* [PATCH 0/3] sched/fair: Reduce contention on tg's load_avg
@ 2015-11-25 19:09 Waiman Long
2015-11-25 19:09 ` [PATCH 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Waiman Long @ 2015-11-25 19:09 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, Scott J Norton, Douglas Hatch, Waiman Long
This patch series tries to reduce contention on task_group's load_avg
to improve system performance. It also tries to optimize the use of
idle_cpu() call in update_sg_lb_stats().
Waiman Long (3):
sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()
sched/fair: Move hot load_avg into its own cacheline
sched/fair: Use different cachelines for readers and writers of
load_avg
kernel/sched/core.c | 9 +++++++++
kernel/sched/fair.c | 40 +++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 15 ++++++++++++++-
3 files changed, 58 insertions(+), 6 deletions(-)
^ permalink raw reply [flat|nested] 18+ messages in thread* [PATCH 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() 2015-11-25 19:09 [PATCH 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long @ 2015-11-25 19:09 ` Waiman Long 2015-12-04 11:57 ` [tip:sched/core] " tip-bot for Waiman Long 2015-11-25 19:09 ` [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long 2015-11-25 19:09 ` [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg Waiman Long 2 siblings, 1 reply; 18+ messages in thread From: Waiman Long @ 2015-11-25 19:09 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: linux-kernel, Scott J Norton, Douglas Hatch, Waiman Long Part of the responsibility of the update_sg_lb_stats() function is to update the idle_cpus statistical counter in struct sg_lb_stats. This check is done by calling idle_cpu(). The idle_cpu() function, in turn, checks a number of fields within the run queue structure such as rq->curr and rq->nr_running. With the current layout of the run queue structure, rq->curr and rq->nr_running are in separate cachelines. The rq->curr variable is checked first followed by nr_running. As nr_running is also accessed by update_sg_lb_stats() earlier, it makes no sense to load another cacheline when nr_running is not 0 as idle_cpu() will always return false in this case. This patch eliminates this redundant cacheline load by checking the cached nr_running before calling idle_cpu(). Signed-off-by: Waiman Long <Waiman.Long@hpe.com> --- kernel/sched/fair.c | 10 +++++++--- 1 files changed, 7 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f04fda8..8f1eccc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6302,7 +6302,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, bool *overload) { unsigned long load; - int i; + int i, nr_running; memset(sgs, 0, sizeof(*sgs)); @@ -6319,7 +6319,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->group_util += cpu_util(i); sgs->sum_nr_running += rq->cfs.h_nr_running; - if (rq->nr_running > 1) + nr_running = rq->nr_running; + if (nr_running > 1) *overload = true; #ifdef CONFIG_NUMA_BALANCING @@ -6327,7 +6328,10 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->nr_preferred_running += rq->nr_preferred_running; #endif sgs->sum_weighted_load += weighted_cpuload(i); - if (idle_cpu(i)) + /* + * No need to call idle_cpu() if nr_running is not 0 + */ + if (!nr_running && idle_cpu(i)) sgs->idle_cpus++; } -- 1.7.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [tip:sched/core] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() 2015-11-25 19:09 ` [PATCH 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long @ 2015-12-04 11:57 ` tip-bot for Waiman Long 0 siblings, 0 replies; 18+ messages in thread From: tip-bot for Waiman Long @ 2015-12-04 11:57 UTC (permalink / raw) To: linux-tip-commits Cc: Waiman.Long, scott.norton, hpa, efault, mingo, peterz, torvalds, tglx, linux-kernel, doug.hatch Commit-ID: a426f99c91d1036767a7819aaaba6bd3191b7f06 Gitweb: http://git.kernel.org/tip/a426f99c91d1036767a7819aaaba6bd3191b7f06 Author: Waiman Long <Waiman.Long@hpe.com> AuthorDate: Wed, 25 Nov 2015 14:09:38 -0500 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Fri, 4 Dec 2015 10:34:47 +0100 sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Part of the responsibility of the update_sg_lb_stats() function is to update the idle_cpus statistical counter in struct sg_lb_stats. This check is done by calling idle_cpu(). The idle_cpu() function, in turn, checks a number of fields within the run queue structure such as rq->curr and rq->nr_running. With the current layout of the run queue structure, rq->curr and rq->nr_running are in separate cachelines. The rq->curr variable is checked first followed by nr_running. As nr_running is also accessed by update_sg_lb_stats() earlier, it makes no sense to load another cacheline when nr_running is not 0 as idle_cpu() will always return false in this case. This patch eliminates this redundant cacheline load by checking the cached nr_running before calling idle_cpu(). Signed-off-by: Waiman Long <Waiman.Long@hpe.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Douglas Hatch <doug.hatch@hpe.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1448478580-26467-2-git-send-email-Waiman.Long@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/sched/fair.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index efd664c..4b0e8b8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6398,7 +6398,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, bool *overload) { unsigned long load; - int i; + int i, nr_running; memset(sgs, 0, sizeof(*sgs)); @@ -6415,7 +6415,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->group_util += cpu_util(i); sgs->sum_nr_running += rq->cfs.h_nr_running; - if (rq->nr_running > 1) + nr_running = rq->nr_running; + if (nr_running > 1) *overload = true; #ifdef CONFIG_NUMA_BALANCING @@ -6423,7 +6424,10 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->nr_preferred_running += rq->nr_preferred_running; #endif sgs->sum_weighted_load += weighted_cpuload(i); - if (idle_cpu(i)) + /* + * No need to call idle_cpu() if nr_running is not 0 + */ + if (!nr_running && idle_cpu(i)) sgs->idle_cpus++; } ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline 2015-11-25 19:09 [PATCH 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long 2015-11-25 19:09 ` [PATCH 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long @ 2015-11-25 19:09 ` Waiman Long 2015-11-30 10:23 ` Peter Zijlstra 2015-11-25 19:09 ` [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg Waiman Long 2 siblings, 1 reply; 18+ messages in thread From: Waiman Long @ 2015-11-25 19:09 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: linux-kernel, Scott J Norton, Douglas Hatch, Waiman Long If a system with large number of sockets was driven to full utilization, it was found that the clock tick handling occupied a rather significant proportion of CPU time when fair group scheduling was enabled which was true for most distributions. Running a java benchmark on a 16-socket IvyBridge-EX system, the perf profile looked like: 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt 8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 8.56% 0.00% java [kernel.vmlinux] [k] update_process_times 8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick 6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair 5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares In particular, the high CPU time consumed by update_cfs_shares() was mostly due to contention on the cacheline that contained the task_group's load_avg statistical counter. This cacheline may also contains variables like shares, cfs_rq & se which are accessed rather frequently during clock tick processing. This patch moves the load_avg variable into another cacheline separated from the other frequently accessed variables. By doing so, the perf profile became: 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares The %cpu time is still pretty high, but it is better than before. The benchmark results before and after the patch was as follows: Before patch - Max-jOPs: 907533 Critical-jOps: 134877 After patch - Max-jOPs: 916011 Critical-jOps: 142366 Signed-off-by: Waiman Long <Waiman.Long@hpe.com> --- kernel/sched/sched.h | 7 ++++++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index efd3bfc..e679895 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -248,7 +248,12 @@ struct task_group { unsigned long shares; #ifdef CONFIG_SMP - atomic_long_t load_avg; + /* + * load_avg can be heavily contended at clock tick time, so put + * it in its own cacheline separated from the fields above which + * will also be accessed at each tick. + */ + atomic_long_t load_avg ____cacheline_aligned; #endif #endif -- 1.7.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline 2015-11-25 19:09 ` [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long @ 2015-11-30 10:23 ` Peter Zijlstra 0 siblings, 0 replies; 18+ messages in thread From: Peter Zijlstra @ 2015-11-30 10:23 UTC (permalink / raw) To: Waiman Long; +Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch On Wed, Nov 25, 2015 at 02:09:39PM -0500, Waiman Long wrote: > +++ b/kernel/sched/sched.h > @@ -248,7 +248,12 @@ struct task_group { > unsigned long shares; > > #ifdef CONFIG_SMP > - atomic_long_t load_avg; > + /* > + * load_avg can be heavily contended at clock tick time, so put > + * it in its own cacheline separated from the fields above which > + * will also be accessed at each tick. > + */ > + atomic_long_t load_avg ____cacheline_aligned; Same as with the other patch; this only works if the structure itself is cacheline aligned, which I don't think it is. > #endif > #endif > > -- > 1.7.1 > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg 2015-11-25 19:09 [PATCH 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long 2015-11-25 19:09 ` [PATCH 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long 2015-11-25 19:09 ` [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long @ 2015-11-25 19:09 ` Waiman Long 2015-11-30 10:22 ` Peter Zijlstra 2 siblings, 1 reply; 18+ messages in thread From: Waiman Long @ 2015-11-25 19:09 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra Cc: linux-kernel, Scott J Norton, Douglas Hatch, Waiman Long The load_avg statistical counter is only changed if the load on a CPU deviates significantly from the previous tick. So there are usually more readers than writers of load_avg. Still, on a large system, the cacheline contention can cause significant slowdown and impact performance. This patch attempts to separate those load_avg readers (update_cfs_shares) and writers (task_tick_fair) to use different cachelines instead. Writers of load_avg will now accumulates the load delta into load_avg_delta which sits in a different cacheline. If load_avg_delta is sufficiently large (> load_avg/64), it will then be added back to load_avg. Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores, 480 threads), the perf profile before the patch was: 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares After the patch, it became: 2.94% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 2.52% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt 2.25% 0.02% java [kernel.vmlinux] [k] tick_sched_timer 2.21% 0.00% java [kernel.vmlinux] [k] update_process_times 1.70% 0.03% java [kernel.vmlinux] [k] scheduler_tick 0.96% 0.34% java [kernel.vmlinux] [k] task_tick_fair 0.61% 0.48% java [kernel.vmlinux] [k] update_cfs_shares The benchmark results before and after the patch were as follows: Before patch - Max-jOPs: 916011 Critical-jOps: 142366 AFter patch - Max-jOPs: 939130 Critical-jOps: 211937 There was significant improvement in Critical-jOps which was latency sensitive. This patch does introduce additional delay in getting the real load average reflected in load_avg. It may also incur additional overhead if the number of CPUs in a task group is small. As a result, this change is only activated when running on a 4-socket or larger systems which can get the most benefit from it. Signed-off-by: Waiman Long <Waiman.Long@hpe.com> --- kernel/sched/core.c | 9 +++++++++ kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++-- kernel/sched/sched.h | 8 ++++++++ 3 files changed, 45 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4d568ac..f3075da 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7356,6 +7356,12 @@ void __init sched_init(void) root_task_group.cfs_rq = (struct cfs_rq **)ptr; ptr += nr_cpu_ids * sizeof(void **); +#ifdef CONFIG_SMP + /* + * Use load_avg_delta if not 2P or less + */ + root_task_group.use_la_delta = (num_possible_nodes() > 2); +#endif /* CONFIG_SMP */ #endif /* CONFIG_FAIR_GROUP_SCHED */ #ifdef CONFIG_RT_GROUP_SCHED root_task_group.rt_se = (struct sched_rt_entity **)ptr; @@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent) if (!alloc_rt_sched_group(tg, parent)) goto err; +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP) + tg->use_la_delta = root_task_group.use_la_delta; +#endif return tg; err: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8f1eccc..44732cc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, #ifdef CONFIG_FAIR_GROUP_SCHED /* - * Updating tg's load_avg is necessary before update_cfs_share (which is done) + * Updating tg's load_avg is necessary before update_cfs_shares (which is done) * and effective_load (which is not done because it is too costly). + * + * The tg's use_la_delta flag, if set, will cause the load_avg delta to be + * accumulated into the load_avg_delta variable instead to reduce cacheline + * contention on load_avg at the expense of more delay in reflecting the real + * load_avg. The tg's load_avg and load_avg_delta variables are in separate + * cachelines. With that flag set, load_avg will be read mostly whereas + * load_avg_delta will be write mostly. */ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) { long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { - atomic_long_add(delta, &cfs_rq->tg->load_avg); + struct task_group *tg = cfs_rq->tg; + long load_avg, tot_delta; + + if (!tg->use_la_delta) { + /* + * If the use_la_delta isn't set, just add the + * delta directly into load_avg. + */ + atomic_long_add(delta, &tg->load_avg); + goto set_contrib; + } + + tot_delta = atomic_long_add_return(delta, &tg->load_avg_delta); + load_avg = atomic_long_read(&tg->load_avg); + if (abs(tot_delta) > load_avg / 64) { + tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0); + if (tot_delta) + atomic_long_add(tot_delta, &tg->load_avg); + } +set_contrib: cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; } } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e679895..aef4e4e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -252,8 +252,16 @@ struct task_group { * load_avg can be heavily contended at clock tick time, so put * it in its own cacheline separated from the fields above which * will also be accessed at each tick. + * + * The use_la_delta flag, if set, will enable the use of load_avg_delta + * to accumulate the delta and only change load_avg when the delta + * is big enough. This reduces the cacheline contention on load_avg. + * This flag will be set at allocation time depending on the system + * configuration. */ + int use_la_delta; atomic_long_t load_avg ____cacheline_aligned; + atomic_long_t load_avg_delta ____cacheline_aligned; #endif #endif -- 1.7.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg 2015-11-25 19:09 ` [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg Waiman Long @ 2015-11-30 10:22 ` Peter Zijlstra 2015-11-30 19:13 ` Waiman Long 0 siblings, 1 reply; 18+ messages in thread From: Peter Zijlstra @ 2015-11-30 10:22 UTC (permalink / raw) To: Waiman Long Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch, Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du Please always Cc the people who wrote the code. +CC pjt, ben, morten, yuyang On Wed, Nov 25, 2015 at 02:09:40PM -0500, Waiman Long wrote: > The load_avg statistical counter is only changed if the load on a CPU > deviates significantly from the previous tick. So there are usually > more readers than writers of load_avg. Still, on a large system, > the cacheline contention can cause significant slowdown and impact > performance. > > This patch attempts to separate those load_avg readers > (update_cfs_shares) and writers (task_tick_fair) to use different > cachelines instead. Writers of load_avg will now accumulates the > load delta into load_avg_delta which sits in a different cacheline. > If load_avg_delta is sufficiently large (> load_avg/64), it will then > be added back to load_avg. > > Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores, > 480 threads), the perf profile before the patch was: > > 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt > 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt > 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer > 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times > 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick > 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair > 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares > > After the patch, it became: > > 2.94% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt > 2.52% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt > 2.25% 0.02% java [kernel.vmlinux] [k] tick_sched_timer > 2.21% 0.00% java [kernel.vmlinux] [k] update_process_times > 1.70% 0.03% java [kernel.vmlinux] [k] scheduler_tick > 0.96% 0.34% java [kernel.vmlinux] [k] task_tick_fair > 0.61% 0.48% java [kernel.vmlinux] [k] update_cfs_shares This begs the question tough; why are you running a global load in a cgroup; and do we really need to update this for the root cgroup? It seems to me we don't need calc_tg_weight() for the root cgroup, it doesn't need to normalize its weight numbers. That is; isn't this simply a problem we should avoid? > The benchmark results before and after the patch were as follows: > > Before patch - Max-jOPs: 916011 Critical-jOps: 142366 > AFter patch - Max-jOPs: 939130 Critical-jOps: 211937 > > There was significant improvement in Critical-jOps which was latency > sensitive. > > This patch does introduce additional delay in getting the real load > average reflected in load_avg. It may also incur additional overhead > if the number of CPUs in a task group is small. As a result, this > change is only activated when running on a 4-socket or larger systems > which can get the most benefit from it. So I'm not particularly charmed by this; it rather makes a mess of things. Also this really wants a run of the cgroup fairness test thingy pjt/ben have somewhere. > Signed-off-by: Waiman Long <Waiman.Long@hpe.com> > --- > kernel/sched/core.c | 9 +++++++++ > kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++-- > kernel/sched/sched.h | 8 ++++++++ > 3 files changed, 45 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 4d568ac..f3075da 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -7356,6 +7356,12 @@ void __init sched_init(void) > root_task_group.cfs_rq = (struct cfs_rq **)ptr; > ptr += nr_cpu_ids * sizeof(void **); > > +#ifdef CONFIG_SMP > + /* > + * Use load_avg_delta if not 2P or less > + */ > + root_task_group.use_la_delta = (num_possible_nodes() > 2); > +#endif /* CONFIG_SMP */ > #endif /* CONFIG_FAIR_GROUP_SCHED */ > #ifdef CONFIG_RT_GROUP_SCHED > root_task_group.rt_se = (struct sched_rt_entity **)ptr; > @@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent) > if (!alloc_rt_sched_group(tg, parent)) > goto err; > > +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP) > + tg->use_la_delta = root_task_group.use_la_delta; > +#endif > return tg; > > err: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8f1eccc..44732cc 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > > #ifdef CONFIG_FAIR_GROUP_SCHED > /* > - * Updating tg's load_avg is necessary before update_cfs_share (which is done) > + * Updating tg's load_avg is necessary before update_cfs_shares (which is done) > * and effective_load (which is not done because it is too costly). > + * > + * The tg's use_la_delta flag, if set, will cause the load_avg delta to be > + * accumulated into the load_avg_delta variable instead to reduce cacheline > + * contention on load_avg at the expense of more delay in reflecting the real > + * load_avg. The tg's load_avg and load_avg_delta variables are in separate > + * cachelines. With that flag set, load_avg will be read mostly whereas > + * load_avg_delta will be write mostly. > */ > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) > { > long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > > if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > - atomic_long_add(delta, &cfs_rq->tg->load_avg); > + struct task_group *tg = cfs_rq->tg; > + long load_avg, tot_delta; > + > + if (!tg->use_la_delta) { > + /* > + * If the use_la_delta isn't set, just add the > + * delta directly into load_avg. > + */ > + atomic_long_add(delta, &tg->load_avg); > + goto set_contrib; > + } > + > + tot_delta = atomic_long_add_return(delta, &tg->load_avg_delta); > + load_avg = atomic_long_read(&tg->load_avg); > + if (abs(tot_delta) > load_avg / 64) { > + tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0); > + if (tot_delta) > + atomic_long_add(tot_delta, &tg->load_avg); > + } > +set_contrib: > cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; > } > } I'm thinking that its now far too big to retain the inline qualifier. > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index e679895..aef4e4e 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -252,8 +252,16 @@ struct task_group { > * load_avg can be heavily contended at clock tick time, so put > * it in its own cacheline separated from the fields above which > * will also be accessed at each tick. > + * > + * The use_la_delta flag, if set, will enable the use of load_avg_delta > + * to accumulate the delta and only change load_avg when the delta > + * is big enough. This reduces the cacheline contention on load_avg. > + * This flag will be set at allocation time depending on the system > + * configuration. > */ > + int use_la_delta; > atomic_long_t load_avg ____cacheline_aligned; > + atomic_long_t load_avg_delta ____cacheline_aligned; This would only work if the structure itself is allocated with cacheline alignment, and looking at sched_create_group(), we use a plain kzalloc() for this, which doesn't guarantee any sort of alignment beyond machine word size IIRC. Also, you unconditionally grow the structure by a whole cacheline. > #endif > #endif > > -- > 1.7.1 > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg 2015-11-30 10:22 ` Peter Zijlstra @ 2015-11-30 19:13 ` Waiman Long 2015-11-30 22:09 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Waiman Long @ 2015-11-30 19:13 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch, Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du On 11/30/2015 05:22 AM, Peter Zijlstra wrote: > Please always Cc the people who wrote the code. > > +CC pjt, ben, morten, yuyang Sorry for that. Their names didn't show up when I did get_maintainer.pl. > On Wed, Nov 25, 2015 at 02:09:40PM -0500, Waiman Long wrote: >> The load_avg statistical counter is only changed if the load on a CPU >> deviates significantly from the previous tick. So there are usually >> more readers than writers of load_avg. Still, on a large system, >> the cacheline contention can cause significant slowdown and impact >> performance. >> >> This patch attempts to separate those load_avg readers >> (update_cfs_shares) and writers (task_tick_fair) to use different >> cachelines instead. Writers of load_avg will now accumulates the >> load delta into load_avg_delta which sits in a different cacheline. >> If load_avg_delta is sufficiently large (> load_avg/64), it will then >> be added back to load_avg. >> >> Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores, >> 480 threads), the perf profile before the patch was: >> >> 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt >> 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt >> 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer >> 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times >> 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick >> 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair >> 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares >> >> After the patch, it became: >> >> 2.94% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt >> 2.52% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt >> 2.25% 0.02% java [kernel.vmlinux] [k] tick_sched_timer >> 2.21% 0.00% java [kernel.vmlinux] [k] update_process_times >> 1.70% 0.03% java [kernel.vmlinux] [k] scheduler_tick >> 0.96% 0.34% java [kernel.vmlinux] [k] task_tick_fair >> 0.61% 0.48% java [kernel.vmlinux] [k] update_cfs_shares > This begs the question tough; why are you running a global load in a > cgroup; and do we really need to update this for the root cgroup? It > seems to me we don't need calc_tg_weight() for the root cgroup, it > doesn't need to normalize its weight numbers. > > That is; isn't this simply a problem we should avoid? I didn't use any cgroup in my test setup. Autogroup was enabled, though. Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56 times. >> The benchmark results before and after the patch were as follows: >> >> Before patch - Max-jOPs: 916011 Critical-jOps: 142366 >> AFter patch - Max-jOPs: 939130 Critical-jOps: 211937 >> >> There was significant improvement in Critical-jOps which was latency >> sensitive. >> >> This patch does introduce additional delay in getting the real load >> average reflected in load_avg. It may also incur additional overhead >> if the number of CPUs in a task group is small. As a result, this >> change is only activated when running on a 4-socket or larger systems >> which can get the most benefit from it. > So I'm not particularly charmed by this; it rather makes a mess of > things. Also this really wants a run of the cgroup fairness test thingy > pjt/ben have somewhere. I will be glad to run any additional tests, if necessary. I do need pointers to those test, though. >> Signed-off-by: Waiman Long<Waiman.Long@hpe.com> >> --- >> kernel/sched/core.c | 9 +++++++++ >> kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++-- >> kernel/sched/sched.h | 8 ++++++++ >> 3 files changed, 45 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 4d568ac..f3075da 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -7356,6 +7356,12 @@ void __init sched_init(void) >> root_task_group.cfs_rq = (struct cfs_rq **)ptr; >> ptr += nr_cpu_ids * sizeof(void **); >> >> +#ifdef CONFIG_SMP >> + /* >> + * Use load_avg_delta if not 2P or less >> + */ >> + root_task_group.use_la_delta = (num_possible_nodes()> 2); >> +#endif /* CONFIG_SMP */ >> #endif /* CONFIG_FAIR_GROUP_SCHED */ >> #ifdef CONFIG_RT_GROUP_SCHED >> root_task_group.rt_se = (struct sched_rt_entity **)ptr; >> @@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent) >> if (!alloc_rt_sched_group(tg, parent)) >> goto err; >> >> +#if defined(CONFIG_FAIR_GROUP_SCHED)&& defined(CONFIG_SMP) >> + tg->use_la_delta = root_task_group.use_la_delta; >> +#endif >> return tg; >> >> err: >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 8f1eccc..44732cc 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, >> >> #ifdef CONFIG_FAIR_GROUP_SCHED >> /* >> - * Updating tg's load_avg is necessary before update_cfs_share (which is done) >> + * Updating tg's load_avg is necessary before update_cfs_shares (which is done) >> * and effective_load (which is not done because it is too costly). >> + * >> + * The tg's use_la_delta flag, if set, will cause the load_avg delta to be >> + * accumulated into the load_avg_delta variable instead to reduce cacheline >> + * contention on load_avg at the expense of more delay in reflecting the real >> + * load_avg. The tg's load_avg and load_avg_delta variables are in separate >> + * cachelines. With that flag set, load_avg will be read mostly whereas >> + * load_avg_delta will be write mostly. >> */ >> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) >> { >> long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; >> >> if (force || abs(delta)> cfs_rq->tg_load_avg_contrib / 64) { >> - atomic_long_add(delta,&cfs_rq->tg->load_avg); >> + struct task_group *tg = cfs_rq->tg; >> + long load_avg, tot_delta; >> + >> + if (!tg->use_la_delta) { >> + /* >> + * If the use_la_delta isn't set, just add the >> + * delta directly into load_avg. >> + */ >> + atomic_long_add(delta,&tg->load_avg); >> + goto set_contrib; >> + } >> + >> + tot_delta = atomic_long_add_return(delta,&tg->load_avg_delta); >> + load_avg = atomic_long_read(&tg->load_avg); >> + if (abs(tot_delta)> load_avg / 64) { >> + tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0); >> + if (tot_delta) >> + atomic_long_add(tot_delta,&tg->load_avg); >> + } >> +set_contrib: >> cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; >> } >> } > I'm thinking that its now far too big to retain the inline qualifier. I can take the inline keyword out. > >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index e679895..aef4e4e 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -252,8 +252,16 @@ struct task_group { >> * load_avg can be heavily contended at clock tick time, so put >> * it in its own cacheline separated from the fields above which >> * will also be accessed at each tick. >> + * >> + * The use_la_delta flag, if set, will enable the use of load_avg_delta >> + * to accumulate the delta and only change load_avg when the delta >> + * is big enough. This reduces the cacheline contention on load_avg. >> + * This flag will be set at allocation time depending on the system >> + * configuration. >> */ >> + int use_la_delta; >> atomic_long_t load_avg ____cacheline_aligned; >> + atomic_long_t load_avg_delta ____cacheline_aligned; > This would only work if the structure itself is allocated with cacheline > alignment, and looking at sched_create_group(), we use a plain kzalloc() > for this, which doesn't guarantee any sort of alignment beyond machine > word size IIRC. With a RHEL 6 derived .config file, the size of the task_group structure was 460 bytes on a 32-bit x86 kernel. Adding a ____cacheline_aligned tag increase the size to 512 bytes. So it did make the structure a multiple of the cacheline size. With both slub and slab, the allocated task group pointers from kzalloc() in sched_create_group() were all multiples of 0x200. So they were properly aligned for the ____cacheline_aligned tag to work. > Also, you unconditionally grow the structure by a whole cacheline. I know it is a drawback of using ____cacheline_aligned tag. However, we probably won't create too many task groups in normal use. So the increase in memory consumption shouldn't be noticeable. Cheers, Longman ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg 2015-11-30 19:13 ` Waiman Long @ 2015-11-30 22:09 ` Peter Zijlstra 2015-12-01 3:55 ` Waiman Long 2015-11-30 22:29 ` Peter Zijlstra 2015-11-30 22:32 ` Peter Zijlstra 2 siblings, 1 reply; 18+ messages in thread From: Peter Zijlstra @ 2015-11-30 22:09 UTC (permalink / raw) To: Waiman Long Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch, Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote: > >This begs the question tough; why are you running a global load in a > >cgroup; and do we really need to update this for the root cgroup? It > >seems to me we don't need calc_tg_weight() for the root cgroup, it > >doesn't need to normalize its weight numbers. > > > >That is; isn't this simply a problem we should avoid? > > I didn't use any cgroup in my test setup. Autogroup was enabled, though. > Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56 > times. Yeah, can you kill autogroup and see if that helps? If not, we probably should add some code to avoid calculating things for the root group. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg 2015-11-30 22:09 ` Peter Zijlstra @ 2015-12-01 3:55 ` Waiman Long 2015-12-01 8:49 ` Peter Zijlstra 0 siblings, 1 reply; 18+ messages in thread From: Waiman Long @ 2015-12-01 3:55 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch, Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du On 11/30/2015 05:09 PM, Peter Zijlstra wrote: > On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote: >>> This begs the question tough; why are you running a global load in a >>> cgroup; and do we really need to update this for the root cgroup? It >>> seems to me we don't need calc_tg_weight() for the root cgroup, it >>> doesn't need to normalize its weight numbers. >>> >>> That is; isn't this simply a problem we should avoid? >> I didn't use any cgroup in my test setup. Autogroup was enabled, though. >> Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56 >> times. > Yeah, can you kill autogroup and see if that helps? If not, we probably > should add some code to avoid calculating things for the root group. I will try that out tomorrow. However, SCHED_AUTOGROUP was enabled in the distribution kernels. So we still need to look at that with autogroup enabled. Cheers, Longman ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg 2015-12-01 3:55 ` Waiman Long @ 2015-12-01 8:49 ` Peter Zijlstra 2015-12-01 10:44 ` Mike Galbraith 2015-12-02 18:48 ` Waiman Long 0 siblings, 2 replies; 18+ messages in thread From: Peter Zijlstra @ 2015-12-01 8:49 UTC (permalink / raw) To: Waiman Long Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch, Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du On Mon, Nov 30, 2015 at 10:55:02PM -0500, Waiman Long wrote: > On 11/30/2015 05:09 PM, Peter Zijlstra wrote: > >On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote: > >>>This begs the question tough; why are you running a global load in a > >>>cgroup; and do we really need to update this for the root cgroup? It > >>>seems to me we don't need calc_tg_weight() for the root cgroup, it > >>>doesn't need to normalize its weight numbers. > >>> > >>>That is; isn't this simply a problem we should avoid? > >>I didn't use any cgroup in my test setup. Autogroup was enabled, though. > >>Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56 > >>times. > >Yeah, can you kill autogroup and see if that helps? If not, we probably > >should add some code to avoid calculating things for the root group. > > I will try that out tomorrow. However, SCHED_AUTOGROUP was enabled in the > distribution kernels. So we still need to look at that with autogroup > enabled. Meh, or just tell the people that have stupid large machines to use noautogroup on boot (its of questionable benefit in the first place imo, esp. on servers). ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg 2015-12-01 8:49 ` Peter Zijlstra @ 2015-12-01 10:44 ` Mike Galbraith 2015-12-02 18:48 ` Waiman Long 1 sibling, 0 replies; 18+ messages in thread From: Mike Galbraith @ 2015-12-01 10:44 UTC (permalink / raw) To: Peter Zijlstra Cc: Waiman Long, Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch, Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du On Tue, 2015-12-01 at 09:49 +0100, Peter Zijlstra wrote: > On Mon, Nov 30, 2015 at 10:55:02PM -0500, Waiman Long wrote: > > On 11/30/2015 05:09 PM, Peter Zijlstra wrote: > > >On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote: > > >>>This begs the question tough; why are you running a global load in a > > >>>cgroup; and do we really need to update this for the root cgroup? It > > >>>seems to me we don't need calc_tg_weight() for the root cgroup, it > > >>>doesn't need to normalize its weight numbers. > > >>> > > >>>That is; isn't this simply a problem we should avoid? > > >>I didn't use any cgroup in my test setup. Autogroup was enabled, though. > > >>Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56 > > >>times. > > >Yeah, can you kill autogroup and see if that helps? If not, we probably > > >should add some code to avoid calculating things for the root group. > > > > I will try that out tomorrow. However, SCHED_AUTOGROUP was enabled in the > > distribution kernels. So we still need to look at that with autogroup > > enabled. > > Meh, or just tell the people that have stupid large machines to use > noautogroup on boot (its of questionable benefit in the first place imo, > esp. on servers). Yup (and yup). Someone should also suggest to the systemd(isease) folks that they try actually measuring before turning everything in the world on. "Oo cgroups are cool" is NOT a good reason to turn it all on :) -Mike ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg 2015-12-01 8:49 ` Peter Zijlstra 2015-12-01 10:44 ` Mike Galbraith @ 2015-12-02 18:48 ` Waiman Long 1 sibling, 0 replies; 18+ messages in thread From: Waiman Long @ 2015-12-02 18:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch, Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du On 12/01/2015 03:49 AM, Peter Zijlstra wrote: > On Mon, Nov 30, 2015 at 10:55:02PM -0500, Waiman Long wrote: >> On 11/30/2015 05:09 PM, Peter Zijlstra wrote: >>> On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote: >>>>> This begs the question tough; why are you running a global load in a >>>>> cgroup; and do we really need to update this for the root cgroup? It >>>>> seems to me we don't need calc_tg_weight() for the root cgroup, it >>>>> doesn't need to normalize its weight numbers. >>>>> >>>>> That is; isn't this simply a problem we should avoid? >>>> I didn't use any cgroup in my test setup. Autogroup was enabled, though. >>>> Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56 >>>> times. >>> Yeah, can you kill autogroup and see if that helps? If not, we probably >>> should add some code to avoid calculating things for the root group. >> I will try that out tomorrow. However, SCHED_AUTOGROUP was enabled in the >> distribution kernels. So we still need to look at that with autogroup >> enabled. > Meh, or just tell the people that have stupid large machines to use > noautogroup on boot (its of questionable benefit in the first place imo, > esp. on servers). Yes, I was able to recover most of the lost performance by disabling autogroup. I did send out a new patch to disable load_avg update for root_task_group. I need that for backporting to earlier kernels which was forced to update load_avg for every clock tick even for root_task_group. Cheers, Longman ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg 2015-11-30 19:13 ` Waiman Long 2015-11-30 22:09 ` Peter Zijlstra @ 2015-11-30 22:29 ` Peter Zijlstra 2015-12-01 4:00 ` Waiman Long 2015-11-30 22:32 ` Peter Zijlstra 2 siblings, 1 reply; 18+ messages in thread From: Peter Zijlstra @ 2015-11-30 22:29 UTC (permalink / raw) To: Waiman Long Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch, Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote: > >This would only work if the structure itself is allocated with cacheline > >alignment, and looking at sched_create_group(), we use a plain kzalloc() > >for this, which doesn't guarantee any sort of alignment beyond machine > >word size IIRC. > > With a RHEL 6 derived .config file, the size of the task_group structure was > 460 bytes on a 32-bit x86 kernel. Adding a ____cacheline_aligned tag > increase the size to 512 bytes. So it did make the structure a multiple of > the cacheline size. With both slub and slab, the allocated task group > pointers from kzalloc() in sched_create_group() were all multiples of 0x200. > So they were properly aligned for the ____cacheline_aligned tag to work. Not sure we should rely on sl*b doing the right thing here. KMALLOC_MIN_ALIGN is explicitly set to sizeof(long long). If you want explicit alignment, one should use KMEM_CACHE(). ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg 2015-11-30 22:29 ` Peter Zijlstra @ 2015-12-01 4:00 ` Waiman Long 2015-12-01 8:47 ` Peter Zijlstra 0 siblings, 1 reply; 18+ messages in thread From: Waiman Long @ 2015-12-01 4:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch, Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du On 11/30/2015 05:29 PM, Peter Zijlstra wrote: > On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote: >>> This would only work if the structure itself is allocated with cacheline >>> alignment, and looking at sched_create_group(), we use a plain kzalloc() >>> for this, which doesn't guarantee any sort of alignment beyond machine >>> word size IIRC. >> With a RHEL 6 derived .config file, the size of the task_group structure was >> 460 bytes on a 32-bit x86 kernel. Adding a ____cacheline_aligned tag >> increase the size to 512 bytes. So it did make the structure a multiple of >> the cacheline size. With both slub and slab, the allocated task group >> pointers from kzalloc() in sched_create_group() were all multiples of 0x200. >> So they were properly aligned for the ____cacheline_aligned tag to work. > Not sure we should rely on sl*b doing the right thing here. > KMALLOC_MIN_ALIGN is explicitly set to sizeof(long long). If you want > explicit alignment, one should use KMEM_CACHE(). I think the current kernel use power-of-2 kmemcaches to satisfy kalloc() requests except when the size is less than or equal to 192 where there are some non-power-of-2 kmemcaches available. Given that the task_group structure is large enough with FAIR_GROUP_SCHED enabled, we shouldn't hit the case that the allocated buffer is not cacheline aligned. Cheers, Longman ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg 2015-12-01 4:00 ` Waiman Long @ 2015-12-01 8:47 ` Peter Zijlstra 2015-12-02 18:44 ` Waiman Long 0 siblings, 1 reply; 18+ messages in thread From: Peter Zijlstra @ 2015-12-01 8:47 UTC (permalink / raw) To: Waiman Long Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch, Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du On Mon, Nov 30, 2015 at 11:00:35PM -0500, Waiman Long wrote: > I think the current kernel use power-of-2 kmemcaches to satisfy kalloc() > requests except when the size is less than or equal to 192 where there are > some non-power-of-2 kmemcaches available. Given that the task_group > structure is large enough with FAIR_GROUP_SCHED enabled, we shouldn't hit > the case that the allocated buffer is not cacheline aligned. Using out-of-object storage is allowed (none of the existing sl*b allocators do so iirc). That is, its perfectly valid for a sl*b allocator for 64 byte objects to allocate 72 bytes for each object and use the 'spare' 8 bytes for object tracking or whatnot. That would respect the minimum alignment guarantee of 8 bytes but not provide the 'expected' object size alignment you're assuming. Also, we have the proper interfaces to request the explicit alignment for a reason. So if you need the alignment for correctness, use those. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg 2015-12-01 8:47 ` Peter Zijlstra @ 2015-12-02 18:44 ` Waiman Long 0 siblings, 0 replies; 18+ messages in thread From: Waiman Long @ 2015-12-02 18:44 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch, Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du On 12/01/2015 03:47 AM, Peter Zijlstra wrote: > On Mon, Nov 30, 2015 at 11:00:35PM -0500, Waiman Long wrote: > >> I think the current kernel use power-of-2 kmemcaches to satisfy kalloc() >> requests except when the size is less than or equal to 192 where there are >> some non-power-of-2 kmemcaches available. Given that the task_group >> structure is large enough with FAIR_GROUP_SCHED enabled, we shouldn't hit >> the case that the allocated buffer is not cacheline aligned. > Using out-of-object storage is allowed (none of the existing sl*b > allocators do so iirc). > > That is, its perfectly valid for a sl*b allocator for 64 byte objects to > allocate 72 bytes for each object and use the 'spare' 8 bytes for object > tracking or whatnot. > > That would respect the minimum alignment guarantee of 8 bytes but not > provide the 'expected' object size alignment you're assuming. > > Also, we have the proper interfaces to request the explicit alignment > for a reason. So if you need the alignment for correctness, use those. Thanks for the tip. I have just sent out an updated patch set which create a cache-aligned memcache for task group. That should work under all kernel config setting. Cheers, Longman ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg 2015-11-30 19:13 ` Waiman Long 2015-11-30 22:09 ` Peter Zijlstra 2015-11-30 22:29 ` Peter Zijlstra @ 2015-11-30 22:32 ` Peter Zijlstra 2 siblings, 0 replies; 18+ messages in thread From: Peter Zijlstra @ 2015-11-30 22:32 UTC (permalink / raw) To: Waiman Long Cc: Ingo Molnar, linux-kernel, Scott J Norton, Douglas Hatch, Paul Turner, Ben Segall, Morten Rasmussen, Yuyang Du On Mon, Nov 30, 2015 at 02:13:32PM -0500, Waiman Long wrote: > On 11/30/2015 05:22 AM, Peter Zijlstra wrote: > >Please always Cc the people who wrote the code. > > > >+CC pjt, ben, morten, yuyang > > Sorry for that. Their names didn't show up when I did get_maintainer.pl. Ah, I never much use get_maintainers.pl, but it might that --git-blame would yield some of the people who poked at this code. But I suspect its all yuyang now, seeing how he recently reworked it all. ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2015-12-04 11:58 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-11-25 19:09 [PATCH 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long 2015-11-25 19:09 ` [PATCH 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long 2015-12-04 11:57 ` [tip:sched/core] " tip-bot for Waiman Long 2015-11-25 19:09 ` [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long 2015-11-30 10:23 ` Peter Zijlstra 2015-11-25 19:09 ` [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg Waiman Long 2015-11-30 10:22 ` Peter Zijlstra 2015-11-30 19:13 ` Waiman Long 2015-11-30 22:09 ` Peter Zijlstra 2015-12-01 3:55 ` Waiman Long 2015-12-01 8:49 ` Peter Zijlstra 2015-12-01 10:44 ` Mike Galbraith 2015-12-02 18:48 ` Waiman Long 2015-11-30 22:29 ` Peter Zijlstra 2015-12-01 4:00 ` Waiman Long 2015-12-01 8:47 ` Peter Zijlstra 2015-12-02 18:44 ` Waiman Long 2015-11-30 22:32 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox