* [PATCH v3 0/2] sched/fair: Limit access to overutilized @ 2024-02-29 10:40 Shrikanth Hegde 2024-02-29 10:40 ` [PATCH v3 1/2] sched/fair: Add EAS checks before updating overutilized Shrikanth Hegde 2024-02-29 10:40 ` [PATCH v3 2/2] sched/fair: Use helper function to access rd->overutilized Shrikanth Hegde 0 siblings, 2 replies; 5+ messages in thread From: Shrikanth Hegde @ 2024-02-29 10:40 UTC (permalink / raw) To: mingo, peterz, vincent.guittot Cc: sshegde, yu.c.chen, dietmar.eggemann, linux-kernel, nysal, aboorvad, srikar, vschneid, pierre.gondois, morten.rasmussen, qyousef When running a ISV workload on a large system (240 Cores, SMT8), it was observed from perf profile that newidle_balance and enqueue_task_fair were consuming more cycles. Perf annotate showed that most of the time was spent on accessing overutilized field of root domain. Aboorva was able to simulate similar perf profile by making some changes to stress-ng --wait. Both newidle_balance and enqueue_task_fair consume close to 5-7%. Perf annotate shows that most of the cycles are spent in accessing rd,rd->overutilized field. Overutilized was added for EAS(Energy aware scheduler) to decide whether to do load balance or not. Simultaneous access to overutilized by multiple CPUs lead cache invalidations due to true sharing. Updating overutilized is not required for non-EAS platforms. Since overutilized and overload are part of the same cacheline, there is false sharing as well. Patch 1/2 is the main patch. It helps in reducing the above said issue. Both the functions don't show up in the profile. With patch comparison is in changelog. With the patch stated problem in the ISV workload also got solved and throughput has improved. Patch 2/2 is only code refactoring to use the helper function instead of direct access of the field Detailed perf annotate can be found in cover letter of v1. v3 -> v2: - Pierre and Dietmar suggested we could add one more EAS check before calling cpu_overutilized. That makes sense since that value is not used anyway in Non-EAS case. - Refactored the code as dietmar suggested to avoid additional call to sched_energy_enabled(). - Minor edits to change log. v2 -> v1: Chen Yu pointed out minor issue in code. Corrected that code and updated the changelog. v1: https://lore.kernel.org/lkml/20240223150707.410417-1-sshegde@linux.ibm.com/ v2: https://lore.kernel.org/lkml/20240228071621.602596-1-sshegde@linux.ibm.com/ Shrikanth Hegde (2): sched/fair: Add EAS checks before updating overutilized sched/fair: Use helper function to access rd->overutilized kernel/sched/fair.c | 61 ++++++++++++++++++++++++++++++--------------- 1 file changed, 41 insertions(+), 20 deletions(-) -- 2.39.3 ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v3 1/2] sched/fair: Add EAS checks before updating overutilized 2024-02-29 10:40 [PATCH v3 0/2] sched/fair: Limit access to overutilized Shrikanth Hegde @ 2024-02-29 10:40 ` Shrikanth Hegde 2024-03-01 9:26 ` Srikar Dronamraju 2024-03-01 14:14 ` Shrikanth Hegde 2024-02-29 10:40 ` [PATCH v3 2/2] sched/fair: Use helper function to access rd->overutilized Shrikanth Hegde 1 sibling, 2 replies; 5+ messages in thread From: Shrikanth Hegde @ 2024-02-29 10:40 UTC (permalink / raw) To: mingo, peterz, vincent.guittot Cc: sshegde, yu.c.chen, dietmar.eggemann, linux-kernel, nysal, aboorvad, srikar, vschneid, pierre.gondois, morten.rasmussen, qyousef Overutilized field of root domain is only used for EAS(energy aware scheduler) to decide whether to do load balance or not. It is not used if EAS not possible. Currently enqueue_task_fair and task_tick_fair accesses, sometime updates this field. In update_sd_lb_stats it is updated often. This causes cache contention due to true sharing and burns a lot of cycles. overload and overutilized are part of the same cacheline. Updating it often invalidates the cacheline. That causes access to overload to slow down due to false sharing. Hence add EAS check before accessing/updating this field. EAS check is optimized at compile time or it is a static branch. Hence it shouldn't cost much. With the patch, both enqueue_task_fair and newidle_balance don't show up as hot routines in perf profile. 6.8-rc4: 7.18% swapper [kernel.vmlinux] [k] enqueue_task_fair 6.78% s [kernel.vmlinux] [k] newidle_balance +patch: 0.14% swapper [kernel.vmlinux] [k] enqueue_task_fair 0.00% swapper [kernel.vmlinux] [k] newidle_balance Minor change: trace_sched_overutilized_tp expect that second argument to be bool. So do a int to bool conversion for that. Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator") Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> --- kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++----------------- 1 file changed, 29 insertions(+), 18 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6a16129f9a5c..1f7d62b7c26f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6670,15 +6670,29 @@ static inline bool cpu_overutilized(int cpu) return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu); } -static inline void update_overutilized_status(struct rq *rq) +static inline void set_rd_overutilized_status(struct root_domain *rd, + unsigned int status) { - if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) { - WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED); - trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED); - } + WRITE_ONCE(rd->overutilized, status); + trace_sched_overutilized_tp(rd, !!status); +} + +static inline void check_update_overutilized_status(struct rq *rq) +{ + /* + * overutilized field is used for load balancing decisions only + * if energy aware scheduler is being used + */ + if (!sched_energy_enabled()) + return; + + if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) + set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED); } #else -static inline void update_overutilized_status(struct rq *rq) { } +static inline void check_update_overutilized_status(struct rq *rq) { } +static inline void set_rd_overutilized_status(struct root_domain *rd, + unsigned int status) { } #endif /* Runqueue only has SCHED_IDLE tasks enqueued */ @@ -6779,7 +6793,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) * and the following generally works well enough in practice. */ if (!task_new) - update_overutilized_status(rq); + check_update_overutilized_status(rq); enqueue_throttle: assert_list_leaf_cfs_rq(rq); @@ -9902,7 +9916,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (nr_running > 1) *sg_status |= SG_OVERLOAD; - if (cpu_overutilized(i)) + if (sched_energy_enabled() && cpu_overutilized(i)) *sg_status |= SG_OVERUTILIZED; #ifdef CONFIG_NUMA_BALANCING @@ -10596,19 +10610,16 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd env->fbq_type = fbq_classify_group(&sds->busiest_stat); if (!env->sd->parent) { - struct root_domain *rd = env->dst_rq->rd; - /* update overload indicator if we are at root domain */ - WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD); + WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD); /* Update over-utilization (tipping point, U >= 0) indicator */ - WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED); - trace_sched_overutilized_tp(rd, sg_status & SG_OVERUTILIZED); + if (sched_energy_enabled()) { + set_rd_overutilized_status(env->dst_rq->rd, + sg_status & SG_OVERUTILIZED); + } } else if (sg_status & SG_OVERUTILIZED) { - struct root_domain *rd = env->dst_rq->rd; - - WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED); - trace_sched_overutilized_tp(rd, SG_OVERUTILIZED); + set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED); } update_idle_cpu_scan(env, sum_util); @@ -12609,7 +12620,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) task_tick_numa(rq, curr); update_misfit_status(curr, rq); - update_overutilized_status(task_rq(curr)); + check_update_overutilized_status(task_rq(curr)); task_tick_core(rq, curr); } -- 2.39.3 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v3 1/2] sched/fair: Add EAS checks before updating overutilized 2024-02-29 10:40 ` [PATCH v3 1/2] sched/fair: Add EAS checks before updating overutilized Shrikanth Hegde @ 2024-03-01 9:26 ` Srikar Dronamraju 2024-03-01 14:14 ` Shrikanth Hegde 1 sibling, 0 replies; 5+ messages in thread From: Srikar Dronamraju @ 2024-03-01 9:26 UTC (permalink / raw) To: Shrikanth Hegde Cc: mingo, peterz, vincent.guittot, yu.c.chen, dietmar.eggemann, linux-kernel, nysal, aboorvad, vschneid, pierre.gondois, morten.rasmussen, qyousef * Shrikanth Hegde <sshegde@linux.ibm.com> [2024-02-29 16:10:09]: > Overutilized field of root domain is only used for EAS(energy aware scheduler) > to decide whether to do load balance or not. It is not used if EAS > not possible. > > Currently enqueue_task_fair and task_tick_fair accesses, sometime updates > this field. In update_sd_lb_stats it is updated often. This causes cache > contention due to true sharing and burns a lot of cycles. overload and > overutilized are part of the same cacheline. Updating it often invalidates > the cacheline. That causes access to overload to slow down due to > false sharing. Hence add EAS check before accessing/updating this field. > EAS check is optimized at compile time or it is a static branch. > Hence it shouldn't cost much. > > With the patch, both enqueue_task_fair and newidle_balance don't show > up as hot routines in perf profile. > > 6.8-rc4: > 7.18% swapper [kernel.vmlinux] [k] enqueue_task_fair > 6.78% s [kernel.vmlinux] [k] newidle_balance > +patch: > 0.14% swapper [kernel.vmlinux] [k] enqueue_task_fair > 0.00% swapper [kernel.vmlinux] [k] newidle_balance > > Minor change: trace_sched_overutilized_tp expect that second argument to > be bool. So do a int to bool conversion for that. > > Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator") > Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Looks good to me. Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com> -- Thanks and Regards Srikar Dronamraju ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v3 1/2] sched/fair: Add EAS checks before updating overutilized 2024-02-29 10:40 ` [PATCH v3 1/2] sched/fair: Add EAS checks before updating overutilized Shrikanth Hegde 2024-03-01 9:26 ` Srikar Dronamraju @ 2024-03-01 14:14 ` Shrikanth Hegde 1 sibling, 0 replies; 5+ messages in thread From: Shrikanth Hegde @ 2024-03-01 14:14 UTC (permalink / raw) To: mingo, peterz, vincent.guittot Cc: yu.c.chen, dietmar.eggemann, linux-kernel, nysal, aboorvad, srikar, vschneid, pierre.gondois, morten.rasmussen, qyousef On 2/29/24 4:10 PM, Shrikanth Hegde wrote: > Overutilized field of root domain is only used for EAS(energy aware scheduler) > to decide whether to do load balance or not. It is not used if EAS > not possible. > > Currently enqueue_task_fair and task_tick_fair accesses, sometime updates > this field. In update_sd_lb_stats it is updated often. This causes cache > contention due to true sharing and burns a lot of cycles. overload and > overutilized are part of the same cacheline. Updating it often invalidates > the cacheline. That causes access to overload to slow down due to > false sharing. Hence add EAS check before accessing/updating this field. > EAS check is optimized at compile time or it is a static branch. > Hence it shouldn't cost much. > > With the patch, both enqueue_task_fair and newidle_balance don't show > up as hot routines in perf profile. > > 6.8-rc4: > 7.18% swapper [kernel.vmlinux] [k] enqueue_task_fair > 6.78% s [kernel.vmlinux] [k] newidle_balance > +patch: > 0.14% swapper [kernel.vmlinux] [k] enqueue_task_fair > 0.00% swapper [kernel.vmlinux] [k] newidle_balance > > Minor change: trace_sched_overutilized_tp expect that second argument to > be bool. So do a int to bool conversion for that. > > Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator") > Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> > --- > kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++----------------- > 1 file changed, 29 insertions(+), 18 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 6a16129f9a5c..1f7d62b7c26f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6670,15 +6670,29 @@ static inline bool cpu_overutilized(int cpu) > return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu); > } > > -static inline void update_overutilized_status(struct rq *rq) > +static inline void set_rd_overutilized_status(struct root_domain *rd, > + unsigned int status) > { > - if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) { > - WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED); > - trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED); > - } > + WRITE_ONCE(rd->overutilized, status); > + trace_sched_overutilized_tp(rd, !!status); > +} > + > +static inline void check_update_overutilized_status(struct rq *rq) > +{ > + /* > + * overutilized field is used for load balancing decisions only > + * if energy aware scheduler is being used > + */ > + if (!sched_energy_enabled()) > + return; > + > + if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) > + set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED); > } > #else > -static inline void update_overutilized_status(struct rq *rq) { } > +static inline void check_update_overutilized_status(struct rq *rq) { } > +static inline void set_rd_overutilized_status(struct root_domain *rd, > + unsigned int status) { } > #endif > > /* Runqueue only has SCHED_IDLE tasks enqueued */ > @@ -6779,7 +6793,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > * and the following generally works well enough in practice. > */ > if (!task_new) > - update_overutilized_status(rq); > + check_update_overutilized_status(rq); > > enqueue_throttle: > assert_list_leaf_cfs_rq(rq); > @@ -9902,7 +9916,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, > if (nr_running > 1) > *sg_status |= SG_OVERLOAD; > > - if (cpu_overutilized(i)) > + if (sched_energy_enabled() && cpu_overutilized(i)) > *sg_status |= SG_OVERUTILIZED; > > #ifdef CONFIG_NUMA_BALANCING > @@ -10596,19 +10610,16 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > env->fbq_type = fbq_classify_group(&sds->busiest_stat); > > if (!env->sd->parent) { > - struct root_domain *rd = env->dst_rq->rd; > - > /* update overload indicator if we are at root domain */ > - WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD); > + WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD); > > /* Update over-utilization (tipping point, U >= 0) indicator */ > - WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED); > - trace_sched_overutilized_tp(rd, sg_status & SG_OVERUTILIZED); > + if (sched_energy_enabled()) { > + set_rd_overutilized_status(env->dst_rq->rd, > + sg_status & SG_OVERUTILIZED); > + } > } else if (sg_status & SG_OVERUTILIZED) { Sorry. I missed adding a check here. Let me do that send out next version. > - struct root_domain *rd = env->dst_rq->rd; > - > - WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED); > - trace_sched_overutilized_tp(rd, SG_OVERUTILIZED); > + set_rd_overutilized_status(env->dst_rq->rd, SG_OVERUTILIZED); > } > > update_idle_cpu_scan(env, sum_util); > @@ -12609,7 +12620,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) > task_tick_numa(rq, curr); > > update_misfit_status(curr, rq); > - update_overutilized_status(task_rq(curr)); > + check_update_overutilized_status(task_rq(curr)); > > task_tick_core(rq, curr); > } > -- > 2.39.3 > ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v3 2/2] sched/fair: Use helper function to access rd->overutilized 2024-02-29 10:40 [PATCH v3 0/2] sched/fair: Limit access to overutilized Shrikanth Hegde 2024-02-29 10:40 ` [PATCH v3 1/2] sched/fair: Add EAS checks before updating overutilized Shrikanth Hegde @ 2024-02-29 10:40 ` Shrikanth Hegde 1 sibling, 0 replies; 5+ messages in thread From: Shrikanth Hegde @ 2024-02-29 10:40 UTC (permalink / raw) To: mingo, peterz, vincent.guittot Cc: sshegde, yu.c.chen, dietmar.eggemann, linux-kernel, nysal, aboorvad, srikar, vschneid, pierre.gondois, morten.rasmussen, qyousef Overutilized field is accessed directly in multiple places. So it could use a helper function. That way one might be more informed that it needs to be used only in case of EAS. No change in functionality intended. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> --- kernel/sched/fair.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1f7d62b7c26f..99b5ddf92924 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6670,6 +6670,15 @@ static inline bool cpu_overutilized(int cpu) return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu); } +/* + * Ensure that caller can do EAS. overutilized value + * make sense only if EAS is enabled + */ +static inline int is_rd_overutilized(struct root_domain *rd) +{ + return READ_ONCE(rd->overutilized); +} + static inline void set_rd_overutilized_status(struct root_domain *rd, unsigned int status) { @@ -6686,13 +6695,14 @@ static inline void check_update_overutilized_status(struct rq *rq) if (!sched_energy_enabled()) return; - if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) + if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu)) set_rd_overutilized_status(rq->rd, SG_OVERUTILIZED); } #else static inline void check_update_overutilized_status(struct rq *rq) { } static inline void set_rd_overutilized_status(struct root_domain *rd, unsigned int status) { } +static inline int is_rd_overutilized(struct root_domain *rd) { } #endif /* Runqueue only has SCHED_IDLE tasks enqueued */ @@ -7974,7 +7984,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) rcu_read_lock(); pd = rcu_dereference(rd->pd); - if (!pd || READ_ONCE(rd->overutilized)) + if (!pd || is_rd_overutilized(rd)) goto unlock; /* @@ -10859,7 +10869,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) if (sched_energy_enabled()) { struct root_domain *rd = env->dst_rq->rd; - if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized)) + if (rcu_dereference(rd->pd) && !is_rd_overutilized(rd)) goto out_balanced; } -- 2.39.3 ^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-03-01 14:15 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-02-29 10:40 [PATCH v3 0/2] sched/fair: Limit access to overutilized Shrikanth Hegde 2024-02-29 10:40 ` [PATCH v3 1/2] sched/fair: Add EAS checks before updating overutilized Shrikanth Hegde 2024-03-01 9:26 ` Srikar Dronamraju 2024-03-01 14:14 ` Shrikanth Hegde 2024-02-29 10:40 ` [PATCH v3 2/2] sched/fair: Use helper function to access rd->overutilized Shrikanth Hegde
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox