public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] sched/fair: Limit access to overutilized
@ 2024-02-28  7:16 Shrikanth Hegde
  2024-02-28  7:16 ` [PATCH v2 1/2] sched/fair: Add EAS checks before updating overutilized Shrikanth Hegde
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Shrikanth Hegde @ 2024-02-28  7:16 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot
  Cc: sshegde, yu.c.chen, dietmar.eggemann, linux-kernel, nysal,
	aboorvad, srikar, vschneid, pierre.gondois, morten.rasmussen,
	qyousef

When running a ISV workload on a large system (240 Cores, SMT8), it was
observed from perf profile that newidle_balance and enqueue_task_fair
were consuming more cycles. Perf annotate showed that most of the time
was spent on accessing overutilized field of root domain.

Aboorva was able to simulate similar perf profile by making some
changes to stress-ng --wait. Both newidle_balance and enqueue_task_fair
consume close to 5-7%. Perf annotate shows that most of the cycles are spent
in accessing rd,rd->overutilized field.

Overutilized was added for EAS(Energy aware scheduler) to choose either
EAS aware load balancing or regular load balance. As checked, on x86 and
powerpc both overload and overutilized share the same cacheline in rd.
Updating overutilized is not required for non-EAS platforms.

Patch 1/2 is the main patch. It helps in reducing the above said issue.
Both the functions don't show up in the profile. With patch comparison is in
changelog. With the patch stated problem in the ISV workload also got
solved and throughput has improved.
Patch 2/2 is only code refactoring to use the helper function instead of
direct access of the field, so one would come to know that it is accessed
only in EAS. This depends on 1/2 to be applied first

Thanks to Aboorva Devarajan and Nysal Jan K A for helping in recreating,
debugging this issue and verifying the patch.
Detailed perf annotate can be found in cover letter of v1.

v2 -> v1:
Chen Yu pointed out minor issue in code. corrected that code and updated
the changelog.

v1: https://lore.kernel.org/lkml/20240223150707.410417-1-sshegde@linux.ibm.com/

Shrikanth Hegde (2):
  sched/fair: Add EAS checks before updating overutilized
  sched/fair: Use helper function to access rd->overutilized

 kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++------------
 1 file changed, 36 insertions(+), 13 deletions(-)

--
2.39.3


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] sched/fair: Add EAS checks before updating overutilized
  2024-02-28  7:16 [PATCH v2 0/2] sched/fair: Limit access to overutilized Shrikanth Hegde
@ 2024-02-28  7:16 ` Shrikanth Hegde
  2024-02-28 15:58   ` Pierre Gondois
  2024-02-28  7:16 ` [PATCH v2 2/2] sched/fair: Use helper function to access rd->overutilized Shrikanth Hegde
  2024-02-29  0:08 ` [PATCH v2 0/2] sched/fair: Limit access to overutilized Qais Yousef
  2 siblings, 1 reply; 11+ messages in thread
From: Shrikanth Hegde @ 2024-02-28  7:16 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot
  Cc: sshegde, yu.c.chen, dietmar.eggemann, linux-kernel, nysal,
	aboorvad, srikar, vschneid, pierre.gondois, morten.rasmussen,
	qyousef

Overutilized field of root domain is only used for EAS(energy aware scheduler)
to decide whether to do regular load balance or EAS aware load balance. It
is not used if EAS not possible.

Currently enqueue_task_fair and task_tick_fair accesses, sometime updates
this field. In update_sd_lb_stats it is updated often.
Which causes cache contention due to load/store tearing and burns
a lot of cycles. Hence add EAS check before updating this field.
EAS check is optimized at compile time or it is static branch.
Hence it shouldn't cost much.

With the patch, both enqueue_task_fair and newidle_balance don't show
up as hot routines in perf profile.

6.8-rc4:
7.18%  swapper          [kernel.vmlinux]              [k] enqueue_task_fair
6.78%  s                [kernel.vmlinux]              [k] newidle_balance
+patch:
0.14%  swapper          [kernel.vmlinux]              [k] enqueue_task_fair
0.00%  swapper          [kernel.vmlinux]              [k] newidle_balance

Minor change; trace_sched_overutilized_tp expect that second argument to
be bool. So do a int to bool conversion for that.

Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 35 ++++++++++++++++++++++++-----------
 1 file changed, 24 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e30e2bb77a0..3105fb08b87e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6670,15 +6670,30 @@ static inline bool cpu_overutilized(int cpu)
 	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
 }

-static inline void update_overutilized_status(struct rq *rq)
+static inline void update_rd_overutilized_status(struct root_domain *rd,
+						 int status)
 {
-	if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
-		WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
-		trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
+	if (sched_energy_enabled()) {
+		WRITE_ONCE(rd->overutilized, status);
+		trace_sched_overutilized_tp(rd, !!status);
+	}
+}
+
+static inline void check_update_overutilized_status(struct rq *rq)
+{
+	/*
+	 * overutilized field is used for load balancing decisions only
+	 * if energy aware scheduler is being used
+	 */
+	if (sched_energy_enabled()) {
+		if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
+			update_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
 	}
 }
 #else
-static inline void update_overutilized_status(struct rq *rq) { }
+static inline void check_update_overutilized_status(struct rq *rq) { }
+static inline void update_rd_overutilized_status(struct root_domain *rd,
+						 bool status) { }
 #endif

 /* Runqueue only has SCHED_IDLE tasks enqueued */
@@ -6779,7 +6794,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	 * and the following generally works well enough in practice.
 	 */
 	if (!task_new)
-		update_overutilized_status(rq);
+		check_update_overutilized_status(rq);

 enqueue_throttle:
 	assert_list_leaf_cfs_rq(rq);
@@ -10613,13 +10628,11 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD);

 		/* Update over-utilization (tipping point, U >= 0) indicator */
-		WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
-		trace_sched_overutilized_tp(rd, sg_status & SG_OVERUTILIZED);
+		update_rd_overutilized_status(rd, sg_status & SG_OVERUTILIZED);
 	} else if (sg_status & SG_OVERUTILIZED) {
 		struct root_domain *rd = env->dst_rq->rd;

-		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
-		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
+		update_rd_overutilized_status(rd, SG_OVERUTILIZED);
 	}

 	update_idle_cpu_scan(env, sum_util);
@@ -12625,7 +12638,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		task_tick_numa(rq, curr);

 	update_misfit_status(curr, rq);
-	update_overutilized_status(task_rq(curr));
+	check_update_overutilized_status(task_rq(curr));

 	task_tick_core(rq, curr);
 }
--
2.39.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] sched/fair: Use helper function to access rd->overutilized
  2024-02-28  7:16 [PATCH v2 0/2] sched/fair: Limit access to overutilized Shrikanth Hegde
  2024-02-28  7:16 ` [PATCH v2 1/2] sched/fair: Add EAS checks before updating overutilized Shrikanth Hegde
@ 2024-02-28  7:16 ` Shrikanth Hegde
  2024-02-29  0:08 ` [PATCH v2 0/2] sched/fair: Limit access to overutilized Qais Yousef
  2 siblings, 0 replies; 11+ messages in thread
From: Shrikanth Hegde @ 2024-02-28  7:16 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot
  Cc: sshegde, yu.c.chen, dietmar.eggemann, linux-kernel, nysal,
	aboorvad, srikar, vschneid, pierre.gondois, morten.rasmussen,
	qyousef

Overutilized field is accessed directly in multiple places.
So it could use a helper function. That way one might be more
informed that it needs to be used only in case of EAS.

No change in functionality intended.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3105fb08b87e..c085fe8421b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6670,6 +6670,15 @@ static inline bool cpu_overutilized(int cpu)
 	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
 }

+/*
+ * Ensure that caller can do EAS. overutilized value
+ * make sense only if EAS is enabled
+ */
+static inline int is_rd_overutilized(struct root_domain *rd)
+{
+	return READ_ONCE(rd->overutilized);
+}
+
 static inline void update_rd_overutilized_status(struct root_domain *rd,
 						 int status)
 {
@@ -6686,7 +6695,7 @@ static inline void check_update_overutilized_status(struct rq *rq)
 	 * if energy aware scheduler is being used
 	 */
 	if (sched_energy_enabled()) {
-		if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
+		if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
 			update_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
 	}
 }
@@ -6694,6 +6703,7 @@ static inline void check_update_overutilized_status(struct rq *rq)
 static inline void check_update_overutilized_status(struct rq *rq) { }
 static inline void update_rd_overutilized_status(struct root_domain *rd,
 						 bool status) { }
+static inline int is_rd_overutilized(struct root_domain *rd) { }
 #endif

 /* Runqueue only has SCHED_IDLE tasks enqueued */
@@ -7969,7 +7979,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)

 	rcu_read_lock();
 	pd = rcu_dereference(rd->pd);
-	if (!pd || READ_ONCE(rd->overutilized))
+	if (!pd || is_rd_overutilized(rd))
 		goto unlock;

 	/*
@@ -10872,7 +10882,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	if (sched_energy_enabled()) {
 		struct root_domain *rd = env->dst_rq->rd;

-		if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
+		if (rcu_dereference(rd->pd) && !is_rd_overutilized(rd))
 			goto out_balanced;
 	}

--
2.39.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Add EAS checks before updating overutilized
  2024-02-28  7:16 ` [PATCH v2 1/2] sched/fair: Add EAS checks before updating overutilized Shrikanth Hegde
@ 2024-02-28 15:58   ` Pierre Gondois
  2024-02-28 17:24     ` Shrikanth Hegde
  0 siblings, 1 reply; 11+ messages in thread
From: Pierre Gondois @ 2024-02-28 15:58 UTC (permalink / raw)
  To: Shrikanth Hegde, mingo, peterz, vincent.guittot
  Cc: yu.c.chen, dietmar.eggemann, linux-kernel, nysal, aboorvad,
	srikar, vschneid, morten.rasmussen, qyousef

It is nice to avoid calling effective_cpu_util() through the following
when EAS is not enabled:

cpu_overutilized()
\-util_fits_cpu()
   \- ...
     \-effective_cpu_util()

On 2/28/24 08:16, Shrikanth Hegde wrote:
> Overutilized field of root domain is only used for EAS(energy aware scheduler)
> to decide whether to do regular load balance or EAS aware load balance. It
> is not used if EAS not possible.
> 
> Currently enqueue_task_fair and task_tick_fair accesses, sometime updates
> this field. In update_sd_lb_stats it is updated often.
> Which causes cache contention due to load/store tearing and burns
> a lot of cycles. Hence add EAS check before updating this field.
> EAS check is optimized at compile time or it is static branch.
> Hence it shouldn't cost much.
> 
> With the patch, both enqueue_task_fair and newidle_balance don't show
> up as hot routines in perf profile.
> 
> 6.8-rc4:
> 7.18%  swapper          [kernel.vmlinux]              [k] enqueue_task_fair
> 6.78%  s                [kernel.vmlinux]              [k] newidle_balance
> +patch:
> 0.14%  swapper          [kernel.vmlinux]              [k] enqueue_task_fair
> 0.00%  swapper          [kernel.vmlinux]              [k] newidle_balance
> 
> Minor change; trace_sched_overutilized_tp expect that second argument to
> be bool. So do a int to bool conversion for that.
> 
> Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>   kernel/sched/fair.c | 35 ++++++++++++++++++++++++-----------
>   1 file changed, 24 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8e30e2bb77a0..3105fb08b87e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6670,15 +6670,30 @@ static inline bool cpu_overutilized(int cpu)
>   	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
>   }
> 
> -static inline void update_overutilized_status(struct rq *rq)
> +static inline void update_rd_overutilized_status(struct root_domain *rd,
> +						 int status)
>   {
> -	if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
> -		WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
> -		trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
> +	if (sched_energy_enabled()) {
> +		WRITE_ONCE(rd->overutilized, status);
> +		trace_sched_overutilized_tp(rd, !!status);
> +	}
> +}

NIT:
When called from check_update_overutilized_status(),
sched_energy_enabled() will be checked twice.

> +
> +static inline void check_update_overutilized_status(struct rq *rq)
> +{
> +	/*
> +	 * overutilized field is used for load balancing decisions only
> +	 * if energy aware scheduler is being used
> +	 */
> +	if (sched_energy_enabled()) {
> +		if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
> +			update_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
>   	}
>   }
>   #else
> -static inline void update_overutilized_status(struct rq *rq) { }
> +static inline void check_update_overutilized_status(struct rq *rq) { }
> +static inline void update_rd_overutilized_status(struct root_domain *rd,
> +						 bool status) { }
>   #endif
> 
>   /* Runqueue only has SCHED_IDLE tasks enqueued */
> @@ -6779,7 +6794,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>   	 * and the following generally works well enough in practice.
>   	 */
>   	if (!task_new)
> -		update_overutilized_status(rq);
> +		check_update_overutilized_status(rq);
> 
>   enqueue_throttle:
>   	assert_list_leaf_cfs_rq(rq);
> @@ -10613,13 +10628,11 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>   		WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD);
> 
>   		/* Update over-utilization (tipping point, U >= 0) indicator */
> -		WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
> -		trace_sched_overutilized_tp(rd, sg_status & SG_OVERUTILIZED);
> +		update_rd_overutilized_status(rd, sg_status & SG_OVERUTILIZED);
>   	} else if (sg_status & SG_OVERUTILIZED) {
>   		struct root_domain *rd = env->dst_rq->rd;
> 
> -		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
> -		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
> +		update_rd_overutilized_status(rd, SG_OVERUTILIZED);
>   	}
> 
>   	update_idle_cpu_scan(env, sum_util);
> @@ -12625,7 +12638,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>   		task_tick_numa(rq, curr);
> 
>   	update_misfit_status(curr, rq);
> -	update_overutilized_status(task_rq(curr));
> +	check_update_overutilized_status(task_rq(curr));
> 
>   	task_tick_core(rq, curr);
>   }
> --
> 2.39.3
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Add EAS checks before updating overutilized
  2024-02-28 15:58   ` Pierre Gondois
@ 2024-02-28 17:24     ` Shrikanth Hegde
  2024-02-28 23:34       ` Dietmar Eggemann
  2024-02-29  9:11       ` Pierre Gondois
  0 siblings, 2 replies; 11+ messages in thread
From: Shrikanth Hegde @ 2024-02-28 17:24 UTC (permalink / raw)
  To: Pierre Gondois
  Cc: yu.c.chen, dietmar.eggemann, linux-kernel, nysal, aboorvad,
	srikar, vschneid, morten.rasmussen, qyousef, mingo, peterz,
	vincent.guittot



On 2/28/24 9:28 PM, Pierre Gondois wrote:

Hi Pierre, Thanks for taking a look.

> It is nice to avoid calling effective_cpu_util() through the following
> when EAS is not enabled:
> I think we are avoiding calling cpu_overutilized except in update_sg_lb_stats. 
I didnt want to put a EAS check in cpu_overutilized as it could be useful 
function in non-EAS cases in future. calling cpu_overutilized alone doesnt 
do any access to root_domain's overutilized field. So we are okay w.r.t to 
cache issues. 
But we will do some extra computation currently and then not use it if it 
Non-EAS case in update_sg_lb_stats

Would something like this makes sense?
@@ -9925,7 +9925,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
                if (nr_running > 1)
                        *sg_status |= SG_OVERLOAD;
 
-               if (cpu_overutilized(i))
+               if (sched_energy_enabled() && cpu_overutilized(i))
                        *sg_status |= SG_OVERUTILIZED;
 



I didnt find how would util_fits_cpu ends up calling effective_cpu_util. 
Could you please elaborate? 

> cpu_overutilized()
> \-util_fits_cpu()
>   \- ...
>     \-effective_cpu_util()
> 
> On 2/28/24 08:16, Shrikanth Hegde wrote:
>> Overutilized field of root domain is only used for EAS(energy aware
>> scheduler)
>> to decide whether to do regular load balance or EAS aware load
>> balance. It
>> is not used if EAS not possible.
>>
>> Currently enqueue_task_fair and task_tick_fair accesses, sometime updates
>> this field. In update_sd_lb_stats it is updated often.
>> Which causes cache contention due to load/store tearing and burns
>> a lot of cycles. Hence add EAS check before updating this field.
>> EAS check is optimized at compile time or it is static branch.
>> Hence it shouldn't cost much.
>>
>> With the patch, both enqueue_task_fair and newidle_balance don't show
>> up as hot routines in perf profile.
>>
>> 6.8-rc4:
>> 7.18%  swapper          [kernel.vmlinux]              [k]
>> enqueue_task_fair
>> 6.78%  s                [kernel.vmlinux]              [k] newidle_balance
>> +patch:
>> 0.14%  swapper          [kernel.vmlinux]              [k]
>> enqueue_task_fair
>> 0.00%  swapper          [kernel.vmlinux]              [k] newidle_balance
>>
>> Minor change; trace_sched_overutilized_tp expect that second argument to
>> be bool. So do a int to bool conversion for that.
>>
>> Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point
>> indicator")
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   kernel/sched/fair.c | 35 ++++++++++++++++++++++++-----------
>>   1 file changed, 24 insertions(+), 11 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8e30e2bb77a0..3105fb08b87e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6670,15 +6670,30 @@ static inline bool cpu_overutilized(int cpu)
>>       return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min,
>> rq_util_max, cpu);
>>   }
>>
>> -static inline void update_overutilized_status(struct rq *rq)
>> +static inline void update_rd_overutilized_status(struct root_domain *rd,
>> +                         int status)
>>   {
>> -    if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
>> -        WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
>> -        trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
>> +    if (sched_energy_enabled()) {
>> +        WRITE_ONCE(rd->overutilized, status);
>> +        trace_sched_overutilized_tp(rd, !!status);
>> +    }
>> +}
> 
> NIT:
> When called from check_update_overutilized_status(),
> sched_energy_enabled() will be checked twice.
Yes. 
But, I think that's okay since it is a static branch check at best. 
This way it keeps the code simpler. 

> 
>> +
>> +static inline void check_update_overutilized_status(struct rq *rq)
>> +{
>> +    /*
>> +     * overutilized field is used for load balancing decisions only
>> +     * if energy aware scheduler is being used
>> +     */
>> +    if (sched_energy_enabled()) {
>> +        if (!READ_ONCE(rq->rd->overutilized) &&
>> cpu_overutilized(rq->cpu))
>> +            update_rd_overutilized_status(rq->rd, SG_OVERUTILIZED);
>>       }
>>   }
>>   #else
>> -static inline void update_overutilized_status(struct rq *rq) { }
>> +static inline void check_update_overutilized_status(struct rq *rq) { }
>> +static inline void update_rd_overutilized_status(struct root_domain *rd,
>> +                         bool status) { }
>>   #endif
>>
>>   /* Runqueue only has SCHED_IDLE tasks enqueued */
>> @@ -6779,7 +6794,7 @@ enqueue_task_fair(struct rq *rq, struct
>> task_struct *p, int flags)
>>        * and the following generally works well enough in practice.
>>        */
>>       if (!task_new)
>> -        update_overutilized_status(rq);
>> +        check_update_overutilized_status(rq);
>>
>>   enqueue_throttle:
>>       assert_list_leaf_cfs_rq(rq);
>> @@ -10613,13 +10628,11 @@ static inline void update_sd_lb_stats(struct
>> lb_env *env, struct sd_lb_stats *sd
>>           WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD);
>>
>>           /* Update over-utilization (tipping point, U >= 0) indicator */
>> -        WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
>> -        trace_sched_overutilized_tp(rd, sg_status & SG_OVERUTILIZED);
>> +        update_rd_overutilized_status(rd, sg_status & SG_OVERUTILIZED);
>>       } else if (sg_status & SG_OVERUTILIZED) {
>>           struct root_domain *rd = env->dst_rq->rd;
>>
>> -        WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
>> -        trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
>> +        update_rd_overutilized_status(rd, SG_OVERUTILIZED);
>>       }
>>
>>       update_idle_cpu_scan(env, sum_util);
>> @@ -12625,7 +12638,7 @@ static void task_tick_fair(struct rq *rq,
>> struct task_struct *curr, int queued)
>>           task_tick_numa(rq, curr);
>>
>>       update_misfit_status(curr, rq);
>> -    update_overutilized_status(task_rq(curr));
>> +    check_update_overutilized_status(task_rq(curr));
>>
>>       task_tick_core(rq, curr);
>>   }
>> -- 
>> 2.39.3
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Add EAS checks before updating overutilized
  2024-02-28 17:24     ` Shrikanth Hegde
@ 2024-02-28 23:34       ` Dietmar Eggemann
  2024-02-29  4:30         ` Shrikanth Hegde
  2024-02-29  9:11       ` Pierre Gondois
  1 sibling, 1 reply; 11+ messages in thread
From: Dietmar Eggemann @ 2024-02-28 23:34 UTC (permalink / raw)
  To: Shrikanth Hegde, Pierre Gondois
  Cc: yu.c.chen, linux-kernel, nysal, aboorvad, srikar, vschneid,
	morten.rasmussen, qyousef, mingo, peterz, vincent.guittot

On 28/02/2024 18:24, Shrikanth Hegde wrote:
> 
> 
> On 2/28/24 9:28 PM, Pierre Gondois wrote:

[...]

> But we will do some extra computation currently and then not use it if it 
> Non-EAS case in update_sg_lb_stats
> 
> Would something like this makes sense?
> @@ -9925,7 +9925,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                 if (nr_running > 1)
>                         *sg_status |= SG_OVERLOAD;
>  
> -               if (cpu_overutilized(i))
> +               if (sched_energy_enabled() && cpu_overutilized(i))
>                         *sg_status |= SG_OVERUTILIZED;

Yes, we could also disable the setting of OU in load_balance in the none
!EAS case.

[...]

>> NIT:
>> When called from check_update_overutilized_status(),
>> sched_energy_enabled() will be checked twice.
> Yes. 
> But, I think that's okay since it is a static branch check at best. 
> This way it keeps the code simpler. 

You could keep the ched_energy_enabled() outside of the new
set_overutilized_status() to avoid this:

-->8--

---
 kernel/sched/fair.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 32bc98d9123d..c82164bf45f3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6676,12 +6676,19 @@ static inline bool cpu_overutilized(int cpu)
 	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
 }
 
+static inline void set_overutilized_status(struct rq *rq, unsigned int val)
+{
+	WRITE_ONCE(rq->rd->overutilized, val);
+	trace_sched_overutilized_tp(rq->rd, val);
+}
+
 static inline void update_overutilized_status(struct rq *rq)
 {
-	if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
-		WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
-		trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
-	}
+	if (!sched_energy_enabled())
+		return;
+
+	if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
+		set_overutilized_status(rq, SG_OVERUTILIZED);
 }
 #else
 static inline void update_overutilized_status(struct rq *rq) { }
@@ -10755,19 +10762,16 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
 
 	if (!env->sd->parent) {
-		struct root_domain *rd = env->dst_rq->rd;
-
 		/* update overload indicator if we are at root domain */
-		WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD);
+		WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
 
 		/* Update over-utilization (tipping point, U >= 0) indicator */
-		WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
-		trace_sched_overutilized_tp(rd, sg_status & SG_OVERUTILIZED);
-	} else if (sg_status & SG_OVERUTILIZED) {
-		struct root_domain *rd = env->dst_rq->rd;
-
-		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
-		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
+		if (sched_energy_enabled()) {
+			set_overutilized_status(env->dst_rq,
+						sg_status & SG_OVERUTILIZED);
+		}
+	} else if (sched_energy_enabled() && sg_status & SG_OVERUTILIZED) {
+		set_overutilized_status(env->dst_rq, SG_OVERUTILIZED);
 	}
 
 	update_idle_cpu_scan(env, sum_util);
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/2] sched/fair: Limit access to overutilized
  2024-02-28  7:16 [PATCH v2 0/2] sched/fair: Limit access to overutilized Shrikanth Hegde
  2024-02-28  7:16 ` [PATCH v2 1/2] sched/fair: Add EAS checks before updating overutilized Shrikanth Hegde
  2024-02-28  7:16 ` [PATCH v2 2/2] sched/fair: Use helper function to access rd->overutilized Shrikanth Hegde
@ 2024-02-29  0:08 ` Qais Yousef
  2024-02-29  4:46   ` Shrikanth Hegde
  2 siblings, 1 reply; 11+ messages in thread
From: Qais Yousef @ 2024-02-29  0:08 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, vincent.guittot, yu.c.chen, dietmar.eggemann,
	linux-kernel, nysal, aboorvad, srikar, vschneid, pierre.gondois,
	morten.rasmussen

On 02/28/24 12:46, Shrikanth Hegde wrote:
> When running a ISV workload on a large system (240 Cores, SMT8), it was
> observed from perf profile that newidle_balance and enqueue_task_fair
> were consuming more cycles. Perf annotate showed that most of the time
> was spent on accessing overutilized field of root domain.
> 
> Aboorva was able to simulate similar perf profile by making some
> changes to stress-ng --wait. Both newidle_balance and enqueue_task_fair
> consume close to 5-7%. Perf annotate shows that most of the cycles are spent
> in accessing rd,rd->overutilized field.
> 
> Overutilized was added for EAS(Energy aware scheduler) to choose either
> EAS aware load balancing or regular load balance. As checked, on x86 and

It actually toggles load balance on/off (off if !overutilized).

misfit load balance used to be controlled by this but this was decoupled since
commit e5ed0550c04c ("sched/fair: unlink misfit task from cpu overutilized")

> powerpc both overload and overutilized share the same cacheline in rd.
> Updating overutilized is not required for non-EAS platforms.

Is the fact these two share the cacheline is part of the problem? From patch
1 it seems the fact that overutlized is updated often on different cpus is the
problem? Did you try to move overutlized to different places to see if this
alternatively helps?

The patches look fine to me. I am just trying to verify that indeed the access
to overutilzed is the problem, not something else being on the same cacheline
is accidentally being slowed down, which means the problem can resurface in the
future.

> 
> Patch 1/2 is the main patch. It helps in reducing the above said issue.
> Both the functions don't show up in the profile. With patch comparison is in
> changelog. With the patch stated problem in the ISV workload also got
> solved and throughput has improved.
> Patch 2/2 is only code refactoring to use the helper function instead of
> direct access of the field, so one would come to know that it is accessed
> only in EAS. This depends on 1/2 to be applied first
> 
> Thanks to Aboorva Devarajan and Nysal Jan K A for helping in recreating,
> debugging this issue and verifying the patch.
> Detailed perf annotate can be found in cover letter of v1.
> 
> v2 -> v1:
> Chen Yu pointed out minor issue in code. corrected that code and updated
> the changelog.
> 
> v1: https://lore.kernel.org/lkml/20240223150707.410417-1-sshegde@linux.ibm.com/
> 
> Shrikanth Hegde (2):
>   sched/fair: Add EAS checks before updating overutilized
>   sched/fair: Use helper function to access rd->overutilized
> 
>  kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++------------
>  1 file changed, 36 insertions(+), 13 deletions(-)
> 
> --
> 2.39.3
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Add EAS checks before updating overutilized
  2024-02-28 23:34       ` Dietmar Eggemann
@ 2024-02-29  4:30         ` Shrikanth Hegde
  0 siblings, 0 replies; 11+ messages in thread
From: Shrikanth Hegde @ 2024-02-29  4:30 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: yu.c.chen, linux-kernel, nysal, aboorvad, srikar, vschneid,
	morten.rasmussen, qyousef, mingo, peterz, vincent.guittot,
	Pierre Gondois



On 2/29/24 5:04 AM, Dietmar Eggemann wrote:
> On 28/02/2024 18:24, Shrikanth Hegde wrote:
> 

Thank you Dietmar, for taking a look. 

> [...]
> 
>> But we will do some extra computation currently and then not use it if it 
>> Non-EAS case in update_sg_lb_stats
>>
>> Would something like this makes sense?
>> @@ -9925,7 +9925,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>                 if (nr_running > 1)
>>                         *sg_status |= SG_OVERLOAD;
>>  
>> -               if (cpu_overutilized(i))
>> +               if (sched_energy_enabled() && cpu_overutilized(i))
>>                         *sg_status |= SG_OVERUTILIZED;
> 
> Yes, we could also disable the setting of OU in load_balance in the none
> !EAS case.
> 
> [...]

Ok. I will add this change. I don't see any other place where we need to do EAS 
check w.r.t to overutilized. This should cover all cases then. 

> 
>>> NIT:
>>> When called from check_update_overutilized_status(),
>>> sched_energy_enabled() will be checked twice.
>> Yes. 
>> But, I think that's okay since it is a static branch check at best. 
>> This way it keeps the code simpler. 
> 
> You could keep the ched_energy_enabled() outside of the new
> set_overutilized_status() to avoid this:
> 
> -->8--

Ok. We can do this as well.  I will incorporate this and send out v3 soon. 


> 
> ---
>  kernel/sched/fair.c | 32 ++++++++++++++++++--------------
>  1 file changed, 18 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 32bc98d9123d..c82164bf45f3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6676,12 +6676,19 @@ static inline bool cpu_overutilized(int cpu)
>  	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
>  }
>  
> +static inline void set_overutilized_status(struct rq *rq, unsigned int val)
> +{
> +	WRITE_ONCE(rq->rd->overutilized, val);
> +	trace_sched_overutilized_tp(rq->rd, val);
> +}
> +
>  static inline void update_overutilized_status(struct rq *rq)
>  {
> -	if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
> -		WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
> -		trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED);
> -	}
> +	if (!sched_energy_enabled())
> +		return;
> +
> +	if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
> +		set_overutilized_status(rq, SG_OVERUTILIZED);
>  }
>  #else
>  static inline void update_overutilized_status(struct rq *rq) { }
> @@ -10755,19 +10762,16 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
>  
>  	if (!env->sd->parent) {
> -		struct root_domain *rd = env->dst_rq->rd;
> -
>  		/* update overload indicator if we are at root domain */
> -		WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD);
> +		WRITE_ONCE(env->dst_rq->rd->overload, sg_status & SG_OVERLOAD);
>  
>  		/* Update over-utilization (tipping point, U >= 0) indicator */
> -		WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
> -		trace_sched_overutilized_tp(rd, sg_status & SG_OVERUTILIZED);
> -	} else if (sg_status & SG_OVERUTILIZED) {
> -		struct root_domain *rd = env->dst_rq->rd;
> -
> -		WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED);
> -		trace_sched_overutilized_tp(rd, SG_OVERUTILIZED);
> +		if (sched_energy_enabled()) {
> +			set_overutilized_status(env->dst_rq,
> +						sg_status & SG_OVERUTILIZED);
> +		}
> +	} else if (sched_energy_enabled() && sg_status & SG_OVERUTILIZED) {
> +		set_overutilized_status(env->dst_rq, SG_OVERUTILIZED);
>  	}
>  
>  	update_idle_cpu_scan(env, sum_util);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/2] sched/fair: Limit access to overutilized
  2024-02-29  0:08 ` [PATCH v2 0/2] sched/fair: Limit access to overutilized Qais Yousef
@ 2024-02-29  4:46   ` Shrikanth Hegde
  2024-03-03 18:06     ` Qais Yousef
  0 siblings, 1 reply; 11+ messages in thread
From: Shrikanth Hegde @ 2024-02-29  4:46 UTC (permalink / raw)
  To: Qais Yousef
  Cc: mingo, peterz, vincent.guittot, yu.c.chen, dietmar.eggemann,
	linux-kernel, nysal, aboorvad, srikar, vschneid, pierre.gondois,
	morten.rasmussen



On 2/29/24 5:38 AM, Qais Yousef wrote:
> On 02/28/24 12:46, Shrikanth Hegde wrote:
[...]
>> Overutilized was added for EAS(Energy aware scheduler) to choose either
>> EAS aware load balancing or regular load balance. As checked, on x86 and
> 
> It actually toggles load balance on/off (off if !overutilized).
> 
> misfit load balance used to be controlled by this but this was decoupled since
> commit e5ed0550c04c ("sched/fair: unlink misfit task from cpu overutilized")
> 

Ok.

>> powerpc both overload and overutilized share the same cacheline in rd.
>> Updating overutilized is not required for non-EAS platforms.
> 
> Is the fact these two share the cacheline is part of the problem? From patch
> 1 it seems the fact that overutlized is updated often on different cpus is the
> problem? Did you try to move overutlized to different places to see if this
> alternatively helps?
> 
> The patches look fine to me. I am just trying to verify that indeed the access
> to overutilzed is the problem, not something else being on the same cacheline
> is accidentally being slowed down, which means the problem can resurface in the
> future.
> 

We did explicit cachealign for overload. By doing that newidle_balance goes away from
perf profile. But enqueue_task_fair still remains. That because there is load-store 
tearing happening on overutilized field alone due to different CPUs accessing and 
updating it at the same time. 

We have also verified that rq->rd->overutilized in enqueue_task_fair path is the reason
for it showing up in perf profile. 

>>
[...]
>>
>> --
>> 2.39.3
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Add EAS checks before updating overutilized
  2024-02-28 17:24     ` Shrikanth Hegde
  2024-02-28 23:34       ` Dietmar Eggemann
@ 2024-02-29  9:11       ` Pierre Gondois
  1 sibling, 0 replies; 11+ messages in thread
From: Pierre Gondois @ 2024-02-29  9:11 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: yu.c.chen, dietmar.eggemann, linux-kernel, nysal, aboorvad,
	srikar, vschneid, morten.rasmussen, qyousef, mingo, peterz,
	vincent.guittot

Hello Shrikanth,

On 2/28/24 18:24, Shrikanth Hegde wrote:
> 
> 
> On 2/28/24 9:28 PM, Pierre Gondois wrote:
> 
> Hi Pierre, Thanks for taking a look.
> 
>> It is nice to avoid calling effective_cpu_util() through the following
>> when EAS is not enabled:
>> I think we are avoiding calling cpu_overutilized except in update_sg_lb_stats.
> I didnt want to put a EAS check in cpu_overutilized as it could be useful
> function in non-EAS cases in future. calling cpu_overutilized alone doesnt
> do any access to root_domain's overutilized field. So we are okay w.r.t to
> cache issues.
> But we will do some extra computation currently and then not use it if it
> Non-EAS case in update_sg_lb_stats
> 
> Would something like this makes sense?
> @@ -9925,7 +9925,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                  if (nr_running > 1)
>                          *sg_status |= SG_OVERLOAD;
>   
> -               if (cpu_overutilized(i))
> +               if (sched_energy_enabled() && cpu_overutilized(i))
>                          *sg_status |= SG_OVERUTILIZED;
>   

Yes right. I think that what Dietmar suggested is also a good idea
which could be used instead.

> 
> 
> 
> I didnt find how would util_fits_cpu ends up calling effective_cpu_util.
> Could you please elaborate?

Sorry I meant this path:
cpu_overutilized()
\-cpu_util_cfs()
   \-cpu_util()

effective_cpu_util() is effectively not involved.

> 
>> cpu_overutilized()
>> \-util_fits_cpu()
>>    \- ...
>>      \-effective_cpu_util()
>>

[snip]

Regards,
Pierre

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/2] sched/fair: Limit access to overutilized
  2024-02-29  4:46   ` Shrikanth Hegde
@ 2024-03-03 18:06     ` Qais Yousef
  0 siblings, 0 replies; 11+ messages in thread
From: Qais Yousef @ 2024-03-03 18:06 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, vincent.guittot, yu.c.chen, dietmar.eggemann,
	linux-kernel, nysal, aboorvad, srikar, vschneid, pierre.gondois,
	morten.rasmussen

On 02/29/24 10:16, Shrikanth Hegde wrote:

> > Is the fact these two share the cacheline is part of the problem? From patch
> > 1 it seems the fact that overutlized is updated often on different cpus is the
> > problem? Did you try to move overutlized to different places to see if this
> > alternatively helps?
> > 
> > The patches look fine to me. I am just trying to verify that indeed the access
> > to overutilzed is the problem, not something else being on the same cacheline
> > is accidentally being slowed down, which means the problem can resurface in the
> > future.
> > 
> 
> We did explicit cachealign for overload. By doing that newidle_balance goes away from
> perf profile. But enqueue_task_fair still remains. That because there is load-store 

I don't have a solution, but this accidental dependency is something to ponder
for the future..

> tearing happening on overutilized field alone due to different CPUs accessing and 
> updating it at the same time. 
> 
> We have also verified that rq->rd->overutilized in enqueue_task_fair path is the reason
> for it showing up in perf profile. 

Something to ponder as well. Maybe this is better converted to a per-cpu
variable if it can cause such cacheline bouncing. But that's probably not good
enough on its own.

Anyway. Something to ponder for the future too. I think the current definition
of overutilized is due for a revisit and I'll keep this problem in mind if we
can introduce a friendlier pattern.


Thanks

--
Qais Yousef

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-03-03 18:06 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-28  7:16 [PATCH v2 0/2] sched/fair: Limit access to overutilized Shrikanth Hegde
2024-02-28  7:16 ` [PATCH v2 1/2] sched/fair: Add EAS checks before updating overutilized Shrikanth Hegde
2024-02-28 15:58   ` Pierre Gondois
2024-02-28 17:24     ` Shrikanth Hegde
2024-02-28 23:34       ` Dietmar Eggemann
2024-02-29  4:30         ` Shrikanth Hegde
2024-02-29  9:11       ` Pierre Gondois
2024-02-28  7:16 ` [PATCH v2 2/2] sched/fair: Use helper function to access rd->overutilized Shrikanth Hegde
2024-02-29  0:08 ` [PATCH v2 0/2] sched/fair: Limit access to overutilized Qais Yousef
2024-02-29  4:46   ` Shrikanth Hegde
2024-03-03 18:06     ` Qais Yousef

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox