[RFC PATCH] sched/fair: Skip idle CPU search on busy system

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] sched/fair: Skip idle CPU search on busy system
@ 2023-07-26  9:36 Shrikanth Hegde
  2023-07-27  7:25 ` Chen Yu
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Shrikanth Hegde @ 2023-07-26  9:36 UTC (permalink / raw)
  To: peterz, vincent.guittot
  Cc: sshegde, srikar, linux-kernel, mingo, dietmar.eggemann, mgorman

When the system is fully busy, there will not be any idle CPU's.
In that case, load_balance will be called mainly with CPU_NOT_IDLE
type. In should_we_balance its currently checking for an idle CPU if
one exist. When system is 100% busy, there will not be an idle CPU and
these idle_cpu checks can be skipped. This would avoid fetching those rq
structures.

This is a minor optimization for a specific case of 100% utilization.

.....
Coming to the current implementation. It is a very basic approach to the
issue. It may not be the best/perfect way to this.  It works only in
case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which
tracks idle CPU's. AFAIU there isn't any other. If there is such info, we
can use that instead. nohz.nr_cpus is atomic, which might be costly too.

Alternative way would be to add a new attribute to sched_domain and update
it in cpu idle entry/exit path per CPU. Advantage is, check can be per
env->sd instead of global. Slightly complicated, but maybe better. there
could other advantage at wake up to limit the scan etc.

Your feedback would really help. Does this optimization makes sense?

Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..903d59b5290c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10713,6 +10713,12 @@ static int should_we_balance(struct lb_env *env)
 		return 1;
 	}

+#ifdef CONFIG_NO_HZ_COMMON
+	/* If the system is fully busy, its better to skip the idle checks */
+	if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0)
+		return group_balance_cpu(sg) == env->dst_cpu;
+#endif
+
 	/* Try to find first idle CPU */
 	for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
 		if (!idle_cpu(cpu))
--
2.31.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] sched/fair: Skip idle CPU search on busy system
  2023-07-26  9:36 [RFC PATCH] sched/fair: Skip idle CPU search on busy system Shrikanth Hegde
@ 2023-07-27  7:25 ` Chen Yu
  2023-07-27 15:04   ` Shrikanth Hegde
  2023-08-09 18:44 ` Vishal Chourasia
  2023-10-04 16:25 ` Ingo Molnar
  2 siblings, 1 reply; 6+ messages in thread
From: Chen Yu @ 2023-07-27  7:25 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: peterz, vincent.guittot, srikar, linux-kernel, mingo,
	dietmar.eggemann, mgorman, rui.zhang, tim.c.chen

On 2023-07-26 at 15:06:12 +0530, Shrikanth Hegde wrote:
> When the system is fully busy, there will not be any idle CPU's.
> In that case, load_balance will be called mainly with CPU_NOT_IDLE
> type. In should_we_balance its currently checking for an idle CPU if
> one exist. When system is 100% busy, there will not be an idle CPU and
> these idle_cpu checks can be skipped. This would avoid fetching those rq
> structures.
>

Yes, I guess this could help reducing the cost if the sched group
has many CPUs. 

> This is a minor optimization for a specific case of 100% utilization.
> 
> .....
> Coming to the current implementation. It is a very basic approach to the
> issue. It may not be the best/perfect way to this.  It works only in
> case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which
> tracks idle CPU's. AFAIU there isn't any other. If there is such info, we
> can use that instead. nohz.nr_cpus is atomic, which might be costly too.
>
> Alternative way would be to add a new attribute to sched_domain and update
> it in cpu idle entry/exit path per CPU. Advantage is, check can be per
> env->sd instead of global. Slightly complicated, but maybe better. there
> could other advantage at wake up to limit the scan etc.
> 

When checking the code, I found that there is per domain nr_busy_cpus.
However that variable is only for LLC domain. Maybe extend the sd_share
for domains under NUMA is applicable IMO.

thanks,
Chenyu

> Your feedback would really help. Does this optimization makes sense?
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
> ---
>  kernel/sched/fair.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f55884..903d59b5290c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10713,6 +10713,12 @@ static int should_we_balance(struct lb_env *env)
>  		return 1;
>  	}
> 
> +#ifdef CONFIG_NO_HZ_COMMON
> +	/* If the system is fully busy, its better to skip the idle checks */
> +	if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0)
> +		return group_balance_cpu(sg) == env->dst_cpu;
> +#endif
> +
>  	/* Try to find first idle CPU */
>  	for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
>  		if (!idle_cpu(cpu))
> --
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] sched/fair: Skip idle CPU search on busy system
  2023-07-27  7:25 ` Chen Yu
@ 2023-07-27 15:04   ` Shrikanth Hegde
  0 siblings, 0 replies; 6+ messages in thread
From: Shrikanth Hegde @ 2023-07-27 15:04 UTC (permalink / raw)
  To: Chen Yu
  Cc: peterz, vincent.guittot, srikar, linux-kernel, mingo,
	dietmar.eggemann, mgorman, rui.zhang, tim.c.chen



On 7/27/23 12:55 PM, Chen Yu wrote:
> On 2023-07-26 at 15:06:12 +0530, Shrikanth Hegde wrote:
>> When the system is fully busy, there will not be any idle CPU's.
>> In that case, load_balance will be called mainly with CPU_NOT_IDLE
>> type. In should_we_balance its currently checking for an idle CPU if
>> one exist. When system is 100% busy, there will not be an idle CPU and
>> these idle_cpu checks can be skipped. This would avoid fetching those rq
>> structures.
>>
> 
> Yes, I guess this could help reducing the cost if the sched group
> has many CPUs. 

Thank you for the review Chen Yu. 

> 
>> This is a minor optimization for a specific case of 100% utilization.
>>
>> .....
>> Coming to the current implementation. It is a very basic approach to the
>> issue. It may not be the best/perfect way to this.  It works only in
>> case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which
>> tracks idle CPU's. AFAIU there isn't any other. If there is such info, we
>> can use that instead. nohz.nr_cpus is atomic, which might be costly too.
>>
>> Alternative way would be to add a new attribute to sched_domain and update
>> it in cpu idle entry/exit path per CPU. Advantage is, check can be per
>> env->sd instead of global. Slightly complicated, but maybe better. there
>> could other advantage at wake up to limit the scan etc.
>>
> 
> When checking the code, I found that there is per domain nr_busy_cpus.
> However that variable is only for LLC domain. Maybe extend the sd_share
> for domains under NUMA is applicable IMO.

True. I did see that. Doing at every level when there are large number 
of CPU's will likely need lock when updating the sd_share and that would 
be the bottleneck as well. Since sd_share never makes sense for NUMA, 
This would cause different code check for NUMA and non-NUMA. Though main benefit 
for this corner case would be in NUMA as there would be large number of CPU's there.

I will keep that thought and will try to work something along.

> 
> thanks,
> Chenyu
> 
>> Your feedback would really help. Does this optimization makes sense?
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
>> ---
>>  kernel/sched/fair.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 373ff5f55884..903d59b5290c 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10713,6 +10713,12 @@ static int should_we_balance(struct lb_env *env)
>>  		return 1;
>>  	}
>>
>> +#ifdef CONFIG_NO_HZ_COMMON
>> +	/* If the system is fully busy, its better to skip the idle checks */
>> +	if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0)
>> +		return group_balance_cpu(sg) == env->dst_cpu;
>> +#endif
>> +
>>  	/* Try to find first idle CPU */
>>  	for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
>>  		if (!idle_cpu(cpu))
>> --
>> 2.31.1
>>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] sched/fair: Skip idle CPU search on busy system
  2023-07-26  9:36 [RFC PATCH] sched/fair: Skip idle CPU search on busy system Shrikanth Hegde
  2023-07-27  7:25 ` Chen Yu
@ 2023-08-09 18:44 ` Vishal Chourasia
  2023-08-10 15:44   ` Shrikanth Hegde
  2023-10-04 16:25 ` Ingo Molnar
  2 siblings, 1 reply; 6+ messages in thread
From: Vishal Chourasia @ 2023-08-09 18:44 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: peterz, vincent.guittot, srikar, linux-kernel, mingo,
	dietmar.eggemann, mgorman

On Wed, Jul 26, 2023 at 03:06:12PM +0530, Shrikanth Hegde wrote:
> When the system is fully busy, there will not be any idle CPU's.
> In that case, load_balance will be called mainly with CPU_NOT_IDLE
> type. In should_we_balance its currently checking for an idle CPU if
> one exist. When system is 100% busy, there will not be an idle CPU and
> these idle_cpu checks can be skipped. This would avoid fetching those rq
> structures.
> 
> This is a minor optimization for a specific case of 100% utilization.
> 
> .....
> Coming to the current implementation. It is a very basic approach to the
> issue. It may not be the best/perfect way to this.  It works only in
> case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which
> tracks idle CPU's. AFAIU there isn't any other. If there is such info, we
> can use that instead. nohz.nr_cpus is atomic, which might be costly too.
> 
> Alternative way would be to add a new attribute to sched_domain and update
> it in cpu idle entry/exit path per CPU. Advantage is, check can be per
> env->sd instead of global. Slightly complicated, but maybe better. there
> could other advantage at wake up to limit the scan etc.
> 
> Your feedback would really help. Does this optimization makes sense?
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
> ---
>  kernel/sched/fair.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f55884..903d59b5290c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10713,6 +10713,12 @@ static int should_we_balance(struct lb_env *env)
>  		return 1;
>  	}
> 
> +#ifdef CONFIG_NO_HZ_COMMON
> +	/* If the system is fully busy, its better to skip the idle checks */
> +	if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0)
> +		return group_balance_cpu(sg) == env->dst_cpu;
> +#endif
> +
>  	/* Try to find first idle CPU */
>  	for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
>  		if (!idle_cpu(cpu))
> --
> 2.31.1
> 
Tested this patchset on top of v6.4

5 Runs of stress-ng (100% load) on a system with 16CPUs spawning 23 threads for
60 minutes.

stress-ng: 16CPUs, 23threads, 60mins

- 6.4.0

| completion time(sec) |      user |        sys |
|----------------------+-----------+------------|
|              3600.05 |  57582.44 |       0.70 |
|              3600.10 |  57597.07 |       0.68 |
|              3600.05 |  57596.65 |       0.47 |
|              3600.04 |  57596.36 |       0.71 |
|              3600.06 |  57595.32 |       0.42 |
|              3600.06 | 57593.568 |      0.596 | average
|          0.046904158 | 12.508392 | 0.27878307 | stddev

- 6.4.0+ (with patch)

| completion time(sec) |      user |         sys |
|----------------------+-----------+-------------|
|              3600.04 |  57596.58 |        0.50 |
|              3600.04 |  57595.19 |        0.48 |
|              3600.05 |  57597.39 |        0.49 |
|              3600.04 |  57596.64 |        0.53 |
|              3600.04 |  57595.94 |        0.43 |
|             3600.042 | 57596.348 |       0.486 | average
|         0.0089442719 | 1.6529610 | 0.072938330 | stddev

The average system time is slightly lower in the patched version (0.486 seconds)
compared to the 6.4.0 version (0.596 seconds). 
The standard deviation for system time is also lower in the patched version
(0.0729 seconds) than in the 6.4.0 version (0.2788 seconds), suggesting more
consistent system time results with the patch.

vishal.c

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] sched/fair: Skip idle CPU search on busy system
  2023-08-09 18:44 ` Vishal Chourasia
@ 2023-08-10 15:44   ` Shrikanth Hegde
  0 siblings, 0 replies; 6+ messages in thread
From: Shrikanth Hegde @ 2023-08-10 15:44 UTC (permalink / raw)
  To: Vishal Chourasia
  Cc: peterz, vincent.guittot, srikar, linux-kernel, mingo,
	dietmar.eggemann, mgorman



On 8/10/23 12:14 AM, Vishal Chourasia wrote:
> On Wed, Jul 26, 2023 at 03:06:12PM +0530, Shrikanth Hegde wrote:
>> When the system is fully busy, there will not be any idle CPU's.
>>
> Tested this patchset on top of v6.4
[...]
> 5 Runs of stress-ng (100% load) on a system with 16CPUs spawning 23 threads for
> 60 minutes.
> 
> stress-ng: 16CPUs, 23threads, 60mins
> 
> - 6.4.0
> 
> | completion time(sec) |      user |        sys |
> |----------------------+-----------+------------|
> |              3600.05 |  57582.44 |       0.70 |
> |              3600.10 |  57597.07 |       0.68 |
> |              3600.05 |  57596.65 |       0.47 |
> |              3600.04 |  57596.36 |       0.71 |
> |              3600.06 |  57595.32 |       0.42 |
> |              3600.06 | 57593.568 |      0.596 | average
> |          0.046904158 | 12.508392 | 0.27878307 | stddev
> 
> - 6.4.0+ (with patch)
> 
> | completion time(sec) |      user |         sys |
> |----------------------+-----------+-------------|
> |              3600.04 |  57596.58 |        0.50 |
> |              3600.04 |  57595.19 |        0.48 |
> |              3600.05 |  57597.39 |        0.49 |
> |              3600.04 |  57596.64 |        0.53 |
> |              3600.04 |  57595.94 |        0.43 |
> |             3600.042 | 57596.348 |       0.486 | average
> |         0.0089442719 | 1.6529610 | 0.072938330 | stddev
> 
> The average system time is slightly lower in the patched version (0.486 seconds)
> compared to the 6.4.0 version (0.596 seconds). 
> The standard deviation for system time is also lower in the patched version
> (0.0729 seconds) than in the 6.4.0 version (0.2788 seconds), suggesting more
> consistent system time results with the patch.
> 
> vishal.c

Thank you very much Vishal for trying this out. 

Meanwhile, I am yet to try the suggestion given by chen. Let me see if that works okay.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] sched/fair: Skip idle CPU search on busy system
  2023-07-26  9:36 [RFC PATCH] sched/fair: Skip idle CPU search on busy system Shrikanth Hegde
  2023-07-27  7:25 ` Chen Yu
  2023-08-09 18:44 ` Vishal Chourasia
@ 2023-10-04 16:25 ` Ingo Molnar
  2 siblings, 0 replies; 6+ messages in thread
From: Ingo Molnar @ 2023-10-04 16:25 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: peterz, vincent.guittot, srikar, linux-kernel, mingo,
	dietmar.eggemann, mgorman


* Shrikanth Hegde <sshegde@linux.vnet.ibm.com> wrote:

> When the system is fully busy, there will not be any idle CPU's.
> In that case, load_balance will be called mainly with CPU_NOT_IDLE
> type. In should_we_balance its currently checking for an idle CPU if
> one exist. When system is 100% busy, there will not be an idle CPU and
> these idle_cpu checks can be skipped. This would avoid fetching those rq
> structures.
> 
> This is a minor optimization for a specific case of 100% utilization.
> 
> .....
> Coming to the current implementation. It is a very basic approach to the
> issue. It may not be the best/perfect way to this.  It works only in
> case of CONFIG_NO_HZ_COMMON. nohz.nr_cpus is a global info available which
> tracks idle CPU's. AFAIU there isn't any other. If there is such info, we
> can use that instead. nohz.nr_cpus is atomic, which might be costly too.
> 
> Alternative way would be to add a new attribute to sched_domain and update
> it in cpu idle entry/exit path per CPU. Advantage is, check can be per
> env->sd instead of global. Slightly complicated, but maybe better. there
> could other advantage at wake up to limit the scan etc.
> 
> Your feedback would really help. Does this optimization makes sense?
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
> ---
>  kernel/sched/fair.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f55884..903d59b5290c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10713,6 +10713,12 @@ static int should_we_balance(struct lb_env *env)
>  		return 1;
>  	}
> 
> +#ifdef CONFIG_NO_HZ_COMMON
> +	/* If the system is fully busy, its better to skip the idle checks */
> +	if (env->idle == CPU_NOT_IDLE && atomic_read(&nohz.nr_cpus) == 0)
> +		return group_balance_cpu(sg) == env->dst_cpu;
> +#endif

Not a big fan of coupling NOHZ to a scheduler optimization in this fashion, 
and not a big fan of the nohz.nr_cpus global cacheline either.

I think it should be done unconditionally, via the scheduler topology tree:

 - We should probably slow-propagate "permanently busy" status of a CPU
   down the topology tree, ie.:

     - mark a domain fully-busy with a delay & batching, probably driven
       by the busy-tick only,

     - while marking a domain idle instantly & propagating this up the
       domain tree only if necessary. The propagation can stop if it
       finds a non-busy domain, so usually it won't reach the root domain.

 - This approach ensures there's no real overhead problem in the domain 
   tree: think of hundreds of CPUs all accessing the nohz.nr_cpus global 
   variable... I bet it's a measurable problem already on large systems.

 - The "atomic_read(&nohz.nr_cpus) == 0" condition in your patch is simply
   the busy-flag checked at the root domain: a readonly global cacheline
   that never gets modified on a permanently busy system.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-10-04 16:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-26  9:36 [RFC PATCH] sched/fair: Skip idle CPU search on busy system Shrikanth Hegde
2023-07-27  7:25 ` Chen Yu
2023-07-27 15:04   ` Shrikanth Hegde
2023-08-09 18:44 ` Vishal Chourasia
2023-08-10 15:44   ` Shrikanth Hegde
2023-10-04 16:25 ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox