Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Tobias Huschle <huschle@linux.ibm.com>
To: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	bristot@redhat.com, vschneid@redhat.com,
	srikar@linux.vnet.ibm.com, linuxppc-dev@lists.ozlabs.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer
Date: Fri, 07 Jul 2023 09:45:15 +0200	[thread overview]
Message-ID: <8a29c5d6b0af0321316a88b924b8322f@linux.ibm.com> (raw)
In-Reply-To: <de856969-5d78-f771-96ff-4afce3a6e776@linux.vnet.ibm.com>

On 2023-07-06 19:19, Shrikanth Hegde wrote:
> On 5/15/23 5:16 PM, Tobias Huschle wrote:
>> The current load balancer implementation implies that scheduler 
>> groups,
>> within the same domain, all host the same number of CPUs. This is
>> reflected in the condition, that a scheduler group, which is load
>> balancing and classified as having spare capacity, should pull work
>> from the busiest group, if the local group runs less processes than
>> the busiest one. This implies that these two groups should run the
>> same number of processes, which is problematic if the groups are not
>> of the same size.
>> 
>> The assumption that scheduler groups within the same scheduler domain
>> host the same number of CPUs appears to be true for non-s390
>> architectures. Nevertheless, s390 can have scheduler groups of unequal
>> size.
>> 
>> This introduces a performance degredation in the following scenario:
>> 
>> Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
>> the remaining 2 are located on another socket:
>> 
>> Socket   -----1-----    -2-
>> CPU      1 2 3 4 5 6    7 8
>> 
>> Placing some workload ( x = one task ) yields the following
>> scenarios:
>> 
>> The first 5 tasks are distributed evenly across the two groups.
>> 
>> Socket   -----1-----    -2-
>> CPU      1 2 3 4 5 6    7 8
>>          x x x          x x
>> 
>> Adding a 6th task yields the following distribution:
>> 
>> Socket   -----1-----    -2-
>> CPU      1 2 3 4 5 6    7 8
>> SMT1     x x x          x x
>> SMT2                    x
>> 
>> The task is added to the 2nd scheduler group, as the scheduler has the
>> assumption that scheduler groups are of the same size, so they should
>> also host the same number of tasks. This makes CPU 7 run into SMT
>> thread, which comes with a performance penalty. This means, that in
>> the window of 6-8 tasks, load balancing is done suboptimally, because
>> SMT is used although there is no reason to do so as fully idle CPUs
>> are still available.
>> 
>> Taking the weight of the scheduler groups into account, ensures that
>> a load balancing CPU within a smaller group will not try to pull tasks
>> from a bigger group while the bigger group still has idle CPUs
>> available.
>> 
>> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
>> ---
>>  kernel/sched/fair.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 48b6f0ca13ac..b1307d7e4065 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10426,7 +10426,8 @@ static struct sched_group 
>> *find_busiest_group(struct lb_env *env)
>>  	 * group's child domain.
>>  	 */
>>  	if (sds.prefer_sibling && local->group_type == group_has_spare &&
>> -	    busiest->sum_nr_running > local->sum_nr_running + 1)
>> +	    busiest->sum_nr_running * local->group_weight >
>> +			local->sum_nr_running * busiest->group_weight + 1)
>>  		goto force_balance;
>> 
> 
> 
> I assume its SMT2 here. sched group is enclosed in 
> [busy_cpus/idle_cpus/weight]
> 
> Lets take an example:  we will begin the with the said imbalance.
> [3/9/12] -- local group
> [3/1/4] -- busy group.
> we will evaluate 3*12 > (4*(3+1)) is true and proceeds further to 
> balance.
> but calculate_imbalance will lead to zero as the imbalance no? in case
> of prefer sibling
> case it find the difference of sum_nr_running of the two groups. It
> will be 3-3 = 0
> 
> this would need modifications to calculate_imbalance.
> Maybe something like this will help? NOT TESTED AT ALL.
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..e027d4086edc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10296,7 +10296,9 @@ static inline void calculate_imbalance(struct
> lb_env *env, struct sd_lb_stats *s
>                         return;
>                 }
> 
> -               if (busiest->group_weight == 1 || sds->prefer_sibling) 
> {
> +               if (busiest->group_weight == 1 ||
> +                       (sds->prefer_sibling &&
> +                        busiest->group_weight == local->group_weight)) 
> {
>                         unsigned int nr_diff = busiest->sum_nr_running;
>                         /*
>                          * When prefer sibling, evenly spread running 
> tasks on
> @@ -10305,6 +10307,11 @@ static inline void calculate_imbalance(struct
> lb_env *env, struct sd_lb_stats *s
>                         env->migration_type = migrate_task;
>                         lsub_positive(&nr_diff, local->sum_nr_running);
>                         env->imbalance = nr_diff;
> +               }
> +               if (sds->prefer_sibling &&
> +                   busiest->group_weight != local->group_weight) {
> +                       env->migration_type = migrate_task;
> +                       env->imbalance = 1;
>                 } else {
> 

I also had a look at this when Vincent pointed out that this part is 
missing.
The formula proposed by Vincent works pretty well, it is not even 
necessary
to add additional if-statements here.
> 
> ---------------------------------------------------------------------------------------------------
> On a tangential dimension.
> 
> 
> Since prefer_siblings make inherent assumption that all groups have
> equal weight,
> it will cause  complications when group_weights are different. I think
> that becomes very
> tricky when there are more than two groups. Depending on which cpu is 
> doing
> load_balance there can be unnecessary migrations.
> 
> 
> Currently even in NUMA this can be similar case right? There will be
> unequal number of CPU's right?
> If we fix that case and remove prefer siblings in your arch, will that 
> work?
> 
> Maybe something like this? NOT TESTED AT ALL.
> 
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..fc6377f48ced 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10514,6 +10514,7 @@ static struct sched_group
> *find_busiest_group(struct lb_env *env)
>                         goto out_balanced;
> 
>                 if (busiest->group_weight > 1 &&
> +                   busiest->group_weight == local->group_weight &&
>                     local->idle_cpus <= (busiest->idle_cpus + 1))
>                         /*
>                          * If the busiest group is not overloaded
> @@ -10526,6 +10527,11 @@ static struct sched_group
> *find_busiest_group(struct lb_env *env)
>                          */
>                         goto out_balanced;
> 
> +               if ((busiest->group_weight != local->group_weight) &&
> +                     (local->idle_cpus * busiest->group_weight <=
> +                              local->group_weight * 
> (busiest->idle_cpus + 1)))
> +                       goto out_balanced;
> +
>                 if (busiest->sum_h_nr_running == 1)
>                         /*
>                          * busiest doesn't have any tasks waiting to 
> run
> 
> 
> 
> 
> 
>>  	if (busiest->group_type != group_overloaded) {

I played around with alternate solutions as well, yours could be 
interesting in order
to declare the problematic state as balanced essentially. I wouldn't be 
opposed to
remove prefer_siblings. The question would be if it is necessary to 
address both
scenarios anyway.

next prev parent reply	other threads:[~2023-07-07  7:46 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-15 11:46 [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer Tobias Huschle
2023-05-15 11:46 ` [RFC 1/1] " Tobias Huschle
2023-05-16 13:36   ` Vincent Guittot
2023-06-05  8:07     ` Tobias Huschle
2023-07-05  7:52       ` Vincent Guittot
2023-07-07  7:44         ` Tobias Huschle
2023-07-07 14:33           ` Shrikanth Hegde
2023-07-07 15:59             ` Tobias Huschle
2023-07-07 16:26               ` Shrikanth Hegde
2023-07-04 13:40   ` Peter Zijlstra
2023-07-07  7:44     ` Tobias Huschle
2023-07-06 17:19   ` Shrikanth Hegde
2023-07-07  7:45     ` Tobias Huschle [this message]
2023-05-16 16:35 ` [RFC 0/1] " Dietmar Eggemann
2023-07-04  9:11   ` Tobias Huschle
2023-07-06 11:11     ` Dietmar Eggemann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8a29c5d6b0af0321316a88b924b8322f@linux.ibm.com \
    --to=huschle@linux.ibm.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=sshegde@linux.vnet.ibm.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox