From: Tobias Huschle <huschle@linux.ibm.com>
To: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
bristot@redhat.com, vschneid@redhat.com,
srikar@linux.vnet.ibm.com, linuxppc-dev@lists.ozlabs.org,
linux-kernel@vger.kernel.org
Subject: Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer
Date: Fri, 07 Jul 2023 09:45:15 +0200 [thread overview]
Message-ID: <8a29c5d6b0af0321316a88b924b8322f@linux.ibm.com> (raw)
In-Reply-To: <de856969-5d78-f771-96ff-4afce3a6e776@linux.vnet.ibm.com>
On 2023-07-06 19:19, Shrikanth Hegde wrote:
> On 5/15/23 5:16 PM, Tobias Huschle wrote:
>> The current load balancer implementation implies that scheduler
>> groups,
>> within the same domain, all host the same number of CPUs. This is
>> reflected in the condition, that a scheduler group, which is load
>> balancing and classified as having spare capacity, should pull work
>> from the busiest group, if the local group runs less processes than
>> the busiest one. This implies that these two groups should run the
>> same number of processes, which is problematic if the groups are not
>> of the same size.
>>
>> The assumption that scheduler groups within the same scheduler domain
>> host the same number of CPUs appears to be true for non-s390
>> architectures. Nevertheless, s390 can have scheduler groups of unequal
>> size.
>>
>> This introduces a performance degredation in the following scenario:
>>
>> Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
>> the remaining 2 are located on another socket:
>>
>> Socket -----1----- -2-
>> CPU 1 2 3 4 5 6 7 8
>>
>> Placing some workload ( x = one task ) yields the following
>> scenarios:
>>
>> The first 5 tasks are distributed evenly across the two groups.
>>
>> Socket -----1----- -2-
>> CPU 1 2 3 4 5 6 7 8
>> x x x x x
>>
>> Adding a 6th task yields the following distribution:
>>
>> Socket -----1----- -2-
>> CPU 1 2 3 4 5 6 7 8
>> SMT1 x x x x x
>> SMT2 x
>>
>> The task is added to the 2nd scheduler group, as the scheduler has the
>> assumption that scheduler groups are of the same size, so they should
>> also host the same number of tasks. This makes CPU 7 run into SMT
>> thread, which comes with a performance penalty. This means, that in
>> the window of 6-8 tasks, load balancing is done suboptimally, because
>> SMT is used although there is no reason to do so as fully idle CPUs
>> are still available.
>>
>> Taking the weight of the scheduler groups into account, ensures that
>> a load balancing CPU within a smaller group will not try to pull tasks
>> from a bigger group while the bigger group still has idle CPUs
>> available.
>>
>> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
>> ---
>> kernel/sched/fair.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 48b6f0ca13ac..b1307d7e4065 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10426,7 +10426,8 @@ static struct sched_group
>> *find_busiest_group(struct lb_env *env)
>> * group's child domain.
>> */
>> if (sds.prefer_sibling && local->group_type == group_has_spare &&
>> - busiest->sum_nr_running > local->sum_nr_running + 1)
>> + busiest->sum_nr_running * local->group_weight >
>> + local->sum_nr_running * busiest->group_weight + 1)
>> goto force_balance;
>>
>
>
> I assume its SMT2 here. sched group is enclosed in
> [busy_cpus/idle_cpus/weight]
>
> Lets take an example: we will begin the with the said imbalance.
> [3/9/12] -- local group
> [3/1/4] -- busy group.
> we will evaluate 3*12 > (4*(3+1)) is true and proceeds further to
> balance.
> but calculate_imbalance will lead to zero as the imbalance no? in case
> of prefer sibling
> case it find the difference of sum_nr_running of the two groups. It
> will be 3-3 = 0
>
> this would need modifications to calculate_imbalance.
> Maybe something like this will help? NOT TESTED AT ALL.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..e027d4086edc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10296,7 +10296,9 @@ static inline void calculate_imbalance(struct
> lb_env *env, struct sd_lb_stats *s
> return;
> }
>
> - if (busiest->group_weight == 1 || sds->prefer_sibling)
> {
> + if (busiest->group_weight == 1 ||
> + (sds->prefer_sibling &&
> + busiest->group_weight == local->group_weight))
> {
> unsigned int nr_diff = busiest->sum_nr_running;
> /*
> * When prefer sibling, evenly spread running
> tasks on
> @@ -10305,6 +10307,11 @@ static inline void calculate_imbalance(struct
> lb_env *env, struct sd_lb_stats *s
> env->migration_type = migrate_task;
> lsub_positive(&nr_diff, local->sum_nr_running);
> env->imbalance = nr_diff;
> + }
> + if (sds->prefer_sibling &&
> + busiest->group_weight != local->group_weight) {
> + env->migration_type = migrate_task;
> + env->imbalance = 1;
> } else {
>
I also had a look at this when Vincent pointed out that this part is
missing.
The formula proposed by Vincent works pretty well, it is not even
necessary
to add additional if-statements here.
>
> ---------------------------------------------------------------------------------------------------
> On a tangential dimension.
>
>
> Since prefer_siblings make inherent assumption that all groups have
> equal weight,
> it will cause complications when group_weights are different. I think
> that becomes very
> tricky when there are more than two groups. Depending on which cpu is
> doing
> load_balance there can be unnecessary migrations.
>
>
> Currently even in NUMA this can be similar case right? There will be
> unequal number of CPU's right?
> If we fix that case and remove prefer siblings in your arch, will that
> work?
>
> Maybe something like this? NOT TESTED AT ALL.
>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..fc6377f48ced 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10514,6 +10514,7 @@ static struct sched_group
> *find_busiest_group(struct lb_env *env)
> goto out_balanced;
>
> if (busiest->group_weight > 1 &&
> + busiest->group_weight == local->group_weight &&
> local->idle_cpus <= (busiest->idle_cpus + 1))
> /*
> * If the busiest group is not overloaded
> @@ -10526,6 +10527,11 @@ static struct sched_group
> *find_busiest_group(struct lb_env *env)
> */
> goto out_balanced;
>
> + if ((busiest->group_weight != local->group_weight) &&
> + (local->idle_cpus * busiest->group_weight <=
> + local->group_weight *
> (busiest->idle_cpus + 1)))
> + goto out_balanced;
> +
> if (busiest->sum_h_nr_running == 1)
> /*
> * busiest doesn't have any tasks waiting to
> run
>
>
>
>
>
>> if (busiest->group_type != group_overloaded) {
I played around with alternate solutions as well, yours could be
interesting in order
to declare the problematic state as balanced essentially. I wouldn't be
opposed to
remove prefer_siblings. The question would be if it is necessary to
address both
scenarios anyway.
next prev parent reply other threads:[~2023-07-07 7:46 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-15 11:46 [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer Tobias Huschle
2023-05-15 11:46 ` [RFC 1/1] " Tobias Huschle
2023-05-16 13:36 ` Vincent Guittot
2023-06-05 8:07 ` Tobias Huschle
2023-07-05 7:52 ` Vincent Guittot
2023-07-07 7:44 ` Tobias Huschle
2023-07-07 14:33 ` Shrikanth Hegde
2023-07-07 15:59 ` Tobias Huschle
2023-07-07 16:26 ` Shrikanth Hegde
2023-07-04 13:40 ` Peter Zijlstra
2023-07-07 7:44 ` Tobias Huschle
2023-07-06 17:19 ` Shrikanth Hegde
2023-07-07 7:45 ` Tobias Huschle [this message]
2023-05-16 16:35 ` [RFC 0/1] " Dietmar Eggemann
2023-07-04 9:11 ` Tobias Huschle
2023-07-06 11:11 ` Dietmar Eggemann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8a29c5d6b0af0321316a88b924b8322f@linux.ibm.com \
--to=huschle@linux.ibm.com \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=srikar@linux.vnet.ibm.com \
--cc=sshegde@linux.vnet.ibm.com \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox