linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Tobias Huschle <huschle@linux.ibm.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: juri.lelli@redhat.com, vschneid@redhat.com,
	srikar@linux.vnet.ibm.com, peterz@infradead.org,
	sshegde@linux.vnet.ibm.com, linuxppc-dev@lists.ozlabs.org,
	linux-kernel@vger.kernel.org, rostedt@goodmis.org,
	bsegall@google.com, mingo@redhat.com, mgorman@suse.de,
	bristot@redhat.com, dietmar.eggemann@arm.com
Subject: Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer
Date: Fri, 07 Jul 2023 09:44:24 +0200	[thread overview]
Message-ID: <12befe210503469beac9e711d9782675@linux.ibm.com> (raw)
In-Reply-To: <ZKUhPO3xcvCfjWfe@vingu-book>

On 2023-07-05 09:52, Vincent Guittot wrote:
> Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :
>> On 2023-05-16 15:36, Vincent Guittot wrote:
>> > On Mon, 15 May 2023 at 13:46, Tobias Huschle <huschle@linux.ibm.com>
>> > wrote:
>> > >
>> > > The current load balancer implementation implies that scheduler
>> > > groups,
>> > > within the same domain, all host the same number of CPUs. This is
>> > > reflected in the condition, that a scheduler group, which is load
>> > > balancing and classified as having spare capacity, should pull work
>> > > from the busiest group, if the local group runs less processes than
>> > > the busiest one. This implies that these two groups should run the
>> > > same number of processes, which is problematic if the groups are not
>> > > of the same size.
>> > >
>> > > The assumption that scheduler groups within the same scheduler domain
>> > > host the same number of CPUs appears to be true for non-s390
>> > > architectures. Nevertheless, s390 can have scheduler groups of unequal
>> > > size.
>> > >
>> > > This introduces a performance degredation in the following scenario:
>> > >
>> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
>> > > the remaining 2 are located on another socket:
>> > >
>> > > Socket   -----1-----    -2-
>> > > CPU      1 2 3 4 5 6    7 8
>> > >
>> > > Placing some workload ( x = one task ) yields the following
>> > > scenarios:
>> > >
>> > > The first 5 tasks are distributed evenly across the two groups.
>> > >
>> > > Socket   -----1-----    -2-
>> > > CPU      1 2 3 4 5 6    7 8
>> > >          x x x          x x
>> > >
>> > > Adding a 6th task yields the following distribution:
>> > >
>> > > Socket   -----1-----    -2-
>> > > CPU      1 2 3 4 5 6    7 8
>> > > SMT1     x x x          x x
>> > > SMT2                    x
>> >
>> > Your description is a bit confusing for me. What you name CPU above
>> > should be named Core, doesn' it ?
>> >
>> > Could you share with us your scheduler topology ?
>> >
>> 
>> You are correct, it should say core instead of CPU.
>> 
>> One actual configuration from one of my test machines (lscpu -e):
>> 
> 
> [...]
> 
>> 
>> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.
> 
> Thaks for the details
> 
>> 
>> If I run stress-ng with 8 cpu stressors on the original code I get a
>> distribution
>> like this:
>> 
>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>                 x     x     x     x      x  x  x  x
>> 
>> Which means that the two cores in the smaller group are running into 
>> SMT
>> while two
>> cores in the larger group are still idle. This is caused by the
>> prefer_sibling path
>> which really wants to see both groups run the same number of tasks.
> 
> yes and it considers that there are the same number of CPUs per group
> 
>> 
>> > >
>> > > The task is added to the 2nd scheduler group, as the scheduler has the
>> > > assumption that scheduler groups are of the same size, so they should
>> > > also host the same number of tasks. This makes CPU 7 run into SMT
>> > > thread, which comes with a performance penalty. This means, that in
>> > > the window of 6-8 tasks, load balancing is done suboptimally, because
>> > > SMT is used although there is no reason to do so as fully idle CPUs
>> > > are still available.
>> > >
>> > > Taking the weight of the scheduler groups into account, ensures that
>> > > a load balancing CPU within a smaller group will not try to pull tasks
>> > > from a bigger group while the bigger group still has idle CPUs
>> > > available.
>> > >
>> > > Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
>> > > ---
>> > >  kernel/sched/fair.c | 3 ++-
>> > >  1 file changed, 2 insertions(+), 1 deletion(-)
>> > >
>> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > > index 48b6f0ca13ac..b1307d7e4065 100644
>> > > --- a/kernel/sched/fair.c
>> > > +++ b/kernel/sched/fair.c
>> > > @@ -10426,7 +10426,8 @@ static struct sched_group
>> > > *find_busiest_group(struct lb_env *env)
>> > >          * group's child domain.
>> > >          */
>> > >         if (sds.prefer_sibling && local->group_type ==
>> > > group_has_spare &&
>> > > -           busiest->sum_nr_running > local->sum_nr_running + 1)
>> > > +           busiest->sum_nr_running * local->group_weight >
>> > > +                       local->sum_nr_running *
>> > > busiest->group_weight + 1)
> 
> 
> what you want to test here is that moving 1 task from busiest to local 
> group
> would help and balance the ratio of tasks per cpu
> 
> (busiest->sum_nr_running - 1) / busiest->group_weight >
> (local->sum_nr_running + 1) / local->group_weight
> 
> which can be develop into
> 
> busiest->sum_nr_running * local->group_weight >= local->sum_nr_running
> * busiest->group_weight + busiest->group_weight + local->group_weight
> 
> and you also have to change how we calculate the imbalance which just
> provide the half of the diff of nr_running
> 
> by something like
> 
> (busiest->sum_nr_running * local->group_weight) -
> (local->sum_nr_running * busiest->group_weight) /
> (busiest->group_weight + local->group_weight)
> 

Ahh right, I had a look at the imbalance part now and your suggestion 
works
pretty well. Just had to make some minor adjustments so far.
Nice side effect is, that this allows the load balancer to behave 
exactly the
same as before in the cases where local->group_weight == 
busiest->group_weight.
The behavior only changes for the case where the groups are not of equal 
size.

I will figure out a solution and send a patch soon which incorporates 
these
adjustments plus a more detailed description.

Thanks for the feedback.

>> >
>> > This is the prefer_sibling path. Could it be that you should disable
>> > prefer_siling between your sockets for such topology ? the default
>> > path compares the number of idle CPUs when groups has spare capacity
>> >
>> >
>> 
>> If I disable prefer_sibling (I played around with it a bit), I run 
>> into the
>> problem,
>> that the tasks are distributed s.t. each group has the same amount of 
>> idle
>> CPUs, which
>> yields distributions similar to this:
>> 
>> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>>     x  x  x     x  x     x     x  x
>> 
>> Now both groups have 4 idle CPUs which fulfills the criteria imposed 
>> by the
>> load balancer,
>> but the larger group is now running SMT while the smaller one is just 
>> idle.
>> 
>> So, in this asymmetric setup, both criteria seem to not work in an 
>> optimal
>> way. Going for
>> the same number of idle CPUs or alternatively for the same number of 
>> running
>> processes
>> both cause a sub-optimal distribution of tasks, leading to unnecessary 
>> SMT.
> 
> there is the same behavior and assumption here too
> 
> 
>> 
>> It seems also to be possible to address the regular load balancing 
>> path by
>> aiming to have the
>> same unused capacity between groups instead of the same number of idle 
>> CPUs.
>> This seems to
>> have been considered in the past, but the choice went in favor of the 
>> number
>> of idle CPUs.
> 
> unused capacity doesn't give the instantaneous state so a group can be
> idle but without
> unused capacity
> 
>> Since this decision was actively taken already, I focused on the
>> prefer_sibling path.
>> 
>> The question now would be how to address this properly (or if I'm 
>> missing
>> something here).
>> As mentioned in the cover letter, this was the most simplistic and 
>> least
>> invasive approach
>> I could find, others might be more sophisticated but also have some
>> side-effects.
>> 
>> I have a bit of a hard time leaving this one as-is, as it just 
>> introduces
>> two additional
>> multiplications with no effect for most architectures. Maybe an
>> architectures specific
>> inline function that the compiler can optimize away if not needed?
>> 
>> > >                 goto force_balance;
>> > >
>> > >         if (busiest->group_type != group_overloaded) {
>> > > --
>> > > 2.34.1
>> > >
>> 

  reply	other threads:[~2023-07-07  7:45 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-15 11:46 [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer Tobias Huschle
2023-05-15 11:46 ` [RFC 1/1] " Tobias Huschle
2023-05-16 13:36   ` Vincent Guittot
2023-06-05  8:07     ` Tobias Huschle
2023-07-05  7:52       ` Vincent Guittot
2023-07-07  7:44         ` Tobias Huschle [this message]
2023-07-07 14:33           ` Shrikanth Hegde
2023-07-07 15:59             ` Tobias Huschle
2023-07-07 16:26               ` Shrikanth Hegde
2023-07-04 13:40   ` Peter Zijlstra
2023-07-07  7:44     ` Tobias Huschle
2023-07-06 17:19   ` Shrikanth Hegde
2023-07-07  7:45     ` Tobias Huschle
2023-05-16 16:35 ` [RFC 0/1] " Dietmar Eggemann
2023-07-04  9:11   ` Tobias Huschle
2023-07-06 11:11     ` Dietmar Eggemann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=12befe210503469beac9e711d9782675@linux.ibm.com \
    --to=huschle@linux.ibm.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=sshegde@linux.vnet.ibm.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).