Re: [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Tobias Huschle <huschle@linux.ibm.com>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: juri.lelli@redhat.com, vschneid@redhat.com,
	vincent.guittot@linaro.org, srikar@linux.vnet.ibm.com,
	peterz@infradead.org, sshegde@linux.vnet.ibm.com,
	linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
	rostedt@goodmis.org, bsegall@google.com, mingo@redhat.com,
	mgorman@suse.de, bristot@redhat.com
Subject: Re: [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer
Date: Tue, 04 Jul 2023 11:11:00 +0200	[thread overview]
Message-ID: <4c28b46b59bcc083956757074d1fe059@linux.ibm.com> (raw)
In-Reply-To: <26fe6dc1-33c5-b825-c019-b346e8bedc0a@arm.com>

On 2023-05-16 18:35, Dietmar Eggemann wrote:
> On 15/05/2023 13:46, Tobias Huschle wrote:
>> The current load balancer implementation implies that scheduler 
>> groups,
>> within the same scheduler domain, all host the same number of CPUs.
>> 
>> This appears to be valid for non-s390 architectures. Nevertheless, 
>> s390
>> can actually have scheduler groups of unequal size.
> 
> Arm (classical) big.Little had this for years before we switched to 
> flat
> scheduling (only MC sched domain) over CPU capacity boundaries for Arm
> DynamIQ.
> 
> Arm64 Juno platform in mainline:
> 
> root@juno:~# cat 
> /sys/devices/system/cpu/cpu*/topology/cluster_cpus_list
> 0,3-5
> 1-2
> 1-2
> 0,3-5
> 0,3-5
> 0,3-5
> 
> root@juno:~# cat /proc/schedstat | grep ^domain | awk '{print $1, $2}'
> 
> domain0 39 <--
> domain1 3f
> domain0 06 <--
> domain1 3f
> domain0 06
> domain1 3f
> domain0 39
> domain1 3f
> domain0 39
> domain1 3f
> domain0 39
> domain1 3f
> 
> root@juno:~# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
> MC
> DIE
> 
> But we don't have SMT on the mobile processors.
> 
> It looks like you are only interested to get group_weight dependency
> into this 'prefer_sibling' condition of find_busiest_group()?
> 
Sorry, looks like your reply hit some bad filter of my mail program.
Let me answer, although it's a bit late.

Yes, I would like to get the group_weight into the prefer_sibling path.
Unfortunately, we cannot go for a flat hierarchy as the s390 hardware
allows to have CPUs to be pretty far apart (cache-wise), which means,
the load balancer should avoid to move tasks back and forth between
those CPUs if possible.

We can't remove SD_PREFER_SIBLING either, as this would cause the load
balancer to aim for having the same number of idle CPUs in all groups,
which is a problem as well in asymmetric groups, for example:

With SD_PREFER_SIBLING, aiming for same number of non-idle CPUs
00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
                 x     x     x     x      x  x  x  x

Without SD_PREFER_SIBLING, aiming for the same number of idle CPUs
00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
     x  x  x     x  x     x     x  x


Hence the idea to add the group_weight to the prefer_sibling path.

I was wondering if this would be the right place to address this issue
or if I should go down another route.

> We in (classical) big.LITTLE (sd flag SD_ASYM_CPUCAPACITY) remove
> SD_PREFER_SIBLING from sd->child so we don't run this condition.
> 
>> The current scheduler behavior causes some s390 configs to use SMT
>> while some cores are still idle, leading to a performance degredation
>> under certain levels of workload.
>> 
>> Please refer to the patch's commit message for more details and an
>> example. This patch is a proposal on how to integrate the size of
>> scheduler groups into the decision process.
>> 
>> This patch is the most basic approach to address this issue and does
>> not claim to be perfect as-is.
>> 
>> Other ideas that also proved to address the problem but are more
>> complex but also potentially more precise:
>>   1. On scheduler group building, count the number of CPUs within each
>>      group that are first in their sibling mask. This represents the
>>      number of CPUs that can be used before running into SMT. This
>>      should be slightly more accurate than using the full group weight
>>      if the number of available SMT threads per core varies.
>>   2. Introduce a new scheduler group classification (smt_busy) in
>>      between of fully_busy and has_spare. This classification would
>>      indicate that a group still has spare capacity, but will run
>>      into SMT when using that capacity. This would make the load
>>      balancer prefer groups with fully idle CPUs over ones that are
>>      about to run into SMT.
>> 
>> Feedback would be greatly appreciated.
>> 
>> Tobias Huschle (1):
>>   sched/fair: Consider asymmetric scheduler groups in load balancer
>> 
>>  kernel/sched/fair.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>

WARNING: multiple messages have this Message-ID (diff)

From: Tobias Huschle <huschle@linux.ibm.com>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, bristot@redhat.com,
	vschneid@redhat.com, sshegde@linux.vnet.ibm.com,
	srikar@linux.vnet.ibm.com, linuxppc-dev@lists.ozlabs.org
Subject: Re: [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer
Date: Tue, 04 Jul 2023 11:11:00 +0200	[thread overview]
Message-ID: <4c28b46b59bcc083956757074d1fe059@linux.ibm.com> (raw)
In-Reply-To: <26fe6dc1-33c5-b825-c019-b346e8bedc0a@arm.com>

On 2023-05-16 18:35, Dietmar Eggemann wrote:
> On 15/05/2023 13:46, Tobias Huschle wrote:
>> The current load balancer implementation implies that scheduler 
>> groups,
>> within the same scheduler domain, all host the same number of CPUs.
>> 
>> This appears to be valid for non-s390 architectures. Nevertheless, 
>> s390
>> can actually have scheduler groups of unequal size.
> 
> Arm (classical) big.Little had this for years before we switched to 
> flat
> scheduling (only MC sched domain) over CPU capacity boundaries for Arm
> DynamIQ.
> 
> Arm64 Juno platform in mainline:
> 
> root@juno:~# cat 
> /sys/devices/system/cpu/cpu*/topology/cluster_cpus_list
> 0,3-5
> 1-2
> 1-2
> 0,3-5
> 0,3-5
> 0,3-5
> 
> root@juno:~# cat /proc/schedstat | grep ^domain | awk '{print $1, $2}'
> 
> domain0 39 <--
> domain1 3f
> domain0 06 <--
> domain1 3f
> domain0 06
> domain1 3f
> domain0 39
> domain1 3f
> domain0 39
> domain1 3f
> domain0 39
> domain1 3f
> 
> root@juno:~# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
> MC
> DIE
> 
> But we don't have SMT on the mobile processors.
> 
> It looks like you are only interested to get group_weight dependency
> into this 'prefer_sibling' condition of find_busiest_group()?
> 
Sorry, looks like your reply hit some bad filter of my mail program.
Let me answer, although it's a bit late.

Yes, I would like to get the group_weight into the prefer_sibling path.
Unfortunately, we cannot go for a flat hierarchy as the s390 hardware
allows to have CPUs to be pretty far apart (cache-wise), which means,
the load balancer should avoid to move tasks back and forth between
those CPUs if possible.

We can't remove SD_PREFER_SIBLING either, as this would cause the load
balancer to aim for having the same number of idle CPUs in all groups,
which is a problem as well in asymmetric groups, for example:

With SD_PREFER_SIBLING, aiming for same number of non-idle CPUs
00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
                 x     x     x     x      x  x  x  x

Without SD_PREFER_SIBLING, aiming for the same number of idle CPUs
00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
     x  x  x     x  x     x     x  x


Hence the idea to add the group_weight to the prefer_sibling path.

I was wondering if this would be the right place to address this issue
or if I should go down another route.

> We in (classical) big.LITTLE (sd flag SD_ASYM_CPUCAPACITY) remove
> SD_PREFER_SIBLING from sd->child so we don't run this condition.
> 
>> The current scheduler behavior causes some s390 configs to use SMT
>> while some cores are still idle, leading to a performance degredation
>> under certain levels of workload.
>> 
>> Please refer to the patch's commit message for more details and an
>> example. This patch is a proposal on how to integrate the size of
>> scheduler groups into the decision process.
>> 
>> This patch is the most basic approach to address this issue and does
>> not claim to be perfect as-is.
>> 
>> Other ideas that also proved to address the problem but are more
>> complex but also potentially more precise:
>>   1. On scheduler group building, count the number of CPUs within each
>>      group that are first in their sibling mask. This represents the
>>      number of CPUs that can be used before running into SMT. This
>>      should be slightly more accurate than using the full group weight
>>      if the number of available SMT threads per core varies.
>>   2. Introduce a new scheduler group classification (smt_busy) in
>>      between of fully_busy and has_spare. This classification would
>>      indicate that a group still has spare capacity, but will run
>>      into SMT when using that capacity. This would make the load
>>      balancer prefer groups with fully idle CPUs over ones that are
>>      about to run into SMT.
>> 
>> Feedback would be greatly appreciated.
>> 
>> Tobias Huschle (1):
>>   sched/fair: Consider asymmetric scheduler groups in load balancer
>> 
>>  kernel/sched/fair.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>

next prev parent reply	other threads:[~2023-07-04  9:12 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-15 11:46 [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer Tobias Huschle
2023-05-15 11:46 ` Tobias Huschle
2023-05-15 11:46 ` [RFC 1/1] " Tobias Huschle
2023-05-15 11:46   ` Tobias Huschle
2023-05-16 13:36   ` Vincent Guittot
2023-05-16 13:36     ` Vincent Guittot
2023-06-05  8:07     ` Tobias Huschle
2023-06-05  8:07       ` Tobias Huschle
2023-07-05  7:52       ` Vincent Guittot
2023-07-05  7:52         ` Vincent Guittot
2023-07-07  7:44         ` Tobias Huschle
2023-07-07  7:44           ` Tobias Huschle
2023-07-07 14:33           ` Shrikanth Hegde
2023-07-07 14:33             ` Shrikanth Hegde
2023-07-07 15:59             ` Tobias Huschle
2023-07-07 15:59               ` Tobias Huschle
2023-07-07 16:26               ` Shrikanth Hegde
2023-07-07 16:26                 ` Shrikanth Hegde
2023-07-04 13:40   ` Peter Zijlstra
2023-07-04 13:40     ` Peter Zijlstra
2023-07-07  7:44     ` Tobias Huschle
2023-07-07  7:44       ` Tobias Huschle
2023-07-06 17:19   ` Shrikanth Hegde
2023-07-06 17:19     ` Shrikanth Hegde
2023-07-07  7:45     ` Tobias Huschle
2023-07-07  7:45       ` Tobias Huschle
2023-05-16 16:35 ` [RFC 0/1] " Dietmar Eggemann
2023-05-16 16:35   ` Dietmar Eggemann
2023-07-04  9:11   ` Tobias Huschle [this message]
2023-07-04  9:11     ` Tobias Huschle
2023-07-06 11:11     ` Dietmar Eggemann
2023-07-06 11:11       ` Dietmar Eggemann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4c28b46b59bcc083956757074d1fe059@linux.ibm.com \
    --to=huschle@linux.ibm.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=sshegde@linux.vnet.ibm.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.