[RFC 0/6] rework sched_domain topology description

All of lore.kernel.org
 help / color / mirror / Atom feed

From: dietmar.eggemann@arm.com (Dietmar Eggemann)
To: linux-arm-kernel@lists.infradead.org
Subject: [RFC 0/6] rework sched_domain topology description
Date: Thu, 13 Mar 2014 14:07:43 +0000	[thread overview]
Message-ID: <5321BBAF.8050108@arm.com> (raw)
In-Reply-To: <CAKfTPtC3Y7s=n205MND24tLrsT+-3+JhWHebxcBkwAuaMngDhA@mail.gmail.com>

On 12/03/14 13:47, Vincent Guittot wrote:
> On 12 March 2014 14:28, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 11/03/14 13:17, Peter Zijlstra wrote:
>>> On Sat, Mar 08, 2014 at 12:40:58PM +0000, Dietmar Eggemann wrote:
>>>>>
>>>>> I don't have a strong opinion about using or not a cpu argument for
>>>>> setting the flags of a level (it was part of the initial proposal
>>>>> before we start to completely rework the build of sched_domain)
>>>>> Nevertheless, I see one potential concern that you can have completely
>>>>> different flags configuration of the same sd level of 2 cpus.
>>>>
>>>> Could you elaborate a little bit further regarding the last sentence? Do you
>>>> think that those completely different flags configuration would make it
>>>> impossible, that the load-balance code could work at all at this sd?
>>>
>>> So a problem with such an interfaces is that is makes it far too easy to
>>> generate completely broken domains.
>>
>> I see the point. What I'm still struggling with is to understand why
>> this interface is worse then the one where we set-up additional,
>> adjacent sd levels with new cpu_foo_mask functions plus different static
>> sd-flags configurations and rely on the sd degenerate functionality in
>> the core scheduler to fold these levels together to achieve different
>> per cpu sd flags configurations.
> 
> The main difference is that all CPUs has got the same levels at the
> initial state and then the degenerate sequence can decide that it's
> worth removing a level and if it will not create unsuable domains.
>

Agreed. But what I'm trying to say is that using the approach of
multiple adjacent sd levels with different cpu_mask(int cpu) functions
and static sd topology flags will not prevent us from coding the
enforcement of sane sd topology flags set-ups somewhere inside the core
scheduler.

It is possible to easily introduce erroneous set-ups from the standpoint
of sd topology flags with this approach too.

For the sake of an example on ARM TC2 platform, I changed
cpu_corepower_mask(int cpu) [arch/arm/kernel/topology.c] to simulate
that in socket 1 (3 Cortex-A7) cores can powergate individually whereas
in socket 0 (2 Cortex A15) they can't:

 const struct cpumask *cpu_corepower_mask(int cpu)
 {
-       return &cpu_topology[cpu].thread_sibling;
+       return cpu_topology[cpu].socket_id ?
&cpu_topology[cpu].thread_sibling :
+                       &cpu_topology[cpu].core_sibling;
 }


With this I get the following cpu mask configuration:

dmesg snippet (w/ additional debug in cpu_coregroup_mask(),
cpu_corepower_mask()):

...
CPU0: cpu_corepower_mask=0-1
CPU0: cpu_coregroup_mask=0-1
CPU1: cpu_corepower_mask=0-1
CPU1: cpu_coregroup_mask=0-1
CPU2: cpu_corepower_mask=2
CPU2: cpu_coregroup_mask=2-4
CPU3: cpu_corepower_mask=3
CPU3: cpu_coregroup_mask=2-4
CPU4: cpu_corepower_mask=4
CPU4: cpu_coregroup_mask=2-4
...

And I deliberately introduced the following error into the
arm_topology[] table:

 static struct sched_domain_topology_level arm_topology[] = {
 #ifdef CONFIG_SCHED_MC
-       { cpu_corepower_mask, SD_SHARE_PKG_RESOURCES |
SD_SHARE_POWERDOMAIN, SD_INIT_NAME(GMC) },
+       { cpu_corepower_mask, SD_SHARE_POWERDOMAIN, SD_INIT_NAME(GMC) },
        { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },

With this set-up, I get GMC & DIE level for CPU0,1 and MC & DIE level
for CPU2,3,4, i.e. the SD_SHARE_PKG_RESOURCES flag is only set for
CPU2,3,4 and MC level.

dmesg snippet (w/ adapted sched_domain_debug_one(), only CPU0 and CPU2
shown here):

...
CPU0 attaching sched-domain:
domain 0: span 0-1 level GMC
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_POWERDOMAIN SD_PREFER_SIBLING
groups: 0 1
...
domain 1: span 0-4 level DIE
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_PREFER_SIBLING
groups: 0-1 (cpu_power = 2048) 2-4 (cpu_power = 3072)
...
CPU2 attaching sched-domain:
domain 0: span 2-4 level MC
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES
groups: 2 3 4
...
domain 1: span 0-4 level DIE
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_PREFER_SIBLING
groups: 2-4 (cpu_power = 3072) 0-1 (cpu_power = 2048)
...

What I wanted to say is IMHO, it doesn't matter which approach we take
(multiple adjacent sd levels or per-cpu topo sd flag function), we have
to enforce sane sd topology flags set-up inside the core scheduler anyway.

-- Dietmar

>>
>> IMHO, exposing struct sched_domain_topology_level bar_topology[] to the
>> arch is the reason why the core scheduler has to check if the arch
>> provides a sane sd setup in both cases.
>>
>>>
>>> You can, for two cpus in the same domain provide, different flags; such
>>> a configuration doesn't make any sense at all.
>>>
>>> Now I see why people would like to have this; but unless we can make it
>>> robust I'd be very hesitant to go this route.
>>>
>>
>> By making it robust, I guess you mean that the core scheduler has to
>> check that the provided set-ups are sane, something like the following
>> code snippet in sd_init()
>>
>> if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,
>>                 "wrong sd_flags in topology description\n"))
>>         tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;
>>
>> but for per cpu set-up's.
>> Obviously, this check has to be in sync with the usage of these flags in
>> the core scheduler algorithms. This comprises probably that a subset of
>> these topology sd flags has to be set for all cpus in a sd level whereas
>> other can be set only for some cpus.
[...]

WARNING: multiple messages have this Message-ID (diff)

From: Dietmar Eggemann <dietmar.eggemann@arm.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	"mingo@kernel.org" <mingo@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"tony.luck@intel.com" <tony.luck@intel.com>,
	"fenghua.yu@intel.com" <fenghua.yu@intel.com>,
	"schwidefsky@de.ibm.com" <schwidefsky@de.ibm.com>,
	"james.hogan@imgtec.com" <james.hogan@imgtec.com>,
	"cmetcalf@tilera.com" <cmetcalf@tilera.com>,
	"benh@kernel.crashing.org" <benh@kernel.crashing.org>,
	"linux@arm.linux.org.uk" <linux@arm.linux.org.uk>,
	"linux-arm-kernel@lists.infradead.org" 
	<linux-arm-kernel@lists.infradead.org>,
	"preeti@linux.vnet.ibm.com" <preeti@linux.vnet.ibm.com>,
	"linaro-kernel@lists.linaro.org" <linaro-kernel@lists.linaro.org>
Subject: Re: [RFC 0/6] rework sched_domain topology description
Date: Thu, 13 Mar 2014 14:07:43 +0000	[thread overview]
Message-ID: <5321BBAF.8050108@arm.com> (raw)
In-Reply-To: <CAKfTPtC3Y7s=n205MND24tLrsT+-3+JhWHebxcBkwAuaMngDhA@mail.gmail.com>

On 12/03/14 13:47, Vincent Guittot wrote:
> On 12 March 2014 14:28, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 11/03/14 13:17, Peter Zijlstra wrote:
>>> On Sat, Mar 08, 2014 at 12:40:58PM +0000, Dietmar Eggemann wrote:
>>>>>
>>>>> I don't have a strong opinion about using or not a cpu argument for
>>>>> setting the flags of a level (it was part of the initial proposal
>>>>> before we start to completely rework the build of sched_domain)
>>>>> Nevertheless, I see one potential concern that you can have completely
>>>>> different flags configuration of the same sd level of 2 cpus.
>>>>
>>>> Could you elaborate a little bit further regarding the last sentence? Do you
>>>> think that those completely different flags configuration would make it
>>>> impossible, that the load-balance code could work at all at this sd?
>>>
>>> So a problem with such an interfaces is that is makes it far too easy to
>>> generate completely broken domains.
>>
>> I see the point. What I'm still struggling with is to understand why
>> this interface is worse then the one where we set-up additional,
>> adjacent sd levels with new cpu_foo_mask functions plus different static
>> sd-flags configurations and rely on the sd degenerate functionality in
>> the core scheduler to fold these levels together to achieve different
>> per cpu sd flags configurations.
> 
> The main difference is that all CPUs has got the same levels at the
> initial state and then the degenerate sequence can decide that it's
> worth removing a level and if it will not create unsuable domains.
>

Agreed. But what I'm trying to say is that using the approach of
multiple adjacent sd levels with different cpu_mask(int cpu) functions
and static sd topology flags will not prevent us from coding the
enforcement of sane sd topology flags set-ups somewhere inside the core
scheduler.

It is possible to easily introduce erroneous set-ups from the standpoint
of sd topology flags with this approach too.

For the sake of an example on ARM TC2 platform, I changed
cpu_corepower_mask(int cpu) [arch/arm/kernel/topology.c] to simulate
that in socket 1 (3 Cortex-A7) cores can powergate individually whereas
in socket 0 (2 Cortex A15) they can't:

 const struct cpumask *cpu_corepower_mask(int cpu)
 {
-       return &cpu_topology[cpu].thread_sibling;
+       return cpu_topology[cpu].socket_id ?
&cpu_topology[cpu].thread_sibling :
+                       &cpu_topology[cpu].core_sibling;
 }


With this I get the following cpu mask configuration:

dmesg snippet (w/ additional debug in cpu_coregroup_mask(),
cpu_corepower_mask()):

...
CPU0: cpu_corepower_mask=0-1
CPU0: cpu_coregroup_mask=0-1
CPU1: cpu_corepower_mask=0-1
CPU1: cpu_coregroup_mask=0-1
CPU2: cpu_corepower_mask=2
CPU2: cpu_coregroup_mask=2-4
CPU3: cpu_corepower_mask=3
CPU3: cpu_coregroup_mask=2-4
CPU4: cpu_corepower_mask=4
CPU4: cpu_coregroup_mask=2-4
...

And I deliberately introduced the following error into the
arm_topology[] table:

 static struct sched_domain_topology_level arm_topology[] = {
 #ifdef CONFIG_SCHED_MC
-       { cpu_corepower_mask, SD_SHARE_PKG_RESOURCES |
SD_SHARE_POWERDOMAIN, SD_INIT_NAME(GMC) },
+       { cpu_corepower_mask, SD_SHARE_POWERDOMAIN, SD_INIT_NAME(GMC) },
        { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },

With this set-up, I get GMC & DIE level for CPU0,1 and MC & DIE level
for CPU2,3,4, i.e. the SD_SHARE_PKG_RESOURCES flag is only set for
CPU2,3,4 and MC level.

dmesg snippet (w/ adapted sched_domain_debug_one(), only CPU0 and CPU2
shown here):

...
CPU0 attaching sched-domain:
domain 0: span 0-1 level GMC
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_POWERDOMAIN SD_PREFER_SIBLING
groups: 0 1
...
domain 1: span 0-4 level DIE
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_PREFER_SIBLING
groups: 0-1 (cpu_power = 2048) 2-4 (cpu_power = 3072)
...
CPU2 attaching sched-domain:
domain 0: span 2-4 level MC
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES
groups: 2 3 4
...
domain 1: span 0-4 level DIE
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_PREFER_SIBLING
groups: 2-4 (cpu_power = 3072) 0-1 (cpu_power = 2048)
...

What I wanted to say is IMHO, it doesn't matter which approach we take
(multiple adjacent sd levels or per-cpu topo sd flag function), we have
to enforce sane sd topology flags set-up inside the core scheduler anyway.

-- Dietmar

>>
>> IMHO, exposing struct sched_domain_topology_level bar_topology[] to the
>> arch is the reason why the core scheduler has to check if the arch
>> provides a sane sd setup in both cases.
>>
>>>
>>> You can, for two cpus in the same domain provide, different flags; such
>>> a configuration doesn't make any sense at all.
>>>
>>> Now I see why people would like to have this; but unless we can make it
>>> robust I'd be very hesitant to go this route.
>>>
>>
>> By making it robust, I guess you mean that the core scheduler has to
>> check that the provided set-ups are sane, something like the following
>> code snippet in sd_init()
>>
>> if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,
>>                 "wrong sd_flags in topology description\n"))
>>         tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;
>>
>> but for per cpu set-up's.
>> Obviously, this check has to be in sync with the usage of these flags in
>> the core scheduler algorithms. This comprises probably that a subset of
>> these topology sd flags has to be set for all cpus in a sd level whereas
>> other can be set only for some cpus.
[...]

next prev parent reply	other threads:[~2014-03-13 14:07 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-05  7:18 [RFC 0/6] rework sched_domain topology description Vincent Guittot
2014-03-05  7:18 ` Vincent Guittot
2014-03-05  7:18 ` [RFC 1/6] sched: remove unused SCHED_INIT_NODE Vincent Guittot
2014-03-05  7:18 ` [PATCH 2/6] sched: rework of sched_domain topology definition Vincent Guittot
2014-03-05  7:18   ` Vincent Guittot
2014-03-05 17:09   ` Dietmar Eggemann
2014-03-05 17:09     ` Dietmar Eggemann
2014-03-06  8:32     ` Vincent Guittot
2014-03-06  8:32       ` Vincent Guittot
2014-03-11 10:31       ` Peter Zijlstra
2014-03-11 10:31         ` Peter Zijlstra
2014-03-11 13:27         ` Vincent Guittot
2014-03-11 13:27           ` Vincent Guittot
2014-03-11 13:48           ` Dietmar Eggemann
2014-03-11 13:48             ` Dietmar Eggemann
2014-03-05  7:18 ` [RFC 3/6] sched: s390: create a dedicated topology table Vincent Guittot
2014-03-05  7:18 ` [RFC 4/6] sched: powerpc: " Vincent Guittot
2014-03-11 10:08   ` Preeti U Murthy
2014-03-11 10:08     ` Preeti U Murthy
2014-03-11 13:18     ` Vincent Guittot
2014-03-11 13:18       ` Vincent Guittot
2014-03-12  4:42       ` Preeti U Murthy
2014-03-12  4:42         ` Preeti U Murthy
2014-03-12  7:44         ` Vincent Guittot
2014-03-12  7:44           ` Vincent Guittot
2014-03-12 11:04           ` Dietmar Eggemann
2014-03-12 11:04             ` Dietmar Eggemann
2014-03-14  2:30             ` Preeti U Murthy
2014-03-14  2:30               ` Preeti U Murthy
2014-03-14  2:14           ` Preeti U Murthy
2014-03-14  2:14             ` Preeti U Murthy
2014-03-05  7:18 ` [RFC 5/6] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain Vincent Guittot
2014-03-05  7:18 ` [RFC 6/6] sched: ARM: create a dedicated scheduler topology table Vincent Guittot
2014-03-05 22:38   ` Dietmar Eggemann
2014-03-05 22:38     ` Dietmar Eggemann
2014-03-06  8:42     ` Vincent Guittot
2014-03-06  8:42       ` Vincent Guittot
2014-03-05 23:17 ` [RFC 0/6] rework sched_domain topology description Dietmar Eggemann
2014-03-05 23:17   ` Dietmar Eggemann
2014-03-06  9:04   ` Vincent Guittot
2014-03-06  9:04     ` Vincent Guittot
2014-03-06 12:31     ` Dietmar Eggemann
2014-03-06 12:31       ` Dietmar Eggemann
2014-03-07  2:47       ` Vincent Guittot
2014-03-07  2:47         ` Vincent Guittot
2014-03-08 12:40         ` Dietmar Eggemann
2014-03-08 12:40           ` Dietmar Eggemann
2014-03-10 13:21           ` Vincent Guittot
2014-03-10 13:21             ` Vincent Guittot
2014-03-11 13:17           ` Peter Zijlstra
2014-03-11 13:17             ` Peter Zijlstra
2014-03-12 13:28             ` Dietmar Eggemann
2014-03-12 13:28               ` Dietmar Eggemann
2014-03-12 13:47               ` Vincent Guittot
2014-03-12 13:47                 ` Vincent Guittot
2014-03-13 14:07                 ` Dietmar Eggemann [this message]
2014-03-13 14:07                   ` Dietmar Eggemann
2014-03-17 11:52               ` Peter Zijlstra
2014-03-17 11:52                 ` Peter Zijlstra
2014-03-19 19:15                 ` Dietmar Eggemann
2014-03-19 19:15                   ` Dietmar Eggemann
2014-03-20  8:28                   ` Vincent Guittot
2014-03-20  8:28                     ` Vincent Guittot
2014-03-11 13:08         ` Peter Zijlstra
2014-03-11 13:08           ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5321BBAF.8050108@arm.com \
    --to=dietmar.eggemann@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.