From: dietmar.eggemann@arm.com (Dietmar Eggemann)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH v3 1/6] sched: rework of sched_domain topology definition
Date: Mon, 24 Mar 2014 15:02:41 +0100 [thread overview]
Message-ID: <53303B01.6070302@arm.com> (raw)
In-Reply-To: <CAKfTPtALFxdiWYa1ka3W+zhKBDshjngbz0Jys8trOgCjnmtDoA@mail.gmail.com>
On 21/03/14 11:04, Vincent Guittot wrote:
> On 20 March 2014 18:18, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 20/03/14 17:02, Vincent Guittot wrote:
>>> On 20 March 2014 13:41, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>> On 19/03/14 16:22, Vincent Guittot wrote:
>>>>> We replace the old way to configure the scheduler topology with a new method
>>>>> which enables a platform to declare additionnal level (if needed).
>>>>>
>>>>> We still have a default topology table definition that can be used by platform
>>>>> that don't want more level than the SMT, MC, CPU and NUMA ones. This table can
>>>>> be overwritten by an arch which either wants to add new level where a load balance
>>>>> make sense like BOOK or powergating level or wants to change the flags
>>>>> configuration of some levels.
>>>>>
>>>>> For each level, we need a function pointer that returns cpumask for each cpu,
>>>>> a function pointer that returns the flags for the level and a name. Only flags
>>>>> that describe topology, can be set by an architecture. The current topology
>>>>> flags are:
>>>>> SD_SHARE_CPUPOWER
>>>>> SD_SHARE_PKG_RESOURCES
>>>>> SD_NUMA
>>>>> SD_ASYM_PACKING
>>>>>
>>>>> Then, each level must be a subset on the next one. The build sequence of the
>>>>> sched_domain will take care of removing useless levels like those with 1 CPU
>>>>> and those with the same CPU span and relevant information for load balancing
>>>>> than its child.
>>>>
>>>> The paragraph above contains important information to set this up
>>>> correctly, that's why it might be worth clarifying:
>>>>
>>>> - "next one" of sd means "child of sd" ?
>>>
>>> It's the next one in the table so the parent in the sched_domain
>>
>> Right, it's this way around. DIE is parent of MC is parent of GMC. Maybe
>> you could be more explicit about the parent of relation here?
>>
>>>
>>>> - "subset" means really "subset" and not "proper subset" ?
>>>
>>> yes, it's really "subset" and not "proper subset"
>>>
>>> Vincent
>>>
>>>>
>>>> On TC2 w/ the following change in cpu_corepower_mask()
>>>>
>>>> const struct cpumask *cpu_corepower_mask(int cpu)
>>>> {
>>>> - return &cpu_topology[cpu].thread_sibling;
>>>> + return cpu_topology[cpu].socket_id ?
>>>> &cpu_topology[cpu].thread_sibling :
>>>> + &cpu_topology[cpu].core_sibling;
>>>> }
>>>>
>>>> I get this e.g. for CPU0,2:
>>>>
>>>> CPU0: cpu_corepower_mask=0-1 -> GMC is subset of MC
>>>> CPU0: cpu_coregroup_mask=0-1
>>>> CPU0: cpu_cpu_mask=0-4
>>>>
>>>> CPU2: cpu_corepower_mask=2 -> GMC is proper sunset of MC
>>>> CPU2: cpu_coregroup_mask=2-4
>>>> CPU2: cpu_cpu_mask=0-4
>>>>
>>>> I assume here that this is a correct set-up.
>>
>> So this is a correct setup?
>
> yes it's a correct setup before the degenerate sequence
Cool, thanks.
>
>>
>>>>
>>>> The domain degenerate part:
>>>>
>>>> "useless levels like those with 1 CPU" ... that's the case for GMC level
>>>> for CPU2,3,4
>>>>
>>>> The GMC level is destroyed because of the following code snippet in
>>>> sd_degenerate(): if (cpumask_weight(sched_domain_span(sd)) == 1)
>>>>
>>>> so that's fine.
>>>>
>>>> In case of CPU0,1 since GMC and MC have the same span, the code in
>>>> build_sched_groups() creates only one group for MC and that's why
>>>> pflags is altered in sd_parent_degenerate() to SD_WAKE_AFFINE (0x20) and
>>>> the if condition 'if (~cflags & pflags)' is not hit and
>>>> sd_parent_degenerate() finally returns 1 for MC.
>>>>
>>>> So the "those with the same CPU span and relevant information for load
>>>> balancing than its child." is not so easy to understand for me. Because
>>>> both levels have the same span we actually don't take the flags of the
>>>> parent into consideration which require at least 2 groups.
>
> It's only the case if the parent has got 1 group otherwise we take
> care of all flags
Agreed & understood.
>
>>>>
>>>> So the TC2 example covers for me two corner cases: (1) The level I want
>>>> to get rid of only contains 1 CPU (GMC for CPU2,3,4) and (2) The span of
>>>> the parent level I want to get rid of (MC for CPU0,1) of is the same as
>>>> the span of the level which should stay.
>
> Having the same span is not enough. There must also no have relevant
> differences in the flags (after removing flags that need more than 1
> group is the parent has only 1 groups)
But if the span is the same (e.g. GMC, MC in the TC2 example), then
build_sched_groups() will always only create 1 group for the appropriate
parent (e.g. MC) following to the degenerate related code path I
described above. The TC2 example simply doesn't cover the case where the
parent is destroyed because of relevant differences in the flags. Also,
the added SD_SHARE_POWERDOMAIN in sd_parent_degenerate() of patch
'sched: add a new SD_SHARE_POWERDOMAIN for sched_domain' doesn't make a
differences because it's not set on MC level in the TC2 example. All I
want to say is that this code is not completely tested w/ this TC2
set-up alone.
>
>>>>
>>>> Are these two corner cases the only one supported here? If yes this has
>>>> to be stated somewhere, otherwise if somebody will try this approach on
>>>> a different topology, (s)he might be surprised.
>
> The degenerate sequence is there to remove useless level but it will
> not remove useful level. This rework has not modify the behavior of
> the degenerate sequence so (s)he should take the same care than
> previously.
Probably nitpicking here, but the patch 'sched: add a new
SD_SHARE_POWERDOMAIN for sched_domain' adds SD_SHARE_POWERDOMAIN in
sd_degenerate() and sd_parent_degenerate() does by introducing this flag.
-- Dietmar
>
> Vincent
>
>>
>> Could you please comment on the paragraph above too?
>>
>> Thanks,
>>
>> -- Dietmar
>>
>>>>
>>>> If we only consider SD_SHARE_POWERDOMAIN for the socket related level,
>>>> this works fine.
>>>>
>>>> I would like to test this on more platforms but I only have my TC2
>>>> available :-)
>>>>
>>>> -- Dietmar
>>>>
>>>> [...]
>>>>
>>>
>>
>>
>
WARNING: multiple messages have this Message-ID (diff)
From: Dietmar Eggemann <dietmar.eggemann@arm.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: "peterz@infradead.org" <peterz@infradead.org>,
"mingo@kernel.org" <mingo@kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"tony.luck@intel.com" <tony.luck@intel.com>,
"fenghua.yu@intel.com" <fenghua.yu@intel.com>,
"schwidefsky@de.ibm.com" <schwidefsky@de.ibm.com>,
"james.hogan@imgtec.com" <james.hogan@imgtec.com>,
"cmetcalf@tilera.com" <cmetcalf@tilera.com>,
"benh@kernel.crashing.org" <benh@kernel.crashing.org>,
"linux@arm.linux.org.uk" <linux@arm.linux.org.uk>,
"linux-arm-kernel@lists.infradead.org"
<linux-arm-kernel@lists.infradead.org>,
"preeti@linux.vnet.ibm.com" <preeti@linux.vnet.ibm.com>,
"linaro-kernel@lists.linaro.org" <linaro-kernel@lists.linaro.org>
Subject: Re: [PATCH v3 1/6] sched: rework of sched_domain topology definition
Date: Mon, 24 Mar 2014 15:02:41 +0100 [thread overview]
Message-ID: <53303B01.6070302@arm.com> (raw)
In-Reply-To: <CAKfTPtALFxdiWYa1ka3W+zhKBDshjngbz0Jys8trOgCjnmtDoA@mail.gmail.com>
On 21/03/14 11:04, Vincent Guittot wrote:
> On 20 March 2014 18:18, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 20/03/14 17:02, Vincent Guittot wrote:
>>> On 20 March 2014 13:41, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>> On 19/03/14 16:22, Vincent Guittot wrote:
>>>>> We replace the old way to configure the scheduler topology with a new method
>>>>> which enables a platform to declare additionnal level (if needed).
>>>>>
>>>>> We still have a default topology table definition that can be used by platform
>>>>> that don't want more level than the SMT, MC, CPU and NUMA ones. This table can
>>>>> be overwritten by an arch which either wants to add new level where a load balance
>>>>> make sense like BOOK or powergating level or wants to change the flags
>>>>> configuration of some levels.
>>>>>
>>>>> For each level, we need a function pointer that returns cpumask for each cpu,
>>>>> a function pointer that returns the flags for the level and a name. Only flags
>>>>> that describe topology, can be set by an architecture. The current topology
>>>>> flags are:
>>>>> SD_SHARE_CPUPOWER
>>>>> SD_SHARE_PKG_RESOURCES
>>>>> SD_NUMA
>>>>> SD_ASYM_PACKING
>>>>>
>>>>> Then, each level must be a subset on the next one. The build sequence of the
>>>>> sched_domain will take care of removing useless levels like those with 1 CPU
>>>>> and those with the same CPU span and relevant information for load balancing
>>>>> than its child.
>>>>
>>>> The paragraph above contains important information to set this up
>>>> correctly, that's why it might be worth clarifying:
>>>>
>>>> - "next one" of sd means "child of sd" ?
>>>
>>> It's the next one in the table so the parent in the sched_domain
>>
>> Right, it's this way around. DIE is parent of MC is parent of GMC. Maybe
>> you could be more explicit about the parent of relation here?
>>
>>>
>>>> - "subset" means really "subset" and not "proper subset" ?
>>>
>>> yes, it's really "subset" and not "proper subset"
>>>
>>> Vincent
>>>
>>>>
>>>> On TC2 w/ the following change in cpu_corepower_mask()
>>>>
>>>> const struct cpumask *cpu_corepower_mask(int cpu)
>>>> {
>>>> - return &cpu_topology[cpu].thread_sibling;
>>>> + return cpu_topology[cpu].socket_id ?
>>>> &cpu_topology[cpu].thread_sibling :
>>>> + &cpu_topology[cpu].core_sibling;
>>>> }
>>>>
>>>> I get this e.g. for CPU0,2:
>>>>
>>>> CPU0: cpu_corepower_mask=0-1 -> GMC is subset of MC
>>>> CPU0: cpu_coregroup_mask=0-1
>>>> CPU0: cpu_cpu_mask=0-4
>>>>
>>>> CPU2: cpu_corepower_mask=2 -> GMC is proper sunset of MC
>>>> CPU2: cpu_coregroup_mask=2-4
>>>> CPU2: cpu_cpu_mask=0-4
>>>>
>>>> I assume here that this is a correct set-up.
>>
>> So this is a correct setup?
>
> yes it's a correct setup before the degenerate sequence
Cool, thanks.
>
>>
>>>>
>>>> The domain degenerate part:
>>>>
>>>> "useless levels like those with 1 CPU" ... that's the case for GMC level
>>>> for CPU2,3,4
>>>>
>>>> The GMC level is destroyed because of the following code snippet in
>>>> sd_degenerate(): if (cpumask_weight(sched_domain_span(sd)) == 1)
>>>>
>>>> so that's fine.
>>>>
>>>> In case of CPU0,1 since GMC and MC have the same span, the code in
>>>> build_sched_groups() creates only one group for MC and that's why
>>>> pflags is altered in sd_parent_degenerate() to SD_WAKE_AFFINE (0x20) and
>>>> the if condition 'if (~cflags & pflags)' is not hit and
>>>> sd_parent_degenerate() finally returns 1 for MC.
>>>>
>>>> So the "those with the same CPU span and relevant information for load
>>>> balancing than its child." is not so easy to understand for me. Because
>>>> both levels have the same span we actually don't take the flags of the
>>>> parent into consideration which require at least 2 groups.
>
> It's only the case if the parent has got 1 group otherwise we take
> care of all flags
Agreed & understood.
>
>>>>
>>>> So the TC2 example covers for me two corner cases: (1) The level I want
>>>> to get rid of only contains 1 CPU (GMC for CPU2,3,4) and (2) The span of
>>>> the parent level I want to get rid of (MC for CPU0,1) of is the same as
>>>> the span of the level which should stay.
>
> Having the same span is not enough. There must also no have relevant
> differences in the flags (after removing flags that need more than 1
> group is the parent has only 1 groups)
But if the span is the same (e.g. GMC, MC in the TC2 example), then
build_sched_groups() will always only create 1 group for the appropriate
parent (e.g. MC) following to the degenerate related code path I
described above. The TC2 example simply doesn't cover the case where the
parent is destroyed because of relevant differences in the flags. Also,
the added SD_SHARE_POWERDOMAIN in sd_parent_degenerate() of patch
'sched: add a new SD_SHARE_POWERDOMAIN for sched_domain' doesn't make a
differences because it's not set on MC level in the TC2 example. All I
want to say is that this code is not completely tested w/ this TC2
set-up alone.
>
>>>>
>>>> Are these two corner cases the only one supported here? If yes this has
>>>> to be stated somewhere, otherwise if somebody will try this approach on
>>>> a different topology, (s)he might be surprised.
>
> The degenerate sequence is there to remove useless level but it will
> not remove useful level. This rework has not modify the behavior of
> the degenerate sequence so (s)he should take the same care than
> previously.
Probably nitpicking here, but the patch 'sched: add a new
SD_SHARE_POWERDOMAIN for sched_domain' adds SD_SHARE_POWERDOMAIN in
sd_degenerate() and sd_parent_degenerate() does by introducing this flag.
-- Dietmar
>
> Vincent
>
>>
>> Could you please comment on the paragraph above too?
>>
>> Thanks,
>>
>> -- Dietmar
>>
>>>>
>>>> If we only consider SD_SHARE_POWERDOMAIN for the socket related level,
>>>> this works fine.
>>>>
>>>> I would like to test this on more platforms but I only have my TC2
>>>> available :-)
>>>>
>>>> -- Dietmar
>>>>
>>>> [...]
>>>>
>>>
>>
>>
>
next prev parent reply other threads:[~2014-03-24 14:02 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-03-19 16:22 [PATCH v3 0/6] rework sched_domain topology description Vincent Guittot
2014-03-19 16:22 ` Vincent Guittot
2014-03-19 16:22 ` [PATCH v3 1/6] sched: rework of sched_domain topology definition Vincent Guittot
2014-03-19 16:22 ` Vincent Guittot
2014-03-20 12:41 ` Dietmar Eggemann
2014-03-20 12:41 ` Dietmar Eggemann
2014-03-20 17:02 ` Vincent Guittot
2014-03-20 17:02 ` Vincent Guittot
2014-03-20 17:18 ` Dietmar Eggemann
2014-03-20 17:18 ` Dietmar Eggemann
2014-03-21 10:04 ` Vincent Guittot
2014-03-21 10:04 ` Vincent Guittot
2014-03-24 14:02 ` Dietmar Eggemann [this message]
2014-03-24 14:02 ` Dietmar Eggemann
2014-03-19 16:22 ` [PATCH v3 2/6] sched: s390: create a dedicated topology table Vincent Guittot
2014-03-19 16:22 ` Vincent Guittot
2014-03-19 16:22 ` [PATCH v3 3/6] sched: powerpc: " Vincent Guittot
2014-03-19 16:22 ` Vincent Guittot
2014-03-19 16:22 ` [PATCH v3 4/6] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain Vincent Guittot
2014-03-19 16:22 ` Vincent Guittot
2014-03-19 16:22 ` [PATCH 5/6] sched: ARM: create a dedicated scheduler topology table Vincent Guittot
2014-03-19 16:22 ` Vincent Guittot
2014-03-19 16:22 ` [PATCH v3 6/6] sched: powerpc: Add SD_SHARE_POWERDOMAIN for SMT level Vincent Guittot
2014-03-19 16:22 ` Vincent Guittot
2014-03-23 1:49 ` Preeti U Murthy
2014-03-23 1:49 ` Preeti U Murthy
2014-03-23 3:12 ` Benjamin Herrenschmidt
2014-03-23 3:12 ` Benjamin Herrenschmidt
2014-03-24 8:21 ` Vincent Guittot
2014-03-24 8:21 ` Vincent Guittot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53303B01.6070302@arm.com \
--to=dietmar.eggemann@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.