From: Darren Hart <darren@os.amperecomputing.com>
To: Barry Song <21cnbao@gmail.com>,
Sudeep Holla <sudeep.holla@arm.com>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
"Rafael J. Wysocki" <rafael@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
LKML <linux-kernel@vger.kernel.org>,
Linux Arm <linux-arm-kernel@lists.infradead.org>,
Catalin Marinas <Catalin.Marinas@arm.com>,
Will Deacon <will@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Barry Song <song.bao.hua@hisilicon.com>,
Valentin Schneider <Valentin.Schneider@arm.com>,
"D . Scott Phillips" <scott@os.amperecomputing.com>,
Ilkka Koskinen <ilkka@os.amperecomputing.com>,
stable@vger.kernel.org
Subject: Re: [PATCH 1/1] arm64: smp: Skip MC sched domain on SoCs with no LLC
Date: Thu, 3 Mar 2022 08:35:35 -0800 [thread overview]
Message-ID: <YiDuV8YkaWGNgky7@fedora> (raw)
In-Reply-To: <CAGsJ_4y8MkQhAZ9c9yz_UHee7MCZrtv3aui=Luq-ZOBeAsGbGQ@mail.gmail.com>
On Thu, Mar 03, 2022 at 06:36:30PM +1300, Barry Song wrote:
> On Thu, Mar 3, 2022 at 3:22 PM Darren Hart
> <darren@os.amperecomputing.com> wrote:
> >
> > On Wed, Mar 02, 2022 at 10:32:06AM +0100, Vincent Guittot wrote:
> > > On Tue, 1 Mar 2022 at 01:35, Darren Hart <darren@os.amperecomputing.com> wrote:
> > > >
> > > > Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop
> > > > Control Unit, but have no shared CPU-side last level cache.
> > > >
> > > > cpu_coregroup_mask() will return a cpumask with weight 1, while
> > > > cpu_clustergroup_mask() will return a cpumask with weight 2.
> > > >
> > > > As a result, build_sched_domain() will BUG() once per CPU with:
> > > >
> > > > BUG: arch topology borken
> > > > the CLS domain not a subset of the MC domain
> > > >
> > > > The MC level cpumask is then extended to that of the CLS child, and is
> > > > later removed entirely as redundant. This sched domain topology is an
> > > > improvement over previous topologies, or those built without
> > > > SCHED_CLUSTER, particularly for certain latency sensitive workloads.
> > > > With the current scheduler model and heuristics, this is a desirable
> > > > default topology for Ampere Altra and Altra Max system.
> > > >
> > > > Introduce an alternate sched domain topology for arm64 without the MC
> > > > level and test for llc_sibling weight 1 across all CPUs to enable it.
> > > >
> > > > Do this in arch/arm64/kernel/smp.c (as opposed to
> > > > arch/arm64/kernel/topology.c) as all the CPU sibling maps are now
> > > > populated and we avoid needing to extend the drivers/acpi/pptt.c API to
> > > > detect the cluster level being above the cpu llc level. This is
> > > > consistent with other architectures and provides a readily extensible
> > > > mechanism for other alternate topologies.
> > > >
> > > > The final sched domain topology for a 2 socket Ampere Altra system is
> > > > unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > > >
> > > > For CPU0:
> > > >
> > > > CONFIG_SCHED_CLUSTER=y
> > > > CLS [0-1]
> > > > DIE [0-79]
> > > > NUMA [0-159]
> > > >
> > > > CONFIG_SCHED_CLUSTER is not set
> > > > DIE [0-79]
> > > > NUMA [0-159]
> > > >
> > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > Cc: Will Deacon <will@kernel.org>
> > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > > > ---
> > > > arch/arm64/kernel/smp.c | 28 ++++++++++++++++++++++++++++
> > > > 1 file changed, 28 insertions(+)
> > > >
> > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > index 27df5c1e6baa..3597e75645e1 100644
> > > > --- a/arch/arm64/kernel/smp.c
> > > > +++ b/arch/arm64/kernel/smp.c
> > > > @@ -433,6 +433,33 @@ static void __init hyp_mode_check(void)
> > > > }
> > > > }
> > > >
> > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > > +#ifdef CONFIG_SCHED_SMT
> > > > + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > > +#endif
> > > > +
> > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > > +#endif
> > > > +
> > > > + { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > > + { NULL, },
> > > > +};
> > > > +
> > > > +static void __init update_sched_domain_topology(void)
> > > > +{
> > > > + int cpu;
> > > > +
> > > > + for_each_possible_cpu(cpu) {
> > > > + if (cpu_topology[cpu].llc_id != -1 &&
> > >
> > > Have you tested it with a non-acpi system ? AFAICT, llc_id is only set
> > > by ACPI system and llc_id == -1 for others like DT based system
> > >
> > > > + cpumask_weight(&cpu_topology[cpu].llc_sibling) > 1)
> > > > + return;
> > > > + }
> >
> > Hi Vincent,
> >
> > I did not have a non-acpi system to test, no. You're right of course,
> > llc_id is only set by ACPI systems on arm64. We could wrap this in a
> > CONFIG_ACPI ifdef (or IS_ENABLED), but I think this would be preferable:
> >
> > + for_each_possible_cpu(cpu) {
> > + if (cpu_topology[cpu].llc_id == -1 ||
> > + cpumask_weight(&cpu_topology[cpu].llc_sibling) > 1)
> > + return;
> > + }
> >
> > Quickly tested on Altra successfully. Would appreciate anyone with non-acpi
> > arm64 systems who can test and verify this behaves as intended. I will ask
> > around tomorrow as well to see what I may have access to.
>
> I wonder if we can fix it by this
>
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 976154140f0b..551655ccd0eb 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -627,6 +627,13 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
> if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
> core_mask = &cpu_topology[cpu].llc_sibling;
> }
> + /*
> + * Some machines have no LLC but have clusters, we let MC = CLUSTER
> + * as MC should always be after CLUSTER. But anyway, the MC domain
> + * will be removed
> + */
> + if (cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
> + core_mask = &cpu_topology[cpu].cluster_sibling;
>
> return core_mask;
> }
>
> as it can make all kinds of topologies happy - symmetric and asymmetric.
>
Hah. Full circle. Yes, this works, and it's basically what we'd started
with internally. I ended up exploring various paths here to avoid a
"band aid" and to target the fix and minimize impact. That said, after
digging through the acpi, topology, smp, and sched domains code... I
don't think this approach is a band aid and it's a very minimal
solution. The only downside I can think of is masking a potential
topology bug and not catching it in the scheduler - that seems very
unlikely. I'm perfectly happy with this solution as well.
Will D, would you prefer this approach?
+Sudeep, Greg, and Rafael,
Are you OK with this approach?
If so, we can drop my arm64 specific new topology patch and I can send a
version of this one out (suggested-by Barry of course), unless you'd
prefer to send it Barry?
Thanks,
> >
> > Thanks,
> >
> > > > +
> > > > + pr_info("No LLC siblings, using No MC sched domains topology\n");
> > > > + set_sched_topology(arm64_no_mc_topology);
> > > > +}
> > > > +
> > > > void __init smp_cpus_done(unsigned int max_cpus)
> > > > {
> > > > pr_info("SMP: Total of %d processors activated.\n", num_online_cpus());
> > > > @@ -440,6 +467,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
> > > > hyp_mode_check();
> > > > apply_alternatives_all();
> > > > mark_linear_text_alias_ro();
> > > > + update_sched_domain_topology();
> > > > }
> > > >
> > > > void __init smp_prepare_boot_cpu(void)
> > > > --
> > > > 2.31.1
> > > >
> >
> > --
> > Darren Hart
> > Ampere Computing / OS and Kernel
>
> Thanks
> Barry
--
Darren Hart
Ampere Computing / OS and Kernel
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
WARNING: multiple messages have this Message-ID (diff)
From: Darren Hart <darren@os.amperecomputing.com>
To: Barry Song <21cnbao@gmail.com>,
Sudeep Holla <sudeep.holla@arm.com>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
"Rafael J. Wysocki" <rafael@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
LKML <linux-kernel@vger.kernel.org>,
Linux Arm <linux-arm-kernel@lists.infradead.org>,
Catalin Marinas <Catalin.Marinas@arm.com>,
Will Deacon <will@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Barry Song <song.bao.hua@hisilicon.com>,
Valentin Schneider <Valentin.Schneider@arm.com>,
"D . Scott Phillips" <scott@os.amperecomputing.com>,
Ilkka Koskinen <ilkka@os.amperecomputing.com>,
stable@vger.kernel.org
Subject: Re: [PATCH 1/1] arm64: smp: Skip MC sched domain on SoCs with no LLC
Date: Thu, 3 Mar 2022 08:35:35 -0800 [thread overview]
Message-ID: <YiDuV8YkaWGNgky7@fedora> (raw)
In-Reply-To: <CAGsJ_4y8MkQhAZ9c9yz_UHee7MCZrtv3aui=Luq-ZOBeAsGbGQ@mail.gmail.com>
On Thu, Mar 03, 2022 at 06:36:30PM +1300, Barry Song wrote:
> On Thu, Mar 3, 2022 at 3:22 PM Darren Hart
> <darren@os.amperecomputing.com> wrote:
> >
> > On Wed, Mar 02, 2022 at 10:32:06AM +0100, Vincent Guittot wrote:
> > > On Tue, 1 Mar 2022 at 01:35, Darren Hart <darren@os.amperecomputing.com> wrote:
> > > >
> > > > Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop
> > > > Control Unit, but have no shared CPU-side last level cache.
> > > >
> > > > cpu_coregroup_mask() will return a cpumask with weight 1, while
> > > > cpu_clustergroup_mask() will return a cpumask with weight 2.
> > > >
> > > > As a result, build_sched_domain() will BUG() once per CPU with:
> > > >
> > > > BUG: arch topology borken
> > > > the CLS domain not a subset of the MC domain
> > > >
> > > > The MC level cpumask is then extended to that of the CLS child, and is
> > > > later removed entirely as redundant. This sched domain topology is an
> > > > improvement over previous topologies, or those built without
> > > > SCHED_CLUSTER, particularly for certain latency sensitive workloads.
> > > > With the current scheduler model and heuristics, this is a desirable
> > > > default topology for Ampere Altra and Altra Max system.
> > > >
> > > > Introduce an alternate sched domain topology for arm64 without the MC
> > > > level and test for llc_sibling weight 1 across all CPUs to enable it.
> > > >
> > > > Do this in arch/arm64/kernel/smp.c (as opposed to
> > > > arch/arm64/kernel/topology.c) as all the CPU sibling maps are now
> > > > populated and we avoid needing to extend the drivers/acpi/pptt.c API to
> > > > detect the cluster level being above the cpu llc level. This is
> > > > consistent with other architectures and provides a readily extensible
> > > > mechanism for other alternate topologies.
> > > >
> > > > The final sched domain topology for a 2 socket Ampere Altra system is
> > > > unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:
> > > >
> > > > For CPU0:
> > > >
> > > > CONFIG_SCHED_CLUSTER=y
> > > > CLS [0-1]
> > > > DIE [0-79]
> > > > NUMA [0-159]
> > > >
> > > > CONFIG_SCHED_CLUSTER is not set
> > > > DIE [0-79]
> > > > NUMA [0-159]
> > > >
> > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > Cc: Will Deacon <will@kernel.org>
> > > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > > Cc: Barry Song <song.bao.hua@hisilicon.com>
> > > > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > > > Cc: D. Scott Phillips <scott@os.amperecomputing.com>
> > > > Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > > > Cc: <stable@vger.kernel.org> # 5.16.x
> > > > Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
> > > > ---
> > > > arch/arm64/kernel/smp.c | 28 ++++++++++++++++++++++++++++
> > > > 1 file changed, 28 insertions(+)
> > > >
> > > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > > > index 27df5c1e6baa..3597e75645e1 100644
> > > > --- a/arch/arm64/kernel/smp.c
> > > > +++ b/arch/arm64/kernel/smp.c
> > > > @@ -433,6 +433,33 @@ static void __init hyp_mode_check(void)
> > > > }
> > > > }
> > > >
> > > > +static struct sched_domain_topology_level arm64_no_mc_topology[] = {
> > > > +#ifdef CONFIG_SCHED_SMT
> > > > + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> > > > +#endif
> > > > +
> > > > +#ifdef CONFIG_SCHED_CLUSTER
> > > > + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
> > > > +#endif
> > > > +
> > > > + { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> > > > + { NULL, },
> > > > +};
> > > > +
> > > > +static void __init update_sched_domain_topology(void)
> > > > +{
> > > > + int cpu;
> > > > +
> > > > + for_each_possible_cpu(cpu) {
> > > > + if (cpu_topology[cpu].llc_id != -1 &&
> > >
> > > Have you tested it with a non-acpi system ? AFAICT, llc_id is only set
> > > by ACPI system and llc_id == -1 for others like DT based system
> > >
> > > > + cpumask_weight(&cpu_topology[cpu].llc_sibling) > 1)
> > > > + return;
> > > > + }
> >
> > Hi Vincent,
> >
> > I did not have a non-acpi system to test, no. You're right of course,
> > llc_id is only set by ACPI systems on arm64. We could wrap this in a
> > CONFIG_ACPI ifdef (or IS_ENABLED), but I think this would be preferable:
> >
> > + for_each_possible_cpu(cpu) {
> > + if (cpu_topology[cpu].llc_id == -1 ||
> > + cpumask_weight(&cpu_topology[cpu].llc_sibling) > 1)
> > + return;
> > + }
> >
> > Quickly tested on Altra successfully. Would appreciate anyone with non-acpi
> > arm64 systems who can test and verify this behaves as intended. I will ask
> > around tomorrow as well to see what I may have access to.
>
> I wonder if we can fix it by this
>
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 976154140f0b..551655ccd0eb 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -627,6 +627,13 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
> if (cpumask_subset(&cpu_topology[cpu].llc_sibling, core_mask))
> core_mask = &cpu_topology[cpu].llc_sibling;
> }
> + /*
> + * Some machines have no LLC but have clusters, we let MC = CLUSTER
> + * as MC should always be after CLUSTER. But anyway, the MC domain
> + * will be removed
> + */
> + if (cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling))
> + core_mask = &cpu_topology[cpu].cluster_sibling;
>
> return core_mask;
> }
>
> as it can make all kinds of topologies happy - symmetric and asymmetric.
>
Hah. Full circle. Yes, this works, and it's basically what we'd started
with internally. I ended up exploring various paths here to avoid a
"band aid" and to target the fix and minimize impact. That said, after
digging through the acpi, topology, smp, and sched domains code... I
don't think this approach is a band aid and it's a very minimal
solution. The only downside I can think of is masking a potential
topology bug and not catching it in the scheduler - that seems very
unlikely. I'm perfectly happy with this solution as well.
Will D, would you prefer this approach?
+Sudeep, Greg, and Rafael,
Are you OK with this approach?
If so, we can drop my arm64 specific new topology patch and I can send a
version of this one out (suggested-by Barry of course), unless you'd
prefer to send it Barry?
Thanks,
> >
> > Thanks,
> >
> > > > +
> > > > + pr_info("No LLC siblings, using No MC sched domains topology\n");
> > > > + set_sched_topology(arm64_no_mc_topology);
> > > > +}
> > > > +
> > > > void __init smp_cpus_done(unsigned int max_cpus)
> > > > {
> > > > pr_info("SMP: Total of %d processors activated.\n", num_online_cpus());
> > > > @@ -440,6 +467,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
> > > > hyp_mode_check();
> > > > apply_alternatives_all();
> > > > mark_linear_text_alias_ro();
> > > > + update_sched_domain_topology();
> > > > }
> > > >
> > > > void __init smp_prepare_boot_cpu(void)
> > > > --
> > > > 2.31.1
> > > >
> >
> > --
> > Darren Hart
> > Ampere Computing / OS and Kernel
>
> Thanks
> Barry
--
Darren Hart
Ampere Computing / OS and Kernel
next prev parent reply other threads:[~2022-03-03 16:37 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-01 0:28 [PATCH 0/1] arm64: smp: Skip MC sched domain on SoCs with no LLC Darren Hart
2022-03-01 0:28 ` Darren Hart
2022-03-01 0:29 ` [PATCH 1/1] " Darren Hart
2022-03-01 0:29 ` Darren Hart
2022-03-02 9:32 ` Vincent Guittot
2022-03-02 9:32 ` Vincent Guittot
2022-03-03 2:18 ` Darren Hart
2022-03-03 2:18 ` Darren Hart
2022-03-03 5:36 ` Barry Song
2022-03-03 5:36 ` Barry Song
2022-03-03 16:35 ` Darren Hart [this message]
2022-03-03 16:35 ` Darren Hart
2022-03-03 21:43 ` Barry Song
2022-03-03 21:43 ` Barry Song
2022-03-03 8:08 ` Vincent Guittot
2022-03-03 8:08 ` Vincent Guittot
2022-03-03 16:02 ` Darren Hart
2022-03-03 16:02 ` Darren Hart
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YiDuV8YkaWGNgky7@fedora \
--to=darren@os.amperecomputing.com \
--cc=21cnbao@gmail.com \
--cc=Catalin.Marinas@arm.com \
--cc=Valentin.Schneider@arm.com \
--cc=gregkh@linuxfoundation.org \
--cc=ilkka@os.amperecomputing.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=rafael@kernel.org \
--cc=scott@os.amperecomputing.com \
--cc=song.bao.hua@hisilicon.com \
--cc=stable@vger.kernel.org \
--cc=sudeep.holla@arm.com \
--cc=vincent.guittot@linaro.org \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.