From: Morten Rasmussen <morten.rasmussen@arm.com>
To: Jeremy Linton <jeremy.linton@arm.com>
Cc: mark.rutland@arm.com, vkilari@codeaurora.org,
lorenzo.pieralisi@arm.com, catalin.marinas@arm.com,
tnowicki@caviumnetworks.com, gregkh@linuxfoundation.org,
will.deacon@arm.com, dietmar.eggemann@arm.com, rjw@rjwysocki.net,
linux-kernel@vger.kernel.org, ahs3@redhat.com,
linux-acpi@vger.kernel.org, palmer@sifive.com,
hanjun.guo@linaro.org, sudeep.holla@arm.com,
austinwc@codeaurora.org, linux-riscv@lists.infradead.org,
john.garry@huawei.com, wangxiongfeng2@huawei.com,
linux-arm-kernel@lists.infradead.org, lenb@kernel.org
Subject: Re: [PATCH v7 13/13] arm64: topology: divorce MC scheduling domain from core_siblings
Date: Wed, 14 Mar 2018 13:05:09 +0000 [thread overview]
Message-ID: <20180314130509.GM4589@e105550-lin.cambridge.arm.com> (raw)
In-Reply-To: <4e7fd404-fb7b-ee38-4a78-8ee091ae07cf@arm.com>
On Wed, Mar 07, 2018 at 10:19:50AM -0600, Jeremy Linton wrote:
> Hi,
>
> On 03/07/2018 07:06 AM, Morten Rasmussen wrote:
> >On Tue, Mar 06, 2018 at 04:22:18PM -0600, Jeremy Linton wrote:
> >>>>>>To do this correctly, we should really base that on the cache
> >>>>>>topology immediately below the NUMA node (for NUMA in socket) >> or below the physical package for normal NUMA configurations.
> >>>>>
> >>>>>That means we wouldn't support multi-die NUMA nodes?
> >>>>
> >>>>You mean a bottom level NUMA domain that crosses multiple sockets/dies? That
> >>>>should work. This patch is picking the widest cache layer below the smallest
> >>>>of the package or numa grouping. What actually happens depends on the
> >>>>topology. Given a case where there are multiple dies in a socket, and the
> >>>>numa domain is at the socket level the MC is going to reflect the caching
> >>>>topology immediately below the socket. In the case of multiple dies, with a
> >>>>cache that crosses them in socket, then the MC is basically going to be the
> >>>>socket, otherwise if the widest cache is per die, or some narrower grouping
> >>>>(cluster?) then that is what ends up in the MC. (this is easier with some
> >>>>pictures)
> >>>
> >>>That is more or less what I meant. I think I got confused with the role
> >>>of "DIE" level, i.e. that top non-NUMA level, in this. The DIE level
> >>>cpumask spans exactly the NUMA node, so IIUC we have three scenarios:
> >>>
> >>>1. Multi-die/socket/physical package NUMA node
> >>> Top non-NUMA level (DIE) spans multiple packages. Bottom NUMA level
> >>> spans multiple multi-package nodes. The MC mask reflects the last-level
> >>> cache within the NUMA node which is most likely per-die or per-cluster
> >>> (inside each die).
> >>>
> >>>2. physical package == NUMA node
> >>> The top non-NUMA (DIE) mask is the same as the core sibling mask.
> >>> If there is cache spanning the entire node, the scheduler topology
> >>> will eliminate a layer (DIE?), so bottom NUMA level would be right on
> >>> top of MC spanning multiple physical packages. If there is no
> >>> node-wide last level cache, DIE is preserved and MC matches the span
> >>> of the last level cache.
> >>>
> >>>3. numa-in-package
> >>> Top non-NUMA (DIE) mask is not reflecting the actual die, but is
> >>> reflecting the NUMA node. MC has a span equal to the largest share
> >>> cache span smaller than or equal to the the NUMA node. If it is
> >>> equal, DIE level is eliminated, otherwise DIE is preserved, but
> >>> doesn't really represent die. Bottom non-NUMA level spans multiple
> >>> in-package NUMA nodes.
> >>>
> >>>As you said, multi-die nodes should work. However, I'm not sure if
> >>>shrinking MC to match a cache could cause us trouble, or if it should
> >>>just be shrunk to be the smaller of the node mask and core siblings.
> >>
> >>Shrinking to the smaller of the numa or package is fairly trivial change,
> >>I'm good with that change too.. I discounted it because there might be an
> >>advantage in case 2 if the internal hardware is actually a case 3 (or just
> >>multiple rings/whatever each with a L3). In those cases the firmware vendor
> >>could play around with whatever representation serves them the best.
> >
> >Agreed. Distributed last level caches and interconnect speeds makes it
> >virtually impossible to define MC in a way that works well for everyone
> >based on the topology information we have at hand.
> >
> >>
> >>>Unless you have a node-wide last level cache DIE level won't be
> >>>eliminated in scenario 2 and 3, and could cause trouble. For
> >>>numa-in-package, you can end up with a DIE level inside the node where
> >>>the default flags don't favour aggressive spreading of tasks. The same
> >>>could be the case for per-package nodes (scenario 2).
> >>>
> >>>Don't we end up redefining physical package to be last level cache
> >>>instead of using the PPTT flag for scenario 2 and 3?
> >>
> >>I'm not sure I understand, core_siblings isn't changing (its still per
> >>package). Only the MC mapping which normally is just core_siblings. For all
> >>intents right now this patch is the same as v6, except for the
> >>numa-in-package where the MC domain is being shrunk to the node siblings.
> >>I'm just trying to setup the code for potential future cases where the LLC
> >>isn't equal to the node or package.
> >
> >Right, core_siblings remains the same. The scheduler topology just looks
> >a bit odd as we can have core_siblings spanning the full true physical
> >package and have DIE level as a subset of that with an MC level where
> >the MC siblings is a much smaller subset of cpus than core_siblings.
> >
> >IOW, it would lead to having one topology used by the scheduler and
> >another used by the users of topology_core_cpumask() (which is not
> >many I think).
> >
> >Is there a good reason for diverging instead of adjusting the
> >core_sibling mask? On x86 the core_siblings mask is defined by the last
> >level cache span so they don't have this issue.
>
> I'm overwhelmingly convinced we are doing the right thing WRT the core
> siblings at the moment. Its exported to user space, and the general
> understanding is that its a socket. Even with numa in package/on die if you
> run lscpu, lstopo, etc... They all understand the system topology correctly
> doing it this way (AFAIK).
Right. As said in my reply to Brice, I thought MC level and sysfs were
aligned, but they clearly aren't. I agree that treating them different
is the right thing to do.
> >I would prefer this simpler solution as it should eliminate DIE level
> >for all numa-in-package configurations. Although, I think we should consider
> >just shrinking the core_sibling mask instead of having a difference MC
> >mask (cpu_coregroup_mask). Do you see any problems in doing that?I'm
> My strongest opinion is leaning toward core_siblings being correct as it
> stands. How the scheduler deals with that is less clear. I will toss the
> above as a separate patch, and we can forget this one. I see dropping DIE as
> a larger patch set defining an arch specific scheduler topology and tweaking
> the individual scheduler level/flags/tuning. OTOH, unless there is something
> particularly creative there, I don't see how to avoid NUMA domains pushing
> deeper into the cache/system topology. Which means filling the MC layer (and
> possible others) similarly to the above snippit.
Agreed that core_siblings is correct. With the simple solution DIE
shouldn't show up for any numa_in_package configurations allowing NUMA
to sit directly on top of MC, which should mean that flags should be
roughly okay.
next prev parent reply other threads:[~2018-03-14 13:05 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-28 22:06 [PATCH v7 00/13] Support PPTT for ARM64 Jeremy Linton
2018-02-28 22:06 ` [PATCH v7 01/13] drivers: base: cacheinfo: move cache_setup_of_node() Jeremy Linton
2018-03-06 16:16 ` Sudeep Holla
2018-02-28 22:06 ` [PATCH v7 02/13] drivers: base: cacheinfo: setup DT cache properties early Jeremy Linton
2018-02-28 22:34 ` Palmer Dabbelt
2018-03-06 16:43 ` Sudeep Holla
2018-02-28 22:06 ` [PATCH v7 03/13] cacheinfo: rename of_node to fw_token Jeremy Linton
2018-03-06 16:45 ` Sudeep Holla
2018-02-28 22:06 ` [PATCH v7 04/13] arm64/acpi: Create arch specific cpu to acpi id helper Jeremy Linton
2018-03-06 17:13 ` Sudeep Holla
2018-02-28 22:06 ` [PATCH v7 05/13] ACPI/PPTT: Add Processor Properties Topology Table parsing Jeremy Linton
2018-03-06 17:39 ` Sudeep Holla
2018-03-08 16:39 ` Ard Biesheuvel
2018-03-08 19:52 ` Jeremy Linton
2018-03-19 10:46 ` Rafael J. Wysocki
2018-03-20 13:25 ` Jeremy Linton
2018-02-28 22:06 ` [PATCH v7 06/13] ACPI: Enable PPTT support on ARM64 Jeremy Linton
2018-03-06 16:55 ` Sudeep Holla
2018-02-28 22:06 ` [PATCH v7 07/13] drivers: base cacheinfo: Add support for ACPI based firmware tables Jeremy Linton
2018-03-06 17:50 ` Sudeep Holla
2018-03-08 17:20 ` Lorenzo Pieralisi
2018-02-28 22:06 ` [PATCH v7 08/13] arm64: " Jeremy Linton
2018-03-03 21:58 ` kbuild test robot
2018-03-06 17:23 ` Sudeep Holla
2018-02-28 22:06 ` [PATCH v7 09/13] ACPI/PPTT: Add topology parsing code Jeremy Linton
2018-02-28 22:06 ` [PATCH v7 10/13] arm64: topology: rename cluster_id Jeremy Linton
2018-03-05 12:24 ` Mark Brown
2018-02-28 22:06 ` [PATCH v7 11/13] arm64: topology: enable ACPI/PPTT based CPU topology Jeremy Linton
2018-02-28 22:06 ` [PATCH v7 12/13] ACPI: Add PPTT to injectable table list Jeremy Linton
2018-02-28 22:06 ` [PATCH v7 13/13] arm64: topology: divorce MC scheduling domain from core_siblings Jeremy Linton
2018-03-01 15:52 ` Morten Rasmussen
2018-02-27 20:18 ` Jeremy Linton
2018-03-06 16:07 ` Morten Rasmussen
2018-03-06 22:22 ` Jeremy Linton
2018-03-07 13:06 ` Morten Rasmussen
2018-03-07 16:19 ` Jeremy Linton
2018-03-14 13:05 ` Morten Rasmussen [this message]
2018-03-08 20:41 ` Brice Goglin
2018-03-14 12:43 ` Morten Rasmussen
2018-03-01 12:06 ` [PATCH v7 00/13] Support PPTT for ARM64 Sudeep Holla
2018-02-27 18:49 ` Jeremy Linton
2018-03-08 15:59 ` Ard Biesheuvel
2018-03-08 17:41 ` Jeremy Linton
2018-03-14 9:57 ` vkilari
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180314130509.GM4589@e105550-lin.cambridge.arm.com \
--to=morten.rasmussen@arm.com \
--cc=ahs3@redhat.com \
--cc=austinwc@codeaurora.org \
--cc=catalin.marinas@arm.com \
--cc=dietmar.eggemann@arm.com \
--cc=gregkh@linuxfoundation.org \
--cc=hanjun.guo@linaro.org \
--cc=jeremy.linton@arm.com \
--cc=john.garry@huawei.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-riscv@lists.infradead.org \
--cc=lorenzo.pieralisi@arm.com \
--cc=mark.rutland@arm.com \
--cc=palmer@sifive.com \
--cc=rjw@rjwysocki.net \
--cc=sudeep.holla@arm.com \
--cc=tnowicki@caviumnetworks.com \
--cc=vkilari@codeaurora.org \
--cc=wangxiongfeng2@huawei.com \
--cc=will.deacon@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox