* [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC
@ 2018-03-29 0:00 Alison Schofield
2018-03-29 13:16 ` Thomas Gleixner
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Alison Schofield @ 2018-03-29 0:00 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar
Cc: Dave Hansen, Tony Luck, Tim Chen, H. Peter Anvin, Borislav Petkov,
Peter Zijlstra, David Rientjes, Igor Mammedov, Prarit Bhargava,
brice.goglin, x86, linux-kernel
From: Alison Schofield <alison.schofield@intel.com>
Intel's Skylake Server CPUs have a different LLC topology than previous
generations. When in Sub-NUMA-Clustering (SNC) mode, the package is
divided into two "slices", each containing half the cores, half the LLC,
and one memory controller and each slice is enumerated to Linux as a
NUMA node. This is similar to how the cores and LLC were arranged
for the Cluster-On-Die (CoD) feature.
CoD allowed the same cache line to be present in each half of the LLC.
But, with SNC, each line is only ever present in *one* slice. This
means that the portion of the LLC *available* to a CPU depends on the
data being accessed:
Remote socket: entire package LLC is shared
Local socket->local slice: data goes into local slice LLC
Local socket->remote slice: data goes into remote-slice LLC. Slightly
higher latency than local slice LLC.
The biggest implication from this is that a process accessing all
NUMA-local memory only sees half the LLC capacity.
The CPU describes its cache hierarchy with the CPUID instruction. One
of the CPUID leaves enumerates the "logical processors sharing this
cache". This information is used for scheduling decisions so that tasks
move more freely between CPUs sharing the cache.
But, the CPUID for the SNC configuration discussed above enumerates
the LLC as being shared by the entire package. This is not 100%
precise because the entire cache is not usable by all accesses. But,
it *is* the way the hardware enumerates itself, and this is not likely
to change.
This breaks the sane_topology() check in the smpboot.c code because
this topology is considered not-sane. To fix this, add a vendor and
model specific check to never call topology_sane() for these systems.
Also, just like "Cluster-on-Die" we throw out the "coregroup"
sched_domain_topology_level and use NUMA information from the SRAT
alone.
This is OK at least on the hardware we are immediately concerned about
because the LLC sharing happens at both the slice and at the package
level, which are also NUMA boundaries.
This patch eliminates a warning that looks like this:
sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Luck, Tony <tony.luck@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: brice.goglin@gmail.com
Cc: Ingo Molnar <mingo@kernel.org>
---
Changes in v3:
* Use x86_match_cpu() for vendor & model check and moved related
comments to the array define. (Still just one system model)
* Updated the comments surrounding the topology_sane() check.
Changes in v2:
* Add vendor check (Intel) where we only had a model check (Skylake_X).
Considered the suggestion of adding a new flag here but thought that
to be overkill for this usage.
* Return false, instead of true, from match_llc() per reviewer suggestion.
That also cleaned up a topology broken bug message in sched_domain().
* Updated the comments around the match_llc() return value, continuing to
note that the return value doesn't actually matter because we are throwing
away coregroups for scheduling.
* Changed submitter. I'm following up on this patch originally submitted
by Dave Hansen.
arch/x86/kernel/smpboot.c | 62 +++++++++++++++++++++++++++++++++++++++--------
1 file changed, 52 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ff99e2b..92f0fe0 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -77,6 +77,8 @@
#include <asm/i8259.h>
#include <asm/misc.h>
#include <asm/qspinlock.h>
+#include <asm/intel-family.h>
+#include <asm/cpu_device_id.h>
/* Number of siblings per CPU package */
int smp_num_siblings = 1;
@@ -390,15 +392,61 @@ static bool match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
return false;
}
+/*
+ * Set if a package/die has multiple NUMA nodes inside.
+ * AMD Magny-Cours, Intel Cluster-on-Die, and Intel
+ * Sub-NUMA Clustering have this.
+ */
+
+static bool x86_has_numa_in_package;
+
+/*
+ * Define SNC (Sub-NUMA Cluster) CPUs.
+ *
+ * These are Intel CPUs that enumerate an LLC that is shared by
+ * multiple NUMA nodes. The LLC on these systems is shared for
+ * off-package data access but private to the NUMA node (half
+ * of the package) for on-package access.
+ *
+ * CPUID (the source of the information about the LLC) can only
+ * enumerate the cache as being shared *or* unshared, but not
+ * this particular configuration. The CPU in this case enumerates
+ * the cache to be shared across the entire package (spanning both
+ * NUMA nodes).
+ */
+
+static const struct x86_cpu_id snc_cpu[] __initconst = {
+ { X86_VENDOR_INTEL, 6, INTEL_FAM6_SKYLAKE_X },
+ {}
+};
+
static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
{
int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
- if (per_cpu(cpu_llc_id, cpu1) != BAD_APICID &&
- per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2))
- return topology_sane(c, o, "llc");
+ /* Do not match if we do not have a valid APICID for cpu: */
+ if (per_cpu(cpu_llc_id, cpu1) == BAD_APICID)
+ return false;
- return false;
+ /* Do not match if LLC id does not match: */
+ if (per_cpu(cpu_llc_id, cpu1) != per_cpu(cpu_llc_id, cpu2))
+ return false;
+
+ /*
+ * topology_sane() considers LLCs that span NUMA nodes to be
+ * insane and will display a warning message. Bypass the call
+ * to topology_sane() for snc_cpu's to avoid that warning.
+ */
+
+ if (!topology_same_node(c, o) && x86_match_cpu(snc_cpu)) {
+ /* Indicate that package has NUMA nodes inside: */
+ x86_has_numa_in_package = true;
+
+ /* Do not use LLC for scheduler decisions: */
+ return false;
+ }
+
+ return topology_sane(c, o, "llc");
}
/*
@@ -454,12 +502,6 @@ static struct sched_domain_topology_level x86_topology[] = {
{ NULL, },
};
-/*
- * Set if a package/die has multiple NUMA nodes inside.
- * AMD Magny-Cours and Intel Cluster-on-Die have this.
- */
-static bool x86_has_numa_in_package;
-
void set_cpu_sibling_map(int cpu)
{
bool has_smt = smp_num_siblings > 1;
--
2.7.4
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC
2018-03-29 0:00 [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC Alison Schofield
@ 2018-03-29 13:16 ` Thomas Gleixner
2018-03-29 13:45 ` Dave Hansen
2018-03-29 13:47 ` Peter Zijlstra
2018-03-30 3:20 ` kbuild test robot
2018-03-30 17:34 ` Alison Schofield
2 siblings, 2 replies; 13+ messages in thread
From: Thomas Gleixner @ 2018-03-29 13:16 UTC (permalink / raw)
To: Alison Schofield
Cc: Ingo Molnar, Dave Hansen, Tony Luck, Tim Chen, H. Peter Anvin,
Borislav Petkov, Peter Zijlstra, David Rientjes, Igor Mammedov,
Prarit Bhargava, brice.goglin, x86, linux-kernel
On Wed, 28 Mar 2018, Alison Schofield wrote:
> From: Alison Schofield <alison.schofield@intel.com>
>
> Intel's Skylake Server CPUs have a different LLC topology than previous
> generations. When in Sub-NUMA-Clustering (SNC) mode, the package is
> divided into two "slices", each containing half the cores, half the LLC,
> and one memory controller and each slice is enumerated to Linux as a
> NUMA node. This is similar to how the cores and LLC were arranged
> for the Cluster-On-Die (CoD) feature.
>
> CoD allowed the same cache line to be present in each half of the LLC.
> But, with SNC, each line is only ever present in *one* slice. This
> means that the portion of the LLC *available* to a CPU depends on the
> data being accessed:
>
> Remote socket: entire package LLC is shared
> Local socket->local slice: data goes into local slice LLC
> Local socket->remote slice: data goes into remote-slice LLC. Slightly
> higher latency than local slice LLC.
>
> The biggest implication from this is that a process accessing all
> NUMA-local memory only sees half the LLC capacity.
>
> The CPU describes its cache hierarchy with the CPUID instruction. One
> of the CPUID leaves enumerates the "logical processors sharing this
> cache". This information is used for scheduling decisions so that tasks
> move more freely between CPUs sharing the cache.
>
> But, the CPUID for the SNC configuration discussed above enumerates
> the LLC as being shared by the entire package. This is not 100%
> precise because the entire cache is not usable by all accesses. But,
> it *is* the way the hardware enumerates itself, and this is not likely
> to change.
>
> This breaks the sane_topology() check in the smpboot.c code because
> this topology is considered not-sane. To fix this, add a vendor and
> model specific check to never call topology_sane() for these systems.
> Also, just like "Cluster-on-Die" we throw out the "coregroup"
> sched_domain_topology_level and use NUMA information from the SRAT
> alone.
>
> This is OK at least on the hardware we are immediately concerned about
> because the LLC sharing happens at both the slice and at the package
> level, which are also NUMA boundaries.
So that addresses the scheduler interaction, but it still leaves the
information in the sysfs files unchanged. See cpu/intel_cacheinfo.c. There
are applications which use that information so it should be correct.
Thanks,
tglx
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC
2018-03-29 13:16 ` Thomas Gleixner
@ 2018-03-29 13:45 ` Dave Hansen
2018-03-29 13:57 ` Peter Zijlstra
2018-03-29 13:47 ` Peter Zijlstra
1 sibling, 1 reply; 13+ messages in thread
From: Dave Hansen @ 2018-03-29 13:45 UTC (permalink / raw)
To: Thomas Gleixner, Alison Schofield
Cc: Ingo Molnar, Tony Luck, Tim Chen, H. Peter Anvin, Borislav Petkov,
Peter Zijlstra, David Rientjes, Igor Mammedov, Prarit Bhargava,
brice.goglin, x86, linux-kernel
On 03/29/2018 06:16 AM, Thomas Gleixner wrote:
>> This is OK at least on the hardware we are immediately concerned about
>> because the LLC sharing happens at both the slice and at the package
>> level, which are also NUMA boundaries.
> So that addresses the scheduler interaction, but it still leaves the
> information in the sysfs files unchanged. See cpu/intel_cacheinfo.c. There
> are applications which use that information so it should be correct.
Were you thinking of shared_cpu_list/map? The information in there is
correct for core->off-package access. It is not correct for
core->on-package access, unless that access is perfectly interleaved
across both package "slices".
We could try to add an attribute or two to clarify this situation. But,
similar to the CPUID leaves, I don't think we actually have a precise
way to describe the way the cache actually works here.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC
2018-03-29 13:16 ` Thomas Gleixner
2018-03-29 13:45 ` Dave Hansen
@ 2018-03-29 13:47 ` Peter Zijlstra
2018-03-29 14:34 ` Dave Hansen
2018-03-29 14:37 ` Dave Hansen
1 sibling, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2018-03-29 13:47 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Alison Schofield, Ingo Molnar, Dave Hansen, Tony Luck, Tim Chen,
H. Peter Anvin, Borislav Petkov, David Rientjes, Igor Mammedov,
Prarit Bhargava, brice.goglin, x86, linux-kernel
On Thu, Mar 29, 2018 at 03:16:12PM +0200, Thomas Gleixner wrote:
> On Wed, 28 Mar 2018, Alison Schofield wrote:
> > From: Alison Schofield <alison.schofield@intel.com>
> >
> > Intel's Skylake Server CPUs have a different LLC topology than previous
> > generations. When in Sub-NUMA-Clustering (SNC) mode, the package is
> > divided into two "slices", each containing half the cores, half the LLC,
> > and one memory controller and each slice is enumerated to Linux as a
> > NUMA node. This is similar to how the cores and LLC were arranged
> > for the Cluster-On-Die (CoD) feature.
> >
> > CoD allowed the same cache line to be present in each half of the LLC.
> > But, with SNC, each line is only ever present in *one* slice. This
> > means that the portion of the LLC *available* to a CPU depends on the
> > data being accessed:
> >
> > Remote socket: entire package LLC is shared
> > Local socket->local slice: data goes into local slice LLC
> > Local socket->remote slice: data goes into remote-slice LLC. Slightly
> > higher latency than local slice LLC.
> >
> > The biggest implication from this is that a process accessing all
> > NUMA-local memory only sees half the LLC capacity.
> >
> > The CPU describes its cache hierarchy with the CPUID instruction. One
> > of the CPUID leaves enumerates the "logical processors sharing this
> > cache". This information is used for scheduling decisions so that tasks
> > move more freely between CPUs sharing the cache.
> >
> > But, the CPUID for the SNC configuration discussed above enumerates
> > the LLC as being shared by the entire package. This is not 100%
> > precise because the entire cache is not usable by all accesses. But,
> > it *is* the way the hardware enumerates itself, and this is not likely
> > to change.
> >
> > This breaks the sane_topology() check in the smpboot.c code because
> > this topology is considered not-sane. To fix this, add a vendor and
> > model specific check to never call topology_sane() for these systems.
> > Also, just like "Cluster-on-Die" we throw out the "coregroup"
> > sched_domain_topology_level and use NUMA information from the SRAT
> > alone.
> >
> > This is OK at least on the hardware we are immediately concerned about
> > because the LLC sharing happens at both the slice and at the package
> > level, which are also NUMA boundaries.
>
> So that addresses the scheduler interaction, but it still leaves the
> information in the sysfs files unchanged. See cpu/intel_cacheinfo.c. There
> are applications which use that information so it should be correct.
Not unchanged, it actually breaks it. The patch doesn't consider the
user visible impact of the mask changes at all, and that is a problem.
The issue is that HPC workloads care about cache-size-per-cpu measure,
and the way they go about obtaining that is reading the cache-size and
dividing it by the h-weight of the cache-mask.
Now the patch does in fact change the cache-mask as exposed to
userspace, it however does _NOT_ change the cache-size. This means that
anybody using the values from sysfs to compute size/weight, now gets
double the value they ought to get.
So either is must not change the llc-mask, or also change the llc-size.
Further I think Dave argued that we should not change the llc-size,
because while SNC presents a subset of the cache to local CPUs, for
remote data the whole cache is still available, again something some
applications might rely on.
Which then leads to the conclusion that the current:
> + /* Do not use LLC for scheduler decisions: */
> + return false;
is wrong. Also, that comment is *completely* wrong, since the return
value has *nothing* to do with scheduler decisions what so ever, you
affected that by setting:
> + x86_has_numa_in_package = true;
So please, try again.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC
2018-03-29 13:45 ` Dave Hansen
@ 2018-03-29 13:57 ` Peter Zijlstra
0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2018-03-29 13:57 UTC (permalink / raw)
To: Dave Hansen
Cc: Thomas Gleixner, Alison Schofield, Ingo Molnar, Tony Luck,
Tim Chen, H. Peter Anvin, Borislav Petkov, David Rientjes,
Igor Mammedov, Prarit Bhargava, brice.goglin, x86, linux-kernel
On Thu, Mar 29, 2018 at 06:45:12AM -0700, Dave Hansen wrote:
> On 03/29/2018 06:16 AM, Thomas Gleixner wrote:
> >> This is OK at least on the hardware we are immediately concerned about
> >> because the LLC sharing happens at both the slice and at the package
> >> level, which are also NUMA boundaries.
> > So that addresses the scheduler interaction, but it still leaves the
> > information in the sysfs files unchanged. See cpu/intel_cacheinfo.c. There
> > are applications which use that information so it should be correct.
>
> Were you thinking of shared_cpu_list/map?
Yes, the fact that those are changed and the Changelog doesn't mention
them is a giant fail all by itself.
> The information in there is
> correct for core->off-package access. It is not correct for
> core->on-package access, unless that access is perfectly interleaved
> across both package "slices".
The fact that 'simple' measures like cache/cpu are now completely broken
is a problem.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC
2018-03-29 13:47 ` Peter Zijlstra
@ 2018-03-29 14:34 ` Dave Hansen
2018-03-29 16:09 ` Peter Zijlstra
2018-03-29 14:37 ` Dave Hansen
1 sibling, 1 reply; 13+ messages in thread
From: Dave Hansen @ 2018-03-29 14:34 UTC (permalink / raw)
To: Peter Zijlstra, Thomas Gleixner
Cc: Alison Schofield, Ingo Molnar, Tony Luck, Tim Chen,
H. Peter Anvin, Borislav Petkov, David Rientjes, Igor Mammedov,
Prarit Bhargava, brice.goglin, x86, linux-kernel
On 03/29/2018 06:47 AM, Peter Zijlstra wrote:
> The issue is that HPC workloads care about cache-size-per-cpu measure,
> and the way they go about obtaining that is reading the cache-size and
> dividing it by the h-weight of the cache-mask.
That works, but only if the memory being accessed is slice/node-local.
If it's spread across the package, it'll be wrong.
But, the HPC folks are the ones that are the most likely to have good
NUMA affinity, so that would seem to point us in the direction of both
halving the size and the mask so that the LLC _looks_ split to userspace.
> Now the patch does in fact change the cache-mask as exposed to
> userspace, it however does _NOT_ change the cache-size. This means that
> anybody using the values from sysfs to compute size/weight, now gets
> double the value they ought to get.
>
> So either is must not change the llc-mask, or also change the llc-size.
IOW, don't make it look like we've either doubled or halved the exposed
size of the llc.
> Which then leads to the conclusion that the current:
>
>> + /* Do not use LLC for scheduler decisions: */
>> + return false;
>
> is wrong. Also, that comment is *completely* wrong, since the return
> value has *nothing* to do with scheduler decisions
OK, got it. That comment betrayed my ignorance. I'm glad we put it there.
What should we say, though?
/*
* false means 'c' does not share the LLC of 'o'.
* Note: this decision gets reflected all the way
* out to userspace
*/
return false;
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC
2018-03-29 13:47 ` Peter Zijlstra
2018-03-29 14:34 ` Dave Hansen
@ 2018-03-29 14:37 ` Dave Hansen
2018-03-29 14:55 ` Peter Zijlstra
1 sibling, 1 reply; 13+ messages in thread
From: Dave Hansen @ 2018-03-29 14:37 UTC (permalink / raw)
To: Peter Zijlstra, Thomas Gleixner
Cc: Alison Schofield, Ingo Molnar, Tony Luck, Tim Chen,
H. Peter Anvin, Borislav Petkov, David Rientjes, Igor Mammedov,
Prarit Bhargava, brice.goglin, x86, linux-kernel
On 03/29/2018 06:47 AM, Peter Zijlstra wrote:
> Further I think Dave argued that we should not change the llc-size,
> because while SNC presents a subset of the cache to local CPUs, for
> remote data the whole cache is still available, again something some
> applications might rely on.
BTW, I may have argued this in the past, but I don't think it's the best
thing to do.
If anything, we should optimize for the _users_ of this information: the
performance-sensitive ones who are digging up the cache topology. They
are also likely to be the most NUMA-affinitized and stay node-local with
much of their memory traffic. That would seem to point us in the
direction of enumerating two separate, half-sized LLCs "shared" only by
the slice when SNC mode is on.
That's what I've argued to the hardware folks lately, at least.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC
2018-03-29 14:37 ` Dave Hansen
@ 2018-03-29 14:55 ` Peter Zijlstra
0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2018-03-29 14:55 UTC (permalink / raw)
To: Dave Hansen
Cc: Thomas Gleixner, Alison Schofield, Ingo Molnar, Tony Luck,
Tim Chen, H. Peter Anvin, Borislav Petkov, David Rientjes,
Igor Mammedov, Prarit Bhargava, brice.goglin, x86, linux-kernel
On Thu, Mar 29, 2018 at 07:37:29AM -0700, Dave Hansen wrote:
> On 03/29/2018 06:47 AM, Peter Zijlstra wrote:
> > Further I think Dave argued that we should not change the llc-size,
> > because while SNC presents a subset of the cache to local CPUs, for
> > remote data the whole cache is still available, again something some
> > applications might rely on.
>
> BTW, I may have argued this in the past, but I don't think it's the best
> thing to do.
>
> If anything, we should optimize for the _users_ of this information: the
> performance-sensitive ones who are digging up the cache topology. They
> are also likely to be the most NUMA-affinitized and stay node-local with
> much of their memory traffic. That would seem to point us in the
> direction of enumerating two separate, half-sized LLCs "shared" only by
> the slice when SNC mode is on.
>
> That's what I've argued to the hardware folks lately, at least.
Fair enough; then we need change the reported cache-size and mention all
this in (preferably) both comment and Changelog.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC
2018-03-29 14:34 ` Dave Hansen
@ 2018-03-29 16:09 ` Peter Zijlstra
2018-03-29 16:25 ` Dave Hansen
0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2018-03-29 16:09 UTC (permalink / raw)
To: Dave Hansen
Cc: Thomas Gleixner, Alison Schofield, Ingo Molnar, Tony Luck,
Tim Chen, H. Peter Anvin, Borislav Petkov, David Rientjes,
Igor Mammedov, Prarit Bhargava, brice.goglin, x86, linux-kernel
On Thu, Mar 29, 2018 at 07:34:58AM -0700, Dave Hansen wrote:
> What should we say, though?
>
> /*
> * false means 'c' does not share the LLC of 'o'.
> * Note: this decision gets reflected all the way
> * out to userspace
> */
> return false;
That works, I suppose. You might want to put in a refernce to wherever
it is you shrink llc-size such that cache-per-cpu remains invariant.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC
2018-03-29 16:09 ` Peter Zijlstra
@ 2018-03-29 16:25 ` Dave Hansen
2018-03-29 16:29 ` Luck, Tony
0 siblings, 1 reply; 13+ messages in thread
From: Dave Hansen @ 2018-03-29 16:25 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Gleixner, Alison Schofield, Ingo Molnar, Tony Luck,
Tim Chen, H. Peter Anvin, Borislav Petkov, David Rientjes,
Igor Mammedov, Prarit Bhargava, brice.goglin, x86, linux-kernel
On 03/29/2018 09:09 AM, Peter Zijlstra wrote:
> On Thu, Mar 29, 2018 at 07:34:58AM -0700, Dave Hansen wrote:
>> What should we say, though?
>>
>> /*
>> * false means 'c' does not share the LLC of 'o'.
>> * Note: this decision gets reflected all the way
>> * out to userspace
>> */
>> return false;
> That works, I suppose. You might want to put in a refernce to wherever
> it is you shrink llc-size such that cache-per-cpu remains invariant.
Hmm, if we shrink llc-size by splitting it, do we also need to create a
unique "id" for each slice?
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC
2018-03-29 16:25 ` Dave Hansen
@ 2018-03-29 16:29 ` Luck, Tony
0 siblings, 0 replies; 13+ messages in thread
From: Luck, Tony @ 2018-03-29 16:29 UTC (permalink / raw)
To: Dave Hansen, Peter Zijlstra
Cc: Thomas Gleixner, Schofield, Alison, Ingo Molnar, Tim Chen,
H. Peter Anvin, Borislav Petkov, David Rientjes, Igor Mammedov,
Prarit Bhargava, brice.goglin@gmail.com, x86@kernel.org,
linux-kernel@vger.kernel.org
> Hmm, if we shrink llc-size by splitting it, do we also need to create a
> unique "id" for each slice?
RDT uses the cache id ... but it doesn't play well with cluster on die mode ... so our
recommendation is to not use RDT if COD mode is enabled.
If the result of these changes happens to be some easy way for RDT to tell that
COD mode is on, we'll add a check to the init/mount code to enforce not using
them together.
-Tony
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC
2018-03-29 0:00 [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC Alison Schofield
2018-03-29 13:16 ` Thomas Gleixner
@ 2018-03-30 3:20 ` kbuild test robot
2018-03-30 17:34 ` Alison Schofield
2 siblings, 0 replies; 13+ messages in thread
From: kbuild test robot @ 2018-03-30 3:20 UTC (permalink / raw)
To: Alison Schofield
Cc: kbuild-all, Thomas Gleixner, Ingo Molnar, Dave Hansen, Tony Luck,
Tim Chen, H. Peter Anvin, Borislav Petkov, Peter Zijlstra,
David Rientjes, Igor Mammedov, Prarit Bhargava, brice.goglin, x86,
linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1154 bytes --]
Hi Alison,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on v4.16-rc7]
[also build test WARNING on next-20180329]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Alison-Schofield/x86-sched-allow-topologies-where-NUMA-nodes-share-an-LLC/20180330-070743
config: x86_64-allyesdebian (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
All warnings (new ones prefixed by >>):
>> WARNING: vmlinux.o(.text+0x9601e): Section mismatch in reference from the function set_cpu_sibling_map() to the variable .init.rodata:snc_cpu
The function set_cpu_sibling_map() references
the variable __initconst snc_cpu.
This is often because set_cpu_sibling_map lacks a __initconst
annotation or the annotation of snc_cpu is wrong.
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 40066 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC
2018-03-29 0:00 [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC Alison Schofield
2018-03-29 13:16 ` Thomas Gleixner
2018-03-30 3:20 ` kbuild test robot
@ 2018-03-30 17:34 ` Alison Schofield
2 siblings, 0 replies; 13+ messages in thread
From: Alison Schofield @ 2018-03-30 17:34 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar
Cc: Dave Hansen, Tony Luck, Tim Chen, H. Peter Anvin, Borislav Petkov,
Peter Zijlstra, David Rientjes, Igor Mammedov, Prarit Bhargava,
brice.goglin, x86, linux-kernel
On Wed, Mar 28, 2018 at 05:00:24PM -0700, Alison Schofield wrote:
> From: Alison Schofield <alison.schofield@intel.com>
>
> Intel's Skylake Server CPUs have a different LLC topology than previous
> generations. When in Sub-NUMA-Clustering (SNC) mode, the package is
> divided into two "slices", each containing half the cores, half the LLC,
> and one memory controller and each slice is enumerated to Linux as a
> NUMA node. This is similar to how the cores and LLC were arranged
> for the Cluster-On-Die (CoD) feature.
>
> CoD allowed the same cache line to be present in each half of the LLC.
> But, with SNC, each line is only ever present in *one* slice. This
> means that the portion of the LLC *available* to a CPU depends on the
> data being accessed:
>
> Remote socket: entire package LLC is shared
> Local socket->local slice: data goes into local slice LLC
> Local socket->remote slice: data goes into remote-slice LLC. Slightly
> higher latency than local slice LLC.
>
> The biggest implication from this is that a process accessing all
> NUMA-local memory only sees half the LLC capacity.
>
> The CPU describes its cache hierarchy with the CPUID instruction. One
> of the CPUID leaves enumerates the "logical processors sharing this
> cache". This information is used for scheduling decisions so that tasks
> move more freely between CPUs sharing the cache.
>
> But, the CPUID for the SNC configuration discussed above enumerates
> the LLC as being shared by the entire package. This is not 100%
> precise because the entire cache is not usable by all accesses. But,
> it *is* the way the hardware enumerates itself, and this is not likely
> to change.
>
> This breaks the sane_topology() check in the smpboot.c code because
> this topology is considered not-sane. To fix this, add a vendor and
> model specific check to never call topology_sane() for these systems.
> Also, just like "Cluster-on-Die" we throw out the "coregroup"
> sched_domain_topology_level and use NUMA information from the SRAT
> alone.
>
> This is OK at least on the hardware we are immediately concerned about
> because the LLC sharing happens at both the slice and at the package
> level, which are also NUMA boundaries.
>
> This patch eliminates a warning that looks like this:
>
> sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
>
Let's see if I'm getting a better grasp of this:
My goal here is to suppress that WARNING message from topology_sane().
(We have a customer who is seeing the WARNING and would like it to go away)
The sysfs exported info for SNC's is 'not precise'. It reports the entire
LLC as available. This 'non precise' data existed before this patch and
exists after this patch. This is a problem, agreed.
PeterZ:
At first I thought you were saying that this patch itself broke the
sysfs info. I experimented with that and found no differences in sysfs
info before/after the patch and with SNC on/off. That makes me think
you are saying that we should not say this topology is 'Allowed' when
the sysfs data is wrong. (ie. That WARNING serves a purpose)
If you did indeed mean that the patch breaks the sysfs data, please
point me real close! ie. How does it change the cache-mask as exposed
to userspace?
All:
Here are 3 alternatives:
1) Keep patch code basically the same and improve the comments & commit,
being very explicit about the sysfs info issue.
2) Change the way the LLC-size is reported.
Enumerate "two separate, half-sized LLCs shared only by the slice when
SNC mode is on."
3) Do not export the sysfs info that is wrong. Userspace cannot make bad
decisions based upon it.
Can we do 1) now and then follow with 2) or 3)?
Thanks for all the reviews/comments,
alisons
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2018-03-30 17:33 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-03-29 0:00 [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC Alison Schofield
2018-03-29 13:16 ` Thomas Gleixner
2018-03-29 13:45 ` Dave Hansen
2018-03-29 13:57 ` Peter Zijlstra
2018-03-29 13:47 ` Peter Zijlstra
2018-03-29 14:34 ` Dave Hansen
2018-03-29 16:09 ` Peter Zijlstra
2018-03-29 16:25 ` Dave Hansen
2018-03-29 16:29 ` Luck, Tony
2018-03-29 14:37 ` Dave Hansen
2018-03-29 14:55 ` Peter Zijlstra
2018-03-30 3:20 ` kbuild test robot
2018-03-30 17:34 ` Alison Schofield
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox