From: Dave Hansen <dave@sr71.net>
To: a.p.zijlstra@chello.nl
Cc: mingo@kernel.org, hpa@linux.intel.com, brice.goglin@gmail.com,
bp@alien8.de, linux-kernel@vger.kernel.org,
Dave Hansen <dave@sr71.net>
Subject: [RFC][PATCH 0/6] fix topology for multi-NUMA-node CPUs
Date: Wed, 17 Sep 2014 15:33:10 -0700 [thread overview]
Message-ID: <20140917223310.026BCC2C@viggo.jf.intel.com> (raw)
This is a big fat RFC. It takes quite a few liberties with the
multi-core topology level that I'm not completely comfortable
with.
It has only been tested lightly.
Full dmesg for a Cluster-on-Die system with this set applied,
and sched_debug on the command-line is here:
http://sr71.net/~dave/intel/full-dmesg-hswep-20140917.txt
---
I'm getting the spew below when booting with Haswell (Xeon
E5-2699 v3) CPUs and the "Cluster-on-Die" (CoD) feature enabled
in the BIOS. It seems similar to the issue that some folks from
AMD ran in to on their systems and addressed in this commit:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=161270fc1f9ddfc17154e0d49291472a9cdef7db
Both these Intel and AMD systems break an assumption which is
being enforced by topology_sane(): a socket may not contain more
than one NUMA node.
AMD special-cased their system by looking for a cpuid flag. The
Intel mode is dependent on BIOS options and I do not know of a
way which it is enumerated other than the tables being parsed
during the CPU bringup process.
This also fixes sysfs because CPUs with the same 'physical_package_id'
in /sys/devices/system/cpu/cpu*/topology/ are not listed together
in the same 'core_siblings_list'. This violates a statement from
Documentation/ABI/testing/sysfs-devices-system-cpu:
core_siblings: internal kernel map of cpu#'s hardware threads
within the same physical_package_id.
core_siblings_list: human-readable list of the logical CPU
numbers within the same physical_package_id as cpu#.
The sysfs effects here cause an issue with the hwloc tool where
it gets confused and thinks there are more sockets than are
physically present.
Before this set, there are two packages:
# cd /sys/devices/system/cpu/
# cat cpu*/topology/physical_package_id | sort | uniq -c
18 0
18 1
But 4 _sets_ of core siblings:
# cat cpu*/topology/core_siblings_list | sort | uniq -c
9 0-8
9 18-26
9 27-35
9 9-17
After this set, there are only 2 sets of core siblings, which
is what we expect for a 2-socket system.
# cat cpu*/topology/physical_package_id | sort | uniq -c
18 0
18 1
# cat cpu*/topology/core_siblings_list | sort | uniq -c
18 0-17
18 18-35
Example spew:
...
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
#2 #3 #4 #5 #6 #7 #8
.... node #1, CPUs: #9
------------[ cut here ]------------
WARNING: CPU: 9 PID: 0 at /home/ak/hle/linux-hle-2.6/arch/x86/kernel/smpboot.c:306 topology_sane.isra.2+0x74/0x90()
sched: CPU #9's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
Modules linked in:
CPU: 9 PID: 0 Comm: swapper/9 Not tainted 3.17.0-rc1-00293-g8e01c4d-dirty #631
Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0036.R05.1407140519 07/14/2014
0000000000000009 ffff88046ddabe00 ffffffff8172e485 ffff88046ddabe48
ffff88046ddabe38 ffffffff8109691d 000000000000b001 0000000000000009
ffff88086fc12580 000000000000b020 0000000000000009 ffff88046ddabe98
Call Trace:
[<ffffffff8172e485>] dump_stack+0x45/0x56
[<ffffffff8109691d>] warn_slowpath_common+0x7d/0xa0
[<ffffffff8109698c>] warn_slowpath_fmt+0x4c/0x50
[<ffffffff81074f94>] topology_sane.isra.2+0x74/0x90
[<ffffffff8107530e>] set_cpu_sibling_map+0x31e/0x4f0
[<ffffffff8107568d>] start_secondary+0x1ad/0x240
---[ end trace 3fe5f587a9fcde61 ]---
#10 #11 #12 #13 #14 #15 #16 #17
.... node #2, CPUs: #18 #19 #20 #21 #22 #23 #24 #25 #26
.... node #3, CPUs: #27 #28 #29 #30 #31 #32 #33 #34 #35
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
rc: ak@linux.intel.com
Cc: brice.goglin@gmail.com
Cc: bp@alien8.de
next reply other threads:[~2014-09-17 22:33 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-09-17 22:33 Dave Hansen [this message]
2014-09-17 22:33 ` [RFC][PATCH 1/6] topology: rename topology_core_cpumask() to topology_package_cpumask() Dave Hansen
2014-09-17 22:33 ` [RFC][PATCH 2/6] x86: introduce cpumask specifically for the package Dave Hansen
2014-09-18 14:57 ` Peter Zijlstra
2014-09-17 22:33 ` [RFC][PATCH 3/6] x86: use package_map instead of core_map for sysfs Dave Hansen
2014-09-17 22:33 ` [RFC][PATCH 4/6] sched: eliminate "DIE" domain level when NUMA present Dave Hansen
2014-09-18 17:28 ` Peter Zijlstra
2014-09-17 22:33 ` [RFC][PATCH 5/6] sched: keep MC domain from crossing nodes OR packages Dave Hansen
2014-09-17 22:33 ` [RFC][PATCH 6/6] sched: consolidate config options Dave Hansen
2014-09-18 17:29 ` Peter Zijlstra
2014-09-19 19:15 ` Dave Hansen
2014-09-19 23:03 ` Peter Zijlstra
2014-09-18 7:45 ` [RFC][PATCH 0/6] fix topology for multi-NUMA-node CPUs Borislav Petkov
[not found] ` <CAOjmkp8EGO0jicmdO=p6ATHz-hUJmWb+xoBLjOdLBUwwGzyhhg@mail.gmail.com>
2014-09-22 15:54 ` Aravind Gopalakrishnan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140917223310.026BCC2C@viggo.jf.intel.com \
--to=dave@sr71.net \
--cc=a.p.zijlstra@chello.nl \
--cc=bp@alien8.de \
--cc=brice.goglin@gmail.com \
--cc=hpa@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox