From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Xen-devel List <xen-devel@lists.xen.org>
Subject: Re: Hwloc with Xen host topology
Date: Thu, 2 Jan 2014 21:38:37 +0000 [thread overview]
Message-ID: <52C5DC5D.3070106@citrix.com> (raw)
In-Reply-To: <52C5CB89.70804@citrix.com>
On 02/01/14 20:26, Andrew Cooper wrote:
> Hello,
>
> For some post-holiday hacking, I tried playing around with getting hwloc
> to understand Xen's full system topology, rather than the faked up
> topology dom0 receives.
>
> I present here some code which works (on some interestingly shaped
> servers in the XenRT test pool), and some discoveries/problems found
> along the way.
>
> Code can be found at:
> http://xenbits.xen.org/gitweb/?p=people/andrewcoop/hwloc.git;a=shortlog;h=refs/heads/hwloc-xen-topology-v1
>
> You will need a libxc with the following patch:
> http://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/hwloc-support-experimental
>
> Instructions for use found in the commit message of the hwloc.git tree.
> It is worth noting that with the help of the hwloc-devel list, v2 is
> already quite a bit different, but is still in-progress.
>
>
> Anyway, for the Xen issues I encountered. If memory serves, some of
> them might have been brought up on xen-devel in the past.
>
> The first problem, as indicated from the extra patch required against
> libxc is that the current interface for xc_{topology,numa}info() suck if
> you are not libxl. The current interface forces the caller to handle
> hypercall bounce buffering, which is even harder to do sensibly as half
> the bounce buffer macros are private to libxc. Bounce buffering is the
> kind of details which libxc should deal with on behalf of its callers,
> and should only be exposed to callers who want to do something special.
>
> My patch implements xc_{topology,numa}info_bounced() (name up for
> reconsideration) which takes some uint{32,64}_t arrays (optionally
> NULL), and properly bounce buffer them. This results in not needing to
> mess around with any of the bounce buffering in hwloc.
>
> The second problem is with the choice of max_node_id, which is
> MAX_NUMNODES-1, or 63. This means that the toolstack has to bounce a
> 16k buffer (64 * 64 * uint32_t) to get the node-node distances, even on
> a single or dual node system. The issue is less pronounced with the
> node_to_mem{size,free} arrays, which only have to be 64 * uint64_t long,
> but still wasteful especially if node_to_memfree is being periodically
> polled. Having nr_node_ids set dynamically (similar to nr_cpu_ids)
> would alleviate this overhead, as the number of nodes available on the
> system will unconditionally be static after boot.
>
> The third problem is the one which created the only real bug in my hwloc
> implementation. Cores are numbered per-socket in Xen, while sockets,
> numa nodes and cpus are numbered on an absolute scale. There is
> currently a gross hack in my hwloc code which adds (socket_id *
> cores_per_socket * threads_per_core) onto each core id to make them
> similarly numbered on an absolute scale. This is fine for a homogeneous
> system, but not for a hetrogeneous system.
>
> Relatedly, when debugging the third problem on an AMD Opteron 63xx
> system, I noticed that it advertises 8 cores per socket and 2 threads
> per core, but numbers the cores 1-16 on each socket. This is broken.
> It should ether be 16 cores per socket and 1 thread per core, or really
> 8 cores per socket and 2 threads per core, with the cores numbered 1-8
> and each pair of cpus with the same core id.
>
> Fourth, the API for identifying offline cpus is broken. To mark a cpu
> as offline, it has its topology information shot, meaning that an
> offline cpu cannot be positively located in the topology. I happen to
> know it can as Xen writes the records sequentially, so a single offline
> cpu can be identified based on the valid information either side, but a
> block of offline cpus become rather harder to locate. Ideally,
> XEN_SYSCTL_topologyinfo should return 4 parameters, with one of them
> being a bitmap from 0 to max_cpu_index identifying which cpus are
> online, and writing the correct core/socket/node information (when
> known) into the other parameters. However, being an ABI now makes this
> somewhat harder to do.
>
> Fifth, Xen has no way of querying the cpu cache information. hwloc
> likes to know the entire cache hierarchy, which is arguably more useful
> for its primary purpose of optimising HPC than for simply viewing the
> Xen topology, but is none-the-less a missing feature as far as Xen is
> concerned. I was considering adding a sysctl along the lines of "please
> execute cpuid with these parameters on that pcpu and give me the answers".
>
> Sixth and finally, which is also the hardest problem conceptually to
> solve, Xen has no notion of IO proximity. Devices on the system can
> report their location using _PXM() methods in the DSDT/SSDTs, but only
> dom0 can gather this information, and doesn't have an accurate view of
> the NUMA or CPU topology.
Seventh, as some very up-to-the-minute hacking,
XEN_SYSCTL_numainfo is not giving back valid information.
>From a Haswell-EP SDP, running XenServer trunk (xen-4.3 based):
Xen NUMA information:
numa count 64, max numa id 1
node[ 0], size 19327352832, free 15262810112
node[ 1], size 17179869184, free 15961382912
Which sums to ~2GB more than the total system ram of:
(XEN) System RAM: 32320MB (33096268kB)
It would appear that a node memsize includes IO encompassed by the nodes
start/end pfns, rather than just the RAM contained inside the nodes pfns.
(XEN) SRAT: Node 0 PXM 0 0-480000000
(XEN) SRAT: Node 1 PXM 1 480000000-880000000
Is this intentional or an oversight?
~Andrew
prev parent reply other threads:[~2014-01-02 21:38 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-01-02 20:26 Hwloc with Xen host topology Andrew Cooper
2014-01-02 21:24 ` Samuel Thibault
2014-01-02 21:50 ` Andrew Cooper
2014-01-02 21:55 ` Samuel Thibault
2014-01-02 22:01 ` Andrew Cooper
2014-01-02 22:04 ` Samuel Thibault
2014-01-02 21:38 ` Andrew Cooper [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52C5DC5D.3070106@citrix.com \
--to=andrew.cooper3@citrix.com \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.