* Hwloc with Xen host topology
@ 2014-01-02 20:26 Andrew Cooper
2014-01-02 21:24 ` Samuel Thibault
2014-01-02 21:38 ` Andrew Cooper
0 siblings, 2 replies; 7+ messages in thread
From: Andrew Cooper @ 2014-01-02 20:26 UTC (permalink / raw)
To: Xen-devel List
Hello,
For some post-holiday hacking, I tried playing around with getting hwloc
to understand Xen's full system topology, rather than the faked up
topology dom0 receives.
I present here some code which works (on some interestingly shaped
servers in the XenRT test pool), and some discoveries/problems found
along the way.
Code can be found at:
http://xenbits.xen.org/gitweb/?p=people/andrewcoop/hwloc.git;a=shortlog;h=refs/heads/hwloc-xen-topology-v1
You will need a libxc with the following patch:
http://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/hwloc-support-experimental
Instructions for use found in the commit message of the hwloc.git tree.
It is worth noting that with the help of the hwloc-devel list, v2 is
already quite a bit different, but is still in-progress.
Anyway, for the Xen issues I encountered. If memory serves, some of
them might have been brought up on xen-devel in the past.
The first problem, as indicated from the extra patch required against
libxc is that the current interface for xc_{topology,numa}info() suck if
you are not libxl. The current interface forces the caller to handle
hypercall bounce buffering, which is even harder to do sensibly as half
the bounce buffer macros are private to libxc. Bounce buffering is the
kind of details which libxc should deal with on behalf of its callers,
and should only be exposed to callers who want to do something special.
My patch implements xc_{topology,numa}info_bounced() (name up for
reconsideration) which takes some uint{32,64}_t arrays (optionally
NULL), and properly bounce buffer them. This results in not needing to
mess around with any of the bounce buffering in hwloc.
The second problem is with the choice of max_node_id, which is
MAX_NUMNODES-1, or 63. This means that the toolstack has to bounce a
16k buffer (64 * 64 * uint32_t) to get the node-node distances, even on
a single or dual node system. The issue is less pronounced with the
node_to_mem{size,free} arrays, which only have to be 64 * uint64_t long,
but still wasteful especially if node_to_memfree is being periodically
polled. Having nr_node_ids set dynamically (similar to nr_cpu_ids)
would alleviate this overhead, as the number of nodes available on the
system will unconditionally be static after boot.
The third problem is the one which created the only real bug in my hwloc
implementation. Cores are numbered per-socket in Xen, while sockets,
numa nodes and cpus are numbered on an absolute scale. There is
currently a gross hack in my hwloc code which adds (socket_id *
cores_per_socket * threads_per_core) onto each core id to make them
similarly numbered on an absolute scale. This is fine for a homogeneous
system, but not for a hetrogeneous system.
Relatedly, when debugging the third problem on an AMD Opteron 63xx
system, I noticed that it advertises 8 cores per socket and 2 threads
per core, but numbers the cores 1-16 on each socket. This is broken.
It should ether be 16 cores per socket and 1 thread per core, or really
8 cores per socket and 2 threads per core, with the cores numbered 1-8
and each pair of cpus with the same core id.
Fourth, the API for identifying offline cpus is broken. To mark a cpu
as offline, it has its topology information shot, meaning that an
offline cpu cannot be positively located in the topology. I happen to
know it can as Xen writes the records sequentially, so a single offline
cpu can be identified based on the valid information either side, but a
block of offline cpus become rather harder to locate. Ideally,
XEN_SYSCTL_topologyinfo should return 4 parameters, with one of them
being a bitmap from 0 to max_cpu_index identifying which cpus are
online, and writing the correct core/socket/node information (when
known) into the other parameters. However, being an ABI now makes this
somewhat harder to do.
Fifth, Xen has no way of querying the cpu cache information. hwloc
likes to know the entire cache hierarchy, which is arguably more useful
for its primary purpose of optimising HPC than for simply viewing the
Xen topology, but is none-the-less a missing feature as far as Xen is
concerned. I was considering adding a sysctl along the lines of "please
execute cpuid with these parameters on that pcpu and give me the answers".
Sixth and finally, which is also the hardest problem conceptually to
solve, Xen has no notion of IO proximity. Devices on the system can
report their location using _PXM() methods in the DSDT/SSDTs, but only
dom0 can gather this information, and doesn't have an accurate view of
the NUMA or CPU topology.
Anyway - that is probably enough rambling. I don't expect much/any of
this to be resolved before the 4.5 dev window opens, but bringing these
issues to light might at least get some of them discussed.
~Andrew
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Hwloc with Xen host topology
2014-01-02 20:26 Hwloc with Xen host topology Andrew Cooper
@ 2014-01-02 21:24 ` Samuel Thibault
2014-01-02 21:50 ` Andrew Cooper
2014-01-02 21:38 ` Andrew Cooper
1 sibling, 1 reply; 7+ messages in thread
From: Samuel Thibault @ 2014-01-02 21:24 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Xen-devel List
Hello,
Andrew Cooper, le Thu 02 Jan 2014 20:26:49 +0000, a écrit :
> Cores are numbered per-socket in Xen, while sockets,
> numa nodes and cpus are numbered on an absolute scale. There is
> currently a gross hack in my hwloc code which adds (socket_id *
> cores_per_socket * threads_per_core) onto each core id to make them
> similarly numbered on an absolute scale. This is fine for a homogeneous
> system, but not for a hetrogeneous system.
BTW, hwloc does not need these physical ids to be unique, it can cope
with duplication and whatnot. That said, having a coherent interface at
the Xen layer would be a good thing, indeed :)
Samuel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Hwloc with Xen host topology
2014-01-02 21:24 ` Samuel Thibault
@ 2014-01-02 21:50 ` Andrew Cooper
2014-01-02 21:55 ` Samuel Thibault
0 siblings, 1 reply; 7+ messages in thread
From: Andrew Cooper @ 2014-01-02 21:50 UTC (permalink / raw)
To: Samuel Thibault, Xen-devel List
On 02/01/14 21:24, Samuel Thibault wrote:
> Hello,
>
> Andrew Cooper, le Thu 02 Jan 2014 20:26:49 +0000, a écrit :
>> Cores are numbered per-socket in Xen, while sockets,
>> numa nodes and cpus are numbered on an absolute scale. There is
>> currently a gross hack in my hwloc code which adds (socket_id *
>> cores_per_socket * threads_per_core) onto each core id to make them
>> similarly numbered on an absolute scale. This is fine for a homogeneous
>> system, but not for a hetrogeneous system.
> BTW, hwloc does not need these physical ids to be unique, it can cope
> with duplication and whatnot. That said, having a coherent interface at
> the Xen layer would be a good thing, indeed :)
>
> Samuel
If I take out the described hack, I am presented with
****************************************************************************
* hwloc has encountered what looks like an error from the operating system.
*
* object (Core P#0 cpuset 0x30000003) intersection without inclusion!
* Error occurred in topology.c line 853
*
* Please report this error message to the hwloc user's mailing list,
* along with the output from the hwloc-gather-topology.sh script.
****************************************************************************
Which I took to mean "I have done something stupid". I looked and saw
that I was attempting to insert a second Core P#0 object with a
different cpuset and decided to renumber the cores so they didn't
overlap in physical ids.
If you believe that this should indeed work, then I guess I need to
raise a bug...
~Andrew
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Hwloc with Xen host topology
2014-01-02 21:50 ` Andrew Cooper
@ 2014-01-02 21:55 ` Samuel Thibault
2014-01-02 22:01 ` Andrew Cooper
0 siblings, 1 reply; 7+ messages in thread
From: Samuel Thibault @ 2014-01-02 21:55 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Xen-devel List
Andrew Cooper, le Thu 02 Jan 2014 21:50:06 +0000, a écrit :
> On 02/01/14 21:24, Samuel Thibault wrote:
> > Andrew Cooper, le Thu 02 Jan 2014 20:26:49 +0000, a écrit :
> >> Cores are numbered per-socket in Xen, while sockets,
> >> numa nodes and cpus are numbered on an absolute scale. There is
> >> currently a gross hack in my hwloc code which adds (socket_id *
> >> cores_per_socket * threads_per_core) onto each core id to make them
> >> similarly numbered on an absolute scale. This is fine for a homogeneous
> >> system, but not for a hetrogeneous system.
> > BTW, hwloc does not need these physical ids to be unique, it can cope
> > with duplication and whatnot. That said, having a coherent interface at
> > the Xen layer would be a good thing, indeed :)
>
> If I take out the described hack, I am presented with
>
> ****************************************************************************
> * hwloc has encountered what looks like an error from the operating system.
> *
> * object (Core P#0 cpuset 0x30000003) intersection without inclusion!
> * Error occurred in topology.c line 853
> *
> * Please report this error message to the hwloc user's mailing list,
> * along with the output from the hwloc-gather-topology.sh script.
> ****************************************************************************
>
> Which I took to mean "I have done something stupid". I looked and saw
> that I was attempting to insert a second Core P#0 object with a
> different cpuset and decided to renumber the cores so they didn't
> overlap in physical ids.
>
> If you believe that this should indeed work, then I guess I need to
> raise a bug...
Well, logical processor physical ids, i.e. what is used for indexing
physical cpusets, have to be unique. The core/socket/node IDs don't have
to.
Samuel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Hwloc with Xen host topology
2014-01-02 21:55 ` Samuel Thibault
@ 2014-01-02 22:01 ` Andrew Cooper
2014-01-02 22:04 ` Samuel Thibault
0 siblings, 1 reply; 7+ messages in thread
From: Andrew Cooper @ 2014-01-02 22:01 UTC (permalink / raw)
To: Samuel Thibault, Xen-devel List
On 02/01/14 21:55, Samuel Thibault wrote:
> Andrew Cooper, le Thu 02 Jan 2014 21:50:06 +0000, a écrit :
>> On 02/01/14 21:24, Samuel Thibault wrote:
>>> Andrew Cooper, le Thu 02 Jan 2014 20:26:49 +0000, a écrit :
>>>> Cores are numbered per-socket in Xen, while sockets,
>>>> numa nodes and cpus are numbered on an absolute scale. There is
>>>> currently a gross hack in my hwloc code which adds (socket_id *
>>>> cores_per_socket * threads_per_core) onto each core id to make them
>>>> similarly numbered on an absolute scale. This is fine for a homogeneous
>>>> system, but not for a hetrogeneous system.
>>> BTW, hwloc does not need these physical ids to be unique, it can cope
>>> with duplication and whatnot. That said, having a coherent interface at
>>> the Xen layer would be a good thing, indeed :)
>> If I take out the described hack, I am presented with
>>
>> ****************************************************************************
>> * hwloc has encountered what looks like an error from the operating system.
>> *
>> * object (Core P#0 cpuset 0x30000003) intersection without inclusion!
>> * Error occurred in topology.c line 853
>> *
>> * Please report this error message to the hwloc user's mailing list,
>> * along with the output from the hwloc-gather-topology.sh script.
>> ****************************************************************************
>>
>> Which I took to mean "I have done something stupid". I looked and saw
>> that I was attempting to insert a second Core P#0 object with a
>> different cpuset and decided to renumber the cores so they didn't
>> overlap in physical ids.
>>
>> If you believe that this should indeed work, then I guess I need to
>> raise a bug...
> Well, logical processor physical ids, i.e. what is used for indexing
> physical cpusets, have to be unique. The core/socket/node IDs don't have
> to.
>
> Samuel
Then a bug needs raising. My hack only changes the Core physical ID as
far as hwloc is concerned. The PU physical IDs are unchanged by the
hack, and already unique as presented by Xen.
~Andrew
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Hwloc with Xen host topology
2014-01-02 22:01 ` Andrew Cooper
@ 2014-01-02 22:04 ` Samuel Thibault
0 siblings, 0 replies; 7+ messages in thread
From: Samuel Thibault @ 2014-01-02 22:04 UTC (permalink / raw)
To: Andrew Cooper; +Cc: hwloc-devel, Xen-devel List
Andrew Cooper, le Thu 02 Jan 2014 22:01:30 +0000, a écrit :
> On 02/01/14 21:55, Samuel Thibault wrote:
> > Andrew Cooper, le Thu 02 Jan 2014 21:50:06 +0000, a écrit :
> >> On 02/01/14 21:24, Samuel Thibault wrote:
> >>> Andrew Cooper, le Thu 02 Jan 2014 20:26:49 +0000, a écrit :
> >>>> Cores are numbered per-socket in Xen, while sockets,
> >>>> numa nodes and cpus are numbered on an absolute scale. There is
> >>>> currently a gross hack in my hwloc code which adds (socket_id *
> >>>> cores_per_socket * threads_per_core) onto each core id to make them
> >>>> similarly numbered on an absolute scale. This is fine for a homogeneous
> >>>> system, but not for a hetrogeneous system.
> >>> BTW, hwloc does not need these physical ids to be unique, it can cope
> >>> with duplication and whatnot. That said, having a coherent interface at
> >>> the Xen layer would be a good thing, indeed :)
> >> If I take out the described hack, I am presented with
> >>
> >> ****************************************************************************
> >> * hwloc has encountered what looks like an error from the operating system.
> >> *
> >> * object (Core P#0 cpuset 0x30000003) intersection without inclusion!
> >> * Error occurred in topology.c line 853
> >> *
> >> * Please report this error message to the hwloc user's mailing list,
> >> * along with the output from the hwloc-gather-topology.sh script.
> >> ****************************************************************************
> >>
> >> Which I took to mean "I have done something stupid". I looked and saw
> >> that I was attempting to insert a second Core P#0 object with a
> >> different cpuset and decided to renumber the cores so they didn't
> >> overlap in physical ids.
> >>
> >> If you believe that this should indeed work, then I guess I need to
> >> raise a bug...
> > Well, logical processor physical ids, i.e. what is used for indexing
> > physical cpusets, have to be unique. The core/socket/node IDs don't have
> > to.
>
> Then a bug needs raising. My hack only changes the Core physical ID as
> far as hwloc is concerned. The PU physical IDs are unchanged by the
> hack, and already unique as presented by Xen.
This needs investigation indeed. I'm sure we are supposed to support
that kind of case.
Samuel
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Hwloc with Xen host topology
2014-01-02 20:26 Hwloc with Xen host topology Andrew Cooper
2014-01-02 21:24 ` Samuel Thibault
@ 2014-01-02 21:38 ` Andrew Cooper
1 sibling, 0 replies; 7+ messages in thread
From: Andrew Cooper @ 2014-01-02 21:38 UTC (permalink / raw)
To: Xen-devel List
On 02/01/14 20:26, Andrew Cooper wrote:
> Hello,
>
> For some post-holiday hacking, I tried playing around with getting hwloc
> to understand Xen's full system topology, rather than the faked up
> topology dom0 receives.
>
> I present here some code which works (on some interestingly shaped
> servers in the XenRT test pool), and some discoveries/problems found
> along the way.
>
> Code can be found at:
> http://xenbits.xen.org/gitweb/?p=people/andrewcoop/hwloc.git;a=shortlog;h=refs/heads/hwloc-xen-topology-v1
>
> You will need a libxc with the following patch:
> http://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/hwloc-support-experimental
>
> Instructions for use found in the commit message of the hwloc.git tree.
> It is worth noting that with the help of the hwloc-devel list, v2 is
> already quite a bit different, but is still in-progress.
>
>
> Anyway, for the Xen issues I encountered. If memory serves, some of
> them might have been brought up on xen-devel in the past.
>
> The first problem, as indicated from the extra patch required against
> libxc is that the current interface for xc_{topology,numa}info() suck if
> you are not libxl. The current interface forces the caller to handle
> hypercall bounce buffering, which is even harder to do sensibly as half
> the bounce buffer macros are private to libxc. Bounce buffering is the
> kind of details which libxc should deal with on behalf of its callers,
> and should only be exposed to callers who want to do something special.
>
> My patch implements xc_{topology,numa}info_bounced() (name up for
> reconsideration) which takes some uint{32,64}_t arrays (optionally
> NULL), and properly bounce buffer them. This results in not needing to
> mess around with any of the bounce buffering in hwloc.
>
> The second problem is with the choice of max_node_id, which is
> MAX_NUMNODES-1, or 63. This means that the toolstack has to bounce a
> 16k buffer (64 * 64 * uint32_t) to get the node-node distances, even on
> a single or dual node system. The issue is less pronounced with the
> node_to_mem{size,free} arrays, which only have to be 64 * uint64_t long,
> but still wasteful especially if node_to_memfree is being periodically
> polled. Having nr_node_ids set dynamically (similar to nr_cpu_ids)
> would alleviate this overhead, as the number of nodes available on the
> system will unconditionally be static after boot.
>
> The third problem is the one which created the only real bug in my hwloc
> implementation. Cores are numbered per-socket in Xen, while sockets,
> numa nodes and cpus are numbered on an absolute scale. There is
> currently a gross hack in my hwloc code which adds (socket_id *
> cores_per_socket * threads_per_core) onto each core id to make them
> similarly numbered on an absolute scale. This is fine for a homogeneous
> system, but not for a hetrogeneous system.
>
> Relatedly, when debugging the third problem on an AMD Opteron 63xx
> system, I noticed that it advertises 8 cores per socket and 2 threads
> per core, but numbers the cores 1-16 on each socket. This is broken.
> It should ether be 16 cores per socket and 1 thread per core, or really
> 8 cores per socket and 2 threads per core, with the cores numbered 1-8
> and each pair of cpus with the same core id.
>
> Fourth, the API for identifying offline cpus is broken. To mark a cpu
> as offline, it has its topology information shot, meaning that an
> offline cpu cannot be positively located in the topology. I happen to
> know it can as Xen writes the records sequentially, so a single offline
> cpu can be identified based on the valid information either side, but a
> block of offline cpus become rather harder to locate. Ideally,
> XEN_SYSCTL_topologyinfo should return 4 parameters, with one of them
> being a bitmap from 0 to max_cpu_index identifying which cpus are
> online, and writing the correct core/socket/node information (when
> known) into the other parameters. However, being an ABI now makes this
> somewhat harder to do.
>
> Fifth, Xen has no way of querying the cpu cache information. hwloc
> likes to know the entire cache hierarchy, which is arguably more useful
> for its primary purpose of optimising HPC than for simply viewing the
> Xen topology, but is none-the-less a missing feature as far as Xen is
> concerned. I was considering adding a sysctl along the lines of "please
> execute cpuid with these parameters on that pcpu and give me the answers".
>
> Sixth and finally, which is also the hardest problem conceptually to
> solve, Xen has no notion of IO proximity. Devices on the system can
> report their location using _PXM() methods in the DSDT/SSDTs, but only
> dom0 can gather this information, and doesn't have an accurate view of
> the NUMA or CPU topology.
Seventh, as some very up-to-the-minute hacking,
XEN_SYSCTL_numainfo is not giving back valid information.
>From a Haswell-EP SDP, running XenServer trunk (xen-4.3 based):
Xen NUMA information:
numa count 64, max numa id 1
node[ 0], size 19327352832, free 15262810112
node[ 1], size 17179869184, free 15961382912
Which sums to ~2GB more than the total system ram of:
(XEN) System RAM: 32320MB (33096268kB)
It would appear that a node memsize includes IO encompassed by the nodes
start/end pfns, rather than just the RAM contained inside the nodes pfns.
(XEN) SRAT: Node 0 PXM 0 0-480000000
(XEN) SRAT: Node 1 PXM 1 480000000-880000000
Is this intentional or an oversight?
~Andrew
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-01-02 22:04 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-02 20:26 Hwloc with Xen host topology Andrew Cooper
2014-01-02 21:24 ` Samuel Thibault
2014-01-02 21:50 ` Andrew Cooper
2014-01-02 21:55 ` Samuel Thibault
2014-01-02 22:01 ` Andrew Cooper
2014-01-02 22:04 ` Samuel Thibault
2014-01-02 21:38 ` Andrew Cooper
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.