_PXM, NUMA, and all that goodnesss

All of lore.kernel.org
 help / color / mirror / Atom feed

* _PXM, NUMA, and all that goodnesss
@ 2014-02-12 19:50 Konrad Rzeszutek Wilk
  2014-02-13 10:08 ` Jan Beulich
  2014-02-13 11:22 ` George Dunlap
  0 siblings, 2 replies; 5+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-02-12 19:50 UTC (permalink / raw)
  To: xen-devel, george.dunlap, jun.nakajima, boris.ostrovsky, jbeulich,
	andrew.cooper3, andrew.thomas, ufimtseva
  Cc: kurt.hackel

Hey,

I have been looking at figuring out how we can "easily" do PCIe assignment
of devices that are on different sockets. The problem is that
on machines with many sockets (four or more) we might inadvertently assign
the PCIe from a different socket to a guest bound to a different NUMA
node. That means more KPI traffic, higher latency, etc.

>From a Linux kernel perspective we do seem to 'pipe' said information
from the ACPI DSDT (drivers/xen/pci.c):

 75                 unsigned long long pxm;                                         
 76                                                                                 
 77                 status = acpi_evaluate_integer(handle, "_PXM",                  
 78                                    NULL, &pxm);                                 
 79                 if (ACPI_SUCCESS(status)) {                                     
 80                     add.optarr[0] = pxm;                                        
 81                     add.flags |= XEN_PCI_DEV_PXM;        

Which is neat except that Xen ignores that flag altogether. I Googled
a bit but still did not find anything relevant - thought there were
some presentations from past Xen Summits referring to it
(I can't find it now :-()

Anyhow,  what I am wondering if there are some prototypes out the
in the past that utilize this. And if we were to use this how
can we expose this to 'libxl' or any other tools to say:

"Hey! You might want to use this other PCI device assigned
to pciback which is on the same node". Some of form of
'numa-pci' affinity.

Interestingly enough one can also read this from SysFS:
/sys/bus/pci/devices/<BDF>/numa_node,local_cpu,local_cpulist.

Except that we don't expose the NUMA topology to the initial
domain so the 'numa_node' is all -1. And the local_cpu depends
on seeing _all_ of the CPUs - and of course it assumes that
vCPU == pCPU.

Anyhow, if this was "tweaked" such that the initial domain
was seeing the hardware NUMA topology and parsing it (via
Elena's patches) we could potentially have at least the
'numa_node' information present and figure out if a guest
is using a PCIe device from the right socket.

So what I am wondering is:
 1) Were there any plans for the XEN_PCI_DEV_PXM in the
    hypervisor? Were there some prototypes for exporting the
    PCI device BDF and NUMA information out.

 2) Would it be better to just look at making the initial domain
   be able to figure out the NUMA topology and assign the
   correct 'numa_node' in the PCI fields?

 3). If either option is used, would taking that information in-to
   advisement when launching a guest with either 'cpus' or 'numa-affinity'
   or 'pci' and informing the user of a better choice be good?
   Or would it be better if there was some diagnostic tool to at
   least tell the user whether their PCI device assignment made
   sense or not? Or perhaps program the 'numa-affinity' based on
   the PCIe socket location?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: _PXM, NUMA, and all that goodnesss
  2014-02-12 19:50 _PXM, NUMA, and all that goodnesss Konrad Rzeszutek Wilk
@ 2014-02-13 10:08 ` Jan Beulich
  2014-02-13 11:21   ` Andrew Cooper
  2014-02-13 11:22 ` George Dunlap
  1 sibling, 1 reply; 5+ messages in thread
From: Jan Beulich @ 2014-02-13 10:08 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: ufimtseva, andrew.thomas, george.dunlap, andrew.cooper3,
	jun.nakajima, kurt.hackel, xen-devel, boris.ostrovsky

>>> On 12.02.14 at 20:50, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> From a Linux kernel perspective we do seem to 'pipe' said information
> from the ACPI DSDT (drivers/xen/pci.c):
> 
>  75                 unsigned long long pxm;                                  
>  76                                                                          
>  77                 status = acpi_evaluate_integer(handle, "_PXM",           
>  78                                    NULL, &pxm);                           
>  79                 if (ACPI_SUCCESS(status)) {                              
>  80                     add.optarr[0] = pxm;                                 
>  81                     add.flags |= XEN_PCI_DEV_PXM;        
> 
> Which is neat except that Xen ignores that flag altogether. I Googled
> a bit but still did not find anything relevant - thought there were
> some presentations from past Xen Summits referring to it
> (I can't find it now :-()

When adding that interface it seemed pretty clear to me that we
would want/need this information sooner or later. I'm unaware of
any (prototype or better) code utilizing it.

> Anyhow,  what I am wondering if there are some prototypes out the
> in the past that utilize this. And if we were to use this how
> can we expose this to 'libxl' or any other tools to say:
> 
> "Hey! You might want to use this other PCI device assigned
> to pciback which is on the same node". Some of form of
> 'numa-pci' affinity.

Right, a hint like this might be desirable. But this shouldn't be
enforced.

> Interestingly enough one can also read this from SysFS:
> /sys/bus/pci/devices/<BDF>/numa_node,local_cpu,local_cpulist.
> 
> Except that we don't expose the NUMA topology to the initial
> domain so the 'numa_node' is all -1. And the local_cpu depends
> on seeing _all_ of the CPUs - and of course it assumes that
> vCPU == pCPU.
> 
> Anyhow, if this was "tweaked" such that the initial domain
> was seeing the hardware NUMA topology and parsing it (via
> Elena's patches) we could potentially have at least the
> 'numa_node' information present and figure out if a guest
> is using a PCIe device from the right socket.

I think you're mixing up things here. Afaict Elena's patches
are to introduce _virtual_ NUMA, i.e. it would specifically _not_
expose the host NUMA properties to the Dom0 kernel. Don't
we have interfaces to expose the host NUMA information to
the tools already?

> So what I am wondering is:
>  1) Were there any plans for the XEN_PCI_DEV_PXM in the
>     hypervisor? Were there some prototypes for exporting the
>     PCI device BDF and NUMA information out.

As said above: Intentions (I wouldn't call it plans) yes, prototypes
no.

>  2) Would it be better to just look at making the initial domain
>    be able to figure out the NUMA topology and assign the
>    correct 'numa_node' in the PCI fields?

As said above, I don't think this should be exposed to and
handled in Dom0's kernel. It's the tool stack to have the overall
view here.

>  3). If either option is used, would taking that information in-to
>    advisement when launching a guest with either 'cpus' or 'numa-affinity'
>    or 'pci' and informing the user of a better choice be good?
>    Or would it be better if there was some diagnostic tool to at
>    least tell the user whether their PCI device assignment made
>    sense or not? Or perhaps program the 'numa-affinity' based on
>    the PCIe socket location?

I think issuing hint messages would be nice. Automatic placement
could clearly also take assigned devices' localities into consideration,
i.e. one could expect assigned devices to result in the respective
nodes to be picked in preference (as long as CPU and memory
availability allow doing so).

Jan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: _PXM, NUMA, and all that goodnesss
  2014-02-13 10:08 ` Jan Beulich
@ 2014-02-13 11:21   ` Andrew Cooper
  2014-02-13 11:40     ` Jan Beulich
  0 siblings, 1 reply; 5+ messages in thread
From: Andrew Cooper @ 2014-02-13 11:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: ufimtseva, andrew.thomas, george.dunlap, kurt.hackel,
	jun.nakajima, xen-devel, boris.ostrovsky

On 13/02/14 10:08, Jan Beulich wrote:
>> Interestingly enough one can also read this from SysFS:
>> /sys/bus/pci/devices/<BDF>/numa_node,local_cpu,local_cpulist.
>>
>> Except that we don't expose the NUMA topology to the initial
>> domain so the 'numa_node' is all -1. And the local_cpu depends
>> on seeing _all_ of the CPUs - and of course it assumes that
>> vCPU == pCPU.
>>
>> Anyhow, if this was "tweaked" such that the initial domain
>> was seeing the hardware NUMA topology and parsing it (via
>> Elena's patches) we could potentially have at least the
>> 'numa_node' information present and figure out if a guest
>> is using a PCIe device from the right socket.
> I think you're mixing up things here. Afaict Elena's patches
> are to introduce _virtual_ NUMA, i.e. it would specifically _not_
> expose the host NUMA properties to the Dom0 kernel. Don't
> we have interfaces to expose the host NUMA information to
> the tools already?

I have recently looked into this when playing with xen support in hwloc.

Xen can export its vcpu_to_{socket,node,core} mappings for the toolstack
to consume, and for each node expose an count of used and free pages,
along with a square matrix of distances from the SRAT table.

The counts of used pages are problematic, because it includes pages
mapping MMIO regions, which is different to the logical expectation of
just being RAM

>
>> So what I am wondering is:
>>  1) Were there any plans for the XEN_PCI_DEV_PXM in the
>>     hypervisor? Were there some prototypes for exporting the
>>     PCI device BDF and NUMA information out.
> As said above: Intentions (I wouldn't call it plans) yes, prototypes
> no.
>
>>  2) Would it be better to just look at making the initial domain
>>    be able to figure out the NUMA topology and assign the
>>    correct 'numa_node' in the PCI fields?
> As said above, I don't think this should be exposed to and
> handled in Dom0's kernel. It's the tool stack to have the overall
> view here.

This is where things get awkward.  Dom0 has the real APCI tables and is
the only entity with the ability to evaluate the _PXM() attributes to
work out which PCI devices belong to which NUMA nodes.  On the other
hand, its idea of cpus and numa is stifled by being virtual and
generally not having access to all the cpus it can see as present in the
ACPI tables.

It would certainly be nice for dom0 to report the _PXM() attributes back
up to Xen, but I have no idea how easy/hard it would be.

>
>>  3). If either option is used, would taking that information in-to
>>    advisement when launching a guest with either 'cpus' or 'numa-affinity'
>>    or 'pci' and informing the user of a better choice be good?
>>    Or would it be better if there was some diagnostic tool to at
>>    least tell the user whether their PCI device assignment made
>>    sense or not? Or perhaps program the 'numa-affinity' based on
>>    the PCIe socket location?
> I think issuing hint messages would be nice. Automatic placement
> could clearly also take assigned devices' localities into consideration,
> i.e. one could expect assigned devices to result in the respective
> nodes to be picked in preference (as long as CPU and memory
> availability allow doing so).
>
> Jan
>

Diagnostic tool is arguably in the works, having been done in my copious
free time, and rather more activly on the hwloc-devel list than
xen-devel, given the current code freeze.

http://xenbits.xen.org/gitweb/?p=people/andrewcoop/hwloc.git;a=shortlog;h=refs/heads/hwloc-xen-topology-v4
http://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/hwloc-support-experimental-v2

One vague idea I had was to see about using hwlocs placement algorithms
to help advise domain placement, but I have not yet done any
investigation into the feasibility of this.

~Andrew

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: _PXM, NUMA, and all that goodnesss
  2014-02-13 11:21   ` Andrew Cooper
@ 2014-02-13 11:40     ` Jan Beulich
  0 siblings, 0 replies; 5+ messages in thread
From: Jan Beulich @ 2014-02-13 11:40 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: ufimtseva, andrew.thomas, george.dunlap, kurt.hackel,
	jun.nakajima, xen-devel, boris.ostrovsky

>>> On 13.02.14 at 12:21, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 13/02/14 10:08, Jan Beulich wrote:
>>>  2) Would it be better to just look at making the initial domain
>>>    be able to figure out the NUMA topology and assign the
>>>    correct 'numa_node' in the PCI fields?
>> As said above, I don't think this should be exposed to and
>> handled in Dom0's kernel. It's the tool stack to have the overall
>> view here.
> 
> This is where things get awkward.  Dom0 has the real APCI tables and is
> the only entity with the ability to evaluate the _PXM() attributes to
> work out which PCI devices belong to which NUMA nodes.  On the other
> hand, its idea of cpus and numa is stifled by being virtual and
> generally not having access to all the cpus it can see as present in the
> ACPI tables.
> 
> It would certainly be nice for dom0 to report the _PXM() attributes back
> up to Xen, but I have no idea how easy/hard it would be.

But that's being done already (see Konrad's original post), just
that the hypervisor doesn't really make use of the information
at present.

Jan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: _PXM, NUMA, and all that goodnesss
  2014-02-12 19:50 _PXM, NUMA, and all that goodnesss Konrad Rzeszutek Wilk
  2014-02-13 10:08 ` Jan Beulich
@ 2014-02-13 11:22 ` George Dunlap
  1 sibling, 0 replies; 5+ messages in thread
From: George Dunlap @ 2014-02-13 11:22 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, xen-devel, jun.nakajima, boris.ostrovsky,
	jbeulich, andrew.cooper3, andrew.thomas, ufimtseva
  Cc: kurt.hackel

On 02/12/2014 07:50 PM, Konrad Rzeszutek Wilk wrote:
> Hey,
>
> I have been looking at figuring out how we can "easily" do PCIe assignment
> of devices that are on different sockets. The problem is that
> on machines with many sockets (four or more) we might inadvertently assign
> the PCIe from a different socket to a guest bound to a different NUMA
> node. That means more KPI traffic, higher latency, etc.
>
>  From a Linux kernel perspective we do seem to 'pipe' said information
> from the ACPI DSDT (drivers/xen/pci.c):
>
>   75                 unsigned long long pxm;
>   76
>   77                 status = acpi_evaluate_integer(handle, "_PXM",
>   78                                    NULL, &pxm);
>   79                 if (ACPI_SUCCESS(status)) {
>   80                     add.optarr[0] = pxm;
>   81                     add.flags |= XEN_PCI_DEV_PXM;
>
> Which is neat except that Xen ignores that flag altogether. I Googled
> a bit but still did not find anything relevant - thought there were
> some presentations from past Xen Summits referring to it
> (I can't find it now :-()
>
> Anyhow,  what I am wondering if there are some prototypes out the
> in the past that utilize this. And if we were to use this how
> can we expose this to 'libxl' or any other tools to say:
>
> "Hey! You might want to use this other PCI device assigned
> to pciback which is on the same node". Some of form of
> 'numa-pci' affinity.

A warning that the PCI device is not in the numa affinity of the guest 
might be nice.

> Interestingly enough one can also read this from SysFS:
> /sys/bus/pci/devices/<BDF>/numa_node,local_cpu,local_cpulist.
>
> Except that we don't expose the NUMA topology to the initial
> domain so the 'numa_node' is all -1. And the local_cpu depends
> on seeing _all_ of the CPUs - and of course it assumes that
> vCPU == pCPU.
>
> Anyhow, if this was "tweaked" such that the initial domain
> was seeing the hardware NUMA topology and parsing it (via
> Elena's patches) we could potentially have at least the
> 'numa_node' information present and figure out if a guest
> is using a PCIe device from the right socket.

I don't think we want to go down the path of pretending that dom0 is the 
hypervisor.  This is the same reason I objected to Boris' approach to 
perf integration last year.  I can understand the idea of wanting to use 
the same tools in the same way; but the fact is dom0 is a guest, and its 
virtual hardware (including #cpus, topology, &c) isn't (and shouldn't be 
required to) be in any way related to the host.

On the other hand... just tossing this out there, but how hard would it 
be for dom0 to report information about the *physical* topology on 
certain things in sysfs, rather than *virtual* topology?  I.e., no 
matter what dom0's virtual topology was, to report the physical 
numa_node, local_cpu, &c in sysfs?

I suppose this might cause problems if the scheduler then tried to run a 
process / tasklet on the node to which the device was attached, only to 
find out that no such (virtual) node existed.

If that would be a no-go, then I think we need to expose that 
information via libxl somehow so the toolstack can make reasonable 
decisions.

>
> So what I am wondering is:
>   1) Were there any plans for the XEN_PCI_DEV_PXM in the
>      hypervisor? Were there some prototypes for exporting the
>      PCI device BDF and NUMA information out.
>
>   2) Would it be better to just look at making the initial domain
>     be able to figure out the NUMA topology and assign the
>     correct 'numa_node' in the PCI fields?
>
>   3). If either option is used, would taking that information in-to
>     advisement when launching a guest with either 'cpus' or 'numa-affinity'
>     or 'pci' and informing the user of a better choice be good?
>     Or would it be better if there was some diagnostic tool to at
>     least tell the user whether their PCI device assignment made
>     sense or not? Or perhaps program the 'numa-affinity' based on
>     the PCIe socket location?

I think in general, we should:
* Do something reasonable when no NUMA topology has been specified
* Do what the user asks (but help them make good decisions) when they do 
specify topology.

A couple of things that might mean:
* Having the NUMA placement algorithm take into account the location of 
assigned PCI devices is probably a good idea.
* Having a warning when a device is outside of a VM's soft cpu affinity 
or NUMA affinity.  (I think we do something similar when the soft cpu 
affinity doesn't intersect the NUMA affinity.)
* Exposing the NUMA affinity of a device when doing xl 
pci-assignable-list might be a good idea as well, just to give people a 
hint that they should be maybe thinking about this.  Maybe have xl 
pci-assignable-add print what node a device is on as well? (Maybe only 
on NUMA boxes?)

Just as an aside, can I take it that a lot of your customers have / are 
expected to have such NUMA boxes?  The accepted wisdom (at least in some 
circles) seems to be that NUMA isn't particularly important for cloud, 
because cloud providers will generally use a larger number of smaller 
boxes and use a cloud orchestration layer to tie them all together.

  -George

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-02-13 11:40 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-12 19:50 _PXM, NUMA, and all that goodnesss Konrad Rzeszutek Wilk
2014-02-13 10:08 ` Jan Beulich
2014-02-13 11:21   ` Andrew Cooper
2014-02-13 11:40     ` Jan Beulich
2014-02-13 11:22 ` George Dunlap

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.