_PXM, NUMA, and all that goodnesss - Konrad Rzeszutek Wilk

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: xen-devel@lists.xensource.com, george.dunlap@eu.citrix.com,
	jun.nakajima@intel.com, boris.ostrovsky@oracle.com,
	jbeulich@suse.com, andrew.cooper3@citrix.com,
	andrew.thomas@oracle.com, ufimtseva@gmail.com
Cc: kurt.hackel@oracle.com
Subject: _PXM, NUMA, and all that goodnesss
Date: Wed, 12 Feb 2014 14:50:05 -0500	[thread overview]
Message-ID: <20140212195005.GB29910@phenom.dumpdata.com> (raw)

Hey,

I have been looking at figuring out how we can "easily" do PCIe assignment
of devices that are on different sockets. The problem is that
on machines with many sockets (four or more) we might inadvertently assign
the PCIe from a different socket to a guest bound to a different NUMA
node. That means more KPI traffic, higher latency, etc.

>From a Linux kernel perspective we do seem to 'pipe' said information
from the ACPI DSDT (drivers/xen/pci.c):

 75                 unsigned long long pxm;                                         
 76                                                                                 
 77                 status = acpi_evaluate_integer(handle, "_PXM",                  
 78                                    NULL, &pxm);                                 
 79                 if (ACPI_SUCCESS(status)) {                                     
 80                     add.optarr[0] = pxm;                                        
 81                     add.flags |= XEN_PCI_DEV_PXM;        

Which is neat except that Xen ignores that flag altogether. I Googled
a bit but still did not find anything relevant - thought there were
some presentations from past Xen Summits referring to it
(I can't find it now :-()

Anyhow,  what I am wondering if there are some prototypes out the
in the past that utilize this. And if we were to use this how
can we expose this to 'libxl' or any other tools to say:

"Hey! You might want to use this other PCI device assigned
to pciback which is on the same node". Some of form of
'numa-pci' affinity.

Interestingly enough one can also read this from SysFS:
/sys/bus/pci/devices/<BDF>/numa_node,local_cpu,local_cpulist.

Except that we don't expose the NUMA topology to the initial
domain so the 'numa_node' is all -1. And the local_cpu depends
on seeing _all_ of the CPUs - and of course it assumes that
vCPU == pCPU.

Anyhow, if this was "tweaked" such that the initial domain
was seeing the hardware NUMA topology and parsing it (via
Elena's patches) we could potentially have at least the
'numa_node' information present and figure out if a guest
is using a PCIe device from the right socket.

So what I am wondering is:
 1) Were there any plans for the XEN_PCI_DEV_PXM in the
    hypervisor? Were there some prototypes for exporting the
    PCI device BDF and NUMA information out.

 2) Would it be better to just look at making the initial domain
   be able to figure out the NUMA topology and assign the
   correct 'numa_node' in the PCI fields?

 3). If either option is used, would taking that information in-to
   advisement when launching a guest with either 'cpus' or 'numa-affinity'
   or 'pci' and informing the user of a better choice be good?
   Or would it be better if there was some diagnostic tool to at
   least tell the user whether their PCI device assignment made
   sense or not? Or perhaps program the 'numa-affinity' based on
   the PCIe socket location?

next             reply	other threads:[~2014-02-12 19:50 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-12 19:50 Konrad Rzeszutek Wilk [this message]
2014-02-13 10:08 ` _PXM, NUMA, and all that goodnesss Jan Beulich
2014-02-13 11:21   ` Andrew Cooper
2014-02-13 11:40     ` Jan Beulich
2014-02-13 11:22 ` George Dunlap

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140212195005.GB29910@phenom.dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=andrew.thomas@oracle.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=george.dunlap@eu.citrix.com \
    --cc=jbeulich@suse.com \
    --cc=jun.nakajima@intel.com \
    --cc=kurt.hackel@oracle.com \
    --cc=ufimtseva@gmail.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.