Re: [PATCH 1/1] docs: adding NUMA documentation for pseries

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Greg Kurz <groug@kaod.org>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: Daniel Henrique Barboza <danielhb413@gmail.com>,
	qemu-ppc@nongnu.org, qemu-devel@nongnu.org
Subject: Re: [PATCH 1/1] docs: adding NUMA documentation for pseries
Date: Mon, 3 Aug 2020 13:49:17 +0200	[thread overview]
Message-ID: <20200803134917.48c5e7a5@bahia.lan> (raw)
In-Reply-To: <20200730005852.GN84173@umbus.fritz.box>

[-- Attachment #1: Type: text/plain, Size: 10014 bytes --]

On Thu, 30 Jul 2020 10:58:52 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Jul 29, 2020 at 09:57:56AM -0300, Daniel Henrique Barboza wrote:
> > This patch adds a new documentation file, ppc-spapr-numa.rst,
> > informing what developers and user can expect of the NUMA distance
> > support for the pseries machine, up to QEMU 5.1.
> > 
> > In the (hopefully soon) future, when we rework the NUMA mechanics
> > of the pseries machine to at least attempt to contemplate user
> > choice, this doc will be extended to inform about the new
> > support.
> > 
> > Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> 
> Applied to ppc-for-5.2, thanks.
> 

I'm now hitting this:

Warning, treated as error:
docs/specs/ppc-spapr-numa.rst:document isn't included in any toctree

> > ---
> >  docs/specs/ppc-spapr-numa.rst | 191 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 191 insertions(+)
> >  create mode 100644 docs/specs/ppc-spapr-numa.rst
> > 
> > diff --git a/docs/specs/ppc-spapr-numa.rst b/docs/specs/ppc-spapr-numa.rst
> > new file mode 100644
> > index 0000000000..e762038022
> > --- /dev/null
> > +++ b/docs/specs/ppc-spapr-numa.rst
> > @@ -0,0 +1,191 @@
> > +
> > +NUMA mechanics for sPAPR (pseries machines)
> > +============================================
> > +
> > +NUMA in sPAPR works different than the System Locality Distance
> > +Information Table (SLIT) in ACPI. The logic is explained in the LOPAPR
> > +1.1 chapter 15, "Non Uniform Memory Access (NUMA) Option". This
> > +document aims to complement this specification, providing details
> > +of the elements that impacts how QEMU views NUMA in pseries.
> > +
> > +Associativity and ibm,associativity property
> > +--------------------------------------------
> > +
> > +Associativity is defined as a group of platform resources that has
> > +similar mean performance (or in our context here, distance) relative to
> > +everyone else outside of the group.
> > +
> > +The format of the ibm,associativity property varies with the value of
> > +bit 0 of byte 5 of the ibm,architecture-vec-5 property. The format with
> > +bit 0 equal to zero is deprecated. The current format, with the bit 0
> > +with the value of one, makes ibm,associativity property represent the
> > +physical hierarchy of the platform, as one or more lists that starts
> > +with the highest level grouping up to the smallest. Considering the
> > +following topology:
> > +
> > +::
> > +
> > +    Mem M1 ---- Proc P1    |
> > +    -----------------      | Socket S1  ---|
> > +          chip C1          |               |
> > +                                           | HW module 1 (MOD1)
> > +    Mem M2 ---- Proc P2    |               |
> > +    -----------------      | Socket S2  ---|
> > +          chip C2          |
> > +
> > +The ibm,associativity property for the processors would be:
> > +
> > +* P1: {MOD1, S1, C1, P1}
> > +* P2: {MOD1, S2, C2, P2}
> > +
> > +Each allocable resource has an ibm,associativity property. The LOPAPR
> > +specification allows multiple lists to be present in this property,
> > +considering that the same resource can have multiple connections to the
> > +platform.
> > +
> > +Relative Performance Distance and ibm,associativity-reference-points
> > +--------------------------------------------------------------------
> > +
> > +The ibm,associativity-reference-points property is an array that is used
> > +to define the relevant performance/distance  related boundaries, defining
> > +the NUMA levels for the platform.
> > +
> > +The definition of its elements also varies with the value of bit 0 of byte 5
> > +of the ibm,architecture-vec-5 property. The format with bit 0 equal to zero
> > +is also deprecated. With the current format, each integer of the
> > +ibm,associativity-reference-points represents an 1 based ordinal index (i.e.
> > +the first element is 1) of the ibm,associativity array. The first
> > +boundary is the most significant to application performance, followed by
> > +less significant boundaries. Allocated resources that belongs to the
> > +same performance boundaries are expected to have relative NUMA distance
> > +that matches the relevancy of the boundary itself. Resources that belongs
> > +to the same first boundary will have the shortest distance from each
> > +other. Subsequent boundaries represents greater distances and degraded
> > +performance.
> > +
> > +Using the previous example, the following setting reference points defines
> > +three NUMA levels:
> > +
> > +* ibm,associativity-reference-points = {0x3, 0x2, 0x1}
> > +
> > +The first NUMA level (0x3) is interpreted as the third element of each
> > +ibm,associativity array, the second level is the second element and
> > +the third level is the first element. Let's also consider that elements
> > +belonging to the first NUMA level have distance equal to 10 from each
> > +other, and each NUMA level doubles the distance from the previous. This
> > +means that the second would be 20 and the third level 40. For the P1 and
> > +P2 processors, we would have the following NUMA levels:
> > +
> > +::
> > +
> > +  * ibm,associativity-reference-points = {0x3, 0x2, 0x1}
> > +
> > +  * P1: associativity{MOD1, S1, C1, P1}
> > +
> > +  First NUMA level (0x3) => associativity[2] = C1
> > +  Second NUMA level (0x2) => associativity[1] = S1
> > +  Third NUMA level (0x1) => associativity[0] = MOD1
> > +
> > +  * P2: associativity{MOD1, S2, C2, P2}
> > +
> > +  First NUMA level (0x3) => associativity[2] = C2
> > +  Second NUMA level (0x2) => associativity[1] = S2
> > +  Third NUMA level (0x1) => associativity[0] = MOD1
> > +
> > +  P1 and P2 have the same third NUMA level, MOD1: Distance between them = 40
> > +
> > +Changing the ibm,associativity-reference-points array changes the performance
> > +distance attributes for the same associativity arrays, as the following
> > +example illustrates:
> > +
> > +::
> > +
> > +  * ibm,associativity-reference-points = {0x2}
> > +
> > +  * P1: associativity{MOD1, S1, C1, P1}
> > +
> > +  First NUMA level (0x2) => associativity[1] = S1
> > +
> > +  * P2: associativity{MOD1, S2, C2, P2}
> > +
> > +  First NUMA level (0x2) => associativity[1] = S2
> > +
> > +  P1 and P2 does not have a common performance boundary. Since this is a one level
> > +  NUMA configuration, distance between them is one boundary above the first
> > +  level, 20.
> > +
> > +
> > +In a hypothetical platform where all resources inside the same hardware module
> > +is considered to be on the same performance boundary:
> > +
> > +::
> > +
> > +  * ibm,associativity-reference-points = {0x1}
> > +
> > +  * P1: associativity{MOD1, S1, C1, P1}
> > +
> > +  First NUMA level (0x1) => associativity[0] = MOD0
> > +
> > +  * P2: associativity{MOD1, S2, C2, P2}
> > +
> > +  First NUMA level (0x1) => associativity[0] = MOD0
> > +
> > +  P1 and P2 belongs to the same first order boundary. The distance between then
> > +  is 10.
> > +
> > +
> > +How the pseries Linux guest calculates NUMA distances
> > +=====================================================
> > +
> > +Another key difference between ACPI SLIT and the LOPAPR regarding NUMA is
> > +how the distances are expressed. The SLIT table provides the NUMA distance
> > +value between the relevant resources. LOPAPR does not provide a standard
> > +way to calculate it. We have the ibm,associativity for each resource, which
> > +provides a common-performance hierarchy,  and the ibm,associativity-reference-points
> > +array that tells which level of associativity is considered to be relevant
> > +or not.
> > +
> > +The result is that each OS is free to implement and to interpret the distance
> > +as it sees fit. For the pseries Linux guest, each level of NUMA duplicates
> > +the distance of the previous level, and the maximum amount of levels is
> > +limited to MAX_DISTANCE_REF_POINTS = 4 (from arch/powerpc/mm/numa.c in the
> > +kernel tree). This results in the following distances:
> > +
> > +* both resources in the first NUMA level: 10
> > +* resources one NUMA level apart: 20
> > +* resources two NUMA levels apart: 40
> > +* resources three NUMA levels apart: 80
> > +* resources four NUMA levels apart: 160
> > +
> > +
> > +Consequences for QEMU NUMA tuning
> > +---------------------------------
> > +
> > +The way the pseries Linux guest calculates NUMA distances has a direct effect
> > +on what QEMU users can expect when doing NUMA tuning. As of QEMU 5.1, this is
> > +the default ibm,associativity-reference-points being used in the pseries
> > +machine:
> > +
> > +ibm,associativity-reference-points = {0x4, 0x4, 0x2}
> > +
> > +The first and second level are equal, 0x4, and a third one was added in
> > +commit a6030d7e0b35 exclusively for NVLink GPUs support. This means that
> > +regardless of how the ibm,associativity properties are being created in
> > +the device tree, the pseries Linux guest will only recognize three scenarios
> > +as far as NUMA distance goes:
> > +
> > +* if the resources belongs to the same first NUMA level = 10
> > +* second level is skipped since it's equal to the first
> > +* all resources that aren't a NVLink GPU, it is guaranteed that they will belong
> > +  to the same third NUMA level, having distance = 40
> > +* for NVLink GPUs, distance = 80 from everything else
> > +
> > +In short, we can summarize the NUMA distances seem in pseries Linux guests, using
> > +QEMU up to 5.1, as follows:
> > +
> > +* local distance, i.e. the distance of the resource to its own NUMA node: 10
> > +* if it's a NVLink GPU device, distance: 80
> > +* every other resource, distance: 40
> > +
> > +This also means that user input in QEMU command line does not change the
> > +NUMA distancing inside the guest for the pseries machine.
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

next prev parent reply	other threads:[~2020-08-03 11:50 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-29 12:57 [PATCH 1/1] docs: adding NUMA documentation for pseries Daniel Henrique Barboza
2020-07-30  0:58 ` David Gibson
2020-08-03 11:49   ` Greg Kurz [this message]
2020-08-03 12:14     ` Daniel Henrique Barboza
2020-08-03 12:53       ` Greg Kurz
2020-08-03 13:43         ` Daniel Henrique Barboza
2020-08-03 12:59       ` Peter Maydell
2020-08-03 13:30         ` Daniel Henrique Barboza

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200803134917.48c5e7a5@bahia.lan \
    --to=groug@kaod.org \
    --cc=danielhb413@gmail.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).