Re: [PATCH] spapr: Add a new level of NUMA for GPUs

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: David Gibson <david@gibson.dropbear.id.au>
To: Reza Arbab <arbab@linux.ibm.com>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>,
	Daniel Henrique Barboza <danielhb@linux.ibm.com>,
	Leonardo Augusto Guimaraes Garcia <lagarcia@linux.ibm.com>,
	qemu-ppc@nongnu.org, qemu-devel@nongnu.org
Subject: Re: [PATCH] spapr: Add a new level of NUMA for GPUs
Date: Mon, 11 May 2020 16:17:45 +1000	[thread overview]
Message-ID: <20200511061745.GP2183@umbus.fritz.box> (raw)
In-Reply-To: <20200508175927.21791-1-arbab@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 5158 bytes --]

On Fri, May 08, 2020 at 12:59:27PM -0500, Reza Arbab wrote:
> NUMA nodes corresponding to GPU memory currently have the same
> affinity/distance as normal memory nodes. Add a third NUMA associativity
> reference point enabling us to give GPU nodes more distance.
> 
> Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5):
> 
> node distances:
> node   0   1   2   3   4   5
>   0:  10  40  40  40  40  40
>   1:  40  10  40  40  40  40
>   2:  40  40  10  40  40  40
>   3:  40  40  40  10  40  40
>   4:  40  40  40  40  10  40
>   5:  40  40  40  40  40  10
> 
> After:
> 
> node distances:
> node   0   1   2   3   4   5
>   0:  10  40  80  80  80  80
>   1:  40  10  80  80  80  80
>   2:  80  80  10  80  80  80
>   3:  80  80  80  10  80  80
>   4:  80  80  80  80  10  80
>   5:  80  80  80  80  80  10
> 
> These are the same distances as on the host, mirroring the change made
> to host firmware in skiboot commit f845a648b8cb ("numa/associativity:
> Add a new level of NUMA for GPU's").

Urgh.  I have a numnber of thoughts on this.

1)

This would all be much simpler, if PAPR's representation of NUMA
distances weren't so awful.  Somehow it manages to be both so complex
that it's very hard to understand, and yet very limited in that it
has no way to represent distances in any absolute units, or even
specific ratios between distances.

Both qemu and the guest kernel can have an arbitrary set of nodes,
with an arbitrary matrix of distances between each pair, which we then
have to lossily encode into this PAPR nonsense.

2)

Alas, I don't think we can simply change this information.  We'll have
to do it conditionally on a new machine type.  This is guest visible
information, which shouldn't change under a running guest across
migration between different qemu versions.  At least for Linux guests
we'd probably get away with it, since I think it only reads this info
at boot, and across a migration we'd at worst get non-optimal
behaviour, not actual breakage.

Still, I'd need a stronger case than "probably won't break" before
breaking our rules about guest environment stability within a machine
type.

3)

I'm not sure that this version is totally correct w.r.t. PAPR.  But
then, I'm also not really sure of that for the existing version.
Specifically it's not at all clear from PAPR if the IDs used at each
level of the ibm,associativity need to be:
   a) globally unique
   b) unique only within the associativity level they appear at
or c) unique only within the "node" at the next higher level they
      belong to

4)

I'm not totally clear on the rationale for using the individual gpu's
numa ID at all levels, rather than just one.  I'm guessing this is so
that the gpu memory blocks are distant from each other as well as
distant from main memory.  Is that right?

> Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
> ---
>  hw/ppc/spapr.c             | 6 +++++-
>  hw/ppc/spapr_pci_nvlink2.c | 8 +++-----
>  2 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index c18eab0a2305..53567f98f0c6 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -892,7 +892,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>      int rtas;
>      GString *hypertas = g_string_sized_new(256);
>      GString *qemu_hypertas = g_string_sized_new(256);
> -    uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) };
> +    uint32_t refpoints[] = {
> +        cpu_to_be32(0x4),
> +        cpu_to_be32(0x4),
> +        cpu_to_be32(0x2),
> +    };
>      uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
>          memory_region_size(&MACHINE(spapr)->device_memory->mr);
>      uint32_t lrdr_capacity[] = {
> diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
> index 8332d5694e46..f2cb26019e88 100644
> --- a/hw/ppc/spapr_pci_nvlink2.c
> +++ b/hw/ppc/spapr_pci_nvlink2.c
> @@ -37,8 +37,6 @@
>  #define PHANDLE_NVLINK(phb, gn, nn)  (0x00130000 | (((phb)->index) << 8) | \
>                                       ((gn) << 4) | (nn))
>  
> -#define SPAPR_GPU_NUMA_ID           (cpu_to_be32(1))
> -
>  typedef struct SpaprPhbPciNvGpuSlot {
>          uint64_t tgt;
>          uint64_t gpa;
> @@ -361,9 +359,9 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt)
>                                                      "nvlink2-mr[0]", NULL);
>          uint32_t associativity[] = {
>              cpu_to_be32(0x4),
> -            SPAPR_GPU_NUMA_ID,
> -            SPAPR_GPU_NUMA_ID,
> -            SPAPR_GPU_NUMA_ID,
> +            cpu_to_be32(nvslot->numa_id),
> +            cpu_to_be32(nvslot->numa_id),
> +            cpu_to_be32(nvslot->numa_id),
>              cpu_to_be32(nvslot->numa_id)
>          };
>          uint64_t size = object_property_get_uint(nv_mrobj, "size", NULL);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

next prev parent reply	other threads:[~2020-05-11  6:39 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-08 17:59 [PATCH] spapr: Add a new level of NUMA for GPUs Reza Arbab
2020-05-11  6:17 ` David Gibson [this message]
2020-05-14  0:19   ` Reza Arbab

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200511061745.GP2183@umbus.fritz.box \
    --to=david@gibson.dropbear.id.au \
    --cc=aik@ozlabs.ru \
    --cc=arbab@linux.ibm.com \
    --cc=danielhb@linux.ibm.com \
    --cc=lagarcia@linux.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).