[PATCH v2 1/2] spapr: Add associativity reference point count to machine info

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 1/2] spapr: Add associativity reference point count to machine info
@ 2020-05-18 21:44 Reza Arbab
  2020-05-18 21:44 ` [PATCH v2 2/2] spapr: Add a new level of NUMA for GPUs Reza Arbab
  2020-05-20 23:34 ` [PATCH v2 1/2] spapr: Add associativity reference point count to machine info Greg Kurz
  0 siblings, 2 replies; 8+ messages in thread
From: Reza Arbab @ 2020-05-18 21:44 UTC (permalink / raw)
  To: David Gibson, qemu-ppc, qemu-devel
  Cc: Alexey Kardashevskiy, Daniel Henrique Barboza,
	Leonardo Augusto Guimaraes Garcia

Make the number of NUMA associativity reference points a
machine-specific value, using the currently assumed default (two
reference points). This preps the next patch to conditionally change it.

Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
---
 hw/ppc/spapr.c         | 6 +++++-
 include/hw/ppc/spapr.h | 1 +
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index c18eab0a2305..88b4a1f17716 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -889,10 +889,12 @@ static int spapr_dt_rng(void *fdt)
 static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
 {
     MachineState *ms = MACHINE(spapr);
+    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(ms);
     int rtas;
     GString *hypertas = g_string_sized_new(256);
     GString *qemu_hypertas = g_string_sized_new(256);
     uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) };
+    uint32_t nr_refpoints;
     uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
         memory_region_size(&MACHINE(spapr)->device_memory->mr);
     uint32_t lrdr_capacity[] = {
@@ -944,8 +946,9 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
                      qemu_hypertas->str, qemu_hypertas->len));
     g_string_free(qemu_hypertas, TRUE);
 
+    nr_refpoints = MIN(smc->nr_assoc_refpoints, ARRAY_SIZE(refpoints));
     _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
-                     refpoints, sizeof(refpoints)));
+                     refpoints, nr_refpoints * sizeof(uint32_t)));
 
     _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
                      maxdomains, sizeof(maxdomains)));
@@ -4541,6 +4544,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
     smc->linux_pci_probe = true;
     smc->smp_threads_vsmt = true;
     smc->nr_xirqs = SPAPR_NR_XIRQS;
+    smc->nr_assoc_refpoints = 2;
     xfc->match_nvt = spapr_match_nvt;
 }
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index e579eaf28c05..abaf9a92adc0 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -129,6 +129,7 @@ struct SpaprMachineClass {
     bool linux_pci_probe;
     bool smp_threads_vsmt; /* set VSMT to smp_threads by default */
     hwaddr rma_limit;          /* clamp the RMA to this size */
+    uint32_t nr_assoc_refpoints;
 
     void (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
                           uint64_t *buid, hwaddr *pio, 
-- 
2.18.2



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/2] spapr: Add a new level of NUMA for GPUs
  2020-05-18 21:44 [PATCH v2 1/2] spapr: Add associativity reference point count to machine info Reza Arbab
@ 2020-05-18 21:44 ` Reza Arbab
  2020-05-20 23:36   ` Greg Kurz
  2020-05-20 23:34 ` [PATCH v2 1/2] spapr: Add associativity reference point count to machine info Greg Kurz
  1 sibling, 1 reply; 8+ messages in thread
From: Reza Arbab @ 2020-05-18 21:44 UTC (permalink / raw)
  To: David Gibson, qemu-ppc, qemu-devel
  Cc: Alexey Kardashevskiy, Daniel Henrique Barboza,
	Leonardo Augusto Guimaraes Garcia

NUMA nodes corresponding to GPU memory currently have the same
affinity/distance as normal memory nodes. Add a third NUMA associativity
reference point enabling us to give GPU nodes more distance.

This is guest visible information, which shouldn't change under a
running guest across migration between different qemu versions, so make
the change effective only in new (pseries > 5.0) machine types.

Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5):

node distances:
node   0   1   2   3   4   5
  0:  10  40  40  40  40  40
  1:  40  10  40  40  40  40
  2:  40  40  10  40  40  40
  3:  40  40  40  10  40  40
  4:  40  40  40  40  10  40
  5:  40  40  40  40  40  10

After:

node distances:
node   0   1   2   3   4   5
  0:  10  40  80  80  80  80
  1:  40  10  80  80  80  80
  2:  80  80  10  80  80  80
  3:  80  80  80  10  80  80
  4:  80  80  80  80  10  80
  5:  80  80  80  80  80  10

These are the same distances as on the host, mirroring the change made
to host firmware in skiboot commit f845a648b8cb ("numa/associativity:
Add a new level of NUMA for GPU's").

Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
---
 hw/ppc/spapr.c             | 11 +++++++++--
 hw/ppc/spapr_pci_nvlink2.c |  2 +-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 88b4a1f17716..1d9193d5ee49 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -893,7 +893,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
     int rtas;
     GString *hypertas = g_string_sized_new(256);
     GString *qemu_hypertas = g_string_sized_new(256);
-    uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) };
+    uint32_t refpoints[] = {
+        cpu_to_be32(0x4),
+        cpu_to_be32(0x4),
+        cpu_to_be32(0x2),
+    };
     uint32_t nr_refpoints;
     uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
         memory_region_size(&MACHINE(spapr)->device_memory->mr);
@@ -4544,7 +4548,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
     smc->linux_pci_probe = true;
     smc->smp_threads_vsmt = true;
     smc->nr_xirqs = SPAPR_NR_XIRQS;
-    smc->nr_assoc_refpoints = 2;
+    smc->nr_assoc_refpoints = 3;
     xfc->match_nvt = spapr_match_nvt;
 }
 
@@ -4611,8 +4615,11 @@ DEFINE_SPAPR_MACHINE(5_1, "5.1", true);
  */
 static void spapr_machine_5_0_class_options(MachineClass *mc)
 {
+    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
+
     spapr_machine_5_1_class_options(mc);
     compat_props_add(mc->compat_props, hw_compat_5_0, hw_compat_5_0_len);
+    smc->nr_assoc_refpoints = 2;
 }
 
 DEFINE_SPAPR_MACHINE(5_0, "5.0", false);
diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
index 8332d5694e46..247fd48731e2 100644
--- a/hw/ppc/spapr_pci_nvlink2.c
+++ b/hw/ppc/spapr_pci_nvlink2.c
@@ -362,7 +362,7 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt)
         uint32_t associativity[] = {
             cpu_to_be32(0x4),
             SPAPR_GPU_NUMA_ID,
-            SPAPR_GPU_NUMA_ID,
+            cpu_to_be32(nvslot->numa_id),
             SPAPR_GPU_NUMA_ID,
             cpu_to_be32(nvslot->numa_id)
         };
-- 
2.18.2



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] spapr: Add associativity reference point count to machine info
  2020-05-18 21:44 [PATCH v2 1/2] spapr: Add associativity reference point count to machine info Reza Arbab
  2020-05-18 21:44 ` [PATCH v2 2/2] spapr: Add a new level of NUMA for GPUs Reza Arbab
@ 2020-05-20 23:34 ` Greg Kurz
  2020-05-21  5:12   ` David Gibson
  2020-05-21 23:10   ` Reza Arbab
  1 sibling, 2 replies; 8+ messages in thread
From: Greg Kurz @ 2020-05-20 23:34 UTC (permalink / raw)
  To: Reza Arbab
  Cc: Daniel Henrique Barboza, Leonardo Augusto Guimaraes Garcia,
	qemu-ppc, qemu-devel, David Gibson

On Mon, 18 May 2020 16:44:17 -0500
Reza Arbab <arbab@linux.ibm.com> wrote:

> Make the number of NUMA associativity reference points a
> machine-specific value, using the currently assumed default (two
> reference points). This preps the next patch to conditionally change it.
> 
> Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
> ---
>  hw/ppc/spapr.c         | 6 +++++-
>  include/hw/ppc/spapr.h | 1 +
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index c18eab0a2305..88b4a1f17716 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -889,10 +889,12 @@ static int spapr_dt_rng(void *fdt)
>  static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>  {
>      MachineState *ms = MACHINE(spapr);
> +    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(ms);
>      int rtas;
>      GString *hypertas = g_string_sized_new(256);
>      GString *qemu_hypertas = g_string_sized_new(256);
>      uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) };
> +    uint32_t nr_refpoints;
>      uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
>          memory_region_size(&MACHINE(spapr)->device_memory->mr);
>      uint32_t lrdr_capacity[] = {
> @@ -944,8 +946,9 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>                       qemu_hypertas->str, qemu_hypertas->len));
>      g_string_free(qemu_hypertas, TRUE);
>  
> +    nr_refpoints = MIN(smc->nr_assoc_refpoints, ARRAY_SIZE(refpoints));

Having the machine requesting more reference points than available
would clearly be a bug. I'd rather add an assert() than silently
clipping to the size of refpoints[].

>      _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
> -                     refpoints, sizeof(refpoints)));
> +                     refpoints, nr_refpoints * sizeof(uint32_t)));
>  

Size can be expressed without yet another explicit reference to the
uint32_t type:

nr_refpoints * sizeof(refpoints[0])

>      _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
>                       maxdomains, sizeof(maxdomains)));
> @@ -4541,6 +4544,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
>      smc->linux_pci_probe = true;
>      smc->smp_threads_vsmt = true;
>      smc->nr_xirqs = SPAPR_NR_XIRQS;
> +    smc->nr_assoc_refpoints = 2;

When adding a new setting for the default machine type, we usually
take care of older machine types at the same time, ie. folding this
patch into the next one. Both patches are simple enough that it should
be okay and this would avoid this line to be touched again.

>      xfc->match_nvt = spapr_match_nvt;
>  }
>  
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index e579eaf28c05..abaf9a92adc0 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -129,6 +129,7 @@ struct SpaprMachineClass {
>      bool linux_pci_probe;
>      bool smp_threads_vsmt; /* set VSMT to smp_threads by default */
>      hwaddr rma_limit;          /* clamp the RMA to this size */
> +    uint32_t nr_assoc_refpoints;
>  
>      void (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
>                            uint64_t *buid, hwaddr *pio, 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] spapr: Add a new level of NUMA for GPUs
  2020-05-18 21:44 ` [PATCH v2 2/2] spapr: Add a new level of NUMA for GPUs Reza Arbab
@ 2020-05-20 23:36   ` Greg Kurz
  2020-05-21  5:13     ` David Gibson
  0 siblings, 1 reply; 8+ messages in thread
From: Greg Kurz @ 2020-05-20 23:36 UTC (permalink / raw)
  To: Reza Arbab
  Cc: Daniel Henrique Barboza, Leonardo Augusto Guimaraes Garcia,
	qemu-ppc, qemu-devel, David Gibson

On Mon, 18 May 2020 16:44:18 -0500
Reza Arbab <arbab@linux.ibm.com> wrote:

> NUMA nodes corresponding to GPU memory currently have the same
> affinity/distance as normal memory nodes. Add a third NUMA associativity
> reference point enabling us to give GPU nodes more distance.
> 
> This is guest visible information, which shouldn't change under a
> running guest across migration between different qemu versions, so make
> the change effective only in new (pseries > 5.0) machine types.
> 
> Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5):
> 
> node distances:
> node   0   1   2   3   4   5
>   0:  10  40  40  40  40  40
>   1:  40  10  40  40  40  40
>   2:  40  40  10  40  40  40
>   3:  40  40  40  10  40  40
>   4:  40  40  40  40  10  40
>   5:  40  40  40  40  40  10
> 
> After:
> 
> node distances:
> node   0   1   2   3   4   5
>   0:  10  40  80  80  80  80
>   1:  40  10  80  80  80  80
>   2:  80  80  10  80  80  80
>   3:  80  80  80  10  80  80
>   4:  80  80  80  80  10  80
>   5:  80  80  80  80  80  10
> 
> These are the same distances as on the host, mirroring the change made
> to host firmware in skiboot commit f845a648b8cb ("numa/associativity:
> Add a new level of NUMA for GPU's").
> 
> Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
> ---
>  hw/ppc/spapr.c             | 11 +++++++++--
>  hw/ppc/spapr_pci_nvlink2.c |  2 +-
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 88b4a1f17716..1d9193d5ee49 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -893,7 +893,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>      int rtas;
>      GString *hypertas = g_string_sized_new(256);
>      GString *qemu_hypertas = g_string_sized_new(256);
> -    uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) };
> +    uint32_t refpoints[] = {
> +        cpu_to_be32(0x4),
> +        cpu_to_be32(0x4),
> +        cpu_to_be32(0x2),
> +    };
>      uint32_t nr_refpoints;
>      uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
>          memory_region_size(&MACHINE(spapr)->device_memory->mr);
> @@ -4544,7 +4548,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
>      smc->linux_pci_probe = true;
>      smc->smp_threads_vsmt = true;
>      smc->nr_xirqs = SPAPR_NR_XIRQS;
> -    smc->nr_assoc_refpoints = 2;
> +    smc->nr_assoc_refpoints = 3;
>      xfc->match_nvt = spapr_match_nvt;
>  }
>  
> @@ -4611,8 +4615,11 @@ DEFINE_SPAPR_MACHINE(5_1, "5.1", true);
>   */
>  static void spapr_machine_5_0_class_options(MachineClass *mc)
>  {
> +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> +
>      spapr_machine_5_1_class_options(mc);
>      compat_props_add(mc->compat_props, hw_compat_5_0, hw_compat_5_0_len);
> +    smc->nr_assoc_refpoints = 2;
>  }
>  
>  DEFINE_SPAPR_MACHINE(5_0, "5.0", false);
> diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
> index 8332d5694e46..247fd48731e2 100644
> --- a/hw/ppc/spapr_pci_nvlink2.c
> +++ b/hw/ppc/spapr_pci_nvlink2.c
> @@ -362,7 +362,7 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt)
>          uint32_t associativity[] = {
>              cpu_to_be32(0x4),
>              SPAPR_GPU_NUMA_ID,
> -            SPAPR_GPU_NUMA_ID,
> +            cpu_to_be32(nvslot->numa_id),

This is a guest visible change. It should theoretically be controlled
with a compat property of the PHB (look for "static GlobalProperty" in
spapr.c). But since this code is only used for GPU passthrough and we
don't support migration of such devices, I guess it's okay. Maybe just
mention it in the changelog.

>              SPAPR_GPU_NUMA_ID,
>              cpu_to_be32(nvslot->numa_id)
>          };



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] spapr: Add associativity reference point count to machine info
  2020-05-20 23:34 ` [PATCH v2 1/2] spapr: Add associativity reference point count to machine info Greg Kurz
@ 2020-05-21  5:12   ` David Gibson
  2020-05-21 23:10   ` Reza Arbab
  1 sibling, 0 replies; 8+ messages in thread
From: David Gibson @ 2020-05-21  5:12 UTC (permalink / raw)
  To: Greg Kurz
  Cc: qemu-ppc, Daniel Henrique Barboza,
	Leonardo Augusto Guimaraes Garcia, Reza Arbab, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4075 bytes --]

On Thu, May 21, 2020 at 01:34:37AM +0200, Greg Kurz wrote:
> On Mon, 18 May 2020 16:44:17 -0500
> Reza Arbab <arbab@linux.ibm.com> wrote:
> 
> > Make the number of NUMA associativity reference points a
> > machine-specific value, using the currently assumed default (two
> > reference points). This preps the next patch to conditionally change it.
> > 
> > Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
> > ---
> >  hw/ppc/spapr.c         | 6 +++++-
> >  include/hw/ppc/spapr.h | 1 +
> >  2 files changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index c18eab0a2305..88b4a1f17716 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -889,10 +889,12 @@ static int spapr_dt_rng(void *fdt)
> >  static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
> >  {
> >      MachineState *ms = MACHINE(spapr);
> > +    SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(ms);
> >      int rtas;
> >      GString *hypertas = g_string_sized_new(256);
> >      GString *qemu_hypertas = g_string_sized_new(256);
> >      uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) };
> > +    uint32_t nr_refpoints;
> >      uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
> >          memory_region_size(&MACHINE(spapr)->device_memory->mr);
> >      uint32_t lrdr_capacity[] = {
> > @@ -944,8 +946,9 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
> >                       qemu_hypertas->str, qemu_hypertas->len));
> >      g_string_free(qemu_hypertas, TRUE);
> >  
> > +    nr_refpoints = MIN(smc->nr_assoc_refpoints, ARRAY_SIZE(refpoints));
> 
> Having the machine requesting more reference points than available
> would clearly be a bug. I'd rather add an assert() than silently
> clipping to the size of refpoints[].

Actually, I think this "num reference points" thing is a false
abstraction.  It's selecting a number of entries from a list of
reference points that's fixed.  The number of things we could do
simply by changing the machine property and not the array is pretty
small.

I think it would be simpler to just have a boolean in the machine
class.

> >      _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
> > -                     refpoints, sizeof(refpoints)));
> > +                     refpoints, nr_refpoints * sizeof(uint32_t)));
> >  
> 
> Size can be expressed without yet another explicit reference to the
> uint32_t type:
> 
> nr_refpoints * sizeof(refpoints[0])
> 
> >      _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains",
> >                       maxdomains, sizeof(maxdomains)));
> > @@ -4541,6 +4544,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >      smc->linux_pci_probe = true;
> >      smc->smp_threads_vsmt = true;
> >      smc->nr_xirqs = SPAPR_NR_XIRQS;
> > +    smc->nr_assoc_refpoints = 2;
> 
> When adding a new setting for the default machine type, we usually
> take care of older machine types at the same time, ie. folding this
> patch into the next one. Both patches are simple enough that it should
> be okay and this would avoid this line to be touched again.
> 
> >      xfc->match_nvt = spapr_match_nvt;
> >  }
> >  
> > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> > index e579eaf28c05..abaf9a92adc0 100644
> > --- a/include/hw/ppc/spapr.h
> > +++ b/include/hw/ppc/spapr.h
> > @@ -129,6 +129,7 @@ struct SpaprMachineClass {
> >      bool linux_pci_probe;
> >      bool smp_threads_vsmt; /* set VSMT to smp_threads by default */
> >      hwaddr rma_limit;          /* clamp the RMA to this size */
> > +    uint32_t nr_assoc_refpoints;
> >  
> >      void (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
> >                            uint64_t *buid, hwaddr *pio, 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] spapr: Add a new level of NUMA for GPUs
  2020-05-20 23:36   ` Greg Kurz
@ 2020-05-21  5:13     ` David Gibson
  2020-05-21  9:00       ` Greg Kurz
  0 siblings, 1 reply; 8+ messages in thread
From: David Gibson @ 2020-05-21  5:13 UTC (permalink / raw)
  To: Greg Kurz
  Cc: qemu-ppc, Daniel Henrique Barboza,
	Leonardo Augusto Guimaraes Garcia, Reza Arbab, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4414 bytes --]

On Thu, May 21, 2020 at 01:36:16AM +0200, Greg Kurz wrote:
> On Mon, 18 May 2020 16:44:18 -0500
> Reza Arbab <arbab@linux.ibm.com> wrote:
> 
> > NUMA nodes corresponding to GPU memory currently have the same
> > affinity/distance as normal memory nodes. Add a third NUMA associativity
> > reference point enabling us to give GPU nodes more distance.
> > 
> > This is guest visible information, which shouldn't change under a
> > running guest across migration between different qemu versions, so make
> > the change effective only in new (pseries > 5.0) machine types.
> > 
> > Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5):
> > 
> > node distances:
> > node   0   1   2   3   4   5
> >   0:  10  40  40  40  40  40
> >   1:  40  10  40  40  40  40
> >   2:  40  40  10  40  40  40
> >   3:  40  40  40  10  40  40
> >   4:  40  40  40  40  10  40
> >   5:  40  40  40  40  40  10
> > 
> > After:
> > 
> > node distances:
> > node   0   1   2   3   4   5
> >   0:  10  40  80  80  80  80
> >   1:  40  10  80  80  80  80
> >   2:  80  80  10  80  80  80
> >   3:  80  80  80  10  80  80
> >   4:  80  80  80  80  10  80
> >   5:  80  80  80  80  80  10
> > 
> > These are the same distances as on the host, mirroring the change made
> > to host firmware in skiboot commit f845a648b8cb ("numa/associativity:
> > Add a new level of NUMA for GPU's").
> > 
> > Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
> > ---
> >  hw/ppc/spapr.c             | 11 +++++++++--
> >  hw/ppc/spapr_pci_nvlink2.c |  2 +-
> >  2 files changed, 10 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 88b4a1f17716..1d9193d5ee49 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -893,7 +893,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
> >      int rtas;
> >      GString *hypertas = g_string_sized_new(256);
> >      GString *qemu_hypertas = g_string_sized_new(256);
> > -    uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) };
> > +    uint32_t refpoints[] = {
> > +        cpu_to_be32(0x4),
> > +        cpu_to_be32(0x4),
> > +        cpu_to_be32(0x2),
> > +    };
> >      uint32_t nr_refpoints;
> >      uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
> >          memory_region_size(&MACHINE(spapr)->device_memory->mr);
> > @@ -4544,7 +4548,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >      smc->linux_pci_probe = true;
> >      smc->smp_threads_vsmt = true;
> >      smc->nr_xirqs = SPAPR_NR_XIRQS;
> > -    smc->nr_assoc_refpoints = 2;
> > +    smc->nr_assoc_refpoints = 3;
> >      xfc->match_nvt = spapr_match_nvt;
> >  }
> >  
> > @@ -4611,8 +4615,11 @@ DEFINE_SPAPR_MACHINE(5_1, "5.1", true);
> >   */
> >  static void spapr_machine_5_0_class_options(MachineClass *mc)
> >  {
> > +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> > +
> >      spapr_machine_5_1_class_options(mc);
> >      compat_props_add(mc->compat_props, hw_compat_5_0, hw_compat_5_0_len);
> > +    smc->nr_assoc_refpoints = 2;
> >  }
> >  
> >  DEFINE_SPAPR_MACHINE(5_0, "5.0", false);
> > diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
> > index 8332d5694e46..247fd48731e2 100644
> > --- a/hw/ppc/spapr_pci_nvlink2.c
> > +++ b/hw/ppc/spapr_pci_nvlink2.c
> > @@ -362,7 +362,7 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt)
> >          uint32_t associativity[] = {
> >              cpu_to_be32(0x4),
> >              SPAPR_GPU_NUMA_ID,
> > -            SPAPR_GPU_NUMA_ID,
> > +            cpu_to_be32(nvslot->numa_id),
> 
> This is a guest visible change. It should theoretically be controlled
> with a compat property of the PHB (look for "static GlobalProperty" in
> spapr.c). But since this code is only used for GPU passthrough and we
> don't support migration of such devices, I guess it's okay. Maybe just
> mention it in the changelog.

Yeah, we might get away with it, but it should be too hard to get this
right, so let's do it.

> 
> >              SPAPR_GPU_NUMA_ID,
> >              cpu_to_be32(nvslot->numa_id)
> >          };
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] spapr: Add a new level of NUMA for GPUs
  2020-05-21  5:13     ` David Gibson
@ 2020-05-21  9:00       ` Greg Kurz
  0 siblings, 0 replies; 8+ messages in thread
From: Greg Kurz @ 2020-05-21  9:00 UTC (permalink / raw)
  To: David Gibson
  Cc: Reza Arbab, Daniel Henrique Barboza,
	Leonardo Augusto Guimaraes Garcia, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4578 bytes --]

On Thu, 21 May 2020 15:13:45 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, May 21, 2020 at 01:36:16AM +0200, Greg Kurz wrote:
> > On Mon, 18 May 2020 16:44:18 -0500
> > Reza Arbab <arbab@linux.ibm.com> wrote:
> > 
> > > NUMA nodes corresponding to GPU memory currently have the same
> > > affinity/distance as normal memory nodes. Add a third NUMA associativity
> > > reference point enabling us to give GPU nodes more distance.
> > > 
> > > This is guest visible information, which shouldn't change under a
> > > running guest across migration between different qemu versions, so make
> > > the change effective only in new (pseries > 5.0) machine types.
> > > 
> > > Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5):
> > > 
> > > node distances:
> > > node   0   1   2   3   4   5
> > >   0:  10  40  40  40  40  40
> > >   1:  40  10  40  40  40  40
> > >   2:  40  40  10  40  40  40
> > >   3:  40  40  40  10  40  40
> > >   4:  40  40  40  40  10  40
> > >   5:  40  40  40  40  40  10
> > > 
> > > After:
> > > 
> > > node distances:
> > > node   0   1   2   3   4   5
> > >   0:  10  40  80  80  80  80
> > >   1:  40  10  80  80  80  80
> > >   2:  80  80  10  80  80  80
> > >   3:  80  80  80  10  80  80
> > >   4:  80  80  80  80  10  80
> > >   5:  80  80  80  80  80  10
> > > 
> > > These are the same distances as on the host, mirroring the change made
> > > to host firmware in skiboot commit f845a648b8cb ("numa/associativity:
> > > Add a new level of NUMA for GPU's").
> > > 
> > > Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
> > > ---
> > >  hw/ppc/spapr.c             | 11 +++++++++--
> > >  hw/ppc/spapr_pci_nvlink2.c |  2 +-
> > >  2 files changed, 10 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > > index 88b4a1f17716..1d9193d5ee49 100644
> > > --- a/hw/ppc/spapr.c
> > > +++ b/hw/ppc/spapr.c
> > > @@ -893,7 +893,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
> > >      int rtas;
> > >      GString *hypertas = g_string_sized_new(256);
> > >      GString *qemu_hypertas = g_string_sized_new(256);
> > > -    uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) };
> > > +    uint32_t refpoints[] = {
> > > +        cpu_to_be32(0x4),
> > > +        cpu_to_be32(0x4),
> > > +        cpu_to_be32(0x2),
> > > +    };
> > >      uint32_t nr_refpoints;
> > >      uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
> > >          memory_region_size(&MACHINE(spapr)->device_memory->mr);
> > > @@ -4544,7 +4548,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
> > >      smc->linux_pci_probe = true;
> > >      smc->smp_threads_vsmt = true;
> > >      smc->nr_xirqs = SPAPR_NR_XIRQS;
> > > -    smc->nr_assoc_refpoints = 2;
> > > +    smc->nr_assoc_refpoints = 3;
> > >      xfc->match_nvt = spapr_match_nvt;
> > >  }
> > >  
> > > @@ -4611,8 +4615,11 @@ DEFINE_SPAPR_MACHINE(5_1, "5.1", true);
> > >   */
> > >  static void spapr_machine_5_0_class_options(MachineClass *mc)
> > >  {
> > > +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> > > +
> > >      spapr_machine_5_1_class_options(mc);
> > >      compat_props_add(mc->compat_props, hw_compat_5_0, hw_compat_5_0_len);
> > > +    smc->nr_assoc_refpoints = 2;
> > >  }
> > >  
> > >  DEFINE_SPAPR_MACHINE(5_0, "5.0", false);
> > > diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
> > > index 8332d5694e46..247fd48731e2 100644
> > > --- a/hw/ppc/spapr_pci_nvlink2.c
> > > +++ b/hw/ppc/spapr_pci_nvlink2.c
> > > @@ -362,7 +362,7 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt)
> > >          uint32_t associativity[] = {
> > >              cpu_to_be32(0x4),
> > >              SPAPR_GPU_NUMA_ID,
> > > -            SPAPR_GPU_NUMA_ID,
> > > +            cpu_to_be32(nvslot->numa_id),
> > 
> > This is a guest visible change. It should theoretically be controlled
> > with a compat property of the PHB (look for "static GlobalProperty" in
> > spapr.c). But since this code is only used for GPU passthrough and we
> > don't support migration of such devices, I guess it's okay. Maybe just
> > mention it in the changelog.
> 
> Yeah, we might get away with it, but it should be too hard to get this

I guess you mean "it shouldn't be too hard" ?

> right, so let's do it.
> 
> > 
> > >              SPAPR_GPU_NUMA_ID,
> > >              cpu_to_be32(nvslot->numa_id)
> > >          };
> > 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] spapr: Add associativity reference point count to machine info
  2020-05-20 23:34 ` [PATCH v2 1/2] spapr: Add associativity reference point count to machine info Greg Kurz
  2020-05-21  5:12   ` David Gibson
@ 2020-05-21 23:10   ` Reza Arbab
  1 sibling, 0 replies; 8+ messages in thread
From: Reza Arbab @ 2020-05-21 23:10 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Daniel Henrique Barboza, Leonardo Augusto Guimaraes Garcia,
	qemu-ppc, qemu-devel, David Gibson

On Thu, May 21, 2020 at 01:34:37AM +0200, Greg Kurz wrote:
>On Mon, 18 May 2020 16:44:17 -0500
>Reza Arbab <arbab@linux.ibm.com> wrote:
>> @@ -944,8 +946,9 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>>                       qemu_hypertas->str, qemu_hypertas->len));
>>      g_string_free(qemu_hypertas, TRUE);
>>
>> +    nr_refpoints = MIN(smc->nr_assoc_refpoints, ARRAY_SIZE(refpoints));
>
>Having the machine requesting more reference points than available
>would clearly be a bug. I'd rather add an assert() than silently
>clipping to the size of refpoints[].

I'll rework nr_assoc_refpoints into a bool as David suggested.  
Struggling for a name at the moment, but I'll think on it.

>>      _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points",
>> -                     refpoints, sizeof(refpoints)));
>> +                     refpoints, nr_refpoints * sizeof(uint32_t)));
>>
>
>Size can be expressed without yet another explicit reference to the
>uint32_t type:
>
>nr_refpoints * sizeof(refpoints[0])

Will do.

>> @@ -4541,6 +4544,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
>>      smc->linux_pci_probe = true;
>>      smc->smp_threads_vsmt = true;
>>      smc->nr_xirqs = SPAPR_NR_XIRQS;
>> +    smc->nr_assoc_refpoints = 2;
>
>When adding a new setting for the default machine type, we usually
>take care of older machine types at the same time, ie. folding this
>patch into the next one. Both patches are simple enough that it should
>be okay and this would avoid this line to be touched again.

No problem. I'll squash the patches together and work in that PHB compat 
property as well. 

Thanks for your feedback!

-- 
Reza Arbab


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-05-21 23:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-05-18 21:44 [PATCH v2 1/2] spapr: Add associativity reference point count to machine info Reza Arbab
2020-05-18 21:44 ` [PATCH v2 2/2] spapr: Add a new level of NUMA for GPUs Reza Arbab
2020-05-20 23:36   ` Greg Kurz
2020-05-21  5:13     ` David Gibson
2020-05-21  9:00       ` Greg Kurz
2020-05-20 23:34 ` [PATCH v2 1/2] spapr: Add associativity reference point count to machine info Greg Kurz
2020-05-21  5:12   ` David Gibson
2020-05-21 23:10   ` Reza Arbab

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).