From: David Gibson <david@gibson.dropbear.id.au>
To: Laurent Vivier <lvivier@redhat.com>
Cc: peter.maydell@linaro.org, Alexey Kardashevskiy <aik@ozlabs.ru>,
gkurz@kaod.org, qemu-devel@nongnu.org,
Alex Williamson <alex.williamson@redhat.com>,
qemu-ppc@nongnu.org, clg@kaod.org
Subject: Re: [Qemu-devel] [PULL 01/36] spapr: Support NVIDIA V100 GPU with NVLink2
Date: Mon, 20 May 2019 16:06:10 +1000 [thread overview]
Message-ID: <20190520060609.GB27407@umbus.fritz.box> (raw)
In-Reply-To: <2187f170-8a8b-356d-78e0-fb010443df3b@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 9096 bytes --]
On Fri, May 17, 2019 at 07:37:04PM +0200, Laurent Vivier wrote:
> On 26/04/2019 08:05, David Gibson wrote:
> > From: Alexey Kardashevskiy <aik@ozlabs.ru>
> >
> > NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
> > space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
> > implements special regions for such GPUs and emulates an NVLink bridge.
> > NVLink2-enabled POWER9 CPUs also provide address translation services
> > which includes an ATS shootdown (ATSD) register exported via the NVLink
> > bridge device.
> >
> > This adds a quirk to VFIO to map the GPU memory and create an MR;
> > the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
> > this to get the MR and map it to the system address space.
> > Another quirk does the same for ATSD.
> >
> > This adds additional steps to sPAPR PHB setup:
> >
> > 1. Search for specific GPUs and NPUs, collect findings in
> > sPAPRPHBState::nvgpus, manage system address space mappings;
> >
> > 2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
> > "memory-block", "link-speed" to advertise the NVLink2 function to
> > the guest;
> >
> > 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;
> >
> > 4. Add new memory blocks (with extra "linux,memory-usable" to prevent
> > the guest OS from accessing the new memory until it is onlined) and
> > npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
> > uses it for link discovery.
> >
> > This allocates space for GPU RAM and ATSD like we do for MMIOs by
> > adding 2 new parameters to the phb_placement() hook. Older machine types
> > set these to zero.
> >
> > This puts new memory nodes in a separate NUMA node to as the GPU RAM
> > needs to be configured equally distant from any other node in the system.
> > Unlike the host setup which assigns numa ids from 255 downwards, this
> > adds new NUMA nodes after the user configures nodes or from 1 if none
> > were configured.
> >
> > This adds requirement similar to EEH - one IOMMU group per vPHB.
> > The reason for this is that ATSD registers belong to a physical NPU
> > so they cannot invalidate translations on GPUs attached to another NPU.
> > It is guaranteed by the host platform as it does not mix NVLink bridges
> > or GPUs from different NPU in the same IOMMU group. If more than one
> > IOMMU group is detected on a vPHB, this disables ATSD support for that
> > vPHB and prints a warning.
> >
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > [aw: for vfio portions]
> > Acked-by: Alex Williamson <alex.williamson@redhat.com>
> > Message-Id: <20190312082103.130561-1-aik@ozlabs.ru>
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > hw/ppc/Makefile.objs | 2 +-
> > hw/ppc/spapr.c | 48 +++-
> > hw/ppc/spapr_pci.c | 19 ++
> > hw/ppc/spapr_pci_nvlink2.c | 450 ++++++++++++++++++++++++++++++++++++
> > hw/vfio/pci-quirks.c | 131 +++++++++++
> > hw/vfio/pci.c | 14 ++
> > hw/vfio/pci.h | 2 +
> > hw/vfio/trace-events | 4 +
> > include/hw/pci-host/spapr.h | 45 ++++
> > include/hw/ppc/spapr.h | 5 +-
> > 10 files changed, 711 insertions(+), 9 deletions(-)
> > create mode 100644 hw/ppc/spapr_pci_nvlink2.c
> >
> > diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> > index 1111b218a0..636e717f20 100644
> > --- a/hw/ppc/Makefile.objs
> > +++ b/hw/ppc/Makefile.objs
> > @@ -9,7 +9,7 @@ obj-$(CONFIG_SPAPR_RNG) += spapr_rng.o
> > # IBM PowerNV
> > obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
> > ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> > -obj-y += spapr_pci_vfio.o
> > +obj-y += spapr_pci_vfio.o spapr_pci_nvlink2.o
> > endif
> > obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
> > # PowerPC 4xx boards
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index b52b82d298..b81e237635 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -1034,12 +1034,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
> > 0, cpu_to_be32(SPAPR_MEMORY_BLOCK_SIZE),
> > cpu_to_be32(max_cpus / smp_threads),
> > };
> > + uint32_t maxdomain = cpu_to_be32(spapr->gpu_numa_id > 1 ? 1 : 0);
> > uint32_t maxdomains[] = {
> > cpu_to_be32(4),
> > - cpu_to_be32(0),
> > - cpu_to_be32(0),
> > - cpu_to_be32(0),
> > - cpu_to_be32(nb_numa_nodes ? nb_numa_nodes : 1),
> > + maxdomain,
> > + maxdomain,
> > + maxdomain,
> > + cpu_to_be32(spapr->gpu_numa_id),
> > };
> >
> > _FDT(rtas = fdt_add_subnode(fdt, 0, "rtas"));
> > @@ -1698,6 +1699,16 @@ static void spapr_machine_reset(void)
> > spapr_irq_msi_reset(spapr);
> > }
> >
> > + /*
> > + * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
> > + * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
> > + * called from vPHB reset handler so we initialize the counter here.
> > + * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
> > + * must be equally distant from any other node.
> > + * The final value of spapr->gpu_numa_id is going to be written to
> > + * max-associativity-domains in spapr_build_fdt().
> > + */
> > + spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
> > qemu_devices_reset();
> >
> > /*
> > @@ -3907,7 +3918,9 @@ static void spapr_phb_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
> > smc->phb_placement(spapr, sphb->index,
> > &sphb->buid, &sphb->io_win_addr,
> > &sphb->mem_win_addr, &sphb->mem64_win_addr,
> > - windows_supported, sphb->dma_liobn, errp);
> > + windows_supported, sphb->dma_liobn,
> > + &sphb->nv2_gpa_win_addr, &sphb->nv2_atsd_win_addr,
> > + errp);
> > }
> >
> > static void spapr_phb_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
> > @@ -4108,7 +4121,8 @@ static const CPUArchIdList *spapr_possible_cpu_arch_ids(MachineState *machine)
> > static void spapr_phb_placement(SpaprMachineState *spapr, uint32_t index,
> > uint64_t *buid, hwaddr *pio,
> > hwaddr *mmio32, hwaddr *mmio64,
> > - unsigned n_dma, uint32_t *liobns, Error **errp)
> > + unsigned n_dma, uint32_t *liobns,
> > + hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
> > {
> > /*
> > * New-style PHB window placement.
> > @@ -4153,6 +4167,9 @@ static void spapr_phb_placement(SpaprMachineState *spapr, uint32_t index,
> > *pio = SPAPR_PCI_BASE + index * SPAPR_PCI_IO_WIN_SIZE;
> > *mmio32 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM32_WIN_SIZE;
> > *mmio64 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM64_WIN_SIZE;
> > +
> > + *nv2gpa = SPAPR_PCI_NV2RAM64_WIN_BASE + index * SPAPR_PCI_NV2RAM64_WIN_SIZE;
> > + *nv2atsd = SPAPR_PCI_NV2ATSD_WIN_BASE + index * SPAPR_PCI_NV2ATSD_WIN_SIZE;
> > }
> >
> > static ICSState *spapr_ics_get(XICSFabric *dev, int irq)
> > @@ -4357,6 +4374,18 @@ DEFINE_SPAPR_MACHINE(4_0, "4.0", true);
> > /*
> > * pseries-3.1
> > */
> > +static void phb_placement_3_1(SpaprMachineState *spapr, uint32_t index,
> > + uint64_t *buid, hwaddr *pio,
> > + hwaddr *mmio32, hwaddr *mmio64,
> > + unsigned n_dma, uint32_t *liobns,
> > + hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
> > +{
> > + spapr_phb_placement(spapr, index, buid, pio, mmio32, mmio64, n_dma, liobns,
> > + nv2gpa, nv2atsd, errp);
> > + *nv2gpa = 0;
> > + *nv2atsd = 0;
> > +}
> > +
> > static void spapr_machine_3_1_class_options(MachineClass *mc)
> > {
> > SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> > @@ -4372,6 +4401,7 @@ static void spapr_machine_3_1_class_options(MachineClass *mc)
> > smc->default_caps.caps[SPAPR_CAP_SBBC] = SPAPR_CAP_BROKEN;
> > smc->default_caps.caps[SPAPR_CAP_IBS] = SPAPR_CAP_BROKEN;
> > smc->default_caps.caps[SPAPR_CAP_LARGE_DECREMENTER] = SPAPR_CAP_OFF;
> > + smc->phb_placement = phb_placement_3_1;
>
> I think this should be renamed and go into the 4.0 machine type as it
> has already been released.
Drat, good point. The patch is already merged, but I'm writing a
followup to correct this. We haven't released since the wrong one was
merged, so that should be ok.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2019-05-20 6:35 UTC|newest]
Thread overview: 79+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-26 6:05 [Qemu-devel] [PULL 00/36] ppc-for-4.1 queue 20190426 David Gibson
2019-04-26 6:05 ` David Gibson
2019-04-26 6:05 ` [Qemu-devel] [PULL 01/36] spapr: Support NVIDIA V100 GPU with NVLink2 David Gibson
2019-04-26 6:05 ` David Gibson
2019-05-17 17:37 ` Laurent Vivier
2019-05-17 17:58 ` Greg Kurz
2019-05-20 6:06 ` David Gibson [this message]
2019-04-26 6:05 ` [Qemu-devel] [PULL 02/36] hw/ppc/prep: Drop useless inclusion of "hw/input/i8042.h" David Gibson
2019-04-26 6:05 ` David Gibson
2019-04-26 6:05 ` [Qemu-devel] [PULL 03/36] spapr/rtas: modify spapr_rtas_register() to remove RTAS handlers David Gibson
2019-04-26 6:05 ` David Gibson
2019-04-26 6:05 ` [Qemu-devel] [PULL 04/36] spapr/irq: remove spapr_ics_create() David Gibson
2019-04-26 6:05 ` David Gibson
2019-04-26 6:05 ` [Qemu-devel] [PULL 05/36] target/ppc: Style fixes for ppc-models.[ch] David Gibson
2019-04-26 6:05 ` David Gibson
2019-04-26 6:05 ` [Qemu-devel] [PULL 06/36] target/ppc: Style fixes for cpu.[ch] David Gibson
2019-04-26 6:05 ` David Gibson
2019-04-26 6:05 ` [Qemu-devel] [PULL 07/36] target/ppc: Style fixes for int_helper.c David Gibson
2019-04-26 6:05 ` David Gibson
2019-04-26 6:05 ` [Qemu-devel] [PULL 08/36] target/ppc: Style fixes for fpu_helper.c David Gibson
2019-04-26 6:05 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 09/36] target/ppc: Style fixes for dfp_helper.c David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 10/36] target/ppc: Style fixes for excp_helper.c David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 11/36] target/ppc: Style fixes for gdbstub.c David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 12/36] target/ppc: Style fixes for helper_regs.h David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 13/36] target/ppc: Style fixes for kvm_ppc.h and kvm.c David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 14/36] target/ppc: Style fixes for machine.c David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 15/36] target/ppc: Style fixes for mem_helper.c David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 16/36] target/ppc: Style fixes for mfrom_table.inc.c & mfrom_table_gen.c David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 17/36] target/ppc: Style fixes for misc_helper.c David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 18/36] target/ppc: Style fixes for mmu-hash32.[ch] David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 19/36] target/ppc: Style fixes for mmu-hash64.[ch] David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 20/36] target/ppc: Style fixes for mmu_helper.c David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 21/36] target/ppc: Style fixes for monitor.c David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 22/36] target/ppc: Style fixes for translate_init.inc.c David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 23/36] target/ppc: Style fixes for translate.c David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 24/36] target/ppc: Style fixes for translate/fp-impl.inc.c David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 25/36] target/ppc: Style fixes for translate/vsx-impl.inc.c David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 26/36] target/ppc: Style fixes for translate/vmx-impl.inc.c David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 27/36] target/ppc: Style fixes for translate/spe-impl.inc.c David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 28/36] spapr_pci: Get rid of duplicate code for node name creation David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 29/36] spapr: Drop duplicate PCI swizzle code David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 30/36] target/ppc/trace-events: Fix trivial typo David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 31/36] target/ppc/kvm: Convert DPRINTF to traces David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 32/36] target/ppc: Don't check UPRT in radix mode when in HV real mode David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 33/36] ppc/spapr: Use proper HPTE accessors for H_READ David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 34/36] ppc/hash64: Rework R and C bit updates David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 35/36] ppc/hash32: " David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-26 6:06 ` [Qemu-devel] [PULL 36/36] target/ppc: improve performance of large BAT invalidations David Gibson
2019-04-26 6:06 ` David Gibson
2019-04-28 10:42 ` [Qemu-devel] [PULL 00/36] ppc-for-4.1 queue 20190426 Peter Maydell
2019-04-28 10:42 ` Peter Maydell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190520060609.GB27407@umbus.fritz.box \
--to=david@gibson.dropbear.id.au \
--cc=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=clg@kaod.org \
--cc=gkurz@kaod.org \
--cc=lvivier@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=qemu-ppc@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).