From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57426) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1btSVE-00031l-UU for qemu-devel@nongnu.org; Mon, 10 Oct 2016 00:47:55 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1btSVB-0005LO-Lj for qemu-devel@nongnu.org; Mon, 10 Oct 2016 00:47:51 -0400 Date: Mon, 10 Oct 2016 15:07:01 +1100 From: David Gibson Message-ID: <20161010040701.GA22498@umbus.fritz.box> References: <1475722987-18644-1-git-send-email-david@gibson.dropbear.id.au> <1475722987-18644-2-git-send-email-david@gibson.dropbear.id.au> <7507c5a4-85b9-8759-b12c-8f4fcaf1bfb2@ozlabs.ru> <20161007051008.GV18490@umbus.fritz.box> <4cb9f978-26a9-871a-b9bb-65f945f73eae@ozlabs.ru> <20161007091754.GX18490@umbus.fritz.box> <220c7dcb-e647-4e15-6824-7cbb0fc7dd17@ozlabs.ru> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="XsQoSWH+UP9D9v3l" Content-Disposition: inline In-Reply-To: <220c7dcb-e647-4e15-6824-7cbb0fc7dd17@ozlabs.ru> Subject: Re: [Qemu-devel] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alexey Kardashevskiy Cc: qemu-ppc@nongnu.org, qemu-devel@nongnu.org, benh@kernel.crashing.org, thuth@redhat.com, lvivier@redhat.com, agraf@suse.de, mst@redhat.com, mdroth@linux.vnet.ibm.com, nikunj@linux.vnet.ibm.com, bharata@linux.vnet.ibm.com, abologna@redhat.com, mpolednik@redhat.com --XsQoSWH+UP9D9v3l Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Oct 10, 2016 at 12:04:29PM +1100, Alexey Kardashevskiy wrote: > On 07/10/16 20:17, David Gibson wrote: > > On Fri, Oct 07, 2016 at 04:34:59PM +1100, Alexey Kardashevskiy wrote: > >> On 07/10/16 16:10, David Gibson wrote: > >>> On Fri, Oct 07, 2016 at 02:57:43PM +1100, Alexey Kardashevskiy wrote: > >>>> On 06/10/16 14:03, David Gibson wrote: > >>>>> The 'spapr-pci-host-bridge' represents the virtual PCI host bridge = (PHB) > >>>>> for a PAPR guest. Unlike on x86, it's routine on Power (both bare = metal > >>>>> and PAPR guests) to have numerous independent PHBs, each controllin= g a > >>>>> separate PCI domain. > >>>>> > >>>>> There are two ways of configuring the spapr-pci-host-bridge device:= first > >>>>> it can be done fully manually, specifying the locations and sizes o= f all > >>>>> the IO windows. This gives the most control, but is very awkward w= ith 6 > >>>>> mandatory parameters. Alternatively just an "index" can be specifi= ed > >>>>> which essentially selects from an array of predefined PHB locations. > >>>>> The PHB at index 0 is automatically created as the default PHB. > >>>>> > >>>>> The current set of default locations causes some problems for guest= s with > >>>>> large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big n= Vidia > >>>>> GPGPU cards via VFIO). Obviously, for migration we can only change= the > >>>>> locations on a new machine type, however. > >>>>> > >>>>> This is awkward, because the placement is currently decided within = the > >>>>> spapr-pci-host-bridge code, so it breaks abstraction to look inside= the > >>>>> machine type version. > >>>>> > >>>>> So, this patch delegates the "default mode" PHB placement from the > >>>>> spapr-pci-host-bridge device back to the machine type via a public = method > >>>>> in sPAPRMachineClass. It's still a bit ugly, but it's about the be= st we > >>>>> can do. > >>>>> > >>>>> For now, this just changes where the calculation is done. It doesn= 't > >>>>> change the actual location of the host bridges, or any other behavi= our. > >>>>> > >>>>> Signed-off-by: David Gibson > >>>>> --- > >>>>> hw/ppc/spapr.c | 34 ++++++++++++++++++++++++++++++++++ > >>>>> hw/ppc/spapr_pci.c | 22 ++++++++-------------- > >>>>> include/hw/pci-host/spapr.h | 11 +---------- > >>>>> include/hw/ppc/spapr.h | 4 ++++ > >>>>> 4 files changed, 47 insertions(+), 24 deletions(-) > >>>>> > >>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > >>>>> index 03e3803..f6e9c2a 100644 > >>>>> --- a/hw/ppc/spapr.c > >>>>> +++ b/hw/ppc/spapr.c > >>>>> @@ -2370,6 +2370,39 @@ static HotpluggableCPUList *spapr_query_hotp= luggable_cpus(MachineState *machine) > >>>>> return head; > >>>>> } > >>>>> =20 > >>>>> +static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t= index, > >>>>> + uint64_t *buid, hwaddr *pio, hwadd= r *pio_size, > >>>>> + hwaddr *mmio, hwaddr *mmio_size, > >>>>> + unsigned n_dma, uint32_t *liobns, = Error **errp) > >>>>> +{ > >>>>> + const uint64_t base_buid =3D 0x800000020000000ULL; > >>>>> + const hwaddr phb0_base =3D 0x10000000000ULL; /* 1 TiB */ > >>>>> + const hwaddr phb_spacing =3D 0x1000000000ULL; /* 64 GiB */ > >>>>> + const hwaddr mmio_offset =3D 0xa0000000; /* 2 GiB + 512 MiB */ > >>>>> + const hwaddr pio_offset =3D 0x80000000; /* 2 GiB */ > >>>>> + const uint32_t max_index =3D 255; > >>>>> + > >>>>> + hwaddr phb_base; > >>>>> + int i; > >>>>> + > >>>>> + if (index > max_index) { > >>>>> + error_setg(errp, "\"index\" for PAPR PHB is too large (max= %u)", > >>>>> + max_index); > >>>>> + return; > >>>>> + } > >>>>> + > >>>>> + *buid =3D base_buid + index; > >>>>> + for (i =3D 0; i < n_dma; ++i) { > >>>>> + liobns[i] =3D SPAPR_PCI_LIOBN(index, i); > >>>>> + } > >>>>> + > >>>>> + phb_base =3D phb0_base + index * phb_spacing; > >>>>> + *pio =3D phb_base + pio_offset; > >>>>> + *pio_size =3D SPAPR_PCI_IO_WIN_SIZE; > >>>>> + *mmio =3D phb_base + mmio_offset; > >>>>> + *mmio_size =3D SPAPR_PCI_MMIO_WIN_SIZE; > >>>>> +} > >>>>> + > >>>>> static void spapr_machine_class_init(ObjectClass *oc, void *data) > >>>>> { > >>>>> MachineClass *mc =3D MACHINE_CLASS(oc); > >>>>> @@ -2406,6 +2439,7 @@ static void spapr_machine_class_init(ObjectCl= ass *oc, void *data) > >>>>> mc->query_hotpluggable_cpus =3D spapr_query_hotpluggable_cpus; > >>>>> fwc->get_dev_path =3D spapr_get_fw_dev_path; > >>>>> nc->nmi_monitor_handler =3D spapr_nmi; > >>>>> + smc->phb_placement =3D spapr_phb_placement; > >>>>> } > >>>>> =20 > >>>>> static const TypeInfo spapr_machine_info =3D { > >>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c > >>>>> index 4f00865..c0fc964 100644 > >>>>> --- a/hw/ppc/spapr_pci.c > >>>>> +++ b/hw/ppc/spapr_pci.c > >>>>> @@ -1311,7 +1311,8 @@ static void spapr_phb_realize(DeviceState *de= v, Error **errp) > >>>>> sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1; > >>>>> =20 > >>>>> if (sphb->index !=3D (uint32_t)-1) { > >>>>> - hwaddr windows_base; > >>>>> + sPAPRMachineClass *smc =3D SPAPR_MACHINE_GET_CLASS(spapr); > >>>>> + Error *local_err =3D NULL; > >>>>> =20 > >>>>> if ((sphb->buid !=3D (uint64_t)-1) || (sphb->dma_liobn[0] = !=3D (uint32_t)-1) > >>>>> || (sphb->dma_liobn[1] !=3D (uint32_t)-1 && windows_su= pported =3D=3D 2) > >>>>> @@ -1322,21 +1323,14 @@ static void spapr_phb_realize(DeviceState *= dev, Error **errp) > >>>>> return; > >>>>> } > >>>>> =20 > >>>>> - if (sphb->index > SPAPR_PCI_MAX_INDEX) { > >>>>> - error_setg(errp, "\"index\" for PAPR PHB is too large = (max %u)", > >>>>> - SPAPR_PCI_MAX_INDEX); > >>>>> + smc->phb_placement(spapr, sphb->index, > >>>>> + &sphb->buid, &sphb->io_win_addr, &sphb-= >io_win_size, > >>>>> + &sphb->mem_win_addr, &sphb->mem_win_siz= e, > >>>>> + windows_supported, sphb->dma_liobn, &lo= cal_err); > >>>>> + if (local_err) { > >>>>> + error_propagate(errp, local_err); > >>>>> return; > >>>>> } > >>>>> - > >>>>> - sphb->buid =3D SPAPR_PCI_BASE_BUID + sphb->index; > >>>>> - for (i =3D 0; i < windows_supported; ++i) { > >>>>> - sphb->dma_liobn[i] =3D SPAPR_PCI_LIOBN(sphb->index, i); > >>>>> - } > >>>>> - > >>>>> - windows_base =3D SPAPR_PCI_WINDOW_BASE > >>>>> - + sphb->index * SPAPR_PCI_WINDOW_SPACING; > >>>>> - sphb->mem_win_addr =3D windows_base + SPAPR_PCI_MMIO_WIN_O= FF; > >>>>> - sphb->io_win_addr =3D windows_base + SPAPR_PCI_IO_WIN_OFF; > >>>>> } > >>>>> =20 > >>>>> if (sphb->buid =3D=3D (uint64_t)-1) { > >>>>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spap= r.h > >>>>> index 30dbd46..8c9ebfd 100644 > >>>>> --- a/include/hw/pci-host/spapr.h > >>>>> +++ b/include/hw/pci-host/spapr.h > >>>>> @@ -79,18 +79,9 @@ struct sPAPRPHBState { > >>>>> uint32_t numa_node; > >>>>> }; > >>>>> =20 > >>>>> -#define SPAPR_PCI_MAX_INDEX 255 > >>>>> - > >>>>> -#define SPAPR_PCI_BASE_BUID 0x800000020000000ULL > >>>>> - > >>>>> #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL > >>>>> =20 > >>>>> -#define SPAPR_PCI_WINDOW_BASE 0x10000000000ULL > >>>>> -#define SPAPR_PCI_WINDOW_SPACING 0x1000000000ULL > >>>>> -#define SPAPR_PCI_MMIO_WIN_OFF 0xA0000000 > >>>>> -#define SPAPR_PCI_MMIO_WIN_SIZE (SPAPR_PCI_WINDOW_SPACING - \ > >>>>> - SPAPR_PCI_MEM_WIN_BUS_OFFSET) > >>>>> -#define SPAPR_PCI_IO_WIN_OFF 0x80000000 > >>>>> +#define SPAPR_PCI_MMIO_WIN_SIZE 0xf80000000 > >>>>> #define SPAPR_PCI_IO_WIN_SIZE 0x10000 > >>>>> =20 > >>>>> #define SPAPR_PCI_MSI_WINDOW 0x40000000000ULL > >>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h > >>>>> index 39dadaa..9e1c5a5 100644 > >>>>> --- a/include/hw/ppc/spapr.h > >>>>> +++ b/include/hw/ppc/spapr.h > >>>>> @@ -40,6 +40,10 @@ struct sPAPRMachineClass { > >>>>> bool dr_lmb_enabled; /* enable dynamic-reconfig/hotplug = of LMBs */ > >>>>> bool use_ohci_by_default; /* use USB-OHCI instead of XHCI */ > >>>>> const char *tcg_default_cpu; /* which (TCG) CPU to simulate by= default */ > >>>>> + void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index, > >>>>> + uint64_t *buid, hwaddr *pio, hwaddr *pio= _size, > >>>>> + hwaddr *mmio, hwaddr *mmio_size, > >>>>> + unsigned n_dma, uint32_t *liobns, Error = **errp); > >>>> > >>>> > >>>> I still like idea of getting rid of index better. In order to reduce= the > >>>> command line length, you could not enable IO and MMIO32 by default, = the > >>>> default size for MMIO64 could be set to 1TB or something like this, = BUID > >>>> could be just a raw MMIO64 address value. > >>> > >>> Hm, interesting idea. I do like the idea of making the BUID equal to > >>> the MMIO64 address. Obviously we'd still need addresses for the > >>> default PHB, including 32-bit and PIO windows. > >>> > >>> Trickier, we'd still need to support 'index' for compatibility with > >>> older command lines. I guess we could reserve index purely for the > >>> "legacy" locations - 2.8+ machine types would create the default > >>> bridge with explicit windows, rather than just using index 0. > >>> > >>>> So essentially you would only have to specify MMIO64 and 2xLIOBN in = the > >>>> command line which does not seem that much of overkill assuming that= a > >>>> second (third,...) PHB is almost never used and even if it is, it wi= ll be > >>>> used for SRIOV VF (which can do 64bit) or virtio (the same). > >>> > >>> Hm, maybe. 3 values is still a bit of a PITA. I guess liobn64 could > >>> theoretically be optional, although you're going to want it for > >>> basically every case where you're creating an additional PHB. > >>> > >>> Or.. what if we insisted the mmio64 window had at least 4G alignment, > >>> then we might be able to use mmio64 >> 32 (=3D=3D buid >> 32) as the > >>> liobn. Hmm.. guess it would need to be 8G alignment, so we have one > >>> extra bit. > >> > >> > >> As you mentioned in 4/4, guests cannot access above 64 TiB, we could u= se > >> upper bits to store the window number. > >=20 > > *Some* guests can't access above 64 TiB, that's not an inherent > > limitation. So I'd rather not do that. > >=20 > >> > >> > >>> > >>> ..and I wonder if this might be easier to manage if we introduced a > >>> new spapr-pci-host-bridge-2 subtype. > >>> > >>> Let me think about this.. > >>> > >>>> With index and phb_placement(), QEMU will still auto-generate these = IO/MMIO > >>>> addresses from the command line parameters which did hit us with mig= ration > >>>> before as the QEMU command line on the source side and the destinati= on side > >>>> is not exactly the same, and this might hit again... > >>> > >>> Uh.. I'm not quite following you. What's the situation which would > >>> break migration with the proposed scheme? > >> > >> Well, I cannot think of any other device available on SPAPR which we c= ould > >> map to the CPU address space but if it ever appears, we will have a pr= oblem > >> unless this new imaginary device does not pick an address from the CPU > >> space automatically. > >=20 > > I followed that even less that the previous one, I can't even parse tha= t. >=20 > For example, 2 ibm-veth devices, each with assigned MAC. The source libvi= rt > puts them in the QEMU command line in som order and they get automatically > assigned interrupts from a pool. Then we migrate and since libvirt does n= ot > read the ibm-veth.irq QOM property on the source side and therefore does > not send it to the destination side to use in the QEMU command line, there > is a change that on the destination side our ibm-veth devices will get > different IRQ numbers and VMSTATE_UINT32_EQUAL(irq, VIOsPAPRDevice) will = fail. >=20 > So either guest should be able to keep working with new IRQ numbers > (impossible) or we should make sure that libvirt sends them over or libvi= rt > always picks the correct number (derived from reg? should not depend on a= ny > other hardware). The same applies to addresses where you map PHB > MMIO. Ah, right. Yeah, we have a bunch of problems like this because I did a bunch of irq allocation from a pool, before I was thinking about migration. I've tried to avoid that more recently, but it's hard to see alternatives when you can have so many virtual devices. For PHBs we already migrate the necessary information (lsi_table and the equivalent stuff for MSI). If we don't for VIO, then that's a bug, but I don't see that it's really related to this rework of PHB placement. > ps. may be libvirt does send irq for vio devices now, dunno, but there was > a bug like this :) --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --XsQoSWH+UP9D9v3l Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJX+xPiAAoJEGw4ysog2bOS3qgQANiCZMDgtv3h5RdrGgnYDoTw E7PLJ816VxHQZid5XRXUxdQ4nf3E4xwEO/5H2brtTqYqHuRMxXAvubaD6x8X7TEX TwHLJ+mBkZd0N+wrQNmft+gR6ace2JvxsvpmEIneYQ2h+7zsNkkCG3/EAJO9MTpV 9mTuIQbQeZRp6msekn+DN7/rNFGhk7Fa1KZGEesp3LCicMITtzXN8ALeeC9S80tM 2YMRNBjiD2WG+b4xw8NVcWRM2KMqXW1dB71U/9VRDLzKzpozh/ar0bW0Qlo9E7v7 xc4gm8Kwl9Ey3Q8FouDkS78JPv9PxLvQ0hW/8dn3YTj511FVHBDvgjRal+aIExq5 PodMHrBBAmMn+3HBwZX1dvMGTmpb+6cPR5yM/bYae4uEMOPlrsnVAPMd31TERT/c gl1sjLaIeDoPEbn2tMSFtyswID9byK776ZssLGXCRI2xgoKHkVe96bJ6Z9LETgh5 5KJ77fRYvf+Ng/CuiWMi35TOAU0sbim8S6/g1zJZAN5IENvy4cO43fTA1D6yJkT2 imFzLJ/sbf6Q4dwAYUl4SuUeakKO1eWD7hzr6Mc5byFAmtc3k0g+z80Sg+r3lnTn NKZke3+Ql6ihvjZly9JIjkjrTqXQQJBCHZ24EGF0lOEHhVhdg8mmDrbN3l7Ww49Q hF8JoUeQOL4HoQ1/nQCg =pfhY -----END PGP SIGNATURE----- --XsQoSWH+UP9D9v3l--