From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:54548) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YR4iN-0004sF-Oz for qemu-devel@nongnu.org; Thu, 26 Feb 2015 15:07:21 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YR4iH-0004RR-Hj for qemu-devel@nongnu.org; Thu, 26 Feb 2015 15:07:19 -0500 Received: from e36.co.us.ibm.com ([32.97.110.154]:51037) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YR4iH-0004QV-9r for qemu-devel@nongnu.org; Thu, 26 Feb 2015 15:07:13 -0500 Received: from /spool/local by e36.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 26 Feb 2015 13:07:11 -0700 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable From: Michael Roth In-Reply-To: <20150225064026.31752.8222@loki> References: <1424096872-29868-1-git-send-email-mdroth@linux.vnet.ibm.com> <1424096872-29868-16-git-send-email-mdroth@linux.vnet.ibm.com> <20150225031129.GD15695@voom.fritz.box> <20150225051724.31752.99805@loki> <20150225064026.31752.8222@loki> Message-ID: <20150226200604.31752.68449@loki> Date: Thu, 26 Feb 2015 14:06:04 -0600 Subject: Re: [Qemu-devel] [PATCH v5 15/16] spapr_pci: enable basic hotplug operations List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: David Gibson Cc: aik@ozlabs.ru, qemu-devel@nongnu.org, agraf@suse.de, ncmike@ncultra.org, qemu-ppc@nongnu.org, tyreld@linux.vnet.ibm.com, bharata.rao@gmail.com, nfont@linux.vnet.ibm.com Quoting Michael Roth (2015-02-25 00:40:26) > Quoting Michael Roth (2015-02-24 23:17:24) > > Quoting David Gibson (2015-02-24 21:11:29) > > > On Mon, Feb 16, 2015 at 08:27:51AM -0600, Michael Roth wrote: > > > > This enables hotplug for PHB bridges. Upon hotplug we generate the > > > = > > > "PCI Host Bridge bridges" :-p > > > = > > > > OF-nodes required by PAPR specification and IEEE 1275-1994 > > > > "PCI Bus Binding to Open Firmware" for the device. > > > > = > > > > We associate the corresponding FDT for these nodes with the DrcEntry > > > > corresponding to the slot, which will be fetched via > > > > ibm,configure-connector RTAS calls by the guest as described by PAPR > > > > specification. The FDT is cleaned up in the case of unplug. > > > > = > > > > Signed-off-by: Michael Roth > > > > --- > > > > hw/ppc/spapr_pci.c | 391 +++++++++++++++++++++++++++++++++++++++++= +++++++++--- > > > > 1 file changed, 371 insertions(+), 20 deletions(-) > > > > = > > > > diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c > > > > index 6a3d917..b9af1cd 100644 > > > > --- a/hw/ppc/spapr_pci.c > > > > +++ b/hw/ppc/spapr_pci.c > > > > @@ -33,8 +33,11 @@ > > > > #include > > > > #include "trace.h" > > > > #include "qemu/error-report.h" > > > > +#include "qapi/qmp/qerror.h" > > > > = > > > > #include "hw/pci/pci_bus.h" > > > > +#include "hw/ppc/spapr_drc.h" > > > > +#include "sysemu/device_tree.h" > > > > = > > > > /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */ > > > > #define RTAS_QUERY_FN 0 > > > > @@ -47,7 +50,13 @@ > > > > #define RTAS_TYPE_MSI 1 > > > > #define RTAS_TYPE_MSIX 2 > > > > = > > > > -#include "hw/ppc/spapr_drc.h" > > > > +#define _FDT(exp) \ > > > > + do { \ > > > > + int ret =3D (exp); = \ > > > > + if (ret < 0) { = \ > > > > + return ret; = \ > > > > + } = \ > > > > + } while (0) > > > > = > > > > static sPAPRPHBState *find_phb(sPAPREnvironment *spapr, uint64_t b= uid) > > > > { > > > > @@ -481,6 +490,359 @@ static AddressSpace *spapr_pci_dma_iommu(PCIB= us *bus, void *opaque, int devfn) > > > > return &phb->iommu_as; > > > > } > > > > = > > > > +/* Macros to operate with address in OF binding to PCI */ > > > > +#define b_x(x, p, l) (((x) & ((1<<(l))-1)) << (p)) > > > > +#define b_n(x) b_x((x), 31, 1) /* 0 if relocatable */ > > > > +#define b_p(x) b_x((x), 30, 1) /* 1 if prefetchable */ > > > > +#define b_t(x) b_x((x), 29, 1) /* 1 if the address is ali= ased */ > > > > +#define b_ss(x) b_x((x), 24, 2) /* the space code */ > > > > +#define b_bbbbbbbb(x) b_x((x), 16, 8) /* bus number */ > > > > +#define b_ddddd(x) b_x((x), 11, 5) /* device number */ > > > > +#define b_fff(x) b_x((x), 8, 3) /* function number */ > > > > +#define b_rrrrrrrr(x) b_x((x), 0, 8) /* register number */ > > > > +/* for 'reg'/'assigned-addresses' OF properties */ > > > > +#define RESOURCE_CELLS_SIZE 2 > > > > +#define RESOURCE_CELLS_ADDRESS 3 > > > > + > > > > +typedef struct ResourceFields { > > > > + uint32_t phys_hi; > > > > + uint32_t phys_mid; > > > > + uint32_t phys_lo; > > > > + uint32_t size_hi; > > > > + uint32_t size_lo; > > > > +} ResourceFields; > > > = > > > Hrm, 5*32-bit ints, that probably needs a ((packed)) on at least some > > > platforms to be safe. > > = > > I seem to remember some rule (C99?) about padding only being used in ca= ses > > where an n-byte field doesn't naturally fall on an n-byte boundary, but > > it doesn't hurt to be safe/explicit about what we're expecting. > > = > > > = > > > > +typedef struct ResourceProps { > > > > + union { > > > > + ResourceFields fields[8]; > > > > + uint8_t data[sizeof(ResourceFields) * 8]; > > > > + } reg; > > > > + union { > > > > + ResourceFields fields[7]; > > > > + uint8_t data[sizeof(ResourceFields) * 7]; > > > > + } assigned; > > > > + uint32_t reg_len; > > > > + uint32_t assigned_len; > > > > +} ResourceProps; > > > = > > > I'm a bit dubious about this union, rather than just casting when you > > > need the bytestring version. > > = > > Yah, I may have gotten a bit carried away there... > > = > > > = > > > > + > > > > +/* fill in the 'reg'/'assigned-resources' OF properties for > > > > + * a PCI device. 'reg' describes resource requirements for a > > > > + * device's IO/MEM regions, 'assigned-addresses' describes the > > > > + * actual resource assignments. > > > > + * > > > > + * the properties are arrays of ('phys-addr', 'size') pairs descri= bing > > > > + * the addressable regions of the PCI device, where 'phys-addr' is= a > > > > + * RESOURCE_CELLS_ADDRESS-tuple of 32-bit integers corresponding to > > > > + * (phys.hi, phys.mid, phys.lo), and 'size' is a > > > > + * RESOURCE_CELLS_SIZE-tuple corresponding to (size.hi, size.lo). > > > > + * phys.hi =3D 0xYYXXXXZZ, where: > > > > + * 0xYY =3D npt000ss > > > > + * ||| | > > > > + * ||| +-- space code: 1 if IO region, 2 if MEM region > > > > + * ||+------ for non-relocatable IO: 1 if aliased > > > > + * || for relocatable IO: 1 if below 64KB > > > > + * || for MEM: 1 if below 1MB > > > > + * |+------- 1 if region is prefetchable > > > > + * +-------- 1 if region is non-relocatable > > > > + * 0xXXXX =3D bbbbbbbb dddddfff, encoding bus, slot, and function > > > > + * bits respectively > > > > + * 0xZZ =3D rrrrrrrr, the register number of the BAR correspondi= ng > > > > + * to the region > > > > + * > > > > + * phys.mid and phys.lo correspond respectively to the hi/lo porti= ons > > > > + * of the actual address of the region. > > > > + * > > > > + * how the phys-addr/size values are used differ slightly between > > > > + * 'reg' and 'assigned-addresses' properties. namely, 'reg' has > > > > + * an additional description for the config space region of the > > > > + * device, and in the case of QEMU has n=3D0 and phys.mid=3Dphys.l= o=3D0 > > > > + * to describe the region as relocatable, with an address-mapping > > > > + * that corresponds directly to the PHB's address space for the > > > > + * resource. 'assigned-addresses' always has n=3D1 set with an abs= olute > > > > + * address assigned for the resource. in general, 'assigned-addres= ses' > > > > + * won't be populated, since addresses for PCI devices are general= ly > > > > + * unmapped initially and left to the guest to assign. > > > > + * > > > > + * in accordance with PCI Bus Binding to Open Firmware, > > > > + * IEEE Std 1275-1994, section 4.1.1, as implemented by PAPR+ v2.7, > > > > + * Appendix C. > > > > + */ > > > > +static void populate_resource_props(PCIDevice *d, ResourceProps *r= p) > > > > +{ > > > > + int bus_num =3D pci_bus_num(PCI_BUS(qdev_get_parent_bus(DEVICE= (d)))); > > > > + uint32_t dev_id =3D (b_bbbbbbbb(bus_num) | > > > > + b_ddddd(PCI_SLOT(d->devfn)) | > > > > + b_fff(PCI_FUNC(d->devfn))); > > > > + ResourceFields *reg, *assigned; > > > > + int i, reg_idx =3D 0, assigned_idx =3D 0; > > > > + > > > > + /* config space region */ > > > > + reg =3D &rp->reg.fields[reg_idx++]; > > > > + reg->phys_hi =3D cpu_to_be32(dev_id); > > > > + reg->phys_mid =3D 0; > > > > + reg->phys_lo =3D 0; > > > > + reg->size_hi =3D 0; > > > > + reg->size_lo =3D 0; > > > > + > > > > + for (i =3D 0; i < PCI_NUM_REGIONS; i++) { > > > > + if (!d->io_regions[i].size) { > > > > + continue; > > > > + } > > > > + > > > > + reg =3D &rp->reg.fields[reg_idx++]; > > > > + > > > > + reg->phys_hi =3D cpu_to_be32(dev_id | b_rrrrrrrr(pci_bar(d= , i))); > > > > + if (d->io_regions[i].type & PCI_BASE_ADDRESS_SPACE_IO) { > > > > + reg->phys_hi |=3D cpu_to_be32(b_ss(1)); > > > > + } else { > > > > + reg->phys_hi |=3D cpu_to_be32(b_ss(2)); > > > > + } > > > > + reg->phys_mid =3D 0; > > > > + reg->phys_lo =3D 0; > > > > + reg->size_hi =3D cpu_to_be32(d->io_regions[i].size >> 32); > > > > + reg->size_lo =3D cpu_to_be32(d->io_regions[i].size); > > > > + > > > > + if (d->io_regions[i].addr !=3D PCI_BAR_UNMAPPED) { > > > > + continue; > > > = > > > This has inverted sense, doesn't it? If the BAR *is* unmapped you > > > want to continue, skipping the assigned-addresses stuff. > > = > > Yes, should be inverted :( It ends up not mattering much with the > > current userspace code, since we rely on PCI rescan which always > > re-assigns BARs, but definitely needs fixing. > > = > > > = > > > > + } > > > > + > > > > + assigned =3D &rp->assigned.fields[assigned_idx++]; > > > > + assigned->phys_hi =3D cpu_to_be32(reg->phys_hi | b_n(1)); > > > > + assigned->phys_mid =3D cpu_to_be32(d->io_regions[i].addr >= > 32); > > > > + assigned->phys_lo =3D cpu_to_be32(d->io_regions[i].addr); > > > = > > > Not sure if I understood the comment above properly; should these > > > addresses actually be translated through an mappings above the PHB? > > > Not that there are any such translations on sPAPR, but... > > = > > The io_regions[i].addr values are relative to the IO/MEM address spaces= for > > the device, which correspond to IO/MEM windows for the PHB. So I think = the > > memory API should handle any translation above that... > > = > > Or do you mean because of the n=3D0/relocatable IO regions described by= 'reg'? > > = > > IIUC, when n=3D0, reg->phys_{mid,lo} can be used to encode a starting o= ffset, > > so a guest can re-assign/re-program an IO region that's already been as= signed > > and described via the correspond fields in 'assigned-addresses', so lon= g as > > uses an addr above reg->phys{mid,lo}. > > = > > So, we could use those reg->phys_{mid,lo} values as an alternative to t= he > > PHBs IO/MEM windows (I guess for platforms that don't provide these win= dows > > and just expose the full address space?). > > = > > But since we have those windows, we end up not needing this, so we alwa= ys do > > n=3D0, reg->phys_mid =3D reg->phys_lo =3D 0, so the values in > > assigned->phys_{mid,lo} end up just being offsets into the IO/MEM windo= ws, > > which correspond directly to io_regions[n].addr. > = > Argh, not sure what I was thinking. io_regions[n].addr is relative to the > IO/MEM window, so from the guest perspective .addr is indeed a different > value... > = > So maybe reg->phys_mid/reg->phys_lo do in fact need to reflect the window > offsets so that the relocatable region assignments are offset > accordingly... > = > Will look into it more. So, maybe my first response wasn't as misguided as I thought. I haven't found any documentation that lays out the specifics, but the kernel code ma= kes it fairly clear that the IO/MEM addresses in assigned-addresses are relative to PHB's IO/MEM windows. The code basically does this: arch/powerpc/kernel/pci_of_scan.c: of_pci_parse_addrs() addrs =3D of_get('assigned-addresses') config_offset =3D addrs[0] && 0xff (phys.hi) pci_bar_num =3D (config_offset - PCI_BASE_ADDRESS_0) / 4 base =3D &addrs[1:2] (phys.mid/phys.lo) size =3D &addrs[3:4] (size.hi, size.lo) region.start =3D base region.end =3D base + size -1 res =3D &dev->resource[pci_bar_num]; pcibios_bus_to_resource(dev->bus, res, ®ion) drivers/pci/host-bridge.c:pcibios_bus_to_resource(bus, bar_res, bar_regio= n) bridge =3D pci_host_bridge(bus) for window in bridge->windows: # skip if window type (IO/MEM), doesn't match device BAR's res) bus_region.start =3D window->res->start - window->offset bus_region.end =3D window->res->end - window->offset if bus_region contains device BAR region: offset =3D window->offset break //take the region, add the window offset, and use that as the resource = addr res->start =3D region->start + offset res->end =3D region->end + offset I tested this with the rpaphp hotplug module, which honors assigned-addresses (current guest hotplug uses generic pci rescan instead due to a long-standing issue with rpaphp only being able to handle 1 slot per PHB if there are devices present at module-load time), by adding some test code to pre-assign the BARs during hotplug (with addresses relati= ve to PHB's IO/MEM windows) to make sure assigned-addresses values get properly translated to the correct offset in guest's io/system address space. Will add a note in the comments.