From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38272) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XI56z-0001wc-3v for qemu-devel@nongnu.org; Thu, 14 Aug 2014 20:11:18 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XI56u-0000Dn-PB for qemu-devel@nongnu.org; Thu, 14 Aug 2014 20:11:17 -0400 Date: Fri, 15 Aug 2014 10:09:16 +1000 From: David Gibson Message-ID: <20140815000916.GL7628@voom.redhat.com> References: <1406799254-25223-10-git-send-email-aik@ozlabs.ru> <53E8B0BB.60006@suse.de> <53E8DAC2.1020307@ozlabs.ru> <53E8FDCC.7060308@suse.de> <53E959DA.3020206@ozlabs.ru> <53E9E077.1040804@suse.de> <53EA2E6F.6060002@ozlabs.ru> <53EA3298.2040600@suse.de> <53EAAED8.1040206@ozlabs.ru> <53ECBBE5.3070505@suse.de> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="FnOKg9Ah4tDwTfQS" Content-Disposition: inline In-Reply-To: <53ECBBE5.3070505@suse.de> Subject: Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alexander Graf Cc: Alexey Kardashevskiy , Alex Williamson , qemu-ppc@nongnu.org, qemu-devel@nongnu.org --FnOKg9Ah4tDwTfQS Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Aug 14, 2014 at 03:38:45PM +0200, Alexander Graf wrote: >=20 > On 13.08.14 02:18, Alexey Kardashevskiy wrote: > >On 08/13/2014 01:28 AM, Alexander Graf wrote: > >>On 12.08.14 17:10, Alexey Kardashevskiy wrote: > >>>On 08/12/2014 07:37 PM, Alexander Graf wrote: > >>>>On 12.08.14 02:03, Alexey Kardashevskiy wrote: > >>>>>On 08/12/2014 03:30 AM, Alexander Graf wrote: > >>>>>>On 11.08.14 17:01, Alexey Kardashevskiy wrote: > >>>>>>>On 08/11/2014 10:02 PM, Alexander Graf wrote: > >>>>>>>>On 31.07.14 11:34, Alexey Kardashevskiy wrote: > >>>>>>>>>This implements DDW for VFIO. Host kernel support is required for > >>>>>>>>>this. > >>>>>>>>> > >>>>>>>>>Signed-off-by: Alexey Kardashevskiy > >>>>>>>>>--- > >>>>>>>>> hw/ppc/spapr_pci_vfio.c | 75 > >>>>>>>>>+++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>>>>> 1 file changed, 75 insertions(+) > >>>>>>>>> > >>>>>>>>>diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c > >>>>>>>>>index d3bddf2..dc443e2 100644 > >>>>>>>>>--- a/hw/ppc/spapr_pci_vfio.c > >>>>>>>>>+++ b/hw/ppc/spapr_pci_vfio.c > >>>>>>>>>@@ -69,6 +69,77 @@ static void > >>>>>>>>>spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp) > >>>>>>>>> /* Register default 32bit DMA window */ > >>>>>>>>> memory_region_add_subregion(&sphb->iommu_root, > >>>>>>>>>tcet->bus_offset, > >>>>>>>>> spapr_tce_get_iommu(tcet)); > >>>>>>>>>+ > >>>>>>>>>+ sphb->ddw_supported =3D !!(info.flags & > >>>>>>>>>VFIO_IOMMU_SPAPR_TCE_FLAG_DDW); > >>>>>>>>>+} > >>>>>>>>>+ > >>>>>>>>>+static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb, > >>>>>>>>>+ uint32_t *windows_available, > >>>>>>>>>+ uint32_t *page_size_mask) > >>>>>>>>>+{ > >>>>>>>>>+ sPAPRPHBVFIOState *svphb =3D SPAPR_PCI_VFIO_HOST_BRIDGE(sph= b); > >>>>>>>>>+ struct vfio_iommu_spapr_tce_query query =3D { .argsz =3D > >>>>>>>>>sizeof(query) }; > >>>>>>>>>+ int ret; > >>>>>>>>>+ > >>>>>>>>>+ ret =3D vfio_container_ioctl(&sphb->iommu_as, svphb->iommug= roupid, > >>>>>>>>>+ VFIO_IOMMU_SPAPR_TCE_QUERY, &que= ry); > >>>>>>>>>+ if (ret) { > >>>>>>>>>+ return ret; > >>>>>>>>>+ } > >>>>>>>>>+ > >>>>>>>>>+ *windows_available =3D query.windows_available; > >>>>>>>>>+ *page_size_mask =3D query.page_size_mask; > >>>>>>>>>+ > >>>>>>>>>+ return ret; > >>>>>>>>>+} > >>>>>>>>>+ > >>>>>>>>>+static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint3= 2_t > >>>>>>>>>page_shift, > >>>>>>>>>+ uint32_t window_shift, uin= t32_t > >>>>>>>>>liobn, > >>>>>>>>>+ sPAPRTCETable **ptcet) > >>>>>>>>>+{ > >>>>>>>>>+ sPAPRPHBVFIOState *svphb =3D SPAPR_PCI_VFIO_HOST_BRIDGE(sph= b); > >>>>>>>>>+ struct vfio_iommu_spapr_tce_create create =3D { > >>>>>>>>>+ .argsz =3D sizeof(create), > >>>>>>>>>+ .page_shift =3D page_shift, > >>>>>>>>>+ .window_shift =3D window_shift, > >>>>>>>>>+ .start_addr =3D 0 > >>>>>>>>>+ }; > >>>>>>>>>+ int ret; > >>>>>>>>>+ > >>>>>>>>>+ ret =3D vfio_container_ioctl(&sphb->iommu_as, svphb->iommug= roupid, > >>>>>>>>>+ VFIO_IOMMU_SPAPR_TCE_CREATE, &cr= eate); > >>>>>>>>>+ if (ret) { > >>>>>>>>>+ return ret; > >>>>>>>>>+ } > >>>>>>>>>+ > >>>>>>>>>+ *ptcet =3D spapr_tce_new_table(DEVICE(sphb), liobn, > >>>>>>>>>create.start_addr, > >>>>>>>>>+ page_shift, 1 << (window_shift= - > >>>>>>>>>page_shift), > >>>>>>>>I spot a 1 without ULL again - this time it might work out ok, but > >>>>>>>>please > >>>>>>>>just always use ULL when you pass around addresses. > >>>>>>>My bad. I keep forgetting this, I'll adjust my own checkpatch.py :) > >>>>>>> > >>>>>>> > >>>>>>>>Please walk me though the abstraction levels on what each page si= ze > >>>>>>>>honoration means. If I use THP, what page size granularity can I = use > >>>>>>>>for > >>>>>>>>TCE entries? > >>>>>>>[RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS c= alls > >>>>>>>support > >>>>>>> > >>>>>>>+ const struct { int shift; uint32_t mask; } masks[] =3D { > >>>>>>>+ { 12, DDW_PGSIZE_4K }, > >>>>>>>+ { 16, DDW_PGSIZE_64K }, > >>>>>>>+ { 24, DDW_PGSIZE_16M }, > >>>>>>>+ { 25, DDW_PGSIZE_32M }, > >>>>>>>+ { 26, DDW_PGSIZE_64M }, > >>>>>>>+ { 27, DDW_PGSIZE_128M }, > >>>>>>>+ { 28, DDW_PGSIZE_256M }, > >>>>>>>+ { 34, DDW_PGSIZE_16G }, > >>>>>>>+ }; > >>>>>>> > >>>>>>> > >>>>>>>Supported page sizes are returned by the host kernel via "query". = For > >>>>>>>16MB > >>>>>>>pages, page shift will return > >>>>>>>DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M. > >>>>>>>Or I did not understand the question... > >>>>>>Why do we care about the sizes? Anything bigger than what we support > >>>>>>should > >>>>>>always work, no? What happens if the guest creates a 16MB map but my > >>>>>>pages > >>>>>>are 4kb mapped? Wouldn't the same logic be able to deal with 16G pa= ges? > >>>>>It is DMA memory, if I split "virtual" 16M page to a bunch of real 4K > >>>>>pages, I have to make sure these 16M are continuous - there will be = one > >>>>>TCE > >>>>>entry for it and no more translations besides IOMMU. What do I miss = now? > >>>>Who does the shadow translation where? Does it exist at all? > >>>IOMMU? I am not sure I am following you... This IOMMU will look as dir= ect > >>>DMA for the guest but the real IOMMU table is sparse and it is populat= ed > >>>via a bunch of H_PUT_TCE calls as the default small window. > >>> > >>>There is a direct mapping in the host called "bypass window" but it is= not > >>>used here as sPAPR does not define that for paravirtualization. > >>Ok, imagine I have 16MB of guest physical memory that is in reality bac= ked > >>by 256 64k pages on the host. The guest wants to create a 16M TCE entry= for > >>this (from its point of view contiguous) chunk of memory. > >> > >>Do we allow this? > >No, we do not. We tell the guest what it can use. > > > >>Or do we force the guest to create 64k TCE entries? > >16MB TCE pages are only allowed if qemu is running with hugepages. >=20 > That's unfortunate ;) but as long as we have to pin TCEd memory anyway, I > guess it doesn't hurt as badly. >=20 > > > > > >>If we allow it, why would we ever put any restriction at the upper end = of > >>TCE entry sizes? If we already implement enough logic to map things laz= ily > >>around, we could as well have the guest create a 256M TCE entry and just > >>split it on the host view to 64k TCE entries. > >Oh, thiiiiiis is what you meant... > > > >Well, we could, just for now current linux guests support 4K/64K/16M only > >and they choose depending on what hypervisor supports - look at > >enable_ddw() in the guest. What you suggest seems to be an unnecessary c= ode > >duplication for 16MB pages case. For bigger page sizes - for example, for > >64GB guest, a TCE table with 16MB TCEs will be 32KB which is already > >awesome enough, no? >=20 > In "normal" invironments guests won't be backed by 16M pages, but by 64k > pages with the occasional THP huge page merge that you can't rely on. >=20 > That's why I figured it'd be smart to support 16MB TCEs even when the > underlying memory is only backed by 64k pages. That could work for emulated PCI devices, but not for VFIO. With VFIO the TCEs get passed through to the hardware, and so the pages mapped must be physically contiguous, which can only happen if the guest is backed by hugepages. Well.. I guess you *could* fake it for VFIO, by making each guest H_PUT_TCE result in many real TCEs being created. But I think it's a bad idea, because it would trigger the guest to map all RAM when not hugepage backed, and that would mean the translated (host) TCE table would be inordinately large. --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --FnOKg9Ah4tDwTfQS Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAEBAgAGBQJT7U+sAAoJEGw4ysog2bOSMYUP/jNlBHoSjX/1cw/+8r+M0r5T PRzB446A4jvjkTpdO+NbcNVRp4ympLtVcT4hntVPDLcIXBm/TwuwMIw1EodJ5i1+ xlbqKE6O4ApiEJ50YuQTE8giSt4vNeDaazQZS9yrGZmePgiyOf6UMQ05zZcl8YJS CVYn03MaxDsZMvNHjB5SK9bTDQpOVfNGYLQyCgvQSAb6FRRVSN6Z6KWQAWBYGXji D+Sh1ZQ5uKSTPjnMnqVCUC3DIY2GcBBVbsmINDBxCePUWhseFTV3eJ8kgDAE6bTP aQIv+mkSx2hIClmLd+0NxKPNnnOkTv0hML9fwFckxSWDmff/0BacQC1N6DeUqN4F V2DJHXO5lL4WvLFOlAOaqJhxCYhVCl8BuHkPoL+0Rg7jfVdAmky1ZhZAP9rGy6Ii RPW1LCWRvwC2h/p4QHYWKVQGSKDP5gQTjBE29KRBb7TK2SWXtZqUO/I/X+rYKBVy Khpka/ChdJLLJhCxNMvm71ZPxKtmrzo1FPNeRKpT3+8Q1POHrbN6CumUjFn0apAZ Jrgh37bznFDnp1uhb+WpQRuxeG2109/1BRDtPopJ59jzWty/+tmSw2JE8Ff/bNfo fepDR/SFzAUOPT6vr8n6fw5tqx+uQKMPCFWfk8r+7mJTMLbdOWoO6PsF4/brer+e QK08VjztDPib6uhl4ugJ =ebT+ -----END PGP SIGNATURE----- --FnOKg9Ah4tDwTfQS--