From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:51947) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z8Ncv-0001Vm-2S for qemu-devel@nongnu.org; Fri, 26 Jun 2015 03:00:42 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Z8Ncp-0008Mf-Jg for qemu-devel@nongnu.org; Fri, 26 Jun 2015 03:00:41 -0400 Date: Fri, 26 Jun 2015 17:01:05 +1000 From: David Gibson Message-ID: <20150626070105.GB27737@voom.redhat.com> References: <1434627456-13745-1-git-send-email-aik@ozlabs.ru> <20150623064442.GC13352@voom.redhat.com> <558A8BF8.3080509@ozlabs.ru> <1435262378.3700.390.camel@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="f2QGlHpHGjS2mn6Y" Content-Disposition: inline In-Reply-To: <1435262378.3700.390.camel@redhat.com> Subject: Re: [Qemu-devel] [PATCH qemu v8 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: Alexey Kardashevskiy , qemu-ppc@nongnu.org, qemu-devel@nongnu.org, Gavin Shan , Alexander Graf --f2QGlHpHGjS2mn6Y Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jun 25, 2015 at 01:59:38PM -0600, Alex Williamson wrote: > On Wed, 2015-06-24 at 20:52 +1000, Alexey Kardashevskiy wrote: > > On 06/23/2015 04:44 PM, David Gibson wrote: > > > On Thu, Jun 18, 2015 at 09:37:22PM +1000, Alexey Kardashevskiy wrote: > > >> > > >> (cut-n-paste from kernel patchset) > > >> > > >> Each Partitionable Endpoint (IOMMU group) has an address range on a = PCI bus > > >> where devices are allowed to do DMA. These ranges are called DMA win= dows. > > >> By default, there is a single DMA window, 1 or 2GB big, mapped at ze= ro > > >> on a PCI bus. > > >> > > >> PAPR defines a DDW RTAS API which allows pseries guests > > >> querying the hypervisor about DDW support and capabilities (page siz= e mask > > >> for now). A pseries guest may request an additional (to the default) > > >> DMA windows using this RTAS API. > > >> The existing pseries Linux guests request an additional window as bi= g as > > >> the guest RAM and map the entire guest window which effectively crea= tes > > >> direct mapping of the guest memory to a PCI bus. > > >> > > >> This patchset reworks PPC64 IOMMU code and adds necessary structures > > >> to support big windows. > > >> > > >> Once a Linux guest discovers the presence of DDW, it does: > > >> 1. query hypervisor about number of available windows and page size = masks; > > >> 2. create a window with the biggest possible page size (today 4K/64K= /16M); > > >> 3. map the entire guest RAM via H_PUT_TCE* hypercalls; > > >> 4. switche dma_ops to direct_dma_ops on the selected PE. > > >> > > >> Once this is done, H_PUT_TCE is not called anymore for 64bit devices= and > > >> the guest does not waste time on DMA map/unmap operations. > > >> > > >> Note that 32bit devices won't use DDW and will keep using the default > > >> DMA window so KVM optimizations will be required (to be posted later= ). > > >> > > >> This patchset adds DDW support for pseries. The host kernel changes = are > > >> required, posted as: > > >> > > >> [PATCH kernel v11 00/34] powerpc/iommu/vfio: Enable Dynamic DMA wind= ows > > >> > > >> This patchset is based on git://github.com/dgibson/qemu.git spapr-ne= xt branch. > > > > > > A couple of general queries - this touchs on the kernel part as well > > > as the qemu part: > > > > > > * Am I correct in thinking that the point in doing the > > > pre-registration stuff is to allow the kernel to handle PUT_TCE > > > in real mode? i.e. that the advatage of doing preregistration > > > rather than accounting on the DMA_MAP and DMA_UNMAP itself only > > > appears once you have kernel KVM+VFIO acceleration? > >=20 > >=20 > > Handling PUT_TCE includes 2 things: > > 1. get_user_pages_fast() and put_page() > > 2. update locked_vm > >=20 > > Both are tricky in real mode but 2) is also tricky in virtual mode as I= =20 > > have to deal with multiple unrelated 32bit and 64bit windows (VFIO does= not=20 > > care if they belong to one or many processes) with IOMMU page size=3D= =3D4K and=20 > > gup/put_page working with 64k pages (our default page size for host ker= nel). > >=20 > > But yes, without keeping real mode handlers in mind, this thing could h= ave=20 > > been made simpler. > >=20 > >=20 > > > * Do you have test numbers to show that it's still worthwhile to ha= ve > > > kernel acceleration once you have a guest using DDW? With DDW in > > > play, even if PUT_TCE is slow, it should be called a lot less > > > often. > >=20 > > With DDW, the whole RAM mapped once at first set_dma_mask(64bit) called= by=20 > > the guest, it is just a few PUT_TCE_INDIRECT calls. > >=20 > > If the guest uses DDW, real mode handlers cannot possibly beat it and I= =20 > > have reports that real mode handlers are noticibly slower than direct D= MA=20 > > mapping (i.e. DDW) for 40Gb devices (10Gb seems to be fine but I have n= ot=20 > > tried a dozen of guests yet). > >=20 > >=20 > > > The reason I ask is that the preregistration handling is a pretty big > > > chunk of code that inserts itself into some pretty core kernel data > > > structures, all for one pretty specific use case. We only want to do > > > that if there's a strong justification for it. > >=20 > > Exactly. I keep asking Ben and Paul periodically if we want to keep it = and=20 > > the answer is always yes :) > >=20 > >=20 > > About "vfio: spapr: Move SPAPR-related code to a separate file" - I gue= ss I=20 > > better off removing it for now, right? >=20 > I won't block the series for moving the code out, but it seems like > we're consistently making spapr an exception. It seemed like we had > settled on some spapr code that could be semi-portable to any > guest-based IOMMU model, but patch 02/14 pulls that off into the spapr > specific wart. Theoretically we could support a guest-based IOMMU with > type1 code as well, we just don't have much impetus to do so nor does > type1 really have a mapping interface designed for that kind of > performance. We can always pull the code out of spapr or git history if > we have some need for it though. Thanks, Yeah, I think 2/14 should just be dropped. There are certainly places (particuarly on the kernel side) where it makes sense to split out the code for the different IOMMU types. But the code that's being moved in 2/14 isn't really different for spapr, nor is it large enough that it needs to be in its own file. --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --f2QGlHpHGjS2mn6Y Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAEBAgAGBQJVjPixAAoJEGw4ysog2bOSk9IP/08zanWoMqcXxpsAw7zCZxip OhGv3cOCq0sdkwHAfinFyDxOpyLkw+jKK3UjwzUTOlN7ciem0oYpIMGXty+/T76s smidvbX/LgIuT1bYbN2yWE1gA49/PzvQbVKL1xC1sPfjc8KcbCXp4d0uoWRs5gxw 24Xl/lgartiwsSjpEkNSR2IpnOmgsViMSZA9PCfZ9txI5rBLZpAeZbGPrcutbrSI VrBhciRabbFm3kHsPBZCGyLqualKH96CgO200KQnQIJBkGXPiPBDVfCuQqua8s2N EQtvUM1h27f55JoI4/pgRlj/yQsD4J8sy91nsoJX4Y0rdHivNdeFr3eGNgYlGs7V 6AgsgSH4clxhEUc6SbU2twd/f2Lz9gs5WJe5xA3cvcNXQHC8w4SWFV/Vw83D3you wk1zU3DTbp+n4Gz+zOwz+3TTfQA73ILj8rKvZw0OGUncaA/Tv0y2fDJvPtCpubV3 39qzAOr/Qoazu2oodA/BpCbCBJmjV2JuyTR9J2COoDmJNp9JhkKFyL/xJMtgZMe7 y+vDGafLJ8+W/W9ZUUaKqyU7Cq2OCrxRoXQI/O5RaFz5atAoFzzxIfjXlGa3UDxW wov/8SyYeg6rssdyPfILbZcRZVktVjilbW1OoUM1NLXstNS+b06b/hqAUduiIROx CBTXW/aq4c/pMrulQh4C =oi0w -----END PGP SIGNATURE----- --f2QGlHpHGjS2mn6Y--