From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:51459) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UNV62-0002Ep-Ec for qemu-devel@nongnu.org; Wed, 03 Apr 2013 17:20:01 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UNV5x-0004o0-3c for qemu-devel@nongnu.org; Wed, 03 Apr 2013 17:19:54 -0400 Received: from mail-db8lp0184.outbound.messaging.microsoft.com ([213.199.154.184]:30968 helo=db8outboundpool.messaging.microsoft.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UNV5w-0004nc-Qf for qemu-devel@nongnu.org; Wed, 03 Apr 2013 17:19:48 -0400 Date: Wed, 3 Apr 2013 16:19:36 -0500 From: Scott Wood In-Reply-To: <1364960240.2882.230.camel@bling.home> (from alex.williamson@redhat.com on Tue Apr 2 22:37:20 2013) Message-ID: <1365023976.25627.13@snotra> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; delsp=Yes; format=Flowed Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] RFC: vfio API changes needed for powerpc List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: Wood Scott-B07421 , "kvm@vger.kernel.org" , Stuart Yoder , "qemu-devel@nongnu.org" , "agraf@suse.de" , Yoder Stuart-B08248 , "iommu@lists.linux-foundation.org" , Bhushan Bharat-R65777 On 04/02/2013 10:37:20 PM, Alex Williamson wrote: > On Tue, 2013-04-02 at 17:50 -0500, Scott Wood wrote: > > On 04/02/2013 04:38:45 PM, Alex Williamson wrote: > > > On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote: > > > > On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood > > > wrote: > > > > >> > C. Explicit mapping using normal DMA map. The last =20 > idea > > > is that > > > > >> > we would introduce a new ioctl to give user-space =20 > an fd > > > to > > > > >> > the MSI bank, which could be mmapped. The flow =20 > would be > > > > >> > something like this: > > > > >> > -for each group user space calls new ioctl > > > > >> > VFIO_GROUP_GET_MSI_FD > > > > >> > -user space mmaps the fd, getting a vaddr > > > > >> > -user space does a normal DMA map for desired =20 > iova > > > > >> > This approach makes everything explicit, but adds a =20 > new > > > ioctl > > > > >> > applicable most likely only to the PAMU (type2 =20 > iommu). > > > > >> > > > > >> And the DMA_MAP of that mmap then allows userspace to select =20 > the > > > window > > > > >> used? This one seems like a lot of overhead, adding a new > > > ioctl, new > > > > >> fd, mmap, special mapping path, etc. > > > > > > > > > > > > > > > There's going to be special stuff no matter what. This would > > > keep it > > > > > separated from the IOMMU map code. > > > > > > > > > > I'm not sure what you mean by "overhead" here... the runtime > > > overhead of > > > > > setting things up is not particularly relevant as long as it's > > > reasonable. > > > > > If you mean development and maintenance effort, keeping things > > > well > > > > > separated should help. > > > > > > > > We don't need to change DMA_MAP. If we can simply add a new =20 > "type > > > 2" > > > > ioctl that allows user space to set which windows are MSIs, it > > > seems vastly > > > > less complex than an ioctl to supply a new fd, mmap of it, etc. > > > > > > > > So maybe 2 ioctls: > > > > VFIO_IOMMU_GET_MSI_COUNT > > > > Do you mean a count of actual MSIs or a count of MSI banks used by =20 > the > > whole VFIO group? >=20 > I hope the latter, which would clarify how this is distinct from > DEVICE_GET_IRQ_INFO. Is hotplug even on the table? Presumably > dynamically adding a device could bring along additional MSI banks? I'm not sure -- maybe we could say that hotplug can add banks, but not =20 remove them or change the order, so userspace would just need to check =20 if the number of banks changed, and map the extras. > The current VFIO MSI support has the host handling everything about =20 > MSI. > The user never programs an MSI vector to the physical device, they set > up everything through ioctl. On interrupt, we simply trigger an =20 > eventfd > and leave it to things like KVM irqfd or QEMU to do the right thing =20 > in a > virtual machine. >=20 > Here the MSI vector has to go through a PAMU window to hit the correct > MSI bank. So that means it has some component of the iova involved, > which we're proposing here is controlled by userspace (whether that > vector uses an offset from 0x10000000 or 0x00000000 depending on which > window slot is used to make the MSI bank). I assume we're still =20 > working > in a model where the physical interrupt fires into the host and a > host-based interrupt handler triggers an eventfd, right? Yes (subject to possible future optimizations). > So that means the vector also has host components so we trigger the =20 > correct ISR. How > is that coordinated? Everything but the iova component needs to come from the host MSI =20 allocator. > Would is be possible for userspace to simply leave room for MSI bank > mapping (how much room could be determined by something like > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace =20 > can > DMA_MAP starting at the 0x0 address of the aperture, growing up, and > VFIO will map banks on demand at the top of the aperture, growing =20 > down? > Wouldn't that avoid a lot of issues with userspace needing to know > anything about MSI banks (other than count) and coordinating irq =20 > numbers > and enabling handlers? This would restrict a (possibly unlikely) use case where the user wants =20 to map something near the top of the aperture but has another place =20 MSIs can go (or is willing to live without MSIs). Otherwise it could =20 be workable, as long as we can require an explicit MSI enabling on a =20 device to happen after the aperture and subwindow count are set up. =20 I'm not sure it would really buy anything over having userspace iterate =20 over the MSI bank count, though -- it would probably be a bit more =20 complicated. > > > On x86 MSI count is very > > > device specific, which means it wold be a VFIO_DEVICE_* ioctl > > > (actually > > > VFIO_DEVICE_GET_IRQ_INFO does this for us on x86). The trouble =20 > with > > > it > > > being a device ioctl is that you need to get the device FD, but =20 > the > > > IOMMU protection needs to be established before you can get =20 > that... so > > > there's an ordering problem if you need it from the device before > > > configuring the IOMMU. Thanks, > > > > What do you mean by "IOMMU protection needs to be established"? > > Wouldn't we just start with no mappings in place? >=20 > If no mappings blocks all DMA, sure, that's fine. Once the VFIO =20 > device > FD is accessible by userspace we have to protect the host against DMA. > If any IOMMU_SET_ATTR calls temporarily disable DMA protection, that > could be exploitable. Thanks, Unless the PAMU is globally in bypass mode (which it wouldn't be), =20 there's no way to disable protection other than creating one giant =20 mapping. -Scott=