From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:51459)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <B07421@freescale.com>) id 1UNV62-0002Ep-Ec
	for qemu-devel@nongnu.org; Wed, 03 Apr 2013 17:20:01 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <B07421@freescale.com>) id 1UNV5x-0004o0-3c
	for qemu-devel@nongnu.org; Wed, 03 Apr 2013 17:19:54 -0400
Received: from mail-db8lp0184.outbound.messaging.microsoft.com
	([213.199.154.184]:30968
	helo=db8outboundpool.messaging.microsoft.com)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <B07421@freescale.com>) id 1UNV5w-0004nc-Qf
	for qemu-devel@nongnu.org; Wed, 03 Apr 2013 17:19:48 -0400
Date: Wed, 3 Apr 2013 16:19:36 -0500
From: Scott Wood <scottwood@freescale.com>
In-Reply-To: <1364960240.2882.230.camel@bling.home> (from
	alex.williamson@redhat.com on Tue Apr  2 22:37:20 2013)
Message-ID: <1365023976.25627.13@snotra>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; delsp=Yes; format=Flowed
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Wood Scott-B07421 <B07421@freescale.com>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, Stuart Yoder <b08248@gmail.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, "agraf@suse.de" <agraf@suse.de>, Yoder Stuart-B08248 <B08248@freescale.com>, "iommu@lists.linux-foundation.org" <iommu@lists.linux-foundation.org>, Bhushan Bharat-R65777 <R65777@freescale.com>

On 04/02/2013 10:37:20 PM, Alex Williamson wrote:
> On Tue, 2013-04-02 at 17:50 -0500, Scott Wood wrote:
> > On 04/02/2013 04:38:45 PM, Alex Williamson wrote:
> > > On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
> > > > On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood
> > > <scottwood@freescale.com> wrote:
> > > > >> >    C.  Explicit mapping using normal DMA map.  The last =20
> idea
> > > is that
> > > > >> >        we would introduce a new ioctl to give user-space =20
> an fd
> > > to
> > > > >> >        the MSI bank, which could be mmapped.  The flow =20
> would be
> > > > >> >        something like this:
> > > > >> >           -for each group user space calls new ioctl
> > > > >> > VFIO_GROUP_GET_MSI_FD
> > > > >> >           -user space mmaps the fd, getting a vaddr
> > > > >> >           -user space does a normal DMA map for desired =20
> iova
> > > > >> >        This approach makes everything explicit, but adds a =20
> new
> > > ioctl
> > > > >> >        applicable most likely only to the PAMU (type2 =20
> iommu).
> > > > >>
> > > > >> And the DMA_MAP of that mmap then allows userspace to select =20
> the
> > > window
> > > > >> used?  This one seems like a lot of overhead, adding a new
> > > ioctl, new
> > > > >> fd, mmap, special mapping path, etc.
> > > > >
> > > > >
> > > > > There's going to be special stuff no matter what.  This would
> > > keep it
> > > > > separated from the IOMMU map code.
> > > > >
> > > > > I'm not sure what you mean by "overhead" here... the runtime
> > > overhead of
> > > > > setting things up is not particularly relevant as long as it's
> > > reasonable.
> > > > > If you mean development and maintenance effort, keeping things
> > > well
> > > > > separated should help.
> > > >
> > > > We don't need to change DMA_MAP.  If we can simply add a new =20
> "type
> > > 2"
> > > > ioctl that allows user space to set which windows are MSIs, it
> > > seems vastly
> > > > less complex than an ioctl to supply a new fd, mmap of it, etc.
> > > >
> > > > So maybe 2 ioctls:
> > > >     VFIO_IOMMU_GET_MSI_COUNT
> >
> > Do you mean a count of actual MSIs or a count of MSI banks used by =20
> the
> > whole VFIO group?
>=20
> I hope the latter, which would clarify how this is distinct from
> DEVICE_GET_IRQ_INFO.  Is hotplug even on the table?  Presumably
> dynamically adding a device could bring along additional MSI banks?

I'm not sure -- maybe we could say that hotplug can add banks, but not =20
remove them or change the order, so userspace would just need to check =20
if the number of banks changed, and map the extras.

> The current VFIO MSI support has the host handling everything about =20
> MSI.
> The user never programs an MSI vector to the physical device, they set
> up everything through ioctl.  On interrupt, we simply trigger an =20
> eventfd
> and leave it to things like KVM irqfd or QEMU to do the right thing =20
> in a
> virtual machine.
>=20
> Here the MSI vector has to go through a PAMU window to hit the correct
> MSI bank.  So that means it has some component of the iova involved,
> which we're proposing here is controlled by userspace (whether that
> vector uses an offset from 0x10000000 or 0x00000000 depending on which
> window slot is used to make the MSI bank).  I assume we're still =20
> working
> in a model where the physical interrupt fires into the host and a
> host-based interrupt handler triggers an eventfd, right?

Yes (subject to possible future optimizations).

> So that means the vector also has host components so we trigger the =20
> correct ISR.  How
> is that coordinated?

Everything but the iova component needs to come from the host MSI =20
allocator.

> Would is be possible for userspace to simply leave room for MSI bank
> mapping (how much room could be determined by something like
> VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace =20
> can
> DMA_MAP starting at the 0x0 address of the aperture, growing up, and
> VFIO will map banks on demand at the top of the aperture, growing =20
> down?
> Wouldn't that avoid a lot of issues with userspace needing to know
> anything about MSI banks (other than count) and coordinating irq =20
> numbers
> and enabling handlers?

This would restrict a (possibly unlikely) use case where the user wants =20
to map something near the top of the aperture but has another place =20
MSIs can go (or is willing to live without MSIs).  Otherwise it could =20
be workable, as long as we can require an explicit MSI enabling on a =20
device to happen after the aperture and subwindow count are set up.  =20
I'm not sure it would really buy anything over having userspace iterate =20
over the MSI bank count, though -- it would probably be a bit more =20
complicated.

> > > On x86 MSI count is very
> > > device specific, which means it wold be a VFIO_DEVICE_* ioctl
> > > (actually
> > > VFIO_DEVICE_GET_IRQ_INFO does this for us on x86).  The trouble =20
> with
> > > it
> > > being a device ioctl is that you need to get the device FD, but =20
> the
> > > IOMMU protection needs to be established before you can get =20
> that... so
> > > there's an ordering problem if you need it from the device before
> > > configuring the IOMMU.  Thanks,
> >
> > What do you mean by "IOMMU protection needs to be established"?
> > Wouldn't we just start with no mappings in place?
>=20
> If no mappings blocks all DMA, sure, that's fine.  Once the VFIO =20
> device
> FD is accessible by userspace we have to protect the host against DMA.
> If any IOMMU_SET_ATTR calls temporarily disable DMA protection, that
> could be exploitable.  Thanks,

Unless the PAMU is globally in bypass mode (which it wouldn't be), =20
there's no way to disable protection other than creating one giant =20
mapping.

-Scott=