From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:50780)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <B07421@freescale.com>) id 1UN9wA-0004zW-8B
	for qemu-devel@nongnu.org; Tue, 02 Apr 2013 18:44:25 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <B07421@freescale.com>) id 1UN9w7-0001J4-9s
	for qemu-devel@nongnu.org; Tue, 02 Apr 2013 18:44:18 -0400
Received: from mail-db8lp0184.outbound.messaging.microsoft.com
	([213.199.154.184]:30363
	helo=db8outboundpool.messaging.microsoft.com)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <B07421@freescale.com>) id 1UN9w7-0001Iq-1Q
	for qemu-devel@nongnu.org; Tue, 02 Apr 2013 18:44:15 -0400
Date: Tue, 2 Apr 2013 17:44:06 -0500
From: Scott Wood <scottwood@freescale.com>
In-Reply-To: <1364938324.2882.179.camel@bling.home> (from
	alex.williamson@redhat.com on Tue Apr  2 16:32:04 2013)
Message-ID: <1364942646.24520.27@snotra>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; delsp=Yes; format=Flowed
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] RFC: vfio API changes needed for powerpc
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Wood Scott-B07421 <B07421@freescale.com>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, "agraf@suse.de" <agraf@suse.de>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, Yoder Stuart-B08248 <B08248@freescale.com>, "iommu@lists.linux-foundation.org" <iommu@lists.linux-foundation.org>, Bhushan Bharat-R65777 <R65777@freescale.com>

On 04/02/2013 04:32:04 PM, Alex Williamson wrote:
> On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote:
> > On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
> > > On x86 the interrupt remapper handles this transparently when MSI
> > > is enabled and userspace never gets direct access to the device =20
> MSI
> > > address/data registers.
> >
> > x86 has a totally different mechanism here, as far as I understand =20
> --
> > even before you get into restrictions on mappings.
>=20
> So what control will userspace have over programming the actually MSI
> vectors on PAMU?

Not sure what you mean -- PAMU doesn't get explicitly involved in =20
MSIs.  It's just another 4K page mapping (per relevant MSI bank).  If =20
you want isolation, you need to make sure that an MSI group is only =20
used by one VFIO group, and that you're on a chip that has alias pages =20
with just one MSI bank register each (newer chips do, but the first =20
chip to have a PAMU didn't).

> > > This could also be done as another "type2" ioctl extension.
> >
> > Again, what is "type2", specifically?  If someone else is adding =20
> their
> > own IOMMU that is kind of, sort of like PAMU, how would they know if
> > it's close enough?  What assumptions can a user make when they see =20
> that
> > they're dealing with "type2"?
>=20
> Naming always has and always will be a problem.  I assume this is =20
> named
> type2 rather than PAMU because it's trying to expose a generic =20
> windowed
> IOMMU fitting the IOMMU API.

But how closely is the MSI situation related to a generic windowed =20
IOMMU, then?  We could just as well have a highly flexible IOMMU in =20
terms of arbitrary 4K page mappings, but still handle MSIs as pages to =20
be mapped rather than a translation table.  Or we could have a windowed =20
IOMMU that has an MSI translation table.

> Like type1, it doesn't really make sense
> to name it "IOMMU API" because that's a kernel internal interface and
> we're designing a userspace interface that just happens to use that.
> Tagging it to a piece of hardware makes it less reusable.

Well, that's my point.  Is it reusable at all, anyway?  If not, then =20
giving it a more obscure name won't change that.  If it is reusable, =20
then where is the line drawn between things that are PAMU-specific or =20
MPIC-specific and things that are part of the "generic windowed IOMMU" =20
abstraction?

>  Type1 is arbitrary.  It might as well be named "brown" and this one =20
> can be
> "blue".

The difference is that "type1" seems to refer to hardware that can do =20
arbitrary 4K page mappings, possibly constrained by an aperture but =20
nothing else.  More than one IOMMU can reasonably fit that.  The odds =20
that another IOMMU would have exactly the same restrictions as PAMU =20
seem smaller in comparison.

In any case, if you had to deal with some Intel-only quirk, would it =20
make sense to call it a "type1 attribute"?  I'm not advocating one way =20
or the other on whether an abstraction is viable here (though Stuart =20
seems to think it's "highly unlikely anything but a PAMU will comply"), =20
just that if it is to be abstracted rather than a hardware-specific =20
interface, we need to document what is and is not part of the =20
abstraction.  Otherwise a non-PAMU-specific user won't know what they =20
can rely on, and someone adding support for a new windowed IOMMU won't =20
know if theirs is close enough, or they need to introduce a "type3".

> > > What's the value to userspace in determining which windows are =20
> used
> > > by which banks?
> >
> > That depends on who programs the MSI config space address.  What is
> > important is userspace controlling which iovas will be dedicated to
> > this, in case it wants to put something else there.
>=20
> So userspace is programming the MSI vectors, targeting a user =20
> programmed
> iova?  But an iova selects a window and I thought there were some =20
> number
> of MSI banks and we don't really know which ones we'll need...  still
> confused.

Userspace would also need a way to find out the page offset and data =20
value.  That may be an argument in favor of having the two ioctls =20
Stuart later suggested (get MSI count, and map MSI).  Would there be =20
any complication in the VFIO code from tracking a mapping that doesn't =20
have a userspace virtual address associated with it?

> > There's going to be special stuff no matter what.  This would keep =20
> it
> > separated from the IOMMU map code.
> >
> > I'm not sure what you mean by "overhead" here... the runtime =20
> overhead
> > of setting things up is not particularly relevant as long as it's
> > reasonable.  If you mean development and maintenance effort, keeping
> > things well separated should help.
>=20
> Overhead in terms of code required and complexity.  More things to
> reference count and shut down in the proper order on userspace exit.
> Thanks,

That didn't stop others from having me convert the KVM device control =20
API to use file descriptors instead of something more ad-hoc with a =20
better-defined destruction order. :-)

I don't know if it necessarily needs to be a separate fd -- it could be =20
just another device resource like BARs, with some way for userspace to =20
tell if the page is shared by multiple devices in the group (e.g. make =20
the physical address visible).

-Scott=