From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:48467)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1WDjmL-00011a-Sw
	for qemu-devel@nongnu.org; Wed, 12 Feb 2014 19:03:52 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1WDjmF-000593-MR
	for qemu-devel@nongnu.org; Wed, 12 Feb 2014 19:03:45 -0500
Received: from mx1.redhat.com ([209.132.183.28]:41914)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1WDjmF-00058j-Dd
	for qemu-devel@nongnu.org; Wed, 12 Feb 2014 19:03:39 -0500
Message-ID: <1392249813.15608.351.camel@ul30vt.home>
From: Alex Williamson <alex.williamson@redhat.com>
Date: Wed, 12 Feb 2014 17:03:33 -0700
In-Reply-To: <20140212225102.GB16184@irqsave.net>
References: <20140212181027.GB4225@irqsave.net>
	<1392233665.15608.299.camel@ul30vt.home>
	<20140212225102.GB16184@irqsave.net>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] Guest IOMMU and Cisco usnic
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: =?ISO-8859-1?Q?Beno=EEt?= Canet <benoit.canet@irqsave.net>
Cc: qemu-devel@nongnu.org

On Wed, 2014-02-12 at 23:51 +0100, Beno=C3=AEt Canet wrote:
> The Wednesday 12 Feb 2014 =C3=A0 12:34:25 (-0700), Alex Williamson wrot=
e :
> > On Wed, 2014-02-12 at 19:10 +0100, Beno=C3=AEt Canet wrote:
> > > Hi Alex,
> > >=20
> > > After the IRC conversation we had a few days ago I understood that =
guest IOMMU
> > > was not implemented.
> > >=20
> > > I have a real use case for it:
> > >=20
> > > Cisco usnic allow to write MPI applications while driving the netwo=
rk card in
> > > userspace in order to optimize the latency. It's made for compute c=
lusters.
> > >=20
> > > The typical cloud provider don't provide bare metal access but only=
 vms on top
> > > of Cisco's hardware hence VFIO is using the IOMMU to passthrough th=
e NIC to the
> > > guest and no IOMMU is present in the guest.
> > >=20
> > > questions: Would writing a performing guest IOMMU implementation be=
 possible ?
> > >            How complex this project looks for someone knowing IOMMU=
s issues ?
> > >=20
> > > The ideal implementation would forward the IOMMU work to the host h=
ardware for
> > > speed.
> > >=20
> > > I can devote time writing the feature if it's doable.
> >=20
> > Hi Beno=C3=AEt,
> >=20
> > I imagine it's doable, but it's certainly not trivial, beyond that I
> > haven't put much thought into it.
> >=20
> > VFIO running in a guest would need an IOMMU that implements both the
> > IOMMU API and IOMMU groups.  Whether that comes from an emulated
> > physical IOMMU (like VT-d) or from a new paravirt IOMMU would be for =
you
> > to decide.  VT-d would imply using a PCIe chipset like Q35 and trying=
 to
> > bandage on VT-d or updating Q35 to something that natively supports
> > VT-d.  Getting a sufficiently similar PCIe hierarchy between host an
> > guest would also be required.
>=20
> This Cisco thing usnic (driver/infiniband/hw/usnic) does not seems to u=
se VFIO
> at all and seems to be hardcoded to make use of an intel IOMMU.
>=20
> I don't know if it's a good thing or not.

Sorry, I got a little off track assuming usnic was a VFIO userspace
driver.  Peeking quickly at it, it looks like it also uses the IOMMU
API, so unless I missed the VT-d specific parts, a pv IOMMU in the guest
might allow some simplification if you don't care about non-Linux
support.
=20
> > The current model of putting all guest devices in a single IOMMU doma=
in
> > on the host is likely not what you would want and might imply a new V=
FIO
> > IOMMU backend that is better tuned for separate domains, sparse
> > mappings, and low-latency.  VFIO has a modular IOMMU design, so this
> > isn't architecturally a problem.  The VFIO user (QEMU) is able to sel=
ect
> > which backend to use and the code is written with supporting multiple
> > backends in mind.
> >=20
> > A complication you'll have is that the granularity of IOMMU operation=
s
> > through VFIO is at the IOMMU group level, so the guest would not be a=
ble
> > to easily split devices grouped together on the host between separate
> > users in the guest.  That could be modeled as a conventional PCI brid=
ge
> > masking the requester ID of devices in the guest such that host group=
s
> > are mirrored as guest groups.
>=20
> I think that users would be happy with only one palo ucs VF wrapped by =
usnic
> in the guest. I definitively need to check this point.

The solution should support multiple devices though, it may just require
multiple guest IOMMUs and fairly strict configuration constraints.

> > There might also be more simple "punch-through" ways to do it, for
> > instance what if instead of trying to make it work like it does on th=
e
> > host we invented a paravirt VFIO interface and the vfio-pv driver in =
the
> > guest populated /dev/vfio as slightly modified passthroughs to the ho=
st
> > fds.  The guest OS may not even really need to be aware of the device.
> >=20
>=20
> As I am not really interested in nesting VFIO but using the intel IOMMU=
 directly
> in the guest a "punch-through" method would be fine.

I was doing a lot of hand-waving for a vfio-pv punch-though, but I don't
even have a vague idea of what an IOMMU API punch-through would look
like.  Seems like you need to evaluate if the pain of emulating VT-d is
greater than the pain of creating a new pv IOMMU and which is likely to
perform better.  Thanks,

Alex

> > It's an interesting project and certainly a valid use case.  I'd also
> > like to see things like Intel's DPDK move to using VFIO, but the curr=
ent
> > UIO DPDK is often used in guests.  Thanks,
>=20
> I ask this to Thomas Monjalon the DPDK maintainer.
>=20
> Thanks,
>=20
> Best regards
>=20
> Beno=C3=AEt
>=20
> >=20
> > Alex
> >=20
> >=20