From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:48467) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WDjmL-00011a-Sw for qemu-devel@nongnu.org; Wed, 12 Feb 2014 19:03:52 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WDjmF-000593-MR for qemu-devel@nongnu.org; Wed, 12 Feb 2014 19:03:45 -0500 Received: from mx1.redhat.com ([209.132.183.28]:41914) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WDjmF-00058j-Dd for qemu-devel@nongnu.org; Wed, 12 Feb 2014 19:03:39 -0500 Message-ID: <1392249813.15608.351.camel@ul30vt.home> From: Alex Williamson Date: Wed, 12 Feb 2014 17:03:33 -0700 In-Reply-To: <20140212225102.GB16184@irqsave.net> References: <20140212181027.GB4225@irqsave.net> <1392233665.15608.299.camel@ul30vt.home> <20140212225102.GB16184@irqsave.net> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] Guest IOMMU and Cisco usnic List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: =?ISO-8859-1?Q?Beno=EEt?= Canet Cc: qemu-devel@nongnu.org On Wed, 2014-02-12 at 23:51 +0100, Beno=C3=AEt Canet wrote: > The Wednesday 12 Feb 2014 =C3=A0 12:34:25 (-0700), Alex Williamson wrot= e : > > On Wed, 2014-02-12 at 19:10 +0100, Beno=C3=AEt Canet wrote: > > > Hi Alex, > > >=20 > > > After the IRC conversation we had a few days ago I understood that = guest IOMMU > > > was not implemented. > > >=20 > > > I have a real use case for it: > > >=20 > > > Cisco usnic allow to write MPI applications while driving the netwo= rk card in > > > userspace in order to optimize the latency. It's made for compute c= lusters. > > >=20 > > > The typical cloud provider don't provide bare metal access but only= vms on top > > > of Cisco's hardware hence VFIO is using the IOMMU to passthrough th= e NIC to the > > > guest and no IOMMU is present in the guest. > > >=20 > > > questions: Would writing a performing guest IOMMU implementation be= possible ? > > > How complex this project looks for someone knowing IOMMU= s issues ? > > >=20 > > > The ideal implementation would forward the IOMMU work to the host h= ardware for > > > speed. > > >=20 > > > I can devote time writing the feature if it's doable. > >=20 > > Hi Beno=C3=AEt, > >=20 > > I imagine it's doable, but it's certainly not trivial, beyond that I > > haven't put much thought into it. > >=20 > > VFIO running in a guest would need an IOMMU that implements both the > > IOMMU API and IOMMU groups. Whether that comes from an emulated > > physical IOMMU (like VT-d) or from a new paravirt IOMMU would be for = you > > to decide. VT-d would imply using a PCIe chipset like Q35 and trying= to > > bandage on VT-d or updating Q35 to something that natively supports > > VT-d. Getting a sufficiently similar PCIe hierarchy between host an > > guest would also be required. >=20 > This Cisco thing usnic (driver/infiniband/hw/usnic) does not seems to u= se VFIO > at all and seems to be hardcoded to make use of an intel IOMMU. >=20 > I don't know if it's a good thing or not. Sorry, I got a little off track assuming usnic was a VFIO userspace driver. Peeking quickly at it, it looks like it also uses the IOMMU API, so unless I missed the VT-d specific parts, a pv IOMMU in the guest might allow some simplification if you don't care about non-Linux support. =20 > > The current model of putting all guest devices in a single IOMMU doma= in > > on the host is likely not what you would want and might imply a new V= FIO > > IOMMU backend that is better tuned for separate domains, sparse > > mappings, and low-latency. VFIO has a modular IOMMU design, so this > > isn't architecturally a problem. The VFIO user (QEMU) is able to sel= ect > > which backend to use and the code is written with supporting multiple > > backends in mind. > >=20 > > A complication you'll have is that the granularity of IOMMU operation= s > > through VFIO is at the IOMMU group level, so the guest would not be a= ble > > to easily split devices grouped together on the host between separate > > users in the guest. That could be modeled as a conventional PCI brid= ge > > masking the requester ID of devices in the guest such that host group= s > > are mirrored as guest groups. >=20 > I think that users would be happy with only one palo ucs VF wrapped by = usnic > in the guest. I definitively need to check this point. The solution should support multiple devices though, it may just require multiple guest IOMMUs and fairly strict configuration constraints. > > There might also be more simple "punch-through" ways to do it, for > > instance what if instead of trying to make it work like it does on th= e > > host we invented a paravirt VFIO interface and the vfio-pv driver in = the > > guest populated /dev/vfio as slightly modified passthroughs to the ho= st > > fds. The guest OS may not even really need to be aware of the device. > >=20 >=20 > As I am not really interested in nesting VFIO but using the intel IOMMU= directly > in the guest a "punch-through" method would be fine. I was doing a lot of hand-waving for a vfio-pv punch-though, but I don't even have a vague idea of what an IOMMU API punch-through would look like. Seems like you need to evaluate if the pain of emulating VT-d is greater than the pain of creating a new pv IOMMU and which is likely to perform better. Thanks, Alex > > It's an interesting project and certainly a valid use case. I'd also > > like to see things like Intel's DPDK move to using VFIO, but the curr= ent > > UIO DPDK is often used in guests. Thanks, >=20 > I ask this to Thomas Monjalon the DPDK maintainer. >=20 > Thanks, >=20 > Best regards >=20 > Beno=C3=AEt >=20 > >=20 > > Alex > >=20 > >=20