From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:44148) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UjZ5I-0003FH-WA for qemu-devel@nongnu.org; Mon, 03 Jun 2013 14:02:22 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UjZ5H-00051q-5s for qemu-devel@nongnu.org; Mon, 03 Jun 2013 14:02:20 -0400 Received: from mx1.redhat.com ([209.132.183.28]:54384) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UjZ5G-00051j-Su for qemu-devel@nongnu.org; Mon, 03 Jun 2013 14:02:19 -0400 Message-ID: <1370282529.30975.344.camel@ul30vt.home> From: Alex Williamson Date: Mon, 03 Jun 2013 12:02:09 -0600 In-Reply-To: <20130603163305.GC4094@irqsave.net> References: <20130603163305.GC4094@irqsave.net> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] VFIO and scheduled SR-IOV cards List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: =?ISO-8859-1?Q?Beno=EEt?= Canet Cc: iommu@lists.linux-foundation.org, qemu-devel@nongnu.org On Mon, 2013-06-03 at 18:33 +0200, Beno=C3=AEt Canet wrote: > Hello, >=20 > I plan to write a PF driver for an SR-IOV card and make the VFs work wi= th QEMU's > VFIO passthrough so I am asking the following design question before tr= ying to > write and push code. >=20 > After SR-IOV being enabled on this hardware only one VF function can be= active > at a given time. Is this actually an SR-IOV device or are you trying to write a driver that emulates SR-IOV for a PF? > The PF host kernel driver is acting as a scheduler. > It switch every few milliseconds which VF is the current active functio= n while > disabling the others VFs. >=20 > One consequence of how the hardware works is that the MMR regions of th= e > switched off VFs must be unmapped and their io access should block unti= l the VF > is switched on again. MMR =3D Memory Mapped Register? This seems contradictory to the SR-IOV spec, which states: Each VF contains a non-shared set of physical resources required to deliver Function-specific services, e.g., resources such as work queues, data buffers, etc. These resources can be directly accessed by an SI without requiring VI or SR-PCIM intervention. Furthermore, each VF should have a separate requester ID. What's being suggested here seems like maybe that's not the case. If true, it would make iommu groups challenging. Is there any VF save/restore around the scheduling? > Each IOMMU map/unmap should be done in less than 100ns. I think that may be a lot to ask if we need to unmap the regions in the guest and in the iommu. If the "VFs" used different requester IDs, iommu unmapping whouldn't be necessary. I experimented with switching between trapped (read/write) access to memory regions and mmap'd (direct mapping) for handling legacy interrupts. There was a noticeable performance penalty switching per interrupt. > As the kernel iommu module is being called by the VFIO driver the PF dr= iver > cannot interface with it. >=20 > Currently the only interface of the VFIO code is for the userland QEMU = process > and I fear that notifying QEMU that it should do the unmap/block would = take more > than 100ns. >=20 > Also blocking the IO access in QEMU under the BQL would freeze QEMU. >=20 > Do you have and idea on how to write this required map and block/unmap = feature ? It seems like there are several options, but I'm doubtful that any of them will meet 100ns. If this is completely fake SR-IOV and there's not a different requester ID per VF, I'd start with seeing if you can even do the iommu_unmap/iommu_map of the MMIO BARs in under 100ns. If that's close to your limit, then your only real option for QEMU is to freeze it, which still involves getting multiple (maybe many) vCPUs out of VM mode. That's not free either. If by some miracle you have time to spare, you could remap the regions to trapped mode and let the vCPUs run while vfio blocks on read/write. Maybe there's even a question whether mmap'd mode is worthwhile for this device. Trapping every read/write is orders of magnitude slower, but allows you to handle the "wait for VF" on the kernel side. If you can provide more info on the device design/contraints, maybe we can come up with better options. Thanks, Alex