From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52418) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UjtU9-0006bz-RK for qemu-devel@nongnu.org; Tue, 04 Jun 2013 11:49:28 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UjtU3-0004W2-1f for qemu-devel@nongnu.org; Tue, 04 Jun 2013 11:49:21 -0400 Received: from nodalink.pck.nerim.net ([62.212.105.220]:33370 helo=paradis.irqsave.net) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UjtU2-0004VD-E7 for qemu-devel@nongnu.org; Tue, 04 Jun 2013 11:49:14 -0400 Date: Tue, 4 Jun 2013 17:50:30 +0200 From: =?iso-8859-1?Q?Beno=EEt?= Canet Message-ID: <20130604155030.GA5991@irqsave.net> References: <20130603163305.GC4094@irqsave.net> <1370282529.30975.344.camel@ul30vt.home> <51ACE1B5.2050102@redhat.com> <1370285865.30975.361.camel@ul30vt.home> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <1370285865.30975.361.camel@ul30vt.home> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] VFIO and scheduled SR-IOV cards List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: =?iso-8859-1?Q?Beno=EEt?= Canet , iommu@lists.linux-foundation.org, Don Dutile , qemu-devel@nongnu.org Hello, More informations on how the hardware works. -Each VF will have its own memory and MMR, etc. That means the resources are not shared. -Each VF will have its own bus number, function number and device number. That means request ID is separated for each VF. There is also VF save/restore area for the switch. A VF regular memory (not MMR) is still accessible after a switch out. But when a function VF1 is scheduled a read to a MRR of VF number 0 could= return the value of the same MMR in VF number 1 because VF number 1 is switched = on and the PF processor is busy servicing VF number 1. This could confuse the guest VF driver so the unmap and block or a same g= oal achieving technique is required. I hope these informations makes the area of the problem to solve narrower= . Best regards Beno=EEt Canet > Le Monday 03 Jun 2013 =E0 12:57:45 (-0600), Alex Williamson a =E9crit : > On Mon, 2013-06-03 at 14:34 -0400, Don Dutile wrote: > > On 06/03/2013 02:02 PM, Alex Williamson wrote: > > > On Mon, 2013-06-03 at 18:33 +0200, Beno=EEt Canet wrote: > > >> Hello, > > >> > > >> I plan to write a PF driver for an SR-IOV card and make the VFs wo= rk with QEMU's > > >> VFIO passthrough so I am asking the following design question befo= re trying to > > >> write and push code. > > >> > > >> After SR-IOV being enabled on this hardware only one VF function c= an be active > > >> at a given time. > > > > > > Is this actually an SR-IOV device or are you trying to write a driv= er > > > that emulates SR-IOV for a PF? > > > > > >> The PF host kernel driver is acting as a scheduler. > > >> It switch every few milliseconds which VF is the current active fu= nction while > > >> disabling the others VFs. > > >> > > that's time-sharing of hw, which sw doesn't see ... so, ok. > >=20 > > >> One consequence of how the hardware works is that the MMR regions = of the > > >> switched off VFs must be unmapped and their io access should block= until the VF > > >> is switched on again. > > > > > This violates the spec., and does impact sw -- how can one assign suc= h a VF to a guest > > -- it does not work indep. of other VFs. > >=20 > > > MMR =3D Memory Mapped Register? > > > > > > This seems contradictory to the SR-IOV spec, which states: > > > > > > Each VF contains a non-shared set of physical resources re= quired > > > to deliver Function-specific > > > services, e.g., resources such as work queues, data buffer= s, > > > etc. These resources can be directly > > > accessed by an SI without requiring VI or SR-PCIM interven= tion. > > > > > > Furthermore, each VF should have a separate requester ID. What's b= eing > > > suggested here seems like maybe that's not the case. If true, it w= ould > > I didn't read it that way above. I read it as the PCIe end is timesh= ared > > btwn VFs (& PFs?). .... with some VFs disappearing (from a driver per= spective) > > as if the device was hot unplug w/o notification. That will probably= cause > > read-timeouts & SME's, bringing down most enterprise-level systems. >=20 > Perhaps I'm reading too much into it, but using the same requester ID > would seem like justification for why the device needs to be unmapped. > Otherwise we could just stop QEMU and leave the mappings alone if we > just want to make sure access to the device is blocked while the device > is swapped out. Not the best overall throughput algorithm, but maybe a > proof of concept. Need more info about how the device actually behaves > to know for sure. Thanks, >=20 > Alex >=20 > > > make iommu groups challenging. Is there any VF save/restore around= the > > > scheduling? > > > > > >> Each IOMMU map/unmap should be done in less than 100ns. > > > > > > I think that may be a lot to ask if we need to unmap the regions in= the > > > guest and in the iommu. If the "VFs" used different requester IDs, > > > iommu unmapping whouldn't be necessary. I experimented with switch= ing > > > between trapped (read/write) access to memory regions and mmap'd (d= irect > > > mapping) for handling legacy interrupts. There was a noticeable > > > performance penalty switching per interrupt. > > > > > >> As the kernel iommu module is being called by the VFIO driver the = PF driver > > >> cannot interface with it. > > >> > > >> Currently the only interface of the VFIO code is for the userland = QEMU process > > >> and I fear that notifying QEMU that it should do the unmap/block w= ould take more > > >> than 100ns. > > >> > > >> Also blocking the IO access in QEMU under the BQL would freeze QEM= U. > > >> > > >> Do you have and idea on how to write this required map and block/u= nmap feature ? > > > > > > It seems like there are several options, but I'm doubtful that any = of > > > them will meet 100ns. If this is completely fake SR-IOV and there'= s not > > > a different requester ID per VF, I'd start with seeing if you can e= ven > > > do the iommu_unmap/iommu_map of the MMIO BARs in under 100ns. If t= hat's > > > close to your limit, then your only real option for QEMU is to free= ze > > > it, which still involves getting multiple (maybe many) vCPUs out of= VM > > > mode. That's not free either. If by some miracle you have time to > > > spare, you could remap the regions to trapped mode and let the vCPU= s run > > > while vfio blocks on read/write. > > > > > > Maybe there's even a question whether mmap'd mode is worthwhile for= this > > > device. Trapping every read/write is orders of magnitude slower, b= ut > > > allows you to handle the "wait for VF" on the kernel side. > > > > > > If you can provide more info on the device design/contraints, maybe= we > > > can come up with better options. Thanks, > > > > > > Alex > > > > > > _______________________________________________ > > > iommu mailing list > > > iommu@lists.linux-foundation.org > > > https://lists.linuxfoundation.org/mailman/listinfo/iommu > >=20 >=20 >=20 >=20 >=20