From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:52418)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <benoit.canet@irqsave.net>) id 1UjtU9-0006bz-RK
	for qemu-devel@nongnu.org; Tue, 04 Jun 2013 11:49:28 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <benoit.canet@irqsave.net>) id 1UjtU3-0004W2-1f
	for qemu-devel@nongnu.org; Tue, 04 Jun 2013 11:49:21 -0400
Received: from nodalink.pck.nerim.net ([62.212.105.220]:33370
	helo=paradis.irqsave.net) by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <benoit.canet@irqsave.net>) id 1UjtU2-0004VD-E7
	for qemu-devel@nongnu.org; Tue, 04 Jun 2013 11:49:14 -0400
Date: Tue, 4 Jun 2013 17:50:30 +0200
From: =?iso-8859-1?Q?Beno=EEt?= Canet <benoit.canet@irqsave.net>
Message-ID: <20130604155030.GA5991@irqsave.net>
References: <20130603163305.GC4094@irqsave.net>
	<1370282529.30975.344.camel@ul30vt.home>
	<51ACE1B5.2050102@redhat.com>
	<1370285865.30975.361.camel@ul30vt.home>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
In-Reply-To: <1370285865.30975.361.camel@ul30vt.home>
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] VFIO and scheduled SR-IOV cards
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: =?iso-8859-1?Q?Beno=EEt?= Canet <benoit.canet@irqsave.net>, iommu@lists.linux-foundation.org, Don Dutile <ddutile@redhat.com>, qemu-devel@nongnu.org


Hello,

More informations on how the hardware works.

-Each VF will have its own memory and MMR, etc.
That means the resources are not shared.

-Each VF will have its own bus number, function number and device number.
That means request ID is separated for each VF.

There is also VF save/restore area for the switch.

A VF regular memory (not MMR) is still accessible after a switch out.

But when a function VF1 is scheduled a read to a MRR of VF number 0 could=
 return
the value of the same MMR in VF number 1 because VF number 1 is switched =
on and
the PF processor is busy servicing VF number 1.

This could confuse the guest VF driver so the unmap and block or a same g=
oal
achieving technique is required.

I hope these informations makes the area of the problem to solve narrower=
.

Best regards

Beno=EEt Canet

> Le Monday 03 Jun 2013 =E0 12:57:45 (-0600), Alex Williamson a =E9crit :
> On Mon, 2013-06-03 at 14:34 -0400, Don Dutile wrote:
> > On 06/03/2013 02:02 PM, Alex Williamson wrote:
> > > On Mon, 2013-06-03 at 18:33 +0200, Beno=EEt Canet wrote:
> > >> Hello,
> > >>
> > >> I plan to write a PF driver for an SR-IOV card and make the VFs wo=
rk with QEMU's
> > >> VFIO passthrough so I am asking the following design question befo=
re trying to
> > >> write and push code.
> > >>
> > >> After SR-IOV being enabled on this hardware only one VF function c=
an be active
> > >> at a given time.
> > >
> > > Is this actually an SR-IOV device or are you trying to write a driv=
er
> > > that emulates SR-IOV for a PF?
> > >
> > >> The PF host kernel driver is acting as a scheduler.
> > >> It switch every few milliseconds which VF is the current active fu=
nction while
> > >> disabling the others VFs.
> > >>
> > that's time-sharing of hw, which sw doesn't see ... so, ok.
> >=20
> > >> One consequence of how the hardware works is that the MMR regions =
of the
> > >> switched off VFs must be unmapped and their io access should block=
 until the VF
> > >> is switched on again.
> > >
> > This violates the spec., and does impact sw -- how can one assign suc=
h a VF to a guest
> > -- it does not work indep. of other VFs.
> >=20
> > > MMR =3D Memory Mapped Register?
> > >
> > > This seems contradictory to the SR-IOV spec, which states:
> > >
> > >          Each VF contains a non-shared set of physical resources re=
quired
> > >          to deliver Function-specific
> > >          services, e.g., resources such as work queues, data buffer=
s,
> > >          etc. These resources can be directly
> > >          accessed by an SI without requiring VI or SR-PCIM interven=
tion.
> > >
> > > Furthermore, each VF should have a separate requester ID.  What's b=
eing
> > > suggested here seems like maybe that's not the case.  If true, it w=
ould
> > I didn't read it that way above.  I read it as the PCIe end is timesh=
ared
> > btwn VFs (& PFs?). .... with some VFs disappearing (from a driver per=
spective)
> > as if the device was hot unplug w/o notification.  That will probably=
 cause
> > read-timeouts & SME's, bringing down most enterprise-level systems.
>=20
> Perhaps I'm reading too much into it, but using the same requester ID
> would seem like justification for why the device needs to be unmapped.
> Otherwise we could just stop QEMU and leave the mappings alone if we
> just want to make sure access to the device is blocked while the device
> is swapped out.  Not the best overall throughput algorithm, but maybe a
> proof of concept.  Need more info about how the device actually behaves
> to know for sure.  Thanks,
>=20
> Alex
>=20
> > > make iommu groups challenging.  Is there any VF save/restore around=
 the
> > > scheduling?
> > >
> > >> Each IOMMU map/unmap should be done in less than 100ns.
> > >
> > > I think that may be a lot to ask if we need to unmap the regions in=
 the
> > > guest and in the iommu.  If the "VFs" used different requester IDs,
> > > iommu unmapping whouldn't be necessary.  I experimented with switch=
ing
> > > between trapped (read/write) access to memory regions and mmap'd (d=
irect
> > > mapping) for handling legacy interrupts.  There was a noticeable
> > > performance penalty switching per interrupt.
> > >
> > >> As the kernel iommu module is being called by the VFIO driver the =
PF driver
> > >> cannot interface with it.
> > >>
> > >> Currently the only interface of the VFIO code is for the userland =
QEMU process
> > >> and I fear that notifying QEMU that it should do the unmap/block w=
ould take more
> > >> than 100ns.
> > >>
> > >> Also blocking the IO access in QEMU under the BQL would freeze QEM=
U.
> > >>
> > >> Do you have and idea on how to write this required map and block/u=
nmap feature ?
> > >
> > > It seems like there are several options, but I'm doubtful that any =
of
> > > them will meet 100ns.  If this is completely fake SR-IOV and there'=
s not
> > > a different requester ID per VF, I'd start with seeing if you can e=
ven
> > > do the iommu_unmap/iommu_map of the MMIO BARs in under 100ns.  If t=
hat's
> > > close to your limit, then your only real option for QEMU is to free=
ze
> > > it, which still involves getting multiple (maybe many) vCPUs out of=
 VM
> > > mode.  That's not free either.  If by some miracle you have time to
> > > spare, you could remap the regions to trapped mode and let the vCPU=
s run
> > > while vfio blocks on read/write.
> > >
> > > Maybe there's even a question whether mmap'd mode is worthwhile for=
 this
> > > device.  Trapping every read/write is orders of magnitude slower, b=
ut
> > > allows you to handle the "wait for VF" on the kernel side.
> > >
> > > If you can provide more info on the device design/contraints, maybe=
 we
> > > can come up with better options.  Thanks,
> > >
> > > Alex
> > >
> > > _______________________________________________
> > > iommu mailing list
> > > iommu@lists.linux-foundation.org
> > > https://lists.linuxfoundation.org/mailman/listinfo/iommu
> >=20
>=20
>=20
>=20
>=20