From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alex Williamson <alex.williamson@redhat.com>
Subject: Re: [RFC PATCH] qemu pci: pci_add_capability
 enhancement to prevent damaging config space
Date: Fri, 08 Jun 2012 08:56:24 -0600
Message-ID: <1339167384.26976.71.camel@ul30vt>
References: <4FACB581.2050609@ozlabs.ru>
	<6A22E211-BC82-49BD-A335-02D3BAA14A17@suse.de>
	<4FAD0A4F.2050506@ozlabs.ru>
	<E3C5DF35-72FF-4247-BC86-F3374E2B40E6@suse.de>
	<4FB080CE.3030703@ozlabs.ru>
	<4FB5DA43.90907@ozlabs.ru> <1337652170.2779.143.camel@pasglop>
	<6C472F5B-B8C3-48DE-B19B-00973AF6AC56@suse.de>
	<4FBB0B95.8050901@ozlabs.ru>
	<82643009-4F43-407F-B26C-C36537825BFD@suse.de>
	<4FBB2E25.2030206@ozlabs.ru>
	<584A5E54-2119-415C-93B4-BB91A08CA729@suse.de>
	<4FD1BC14.6030900@ozlabs.ru>
	<4FD1DA5C.5020900@siemens.com> <4FD1DF29.1050303@ozlabs.ru>
	<4FD1E25A.2010900@siemens.com> <4FD2058F.7030903@ozlabs.ru>
	<4FD20F92.9080805@siemens.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	Alexey Kardashevskiy <aik@ozlabs.ru>, Alexander Graf <agraf@suse.de>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"anthony@codemonkey.ws" <anthony@codemonkey.ws>,
	David Gibson <david@gibson.dropbear.id.au>
To: Jan Kiszka <jan.kiszka@siemens.com>
Return-path: <qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org>
In-Reply-To: <4FD20F92.9080805@siemens.com>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org
Sender: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org
List-Id: kvm.vger.kernel.org

On Fri, 2012-06-08 at 16:43 +0200, Jan Kiszka wrote:
> On 2012-06-08 16:00, Alexey Kardashevskiy wrote:
> > 08.06.2012 21:30, Jan Kiszka =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
> >> On 2012-06-08 13:16, Alexey Kardashevskiy wrote:
> >>> 08.06.2012 20:56, Jan Kiszka =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=
=BB:
> >>>> On 2012-06-08 10:47, Alexey Kardashevskiy wrote:
> >>>>> Yet another try :)
> >>>>>
> >>>>> Normally the pci_add_capability is called on devices to add new
> >>>>> capability. This is ok for emulated devices which capabilities li=
st
> >>>>> is being built by QEMU.
> >>>>>
> >>>>> In the case of VFIO the capability may already exist and adding n=
ew
> >>>>
> >>>> Why does it exit? VFIO should build the virtual capability list fr=
om
> >>>> scratch (just like classic device assignment does), recreating the
> >>>> layout of the physical device (except for masked out caps). In tha=
t
> >>>> case, this conflict should become impossible, no?
> >>>
> >>> Normally capabilities in emulated devices are created by calling
> >>> msi_init or msix_init - just when emulated device wants to advertis=
e it
> >>> to the guest.
> >>>
> >>> In the case of VFIO, there is a lot of capabilities which QEMU does=
 not
> >>> know and does not want to know about. They are read from the host k=
ernel
> >>> as is. And we definitely want to pass these capabilities to the gue=
st as
> >>> is, i.e. on the same position and the same number of them. Just for=
 some
> >>> we call pci_add_capability (indirectly!) if we want QEMU to support=
 them
> >>> somehow.
> >>>
> >>> If we invent some function which "readds" all the capabilities we g=
ot
> >>> from the host to keep internal QEMU's PCIDevice data in sync, then =
we'll
> >>> need to change every piece of code which adds capabilities.
> >>
> >> I can't follow. What is different in VFIO from device-assignment.c,
> >> assigned_device_pci_cap_init (except that it already uses msi[x]_ini=
t,
> >> something we need to fix in device-assignment.c)?
> >=20
> > What are device-assignment.c and assigned_device_pci_cap_init? Cannot
> > find them in QEMU tree.
>=20
> "Old-style" KVM device assignment is not yet upstream. You can find it
> in qemu-kvm, hopefully in upstream soon as well.
>=20
> >=20
> > Ah, anyway. The main difference is QEMU does not emulate VFIO devices=
,
> > it just a proxy to the host system. Or I do not understand the questi=
on.
> >=20
> >>> I noticed,
> >>> this is very common approach here to change a lot for a very small =
thing
> >>> or rare case but I'd like to avoid this :)
> >>>
> >>>> But if pci_*add*_capability should actually be used like this (I d=
oubt
> >>>> this),
> >>>
> >>> MSI/MSIX use it. To enable MSI/MSIX on VFIO PCIDevice, we call
> >>> msi_init/msix_init and they call pci_add_capability.
> >>
> >> You can't blame msi_init/msix_init for the fact that VFIO creates a
> >> capability list with an existing MSI/MSI-X entry beforehand.
> >=20
> > VFIO does not create any capability. It gets them all from the host
> > kernel and passes to the guest as is. VFIO only needs MSIX to be enab=
led
> > in VFIO.
>=20
> Just like any device in QEMU, also VFIO need to set up a virtual config
> space when it registers with the PCI core layer. Even if the virtual on=
e
> is modeled after the real one, it is still _created_ by the VFIO
> userspace part. And this creation process is obviously a bit messed up
> so far. Fix this, but not by adding workarounds in the MSI or PCI layer.
> Rather add all capabilities you want to expose to the guest via
> pci_add_capability or, indirectly, via msi[x]_init at the right
> position. Do not just copy the real config space over, that breaks the
> core layer as we see.

The difference between VFIO and kvm device assignment is that VFIO
emulates a lot of config space for us, so most things are passed
through.  MSI and MSIX are unique that we actually do want the qemu
support for helping us to manage them.  So we're basically not telling
qemu about anything other than these, and for the most part, that works
since qemu never handles access to the other capabilities.  However, I
think you're probably right, VFIO should just walk the capabilities
list, registering each with qemu.  It's a little "unnecessary" overhead
from the VFIO perspective, but it makes the VFIO device less unique.
I'll work on adding this.  Thanks,

Alex