From mboxrd@z Thu Jan  1 00:00:00 1970
From: Malcolm Crossley <malcolm.crossley@citrix.com>
Subject: Re: [PATCH] libxl: Don't insert PCI device into
 xenstore for HVM guests
Date: Tue, 2 Jun 2015 16:48:25 +0100
Message-ID: <556DD049.1090209@citrix.com>
References: <1432886385-20313-1-git-send-email-ross.lagerwall@citrix.com>
	<20150601152636.GD11531@x230.dumpdata.com>
	<556C7DB9.5000705@citrix.com>
	<556C8242.9060606@citrix.com> <20150601175550.GI13347@x230>
	<556D8022.5030504@citrix.com> <20150602143412.GD8260@l.oracle.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="windows-1252"
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <20150602143412.GD8260@l.oracle.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Wei Liu <wei.liu2@citrix.com>, Ian Campbell <ian.campbell@citrix.com>, Stefano Stabellini <stefano.stabellini@eu.citrix.com>, Ian Jackson <ian.jackson@eu.citrix.com>, xen-devel@lists.xen.org, Ross Lagerwall <ross.lagerwall@citrix.com>
List-Id: xen-devel@lists.xenproject.org

On 02/06/15 15:34, Konrad Rzeszutek Wilk wrote:
> On Tue, Jun 02, 2015 at 11:06:26AM +0100, Malcolm Crossley wrote:
>> On 01/06/15 18:55, Konrad Rzeszutek Wilk wrote:
>>> On Mon, Jun 01, 2015 at 05:03:14PM +0100, Malcolm Crossley wrote:
>>>> On 01/06/15 16:43, Ross Lagerwall wrote:
>>>>> On 06/01/2015 04:26 PM, Konrad Rzeszutek Wilk wrote:
>>>>>> On Fri, May 29, 2015 at 08:59:45AM +0100, Ross Lagerwall wrote:
>>>>>>> When doing passthrough of a PCI device for an HVM guest, don't inse=
rt
>>>>>>> the device into xenstore, otherwise pciback attempts to use it which
>>>>>>> conflicts with QEMU.
>>>>>>
>>>>>> How does it conflict?
>>>>>
>>>>> It doesn't work with repeated use. See below.
>>>>>
>>>>>>>
>>>>>>> This manifests itself such that the first time a device is passed t=
o a
>>>>>>> domain, it succeeds. Subsequent attempts fail unless the device is
>>>>>>> unbound from pciback or the machine rebooted.
>>>>>>
>>>>>> Can you be more specific please? What are the issues? Why does it
>>>>>> fail?
>>>>>
>>>>> Without this patch, if a device (e.g. a GPU) is bound to pciback and
>>>>> then passed through to a guest using xl pci-attach, it appears in the
>>>>> guest and works fine. If the guest is rebooted, and the device is aga=
in
>>>>> passed through with xl pci-attach, it appears in the guest as before =
but
>>>>> does not work. In Windows, it gets something like Error Code 43 and on
>>>>> Linux, the Nouveau driver fails to initialize the device (with error =
-22
>>>>> or something). The only way to get the device to work again is to reb=
oot
>>>>> the host or unbind and rebind it to pciback.
>>>>>
>>>>> With this patch, it works as expected. The device is bound to pciback
>>>>> and works after being passed through, even after the VM is rebooted.
>>>>>
>>>>>>
>>>>>> There are certain things that pciback does to "prepare" an PCI device
>>>>>> which QEMU also does. Some of them - such as saving the configuration
>>>>>> registers (And then restoring them after the device has been detache=
d) -
>>>>>> is something that QEMU does not do.
>>>>>>
>>>>>
>>>>> I really have no idea what the correct thing to do is, but the current
>>>>> code with qemu-trad doesn't seem to work (for me).
> =

> I think I know what the problem is. Do you by any chance have the XSA133-=
addenum
> patch in? If not could you apply it and tell me if it works?
> =

>>>>
>>>> The pciback pci_stub.c implements the pciback.hide and the device reset
>>>> logic.
>>>>
>>>> The rest of pciback implements the pciback xenbus device which PV gues=
ts
>>>> need in order to map/unmap MSI interrupts and access PCI config space.
>>>>
>>>> QEMU emulates and handles the MSI interrupt capabilities and PCI config
>>>> space directly.
>>>
>>> Right..
>>>>
>>>> This is why a pciback xenbus device should not be created for
>>>> passthrough PCI device being handled by QEMU.
>>>
>>> To me that sounds that we should not have PV drivers because QEMU
>>> emulates IDE or network devices.
>>
>> That is different. We first boot with QEMU handling the devices and then
>> we explictly unplug QEMU's handling of IDE and network devices.
>>
>> That handover protocol does not currently exist for PCI passthrough
>> devices so we have to chose one mechanism or the other to manage the
>> passed through PCI device at boot time. Otherwise a HVM guest could load
>> pcifront and cause's all kinds of chaos with interrupt management or
>> outbound MMIO window management.
> =

> Which would be fun! :-)
>>
>>>
>>> The crux here is that none of the operations that pciback performs
>>> should affect QEMU or guests. But it does - so there is a bug.
>>
>> I agree there is a bug but should we try to fix it based upon my
>> comments above?
> =

> I am still thinking about it. I do like certain things that pciback
> does as part of it being notified that a device is to be used by
> a guest and performing the configuration save/reset (see
> pcistub_put_pci_dev in the pciback).
> =

> If somehow that can still be done by libxl (or QEMU) via SysFS
> that would be good.
> =

> Just to clarify:
>  - I concur with you that having xen-pcifront loaded in HVM
>    guest and doing odd things behind QEMU is not good.
>  - I like the fact that xen-pciback does a bunch of safety
>    things with the PCI device to prepare it for a guest.
>  - Currently these 'safety things'  are done when you
>    'unbind' or 'bind' the device to pciback.
>  - Or when the guest is shutdown and via XenBus we are told
>    and can do the 'safety things'. This is the crux - if there
>    is a way to do this via SysFS this would be super.
> =

>    Or perhaps xenpciback can figure out that the guest is HVM
>    and ignore any XenBus actions?
> =


Xenserver toolstack currently bind/unbinds the device to pciback and
manually triggers the reset on the device. It then uses xenstore keys to
communicate with the QEMU hotplug mechanism.

A complete description is here:

"Xenopsd performs the following steps for HVM PCI passthrough (for each
PCI device):

    write /local/domain/0/backend/pci/<domid>/0/msitranslate =930=94 or =93=
1=94
    write /local/domain/0/backend/pci/<domid>/0/power_mgmt =930=94 or =931=
=94
    bind device to pciback (write device id to the new_slot and bind
nodes in sysfs)
    write /local/domain/0/backend/device-model/<domid>/command =93pci-ins=
=94
    write /local/domain/0/backend/device-model/<domid>/parameter
=93xxxx:xx:xx.x=94
    wait for =93pci-inserted=94 to appear in
/local/domain/0/backend/<domid>/device-model/state
    write /local/domain/0/backend/pci/<domid>/0/dev-<x> =93xxxx:xx:xx.x=94
    if /local/domain/<domid>/device/pci/0 does not yet exist, then
create /local/domain/<domid>/device/pci/0, give ownership to the guest,
and write backend=3D/local/domain/0/backend/pci/<domid>/0, backend-id=3D0,
state=3D1
    xc_domain_assign_device


To unplug:

    write /local/domain/0/backend/device-model/<domid>/command =93pci-rem=
=94
    write /local/domain/0/backend/device-model/<domid>/parameter
=93xxxx:xx:xx.x=94
    wait for =93pci-removed=94 to appear in
/local/domain/0/backend/<domid>/device-model/state
    remove /local/domain/0/backend/pci/<domid>/0/dev-<x>
    call /usr/lib/xcp/lib/pci-flr flr-pre xxxx:xx:xx.x
    if the file /sys/bus/pci/devices/xxxx:xx:xx.x/reset exists, then
write "1" to it; otherwise write "xxxx:xx:xx.x" to
/sys/bus/pci/drivers/pciback/do_flr
    call /usr/lib/xcp/lib/pci-flr flr-post xxxx:xx:xx.x
    xc_domain_deassign_device"


I would recommend that libxl performs similar operations (except for the
QEMU hotplug part and the flr script parts).

We should probably split pciback into parts:

1. A part to capture device at boot and prevent other drivers loading
(bind/unbind)

2. A part to manage reset on device's which don't have device specific
reset's.

3. A part to manage the pciback PV device for communicating to pcifront.


I don't think we can have pciback work out if it's a HVM guest and I
think we should instead use the toolstack to prep the device correctly
before passthrough. The toolstack is donating it's PCI device to the new
domain afterall :)


>>>
>>> I would like to understand which ones do it so I can fix in
>>> pciback - as it might be also be a problem with PV.
>>>
>>> Unless... are you by any chance using extra patches on top of the
>>> native pciback?
>>
>> We do have extra patches but they only allow us to do a SBR on PCI
>> device's which require it. They failure listed above occurs on devices
>> with device specific resets (e.g. FLR,D3) as well so those extra patches
>> aren't being used.
>>
>>>
>>>>
>>>> Malcolm
>>>>
>>>>>
>>>>> Regards
>>>>
>>