From mboxrd@z Thu Jan 1 00:00:00 1970 From: Malcolm Crossley Subject: Re: [PATCH] libxl: Don't insert PCI device into xenstore for HVM guests Date: Tue, 2 Jun 2015 16:48:25 +0100 Message-ID: <556DD049.1090209@citrix.com> References: <1432886385-20313-1-git-send-email-ross.lagerwall@citrix.com> <20150601152636.GD11531@x230.dumpdata.com> <556C7DB9.5000705@citrix.com> <556C8242.9060606@citrix.com> <20150601175550.GI13347@x230> <556D8022.5030504@citrix.com> <20150602143412.GD8260@l.oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <20150602143412.GD8260@l.oracle.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Konrad Rzeszutek Wilk Cc: Wei Liu , Ian Campbell , Stefano Stabellini , Ian Jackson , xen-devel@lists.xen.org, Ross Lagerwall List-Id: xen-devel@lists.xenproject.org On 02/06/15 15:34, Konrad Rzeszutek Wilk wrote: > On Tue, Jun 02, 2015 at 11:06:26AM +0100, Malcolm Crossley wrote: >> On 01/06/15 18:55, Konrad Rzeszutek Wilk wrote: >>> On Mon, Jun 01, 2015 at 05:03:14PM +0100, Malcolm Crossley wrote: >>>> On 01/06/15 16:43, Ross Lagerwall wrote: >>>>> On 06/01/2015 04:26 PM, Konrad Rzeszutek Wilk wrote: >>>>>> On Fri, May 29, 2015 at 08:59:45AM +0100, Ross Lagerwall wrote: >>>>>>> When doing passthrough of a PCI device for an HVM guest, don't inse= rt >>>>>>> the device into xenstore, otherwise pciback attempts to use it which >>>>>>> conflicts with QEMU. >>>>>> >>>>>> How does it conflict? >>>>> >>>>> It doesn't work with repeated use. See below. >>>>> >>>>>>> >>>>>>> This manifests itself such that the first time a device is passed t= o a >>>>>>> domain, it succeeds. Subsequent attempts fail unless the device is >>>>>>> unbound from pciback or the machine rebooted. >>>>>> >>>>>> Can you be more specific please? What are the issues? Why does it >>>>>> fail? >>>>> >>>>> Without this patch, if a device (e.g. a GPU) is bound to pciback and >>>>> then passed through to a guest using xl pci-attach, it appears in the >>>>> guest and works fine. If the guest is rebooted, and the device is aga= in >>>>> passed through with xl pci-attach, it appears in the guest as before = but >>>>> does not work. In Windows, it gets something like Error Code 43 and on >>>>> Linux, the Nouveau driver fails to initialize the device (with error = -22 >>>>> or something). The only way to get the device to work again is to reb= oot >>>>> the host or unbind and rebind it to pciback. >>>>> >>>>> With this patch, it works as expected. The device is bound to pciback >>>>> and works after being passed through, even after the VM is rebooted. >>>>> >>>>>> >>>>>> There are certain things that pciback does to "prepare" an PCI device >>>>>> which QEMU also does. Some of them - such as saving the configuration >>>>>> registers (And then restoring them after the device has been detache= d) - >>>>>> is something that QEMU does not do. >>>>>> >>>>> >>>>> I really have no idea what the correct thing to do is, but the current >>>>> code with qemu-trad doesn't seem to work (for me). > = > I think I know what the problem is. Do you by any chance have the XSA133-= addenum > patch in? If not could you apply it and tell me if it works? > = >>>> >>>> The pciback pci_stub.c implements the pciback.hide and the device reset >>>> logic. >>>> >>>> The rest of pciback implements the pciback xenbus device which PV gues= ts >>>> need in order to map/unmap MSI interrupts and access PCI config space. >>>> >>>> QEMU emulates and handles the MSI interrupt capabilities and PCI config >>>> space directly. >>> >>> Right.. >>>> >>>> This is why a pciback xenbus device should not be created for >>>> passthrough PCI device being handled by QEMU. >>> >>> To me that sounds that we should not have PV drivers because QEMU >>> emulates IDE or network devices. >> >> That is different. We first boot with QEMU handling the devices and then >> we explictly unplug QEMU's handling of IDE and network devices. >> >> That handover protocol does not currently exist for PCI passthrough >> devices so we have to chose one mechanism or the other to manage the >> passed through PCI device at boot time. Otherwise a HVM guest could load >> pcifront and cause's all kinds of chaos with interrupt management or >> outbound MMIO window management. > = > Which would be fun! :-) >> >>> >>> The crux here is that none of the operations that pciback performs >>> should affect QEMU or guests. But it does - so there is a bug. >> >> I agree there is a bug but should we try to fix it based upon my >> comments above? > = > I am still thinking about it. I do like certain things that pciback > does as part of it being notified that a device is to be used by > a guest and performing the configuration save/reset (see > pcistub_put_pci_dev in the pciback). > = > If somehow that can still be done by libxl (or QEMU) via SysFS > that would be good. > = > Just to clarify: > - I concur with you that having xen-pcifront loaded in HVM > guest and doing odd things behind QEMU is not good. > - I like the fact that xen-pciback does a bunch of safety > things with the PCI device to prepare it for a guest. > - Currently these 'safety things' are done when you > 'unbind' or 'bind' the device to pciback. > - Or when the guest is shutdown and via XenBus we are told > and can do the 'safety things'. This is the crux - if there > is a way to do this via SysFS this would be super. > = > Or perhaps xenpciback can figure out that the guest is HVM > and ignore any XenBus actions? > = Xenserver toolstack currently bind/unbinds the device to pciback and manually triggers the reset on the device. It then uses xenstore keys to communicate with the QEMU hotplug mechanism. A complete description is here: "Xenopsd performs the following steps for HVM PCI passthrough (for each PCI device): write /local/domain/0/backend/pci//0/msitranslate =930=94 or =93= 1=94 write /local/domain/0/backend/pci//0/power_mgmt =930=94 or =931= =94 bind device to pciback (write device id to the new_slot and bind nodes in sysfs) write /local/domain/0/backend/device-model//command =93pci-ins= =94 write /local/domain/0/backend/device-model//parameter =93xxxx:xx:xx.x=94 wait for =93pci-inserted=94 to appear in /local/domain/0/backend//device-model/state write /local/domain/0/backend/pci//0/dev- =93xxxx:xx:xx.x=94 if /local/domain//device/pci/0 does not yet exist, then create /local/domain//device/pci/0, give ownership to the guest, and write backend=3D/local/domain/0/backend/pci//0, backend-id=3D0, state=3D1 xc_domain_assign_device To unplug: write /local/domain/0/backend/device-model//command =93pci-rem= =94 write /local/domain/0/backend/device-model//parameter =93xxxx:xx:xx.x=94 wait for =93pci-removed=94 to appear in /local/domain/0/backend//device-model/state remove /local/domain/0/backend/pci//0/dev- call /usr/lib/xcp/lib/pci-flr flr-pre xxxx:xx:xx.x if the file /sys/bus/pci/devices/xxxx:xx:xx.x/reset exists, then write "1" to it; otherwise write "xxxx:xx:xx.x" to /sys/bus/pci/drivers/pciback/do_flr call /usr/lib/xcp/lib/pci-flr flr-post xxxx:xx:xx.x xc_domain_deassign_device" I would recommend that libxl performs similar operations (except for the QEMU hotplug part and the flr script parts). We should probably split pciback into parts: 1. A part to capture device at boot and prevent other drivers loading (bind/unbind) 2. A part to manage reset on device's which don't have device specific reset's. 3. A part to manage the pciback PV device for communicating to pcifront. I don't think we can have pciback work out if it's a HVM guest and I think we should instead use the toolstack to prep the device correctly before passthrough. The toolstack is donating it's PCI device to the new domain afterall :) >>> >>> I would like to understand which ones do it so I can fix in >>> pciback - as it might be also be a problem with PV. >>> >>> Unless... are you by any chance using extra patches on top of the >>> native pciback? >> >> We do have extra patches but they only allow us to do a SBR on PCI >> device's which require it. They failure listed above occurs on devices >> with device specific resets (e.g. FLR,D3) as well so those extra patches >> aren't being used. >> >>> >>>> >>>> Malcolm >>>> >>>>> >>>>> Regards >>>> >>