Re: [PATCH] libxl: Don't insert PCI device into xenstore for HVM guests

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Malcolm Crossley <malcolm.crossley@citrix.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Wei Liu <wei.liu2@citrix.com>,
	Ian Campbell <ian.campbell@citrix.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	Ian Jackson <ian.jackson@eu.citrix.com>,
	xen-devel@lists.xen.org,
	Ross Lagerwall <ross.lagerwall@citrix.com>
Subject: Re: [PATCH] libxl: Don't insert PCI device into xenstore for HVM guests
Date: Tue, 2 Jun 2015 16:48:25 +0100	[thread overview]
Message-ID: <556DD049.1090209@citrix.com> (raw)
In-Reply-To: <20150602143412.GD8260@l.oracle.com>

On 02/06/15 15:34, Konrad Rzeszutek Wilk wrote:
> On Tue, Jun 02, 2015 at 11:06:26AM +0100, Malcolm Crossley wrote:
>> On 01/06/15 18:55, Konrad Rzeszutek Wilk wrote:
>>> On Mon, Jun 01, 2015 at 05:03:14PM +0100, Malcolm Crossley wrote:
>>>> On 01/06/15 16:43, Ross Lagerwall wrote:
>>>>> On 06/01/2015 04:26 PM, Konrad Rzeszutek Wilk wrote:
>>>>>> On Fri, May 29, 2015 at 08:59:45AM +0100, Ross Lagerwall wrote:
>>>>>>> When doing passthrough of a PCI device for an HVM guest, don't insert
>>>>>>> the device into xenstore, otherwise pciback attempts to use it which
>>>>>>> conflicts with QEMU.
>>>>>>
>>>>>> How does it conflict?
>>>>>
>>>>> It doesn't work with repeated use. See below.
>>>>>
>>>>>>>
>>>>>>> This manifests itself such that the first time a device is passed to a
>>>>>>> domain, it succeeds. Subsequent attempts fail unless the device is
>>>>>>> unbound from pciback or the machine rebooted.
>>>>>>
>>>>>> Can you be more specific please? What are the issues? Why does it
>>>>>> fail?
>>>>>
>>>>> Without this patch, if a device (e.g. a GPU) is bound to pciback and
>>>>> then passed through to a guest using xl pci-attach, it appears in the
>>>>> guest and works fine. If the guest is rebooted, and the device is again
>>>>> passed through with xl pci-attach, it appears in the guest as before but
>>>>> does not work. In Windows, it gets something like Error Code 43 and on
>>>>> Linux, the Nouveau driver fails to initialize the device (with error -22
>>>>> or something). The only way to get the device to work again is to reboot
>>>>> the host or unbind and rebind it to pciback.
>>>>>
>>>>> With this patch, it works as expected. The device is bound to pciback
>>>>> and works after being passed through, even after the VM is rebooted.
>>>>>
>>>>>>
>>>>>> There are certain things that pciback does to "prepare" an PCI device
>>>>>> which QEMU also does. Some of them - such as saving the configuration
>>>>>> registers (And then restoring them after the device has been detached) -
>>>>>> is something that QEMU does not do.
>>>>>>
>>>>>
>>>>> I really have no idea what the correct thing to do is, but the current
>>>>> code with qemu-trad doesn't seem to work (for me).
> 
> I think I know what the problem is. Do you by any chance have the XSA133-addenum
> patch in? If not could you apply it and tell me if it works?
> 
>>>>
>>>> The pciback pci_stub.c implements the pciback.hide and the device reset
>>>> logic.
>>>>
>>>> The rest of pciback implements the pciback xenbus device which PV guests
>>>> need in order to map/unmap MSI interrupts and access PCI config space.
>>>>
>>>> QEMU emulates and handles the MSI interrupt capabilities and PCI config
>>>> space directly.
>>>
>>> Right..
>>>>
>>>> This is why a pciback xenbus device should not be created for
>>>> passthrough PCI device being handled by QEMU.
>>>
>>> To me that sounds that we should not have PV drivers because QEMU
>>> emulates IDE or network devices.
>>
>> That is different. We first boot with QEMU handling the devices and then
>> we explictly unplug QEMU's handling of IDE and network devices.
>>
>> That handover protocol does not currently exist for PCI passthrough
>> devices so we have to chose one mechanism or the other to manage the
>> passed through PCI device at boot time. Otherwise a HVM guest could load
>> pcifront and cause's all kinds of chaos with interrupt management or
>> outbound MMIO window management.
> 
> Which would be fun! :-)
>>
>>>
>>> The crux here is that none of the operations that pciback performs
>>> should affect QEMU or guests. But it does - so there is a bug.
>>
>> I agree there is a bug but should we try to fix it based upon my
>> comments above?
> 
> I am still thinking about it. I do like certain things that pciback
> does as part of it being notified that a device is to be used by
> a guest and performing the configuration save/reset (see
> pcistub_put_pci_dev in the pciback).
> 
> If somehow that can still be done by libxl (or QEMU) via SysFS
> that would be good.
> 
> Just to clarify:
>  - I concur with you that having xen-pcifront loaded in HVM
>    guest and doing odd things behind QEMU is not good.
>  - I like the fact that xen-pciback does a bunch of safety
>    things with the PCI device to prepare it for a guest.
>  - Currently these 'safety things'  are done when you
>    'unbind' or 'bind' the device to pciback.
>  - Or when the guest is shutdown and via XenBus we are told
>    and can do the 'safety things'. This is the crux - if there
>    is a way to do this via SysFS this would be super.
> 
>    Or perhaps xenpciback can figure out that the guest is HVM
>    and ignore any XenBus actions?
> 

Xenserver toolstack currently bind/unbinds the device to pciback and
manually triggers the reset on the device. It then uses xenstore keys to
communicate with the QEMU hotplug mechanism.

A complete description is here:

"Xenopsd performs the following steps for HVM PCI passthrough (for each
PCI device):

    write /local/domain/0/backend/pci/<domid>/0/msitranslate “0” or “1”
    write /local/domain/0/backend/pci/<domid>/0/power_mgmt “0” or “1”
    bind device to pciback (write device id to the new_slot and bind
nodes in sysfs)
    write /local/domain/0/backend/device-model/<domid>/command “pci-ins”
    write /local/domain/0/backend/device-model/<domid>/parameter
“xxxx:xx:xx.x”
    wait for “pci-inserted” to appear in
/local/domain/0/backend/<domid>/device-model/state
    write /local/domain/0/backend/pci/<domid>/0/dev-<x> “xxxx:xx:xx.x”
    if /local/domain/<domid>/device/pci/0 does not yet exist, then
create /local/domain/<domid>/device/pci/0, give ownership to the guest,
and write backend=/local/domain/0/backend/pci/<domid>/0, backend-id=0,
state=1
    xc_domain_assign_device


To unplug:

    write /local/domain/0/backend/device-model/<domid>/command “pci-rem”
    write /local/domain/0/backend/device-model/<domid>/parameter
“xxxx:xx:xx.x”
    wait for “pci-removed” to appear in
/local/domain/0/backend/<domid>/device-model/state
    remove /local/domain/0/backend/pci/<domid>/0/dev-<x>
    call /usr/lib/xcp/lib/pci-flr flr-pre xxxx:xx:xx.x
    if the file /sys/bus/pci/devices/xxxx:xx:xx.x/reset exists, then
write "1" to it; otherwise write "xxxx:xx:xx.x" to
/sys/bus/pci/drivers/pciback/do_flr
    call /usr/lib/xcp/lib/pci-flr flr-post xxxx:xx:xx.x
    xc_domain_deassign_device"


I would recommend that libxl performs similar operations (except for the
QEMU hotplug part and the flr script parts).

We should probably split pciback into parts:

1. A part to capture device at boot and prevent other drivers loading
(bind/unbind)

2. A part to manage reset on device's which don't have device specific
reset's.

3. A part to manage the pciback PV device for communicating to pcifront.


I don't think we can have pciback work out if it's a HVM guest and I
think we should instead use the toolstack to prep the device correctly
before passthrough. The toolstack is donating it's PCI device to the new
domain afterall :)


>>>
>>> I would like to understand which ones do it so I can fix in
>>> pciback - as it might be also be a problem with PV.
>>>
>>> Unless... are you by any chance using extra patches on top of the
>>> native pciback?
>>
>> We do have extra patches but they only allow us to do a SBR on PCI
>> device's which require it. They failure listed above occurs on devices
>> with device specific resets (e.g. FLR,D3) as well so those extra patches
>> aren't being used.
>>
>>>
>>>>
>>>> Malcolm
>>>>
>>>>>
>>>>> Regards
>>>>
>>

next prev parent reply	other threads:[~2015-06-02 15:48 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-29  7:59 [PATCH] libxl: Don't insert PCI device into xenstore for HVM guests Ross Lagerwall
2015-05-29  9:41 ` Wei Liu
2015-05-29  9:43   ` Ross Lagerwall
2015-05-29  9:50     ` Wei Liu
2015-05-29  9:54       ` Ross Lagerwall
2015-05-29 10:24         ` Wei Liu
2015-06-01 10:12 ` George Dunlap
2015-06-01 15:26 ` Konrad Rzeszutek Wilk
2015-06-01 15:43   ` Ross Lagerwall
2015-06-01 15:58     ` Konrad Rzeszutek Wilk
2015-06-01 15:59     ` Sander Eikelenboom
2015-06-01 16:03     ` Malcolm Crossley
2015-06-01 17:55       ` Konrad Rzeszutek Wilk
2015-06-02 10:06         ` Malcolm Crossley
2015-06-02 14:34           ` Konrad Rzeszutek Wilk
2015-06-02 15:48             ` Malcolm Crossley [this message]
2015-06-10 20:10               ` Konrad Rzeszutek Wilk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=556DD049.1090209@citrix.com \
    --to=malcolm.crossley@citrix.com \
    --cc=ian.campbell@citrix.com \
    --cc=ian.jackson@eu.citrix.com \
    --cc=konrad.wilk@oracle.com \
    --cc=ross.lagerwall@citrix.com \
    --cc=stefano.stabellini@eu.citrix.com \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.