Re: [PATCH] libxl: Don't insert PCI device into xenstore for HVM guests

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Malcolm Crossley <malcolm.crossley@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>,
	Ian Campbell <ian.campbell@citrix.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	Ian Jackson <ian.jackson@eu.citrix.com>,
	xen-devel@lists.xen.org,
	Ross Lagerwall <ross.lagerwall@citrix.com>
Subject: Re: [PATCH] libxl: Don't insert PCI device into xenstore for HVM guests
Date: Wed, 10 Jun 2015 16:10:56 -0400	[thread overview]
Message-ID: <20150610201056.GA5464@l.oracle.com> (raw)
In-Reply-To: <556DD049.1090209@citrix.com>

On Tue, Jun 02, 2015 at 04:48:25PM +0100, Malcolm Crossley wrote:
> On 02/06/15 15:34, Konrad Rzeszutek Wilk wrote:
> > On Tue, Jun 02, 2015 at 11:06:26AM +0100, Malcolm Crossley wrote:
> >> On 01/06/15 18:55, Konrad Rzeszutek Wilk wrote:
> >>> On Mon, Jun 01, 2015 at 05:03:14PM +0100, Malcolm Crossley wrote:
> >>>> On 01/06/15 16:43, Ross Lagerwall wrote:
> >>>>> On 06/01/2015 04:26 PM, Konrad Rzeszutek Wilk wrote:
> >>>>>> On Fri, May 29, 2015 at 08:59:45AM +0100, Ross Lagerwall wrote:
> >>>>>>> When doing passthrough of a PCI device for an HVM guest, don't insert
> >>>>>>> the device into xenstore, otherwise pciback attempts to use it which
> >>>>>>> conflicts with QEMU.
> >>>>>>
> >>>>>> How does it conflict?
> >>>>>
> >>>>> It doesn't work with repeated use. See below.
> >>>>>
> >>>>>>>
> >>>>>>> This manifests itself such that the first time a device is passed to a
> >>>>>>> domain, it succeeds. Subsequent attempts fail unless the device is
> >>>>>>> unbound from pciback or the machine rebooted.
> >>>>>>
> >>>>>> Can you be more specific please? What are the issues? Why does it
> >>>>>> fail?
> >>>>>
> >>>>> Without this patch, if a device (e.g. a GPU) is bound to pciback and
> >>>>> then passed through to a guest using xl pci-attach, it appears in the
> >>>>> guest and works fine. If the guest is rebooted, and the device is again
> >>>>> passed through with xl pci-attach, it appears in the guest as before but
> >>>>> does not work. In Windows, it gets something like Error Code 43 and on
> >>>>> Linux, the Nouveau driver fails to initialize the device (with error -22
> >>>>> or something). The only way to get the device to work again is to reboot
> >>>>> the host or unbind and rebind it to pciback.
> >>>>>
> >>>>> With this patch, it works as expected. The device is bound to pciback
> >>>>> and works after being passed through, even after the VM is rebooted.
> >>>>>
> >>>>>>
> >>>>>> There are certain things that pciback does to "prepare" an PCI device
> >>>>>> which QEMU also does. Some of them - such as saving the configuration
> >>>>>> registers (And then restoring them after the device has been detached) -
> >>>>>> is something that QEMU does not do.
> >>>>>>
> >>>>>
> >>>>> I really have no idea what the correct thing to do is, but the current
> >>>>> code with qemu-trad doesn't seem to work (for me).
> > 
> > I think I know what the problem is. Do you by any chance have the XSA133-addenum
> > patch in? If not could you apply it and tell me if it works?

Ping?

It was XSA120-addendum.

> > 
> >>>>
> >>>> The pciback pci_stub.c implements the pciback.hide and the device reset
> >>>> logic.
> >>>>
> >>>> The rest of pciback implements the pciback xenbus device which PV guests
> >>>> need in order to map/unmap MSI interrupts and access PCI config space.
> >>>>
> >>>> QEMU emulates and handles the MSI interrupt capabilities and PCI config
> >>>> space directly.
> >>>
> >>> Right..
> >>>>
> >>>> This is why a pciback xenbus device should not be created for
> >>>> passthrough PCI device being handled by QEMU.
> >>>
> >>> To me that sounds that we should not have PV drivers because QEMU
> >>> emulates IDE or network devices.
> >>
> >> That is different. We first boot with QEMU handling the devices and then
> >> we explictly unplug QEMU's handling of IDE and network devices.
> >>
> >> That handover protocol does not currently exist for PCI passthrough
> >> devices so we have to chose one mechanism or the other to manage the
> >> passed through PCI device at boot time. Otherwise a HVM guest could load
> >> pcifront and cause's all kinds of chaos with interrupt management or
> >> outbound MMIO window management.
> > 
> > Which would be fun! :-)
> >>
> >>>
> >>> The crux here is that none of the operations that pciback performs
> >>> should affect QEMU or guests. But it does - so there is a bug.
> >>
> >> I agree there is a bug but should we try to fix it based upon my
> >> comments above?
> > 
> > I am still thinking about it. I do like certain things that pciback
> > does as part of it being notified that a device is to be used by
> > a guest and performing the configuration save/reset (see
> > pcistub_put_pci_dev in the pciback).
> > 
> > If somehow that can still be done by libxl (or QEMU) via SysFS
> > that would be good.
> > 
> > Just to clarify:
> >  - I concur with you that having xen-pcifront loaded in HVM
> >    guest and doing odd things behind QEMU is not good.
> >  - I like the fact that xen-pciback does a bunch of safety
> >    things with the PCI device to prepare it for a guest.
> >  - Currently these 'safety things'  are done when you
> >    'unbind' or 'bind' the device to pciback.
> >  - Or when the guest is shutdown and via XenBus we are told
> >    and can do the 'safety things'. This is the crux - if there
> >    is a way to do this via SysFS this would be super.
> > 
> >    Or perhaps xenpciback can figure out that the guest is HVM
> >    and ignore any XenBus actions?
> > 
> 
> Xenserver toolstack currently bind/unbinds the device to pciback and
> manually triggers the reset on the device. It then uses xenstore keys to
> communicate with the QEMU hotplug mechanism.
> 
> A complete description is here:
> 
> "Xenopsd performs the following steps for HVM PCI passthrough (for each
> PCI device):
> 
>     write /local/domain/0/backend/pci/<domid>/0/msitranslate “0” or “1”
>     write /local/domain/0/backend/pci/<domid>/0/power_mgmt “0” or “1”
>     bind device to pciback (write device id to the new_slot and bind
> nodes in sysfs)
>     write /local/domain/0/backend/device-model/<domid>/command “pci-ins”
>     write /local/domain/0/backend/device-model/<domid>/parameter
> “xxxx:xx:xx.x”
>     wait for “pci-inserted” to appear in
> /local/domain/0/backend/<domid>/device-model/state
>     write /local/domain/0/backend/pci/<domid>/0/dev-<x> “xxxx:xx:xx.x”
>     if /local/domain/<domid>/device/pci/0 does not yet exist, then
> create /local/domain/<domid>/device/pci/0, give ownership to the guest,
> and write backend=/local/domain/0/backend/pci/<domid>/0, backend-id=0,
> state=1
>     xc_domain_assign_device
> 
> 
> To unplug:
> 
>     write /local/domain/0/backend/device-model/<domid>/command “pci-rem”
>     write /local/domain/0/backend/device-model/<domid>/parameter
> “xxxx:xx:xx.x”
>     wait for “pci-removed” to appear in
> /local/domain/0/backend/<domid>/device-model/state
>     remove /local/domain/0/backend/pci/<domid>/0/dev-<x>
>     call /usr/lib/xcp/lib/pci-flr flr-pre xxxx:xx:xx.x

'pci-flr' ? User-space program to poke the configuration registers?

>     if the file /sys/bus/pci/devices/xxxx:xx:xx.x/reset exists, then
> write "1" to it; otherwise write "xxxx:xx:xx.x" to
> /sys/bus/pci/drivers/pciback/do_flr

Do you have a patch to expose this? This paramter does not exist
in the upstream Linux kernel.

>     call /usr/lib/xcp/lib/pci-flr flr-post xxxx:xx:xx.x
>     xc_domain_deassign_device"
> 
> 
> I would recommend that libxl performs similar operations (except for the
> QEMU hotplug part and the flr script parts).
> 
> We should probably split pciback into parts:
> 
> 1. A part to capture device at boot and prevent other drivers loading
> (bind/unbind)
> 
> 2. A part to manage reset on device's which don't have device specific
> reset's.

Not following you. I think you mean if the standard PCI 'reset' is not
exposed because the device has some quirks or it truly does not expose
FLR, D3/D0 switch, or any other ways to reset it?

> 
> 3. A part to manage the pciback PV device for communicating to pcifront.
> 
> 
> I don't think we can have pciback work out if it's a HVM guest and I
> think we should instead use the toolstack to prep the device correctly
> before passthrough. The toolstack is donating it's PCI device to the new

I do not know what is the right way. The kernel has a lot of knowledge
about these devices and can proper locking and deal with quirks. QEMU
also has some idea of what to do and how to reset a device.

> domain afterall :)
> 
> 
> >>>
> >>> I would like to understand which ones do it so I can fix in
> >>> pciback - as it might be also be a problem with PV.
> >>>
> >>> Unless... are you by any chance using extra patches on top of the
> >>> native pciback?
> >>
> >> We do have extra patches but they only allow us to do a SBR on PCI
> >> device's which require it. They failure listed above occurs on devices
> >> with device specific resets (e.g. FLR,D3) as well so those extra patches
> >> aren't being used.
> >>
> >>>
> >>>>
> >>>> Malcolm
> >>>>
> >>>>>
> >>>>> Regards
> >>>>
> >>
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

     prev parent reply	other threads:[~2015-06-10 20:10 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-29  7:59 [PATCH] libxl: Don't insert PCI device into xenstore for HVM guests Ross Lagerwall
2015-05-29  9:41 ` Wei Liu
2015-05-29  9:43   ` Ross Lagerwall
2015-05-29  9:50     ` Wei Liu
2015-05-29  9:54       ` Ross Lagerwall
2015-05-29 10:24         ` Wei Liu
2015-06-01 10:12 ` George Dunlap
2015-06-01 15:26 ` Konrad Rzeszutek Wilk
2015-06-01 15:43   ` Ross Lagerwall
2015-06-01 15:58     ` Konrad Rzeszutek Wilk
2015-06-01 15:59     ` Sander Eikelenboom
2015-06-01 16:03     ` Malcolm Crossley
2015-06-01 17:55       ` Konrad Rzeszutek Wilk
2015-06-02 10:06         ` Malcolm Crossley
2015-06-02 14:34           ` Konrad Rzeszutek Wilk
2015-06-02 15:48             ` Malcolm Crossley
2015-06-10 20:10               ` Konrad Rzeszutek Wilk [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150610201056.GA5464@l.oracle.com \
    --to=konrad.wilk@oracle.com \
    --cc=ian.campbell@citrix.com \
    --cc=ian.jackson@eu.citrix.com \
    --cc=malcolm.crossley@citrix.com \
    --cc=ross.lagerwall@citrix.com \
    --cc=stefano.stabellini@eu.citrix.com \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).