Re: PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: David Vrabel <david.vrabel@citrix.com>, zhenzhong.duan@oracle.com
Cc: xen-devel@lists.xenproject.org, jbeulich@suse.com
Subject: Re: PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)
Date: Tue, 11 Nov 2014 20:37:57 -0500	[thread overview]
Message-ID: <20141112013757.GC2593@laptop.dumpdata.com> (raw)
In-Reply-To: <20141110213248.GA23182@laptop.dumpdata.com>

On Mon, Nov 10, 2014 at 04:32:48PM -0500, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 10, 2014 at 01:07:20PM -0500, Konrad Rzeszutek Wilk wrote:
> > On Mon, Nov 10, 2014 at 05:42:32PM +0000, David Vrabel wrote:
> > > On 10/11/14 17:32, Konrad Rzeszutek Wilk wrote:
> > > > Hey,
> > > > 
> > > > With Xen 4.5 (today's staging), when I boot a guest and then do pci-attach
> > > > the BARs values are corrupt.
> 
> I can reproduce this with Xen 4.4, Xen 4.3 and Xen 4.1.
> 
> A bit digging in and I realized that:
> 
> (XEN) memory_map:add: dom1 gfn=f4000 mfn=d8000 nr=4000 [64M]
> (XEN) AMD-Vi: update_paging_mode Try to access pdev_list without aquiring pcidevs_lock.
> (XEN) memory_map:add: dom1 gfn=f8000 mfn=fc000 nr=2000 [32M]
> (XEN) ioport_map:add: dom1 gport=1000 mport=c000 nr=80
> (XEN) AMD-Vi: Disable: device id = 0x500, domain = 0, paging mode = 3
> (XEN) AMD-Vi: Setup I/O page table: device id = 0x500, type = 0x1, root table = 0x228b02000, domain = 1, paging mode = 3
> 
> The sizes are my own editing. This means QEMU is putting the
> devices in the MMIO region - and doing it succesfully. But then:
> 
> > > 
> > > 
> > > > [  152.572965] pci 0000:00:04.0: BAR 1: no space for [mem size 0x08000000 64bit  pref]
> > [  152.518320] pci 0000:00:04.0: reg 0x14: [mem 0x00000000-0x07ffffff 64bit pref]
> 
> .. The guest computes the right size for them, but reads the wrong BAR value
> that was set by QEMU and also created in the hypervisor.
> 
> Perhaps this is Linux kernel being on fritz. Will try another kernel.

I figured this out.


When we pass in the device at bootup, the hvmloader does:

(d4) pci dev 05:0 bar 14 size 008000000: 0e000000c
(d4) pci dev 05:0 bar 1c size 004000000: 0e800000c
(d4) pci dev 05:0 bar 10 size 002000000: 0ec000000
(d4) pci dev 05:0 bar 24 size 000000080: 00000c201

That is - it finds the size, and then it sets the BARs to fit within
the MMIO region. QEMU is not involved in this.

When we PCI insert an device, the BARs are not set at all - and hence
the Linux kernel is the one that tries to set the BARs in. The
reason it cannot fit the device in the MMIO region is due to the
_CRS only having certain ranges (even thought the MMIO region can
cover 2GB). See:

Without any devices (and me doing PCI insertion after that):
# dmesg | grep "bus resource"
[    0.366000] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.366000] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
[    0.366000] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
[    0.366000] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
[    0.366000] pci_bus 0000:00: root bus resource [mem 0xf0000000-0xfbffffff]

With the device (my GPU card) inserted so that hvmloader can enumerate it:
 dmesg | grep 'resource'     
[    0.455006] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.459006] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
[    0.462006] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
[    0.466006] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
[    0.469006] pci_bus 0000:00: root bus resource [mem 0xe0000000-0xfbffffff]

I chatted with Bjorn and Rafeal on IRC about how PCI insertion works
on baremetal and it sounds like Thunderbolt device insertion is an
interesting case. The SMM sets the BAR regions to fit within the MMIO
(which is advertised by the _CRS) and it then pokes the OS to enumerate
the BARs. The OS is free to use what the firmware has set or renumber
it. The end result is that since the SMM 'fits' the BAR inside the
pre-set _CRS window it all works. We do not do that.

The two ways I could think of making this work are:
 - QEMU tracks BAR enumeration. When a new device is inserted it would
   set the BAR to fit within the E820 "HOLE" region. If it can't
   (because the MMIO is too small) it puts it at the end of the memory.
   Naturally the 'end of the memory' part would require adding
   _CRS to cover end of GPFN to never never land. And also the _CRS
   region for the MMIO under 4GB would have to be expanded so QEMU
   can jam things in there.

 - Or add in dsdt.asl another _CRS region controlled by the hvmloader.
   This one would start at the end of GPFN + delta of maxmem - mem and
   continue to never never land. The hvmloader would just write the
   the values in the BIOS OperationRegion (0xFC000000) and let the
   AML code take care of parsing it and constructing the #9 _CRS region.
   This will allow kernels who are picky about BARs not being in _CRS
   region to deal with cards that are hot-plugged past BIOS boot.

next prev parent reply	other threads:[~2014-11-12  1:38 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-10 17:32 PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus) Konrad Rzeszutek Wilk
2014-11-10 17:42 ` David Vrabel
2014-11-10 18:07   ` Konrad Rzeszutek Wilk
2014-11-10 21:32     ` Konrad Rzeszutek Wilk
2014-11-12  1:37       ` Konrad Rzeszutek Wilk [this message]
2014-11-12  9:24         ` Jan Beulich
2014-11-12 10:01           ` Malcolm Crossley
2014-11-12 10:11             ` Jan Beulich
2014-11-12 10:41               ` Malcolm Crossley
2014-11-12 15:14             ` Konrad Rzeszutek Wilk
2014-11-12 17:24               ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141112013757.GC2593@laptop.dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=david.vrabel@citrix.com \
    --cc=jbeulich@suse.com \
    --cc=xen-devel@lists.xenproject.org \
    --cc=zhenzhong.duan@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.