PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)

All of lore.kernel.org
 help / color / mirror / Atom feed

* PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)
@ 2014-11-10 17:32 Konrad Rzeszutek Wilk
  2014-11-10 17:42 ` David Vrabel
  0 siblings, 1 reply; 11+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-11-10 17:32 UTC (permalink / raw)
  To: xen-devel, jbeulich

Hey,

With Xen 4.5 (today's staging), when I boot a guest and then do pci-attach
the BARs values are corrupt.

For example, with this guest config:

kernel="hvmloader"
builder="hvm"
serial="pty"
memory = 2048
name = "XTT"
usb=1
usbdevice='tablet'
vcpus=2
vga="stdvga"
vif = [ 'mac=00:0f:4b:00:00:63,bridge=xenbr0' ]
disk= ['file:/root/root_image.iso,hdc:cdrom,r']
vnc=1
vnclisten="0.0.0.0"
boot = "dc"

And with this PCI card:
5:00.0 VGA compatible controller: NVIDIA Corporation GF104GLM [Quadro 4000M] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Gigabyte Technology Co., Ltd Device 34fc
        Flags: fast devsel, IRQ 20
        Memory at fc000000 (32-bit, non-prefetchable) [size=32M]
        Memory at d0000000 (64-bit, prefetchable) [size=128M]
        Memory at d8000000 (64-bit, prefetchable) [size=64M]

which in dom0 is reported as:

4064] pci 0000:05:00.0: [10de:0e3b] type 00 class 0x030000
[   22.834087] pci 0000:05:00.0: reg 0x10: [mem 0xfc000000-0xfdffffff]
[   22.834109] pci 0000:05:00.0: reg 0x14: [mem 0xd0000000-0xd7ffffff 64bit pref]
[   22.834131] pci 0000:05:00.0: reg 0x1c: [mem 0xd8000000-0xdbffffff 64bit pref]
[   22.834144] pci 0000:05:00.0: reg 0x24: [io  0xc000-0xc07f]
[   22.834157] pci 0000:05:00.0: reg 0x30: [mem 0xfe000000-0xfe07ffff pref]

When I assign said card (xl pci-attach XTT 05:00.0) the guest gives me:

     ... 498525] pci 0000:00:04.0: [10de:0e3b] type 00 class 0x030000
[  152.508612] pci 0000:00:04.0: reg 0x10: [mem 0x00000000-0x01ffffff]
[  152.518320] pci 0000:00:04.0: reg 0x14: [mem 0x00000000-0x07ffffff 64bit preff ]
[  152.529301] pci 0000:00:04.0: reg 0x1c: [mem 0x00000000-0x03ffffff 64bit preff ]
[  152.540095] pci 0000:00:04.0: reg 0x24: [io  0x0000-0x007f]
[  152.548497] pci 0000:00:04.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]
[  152.561018] vgaarb: device added: PCI:0000:00:04.0,decodes=io+mem,owns=none,llocks=none
[  152.572965] pci 0000:00:04.0: BAR 1: no space for [mem size 0x08000000 64bit  pref]
[  152.583917] pci 0000:00:04.0: BAR 1: failed to assign [mem size 0x08000000 64 bit pref]
[  152.595528] pci 0000:00:04.0: BAR 3: assigned [mem 0xf4000000-0xf7ffffff 64bi t pref]

If I boot the guest with:
pci=["05:00.0"]
it works fine:

# dmesg | grep 05.0
[    0.000000] pcpu-alloc: [0] 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 
[    0.905008] pci 0000:00:03.0: reg 0x10: [mem 0xef000000-0xefffffff pref]
[    0.944101] pci 0000:00:05.0: [10de:0e3b] type 00 class 0x030000
[    0.953013] pci 0000:00:05.0: reg 0x10: [mem 0xec000000-0xedffffff]
[    0.967016] pci 0000:00:05.0: reg 0x14: [mem 0xe0000000-0xe7ffffff 64bit pref]
[    0.973016] pci 0000:00:05.0: reg 0x1c: [mem 0xe8000000-0xebffffff 64bit pref]
[    0.981015] pci 0000:00:05.0: reg 0x24: [io  0xc200-0xc27f]
[    0.988016] pci 0000:00:05.0: reg 0x30: [mem 0xf0000000-0xf007ffff pref]
[    0.995083] vgaarb: device added: PCI:0000:00:05.0,decodes=io+mem,owns=io+mem,locks=none
[    0.997000] vgaarb: bridge control possible 0000:00:05.0
[    3.952023] nouveau  [  DEVICE][0000:00:05.0] BOOT0  : 0x0c4d80a1
[    3.952025] nouveau  [  DEVICE][0000:00:05.0] Chipset: GF104 (NVC4)
[    3.952027] nouveau  [  DEVICE][0000:00:05.0] Family : NVC0
[    3.952072] nouveau  [   VBIOS][0000:00:05.0] checking PRAMIN for image...
[    3.952079] nouveau  [   VBIOS][0000:00:05.0] ... signature not found
[    3.952080] nouveau  [   VBIOS][0000:00:05.0] checking PROM for image...
[    4.096446] nouveau  [   VBIOS][0000:00:05.0] ... appears to be valid
[    4.104012] nouveau  [   VBIOS][0000:00:05.0] using image from PROM
[    4.111214] nouveau  [   VBIOS][0000:00:05.0] BIT signature found
[    4.118054] nouveau  [   VBIOS][0000:00:05.0] version 70.04.13.00.01
[    4.125248] nouveau  [ DEVINIT][0000:00:05.0] adaptor not initialised
[    4.131629] nouveau  [   VBIOS][0000:00:05.0] running init tables
[    4.227827] nouveau  [     PMC][0000:00:05.0] MSI interrupts enabled
[    4.234296] nouveau  [     PFB][0000:00:05.0] RAM type: GDDR5
[    4.240176] nouveau  [     PFB][0000:00:05.0] RAM size: 1024 MiB
[    4.245878] nouveau  [     PFB][0000:00:05.0]    ZCOMP: 0 tags
[    4.255523] nouveau  [    VOLT][0000:00:05.0] GPU voltage: 875000uv
[    4.291146] nouveau  [  PTHERM][0000:00:05.0] FAN control: PWM
[    4.296582] nouveau  [  PTHERM][0000:00:05.0] fan management: automatic
[    4.302668] nouveau  [  PTHERM][0000:00:05.0] internal sensor: yes
[    4.309465] nouveau  [     CLK][0000:00:05.0] 03: core 50 MHz memory 135 MHz 
[    4.317854] nouveau  [     CLK][0000:00:05.0] 07: core 405 MHz memory 324 MHz 
[    4.326655] nouveau  [     CLK][0000:00:05.0] 0c: core 405 MHz memory 1800 MHz 
[    4.333742] nouveau  [     CLK][0000:00:05.0] 0f: core 715 MHz memory 1800 MHz 
[    4.341362] nouveau  [     CLK][0000:00:05.0] --: core 50 MHz memory 135 MHz 
[    4.749977] nouveau 0000:00:05.0: fb0: nouveaufb frame buffer device
[    4.756172] nouveau 0000:00:05.0: registered panic notifier
[    4.765041] [drm] Initialized nouveau 1.2.0 20120801 for 0000:00:05.0 on minor 0

# lspci -s 00:05.0 -v
00:05.0 VGA compatible controller: nVidia Corporation Device 0e3b (rev a1) (prog-if 00 [VGA controller])
        Subsystem: Giga-byte Technology Device 34fc
        Physical Slot: 5
        Flags: bus master, fast devsel, latency 0, IRQ 67
        Memory at ec000000 (32-bit, non-prefetchable) [size=32M]
        Memory at e0000000 (64-bit, prefetchable) [size=128M]
        Memory at e8000000 (64-bit, prefetchable) [size=64M]
        I/O ports at c200 [size=128]
        Expansion ROM at f0000000 [disabled] [size=512K]

Interesting observation:

a) It does NOT make a different if I use qemu-traditinal or qemu-xen.
   In both cases I get the same BAR bogus numbers.

b) qemu-xen logging is not that great. When I used qemu-trad I discovered:

m-command: hot insert pass-through pci dev
register_real_device: Assigning real physical device 05:00.0 ...
register_real_device: Disable MSI translation via per device option
register_real_device: Disable power management
pt_iomul_init: Error: pt_iomul_init can't open file /dev/xen/pci_iomul: No such file or directory: 0x5:0x0.0x0
pt_register_regions: IO region registered (size=0x02000000 base_addr=0xfc000000)
pt_register_regions: IO region registered (size=0x08000000 base_addr=0xd000000c)
pt_register_regions: IO region registered (size=0x04000000 base_addr=0xd800000c)
pt_register_regions: IO region registered (size=0x00000080 base_addr=0x0000c001)
pt_register_regions: Expansion ROM registered (size=0x00080000 base_addr=0xfe000000)
pci_intx: intx=1
register_real_device: Real physical device 05:00.0 registered successfuly!
IRQ type = INTx
generate a sci for PHP.
deassert due to disable GPE bit.
ACPI:debug: write addr=0xb044, val=0x20.
ACPI:debug: write addr=0xb045, val=0x0.
ACPI:debug: write addr=0xb044, val=0x20.
ACPI:debug: write addr=0xb045, val=0x89.
ACPI:debug: write addr=0xb044, val=0x21.
ACPI:debug: write addr=0xb045, val=0x89.
ACPI:debug: write addr=0xb044, val=0x22.
ACPI:debug: write addr=0xb045, val=0x89.
ACPI:debug: write addr=0xb044, val=0x23.
ACPI:debug: write addr=0xb045, val=0x89.
ACPI:debug: write addr=0xb044, val=0x24.
ACPI:debug: write addr=0xb045, val=0x89.
ACPI:debug: write addr=0xb044, val=0x25.
ACPI:debug: write addr=0xb045, val=0x89.
ACPI:debug: write addr=0xb044, val=0x26.
ACPI:debug: write addr=0xb045, val=0x89.
ACPI:debug: write addr=0xb044, val=0x27.
ACPI:debug: write addr=0xb045, val=0x89.
pt_iomem_map: e_phys=f2000000 maddr=fc000000 type=0 len=33554432 index=0 first_map=1
pt_iomem_map: e_phys=f4000000 maddr=d8000000 type=8 len=67108864 index=3 first_map=1
pt_ioport_map: e_phys=1000 pio_base=c000 len=128 index=5 first_map=1
pt_msgctrl_reg_write: setup msi for dev 20
pt_msi_setup: pt_msi_setup requested pirq = 87
pt_msi_setup: msi mapped with pirq 57
pt_msi_update: Update msi with pirq 57 gvec 0 gflags 3057
pci_intx: intx=1
pt_msi_disable: Unbind msi with pirq 57, gvec 0
pt_msi_disable: Unmap msi with pirq 57

Ideas which commit id I ought to look at?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)
  2014-11-10 17:32 PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus) Konrad Rzeszutek Wilk
@ 2014-11-10 17:42 ` David Vrabel
  2014-11-10 18:07   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 11+ messages in thread
From: David Vrabel @ 2014-11-10 17:42 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, xen-devel, jbeulich

On 10/11/14 17:32, Konrad Rzeszutek Wilk wrote:
> Hey,
> 
> With Xen 4.5 (today's staging), when I boot a guest and then do pci-attach
> the BARs values are corrupt.

Corrupt?

> [  152.572965] pci 0000:00:04.0: BAR 1: no space for [mem size 0x08000000 64bit  pref]

Looks like the default MMIO hole isn't large enough for this device.

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)
  2014-11-10 17:42 ` David Vrabel
@ 2014-11-10 18:07   ` Konrad Rzeszutek Wilk
  2014-11-10 21:32     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 11+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-11-10 18:07 UTC (permalink / raw)
  To: David Vrabel; +Cc: xen-devel, jbeulich

On Mon, Nov 10, 2014 at 05:42:32PM +0000, David Vrabel wrote:
> On 10/11/14 17:32, Konrad Rzeszutek Wilk wrote:
> > Hey,
> > 
> > With Xen 4.5 (today's staging), when I boot a guest and then do pci-attach
> > the BARs values are corrupt.
> 
> Corrupt?
> 
> > [  152.572965] pci 0000:00:04.0: BAR 1: no space for [mem size 0x08000000 64bit  pref]
> 
> Looks like the default MMIO hole isn't large enough for this device.

The BARs are 32M, 64M, 128MB and the MMIO is 2GB.

It looks like the BAR value is corrupted as:

(dom0):
[   22.834109] pci 0000:05:00.0: reg 0x14: [mem 0xd0000000-0xd7ffffff 64bit pref]
                                                  ^ - here we have '0xd'
guest:
[  152.518320] pci 0000:00:04.0: reg 0x14: [mem 0x00000000-0x07ffffff 64bit pref]


See that '0xd' gone in the guest?



> 
> David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)
  2014-11-10 18:07   ` Konrad Rzeszutek Wilk
@ 2014-11-10 21:32     ` Konrad Rzeszutek Wilk
  2014-11-12  1:37       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 11+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-11-10 21:32 UTC (permalink / raw)
  To: David Vrabel; +Cc: xen-devel, jbeulich

On Mon, Nov 10, 2014 at 01:07:20PM -0500, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 10, 2014 at 05:42:32PM +0000, David Vrabel wrote:
> > On 10/11/14 17:32, Konrad Rzeszutek Wilk wrote:
> > > Hey,
> > > 
> > > With Xen 4.5 (today's staging), when I boot a guest and then do pci-attach
> > > the BARs values are corrupt.

I can reproduce this with Xen 4.4, Xen 4.3 and Xen 4.1.

A bit digging in and I realized that:

(XEN) memory_map:add: dom1 gfn=f4000 mfn=d8000 nr=4000 [64M]
(XEN) AMD-Vi: update_paging_mode Try to access pdev_list without aquiring pcidevs_lock.
(XEN) memory_map:add: dom1 gfn=f8000 mfn=fc000 nr=2000 [32M]
(XEN) ioport_map:add: dom1 gport=1000 mport=c000 nr=80
(XEN) AMD-Vi: Disable: device id = 0x500, domain = 0, paging mode = 3
(XEN) AMD-Vi: Setup I/O page table: device id = 0x500, type = 0x1, root table = 0x228b02000, domain = 1, paging mode = 3

The sizes are my own editing. This means QEMU is putting the
devices in the MMIO region - and doing it succesfully. But then:

> > 
> > 
> > > [  152.572965] pci 0000:00:04.0: BAR 1: no space for [mem size 0x08000000 64bit  pref]
> [  152.518320] pci 0000:00:04.0: reg 0x14: [mem 0x00000000-0x07ffffff 64bit pref]

.. The guest computes the right size for them, but reads the wrong BAR value
that was set by QEMU and also created in the hypervisor.

Perhaps this is Linux kernel being on fritz. Will try another kernel.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)
  2014-11-10 21:32     ` Konrad Rzeszutek Wilk
@ 2014-11-12  1:37       ` Konrad Rzeszutek Wilk
  2014-11-12  9:24         ` Jan Beulich
  0 siblings, 1 reply; 11+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-11-12  1:37 UTC (permalink / raw)
  To: David Vrabel, zhenzhong.duan; +Cc: xen-devel, jbeulich

On Mon, Nov 10, 2014 at 04:32:48PM -0500, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 10, 2014 at 01:07:20PM -0500, Konrad Rzeszutek Wilk wrote:
> > On Mon, Nov 10, 2014 at 05:42:32PM +0000, David Vrabel wrote:
> > > On 10/11/14 17:32, Konrad Rzeszutek Wilk wrote:
> > > > Hey,
> > > > 
> > > > With Xen 4.5 (today's staging), when I boot a guest and then do pci-attach
> > > > the BARs values are corrupt.
> 
> I can reproduce this with Xen 4.4, Xen 4.3 and Xen 4.1.
> 
> A bit digging in and I realized that:
> 
> (XEN) memory_map:add: dom1 gfn=f4000 mfn=d8000 nr=4000 [64M]
> (XEN) AMD-Vi: update_paging_mode Try to access pdev_list without aquiring pcidevs_lock.
> (XEN) memory_map:add: dom1 gfn=f8000 mfn=fc000 nr=2000 [32M]
> (XEN) ioport_map:add: dom1 gport=1000 mport=c000 nr=80
> (XEN) AMD-Vi: Disable: device id = 0x500, domain = 0, paging mode = 3
> (XEN) AMD-Vi: Setup I/O page table: device id = 0x500, type = 0x1, root table = 0x228b02000, domain = 1, paging mode = 3
> 
> The sizes are my own editing. This means QEMU is putting the
> devices in the MMIO region - and doing it succesfully. But then:
> 
> > > 
> > > 
> > > > [  152.572965] pci 0000:00:04.0: BAR 1: no space for [mem size 0x08000000 64bit  pref]
> > [  152.518320] pci 0000:00:04.0: reg 0x14: [mem 0x00000000-0x07ffffff 64bit pref]
> 
> .. The guest computes the right size for them, but reads the wrong BAR value
> that was set by QEMU and also created in the hypervisor.
> 
> Perhaps this is Linux kernel being on fritz. Will try another kernel.

I figured this out.


When we pass in the device at bootup, the hvmloader does:

(d4) pci dev 05:0 bar 14 size 008000000: 0e000000c
(d4) pci dev 05:0 bar 1c size 004000000: 0e800000c
(d4) pci dev 05:0 bar 10 size 002000000: 0ec000000
(d4) pci dev 05:0 bar 24 size 000000080: 00000c201

That is - it finds the size, and then it sets the BARs to fit within
the MMIO region. QEMU is not involved in this.

When we PCI insert an device, the BARs are not set at all - and hence
the Linux kernel is the one that tries to set the BARs in. The
reason it cannot fit the device in the MMIO region is due to the
_CRS only having certain ranges (even thought the MMIO region can
cover 2GB). See:

Without any devices (and me doing PCI insertion after that):
# dmesg | grep "bus resource"
[    0.366000] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.366000] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
[    0.366000] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
[    0.366000] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
[    0.366000] pci_bus 0000:00: root bus resource [mem 0xf0000000-0xfbffffff]

With the device (my GPU card) inserted so that hvmloader can enumerate it:
 dmesg | grep 'resource'     
[    0.455006] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.459006] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
[    0.462006] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
[    0.466006] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
[    0.469006] pci_bus 0000:00: root bus resource [mem 0xe0000000-0xfbffffff]

I chatted with Bjorn and Rafeal on IRC about how PCI insertion works
on baremetal and it sounds like Thunderbolt device insertion is an
interesting case. The SMM sets the BAR regions to fit within the MMIO
(which is advertised by the _CRS) and it then pokes the OS to enumerate
the BARs. The OS is free to use what the firmware has set or renumber
it. The end result is that since the SMM 'fits' the BAR inside the
pre-set _CRS window it all works. We do not do that.

The two ways I could think of making this work are:
 - QEMU tracks BAR enumeration. When a new device is inserted it would
   set the BAR to fit within the E820 "HOLE" region. If it can't
   (because the MMIO is too small) it puts it at the end of the memory.
   Naturally the 'end of the memory' part would require adding
   _CRS to cover end of GPFN to never never land. And also the _CRS
   region for the MMIO under 4GB would have to be expanded so QEMU
   can jam things in there.

 - Or add in dsdt.asl another _CRS region controlled by the hvmloader.
   This one would start at the end of GPFN + delta of maxmem - mem and
   continue to never never land. The hvmloader would just write the
   the values in the BIOS OperationRegion (0xFC000000) and let the
   AML code take care of parsing it and constructing the #9 _CRS region.
   This will allow kernels who are picky about BARs not being in _CRS
   region to deal with cards that are hot-plugged past BIOS boot.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)
  2014-11-12  1:37       ` Konrad Rzeszutek Wilk
@ 2014-11-12  9:24         ` Jan Beulich
  2014-11-12 10:01           ` Malcolm Crossley
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2014-11-12  9:24 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, David Vrabel, zhenzhong.duan

>>> On 12.11.14 at 02:37, <konrad.wilk@oracle.com> wrote:
> When we PCI insert an device, the BARs are not set at all - and hence
> the Linux kernel is the one that tries to set the BARs in. The
> reason it cannot fit the device in the MMIO region is due to the
> _CRS only having certain ranges (even thought the MMIO region can
> cover 2GB). See:
> 
> Without any devices (and me doing PCI insertion after that):
> # dmesg | grep "bus resource"
> [    0.366000] pci_bus 0000:00: root bus resource [bus 00-ff]
> [    0.366000] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
> [    0.366000] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
> [    0.366000] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
> [    0.366000] pci_bus 0000:00: root bus resource [mem 0xf0000000-0xfbffffff]
> 
> With the device (my GPU card) inserted so that hvmloader can enumerate it:
>  dmesg | grep 'resource'     
> [    0.455006] pci_bus 0000:00: root bus resource [bus 00-ff]
> [    0.459006] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
> [    0.462006] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
> [    0.466006] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
> [    0.469006] pci_bus 0000:00: root bus resource [mem 0xe0000000-0xfbffffff]
> 
> I chatted with Bjorn and Rafeal on IRC about how PCI insertion works
> on baremetal and it sounds like Thunderbolt device insertion is an
> interesting case. The SMM sets the BAR regions to fit within the MMIO
> (which is advertised by the _CRS) and it then pokes the OS to enumerate
> the BARs. The OS is free to use what the firmware has set or renumber
> it. The end result is that since the SMM 'fits' the BAR inside the
> pre-set _CRS window it all works. We do not do that.

Who does the BAR assignment is pretty much orthogonal to the
problem at hand: If the region reserved for MMIO is too small,
no-one will be able to fit a device in there. Plus, what is being
reported as root bus resource doesn't have to have a
connection to the ranges usable for MMIO at all, at least if I
assume that the (Dell) system I'm right now looking at isn't
completely screwed:

pci_bus 0000:00: root bus resource [bus 00-ff]
pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
pci_bus 0000:00: root bus resource [mem 0x00000000-0x3fffffffff]

(i.e. it simply reports the full usable 38 bits wide address space)

Looking at another (Intel) one, there is no mention of regions
above the 4G boundary at all:

pci_bus 0000:00: root bus resource [bus 00-3d]
pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
pci_bus 0000:00: root bus resource [mem 0x000c4000-0x000cbfff]
pci_bus 0000:00: root bus resource [mem 0xfed40000-0xfedfffff]
pci_bus 0000:00: root bus resource [mem 0xd0000000-0xf7ffffff]

Not sure how the OS would know it is safe to assign BARs above
4Gb here.

In any event, what you need is an equivalent of the frequently
seen BIOS option controlling the size of the space to be reserved
for MMIO (often allowing it to be 1, 2, or 3 Gb). I.e. an alternative
(or extension) to the dynamic lowering of pci_mem_start in
hvmloader.

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)
  2014-11-12  9:24         ` Jan Beulich
@ 2014-11-12 10:01           ` Malcolm Crossley
  2014-11-12 10:11             ` Jan Beulich
  2014-11-12 15:14             ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 11+ messages in thread
From: Malcolm Crossley @ 2014-11-12 10:01 UTC (permalink / raw)
  To: xen-devel

On 12/11/14 09:24, Jan Beulich wrote:
>>>> On 12.11.14 at 02:37, <konrad.wilk@oracle.com> wrote:
>> When we PCI insert an device, the BARs are not set at all - and hence
>> the Linux kernel is the one that tries to set the BARs in. The
>> reason it cannot fit the device in the MMIO region is due to the
>> _CRS only having certain ranges (even thought the MMIO region can
>> cover 2GB). See:
>>
>> Without any devices (and me doing PCI insertion after that):
>> # dmesg | grep "bus resource"
>> [    0.366000] pci_bus 0000:00: root bus resource [bus 00-ff]
>> [    0.366000] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
>> [    0.366000] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
>> [    0.366000] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
>> [    0.366000] pci_bus 0000:00: root bus resource [mem 0xf0000000-0xfbffffff]
>>
>> With the device (my GPU card) inserted so that hvmloader can enumerate it:
>>  dmesg | grep 'resource'     
>> [    0.455006] pci_bus 0000:00: root bus resource [bus 00-ff]
>> [    0.459006] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
>> [    0.462006] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
>> [    0.466006] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
>> [    0.469006] pci_bus 0000:00: root bus resource [mem 0xe0000000-0xfbffffff]
>>
>> I chatted with Bjorn and Rafeal on IRC about how PCI insertion works
>> on baremetal and it sounds like Thunderbolt device insertion is an
>> interesting case. The SMM sets the BAR regions to fit within the MMIO
>> (which is advertised by the _CRS) and it then pokes the OS to enumerate
>> the BARs. The OS is free to use what the firmware has set or renumber
>> it. The end result is that since the SMM 'fits' the BAR inside the
>> pre-set _CRS window it all works. We do not do that.
> 
> Who does the BAR assignment is pretty much orthogonal to the
> problem at hand: If the region reserved for MMIO is too small,
> no-one will be able to fit a device in there. Plus, what is being
> reported as root bus resource doesn't have to have a
> connection to the ranges usable for MMIO at all, at least if I
> assume that the (Dell) system I'm right now looking at isn't
> completely screwed:
> 
> pci_bus 0000:00: root bus resource [bus 00-ff]
> pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
> pci_bus 0000:00: root bus resource [mem 0x00000000-0x3fffffffff]
> 
> (i.e. it simply reports the full usable 38 bits wide address space)
> 
> Looking at another (Intel) one, there is no mention of regions
> above the 4G boundary at all:
> 
> pci_bus 0000:00: root bus resource [bus 00-3d]
> pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
> pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
> pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
> pci_bus 0000:00: root bus resource [mem 0x000c4000-0x000cbfff]
> pci_bus 0000:00: root bus resource [mem 0xfed40000-0xfedfffff]
> pci_bus 0000:00: root bus resource [mem 0xd0000000-0xf7ffffff]
> 
> Not sure how the OS would know it is safe to assign BARs above
> 4Gb here.
> 
> In any event, what you need is an equivalent of the frequently
> seen BIOS option controlling the size of the space to be reserved
> for MMIO (often allowing it to be 1, 2, or 3 Gb). I.e. an alternative
> (or extension) to the dynamic lowering of pci_mem_start in
> hvmloader.
> 

I agree with Jan. By using xl pci-attach you are effectively hotplugging
a PCI device (in the bare metal case). The only way this will work
reliably is if you reserve some MMIO space for the device you are about
to attach. You cannot just use space above the 4G boundary because the
PCI device may have 32 bit only BAR's and thus it's MMIO cannot be
placed at addresses above 4G.

The problem you have is that you cannot predict how much MMIO space to
reserve because you don't know in advance how many PCI device's you are
going to hotplug and how much MMIO space is required per device.

As for the CRS regions: These typically describe the BIOS set limits in
hardware configuration for the MMIO hole itself. On single socket
systems anything which isn't RAM or another predefined region decodes to
MMIO. This is probably why Jan's Dell system has a CRS region which
covers the entire address space.

On multi socket systems the CRS is very important because the chipset is
configured to only decode certain regions to the PCI express ports, if
you use an address out side of those regions then accessing that address
will go "nowhere" and the machine will crash.

Typically you will see a separate high MMIO CRS region if 64bit BAR
support is enabled in BIOS.

To do HVM pci hotplug properly we need to reserve MMIO space below 4G
and emulate a PCI hotplug capable PCI-PCI bridge device. The bridge
device will know the maximum size of the MMIO behind it (as allocated at
boot time) and so we can calculate if the device we are hotplugging can
fit. If it doesn't fit then we fail the hotplug otherwise we allow it
and the OS will correct allocate the BAR behind the bridge.

BTW, calculating the required MMIO for multi BAR PCI device's is not
easy because all the BAR's need to be aligned to their size (naturally
aligned).

Malcolm

> Jan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)
  2014-11-12 10:01           ` Malcolm Crossley
@ 2014-11-12 10:11             ` Jan Beulich
  2014-11-12 10:41               ` Malcolm Crossley
  2014-11-12 15:14             ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2014-11-12 10:11 UTC (permalink / raw)
  To: Malcolm Crossley; +Cc: xen-devel

>>> On 12.11.14 at 11:01, <malcolm.crossley@citrix.com> wrote:
> As for the CRS regions: These typically describe the BIOS set limits in
> hardware configuration for the MMIO hole itself. On single socket
> systems anything which isn't RAM or another predefined region decodes to
> MMIO. This is probably why Jan's Dell system has a CRS region which
> covers the entire address space.
> 
> On multi socket systems the CRS is very important because the chipset is
> configured to only decode certain regions to the PCI express ports, if
> you use an address out side of those regions then accessing that address
> will go "nowhere" and the machine will crash.

Don't you mean multi-node instead of multi-socket here? Since what
matters is how the I/O subsystem is organized; the CPU topology is
pretty uninteresting for this.

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)
  2014-11-12 10:11             ` Jan Beulich
@ 2014-11-12 10:41               ` Malcolm Crossley
  0 siblings, 0 replies; 11+ messages in thread
From: Malcolm Crossley @ 2014-11-12 10:41 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 12/11/14 10:11, Jan Beulich wrote:
>>>> On 12.11.14 at 11:01, <malcolm.crossley@citrix.com> wrote:
>> As for the CRS regions: These typically describe the BIOS set limits in
>> hardware configuration for the MMIO hole itself. On single socket
>> systems anything which isn't RAM or another predefined region decodes to
>> MMIO. This is probably why Jan's Dell system has a CRS region which
>> covers the entire address space.
>>
>> On multi socket systems the CRS is very important because the chipset is
>> configured to only decode certain regions to the PCI express ports, if
>> you use an address out side of those regions then accessing that address
>> will go "nowhere" and the machine will crash.
> 
> Don't you mean multi-node instead of multi-socket here? Since what
> matters is how the I/O subsystem is organized; the CPU topology is
> pretty uninteresting for this.
> 

Yes, multi-IO-node would be the correct description (I used socket
because most people would understand that more clearly). I don't like
using the term "node" on it's own because IO and memory node's can be
quite different. The specific hardware dictating the address space
partitioning is the coherency fabric (QPI links/HT links).

Malcolm

> Jan
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)
  2014-11-12 10:01           ` Malcolm Crossley
  2014-11-12 10:11             ` Jan Beulich
@ 2014-11-12 15:14             ` Konrad Rzeszutek Wilk
  2014-11-12 17:24               ` Jan Beulich
  1 sibling, 1 reply; 11+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-11-12 15:14 UTC (permalink / raw)
  To: Malcolm Crossley; +Cc: xen-devel

On Wed, Nov 12, 2014 at 10:01:28AM +0000, Malcolm Crossley wrote:
> On 12/11/14 09:24, Jan Beulich wrote:
> >>>> On 12.11.14 at 02:37, <konrad.wilk@oracle.com> wrote:
> >> When we PCI insert an device, the BARs are not set at all - and hence
> >> the Linux kernel is the one that tries to set the BARs in. The
> >> reason it cannot fit the device in the MMIO region is due to the
> >> _CRS only having certain ranges (even thought the MMIO region can
> >> cover 2GB). See:
> >>
> >> Without any devices (and me doing PCI insertion after that):
> >> # dmesg | grep "bus resource"
> >> [    0.366000] pci_bus 0000:00: root bus resource [bus 00-ff]
> >> [    0.366000] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
> >> [    0.366000] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
> >> [    0.366000] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
> >> [    0.366000] pci_bus 0000:00: root bus resource [mem 0xf0000000-0xfbffffff]
> >>
> >> With the device (my GPU card) inserted so that hvmloader can enumerate it:
> >>  dmesg | grep 'resource'     
> >> [    0.455006] pci_bus 0000:00: root bus resource [bus 00-ff]
> >> [    0.459006] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
> >> [    0.462006] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
> >> [    0.466006] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
> >> [    0.469006] pci_bus 0000:00: root bus resource [mem 0xe0000000-0xfbffffff]
> >>
> >> I chatted with Bjorn and Rafeal on IRC about how PCI insertion works
> >> on baremetal and it sounds like Thunderbolt device insertion is an
> >> interesting case. The SMM sets the BAR regions to fit within the MMIO
> >> (which is advertised by the _CRS) and it then pokes the OS to enumerate
> >> the BARs. The OS is free to use what the firmware has set or renumber
> >> it. The end result is that since the SMM 'fits' the BAR inside the
> >> pre-set _CRS window it all works. We do not do that.
> > 
> > Who does the BAR assignment is pretty much orthogonal to the
> > problem at hand: If the region reserved for MMIO is too small,
> > no-one will be able to fit a device in there. Plus, what is being
> > reported as root bus resource doesn't have to have a
> > connection to the ranges usable for MMIO at all, at least if I
> > assume that the (Dell) system I'm right now looking at isn't
> > completely screwed:
> > 
> > pci_bus 0000:00: root bus resource [bus 00-ff]
> > pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
> > pci_bus 0000:00: root bus resource [mem 0x00000000-0x3fffffffff]
> > 
> > (i.e. it simply reports the full usable 38 bits wide address space)
> > 
> > Looking at another (Intel) one, there is no mention of regions
> > above the 4G boundary at all:
> > 
> > pci_bus 0000:00: root bus resource [bus 00-3d]
> > pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
> > pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
> > pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
> > pci_bus 0000:00: root bus resource [mem 0x000c4000-0x000cbfff]
> > pci_bus 0000:00: root bus resource [mem 0xfed40000-0xfedfffff]
> > pci_bus 0000:00: root bus resource [mem 0xd0000000-0xf7ffffff]
> > 
> > Not sure how the OS would know it is safe to assign BARs above
> > 4Gb here.
> > 
> > In any event, what you need is an equivalent of the frequently
> > seen BIOS option controlling the size of the space to be reserved
> > for MMIO (often allowing it to be 1, 2, or 3 Gb). I.e. an alternative
> > (or extension) to the dynamic lowering of pci_mem_start in
> > hvmloader.
> > 
> 
> I agree with Jan. By using xl pci-attach you are effectively hotplugging
> a PCI device (in the bare metal case). The only way this will work
> reliably is if you reserve some MMIO space for the device you are about
> to attach. You cannot just use space above the 4G boundary because the
> PCI device may have 32 bit only BAR's and thus it's MMIO cannot be
> placed at addresses above 4G.

Is it safe to split the BARs to be in different locations? Say stash
all 64-bit BARs above 4GB and put all 32-bit under 4GB?

Looking at the hvmloader it looks to be doing that if it has exhausted
the mmio_total.

> 
> The problem you have is that you cannot predict how much MMIO space to
> reserve because you don't know in advance how many PCI device's you are
> going to hotplug and how much MMIO space is required per device.

Perhaps following Jan's advice allow "bigger" MMIO ranges to be
predefined: 4GB, 8Gb, 16GB, etc. And the larger ranges would cover
space under 4GB (so say max 3GB) while the rest is spilled past the 4GB
past the 'maxmem' range?

> 
> As for the CRS regions: These typically describe the BIOS set limits in
> hardware configuration for the MMIO hole itself. On single socket
> systems anything which isn't RAM or another predefined region decodes to
> MMIO. This is probably why Jan's Dell system has a CRS region which
> covers the entire address space.
> 
> On multi socket systems the CRS is very important because the chipset is
> configured to only decode certain regions to the PCI express ports, if
> you use an address out side of those regions then accessing that address
> will go "nowhere" and the machine will crash.
> 
> Typically you will see a separate high MMIO CRS region if 64bit BAR
> support is enabled in BIOS.
> 
> 
> To do HVM pci hotplug properly we need to reserve MMIO space below 4G
> and emulate a PCI hotplug capable PCI-PCI bridge device. The bridge
> device will know the maximum size of the MMIO behind it (as allocated at
> boot time) and so we can calculate if the device we are hotplugging can
> fit. If it doesn't fit then we fail the hotplug otherwise we allow it
> and the OS will correct allocate the BAR behind the bridge.

I think that can be done right now for the MMIO and _CRS in hvmloader
and libxc/libxl. I wonder if that can all be done without having an
PCI-PCI bridge device introduced?

> 
> BTW, calculating the required MMIO for multi BAR PCI device's is not
> easy because all the BAR's need to be aligned to their size (naturally
> aligned).

Ouch. So two 512MB and an 1GB can't be next to each but would need:
512GB BAR<-- 512GB space--->| 1GB BAR.

Or just put the 1GB first:

1GB BAR | 512GB 

?

> 
> Malcolm
> 
> 
> > Jan
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> > 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)
  2014-11-12 15:14             ` Konrad Rzeszutek Wilk
@ 2014-11-12 17:24               ` Jan Beulich
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Beulich @ 2014-11-12 17:24 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Malcolm Crossley, xen-devel

>>> On 12.11.14 at 16:14, <konrad.wilk@oracle.com> wrote:
> On Wed, Nov 12, 2014 at 10:01:28AM +0000, Malcolm Crossley wrote:
>> I agree with Jan. By using xl pci-attach you are effectively hotplugging
>> a PCI device (in the bare metal case). The only way this will work
>> reliably is if you reserve some MMIO space for the device you are about
>> to attach. You cannot just use space above the 4G boundary because the
>> PCI device may have 32 bit only BAR's and thus it's MMIO cannot be
>> placed at addresses above 4G.
> 
> Is it safe to split the BARs to be in different locations? Say stash
> all 64-bit BARs above 4GB and put all 32-bit under 4GB?

Sure.

>> The problem you have is that you cannot predict how much MMIO space to
>> reserve because you don't know in advance how many PCI device's you are
>> going to hotplug and how much MMIO space is required per device.
> 
> Perhaps following Jan's advice allow "bigger" MMIO ranges to be
> predefined: 4GB, 8Gb, 16GB, etc. And the larger ranges would cover
> space under 4GB (so say max 3GB) while the rest is spilled past the 4GB
> past the 'maxmem' range?

Not sure how you'd put a 4G or even 8G hole below the 4G
boundary... These BIOS settings only ever relate to space up to
4G (at least as far as I had seen them).

>> To do HVM pci hotplug properly we need to reserve MMIO space below 4G
>> and emulate a PCI hotplug capable PCI-PCI bridge device. The bridge
>> device will know the maximum size of the MMIO behind it (as allocated at
>> boot time) and so we can calculate if the device we are hotplugging can
>> fit. If it doesn't fit then we fail the hotplug otherwise we allow it
>> and the OS will correct allocate the BAR behind the bridge.
> 
> I think that can be done right now for the MMIO and _CRS in hvmloader
> and libxc/libxl. I wonder if that can all be done without having an
> PCI-PCI bridge device introduced?

I think it could, even if that possibly wouldn't be 100% spec conforming.

>> BTW, calculating the required MMIO for multi BAR PCI device's is not
>> easy because all the BAR's need to be aligned to their size (naturally
>> aligned).
> 
> Ouch. So two 512MB and an 1GB can't be next to each but would need:
> 512GB BAR<-- 512GB space--->| 1GB BAR.
> 
> Or just put the 1GB first:
> 
> 1GB BAR | 512GB 
> 
> ?

The latter is what one should prefer.

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-11-12 17:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-10 17:32 PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus) Konrad Rzeszutek Wilk
2014-11-10 17:42 ` David Vrabel
2014-11-10 18:07   ` Konrad Rzeszutek Wilk
2014-11-10 21:32     ` Konrad Rzeszutek Wilk
2014-11-12  1:37       ` Konrad Rzeszutek Wilk
2014-11-12  9:24         ` Jan Beulich
2014-11-12 10:01           ` Malcolm Crossley
2014-11-12 10:11             ` Jan Beulich
2014-11-12 10:41               ` Malcolm Crossley
2014-11-12 15:14             ` Konrad Rzeszutek Wilk
2014-11-12 17:24               ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.