Re: PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Malcolm Crossley <malcolm.crossley@citrix.com>
Cc: xen-devel@lists.xen.org
Subject: Re: PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus)
Date: Wed, 12 Nov 2014 10:14:28 -0500	[thread overview]
Message-ID: <20141112151428.GA6017@laptop.dumpdata.com> (raw)
In-Reply-To: <54632FF8.7020508@citrix.com>

On Wed, Nov 12, 2014 at 10:01:28AM +0000, Malcolm Crossley wrote:
> On 12/11/14 09:24, Jan Beulich wrote:
> >>>> On 12.11.14 at 02:37, <konrad.wilk@oracle.com> wrote:
> >> When we PCI insert an device, the BARs are not set at all - and hence
> >> the Linux kernel is the one that tries to set the BARs in. The
> >> reason it cannot fit the device in the MMIO region is due to the
> >> _CRS only having certain ranges (even thought the MMIO region can
> >> cover 2GB). See:
> >>
> >> Without any devices (and me doing PCI insertion after that):
> >> # dmesg | grep "bus resource"
> >> [    0.366000] pci_bus 0000:00: root bus resource [bus 00-ff]
> >> [    0.366000] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
> >> [    0.366000] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
> >> [    0.366000] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
> >> [    0.366000] pci_bus 0000:00: root bus resource [mem 0xf0000000-0xfbffffff]
> >>
> >> With the device (my GPU card) inserted so that hvmloader can enumerate it:
> >>  dmesg | grep 'resource'     
> >> [    0.455006] pci_bus 0000:00: root bus resource [bus 00-ff]
> >> [    0.459006] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
> >> [    0.462006] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
> >> [    0.466006] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
> >> [    0.469006] pci_bus 0000:00: root bus resource [mem 0xe0000000-0xfbffffff]
> >>
> >> I chatted with Bjorn and Rafeal on IRC about how PCI insertion works
> >> on baremetal and it sounds like Thunderbolt device insertion is an
> >> interesting case. The SMM sets the BAR regions to fit within the MMIO
> >> (which is advertised by the _CRS) and it then pokes the OS to enumerate
> >> the BARs. The OS is free to use what the firmware has set or renumber
> >> it. The end result is that since the SMM 'fits' the BAR inside the
> >> pre-set _CRS window it all works. We do not do that.
> > 
> > Who does the BAR assignment is pretty much orthogonal to the
> > problem at hand: If the region reserved for MMIO is too small,
> > no-one will be able to fit a device in there. Plus, what is being
> > reported as root bus resource doesn't have to have a
> > connection to the ranges usable for MMIO at all, at least if I
> > assume that the (Dell) system I'm right now looking at isn't
> > completely screwed:
> > 
> > pci_bus 0000:00: root bus resource [bus 00-ff]
> > pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
> > pci_bus 0000:00: root bus resource [mem 0x00000000-0x3fffffffff]
> > 
> > (i.e. it simply reports the full usable 38 bits wide address space)
> > 
> > Looking at another (Intel) one, there is no mention of regions
> > above the 4G boundary at all:
> > 
> > pci_bus 0000:00: root bus resource [bus 00-3d]
> > pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
> > pci_bus 0000:00: root bus resource [io  0x0d00-0xffff]
> > pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
> > pci_bus 0000:00: root bus resource [mem 0x000c4000-0x000cbfff]
> > pci_bus 0000:00: root bus resource [mem 0xfed40000-0xfedfffff]
> > pci_bus 0000:00: root bus resource [mem 0xd0000000-0xf7ffffff]
> > 
> > Not sure how the OS would know it is safe to assign BARs above
> > 4Gb here.
> > 
> > In any event, what you need is an equivalent of the frequently
> > seen BIOS option controlling the size of the space to be reserved
> > for MMIO (often allowing it to be 1, 2, or 3 Gb). I.e. an alternative
> > (or extension) to the dynamic lowering of pci_mem_start in
> > hvmloader.
> > 
> 
> I agree with Jan. By using xl pci-attach you are effectively hotplugging
> a PCI device (in the bare metal case). The only way this will work
> reliably is if you reserve some MMIO space for the device you are about
> to attach. You cannot just use space above the 4G boundary because the
> PCI device may have 32 bit only BAR's and thus it's MMIO cannot be
> placed at addresses above 4G.

Is it safe to split the BARs to be in different locations? Say stash
all 64-bit BARs above 4GB and put all 32-bit under 4GB?

Looking at the hvmloader it looks to be doing that if it has exhausted
the mmio_total.

> 
> The problem you have is that you cannot predict how much MMIO space to
> reserve because you don't know in advance how many PCI device's you are
> going to hotplug and how much MMIO space is required per device.

Perhaps following Jan's advice allow "bigger" MMIO ranges to be
predefined: 4GB, 8Gb, 16GB, etc. And the larger ranges would cover
space under 4GB (so say max 3GB) while the rest is spilled past the 4GB
past the 'maxmem' range?

> 
> As for the CRS regions: These typically describe the BIOS set limits in
> hardware configuration for the MMIO hole itself. On single socket
> systems anything which isn't RAM or another predefined region decodes to
> MMIO. This is probably why Jan's Dell system has a CRS region which
> covers the entire address space.
> 
> On multi socket systems the CRS is very important because the chipset is
> configured to only decode certain regions to the PCI express ports, if
> you use an address out side of those regions then accessing that address
> will go "nowhere" and the machine will crash.
> 
> Typically you will see a separate high MMIO CRS region if 64bit BAR
> support is enabled in BIOS.
> 
> 
> To do HVM pci hotplug properly we need to reserve MMIO space below 4G
> and emulate a PCI hotplug capable PCI-PCI bridge device. The bridge
> device will know the maximum size of the MMIO behind it (as allocated at
> boot time) and so we can calculate if the device we are hotplugging can
> fit. If it doesn't fit then we fail the hotplug otherwise we allow it
> and the OS will correct allocate the BAR behind the bridge.

I think that can be done right now for the MMIO and _CRS in hvmloader
and libxc/libxl. I wonder if that can all be done without having an
PCI-PCI bridge device introduced?

> 
> BTW, calculating the required MMIO for multi BAR PCI device's is not
> easy because all the BAR's need to be aligned to their size (naturally
> aligned).

Ouch. So two 512MB and an 1GB can't be next to each but would need:
512GB BAR<-- 512GB space--->| 1GB BAR.

Or just put the 1GB first:

1GB BAR | 512GB 

?

> 
> Malcolm
> 
> 
> > Jan
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> > 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2014-11-12 15:14 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-10 17:32 PCI passthrough (pci-attach) to HVM guests bug (BAR64 addresses are bogus) Konrad Rzeszutek Wilk
2014-11-10 17:42 ` David Vrabel
2014-11-10 18:07   ` Konrad Rzeszutek Wilk
2014-11-10 21:32     ` Konrad Rzeszutek Wilk
2014-11-12  1:37       ` Konrad Rzeszutek Wilk
2014-11-12  9:24         ` Jan Beulich
2014-11-12 10:01           ` Malcolm Crossley
2014-11-12 10:11             ` Jan Beulich
2014-11-12 10:41               ` Malcolm Crossley
2014-11-12 15:14             ` Konrad Rzeszutek Wilk [this message]
2014-11-12 17:24               ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141112151428.GA6017@laptop.dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=malcolm.crossley@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.