From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from gate.crashing.org ([63.228.1.57]:32779 "EHLO gate.crashing.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753644AbaDPGaz (ORCPT ); Wed, 16 Apr 2014 02:30:55 -0400 Message-ID: <1397629816.32730.8.camel@pasglop> Subject: Re: [PATCH v7] PCI: Try best to allocate pref mmio 64bit above 4g From: Benjamin Herrenschmidt To: Bjorn Helgaas Cc: Guo Chao , Yinghai Lu , Wei Yang , Gavin Shan , Jack Morgenstein , Amir Vadai , Or Gerlitz , Eugenia Emantayev , talal@mellanox.com, "linux-pci@vger.kernel.org" Date: Wed, 16 Apr 2014 16:30:16 +1000 In-Reply-To: References: <1394222924-28886-1-git-send-email-yinghai@kernel.org> <20140408025738.GA3198@yanx> <20140409075215.GA3173@yanx> <20140415115450.GA7792@yanx> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Sender: linux-pci-owner@vger.kernel.org List-ID: On Tue, 2014-04-15 at 18:09 -0600, Bjorn Helgaas wrote: > > Thanks for the example. Please open a bug report at > http://bugzilla.kernel.org and attach the complete dmesg logs before > and after Yinghai's patch. > > Having the complete logs helps me answer questions myself without > having to bother you, and it also helps me figure out whether we can > improve our logging to make it easier to diagnose problems like this. Unfortunately, for a *little while* longer (hint !) we can't publish a complete log from a Power8 machine, but we should be able to include everything remotely related to PCI. > > | pci 0003:05:00.0: reg 0x10: [mem 0x3d05801000000-0x3d058010fffff 64bit] > > | pci 0003:05:00.0: reg 0x18: [mem 0x3d05010000000-0x3d05017ffffff 64bit pref] > > | pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref] > > | pci 0003:05:00.0: reg 0x134: [mem 0x3d05018000000-0x3d0501fffffff 64bit pref] > > > > This is printed at enumeration phase. This device has a SRIOV BAR with > > size of 0x7ffffff (128M). That's the size of a signle VF BAR. The device > > supports 63 VFs so we need near 8G space in total. Apparanlty we need > > exploit 64-bit space. > > Yes. Do we print a hint anywhere about how many VFs there are? In > other words, can you deduce the number "63" from the dmesg, or do you > have to figure that out some other way? It'd be nice if that > information were somewhere in dmesg. > > > | PCI host bridge to bus 0003:00 > > | pci_bus 0003:00: root bus resource [mem 0x3d05800000000-0x3d0587ffeffff] (bus address [0x80000000-0xfffeffff]) > > | pci_bus 0003:00: root bus resource [mem 0x3d05008000000-0x3d057ffffffff 64bit pref] > > > > And we do have a huge (32G) 64-bit prefetchable window supply. We expect > > everything to work fine, but: > > > > | pci 0003:00:00.0: BAR 15: can't assign mem pref (size 0x206000000) > > | pci 0003:00:00.0: BAR 14: assigned [mem 0x3d05800000000-0x3d05802ffffff] > > | pci 0003:00:00.0: BAR 13: can't assign io (size 0x4000) > > > > It went wrong at the beginning. Note the error message never considers > > 64-bit or not, but BAR 15 here has it MEM_64 flag cleared. > > BAR 15 is a bridge window. I think its resource flags should reflect > the capability of the *window*, even if we disable the window or we > happen to assign addresses that are under 4GB. So I think it's wrong > that we clear the MEM_64 flag in pbus_size_mem() and the IO flag in > pbus_size_io(). > > > It first > > tried to find a 32-bit prefetchable window, but we only supply a 64-bit one. > > So it fall back to (32-bit) non-prefetchable window, but there is no enough > > room there. At last it went into complicated steps (not show here) of > > allocating requested resource first, then try best for the optional ones, etc.. > > > > Why is BAR 15 (prefetchable) 32 bit instead of 64? Because PCI core favours > > 32-bit prefetchable BARs and we have some. This is one of them: > > > > | pci 0003:05:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref] > > > > PCI core decides to let them enjoy the benefition of prefetch. They can't > > bear the risk of getting 4G-above address, so its parent, its parent's parent, > > its parent's parent's parent, finally the root bridge (00:00.0) must have their > > MEM_64 flag of prefetchable resource (BAR 15) clear. > > It sounds like we're tracking the resource requirements > (prefetchability and BAR width) by using the flags on bridge windows. > If that's the case, I think it's wrong. We should preserve the bridge > window flags, because those express the bridge hardware capabilities, > and we should explicitly keep track of what's required by devices > below the bridge in some other way. > > > In the end nobody > > is eligible to use the 64-bit (prefetchable) space even we have huge > > supply ! > > > > Note even the resource is small and successfully fall back into 32-bit > > non-prefetchable window, that's still not OK for us because we need > > SRIOV resource be at 64-bit prefetchable space to do platform > > configuration. > > > > With Yinghai's patch, when 64-bit prefetchable BARs found, they're more > > favoured than the 32-bit prefetchable ones (if any), so all upstream bridges' > > prefetchable windows have their MEM_64 flag reserved and the huge 64-bit > > prefetchable space will be exploited: > > > > | pci 0003:00:00.0: BAR 15: assigned [mem 0x3d05008000000-0x3d0521fffffff 64bit pref] > > | pci 0003:00:00.0: BAR 14: assigned [mem 0x3d05800000000-0x3d05802ffffff] > > | pci 0003:00:00.0: BAR 13: can't assign io (size 0x4000) > > > > (The IO resource error here is due to we do not provide IO window) > > Yes. The lack of I/O space is just a constraint of the platform. > It'd be nice if we printed a more meaningful error message in this > case. One really has to be a PCI expert to distinguish this from a > real problem that we need to fix. > > Bjorn