From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:45267) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1e59NF-0006S5-Uu for qemu-devel@nongnu.org; Thu, 19 Oct 2017 07:52:31 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1e59NC-0001ec-Ox for qemu-devel@nongnu.org; Thu, 19 Oct 2017 07:52:30 -0400 Received: from mx1.redhat.com ([209.132.183.28]:37420) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1e59NC-0001da-FD for qemu-devel@nongnu.org; Thu, 19 Oct 2017 07:52:26 -0400 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 914581B5E48 for ; Thu, 19 Oct 2017 11:52:24 +0000 (UTC) References: <20171018095807.101406-1-marcel@redhat.com> <69cefa98-5681-5b12-719c-e13fcba969c4@redhat.com> <9d4b1fa8-e670-556b-278d-4993ad41b512@redhat.com> <36a5b8e7-84d1-5ffd-d85b-7f0378801f3c@redhat.com> <7b08ea3e-d380-ac0c-a716-d0ac703feaea@redhat.com> From: Laszlo Ersek Message-ID: <6e21e32b-ce06-a6ff-8603-d864af129d7a@redhat.com> Date: Thu, 19 Oct 2017 13:52:14 +0200 MIME-Version: 1.0 In-Reply-To: <7b08ea3e-d380-ac0c-a716-d0ac703feaea@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH] hw/pci-host: Fix x86 Host Bridges 64bit PCI hole List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Marcel Apfelbaum , qemu-devel@nongnu.org Cc: mst@redhat.com, pbonzini@redhat.com, ehabkost@redhat.com, Igor Mammedov , "Dr. David Alan Gilbert" , Gerd Hoffmann On 10/19/17 13:29, Marcel Apfelbaum wrote: > Hi Laszlo, >=20 > On 19/10/2017 13:41, Laszlo Ersek wrote: >> On 10/19/17 11:30, Marcel Apfelbaum wrote: >>> >>> Hi Laszlo, >>> >>> On 18/10/2017 14:40, Laszlo Ersek wrote: >>>> Hi Marcel, >>>> >>>> On 10/18/17 11:58, Marcel Apfelbaum wrote: >>>>> Currently there is no MMIO range over 4G >>>>> reserved for PCI hotplug. Since the 32bit PCI hole >>>>> depends on the number of cold-plugged PCI devices >>>>> and other factors, it is very possible is too small >>>>> to hotplug PCI devices with large BARs. >>>>> >>>>> Fix it by reserving all the address space after >>>>> RAM (and the reserved space for RAM hotplug), >>>>> but no more than 40bits. The later seems to be >>>>> QEMU's magic number for the Guest physical CPU addressable >>>>> bits and is safe with respect to migration. >>>>> >>>>> Note this is a regression since prev QEMU versions had >>>>> some range reserved for 64bit PCI hotplug. >>>>> >>>>> Signed-off-by: Marcel Apfelbaum >>>>> --- >>>>> =C2=A0=C2=A0 hw/i386/pc.c=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 | 22 ++++++++++++++++++++++ >>>>> =C2=A0=C2=A0 hw/pci-host/piix.c=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 | 10 ++++++++++ >>>>> =C2=A0=C2=A0 hw/pci-host/q35.c=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 |=C2=A0 9 +++++++++ >>>>> =C2=A0=C2=A0 include/hw/i386/pc.h=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 | 1= 0 ++++++++++ >>>>> =C2=A0=C2=A0 include/hw/pci-host/q35.h |=C2=A0 1 + >>>>> =C2=A0=C2=A0 5 files changed, 52 insertions(+) >>>> >>>> - What are the symptoms of the problem? >>>> >>> >>> PCI devices with *relatively* large BARs can't be hot-plugged. >>> The reason is QEMU does not reserve MMIO range over 4G and >>> uses whatever remains on the 32bit PCI window. >>> >>> If you coldplug enough PCI devices you will end up with >>> with a 64bit PCI window that covers only the cold-plugged PCI >>> devices, still no room for hotplug. >>> >>>> - What QEMU commit regressed the previous functionality? >>>> >>> >>> =C2=A0=C2=A0- commit 39848901818 pc: limit 64 bit hole to 2G by defau= lt >>> =C2=A0=C2=A0shows us QEMU had the 64bit PCI hole, so it is a regressi= on. >> >> (Side remark: this commit was part of release v1.6.0, and at that time >> we also had the "etc/pci-info" fw_cfg file.) >> >>> >>> =C2=A0=C2=A0- commit 2028fdf3791 piix: use 64 bit window programmed b= y guest >>> =C2=A0=C2=A0leaves no room for PCI hotplug >> >> (Side remark: part of v1.7.0, from 2013. So, we can call this a >> regression, but then it's a very old one. >> >> Of course this is *not* a counter-argument to your patch, or commit >> message wording; it's just that I'm surprised that the issue could >> remain dormant this long -- usually users report regressions very >> quickly.) >> >=20 > I also thought I am proposing a new feature and I was pretty amazed > I actually fix something that worked before. >=20 >>> >>>> - How exactly is the regression (and this fix) exposed to the guest? >>>> >>> >>> The nuance is both above commits computed the actual MMIO >>> range used by cold-plugged devices, but as side effect >>> they influenced the remaining PCI window for hot-plugged devices. >>> >>> What the guest sees is the CRS MMIO range over 4G. >> >> OK, understood. >> >>> >>>> I'm confused because semi-recently I've done multiple successful >>>> hotplugs, with OVMF, Q35, PCIe root ports, and PCI Express devices >>>> having large (64-bit) MMIO BARs. The examples include ivshmem / Linu= x >>>> guest (based on your hints) and vfio-pci (XL710 NIC) / Windows Serve= r >>>> 2016 guest. >>> >>> Cool! These are actually great news! >>> Back to the issue in hand, a few things: >>> 1. It works with Q35 only, but not with the I440FX machines. >> >> OK. >> >>> 2. The hints are used for the PCIe root ports, meaning for PCI bridge= s >>> =C2=A0=C2=A0=C2=A0 and addresses a slightly different issues: leave h= otplug space >>> =C2=A0=C2=A0=C2=A0 for secondary buses, while this patch tackles hotp= lug >>> =C2=A0=C2=A0=C2=A0 space for root buses. >> >> OK. >> >>> >>>> >>>> By default, OVMF marks a 32GB area as 64-bit MMIO aperture, for BAR >>>> allocation. This area is above the normal memory (plus reservation f= or >>>> hotplug memory, if any). It is also GB-aligned. The aperture size >>>> can be >>>> changed (currently) with an experimental -fw_cfg switch (the fw_cfg >>>> file >>>> is called "opt/ovmf/X-PciMmio64Mb", it states the size of the 64-bit >>>> MMIO aperture as a decimal number of megabytes). >>> >>> I'll try to explain how I see the "picture": >>> CRS is created by QEMU and computed *after* the firmware finishes >>> the PCI BARs allocations. >> >> Yes. >> >>> Then *QEMU* decides how big it will be the >>> 64bit PCI hole, not the firmware - see the past commits. >> >> Yes, first the guest-side programming happens, then QEMU (re-)calculat= es >> the ACPI payload once the right ACPI fw_cfg file is selected / read. >> This re-calculation (including the CRS) observes the previously >> programmed values. >> >>> >>> So IMHO you don't need the opt/ovmf/X-PciMmio64Mb since the OVMF >>> is not deciding this value. It actually should not care at all. >> >> OVMF cannot *not* care. It must internally provide (to other, >> platform-independent components in edk2) the 64-bit MMIO aperture that >> the BARs can be allocated from. >> >> >> But maybe I'm totally misunderstanding this patch! Are you saying that >> the patch makes the following difference only: when the CRS is >> recalculated (i.e., after the PCI enumeration has completed in the >> guest, and the ACPI linker/loader fw_cfg files are opened), then you >> only grow / round up the 64-bit range? >> >=20 > Exactly. >=20 >> That would make sense, because a *superset* aperture exposed to the OS >> via the _CRS does not invalidate any allocations done by the firmware >> earlier. Furthermore, the firmware itself need not learn about the >> details of said superset. In this sense I agree that OVMF should not >> care at all. >> >=20 > Exactly, the range is a superset. >=20 >> [snip] >> >>>> Also, how do you plan to integrate this feature with SeaBIOS? >> >> (Ever since I sent my previous response, I've been glad that I finishe= d >> my email with this question -- I really think OVMF and SeaBIOS should >> behave uniformly in this regard. So I'm happy about your answer:) >> >>> Simple, no need :) . It just works as it should. >>> SeaBIOS computes everything, *then* QEMU creates the CRs ranges >>> based on firmware input and other properties. >>> I think it works also with OVMF out of the box. >> >> OK, so this explains it all, thank you very much -- my question is if >> you could spell out those "other properties" in a bit more detail (I >> apologize if these should be obvious to me already, from the commit >> message and the patch): >> >=20 > "Other properties" was a vague link to max RAM, and things like that. > Nothing specific. >=20 >> (1) How exactly does the enlarged 64-bit hole remain a *superset* of t= he >> firmware-programmed BARs? Can you point me to a location in the code? >> >=20 > Sure in the patch itself: > =C2=A0We first take the firmware computed ranges: > =C2=A0=C2=A0=C2=A0=C2=A0 pci_bus_get_w64_range(h->bus, &w64); > =C2=A0Look at the *get_pci_hole64_start -> If the firmware set it we do= n't > touch. > =C2=A0And to the *get_pci_hole64_end -> We only update the value if is = smaller > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 than what= firmware set. >=20 >=20 >=20 >> (2) What happens if we do have more than 1TB RAM? >> >=20 > You get no 64bit "hot-pluggable" reserved range. > The code dealing with that is: >=20 > +=C2=A0=C2=A0=C2=A0 if (hole64_start > PCI_HOST_PCI_HOLE64_END) { > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 hole64_start =3D PCI_HOST_P= CI_HOLE64_END; > +=C2=A0=C2=A0=C2=A0 } >=20 > And since the end is hard-coded to PCI_HOST_PCI_HOLE64_END > we have a 0 size region which gets discarded. >=20 > An interesting case is if we have more than 1T RAM and the > firmware assigns BARs over that, we are increasing the > 64bit window to include some of the RAM... >=20 > A better approach is to check against the actual hole64 > end and not the magic value. > I'll send a v3, thanks! Ah, right; in v1 -- and possibly v2, I haven't looked yet -- hole64_start will be set to PCI_HOST_PCI_HOLE64_END, in preparation for the end to be set the exact same way (resulting in an empty range) -- however, the end won't be modified at all if it's already above PCI_HOST_PCI_HOLE64_END, so we'll end up with a non-empty range that intersects with RAM. Please CC me on v3 too; I feel that I'm now better equipped to comment :) Thanks! Laszlo >=20 >>> A last note, when pxb/pxb-pcies will became hot-pluggable >>> we need to leave some bits for them too. I wouldn't >>> really want to add fw_config files for them too, or >>> use a complicate fw_config, or rewrite >>> the logic for all firmware if we don't have to. >> >> OK. >> >>> >>> 64bit PCI window starts from memory(+reserved) end >>> and finishes at 40bit. >> >> OK, this seems to (partly) answer my question (1) -- can you please >> confirm that the >> >> =C2=A0=C2=A0 finishes at 40bit >> >> part does not imply forcefully truncating the CRS at 1TB, in case the >> firmware already allocated BARs higher than that? >> >=20 > I confirm: >=20 > +=C2=A0=C2=A0=C2=A0 if (s->pci_hole64_fix && value < PCI_HOST_PCI_HOLE6= 4_END) { > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 value =3D PCI_HOST_PCI_HOLE= 64_END; > +=C2=A0=C2=A0=C2=A0 } >=20 > I touch the value only if is smaller than what the firmware set. >=20 >>> If you have 1T of RAM you >>> get nothing for PCI hotplug, seems fair. >> >> This is fine (as the answer to my question (2)) as long as you mean it >> in the following sense: "QEMU will simply *not grow* the MMIO in the >> _CRS past 1TB". >> >>> When guests with more then 1T of RAM will become >>> frequent, I suppose the QEMU's default CPU addressable >>> bits will change and we will change it too. >> >> Thanks Marcel, I feel a lot safer now. >> >> Please confirm that the changes can only grow the area exposed in the >> _CRS -- i.e., only lower or preserve the base, and only raise or >> preserve the absolute limit. I'll shut up then :) >> >=20 > Confirmed and thanks for the review! >=20 > Thanks, > Marcel >=20 >> Thanks! >> Laszlo >> >=20