From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:37588) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aJjc9-0005mo-Bz for qemu-devel@nongnu.org; Thu, 14 Jan 2016 10:15:08 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aJjc4-00059v-9B for qemu-devel@nongnu.org; Thu, 14 Jan 2016 10:15:05 -0500 Received: from mx1.redhat.com ([209.132.183.28]:48829) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aJjc3-00059l-Ue for qemu-devel@nongnu.org; Thu, 14 Jan 2016 10:15:00 -0500 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) by mx1.redhat.com (Postfix) with ESMTPS id 8E6707AE88 for ; Thu, 14 Jan 2016 15:14:59 +0000 (UTC) References: <1452515063-5615-1-git-send-email-marcel@redhat.com> <5693D447.8070000@redhat.com> <5693D999.2030504@gmail.com> <5693E349.4090201@redhat.com> <5693EE0B.7030002@redhat.com> <5693F802.2010304@redhat.com> <5693FB2F.2080403@redhat.com> <56979399.20902@redhat.com> <5697B111.6070203@redhat.com> From: Marcel Apfelbaum Message-ID: <5697BB6D.3020406@redhat.com> Date: Thu, 14 Jan 2016 17:14:53 +0200 MIME-Version: 1.0 In-Reply-To: <5697B111.6070203@redhat.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH] hw/pci: do not update the PCI mappings while Decode (I/O or memory) bit is not set in the Command register List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Laszlo Ersek , qemu-devel@nongnu.org Cc: mst@redhat.com On 01/14/2016 04:30 PM, Laszlo Ersek wrote: > On 01/14/16 13:24, Marcel Apfelbaum wrote: >> On 01/11/2016 08:57 PM, Marcel Apfelbaum wrote: >>> On 01/11/2016 08:44 PM, Laszlo Ersek wrote: >>>> On 01/11/16 19:01, Marcel Apfelbaum wrote: >>>>> On 01/11/2016 07:15 PM, Laszlo Ersek wrote: >>>>>> On 01/11/16 17:34, Marcel Apfelbaum wrote: >>>>>>> On 01/11/2016 06:11 PM, Laszlo Ersek wrote: >>>>>>>> On 01/11/16 13:24, Marcel Apfelbaum wrote: >>>>>>>>> Two reasons: >>>>>>>>> - PCI Spec indicates that while the bit is not set >>>>>>>>> the memory sizing is not finished. >>>>>>>>> - pci_bar_address will return PCI_BAR_UNMAPPED >>>>>>>>> and a previous value can be accidentally overridden >>>>>>>>> if the command register is modified (and not the BAR). >>>>>>>>> >>>>>>>>> Signed-off-by: Marcel Apfelbaum >>>>>>>>> --- >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I found this when trying to use multiple root complexes with OVMF. >>>>>>>>> >>>>>>>>> When trying to attach a device to the pxb-pcie device as Integrated >>>>>>>>> Device it did not receive the IO/MEM resources. >>>>>>>>> >>>>>>>>> The reason is that OVMF is working like that: >>>>>>>>> 1. It disables the Decode (I/O or memory) bit in the Command >>>>>>>>> register >>>>>>>>> 2. It configures the device BARS >>>>>>>>> 3. Makes some tests on the Command register >>>>>>>>> 4. ... >>>>>>>>> 5. Enables the Decode (I/O or memory) at some point. >>>>>>>>> >>>>>>>>> On step 3 all the BARS are overridden to 0xffffffff by QEMU. >>>>>>>>> >>>>>>>>> Since QEMU uses the device BARs to compute the new host bridge >>>>>>>>> resources >>>>>>>>> it now gets garbage. >>>>>>>>> >>>>>>>>> Laszlo, this also solves the SHPC problem for the pci-2-pci bridge >>>>>>>>> inside the pxb. >>>>>>>>> Now we can enable the SHPC for it too. >>>>>>>> >>>>>>>> I encountered the exact same problem months ago. I posted patches >>>>>>>> for >>>>>>>> it; you were CC'd. :) >>>>>>>> >>>>>>>> http://thread.gmane.org/gmane.comp.emulators.qemu/342206/focus=342209 >>>>>>>> >>>>>>>> http://thread.gmane.org/gmane.comp.emulators.qemu/342206/focus=342210 >>>>>>>> >>>>>>>> >>>>>>>> As you can see under the second link above, I made the same >>>>>>>> analysis & >>>>>>>> observations as you do now. (It took me quite long to track down the >>>>>>>> "inexplicable" behavior of edk2's generic PCI bus driver / >>>>>>>> enumerator >>>>>>>> that is built into OVMF.) >>>>>>> >>>>>>> Wow, I just re-worked this issue again from 0! I wish I have >>>>>>> remembered >>>>>>> those threads :( >>>>>>> This was another symptom of the exact problem! And I remembered >>>>>>> something about >>>>>>> SHPC, I should have looked at those mail threads again... >>>>>>> >>>>>>>> >>>>>>>> I proposed to change pci_bar_address() so that it could return, to >>>>>>>> distinguished callers, the BAR values "under programming", even >>>>>>>> if the >>>>>>>> command bits were clear. Then the ACPI generator would utilize this >>>>>>>> special exception. >>>>>>>> >>>>>>>> Michael disagreed; in >>>>>>>> >>>>>>>> http://thread.gmane.org/gmane.comp.emulators.qemu/342206/focus=342242 >>>>>>>> >>>>>>>> >>>>>>>> he wrote "[t]his is problematic - disabled BAR values have no >>>>>>>> meaning >>>>>>>> according to the PCI spec". >>>>>>>> >>>>>>> >>>>>>> Yes... because it looked like a hook for our case only, >>>>>>> the good news is that this patch is based exactly on the fact that >>>>>>> the BARs have no meaning if the bit is not set. >>>>>>> >>>>>>>> The current solution to the problem (= we disable the SHPC) was >>>>>>>> recommended by Michael in that message: "It might be best to add a >>>>>>>> property to just disable shpc in the bridge so no devices reside >>>>>>>> directly behind the pxb?" >>>>>>>> >>>>>>> >>>>>>> I confess I don't exactly understand what the SHPC of the pci-2-pci >>>>>>> bridge >>>>>>> has to do with sibling devices on the pxb's root bus (SHPC is the >>>>>>> hot-plug controller >>>>>>> for the devices behind the pci-2-pci bridge). >>>>>>> >>>>>>> The second part I do understand, the pxb design was to not have >>>>>>> devices >>>>>>> directly behind >>>>>>> the pxb, so maybe he meant that SHPC is the part of the pci-bridge >>>>>>> that >>>>>>> behaves like >>>>>>> a device in the sense it requires IO/MEM resources. >>>>>>> >>>>>>> Bottom line, your solution for the PXB was just fine :) >>>>>>> >>>>>>> >>>>>>>> In comparison, your patch doesn't change pci_bar_address(). >>>>>>>> Instead, it >>>>>>>> modifies pci_update_mappings() *not to call* pci_bar_address(), >>>>>>>> if the >>>>>>>> respective command bits are clear. >>>>>>>> >>>>>>>> I guess that could have about the same effect. >>>>>>>> >>>>>>>> If, unlike my patch, yours actually improves QEMU's compliance >>>>>>>> with the >>>>>>>> PCI specs, then it's likely a good patch. (And apparently more >>>>>>>> general >>>>>>>> than the SHPC-specific solution we have now.) >>>>>>> >>>>>>> >>>>>>> Exactly! Why should a pci write to the command register *delete* >>>>>>> previously set resources? I am looking at it as a bug. >>>>>>> >>>>>>> And also updating the mappings while the Decoding bit is not enables >>>>>>> is at least not necessary. >>>>>>> >>>>>>>> >>>>>>>> I just don't know if it's a good idea to leave any old mappings >>>>>>>> active >>>>>>>> while the BARs are being reprogrammed (with the command bits clear). >>>>>>>> >>>>>>> >>>>>>> First, because the OS can't use the IO/MEM anyway, secondly the guest >>>>>>> OS/firmware >>>>>>> is the one that disabled the bit... (in order to program resources) >>>>>> >>>>>> I have something like the following in mind. Do you think it is a >>>>>> valid >>>>>> (although contrived) use case? >>>>>> >>>>>> - guest programs some BAR and uses it (as designed / intended) >>>>>> - guest disables command bit, modifies BAR location >>>>>> - guest accesses *old* BAR location >>>>>> >>>>>> What should a guest *expect* in such a case? Is this invalid guest >>>>>> behavior? >>>>> >>>>> Yes, this is indeed invalid behavior, from the device point of view >>>>> it is disabled. Best case scenario - the guest will see 0xffffffff, >>>>> worst case - garbage. >>>>> >>>>>> >>>>>> If it is not invalid, then will QEMU comply with the guest's >>>>>> expectations if your patch is applied? Pre-patch, the guest would >>>>>> likely >>>>>> access a "hole" in the host bridge MMIO aperture, whereas with your >>>>>> patch (I guess?) it still might access the device through the old >>>>>> (still >>>>>> active) BAR? >>>>>> >>>>> >>>>> Since the IO is disabled, pci_bar_address will return PCI_BAR_UNMAPPED >>>>> and *no updates will be made* pre or post this patch. It will behave >>>>> the >>>>> same >>>>> from the guest point of view. It will still access the memory region >>>>> of the device. >>>>> >>>>> >>>>>> Or would QEMU prevent that just by virtue of the command bit being >>>>>> clear? >>>>> >>>>> Again, this patch only changes the behavior in a specific case: >>>>> when the device is disabled and the guest writes to the command >>>>> register >>>>> without >>>>> enabling IO/MEM. >>>> >>>> Ah, right! That's exactly what the edk2 PCI bus driver / enumerator >>>> does. Massages the command register without touching those two bits, >>>> and... >>>> >>>>> >>>>> Pre-patch -> the old BAR addresses are overridden with 0xffffffff >>>>> (I really think this is a bug, nobody asked for this!!) >>>> >>>> the previously programmed BAR address gets lost. It's great that you >>>> have the PCI knowledge to state that this is actually a bug! It had >>>> looked fishy to me as well, but I couldn't make the same argument. >>>> >>>>> Post-Patch -> the old BAR values are returned to the prev state -> >>>>> correct behavior IMHO. >>>> >> >> Hi Laszlo, >> >> After an off-line talk with Michael it seems that this solution is also >> not valid. >> >> When we build the ACPI table we don't take the IO/mem ranges directly >> from the Configuration Space, >> but using the proper wrappers to ensure the values are valid. >> >> When the IO decoding is disabled the wrappers do no modify the actual >> Configuration Space, >> but only the mappings to show that the values are not valid. >> >> It doesn't matter what are we going to do here, the real problem lies in >> OVMF not *enabling* >> IO/Mem decoding before requesting the ACPI tables. (If the OVMF would >> not do the "massage" we will >> get the right values be chance, we should not look to the values when IO >> is disabled) > > When discussing this, please address the code that is specific to OVMF > *separately* from code that is generic in edk2. OVMF (the binary) comes > together from a large number of modules. The code that is under my > co-maintainership, and is specific to QEMU, lives under OvmfPkg/ in the > edk2 tree. Most other modules in the tree, especially under > MdeModulePkg/, are generic, and run unmodified on physical hardware. > > Why is this important? Because it is easy (or easier, anyway) to adapt > things in OvmfPkg/ to QEMU. Modifying code in MdeModulePkg/ so that it > works better with QEMU is *very* hard, especially if the changes are > intrusive. The PCI bus driver (that is responsible for the enumeration > etc) is generic, and lives in "MdeModulePkg/Bus/Pci/PciBusDxe/". If > you'd like to change that, that will require extremely strong arguments > for the maintainer to accept. > > MdeModulePkg/Bus/Pci/PciBusDxe communicates with the platform-specific > PCI host bridge / root bridge driver. OVMF does have a specific version > of that driver, in "OvmfPkg/PciHostBridgeDxe". > > Dependent on whether you find an issue in the (higher level, generic) > "MdeModulePkg/Bus/Pci/PciBusDxe", versus the lower level, specific > "OvmfPkg/PciHostBridgeDxe", the effort required to fix the problem may > differ hugely. > > The massaging of the command register occurs in > "MdeModulePkg/Bus/Pci/PciBusDxe". If we say that the command register > massaging is invalid, then the first question that the maintainer will > likely ask is, "why does it work on physical hardware then". > > Unless we can argue that the massaging works on physical hardware *in > spite of* the PCI specification, just as a lucky coincidence (that is, > we can prove that the PciBusDxe code violates the PCI spec), our claim > might well turn out a dead end. > > (Unless the change needed is really non-intrusive, or can be controlled > with build-time flags.) > >> >> Bottom line, we have 2 options: >> 1. Modify the OVMF to enable IO Decoding *before* loading the ACPI tables. > > OVMF downloads the ACPI tables over fw_cfg strictly after PCI > enumeration and resource allocation complete. This is the timeline: > > (1) The "OvmfPkg/PciHostBridgeDxe" driver is dispatched by the DXE core > at some point. This is a DXE_DRIVER, which means that it produces it > protocols (i.e., it makes structures with function pointers in them > available to the rest of the firmware) right in its entry point > function. You can see the GUIDs of these protocols listed in > "PciHostBridgeDxe.inf", under [Protocols], with the comment ## PRODUCES. > Two protocols are relevant here: > > - gEfiPciHostBridgeResourceAllocationProtocolGuid > - gEfiPciRootBridgeIoProtocolGuid > > This driver exposes the root bridges (root buses) that are available in > the system -- one EFI_PCI_ROOT_BRIDGE_IO_PROTOCOL structure for each > root bus --, and a "singleton" > EFI_PCI_HOST_BRIDGE_RESOURCE_ALLOCATION_PROTOCOL structure that allows > the higher level PCI bus driver to allocate resources. > > (2) The "MdeModulePkg/Bus/Pci/PciBusDxe" driver is launched by the DXE > core at some point. If you look at "PciBusDxe.inf", you can see in the > [Protocols] section that the driver consumes the following protocols > (among other things): > > - gEfiPciHostBridgeResourceAllocationProtocolGuid > - gEfiPciRootBridgeIoProtocolGuid > > That is, the two produced in (1). > > Now, *unlike* the driver in (1), this is not a DXE_DRIVER module, but a > UEFI_DRIVER module. Meaning, it conforms to the UEFI driver model: it > doesn't do practically anything in its entry point function; instead it > expects callback via the driver binding protocol. It won't bind root > bridge devices until it is explicitly asked to, whereas the root bridge > / host bridge driver in (1) produces the root bridge protocol instances > right off the bat. > > You can also witness this in the [Protocols] usage comments. For a > DXE_DRIVER, you have PRODUCES and CONSUMES -- these are "driver level" > annotations. For a UEFI_DRIVER, in connection with the driver binding > protocol, you will see TO_START (= the protocol this driver is capable > of binding) and BY_START (= the protocol the driver produces on top of > the handles it binds). > > For the PCI bus driver, both protocols listed above are marked TO_START. > We also have "gEfiPciIoProtocolGuid", marked BY_START. > > This means that the PCI bus driver will bind, *when asked*, the root > bridge protocol instances (that were automatically produced in (1)). It > enumerates the devices on those root bridges (and recursively any other > non-root bridges), allocate resources for them (with the other protocol > listed in (1)), and produce, *when asked*, EFI_PCI_IO_PROTOCOL > instances, one for each PCI device found. > > ( > > In turn, the various PCI device drivers (video driver, virtio-block > driver, etc) consume EFI_PCI_IO_PROTOCOL, and produce other UEFI > abstractions on top (Graphics Output Protocol, Block IO protocol, etc). > > ) > > Finally, you can also find this under [Protocols], for PciBusDxe: > > gEfiPciEnumerationCompleteProtocolGuid ## PRODUCES > > Note that this is *not* a BY_START remark, but a PRODUCES remark. The > PCI bus driver will install a firmware-wide empty protocol structure, > with the above protocol GUID, after the enumeration / resource > allocation is complete. This allows other drivers to *wait* for that > enumeration to complete. > > (3) OVMF's Acpi Platform Driver, in > OvmfPkg/AcpiPlatformDxe/AcpiPlatformDxe.inf, is dispatched at some point > by the DXE core. This is a DXE_DRIVER, meaning it does stuff immediately > in its entry point. > > What it does is the following. It checks for the presence of > gEfiPciEnumerationCompleteProtocolGuid in the protocol database. If that > protocol is available immediately, then the driver knows it has been > dispatched *after* PCI enumeration completed, so it goes with its > business right ahead (see below). If the protocol is not available yet, > then it registers a *protocol notify* callback for the same, and exits > the entry point function. (It stays resident of course.) > > When the PCI bus driver finishes enumeration sometimes later, and > installs said protocol, then the ACPI platform driver receives the > callback, and performs it business finally. > > And that business is to download the ACPI payload from QEMU over fw_cfg, > "execute" the linker-loader script, and install the ACPI tables, calling > EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(). > > (4) The final piece of the puzzle is the BDS driver (Boot Device > Selection driver). This is also a DXE_DRIVER. The DXE core dispatches it > as the last step of the DXE phase. > > This driver is a generic driver (in > IntelFrameworkModulePkg/Universal/BdsDxe), but OVMF customizes it with a > plug-in library instance, which lives in > "OvmfPkg/Library/PlatformBdsLib". The important function in the library > is PlatformBdsPolicyBehavior() -- the generic BDS driver calls it. > > In this function, we have: > > VisitAllInstancesOfProtocol (&gEfiPciRootBridgeIoProtocolGuid, > ConnectRootBridge, NULL); > > This looks simple, but it does a whole lot: > > (a) It locates all the root bridge IO protocol instances that (1) > automatically produced, and calls the gBS->ConnectController() > service on each. This is what *asks* the PCI bus driver to bind each > root bridge. > > (b) In turn the PCI bus driver does the enumeration and resource > allocation, communicating with OvmfPkg/PciHostBridgeDxe in the > process. It procudes a bunch of PciIo protocol instances for the > devices it finds. Finally it installs the "PCI enumeration > complete" protocol discussed in (2). > > At this point the PCI configuration *should* be final, as far as > QEMU is concerned. > > (c) The ACPI platform driver described in (3) comes to life, and > downloads the payload via fw_cfg. The payload is regenerated by > QEMU on the spot, based on the current PCI configuration. > > (5) *later*, but still in OvmfPkg/Library/PlatformBdsLib, we call > BdsLibConnectAll(). This is a utility library function (not specific to > OvmfPkg) that binds *all* possible UEFI drivers to *all* possible device > handles. > > This is when the specific PCI device drivers I mentioned above (video, > virtio, etc) bind the PciIO protocol instances -- produced in (b) -- and > produce the higher level abstractions (Graphics Output Protocol, Block > IO protocol), on top of wich even higher level protocols are produced by > other drivers, recursively. (Disk IO on Block IO, partitions on Disk IO, > filesystems on partitions, for example.) > > The point is, the IO and MMIO enable bits, in each individual PCI > device's command register, are set *only* in this stage, by the > individual UEFI drivers that recognize, and drive, that specific kind of > PCI device (most frequently based on vendor and device ID). > > For example, a QXL video card sees this occur when the > QemuVideoControllerDriverStart() function, in > "OvmfPkg/QemuVideoDxe/Driver.c", binds the QXL card's PciIo protocol > instance -- search it for "EFI_PCI_DEVICE_ENABLE". (The definition of > that macro is in "MdePkg/Include/Protocol/PciIo.h".) > > > Bottom line, the ACPI payload is downloaded after the enumeration / > resource allocation completes, but *not* after each single PCI device is > *enabled* via its own command register. > > If the ACPI payload actually depends on *that*, that's trouble. For > example, I could interrupt the boot process, and go to the UEFI shell > first, before OVMF boots "grub" (or Windows). Then I can manually > *unbind* some drivers from some devices. > > For example, I can manually unbind QemuVideoDxe from the QXL card. Being > the well behaved UEFI_DRIVER that it is, QemuVideoDxe will restore the > original PCI attributes for this device -- meaning, it will clear the > various decode bits in the command register. If I happen to have a > serial console for the virtual machine as well, then I can easily > continue using the VM over that. I can return to the UEFI setup utility > from the UEFI shell, and allow the boot process to continue -- without a > configured Graphics Output Protocol for the QXL card! > > I can do this with any (well written) UEFI_DRIVER module. Therefore, the > decode bits may change for any given PCI device *after* the ACPI payload > was downloaded. I think neither QEMU should keep regenerating that ACPI > payload, nor OVMF should attempt to notice any such change, and > re-download and re-install the ACPI tables. > > Speaking in terms of "UEFI design", whether a PCI device is at the > moment bound & driven by a matching UEFI driver, is (should be) > orthogonal to what the ACPI Platform driver does or does not. > > > I could try to re-wire the ACPI platform driver, so that it doesn't wake > up when the PCI bus driver completes the enumeration -- see (4)(b) and > (4)(c) above. Instead, I could try to signal it manually from > PlatformBdsLib, *after* the explicit BdsLibConnectAll() -- see (5). > > Because at that point, all possible UEFI drivers should be active, and > all recognized PCI devices should be decoding their resources. > > However, > - this is incredibly ugly. (Small problem.) > - The big problem is that ACPI tables, particularly the FACS, is a > *prerequisite* for a number of other firmware features; most > importantly: S3. > > There's a bunch of S3- and SMM-related, important stuff happening > *between* the enumeration -- (4)(c) -- and BdsLibConnectAll() -- (5). > If I wanted to delay the ACPI table installation, that would turn the > entire logic in-between on its head -- a dependency hell. > > So I don't think I want to do that; it was very hard to get into a > functional state. > Hi Laszlo, Thank you for the very detailed explanation (as always). Part of the above I understood by myself but the above explanation makes my life much easier :) > >> 2. The same as with pxb, disable Integrated End points for pxb-pcie. > > My vote, without a doubt. > I agree is does not worth the trouble. I'll add a QEMU check to not allow Integrated Endpoints for pxb-pcie. Thanks! Marcel >> >> I am going to look at 1., maybe I is doable in a clean way. > > My vote: don't. :) > > Thanks > Laszlo > >> Thanks, >> Marcel >> >> >> [...] >