From: Laszlo Ersek <lersek@redhat.com>
To: Marcel Apfelbaum <marcel@redhat.com>, qemu-devel@nongnu.org
Cc: mst@redhat.com, pbonzini@redhat.com, ehabkost@redhat.com,
Igor Mammedov <imammedo@redhat.com>,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
Gerd Hoffmann <kraxel@redhat.com>
Subject: Re: [Qemu-devel] [PATCH] hw/pci-host: Fix x86 Host Bridges 64bit PCI hole
Date: Thu, 19 Oct 2017 12:41:41 +0200 [thread overview]
Message-ID: <36a5b8e7-84d1-5ffd-d85b-7f0378801f3c@redhat.com> (raw)
In-Reply-To: <9d4b1fa8-e670-556b-278d-4993ad41b512@redhat.com>
On 10/19/17 11:30, Marcel Apfelbaum wrote:
>
> Hi Laszlo,
>
> On 18/10/2017 14:40, Laszlo Ersek wrote:
>> Hi Marcel,
>>
>> On 10/18/17 11:58, Marcel Apfelbaum wrote:
>>> Currently there is no MMIO range over 4G
>>> reserved for PCI hotplug. Since the 32bit PCI hole
>>> depends on the number of cold-plugged PCI devices
>>> and other factors, it is very possible is too small
>>> to hotplug PCI devices with large BARs.
>>>
>>> Fix it by reserving all the address space after
>>> RAM (and the reserved space for RAM hotplug),
>>> but no more than 40bits. The later seems to be
>>> QEMU's magic number for the Guest physical CPU addressable
>>> bits and is safe with respect to migration.
>>>
>>> Note this is a regression since prev QEMU versions had
>>> some range reserved for 64bit PCI hotplug.
>>>
>>> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
>>> ---
>>> hw/i386/pc.c | 22 ++++++++++++++++++++++
>>> hw/pci-host/piix.c | 10 ++++++++++
>>> hw/pci-host/q35.c | 9 +++++++++
>>> include/hw/i386/pc.h | 10 ++++++++++
>>> include/hw/pci-host/q35.h | 1 +
>>> 5 files changed, 52 insertions(+)
>>
>> - What are the symptoms of the problem?
>>
>
> PCI devices with *relatively* large BARs can't be hot-plugged.
> The reason is QEMU does not reserve MMIO range over 4G and
> uses whatever remains on the 32bit PCI window.
>
> If you coldplug enough PCI devices you will end up with
> with a 64bit PCI window that covers only the cold-plugged PCI
> devices, still no room for hotplug.
>
>> - What QEMU commit regressed the previous functionality?
>>
>
> - commit 39848901818 pc: limit 64 bit hole to 2G by default
> shows us QEMU had the 64bit PCI hole, so it is a regression.
(Side remark: this commit was part of release v1.6.0, and at that time
we also had the "etc/pci-info" fw_cfg file.)
>
> - commit 2028fdf3791 piix: use 64 bit window programmed by guest
> leaves no room for PCI hotplug
(Side remark: part of v1.7.0, from 2013. So, we can call this a
regression, but then it's a very old one.
Of course this is *not* a counter-argument to your patch, or commit
message wording; it's just that I'm surprised that the issue could
remain dormant this long -- usually users report regressions very quickly.)
>
>> - How exactly is the regression (and this fix) exposed to the guest?
>>
>
> The nuance is both above commits computed the actual MMIO
> range used by cold-plugged devices, but as side effect
> they influenced the remaining PCI window for hot-plugged devices.
>
> What the guest sees is the CRS MMIO range over 4G.
OK, understood.
>
>> I'm confused because semi-recently I've done multiple successful
>> hotplugs, with OVMF, Q35, PCIe root ports, and PCI Express devices
>> having large (64-bit) MMIO BARs. The examples include ivshmem / Linux
>> guest (based on your hints) and vfio-pci (XL710 NIC) / Windows Server
>> 2016 guest.
>
> Cool! These are actually great news!
> Back to the issue in hand, a few things:
> 1. It works with Q35 only, but not with the I440FX machines.
OK.
> 2. The hints are used for the PCIe root ports, meaning for PCI bridges
> and addresses a slightly different issues: leave hotplug space
> for secondary buses, while this patch tackles hotplug
> space for root buses.
OK.
>
>>
>> By default, OVMF marks a 32GB area as 64-bit MMIO aperture, for BAR
>> allocation. This area is above the normal memory (plus reservation for
>> hotplug memory, if any). It is also GB-aligned. The aperture size can be
>> changed (currently) with an experimental -fw_cfg switch (the fw_cfg file
>> is called "opt/ovmf/X-PciMmio64Mb", it states the size of the 64-bit
>> MMIO aperture as a decimal number of megabytes).
>
> I'll try to explain how I see the "picture":
> CRS is created by QEMU and computed *after* the firmware finishes
> the PCI BARs allocations.
Yes.
> Then *QEMU* decides how big it will be the
> 64bit PCI hole, not the firmware - see the past commits.
Yes, first the guest-side programming happens, then QEMU (re-)calculates
the ACPI payload once the right ACPI fw_cfg file is selected / read.
This re-calculation (including the CRS) observes the previously
programmed values.
>
> So IMHO you don't need the opt/ovmf/X-PciMmio64Mb since the OVMF
> is not deciding this value. It actually should not care at all.
OVMF cannot *not* care. It must internally provide (to other,
platform-independent components in edk2) the 64-bit MMIO aperture that
the BARs can be allocated from.
But maybe I'm totally misunderstanding this patch! Are you saying that
the patch makes the following difference only: when the CRS is
recalculated (i.e., after the PCI enumeration has completed in the
guest, and the ACPI linker/loader fw_cfg files are opened), then you
only grow / round up the 64-bit range?
That would make sense, because a *superset* aperture exposed to the OS
via the _CRS does not invalidate any allocations done by the firmware
earlier. Furthermore, the firmware itself need not learn about the
details of said superset. In this sense I agree that OVMF should not
care at all.
[snip]
>> Also, how do you plan to integrate this feature with SeaBIOS?
(Ever since I sent my previous response, I've been glad that I finished
my email with this question -- I really think OVMF and SeaBIOS should
behave uniformly in this regard. So I'm happy about your answer:)
> Simple, no need :) . It just works as it should.
> SeaBIOS computes everything, *then* QEMU creates the CRs ranges
> based on firmware input and other properties.
> I think it works also with OVMF out of the box.
OK, so this explains it all, thank you very much -- my question is if
you could spell out those "other properties" in a bit more detail (I
apologize if these should be obvious to me already, from the commit
message and the patch):
(1) How exactly does the enlarged 64-bit hole remain a *superset* of the
firmware-programmed BARs? Can you point me to a location in the code?
(2) What happens if we do have more than 1TB RAM?
> A last note, when pxb/pxb-pcies will became hot-pluggable
> we need to leave some bits for them too. I wouldn't
> really want to add fw_config files for them too, or
> use a complicate fw_config, or rewrite
> the logic for all firmware if we don't have to.
OK.
>
> 64bit PCI window starts from memory(+reserved) end
> and finishes at 40bit.
OK, this seems to (partly) answer my question (1) -- can you please
confirm that the
finishes at 40bit
part does not imply forcefully truncating the CRS at 1TB, in case the
firmware already allocated BARs higher than that?
> If you have 1T of RAM you
> get nothing for PCI hotplug, seems fair.
This is fine (as the answer to my question (2)) as long as you mean it
in the following sense: "QEMU will simply *not grow* the MMIO in the
_CRS past 1TB".
> When guests with more then 1T of RAM will become
> frequent, I suppose the QEMU's default CPU addressable
> bits will change and we will change it too.
Thanks Marcel, I feel a lot safer now.
Please confirm that the changes can only grow the area exposed in the
_CRS -- i.e., only lower or preserve the base, and only raise or
preserve the absolute limit. I'll shut up then :)
Thanks!
Laszlo
next prev parent reply other threads:[~2017-10-19 10:42 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-10-18 9:58 [Qemu-devel] [PATCH] hw/pci-host: Fix x86 Host Bridges 64bit PCI hole Marcel Apfelbaum
2017-10-18 10:45 ` Igor Mammedov
2017-10-18 10:57 ` Marcel Apfelbaum
2017-10-18 11:40 ` Laszlo Ersek
2017-10-19 9:30 ` Marcel Apfelbaum
2017-10-19 10:41 ` Laszlo Ersek [this message]
2017-10-19 11:29 ` Marcel Apfelbaum
2017-10-19 11:52 ` Laszlo Ersek
2017-10-19 13:03 ` Gerd Hoffmann
2017-10-19 13:34 ` Marcel Apfelbaum
2017-10-20 6:55 ` Gerd Hoffmann
2017-10-20 9:32 ` Laszlo Ersek
2017-10-20 10:59 ` Gerd Hoffmann
2017-10-20 14:06 ` Marcel Apfelbaum
2017-10-20 13:47 ` Marcel Apfelbaum
2017-10-23 5:45 ` Gerd Hoffmann
2017-10-23 8:46 ` Marcel Apfelbaum
2017-10-23 9:16 ` Gerd Hoffmann
2017-10-23 9:35 ` Marcel Apfelbaum
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=36a5b8e7-84d1-5ffd-d85b-7f0378801f3c@redhat.com \
--to=lersek@redhat.com \
--cc=dgilbert@redhat.com \
--cc=ehabkost@redhat.com \
--cc=imammedo@redhat.com \
--cc=kraxel@redhat.com \
--cc=marcel@redhat.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).