From: Laszlo Ersek <lersek@redhat.com>
To: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>,
"Michael S. Tsirkin" <mst@redhat.com>
Cc: Zihan Yang <whois.zihan.yang@gmail.com>,
qemu-devel@nongnu.org, Igor Mammedov <imammedo@redhat.com>,
Alex Williamson <alex.williamson@redhat.com>,
Eric Auger <eauger@redhat.com>, Drew Jones <drjones@redhat.com>,
Wei Huang <wei@redhat.com>
Subject: Re: [Qemu-devel] [RFC 3/3] acpi-build: allocate mcfg for multiple host bridges
Date: Wed, 23 May 2018 19:25:04 +0200 [thread overview]
Message-ID: <aa3e1859-bfd0-7fc6-bc9e-c35dd68588a8@redhat.com> (raw)
In-Reply-To: <6b3efcdf-c953-b2bc-e8f7-ac172f143d07@gmail.com>
On 05/23/18 19:11, Marcel Apfelbaum wrote:
> On 05/23/2018 10:32 AM, Laszlo Ersek wrote:
>> On 05/23/18 01:40, Michael S. Tsirkin wrote:
>>> On Wed, May 23, 2018 at 12:42:09AM +0200, Laszlo Ersek wrote:
>>>> If we figure out a placement strategy or an easy to consume
>>>> representation of these data for the firmware, it might be possible
>>>> for OVMF to hook them into the edk2 core (although not in the
>>>> earliest firmware phases, such as SEC and PEI).
>
> Can you please remind me how OVMF places the 64-bit PCI hotplug
> window?
If you mean the 64-bit PCI MMIO aperture, I described it here in detail:
https://bugzilla.redhat.com/show_bug.cgi?id=1353591#c8
I'll also quote it inline, before returning to your email:
On 03/26/18 16:10, bugzilla@redhat.com wrote:
> https://bugzilla.redhat.com/show_bug.cgi?id=1353591
>
> Laszlo Ersek <lersek@redhat.com> changed:
>
> What |Removed |Added
> ----------------------------------------------------------------------------
> Flags|needinfo?(lersek@redhat.com |
> |) |
>
>
>
> --- Comment #8 from Laszlo Ersek <lersek@redhat.com> ---
> Sure, I can attempt :) The function to look at is GetFirstNonAddress()
> in "OvmfPkg/PlatformPei/MemDetect.c". I'll try to write it up here in
> natural language (although I commented the function heavily as well).
>
> As an introduction, the "number of address bits" is a quantity that
> the firmware itself needs to know, so that in the DXE phase page
> tables exist that actually map that address space. The
> GetFirstNonAddress() function (in the PEI phase) calculates the
> highest *exclusive* address that the firmware might want or need to
> use (in the DXE phase).
>
> (1) First we get the highest exclusive cold-plugged RAM address.
> (There are two methods for this, the more robust one is to read QEMU's
> E820 map, the older / less robust one is to calculate it from the
> CMOS.) If the result would be <4GB, then we take exactly 4GB from this
> step, because the firmware always needs to be able to address up to
> 4GB. Note that this is already somewhat non-intuitive; for example, if
> you have 4GB of RAM (as in, *amount*), it will go up to 6GB in the
> guest-phys address space (because [0x8000_0000..0xFFFF_FFFF] is not
> RAM but MMIO on q35).
>
> (2) If the DXE phase is 32-bit, then we're done. (No addresses >=4GB
> can be accessed, either for RAM or MMIO.) For RHEL this is never the
> case.
>
> (3) Grab the size of the 64-bit PCI MMIO aperture. This defaults to
> 32GB, but a custom (OVMF specific) fw_cfg file (from the QEMU command
> line) can resize it or even disable it. This aperture is relevant
> because it's going to be the top of the address space that the
> firmware is interested in. If the aperture is disabled (on the QEMU
> cmdline), then we're done, and only the value from point (1) matters
> -- that determines the address width we need.
>
> (4) OK, so we have a 64-bit PCI MMIO aperture (for allocating BARs out
> of, later); we have to place it somewhere. The base cannot match the
> value from (1) directly, because that would not leave room for the
> DIMM hotplug area. So the end of that area is read from the fw_cfg
> file "etc/reserved-memory-end". DIMM hotplug is enabled iff
> "etc/reserved-memory-end" exists. If "etc/reserved-memory-end" exists,
> then it is guaranteed to be larger than the value from (1) -- i.e.,
> top of cold-plugged RAM.
>
> (5) We round up the size of the 64-bit PCI aperture to 1GB. We also
> round up the base of the same -- i.e., from (4) or (1), as appropriate
> -- to 1GB. This is inspired by SeaBIOS, because this lets the host map
> the aperture with 1GB hugepages.
>
> (6) The base address of the aperture is then rounded up so that it
> ends up aligned "naturally". "Natural" alignment means that we take
> the largest whole power of two (i.e., BAR size) that can fit *within*
> the aperture (whose size comes from (3) and (5)) and use that BAR size
> as alignment requirement. This is because the PciBusDxe driver sorts
> the BARs in decreasing size order (and equivalently, decreasing
> alignment order), for allocation in increasing address order, so if
> our aperture base is aligned sufficiently for the largest BAR that can
> theoretically fit into the aperture, then the base will be aligned
> correctly for *any* other BAR that fits.
>
> For example, if you have a 32GB aperture size, then the largest BAR
> that can fit is 32GB, so the alignment requirement in step (6) will be
> 32GB. Whereas, if the user configures a 48GB aperture size in (3),
> then your alignment will remain 32GB in step (6), because a 64GB BAR
> would not fit, and a 32GB BAR (which fits) dictates a 32GB alignment.
>
> Thus we have the following "ladder" of ranges:
>
> (a) cold-plugged RAM (low, <2GB)
> (b) 32-bit PCI MMIO aperture, ECAM/MMCONFIG, APIC, pflash, etc (<4GB)
> (c) cold-plugged RAM (high, >=4GB)
> (d) DIMM hot-plug area
> (e) padding up to 1GB alignment (for hugepages)
> (f) padding up to the natural alignment of the 64-bit PCI MMIO
> aperture size (32GB by default)
> (g) 64-bit PCI MMIO aperture
>
> To my understanding, "maxmem" determines the end of (d). And, the
> address width is dictated by the end of (g).
>
> Two more examples.
>
> - If you have 36 phys address bits, that doesn't let you use
> maxmem=32G. This is because maxmem=32G puts the end of the DIMM
> hotplug area (d) strictly *above* 32GB (due to the "RAM gap" (b)),
> and then the padding (f) places the 64-bit PCI MMIO aperture at
> 64GB. So 36 phys address bits don't suffice.
>
> - On the other hand, if you have 37 phys address bits, that *should*
> let you use maxmem=64G. While the DIMM hot-plug area will end
> strictly above 64GB, the 64-bit PCI MMIO aperture (of size 32GB) can
> be placed at 96GB, so it will all fit into 128GB (i.e. 37 address
> bits).
>
> Sorry if this is confusing, I got very little sleep last night.
>
Back to your email:
On 05/23/18 19:11, Marcel Apfelbaum wrote:
> I think we may be able to succeed with "standard" APCI declarations of
> the PCI segments + placing the extra MMCONFIG ranges before the 64-bit
> PCI hotplug area.
That idea could work, but firmware will need hints about it.
Thanks!
Laszlo
next prev parent reply other threads:[~2018-05-23 17:27 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-05-20 7:28 [Qemu-devel] [RFC 0/3] pci_expander_brdige: Put pxb host bridge into separate pci domain Zihan Yang
2018-05-20 7:28 ` [Qemu-devel] [RFC 1/3] pci_expander_bridge: reserve enough mcfg space for pxb host Zihan Yang
2018-05-21 11:03 ` Marcel Apfelbaum
2018-05-22 5:59 ` Zihan Yang
2018-05-22 18:47 ` Marcel Apfelbaum
2018-05-20 7:28 ` [Qemu-devel] [RFC 2/3] pci: Link pci_host_bridges with QTAILQ Zihan Yang
2018-05-21 11:05 ` Marcel Apfelbaum
2018-05-22 5:59 ` Zihan Yang
2018-05-22 18:39 ` Marcel Apfelbaum
2018-05-20 7:28 ` [Qemu-devel] [RFC 3/3] acpi-build: allocate mcfg for multiple host bridges Zihan Yang
2018-05-21 11:53 ` Marcel Apfelbaum
2018-05-22 6:03 ` Zihan Yang
2018-05-22 18:43 ` Marcel Apfelbaum
2018-05-22 9:52 ` Laszlo Ersek
2018-05-22 19:01 ` Marcel Apfelbaum
2018-05-22 19:51 ` Laszlo Ersek
2018-05-22 20:58 ` Michael S. Tsirkin
2018-05-22 21:36 ` Alex Williamson
2018-05-22 21:44 ` Michael S. Tsirkin
2018-05-22 21:47 ` Alex Williamson
2018-05-22 22:00 ` Laszlo Ersek
2018-05-22 23:38 ` Michael S. Tsirkin
2018-05-23 4:28 ` Alex Williamson
2018-05-23 14:25 ` Michael S. Tsirkin
2018-05-23 14:57 ` Alex Williamson
2018-05-23 15:01 ` Michael S. Tsirkin
2018-05-23 16:50 ` Marcel Apfelbaum
2018-05-22 21:17 ` Alex Williamson
2018-05-22 21:22 ` Michael S. Tsirkin
2018-05-22 21:58 ` Laszlo Ersek
2018-05-22 21:50 ` Laszlo Ersek
2018-05-23 17:00 ` Marcel Apfelbaum
2018-05-22 22:42 ` Laszlo Ersek
2018-05-22 23:40 ` Michael S. Tsirkin
2018-05-23 7:32 ` Laszlo Ersek
2018-05-23 11:11 ` Zihan Yang
2018-05-23 12:28 ` Laszlo Ersek
2018-05-23 17:23 ` Marcel Apfelbaum
2018-05-24 9:57 ` Zihan Yang
2018-05-23 17:33 ` Marcel Apfelbaum
2018-05-24 10:00 ` Zihan Yang
2018-05-23 17:11 ` Marcel Apfelbaum
2018-05-23 17:25 ` Laszlo Ersek [this message]
2018-05-28 11:02 ` Laszlo Ersek
2018-05-21 15:23 ` [Qemu-devel] [RFC 0/3] pci_expander_brdige: Put pxb host bridge into separate pci domain Marcel Apfelbaum
2018-05-22 6:04 ` Zihan Yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aa3e1859-bfd0-7fc6-bc9e-c35dd68588a8@redhat.com \
--to=lersek@redhat.com \
--cc=alex.williamson@redhat.com \
--cc=drjones@redhat.com \
--cc=eauger@redhat.com \
--cc=imammedo@redhat.com \
--cc=marcel.apfelbaum@gmail.com \
--cc=mst@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=wei@redhat.com \
--cc=whois.zihan.yang@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).