From: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
To: Alex Williamson <alex.williamson@redhat.com>,
Laszlo Ersek <lersek@redhat.com>
Cc: Zihan Yang <whois.zihan.yang@gmail.com>,
qemu-devel@nongnu.org, Wei Huang <wei@redhat.com>,
Drew Jones <drjones@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
Eric Auger <eauger@redhat.com>,
Igor Mammedov <imammedo@redhat.com>
Subject: Re: [Qemu-devel] [RFC 3/3] acpi-build: allocate mcfg for multiple host bridges
Date: Wed, 23 May 2018 20:00:45 +0300 [thread overview]
Message-ID: <e3f4ec71-98d2-1a82-cf8d-502bac431bc4@gmail.com> (raw)
In-Reply-To: <20180522151732.36122156@w520.home>
Hi Alex,
On 05/23/2018 12:17 AM, Alex Williamson wrote:
> On Tue, 22 May 2018 21:51:47 +0200
> Laszlo Ersek <lersek@redhat.com> wrote:
>
>> On 05/22/18 21:01, Marcel Apfelbaum wrote:
>>> Hi Laszlo,
>>>
>>> On 05/22/2018 12:52 PM, Laszlo Ersek wrote:
>>>> On 05/21/18 13:53, Marcel Apfelbaum wrote:
>>>>> On 05/20/2018 10:28 AM, Zihan Yang wrote:
>>>>>> Currently only q35 host bridge us allocated space in MCFG table. To
>>>>>> put pxb host
>>>>>> into sepratate pci domain, each of them should have its own
>>>>>> configuration space
>>>>>> int MCFG table
>>>>>>
>>>>>> Signed-off-by: Zihan Yang <whois.zihan.yang@gmail.com>
>>>>>> ---
>>>>>> hw/i386/acpi-build.c | 83
>>>>>> +++++++++++++++++++++++++++++++++++++++-------------
>>>>>> 1 file changed, 62 insertions(+), 21 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
>>>>>> index 9bc6d97..808d815 100644
>>>>>> --- a/hw/i386/acpi-build.c
>>>>>> +++ b/hw/i386/acpi-build.c
>>>>>> @@ -89,6 +89,7 @@
>>>>>> typedef struct AcpiMcfgInfo {
>>>>>> uint64_t mcfg_base;
>>>>>> uint32_t mcfg_size;
>>>>>> + struct AcpiMcfgInfo *next;
>>>>>> } AcpiMcfgInfo;
>>>>>> typedef struct AcpiPmInfo {
>>>>>> @@ -2427,14 +2428,15 @@ build_mcfg_q35(GArray *table_data, BIOSLinker
>>>>>> *linker, AcpiMcfgInfo *info)
>>>>>> {
>>>>>> AcpiTableMcfg *mcfg;
>>>>>> const char *sig;
>>>>>> - int len = sizeof(*mcfg) + 1 * sizeof(mcfg->allocation[0]);
>>>>>> + int len, count = 0;
>>>>>> + AcpiMcfgInfo *cfg = info;
>>>>>> + while (cfg) {
>>>>>> + ++count;
>>>>>> + cfg = cfg->next;
>>>>>> + }
>>>>>> + len = sizeof(*mcfg) + count * sizeof(mcfg->allocation[0]);
>>>>>> mcfg = acpi_data_push(table_data, len);
>>>>>> - mcfg->allocation[0].address = cpu_to_le64(info->mcfg_base);
>>>>>> - /* Only a single allocation so no need to play with segments */
>>>>>> - mcfg->allocation[0].pci_segment = cpu_to_le16(0);
>>>>>> - mcfg->allocation[0].start_bus_number = 0;
>>>>>> - mcfg->allocation[0].end_bus_number =
>>>>>> PCIE_MMCFG_BUS(info->mcfg_size - 1);
>>>>>> /* MCFG is used for ECAM which can be enabled or disabled by
>>>>>> guest.
>>>>>> * To avoid table size changes (which create migration issues),
>>>>>> @@ -2448,6 +2450,17 @@ build_mcfg_q35(GArray *table_data, BIOSLinker
>>>>>> *linker, AcpiMcfgInfo *info)
>>>>>> } else {
>>>>>> sig = "MCFG";
>>>>>> }
>>>>>> +
>>>>>> + count = 0;
>>>>>> + while (info) {
>>>>>> + mcfg[count].allocation[0].address =
>>>>>> cpu_to_le64(info->mcfg_base);
>>>>>> + /* Only a single allocation so no need to play with
>>>>>> segments */
>>>>>> + mcfg[count].allocation[0].pci_segment = cpu_to_le16(count);
>>>>>> + mcfg[count].allocation[0].start_bus_number = 0;
>>>>>> + mcfg[count++].allocation[0].end_bus_number =
>>>>>> PCIE_MMCFG_BUS(info->mcfg_size - 1);
>>>>> An interesting point is if we want to limit the MMFCG size for PXBs, as
>>>>> we may not be
>>>>> interested to use all the buses in a specific domain. For each bus we
>>>>> require some
>>>>> address space that remains unused.
>>>>>
>>>>>> + info = info->next;
>>>>>> + }
>>>>>> +
>>>>>> build_header(linker, table_data, (void *)mcfg, sig, len, 1,
>>>>>> NULL, NULL);
>>>>>> }
>>>>>> @@ -2602,26 +2615,52 @@ struct AcpiBuildState {
>>>>>> MemoryRegion *linker_mr;
>>>>>> } AcpiBuildState;
>>>>>> -static bool acpi_get_mcfg(AcpiMcfgInfo *mcfg)
>>>>>> +static inline void cleanup_mcfg(AcpiMcfgInfo *mcfg)
>>>>>> +{
>>>>>> + AcpiMcfgInfo *tmp;
>>>>>> + while (mcfg) {
>>>>>> + tmp = mcfg->next;
>>>>>> + g_free(mcfg);
>>>>>> + mcfg = tmp;
>>>>>> + }
>>>>>> +}
>>>>>> +
>>>>>> +static AcpiMcfgInfo *acpi_get_mcfg(void)
>>>>>> {
>>>>>> Object *pci_host;
>>>>>> QObject *o;
>>>>>> + AcpiMcfgInfo *head = NULL, *tail, *mcfg;
>>>>>> pci_host = acpi_get_i386_pci_host();
>>>>>> g_assert(pci_host);
>>>>>> - o = object_property_get_qobject(pci_host, PCIE_HOST_MCFG_BASE,
>>>>>> NULL);
>>>>>> - if (!o) {
>>>>>> - return false;
>>>>>> + while (pci_host) {
>>>>>> + mcfg = g_new0(AcpiMcfgInfo, 1);
>>>>>> + mcfg->next = NULL;
>>>>>> + if (!head) {
>>>>>> + tail = head = mcfg;
>>>>>> + } else {
>>>>>> + tail->next = mcfg;
>>>>>> + tail = mcfg;
>>>>>> + }
>>>>>> +
>>>>>> + o = object_property_get_qobject(pci_host,
>>>>>> PCIE_HOST_MCFG_BASE, NULL);
>>>>>> + if (!o) {
>>>>>> + cleanup_mcfg(head);
>>>>>> + g_free(mcfg);
>>>>>> + return NULL;
>>>>>> + }
>>>>>> + mcfg->mcfg_base = qnum_get_uint(qobject_to(QNum, o));
>>>>>> + qobject_unref(o);
>>>>>> +
>>>>>> + o = object_property_get_qobject(pci_host,
>>>>>> PCIE_HOST_MCFG_SIZE, NULL);
>>>>> I'll let Igor to comment on the APCI bits, but the idea of querying each
>>>>> PCI host
>>>>> for the MMFCG details and put it into a different table sounds good
>>>>> to me.
>>>>>
>>>>> [Adding Laszlo for his insights]
>>>> Thanks for the CC -- I don't have many (positive) insights here to
>>>> offer, I'm afraid. First, the patch set (including the blurb) doesn't
>>>> seem to explain *why* multiple host bridges / ECAM areas are a good
>>>> idea.
>>> The issue we want to solve is the hard limitation of 256 PCIe devices
>>> that can be used in a Q35 machine.
> Isn't it interesting that conventional PCI can easily support so many
> more devices? Sorta makes you wonder why we care that virtual devices
> are express rather than conventional for a high density configuration...
>
>> Implying that there are use cases for which ~256 PCIe devices aren't
>> enough. I remain unconvinced until proved wrong :)
>>
>> Anyway, a significant source of waste comes from the restriction that we
>> can only put 1 device (with up to 8 functions) on each non-root-complex
>> PCI Express bus (such as root ports and downstream ports). This forfeits
>> a huge portion of the ECAM area (about 31/32th) that we already have.
>> Rather than spending more MMIO guest-phys address range on new
>> discontiguous ECAM ranges, I'd prefer if we could look into ARI. I seem
>> to recall from your earlier presentation that ARI could recover that
>> lost address space (meaning both ECAM ranges and PCIe B/D/F address space).
> How does ARI solve the hotplug problem? ARI is effectively
> multifunction on steroids,
+1
> the ARI capability in each function points
> to the next function number so that we don't need to scan the entire
> devfn address space per bus (an inefficiency we don't care about when
> there are only 8 function). So yeah, we can fill an entire bus with
> devices with ARI, but they're all rooted at 00.0.
>
>> There are signs that the edk2 core supports ARI if the underlying
>> platform supports it. (Which is not the case with multiple PCIe domains
>> / multiple ECAM ranges.)
> It's pretty surprising to me that edk2 wouldn't already have support
> for multiple PCIe domains, they're really not *that* uncommon. Some
> architectures make almost gratuitous use of PCIe domains. I certainly
> know there were UEFI ia64 platforms with multiple domains.
>
>> ARI support could also help aarch64/virt. Eric (CC'd) has been working
>> on raising the max PCIe bus count from *16* to 256 for aarch64/virt, and
>> AFAIR one of the challenges there has been finding a good spot for the
>> larger ECAM in guest-phys address space. Fiddling with such address maps
>> is always a pain.
>>
>> Back to x86, the guest-phys address space is quite crowded too. The
>> 32-bit PCI MMIO aperture (with the neighboring ECAM and platform MMIO
>> areas such as LAPIC, IO-APIC, HPET, TPM, pflash, ...) is always a scarce
>> resource. Plus, reaching agreement between OVMF and QEMU on the exact
>> location of the 32-bit PCI MMIO aperture has always been a huge pain; so
>> you'd likely shoot for 64-bit.
> Why do we need to equally split 32-bit MMIO space between domains? Put
> it on domain 0 and require devices installed into non-zero domains have
> no 32-bit dependencies.
>
>> But 64-bit is ill-partitioned and/or crowded too: first you have the
>> cold-plugged >4GB DRAM (whose size the firmware can interrogate), then
>> the hotplug DIMM area up to "etc/reserved-memory-end" (ditto), then the
>> 64-bit PCI MMIO aperture (whose size the firmware decides itself,
>> because QEMU cannot be asked about it presently). Placing the additional
>> MMCFGs somewhere would need extending the total order between these
>> things at the design level, more fw_cfg files, more sanity checks in
>> platform code in the firmware, and, quite importantly, adding support to
>> multiple discontiguous MMCFGs to core edk2 code.
> Hmm, we're doing something wrong if we can't figure out some standards
> within QEMU for placing per domain 64-bit MMIO and MMCFG ranges.
>
>> I don't know much about ARI but it already looks a whole lot more
>> attractive to me.
>>
>>> We can use "PCI functions" to increase
>>> the number, but then we loose the hot-plugging granularity.
>>>
>>> Currently pxb-pcie host bridges share the same PCI domain (0)
>>> bus numbers, so adding more of them will not result in more usable
>>> devices (each PCIe device resides on its own bus),
>>> but putting each of them on a different domain removes
>>> that limitation.
>>>
>>> Another possible use (there is a good chance I am wrong, adding Alex to
>>> correct me),
>>> may be the "modeling" of a multi-function assigned device.
>>> Currently all the functions of an assigneddevice will be placed on
>>> different PCIe
>>> Root Ports "loosing" the connection between them.
>>>
>>> However, if we put all the functions of the same assigned device in the
>>> same
>>> PCI domain we may be able to "re-connect" them.
>> OK -- why is that useful? What's the current practical problem with the
>> splitting you describe?
> There are very few cases where this is useful, things like associating
> USB companion devices or translating a guest bus reset to host bus
> reset when functions are split to separate virtual buses. That said, I
> have no idea how multiple domains plays a factor here. Regardless of
> how many PCIe domains a VM supports, we can easily use existing
> multifunction within a single domain to expose multifunction assigned
> devices in a way that better resembles the physical hardware.
> Multifunction PCIe endpoints cannot span PCIe domains.
Sorry for the misunderstanding. I was referring to the complete opposite.
We want to assign two functions of the same phys device to the same VM.
They will land on different PCIe Root Ports and we may have issues with
the data flow between them.
I was wondering if assigning them both to the same PCI domain (different
from 0) will
help us avoid implementing the ACS for a PCIe Root Ports. But again, I
may be way off here.
> In fact, IOMMUs generally cannot span domains either, so we better also
> be thinking about at least a VT-d DHRD or vIOMMU per PCIe domain.
>
Right
Thanks,
Marcel
[...]
next prev parent reply other threads:[~2018-05-23 17:00 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-05-20 7:28 [Qemu-devel] [RFC 0/3] pci_expander_brdige: Put pxb host bridge into separate pci domain Zihan Yang
2018-05-20 7:28 ` [Qemu-devel] [RFC 1/3] pci_expander_bridge: reserve enough mcfg space for pxb host Zihan Yang
2018-05-21 11:03 ` Marcel Apfelbaum
2018-05-22 5:59 ` Zihan Yang
2018-05-22 18:47 ` Marcel Apfelbaum
2018-05-20 7:28 ` [Qemu-devel] [RFC 2/3] pci: Link pci_host_bridges with QTAILQ Zihan Yang
2018-05-21 11:05 ` Marcel Apfelbaum
2018-05-22 5:59 ` Zihan Yang
2018-05-22 18:39 ` Marcel Apfelbaum
2018-05-20 7:28 ` [Qemu-devel] [RFC 3/3] acpi-build: allocate mcfg for multiple host bridges Zihan Yang
2018-05-21 11:53 ` Marcel Apfelbaum
2018-05-22 6:03 ` Zihan Yang
2018-05-22 18:43 ` Marcel Apfelbaum
2018-05-22 9:52 ` Laszlo Ersek
2018-05-22 19:01 ` Marcel Apfelbaum
2018-05-22 19:51 ` Laszlo Ersek
2018-05-22 20:58 ` Michael S. Tsirkin
2018-05-22 21:36 ` Alex Williamson
2018-05-22 21:44 ` Michael S. Tsirkin
2018-05-22 21:47 ` Alex Williamson
2018-05-22 22:00 ` Laszlo Ersek
2018-05-22 23:38 ` Michael S. Tsirkin
2018-05-23 4:28 ` Alex Williamson
2018-05-23 14:25 ` Michael S. Tsirkin
2018-05-23 14:57 ` Alex Williamson
2018-05-23 15:01 ` Michael S. Tsirkin
2018-05-23 16:50 ` Marcel Apfelbaum
2018-05-22 21:17 ` Alex Williamson
2018-05-22 21:22 ` Michael S. Tsirkin
2018-05-22 21:58 ` Laszlo Ersek
2018-05-22 21:50 ` Laszlo Ersek
2018-05-23 17:00 ` Marcel Apfelbaum [this message]
2018-05-22 22:42 ` Laszlo Ersek
2018-05-22 23:40 ` Michael S. Tsirkin
2018-05-23 7:32 ` Laszlo Ersek
2018-05-23 11:11 ` Zihan Yang
2018-05-23 12:28 ` Laszlo Ersek
2018-05-23 17:23 ` Marcel Apfelbaum
2018-05-24 9:57 ` Zihan Yang
2018-05-23 17:33 ` Marcel Apfelbaum
2018-05-24 10:00 ` Zihan Yang
2018-05-23 17:11 ` Marcel Apfelbaum
2018-05-23 17:25 ` Laszlo Ersek
2018-05-28 11:02 ` Laszlo Ersek
2018-05-21 15:23 ` [Qemu-devel] [RFC 0/3] pci_expander_brdige: Put pxb host bridge into separate pci domain Marcel Apfelbaum
2018-05-22 6:04 ` Zihan Yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e3f4ec71-98d2-1a82-cf8d-502bac431bc4@gmail.com \
--to=marcel.apfelbaum@gmail.com \
--cc=alex.williamson@redhat.com \
--cc=drjones@redhat.com \
--cc=eauger@redhat.com \
--cc=imammedo@redhat.com \
--cc=lersek@redhat.com \
--cc=mst@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=wei@redhat.com \
--cc=whois.zihan.yang@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).