From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:60607) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fLEkN-0002Eo-Po for qemu-devel@nongnu.org; Tue, 22 May 2018 17:23:10 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fLEkI-0005nz-O1 for qemu-devel@nongnu.org; Tue, 22 May 2018 17:23:07 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:38800 helo=mx1.redhat.com) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1fLEkI-0005nR-Ho for qemu-devel@nongnu.org; Tue, 22 May 2018 17:23:02 -0400 Date: Wed, 23 May 2018 00:22:53 +0300 From: "Michael S. Tsirkin" Message-ID: <20180523001822-mutt-send-email-mst@kernel.org> References: <1526801333-30613-1-git-send-email-whois.zihan.yang@gmail.com> <1526801333-30613-4-git-send-email-whois.zihan.yang@gmail.com> <17a3765f-b835-2d45-e8b9-ffd4aff909f9@redhat.com> <20180522151732.36122156@w520.home> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20180522151732.36122156@w520.home> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC 3/3] acpi-build: allocate mcfg for multiple host bridges List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: Laszlo Ersek , Marcel Apfelbaum , Zihan Yang , qemu-devel@nongnu.org, Wei Huang , Drew Jones , Eric Auger , Igor Mammedov On Tue, May 22, 2018 at 03:17:32PM -0600, Alex Williamson wrote: > On Tue, 22 May 2018 21:51:47 +0200 > Laszlo Ersek wrote: >=20 > > On 05/22/18 21:01, Marcel Apfelbaum wrote: > > > Hi Laszlo, > > >=20 > > > On 05/22/2018 12:52 PM, Laszlo Ersek wrote: =20 > > >> On 05/21/18 13:53, Marcel Apfelbaum wrote: =20 > > >>> > > >>> On 05/20/2018 10:28 AM, Zihan Yang wrote: =20 > > >>>> Currently only q35 host bridge us allocated space in MCFG table.= To > > >>>> put pxb host > > >>>> into sepratate pci domain, each of them should have its own > > >>>> configuration space > > >>>> int MCFG table > > >>>> > > >>>> Signed-off-by: Zihan Yang > > >>>> --- > > >>>> =A0=A0 hw/i386/acpi-build.c | 83 > > >>>> +++++++++++++++++++++++++++++++++++++++------------- > > >>>> =A0=A0 1 file changed, 62 insertions(+), 21 deletions(-) > > >>>> > > >>>> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c > > >>>> index 9bc6d97..808d815 100644 > > >>>> --- a/hw/i386/acpi-build.c > > >>>> +++ b/hw/i386/acpi-build.c > > >>>> @@ -89,6 +89,7 @@ > > >>>> =A0=A0 typedef struct AcpiMcfgInfo { > > >>>> =A0=A0=A0=A0=A0=A0 uint64_t mcfg_base; > > >>>> =A0=A0=A0=A0=A0=A0 uint32_t mcfg_size; > > >>>> +=A0=A0=A0 struct AcpiMcfgInfo *next; > > >>>> =A0=A0 } AcpiMcfgInfo; > > >>>> =A0=A0 =A0 typedef struct AcpiPmInfo { > > >>>> @@ -2427,14 +2428,15 @@ build_mcfg_q35(GArray *table_data, BIOSL= inker > > >>>> *linker, AcpiMcfgInfo *info) > > >>>> =A0=A0 { > > >>>> =A0=A0=A0=A0=A0=A0 AcpiTableMcfg *mcfg; > > >>>> =A0=A0=A0=A0=A0=A0 const char *sig; > > >>>> -=A0=A0=A0 int len =3D sizeof(*mcfg) + 1 * sizeof(mcfg->allocati= on[0]); > > >>>> +=A0=A0=A0 int len, count =3D 0; > > >>>> +=A0=A0=A0 AcpiMcfgInfo *cfg =3D info; > > >>>> =A0=A0 +=A0=A0=A0 while (cfg) { > > >>>> +=A0=A0=A0=A0=A0=A0=A0 ++count; > > >>>> +=A0=A0=A0=A0=A0=A0=A0 cfg =3D cfg->next; > > >>>> +=A0=A0=A0 } > > >>>> +=A0=A0=A0 len =3D sizeof(*mcfg) + count * sizeof(mcfg->allocati= on[0]); > > >>>> =A0=A0=A0=A0=A0=A0 mcfg =3D acpi_data_push(table_data, len); > > >>>> -=A0=A0=A0 mcfg->allocation[0].address =3D cpu_to_le64(info->mcf= g_base); > > >>>> -=A0=A0=A0 /* Only a single allocation so no need to play with s= egments */ > > >>>> -=A0=A0=A0 mcfg->allocation[0].pci_segment =3D cpu_to_le16(0); > > >>>> -=A0=A0=A0 mcfg->allocation[0].start_bus_number =3D 0; > > >>>> -=A0=A0=A0 mcfg->allocation[0].end_bus_number =3D > > >>>> PCIE_MMCFG_BUS(info->mcfg_size - 1); > > >>>> =A0=A0 =A0=A0=A0=A0=A0 /* MCFG is used for ECAM which can be ena= bled or disabled by > > >>>> guest. > > >>>> =A0=A0=A0=A0=A0=A0=A0 * To avoid table size changes (which creat= e migration issues), > > >>>> @@ -2448,6 +2450,17 @@ build_mcfg_q35(GArray *table_data, BIOSLi= nker > > >>>> *linker, AcpiMcfgInfo *info) > > >>>> =A0=A0=A0=A0=A0=A0 } else { > > >>>> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 sig =3D "MCFG"; > > >>>> =A0=A0=A0=A0=A0=A0 } > > >>>> + > > >>>> +=A0=A0=A0 count =3D 0; > > >>>> +=A0=A0=A0 while (info) { > > >>>> +=A0=A0=A0=A0=A0=A0=A0 mcfg[count].allocation[0].address =3D > > >>>> cpu_to_le64(info->mcfg_base); > > >>>> +=A0=A0=A0=A0=A0=A0=A0 /* Only a single allocation so no need to= play with > > >>>> segments */ > > >>>> +=A0=A0=A0=A0=A0=A0=A0 mcfg[count].allocation[0].pci_segment =3D= cpu_to_le16(count); > > >>>> +=A0=A0=A0=A0=A0=A0=A0 mcfg[count].allocation[0].start_bus_numbe= r =3D 0; > > >>>> +=A0=A0=A0=A0=A0=A0=A0 mcfg[count++].allocation[0].end_bus_numbe= r =3D > > >>>> PCIE_MMCFG_BUS(info->mcfg_size - 1); =20 > > >>> An interesting point is if we want to limit the MMFCG size for PX= Bs, as > > >>> we may not be > > >>> interested to use all the buses in a specific domain. For each bu= s we > > >>> require some > > >>> address space that remains unused. > > >>> =20 > > >>>> +=A0=A0=A0=A0=A0=A0=A0 info =3D info->next; > > >>>> +=A0=A0=A0 } > > >>>> + > > >>>> =A0=A0=A0=A0=A0=A0 build_header(linker, table_data, (void *)mcfg= , sig, len, 1, > > >>>> NULL, NULL); > > >>>> =A0=A0 } > > >>>> =A0=A0 @@ -2602,26 +2615,52 @@ struct AcpiBuildState { > > >>>> =A0=A0=A0=A0=A0=A0 MemoryRegion *linker_mr; > > >>>> =A0=A0 } AcpiBuildState; > > >>>> =A0=A0 -static bool acpi_get_mcfg(AcpiMcfgInfo *mcfg) > > >>>> +static inline void cleanup_mcfg(AcpiMcfgInfo *mcfg) > > >>>> +{ > > >>>> +=A0=A0=A0 AcpiMcfgInfo *tmp; > > >>>> +=A0=A0=A0 while (mcfg) { > > >>>> +=A0=A0=A0=A0=A0=A0=A0 tmp =3D mcfg->next; > > >>>> +=A0=A0=A0=A0=A0=A0=A0 g_free(mcfg); > > >>>> +=A0=A0=A0=A0=A0=A0=A0 mcfg =3D tmp; > > >>>> +=A0=A0=A0 } > > >>>> +} > > >>>> + > > >>>> +static AcpiMcfgInfo *acpi_get_mcfg(void) > > >>>> =A0=A0 { > > >>>> =A0=A0=A0=A0=A0=A0 Object *pci_host; > > >>>> =A0=A0=A0=A0=A0=A0 QObject *o; > > >>>> +=A0=A0=A0 AcpiMcfgInfo *head =3D NULL, *tail, *mcfg; > > >>>> =A0=A0 =A0=A0=A0=A0=A0 pci_host =3D acpi_get_i386_pci_host(); > > >>>> =A0=A0=A0=A0=A0=A0 g_assert(pci_host); > > >>>> =A0=A0 -=A0=A0=A0 o =3D object_property_get_qobject(pci_host, PC= IE_HOST_MCFG_BASE, > > >>>> NULL); > > >>>> -=A0=A0=A0 if (!o) { > > >>>> -=A0=A0=A0=A0=A0=A0=A0 return false; > > >>>> +=A0=A0=A0 while (pci_host) { > > >>>> +=A0=A0=A0=A0=A0=A0=A0 mcfg =3D g_new0(AcpiMcfgInfo, 1); > > >>>> +=A0=A0=A0=A0=A0=A0=A0 mcfg->next =3D NULL; > > >>>> +=A0=A0=A0=A0=A0=A0=A0 if (!head) { > > >>>> +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 tail =3D head =3D mcfg; > > >>>> +=A0=A0=A0=A0=A0=A0=A0 } else { > > >>>> +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 tail->next =3D mcfg; > > >>>> +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 tail =3D mcfg; > > >>>> +=A0=A0=A0=A0=A0=A0=A0 } > > >>>> + > > >>>> +=A0=A0=A0=A0=A0=A0=A0 o =3D object_property_get_qobject(pci_hos= t, > > >>>> PCIE_HOST_MCFG_BASE, NULL); > > >>>> +=A0=A0=A0=A0=A0=A0=A0 if (!o) { > > >>>> +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 cleanup_mcfg(head); > > >>>> +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 g_free(mcfg); > > >>>> +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 return NULL; > > >>>> +=A0=A0=A0=A0=A0=A0=A0 } > > >>>> +=A0=A0=A0=A0=A0=A0=A0 mcfg->mcfg_base =3D qnum_get_uint(qobject= _to(QNum, o)); > > >>>> +=A0=A0=A0=A0=A0=A0=A0 qobject_unref(o); > > >>>> + > > >>>> +=A0=A0=A0=A0=A0=A0=A0 o =3D object_property_get_qobject(pci_hos= t, > > >>>> PCIE_HOST_MCFG_SIZE, NULL); =20 > > >>> I'll let Igor to comment on the APCI bits, but the idea of queryi= ng each > > >>> PCI host > > >>> for the MMFCG details and put it into a different table sounds go= od > > >>> to me. > > >>> > > >>> [Adding Laszlo for his insights] =20 > > >> Thanks for the CC -- I don't have many (positive) insights here to > > >> offer, I'm afraid. First, the patch set (including the blurb) does= n't > > >> seem to explain *why* multiple host bridges / ECAM areas are a goo= d > > >> idea. =20 > > >=20 > > > The issue we want to solve is the hard limitation of 256 PCIe devic= es > > > that can be used in a Q35 machine. >=20 > Isn't it interesting that conventional PCI can easily support so many > more devices? Sorta makes you wonder why we care that virtual devices > are express rather than conventional for a high density configuration..= . Because they are express in real life? > > Implying that there are use cases for which ~256 PCIe devices aren't > > enough. I remain unconvinced until proved wrong :) > >=20 > > Anyway, a significant source of waste comes from the restriction that= we > > can only put 1 device (with up to 8 functions) on each non-root-compl= ex > > PCI Express bus (such as root ports and downstream ports). This forfe= its > > a huge portion of the ECAM area (about 31/32th) that we already have. > > Rather than spending more MMIO guest-phys address range on new > > discontiguous ECAM ranges, I'd prefer if we could look into ARI. I se= em > > to recall from your earlier presentation that ARI could recover that > > lost address space (meaning both ECAM ranges and PCIe B/D/F address s= pace). >=20 > How does ARI solve the hotplug problem? ARI is effectively > multifunction on steroids, the ARI capability in each function points > to the next function number so that we don't need to scan the entire > devfn address space per bus (an inefficiency we don't care about when > there are only 8 function). So yeah, we can fill an entire bus with > devices with ARI, but they're all rooted at 00.0. > =20 > > There are signs that the edk2 core supports ARI if the underlying > > platform supports it. (Which is not the case with multiple PCIe domai= ns > > / multiple ECAM ranges.) >=20 > It's pretty surprising to me that edk2 wouldn't already have support > for multiple PCIe domains, they're really not *that* uncommon. Some > architectures make almost gratuitous use of PCIe domains. I certainly > know there were UEFI ia64 platforms with multiple domains. >=20 > > ARI support could also help aarch64/virt. Eric (CC'd) has been workin= g > > on raising the max PCIe bus count from *16* to 256 for aarch64/virt, = and > > AFAIR one of the challenges there has been finding a good spot for th= e > > larger ECAM in guest-phys address space. Fiddling with such address m= aps > > is always a pain. > >=20 > > Back to x86, the guest-phys address space is quite crowded too. The > > 32-bit PCI MMIO aperture (with the neighboring ECAM and platform MMIO > > areas such as LAPIC, IO-APIC, HPET, TPM, pflash, ...) is always a sca= rce > > resource. Plus, reaching agreement between OVMF and QEMU on the exact > > location of the 32-bit PCI MMIO aperture has always been a huge pain;= so > > you'd likely shoot for 64-bit. >=20 > Why do we need to equally split 32-bit MMIO space between domains? Put > it on domain 0 and require devices installed into non-zero domains have > no 32-bit dependencies. +1 > > But 64-bit is ill-partitioned and/or crowded too: first you have the > > cold-plugged >4GB DRAM (whose size the firmware can interrogate), the= n > > the hotplug DIMM area up to "etc/reserved-memory-end" (ditto), then t= he > > 64-bit PCI MMIO aperture (whose size the firmware decides itself, > > because QEMU cannot be asked about it presently). Placing the additio= nal > > MMCFGs somewhere would need extending the total order between these > > things at the design level, more fw_cfg files, more sanity checks in > > platform code in the firmware, and, quite importantly, adding support= to > > multiple discontiguous MMCFGs to core edk2 code. >=20 > Hmm, we're doing something wrong if we can't figure out some standards > within QEMU for placing per domain 64-bit MMIO and MMCFG ranges. I thought in this patch it is done by firmware. > > I don't know much about ARI but it already looks a whole lot more > > attractive to me. > >=20 > > > We can use "PCI functions" to increase > > > the number, but then we loose the hot-plugging granularity. > > >=20 > > > Currently pxb-pcie host bridges share the same PCI domain (0) > > > bus numbers, so adding more of them will not result in more usable > > > devices (each PCIe device resides on its own bus), > > > but putting each of them on a different domain removes > > > that limitation. > > >=20 > > > Another possible use (there is a good chance I am wrong, adding Ale= x to > > > correct me), > > > may be the "modeling" of a multi-function assigned device. > > > Currently all the functions of an assigneddevice will be placed on > > > different PCIe > > > Root Ports "loosing" the connection between them. > > >=20 > > > However, if we put all the functions of the same assigned device in= the > > > same > > > PCI domain we may be able to "re-connect" them. =20 > >=20 > > OK -- why is that useful? What's the current practical problem with t= he > > splitting you describe? >=20 > There are very few cases where this is useful, things like associating > USB companion devices or translating a guest bus reset to host bus > reset when functions are split to separate virtual buses. That said, I > have no idea how multiple domains plays a factor here. Regardless of > how many PCIe domains a VM supports, we can easily use existing > multifunction within a single domain to expose multifunction assigned > devices in a way that better resembles the physical hardware. > Multifunction PCIe endpoints cannot span PCIe domains. >=20 > In fact, IOMMUs generally cannot span domains either, so we better also > be thinking about at least a VT-d DHRD or vIOMMU per PCIe domain. > =20 > > >> =A0 Second, supporting such would take very intrusive patches / la= rge > > >> feature development for OVMF (and that's not just for OvmfPkg and > > >> ArmVirtPkg platform code, but also core MdeModulePkg code). > > >> > > >> Obviously I'm not saying this feature should not be implemented fo= r QEMU > > >> + SeaBIOS, just that don't expect edk2 support for it anytime soon= . > > >> Without a convincing use case, even the review impact seems hard t= o > > >> justify. =20 > > >=20 > > > I understand, thank you for the feedback, the point was to be sure > > > we don't decide something that *can't be done* in OVMF. =20 > >=20 > > Semantics :) Obviously, everything "can be done" in software; that's = why > > it's *soft*ware. Who is going to write it in practice? It had taken > > years until the edk2 core gained a universal PciHostBridgeDxe driver > > with a well-defined platform customization interface, and that interf= ace > > doesn't support multiple domains / segments. It had also taken years > > until the same edk2 core driver gained support for nonzero MMIO addre= ss > > translation (i.e. where the CPU view of system RAM addresses differs > > from the PCI device view of the same, for DMA purposes) -- and that's > > just a linear translation, not even about IOMMUs. The PCI core > > maintainers in edk2 are always very busy, and upstreaming such change= s > > tends to take forever. >=20 > Wow, that's surprising, ia64 was doing all of that on UEFI platforms a > decade ago. > =20 > > Again, I'm not opposing this feature per se. I just see any potential > > support for it in edk2 very difficult. I remain unconvinced that ~256 > > PCIe devices aren't enough. Even assuming I'm proved plain wrong ther= e, > > I'd still much prefer ARI (although I don't know much about it at thi= s > > point), because (a) the edk2 core already shows signs that it intends= to > > support ARI, (b) ARI would help aarch64/virt with nearly the same > > effort, (c) it seems that ARI would address the problem (namely, wast= ing > > MMCFG) head-on, rather than throwing yet more MMCFG at it. >=20 > -ENOHOTPLUG From an assigned device perspective, this also implies > that we're potentially adding a PCIe extended capability that may not > exist on the physical device to the virtual view of the device. > There's no guarantee that we can place that capability in the devices > extended config space without stepping on registers that the hardware > has hidden between the capabilities (for real, GPUs have registers in > extended config space that aren't covered by capabilities) or the > unlikely event that there's no more room in extended config space. > Thanks, >=20 > Alex