All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation
@ 2026-05-08 18:37 Tushar Dave
  2026-05-08 18:37 ` [RFC PATCH 1/8] hw/pci: add fixed-bars property to allow fixed BAR addresses Tushar Dave
                   ` (10 more replies)
  0 siblings, 11 replies; 23+ messages in thread
From: Tushar Dave @ 2026-05-08 18:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alwilliamson, jgg, skolothumtho, qemu-arm, peter.maydell, mst,
	marcel.apfelbaum, devel

This RFC introduces a mechanism to specify Guest Physical Addresses
(GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR
addresses to match host physical addresses for assigned devices.

On some platforms, P2P DMA is performed between devices within the same
IOMMU group. The PCI fabric ACS is configured to permit direct P2P
without going through the host bridge in order to achieve the required
performance.

To support this multi-device IOMMU group P2P scenario in virtualization,
the VM may need to use the same MMIO BAR addresses as the host physical
address layout.

This series implements a per-device PCI property, "fixed-bars", which
allows users to specify fixed BAR addresses. The property is generic
and is available on any PCI-capable machine. It is a comma-separated
list of BAR assignments:

        barN@<addr>[,barM@<addr>]*

The virt machine builds on this with two additional machine properties:

pci-pre-enum
    When enabled, QEMU performs PCI enumeration and resource assignment
    before handing control to firmware (e.g. EDK2). This includes
    programming 64-bit prefetchable BARs according to fixed-bars
    assignments, and programming bridge prefetchable windows.
    A "pci-enum-done" device-tree property is set so firmware preserves
    the configuration.

pcie-mmio-window
    Defines the MMIO64 window for PCIe devices. When using fixed-bars,
    this allows the aperture to be resized or repositioned so all
    assigned BARs fall within a valid address range.

Why QEMU programs PCI resources rather than EDK2:

To support fixed BAR placement, QEMU performs PCI bus enumeration and
resource assignment prior to firmware execution. EDK2 already provides
a PCD-controlled mechanism (PcdPciDisableBusEnumeration) that allows
the platform to skip PCI enumeration and resource allocation. This
series leverages that mechanism so that, when enabled, firmware runs in
a discovery-only mode and preserves the configuration established by
QEMU.

When pci-pre-enum is enabled, QEMU runs PCI enumeration and resource
allocation, prioritizing fixed BARs specified via fixed-bars. If
allocation fails due to alignment, overlap, or address space constraints,
QEMU terminates with an error. Otherwise, all BARs and bridge windows are
fully programmed before firmware execution.

There is certainly room for improvement, but this RFC aims to gather
feedback on the overall approach chosen to address this problem.

We use the virt machine in this series as the concrete example
consuming the fixed-BAR model. Other machines may require their own
machine-specific mechanism (such as pcie-mmio-window) if they want to
adopt the same approach.

Example usage:

  -machine virt,...,pcie-mmio-window=0x400000000000:0x400000000000,pci-pre-enum=on \
  -device vfio-pci,host=0009:06:00.0,id=dev0 \
  -set device.dev0.fixed-bars=bar2@0x6b8000000000,bar4@0x6c8000000000

Testing:
This series was tested on NVIDIA GB300 platforms with a recent Linux
kernel. GPUDirect P2P between a GPU and a CX8 NIC requires a PCIe
topology in the VM that mirrors bare metal (e.g. both devices under the
same switch and ACS tuned for the minimal P2P paths needed for GPUDirect
RDMA).

TODO:
- The fixed BAR allocator handles 64-bit prefetchable BARs and related
  bridge prefetch windows only. Programming PIO, 32-bit MMIO, and
  64-bit non-prefetchable BARs, and sizing bridge windows for those
  resource types, is left for follow-up patches.
- SR-IOV virtual functions are not included when sizing bridge prefetch
  apertures and may require additional work.
- Add ACPI _DSM so the fixed BARs are preserved.


A git branch with this series applied is available at:
https://github.com/tdavenvidia/upstream-qemu/commits/upstream_May_08_26/

The related EDK2 change is available at:
https://github.com/tdavenvidia/edk2/commits/upstream_May_08_26/

Tushar Dave (8):
  hw/pci: add fixed-bars property to allow fixed BAR addresses
  hw/pci: enumerate PCI bus and program bridge bus numbers
  hw/pci: introduce allocator for fixed BAR placement
  hw/pci: pack remaining BARs and update bridge windows
  hw/pci: allocate remaining BARs for buses without fixed BARs
  hw/pci: finalize bridge prefetch windows after BAR allocation
  hw/arm/virt: add pcie-mmio-window machine property
  hw/arm/virt: add pci-pre-enum machine property

 hw/arm/virt.c               |  157 ++++-
 hw/pci/meson.build          |    2 +
 hw/pci/pci-enumerate.c      |  144 +++++
 hw/pci/pci-enumerate.h      |   15 +
 hw/pci/pci-resource.c       | 1099 +++++++++++++++++++++++++++++++++++
 hw/pci/pci-resource.h       |   82 +++
 hw/pci/pci.c                |  108 ++++
 include/hw/arm/virt.h       |    3 +
 include/hw/pci/pci_device.h |   10 +
 9 files changed, 1615 insertions(+), 5 deletions(-)
 create mode 100644 hw/pci/pci-enumerate.c
 create mode 100644 hw/pci/pci-enumerate.h
 create mode 100644 hw/pci/pci-resource.c
 create mode 100644 hw/pci/pci-resource.h

-- 
2.34.1



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 1/8] hw/pci: add fixed-bars property to allow fixed BAR addresses
  2026-05-08 18:37 [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation Tushar Dave
@ 2026-05-08 18:37 ` Tushar Dave
  2026-05-08 18:37 ` [RFC PATCH 2/8] hw/pci: enumerate PCI bus and program bridge bus numbers Tushar Dave
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Tushar Dave @ 2026-05-08 18:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alwilliamson, jgg, skolothumtho, qemu-arm, peter.maydell, mst,
	marcel.apfelbaum, devel

Introduce a per-device fixed-bars property that allows users to provide
fixed PCI BAR addresses for PCI endpoint devices.

The fixed-bars property cannot be supported on hot-plugged PCI devices.
PCI BARs for the hot-plugged device are programmed by the guest at the
time the device appears.

Property format:
- Comma-separated list of BAR entries, each as:
  barN@<addr>[,barM@<addr>]*

- Example:
  -device vfio-pci,...,fixed-bars=bar2@0x6b8000000000

- Multiple BARs:
  -device vfio-pci,host=...,id=dev0
  -set dev0.fixed-bars=bar0@0x400000000000,bar4@0x410000000000

Signed-off-by: Tushar Dave <tdave@nvidia.com>
---
 hw/pci/pci.c                | 108 ++++++++++++++++++++++++++++++++++++
 include/hw/pci/pci_device.h |  10 ++++
 2 files changed, 118 insertions(+)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 2c3657d00d..054fc2c0fa 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -50,6 +50,7 @@
 #include "hw/core/boards.h"
 #include "hw/nvram/fw_cfg.h"
 #include "qapi/error.h"
+#include "qapi/util.h"
 #include "qemu/cutils.h"
 #include "pci-internal.h"
 
@@ -81,6 +82,7 @@ static const Property pci_props[] = {
     DEFINE_PROP_STRING("romfile", PCIDevice, romfile),
     DEFINE_PROP_UINT32("romsize", PCIDevice, romsize, UINT32_MAX),
     DEFINE_PROP_INT32("rombar",  PCIDevice, rom_bar, -1),
+    DEFINE_PROP_STRING("fixed-bars", PCIDevice, fixed_bars),
     DEFINE_PROP_BIT("multifunction", PCIDevice, cap_present,
                     QEMU_PCI_CAP_MULTIFUNCTION_BITNR, false),
     DEFINE_PROP_BIT("x-pcie-lnksta-dllla", PCIDevice, cap_present,
@@ -218,6 +220,103 @@ static void pci_bus_unrealize(BusState *qbus)
     vmstate_unregister(NULL, &vmstate_pcibus, bus);
 }
 
+#define FIXED_BARS_ERR "fixed-bars: expected barN@<addr>[,barM@<addr>]*; "
+
+static int pci_parse_bar_token(const char *tok, Error **errp)
+{
+    int v = qapi_enum_parse(&OffAutoPCIBAR_lookup, tok, -1, errp);
+
+    if (v < 0) {
+        return -1;
+    }
+    if (v < OFF_AUTO_PCIBAR_BAR0) {
+        error_setg(errp, FIXED_BARS_ERR "invalid BAR '%s', expected bar0..bar5", tok);
+        return -1;
+    }
+    return v - OFF_AUTO_PCIBAR_BAR0;
+}
+
+/*
+ * Parse fixed-bars=barN@<addr>[,barM@<addr>]*
+ * BAR type, size, and alignment validation is deferred to the allocator,
+ * which has the full device context needed to perform those checks.
+ * On error, sets *@errp.
+ */
+static void pci_parse_fixed_bars(PCIDevice *pci_dev, Error **errp)
+{
+    Error *local_err = NULL;
+    char **entries = NULL;
+    char **parts = NULL;
+    const char *endp;
+    char **e;
+    uint64_t bar_addr;
+    int index;
+    int i, ret;
+
+    if (!pci_dev->fixed_bars || !*pci_dev->fixed_bars) {
+        return;
+    }
+    if (DEVICE(pci_dev)->hotplugged) {
+        error_setg(&local_err,
+                   "fixed-bars is not supported on hot-plugged PCI devices");
+        goto out;
+    }
+
+    entries = g_strsplit(pci_dev->fixed_bars, ",", -1);
+    for (e = entries; e && *e; e++) {
+        const char *entry = g_strstrip(*e);
+        if (*entry == '\0') {
+            error_setg(&local_err, FIXED_BARS_ERR "empty field in list");
+            goto out;
+        }
+
+        parts = g_strsplit(entry, "@", 2);
+        if (!parts[0] || !parts[1]) {
+            error_setg(&local_err, FIXED_BARS_ERR "not '%s'", entry);
+            goto out;
+        }
+
+        index = pci_parse_bar_token(parts[0], &local_err);
+        if (index < 0) {
+            goto out;
+        }
+
+        ret = qemu_strtou64(parts[1], &endp, 0, &bar_addr);
+        if (ret) {
+            error_setg(&local_err, FIXED_BARS_ERR "unparseable address in '%s'",
+                       entry);
+            goto out;
+        }
+        if (*endp != '\0') {
+            error_setg(&local_err, FIXED_BARS_ERR "trailing data after address in '%s'",
+                       entry);
+            goto out;
+        }
+        g_clear_pointer(&parts, g_strfreev);
+
+        if (!pci_dev->fixed_bar_addrs) {
+            pci_dev->fixed_bar_addrs = g_new(pcibus_t, PCI_NUM_REGIONS - 1);
+            for (i = 0; i < PCI_NUM_REGIONS - 1; i++) {
+                pci_dev->fixed_bar_addrs[i] = PCI_BAR_UNMAPPED;
+            }
+        }
+        if (pci_dev->fixed_bar_addrs[index] != PCI_BAR_UNMAPPED) {
+            error_setg(&local_err, FIXED_BARS_ERR "bar%d specified more than once",
+                       index);
+            goto out;
+        }
+        pci_dev->fixed_bar_addrs[index] = (pcibus_t)bar_addr;
+    }
+
+out:
+    g_clear_pointer(&parts, g_strfreev);
+    g_strfreev(entries);
+    if (local_err) {
+        g_clear_pointer(&pci_dev->fixed_bar_addrs, g_free);
+        error_propagate(errp, local_err);
+    }
+}
+
 static int pcibus_num(PCIBus *bus)
 {
     if (pci_bus_is_root(bus)) {
@@ -1473,6 +1572,8 @@ static void pci_qdev_unrealize(DeviceState *dev)
     pci_del_option_rom(pci_dev);
     pcie_sriov_unregister_device(pci_dev);
 
+    g_clear_pointer(&pci_dev->fixed_bar_addrs, g_free);
+
     if (pc->exit) {
         pc->exit(pci_dev);
     }
@@ -2369,6 +2470,13 @@ static void pci_qdev_realize(DeviceState *qdev, Error **errp)
         is_default_rom = true;
     }
 
+    pci_parse_fixed_bars(pci_dev, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        pci_qdev_unrealize(DEVICE(pci_dev));
+        return;
+    }
+
     pci_add_option_rom(pci_dev, is_default_rom, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index 5cac6e1688..3e46876985 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -179,6 +179,16 @@ struct PCIDevice {
     char *failover_pair_id;
     uint32_t acpi_index;
 
+    /*
+     * When fixed-bars property is in use, fixed_bar_addrs is non-NULL
+     * and has PCI_NUM_REGIONS - 1 elements (bar0..bar5); each slot is
+     * either PCI_BAR_UNMAPPED (no fixed address for that BAR) or the
+     * fixed address for that BAR.  NULL if the property is unused/empty
+     * or the map is not yet allocated.
+     */
+    char *fixed_bars;
+    pcibus_t *fixed_bar_addrs;
+
     /*
      * Indirect DMA region bounce buffer size as configured for the device. This
      * is a configuration parameter that is reflected into bus_master_as when
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 2/8] hw/pci: enumerate PCI bus and program bridge bus numbers
  2026-05-08 18:37 [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation Tushar Dave
  2026-05-08 18:37 ` [RFC PATCH 1/8] hw/pci: add fixed-bars property to allow fixed BAR addresses Tushar Dave
@ 2026-05-08 18:37 ` Tushar Dave
  2026-05-08 18:37 ` [RFC PATCH 3/8] hw/pci: introduce allocator for fixed BAR placement Tushar Dave
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Tushar Dave @ 2026-05-08 18:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alwilliamson, jgg, skolothumtho, qemu-arm, peter.maydell, mst,
	marcel.apfelbaum, devel

When guest firmware is told not to perform bus enumeration, QEMU
must program bridge primary, secondary, and subordinate bus number
registers before handing control to firmware. Walk the hierarchy
under the root bus, assign secondary bus numbers in firmware-like
order (PXB roots first by bus number, then PCI bridges by devfn),
and program those bridge registers.

Note that SR-IOV bus number allocation (VF offset/stride/NumVFs) is not
handled in this commit and requires additional work.

Signed-off-by: Tushar Dave <tdave@nvidia.com>
---
 hw/pci/meson.build     |   1 +
 hw/pci/pci-enumerate.c | 144 +++++++++++++++++++++++++++++++++++++++++
 hw/pci/pci-enumerate.h |  15 +++++
 3 files changed, 160 insertions(+)
 create mode 100644 hw/pci/pci-enumerate.c
 create mode 100644 hw/pci/pci-enumerate.h

diff --git a/hw/pci/meson.build b/hw/pci/meson.build
index a6cbd89c0a..7e8f5bb87d 100644
--- a/hw/pci/meson.build
+++ b/hw/pci/meson.build
@@ -5,6 +5,7 @@ pci_ss.add(files(
   'pci.c',
   'pci_bridge.c',
   'pci_host.c',
+  'pci-enumerate.c',
   'pci-hmp-cmds.c',
   'pci-qmp-cmds.c',
   'pcie_sriov.c',
diff --git a/hw/pci/pci-enumerate.c b/hw/pci/pci-enumerate.c
new file mode 100644
index 0000000000..2c6d25b25d
--- /dev/null
+++ b/hw/pci/pci-enumerate.c
@@ -0,0 +1,144 @@
+/*
+ * Copyright (C) 2026 NVIDIA
+ * Written by Tushar Dave
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "hw/pci/pci.h"
+#include "hw/pci/pci_bridge.h"
+#include "hw/pci/pci_bus.h"
+#include "hw/pci/pci-enumerate.h"
+
+/* Forward declaration */
+static uint8_t pci_program_bus_numbers(PCIBus *bus, uint8_t current_bus_num,
+                                       uint8_t *next_bus_num);
+
+static int cmp_bus_by_devfn(gconstpointer a, gconstpointer b)
+{
+    PCIBus *bus_a = *(PCIBus * const *)a;
+    PCIBus *bus_b = *(PCIBus * const *)b;
+    return (int)bus_a->parent_dev->devfn - (int)bus_b->parent_dev->devfn;
+}
+
+static int cmp_bus_by_num(gconstpointer a, gconstpointer b)
+{
+    PCIBus *bus_a = *(PCIBus * const *)a;
+    PCIBus *bus_b = *(PCIBus * const *)b;
+    return pci_bus_num(bus_a) - pci_bus_num(bus_b);
+}
+
+/*
+ * Program one bridge's primary, secondary and subordinate bus numbers
+ * and recurse. Return the max subordinate bus number.
+ */
+static uint8_t pci_program_bridge(PCIDevice *bridge, PCIBus *child_bus,
+                                  uint8_t current_bus_num,
+                                  uint8_t *next_bus_num)
+{
+    uint8_t secondary, max_child;
+
+    /* Bus number space exhausted; no bus number to assign. */
+    if (*next_bus_num == 0) {
+        return current_bus_num;
+    }
+    secondary = *next_bus_num;
+    (*next_bus_num)++;
+
+    pci_default_write_config(bridge, PCI_PRIMARY_BUS, current_bus_num, 1);
+    pci_default_write_config(bridge, PCI_SECONDARY_BUS, secondary, 1);
+    /*
+     * Unlike real hardware, QEMU does not require opening a subordinate
+     * aperture before scanning downstream devices.  Write secondary as
+     * a placeholder; the final value is set after recursion below.
+     */
+    pci_default_write_config(bridge, PCI_SUBORDINATE_BUS, secondary, 1);
+
+    max_child = pci_program_bus_numbers(child_bus, secondary, next_bus_num);
+    pci_default_write_config(bridge, PCI_SUBORDINATE_BUS, max_child, 1);
+    return max_child;
+}
+
+/*
+ * Program bus numbers for this bus and all subordinates.
+ * - current_bus_num: this bus' number (0 for root, or already set for PXB).
+ * - next_bus_num: next free bus number to assign to a bridge.
+ *
+ * Children come from bus->child only. Two kinds:
+ * 1) PXB (extra root): child has PCI_BUS_IS_ROOT. Has bus number
+ *    already set, recurse only.
+ * 2) Normal bridge: parent is IS_PCI_BRIDGE. Assign secondary = *next_bus_num,
+ *    program primary, secondary and subordinate bus numbers, and recurse.
+ *
+ * Order matches EDK2 PciBusDxe enumeration: process PXB children first
+ * (sorted by bus number), then bridges (sorted by devfn).
+ */
+static uint8_t pci_program_bus_numbers(PCIBus *bus, uint8_t current_bus_num,
+                                       uint8_t *next_bus_num)
+{
+    PCIBus *child_bus;
+    GArray *pxb_buses = g_array_new(false, false, sizeof(PCIBus *));
+    GArray *bridges = g_array_new(false, false, sizeof(PCIBus *));
+    uint8_t max_subordinate = current_bus_num;
+    uint8_t child_num;
+    uint8_t one_max;
+    guint i;
+
+    /* Single pass over bus->child: split into PXB vs bridge */
+    QLIST_FOREACH(child_bus, &bus->child, sibling) {
+        if (!child_bus->parent_dev) {
+            continue;
+        }
+        if (pci_bus_is_root(child_bus)) {
+            /* PXB or similar: bus number already set (e.g. bus_nr=1, 9) */
+            g_array_append_val(pxb_buses, child_bus);
+        } else if (IS_PCI_BRIDGE(child_bus->parent_dev)) {
+            g_array_append_val(bridges, child_bus);
+        }
+    }
+
+    /* PXB first, sorted by bus number (e.g. 1 before 9) */
+    if (pxb_buses->len > 1) {
+        g_array_sort(pxb_buses, cmp_bus_by_num);
+    }
+    for (i = 0; i < pxb_buses->len; i++) {
+        child_bus = g_array_index(pxb_buses, PCIBus *, i);
+        child_num = (uint8_t)pci_bus_num(child_bus);
+        if (child_num + 1 > *next_bus_num) {
+            *next_bus_num = child_num + 1;
+        }
+        one_max = pci_program_bus_numbers(child_bus, child_num, next_bus_num);
+        if (one_max > max_subordinate) {
+            max_subordinate = one_max;
+        }
+    }
+    g_array_free(pxb_buses, true);
+
+    /* Bridges second, sorted by devfn */
+    if (bridges->len > 1) {
+        g_array_sort(bridges, cmp_bus_by_devfn);
+    }
+    for (i = 0; i < bridges->len; i++) {
+        child_bus = g_array_index(bridges, PCIBus *, i);
+        one_max = pci_program_bridge(child_bus->parent_dev, child_bus,
+                                     current_bus_num, next_bus_num);
+        if (one_max > max_subordinate) {
+            max_subordinate = one_max;
+        }
+    }
+    g_array_free(bridges, true);
+
+    return max_subordinate;
+}
+
+void pci_enumerate_bus(PCIBus *root_bus)
+{
+    uint8_t next_bus_num;
+
+    if (!root_bus) {
+        return;
+    }
+    next_bus_num = 1;
+    pci_program_bus_numbers(root_bus, 0, &next_bus_num);
+}
diff --git a/hw/pci/pci-enumerate.h b/hw/pci/pci-enumerate.h
new file mode 100644
index 0000000000..b1e4b989f1
--- /dev/null
+++ b/hw/pci/pci-enumerate.h
@@ -0,0 +1,15 @@
+/*
+ * Copyright (C) 2026 NVIDIA
+ * Written by Tushar Dave
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_PCI_PCI_ENUMERATE_H
+#define HW_PCI_PCI_ENUMERATE_H
+
+#include "hw/pci/pci_bus.h"
+
+void pci_enumerate_bus(PCIBus *root_bus);
+
+#endif
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 3/8] hw/pci: introduce allocator for fixed BAR placement
  2026-05-08 18:37 [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation Tushar Dave
  2026-05-08 18:37 ` [RFC PATCH 1/8] hw/pci: add fixed-bars property to allow fixed BAR addresses Tushar Dave
  2026-05-08 18:37 ` [RFC PATCH 2/8] hw/pci: enumerate PCI bus and program bridge bus numbers Tushar Dave
@ 2026-05-08 18:37 ` Tushar Dave
  2026-05-08 18:37 ` [RFC PATCH 4/8] hw/pci: pack remaining BARs and update bridge windows Tushar Dave
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Tushar Dave @ 2026-05-08 18:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alwilliamson, jgg, skolothumtho, qemu-arm, peter.maydell, mst,
	marcel.apfelbaum, devel

The allocator walks the PCI topology and, for each device with fixed
BAR requests, validates the BAR (type, size, alignment, and that the
resulting range fits within the machine-provided MMIO64 window),
detects conflicts against already claimed ranges, and programs 64-bit
prefetchable BARs.

A global list of claimed ranges is maintained to detect overlapping
allocations across devices. Overlapping fixed BARs within a device are
also detected, and any conflict results in an error.

This patch implements the initial phase of fixed BAR handling and
reservation tracking. Allocation of remaining resources and bridge
window setup will be added in follow-up patches.

Limitations:

Only 64-bit prefetchable MMIO BARs within the MMIO64 window are
handled. This covers devices that use large prefetchable MMIO regions.
32-bit MMIO, PIO, and 64-bit non-prefetchable BARs are not supported
and will be addressed in future work.

Signed-off-by: Tushar Dave <tdave@nvidia.com>
---
 hw/pci/meson.build    |   1 +
 hw/pci/pci-resource.c | 255 ++++++++++++++++++++++++++++++++++++++++++
 hw/pci/pci-resource.h |  65 +++++++++++
 3 files changed, 321 insertions(+)
 create mode 100644 hw/pci/pci-resource.c
 create mode 100644 hw/pci/pci-resource.h

diff --git a/hw/pci/meson.build b/hw/pci/meson.build
index 7e8f5bb87d..d26695414f 100644
--- a/hw/pci/meson.build
+++ b/hw/pci/meson.build
@@ -6,6 +6,7 @@ pci_ss.add(files(
   'pci_bridge.c',
   'pci_host.c',
   'pci-enumerate.c',
+  'pci-resource.c',
   'pci-hmp-cmds.c',
   'pci-qmp-cmds.c',
   'pcie_sriov.c',
diff --git a/hw/pci/pci-resource.c b/hw/pci/pci-resource.c
new file mode 100644
index 0000000000..5e9a78ec16
--- /dev/null
+++ b/hw/pci/pci-resource.c
@@ -0,0 +1,255 @@
+/*
+ * Copyright (C) 2026 NVIDIA
+ * Written by Tushar Dave
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "qemu/range.h"
+#include "hw/pci/pci.h"
+#include "hw/pci/pci_bridge.h"
+#include "hw/pci/pci_bus.h"
+#include "hw/pci/pci_host.h"
+#include "hw/pci/pci-resource.h"
+
+/* Global list of claimed fixed 64-bit prefetchable BAR ranges */
+static GArray *fixed_claim_regions;
+
+static void fixed_claim_regions_reset(void)
+{
+    if (fixed_claim_regions) {
+        g_array_free(fixed_claim_regions, true);
+        fixed_claim_regions = NULL;
+    }
+    fixed_claim_regions = g_array_new(false, true, sizeof(FixedClaim));
+}
+
+static bool fixed_claim_regions_conflicts(uint64_t start, uint64_t end,
+                                          uint64_t wbase64, uint64_t wlimit64,
+                                          uint64_t *conflict_end)
+{
+    /* Hard guard: out-of-window ranges are invalid input */
+    if (start < wbase64 || end > wlimit64) {
+        error_report("placement [0x%"PRIx64"..0x%"PRIx64"] out of window "
+                     "[0x%"PRIx64"..0x%"PRIx64"]",
+                     start, end, wbase64, wlimit64);
+        exit(1);
+    }
+    if (!fixed_claim_regions) {
+        return false;
+    }
+    for (guint i = 0; i < fixed_claim_regions->len; i++) {
+        FixedClaim *c = &g_array_index(fixed_claim_regions, FixedClaim, i);
+        if (ranges_overlap(start, end - start + 1, c->start, c->end - c->start + 1)) {
+            if (conflict_end) {
+                *conflict_end = c->end;
+            }
+            return true;
+        }
+    }
+    return false;
+}
+
+static void fixed_claim_regions_add(uint64_t start, uint64_t end, PCIDevice *dev, int bar)
+{
+    FixedClaim cl = { .start = start, .end = end, .owner = dev, .bar = bar };
+    g_array_append_val(fixed_claim_regions, cl);
+}
+
+static void pci_validate_fixed_bar(PCIDevice *dev, int bar_index,
+                                   uint64_t addr, uint64_t size,
+                                   uint64_t wbase64, uint64_t wlimit64)
+{
+    PCIIORegion *r = &dev->io_regions[bar_index];
+    uint64_t end;
+
+    if (!r->size || !(r->type & PCI_BASE_ADDRESS_MEM_TYPE_64)) {
+        error_report("Invalid fixed-bars for %s [%02x:%02x.%x] BAR%d: "
+                     "BAR not 64-bit or size=0 (type=0x%x size=0x%"PRIx64")",
+                     dev->name, pci_dev_bus_num(dev),
+                     PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+                     bar_index, r->type, (uint64_t)r->size);
+        exit(1);
+    }
+    /* This path only programs 64-bit prefetchable MMIO in the MMIO64 window. */
+    if (!(r->type & PCI_BASE_ADDRESS_MEM_PREFETCH) &&
+        !pci_bus_is_root(pci_get_bus(dev))) {
+        error_report("Invalid fixed-bars for %s [%02x:%02x.%x] BAR%d: "
+                     "this allocator only supports 64-bit prefetchable MMIO; "
+                     "64-bit non-prefetchable is not supported",
+                     dev->name, pci_dev_bus_num(dev),
+                     PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn), bar_index);
+        exit(1);
+    }
+    end = addr + size - 1;
+    if (addr & (size - 1)) {
+        error_report("Invalid fixed-bars alignment for %s [%02x:%02x.%x] "
+                     "BAR%d: addr=0x%"PRIx64" size=0x%"PRIx64,
+                     dev->name, pci_dev_bus_num(dev),
+                     PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+                     bar_index, addr, size);
+        exit(1);
+    }
+    if (addr < wbase64 || end > wlimit64) {
+        error_report("fixed-bars out of window for %s [%02x:%02x.%x] BAR%d "
+                     "range=[0x%"PRIx64"..0x%"PRIx64"] window=[0x%"PRIx64"..0x%"PRIx64"]",
+                     dev->name, pci_dev_bus_num(dev),
+                     PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+                     bar_index, addr, end, wbase64, wlimit64);
+        exit(1);
+    }
+}
+
+static void pci_check_fixed_bar_overlap(PCIDevice *dev, PhysBAR *pbars)
+{
+    for (int i = 0; i < PCI_ROM_SLOT; i++) {
+        if (!(pbars[i].flags & IORESOURCE_PREFETCH)) {
+            continue;
+        }
+        for (int j = i + 1; j < PCI_ROM_SLOT; j++) {
+            if (!(pbars[j].flags & IORESOURCE_PREFETCH)) {
+                continue;
+            }
+            if (ranges_overlap(pbars[i].addr, dev->io_regions[i].size,
+                               pbars[j].addr, dev->io_regions[j].size)) {
+                error_report("Invalid fixed-bars — fixed BAR overlap on %s [%02x:%02x.%x]: "
+                             "BAR%d [0x%lx..0x%lx] vs BAR%d [0x%lx..0x%lx]",
+                             dev->name, pci_dev_bus_num(dev),
+                             PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+                             i, pbars[i].addr, pbars[i].addr + dev->io_regions[i].size - 1,
+                             j, pbars[j].addr, pbars[j].addr + dev->io_regions[j].size - 1);
+                exit(1);
+            }
+        }
+    }
+}
+
+/* Program 64-bit prefetchable BARs */
+static void pci_program_prefetch_bars(PCIDevice *dev, PhysBAR *pbars)
+{
+    int idx;
+    uint32_t laddr;
+
+    for (idx = 0; idx < PCI_ROM_SLOT; idx++) {
+        PhysBAR *pbar = &pbars[idx];
+
+        if (!(pbar->flags & IORESOURCE_PREFETCH)) {
+            continue;
+        }
+        laddr = pbar->addr & PCI_BASE_ADDRESS_MEM_MASK;
+        laddr |= PCI_BASE_ADDRESS_MEM_TYPE_64;
+        /* Set PREFETCH bit only if the BAR itself is prefetchable */
+        if (dev->io_regions[idx].type & PCI_BASE_ADDRESS_MEM_PREFETCH) {
+            laddr |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+        }
+
+        pci_host_config_write_common(dev,
+                                     PCI_BASE_ADDRESS_0 + (idx * 4),
+                                     pci_config_size(dev),
+                                     laddr,
+                                     4);
+        pci_host_config_write_common(dev,
+                                     PCI_BASE_ADDRESS_0 + (idx * 4) + 4,
+                                     pci_config_size(dev),
+                                     (uint32_t)(pbar->addr >> 32),
+                                     4);
+    }
+}
+
+/* Phase 1: claim and program fixed BARs for one device (per-device callback) */
+static void pci_dev_claim_and_program_fixed_bars(PCIBus *bus, PCIDevice *dev, void *opaque)
+{
+    PciProgramCtx *pctx = (PciProgramCtx *)opaque;
+    PhysBAR *pbar, pbars[PCI_ROM_SLOT];
+    bool had_any_fixed = false;
+    uint64_t start;
+    uint64_t end;
+    int idx;
+
+    pbar = pbars;
+    memset(pbar, 0, sizeof(pbars));
+
+    if (!dev->fixed_bar_addrs) {
+        return;
+    }
+    for (idx = 0; idx < PCI_ROM_SLOT; idx++) {
+        PCIIORegion *r = &dev->io_regions[idx];
+        if (dev->fixed_bar_addrs[idx] == PCI_BAR_UNMAPPED) {
+            continue;
+        }
+        pci_validate_fixed_bar(dev, idx,
+                                    dev->fixed_bar_addrs[idx],
+                                    r->size,
+                                    pctx->mmio64_base,
+                                    pctx->mmio64_base + pctx->mmio64_size - 1);
+
+        start = dev->fixed_bar_addrs[idx];
+        end = start + r->size - 1;
+        if (fixed_claim_regions_conflicts(start, end,
+                                            pctx->mmio64_base,
+                                            pctx->mmio64_base + pctx->mmio64_size - 1,
+                                            NULL)) {
+            error_report("Invalid fixed-bars — fixed BAR for %s [%02x:%02x.%x] "
+                         "BAR%d [0x%"PRIx64"..0x%"PRIx64"] overlaps an existing fixed range",
+                         dev->name, pci_dev_bus_num(dev),
+                         PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+                         idx, start, end);
+            exit(1);
+        }
+        fixed_claim_regions_add(start, end, dev, idx);
+        pbars[idx].addr = dev->fixed_bar_addrs[idx];
+        pbars[idx].end = pbars[idx].addr + r->size - 1;
+        pbars[idx].flags = IORESOURCE_PREFETCH;
+        had_any_fixed = true;
+    }
+    if (had_any_fixed) {
+        g_hash_table_insert(pctx->had_fixed, dev, dev);
+    }
+    /* Abort if intra-device fixed overlap */
+    pci_check_fixed_bar_overlap(dev, pbars);
+    /* Program fixed BARs now */
+    pci_program_prefetch_bars(dev, pbars);
+}
+
+static void pci_bus_claim_and_program_fixed_bars(PCIBus *bus, void *opaque)
+{
+    pci_for_each_device_under_bus(bus, pci_dev_claim_and_program_fixed_bars, opaque);
+}
+
+static void pci_resource_init_from_mmio(PciAllocCfg *pci_res,
+                                   const PciFixedBarMmioParams *mmio)
+{
+    pci_res->mmio32_base = mmio->mmio32_base;
+    pci_res->mmio32_size = mmio->mmio32_size;
+    pci_res->mmio64_base = mmio->mmio64_base;
+    pci_res->mmio64_size = mmio->mmio64_size;
+}
+
+void pci_fixed_bar_allocator(PCIBus *root, const PciFixedBarMmioParams *mmio)
+{
+    PciAllocCfg pci_res_buf, *pci_res = &pci_res_buf;
+    PCIBus *bus = root;
+
+    /* Fill allocator MMIO window once from machine memmap */
+    pci_resource_init_from_mmio(pci_res, mmio);
+
+    /* Reset fixed-claims tracking */
+    fixed_claim_regions_reset();
+
+    PciProgramCtx pctx = {
+        .mmio64_base = pci_res->mmio64_base,
+        .mmio64_size = pci_res->mmio64_size,
+        .had_fixed = g_hash_table_new(NULL, NULL),
+    };
+
+    /* Phase 1: program all fixed BARs and claim them */
+    pci_for_each_bus(bus, pci_bus_claim_and_program_fixed_bars, &pctx);
+
+    /* TODOs: Phases 2–3, program remaining BARs, bridge window refresh etc,.  */
+
+    /* Cleanup */
+    g_hash_table_destroy(pctx.had_fixed);
+    fixed_claim_regions_reset();
+}
diff --git a/hw/pci/pci-resource.h b/hw/pci/pci-resource.h
new file mode 100644
index 0000000000..cc4d6f71cb
--- /dev/null
+++ b/hw/pci/pci-resource.h
@@ -0,0 +1,65 @@
+/*
+ * Copyright (C) 2026 NVIDIA
+ * Written by Tushar Dave
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_PCI_PCI_RESOURCE_H
+#define HW_PCI_PCI_RESOURCE_H
+
+#include "exec/hwaddr.h"
+#include "hw/pci/pci.h"
+#include <glib.h>
+
+#define IORESOURCE_PREFETCH     0x00002000
+
+typedef struct {
+    uint64_t addr;
+    uint64_t end;
+    uint64_t flags;
+} PhysBAR;
+
+typedef struct {
+    uint64_t wbase;
+    uint64_t wlimit;
+    uint64_t wbase64;
+    uint64_t wlimit64;
+    uint64_t rbase;
+    uint64_t rlimit;
+    uint64_t rsize;
+    uint64_t piobase;
+    bool     available;
+    bool     search_mmio64;
+    PCIDevice *dev;
+    PCIBus *bus;
+    /* Allocator window (filled once from machine memmap) */
+    hwaddr   mmio32_base;
+    hwaddr   mmio32_size;
+    hwaddr   mmio64_base;
+    hwaddr   mmio64_size;
+} PciAllocCfg;
+
+typedef struct FixedClaim {
+    uint64_t start;
+    uint64_t end;
+    PCIDevice *owner;
+    int bar;
+} FixedClaim;
+
+typedef struct {
+    hwaddr mmio64_base;
+    hwaddr mmio64_size;
+    GHashTable *had_fixed; /* set of PCIDevice* that had at least one fixed BAR */
+} PciProgramCtx;
+
+typedef struct PciFixedBarMmioParams {
+    hwaddr mmio32_base;
+    hwaddr mmio32_size;
+    hwaddr mmio64_base;
+    hwaddr mmio64_size;
+} PciFixedBarMmioParams;
+
+void pci_fixed_bar_allocator(PCIBus *root, const PciFixedBarMmioParams *mmio);
+
+#endif
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 4/8] hw/pci: pack remaining BARs and update bridge windows
  2026-05-08 18:37 [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation Tushar Dave
                   ` (2 preceding siblings ...)
  2026-05-08 18:37 ` [RFC PATCH 3/8] hw/pci: introduce allocator for fixed BAR placement Tushar Dave
@ 2026-05-08 18:37 ` Tushar Dave
  2026-05-08 18:37 ` [RFC PATCH 5/8] hw/pci: allocate remaining BARs for buses without fixed BARs Tushar Dave
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Tushar Dave @ 2026-05-08 18:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alwilliamson, jgg, skolothumtho, qemu-arm, peter.maydell, mst,
	marcel.apfelbaum, devel

Extend the fixed BAR allocator to handle remaining 64-bit prefetchable
BARs after fixed BAR placement.

For each bus with fixed BAR devices, collect fixed and unassigned BARs,
compute available MMIO64 holes considering both local fixed BAR anchors
and globally claimed regions, and select an appropriate region to pack
remaining BARs.

Remaining BARs are sorted by size and packed into the selected hole
using a greedy placement strategy. Fixed BAR placement is preserved,
and all allocations are tracked via the global claim list.

After BAR placement, update the PCI bridge prefetchable window to cover
both fixed and dynamically assigned BAR ranges, ensuring firmware sees
a consistent MMIO layout.

This implements the second phase of the allocator that does dynamic BAR
placement and bridge window sizing for buses with fixed BAR constraints.

Signed-off-by: Tushar Dave <tdave@nvidia.com>
---
 hw/pci/pci-resource.c | 404 +++++++++++++++++++++++++++++++++++++++++-
 hw/pci/pci-resource.h |  17 ++
 2 files changed, 420 insertions(+), 1 deletion(-)

diff --git a/hw/pci/pci-resource.c b/hw/pci/pci-resource.c
index 5e9a78ec16..de98924aa6 100644
--- a/hw/pci/pci-resource.c
+++ b/hw/pci/pci-resource.c
@@ -7,6 +7,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
+#include "qemu/bitops.h"
 #include "qemu/range.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/pci_bridge.h"
@@ -158,6 +159,404 @@ static void pci_program_prefetch_bars(PCIDevice *dev, PhysBAR *pbars)
     }
 }
 
+static void pci_update_prefetch_window(PCIBus *bus, uint64_t base, uint64_t limit)
+{
+    PCIDevice *bridge = pci_bridge_get_device(bus);
+    uint32_t reg_base, reg_limit;
+
+    assert(bridge);
+
+    reg_base = (uint32_t)(extract64(base, 20, 12) << 4);
+    reg_limit = (uint32_t)(extract64(limit, 20, 12) << 4);
+    pci_host_config_write_common(bridge,
+                                 PCI_PREF_MEMORY_BASE,
+                                 pci_config_size(bridge),
+                                 reg_base | PCI_PREF_RANGE_TYPE_64,
+                                 2);
+    pci_host_config_write_common(bridge,
+                                 PCI_PREF_BASE_UPPER32,
+                                 pci_config_size(bridge),
+                                 (uint32_t)(base >> 32),
+                                 4);
+    pci_host_config_write_common(bridge,
+                                 PCI_PREF_MEMORY_LIMIT,
+                                 pci_config_size(bridge),
+                                 reg_limit | PCI_PREF_RANGE_TYPE_64,
+                                 2);
+    pci_host_config_write_common(bridge,
+                                 PCI_PREF_LIMIT_UPPER32,
+                                 pci_config_size(bridge),
+                                 (uint32_t)(limit >> 32),
+                                 4);
+}
+
+static inline bool is_64bit_pref_bar(PCIIORegion *r)
+{
+    if (!r->size) {
+        return false;
+    }
+    if (r->type & PCI_BASE_ADDRESS_SPACE_IO) {
+        return false;
+    }
+    if (!(r->type & PCI_BASE_ADDRESS_MEM_TYPE_64)) {
+        return false;
+    }
+    if (!(r->type & PCI_BASE_ADDRESS_MEM_PREFETCH)) {
+        return false;
+    }
+    return true;
+}
+
+/* Comparison function for sorting intervals by start address */
+static int compare_intervals(gconstpointer a, gconstpointer b)
+{
+    const AddressInterval *ia = (const AddressInterval *)a;
+    const AddressInterval *ib = (const AddressInterval *)b;
+    if (ia->start < ib->start) return -1;
+    if (ia->start > ib->start) return 1;
+    return 0;
+}
+
+/* Comparison function for sorting BARs by descending size */
+static int compare_bar_size_desc(gconstpointer a, gconstpointer b)
+{
+    const BarEntry *ea = (const BarEntry *)a;
+    const BarEntry *eb = (const BarEntry *)b;
+    if (ea->size > eb->size) return -1;
+    if (ea->size < eb->size) return 1;
+    return 0;
+}
+
+/* Categorize holes relative to anchors */
+static CategorizedHoles categorize_holes(GArray *holes, GArray *fixed_bars)
+{
+    CategorizedHoles result = {
+        .leftmost_hole = -1,
+        .middle_holes = g_array_new(false, false, sizeof(int)),
+        .rightmost_hole = -1
+    };
+
+    /* Get anchor boundaries */
+    uint64_t first_anchor_start = g_array_index(fixed_bars, AddressInterval, 0).start;
+    uint64_t last_anchor_end = g_array_index(fixed_bars, AddressInterval,
+                                             fixed_bars->len - 1).end;
+    /* Categorize each hole */
+    for (guint h = 0; h < holes->len; h++) {
+        AddressInterval *hole = &g_array_index(holes, AddressInterval, h);
+
+        if (hole->end < first_anchor_start) {
+            result.leftmost_hole = h;  /* Before all anchors */
+        } else if (hole->start > last_anchor_end) {
+            result.rightmost_hole = h;  /* After all anchors */
+        } else {
+            g_array_append_val(result.middle_holes, h);  /* Between anchors */
+        }
+    }
+    return result;
+}
+
+/*
+ * Compute REAL holes considering both local anchors and global claims.
+ * This returns actual free space that can be used for packing.
+ * Strategy: Collect all obstacles (local fixed BARs + global claims from
+ * other buses), then compute gaps between them.
+ */
+static GArray* compute_real_holes(GArray *fixed_bars, uint64_t mmio_start, uint64_t mmio_end)
+{
+    GArray *holes = g_array_new(false, false, sizeof(AddressInterval));
+    GArray *claimed_regions = g_array_new(false, false, sizeof(AddressInterval));
+    uint64_t scan;
+
+    /* Add local fixed BARs (anchors) as claimed regions */
+    for (guint i = 0; i < fixed_bars->len; i++) {
+        AddressInterval *anchor = &g_array_index(fixed_bars, AddressInterval, i);
+        g_array_append_val(claimed_regions, *anchor);
+    }
+
+    /* Add global claims from ALL buses (including other buses) */
+    if (fixed_claim_regions) {
+        for (guint i = 0; i < fixed_claim_regions->len; i++) {
+            FixedClaim *claim = &g_array_index(fixed_claim_regions, FixedClaim, i);
+            /* Only consider claims within our MMIO window */
+            if (claim->start <= mmio_end && claim->end >= mmio_start) {
+                AddressInterval region = {
+                    .start = claim->start,
+                    .end = claim->end
+                };
+                g_array_append_val(claimed_regions, region);
+            }
+        }
+    }
+
+    /* Handle case with no claimed regions */
+    if (claimed_regions->len == 0) {
+        AddressInterval hole = { .start = mmio_start, .end = mmio_end };
+        g_array_append_val(holes, hole);
+        g_array_free(claimed_regions, true);
+        return holes;
+    }
+
+    /* Sort claimed regions by start address */
+    g_array_sort(claimed_regions, compare_intervals);
+
+    /* Compute holes between all claimed regions */
+    scan = mmio_start;
+    for (guint i = 0; i < claimed_regions->len; i++) {
+        AddressInterval *claimed = &g_array_index(claimed_regions, AddressInterval, i);
+
+        /* Free space before this claimed region */
+        if (scan < claimed->start) {
+            AddressInterval hole = { .start = scan, .end = claimed->start - 1 };
+            g_array_append_val(holes, hole);
+        }
+
+        /* Move scan cursor past this claimed region */
+        scan = MAX(scan, claimed->end + 1);
+    }
+
+    /* Free space after last claimed region */
+    if (scan <= mmio_end) {
+        AddressInterval hole = { .start = scan, .end = mmio_end };
+        g_array_append_val(holes, hole);
+    }
+
+    g_array_free(claimed_regions, true);
+    return holes;
+}
+
+static bool pack_bars_into_region(GArray *bars, uint64_t pack_start, uint64_t pack_end,
+                                   uint64_t *out_min_addr, uint64_t *out_max_addr)
+{
+    uint64_t pack_cursor = pack_start;
+    uint64_t min_addr = UINT64_MAX;
+    uint64_t max_addr = 0;
+
+    for (guint i = 0; i < bars->len; i++) {
+        BarEntry *e = &g_array_index(bars, BarEntry, i);
+        PCIIORegion *r = &e->dev->io_regions[e->bar_idx];
+
+        uint64_t aligned_addr = ROUND_UP(pack_cursor, r->size);
+        uint64_t bar_start = aligned_addr;
+        uint64_t bar_end = bar_start + r->size - 1;
+
+        if (bar_end > pack_end) {
+            return false; /* Doesn't fit */
+        }
+
+        PhysBAR pbars_array[PCI_ROM_SLOT];
+        memset(pbars_array, 0, sizeof(pbars_array));
+        pbars_array[e->bar_idx].addr = bar_start;
+        pbars_array[e->bar_idx].end = bar_end;
+        pbars_array[e->bar_idx].flags = IORESOURCE_PREFETCH;
+
+        pci_program_prefetch_bars(e->dev, pbars_array);
+
+        min_addr = MIN(min_addr, bar_start);
+        max_addr = MAX(max_addr, bar_end);
+        pack_cursor = bar_end + 1;
+    }
+
+    *out_min_addr = min_addr;
+    *out_max_addr = max_addr;
+    return true;
+}
+
+static void finalize_bridge_window(PCIBus *bus, uint64_t min_addr, uint64_t max_addr)
+{
+    PCIDevice *bridge_dev = pci_bridge_get_device(bus);
+
+    if (bridge_dev) {
+        fixed_claim_regions_add(min_addr, max_addr, bridge_dev, -1);
+        pci_update_prefetch_window(bus, min_addr, max_addr);
+    }
+}
+
+static bool pci_bus_phase2_fill_bar_lists(PCIBus *bus, PciProgramCtx *pctx,
+                                          GArray *fixed_bars, GArray *remaining_bars)
+{
+    AddressInterval interval;
+    BarEntry bentry;
+    PCIDevice *d;
+    PCIIORegion *r;
+    bool bus_has_fixed = false;
+    bool device_has_fixed;
+    int devfn, i;
+
+    for (devfn = 0; devfn < ARRAY_SIZE(bus->devices); devfn++) {
+        d = bus->devices[devfn];
+        if (!d) {
+            continue;
+        }
+        device_has_fixed = g_hash_table_contains(pctx->had_fixed, d);
+        if (device_has_fixed) {
+            bus_has_fixed = true;
+        }
+        for (i = 0; i < PCI_ROM_SLOT; i++) {
+            r = &d->io_regions[i];
+            if (!is_64bit_pref_bar(r)) {
+                continue;
+            }
+            if (device_has_fixed && d->fixed_bar_addrs &&
+                d->fixed_bar_addrs[i] != PCI_BAR_UNMAPPED) {
+                interval.start = d->fixed_bar_addrs[i];
+                interval.end = d->fixed_bar_addrs[i] + r->size - 1;
+                g_array_append_val(fixed_bars, interval);
+            } else {
+                bentry.dev = d;
+                bentry.bar_idx = i;
+                bentry.size = r->size;
+                g_array_append_val(remaining_bars, bentry);
+            }
+        }
+    }
+    return bus_has_fixed;
+}
+
+/* Find a mmio64 hole, pack unassigned BARs and program the bridge */
+static void
+pci_bus_phase2_hole_pack_and_update_bridge(PCIBus *bus, GArray *fixed_bars,
+                                           GArray *remaining_bars,
+                                           uint64_t mmio_start,
+                                           uint64_t mmio_end)
+{
+    GArray *holes;
+    FixedClaim *claim;
+    CategorizedHoles cat;
+    AddressInterval *holep, *selected;
+    int selected_hole, largest_middle, h_idx;
+    guint c, mid_i, f;
+    uint64_t bus_min_addr, bus_max_addr, remaining_demand;
+    uint64_t leftmost_anchor, rightmost_anchor_end, valid_start, valid_end;
+    uint64_t largest_size, hole_size, pack_start, pack_end;
+
+    g_array_sort(fixed_bars, compare_intervals);
+    g_array_sort(remaining_bars, compare_bar_size_desc);
+
+    remaining_demand = 0;
+    for (c = 0; c < remaining_bars->len; c++) {
+        remaining_demand += g_array_index(remaining_bars, BarEntry, c).size;
+    }
+
+    leftmost_anchor = g_array_index(fixed_bars, AddressInterval, 0).start;
+    rightmost_anchor_end = g_array_index(fixed_bars, AddressInterval,
+                                        fixed_bars->len - 1).end;
+
+    valid_start = mmio_start;
+    valid_end = mmio_end;
+
+    if (fixed_claim_regions) {
+        for (c = 0; c < fixed_claim_regions->len; c++) {
+            claim = &g_array_index(fixed_claim_regions, FixedClaim, c);
+            if (claim->end < leftmost_anchor && claim->end >= valid_start) {
+                valid_start = claim->end + 1;
+            }
+            if (claim->start > rightmost_anchor_end && claim->start <= valid_end) {
+                valid_end = claim->start - 1;
+            }
+        }
+    }
+
+    holes = compute_real_holes(fixed_bars, valid_start, valid_end);
+    cat = categorize_holes(holes, fixed_bars);
+
+    selected_hole = -1;
+    pack_start = 0;
+    pack_end = 0;
+
+    if (cat.middle_holes->len > 0) {
+        largest_middle = -1;
+        largest_size = 0;
+        for (mid_i = 0; mid_i < cat.middle_holes->len; mid_i++) {
+            h_idx = g_array_index(cat.middle_holes, int, mid_i);
+            holep = &g_array_index(holes, AddressInterval, h_idx);
+            hole_size = holep->end - holep->start + 1;
+            if (hole_size >= remaining_demand && hole_size > largest_size) {
+                largest_size = hole_size;
+                largest_middle = h_idx;
+            }
+        }
+        if (largest_middle >= 0) {
+            selected_hole = largest_middle;
+        }
+    }
+    if (selected_hole < 0 && cat.rightmost_hole >= 0) {
+        holep = &g_array_index(holes, AddressInterval, cat.rightmost_hole);
+        hole_size = holep->end - holep->start + 1;
+        if (hole_size >= remaining_demand) {
+            selected_hole = cat.rightmost_hole;
+        }
+    }
+    if (selected_hole < 0 && cat.leftmost_hole >= 0) {
+        holep = &g_array_index(holes, AddressInterval, cat.leftmost_hole);
+        hole_size = holep->end - holep->start + 1;
+        if (hole_size >= remaining_demand) {
+            selected_hole = cat.leftmost_hole;
+        }
+    }
+    g_array_free(cat.middle_holes, true);
+    if (selected_hole < 0) {
+        error_report("bus [%02x] insufficient contiguous space for "
+                     "remaining_demand=0x%"PRIx64,
+                     pci_bus_num(bus), remaining_demand);
+        g_array_free(holes, true);
+        g_array_free(fixed_bars, true);
+        g_array_free(remaining_bars, true);
+        exit(1);
+    }
+    selected = &g_array_index(holes, AddressInterval, selected_hole);
+    pack_start = selected->start;
+    pack_end = selected->end;
+    g_array_free(holes, true);
+    if (!pack_bars_into_region(remaining_bars, pack_start, pack_end,
+                                 &bus_min_addr, &bus_max_addr)) {
+        error_report("bus [%02x] failed to pack BARs", pci_bus_num(bus));
+        g_array_free(fixed_bars, true);
+        g_array_free(remaining_bars, true);
+        exit(1);
+    }
+    for (f = 0; f < fixed_bars->len; f++) {
+        holep = &g_array_index(fixed_bars, AddressInterval, f);
+        bus_min_addr = MIN(bus_min_addr, holep->start);
+        bus_max_addr = MAX(bus_max_addr, holep->end);
+    }
+    finalize_bridge_window(bus, bus_min_addr, bus_max_addr);
+    g_array_free(fixed_bars, true);
+    g_array_free(remaining_bars, true);
+}
+
+static void pci_bus_phase2_pack_remaining_bars(PCIBus *bus, void *opaque)
+{
+    PciProgramCtx *pctx = (PciProgramCtx *)opaque;
+    GArray *fixed_bars, *remaining_bars;
+    uint64_t mmio_start, mmio_end, bus_min_addr, bus_max_addr;
+    bool bus_has_fixed;
+
+    mmio_start = pctx->mmio64_base;
+    mmio_end = pctx->mmio64_base + pctx->mmio64_size - 1;
+    fixed_bars = g_array_new(false, false, sizeof(AddressInterval));
+    remaining_bars = g_array_new(false, false, sizeof(BarEntry));
+    bus_has_fixed = pci_bus_phase2_fill_bar_lists(bus, pctx, fixed_bars,
+                                                    remaining_bars);
+    if (!bus_has_fixed) {
+        g_array_free(fixed_bars, true);
+        g_array_free(remaining_bars, true);
+        return;
+    }
+    if (remaining_bars->len == 0) {
+        if (fixed_bars->len > 0) {
+            g_array_sort(fixed_bars, compare_intervals);
+            bus_min_addr = g_array_index(fixed_bars, AddressInterval, 0).start;
+            bus_max_addr = g_array_index(fixed_bars, AddressInterval,
+                                        fixed_bars->len - 1).end;
+            finalize_bridge_window(bus, bus_min_addr, bus_max_addr);
+        }
+        g_array_free(fixed_bars, true);
+        g_array_free(remaining_bars, true);
+        return;
+    }
+    pci_bus_phase2_hole_pack_and_update_bridge(bus, fixed_bars, remaining_bars,
+                                                mmio_start, mmio_end);
+}
 /* Phase 1: claim and program fixed BARs for one device (per-device callback) */
 static void pci_dev_claim_and_program_fixed_bars(PCIBus *bus, PCIDevice *dev, void *opaque)
 {
@@ -247,7 +646,10 @@ void pci_fixed_bar_allocator(PCIBus *root, const PciFixedBarMmioParams *mmio)
     /* Phase 1: program all fixed BARs and claim them */
     pci_for_each_bus(bus, pci_bus_claim_and_program_fixed_bars, &pctx);
 
-    /* TODOs: Phases 2–3, program remaining BARs, bridge window refresh etc,.  */
+    /* Phase 2: pack remaining 64-bit prefetchable BARs and size parent bridge window */
+    pci_for_each_bus(bus, pci_bus_phase2_pack_remaining_bars, &pctx);
+
+    /* Phase 3: buses with no fixed-BAR devices; final bridge pass: follow-up */
 
     /* Cleanup */
     g_hash_table_destroy(pctx.had_fixed);
diff --git a/hw/pci/pci-resource.h b/hw/pci/pci-resource.h
index cc4d6f71cb..5155a7cefa 100644
--- a/hw/pci/pci-resource.h
+++ b/hw/pci/pci-resource.h
@@ -47,6 +47,23 @@ typedef struct FixedClaim {
     int bar;
 } FixedClaim;
 
+typedef struct {
+    uint64_t start;
+    uint64_t end;
+} AddressInterval;
+
+typedef struct {
+    PCIDevice *dev;
+    int bar_idx;
+    uint64_t size;
+} BarEntry;
+
+typedef struct {
+    int leftmost_hole;      /* Index of hole before first anchor, or -1 */
+    GArray *middle_holes;   /* Array of hole indices between anchors */
+    int rightmost_hole;     /* Index of hole after last anchor, or -1 */
+} CategorizedHoles;
+
 typedef struct {
     hwaddr mmio64_base;
     hwaddr mmio64_size;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 5/8] hw/pci: allocate remaining BARs for buses without fixed BARs
  2026-05-08 18:37 [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation Tushar Dave
                   ` (3 preceding siblings ...)
  2026-05-08 18:37 ` [RFC PATCH 4/8] hw/pci: pack remaining BARs and update bridge windows Tushar Dave
@ 2026-05-08 18:37 ` Tushar Dave
  2026-05-08 18:37 ` [RFC PATCH 6/8] hw/pci: finalize bridge prefetch windows after BAR allocation Tushar Dave
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Tushar Dave @ 2026-05-08 18:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alwilliamson, jgg, skolothumtho, qemu-arm, peter.maydell, mst,
	marcel.apfelbaum, devel

This phase performs PCI BAR allocation for buses without fixed BAR
assignments. It respects existing allocations and does not disturb
already programmed BARs or bridge windows.

It computes remaining MMIO64 requirements, assigns BARs, and extends
bridge prefetch windows if required.

Signed-off-by: Tushar Dave <tdave@nvidia.com>
---
 hw/pci/pci-resource.c | 355 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 354 insertions(+), 1 deletion(-)

diff --git a/hw/pci/pci-resource.c b/hw/pci/pci-resource.c
index de98924aa6..e2d2adc7de 100644
--- a/hw/pci/pci-resource.c
+++ b/hw/pci/pci-resource.c
@@ -371,6 +371,358 @@ static void finalize_bridge_window(PCIBus *bus, uint64_t min_addr, uint64_t max_
     }
 }
 
+/* Returns true if this 64-bit pref BAR is already assigned */
+static bool bar_is_assigned(PCIDevice *dev, int bar_idx, GHashTable *had_fixed)
+{
+    PCIIORegion *r = &dev->io_regions[bar_idx];
+    uint32_t lo;
+    uint32_t hi;
+
+    if (!is_64bit_pref_bar(r)) {
+        return false;
+    }
+    if (dev->fixed_bar_addrs &&
+        dev->fixed_bar_addrs[bar_idx] != PCI_BAR_UNMAPPED) {
+        return true;
+    }
+    if (bar_idx >= PCI_ROM_SLOT - 1) {
+        return false; /* 64-bit BAR uses two slots */
+    }
+    lo = pci_get_long(dev->config + PCI_BASE_ADDRESS_0 + bar_idx * 4);
+    if (!(lo & PCI_BASE_ADDRESS_MEM_TYPE_64)) {
+        return (lo & PCI_BASE_ADDRESS_MEM_MASK) != 0;
+    }
+    hi = pci_get_long(dev->config + PCI_BASE_ADDRESS_0 + bar_idx * 4 + 4);
+    return (((uint64_t)hi << 32) | (lo & PCI_BASE_ADDRESS_MEM_MASK)) != 0;
+}
+
+/* Return BAR address from config, or 0 if unassigned. */
+static uint64_t get_bar_addr_from_config(PCIDevice *dev, int bar_idx)
+{
+    PCIIORegion *r = &dev->io_regions[bar_idx];
+    uint32_t lo;
+    uint32_t hi;
+
+    if (!r->size || bar_idx >= PCI_ROM_SLOT - 1) {
+        return 0;
+    }
+    lo = pci_get_long(dev->config + PCI_BASE_ADDRESS_0 + bar_idx * 4);
+    if (lo & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+        hi = pci_get_long(dev->config + PCI_BASE_ADDRESS_0 + bar_idx * 4 + 4);
+        return ((uint64_t)hi << 32) | (lo & PCI_BASE_ADDRESS_MEM_MASK);
+    }
+    return lo & PCI_BASE_ADDRESS_MEM_MASK;
+}
+
+/* Total size of unassigned 64-bit pref BARs in this bus and its subtree. */
+static uint64_t size_entire_subtree(PCIBus *bus, GHashTable *had_fixed)
+{
+    uint64_t total = 0;
+
+    for (int devfn = 0; devfn < ARRAY_SIZE(bus->devices); devfn++) {
+        PCIDevice *d = bus->devices[devfn];
+        if (!d) {
+            continue;
+        }
+        for (int i = 0; i < PCI_ROM_SLOT; i++) {
+            PCIIORegion *r = &d->io_regions[i];
+            if (!is_64bit_pref_bar(r)) {
+                continue;
+            }
+            if (bar_is_assigned(d, i, had_fixed)) {
+                continue;
+            }
+            total += r->size;
+        }
+        if (IS_PCI_BRIDGE(d)) {
+            total += size_entire_subtree(pci_bridge_get_sec_bus(PCI_BRIDGE(d)), had_fixed);
+        }
+    }
+    return total;
+}
+
+/* Highest end address of any assigned BAR or bridge window in this bus and subtree. */
+static uint64_t find_highest_assigned_in_bus(PCIBus *bus)
+{
+    uint64_t highest = 0;
+    uint64_t base;
+    uint64_t limit;
+    uint64_t addr;
+
+    for (int devfn = 0; devfn < ARRAY_SIZE(bus->devices); devfn++) {
+        PCIDevice *d = bus->devices[devfn];
+        if (!d) {
+            continue;
+        }
+        if (IS_PCI_BRIDGE(d)) {
+            PCIBus *sec = pci_bridge_get_sec_bus(PCI_BRIDGE(d));
+            PCIDevice *bridge_dev = pci_bridge_get_device(sec);
+            if (bridge_dev) {
+                base = pci_bridge_get_base(bridge_dev, PCI_BASE_ADDRESS_MEM_PREFETCH);
+                limit = pci_bridge_get_limit(bridge_dev, PCI_BASE_ADDRESS_MEM_PREFETCH);
+                if (limit > base) {
+                    highest = MAX(highest, limit);
+                }
+                highest = MAX(highest, find_highest_assigned_in_bus(sec));
+            }
+            continue;
+        }
+        for (int i = 0; i < PCI_ROM_SLOT; i++) {
+            PCIIORegion *r = &d->io_regions[i];
+            if (!is_64bit_pref_bar(r)) {
+                continue;
+            }
+            addr = 0;
+            if (d->fixed_bar_addrs &&
+                d->fixed_bar_addrs[i] != PCI_BAR_UNMAPPED) {
+                addr = d->fixed_bar_addrs[i];
+            } else {
+                addr = get_bar_addr_from_config(d, i);
+            }
+            if (addr != 0 && r->size) {
+                highest = MAX(highest, addr + r->size - 1);
+            }
+        }
+    }
+    return highest;
+}
+
+/* Next free address in root MMIO64. */
+static uint64_t next_free_from_root(hwaddr mmio64_base, hwaddr mmio64_size)
+{
+    uint64_t mmio_start = mmio64_base;
+    uint64_t mmio_end = mmio64_base + mmio64_size - 1;
+    uint64_t highest;
+
+    highest = mmio_start - 1;
+    if (fixed_claim_regions) {
+        for (guint i = 0; i < fixed_claim_regions->len; i++) {
+            FixedClaim *c = &g_array_index(fixed_claim_regions, FixedClaim, i);
+            if (c->end >= mmio_start && c->start <= mmio_end) {
+                highest = MAX(highest, c->end);
+            }
+        }
+    }
+    return ROUND_UP(highest + 1, 0x1000); /* 4K align for new window */
+}
+
+static bool
+pci_bus_phase3_ensure_parent_prefetch_window(PCIBus *bus, PciProgramCtx *pctx,
+                                             PCIDevice *parent_bridge, uint64_t mmio_end)
+{
+    PCIBus *parent_bus;
+    PCIDevice *grandparent;
+    uint64_t parent_win_base, parent_win_limit, next_in_subtree;
+    uint64_t required, window_base, window_limit;
+    bool window_not_programmed;
+    bool parent_in_mmio64;
+
+    window_base = pci_bridge_get_base(parent_bridge, PCI_BASE_ADDRESS_MEM_PREFETCH);
+    window_limit = pci_bridge_get_limit(parent_bridge, PCI_BASE_ADDRESS_MEM_PREFETCH);
+    window_not_programmed = (window_base >= window_limit) ||
+                            (window_base < pctx->mmio64_base) || (window_limit > mmio_end);
+    if (!window_not_programmed) {
+        return true;
+    }
+
+    required = size_entire_subtree(bus, pctx->had_fixed);
+    if (required == 0) {
+        return false;
+    }
+    required = ROUND_UP(required, 0x1000);
+
+    parent_bus = pci_get_bus(parent_bridge);
+    grandparent = parent_bus ? pci_bridge_get_device(parent_bus) : NULL;
+    if (!grandparent) {
+        window_base = next_free_from_root(pctx->mmio64_base, pctx->mmio64_size);
+        window_limit = window_base + required - 1;
+        if (window_limit > mmio_end) {
+            error_report("bus [%02x] out of root MMIO64 space", pci_bus_num(bus));
+            exit(1);
+        }
+    } else {
+        parent_win_base = pci_bridge_get_base(grandparent, PCI_BASE_ADDRESS_MEM_PREFETCH);
+        parent_win_limit = pci_bridge_get_limit(grandparent, PCI_BASE_ADDRESS_MEM_PREFETCH);
+        parent_in_mmio64 = (parent_win_limit > parent_win_base) &&
+                           (parent_win_base >= pctx->mmio64_base) && (parent_win_limit <= mmio_end);
+        if (!parent_in_mmio64) {
+            window_base = next_free_from_root(pctx->mmio64_base, pctx->mmio64_size);
+            window_limit = window_base + required - 1;
+            if (window_limit > mmio_end) {
+                error_report("bus [%02x] out of root MMIO64 space", pci_bus_num(bus));
+                exit(1);
+            }
+        } else {
+            next_in_subtree = ROUND_UP(
+                find_highest_assigned_in_bus(parent_bus) + 1, 0x1000);
+            window_base = MAX(parent_win_base, next_in_subtree);
+            window_limit = window_base + required - 1;
+            if (window_limit > parent_win_limit) {
+                error_report("bus [%02x] no room in parent bridge window", pci_bus_num(bus));
+                exit(1);
+            }
+        }
+    }
+    finalize_bridge_window(bus, window_base, window_limit);
+    return true;
+}
+
+static GArray *pci_bus_phase3_collect_unassigned_bars(PCIBus *bus, PciProgramCtx *pctx,
+                                                      uint64_t *out_total_size)
+{
+    PCIDevice *d;
+    PCIIORegion *r;
+    GArray *bars;
+    uint64_t required;
+    int devfn, i;
+
+    required = 0;
+    bars = g_array_new(false, false, sizeof(BarEntry));
+    for (devfn = 0; devfn < ARRAY_SIZE(bus->devices); devfn++) {
+        d = bus->devices[devfn];
+        if (!d) {
+            continue;
+        }
+        for (i = 0; i < PCI_ROM_SLOT; i++) {
+            r = &d->io_regions[i];
+            if (!is_64bit_pref_bar(r) || bar_is_assigned(d, i, pctx->had_fixed)) {
+                continue;
+            }
+            required += r->size;
+            g_array_append_val(
+                bars, ((BarEntry){ .dev = d, .bar_idx = i, .size = r->size }));
+        }
+    }
+    *out_total_size = required;
+    return bars;
+}
+
+static void
+pci_bus_phase3_extend_window_for_bars(PCIBus *bus, PciProgramCtx *pctx,
+                                      PCIDevice *parent_bridge, uint64_t mmio_end,
+                                      uint64_t current, uint64_t required,
+                                      uint64_t window_base, uint64_t *window_limit,
+                                      GArray *bars_this_bus)
+{
+    uint64_t parent_limit, gp_base, gp_limit, new_limit;
+    PCIBus *parent_bus;
+    PCIDevice *grandparent;
+
+    if (current + required <= *window_limit) {
+        return;
+    }
+
+    parent_bus = pci_get_bus(parent_bridge);
+    grandparent = parent_bus ? pci_bridge_get_device(parent_bus) : NULL;
+    parent_limit = mmio_end;
+    if (grandparent) {
+        gp_base = pci_bridge_get_base(grandparent, PCI_BASE_ADDRESS_MEM_PREFETCH);
+        gp_limit = pci_bridge_get_limit(grandparent, PCI_BASE_ADDRESS_MEM_PREFETCH);
+        if (gp_limit > gp_base && gp_base >= pctx->mmio64_base) {
+            parent_limit = gp_limit;
+        }
+    }
+    new_limit = current + required - 1;
+    if (new_limit > parent_limit) {
+        error_report("bus [%02x] out of MMIO space (required 0x%" PRIx64 ")", pci_bus_num(bus),
+                    required);
+        g_array_free(bars_this_bus, true);
+        exit(1);
+    }
+    if (new_limit > *window_limit) {
+        pci_update_prefetch_window(bus, window_base, new_limit);
+        fixed_claim_regions_add(*window_limit + 1, new_limit, parent_bridge, -1);
+        *window_limit = new_limit;
+    }
+}
+
+static void
+pci_bus_phase3_program_bars_and_update_bridge(PCIBus *bus, PCIDevice *parent_bridge,
+                                              uint64_t window_base, uint64_t window_limit,
+                                              uint64_t start_addr, GArray *bars)
+{
+    guint b;
+    BarEntry *be;
+    PCIIORegion *r;
+    uint64_t addr, bar_end, high;
+    PhysBAR pbars_array[PCI_ROM_SLOT];
+
+    g_array_sort(bars, compare_bar_size_desc);
+    addr = start_addr;
+    for (b = 0; b < bars->len; b++) {
+        be = &g_array_index(bars, BarEntry, b);
+        r = &be->dev->io_regions[be->bar_idx];
+        addr = ROUND_UP(addr, r->size);
+        bar_end = addr + r->size - 1;
+        memset(pbars_array, 0, sizeof(pbars_array));
+        pbars_array[be->bar_idx].addr = addr;
+        pbars_array[be->bar_idx].end = bar_end;
+        pbars_array[be->bar_idx].flags = IORESOURCE_PREFETCH;
+        pci_program_prefetch_bars(be->dev, pbars_array);
+        addr = bar_end + 1;
+    }
+    high = find_highest_assigned_in_bus(bus);
+    if (high > window_limit) {
+        pci_update_prefetch_window(bus, window_base, high);
+        fixed_claim_regions_add(window_limit + 1, high, parent_bridge, -1);
+    }
+    g_array_free(bars, true);
+}
+
+/* Allocate and program 64-bit pref BARs for a bus with no fixed-BAR devices. */
+static void pci_bus_phase3_allocate_bars(PCIBus *bus, PciProgramCtx *pctx)
+{
+    uint64_t mmio_end, window_base, window_limit, current, required;
+    PCIDevice *parent_bridge;
+    GArray *bars;
+
+    parent_bridge = pci_bridge_get_device(bus);
+    if (!parent_bridge) {
+        return; /* Root bus has no bridge; skip */
+    }
+
+    mmio_end = pctx->mmio64_base + pctx->mmio64_size - 1;
+    if (!pci_bus_phase3_ensure_parent_prefetch_window(bus, pctx, parent_bridge, mmio_end)) {
+        return;
+    }
+    window_base = pci_bridge_get_base(parent_bridge, PCI_BASE_ADDRESS_MEM_PREFETCH);
+    window_limit = pci_bridge_get_limit(parent_bridge, PCI_BASE_ADDRESS_MEM_PREFETCH);
+    current = ROUND_UP(find_highest_assigned_in_bus(bus) + 1, 0x1000);
+    if (current < window_base) {
+        current = window_base;
+    }
+
+    bars = pci_bus_phase3_collect_unassigned_bars(bus, pctx, &required);
+    if (bars->len == 0) {
+        g_array_free(bars, true);
+        return;
+    }
+    pci_bus_phase3_extend_window_for_bars(bus, pctx, parent_bridge, mmio_end, current,
+                                              required, window_base, &window_limit, bars);
+    pci_bus_phase3_program_bars_and_update_bridge(
+        bus, parent_bridge, window_base, window_limit, current, bars);
+}
+
+/* Run once per bus; act only when the bus has no fixed-BAR devices. */
+static void pci_bus_phase3_allocate_no_fixed_bars(PCIBus *bus, void *opaque)
+{
+    PciProgramCtx *pctx = (PciProgramCtx *)opaque;
+    bool bus_has_fixed = false;
+
+    for (int devfn = 0; devfn < ARRAY_SIZE(bus->devices); devfn++) {
+        PCIDevice *d = bus->devices[devfn];
+        if (d && g_hash_table_contains(pctx->had_fixed, d)) {
+            bus_has_fixed = true;
+            break;
+        }
+    }
+
+    if (bus_has_fixed) {
+        return;
+    }
+    pci_bus_phase3_allocate_bars(bus, pctx);
+}
+
 static bool pci_bus_phase2_fill_bar_lists(PCIBus *bus, PciProgramCtx *pctx,
                                           GArray *fixed_bars, GArray *remaining_bars)
 {
@@ -649,7 +1001,8 @@ void pci_fixed_bar_allocator(PCIBus *root, const PciFixedBarMmioParams *mmio)
     /* Phase 2: pack remaining 64-bit prefetchable BARs and size parent bridge window */
     pci_for_each_bus(bus, pci_bus_phase2_pack_remaining_bars, &pctx);
 
-    /* Phase 3: buses with no fixed-BAR devices; final bridge pass: follow-up */
+    /* Phase 3: allocate BARs for buses that have no fixed-BAR devices */
+    pci_for_each_bus(bus, pci_bus_phase3_allocate_no_fixed_bars, &pctx);
 
     /* Cleanup */
     g_hash_table_destroy(pctx.had_fixed);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 6/8] hw/pci: finalize bridge prefetch windows after BAR allocation
  2026-05-08 18:37 [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation Tushar Dave
                   ` (4 preceding siblings ...)
  2026-05-08 18:37 ` [RFC PATCH 5/8] hw/pci: allocate remaining BARs for buses without fixed BARs Tushar Dave
@ 2026-05-08 18:37 ` Tushar Dave
  2026-05-08 18:37 ` [RFC PATCH 7/8] hw/arm/virt: add pcie-mmio-window machine property Tushar Dave
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Tushar Dave @ 2026-05-08 18:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alwilliamson, jgg, skolothumtho, qemu-arm, peter.maydell, mst,
	marcel.apfelbaum, devel

Add a final reconciliation pass to update bridge prefetch windows
after all BARs have been assigned across all phases.

This ensures bridge windows accurately reflect final BAR placement
across all buses.

SR-IOV virtual functions are not included when sizing bridge prefetch
apertures and may require additional work.

Signed-off-by: Tushar Dave <tdave@nvidia.com>
---
 hw/pci/pci-resource.c | 89 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

diff --git a/hw/pci/pci-resource.c b/hw/pci/pci-resource.c
index e2d2adc7de..01db59c4af 100644
--- a/hw/pci/pci-resource.c
+++ b/hw/pci/pci-resource.c
@@ -190,6 +190,76 @@ static void pci_update_prefetch_window(PCIBus *bus, uint64_t base, uint64_t limi
                                  4);
 }
 
+static void pci_get_bridge_window(PCIBus *bus, void *opaque)
+{
+    PCIDevice *bridge = pci_bridge_get_device(bus);
+    PciAllocCfg *pci_res = (PciAllocCfg *)opaque;
+
+    if (!bridge) {
+        pci_res->wbase = pci_res->mmio32_base;
+        pci_res->wlimit = pci_res->mmio32_base + pci_res->mmio32_size - 1;
+        pci_res->wbase64 = pci_res->mmio64_base;
+        pci_res->wlimit64 = pci_res->mmio64_base + pci_res->mmio64_size - 1;
+    } else {
+        pci_res->wbase = pci_bridge_get_base(bridge, PCI_BASE_ADDRESS_MEM_TYPE_32);
+        pci_res->wlimit = pci_bridge_get_limit(bridge, PCI_BASE_ADDRESS_MEM_TYPE_32);
+        pci_res->wbase64 = pci_bridge_get_base(bridge, PCI_BASE_ADDRESS_MEM_PREFETCH);
+        pci_res->wlimit64 = pci_bridge_get_limit(bridge, PCI_BASE_ADDRESS_MEM_PREFETCH);
+    }
+}
+
+static void pci_collect_mmio64_window(PCIBus *bus, PCIDevice *dev, void *opaque)
+{
+    PciAllocCfg *pci_res = (PciAllocCfg *)opaque;
+    uint64_t rbase, rlimit;
+    uint32_t idx;
+
+    for (idx = 0; idx < PCI_ROM_SLOT; idx++) {
+        PCIIORegion *res = &dev->io_regions[idx];
+
+        if (!res->size) {
+            continue;
+        }
+        rbase = res->addr;
+        rlimit = res->addr + res->size - 1;
+        /* Entire BAR must lie in the window; do not count partial overlap. */
+        if (rbase < pci_res->wbase64 || rlimit > pci_res->wlimit64) {
+            continue;
+        }
+        pci_res->rbase = MIN(pci_res->rbase, rbase);
+        pci_res->rlimit = MAX(pci_res->rlimit, rlimit);
+    }
+
+    if (IS_PCI_BRIDGE(dev)) {
+        rbase = pci_bridge_get_base(dev, PCI_BASE_ADDRESS_MEM_PREFETCH);
+        rlimit = pci_bridge_get_limit(dev, PCI_BASE_ADDRESS_MEM_PREFETCH);
+
+        if ((rbase < pci_res->wbase64) ||
+            (rbase > pci_res->wlimit64) ||
+            (rlimit < pci_res->wbase64) ||
+            (rlimit > pci_res->wlimit64)) {
+            return;
+        }
+
+        pci_res->rbase = MIN(pci_res->rbase, rbase);
+        pci_res->rlimit = MAX(pci_res->rlimit, rlimit);
+    }
+}
+
+static void pci_bus_update_prefetch_window(PCIBus *bus, void *opaque)
+{
+    PciAllocCfg *pci_res = (PciAllocCfg *)opaque;
+    pci_res->rbase = ~0;
+    pci_res->rlimit = 0;
+
+    assert(pci_bridge_get_device(bus));
+    pci_for_each_device_under_bus(bus, pci_collect_mmio64_window, pci_res);
+
+    if (pci_res->rlimit > pci_res->rbase) {
+        pci_update_prefetch_window(bus, pci_res->rbase, pci_res->rlimit);
+    }
+}
+
 static inline bool is_64bit_pref_bar(PCIIORegion *r)
 {
     if (!r->size) {
@@ -1004,6 +1074,25 @@ void pci_fixed_bar_allocator(PCIBus *root, const PciFixedBarMmioParams *mmio)
     /* Phase 3: allocate BARs for buses that have no fixed-BAR devices */
     pci_for_each_bus(bus, pci_bus_phase3_allocate_no_fixed_bars, &pctx);
 
+    memset(pci_res, 0, sizeof(PciAllocCfg));
+    pci_resource_init_from_mmio(pci_res, mmio);
+
+    /* TODO: 32-bit MMIO/ROM adjustment */
+    /* TODO: PIO assignment */
+    /* TODO: 64-bit non-prefetchable */
+
+    /* Align bridge prefetch window with assigned BAR ranges */
+    pci_get_bridge_window(bus, pci_res);
+
+    QLIST_FOREACH(bus, &bus->child, sibling) {
+        pci_res->bus = bus;
+        /* Use the full mmio64 window */
+        pci_res->wbase64 = pci_res->mmio64_base;
+        pci_res->wlimit64 = pci_res->mmio64_base + pci_res->mmio64_size - 1;
+
+        pci_for_each_bus(bus, pci_bus_update_prefetch_window, pci_res);
+    }
+
     /* Cleanup */
     g_hash_table_destroy(pctx.had_fixed);
     fixed_claim_regions_reset();
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 7/8] hw/arm/virt: add pcie-mmio-window machine property
  2026-05-08 18:37 [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation Tushar Dave
                   ` (5 preceding siblings ...)
  2026-05-08 18:37 ` [RFC PATCH 6/8] hw/pci: finalize bridge prefetch windows after BAR allocation Tushar Dave
@ 2026-05-08 18:37 ` Tushar Dave
  2026-05-08 18:37 ` [RFC PATCH 8/8] hw/arm/virt: add pci-pre-enum " Tushar Dave
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Tushar Dave @ 2026-05-08 18:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alwilliamson, jgg, skolothumtho, qemu-arm, peter.maydell, mst,
	marcel.apfelbaum, devel

Introduce a machine property to explicitly set the high PCIe MMIO
window as BASE:SIZE, and apply it in the high memory map.

Usage:
    -machine pcie-mmio-window=0x400000000000:0x400000000000

When using the fixed-bars property to assign guest physical
addresses to PCI BARs, those addresses must fall within the
machine's MMIO64 window. The default aperture may be too small
or not cover the required range.

This property allows the aperture to be resized or repositioned
so that all fixed BAR addresses are accessible to the guest.

Signed-off-by: Tushar Dave <tdave@nvidia.com>
---
 hw/arm/virt.c         | 87 ++++++++++++++++++++++++++++++++++++++++++-
 include/hw/arm/virt.h |  2 +
 2 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ec0d8475ca..55f41c7e46 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1915,8 +1915,31 @@ static void virt_set_high_memmap(VirtMachineState *vms,
 
     for (i = VIRT_LOWMEMMAP_LAST; i < ARRAY_SIZE(extended_memmap); i++) {
         region_enabled = virt_get_high_memmap_enabled(vms, i);
-        region_base = ROUND_UP(base, extended_memmap[i].size);
-        region_size = extended_memmap[i].size;
+
+        if (i == VIRT_HIGH_PCIE_MMIO && vms->override_pcie_mmio_size) {
+            region_base = vms->override_pcie_mmio_base;
+            region_size = vms->override_pcie_mmio_size;
+
+            /* Check for overlap with prior high regions */
+            if (region_base < base) {
+                error_report("pcie-mmio-window base 0x%" PRIx64 " overlaps "
+                            "high memory layout (must be >= 0x%" PRIx64 ")",
+                            (uint64_t)region_base, (uint64_t)base);
+                exit(1);
+            }
+            /* Must not exceed the PA space */
+            if (region_base + region_size > BIT_ULL(pa_bits)) {
+                error_report("pcie-mmio-window [0x%" PRIx64 ", 0x%" PRIx64 ") "
+                            "exceeds %d-bit PA space",
+                            (uint64_t)region_base,
+                            (uint64_t)(region_base + region_size),
+                            pa_bits);
+                exit(1);
+            }
+        } else {
+            region_base = ROUND_UP(base, extended_memmap[i].size);
+            region_size = extended_memmap[i].size;
+        }
 
         vms->memmap[i].base = region_base;
         vms->memmap[i].size = region_size;
@@ -3004,6 +3027,60 @@ static void virt_set_gic_version(Object *obj, const char *value, Error **errp)
     }
 }
 
+static char *virt_get_pcie_mmio_window(Object *obj, Error **errp)
+{
+    VirtMachineState *vms = VIRT_MACHINE(obj);
+
+    if (!vms->override_pcie_mmio_size) {
+        return g_strdup("");
+    }
+    return g_strdup_printf("0x%" PRIx64 ":0x%" PRIx64,
+                           (uint64_t)vms->override_pcie_mmio_base,
+                           (uint64_t)vms->override_pcie_mmio_size);
+}
+
+static void virt_set_pcie_mmio_window(Object *obj, const char *value, Error **errp)
+{
+    VirtMachineState *vms = VIRT_MACHINE(obj);
+    uint64_t base = 0, size = 0;
+    const char *endptr;
+    int ret;
+
+    if (!value || !*value) {
+        return;
+    }
+
+    ret = qemu_strtou64(value, &endptr, 0, &base);
+    if (ret || base == 0) {
+        error_setg(errp, "pcie-mmio-window base must be a positive number");
+        return;
+    }
+    if (*endptr != ':' || !*(endptr + 1)) {
+        error_setg(errp, "pcie-mmio-window expects BASE:SIZE");
+        return;
+    }
+
+    ret = qemu_strtou64(endptr + 1, NULL, 0, &size);
+    if (ret || size == 0) {
+        error_setg(errp, "pcie-mmio-window size must be a positive number");
+        return;
+    }
+
+    if (!is_power_of_2(size)) {
+        error_setg(errp, "pcie-mmio-window size 0x%" PRIx64 " must be a power of 2",
+                   (uint64_t)size);
+        return;
+    }
+    if (base % size != 0) {
+        error_setg(errp, "pcie-mmio-window base 0x%" PRIx64 " must be aligned to size 0x%" PRIx64,
+                  (uint64_t)base, (uint64_t)size);
+        return;
+    }
+
+    vms->override_pcie_mmio_base = base;
+    vms->override_pcie_mmio_size = size;
+}
+
 static char *virt_get_iommu(Object *obj, Error **errp)
 {
     VirtMachineState *vms = VIRT_MACHINE(obj);
@@ -3582,6 +3659,12 @@ static void virt_machine_class_init(ObjectClass *oc, const void *data)
                                           "Set the IOMMU type. "
                                           "Valid values are none and smmuv3");
 
+    object_class_property_add_str(oc, "pcie-mmio-window",
+                                  virt_get_pcie_mmio_window,
+                                  virt_set_pcie_mmio_window);
+    object_class_property_set_description(oc, "pcie-mmio-window",
+                                          "Override the high PCIe MMIO window as BASE:SIZE");
+
     object_class_property_add_bool(oc, "default-bus-bypass-iommu",
                                    virt_get_default_bus_bypass_iommu,
                                    virt_set_default_bus_bypass_iommu);
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 5fcbd1c76f..410df857c7 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -187,6 +187,8 @@ struct VirtMachineState {
     MemoryRegion *sysmem;
     MemoryRegion *secure_sysmem;
     bool pci_preserve_config;
+    hwaddr override_pcie_mmio_base;
+    hwaddr override_pcie_mmio_size;
 };
 
 #define VIRT_ECAM_ID(high) (high ? VIRT_HIGH_PCIE_ECAM : VIRT_PCIE_ECAM)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 8/8] hw/arm/virt: add pci-pre-enum machine property
  2026-05-08 18:37 [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation Tushar Dave
                   ` (6 preceding siblings ...)
  2026-05-08 18:37 ` [RFC PATCH 7/8] hw/arm/virt: add pcie-mmio-window machine property Tushar Dave
@ 2026-05-08 18:37 ` Tushar Dave
  2026-05-11  7:46 ` [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation Peter Maydell
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Tushar Dave @ 2026-05-08 18:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: alwilliamson, jgg, skolothumtho, qemu-arm, peter.maydell, mst,
	marcel.apfelbaum, devel

Add a "pci-pre-enum" option for the virt machine. When enabled, QEMU
performs PCI enumeration and programs BARs before handing control to
firmware.

This is intended for use with the "fixed-bars" property, where the
user assigns fixed BAR addresses and expects firmware to preserve the
configuration.

pci-pre-enum is exposed as a separate machine property rather than
being implied by the presence of fixed-bars. This allows QEMU's PCI
enumeration path to be exercised independently (for example, to
verify that QEMU produces the same device enumeration as EDK2)
without requiring any device to specify fixed BARs.

When enabled, a "pci-enum-done" property is added to the PCI node in
the device tree to indicate to firmware (e.g. EDK2) that PCI
enumeration has already been performed.

When disabled (default), behavior is unchanged.

Signed-off-by: Tushar Dave <tdave@nvidia.com>
---
 hw/arm/virt.c         | 70 +++++++++++++++++++++++++++++++++++++++++--
 include/hw/arm/virt.h |  1 +
 2 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 55f41c7e46..7d41bfc457 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -52,6 +52,7 @@
 #include "system/whpx.h"
 #include "system/qtest.h"
 #include "system/system.h"
+#include "system/reset.h"
 #include "hw/core/loader.h"
 #include "qapi/error.h"
 #include "qemu/bitops.h"
@@ -94,6 +95,8 @@
 #include "hw/cxl/cxl.h"
 #include "hw/cxl/cxl_host.h"
 #include "qemu/guest-random.h"
+#include "hw/pci/pci-resource.h"
+#include "hw/pci/pci-enumerate.h"
 
 static GlobalProperty arm_virt_compat_defaults[] = {
     { TYPE_VIRTIO_IOMMU_PCI, "aw-bits", "48" },
@@ -1697,6 +1700,10 @@ static void create_pcie(VirtMachineState *vms)
     qemu_fdt_setprop_cell(ms->fdt, nodename, "#interrupt-cells", 1);
     create_pcie_irq_map(ms, vms->gic_phandle, irq, nodename);
 
+    if (vms->pci_pre_enum) {
+        qemu_fdt_setprop_cell(ms->fdt, nodename, "pci-enum-done", 1);
+    }
+
     if (vms->iommu) {
         vms->iommu_phandle = qemu_fdt_alloc_phandle(ms->fdt);
 
@@ -1832,6 +1839,20 @@ static void virt_build_smbios(VirtMachineState *vms)
     }
 }
 
+static void virt_pci_apply_fix_bar_after_reset(void *opaque)
+{
+    VirtMachineState *vms = opaque;
+    PciFixedBarMmioParams mmio = {
+      .mmio32_base = vms->memmap[VIRT_PCIE_MMIO].base,
+      .mmio32_size = vms->memmap[VIRT_PCIE_MMIO].size,
+      .mmio64_base = vms->memmap[VIRT_HIGH_PCIE_MMIO].base,
+      .mmio64_size = vms->memmap[VIRT_HIGH_PCIE_MMIO].size,
+    };
+
+    pci_enumerate_bus(vms->bus);
+    pci_fixed_bar_allocator(vms->bus, &mmio);
+}
+
 static
 void virt_machine_done(Notifier *notifier, void *data)
 {
@@ -1864,11 +1885,30 @@ void virt_machine_done(Notifier *notifier, void *data)
     if (arm_load_dtb(info->dtb_start, info, info->dtb_limit, as, ms, cpu) < 0) {
         exit(1);
     }
-
-    pci_bus_add_fw_cfg_extra_pci_roots(vms->fw_cfg, vms->bus,
-                                       &error_abort);
+    /*
+     * In pci-pre-enum mode, EDK2 does not perform PCI enumeration or
+     * resource assignment (PcdPciDisableBusEnumeration = TRUE). All root
+     * bridges are marked ResourceAssigned, meaning the topology and
+     * MMIO/MMIO64 apertures provided by QEMU are treated as final.
+     *
+     * In this mode, each root bridge is consumed as an independent resource
+     * domain. Exposing additional root bridges (e.g. PXB extra roots) that
+     * share identical MMIO/MMIO64 apertures creates duplicate resource domains
+     * with overlapping address spaces, which is invalid in this mode.
+     *
+     * Therefore, extra root bridges are not exposed in pre-enumeration mode.
+     */
+    if (!vms->pci_pre_enum) {
+        pci_bus_add_fw_cfg_extra_pci_roots(vms->fw_cfg, vms->bus,
+                                           &error_abort);
+    }
 
     virt_acpi_setup(vms);
+
+    if (vms->pci_pre_enum) {
+        qemu_register_reset(virt_pci_apply_fix_bar_after_reset, vms);
+    }
+
     virt_build_smbios(vms);
 }
 
@@ -2988,6 +3028,20 @@ static void virt_set_mte(Object *obj, bool value, Error **errp)
     vms->mte = value;
 }
 
+static bool virt_get_pci_pre_enum(Object *obj, Error **errp)
+{
+    VirtMachineState *vms = VIRT_MACHINE(obj);
+
+    return vms->pci_pre_enum;
+}
+
+static void virt_set_pci_pre_enum(Object *obj, bool value, Error **errp)
+{
+    VirtMachineState *vms = VIRT_MACHINE(obj);
+
+    vms->pci_pre_enum = value;
+}
+
 static char *virt_get_gic_version(Object *obj, Error **errp)
 {
     VirtMachineState *vms = VIRT_MACHINE(obj);
@@ -3726,6 +3780,13 @@ static void virt_machine_class_init(ObjectClass *oc, const void *data)
                                           "in ACPI table header."
                                           "The string may be up to 8 bytes in size");
 
+    object_class_property_add_bool(oc, "pci-pre-enum",
+                                   virt_get_pci_pre_enum,
+                                   virt_set_pci_pre_enum);
+    object_class_property_set_description(oc, "pci-pre-enum",
+                                          "Set on/off to enable/disable PCI enumeration and resource assignment"
+                                          " in QEMU. When enabled, QEMU programs BARs (including fixed-bars"
+                                          " addresses) before handing control to firmware.");
 }
 
 static void virt_instance_init(Object *obj)
@@ -3768,6 +3829,9 @@ static void virt_instance_init(Object *obj)
     /* MTE is disabled by default.  */
     vms->mte = false;
 
+    /* PCI pre-enumeration disabled by default */
+    vms->pci_pre_enum = false;
+
     /* Supply kaslr-seed and rng-seed by default */
     vms->dtb_randomness = true;
 
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 410df857c7..0786f4a4fc 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -187,6 +187,7 @@ struct VirtMachineState {
     MemoryRegion *sysmem;
     MemoryRegion *secure_sysmem;
     bool pci_preserve_config;
+    bool pci_pre_enum;
     hwaddr override_pcie_mmio_base;
     hwaddr override_pcie_mmio_size;
 };
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation
  2026-05-08 18:37 [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation Tushar Dave
                   ` (7 preceding siblings ...)
  2026-05-08 18:37 ` [RFC PATCH 8/8] hw/arm/virt: add pci-pre-enum " Tushar Dave
@ 2026-05-11  7:46 ` Peter Maydell
  2026-05-11 12:26   ` Jason Gunthorpe
  2026-05-11  9:09 ` Michael S. Tsirkin
  2026-05-11 11:43 ` [edk2-devel] " Ard Biesheuvel
  10 siblings, 1 reply; 23+ messages in thread
From: Peter Maydell @ 2026-05-11  7:46 UTC (permalink / raw)
  To: Tushar Dave
  Cc: qemu-devel, alwilliamson, jgg, skolothumtho, qemu-arm, mst,
	marcel.apfelbaum, devel

On Fri, 8 May 2026 at 19:37, Tushar Dave <tdave@nvidia.com> wrote:
>
> This RFC introduces a mechanism to specify Guest Physical Addresses
> (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR
> addresses to match host physical addresses for assigned devices.
>
> On some platforms, P2P DMA is performed between devices within the same
> IOMMU group. The PCI fabric ACS is configured to permit direct P2P
> without going through the host bridge in order to achieve the required
> performance.
>
> To support this multi-device IOMMU group P2P scenario in virtualization,
> the VM may need to use the same MMIO BAR addresses as the host physical
> address layout.

This feels like something's wrong in the design. A VM doesn't
necessarily have the same memory layout as the host: the
VM hardware is all about making that possible.

> Why QEMU programs PCI resources rather than EDK2:
>
> To support fixed BAR placement, QEMU performs PCI bus enumeration and
> resource assignment prior to firmware execution. EDK2 already provides
> a PCD-controlled mechanism (PcdPciDisableBusEnumeration) that allows
> the platform to skip PCI enumeration and resource allocation. This
> series leverages that mechanism so that, when enabled, firmware runs in
> a discovery-only mode and preserves the configuration established by
> QEMU.

I'm definitely not enthusiastic about having QEMU do PCI bus
enumeration. This isn't the way the hardware does it, and it's a
lot of code that's duplicating what the guest already has (there's
over a thousand lines of code in this patchset).

> We use the virt machine in this series as the concrete example
> consuming the fixed-BAR model. Other machines may require their own
> machine-specific mechanism (such as pcie-mmio-window) if they want to
> adopt the same approach.
>
> Example usage:
>
>   -machine virt,...,pcie-mmio-window=0x400000000000:0x400000000000,pci-pre-enum=on \
>   -device vfio-pci,host=0009:06:00.0,id=dev0 \
>   -set device.dev0.fixed-bars=bar2@0x6b8000000000,bar4@0x6c8000000000

...and you end up with enormous command lines like this full of
magic numbers relating to address space layout.

I think it would be better to find a way of doing this that
doesn't have the "VM address space layout has to match the
host layout" restriction.

-- PMM


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation
  2026-05-08 18:37 [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation Tushar Dave
                   ` (8 preceding siblings ...)
  2026-05-11  7:46 ` [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation Peter Maydell
@ 2026-05-11  9:09 ` Michael S. Tsirkin
  2026-05-11 18:10   ` Tushar Dave
  2026-05-11 11:43 ` [edk2-devel] " Ard Biesheuvel
  10 siblings, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2026-05-11  9:09 UTC (permalink / raw)
  To: Tushar Dave
  Cc: qemu-devel, alwilliamson, jgg, skolothumtho, qemu-arm,
	peter.maydell, marcel.apfelbaum, devel

On Fri, May 08, 2026 at 01:37:09PM -0500, Tushar Dave wrote:
> This RFC introduces a mechanism to specify Guest Physical Addresses
> (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR
> addresses to match host physical addresses for assigned devices.
> 
> On some platforms, P2P DMA is performed between devices within the same
> IOMMU group. The PCI fabric ACS is configured to permit direct P2P
> without going through the host bridge in order to achieve the required
> performance.

Pass this info to guest firmware, let it set bars any way it wants?



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [edk2-devel] [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation
  2026-05-08 18:37 [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation Tushar Dave
                   ` (9 preceding siblings ...)
  2026-05-11  9:09 ` Michael S. Tsirkin
@ 2026-05-11 11:43 ` Ard Biesheuvel
  2026-05-12 17:25   ` Tushar Dave
  10 siblings, 1 reply; 23+ messages in thread
From: Ard Biesheuvel @ 2026-05-11 11:43 UTC (permalink / raw)
  To: devel@edk2.groups.io, tdave, qemu-devel
  Cc: alwilliamson, jgg, skolothumtho, qemu-arm, peter.maydell, mst,
	marcel.apfelbaum

Hello Tushar,

On Fri, 8 May 2026, at 20:37, Tushar Dave via groups.io wrote:
> This RFC introduces a mechanism to specify Guest Physical Addresses
> (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR
> addresses to match host physical addresses for assigned devices.
>
> On some platforms, P2P DMA is performed between devices within the same
> IOMMU group. The PCI fabric ACS is configured to permit direct P2P
> without going through the host bridge in order to achieve the required
> performance.
>
> To support this multi-device IOMMU group P2P scenario in virtualization,
> the VM may need to use the same MMIO BAR addresses as the host physical
> address layout.
>

Did you consider implementing this using Enhanced Allocation (EA)? If so,
could you explain why it is not suitable here?

Also, I think I understand what the intent is here, but could you describe
the topology in a bit more detail? These are assigned physical PCIe endpoints
behind an emulated host bridge, right? And the BAR needs to reside at an
a priori fixed address so that another PCIe endpoint behind the same emulated
host bridge can DMA straight into it?

Doing PCIe enumeration at yet another level is not a feasible approach imo,
having UEFI and Linux play nice together is already a bit of a challenge.

Is there any way this could be handled by having special rules for inbound
translation in the host bridge driver/implementation?




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation
  2026-05-11  7:46 ` [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation Peter Maydell
@ 2026-05-11 12:26   ` Jason Gunthorpe
  2026-05-11 18:38     ` Mohamed Mediouni
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2026-05-11 12:26 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Tushar Dave, qemu-devel, alwilliamson, skolothumtho, qemu-arm,
	mst, marcel.apfelbaum, devel

On Mon, May 11, 2026 at 08:46:57AM +0100, Peter Maydell wrote:
> On Fri, 8 May 2026 at 19:37, Tushar Dave <tdave@nvidia.com> wrote:
> >
> > This RFC introduces a mechanism to specify Guest Physical Addresses
> > (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR
> > addresses to match host physical addresses for assigned devices.
> >
> > On some platforms, P2P DMA is performed between devices within the same
> > IOMMU group. The PCI fabric ACS is configured to permit direct P2P
> > without going through the host bridge in order to achieve the required
> > performance.
> >
> > To support this multi-device IOMMU group P2P scenario in virtualization,
> > the VM may need to use the same MMIO BAR addresses as the host physical
> > address layout.
> 
> This feels like something's wrong in the design. A VM doesn't
> necessarily have the same memory layout as the host: the
> VM hardware is all about making that possible.

The HW running these systems is unfortunately limited and doesn't have
ATS support. Without the right HW features the physical PCI topology
is leaked into the VM and there is no choice but to have the VM guest
physical and true physical match, otherwise the VM can't work.

There is no other way to support these VM shapes on this HW.

Newer CPUs in this family have more HW features and won't need to do
these things.

Jason


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation
  2026-05-11  9:09 ` Michael S. Tsirkin
@ 2026-05-11 18:10   ` Tushar Dave
  2026-05-11 22:09     ` Michael S. Tsirkin
  0 siblings, 1 reply; 23+ messages in thread
From: Tushar Dave @ 2026-05-11 18:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, alwilliamson, jgg, skolothumtho, qemu-arm,
	peter.maydell, marcel.apfelbaum, devel



On 5/11/2026 4:09 AM, Michael S. Tsirkin wrote:
> On Fri, May 08, 2026 at 01:37:09PM -0500, Tushar Dave wrote:
>> This RFC introduces a mechanism to specify Guest Physical Addresses
>> (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR
>> addresses to match host physical addresses for assigned devices.
>>
>> On some platforms, P2P DMA is performed between devices within the same
>> IOMMU group. The PCI fabric ACS is configured to permit direct P2P
>> without going through the host bridge in order to achieve the required
>> performance.
> 
> Pass this info to guest firmware, let it set bars any way it wants?

We are using firmware, relying on the existing EDK2-supported mode
enabled by PcdPciDisableBusEnumeration, where firmware is expected
to preserve the PCI topology and BAR programming established by
the hypervisor.

In our case, the hypervisor is QEMU, which performs PCI enumeration
and resource assignment before handing control to firmware. EDK2
then explicitly refrains from re-enumerating or reallocating PCI
BARs, as this is already a supported firmware behavior.

-Tushar



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation
  2026-05-11 12:26   ` Jason Gunthorpe
@ 2026-05-11 18:38     ` Mohamed Mediouni
  2026-05-11 20:28       ` Jason Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Mohamed Mediouni @ 2026-05-11 18:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Maydell, Tushar Dave, qemu-devel, alwilliamson,
	skolothumtho, qemu-arm, mst, marcel.apfelbaum, devel



> On 11. May 2026, at 14:26, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> On Mon, May 11, 2026 at 08:46:57AM +0100, Peter Maydell wrote:
>> On Fri, 8 May 2026 at 19:37, Tushar Dave <tdave@nvidia.com> wrote:
>>> 
>>> This RFC introduces a mechanism to specify Guest Physical Addresses
>>> (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR
>>> addresses to match host physical addresses for assigned devices.
>>> 
>>> On some platforms, P2P DMA is performed between devices within the same
>>> IOMMU group. The PCI fabric ACS is configured to permit direct P2P
>>> without going through the host bridge in order to achieve the required
>>> performance.
>>> 
>>> To support this multi-device IOMMU group P2P scenario in virtualization,
>>> the VM may need to use the same MMIO BAR addresses as the host physical
>>> address layout.
>> 
>> This feels like something's wrong in the design. A VM doesn't
>> necessarily have the same memory layout as the host: the
>> VM hardware is all about making that possible.
> 
> The HW running these systems is unfortunately limited and doesn't have
> ATS support. Without the right HW features the physical PCI topology
> is leaked into the VM and there is no choice but to have the VM guest
> physical and true physical match, otherwise the VM can't work.
> 
> There is no other way to support these VM shapes on this HW.
> 
> Newer CPUs in this family have more HW features and won't need to do
> these things.
> 
> Jason
> 

Hi,

It has been years already since I last looked at Grace (synthetic-)PCIe handling
(and it was with another VMM than QEMU) but...
 
As you’ve said newer parts will not have this quite peculiar requirement, and
that makes me think that this is maybe suited as more of a workaround that
is not open ended... by having it open just like that there’s a chance that it
would get future users and I don’t think that makes a bunch of sense.

And not very relevant to Grace + GPU systems specifically, but any hopes of live
migration support are lost outside of very awkward placement trickery when using
this to match host addresses.

Is this specific to the NIC attached directly to the GPU (that is
then attached over C2C) configuration?

Using PCIe passthrough with ATS disabled doesn’t sound like such a great idea
on the security side of things imo.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation
  2026-05-11 18:38     ` Mohamed Mediouni
@ 2026-05-11 20:28       ` Jason Gunthorpe
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2026-05-11 20:28 UTC (permalink / raw)
  To: Mohamed Mediouni
  Cc: Peter Maydell, Tushar Dave, qemu-devel, alwilliamson,
	skolothumtho, qemu-arm, mst, marcel.apfelbaum, devel

On Mon, May 11, 2026 at 08:38:44PM +0200, Mohamed Mediouni wrote:

> And not very relevant to Grace + GPU systems specifically, but any hopes of live
> migration support are lost outside of very awkward placement trickery when using
> this to match host addresses.

I think live migration is still possible, it does further limit
migration pairs to ones with identical physical topology though.

> Is this specific to the NIC attached directly to the GPU (that is
> then attached over C2C) configuration?

Something like that, it is convoluted.

> Using PCIe passthrough with ATS disabled doesn’t sound like such a great idea
> on the security side of things imo.

The platform was specifically designed in a way that this is
secure. Yes it is not a great idea in general but with careful choices
it can be made safe.

Jason


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation
  2026-05-11 18:10   ` Tushar Dave
@ 2026-05-11 22:09     ` Michael S. Tsirkin
  0 siblings, 0 replies; 23+ messages in thread
From: Michael S. Tsirkin @ 2026-05-11 22:09 UTC (permalink / raw)
  To: Tushar Dave
  Cc: qemu-devel, alwilliamson, jgg, skolothumtho, qemu-arm,
	peter.maydell, marcel.apfelbaum, devel

On Mon, May 11, 2026 at 01:10:43PM -0500, Tushar Dave wrote:
> 
> 
> On 5/11/2026 4:09 AM, Michael S. Tsirkin wrote:
> > On Fri, May 08, 2026 at 01:37:09PM -0500, Tushar Dave wrote:
> >> This RFC introduces a mechanism to specify Guest Physical Addresses
> >> (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR
> >> addresses to match host physical addresses for assigned devices.
> >>
> >> On some platforms, P2P DMA is performed between devices within the same
> >> IOMMU group. The PCI fabric ACS is configured to permit direct P2P
> >> without going through the host bridge in order to achieve the required
> >> performance.
> > 
> > Pass this info to guest firmware, let it set bars any way it wants?
> 
> We are using firmware, relying on the existing EDK2-supported mode
> enabled by PcdPciDisableBusEnumeration, where firmware is expected
> to preserve the PCI topology and BAR programming established by
> the hypervisor.
> 
> In our case, the hypervisor is QEMU, which performs PCI enumeration
> and resource assignment before handing control to firmware. EDK2
> then explicitly refrains from re-enumerating or reallocating PCI
> BARs, as this is already a supported firmware behavior.
> 
> -Tushar


I see no advantage in performing pci enumeration in qemu when firmware
is already doing an adequate job of it. If you want firmware to map
specific devices at specific addresses, pass that info along to it.

-- 
MST



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [edk2-devel] [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation
  2026-05-11 11:43 ` [edk2-devel] " Ard Biesheuvel
@ 2026-05-12 17:25   ` Tushar Dave
  2026-05-12 23:06     ` Alex Williamson
  0 siblings, 1 reply; 23+ messages in thread
From: Tushar Dave @ 2026-05-12 17:25 UTC (permalink / raw)
  To: Ard Biesheuvel, devel@edk2.groups.io, qemu-devel
  Cc: alwilliamson, jgg, skolothumtho, qemu-arm, peter.maydell, mst,
	marcel.apfelbaum



On 5/11/2026 6:43 AM, Ard Biesheuvel wrote:
> Hello Tushar,
> 
> On Fri, 8 May 2026, at 20:37, Tushar Dave via groups.io wrote:
>> This RFC introduces a mechanism to specify Guest Physical Addresses
>> (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR
>> addresses to match host physical addresses for assigned devices.
>>
>> On some platforms, P2P DMA is performed between devices within the same
>> IOMMU group. The PCI fabric ACS is configured to permit direct P2P
>> without going through the host bridge in order to achieve the required
>> performance.
>>
>> To support this multi-device IOMMU group P2P scenario in virtualization,
>> the VM may need to use the same MMIO BAR addresses as the host physical
>> address layout.
>>
> 
> Did you consider implementing this using Enhanced Allocation (EA)? If so,
> could you explain why it is not suitable here?

I have not evaluated EA for this design. When I looked at EDK2, I
chose PcdPciDisableBusEnumeration because it cleanly preserves fixed
BAR programming established by the hypervisor — at the cost of QEMU
performing PCI bus number and resource assignment.

I did a quick search and do not see EA support in EDK2. Any pointers
to EA being used in a similar fashion to achieve fixed BAR placement
would be appreciated.

> 
> Also, I think I understand what the intent is here, but could you describe
> the topology in a bit more detail? These are assigned physical PCIe endpoints
> behind an emulated host bridge, right? And the BAR needs to reside at an
> a priori fixed address so that another PCIe endpoint behind the same emulated
> host bridge can DMA straight into it?

Yes, that is all correct.

      -[0000:00]-+-00.0  Host bridge
                 +-01.0  Root Port
                     \-[0000:02]
                          +-00.0 Switch Upstream Port
                          +-01.0 Switch Downstream Port A
                          |      \-[0000:04] Device A
                          +-02.0 Switch Downstream Port B
                                 \-[0000:05] Device B

> 
> Doing PCIe enumeration at yet another level is not a feasible approach imo,
> having UEFI and Linux play nice together is already a bit of a challenge.

I agree but to clarify, in this case QEMU performs PCI topology
initialization and resource assignment prior to firmware execution,
where EDK2 avoids full PCI bus re-enumeration. Linux sees a fully
enumerated bus from firmware just as it does today. There is no
duplicated enumeration step between firmware and Linux when we use
EDK2 with PcdPciDisableBusEnumeration.

> 
> Is there any way this could be handled by having special rules for inbound
> translation in the host bridge driver/implementation?

Not that I can think of.

Thanks.
-Tushar


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [edk2-devel] [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation
  2026-05-12 17:25   ` Tushar Dave
@ 2026-05-12 23:06     ` Alex Williamson
  2026-05-12 23:12       ` Michael S. Tsirkin
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Williamson @ 2026-05-12 23:06 UTC (permalink / raw)
  To: Tushar Dave, Cédric Le Goater
  Cc: Ard Biesheuvel, devel@edk2.groups.io, qemu-devel, jgg,
	skolothumtho, qemu-arm, peter.maydell, mst, marcel.apfelbaum

On Tue, 12 May 2026 12:25:45 -0500
Tushar Dave <tdave@nvidia.com> wrote:

> On 5/11/2026 6:43 AM, Ard Biesheuvel wrote:
> > Hello Tushar,
> > 
> > On Fri, 8 May 2026, at 20:37, Tushar Dave via groups.io wrote:  
> >> This RFC introduces a mechanism to specify Guest Physical Addresses
> >> (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR
> >> addresses to match host physical addresses for assigned devices.
> >>
> >> On some platforms, P2P DMA is performed between devices within the same
> >> IOMMU group. The PCI fabric ACS is configured to permit direct P2P
> >> without going through the host bridge in order to achieve the required
> >> performance.
> >>
> >> To support this multi-device IOMMU group P2P scenario in virtualization,
> >> the VM may need to use the same MMIO BAR addresses as the host physical
> >> address layout.
> >>  
> > 
> > Did you consider implementing this using Enhanced Allocation (EA)? If so,
> > could you explain why it is not suitable here?  
> 
> I have not evaluated EA for this design. When I looked at EDK2, I
> chose PcdPciDisableBusEnumeration because it cleanly preserves fixed
> BAR programming established by the hypervisor — at the cost of QEMU
> performing PCI bus number and resource assignment.
> 
> I did a quick search and do not see EA support in EDK2. Any pointers
> to EA being used in a similar fashion to achieve fixed BAR placement
> would be appreciated.

EA wasn't on my radar either, but I did some research and chatted with
Tushar and I think it could work.  I'll sketch out a rough idea of what
it might looks like.

EA describes BAR equivalents (fixed base address, size, and type) in a
separate capability while the corresponding device BAR registers appear
unimplemented.  Linux already consumes endpoint EA capabilities and
marks the resulting resources IORESOURCE_PCI_FIXED.  EDK2 doesn't know
about EA (cap 0x14 isn't defined anywhere in MdePkg, and PciBusDxe
never consults it afaict), but that turns out to be useful here rather
than a problem.

Starting at the QEMU device, for a vfio-pci device we'd need to
virtualize the real BARs as unimplemented and surface that information
via a synthesized EA capability instead.  It's debatable whether this
is a generic PCI mechanism or vfio-pci specific, whether HPA is
automatically used as the base address for vfio-pci devices or
user-specified, and the capability offset in config space.  None of
those fundamentally change the shape of the flow.

For the absolute bare-minimum level of support (EA device on the root
complex, EA resources don't overlap the VM address space or MMIO range,
EDK2 firmware, Linux guest booted with pci=nocrs) I think this actually
works with just adding the EA capability above.  Let's walk through
those constraints and how we relax them.

At the firmware level we lean on the real BAR registers being
unimplemented for EA devices, so EDK2 allocates no MMIO or IO resources
for them.  Only bus numbers get assigned if the EA device sits in a PCI
hierarchy.  That's exactly what we want, EDK2 doing conventional bus
assignment but staying out of the EA resource flow entirely.

Instead of firmware EA enlightenment we lean on the guest OS.  Linux
reads endpoint EA today, but the bridge aperture sizing path ignores
those fixed resources.  As Tushar's series demonstrates, generically
handling mixed "fixed-BAR" and programmable-BAR devices in one
hierarchy is hard.  An incremental Linux enhancement that greatly
simplifies the problem space would be to program bridge apertures only
for hierarchies consisting entirely of fixed resources.  The math
becomes trivial (window spans min..max of fixed children, aligned to
bridge granularity), and there's no regression risk, these hierarchies
currently fail silently.  The sizer ignores fixed children and the
fixed-claim walk-up finds no containing parent.  This enhancement,
plus the homogeneous-hierarchy constraint, removes the root-complex
constraint and lets us mirror the bare-metal topologies we need.

Resource ranges are a bit messier.  The extent of the EA device ranges
could be determined in QEMU and the VM address map adjusted to prevent
overlap.  Tushar already has a similar user-specified machine option in
this series.  That range also needs to reach the guest as a CRS (to
avoid pci=nocrs) but needs to stay distinct from the DT range passed to
EDK2 for programmable BAR devices so EDK2 won't place a programmable
BAR or bridge window into the EA region.  So long as we keep EA and
programmable devices in separate hierarchies, EDK2 only needs the
programmable range via DT and we can add the EA range as additional CRS
ranges visible only to the guest.

In practice, EDK2 programs all the programmable devices and the EA
devices live entirely in the additional CRS.  A possibly cleaner
alternative is additional PXB host bridges for the EA devices, each
with its own CRS.  That sidesteps the DT/CRS split entirely since the
EA PXB has nothing for EDK2 to allocate anyway.

If we agree that homogeneous hierarchies (no mixing of EA and
programmable BARs) is a reasonable constraint, and possibly extend that
to homogeneous per host bridge to simplify the CRS mapping, we have the
following work items:

 * Extend Linux EA support to program bridge apertures for subordinate
   homogeneous EA hierarchies.

 * Develop options to virtualize programmable BARs as EA for vfio-pci
   devices, if not generically for the benefit of testing.

 * Implement a way to poke holes in the VM address space and plumb
   through to account for addresses used by EA devices.

 * Provide those same ranges to the guest via CRS (but not via DT to
   EDK2), or alternatively expose them through additional PXB host
   bridges.

Does that shape roughly seem accurate?  Are there additional gaps I've
missed?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [edk2-devel] [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation
  2026-05-12 23:06     ` Alex Williamson
@ 2026-05-12 23:12       ` Michael S. Tsirkin
  2026-05-12 23:57         ` Alex Williamson
  0 siblings, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2026-05-12 23:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tushar Dave, Cédric Le Goater, Ard Biesheuvel,
	devel@edk2.groups.io, qemu-devel, jgg, skolothumtho, qemu-arm,
	peter.maydell, marcel.apfelbaum

On Tue, May 12, 2026 at 05:06:50PM -0600, Alex Williamson wrote:
> On Tue, 12 May 2026 12:25:45 -0500
> Tushar Dave <tdave@nvidia.com> wrote:
> 
> > On 5/11/2026 6:43 AM, Ard Biesheuvel wrote:
> > > Hello Tushar,
> > > 
> > > On Fri, 8 May 2026, at 20:37, Tushar Dave via groups.io wrote:  
> > >> This RFC introduces a mechanism to specify Guest Physical Addresses
> > >> (GPAs) for PCI BARs, allowing explicit placement of guest MMIO BAR
> > >> addresses to match host physical addresses for assigned devices.
> > >>
> > >> On some platforms, P2P DMA is performed between devices within the same
> > >> IOMMU group. The PCI fabric ACS is configured to permit direct P2P
> > >> without going through the host bridge in order to achieve the required
> > >> performance.
> > >>
> > >> To support this multi-device IOMMU group P2P scenario in virtualization,
> > >> the VM may need to use the same MMIO BAR addresses as the host physical
> > >> address layout.
> > >>  
> > > 
> > > Did you consider implementing this using Enhanced Allocation (EA)? If so,
> > > could you explain why it is not suitable here?  
> > 
> > I have not evaluated EA for this design. When I looked at EDK2, I
> > chose PcdPciDisableBusEnumeration because it cleanly preserves fixed
> > BAR programming established by the hypervisor — at the cost of QEMU
> > performing PCI bus number and resource assignment.
> > 
> > I did a quick search and do not see EA support in EDK2. Any pointers
> > to EA being used in a similar fashion to achieve fixed BAR placement
> > would be appreciated.
> 
> EA wasn't on my radar either, but I did some research and chatted with
> Tushar and I think it could work.  I'll sketch out a rough idea of what
> it might looks like.
> 
> EA describes BAR equivalents (fixed base address, size, and type) in a
> separate capability while the corresponding device BAR registers appear
> unimplemented.  Linux already consumes endpoint EA capabilities and
> marks the resulting resources IORESOURCE_PCI_FIXED.  EDK2 doesn't know
> about EA (cap 0x14 isn't defined anywhere in MdePkg, and PciBusDxe
> never consults it afaict), but that turns out to be useful here rather
> than a problem.
> 
> Starting at the QEMU device, for a vfio-pci device we'd need to
> virtualize the real BARs as unimplemented and surface that information
> via a synthesized EA capability instead.  It's debatable whether this
> is a generic PCI mechanism or vfio-pci specific, whether HPA is
> automatically used as the base address for vfio-pci devices or
> user-specified, and the capability offset in config space.  None of
> those fundamentally change the shape of the flow.
> 
> For the absolute bare-minimum level of support (EA device on the root
> complex, EA resources don't overlap the VM address space or MMIO range,
> EDK2 firmware, Linux guest booted with pci=nocrs) I think this actually
> works with just adding the EA capability above.  Let's walk through
> those constraints and how we relax them.
> 
> At the firmware level we lean on the real BAR registers being
> unimplemented for EA devices, so EDK2 allocates no MMIO or IO resources
> for them.  Only bus numbers get assigned if the EA device sits in a PCI
> hierarchy.  That's exactly what we want, EDK2 doing conventional bus
> assignment but staying out of the EA resource flow entirely.
> 
> Instead of firmware EA enlightenment we lean on the guest OS.  Linux
> reads endpoint EA today, but the bridge aperture sizing path ignores
> those fixed resources.  As Tushar's series demonstrates, generically
> handling mixed "fixed-BAR" and programmable-BAR devices in one
> hierarchy is hard.  An incremental Linux enhancement that greatly
> simplifies the problem space would be to program bridge apertures only
> for hierarchies consisting entirely of fixed resources.  The math
> becomes trivial (window spans min..max of fixed children, aligned to
> bridge granularity), and there's no regression risk, these hierarchies
> currently fail silently.  The sizer ignores fixed children and the
> fixed-claim walk-up finds no containing parent.  This enhancement,
> plus the homogeneous-hierarchy constraint, removes the root-complex
> constraint and lets us mirror the bare-metal topologies we need.
> 
> Resource ranges are a bit messier.  The extent of the EA device ranges
> could be determined in QEMU and the VM address map adjusted to prevent
> overlap.  Tushar already has a similar user-specified machine option in
> this series.  That range also needs to reach the guest as a CRS (to
> avoid pci=nocrs) but needs to stay distinct from the DT range passed to
> EDK2 for programmable BAR devices so EDK2 won't place a programmable
> BAR or bridge window into the EA region.  So long as we keep EA and
> programmable devices in separate hierarchies, EDK2 only needs the
> programmable range via DT and we can add the EA range as additional CRS
> ranges visible only to the guest.
> 
> In practice, EDK2 programs all the programmable devices and the EA
> devices live entirely in the additional CRS.  A possibly cleaner
> alternative is additional PXB host bridges for the EA devices, each
> with its own CRS.  That sidesteps the DT/CRS split entirely since the
> EA PXB has nothing for EDK2 to allocate anyway.
> 
> If we agree that homogeneous hierarchies (no mixing of EA and
> programmable BARs) is a reasonable constraint, and possibly extend that
> to homogeneous per host bridge to simplify the CRS mapping, we have the
> following work items:
> 
>  * Extend Linux EA support to program bridge apertures for subordinate
>    homogeneous EA hierarchies.
> 
>  * Develop options to virtualize programmable BARs as EA for vfio-pci
>    devices, if not generically for the benefit of testing.
> 
>  * Implement a way to poke holes in the VM address space and plumb
>    through to account for addresses used by EA devices.
> 
>  * Provide those same ranges to the guest via CRS (but not via DT to
>    EDK2), or alternatively expose them through additional PXB host
>    bridges.
> 
> Does that shape roughly seem accurate?  Are there additional gaps I've
> missed?  Thanks,
> 
> Alex


just one question why not do it in firmware so windows
is thinkably also handled?



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [edk2-devel] [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation
  2026-05-12 23:12       ` Michael S. Tsirkin
@ 2026-05-12 23:57         ` Alex Williamson
  2026-05-13 11:36           ` Jason Gunthorpe
  2026-05-13 14:25           ` Ard Biesheuvel
  0 siblings, 2 replies; 23+ messages in thread
From: Alex Williamson @ 2026-05-12 23:57 UTC (permalink / raw)
  To: Michael S. Tsirkin, Alex Williamson
  Cc: Tushar Dave, Cédric Le Goater, Ard Biesheuvel,
	devel@edk2.groups.io, qemu-devel, Jason Gunthorpe,
	Shameer Kolothum, qemu-arm, Peter Maydell, marcel.apfelbaum

On Tue, May 12, 2026, at 5:12 PM, Michael S. Tsirkin wrote:
> On Tue, May 12, 2026 at 05:06:50PM -0600, Alex Williamson wrote:
>> If we agree that homogeneous hierarchies (no mixing of EA and
>> programmable BARs) is a reasonable constraint, and possibly extend that
>> to homogeneous per host bridge to simplify the CRS mapping, we have the
>> following work items:
>> 
>>  * Extend Linux EA support to program bridge apertures for subordinate
>>    homogeneous EA hierarchies.
>> 
>>  * Develop options to virtualize programmable BARs as EA for vfio-pci
>>    devices, if not generically for the benefit of testing.
>> 
>>  * Implement a way to poke holes in the VM address space and plumb
>>    through to account for addresses used by EA devices.
>> 
>>  * Provide those same ranges to the guest via CRS (but not via DT to
>>    EDK2), or alternatively expose them through additional PXB host
>>    bridges.
>> 
>> Does that shape roughly seem accurate?  Are there additional gaps I've
>> missed?  Thanks,
>
> just one question why not do it in firmware so windows
> is thinkably also handled?

I suppose someone could chime in if they have a similar requirement for Windows guests.  Otherwise, the incremental effort to extend Linux EA support seems smaller, though I also don't know what, if any support Windows has for EA to bother.  Regardless, improving Linux EA support might help elsewhere and doesn't preclude edk2 support in the future.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [edk2-devel] [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation
  2026-05-12 23:57         ` Alex Williamson
@ 2026-05-13 11:36           ` Jason Gunthorpe
  2026-05-13 14:25           ` Ard Biesheuvel
  1 sibling, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2026-05-13 11:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael S. Tsirkin, Alex Williamson, Tushar Dave,
	Cédric Le Goater, Ard Biesheuvel, devel@edk2.groups.io,
	qemu-devel, Shameer Kolothum, qemu-arm, Peter Maydell,
	marcel.apfelbaum

On Tue, May 12, 2026 at 05:57:19PM -0600, Alex Williamson wrote:
> On Tue, May 12, 2026, at 5:12 PM, Michael S. Tsirkin wrote:
> > On Tue, May 12, 2026 at 05:06:50PM -0600, Alex Williamson wrote:
> >> If we agree that homogeneous hierarchies (no mixing of EA and
> >> programmable BARs) is a reasonable constraint, and possibly extend that
> >> to homogeneous per host bridge to simplify the CRS mapping, we have the
> >> following work items:
> >> 
> >>  * Extend Linux EA support to program bridge apertures for subordinate
> >>    homogeneous EA hierarchies.
> >> 
> >>  * Develop options to virtualize programmable BARs as EA for vfio-pci
> >>    devices, if not generically for the benefit of testing.
> >> 
> >>  * Implement a way to poke holes in the VM address space and plumb
> >>    through to account for addresses used by EA devices.
> >> 
> >>  * Provide those same ranges to the guest via CRS (but not via DT to
> >>    EDK2), or alternatively expose them through additional PXB host
> >>    bridges.
> >> 
> >> Does that shape roughly seem accurate?  Are there additional gaps I've
> >> missed?  Thanks,
> >
> > just one question why not do it in firmware so windows
> > is thinkably also handled?
> 
> I suppose someone could chime in if they have a similar requirement
> for Windows guests.  Otherwise, the incremental effort to extend
> Linux EA support seems smaller, though I also don't know what, if
> any support Windows has for EA to bother.  Regardless, improving
> Linux EA support might help elsewhere and doesn't preclude edk2
> support in the future.  Thanks,

I think there are specific already deployed distros that need to work
under qemu though - so I would discount anything that needs kernel
changes to work

Jason


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [edk2-devel] [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation
  2026-05-12 23:57         ` Alex Williamson
  2026-05-13 11:36           ` Jason Gunthorpe
@ 2026-05-13 14:25           ` Ard Biesheuvel
  1 sibling, 0 replies; 23+ messages in thread
From: Ard Biesheuvel @ 2026-05-13 14:25 UTC (permalink / raw)
  To: Alex Williamson, Michael S. Tsirkin, Alex Williamson
  Cc: Tushar Dave, Cédric Le Goater, devel@edk2.groups.io,
	qemu-devel, Jason Gunthorpe, Shameer Kolothum, qemu-arm,
	Peter Maydell, marcel.apfelbaum


On Wed, 13 May 2026, at 01:57, Alex Williamson wrote:
> On Tue, May 12, 2026, at 5:12 PM, Michael S. Tsirkin wrote:
>> On Tue, May 12, 2026 at 05:06:50PM -0600, Alex Williamson wrote:
>>> If we agree that homogeneous hierarchies (no mixing of EA and
>>> programmable BARs) is a reasonable constraint, and possibly extend
>>> that to homogeneous per host bridge to simplify the CRS mapping, we
>>> have the following work items:
>>>
>>>  * Extend Linux EA support to program bridge apertures for
>>>    subordinate homogeneous EA hierarchies.
>>>
>>>  * Develop options to virtualize programmable BARs as EA for vfio-
>>>    pci devices, if not generically for the benefit of testing.
>>>
>>>  * Implement a way to poke holes in the VM address space and plumb
>>>    through to account for addresses used by EA devices.
>>>
>>>  * Provide those same ranges to the guest via CRS (but not via DT to
>>>    EDK2), or alternatively expose them through additional PXB host
>>>    bridges.
>>>
>>> Does that shape roughly seem accurate?  Are there additional gaps
>>> I've missed?  Thanks,
>>
>> just one question why not do it in firmware so windows is thinkably
>> also handled?
>
> I suppose someone could chime in if they have a similar requirement
> for Windows guests.  Otherwise, the incremental effort to extend Linux
> EA support seems smaller, though I also don't know what, if any
> support Windows has for EA to bother.  Regardless, improving Linux EA
> support might help elsewhere and doesn't preclude edk2 support in the
> future. Thanks,
>

If EA is too much of a hassle to implement, another avenue that you
might explore is EFI_INCOMPATIBLE_PCI_DEVICE_SUPPORT_PROTOCOL in edk2,
which can be implemented by the platform to inform the PCI core about
non-PCI compliant devices that have special requirements.

While it is supposed to support this use case too, the PCI resource
allocation code in EDK2 currently does not correctly support fixed
resources that are reported by this protocol, but getting that fixed
(and implementing the protocol in your firmware) might be a shorter
path to getting this hardware supported under any OS (assuming EFI
boot) than EA.


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-05-13 14:26 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-08 18:37 [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation Tushar Dave
2026-05-08 18:37 ` [RFC PATCH 1/8] hw/pci: add fixed-bars property to allow fixed BAR addresses Tushar Dave
2026-05-08 18:37 ` [RFC PATCH 2/8] hw/pci: enumerate PCI bus and program bridge bus numbers Tushar Dave
2026-05-08 18:37 ` [RFC PATCH 3/8] hw/pci: introduce allocator for fixed BAR placement Tushar Dave
2026-05-08 18:37 ` [RFC PATCH 4/8] hw/pci: pack remaining BARs and update bridge windows Tushar Dave
2026-05-08 18:37 ` [RFC PATCH 5/8] hw/pci: allocate remaining BARs for buses without fixed BARs Tushar Dave
2026-05-08 18:37 ` [RFC PATCH 6/8] hw/pci: finalize bridge prefetch windows after BAR allocation Tushar Dave
2026-05-08 18:37 ` [RFC PATCH 7/8] hw/arm/virt: add pcie-mmio-window machine property Tushar Dave
2026-05-08 18:37 ` [RFC PATCH 8/8] hw/arm/virt: add pci-pre-enum " Tushar Dave
2026-05-11  7:46 ` [RFC PATCH 0/8] hw/arm/virt, hw/pci: PCI pre-enumeration and fixed BAR allocation Peter Maydell
2026-05-11 12:26   ` Jason Gunthorpe
2026-05-11 18:38     ` Mohamed Mediouni
2026-05-11 20:28       ` Jason Gunthorpe
2026-05-11  9:09 ` Michael S. Tsirkin
2026-05-11 18:10   ` Tushar Dave
2026-05-11 22:09     ` Michael S. Tsirkin
2026-05-11 11:43 ` [edk2-devel] " Ard Biesheuvel
2026-05-12 17:25   ` Tushar Dave
2026-05-12 23:06     ` Alex Williamson
2026-05-12 23:12       ` Michael S. Tsirkin
2026-05-12 23:57         ` Alex Williamson
2026-05-13 11:36           ` Jason Gunthorpe
2026-05-13 14:25           ` Ard Biesheuvel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.