* [PATCH v3 1/6] PCI: handle PCI->PCIe bridges as well in free_pdev()
2026-01-29 13:05 [PATCH v3 0/6] (v)PCI: extended capability handling Jan Beulich
@ 2026-01-29 13:07 ` Jan Beulich
2026-01-29 13:49 ` Roger Pau Monné
2026-01-29 13:08 ` [PATCH v3 2/6] PCI: determine whether a device has extended config space Jan Beulich
` (4 subsequent siblings)
5 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2026-01-29 13:07 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Andrew Cooper, Julien Grall, Stefano Stabellini, Anthony PERARD,
Michal Orzel, Roger Pau Monné, Stewart Hildebrand
Don't know how I managed to overlook the freeing side when adding the case
to alloc_pdev().
Fixes: cd2b9f0b1986 ("PCI: handle PCI->PCIe bridges as well in alloc_pdev()")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.
---
Noticed due to the original patch still applying cleanly, just with an
offset of a few dozen lines.
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -436,6 +436,7 @@ static void free_pdev(struct pci_seg *ps
unsigned long flags;
case DEV_TYPE_PCIe2PCI_BRIDGE:
+ case DEV_TYPE_PCI2PCIe_BRIDGE:
case DEV_TYPE_LEGACY_PCI_BRIDGE:
sec_bus = pci_conf_read8(pdev->sbdf, PCI_SECONDARY_BUS);
sub_bus = pci_conf_read8(pdev->sbdf, PCI_SUBORDINATE_BUS);
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v3 1/6] PCI: handle PCI->PCIe bridges as well in free_pdev()
2026-01-29 13:07 ` [PATCH v3 1/6] PCI: handle PCI->PCIe bridges as well in free_pdev() Jan Beulich
@ 2026-01-29 13:49 ` Roger Pau Monné
0 siblings, 0 replies; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-29 13:49 UTC (permalink / raw)
To: Jan Beulich
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Julien Grall,
Stefano Stabellini, Anthony PERARD, Michal Orzel,
Stewart Hildebrand
On Thu, Jan 29, 2026 at 02:07:58PM +0100, Jan Beulich wrote:
> Don't know how I managed to overlook the freeing side when adding the case
> to alloc_pdev().
>
> Fixes: cd2b9f0b1986 ("PCI: handle PCI->PCIe bridges as well in alloc_pdev()")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Thanks, Roger.
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v3 2/6] PCI: determine whether a device has extended config space
2026-01-29 13:05 [PATCH v3 0/6] (v)PCI: extended capability handling Jan Beulich
2026-01-29 13:07 ` [PATCH v3 1/6] PCI: handle PCI->PCIe bridges as well in free_pdev() Jan Beulich
@ 2026-01-29 13:08 ` Jan Beulich
2026-01-29 14:09 ` Roger Pau Monné
2026-02-02 9:14 ` Jan Beulich
2026-01-29 13:08 ` [PATCH v3 3/6] PCI: don't look for ext-caps when there's no extended cfg space Jan Beulich
` (3 subsequent siblings)
5 siblings, 2 replies; 27+ messages in thread
From: Jan Beulich @ 2026-01-29 13:08 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Andrew Cooper, Julien Grall, Stefano Stabellini, Anthony PERARD,
Michal Orzel, Roger Pau Monné, Stewart Hildebrand
Legacy PCI devices don't have any extended config space. Reading any part
thereof may return all ones or other arbitrary data, e.g. in some cases
base config space contents repeatedly.
Logic follows Linux 6.19-rc's pci_cfg_space_size(), albeit leveraging our
determination of device type; in particular some comments are taken
verbatim from there. Like with Linux'es CONFIG_PCI_QUIRKS, only the alias
detection logic is covered by the new "pci=no-quirks". The singular access
at PCI_CFG_SPACE_SIZE is left unconditional.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
---
The warning near the bottom of pci_check_extcfg() may be issued multiple
times for a single device now. Should we try to avoid that?
Note that no vPCI adjustments are done here, but they're going to be
needed: Whatever requires extended capabilities will need re-
evaluating / newly establishing / tearing down in case an invocation of
PHYSDEVOP_pci_mmcfg_reserved alters global state.
---
v3: Add command line (sub-)option.
v2: Major re-work to also check upon PHYSDEVOP_pci_mmcfg_reserved
invocation.
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2009,12 +2009,21 @@ Only effective if CONFIG_PARTIAL_EMULATI
behavior.**
### pci
- = List of [ serr=<bool>, perr=<bool> ]
+ = List of [ serr=<bool>, perr=<bool>, quirks=<bool> ]
+
+* `serr` and `perr`
Default: Signaling left as set by firmware.
-Override the firmware settings, and explicitly enable or disable the
-signalling of PCI System and Parity errors.
+ Override the firmware settings, and explicitly enable or disable the
+ signalling of PCI System and Parity errors.
+
+* `quirks`
+
+ Default: `on`
+
+ In its negative form, allows to suppress certain quirk workarounds, in case
+ they cause issues.
### pci-phantom
> `=[<seg>:]<bus>:<device>,<stride>`
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -22,6 +22,8 @@ int physdev_map_pirq(struct domain *d, i
struct msi_info *msi);
int physdev_unmap_pirq(struct domain *d, int pirq);
+int cf_check physdev_check_pci_extcfg(struct pci_dev *pdev, void *arg);
+
#include "x86_64/mmconfig.h"
#ifndef COMPAT
@@ -160,6 +162,17 @@ int physdev_unmap_pirq(struct domain *d,
return ret;
}
+
+int cf_check physdev_check_pci_extcfg(struct pci_dev *pdev, void *arg)
+{
+ const struct physdev_pci_mmcfg_reserved *info = arg;
+
+ ASSERT(pdev->seg == info->segment);
+ if ( pdev->bus >= info->start_bus && pdev->bus <= info->end_bus )
+ pci_check_extcfg(pdev);
+
+ return 0;
+}
#endif /* COMPAT */
ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
@@ -511,6 +524,11 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
ret = pci_mmcfg_reserved(info.address, info.segment,
info.start_bus, info.end_bus, info.flags);
+
+ if ( !ret )
+ ret = pci_segment_iterate(info.segment, physdev_check_pci_extcfg,
+ &info);
+
if ( !ret && has_vpci(currd) && (info.flags & XEN_PCI_MMCFG_RESERVED) )
{
/*
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -183,6 +183,7 @@ custom_param("pci-phantom", parse_phanto
static u16 __read_mostly command_mask;
static u16 __read_mostly bridge_ctl_mask;
+static bool __ro_after_init opt_pci_quirks = true;
static int __init cf_check parse_pci_param(const char *s)
{
@@ -207,6 +208,8 @@ static int __init cf_check parse_pci_par
cmd_mask = PCI_COMMAND_PARITY;
brctl_mask = PCI_BRIDGE_CTL_PARITY;
}
+ else if ( (val = parse_boolean("quirks", s, ss)) >= 0 )
+ opt_pci_quirks = val;
else
rc = -EINVAL;
@@ -422,6 +425,9 @@ static struct pci_dev *alloc_pdev(struct
}
apply_quirks(pdev);
+
+ pci_check_extcfg(pdev);
+
check_pdev(pdev);
return pdev;
@@ -719,6 +725,11 @@ int pci_add_device(u16 seg, u8 bus, u8 d
list_add(&pdev->vf_list, &pf_pdev->vf_list);
}
+
+ if ( !pdev->ext_cfg )
+ printk(XENLOG_WARNING
+ "%pp: VF without extended config space?\n",
+ &pdev->sbdf);
}
}
@@ -1042,6 +1053,79 @@ enum pdev_type pdev_type(u16 seg, u8 bus
return pos ? DEV_TYPE_PCIe_ENDPOINT : DEV_TYPE_PCI;
}
+void pci_check_extcfg(struct pci_dev *pdev)
+{
+ unsigned int pos;
+
+ pdev->ext_cfg = false;
+
+ switch ( pdev->type )
+ {
+ case DEV_TYPE_PCIe_ENDPOINT:
+ case DEV_TYPE_PCIe_BRIDGE:
+ case DEV_TYPE_PCI_HOST_BRIDGE:
+ case DEV_TYPE_PCIe2PCI_BRIDGE:
+ case DEV_TYPE_PCI2PCIe_BRIDGE:
+ break;
+
+ case DEV_TYPE_LEGACY_PCI_BRIDGE:
+ case DEV_TYPE_PCI:
+ pos = pci_find_cap_offset(pdev->sbdf, PCI_CAP_ID_PCIX);
+ if ( !pos ||
+ !(pci_conf_read32(pdev->sbdf, pos + PCI_X_STATUS) &
+ (PCI_X_STATUS_266MHZ | PCI_X_STATUS_533MHZ)) )
+ return;
+ break;
+
+ default:
+ return;
+ }
+
+ /*
+ * Regular PCI devices have 256 bytes, but PCI-X 2 and PCI Express devices
+ * have 4096 bytes. Even if the device is capable, that doesn't mean we
+ * can access it. Maybe we don't have a way to generate extended config
+ * space accesses, or the device is behind a reverse Express bridge. So
+ * we try reading the dword at PCI_CFG_SPACE_SIZE which must either be 0
+ * or a valid extended capability header.
+ */
+ if ( pci_conf_read32(pdev->sbdf, PCI_CFG_SPACE_SIZE) == 0xffffffffU )
+ return;
+
+ if ( opt_pci_quirks )
+ {
+ /*
+ * PCI Express to PCI/PCI-X Bridge Specification, rev 1.0, 4.1.4 says
+ * that when forwarding a type1 configuration request the bridge must
+ * check that the extended register address field is zero. The bridge
+ * is not permitted to forward the transactions and must handle it as
+ * an Unsupported Request. Some bridges do not follow this rule and
+ * simply drop the extended register bits, resulting in the standard
+ * config space being aliased, every 256 bytes across the entire
+ * configuration space. Test for this condition by comparing the first
+ * dword of each potential alias to the vendor/device ID.
+ * Known offenders:
+ * ASM1083/1085 PCIe-to-PCI Reversible Bridge (1b21:1080, rev 01 & 03)
+ * AMD/ATI SBx00 PCI to PCI Bridge (1002:4384, rev 40)
+ */
+ unsigned int sig = pci_conf_read32(pdev->sbdf, PCI_VENDOR_ID);
+
+ for ( pos = PCI_CFG_SPACE_SIZE;
+ pos < PCI_CFG_SPACE_EXP_SIZE; pos += PCI_CFG_SPACE_SIZE )
+ if ( pci_conf_read32(pdev->sbdf, pos) != sig )
+ break;
+
+ if ( pos >= PCI_CFG_SPACE_EXP_SIZE )
+ {
+ printk(XENLOG_WARNING "%pp: extended config space aliases base one\n",
+ &pdev->sbdf);
+ return;
+ }
+ }
+
+ pdev->ext_cfg = true;
+}
+
/*
* find the upstream PCIe-to-PCI/PCIX bridge or PCI legacy bridge
* return 0: the device is integrated PCI device or PCIe
@@ -1842,6 +1926,29 @@ int pci_iterate_devices(int (*handler)(s
return pci_segments_iterate(iterate_all, &iter) ?: iter.rc;
}
+/* Iterate a single PCI segment, with locking but without preemption. */
+int pci_segment_iterate(unsigned int segment,
+ int (*handler)(struct pci_dev *pdev, void *arg),
+ void *arg)
+{
+ struct pci_seg *seg = get_pseg(segment);
+ struct segment_iter iter = {
+ .handler = handler,
+ .arg = arg,
+ };
+
+ if ( !seg )
+ return -ENODEV;
+
+ pcidevs_lock();
+
+ iter.rc = iterate_all(seg, &iter) ?: iter.rc;
+
+ pcidevs_unlock();
+
+ return iter.rc;
+}
+
/*
* Local variables:
* mode: C
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -126,6 +126,9 @@ struct pci_dev {
nodeid_t node; /* NUMA node */
+ /* Whether the device has (accessible) extended config space. */
+ bool ext_cfg;
+
/* Device to be quarantined, don't automatically re-assign to dom0 */
bool quarantine;
@@ -242,6 +245,11 @@ void pci_check_disable_device(u16 seg, u
int pci_iterate_devices(int (*handler)(struct pci_dev *pdev, void *arg),
void *arg);
+/* Iterate a single PCI segment, with locking but without preemption. */
+int pci_segment_iterate(unsigned int segment,
+ int (*handler)(struct pci_dev *pdev, void *arg),
+ void *arg);
+
uint8_t pci_conf_read8(pci_sbdf_t sbdf, unsigned int reg);
uint16_t pci_conf_read16(pci_sbdf_t sbdf, unsigned int reg);
uint32_t pci_conf_read32(pci_sbdf_t sbdf, unsigned int reg);
@@ -260,6 +268,7 @@ unsigned int pci_find_next_cap_ttl(pci_s
unsigned int *ttl);
unsigned int pci_find_next_cap(pci_sbdf_t sbdf, unsigned int pos,
unsigned int cap);
+void pci_check_extcfg(struct pci_dev *pdev);
unsigned int pci_find_ext_capability(const struct pci_dev *pdev,
unsigned int cap);
unsigned int pci_find_next_ext_capability(const struct pci_dev *pdev,
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v3 2/6] PCI: determine whether a device has extended config space
2026-01-29 13:08 ` [PATCH v3 2/6] PCI: determine whether a device has extended config space Jan Beulich
@ 2026-01-29 14:09 ` Roger Pau Monné
2026-01-29 14:17 ` Jan Beulich
2026-02-02 9:14 ` Jan Beulich
1 sibling, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-29 14:09 UTC (permalink / raw)
To: Jan Beulich
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Julien Grall,
Stefano Stabellini, Anthony PERARD, Michal Orzel,
Stewart Hildebrand
On Thu, Jan 29, 2026 at 02:08:27PM +0100, Jan Beulich wrote:
> Legacy PCI devices don't have any extended config space. Reading any part
> thereof may return all ones or other arbitrary data, e.g. in some cases
> base config space contents repeatedly.
>
> Logic follows Linux 6.19-rc's pci_cfg_space_size(), albeit leveraging our
> determination of device type; in particular some comments are taken
> verbatim from there. Like with Linux'es CONFIG_PCI_QUIRKS, only the alias
> detection logic is covered by the new "pci=no-quirks". The singular access
> at PCI_CFG_SPACE_SIZE is left unconditional.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> The warning near the bottom of pci_check_extcfg() may be issued multiple
> times for a single device now. Should we try to avoid that?
>
> Note that no vPCI adjustments are done here, but they're going to be
> needed: Whatever requires extended capabilities will need re-
> evaluating / newly establishing / tearing down in case an invocation of
> PHYSDEVOP_pci_mmcfg_reserved alters global state.
> ---
> v3: Add command line (sub-)option.
> v2: Major re-work to also check upon PHYSDEVOP_pci_mmcfg_reserved
> invocation.
>
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -2009,12 +2009,21 @@ Only effective if CONFIG_PARTIAL_EMULATI
> behavior.**
>
> ### pci
> - = List of [ serr=<bool>, perr=<bool> ]
> + = List of [ serr=<bool>, perr=<bool>, quirks=<bool> ]
> +
> +* `serr` and `perr`
>
> Default: Signaling left as set by firmware.
>
> -Override the firmware settings, and explicitly enable or disable the
> -signalling of PCI System and Parity errors.
> + Override the firmware settings, and explicitly enable or disable the
> + signalling of PCI System and Parity errors.
> +
> +* `quirks`
> +
> + Default: `on`
> +
> + In its negative form, allows to suppress certain quirk workarounds, in case
> + they cause issues.
Not that I oppose to this, but I've assumed that you would introduce
an option to fallback to the previous behavior where Xen would just
assume extended space to be accessible.
I'm fine with this, just assumed it would be a proper fallback to the
previous behavior (which doesn't make much sense as it's partially
bogus when used with vPCI).
Regards, Roger.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 2/6] PCI: determine whether a device has extended config space
2026-01-29 14:09 ` Roger Pau Monné
@ 2026-01-29 14:17 ` Jan Beulich
0 siblings, 0 replies; 27+ messages in thread
From: Jan Beulich @ 2026-01-29 14:17 UTC (permalink / raw)
To: Roger Pau Monné
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Julien Grall,
Stefano Stabellini, Anthony PERARD, Michal Orzel,
Stewart Hildebrand
On 29.01.2026 15:09, Roger Pau Monné wrote:
> On Thu, Jan 29, 2026 at 02:08:27PM +0100, Jan Beulich wrote:
>> Legacy PCI devices don't have any extended config space. Reading any part
>> thereof may return all ones or other arbitrary data, e.g. in some cases
>> base config space contents repeatedly.
>>
>> Logic follows Linux 6.19-rc's pci_cfg_space_size(), albeit leveraging our
>> determination of device type; in particular some comments are taken
>> verbatim from there. Like with Linux'es CONFIG_PCI_QUIRKS, only the alias
>> detection logic is covered by the new "pci=no-quirks". The singular access
>> at PCI_CFG_SPACE_SIZE is left unconditional.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
>> ---
>> The warning near the bottom of pci_check_extcfg() may be issued multiple
>> times for a single device now. Should we try to avoid that?
>>
>> Note that no vPCI adjustments are done here, but they're going to be
>> needed: Whatever requires extended capabilities will need re-
>> evaluating / newly establishing / tearing down in case an invocation of
>> PHYSDEVOP_pci_mmcfg_reserved alters global state.
>> ---
>> v3: Add command line (sub-)option.
>> v2: Major re-work to also check upon PHYSDEVOP_pci_mmcfg_reserved
>> invocation.
>>
>> --- a/docs/misc/xen-command-line.pandoc
>> +++ b/docs/misc/xen-command-line.pandoc
>> @@ -2009,12 +2009,21 @@ Only effective if CONFIG_PARTIAL_EMULATI
>> behavior.**
>>
>> ### pci
>> - = List of [ serr=<bool>, perr=<bool> ]
>> + = List of [ serr=<bool>, perr=<bool>, quirks=<bool> ]
>> +
>> +* `serr` and `perr`
>>
>> Default: Signaling left as set by firmware.
>>
>> -Override the firmware settings, and explicitly enable or disable the
>> -signalling of PCI System and Parity errors.
>> + Override the firmware settings, and explicitly enable or disable the
>> + signalling of PCI System and Parity errors.
>> +
>> +* `quirks`
>> +
>> + Default: `on`
>> +
>> + In its negative form, allows to suppress certain quirk workarounds, in case
>> + they cause issues.
>
> Not that I oppose to this, but I've assumed that you would introduce
> an option to fallback to the previous behavior where Xen would just
> assume extended space to be accessible.
But that would be wrong. It didn't even occur to me that you could have
wanted that.
Jan
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 2/6] PCI: determine whether a device has extended config space
2026-01-29 13:08 ` [PATCH v3 2/6] PCI: determine whether a device has extended config space Jan Beulich
2026-01-29 14:09 ` Roger Pau Monné
@ 2026-02-02 9:14 ` Jan Beulich
2026-02-02 9:21 ` Roger Pau Monné
1 sibling, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2026-02-02 9:14 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Andrew Cooper, Julien Grall, Stefano Stabellini, Anthony PERARD,
Michal Orzel, Roger Pau Monné, Stewart Hildebrand
On 29.01.2026 14:08, Jan Beulich wrote:
> @@ -1042,6 +1053,79 @@ enum pdev_type pdev_type(u16 seg, u8 bus
> return pos ? DEV_TYPE_PCIe_ENDPOINT : DEV_TYPE_PCI;
> }
>
> +void pci_check_extcfg(struct pci_dev *pdev)
> +{
> + unsigned int pos;
> +
> + pdev->ext_cfg = false;
> +
> + switch ( pdev->type )
> + {
> + case DEV_TYPE_PCIe_ENDPOINT:
> + case DEV_TYPE_PCIe_BRIDGE:
> + case DEV_TYPE_PCI_HOST_BRIDGE:
> + case DEV_TYPE_PCIe2PCI_BRIDGE:
> + case DEV_TYPE_PCI2PCIe_BRIDGE:
> + break;
> +
> + case DEV_TYPE_LEGACY_PCI_BRIDGE:
> + case DEV_TYPE_PCI:
> + pos = pci_find_cap_offset(pdev->sbdf, PCI_CAP_ID_PCIX);
> + if ( !pos ||
> + !(pci_conf_read32(pdev->sbdf, pos + PCI_X_STATUS) &
> + (PCI_X_STATUS_266MHZ | PCI_X_STATUS_533MHZ)) )
To not violate Misra rule 7.2 I'll fold in the change below. I guess I'll
further follow up with a patch adjusting other problematic #define-s in
that header, too.
Jan
--- a/xen/include/xen/pci_regs.h
+++ b/xen/include/xen/pci_regs.h
@@ -382,7 +382,7 @@
#define PCI_X_STATUS_MAX_CUM 0x1c000000 /* Designed Max Cumulative Read Size */
#define PCI_X_STATUS_SPL_ERR 0x20000000 /* Rcvd Split Completion Error Msg */
#define PCI_X_STATUS_266MHZ 0x40000000 /* 266 MHz capable */
-#define PCI_X_STATUS_533MHZ 0x80000000 /* 533 MHz capable */
+#define PCI_X_STATUS_533MHZ 0x80000000U /* 533 MHz capable */
/* PCI Express capability registers */
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v3 2/6] PCI: determine whether a device has extended config space
2026-02-02 9:14 ` Jan Beulich
@ 2026-02-02 9:21 ` Roger Pau Monné
0 siblings, 0 replies; 27+ messages in thread
From: Roger Pau Monné @ 2026-02-02 9:21 UTC (permalink / raw)
To: Jan Beulich
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Julien Grall,
Stefano Stabellini, Anthony PERARD, Michal Orzel,
Stewart Hildebrand
On Mon, Feb 02, 2026 at 10:14:57AM +0100, Jan Beulich wrote:
> On 29.01.2026 14:08, Jan Beulich wrote:
> > @@ -1042,6 +1053,79 @@ enum pdev_type pdev_type(u16 seg, u8 bus
> > return pos ? DEV_TYPE_PCIe_ENDPOINT : DEV_TYPE_PCI;
> > }
> >
> > +void pci_check_extcfg(struct pci_dev *pdev)
> > +{
> > + unsigned int pos;
> > +
> > + pdev->ext_cfg = false;
> > +
> > + switch ( pdev->type )
> > + {
> > + case DEV_TYPE_PCIe_ENDPOINT:
> > + case DEV_TYPE_PCIe_BRIDGE:
> > + case DEV_TYPE_PCI_HOST_BRIDGE:
> > + case DEV_TYPE_PCIe2PCI_BRIDGE:
> > + case DEV_TYPE_PCI2PCIe_BRIDGE:
> > + break;
> > +
> > + case DEV_TYPE_LEGACY_PCI_BRIDGE:
> > + case DEV_TYPE_PCI:
> > + pos = pci_find_cap_offset(pdev->sbdf, PCI_CAP_ID_PCIX);
> > + if ( !pos ||
> > + !(pci_conf_read32(pdev->sbdf, pos + PCI_X_STATUS) &
> > + (PCI_X_STATUS_266MHZ | PCI_X_STATUS_533MHZ)) )
>
> To not violate Misra rule 7.2 I'll fold in the change below. I guess I'll
> further follow up with a patch adjusting other problematic #define-s in
> that header, too.
I'm fine if you want to do a pre-patch fixing all the problematic
defines in the header and put it ahead of the series. Maybe it makes
more sense to fix them all in a single commit than fixing one here and
the rest separately?
Thanks, Roger.
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v3 3/6] PCI: don't look for ext-caps when there's no extended cfg space
2026-01-29 13:05 [PATCH v3 0/6] (v)PCI: extended capability handling Jan Beulich
2026-01-29 13:07 ` [PATCH v3 1/6] PCI: handle PCI->PCIe bridges as well in free_pdev() Jan Beulich
2026-01-29 13:08 ` [PATCH v3 2/6] PCI: determine whether a device has extended config space Jan Beulich
@ 2026-01-29 13:08 ` Jan Beulich
2026-01-29 13:09 ` [PATCH v3 4/6] vPCI: really no ext-caps without extended config space Jan Beulich
` (2 subsequent siblings)
5 siblings, 0 replies; 27+ messages in thread
From: Jan Beulich @ 2026-01-29 13:08 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Andrew Cooper, Julien Grall, Stefano Stabellini, Anthony PERARD,
Michal Orzel, Roger Pau Monné, Stewart Hildebrand
Avoid interpreting as extended capabilities what may be about anything. In
doing so, vPCI then also won't mis-interpret data from beyond base config
space anymore.
Fixes: 3b35911d709e ("Enable pci mmcfg and ATS for x86_64")
Fixes: a845b50c12f3 ("vpci/header: Emulate extended capability list for dom0")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
---
Because of the multiple prereq changes, despite the Fixes: tags I'm not
quite sure whether to backport this. (I'm leaning towards "no".)
--- a/xen/drivers/pci/pci.c
+++ b/xen/drivers/pci/pci.c
@@ -113,6 +113,12 @@ unsigned int pci_find_next_ext_capabilit
int ttl = 480; /* 3840 bytes, minimum 8 bytes per capability */
unsigned int pos = max(start, PCI_CFG_SPACE_SIZE + 0U);
+ if ( !pdev->ext_cfg )
+ {
+ ASSERT(!start);
+ return 0;
+ }
+
header = pci_conf_read32(pdev->sbdf, pos);
/*
^ permalink raw reply [flat|nested] 27+ messages in thread* [PATCH v3 4/6] vPCI: really no ext-caps without extended config space
2026-01-29 13:05 [PATCH v3 0/6] (v)PCI: extended capability handling Jan Beulich
` (2 preceding siblings ...)
2026-01-29 13:08 ` [PATCH v3 3/6] PCI: don't look for ext-caps when there's no extended cfg space Jan Beulich
@ 2026-01-29 13:09 ` Jan Beulich
2026-01-29 14:13 ` Roger Pau Monné
2026-01-29 13:10 ` [PATCH v3 5/6] x86/PCI: avoid re-evaluation of extended config space accessibility Jan Beulich
2026-01-29 13:10 ` [PATCH v3 6/6] vPCI: re-init extended-capability lists when MMCFG availability changed Jan Beulich
5 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2026-01-29 13:09 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org; +Cc: Roger Pau Monné, Stewart Hildebrand
For DomU-s, whether to emulate accesses to the first 32 bits of extended
config space as read-as-zero or read-as-all-ones depends on whether a
device actually has extended config space. If it doesn't, read-as-zero
isn't correct; not getting this right may confuse functions like Linux
6.19-rc's pci_ext_cfg_is_aliased().
For Dom0 this then simply allows dropping a later conditional.
Fixes: a845b50c12f3 ("vpci/header: Emulate extended capability list for dom0")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Move code condition to top-level function scope. Eliminate a later
conditional in exchange.
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -829,6 +829,9 @@ static int vpci_init_ext_capability_list
{
unsigned int pos = PCI_CFG_SPACE_SIZE;
+ if ( !pdev->ext_cfg )
+ return 0;
+
if ( !is_hardware_domain(pdev->domain) )
/* Extended capabilities read as zero, write ignore for DomU */
return vpci_add_register(pdev->vpci, vpci_read_val, NULL,
@@ -841,10 +844,9 @@ static int vpci_init_ext_capability_list
if ( header == 0xffffffffU )
{
- if ( pos != PCI_CFG_SPACE_SIZE )
- printk(XENLOG_WARNING
- "%pd %pp: broken extended cap list, offset %#x\n",
- pdev->domain, &pdev->sbdf, pos);
+ printk(XENLOG_WARNING
+ "%pd %pp: broken extended cap list, offset %#x\n",
+ pdev->domain, &pdev->sbdf, pos);
return 0;
}
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v3 4/6] vPCI: really no ext-caps without extended config space
2026-01-29 13:09 ` [PATCH v3 4/6] vPCI: really no ext-caps without extended config space Jan Beulich
@ 2026-01-29 14:13 ` Roger Pau Monné
0 siblings, 0 replies; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-29 14:13 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Stewart Hildebrand
On Thu, Jan 29, 2026 at 02:09:27PM +0100, Jan Beulich wrote:
> For DomU-s, whether to emulate accesses to the first 32 bits of extended
> config space as read-as-zero or read-as-all-ones depends on whether a
> device actually has extended config space. If it doesn't, read-as-zero
> isn't correct; not getting this right may confuse functions like Linux
> 6.19-rc's pci_ext_cfg_is_aliased().
>
> For Dom0 this then simply allows dropping a later conditional.
>
> Fixes: a845b50c12f3 ("vpci/header: Emulate extended capability list for dom0")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Thanks, Roger.
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v3 5/6] x86/PCI: avoid re-evaluation of extended config space accessibility
2026-01-29 13:05 [PATCH v3 0/6] (v)PCI: extended capability handling Jan Beulich
` (3 preceding siblings ...)
2026-01-29 13:09 ` [PATCH v3 4/6] vPCI: really no ext-caps without extended config space Jan Beulich
@ 2026-01-29 13:10 ` Jan Beulich
2026-01-29 15:32 ` Roger Pau Monné
2026-02-02 8:51 ` Jan Beulich
2026-01-29 13:10 ` [PATCH v3 6/6] vPCI: re-init extended-capability lists when MMCFG availability changed Jan Beulich
5 siblings, 2 replies; 27+ messages in thread
From: Jan Beulich @ 2026-01-29 13:10 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Andrew Cooper, Roger Pau Monné, Stewart Hildebrand
When, during boot, we have already correctly determined availability of
the MMCFG access method for a given bus range, there's then no need to
invoke pci_check_extcfg() again for every of the devices. This in
particular avoids ->ext_cfg to transiently indicate the wrong state.
Switch to using Xen style on lines being touched and immediately adjacent
ones.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -528,6 +528,8 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
if ( !ret )
ret = pci_segment_iterate(info.segment, physdev_check_pci_extcfg,
&info);
+ else if ( ret > 0 ) /* Indication of "no change". */
+ ret = 0;
if ( !ret && has_vpci(currd) && (info.flags & XEN_PCI_MMCFG_RESERVED) )
{
--- a/xen/arch/x86/x86_64/mmconfig.h
+++ b/xen/arch/x86/x86_64/mmconfig.h
@@ -74,6 +74,6 @@ int pci_mmcfg_reserved(uint64_t address,
unsigned int flags);
int pci_mmcfg_arch_init(void);
int pci_mmcfg_arch_enable(unsigned int idx);
-void pci_mmcfg_arch_disable(unsigned int idx);
+int pci_mmcfg_arch_disable(unsigned int idx);
#endif /* X86_64_MMCONFIG_H */
--- a/xen/arch/x86/x86_64/mmconfig-shared.c
+++ b/xen/arch/x86/x86_64/mmconfig-shared.c
@@ -388,8 +388,9 @@ static bool __init pci_mmcfg_reject_brok
(unsigned int)cfg->start_bus_number,
(unsigned int)cfg->end_bus_number);
- if (!is_mmconf_reserved(addr, size, i, cfg) ||
- pci_mmcfg_arch_enable(i)) {
+ if ( !is_mmconf_reserved(addr, size, i, cfg) ||
+ pci_mmcfg_arch_enable(i) < 0 )
+ {
pci_mmcfg_arch_disable(i);
valid = 0;
}
@@ -417,8 +418,8 @@ void __init acpi_mmcfg_init(void)
unsigned int i;
pci_mmcfg_arch_init();
- for (i = 0; i < pci_mmcfg_config_num; ++i)
- if (pci_mmcfg_arch_enable(i))
+ for ( i = 0; i < pci_mmcfg_config_num; ++i )
+ if ( pci_mmcfg_arch_enable(i) < 0 )
valid = 0;
} else {
acpi_table_parse(ACPI_SIG_MCFG, acpi_parse_mcfg);
@@ -458,10 +459,11 @@ int pci_mmcfg_reserved(uint64_t address,
segment, start_bus, end_bus, address, cfg->address);
return -EIO;
}
- if (flags & XEN_PCI_MMCFG_RESERVED)
+
+ if ( flags & XEN_PCI_MMCFG_RESERVED )
return pci_mmcfg_arch_enable(i);
- pci_mmcfg_arch_disable(i);
- return 0;
+
+ return pci_mmcfg_arch_disable(i);
}
}
--- a/xen/arch/x86/x86_64/mmconfig_64.c
+++ b/xen/arch/x86/x86_64/mmconfig_64.c
@@ -138,8 +138,9 @@ int pci_mmcfg_arch_enable(unsigned int i
const typeof(pci_mmcfg_config[0]) *cfg = pci_mmcfg_virt[idx].cfg;
unsigned long start_mfn, end_mfn;
- if (pci_mmcfg_virt[idx].virt)
- return 0;
+ if ( pci_mmcfg_virt[idx].virt )
+ return 1;
+
pci_mmcfg_virt[idx].virt = mcfg_ioremap(cfg, idx, PAGE_HYPERVISOR_UC);
if (!pci_mmcfg_virt[idx].virt) {
printk(KERN_ERR "PCI: Cannot map MCFG aperture for segment %04x\n",
@@ -160,10 +161,13 @@ int pci_mmcfg_arch_enable(unsigned int i
return 0;
}
-void pci_mmcfg_arch_disable(unsigned int idx)
+int pci_mmcfg_arch_disable(unsigned int idx)
{
const typeof(pci_mmcfg_config[0]) *cfg = pci_mmcfg_virt[idx].cfg;
+ if ( !pci_mmcfg_virt[idx].virt )
+ return 1;
+
pci_mmcfg_virt[idx].virt = NULL;
/*
* Don't use destroy_xen_mappings() here, or make sure that at least
@@ -173,6 +177,8 @@ void pci_mmcfg_arch_disable(unsigned int
mcfg_ioremap(cfg, idx, 0);
printk(KERN_WARNING "PCI: Not using MCFG for segment %04x bus %02x-%02x\n",
cfg->pci_segment, cfg->start_bus_number, cfg->end_bus_number);
+
+ return 0;
}
bool pci_mmcfg_decode(unsigned long mfn, unsigned int *seg, unsigned int *bdf)
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v3 5/6] x86/PCI: avoid re-evaluation of extended config space accessibility
2026-01-29 13:10 ` [PATCH v3 5/6] x86/PCI: avoid re-evaluation of extended config space accessibility Jan Beulich
@ 2026-01-29 15:32 ` Roger Pau Monné
2026-01-29 15:47 ` Jan Beulich
2026-02-02 8:51 ` Jan Beulich
1 sibling, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-29 15:32 UTC (permalink / raw)
To: Jan Beulich
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Stewart Hildebrand
On Thu, Jan 29, 2026 at 02:10:01PM +0100, Jan Beulich wrote:
> When, during boot, we have already correctly determined availability of
> the MMCFG access method for a given bus range, there's then no need to
> invoke pci_check_extcfg() again for every of the devices. This in
> particular avoids ->ext_cfg to transiently indicate the wrong state.
>
> Switch to using Xen style on lines being touched and immediately adjacent
> ones.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
One suggestion for a further addition below.
> ---
> v3: New.
>
> --- a/xen/arch/x86/physdev.c
> +++ b/xen/arch/x86/physdev.c
> @@ -528,6 +528,8 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
> if ( !ret )
> ret = pci_segment_iterate(info.segment, physdev_check_pci_extcfg,
> &info);
> + else if ( ret > 0 ) /* Indication of "no change". */
> + ret = 0;
>
> if ( !ret && has_vpci(currd) && (info.flags & XEN_PCI_MMCFG_RESERVED) )
Maybe it doesn't need to be strictly done here, but now that
pci_mmcfg_reserved() signals whether the MMCFG was already registered,
could you also restrict the register_vpci_mmcfg_handler() call to ret
== 0?
That will also simplify the logic in register_vpci_mmcfg_handler()
since we no longer need to return 0 when the region is already
registered, returning -EEXIST should be fine if the caller is
adjusted.
Thanks, Roger.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 5/6] x86/PCI: avoid re-evaluation of extended config space accessibility
2026-01-29 15:32 ` Roger Pau Monné
@ 2026-01-29 15:47 ` Jan Beulich
0 siblings, 0 replies; 27+ messages in thread
From: Jan Beulich @ 2026-01-29 15:47 UTC (permalink / raw)
To: Roger Pau Monné
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Stewart Hildebrand
On 29.01.2026 16:32, Roger Pau Monné wrote:
> On Thu, Jan 29, 2026 at 02:10:01PM +0100, Jan Beulich wrote:
>> When, during boot, we have already correctly determined availability of
>> the MMCFG access method for a given bus range, there's then no need to
>> invoke pci_check_extcfg() again for every of the devices. This in
>> particular avoids ->ext_cfg to transiently indicate the wrong state.
>>
>> Switch to using Xen style on lines being touched and immediately adjacent
>> ones.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Thanks.
> One suggestion for a further addition below.
>
>> ---
>> v3: New.
>>
>> --- a/xen/arch/x86/physdev.c
>> +++ b/xen/arch/x86/physdev.c
>> @@ -528,6 +528,8 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
>> if ( !ret )
>> ret = pci_segment_iterate(info.segment, physdev_check_pci_extcfg,
>> &info);
>> + else if ( ret > 0 ) /* Indication of "no change". */
>> + ret = 0;
>>
>> if ( !ret && has_vpci(currd) && (info.flags & XEN_PCI_MMCFG_RESERVED) )
>
> Maybe it doesn't need to be strictly done here, but now that
> pci_mmcfg_reserved() signals whether the MMCFG was already registered,
> could you also restrict the register_vpci_mmcfg_handler() call to ret
> == 0?
Possibly; you know vPCI better than I do.
> That will also simplify the logic in register_vpci_mmcfg_handler()
> since we no longer need to return 0 when the region is already
> registered, returning -EEXIST should be fine if the caller is
> adjusted.
I think this then will want to be a separate change, with its own
justification.
Jan
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 5/6] x86/PCI: avoid re-evaluation of extended config space accessibility
2026-01-29 13:10 ` [PATCH v3 5/6] x86/PCI: avoid re-evaluation of extended config space accessibility Jan Beulich
2026-01-29 15:32 ` Roger Pau Monné
@ 2026-02-02 8:51 ` Jan Beulich
2026-02-02 9:14 ` Roger Pau Monné
2026-02-02 19:25 ` Stewart Hildebrand
1 sibling, 2 replies; 27+ messages in thread
From: Jan Beulich @ 2026-02-02 8:51 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Andrew Cooper, Roger Pau Monné, Stewart Hildebrand
On 29.01.2026 14:10, Jan Beulich wrote:
> @@ -160,10 +161,13 @@ int pci_mmcfg_arch_enable(unsigned int i
> return 0;
> }
>
> -void pci_mmcfg_arch_disable(unsigned int idx)
> +int pci_mmcfg_arch_disable(unsigned int idx)
> {
> const typeof(pci_mmcfg_config[0]) *cfg = pci_mmcfg_virt[idx].cfg;
>
> + if ( !pci_mmcfg_virt[idx].virt )
> + return 1;
Afaict this is what causes CI (adl-*) to say no here:
(XEN) [ 4.132689] PCI: Using MCFG for segment 0000 bus 00-ff
(XEN) [ 4.132697] ----[ Xen-4.22-unstable x86_64 debug=y ubsan=y Not tainted ]----
(XEN) [ 4.132700] CPU: 12
(XEN) [ 4.132702] RIP: e008:[<ffff82d0405779bd>] pci_mmcfg_read+0x19e/0x1c7
(XEN) [ 4.132708] RFLAGS: 0000000000010286 CONTEXT: hypervisor (d0v0)
(XEN) [ 4.132711] rax: 0000000000300000 rbx: ffff808000300100 rcx: 0000000000000000
(XEN) [ 4.132714] rdx: ffff808000300100 rsi: 0000000000000000 rdi: ffff8304959ffcec
(XEN) [ 4.132716] rbp: ffff8304959ffd18 rsp: ffff8304959ffce8 r8: 0000000000000004
(XEN) [ 4.132718] r9: ffff8304959ffd2c r10: 0000000000000000 r11: 0000000000000000
(XEN) [ 4.132720] r12: 0000000000000100 r13: 0000000000000004 r14: ffff8304959ffd2c
(XEN) [ 4.132723] r15: ffff808000000000 cr0: 0000000080050033 cr4: 0000000000b526e0
(XEN) [ 4.132725] cr3: 0000000492a30000 cr2: ffff808000300100
(XEN) [ 4.132727] fsb: 0000000000000000 gsb: ffff8881b9a00000 gss: 0000000000000000
(XEN) [ 4.132729] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
(XEN) [ 4.132733] Xen code around <ffff82d0405779bd> (pci_mmcfg_read+0x19e/0x1c7):
(XEN) [ 4.132734] 48 39 d3 72 ea 4c 01 e3 <8b> 03 89 c3 4d 85 f6 74 0d 41 89 1e b8 00 00 00
(XEN) [ 4.132744] Xen stack trace from rsp=ffff8304959ffce8:
(XEN) [ 4.132745] 0000000300000286 ffff830495bd8010 0000000000000003 ffff830495bd8010
(XEN) [ 4.132749] ffff8304959ffdd0 ffff82d0405fa7ef ffff8304959ffd30 ffff82d040576877
(XEN) [ 4.132753] 000000000000000c ffff8304959ffd58 ffff82d04039b81d ffff8304959ffe28
(XEN) [ 4.132756] 0000000000000003 ffff830495bd8010 ffff8304959ffd80 ffff82d0405fa90b
(XEN) [ 4.132760] ffff8304959ffdc8 ffff830495bd8010 ffff830498019650 ffff8304959ffdb8
(XEN) [ 4.132764] ffff82d0403983e0 ffff830498019650 ffff8304959ffe28 ffff82d0405fa7ef
(XEN) [ 4.132767] 0000000000000018 ffffc9004002b900 ffff8304959ffdf8 ffff82d04039feba
(XEN) [ 4.132771] ffff82d0405fa7ef ffff8304959ffe28 0000000000000000 ffffc9004002b900
(XEN) [ 4.132774] 0000000000000000 ffff8304959bb000 ffff8304959ffe78 ffff82d0405ff666
(XEN) [ 4.132778] ffff82d0405713b8 00000000ffffffff 00a0fb0081f456e0 ffff8304959b3010
(XEN) [ 4.132782] 00000000c0000000 00000001ff000000 ffff8304959fff08 0000000000000040
(XEN) [ 4.132785] 000000ec00000001 ffff8304959fff08 ffff8304959a4000 0000000000000021
(XEN) [ 4.132789] 0000000000000000 ffffc9004002b900 ffff8304959ffef8 ffff82d0405711b2
(XEN) [ 4.132792] 0000000000000000 ffff888100456938 ffff8881001470b8 0000000000000018
(XEN) [ 4.132795] 0000000000000000 ffff8304959ffef8 ffff82d0406213f9 ffff8304959a4000
(XEN) [ 4.132799] 0000000000000000 ffff8304959a4000 0000000000000000 0000000000000000
(XEN) [ 4.132802] ffff8304959fffff 0000000000000000 00007cfb6a6000d7 ffff82d0402012d3
(XEN) [ 4.132806] 0000000000000000 00000000ffffffff ffff8881001470b8 ffff888100b88900
(XEN) [ 4.132809] ffffc9004002b900 ffff8881001470b8 0000000000000283 ffff888100456938
(XEN) [ 4.132813] ffff888100065410 0000000000000000 0000000000000021 ffffffff81f7842a
(XEN) [ 4.132816] Xen call trace:
(XEN) [ 4.132819] [<ffff82d0405779bd>] R pci_mmcfg_read+0x19e/0x1c7
(XEN) [ 4.132822] [<ffff82d040576877>] F pci_conf_read32+0x55/0x5e
(XEN) [ 4.132826] [<ffff82d04039b81d>] F pci_check_extcfg+0xb1/0x13b
(XEN) [ 4.132831] [<ffff82d0405fa90b>] F physdev_check_pci_extcfg+0x11c/0x121
(XEN) [ 4.132833] [<ffff82d0403983e0>] F drivers/passthrough/pci.c#iterate_all+0xa2/0xe2
(XEN) [ 4.132836] [<ffff82d04039feba>] F pci_segment_iterate+0x4e/0x74
(XEN) [ 4.132839] [<ffff82d0405ff666>] F do_physdev_op+0x362a/0x4161
(XEN) [ 4.132842] [<ffff82d0405711b2>] F pv_hypercall+0x6be/0x838
(XEN) [ 4.132845] [<ffff82d0402012d3>] F lstar_enter+0x143/0x148
(XEN) [ 4.132847]
(XEN) [ 4.132848] Pagetable walk from ffff808000300100:
(XEN) [ 4.132851] L4[0x101] = 0000000000000000 ffffffffffffffff
There is an important comment in pci_mmcfg_arch_disable():
/*
* Don't use destroy_xen_mappings() here, or make sure that at least
* the necessary L4 entries get populated (so that they get properly
* propagated to guest domains' page tables).
*/
Hence it is wrong to bypass
mcfg_ioremap(cfg, idx, 0);
. v4 coming.
Jan
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v3 5/6] x86/PCI: avoid re-evaluation of extended config space accessibility
2026-02-02 8:51 ` Jan Beulich
@ 2026-02-02 9:14 ` Roger Pau Monné
2026-02-02 9:30 ` Jan Beulich
2026-02-02 19:25 ` Stewart Hildebrand
1 sibling, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2026-02-02 9:14 UTC (permalink / raw)
To: Jan Beulich
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Stewart Hildebrand
On Mon, Feb 02, 2026 at 09:51:18AM +0100, Jan Beulich wrote:
> On 29.01.2026 14:10, Jan Beulich wrote:
> > @@ -160,10 +161,13 @@ int pci_mmcfg_arch_enable(unsigned int i
> > return 0;
> > }
> >
> > -void pci_mmcfg_arch_disable(unsigned int idx)
> > +int pci_mmcfg_arch_disable(unsigned int idx)
> > {
> > const typeof(pci_mmcfg_config[0]) *cfg = pci_mmcfg_virt[idx].cfg;
> >
> > + if ( !pci_mmcfg_virt[idx].virt )
> > + return 1;
>
> Afaict this is what causes CI (adl-*) to say no here:
>
> (XEN) [ 4.132689] PCI: Using MCFG for segment 0000 bus 00-ff
> (XEN) [ 4.132697] ----[ Xen-4.22-unstable x86_64 debug=y ubsan=y Not tainted ]----
> (XEN) [ 4.132700] CPU: 12
> (XEN) [ 4.132702] RIP: e008:[<ffff82d0405779bd>] pci_mmcfg_read+0x19e/0x1c7
> (XEN) [ 4.132708] RFLAGS: 0000000000010286 CONTEXT: hypervisor (d0v0)
> (XEN) [ 4.132711] rax: 0000000000300000 rbx: ffff808000300100 rcx: 0000000000000000
> (XEN) [ 4.132714] rdx: ffff808000300100 rsi: 0000000000000000 rdi: ffff8304959ffcec
> (XEN) [ 4.132716] rbp: ffff8304959ffd18 rsp: ffff8304959ffce8 r8: 0000000000000004
> (XEN) [ 4.132718] r9: ffff8304959ffd2c r10: 0000000000000000 r11: 0000000000000000
> (XEN) [ 4.132720] r12: 0000000000000100 r13: 0000000000000004 r14: ffff8304959ffd2c
> (XEN) [ 4.132723] r15: ffff808000000000 cr0: 0000000080050033 cr4: 0000000000b526e0
> (XEN) [ 4.132725] cr3: 0000000492a30000 cr2: ffff808000300100
> (XEN) [ 4.132727] fsb: 0000000000000000 gsb: ffff8881b9a00000 gss: 0000000000000000
> (XEN) [ 4.132729] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
> (XEN) [ 4.132733] Xen code around <ffff82d0405779bd> (pci_mmcfg_read+0x19e/0x1c7):
> (XEN) [ 4.132734] 48 39 d3 72 ea 4c 01 e3 <8b> 03 89 c3 4d 85 f6 74 0d 41 89 1e b8 00 00 00
> (XEN) [ 4.132744] Xen stack trace from rsp=ffff8304959ffce8:
> (XEN) [ 4.132745] 0000000300000286 ffff830495bd8010 0000000000000003 ffff830495bd8010
> (XEN) [ 4.132749] ffff8304959ffdd0 ffff82d0405fa7ef ffff8304959ffd30 ffff82d040576877
> (XEN) [ 4.132753] 000000000000000c ffff8304959ffd58 ffff82d04039b81d ffff8304959ffe28
> (XEN) [ 4.132756] 0000000000000003 ffff830495bd8010 ffff8304959ffd80 ffff82d0405fa90b
> (XEN) [ 4.132760] ffff8304959ffdc8 ffff830495bd8010 ffff830498019650 ffff8304959ffdb8
> (XEN) [ 4.132764] ffff82d0403983e0 ffff830498019650 ffff8304959ffe28 ffff82d0405fa7ef
> (XEN) [ 4.132767] 0000000000000018 ffffc9004002b900 ffff8304959ffdf8 ffff82d04039feba
> (XEN) [ 4.132771] ffff82d0405fa7ef ffff8304959ffe28 0000000000000000 ffffc9004002b900
> (XEN) [ 4.132774] 0000000000000000 ffff8304959bb000 ffff8304959ffe78 ffff82d0405ff666
> (XEN) [ 4.132778] ffff82d0405713b8 00000000ffffffff 00a0fb0081f456e0 ffff8304959b3010
> (XEN) [ 4.132782] 00000000c0000000 00000001ff000000 ffff8304959fff08 0000000000000040
> (XEN) [ 4.132785] 000000ec00000001 ffff8304959fff08 ffff8304959a4000 0000000000000021
> (XEN) [ 4.132789] 0000000000000000 ffffc9004002b900 ffff8304959ffef8 ffff82d0405711b2
> (XEN) [ 4.132792] 0000000000000000 ffff888100456938 ffff8881001470b8 0000000000000018
> (XEN) [ 4.132795] 0000000000000000 ffff8304959ffef8 ffff82d0406213f9 ffff8304959a4000
> (XEN) [ 4.132799] 0000000000000000 ffff8304959a4000 0000000000000000 0000000000000000
> (XEN) [ 4.132802] ffff8304959fffff 0000000000000000 00007cfb6a6000d7 ffff82d0402012d3
> (XEN) [ 4.132806] 0000000000000000 00000000ffffffff ffff8881001470b8 ffff888100b88900
> (XEN) [ 4.132809] ffffc9004002b900 ffff8881001470b8 0000000000000283 ffff888100456938
> (XEN) [ 4.132813] ffff888100065410 0000000000000000 0000000000000021 ffffffff81f7842a
> (XEN) [ 4.132816] Xen call trace:
> (XEN) [ 4.132819] [<ffff82d0405779bd>] R pci_mmcfg_read+0x19e/0x1c7
> (XEN) [ 4.132822] [<ffff82d040576877>] F pci_conf_read32+0x55/0x5e
> (XEN) [ 4.132826] [<ffff82d04039b81d>] F pci_check_extcfg+0xb1/0x13b
> (XEN) [ 4.132831] [<ffff82d0405fa90b>] F physdev_check_pci_extcfg+0x11c/0x121
> (XEN) [ 4.132833] [<ffff82d0403983e0>] F drivers/passthrough/pci.c#iterate_all+0xa2/0xe2
> (XEN) [ 4.132836] [<ffff82d04039feba>] F pci_segment_iterate+0x4e/0x74
> (XEN) [ 4.132839] [<ffff82d0405ff666>] F do_physdev_op+0x362a/0x4161
> (XEN) [ 4.132842] [<ffff82d0405711b2>] F pv_hypercall+0x6be/0x838
> (XEN) [ 4.132845] [<ffff82d0402012d3>] F lstar_enter+0x143/0x148
> (XEN) [ 4.132847]
> (XEN) [ 4.132848] Pagetable walk from ffff808000300100:
> (XEN) [ 4.132851] L4[0x101] = 0000000000000000 ffffffffffffffff
>
> There is an important comment in pci_mmcfg_arch_disable():
>
> /*
> * Don't use destroy_xen_mappings() here, or make sure that at least
> * the necessary L4 entries get populated (so that they get properly
> * propagated to guest domains' page tables).
> */
>
> Hence it is wrong to bypass
>
> mcfg_ioremap(cfg, idx, 0);
Hm, I see. The L4 slot must be unconditionally populated before we
clone the idle page-table, otherwise the mappings won't propagate.
What about unconditionally populating the L4 slot in
subarch_init_memory()? That seems less fragile than doing it in
pci_mmcfg_arch_disable().
Thanks, Roger.
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v3 5/6] x86/PCI: avoid re-evaluation of extended config space accessibility
2026-02-02 9:14 ` Roger Pau Monné
@ 2026-02-02 9:30 ` Jan Beulich
2026-02-02 10:13 ` Roger Pau Monné
0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2026-02-02 9:30 UTC (permalink / raw)
To: Roger Pau Monné
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Stewart Hildebrand
On 02.02.2026 10:14, Roger Pau Monné wrote:
> On Mon, Feb 02, 2026 at 09:51:18AM +0100, Jan Beulich wrote:
>> On 29.01.2026 14:10, Jan Beulich wrote:
>>> @@ -160,10 +161,13 @@ int pci_mmcfg_arch_enable(unsigned int i
>>> return 0;
>>> }
>>>
>>> -void pci_mmcfg_arch_disable(unsigned int idx)
>>> +int pci_mmcfg_arch_disable(unsigned int idx)
>>> {
>>> const typeof(pci_mmcfg_config[0]) *cfg = pci_mmcfg_virt[idx].cfg;
>>>
>>> + if ( !pci_mmcfg_virt[idx].virt )
>>> + return 1;
>>
>> Afaict this is what causes CI (adl-*) to say no here:
>>
>> (XEN) [ 4.132689] PCI: Using MCFG for segment 0000 bus 00-ff
>> (XEN) [ 4.132697] ----[ Xen-4.22-unstable x86_64 debug=y ubsan=y Not tainted ]----
>> (XEN) [ 4.132700] CPU: 12
>> (XEN) [ 4.132702] RIP: e008:[<ffff82d0405779bd>] pci_mmcfg_read+0x19e/0x1c7
>> (XEN) [ 4.132708] RFLAGS: 0000000000010286 CONTEXT: hypervisor (d0v0)
>> (XEN) [ 4.132711] rax: 0000000000300000 rbx: ffff808000300100 rcx: 0000000000000000
>> (XEN) [ 4.132714] rdx: ffff808000300100 rsi: 0000000000000000 rdi: ffff8304959ffcec
>> (XEN) [ 4.132716] rbp: ffff8304959ffd18 rsp: ffff8304959ffce8 r8: 0000000000000004
>> (XEN) [ 4.132718] r9: ffff8304959ffd2c r10: 0000000000000000 r11: 0000000000000000
>> (XEN) [ 4.132720] r12: 0000000000000100 r13: 0000000000000004 r14: ffff8304959ffd2c
>> (XEN) [ 4.132723] r15: ffff808000000000 cr0: 0000000080050033 cr4: 0000000000b526e0
>> (XEN) [ 4.132725] cr3: 0000000492a30000 cr2: ffff808000300100
>> (XEN) [ 4.132727] fsb: 0000000000000000 gsb: ffff8881b9a00000 gss: 0000000000000000
>> (XEN) [ 4.132729] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
>> (XEN) [ 4.132733] Xen code around <ffff82d0405779bd> (pci_mmcfg_read+0x19e/0x1c7):
>> (XEN) [ 4.132734] 48 39 d3 72 ea 4c 01 e3 <8b> 03 89 c3 4d 85 f6 74 0d 41 89 1e b8 00 00 00
>> (XEN) [ 4.132744] Xen stack trace from rsp=ffff8304959ffce8:
>> (XEN) [ 4.132745] 0000000300000286 ffff830495bd8010 0000000000000003 ffff830495bd8010
>> (XEN) [ 4.132749] ffff8304959ffdd0 ffff82d0405fa7ef ffff8304959ffd30 ffff82d040576877
>> (XEN) [ 4.132753] 000000000000000c ffff8304959ffd58 ffff82d04039b81d ffff8304959ffe28
>> (XEN) [ 4.132756] 0000000000000003 ffff830495bd8010 ffff8304959ffd80 ffff82d0405fa90b
>> (XEN) [ 4.132760] ffff8304959ffdc8 ffff830495bd8010 ffff830498019650 ffff8304959ffdb8
>> (XEN) [ 4.132764] ffff82d0403983e0 ffff830498019650 ffff8304959ffe28 ffff82d0405fa7ef
>> (XEN) [ 4.132767] 0000000000000018 ffffc9004002b900 ffff8304959ffdf8 ffff82d04039feba
>> (XEN) [ 4.132771] ffff82d0405fa7ef ffff8304959ffe28 0000000000000000 ffffc9004002b900
>> (XEN) [ 4.132774] 0000000000000000 ffff8304959bb000 ffff8304959ffe78 ffff82d0405ff666
>> (XEN) [ 4.132778] ffff82d0405713b8 00000000ffffffff 00a0fb0081f456e0 ffff8304959b3010
>> (XEN) [ 4.132782] 00000000c0000000 00000001ff000000 ffff8304959fff08 0000000000000040
>> (XEN) [ 4.132785] 000000ec00000001 ffff8304959fff08 ffff8304959a4000 0000000000000021
>> (XEN) [ 4.132789] 0000000000000000 ffffc9004002b900 ffff8304959ffef8 ffff82d0405711b2
>> (XEN) [ 4.132792] 0000000000000000 ffff888100456938 ffff8881001470b8 0000000000000018
>> (XEN) [ 4.132795] 0000000000000000 ffff8304959ffef8 ffff82d0406213f9 ffff8304959a4000
>> (XEN) [ 4.132799] 0000000000000000 ffff8304959a4000 0000000000000000 0000000000000000
>> (XEN) [ 4.132802] ffff8304959fffff 0000000000000000 00007cfb6a6000d7 ffff82d0402012d3
>> (XEN) [ 4.132806] 0000000000000000 00000000ffffffff ffff8881001470b8 ffff888100b88900
>> (XEN) [ 4.132809] ffffc9004002b900 ffff8881001470b8 0000000000000283 ffff888100456938
>> (XEN) [ 4.132813] ffff888100065410 0000000000000000 0000000000000021 ffffffff81f7842a
>> (XEN) [ 4.132816] Xen call trace:
>> (XEN) [ 4.132819] [<ffff82d0405779bd>] R pci_mmcfg_read+0x19e/0x1c7
>> (XEN) [ 4.132822] [<ffff82d040576877>] F pci_conf_read32+0x55/0x5e
>> (XEN) [ 4.132826] [<ffff82d04039b81d>] F pci_check_extcfg+0xb1/0x13b
>> (XEN) [ 4.132831] [<ffff82d0405fa90b>] F physdev_check_pci_extcfg+0x11c/0x121
>> (XEN) [ 4.132833] [<ffff82d0403983e0>] F drivers/passthrough/pci.c#iterate_all+0xa2/0xe2
>> (XEN) [ 4.132836] [<ffff82d04039feba>] F pci_segment_iterate+0x4e/0x74
>> (XEN) [ 4.132839] [<ffff82d0405ff666>] F do_physdev_op+0x362a/0x4161
>> (XEN) [ 4.132842] [<ffff82d0405711b2>] F pv_hypercall+0x6be/0x838
>> (XEN) [ 4.132845] [<ffff82d0402012d3>] F lstar_enter+0x143/0x148
>> (XEN) [ 4.132847]
>> (XEN) [ 4.132848] Pagetable walk from ffff808000300100:
>> (XEN) [ 4.132851] L4[0x101] = 0000000000000000 ffffffffffffffff
>>
>> There is an important comment in pci_mmcfg_arch_disable():
>>
>> /*
>> * Don't use destroy_xen_mappings() here, or make sure that at least
>> * the necessary L4 entries get populated (so that they get properly
>> * propagated to guest domains' page tables).
>> */
>>
>> Hence it is wrong to bypass
>>
>> mcfg_ioremap(cfg, idx, 0);
>
> Hm, I see. The L4 slot must be unconditionally populated before we
> clone the idle page-table, otherwise the mappings won't propagate.
>
> What about unconditionally populating the L4 slot in
> subarch_init_memory()? That seems less fragile than doing it in
> pci_mmcfg_arch_disable().
Less fragile - perhaps. Yet I don't see why we should populate the field if
we wouldn't ever need it. Of course with violating layering some, we could
know in subarch_init_memory(), as pci_setup() runs earlier.
Jan
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v3 5/6] x86/PCI: avoid re-evaluation of extended config space accessibility
2026-02-02 9:30 ` Jan Beulich
@ 2026-02-02 10:13 ` Roger Pau Monné
2026-02-02 14:40 ` Jan Beulich
0 siblings, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2026-02-02 10:13 UTC (permalink / raw)
To: Jan Beulich
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Stewart Hildebrand
On Mon, Feb 02, 2026 at 10:30:29AM +0100, Jan Beulich wrote:
> On 02.02.2026 10:14, Roger Pau Monné wrote:
> > On Mon, Feb 02, 2026 at 09:51:18AM +0100, Jan Beulich wrote:
> >> On 29.01.2026 14:10, Jan Beulich wrote:
> >>> @@ -160,10 +161,13 @@ int pci_mmcfg_arch_enable(unsigned int i
> >>> return 0;
> >>> }
> >>>
> >>> -void pci_mmcfg_arch_disable(unsigned int idx)
> >>> +int pci_mmcfg_arch_disable(unsigned int idx)
> >>> {
> >>> const typeof(pci_mmcfg_config[0]) *cfg = pci_mmcfg_virt[idx].cfg;
> >>>
> >>> + if ( !pci_mmcfg_virt[idx].virt )
> >>> + return 1;
> >>
> >> Afaict this is what causes CI (adl-*) to say no here:
> >>
> >> (XEN) [ 4.132689] PCI: Using MCFG for segment 0000 bus 00-ff
> >> (XEN) [ 4.132697] ----[ Xen-4.22-unstable x86_64 debug=y ubsan=y Not tainted ]----
> >> (XEN) [ 4.132700] CPU: 12
> >> (XEN) [ 4.132702] RIP: e008:[<ffff82d0405779bd>] pci_mmcfg_read+0x19e/0x1c7
> >> (XEN) [ 4.132708] RFLAGS: 0000000000010286 CONTEXT: hypervisor (d0v0)
> >> (XEN) [ 4.132711] rax: 0000000000300000 rbx: ffff808000300100 rcx: 0000000000000000
> >> (XEN) [ 4.132714] rdx: ffff808000300100 rsi: 0000000000000000 rdi: ffff8304959ffcec
> >> (XEN) [ 4.132716] rbp: ffff8304959ffd18 rsp: ffff8304959ffce8 r8: 0000000000000004
> >> (XEN) [ 4.132718] r9: ffff8304959ffd2c r10: 0000000000000000 r11: 0000000000000000
> >> (XEN) [ 4.132720] r12: 0000000000000100 r13: 0000000000000004 r14: ffff8304959ffd2c
> >> (XEN) [ 4.132723] r15: ffff808000000000 cr0: 0000000080050033 cr4: 0000000000b526e0
> >> (XEN) [ 4.132725] cr3: 0000000492a30000 cr2: ffff808000300100
> >> (XEN) [ 4.132727] fsb: 0000000000000000 gsb: ffff8881b9a00000 gss: 0000000000000000
> >> (XEN) [ 4.132729] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
> >> (XEN) [ 4.132733] Xen code around <ffff82d0405779bd> (pci_mmcfg_read+0x19e/0x1c7):
> >> (XEN) [ 4.132734] 48 39 d3 72 ea 4c 01 e3 <8b> 03 89 c3 4d 85 f6 74 0d 41 89 1e b8 00 00 00
> >> (XEN) [ 4.132744] Xen stack trace from rsp=ffff8304959ffce8:
> >> (XEN) [ 4.132745] 0000000300000286 ffff830495bd8010 0000000000000003 ffff830495bd8010
> >> (XEN) [ 4.132749] ffff8304959ffdd0 ffff82d0405fa7ef ffff8304959ffd30 ffff82d040576877
> >> (XEN) [ 4.132753] 000000000000000c ffff8304959ffd58 ffff82d04039b81d ffff8304959ffe28
> >> (XEN) [ 4.132756] 0000000000000003 ffff830495bd8010 ffff8304959ffd80 ffff82d0405fa90b
> >> (XEN) [ 4.132760] ffff8304959ffdc8 ffff830495bd8010 ffff830498019650 ffff8304959ffdb8
> >> (XEN) [ 4.132764] ffff82d0403983e0 ffff830498019650 ffff8304959ffe28 ffff82d0405fa7ef
> >> (XEN) [ 4.132767] 0000000000000018 ffffc9004002b900 ffff8304959ffdf8 ffff82d04039feba
> >> (XEN) [ 4.132771] ffff82d0405fa7ef ffff8304959ffe28 0000000000000000 ffffc9004002b900
> >> (XEN) [ 4.132774] 0000000000000000 ffff8304959bb000 ffff8304959ffe78 ffff82d0405ff666
> >> (XEN) [ 4.132778] ffff82d0405713b8 00000000ffffffff 00a0fb0081f456e0 ffff8304959b3010
> >> (XEN) [ 4.132782] 00000000c0000000 00000001ff000000 ffff8304959fff08 0000000000000040
> >> (XEN) [ 4.132785] 000000ec00000001 ffff8304959fff08 ffff8304959a4000 0000000000000021
> >> (XEN) [ 4.132789] 0000000000000000 ffffc9004002b900 ffff8304959ffef8 ffff82d0405711b2
> >> (XEN) [ 4.132792] 0000000000000000 ffff888100456938 ffff8881001470b8 0000000000000018
> >> (XEN) [ 4.132795] 0000000000000000 ffff8304959ffef8 ffff82d0406213f9 ffff8304959a4000
> >> (XEN) [ 4.132799] 0000000000000000 ffff8304959a4000 0000000000000000 0000000000000000
> >> (XEN) [ 4.132802] ffff8304959fffff 0000000000000000 00007cfb6a6000d7 ffff82d0402012d3
> >> (XEN) [ 4.132806] 0000000000000000 00000000ffffffff ffff8881001470b8 ffff888100b88900
> >> (XEN) [ 4.132809] ffffc9004002b900 ffff8881001470b8 0000000000000283 ffff888100456938
> >> (XEN) [ 4.132813] ffff888100065410 0000000000000000 0000000000000021 ffffffff81f7842a
> >> (XEN) [ 4.132816] Xen call trace:
> >> (XEN) [ 4.132819] [<ffff82d0405779bd>] R pci_mmcfg_read+0x19e/0x1c7
> >> (XEN) [ 4.132822] [<ffff82d040576877>] F pci_conf_read32+0x55/0x5e
> >> (XEN) [ 4.132826] [<ffff82d04039b81d>] F pci_check_extcfg+0xb1/0x13b
> >> (XEN) [ 4.132831] [<ffff82d0405fa90b>] F physdev_check_pci_extcfg+0x11c/0x121
> >> (XEN) [ 4.132833] [<ffff82d0403983e0>] F drivers/passthrough/pci.c#iterate_all+0xa2/0xe2
> >> (XEN) [ 4.132836] [<ffff82d04039feba>] F pci_segment_iterate+0x4e/0x74
> >> (XEN) [ 4.132839] [<ffff82d0405ff666>] F do_physdev_op+0x362a/0x4161
> >> (XEN) [ 4.132842] [<ffff82d0405711b2>] F pv_hypercall+0x6be/0x838
> >> (XEN) [ 4.132845] [<ffff82d0402012d3>] F lstar_enter+0x143/0x148
> >> (XEN) [ 4.132847]
> >> (XEN) [ 4.132848] Pagetable walk from ffff808000300100:
> >> (XEN) [ 4.132851] L4[0x101] = 0000000000000000 ffffffffffffffff
> >>
> >> There is an important comment in pci_mmcfg_arch_disable():
> >>
> >> /*
> >> * Don't use destroy_xen_mappings() here, or make sure that at least
> >> * the necessary L4 entries get populated (so that they get properly
> >> * propagated to guest domains' page tables).
> >> */
> >>
> >> Hence it is wrong to bypass
> >>
> >> mcfg_ioremap(cfg, idx, 0);
> >
> > Hm, I see. The L4 slot must be unconditionally populated before we
> > clone the idle page-table, otherwise the mappings won't propagate.
> >
> > What about unconditionally populating the L4 slot in
> > subarch_init_memory()? That seems less fragile than doing it in
> > pci_mmcfg_arch_disable().
>
> Less fragile - perhaps. Yet I don't see why we should populate the field if
> we wouldn't ever need it. Of course with violating layering some, we could
> know in subarch_init_memory(), as pci_setup() runs earlier.
How can Xen be sure at setup time that the slot will never be used?
The MMCFG could be empty initially when parsed by Xen, but MMCFG
regions might appear as a result of AML method execution (_CBA and
_CRS methods in hotplug host bridges).
Thanks, Roger.
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v3 5/6] x86/PCI: avoid re-evaluation of extended config space accessibility
2026-02-02 10:13 ` Roger Pau Monné
@ 2026-02-02 14:40 ` Jan Beulich
2026-02-02 15:18 ` Roger Pau Monné
0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2026-02-02 14:40 UTC (permalink / raw)
To: Roger Pau Monné
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Stewart Hildebrand
On 02.02.2026 11:13, Roger Pau Monné wrote:
> On Mon, Feb 02, 2026 at 10:30:29AM +0100, Jan Beulich wrote:
>> On 02.02.2026 10:14, Roger Pau Monné wrote:
>>> On Mon, Feb 02, 2026 at 09:51:18AM +0100, Jan Beulich wrote:
>>>> On 29.01.2026 14:10, Jan Beulich wrote:
>>>>> @@ -160,10 +161,13 @@ int pci_mmcfg_arch_enable(unsigned int i
>>>>> return 0;
>>>>> }
>>>>>
>>>>> -void pci_mmcfg_arch_disable(unsigned int idx)
>>>>> +int pci_mmcfg_arch_disable(unsigned int idx)
>>>>> {
>>>>> const typeof(pci_mmcfg_config[0]) *cfg = pci_mmcfg_virt[idx].cfg;
>>>>>
>>>>> + if ( !pci_mmcfg_virt[idx].virt )
>>>>> + return 1;
>>>>
>>>> Afaict this is what causes CI (adl-*) to say no here:
>>>>
>>>> (XEN) [ 4.132689] PCI: Using MCFG for segment 0000 bus 00-ff
>>>> (XEN) [ 4.132697] ----[ Xen-4.22-unstable x86_64 debug=y ubsan=y Not tainted ]----
>>>> (XEN) [ 4.132700] CPU: 12
>>>> (XEN) [ 4.132702] RIP: e008:[<ffff82d0405779bd>] pci_mmcfg_read+0x19e/0x1c7
>>>> (XEN) [ 4.132708] RFLAGS: 0000000000010286 CONTEXT: hypervisor (d0v0)
>>>> (XEN) [ 4.132711] rax: 0000000000300000 rbx: ffff808000300100 rcx: 0000000000000000
>>>> (XEN) [ 4.132714] rdx: ffff808000300100 rsi: 0000000000000000 rdi: ffff8304959ffcec
>>>> (XEN) [ 4.132716] rbp: ffff8304959ffd18 rsp: ffff8304959ffce8 r8: 0000000000000004
>>>> (XEN) [ 4.132718] r9: ffff8304959ffd2c r10: 0000000000000000 r11: 0000000000000000
>>>> (XEN) [ 4.132720] r12: 0000000000000100 r13: 0000000000000004 r14: ffff8304959ffd2c
>>>> (XEN) [ 4.132723] r15: ffff808000000000 cr0: 0000000080050033 cr4: 0000000000b526e0
>>>> (XEN) [ 4.132725] cr3: 0000000492a30000 cr2: ffff808000300100
>>>> (XEN) [ 4.132727] fsb: 0000000000000000 gsb: ffff8881b9a00000 gss: 0000000000000000
>>>> (XEN) [ 4.132729] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
>>>> (XEN) [ 4.132733] Xen code around <ffff82d0405779bd> (pci_mmcfg_read+0x19e/0x1c7):
>>>> (XEN) [ 4.132734] 48 39 d3 72 ea 4c 01 e3 <8b> 03 89 c3 4d 85 f6 74 0d 41 89 1e b8 00 00 00
>>>> (XEN) [ 4.132744] Xen stack trace from rsp=ffff8304959ffce8:
>>>> (XEN) [ 4.132745] 0000000300000286 ffff830495bd8010 0000000000000003 ffff830495bd8010
>>>> (XEN) [ 4.132749] ffff8304959ffdd0 ffff82d0405fa7ef ffff8304959ffd30 ffff82d040576877
>>>> (XEN) [ 4.132753] 000000000000000c ffff8304959ffd58 ffff82d04039b81d ffff8304959ffe28
>>>> (XEN) [ 4.132756] 0000000000000003 ffff830495bd8010 ffff8304959ffd80 ffff82d0405fa90b
>>>> (XEN) [ 4.132760] ffff8304959ffdc8 ffff830495bd8010 ffff830498019650 ffff8304959ffdb8
>>>> (XEN) [ 4.132764] ffff82d0403983e0 ffff830498019650 ffff8304959ffe28 ffff82d0405fa7ef
>>>> (XEN) [ 4.132767] 0000000000000018 ffffc9004002b900 ffff8304959ffdf8 ffff82d04039feba
>>>> (XEN) [ 4.132771] ffff82d0405fa7ef ffff8304959ffe28 0000000000000000 ffffc9004002b900
>>>> (XEN) [ 4.132774] 0000000000000000 ffff8304959bb000 ffff8304959ffe78 ffff82d0405ff666
>>>> (XEN) [ 4.132778] ffff82d0405713b8 00000000ffffffff 00a0fb0081f456e0 ffff8304959b3010
>>>> (XEN) [ 4.132782] 00000000c0000000 00000001ff000000 ffff8304959fff08 0000000000000040
>>>> (XEN) [ 4.132785] 000000ec00000001 ffff8304959fff08 ffff8304959a4000 0000000000000021
>>>> (XEN) [ 4.132789] 0000000000000000 ffffc9004002b900 ffff8304959ffef8 ffff82d0405711b2
>>>> (XEN) [ 4.132792] 0000000000000000 ffff888100456938 ffff8881001470b8 0000000000000018
>>>> (XEN) [ 4.132795] 0000000000000000 ffff8304959ffef8 ffff82d0406213f9 ffff8304959a4000
>>>> (XEN) [ 4.132799] 0000000000000000 ffff8304959a4000 0000000000000000 0000000000000000
>>>> (XEN) [ 4.132802] ffff8304959fffff 0000000000000000 00007cfb6a6000d7 ffff82d0402012d3
>>>> (XEN) [ 4.132806] 0000000000000000 00000000ffffffff ffff8881001470b8 ffff888100b88900
>>>> (XEN) [ 4.132809] ffffc9004002b900 ffff8881001470b8 0000000000000283 ffff888100456938
>>>> (XEN) [ 4.132813] ffff888100065410 0000000000000000 0000000000000021 ffffffff81f7842a
>>>> (XEN) [ 4.132816] Xen call trace:
>>>> (XEN) [ 4.132819] [<ffff82d0405779bd>] R pci_mmcfg_read+0x19e/0x1c7
>>>> (XEN) [ 4.132822] [<ffff82d040576877>] F pci_conf_read32+0x55/0x5e
>>>> (XEN) [ 4.132826] [<ffff82d04039b81d>] F pci_check_extcfg+0xb1/0x13b
>>>> (XEN) [ 4.132831] [<ffff82d0405fa90b>] F physdev_check_pci_extcfg+0x11c/0x121
>>>> (XEN) [ 4.132833] [<ffff82d0403983e0>] F drivers/passthrough/pci.c#iterate_all+0xa2/0xe2
>>>> (XEN) [ 4.132836] [<ffff82d04039feba>] F pci_segment_iterate+0x4e/0x74
>>>> (XEN) [ 4.132839] [<ffff82d0405ff666>] F do_physdev_op+0x362a/0x4161
>>>> (XEN) [ 4.132842] [<ffff82d0405711b2>] F pv_hypercall+0x6be/0x838
>>>> (XEN) [ 4.132845] [<ffff82d0402012d3>] F lstar_enter+0x143/0x148
>>>> (XEN) [ 4.132847]
>>>> (XEN) [ 4.132848] Pagetable walk from ffff808000300100:
>>>> (XEN) [ 4.132851] L4[0x101] = 0000000000000000 ffffffffffffffff
>>>>
>>>> There is an important comment in pci_mmcfg_arch_disable():
>>>>
>>>> /*
>>>> * Don't use destroy_xen_mappings() here, or make sure that at least
>>>> * the necessary L4 entries get populated (so that they get properly
>>>> * propagated to guest domains' page tables).
>>>> */
>>>>
>>>> Hence it is wrong to bypass
>>>>
>>>> mcfg_ioremap(cfg, idx, 0);
>>>
>>> Hm, I see. The L4 slot must be unconditionally populated before we
>>> clone the idle page-table, otherwise the mappings won't propagate.
>>>
>>> What about unconditionally populating the L4 slot in
>>> subarch_init_memory()? That seems less fragile than doing it in
>>> pci_mmcfg_arch_disable().
>>
>> Less fragile - perhaps. Yet I don't see why we should populate the field if
>> we wouldn't ever need it. Of course with violating layering some, we could
>> know in subarch_init_memory(), as pci_setup() runs earlier.
>
> How can Xen be sure at setup time that the slot will never be used?
> The MMCFG could be empty initially when parsed by Xen, but MMCFG
> regions might appear as a result of AML method execution (_CBA and
> _CRS methods in hotplug host bridges).
Their usability may change, but how many of them there are (going to be) we
need to know at boot time. See how pci_mmcfg_config_num, pci_mmcfg_config[],
and pci_mmcfg_virt[] are initialized (all by __init functions). If regions
could truly appear "out of the blue", we'd have a bigger problem.
Jan
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v3 5/6] x86/PCI: avoid re-evaluation of extended config space accessibility
2026-02-02 14:40 ` Jan Beulich
@ 2026-02-02 15:18 ` Roger Pau Monné
2026-02-03 14:48 ` Jan Beulich
0 siblings, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2026-02-02 15:18 UTC (permalink / raw)
To: Jan Beulich
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Stewart Hildebrand
On Mon, Feb 02, 2026 at 03:40:39PM +0100, Jan Beulich wrote:
> On 02.02.2026 11:13, Roger Pau Monné wrote:
> > On Mon, Feb 02, 2026 at 10:30:29AM +0100, Jan Beulich wrote:
> >> On 02.02.2026 10:14, Roger Pau Monné wrote:
> >>> On Mon, Feb 02, 2026 at 09:51:18AM +0100, Jan Beulich wrote:
> >>>> On 29.01.2026 14:10, Jan Beulich wrote:
> >>>>> @@ -160,10 +161,13 @@ int pci_mmcfg_arch_enable(unsigned int i
> >>>>> return 0;
> >>>>> }
> >>>>>
> >>>>> -void pci_mmcfg_arch_disable(unsigned int idx)
> >>>>> +int pci_mmcfg_arch_disable(unsigned int idx)
> >>>>> {
> >>>>> const typeof(pci_mmcfg_config[0]) *cfg = pci_mmcfg_virt[idx].cfg;
> >>>>>
> >>>>> + if ( !pci_mmcfg_virt[idx].virt )
> >>>>> + return 1;
> >>>>
> >>>> Afaict this is what causes CI (adl-*) to say no here:
> >>>>
> >>>> (XEN) [ 4.132689] PCI: Using MCFG for segment 0000 bus 00-ff
> >>>> (XEN) [ 4.132697] ----[ Xen-4.22-unstable x86_64 debug=y ubsan=y Not tainted ]----
> >>>> (XEN) [ 4.132700] CPU: 12
> >>>> (XEN) [ 4.132702] RIP: e008:[<ffff82d0405779bd>] pci_mmcfg_read+0x19e/0x1c7
> >>>> (XEN) [ 4.132708] RFLAGS: 0000000000010286 CONTEXT: hypervisor (d0v0)
> >>>> (XEN) [ 4.132711] rax: 0000000000300000 rbx: ffff808000300100 rcx: 0000000000000000
> >>>> (XEN) [ 4.132714] rdx: ffff808000300100 rsi: 0000000000000000 rdi: ffff8304959ffcec
> >>>> (XEN) [ 4.132716] rbp: ffff8304959ffd18 rsp: ffff8304959ffce8 r8: 0000000000000004
> >>>> (XEN) [ 4.132718] r9: ffff8304959ffd2c r10: 0000000000000000 r11: 0000000000000000
> >>>> (XEN) [ 4.132720] r12: 0000000000000100 r13: 0000000000000004 r14: ffff8304959ffd2c
> >>>> (XEN) [ 4.132723] r15: ffff808000000000 cr0: 0000000080050033 cr4: 0000000000b526e0
> >>>> (XEN) [ 4.132725] cr3: 0000000492a30000 cr2: ffff808000300100
> >>>> (XEN) [ 4.132727] fsb: 0000000000000000 gsb: ffff8881b9a00000 gss: 0000000000000000
> >>>> (XEN) [ 4.132729] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
> >>>> (XEN) [ 4.132733] Xen code around <ffff82d0405779bd> (pci_mmcfg_read+0x19e/0x1c7):
> >>>> (XEN) [ 4.132734] 48 39 d3 72 ea 4c 01 e3 <8b> 03 89 c3 4d 85 f6 74 0d 41 89 1e b8 00 00 00
> >>>> (XEN) [ 4.132744] Xen stack trace from rsp=ffff8304959ffce8:
> >>>> (XEN) [ 4.132745] 0000000300000286 ffff830495bd8010 0000000000000003 ffff830495bd8010
> >>>> (XEN) [ 4.132749] ffff8304959ffdd0 ffff82d0405fa7ef ffff8304959ffd30 ffff82d040576877
> >>>> (XEN) [ 4.132753] 000000000000000c ffff8304959ffd58 ffff82d04039b81d ffff8304959ffe28
> >>>> (XEN) [ 4.132756] 0000000000000003 ffff830495bd8010 ffff8304959ffd80 ffff82d0405fa90b
> >>>> (XEN) [ 4.132760] ffff8304959ffdc8 ffff830495bd8010 ffff830498019650 ffff8304959ffdb8
> >>>> (XEN) [ 4.132764] ffff82d0403983e0 ffff830498019650 ffff8304959ffe28 ffff82d0405fa7ef
> >>>> (XEN) [ 4.132767] 0000000000000018 ffffc9004002b900 ffff8304959ffdf8 ffff82d04039feba
> >>>> (XEN) [ 4.132771] ffff82d0405fa7ef ffff8304959ffe28 0000000000000000 ffffc9004002b900
> >>>> (XEN) [ 4.132774] 0000000000000000 ffff8304959bb000 ffff8304959ffe78 ffff82d0405ff666
> >>>> (XEN) [ 4.132778] ffff82d0405713b8 00000000ffffffff 00a0fb0081f456e0 ffff8304959b3010
> >>>> (XEN) [ 4.132782] 00000000c0000000 00000001ff000000 ffff8304959fff08 0000000000000040
> >>>> (XEN) [ 4.132785] 000000ec00000001 ffff8304959fff08 ffff8304959a4000 0000000000000021
> >>>> (XEN) [ 4.132789] 0000000000000000 ffffc9004002b900 ffff8304959ffef8 ffff82d0405711b2
> >>>> (XEN) [ 4.132792] 0000000000000000 ffff888100456938 ffff8881001470b8 0000000000000018
> >>>> (XEN) [ 4.132795] 0000000000000000 ffff8304959ffef8 ffff82d0406213f9 ffff8304959a4000
> >>>> (XEN) [ 4.132799] 0000000000000000 ffff8304959a4000 0000000000000000 0000000000000000
> >>>> (XEN) [ 4.132802] ffff8304959fffff 0000000000000000 00007cfb6a6000d7 ffff82d0402012d3
> >>>> (XEN) [ 4.132806] 0000000000000000 00000000ffffffff ffff8881001470b8 ffff888100b88900
> >>>> (XEN) [ 4.132809] ffffc9004002b900 ffff8881001470b8 0000000000000283 ffff888100456938
> >>>> (XEN) [ 4.132813] ffff888100065410 0000000000000000 0000000000000021 ffffffff81f7842a
> >>>> (XEN) [ 4.132816] Xen call trace:
> >>>> (XEN) [ 4.132819] [<ffff82d0405779bd>] R pci_mmcfg_read+0x19e/0x1c7
> >>>> (XEN) [ 4.132822] [<ffff82d040576877>] F pci_conf_read32+0x55/0x5e
> >>>> (XEN) [ 4.132826] [<ffff82d04039b81d>] F pci_check_extcfg+0xb1/0x13b
> >>>> (XEN) [ 4.132831] [<ffff82d0405fa90b>] F physdev_check_pci_extcfg+0x11c/0x121
> >>>> (XEN) [ 4.132833] [<ffff82d0403983e0>] F drivers/passthrough/pci.c#iterate_all+0xa2/0xe2
> >>>> (XEN) [ 4.132836] [<ffff82d04039feba>] F pci_segment_iterate+0x4e/0x74
> >>>> (XEN) [ 4.132839] [<ffff82d0405ff666>] F do_physdev_op+0x362a/0x4161
> >>>> (XEN) [ 4.132842] [<ffff82d0405711b2>] F pv_hypercall+0x6be/0x838
> >>>> (XEN) [ 4.132845] [<ffff82d0402012d3>] F lstar_enter+0x143/0x148
> >>>> (XEN) [ 4.132847]
> >>>> (XEN) [ 4.132848] Pagetable walk from ffff808000300100:
> >>>> (XEN) [ 4.132851] L4[0x101] = 0000000000000000 ffffffffffffffff
> >>>>
> >>>> There is an important comment in pci_mmcfg_arch_disable():
> >>>>
> >>>> /*
> >>>> * Don't use destroy_xen_mappings() here, or make sure that at least
> >>>> * the necessary L4 entries get populated (so that they get properly
> >>>> * propagated to guest domains' page tables).
> >>>> */
> >>>>
> >>>> Hence it is wrong to bypass
> >>>>
> >>>> mcfg_ioremap(cfg, idx, 0);
> >>>
> >>> Hm, I see. The L4 slot must be unconditionally populated before we
> >>> clone the idle page-table, otherwise the mappings won't propagate.
> >>>
> >>> What about unconditionally populating the L4 slot in
> >>> subarch_init_memory()? That seems less fragile than doing it in
> >>> pci_mmcfg_arch_disable().
> >>
> >> Less fragile - perhaps. Yet I don't see why we should populate the field if
> >> we wouldn't ever need it. Of course with violating layering some, we could
> >> know in subarch_init_memory(), as pci_setup() runs earlier.
> >
> > How can Xen be sure at setup time that the slot will never be used?
> > The MMCFG could be empty initially when parsed by Xen, but MMCFG
> > regions might appear as a result of AML method execution (_CBA and
> > _CRS methods in hotplug host bridges).
>
> Their usability may change, but how many of them there are (going to be) we
> need to know at boot time. See how pci_mmcfg_config_num, pci_mmcfg_config[],
> and pci_mmcfg_virt[] are initialized (all by __init functions). If regions
> could truly appear "out of the blue", we'd have a bigger problem.
My copy of the PCI Firmware Spec v3.3 contains:
"4.1.2. MCFG Table Description
The MCFG table is an ACPI table that is used to communicate the base
addresses corresponding to the non-hot removable PCI Segment Groups
range within a PCI Segment Group available to the operating system at
boot.
[...]
4.1.3. The _CBA Method
Some systems may support hot plug of host bridges that introduce
either a range of buses within an existing PCI Segment Group or
introduce a new PCI Segment Group. For example, each I/O chip in a
multi-chip PCI Express root complex implementation could start a new
PCI Segment Group."
Together with this:
"The MCFG table format allows for more than one memory mapped base
address entry provided each entry (memory mapped configuration space
base address allocation structure) corresponds to a unique PCI Segment
Group consisting of 256 PCI buses. Multiple entries corresponding to a
single PCI Segment Group is not allowed."
Given that each segment group can only appear once in the MCFG, and
that the _CBA method can introduce new segment groups, it would seem
to me the spec does allow for new segments appearing exclusively as
the return of _CBA method? It does read as if hot-removable segment
groups must not appear in the MCFG table. I'm not finding any clear
statement in the spec that says that ECAM areas must previously appear
in the MCFG table.
I'm not sure how common that is, but it doesn't seem impossible given
my reading of the spec.
Thanks, Roger.
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v3 5/6] x86/PCI: avoid re-evaluation of extended config space accessibility
2026-02-02 15:18 ` Roger Pau Monné
@ 2026-02-03 14:48 ` Jan Beulich
2026-02-03 17:19 ` Roger Pau Monné
0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2026-02-03 14:48 UTC (permalink / raw)
To: Roger Pau Monné
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Stewart Hildebrand
On 02.02.2026 16:18, Roger Pau Monné wrote:
> My copy of the PCI Firmware Spec v3.3 contains:
>
> "4.1.2. MCFG Table Description
>
> The MCFG table is an ACPI table that is used to communicate the base
> addresses corresponding to the non-hot removable PCI Segment Groups
> range within a PCI Segment Group available to the operating system at
> boot.
>
> [...]
>
> 4.1.3. The _CBA Method
>
> Some systems may support hot plug of host bridges that introduce
> either a range of buses within an existing PCI Segment Group or
> introduce a new PCI Segment Group. For example, each I/O chip in a
> multi-chip PCI Express root complex implementation could start a new
> PCI Segment Group."
>
> Together with this:
>
> "The MCFG table format allows for more than one memory mapped base
> address entry provided each entry (memory mapped configuration space
> base address allocation structure) corresponds to a unique PCI Segment
> Group consisting of 256 PCI buses. Multiple entries corresponding to a
> single PCI Segment Group is not allowed."
>
> Given that each segment group can only appear once in the MCFG, and
> that the _CBA method can introduce new segment groups, it would seem
> to me the spec does allow for new segments appearing exclusively as
> the return of _CBA method? It does read as if hot-removable segment
> groups must not appear in the MCFG table. I'm not finding any clear
> statement in the spec that says that ECAM areas must previously appear
> in the MCFG table.
>
> I'm not sure how common that is, but it doesn't seem impossible given
> my reading of the spec.
Hmm, that'll be a bit of work then, as Dom0 will also need to propagate
the necessary data into Xen.
Jan
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 5/6] x86/PCI: avoid re-evaluation of extended config space accessibility
2026-02-03 14:48 ` Jan Beulich
@ 2026-02-03 17:19 ` Roger Pau Monné
0 siblings, 0 replies; 27+ messages in thread
From: Roger Pau Monné @ 2026-02-03 17:19 UTC (permalink / raw)
To: Jan Beulich
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Stewart Hildebrand
On Tue, Feb 03, 2026 at 03:48:52PM +0100, Jan Beulich wrote:
> On 02.02.2026 16:18, Roger Pau Monné wrote:
> > My copy of the PCI Firmware Spec v3.3 contains:
> >
> > "4.1.2. MCFG Table Description
> >
> > The MCFG table is an ACPI table that is used to communicate the base
> > addresses corresponding to the non-hot removable PCI Segment Groups
> > range within a PCI Segment Group available to the operating system at
> > boot.
> >
> > [...]
> >
> > 4.1.3. The _CBA Method
> >
> > Some systems may support hot plug of host bridges that introduce
> > either a range of buses within an existing PCI Segment Group or
> > introduce a new PCI Segment Group. For example, each I/O chip in a
> > multi-chip PCI Express root complex implementation could start a new
> > PCI Segment Group."
> >
> > Together with this:
> >
> > "The MCFG table format allows for more than one memory mapped base
> > address entry provided each entry (memory mapped configuration space
> > base address allocation structure) corresponds to a unique PCI Segment
> > Group consisting of 256 PCI buses. Multiple entries corresponding to a
> > single PCI Segment Group is not allowed."
> >
> > Given that each segment group can only appear once in the MCFG, and
> > that the _CBA method can introduce new segment groups, it would seem
> > to me the spec does allow for new segments appearing exclusively as
> > the return of _CBA method? It does read as if hot-removable segment
> > groups must not appear in the MCFG table. I'm not finding any clear
> > statement in the spec that says that ECAM areas must previously appear
> > in the MCFG table.
> >
> > I'm not sure how common that is, but it doesn't seem impossible given
> > my reading of the spec.
>
> Hmm, that'll be a bit of work then, as Dom0 will also need to propagate
> the necessary data into Xen.
TBH this is what the spec says, but I've never encountered such a
system. In fact I've never tested hotplug of a PCI host bridge.
Not sure this can be simulated with QEMU so that we could at least
test whatever fixes we plan to do in that area? I guess we could
"fake" a bodge where Xen ignores the MCFG completely and only becomes
aware of the ECAM areas from what the hardware domain reports back.
Thanks, Roger.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 5/6] x86/PCI: avoid re-evaluation of extended config space accessibility
2026-02-02 8:51 ` Jan Beulich
2026-02-02 9:14 ` Roger Pau Monné
@ 2026-02-02 19:25 ` Stewart Hildebrand
1 sibling, 0 replies; 27+ messages in thread
From: Stewart Hildebrand @ 2026-02-02 19:25 UTC (permalink / raw)
To: Jan Beulich, xen-devel@lists.xenproject.org
Cc: Andrew Cooper, Roger Pau Monné
On 2/2/26 03:51, Jan Beulich wrote:
> On 29.01.2026 14:10, Jan Beulich wrote:
>> @@ -160,10 +161,13 @@ int pci_mmcfg_arch_enable(unsigned int i
>> return 0;
>> }
>>
>> -void pci_mmcfg_arch_disable(unsigned int idx)
>> +int pci_mmcfg_arch_disable(unsigned int idx)
>> {
>> const typeof(pci_mmcfg_config[0]) *cfg = pci_mmcfg_virt[idx].cfg;
>>
>> + if ( !pci_mmcfg_virt[idx].virt )
>> + return 1;
>
> Afaict this is what causes CI (adl-*) to say no here:
>
> (XEN) [ 4.132689] PCI: Using MCFG for segment 0000 bus 00-ff
> (XEN) [ 4.132697] ----[ Xen-4.22-unstable x86_64 debug=y ubsan=y Not tainted ]----
> (XEN) [ 4.132700] CPU: 12
> (XEN) [ 4.132702] RIP: e008:[<ffff82d0405779bd>] pci_mmcfg_read+0x19e/0x1c7
> (XEN) [ 4.132708] RFLAGS: 0000000000010286 CONTEXT: hypervisor (d0v0)
> (XEN) [ 4.132711] rax: 0000000000300000 rbx: ffff808000300100 rcx: 0000000000000000
> (XEN) [ 4.132714] rdx: ffff808000300100 rsi: 0000000000000000 rdi: ffff8304959ffcec
> (XEN) [ 4.132716] rbp: ffff8304959ffd18 rsp: ffff8304959ffce8 r8: 0000000000000004
> (XEN) [ 4.132718] r9: ffff8304959ffd2c r10: 0000000000000000 r11: 0000000000000000
> (XEN) [ 4.132720] r12: 0000000000000100 r13: 0000000000000004 r14: ffff8304959ffd2c
> (XEN) [ 4.132723] r15: ffff808000000000 cr0: 0000000080050033 cr4: 0000000000b526e0
> (XEN) [ 4.132725] cr3: 0000000492a30000 cr2: ffff808000300100
> (XEN) [ 4.132727] fsb: 0000000000000000 gsb: ffff8881b9a00000 gss: 0000000000000000
> (XEN) [ 4.132729] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
> (XEN) [ 4.132733] Xen code around <ffff82d0405779bd> (pci_mmcfg_read+0x19e/0x1c7):
> (XEN) [ 4.132734] 48 39 d3 72 ea 4c 01 e3 <8b> 03 89 c3 4d 85 f6 74 0d 41 89 1e b8 00 00 00
> (XEN) [ 4.132744] Xen stack trace from rsp=ffff8304959ffce8:
> (XEN) [ 4.132745] 0000000300000286 ffff830495bd8010 0000000000000003 ffff830495bd8010
> (XEN) [ 4.132749] ffff8304959ffdd0 ffff82d0405fa7ef ffff8304959ffd30 ffff82d040576877
> (XEN) [ 4.132753] 000000000000000c ffff8304959ffd58 ffff82d04039b81d ffff8304959ffe28
> (XEN) [ 4.132756] 0000000000000003 ffff830495bd8010 ffff8304959ffd80 ffff82d0405fa90b
> (XEN) [ 4.132760] ffff8304959ffdc8 ffff830495bd8010 ffff830498019650 ffff8304959ffdb8
> (XEN) [ 4.132764] ffff82d0403983e0 ffff830498019650 ffff8304959ffe28 ffff82d0405fa7ef
> (XEN) [ 4.132767] 0000000000000018 ffffc9004002b900 ffff8304959ffdf8 ffff82d04039feba
> (XEN) [ 4.132771] ffff82d0405fa7ef ffff8304959ffe28 0000000000000000 ffffc9004002b900
> (XEN) [ 4.132774] 0000000000000000 ffff8304959bb000 ffff8304959ffe78 ffff82d0405ff666
> (XEN) [ 4.132778] ffff82d0405713b8 00000000ffffffff 00a0fb0081f456e0 ffff8304959b3010
> (XEN) [ 4.132782] 00000000c0000000 00000001ff000000 ffff8304959fff08 0000000000000040
> (XEN) [ 4.132785] 000000ec00000001 ffff8304959fff08 ffff8304959a4000 0000000000000021
> (XEN) [ 4.132789] 0000000000000000 ffffc9004002b900 ffff8304959ffef8 ffff82d0405711b2
> (XEN) [ 4.132792] 0000000000000000 ffff888100456938 ffff8881001470b8 0000000000000018
> (XEN) [ 4.132795] 0000000000000000 ffff8304959ffef8 ffff82d0406213f9 ffff8304959a4000
> (XEN) [ 4.132799] 0000000000000000 ffff8304959a4000 0000000000000000 0000000000000000
> (XEN) [ 4.132802] ffff8304959fffff 0000000000000000 00007cfb6a6000d7 ffff82d0402012d3
> (XEN) [ 4.132806] 0000000000000000 00000000ffffffff ffff8881001470b8 ffff888100b88900
> (XEN) [ 4.132809] ffffc9004002b900 ffff8881001470b8 0000000000000283 ffff888100456938
> (XEN) [ 4.132813] ffff888100065410 0000000000000000 0000000000000021 ffffffff81f7842a
> (XEN) [ 4.132816] Xen call trace:
> (XEN) [ 4.132819] [<ffff82d0405779bd>] R pci_mmcfg_read+0x19e/0x1c7
> (XEN) [ 4.132822] [<ffff82d040576877>] F pci_conf_read32+0x55/0x5e
> (XEN) [ 4.132826] [<ffff82d04039b81d>] F pci_check_extcfg+0xb1/0x13b
> (XEN) [ 4.132831] [<ffff82d0405fa90b>] F physdev_check_pci_extcfg+0x11c/0x121
> (XEN) [ 4.132833] [<ffff82d0403983e0>] F drivers/passthrough/pci.c#iterate_all+0xa2/0xe2
> (XEN) [ 4.132836] [<ffff82d04039feba>] F pci_segment_iterate+0x4e/0x74
> (XEN) [ 4.132839] [<ffff82d0405ff666>] F do_physdev_op+0x362a/0x4161
> (XEN) [ 4.132842] [<ffff82d0405711b2>] F pv_hypercall+0x6be/0x838
> (XEN) [ 4.132845] [<ffff82d0402012d3>] F lstar_enter+0x143/0x148
> (XEN) [ 4.132847]
> (XEN) [ 4.132848] Pagetable walk from ffff808000300100:
> (XEN) [ 4.132851] L4[0x101] = 0000000000000000 ffffffffffffffff
FWIW, I can reproduce this on my test machine.
> There is an important comment in pci_mmcfg_arch_disable():
>
> /*
> * Don't use destroy_xen_mappings() here, or make sure that at least
> * the necessary L4 entries get populated (so that they get properly
> * propagated to guest domains' page tables).
> */
>
> Hence it is wrong to bypass
>
> mcfg_ioremap(cfg, idx, 0);
Indeed, restoring this call resolves it.
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v3 6/6] vPCI: re-init extended-capability lists when MMCFG availability changed
2026-01-29 13:05 [PATCH v3 0/6] (v)PCI: extended capability handling Jan Beulich
` (4 preceding siblings ...)
2026-01-29 13:10 ` [PATCH v3 5/6] x86/PCI: avoid re-evaluation of extended config space accessibility Jan Beulich
@ 2026-01-29 13:10 ` Jan Beulich
2026-01-29 15:40 ` Roger Pau Monné
5 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2026-01-29 13:10 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Andrew Cooper, Roger Pau Monné, Stewart Hildebrand
When Dom0 informs up about MMCFG usability, this may change whether
extended capabilities are available (accessible) for devices. Zap what
might be on record, and re-initialize the list.
No synchronization is added for the case where devices may already be in
use. That'll need sorting when (a) DomU support was added and (b) DomU-s
may run already while Dom0 / hwdom still boots (dom0less, Hyperlaunch).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
vpci_reinit_ext_capability_list()'s return value isn't checked, as it
doesn't feel quite right to fail the hypercall because of this. At the
same time it also doesn't feel quite right to have the function return
"void". Thoughts?
---
v3: New.
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -8,6 +8,8 @@
#include <xen/guest_access.h>
#include <xen/iocap.h>
#include <xen/serial.h>
+#include <xen/vpci.h>
+
#include <asm/current.h>
#include <asm/io_apic.h>
#include <asm/msi.h>
@@ -169,7 +171,10 @@ int cf_check physdev_check_pci_extcfg(st
ASSERT(pdev->seg == info->segment);
if ( pdev->bus >= info->start_bus && pdev->bus <= info->end_bus )
+ {
pci_check_extcfg(pdev);
+ vpci_reinit_ext_capability_list(pdev);
+ }
return 0;
}
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -869,6 +869,18 @@ static int vpci_init_ext_capability_list
return 0;
}
+int vpci_reinit_ext_capability_list(const struct pci_dev *pdev)
+{
+ if ( !pdev->vpci )
+ return 0;
+
+ if ( vpci_remove_registers(pdev->vpci, PCI_CFG_SPACE_SIZE,
+ PCI_CFG_SPACE_EXP_SIZE - PCI_CFG_SPACE_SIZE) )
+ ASSERT_UNREACHABLE();
+
+ return vpci_init_ext_capability_list(pdev);
+}
+
int vpci_init_header(struct pci_dev *pdev)
{
uint16_t cmd;
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -45,6 +45,7 @@ typedef struct {
REGISTER_VPCI_CAPABILITY(PCI_EXT_CAP_ID_##name, name, finit, fclean, true)
int __must_check vpci_init_header(struct pci_dev *pdev);
+int vpci_reinit_ext_capability_list(const struct pci_dev *pdev);
/* Assign vPCI to device by adding handlers. */
int __must_check vpci_assign_device(struct pci_dev *pdev);
@@ -300,6 +301,11 @@ bool vpci_ecam_read(pci_sbdf_t sbdf, uns
#else /* !CONFIG_HAS_VPCI */
struct vpci_vcpu {};
+static inline int vpci_reinit_ext_capability_list(const struct pci_dev *pdev)
+{
+ return 0;
+}
+
static inline int vpci_assign_device(struct pci_dev *pdev)
{
return 0;
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v3 6/6] vPCI: re-init extended-capability lists when MMCFG availability changed
2026-01-29 13:10 ` [PATCH v3 6/6] vPCI: re-init extended-capability lists when MMCFG availability changed Jan Beulich
@ 2026-01-29 15:40 ` Roger Pau Monné
2026-01-29 15:54 ` Jan Beulich
0 siblings, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-29 15:40 UTC (permalink / raw)
To: Jan Beulich
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Stewart Hildebrand
On Thu, Jan 29, 2026 at 02:10:34PM +0100, Jan Beulich wrote:
> When Dom0 informs up about MMCFG usability, this may change whether
> extended capabilities are available (accessible) for devices. Zap what
> might be on record, and re-initialize the list.
>
> No synchronization is added for the case where devices may already be in
> use. That'll need sorting when (a) DomU support was added and (b) DomU-s
> may run already while Dom0 / hwdom still boots (dom0less, Hyperlaunch).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> vpci_reinit_ext_capability_list()'s return value isn't checked, as it
> doesn't feel quite right to fail the hypercall because of this. At the
> same time it also doesn't feel quite right to have the function return
> "void". Thoughts?
> ---
> v3: New.
>
> --- a/xen/arch/x86/physdev.c
> +++ b/xen/arch/x86/physdev.c
> @@ -8,6 +8,8 @@
> #include <xen/guest_access.h>
> #include <xen/iocap.h>
> #include <xen/serial.h>
> +#include <xen/vpci.h>
> +
> #include <asm/current.h>
> #include <asm/io_apic.h>
> #include <asm/msi.h>
> @@ -169,7 +171,10 @@ int cf_check physdev_check_pci_extcfg(st
>
> ASSERT(pdev->seg == info->segment);
> if ( pdev->bus >= info->start_bus && pdev->bus <= info->end_bus )
> + {
> pci_check_extcfg(pdev);
> + vpci_reinit_ext_capability_list(pdev);
> + }
>
> return 0;
> }
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -869,6 +869,18 @@ static int vpci_init_ext_capability_list
> return 0;
> }
>
> +int vpci_reinit_ext_capability_list(const struct pci_dev *pdev)
> +{
> + if ( !pdev->vpci )
> + return 0;
> +
> + if ( vpci_remove_registers(pdev->vpci, PCI_CFG_SPACE_SIZE,
> + PCI_CFG_SPACE_EXP_SIZE - PCI_CFG_SPACE_SIZE) )
> + ASSERT_UNREACHABLE();
> +
> + return vpci_init_ext_capability_list(pdev);
Isn't this missing the possible addition or removal of managed
extended capabilities? IOW: on removal of access to the extended
space the vPCI managed capabilties that have is_ext == true should
call their ->cleanup() hooks, and on discovery of MMCFG access we
should call the ->init() hooks?
Thanks, Roger.
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v3 6/6] vPCI: re-init extended-capability lists when MMCFG availability changed
2026-01-29 15:40 ` Roger Pau Monné
@ 2026-01-29 15:54 ` Jan Beulich
2026-01-29 17:41 ` Roger Pau Monné
0 siblings, 1 reply; 27+ messages in thread
From: Jan Beulich @ 2026-01-29 15:54 UTC (permalink / raw)
To: Roger Pau Monné
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Stewart Hildebrand
On 29.01.2026 16:40, Roger Pau Monné wrote:
> On Thu, Jan 29, 2026 at 02:10:34PM +0100, Jan Beulich wrote:
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -869,6 +869,18 @@ static int vpci_init_ext_capability_list
>> return 0;
>> }
>>
>> +int vpci_reinit_ext_capability_list(const struct pci_dev *pdev)
>> +{
>> + if ( !pdev->vpci )
>> + return 0;
>> +
>> + if ( vpci_remove_registers(pdev->vpci, PCI_CFG_SPACE_SIZE,
>> + PCI_CFG_SPACE_EXP_SIZE - PCI_CFG_SPACE_SIZE) )
>> + ASSERT_UNREACHABLE();
>> +
>> + return vpci_init_ext_capability_list(pdev);
>
> Isn't this missing the possible addition or removal of managed
> extended capabilities? IOW: on removal of access to the extended
> space the vPCI managed capabilties that have is_ext == true should
> call their ->cleanup() hooks, and on discovery of MMCFG access we
> should call the ->init() hooks?
Now I know why this felt too easy. Yet I wonder: Why is this done in two
parts in the first place?
Jan
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH v3 6/6] vPCI: re-init extended-capability lists when MMCFG availability changed
2026-01-29 15:54 ` Jan Beulich
@ 2026-01-29 17:41 ` Roger Pau Monné
0 siblings, 0 replies; 27+ messages in thread
From: Roger Pau Monné @ 2026-01-29 17:41 UTC (permalink / raw)
To: Jan Beulich
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Stewart Hildebrand
On Thu, Jan 29, 2026 at 04:54:59PM +0100, Jan Beulich wrote:
> On 29.01.2026 16:40, Roger Pau Monné wrote:
> > On Thu, Jan 29, 2026 at 02:10:34PM +0100, Jan Beulich wrote:
> >> --- a/xen/drivers/vpci/header.c
> >> +++ b/xen/drivers/vpci/header.c
> >> @@ -869,6 +869,18 @@ static int vpci_init_ext_capability_list
> >> return 0;
> >> }
> >>
> >> +int vpci_reinit_ext_capability_list(const struct pci_dev *pdev)
> >> +{
> >> + if ( !pdev->vpci )
> >> + return 0;
> >> +
> >> + if ( vpci_remove_registers(pdev->vpci, PCI_CFG_SPACE_SIZE,
> >> + PCI_CFG_SPACE_EXP_SIZE - PCI_CFG_SPACE_SIZE) )
> >> + ASSERT_UNREACHABLE();
> >> +
> >> + return vpci_init_ext_capability_list(pdev);
> >
> > Isn't this missing the possible addition or removal of managed
> > extended capabilities? IOW: on removal of access to the extended
> > space the vPCI managed capabilties that have is_ext == true should
> > call their ->cleanup() hooks, and on discovery of MMCFG access we
> > should call the ->init() hooks?
>
> Now I know why this felt too easy. Yet I wonder: Why is this done in two
> parts in the first place?
I think this boils down to us (me I would think) not planning ahead
that capabilities might appear _after_ the initial device assignation.
This is true for example when Xen doesn't have access to the MMCFG at
boot, and it's only made aware of such after starting the hardware
domain.
Right now vpci_init_{,ext_}capability_list() is called from
vpci_init_header(), but for the case where MMCFG is registered late we
don't really want to re-init the header, so calling that helper is not
an option.
We might want to create two new wrappers that encapsulate
vpci_init_{,ext_}capability_list() + the calling of the
vpci_init_capabilities(), provided that vpci_init_capabilities() is
also split between legacy and extended capabilities. We want to call
those new helpers from vpci_assign_device() instead of
vpci_init_header().
Thanks, Roger.
^ permalink raw reply [flat|nested] 27+ messages in thread