* [PATCH v3 0/7] Implement SR-IOV support for PVH
@ 2026-04-09 14:01 Mykyta Poturai
2026-04-09 14:01 ` [PATCH v3 2/7] vpci: rename and export vpci_guest_mem_bar_{read,write} Mykyta Poturai
` (6 more replies)
0 siblings, 7 replies; 35+ messages in thread
From: Mykyta Poturai @ 2026-04-09 14:01 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Mykyta Poturai, Roger Pau Monné, Stewart Hildebrand,
Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
Julien Grall, Stefano Stabellini, Daniel P. Smith
This series enables support for PCI SR-IOV capability for PVH domains.
It allows Dom0 to enable and use SR-IOV virtual functions and for these
functions to be passed to guests.
To achieve this, add handlers for SRIOV_CONTROL registers, simplified handlers
for VFs headers. Xen relies on dom0 to enable SR-IOV and call
PHYSDEVOP_pci_device_* to inform about addition/removal of VFs.
Core functionality is based on previous work[1].
Tested on R-Car Spider board with Samsung NVMe SSD Controller 980 and Intel
X550T ethernet card.
[1]: https://lore.kernel.org/xen-devel/20180717094830.54806-1-roger.pau@citrix.com/
v2->v3:
* rework the series for VF discovery by Dom0
* drop pci/iommu: Check that IOMMU supports removing devices, see [2]
* drop vpci: add a wait operation to the vpci vcpu pending actions
* add vpci: allow queueing of mapping operations
* minor changes in individual patches
[2]: https://patchew.org/Xen/a59c2da0d4c72deb42950e9a8e3982fbdee60668.1775555766.git.mykyta._5Fpoturai@epam.com/
v1->v2:
* rework the series for VF discovery in Xen
* separate doc changes into the last patch
Mykyta Poturai (2):
vpci: Use pervcpu ranges for BAR mapping
docs: Update SR-IOV support status
Stewart Hildebrand (5):
vpci: rename and export vpci_modify_bars
vpci: rename and export vpci_guest_mem_bar_{read,write}
vpci: allow queueing of mapping operations
vpci: add SR-IOV support for PVH Dom0
vpci: add SR-IOV support for DomUs
SUPPORT.md | 2 -
xen/common/domain.c | 2 +
xen/drivers/vpci/Makefile | 1 +
xen/drivers/vpci/header.c | 314 +++++++++++++++++++------------
xen/drivers/vpci/private.h | 10 +
xen/drivers/vpci/sriov.c | 366 +++++++++++++++++++++++++++++++++++++
xen/drivers/vpci/vpci.c | 10 +-
xen/include/xen/vpci.h | 28 ++-
8 files changed, 607 insertions(+), 126 deletions(-)
create mode 100644 xen/drivers/vpci/sriov.c
--
2.51.2
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH v3 1/7] vpci: rename and export vpci_modify_bars
2026-04-09 14:01 [PATCH v3 0/7] Implement SR-IOV support for PVH Mykyta Poturai
2026-04-09 14:01 ` [PATCH v3 2/7] vpci: rename and export vpci_guest_mem_bar_{read,write} Mykyta Poturai
@ 2026-04-09 14:01 ` Mykyta Poturai
2026-04-09 14:01 ` [PATCH v3 4/7] vpci: allow queueing of mapping operations Mykyta Poturai
` (4 subsequent siblings)
6 siblings, 0 replies; 35+ messages in thread
From: Mykyta Poturai @ 2026-04-09 14:01 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Stewart Hildebrand, Roger Pau Monné, Mykyta Poturai,
Teddy Astie
From: Stewart Hildebrand <stewart.hildebrand@amd.com>
Export functions required for SR-IOV support.
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
Reviewed-by: Teddy Astie <teddy.astie@vates.tech>
---
v2->v3:
* move declaration to private.h
v1->v2
* Collect RBs
---
| 16 +++++++++-------
xen/drivers/vpci/private.h | 3 +++
2 files changed, 12 insertions(+), 7 deletions(-)
--git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index a760d8c32f..96995e098b 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -305,7 +305,7 @@ static void defer_map(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
raise_softirq(SCHEDULE_SOFTIRQ);
}
-static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
+int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
{
struct vpci_header *header = &pdev->vpci->header;
struct pci_dev *tmp;
@@ -546,7 +546,7 @@ static void cf_check cmd_write(
* memory decoding bit has not been changed, so leave everything as-is,
* hoping the guest will realize and try again.
*/
- modify_bars(pdev, cmd, false);
+ vpci_modify_bars(pdev, cmd, false);
else
pci_conf_write16(pdev->sbdf, reg, cmd);
}
@@ -714,13 +714,15 @@ static void cf_check rom_write(
* Pass PCI_COMMAND_MEMORY or 0 to signal a map/unmap request, note that
* this fabricated command is never going to be written to the register.
*/
- else if ( modify_bars(pdev, new_enabled ? PCI_COMMAND_MEMORY : 0, true) )
+ else if ( vpci_modify_bars(pdev,
+ new_enabled ? PCI_COMMAND_MEMORY : 0,
+ true) )
/*
* No memory has been added or removed from the p2m (because the actual
* p2m changes are deferred in defer_map) and the ROM enable bit has
* not been changed, so leave everything as-is, hoping the guest will
* realize and try again. It's important to not update rom->addr in the
- * unmap case if modify_bars has failed, or future attempts would
+ * unmap case if vpci_modify_bars has failed, or future attempts would
* attempt to unmap the wrong address.
*/
return;
@@ -800,8 +802,8 @@ int vpci_init_header(struct pci_dev *pdev)
/*
* For DomUs, clear PCI_COMMAND_{MASTER,MEMORY,IO} and other
* DomU-controllable bits in PCI_COMMAND. Devices assigned to DomUs will
- * start with memory decoding disabled, and modify_bars() will not be called
- * at the end of this function.
+ * start with memory decoding disabled, and vpci_modify_bars() will not be
+ * called at the end of this function.
*/
if ( !is_hwdom )
cmd &= ~(PCI_COMMAND_VGA_PALETTE | PCI_COMMAND_INVALIDATE |
@@ -926,7 +928,7 @@ int vpci_init_header(struct pci_dev *pdev)
goto fail;
}
- return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
+ return (cmd & PCI_COMMAND_MEMORY) ? vpci_modify_bars(pdev, cmd, false) : 0;
fail:
pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
diff --git a/xen/drivers/vpci/private.h b/xen/drivers/vpci/private.h
index 2907f6b40f..6fdf8a20d9 100644
--- a/xen/drivers/vpci/private.h
+++ b/xen/drivers/vpci/private.h
@@ -138,6 +138,9 @@ static inline size_t vmsix_table_size(const struct vpci *vpci, unsigned int nr)
8), 8);
}
+/* Map/unmap the BARs of a vPCI device. */
+int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only);
+
#endif /* __XEN__ */
#endif /* VPCI_PRIVATE_H */
--
2.51.2
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH v3 2/7] vpci: rename and export vpci_guest_mem_bar_{read,write}
2026-04-09 14:01 [PATCH v3 0/7] Implement SR-IOV support for PVH Mykyta Poturai
@ 2026-04-09 14:01 ` Mykyta Poturai
2026-04-22 10:27 ` Roger Pau Monné
2026-04-09 14:01 ` [PATCH v3 1/7] vpci: rename and export vpci_modify_bars Mykyta Poturai
` (5 subsequent siblings)
6 siblings, 1 reply; 35+ messages in thread
From: Mykyta Poturai @ 2026-04-09 14:01 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Stewart Hildebrand, Roger Pau Monné, Mykyta Poturai,
Teddy Astie
From: Stewart Hildebrand <stewart.hildebrand@amd.com>
Export functions required for SR-IOV support.
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
Reviewed-by: Teddy Astie <teddy.astie@vates.tech>
---
v2->v3:
* move declarations to private.h
v1->v2:
* collect RBs
---
| 20 +++++++++++---------
xen/drivers/vpci/private.h | 6 ++++++
2 files changed, 17 insertions(+), 9 deletions(-)
--git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 96995e098b..5d5ba5c016 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -611,9 +611,9 @@ static void cf_check bar_write(
pci_conf_write32(pdev->sbdf, reg, val);
}
-static void cf_check guest_mem_bar_write(const struct pci_dev *pdev,
- unsigned int reg, uint32_t val,
- void *data)
+void cf_check vpci_guest_mem_bar_write(const struct pci_dev *pdev,
+ unsigned int reg, uint32_t val,
+ void *data)
{
struct vpci_bar *bar = data;
bool hi = false;
@@ -653,8 +653,8 @@ static void cf_check guest_mem_bar_write(const struct pci_dev *pdev,
bar->guest_addr = guest_addr;
}
-static uint32_t cf_check guest_mem_bar_read(const struct pci_dev *pdev,
- unsigned int reg, void *data)
+uint32_t cf_check vpci_guest_mem_bar_read(const struct pci_dev *pdev,
+ unsigned int reg, void *data)
{
const struct vpci_bar *bar = data;
uint32_t reg_val;
@@ -826,8 +826,9 @@ int vpci_init_header(struct pci_dev *pdev)
bars[i].type = VPCI_BAR_MEM64_HI;
rc = vpci_add_register(pdev->vpci,
is_hwdom ? vpci_hw_read32
- : guest_mem_bar_read,
- is_hwdom ? bar_write : guest_mem_bar_write,
+ : vpci_guest_mem_bar_read,
+ is_hwdom ? bar_write
+ : vpci_guest_mem_bar_write,
reg, 4, &bars[i]);
if ( rc )
goto fail;
@@ -885,8 +886,9 @@ int vpci_init_header(struct pci_dev *pdev)
bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
rc = vpci_add_register(pdev->vpci,
- is_hwdom ? vpci_hw_read32 : guest_mem_bar_read,
- is_hwdom ? bar_write : guest_mem_bar_write,
+ is_hwdom ? vpci_hw_read32
+ : vpci_guest_mem_bar_read,
+ is_hwdom ? bar_write : vpci_guest_mem_bar_write,
reg, 4, &bars[i]);
if ( rc )
goto fail;
diff --git a/xen/drivers/vpci/private.h b/xen/drivers/vpci/private.h
index 6fdf8a20d9..f012fd160d 100644
--- a/xen/drivers/vpci/private.h
+++ b/xen/drivers/vpci/private.h
@@ -141,6 +141,12 @@ static inline size_t vmsix_table_size(const struct vpci *vpci, unsigned int nr)
/* Map/unmap the BARs of a vPCI device. */
int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only);
+void cf_check vpci_guest_mem_bar_write(const struct pci_dev *pdev,
+ unsigned int reg, uint32_t val,
+ void *data);
+
+uint32_t cf_check vpci_guest_mem_bar_read(const struct pci_dev *pdev,
+ unsigned int reg, void *data);
#endif /* __XEN__ */
#endif /* VPCI_PRIVATE_H */
--
2.51.2
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH v3 3/7] vpci: Use pervcpu ranges for BAR mapping
2026-04-09 14:01 [PATCH v3 0/7] Implement SR-IOV support for PVH Mykyta Poturai
` (2 preceding siblings ...)
2026-04-09 14:01 ` [PATCH v3 4/7] vpci: allow queueing of mapping operations Mykyta Poturai
@ 2026-04-09 14:01 ` Mykyta Poturai
2026-04-22 11:00 ` Roger Pau Monné
2026-04-09 14:01 ` [PATCH v3 6/7] vpci: add SR-IOV support for DomUs Mykyta Poturai
` (2 subsequent siblings)
6 siblings, 1 reply; 35+ messages in thread
From: Mykyta Poturai @ 2026-04-09 14:01 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Mykyta Poturai, Andrew Cooper, Anthony PERARD, Michal Orzel,
Jan Beulich, Julien Grall, Roger Pau Monné,
Stefano Stabellini, Stewart Hildebrand, Mykyta Poturai
From: Mykyta Poturai <Mykyta_Poturai@epam.com>
There is no need to store ranges for each PCI device, as they are only
used during the mapping/unmapping process and can be reused for each
device. This also allows to avoid the need to allocate and destroy
rangesets for each device.
Move the rangesets from struct vpci_bar to struct vpci_vcpu and perform
(de-)allocation with vcpu (de-)allocation. Introduce RANGESET_DESTROY()
macro to free a rangeset and set the pointer to NULL.
Amends: 622bdd962822 ("vpci/header: handle p2m range sets per BAR")
Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
v2->v3:
* synced with BAR write with memory decoding enabled series[1]
* add Amends tag
* remove unused variable i due to rebasing over 998060dd9101 ("vPCI:
move vpci_init_capabilities() to a separate file")
* enclose entire struct vpci_vcpu inside #ifdef __XEN__
* s/bar_mem/mem/
* use ARRAY_SIZE
* put init/destroy in functions
* only allocate for domains with vPCI and idle domain
* replace 'if ( !mem ) continue;' with ASSERT
v1->v2
* new patch
[1]: https://patchew.org/Xen/20260406191203.97662-1-stewart.hildebrand@amd.com/
---
xen/common/domain.c | 5 +++
| 67 ++++++++++++++------------------------
xen/drivers/vpci/vpci.c | 34 ++++++++++++++++---
xen/include/xen/rangeset.h | 7 ++++
xen/include/xen/vpci.h | 10 ++++--
5 files changed, 73 insertions(+), 50 deletions(-)
diff --git a/xen/common/domain.c b/xen/common/domain.c
index bb9e210c28..5ef7db8f09 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -455,6 +455,8 @@ static int vcpu_teardown(struct vcpu *v)
*/
static void vcpu_destroy(struct vcpu *v)
{
+ vpci_vcpu_destroy(v);
+
free_vcpu_struct(v);
}
@@ -512,6 +514,9 @@ struct vcpu *vcpu_create(struct domain *d, unsigned int vcpu_id)
if ( arch_vcpu_create(v) != 0 )
goto fail_sched;
+ if ( vpci_vcpu_init(v) )
+ goto fail_sched;
+
d->vcpu[vcpu_id] = v;
if ( vcpu_id != 0 )
{
--git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 5d5ba5c016..5bfb541b6a 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -196,6 +196,7 @@ bool vpci_process_pending(struct vcpu *v)
for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
{
struct vpci_bar *bar = &header->bars[i];
+ struct rangeset *mem = v->vpci.mem[i];
struct map_data data = {
.d = v->domain,
.map = v->vpci.cmd & PCI_COMMAND_MEMORY,
@@ -203,10 +204,10 @@ bool vpci_process_pending(struct vcpu *v)
};
int rc;
- if ( rangeset_is_empty(bar->mem) )
+ if ( rangeset_is_empty(mem) )
continue;
- rc = rangeset_consume_ranges(bar->mem, map_range, &data);
+ rc = rangeset_consume_ranges(mem, map_range, &data);
if ( rc == -ERESTART )
{
@@ -224,8 +225,8 @@ bool vpci_process_pending(struct vcpu *v)
/* Clean all the rangesets */
for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
- if ( !rangeset_is_empty(header->bars[i].mem) )
- rangeset_purge(header->bars[i].mem);
+ if ( !rangeset_is_empty(v->vpci.mem[i]) )
+ rangeset_purge(v->vpci.mem[i]);
v->vpci.pdev = NULL;
@@ -260,13 +261,14 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
{
struct vpci_bar *bar = &header->bars[i];
+ struct rangeset *mem = current->vpci.mem[i];
struct map_data data = { .d = d, .map = true, .bar = bar };
- if ( rangeset_is_empty(bar->mem) )
+ if ( rangeset_is_empty(mem) )
continue;
- while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
- &data)) == -ERESTART )
+ while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) ==
+ -ERESTART )
{
/*
* It's safe to drop and reacquire the lock in this context
@@ -331,13 +333,13 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
{
struct vpci_bar *bar = &header->bars[i];
+ struct rangeset *mem = current->vpci.mem[i];
unsigned long start = PFN_DOWN(bar->addr);
unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
unsigned long start_guest = PFN_DOWN(bar->guest_addr);
unsigned long end_guest = PFN_DOWN(bar->guest_addr + bar->size - 1);
- if ( !bar->mem )
- continue;
+ ASSERT(mem);
if ( !MAPPABLE_BAR(bar) ||
(rom_only ? bar->type != VPCI_BAR_ROM
@@ -354,7 +356,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
continue;
}
- ASSERT(rangeset_is_empty(bar->mem));
+ ASSERT(rangeset_is_empty(mem));
/*
* Make sure that the guest set address has the same page offset
@@ -369,7 +371,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
return -EINVAL;
}
- rc = rangeset_add_range(bar->mem, start_guest, end_guest);
+ rc = rangeset_add_range(mem, start_guest, end_guest);
if ( rc )
{
printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
@@ -380,12 +382,12 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
/* Check for overlap with the already setup BAR ranges. */
for ( j = 0; j < i; j++ )
{
- struct vpci_bar *prev_bar = &header->bars[j];
+ struct rangeset *prev_mem = current->vpci.mem[j];
- if ( rangeset_is_empty(prev_bar->mem) )
+ if ( rangeset_is_empty(prev_mem) )
continue;
- rc = rangeset_remove_range(prev_bar->mem, start_guest, end_guest);
+ rc = rangeset_remove_range(prev_mem, start_guest, end_guest);
if ( rc )
{
gprintk(XENLOG_WARNING,
@@ -395,7 +397,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
}
}
- rc = pci_sanitize_bar_memory(bar->mem);
+ rc = pci_sanitize_bar_memory(mem);
if ( rc )
{
gprintk(XENLOG_WARNING,
@@ -412,14 +414,14 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
vmsix_table_size(pdev->vpci, i) - 1);
- for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
+ for ( j = 0; j < ARRAY_SIZE(current->vpci.mem); j++ )
{
- const struct vpci_bar *bar = &header->bars[j];
+ struct rangeset *mem = current->vpci.mem[j];
- if ( rangeset_is_empty(bar->mem) )
+ if ( rangeset_is_empty(mem) )
continue;
- rc = rangeset_remove_range(bar->mem, start, end);
+ rc = rangeset_remove_range(mem, start, end);
if ( rc )
{
gprintk(XENLOG_WARNING,
@@ -469,8 +471,9 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
for ( j = 0; j < ARRAY_SIZE(header->bars); j++)
{
const struct vpci_bar *bar = &header->bars[j];
+ struct rangeset *mem = current->vpci.mem[j];
- if ( !rangeset_overlaps_range(bar->mem, start, end) ||
+ if ( !rangeset_overlaps_range(mem, start, end) ||
/*
* If only the ROM enable bit is toggled check against
* other BARs in the same device for overlaps, but not
@@ -481,7 +484,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
bar->type == VPCI_BAR_ROM) )
continue;
- rc = rangeset_remove_range(bar->mem, start, end);
+ rc = rangeset_remove_range(mem, start, end);
if ( rc )
{
gprintk(XENLOG_WARNING,
@@ -734,18 +737,6 @@ static void cf_check rom_write(
}
}
-static int bar_add_rangeset(const struct pci_dev *pdev, struct vpci_bar *bar,
- unsigned int i)
-{
- char str[32];
-
- snprintf(str, sizeof(str), "%pp:BAR%u", &pdev->sbdf, i);
-
- bar->mem = rangeset_new(pdev->domain, str, RANGESETF_no_print);
-
- return !bar->mem ? -ENOMEM : 0;
-}
-
int vpci_init_header(struct pci_dev *pdev)
{
uint16_t cmd;
@@ -856,10 +847,6 @@ int vpci_init_header(struct pci_dev *pdev)
else
bars[i].type = VPCI_BAR_MEM32;
- rc = bar_add_rangeset(pdev, &bars[i], i);
- if ( rc )
- goto fail;
-
rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
(i == num_bars - 1) ? PCI_BAR_LAST : 0);
if ( rc < 0 )
@@ -913,12 +900,6 @@ int vpci_init_header(struct pci_dev *pdev)
4, rom);
if ( rc )
rom->type = VPCI_BAR_EMPTY;
- else
- {
- rc = bar_add_rangeset(pdev, rom, num_bars);
- if ( rc )
- goto fail;
- }
}
else if ( !is_hwdom )
{
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 0ac9ec8b04..d069ca6d9c 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -24,6 +24,35 @@
#ifdef __XEN__
+void vpci_vcpu_destroy(struct vcpu *v)
+{
+ if ( !has_vpci(v->domain) && !is_idle_domain(v->domain) )
+ return;
+
+ for ( unsigned int i = 0; i < ARRAY_SIZE(v->vpci.mem); i++ )
+ RANGESET_DESTROY(v->vpci.mem[i]);
+}
+
+int vpci_vcpu_init(struct vcpu *v)
+{
+ unsigned int i;
+
+ if ( !has_vpci(v->domain) && !is_idle_domain(v->domain) )
+ return 0;
+
+ for ( i = 0; i < ARRAY_SIZE(v->vpci.mem); i++ )
+ {
+ char str[32];
+
+ snprintf(str, sizeof(str), "%pv:BAR%u", v, i);
+ v->vpci.mem[i] = rangeset_new(v->domain, str, RANGESETF_no_print);
+ if ( !v->vpci.mem[i] )
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
static int assign_virtual_sbdf(struct pci_dev *pdev)
{
@@ -89,8 +118,6 @@ struct vpci_register *vpci_get_register(const struct vpci *vpci,
void vpci_deassign_device(struct pci_dev *pdev)
{
- unsigned int i;
-
ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
if ( !has_vpci(pdev->domain) || !pdev->vpci )
@@ -116,9 +143,6 @@ void vpci_deassign_device(struct pci_dev *pdev)
}
spin_unlock(&pdev->vpci->lock);
- for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
- rangeset_destroy(pdev->vpci->header.bars[i].mem);
-
xfree(pdev->vpci);
pdev->vpci = NULL;
}
diff --git a/xen/include/xen/rangeset.h b/xen/include/xen/rangeset.h
index 817505badf..f01e00ec92 100644
--- a/xen/include/xen/rangeset.h
+++ b/xen/include/xen/rangeset.h
@@ -40,6 +40,13 @@ struct rangeset *rangeset_new(
void rangeset_destroy(
struct rangeset *r);
+/* Destroy a rangeset, and zero the pointer to it. */
+#define RANGESET_DESTROY(r) \
+ ({ \
+ rangeset_destroy(r); \
+ (r) = NULL; \
+ })
+
/*
* Set a limit on the number of ranges that may exist in set @r.
* NOTE: This must be called while @r is empty.
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 877aa391d1..b55bacbe6e 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -19,6 +19,9 @@
*/
#define VPCI_MAX_VIRT_DEV (PCI_SLOT(~0) + 1)
+void vpci_vcpu_destroy(struct vcpu *v);
+int vpci_vcpu_init(struct vcpu *v);
+
/* Assign vPCI to device by adding handlers. */
int __must_check vpci_assign_device(struct pci_dev *pdev);
@@ -54,7 +57,6 @@ struct vpci {
uint64_t guest_addr;
uint64_t size;
uint64_t resizable_sizes;
- struct rangeset *mem;
enum {
VPCI_BAR_EMPTY,
VPCI_BAR_IO,
@@ -152,14 +154,15 @@ struct vpci {
#endif
};
+#ifdef __XEN__
struct vpci_vcpu {
/* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
const struct pci_dev *pdev;
+ struct rangeset *mem[ARRAY_SIZE(((struct vpci_header *)NULL)->bars)];
uint16_t cmd;
bool rom_only : 1;
};
-#ifdef __XEN__
void vpci_dump_msi(void);
/* Arch-specific vPCI MSI helpers. */
@@ -204,6 +207,9 @@ bool vpci_ecam_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int len,
#else /* !CONFIG_HAS_VPCI */
struct vpci_vcpu {};
+static inline void vpci_vcpu_destroy(struct vcpu *v) { }
+static inline int vpci_vcpu_init(struct vcpu *v) { return 0; }
+
static inline int vpci_reinit_ext_capabilities(struct pci_dev *pdev)
{
return 0;
--
2.51.2
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH v3 4/7] vpci: allow queueing of mapping operations
2026-04-09 14:01 [PATCH v3 0/7] Implement SR-IOV support for PVH Mykyta Poturai
2026-04-09 14:01 ` [PATCH v3 2/7] vpci: rename and export vpci_guest_mem_bar_{read,write} Mykyta Poturai
2026-04-09 14:01 ` [PATCH v3 1/7] vpci: rename and export vpci_modify_bars Mykyta Poturai
@ 2026-04-09 14:01 ` Mykyta Poturai
2026-04-22 11:38 ` Roger Pau Monné
2026-04-09 14:01 ` [PATCH v3 3/7] vpci: Use pervcpu ranges for BAR mapping Mykyta Poturai
` (3 subsequent siblings)
6 siblings, 1 reply; 35+ messages in thread
From: Mykyta Poturai @ 2026-04-09 14:01 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Stewart Hildebrand, Andrew Cooper, Anthony PERARD, Michal Orzel,
Jan Beulich, Julien Grall, Roger Pau Monné,
Stefano Stabellini
From: Stewart Hildebrand <stewart.hildebrand@amd.com>
Introduce vPCI BAR mapping task queue. Store information needed to
map/unmap BARs in struct vpci_map_task. Allow queueing of BAR map/unmap
operations in a list, thus making it possible to perform multiple p2m
operations associated with single PCI device.
This is preparatory work for further changes that need to perform
multiple unmap/map operations before returning to guest.
At the moment, only a single operation will be queued. However, when
multiple operations are queued, there is a check in modify_bars() to
skip BARs already in the requested state that will no longer be
accurate. Remove this check in preparation of upcoming changes.
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Message-ID: <20260406191203.97662-3-stewart.hildebrand@amd.com>
---
v2->v3:
* new patch in this series, borrowed from [1]
[1]: https://patchew.org/Xen/20260406191203.97662-1-stewart.hildebrand@amd.com/20260406191203.97662-3-stewart.hildebrand@amd.com/
---
xen/common/domain.c | 5 +-
| 227 ++++++++++++++++++++++++++-----------
xen/drivers/vpci/vpci.c | 28 +----
xen/include/xen/rangeset.h | 7 --
xen/include/xen/vpci.h | 21 ++--
5 files changed, 179 insertions(+), 109 deletions(-)
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 5ef7db8f09..b1931be987 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -455,8 +455,6 @@ static int vcpu_teardown(struct vcpu *v)
*/
static void vcpu_destroy(struct vcpu *v)
{
- vpci_vcpu_destroy(v);
-
free_vcpu_struct(v);
}
@@ -514,8 +512,7 @@ struct vcpu *vcpu_create(struct domain *d, unsigned int vcpu_id)
if ( arch_vcpu_create(v) != 0 )
goto fail_sched;
- if ( vpci_vcpu_init(v) )
- goto fail_sched;
+ vpci_vcpu_init(v);
d->vcpu[vcpu_id] = v;
if ( vcpu_id != 0 )
--git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 5bfb541b6a..451cdd3a6f 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -23,6 +23,7 @@
#include <xen/lib.h>
#include <xen/sched.h>
#include <xen/softirq.h>
+#include <xen/xvmalloc.h>
#include <xsm/xsm.h>
@@ -35,7 +36,7 @@
struct map_data {
struct domain *d;
- const struct vpci_bar *bar;
+ const struct vpci_bar_map *bar;
bool map;
};
@@ -174,32 +175,20 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
ASSERT_UNREACHABLE();
}
-bool vpci_process_pending(struct vcpu *v)
+static int vpci_process_map_task(const struct pci_dev *pdev,
+ struct vpci_map_task *task)
{
- const struct pci_dev *pdev = v->vpci.pdev;
- struct vpci_header *header = NULL;
unsigned int i;
- if ( !pdev )
- return false;
-
- read_lock(&v->domain->pci_lock);
-
- if ( !pdev->vpci || (v->domain != pdev->domain) )
- {
- v->vpci.pdev = NULL;
- read_unlock(&v->domain->pci_lock);
- return false;
- }
+ ASSERT(rw_is_locked(&pdev->domain->pci_lock));
- header = &pdev->vpci->header;
- for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+ for ( i = 0; i < ARRAY_SIZE(task->bars); i++ )
{
- struct vpci_bar *bar = &header->bars[i];
- struct rangeset *mem = v->vpci.mem[i];
+ struct vpci_bar_map *bar = &task->bars[i];
+ struct rangeset *mem = bar->mem;
struct map_data data = {
- .d = v->domain,
- .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
+ .d = pdev->domain,
+ .map = task->cmd & PCI_COMMAND_MEMORY,
.bar = bar,
};
int rc;
@@ -210,58 +199,116 @@ bool vpci_process_pending(struct vcpu *v)
rc = rangeset_consume_ranges(mem, map_range, &data);
if ( rc == -ERESTART )
- {
- read_unlock(&v->domain->pci_lock);
- return true;
- }
+ return rc;
if ( rc )
{
spin_lock(&pdev->vpci->lock);
/* Disable memory decoding unconditionally on failure. */
- modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
- false);
+ modify_decoding(pdev, task->cmd & ~PCI_COMMAND_MEMORY, false);
spin_unlock(&pdev->vpci->lock);
- /* Clean all the rangesets */
- for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
- if ( !rangeset_is_empty(v->vpci.mem[i]) )
- rangeset_purge(v->vpci.mem[i]);
+ if ( !is_hardware_domain(pdev->domain) )
+ domain_crash(pdev->domain);
+
+ return rc;
+ }
+ }
+
+ spin_lock(&pdev->vpci->lock);
+ modify_decoding(pdev, task->cmd, task->rom_only);
+ spin_unlock(&pdev->vpci->lock);
+
+ return 0;
+}
+
+static void destroy_map_task(struct vpci_map_task *task)
+{
+ unsigned int i;
+
+ if ( !task )
+ {
+ ASSERT_UNREACHABLE();
+ return;
+ }
+
+ for ( i = 0; i < ARRAY_SIZE(task->bars); i++ )
+ rangeset_destroy(task->bars[i].mem);
+
+ xvfree(task);
+}
+
+static void clear_map_queue(struct vcpu *v)
+{
+ struct vpci_map_task *task;
+
+ while ( (task = list_first_entry_or_null(&v->vpci.task_queue,
+ struct vpci_map_task,
+ next)) != NULL )
+ {
+ list_del(&task->next);
+ destroy_map_task(task);
+ }
+}
+
+bool vpci_process_pending(struct vcpu *v)
+{
+ const struct pci_dev *pdev = v->vpci.pdev;
+ struct vpci_map_task *task;
- v->vpci.pdev = NULL;
+ if ( !pdev )
+ return false;
+ read_lock(&v->domain->pci_lock);
+
+ if ( !pdev->vpci || (v->domain != pdev->domain) )
+ {
+ clear_map_queue(v);
+ v->vpci.pdev = NULL;
+ read_unlock(&v->domain->pci_lock);
+ return false;
+ }
+
+ while ( (task = list_first_entry_or_null(&v->vpci.task_queue,
+ struct vpci_map_task,
+ next)) != NULL )
+ {
+ int rc = vpci_process_map_task(pdev, task);
+
+ if ( rc == -ERESTART )
+ {
read_unlock(&v->domain->pci_lock);
+ return true;
+ }
- if ( !is_hardware_domain(v->domain) )
- domain_crash(v->domain);
+ list_del(&task->next);
+ destroy_map_task(task);
- return false;
+ if ( rc )
+ {
+ clear_map_queue(v);
+ break;
}
}
v->vpci.pdev = NULL;
- spin_lock(&pdev->vpci->lock);
- modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
- spin_unlock(&pdev->vpci->lock);
-
read_unlock(&v->domain->pci_lock);
return false;
}
static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
- uint16_t cmd)
+ struct vpci_map_task *task)
{
- struct vpci_header *header = &pdev->vpci->header;
int rc = 0;
unsigned int i;
ASSERT(rw_is_write_locked(&d->pci_lock));
- for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+ for ( i = 0; i < ARRAY_SIZE(task->bars); i++ )
{
- struct vpci_bar *bar = &header->bars[i];
- struct rangeset *mem = current->vpci.mem[i];
+ struct vpci_bar_map *bar = &task->bars[i];
+ struct rangeset *mem = bar->mem;
struct map_data data = { .d = d, .map = true, .bar = bar };
if ( rangeset_is_empty(mem) )
@@ -281,15 +328,52 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
}
}
if ( !rc )
- modify_decoding(pdev, cmd, false);
+ modify_decoding(pdev, task->cmd, false);
return rc;
}
-static void defer_map(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
+static struct vpci_map_task *alloc_map_task(const struct pci_dev *pdev,
+ uint16_t cmd, bool rom_only)
+{
+ struct vpci_map_task *task;
+ unsigned int i;
+
+ task = xvzalloc(struct vpci_map_task);
+
+ if ( !task )
+ return NULL;
+
+ for ( i = 0; i < ARRAY_SIZE(task->bars); i++ )
+ {
+ if ( !MAPPABLE_BAR(&pdev->vpci->header.bars[i]) )
+ continue;
+
+ task->bars[i].mem = rangeset_new(pdev->domain, NULL,
+ RANGESETF_no_print);
+
+ if ( !task->bars[i].mem )
+ {
+ destroy_map_task(task);
+ return NULL;
+ }
+
+ task->bars[i].addr = pdev->vpci->header.bars[i].addr;
+ task->bars[i].guest_addr = pdev->vpci->header.bars[i].guest_addr;
+ }
+
+ task->cmd = cmd;
+ task->rom_only = rom_only;
+
+ return task;
+}
+
+static void defer_map(const struct pci_dev *pdev, struct vpci_map_task *task)
{
struct vcpu *curr = current;
+ ASSERT(!curr->vpci.pdev || curr->vpci.pdev == pdev);
+
/*
* FIXME: when deferring the {un}map the state of the device should not
* be trusted. For example the enable bit is toggled after the device
@@ -297,8 +381,8 @@ static void defer_map(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
* started for the same device if the domain is not well-behaved.
*/
curr->vpci.pdev = pdev;
- curr->vpci.cmd = cmd;
- curr->vpci.rom_only = rom_only;
+ list_add_tail(&task->next, &curr->vpci.task_queue);
+
/*
* Raise a scheduler softirq in order to prevent the guest from resuming
* execution with pending mapping operations, to trigger the invocation
@@ -313,11 +397,17 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
struct pci_dev *tmp;
const struct domain *d;
const struct vpci_msix *msix = pdev->vpci->msix;
+ struct vpci_map_task *task;
unsigned int i, j;
int rc;
ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+ task = alloc_map_task(pdev, cmd, rom_only);
+
+ if ( !task )
+ return -ENOMEM;
+
/*
* Create a rangeset per BAR that represents the current device memory
* region and compare it against all the currently active BAR memory
@@ -333,19 +423,18 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
{
struct vpci_bar *bar = &header->bars[i];
- struct rangeset *mem = current->vpci.mem[i];
+ struct rangeset *mem = task->bars[i].mem;
unsigned long start = PFN_DOWN(bar->addr);
unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
unsigned long start_guest = PFN_DOWN(bar->guest_addr);
unsigned long end_guest = PFN_DOWN(bar->guest_addr + bar->size - 1);
- ASSERT(mem);
+ if ( !mem )
+ continue;
if ( !MAPPABLE_BAR(bar) ||
(rom_only ? bar->type != VPCI_BAR_ROM
- : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) ||
- /* Skip BARs already in the requested state. */
- bar->enabled == !!(cmd & PCI_COMMAND_MEMORY) )
+ : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) )
continue;
if ( !pci_check_bar(pdev, _mfn(start), _mfn(end)) )
@@ -368,7 +457,8 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
gprintk(XENLOG_G_WARNING,
"%pp: can't map BAR%u - offset mismatch: %#lx vs %#lx\n",
&pdev->sbdf, i, bar->guest_addr, bar->addr);
- return -EINVAL;
+ rc = -EINVAL;
+ goto fail;
}
rc = rangeset_add_range(mem, start_guest, end_guest);
@@ -376,13 +466,13 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
{
printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
start_guest, end_guest, rc);
- return rc;
+ goto fail;
}
/* Check for overlap with the already setup BAR ranges. */
for ( j = 0; j < i; j++ )
{
- struct rangeset *prev_mem = current->vpci.mem[j];
+ struct rangeset *prev_mem = task->bars[j].mem;
if ( rangeset_is_empty(prev_mem) )
continue;
@@ -393,7 +483,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
gprintk(XENLOG_WARNING,
"%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
&pdev->sbdf, start_guest, end_guest, rc);
- return rc;
+ goto fail;
}
}
@@ -403,7 +493,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
gprintk(XENLOG_WARNING,
"%pp: failed to sanitize BAR#%u memory: %d\n",
&pdev->sbdf, i, rc);
- return rc;
+ goto fail;
}
}
@@ -414,9 +504,9 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
vmsix_table_size(pdev->vpci, i) - 1);
- for ( j = 0; j < ARRAY_SIZE(current->vpci.mem); j++ )
+ for ( j = 0; j < ARRAY_SIZE(task->bars); j++ )
{
- struct rangeset *mem = current->vpci.mem[j];
+ struct rangeset *mem = task->bars[j].mem;
if ( rangeset_is_empty(mem) )
continue;
@@ -427,7 +517,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
gprintk(XENLOG_WARNING,
"%pp: failed to remove MSIX table [%lx, %lx]: %d\n",
&pdev->sbdf, start, end, rc);
- return rc;
+ goto fail;
}
}
}
@@ -471,7 +561,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
for ( j = 0; j < ARRAY_SIZE(header->bars); j++)
{
const struct vpci_bar *bar = &header->bars[j];
- struct rangeset *mem = current->vpci.mem[j];
+ struct rangeset *mem = task->bars[j].mem;
if ( !rangeset_overlaps_range(mem, start, end) ||
/*
@@ -490,7 +580,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
gprintk(XENLOG_WARNING,
"%pp: failed to remove [%lx, %lx]: %d\n",
&pdev->sbdf, start, end, rc);
- return rc;
+ goto fail;
}
}
}
@@ -513,12 +603,19 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
* will always be to establish mappings and process all the BARs.
*/
ASSERT((cmd & PCI_COMMAND_MEMORY) && !rom_only);
- return apply_map(pdev->domain, pdev, cmd);
+ rc = apply_map(pdev->domain, pdev, task);
+ destroy_map_task(task);
+ return rc;
}
- defer_map(pdev, cmd, rom_only);
+ defer_map(pdev, task);
return 0;
+
+ fail:
+ destroy_map_task(task);
+
+ return rc;
}
static void cf_check cmd_write(
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index d069ca6d9c..ce9fb5b357 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -24,33 +24,9 @@
#ifdef __XEN__
-void vpci_vcpu_destroy(struct vcpu *v)
+void vpci_vcpu_init(struct vcpu *v)
{
- if ( !has_vpci(v->domain) && !is_idle_domain(v->domain) )
- return;
-
- for ( unsigned int i = 0; i < ARRAY_SIZE(v->vpci.mem); i++ )
- RANGESET_DESTROY(v->vpci.mem[i]);
-}
-
-int vpci_vcpu_init(struct vcpu *v)
-{
- unsigned int i;
-
- if ( !has_vpci(v->domain) && !is_idle_domain(v->domain) )
- return 0;
-
- for ( i = 0; i < ARRAY_SIZE(v->vpci.mem); i++ )
- {
- char str[32];
-
- snprintf(str, sizeof(str), "%pv:BAR%u", v, i);
- v->vpci.mem[i] = rangeset_new(v->domain, str, RANGESETF_no_print);
- if ( !v->vpci.mem[i] )
- return -ENOMEM;
- }
-
- return 0;
+ INIT_LIST_HEAD(&v->vpci.task_queue);
}
#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
diff --git a/xen/include/xen/rangeset.h b/xen/include/xen/rangeset.h
index f01e00ec92..817505badf 100644
--- a/xen/include/xen/rangeset.h
+++ b/xen/include/xen/rangeset.h
@@ -40,13 +40,6 @@ struct rangeset *rangeset_new(
void rangeset_destroy(
struct rangeset *r);
-/* Destroy a rangeset, and zero the pointer to it. */
-#define RANGESET_DESTROY(r) \
- ({ \
- rangeset_destroy(r); \
- (r) = NULL; \
- })
-
/*
* Set a limit on the number of ranges that may exist in set @r.
* NOTE: This must be called while @r is empty.
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index b55bacbe6e..e34f7abe6d 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -19,8 +19,7 @@
*/
#define VPCI_MAX_VIRT_DEV (PCI_SLOT(~0) + 1)
-void vpci_vcpu_destroy(struct vcpu *v);
-int vpci_vcpu_init(struct vcpu *v);
+void vpci_vcpu_init(struct vcpu *v);
/* Assign vPCI to device by adding handlers. */
int __must_check vpci_assign_device(struct pci_dev *pdev);
@@ -155,14 +154,23 @@ struct vpci {
};
#ifdef __XEN__
-struct vpci_vcpu {
+struct vpci_map_task {
/* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
- const struct pci_dev *pdev;
- struct rangeset *mem[ARRAY_SIZE(((struct vpci_header *)NULL)->bars)];
+ struct list_head next;
+ struct vpci_bar_map {
+ uint64_t addr;
+ uint64_t guest_addr;
+ struct rangeset *mem;
+ } bars[ARRAY_SIZE(((struct vpci_header *)NULL)->bars)];
uint16_t cmd;
bool rom_only : 1;
};
+struct vpci_vcpu {
+ const struct pci_dev *pdev;
+ struct list_head task_queue;
+};
+
void vpci_dump_msi(void);
/* Arch-specific vPCI MSI helpers. */
@@ -207,8 +215,7 @@ bool vpci_ecam_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int len,
#else /* !CONFIG_HAS_VPCI */
struct vpci_vcpu {};
-static inline void vpci_vcpu_destroy(struct vcpu *v) { }
-static inline int vpci_vcpu_init(struct vcpu *v) { return 0; }
+static inline void vpci_vcpu_init(struct vcpu *v) { }
static inline int vpci_reinit_ext_capabilities(struct pci_dev *pdev)
{
--
2.51.2
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-04-09 14:01 [PATCH v3 0/7] Implement SR-IOV support for PVH Mykyta Poturai
` (4 preceding siblings ...)
2026-04-09 14:01 ` [PATCH v3 6/7] vpci: add SR-IOV support for DomUs Mykyta Poturai
@ 2026-04-09 14:01 ` Mykyta Poturai
2026-04-09 15:27 ` Daniel P. Smith
` (3 more replies)
2026-04-09 14:01 ` [PATCH v3 7/7] docs: Update SR-IOV support status Mykyta Poturai
6 siblings, 4 replies; 35+ messages in thread
From: Mykyta Poturai @ 2026-04-09 14:01 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Stewart Hildebrand, Roger Pau Monné, Daniel P. Smith,
Mykyta Poturai
From: Stewart Hildebrand <stewart.hildebrand@amd.com>
This code is expected to only be used by privileged domains,
unprivileged domains should not get access to the SR-IOV capability.
Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
for possible changes in the system page size register. Also force VFs to
always use emulated reads for command register, this is needed to
prevent some drivers accidentally unmapping BARs. Discovery of VFs is
done by Dom0, which must register them with Xen.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
---
v2->v3:
* rework adding VFs with multiple mapping operations support
* use private.h
* style fixes
* fixup cleanup_sriov
* emulate command register read for VFs
v1->v2:
* switch to VF discovery by Xen
* fix sequential vpci_modify_bars calls
* move documentation changes to a separate commit
---
xen/drivers/vpci/Makefile | 1 +
| 6 +-
xen/drivers/vpci/private.h | 1 +
xen/drivers/vpci/sriov.c | 314 +++++++++++++++++++++++++++++++++++++
xen/include/xen/vpci.h | 7 +-
5 files changed, 327 insertions(+), 2 deletions(-)
create mode 100644 xen/drivers/vpci/sriov.c
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index 9793a4f9b0..8b0e3c03a7 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1,3 +1,4 @@
obj-y += cap.o
obj-y += vpci.o header.o rebar.o
+obj-y += sriov.o
obj-$(CONFIG_HAS_PCI_MSI) += msi.o msix.o
--git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 451cdd3a6f..a36285672e 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -869,7 +869,8 @@ int vpci_init_header(struct pci_dev *pdev)
* PCI_COMMAND_PARITY, PCI_COMMAND_SERR, and PCI_COMMAND_FAST_BACK.
*/
rc = vpci_add_register_mask(pdev->vpci,
- is_hwdom ? vpci_hw_read16 : guest_cmd_read,
+ is_hwdom && !pdev->info.is_virtfn
+ ? vpci_hw_read16 : guest_cmd_read,
cmd_write, PCI_COMMAND, 2, header, 0, 0,
is_hwdom ? 0
: PCI_COMMAND_RSVDP_MASK |
@@ -900,6 +901,9 @@ int vpci_init_header(struct pci_dev *pdev)
header->guest_cmd = cmd;
+ if ( pdev->info.is_virtfn )
+ return vpci_vf_init_header(pdev);
+
/* Disable memory decoding before sizing. */
if ( !is_hwdom || (cmd & PCI_COMMAND_MEMORY) )
pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
diff --git a/xen/drivers/vpci/private.h b/xen/drivers/vpci/private.h
index f012fd160d..8e0043ddbe 100644
--- a/xen/drivers/vpci/private.h
+++ b/xen/drivers/vpci/private.h
@@ -45,6 +45,7 @@ typedef struct {
REGISTER_VPCI_CAPABILITY(PCI_EXT_CAP_ID_##name, name, finit, fclean, true)
int __must_check vpci_init_header(struct pci_dev *pdev);
+int __must_check vpci_vf_init_header(struct pci_dev *pdev);
int vpci_init_capabilities(struct pci_dev *pdev, bool ext_only);
void vpci_cleanup_capabilities(struct pci_dev *pdev, bool ext_only);
diff --git a/xen/drivers/vpci/sriov.c b/xen/drivers/vpci/sriov.c
new file mode 100644
index 0000000000..ec6a7b84d5
--- /dev/null
+++ b/xen/drivers/vpci/sriov.c
@@ -0,0 +1,314 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Handlers for accesses to the SR-IOV capability structure.
+ *
+ * Copyright (C) 2026 Citrix Systems R&D
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+#include <xsm/xsm.h>
+#include "private.h"
+
+static int vf_init_bars(const struct pci_dev *vf_pdev)
+{
+ int vf_idx;
+ unsigned int i;
+ const struct pci_dev *pf_pdev = vf_pdev->pf_pdev;
+ struct vpci_bar *bars = vf_pdev->vpci->header.bars;
+ struct vpci_bar *physfn_vf_bars = pf_pdev->vpci->sriov->vf_bars;
+ unsigned int sriov_pos = pci_find_ext_capability(pf_pdev,
+ PCI_EXT_CAP_ID_SRIOV);
+ uint16_t offset = pci_conf_read16(pf_pdev->sbdf,
+ sriov_pos + PCI_SRIOV_VF_OFFSET);
+ uint16_t stride = pci_conf_read16(pf_pdev->sbdf,
+ sriov_pos + PCI_SRIOV_VF_STRIDE);
+
+ vf_idx = vf_pdev->sbdf.sbdf - (pf_pdev->sbdf.sbdf + offset);
+ if ( vf_idx < 0 )
+ return -EINVAL;
+
+ if ( stride )
+ {
+ if ( vf_idx % stride )
+ return -EINVAL;
+ vf_idx /= stride;
+ }
+
+ /*
+ * Set up BARs for this VF out of PF's VF BARs taking into account
+ * the index of the VF.
+ */
+ for ( i = 0; i < PCI_SRIOV_NUM_BARS; i++ )
+ {
+ bars[i].addr = physfn_vf_bars[i].addr + vf_idx * physfn_vf_bars[i].size;
+ bars[i].guest_addr = bars[i].addr;
+ bars[i].size = physfn_vf_bars[i].size;
+ bars[i].type = physfn_vf_bars[i].type;
+ bars[i].prefetchable = physfn_vf_bars[i].prefetchable;
+ }
+
+ return 0;
+}
+
+static int map_vfs(const struct pci_dev *pf_pdev, uint16_t cmd)
+{
+ struct pci_dev *vf_pdev;
+ int rc;
+
+ ASSERT(rw_is_write_locked(&pf_pdev->domain->pci_lock));
+
+ list_for_each_entry(vf_pdev, &pf_pdev->vf_list, vf_list)
+ {
+ rc = vpci_modify_bars(vf_pdev, cmd, false);
+ if ( rc )
+ {
+ gprintk(XENLOG_ERR, "failed to %s VF %pp: %d\n",
+ (cmd & PCI_COMMAND_MEMORY) ? "map" : "unmap",
+ &vf_pdev->sbdf, rc);
+ return rc;
+ }
+
+ vf_pdev->vpci->header.guest_cmd &= ~PCI_COMMAND_MEMORY;
+ vf_pdev->vpci->header.guest_cmd |= (cmd & PCI_COMMAND_MEMORY);
+ }
+
+ return 0;
+}
+
+static int size_vf_bars(const struct pci_dev *pf_pdev, unsigned int sriov_pos,
+ uint64_t *vf_rlen)
+{
+ struct vpci_bar *bars;
+ unsigned int i;
+ int rc = 0;
+
+ ASSERT(rw_is_write_locked(&pf_pdev->domain->pci_lock));
+ ASSERT(!pf_pdev->info.is_virtfn);
+ ASSERT(pf_pdev->vpci->sriov);
+
+ /* Read BARs for VFs out of PF's SR-IOV extended capability. */
+ bars = pf_pdev->vpci->sriov->vf_bars;
+ /* Set the BARs addresses and size. */
+ for ( i = 0; i < PCI_SRIOV_NUM_BARS; i += rc )
+ {
+ unsigned int idx = sriov_pos + PCI_SRIOV_BAR + i * 4;
+ uint32_t bar;
+ uint64_t addr, size;
+
+ bar = pci_conf_read32(pf_pdev->sbdf, idx);
+
+ rc = pci_size_mem_bar(pf_pdev->sbdf, idx, &addr, &size,
+ PCI_BAR_VF |
+ ((i == PCI_SRIOV_NUM_BARS - 1) ? PCI_BAR_LAST
+ : 0));
+
+ /*
+ * Update vf_rlen on the PF. According to the spec the size of
+ * the BARs can change if the system page size register is
+ * modified, so always update rlen when enabling VFs.
+ */
+ vf_rlen[i] = size;
+
+ if ( !size )
+ {
+ bars[i].type = VPCI_BAR_EMPTY;
+ continue;
+ }
+
+ bars[i].addr = addr;
+ bars[i].guest_addr = addr;
+ bars[i].size = size;
+ bars[i].prefetchable = bar & PCI_BASE_ADDRESS_MEM_PREFETCH;
+
+ switch ( rc )
+ {
+ case 1:
+ bars[i].type = VPCI_BAR_MEM32;
+ break;
+
+ case 2:
+ bars[i].type = VPCI_BAR_MEM64_LO;
+ bars[i + 1].type = VPCI_BAR_MEM64_HI;
+ break;
+
+ default:
+ ASSERT_UNREACHABLE();
+ }
+ }
+
+ rc = rc > 0 ? 0 : rc;
+
+ return rc;
+}
+
+static void cf_check control_write(const struct pci_dev *pdev, unsigned int reg,
+ uint32_t val, void *data)
+{
+ unsigned int sriov_pos = reg - PCI_SRIOV_CTRL;
+ struct vpci_sriov *sriov = pdev->vpci->sriov;
+ uint16_t control = pci_conf_read16(pdev->sbdf, reg);
+ bool mem_enabled = control & PCI_SRIOV_CTRL_MSE;
+ bool new_mem_enabled = val & PCI_SRIOV_CTRL_MSE;
+ bool enabled = control & PCI_SRIOV_CTRL_VFE;
+ bool new_enabled = val & PCI_SRIOV_CTRL_VFE;
+ int rc;
+
+ ASSERT(!pdev->info.is_virtfn);
+
+ if ( new_enabled == enabled && new_mem_enabled == mem_enabled )
+ {
+ pci_conf_write16(pdev->sbdf, reg, val);
+ return;
+ }
+
+ if ( mem_enabled && !new_mem_enabled )
+ map_vfs(pdev, 0);
+
+ if ( !enabled && new_enabled )
+ {
+ size_vf_bars(pdev, sriov_pos, (uint64_t *)data);
+
+ /*
+ * Only update the number of active VFs when enabling, when
+ * disabling use the cached value in order to always remove the same
+ * number of VFs that were active.
+ */
+ sriov->num_vfs = pci_conf_read16(pdev->sbdf,
+ sriov_pos + PCI_SRIOV_NUM_VF);
+ }
+
+ if ( !mem_enabled && new_mem_enabled )
+ {
+ rc = map_vfs(pdev, PCI_COMMAND_MEMORY);
+
+ if ( rc )
+ map_vfs(pdev, 0);
+ }
+
+ pci_conf_write16(pdev->sbdf, reg, val);
+}
+
+int vpci_vf_init_header(struct pci_dev *vf_pdev)
+{
+ const struct pci_dev *pf_pdev;
+ unsigned int sriov_pos;
+ int rc = 0;
+ uint16_t ctrl;
+
+ ASSERT(rw_is_write_locked(&vf_pdev->domain->pci_lock));
+
+ if ( !vf_pdev->info.is_virtfn )
+ return 0;
+
+ pf_pdev = vf_pdev->pf_pdev;
+ ASSERT(pf_pdev);
+
+ rc = vf_init_bars(vf_pdev);
+ if ( rc )
+ return rc;
+
+ sriov_pos = pci_find_ext_capability(pf_pdev, PCI_EXT_CAP_ID_SRIOV);
+ ctrl = pci_conf_read16(pf_pdev->sbdf, sriov_pos + PCI_SRIOV_CTRL);
+
+ if ( (pf_pdev->domain == vf_pdev->domain) && (ctrl & PCI_SRIOV_CTRL_MSE) )
+ {
+ rc = vpci_modify_bars(vf_pdev, PCI_COMMAND_MEMORY, false);
+ if ( rc )
+ return rc;
+
+ vf_pdev->vpci->header.guest_cmd |= PCI_COMMAND_MEMORY;
+ }
+
+ return rc;
+}
+
+static int cf_check init_sriov(struct pci_dev *pdev)
+{
+ unsigned int pos;
+
+ ASSERT(!pdev->info.is_virtfn);
+
+ pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
+
+ if ( !pos )
+ return 0;
+
+ if ( xsm_resource_setup_pci(XSM_PRIV, pdev->sbdf.bdf) )
+ {
+ printk(XENLOG_ERR
+ "%pp: SR-IOV configuration unsupported for unpriv %pd\n",
+ &pdev->sbdf, pdev->domain);
+ return -EACCES;
+ }
+
+ pdev->vpci->sriov = xzalloc(struct vpci_sriov);
+ if ( !pdev->vpci->sriov )
+ return -ENOMEM;
+
+ /*
+ * We need to modify vf_rlen in control_write but we can't do it cleanly
+ * from pdev because write callback only accepts const pdev. Moving vf_rlen
+ * inside of struct vpci_sriov is also not possible because it is used
+ * before vpci init. So pass it here as additional data to not require
+ * dropping const in control_write.
+ */
+ return vpci_add_register(pdev->vpci, vpci_hw_read16, control_write,
+ pos + PCI_SRIOV_CTRL, 2, &pdev->physfn.vf_rlen);
+}
+
+static int cf_check cleanup_sriov(const struct pci_dev *pdev, bool hide)
+{
+ unsigned int pos;
+ int rc;
+
+ if ( !pdev->vpci->sriov )
+ return 0;
+
+ ASSERT(!pdev->info.is_virtfn);
+ ASSERT(list_empty(&pdev->vf_list));
+
+ if ( !hide )
+ {
+ XFREE(pdev->vpci->sriov);
+ return 0;
+ }
+
+ pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
+ rc = vpci_remove_registers(pdev->vpci, pos + PCI_SRIOV_CTRL, 2);
+ if ( rc )
+ {
+ printk(XENLOG_ERR "%pd %pp: fail to remove SRIOV handlers rc=%d\n",
+ pdev->domain, &pdev->sbdf, rc);
+ ASSERT_UNREACHABLE();
+ return rc;
+ }
+ XFREE(pdev->vpci->sriov);
+
+ /*
+ * Unprivileged domains have a deny by default register access policy, no
+ * need to add any further handlers for them.
+ */
+ if ( !is_hardware_domain(pdev->domain) )
+ return 0;
+
+ rc = vpci_add_register(pdev->vpci, vpci_hw_read16, NULL,
+ pos + PCI_SRIOV_CTRL, 2, NULL);
+ if ( rc )
+ printk(XENLOG_ERR "%pd %pp: fail to add SRIOV ctrl handler rc=%d\n",
+ pdev->domain, &pdev->sbdf, rc);
+
+ return rc;
+}
+
+REGISTER_VPCI_EXTCAP(SRIOV, init_sriov, cleanup_sriov);
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index e34f7abe6d..fc8a943abe 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -84,7 +84,6 @@ struct vpci {
* upon to know whether BARs are mapped into the guest p2m.
*/
bool bars_mapped : 1;
- /* FIXME: currently there's no support for SR-IOV. */
} header;
/* MSI data. */
@@ -145,6 +144,12 @@ struct vpci {
unsigned int nbars:3;
} rebar;
+ struct vpci_sriov {
+ /* PF only */
+ struct vpci_bar vf_bars[PCI_SRIOV_NUM_BARS];
+ uint16_t num_vfs;
+ } *sriov;
+
#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
/* Guest SBDF of the device. */
#define INVALID_GUEST_SBDF ((pci_sbdf_t){ .sbdf = ~0U })
--
2.51.2
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH v3 6/7] vpci: add SR-IOV support for DomUs
2026-04-09 14:01 [PATCH v3 0/7] Implement SR-IOV support for PVH Mykyta Poturai
` (3 preceding siblings ...)
2026-04-09 14:01 ` [PATCH v3 3/7] vpci: Use pervcpu ranges for BAR mapping Mykyta Poturai
@ 2026-04-09 14:01 ` Mykyta Poturai
2026-04-21 14:55 ` Jan Beulich
2026-04-09 14:01 ` [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0 Mykyta Poturai
2026-04-09 14:01 ` [PATCH v3 7/7] docs: Update SR-IOV support status Mykyta Poturai
6 siblings, 1 reply; 35+ messages in thread
From: Mykyta Poturai @ 2026-04-09 14:01 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Stewart Hildebrand, Roger Pau Monné, Mykyta Poturai
From: Stewart Hildebrand <stewart.hildebrand@amd.com>
Emulate guest BAR register values based on PF BAR values for VFs.
This allows creating a guest view of the normal BAR registers and emulates
the size and properties as it is done during PCI device enumeration by
the guest.
Expose VID/DID and class/revision to the guest.
Note, that VFs ROM BAR is read-only and is all zeros, but VF may provide
access to the PFs ROM via emulation and is not implemented.
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
---
v2->v3:
* don't emulate VID/DID registers
* replace ifdef with IS_ENABLED
v1->v2:
* remove VF register handlers covered by init_header
* set guest addr unconditionally
---
xen/drivers/vpci/sriov.c | 52 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 52 insertions(+)
diff --git a/xen/drivers/vpci/sriov.c b/xen/drivers/vpci/sriov.c
index ec6a7b84d5..b43a2b26d3 100644
--- a/xen/drivers/vpci/sriov.c
+++ b/xen/drivers/vpci/sriov.c
@@ -211,6 +211,58 @@ int vpci_vf_init_header(struct pci_dev *vf_pdev)
sriov_pos = pci_find_ext_capability(pf_pdev, PCI_EXT_CAP_ID_SRIOV);
ctrl = pci_conf_read16(pf_pdev->sbdf, sriov_pos + PCI_SRIOV_CTRL);
+ if ( IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) &&
+ pf_pdev->domain != vf_pdev->domain )
+ {
+ struct vpci_bar *bars = vf_pdev->vpci->header.bars;
+
+ rc = vpci_add_register(vf_pdev->vpci, vpci_hw_read16, NULL,
+ PCI_VENDOR_ID, 2, NULL);
+ if ( rc )
+ return rc;
+
+ rc = vpci_add_register(vf_pdev->vpci, vpci_hw_read16, NULL,
+ PCI_DEVICE_ID, 2, NULL);
+ if ( rc )
+ return rc;
+
+ /* Hardcode multi-function device bit to 0 */
+ rc = vpci_add_register(vf_pdev->vpci, vpci_read_val, NULL,
+ PCI_HEADER_TYPE, 1,
+ (void *)PCI_HEADER_TYPE_NORMAL);
+ if ( rc )
+ return rc;
+
+ rc = vpci_add_register(vf_pdev->vpci, vpci_hw_read32, NULL,
+ PCI_CLASS_REVISION, 4, NULL);
+ if ( rc )
+ return rc;
+
+ for ( unsigned int i = 0; i < PCI_SRIOV_NUM_BARS; i++ )
+ {
+ switch ( pf_pdev->vpci->sriov->vf_bars[i].type )
+ {
+ case VPCI_BAR_MEM32:
+ case VPCI_BAR_MEM64_LO:
+ case VPCI_BAR_MEM64_HI:
+ rc = vpci_add_register(vf_pdev->vpci, vpci_guest_mem_bar_read,
+ vpci_guest_mem_bar_write,
+ PCI_BASE_ADDRESS_0 + i * 4, 4, &bars[i]);
+ if ( rc )
+ return rc;
+ break;
+
+ default:
+ rc = vpci_add_register(vf_pdev->vpci, vpci_read_val, NULL,
+ PCI_BASE_ADDRESS_0 + i * 4, 4,
+ (void *)0);
+ if ( rc )
+ return rc;
+ break;
+ }
+ }
+ }
+
if ( (pf_pdev->domain == vf_pdev->domain) && (ctrl & PCI_SRIOV_CTRL_MSE) )
{
rc = vpci_modify_bars(vf_pdev, PCI_COMMAND_MEMORY, false);
--
2.51.2
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH v3 7/7] docs: Update SR-IOV support status
2026-04-09 14:01 [PATCH v3 0/7] Implement SR-IOV support for PVH Mykyta Poturai
` (5 preceding siblings ...)
2026-04-09 14:01 ` [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0 Mykyta Poturai
@ 2026-04-09 14:01 ` Mykyta Poturai
2026-04-21 14:56 ` Jan Beulich
6 siblings, 1 reply; 35+ messages in thread
From: Mykyta Poturai @ 2026-04-09 14:01 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Mykyta Poturai, Andrew Cooper, Anthony PERARD, Michal Orzel,
Jan Beulich, Julien Grall, Roger Pau Monné,
Stefano Stabellini
Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
---
v1->v2:
* new patch
---
SUPPORT.md | 2 --
1 file changed, 2 deletions(-)
diff --git a/SUPPORT.md b/SUPPORT.md
index 8e7ab7cb3e..f4f1458bbb 100644
--- a/SUPPORT.md
+++ b/SUPPORT.md
@@ -170,8 +170,6 @@ unexpected behavior or issues on some hardware.
At least the following features are missing on a PVH dom0:
- * PCI SR-IOV.
-
* Native NMI forwarding (nmi=dom0 command line option).
* MCE handling.
--
2.51.2
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-04-09 14:01 ` [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0 Mykyta Poturai
@ 2026-04-09 15:27 ` Daniel P. Smith
2026-04-21 14:43 ` Jan Beulich
` (2 subsequent siblings)
3 siblings, 0 replies; 35+ messages in thread
From: Daniel P. Smith @ 2026-04-09 15:27 UTC (permalink / raw)
To: Mykyta Poturai, xen-devel@lists.xenproject.org
Cc: Stewart Hildebrand, Roger Pau Monné, Jason Andryuk
On 4/9/26 10:01, Mykyta Poturai wrote:
> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>
> This code is expected to only be used by privileged domains,
> unprivileged domains should not get access to the SR-IOV capability.
>
> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
> for possible changes in the system page size register. Also force VFs to
> always use emulated reads for command register, this is needed to
> prevent some drivers accidentally unmapping BARs. Discovery of VFs is
> done by Dom0, which must register them with Xen.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
> ---
> v2->v3:
> * rework adding VFs with multiple mapping operations support
> * use private.h
> * style fixes
> * fixup cleanup_sriov
> * emulate command register read for VFs
>
> v1->v2:
> * switch to VF discovery by Xen
> * fix sequential vpci_modify_bars calls
> * move documentation changes to a separate commit
> ---
I have only cursory read this patch, but it got me thinking, which I'm
still in the middle of completing my questions. Figured I will pose it
to the community for other's opinions as well. From what I have looked
at this far, it is just wiring up the SRIOV under the existing PCI
access controls. It's possible to just provide a specific label to the
device to differentiate in policy, but I am wondering if there might
want to be a further degree of fine-grained access controls over SRIOV
devices?
V/r,
Daniel P. Smith
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-04-09 14:01 ` [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0 Mykyta Poturai
2026-04-09 15:27 ` Daniel P. Smith
@ 2026-04-21 14:43 ` Jan Beulich
2026-04-23 10:12 ` Mykyta Poturai
2026-04-22 14:19 ` Roger Pau Monné
2026-04-28 20:05 ` Stewart Hildebrand
3 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2026-04-21 14:43 UTC (permalink / raw)
To: Mykyta Poturai, Stewart Hildebrand
Cc: Roger Pau Monné, Daniel P. Smith,
xen-devel@lists.xenproject.org
On 09.04.2026 16:01, Mykyta Poturai wrote:
> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>
> This code is expected to only be used by privileged domains,
> unprivileged domains should not get access to the SR-IOV capability.
>
> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
> for possible changes in the system page size register. Also force VFs to
> always use emulated reads for command register, this is needed to
> prevent some drivers accidentally unmapping BARs.
This apparently refers to the change to vpci_init_header(). Writes are
already intercepted. How would a read lead to accidental BAR unmap? Even
for writes I don't see how a VF driver could accidentally unmap BARs, as
the memory decode bit there is hardwired to 0.
> Discovery of VFs is
> done by Dom0, which must register them with Xen.
If we intercept control register writes, why would we still require
Dom0 to report the VFs that appear?
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
First S-o-b doesn't match From: - is that correct?
> --- a/xen/drivers/vpci/Makefile
> +++ b/xen/drivers/vpci/Makefile
> @@ -1,3 +1,4 @@
> obj-y += cap.o
> obj-y += vpci.o header.o rebar.o
> +obj-y += sriov.o
> obj-$(CONFIG_HAS_PCI_MSI) += msi.o msix.o
I understand the one line with multiple objects is unhelpful here, but
for new additions (like was the case with cap.o) I think we want to
respect alphabetical ordering, excluding the one odd line.
> --- /dev/null
> +++ b/xen/drivers/vpci/sriov.c
> @@ -0,0 +1,314 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Handlers for accesses to the SR-IOV capability structure.
> + *
> + * Copyright (C) 2026 Citrix Systems R&D
> + */
> +
> +#include <xen/sched.h>
> +#include <xen/vpci.h>
> +#include <xsm/xsm.h>
> +#include "private.h"
> +
> +static int vf_init_bars(const struct pci_dev *vf_pdev)
> +{
> + int vf_idx;
> + unsigned int i;
> + const struct pci_dev *pf_pdev = vf_pdev->pf_pdev;
> + struct vpci_bar *bars = vf_pdev->vpci->header.bars;
> + struct vpci_bar *physfn_vf_bars = pf_pdev->vpci->sriov->vf_bars;
> + unsigned int sriov_pos = pci_find_ext_capability(pf_pdev,
> + PCI_EXT_CAP_ID_SRIOV);
> + uint16_t offset = pci_conf_read16(pf_pdev->sbdf,
> + sriov_pos + PCI_SRIOV_VF_OFFSET);
> + uint16_t stride = pci_conf_read16(pf_pdev->sbdf,
> + sriov_pos + PCI_SRIOV_VF_STRIDE);
> +
> + vf_idx = vf_pdev->sbdf.sbdf - (pf_pdev->sbdf.sbdf + offset);
> + if ( vf_idx < 0 )
> + return -EINVAL;
Why is underflow checked for, but too large a value isn't?
> + if ( stride )
> + {
> + if ( vf_idx % stride )
> + return -EINVAL;
> + vf_idx /= stride;
> + }
What if stride is 0 and there are multiple VFs?
> + /*
> + * Set up BARs for this VF out of PF's VF BARs taking into account
> + * the index of the VF.
> + */
> + for ( i = 0; i < PCI_SRIOV_NUM_BARS; i++ )
> + {
> + bars[i].addr = physfn_vf_bars[i].addr + vf_idx * physfn_vf_bars[i].size;
> + bars[i].guest_addr = bars[i].addr;
> + bars[i].size = physfn_vf_bars[i].size;
> + bars[i].type = physfn_vf_bars[i].type;
> + bars[i].prefetchable = physfn_vf_bars[i].prefetchable;
This may end up quite a bit easier to read if &bar[i] and &physfn_vf_bars[i]
were latched into local variables. It might further help if the = were all
aligned with one another.
> +static int map_vfs(const struct pci_dev *pf_pdev, uint16_t cmd)
> +{
> + struct pci_dev *vf_pdev;
> + int rc;
> +
> + ASSERT(rw_is_write_locked(&pf_pdev->domain->pci_lock));
> +
> + list_for_each_entry(vf_pdev, &pf_pdev->vf_list, vf_list)
> + {
> + rc = vpci_modify_bars(vf_pdev, cmd, false);
> + if ( rc )
> + {
> + gprintk(XENLOG_ERR, "failed to %s VF %pp: %d\n",
> + (cmd & PCI_COMMAND_MEMORY) ? "map" : "unmap",
> + &vf_pdev->sbdf, rc);
> + return rc;
> + }
> +
> + vf_pdev->vpci->header.guest_cmd &= ~PCI_COMMAND_MEMORY;
> + vf_pdev->vpci->header.guest_cmd |= (cmd & PCI_COMMAND_MEMORY);
As mentioned elsewhere as well, this bit is supposed to be 0 for VFs.
> +static void cf_check control_write(const struct pci_dev *pdev, unsigned int reg,
> + uint32_t val, void *data)
> +{
> + unsigned int sriov_pos = reg - PCI_SRIOV_CTRL;
> + struct vpci_sriov *sriov = pdev->vpci->sriov;
> + uint16_t control = pci_conf_read16(pdev->sbdf, reg);
> + bool mem_enabled = control & PCI_SRIOV_CTRL_MSE;
> + bool new_mem_enabled = val & PCI_SRIOV_CTRL_MSE;
> + bool enabled = control & PCI_SRIOV_CTRL_VFE;
> + bool new_enabled = val & PCI_SRIOV_CTRL_VFE;
> + int rc;
> +
> + ASSERT(!pdev->info.is_virtfn);
> +
> + if ( new_enabled == enabled && new_mem_enabled == mem_enabled )
> + {
> + pci_conf_write16(pdev->sbdf, reg, val);
> + return;
> + }
> +
> + if ( mem_enabled && !new_mem_enabled )
> + map_vfs(pdev, 0);
Why are {new_,}enabled not also taken into account here?
> + if ( !enabled && new_enabled )
> + {
> + size_vf_bars(pdev, sriov_pos, (uint64_t *)data);
(The cast is generally unnecessary here. If you want to keep it, I
think a comment is needed.)
> + /*
> + * Only update the number of active VFs when enabling, when
> + * disabling use the cached value in order to always remove the same
> + * number of VFs that were active.
> + */
> + sriov->num_vfs = pci_conf_read16(pdev->sbdf,
> + sriov_pos + PCI_SRIOV_NUM_VF);
> + }
> +
> + if ( !mem_enabled && new_mem_enabled )
> + {
> + rc = map_vfs(pdev, PCI_COMMAND_MEMORY);
Same here.
> + if ( rc )
> + map_vfs(pdev, 0);
> + }
> +
> + pci_conf_write16(pdev->sbdf, reg, val);
> +}
> +
> +int vpci_vf_init_header(struct pci_dev *vf_pdev)
I think this would better move (up or down), to not sit in the middle of
PF functions.
> +{
> + const struct pci_dev *pf_pdev;
> + unsigned int sriov_pos;
> + int rc = 0;
> + uint16_t ctrl;
> +
> + ASSERT(rw_is_write_locked(&vf_pdev->domain->pci_lock));
> +
> + if ( !vf_pdev->info.is_virtfn )
> + return 0;
Looking at the sole call site, this apparently wants to be an assertion.
> + pf_pdev = vf_pdev->pf_pdev;
> + ASSERT(pf_pdev);
The caller doesn't guarantee this, does it?
> + rc = vf_init_bars(vf_pdev);
> + if ( rc )
> + return rc;
> +
> + sriov_pos = pci_find_ext_capability(pf_pdev, PCI_EXT_CAP_ID_SRIOV);
> + ctrl = pci_conf_read16(pf_pdev->sbdf, sriov_pos + PCI_SRIOV_CTRL);
> +
> + if ( (pf_pdev->domain == vf_pdev->domain) && (ctrl & PCI_SRIOV_CTRL_MSE) )
> + {
> + rc = vpci_modify_bars(vf_pdev, PCI_COMMAND_MEMORY, false);
Doesn't VF enable also need to be set for the BARs to be mapped?
> + if ( rc )
> + return rc;
> +
> + vf_pdev->vpci->header.guest_cmd |= PCI_COMMAND_MEMORY;
But the bit is required to be zero in the VF command register.
> + }
> +
> + return rc;
> +}
> +
> +static int cf_check init_sriov(struct pci_dev *pdev)
> +{
> + unsigned int pos;
> +
> + ASSERT(!pdev->info.is_virtfn);
Where's the check that prevents this from triggering? Aiui if
vpci_init_capabilities() finds an SR-IOV capability in a VF's config
space, it'll call here.
> + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
> +
> + if ( !pos )
> + return 0;
Can't this be an assertion? (In fact I was wondering why the caller,
which already had to find the capability, doesn't pass the position.)
> +static int cf_check cleanup_sriov(const struct pci_dev *pdev, bool hide)
> +{
> + unsigned int pos;
> + int rc;
> +
> + if ( !pdev->vpci->sriov )
> + return 0;
> +
> + ASSERT(!pdev->info.is_virtfn);
> + ASSERT(list_empty(&pdev->vf_list));
What guarantees this latter property?
> + if ( !hide )
> + {
> + XFREE(pdev->vpci->sriov);
> + return 0;
> + }
> +
> + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
As was explained in 9ebba62ca843 ("vPCI/ReBAR: improve cleanup") and
d734babf8bc3 ("vPCI: re-init extended-capabilities when MMCFG
availability changed"), you may not call this function during cleanup.
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 6/7] vpci: add SR-IOV support for DomUs
2026-04-09 14:01 ` [PATCH v3 6/7] vpci: add SR-IOV support for DomUs Mykyta Poturai
@ 2026-04-21 14:55 ` Jan Beulich
2026-04-24 6:34 ` Mykyta Poturai
0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2026-04-21 14:55 UTC (permalink / raw)
To: Mykyta Poturai, Stewart Hildebrand
Cc: Roger Pau Monné, xen-devel@lists.xenproject.org
On 09.04.2026 16:01, Mykyta Poturai wrote:
> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>
> Emulate guest BAR register values based on PF BAR values for VFs.
> This allows creating a guest view of the normal BAR registers and emulates
> the size and properties as it is done during PCI device enumeration by
> the guest.
>
> Expose VID/DID and class/revision to the guest.
>
> Note, that VFs ROM BAR is read-only and is all zeros, but VF may provide
> access to the PFs ROM via emulation and is not implemented.
>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
So this change is merely to avoid having yet another gap on the road to
DomU support in vPCI? I.e. there's no claim or expectation that VFs
could now be used in DomU-s?
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH v3 7/7] docs: Update SR-IOV support status
2026-04-09 14:01 ` [PATCH v3 7/7] docs: Update SR-IOV support status Mykyta Poturai
@ 2026-04-21 14:56 ` Jan Beulich
0 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2026-04-21 14:56 UTC (permalink / raw)
To: Mykyta Poturai
Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
Roger Pau Monné, Stefano Stabellini,
xen-devel@lists.xenproject.org
On 09.04.2026 16:01, Mykyta Poturai wrote:
> Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
Might this better be part of patch 5?
Jan
> --- a/SUPPORT.md
> +++ b/SUPPORT.md
> @@ -170,8 +170,6 @@ unexpected behavior or issues on some hardware.
>
> At least the following features are missing on a PVH dom0:
>
> - * PCI SR-IOV.
> -
> * Native NMI forwarding (nmi=dom0 command line option).
>
> * MCE handling.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 2/7] vpci: rename and export vpci_guest_mem_bar_{read,write}
2026-04-09 14:01 ` [PATCH v3 2/7] vpci: rename and export vpci_guest_mem_bar_{read,write} Mykyta Poturai
@ 2026-04-22 10:27 ` Roger Pau Monné
0 siblings, 0 replies; 35+ messages in thread
From: Roger Pau Monné @ 2026-04-22 10:27 UTC (permalink / raw)
To: Mykyta Poturai
Cc: xen-devel@lists.xenproject.org, Stewart Hildebrand, Teddy Astie
On Thu, Apr 09, 2026 at 02:01:32PM +0000, Mykyta Poturai wrote:
> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>
> Export functions required for SR-IOV support.
>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
> Reviewed-by: Teddy Astie <teddy.astie@vates.tech>
FWIW, you could possibly squash patch 1 and 2 into a single patch.
Both changes are just exporting functions from the same file.
Thanks, Roger.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 3/7] vpci: Use pervcpu ranges for BAR mapping
2026-04-09 14:01 ` [PATCH v3 3/7] vpci: Use pervcpu ranges for BAR mapping Mykyta Poturai
@ 2026-04-22 11:00 ` Roger Pau Monné
2026-04-22 12:04 ` Jan Beulich
0 siblings, 1 reply; 35+ messages in thread
From: Roger Pau Monné @ 2026-04-22 11:00 UTC (permalink / raw)
To: Mykyta Poturai
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Anthony PERARD,
Michal Orzel, Jan Beulich, Julien Grall, Stefano Stabellini,
Stewart Hildebrand
On Thu, Apr 09, 2026 at 02:01:33PM +0000, Mykyta Poturai wrote:
> From: Mykyta Poturai <Mykyta_Poturai@epam.com>
>
> There is no need to store ranges for each PCI device, as they are only
> used during the mapping/unmapping process and can be reused for each
> device. This also allows to avoid the need to allocate and destroy
> rangesets for each device.
>
> Move the rangesets from struct vpci_bar to struct vpci_vcpu and perform
> (de-)allocation with vcpu (de-)allocation. Introduce RANGESET_DESTROY()
> macro to free a rangeset and set the pointer to NULL.
While this obviously has advantatges when multiple PCI devices are
assigned to a single domain (ie: for the hardware domain), it's
actually consuming more memory in the domU case if a single device is
assigned to the domain, and it has more than one vCPU.
I'm not saying it's bad per-se, but it's not obviously such an
improvement IMO. With the current approach we know Xen will allocate
an amount of rangesets that's bound to the amount of PCI devices
(nr_rangesets = nr_pci_devs * NR_BARS), but with this approach the
amount of allocated rangesets depends on the number of vCPUs, and
hence it's no longer bound by the hardware anymore.
I assume that this will change further to support SR-IOV devices that
might have even more VF BARs per PF, so I have to look at further
patches.
> Amends: 622bdd962822 ("vpci/header: handle p2m range sets per BAR")
> Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> ---
> v2->v3:
> * synced with BAR write with memory decoding enabled series[1]
> * add Amends tag
> * remove unused variable i due to rebasing over 998060dd9101 ("vPCI:
> move vpci_init_capabilities() to a separate file")
> * enclose entire struct vpci_vcpu inside #ifdef __XEN__
> * s/bar_mem/mem/
> * use ARRAY_SIZE
> * put init/destroy in functions
> * only allocate for domains with vPCI and idle domain
> * replace 'if ( !mem ) continue;' with ASSERT
>
> v1->v2
> * new patch
>
> [1]: https://patchew.org/Xen/20260406191203.97662-1-stewart.hildebrand@amd.com/
> ---
> xen/common/domain.c | 5 +++
> xen/drivers/vpci/header.c | 67 ++++++++++++++------------------------
> xen/drivers/vpci/vpci.c | 34 ++++++++++++++++---
> xen/include/xen/rangeset.h | 7 ++++
> xen/include/xen/vpci.h | 10 ++++--
> 5 files changed, 73 insertions(+), 50 deletions(-)
>
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index bb9e210c28..5ef7db8f09 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -455,6 +455,8 @@ static int vcpu_teardown(struct vcpu *v)
> */
> static void vcpu_destroy(struct vcpu *v)
> {
> + vpci_vcpu_destroy(v);
> +
> free_vcpu_struct(v);
> }
>
> @@ -512,6 +514,9 @@ struct vcpu *vcpu_create(struct domain *d, unsigned int vcpu_id)
> if ( arch_vcpu_create(v) != 0 )
> goto fail_sched;
>
> + if ( vpci_vcpu_init(v) )
> + goto fail_sched;
> +
> d->vcpu[vcpu_id] = v;
> if ( vcpu_id != 0 )
> {
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 5d5ba5c016..5bfb541b6a 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -196,6 +196,7 @@ bool vpci_process_pending(struct vcpu *v)
> for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> {
> struct vpci_bar *bar = &header->bars[i];
> + struct rangeset *mem = v->vpci.mem[i];
> struct map_data data = {
> .d = v->domain,
> .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> @@ -203,10 +204,10 @@ bool vpci_process_pending(struct vcpu *v)
> };
> int rc;
>
> - if ( rangeset_is_empty(bar->mem) )
> + if ( rangeset_is_empty(mem) )
> continue;
>
> - rc = rangeset_consume_ranges(bar->mem, map_range, &data);
> + rc = rangeset_consume_ranges(mem, map_range, &data);
>
> if ( rc == -ERESTART )
> {
> @@ -224,8 +225,8 @@ bool vpci_process_pending(struct vcpu *v)
>
> /* Clean all the rangesets */
> for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> - if ( !rangeset_is_empty(header->bars[i].mem) )
> - rangeset_purge(header->bars[i].mem);
> + if ( !rangeset_is_empty(v->vpci.mem[i]) )
> + rangeset_purge(v->vpci.mem[i]);
>
> v->vpci.pdev = NULL;
>
> @@ -260,13 +261,14 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
> for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> {
> struct vpci_bar *bar = &header->bars[i];
> + struct rangeset *mem = current->vpci.mem[i];
> struct map_data data = { .d = d, .map = true, .bar = bar };
>
> - if ( rangeset_is_empty(bar->mem) )
> + if ( rangeset_is_empty(mem) )
> continue;
>
> - while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
> - &data)) == -ERESTART )
> + while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) ==
> + -ERESTART )
> {
> /*
> * It's safe to drop and reacquire the lock in this context
> @@ -331,13 +333,13 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> {
> struct vpci_bar *bar = &header->bars[i];
> + struct rangeset *mem = current->vpci.mem[i];
> unsigned long start = PFN_DOWN(bar->addr);
> unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
> unsigned long start_guest = PFN_DOWN(bar->guest_addr);
> unsigned long end_guest = PFN_DOWN(bar->guest_addr + bar->size - 1);
>
> - if ( !bar->mem )
> - continue;
> + ASSERT(mem);
>
> if ( !MAPPABLE_BAR(bar) ||
> (rom_only ? bar->type != VPCI_BAR_ROM
> @@ -354,7 +356,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> continue;
> }
>
> - ASSERT(rangeset_is_empty(bar->mem));
> + ASSERT(rangeset_is_empty(mem));
>
> /*
> * Make sure that the guest set address has the same page offset
> @@ -369,7 +371,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> return -EINVAL;
> }
>
> - rc = rangeset_add_range(bar->mem, start_guest, end_guest);
> + rc = rangeset_add_range(mem, start_guest, end_guest);
> if ( rc )
> {
> printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
> @@ -380,12 +382,12 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> /* Check for overlap with the already setup BAR ranges. */
> for ( j = 0; j < i; j++ )
> {
> - struct vpci_bar *prev_bar = &header->bars[j];
> + struct rangeset *prev_mem = current->vpci.mem[j];
>
> - if ( rangeset_is_empty(prev_bar->mem) )
> + if ( rangeset_is_empty(prev_mem) )
> continue;
>
> - rc = rangeset_remove_range(prev_bar->mem, start_guest, end_guest);
> + rc = rangeset_remove_range(prev_mem, start_guest, end_guest);
> if ( rc )
> {
> gprintk(XENLOG_WARNING,
> @@ -395,7 +397,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> }
> }
>
> - rc = pci_sanitize_bar_memory(bar->mem);
> + rc = pci_sanitize_bar_memory(mem);
> if ( rc )
> {
> gprintk(XENLOG_WARNING,
> @@ -412,14 +414,14 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
> vmsix_table_size(pdev->vpci, i) - 1);
>
> - for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
> + for ( j = 0; j < ARRAY_SIZE(current->vpci.mem); j++ )
You make a non-trivial use of current in vpci_modify_bars(), maybe you
should consider introducing a local variable for it:
struct *vcpu curr = current;
current expands to a call to get_cpu_info(9, which is better to avoid
doing repeatedly, specially in the context above which is used as a
loop upper bound.
> {
> - const struct vpci_bar *bar = &header->bars[j];
> + struct rangeset *mem = current->vpci.mem[j];
>
> - if ( rangeset_is_empty(bar->mem) )
> + if ( rangeset_is_empty(mem) )
> continue;
>
> - rc = rangeset_remove_range(bar->mem, start, end);
> + rc = rangeset_remove_range(mem, start, end);
> if ( rc )
> {
> gprintk(XENLOG_WARNING,
> @@ -469,8 +471,9 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> for ( j = 0; j < ARRAY_SIZE(header->bars); j++)
> {
> const struct vpci_bar *bar = &header->bars[j];
> + struct rangeset *mem = current->vpci.mem[j];
>
> - if ( !rangeset_overlaps_range(bar->mem, start, end) ||
> + if ( !rangeset_overlaps_range(mem, start, end) ||
> /*
> * If only the ROM enable bit is toggled check against
> * other BARs in the same device for overlaps, but not
> @@ -481,7 +484,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> bar->type == VPCI_BAR_ROM) )
> continue;
>
> - rc = rangeset_remove_range(bar->mem, start, end);
> + rc = rangeset_remove_range(mem, start, end);
> if ( rc )
> {
> gprintk(XENLOG_WARNING,
> @@ -734,18 +737,6 @@ static void cf_check rom_write(
> }
> }
>
> -static int bar_add_rangeset(const struct pci_dev *pdev, struct vpci_bar *bar,
> - unsigned int i)
> -{
> - char str[32];
> -
> - snprintf(str, sizeof(str), "%pp:BAR%u", &pdev->sbdf, i);
> -
> - bar->mem = rangeset_new(pdev->domain, str, RANGESETF_no_print);
> -
> - return !bar->mem ? -ENOMEM : 0;
> -}
> -
> int vpci_init_header(struct pci_dev *pdev)
> {
> uint16_t cmd;
> @@ -856,10 +847,6 @@ int vpci_init_header(struct pci_dev *pdev)
> else
> bars[i].type = VPCI_BAR_MEM32;
>
> - rc = bar_add_rangeset(pdev, &bars[i], i);
> - if ( rc )
> - goto fail;
> -
> rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
> (i == num_bars - 1) ? PCI_BAR_LAST : 0);
> if ( rc < 0 )
> @@ -913,12 +900,6 @@ int vpci_init_header(struct pci_dev *pdev)
> 4, rom);
> if ( rc )
> rom->type = VPCI_BAR_EMPTY;
> - else
> - {
> - rc = bar_add_rangeset(pdev, rom, num_bars);
> - if ( rc )
> - goto fail;
> - }
> }
> else if ( !is_hwdom )
> {
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 0ac9ec8b04..d069ca6d9c 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -24,6 +24,35 @@
>
> #ifdef __XEN__
>
> +void vpci_vcpu_destroy(struct vcpu *v)
> +{
> + if ( !has_vpci(v->domain) && !is_idle_domain(v->domain) )
> + return;
> +
> + for ( unsigned int i = 0; i < ARRAY_SIZE(v->vpci.mem); i++ )
> + RANGESET_DESTROY(v->vpci.mem[i]);
> +}
> +
> +int vpci_vcpu_init(struct vcpu *v)
> +{
> + unsigned int i;
> +
> + if ( !has_vpci(v->domain) && !is_idle_domain(v->domain) )
> + return 0;
> +
> + for ( i = 0; i < ARRAY_SIZE(v->vpci.mem); i++ )
> + {
> + char str[32];
> +
> + snprintf(str, sizeof(str), "%pv:BAR%u", v, i);
> + v->vpci.mem[i] = rangeset_new(v->domain, str, RANGESETF_no_print);
> + if ( !v->vpci.mem[i] )
> + return -ENOMEM;
> + }
> +
> + return 0;
> +}
> +
> #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> static int assign_virtual_sbdf(struct pci_dev *pdev)
> {
> @@ -89,8 +118,6 @@ struct vpci_register *vpci_get_register(const struct vpci *vpci,
>
> void vpci_deassign_device(struct pci_dev *pdev)
> {
> - unsigned int i;
> -
> ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>
> if ( !has_vpci(pdev->domain) || !pdev->vpci )
> @@ -116,9 +143,6 @@ void vpci_deassign_device(struct pci_dev *pdev)
> }
> spin_unlock(&pdev->vpci->lock);
>
> - for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
> - rangeset_destroy(pdev->vpci->header.bars[i].mem);
> -
> xfree(pdev->vpci);
> pdev->vpci = NULL;
> }
> diff --git a/xen/include/xen/rangeset.h b/xen/include/xen/rangeset.h
> index 817505badf..f01e00ec92 100644
> --- a/xen/include/xen/rangeset.h
> +++ b/xen/include/xen/rangeset.h
> @@ -40,6 +40,13 @@ struct rangeset *rangeset_new(
> void rangeset_destroy(
> struct rangeset *r);
>
> +/* Destroy a rangeset, and zero the pointer to it. */
> +#define RANGESET_DESTROY(r) \
> + ({ \
> + rangeset_destroy(r); \
> + (r) = NULL; \
You want to use a construct similar to what XFREE() does:
#define RANGESET_DESTROY(r) do { \
struct rangeset *_r_ = (r); \
(r) = NULL; \
rangeset_destroy(_r_); \
} while ( false )
After the GhostRace vulnerability we NULL the variable before freeing
the data associated with it.
Thanks, Roger.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 4/7] vpci: allow queueing of mapping operations
2026-04-09 14:01 ` [PATCH v3 4/7] vpci: allow queueing of mapping operations Mykyta Poturai
@ 2026-04-22 11:38 ` Roger Pau Monné
0 siblings, 0 replies; 35+ messages in thread
From: Roger Pau Monné @ 2026-04-22 11:38 UTC (permalink / raw)
To: Mykyta Poturai
Cc: xen-devel@lists.xenproject.org, Stewart Hildebrand, Andrew Cooper,
Anthony PERARD, Michal Orzel, Jan Beulich, Julien Grall,
Stefano Stabellini
On Thu, Apr 09, 2026 at 02:01:33PM +0000, Mykyta Poturai wrote:
> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>
> Introduce vPCI BAR mapping task queue. Store information needed to
> map/unmap BARs in struct vpci_map_task. Allow queueing of BAR map/unmap
> operations in a list, thus making it possible to perform multiple p2m
> operations associated with single PCI device.
I would replace "single PCI device" with "single PCI config space
register access", it's more specific.
> This is preparatory work for further changes that need to perform
> multiple unmap/map operations before returning to guest.
>
> At the moment, only a single operation will be queued. However, when
> multiple operations are queued, there is a check in modify_bars() to
> skip BARs already in the requested state that will no longer be
> accurate. Remove this check in preparation of upcoming changes.
>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> Message-ID: <20260406191203.97662-3-stewart.hildebrand@amd.com>
I think you also need to add your SoB if you add it to your series.
> ---
> v2->v3:
> * new patch in this series, borrowed from [1]
>
> [1]: https://patchew.org/Xen/20260406191203.97662-1-stewart.hildebrand@amd.com/20260406191203.97662-3-stewart.hildebrand@amd.com/
> ---
> xen/common/domain.c | 5 +-
> xen/drivers/vpci/header.c | 227 ++++++++++++++++++++++++++-----------
> xen/drivers/vpci/vpci.c | 28 +----
> xen/include/xen/rangeset.h | 7 --
> xen/include/xen/vpci.h | 21 ++--
> 5 files changed, 179 insertions(+), 109 deletions(-)
>
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 5ef7db8f09..b1931be987 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -455,8 +455,6 @@ static int vcpu_teardown(struct vcpu *v)
> */
> static void vcpu_destroy(struct vcpu *v)
> {
> - vpci_vcpu_destroy(v);
> -
> free_vcpu_struct(v);
> }
>
> @@ -514,8 +512,7 @@ struct vcpu *vcpu_create(struct domain *d, unsigned int vcpu_id)
> if ( arch_vcpu_create(v) != 0 )
> goto fail_sched;
>
> - if ( vpci_vcpu_init(v) )
> - goto fail_sched;
> + vpci_vcpu_init(v);
>
> d->vcpu[vcpu_id] = v;
> if ( vcpu_id != 0 )
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 5bfb541b6a..451cdd3a6f 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -23,6 +23,7 @@
> #include <xen/lib.h>
> #include <xen/sched.h>
> #include <xen/softirq.h>
> +#include <xen/xvmalloc.h>
>
> #include <xsm/xsm.h>
>
> @@ -35,7 +36,7 @@
>
> struct map_data {
> struct domain *d;
> - const struct vpci_bar *bar;
> + const struct vpci_bar_map *bar;
> bool map;
> };
>
> @@ -174,32 +175,20 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
> ASSERT_UNREACHABLE();
> }
>
> -bool vpci_process_pending(struct vcpu *v)
> +static int vpci_process_map_task(const struct pci_dev *pdev,
> + struct vpci_map_task *task)
> {
> - const struct pci_dev *pdev = v->vpci.pdev;
> - struct vpci_header *header = NULL;
> unsigned int i;
>
> - if ( !pdev )
> - return false;
> -
> - read_lock(&v->domain->pci_lock);
> -
> - if ( !pdev->vpci || (v->domain != pdev->domain) )
> - {
> - v->vpci.pdev = NULL;
> - read_unlock(&v->domain->pci_lock);
> - return false;
> - }
> + ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>
> - header = &pdev->vpci->header;
> - for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> + for ( i = 0; i < ARRAY_SIZE(task->bars); i++ )
> {
> - struct vpci_bar *bar = &header->bars[i];
> - struct rangeset *mem = v->vpci.mem[i];
> + struct vpci_bar_map *bar = &task->bars[i];
> + struct rangeset *mem = bar->mem;
> struct map_data data = {
> - .d = v->domain,
> - .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> + .d = pdev->domain,
> + .map = task->cmd & PCI_COMMAND_MEMORY,
> .bar = bar,
> };
> int rc;
> @@ -210,58 +199,116 @@ bool vpci_process_pending(struct vcpu *v)
> rc = rangeset_consume_ranges(mem, map_range, &data);
>
> if ( rc == -ERESTART )
> - {
> - read_unlock(&v->domain->pci_lock);
> - return true;
> - }
> + return rc;
>
> if ( rc )
> {
> spin_lock(&pdev->vpci->lock);
> /* Disable memory decoding unconditionally on failure. */
> - modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
> - false);
> + modify_decoding(pdev, task->cmd & ~PCI_COMMAND_MEMORY, false);
> spin_unlock(&pdev->vpci->lock);
>
> - /* Clean all the rangesets */
> - for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> - if ( !rangeset_is_empty(v->vpci.mem[i]) )
> - rangeset_purge(v->vpci.mem[i]);
> + if ( !is_hardware_domain(pdev->domain) )
> + domain_crash(pdev->domain);
> +
> + return rc;
> + }
> + }
> +
> + spin_lock(&pdev->vpci->lock);
> + modify_decoding(pdev, task->cmd, task->rom_only);
> + spin_unlock(&pdev->vpci->lock);
> +
> + return 0;
> +}
> +
> +static void destroy_map_task(struct vpci_map_task *task)
> +{
> + unsigned int i;
> +
> + if ( !task )
> + {
> + ASSERT_UNREACHABLE();
> + return;
> + }
> +
> + for ( i = 0; i < ARRAY_SIZE(task->bars); i++ )
> + rangeset_destroy(task->bars[i].mem);
> +
> + xvfree(task);
> +}
> +
> +static void clear_map_queue(struct vcpu *v)
> +{
> + struct vpci_map_task *task;
> +
> + while ( (task = list_first_entry_or_null(&v->vpci.task_queue,
> + struct vpci_map_task,
> + next)) != NULL )
> + {
> + list_del(&task->next);
> + destroy_map_task(task);
> + }
> +}
> +
> +bool vpci_process_pending(struct vcpu *v)
> +{
> + const struct pci_dev *pdev = v->vpci.pdev;
> + struct vpci_map_task *task;
>
> - v->vpci.pdev = NULL;
> + if ( !pdev )
> + return false;
>
> + read_lock(&v->domain->pci_lock);
> +
> + if ( !pdev->vpci || (v->domain != pdev->domain) )
> + {
> + clear_map_queue(v);
> + v->vpci.pdev = NULL;
> + read_unlock(&v->domain->pci_lock);
> + return false;
> + }
> +
> + while ( (task = list_first_entry_or_null(&v->vpci.task_queue,
> + struct vpci_map_task,
> + next)) != NULL )
> + {
> + int rc = vpci_process_map_task(pdev, task);
> +
> + if ( rc == -ERESTART )
> + {
> read_unlock(&v->domain->pci_lock);
> + return true;
> + }
>
> - if ( !is_hardware_domain(v->domain) )
> - domain_crash(v->domain);
> + list_del(&task->next);
> + destroy_map_task(task);
>
> - return false;
> + if ( rc )
> + {
> + clear_map_queue(v);
> + break;
> }
> }
> v->vpci.pdev = NULL;
>
> - spin_lock(&pdev->vpci->lock);
> - modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
> - spin_unlock(&pdev->vpci->lock);
> -
> read_unlock(&v->domain->pci_lock);
>
> return false;
> }
>
> static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
> - uint16_t cmd)
> + struct vpci_map_task *task)
> {
> - struct vpci_header *header = &pdev->vpci->header;
> int rc = 0;
> unsigned int i;
>
> ASSERT(rw_is_write_locked(&d->pci_lock));
>
> - for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> + for ( i = 0; i < ARRAY_SIZE(task->bars); i++ )
> {
> - struct vpci_bar *bar = &header->bars[i];
> - struct rangeset *mem = current->vpci.mem[i];
> + struct vpci_bar_map *bar = &task->bars[i];
> + struct rangeset *mem = bar->mem;
> struct map_data data = { .d = d, .map = true, .bar = bar };
>
> if ( rangeset_is_empty(mem) )
> @@ -281,15 +328,52 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
> }
> }
> if ( !rc )
> - modify_decoding(pdev, cmd, false);
> + modify_decoding(pdev, task->cmd, false);
>
> return rc;
> }
>
> -static void defer_map(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> +static struct vpci_map_task *alloc_map_task(const struct pci_dev *pdev,
> + uint16_t cmd, bool rom_only)
> +{
> + struct vpci_map_task *task;
> + unsigned int i;
> +
> + task = xvzalloc(struct vpci_map_task);
> +
> + if ( !task )
> + return NULL;
> +
> + for ( i = 0; i < ARRAY_SIZE(task->bars); i++ )
> + {
> + if ( !MAPPABLE_BAR(&pdev->vpci->header.bars[i]) )
> + continue;
> +
> + task->bars[i].mem = rangeset_new(pdev->domain, NULL,
> + RANGESETF_no_print);
> +
> + if ( !task->bars[i].mem )
> + {
> + destroy_map_task(task);
> + return NULL;
> + }
> +
> + task->bars[i].addr = pdev->vpci->header.bars[i].addr;
> + task->bars[i].guest_addr = pdev->vpci->header.bars[i].guest_addr;
> + }
> +
> + task->cmd = cmd;
> + task->rom_only = rom_only;
> +
> + return task;
> +}
> +
> +static void defer_map(const struct pci_dev *pdev, struct vpci_map_task *task)
> {
> struct vcpu *curr = current;
>
> + ASSERT(!curr->vpci.pdev || curr->vpci.pdev == pdev);
> +
> /*
> * FIXME: when deferring the {un}map the state of the device should not
> * be trusted. For example the enable bit is toggled after the device
> @@ -297,8 +381,8 @@ static void defer_map(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> * started for the same device if the domain is not well-behaved.
> */
> curr->vpci.pdev = pdev;
> - curr->vpci.cmd = cmd;
> - curr->vpci.rom_only = rom_only;
> + list_add_tail(&task->next, &curr->vpci.task_queue);
> +
> /*
> * Raise a scheduler softirq in order to prevent the guest from resuming
> * execution with pending mapping operations, to trigger the invocation
> @@ -313,11 +397,17 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> struct pci_dev *tmp;
> const struct domain *d;
> const struct vpci_msix *msix = pdev->vpci->msix;
> + struct vpci_map_task *task;
> unsigned int i, j;
> int rc;
>
> ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>
> + task = alloc_map_task(pdev, cmd, rom_only);
> +
> + if ( !task )
> + return -ENOMEM;
> +
> /*
> * Create a rangeset per BAR that represents the current device memory
> * region and compare it against all the currently active BAR memory
> @@ -333,19 +423,18 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> {
> struct vpci_bar *bar = &header->bars[i];
> - struct rangeset *mem = current->vpci.mem[i];
> + struct rangeset *mem = task->bars[i].mem;
> unsigned long start = PFN_DOWN(bar->addr);
> unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
> unsigned long start_guest = PFN_DOWN(bar->guest_addr);
> unsigned long end_guest = PFN_DOWN(bar->guest_addr + bar->size - 1);
>
> - ASSERT(mem);
> + if ( !mem )
> + continue;
>
> if ( !MAPPABLE_BAR(bar) ||
> (rom_only ? bar->type != VPCI_BAR_ROM
> - : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) ||
> - /* Skip BARs already in the requested state. */
> - bar->enabled == !!(cmd & PCI_COMMAND_MEMORY) )
> + : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) )
> continue;
>
> if ( !pci_check_bar(pdev, _mfn(start), _mfn(end)) )
> @@ -368,7 +457,8 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> gprintk(XENLOG_G_WARNING,
> "%pp: can't map BAR%u - offset mismatch: %#lx vs %#lx\n",
> &pdev->sbdf, i, bar->guest_addr, bar->addr);
> - return -EINVAL;
> + rc = -EINVAL;
> + goto fail;
> }
>
> rc = rangeset_add_range(mem, start_guest, end_guest);
> @@ -376,13 +466,13 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> {
> printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
> start_guest, end_guest, rc);
> - return rc;
> + goto fail;
> }
>
> /* Check for overlap with the already setup BAR ranges. */
> for ( j = 0; j < i; j++ )
> {
> - struct rangeset *prev_mem = current->vpci.mem[j];
> + struct rangeset *prev_mem = task->bars[j].mem;
>
> if ( rangeset_is_empty(prev_mem) )
> continue;
> @@ -393,7 +483,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> gprintk(XENLOG_WARNING,
> "%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
> &pdev->sbdf, start_guest, end_guest, rc);
> - return rc;
> + goto fail;
> }
> }
>
> @@ -403,7 +493,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> gprintk(XENLOG_WARNING,
> "%pp: failed to sanitize BAR#%u memory: %d\n",
> &pdev->sbdf, i, rc);
> - return rc;
> + goto fail;
> }
> }
>
> @@ -414,9 +504,9 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
> vmsix_table_size(pdev->vpci, i) - 1);
>
> - for ( j = 0; j < ARRAY_SIZE(current->vpci.mem); j++ )
> + for ( j = 0; j < ARRAY_SIZE(task->bars); j++ )
> {
> - struct rangeset *mem = current->vpci.mem[j];
> + struct rangeset *mem = task->bars[j].mem;
>
> if ( rangeset_is_empty(mem) )
> continue;
> @@ -427,7 +517,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> gprintk(XENLOG_WARNING,
> "%pp: failed to remove MSIX table [%lx, %lx]: %d\n",
> &pdev->sbdf, start, end, rc);
> - return rc;
> + goto fail;
> }
> }
> }
> @@ -471,7 +561,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> for ( j = 0; j < ARRAY_SIZE(header->bars); j++)
> {
> const struct vpci_bar *bar = &header->bars[j];
> - struct rangeset *mem = current->vpci.mem[j];
> + struct rangeset *mem = task->bars[j].mem;
>
> if ( !rangeset_overlaps_range(mem, start, end) ||
> /*
> @@ -490,7 +580,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> gprintk(XENLOG_WARNING,
> "%pp: failed to remove [%lx, %lx]: %d\n",
> &pdev->sbdf, start, end, rc);
> - return rc;
> + goto fail;
> }
> }
> }
> @@ -513,12 +603,19 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> * will always be to establish mappings and process all the BARs.
> */
> ASSERT((cmd & PCI_COMMAND_MEMORY) && !rom_only);
> - return apply_map(pdev->domain, pdev, cmd);
> + rc = apply_map(pdev->domain, pdev, task);
> + destroy_map_task(task);
> + return rc;
> }
>
> - defer_map(pdev, cmd, rom_only);
> + defer_map(pdev, task);
>
> return 0;
> +
> + fail:
> + destroy_map_task(task);
> +
> + return rc;
> }
>
> static void cf_check cmd_write(
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index d069ca6d9c..ce9fb5b357 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -24,33 +24,9 @@
>
> #ifdef __XEN__
>
> -void vpci_vcpu_destroy(struct vcpu *v)
> +void vpci_vcpu_init(struct vcpu *v)
> {
> - if ( !has_vpci(v->domain) && !is_idle_domain(v->domain) )
> - return;
> -
> - for ( unsigned int i = 0; i < ARRAY_SIZE(v->vpci.mem); i++ )
> - RANGESET_DESTROY(v->vpci.mem[i]);
> -}
> -
> -int vpci_vcpu_init(struct vcpu *v)
> -{
> - unsigned int i;
> -
> - if ( !has_vpci(v->domain) && !is_idle_domain(v->domain) )
> - return 0;
> -
> - for ( i = 0; i < ARRAY_SIZE(v->vpci.mem); i++ )
> - {
> - char str[32];
> -
> - snprintf(str, sizeof(str), "%pv:BAR%u", v, i);
> - v->vpci.mem[i] = rangeset_new(v->domain, str, RANGESETF_no_print);
> - if ( !v->vpci.mem[i] )
> - return -ENOMEM;
> - }
> -
> - return 0;
> + INIT_LIST_HEAD(&v->vpci.task_queue);
> }
>
> #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> diff --git a/xen/include/xen/rangeset.h b/xen/include/xen/rangeset.h
> index f01e00ec92..817505badf 100644
> --- a/xen/include/xen/rangeset.h
> +++ b/xen/include/xen/rangeset.h
> @@ -40,13 +40,6 @@ struct rangeset *rangeset_new(
> void rangeset_destroy(
> struct rangeset *r);
>
> -/* Destroy a rangeset, and zero the pointer to it. */
> -#define RANGESET_DESTROY(r) \
> - ({ \
> - rangeset_destroy(r); \
> - (r) = NULL; \
> - })
You introduce this in the previous patch, and removing it in the next
patch? Please try to avoid such spurious usage, it's not good to
waste effort reviewing newly introduced code that's to be removed by
a patch in the same series.
Unless very special cases, we usually recommend submitters to attempt
to make sure code introduced in a patch series is consistent, so that
we don't end up in a situation like this where you are removing a lot
of code that was introduced by the previous patch.
You likely want to split the changes slightly differently here, maybe
the previous patch should introduce the non-functional changes, like
the change to use mem instead of bar->mem or similar, and one patch
that introduces the usage of the linked-list task queue.
> -
> /*
> * Set a limit on the number of ranges that may exist in set @r.
> * NOTE: This must be called while @r is empty.
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index b55bacbe6e..e34f7abe6d 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -19,8 +19,7 @@
> */
> #define VPCI_MAX_VIRT_DEV (PCI_SLOT(~0) + 1)
>
> -void vpci_vcpu_destroy(struct vcpu *v);
> -int vpci_vcpu_init(struct vcpu *v);
> +void vpci_vcpu_init(struct vcpu *v);
>
> /* Assign vPCI to device by adding handlers. */
> int __must_check vpci_assign_device(struct pci_dev *pdev);
> @@ -155,14 +154,23 @@ struct vpci {
> };
>
> #ifdef __XEN__
> -struct vpci_vcpu {
> +struct vpci_map_task {
> /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
> - const struct pci_dev *pdev;
> - struct rangeset *mem[ARRAY_SIZE(((struct vpci_header *)NULL)->bars)];
> + struct list_head next;
> + struct vpci_bar_map {
> + uint64_t addr;
> + uint64_t guest_addr;
> + struct rangeset *mem;
> + } bars[ARRAY_SIZE(((struct vpci_header *)NULL)->bars)];
I'm a bit puzzled (possibly missing something), but why are you
keeping this vpci_bar_map array here? AFAICT map tasks are allocated
on-demand after this change (by using {alloc,destroy}_map_task()).
Thanks, Roger.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 3/7] vpci: Use pervcpu ranges for BAR mapping
2026-04-22 11:00 ` Roger Pau Monné
@ 2026-04-22 12:04 ` Jan Beulich
2026-04-22 14:20 ` Roger Pau Monné
0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2026-04-22 12:04 UTC (permalink / raw)
To: Roger Pau Monné
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Anthony PERARD,
Michal Orzel, Julien Grall, Stefano Stabellini,
Stewart Hildebrand, Mykyta Poturai
On 22.04.2026 13:00, Roger Pau Monné wrote:
> On Thu, Apr 09, 2026 at 02:01:33PM +0000, Mykyta Poturai wrote:
>> @@ -412,14 +414,14 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>> unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
>> vmsix_table_size(pdev->vpci, i) - 1);
>>
>> - for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
>> + for ( j = 0; j < ARRAY_SIZE(current->vpci.mem); j++ )
>
> You make a non-trivial use of current in vpci_modify_bars(), maybe you
> should consider introducing a local variable for it:
>
> struct *vcpu curr = current;
Without any intention to negate this, ...
> current expands to a call to get_cpu_info(9, which is better to avoid
> doing repeatedly, specially in the context above which is used as a
> loop upper bound.
... I'd like to point out that "current" isn't evaluated when used by
ARRAY_SIZE() (resolving to two uses of sizeof()).
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-04-09 14:01 ` [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0 Mykyta Poturai
2026-04-09 15:27 ` Daniel P. Smith
2026-04-21 14:43 ` Jan Beulich
@ 2026-04-22 14:19 ` Roger Pau Monné
2026-04-28 20:05 ` Stewart Hildebrand
3 siblings, 0 replies; 35+ messages in thread
From: Roger Pau Monné @ 2026-04-22 14:19 UTC (permalink / raw)
To: Mykyta Poturai
Cc: xen-devel@lists.xenproject.org, Stewart Hildebrand,
Daniel P. Smith
On Thu, Apr 09, 2026 at 02:01:34PM +0000, Mykyta Poturai wrote:
> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
The "From" should match the first SoB IIRC.
> This code is expected to only be used by privileged domains,
> unprivileged domains should not get access to the SR-IOV capability.
>
> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
> for possible changes in the system page size register. Also force VFs to
> always use emulated reads for command register, this is needed to
> prevent some drivers accidentally unmapping BARs. Discovery of VFs is
> done by Dom0, which must register them with Xen.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
> ---
> v2->v3:
> * rework adding VFs with multiple mapping operations support
> * use private.h
> * style fixes
> * fixup cleanup_sriov
> * emulate command register read for VFs
>
> v1->v2:
> * switch to VF discovery by Xen
> * fix sequential vpci_modify_bars calls
> * move documentation changes to a separate commit
> ---
> xen/drivers/vpci/Makefile | 1 +
> xen/drivers/vpci/header.c | 6 +-
> xen/drivers/vpci/private.h | 1 +
> xen/drivers/vpci/sriov.c | 314 +++++++++++++++++++++++++++++++++++++
> xen/include/xen/vpci.h | 7 +-
> 5 files changed, 327 insertions(+), 2 deletions(-)
> create mode 100644 xen/drivers/vpci/sriov.c
>
> diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
> index 9793a4f9b0..8b0e3c03a7 100644
> --- a/xen/drivers/vpci/Makefile
> +++ b/xen/drivers/vpci/Makefile
> @@ -1,3 +1,4 @@
> obj-y += cap.o
> obj-y += vpci.o header.o rebar.o
> +obj-y += sriov.o
> obj-$(CONFIG_HAS_PCI_MSI) += msi.o msix.o
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 451cdd3a6f..a36285672e 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -869,7 +869,8 @@ int vpci_init_header(struct pci_dev *pdev)
> * PCI_COMMAND_PARITY, PCI_COMMAND_SERR, and PCI_COMMAND_FAST_BACK.
> */
> rc = vpci_add_register_mask(pdev->vpci,
> - is_hwdom ? vpci_hw_read16 : guest_cmd_read,
> + is_hwdom && !pdev->info.is_virtfn
> + ? vpci_hw_read16 : guest_cmd_read,
Why are you using the guest command register handler for VFs? A
hardware domain should still be able to read the native VF hardware
value, as it knows it's a VF. Otherwise this needs a comment
explaining why it's needed.
> cmd_write, PCI_COMMAND, 2, header, 0, 0,
> is_hwdom ? 0
> : PCI_COMMAND_RSVDP_MASK |
> @@ -900,6 +901,9 @@ int vpci_init_header(struct pci_dev *pdev)
>
> header->guest_cmd = cmd;
>
> + if ( pdev->info.is_virtfn )
> + return vpci_vf_init_header(pdev);
> +
> /* Disable memory decoding before sizing. */
> if ( !is_hwdom || (cmd & PCI_COMMAND_MEMORY) )
> pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
> diff --git a/xen/drivers/vpci/private.h b/xen/drivers/vpci/private.h
> index f012fd160d..8e0043ddbe 100644
> --- a/xen/drivers/vpci/private.h
> +++ b/xen/drivers/vpci/private.h
> @@ -45,6 +45,7 @@ typedef struct {
> REGISTER_VPCI_CAPABILITY(PCI_EXT_CAP_ID_##name, name, finit, fclean, true)
>
> int __must_check vpci_init_header(struct pci_dev *pdev);
> +int __must_check vpci_vf_init_header(struct pci_dev *pdev);
>
> int vpci_init_capabilities(struct pci_dev *pdev, bool ext_only);
> void vpci_cleanup_capabilities(struct pci_dev *pdev, bool ext_only);
> diff --git a/xen/drivers/vpci/sriov.c b/xen/drivers/vpci/sriov.c
> new file mode 100644
> index 0000000000..ec6a7b84d5
> --- /dev/null
> +++ b/xen/drivers/vpci/sriov.c
> @@ -0,0 +1,314 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Handlers for accesses to the SR-IOV capability structure.
> + *
> + * Copyright (C) 2026 Citrix Systems R&D
> + */
> +
> +#include <xen/sched.h>
> +#include <xen/vpci.h>
Newline here ...
> +#include <xsm/xsm.h>
... and here preferably.
> +#include "private.h"
> +
> +static int vf_init_bars(const struct pci_dev *vf_pdev)
For sanity sake, vf_pdev shouldn't be const here. We modify it, it's
just that the modified fields are different allocations (pdev->vpci).
> +{
> + int vf_idx;
> + unsigned int i;
> + const struct pci_dev *pf_pdev = vf_pdev->pf_pdev;
> + struct vpci_bar *bars = vf_pdev->vpci->header.bars;
> + struct vpci_bar *physfn_vf_bars = pf_pdev->vpci->sriov->vf_bars;
> + unsigned int sriov_pos = pci_find_ext_capability(pf_pdev,
> + PCI_EXT_CAP_ID_SRIOV);
> + uint16_t offset = pci_conf_read16(pf_pdev->sbdf,
> + sriov_pos + PCI_SRIOV_VF_OFFSET);
> + uint16_t stride = pci_conf_read16(pf_pdev->sbdf,
> + sriov_pos + PCI_SRIOV_VF_STRIDE);
> +
> + vf_idx = vf_pdev->sbdf.sbdf - (pf_pdev->sbdf.sbdf + offset);
> + if ( vf_idx < 0 )
> + return -EINVAL;
> +
> + if ( stride )
> + {
> + if ( vf_idx % stride )
> + return -EINVAL;
> + vf_idx /= stride;
> + }
> +
> + /*
> + * Set up BARs for this VF out of PF's VF BARs taking into account
> + * the index of the VF.
> + */
> + for ( i = 0; i < PCI_SRIOV_NUM_BARS; i++ )
> + {
> + bars[i].addr = physfn_vf_bars[i].addr + vf_idx * physfn_vf_bars[i].size;
You only want to set .addr if the BAR is populated? IOW: if the BAR is
of type VPCI_BAR_MEM{32,64_LO}. Otherwise we are populating the .addr
field with junk.
> + bars[i].guest_addr = bars[i].addr;
> + bars[i].size = physfn_vf_bars[i].size;
> + bars[i].type = physfn_vf_bars[i].type;
> + bars[i].prefetchable = physfn_vf_bars[i].prefetchable;
> + }
> +
> + return 0;
> +}
> +
> +static int map_vfs(const struct pci_dev *pf_pdev, uint16_t cmd)
> +{
> + struct pci_dev *vf_pdev;
> + int rc;
> +
> + ASSERT(rw_is_write_locked(&pf_pdev->domain->pci_lock));
> +
> + list_for_each_entry(vf_pdev, &pf_pdev->vf_list, vf_list)
> + {
> + rc = vpci_modify_bars(vf_pdev, cmd, false);
> + if ( rc )
> + {
> + gprintk(XENLOG_ERR, "failed to %s VF %pp: %d\n",
> + (cmd & PCI_COMMAND_MEMORY) ? "map" : "unmap",
> + &vf_pdev->sbdf, rc);
> + return rc;
> + }
> +
> + vf_pdev->vpci->header.guest_cmd &= ~PCI_COMMAND_MEMORY;
> + vf_pdev->vpci->header.guest_cmd |= (cmd & PCI_COMMAND_MEMORY);
You shouldn't be modifying the guest_cmd here, the hardware domain
should see the native VF command register. Hardware domain knows it's
a VF.
> + }
> +
> + return 0;
> +}
> +
> +static int size_vf_bars(const struct pci_dev *pf_pdev, unsigned int sriov_pos,
> + uint64_t *vf_rlen)
> +{
> + struct vpci_bar *bars;
> + unsigned int i;
> + int rc = 0;
> +
> + ASSERT(rw_is_write_locked(&pf_pdev->domain->pci_lock));
> + ASSERT(!pf_pdev->info.is_virtfn);
> + ASSERT(pf_pdev->vpci->sriov);
> +
> + /* Read BARs for VFs out of PF's SR-IOV extended capability. */
> + bars = pf_pdev->vpci->sriov->vf_bars;
You can do the initialization at variable definition.
> + /* Set the BARs addresses and size. */
> + for ( i = 0; i < PCI_SRIOV_NUM_BARS; i += rc )
> + {
> + unsigned int idx = sriov_pos + PCI_SRIOV_BAR + i * 4;
> + uint32_t bar;
> + uint64_t addr, size;
> +
> + bar = pci_conf_read32(pf_pdev->sbdf, idx);
> +
> + rc = pci_size_mem_bar(pf_pdev->sbdf, idx, &addr, &size,
> + PCI_BAR_VF |
> + ((i == PCI_SRIOV_NUM_BARS - 1) ? PCI_BAR_LAST
> + : 0));
> +
> + /*
> + * Update vf_rlen on the PF. According to the spec the size of
> + * the BARs can change if the system page size register is
> + * modified, so always update rlen when enabling VFs.
> + */
> + vf_rlen[i] = size;
> +
> + if ( !size )
> + {
> + bars[i].type = VPCI_BAR_EMPTY;
> + continue;
> + }
> +
> + bars[i].addr = addr;
> + bars[i].guest_addr = addr;
> + bars[i].size = size;
> + bars[i].prefetchable = bar & PCI_BASE_ADDRESS_MEM_PREFETCH;
> +
> + switch ( rc )
> + {
> + case 1:
> + bars[i].type = VPCI_BAR_MEM32;
> + break;
> +
> + case 2:
> + bars[i].type = VPCI_BAR_MEM64_LO;
> + bars[i + 1].type = VPCI_BAR_MEM64_HI;
> + break;
> +
> + default:
> + ASSERT_UNREACHABLE();
While ASSERT_UNREACHABLE() is fine here, you still need to make sure the
code makes progress on non-debug builds, and hence rc should be set
to 1 so that the loop moves into processing the next BAR.
> + }
> + }
> +
> + rc = rc > 0 ? 0 : rc;
This will only take into account the return code of the last
pci_size_mem_bar() call. It might be best to either properly
accumulate errors, or simply make the function void, seeing as the
only caller ignores the return value.
> +
> + return rc;
> +}
> +
> +static void cf_check control_write(const struct pci_dev *pdev, unsigned int reg,
> + uint32_t val, void *data)
> +{
> + unsigned int sriov_pos = reg - PCI_SRIOV_CTRL;
> + struct vpci_sriov *sriov = pdev->vpci->sriov;
> + uint16_t control = pci_conf_read16(pdev->sbdf, reg);
> + bool mem_enabled = control & PCI_SRIOV_CTRL_MSE;
> + bool new_mem_enabled = val & PCI_SRIOV_CTRL_MSE;
> + bool enabled = control & PCI_SRIOV_CTRL_VFE;
> + bool new_enabled = val & PCI_SRIOV_CTRL_VFE;
> + int rc;
> +
> + ASSERT(!pdev->info.is_virtfn);
> +
> + if ( new_enabled == enabled && new_mem_enabled == mem_enabled )
> + {
> + pci_conf_write16(pdev->sbdf, reg, val);
> + return;
> + }
> +
> + if ( mem_enabled && !new_mem_enabled )
> + map_vfs(pdev, 0);
> +
> + if ( !enabled && new_enabled )
> + {
> + size_vf_bars(pdev, sriov_pos, (uint64_t *)data);
> +
> + /*
> + * Only update the number of active VFs when enabling, when
> + * disabling use the cached value in order to always remove the same
> + * number of VFs that were active.
> + */
> + sriov->num_vfs = pci_conf_read16(pdev->sbdf,
> + sriov_pos + PCI_SRIOV_NUM_VF);
> + }
> +
> + if ( !mem_enabled && new_mem_enabled )
> + {
> + rc = map_vfs(pdev, PCI_COMMAND_MEMORY);
> +
> + if ( rc )
> + map_vfs(pdev, 0);
> + }
> +
> + pci_conf_write16(pdev->sbdf, reg, val);
> +}
> +
> +int vpci_vf_init_header(struct pci_dev *vf_pdev)
> +{
> + const struct pci_dev *pf_pdev;
> + unsigned int sriov_pos;
> + int rc = 0;
> + uint16_t ctrl;
> +
> + ASSERT(rw_is_write_locked(&vf_pdev->domain->pci_lock));
> +
> + if ( !vf_pdev->info.is_virtfn )
> + return 0;
> +
> + pf_pdev = vf_pdev->pf_pdev;
> + ASSERT(pf_pdev);
> +
> + rc = vf_init_bars(vf_pdev);
> + if ( rc )
> + return rc;
> +
> + sriov_pos = pci_find_ext_capability(pf_pdev, PCI_EXT_CAP_ID_SRIOV);
> + ctrl = pci_conf_read16(pf_pdev->sbdf, sriov_pos + PCI_SRIOV_CTRL);
> +
> + if ( (pf_pdev->domain == vf_pdev->domain) && (ctrl & PCI_SRIOV_CTRL_MSE) )
> + {
> + rc = vpci_modify_bars(vf_pdev, PCI_COMMAND_MEMORY, false);
> + if ( rc )
> + return rc;
> +
> + vf_pdev->vpci->header.guest_cmd |= PCI_COMMAND_MEMORY;
> + }
> +
> + return rc;
> +}
> +
> +static int cf_check init_sriov(struct pci_dev *pdev)
> +{
> + unsigned int pos;
> +
> + ASSERT(!pdev->info.is_virtfn);
> +
> + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
> +
> + if ( !pos )
> + return 0;
> +
> + if ( xsm_resource_setup_pci(XSM_PRIV, pdev->sbdf.bdf) )
> + {
> + printk(XENLOG_ERR
> + "%pp: SR-IOV configuration unsupported for unpriv %pd\n",
> + &pdev->sbdf, pdev->domain);
> + return -EACCES;
> + }
> +
> + pdev->vpci->sriov = xzalloc(struct vpci_sriov);
> + if ( !pdev->vpci->sriov )
> + return -ENOMEM;
> +
> + /*
> + * We need to modify vf_rlen in control_write but we can't do it cleanly
> + * from pdev because write callback only accepts const pdev. Moving vf_rlen
> + * inside of struct vpci_sriov is also not possible because it is used
> + * before vpci init. So pass it here as additional data to not require
> + * dropping const in control_write.
> + */
> + return vpci_add_register(pdev->vpci, vpci_hw_read16, control_write,
> + pos + PCI_SRIOV_CTRL, 2, &pdev->physfn.vf_rlen);
> +}
> +
> +static int cf_check cleanup_sriov(const struct pci_dev *pdev, bool hide)
> +{
> + unsigned int pos;
> + int rc;
> +
> + if ( !pdev->vpci->sriov )
> + return 0;
> +
> + ASSERT(!pdev->info.is_virtfn);
> + ASSERT(list_empty(&pdev->vf_list));
> +
> + if ( !hide )
> + {
> + XFREE(pdev->vpci->sriov);
> + return 0;
> + }
> +
> + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
You need to rework this a bit, after some recent changes access to the
ECAM space could disappear if dom0 signals that it's not OK to use
whatever memory range Xen was using. Hence the position of the SR-IOV
capability needs to be cached at init time, so it can be removed here
regardless of whether access to ECAM has been lost/revoked.
> + rc = vpci_remove_registers(pdev->vpci, pos + PCI_SRIOV_CTRL, 2);
> + if ( rc )
> + {
> + printk(XENLOG_ERR "%pd %pp: fail to remove SRIOV handlers rc=%d\n",
> + pdev->domain, &pdev->sbdf, rc);
> + ASSERT_UNREACHABLE();
> + return rc;
> + }
> + XFREE(pdev->vpci->sriov);
> +
> + /*
> + * Unprivileged domains have a deny by default register access policy, no
> + * need to add any further handlers for them.
> + */
> + if ( !is_hardware_domain(pdev->domain) )
> + return 0;
> +
> + rc = vpci_add_register(pdev->vpci, vpci_hw_read16, NULL,
> + pos + PCI_SRIOV_CTRL, 2, NULL);
> + if ( rc )
> + printk(XENLOG_ERR "%pd %pp: fail to add SRIOV ctrl handler rc=%d\n",
> + pdev->domain, &pdev->sbdf, rc);
> +
> + return rc;
> +}
> +
> +REGISTER_VPCI_EXTCAP(SRIOV, init_sriov, cleanup_sriov);
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index e34f7abe6d..fc8a943abe 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -84,7 +84,6 @@ struct vpci {
> * upon to know whether BARs are mapped into the guest p2m.
> */
> bool bars_mapped : 1;
> - /* FIXME: currently there's no support for SR-IOV. */
You want to add a note to CHANGELOG also to note the SR-IOV support
addition.
Thanks, Roger.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 3/7] vpci: Use pervcpu ranges for BAR mapping
2026-04-22 12:04 ` Jan Beulich
@ 2026-04-22 14:20 ` Roger Pau Monné
0 siblings, 0 replies; 35+ messages in thread
From: Roger Pau Monné @ 2026-04-22 14:20 UTC (permalink / raw)
To: Jan Beulich
Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Anthony PERARD,
Michal Orzel, Julien Grall, Stefano Stabellini,
Stewart Hildebrand, Mykyta Poturai
On Wed, Apr 22, 2026 at 02:04:59PM +0200, Jan Beulich wrote:
> On 22.04.2026 13:00, Roger Pau Monné wrote:
> > On Thu, Apr 09, 2026 at 02:01:33PM +0000, Mykyta Poturai wrote:
> >> @@ -412,14 +414,14 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >> unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
> >> vmsix_table_size(pdev->vpci, i) - 1);
> >>
> >> - for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
> >> + for ( j = 0; j < ARRAY_SIZE(current->vpci.mem); j++ )
> >
> > You make a non-trivial use of current in vpci_modify_bars(), maybe you
> > should consider introducing a local variable for it:
> >
> > struct *vcpu curr = current;
>
> Without any intention to negate this, ...
>
> > current expands to a call to get_cpu_info(9, which is better to avoid
> > doing repeatedly, specially in the context above which is used as a
> > loop upper bound.
>
> ... I'd like to point out that "current" isn't evaluated when used by
> ARRAY_SIZE() (resolving to two uses of sizeof()).
Oh indeed, this is just for getting the type.
Thanks, Roger.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-04-21 14:43 ` Jan Beulich
@ 2026-04-23 10:12 ` Mykyta Poturai
2026-05-04 5:37 ` Jan Beulich
0 siblings, 1 reply; 35+ messages in thread
From: Mykyta Poturai @ 2026-04-23 10:12 UTC (permalink / raw)
To: Jan Beulich, Stewart Hildebrand
Cc: Roger Pau Monné, Daniel P. Smith,
xen-devel@lists.xenproject.org
On 4/21/26 17:43, Jan Beulich wrote:
> On 09.04.2026 16:01, Mykyta Poturai wrote:
>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>
>> This code is expected to only be used by privileged domains,
>> unprivileged domains should not get access to the SR-IOV capability.
>>
>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
>> for possible changes in the system page size register. Also force VFs to
>> always use emulated reads for command register, this is needed to
>> prevent some drivers accidentally unmapping BARs.
>
> This apparently refers to the change to vpci_init_header(). Writes are
> already intercepted. How would a read lead to accidental BAR unmap? Even
> for writes I don't see how a VF driver could accidentally unmap BARs, as
> the memory decode bit there is hardwired to 0.
>
>> Discovery of VFs is
>> done by Dom0, which must register them with Xen.
>
> If we intercept control register writes, why would we still require
> Dom0 to report the VFs that appear?
>
Sorry, I don't understand this question. You specifically requested this
to be done this way in V2. Quoting your reply from V2 below.
> Aren't you effectively busy-waiting for these 100ms, by simply
returning "true"
> from vpci_process_pending() until the time has passed? This imo is a
no-go. You
> want to set a timer and put the vCPU to sleep, to wake it up again
when the
> timer has expired. That'll then eliminate the need for the
not-so-nice patch 4.
> Question is whether we need to actually go this far (right away). I
expect you
> don't mean to hand PFs to DomU-s. As long as we keep them in the hardware
> domain, can't we trust it to set things up correctly, just like we
trust it in
> a number of other aspects?
> Jan
>> +static int map_vfs(const struct pci_dev *pf_pdev, uint16_t cmd)
>> +{
>> + struct pci_dev *vf_pdev;
>> + int rc;
>> +
>> + ASSERT(rw_is_write_locked(&pf_pdev->domain->pci_lock));
>> +
>> + list_for_each_entry(vf_pdev, &pf_pdev->vf_list, vf_list)
>> + {
>> + rc = vpci_modify_bars(vf_pdev, cmd, false);
>> + if ( rc )
>> + {
>> + gprintk(XENLOG_ERR, "failed to %s VF %pp: %d\n",
>> + (cmd & PCI_COMMAND_MEMORY) ? "map" : "unmap",
>> + &vf_pdev->sbdf, rc);
>> + return rc;
>> + }
>> +
>> + vf_pdev->vpci->header.guest_cmd &= ~PCI_COMMAND_MEMORY;
>> + vf_pdev->vpci->header.guest_cmd |= (cmd & PCI_COMMAND_MEMORY);
>
> As mentioned elsewhere as well, this bit is supposed to be 0 for VFs.
There are some devices that expose VFs with the same VID/DID as in the
PF, causing Linux to use normal driver for them and threat them like
normal devices. At some point, those normal drivers try to do a
read-modify-update of the command register and end up writing 0 to
PCI_COMMAND_MEMORY, causing cmd_write to unmap the BARS of that device.
I am not sure, maybe it would be better to just ignore cmd writes for VFs?
>> + sriov_pos = pci_find_ext_capability(pf_pdev, PCI_EXT_CAP_ID_SRIOV);
>> + ctrl = pci_conf_read16(pf_pdev->sbdf, sriov_pos + PCI_SRIOV_CTRL);
>> +
>> + if ( (pf_pdev->domain == vf_pdev->domain) && (ctrl & PCI_SRIOV_CTRL_MSE) )
>> + {
>> + rc = vpci_modify_bars(vf_pdev, PCI_COMMAND_MEMORY, false);
>
> Doesn't VF enable also need to be set for the BARs to be mapped?
I don't think so. Enabling memory space logically maps very well to
mapping memory to the guest. I don’t see any benefit of also requiring
VFE bit here.
--
Mykyta
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 6/7] vpci: add SR-IOV support for DomUs
2026-04-21 14:55 ` Jan Beulich
@ 2026-04-24 6:34 ` Mykyta Poturai
0 siblings, 0 replies; 35+ messages in thread
From: Mykyta Poturai @ 2026-04-24 6:34 UTC (permalink / raw)
To: Jan Beulich, Stewart Hildebrand
Cc: Roger Pau Monné, xen-devel@lists.xenproject.org
On 4/21/26 17:55, Jan Beulich wrote:
> On 09.04.2026 16:01, Mykyta Poturai wrote:
>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>
>> Emulate guest BAR register values based on PF BAR values for VFs.
>> This allows creating a guest view of the normal BAR registers and emulates
>> the size and properties as it is done during PCI device enumeration by
>> the guest.
>>
>> Expose VID/DID and class/revision to the guest.
>>
>> Note, that VFs ROM BAR is read-only and is all zeros, but VF may provide
>> access to the PFs ROM via emulation and is not implemented.
>>
>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>> Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
>
> So this change is merely to avoid having yet another gap on the road to
> DomU support in vPCI? I.e. there's no claim or expectation that VFs
> could now be used in DomU-s?
>
> Jan
Yes, all of my DomU tests were done with extra patches for DomU VPCI
support that are not upstreamed yet.
--
Mykyta
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-04-09 14:01 ` [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0 Mykyta Poturai
` (2 preceding siblings ...)
2026-04-22 14:19 ` Roger Pau Monné
@ 2026-04-28 20:05 ` Stewart Hildebrand
3 siblings, 0 replies; 35+ messages in thread
From: Stewart Hildebrand @ 2026-04-28 20:05 UTC (permalink / raw)
To: Mykyta Poturai, xen-devel@lists.xenproject.org
Cc: Roger Pau Monné, Daniel P. Smith
On 4/9/26 10:01, Mykyta Poturai wrote:
> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>
> This code is expected to only be used by privileged domains,
> unprivileged domains should not get access to the SR-IOV capability.
>
> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
> for possible changes in the system page size register. Also force VFs to
> always use emulated reads for command register, this is needed to
> prevent some drivers accidentally unmapping BARs. Discovery of VFs is
> done by Dom0, which must register them with Xen.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
> ---
> v2->v3:
> * rework adding VFs with multiple mapping operations support
> * use private.h
> * style fixes
> * fixup cleanup_sriov
> * emulate command register read for VFs
Also noteworthy: switched back to no longer adding VFs within Xen.
>
> v1->v2:
> * switch to VF discovery by Xen
> * fix sequential vpci_modify_bars calls
> * move documentation changes to a separate commit
> ---
> xen/drivers/vpci/Makefile | 1 +
> xen/drivers/vpci/header.c | 6 +-
> xen/drivers/vpci/private.h | 1 +
> xen/drivers/vpci/sriov.c | 314 +++++++++++++++++++++++++++++++++++++
> xen/include/xen/vpci.h | 7 +-
> 5 files changed, 327 insertions(+), 2 deletions(-)
> create mode 100644 xen/drivers/vpci/sriov.c
>
> diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
> index 9793a4f9b0..8b0e3c03a7 100644
> --- a/xen/drivers/vpci/Makefile
> +++ b/xen/drivers/vpci/Makefile
> @@ -1,3 +1,4 @@
> obj-y += cap.o
> obj-y += vpci.o header.o rebar.o
> +obj-y += sriov.o
> obj-$(CONFIG_HAS_PCI_MSI) += msi.o msix.o
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 451cdd3a6f..a36285672e 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -869,7 +869,8 @@ int vpci_init_header(struct pci_dev *pdev)
> * PCI_COMMAND_PARITY, PCI_COMMAND_SERR, and PCI_COMMAND_FAST_BACK.
> */
> rc = vpci_add_register_mask(pdev->vpci,
> - is_hwdom ? vpci_hw_read16 : guest_cmd_read,
> + is_hwdom && !pdev->info.is_virtfn
> + ? vpci_hw_read16 : guest_cmd_read,
> cmd_write, PCI_COMMAND, 2, header, 0, 0,
> is_hwdom ? 0
> : PCI_COMMAND_RSVDP_MASK |
> @@ -900,6 +901,9 @@ int vpci_init_header(struct pci_dev *pdev)
>
> header->guest_cmd = cmd;
>
> + if ( pdev->info.is_virtfn )
> + return vpci_vf_init_header(pdev);
> +
> /* Disable memory decoding before sizing. */
> if ( !is_hwdom || (cmd & PCI_COMMAND_MEMORY) )
> pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
> diff --git a/xen/drivers/vpci/private.h b/xen/drivers/vpci/private.h
> index f012fd160d..8e0043ddbe 100644
> --- a/xen/drivers/vpci/private.h
> +++ b/xen/drivers/vpci/private.h
> @@ -45,6 +45,7 @@ typedef struct {
> REGISTER_VPCI_CAPABILITY(PCI_EXT_CAP_ID_##name, name, finit, fclean, true)
>
> int __must_check vpci_init_header(struct pci_dev *pdev);
> +int __must_check vpci_vf_init_header(struct pci_dev *pdev);
>
> int vpci_init_capabilities(struct pci_dev *pdev, bool ext_only);
> void vpci_cleanup_capabilities(struct pci_dev *pdev, bool ext_only);
> diff --git a/xen/drivers/vpci/sriov.c b/xen/drivers/vpci/sriov.c
> new file mode 100644
> index 0000000000..ec6a7b84d5
> --- /dev/null
> +++ b/xen/drivers/vpci/sriov.c
> @@ -0,0 +1,314 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Handlers for accesses to the SR-IOV capability structure.
> + *
> + * Copyright (C) 2026 Citrix Systems R&D
> + */
> +
> +#include <xen/sched.h>
> +#include <xen/vpci.h>
> +#include <xsm/xsm.h>
> +#include "private.h"
Should "private.h" move up so it's first?
> +
> +static int vf_init_bars(const struct pci_dev *vf_pdev)
> +{
> + int vf_idx;
> + unsigned int i;
> + const struct pci_dev *pf_pdev = vf_pdev->pf_pdev;
> + struct vpci_bar *bars = vf_pdev->vpci->header.bars;
> + struct vpci_bar *physfn_vf_bars = pf_pdev->vpci->sriov->vf_bars;
> + unsigned int sriov_pos = pci_find_ext_capability(pf_pdev,
> + PCI_EXT_CAP_ID_SRIOV);
> + uint16_t offset = pci_conf_read16(pf_pdev->sbdf,
> + sriov_pos + PCI_SRIOV_VF_OFFSET);
> + uint16_t stride = pci_conf_read16(pf_pdev->sbdf,
> + sriov_pos + PCI_SRIOV_VF_STRIDE);
> +
> + vf_idx = vf_pdev->sbdf.sbdf - (pf_pdev->sbdf.sbdf + offset);
> + if ( vf_idx < 0 )
> + return -EINVAL;
> +
> + if ( stride )
> + {
> + if ( vf_idx % stride )
> + return -EINVAL;
> + vf_idx /= stride;
> + }
> +
> + /*
> + * Set up BARs for this VF out of PF's VF BARs taking into account
> + * the index of the VF.
> + */
> + for ( i = 0; i < PCI_SRIOV_NUM_BARS; i++ )
> + {
> + bars[i].addr = physfn_vf_bars[i].addr + vf_idx * physfn_vf_bars[i].size;
> + bars[i].guest_addr = bars[i].addr;
We don't want to leak the host address to domUs, i.e. you might consider doing
something like:
bars[i].guest_addr = is_hardware_domain(d) ? bars[i].addr : 0;
> + bars[i].size = physfn_vf_bars[i].size;
> + bars[i].type = physfn_vf_bars[i].type;
> + bars[i].prefetchable = physfn_vf_bars[i].prefetchable;
> + }
> +
> + return 0;
> +}
> +
> +static int map_vfs(const struct pci_dev *pf_pdev, uint16_t cmd)
> +{
> + struct pci_dev *vf_pdev;
> + int rc;
> +
> + ASSERT(rw_is_write_locked(&pf_pdev->domain->pci_lock));
> +
> + list_for_each_entry(vf_pdev, &pf_pdev->vf_list, vf_list)
> + {
> + rc = vpci_modify_bars(vf_pdev, cmd, false);
> + if ( rc )
> + {
> + gprintk(XENLOG_ERR, "failed to %s VF %pp: %d\n",
> + (cmd & PCI_COMMAND_MEMORY) ? "map" : "unmap",
> + &vf_pdev->sbdf, rc);
> + return rc;
> + }
> +
> + vf_pdev->vpci->header.guest_cmd &= ~PCI_COMMAND_MEMORY;
> + vf_pdev->vpci->header.guest_cmd |= (cmd & PCI_COMMAND_MEMORY);
> + }
> +
> + return 0;
> +}
> +
> +static int size_vf_bars(const struct pci_dev *pf_pdev, unsigned int sriov_pos,
> + uint64_t *vf_rlen)
> +{
> + struct vpci_bar *bars;
> + unsigned int i;
> + int rc = 0;
> +
> + ASSERT(rw_is_write_locked(&pf_pdev->domain->pci_lock));
> + ASSERT(!pf_pdev->info.is_virtfn);
> + ASSERT(pf_pdev->vpci->sriov);
> +
> + /* Read BARs for VFs out of PF's SR-IOV extended capability. */
> + bars = pf_pdev->vpci->sriov->vf_bars;
> + /* Set the BARs addresses and size. */
> + for ( i = 0; i < PCI_SRIOV_NUM_BARS; i += rc )
> + {
> + unsigned int idx = sriov_pos + PCI_SRIOV_BAR + i * 4;
> + uint32_t bar;
> + uint64_t addr, size;
> +
> + bar = pci_conf_read32(pf_pdev->sbdf, idx);
> +
> + rc = pci_size_mem_bar(pf_pdev->sbdf, idx, &addr, &size,
> + PCI_BAR_VF |
> + ((i == PCI_SRIOV_NUM_BARS - 1) ? PCI_BAR_LAST
> + : 0));
> +
> + /*
> + * Update vf_rlen on the PF. According to the spec the size of
> + * the BARs can change if the system page size register is
> + * modified, so always update rlen when enabling VFs.
> + */
> + vf_rlen[i] = size;
> +
> + if ( !size )
> + {
> + bars[i].type = VPCI_BAR_EMPTY;
> + continue;
> + }
> +
> + bars[i].addr = addr;
> + bars[i].guest_addr = addr;
> + bars[i].size = size;
> + bars[i].prefetchable = bar & PCI_BASE_ADDRESS_MEM_PREFETCH;
> +
> + switch ( rc )
> + {
> + case 1:
> + bars[i].type = VPCI_BAR_MEM32;
> + break;
> +
> + case 2:
> + bars[i].type = VPCI_BAR_MEM64_LO;
> + bars[i + 1].type = VPCI_BAR_MEM64_HI;
> + break;
> +
> + default:
> + ASSERT_UNREACHABLE();
> + }
> + }
> +
> + rc = rc > 0 ? 0 : rc;
> +
> + return rc;
> +}
> +
> +static void cf_check control_write(const struct pci_dev *pdev, unsigned int reg,
> + uint32_t val, void *data)
> +{
> + unsigned int sriov_pos = reg - PCI_SRIOV_CTRL;
> + struct vpci_sriov *sriov = pdev->vpci->sriov;
> + uint16_t control = pci_conf_read16(pdev->sbdf, reg);
> + bool mem_enabled = control & PCI_SRIOV_CTRL_MSE;
> + bool new_mem_enabled = val & PCI_SRIOV_CTRL_MSE;
> + bool enabled = control & PCI_SRIOV_CTRL_VFE;
> + bool new_enabled = val & PCI_SRIOV_CTRL_VFE;
> + int rc;
> +
> + ASSERT(!pdev->info.is_virtfn);
> +
> + if ( new_enabled == enabled && new_mem_enabled == mem_enabled )
> + {
> + pci_conf_write16(pdev->sbdf, reg, val);
> + return;
> + }
> +
> + if ( mem_enabled && !new_mem_enabled )
> + map_vfs(pdev, 0);
> +
> + if ( !enabled && new_enabled )
> + {
> + size_vf_bars(pdev, sriov_pos, (uint64_t *)data);
> +
> + /*
> + * Only update the number of active VFs when enabling, when
> + * disabling use the cached value in order to always remove the same
> + * number of VFs that were active.
> + */
> + sriov->num_vfs = pci_conf_read16(pdev->sbdf,
> + sriov_pos + PCI_SRIOV_NUM_VF);
> + }
> +
> + if ( !mem_enabled && new_mem_enabled )
> + {
> + rc = map_vfs(pdev, PCI_COMMAND_MEMORY);
> +
> + if ( rc )
> + map_vfs(pdev, 0);
> + }
> +
> + pci_conf_write16(pdev->sbdf, reg, val);
> +}
> +
> +int vpci_vf_init_header(struct pci_dev *vf_pdev)
> +{
> + const struct pci_dev *pf_pdev;
> + unsigned int sriov_pos;
> + int rc = 0;
> + uint16_t ctrl;
> +
> + ASSERT(rw_is_write_locked(&vf_pdev->domain->pci_lock));
> +
> + if ( !vf_pdev->info.is_virtfn )
> + return 0;
> +
> + pf_pdev = vf_pdev->pf_pdev;
> + ASSERT(pf_pdev);
> +
> + rc = vf_init_bars(vf_pdev);
> + if ( rc )
> + return rc;
> +
> + sriov_pos = pci_find_ext_capability(pf_pdev, PCI_EXT_CAP_ID_SRIOV);
> + ctrl = pci_conf_read16(pf_pdev->sbdf, sriov_pos + PCI_SRIOV_CTRL);
> +
> + if ( (pf_pdev->domain == vf_pdev->domain) && (ctrl & PCI_SRIOV_CTRL_MSE) )
> + {
> + rc = vpci_modify_bars(vf_pdev, PCI_COMMAND_MEMORY, false);
> + if ( rc )
> + return rc;
> +
> + vf_pdev->vpci->header.guest_cmd |= PCI_COMMAND_MEMORY;
> + }
> +
> + return rc;
> +}
> +
> +static int cf_check init_sriov(struct pci_dev *pdev)
> +{
> + unsigned int pos;
> +
> + ASSERT(!pdev->info.is_virtfn);
> +
> + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
> +
> + if ( !pos )
> + return 0;
> +
> + if ( xsm_resource_setup_pci(XSM_PRIV, pdev->sbdf.bdf) )
> + {
> + printk(XENLOG_ERR
> + "%pp: SR-IOV configuration unsupported for unpriv %pd\n",
> + &pdev->sbdf, pdev->domain);
> + return -EACCES;
> + }
> +
> + pdev->vpci->sriov = xzalloc(struct vpci_sriov);
Don't we prefer xvzalloc now? If so, changing this should be paired with a
corresponding XVFREE in the cleanup handler.
> + if ( !pdev->vpci->sriov )
> + return -ENOMEM;
> +
> + /*
> + * We need to modify vf_rlen in control_write but we can't do it cleanly
> + * from pdev because write callback only accepts const pdev. Moving vf_rlen
> + * inside of struct vpci_sriov is also not possible because it is used
> + * before vpci init. So pass it here as additional data to not require
> + * dropping const in control_write.
> + */
> + return vpci_add_register(pdev->vpci, vpci_hw_read16, control_write,
> + pos + PCI_SRIOV_CTRL, 2, &pdev->physfn.vf_rlen);
> +}
> +
> +static int cf_check cleanup_sriov(const struct pci_dev *pdev, bool hide)
> +{
> + unsigned int pos;
> + int rc;
> +
> + if ( !pdev->vpci->sriov )
> + return 0;
> +
> + ASSERT(!pdev->info.is_virtfn);
> + ASSERT(list_empty(&pdev->vf_list));
> +
> + if ( !hide )
> + {
> + XFREE(pdev->vpci->sriov);
> + return 0;
> + }
> +
> + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
> + rc = vpci_remove_registers(pdev->vpci, pos + PCI_SRIOV_CTRL, 2);
> + if ( rc )
> + {
> + printk(XENLOG_ERR "%pd %pp: fail to remove SRIOV handlers rc=%d\n",
> + pdev->domain, &pdev->sbdf, rc);
> + ASSERT_UNREACHABLE();
> + return rc;
> + }
> + XFREE(pdev->vpci->sriov);
> +
> + /*
> + * Unprivileged domains have a deny by default register access policy, no
> + * need to add any further handlers for them.
> + */
> + if ( !is_hardware_domain(pdev->domain) )
> + return 0;
> +
> + rc = vpci_add_register(pdev->vpci, vpci_hw_read16, NULL,
> + pos + PCI_SRIOV_CTRL, 2, NULL);
> + if ( rc )
> + printk(XENLOG_ERR "%pd %pp: fail to add SRIOV ctrl handler rc=%d\n",
> + pdev->domain, &pdev->sbdf, rc);
> +
> + return rc;
> +}
> +
> +REGISTER_VPCI_EXTCAP(SRIOV, init_sriov, cleanup_sriov);
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index e34f7abe6d..fc8a943abe 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -84,7 +84,6 @@ struct vpci {
> * upon to know whether BARs are mapped into the guest p2m.
> */
> bool bars_mapped : 1;
> - /* FIXME: currently there's no support for SR-IOV. */
> } header;
>
> /* MSI data. */
> @@ -145,6 +144,12 @@ struct vpci {
> unsigned int nbars:3;
> } rebar;
>
> + struct vpci_sriov {
> + /* PF only */
> + struct vpci_bar vf_bars[PCI_SRIOV_NUM_BARS];
> + uint16_t num_vfs;
> + } *sriov;
> +
> #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> /* Guest SBDF of the device. */
> #define INVALID_GUEST_SBDF ((pci_sbdf_t){ .sbdf = ~0U })
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-04-23 10:12 ` Mykyta Poturai
@ 2026-05-04 5:37 ` Jan Beulich
2026-05-06 9:39 ` Mykyta Poturai
0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2026-05-04 5:37 UTC (permalink / raw)
To: Mykyta Poturai
Cc: Roger Pau Monné, Daniel P. Smith,
xen-devel@lists.xenproject.org, Stewart Hildebrand
On 23.04.2026 12:12, Mykyta Poturai wrote:
> On 4/21/26 17:43, Jan Beulich wrote:
>> On 09.04.2026 16:01, Mykyta Poturai wrote:
>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>
>>> This code is expected to only be used by privileged domains,
>>> unprivileged domains should not get access to the SR-IOV capability.
>>>
>>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
>>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
>>> for possible changes in the system page size register. Also force VFs to
>>> always use emulated reads for command register, this is needed to
>>> prevent some drivers accidentally unmapping BARs.
>>
>> This apparently refers to the change to vpci_init_header(). Writes are
>> already intercepted. How would a read lead to accidental BAR unmap? Even
>> for writes I don't see how a VF driver could accidentally unmap BARs, as
>> the memory decode bit there is hardwired to 0.
>>
>>> Discovery of VFs is
>>> done by Dom0, which must register them with Xen.
>>
>> If we intercept control register writes, why would we still require
>> Dom0 to report the VFs that appear?
>>
>
> Sorry, I don't understand this question. You specifically requested this
> to be done this way in V2. Quoting your reply from V2 below.
>
> > Aren't you effectively busy-waiting for these 100ms, by simply
> returning "true"
> > from vpci_process_pending() until the time has passed? This imo is a
> no-go. You
> > want to set a timer and put the vCPU to sleep, to wake it up again
> when the
> > timer has expired. That'll then eliminate the need for the
> not-so-nice patch 4.
>
> > Question is whether we need to actually go this far (right away). I
> expect you
> > don't mean to hand PFs to DomU-s. As long as we keep them in the hardware
> > domain, can't we trust it to set things up correctly, just like we
> trust it in
> > a number of other aspects?
How's any of this related to the question I raised here, or your reply
thereto? If we intercept PCI_SRIOV_CTRL, we know when VFs are created.
Why still demand Dom0 to report them then?
>>> +static int map_vfs(const struct pci_dev *pf_pdev, uint16_t cmd)
>>> +{
>>> + struct pci_dev *vf_pdev;
>>> + int rc;
>>> +
>>> + ASSERT(rw_is_write_locked(&pf_pdev->domain->pci_lock));
>>> +
>>> + list_for_each_entry(vf_pdev, &pf_pdev->vf_list, vf_list)
>>> + {
>>> + rc = vpci_modify_bars(vf_pdev, cmd, false);
>>> + if ( rc )
>>> + {
>>> + gprintk(XENLOG_ERR, "failed to %s VF %pp: %d\n",
>>> + (cmd & PCI_COMMAND_MEMORY) ? "map" : "unmap",
>>> + &vf_pdev->sbdf, rc);
>>> + return rc;
>>> + }
>>> +
>>> + vf_pdev->vpci->header.guest_cmd &= ~PCI_COMMAND_MEMORY;
>>> + vf_pdev->vpci->header.guest_cmd |= (cmd & PCI_COMMAND_MEMORY);
>>
>> As mentioned elsewhere as well, this bit is supposed to be 0 for VFs.
>
> There are some devices that expose VFs with the same VID/DID as in the
> PF, causing Linux to use normal driver for them and threat them like
> normal devices. At some point, those normal drivers try to do a
> read-modify-update of the command register and end up writing 0 to
> PCI_COMMAND_MEMORY, causing cmd_write to unmap the BARS of that device.
> I am not sure, maybe it would be better to just ignore cmd writes for VFs?
No. We should treat r/o bits as r/o (which for this bit implies it not
controlling BAR mapping).
>>> + sriov_pos = pci_find_ext_capability(pf_pdev, PCI_EXT_CAP_ID_SRIOV);
>>> + ctrl = pci_conf_read16(pf_pdev->sbdf, sriov_pos + PCI_SRIOV_CTRL);
>>> +
>>> + if ( (pf_pdev->domain == vf_pdev->domain) && (ctrl & PCI_SRIOV_CTRL_MSE) )
>>> + {
>>> + rc = vpci_modify_bars(vf_pdev, PCI_COMMAND_MEMORY, false);
>>
>> Doesn't VF enable also need to be set for the BARs to be mapped?
>
> I don't think so. Enabling memory space logically maps very well to
> mapping memory to the guest. I don’t see any benefit of also requiring
> VFE bit here.
Iirc the spec is quite explicit in this regard.
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-05-04 5:37 ` Jan Beulich
@ 2026-05-06 9:39 ` Mykyta Poturai
2026-05-06 11:54 ` Jan Beulich
0 siblings, 1 reply; 35+ messages in thread
From: Mykyta Poturai @ 2026-05-06 9:39 UTC (permalink / raw)
To: Jan Beulich
Cc: Roger Pau Monné, Daniel P. Smith,
xen-devel@lists.xenproject.org, Stewart Hildebrand
On 5/4/26 08:37, Jan Beulich wrote:
> On 23.04.2026 12:12, Mykyta Poturai wrote:
>> On 4/21/26 17:43, Jan Beulich wrote:
>>> On 09.04.2026 16:01, Mykyta Poturai wrote:
>>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>
>>>> This code is expected to only be used by privileged domains,
>>>> unprivileged domains should not get access to the SR-IOV capability.
>>>>
>>>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
>>>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
>>>> for possible changes in the system page size register. Also force VFs to
>>>> always use emulated reads for command register, this is needed to
>>>> prevent some drivers accidentally unmapping BARs.
>>>
>>> This apparently refers to the change to vpci_init_header(). Writes are
>>> already intercepted. How would a read lead to accidental BAR unmap? Even
>>> for writes I don't see how a VF driver could accidentally unmap BARs, as
>>> the memory decode bit there is hardwired to 0.
>>>
>>>> Discovery of VFs is
>>>> done by Dom0, which must register them with Xen.
>>>
>>> If we intercept control register writes, why would we still require
>>> Dom0 to report the VFs that appear?
>>>
>>
>> Sorry, I don't understand this question. You specifically requested this
>> to be done this way in V2. Quoting your reply from V2 below.
>>
>> > Aren't you effectively busy-waiting for these 100ms, by simply
>> returning "true"
>> > from vpci_process_pending() until the time has passed? This imo is a
>> no-go. You
>> > want to set a timer and put the vCPU to sleep, to wake it up again
>> when the
>> > timer has expired. That'll then eliminate the need for the
>> not-so-nice patch 4.
>>
>> > Question is whether we need to actually go this far (right away). I
>> expect you
>> > don't mean to hand PFs to DomU-s. As long as we keep them in the hardware
>> > domain, can't we trust it to set things up correctly, just like we
>> trust it in
>> > a number of other aspects?
>
> How's any of this related to the question I raised here, or your reply
> thereto? If we intercept PCI_SRIOV_CTRL, we know when VFs are created.
> Why still demand Dom0 to report them then?
>
The spec states that VFs can take up to 100ms after the VF_ENABLE bit is
set to become alive. We discussed in the V2 that it is not acceptable to
do a required 100ms wait in Xen while blocking a domain. And not doing
that blocking would require some mechanism to only allow a domain to run
for precisely 99(or more?)ms. You yourself suggested that we can trust
the hardware domain with registering VFs if we already trust it with
other PCI-related stuff. Did you change your mind, or am I completely
misunderstanding this question?
>>>> +static int map_vfs(const struct pci_dev *pf_pdev, uint16_t cmd)
>>>> +{
>>>> + struct pci_dev *vf_pdev;
>>>> + int rc;
>>>> +
>>>> + ASSERT(rw_is_write_locked(&pf_pdev->domain->pci_lock));
>>>> +
>>>> + list_for_each_entry(vf_pdev, &pf_pdev->vf_list, vf_list)
>>>> + {
>>>> + rc = vpci_modify_bars(vf_pdev, cmd, false);
>>>> + if ( rc )
>>>> + {
>>>> + gprintk(XENLOG_ERR, "failed to %s VF %pp: %d\n",
>>>> + (cmd & PCI_COMMAND_MEMORY) ? "map" : "unmap",
>>>> + &vf_pdev->sbdf, rc);
>>>> + return rc;
>>>> + }
>>>> +
>>>> + vf_pdev->vpci->header.guest_cmd &= ~PCI_COMMAND_MEMORY;
>>>> + vf_pdev->vpci->header.guest_cmd |= (cmd & PCI_COMMAND_MEMORY);
>>>
>>> As mentioned elsewhere as well, this bit is supposed to be 0 for VFs.
>>
>> There are some devices that expose VFs with the same VID/DID as in the
>> PF, causing Linux to use normal driver for them and threat them like
>> normal devices. At some point, those normal drivers try to do a
>> read-modify-update of the command register and end up writing 0 to
>> PCI_COMMAND_MEMORY, causing cmd_write to unmap the BARS of that device.
>> I am not sure, maybe it would be better to just ignore cmd writes for VFs?
>
> No. We should treat r/o bits as r/o (which for this bit implies it not
> controlling BAR mapping).
>
>>>> + sriov_pos = pci_find_ext_capability(pf_pdev, PCI_EXT_CAP_ID_SRIOV);
>>>> + ctrl = pci_conf_read16(pf_pdev->sbdf, sriov_pos + PCI_SRIOV_CTRL);
>>>> +
>>>> + if ( (pf_pdev->domain == vf_pdev->domain) && (ctrl & PCI_SRIOV_CTRL_MSE) )
>>>> + {
>>>> + rc = vpci_modify_bars(vf_pdev, PCI_COMMAND_MEMORY, false);
>>>
>>> Doesn't VF enable also need to be set for the BARs to be mapped?
>>
>> I don't think so. Enabling memory space logically maps very well to
>> mapping memory to the guest. I don’t see any benefit of also requiring
>> VFE bit here.
>
> Iirc the spec is quite explicit in this regard.
>
> Jan
I will recheck the spec regarding this question.
--
Mykyta
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-05-06 9:39 ` Mykyta Poturai
@ 2026-05-06 11:54 ` Jan Beulich
2026-05-07 20:40 ` Volodymyr Babchuk
0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2026-05-06 11:54 UTC (permalink / raw)
To: Mykyta Poturai
Cc: Roger Pau Monné, Daniel P. Smith,
xen-devel@lists.xenproject.org, Stewart Hildebrand
On 06.05.2026 11:39, Mykyta Poturai wrote:
> On 5/4/26 08:37, Jan Beulich wrote:
>> On 23.04.2026 12:12, Mykyta Poturai wrote:
>>> On 4/21/26 17:43, Jan Beulich wrote:
>>>> On 09.04.2026 16:01, Mykyta Poturai wrote:
>>>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>>
>>>>> This code is expected to only be used by privileged domains,
>>>>> unprivileged domains should not get access to the SR-IOV capability.
>>>>>
>>>>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
>>>>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
>>>>> for possible changes in the system page size register. Also force VFs to
>>>>> always use emulated reads for command register, this is needed to
>>>>> prevent some drivers accidentally unmapping BARs.
>>>>
>>>> This apparently refers to the change to vpci_init_header(). Writes are
>>>> already intercepted. How would a read lead to accidental BAR unmap? Even
>>>> for writes I don't see how a VF driver could accidentally unmap BARs, as
>>>> the memory decode bit there is hardwired to 0.
>>>>
>>>>> Discovery of VFs is
>>>>> done by Dom0, which must register them with Xen.
>>>>
>>>> If we intercept control register writes, why would we still require
>>>> Dom0 to report the VFs that appear?
>>>>
>>>
>>> Sorry, I don't understand this question. You specifically requested this
>>> to be done this way in V2. Quoting your reply from V2 below.
>>>
>>> > Aren't you effectively busy-waiting for these 100ms, by simply
>>> returning "true"
>>> > from vpci_process_pending() until the time has passed? This imo is a
>>> no-go. You
>>> > want to set a timer and put the vCPU to sleep, to wake it up again
>>> when the
>>> > timer has expired. That'll then eliminate the need for the
>>> not-so-nice patch 4.
>>>
>>> > Question is whether we need to actually go this far (right away). I
>>> expect you
>>> > don't mean to hand PFs to DomU-s. As long as we keep them in the hardware
>>> > domain, can't we trust it to set things up correctly, just like we
>>> trust it in
>>> > a number of other aspects?
>>
>> How's any of this related to the question I raised here, or your reply
>> thereto? If we intercept PCI_SRIOV_CTRL, we know when VFs are created.
>> Why still demand Dom0 to report them then?
>>
>
> The spec states that VFs can take up to 100ms after the VF_ENABLE bit is
> set to become alive. We discussed in the V2 that it is not acceptable to
> do a required 100ms wait in Xen while blocking a domain. And not doing
> that blocking would require some mechanism to only allow a domain to run
> for precisely 99(or more?)ms. You yourself suggested that we can trust
> the hardware domain with registering VFs if we already trust it with
> other PCI-related stuff. Did you change your mind, or am I completely
> misunderstanding this question?
No, I still think that we can trust hwdom enough. Nevertheless we should
aim at being independent of it where possible. And I seem to recall that
I had also outlined an approach how to avoid spin-waiting for 100ms in
the hypervisor.
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-05-06 11:54 ` Jan Beulich
@ 2026-05-07 20:40 ` Volodymyr Babchuk
2026-05-08 5:52 ` Jan Beulich
2026-05-08 5:52 ` Jan Beulich
0 siblings, 2 replies; 35+ messages in thread
From: Volodymyr Babchuk @ 2026-05-07 20:40 UTC (permalink / raw)
To: Jan Beulich
Cc: Mykyta Poturai, Roger Pau Monné, Daniel P. Smith,
xen-devel@lists.xenproject.org, Stewart Hildebrand
Hi Jan,
Jan Beulich <jbeulich@suse.com> writes:
> On 06.05.2026 11:39, Mykyta Poturai wrote:
>> On 5/4/26 08:37, Jan Beulich wrote:
>>> On 23.04.2026 12:12, Mykyta Poturai wrote:
>>>> On 4/21/26 17:43, Jan Beulich wrote:
>>>>> On 09.04.2026 16:01, Mykyta Poturai wrote:
>>>>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>>>
>>>>>> This code is expected to only be used by privileged domains,
>>>>>> unprivileged domains should not get access to the SR-IOV capability.
>>>>>>
>>>>>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
>>>>>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
>>>>>> for possible changes in the system page size register. Also force VFs to
>>>>>> always use emulated reads for command register, this is needed to
>>>>>> prevent some drivers accidentally unmapping BARs.
>>>>>
>>>>> This apparently refers to the change to vpci_init_header(). Writes are
>>>>> already intercepted. How would a read lead to accidental BAR unmap? Even
>>>>> for writes I don't see how a VF driver could accidentally unmap BARs, as
>>>>> the memory decode bit there is hardwired to 0.
>>>>>
>>>>>> Discovery of VFs is
>>>>>> done by Dom0, which must register them with Xen.
>>>>>
>>>>> If we intercept control register writes, why would we still require
>>>>> Dom0 to report the VFs that appear?
>>>>>
>>>>
>>>> Sorry, I don't understand this question. You specifically requested this
>>>> to be done this way in V2. Quoting your reply from V2 below.
>>>>
>>>> > Aren't you effectively busy-waiting for these 100ms, by simply
>>>> returning "true"
>>>> > from vpci_process_pending() until the time has passed? This imo is a
>>>> no-go. You
>>>> > want to set a timer and put the vCPU to sleep, to wake it up again
>>>> when the
>>>> > timer has expired. That'll then eliminate the need for the
>>>> not-so-nice patch 4.
>>>>
>>>> > Question is whether we need to actually go this far (right away). I
>>>> expect you
>>>> > don't mean to hand PFs to DomU-s. As long as we keep them in the hardware
>>>> > domain, can't we trust it to set things up correctly, just like we
>>>> trust it in
>>>> > a number of other aspects?
>>>
>>> How's any of this related to the question I raised here, or your reply
>>> thereto? If we intercept PCI_SRIOV_CTRL, we know when VFs are created.
>>> Why still demand Dom0 to report them then?
>>>
>>
>> The spec states that VFs can take up to 100ms after the VF_ENABLE bit is
>> set to become alive. We discussed in the V2 that it is not acceptable to
>> do a required 100ms wait in Xen while blocking a domain. And not doing
>> that blocking would require some mechanism to only allow a domain to run
>> for precisely 99(or more?)ms. You yourself suggested that we can trust
>> the hardware domain with registering VFs if we already trust it with
>> other PCI-related stuff. Did you change your mind, or am I completely
>> misunderstanding this question?
>
> No, I still think that we can trust hwdom enough. Nevertheless we should
> aim at being independent of it where possible. And I seem to recall that
> I had also outlined an approach how to avoid spin-waiting for 100ms in
> the hypervisor.
I want to clarify: you are telling that Xen should not wait for hwdom to
report VFs and instead create them by itself. Is this correct?
--
WBR, Volodymyr
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-05-07 20:40 ` Volodymyr Babchuk
2026-05-08 5:52 ` Jan Beulich
@ 2026-05-08 5:52 ` Jan Beulich
1 sibling, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2026-05-08 5:52 UTC (permalink / raw)
To: Volodymyr Babchuk
Cc: Mykyta Poturai, Roger Pau Monné, Daniel P. Smith,
xen-devel@lists.xenproject.org, Stewart Hildebrand
On 07.05.2026 22:40, Volodymyr Babchuk wrote:
> Jan Beulich <jbeulich@suse.com> writes:
>> On 06.05.2026 11:39, Mykyta Poturai wrote:
>>> On 5/4/26 08:37, Jan Beulich wrote:
>>>> On 23.04.2026 12:12, Mykyta Poturai wrote:
>>>>> On 4/21/26 17:43, Jan Beulich wrote:
>>>>>> On 09.04.2026 16:01, Mykyta Poturai wrote:
>>>>>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>>>>
>>>>>>> This code is expected to only be used by privileged domains,
>>>>>>> unprivileged domains should not get access to the SR-IOV capability.
>>>>>>>
>>>>>>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
>>>>>>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
>>>>>>> for possible changes in the system page size register. Also force VFs to
>>>>>>> always use emulated reads for command register, this is needed to
>>>>>>> prevent some drivers accidentally unmapping BARs.
>>>>>>
>>>>>> This apparently refers to the change to vpci_init_header(). Writes are
>>>>>> already intercepted. How would a read lead to accidental BAR unmap? Even
>>>>>> for writes I don't see how a VF driver could accidentally unmap BARs, as
>>>>>> the memory decode bit there is hardwired to 0.
>>>>>>
>>>>>>> Discovery of VFs is
>>>>>>> done by Dom0, which must register them with Xen.
>>>>>>
>>>>>> If we intercept control register writes, why would we still require
>>>>>> Dom0 to report the VFs that appear?
>>>>>>
>>>>>
>>>>> Sorry, I don't understand this question. You specifically requested this
>>>>> to be done this way in V2. Quoting your reply from V2 below.
>>>>>
>>>>> > Aren't you effectively busy-waiting for these 100ms, by simply
>>>>> returning "true"
>>>>> > from vpci_process_pending() until the time has passed? This imo is a
>>>>> no-go. You
>>>>> > want to set a timer and put the vCPU to sleep, to wake it up again
>>>>> when the
>>>>> > timer has expired. That'll then eliminate the need for the
>>>>> not-so-nice patch 4.
>>>>>
>>>>> > Question is whether we need to actually go this far (right away). I
>>>>> expect you
>>>>> > don't mean to hand PFs to DomU-s. As long as we keep them in the hardware
>>>>> > domain, can't we trust it to set things up correctly, just like we
>>>>> trust it in
>>>>> > a number of other aspects?
>>>>
>>>> How's any of this related to the question I raised here, or your reply
>>>> thereto? If we intercept PCI_SRIOV_CTRL, we know when VFs are created.
>>>> Why still demand Dom0 to report them then?
>>>>
>>>
>>> The spec states that VFs can take up to 100ms after the VF_ENABLE bit is
>>> set to become alive. We discussed in the V2 that it is not acceptable to
>>> do a required 100ms wait in Xen while blocking a domain. And not doing
>>> that blocking would require some mechanism to only allow a domain to run
>>> for precisely 99(or more?)ms. You yourself suggested that we can trust
>>> the hardware domain with registering VFs if we already trust it with
>>> other PCI-related stuff. Did you change your mind, or am I completely
>>> misunderstanding this question?
>>
>> No, I still think that we can trust hwdom enough. Nevertheless we should
>> aim at being independent of it where possible. And I seem to recall that
>> I had also outlined an approach how to avoid spin-waiting for 100ms in
>> the hypervisor.
>
> I want to clarify: you are telling that Xen should not wait for hwdom to
> report VFs and instead create them by itself. Is this correct?
If that's technically possible, yes.
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-05-07 20:40 ` Volodymyr Babchuk
@ 2026-05-08 5:52 ` Jan Beulich
2026-05-11 14:10 ` Volodymyr Babchuk
2026-05-08 5:52 ` Jan Beulich
1 sibling, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2026-05-08 5:52 UTC (permalink / raw)
To: Volodymyr Babchuk
Cc: Mykyta Poturai, Roger Pau Monné, Daniel P. Smith,
xen-devel@lists.xenproject.org, Stewart Hildebrand
On 07.05.2026 22:40, Volodymyr Babchuk wrote:
> Jan Beulich <jbeulich@suse.com> writes:
>> On 06.05.2026 11:39, Mykyta Poturai wrote:
>>> On 5/4/26 08:37, Jan Beulich wrote:
>>>> On 23.04.2026 12:12, Mykyta Poturai wrote:
>>>>> On 4/21/26 17:43, Jan Beulich wrote:
>>>>>> On 09.04.2026 16:01, Mykyta Poturai wrote:
>>>>>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>>>>
>>>>>>> This code is expected to only be used by privileged domains,
>>>>>>> unprivileged domains should not get access to the SR-IOV capability.
>>>>>>>
>>>>>>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
>>>>>>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
>>>>>>> for possible changes in the system page size register. Also force VFs to
>>>>>>> always use emulated reads for command register, this is needed to
>>>>>>> prevent some drivers accidentally unmapping BARs.
>>>>>>
>>>>>> This apparently refers to the change to vpci_init_header(). Writes are
>>>>>> already intercepted. How would a read lead to accidental BAR unmap? Even
>>>>>> for writes I don't see how a VF driver could accidentally unmap BARs, as
>>>>>> the memory decode bit there is hardwired to 0.
>>>>>>
>>>>>>> Discovery of VFs is
>>>>>>> done by Dom0, which must register them with Xen.
>>>>>>
>>>>>> If we intercept control register writes, why would we still require
>>>>>> Dom0 to report the VFs that appear?
>>>>>>
>>>>>
>>>>> Sorry, I don't understand this question. You specifically requested this
>>>>> to be done this way in V2. Quoting your reply from V2 below.
>>>>>
>>>>> > Aren't you effectively busy-waiting for these 100ms, by simply
>>>>> returning "true"
>>>>> > from vpci_process_pending() until the time has passed? This imo is a
>>>>> no-go. You
>>>>> > want to set a timer and put the vCPU to sleep, to wake it up again
>>>>> when the
>>>>> > timer has expired. That'll then eliminate the need for the
>>>>> not-so-nice patch 4.
>>>>>
>>>>> > Question is whether we need to actually go this far (right away). I
>>>>> expect you
>>>>> > don't mean to hand PFs to DomU-s. As long as we keep them in the hardware
>>>>> > domain, can't we trust it to set things up correctly, just like we
>>>>> trust it in
>>>>> > a number of other aspects?
>>>>
>>>> How's any of this related to the question I raised here, or your reply
>>>> thereto? If we intercept PCI_SRIOV_CTRL, we know when VFs are created.
>>>> Why still demand Dom0 to report them then?
>>>>
>>>
>>> The spec states that VFs can take up to 100ms after the VF_ENABLE bit is
>>> set to become alive. We discussed in the V2 that it is not acceptable to
>>> do a required 100ms wait in Xen while blocking a domain. And not doing
>>> that blocking would require some mechanism to only allow a domain to run
>>> for precisely 99(or more?)ms. You yourself suggested that we can trust
>>> the hardware domain with registering VFs if we already trust it with
>>> other PCI-related stuff. Did you change your mind, or am I completely
>>> misunderstanding this question?
>>
>> No, I still think that we can trust hwdom enough. Nevertheless we should
>> aim at being independent of it where possible. And I seem to recall that
>> I had also outlined an approach how to avoid spin-waiting for 100ms in
>> the hypervisor.
>
> I want to clarify: you are telling that Xen should not wait for hwdom to
> report VFs and instead create them by itself. Is this correct?
If that's technically possible, yes.
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-05-08 5:52 ` Jan Beulich
@ 2026-05-11 14:10 ` Volodymyr Babchuk
2026-05-12 6:20 ` Jan Beulich
0 siblings, 1 reply; 35+ messages in thread
From: Volodymyr Babchuk @ 2026-05-11 14:10 UTC (permalink / raw)
To: Jan Beulich
Cc: Mykyta Poturai, Roger Pau Monné, Daniel P. Smith,
xen-devel@lists.xenproject.org, Stewart Hildebrand
Hi Jan,
Jan Beulich <jbeulich@suse.com> writes:
> On 07.05.2026 22:40, Volodymyr Babchuk wrote:
>> Jan Beulich <jbeulich@suse.com> writes:
>>> On 06.05.2026 11:39, Mykyta Poturai wrote:
>>>> On 5/4/26 08:37, Jan Beulich wrote:
>>>>> On 23.04.2026 12:12, Mykyta Poturai wrote:
>>>>>> On 4/21/26 17:43, Jan Beulich wrote:
>>>>>>> On 09.04.2026 16:01, Mykyta Poturai wrote:
>>>>>>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>>>>>
>>>>>>>> This code is expected to only be used by privileged domains,
>>>>>>>> unprivileged domains should not get access to the SR-IOV capability.
>>>>>>>>
>>>>>>>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
>>>>>>>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
>>>>>>>> for possible changes in the system page size register. Also force VFs to
>>>>>>>> always use emulated reads for command register, this is needed to
>>>>>>>> prevent some drivers accidentally unmapping BARs.
>>>>>>>
>>>>>>> This apparently refers to the change to vpci_init_header(). Writes are
>>>>>>> already intercepted. How would a read lead to accidental BAR unmap? Even
>>>>>>> for writes I don't see how a VF driver could accidentally unmap BARs, as
>>>>>>> the memory decode bit there is hardwired to 0.
>>>>>>>
>>>>>>>> Discovery of VFs is
>>>>>>>> done by Dom0, which must register them with Xen.
>>>>>>>
>>>>>>> If we intercept control register writes, why would we still require
>>>>>>> Dom0 to report the VFs that appear?
>>>>>>>
>>>>>>
>>>>>> Sorry, I don't understand this question. You specifically requested this
>>>>>> to be done this way in V2. Quoting your reply from V2 below.
>>>>>>
>>>>>> > Aren't you effectively busy-waiting for these 100ms, by simply
>>>>>> returning "true"
>>>>>> > from vpci_process_pending() until the time has passed? This imo is a
>>>>>> no-go. You
>>>>>> > want to set a timer and put the vCPU to sleep, to wake it up again
>>>>>> when the
>>>>>> > timer has expired. That'll then eliminate the need for the
>>>>>> not-so-nice patch 4.
>>>>>>
>>>>>> > Question is whether we need to actually go this far (right away). I
>>>>>> expect you
>>>>>> > don't mean to hand PFs to DomU-s. As long as we keep them in the hardware
>>>>>> > domain, can't we trust it to set things up correctly, just like we
>>>>>> trust it in
>>>>>> > a number of other aspects?
>>>>>
>>>>> How's any of this related to the question I raised here, or your reply
>>>>> thereto? If we intercept PCI_SRIOV_CTRL, we know when VFs are created.
>>>>> Why still demand Dom0 to report them then?
>>>>>
>>>>
>>>> The spec states that VFs can take up to 100ms after the VF_ENABLE bit is
>>>> set to become alive. We discussed in the V2 that it is not acceptable to
>>>> do a required 100ms wait in Xen while blocking a domain. And not doing
>>>> that blocking would require some mechanism to only allow a domain to run
>>>> for precisely 99(or more?)ms. You yourself suggested that we can trust
>>>> the hardware domain with registering VFs if we already trust it with
>>>> other PCI-related stuff. Did you change your mind, or am I completely
>>>> misunderstanding this question?
>>>
>>> No, I still think that we can trust hwdom enough. Nevertheless we should
>>> aim at being independent of it where possible. And I seem to recall that
>>> I had also outlined an approach how to avoid spin-waiting for 100ms in
>>> the hypervisor.
>>
>> I want to clarify: you are telling that Xen should not wait for hwdom to
>> report VFs and instead create them by itself. Is this correct?
>
> If that's technically possible, yes.
Okay, so let's clear this. If I remember correct, you discussed this
with Mykyta in the previous version and suggested to put the vCPU to
sleep for 100ms. I don't think that this is a good idea, because guest
kernel will not be happy about that. So, IMO, it is better to just
allows the guest to tell Xen when VFs are ready.
Or maybe I am missing something and you had some another idea?
--
WBR, Volodymyr
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-05-11 14:10 ` Volodymyr Babchuk
@ 2026-05-12 6:20 ` Jan Beulich
2026-05-12 7:32 ` Mykyta Poturai
0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2026-05-12 6:20 UTC (permalink / raw)
To: Volodymyr Babchuk
Cc: Mykyta Poturai, Roger Pau Monné, Daniel P. Smith,
xen-devel@lists.xenproject.org, Stewart Hildebrand
On 11.05.2026 16:10, Volodymyr Babchuk wrote:
> Hi Jan,
>
> Jan Beulich <jbeulich@suse.com> writes:
>
>> On 07.05.2026 22:40, Volodymyr Babchuk wrote:
>>> Jan Beulich <jbeulich@suse.com> writes:
>>>> On 06.05.2026 11:39, Mykyta Poturai wrote:
>>>>> On 5/4/26 08:37, Jan Beulich wrote:
>>>>>> On 23.04.2026 12:12, Mykyta Poturai wrote:
>>>>>>> On 4/21/26 17:43, Jan Beulich wrote:
>>>>>>>> On 09.04.2026 16:01, Mykyta Poturai wrote:
>>>>>>>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>>>>>>
>>>>>>>>> This code is expected to only be used by privileged domains,
>>>>>>>>> unprivileged domains should not get access to the SR-IOV capability.
>>>>>>>>>
>>>>>>>>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
>>>>>>>>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
>>>>>>>>> for possible changes in the system page size register. Also force VFs to
>>>>>>>>> always use emulated reads for command register, this is needed to
>>>>>>>>> prevent some drivers accidentally unmapping BARs.
>>>>>>>>
>>>>>>>> This apparently refers to the change to vpci_init_header(). Writes are
>>>>>>>> already intercepted. How would a read lead to accidental BAR unmap? Even
>>>>>>>> for writes I don't see how a VF driver could accidentally unmap BARs, as
>>>>>>>> the memory decode bit there is hardwired to 0.
>>>>>>>>
>>>>>>>>> Discovery of VFs is
>>>>>>>>> done by Dom0, which must register them with Xen.
>>>>>>>>
>>>>>>>> If we intercept control register writes, why would we still require
>>>>>>>> Dom0 to report the VFs that appear?
>>>>>>>>
>>>>>>>
>>>>>>> Sorry, I don't understand this question. You specifically requested this
>>>>>>> to be done this way in V2. Quoting your reply from V2 below.
>>>>>>>
>>>>>>> > Aren't you effectively busy-waiting for these 100ms, by simply
>>>>>>> returning "true"
>>>>>>> > from vpci_process_pending() until the time has passed? This imo is a
>>>>>>> no-go. You
>>>>>>> > want to set a timer and put the vCPU to sleep, to wake it up again
>>>>>>> when the
>>>>>>> > timer has expired. That'll then eliminate the need for the
>>>>>>> not-so-nice patch 4.
>>>>>>>
>>>>>>> > Question is whether we need to actually go this far (right away). I
>>>>>>> expect you
>>>>>>> > don't mean to hand PFs to DomU-s. As long as we keep them in the hardware
>>>>>>> > domain, can't we trust it to set things up correctly, just like we
>>>>>>> trust it in
>>>>>>> > a number of other aspects?
>>>>>>
>>>>>> How's any of this related to the question I raised here, or your reply
>>>>>> thereto? If we intercept PCI_SRIOV_CTRL, we know when VFs are created.
>>>>>> Why still demand Dom0 to report them then?
>>>>>>
>>>>>
>>>>> The spec states that VFs can take up to 100ms after the VF_ENABLE bit is
>>>>> set to become alive. We discussed in the V2 that it is not acceptable to
>>>>> do a required 100ms wait in Xen while blocking a domain. And not doing
>>>>> that blocking would require some mechanism to only allow a domain to run
>>>>> for precisely 99(or more?)ms. You yourself suggested that we can trust
>>>>> the hardware domain with registering VFs if we already trust it with
>>>>> other PCI-related stuff. Did you change your mind, or am I completely
>>>>> misunderstanding this question?
>>>>
>>>> No, I still think that we can trust hwdom enough. Nevertheless we should
>>>> aim at being independent of it where possible. And I seem to recall that
>>>> I had also outlined an approach how to avoid spin-waiting for 100ms in
>>>> the hypervisor.
>>>
>>> I want to clarify: you are telling that Xen should not wait for hwdom to
>>> report VFs and instead create them by itself. Is this correct?
>>
>> If that's technically possible, yes.
>
> Okay, so let's clear this. If I remember correct, you discussed this
> with Mykyta in the previous version and suggested to put the vCPU to
> sleep for 100ms.
I don't think I did (except perhaps from a very abstract perspective),
precisely because of ...
> I don't think that this is a good idea, because guest
> kernel will not be happy about that.
... this. Instead iirc I suggested to refuse (short-circuit) handling
VF register accesses for the next 100ms.
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-05-12 6:20 ` Jan Beulich
@ 2026-05-12 7:32 ` Mykyta Poturai
2026-05-12 8:58 ` Roger Pau Monné
0 siblings, 1 reply; 35+ messages in thread
From: Mykyta Poturai @ 2026-05-12 7:32 UTC (permalink / raw)
To: Jan Beulich, Volodymyr Babchuk
Cc: Roger Pau Monné, Daniel P. Smith,
xen-devel@lists.xenproject.org, Stewart Hildebrand
On 5/12/26 09:20, Jan Beulich wrote:
> On 11.05.2026 16:10, Volodymyr Babchuk wrote:
>> Hi Jan,
>>
>> Jan Beulich <jbeulich@suse.com> writes:
>>
>>> On 07.05.2026 22:40, Volodymyr Babchuk wrote:
>>>> Jan Beulich <jbeulich@suse.com> writes:
>>>>> On 06.05.2026 11:39, Mykyta Poturai wrote:
>>>>>> On 5/4/26 08:37, Jan Beulich wrote:
>>>>>>> On 23.04.2026 12:12, Mykyta Poturai wrote:
>>>>>>>> On 4/21/26 17:43, Jan Beulich wrote:
>>>>>>>>> On 09.04.2026 16:01, Mykyta Poturai wrote:
>>>>>>>>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>>>>>>>
>>>>>>>>>> This code is expected to only be used by privileged domains,
>>>>>>>>>> unprivileged domains should not get access to the SR-IOV capability.
>>>>>>>>>>
>>>>>>>>>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
>>>>>>>>>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
>>>>>>>>>> for possible changes in the system page size register. Also force VFs to
>>>>>>>>>> always use emulated reads for command register, this is needed to
>>>>>>>>>> prevent some drivers accidentally unmapping BARs.
>>>>>>>>>
>>>>>>>>> This apparently refers to the change to vpci_init_header(). Writes are
>>>>>>>>> already intercepted. How would a read lead to accidental BAR unmap? Even
>>>>>>>>> for writes I don't see how a VF driver could accidentally unmap BARs, as
>>>>>>>>> the memory decode bit there is hardwired to 0.
>>>>>>>>>
>>>>>>>>>> Discovery of VFs is
>>>>>>>>>> done by Dom0, which must register them with Xen.
>>>>>>>>>
>>>>>>>>> If we intercept control register writes, why would we still require
>>>>>>>>> Dom0 to report the VFs that appear?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Sorry, I don't understand this question. You specifically requested this
>>>>>>>> to be done this way in V2. Quoting your reply from V2 below.
>>>>>>>>
>>>>>>>> > Aren't you effectively busy-waiting for these 100ms, by simply
>>>>>>>> returning "true"
>>>>>>>> > from vpci_process_pending() until the time has passed? This imo is a
>>>>>>>> no-go. You
>>>>>>>> > want to set a timer and put the vCPU to sleep, to wake it up again
>>>>>>>> when the
>>>>>>>> > timer has expired. That'll then eliminate the need for the
>>>>>>>> not-so-nice patch 4.
>>>>>>>>
>>>>>>>> > Question is whether we need to actually go this far (right away). I
>>>>>>>> expect you
>>>>>>>> > don't mean to hand PFs to DomU-s. As long as we keep them in the hardware
>>>>>>>> > domain, can't we trust it to set things up correctly, just like we
>>>>>>>> trust it in
>>>>>>>> > a number of other aspects?
>>>>>>>
>>>>>>> How's any of this related to the question I raised here, or your reply
>>>>>>> thereto? If we intercept PCI_SRIOV_CTRL, we know when VFs are created.
>>>>>>> Why still demand Dom0 to report them then?
>>>>>>>
>>>>>>
>>>>>> The spec states that VFs can take up to 100ms after the VF_ENABLE bit is
>>>>>> set to become alive. We discussed in the V2 that it is not acceptable to
>>>>>> do a required 100ms wait in Xen while blocking a domain. And not doing
>>>>>> that blocking would require some mechanism to only allow a domain to run
>>>>>> for precisely 99(or more?)ms. You yourself suggested that we can trust
>>>>>> the hardware domain with registering VFs if we already trust it with
>>>>>> other PCI-related stuff. Did you change your mind, or am I completely
>>>>>> misunderstanding this question?
>>>>>
>>>>> No, I still think that we can trust hwdom enough. Nevertheless we should
>>>>> aim at being independent of it where possible. And I seem to recall that
>>>>> I had also outlined an approach how to avoid spin-waiting for 100ms in
>>>>> the hypervisor.
>>>>
>>>> I want to clarify: you are telling that Xen should not wait for hwdom to
>>>> report VFs and instead create them by itself. Is this correct?
>>>
>>> If that's technically possible, yes.
>>
>> Okay, so let's clear this. If I remember correct, you discussed this
>> with Mykyta in the previous version and suggested to put the vCPU to
>> sleep for 100ms.
>
> I don't think I did (except perhaps from a very abstract perspective),
> precisely because of ...
>
>> I don't think that this is a good idea, because guest
>> kernel will not be happy about that.
>
> ... this. Instead iirc I suggested to refuse (short-circuit) handling
> VF register accesses for the next 100ms.
>
> Jan
Do you have any suggestions on how to ensure that we accurately catch
the window where 100ms have already passed, but guests haven’t tried to
read anything yet, to flip this back? As I mentioned in the previous
version, Linux, for example, doesn’t attempt to re-read anything if the
first read failed after 100ms. So it appears to me that this approach
would be prone to racing with the guest for getting to the VF first. One
approach I can think of is to somehow swap the register handlers back
in-flight during the first read by the guest if 100ms have already
passed. However, this would still depend on Dom0 for registering VFs,
but in a more convoluted way. We also can’t add the VFs before 100ms
have passed and add timing checks to all register handlers, because
pci_add_device and everything below it expects the device to be
functional at the moment of addition.
Maybe you see some other way to avoid these problems that I am missing?
--
Mykyta
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-05-12 7:32 ` Mykyta Poturai
@ 2026-05-12 8:58 ` Roger Pau Monné
2026-05-12 10:28 ` Volodymyr Babchuk
2026-05-12 10:44 ` Jan Beulich
0 siblings, 2 replies; 35+ messages in thread
From: Roger Pau Monné @ 2026-05-12 8:58 UTC (permalink / raw)
To: Mykyta Poturai
Cc: Jan Beulich, Volodymyr Babchuk, Daniel P. Smith,
xen-devel@lists.xenproject.org, Stewart Hildebrand
On Tue, May 12, 2026 at 07:32:20AM +0000, Mykyta Poturai wrote:
>
>
> On 5/12/26 09:20, Jan Beulich wrote:
> > On 11.05.2026 16:10, Volodymyr Babchuk wrote:
> >> Hi Jan,
> >>
> >> Jan Beulich <jbeulich@suse.com> writes:
> >>
> >>> On 07.05.2026 22:40, Volodymyr Babchuk wrote:
> >>>> Jan Beulich <jbeulich@suse.com> writes:
> >>>>> On 06.05.2026 11:39, Mykyta Poturai wrote:
> >>>>>> On 5/4/26 08:37, Jan Beulich wrote:
> >>>>>>> On 23.04.2026 12:12, Mykyta Poturai wrote:
> >>>>>>>> On 4/21/26 17:43, Jan Beulich wrote:
> >>>>>>>>> On 09.04.2026 16:01, Mykyta Poturai wrote:
> >>>>>>>>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
> >>>>>>>>>>
> >>>>>>>>>> This code is expected to only be used by privileged domains,
> >>>>>>>>>> unprivileged domains should not get access to the SR-IOV capability.
> >>>>>>>>>>
> >>>>>>>>>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
> >>>>>>>>>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
> >>>>>>>>>> for possible changes in the system page size register. Also force VFs to
> >>>>>>>>>> always use emulated reads for command register, this is needed to
> >>>>>>>>>> prevent some drivers accidentally unmapping BARs.
> >>>>>>>>>
> >>>>>>>>> This apparently refers to the change to vpci_init_header(). Writes are
> >>>>>>>>> already intercepted. How would a read lead to accidental BAR unmap? Even
> >>>>>>>>> for writes I don't see how a VF driver could accidentally unmap BARs, as
> >>>>>>>>> the memory decode bit there is hardwired to 0.
> >>>>>>>>>
> >>>>>>>>>> Discovery of VFs is
> >>>>>>>>>> done by Dom0, which must register them with Xen.
> >>>>>>>>>
> >>>>>>>>> If we intercept control register writes, why would we still require
> >>>>>>>>> Dom0 to report the VFs that appear?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Sorry, I don't understand this question. You specifically requested this
> >>>>>>>> to be done this way in V2. Quoting your reply from V2 below.
> >>>>>>>>
> >>>>>>>> > Aren't you effectively busy-waiting for these 100ms, by simply
> >>>>>>>> returning "true"
> >>>>>>>> > from vpci_process_pending() until the time has passed? This imo is a
> >>>>>>>> no-go. You
> >>>>>>>> > want to set a timer and put the vCPU to sleep, to wake it up again
> >>>>>>>> when the
> >>>>>>>> > timer has expired. That'll then eliminate the need for the
> >>>>>>>> not-so-nice patch 4.
> >>>>>>>>
> >>>>>>>> > Question is whether we need to actually go this far (right away). I
> >>>>>>>> expect you
> >>>>>>>> > don't mean to hand PFs to DomU-s. As long as we keep them in the hardware
> >>>>>>>> > domain, can't we trust it to set things up correctly, just like we
> >>>>>>>> trust it in
> >>>>>>>> > a number of other aspects?
> >>>>>>>
> >>>>>>> How's any of this related to the question I raised here, or your reply
> >>>>>>> thereto? If we intercept PCI_SRIOV_CTRL, we know when VFs are created.
> >>>>>>> Why still demand Dom0 to report them then?
> >>>>>>>
> >>>>>>
> >>>>>> The spec states that VFs can take up to 100ms after the VF_ENABLE bit is
> >>>>>> set to become alive. We discussed in the V2 that it is not acceptable to
> >>>>>> do a required 100ms wait in Xen while blocking a domain. And not doing
> >>>>>> that blocking would require some mechanism to only allow a domain to run
> >>>>>> for precisely 99(or more?)ms. You yourself suggested that we can trust
> >>>>>> the hardware domain with registering VFs if we already trust it with
> >>>>>> other PCI-related stuff. Did you change your mind, or am I completely
> >>>>>> misunderstanding this question?
> >>>>>
> >>>>> No, I still think that we can trust hwdom enough. Nevertheless we should
> >>>>> aim at being independent of it where possible. And I seem to recall that
> >>>>> I had also outlined an approach how to avoid spin-waiting for 100ms in
> >>>>> the hypervisor.
> >>>>
> >>>> I want to clarify: you are telling that Xen should not wait for hwdom to
> >>>> report VFs and instead create them by itself. Is this correct?
> >>>
> >>> If that's technically possible, yes.
> >>
> >> Okay, so let's clear this. If I remember correct, you discussed this
> >> with Mykyta in the previous version and suggested to put the vCPU to
> >> sleep for 100ms.
> >
> > I don't think I did (except perhaps from a very abstract perspective),
> > precisely because of ...
> >
> >> I don't think that this is a good idea, because guest
> >> kernel will not be happy about that.
> >
> > ... this. Instead iirc I suggested to refuse (short-circuit) handling
> > VF register accesses for the next 100ms.
> >
> > Jan
>
> Do you have any suggestions on how to ensure that we accurately catch
> the window where 100ms have already passed, but guests haven’t tried to
> read anything yet, to flip this back? As I mentioned in the previous
> version, Linux, for example, doesn’t attempt to re-read anything if the
> first read failed after 100ms. So it appears to me that this approach
> would be prone to racing with the guest for getting to the VF first. One
> approach I can think of is to somehow swap the register handlers back
> in-flight during the first read by the guest if 100ms have already
> passed. However, this would still depend on Dom0 for registering VFs,
> but in a more convoluted way. We also can’t add the VFs before 100ms
> have passed and add timing checks to all register handlers, because
> pci_add_device and everything below it expects the device to be
> functional at the moment of addition.
>
>
>
> Maybe you see some other way to avoid these problems that I am missing?
We could maybe do some middle ground here, kind of similar to what
Linux does. The overall idea would be to put on hold any accesses to
the device(s) PCI config space for 100ms, that would include the PF
and any VFs. At the point when VF enable is set Xen already knows the
position of the VFs in the PCI config space.
Any PCI config space access attempts to the PF or VFs during that
100ms window would cause the guest vCPU to be put on hold, and the
access would only be retried once the 100ms window has passed and Xen
has registered the VFs with vPCI. This approach needs extra logic to
put vPCI accesses on hold, similar to what Xen does when mapping a BAR
into the p2m, and a timer to defer the adding of the Vfs and the
unlocking of the affected PCI config space region.
That would be a middle ground IMO, as the guest vCPUs could be running
freely, unless accesses to the affected PCI config space was attempted
before the 100ms window, at which point they would be blocked waiting
for the timeout to expire. A well-behaved domain shouldn't try to
access the PCI config space either ahead the 100ms window expiring.
Roger.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-05-12 8:58 ` Roger Pau Monné
@ 2026-05-12 10:28 ` Volodymyr Babchuk
2026-05-12 13:22 ` Roger Pau Monné
2026-05-12 10:44 ` Jan Beulich
1 sibling, 1 reply; 35+ messages in thread
From: Volodymyr Babchuk @ 2026-05-12 10:28 UTC (permalink / raw)
To: Roger Pau Monné
Cc: Mykyta Poturai, Jan Beulich, Daniel P. Smith,
xen-devel@lists.xenproject.org, Stewart Hildebrand
Hi Roger,
Roger Pau Monné <roger.pau@citrix.com> writes:
> On Tue, May 12, 2026 at 07:32:20AM +0000, Mykyta Poturai wrote:
>>
>>
>> On 5/12/26 09:20, Jan Beulich wrote:
>> > On 11.05.2026 16:10, Volodymyr Babchuk wrote:
>> >> Hi Jan,
>> >>
>> >> Jan Beulich <jbeulich@suse.com> writes:
>> >>
>> >>> On 07.05.2026 22:40, Volodymyr Babchuk wrote:
>> >>>> Jan Beulich <jbeulich@suse.com> writes:
>> >>>>> On 06.05.2026 11:39, Mykyta Poturai wrote:
>> >>>>>> On 5/4/26 08:37, Jan Beulich wrote:
>> >>>>>>> On 23.04.2026 12:12, Mykyta Poturai wrote:
>> >>>>>>>> On 4/21/26 17:43, Jan Beulich wrote:
>> >>>>>>>>> On 09.04.2026 16:01, Mykyta Poturai wrote:
>> >>>>>>>>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>> >>>>>>>>>>
>> >>>>>>>>>> This code is expected to only be used by privileged domains,
>> >>>>>>>>>> unprivileged domains should not get access to the SR-IOV capability.
>> >>>>>>>>>>
>> >>>>>>>>>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
>> >>>>>>>>>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
>> >>>>>>>>>> for possible changes in the system page size register. Also force VFs to
>> >>>>>>>>>> always use emulated reads for command register, this is needed to
>> >>>>>>>>>> prevent some drivers accidentally unmapping BARs.
>> >>>>>>>>>
>> >>>>>>>>> This apparently refers to the change to vpci_init_header(). Writes are
>> >>>>>>>>> already intercepted. How would a read lead to accidental BAR unmap? Even
>> >>>>>>>>> for writes I don't see how a VF driver could accidentally unmap BARs, as
>> >>>>>>>>> the memory decode bit there is hardwired to 0.
>> >>>>>>>>>
>> >>>>>>>>>> Discovery of VFs is
>> >>>>>>>>>> done by Dom0, which must register them with Xen.
>> >>>>>>>>>
>> >>>>>>>>> If we intercept control register writes, why would we still require
>> >>>>>>>>> Dom0 to report the VFs that appear?
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Sorry, I don't understand this question. You specifically requested this
>> >>>>>>>> to be done this way in V2. Quoting your reply from V2 below.
>> >>>>>>>>
>> >>>>>>>> > Aren't you effectively busy-waiting for these 100ms, by simply
>> >>>>>>>> returning "true"
>> >>>>>>>> > from vpci_process_pending() until the time has passed? This imo is a
>> >>>>>>>> no-go. You
>> >>>>>>>> > want to set a timer and put the vCPU to sleep, to wake it up again
>> >>>>>>>> when the
>> >>>>>>>> > timer has expired. That'll then eliminate the need for the
>> >>>>>>>> not-so-nice patch 4.
>> >>>>>>>>
>> >>>>>>>> > Question is whether we need to actually go this far (right away). I
>> >>>>>>>> expect you
>> >>>>>>>> > don't mean to hand PFs to DomU-s. As long as we keep them in the hardware
>> >>>>>>>> > domain, can't we trust it to set things up correctly, just like we
>> >>>>>>>> trust it in
>> >>>>>>>> > a number of other aspects?
>> >>>>>>>
>> >>>>>>> How's any of this related to the question I raised here, or your reply
>> >>>>>>> thereto? If we intercept PCI_SRIOV_CTRL, we know when VFs are created.
>> >>>>>>> Why still demand Dom0 to report them then?
>> >>>>>>>
>> >>>>>>
>> >>>>>> The spec states that VFs can take up to 100ms after the VF_ENABLE bit is
>> >>>>>> set to become alive. We discussed in the V2 that it is not acceptable to
>> >>>>>> do a required 100ms wait in Xen while blocking a domain. And not doing
>> >>>>>> that blocking would require some mechanism to only allow a domain to run
>> >>>>>> for precisely 99(or more?)ms. You yourself suggested that we can trust
>> >>>>>> the hardware domain with registering VFs if we already trust it with
>> >>>>>> other PCI-related stuff. Did you change your mind, or am I completely
>> >>>>>> misunderstanding this question?
>> >>>>>
>> >>>>> No, I still think that we can trust hwdom enough. Nevertheless we should
>> >>>>> aim at being independent of it where possible. And I seem to recall that
>> >>>>> I had also outlined an approach how to avoid spin-waiting for 100ms in
>> >>>>> the hypervisor.
>> >>>>
>> >>>> I want to clarify: you are telling that Xen should not wait for hwdom to
>> >>>> report VFs and instead create them by itself. Is this correct?
>> >>>
>> >>> If that's technically possible, yes.
>> >>
>> >> Okay, so let's clear this. If I remember correct, you discussed this
>> >> with Mykyta in the previous version and suggested to put the vCPU to
>> >> sleep for 100ms.
>> >
>> > I don't think I did (except perhaps from a very abstract perspective),
>> > precisely because of ...
>> >
>> >> I don't think that this is a good idea, because guest
>> >> kernel will not be happy about that.
>> >
>> > ... this. Instead iirc I suggested to refuse (short-circuit) handling
>> > VF register accesses for the next 100ms.
>> >
>> > Jan
>>
>> Do you have any suggestions on how to ensure that we accurately catch
>> the window where 100ms have already passed, but guests haven’t tried to
>> read anything yet, to flip this back? As I mentioned in the previous
>> version, Linux, for example, doesn’t attempt to re-read anything if the
>> first read failed after 100ms. So it appears to me that this approach
>> would be prone to racing with the guest for getting to the VF first. One
>> approach I can think of is to somehow swap the register handlers back
>> in-flight during the first read by the guest if 100ms have already
>> passed. However, this would still depend on Dom0 for registering VFs,
>> but in a more convoluted way. We also can’t add the VFs before 100ms
>> have passed and add timing checks to all register handlers, because
>> pci_add_device and everything below it expects the device to be
>> functional at the moment of addition.
>>
>>
>>
>> Maybe you see some other way to avoid these problems that I am missing?
>
> We could maybe do some middle ground here, kind of similar to what
> Linux does. The overall idea would be to put on hold any accesses to
> the device(s) PCI config space for 100ms, that would include the PF
> and any VFs. At the point when VF enable is set Xen already knows the
> position of the VFs in the PCI config space.
>
> Any PCI config space access attempts to the PF or VFs during that
> 100ms window would cause the guest vCPU to be put on hold, and the
> access would only be retried once the 100ms window has passed and Xen
> has registered the VFs with vPCI. This approach needs extra logic to
> put vPCI accesses on hold, similar to what Xen does when mapping a BAR
> into the p2m, and a timer to defer the adding of the Vfs and the
> unlocking of the affected PCI config space region.
>
> That would be a middle ground IMO, as the guest vCPUs could be running
> freely, unless accesses to the affected PCI config space was attempted
> before the 100ms window, at which point they would be blocked waiting
> for the timeout to expire. A well-behaved domain shouldn't try to
> access the PCI config space either ahead the 100ms window expiring.
This approach seems reasonable for me, but this would require big
changes in vPCI logic, as now pci_add_device() needs ability to defer
all config space accesses till VFs are ready and in meantime we'll have
to deal with half-initialized pdev. PCI/vPCI logic is already convoluted
enough and adding more intermediate states, which need to be dealt with
in different places will make things even worse. Unless I miss some easy
fix, of course...
What I am trying to say is that your suggestion is technically doable,
but requires lots of work, and we don't need resources for this right
now. So, what's your opinion on existing approach? Is relying on a
domain to introduce VFs such a bad idea?
--
WBR, Volodymyr
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-05-12 8:58 ` Roger Pau Monné
2026-05-12 10:28 ` Volodymyr Babchuk
@ 2026-05-12 10:44 ` Jan Beulich
2026-05-12 11:11 ` Roger Pau Monné
1 sibling, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2026-05-12 10:44 UTC (permalink / raw)
To: Roger Pau Monné
Cc: Volodymyr Babchuk, Daniel P. Smith,
xen-devel@lists.xenproject.org, Stewart Hildebrand,
Mykyta Poturai
On 12.05.2026 10:58, Roger Pau Monné wrote:
> On Tue, May 12, 2026 at 07:32:20AM +0000, Mykyta Poturai wrote:
>> On 5/12/26 09:20, Jan Beulich wrote:
>>> On 11.05.2026 16:10, Volodymyr Babchuk wrote:
>>>> Okay, so let's clear this. If I remember correct, you discussed this
>>>> with Mykyta in the previous version and suggested to put the vCPU to
>>>> sleep for 100ms.
>>>
>>> I don't think I did (except perhaps from a very abstract perspective),
>>> precisely because of ...
>>>
>>>> I don't think that this is a good idea, because guest
>>>> kernel will not be happy about that.
>>>
>>> ... this. Instead iirc I suggested to refuse (short-circuit) handling
>>> VF register accesses for the next 100ms.
>>
>> Do you have any suggestions on how to ensure that we accurately catch
>> the window where 100ms have already passed, but guests haven’t tried to
>> read anything yet, to flip this back? As I mentioned in the previous
>> version, Linux, for example, doesn’t attempt to re-read anything if the
>> first read failed after 100ms. So it appears to me that this approach
>> would be prone to racing with the guest for getting to the VF first.
When we do the write to the control register in Xen, our timer will start
ticking before the guest's. Hence our 100ms will be over (slightly)
earlier, and a well-behaved guest (having waited for the full 100ms
according to its own tracking) will be handled fine.
>> One
>> approach I can think of is to somehow swap the register handlers back
>> in-flight during the first read by the guest if 100ms have already
>> passed. However, this would still depend on Dom0 for registering VFs,
>> but in a more convoluted way. We also can’t add the VFs before 100ms
>> have passed and add timing checks to all register handlers, because
>> pci_add_device and everything below it expects the device to be
>> functional at the moment of addition.
I fear I'm not following this.
> We could maybe do some middle ground here, kind of similar to what
> Linux does. The overall idea would be to put on hold any accesses to
> the device(s) PCI config space for 100ms, that would include the PF
> and any VFs.
For the PF, at most parts of the SR-IOV capability should be thus
constrained, I think.
> At the point when VF enable is set Xen already knows the
> position of the VFs in the PCI config space.
>
> Any PCI config space access attempts to the PF or VFs during that
> 100ms window would cause the guest vCPU to be put on hold, and the
> access would only be retried once the 100ms window has passed and Xen
> has registered the VFs with vPCI. This approach needs extra logic to
> put vPCI accesses on hold, similar to what Xen does when mapping a BAR
> into the p2m, and a timer to defer the adding of the Vfs and the
> unlocking of the affected PCI config space region.
I was meaning to have this done in even simpler a way: Simply record
when the VFs were configured, and within the next 100ms terminate all
accesses (read all ones, discard writes).
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-05-12 10:44 ` Jan Beulich
@ 2026-05-12 11:11 ` Roger Pau Monné
0 siblings, 0 replies; 35+ messages in thread
From: Roger Pau Monné @ 2026-05-12 11:11 UTC (permalink / raw)
To: Jan Beulich
Cc: Volodymyr Babchuk, Daniel P. Smith,
xen-devel@lists.xenproject.org, Stewart Hildebrand,
Mykyta Poturai
On Tue, May 12, 2026 at 12:44:48PM +0200, Jan Beulich wrote:
> On 12.05.2026 10:58, Roger Pau Monné wrote:
> > On Tue, May 12, 2026 at 07:32:20AM +0000, Mykyta Poturai wrote:
> >> On 5/12/26 09:20, Jan Beulich wrote:
> >>> On 11.05.2026 16:10, Volodymyr Babchuk wrote:
> >>>> Okay, so let's clear this. If I remember correct, you discussed this
> >>>> with Mykyta in the previous version and suggested to put the vCPU to
> >>>> sleep for 100ms.
> >>>
> >>> I don't think I did (except perhaps from a very abstract perspective),
> >>> precisely because of ...
> >>>
> >>>> I don't think that this is a good idea, because guest
> >>>> kernel will not be happy about that.
> >>>
> >>> ... this. Instead iirc I suggested to refuse (short-circuit) handling
> >>> VF register accesses for the next 100ms.
> >>
> >> Do you have any suggestions on how to ensure that we accurately catch
> >> the window where 100ms have already passed, but guests haven’t tried to
> >> read anything yet, to flip this back? As I mentioned in the previous
> >> version, Linux, for example, doesn’t attempt to re-read anything if the
> >> first read failed after 100ms. So it appears to me that this approach
> >> would be prone to racing with the guest for getting to the VF first.
>
> When we do the write to the control register in Xen, our timer will start
> ticking before the guest's. Hence our 100ms will be over (slightly)
> earlier, and a well-behaved guest (having waited for the full 100ms
> according to its own tracking) will be handled fine.
>
> >> One
> >> approach I can think of is to somehow swap the register handlers back
> >> in-flight during the first read by the guest if 100ms have already
> >> passed. However, this would still depend on Dom0 for registering VFs,
> >> but in a more convoluted way. We also can’t add the VFs before 100ms
> >> have passed and add timing checks to all register handlers, because
> >> pci_add_device and everything below it expects the device to be
> >> functional at the moment of addition.
>
> I fear I'm not following this.
>
> > We could maybe do some middle ground here, kind of similar to what
> > Linux does. The overall idea would be to put on hold any accesses to
> > the device(s) PCI config space for 100ms, that would include the PF
> > and any VFs.
>
> For the PF, at most parts of the SR-IOV capability should be thus
> constrained, I think.
Linux blocks access to the whole device PCI config space, but that
might be simply because it's easier to implement that way on their
side. Certainly the spec doesn't mention any restriction in accessing
the PF config space during that window.
As a simpler approach we might want to reject write accesses to the
SR-IOV capability during that window.
> > At the point when VF enable is set Xen already knows the
> > position of the VFs in the PCI config space.
> >
> > Any PCI config space access attempts to the PF or VFs during that
> > 100ms window would cause the guest vCPU to be put on hold, and the
> > access would only be retried once the 100ms window has passed and Xen
> > has registered the VFs with vPCI. This approach needs extra logic to
> > put vPCI accesses on hold, similar to what Xen does when mapping a BAR
> > into the p2m, and a timer to defer the adding of the Vfs and the
> > unlocking of the affected PCI config space region.
>
> I was meaning to have this done in even simpler a way: Simply record
> when the VFs were configured, and within the next 100ms terminate all
> accesses (read all ones, discard writes).
Hm, I thought about such approach also, I was mostly worried that some
drivers might know the device has a shorter initialization time, and
hence attempt to access before the 100ms window. However simply
discarding accesses might be easier to implement initially, and hence
I would be fien with such approach. We would need to log any such
discarded accesses during the init window.
Thanks, Roger.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0
2026-05-12 10:28 ` Volodymyr Babchuk
@ 2026-05-12 13:22 ` Roger Pau Monné
0 siblings, 0 replies; 35+ messages in thread
From: Roger Pau Monné @ 2026-05-12 13:22 UTC (permalink / raw)
To: Volodymyr Babchuk
Cc: Mykyta Poturai, Jan Beulich, Daniel P. Smith,
xen-devel@lists.xenproject.org, Stewart Hildebrand
On Tue, May 12, 2026 at 10:28:05AM +0000, Volodymyr Babchuk wrote:
> Hi Roger,
>
> Roger Pau Monné <roger.pau@citrix.com> writes:
>
> > On Tue, May 12, 2026 at 07:32:20AM +0000, Mykyta Poturai wrote:
> >>
> >>
> >> On 5/12/26 09:20, Jan Beulich wrote:
> >> > On 11.05.2026 16:10, Volodymyr Babchuk wrote:
> >> >> Hi Jan,
> >> >>
> >> >> Jan Beulich <jbeulich@suse.com> writes:
> >> >>
> >> >>> On 07.05.2026 22:40, Volodymyr Babchuk wrote:
> >> >>>> Jan Beulich <jbeulich@suse.com> writes:
> >> >>>>> On 06.05.2026 11:39, Mykyta Poturai wrote:
> >> >>>>>> On 5/4/26 08:37, Jan Beulich wrote:
> >> >>>>>>> On 23.04.2026 12:12, Mykyta Poturai wrote:
> >> >>>>>>>> On 4/21/26 17:43, Jan Beulich wrote:
> >> >>>>>>>>> On 09.04.2026 16:01, Mykyta Poturai wrote:
> >> >>>>>>>>>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
> >> >>>>>>>>>>
> >> >>>>>>>>>> This code is expected to only be used by privileged domains,
> >> >>>>>>>>>> unprivileged domains should not get access to the SR-IOV capability.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
> >> >>>>>>>>>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
> >> >>>>>>>>>> for possible changes in the system page size register. Also force VFs to
> >> >>>>>>>>>> always use emulated reads for command register, this is needed to
> >> >>>>>>>>>> prevent some drivers accidentally unmapping BARs.
> >> >>>>>>>>>
> >> >>>>>>>>> This apparently refers to the change to vpci_init_header(). Writes are
> >> >>>>>>>>> already intercepted. How would a read lead to accidental BAR unmap? Even
> >> >>>>>>>>> for writes I don't see how a VF driver could accidentally unmap BARs, as
> >> >>>>>>>>> the memory decode bit there is hardwired to 0.
> >> >>>>>>>>>
> >> >>>>>>>>>> Discovery of VFs is
> >> >>>>>>>>>> done by Dom0, which must register them with Xen.
> >> >>>>>>>>>
> >> >>>>>>>>> If we intercept control register writes, why would we still require
> >> >>>>>>>>> Dom0 to report the VFs that appear?
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Sorry, I don't understand this question. You specifically requested this
> >> >>>>>>>> to be done this way in V2. Quoting your reply from V2 below.
> >> >>>>>>>>
> >> >>>>>>>> > Aren't you effectively busy-waiting for these 100ms, by simply
> >> >>>>>>>> returning "true"
> >> >>>>>>>> > from vpci_process_pending() until the time has passed? This imo is a
> >> >>>>>>>> no-go. You
> >> >>>>>>>> > want to set a timer and put the vCPU to sleep, to wake it up again
> >> >>>>>>>> when the
> >> >>>>>>>> > timer has expired. That'll then eliminate the need for the
> >> >>>>>>>> not-so-nice patch 4.
> >> >>>>>>>>
> >> >>>>>>>> > Question is whether we need to actually go this far (right away). I
> >> >>>>>>>> expect you
> >> >>>>>>>> > don't mean to hand PFs to DomU-s. As long as we keep them in the hardware
> >> >>>>>>>> > domain, can't we trust it to set things up correctly, just like we
> >> >>>>>>>> trust it in
> >> >>>>>>>> > a number of other aspects?
> >> >>>>>>>
> >> >>>>>>> How's any of this related to the question I raised here, or your reply
> >> >>>>>>> thereto? If we intercept PCI_SRIOV_CTRL, we know when VFs are created.
> >> >>>>>>> Why still demand Dom0 to report them then?
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>> The spec states that VFs can take up to 100ms after the VF_ENABLE bit is
> >> >>>>>> set to become alive. We discussed in the V2 that it is not acceptable to
> >> >>>>>> do a required 100ms wait in Xen while blocking a domain. And not doing
> >> >>>>>> that blocking would require some mechanism to only allow a domain to run
> >> >>>>>> for precisely 99(or more?)ms. You yourself suggested that we can trust
> >> >>>>>> the hardware domain with registering VFs if we already trust it with
> >> >>>>>> other PCI-related stuff. Did you change your mind, or am I completely
> >> >>>>>> misunderstanding this question?
> >> >>>>>
> >> >>>>> No, I still think that we can trust hwdom enough. Nevertheless we should
> >> >>>>> aim at being independent of it where possible. And I seem to recall that
> >> >>>>> I had also outlined an approach how to avoid spin-waiting for 100ms in
> >> >>>>> the hypervisor.
> >> >>>>
> >> >>>> I want to clarify: you are telling that Xen should not wait for hwdom to
> >> >>>> report VFs and instead create them by itself. Is this correct?
> >> >>>
> >> >>> If that's technically possible, yes.
> >> >>
> >> >> Okay, so let's clear this. If I remember correct, you discussed this
> >> >> with Mykyta in the previous version and suggested to put the vCPU to
> >> >> sleep for 100ms.
> >> >
> >> > I don't think I did (except perhaps from a very abstract perspective),
> >> > precisely because of ...
> >> >
> >> >> I don't think that this is a good idea, because guest
> >> >> kernel will not be happy about that.
> >> >
> >> > ... this. Instead iirc I suggested to refuse (short-circuit) handling
> >> > VF register accesses for the next 100ms.
> >> >
> >> > Jan
> >>
> >> Do you have any suggestions on how to ensure that we accurately catch
> >> the window where 100ms have already passed, but guests haven’t tried to
> >> read anything yet, to flip this back? As I mentioned in the previous
> >> version, Linux, for example, doesn’t attempt to re-read anything if the
> >> first read failed after 100ms. So it appears to me that this approach
> >> would be prone to racing with the guest for getting to the VF first. One
> >> approach I can think of is to somehow swap the register handlers back
> >> in-flight during the first read by the guest if 100ms have already
> >> passed. However, this would still depend on Dom0 for registering VFs,
> >> but in a more convoluted way. We also can’t add the VFs before 100ms
> >> have passed and add timing checks to all register handlers, because
> >> pci_add_device and everything below it expects the device to be
> >> functional at the moment of addition.
> >>
> >>
> >>
> >> Maybe you see some other way to avoid these problems that I am missing?
> >
> > We could maybe do some middle ground here, kind of similar to what
> > Linux does. The overall idea would be to put on hold any accesses to
> > the device(s) PCI config space for 100ms, that would include the PF
> > and any VFs. At the point when VF enable is set Xen already knows the
> > position of the VFs in the PCI config space.
> >
> > Any PCI config space access attempts to the PF or VFs during that
> > 100ms window would cause the guest vCPU to be put on hold, and the
> > access would only be retried once the 100ms window has passed and Xen
> > has registered the VFs with vPCI. This approach needs extra logic to
> > put vPCI accesses on hold, similar to what Xen does when mapping a BAR
> > into the p2m, and a timer to defer the adding of the Vfs and the
> > unlocking of the affected PCI config space region.
> >
> > That would be a middle ground IMO, as the guest vCPUs could be running
> > freely, unless accesses to the affected PCI config space was attempted
> > before the 100ms window, at which point they would be blocked waiting
> > for the timeout to expire. A well-behaved domain shouldn't try to
> > access the PCI config space either ahead the 100ms window expiring.
>
> This approach seems reasonable for me, but this would require big
> changes in vPCI logic, as now pci_add_device() needs ability to defer
> all config space accesses till VFs are ready and in meantime we'll have
> to deal with half-initialized pdev. PCI/vPCI logic is already convoluted
> enough and adding more intermediate states, which need to be dealt with
> in different places will make things even worse. Unless I miss some easy
> fix, of course...
You could just defer the pci_add_device() to after the 100ms window
from having enabled VFs? The only requirement would be blocking PCI
config space accesses to VFs during those 100ms. You could use a
bitmap to signal which SBDF should be rejected in vpci_{read,write}.
There's already logic in those functions to reject accesses.
> What I am trying to say is that your suggestion is technically doable,
> but requires lots of work, and we don't need resources for this right
> now. So, what's your opinion on existing approach? Is relying on a
> domain to introduce VFs such a bad idea?
IMO yes, requiring the usage of an hypercall (or any other side-band
interface) when not strictly required is just adding a handicap for
OSes that then need to be ported to run on Xen. It might seem easier
at first, but adding and maintaining such side-band interfaces in OSes
tend to be cumbersome and prone to errors.
We could have avoided introducing vPCI at the cost of adding a
completely new side-band interface to manage PCI config space
accesses and interrupts on x86 PVH, yet we didn't do it because albeit
easier to implement from the Xen side, it would have a huge cost on
OSes that would want to run in PVH mode.
Thanks, Roger.
^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2026-05-12 13:23 UTC | newest]
Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09 14:01 [PATCH v3 0/7] Implement SR-IOV support for PVH Mykyta Poturai
2026-04-09 14:01 ` [PATCH v3 2/7] vpci: rename and export vpci_guest_mem_bar_{read,write} Mykyta Poturai
2026-04-22 10:27 ` Roger Pau Monné
2026-04-09 14:01 ` [PATCH v3 1/7] vpci: rename and export vpci_modify_bars Mykyta Poturai
2026-04-09 14:01 ` [PATCH v3 4/7] vpci: allow queueing of mapping operations Mykyta Poturai
2026-04-22 11:38 ` Roger Pau Monné
2026-04-09 14:01 ` [PATCH v3 3/7] vpci: Use pervcpu ranges for BAR mapping Mykyta Poturai
2026-04-22 11:00 ` Roger Pau Monné
2026-04-22 12:04 ` Jan Beulich
2026-04-22 14:20 ` Roger Pau Monné
2026-04-09 14:01 ` [PATCH v3 6/7] vpci: add SR-IOV support for DomUs Mykyta Poturai
2026-04-21 14:55 ` Jan Beulich
2026-04-24 6:34 ` Mykyta Poturai
2026-04-09 14:01 ` [PATCH v3 5/7] vpci: add SR-IOV support for PVH Dom0 Mykyta Poturai
2026-04-09 15:27 ` Daniel P. Smith
2026-04-21 14:43 ` Jan Beulich
2026-04-23 10:12 ` Mykyta Poturai
2026-05-04 5:37 ` Jan Beulich
2026-05-06 9:39 ` Mykyta Poturai
2026-05-06 11:54 ` Jan Beulich
2026-05-07 20:40 ` Volodymyr Babchuk
2026-05-08 5:52 ` Jan Beulich
2026-05-11 14:10 ` Volodymyr Babchuk
2026-05-12 6:20 ` Jan Beulich
2026-05-12 7:32 ` Mykyta Poturai
2026-05-12 8:58 ` Roger Pau Monné
2026-05-12 10:28 ` Volodymyr Babchuk
2026-05-12 13:22 ` Roger Pau Monné
2026-05-12 10:44 ` Jan Beulich
2026-05-12 11:11 ` Roger Pau Monné
2026-05-08 5:52 ` Jan Beulich
2026-04-22 14:19 ` Roger Pau Monné
2026-04-28 20:05 ` Stewart Hildebrand
2026-04-09 14:01 ` [PATCH v3 7/7] docs: Update SR-IOV support status Mykyta Poturai
2026-04-21 14:56 ` Jan Beulich
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.