All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/8] Implement SR-IOV support for PVH
@ 2026-03-09 11:08 Mykyta Poturai
  2026-03-09 11:08 ` [PATCH v2 1/8] vpci: rename and export vpci_modify_bars Mykyta Poturai
                   ` (7 more replies)
  0 siblings, 8 replies; 29+ messages in thread
From: Mykyta Poturai @ 2026-03-09 11:08 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Mykyta Poturai, Roger Pau Monné, Stewart Hildebrand,
	Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Stefano Stabellini, Daniel P. Smith

This series enables support for PCI SR-IOV capabilty for PVH domains.
It allows Dom0 to enable and use SR-IOV virtual functions and for this
functions to be passed to guests.

To achieve this, handlers for SRIOV_CONTROL registes, simplified handlers
for VFs header, and scanning for added VFs were implemented in Xen.

Core functionality is based on previous work[1].

Tested on R-Car Spider board with Samsung NVMe SSD Controller 980 and Intel
X550T ethernet card.

[1]: https://lore.kernel.org/xen-devel/20180717094830.54806-1-roger.pau@citrix.com/

v1->v2:
* rework the series for VF discovery in Xen
* separate doc changes into the last patch


Mykyta Poturai (4):
  vpci: Use pervcpu ranges for BAR mapping
  vpci: add a wait operation to the vpci vcpu pending actions
  pci/iommu: Check that IOMMU supports removing devices
  docs: Update SR-IOV support status

Stewart Hildebrand (4):
  vpci: rename and export vpci_modify_bars
  vpci: rename and export vpci_guest_mem_bar_{read,write}
  vpci: add SR-IOV support for PVH Dom0
  vpci: add SR-IOV support for DomUs

 SUPPORT.md                    |   2 -
 xen/common/domain.c           |  24 ++
 xen/drivers/passthrough/pci.c |   6 +
 xen/drivers/vpci/Makefile     |   2 +-
 xen/drivers/vpci/header.c     | 226 +++++++++---------
 xen/drivers/vpci/sriov.c      | 420 ++++++++++++++++++++++++++++++++++
 xen/drivers/vpci/vpci.c       |   3 -
 xen/include/xen/vpci.h        |  42 +++-
 8 files changed, 608 insertions(+), 117 deletions(-)
 create mode 100644 xen/drivers/vpci/sriov.c

-- 
2.51.2

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 1/8] vpci: rename and export vpci_modify_bars
  2026-03-09 11:08 [PATCH v2 0/8] Implement SR-IOV support for PVH Mykyta Poturai
@ 2026-03-09 11:08 ` Mykyta Poturai
  2026-03-16 21:03   ` Stewart Hildebrand
  2026-03-09 11:08 ` [PATCH v2 3/8] vpci: Use pervcpu ranges for BAR mapping Mykyta Poturai
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Mykyta Poturai @ 2026-03-09 11:08 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Stewart Hildebrand, Roger Pau Monné, Mykyta Poturai,
	Teddy Astie

From: Stewart Hildebrand <stewart.hildebrand@amd.com>

Export functions required for SR-IOV support.

Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
Reviewed-by: Teddy Astie <teddy.astie@vates.tech>

---
v1->v2
* Collect RBs
---
 xen/drivers/vpci/header.c | 16 +++++++++-------
 xen/include/xen/vpci.h    |  3 +++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 739a5f610e..5202518e83 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -304,7 +304,7 @@ static void defer_map(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     raise_softirq(SCHEDULE_SOFTIRQ);
 }
 
-static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
+int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 {
     struct vpci_header *header = &pdev->vpci->header;
     struct pci_dev *tmp;
@@ -545,7 +545,7 @@ static void cf_check cmd_write(
          * memory decoding bit has not been changed, so leave everything as-is,
          * hoping the guest will realize and try again.
          */
-        modify_bars(pdev, cmd, false);
+        vpci_modify_bars(pdev, cmd, false);
     else
         pci_conf_write16(pdev->sbdf, reg, cmd);
 }
@@ -713,13 +713,15 @@ static void cf_check rom_write(
      * Pass PCI_COMMAND_MEMORY or 0 to signal a map/unmap request, note that
      * this fabricated command is never going to be written to the register.
      */
-    else if ( modify_bars(pdev, new_enabled ? PCI_COMMAND_MEMORY : 0, true) )
+    else if ( vpci_modify_bars(pdev,
+                               new_enabled ? PCI_COMMAND_MEMORY : 0,
+                               true) )
         /*
          * No memory has been added or removed from the p2m (because the actual
          * p2m changes are deferred in defer_map) and the ROM enable bit has
          * not been changed, so leave everything as-is, hoping the guest will
          * realize and try again. It's important to not update rom->addr in the
-         * unmap case if modify_bars has failed, or future attempts would
+         * unmap case if vpci_modify_bars has failed, or future attempts would
          * attempt to unmap the wrong address.
          */
         return;
@@ -933,8 +935,8 @@ int vpci_init_header(struct pci_dev *pdev)
     /*
      * For DomUs, clear PCI_COMMAND_{MASTER,MEMORY,IO} and other
      * DomU-controllable bits in PCI_COMMAND. Devices assigned to DomUs will
-     * start with memory decoding disabled, and modify_bars() will not be called
-     * at the end of this function.
+     * start with memory decoding disabled, and vpci_modify_bars() will not be
+     * called at the end of this function.
      */
     if ( !is_hwdom )
         cmd &= ~(PCI_COMMAND_VGA_PALETTE | PCI_COMMAND_INVALIDATE |
@@ -1059,7 +1061,7 @@ int vpci_init_header(struct pci_dev *pdev)
             goto fail;
     }
 
-    return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
+    return (cmd & PCI_COMMAND_MEMORY) ? vpci_modify_bars(pdev, cmd, false) : 0;
 
  fail:
     pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index d6310104ea..a98ddbb32e 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -295,6 +295,9 @@ bool vpci_ecam_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int len,
 bool vpci_ecam_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int len,
                     unsigned long *data);
 
+/* Map/unmap the BARs of a vPCI device. */
+int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only);
+
 #endif /* __XEN__ */
 
 #else /* !CONFIG_HAS_VPCI */
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 2/8] vpci: rename and export vpci_guest_mem_bar_{read,write}
  2026-03-09 11:08 [PATCH v2 0/8] Implement SR-IOV support for PVH Mykyta Poturai
  2026-03-09 11:08 ` [PATCH v2 1/8] vpci: rename and export vpci_modify_bars Mykyta Poturai
  2026-03-09 11:08 ` [PATCH v2 3/8] vpci: Use pervcpu ranges for BAR mapping Mykyta Poturai
@ 2026-03-09 11:08 ` Mykyta Poturai
  2026-03-16 21:04   ` Stewart Hildebrand
  2026-03-09 11:08 ` [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions Mykyta Poturai
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Mykyta Poturai @ 2026-03-09 11:08 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Stewart Hildebrand, Roger Pau Monné, Mykyta Poturai,
	Teddy Astie

From: Stewart Hildebrand <stewart.hildebrand@amd.com>

Export functions required for SR-IOV support.

Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
Reviewed-by: Teddy Astie <teddy.astie@vates.tech>
---
v1->v2:
* collect RBs
---
 xen/drivers/vpci/header.c | 20 +++++++++++---------
 xen/include/xen/vpci.h    |  6 ++++++
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 5202518e83..07ec991a12 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -610,9 +610,9 @@ static void cf_check bar_write(
     pci_conf_write32(pdev->sbdf, reg, val);
 }
 
-static void cf_check guest_mem_bar_write(const struct pci_dev *pdev,
-                                         unsigned int reg, uint32_t val,
-                                         void *data)
+void cf_check vpci_guest_mem_bar_write(const struct pci_dev *pdev,
+                                       unsigned int reg, uint32_t val,
+                                       void *data)
 {
     struct vpci_bar *bar = data;
     bool hi = false;
@@ -652,8 +652,8 @@ static void cf_check guest_mem_bar_write(const struct pci_dev *pdev,
     bar->guest_addr = guest_addr;
 }
 
-static uint32_t cf_check guest_mem_bar_read(const struct pci_dev *pdev,
-                                            unsigned int reg, void *data)
+uint32_t cf_check vpci_guest_mem_bar_read(const struct pci_dev *pdev,
+                                          unsigned int reg, void *data)
 {
     const struct vpci_bar *bar = data;
     uint32_t reg_val;
@@ -959,8 +959,9 @@ int vpci_init_header(struct pci_dev *pdev)
             bars[i].type = VPCI_BAR_MEM64_HI;
             rc = vpci_add_register(pdev->vpci,
                                    is_hwdom ? vpci_hw_read32
-                                            : guest_mem_bar_read,
-                                   is_hwdom ? bar_write : guest_mem_bar_write,
+                                            : vpci_guest_mem_bar_read,
+                                   is_hwdom ? bar_write
+                                            : vpci_guest_mem_bar_write,
                                    reg, 4, &bars[i]);
             if ( rc )
                 goto fail;
@@ -1018,8 +1019,9 @@ int vpci_init_header(struct pci_dev *pdev)
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
         rc = vpci_add_register(pdev->vpci,
-                               is_hwdom ? vpci_hw_read32 : guest_mem_bar_read,
-                               is_hwdom ? bar_write : guest_mem_bar_write,
+                               is_hwdom ? vpci_hw_read32
+                                        : vpci_guest_mem_bar_read,
+                               is_hwdom ? bar_write : vpci_guest_mem_bar_write,
                                reg, 4, &bars[i]);
         if ( rc )
             goto fail;
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index a98ddbb32e..dd233b8b03 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -78,6 +78,12 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
 uint32_t cf_check vpci_read_val(
     const struct pci_dev *pdev, unsigned int reg, void *data);
 
+void cf_check vpci_guest_mem_bar_write(const struct pci_dev *pdev,
+                                       unsigned int reg, uint32_t val,
+                                       void *data);
+uint32_t cf_check vpci_guest_mem_bar_read(const struct pci_dev *pdev,
+                                          unsigned int reg, void *data);
+
 /* Passthrough handlers. */
 uint32_t cf_check vpci_hw_read8(
     const struct pci_dev *pdev, unsigned int reg, void *data);
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions
  2026-03-09 11:08 [PATCH v2 0/8] Implement SR-IOV support for PVH Mykyta Poturai
                   ` (2 preceding siblings ...)
  2026-03-09 11:08 ` [PATCH v2 2/8] vpci: rename and export vpci_guest_mem_bar_{read,write} Mykyta Poturai
@ 2026-03-09 11:08 ` Mykyta Poturai
  2026-03-17 14:02   ` Stewart Hildebrand
                     ` (2 more replies)
  2026-03-09 11:08 ` [PATCH v2 6/8] vpci: add SR-IOV support for PVH Dom0 Mykyta Poturai
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 29+ messages in thread
From: Mykyta Poturai @ 2026-03-09 11:08 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Mykyta Poturai, Roger Pau Monné, Stewart Hildebrand

This allows waiting a specified number of cycles on the vcpu. Once the
wait has finished a callback is executed.

Note that this is still not used, but introduced here in order to
simplify the complexity of the patches that actually make use of it.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
---
v1->v2:
* new patch
---
 xen/drivers/vpci/header.c | 125 ++++++++++++++++++++++----------------
 xen/include/xen/vpci.h    |  19 ++++++
 2 files changed, 90 insertions(+), 54 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index cb64d9b9fc..284964f0d4 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -175,76 +175,92 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
 
 bool vpci_process_pending(struct vcpu *v)
 {
-    const struct pci_dev *pdev = v->vpci.pdev;
-    struct vpci_header *header = NULL;
-    unsigned int i;
-
-    if ( !pdev )
-        return false;
-
-    read_lock(&v->domain->pci_lock);
-
-    if ( !pdev->vpci || (v->domain != pdev->domain) )
+    switch ( v->vpci.task )
     {
-        v->vpci.pdev = NULL;
-        read_unlock(&v->domain->pci_lock);
-        return false;
-    }
-
-    header = &pdev->vpci->header;
-    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    case MODIFY_MEMORY:
     {
-        struct vpci_bar *bar = &header->bars[i];
-        struct rangeset *mem = v->vpci.bar_mem[i];
-        struct map_data data = {
-            .d = v->domain,
-            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
-            .bar = bar,
-        };
-        int rc;
+        const struct pci_dev *pdev = v->vpci.memory.pdev;
+        struct vpci_header *header = NULL;
+        unsigned int i;
 
-        if ( rangeset_is_empty(mem) )
-            continue;
+        if ( !pdev )
+            break;
 
-        rc = rangeset_consume_ranges(mem, map_range, &data);
+        read_lock(&v->domain->pci_lock);
 
-        if ( rc == -ERESTART )
+        if ( !pdev->vpci || (v->domain != pdev->domain) )
         {
+            v->vpci.memory.pdev = NULL;
             read_unlock(&v->domain->pci_lock);
-            return true;
+            break;
         }
 
-        if ( rc )
+        header = &pdev->vpci->header;
+        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
         {
-            spin_lock(&pdev->vpci->lock);
-            /* Disable memory decoding unconditionally on failure. */
-            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
-                            false);
-            spin_unlock(&pdev->vpci->lock);
+            struct vpci_bar *bar = &header->bars[i];
+            struct rangeset *mem = v->vpci.bar_mem[i];
+            struct map_data data = {
+                .d = v->domain,
+                .map = v->vpci.memory.cmd & PCI_COMMAND_MEMORY,
+                .bar = bar,
+            };
+            int rc;
+
+            if ( rangeset_is_empty(mem) )
+                continue;
 
-            /* Clean all the rangesets */
-            for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
-                if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
-                     rangeset_purge(v->vpci.bar_mem[i]);
+            rc = rangeset_consume_ranges(mem, map_range, &data);
 
-            v->vpci.pdev = NULL;
+            if ( rc == -ERESTART )
+            {
+                read_unlock(&v->domain->pci_lock);
+                return true;
+            }
 
-            read_unlock(&v->domain->pci_lock);
+            if ( rc )
+            {
+                spin_lock(&pdev->vpci->lock);
+                /* Disable memory decoding unconditionally on failure. */
+                modify_decoding(pdev, v->vpci.memory.cmd & ~PCI_COMMAND_MEMORY,
+                                false);
+                spin_unlock(&pdev->vpci->lock);
+
+                /* Clean all the rangesets */
+                for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+                    if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
+                        rangeset_purge(v->vpci.bar_mem[i]);
+
+                v->vpci.memory.pdev = NULL;
+
+                read_unlock(&v->domain->pci_lock);
 
-            if ( !is_hardware_domain(v->domain) )
-                domain_crash(v->domain);
+                if ( !is_hardware_domain(v->domain) )
+                    domain_crash(v->domain);
 
-            return false;
+                break;
+            }
         }
-    }
-    v->vpci.pdev = NULL;
+        v->vpci.memory.pdev = NULL;
 
-    spin_lock(&pdev->vpci->lock);
-    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
-    spin_unlock(&pdev->vpci->lock);
+        spin_lock(&pdev->vpci->lock);
+        modify_decoding(pdev, v->vpci.memory.cmd, v->vpci.memory.rom_only);
+        spin_unlock(&pdev->vpci->lock);
 
-    read_unlock(&v->domain->pci_lock);
+        read_unlock(&v->domain->pci_lock);
+
+        break;
+    }
+    case WAIT:
+        if ( NOW() < v->vpci.wait.end )
+            return true;
+        v->vpci.wait.callback(v->vpci.wait.data);
+        break;
+    case NONE:
+        return false;
+    }
 
+    v->vpci.task = NONE;
     return false;
 }
 
@@ -295,9 +311,10 @@ static void defer_map(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
      * is mapped. This can lead to parallel mapping operations being
      * started for the same device if the domain is not well-behaved.
      */
-    curr->vpci.pdev = pdev;
-    curr->vpci.cmd = cmd;
-    curr->vpci.rom_only = rom_only;
+    curr->vpci.memory.pdev = pdev;
+    curr->vpci.memory.cmd = cmd;
+    curr->vpci.memory.rom_only = rom_only;
+    curr->vpci.task = MODIFY_MEMORY;
     /*
      * Raise a scheduler softirq in order to prevent the guest from resuming
      * execution with pending mapping operations, to trigger the invocation
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index fa654545e5..47cdb54d42 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -212,7 +212,26 @@ struct vpci_vcpu {
     /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
     const struct pci_dev *pdev;
 #ifdef __XEN__
+    enum {
+        NONE,
+        MODIFY_MEMORY,
+        WAIT,
+    } task;
     struct rangeset *bar_mem[PCI_HEADER_NORMAL_NR_BARS + 1];
+    union {
+        struct {
+            /* Store state while {un}mapping of PCI BARs. */
+            const struct pci_dev *pdev;
+            uint16_t cmd;
+            bool rom_only : 1;
+        } memory;
+        struct {
+            /* Store wait state. */
+            s_time_t end;
+            void (*callback)(void *);
+            void *data;
+        } wait;
+    };
 #endif
     uint16_t cmd;
     bool rom_only : 1;
-- 
2.51.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 3/8] vpci: Use pervcpu ranges for BAR mapping
  2026-03-09 11:08 [PATCH v2 0/8] Implement SR-IOV support for PVH Mykyta Poturai
  2026-03-09 11:08 ` [PATCH v2 1/8] vpci: rename and export vpci_modify_bars Mykyta Poturai
@ 2026-03-09 11:08 ` Mykyta Poturai
  2026-03-16 21:36   ` Stewart Hildebrand
                     ` (2 more replies)
  2026-03-09 11:08 ` [PATCH v2 2/8] vpci: rename and export vpci_guest_mem_bar_{read,write} Mykyta Poturai
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 29+ messages in thread
From: Mykyta Poturai @ 2026-03-09 11:08 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Mykyta Poturai, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, Stewart Hildebrand

There is no need to store ranges for each PCI device, as they are only
used during the mapping/unmapping process and can be reused for each
device. This also allows to avoid the need to allocate and destroy
rangesets for each device.

Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
---
v1->v2:
* new patch
---
 xen/common/domain.c       | 24 ++++++++++++++
 xen/drivers/vpci/header.c | 66 ++++++++++++++-------------------------
 xen/drivers/vpci/vpci.c   |  3 --
 xen/include/xen/vpci.h    |  4 ++-
 4 files changed, 51 insertions(+), 46 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index e06174fca7..76b0163616 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -454,6 +454,14 @@ static int vcpu_teardown(struct vcpu *v)
  */
 static void vcpu_destroy(struct vcpu *v)
 {
+#ifdef CONFIG_HAS_VPCI
+    int i;
+
+    for ( i = 0; i < ARRAY_SIZE(v->vpci.bar_mem); i++ )
+        if ( v->vpci.bar_mem[i] )
+            rangeset_destroy(v->vpci.bar_mem[i]);
+
+#endif
     free_vcpu_struct(v);
 }
 
@@ -511,6 +519,22 @@ struct vcpu *vcpu_create(struct domain *d, unsigned int vcpu_id)
     if ( arch_vcpu_create(v) != 0 )
         goto fail_sched;
 
+#ifdef CONFIG_HAS_VPCI
+    {
+        int i;
+
+        for ( i = 0; i < ARRAY_SIZE(v->vpci.bar_mem); i++ )
+        {
+            char str[32];
+
+            snprintf(str, sizeof(str), "%pv:BAR%u", v, i);
+            v->vpci.bar_mem[i] = rangeset_new(d, str, RANGESETF_no_print);
+            if ( !v->vpci.bar_mem[i] )
+                goto fail_sched;
+        }
+    }
+#endif
+
     d->vcpu[vcpu_id] = v;
     if ( vcpu_id != 0 )
     {
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 07ec991a12..cb64d9b9fc 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -195,6 +195,7 @@ bool vpci_process_pending(struct vcpu *v)
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
+        struct rangeset *mem = v->vpci.bar_mem[i];
         struct map_data data = {
             .d = v->domain,
             .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
@@ -202,10 +203,10 @@ bool vpci_process_pending(struct vcpu *v)
         };
         int rc;
 
-        if ( rangeset_is_empty(bar->mem) )
+        if ( rangeset_is_empty(mem) )
             continue;
 
-        rc = rangeset_consume_ranges(bar->mem, map_range, &data);
+        rc = rangeset_consume_ranges(mem, map_range, &data);
 
         if ( rc == -ERESTART )
         {
@@ -223,8 +224,8 @@ bool vpci_process_pending(struct vcpu *v)
 
             /* Clean all the rangesets */
             for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
-                if ( !rangeset_is_empty(header->bars[i].mem) )
-                     rangeset_purge(header->bars[i].mem);
+                if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
+                     rangeset_purge(v->vpci.bar_mem[i]);
 
             v->vpci.pdev = NULL;
 
@@ -259,13 +260,14 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
+        struct rangeset *mem = current->vpci.bar_mem[i];
         struct map_data data = { .d = d, .map = true, .bar = bar };
 
-        if ( rangeset_is_empty(bar->mem) )
+        if ( rangeset_is_empty(mem) )
             continue;
 
-        while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
-                                              &data)) == -ERESTART )
+        while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) ==
+                -ERESTART )
         {
             /*
              * It's safe to drop and reacquire the lock in this context
@@ -330,12 +332,13 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
+        struct rangeset *mem = current->vpci.bar_mem[i];
         unsigned long start = PFN_DOWN(bar->addr);
         unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
         unsigned long start_guest = PFN_DOWN(bar->guest_addr);
         unsigned long end_guest = PFN_DOWN(bar->guest_addr + bar->size - 1);
 
-        if ( !bar->mem )
+        if ( !mem )
             continue;
 
         if ( !MAPPABLE_BAR(bar) ||
@@ -353,7 +356,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             continue;
         }
 
-        ASSERT(rangeset_is_empty(bar->mem));
+        ASSERT(rangeset_is_empty(mem));
 
         /*
          * Make sure that the guest set address has the same page offset
@@ -368,7 +371,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             return -EINVAL;
         }
 
-        rc = rangeset_add_range(bar->mem, start_guest, end_guest);
+        rc = rangeset_add_range(mem, start_guest, end_guest);
         if ( rc )
         {
             printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
@@ -379,12 +382,12 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
         /* Check for overlap with the already setup BAR ranges. */
         for ( j = 0; j < i; j++ )
         {
-            struct vpci_bar *prev_bar = &header->bars[j];
+            struct rangeset *prev_mem = current->vpci.bar_mem[j];
 
-            if ( rangeset_is_empty(prev_bar->mem) )
+            if ( rangeset_is_empty(prev_mem) )
                 continue;
 
-            rc = rangeset_remove_range(prev_bar->mem, start_guest, end_guest);
+            rc = rangeset_remove_range(prev_mem, start_guest, end_guest);
             if ( rc )
             {
                 gprintk(XENLOG_WARNING,
@@ -394,7 +397,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             }
         }
 
-        rc = pci_sanitize_bar_memory(bar->mem);
+        rc = pci_sanitize_bar_memory(mem);
         if ( rc )
         {
             gprintk(XENLOG_WARNING,
@@ -411,14 +414,14 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
         unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
                                      vmsix_table_size(pdev->vpci, i) - 1);
 
-        for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
+        for ( j = 0; j < ARRAY_SIZE(current->vpci.bar_mem); j++ )
         {
-            const struct vpci_bar *bar = &header->bars[j];
+            struct rangeset *mem = current->vpci.bar_mem[j];
 
-            if ( rangeset_is_empty(bar->mem) )
+            if ( rangeset_is_empty(mem) )
                 continue;
 
-            rc = rangeset_remove_range(bar->mem, start, end);
+            rc = rangeset_remove_range(mem, start, end);
             if ( rc )
             {
                 gprintk(XENLOG_WARNING,
@@ -468,8 +471,9 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
                 for ( j = 0; j < ARRAY_SIZE(header->bars); j++)
                 {
                     const struct vpci_bar *bar = &header->bars[j];
+                    struct rangeset *mem = current->vpci.bar_mem[j];
 
-                    if ( !rangeset_overlaps_range(bar->mem, start, end) ||
+                    if ( !rangeset_overlaps_range(mem, start, end) ||
                          /*
                           * If only the ROM enable bit is toggled check against
                           * other BARs in the same device for overlaps, but not
@@ -480,7 +484,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
                           bar->type == VPCI_BAR_ROM) )
                         continue;
 
-                    rc = rangeset_remove_range(bar->mem, start, end);
+                    rc = rangeset_remove_range(mem, start, end);
                     if ( rc )
                     {
                         gprintk(XENLOG_WARNING,
@@ -733,18 +737,6 @@ static void cf_check rom_write(
     }
 }
 
-static int bar_add_rangeset(const struct pci_dev *pdev, struct vpci_bar *bar,
-                            unsigned int i)
-{
-    char str[32];
-
-    snprintf(str, sizeof(str), "%pp:BAR%u", &pdev->sbdf, i);
-
-    bar->mem = rangeset_new(pdev->domain, str, RANGESETF_no_print);
-
-    return !bar->mem ? -ENOMEM : 0;
-}
-
 static int vpci_init_capability_list(struct pci_dev *pdev)
 {
     int rc;
@@ -989,10 +981,6 @@ int vpci_init_header(struct pci_dev *pdev)
         else
             bars[i].type = VPCI_BAR_MEM32;
 
-        rc = bar_add_rangeset(pdev, &bars[i], i);
-        if ( rc )
-            goto fail;
-
         rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
                               (i == num_bars - 1) ? PCI_BAR_LAST : 0);
         if ( rc < 0 )
@@ -1046,12 +1034,6 @@ int vpci_init_header(struct pci_dev *pdev)
                                4, rom);
         if ( rc )
             rom->type = VPCI_BAR_EMPTY;
-        else
-        {
-            rc = bar_add_rangeset(pdev, rom, num_bars);
-            if ( rc )
-                goto fail;
-        }
     }
     else if ( !is_hwdom )
     {
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index f66f50c8ba..af61b521b0 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -357,9 +357,6 @@ void vpci_deassign_device(struct pci_dev *pdev)
     }
     spin_unlock(&pdev->vpci->lock);
 
-    for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
-        rangeset_destroy(pdev->vpci->header.bars[i].mem);
-
     xfree(pdev->vpci);
     pdev->vpci = NULL;
 }
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index dd233b8b03..fa654545e5 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -118,7 +118,6 @@ struct vpci {
             uint64_t guest_addr;
             uint64_t size;
             uint64_t resizable_sizes;
-            struct rangeset *mem;
             enum {
                 VPCI_BAR_EMPTY,
                 VPCI_BAR_IO,
@@ -212,6 +211,9 @@ struct vpci {
 struct vpci_vcpu {
     /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
     const struct pci_dev *pdev;
+#ifdef __XEN__
+    struct rangeset *bar_mem[PCI_HEADER_NORMAL_NR_BARS + 1];
+#endif
     uint16_t cmd;
     bool rom_only : 1;
 };
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 5/8] pci/iommu: Check that IOMMU supports removing devices
  2026-03-09 11:08 [PATCH v2 0/8] Implement SR-IOV support for PVH Mykyta Poturai
                   ` (4 preceding siblings ...)
  2026-03-09 11:08 ` [PATCH v2 6/8] vpci: add SR-IOV support for PVH Dom0 Mykyta Poturai
@ 2026-03-09 11:08 ` Mykyta Poturai
  2026-03-31 14:28   ` Jan Beulich
  2026-03-09 11:08 ` [PATCH v2 7/8] vpci: add SR-IOV support for DomUs Mykyta Poturai
  2026-03-09 11:08 ` [PATCH v2 8/8] docs: Update SR-IOV support status Mykyta Poturai
  7 siblings, 1 reply; 29+ messages in thread
From: Mykyta Poturai @ 2026-03-09 11:08 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Mykyta Poturai, Jan Beulich, Roger Pau Monné

before trying to remove them.
Some IOMMU platforms, such as SMMU-V3 and IPMMU-VMSA don't support
removing devices. Trying to call remove_device will result in a crash.
So check that platform_ops have remove_device set before calling it.

Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
---
v1->v2:
* new patch
---
 xen/drivers/passthrough/pci.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 464bb0fee4..704eb6e19f 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1545,6 +1545,12 @@ static int iommu_remove_device(struct pci_dev *pdev)
     if ( !is_iommu_enabled(pdev->domain) )
         return 0;
 
+    if ( !hd->platform_ops->remove_device )
+    {
+        printk(XENLOG_ERR "IOMMU: remove_device not supported by platform\n");
+        return -EOPNOTSUPP;
+    }
+
     for ( devfn = pdev->devfn ; pdev->phantom_stride; )
     {
         int rc;
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 6/8] vpci: add SR-IOV support for PVH Dom0
  2026-03-09 11:08 [PATCH v2 0/8] Implement SR-IOV support for PVH Mykyta Poturai
                   ` (3 preceding siblings ...)
  2026-03-09 11:08 ` [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions Mykyta Poturai
@ 2026-03-09 11:08 ` Mykyta Poturai
  2026-03-19 21:11   ` Stewart Hildebrand
  2026-03-31 14:44   ` Jan Beulich
  2026-03-09 11:08 ` [PATCH v2 5/8] pci/iommu: Check that IOMMU supports removing devices Mykyta Poturai
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 29+ messages in thread
From: Mykyta Poturai @ 2026-03-09 11:08 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Stewart Hildebrand, Roger Pau Monné, Daniel P. Smith,
	Mykyta Poturai

From: Stewart Hildebrand <stewart.hildebrand@amd.com>

This code is expected to only be used by privileged domains,
unprivileged domains should not get access to the SR-IOV capability.

Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
for possible changes in the system page size register.

Allow forcing vpci_modify_bars to not defer the actual mapping changes,
which is needed to fix the sequential calls to vpci_modify_bars when
enabling VFs from Dom0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
---
v1->v2:
* switch to VF discovery by Xen
* fix sequential vpci_modify_bars calls
* move documentation changes to a separate commit
---
 xen/drivers/vpci/Makefile |   2 +-
 xen/drivers/vpci/header.c |  17 +-
 xen/drivers/vpci/sriov.c  | 363 ++++++++++++++++++++++++++++++++++++++
 xen/include/xen/vpci.h    |  12 +-
 4 files changed, 385 insertions(+), 9 deletions(-)
 create mode 100644 xen/drivers/vpci/sriov.c

diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index a7c8a30a89..fe1e57b64d 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1,2 +1,2 @@
-obj-y += vpci.o header.o rebar.o
+obj-y += vpci.o header.o rebar.o sriov.o
 obj-$(CONFIG_HAS_PCI_MSI) += msi.o msix.o
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 284964f0d4..c55c3380d4 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -264,7 +264,7 @@ bool vpci_process_pending(struct vcpu *v)
     return false;
 }
 
-static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
+static int apply_map(struct domain *d, const struct pci_dev *pdev,
                             uint16_t cmd)
 {
     struct vpci_header *header = &pdev->vpci->header;
@@ -323,7 +323,8 @@ static void defer_map(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     raise_softirq(SCHEDULE_SOFTIRQ);
 }
 
-int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
+int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only,
+                     bool no_defer)
 {
     struct vpci_header *header = &pdev->vpci->header;
     struct pci_dev *tmp;
@@ -519,7 +520,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
         d = dom_xen;
     }
 
-    if ( system_state < SYS_STATE_active )
+    if ( system_state < SYS_STATE_active || no_defer )
     {
         /*
          * Mappings might be created when building Dom0 if the memory decoding
@@ -566,7 +567,7 @@ static void cf_check cmd_write(
          * memory decoding bit has not been changed, so leave everything as-is,
          * hoping the guest will realize and try again.
          */
-        vpci_modify_bars(pdev, cmd, false);
+        vpci_modify_bars(pdev, cmd, false, false);
     else
         pci_conf_write16(pdev->sbdf, reg, cmd);
 }
@@ -736,7 +737,7 @@ static void cf_check rom_write(
      */
     else if ( vpci_modify_bars(pdev,
                                new_enabled ? PCI_COMMAND_MEMORY : 0,
-                               true) )
+                               true, false) )
         /*
          * No memory has been added or removed from the p2m (because the actual
          * p2m changes are deferred in defer_map) and the ROM enable bit has
@@ -954,6 +955,9 @@ int vpci_init_header(struct pci_dev *pdev)
 
     header->guest_cmd = cmd;
 
+    if ( pdev->info.is_virtfn )
+        return vf_init_header(pdev);
+
     /* Disable memory decoding before sizing. */
     if ( !is_hwdom || (cmd & PCI_COMMAND_MEMORY) )
         pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
@@ -1062,7 +1066,8 @@ int vpci_init_header(struct pci_dev *pdev)
             goto fail;
     }
 
-    return (cmd & PCI_COMMAND_MEMORY) ? vpci_modify_bars(pdev, cmd, false) : 0;
+    return (cmd & PCI_COMMAND_MEMORY)
+                ? vpci_modify_bars(pdev, cmd, false, false) : 0;
 
  fail:
     pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
diff --git a/xen/drivers/vpci/sriov.c b/xen/drivers/vpci/sriov.c
new file mode 100644
index 0000000000..6f691149e9
--- /dev/null
+++ b/xen/drivers/vpci/sriov.c
@@ -0,0 +1,363 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Handlers for accesses to the SR-IOV capability structure.
+ *
+ * Copyright (C) 2026 Citrix Systems R&D
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+#include <xsm/xsm.h>
+
+static int vf_init_bars(const struct pci_dev *vf_pdev)
+{
+    int vf_idx;
+    unsigned int i;
+    const struct pci_dev *pf_pdev = vf_pdev->pf_pdev;
+    struct vpci_bar *bars = vf_pdev->vpci->header.bars;
+    struct vpci_bar *physfn_vf_bars = pf_pdev->vpci->sriov->vf_bars;
+    unsigned int sriov_pos = pci_find_ext_capability(pf_pdev,
+                                                     PCI_EXT_CAP_ID_SRIOV);
+    uint16_t offset = pci_conf_read16(pf_pdev->sbdf,
+                                      sriov_pos + PCI_SRIOV_VF_OFFSET);
+    uint16_t stride = pci_conf_read16(pf_pdev->sbdf,
+                                      sriov_pos + PCI_SRIOV_VF_STRIDE);
+
+    vf_idx = vf_pdev->sbdf.sbdf - (pf_pdev->sbdf.sbdf + offset);
+    if ( vf_idx < 0 )
+        return -EINVAL;
+
+    if ( stride )
+    {
+        if ( vf_idx % stride )
+            return -EINVAL;
+        vf_idx /= stride;
+    }
+
+    /*
+     * Set up BARs for this VF out of PF's VF BARs taking into account
+     * the index of the VF.
+     */
+    for ( i = 0; i < PCI_SRIOV_NUM_BARS; i++ )
+    {
+        bars[i].addr = physfn_vf_bars[i].addr + vf_idx * physfn_vf_bars[i].size;
+        bars[i].guest_addr = bars[i].addr;
+        bars[i].size = physfn_vf_bars[i].size;
+        bars[i].type = physfn_vf_bars[i].type;
+        bars[i].prefetchable = physfn_vf_bars[i].prefetchable;
+    }
+
+    return 0;
+}
+
+/* Must be called form vpci_process_pending context */
+static int map_vfs(const struct pci_dev *pf_pdev, uint16_t cmd)
+{
+    struct pci_dev *vf_pdev;
+    int rc;
+
+    ASSERT(rw_is_write_locked(&pf_pdev->domain->pci_lock));
+
+    list_for_each_entry(vf_pdev, &pf_pdev->vf_list, vf_list) {
+        rc = vpci_modify_bars(vf_pdev, cmd, false, true);
+        if ( rc )
+        {
+            gprintk(XENLOG_ERR, "failed to %s VF %pp: %d\n",
+                    (cmd & PCI_COMMAND_MEMORY) ? "map" : "unmap",
+                    &vf_pdev->sbdf, rc);
+            return rc;
+        }
+    }
+
+    return 0;
+}
+
+
+static int size_vf_bars(struct pci_dev *pf_pdev, unsigned int sriov_pos)
+{
+    /*
+     * NB: a non-const pci_dev of the PF is needed in order to update
+     * vf_rlen.
+     */
+    struct vpci_bar *bars;
+    unsigned int i;
+    int rc = 0;
+
+    ASSERT(rw_is_write_locked(&pf_pdev->domain->pci_lock));
+    ASSERT(!pf_pdev->info.is_virtfn);
+    ASSERT(pf_pdev->vpci->sriov);
+
+    /* Read BARs for VFs out of PF's SR-IOV extended capability. */
+    bars = pf_pdev->vpci->sriov->vf_bars;
+    /* Set the BARs addresses and size. */
+    for ( i = 0; i < PCI_SRIOV_NUM_BARS; i += rc )
+    {
+        unsigned int idx = sriov_pos + PCI_SRIOV_BAR + i * 4;
+        uint32_t bar;
+        uint64_t addr, size;
+
+        bar = pci_conf_read32(pf_pdev->sbdf, idx);
+
+        rc = pci_size_mem_bar(pf_pdev->sbdf, idx, &addr, &size,
+                              PCI_BAR_VF |
+                              ((i == PCI_SRIOV_NUM_BARS - 1) ? PCI_BAR_LAST
+                                                             : 0));
+
+        /*
+         * Update vf_rlen on the PF. According to the spec the size of
+         * the BARs can change if the system page size register is
+         * modified, so always update rlen when enabling VFs.
+         */
+        pf_pdev->physfn.vf_rlen[i] = size;
+
+        if ( !size )
+        {
+            bars[i].type = VPCI_BAR_EMPTY;
+            continue;
+        }
+
+        bars[i].addr = addr;
+        bars[i].guest_addr = addr;
+        bars[i].size = size;
+        bars[i].prefetchable = bar & PCI_BASE_ADDRESS_MEM_PREFETCH;
+
+        switch ( rc )
+        {
+        case 1:
+            bars[i].type = VPCI_BAR_MEM32;
+            break;
+
+        case 2:
+            bars[i].type = VPCI_BAR_MEM64_LO;
+            bars[i + 1].type = VPCI_BAR_MEM64_HI;
+            break;
+
+        default:
+            ASSERT_UNREACHABLE();
+        }
+    }
+
+    rc = rc > 0 ? 0 : rc;
+
+    return rc;
+}
+
+struct callback_data {
+    const struct pci_dev *pdev;
+    unsigned int pos;
+    uint32_t value;
+    bool enable : 1;
+    bool disable : 1;
+    bool map : 1;
+    bool unmap : 1;
+};
+
+static void cf_check control_write_cb(void *data)
+{
+    struct callback_data *cb = data;
+    const struct pci_dev *pdev = cb->pdev;
+    uint16_t offset = pci_conf_read16(pdev->sbdf, cb->pos + PCI_SRIOV_VF_OFFSET);
+    uint16_t stride = pci_conf_read16(pdev->sbdf, cb->pos + PCI_SRIOV_VF_STRIDE);
+    struct vpci_sriov *sriov = pdev->vpci->sriov;
+    int rc = 0;
+    unsigned int i;
+
+    if ( cb->unmap )
+    {
+        write_lock(&pdev->domain->pci_lock);
+        map_vfs(pdev, 0);
+        write_unlock(&pdev->domain->pci_lock);
+    }
+
+    if ( cb->enable || cb->disable )
+    {
+        for ( i = 0; i < sriov->num_vfs; i++ )
+        {
+            const pci_sbdf_t vf_sbdf = {
+                .sbdf = pdev->sbdf.sbdf + offset + stride * i,
+            };
+
+            if ( cb->enable )
+            {
+                const struct pci_dev_info info = {
+                    .is_virtfn = true,
+                    .is_extfn = false,
+                    .physfn.bus = pdev->sbdf.bus,
+                    .physfn.devfn = pdev->sbdf.devfn,
+                };
+                rc = pci_add_device(vf_sbdf.seg, vf_sbdf.bus, vf_sbdf.devfn,
+                                    &info, pdev->node);
+            }
+            if ( cb->disable )
+                rc = pci_remove_device(vf_sbdf.seg, vf_sbdf.bus, vf_sbdf.devfn);
+
+            if ( rc && rc != -ENODEV)
+                gprintk(XENLOG_ERR, "failed to %s VF %pp: %d\n",
+                        cb->enable ? "add" : "remove", &vf_sbdf, rc);
+        }
+    }
+
+    if ( cb->map )
+    {
+        write_lock(&pdev->domain->pci_lock);
+        rc = map_vfs(pdev, PCI_COMMAND_MEMORY);
+
+        if ( rc )
+            map_vfs(pdev, 0);
+        write_unlock(&pdev->domain->pci_lock);
+    }
+
+    pci_conf_write16(pdev->sbdf, cb->pos + PCI_SRIOV_CTRL, cb->value);
+    xfree(cb);
+}
+
+static void cf_check control_write(const struct pci_dev *pdev, unsigned int reg,
+                                   uint32_t val, void *data)
+{
+    unsigned int sriov_pos = reg - PCI_SRIOV_CTRL;
+    struct vpci_sriov *sriov = pdev->vpci->sriov;
+    struct callback_data *cb = NULL;
+    uint16_t control = pci_conf_read16(pdev->sbdf, reg);
+    bool mem_enabled = control & PCI_SRIOV_CTRL_MSE;
+    bool new_mem_enabled = val & PCI_SRIOV_CTRL_MSE;
+    bool enabled = control & PCI_SRIOV_CTRL_VFE;
+    bool new_enabled = val & PCI_SRIOV_CTRL_VFE;
+
+    ASSERT(!pdev->info.is_virtfn);
+
+    if ( new_enabled == enabled && new_mem_enabled == mem_enabled )
+    {
+        pci_conf_write16(pdev->sbdf, reg, val);
+        return;
+    }
+
+    cb = xzalloc(struct callback_data);
+
+    if ( !cb )
+    {
+        gprintk(XENLOG_ERR,
+                "%pp: Unable to allocate memory for SR-IOV enable\n",
+                pdev);
+        return;
+    }
+
+    cb->pdev = pdev;
+    cb->pos = sriov_pos;
+    cb->value = val;
+    cb->map = new_mem_enabled && !mem_enabled;
+    cb->unmap = !new_mem_enabled && mem_enabled;
+    cb->enable = new_enabled && !enabled;
+    cb->disable = !new_enabled && enabled;
+
+    current->vpci.task = WAIT;
+    current->vpci.wait.callback = control_write_cb;
+    current->vpci.wait.data = cb;
+    current->vpci.wait.end = NOW();
+
+    if ( cb->enable )
+    {
+        size_vf_bars((struct pci_dev *)pdev, sriov_pos);
+
+        /*
+         * Only update the number of active VFs when enabling, when
+         * disabling use the cached value in order to always remove the same
+         * number of VFs that were active.
+         */
+        sriov->num_vfs = pci_conf_read16(pdev->sbdf,
+                                         sriov_pos + PCI_SRIOV_NUM_VF);
+        /*
+         * NB: VFE needs to be enabled before calling pci_add_device so Xen
+         * can access the config space of VFs. FIXME casting away const-ness
+         * to modify vf_rlen
+         */
+        pci_conf_write16(pdev->sbdf, reg, control | PCI_SRIOV_CTRL_VFE);
+        /*
+         * The spec states that the software must wait at least 100ms before
+         * attempting to access VF registers when enabling virtual functions
+         * on the PF.
+         */
+
+        current->vpci.wait.end = NOW() + MILLISECS(100);
+    }
+}
+
+int vf_init_header(struct pci_dev *vf_pdev)
+{
+    const struct pci_dev *pf_pdev;
+    unsigned int sriov_pos;
+    int rc = 0;
+    uint16_t ctrl;
+
+    ASSERT(rw_is_write_locked(&vf_pdev->domain->pci_lock));
+
+    if ( !vf_pdev->info.is_virtfn )
+        return 0;
+
+    pf_pdev = vf_pdev->pf_pdev;
+    ASSERT(pf_pdev);
+
+    rc = vf_init_bars(vf_pdev);
+    if ( rc )
+        return rc;
+
+    sriov_pos = pci_find_ext_capability(pf_pdev, PCI_EXT_CAP_ID_SRIOV);
+    ctrl = pci_conf_read16(pf_pdev->sbdf, sriov_pos + PCI_SRIOV_CTRL);
+
+    if ( (pf_pdev->domain == vf_pdev->domain) && (ctrl & PCI_SRIOV_CTRL_MSE) )
+    {
+        rc = vpci_modify_bars(vf_pdev, PCI_COMMAND_MEMORY, false, false);
+        if ( rc )
+            return rc;
+    }
+
+    return rc;
+}
+
+static int cf_check init_sriov(struct pci_dev *pdev)
+{
+    unsigned int pos;
+
+    ASSERT(!pdev->info.is_virtfn);
+
+    pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
+
+    if ( !pos )
+        return 0;
+
+    if ( xsm_resource_setup_pci(XSM_PRIV, pdev->sbdf.bdf) )
+    {
+        printk(XENLOG_ERR
+               "%pp: SR-IOV configuration unsupported for unpriv %pd\n",
+               &pdev->sbdf, pdev->domain);
+        return 0;
+    }
+
+    pdev->vpci->sriov = xzalloc(struct vpci_sriov);
+    if ( !pdev->vpci->sriov )
+        return -ENOMEM;
+
+    return vpci_add_register(pdev->vpci, vpci_hw_read16, control_write,
+                             pos + PCI_SRIOV_CTRL, 2, NULL);
+}
+
+static int cf_check cleanup_sriov(const struct pci_dev *pdev, bool hide)
+{
+    if ( hide )
+        return 0;
+
+    XFREE(pdev->vpci->sriov);
+
+    return 0;
+}
+
+REGISTER_VPCI_EXTCAP(SRIOV, init_sriov, cleanup_sriov);
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 47cdb54d42..ae5f3b7274 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -45,6 +45,7 @@ typedef struct {
     REGISTER_VPCI_CAPABILITY(PCI_EXT_CAP_ID_##name, name, finit, fclean, true)
 
 int __must_check vpci_init_header(struct pci_dev *pdev);
+int __must_check vf_init_header(struct pci_dev *pdev);
 
 /* Assign vPCI to device by adding handlers. */
 int __must_check vpci_assign_device(struct pci_dev *pdev);
@@ -146,7 +147,6 @@ struct vpci {
          * upon to know whether BARs are mapped into the guest p2m.
          */
         bool bars_mapped      : 1;
-        /* FIXME: currently there's no support for SR-IOV. */
     } header;
 
     /* MSI data. */
@@ -200,6 +200,13 @@ struct vpci {
             struct vpci_arch_msix_entry arch;
         } entries[];
     } *msix;
+
+    struct vpci_sriov {
+        /* PF only */
+        struct vpci_bar vf_bars[PCI_SRIOV_NUM_BARS];
+        uint16_t num_vfs;
+    } *sriov;
+
 #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
     /* Guest SBDF of the device. */
 #define INVALID_GUEST_SBDF ((pci_sbdf_t){ .sbdf = ~0U })
@@ -323,7 +330,8 @@ bool vpci_ecam_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int len,
                     unsigned long *data);
 
 /* Map/unmap the BARs of a vPCI device. */
-int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only);
+int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only,
+                     bool no_defer);
 
 #endif /* __XEN__ */
 
-- 
2.51.2

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 7/8] vpci: add SR-IOV support for DomUs
  2026-03-09 11:08 [PATCH v2 0/8] Implement SR-IOV support for PVH Mykyta Poturai
                   ` (5 preceding siblings ...)
  2026-03-09 11:08 ` [PATCH v2 5/8] pci/iommu: Check that IOMMU supports removing devices Mykyta Poturai
@ 2026-03-09 11:08 ` Mykyta Poturai
  2026-03-31 15:02   ` Jan Beulich
  2026-03-09 11:08 ` [PATCH v2 8/8] docs: Update SR-IOV support status Mykyta Poturai
  7 siblings, 1 reply; 29+ messages in thread
From: Mykyta Poturai @ 2026-03-09 11:08 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Stewart Hildebrand, Roger Pau Monné, Mykyta Poturai

From: Stewart Hildebrand <stewart.hildebrand@amd.com>

SR-IOV virtual functions have simplified configuration space such as
Vendor ID is derived from the physical function and Device ID comes
from SR-IOV extended capability.
Emulate those, so we can provide VID/DID pair for guests which use PCI
passthrough for SR-IOV virtual functions.

Emulate guest BAR register values based on PF BAR values for VFs.
This allows creating a guest view of the normal BAR registers and emulates
the size and properties as it is done during PCI device enumeration by
the guest.

Note, that VFs ROM BAR is read-only and is all zeros, but VF may provide
access to the PFs ROM via emulation and is not implemented.

Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
---
v1->v2:
* remove VF register handlers covered by init_header
* set guest addr unconditionally
---
 xen/drivers/vpci/sriov.c | 57 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/xen/drivers/vpci/sriov.c b/xen/drivers/vpci/sriov.c
index 6f691149e9..1c408d1c6b 100644
--- a/xen/drivers/vpci/sriov.c
+++ b/xen/drivers/vpci/sriov.c
@@ -303,6 +303,63 @@ int vf_init_header(struct pci_dev *vf_pdev)
     sriov_pos = pci_find_ext_capability(pf_pdev, PCI_EXT_CAP_ID_SRIOV);
     ctrl = pci_conf_read16(pf_pdev->sbdf, sriov_pos + PCI_SRIOV_CTRL);
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    if ( pf_pdev->domain != vf_pdev->domain )
+    {
+        uint16_t vid = pci_conf_read16(pf_pdev->sbdf, PCI_VENDOR_ID);
+        uint16_t did = pci_conf_read16(pf_pdev->sbdf,
+                                       sriov_pos + PCI_SRIOV_VF_DID);
+        struct vpci_bar *bars = vf_pdev->vpci->header.bars;
+        unsigned int i;
+
+        rc = vpci_add_register(vf_pdev->vpci, vpci_read_val, NULL,
+                               PCI_VENDOR_ID, 2, (void *)(uintptr_t)vid);
+        if ( rc )
+            return rc;
+
+        rc = vpci_add_register(vf_pdev->vpci, vpci_read_val, NULL,
+                               PCI_DEVICE_ID, 2, (void *)(uintptr_t)did);
+        if ( rc )
+            return rc;
+
+        /* Hardcode multi-function device bit to 0 */
+        rc = vpci_add_register(vf_pdev->vpci, vpci_read_val, NULL,
+                               PCI_HEADER_TYPE, 1,
+                               (void *)PCI_HEADER_TYPE_NORMAL);
+        if ( rc )
+            return rc;
+
+        rc = vpci_add_register(vf_pdev->vpci, vpci_hw_read32, NULL,
+                               PCI_CLASS_REVISION, 4, NULL);
+        if ( rc )
+            return rc;
+
+        for ( i = 0; i < PCI_SRIOV_NUM_BARS; i++ )
+        {
+            switch ( pf_pdev->vpci->sriov->vf_bars[i].type )
+            {
+            case VPCI_BAR_MEM32:
+            case VPCI_BAR_MEM64_LO:
+            case VPCI_BAR_MEM64_HI:
+                rc = vpci_add_register(vf_pdev->vpci, vpci_guest_mem_bar_read,
+                                       vpci_guest_mem_bar_write,
+                                       PCI_BASE_ADDRESS_0 + i * 4, 4, &bars[i]);
+                if ( rc )
+                    return rc;
+                break;
+            default:
+                rc = vpci_add_register(vf_pdev->vpci, vpci_read_val, NULL,
+                                       PCI_BASE_ADDRESS_0 + i * 4, 4,
+                                       (void *)0);
+                if ( rc )
+                    return rc;
+                break;
+            }
+        }
+
+    }
+#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
+
     if ( (pf_pdev->domain == vf_pdev->domain) && (ctrl & PCI_SRIOV_CTRL_MSE) )
     {
         rc = vpci_modify_bars(vf_pdev, PCI_COMMAND_MEMORY, false, false);
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 8/8] docs: Update SR-IOV support status
  2026-03-09 11:08 [PATCH v2 0/8] Implement SR-IOV support for PVH Mykyta Poturai
                   ` (6 preceding siblings ...)
  2026-03-09 11:08 ` [PATCH v2 7/8] vpci: add SR-IOV support for DomUs Mykyta Poturai
@ 2026-03-09 11:08 ` Mykyta Poturai
  7 siblings, 0 replies; 29+ messages in thread
From: Mykyta Poturai @ 2026-03-09 11:08 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Mykyta Poturai, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini

Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
---
v1->v2:
* new patch
---
 SUPPORT.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/SUPPORT.md b/SUPPORT.md
index d441bccf37..c01853fe7e 100644
--- a/SUPPORT.md
+++ b/SUPPORT.md
@@ -170,8 +170,6 @@ unexpected behavior or issues on some hardware.
 
 At least the following features are missing on a PVH dom0:
 
-  * PCI SR-IOV.
-
   * Native NMI forwarding (nmi=dom0 command line option).
 
   * MCE handling.
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/8] vpci: rename and export vpci_modify_bars
  2026-03-09 11:08 ` [PATCH v2 1/8] vpci: rename and export vpci_modify_bars Mykyta Poturai
@ 2026-03-16 21:03   ` Stewart Hildebrand
  0 siblings, 0 replies; 29+ messages in thread
From: Stewart Hildebrand @ 2026-03-16 21:03 UTC (permalink / raw)
  To: Mykyta Poturai, xen-devel@lists.xenproject.org
  Cc: Roger Pau Monné, Teddy Astie

On 3/9/26 07:08, Mykyta Poturai wrote:
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index d6310104ea..a98ddbb32e 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -295,6 +295,9 @@ bool vpci_ecam_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int len,
>  bool vpci_ecam_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int len,
>                      unsigned long *data);
>  
> +/* Map/unmap the BARs of a vPCI device. */
> +int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only);

This prototype should be moved to private.h

> +
>  #endif /* __XEN__ */
>  
>  #else /* !CONFIG_HAS_VPCI */



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/8] vpci: rename and export vpci_guest_mem_bar_{read,write}
  2026-03-09 11:08 ` [PATCH v2 2/8] vpci: rename and export vpci_guest_mem_bar_{read,write} Mykyta Poturai
@ 2026-03-16 21:04   ` Stewart Hildebrand
  0 siblings, 0 replies; 29+ messages in thread
From: Stewart Hildebrand @ 2026-03-16 21:04 UTC (permalink / raw)
  To: Mykyta Poturai, xen-devel@lists.xenproject.org
  Cc: Roger Pau Monné, Teddy Astie

On 3/9/26 07:08, Mykyta Poturai wrote:
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index a98ddbb32e..dd233b8b03 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -78,6 +78,12 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>  uint32_t cf_check vpci_read_val(
>      const struct pci_dev *pdev, unsigned int reg, void *data);
>  
> +void cf_check vpci_guest_mem_bar_write(const struct pci_dev *pdev,
> +                                       unsigned int reg, uint32_t val,
> +                                       void *data);
> +uint32_t cf_check vpci_guest_mem_bar_read(const struct pci_dev *pdev,
> +                                          unsigned int reg, void *data);
> +

These prototypes should be moved to private.h

>  /* Passthrough handlers. */
>  uint32_t cf_check vpci_hw_read8(
>      const struct pci_dev *pdev, unsigned int reg, void *data);



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 3/8] vpci: Use pervcpu ranges for BAR mapping
  2026-03-09 11:08 ` [PATCH v2 3/8] vpci: Use pervcpu ranges for BAR mapping Mykyta Poturai
@ 2026-03-16 21:36   ` Stewart Hildebrand
  2026-03-17  0:36   ` Stewart Hildebrand
  2026-03-31  9:59   ` Jan Beulich
  2 siblings, 0 replies; 29+ messages in thread
From: Stewart Hildebrand @ 2026-03-16 21:36 UTC (permalink / raw)
  To: Mykyta Poturai, xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

On 3/9/26 07:08, Mykyta Poturai wrote:
> There is no need to store ranges for each PCI device, as they are only
> used during the mapping/unmapping process and can be reused for each
> device. This also allows to avoid the need to allocate and destroy
> rangesets for each device.
> 

Thank you for this.

Consider adding this tag:
Amends: 622bdd962822 ("vpci/header: handle p2m range sets per BAR")

> Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
> ---
> v1->v2:
> * new patch
> ---
>  xen/common/domain.c       | 24 ++++++++++++++
>  xen/drivers/vpci/header.c | 66 ++++++++++++++-------------------------
>  xen/drivers/vpci/vpci.c   |  3 --
>  xen/include/xen/vpci.h    |  4 ++-
>  4 files changed, 51 insertions(+), 46 deletions(-)
> 
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index e06174fca7..76b0163616 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -454,6 +454,14 @@ static int vcpu_teardown(struct vcpu *v)
>   */
>  static void vcpu_destroy(struct vcpu *v)
>  {
> +#ifdef CONFIG_HAS_VPCI
> +    int i;
> +
> +    for ( i = 0; i < ARRAY_SIZE(v->vpci.bar_mem); i++ )
> +        if ( v->vpci.bar_mem[i] )
> +            rangeset_destroy(v->vpci.bar_mem[i]);

Paraphrasing some previous feedback from [1], you might additionally want:

    v->vpci.bar_mem[i] = NULL;

or introduce a RANGESET_DESTROY() similar to XFREE().

[1] https://lore.kernel.org/xen-devel/20250723163744.13095-1-stewart.hildebrand@amd.com/T/#t

> +
> +#endif
>      free_vcpu_struct(v);
>  }
>  
> @@ -511,6 +519,22 @@ struct vcpu *vcpu_create(struct domain *d, unsigned int vcpu_id)
>      if ( arch_vcpu_create(v) != 0 )
>          goto fail_sched;
>  
> +#ifdef CONFIG_HAS_VPCI

Would it make sense to introduce a vpci_* function instead of #ifdef? Same for
the deallocation above.

> +    {

I'm not sure it's necessary to start a new block here.

> +        int i;
> +
> +        for ( i = 0; i < ARRAY_SIZE(v->vpci.bar_mem); i++ )
> +        {
> +            char str[32];
> +
> +            snprintf(str, sizeof(str), "%pv:BAR%u", v, i);
> +            v->vpci.bar_mem[i] = rangeset_new(d, str, RANGESETF_no_print);

This seems to be performing an allocation for vPCI even for domains that don't
have vPCI.

> +            if ( !v->vpci.bar_mem[i] )
> +                goto fail_sched;
> +        }
> +    }
> +#endif
> +
>      d->vcpu[vcpu_id] = v;
>      if ( vcpu_id != 0 )
>      {
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 07ec991a12..cb64d9b9fc 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -195,6 +195,7 @@ bool vpci_process_pending(struct vcpu *v)
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
>          struct vpci_bar *bar = &header->bars[i];
> +        struct rangeset *mem = v->vpci.bar_mem[i];
>          struct map_data data = {
>              .d = v->domain,
>              .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> @@ -202,10 +203,10 @@ bool vpci_process_pending(struct vcpu *v)
>          };
>          int rc;
>  
> -        if ( rangeset_is_empty(bar->mem) )
> +        if ( rangeset_is_empty(mem) )
>              continue;
>  
> -        rc = rangeset_consume_ranges(bar->mem, map_range, &data);
> +        rc = rangeset_consume_ranges(mem, map_range, &data);
>  
>          if ( rc == -ERESTART )
>          {
> @@ -223,8 +224,8 @@ bool vpci_process_pending(struct vcpu *v)
>  
>              /* Clean all the rangesets */
>              for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> -                if ( !rangeset_is_empty(header->bars[i].mem) )
> -                     rangeset_purge(header->bars[i].mem);
> +                if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
> +                     rangeset_purge(v->vpci.bar_mem[i]);
>  
>              v->vpci.pdev = NULL;
>  
> @@ -259,13 +260,14 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
>          struct vpci_bar *bar = &header->bars[i];
> +        struct rangeset *mem = current->vpci.bar_mem[i];
>          struct map_data data = { .d = d, .map = true, .bar = bar };
>  
> -        if ( rangeset_is_empty(bar->mem) )
> +        if ( rangeset_is_empty(mem) )
>              continue;
>  
> -        while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
> -                                              &data)) == -ERESTART )
> +        while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) ==
> +                -ERESTART )
>          {
>              /*
>               * It's safe to drop and reacquire the lock in this context
> @@ -330,12 +332,13 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
>          struct vpci_bar *bar = &header->bars[i];
> +        struct rangeset *mem = current->vpci.bar_mem[i];
>          unsigned long start = PFN_DOWN(bar->addr);
>          unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>          unsigned long start_guest = PFN_DOWN(bar->guest_addr);
>          unsigned long end_guest = PFN_DOWN(bar->guest_addr + bar->size - 1);
>  
> -        if ( !bar->mem )
> +        if ( !mem )
>              continue;
>  
>          if ( !MAPPABLE_BAR(bar) ||
> @@ -353,7 +356,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>              continue;
>          }
>  
> -        ASSERT(rangeset_is_empty(bar->mem));
> +        ASSERT(rangeset_is_empty(mem));
>  
>          /*
>           * Make sure that the guest set address has the same page offset
> @@ -368,7 +371,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>              return -EINVAL;
>          }
>  
> -        rc = rangeset_add_range(bar->mem, start_guest, end_guest);
> +        rc = rangeset_add_range(mem, start_guest, end_guest);
>          if ( rc )
>          {
>              printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
> @@ -379,12 +382,12 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>          /* Check for overlap with the already setup BAR ranges. */
>          for ( j = 0; j < i; j++ )
>          {
> -            struct vpci_bar *prev_bar = &header->bars[j];
> +            struct rangeset *prev_mem = current->vpci.bar_mem[j];
>  
> -            if ( rangeset_is_empty(prev_bar->mem) )
> +            if ( rangeset_is_empty(prev_mem) )
>                  continue;
>  
> -            rc = rangeset_remove_range(prev_bar->mem, start_guest, end_guest);
> +            rc = rangeset_remove_range(prev_mem, start_guest, end_guest);
>              if ( rc )
>              {
>                  gprintk(XENLOG_WARNING,
> @@ -394,7 +397,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>              }
>          }
>  
> -        rc = pci_sanitize_bar_memory(bar->mem);
> +        rc = pci_sanitize_bar_memory(mem);
>          if ( rc )
>          {
>              gprintk(XENLOG_WARNING,
> @@ -411,14 +414,14 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>          unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
>                                       vmsix_table_size(pdev->vpci, i) - 1);
>  
> -        for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
> +        for ( j = 0; j < ARRAY_SIZE(current->vpci.bar_mem); j++ )
>          {
> -            const struct vpci_bar *bar = &header->bars[j];
> +            struct rangeset *mem = current->vpci.bar_mem[j];
>  
> -            if ( rangeset_is_empty(bar->mem) )
> +            if ( rangeset_is_empty(mem) )
>                  continue;
>  
> -            rc = rangeset_remove_range(bar->mem, start, end);
> +            rc = rangeset_remove_range(mem, start, end);
>              if ( rc )
>              {
>                  gprintk(XENLOG_WARNING,
> @@ -468,8 +471,9 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>                  for ( j = 0; j < ARRAY_SIZE(header->bars); j++)
>                  {
>                      const struct vpci_bar *bar = &header->bars[j];
> +                    struct rangeset *mem = current->vpci.bar_mem[j];
>  
> -                    if ( !rangeset_overlaps_range(bar->mem, start, end) ||
> +                    if ( !rangeset_overlaps_range(mem, start, end) ||
>                           /*
>                            * If only the ROM enable bit is toggled check against
>                            * other BARs in the same device for overlaps, but not
> @@ -480,7 +484,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>                            bar->type == VPCI_BAR_ROM) )
>                          continue;
>  
> -                    rc = rangeset_remove_range(bar->mem, start, end);
> +                    rc = rangeset_remove_range(mem, start, end);
>                      if ( rc )
>                      {
>                          gprintk(XENLOG_WARNING,
> @@ -733,18 +737,6 @@ static void cf_check rom_write(
>      }
>  }
>  
> -static int bar_add_rangeset(const struct pci_dev *pdev, struct vpci_bar *bar,
> -                            unsigned int i)
> -{
> -    char str[32];
> -
> -    snprintf(str, sizeof(str), "%pp:BAR%u", &pdev->sbdf, i);
> -
> -    bar->mem = rangeset_new(pdev->domain, str, RANGESETF_no_print);
> -
> -    return !bar->mem ? -ENOMEM : 0;
> -}
> -
>  static int vpci_init_capability_list(struct pci_dev *pdev)
>  {
>      int rc;
> @@ -989,10 +981,6 @@ int vpci_init_header(struct pci_dev *pdev)
>          else
>              bars[i].type = VPCI_BAR_MEM32;
>  
> -        rc = bar_add_rangeset(pdev, &bars[i], i);
> -        if ( rc )
> -            goto fail;
> -
>          rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
>                                (i == num_bars - 1) ? PCI_BAR_LAST : 0);
>          if ( rc < 0 )
> @@ -1046,12 +1034,6 @@ int vpci_init_header(struct pci_dev *pdev)
>                                 4, rom);
>          if ( rc )
>              rom->type = VPCI_BAR_EMPTY;
> -        else
> -        {
> -            rc = bar_add_rangeset(pdev, rom, num_bars);
> -            if ( rc )
> -                goto fail;
> -        }
>      }
>      else if ( !is_hwdom )
>      {
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index f66f50c8ba..af61b521b0 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -357,9 +357,6 @@ void vpci_deassign_device(struct pci_dev *pdev)
>      }
>      spin_unlock(&pdev->vpci->lock);
>  
> -    for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
> -        rangeset_destroy(pdev->vpci->header.bars[i].mem);
> -
>      xfree(pdev->vpci);
>      pdev->vpci = NULL;
>  }
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index dd233b8b03..fa654545e5 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -118,7 +118,6 @@ struct vpci {
>              uint64_t guest_addr;
>              uint64_t size;
>              uint64_t resizable_sizes;
> -            struct rangeset *mem;
>              enum {
>                  VPCI_BAR_EMPTY,
>                  VPCI_BAR_IO,
> @@ -212,6 +211,9 @@ struct vpci {
>  struct vpci_vcpu {
>      /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
>      const struct pci_dev *pdev;
> +#ifdef __XEN__

Why not enclose the whole vpci_vcpu struct inside __XEN__?

> +    struct rangeset *bar_mem[PCI_HEADER_NORMAL_NR_BARS + 1];

Maybe add a brief comment explaining the + 1 ?

> +#endif
>      uint16_t cmd;
>      bool rom_only : 1;
>  };



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 3/8] vpci: Use pervcpu ranges for BAR mapping
  2026-03-09 11:08 ` [PATCH v2 3/8] vpci: Use pervcpu ranges for BAR mapping Mykyta Poturai
  2026-03-16 21:36   ` Stewart Hildebrand
@ 2026-03-17  0:36   ` Stewart Hildebrand
  2026-03-31  9:59   ` Jan Beulich
  2 siblings, 0 replies; 29+ messages in thread
From: Stewart Hildebrand @ 2026-03-17  0:36 UTC (permalink / raw)
  To: Mykyta Poturai, xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

On 3/9/26 07:08, Mykyta Poturai wrote:
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 07ec991a12..cb64d9b9fc 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c

... snip ...

> @@ -330,12 +332,13 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
>          struct vpci_bar *bar = &header->bars[i];
> +        struct rangeset *mem = current->vpci.bar_mem[i];
>          unsigned long start = PFN_DOWN(bar->addr);
>          unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>          unsigned long start_guest = PFN_DOWN(bar->guest_addr);
>          unsigned long end_guest = PFN_DOWN(bar->guest_addr + bar->size - 1);
>  
> -        if ( !bar->mem )
> +        if ( !mem )

Since all the members of the bar_mem array are allocated unconditionally in
vcpu_create, is the check here still necessary?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions
  2026-03-09 11:08 ` [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions Mykyta Poturai
@ 2026-03-17 14:02   ` Stewart Hildebrand
  2026-03-31  9:53   ` Jan Beulich
  2026-03-31 14:55   ` Jan Beulich
  2 siblings, 0 replies; 29+ messages in thread
From: Stewart Hildebrand @ 2026-03-17 14:02 UTC (permalink / raw)
  To: Mykyta Poturai, xen-devel@lists.xenproject.org; +Cc: Roger Pau Monné

On 3/9/26 07:08, Mykyta Poturai wrote:
> This allows waiting a specified number of cycles on the vcpu. Once the
> wait has finished a callback is executed.
> 
> Note that this is still not used, but introduced here in order to
> simplify the complexity of the patches that actually make use of it.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
> ---
> v1->v2:
> * new patch
> ---
>  xen/drivers/vpci/header.c | 125 ++++++++++++++++++++++----------------
>  xen/include/xen/vpci.h    |  19 ++++++
>  2 files changed, 90 insertions(+), 54 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index cb64d9b9fc..284964f0d4 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -175,76 +175,92 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>  
>  bool vpci_process_pending(struct vcpu *v)
>  {
> -    const struct pci_dev *pdev = v->vpci.pdev;
> -    struct vpci_header *header = NULL;
> -    unsigned int i;
> -
> -    if ( !pdev )
> -        return false;
> -
> -    read_lock(&v->domain->pci_lock);
> -
> -    if ( !pdev->vpci || (v->domain != pdev->domain) )
> +    switch ( v->vpci.task )
>      {
> -        v->vpci.pdev = NULL;
> -        read_unlock(&v->domain->pci_lock);
> -        return false;
> -    }
> -
> -    header = &pdev->vpci->header;
> -    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> +    case MODIFY_MEMORY:
>      {
> -        struct vpci_bar *bar = &header->bars[i];
> -        struct rangeset *mem = v->vpci.bar_mem[i];
> -        struct map_data data = {
> -            .d = v->domain,
> -            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> -            .bar = bar,
> -        };
> -        int rc;
> +        const struct pci_dev *pdev = v->vpci.memory.pdev;
> +        struct vpci_header *header = NULL;
> +        unsigned int i;
>  
> -        if ( rangeset_is_empty(mem) )
> -            continue;
> +        if ( !pdev )
> +            break;
>  
> -        rc = rangeset_consume_ranges(mem, map_range, &data);
> +        read_lock(&v->domain->pci_lock);
>  
> -        if ( rc == -ERESTART )
> +        if ( !pdev->vpci || (v->domain != pdev->domain) )
>          {
> +            v->vpci.memory.pdev = NULL;
>              read_unlock(&v->domain->pci_lock);
> -            return true;
> +            break;
>          }
>  
> -        if ( rc )
> +        header = &pdev->vpci->header;
> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>          {
> -            spin_lock(&pdev->vpci->lock);
> -            /* Disable memory decoding unconditionally on failure. */
> -            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
> -                            false);
> -            spin_unlock(&pdev->vpci->lock);
> +            struct vpci_bar *bar = &header->bars[i];
> +            struct rangeset *mem = v->vpci.bar_mem[i];
> +            struct map_data data = {
> +                .d = v->domain,
> +                .map = v->vpci.memory.cmd & PCI_COMMAND_MEMORY,
> +                .bar = bar,
> +            };
> +            int rc;
> +
> +            if ( rangeset_is_empty(mem) )
> +                continue;
>  
> -            /* Clean all the rangesets */
> -            for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> -                if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
> -                     rangeset_purge(v->vpci.bar_mem[i]);
> +            rc = rangeset_consume_ranges(mem, map_range, &data);
>  
> -            v->vpci.pdev = NULL;
> +            if ( rc == -ERESTART )
> +            {
> +                read_unlock(&v->domain->pci_lock);
> +                return true;
> +            }
>  
> -            read_unlock(&v->domain->pci_lock);
> +            if ( rc )
> +            {
> +                spin_lock(&pdev->vpci->lock);
> +                /* Disable memory decoding unconditionally on failure. */
> +                modify_decoding(pdev, v->vpci.memory.cmd & ~PCI_COMMAND_MEMORY,
> +                                false);
> +                spin_unlock(&pdev->vpci->lock);
> +
> +                /* Clean all the rangesets */
> +                for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> +                    if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
> +                        rangeset_purge(v->vpci.bar_mem[i]);
> +
> +                v->vpci.memory.pdev = NULL;
> +
> +                read_unlock(&v->domain->pci_lock);
>  
> -            if ( !is_hardware_domain(v->domain) )
> -                domain_crash(v->domain);
> +                if ( !is_hardware_domain(v->domain) )
> +                    domain_crash(v->domain);
>  
> -            return false;
> +                break;
> +            }
>          }
> -    }
> -    v->vpci.pdev = NULL;
> +        v->vpci.memory.pdev = NULL;
>  
> -    spin_lock(&pdev->vpci->lock);
> -    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
> -    spin_unlock(&pdev->vpci->lock);
> +        spin_lock(&pdev->vpci->lock);
> +        modify_decoding(pdev, v->vpci.memory.cmd, v->vpci.memory.rom_only);
> +        spin_unlock(&pdev->vpci->lock);
>  
> -    read_unlock(&v->domain->pci_lock);
> +        read_unlock(&v->domain->pci_lock);
> +
> +        break;
> +    }

Nit: this is a lot of churn for a relatively small number of changes. Could the
indentation level be retained (reducing churn) by putting the block in a new
function?

> +    case WAIT:
> +        if ( NOW() < v->vpci.wait.end )
> +            return true;
> +        v->vpci.wait.callback(v->vpci.wait.data);
> +        break;
> +    case NONE:
> +        return false;
> +    }
>  
> +    v->vpci.task = NONE;
>      return false;
>  }
>  
> @@ -295,9 +311,10 @@ static void defer_map(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>       * is mapped. This can lead to parallel mapping operations being
>       * started for the same device if the domain is not well-behaved.
>       */
> -    curr->vpci.pdev = pdev;
> -    curr->vpci.cmd = cmd;
> -    curr->vpci.rom_only = rom_only;
> +    curr->vpci.memory.pdev = pdev;
> +    curr->vpci.memory.cmd = cmd;
> +    curr->vpci.memory.rom_only = rom_only;
> +    curr->vpci.task = MODIFY_MEMORY;
>      /*
>       * Raise a scheduler softirq in order to prevent the guest from resuming
>       * execution with pending mapping operations, to trigger the invocation
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index fa654545e5..47cdb54d42 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -212,7 +212,26 @@ struct vpci_vcpu {
>      /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
>      const struct pci_dev *pdev;

pdev can now be removed from here

>  #ifdef __XEN__
> +    enum {
> +        NONE,
> +        MODIFY_MEMORY,
> +        WAIT,
> +    } task;
>      struct rangeset *bar_mem[PCI_HEADER_NORMAL_NR_BARS + 1];
> +    union {
> +        struct {
> +            /* Store state while {un}mapping of PCI BARs. */
> +            const struct pci_dev *pdev;
> +            uint16_t cmd;
> +            bool rom_only : 1;
> +        } memory;
> +        struct {
> +            /* Store wait state. */
> +            s_time_t end;
> +            void (*callback)(void *);
> +            void *data;
> +        } wait;
> +    };
>  #endif
>      uint16_t cmd;
>      bool rom_only : 1;

cmd and rom_only can be removed from here


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 6/8] vpci: add SR-IOV support for PVH Dom0
  2026-03-09 11:08 ` [PATCH v2 6/8] vpci: add SR-IOV support for PVH Dom0 Mykyta Poturai
@ 2026-03-19 21:11   ` Stewart Hildebrand
  2026-03-23  9:46     ` Mykyta Poturai
  2026-03-31 14:44   ` Jan Beulich
  1 sibling, 1 reply; 29+ messages in thread
From: Stewart Hildebrand @ 2026-03-19 21:11 UTC (permalink / raw)
  To: Mykyta Poturai, xen-devel@lists.xenproject.org
  Cc: Roger Pau Monné, Daniel P. Smith

On 3/9/26 07:08, Mykyta Poturai wrote:
> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
> 
> This code is expected to only be used by privileged domains,
> unprivileged domains should not get access to the SR-IOV capability.
> 
> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
> for possible changes in the system page size register.
> 
> Allow forcing vpci_modify_bars to not defer the actual mapping changes,

I don't think this is suitable. We perform the p2m operations in a deferred
context because they may take a long time. And since they may take a long time,
the logic is interruptible: in map_range(), we perform a general_preempt_check()
and return -ERESTART so that we give a chance for other pending work to
complete, including the scheduler softirq. If vpci_process_pending() returns
true, it will be called again and is expected to resume where it left off. The
vcpu won't continue until vpci_process_pending() returns false.

> which is needed to fix the sequential calls to vpci_modify_bars when
> enabling VFs from Dom0.

I'm guessing you resorted to this because you need to perform multiple mapping
operations, but the vPCI deferred mapping mechanism only supports a single
operation? If so, this is an issue I've been attempting to resolve for some time
with the BAR-write-with-memory-decoding-enabled series [1]. In that series I'm
working on introducing the ability perform multiple mapping operations. I'm
almost ready to send v3 of the BAR-write-with-memory-decoding-enabled series,
and I hope you don't mind that I include your patch ("vpci: Use pervcpu ranges
for BAR mapping"). You may consider the possibility of basing SR-IOV on this
work if suitable.

[1] https://lore.kernel.org/xen-devel/20250723163744.13095-1-stewart.hildebrand@amd.com/T/#t

Regardless, ultimately we need to find a way to return from
vpci_process_pending() during the potentially long-running p2m operations.
As an alternative suggestion, could you return from control_write_cb() after
each call to map_vfs(), and somehow make it resume where it left off?

> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> Signed-off-by: Mykyta Poturai <mykyta_poturai@epam.com>
> ---
> v1->v2:
> * switch to VF discovery by Xen
> * fix sequential vpci_modify_bars calls
> * move documentation changes to a separate commit
> ---
>  xen/drivers/vpci/Makefile |   2 +-
>  xen/drivers/vpci/header.c |  17 +-
>  xen/drivers/vpci/sriov.c  | 363 ++++++++++++++++++++++++++++++++++++++
>  xen/include/xen/vpci.h    |  12 +-
>  4 files changed, 385 insertions(+), 9 deletions(-)
>  create mode 100644 xen/drivers/vpci/sriov.c
> 
> diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
> index a7c8a30a89..fe1e57b64d 100644
> --- a/xen/drivers/vpci/Makefile
> +++ b/xen/drivers/vpci/Makefile
> @@ -1,2 +1,2 @@
> -obj-y += vpci.o header.o rebar.o
> +obj-y += vpci.o header.o rebar.o sriov.o

I suggest putting sriov.o on its own line

>  obj-$(CONFIG_HAS_PCI_MSI) += msi.o msix.o
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 284964f0d4..c55c3380d4 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -264,7 +264,7 @@ bool vpci_process_pending(struct vcpu *v)
>      return false;
>  }
>  
> -static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
> +static int apply_map(struct domain *d, const struct pci_dev *pdev,

There are some differences in apply_map() to be aware of that doesn't make it
suitable for use in vpci_process_pending() context, and hence why it's marked
__init:

* apply_map() uses 'current' instead of the vcpu passed from
  vpci_process_pending(), but I don't think there's a guarantee that 'current'
  will be the same vcpu. In other words, it may end up using the wrong bar_mem.
* apply_map() doesn't return on -ERESTART, instead continuously calls
  process_pending_softirqs(). While this may allow some other pending softirqs
  to execute, process_pending_softirqs() will never invoke the scheduler
  softirq.
* apply_map() doesn't handle errors from map_range() (other than -ERESTART),
  though I'm not sure if this is intentional or a bug. Compare this to the
  mapping loop in vpci_process_pending() that properly cleans up and breaks out
  of the loop on error.

>                              uint16_t cmd)
>  {
>      struct vpci_header *header = &pdev->vpci->header;
> @@ -323,7 +323,8 @@ static void defer_map(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>      raise_softirq(SCHEDULE_SOFTIRQ);
>  }
>  
> -int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> +int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only,
> +                     bool no_defer)
>  {
>      struct vpci_header *header = &pdev->vpci->header;
>      struct pci_dev *tmp;
> @@ -519,7 +520,7 @@ int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>          d = dom_xen;
>      }
>  
> -    if ( system_state < SYS_STATE_active )
> +    if ( system_state < SYS_STATE_active || no_defer )
>      {
>          /*
>           * Mappings might be created when building Dom0 if the memory decoding
> @@ -566,7 +567,7 @@ static void cf_check cmd_write(
>           * memory decoding bit has not been changed, so leave everything as-is,
>           * hoping the guest will realize and try again.
>           */
> -        vpci_modify_bars(pdev, cmd, false);
> +        vpci_modify_bars(pdev, cmd, false, false);
>      else
>          pci_conf_write16(pdev->sbdf, reg, cmd);
>  }
> @@ -736,7 +737,7 @@ static void cf_check rom_write(
>       */
>      else if ( vpci_modify_bars(pdev,
>                                 new_enabled ? PCI_COMMAND_MEMORY : 0,
> -                               true) )
> +                               true, false) )
>          /*
>           * No memory has been added or removed from the p2m (because the actual
>           * p2m changes are deferred in defer_map) and the ROM enable bit has
> @@ -954,6 +955,9 @@ int vpci_init_header(struct pci_dev *pdev)
>  
>      header->guest_cmd = cmd;
>  
> +    if ( pdev->info.is_virtfn )
> +        return vf_init_header(pdev);
> +
>      /* Disable memory decoding before sizing. */
>      if ( !is_hwdom || (cmd & PCI_COMMAND_MEMORY) )
>          pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
> @@ -1062,7 +1066,8 @@ int vpci_init_header(struct pci_dev *pdev)
>              goto fail;
>      }
>  
> -    return (cmd & PCI_COMMAND_MEMORY) ? vpci_modify_bars(pdev, cmd, false) : 0;
> +    return (cmd & PCI_COMMAND_MEMORY)
> +                ? vpci_modify_bars(pdev, cmd, false, false) : 0;
>  
>   fail:
>      pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> diff --git a/xen/drivers/vpci/sriov.c b/xen/drivers/vpci/sriov.c
> new file mode 100644
> index 0000000000..6f691149e9
> --- /dev/null
> +++ b/xen/drivers/vpci/sriov.c
> @@ -0,0 +1,363 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Handlers for accesses to the SR-IOV capability structure.
> + *
> + * Copyright (C) 2026 Citrix Systems R&D
> + */
> +

#include "private.h"

> +#include <xen/sched.h>
> +#include <xen/vpci.h>
> +#include <xsm/xsm.h>
> +
> +static int vf_init_bars(const struct pci_dev *vf_pdev)
> +{
> +    int vf_idx;
> +    unsigned int i;
> +    const struct pci_dev *pf_pdev = vf_pdev->pf_pdev;
> +    struct vpci_bar *bars = vf_pdev->vpci->header.bars;
> +    struct vpci_bar *physfn_vf_bars = pf_pdev->vpci->sriov->vf_bars;
> +    unsigned int sriov_pos = pci_find_ext_capability(pf_pdev,
> +                                                     PCI_EXT_CAP_ID_SRIOV);
> +    uint16_t offset = pci_conf_read16(pf_pdev->sbdf,
> +                                      sriov_pos + PCI_SRIOV_VF_OFFSET);
> +    uint16_t stride = pci_conf_read16(pf_pdev->sbdf,
> +                                      sriov_pos + PCI_SRIOV_VF_STRIDE);
> +
> +    vf_idx = vf_pdev->sbdf.sbdf - (pf_pdev->sbdf.sbdf + offset);
> +    if ( vf_idx < 0 )
> +        return -EINVAL;
> +
> +    if ( stride )
> +    {
> +        if ( vf_idx % stride )
> +            return -EINVAL;
> +        vf_idx /= stride;
> +    }
> +
> +    /*
> +     * Set up BARs for this VF out of PF's VF BARs taking into account
> +     * the index of the VF.
> +     */
> +    for ( i = 0; i < PCI_SRIOV_NUM_BARS; i++ )
> +    {
> +        bars[i].addr = physfn_vf_bars[i].addr + vf_idx * physfn_vf_bars[i].size;
> +        bars[i].guest_addr = bars[i].addr;
> +        bars[i].size = physfn_vf_bars[i].size;
> +        bars[i].type = physfn_vf_bars[i].type;
> +        bars[i].prefetchable = physfn_vf_bars[i].prefetchable;
> +    }
> +
> +    return 0;
> +}
> +
> +/* Must be called form vpci_process_pending context */
> +static int map_vfs(const struct pci_dev *pf_pdev, uint16_t cmd)
> +{
> +    struct pci_dev *vf_pdev;
> +    int rc;
> +
> +    ASSERT(rw_is_write_locked(&pf_pdev->domain->pci_lock));
> +
> +    list_for_each_entry(vf_pdev, &pf_pdev->vf_list, vf_list) {

Style: { should be on its own line

> +        rc = vpci_modify_bars(vf_pdev, cmd, false, true);
> +        if ( rc )
> +        {
> +            gprintk(XENLOG_ERR, "failed to %s VF %pp: %d\n",
> +                    (cmd & PCI_COMMAND_MEMORY) ? "map" : "unmap",
> +                    &vf_pdev->sbdf, rc);
> +            return rc;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +

Stray newline

> +static int size_vf_bars(struct pci_dev *pf_pdev, unsigned int sriov_pos)
> +{
> +    /*
> +     * NB: a non-const pci_dev of the PF is needed in order to update
> +     * vf_rlen.
> +     */
> +    struct vpci_bar *bars;
> +    unsigned int i;
> +    int rc = 0;
> +
> +    ASSERT(rw_is_write_locked(&pf_pdev->domain->pci_lock));
> +    ASSERT(!pf_pdev->info.is_virtfn);
> +    ASSERT(pf_pdev->vpci->sriov);
> +
> +    /* Read BARs for VFs out of PF's SR-IOV extended capability. */
> +    bars = pf_pdev->vpci->sriov->vf_bars;
> +    /* Set the BARs addresses and size. */
> +    for ( i = 0; i < PCI_SRIOV_NUM_BARS; i += rc )
> +    {
> +        unsigned int idx = sriov_pos + PCI_SRIOV_BAR + i * 4;
> +        uint32_t bar;
> +        uint64_t addr, size;
> +
> +        bar = pci_conf_read32(pf_pdev->sbdf, idx);
> +
> +        rc = pci_size_mem_bar(pf_pdev->sbdf, idx, &addr, &size,
> +                              PCI_BAR_VF |
> +                              ((i == PCI_SRIOV_NUM_BARS - 1) ? PCI_BAR_LAST
> +                                                             : 0));
> +
> +        /*
> +         * Update vf_rlen on the PF. According to the spec the size of
> +         * the BARs can change if the system page size register is
> +         * modified, so always update rlen when enabling VFs.
> +         */
> +        pf_pdev->physfn.vf_rlen[i] = size;
> +
> +        if ( !size )
> +        {
> +            bars[i].type = VPCI_BAR_EMPTY;
> +            continue;
> +        }
> +
> +        bars[i].addr = addr;
> +        bars[i].guest_addr = addr;
> +        bars[i].size = size;
> +        bars[i].prefetchable = bar & PCI_BASE_ADDRESS_MEM_PREFETCH;
> +
> +        switch ( rc )
> +        {
> +        case 1:
> +            bars[i].type = VPCI_BAR_MEM32;
> +            break;
> +
> +        case 2:
> +            bars[i].type = VPCI_BAR_MEM64_LO;
> +            bars[i + 1].type = VPCI_BAR_MEM64_HI;
> +            break;
> +
> +        default:
> +            ASSERT_UNREACHABLE();
> +        }
> +    }
> +
> +    rc = rc > 0 ? 0 : rc;
> +
> +    return rc;
> +}
> +
> +struct callback_data {
> +    const struct pci_dev *pdev;
> +    unsigned int pos;
> +    uint32_t value;
> +    bool enable : 1;
> +    bool disable : 1;
> +    bool map : 1;
> +    bool unmap : 1;
> +};
> +
> +static void cf_check control_write_cb(void *data)
> +{
> +    struct callback_data *cb = data;
> +    const struct pci_dev *pdev = cb->pdev;
> +    uint16_t offset = pci_conf_read16(pdev->sbdf, cb->pos + PCI_SRIOV_VF_OFFSET);
> +    uint16_t stride = pci_conf_read16(pdev->sbdf, cb->pos + PCI_SRIOV_VF_STRIDE);
> +    struct vpci_sriov *sriov = pdev->vpci->sriov;
> +    int rc = 0;
> +    unsigned int i;

The validity of pdev should be checked. vpci_process_pending() already has
logic for checking the pdev validity and acquiring d->pci_lock, I suggest
perhaps to reuse that.

> +
> +    if ( cb->unmap )
> +    {
> +        write_lock(&pdev->domain->pci_lock);
> +        map_vfs(pdev, 0);
> +        write_unlock(&pdev->domain->pci_lock);

I don't think it's appropriate to release the lock here, then re-acquire it
below in the 'if ( cb->map )' condition, however ...

> +    }
> +
> +    if ( cb->enable || cb->disable )
> +    {
> +        for ( i = 0; i < sriov->num_vfs; i++ )
> +        {
> +            const pci_sbdf_t vf_sbdf = {
> +                .sbdf = pdev->sbdf.sbdf + offset + stride * i,
> +            };
> +
> +            if ( cb->enable )
> +            {
> +                const struct pci_dev_info info = {
> +                    .is_virtfn = true,
> +                    .is_extfn = false,
> +                    .physfn.bus = pdev->sbdf.bus,
> +                    .physfn.devfn = pdev->sbdf.devfn,
> +                };
> +                rc = pci_add_device(vf_sbdf.seg, vf_sbdf.bus, vf_sbdf.devfn,
> +                                    &info, pdev->node);

... pci_add_device() acquires pcidevs_lock(), which would be against the locking
order if d->pci_lock is already held (see the comment for pci_lock in sched.h).
I wonder if we need a variant of pci_add_device() where the caller obtains the
appropriate lock(s)?

We should also consider that pci_add_device() performs vpci_assign_device(), and
I haven't completely thought through the implications of that yet.

> +            }
> +            if ( cb->disable )
> +                rc = pci_remove_device(vf_sbdf.seg, vf_sbdf.bus, vf_sbdf.devfn);
> +
> +            if ( rc && rc != -ENODEV)
> +                gprintk(XENLOG_ERR, "failed to %s VF %pp: %d\n",
> +                        cb->enable ? "add" : "remove", &vf_sbdf, rc);
> +        }
> +    }
> +
> +    if ( cb->map )
> +    {
> +        write_lock(&pdev->domain->pci_lock);
> +        rc = map_vfs(pdev, PCI_COMMAND_MEMORY);
> +
> +        if ( rc )
> +            map_vfs(pdev, 0);
> +        write_unlock(&pdev->domain->pci_lock);
> +    }
> +
> +    pci_conf_write16(pdev->sbdf, cb->pos + PCI_SRIOV_CTRL, cb->value);
> +    xfree(cb);
> +}
> +
> +static void cf_check control_write(const struct pci_dev *pdev, unsigned int reg,
> +                                   uint32_t val, void *data)
> +{
> +    unsigned int sriov_pos = reg - PCI_SRIOV_CTRL;
> +    struct vpci_sriov *sriov = pdev->vpci->sriov;
> +    struct callback_data *cb = NULL;
> +    uint16_t control = pci_conf_read16(pdev->sbdf, reg);
> +    bool mem_enabled = control & PCI_SRIOV_CTRL_MSE;
> +    bool new_mem_enabled = val & PCI_SRIOV_CTRL_MSE;
> +    bool enabled = control & PCI_SRIOV_CTRL_VFE;
> +    bool new_enabled = val & PCI_SRIOV_CTRL_VFE;
> +
> +    ASSERT(!pdev->info.is_virtfn);
> +
> +    if ( new_enabled == enabled && new_mem_enabled == mem_enabled )
> +    {
> +        pci_conf_write16(pdev->sbdf, reg, val);
> +        return;
> +    }
> +
> +    cb = xzalloc(struct callback_data);

Are there any alternatives to this runtime allocation? E.g. could the state be
stored in struct vpci_vcpu or struct vpci_sriov?

> +
> +    if ( !cb )
> +    {
> +        gprintk(XENLOG_ERR,
> +                "%pp: Unable to allocate memory for SR-IOV enable\n",
> +                pdev);
> +        return;
> +    }
> +
> +    cb->pdev = pdev;
> +    cb->pos = sriov_pos;
> +    cb->value = val;
> +    cb->map = new_mem_enabled && !mem_enabled;
> +    cb->unmap = !new_mem_enabled && mem_enabled;
> +    cb->enable = new_enabled && !enabled;
> +    cb->disable = !new_enabled && enabled;
> +
> +    current->vpci.task = WAIT;
> +    current->vpci.wait.callback = control_write_cb;
> +    current->vpci.wait.data = cb;
> +    current->vpci.wait.end = NOW();
> +
> +    if ( cb->enable )
> +    {
> +        size_vf_bars((struct pci_dev *)pdev, sriov_pos);
> +
> +        /*
> +         * Only update the number of active VFs when enabling, when
> +         * disabling use the cached value in order to always remove the same
> +         * number of VFs that were active.
> +         */
> +        sriov->num_vfs = pci_conf_read16(pdev->sbdf,
> +                                         sriov_pos + PCI_SRIOV_NUM_VF);
> +        /*
> +         * NB: VFE needs to be enabled before calling pci_add_device so Xen
> +         * can access the config space of VFs. FIXME casting away const-ness
> +         * to modify vf_rlen
> +         */
> +        pci_conf_write16(pdev->sbdf, reg, control | PCI_SRIOV_CTRL_VFE);
> +        /*
> +         * The spec states that the software must wait at least 100ms before
> +         * attempting to access VF registers when enabling virtual functions
> +         * on the PF.
> +         */
> +
> +        current->vpci.wait.end = NOW() + MILLISECS(100);
> +    }
> +}
> +
> +int vf_init_header(struct pci_dev *vf_pdev)
> +{
> +    const struct pci_dev *pf_pdev;
> +    unsigned int sriov_pos;
> +    int rc = 0;
> +    uint16_t ctrl;
> +
> +    ASSERT(rw_is_write_locked(&vf_pdev->domain->pci_lock));
> +
> +    if ( !vf_pdev->info.is_virtfn )
> +        return 0;
> +
> +    pf_pdev = vf_pdev->pf_pdev;
> +    ASSERT(pf_pdev);
> +
> +    rc = vf_init_bars(vf_pdev);
> +    if ( rc )
> +        return rc;
> +
> +    sriov_pos = pci_find_ext_capability(pf_pdev, PCI_EXT_CAP_ID_SRIOV);
> +    ctrl = pci_conf_read16(pf_pdev->sbdf, sriov_pos + PCI_SRIOV_CTRL);
> +
> +    if ( (pf_pdev->domain == vf_pdev->domain) && (ctrl & PCI_SRIOV_CTRL_MSE) )
> +    {
> +        rc = vpci_modify_bars(vf_pdev, PCI_COMMAND_MEMORY, false, false);
> +        if ( rc )
> +            return rc;
> +    }
> +
> +    return rc;
> +}
> +
> +static int cf_check init_sriov(struct pci_dev *pdev)
> +{
> +    unsigned int pos;
> +
> +    ASSERT(!pdev->info.is_virtfn);
> +
> +    pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
> +
> +    if ( !pos )
> +        return 0;
> +
> +    if ( xsm_resource_setup_pci(XSM_PRIV, pdev->sbdf.bdf) )
> +    {
> +        printk(XENLOG_ERR
> +               "%pp: SR-IOV configuration unsupported for unpriv %pd\n",
> +               &pdev->sbdf, pdev->domain);
> +        return 0;

Should this return an error?

> +    }
> +
> +    pdev->vpci->sriov = xzalloc(struct vpci_sriov);
> +    if ( !pdev->vpci->sriov )
> +        return -ENOMEM;
> +
> +    return vpci_add_register(pdev->vpci, vpci_hw_read16, control_write,
> +                             pos + PCI_SRIOV_CTRL, 2, NULL);
> +}
> +
> +static int cf_check cleanup_sriov(const struct pci_dev *pdev, bool hide)
> +{
> +    if ( hide )
> +        return 0;
> +
> +    XFREE(pdev->vpci->sriov);

pdev->vpci->sriov should always be freed, no matter if hide == true or false.

For hardware_domain, there should also be a handler added to hide the capability
in case of hide == true. See the other capability cleanup hooks for examples.

> +
> +    return 0;
> +}
> +
> +REGISTER_VPCI_EXTCAP(SRIOV, init_sriov, cleanup_sriov);
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index 47cdb54d42..ae5f3b7274 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -45,6 +45,7 @@ typedef struct {
>      REGISTER_VPCI_CAPABILITY(PCI_EXT_CAP_ID_##name, name, finit, fclean, true)
>  
>  int __must_check vpci_init_header(struct pci_dev *pdev);
> +int __must_check vf_init_header(struct pci_dev *pdev);

Move to private.h. The function name should also have a vpci_ prefix since it's
not static.

>  
>  /* Assign vPCI to device by adding handlers. */
>  int __must_check vpci_assign_device(struct pci_dev *pdev);
> @@ -146,7 +147,6 @@ struct vpci {
>           * upon to know whether BARs are mapped into the guest p2m.
>           */
>          bool bars_mapped      : 1;
> -        /* FIXME: currently there's no support for SR-IOV. */
>      } header;
>  
>      /* MSI data. */
> @@ -200,6 +200,13 @@ struct vpci {
>              struct vpci_arch_msix_entry arch;
>          } entries[];
>      } *msix;
> +
> +    struct vpci_sriov {
> +        /* PF only */
> +        struct vpci_bar vf_bars[PCI_SRIOV_NUM_BARS];
> +        uint16_t num_vfs;
> +    } *sriov;
> +
>  #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>      /* Guest SBDF of the device. */
>  #define INVALID_GUEST_SBDF ((pci_sbdf_t){ .sbdf = ~0U })
> @@ -323,7 +330,8 @@ bool vpci_ecam_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int len,
>                      unsigned long *data);
>  
>  /* Map/unmap the BARs of a vPCI device. */
> -int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only);
> +int vpci_modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only,
> +                     bool no_defer);

Move to private.h

>  
>  #endif /* __XEN__ */
>  



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 6/8] vpci: add SR-IOV support for PVH Dom0
  2026-03-19 21:11   ` Stewart Hildebrand
@ 2026-03-23  9:46     ` Mykyta Poturai
  0 siblings, 0 replies; 29+ messages in thread
From: Mykyta Poturai @ 2026-03-23  9:46 UTC (permalink / raw)
  To: Stewart Hildebrand, xen-devel@lists.xenproject.org
  Cc: Roger Pau Monné, Daniel P. Smith

On 3/19/26 23:11, Stewart Hildebrand wrote:
> On 3/9/26 07:08, Mykyta Poturai wrote:
>> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>
>> This code is expected to only be used by privileged domains,
>> unprivileged domains should not get access to the SR-IOV capability.
>>
>> Implement RW handlers for PCI_SRIOV_CTRL register to dynamically
>> map/unmap VF BARS. Recalculate BAR sizes before mapping VFs to account
>> for possible changes in the system page size register.
>>
>> Allow forcing vpci_modify_bars to not defer the actual mapping changes,
> 
> I don't think this is suitable. We perform the p2m operations in a deferred
> context because they may take a long time. And since they may take a long time,
> the logic is interruptible: in map_range(), we perform a general_preempt_check()
> and return -ERESTART so that we give a chance for other pending work to
> complete, including the scheduler softirq. If vpci_process_pending() returns
> true, it will be called again and is expected to resume where it left off. The
> vcpu won't continue until vpci_process_pending() returns false.
> 
>> which is needed to fix the sequential calls to vpci_modify_bars when
>> enabling VFs from Dom0.
> 
> I'm guessing you resorted to this because you need to perform multiple mapping
> operations, but the vPCI deferred mapping mechanism only supports a single
> operation? If so, this is an issue I've been attempting to resolve for some time
> with the BAR-write-with-memory-decoding-enabled series [1]. In that series I'm
> working on introducing the ability perform multiple mapping operations. I'm
> almost ready to send v3 of the BAR-write-with-memory-decoding-enabled series,
> and I hope you don't mind that I include your patch ("vpci: Use pervcpu ranges
> for BAR mapping"). You may consider the possibility of basing SR-IOV on this
> work if suitable.
> 
> [1] https://lore.kernel.org/xen-devel/20250723163744.13095-1-stewart.hildebrand@amd.com/T/#t
> 

I’ve looked at your changes, but there seems to be a push against 
dynamically allocating tasks, which would not work with SR-IOV, or 
require a lot of task structs to be preallocated and used very rarely.

> Regardless, ultimately we need to find a way to return from
> vpci_process_pending() during the potentially long-running p2m operations.
> As an alternative suggestion, could you return from control_write_cb() after
> each call to map_vfs(), and somehow make it resume where it left off?

I’ll try this approach, thanks.

-- 
Mykyta

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions
  2026-03-09 11:08 ` [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions Mykyta Poturai
  2026-03-17 14:02   ` Stewart Hildebrand
@ 2026-03-31  9:53   ` Jan Beulich
  2026-03-31 14:55   ` Jan Beulich
  2 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2026-03-31  9:53 UTC (permalink / raw)
  To: Mykyta Poturai
  Cc: Roger Pau Monné, Stewart Hildebrand,
	xen-devel@lists.xenproject.org

On 09.03.2026 12:08, Mykyta Poturai wrote:
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -212,7 +212,26 @@ struct vpci_vcpu {
>      /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
>      const struct pci_dev *pdev;
>  #ifdef __XEN__
> +    enum {
> +        NONE,
> +        MODIFY_MEMORY,
> +        WAIT,
> +    } task;

Unlike structure or union fields, the scope of enumerators is global. I
don't think generic names like NONE and WAIT should be introduced into
global scope. At the very least VPCI_ wants prefixing to them, albeit
VPCI_NONE of course isn't to going to read very well either. Hence
either replace "NONE" there as well, or use e.g. VPCI_OP_* as a prefix.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 3/8] vpci: Use pervcpu ranges for BAR mapping
  2026-03-09 11:08 ` [PATCH v2 3/8] vpci: Use pervcpu ranges for BAR mapping Mykyta Poturai
  2026-03-16 21:36   ` Stewart Hildebrand
  2026-03-17  0:36   ` Stewart Hildebrand
@ 2026-03-31  9:59   ` Jan Beulich
  2026-03-31 11:56     ` Andrew Cooper
  2 siblings, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2026-03-31  9:59 UTC (permalink / raw)
  To: Mykyta Poturai
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Roger Pau Monné, Stefano Stabellini, Stewart Hildebrand,
	xen-devel@lists.xenproject.org

On 09.03.2026 12:08, Mykyta Poturai wrote:
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -454,6 +454,14 @@ static int vcpu_teardown(struct vcpu *v)
>   */
>  static void vcpu_destroy(struct vcpu *v)
>  {
> +#ifdef CONFIG_HAS_VPCI
> +    int i;

Nit: No plain int please when ...

> +    for ( i = 0; i < ARRAY_SIZE(v->vpci.bar_mem); i++ )

... its values are only ever non-negative. (Applies elsewhere as well.)

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 3/8] vpci: Use pervcpu ranges for BAR mapping
  2026-03-31  9:59   ` Jan Beulich
@ 2026-03-31 11:56     ` Andrew Cooper
  0 siblings, 0 replies; 29+ messages in thread
From: Andrew Cooper @ 2026-03-31 11:56 UTC (permalink / raw)
  To: Jan Beulich, Mykyta Poturai
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Roger Pau Monné, Stefano Stabellini, Stewart Hildebrand,
	xen-devel@lists.xenproject.org

On 31/03/2026 10:59 am, Jan Beulich wrote:
> On 09.03.2026 12:08, Mykyta Poturai wrote:
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -454,6 +454,14 @@ static int vcpu_teardown(struct vcpu *v)
>>   */
>>  static void vcpu_destroy(struct vcpu *v)
>>  {
>> +#ifdef CONFIG_HAS_VPCI
>> +    int i;
> Nit: No plain int please when ...
>
>> +    for ( i = 0; i < ARRAY_SIZE(v->vpci.bar_mem); i++ )
> ... its values are only ever non-negative. (Applies elsewhere as well.)

Furthermore, please use `for ( unsigned int i = ...`

That avoids needing to play weird scope games for i.

~Andrew


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 5/8] pci/iommu: Check that IOMMU supports removing devices
  2026-03-09 11:08 ` [PATCH v2 5/8] pci/iommu: Check that IOMMU supports removing devices Mykyta Poturai
@ 2026-03-31 14:28   ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2026-03-31 14:28 UTC (permalink / raw)
  To: Mykyta Poturai; +Cc: Roger Pau Monné, xen-devel@lists.xenproject.org

On 09.03.2026 12:08, Mykyta Poturai wrote:
> before trying to remove them.
> Some IOMMU platforms, such as SMMU-V3 and IPMMU-VMSA don't support
> removing devices. Trying to call remove_device will result in a crash.
> So check that platform_ops have remove_device set before calling it.

Hmm, but both have .add_device populated. They ought to support
.remove_device, especially if ...

> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -1545,6 +1545,12 @@ static int iommu_remove_device(struct pci_dev *pdev)
>      if ( !is_iommu_enabled(pdev->domain) )
>          return 0;
>  
> +    if ( !hd->platform_ops->remove_device )
> +    {
> +        printk(XENLOG_ERR "IOMMU: remove_device not supported by platform\n");
> +        return -EOPNOTSUPP;
> +    }
> +
>      for ( devfn = pdev->devfn ; pdev->phantom_stride; )
>      {
>          int rc;

... this is for PCI. (I'm simply not qualified to discuss DT.) This
being in the SR-IOV series, I have to assume for the change to be about
a PF de-configuring its VFs. This imo shouldn't end in -EOPNOTSUPP. In
fact PHYSDEVOP_pci_device_remove serves largely as a notification -
please check the xen_remove_device() use in Linux. That is, the device
is going to be gone anyway, and hence we'd better take care of that fact
in Xen.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 6/8] vpci: add SR-IOV support for PVH Dom0
  2026-03-09 11:08 ` [PATCH v2 6/8] vpci: add SR-IOV support for PVH Dom0 Mykyta Poturai
  2026-03-19 21:11   ` Stewart Hildebrand
@ 2026-03-31 14:44   ` Jan Beulich
  1 sibling, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2026-03-31 14:44 UTC (permalink / raw)
  To: Mykyta Poturai
  Cc: Stewart Hildebrand, Roger Pau Monné, Daniel P. Smith,
	xen-devel@lists.xenproject.org

On 09.03.2026 12:08, Mykyta Poturai wrote:
> +static void cf_check control_write(const struct pci_dev *pdev, unsigned int reg,
> +                                   uint32_t val, void *data)
> +{
> +    unsigned int sriov_pos = reg - PCI_SRIOV_CTRL;
> +    struct vpci_sriov *sriov = pdev->vpci->sriov;
> +    struct callback_data *cb = NULL;
> +    uint16_t control = pci_conf_read16(pdev->sbdf, reg);
> +    bool mem_enabled = control & PCI_SRIOV_CTRL_MSE;
> +    bool new_mem_enabled = val & PCI_SRIOV_CTRL_MSE;
> +    bool enabled = control & PCI_SRIOV_CTRL_VFE;
> +    bool new_enabled = val & PCI_SRIOV_CTRL_VFE;
> +
> +    ASSERT(!pdev->info.is_virtfn);
> +
> +    if ( new_enabled == enabled && new_mem_enabled == mem_enabled )
> +    {
> +        pci_conf_write16(pdev->sbdf, reg, val);
> +        return;
> +    }
> +
> +    cb = xzalloc(struct callback_data);
> +
> +    if ( !cb )
> +    {
> +        gprintk(XENLOG_ERR,
> +                "%pp: Unable to allocate memory for SR-IOV enable\n",
> +                pdev);
> +        return;
> +    }
> +
> +    cb->pdev = pdev;
> +    cb->pos = sriov_pos;
> +    cb->value = val;
> +    cb->map = new_mem_enabled && !mem_enabled;
> +    cb->unmap = !new_mem_enabled && mem_enabled;
> +    cb->enable = new_enabled && !enabled;
> +    cb->disable = !new_enabled && enabled;
> +
> +    current->vpci.task = WAIT;
> +    current->vpci.wait.callback = control_write_cb;
> +    current->vpci.wait.data = cb;
> +    current->vpci.wait.end = NOW();
> +
> +    if ( cb->enable )
> +    {
> +        size_vf_bars((struct pci_dev *)pdev, sriov_pos);

No casting away of const, please. See Misra rule 11.8.

> +        /*
> +         * Only update the number of active VFs when enabling, when
> +         * disabling use the cached value in order to always remove the same
> +         * number of VFs that were active.
> +         */
> +        sriov->num_vfs = pci_conf_read16(pdev->sbdf,
> +                                         sriov_pos + PCI_SRIOV_NUM_VF);
> +        /*
> +         * NB: VFE needs to be enabled before calling pci_add_device so Xen
> +         * can access the config space of VFs. FIXME casting away const-ness
> +         * to modify vf_rlen
> +         */
> +        pci_conf_write16(pdev->sbdf, reg, control | PCI_SRIOV_CTRL_VFE);
> +        /*
> +         * The spec states that the software must wait at least 100ms before
> +         * attempting to access VF registers when enabling virtual functions
> +         * on the PF.
> +         */
> +
> +        current->vpci.wait.end = NOW() + MILLISECS(100);

Aren't you effectively busy-waiting for these 100ms, by simply returning "true"
from vpci_process_pending() until the time has passed? This imo is a no-go. You
want to set a timer and put the vCPU to sleep, to wake it up again when the
timer has expired. That'll then eliminate the need for the not-so-nice patch 4.

Question is whether we need to actually go this far (right away). I expect you
don't mean to hand PFs to DomU-s. As long as we keep them in the hardware
domain, can't we trust it to set things up correctly, just like we trust it in
a number of other aspects?

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions
  2026-03-09 11:08 ` [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions Mykyta Poturai
  2026-03-17 14:02   ` Stewart Hildebrand
  2026-03-31  9:53   ` Jan Beulich
@ 2026-03-31 14:55   ` Jan Beulich
  2026-04-01  7:59     ` Mykyta Poturai
  2 siblings, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2026-03-31 14:55 UTC (permalink / raw)
  To: Mykyta Poturai
  Cc: Roger Pau Monné, Stewart Hildebrand,
	xen-devel@lists.xenproject.org

On 09.03.2026 12:08, Mykyta Poturai wrote:
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -175,76 +175,92 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>  
>  bool vpci_process_pending(struct vcpu *v)
>  {
> -    const struct pci_dev *pdev = v->vpci.pdev;
> -    struct vpci_header *header = NULL;
> -    unsigned int i;
> -
> -    if ( !pdev )
> -        return false;
> -
> -    read_lock(&v->domain->pci_lock);
> -
> -    if ( !pdev->vpci || (v->domain != pdev->domain) )
> +    switch ( v->vpci.task )
>      {
> -        v->vpci.pdev = NULL;
> -        read_unlock(&v->domain->pci_lock);
> -        return false;
> -    }
> -
> -    header = &pdev->vpci->header;
> -    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> +    case MODIFY_MEMORY:
>      {
> -        struct vpci_bar *bar = &header->bars[i];
> -        struct rangeset *mem = v->vpci.bar_mem[i];
> -        struct map_data data = {
> -            .d = v->domain,
> -            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> -            .bar = bar,
> -        };
> -        int rc;
> +        const struct pci_dev *pdev = v->vpci.memory.pdev;
> +        struct vpci_header *header = NULL;
> +        unsigned int i;
>  
> -        if ( rangeset_is_empty(mem) )
> -            continue;
> +        if ( !pdev )
> +            break;
>  
> -        rc = rangeset_consume_ranges(mem, map_range, &data);
> +        read_lock(&v->domain->pci_lock);
>  
> -        if ( rc == -ERESTART )
> +        if ( !pdev->vpci || (v->domain != pdev->domain) )
>          {
> +            v->vpci.memory.pdev = NULL;
>              read_unlock(&v->domain->pci_lock);
> -            return true;
> +            break;
>          }
>  
> -        if ( rc )
> +        header = &pdev->vpci->header;
> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>          {
> -            spin_lock(&pdev->vpci->lock);
> -            /* Disable memory decoding unconditionally on failure. */
> -            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
> -                            false);
> -            spin_unlock(&pdev->vpci->lock);
> +            struct vpci_bar *bar = &header->bars[i];
> +            struct rangeset *mem = v->vpci.bar_mem[i];
> +            struct map_data data = {
> +                .d = v->domain,
> +                .map = v->vpci.memory.cmd & PCI_COMMAND_MEMORY,
> +                .bar = bar,
> +            };
> +            int rc;
> +
> +            if ( rangeset_is_empty(mem) )
> +                continue;
>  
> -            /* Clean all the rangesets */
> -            for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> -                if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
> -                     rangeset_purge(v->vpci.bar_mem[i]);
> +            rc = rangeset_consume_ranges(mem, map_range, &data);
>  
> -            v->vpci.pdev = NULL;
> +            if ( rc == -ERESTART )
> +            {
> +                read_unlock(&v->domain->pci_lock);
> +                return true;
> +            }
>  
> -            read_unlock(&v->domain->pci_lock);
> +            if ( rc )
> +            {
> +                spin_lock(&pdev->vpci->lock);
> +                /* Disable memory decoding unconditionally on failure. */
> +                modify_decoding(pdev, v->vpci.memory.cmd & ~PCI_COMMAND_MEMORY,
> +                                false);
> +                spin_unlock(&pdev->vpci->lock);
> +
> +                /* Clean all the rangesets */
> +                for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
> +                    if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
> +                        rangeset_purge(v->vpci.bar_mem[i]);
> +
> +                v->vpci.memory.pdev = NULL;
> +
> +                read_unlock(&v->domain->pci_lock);
>  
> -            if ( !is_hardware_domain(v->domain) )
> -                domain_crash(v->domain);
> +                if ( !is_hardware_domain(v->domain) )
> +                    domain_crash(v->domain);
>  
> -            return false;
> +                break;
> +            }
>          }
> -    }
> -    v->vpci.pdev = NULL;
> +        v->vpci.memory.pdev = NULL;
>  
> -    spin_lock(&pdev->vpci->lock);
> -    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
> -    spin_unlock(&pdev->vpci->lock);
> +        spin_lock(&pdev->vpci->lock);
> +        modify_decoding(pdev, v->vpci.memory.cmd, v->vpci.memory.rom_only);
> +        spin_unlock(&pdev->vpci->lock);
>  
> -    read_unlock(&v->domain->pci_lock);
> +        read_unlock(&v->domain->pci_lock);
> +
> +        break;
> +    }
> +    case WAIT:
> +        if ( NOW() < v->vpci.wait.end )
> +            return true;
> +        v->vpci.wait.callback(v->vpci.wait.data);
> +        break;

As just indicated in reply to patch 6, busy waiting isn't really acceptable.
This is even more so when the waiting exceeds the typical length of a
scheduling timeslice.

In that other reply I said to put the vCPU to sleep, but you need to be careful
there too: The domain may not expect its vCPU to not make any progress for such
an extended period of time. This may need doing entirely differently: Once the
command register was written, you may want to record the time after which
accesses to the VF registers are permitted. Earlier accesses would simply be
terminated. You may still additionally need a timer, in order to kick off BAR
mapping after that time. (Yet better would  be if the BAR mapping could be
done during those 100ms. After all that may be a reason why this long a delay
is specified: Firmware on the device may also require some time to set up the
BARs accordingly.)

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 7/8] vpci: add SR-IOV support for DomUs
  2026-03-09 11:08 ` [PATCH v2 7/8] vpci: add SR-IOV support for DomUs Mykyta Poturai
@ 2026-03-31 15:02   ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2026-03-31 15:02 UTC (permalink / raw)
  To: Mykyta Poturai
  Cc: Stewart Hildebrand, Roger Pau Monné,
	xen-devel@lists.xenproject.org

On 09.03.2026 12:08, Mykyta Poturai wrote:
> From: Stewart Hildebrand <stewart.hildebrand@amd.com>
> 
> SR-IOV virtual functions have simplified configuration space such as
> Vendor ID is derived from the physical function and Device ID comes
> from SR-IOV extended capability.
> Emulate those, so we can provide VID/DID pair for guests which use PCI
> passthrough for SR-IOV virtual functions.

Why do we need to emulate what hardware would do anyway? These are r/o
fields, so likely okay to expose directly to a domain?

> --- a/xen/drivers/vpci/sriov.c
> +++ b/xen/drivers/vpci/sriov.c
> @@ -303,6 +303,63 @@ int vf_init_header(struct pci_dev *vf_pdev)
>      sriov_pos = pci_find_ext_capability(pf_pdev, PCI_EXT_CAP_ID_SRIOV);
>      ctrl = pci_conf_read16(pf_pdev->sbdf, sriov_pos + PCI_SRIOV_CTRL);
>  
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    if ( pf_pdev->domain != vf_pdev->domain )

Is there anything speaking against the (preferred) use of
IS_ENABLED(CONFIG_HAS_VPCI_GUEST_SUPPORT) here?

> +    {
> +        uint16_t vid = pci_conf_read16(pf_pdev->sbdf, PCI_VENDOR_ID);
> +        uint16_t did = pci_conf_read16(pf_pdev->sbdf,
> +                                       sriov_pos + PCI_SRIOV_VF_DID);
> +        struct vpci_bar *bars = vf_pdev->vpci->header.bars;
> +        unsigned int i;
> +
> +        rc = vpci_add_register(vf_pdev->vpci, vpci_read_val, NULL,
> +                               PCI_VENDOR_ID, 2, (void *)(uintptr_t)vid);
> +        if ( rc )
> +            return rc;
> +
> +        rc = vpci_add_register(vf_pdev->vpci, vpci_read_val, NULL,
> +                               PCI_DEVICE_ID, 2, (void *)(uintptr_t)did);
> +        if ( rc )
> +            return rc;
> +
> +        /* Hardcode multi-function device bit to 0 */
> +        rc = vpci_add_register(vf_pdev->vpci, vpci_read_val, NULL,
> +                               PCI_HEADER_TYPE, 1,
> +                               (void *)PCI_HEADER_TYPE_NORMAL);
> +        if ( rc )
> +            return rc;
> +
> +        rc = vpci_add_register(vf_pdev->vpci, vpci_hw_read32, NULL,
> +                               PCI_CLASS_REVISION, 4, NULL);
> +        if ( rc )
> +            return rc;
> +
> +        for ( i = 0; i < PCI_SRIOV_NUM_BARS; i++ )
> +        {
> +            switch ( pf_pdev->vpci->sriov->vf_bars[i].type )
> +            {
> +            case VPCI_BAR_MEM32:
> +            case VPCI_BAR_MEM64_LO:
> +            case VPCI_BAR_MEM64_HI:
> +                rc = vpci_add_register(vf_pdev->vpci, vpci_guest_mem_bar_read,
> +                                       vpci_guest_mem_bar_write,
> +                                       PCI_BASE_ADDRESS_0 + i * 4, 4, &bars[i]);
> +                if ( rc )
> +                    return rc;
> +                break;
> +            default:

Nit: Blank lines please between non-fall-through case blocks.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions
  2026-03-31 14:55   ` Jan Beulich
@ 2026-04-01  7:59     ` Mykyta Poturai
  2026-04-01  8:21       ` Jan Beulich
  0 siblings, 1 reply; 29+ messages in thread
From: Mykyta Poturai @ 2026-04-01  7:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné, Stewart Hildebrand,
	xen-devel@lists.xenproject.org

On 3/31/26 17:55, Jan Beulich wrote:
> On 09.03.2026 12:08, Mykyta Poturai wrote:
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -175,76 +175,92 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>>   
>>   bool vpci_process_pending(struct vcpu *v)
>>   {
>> -    const struct pci_dev *pdev = v->vpci.pdev;
>> -    struct vpci_header *header = NULL;
>> -    unsigned int i;
>> -
>> -    if ( !pdev )
>> -        return false;
>> -
>> -    read_lock(&v->domain->pci_lock);
>> -
>> -    if ( !pdev->vpci || (v->domain != pdev->domain) )
>> +    switch ( v->vpci.task )
>>       {
>> -        v->vpci.pdev = NULL;
>> -        read_unlock(&v->domain->pci_lock);
>> -        return false;
>> -    }
>> -
>> -    header = &pdev->vpci->header;
>> -    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>> +    case MODIFY_MEMORY:
>>       {
>> -        struct vpci_bar *bar = &header->bars[i];
>> -        struct rangeset *mem = v->vpci.bar_mem[i];
>> -        struct map_data data = {
>> -            .d = v->domain,
>> -            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>> -            .bar = bar,
>> -        };
>> -        int rc;
>> +        const struct pci_dev *pdev = v->vpci.memory.pdev;
>> +        struct vpci_header *header = NULL;
>> +        unsigned int i;
>>   
>> -        if ( rangeset_is_empty(mem) )
>> -            continue;
>> +        if ( !pdev )
>> +            break;
>>   
>> -        rc = rangeset_consume_ranges(mem, map_range, &data);
>> +        read_lock(&v->domain->pci_lock);
>>   
>> -        if ( rc == -ERESTART )
>> +        if ( !pdev->vpci || (v->domain != pdev->domain) )
>>           {
>> +            v->vpci.memory.pdev = NULL;
>>               read_unlock(&v->domain->pci_lock);
>> -            return true;
>> +            break;
>>           }
>>   
>> -        if ( rc )
>> +        header = &pdev->vpci->header;
>> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>           {
>> -            spin_lock(&pdev->vpci->lock);
>> -            /* Disable memory decoding unconditionally on failure. */
>> -            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
>> -                            false);
>> -            spin_unlock(&pdev->vpci->lock);
>> +            struct vpci_bar *bar = &header->bars[i];
>> +            struct rangeset *mem = v->vpci.bar_mem[i];
>> +            struct map_data data = {
>> +                .d = v->domain,
>> +                .map = v->vpci.memory.cmd & PCI_COMMAND_MEMORY,
>> +                .bar = bar,
>> +            };
>> +            int rc;
>> +
>> +            if ( rangeset_is_empty(mem) )
>> +                continue;
>>   
>> -            /* Clean all the rangesets */
>> -            for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>> -                if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>> -                     rangeset_purge(v->vpci.bar_mem[i]);
>> +            rc = rangeset_consume_ranges(mem, map_range, &data);
>>   
>> -            v->vpci.pdev = NULL;
>> +            if ( rc == -ERESTART )
>> +            {
>> +                read_unlock(&v->domain->pci_lock);
>> +                return true;
>> +            }
>>   
>> -            read_unlock(&v->domain->pci_lock);
>> +            if ( rc )
>> +            {
>> +                spin_lock(&pdev->vpci->lock);
>> +                /* Disable memory decoding unconditionally on failure. */
>> +                modify_decoding(pdev, v->vpci.memory.cmd & ~PCI_COMMAND_MEMORY,
>> +                                false);
>> +                spin_unlock(&pdev->vpci->lock);
>> +
>> +                /* Clean all the rangesets */
>> +                for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>> +                    if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>> +                        rangeset_purge(v->vpci.bar_mem[i]);
>> +
>> +                v->vpci.memory.pdev = NULL;
>> +
>> +                read_unlock(&v->domain->pci_lock);
>>   
>> -            if ( !is_hardware_domain(v->domain) )
>> -                domain_crash(v->domain);
>> +                if ( !is_hardware_domain(v->domain) )
>> +                    domain_crash(v->domain);
>>   
>> -            return false;
>> +                break;
>> +            }
>>           }
>> -    }
>> -    v->vpci.pdev = NULL;
>> +        v->vpci.memory.pdev = NULL;
>>   
>> -    spin_lock(&pdev->vpci->lock);
>> -    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
>> -    spin_unlock(&pdev->vpci->lock);
>> +        spin_lock(&pdev->vpci->lock);
>> +        modify_decoding(pdev, v->vpci.memory.cmd, v->vpci.memory.rom_only);
>> +        spin_unlock(&pdev->vpci->lock);
>>   
>> -    read_unlock(&v->domain->pci_lock);
>> +        read_unlock(&v->domain->pci_lock);
>> +
>> +        break;
>> +    }
>> +    case WAIT:
>> +        if ( NOW() < v->vpci.wait.end )
>> +            return true;
>> +        v->vpci.wait.callback(v->vpci.wait.data);
>> +        break;
> 
> As just indicated in reply to patch 6, busy waiting isn't really acceptable.
> This is even more so when the waiting exceeds the typical length of a
> scheduling timeslice.
> 
> In that other reply I said to put the vCPU to sleep, but you need to be careful
> there too: The domain may not expect its vCPU to not make any progress for such
> an extended period of time. This may need doing entirely differently: Once the
> command register was written, you may want to record the time after which
> accesses to the VF registers are permitted. Earlier accesses would simply be
> terminated. You may still additionally need a timer, in order to kick off BAR
> mapping after that time. (Yet better would  be if the BAR mapping could be
> done during those 100ms. After all that may be a reason why this long a delay
> is specified: Firmware on the device may also require some time to set up the
> BARs accordingly.)
> 
> Jan

I am not sure it would work that way. If we look at how linux 
initialized sriov, it writes VFE and MSE bits, waits 100ms and then 
expects VFs to be operational. If they are not operational at that 
moment, then it considers the operation failed and removes all VFs. If 
we also wait 100ms before enabling access, the probability of a guest 
trying to access something before we allow it would be very high.

So I think there is no way to add VFs in Xen without blocking the 
guest’s vCPU in some way. We can revert back to the old variant and rely 
on physdev op to add VFs one by one as they are discovered by Dom0, then 
we will not need to explicitly wait.
@Roger are you okay with that?


Snippet from Linux:

static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
{
	...
	pci_iov_set_numvfs(dev, nr_virtfn);
	iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
	pci_cfg_access_lock(dev);
	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
	msleep(100);
	pci_cfg_access_unlock(dev);

	rc = sriov_add_vfs(dev, initial);
	if (rc)
		goto err_pcibios;
	...
}





-- 
Mykyta

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions
  2026-04-01  7:59     ` Mykyta Poturai
@ 2026-04-01  8:21       ` Jan Beulich
  2026-04-01 14:07         ` Mykyta Poturai
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2026-04-01  8:21 UTC (permalink / raw)
  To: Mykyta Poturai
  Cc: Roger Pau Monné, Stewart Hildebrand,
	xen-devel@lists.xenproject.org

On 01.04.2026 09:59, Mykyta Poturai wrote:
> On 3/31/26 17:55, Jan Beulich wrote:
>> On 09.03.2026 12:08, Mykyta Poturai wrote:
>>> --- a/xen/drivers/vpci/header.c
>>> +++ b/xen/drivers/vpci/header.c
>>> @@ -175,76 +175,92 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>>>   
>>>   bool vpci_process_pending(struct vcpu *v)
>>>   {
>>> -    const struct pci_dev *pdev = v->vpci.pdev;
>>> -    struct vpci_header *header = NULL;
>>> -    unsigned int i;
>>> -
>>> -    if ( !pdev )
>>> -        return false;
>>> -
>>> -    read_lock(&v->domain->pci_lock);
>>> -
>>> -    if ( !pdev->vpci || (v->domain != pdev->domain) )
>>> +    switch ( v->vpci.task )
>>>       {
>>> -        v->vpci.pdev = NULL;
>>> -        read_unlock(&v->domain->pci_lock);
>>> -        return false;
>>> -    }
>>> -
>>> -    header = &pdev->vpci->header;
>>> -    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>> +    case MODIFY_MEMORY:
>>>       {
>>> -        struct vpci_bar *bar = &header->bars[i];
>>> -        struct rangeset *mem = v->vpci.bar_mem[i];
>>> -        struct map_data data = {
>>> -            .d = v->domain,
>>> -            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>>> -            .bar = bar,
>>> -        };
>>> -        int rc;
>>> +        const struct pci_dev *pdev = v->vpci.memory.pdev;
>>> +        struct vpci_header *header = NULL;
>>> +        unsigned int i;
>>>   
>>> -        if ( rangeset_is_empty(mem) )
>>> -            continue;
>>> +        if ( !pdev )
>>> +            break;
>>>   
>>> -        rc = rangeset_consume_ranges(mem, map_range, &data);
>>> +        read_lock(&v->domain->pci_lock);
>>>   
>>> -        if ( rc == -ERESTART )
>>> +        if ( !pdev->vpci || (v->domain != pdev->domain) )
>>>           {
>>> +            v->vpci.memory.pdev = NULL;
>>>               read_unlock(&v->domain->pci_lock);
>>> -            return true;
>>> +            break;
>>>           }
>>>   
>>> -        if ( rc )
>>> +        header = &pdev->vpci->header;
>>> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>           {
>>> -            spin_lock(&pdev->vpci->lock);
>>> -            /* Disable memory decoding unconditionally on failure. */
>>> -            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
>>> -                            false);
>>> -            spin_unlock(&pdev->vpci->lock);
>>> +            struct vpci_bar *bar = &header->bars[i];
>>> +            struct rangeset *mem = v->vpci.bar_mem[i];
>>> +            struct map_data data = {
>>> +                .d = v->domain,
>>> +                .map = v->vpci.memory.cmd & PCI_COMMAND_MEMORY,
>>> +                .bar = bar,
>>> +            };
>>> +            int rc;
>>> +
>>> +            if ( rangeset_is_empty(mem) )
>>> +                continue;
>>>   
>>> -            /* Clean all the rangesets */
>>> -            for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>> -                if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>>> -                     rangeset_purge(v->vpci.bar_mem[i]);
>>> +            rc = rangeset_consume_ranges(mem, map_range, &data);
>>>   
>>> -            v->vpci.pdev = NULL;
>>> +            if ( rc == -ERESTART )
>>> +            {
>>> +                read_unlock(&v->domain->pci_lock);
>>> +                return true;
>>> +            }
>>>   
>>> -            read_unlock(&v->domain->pci_lock);
>>> +            if ( rc )
>>> +            {
>>> +                spin_lock(&pdev->vpci->lock);
>>> +                /* Disable memory decoding unconditionally on failure. */
>>> +                modify_decoding(pdev, v->vpci.memory.cmd & ~PCI_COMMAND_MEMORY,
>>> +                                false);
>>> +                spin_unlock(&pdev->vpci->lock);
>>> +
>>> +                /* Clean all the rangesets */
>>> +                for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>> +                    if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>>> +                        rangeset_purge(v->vpci.bar_mem[i]);
>>> +
>>> +                v->vpci.memory.pdev = NULL;
>>> +
>>> +                read_unlock(&v->domain->pci_lock);
>>>   
>>> -            if ( !is_hardware_domain(v->domain) )
>>> -                domain_crash(v->domain);
>>> +                if ( !is_hardware_domain(v->domain) )
>>> +                    domain_crash(v->domain);
>>>   
>>> -            return false;
>>> +                break;
>>> +            }
>>>           }
>>> -    }
>>> -    v->vpci.pdev = NULL;
>>> +        v->vpci.memory.pdev = NULL;
>>>   
>>> -    spin_lock(&pdev->vpci->lock);
>>> -    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
>>> -    spin_unlock(&pdev->vpci->lock);
>>> +        spin_lock(&pdev->vpci->lock);
>>> +        modify_decoding(pdev, v->vpci.memory.cmd, v->vpci.memory.rom_only);
>>> +        spin_unlock(&pdev->vpci->lock);
>>>   
>>> -    read_unlock(&v->domain->pci_lock);
>>> +        read_unlock(&v->domain->pci_lock);
>>> +
>>> +        break;
>>> +    }
>>> +    case WAIT:
>>> +        if ( NOW() < v->vpci.wait.end )
>>> +            return true;
>>> +        v->vpci.wait.callback(v->vpci.wait.data);
>>> +        break;
>>
>> As just indicated in reply to patch 6, busy waiting isn't really acceptable.
>> This is even more so when the waiting exceeds the typical length of a
>> scheduling timeslice.
>>
>> In that other reply I said to put the vCPU to sleep, but you need to be careful
>> there too: The domain may not expect its vCPU to not make any progress for such
>> an extended period of time. This may need doing entirely differently: Once the
>> command register was written, you may want to record the time after which
>> accesses to the VF registers are permitted. Earlier accesses would simply be
>> terminated. You may still additionally need a timer, in order to kick off BAR
>> mapping after that time. (Yet better would  be if the BAR mapping could be
>> done during those 100ms. After all that may be a reason why this long a delay
>> is specified: Firmware on the device may also require some time to set up the
>> BARs accordingly.)
> 
> I am not sure it would work that way. If we look at how linux 
> initialized sriov, it writes VFE and MSE bits, waits 100ms and then 
> expects VFs to be operational. If they are not operational at that 
> moment, then it considers the operation failed and removes all VFs. If 
> we also wait 100ms before enabling access, the probability of a guest 
> trying to access something before we allow it would be very high.

Well, not really. Our counting of the 100ms necessarily starts before Dom0's.
Furthermore it may be acceptable (or even appropriate) to stall premature
accesses (because they shouldn't occur in the first place), by blocking the
vCPU at that point. A middle route may be possible: Terminate accesses in,
say, the first 90ms, and stall the vCPU for any access past that, but before
the 100ms expired.

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions
  2026-04-01  8:21       ` Jan Beulich
@ 2026-04-01 14:07         ` Mykyta Poturai
  2026-04-01 14:14           ` Jan Beulich
  0 siblings, 1 reply; 29+ messages in thread
From: Mykyta Poturai @ 2026-04-01 14:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné, Stewart Hildebrand,
	xen-devel@lists.xenproject.org



On 4/1/26 11:21, Jan Beulich wrote:
> On 01.04.2026 09:59, Mykyta Poturai wrote:
>> On 3/31/26 17:55, Jan Beulich wrote:
>>> On 09.03.2026 12:08, Mykyta Poturai wrote:
>>>> --- a/xen/drivers/vpci/header.c
>>>> +++ b/xen/drivers/vpci/header.c
>>>> @@ -175,76 +175,92 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>>>>    
>>>>    bool vpci_process_pending(struct vcpu *v)
>>>>    {
>>>> -    const struct pci_dev *pdev = v->vpci.pdev;
>>>> -    struct vpci_header *header = NULL;
>>>> -    unsigned int i;
>>>> -
>>>> -    if ( !pdev )
>>>> -        return false;
>>>> -
>>>> -    read_lock(&v->domain->pci_lock);
>>>> -
>>>> -    if ( !pdev->vpci || (v->domain != pdev->domain) )
>>>> +    switch ( v->vpci.task )
>>>>        {
>>>> -        v->vpci.pdev = NULL;
>>>> -        read_unlock(&v->domain->pci_lock);
>>>> -        return false;
>>>> -    }
>>>> -
>>>> -    header = &pdev->vpci->header;
>>>> -    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>> +    case MODIFY_MEMORY:
>>>>        {
>>>> -        struct vpci_bar *bar = &header->bars[i];
>>>> -        struct rangeset *mem = v->vpci.bar_mem[i];
>>>> -        struct map_data data = {
>>>> -            .d = v->domain,
>>>> -            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>>>> -            .bar = bar,
>>>> -        };
>>>> -        int rc;
>>>> +        const struct pci_dev *pdev = v->vpci.memory.pdev;
>>>> +        struct vpci_header *header = NULL;
>>>> +        unsigned int i;
>>>>    
>>>> -        if ( rangeset_is_empty(mem) )
>>>> -            continue;
>>>> +        if ( !pdev )
>>>> +            break;
>>>>    
>>>> -        rc = rangeset_consume_ranges(mem, map_range, &data);
>>>> +        read_lock(&v->domain->pci_lock);
>>>>    
>>>> -        if ( rc == -ERESTART )
>>>> +        if ( !pdev->vpci || (v->domain != pdev->domain) )
>>>>            {
>>>> +            v->vpci.memory.pdev = NULL;
>>>>                read_unlock(&v->domain->pci_lock);
>>>> -            return true;
>>>> +            break;
>>>>            }
>>>>    
>>>> -        if ( rc )
>>>> +        header = &pdev->vpci->header;
>>>> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>            {
>>>> -            spin_lock(&pdev->vpci->lock);
>>>> -            /* Disable memory decoding unconditionally on failure. */
>>>> -            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
>>>> -                            false);
>>>> -            spin_unlock(&pdev->vpci->lock);
>>>> +            struct vpci_bar *bar = &header->bars[i];
>>>> +            struct rangeset *mem = v->vpci.bar_mem[i];
>>>> +            struct map_data data = {
>>>> +                .d = v->domain,
>>>> +                .map = v->vpci.memory.cmd & PCI_COMMAND_MEMORY,
>>>> +                .bar = bar,
>>>> +            };
>>>> +            int rc;
>>>> +
>>>> +            if ( rangeset_is_empty(mem) )
>>>> +                continue;
>>>>    
>>>> -            /* Clean all the rangesets */
>>>> -            for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>> -                if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>>>> -                     rangeset_purge(v->vpci.bar_mem[i]);
>>>> +            rc = rangeset_consume_ranges(mem, map_range, &data);
>>>>    
>>>> -            v->vpci.pdev = NULL;
>>>> +            if ( rc == -ERESTART )
>>>> +            {
>>>> +                read_unlock(&v->domain->pci_lock);
>>>> +                return true;
>>>> +            }
>>>>    
>>>> -            read_unlock(&v->domain->pci_lock);
>>>> +            if ( rc )
>>>> +            {
>>>> +                spin_lock(&pdev->vpci->lock);
>>>> +                /* Disable memory decoding unconditionally on failure. */
>>>> +                modify_decoding(pdev, v->vpci.memory.cmd & ~PCI_COMMAND_MEMORY,
>>>> +                                false);
>>>> +                spin_unlock(&pdev->vpci->lock);
>>>> +
>>>> +                /* Clean all the rangesets */
>>>> +                for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>> +                    if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>>>> +                        rangeset_purge(v->vpci.bar_mem[i]);
>>>> +
>>>> +                v->vpci.memory.pdev = NULL;
>>>> +
>>>> +                read_unlock(&v->domain->pci_lock);
>>>>    
>>>> -            if ( !is_hardware_domain(v->domain) )
>>>> -                domain_crash(v->domain);
>>>> +                if ( !is_hardware_domain(v->domain) )
>>>> +                    domain_crash(v->domain);
>>>>    
>>>> -            return false;
>>>> +                break;
>>>> +            }
>>>>            }
>>>> -    }
>>>> -    v->vpci.pdev = NULL;
>>>> +        v->vpci.memory.pdev = NULL;
>>>>    
>>>> -    spin_lock(&pdev->vpci->lock);
>>>> -    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
>>>> -    spin_unlock(&pdev->vpci->lock);
>>>> +        spin_lock(&pdev->vpci->lock);
>>>> +        modify_decoding(pdev, v->vpci.memory.cmd, v->vpci.memory.rom_only);
>>>> +        spin_unlock(&pdev->vpci->lock);
>>>>    
>>>> -    read_unlock(&v->domain->pci_lock);
>>>> +        read_unlock(&v->domain->pci_lock);
>>>> +
>>>> +        break;
>>>> +    }
>>>> +    case WAIT:
>>>> +        if ( NOW() < v->vpci.wait.end )
>>>> +            return true;
>>>> +        v->vpci.wait.callback(v->vpci.wait.data);
>>>> +        break;
>>>
>>> As just indicated in reply to patch 6, busy waiting isn't really acceptable.
>>> This is even more so when the waiting exceeds the typical length of a
>>> scheduling timeslice.
>>>
>>> In that other reply I said to put the vCPU to sleep, but you need to be careful
>>> there too: The domain may not expect its vCPU to not make any progress for such
>>> an extended period of time. This may need doing entirely differently: Once the
>>> command register was written, you may want to record the time after which
>>> accesses to the VF registers are permitted. Earlier accesses would simply be
>>> terminated. You may still additionally need a timer, in order to kick off BAR
>>> mapping after that time. (Yet better would  be if the BAR mapping could be
>>> done during those 100ms. After all that may be a reason why this long a delay
>>> is specified: Firmware on the device may also require some time to set up the
>>> BARs accordingly.)
>>
>> I am not sure it would work that way. If we look at how linux
>> initialized sriov, it writes VFE and MSE bits, waits 100ms and then
>> expects VFs to be operational. If they are not operational at that
>> moment, then it considers the operation failed and removes all VFs. If
>> we also wait 100ms before enabling access, the probability of a guest
>> trying to access something before we allow it would be very high.
> 
> Well, not really. Our counting of the 100ms necessarily starts before Dom0's.
> Furthermore it may be acceptable (or even appropriate) to stall premature
> accesses (because they shouldn't occur in the first place), by blocking the
> vCPU at that point. A middle route may be possible: Terminate accesses in,
> say, the first 90ms, and stall the vCPU for any access past that, but before
> the 100ms expired.
> 

Is there any real benefit to doing all this work instead of just waiting 
for Dom0 to register the FVs? Implementing it the way you described 
would require a relatively complex state machine and two timers per 
sriov-capable device. And will also probably require some hacks to 
handle partially initialized VFs in Xen. This adds a lot of work and 
many possible bugs for not a lot of benefit in my opinion.

-- 
Mykyta

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions
  2026-04-01 14:07         ` Mykyta Poturai
@ 2026-04-01 14:14           ` Jan Beulich
  2026-04-01 14:40             ` Mykyta Poturai
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2026-04-01 14:14 UTC (permalink / raw)
  To: Mykyta Poturai
  Cc: Roger Pau Monné, Stewart Hildebrand,
	xen-devel@lists.xenproject.org

On 01.04.2026 16:07, Mykyta Poturai wrote:
> On 4/1/26 11:21, Jan Beulich wrote:
>> On 01.04.2026 09:59, Mykyta Poturai wrote:
>>> On 3/31/26 17:55, Jan Beulich wrote:
>>>> On 09.03.2026 12:08, Mykyta Poturai wrote:
>>>>> --- a/xen/drivers/vpci/header.c
>>>>> +++ b/xen/drivers/vpci/header.c
>>>>> @@ -175,76 +175,92 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>>>>>    
>>>>>    bool vpci_process_pending(struct vcpu *v)
>>>>>    {
>>>>> -    const struct pci_dev *pdev = v->vpci.pdev;
>>>>> -    struct vpci_header *header = NULL;
>>>>> -    unsigned int i;
>>>>> -
>>>>> -    if ( !pdev )
>>>>> -        return false;
>>>>> -
>>>>> -    read_lock(&v->domain->pci_lock);
>>>>> -
>>>>> -    if ( !pdev->vpci || (v->domain != pdev->domain) )
>>>>> +    switch ( v->vpci.task )
>>>>>        {
>>>>> -        v->vpci.pdev = NULL;
>>>>> -        read_unlock(&v->domain->pci_lock);
>>>>> -        return false;
>>>>> -    }
>>>>> -
>>>>> -    header = &pdev->vpci->header;
>>>>> -    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>> +    case MODIFY_MEMORY:
>>>>>        {
>>>>> -        struct vpci_bar *bar = &header->bars[i];
>>>>> -        struct rangeset *mem = v->vpci.bar_mem[i];
>>>>> -        struct map_data data = {
>>>>> -            .d = v->domain,
>>>>> -            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>>>>> -            .bar = bar,
>>>>> -        };
>>>>> -        int rc;
>>>>> +        const struct pci_dev *pdev = v->vpci.memory.pdev;
>>>>> +        struct vpci_header *header = NULL;
>>>>> +        unsigned int i;
>>>>>    
>>>>> -        if ( rangeset_is_empty(mem) )
>>>>> -            continue;
>>>>> +        if ( !pdev )
>>>>> +            break;
>>>>>    
>>>>> -        rc = rangeset_consume_ranges(mem, map_range, &data);
>>>>> +        read_lock(&v->domain->pci_lock);
>>>>>    
>>>>> -        if ( rc == -ERESTART )
>>>>> +        if ( !pdev->vpci || (v->domain != pdev->domain) )
>>>>>            {
>>>>> +            v->vpci.memory.pdev = NULL;
>>>>>                read_unlock(&v->domain->pci_lock);
>>>>> -            return true;
>>>>> +            break;
>>>>>            }
>>>>>    
>>>>> -        if ( rc )
>>>>> +        header = &pdev->vpci->header;
>>>>> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>            {
>>>>> -            spin_lock(&pdev->vpci->lock);
>>>>> -            /* Disable memory decoding unconditionally on failure. */
>>>>> -            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
>>>>> -                            false);
>>>>> -            spin_unlock(&pdev->vpci->lock);
>>>>> +            struct vpci_bar *bar = &header->bars[i];
>>>>> +            struct rangeset *mem = v->vpci.bar_mem[i];
>>>>> +            struct map_data data = {
>>>>> +                .d = v->domain,
>>>>> +                .map = v->vpci.memory.cmd & PCI_COMMAND_MEMORY,
>>>>> +                .bar = bar,
>>>>> +            };
>>>>> +            int rc;
>>>>> +
>>>>> +            if ( rangeset_is_empty(mem) )
>>>>> +                continue;
>>>>>    
>>>>> -            /* Clean all the rangesets */
>>>>> -            for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>> -                if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>>>>> -                     rangeset_purge(v->vpci.bar_mem[i]);
>>>>> +            rc = rangeset_consume_ranges(mem, map_range, &data);
>>>>>    
>>>>> -            v->vpci.pdev = NULL;
>>>>> +            if ( rc == -ERESTART )
>>>>> +            {
>>>>> +                read_unlock(&v->domain->pci_lock);
>>>>> +                return true;
>>>>> +            }
>>>>>    
>>>>> -            read_unlock(&v->domain->pci_lock);
>>>>> +            if ( rc )
>>>>> +            {
>>>>> +                spin_lock(&pdev->vpci->lock);
>>>>> +                /* Disable memory decoding unconditionally on failure. */
>>>>> +                modify_decoding(pdev, v->vpci.memory.cmd & ~PCI_COMMAND_MEMORY,
>>>>> +                                false);
>>>>> +                spin_unlock(&pdev->vpci->lock);
>>>>> +
>>>>> +                /* Clean all the rangesets */
>>>>> +                for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>> +                    if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>>>>> +                        rangeset_purge(v->vpci.bar_mem[i]);
>>>>> +
>>>>> +                v->vpci.memory.pdev = NULL;
>>>>> +
>>>>> +                read_unlock(&v->domain->pci_lock);
>>>>>    
>>>>> -            if ( !is_hardware_domain(v->domain) )
>>>>> -                domain_crash(v->domain);
>>>>> +                if ( !is_hardware_domain(v->domain) )
>>>>> +                    domain_crash(v->domain);
>>>>>    
>>>>> -            return false;
>>>>> +                break;
>>>>> +            }
>>>>>            }
>>>>> -    }
>>>>> -    v->vpci.pdev = NULL;
>>>>> +        v->vpci.memory.pdev = NULL;
>>>>>    
>>>>> -    spin_lock(&pdev->vpci->lock);
>>>>> -    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
>>>>> -    spin_unlock(&pdev->vpci->lock);
>>>>> +        spin_lock(&pdev->vpci->lock);
>>>>> +        modify_decoding(pdev, v->vpci.memory.cmd, v->vpci.memory.rom_only);
>>>>> +        spin_unlock(&pdev->vpci->lock);
>>>>>    
>>>>> -    read_unlock(&v->domain->pci_lock);
>>>>> +        read_unlock(&v->domain->pci_lock);
>>>>> +
>>>>> +        break;
>>>>> +    }
>>>>> +    case WAIT:
>>>>> +        if ( NOW() < v->vpci.wait.end )
>>>>> +            return true;
>>>>> +        v->vpci.wait.callback(v->vpci.wait.data);
>>>>> +        break;
>>>>
>>>> As just indicated in reply to patch 6, busy waiting isn't really acceptable.
>>>> This is even more so when the waiting exceeds the typical length of a
>>>> scheduling timeslice.
>>>>
>>>> In that other reply I said to put the vCPU to sleep, but you need to be careful
>>>> there too: The domain may not expect its vCPU to not make any progress for such
>>>> an extended period of time. This may need doing entirely differently: Once the
>>>> command register was written, you may want to record the time after which
>>>> accesses to the VF registers are permitted. Earlier accesses would simply be
>>>> terminated. You may still additionally need a timer, in order to kick off BAR
>>>> mapping after that time. (Yet better would  be if the BAR mapping could be
>>>> done during those 100ms. After all that may be a reason why this long a delay
>>>> is specified: Firmware on the device may also require some time to set up the
>>>> BARs accordingly.)
>>>
>>> I am not sure it would work that way. If we look at how linux
>>> initialized sriov, it writes VFE and MSE bits, waits 100ms and then
>>> expects VFs to be operational. If they are not operational at that
>>> moment, then it considers the operation failed and removes all VFs. If
>>> we also wait 100ms before enabling access, the probability of a guest
>>> trying to access something before we allow it would be very high.
>>
>> Well, not really. Our counting of the 100ms necessarily starts before Dom0's.
>> Furthermore it may be acceptable (or even appropriate) to stall premature
>> accesses (because they shouldn't occur in the first place), by blocking the
>> vCPU at that point. A middle route may be possible: Terminate accesses in,
>> say, the first 90ms, and stall the vCPU for any access past that, but before
>> the 100ms expired.
> 
> Is there any real benefit to doing all this work instead of just waiting 
> for Dom0 to register the FVs? Implementing it the way you described 
> would require a relatively complex state machine and two timers per 
> sriov-capable device. And will also probably require some hacks to 
> handle partially initialized VFs in Xen. This adds a lot of work and 
> many possible bugs for not a lot of benefit in my opinion.

Odd that you ask me this question. If there was no benefit, why did you do
it this way?

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions
  2026-04-01 14:14           ` Jan Beulich
@ 2026-04-01 14:40             ` Mykyta Poturai
  2026-04-01 14:44               ` Jan Beulich
  0 siblings, 1 reply; 29+ messages in thread
From: Mykyta Poturai @ 2026-04-01 14:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné, Stewart Hildebrand,
	xen-devel@lists.xenproject.org



On 4/1/26 17:14, Jan Beulich wrote:
> On 01.04.2026 16:07, Mykyta Poturai wrote:
>> On 4/1/26 11:21, Jan Beulich wrote:
>>> On 01.04.2026 09:59, Mykyta Poturai wrote:
>>>> On 3/31/26 17:55, Jan Beulich wrote:
>>>>> On 09.03.2026 12:08, Mykyta Poturai wrote:
>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>> @@ -175,76 +175,92 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>>>>>>     
>>>>>>     bool vpci_process_pending(struct vcpu *v)
>>>>>>     {
>>>>>> -    const struct pci_dev *pdev = v->vpci.pdev;
>>>>>> -    struct vpci_header *header = NULL;
>>>>>> -    unsigned int i;
>>>>>> -
>>>>>> -    if ( !pdev )
>>>>>> -        return false;
>>>>>> -
>>>>>> -    read_lock(&v->domain->pci_lock);
>>>>>> -
>>>>>> -    if ( !pdev->vpci || (v->domain != pdev->domain) )
>>>>>> +    switch ( v->vpci.task )
>>>>>>         {
>>>>>> -        v->vpci.pdev = NULL;
>>>>>> -        read_unlock(&v->domain->pci_lock);
>>>>>> -        return false;
>>>>>> -    }
>>>>>> -
>>>>>> -    header = &pdev->vpci->header;
>>>>>> -    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>> +    case MODIFY_MEMORY:
>>>>>>         {
>>>>>> -        struct vpci_bar *bar = &header->bars[i];
>>>>>> -        struct rangeset *mem = v->vpci.bar_mem[i];
>>>>>> -        struct map_data data = {
>>>>>> -            .d = v->domain,
>>>>>> -            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>>>>>> -            .bar = bar,
>>>>>> -        };
>>>>>> -        int rc;
>>>>>> +        const struct pci_dev *pdev = v->vpci.memory.pdev;
>>>>>> +        struct vpci_header *header = NULL;
>>>>>> +        unsigned int i;
>>>>>>     
>>>>>> -        if ( rangeset_is_empty(mem) )
>>>>>> -            continue;
>>>>>> +        if ( !pdev )
>>>>>> +            break;
>>>>>>     
>>>>>> -        rc = rangeset_consume_ranges(mem, map_range, &data);
>>>>>> +        read_lock(&v->domain->pci_lock);
>>>>>>     
>>>>>> -        if ( rc == -ERESTART )
>>>>>> +        if ( !pdev->vpci || (v->domain != pdev->domain) )
>>>>>>             {
>>>>>> +            v->vpci.memory.pdev = NULL;
>>>>>>                 read_unlock(&v->domain->pci_lock);
>>>>>> -            return true;
>>>>>> +            break;
>>>>>>             }
>>>>>>     
>>>>>> -        if ( rc )
>>>>>> +        header = &pdev->vpci->header;
>>>>>> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>>             {
>>>>>> -            spin_lock(&pdev->vpci->lock);
>>>>>> -            /* Disable memory decoding unconditionally on failure. */
>>>>>> -            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
>>>>>> -                            false);
>>>>>> -            spin_unlock(&pdev->vpci->lock);
>>>>>> +            struct vpci_bar *bar = &header->bars[i];
>>>>>> +            struct rangeset *mem = v->vpci.bar_mem[i];
>>>>>> +            struct map_data data = {
>>>>>> +                .d = v->domain,
>>>>>> +                .map = v->vpci.memory.cmd & PCI_COMMAND_MEMORY,
>>>>>> +                .bar = bar,
>>>>>> +            };
>>>>>> +            int rc;
>>>>>> +
>>>>>> +            if ( rangeset_is_empty(mem) )
>>>>>> +                continue;
>>>>>>     
>>>>>> -            /* Clean all the rangesets */
>>>>>> -            for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>> -                if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>>>>>> -                     rangeset_purge(v->vpci.bar_mem[i]);
>>>>>> +            rc = rangeset_consume_ranges(mem, map_range, &data);
>>>>>>     
>>>>>> -            v->vpci.pdev = NULL;
>>>>>> +            if ( rc == -ERESTART )
>>>>>> +            {
>>>>>> +                read_unlock(&v->domain->pci_lock);
>>>>>> +                return true;
>>>>>> +            }
>>>>>>     
>>>>>> -            read_unlock(&v->domain->pci_lock);
>>>>>> +            if ( rc )
>>>>>> +            {
>>>>>> +                spin_lock(&pdev->vpci->lock);
>>>>>> +                /* Disable memory decoding unconditionally on failure. */
>>>>>> +                modify_decoding(pdev, v->vpci.memory.cmd & ~PCI_COMMAND_MEMORY,
>>>>>> +                                false);
>>>>>> +                spin_unlock(&pdev->vpci->lock);
>>>>>> +
>>>>>> +                /* Clean all the rangesets */
>>>>>> +                for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>> +                    if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>>>>>> +                        rangeset_purge(v->vpci.bar_mem[i]);
>>>>>> +
>>>>>> +                v->vpci.memory.pdev = NULL;
>>>>>> +
>>>>>> +                read_unlock(&v->domain->pci_lock);
>>>>>>     
>>>>>> -            if ( !is_hardware_domain(v->domain) )
>>>>>> -                domain_crash(v->domain);
>>>>>> +                if ( !is_hardware_domain(v->domain) )
>>>>>> +                    domain_crash(v->domain);
>>>>>>     
>>>>>> -            return false;
>>>>>> +                break;
>>>>>> +            }
>>>>>>             }
>>>>>> -    }
>>>>>> -    v->vpci.pdev = NULL;
>>>>>> +        v->vpci.memory.pdev = NULL;
>>>>>>     
>>>>>> -    spin_lock(&pdev->vpci->lock);
>>>>>> -    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
>>>>>> -    spin_unlock(&pdev->vpci->lock);
>>>>>> +        spin_lock(&pdev->vpci->lock);
>>>>>> +        modify_decoding(pdev, v->vpci.memory.cmd, v->vpci.memory.rom_only);
>>>>>> +        spin_unlock(&pdev->vpci->lock);
>>>>>>     
>>>>>> -    read_unlock(&v->domain->pci_lock);
>>>>>> +        read_unlock(&v->domain->pci_lock);
>>>>>> +
>>>>>> +        break;
>>>>>> +    }
>>>>>> +    case WAIT:
>>>>>> +        if ( NOW() < v->vpci.wait.end )
>>>>>> +            return true;
>>>>>> +        v->vpci.wait.callback(v->vpci.wait.data);
>>>>>> +        break;
>>>>>
>>>>> As just indicated in reply to patch 6, busy waiting isn't really acceptable.
>>>>> This is even more so when the waiting exceeds the typical length of a
>>>>> scheduling timeslice.
>>>>>
>>>>> In that other reply I said to put the vCPU to sleep, but you need to be careful
>>>>> there too: The domain may not expect its vCPU to not make any progress for such
>>>>> an extended period of time. This may need doing entirely differently: Once the
>>>>> command register was written, you may want to record the time after which
>>>>> accesses to the VF registers are permitted. Earlier accesses would simply be
>>>>> terminated. You may still additionally need a timer, in order to kick off BAR
>>>>> mapping after that time. (Yet better would  be if the BAR mapping could be
>>>>> done during those 100ms. After all that may be a reason why this long a delay
>>>>> is specified: Firmware on the device may also require some time to set up the
>>>>> BARs accordingly.)
>>>>
>>>> I am not sure it would work that way. If we look at how linux
>>>> initialized sriov, it writes VFE and MSE bits, waits 100ms and then
>>>> expects VFs to be operational. If they are not operational at that
>>>> moment, then it considers the operation failed and removes all VFs. If
>>>> we also wait 100ms before enabling access, the probability of a guest
>>>> trying to access something before we allow it would be very high.
>>>
>>> Well, not really. Our counting of the 100ms necessarily starts before Dom0's.
>>> Furthermore it may be acceptable (or even appropriate) to stall premature
>>> accesses (because they shouldn't occur in the first place), by blocking the
>>> vCPU at that point. A middle route may be possible: Terminate accesses in,
>>> say, the first 90ms, and stall the vCPU for any access past that, but before
>>> the 100ms expired.
>>
>> Is there any real benefit to doing all this work instead of just waiting
>> for Dom0 to register the FVs? Implementing it the way you described
>> would require a relatively complex state machine and two timers per
>> sriov-capable device. And will also probably require some hacks to
>> handle partially initialized VFs in Xen. This adds a lot of work and
>> many possible bugs for not a lot of benefit in my opinion.
> 
> Odd that you ask me this question. If there was no benefit, why did you do
> it this way?
> 

Roger asked for this approach in V1, and I saw that it can be done in a 
relatively straightforward way, so I implemented it. I didn’t exactly 
get what the benefits were, but I assumed that there are some and the 
effort is not too big to just do it if the maintainers are asking for it.

Now with the posibility of redoing everything *again* and making it much 
more complex I started to really think if its really needed. So I want 
to know your and others' opinions on registering VFs with Dom0.


-- 
Mykyta

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions
  2026-04-01 14:40             ` Mykyta Poturai
@ 2026-04-01 14:44               ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2026-04-01 14:44 UTC (permalink / raw)
  To: Mykyta Poturai
  Cc: Roger Pau Monné, Stewart Hildebrand,
	xen-devel@lists.xenproject.org

On 01.04.2026 16:40, Mykyta Poturai wrote:
> 
> 
> On 4/1/26 17:14, Jan Beulich wrote:
>> On 01.04.2026 16:07, Mykyta Poturai wrote:
>>> On 4/1/26 11:21, Jan Beulich wrote:
>>>> On 01.04.2026 09:59, Mykyta Poturai wrote:
>>>>> On 3/31/26 17:55, Jan Beulich wrote:
>>>>>> On 09.03.2026 12:08, Mykyta Poturai wrote:
>>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>>> @@ -175,76 +175,92 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>>>>>>>     
>>>>>>>     bool vpci_process_pending(struct vcpu *v)
>>>>>>>     {
>>>>>>> -    const struct pci_dev *pdev = v->vpci.pdev;
>>>>>>> -    struct vpci_header *header = NULL;
>>>>>>> -    unsigned int i;
>>>>>>> -
>>>>>>> -    if ( !pdev )
>>>>>>> -        return false;
>>>>>>> -
>>>>>>> -    read_lock(&v->domain->pci_lock);
>>>>>>> -
>>>>>>> -    if ( !pdev->vpci || (v->domain != pdev->domain) )
>>>>>>> +    switch ( v->vpci.task )
>>>>>>>         {
>>>>>>> -        v->vpci.pdev = NULL;
>>>>>>> -        read_unlock(&v->domain->pci_lock);
>>>>>>> -        return false;
>>>>>>> -    }
>>>>>>> -
>>>>>>> -    header = &pdev->vpci->header;
>>>>>>> -    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>>> +    case MODIFY_MEMORY:
>>>>>>>         {
>>>>>>> -        struct vpci_bar *bar = &header->bars[i];
>>>>>>> -        struct rangeset *mem = v->vpci.bar_mem[i];
>>>>>>> -        struct map_data data = {
>>>>>>> -            .d = v->domain,
>>>>>>> -            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>>>>>>> -            .bar = bar,
>>>>>>> -        };
>>>>>>> -        int rc;
>>>>>>> +        const struct pci_dev *pdev = v->vpci.memory.pdev;
>>>>>>> +        struct vpci_header *header = NULL;
>>>>>>> +        unsigned int i;
>>>>>>>     
>>>>>>> -        if ( rangeset_is_empty(mem) )
>>>>>>> -            continue;
>>>>>>> +        if ( !pdev )
>>>>>>> +            break;
>>>>>>>     
>>>>>>> -        rc = rangeset_consume_ranges(mem, map_range, &data);
>>>>>>> +        read_lock(&v->domain->pci_lock);
>>>>>>>     
>>>>>>> -        if ( rc == -ERESTART )
>>>>>>> +        if ( !pdev->vpci || (v->domain != pdev->domain) )
>>>>>>>             {
>>>>>>> +            v->vpci.memory.pdev = NULL;
>>>>>>>                 read_unlock(&v->domain->pci_lock);
>>>>>>> -            return true;
>>>>>>> +            break;
>>>>>>>             }
>>>>>>>     
>>>>>>> -        if ( rc )
>>>>>>> +        header = &pdev->vpci->header;
>>>>>>> +        for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>>>             {
>>>>>>> -            spin_lock(&pdev->vpci->lock);
>>>>>>> -            /* Disable memory decoding unconditionally on failure. */
>>>>>>> -            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
>>>>>>> -                            false);
>>>>>>> -            spin_unlock(&pdev->vpci->lock);
>>>>>>> +            struct vpci_bar *bar = &header->bars[i];
>>>>>>> +            struct rangeset *mem = v->vpci.bar_mem[i];
>>>>>>> +            struct map_data data = {
>>>>>>> +                .d = v->domain,
>>>>>>> +                .map = v->vpci.memory.cmd & PCI_COMMAND_MEMORY,
>>>>>>> +                .bar = bar,
>>>>>>> +            };
>>>>>>> +            int rc;
>>>>>>> +
>>>>>>> +            if ( rangeset_is_empty(mem) )
>>>>>>> +                continue;
>>>>>>>     
>>>>>>> -            /* Clean all the rangesets */
>>>>>>> -            for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>>> -                if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>>>>>>> -                     rangeset_purge(v->vpci.bar_mem[i]);
>>>>>>> +            rc = rangeset_consume_ranges(mem, map_range, &data);
>>>>>>>     
>>>>>>> -            v->vpci.pdev = NULL;
>>>>>>> +            if ( rc == -ERESTART )
>>>>>>> +            {
>>>>>>> +                read_unlock(&v->domain->pci_lock);
>>>>>>> +                return true;
>>>>>>> +            }
>>>>>>>     
>>>>>>> -            read_unlock(&v->domain->pci_lock);
>>>>>>> +            if ( rc )
>>>>>>> +            {
>>>>>>> +                spin_lock(&pdev->vpci->lock);
>>>>>>> +                /* Disable memory decoding unconditionally on failure. */
>>>>>>> +                modify_decoding(pdev, v->vpci.memory.cmd & ~PCI_COMMAND_MEMORY,
>>>>>>> +                                false);
>>>>>>> +                spin_unlock(&pdev->vpci->lock);
>>>>>>> +
>>>>>>> +                /* Clean all the rangesets */
>>>>>>> +                for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>>>>>> +                    if ( !rangeset_is_empty(v->vpci.bar_mem[i]) )
>>>>>>> +                        rangeset_purge(v->vpci.bar_mem[i]);
>>>>>>> +
>>>>>>> +                v->vpci.memory.pdev = NULL;
>>>>>>> +
>>>>>>> +                read_unlock(&v->domain->pci_lock);
>>>>>>>     
>>>>>>> -            if ( !is_hardware_domain(v->domain) )
>>>>>>> -                domain_crash(v->domain);
>>>>>>> +                if ( !is_hardware_domain(v->domain) )
>>>>>>> +                    domain_crash(v->domain);
>>>>>>>     
>>>>>>> -            return false;
>>>>>>> +                break;
>>>>>>> +            }
>>>>>>>             }
>>>>>>> -    }
>>>>>>> -    v->vpci.pdev = NULL;
>>>>>>> +        v->vpci.memory.pdev = NULL;
>>>>>>>     
>>>>>>> -    spin_lock(&pdev->vpci->lock);
>>>>>>> -    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
>>>>>>> -    spin_unlock(&pdev->vpci->lock);
>>>>>>> +        spin_lock(&pdev->vpci->lock);
>>>>>>> +        modify_decoding(pdev, v->vpci.memory.cmd, v->vpci.memory.rom_only);
>>>>>>> +        spin_unlock(&pdev->vpci->lock);
>>>>>>>     
>>>>>>> -    read_unlock(&v->domain->pci_lock);
>>>>>>> +        read_unlock(&v->domain->pci_lock);
>>>>>>> +
>>>>>>> +        break;
>>>>>>> +    }
>>>>>>> +    case WAIT:
>>>>>>> +        if ( NOW() < v->vpci.wait.end )
>>>>>>> +            return true;
>>>>>>> +        v->vpci.wait.callback(v->vpci.wait.data);
>>>>>>> +        break;
>>>>>>
>>>>>> As just indicated in reply to patch 6, busy waiting isn't really acceptable.
>>>>>> This is even more so when the waiting exceeds the typical length of a
>>>>>> scheduling timeslice.
>>>>>>
>>>>>> In that other reply I said to put the vCPU to sleep, but you need to be careful
>>>>>> there too: The domain may not expect its vCPU to not make any progress for such
>>>>>> an extended period of time. This may need doing entirely differently: Once the
>>>>>> command register was written, you may want to record the time after which
>>>>>> accesses to the VF registers are permitted. Earlier accesses would simply be
>>>>>> terminated. You may still additionally need a timer, in order to kick off BAR
>>>>>> mapping after that time. (Yet better would  be if the BAR mapping could be
>>>>>> done during those 100ms. After all that may be a reason why this long a delay
>>>>>> is specified: Firmware on the device may also require some time to set up the
>>>>>> BARs accordingly.)
>>>>>
>>>>> I am not sure it would work that way. If we look at how linux
>>>>> initialized sriov, it writes VFE and MSE bits, waits 100ms and then
>>>>> expects VFs to be operational. If they are not operational at that
>>>>> moment, then it considers the operation failed and removes all VFs. If
>>>>> we also wait 100ms before enabling access, the probability of a guest
>>>>> trying to access something before we allow it would be very high.
>>>>
>>>> Well, not really. Our counting of the 100ms necessarily starts before Dom0's.
>>>> Furthermore it may be acceptable (or even appropriate) to stall premature
>>>> accesses (because they shouldn't occur in the first place), by blocking the
>>>> vCPU at that point. A middle route may be possible: Terminate accesses in,
>>>> say, the first 90ms, and stall the vCPU for any access past that, but before
>>>> the 100ms expired.
>>>
>>> Is there any real benefit to doing all this work instead of just waiting
>>> for Dom0 to register the FVs? Implementing it the way you described
>>> would require a relatively complex state machine and two timers per
>>> sriov-capable device. And will also probably require some hacks to
>>> handle partially initialized VFs in Xen. This adds a lot of work and
>>> many possible bugs for not a lot of benefit in my opinion.
>>
>> Odd that you ask me this question. If there was no benefit, why did you do
>> it this way?
> 
> Roger asked for this approach in V1, and I saw that it can be done in a 
> relatively straightforward way, so I implemented it. I didn’t exactly 
> get what the benefits were, but I assumed that there are some and the 
> effort is not too big to just do it if the maintainers are asking for it.
> 
> Now with the posibility of redoing everything *again* and making it much 
> more complex I started to really think if its really needed. So I want 
> to know your and others' opinions on registering VFs with Dom0.

Well, before I voice any view here I'd need to understand why Roger wanted
it done like that. (Apparently you must have agreed sufficiently to not
ask back.)

Jan


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2026-04-01 14:59 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-09 11:08 [PATCH v2 0/8] Implement SR-IOV support for PVH Mykyta Poturai
2026-03-09 11:08 ` [PATCH v2 1/8] vpci: rename and export vpci_modify_bars Mykyta Poturai
2026-03-16 21:03   ` Stewart Hildebrand
2026-03-09 11:08 ` [PATCH v2 3/8] vpci: Use pervcpu ranges for BAR mapping Mykyta Poturai
2026-03-16 21:36   ` Stewart Hildebrand
2026-03-17  0:36   ` Stewart Hildebrand
2026-03-31  9:59   ` Jan Beulich
2026-03-31 11:56     ` Andrew Cooper
2026-03-09 11:08 ` [PATCH v2 2/8] vpci: rename and export vpci_guest_mem_bar_{read,write} Mykyta Poturai
2026-03-16 21:04   ` Stewart Hildebrand
2026-03-09 11:08 ` [PATCH v2 4/8] vpci: add a wait operation to the vpci vcpu pending actions Mykyta Poturai
2026-03-17 14:02   ` Stewart Hildebrand
2026-03-31  9:53   ` Jan Beulich
2026-03-31 14:55   ` Jan Beulich
2026-04-01  7:59     ` Mykyta Poturai
2026-04-01  8:21       ` Jan Beulich
2026-04-01 14:07         ` Mykyta Poturai
2026-04-01 14:14           ` Jan Beulich
2026-04-01 14:40             ` Mykyta Poturai
2026-04-01 14:44               ` Jan Beulich
2026-03-09 11:08 ` [PATCH v2 6/8] vpci: add SR-IOV support for PVH Dom0 Mykyta Poturai
2026-03-19 21:11   ` Stewart Hildebrand
2026-03-23  9:46     ` Mykyta Poturai
2026-03-31 14:44   ` Jan Beulich
2026-03-09 11:08 ` [PATCH v2 5/8] pci/iommu: Check that IOMMU supports removing devices Mykyta Poturai
2026-03-31 14:28   ` Jan Beulich
2026-03-09 11:08 ` [PATCH v2 7/8] vpci: add SR-IOV support for DomUs Mykyta Poturai
2026-03-31 15:02   ` Jan Beulich
2026-03-09 11:08 ` [PATCH v2 8/8] docs: Update SR-IOV support status Mykyta Poturai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.