All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/3] xen: SR-IOV fixes
@ 2024-10-18 20:39 Stewart Hildebrand
  2024-10-18 20:39 ` [PATCH v6 1/3] x86/msi: harden stale pdev handling Stewart Hildebrand
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Stewart Hildebrand @ 2024-10-18 20:39 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Jan Beulich, Andrew Cooper,
	Roger Pau Monné, Julien Grall, Stefano Stabellini,
	Teddy Astie

A fix for handling of stale pdevs, and a fix for a regressiong related
to a locking change.

Stewart Hildebrand (3):
  x86/msi: harden stale pdev handling
  xen/pci: introduce PF<->VF links
  x86/msi: fix locking for SR-IOV devices

 xen/arch/x86/msi.c            | 56 +++++++++++++----------
 xen/drivers/passthrough/msi.c |  3 ++
 xen/drivers/passthrough/pci.c | 84 ++++++++++++++++++++++++++++-------
 xen/drivers/vpci/msi.c        |  2 +-
 xen/drivers/vpci/msix.c       |  2 +-
 xen/include/xen/pci.h         | 20 ++++++++-
 6 files changed, 123 insertions(+), 44 deletions(-)


base-commit: 4c21b6affb4c5a3afe22cbc144362091de063366
-- 
2.47.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 1/3] x86/msi: harden stale pdev handling
  2024-10-18 20:39 [PATCH v6 0/3] xen: SR-IOV fixes Stewart Hildebrand
@ 2024-10-18 20:39 ` Stewart Hildebrand
  2024-10-28 16:58   ` Jan Beulich
  2024-10-18 20:39 ` [PATCH v6 2/3] xen/pci: introduce PF<->VF links Stewart Hildebrand
  2024-10-18 20:39 ` [PATCH v6 3/3] x86/msi: fix locking for SR-IOV devices Stewart Hildebrand
  2 siblings, 1 reply; 30+ messages in thread
From: Stewart Hildebrand @ 2024-10-18 20:39 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Jan Beulich, Andrew Cooper,
	Roger Pau Monné, Julien Grall, Stefano Stabellini

Dom0 normally informs Xen of PCI device removal via
PHYSDEVOP_pci_device_remove, e.g. in response to SR-IOV disable or
hot-unplug. We might find ourselves with stale pdevs if a buggy dom0
fails to report removal via PHYSDEVOP_pci_device_remove. In this case,
attempts to access the config space of the stale pdevs would be invalid
and return all 1s.

Some possible conditions leading to this are:

1. Dom0 disables SR-IOV without reporting VF removal to Xen.

The Linux SR-IOV subsystem normally reports VF removal when a PF driver
disables SR-IOV. In case of a buggy dom0 SR-IOV subsystem, SR-IOV could
become disabled with stale dangling VF pdevs in both dom0 Linux and Xen.

2. Dom0 reporting PF removal without reporting VF removal.

During SR-IOV PF removal (hot-unplug), a buggy PF driver may fail to
disable SR-IOV, thus failing to remove the VFs, leaving stale dangling
VFs behind in both Xen and Linux. At least Linux warns in this case:

[  100.000000]  0000:01:00.0: driver left SR-IOV enabled after remove

In either case, Xen is left with stale VF pdevs, risking invalid PCI
config space accesses.

When Xen is built with CONFIG_DEBUG=y, the following Xen crashes were
observed when dom0 attempted to access the config space of a stale VF:

(XEN) Assertion 'pos' failed at arch/x86/msi.c:1274
(XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Tainted:   C    ]----
...
(XEN) Xen call trace:
(XEN)    [<ffff82d040346834>] R pci_msi_conf_write_intercept+0xa2/0x1de
(XEN)    [<ffff82d04035d6b4>] F pci_conf_write_intercept+0x68/0x78
(XEN)    [<ffff82d0403264e5>] F arch/x86/pv/emul-priv-op.c#pci_cfg_ok+0xa0/0x114
(XEN)    [<ffff82d04032660e>] F arch/x86/pv/emul-priv-op.c#guest_io_write+0xb5/0x1c8
(XEN)    [<ffff82d0403267bb>] F arch/x86/pv/emul-priv-op.c#write_io+0x9a/0xe0
(XEN)    [<ffff82d04037c77a>] F x86_emulate+0x100e5/0x25f1e
(XEN)    [<ffff82d0403941a8>] F x86_emulate_wrapper+0x29/0x64
(XEN)    [<ffff82d04032802b>] F pv_emulate_privileged_op+0x12e/0x217
(XEN)    [<ffff82d040369f12>] F do_general_protection+0xc2/0x1b8
(XEN)    [<ffff82d040201aa7>] F x86_64/entry.S#handle_exception_saved+0x2b/0x8c

(XEN) Assertion 'pos' failed at arch/x86/msi.c:1246
(XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Tainted:   C    ]----
...
(XEN) Xen call trace:
(XEN)    [<ffff82d040346b0a>] R pci_reset_msix_state+0x47/0x50
(XEN)    [<ffff82d040287eec>] F pdev_msix_assign+0x19/0x35
(XEN)    [<ffff82d040286184>] F drivers/passthrough/pci.c#assign_device+0x181/0x471
(XEN)    [<ffff82d040287c36>] F iommu_do_pci_domctl+0x248/0x2ec
(XEN)    [<ffff82d040284e1f>] F iommu_do_domctl+0x26/0x44
(XEN)    [<ffff82d0402483b8>] F do_domctl+0x8c1/0x1660
(XEN)    [<ffff82d04032977e>] F pv_hypercall+0x5ce/0x6af
(XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150

These ASSERTs triggered because the MSI-X capability position can't be
found for a stale pdev.

Latch the capability positions of MSI and MSI-X during device init, and
replace instances of pci_find_cap_offset(..., PCI_CAP_ID_MSI{,X}) with
the stored value. Introduce one additional ASSERT, while the two
existing ASSERTs in question continue to work as intended, even with a
stale pdev.

Fixes: 484d7c852e4f ("x86/MSI-X: track host and guest mask-all requests separately")
Fixes: 575e18d54d19 ("pci: clear {host/guest}_maskall field on assign")
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
v5->v6;
* latch MSI/MSI-X capability position during device init

v4->v5:
* new patch, independent of the rest of the series
* new approach to fixing the issue: don't rely on dom0 to report any
  sort of device removal; rather, fix the condition directly

---
Instructions to reproduce
Requires Xen with CONFIG_DEBUG=y
Tested with Linux 6.11

1. Dom0 disables SR-IOV without reporting VF removal to Xen.

* Hack the Linux SR-IOV subsystem to remove the call to
  pci_stop_and_remove_bus_device() in
  drivers/pci/iov.c:pci_iov_remove_virtfn().

* Enable SR-IOV, then disable SR-IOV
  echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
  echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs

* Now we have a stale VF. We can trigger the ASSERT either by unbinding
  the VF driver and issuing a reset...

  echo 0000\:01\:10.0 > /sys/bus/pci/devices/0000\:01\:10.0/driver/unbind
  echo 1 > /sys/bus/pci/devices/0000\:01\:10.0/reset

  ... or by doing xl pci-assignable-add

  xl pci-assignable-add 01:10.0

2. Dom0 reporting PF removal without reporting VF removal.

* Hack your PF driver to leave SR-IOV enabled when removing the device

* Enable SR-IOV

  echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs

* Unplug the PCI device

  (qemu) device_del mydev

* Now we have a stale VF. We can trigger the ASSERT either by re-adding
  the PF device with SR-IOV disabled...

  echo 0000\:01\:10.0 > /sys/bus/pci/devices/0000\:01\:10.0/driver/unbind
  (qemu) device_add igb,id=mydev,bus=pcie.1,netdev=net1

  ... or by reset / xl pci-assignable-add as above.
---
 xen/arch/x86/msi.c            | 19 +++++++++----------
 xen/drivers/passthrough/msi.c |  3 +++
 xen/drivers/vpci/msi.c        |  2 +-
 xen/drivers/vpci/msix.c       |  2 +-
 xen/include/xen/pci.h         |  3 +++
 5 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index ff2e3d86878d..5e24df7be0c0 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -278,23 +278,21 @@ void __msi_set_enable(u16 seg, u8 bus, u8 slot, u8 func, int pos, int enable)
 
 static void msi_set_enable(struct pci_dev *dev, int enable)
 {
-    int pos;
+    int pos = dev->msi_pos;
     u16 seg = dev->seg;
     u8 bus = dev->bus;
     u8 slot = PCI_SLOT(dev->devfn);
     u8 func = PCI_FUNC(dev->devfn);
 
-    pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
     if ( pos )
         __msi_set_enable(seg, bus, slot, func, pos, enable);
 }
 
 static void msix_set_enable(struct pci_dev *dev, int enable)
 {
-    int pos;
+    int pos = dev->msix_pos;
     uint16_t control;
 
-    pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSIX);
     if ( pos )
     {
         control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
@@ -601,7 +599,7 @@ static int msi_capability_init(struct pci_dev *dev,
     uint16_t control;
 
     ASSERT_PDEV_LIST_IS_READ_LOCKED(dev->domain);
-    pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
+    pos = dev->msi_pos;
     if ( !pos )
         return -ENODEV;
     control = pci_conf_read16(dev->sbdf, msi_control_reg(pos));
@@ -764,7 +762,7 @@ static int msix_capability_init(struct pci_dev *dev,
     u8 slot = PCI_SLOT(dev->devfn);
     u8 func = PCI_FUNC(dev->devfn);
     bool maskall = msix->host_maskall, zap_on_error = false;
-    unsigned int pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSIX);
+    unsigned int pos = dev->msix_pos;
 
     if ( !pos )
         return -ENODEV;
@@ -1133,11 +1131,13 @@ static void _pci_cleanup_msix(struct arch_msix *msix)
 static void __pci_disable_msix(struct msi_desc *entry)
 {
     struct pci_dev *dev = entry->dev;
-    unsigned int pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSIX);
+    unsigned int pos = dev->msix_pos;
     u16 control = pci_conf_read16(dev->sbdf,
                                   msix_control_reg(entry->msi_attrib.pos));
     bool maskall = dev->msix->host_maskall;
 
+    ASSERT(pos);
+
     if ( unlikely(!(control & PCI_MSIX_FLAGS_ENABLE)) )
     {
         dev->msix->host_maskall = 1;
@@ -1241,7 +1241,7 @@ void pci_cleanup_msi(struct pci_dev *pdev)
 
 int pci_reset_msix_state(struct pci_dev *pdev)
 {
-    unsigned int pos = pci_find_cap_offset(pdev->sbdf, PCI_CAP_ID_MSIX);
+    unsigned int pos = pdev->msix_pos;
 
     ASSERT(pos);
     /*
@@ -1269,8 +1269,7 @@ int pci_msi_conf_write_intercept(struct pci_dev *pdev, unsigned int reg,
     if ( pdev->msix )
     {
         entry = find_msi_entry(pdev, -1, PCI_CAP_ID_MSIX);
-        pos = entry ? entry->msi_attrib.pos
-                    : pci_find_cap_offset(pdev->sbdf, PCI_CAP_ID_MSIX);
+        pos = entry ? entry->msi_attrib.pos : pdev->msix_pos;
         ASSERT(pos);
 
         if ( reg >= pos && reg < msix_pba_offset_reg(pos) + 4 )
diff --git a/xen/drivers/passthrough/msi.c b/xen/drivers/passthrough/msi.c
index 13d904692ef8..ed2bc7ebe635 100644
--- a/xen/drivers/passthrough/msi.c
+++ b/xen/drivers/passthrough/msi.c
@@ -29,6 +29,7 @@ int pdev_msi_init(struct pci_dev *pdev)
     {
         uint16_t ctrl = pci_conf_read16(pdev->sbdf, msi_control_reg(pos));
 
+        pdev->msi_pos = pos;
         pdev->msi_maxvec = multi_msi_capable(ctrl);
     }
 
@@ -41,6 +42,8 @@ int pdev_msi_init(struct pci_dev *pdev)
         if ( !msix )
             return -ENOMEM;
 
+        pdev->msix_pos = pos;
+
         spin_lock_init(&msix->table_lock);
 
         ctrl = pci_conf_read16(pdev->sbdf, msix_control_reg(pos));
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index dd6620ec5674..66e5a8a116be 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -195,7 +195,7 @@ static void cf_check mask_write(
 
 static int cf_check init_msi(struct pci_dev *pdev)
 {
-    unsigned int pos = pci_find_cap_offset(pdev->sbdf, PCI_CAP_ID_MSI);
+    unsigned int pos = pdev->msi_pos;
     uint16_t control;
     int ret;
 
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index 5bb4444ce21f..6bd8c55bb48e 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -711,7 +711,7 @@ static int cf_check init_msix(struct pci_dev *pdev)
     struct vpci_msix *msix;
     int rc;
 
-    msix_offset = pci_find_cap_offset(pdev->sbdf, PCI_CAP_ID_MSIX);
+    msix_offset = pdev->msix_pos;
     if ( !msix_offset )
         return 0;
 
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 63e49f0117e9..ef56e80651d6 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -113,6 +113,9 @@ struct pci_dev {
         pci_sbdf_t sbdf;
     };
 
+    unsigned int msi_pos;
+    unsigned int msix_pos;
+
     uint8_t msi_maxvec;
     uint8_t phantom_stride;
 
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-10-18 20:39 [PATCH v6 0/3] xen: SR-IOV fixes Stewart Hildebrand
  2024-10-18 20:39 ` [PATCH v6 1/3] x86/msi: harden stale pdev handling Stewart Hildebrand
@ 2024-10-18 20:39 ` Stewart Hildebrand
  2024-10-28 17:02   ` Jan Beulich
  2024-10-28 18:41   ` Roger Pau Monné
  2024-10-18 20:39 ` [PATCH v6 3/3] x86/msi: fix locking for SR-IOV devices Stewart Hildebrand
  2 siblings, 2 replies; 30+ messages in thread
From: Stewart Hildebrand @ 2024-10-18 20:39 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Jan Beulich, Roger Pau Monné,
	Andrew Cooper, Julien Grall, Stefano Stabellini

Add links between a VF's struct pci_dev and its associated PF struct
pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
dropping and re-acquiring the pcidevs_lock().

During PF removal, unlink VF from PF and mark the VF broken. As before,
VFs may exist without a corresponding PF, although now only with
pdev->broken = true.

The hardware domain is expected to remove the associated VFs before
removing the PF. Print a warning in case a PF is removed with associated
VFs still present.

Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
Candidate for backport to 4.19 (the next patch depends on this one)

v5->v6:
* move printk() before ASSERT_UNREACHABLE()
* warn about PF removal with VFs still present
* clarify commit message

v4->v5:
* new patch, split from ("x86/msi: fix locking for SR-IOV devices")
* move INIT_LIST_HEAD(&pdev->vf_list); earlier
* collapse struct list_head instances
* retain error code from pci_add_device()
* unlink (and mark broken) VFs instead of removing them
* const-ify VF->PF link
---
 xen/drivers/passthrough/pci.c | 76 ++++++++++++++++++++++++++++-------
 xen/include/xen/pci.h         | 10 +++++
 2 files changed, 72 insertions(+), 14 deletions(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 74d3895e1ef6..fe31255b1207 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -333,6 +333,8 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
     *((u8*) &pdev->devfn) = devfn;
     pdev->domain = NULL;
 
+    INIT_LIST_HEAD(&pdev->vf_list);
+
     arch_pci_init_pdev(pdev);
 
     rc = pdev_msi_init(pdev);
@@ -449,6 +451,10 @@ static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
 
     list_del(&pdev->alldevs_list);
     pdev_msi_deinit(pdev);
+
+    if ( pdev->info.is_virtfn && pdev->virtfn.pf_pdev )
+        list_del(&pdev->vf_list);
+
     xfree(pdev);
 }
 
@@ -656,24 +662,11 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
     unsigned int slot = PCI_SLOT(devfn), func = PCI_FUNC(devfn);
     const char *type;
     int ret;
-    bool pf_is_extfn = false;
 
     if ( !info )
         type = "device";
     else if ( info->is_virtfn )
-    {
-        pcidevs_lock();
-        pdev = pci_get_pdev(NULL,
-                            PCI_SBDF(seg, info->physfn.bus,
-                                     info->physfn.devfn));
-        if ( pdev )
-            pf_is_extfn = pdev->info.is_extfn;
-        pcidevs_unlock();
-        if ( !pdev )
-            pci_add_device(seg, info->physfn.bus, info->physfn.devfn,
-                           NULL, node);
         type = "virtual function";
-    }
     else if ( info->is_extfn )
         type = "extended function";
     else
@@ -703,7 +696,44 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
          * extended function.
          */
         if ( pdev->info.is_virtfn )
-            pdev->info.is_extfn = pf_is_extfn;
+        {
+            struct pci_dev *pf_pdev;
+
+            pf_pdev = pci_get_pdev(NULL,
+                                   PCI_SBDF(seg, info->physfn.bus,
+                                            info->physfn.devfn));
+
+            if ( !pf_pdev )
+            {
+                ret = pci_add_device(seg, info->physfn.bus, info->physfn.devfn,
+                                     NULL, node);
+                if ( ret )
+                {
+                    printk(XENLOG_WARNING "Failed to add SR-IOV device PF %pp for VF %pp\n",
+                           &PCI_SBDF(seg, info->physfn.bus, info->physfn.devfn),
+                           &pdev->sbdf);
+                    free_pdev(pseg, pdev);
+                    goto out;
+                }
+                pf_pdev = pci_get_pdev(NULL,
+                                       PCI_SBDF(seg, info->physfn.bus,
+                                                info->physfn.devfn));
+                if ( !pf_pdev )
+                {
+                    printk(XENLOG_ERR "Failed to find SR-IOV device PF %pp for VF %pp\n",
+                           &PCI_SBDF(seg, info->physfn.bus, info->physfn.devfn),
+                           &pdev->sbdf);
+                    ASSERT_UNREACHABLE();
+                    free_pdev(pseg, pdev);
+                    ret = -EILSEQ;
+                    goto out;
+                }
+            }
+
+            pdev->info.is_extfn = pf_pdev->info.is_extfn;
+            pdev->virtfn.pf_pdev = pf_pdev;
+            list_add(&pdev->vf_list, &pf_pdev->vf_list);
+        }
     }
 
     if ( !pdev->info.is_virtfn && !pdev->vf_rlen[0] )
@@ -821,6 +851,24 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
         if ( pdev->bus == bus && pdev->devfn == devfn )
         {
+            if ( !pdev->info.is_virtfn )
+            {
+                struct pci_dev *vf_pdev, *tmp;
+                bool warn_stale_vfs = false;
+
+                list_for_each_entry_safe(vf_pdev, tmp, &pdev->vf_list, vf_list)
+                {
+                    list_del(&vf_pdev->vf_list);
+                    vf_pdev->virtfn.pf_pdev = NULL;
+                    vf_pdev->broken = true;
+                    warn_stale_vfs = true;
+                }
+
+                if ( warn_stale_vfs )
+                    printk(XENLOG_WARNING "PCI SR-IOV PF %pp removed with VFs still present\n",
+                           &pdev->sbdf);
+            }
+
             if ( pdev->domain )
             {
                 write_lock(&pdev->domain->pci_lock);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index ef56e80651d6..2ea168d5f914 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -153,7 +153,17 @@ struct pci_dev {
         unsigned int count;
 #define PT_FAULT_THRESHOLD 10
     } fault;
+
+    /*
+     * List head if info.is_virtfn == false
+     * List entry if info.is_virtfn == true
+     */
+    struct list_head vf_list;
     u64 vf_rlen[6];
+    struct {
+        /* Only populated for VFs (info.is_virtfn == true) */
+        const struct pci_dev *pf_pdev;        /* Link from VF to PF */
+    } virtfn;
 
     /* Data for vPCI. */
     struct vpci *vpci;
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v6 3/3] x86/msi: fix locking for SR-IOV devices
  2024-10-18 20:39 [PATCH v6 0/3] xen: SR-IOV fixes Stewart Hildebrand
  2024-10-18 20:39 ` [PATCH v6 1/3] x86/msi: harden stale pdev handling Stewart Hildebrand
  2024-10-18 20:39 ` [PATCH v6 2/3] xen/pci: introduce PF<->VF links Stewart Hildebrand
@ 2024-10-18 20:39 ` Stewart Hildebrand
  2 siblings, 0 replies; 30+ messages in thread
From: Stewart Hildebrand @ 2024-10-18 20:39 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Jan Beulich, Andrew Cooper,
	Roger Pau Monné, Julien Grall, Stefano Stabellini,
	Teddy Astie

In commit 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci
structure") a lock was moved from allocate_and_map_msi_pirq() to the
caller and changed from pcidevs_lock() to read_lock(&d->pci_lock).
However, one call path wasn't updated to reflect the change, leading to
a failed assertion observed under the following conditions:

* PV dom0
* Debug build (CONFIG_DEBUG=y) of Xen
* There is an SR-IOV device in the system with one or more VFs enabled
* Dom0 has loaded the driver for the VF and enabled MSI-X

(XEN) Assertion 'd || pcidevs_locked()' failed at drivers/passthrough/pci.c:535
(XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Not tainted ]----
...
(XEN) Xen call trace:
(XEN)    [<ffff82d040284da8>] R pci_get_pdev+0x4c/0xab
(XEN)    [<ffff82d040344f5c>] F arch/x86/msi.c#read_pci_mem_bar+0x58/0x272
(XEN)    [<ffff82d04034530e>] F arch/x86/msi.c#msix_capability_init+0x198/0x755
(XEN)    [<ffff82d040345dad>] F arch/x86/msi.c#__pci_enable_msix+0x82/0xe8
(XEN)    [<ffff82d0403463e5>] F pci_enable_msi+0x3f/0x78
(XEN)    [<ffff82d04034be2b>] F map_domain_pirq+0x2a4/0x6dc
(XEN)    [<ffff82d04034d4d5>] F allocate_and_map_msi_pirq+0x103/0x262
(XEN)    [<ffff82d04035da5d>] F physdev_map_pirq+0x210/0x259
(XEN)    [<ffff82d04035e798>] F do_physdev_op+0x9c3/0x1454
(XEN)    [<ffff82d040329475>] F pv_hypercall+0x5ac/0x6af
(XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150

In read_pci_mem_bar(), the VF obtains the struct pci_dev pointer for its
associated PF to access the vf_rlen array. This array is initialized in
pci_add_device() and is only populated in the associated PF's struct
pci_dev.

Access the vf_rlen array via the link to the PF, and remove the
troublesome call to pci_get_pdev().

Fixes: 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci structure")
Reported-by: Teddy Astie <teddy.astie@vates.tech>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Candidate for backport to 4.19
Patch #2 ("xen/pci: introduce PF<->VF links") is pre-requisite

v5->v6:
* add Jan's R-b

v4->v5:
* split the PF<->VF links to a pre-requisite patch
* pass pci_sbdf_t to read_pci_mem_bar()
* use stdint.h types on changed lines
* re-add NULL check for pf_info in read_pci_mem_bar(), as pf_info could
  be NULL

v3->v4:
* handle case when PF is removed with VFs enabled, then re-added with
  VFs disabled

v2->v3:
* link from VF to PF's struct pci_dev *

v1->v2:
* remove call to pci_get_pdev()
---
 xen/arch/x86/msi.c            | 37 ++++++++++++++++++++++-------------
 xen/drivers/passthrough/pci.c |  8 +++++---
 xen/include/xen/pci.h         | 15 +++++++++-----
 3 files changed, 38 insertions(+), 22 deletions(-)

diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 5e24df7be0c0..79fff9de4305 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -660,34 +660,35 @@ static int msi_capability_init(struct pci_dev *dev,
     return 0;
 }
 
-static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf)
+static uint64_t read_pci_mem_bar(pci_sbdf_t sbdf, uint8_t bir, int vf,
+                                 const struct pf_info *pf_info)
 {
+    uint16_t seg = sbdf.seg;
+    uint8_t bus = sbdf.bus, slot = sbdf.dev, func = sbdf.fn;
     u8 limit;
     u32 addr, base = PCI_BASE_ADDRESS_0;
     u64 disp = 0;
 
     if ( vf >= 0 )
     {
-        struct pci_dev *pdev = pci_get_pdev(NULL,
-                                            PCI_SBDF(seg, bus, slot, func));
         unsigned int pos;
         uint16_t ctrl, num_vf, offset, stride;
 
-        if ( !pdev )
+        if ( !pf_info )
             return 0;
 
-        pos = pci_find_ext_capability(pdev->sbdf, PCI_EXT_CAP_ID_SRIOV);
-        ctrl = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_CTRL);
-        num_vf = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_NUM_VF);
-        offset = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_OFFSET);
-        stride = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_STRIDE);
+        pos = pci_find_ext_capability(sbdf, PCI_EXT_CAP_ID_SRIOV);
+        ctrl = pci_conf_read16(sbdf, pos + PCI_SRIOV_CTRL);
+        num_vf = pci_conf_read16(sbdf, pos + PCI_SRIOV_NUM_VF);
+        offset = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_OFFSET);
+        stride = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_STRIDE);
 
         if ( !pos ||
              !(ctrl & PCI_SRIOV_CTRL_VFE) ||
              !(ctrl & PCI_SRIOV_CTRL_MSE) ||
              !num_vf || !offset || (num_vf > 1 && !stride) ||
              bir >= PCI_SRIOV_NUM_BARS ||
-             !pdev->vf_rlen[bir] )
+             !pf_info->vf_rlen[bir] )
             return 0;
         base = pos + PCI_SRIOV_BAR;
         vf -= PCI_BDF(bus, slot, func) + offset;
@@ -701,8 +702,8 @@ static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf)
         }
         if ( vf >= num_vf )
             return 0;
-        BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
-        disp = vf * pdev->vf_rlen[bir];
+        BUILD_BUG_ON(ARRAY_SIZE(pf_info->vf_rlen) != PCI_SRIOV_NUM_BARS);
+        disp = vf * pf_info->vf_rlen[bir];
         limit = PCI_SRIOV_NUM_BARS;
     }
     else switch ( pci_conf_read8(PCI_SBDF(seg, bus, slot, func),
@@ -811,6 +812,7 @@ static int msix_capability_init(struct pci_dev *dev,
         int vf;
         paddr_t pba_paddr;
         unsigned int pba_offset;
+        const struct pf_info *pf_info;
 
         if ( !dev->info.is_virtfn )
         {
@@ -818,6 +820,7 @@ static int msix_capability_init(struct pci_dev *dev,
             pslot = slot;
             pfunc = func;
             vf = -1;
+            pf_info = NULL;
         }
         else
         {
@@ -825,9 +828,14 @@ static int msix_capability_init(struct pci_dev *dev,
             pslot = PCI_SLOT(dev->info.physfn.devfn);
             pfunc = PCI_FUNC(dev->info.physfn.devfn);
             vf = dev->sbdf.bdf;
+            if ( dev->virtfn.pf_pdev )
+                pf_info = &dev->virtfn.pf_pdev->physfn;
+            else
+                pf_info = NULL;
         }
 
-        table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf);
+        table_paddr = read_pci_mem_bar(PCI_SBDF(seg, pbus, pslot, pfunc), bir,
+                                       vf, pf_info);
         WARN_ON(msi && msi->table_base != table_paddr);
         if ( !table_paddr )
         {
@@ -850,7 +858,8 @@ static int msix_capability_init(struct pci_dev *dev,
 
         pba_offset = pci_conf_read32(dev->sbdf, msix_pba_offset_reg(pos));
         bir = (u8)(pba_offset & PCI_MSIX_BIRMASK);
-        pba_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf);
+        pba_paddr = read_pci_mem_bar(PCI_SBDF(seg, pbus, pslot, pfunc), bir, vf,
+                                     pf_info);
         WARN_ON(!pba_paddr);
         pba_paddr += pba_offset & ~PCI_MSIX_BIRMASK;
 
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index fe31255b1207..9182723ece1b 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -736,7 +736,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         }
     }
 
-    if ( !pdev->info.is_virtfn && !pdev->vf_rlen[0] )
+    if ( !pdev->info.is_virtfn && !pdev->physfn.vf_rlen[0] )
     {
         unsigned int pos = pci_find_ext_capability(pdev->sbdf,
                                                    PCI_EXT_CAP_ID_SRIOV);
@@ -748,7 +748,9 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         {
             unsigned int i;
 
-            BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
+            BUILD_BUG_ON(ARRAY_SIZE(pdev->physfn.vf_rlen) !=
+                                    PCI_SRIOV_NUM_BARS);
+
             for ( i = 0; i < PCI_SRIOV_NUM_BARS; )
             {
                 unsigned int idx = pos + PCI_SRIOV_BAR + i * 4;
@@ -763,7 +765,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
                     continue;
                 }
                 ret = pci_size_mem_bar(pdev->sbdf, idx, NULL,
-                                       &pdev->vf_rlen[i],
+                                       &pdev->physfn.vf_rlen[i],
                                        PCI_BAR_VF |
                                        ((i == PCI_SRIOV_NUM_BARS - 1) ?
                                         PCI_BAR_LAST : 0));
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 2ea168d5f914..e961a3c66799 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -159,11 +159,16 @@ struct pci_dev {
      * List entry if info.is_virtfn == true
      */
     struct list_head vf_list;
-    u64 vf_rlen[6];
-    struct {
-        /* Only populated for VFs (info.is_virtfn == true) */
-        const struct pci_dev *pf_pdev;        /* Link from VF to PF */
-    } virtfn;
+    union {
+        struct pf_info {
+            /* Only populated for PFs (info.is_virtfn == false) */
+            uint64_t vf_rlen[PCI_SRIOV_NUM_BARS];
+        } physfn;
+        struct {
+            /* Only populated for VFs (info.is_virtfn == true) */
+            const struct pci_dev *pf_pdev;        /* Link from VF to PF */
+        } virtfn;
+    };
 
     /* Data for vPCI. */
     struct vpci *vpci;
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 1/3] x86/msi: harden stale pdev handling
  2024-10-18 20:39 ` [PATCH v6 1/3] x86/msi: harden stale pdev handling Stewart Hildebrand
@ 2024-10-28 16:58   ` Jan Beulich
  2024-10-28 17:53     ` Roger Pau Monné
  0 siblings, 1 reply; 30+ messages in thread
From: Jan Beulich @ 2024-10-28 16:58 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: Andrew Cooper, Roger Pau Monné, Julien Grall,
	Stefano Stabellini, xen-devel

On 18.10.2024 22:39, Stewart Hildebrand wrote:
> Dom0 normally informs Xen of PCI device removal via
> PHYSDEVOP_pci_device_remove, e.g. in response to SR-IOV disable or
> hot-unplug. We might find ourselves with stale pdevs if a buggy dom0
> fails to report removal via PHYSDEVOP_pci_device_remove. In this case,
> attempts to access the config space of the stale pdevs would be invalid
> and return all 1s.
> 
> Some possible conditions leading to this are:
> 
> 1. Dom0 disables SR-IOV without reporting VF removal to Xen.
> 
> The Linux SR-IOV subsystem normally reports VF removal when a PF driver
> disables SR-IOV. In case of a buggy dom0 SR-IOV subsystem, SR-IOV could
> become disabled with stale dangling VF pdevs in both dom0 Linux and Xen.
> 
> 2. Dom0 reporting PF removal without reporting VF removal.
> 
> During SR-IOV PF removal (hot-unplug), a buggy PF driver may fail to
> disable SR-IOV, thus failing to remove the VFs, leaving stale dangling
> VFs behind in both Xen and Linux. At least Linux warns in this case:
> 
> [  100.000000]  0000:01:00.0: driver left SR-IOV enabled after remove
> 
> In either case, Xen is left with stale VF pdevs, risking invalid PCI
> config space accesses.
> 
> When Xen is built with CONFIG_DEBUG=y, the following Xen crashes were
> observed when dom0 attempted to access the config space of a stale VF:
> 
> (XEN) Assertion 'pos' failed at arch/x86/msi.c:1274
> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Tainted:   C    ]----
> ...
> (XEN) Xen call trace:
> (XEN)    [<ffff82d040346834>] R pci_msi_conf_write_intercept+0xa2/0x1de
> (XEN)    [<ffff82d04035d6b4>] F pci_conf_write_intercept+0x68/0x78
> (XEN)    [<ffff82d0403264e5>] F arch/x86/pv/emul-priv-op.c#pci_cfg_ok+0xa0/0x114
> (XEN)    [<ffff82d04032660e>] F arch/x86/pv/emul-priv-op.c#guest_io_write+0xb5/0x1c8
> (XEN)    [<ffff82d0403267bb>] F arch/x86/pv/emul-priv-op.c#write_io+0x9a/0xe0
> (XEN)    [<ffff82d04037c77a>] F x86_emulate+0x100e5/0x25f1e
> (XEN)    [<ffff82d0403941a8>] F x86_emulate_wrapper+0x29/0x64
> (XEN)    [<ffff82d04032802b>] F pv_emulate_privileged_op+0x12e/0x217
> (XEN)    [<ffff82d040369f12>] F do_general_protection+0xc2/0x1b8
> (XEN)    [<ffff82d040201aa7>] F x86_64/entry.S#handle_exception_saved+0x2b/0x8c
> 
> (XEN) Assertion 'pos' failed at arch/x86/msi.c:1246
> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Tainted:   C    ]----
> ...
> (XEN) Xen call trace:
> (XEN)    [<ffff82d040346b0a>] R pci_reset_msix_state+0x47/0x50
> (XEN)    [<ffff82d040287eec>] F pdev_msix_assign+0x19/0x35
> (XEN)    [<ffff82d040286184>] F drivers/passthrough/pci.c#assign_device+0x181/0x471
> (XEN)    [<ffff82d040287c36>] F iommu_do_pci_domctl+0x248/0x2ec
> (XEN)    [<ffff82d040284e1f>] F iommu_do_domctl+0x26/0x44
> (XEN)    [<ffff82d0402483b8>] F do_domctl+0x8c1/0x1660
> (XEN)    [<ffff82d04032977e>] F pv_hypercall+0x5ce/0x6af
> (XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150
> 
> These ASSERTs triggered because the MSI-X capability position can't be
> found for a stale pdev.
> 
> Latch the capability positions of MSI and MSI-X during device init, and
> replace instances of pci_find_cap_offset(..., PCI_CAP_ID_MSI{,X}) with
> the stored value. Introduce one additional ASSERT, while the two
> existing ASSERTs in question continue to work as intended, even with a
> stale pdev.
> 
> Fixes: 484d7c852e4f ("x86/MSI-X: track host and guest mask-all requests separately")
> Fixes: 575e18d54d19 ("pci: clear {host/guest}_maskall field on assign")
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>

Looks largely okay to me now, just two type selection aspects:

> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -278,23 +278,21 @@ void __msi_set_enable(u16 seg, u8 bus, u8 slot, u8 func, int pos, int enable)
>  
>  static void msi_set_enable(struct pci_dev *dev, int enable)
>  {
> -    int pos;
> +    int pos = dev->msi_pos;

This and ...

>      u16 seg = dev->seg;
>      u8 bus = dev->bus;
>      u8 slot = PCI_SLOT(dev->devfn);
>      u8 func = PCI_FUNC(dev->devfn);
>  
> -    pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
>      if ( pos )
>          __msi_set_enable(seg, bus, slot, func, pos, enable);
>  }
>  
>  static void msix_set_enable(struct pci_dev *dev, int enable)
>  {
> -    int pos;
> +    int pos = dev->msix_pos;

... this want to become unsigned int at this occasion, imo. Like we have ...

> @@ -764,7 +762,7 @@ static int msix_capability_init(struct pci_dev *dev,
>      u8 slot = PCI_SLOT(dev->devfn);
>      u8 func = PCI_FUNC(dev->devfn);
>      bool maskall = msix->host_maskall, zap_on_error = false;
> -    unsigned int pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSIX);
> +    unsigned int pos = dev->msix_pos;

... e.g. here already.

> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -113,6 +113,9 @@ struct pci_dev {
>          pci_sbdf_t sbdf;
>      };
>  
> +    unsigned int msi_pos;
> +    unsigned int msix_pos;
> +
>      uint8_t msi_maxvec;
>      uint8_t phantom_stride;

As can be seen from the subsequent members, we're trying to be space
conserving here. Both fields won't require more than 8 bits, so uint8_t
or unsigned char would be the better type to use. Again imo. Preferably
with those adjustments (which could likely be done while committing)
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-10-18 20:39 ` [PATCH v6 2/3] xen/pci: introduce PF<->VF links Stewart Hildebrand
@ 2024-10-28 17:02   ` Jan Beulich
  2024-11-01 20:16     ` Stewart Hildebrand
  2024-10-28 18:41   ` Roger Pau Monné
  1 sibling, 1 reply; 30+ messages in thread
From: Jan Beulich @ 2024-10-28 17:02 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: Roger Pau Monné, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel

On 18.10.2024 22:39, Stewart Hildebrand wrote:
> Add links between a VF's struct pci_dev and its associated PF struct
> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
> dropping and re-acquiring the pcidevs_lock().
> 
> During PF removal, unlink VF from PF and mark the VF broken. As before,
> VFs may exist without a corresponding PF, although now only with
> pdev->broken = true.
> 
> The hardware domain is expected to remove the associated VFs before
> removing the PF. Print a warning in case a PF is removed with associated
> VFs still present.
> 
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> ---
> Candidate for backport to 4.19 (the next patch depends on this one)
> 
> v5->v6:
> * move printk() before ASSERT_UNREACHABLE()
> * warn about PF removal with VFs still present

Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
just after an adjustment to the commit message. I'm instead actively
concerned of the resulting behavior. Question is whether we can reasonably
do something about that.

Jan


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 1/3] x86/msi: harden stale pdev handling
  2024-10-28 16:58   ` Jan Beulich
@ 2024-10-28 17:53     ` Roger Pau Monné
  2024-10-31 11:29       ` Jan Beulich
  0 siblings, 1 reply; 30+ messages in thread
From: Roger Pau Monné @ 2024-10-28 17:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stewart Hildebrand, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel

On Mon, Oct 28, 2024 at 05:58:28PM +0100, Jan Beulich wrote:
> On 18.10.2024 22:39, Stewart Hildebrand wrote:
> > Dom0 normally informs Xen of PCI device removal via
> > PHYSDEVOP_pci_device_remove, e.g. in response to SR-IOV disable or
> > hot-unplug. We might find ourselves with stale pdevs if a buggy dom0
> > fails to report removal via PHYSDEVOP_pci_device_remove. In this case,
> > attempts to access the config space of the stale pdevs would be invalid
> > and return all 1s.
> > 
> > Some possible conditions leading to this are:
> > 
> > 1. Dom0 disables SR-IOV without reporting VF removal to Xen.
> > 
> > The Linux SR-IOV subsystem normally reports VF removal when a PF driver
> > disables SR-IOV. In case of a buggy dom0 SR-IOV subsystem, SR-IOV could
> > become disabled with stale dangling VF pdevs in both dom0 Linux and Xen.
> > 
> > 2. Dom0 reporting PF removal without reporting VF removal.
> > 
> > During SR-IOV PF removal (hot-unplug), a buggy PF driver may fail to
> > disable SR-IOV, thus failing to remove the VFs, leaving stale dangling
> > VFs behind in both Xen and Linux. At least Linux warns in this case:
> > 
> > [  100.000000]  0000:01:00.0: driver left SR-IOV enabled after remove
> > 
> > In either case, Xen is left with stale VF pdevs, risking invalid PCI
> > config space accesses.
> > 
> > When Xen is built with CONFIG_DEBUG=y, the following Xen crashes were
> > observed when dom0 attempted to access the config space of a stale VF:
> > 
> > (XEN) Assertion 'pos' failed at arch/x86/msi.c:1274
> > (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Tainted:   C    ]----
> > ...
> > (XEN) Xen call trace:
> > (XEN)    [<ffff82d040346834>] R pci_msi_conf_write_intercept+0xa2/0x1de
> > (XEN)    [<ffff82d04035d6b4>] F pci_conf_write_intercept+0x68/0x78
> > (XEN)    [<ffff82d0403264e5>] F arch/x86/pv/emul-priv-op.c#pci_cfg_ok+0xa0/0x114
> > (XEN)    [<ffff82d04032660e>] F arch/x86/pv/emul-priv-op.c#guest_io_write+0xb5/0x1c8
> > (XEN)    [<ffff82d0403267bb>] F arch/x86/pv/emul-priv-op.c#write_io+0x9a/0xe0
> > (XEN)    [<ffff82d04037c77a>] F x86_emulate+0x100e5/0x25f1e
> > (XEN)    [<ffff82d0403941a8>] F x86_emulate_wrapper+0x29/0x64
> > (XEN)    [<ffff82d04032802b>] F pv_emulate_privileged_op+0x12e/0x217
> > (XEN)    [<ffff82d040369f12>] F do_general_protection+0xc2/0x1b8
> > (XEN)    [<ffff82d040201aa7>] F x86_64/entry.S#handle_exception_saved+0x2b/0x8c
> > 
> > (XEN) Assertion 'pos' failed at arch/x86/msi.c:1246
> > (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Tainted:   C    ]----
> > ...
> > (XEN) Xen call trace:
> > (XEN)    [<ffff82d040346b0a>] R pci_reset_msix_state+0x47/0x50
> > (XEN)    [<ffff82d040287eec>] F pdev_msix_assign+0x19/0x35
> > (XEN)    [<ffff82d040286184>] F drivers/passthrough/pci.c#assign_device+0x181/0x471
> > (XEN)    [<ffff82d040287c36>] F iommu_do_pci_domctl+0x248/0x2ec
> > (XEN)    [<ffff82d040284e1f>] F iommu_do_domctl+0x26/0x44
> > (XEN)    [<ffff82d0402483b8>] F do_domctl+0x8c1/0x1660
> > (XEN)    [<ffff82d04032977e>] F pv_hypercall+0x5ce/0x6af
> > (XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150
> > 
> > These ASSERTs triggered because the MSI-X capability position can't be
> > found for a stale pdev.
> > 
> > Latch the capability positions of MSI and MSI-X during device init, and
> > replace instances of pci_find_cap_offset(..., PCI_CAP_ID_MSI{,X}) with
> > the stored value. Introduce one additional ASSERT, while the two
> > existing ASSERTs in question continue to work as intended, even with a
> > stale pdev.
> > 
> > Fixes: 484d7c852e4f ("x86/MSI-X: track host and guest mask-all requests separately")
> > Fixes: 575e18d54d19 ("pci: clear {host/guest}_maskall field on assign")
> > Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> 
> Looks largely okay to me now, just two type selection aspects:
> 
> > --- a/xen/arch/x86/msi.c
> > +++ b/xen/arch/x86/msi.c
> > @@ -278,23 +278,21 @@ void __msi_set_enable(u16 seg, u8 bus, u8 slot, u8 func, int pos, int enable)
> >  
> >  static void msi_set_enable(struct pci_dev *dev, int enable)
> >  {
> > -    int pos;
> > +    int pos = dev->msi_pos;
> 
> This and ...
> 
> >      u16 seg = dev->seg;
> >      u8 bus = dev->bus;
> >      u8 slot = PCI_SLOT(dev->devfn);
> >      u8 func = PCI_FUNC(dev->devfn);
> >  
> > -    pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
> >      if ( pos )
> >          __msi_set_enable(seg, bus, slot, func, pos, enable);
> >  }
> >  
> >  static void msix_set_enable(struct pci_dev *dev, int enable)
> >  {
> > -    int pos;
> > +    int pos = dev->msix_pos;
> 
> ... this want to become unsigned int at this occasion, imo. Like we have ...
> 
> > @@ -764,7 +762,7 @@ static int msix_capability_init(struct pci_dev *dev,
> >      u8 slot = PCI_SLOT(dev->devfn);
> >      u8 func = PCI_FUNC(dev->devfn);
> >      bool maskall = msix->host_maskall, zap_on_error = false;
> > -    unsigned int pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSIX);
> > +    unsigned int pos = dev->msix_pos;
> 
> ... e.g. here already.
> 
> > --- a/xen/include/xen/pci.h
> > +++ b/xen/include/xen/pci.h
> > @@ -113,6 +113,9 @@ struct pci_dev {
> >          pci_sbdf_t sbdf;
> >      };
> >  
> > +    unsigned int msi_pos;
> > +    unsigned int msix_pos;
> > +
> >      uint8_t msi_maxvec;
> >      uint8_t phantom_stride;
> 
> As can be seen from the subsequent members, we're trying to be space
> conserving here. Both fields won't require more than 8 bits, so uint8_t
> or unsigned char would be the better type to use. Again imo. Preferably
> with those adjustments (which could likely be done while committing)
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

uint8_t would seem preferable here, as it's fixed-size width clearly
related to the offset into the PCI configuration space for a device.

It might also be worth noting in the commit message that having the
position cached should be a small perf improvement, by not having to
walk the capability list each time.

Anyway, no strong opinion about the commit message adjustment, so with
the type changed:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-10-18 20:39 ` [PATCH v6 2/3] xen/pci: introduce PF<->VF links Stewart Hildebrand
  2024-10-28 17:02   ` Jan Beulich
@ 2024-10-28 18:41   ` Roger Pau Monné
  2024-11-11 20:07     ` Stewart Hildebrand
  1 sibling, 1 reply; 30+ messages in thread
From: Roger Pau Monné @ 2024-10-28 18:41 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Julien Grall,
	Stefano Stabellini

On Fri, Oct 18, 2024 at 04:39:09PM -0400, Stewart Hildebrand wrote:
> Add links between a VF's struct pci_dev and its associated PF struct
> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
> dropping and re-acquiring the pcidevs_lock().
> 
> During PF removal, unlink VF from PF and mark the VF broken. As before,
> VFs may exist without a corresponding PF, although now only with
> pdev->broken = true.
> 
> The hardware domain is expected to remove the associated VFs before
> removing the PF. Print a warning in case a PF is removed with associated
> VFs still present.
> 
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> ---
> Candidate for backport to 4.19 (the next patch depends on this one)
> 
> v5->v6:
> * move printk() before ASSERT_UNREACHABLE()
> * warn about PF removal with VFs still present
> * clarify commit message
> 
> v4->v5:
> * new patch, split from ("x86/msi: fix locking for SR-IOV devices")
> * move INIT_LIST_HEAD(&pdev->vf_list); earlier
> * collapse struct list_head instances
> * retain error code from pci_add_device()
> * unlink (and mark broken) VFs instead of removing them
> * const-ify VF->PF link
> ---
>  xen/drivers/passthrough/pci.c | 76 ++++++++++++++++++++++++++++-------
>  xen/include/xen/pci.h         | 10 +++++
>  2 files changed, 72 insertions(+), 14 deletions(-)
> 
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 74d3895e1ef6..fe31255b1207 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -333,6 +333,8 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
>      *((u8*) &pdev->devfn) = devfn;
>      pdev->domain = NULL;
>  
> +    INIT_LIST_HEAD(&pdev->vf_list);
> +
>      arch_pci_init_pdev(pdev);
>  
>      rc = pdev_msi_init(pdev);
> @@ -449,6 +451,10 @@ static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
>  
>      list_del(&pdev->alldevs_list);
>      pdev_msi_deinit(pdev);
> +
> +    if ( pdev->info.is_virtfn && pdev->virtfn.pf_pdev )

Shouldn't having pdev->info.is_virtfn set already ensure that
pdev->virtfn.pf_pdev != NULL?

> +        list_del(&pdev->vf_list);
> +
>      xfree(pdev);
>  }
>  
> @@ -656,24 +662,11 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>      unsigned int slot = PCI_SLOT(devfn), func = PCI_FUNC(devfn);
>      const char *type;
>      int ret;
> -    bool pf_is_extfn = false;
>  
>      if ( !info )
>          type = "device";
>      else if ( info->is_virtfn )
> -    {
> -        pcidevs_lock();
> -        pdev = pci_get_pdev(NULL,
> -                            PCI_SBDF(seg, info->physfn.bus,
> -                                     info->physfn.devfn));
> -        if ( pdev )
> -            pf_is_extfn = pdev->info.is_extfn;
> -        pcidevs_unlock();
> -        if ( !pdev )
> -            pci_add_device(seg, info->physfn.bus, info->physfn.devfn,
> -                           NULL, node);
>          type = "virtual function";
> -    }
>      else if ( info->is_extfn )
>          type = "extended function";
>      else
> @@ -703,7 +696,44 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>           * extended function.
>           */
>          if ( pdev->info.is_virtfn )
> -            pdev->info.is_extfn = pf_is_extfn;
> +        {
> +            struct pci_dev *pf_pdev;

This could be const?

> +
> +            pf_pdev = pci_get_pdev(NULL,
> +                                   PCI_SBDF(seg, info->physfn.bus,
> +                                            info->physfn.devfn));

You can probably initialize at declaration?

> +
> +            if ( !pf_pdev )

Is this even feasible during correct operation?  IOW: shouldn't the PF
always be added first, so that SR-IOV can be enabled and the VFs added
afterwards?

I see previous code also catered for VFs being added without the PF
being present, so I assume there was some need for this.

> +            {
> +                ret = pci_add_device(seg, info->physfn.bus, info->physfn.devfn,
> +                                     NULL, node);
> +                if ( ret )
> +                {
> +                    printk(XENLOG_WARNING "Failed to add SR-IOV device PF %pp for VF %pp\n",

Could you split this to make the line a bit shorter?

                       printk(XENLOG_WARNING
		              "Failed to add SR-IOV device PF %pp for VF %pp\n",

Same below.

> +                           &PCI_SBDF(seg, info->physfn.bus, info->physfn.devfn),
> +                           &pdev->sbdf);
> +                    free_pdev(pseg, pdev);
> +                    goto out;
> +                }
> +                pf_pdev = pci_get_pdev(NULL,
> +                                       PCI_SBDF(seg, info->physfn.bus,
> +                                                info->physfn.devfn));
> +                if ( !pf_pdev )
> +                {
> +                    printk(XENLOG_ERR "Failed to find SR-IOV device PF %pp for VF %pp\n",
> +                           &PCI_SBDF(seg, info->physfn.bus, info->physfn.devfn),
> +                           &pdev->sbdf);

If you want to add an error message here, I think it should mention
the fact this state is not expected:

"Inconsistent PCI state: failed to find newly added PF %pp for VF %pp\n"

> +                    ASSERT_UNREACHABLE();
> +                    free_pdev(pseg, pdev);
> +                    ret = -EILSEQ;
> +                    goto out;
> +                }
> +            }
> +
> +            pdev->info.is_extfn = pf_pdev->info.is_extfn;
> +            pdev->virtfn.pf_pdev = pf_pdev;
> +            list_add(&pdev->vf_list, &pf_pdev->vf_list);
> +        }
>      }
>  
>      if ( !pdev->info.is_virtfn && !pdev->vf_rlen[0] )
> @@ -821,6 +851,24 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>          if ( pdev->bus == bus && pdev->devfn == devfn )
>          {
> +            if ( !pdev->info.is_virtfn )

Given we have no field to mark a device as a PF, we could check that
pdev->vf_list is not empty, and by doing so the warn_stale_vfs
variable could be dropped?

if ( !pdev->info.is_virtfn && !list_empty(&pdev->vf_list) )
{
    struct pci_dev *vf_pdev;

    while ( (vf_pdev = list_first_entry_or_null(&pdev->vf_list,
                                                struct pci_dev,
						vf_list)) != NULL )
    {
        list_del(&vf_pdev->vf_list);
        vf_pdev->virtfn.pf_pdev = NULL;
        vf_pdev->broken = true;
    }

    printk(XENLOG_WARNING "PCI SR-IOV PF %pp removed with VFs still present\n",
           &pdev->sbdf);
}

> +            {
> +                struct pci_dev *vf_pdev, *tmp;
> +                bool warn_stale_vfs = false;
> +
> +                list_for_each_entry_safe(vf_pdev, tmp, &pdev->vf_list, vf_list)
> +                {
> +                    list_del(&vf_pdev->vf_list);
> +                    vf_pdev->virtfn.pf_pdev = NULL;
> +                    vf_pdev->broken = true;
> +                    warn_stale_vfs = true;
> +                }
> +
> +                if ( warn_stale_vfs )
> +                    printk(XENLOG_WARNING "PCI SR-IOV PF %pp removed with VFs still present\n",
> +                           &pdev->sbdf);
> +            }
> +
>              if ( pdev->domain )
>              {
>                  write_lock(&pdev->domain->pci_lock);
> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
> index ef56e80651d6..2ea168d5f914 100644
> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -153,7 +153,17 @@ struct pci_dev {
>          unsigned int count;
>  #define PT_FAULT_THRESHOLD 10
>      } fault;
> +
> +    /*
> +     * List head if info.is_virtfn == false
> +     * List entry if info.is_virtfn == true
> +     */
> +    struct list_head vf_list;
>      u64 vf_rlen[6];
> +    struct {
> +        /* Only populated for VFs (info.is_virtfn == true) */

All comments here (specially the first ones) would better use PF and
VF consistently, rather than referring to other fields in the struct.
Specially because the fields can change names and the comments would
then become stale.

> +        const struct pci_dev *pf_pdev;        /* Link from VF to PF */
> +    } virtfn;

I'm unsure you need an outer virtfn struct, as it's only one field in
this patch?  Maybe more fields gets added by further patches?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 1/3] x86/msi: harden stale pdev handling
  2024-10-28 17:53     ` Roger Pau Monné
@ 2024-10-31 11:29       ` Jan Beulich
  0 siblings, 0 replies; 30+ messages in thread
From: Jan Beulich @ 2024-10-31 11:29 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stewart Hildebrand, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel

On 28.10.2024 18:53, Roger Pau Monné wrote:
> Anyway, no strong opinion about the commit message adjustment, so with
> the type changed:

Btw, while preparing this patch for committing I ended up confused by this:
I can't find any request to adjust the commit message. The only other thing
I had asked for where plain int -> unsigned int adjustments.

Jan


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-10-28 17:02   ` Jan Beulich
@ 2024-11-01 20:16     ` Stewart Hildebrand
  2024-11-02 15:18       ` Daniel P. Smith
  2024-11-04  7:44       ` Jan Beulich
  0 siblings, 2 replies; 30+ messages in thread
From: Stewart Hildebrand @ 2024-11-01 20:16 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel, Daniel P. Smith

+Daniel (XSM mention)

On 10/28/24 13:02, Jan Beulich wrote:
> On 18.10.2024 22:39, Stewart Hildebrand wrote:
>> Add links between a VF's struct pci_dev and its associated PF struct
>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
>> dropping and re-acquiring the pcidevs_lock().
>>
>> During PF removal, unlink VF from PF and mark the VF broken. As before,
>> VFs may exist without a corresponding PF, although now only with
>> pdev->broken = true.
>>
>> The hardware domain is expected to remove the associated VFs before
>> removing the PF. Print a warning in case a PF is removed with associated
>> VFs still present.
>>
>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>> ---
>> Candidate for backport to 4.19 (the next patch depends on this one)
>>
>> v5->v6:
>> * move printk() before ASSERT_UNREACHABLE()
>> * warn about PF removal with VFs still present
> 
> Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
> just after an adjustment to the commit message. I'm instead actively
> concerned of the resulting behavior. Question is whether we can reasonably
> do something about that.
> 
> Jan

Right. My suggestion then is to go back to roughly how it was done in
v4 [0]:

* Remove the VFs right away during PF removal, so that we don't end up
with stale VFs. Regarding XSM, assume that a domain with permission to
remove the PF is also allowed to remove the VFs. We should probably also
return an error from pci_remove_device in the case of removing the PF
with VFs still present (and still perform the removals despite returning
an error). Subsequent attempts by a domain to remove the VFs would
return an error (as they have already been removed), but that's expected
since we've taken a stance that PF-then-VF removal order is invalid
anyway.

While the above is what I prefer, I just want to mention other options I
considered for the scenario of PF removal with VFs still present:

* Increase the "scariness" of the warning message added in v6.

* Return an error from pci_remove_device (while still removing only the
PF). We would be left with stale VFs in Xen. At least this would
concretely inform dom0 that Xen takes issue with the PF-then-VF removal
order. Subsequent attempts by a domain to remove VFs, however
(un)likely, would succeed.

* Return an error from pci_remove_device and keep the PF and VFs. This
is IMO the worst option because then we would have a stale PF in
addition to stale VFs.

[0] https://lore.kernel.org/xen-devel/20240827035929.118003-1-stewart.hildebrand@amd.com/T/#t


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-01 20:16     ` Stewart Hildebrand
@ 2024-11-02 15:18       ` Daniel P. Smith
  2024-11-04 11:45         ` Alejandro Vallejo
  2024-11-05  9:03         ` Roger Pau Monné
  2024-11-04  7:44       ` Jan Beulich
  1 sibling, 2 replies; 30+ messages in thread
From: Daniel P. Smith @ 2024-11-02 15:18 UTC (permalink / raw)
  To: Stewart Hildebrand, Jan Beulich
  Cc: Roger Pau Monné, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel

On 11/1/24 16:16, Stewart Hildebrand wrote:
> +Daniel (XSM mention)
> 
> On 10/28/24 13:02, Jan Beulich wrote:
>> On 18.10.2024 22:39, Stewart Hildebrand wrote:
>>> Add links between a VF's struct pci_dev and its associated PF struct
>>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
>>> dropping and re-acquiring the pcidevs_lock().
>>>
>>> During PF removal, unlink VF from PF and mark the VF broken. As before,
>>> VFs may exist without a corresponding PF, although now only with
>>> pdev->broken = true.
>>>
>>> The hardware domain is expected to remove the associated VFs before
>>> removing the PF. Print a warning in case a PF is removed with associated
>>> VFs still present.
>>>
>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>> ---
>>> Candidate for backport to 4.19 (the next patch depends on this one)
>>>
>>> v5->v6:
>>> * move printk() before ASSERT_UNREACHABLE()
>>> * warn about PF removal with VFs still present
>>
>> Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
>> just after an adjustment to the commit message. I'm instead actively
>> concerned of the resulting behavior. Question is whether we can reasonably
>> do something about that.
>>
>> Jan
> 
> Right. My suggestion then is to go back to roughly how it was done in
> v4 [0]:
> 
> * Remove the VFs right away during PF removal, so that we don't end up
> with stale VFs. Regarding XSM, assume that a domain with permission to
> remove the PF is also allowed to remove the VFs. We should probably also
> return an error from pci_remove_device in the case of removing the PF
> with VFs still present (and still perform the removals despite returning
> an error). Subsequent attempts by a domain to remove the VFs would
> return an error (as they have already been removed), but that's expected
> since we've taken a stance that PF-then-VF removal order is invalid
> anyway.

I am not confident this is a safe assumption. It will likely be safe for 
probably 99% of the implementations. Apologies for not following 
closely, and correct me if I am wrong here, but from a resource 
perspective each VF can appear to the system as its own unique BDF and 
so I am fairly certain it would be possible to uniquely label each VF. 
For instance in the SVP architecture, the VF may be labeled to restrict 
control to a hardware domain within a Guest Virtual Platform while the 
PF may be restricted to the Supervisor Virtual Platform. In this 
scenario, the Guest would be torn down before the Supervisor so the VF 
should get released before the PF. But it's all theoretical, so I have 
no real implementation to point at that this could be checked/confirmed.

I am only raising this for awareness and not as an objection. If people 
want to punt that theoretical use case down the road until someone 
actually attempts it, I would not be opposed.

v/r,
dps


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-01 20:16     ` Stewart Hildebrand
  2024-11-02 15:18       ` Daniel P. Smith
@ 2024-11-04  7:44       ` Jan Beulich
  2024-11-07 14:32         ` Stewart Hildebrand
  2024-11-08 12:42         ` Alejandro Vallejo
  1 sibling, 2 replies; 30+ messages in thread
From: Jan Beulich @ 2024-11-04  7:44 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: Roger Pau Monné, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel, Daniel P. Smith

On 01.11.2024 21:16, Stewart Hildebrand wrote:
> +Daniel (XSM mention)
> 
> On 10/28/24 13:02, Jan Beulich wrote:
>> On 18.10.2024 22:39, Stewart Hildebrand wrote:
>>> Add links between a VF's struct pci_dev and its associated PF struct
>>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
>>> dropping and re-acquiring the pcidevs_lock().
>>>
>>> During PF removal, unlink VF from PF and mark the VF broken. As before,
>>> VFs may exist without a corresponding PF, although now only with
>>> pdev->broken = true.
>>>
>>> The hardware domain is expected to remove the associated VFs before
>>> removing the PF. Print a warning in case a PF is removed with associated
>>> VFs still present.
>>>
>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>> ---
>>> Candidate for backport to 4.19 (the next patch depends on this one)
>>>
>>> v5->v6:
>>> * move printk() before ASSERT_UNREACHABLE()
>>> * warn about PF removal with VFs still present
>>
>> Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
>> just after an adjustment to the commit message. I'm instead actively
>> concerned of the resulting behavior. Question is whether we can reasonably
>> do something about that.
> 
> Right. My suggestion then is to go back to roughly how it was done in
> v4 [0]:
> 
> * Remove the VFs right away during PF removal, so that we don't end up
> with stale VFs. Regarding XSM, assume that a domain with permission to
> remove the PF is also allowed to remove the VFs. We should probably also
> return an error from pci_remove_device in the case of removing the PF
> with VFs still present (and still perform the removals despite returning
> an error). Subsequent attempts by a domain to remove the VFs would
> return an error (as they have already been removed), but that's expected
> since we've taken a stance that PF-then-VF removal order is invalid
> anyway.

Imo going back is not an option.

> While the above is what I prefer, I just want to mention other options I
> considered for the scenario of PF removal with VFs still present:
> 
> * Increase the "scariness" of the warning message added in v6.
> 
> * Return an error from pci_remove_device (while still removing only the
> PF). We would be left with stale VFs in Xen. At least this would
> concretely inform dom0 that Xen takes issue with the PF-then-VF removal
> order. Subsequent attempts by a domain to remove VFs, however
> (un)likely, would succeed.

Returning an error in such a case is a possibility, but comes with the
risk of confusion. Seeing such an error, a caller may itself assume the
device still is there, and retry its (with or without having removed the
VFs) removal at a later point.

> * Return an error from pci_remove_device and keep the PF and VFs. This
> is IMO the worst option because then we would have a stale PF in
> addition to stale VFs.

Yet this would at least be self-consistent, unlike the variant above. No
matter what, any failure to remove VFs and/or PFs correctly will need to
result in there being no attempt to physically remove the device.

You didn't enumerate an option lightly mentioned before, perhaps because
of its anticipated intrusiveness: Re-associate stale VFs with their PF,
once the PF is re-reported. Problem of course is that, aiui, the VFs
could in principle re-appear at a different BDF (albeit we have other
issues with potential bus-renumbering done by Dom0), and their count
could also change.

Jan

> [0] https://lore.kernel.org/xen-devel/20240827035929.118003-1-stewart.hildebrand@amd.com/T/#t



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-02 15:18       ` Daniel P. Smith
@ 2024-11-04 11:45         ` Alejandro Vallejo
  2024-11-05  9:08           ` Roger Pau Monné
  2024-11-06 17:04           ` Daniel P. Smith
  2024-11-05  9:03         ` Roger Pau Monné
  1 sibling, 2 replies; 30+ messages in thread
From: Alejandro Vallejo @ 2024-11-04 11:45 UTC (permalink / raw)
  To: Daniel P. Smith, Stewart Hildebrand, Jan Beulich
  Cc: Roger Pau Monné, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel

On Sat Nov 2, 2024 at 3:18 PM GMT, Daniel P. Smith wrote:
> On 11/1/24 16:16, Stewart Hildebrand wrote:
> > +Daniel (XSM mention)
> > 
> > On 10/28/24 13:02, Jan Beulich wrote:
> >> On 18.10.2024 22:39, Stewart Hildebrand wrote:
> >>> Add links between a VF's struct pci_dev and its associated PF struct
> >>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
> >>> dropping and re-acquiring the pcidevs_lock().
> >>>
> >>> During PF removal, unlink VF from PF and mark the VF broken. As before,
> >>> VFs may exist without a corresponding PF, although now only with
> >>> pdev->broken = true.
> >>>
> >>> The hardware domain is expected to remove the associated VFs before
> >>> removing the PF. Print a warning in case a PF is removed with associated
> >>> VFs still present.
> >>>
> >>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> >>> ---
> >>> Candidate for backport to 4.19 (the next patch depends on this one)
> >>>
> >>> v5->v6:
> >>> * move printk() before ASSERT_UNREACHABLE()
> >>> * warn about PF removal with VFs still present
> >>
> >> Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
> >> just after an adjustment to the commit message. I'm instead actively
> >> concerned of the resulting behavior. Question is whether we can reasonably
> >> do something about that.
> >>
> >> Jan
> > 
> > Right. My suggestion then is to go back to roughly how it was done in
> > v4 [0]:
> > 
> > * Remove the VFs right away during PF removal, so that we don't end up
> > with stale VFs. Regarding XSM, assume that a domain with permission to
> > remove the PF is also allowed to remove the VFs. We should probably also
> > return an error from pci_remove_device in the case of removing the PF
> > with VFs still present (and still perform the removals despite returning
> > an error). Subsequent attempts by a domain to remove the VFs would
> > return an error (as they have already been removed), but that's expected
> > since we've taken a stance that PF-then-VF removal order is invalid
> > anyway.
>
> I am not confident this is a safe assumption. It will likely be safe for 
> probably 99% of the implementations. Apologies for not following 
> closely, and correct me if I am wrong here, but from a resource 
> perspective each VF can appear to the system as its own unique BDF and 
> so I am fairly certain it would be possible to uniquely label each VF. 
> For instance in the SVP architecture, the VF may be labeled to restrict 
> control to a hardware domain within a Guest Virtual Platform while the 
> PF may be restricted to the Supervisor Virtual Platform. In this 
> scenario, the Guest would be torn down before the Supervisor so the VF 
> should get released before the PF. But it's all theoretical, so I have 
> no real implementation to point at that this could be checked/confirmed.
>
> I am only raising this for awareness and not as an objection. If people 
> want to punt that theoretical use case down the road until someone 
> actually attempts it, I would not be opposed.

Wouldn't it stand to reason then to act conditionally on the authority of the
caller?

i.e: If the caller has the (XSM-checked) authority to remove _BOTH_ PF and
VFs, remove all. If it doesn't have authority to remove the VFs then early exit
with an error, leaving the PF behind as well.

That would do the clean thing in the common case and be consistent with the
security policy even with a conflicting policy. The semantics are somewhat more
complex, but trying to remove a PF before removing the VFs is silly and the
only sensible thing (imo) is to help out during cleanup _or_ be strict about
checking.

>
> v/r,
> dps

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-02 15:18       ` Daniel P. Smith
  2024-11-04 11:45         ` Alejandro Vallejo
@ 2024-11-05  9:03         ` Roger Pau Monné
  2024-11-06 16:31           ` Daniel P. Smith
  1 sibling, 1 reply; 30+ messages in thread
From: Roger Pau Monné @ 2024-11-05  9:03 UTC (permalink / raw)
  To: Daniel P. Smith
  Cc: Stewart Hildebrand, Jan Beulich, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel

On Sat, Nov 02, 2024 at 11:18:24AM -0400, Daniel P. Smith wrote:
> On 11/1/24 16:16, Stewart Hildebrand wrote:
> > +Daniel (XSM mention)
> > 
> > On 10/28/24 13:02, Jan Beulich wrote:
> > > On 18.10.2024 22:39, Stewart Hildebrand wrote:
> > > > Add links between a VF's struct pci_dev and its associated PF struct
> > > > pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
> > > > dropping and re-acquiring the pcidevs_lock().
> > > > 
> > > > During PF removal, unlink VF from PF and mark the VF broken. As before,
> > > > VFs may exist without a corresponding PF, although now only with
> > > > pdev->broken = true.
> > > > 
> > > > The hardware domain is expected to remove the associated VFs before
> > > > removing the PF. Print a warning in case a PF is removed with associated
> > > > VFs still present.
> > > > 
> > > > Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> > > > ---
> > > > Candidate for backport to 4.19 (the next patch depends on this one)
> > > > 
> > > > v5->v6:
> > > > * move printk() before ASSERT_UNREACHABLE()
> > > > * warn about PF removal with VFs still present
> > > 
> > > Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
> > > just after an adjustment to the commit message. I'm instead actively
> > > concerned of the resulting behavior. Question is whether we can reasonably
> > > do something about that.
> > > 
> > > Jan
> > 
> > Right. My suggestion then is to go back to roughly how it was done in
> > v4 [0]:
> > 
> > * Remove the VFs right away during PF removal, so that we don't end up
> > with stale VFs. Regarding XSM, assume that a domain with permission to
> > remove the PF is also allowed to remove the VFs. We should probably also
> > return an error from pci_remove_device in the case of removing the PF
> > with VFs still present (and still perform the removals despite returning
> > an error). Subsequent attempts by a domain to remove the VFs would
> > return an error (as they have already been removed), but that's expected
> > since we've taken a stance that PF-then-VF removal order is invalid
> > anyway.
> 
> I am not confident this is a safe assumption. It will likely be safe for
> probably 99% of the implementations. Apologies for not following closely,
> and correct me if I am wrong here, but from a resource perspective each VF
> can appear to the system as its own unique BDF and so I am fairly certain it
> would be possible to uniquely label each VF. For instance in the SVP
> architecture, the VF may be labeled to restrict control to a hardware domain
> within a Guest Virtual Platform while the PF may be restricted to the
> Supervisor Virtual Platform. In this scenario, the Guest would be torn down
> before the Supervisor so the VF should get released before the PF.

The VF getting released before the PF is what we would usually expect?

I'm a bit confused because a guest being torn down doesn't imply that
the device is removed from Xen (iow: a call to pci_remove_device()).
Removing a device is hot-unplugging the PCI device from Xen, not
deassinging from a guest.

I would also be uneasy with assigning a PF to a non-privileged domain,
specially if VFs from that same device are being assigned to trusted
domains.

My assumption is that you generally want the PFs assigned to the
hardware domain, and the VFs assigned to any other domains (trusted or
not).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-04 11:45         ` Alejandro Vallejo
@ 2024-11-05  9:08           ` Roger Pau Monné
  2024-11-06 16:20             ` Daniel P. Smith
  2024-11-06 17:04           ` Daniel P. Smith
  1 sibling, 1 reply; 30+ messages in thread
From: Roger Pau Monné @ 2024-11-05  9:08 UTC (permalink / raw)
  To: Alejandro Vallejo
  Cc: Daniel P. Smith, Stewart Hildebrand, Jan Beulich, Andrew Cooper,
	Julien Grall, Stefano Stabellini, xen-devel

On Mon, Nov 04, 2024 at 11:45:05AM +0000, Alejandro Vallejo wrote:
> On Sat Nov 2, 2024 at 3:18 PM GMT, Daniel P. Smith wrote:
> > On 11/1/24 16:16, Stewart Hildebrand wrote:
> > > +Daniel (XSM mention)
> > > 
> > > On 10/28/24 13:02, Jan Beulich wrote:
> > >> On 18.10.2024 22:39, Stewart Hildebrand wrote:
> > >>> Add links between a VF's struct pci_dev and its associated PF struct
> > >>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
> > >>> dropping and re-acquiring the pcidevs_lock().
> > >>>
> > >>> During PF removal, unlink VF from PF and mark the VF broken. As before,
> > >>> VFs may exist without a corresponding PF, although now only with
> > >>> pdev->broken = true.
> > >>>
> > >>> The hardware domain is expected to remove the associated VFs before
> > >>> removing the PF. Print a warning in case a PF is removed with associated
> > >>> VFs still present.
> > >>>
> > >>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> > >>> ---
> > >>> Candidate for backport to 4.19 (the next patch depends on this one)
> > >>>
> > >>> v5->v6:
> > >>> * move printk() before ASSERT_UNREACHABLE()
> > >>> * warn about PF removal with VFs still present
> > >>
> > >> Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
> > >> just after an adjustment to the commit message. I'm instead actively
> > >> concerned of the resulting behavior. Question is whether we can reasonably
> > >> do something about that.
> > >>
> > >> Jan
> > > 
> > > Right. My suggestion then is to go back to roughly how it was done in
> > > v4 [0]:
> > > 
> > > * Remove the VFs right away during PF removal, so that we don't end up
> > > with stale VFs. Regarding XSM, assume that a domain with permission to
> > > remove the PF is also allowed to remove the VFs. We should probably also
> > > return an error from pci_remove_device in the case of removing the PF
> > > with VFs still present (and still perform the removals despite returning
> > > an error). Subsequent attempts by a domain to remove the VFs would
> > > return an error (as they have already been removed), but that's expected
> > > since we've taken a stance that PF-then-VF removal order is invalid
> > > anyway.
> >
> > I am not confident this is a safe assumption. It will likely be safe for 
> > probably 99% of the implementations. Apologies for not following 
> > closely, and correct me if I am wrong here, but from a resource 
> > perspective each VF can appear to the system as its own unique BDF and 
> > so I am fairly certain it would be possible to uniquely label each VF. 
> > For instance in the SVP architecture, the VF may be labeled to restrict 
> > control to a hardware domain within a Guest Virtual Platform while the 
> > PF may be restricted to the Supervisor Virtual Platform. In this 
> > scenario, the Guest would be torn down before the Supervisor so the VF 
> > should get released before the PF. But it's all theoretical, so I have 
> > no real implementation to point at that this could be checked/confirmed.
> >
> > I am only raising this for awareness and not as an objection. If people 
> > want to punt that theoretical use case down the road until someone 
> > actually attempts it, I would not be opposed.
> 
> Wouldn't it stand to reason then to act conditionally on the authority of the
> caller?
> 
> i.e: If the caller has the (XSM-checked) authority to remove _BOTH_ PF and
> VFs, remove all. If it doesn't have authority to remove the VFs then early exit
> with an error, leaving the PF behind as well.

I'm unsure if it makes sense to have an entity that's allowed to issue
a pci_remove_device() against a PF, but not against the VFs of the
device.

The owner of the PF should be capable of disabling SR-IOV, at which
point all the VFs disappear from the PCI config space.  If such entity
is capable of controlling the SR-IOV capability, it should also be
able to issue pci_remove_device() calls against the VFs.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-05  9:08           ` Roger Pau Monné
@ 2024-11-06 16:20             ` Daniel P. Smith
  0 siblings, 0 replies; 30+ messages in thread
From: Daniel P. Smith @ 2024-11-06 16:20 UTC (permalink / raw)
  To: Roger Pau Monné, Alejandro Vallejo
  Cc: Stewart Hildebrand, Jan Beulich, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel

On 11/5/24 04:08, Roger Pau Monné wrote:
> On Mon, Nov 04, 2024 at 11:45:05AM +0000, Alejandro Vallejo wrote:
>> On Sat Nov 2, 2024 at 3:18 PM GMT, Daniel P. Smith wrote:
>>> On 11/1/24 16:16, Stewart Hildebrand wrote:
>>>> +Daniel (XSM mention)
>>>>
>>>> On 10/28/24 13:02, Jan Beulich wrote:
>>>>> On 18.10.2024 22:39, Stewart Hildebrand wrote:
>>>>>> Add links between a VF's struct pci_dev and its associated PF struct
>>>>>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
>>>>>> dropping and re-acquiring the pcidevs_lock().
>>>>>>
>>>>>> During PF removal, unlink VF from PF and mark the VF broken. As before,
>>>>>> VFs may exist without a corresponding PF, although now only with
>>>>>> pdev->broken = true.
>>>>>>
>>>>>> The hardware domain is expected to remove the associated VFs before
>>>>>> removing the PF. Print a warning in case a PF is removed with associated
>>>>>> VFs still present.
>>>>>>
>>>>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>>> ---
>>>>>> Candidate for backport to 4.19 (the next patch depends on this one)
>>>>>>
>>>>>> v5->v6:
>>>>>> * move printk() before ASSERT_UNREACHABLE()
>>>>>> * warn about PF removal with VFs still present
>>>>>
>>>>> Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
>>>>> just after an adjustment to the commit message. I'm instead actively
>>>>> concerned of the resulting behavior. Question is whether we can reasonably
>>>>> do something about that.
>>>>>
>>>>> Jan
>>>>
>>>> Right. My suggestion then is to go back to roughly how it was done in
>>>> v4 [0]:
>>>>
>>>> * Remove the VFs right away during PF removal, so that we don't end up
>>>> with stale VFs. Regarding XSM, assume that a domain with permission to
>>>> remove the PF is also allowed to remove the VFs. We should probably also
>>>> return an error from pci_remove_device in the case of removing the PF
>>>> with VFs still present (and still perform the removals despite returning
>>>> an error). Subsequent attempts by a domain to remove the VFs would
>>>> return an error (as they have already been removed), but that's expected
>>>> since we've taken a stance that PF-then-VF removal order is invalid
>>>> anyway.
>>>
>>> I am not confident this is a safe assumption. It will likely be safe for
>>> probably 99% of the implementations. Apologies for not following
>>> closely, and correct me if I am wrong here, but from a resource
>>> perspective each VF can appear to the system as its own unique BDF and
>>> so I am fairly certain it would be possible to uniquely label each VF.
>>> For instance in the SVP architecture, the VF may be labeled to restrict
>>> control to a hardware domain within a Guest Virtual Platform while the
>>> PF may be restricted to the Supervisor Virtual Platform. In this
>>> scenario, the Guest would be torn down before the Supervisor so the VF
>>> should get released before the PF. But it's all theoretical, so I have
>>> no real implementation to point at that this could be checked/confirmed.
>>>
>>> I am only raising this for awareness and not as an objection. If people
>>> want to punt that theoretical use case down the road until someone
>>> actually attempts it, I would not be opposed.
>>
>> Wouldn't it stand to reason then to act conditionally on the authority of the
>> caller?
>>
>> i.e: If the caller has the (XSM-checked) authority to remove _BOTH_ PF and
>> VFs, remove all. If it doesn't have authority to remove the VFs then early exit
>> with an error, leaving the PF behind as well.
> 
> I'm unsure if it makes sense to have an entity that's allowed to issue
> a pci_remove_device() against a PF, but not against the VFs of the
> device.

Apologies for not fully defining how SVP would work. The Supervisor is 
the one of the few domains considered running at the higher trust level.

When I was referring to restricting the VF, for instance, that a VF of 
L:M:N may be assigned label gvp1_pci_t that only allows guest VP 1 
access, while VF X:Y:Z would be labeled gvp2_pci_t that grants guest VP 
2 access only.

At the same time, the Supervisor would be allowed in the policy to 
remove all VFs, in the example above it would have access to 
gvp1/2_pci_t labeled devices. In theory, it would have attempted to have 
the Guest VP tear down (or unplug the VF in a hotplug scenario) the 
virtual device. But in the end, it can't rely on the Guest VP to 
properly shutdown/unplug, and must be able to properly manage the system.

> The owner of the PF should be capable of disabling SR-IOV, at which
> point all the VFs disappear from the PCI config space.  If such entity
> is capable of controlling the SR-IOV capability, it should also be
> able to issue pci_remove_device() calls against the VFs.

Correct, as I mentioned above.

v/r,
dps


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-05  9:03         ` Roger Pau Monné
@ 2024-11-06 16:31           ` Daniel P. Smith
  0 siblings, 0 replies; 30+ messages in thread
From: Daniel P. Smith @ 2024-11-06 16:31 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stewart Hildebrand, Jan Beulich, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel

On 11/5/24 04:03, Roger Pau Monné wrote:
> On Sat, Nov 02, 2024 at 11:18:24AM -0400, Daniel P. Smith wrote:
>> On 11/1/24 16:16, Stewart Hildebrand wrote:
>>> +Daniel (XSM mention)
>>>
>>> On 10/28/24 13:02, Jan Beulich wrote:
>>>> On 18.10.2024 22:39, Stewart Hildebrand wrote:
>>>>> Add links between a VF's struct pci_dev and its associated PF struct
>>>>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
>>>>> dropping and re-acquiring the pcidevs_lock().
>>>>>
>>>>> During PF removal, unlink VF from PF and mark the VF broken. As before,
>>>>> VFs may exist without a corresponding PF, although now only with
>>>>> pdev->broken = true.
>>>>>
>>>>> The hardware domain is expected to remove the associated VFs before
>>>>> removing the PF. Print a warning in case a PF is removed with associated
>>>>> VFs still present.
>>>>>
>>>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>> ---
>>>>> Candidate for backport to 4.19 (the next patch depends on this one)
>>>>>
>>>>> v5->v6:
>>>>> * move printk() before ASSERT_UNREACHABLE()
>>>>> * warn about PF removal with VFs still present
>>>>
>>>> Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
>>>> just after an adjustment to the commit message. I'm instead actively
>>>> concerned of the resulting behavior. Question is whether we can reasonably
>>>> do something about that.
>>>>
>>>> Jan
>>>
>>> Right. My suggestion then is to go back to roughly how it was done in
>>> v4 [0]:
>>>
>>> * Remove the VFs right away during PF removal, so that we don't end up
>>> with stale VFs. Regarding XSM, assume that a domain with permission to
>>> remove the PF is also allowed to remove the VFs. We should probably also
>>> return an error from pci_remove_device in the case of removing the PF
>>> with VFs still present (and still perform the removals despite returning
>>> an error). Subsequent attempts by a domain to remove the VFs would
>>> return an error (as they have already been removed), but that's expected
>>> since we've taken a stance that PF-then-VF removal order is invalid
>>> anyway.
>>
>> I am not confident this is a safe assumption. It will likely be safe for
>> probably 99% of the implementations. Apologies for not following closely,
>> and correct me if I am wrong here, but from a resource perspective each VF
>> can appear to the system as its own unique BDF and so I am fairly certain it
>> would be possible to uniquely label each VF. For instance in the SVP
>> architecture, the VF may be labeled to restrict control to a hardware domain
>> within a Guest Virtual Platform while the PF may be restricted to the
>> Supervisor Virtual Platform. In this scenario, the Guest would be torn down
>> before the Supervisor so the VF should get released before the PF.
> 
> The VF getting released before the PF is what we would usually expect?
> 
> I'm a bit confused because a guest being torn down doesn't imply that
> the device is removed from Xen (iow: a call to pci_remove_device()).
> Removing a device is hot-unplugging the PCI device from Xen, not
> deassinging from a guest.

I was providing a use-case that was crafted, just not implemented. I 
have not looked at SRIOV in some time and not at the level necessary to 
determine this, but I would be uneasy thinking a VF could just be 
released from a guest to be reassigned either to the host hardware 
domain, aka Supervisor VP, or another guest. Perhaps a reset of the VF 
is enough or it might need more, I have not looked at the details at 
this point since we are still ages away from Xen being exactly capable 
of really doing a true implementation of the SVP architecture.

> I would also be uneasy with assigning a PF to a non-privileged domain,
> specially if VFs from that same device are being assigned to trusted
> domains.

As mentioned in the other response, the PF is in a higher trusted domain 
level than where the VFs would be assigned.

> My assumption is that you generally want the PFs assigned to the
> hardware domain, and the VFs assigned to any other domains (trusted or
> not).

In the SVP architecture, there is more than one "hardware domain".

v/r,
dps



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-04 11:45         ` Alejandro Vallejo
  2024-11-05  9:08           ` Roger Pau Monné
@ 2024-11-06 17:04           ` Daniel P. Smith
  1 sibling, 0 replies; 30+ messages in thread
From: Daniel P. Smith @ 2024-11-06 17:04 UTC (permalink / raw)
  To: Alejandro Vallejo, Stewart Hildebrand, Jan Beulich
  Cc: Roger Pau Monné, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel

On 11/4/24 06:45, Alejandro Vallejo wrote:
> On Sat Nov 2, 2024 at 3:18 PM GMT, Daniel P. Smith wrote:
>> On 11/1/24 16:16, Stewart Hildebrand wrote:
>>> +Daniel (XSM mention)
>>>
>>> On 10/28/24 13:02, Jan Beulich wrote:
>>>> On 18.10.2024 22:39, Stewart Hildebrand wrote:
>>>>> Add links between a VF's struct pci_dev and its associated PF struct
>>>>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
>>>>> dropping and re-acquiring the pcidevs_lock().
>>>>>
>>>>> During PF removal, unlink VF from PF and mark the VF broken. As before,
>>>>> VFs may exist without a corresponding PF, although now only with
>>>>> pdev->broken = true.
>>>>>
>>>>> The hardware domain is expected to remove the associated VFs before
>>>>> removing the PF. Print a warning in case a PF is removed with associated
>>>>> VFs still present.
>>>>>
>>>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>> ---
>>>>> Candidate for backport to 4.19 (the next patch depends on this one)
>>>>>
>>>>> v5->v6:
>>>>> * move printk() before ASSERT_UNREACHABLE()
>>>>> * warn about PF removal with VFs still present
>>>>
>>>> Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
>>>> just after an adjustment to the commit message. I'm instead actively
>>>> concerned of the resulting behavior. Question is whether we can reasonably
>>>> do something about that.
>>>>
>>>> Jan
>>>
>>> Right. My suggestion then is to go back to roughly how it was done in
>>> v4 [0]:
>>>
>>> * Remove the VFs right away during PF removal, so that we don't end up
>>> with stale VFs. Regarding XSM, assume that a domain with permission to
>>> remove the PF is also allowed to remove the VFs. We should probably also
>>> return an error from pci_remove_device in the case of removing the PF
>>> with VFs still present (and still perform the removals despite returning
>>> an error). Subsequent attempts by a domain to remove the VFs would
>>> return an error (as they have already been removed), but that's expected
>>> since we've taken a stance that PF-then-VF removal order is invalid
>>> anyway.
>>
>> I am not confident this is a safe assumption. It will likely be safe for
>> probably 99% of the implementations. Apologies for not following
>> closely, and correct me if I am wrong here, but from a resource
>> perspective each VF can appear to the system as its own unique BDF and
>> so I am fairly certain it would be possible to uniquely label each VF.
>> For instance in the SVP architecture, the VF may be labeled to restrict
>> control to a hardware domain within a Guest Virtual Platform while the
>> PF may be restricted to the Supervisor Virtual Platform. In this
>> scenario, the Guest would be torn down before the Supervisor so the VF
>> should get released before the PF. But it's all theoretical, so I have
>> no real implementation to point at that this could be checked/confirmed.
>>
>> I am only raising this for awareness and not as an objection. If people
>> want to punt that theoretical use case down the road until someone
>> actually attempts it, I would not be opposed.
> 
> Wouldn't it stand to reason then to act conditionally on the authority of the
> caller?

Correct, and as I hopefully clarified in response to Roger, like 
everything with virtualization, it's turtles all the way down. Try 
having this model in mind,

   +----------------+  +-------------------------------+
   | Supervisor VP  |  | Guest VP                      |
   |                |  |                               |
   | +------------+ |  | +------------+ +------------+ |
   | |    HWDOM   | |  | |    HWDOM   | |    Guest   | |
   | |  +------+  | |  | |  +------+  | |  +------+  | |
   | |  |      |  | |  | |  |      |  | |  |      |  | |
   | |  |  PF  |  | |  | |  |  VF  +--+-+-->  PCI |  | |
   | |  |      |  | |  | |  |      |  | |  |      |  | |
   | |  +------+  | |  | |  +------+  | |  +------+  | |
   | +------------+ |  | +------------+ +------------+ |
   +----------------+  +-------------------------------+


> i.e: If the caller has the (XSM-checked) authority to remove _BOTH_ PF and
> VFs, remove all. If it doesn't have authority to remove the VFs then early exit
> with an error, leaving the PF behind as well.

Yes, I would agree that is reasonable. Please, I am not trying to make 
this complicated, but just trying to give a more advanced (nested?) 
model to consider. Not something that we have to figure out all the 
details of how to make it all work. I am just saying, consider that a VF 
may have a different label than the PF which allows another domain to be 
able to take actions on the device as if the domain fully owns that 
virtual PCI device. A code comment around the xsm check(s) could state 
that it is expected that a system with access to the PF label must have 
access to all VF labels and the dummy policy should already be wired up 
to only allow the hwdom to do it.

> That would do the clean thing in the common case and be consistent with the
> security policy even with a conflicting policy. The semantics are somewhat more
> complex, but trying to remove a PF before removing the VFs is silly and the
> only sensible thing (imo) is to help out during cleanup _or_ be strict about
> checking.

Yes, and I was not trying to suggest the PF would be removed before the 
VFs. Apologies if it sounds as though I am repeating, myself,, all that 
I was attempting to say is that the VFs should be able to be checked 
separately from the PF. This would allow a domain that was granted 
ownership of a VF, to be able to unplug the VF and only that VF from the 
system. I would leave it to the PCI logic to manage the sequencing of 
the PF and VF operations. In such a scenario, I would expect the hwdom 
to have the facilities to monitor/check the status of the VF to manage 
the case that a guest unplugged it.

v/r,
dps


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-04  7:44       ` Jan Beulich
@ 2024-11-07 14:32         ` Stewart Hildebrand
  2024-11-08 12:42         ` Alejandro Vallejo
  1 sibling, 0 replies; 30+ messages in thread
From: Stewart Hildebrand @ 2024-11-07 14:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Roger Pau Monné, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel, Daniel P. Smith

On 11/4/24 02:44, Jan Beulich wrote:
> On 01.11.2024 21:16, Stewart Hildebrand wrote:
>> +Daniel (XSM mention)
>>
>> On 10/28/24 13:02, Jan Beulich wrote:
>>> On 18.10.2024 22:39, Stewart Hildebrand wrote:
>>>> Add links between a VF's struct pci_dev and its associated PF struct
>>>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
>>>> dropping and re-acquiring the pcidevs_lock().
>>>>
>>>> During PF removal, unlink VF from PF and mark the VF broken. As before,
>>>> VFs may exist without a corresponding PF, although now only with
>>>> pdev->broken = true.
>>>>
>>>> The hardware domain is expected to remove the associated VFs before
>>>> removing the PF. Print a warning in case a PF is removed with associated
>>>> VFs still present.
>>>>
>>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>> ---
>>>> Candidate for backport to 4.19 (the next patch depends on this one)
>>>>
>>>> v5->v6:
>>>> * move printk() before ASSERT_UNREACHABLE()
>>>> * warn about PF removal with VFs still present
>>>
>>> Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
>>> just after an adjustment to the commit message. I'm instead actively
>>> concerned of the resulting behavior. Question is whether we can reasonably
>>> do something about that.
>>
>> Right. My suggestion then is to go back to roughly how it was done in
>> v4 [0]:
>>
>> * Remove the VFs right away during PF removal, so that we don't end up
>> with stale VFs. Regarding XSM, assume that a domain with permission to
>> remove the PF is also allowed to remove the VFs. We should probably also
>> return an error from pci_remove_device in the case of removing the PF
>> with VFs still present (and still perform the removals despite returning
>> an error). Subsequent attempts by a domain to remove the VFs would
>> return an error (as they have already been removed), but that's expected
>> since we've taken a stance that PF-then-VF removal order is invalid
>> anyway.
> 
> Imo going back is not an option.
> 
>> While the above is what I prefer, I just want to mention other options I
>> considered for the scenario of PF removal with VFs still present:
>>
>> * Increase the "scariness" of the warning message added in v6.
>>
>> * Return an error from pci_remove_device (while still removing only the
>> PF). We would be left with stale VFs in Xen. At least this would
>> concretely inform dom0 that Xen takes issue with the PF-then-VF removal
>> order. Subsequent attempts by a domain to remove VFs, however
>> (un)likely, would succeed.
> 
> Returning an error in such a case is a possibility, but comes with the
> risk of confusion. Seeing such an error, a caller may itself assume the
> device still is there, and retry its (with or without having removed the
> VFs) removal at a later point.
> 
>> * Return an error from pci_remove_device and keep the PF and VFs. This
>> is IMO the worst option because then we would have a stale PF in
>> addition to stale VFs.
> 
> Yet this would at least be self-consistent, unlike the variant above. No
> matter what, any failure to remove VFs and/or PFs correctly will need to
> result in there being no attempt to physically remove the device.
> 
> You didn't enumerate an option lightly mentioned before, perhaps because
> of its anticipated intrusiveness: Re-associate stale VFs with their PF,
> once the PF is re-reported.

I'll plan on this approach for v7.

> Problem of course is that, aiui, the VFs
> could in principle re-appear at a different BDF (albeit we have other
> issues with potential bus-renumbering done by Dom0), and their count
> could also change.
> 
> Jan
> 
>> [0] https://lore.kernel.org/xen-devel/20240827035929.118003-1-stewart.hildebrand@amd.com/T/#t
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-04  7:44       ` Jan Beulich
  2024-11-07 14:32         ` Stewart Hildebrand
@ 2024-11-08 12:42         ` Alejandro Vallejo
  2024-11-08 13:17           ` Jan Beulich
  1 sibling, 1 reply; 30+ messages in thread
From: Alejandro Vallejo @ 2024-11-08 12:42 UTC (permalink / raw)
  To: Jan Beulich, Stewart Hildebrand
  Cc: Roger Pau Monné, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel, Daniel P. Smith

On Mon Nov 4, 2024 at 7:44 AM GMT, Jan Beulich wrote:
> On 01.11.2024 21:16, Stewart Hildebrand wrote:
> > +Daniel (XSM mention)
> > 
> > On 10/28/24 13:02, Jan Beulich wrote:
> >> On 18.10.2024 22:39, Stewart Hildebrand wrote:
> >>> Add links between a VF's struct pci_dev and its associated PF struct
> >>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
> >>> dropping and re-acquiring the pcidevs_lock().
> >>>
> >>> During PF removal, unlink VF from PF and mark the VF broken. As before,
> >>> VFs may exist without a corresponding PF, although now only with
> >>> pdev->broken = true.
> >>>
> >>> The hardware domain is expected to remove the associated VFs before
> >>> removing the PF. Print a warning in case a PF is removed with associated
> >>> VFs still present.
> >>>
> >>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> >>> ---
> >>> Candidate for backport to 4.19 (the next patch depends on this one)
> >>>
> >>> v5->v6:
> >>> * move printk() before ASSERT_UNREACHABLE()
> >>> * warn about PF removal with VFs still present
> >>
> >> Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
> >> just after an adjustment to the commit message. I'm instead actively
> >> concerned of the resulting behavior. Question is whether we can reasonably
> >> do something about that.
> > 
> > Right. My suggestion then is to go back to roughly how it was done in
> > v4 [0]:
> > 
> > * Remove the VFs right away during PF removal, so that we don't end up
> > with stale VFs. Regarding XSM, assume that a domain with permission to
> > remove the PF is also allowed to remove the VFs. We should probably also
> > return an error from pci_remove_device in the case of removing the PF
> > with VFs still present (and still perform the removals despite returning
> > an error). Subsequent attempts by a domain to remove the VFs would
> > return an error (as they have already been removed), but that's expected
> > since we've taken a stance that PF-then-VF removal order is invalid
> > anyway.
>
> Imo going back is not an option.
>
> > While the above is what I prefer, I just want to mention other options I
> > considered for the scenario of PF removal with VFs still present:
> > 
> > * Increase the "scariness" of the warning message added in v6.
> > 
> > * Return an error from pci_remove_device (while still removing only the
> > PF). We would be left with stale VFs in Xen. At least this would
> > concretely inform dom0 that Xen takes issue with the PF-then-VF removal
> > order. Subsequent attempts by a domain to remove VFs, however
> > (un)likely, would succeed.
>
> Returning an error in such a case is a possibility, but comes with the
> risk of confusion. Seeing such an error, a caller may itself assume the
> device still is there, and retry its (with or without having removed the
> VFs) removal at a later point.
>
> > * Return an error from pci_remove_device and keep the PF and VFs. This
> > is IMO the worst option because then we would have a stale PF in
> > addition to stale VFs.
>
> Yet this would at least be self-consistent, unlike the variant above. No
> matter what, any failure to remove VFs and/or PFs correctly will need to
> result in there being no attempt to physically remove the device.
>
> You didn't enumerate an option lightly mentioned before, perhaps because
> of its anticipated intrusiveness: Re-associate stale VFs with their PF,
> once the PF is re-reported. Problem of course is that, aiui, the VFs
> could in principle re-appear at a different BDF (albeit we have other
> issues with potential bus-renumbering done by Dom0), and their count
> could also change.

Are you enumerating it for completeness or suggesting it should be done?

Maybe I'm missing something here (and please, do tell me what if so), but why
would this option be desirable at all? What would benefit from such semantics
(as opposed to any of the others)? It would break the lifetime dependency
between PF and VFs, and that doesn't strike me as a feature. It also turns
kernel bugs into a fine implementations by making promises about how state is
persisted, but the consequences of that appear to be too far reaching to know
for sure it's 100% ok.

From afar, it sounds like trying to turn a bug into a feature. And that cannot
always be done sanely. But again, maybe I might very well be missing
something...

>
> Jan
>
> > [0] https://lore.kernel.org/xen-devel/20240827035929.118003-1-stewart.hildebrand@amd.com/T/#t

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-08 12:42         ` Alejandro Vallejo
@ 2024-11-08 13:17           ` Jan Beulich
  2024-11-08 15:19             ` Roger Pau Monné
  0 siblings, 1 reply; 30+ messages in thread
From: Jan Beulich @ 2024-11-08 13:17 UTC (permalink / raw)
  To: Alejandro Vallejo
  Cc: Roger Pau Monné, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel, Daniel P. Smith,
	Stewart Hildebrand

On 08.11.2024 13:42, Alejandro Vallejo wrote:
> On Mon Nov 4, 2024 at 7:44 AM GMT, Jan Beulich wrote:
>> On 01.11.2024 21:16, Stewart Hildebrand wrote:
>>> +Daniel (XSM mention)
>>>
>>> On 10/28/24 13:02, Jan Beulich wrote:
>>>> On 18.10.2024 22:39, Stewart Hildebrand wrote:
>>>>> Add links between a VF's struct pci_dev and its associated PF struct
>>>>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
>>>>> dropping and re-acquiring the pcidevs_lock().
>>>>>
>>>>> During PF removal, unlink VF from PF and mark the VF broken. As before,
>>>>> VFs may exist without a corresponding PF, although now only with
>>>>> pdev->broken = true.
>>>>>
>>>>> The hardware domain is expected to remove the associated VFs before
>>>>> removing the PF. Print a warning in case a PF is removed with associated
>>>>> VFs still present.
>>>>>
>>>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>> ---
>>>>> Candidate for backport to 4.19 (the next patch depends on this one)
>>>>>
>>>>> v5->v6:
>>>>> * move printk() before ASSERT_UNREACHABLE()
>>>>> * warn about PF removal with VFs still present
>>>>
>>>> Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
>>>> just after an adjustment to the commit message. I'm instead actively
>>>> concerned of the resulting behavior. Question is whether we can reasonably
>>>> do something about that.
>>>
>>> Right. My suggestion then is to go back to roughly how it was done in
>>> v4 [0]:
>>>
>>> * Remove the VFs right away during PF removal, so that we don't end up
>>> with stale VFs. Regarding XSM, assume that a domain with permission to
>>> remove the PF is also allowed to remove the VFs. We should probably also
>>> return an error from pci_remove_device in the case of removing the PF
>>> with VFs still present (and still perform the removals despite returning
>>> an error). Subsequent attempts by a domain to remove the VFs would
>>> return an error (as they have already been removed), but that's expected
>>> since we've taken a stance that PF-then-VF removal order is invalid
>>> anyway.
>>
>> Imo going back is not an option.
>>
>>> While the above is what I prefer, I just want to mention other options I
>>> considered for the scenario of PF removal with VFs still present:
>>>
>>> * Increase the "scariness" of the warning message added in v6.
>>>
>>> * Return an error from pci_remove_device (while still removing only the
>>> PF). We would be left with stale VFs in Xen. At least this would
>>> concretely inform dom0 that Xen takes issue with the PF-then-VF removal
>>> order. Subsequent attempts by a domain to remove VFs, however
>>> (un)likely, would succeed.
>>
>> Returning an error in such a case is a possibility, but comes with the
>> risk of confusion. Seeing such an error, a caller may itself assume the
>> device still is there, and retry its (with or without having removed the
>> VFs) removal at a later point.
>>
>>> * Return an error from pci_remove_device and keep the PF and VFs. This
>>> is IMO the worst option because then we would have a stale PF in
>>> addition to stale VFs.
>>
>> Yet this would at least be self-consistent, unlike the variant above. No
>> matter what, any failure to remove VFs and/or PFs correctly will need to
>> result in there being no attempt to physically remove the device.
>>
>> You didn't enumerate an option lightly mentioned before, perhaps because
>> of its anticipated intrusiveness: Re-associate stale VFs with their PF,
>> once the PF is re-reported. Problem of course is that, aiui, the VFs
>> could in principle re-appear at a different BDF (albeit we have other
>> issues with potential bus-renumbering done by Dom0), and their count
>> could also change.
> 
> Are you enumerating it for completeness or suggesting it should be done?

I was meaning to suggest that it should at least be considered.

> Maybe I'm missing something here (and please, do tell me what if so), but why
> would this option be desirable at all? What would benefit from such semantics
> (as opposed to any of the others)? It would break the lifetime dependency
> between PF and VFs, and that doesn't strike me as a feature. It also turns
> kernel bugs into a fine implementations by making promises about how state is
> persisted, but the consequences of that appear to be too far reaching to know
> for sure it's 100% ok.
> 
> From afar, it sounds like trying to turn a bug into a feature. And that cannot
> always be done sanely. But again, maybe I might very well be missing
> something...

My main point is that the other suggested options have weaknesses, too.
Leaving stale VFs around forever isn't, imo, any better than trying to
reuse their structs.

Jan


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-08 13:17           ` Jan Beulich
@ 2024-11-08 15:19             ` Roger Pau Monné
  2024-11-08 16:29               ` Stewart Hildebrand
  0 siblings, 1 reply; 30+ messages in thread
From: Roger Pau Monné @ 2024-11-08 15:19 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Alejandro Vallejo, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel, Daniel P. Smith,
	Stewart Hildebrand

On Fri, Nov 08, 2024 at 02:17:40PM +0100, Jan Beulich wrote:
> On 08.11.2024 13:42, Alejandro Vallejo wrote:
> > On Mon Nov 4, 2024 at 7:44 AM GMT, Jan Beulich wrote:
> >> On 01.11.2024 21:16, Stewart Hildebrand wrote:
> >>> +Daniel (XSM mention)
> >>>
> >>> On 10/28/24 13:02, Jan Beulich wrote:
> >>>> On 18.10.2024 22:39, Stewart Hildebrand wrote:
> >>>>> Add links between a VF's struct pci_dev and its associated PF struct
> >>>>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
> >>>>> dropping and re-acquiring the pcidevs_lock().
> >>>>>
> >>>>> During PF removal, unlink VF from PF and mark the VF broken. As before,
> >>>>> VFs may exist without a corresponding PF, although now only with
> >>>>> pdev->broken = true.
> >>>>>
> >>>>> The hardware domain is expected to remove the associated VFs before
> >>>>> removing the PF. Print a warning in case a PF is removed with associated
> >>>>> VFs still present.
> >>>>>
> >>>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> >>>>> ---
> >>>>> Candidate for backport to 4.19 (the next patch depends on this one)
> >>>>>
> >>>>> v5->v6:
> >>>>> * move printk() before ASSERT_UNREACHABLE()
> >>>>> * warn about PF removal with VFs still present
> >>>>
> >>>> Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
> >>>> just after an adjustment to the commit message. I'm instead actively
> >>>> concerned of the resulting behavior. Question is whether we can reasonably
> >>>> do something about that.
> >>>
> >>> Right. My suggestion then is to go back to roughly how it was done in
> >>> v4 [0]:
> >>>
> >>> * Remove the VFs right away during PF removal, so that we don't end up
> >>> with stale VFs. Regarding XSM, assume that a domain with permission to
> >>> remove the PF is also allowed to remove the VFs. We should probably also
> >>> return an error from pci_remove_device in the case of removing the PF
> >>> with VFs still present (and still perform the removals despite returning
> >>> an error). Subsequent attempts by a domain to remove the VFs would
> >>> return an error (as they have already been removed), but that's expected
> >>> since we've taken a stance that PF-then-VF removal order is invalid
> >>> anyway.
> >>
> >> Imo going back is not an option.
> >>
> >>> While the above is what I prefer, I just want to mention other options I
> >>> considered for the scenario of PF removal with VFs still present:
> >>>
> >>> * Increase the "scariness" of the warning message added in v6.
> >>>
> >>> * Return an error from pci_remove_device (while still removing only the
> >>> PF). We would be left with stale VFs in Xen. At least this would
> >>> concretely inform dom0 that Xen takes issue with the PF-then-VF removal
> >>> order. Subsequent attempts by a domain to remove VFs, however
> >>> (un)likely, would succeed.
> >>
> >> Returning an error in such a case is a possibility, but comes with the
> >> risk of confusion. Seeing such an error, a caller may itself assume the
> >> device still is there, and retry its (with or without having removed the
> >> VFs) removal at a later point.
> >>
> >>> * Return an error from pci_remove_device and keep the PF and VFs. This
> >>> is IMO the worst option because then we would have a stale PF in
> >>> addition to stale VFs.

I'm thinking probably this is the least bad option, and just force the
owner of the PF to ensure there are no VFs left when removing the PF.

What sense does it make anyway to allow removing a PF with VFs still
present?  Not sure exactly what the owner of the PF will do before
calling pci_remove_device(), but it would seem to me the device should
be ready for unplug (so SR-IOV disabled).  Calling pci_remove_device()
with VFs still active points to an error to do proper cleanup by the
owner of the PF.

Returning error from pci_remove_device() and doing nothing would seem
fine to me.  There should be no stale PF or VFs in that case, as the
caller has been notified the device has failed to be removed, so
should treat the device as still present.

Thanks. Roger.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-08 15:19             ` Roger Pau Monné
@ 2024-11-08 16:29               ` Stewart Hildebrand
  2024-11-11  6:40                 ` Jan Beulich
  0 siblings, 1 reply; 30+ messages in thread
From: Stewart Hildebrand @ 2024-11-08 16:29 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: Alejandro Vallejo, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel, Daniel P. Smith

On 11/8/24 10:19, Roger Pau Monné wrote:
> On Fri, Nov 08, 2024 at 02:17:40PM +0100, Jan Beulich wrote:
>> On 08.11.2024 13:42, Alejandro Vallejo wrote:
>>> On Mon Nov 4, 2024 at 7:44 AM GMT, Jan Beulich wrote:
>>>> On 01.11.2024 21:16, Stewart Hildebrand wrote:
>>>>> +Daniel (XSM mention)
>>>>>
>>>>> On 10/28/24 13:02, Jan Beulich wrote:
>>>>>> On 18.10.2024 22:39, Stewart Hildebrand wrote:
>>>>>>> Add links between a VF's struct pci_dev and its associated PF struct
>>>>>>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
>>>>>>> dropping and re-acquiring the pcidevs_lock().
>>>>>>>
>>>>>>> During PF removal, unlink VF from PF and mark the VF broken. As before,
>>>>>>> VFs may exist without a corresponding PF, although now only with
>>>>>>> pdev->broken = true.
>>>>>>>
>>>>>>> The hardware domain is expected to remove the associated VFs before
>>>>>>> removing the PF. Print a warning in case a PF is removed with associated
>>>>>>> VFs still present.
>>>>>>>
>>>>>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>>>> ---
>>>>>>> Candidate for backport to 4.19 (the next patch depends on this one)
>>>>>>>
>>>>>>> v5->v6:
>>>>>>> * move printk() before ASSERT_UNREACHABLE()
>>>>>>> * warn about PF removal with VFs still present
>>>>>>
>>>>>> Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
>>>>>> just after an adjustment to the commit message. I'm instead actively
>>>>>> concerned of the resulting behavior. Question is whether we can reasonably
>>>>>> do something about that.
>>>>>
>>>>> Right. My suggestion then is to go back to roughly how it was done in
>>>>> v4 [0]:
>>>>>
>>>>> * Remove the VFs right away during PF removal, so that we don't end up
>>>>> with stale VFs. Regarding XSM, assume that a domain with permission to
>>>>> remove the PF is also allowed to remove the VFs. We should probably also
>>>>> return an error from pci_remove_device in the case of removing the PF
>>>>> with VFs still present (and still perform the removals despite returning
>>>>> an error). Subsequent attempts by a domain to remove the VFs would
>>>>> return an error (as they have already been removed), but that's expected
>>>>> since we've taken a stance that PF-then-VF removal order is invalid
>>>>> anyway.
>>>>
>>>> Imo going back is not an option.
>>>>
>>>>> While the above is what I prefer, I just want to mention other options I
>>>>> considered for the scenario of PF removal with VFs still present:
>>>>>
>>>>> * Increase the "scariness" of the warning message added in v6.
>>>>>
>>>>> * Return an error from pci_remove_device (while still removing only the
>>>>> PF). We would be left with stale VFs in Xen. At least this would
>>>>> concretely inform dom0 that Xen takes issue with the PF-then-VF removal
>>>>> order. Subsequent attempts by a domain to remove VFs, however
>>>>> (un)likely, would succeed.
>>>>
>>>> Returning an error in such a case is a possibility, but comes with the
>>>> risk of confusion. Seeing such an error, a caller may itself assume the
>>>> device still is there, and retry its (with or without having removed the
>>>> VFs) removal at a later point.
>>>>
>>>>> * Return an error from pci_remove_device and keep the PF and VFs. This
>>>>> is IMO the worst option because then we would have a stale PF in
>>>>> addition to stale VFs.
> 
> I'm thinking probably this is the least bad option, and just force the
> owner of the PF to ensure there are no VFs left when removing the PF.
> 
> What sense does it make anyway to allow removing a PF with VFs still
> present?  Not sure exactly what the owner of the PF will do before
> calling pci_remove_device(), but it would seem to me the device should
> be ready for unplug (so SR-IOV disabled).  Calling pci_remove_device()
> with VFs still active points to an error to do proper cleanup by the
> owner of the PF.

In normal, correct operation, right. The PF driver is indeed expected to
disable SR-IOV (i.e. remove VFs) during its removal, prior to calling
PHYSDEVOP_pci_device_remove for the PF.

> Returning error from pci_remove_device() and doing nothing would seem
> fine to me.  There should be no stale PF or VFs in that case, as the
> caller has been notified the device has failed to be removed, so
> should treat the device as still present.

But software has no way to guarantee there won't be a physical device
removal.

In test scenario #2 described in the first patch [1], the PF (the whole
device, actually) has already been physically unplugged, and dom0
invokes PHYSDEVOP_pci_device_remove to inform Xen about it.

[1] https://lore.kernel.org/xen-devel/20241018203913.1162962-2-stewart.hildebrand@amd.com/

That said, test scenario #2 would only happen when a buggy PF driver
failed to properly clean up the VFs before the PF. But the point is that
returning an error does not guarantee there won't be a stale pdev in
case of a buggy dom0.

I guess as long as we trust the owner of the PF, this approach is fine.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-08 16:29               ` Stewart Hildebrand
@ 2024-11-11  6:40                 ` Jan Beulich
  2024-11-12  9:05                   ` Roger Pau Monné
  0 siblings, 1 reply; 30+ messages in thread
From: Jan Beulich @ 2024-11-11  6:40 UTC (permalink / raw)
  To: Stewart Hildebrand, Roger Pau Monné
  Cc: Alejandro Vallejo, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel, Daniel P. Smith

On 08.11.2024 17:29, Stewart Hildebrand wrote:
> On 11/8/24 10:19, Roger Pau Monné wrote:
>> On Fri, Nov 08, 2024 at 02:17:40PM +0100, Jan Beulich wrote:
>>> On 08.11.2024 13:42, Alejandro Vallejo wrote:
>>>> On Mon Nov 4, 2024 at 7:44 AM GMT, Jan Beulich wrote:
>>>>> On 01.11.2024 21:16, Stewart Hildebrand wrote:
>>>>>> +Daniel (XSM mention)
>>>>>>
>>>>>> On 10/28/24 13:02, Jan Beulich wrote:
>>>>>>> On 18.10.2024 22:39, Stewart Hildebrand wrote:
>>>>>>>> Add links between a VF's struct pci_dev and its associated PF struct
>>>>>>>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
>>>>>>>> dropping and re-acquiring the pcidevs_lock().
>>>>>>>>
>>>>>>>> During PF removal, unlink VF from PF and mark the VF broken. As before,
>>>>>>>> VFs may exist without a corresponding PF, although now only with
>>>>>>>> pdev->broken = true.
>>>>>>>>
>>>>>>>> The hardware domain is expected to remove the associated VFs before
>>>>>>>> removing the PF. Print a warning in case a PF is removed with associated
>>>>>>>> VFs still present.
>>>>>>>>
>>>>>>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>>>>>>>> ---
>>>>>>>> Candidate for backport to 4.19 (the next patch depends on this one)
>>>>>>>>
>>>>>>>> v5->v6:
>>>>>>>> * move printk() before ASSERT_UNREACHABLE()
>>>>>>>> * warn about PF removal with VFs still present
>>>>>>>
>>>>>>> Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
>>>>>>> just after an adjustment to the commit message. I'm instead actively
>>>>>>> concerned of the resulting behavior. Question is whether we can reasonably
>>>>>>> do something about that.
>>>>>>
>>>>>> Right. My suggestion then is to go back to roughly how it was done in
>>>>>> v4 [0]:
>>>>>>
>>>>>> * Remove the VFs right away during PF removal, so that we don't end up
>>>>>> with stale VFs. Regarding XSM, assume that a domain with permission to
>>>>>> remove the PF is also allowed to remove the VFs. We should probably also
>>>>>> return an error from pci_remove_device in the case of removing the PF
>>>>>> with VFs still present (and still perform the removals despite returning
>>>>>> an error). Subsequent attempts by a domain to remove the VFs would
>>>>>> return an error (as they have already been removed), but that's expected
>>>>>> since we've taken a stance that PF-then-VF removal order is invalid
>>>>>> anyway.
>>>>>
>>>>> Imo going back is not an option.
>>>>>
>>>>>> While the above is what I prefer, I just want to mention other options I
>>>>>> considered for the scenario of PF removal with VFs still present:
>>>>>>
>>>>>> * Increase the "scariness" of the warning message added in v6.
>>>>>>
>>>>>> * Return an error from pci_remove_device (while still removing only the
>>>>>> PF). We would be left with stale VFs in Xen. At least this would
>>>>>> concretely inform dom0 that Xen takes issue with the PF-then-VF removal
>>>>>> order. Subsequent attempts by a domain to remove VFs, however
>>>>>> (un)likely, would succeed.
>>>>>
>>>>> Returning an error in such a case is a possibility, but comes with the
>>>>> risk of confusion. Seeing such an error, a caller may itself assume the
>>>>> device still is there, and retry its (with or without having removed the
>>>>> VFs) removal at a later point.
>>>>>
>>>>>> * Return an error from pci_remove_device and keep the PF and VFs. This
>>>>>> is IMO the worst option because then we would have a stale PF in
>>>>>> addition to stale VFs.
>>
>> I'm thinking probably this is the least bad option, and just force the
>> owner of the PF to ensure there are no VFs left when removing the PF.
>>
>> What sense does it make anyway to allow removing a PF with VFs still
>> present?  Not sure exactly what the owner of the PF will do before
>> calling pci_remove_device(), but it would seem to me the device should
>> be ready for unplug (so SR-IOV disabled).  Calling pci_remove_device()
>> with VFs still active points to an error to do proper cleanup by the
>> owner of the PF.
> 
> In normal, correct operation, right. The PF driver is indeed expected to
> disable SR-IOV (i.e. remove VFs) during its removal, prior to calling
> PHYSDEVOP_pci_device_remove for the PF.
> 
>> Returning error from pci_remove_device() and doing nothing would seem
>> fine to me.  There should be no stale PF or VFs in that case, as the
>> caller has been notified the device has failed to be removed, so
>> should treat the device as still present.

Imo really that's another case that would best be addressed by proper
ref-counting. Each VF would hold a ref to its PF, and hence the PF would
go away when the last VF is removed, or when PF removal is (properly)
last. Just that this likely is too complex a change to be warranted for
the purpose here.

> But software has no way to guarantee there won't be a physical device
> removal.
> 
> In test scenario #2 described in the first patch [1], the PF (the whole
> device, actually) has already been physically unplugged, and dom0
> invokes PHYSDEVOP_pci_device_remove to inform Xen about it.

I don't think that's how it's supposed to work. Physical removal should
occur only after software has done all "soft removal". I'd view the
notification to Xen as part of that.

Jan

> [1] https://lore.kernel.org/xen-devel/20241018203913.1162962-2-stewart.hildebrand@amd.com/
> 
> That said, test scenario #2 would only happen when a buggy PF driver
> failed to properly clean up the VFs before the PF. But the point is that
> returning an error does not guarantee there won't be a stale pdev in
> case of a buggy dom0.
> 
> I guess as long as we trust the owner of the PF, this approach is fine.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-10-28 18:41   ` Roger Pau Monné
@ 2024-11-11 20:07     ` Stewart Hildebrand
  2024-11-12  9:02       ` Roger Pau Monné
  2024-11-12  9:42       ` Jan Beulich
  0 siblings, 2 replies; 30+ messages in thread
From: Stewart Hildebrand @ 2024-11-11 20:07 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Julien Grall,
	Stefano Stabellini

On 10/28/24 14:41, Roger Pau Monné wrote:
> On Fri, Oct 18, 2024 at 04:39:09PM -0400, Stewart Hildebrand wrote:
>> Add links between a VF's struct pci_dev and its associated PF struct
>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
>> dropping and re-acquiring the pcidevs_lock().
>>
>> During PF removal, unlink VF from PF and mark the VF broken. As before,
>> VFs may exist without a corresponding PF, although now only with
>> pdev->broken = true.
>>
>> The hardware domain is expected to remove the associated VFs before
>> removing the PF. Print a warning in case a PF is removed with associated
>> VFs still present.
>>
>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>> ---
>> Candidate for backport to 4.19 (the next patch depends on this one)
>>
>> v5->v6:
>> * move printk() before ASSERT_UNREACHABLE()
>> * warn about PF removal with VFs still present
>> * clarify commit message
>>
>> v4->v5:
>> * new patch, split from ("x86/msi: fix locking for SR-IOV devices")
>> * move INIT_LIST_HEAD(&pdev->vf_list); earlier
>> * collapse struct list_head instances
>> * retain error code from pci_add_device()
>> * unlink (and mark broken) VFs instead of removing them
>> * const-ify VF->PF link
>> ---
>>  xen/drivers/passthrough/pci.c | 76 ++++++++++++++++++++++++++++-------
>>  xen/include/xen/pci.h         | 10 +++++
>>  2 files changed, 72 insertions(+), 14 deletions(-)
>>
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index 74d3895e1ef6..fe31255b1207 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -333,6 +333,8 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
>>      *((u8*) &pdev->devfn) = devfn;
>>      pdev->domain = NULL;
>>  
>> +    INIT_LIST_HEAD(&pdev->vf_list);
>> +
>>      arch_pci_init_pdev(pdev);
>>  
>>      rc = pdev_msi_init(pdev);
>> @@ -449,6 +451,10 @@ static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
>>  
>>      list_del(&pdev->alldevs_list);
>>      pdev_msi_deinit(pdev);
>> +
>> +    if ( pdev->info.is_virtfn && pdev->virtfn.pf_pdev )
> 
> Shouldn't having pdev->info.is_virtfn set already ensure that
> pdev->virtfn.pf_pdev != NULL?

In the current rev, the possibility exists, however unlikely, that a
*buggy* dom0 may remove the PF before removing the VFs. In this case a
VF would exist without a corresponding PF, and thus pdev->virtfn.pf_pdev
is NULL.

For the next rev, you're right, it'll be back to the situation where a
VF can only exist with an associated PF.

>> +        list_del(&pdev->vf_list);
>> +
>>      xfree(pdev);
>>  }
>>  
>> @@ -656,24 +662,11 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>      unsigned int slot = PCI_SLOT(devfn), func = PCI_FUNC(devfn);
>>      const char *type;
>>      int ret;
>> -    bool pf_is_extfn = false;
>>  
>>      if ( !info )
>>          type = "device";
>>      else if ( info->is_virtfn )
>> -    {
>> -        pcidevs_lock();
>> -        pdev = pci_get_pdev(NULL,
>> -                            PCI_SBDF(seg, info->physfn.bus,
>> -                                     info->physfn.devfn));
>> -        if ( pdev )
>> -            pf_is_extfn = pdev->info.is_extfn;
>> -        pcidevs_unlock();
>> -        if ( !pdev )
>> -            pci_add_device(seg, info->physfn.bus, info->physfn.devfn,
>> -                           NULL, node);
>>          type = "virtual function";
>> -    }
>>      else if ( info->is_extfn )
>>          type = "extended function";
>>      else
>> @@ -703,7 +696,44 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>           * extended function.
>>           */
>>          if ( pdev->info.is_virtfn )
>> -            pdev->info.is_extfn = pf_is_extfn;
>> +        {
>> +            struct pci_dev *pf_pdev;
> 
> This could be const?

No, as we are doing this below:
    list_add(&pdev->vf_list, &pf_pdev->vf_list);

>> +
>> +            pf_pdev = pci_get_pdev(NULL,
>> +                                   PCI_SBDF(seg, info->physfn.bus,
>> +                                            info->physfn.devfn));
> 
> You can probably initialize at declaration?

OK

>> +
>> +            if ( !pf_pdev )
> 
> Is this even feasible during correct operation?

No, I don't think so.

> IOW: shouldn't the PF
> always be added first, so that SR-IOV can be enabled and the VFs added
> afterwards?

Yes, I think you're right.

> I see previous code also catered for VFs being added without the PF
> being present, so I assume there was some need for this.

This is exactly the source of the confusion on my part. In the removal
path, the consensus seems to be that removing a PF with VFs still
present indicates an error. Then shouldn't the opposite also be true?
Adding a VF without a PF should also indicate an error.

I see the PF-adding logic was added in 942a6f1376d8 ("x86/PCI-MSI:
properly determine VF BAR values"). Searching the mailing list archives
didn't reveal much about it [0].

[0] https://lore.kernel.org/xen-devel/4E3FC6E102000078000501CA@nat28.tlf.novell.com/

The only time I've observed this path being taken is by manually
calling PHYSDEVOP_pci_device_add. I've resorted to calling
PHYSDEVOP_pci_device_{remove,add} from userspace in order to test this,
because the Linux kernel doesn't behave this way.

I can't think of a good rationale for catering to VFs being added
without a PF, so I'll turn it into an error for the next rev.

>> +            {
>> +                ret = pci_add_device(seg, info->physfn.bus, info->physfn.devfn,
>> +                                     NULL, node);
>> +                if ( ret )
>> +                {
>> +                    printk(XENLOG_WARNING "Failed to add SR-IOV device PF %pp for VF %pp\n",
> 
> Could you split this to make the line a bit shorter?
> 
>                        printk(XENLOG_WARNING
> 		              "Failed to add SR-IOV device PF %pp for VF %pp\n",
> 
> Same below.
> 
>> +                           &PCI_SBDF(seg, info->physfn.bus, info->physfn.devfn),
>> +                           &pdev->sbdf);
>> +                    free_pdev(pseg, pdev);
>> +                    goto out;
>> +                }
>> +                pf_pdev = pci_get_pdev(NULL,
>> +                                       PCI_SBDF(seg, info->physfn.bus,
>> +                                                info->physfn.devfn));
>> +                if ( !pf_pdev )
>> +                {
>> +                    printk(XENLOG_ERR "Failed to find SR-IOV device PF %pp for VF %pp\n",
>> +                           &PCI_SBDF(seg, info->physfn.bus, info->physfn.devfn),
>> +                           &pdev->sbdf);
> 
> If you want to add an error message here, I think it should mention
> the fact this state is not expected:
> 
> "Inconsistent PCI state: failed to find newly added PF %pp for VF %pp\n"
> 
>> +                    ASSERT_UNREACHABLE();
>> +                    free_pdev(pseg, pdev);
>> +                    ret = -EILSEQ;
>> +                    goto out;
>> +                }
>> +            }
>> +
>> +            pdev->info.is_extfn = pf_pdev->info.is_extfn;
>> +            pdev->virtfn.pf_pdev = pf_pdev;
>> +            list_add(&pdev->vf_list, &pf_pdev->vf_list);
>> +        }
>>      }
>>  
>>      if ( !pdev->info.is_virtfn && !pdev->vf_rlen[0] )
>> @@ -821,6 +851,24 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>>          if ( pdev->bus == bus && pdev->devfn == devfn )
>>          {
>> +            if ( !pdev->info.is_virtfn )
> 
> Given we have no field to mark a device as a PF, we could check that
> pdev->vf_list is not empty, and by doing so the warn_stale_vfs
> variable could be dropped?
> 
> if ( !pdev->info.is_virtfn && !list_empty(&pdev->vf_list) )
> {
>     struct pci_dev *vf_pdev;
> 
>     while ( (vf_pdev = list_first_entry_or_null(&pdev->vf_list,
>                                                 struct pci_dev,
> 						vf_list)) != NULL )
>     {
>         list_del(&vf_pdev->vf_list);
>         vf_pdev->virtfn.pf_pdev = NULL;
>         vf_pdev->broken = true;
>     }
> 
>     printk(XENLOG_WARNING "PCI SR-IOV PF %pp removed with VFs still present\n",
>            &pdev->sbdf);
> }

Yeah. Given that the consensus is leaning toward keeping the PF and
returning an error, here's my suggestion:

    if ( !pdev->info.is_virtfn && !list_empty(&pdev->vf_list) )
    {
        struct pci_dev *vf_pdev;

        list_for_each_entry(vf_pdev, &pdev->vf_list, vf_list)
            vf_pdev->broken = true;

        pdev->broken = true;

        printk(XENLOG_WARNING
               "Attempted to remove PCI SR-IOV PF %pp with VFs still present\n",
               &pdev->sbdf);

        ret = -EBUSY;
        break;
    }

>> +            {
>> +                struct pci_dev *vf_pdev, *tmp;
>> +                bool warn_stale_vfs = false;
>> +
>> +                list_for_each_entry_safe(vf_pdev, tmp, &pdev->vf_list, vf_list)
>> +                {
>> +                    list_del(&vf_pdev->vf_list);
>> +                    vf_pdev->virtfn.pf_pdev = NULL;
>> +                    vf_pdev->broken = true;
>> +                    warn_stale_vfs = true;
>> +                }
>> +
>> +                if ( warn_stale_vfs )
>> +                    printk(XENLOG_WARNING "PCI SR-IOV PF %pp removed with VFs still present\n",
>> +                           &pdev->sbdf);
>> +            }
>> +
>>              if ( pdev->domain )
>>              {
>>                  write_lock(&pdev->domain->pci_lock);
>> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
>> index ef56e80651d6..2ea168d5f914 100644
>> --- a/xen/include/xen/pci.h
>> +++ b/xen/include/xen/pci.h
>> @@ -153,7 +153,17 @@ struct pci_dev {
>>          unsigned int count;
>>  #define PT_FAULT_THRESHOLD 10
>>      } fault;
>> +
>> +    /*
>> +     * List head if info.is_virtfn == false
>> +     * List entry if info.is_virtfn == true
>> +     */
>> +    struct list_head vf_list;
>>      u64 vf_rlen[6];
>> +    struct {
>> +        /* Only populated for VFs (info.is_virtfn == true) */
> 
> All comments here (specially the first ones) would better use PF and
> VF consistently, rather than referring to other fields in the struct.
> Specially because the fields can change names and the comments would
> then become stale.

OK

>> +        const struct pci_dev *pf_pdev;        /* Link from VF to PF */
>> +    } virtfn;
> 
> I'm unsure you need an outer virtfn struct, as it's only one field in
> this patch?  Maybe more fields gets added by further patches?

Right. There are no more fields to be added, so there's no need. It was
leftover from a previous rev when vf_list was split.

> 
> Thanks, Roger.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-11 20:07     ` Stewart Hildebrand
@ 2024-11-12  9:02       ` Roger Pau Monné
  2024-11-12  9:39         ` Jan Beulich
  2024-11-12  9:42       ` Jan Beulich
  1 sibling, 1 reply; 30+ messages in thread
From: Roger Pau Monné @ 2024-11-12  9:02 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Julien Grall,
	Stefano Stabellini

On Mon, Nov 11, 2024 at 03:07:28PM -0500, Stewart Hildebrand wrote:
> On 10/28/24 14:41, Roger Pau Monné wrote:
> > On Fri, Oct 18, 2024 at 04:39:09PM -0400, Stewart Hildebrand wrote:
> > IOW: shouldn't the PF
> > always be added first, so that SR-IOV can be enabled and the VFs added
> > afterwards?
> 
> Yes, I think you're right.
> 
> > I see previous code also catered for VFs being added without the PF
> > being present, so I assume there was some need for this.
> 
> This is exactly the source of the confusion on my part. In the removal
> path, the consensus seems to be that removing a PF with VFs still
> present indicates an error. Then shouldn't the opposite also be true?
> Adding a VF without a PF should also indicate an error.
> 
> I see the PF-adding logic was added in 942a6f1376d8 ("x86/PCI-MSI:
> properly determine VF BAR values"). Searching the mailing list archives
> didn't reveal much about it [0].
> 
> [0] https://lore.kernel.org/xen-devel/4E3FC6E102000078000501CA@nat28.tlf.novell.com/
> 
> The only time I've observed this path being taken is by manually
> calling PHYSDEVOP_pci_device_add. I've resorted to calling
> PHYSDEVOP_pci_device_{remove,add} from userspace in order to test this,
> because the Linux kernel doesn't behave this way.
> 
> I can't think of a good rationale for catering to VFs being added
> without a PF, so I'll turn it into an error for the next rev.

Maybe there's a case for a device to be discovered with SR-IOV already
enabled (ie: when booting in kexec like environments), but then I
would still expect the OS to first add the PF and afterwards the VFs.
Otherwise I'm not even sure whether the OS would be capable of
identifying the VFs as such.

> >> +                    ASSERT_UNREACHABLE();
> >> +                    free_pdev(pseg, pdev);
> >> +                    ret = -EILSEQ;
> >> +                    goto out;
> >> +                }
> >> +            }
> >> +
> >> +            pdev->info.is_extfn = pf_pdev->info.is_extfn;
> >> +            pdev->virtfn.pf_pdev = pf_pdev;
> >> +            list_add(&pdev->vf_list, &pf_pdev->vf_list);
> >> +        }
> >>      }
> >>  
> >>      if ( !pdev->info.is_virtfn && !pdev->vf_rlen[0] )
> >> @@ -821,6 +851,24 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
> >>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
> >>          if ( pdev->bus == bus && pdev->devfn == devfn )
> >>          {
> >> +            if ( !pdev->info.is_virtfn )
> > 
> > Given we have no field to mark a device as a PF, we could check that
> > pdev->vf_list is not empty, and by doing so the warn_stale_vfs
> > variable could be dropped?
> > 
> > if ( !pdev->info.is_virtfn && !list_empty(&pdev->vf_list) )
> > {
> >     struct pci_dev *vf_pdev;
> > 
> >     while ( (vf_pdev = list_first_entry_or_null(&pdev->vf_list,
> >                                                 struct pci_dev,
> > 						vf_list)) != NULL )
> >     {
> >         list_del(&vf_pdev->vf_list);
> >         vf_pdev->virtfn.pf_pdev = NULL;
> >         vf_pdev->broken = true;
> >     }
> > 
> >     printk(XENLOG_WARNING "PCI SR-IOV PF %pp removed with VFs still present\n",
> >            &pdev->sbdf);
> > }
> 
> Yeah. Given that the consensus is leaning toward keeping the PF and
> returning an error, here's my suggestion:
> 
>     if ( !pdev->info.is_virtfn && !list_empty(&pdev->vf_list) )
>     {
>         struct pci_dev *vf_pdev;
> 
>         list_for_each_entry(vf_pdev, &pdev->vf_list, vf_list)
>             vf_pdev->broken = true;
> 
>         pdev->broken = true;

Do you need to mark the devices as broken?  My expectation would be
that returning -EBUSY here should prevent the device from being
removed, and hence there would be no breakage, just failure to fulfill
the (possible) hot-unplug request.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-11  6:40                 ` Jan Beulich
@ 2024-11-12  9:05                   ` Roger Pau Monné
  0 siblings, 0 replies; 30+ messages in thread
From: Roger Pau Monné @ 2024-11-12  9:05 UTC (permalink / raw)
  To: Jan Beulich, Stewart Hildebrand
  Cc: Alejandro Vallejo, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel, Daniel P. Smith

On Mon, Nov 11, 2024 at 07:40:14AM +0100, Jan Beulich wrote:
> On 08.11.2024 17:29, Stewart Hildebrand wrote:
> > On 11/8/24 10:19, Roger Pau Monné wrote:
> >> On Fri, Nov 08, 2024 at 02:17:40PM +0100, Jan Beulich wrote:
> >>> On 08.11.2024 13:42, Alejandro Vallejo wrote:
> >>>> On Mon Nov 4, 2024 at 7:44 AM GMT, Jan Beulich wrote:
> >>>>> On 01.11.2024 21:16, Stewart Hildebrand wrote:
> >>>>>> +Daniel (XSM mention)
> >>>>>>
> >>>>>> On 10/28/24 13:02, Jan Beulich wrote:
> >>>>>>> On 18.10.2024 22:39, Stewart Hildebrand wrote:
> >>>>>>>> Add links between a VF's struct pci_dev and its associated PF struct
> >>>>>>>> pci_dev. Move the calls to pci_get_pdev()/pci_add_device() down to avoid
> >>>>>>>> dropping and re-acquiring the pcidevs_lock().
> >>>>>>>>
> >>>>>>>> During PF removal, unlink VF from PF and mark the VF broken. As before,
> >>>>>>>> VFs may exist without a corresponding PF, although now only with
> >>>>>>>> pdev->broken = true.
> >>>>>>>>
> >>>>>>>> The hardware domain is expected to remove the associated VFs before
> >>>>>>>> removing the PF. Print a warning in case a PF is removed with associated
> >>>>>>>> VFs still present.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> >>>>>>>> ---
> >>>>>>>> Candidate for backport to 4.19 (the next patch depends on this one)
> >>>>>>>>
> >>>>>>>> v5->v6:
> >>>>>>>> * move printk() before ASSERT_UNREACHABLE()
> >>>>>>>> * warn about PF removal with VFs still present
> >>>>>>>
> >>>>>>> Hmm, maybe I didn't make this clear enough when commenting on v5: I wasn't
> >>>>>>> just after an adjustment to the commit message. I'm instead actively
> >>>>>>> concerned of the resulting behavior. Question is whether we can reasonably
> >>>>>>> do something about that.
> >>>>>>
> >>>>>> Right. My suggestion then is to go back to roughly how it was done in
> >>>>>> v4 [0]:
> >>>>>>
> >>>>>> * Remove the VFs right away during PF removal, so that we don't end up
> >>>>>> with stale VFs. Regarding XSM, assume that a domain with permission to
> >>>>>> remove the PF is also allowed to remove the VFs. We should probably also
> >>>>>> return an error from pci_remove_device in the case of removing the PF
> >>>>>> with VFs still present (and still perform the removals despite returning
> >>>>>> an error). Subsequent attempts by a domain to remove the VFs would
> >>>>>> return an error (as they have already been removed), but that's expected
> >>>>>> since we've taken a stance that PF-then-VF removal order is invalid
> >>>>>> anyway.
> >>>>>
> >>>>> Imo going back is not an option.
> >>>>>
> >>>>>> While the above is what I prefer, I just want to mention other options I
> >>>>>> considered for the scenario of PF removal with VFs still present:
> >>>>>>
> >>>>>> * Increase the "scariness" of the warning message added in v6.
> >>>>>>
> >>>>>> * Return an error from pci_remove_device (while still removing only the
> >>>>>> PF). We would be left with stale VFs in Xen. At least this would
> >>>>>> concretely inform dom0 that Xen takes issue with the PF-then-VF removal
> >>>>>> order. Subsequent attempts by a domain to remove VFs, however
> >>>>>> (un)likely, would succeed.
> >>>>>
> >>>>> Returning an error in such a case is a possibility, but comes with the
> >>>>> risk of confusion. Seeing such an error, a caller may itself assume the
> >>>>> device still is there, and retry its (with or without having removed the
> >>>>> VFs) removal at a later point.
> >>>>>
> >>>>>> * Return an error from pci_remove_device and keep the PF and VFs. This
> >>>>>> is IMO the worst option because then we would have a stale PF in
> >>>>>> addition to stale VFs.
> >>
> >> I'm thinking probably this is the least bad option, and just force the
> >> owner of the PF to ensure there are no VFs left when removing the PF.
> >>
> >> What sense does it make anyway to allow removing a PF with VFs still
> >> present?  Not sure exactly what the owner of the PF will do before
> >> calling pci_remove_device(), but it would seem to me the device should
> >> be ready for unplug (so SR-IOV disabled).  Calling pci_remove_device()
> >> with VFs still active points to an error to do proper cleanup by the
> >> owner of the PF.
> > 
> > In normal, correct operation, right. The PF driver is indeed expected to
> > disable SR-IOV (i.e. remove VFs) during its removal, prior to calling
> > PHYSDEVOP_pci_device_remove for the PF.
> > 
> >> Returning error from pci_remove_device() and doing nothing would seem
> >> fine to me.  There should be no stale PF or VFs in that case, as the
> >> caller has been notified the device has failed to be removed, so
> >> should treat the device as still present.
> 
> Imo really that's another case that would best be addressed by proper
> ref-counting. Each VF would hold a ref to its PF, and hence the PF would
> go away when the last VF is removed, or when PF removal is (properly)
> last. Just that this likely is too complex a change to be warranted for
> the purpose here.
> 
> > But software has no way to guarantee there won't be a physical device
> > removal.
> > 
> > In test scenario #2 described in the first patch [1], the PF (the whole
> > device, actually) has already been physically unplugged, and dom0
> > invokes PHYSDEVOP_pci_device_remove to inform Xen about it.
> 
> I don't think that's how it's supposed to work. Physical removal should
> occur only after software has done all "soft removal". I'd view the
> notification to Xen as part of that.

That would be my expectation also, albeit IIRC we had other instances
where the calling of the removal hypercall was not done in a vetoing
manner by Linux, so it was mostly a notification that the device was
going away, rather than a request for permission to removal.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-12  9:02       ` Roger Pau Monné
@ 2024-11-12  9:39         ` Jan Beulich
  2024-11-12 17:41           ` Stewart Hildebrand
  0 siblings, 1 reply; 30+ messages in thread
From: Jan Beulich @ 2024-11-12  9:39 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Julien Grall, Stefano Stabellini,
	Stewart Hildebrand

On 12.11.2024 10:02, Roger Pau Monné wrote:
> On Mon, Nov 11, 2024 at 03:07:28PM -0500, Stewart Hildebrand wrote:
>> On 10/28/24 14:41, Roger Pau Monné wrote:
>>> if ( !pdev->info.is_virtfn && !list_empty(&pdev->vf_list) )
>>> {
>>>     struct pci_dev *vf_pdev;
>>>
>>>     while ( (vf_pdev = list_first_entry_or_null(&pdev->vf_list,
>>>                                                 struct pci_dev,
>>> 						vf_list)) != NULL )
>>>     {
>>>         list_del(&vf_pdev->vf_list);
>>>         vf_pdev->virtfn.pf_pdev = NULL;
>>>         vf_pdev->broken = true;
>>>     }
>>>
>>>     printk(XENLOG_WARNING "PCI SR-IOV PF %pp removed with VFs still present\n",
>>>            &pdev->sbdf);
>>> }
>>
>> Yeah. Given that the consensus is leaning toward keeping the PF and
>> returning an error, here's my suggestion:
>>
>>     if ( !pdev->info.is_virtfn && !list_empty(&pdev->vf_list) )
>>     {
>>         struct pci_dev *vf_pdev;
>>
>>         list_for_each_entry(vf_pdev, &pdev->vf_list, vf_list)
>>             vf_pdev->broken = true;
>>
>>         pdev->broken = true;
> 
> Do you need to mark the devices as broken?  My expectation would be
> that returning -EBUSY here should prevent the device from being
> removed, and hence there would be no breakage, just failure to fulfill
> the (possible) hot-unplug request.

That very much depends on Dom0 kernels then actually respecting the error,
and not considering the underlying hypercall a mere notification.

Jan


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-11 20:07     ` Stewart Hildebrand
  2024-11-12  9:02       ` Roger Pau Monné
@ 2024-11-12  9:42       ` Jan Beulich
  1 sibling, 0 replies; 30+ messages in thread
From: Jan Beulich @ 2024-11-12  9:42 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: xen-devel, Andrew Cooper, Julien Grall, Stefano Stabellini,
	Roger Pau Monné

On 11.11.2024 21:07, Stewart Hildebrand wrote:
> On 10/28/24 14:41, Roger Pau Monné wrote:
>> I see previous code also catered for VFs being added without the PF
>> being present, so I assume there was some need for this.
> 
> This is exactly the source of the confusion on my part. In the removal
> path, the consensus seems to be that removing a PF with VFs still
> present indicates an error. Then shouldn't the opposite also be true?
> Adding a VF without a PF should also indicate an error.
> 
> I see the PF-adding logic was added in 942a6f1376d8 ("x86/PCI-MSI:
> properly determine VF BAR values"). Searching the mailing list archives
> didn't reveal much about it [0].
> 
> [0] https://lore.kernel.org/xen-devel/4E3FC6E102000078000501CA@nat28.tlf.novell.com/
> 
> The only time I've observed this path being taken is by manually
> calling PHYSDEVOP_pci_device_add. I've resorted to calling
> PHYSDEVOP_pci_device_{remove,add} from userspace in order to test this,
> because the Linux kernel doesn't behave this way.

The goal was to avoid returning an error when we don't strictly need to.
With the overall adjustment ...

> I can't think of a good rationale for catering to VFs being added
> without a PF, so I'll turn it into an error for the next rev.

... changing this may indeed result in better overall consistency.

Jan


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 2/3] xen/pci: introduce PF<->VF links
  2024-11-12  9:39         ` Jan Beulich
@ 2024-11-12 17:41           ` Stewart Hildebrand
  0 siblings, 0 replies; 30+ messages in thread
From: Stewart Hildebrand @ 2024-11-12 17:41 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Julien Grall, Stefano Stabellini

On 11/12/24 04:39, Jan Beulich wrote:
> On 12.11.2024 10:02, Roger Pau Monné wrote:
>> On Mon, Nov 11, 2024 at 03:07:28PM -0500, Stewart Hildebrand wrote:
>>> On 10/28/24 14:41, Roger Pau Monné wrote:
>>>> if ( !pdev->info.is_virtfn && !list_empty(&pdev->vf_list) )
>>>> {
>>>>     struct pci_dev *vf_pdev;
>>>>
>>>>     while ( (vf_pdev = list_first_entry_or_null(&pdev->vf_list,
>>>>                                                 struct pci_dev,
>>>> 						vf_list)) != NULL )
>>>>     {
>>>>         list_del(&vf_pdev->vf_list);
>>>>         vf_pdev->virtfn.pf_pdev = NULL;
>>>>         vf_pdev->broken = true;
>>>>     }
>>>>
>>>>     printk(XENLOG_WARNING "PCI SR-IOV PF %pp removed with VFs still present\n",
>>>>            &pdev->sbdf);
>>>> }
>>>
>>> Yeah. Given that the consensus is leaning toward keeping the PF and
>>> returning an error, here's my suggestion:
>>>
>>>     if ( !pdev->info.is_virtfn && !list_empty(&pdev->vf_list) )
>>>     {
>>>         struct pci_dev *vf_pdev;
>>>
>>>         list_for_each_entry(vf_pdev, &pdev->vf_list, vf_list)
>>>             vf_pdev->broken = true;
>>>
>>>         pdev->broken = true;
>>
>> Do you need to mark the devices as broken?  My expectation would be
>> that returning -EBUSY here should prevent the device from being
>> removed, and hence there would be no breakage, just failure to fulfill
>> the (possible) hot-unplug request.
> 
> That very much depends on Dom0 kernels then actually respecting the error,
> and not considering the underlying hypercall a mere notification.

All dom0 Linux does is print a warning:

# echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
# echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove
[   56.738750]  0000:01:00.0: driver left SR-IOV enabled after remove
(XEN) Attempted to remove PCI SR-IOV PF 0000:01:00.0 with VFs still present
[   56.749904] pci 0000:01:00.0: Failed to delete - passthrough or MSI/MSI-X might fail!
# echo $?
0

Subsequently, lspci reveals no entry for 0000:01:00.0. I think it's
appropriate to mark them broken.


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2024-11-12 17:41 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-18 20:39 [PATCH v6 0/3] xen: SR-IOV fixes Stewart Hildebrand
2024-10-18 20:39 ` [PATCH v6 1/3] x86/msi: harden stale pdev handling Stewart Hildebrand
2024-10-28 16:58   ` Jan Beulich
2024-10-28 17:53     ` Roger Pau Monné
2024-10-31 11:29       ` Jan Beulich
2024-10-18 20:39 ` [PATCH v6 2/3] xen/pci: introduce PF<->VF links Stewart Hildebrand
2024-10-28 17:02   ` Jan Beulich
2024-11-01 20:16     ` Stewart Hildebrand
2024-11-02 15:18       ` Daniel P. Smith
2024-11-04 11:45         ` Alejandro Vallejo
2024-11-05  9:08           ` Roger Pau Monné
2024-11-06 16:20             ` Daniel P. Smith
2024-11-06 17:04           ` Daniel P. Smith
2024-11-05  9:03         ` Roger Pau Monné
2024-11-06 16:31           ` Daniel P. Smith
2024-11-04  7:44       ` Jan Beulich
2024-11-07 14:32         ` Stewart Hildebrand
2024-11-08 12:42         ` Alejandro Vallejo
2024-11-08 13:17           ` Jan Beulich
2024-11-08 15:19             ` Roger Pau Monné
2024-11-08 16:29               ` Stewart Hildebrand
2024-11-11  6:40                 ` Jan Beulich
2024-11-12  9:05                   ` Roger Pau Monné
2024-10-28 18:41   ` Roger Pau Monné
2024-11-11 20:07     ` Stewart Hildebrand
2024-11-12  9:02       ` Roger Pau Monné
2024-11-12  9:39         ` Jan Beulich
2024-11-12 17:41           ` Stewart Hildebrand
2024-11-12  9:42       ` Jan Beulich
2024-10-18 20:39 ` [PATCH v6 3/3] x86/msi: fix locking for SR-IOV devices Stewart Hildebrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.