[PATCH v12 00/15] PCI devices passthrough on Arm, part 3

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v12 00/15] PCI devices passthrough on Arm, part 3
@ 2024-01-09 21:51 Stewart Hildebrand
  2024-01-09 21:51 ` [PATCH v12 01/15] vpci: use per-domain PCI lock to protect vpci structure Stewart Hildebrand
                   ` (14 more replies)
  0 siblings, 15 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-09 21:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Jan Beulich, Andrew Cooper,
	Roger Pau Monné, Wei Liu, George Dunlap, Julien Grall,
	Stefano Stabellini, Jun Nakajima, Kevin Tian, Paul Durrant,
	Daniel P. Smith, Bertrand Marquis, Michal Orzel,
	Volodymyr Babchuk

This is next version of vPCI rework. Aim of this series is to prepare
ground for introducing PCI support on ARM platform.

in v12:
 - I (Stewart) coordinated with Volodomyr to send this whole series. So,
   add my (Stewart) Signed-off-by to all patches.
 - The biggest change is to re-work the PCI_COMMAND register patch.
   Additional feedback has also been addressed - see individual patches.
 - Drop ("pci: msi: pass pdev to pci_enable_msi() function") and
   ("pci: introduce per-domain PCI rwlock") as they were committed
 - Rename ("rangeset: add rangeset_empty() function")
       to ("rangeset: add rangeset_purge() function")
 - Rename ("vpci/header: rework exit path in init_bars")
       to ("vpci/header: rework exit path in init_header()")

in v11:
 - Added my (Volodymyr) Signed-off-by tag to all patches
 - Patch "vpci/header: emulate PCI_COMMAND register for guests" is in
   intermediate state, because it was agreed to rework it once Stewart's
   series on register handling are in.
 - Addressed comments, please see patch descriptions for details.

in v10:

 - Removed patch ("xen/arm: vpci: check guest range"), proper fix
   for the issue is part of ("vpci/header: emulate PCI_COMMAND
   register for guests")
 - Removed patch ("pci/header: reset the command register when adding
   devices")
 - Added patch ("rangeset: add rangeset_empty() function") because
   this function is needed in ("vpci/header: handle p2m range sets
   per BAR")
 - Added ("vpci/header: handle p2m range sets per BAR") which addressed
   an issue discovered by Andrii Chepurnyi during virtio integration
 - Added ("pci: msi: pass pdev to pci_enable_msi() function"), which is
   prereq for ("pci: introduce per-domain PCI rwlock")
 - Fixed "Since v9/v8/... " comments in changelogs to reduce confusion.
   I left "Since" entries for older versions, because they were added
   by original author of the patches.

in v9:

v9 includes addressed commentes from a previous one. Also it
introduces a couple patches from Stewart. This patches are related to
vPCI use on ARM. Patch "vpci/header: rework exit path in init_bars"
was factored-out from "vpci/header: handle p2m range sets per BAR".

in v8:

The biggest change from previous, mistakenly named, v7 series is how
locking is implemented. Instead of d->vpci_rwlock we introduce
d->pci_lock which has broader scope, as it protects not only domain's
vpci state, but domain's list of PCI devices as well.

As we discussed in IRC with Roger, it is not feasible to rework all
the existing code to use the new lock right away. It was agreed that
any write access to d->pdev_list will be protected by **both**
d->pci_lock in write mode and pcidevs_lock(). Read access on other
hand should be protected by either d->pci_lock in read mode or
pcidevs_lock(). It is expected that existing code will use
pcidevs_lock() and new users will use new rw lock. Of course, this
does not mean that new users shall not use pcidevs_lock() when it is
appropriate.

Changes from previous versions are described in each separate patch.

Oleksandr Andrushchenko (11):
  vpci: use per-domain PCI lock to protect vpci structure
  vpci: restrict unhandled read/write operations for guests
  vpci: add hooks for PCI device assign/de-assign
  vpci/header: implement guest BAR register handlers
  rangeset: add RANGESETF_no_print flag
  vpci/header: handle p2m range sets per BAR
  vpci/header: program p2m with guest BAR view
  vpci/header: emulate PCI_COMMAND register for guests
  vpci: add initial support for virtual PCI bus topology
  xen/arm: translate virtual PCI bus topology for guests
  xen/arm: account IO handlers for emulated PCI MSI-X

Stewart Hildebrand (1):
  xen/arm: vpci: permit access to guest vpci space

Volodymyr Babchuk (3):
  vpci/header: rework exit path in init_header()
  rangeset: add rangeset_purge() function
  arm/vpci: honor access size when returning an error

 xen/arch/arm/vpci.c           |  72 ++++-
 xen/arch/x86/hvm/vmsi.c       |  22 +-
 xen/arch/x86/hvm/vmx/vmx.c    |   2 +-
 xen/arch/x86/irq.c            |   8 +-
 xen/arch/x86/msi.c            |  14 +-
 xen/arch/x86/physdev.c        |   2 +
 xen/common/domain.c           |  12 +-
 xen/common/rangeset.c         |  21 +-
 xen/drivers/Kconfig           |   4 +
 xen/drivers/passthrough/pci.c |  32 ++-
 xen/drivers/vpci/header.c     | 495 +++++++++++++++++++++++++++-------
 xen/drivers/vpci/msi.c        |  37 ++-
 xen/drivers/vpci/msix.c       |  59 +++-
 xen/drivers/vpci/vpci.c       | 129 ++++++++-
 xen/include/xen/pci_regs.h    |   1 +
 xen/include/xen/rangeset.h    |   8 +-
 xen/include/xen/sched.h       |  11 +-
 xen/include/xen/vpci.h        |  44 ++-
 18 files changed, 798 insertions(+), 175 deletions(-)


base-commit: c27c8922f2c6995d688437b0758cec6a27d18320
-- 
2.43.0



^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v12 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-09 21:51 [PATCH v12 00/15] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
@ 2024-01-09 21:51 ` Stewart Hildebrand
  2024-01-12 13:48   ` Roger Pau Monné
  2024-01-15 19:43   ` [PATCH v12.2 " Stewart Hildebrand
  2024-01-09 21:51 ` [PATCH v12 02/15] vpci: restrict unhandled read/write operations for guests Stewart Hildebrand
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-09 21:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Roger Pau Monné, Wei Liu, George Dunlap, Julien Grall,
	Stefano Stabellini, Jun Nakajima, Kevin Tian, Paul Durrant,
	Volodymyr Babchuk, Stewart Hildebrand

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Use a previously introduced per-domain read/write lock to check
whether vpci is present, so we are sure there are no accesses to the
contents of the vpci struct if not. This lock can be used (and in a
few cases is used right away) so that vpci removal can be performed
while holding the lock in write mode. Previously such removal could
race with vpci_read for example.

When taking both d->pci_lock and pdev->vpci->lock, they should be
taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
possible deadlock situations.

1. Per-domain's pci_lock is used to protect pdev->vpci structure
from being removed.

2. Writing the command register and ROM BAR register may trigger
modify_bars to run, which in turn may access multiple pdevs while
checking for the existing BAR's overlap. The overlapping check, if
done under the read lock, requires vpci->lock to be acquired on both
devices being compared, which may produce a deadlock. It is not
possible to upgrade read lock to write lock in such a case. So, in
order to prevent the deadlock, use d->pci_lock instead.

All other code, which doesn't lead to pdev->vpci destruction and does
not access multiple pdevs at the same time, can still use a
combination of the read lock and pdev->vpci->lock.

3. Drop const qualifier where the new rwlock is used and this is
appropriate.

4. Do not call process_pending_softirqs with any locks held. For that
unlock prior the call and re-acquire the locks after. After
re-acquiring the lock there is no need to check if pdev->vpci exists:
 - in apply_map because of the context it is called (no race condition
   possible)
 - for MSI/MSI-X debug code because it is called at the end of
   pdev->vpci access and no further access to pdev->vpci is made

5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
while accessing pdevs in vpci code.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
Changes in v12:
 - s/pci_rwlock/pci_lock/ in commit message
 - expand comment about scope of pci_lock in sched.h
 - in vpci_{read,write}, if hwdom is trying to access a device assigned
   to dom_xen, holding hwdom->pci_lock is sufficient (no need to hold
   dom_xen->pci_lock)
 - reintroduce ASSERT in vmx_pi_update_irte()
 - reintroduce ASSERT in __pci_enable_msi{x}()
 - delete note 6. in commit message about removing ASSERTs since we have
   reintroduced them

Changes in v11:
 - Fixed commit message regarding possible spinlocks
 - Removed parameter from allocate_and_map_msi_pirq(), which was added
 in the prev version. Now we are taking pcidevs_lock in
 physdev_map_pirq()
 - Returned ASSERT to pci_enable_msi
 - Fixed case when we took read lock instead of write one
 - Fixed label indentation

Changes in v10:
 - Moved printk pas locked area
 - Returned back ASSERTs
 - Added new parameter to allocate_and_map_msi_pirq() so it knows if
 it should take the global pci lock
 - Added comment about possible improvement in vpci_write
 - Changed ASSERT(rw_is_locked()) to rw_is_write_locked() in
   appropriate places
 - Renamed release_domain_locks() to release_domain_write_locks()
 - moved domain_done label in vpci_dump_msi() to correct place
Changes in v9:
 - extended locked region to protect vpci_remove_device and
   vpci_add_handlers() calls
 - vpci_write() takes lock in the write mode to protect
   potential call to modify_bars()
 - renamed lock releasing function
 - removed ASSERT()s from msi code
 - added trylock in vpci_dump_msi

Changes in v8:
 - changed d->vpci_lock to d->pci_lock
 - introducing d->pci_lock in a separate patch
 - extended locked region in vpci_process_pending
 - removed pcidevs_lockis vpci_dump_msi()
 - removed some changes as they are not needed with
   the new locking scheme
 - added handling for hwdom && dom_xen case
---
 xen/arch/x86/hvm/vmsi.c       | 22 +++++++--------
 xen/arch/x86/hvm/vmx/vmx.c    |  2 +-
 xen/arch/x86/irq.c            |  8 +++---
 xen/arch/x86/msi.c            | 14 +++++-----
 xen/arch/x86/physdev.c        |  2 ++
 xen/drivers/passthrough/pci.c |  9 +++---
 xen/drivers/vpci/header.c     | 18 ++++++++++++
 xen/drivers/vpci/msi.c        | 28 +++++++++++++++++--
 xen/drivers/vpci/msix.c       | 52 ++++++++++++++++++++++++++++++-----
 xen/drivers/vpci/vpci.c       | 28 ++++++++++++++++---
 xen/include/xen/sched.h       |  3 +-
 11 files changed, 144 insertions(+), 42 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 128f23636279..03caf91beefc 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
     struct msixtbl_entry *entry, *new_entry;
     int r = -EINVAL;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
@@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
     struct pci_dev *pdev;
     struct msixtbl_entry *entry;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
@@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
 {
     unsigned int i;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
     if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
     {
@@ -725,8 +725,8 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
     int rc;
 
     ASSERT(msi->arch.pirq != INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
-    pcidevs_lock();
     for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
     {
         struct xen_domctl_bind_pt_irq unbind = {
@@ -745,7 +745,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
 
     msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
                                        msi->vectors, msi->arch.pirq, msi->mask);
-    pcidevs_unlock();
 }
 
 static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
@@ -778,15 +777,14 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
     int rc;
 
     ASSERT(msi->arch.pirq == INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
     rc = vpci_msi_enable(pdev, vectors, 0);
     if ( rc < 0 )
         return rc;
     msi->arch.pirq = rc;
 
-    pcidevs_lock();
     msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
                                        msi->arch.pirq, msi->mask);
-    pcidevs_unlock();
 
     return 0;
 }
@@ -797,8 +795,8 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
     unsigned int i;
 
     ASSERT(pirq != INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
-    pcidevs_lock();
     for ( i = 0; i < nr && bound; i++ )
     {
         struct xen_domctl_bind_pt_irq bind = {
@@ -814,7 +812,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
     write_lock(&pdev->domain->event_lock);
     unmap_domain_pirq(pdev->domain, pirq);
     write_unlock(&pdev->domain->event_lock);
-    pcidevs_unlock();
 }
 
 void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
@@ -854,6 +851,7 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
     int rc;
 
     ASSERT(entry->arch.pirq == INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
     rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry),
                          table_base);
     if ( rc < 0 )
@@ -861,7 +859,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
 
     entry->arch.pirq = rc;
 
-    pcidevs_lock();
     rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
                          entry->masked);
     if ( rc )
@@ -869,7 +866,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
         vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
         entry->arch.pirq = INVALID_PIRQ;
     }
-    pcidevs_unlock();
 
     return rc;
 }
@@ -895,6 +891,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
 {
     unsigned int i;
 
+    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
+
     for ( i = 0; i < msix->max_entries; i++ )
     {
         const struct vpci_msix_entry *entry = &msix->entries[i];
@@ -913,7 +911,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
             struct pci_dev *pdev = msix->pdev;
 
             spin_unlock(&msix->pdev->vpci->lock);
+            read_unlock(&pdev->domain->pci_lock);
             process_pending_softirqs();
+            read_lock(&pdev->domain->pci_lock);
             /* NB: we assume that pdev cannot go away for an alive domain. */
             if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
                 return -EBUSY;
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 8ff675883c2b..890faef48b82 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -413,7 +413,7 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
 
     spin_unlock_irq(&desc->lock);
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&msi_desc->dev->domain->pci_lock));
 
     return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
 
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 50e49e1a4b6f..4d89d9af99fe 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2166,7 +2166,7 @@ int map_domain_pirq(
         struct pci_dev *pdev;
         unsigned int nr = 0;
 
-        ASSERT(pcidevs_locked());
+        ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
 
         ret = -ENODEV;
         if ( !cpu_has_apic )
@@ -2323,7 +2323,7 @@ int unmap_domain_pirq(struct domain *d, int pirq)
     if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
         return -EINVAL;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     info = pirq_info(d, pirq);
@@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 {
     int irq, pirq, ret;
 
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
+
     switch ( type )
     {
     case MAP_PIRQ_TYPE_MSI:
@@ -2917,7 +2919,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 
     msi->irq = irq;
 
-    pcidevs_lock();
     /* Verify or get pirq. */
     write_lock(&d->event_lock);
     pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
@@ -2933,7 +2934,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 
  done:
     write_unlock(&d->event_lock);
-    pcidevs_unlock();
     if ( ret )
     {
         switch ( type )
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 335c0868a225..7da2affa2e82 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -602,7 +602,7 @@ static int msi_capability_init(struct pci_dev *dev,
     unsigned int i, mpos;
     uint16_t control;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
     pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
     if ( !pos )
         return -ENODEV;
@@ -771,7 +771,7 @@ static int msix_capability_init(struct pci_dev *dev,
     if ( !pos )
         return -ENODEV;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
 
     control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
     /*
@@ -988,11 +988,11 @@ static int __pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
 {
     struct msi_desc *old_desc;
 
-    ASSERT(pcidevs_locked());
-
     if ( !pdev )
         return -ENODEV;
 
+    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
+
     old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSI);
     if ( old_desc )
     {
@@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
 {
     struct msi_desc *old_desc;
 
-    ASSERT(pcidevs_locked());
-
     if ( !pdev || !pdev->msix )
         return -ENODEV;
 
+    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
+
     if ( msi->entry_nr >= pdev->msix->nr_entries )
         return -EINVAL;
 
@@ -1154,7 +1154,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
 int pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
                    struct msi_desc **desc)
 {
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
 
     if ( !use_msi )
         return -EPERM;
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 47c4da0af7e1..369c9e788c1c 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
 
     case MAP_PIRQ_TYPE_MSI:
     case MAP_PIRQ_TYPE_MULTI_MSI:
+        pcidevs_lock();
         ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
+        pcidevs_unlock();
         break;
 
     default:
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1439d1ef2b26..3a973324bca1 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -750,7 +750,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         pdev->domain = hardware_domain;
         write_lock(&hardware_domain->pci_lock);
         list_add(&pdev->domain_list, &hardware_domain->pdev_list);
-        write_unlock(&hardware_domain->pci_lock);
 
         /*
          * For devices not discovered by Xen during boot, add vPCI handlers
@@ -759,18 +758,18 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         ret = vpci_add_handlers(pdev);
         if ( ret )
         {
-            printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
-            write_lock(&hardware_domain->pci_lock);
             list_del(&pdev->domain_list);
             write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
+            printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
             goto out;
         }
+        write_unlock(&hardware_domain->pci_lock);
         ret = iommu_add_device(pdev);
         if ( ret )
         {
-            vpci_remove_device(pdev);
             write_lock(&hardware_domain->pci_lock);
+            vpci_remove_device(pdev);
             list_del(&pdev->domain_list);
             write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
@@ -1146,7 +1145,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
     } while ( devfn != pdev->devfn &&
               PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
 
+    write_lock(&ctxt->d->pci_lock);
     err = vpci_add_handlers(pdev);
+    write_unlock(&ctxt->d->pci_lock);
     if ( err )
         printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
                ctxt->d->domain_id, err);
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 58195549d50a..8f5850b8cf6d 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -173,6 +173,7 @@ bool vpci_process_pending(struct vcpu *v)
         if ( rc == -ERESTART )
             return true;
 
+        write_lock(&v->domain->pci_lock);
         spin_lock(&v->vpci.pdev->vpci->lock);
         /* Disable memory decoding unconditionally on failure. */
         modify_decoding(v->vpci.pdev,
@@ -191,6 +192,7 @@ bool vpci_process_pending(struct vcpu *v)
              * failure.
              */
             vpci_remove_device(v->vpci.pdev);
+        write_unlock(&v->domain->pci_lock);
     }
 
     return false;
@@ -202,8 +204,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     struct map_data data = { .d = d, .map = true };
     int rc;
 
+    ASSERT(rw_is_write_locked(&d->pci_lock));
+
     while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+    {
+        /*
+         * It's safe to drop and reacquire the lock in this context
+         * without risking pdev disappearing because devices cannot be
+         * removed until the initial domain has been started.
+         */
+        write_unlock(&d->pci_lock);
         process_pending_softirqs();
+        write_lock(&d->pci_lock);
+    }
+
     rangeset_destroy(mem);
     if ( !rc )
         modify_decoding(pdev, cmd, false);
@@ -244,6 +258,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     unsigned int i;
     int rc;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !mem )
         return -ENOMEM;
 
@@ -524,6 +540,8 @@ static int cf_check init_header(struct pci_dev *pdev)
     int rc;
     bool mask_cap_list = false;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
     {
     case PCI_HEADER_TYPE_NORMAL:
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index a253ccbd7db7..6ff71e5f9ab7 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -263,7 +263,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
 
 void vpci_dump_msi(void)
 {
-    const struct domain *d;
+    struct domain *d;
 
     rcu_read_lock(&domlist_read_lock);
     for_each_domain ( d )
@@ -275,6 +275,9 @@ void vpci_dump_msi(void)
 
         printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
 
+        if ( !read_trylock(&d->pci_lock) )
+            continue;
+
         for_each_pdev ( d, pdev )
         {
             const struct vpci_msi *msi;
@@ -316,14 +319,33 @@ void vpci_dump_msi(void)
                      * holding the lock.
                      */
                     printk("unable to print all MSI-X entries: %d\n", rc);
-                    process_pending_softirqs();
-                    continue;
+                    goto pdev_done;
                 }
             }
 
             spin_unlock(&pdev->vpci->lock);
+        pdev_done:
+            /*
+             * Unlock lock to process pending softirqs. This is
+             * potentially unsafe, as d->pdev_list can be changed in
+             * meantime.
+             */
+            read_unlock(&d->pci_lock);
             process_pending_softirqs();
+            if ( !read_trylock(&d->pci_lock) )
+            {
+                printk("unable to access other devices for the domain\n");
+                goto domain_done;
+            }
         }
+        read_unlock(&d->pci_lock);
+    domain_done:
+        /*
+         * We need this label at the end of the loop, but some
+         * compilers might not be happy about label at the end of the
+         * compound statement so we adding an empty statement here.
+         */
+        ;
     }
     rcu_read_unlock(&domlist_read_lock);
 }
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index d1126a417da9..b6abab47efdd 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 {
     struct vpci_msix *msix;
 
+    ASSERT(rw_is_locked(&d->pci_lock));
+
     list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
     {
         const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
@@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 
 static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
 {
-    return !!msix_find(v->domain, addr);
+    int rc;
+
+    read_lock(&v->domain->pci_lock);
+    rc = !!msix_find(v->domain, addr);
+    read_unlock(&v->domain->pci_lock);
+
+    return rc;
 }
 
 static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
@@ -358,21 +366,35 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_read(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     const struct vpci_msix_entry *entry;
     unsigned int offset;
 
     *data = ~0UL;
 
+    read_lock(&d->pci_lock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_read(d, msix, addr, len, data);
+    {
+        int rc = adjacent_read(d, msix, addr, len, data);
+
+        read_unlock(&d->pci_lock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -404,6 +426,7 @@ static int cf_check msix_read(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     return X86EMUL_OKAY;
 }
@@ -491,19 +514,33 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_write(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     struct vpci_msix_entry *entry;
     unsigned int offset;
 
+    read_lock(&d->pci_lock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_write(d, msix, addr, len, data);
+    {
+        int rc = adjacent_write(d, msix, addr, len, data);
+
+        read_unlock(&d->pci_lock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -579,6 +616,7 @@ static int cf_check msix_write(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     return X86EMUL_OKAY;
 }
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 72ef277c4f8e..a1a004460491 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -42,6 +42,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
 
 void vpci_remove_device(struct pci_dev *pdev)
 {
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !has_vpci(pdev->domain) || !pdev->vpci )
         return;
 
@@ -77,6 +79,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
     const unsigned long *ro_map;
     int rc = 0;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !has_vpci(pdev->domain) )
         return 0;
 
@@ -361,7 +365,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
 
 uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
@@ -375,13 +379,19 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
 
     /*
      * Find the PCI dev matching the address, which for hwdom also requires
-     * consulting DomXEN.  Passthrough everything that's not trapped.
+     * consulting DomXEN. Passthrough everything that's not trapped.
+     * If this is hwdom and the device is assigned to dom_xen, acquiring hwdom's
+     * pci_lock is sufficient.
      */
+    read_lock(&d->pci_lock);
     pdev = pci_get_pdev(d, sbdf);
     if ( !pdev && is_hardware_domain(d) )
         pdev = pci_get_pdev(dom_xen, sbdf);
     if ( !pdev || !pdev->vpci )
+    {
+        read_unlock(&d->pci_lock);
         return vpci_read_hw(sbdf, reg, size);
+    }
 
     spin_lock(&pdev->vpci->lock);
 
@@ -428,6 +438,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     if ( data_offset < size )
     {
@@ -470,7 +481,7 @@ static void vpci_write_helper(const struct pci_dev *pdev,
 void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
@@ -483,8 +494,14 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
 
     /*
      * Find the PCI dev matching the address, which for hwdom also requires
-     * consulting DomXEN.  Passthrough everything that's not trapped.
+     * consulting DomXEN. Passthrough everything that's not trapped.
+     * If this is hwdom and the device is assigned to dom_xen, acquiring hwdom's
+     * pci_lock is sufficient.
+     *
+     * TODO: We need to take pci_locks in exclusive mode only if we
+     * are modifying BARs, so there is a room for improvement.
      */
+    write_lock(&d->pci_lock);
     pdev = pci_get_pdev(d, sbdf);
     if ( !pdev && is_hardware_domain(d) )
         pdev = pci_get_pdev(dom_xen, sbdf);
@@ -493,6 +510,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
         /* Ignore writes to read-only devices, which have no ->vpci. */
         const unsigned long *ro_map = pci_get_ro_map(sbdf.seg);
 
+        write_unlock(&d->pci_lock);
+
         if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
             vpci_write_hw(sbdf, reg, size, data);
         return;
@@ -534,6 +553,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    write_unlock(&d->pci_lock);
 
     if ( data_offset < size )
         /* Tailing gap, write the remaining. */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 9da91e0e6244..37f5922f3206 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -462,7 +462,8 @@ struct domain
 #ifdef CONFIG_HAS_PCI
     struct list_head pdev_list;
     /*
-     * pci_lock protects access to pdev_list.
+     * pci_lock protects access to pdev_list. pci_lock also protects pdev->vpci
+     * structure from being removed.
      *
      * Any user *reading* from pdev_list, or from devices stored in pdev_list,
      * should hold either pcidevs_lock() or pci_lock in read mode. Optionally,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v12 02/15] vpci: restrict unhandled read/write operations for guests
  2024-01-09 21:51 [PATCH v12 00/15] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
  2024-01-09 21:51 ` [PATCH v12 01/15] vpci: use per-domain PCI lock to protect vpci structure Stewart Hildebrand
@ 2024-01-09 21:51 ` Stewart Hildebrand
  2024-01-09 21:51 ` [PATCH v12 03/15] vpci: add hooks for PCI device assign/de-assign Stewart Hildebrand
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-09 21:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Roger Pau Monné, Volodymyr Babchuk,
	Stewart Hildebrand

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

A guest would be able to read and write those registers which are not
emulated and have no respective vPCI handlers, so it will be possible
for it to access the hardware directly.
In order to prevent a guest from reads and writes from/to the unhandled
registers make sure only hardware domain can access the hardware directly
and restrict guests from doing so.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
Since v9:
- removed stray formatting change
- added Roger's R-b tag
Since v6:
- do not use is_hwdom parameter for vpci_{read|write}_hw and use
  current->domain internally
- update commit message
New in v6
---
 xen/drivers/vpci/vpci.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index a1a004460491..e98693e1dc3e 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -268,6 +268,10 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
 {
     uint32_t data;
 
+    /* Guest domains are not allowed to read real hardware. */
+    if ( !is_hardware_domain(current->domain) )
+        return ~(uint32_t)0;
+
     switch ( size )
     {
     case 4:
@@ -311,6 +315,10 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int reg,
 static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                           uint32_t data)
 {
+    /* Guest domains are not allowed to write real hardware. */
+    if ( !is_hardware_domain(current->domain) )
+        return;
+
     switch ( size )
     {
     case 4:
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v12 03/15] vpci: add hooks for PCI device assign/de-assign
  2024-01-09 21:51 [PATCH v12 00/15] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
  2024-01-09 21:51 ` [PATCH v12 01/15] vpci: use per-domain PCI lock to protect vpci structure Stewart Hildebrand
  2024-01-09 21:51 ` [PATCH v12 02/15] vpci: restrict unhandled read/write operations for guests Stewart Hildebrand
@ 2024-01-09 21:51 ` Stewart Hildebrand
  2024-01-23 14:36   ` Jan Beulich
  2024-01-30 19:22   ` Stewart Hildebrand
  2024-01-09 21:51 ` [PATCH v12 04/15] vpci/header: rework exit path in init_header() Stewart Hildebrand
                   ` (11 subsequent siblings)
  14 siblings, 2 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-09 21:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Jan Beulich, Paul Durrant,
	Roger Pau Monné, Volodymyr Babchuk, Stewart Hildebrand

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

When a PCI device gets assigned/de-assigned we need to
initialize/de-initialize vPCI state for the device.

Also, rename vpci_add_handlers() to vpci_assign_device() and
vpci_remove_device() to vpci_deassign_device() to better reflect role
of the functions.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
In v12:
 - Add Roger's R-b
 - Clean up comment in xen/include/xen/vpci.h
 - Add comment in xen/drivers/passthrough/pci.c:deassign_device() to
   clarify vpci_assign_device() call
In v11:
- Call vpci_assign_device() in "deassign_device" if IOMMU call
"reassign_device" was successful.
In v10:
- removed HAS_VPCI_GUEST_SUPPORT checks
- HAS_VPCI_GUEST_SUPPORT config option (in Kconfig) as it is not used
  anywhere
In v9:
- removed previous  vpci_[de]assign_device function and renamed
  existing handlers
- dropped attempts to handle errors in assign_device() function
- do not call vpci_assign_device for dom_io
- use d instead of pdev->domain
- use IS_ENABLED macro
In v8:
- removed vpci_deassign_device
In v6:
- do not pass struct domain to vpci_{assign|deassign}_device as
  pdev->domain can be used
- do not leave the device assigned (pdev->domain == new domain) in case
  vpci_assign_device fails: try to de-assign and if this also fails, then
  crash the domain
In v5:
- do not split code into run_vpci_init
- do not check for is_system_domain in vpci_{de}assign_device
- do not use vpci_remove_device_handlers_locked and re-allocate
  pdev->vpci completely
- make vpci_deassign_device void
In v4:
 - de-assign vPCI from the previous domain on device assignment
 - do not remove handlers in vpci_assign_device as those must not
   exist at that point
In v3:
 - remove toolstack roll-back description from the commit message
   as error are to be handled with proper cleanup in Xen itself
 - remove __must_check
 - remove redundant rc check while assigning devices
 - fix redundant CONFIG_HAS_VPCI check for CONFIG_HAS_VPCI_GUEST_SUPPORT
 - use REGISTER_VPCI_INIT machinery to run required steps on device
   init/assign: add run_vpci_init helper
In v2:
- define CONFIG_HAS_VPCI_GUEST_SUPPORT so dead code is not compiled
  for x86
In v1:
 - constify struct pci_dev where possible
 - do not open code is_system_domain()
 - extended the commit message
---
 xen/drivers/passthrough/pci.c | 25 +++++++++++++++++++++----
 xen/drivers/vpci/header.c     |  2 +-
 xen/drivers/vpci/vpci.c       |  6 +++---
 xen/include/xen/vpci.h        | 10 +++++-----
 4 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 3a973324bca1..a902de6a8693 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -755,7 +755,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
          * For devices not discovered by Xen during boot, add vPCI handlers
          * when Dom0 first informs Xen about such devices.
          */
-        ret = vpci_add_handlers(pdev);
+        ret = vpci_assign_device(pdev);
         if ( ret )
         {
             list_del(&pdev->domain_list);
@@ -769,7 +769,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         if ( ret )
         {
             write_lock(&hardware_domain->pci_lock);
-            vpci_remove_device(pdev);
+            vpci_deassign_device(pdev);
             list_del(&pdev->domain_list);
             write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
@@ -817,7 +817,7 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
         if ( pdev->bus == bus && pdev->devfn == devfn )
         {
-            vpci_remove_device(pdev);
+            vpci_deassign_device(pdev);
             pci_cleanup_msi(pdev);
             ret = iommu_remove_device(pdev);
             if ( pdev->domain )
@@ -875,6 +875,10 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
             goto out;
     }
 
+    write_lock(&d->pci_lock);
+    vpci_deassign_device(pdev);
+    write_unlock(&d->pci_lock);
+
     devfn = pdev->devfn;
     ret = iommu_call(hd->platform_ops, reassign_device, d, target, devfn,
                      pci_to_dev(pdev));
@@ -886,6 +890,11 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
 
     pdev->fault.count = 0;
 
+    write_lock(&target->pci_lock);
+    /* Re-assign back to hardware_domain */
+    ret = vpci_assign_device(pdev);
+    write_unlock(&target->pci_lock);
+
  out:
     if ( ret )
         printk(XENLOG_G_ERR "%pd: deassign (%pp) failed (%d)\n",
@@ -1146,7 +1155,7 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
               PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
 
     write_lock(&ctxt->d->pci_lock);
-    err = vpci_add_handlers(pdev);
+    err = vpci_assign_device(pdev);
     write_unlock(&ctxt->d->pci_lock);
     if ( err )
         printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
@@ -1476,6 +1485,10 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
     if ( pdev->broken && d != hardware_domain && d != dom_io )
         goto done;
 
+    write_lock(&pdev->domain->pci_lock);
+    vpci_deassign_device(pdev);
+    write_unlock(&pdev->domain->pci_lock);
+
     rc = pdev_msix_assign(d, pdev);
     if ( rc )
         goto done;
@@ -1502,6 +1515,10 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
                         pci_to_dev(pdev), flag);
     }
 
+    write_lock(&d->pci_lock);
+    rc = vpci_assign_device(pdev);
+    write_unlock(&d->pci_lock);
+
  done:
     if ( rc )
         printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 8f5850b8cf6d..2f2d98ada012 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -191,7 +191,7 @@ bool vpci_process_pending(struct vcpu *v)
              * killed in order to avoid leaking stale p2m mappings on
              * failure.
              */
-            vpci_remove_device(v->vpci.pdev);
+            vpci_deassign_device(v->vpci.pdev);
         write_unlock(&v->domain->pci_lock);
     }
 
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index e98693e1dc3e..42eac85106a3 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -40,7 +40,7 @@ extern vpci_register_init_t *const __start_vpci_array[];
 extern vpci_register_init_t *const __end_vpci_array[];
 #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
 
-void vpci_remove_device(struct pci_dev *pdev)
+void vpci_deassign_device(struct pci_dev *pdev)
 {
     ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
 
@@ -73,7 +73,7 @@ void vpci_remove_device(struct pci_dev *pdev)
     pdev->vpci = NULL;
 }
 
-int vpci_add_handlers(struct pci_dev *pdev)
+int vpci_assign_device(struct pci_dev *pdev)
 {
     unsigned int i;
     const unsigned long *ro_map;
@@ -107,7 +107,7 @@ int vpci_add_handlers(struct pci_dev *pdev)
     }
 
     if ( rc )
-        vpci_remove_device(pdev);
+        vpci_deassign_device(pdev);
 
     return rc;
 }
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index d20c301a3db3..99fe76f08ace 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -25,11 +25,11 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
   static vpci_register_init_t *const x##_entry  \
                __used_section(".data.vpci." p) = x
 
-/* Add vPCI handlers to device. */
-int __must_check vpci_add_handlers(struct pci_dev *pdev);
+/* Assign vPCI to device by adding handlers. */
+int __must_check vpci_assign_device(struct pci_dev *pdev);
 
 /* Remove all handlers and free vpci related structures. */
-void vpci_remove_device(struct pci_dev *pdev);
+void vpci_deassign_device(struct pci_dev *pdev);
 
 /* Add/remove a register handler. */
 int __must_check vpci_add_register_mask(struct vpci *vpci,
@@ -255,12 +255,12 @@ bool vpci_ecam_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int len,
 #else /* !CONFIG_HAS_VPCI */
 struct vpci_vcpu {};
 
-static inline int vpci_add_handlers(struct pci_dev *pdev)
+static inline int vpci_assign_device(struct pci_dev *pdev)
 {
     return 0;
 }
 
-static inline void vpci_remove_device(struct pci_dev *pdev) { }
+static inline void vpci_deassign_device(struct pci_dev *pdev) { }
 
 static inline void vpci_dump_msi(void) { }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v12 04/15] vpci/header: rework exit path in init_header()
  2024-01-09 21:51 [PATCH v12 00/15] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
                   ` (2 preceding siblings ...)
  2024-01-09 21:51 ` [PATCH v12 03/15] vpci: add hooks for PCI device assign/de-assign Stewart Hildebrand
@ 2024-01-09 21:51 ` Stewart Hildebrand
  2024-01-09 21:51 ` [PATCH v12 05/15] vpci/header: implement guest BAR register handlers Stewart Hildebrand
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-09 21:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Volodymyr Babchuk, Roger Pau Monné, Volodymyr Babchuk,
	Stewart Hildebrand

From: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>

Introduce "fail" label in init_header() function to have the centralized
error return path. This is the pre-requirement for the future changes
in this function.

This patch does not introduce functional changes.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
In v12:
- s/init_bars/init_header/
- Re-order tags
- Fixup scissors line
In v11:
- Do not remove empty line between "goto fail;" and "continue;"
In v10:
- Added Roger's A-b tag.
In v9:
- New in v9
---
 xen/drivers/vpci/header.c | 19 +++++++------------
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 2f2d98ada012..803fe4bb99a6 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -656,10 +656,7 @@ static int cf_check init_header(struct pci_dev *pdev)
             rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
                                    4, &bars[i]);
             if ( rc )
-            {
-                pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-                return rc;
-            }
+                goto fail;
 
             continue;
         }
@@ -679,10 +676,7 @@ static int cf_check init_header(struct pci_dev *pdev)
         rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
                               (i == num_bars - 1) ? PCI_BAR_LAST : 0);
         if ( rc < 0 )
-        {
-            pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-            return rc;
-        }
+            goto fail;
 
         if ( size == 0 )
         {
@@ -697,10 +691,7 @@ static int cf_check init_header(struct pci_dev *pdev)
         rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
                                &bars[i]);
         if ( rc )
-        {
-            pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
-            return rc;
-        }
+            goto fail;
     }
 
     /* Check expansion ROM. */
@@ -722,6 +713,10 @@ static int cf_check init_header(struct pci_dev *pdev)
     }
 
     return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
+
+ fail:
+    pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
+    return rc;
 }
 REGISTER_VPCI_INIT(init_header, VPCI_PRIORITY_MIDDLE);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v12 05/15] vpci/header: implement guest BAR register handlers
  2024-01-09 21:51 [PATCH v12 00/15] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
                   ` (3 preceding siblings ...)
  2024-01-09 21:51 ` [PATCH v12 04/15] vpci/header: rework exit path in init_header() Stewart Hildebrand
@ 2024-01-09 21:51 ` Stewart Hildebrand
  2024-01-09 21:51 ` [PATCH v12 06/15] rangeset: add RANGESETF_no_print flag Stewart Hildebrand
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-09 21:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Roger Pau Monné, Volodymyr Babchuk,
	Stewart Hildebrand

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Add relevant vpci register handlers when assigning PCI device to a domain
and remove those when de-assigning. This allows having different
handlers for different domains, e.g. hwdom and other guests.

Emulate guest BAR register values: this allows creating a guest view
of the registers and emulates size and properties probe as it is done
during PCI device enumeration by the guest.

All empty, IO and ROM BARs for guests are emulated by returning 0 on
reads and ignoring writes: this BARs are special with this respect as
their lower bits have special meaning, so returning default ~0 on read
may confuse guest OS.

Introduce is_hwdom convenience variable and convert an existing
is_hardware_domain() check.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
In v12:
- Add Roger's R-b
- Get rid of empty_bar_read, use vpci_read_val instead
- Convert an existing is_hardware_domain() check to is_hwdom
- Re-indent usage of ternary operator
- Use ternary operator to avoid re-indenting expansion ROM block
In v11:
- Access guest_addr after adjusting for MEM64_HI bar in
guest_bar_write()
- guest bar handlers renamed and now  _mem_ part to denote
that they are handling only memory BARs
- refuse to update guest BAR address if BAR is enabled
In v10:
- ull -> ULL to be MISRA-compatbile
- Use PAGE_OFFSET() instead of combining with ~PAGE_MASK
- Set type of empty bars to VPCI_BAR_EMPTY
In v9:
- factored-out "fail" label introduction in init_bars()
- replaced #ifdef CONFIG_X86 with IS_ENABLED()
- do not pass bars[i] to empty_bar_read() handler
- store guest's BAR address instead of guests BAR register view
Since v6:
- unify the writing of the PCI_COMMAND register on the
  error path into a label
- do not introduce bar_ignore_access helper and open code
- s/guest_bar_ignore_read/empty_bar_read
- update error message in guest_bar_write
- only setup empty_bar_read for IO if !x86
Since v5:
- make sure that the guest set address has the same page offset
  as the physical address on the host
- remove guest_rom_{read|write} as those just implement the default
  behaviour of the registers not being handled
- adjusted comment for struct vpci.addr field
- add guest handlers for BARs which are not handled and will otherwise
  return ~0 on read and ignore writes. The BARs are special with this
  respect as their lower bits have special meaning, so returning ~0
  doesn't seem to be right
Since v4:
- updated commit message
- s/guest_addr/guest_reg
Since v3:
- squashed two patches: dynamic add/remove handlers and guest BAR
  handler implementation
- fix guest BAR read of the high part of a 64bit BAR (Roger)
- add error handling to vpci_assign_device
- s/dom%pd/%pd
- blank line before return
Since v2:
- remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
  has been eliminated from being built on x86
Since v1:
 - constify struct pci_dev where possible
 - do not open code is_system_domain()
 - simplify some code3. simplify
 - use gdprintk + error code instead of gprintk
 - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
   so these do not get compiled for x86
 - removed unneeded is_system_domain check
 - re-work guest read/write to be much simpler and do more work on write
   than read which is expected to be called more frequently
 - removed one too obvious comment
---
 xen/drivers/vpci/header.c | 109 +++++++++++++++++++++++++++++++++++---
 xen/include/xen/vpci.h    |   3 ++
 2 files changed, 106 insertions(+), 6 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 803fe4bb99a6..39e11e141b38 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -478,6 +478,69 @@ static void cf_check bar_write(
     pci_conf_write32(pdev->sbdf, reg, val);
 }
 
+static void cf_check guest_mem_bar_write(const struct pci_dev *pdev,
+                                         unsigned int reg, uint32_t val,
+                                         void *data)
+{
+    struct vpci_bar *bar = data;
+    bool hi = false;
+    uint64_t guest_addr;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+    else
+    {
+        val &= PCI_BASE_ADDRESS_MEM_MASK;
+    }
+
+    guest_addr = bar->guest_addr;
+    guest_addr &= ~(0xffffffffULL << (hi ? 32 : 0));
+    guest_addr |= (uint64_t)val << (hi ? 32 : 0);
+
+    /* Allow guest to size BAR correctly */
+    guest_addr &= ~(bar->size - 1);
+
+    /*
+     * Xen only cares whether the BAR is mapped into the p2m, so allow BAR
+     * writes as long as the BAR is not mapped into the p2m.
+     */
+    if ( bar->enabled )
+    {
+        /* If the value written is the current one avoid printing a warning. */
+        if ( guest_addr != bar->guest_addr )
+            gprintk(XENLOG_WARNING,
+                    "%pp: ignored guest BAR %zu write while mapped\n",
+                    &pdev->sbdf, bar - pdev->vpci->header.bars + hi);
+        return;
+    }
+    bar->guest_addr = guest_addr;
+}
+
+static uint32_t cf_check guest_mem_bar_read(const struct pci_dev *pdev,
+                                            unsigned int reg, void *data)
+{
+    const struct vpci_bar *bar = data;
+    uint32_t reg_val;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        return bar->guest_addr >> 32;
+    }
+
+    reg_val = bar->guest_addr;
+    reg_val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32 :
+                                             PCI_BASE_ADDRESS_MEM_TYPE_64;
+    reg_val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+
+    return reg_val;
+}
+
 static void cf_check rom_write(
     const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
 {
@@ -539,6 +602,7 @@ static int cf_check init_header(struct pci_dev *pdev)
     struct vpci_bar *bars = header->bars;
     int rc;
     bool mask_cap_list = false;
+    bool is_hwdom = is_hardware_domain(pdev->domain);
 
     ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
 
@@ -564,7 +628,7 @@ static int cf_check init_header(struct pci_dev *pdev)
     if ( rc )
         return rc;
 
-    if ( !is_hardware_domain(pdev->domain) )
+    if ( !is_hwdom )
     {
         if ( pci_conf_read16(pdev->sbdf, PCI_STATUS) & PCI_STATUS_CAP_LIST )
         {
@@ -653,8 +717,11 @@ static int cf_check init_header(struct pci_dev *pdev)
         if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
         {
             bars[i].type = VPCI_BAR_MEM64_HI;
-            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
-                                   4, &bars[i]);
+            rc = vpci_add_register(pdev->vpci,
+                                   is_hwdom ? vpci_hw_read32
+                                            : guest_mem_bar_read,
+                                   is_hwdom ? bar_write : guest_mem_bar_write,
+                                   reg, 4, &bars[i]);
             if ( rc )
                 goto fail;
 
@@ -665,6 +732,14 @@ static int cf_check init_header(struct pci_dev *pdev)
         if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
         {
             bars[i].type = VPCI_BAR_IO;
+            if ( !IS_ENABLED(CONFIG_X86) && !is_hwdom )
+            {
+                rc = vpci_add_register(pdev->vpci, vpci_read_val, NULL,
+                                       reg, 4, (void *)0);
+                if ( rc )
+                    goto fail;
+            }
+
             continue;
         }
         if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
@@ -681,6 +756,15 @@ static int cf_check init_header(struct pci_dev *pdev)
         if ( size == 0 )
         {
             bars[i].type = VPCI_BAR_EMPTY;
+
+            if ( !is_hwdom )
+            {
+                rc = vpci_add_register(pdev->vpci, vpci_read_val, NULL,
+                                       reg, 4, (void *)0);
+                if ( rc )
+                    goto fail;
+            }
+
             continue;
         }
 
@@ -688,14 +772,18 @@ static int cf_check init_header(struct pci_dev *pdev)
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
-                               &bars[i]);
+        rc = vpci_add_register(pdev->vpci,
+                               is_hwdom ? vpci_hw_read32 : guest_mem_bar_read,
+                               is_hwdom ? bar_write : guest_mem_bar_write,
+                               reg, 4, &bars[i]);
         if ( rc )
             goto fail;
     }
 
     /* Check expansion ROM. */
-    rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
+    rc = is_hwdom ? pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size,
+                                     PCI_BAR_ROM)
+                  : 0;
     if ( rc > 0 && size )
     {
         struct vpci_bar *rom = &header->bars[num_bars];
@@ -711,6 +799,15 @@ static int cf_check init_header(struct pci_dev *pdev)
         if ( rc )
             rom->type = VPCI_BAR_EMPTY;
     }
+    else if ( !is_hwdom )
+    {
+        /* TODO: Check expansion ROM, we do not handle ROM for guests for now */
+        header->bars[num_bars].type = VPCI_BAR_EMPTY;
+        rc = vpci_add_register(pdev->vpci, vpci_read_val, NULL,
+                               rom_reg, 4, (void *)0);
+        if ( rc )
+            goto fail;
+    }
 
     return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
 
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 99fe76f08ace..b0e38a5a1acb 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -87,7 +87,10 @@ struct vpci {
     struct vpci_header {
         /* Information about the PCI BARs of this device. */
         struct vpci_bar {
+            /* Physical (host) address. */
             uint64_t addr;
+            /* Guest address. */
+            uint64_t guest_addr;
             uint64_t size;
             enum {
                 VPCI_BAR_EMPTY,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v12 06/15] rangeset: add RANGESETF_no_print flag
  2024-01-09 21:51 [PATCH v12 00/15] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
                   ` (4 preceding siblings ...)
  2024-01-09 21:51 ` [PATCH v12 05/15] vpci/header: implement guest BAR register handlers Stewart Hildebrand
@ 2024-01-09 21:51 ` Stewart Hildebrand
  2024-01-09 21:51 ` [PATCH v12 07/15] rangeset: add rangeset_purge() function Stewart Hildebrand
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-09 21:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Jan Beulich, Julien Grall, Stefano Stabellini, Wei Liu,
	Volodymyr Babchuk, Stewart Hildebrand

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

There are range sets which should not be printed, so introduce a flag
which allows marking those as such. Implement relevant logic to skip
such entries while printing.

While at it also simplify the definition of the flags by directly
defining those without helpers.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
Since v5:
- comment indentation (Jan)
Since v1:
- update BUG_ON with new flag
- simplify the definition of the flags
---
 xen/common/rangeset.c      | 5 ++++-
 xen/include/xen/rangeset.h | 5 +++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/xen/common/rangeset.c b/xen/common/rangeset.c
index 16a4c3b842e6..0ccd53caac52 100644
--- a/xen/common/rangeset.c
+++ b/xen/common/rangeset.c
@@ -433,7 +433,7 @@ struct rangeset *rangeset_new(
     INIT_LIST_HEAD(&r->range_list);
     r->nr_ranges = -1;
 
-    BUG_ON(flags & ~RANGESETF_prettyprint_hex);
+    BUG_ON(flags & ~(RANGESETF_prettyprint_hex | RANGESETF_no_print));
     r->flags = flags;
 
     safe_strcpy(r->name, name ?: "(no name)");
@@ -575,6 +575,9 @@ void rangeset_domain_printk(
 
     list_for_each_entry ( r, &d->rangesets, rangeset_list )
     {
+        if ( r->flags & RANGESETF_no_print )
+            continue;
+
         printk("    ");
         rangeset_printk(r);
         printk("\n");
diff --git a/xen/include/xen/rangeset.h b/xen/include/xen/rangeset.h
index 8be0722787ed..87bd956962b5 100644
--- a/xen/include/xen/rangeset.h
+++ b/xen/include/xen/rangeset.h
@@ -49,8 +49,9 @@ void rangeset_limit(
 
 /* Flags for passing to rangeset_new(). */
  /* Pretty-print range limits in hexadecimal. */
-#define _RANGESETF_prettyprint_hex 0
-#define RANGESETF_prettyprint_hex  (1U << _RANGESETF_prettyprint_hex)
+#define RANGESETF_prettyprint_hex   (1U << 0)
+ /* Do not print entries marked with this flag. */
+#define RANGESETF_no_print          (1U << 1)
 
 bool __must_check rangeset_is_empty(
     const struct rangeset *r);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v12 07/15] rangeset: add rangeset_purge() function
  2024-01-09 21:51 [PATCH v12 00/15] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
                   ` (5 preceding siblings ...)
  2024-01-09 21:51 ` [PATCH v12 06/15] rangeset: add RANGESETF_no_print flag Stewart Hildebrand
@ 2024-01-09 21:51 ` Stewart Hildebrand
  2024-01-10 10:00   ` Jan Beulich
  2024-01-09 21:51 ` [PATCH v12 08/15] vpci/header: handle p2m range sets per BAR Stewart Hildebrand
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-09 21:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Volodymyr Babchuk, Andrew Cooper, George Dunlap, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu, Volodymyr Babchuk,
	Stewart Hildebrand

From: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>

This function can be used when user wants to remove all rangeset
entries but do not want to destroy rangeset itself.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
Changes in v12:
 - s/rangeset_empty/rangeset_purge/
Changes in v11:
 - Now the function only empties rangeset, without removing it from
   domain's list

Changes in v10:
 - New in v10. The function is used in "vpci/header: handle p2m range sets per BAR"
---
 xen/common/rangeset.c      | 16 ++++++++++++----
 xen/include/xen/rangeset.h |  3 ++-
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/xen/common/rangeset.c b/xen/common/rangeset.c
index 0ccd53caac52..b75590f90744 100644
--- a/xen/common/rangeset.c
+++ b/xen/common/rangeset.c
@@ -448,11 +448,20 @@ struct rangeset *rangeset_new(
     return r;
 }
 
-void rangeset_destroy(
-    struct rangeset *r)
+void rangeset_purge(struct rangeset *r)
 {
     struct range *x;
 
+    if ( r == NULL )
+        return;
+
+    while ( (x = first_range(r)) != NULL )
+        destroy_range(r, x);
+}
+
+void rangeset_destroy(
+    struct rangeset *r)
+{
     if ( r == NULL )
         return;
 
@@ -463,8 +472,7 @@ void rangeset_destroy(
         spin_unlock(&r->domain->rangesets_lock);
     }
 
-    while ( (x = first_range(r)) != NULL )
-        destroy_range(r, x);
+    rangeset_purge(r);
 
     xfree(r);
 }
diff --git a/xen/include/xen/rangeset.h b/xen/include/xen/rangeset.h
index 87bd956962b5..96c918082501 100644
--- a/xen/include/xen/rangeset.h
+++ b/xen/include/xen/rangeset.h
@@ -56,7 +56,7 @@ void rangeset_limit(
 bool __must_check rangeset_is_empty(
     const struct rangeset *r);
 
-/* Add/claim/remove/query a numeric range. */
+/* Add/claim/remove/query/purge a numeric range. */
 int __must_check rangeset_add_range(
     struct rangeset *r, unsigned long s, unsigned long e);
 int __must_check rangeset_claim_range(struct rangeset *r, unsigned long size,
@@ -70,6 +70,7 @@ bool __must_check rangeset_overlaps_range(
 int rangeset_report_ranges(
     struct rangeset *r, unsigned long s, unsigned long e,
     int (*cb)(unsigned long s, unsigned long e, void *data), void *ctxt);
+void rangeset_purge(struct rangeset *r);
 
 /*
  * Note that the consume function can return an error value apart from
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v12 08/15] vpci/header: handle p2m range sets per BAR
  2024-01-09 21:51 [PATCH v12 00/15] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
                   ` (6 preceding siblings ...)
  2024-01-09 21:51 ` [PATCH v12 07/15] rangeset: add rangeset_purge() function Stewart Hildebrand
@ 2024-01-09 21:51 ` Stewart Hildebrand
  2024-01-09 21:51 ` [PATCH v12 09/15] vpci/header: program p2m with guest BAR view Stewart Hildebrand
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-09 21:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Roger Pau Monné, Volodymyr Babchuk,
	Stewart Hildebrand

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Instead of handling a single range set, that contains all the memory
regions of all the BARs and ROM, have them per BAR.
As the range sets are now created when a PCI device is added and destroyed
when it is removed so make them named and accounted.

Note that rangesets were chosen here despite there being only up to
3 separate ranges in each set (typically just 1). But rangeset per BAR
was chosen for the ease of implementation and existing code re-usability.

Also note that error handling of vpci_process_pending() is slightly
modified, and that vPCI handlers are no longer removed if the creation
of the mappings in vpci_process_pending() fails, as that's unlikely to
lead to a functional device in any case.

This is in preparation of making non-identity mappings in p2m for the MMIOs.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
In v12:
- s/rangeset_empty/rangeset_purge/
- change i to num_bars for expansion ROM (purely cosmetic change)
In v11:
- Modified commit message to note changes in error handling in
vpci_process_pending()
- Removed redundant ASSERT() in defer_map. There is no reason to
introduce it in this patch and there is no other patch where
introducing that ASSERT() was appropriate.
- Fixed formatting
- vpci_process_pending() clears v->vpci.pdev if it failed
  checks at the beginning
- Added Roger's R-B tag
In v10:
- Added additional checks to vpci_process_pending()
- vpci_process_pending() now clears rangeset in case of failure
- Fixed locks in vpci_process_pending()
- Fixed coding style issues
- Fixed error handling in init_bars
In v9:
- removed d->vpci.map_pending in favor of checking v->vpci.pdev !=
NULL
- printk -> gprintk
- renamed bar variable to fix shadowing
- fixed bug with iterating on remote device's BARs
- relaxed lock in vpci_process_pending
- removed stale comment
Since v6:
- update according to the new locking scheme
- remove odd fail label in modify_bars
Since v5:
- fix comments
- move rangeset allocation to init_bars and only allocate
  for MAPPABLE BARs
- check for overlap with the already setup BAR ranges
Since v4:
- use named range sets for BARs (Jan)
- changes required by the new locking scheme
- updated commit message (Jan)
Since v3:
- re-work vpci_cancel_pending accordingly to the per-BAR handling
- s/num_mem_ranges/map_pending and s/uint8_t/bool
- ASSERT(bar->mem) in modify_bars
- create and destroy the rangesets on add/remove
---
 xen/drivers/vpci/header.c | 257 ++++++++++++++++++++++++++------------
 xen/drivers/vpci/vpci.c   |   6 +
 xen/include/xen/vpci.h    |   2 +-
 3 files changed, 185 insertions(+), 80 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 39e11e141b38..feccd070ddd0 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -162,63 +162,107 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
 
 bool vpci_process_pending(struct vcpu *v)
 {
-    if ( v->vpci.mem )
+    struct pci_dev *pdev = v->vpci.pdev;
+    struct map_data data = {
+        .d = v->domain,
+        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
+    };
+    struct vpci_header *header = NULL;
+    unsigned int i;
+
+    if ( !pdev )
+        return false;
+
+    read_lock(&v->domain->pci_lock);
+
+    if ( !pdev->vpci || (v->domain != pdev->domain) )
     {
-        struct map_data data = {
-            .d = v->domain,
-            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
-        };
-        int rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
+        v->vpci.pdev = NULL;
+        read_unlock(&v->domain->pci_lock);
+        return false;
+    }
+
+    header = &pdev->vpci->header;
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    {
+        struct vpci_bar *bar = &header->bars[i];
+        int rc;
+
+        if ( rangeset_is_empty(bar->mem) )
+            continue;
+
+        rc = rangeset_consume_ranges(bar->mem, map_range, &data);
 
         if ( rc == -ERESTART )
+        {
+            read_unlock(&v->domain->pci_lock);
             return true;
+        }
 
-        write_lock(&v->domain->pci_lock);
-        spin_lock(&v->vpci.pdev->vpci->lock);
-        /* Disable memory decoding unconditionally on failure. */
-        modify_decoding(v->vpci.pdev,
-                        rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
-                        !rc && v->vpci.rom_only);
-        spin_unlock(&v->vpci.pdev->vpci->lock);
-
-        rangeset_destroy(v->vpci.mem);
-        v->vpci.mem = NULL;
         if ( rc )
-            /*
-             * FIXME: in case of failure remove the device from the domain.
-             * Note that there might still be leftover mappings. While this is
-             * safe for Dom0, for DomUs the domain will likely need to be
-             * killed in order to avoid leaking stale p2m mappings on
-             * failure.
-             */
-            vpci_deassign_device(v->vpci.pdev);
-        write_unlock(&v->domain->pci_lock);
+        {
+            spin_lock(&pdev->vpci->lock);
+            /* Disable memory decoding unconditionally on failure. */
+            modify_decoding(pdev, v->vpci.cmd & ~PCI_COMMAND_MEMORY,
+                            false);
+            spin_unlock(&pdev->vpci->lock);
+
+            /* Clean all the rangesets */
+            for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+                if ( !rangeset_is_empty(header->bars[i].mem) )
+                     rangeset_purge(header->bars[i].mem);
+
+            v->vpci.pdev = NULL;
+
+            read_unlock(&v->domain->pci_lock);
+
+            if ( !is_hardware_domain(v->domain) )
+                domain_crash(v->domain);
+
+            return false;
+        }
     }
+    v->vpci.pdev = NULL;
+
+    spin_lock(&pdev->vpci->lock);
+    modify_decoding(pdev, v->vpci.cmd, v->vpci.rom_only);
+    spin_unlock(&pdev->vpci->lock);
+
+    read_unlock(&v->domain->pci_lock);
 
     return false;
 }
 
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
-                            struct rangeset *mem, uint16_t cmd)
+                            uint16_t cmd)
 {
     struct map_data data = { .d = d, .map = true };
-    int rc;
+    struct vpci_header *header = &pdev->vpci->header;
+    int rc = 0;
+    unsigned int i;
 
     ASSERT(rw_is_write_locked(&d->pci_lock));
 
-    while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
-        /*
-         * It's safe to drop and reacquire the lock in this context
-         * without risking pdev disappearing because devices cannot be
-         * removed until the initial domain has been started.
-         */
-        write_unlock(&d->pci_lock);
-        process_pending_softirqs();
-        write_lock(&d->pci_lock);
-    }
+        struct vpci_bar *bar = &header->bars[i];
 
-    rangeset_destroy(mem);
+        if ( rangeset_is_empty(bar->mem) )
+            continue;
+
+        while ( (rc = rangeset_consume_ranges(bar->mem, map_range,
+                                              &data)) == -ERESTART )
+        {
+            /*
+             * It's safe to drop and reacquire the lock in this context
+             * without risking pdev disappearing because devices cannot be
+             * removed until the initial domain has been started.
+             */
+            write_unlock(&d->pci_lock);
+            process_pending_softirqs();
+            write_lock(&d->pci_lock);
+        }
+    }
     if ( !rc )
         modify_decoding(pdev, cmd, false);
 
@@ -226,7 +270,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
 }
 
 static void defer_map(struct domain *d, struct pci_dev *pdev,
-                      struct rangeset *mem, uint16_t cmd, bool rom_only)
+                      uint16_t cmd, bool rom_only)
 {
     struct vcpu *curr = current;
 
@@ -237,7 +281,6 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
      * started for the same device if the domain is not well-behaved.
      */
     curr->vpci.pdev = pdev;
-    curr->vpci.mem = mem;
     curr->vpci.cmd = cmd;
     curr->vpci.rom_only = rom_only;
     /*
@@ -251,33 +294,33 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
 static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 {
     struct vpci_header *header = &pdev->vpci->header;
-    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
     struct pci_dev *tmp, *dev = NULL;
     const struct domain *d;
     const struct vpci_msix *msix = pdev->vpci->msix;
-    unsigned int i;
+    unsigned int i, j;
     int rc;
 
     ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
 
-    if ( !mem )
-        return -ENOMEM;
-
     /*
-     * Create a rangeset that represents the current device BARs memory region
-     * and compare it against all the currently active BAR memory regions. If
-     * an overlap is found, subtract it from the region to be mapped/unmapped.
+     * Create a rangeset per BAR that represents the current device memory
+     * region and compare it against all the currently active BAR memory
+     * regions. If an overlap is found, subtract it from the region to be
+     * mapped/unmapped.
      *
-     * First fill the rangeset with all the BARs of this device or with the ROM
+     * First fill the rangesets with the BAR of this device or with the ROM
      * BAR only, depending on whether the guest is toggling the memory decode
      * bit of the command register, or the enable bit of the ROM BAR register.
      */
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
-        const struct vpci_bar *bar = &header->bars[i];
+        struct vpci_bar *bar = &header->bars[i];
         unsigned long start = PFN_DOWN(bar->addr);
         unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
 
+        if ( !bar->mem )
+            continue;
+
         if ( !MAPPABLE_BAR(bar) ||
              (rom_only ? bar->type != VPCI_BAR_ROM
                        : (bar->type == VPCI_BAR_ROM && !header->rom_enabled)) ||
@@ -293,14 +336,31 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             continue;
         }
 
-        rc = rangeset_add_range(mem, start, end);
+        rc = rangeset_add_range(bar->mem, start, end);
         if ( rc )
         {
             printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
                    start, end, rc);
-            rangeset_destroy(mem);
             return rc;
         }
+
+        /* Check for overlap with the already setup BAR ranges. */
+        for ( j = 0; j < i; j++ )
+        {
+            struct vpci_bar *prev_bar = &header->bars[j];
+
+            if ( rangeset_is_empty(prev_bar->mem) )
+                continue;
+
+            rc = rangeset_remove_range(prev_bar->mem, start, end);
+            if ( rc )
+            {
+                gprintk(XENLOG_WARNING,
+                       "%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
+                        &pdev->sbdf, start, end, rc);
+                return rc;
+            }
+        }
     }
 
     /* Remove any MSIX regions if present. */
@@ -310,14 +370,21 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
         unsigned long end = PFN_DOWN(vmsix_table_addr(pdev->vpci, i) +
                                      vmsix_table_size(pdev->vpci, i) - 1);
 
-        rc = rangeset_remove_range(mem, start, end);
-        if ( rc )
+        for ( j = 0; j < ARRAY_SIZE(header->bars); j++ )
         {
-            printk(XENLOG_G_WARNING
-                   "Failed to remove MSIX table [%lx, %lx]: %d\n",
-                   start, end, rc);
-            rangeset_destroy(mem);
-            return rc;
+            const struct vpci_bar *bar = &header->bars[j];
+
+            if ( rangeset_is_empty(bar->mem) )
+                continue;
+
+            rc = rangeset_remove_range(bar->mem, start, end);
+            if ( rc )
+            {
+                gprintk(XENLOG_WARNING,
+                       "%pp: failed to remove MSIX table [%lx, %lx]: %d\n",
+                        &pdev->sbdf, start, end, rc);
+                return rc;
+            }
         }
     }
 
@@ -357,27 +424,37 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 
             for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
             {
-                const struct vpci_bar *bar = &tmp->vpci->header.bars[i];
-                unsigned long start = PFN_DOWN(bar->addr);
-                unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
-
-                if ( !bar->enabled ||
-                     !rangeset_overlaps_range(mem, start, end) ||
-                     /*
-                      * If only the ROM enable bit is toggled check against
-                      * other BARs in the same device for overlaps, but not
-                      * against the same ROM BAR.
-                      */
-                     (rom_only && tmp == pdev && bar->type == VPCI_BAR_ROM) )
+                const struct vpci_bar *remote_bar = &tmp->vpci->header.bars[i];
+                unsigned long start = PFN_DOWN(remote_bar->addr);
+                unsigned long end = PFN_DOWN(remote_bar->addr +
+                                             remote_bar->size - 1);
+
+                if ( !remote_bar->enabled )
                     continue;
 
-                rc = rangeset_remove_range(mem, start, end);
-                if ( rc )
+                for ( j = 0; j < ARRAY_SIZE(header->bars); j++)
                 {
-                    printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
-                           start, end, rc);
-                    rangeset_destroy(mem);
-                    return rc;
+                    const struct vpci_bar *bar = &header->bars[j];
+
+                    if ( !rangeset_overlaps_range(bar->mem, start, end) ||
+                         /*
+                          * If only the ROM enable bit is toggled check against
+                          * other BARs in the same device for overlaps, but not
+                          * against the same ROM BAR.
+                          */
+                         (rom_only &&
+                          tmp == pdev &&
+                          bar->type == VPCI_BAR_ROM) )
+                        continue;
+
+                    rc = rangeset_remove_range(bar->mem, start, end);
+                    if ( rc )
+                    {
+                        gprintk(XENLOG_WARNING,
+                                "%pp: failed to remove [%lx, %lx]: %d\n",
+                                &pdev->sbdf, start, end, rc);
+                        return rc;
+                    }
                 }
             }
         }
@@ -401,10 +478,10 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
          * will always be to establish mappings and process all the BARs.
          */
         ASSERT((cmd & PCI_COMMAND_MEMORY) && !rom_only);
-        return apply_map(pdev->domain, pdev, mem, cmd);
+        return apply_map(pdev->domain, pdev, cmd);
     }
 
-    defer_map(dev->domain, dev, mem, cmd, rom_only);
+    defer_map(dev->domain, dev, cmd, rom_only);
 
     return 0;
 }
@@ -593,6 +670,18 @@ static void cf_check rom_write(
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
 }
 
+static int bar_add_rangeset(const struct pci_dev *pdev, struct vpci_bar *bar,
+                            unsigned int i)
+{
+    char str[32];
+
+    snprintf(str, sizeof(str), "%pp:BAR%u", &pdev->sbdf, i);
+
+    bar->mem = rangeset_new(pdev->domain, str, RANGESETF_no_print);
+
+    return !bar->mem ? -ENOMEM : 0;
+}
+
 static int cf_check init_header(struct pci_dev *pdev)
 {
     uint16_t cmd;
@@ -748,6 +837,10 @@ static int cf_check init_header(struct pci_dev *pdev)
         else
             bars[i].type = VPCI_BAR_MEM32;
 
+        rc = bar_add_rangeset(pdev, &bars[i], i);
+        if ( rc )
+            goto fail;
+
         rc = pci_size_mem_bar(pdev->sbdf, reg, &addr, &size,
                               (i == num_bars - 1) ? PCI_BAR_LAST : 0);
         if ( rc < 0 )
@@ -798,6 +891,12 @@ static int cf_check init_header(struct pci_dev *pdev)
                                4, rom);
         if ( rc )
             rom->type = VPCI_BAR_EMPTY;
+        else
+        {
+            rc = bar_add_rangeset(pdev, rom, num_bars);
+            if ( rc )
+                goto fail;
+        }
     }
     else if ( !is_hwdom )
     {
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 42eac85106a3..a0e8b1012509 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -42,6 +42,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
 
 void vpci_deassign_device(struct pci_dev *pdev)
 {
+    unsigned int i;
+
     ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
 
     if ( !has_vpci(pdev->domain) || !pdev->vpci )
@@ -67,6 +69,10 @@ void vpci_deassign_device(struct pci_dev *pdev)
             if ( pdev->vpci->msix->table[i] )
                 iounmap(pdev->vpci->msix->table[i]);
     }
+
+    for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
+        rangeset_destroy(pdev->vpci->header.bars[i].mem);
+
     xfree(pdev->vpci->msix);
     xfree(pdev->vpci->msi);
     xfree(pdev->vpci);
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index b0e38a5a1acb..817ee9ee7300 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -92,6 +92,7 @@ struct vpci {
             /* Guest address. */
             uint64_t guest_addr;
             uint64_t size;
+            struct rangeset *mem;
             enum {
                 VPCI_BAR_EMPTY,
                 VPCI_BAR_IO,
@@ -176,7 +177,6 @@ struct vpci {
 
 struct vpci_vcpu {
     /* Per-vcpu structure to store state while {un}mapping of PCI BARs. */
-    struct rangeset *mem;
     struct pci_dev *pdev;
     uint16_t cmd;
     bool rom_only : 1;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v12 09/15] vpci/header: program p2m with guest BAR view
  2024-01-09 21:51 [PATCH v12 00/15] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
                   ` (7 preceding siblings ...)
  2024-01-09 21:51 ` [PATCH v12 08/15] vpci/header: handle p2m range sets per BAR Stewart Hildebrand
@ 2024-01-09 21:51 ` Stewart Hildebrand
  2024-01-12 15:06   ` Roger Pau Monné
                     ` (2 more replies)
  2024-01-09 21:51 ` [PATCH v12 10/15] vpci/header: emulate PCI_COMMAND register for guests Stewart Hildebrand
                   ` (5 subsequent siblings)
  14 siblings, 3 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-09 21:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Roger Pau Monné, Daniel P. Smith,
	Volodymyr Babchuk, Stewart Hildebrand

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Take into account guest's BAR view and program its p2m accordingly:
gfn is guest's view of the BAR and mfn is the physical BAR value.
This way hardware domain sees physical BAR values and guest sees
emulated ones.

Hardware domain continues getting the BARs identity mapped, while for
domUs the BARs are mapped at the requested guest address without
modifying the BAR address in the device PCI config space.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
In v12:
- Update guest_addr in rom_write()
- Use unsigned long for start_mfn and map_mfn to reduce mfn_x() calls
- Use existing vmsix_table_*() functions
- Change vmsix_table_base() to use .guest_addr
In v11:
- Add vmsix_guest_table_addr() and vmsix_guest_table_base() functions
  to access guest's view of the VMSIx tables.
- Use MFN (not GFN) to check access permissions
- Move page offset check to this patch
- Call rangeset_remove_range() with correct parameters
In v10:
- Moved GFN variable definition outside the loop in map_range()
- Updated printk error message in map_range()
- Now BAR address is always stored in bar->guest_addr, even for
  HW dom, this removes bunch of ugly is_hwdom() checks in modify_bars()
- vmsix_table_base() now uses .guest_addr instead of .addr
In v9:
- Extended the commit message
- Use bar->guest_addr in modify_bars
- Extended printk error message in map_range
- Moved map_data initialization so .bar can be initialized during declaration
Since v5:
- remove debug print in map_range callback
- remove "identity" from the debug print
Since v4:
- moved start_{gfn|mfn} calculation into map_range
- pass vpci_bar in the map_data instead of start_{gfn|mfn}
- s/guest_addr/guest_reg
Since v3:
- updated comment (Roger)
- removed gfn_add(map->start_gfn, rc); which is wrong
- use v->domain instead of v->vpci.pdev->domain
- removed odd e.g. in comment
- s/d%d/%pd in altered code
- use gdprintk for map/unmap logs
Since v2:
- improve readability for data.start_gfn and restructure ?: construct
Since v1:
 - s/MSI/MSI-X in comments
---
 xen/drivers/vpci/header.c | 81 +++++++++++++++++++++++++++++++--------
 xen/include/xen/vpci.h    |  3 +-
 2 files changed, 66 insertions(+), 18 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index feccd070ddd0..f0b0b64b0929 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -34,6 +34,7 @@
 
 struct map_data {
     struct domain *d;
+    const struct vpci_bar *bar;
     bool map;
 };
 
@@ -41,13 +42,24 @@ static int cf_check map_range(
     unsigned long s, unsigned long e, void *data, unsigned long *c)
 {
     const struct map_data *map = data;
+    /* Start address of the BAR as seen by the guest. */
+    unsigned long start_gfn = PFN_DOWN(map->bar->guest_addr);
+    /* Physical start address of the BAR. */
+    unsigned long start_mfn = PFN_DOWN(map->bar->addr);
     int rc;
 
     for ( ; ; )
     {
         unsigned long size = e - s + 1;
+        /*
+         * Ranges to be mapped don't always start at the BAR start address, as
+         * there can be holes or partially consumed ranges. Account for the
+         * offset of the current address from the BAR start.
+         */
+        unsigned long map_mfn = start_mfn + s - start_gfn;
+        unsigned long m_end = map_mfn + size - 1;
 
-        if ( !iomem_access_permitted(map->d, s, e) )
+        if ( !iomem_access_permitted(map->d, map_mfn, m_end) )
         {
             printk(XENLOG_G_WARNING
                    "%pd denied access to MMIO range [%#lx, %#lx]\n",
@@ -55,7 +67,8 @@ static int cf_check map_range(
             return -EPERM;
         }
 
-        rc = xsm_iomem_mapping(XSM_HOOK, map->d, s, e, map->map);
+        rc = xsm_iomem_mapping(XSM_HOOK, map->d, map_mfn, m_end,
+                               map->map);
         if ( rc )
         {
             printk(XENLOG_G_WARNING
@@ -73,8 +86,8 @@ static int cf_check map_range(
          * - {un}map_mmio_regions doesn't support preemption.
          */
 
-        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
-                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
+        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(map_mfn))
+                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(map_mfn));
         if ( rc == 0 )
         {
             *c += size;
@@ -83,8 +96,9 @@ static int cf_check map_range(
         if ( rc < 0 )
         {
             printk(XENLOG_G_WARNING
-                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
-                   map->map ? "" : "un", s, e, map->d->domain_id, rc);
+                   "Failed to %smap [%lx %lx] -> [%lx %lx] for %pd: %d\n",
+                   map->map ? "" : "un", s, e, map_mfn,
+                   map_mfn + size, map->d, rc);
             break;
         }
         ASSERT(rc < size);
@@ -163,10 +177,6 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
 bool vpci_process_pending(struct vcpu *v)
 {
     struct pci_dev *pdev = v->vpci.pdev;
-    struct map_data data = {
-        .d = v->domain,
-        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
-    };
     struct vpci_header *header = NULL;
     unsigned int i;
 
@@ -186,6 +196,11 @@ bool vpci_process_pending(struct vcpu *v)
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
+        struct map_data data = {
+            .d = v->domain,
+            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
+            .bar = bar,
+        };
         int rc;
 
         if ( rangeset_is_empty(bar->mem) )
@@ -236,7 +251,6 @@ bool vpci_process_pending(struct vcpu *v)
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
                             uint16_t cmd)
 {
-    struct map_data data = { .d = d, .map = true };
     struct vpci_header *header = &pdev->vpci->header;
     int rc = 0;
     unsigned int i;
@@ -246,6 +260,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
+        struct map_data data = { .d = d, .map = true, .bar = bar };
 
         if ( rangeset_is_empty(bar->mem) )
             continue;
@@ -311,12 +326,16 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
      * First fill the rangesets with the BAR of this device or with the ROM
      * BAR only, depending on whether the guest is toggling the memory decode
      * bit of the command register, or the enable bit of the ROM BAR register.
+     *
+     * For non-hardware domain we use guest physical addresses.
      */
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
         unsigned long start = PFN_DOWN(bar->addr);
         unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
+        unsigned long start_guest = PFN_DOWN(bar->guest_addr);
+        unsigned long end_guest = PFN_DOWN(bar->guest_addr + bar->size - 1);
 
         if ( !bar->mem )
             continue;
@@ -336,11 +355,25 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             continue;
         }
 
-        rc = rangeset_add_range(bar->mem, start, end);
+        /*
+         * Make sure that the guest set address has the same page offset
+         * as the physical address on the host or otherwise things won't work as
+         * expected.
+         */
+        if ( PAGE_OFFSET(bar->guest_addr) != PAGE_OFFSET(bar->addr) )
+        {
+            gprintk(XENLOG_G_WARNING,
+                    "%pp: Can't map BAR%d because of page offset mismatch: %lx vs %lx\n",
+                    &pdev->sbdf, i, PAGE_OFFSET(bar->guest_addr),
+                    PAGE_OFFSET(bar->addr));
+            return -EINVAL;
+        }
+
+        rc = rangeset_add_range(bar->mem, start_guest, end_guest);
         if ( rc )
         {
             printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
-                   start, end, rc);
+                   start_guest, end_guest, rc);
             return rc;
         }
 
@@ -352,12 +385,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             if ( rangeset_is_empty(prev_bar->mem) )
                 continue;
 
-            rc = rangeset_remove_range(prev_bar->mem, start, end);
+            rc = rangeset_remove_range(prev_bar->mem, start_guest, end_guest);
             if ( rc )
             {
                 gprintk(XENLOG_WARNING,
                        "%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
-                        &pdev->sbdf, start, end, rc);
+                        &pdev->sbdf, start_guest, end_guest, rc);
                 return rc;
             }
         }
@@ -425,8 +458,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
             {
                 const struct vpci_bar *remote_bar = &tmp->vpci->header.bars[i];
-                unsigned long start = PFN_DOWN(remote_bar->addr);
-                unsigned long end = PFN_DOWN(remote_bar->addr +
+                unsigned long start = PFN_DOWN(remote_bar->guest_addr);
+                unsigned long end = PFN_DOWN(remote_bar->guest_addr +
                                              remote_bar->size - 1);
 
                 if ( !remote_bar->enabled )
@@ -513,6 +546,8 @@ static void cf_check bar_write(
     struct vpci_bar *bar = data;
     bool hi = false;
 
+    ASSERT(is_hardware_domain(pdev->domain));
+
     if ( bar->type == VPCI_BAR_MEM64_HI )
     {
         ASSERT(reg > PCI_BASE_ADDRESS_0);
@@ -543,6 +578,10 @@ static void cf_check bar_write(
      */
     bar->addr &= ~(0xffffffffULL << (hi ? 32 : 0));
     bar->addr |= (uint64_t)val << (hi ? 32 : 0);
+    /*
+     * Update guest address as well, so hardware domain sees BAR identity mapped
+     */
+    bar->guest_addr = bar->addr;
 
     /* Make sure Xen writes back the same value for the BAR RO bits. */
     if ( !hi )
@@ -639,11 +678,14 @@ static void cf_check rom_write(
     }
 
     if ( !rom->enabled )
+    {
         /*
          * If the ROM BAR is not mapped update the address field so the
          * correct address is mapped into the p2m.
          */
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
+        rom->guest_addr = rom->addr;
+    }
 
     if ( !header->bars_mapped || rom->enabled == new_enabled )
     {
@@ -667,7 +709,10 @@ static void cf_check rom_write(
         return;
 
     if ( !new_enabled )
+    {
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
+        rom->guest_addr = rom->addr;
+    }
 }
 
 static int bar_add_rangeset(const struct pci_dev *pdev, struct vpci_bar *bar,
@@ -862,6 +907,7 @@ static int cf_check init_header(struct pci_dev *pdev)
         }
 
         bars[i].addr = addr;
+        bars[i].guest_addr = addr;
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
@@ -884,6 +930,7 @@ static int cf_check init_header(struct pci_dev *pdev)
         rom->type = VPCI_BAR_ROM;
         rom->size = size;
         rom->addr = addr;
+        rom->guest_addr = addr;
         header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
                               PCI_ROM_ADDRESS_ENABLE;
 
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 817ee9ee7300..e89c571890b2 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -216,7 +216,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix);
  */
 static inline paddr_t vmsix_table_base(const struct vpci *vpci, unsigned int nr)
 {
-    return vpci->header.bars[vpci->msix->tables[nr] & PCI_MSIX_BIRMASK].addr;
+    return vpci->header.bars[vpci->msix->tables[nr] &
+                             PCI_MSIX_BIRMASK].guest_addr;
 }
 
 static inline paddr_t vmsix_table_addr(const struct vpci *vpci, unsigned int nr)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v12 10/15] vpci/header: emulate PCI_COMMAND register for guests
  2024-01-09 21:51 [PATCH v12 00/15] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
                   ` (8 preceding siblings ...)
  2024-01-09 21:51 ` [PATCH v12 09/15] vpci/header: program p2m with guest BAR view Stewart Hildebrand
@ 2024-01-09 21:51 ` Stewart Hildebrand
  2024-01-25 15:43   ` Jan Beulich
  2024-01-09 21:51 ` [PATCH v12 11/15] vpci: add initial support for virtual PCI bus topology Stewart Hildebrand
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-09 21:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Roger Pau Monné, Andrew Cooper,
	George Dunlap, Jan Beulich, Julien Grall, Stefano Stabellini,
	Wei Liu, Volodymyr Babchuk, Stewart Hildebrand

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Xen and/or Dom0 may have put values in PCI_COMMAND which they expect
to remain unaltered. PCI_COMMAND_SERR bit is a good example: while the
guest's (domU) view of this will want to be zero (for now), the host
having set it to 1 should be preserved, or else we'd effectively be
giving the domU control of the bit. Thus, PCI_COMMAND register needs
proper emulation in order to honor host's settings.

According to "PCI LOCAL BUS SPECIFICATION, REV. 3.0", section "6.2.2
Device Control" the reset state of the command register is typically 0,
so when assigning a PCI device use 0 as the initial state for the
guest's (domU) view of the command register.

Here is the full list of command register bits with notes about
PCI/PCIe specification, and how Xen handles the bit. QEMU's behavior is
also documented here since that is our current reference implementation
for PCI passthrough.

PCI_COMMAND_IO (bit 0)
  PCIe 6.1: RW
  PCI LB 3.0: RW
  QEMU: (emu_mask) QEMU provides an emulated view of this bit. Guest
    writes do not propagate to hardware. QEMU sets this bit to 1 in
    hardware if an I/O BAR is exposed to the guest.
  Xen domU: (rsvdp_mask) We treat this bit as RsvdP for now since we
    don't yet support I/O BARs for domUs.
  Xen dom0: We allow dom0 to control this bit freely.

PCI_COMMAND_MEMORY (bit 1)
  PCIe 6.1: RW
  PCI LB 3.0: RW
  QEMU: (emu_mask) QEMU provides an emulated view of this bit. Guest
    writes do not propagate to hardware. QEMU sets this bit to 1 in
    hardware if a Memory BAR is exposed to the guest.
  Xen domU/dom0: We handle writes to this bit by mapping/unmapping BAR
    regions.
  Xen domU: For devices assigned to DomUs, memory decoding will be
    disabled at the time of initialization.

PCI_COMMAND_MASTER (bit 2)
  PCIe 6.1: RW
  PCI LB 3.0: RW
  QEMU: Pass through writes to hardware.
  Xen domU/dom0: Pass through writes to hardware.

PCI_COMMAND_SPECIAL (bit 3)
  PCIe 6.1: RO, hardwire to 0
  PCI LB 3.0: RW
  QEMU: Pass through writes to hardware.
  Xen domU/dom0: Pass through writes to hardware.

PCI_COMMAND_INVALIDATE (bit 4)
  PCIe 6.1: RO, hardwire to 0
  PCI LB 3.0: RW
  QEMU: Pass through writes to hardware.
  Xen domU/dom0: Pass through writes to hardware.

PCI_COMMAND_VGA_PALETTE (bit 5)
  PCIe 6.1: RO, hardwire to 0
  PCI LB 3.0: RW
  QEMU: Pass through writes to hardware.
  Xen domU/dom0: Pass through writes to hardware.

PCI_COMMAND_PARITY (bit 6)
  PCIe 6.1: RW
  PCI LB 3.0: RW
  QEMU: (emu_mask) QEMU provides an emulated view of this bit. Guest
    writes do not propagate to hardware.
  Xen domU: (rsvdp_mask) We treat this bit as RsvdP.
  Xen dom0: We allow dom0 to control this bit freely.

PCI_COMMAND_WAIT (bit 7)
  PCIe 6.1: RO, hardwire to 0
  PCI LB 3.0: hardwire to 0
  QEMU: res_mask
  Xen domU: (rsvdp_mask) We treat this bit as RsvdP.
  Xen dom0: We allow dom0 to control this bit freely.

PCI_COMMAND_SERR (bit 8)
  PCIe 6.1: RW
  PCI LB 3.0: RW
  QEMU: (emu_mask) QEMU provides an emulated view of this bit. Guest
    writes do not propagate to hardware.
  Xen domU: (rsvdp_mask) We treat this bit as RsvdP.
  Xen dom0: We allow dom0 to control this bit freely.

PCI_COMMAND_FAST_BACK (bit 9)
  PCIe 6.1: RO, hardwire to 0
  PCI LB 3.0: RW
  QEMU: (emu_mask) QEMU provides an emulated view of this bit. Guest
    writes do not propagate to hardware.
  Xen domU: (rsvdp_mask) We treat this bit as RsvdP.
  Xen dom0: We allow dom0 to control this bit freely.

PCI_COMMAND_INTX_DISABLE (bit 10)
  PCIe 6.1: RW
  PCI LB 3.0: RW
  QEMU: (emu_mask) QEMU provides an emulated view of this bit. Guest
    writes do not propagate to hardware. QEMU checks if INTx was mapped
    for a device. If it is not, then guest can't control
    PCI_COMMAND_INTX_DISABLE bit.
  Xen domU: We prohibit a guest from enabling INTx if MSI(X) is enabled.
  Xen dom0: We allow dom0 to control this bit freely.

Bits 11-15
  PCIe 6.1: RsvdP
  PCI LB 3.0: Reserved
  QEMU: res_mask
  Xen domU/dom0: rsvdp_mask

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
In v12:
- Rework patch using vpci_add_register_mask()
- Add bitmask #define in pci_regs.h according to PCIe 6.1 spec, except
  don't add the RO bits because they were RW in PCI LB 3.0 spec.
- Move and expand TODO comment about properly emulating bits
- Update guest_cmd in msi.c/msix.c:control_write()
- Simplify cmd_write(), thanks to rsvdp_mask
- Update commit description

In v11:
- Fix copy-paste mistake: vpci->msi should be vpci->msix
- Handle PCI_COMMAND_IO
- Fix condition for disabling INTx in the MSI-X code
- Show domU changes to only allowed bits
- Show PCI_COMMAND_MEMORY write only after P2M was altered
- Update comments in the code
In v10:
- Added cf_check attribute to guest_cmd_read
- Removed warning about non-zero cmd
- Updated comment MSI code regarding disabling INTX
- Used ternary operator in vpci_add_register() call
- Disable memory decoding for DomUs in init_bars()
In v9:
- Reworked guest_cmd_read
- Added handling for more bits
Since v6:
- fold guest's logic into cmd_write
- implement cmd_read, so we can report emulated INTx state to guests
- introduce header->guest_cmd to hold the emulated state of the
  PCI_COMMAND register for guests
Since v5:
- add additional check for MSI-X enabled while altering INTX bit
- make sure INTx disabled while guests enable MSI/MSI-X
Since v3:
- gate more code on CONFIG_HAS_MSI
- removed logic for the case when MSI/MSI-X not enabled
---
 xen/drivers/vpci/header.c  | 59 +++++++++++++++++++++++++++++++++++---
 xen/drivers/vpci/msi.c     |  9 ++++++
 xen/drivers/vpci/msix.c    |  7 +++++
 xen/include/xen/pci_regs.h |  1 +
 xen/include/xen/vpci.h     |  3 ++
 5 files changed, 75 insertions(+), 4 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index f0b0b64b0929..374e8e119231 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -168,6 +168,9 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
     if ( !rom_only )
     {
         pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
+        /* Show DomU that we updated P2M */
+        header->guest_cmd &= ~PCI_COMMAND_MEMORY;
+        header->guest_cmd |= cmd & PCI_COMMAND_MEMORY;
         header->bars_mapped = map;
     }
     else
@@ -524,9 +527,26 @@ static void cf_check cmd_write(
 {
     struct vpci_header *header = data;
 
+    if ( !is_hardware_domain(pdev->domain) )
+    {
+        const struct vpci *vpci = pdev->vpci;
+
+        if ( (vpci->msi && vpci->msi->enabled) ||
+             (vpci->msix && vpci->msix->enabled) )
+            cmd |= PCI_COMMAND_INTX_DISABLE;
+
+        /*
+         * Do not show change to PCI_COMMAND_MEMORY bit until we finish
+         * modifying P2M mappings.
+         */
+        header->guest_cmd = (cmd & ~PCI_COMMAND_MEMORY) |
+                            (header->guest_cmd & PCI_COMMAND_MEMORY);
+    }
+
     /*
      * Let Dom0 play with all the bits directly except for the memory
-     * decoding one.
+     * decoding one. Bits that are not allowed for DomU are already
+     * handled above and by the rsvdp_mask.
      */
     if ( header->bars_mapped != !!(cmd & PCI_COMMAND_MEMORY) )
         /*
@@ -540,6 +560,14 @@ static void cf_check cmd_write(
         pci_conf_write16(pdev->sbdf, reg, cmd);
 }
 
+static uint32_t cf_check guest_cmd_read(
+    const struct pci_dev *pdev, unsigned int reg, void *data)
+{
+    const struct vpci_header *header = data;
+
+    return header->guest_cmd;
+}
+
 static void cf_check bar_write(
     const struct pci_dev *pdev, unsigned int reg, uint32_t val, void *data)
 {
@@ -756,9 +784,23 @@ static int cf_check init_header(struct pci_dev *pdev)
         return -EOPNOTSUPP;
     }
 
-    /* Setup a handler for the command register. */
-    rc = vpci_add_register(pdev->vpci, vpci_hw_read16, cmd_write, PCI_COMMAND,
-                           2, header);
+    /*
+     * Setup a handler for the command register.
+     *
+     * TODO: If support for emulated bits is added, re-visit how to handle
+     * PCI_COMMAND_PARITY, PCI_COMMAND_SERR, and PCI_COMMAND_FAST_BACK.
+     */
+    rc = vpci_add_register_mask(pdev->vpci,
+                                is_hwdom ? vpci_hw_read16 : guest_cmd_read,
+                                cmd_write, PCI_COMMAND, 2, header, 0, 0,
+                                PCI_COMMAND_RSVDP_MASK |
+                                    (is_hwdom ? 0
+                                              : PCI_COMMAND_IO |
+                                                PCI_COMMAND_PARITY |
+                                                PCI_COMMAND_WAIT |
+                                                PCI_COMMAND_SERR |
+                                                PCI_COMMAND_FAST_BACK),
+                                0);
     if ( rc )
         return rc;
 
@@ -843,6 +885,15 @@ static int cf_check init_header(struct pci_dev *pdev)
     if ( cmd & PCI_COMMAND_MEMORY )
         pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
 
+    /*
+     * Clear PCI_COMMAND_MEMORY and PCI_COMMAND_IO for DomUs, so they will
+     * always start with memory decoding disabled and to ensure that we will not
+     * call modify_bars() at the end of this function.
+     */
+    if ( !is_hwdom )
+        cmd &= ~(PCI_COMMAND_MEMORY | PCI_COMMAND_IO);
+    header->guest_cmd = cmd;
+
     for ( i = 0; i < num_bars; i++ )
     {
         uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index 6ff71e5f9ab7..aae8055ce92d 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -70,6 +70,15 @@ static void cf_check control_write(
 
         if ( vpci_msi_arch_enable(msi, pdev, vectors) )
             return;
+
+        /*
+         * Make sure domU doesn't enable INTx while enabling MSI.
+         */
+        if ( !is_hardware_domain(pdev->domain) )
+        {
+            pci_intx(pdev, false);
+            pdev->vpci->header.guest_cmd |= PCI_COMMAND_INTX_DISABLE;
+        }
     }
     else
         vpci_msi_arch_disable(msi, pdev);
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index b6abab47efdd..d4f756ce287a 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -135,6 +135,13 @@ static void cf_check control_write(
         }
     }
 
+    /* Make sure domU doesn't enable INTx while enabling MSI-X. */
+    if ( new_enabled && !msix->enabled && !is_hardware_domain(pdev->domain) )
+    {
+        pci_intx(pdev, false);
+        pdev->vpci->header.guest_cmd |= PCI_COMMAND_INTX_DISABLE;
+    }
+
     msix->masked = new_masked;
     msix->enabled = new_enabled;
 
diff --git a/xen/include/xen/pci_regs.h b/xen/include/xen/pci_regs.h
index 9909b27425a5..b2f2e43e864d 100644
--- a/xen/include/xen/pci_regs.h
+++ b/xen/include/xen/pci_regs.h
@@ -48,6 +48,7 @@
 #define  PCI_COMMAND_SERR	0x100	/* Enable SERR */
 #define  PCI_COMMAND_FAST_BACK	0x200	/* Enable back-to-back writes */
 #define  PCI_COMMAND_INTX_DISABLE 0x400 /* INTx Emulation Disable */
+#define  PCI_COMMAND_RSVDP_MASK	0xf800
 
 #define PCI_STATUS		0x06	/* 16 bits */
 #define  PCI_STATUS_IMM_READY	0x01	/* Immediate Readiness */
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index e89c571890b2..77320a667e55 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -107,6 +107,9 @@ struct vpci {
         } bars[PCI_HEADER_NORMAL_NR_BARS + 1];
         /* At most 6 BARS + 1 expansion ROM BAR. */
 
+        /* Guest (domU only) view of the PCI_COMMAND register. */
+        uint16_t guest_cmd;
+
         /*
          * Store whether the ROM enable bit is set (doesn't imply ROM BAR
          * is mapped into guest p2m) if there's a ROM BAR on the device.
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v12 11/15] vpci: add initial support for virtual PCI bus topology
  2024-01-09 21:51 [PATCH v12 00/15] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
                   ` (9 preceding siblings ...)
  2024-01-09 21:51 ` [PATCH v12 10/15] vpci/header: emulate PCI_COMMAND register for guests Stewart Hildebrand
@ 2024-01-09 21:51 ` Stewart Hildebrand
  2024-01-12 11:46   ` George Dunlap
  2024-01-25 16:00   ` Jan Beulich
  2024-01-09 21:51 ` [PATCH v12 12/15] xen/arm: translate virtual PCI bus topology for guests Stewart Hildebrand
                   ` (3 subsequent siblings)
  14 siblings, 2 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-09 21:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Jan Beulich, Julien Grall, Stefano Stabellini, Wei Liu,
	Roger Pau Monné, Volodymyr Babchuk, Stewart Hildebrand

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Assign SBDF to the PCI devices being passed through with bus 0.
The resulting topology is where PCIe devices reside on the bus 0 of the
root complex itself (embedded endpoints).
This implementation is limited to 32 devices which are allowed on
a single PCI bus.

Please note, that at the moment only function 0 of a multifunction
device can be passed through.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
In v11:
- Fixed code formatting
- Removed bogus write_unlock() call
- Fixed type for new_dev_number
In v10:
- Removed ASSERT(pcidevs_locked())
- Removed redundant code (local sbdf variable, clearing sbdf during
device removal, etc)
- Added __maybe_unused attribute to "out:" label
- Introduced HAS_VPCI_GUEST_SUPPORT Kconfig option, as this is the
  first patch where it is used (previously was in "vpci: add hooks for
  PCI device assign/de-assign")
In v9:
- Lock in add_virtual_device() replaced with ASSERT (thanks, Stewart)
In v8:
- Added write lock in add_virtual_device
Since v6:
- re-work wrt new locking scheme
- OT: add ASSERT(pcidevs_write_locked()); to add_virtual_device()
Since v5:
- s/vpci_add_virtual_device/add_virtual_device and make it static
- call add_virtual_device from vpci_assign_device and do not use
  REGISTER_VPCI_INIT machinery
- add pcidevs_locked ASSERT
- use DECLARE_BITMAP for vpci_dev_assigned_map
Since v4:
- moved and re-worked guest sbdf initializers
- s/set_bit/__set_bit
- s/clear_bit/__clear_bit
- minor comment fix s/Virtual/Guest/
- added VPCI_MAX_VIRT_DEV constant (PCI_SLOT(~0) + 1) which will be used
  later for counting the number of MMIO handlers required for a guest
  (Julien)
Since v3:
 - make use of VPCI_INIT
 - moved all new code to vpci.c which belongs to it
 - changed open-coded 31 to PCI_SLOT(~0)
 - added comments and code to reject multifunction devices with
   functions other than 0
 - updated comment about vpci_dev_next and made it unsigned int
 - implement roll back in case of error while assigning/deassigning devices
 - s/dom%pd/%pd
Since v2:
 - remove casts that are (a) malformed and (b) unnecessary
 - add new line for better readability
 - remove CONFIG_HAS_VPCI_GUEST_SUPPORT ifdef's as the relevant vPCI
    functions are now completely gated with this config
 - gate common code with CONFIG_HAS_VPCI_GUEST_SUPPORT
New in v2
---
 xen/drivers/Kconfig     |  4 +++
 xen/drivers/vpci/vpci.c | 57 +++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/sched.h |  8 ++++++
 xen/include/xen/vpci.h  | 11 ++++++++
 4 files changed, 80 insertions(+)

diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
index db94393f47a6..780490cf8e39 100644
--- a/xen/drivers/Kconfig
+++ b/xen/drivers/Kconfig
@@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
 config HAS_VPCI
 	bool
 
+config HAS_VPCI_GUEST_SUPPORT
+	bool
+	depends on HAS_VPCI
+
 endmenu
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index a0e8b1012509..57cfabfd9ad3 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -40,6 +40,49 @@ extern vpci_register_init_t *const __start_vpci_array[];
 extern vpci_register_init_t *const __end_vpci_array[];
 #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+static int add_virtual_device(struct pci_dev *pdev)
+{
+    struct domain *d = pdev->domain;
+    unsigned int new_dev_number;
+
+    if ( is_hardware_domain(d) )
+        return 0;
+
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
+    /*
+     * Each PCI bus supports 32 devices/slots at max or up to 256 when
+     * there are multi-function ones which are not yet supported.
+     */
+    if ( pdev->info.is_extfn && !pdev->info.is_virtfn )
+    {
+        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
+                 &pdev->sbdf);
+        return -EOPNOTSUPP;
+    }
+    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
+                                         VPCI_MAX_VIRT_DEV);
+    if ( new_dev_number == VPCI_MAX_VIRT_DEV )
+        return -ENOSPC;
+
+    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);
+
+    /*
+     * Both segment and bus number are 0:
+     *  - we emulate a single host bridge for the guest, e.g. segment 0
+     *  - with bus 0 the virtual devices are seen as embedded
+     *    endpoints behind the root complex
+     *
+     * TODO: add support for multi-function devices.
+     */
+    pdev->vpci->guest_sbdf = PCI_SBDF(0, 0, new_dev_number, 0);
+
+    return 0;
+}
+
+#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
+
 void vpci_deassign_device(struct pci_dev *pdev)
 {
     unsigned int i;
@@ -49,6 +92,12 @@ void vpci_deassign_device(struct pci_dev *pdev)
     if ( !has_vpci(pdev->domain) || !pdev->vpci )
         return;
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    if ( pdev->vpci->guest_sbdf.sbdf != ~0 )
+        __clear_bit(pdev->vpci->guest_sbdf.dev,
+                    &pdev->domain->vpci_dev_assigned_map);
+#endif
+
     spin_lock(&pdev->vpci->lock);
     while ( !list_empty(&pdev->vpci->handlers) )
     {
@@ -105,6 +154,13 @@ int vpci_assign_device(struct pci_dev *pdev)
     INIT_LIST_HEAD(&pdev->vpci->handlers);
     spin_lock_init(&pdev->vpci->lock);
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    pdev->vpci->guest_sbdf.sbdf = ~0;
+    rc = add_virtual_device(pdev);
+    if ( rc )
+        goto out;
+#endif
+
     for ( i = 0; i < NUM_VPCI_INIT; i++ )
     {
         rc = __start_vpci_array[i](pdev);
@@ -112,6 +168,7 @@ int vpci_assign_device(struct pci_dev *pdev)
             break;
     }
 
+ out: __maybe_unused;
     if ( rc )
         vpci_deassign_device(pdev);
 
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 37f5922f3206..b58a822847be 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -484,6 +484,14 @@ struct domain
      * 2. pdev->vpci->lock
      */
     rwlock_t pci_lock;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /*
+     * The bitmap which shows which device numbers are already used by the
+     * virtual PCI bus topology and is used to assign a unique SBDF to the
+     * next passed through virtual PCI device.
+     */
+    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
+#endif
 #endif
 
 #ifdef CONFIG_HAS_PASSTHROUGH
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 77320a667e55..053467f04982 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -21,6 +21,13 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
 
 #define VPCI_ECAM_BDF(addr)     (((addr) & 0x0ffff000) >> 12)
 
+/*
+ * Maximum number of devices supported by the virtual bus topology:
+ * each PCI bus supports 32 devices/slots at max or up to 256 when
+ * there are multi-function ones which are not yet supported.
+ */
+#define VPCI_MAX_VIRT_DEV       (PCI_SLOT(~0) + 1)
+
 #define REGISTER_VPCI_INIT(x, p)                \
   static vpci_register_init_t *const x##_entry  \
                __used_section(".data.vpci." p) = x
@@ -175,6 +182,10 @@ struct vpci {
             struct vpci_arch_msix_entry arch;
         } entries[];
     } *msix;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /* Guest SBDF of the device. */
+    pci_sbdf_t guest_sbdf;
+#endif
 #endif
 };
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v12 12/15] xen/arm: translate virtual PCI bus topology for guests
  2024-01-09 21:51 [PATCH v12 00/15] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
                   ` (10 preceding siblings ...)
  2024-01-09 21:51 ` [PATCH v12 11/15] vpci: add initial support for virtual PCI bus topology Stewart Hildebrand
@ 2024-01-09 21:51 ` Stewart Hildebrand
  2024-01-09 21:51 ` [PATCH v12 13/15] xen/arm: account IO handlers for emulated PCI MSI-X Stewart Hildebrand
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-09 21:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Stefano Stabellini, Julien Grall,
	Bertrand Marquis, Michal Orzel, Volodymyr Babchuk,
	Roger Pau Monné, Volodymyr Babchuk, Stewart Hildebrand

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

There are three  originators for the PCI configuration space access:
1. The domain that owns physical host bridge: MMIO handlers are
there so we can update vPCI register handlers with the values
written by the hardware domain, e.g. physical view of the registers
vs guest's view on the configuration space.
2. Guest access to the passed through PCI devices: we need to properly
map virtual bus topology to the physical one, e.g. pass the configuration
space access to the corresponding physical devices.
3. Emulated host PCI bridge access. It doesn't exist in the physical
topology, e.g. it can't be mapped to some physical host bridge.
So, all access to the host bridge itself needs to be trapped and
emulated.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
In v11:
- Fixed format issues
- Added ASSERT_UNREACHABLE() to the dummy implementation of
vpci_translate_virtual_device()
- Moved variable in vpci_sbdf_from_gpa(), now it is easier to follow
the logic in the function
Since v9:
- Commend about required lock replaced with ASSERT()
- Style fixes
- call to vpci_translate_virtual_device folded into vpci_sbdf_from_gpa
Since v8:
- locks moved out of vpci_translate_virtual_device()
Since v6:
- add pcidevs locking to vpci_translate_virtual_device
- update wrt to the new locking scheme
Since v5:
- add vpci_translate_virtual_device for #ifndef CONFIG_HAS_VPCI_GUEST_SUPPORT
  case to simplify ifdefery
- add ASSERT(!is_hardware_domain(d)); to vpci_translate_virtual_device
- reset output register on failed virtual SBDF translation
Since v4:
- indentation fixes
- constify struct domain
- updated commit message
- updates to the new locking scheme (pdev->vpci_lock)
Since v3:
- revisit locking
- move code to vpci.c
Since v2:
 - pass struct domain instead of struct vcpu
 - constify arguments where possible
 - gate relevant code with CONFIG_HAS_VPCI_GUEST_SUPPORT
New in v2
---
 xen/arch/arm/vpci.c     | 47 +++++++++++++++++++++++++++++++----------
 xen/drivers/vpci/vpci.c | 24 +++++++++++++++++++++
 xen/include/xen/vpci.h  | 12 +++++++++++
 3 files changed, 72 insertions(+), 11 deletions(-)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 3bc4bb55082a..7a6a0017d132 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -7,31 +7,51 @@
 
 #include <asm/mmio.h>
 
-static pci_sbdf_t vpci_sbdf_from_gpa(const struct pci_host_bridge *bridge,
-                                     paddr_t gpa)
+static bool vpci_sbdf_from_gpa(struct domain *d,
+                               const struct pci_host_bridge *bridge,
+                               paddr_t gpa, pci_sbdf_t *sbdf)
 {
-    pci_sbdf_t sbdf;
+    bool translated = true;
+
+    ASSERT(sbdf);
 
     if ( bridge )
     {
-        sbdf.sbdf = VPCI_ECAM_BDF(gpa - bridge->cfg->phys_addr);
-        sbdf.seg = bridge->segment;
-        sbdf.bus += bridge->cfg->busn_start;
+        sbdf->sbdf = VPCI_ECAM_BDF(gpa - bridge->cfg->phys_addr);
+        sbdf->seg = bridge->segment;
+        sbdf->bus += bridge->cfg->busn_start;
     }
     else
-        sbdf.sbdf = VPCI_ECAM_BDF(gpa - GUEST_VPCI_ECAM_BASE);
+    {
+        /*
+         * For the passed through devices we need to map their virtual SBDF
+         * to the physical PCI device being passed through.
+         */
+        sbdf->sbdf = VPCI_ECAM_BDF(gpa - GUEST_VPCI_ECAM_BASE);
+        read_lock(&d->pci_lock);
+        translated = vpci_translate_virtual_device(d, sbdf);
+        read_unlock(&d->pci_lock);
+    }
 
-    return sbdf;
+    return translated;
 }
 
 static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
                           register_t *r, void *p)
 {
     struct pci_host_bridge *bridge = p;
-    pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
+    pci_sbdf_t sbdf;
     /* data is needed to prevent a pointer cast on 32bit */
     unsigned long data;
 
+    ASSERT(!bridge == !is_hardware_domain(v->domain));
+
+    if ( !vpci_sbdf_from_gpa(v->domain, bridge, info->gpa, &sbdf) )
+    {
+        *r = ~0UL;
+        return 1;
+    }
+
     if ( vpci_ecam_read(sbdf, ECAM_REG_OFFSET(info->gpa),
                         1U << info->dabt.size, &data) )
     {
@@ -39,7 +59,7 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
         return 1;
     }
 
-    *r = ~0ul;
+    *r = ~0UL;
 
     return 0;
 }
@@ -48,7 +68,12 @@ static int vpci_mmio_write(struct vcpu *v, mmio_info_t *info,
                            register_t r, void *p)
 {
     struct pci_host_bridge *bridge = p;
-    pci_sbdf_t sbdf = vpci_sbdf_from_gpa(bridge, info->gpa);
+    pci_sbdf_t sbdf;
+
+    ASSERT(!bridge == !is_hardware_domain(v->domain));
+
+    if ( !vpci_sbdf_from_gpa(v->domain, bridge, info->gpa, &sbdf) )
+        return 1;
 
     return vpci_ecam_write(sbdf, ECAM_REG_OFFSET(info->gpa),
                            1U << info->dabt.size, r);
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 57cfabfd9ad3..6e2d428d691b 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -81,6 +81,30 @@ static int add_virtual_device(struct pci_dev *pdev)
     return 0;
 }
 
+/*
+ * Find the physical device which is mapped to the virtual device
+ * and translate virtual SBDF to the physical one.
+ */
+bool vpci_translate_virtual_device(const struct domain *d, pci_sbdf_t *sbdf)
+{
+    const struct pci_dev *pdev;
+
+    ASSERT(!is_hardware_domain(d));
+    ASSERT(rw_is_locked(&d->pci_lock));
+
+    for_each_pdev ( d, pdev )
+    {
+        if ( pdev->vpci && (pdev->vpci->guest_sbdf.sbdf == sbdf->sbdf) )
+        {
+            /* Replace guest SBDF with the physical one. */
+            *sbdf = pdev->sbdf;
+            return true;
+        }
+    }
+
+    return false;
+}
+
 #endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
 
 void vpci_deassign_device(struct pci_dev *pdev)
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 053467f04982..3cfd9a401178 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -302,6 +302,18 @@ static inline bool __must_check vpci_process_pending(struct vcpu *v)
 }
 #endif
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+bool vpci_translate_virtual_device(const struct domain *d, pci_sbdf_t *sbdf);
+#else
+static inline bool vpci_translate_virtual_device(const struct domain *d,
+                                                 pci_sbdf_t *sbdf)
+{
+    ASSERT_UNREACHABLE();
+
+    return false;
+}
+#endif
+
 #endif
 
 /*
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v12 13/15] xen/arm: account IO handlers for emulated PCI MSI-X
  2024-01-09 21:51 [PATCH v12 00/15] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
                   ` (11 preceding siblings ...)
  2024-01-09 21:51 ` [PATCH v12 12/15] xen/arm: translate virtual PCI bus topology for guests Stewart Hildebrand
@ 2024-01-09 21:51 ` Stewart Hildebrand
  2024-01-09 21:51 ` [PATCH v12 14/15] xen/arm: vpci: permit access to guest vpci space Stewart Hildebrand
  2024-01-09 21:51 ` [PATCH v12 15/15] arm/vpci: honor access size when returning an error Stewart Hildebrand
  14 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-09 21:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Stefano Stabellini, Julien Grall,
	Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Julien Grall,
	Volodymyr Babchuk, Stewart Hildebrand

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

At the moment, we always allocate an extra 16 slots for IO handlers
(see MAX_IO_HANDLER). So while adding IO trap handlers for the emulated
MSI-X registers we need to explicitly tell that we have additional IO
handlers, so those are accounted.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
This actually moved here from the part 2 of the prep work for PCI
passthrough on Arm as it seems to be the proper place for it.

Since v5:
- optimize with IS_ENABLED(CONFIG_HAS_PCI_MSI) since VPCI_MAX_VIRT_DEV is
  defined unconditionally
New in v5
---
 xen/arch/arm/vpci.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 7a6a0017d132..348ba0fbc860 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -130,6 +130,8 @@ static int vpci_get_num_handlers_cb(struct domain *d,
 
 unsigned int domain_vpci_get_num_mmio_handlers(struct domain *d)
 {
+    unsigned int count;
+
     if ( !has_vpci(d) )
         return 0;
 
@@ -150,7 +152,17 @@ unsigned int domain_vpci_get_num_mmio_handlers(struct domain *d)
      * For guests each host bridge requires one region to cover the
      * configuration space. At the moment, we only expose a single host bridge.
      */
-    return 1;
+    count = 1;
+
+    /*
+     * There's a single MSI-X MMIO handler that deals with both PBA
+     * and MSI-X tables per each PCI device being passed through.
+     * Maximum number of emulated virtual devices is VPCI_MAX_VIRT_DEV.
+     */
+    if ( IS_ENABLED(CONFIG_HAS_PCI_MSI) )
+        count += VPCI_MAX_VIRT_DEV;
+
+    return count;
 }
 
 /*
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v12 14/15] xen/arm: vpci: permit access to guest vpci space
  2024-01-09 21:51 [PATCH v12 00/15] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
                   ` (12 preceding siblings ...)
  2024-01-09 21:51 ` [PATCH v12 13/15] xen/arm: account IO handlers for emulated PCI MSI-X Stewart Hildebrand
@ 2024-01-09 21:51 ` Stewart Hildebrand
  2024-01-17  3:03   ` Stewart Hildebrand
  2024-01-09 21:51 ` [PATCH v12 15/15] arm/vpci: honor access size when returning an error Stewart Hildebrand
  14 siblings, 1 reply; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-09 21:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Stewart Hildebrand, Stefano Stabellini, Julien Grall,
	Bertrand Marquis, Michal Orzel, Volodymyr Babchuk, Andrew Cooper,
	George Dunlap, Jan Beulich, Wei Liu

Move iomem_caps initialization earlier (before arch_domain_create()).

Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
Changes in v11:
* move both iomem_caps and irq_caps initializations earlier, along with NULL
  check

Changes in v10:
* fix off-by-one
* also permit access to GUEST_VPCI_PREFETCH_MEM_ADDR

Changes in v9:
* new patch

This is sort of a follow-up to:

  baa6ea700386 ("vpci: add permission checks to map_range()")

I don't believe we need a fixes tag since this depends on the vPCI p2m BAR
patches.
---
 xen/arch/arm/vpci.c |  9 +++++++++
 xen/common/domain.c | 12 ++++++------
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index 348ba0fbc860..b6ef440f17b0 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -2,6 +2,7 @@
 /*
  * xen/arch/arm/vpci.c
  */
+#include <xen/iocap.h>
 #include <xen/sched.h>
 #include <xen/vpci.h>
 
@@ -115,8 +116,16 @@ int domain_vpci_init(struct domain *d)
             return ret;
     }
     else
+    {
         register_mmio_handler(d, &vpci_mmio_handler,
                               GUEST_VPCI_ECAM_BASE, GUEST_VPCI_ECAM_SIZE, NULL);
+        iomem_permit_access(d, paddr_to_pfn(GUEST_VPCI_MEM_ADDR),
+                            paddr_to_pfn(GUEST_VPCI_MEM_ADDR +
+                                         GUEST_VPCI_MEM_SIZE - 1));
+        iomem_permit_access(d, paddr_to_pfn(GUEST_VPCI_PREFETCH_MEM_ADDR),
+                            paddr_to_pfn(GUEST_VPCI_PREFETCH_MEM_ADDR +
+                                         GUEST_VPCI_PREFETCH_MEM_SIZE - 1));
+    }
 
     return 0;
 }
diff --git a/xen/common/domain.c b/xen/common/domain.c
index f6f557499660..8078d1ade690 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -693,6 +693,12 @@ struct domain *domain_create(domid_t domid,
         d->nr_pirqs = min(d->nr_pirqs, nr_irqs);
 
         radix_tree_init(&d->pirq_tree);
+
+        err = -ENOMEM;
+        d->iomem_caps = rangeset_new(d, "I/O Memory", RANGESETF_prettyprint_hex);
+        d->irq_caps   = rangeset_new(d, "Interrupts", 0);
+        if ( !d->iomem_caps || !d->irq_caps )
+            goto fail;
     }
 
     if ( (err = arch_domain_create(d, config, flags)) != 0 )
@@ -711,12 +717,6 @@ struct domain *domain_create(domid_t domid,
         watchdog_domain_init(d);
         init_status |= INIT_watchdog;
 
-        err = -ENOMEM;
-        d->iomem_caps = rangeset_new(d, "I/O Memory", RANGESETF_prettyprint_hex);
-        d->irq_caps   = rangeset_new(d, "Interrupts", 0);
-        if ( !d->iomem_caps || !d->irq_caps )
-            goto fail;
-
         if ( (err = xsm_domain_create(XSM_HOOK, d, config->ssidref)) != 0 )
             goto fail;
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v12 15/15] arm/vpci: honor access size when returning an error
  2024-01-09 21:51 [PATCH v12 00/15] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
                   ` (13 preceding siblings ...)
  2024-01-09 21:51 ` [PATCH v12 14/15] xen/arm: vpci: permit access to guest vpci space Stewart Hildebrand
@ 2024-01-09 21:51 ` Stewart Hildebrand
  14 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-09 21:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Volodymyr Babchuk, Stefano Stabellini, Julien Grall,
	Bertrand Marquis, Michal Orzel, Volodymyr Babchuk,
	Stewart Hildebrand

From: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>

Guest can try to read config space using different access sizes: 8,
16, 32, 64 bits. We need to take this into account when we are
returning an error back to MMIO handler, otherwise it is possible to
provide more data than requested: i.e. guest issues LDRB instruction
to read one byte, but we are writing 0xFFFFFFFFFFFFFFFF in the target
register.

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
 xen/arch/arm/vpci.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/xen/arch/arm/vpci.c b/xen/arch/arm/vpci.c
index b6ef440f17b0..05a479096ef7 100644
--- a/xen/arch/arm/vpci.c
+++ b/xen/arch/arm/vpci.c
@@ -42,6 +42,8 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
 {
     struct pci_host_bridge *bridge = p;
     pci_sbdf_t sbdf;
+    const uint8_t access_size = (1 << info->dabt.size) * 8;
+    const uint64_t access_mask = GENMASK_ULL(access_size - 1, 0);
     /* data is needed to prevent a pointer cast on 32bit */
     unsigned long data;
 
@@ -49,7 +51,7 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
 
     if ( !vpci_sbdf_from_gpa(v->domain, bridge, info->gpa, &sbdf) )
     {
-        *r = ~0UL;
+        *r = access_mask;
         return 1;
     }
 
@@ -60,7 +62,7 @@ static int vpci_mmio_read(struct vcpu *v, mmio_info_t *info,
         return 1;
     }
 
-    *r = ~0UL;
+    *r = access_mask;
 
     return 0;
 }
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 07/15] rangeset: add rangeset_purge() function
  2024-01-09 21:51 ` [PATCH v12 07/15] rangeset: add rangeset_purge() function Stewart Hildebrand
@ 2024-01-10 10:00   ` Jan Beulich
  0 siblings, 0 replies; 68+ messages in thread
From: Jan Beulich @ 2024-01-10 10:00 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: Volodymyr Babchuk, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini, Wei Liu, xen-devel

On 09.01.2024 22:51, Stewart Hildebrand wrote:
> From: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
> 
> This function can be used when user wants to remove all rangeset
> entries but do not want to destroy rangeset itself.
> 
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>

Acked-by: Jan Beulich <jbeulich@suse.com>




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 11/15] vpci: add initial support for virtual PCI bus topology
  2024-01-09 21:51 ` [PATCH v12 11/15] vpci: add initial support for virtual PCI bus topology Stewart Hildebrand
@ 2024-01-12 11:46   ` George Dunlap
  2024-01-12 13:50     ` Stewart Hildebrand
  2024-01-25 16:00   ` Jan Beulich
  1 sibling, 1 reply; 68+ messages in thread
From: George Dunlap @ 2024-01-12 11:46 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: xen-devel, Oleksandr Andrushchenko, Andrew Cooper, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu, Roger Pau Monné,
	Volodymyr Babchuk

On Tue, Jan 9, 2024 at 9:54 PM Stewart Hildebrand
<stewart.hildebrand@amd.com> wrote:
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index 37f5922f3206..b58a822847be 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -484,6 +484,14 @@ struct domain
>       * 2. pdev->vpci->lock
>       */
>      rwlock_t pci_lock;
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    /*
> +     * The bitmap which shows which device numbers are already used by the
> +     * virtual PCI bus topology and is used to assign a unique SBDF to the
> +     * next passed through virtual PCI device.
> +     */
> +    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
> +#endif
>  #endif

Without digging through the whole series, how big do we expect this
bitmap to be on typical systems?

If it's only going to be a handful of bytes, keeping it around for all
guests would be OK; but it's large, it would be better as a pointer,
since it's unused on the vast majority of guests.

 -George


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-09 21:51 ` [PATCH v12 01/15] vpci: use per-domain PCI lock to protect vpci structure Stewart Hildebrand
@ 2024-01-12 13:48   ` Roger Pau Monné
  2024-01-12 17:54     ` Stewart Hildebrand
  2024-01-15 19:43   ` [PATCH v12.2 " Stewart Hildebrand
  1 sibling, 1 reply; 68+ messages in thread
From: Roger Pau Monné @ 2024-01-12 13:48 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk

On Tue, Jan 09, 2024 at 04:51:16PM -0500, Stewart Hildebrand wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Use a previously introduced per-domain read/write lock to check
> whether vpci is present, so we are sure there are no accesses to the
> contents of the vpci struct if not.

Nit: isn't this sentence kind of awkward?  I would word it as:

"Use the per-domain PCI read/write lock to protect the presence of the
pci device vpci field."

> This lock can be used (and in a
> few cases is used right away) so that vpci removal can be performed
> while holding the lock in write mode. Previously such removal could
> race with vpci_read for example.
> 
> When taking both d->pci_lock and pdev->vpci->lock, they should be
> taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
> possible deadlock situations.
> 
> 1. Per-domain's pci_lock is used to protect pdev->vpci structure
> from being removed.
> 
> 2. Writing the command register and ROM BAR register may trigger
> modify_bars to run, which in turn may access multiple pdevs while
> checking for the existing BAR's overlap. The overlapping check, if
> done under the read lock, requires vpci->lock to be acquired on both
> devices being compared, which may produce a deadlock. It is not
> possible to upgrade read lock to write lock in such a case. So, in
> order to prevent the deadlock, use d->pci_lock instead.

... use d->pci_lock in write mode instead.

> 
> All other code, which doesn't lead to pdev->vpci destruction and does
> not access multiple pdevs at the same time, can still use a
> combination of the read lock and pdev->vpci->lock.
> 
> 3. Drop const qualifier where the new rwlock is used and this is
> appropriate.
> 
> 4. Do not call process_pending_softirqs with any locks held. For that
> unlock prior the call and re-acquire the locks after. After
> re-acquiring the lock there is no need to check if pdev->vpci exists:
>  - in apply_map because of the context it is called (no race condition
>    possible)
>  - for MSI/MSI-X debug code because it is called at the end of
>    pdev->vpci access and no further access to pdev->vpci is made
> 
> 5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
> while accessing pdevs in vpci code.
> 
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>

Just one code fix regarding the usage of read_lock (instead of the
trylock version) in vpci_msix_arch_print().. With that and the
comments adjusted:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> Changes in v12:
>  - s/pci_rwlock/pci_lock/ in commit message
>  - expand comment about scope of pci_lock in sched.h
>  - in vpci_{read,write}, if hwdom is trying to access a device assigned
>    to dom_xen, holding hwdom->pci_lock is sufficient (no need to hold
>    dom_xen->pci_lock)
>  - reintroduce ASSERT in vmx_pi_update_irte()
>  - reintroduce ASSERT in __pci_enable_msi{x}()
>  - delete note 6. in commit message about removing ASSERTs since we have
>    reintroduced them
> 
> Changes in v11:
>  - Fixed commit message regarding possible spinlocks
>  - Removed parameter from allocate_and_map_msi_pirq(), which was added
>  in the prev version. Now we are taking pcidevs_lock in
>  physdev_map_pirq()
>  - Returned ASSERT to pci_enable_msi
>  - Fixed case when we took read lock instead of write one
>  - Fixed label indentation
> 
> Changes in v10:
>  - Moved printk pas locked area
>  - Returned back ASSERTs
>  - Added new parameter to allocate_and_map_msi_pirq() so it knows if
>  it should take the global pci lock
>  - Added comment about possible improvement in vpci_write
>  - Changed ASSERT(rw_is_locked()) to rw_is_write_locked() in
>    appropriate places
>  - Renamed release_domain_locks() to release_domain_write_locks()
>  - moved domain_done label in vpci_dump_msi() to correct place
> Changes in v9:
>  - extended locked region to protect vpci_remove_device and
>    vpci_add_handlers() calls
>  - vpci_write() takes lock in the write mode to protect
>    potential call to modify_bars()
>  - renamed lock releasing function
>  - removed ASSERT()s from msi code
>  - added trylock in vpci_dump_msi
> 
> Changes in v8:
>  - changed d->vpci_lock to d->pci_lock
>  - introducing d->pci_lock in a separate patch
>  - extended locked region in vpci_process_pending
>  - removed pcidevs_lockis vpci_dump_msi()
>  - removed some changes as they are not needed with
>    the new locking scheme
>  - added handling for hwdom && dom_xen case
> ---
>  xen/arch/x86/hvm/vmsi.c       | 22 +++++++--------
>  xen/arch/x86/hvm/vmx/vmx.c    |  2 +-
>  xen/arch/x86/irq.c            |  8 +++---
>  xen/arch/x86/msi.c            | 14 +++++-----
>  xen/arch/x86/physdev.c        |  2 ++
>  xen/drivers/passthrough/pci.c |  9 +++---
>  xen/drivers/vpci/header.c     | 18 ++++++++++++
>  xen/drivers/vpci/msi.c        | 28 +++++++++++++++++--
>  xen/drivers/vpci/msix.c       | 52 ++++++++++++++++++++++++++++++-----
>  xen/drivers/vpci/vpci.c       | 28 ++++++++++++++++---
>  xen/include/xen/sched.h       |  3 +-
>  11 files changed, 144 insertions(+), 42 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 128f23636279..03caf91beefc 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
>      struct msixtbl_entry *entry, *new_entry;
>      int r = -EINVAL;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>      ASSERT(rw_is_write_locked(&d->event_lock));
>  
>      if ( !msixtbl_initialised(d) )
> @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
>      struct pci_dev *pdev;
>      struct msixtbl_entry *entry;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>      ASSERT(rw_is_write_locked(&d->event_lock));
>  
>      if ( !msixtbl_initialised(d) )
> @@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
>  {
>      unsigned int i;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>  
>      if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
>      {
> @@ -725,8 +725,8 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>      int rc;
>  
>      ASSERT(msi->arch.pirq != INVALID_PIRQ);
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>  
> -    pcidevs_lock();
>      for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
>      {
>          struct xen_domctl_bind_pt_irq unbind = {
> @@ -745,7 +745,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>  
>      msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
>                                         msi->vectors, msi->arch.pirq, msi->mask);
> -    pcidevs_unlock();
>  }
>  
>  static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
> @@ -778,15 +777,14 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
>      int rc;
>  
>      ASSERT(msi->arch.pirq == INVALID_PIRQ);
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>      rc = vpci_msi_enable(pdev, vectors, 0);
>      if ( rc < 0 )
>          return rc;
>      msi->arch.pirq = rc;
>  
> -    pcidevs_lock();
>      msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
>                                         msi->arch.pirq, msi->mask);
> -    pcidevs_unlock();
>  
>      return 0;
>  }
> @@ -797,8 +795,8 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
>      unsigned int i;
>  
>      ASSERT(pirq != INVALID_PIRQ);
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>  
> -    pcidevs_lock();
>      for ( i = 0; i < nr && bound; i++ )
>      {
>          struct xen_domctl_bind_pt_irq bind = {
> @@ -814,7 +812,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
>      write_lock(&pdev->domain->event_lock);
>      unmap_domain_pirq(pdev->domain, pirq);
>      write_unlock(&pdev->domain->event_lock);
> -    pcidevs_unlock();
>  }
>  
>  void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
> @@ -854,6 +851,7 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>      int rc;
>  
>      ASSERT(entry->arch.pirq == INVALID_PIRQ);
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>      rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry),
>                           table_base);
>      if ( rc < 0 )
> @@ -861,7 +859,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>  
>      entry->arch.pirq = rc;
>  
> -    pcidevs_lock();
>      rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
>                           entry->masked);
>      if ( rc )
> @@ -869,7 +866,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>          vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
>          entry->arch.pirq = INVALID_PIRQ;
>      }
> -    pcidevs_unlock();
>  
>      return rc;
>  }
> @@ -895,6 +891,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>  {
>      unsigned int i;
>  
> +    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
> +
>      for ( i = 0; i < msix->max_entries; i++ )
>      {
>          const struct vpci_msix_entry *entry = &msix->entries[i];
> @@ -913,7 +911,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>              struct pci_dev *pdev = msix->pdev;
>  
>              spin_unlock(&msix->pdev->vpci->lock);
> +            read_unlock(&pdev->domain->pci_lock);
>              process_pending_softirqs();
> +            read_lock(&pdev->domain->pci_lock);

This should be a trylock, like it's on the caller (vpci_dump_msi()).

>              /* NB: we assume that pdev cannot go away for an alive domain. */
>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>                  return -EBUSY;
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 8ff675883c2b..890faef48b82 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -413,7 +413,7 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
>  
>      spin_unlock_irq(&desc->lock);
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&msi_desc->dev->domain->pci_lock));
>  
>      return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
>  
> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
> index 50e49e1a4b6f..4d89d9af99fe 100644
> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -2166,7 +2166,7 @@ int map_domain_pirq(
>          struct pci_dev *pdev;
>          unsigned int nr = 0;
>  
> -        ASSERT(pcidevs_locked());
> +        ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>  
>          ret = -ENODEV;
>          if ( !cpu_has_apic )
> @@ -2323,7 +2323,7 @@ int unmap_domain_pirq(struct domain *d, int pirq)
>      if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
>          return -EINVAL;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>      ASSERT(rw_is_write_locked(&d->event_lock));
>  
>      info = pirq_info(d, pirq);
> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>  {
>      int irq, pirq, ret;
>  
> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> +
>      switch ( type )
>      {
>      case MAP_PIRQ_TYPE_MSI:
> @@ -2917,7 +2919,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>  
>      msi->irq = irq;
>  
> -    pcidevs_lock();
>      /* Verify or get pirq. */
>      write_lock(&d->event_lock);
>      pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
> @@ -2933,7 +2934,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>  
>   done:
>      write_unlock(&d->event_lock);
> -    pcidevs_unlock();
>      if ( ret )
>      {
>          switch ( type )
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index 335c0868a225..7da2affa2e82 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -602,7 +602,7 @@ static int msi_capability_init(struct pci_dev *dev,
>      unsigned int i, mpos;
>      uint16_t control;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>      pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
>      if ( !pos )
>          return -ENODEV;
> @@ -771,7 +771,7 @@ static int msix_capability_init(struct pci_dev *dev,
>      if ( !pos )
>          return -ENODEV;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>  
>      control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
>      /*
> @@ -988,11 +988,11 @@ static int __pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
>  {
>      struct msi_desc *old_desc;
>  
> -    ASSERT(pcidevs_locked());
> -
>      if ( !pdev )
>          return -ENODEV;
>  
> +    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
> +
>      old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSI);
>      if ( old_desc )
>      {
> @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
>  {
>      struct msi_desc *old_desc;
>  
> -    ASSERT(pcidevs_locked());
> -
>      if ( !pdev || !pdev->msix )
>          return -ENODEV;
>  
> +    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
> +
>      if ( msi->entry_nr >= pdev->msix->nr_entries )
>          return -EINVAL;
>  
> @@ -1154,7 +1154,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>  int pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
>                     struct msi_desc **desc)
>  {
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>  
>      if ( !use_msi )
>          return -EPERM;
> diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
> index 47c4da0af7e1..369c9e788c1c 100644
> --- a/xen/arch/x86/physdev.c
> +++ b/xen/arch/x86/physdev.c
> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>  
>      case MAP_PIRQ_TYPE_MSI:
>      case MAP_PIRQ_TYPE_MULTI_MSI:
> +        pcidevs_lock();
>          ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> +        pcidevs_unlock();
>          break;
>  
>      default:
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 1439d1ef2b26..3a973324bca1 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -750,7 +750,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          pdev->domain = hardware_domain;
>          write_lock(&hardware_domain->pci_lock);
>          list_add(&pdev->domain_list, &hardware_domain->pdev_list);
> -        write_unlock(&hardware_domain->pci_lock);
>  
>          /*
>           * For devices not discovered by Xen during boot, add vPCI handlers
> @@ -759,18 +758,18 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          ret = vpci_add_handlers(pdev);
>          if ( ret )
>          {
> -            printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
> -            write_lock(&hardware_domain->pci_lock);
>              list_del(&pdev->domain_list);
>              write_unlock(&hardware_domain->pci_lock);
>              pdev->domain = NULL;
> +            printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
>              goto out;
>          }
> +        write_unlock(&hardware_domain->pci_lock);
>          ret = iommu_add_device(pdev);
>          if ( ret )
>          {
> -            vpci_remove_device(pdev);
>              write_lock(&hardware_domain->pci_lock);
> +            vpci_remove_device(pdev);
>              list_del(&pdev->domain_list);
>              write_unlock(&hardware_domain->pci_lock);
>              pdev->domain = NULL;
> @@ -1146,7 +1145,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
>      } while ( devfn != pdev->devfn &&
>                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
>  
> +    write_lock(&ctxt->d->pci_lock);
>      err = vpci_add_handlers(pdev);
> +    write_unlock(&ctxt->d->pci_lock);
>      if ( err )
>          printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
>                 ctxt->d->domain_id, err);
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 58195549d50a..8f5850b8cf6d 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -173,6 +173,7 @@ bool vpci_process_pending(struct vcpu *v)
>          if ( rc == -ERESTART )
>              return true;
>  
> +        write_lock(&v->domain->pci_lock);
>          spin_lock(&v->vpci.pdev->vpci->lock);
>          /* Disable memory decoding unconditionally on failure. */
>          modify_decoding(v->vpci.pdev,
> @@ -191,6 +192,7 @@ bool vpci_process_pending(struct vcpu *v)
>               * failure.
>               */
>              vpci_remove_device(v->vpci.pdev);
> +        write_unlock(&v->domain->pci_lock);
>      }
>  
>      return false;
> @@ -202,8 +204,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>      struct map_data data = { .d = d, .map = true };
>      int rc;
>  
> +    ASSERT(rw_is_write_locked(&d->pci_lock));
> +
>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
> +    {
> +        /*
> +         * It's safe to drop and reacquire the lock in this context
> +         * without risking pdev disappearing because devices cannot be
> +         * removed until the initial domain has been started.
> +         */
> +        write_unlock(&d->pci_lock);
>          process_pending_softirqs();
> +        write_lock(&d->pci_lock);

Hm, I should have noticed before, but we already call
process_pending_softirqs() with the pdev->vpci->lock held here, so it
would make sense to drop it also.

Fix should be in a separate patch.

> +    }
> +
>      rangeset_destroy(mem);
>      if ( !rc )
>          modify_decoding(pdev, cmd, false);
> @@ -244,6 +258,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>      unsigned int i;
>      int rc;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      if ( !mem )
>          return -ENOMEM;
>  
> @@ -524,6 +540,8 @@ static int cf_check init_header(struct pci_dev *pdev)
>      int rc;
>      bool mask_cap_list = false;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>      {
>      case PCI_HEADER_TYPE_NORMAL:
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index a253ccbd7db7..6ff71e5f9ab7 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -263,7 +263,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>  
>  void vpci_dump_msi(void)
>  {
> -    const struct domain *d;
> +    struct domain *d;
>  
>      rcu_read_lock(&domlist_read_lock);
>      for_each_domain ( d )
> @@ -275,6 +275,9 @@ void vpci_dump_msi(void)
>  
>          printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>  
> +        if ( !read_trylock(&d->pci_lock) )
> +            continue;
> +
>          for_each_pdev ( d, pdev )
>          {
>              const struct vpci_msi *msi;
> @@ -316,14 +319,33 @@ void vpci_dump_msi(void)
>                       * holding the lock.
>                       */
>                      printk("unable to print all MSI-X entries: %d\n", rc);
> -                    process_pending_softirqs();
> -                    continue;
> +                    goto pdev_done;
>                  }
>              }
>  
>              spin_unlock(&pdev->vpci->lock);
> +        pdev_done:
> +            /*
> +             * Unlock lock to process pending softirqs. This is
> +             * potentially unsafe, as d->pdev_list can be changed in
> +             * meantime.
> +             */
> +            read_unlock(&d->pci_lock);
>              process_pending_softirqs();
> +            if ( !read_trylock(&d->pci_lock) )
> +            {
> +                printk("unable to access other devices for the domain\n");
> +                goto domain_done;
> +            }
>          }
> +        read_unlock(&d->pci_lock);
> +    domain_done:
> +        /*
> +         * We need this label at the end of the loop, but some
> +         * compilers might not be happy about label at the end of the
> +         * compound statement so we adding an empty statement here.
> +         */
> +        ;
>      }
>      rcu_read_unlock(&domlist_read_lock);
>  }
> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
> index d1126a417da9..b6abab47efdd 100644
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>  {
>      struct vpci_msix *msix;
>  
> +    ASSERT(rw_is_locked(&d->pci_lock));
> +
>      list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
>      {
>          const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
> @@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>  
>  static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
>  {
> -    return !!msix_find(v->domain, addr);
> +    int rc;
> +
> +    read_lock(&v->domain->pci_lock);
> +    rc = !!msix_find(v->domain, addr);
> +    read_unlock(&v->domain->pci_lock);
> +
> +    return rc;
>  }
>  
>  static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
> @@ -358,21 +366,35 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
>  static int cf_check msix_read(
>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
>  {
> -    const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
>      const struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
>      *data = ~0UL;
>  
> +    read_lock(&d->pci_lock);
> +
> +    msix = msix_find(d, addr);
>      if ( !msix )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_RETRY;
> +    }
>  
>      if ( adjacent_handle(msix, addr) )
> -        return adjacent_read(d, msix, addr, len, data);
> +    {
> +        int rc = adjacent_read(d, msix, addr, len, data);
> +
> +        read_unlock(&d->pci_lock);
> +        return rc;
> +    }
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_OKAY;
> +    }
>  
>      spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
> @@ -404,6 +426,7 @@ static int cf_check msix_read(
>          break;
>      }
>      spin_unlock(&msix->pdev->vpci->lock);
> +    read_unlock(&d->pci_lock);
>  
>      return X86EMUL_OKAY;
>  }
> @@ -491,19 +514,33 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
>  static int cf_check msix_write(
>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
>  {
> -    const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
>      struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
> +    read_lock(&d->pci_lock);
> +
> +    msix = msix_find(d, addr);
>      if ( !msix )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_RETRY;
> +    }
>  
>      if ( adjacent_handle(msix, addr) )
> -        return adjacent_write(d, msix, addr, len, data);
> +    {
> +        int rc = adjacent_write(d, msix, addr, len, data);
> +
> +        read_unlock(&d->pci_lock);
> +        return rc;
> +    }
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_OKAY;
> +    }
>  
>      spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
> @@ -579,6 +616,7 @@ static int cf_check msix_write(
>          break;
>      }
>      spin_unlock(&msix->pdev->vpci->lock);
> +    read_unlock(&d->pci_lock);
>  
>      return X86EMUL_OKAY;
>  }
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 72ef277c4f8e..a1a004460491 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -42,6 +42,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
>  
>  void vpci_remove_device(struct pci_dev *pdev)
>  {
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      if ( !has_vpci(pdev->domain) || !pdev->vpci )
>          return;
>  
> @@ -77,6 +79,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
>      const unsigned long *ro_map;
>      int rc = 0;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      if ( !has_vpci(pdev->domain) )
>          return 0;
>  
> @@ -361,7 +365,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>  
>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
> @@ -375,13 +379,19 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>  
>      /*
>       * Find the PCI dev matching the address, which for hwdom also requires
> -     * consulting DomXEN.  Passthrough everything that's not trapped.
> +     * consulting DomXEN. Passthrough everything that's not trapped.

The double space in the comment above was intentional.

> +     * If this is hwdom and the device is assigned to dom_xen, acquiring hwdom's

For consistency with the above sentence, could you please use DomXEN
here?

> +     * pci_lock is sufficient.
>       */
> +    read_lock(&d->pci_lock);
>      pdev = pci_get_pdev(d, sbdf);
>      if ( !pdev && is_hardware_domain(d) )
>          pdev = pci_get_pdev(dom_xen, sbdf);
>      if ( !pdev || !pdev->vpci )
> +    {
> +        read_unlock(&d->pci_lock);
>          return vpci_read_hw(sbdf, reg, size);
> +    }
>  
>      spin_lock(&pdev->vpci->lock);
>  
> @@ -428,6 +438,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>          ASSERT(data_offset < size);
>      }
>      spin_unlock(&pdev->vpci->lock);
> +    read_unlock(&d->pci_lock);
>  
>      if ( data_offset < size )
>      {
> @@ -470,7 +481,7 @@ static void vpci_write_helper(const struct pci_dev *pdev,
>  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>                  uint32_t data)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
> @@ -483,8 +494,14 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>  
>      /*
>       * Find the PCI dev matching the address, which for hwdom also requires
> -     * consulting DomXEN.  Passthrough everything that's not trapped.
> +     * consulting DomXEN. Passthrough everything that's not trapped.

Same here re the double space removal and the usage of dom_xen below.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 11/15] vpci: add initial support for virtual PCI bus topology
  2024-01-12 11:46   ` George Dunlap
@ 2024-01-12 13:50     ` Stewart Hildebrand
  2024-01-15 11:48       ` George Dunlap
  0 siblings, 1 reply; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-12 13:50 UTC (permalink / raw)
  To: George Dunlap
  Cc: xen-devel, Oleksandr Andrushchenko, Andrew Cooper, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu, Roger Pau Monné,
	Volodymyr Babchuk

On 1/12/24 06:46, George Dunlap wrote:
> On Tue, Jan 9, 2024 at 9:54 PM Stewart Hildebrand
> <stewart.hildebrand@amd.com> wrote:
>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>> index 37f5922f3206..b58a822847be 100644
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -484,6 +484,14 @@ struct domain
>>       * 2. pdev->vpci->lock
>>       */
>>      rwlock_t pci_lock;
>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +    /*
>> +     * The bitmap which shows which device numbers are already used by the
>> +     * virtual PCI bus topology and is used to assign a unique SBDF to the
>> +     * next passed through virtual PCI device.
>> +     */
>> +    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
>> +#endif
>>  #endif
> 
> Without digging through the whole series, how big do we expect this
> bitmap to be on typical systems?
> 
> If it's only going to be a handful of bytes, keeping it around for all
> guests would be OK; but it's large, it would be better as a pointer,
> since it's unused on the vast majority of guests.

Since the bitmap is an unsigned long type it will typically be 8 bytes, although only 4 bytes are actually used. VPCI_MAX_VIRT_DEV is currently fixed at 32, as we are only tracking D (not the whole SBDF) in the bitmap so far.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 09/15] vpci/header: program p2m with guest BAR view
  2024-01-09 21:51 ` [PATCH v12 09/15] vpci/header: program p2m with guest BAR view Stewart Hildebrand
@ 2024-01-12 15:06   ` Roger Pau Monné
  2024-01-12 20:31     ` Stewart Hildebrand
  2024-01-15  9:07     ` [PATCH v12 " Jan Beulich
  2024-01-15 19:44   ` [PATCH v12.2 " Stewart Hildebrand
  2024-01-19 14:28   ` [PATCH v12.3 " Stewart Hildebrand
  2 siblings, 2 replies; 68+ messages in thread
From: Roger Pau Monné @ 2024-01-12 15:06 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: xen-devel, Oleksandr Andrushchenko, Daniel P. Smith,
	Volodymyr Babchuk

On Tue, Jan 09, 2024 at 04:51:24PM -0500, Stewart Hildebrand wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Take into account guest's BAR view and program its p2m accordingly:
> gfn is guest's view of the BAR and mfn is the physical BAR value.
> This way hardware domain sees physical BAR values and guest sees
> emulated ones.
> 
> Hardware domain continues getting the BARs identity mapped, while for
> domUs the BARs are mapped at the requested guest address without
> modifying the BAR address in the device PCI config space.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>

Some nits and a request to add an extra assert. If you agree:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> In v12:
> - Update guest_addr in rom_write()
> - Use unsigned long for start_mfn and map_mfn to reduce mfn_x() calls
> - Use existing vmsix_table_*() functions
> - Change vmsix_table_base() to use .guest_addr
> In v11:
> - Add vmsix_guest_table_addr() and vmsix_guest_table_base() functions
>   to access guest's view of the VMSIx tables.
> - Use MFN (not GFN) to check access permissions
> - Move page offset check to this patch
> - Call rangeset_remove_range() with correct parameters
> In v10:
> - Moved GFN variable definition outside the loop in map_range()
> - Updated printk error message in map_range()
> - Now BAR address is always stored in bar->guest_addr, even for
>   HW dom, this removes bunch of ugly is_hwdom() checks in modify_bars()
> - vmsix_table_base() now uses .guest_addr instead of .addr
> In v9:
> - Extended the commit message
> - Use bar->guest_addr in modify_bars
> - Extended printk error message in map_range
> - Moved map_data initialization so .bar can be initialized during declaration
> Since v5:
> - remove debug print in map_range callback
> - remove "identity" from the debug print
> Since v4:
> - moved start_{gfn|mfn} calculation into map_range
> - pass vpci_bar in the map_data instead of start_{gfn|mfn}
> - s/guest_addr/guest_reg
> Since v3:
> - updated comment (Roger)
> - removed gfn_add(map->start_gfn, rc); which is wrong
> - use v->domain instead of v->vpci.pdev->domain
> - removed odd e.g. in comment
> - s/d%d/%pd in altered code
> - use gdprintk for map/unmap logs
> Since v2:
> - improve readability for data.start_gfn and restructure ?: construct
> Since v1:
>  - s/MSI/MSI-X in comments
> ---
>  xen/drivers/vpci/header.c | 81 +++++++++++++++++++++++++++++++--------
>  xen/include/xen/vpci.h    |  3 +-
>  2 files changed, 66 insertions(+), 18 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index feccd070ddd0..f0b0b64b0929 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -34,6 +34,7 @@
>  
>  struct map_data {
>      struct domain *d;
> +    const struct vpci_bar *bar;
>      bool map;
>  };
>  
> @@ -41,13 +42,24 @@ static int cf_check map_range(
>      unsigned long s, unsigned long e, void *data, unsigned long *c)
>  {
>      const struct map_data *map = data;
> +    /* Start address of the BAR as seen by the guest. */
> +    unsigned long start_gfn = PFN_DOWN(map->bar->guest_addr);
> +    /* Physical start address of the BAR. */
> +    unsigned long start_mfn = PFN_DOWN(map->bar->addr);
>      int rc;
>  
>      for ( ; ; )
>      {
>          unsigned long size = e - s + 1;
> +        /*
> +         * Ranges to be mapped don't always start at the BAR start address, as
> +         * there can be holes or partially consumed ranges. Account for the
> +         * offset of the current address from the BAR start.
> +         */
> +        unsigned long map_mfn = start_mfn + s - start_gfn;
> +        unsigned long m_end = map_mfn + size - 1;
>  
> -        if ( !iomem_access_permitted(map->d, s, e) )
> +        if ( !iomem_access_permitted(map->d, map_mfn, m_end) )
>          {
>              printk(XENLOG_G_WARNING
>                     "%pd denied access to MMIO range [%#lx, %#lx]\n",
> @@ -55,7 +67,8 @@ static int cf_check map_range(
>              return -EPERM;
>          }
>  
> -        rc = xsm_iomem_mapping(XSM_HOOK, map->d, s, e, map->map);
> +        rc = xsm_iomem_mapping(XSM_HOOK, map->d, map_mfn, m_end,
> +                               map->map);
>          if ( rc )
>          {
>              printk(XENLOG_G_WARNING
> @@ -73,8 +86,8 @@ static int cf_check map_range(
>           * - {un}map_mmio_regions doesn't support preemption.
>           */
>  
> -        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
> -                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
> +        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(map_mfn))
> +                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(map_mfn));
>          if ( rc == 0 )
>          {
>              *c += size;
> @@ -83,8 +96,9 @@ static int cf_check map_range(
>          if ( rc < 0 )
>          {
>              printk(XENLOG_G_WARNING
> -                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
> -                   map->map ? "" : "un", s, e, map->d->domain_id, rc);
> +                   "Failed to %smap [%lx %lx] -> [%lx %lx] for %pd: %d\n",
> +                   map->map ? "" : "un", s, e, map_mfn,
> +                   map_mfn + size, map->d, rc);
>              break;
>          }
>          ASSERT(rc < size);
> @@ -163,10 +177,6 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>  bool vpci_process_pending(struct vcpu *v)
>  {
>      struct pci_dev *pdev = v->vpci.pdev;
> -    struct map_data data = {
> -        .d = v->domain,
> -        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> -    };
>      struct vpci_header *header = NULL;
>      unsigned int i;
>  
> @@ -186,6 +196,11 @@ bool vpci_process_pending(struct vcpu *v)
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
>          struct vpci_bar *bar = &header->bars[i];
> +        struct map_data data = {
> +            .d = v->domain,
> +            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
> +            .bar = bar,
> +        };
>          int rc;
>  
>          if ( rangeset_is_empty(bar->mem) )
> @@ -236,7 +251,6 @@ bool vpci_process_pending(struct vcpu *v)
>  static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>                              uint16_t cmd)
>  {
> -    struct map_data data = { .d = d, .map = true };
>      struct vpci_header *header = &pdev->vpci->header;
>      int rc = 0;
>      unsigned int i;
> @@ -246,6 +260,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
>          struct vpci_bar *bar = &header->bars[i];
> +        struct map_data data = { .d = d, .map = true, .bar = bar };
>  
>          if ( rangeset_is_empty(bar->mem) )
>              continue;
> @@ -311,12 +326,16 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>       * First fill the rangesets with the BAR of this device or with the ROM
>       * BAR only, depending on whether the guest is toggling the memory decode
>       * bit of the command register, or the enable bit of the ROM BAR register.
> +     *
> +     * For non-hardware domain we use guest physical addresses.
>       */
>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>      {
>          struct vpci_bar *bar = &header->bars[i];
>          unsigned long start = PFN_DOWN(bar->addr);
>          unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
> +        unsigned long start_guest = PFN_DOWN(bar->guest_addr);
> +        unsigned long end_guest = PFN_DOWN(bar->guest_addr + bar->size - 1);
>  
>          if ( !bar->mem )
>              continue;
> @@ -336,11 +355,25 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>              continue;
>          }

Should we assert that the BAR rangeset is empty here? To stay on the
safe side.

>  
> -        rc = rangeset_add_range(bar->mem, start, end);
> +        /*
> +         * Make sure that the guest set address has the same page offset
> +         * as the physical address on the host or otherwise things won't work as
> +         * expected.
> +         */
> +        if ( PAGE_OFFSET(bar->guest_addr) != PAGE_OFFSET(bar->addr) )
> +        {
> +            gprintk(XENLOG_G_WARNING,
> +                    "%pp: Can't map BAR%d because of page offset mismatch: %lx vs %lx\n",
                                           ^u

Also when using the x modifier it's better to also use # to print the
0x prefix.  You can also reduce the length of the message using
s/because of/due to/ IMO:

%pp: Can't map BAR%u due to offset mismatch: %lx vs %lx

> +                    &pdev->sbdf, i, PAGE_OFFSET(bar->guest_addr),
> +                    PAGE_OFFSET(bar->addr));

Maybe worth printing the whole address?

> +            return -EINVAL;
> +        }
> +
> +        rc = rangeset_add_range(bar->mem, start_guest, end_guest);
>          if ( rc )
>          {
>              printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
> -                   start, end, rc);
> +                   start_guest, end_guest, rc);
>              return rc;
>          }
>  
> @@ -352,12 +385,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>              if ( rangeset_is_empty(prev_bar->mem) )
>                  continue;
>  
> -            rc = rangeset_remove_range(prev_bar->mem, start, end);
> +            rc = rangeset_remove_range(prev_bar->mem, start_guest, end_guest);
>              if ( rc )
>              {
>                  gprintk(XENLOG_WARNING,
>                         "%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
> -                        &pdev->sbdf, start, end, rc);
> +                        &pdev->sbdf, start_guest, end_guest, rc);
>                  return rc;
>              }
>          }
> @@ -425,8 +458,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>              for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>              {
>                  const struct vpci_bar *remote_bar = &tmp->vpci->header.bars[i];
> -                unsigned long start = PFN_DOWN(remote_bar->addr);
> -                unsigned long end = PFN_DOWN(remote_bar->addr +
> +                unsigned long start = PFN_DOWN(remote_bar->guest_addr);
> +                unsigned long end = PFN_DOWN(remote_bar->guest_addr +
>                                               remote_bar->size - 1);
>  
>                  if ( !remote_bar->enabled )
> @@ -513,6 +546,8 @@ static void cf_check bar_write(
>      struct vpci_bar *bar = data;
>      bool hi = false;
>  
> +    ASSERT(is_hardware_domain(pdev->domain));
> +
>      if ( bar->type == VPCI_BAR_MEM64_HI )
>      {
>          ASSERT(reg > PCI_BASE_ADDRESS_0);
> @@ -543,6 +578,10 @@ static void cf_check bar_write(
>       */
>      bar->addr &= ~(0xffffffffULL << (hi ? 32 : 0));
>      bar->addr |= (uint64_t)val << (hi ? 32 : 0);
> +    /*
> +     * Update guest address as well, so hardware domain sees BAR identity mapped
> +     */

Can you drop the 'as well' and make this a single line comment?

Otherwise maybe reword to:

Update guest address, so hardware domain BAR is identity mapped.

Sorry, I find it wasteful to have the opening and closing comment
delimiters in separate lines for single line comments.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-12 13:48   ` Roger Pau Monné
@ 2024-01-12 17:54     ` Stewart Hildebrand
  2024-01-12 18:14       ` [PATCH v12.1 " Stewart Hildebrand
  2024-01-15  8:53       ` [PATCH v12 " Roger Pau Monné
  0 siblings, 2 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-12 17:54 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk

On 1/12/24 08:48, Roger Pau Monné wrote:
> On Tue, Jan 09, 2024 at 04:51:16PM -0500, Stewart Hildebrand wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Use a previously introduced per-domain read/write lock to check
>> whether vpci is present, so we are sure there are no accesses to the
>> contents of the vpci struct if not.
> 
> Nit: isn't this sentence kind of awkward?  I would word it as:
> 
> "Use the per-domain PCI read/write lock to protect the presence of the
> pci device vpci field."

I'll use your suggested wording.

> 
>> This lock can be used (and in a
>> few cases is used right away) so that vpci removal can be performed
>> while holding the lock in write mode. Previously such removal could
>> race with vpci_read for example.
>>
>> When taking both d->pci_lock and pdev->vpci->lock, they should be
>> taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
>> possible deadlock situations.
>>
>> 1. Per-domain's pci_lock is used to protect pdev->vpci structure
>> from being removed.
>>
>> 2. Writing the command register and ROM BAR register may trigger
>> modify_bars to run, which in turn may access multiple pdevs while
>> checking for the existing BAR's overlap. The overlapping check, if
>> done under the read lock, requires vpci->lock to be acquired on both
>> devices being compared, which may produce a deadlock. It is not
>> possible to upgrade read lock to write lock in such a case. So, in
>> order to prevent the deadlock, use d->pci_lock instead.
> 
> ... use d->pci_lock in write mode instead.

Will fix

> 
>>
>> All other code, which doesn't lead to pdev->vpci destruction and does
>> not access multiple pdevs at the same time, can still use a
>> combination of the read lock and pdev->vpci->lock.
>>
>> 3. Drop const qualifier where the new rwlock is used and this is
>> appropriate.
>>
>> 4. Do not call process_pending_softirqs with any locks held. For that
>> unlock prior the call and re-acquire the locks after. After
>> re-acquiring the lock there is no need to check if pdev->vpci exists:
>>  - in apply_map because of the context it is called (no race condition
>>    possible)
>>  - for MSI/MSI-X debug code because it is called at the end of
>>    pdev->vpci access and no further access to pdev->vpci is made
>>
>> 5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
>> while accessing pdevs in vpci code.
>>
>> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
>> Suggested-by: Jan Beulich <jbeulich@suse.com>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> 
> Just one code fix regarding the usage of read_lock (instead of the
> trylock version) in vpci_msix_arch_print().. With that and the
> comments adjusted:
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks! I'll reply to this with a v12.1 patch so as to not unnessecarily resend the whole series.

> 
>> ---
>> Changes in v12:
>>  - s/pci_rwlock/pci_lock/ in commit message
>>  - expand comment about scope of pci_lock in sched.h
>>  - in vpci_{read,write}, if hwdom is trying to access a device assigned
>>    to dom_xen, holding hwdom->pci_lock is sufficient (no need to hold
>>    dom_xen->pci_lock)
>>  - reintroduce ASSERT in vmx_pi_update_irte()
>>  - reintroduce ASSERT in __pci_enable_msi{x}()
>>  - delete note 6. in commit message about removing ASSERTs since we have
>>    reintroduced them
>>
>> Changes in v11:
>>  - Fixed commit message regarding possible spinlocks
>>  - Removed parameter from allocate_and_map_msi_pirq(), which was added
>>  in the prev version. Now we are taking pcidevs_lock in
>>  physdev_map_pirq()
>>  - Returned ASSERT to pci_enable_msi
>>  - Fixed case when we took read lock instead of write one
>>  - Fixed label indentation
>>
>> Changes in v10:
>>  - Moved printk pas locked area
>>  - Returned back ASSERTs
>>  - Added new parameter to allocate_and_map_msi_pirq() so it knows if
>>  it should take the global pci lock
>>  - Added comment about possible improvement in vpci_write
>>  - Changed ASSERT(rw_is_locked()) to rw_is_write_locked() in
>>    appropriate places
>>  - Renamed release_domain_locks() to release_domain_write_locks()
>>  - moved domain_done label in vpci_dump_msi() to correct place
>> Changes in v9:
>>  - extended locked region to protect vpci_remove_device and
>>    vpci_add_handlers() calls
>>  - vpci_write() takes lock in the write mode to protect
>>    potential call to modify_bars()
>>  - renamed lock releasing function
>>  - removed ASSERT()s from msi code
>>  - added trylock in vpci_dump_msi
>>
>> Changes in v8:
>>  - changed d->vpci_lock to d->pci_lock
>>  - introducing d->pci_lock in a separate patch
>>  - extended locked region in vpci_process_pending
>>  - removed pcidevs_lockis vpci_dump_msi()
>>  - removed some changes as they are not needed with
>>    the new locking scheme
>>  - added handling for hwdom && dom_xen case
>> ---
>>  xen/arch/x86/hvm/vmsi.c       | 22 +++++++--------
>>  xen/arch/x86/hvm/vmx/vmx.c    |  2 +-
>>  xen/arch/x86/irq.c            |  8 +++---
>>  xen/arch/x86/msi.c            | 14 +++++-----
>>  xen/arch/x86/physdev.c        |  2 ++
>>  xen/drivers/passthrough/pci.c |  9 +++---
>>  xen/drivers/vpci/header.c     | 18 ++++++++++++
>>  xen/drivers/vpci/msi.c        | 28 +++++++++++++++++--
>>  xen/drivers/vpci/msix.c       | 52 ++++++++++++++++++++++++++++++-----
>>  xen/drivers/vpci/vpci.c       | 28 ++++++++++++++++---
>>  xen/include/xen/sched.h       |  3 +-
>>  11 files changed, 144 insertions(+), 42 deletions(-)
>>
>> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
>> index 128f23636279..03caf91beefc 100644
>> --- a/xen/arch/x86/hvm/vmsi.c
>> +++ b/xen/arch/x86/hvm/vmsi.c
>> @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
>>      struct msixtbl_entry *entry, *new_entry;
>>      int r = -EINVAL;
>>  
>> -    ASSERT(pcidevs_locked());
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>      ASSERT(rw_is_write_locked(&d->event_lock));
>>  
>>      if ( !msixtbl_initialised(d) )
>> @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
>>      struct pci_dev *pdev;
>>      struct msixtbl_entry *entry;
>>  
>> -    ASSERT(pcidevs_locked());
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>      ASSERT(rw_is_write_locked(&d->event_lock));
>>  
>>      if ( !msixtbl_initialised(d) )
>> @@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
>>  {
>>      unsigned int i;
>>  
>> -    ASSERT(pcidevs_locked());
>> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>>  
>>      if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
>>      {
>> @@ -725,8 +725,8 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>>      int rc;
>>  
>>      ASSERT(msi->arch.pirq != INVALID_PIRQ);
>> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>>  
>> -    pcidevs_lock();
>>      for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
>>      {
>>          struct xen_domctl_bind_pt_irq unbind = {
>> @@ -745,7 +745,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>>  
>>      msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
>>                                         msi->vectors, msi->arch.pirq, msi->mask);
>> -    pcidevs_unlock();
>>  }
>>  
>>  static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
>> @@ -778,15 +777,14 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
>>      int rc;
>>  
>>      ASSERT(msi->arch.pirq == INVALID_PIRQ);
>> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>>      rc = vpci_msi_enable(pdev, vectors, 0);
>>      if ( rc < 0 )
>>          return rc;
>>      msi->arch.pirq = rc;
>>  
>> -    pcidevs_lock();
>>      msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
>>                                         msi->arch.pirq, msi->mask);
>> -    pcidevs_unlock();
>>  
>>      return 0;
>>  }
>> @@ -797,8 +795,8 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
>>      unsigned int i;
>>  
>>      ASSERT(pirq != INVALID_PIRQ);
>> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>>  
>> -    pcidevs_lock();
>>      for ( i = 0; i < nr && bound; i++ )
>>      {
>>          struct xen_domctl_bind_pt_irq bind = {
>> @@ -814,7 +812,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
>>      write_lock(&pdev->domain->event_lock);
>>      unmap_domain_pirq(pdev->domain, pirq);
>>      write_unlock(&pdev->domain->event_lock);
>> -    pcidevs_unlock();
>>  }
>>  
>>  void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
>> @@ -854,6 +851,7 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>>      int rc;
>>  
>>      ASSERT(entry->arch.pirq == INVALID_PIRQ);
>> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>>      rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry),
>>                           table_base);
>>      if ( rc < 0 )
>> @@ -861,7 +859,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>>  
>>      entry->arch.pirq = rc;
>>  
>> -    pcidevs_lock();
>>      rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
>>                           entry->masked);
>>      if ( rc )
>> @@ -869,7 +866,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>>          vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
>>          entry->arch.pirq = INVALID_PIRQ;
>>      }
>> -    pcidevs_unlock();
>>  
>>      return rc;
>>  }
>> @@ -895,6 +891,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>  {
>>      unsigned int i;
>>  
>> +    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
>> +
>>      for ( i = 0; i < msix->max_entries; i++ )
>>      {
>>          const struct vpci_msix_entry *entry = &msix->entries[i];
>> @@ -913,7 +911,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>              struct pci_dev *pdev = msix->pdev;
>>  
>>              spin_unlock(&msix->pdev->vpci->lock);
>> +            read_unlock(&pdev->domain->pci_lock);
>>              process_pending_softirqs();
>> +            read_lock(&pdev->domain->pci_lock);
> 
> This should be a trylock, like it's on the caller (vpci_dump_msi()).

I'll replace the read_lock with:

            if ( !read_trylock(&pdev->domain->pci_lock) )
                return -EBUSY;

> 
>>              /* NB: we assume that pdev cannot go away for an alive domain. */
>>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>>                  return -EBUSY;
>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>> index 8ff675883c2b..890faef48b82 100644
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -413,7 +413,7 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
>>  
>>      spin_unlock_irq(&desc->lock);
>>  
>> -    ASSERT(pcidevs_locked());
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&msi_desc->dev->domain->pci_lock));
>>  
>>      return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
>>  
>> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
>> index 50e49e1a4b6f..4d89d9af99fe 100644
>> --- a/xen/arch/x86/irq.c
>> +++ b/xen/arch/x86/irq.c
>> @@ -2166,7 +2166,7 @@ int map_domain_pirq(
>>          struct pci_dev *pdev;
>>          unsigned int nr = 0;
>>  
>> -        ASSERT(pcidevs_locked());
>> +        ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>  
>>          ret = -ENODEV;
>>          if ( !cpu_has_apic )
>> @@ -2323,7 +2323,7 @@ int unmap_domain_pirq(struct domain *d, int pirq)
>>      if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
>>          return -EINVAL;
>>  
>> -    ASSERT(pcidevs_locked());
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>      ASSERT(rw_is_write_locked(&d->event_lock));
>>  
>>      info = pirq_info(d, pirq);
>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>  {
>>      int irq, pirq, ret;
>>  
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>> +
>>      switch ( type )
>>      {
>>      case MAP_PIRQ_TYPE_MSI:
>> @@ -2917,7 +2919,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>  
>>      msi->irq = irq;
>>  
>> -    pcidevs_lock();
>>      /* Verify or get pirq. */
>>      write_lock(&d->event_lock);
>>      pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
>> @@ -2933,7 +2934,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>  
>>   done:
>>      write_unlock(&d->event_lock);
>> -    pcidevs_unlock();
>>      if ( ret )
>>      {
>>          switch ( type )
>> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
>> index 335c0868a225..7da2affa2e82 100644
>> --- a/xen/arch/x86/msi.c
>> +++ b/xen/arch/x86/msi.c
>> @@ -602,7 +602,7 @@ static int msi_capability_init(struct pci_dev *dev,
>>      unsigned int i, mpos;
>>      uint16_t control;
>>  
>> -    ASSERT(pcidevs_locked());
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>>      pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
>>      if ( !pos )
>>          return -ENODEV;
>> @@ -771,7 +771,7 @@ static int msix_capability_init(struct pci_dev *dev,
>>      if ( !pos )
>>          return -ENODEV;
>>  
>> -    ASSERT(pcidevs_locked());
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>>  
>>      control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
>>      /*
>> @@ -988,11 +988,11 @@ static int __pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
>>  {
>>      struct msi_desc *old_desc;
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      if ( !pdev )
>>          return -ENODEV;
>>  
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>> +
>>      old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSI);
>>      if ( old_desc )
>>      {
>> @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
>>  {
>>      struct msi_desc *old_desc;
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      if ( !pdev || !pdev->msix )
>>          return -ENODEV;
>>  
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>> +
>>      if ( msi->entry_nr >= pdev->msix->nr_entries )
>>          return -EINVAL;
>>  
>> @@ -1154,7 +1154,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>>  int pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
>>                     struct msi_desc **desc)
>>  {
>> -    ASSERT(pcidevs_locked());
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>>  
>>      if ( !use_msi )
>>          return -EPERM;
>> diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
>> index 47c4da0af7e1..369c9e788c1c 100644
>> --- a/xen/arch/x86/physdev.c
>> +++ b/xen/arch/x86/physdev.c
>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>>  
>>      case MAP_PIRQ_TYPE_MSI:
>>      case MAP_PIRQ_TYPE_MULTI_MSI:
>> +        pcidevs_lock();
>>          ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>> +        pcidevs_unlock();
>>          break;
>>  
>>      default:
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index 1439d1ef2b26..3a973324bca1 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -750,7 +750,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>          pdev->domain = hardware_domain;
>>          write_lock(&hardware_domain->pci_lock);
>>          list_add(&pdev->domain_list, &hardware_domain->pdev_list);
>> -        write_unlock(&hardware_domain->pci_lock);
>>  
>>          /*
>>           * For devices not discovered by Xen during boot, add vPCI handlers
>> @@ -759,18 +758,18 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>          ret = vpci_add_handlers(pdev);
>>          if ( ret )
>>          {
>> -            printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
>> -            write_lock(&hardware_domain->pci_lock);
>>              list_del(&pdev->domain_list);
>>              write_unlock(&hardware_domain->pci_lock);
>>              pdev->domain = NULL;
>> +            printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
>>              goto out;
>>          }
>> +        write_unlock(&hardware_domain->pci_lock);
>>          ret = iommu_add_device(pdev);
>>          if ( ret )
>>          {
>> -            vpci_remove_device(pdev);
>>              write_lock(&hardware_domain->pci_lock);
>> +            vpci_remove_device(pdev);
>>              list_del(&pdev->domain_list);
>>              write_unlock(&hardware_domain->pci_lock);
>>              pdev->domain = NULL;
>> @@ -1146,7 +1145,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
>>      } while ( devfn != pdev->devfn &&
>>                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
>>  
>> +    write_lock(&ctxt->d->pci_lock);
>>      err = vpci_add_handlers(pdev);
>> +    write_unlock(&ctxt->d->pci_lock);
>>      if ( err )
>>          printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
>>                 ctxt->d->domain_id, err);
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index 58195549d50a..8f5850b8cf6d 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -173,6 +173,7 @@ bool vpci_process_pending(struct vcpu *v)
>>          if ( rc == -ERESTART )
>>              return true;
>>  
>> +        write_lock(&v->domain->pci_lock);
>>          spin_lock(&v->vpci.pdev->vpci->lock);
>>          /* Disable memory decoding unconditionally on failure. */
>>          modify_decoding(v->vpci.pdev,
>> @@ -191,6 +192,7 @@ bool vpci_process_pending(struct vcpu *v)
>>               * failure.
>>               */
>>              vpci_remove_device(v->vpci.pdev);
>> +        write_unlock(&v->domain->pci_lock);
>>      }
>>  
>>      return false;
>> @@ -202,8 +204,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>>      struct map_data data = { .d = d, .map = true };
>>      int rc;
>>  
>> +    ASSERT(rw_is_write_locked(&d->pci_lock));
>> +
>>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
>> +    {
>> +        /*
>> +         * It's safe to drop and reacquire the lock in this context
>> +         * without risking pdev disappearing because devices cannot be
>> +         * removed until the initial domain has been started.
>> +         */
>> +        write_unlock(&d->pci_lock);
>>          process_pending_softirqs();
>> +        write_lock(&d->pci_lock);
> 
> Hm, I should have noticed before, but we already call
> process_pending_softirqs() with the pdev->vpci->lock held here, so it
> would make sense to drop it also.

I don't quite understand this, maybe I'm missing something. I don't see where we acquire pdev->vpci->lock before calling process_pending_softirqs()?

Also, I tried adding

    ASSERT(!spin_is_locked(&pdev->vpci->lock));

both here in apply_map() and in vpci_process_pending(), and they haven't triggered in either dom0 or domU test cases, tested on both arm and x86.

> 
> Fix should be in a separate patch.

Agreed, I'll send v12.1 addressing the other feedback, and we can address this separately.

> 
>> +    }
>> +
>>      rangeset_destroy(mem);
>>      if ( !rc )
>>          modify_decoding(pdev, cmd, false);
>> @@ -244,6 +258,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>      unsigned int i;
>>      int rc;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>>      if ( !mem )
>>          return -ENOMEM;
>>  
>> @@ -524,6 +540,8 @@ static int cf_check init_header(struct pci_dev *pdev)
>>      int rc;
>>      bool mask_cap_list = false;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>>      {
>>      case PCI_HEADER_TYPE_NORMAL:
>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
>> index a253ccbd7db7..6ff71e5f9ab7 100644
>> --- a/xen/drivers/vpci/msi.c
>> +++ b/xen/drivers/vpci/msi.c
>> @@ -263,7 +263,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>>  
>>  void vpci_dump_msi(void)
>>  {
>> -    const struct domain *d;
>> +    struct domain *d;
>>  
>>      rcu_read_lock(&domlist_read_lock);
>>      for_each_domain ( d )
>> @@ -275,6 +275,9 @@ void vpci_dump_msi(void)
>>  
>>          printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>>  
>> +        if ( !read_trylock(&d->pci_lock) )
>> +            continue;
>> +
>>          for_each_pdev ( d, pdev )
>>          {
>>              const struct vpci_msi *msi;
>> @@ -316,14 +319,33 @@ void vpci_dump_msi(void)
>>                       * holding the lock.
>>                       */
>>                      printk("unable to print all MSI-X entries: %d\n", rc);
>> -                    process_pending_softirqs();
>> -                    continue;
>> +                    goto pdev_done;
>>                  }
>>              }
>>  
>>              spin_unlock(&pdev->vpci->lock);
>> +        pdev_done:
>> +            /*
>> +             * Unlock lock to process pending softirqs. This is
>> +             * potentially unsafe, as d->pdev_list can be changed in
>> +             * meantime.
>> +             */
>> +            read_unlock(&d->pci_lock);
>>              process_pending_softirqs();
>> +            if ( !read_trylock(&d->pci_lock) )
>> +            {
>> +                printk("unable to access other devices for the domain\n");
>> +                goto domain_done;
>> +            }
>>          }
>> +        read_unlock(&d->pci_lock);
>> +    domain_done:
>> +        /*
>> +         * We need this label at the end of the loop, but some
>> +         * compilers might not be happy about label at the end of the
>> +         * compound statement so we adding an empty statement here.
>> +         */
>> +        ;
>>      }
>>      rcu_read_unlock(&domlist_read_lock);
>>  }
>> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
>> index d1126a417da9..b6abab47efdd 100644
>> --- a/xen/drivers/vpci/msix.c
>> +++ b/xen/drivers/vpci/msix.c
>> @@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>>  {
>>      struct vpci_msix *msix;
>>  
>> +    ASSERT(rw_is_locked(&d->pci_lock));
>> +
>>      list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
>>      {
>>          const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
>> @@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>>  
>>  static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
>>  {
>> -    return !!msix_find(v->domain, addr);
>> +    int rc;
>> +
>> +    read_lock(&v->domain->pci_lock);
>> +    rc = !!msix_find(v->domain, addr);
>> +    read_unlock(&v->domain->pci_lock);
>> +
>> +    return rc;
>>  }
>>  
>>  static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
>> @@ -358,21 +366,35 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
>>  static int cf_check msix_read(
>>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
>>  {
>> -    const struct domain *d = v->domain;
>> -    struct vpci_msix *msix = msix_find(d, addr);
>> +    struct domain *d = v->domain;
>> +    struct vpci_msix *msix;
>>      const struct vpci_msix_entry *entry;
>>      unsigned int offset;
>>  
>>      *data = ~0UL;
>>  
>> +    read_lock(&d->pci_lock);
>> +
>> +    msix = msix_find(d, addr);
>>      if ( !msix )
>> +    {
>> +        read_unlock(&d->pci_lock);
>>          return X86EMUL_RETRY;
>> +    }
>>  
>>      if ( adjacent_handle(msix, addr) )
>> -        return adjacent_read(d, msix, addr, len, data);
>> +    {
>> +        int rc = adjacent_read(d, msix, addr, len, data);
>> +
>> +        read_unlock(&d->pci_lock);
>> +        return rc;
>> +    }
>>  
>>      if ( !access_allowed(msix->pdev, addr, len) )
>> +    {
>> +        read_unlock(&d->pci_lock);
>>          return X86EMUL_OKAY;
>> +    }
>>  
>>      spin_lock(&msix->pdev->vpci->lock);
>>      entry = get_entry(msix, addr);
>> @@ -404,6 +426,7 @@ static int cf_check msix_read(
>>          break;
>>      }
>>      spin_unlock(&msix->pdev->vpci->lock);
>> +    read_unlock(&d->pci_lock);
>>  
>>      return X86EMUL_OKAY;
>>  }
>> @@ -491,19 +514,33 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
>>  static int cf_check msix_write(
>>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
>>  {
>> -    const struct domain *d = v->domain;
>> -    struct vpci_msix *msix = msix_find(d, addr);
>> +    struct domain *d = v->domain;
>> +    struct vpci_msix *msix;
>>      struct vpci_msix_entry *entry;
>>      unsigned int offset;
>>  
>> +    read_lock(&d->pci_lock);
>> +
>> +    msix = msix_find(d, addr);
>>      if ( !msix )
>> +    {
>> +        read_unlock(&d->pci_lock);
>>          return X86EMUL_RETRY;
>> +    }
>>  
>>      if ( adjacent_handle(msix, addr) )
>> -        return adjacent_write(d, msix, addr, len, data);
>> +    {
>> +        int rc = adjacent_write(d, msix, addr, len, data);
>> +
>> +        read_unlock(&d->pci_lock);
>> +        return rc;
>> +    }
>>  
>>      if ( !access_allowed(msix->pdev, addr, len) )
>> +    {
>> +        read_unlock(&d->pci_lock);
>>          return X86EMUL_OKAY;
>> +    }
>>  
>>      spin_lock(&msix->pdev->vpci->lock);
>>      entry = get_entry(msix, addr);
>> @@ -579,6 +616,7 @@ static int cf_check msix_write(
>>          break;
>>      }
>>      spin_unlock(&msix->pdev->vpci->lock);
>> +    read_unlock(&d->pci_lock);
>>  
>>      return X86EMUL_OKAY;
>>  }
>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>> index 72ef277c4f8e..a1a004460491 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -42,6 +42,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
>>  
>>  void vpci_remove_device(struct pci_dev *pdev)
>>  {
>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>>      if ( !has_vpci(pdev->domain) || !pdev->vpci )
>>          return;
>>  
>> @@ -77,6 +79,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>      const unsigned long *ro_map;
>>      int rc = 0;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>>      if ( !has_vpci(pdev->domain) )
>>          return 0;
>>  
>> @@ -361,7 +365,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>>  
>>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>  {
>> -    const struct domain *d = current->domain;
>> +    struct domain *d = current->domain;
>>      const struct pci_dev *pdev;
>>      const struct vpci_register *r;
>>      unsigned int data_offset = 0;
>> @@ -375,13 +379,19 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>  
>>      /*
>>       * Find the PCI dev matching the address, which for hwdom also requires
>> -     * consulting DomXEN.  Passthrough everything that's not trapped.
>> +     * consulting DomXEN. Passthrough everything that's not trapped.
> 
> The double space in the comment above was intentional.

Will fix

> 
>> +     * If this is hwdom and the device is assigned to dom_xen, acquiring hwdom's
> 
> For consistency with the above sentence, could you please use DomXEN
> here?

Yes

> 
>> +     * pci_lock is sufficient.
>>       */
>> +    read_lock(&d->pci_lock);
>>      pdev = pci_get_pdev(d, sbdf);
>>      if ( !pdev && is_hardware_domain(d) )
>>          pdev = pci_get_pdev(dom_xen, sbdf);
>>      if ( !pdev || !pdev->vpci )
>> +    {
>> +        read_unlock(&d->pci_lock);
>>          return vpci_read_hw(sbdf, reg, size);
>> +    }
>>  
>>      spin_lock(&pdev->vpci->lock);
>>  
>> @@ -428,6 +438,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>          ASSERT(data_offset < size);
>>      }
>>      spin_unlock(&pdev->vpci->lock);
>> +    read_unlock(&d->pci_lock);
>>  
>>      if ( data_offset < size )
>>      {
>> @@ -470,7 +481,7 @@ static void vpci_write_helper(const struct pci_dev *pdev,
>>  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>                  uint32_t data)
>>  {
>> -    const struct domain *d = current->domain;
>> +    struct domain *d = current->domain;
>>      const struct pci_dev *pdev;
>>      const struct vpci_register *r;
>>      unsigned int data_offset = 0;
>> @@ -483,8 +494,14 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>  
>>      /*
>>       * Find the PCI dev matching the address, which for hwdom also requires
>> -     * consulting DomXEN.  Passthrough everything that's not trapped.
>> +     * consulting DomXEN. Passthrough everything that's not trapped.
> 
> Same here re the double space removal and the usage of dom_xen below.

Will fix

> 
> Thanks, Roger.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v12.1 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-12 17:54     ` Stewart Hildebrand
@ 2024-01-12 18:14       ` Stewart Hildebrand
  2024-01-15  8:58         ` Jan Beulich
  2024-01-15  8:53       ` [PATCH v12 " Roger Pau Monné
  1 sibling, 1 reply; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-12 18:14 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Roger Pau Monné, Wei Liu, George Dunlap, Julien Grall,
	Stefano Stabellini, Jun Nakajima, Kevin Tian, Paul Durrant,
	Volodymyr Babchuk, Stewart Hildebrand

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Use the per-domain PCI read/write lock to protect the presence of the
pci device vpci field. This lock can be used (and in a few cases is used
right away) so that vpci removal can be performed while holding the lock
in write mode. Previously such removal could race with vpci_read for
example.

When taking both d->pci_lock and pdev->vpci->lock, they should be
taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
possible deadlock situations.

1. Per-domain's pci_lock is used to protect pdev->vpci structure
from being removed.

2. Writing the command register and ROM BAR register may trigger
modify_bars to run, which in turn may access multiple pdevs while
checking for the existing BAR's overlap. The overlapping check, if
done under the read lock, requires vpci->lock to be acquired on both
devices being compared, which may produce a deadlock. It is not
possible to upgrade read lock to write lock in such a case. So, in
order to prevent the deadlock, use d->pci_lock in write mode instead.

All other code, which doesn't lead to pdev->vpci destruction and does
not access multiple pdevs at the same time, can still use a
combination of the read lock and pdev->vpci->lock.

3. Drop const qualifier where the new rwlock is used and this is
appropriate.

4. Do not call process_pending_softirqs with any locks held. For that
unlock prior the call and re-acquire the locks after. After
re-acquiring the lock there is no need to check if pdev->vpci exists:
 - in apply_map because of the context it is called (no race condition
   possible)
 - for MSI/MSI-X debug code because it is called at the end of
   pdev->vpci access and no further access to pdev->vpci is made

5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
while accessing pdevs in vpci code.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes in v12.1:
 - use read_trylock() in vpci_msix_arch_print()
 - fixup in-code comments (revert double space, use DomXEN) in
   vpci_{read,write}()
 - minor updates in commit message
 - add Roger's R-b

Changes in v12:
 - s/pci_rwlock/pci_lock/ in commit message
 - expand comment about scope of pci_lock in sched.h
 - in vpci_{read,write}, if hwdom is trying to access a device assigned
   to dom_xen, holding hwdom->pci_lock is sufficient (no need to hold
   dom_xen->pci_lock)
 - reintroduce ASSERT in vmx_pi_update_irte()
 - reintroduce ASSERT in __pci_enable_msi{x}()
 - delete note 6. in commit message about removing ASSERTs since we have
   reintroduced them

Changes in v11:
 - Fixed commit message regarding possible spinlocks
 - Removed parameter from allocate_and_map_msi_pirq(), which was added
 in the prev version. Now we are taking pcidevs_lock in
 physdev_map_pirq()
 - Returned ASSERT to pci_enable_msi
 - Fixed case when we took read lock instead of write one
 - Fixed label indentation

Changes in v10:
 - Moved printk pas locked area
 - Returned back ASSERTs
 - Added new parameter to allocate_and_map_msi_pirq() so it knows if
 it should take the global pci lock
 - Added comment about possible improvement in vpci_write
 - Changed ASSERT(rw_is_locked()) to rw_is_write_locked() in
   appropriate places
 - Renamed release_domain_locks() to release_domain_write_locks()
 - moved domain_done label in vpci_dump_msi() to correct place
Changes in v9:
 - extended locked region to protect vpci_remove_device and
   vpci_add_handlers() calls
 - vpci_write() takes lock in the write mode to protect
   potential call to modify_bars()
 - renamed lock releasing function
 - removed ASSERT()s from msi code
 - added trylock in vpci_dump_msi

Changes in v8:
 - changed d->vpci_lock to d->pci_lock
 - introducing d->pci_lock in a separate patch
 - extended locked region in vpci_process_pending
 - removed pcidevs_lockis vpci_dump_msi()
 - removed some changes as they are not needed with
   the new locking scheme
 - added handling for hwdom && dom_xen case
---
 xen/arch/x86/hvm/vmsi.c       | 25 +++++++++--------
 xen/arch/x86/hvm/vmx/vmx.c    |  2 +-
 xen/arch/x86/irq.c            |  8 +++---
 xen/arch/x86/msi.c            | 14 +++++-----
 xen/arch/x86/physdev.c        |  2 ++
 xen/drivers/passthrough/pci.c |  9 +++---
 xen/drivers/vpci/header.c     | 18 ++++++++++++
 xen/drivers/vpci/msi.c        | 28 +++++++++++++++++--
 xen/drivers/vpci/msix.c       | 52 ++++++++++++++++++++++++++++++-----
 xen/drivers/vpci/vpci.c       | 24 ++++++++++++++--
 xen/include/xen/sched.h       |  3 +-
 11 files changed, 145 insertions(+), 40 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 128f23636279..4725e3b72d53 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
     struct msixtbl_entry *entry, *new_entry;
     int r = -EINVAL;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
@@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
     struct pci_dev *pdev;
     struct msixtbl_entry *entry;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
@@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
 {
     unsigned int i;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
     if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
     {
@@ -725,8 +725,8 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
     int rc;
 
     ASSERT(msi->arch.pirq != INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
-    pcidevs_lock();
     for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
     {
         struct xen_domctl_bind_pt_irq unbind = {
@@ -745,7 +745,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
 
     msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
                                        msi->vectors, msi->arch.pirq, msi->mask);
-    pcidevs_unlock();
 }
 
 static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
@@ -778,15 +777,14 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
     int rc;
 
     ASSERT(msi->arch.pirq == INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
     rc = vpci_msi_enable(pdev, vectors, 0);
     if ( rc < 0 )
         return rc;
     msi->arch.pirq = rc;
 
-    pcidevs_lock();
     msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
                                        msi->arch.pirq, msi->mask);
-    pcidevs_unlock();
 
     return 0;
 }
@@ -797,8 +795,8 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
     unsigned int i;
 
     ASSERT(pirq != INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
-    pcidevs_lock();
     for ( i = 0; i < nr && bound; i++ )
     {
         struct xen_domctl_bind_pt_irq bind = {
@@ -814,7 +812,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
     write_lock(&pdev->domain->event_lock);
     unmap_domain_pirq(pdev->domain, pirq);
     write_unlock(&pdev->domain->event_lock);
-    pcidevs_unlock();
 }
 
 void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
@@ -854,6 +851,7 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
     int rc;
 
     ASSERT(entry->arch.pirq == INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
     rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry),
                          table_base);
     if ( rc < 0 )
@@ -861,7 +859,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
 
     entry->arch.pirq = rc;
 
-    pcidevs_lock();
     rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
                          entry->masked);
     if ( rc )
@@ -869,7 +866,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
         vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
         entry->arch.pirq = INVALID_PIRQ;
     }
-    pcidevs_unlock();
 
     return rc;
 }
@@ -895,6 +891,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
 {
     unsigned int i;
 
+    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
+
     for ( i = 0; i < msix->max_entries; i++ )
     {
         const struct vpci_msix_entry *entry = &msix->entries[i];
@@ -913,7 +911,12 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
             struct pci_dev *pdev = msix->pdev;
 
             spin_unlock(&msix->pdev->vpci->lock);
+            read_unlock(&pdev->domain->pci_lock);
             process_pending_softirqs();
+
+            if ( !read_trylock(&pdev->domain->pci_lock) )
+                return -EBUSY;
+
             /* NB: we assume that pdev cannot go away for an alive domain. */
             if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
                 return -EBUSY;
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 8ff675883c2b..890faef48b82 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -413,7 +413,7 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
 
     spin_unlock_irq(&desc->lock);
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&msi_desc->dev->domain->pci_lock));
 
     return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
 
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 50e49e1a4b6f..4d89d9af99fe 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2166,7 +2166,7 @@ int map_domain_pirq(
         struct pci_dev *pdev;
         unsigned int nr = 0;
 
-        ASSERT(pcidevs_locked());
+        ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
 
         ret = -ENODEV;
         if ( !cpu_has_apic )
@@ -2323,7 +2323,7 @@ int unmap_domain_pirq(struct domain *d, int pirq)
     if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
         return -EINVAL;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     info = pirq_info(d, pirq);
@@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 {
     int irq, pirq, ret;
 
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
+
     switch ( type )
     {
     case MAP_PIRQ_TYPE_MSI:
@@ -2917,7 +2919,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 
     msi->irq = irq;
 
-    pcidevs_lock();
     /* Verify or get pirq. */
     write_lock(&d->event_lock);
     pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
@@ -2933,7 +2934,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 
  done:
     write_unlock(&d->event_lock);
-    pcidevs_unlock();
     if ( ret )
     {
         switch ( type )
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 335c0868a225..7da2affa2e82 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -602,7 +602,7 @@ static int msi_capability_init(struct pci_dev *dev,
     unsigned int i, mpos;
     uint16_t control;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
     pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
     if ( !pos )
         return -ENODEV;
@@ -771,7 +771,7 @@ static int msix_capability_init(struct pci_dev *dev,
     if ( !pos )
         return -ENODEV;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
 
     control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
     /*
@@ -988,11 +988,11 @@ static int __pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
 {
     struct msi_desc *old_desc;
 
-    ASSERT(pcidevs_locked());
-
     if ( !pdev )
         return -ENODEV;
 
+    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
+
     old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSI);
     if ( old_desc )
     {
@@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
 {
     struct msi_desc *old_desc;
 
-    ASSERT(pcidevs_locked());
-
     if ( !pdev || !pdev->msix )
         return -ENODEV;
 
+    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
+
     if ( msi->entry_nr >= pdev->msix->nr_entries )
         return -EINVAL;
 
@@ -1154,7 +1154,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
 int pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
                    struct msi_desc **desc)
 {
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
 
     if ( !use_msi )
         return -EPERM;
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 47c4da0af7e1..369c9e788c1c 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
 
     case MAP_PIRQ_TYPE_MSI:
     case MAP_PIRQ_TYPE_MULTI_MSI:
+        pcidevs_lock();
         ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
+        pcidevs_unlock();
         break;
 
     default:
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1439d1ef2b26..3a973324bca1 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -750,7 +750,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         pdev->domain = hardware_domain;
         write_lock(&hardware_domain->pci_lock);
         list_add(&pdev->domain_list, &hardware_domain->pdev_list);
-        write_unlock(&hardware_domain->pci_lock);
 
         /*
          * For devices not discovered by Xen during boot, add vPCI handlers
@@ -759,18 +758,18 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         ret = vpci_add_handlers(pdev);
         if ( ret )
         {
-            printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
-            write_lock(&hardware_domain->pci_lock);
             list_del(&pdev->domain_list);
             write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
+            printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
             goto out;
         }
+        write_unlock(&hardware_domain->pci_lock);
         ret = iommu_add_device(pdev);
         if ( ret )
         {
-            vpci_remove_device(pdev);
             write_lock(&hardware_domain->pci_lock);
+            vpci_remove_device(pdev);
             list_del(&pdev->domain_list);
             write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
@@ -1146,7 +1145,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
     } while ( devfn != pdev->devfn &&
               PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
 
+    write_lock(&ctxt->d->pci_lock);
     err = vpci_add_handlers(pdev);
+    write_unlock(&ctxt->d->pci_lock);
     if ( err )
         printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
                ctxt->d->domain_id, err);
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 58195549d50a..8f5850b8cf6d 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -173,6 +173,7 @@ bool vpci_process_pending(struct vcpu *v)
         if ( rc == -ERESTART )
             return true;
 
+        write_lock(&v->domain->pci_lock);
         spin_lock(&v->vpci.pdev->vpci->lock);
         /* Disable memory decoding unconditionally on failure. */
         modify_decoding(v->vpci.pdev,
@@ -191,6 +192,7 @@ bool vpci_process_pending(struct vcpu *v)
              * failure.
              */
             vpci_remove_device(v->vpci.pdev);
+        write_unlock(&v->domain->pci_lock);
     }
 
     return false;
@@ -202,8 +204,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     struct map_data data = { .d = d, .map = true };
     int rc;
 
+    ASSERT(rw_is_write_locked(&d->pci_lock));
+
     while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+    {
+        /*
+         * It's safe to drop and reacquire the lock in this context
+         * without risking pdev disappearing because devices cannot be
+         * removed until the initial domain has been started.
+         */
+        write_unlock(&d->pci_lock);
         process_pending_softirqs();
+        write_lock(&d->pci_lock);
+    }
+
     rangeset_destroy(mem);
     if ( !rc )
         modify_decoding(pdev, cmd, false);
@@ -244,6 +258,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     unsigned int i;
     int rc;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !mem )
         return -ENOMEM;
 
@@ -524,6 +540,8 @@ static int cf_check init_header(struct pci_dev *pdev)
     int rc;
     bool mask_cap_list = false;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
     {
     case PCI_HEADER_TYPE_NORMAL:
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index a253ccbd7db7..6ff71e5f9ab7 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -263,7 +263,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
 
 void vpci_dump_msi(void)
 {
-    const struct domain *d;
+    struct domain *d;
 
     rcu_read_lock(&domlist_read_lock);
     for_each_domain ( d )
@@ -275,6 +275,9 @@ void vpci_dump_msi(void)
 
         printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
 
+        if ( !read_trylock(&d->pci_lock) )
+            continue;
+
         for_each_pdev ( d, pdev )
         {
             const struct vpci_msi *msi;
@@ -316,14 +319,33 @@ void vpci_dump_msi(void)
                      * holding the lock.
                      */
                     printk("unable to print all MSI-X entries: %d\n", rc);
-                    process_pending_softirqs();
-                    continue;
+                    goto pdev_done;
                 }
             }
 
             spin_unlock(&pdev->vpci->lock);
+        pdev_done:
+            /*
+             * Unlock lock to process pending softirqs. This is
+             * potentially unsafe, as d->pdev_list can be changed in
+             * meantime.
+             */
+            read_unlock(&d->pci_lock);
             process_pending_softirqs();
+            if ( !read_trylock(&d->pci_lock) )
+            {
+                printk("unable to access other devices for the domain\n");
+                goto domain_done;
+            }
         }
+        read_unlock(&d->pci_lock);
+    domain_done:
+        /*
+         * We need this label at the end of the loop, but some
+         * compilers might not be happy about label at the end of the
+         * compound statement so we adding an empty statement here.
+         */
+        ;
     }
     rcu_read_unlock(&domlist_read_lock);
 }
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index d1126a417da9..b6abab47efdd 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 {
     struct vpci_msix *msix;
 
+    ASSERT(rw_is_locked(&d->pci_lock));
+
     list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
     {
         const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
@@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 
 static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
 {
-    return !!msix_find(v->domain, addr);
+    int rc;
+
+    read_lock(&v->domain->pci_lock);
+    rc = !!msix_find(v->domain, addr);
+    read_unlock(&v->domain->pci_lock);
+
+    return rc;
 }
 
 static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
@@ -358,21 +366,35 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_read(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     const struct vpci_msix_entry *entry;
     unsigned int offset;
 
     *data = ~0UL;
 
+    read_lock(&d->pci_lock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_read(d, msix, addr, len, data);
+    {
+        int rc = adjacent_read(d, msix, addr, len, data);
+
+        read_unlock(&d->pci_lock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -404,6 +426,7 @@ static int cf_check msix_read(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     return X86EMUL_OKAY;
 }
@@ -491,19 +514,33 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_write(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     struct vpci_msix_entry *entry;
     unsigned int offset;
 
+    read_lock(&d->pci_lock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_write(d, msix, addr, len, data);
+    {
+        int rc = adjacent_write(d, msix, addr, len, data);
+
+        read_unlock(&d->pci_lock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -579,6 +616,7 @@ static int cf_check msix_write(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     return X86EMUL_OKAY;
 }
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 72ef277c4f8e..475272b173f3 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -42,6 +42,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
 
 void vpci_remove_device(struct pci_dev *pdev)
 {
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !has_vpci(pdev->domain) || !pdev->vpci )
         return;
 
@@ -77,6 +79,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
     const unsigned long *ro_map;
     int rc = 0;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !has_vpci(pdev->domain) )
         return 0;
 
@@ -361,7 +365,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
 
 uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
@@ -376,12 +380,18 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
     /*
      * Find the PCI dev matching the address, which for hwdom also requires
      * consulting DomXEN.  Passthrough everything that's not trapped.
+     * If this is hwdom and the device is assigned to DomXEN, acquiring hwdom's
+     * pci_lock is sufficient.
      */
+    read_lock(&d->pci_lock);
     pdev = pci_get_pdev(d, sbdf);
     if ( !pdev && is_hardware_domain(d) )
         pdev = pci_get_pdev(dom_xen, sbdf);
     if ( !pdev || !pdev->vpci )
+    {
+        read_unlock(&d->pci_lock);
         return vpci_read_hw(sbdf, reg, size);
+    }
 
     spin_lock(&pdev->vpci->lock);
 
@@ -428,6 +438,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     if ( data_offset < size )
     {
@@ -470,7 +481,7 @@ static void vpci_write_helper(const struct pci_dev *pdev,
 void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
@@ -484,7 +495,13 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
     /*
      * Find the PCI dev matching the address, which for hwdom also requires
      * consulting DomXEN.  Passthrough everything that's not trapped.
+     * If this is hwdom and the device is assigned to DomXEN, acquiring hwdom's
+     * pci_lock is sufficient.
+     *
+     * TODO: We need to take pci_locks in exclusive mode only if we
+     * are modifying BARs, so there is a room for improvement.
      */
+    write_lock(&d->pci_lock);
     pdev = pci_get_pdev(d, sbdf);
     if ( !pdev && is_hardware_domain(d) )
         pdev = pci_get_pdev(dom_xen, sbdf);
@@ -493,6 +510,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
         /* Ignore writes to read-only devices, which have no ->vpci. */
         const unsigned long *ro_map = pci_get_ro_map(sbdf.seg);
 
+        write_unlock(&d->pci_lock);
+
         if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
             vpci_write_hw(sbdf, reg, size, data);
         return;
@@ -534,6 +553,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    write_unlock(&d->pci_lock);
 
     if ( data_offset < size )
         /* Tailing gap, write the remaining. */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 9da91e0e6244..37f5922f3206 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -462,7 +462,8 @@ struct domain
 #ifdef CONFIG_HAS_PCI
     struct list_head pdev_list;
     /*
-     * pci_lock protects access to pdev_list.
+     * pci_lock protects access to pdev_list. pci_lock also protects pdev->vpci
+     * structure from being removed.
      *
      * Any user *reading* from pdev_list, or from devices stored in pdev_list,
      * should hold either pcidevs_lock() or pci_lock in read mode. Optionally,

base-commit: 1ec3fe1f664fa837daf31e9fa8938f6109464f28
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 09/15] vpci/header: program p2m with guest BAR view
  2024-01-12 15:06   ` Roger Pau Monné
@ 2024-01-12 20:31     ` Stewart Hildebrand
  2024-01-12 20:49       ` [PATCH v12.1 " Stewart Hildebrand
  2024-01-15  9:07     ` [PATCH v12 " Jan Beulich
  1 sibling, 1 reply; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-12 20:31 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Oleksandr Andrushchenko, Daniel P. Smith,
	Volodymyr Babchuk

On 1/12/24 10:06, Roger Pau Monné wrote:
> On Tue, Jan 09, 2024 at 04:51:24PM -0500, Stewart Hildebrand wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Take into account guest's BAR view and program its p2m accordingly:
>> gfn is guest's view of the BAR and mfn is the physical BAR value.
>> This way hardware domain sees physical BAR values and guest sees
>> emulated ones.
>>
>> Hardware domain continues getting the BARs identity mapped, while for
>> domUs the BARs are mapped at the requested guest address without
>> modifying the BAR address in the device PCI config space.
>>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> 
> Some nits and a request to add an extra assert. If you agree:
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks! I agree. I'll reply to this with a v12.1 patch.

> 
>> ---
>> In v12:
>> - Update guest_addr in rom_write()
>> - Use unsigned long for start_mfn and map_mfn to reduce mfn_x() calls
>> - Use existing vmsix_table_*() functions
>> - Change vmsix_table_base() to use .guest_addr
>> In v11:
>> - Add vmsix_guest_table_addr() and vmsix_guest_table_base() functions
>>   to access guest's view of the VMSIx tables.
>> - Use MFN (not GFN) to check access permissions
>> - Move page offset check to this patch
>> - Call rangeset_remove_range() with correct parameters
>> In v10:
>> - Moved GFN variable definition outside the loop in map_range()
>> - Updated printk error message in map_range()
>> - Now BAR address is always stored in bar->guest_addr, even for
>>   HW dom, this removes bunch of ugly is_hwdom() checks in modify_bars()
>> - vmsix_table_base() now uses .guest_addr instead of .addr
>> In v9:
>> - Extended the commit message
>> - Use bar->guest_addr in modify_bars
>> - Extended printk error message in map_range
>> - Moved map_data initialization so .bar can be initialized during declaration
>> Since v5:
>> - remove debug print in map_range callback
>> - remove "identity" from the debug print
>> Since v4:
>> - moved start_{gfn|mfn} calculation into map_range
>> - pass vpci_bar in the map_data instead of start_{gfn|mfn}
>> - s/guest_addr/guest_reg
>> Since v3:
>> - updated comment (Roger)
>> - removed gfn_add(map->start_gfn, rc); which is wrong
>> - use v->domain instead of v->vpci.pdev->domain
>> - removed odd e.g. in comment
>> - s/d%d/%pd in altered code
>> - use gdprintk for map/unmap logs
>> Since v2:
>> - improve readability for data.start_gfn and restructure ?: construct
>> Since v1:
>>  - s/MSI/MSI-X in comments
>> ---
>>  xen/drivers/vpci/header.c | 81 +++++++++++++++++++++++++++++++--------
>>  xen/include/xen/vpci.h    |  3 +-
>>  2 files changed, 66 insertions(+), 18 deletions(-)
>>
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index feccd070ddd0..f0b0b64b0929 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -34,6 +34,7 @@
>>  
>>  struct map_data {
>>      struct domain *d;
>> +    const struct vpci_bar *bar;
>>      bool map;
>>  };
>>  
>> @@ -41,13 +42,24 @@ static int cf_check map_range(
>>      unsigned long s, unsigned long e, void *data, unsigned long *c)
>>  {
>>      const struct map_data *map = data;
>> +    /* Start address of the BAR as seen by the guest. */
>> +    unsigned long start_gfn = PFN_DOWN(map->bar->guest_addr);
>> +    /* Physical start address of the BAR. */
>> +    unsigned long start_mfn = PFN_DOWN(map->bar->addr);
>>      int rc;
>>  
>>      for ( ; ; )
>>      {
>>          unsigned long size = e - s + 1;
>> +        /*
>> +         * Ranges to be mapped don't always start at the BAR start address, as
>> +         * there can be holes or partially consumed ranges. Account for the
>> +         * offset of the current address from the BAR start.
>> +         */
>> +        unsigned long map_mfn = start_mfn + s - start_gfn;
>> +        unsigned long m_end = map_mfn + size - 1;
>>  
>> -        if ( !iomem_access_permitted(map->d, s, e) )
>> +        if ( !iomem_access_permitted(map->d, map_mfn, m_end) )
>>          {
>>              printk(XENLOG_G_WARNING
>>                     "%pd denied access to MMIO range [%#lx, %#lx]\n",
>> @@ -55,7 +67,8 @@ static int cf_check map_range(
>>              return -EPERM;
>>          }
>>  
>> -        rc = xsm_iomem_mapping(XSM_HOOK, map->d, s, e, map->map);
>> +        rc = xsm_iomem_mapping(XSM_HOOK, map->d, map_mfn, m_end,
>> +                               map->map);
>>          if ( rc )
>>          {
>>              printk(XENLOG_G_WARNING
>> @@ -73,8 +86,8 @@ static int cf_check map_range(
>>           * - {un}map_mmio_regions doesn't support preemption.
>>           */
>>  
>> -        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
>> -                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
>> +        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(map_mfn))
>> +                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(map_mfn));
>>          if ( rc == 0 )
>>          {
>>              *c += size;
>> @@ -83,8 +96,9 @@ static int cf_check map_range(
>>          if ( rc < 0 )
>>          {
>>              printk(XENLOG_G_WARNING
>> -                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
>> -                   map->map ? "" : "un", s, e, map->d->domain_id, rc);
>> +                   "Failed to %smap [%lx %lx] -> [%lx %lx] for %pd: %d\n",
>> +                   map->map ? "" : "un", s, e, map_mfn,
>> +                   map_mfn + size, map->d, rc);
>>              break;
>>          }
>>          ASSERT(rc < size);
>> @@ -163,10 +177,6 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>>  bool vpci_process_pending(struct vcpu *v)
>>  {
>>      struct pci_dev *pdev = v->vpci.pdev;
>> -    struct map_data data = {
>> -        .d = v->domain,
>> -        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>> -    };
>>      struct vpci_header *header = NULL;
>>      unsigned int i;
>>  
>> @@ -186,6 +196,11 @@ bool vpci_process_pending(struct vcpu *v)
>>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>      {
>>          struct vpci_bar *bar = &header->bars[i];
>> +        struct map_data data = {
>> +            .d = v->domain,
>> +            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
>> +            .bar = bar,
>> +        };
>>          int rc;
>>  
>>          if ( rangeset_is_empty(bar->mem) )
>> @@ -236,7 +251,6 @@ bool vpci_process_pending(struct vcpu *v)
>>  static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>>                              uint16_t cmd)
>>  {
>> -    struct map_data data = { .d = d, .map = true };
>>      struct vpci_header *header = &pdev->vpci->header;
>>      int rc = 0;
>>      unsigned int i;
>> @@ -246,6 +260,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>      {
>>          struct vpci_bar *bar = &header->bars[i];
>> +        struct map_data data = { .d = d, .map = true, .bar = bar };
>>  
>>          if ( rangeset_is_empty(bar->mem) )
>>              continue;
>> @@ -311,12 +326,16 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>       * First fill the rangesets with the BAR of this device or with the ROM
>>       * BAR only, depending on whether the guest is toggling the memory decode
>>       * bit of the command register, or the enable bit of the ROM BAR register.
>> +     *
>> +     * For non-hardware domain we use guest physical addresses.
>>       */
>>      for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
>>      {
>>          struct vpci_bar *bar = &header->bars[i];
>>          unsigned long start = PFN_DOWN(bar->addr);
>>          unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
>> +        unsigned long start_guest = PFN_DOWN(bar->guest_addr);
>> +        unsigned long end_guest = PFN_DOWN(bar->guest_addr + bar->size - 1);
>>  
>>          if ( !bar->mem )
>>              continue;
>> @@ -336,11 +355,25 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>              continue;
>>          }
> 
> Should we assert that the BAR rangeset is empty here? To stay on the
> safe side.

Yes, I'll add an ASSERT.

> 
>>  
>> -        rc = rangeset_add_range(bar->mem, start, end);
>> +        /*
>> +         * Make sure that the guest set address has the same page offset
>> +         * as the physical address on the host or otherwise things won't work as
>> +         * expected.
>> +         */
>> +        if ( PAGE_OFFSET(bar->guest_addr) != PAGE_OFFSET(bar->addr) )
>> +        {
>> +            gprintk(XENLOG_G_WARNING,
>> +                    "%pp: Can't map BAR%d because of page offset mismatch: %lx vs %lx\n",
>                                            ^u
> 
> Also when using the x modifier it's better to also use # to print the
> 0x prefix.  You can also reduce the length of the message using
> s/because of/due to/ IMO:
> 
> %pp: Can't map BAR%u due to offset mismatch: %lx vs %lx

Will do

> 
>> +                    &pdev->sbdf, i, PAGE_OFFSET(bar->guest_addr),
>> +                    PAGE_OFFSET(bar->addr));
> 
> Maybe worth printing the whole address?

Agreed, will do

> 
>> +            return -EINVAL;
>> +        }
>> +
>> +        rc = rangeset_add_range(bar->mem, start_guest, end_guest);
>>          if ( rc )
>>          {
>>              printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
>> -                   start, end, rc);
>> +                   start_guest, end_guest, rc);
>>              return rc;
>>          }
>>  
>> @@ -352,12 +385,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>              if ( rangeset_is_empty(prev_bar->mem) )
>>                  continue;
>>  
>> -            rc = rangeset_remove_range(prev_bar->mem, start, end);
>> +            rc = rangeset_remove_range(prev_bar->mem, start_guest, end_guest);
>>              if ( rc )
>>              {
>>                  gprintk(XENLOG_WARNING,
>>                         "%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
>> -                        &pdev->sbdf, start, end, rc);
>> +                        &pdev->sbdf, start_guest, end_guest, rc);
>>                  return rc;
>>              }
>>          }
>> @@ -425,8 +458,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>              for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>              {
>>                  const struct vpci_bar *remote_bar = &tmp->vpci->header.bars[i];
>> -                unsigned long start = PFN_DOWN(remote_bar->addr);
>> -                unsigned long end = PFN_DOWN(remote_bar->addr +
>> +                unsigned long start = PFN_DOWN(remote_bar->guest_addr);
>> +                unsigned long end = PFN_DOWN(remote_bar->guest_addr +
>>                                               remote_bar->size - 1);
>>  
>>                  if ( !remote_bar->enabled )
>> @@ -513,6 +546,8 @@ static void cf_check bar_write(
>>      struct vpci_bar *bar = data;
>>      bool hi = false;
>>  
>> +    ASSERT(is_hardware_domain(pdev->domain));
>> +
>>      if ( bar->type == VPCI_BAR_MEM64_HI )
>>      {
>>          ASSERT(reg > PCI_BASE_ADDRESS_0);
>> @@ -543,6 +578,10 @@ static void cf_check bar_write(
>>       */
>>      bar->addr &= ~(0xffffffffULL << (hi ? 32 : 0));
>>      bar->addr |= (uint64_t)val << (hi ? 32 : 0);
>> +    /*
>> +     * Update guest address as well, so hardware domain sees BAR identity mapped
>> +     */
> 
> Can you drop the 'as well' and make this a single line comment?
> 
> Otherwise maybe reword to:
> 
> Update guest address, so hardware domain BAR is identity mapped.
> 
> Sorry, I find it wasteful to have the opening and closing comment
> delimiters in separate lines for single line comments.

Yep, I'll make it single line


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v12.1 09/15] vpci/header: program p2m with guest BAR view
  2024-01-12 20:31     ` Stewart Hildebrand
@ 2024-01-12 20:49       ` Stewart Hildebrand
  0 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-12 20:49 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Roger Pau Monné, Daniel P. Smith,
	Volodymyr Babchuk, Stewart Hildebrand

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Take into account guest's BAR view and program its p2m accordingly:
gfn is guest's view of the BAR and mfn is the physical BAR value.
This way hardware domain sees physical BAR values and guest sees
emulated ones.

Hardware domain continues getting the BARs identity mapped, while for
domUs the BARs are mapped at the requested guest address without
modifying the BAR address in the device PCI config space.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
In v12.1:
- ASSERT(rangeset_is_empty()) in modify_bars()
- Fixup print format in modify_bars()
- Make comment single line in bar_write()
- Add Roger's R-b
In v12:
- Update guest_addr in rom_write()
- Use unsigned long for start_mfn and map_mfn to reduce mfn_x() calls
- Use existing vmsix_table_*() functions
- Change vmsix_table_base() to use .guest_addr
In v11:
- Add vmsix_guest_table_addr() and vmsix_guest_table_base() functions
  to access guest's view of the VMSIx tables.
- Use MFN (not GFN) to check access permissions
- Move page offset check to this patch
- Call rangeset_remove_range() with correct parameters
In v10:
- Moved GFN variable definition outside the loop in map_range()
- Updated printk error message in map_range()
- Now BAR address is always stored in bar->guest_addr, even for
  HW dom, this removes bunch of ugly is_hwdom() checks in modify_bars()
- vmsix_table_base() now uses .guest_addr instead of .addr
In v9:
- Extended the commit message
- Use bar->guest_addr in modify_bars
- Extended printk error message in map_range
- Moved map_data initialization so .bar can be initialized during declaration
Since v5:
- remove debug print in map_range callback
- remove "identity" from the debug print
Since v4:
- moved start_{gfn|mfn} calculation into map_range
- pass vpci_bar in the map_data instead of start_{gfn|mfn}
- s/guest_addr/guest_reg
Since v3:
- updated comment (Roger)
- removed gfn_add(map->start_gfn, rc); which is wrong
- use v->domain instead of v->vpci.pdev->domain
- removed odd e.g. in comment
- s/d%d/%pd in altered code
- use gdprintk for map/unmap logs
Since v2:
- improve readability for data.start_gfn and restructure ?: construct
Since v1:
 - s/MSI/MSI-X in comments
---
 xen/drivers/vpci/header.c | 80 ++++++++++++++++++++++++++++++---------
 xen/include/xen/vpci.h    |  3 +-
 2 files changed, 65 insertions(+), 18 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index feccd070ddd0..f1e366bb5fc6 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -34,6 +34,7 @@
 
 struct map_data {
     struct domain *d;
+    const struct vpci_bar *bar;
     bool map;
 };
 
@@ -41,13 +42,24 @@ static int cf_check map_range(
     unsigned long s, unsigned long e, void *data, unsigned long *c)
 {
     const struct map_data *map = data;
+    /* Start address of the BAR as seen by the guest. */
+    unsigned long start_gfn = PFN_DOWN(map->bar->guest_addr);
+    /* Physical start address of the BAR. */
+    unsigned long start_mfn = PFN_DOWN(map->bar->addr);
     int rc;
 
     for ( ; ; )
     {
         unsigned long size = e - s + 1;
+        /*
+         * Ranges to be mapped don't always start at the BAR start address, as
+         * there can be holes or partially consumed ranges. Account for the
+         * offset of the current address from the BAR start.
+         */
+        unsigned long map_mfn = start_mfn + s - start_gfn;
+        unsigned long m_end = map_mfn + size - 1;
 
-        if ( !iomem_access_permitted(map->d, s, e) )
+        if ( !iomem_access_permitted(map->d, map_mfn, m_end) )
         {
             printk(XENLOG_G_WARNING
                    "%pd denied access to MMIO range [%#lx, %#lx]\n",
@@ -55,7 +67,8 @@ static int cf_check map_range(
             return -EPERM;
         }
 
-        rc = xsm_iomem_mapping(XSM_HOOK, map->d, s, e, map->map);
+        rc = xsm_iomem_mapping(XSM_HOOK, map->d, map_mfn, m_end,
+                               map->map);
         if ( rc )
         {
             printk(XENLOG_G_WARNING
@@ -73,8 +86,8 @@ static int cf_check map_range(
          * - {un}map_mmio_regions doesn't support preemption.
          */
 
-        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
-                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
+        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(map_mfn))
+                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(map_mfn));
         if ( rc == 0 )
         {
             *c += size;
@@ -83,8 +96,9 @@ static int cf_check map_range(
         if ( rc < 0 )
         {
             printk(XENLOG_G_WARNING
-                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
-                   map->map ? "" : "un", s, e, map->d->domain_id, rc);
+                   "Failed to %smap [%lx %lx] -> [%lx %lx] for %pd: %d\n",
+                   map->map ? "" : "un", s, e, map_mfn,
+                   map_mfn + size, map->d, rc);
             break;
         }
         ASSERT(rc < size);
@@ -163,10 +177,6 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
 bool vpci_process_pending(struct vcpu *v)
 {
     struct pci_dev *pdev = v->vpci.pdev;
-    struct map_data data = {
-        .d = v->domain,
-        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
-    };
     struct vpci_header *header = NULL;
     unsigned int i;
 
@@ -186,6 +196,11 @@ bool vpci_process_pending(struct vcpu *v)
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
+        struct map_data data = {
+            .d = v->domain,
+            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
+            .bar = bar,
+        };
         int rc;
 
         if ( rangeset_is_empty(bar->mem) )
@@ -236,7 +251,6 @@ bool vpci_process_pending(struct vcpu *v)
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
                             uint16_t cmd)
 {
-    struct map_data data = { .d = d, .map = true };
     struct vpci_header *header = &pdev->vpci->header;
     int rc = 0;
     unsigned int i;
@@ -246,6 +260,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
+        struct map_data data = { .d = d, .map = true, .bar = bar };
 
         if ( rangeset_is_empty(bar->mem) )
             continue;
@@ -311,12 +326,16 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
      * First fill the rangesets with the BAR of this device or with the ROM
      * BAR only, depending on whether the guest is toggling the memory decode
      * bit of the command register, or the enable bit of the ROM BAR register.
+     *
+     * For non-hardware domain we use guest physical addresses.
      */
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
         unsigned long start = PFN_DOWN(bar->addr);
         unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
+        unsigned long start_guest = PFN_DOWN(bar->guest_addr);
+        unsigned long end_guest = PFN_DOWN(bar->guest_addr + bar->size - 1);
 
         if ( !bar->mem )
             continue;
@@ -336,11 +355,26 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             continue;
         }
 
-        rc = rangeset_add_range(bar->mem, start, end);
+        ASSERT(rangeset_is_empty(bar->mem));
+
+        /*
+         * Make sure that the guest set address has the same page offset
+         * as the physical address on the host or otherwise things won't work as
+         * expected.
+         */
+        if ( PAGE_OFFSET(bar->guest_addr) != PAGE_OFFSET(bar->addr) )
+        {
+            gprintk(XENLOG_G_WARNING,
+                    "%pp: Can't map BAR%u due to offset mismatch: %#lx vs %#lx\n",
+                    &pdev->sbdf, i, bar->guest_addr, bar->addr);
+            return -EINVAL;
+        }
+
+        rc = rangeset_add_range(bar->mem, start_guest, end_guest);
         if ( rc )
         {
             printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
-                   start, end, rc);
+                   start_guest, end_guest, rc);
             return rc;
         }
 
@@ -352,12 +386,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             if ( rangeset_is_empty(prev_bar->mem) )
                 continue;
 
-            rc = rangeset_remove_range(prev_bar->mem, start, end);
+            rc = rangeset_remove_range(prev_bar->mem, start_guest, end_guest);
             if ( rc )
             {
                 gprintk(XENLOG_WARNING,
                        "%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
-                        &pdev->sbdf, start, end, rc);
+                        &pdev->sbdf, start_guest, end_guest, rc);
                 return rc;
             }
         }
@@ -425,8 +459,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
             {
                 const struct vpci_bar *remote_bar = &tmp->vpci->header.bars[i];
-                unsigned long start = PFN_DOWN(remote_bar->addr);
-                unsigned long end = PFN_DOWN(remote_bar->addr +
+                unsigned long start = PFN_DOWN(remote_bar->guest_addr);
+                unsigned long end = PFN_DOWN(remote_bar->guest_addr +
                                              remote_bar->size - 1);
 
                 if ( !remote_bar->enabled )
@@ -513,6 +547,8 @@ static void cf_check bar_write(
     struct vpci_bar *bar = data;
     bool hi = false;
 
+    ASSERT(is_hardware_domain(pdev->domain));
+
     if ( bar->type == VPCI_BAR_MEM64_HI )
     {
         ASSERT(reg > PCI_BASE_ADDRESS_0);
@@ -543,6 +579,8 @@ static void cf_check bar_write(
      */
     bar->addr &= ~(0xffffffffULL << (hi ? 32 : 0));
     bar->addr |= (uint64_t)val << (hi ? 32 : 0);
+    /* Update guest address, so hardware domain BAR is identity mapped. */
+    bar->guest_addr = bar->addr;
 
     /* Make sure Xen writes back the same value for the BAR RO bits. */
     if ( !hi )
@@ -639,11 +677,14 @@ static void cf_check rom_write(
     }
 
     if ( !rom->enabled )
+    {
         /*
          * If the ROM BAR is not mapped update the address field so the
          * correct address is mapped into the p2m.
          */
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
+        rom->guest_addr = rom->addr;
+    }
 
     if ( !header->bars_mapped || rom->enabled == new_enabled )
     {
@@ -667,7 +708,10 @@ static void cf_check rom_write(
         return;
 
     if ( !new_enabled )
+    {
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
+        rom->guest_addr = rom->addr;
+    }
 }
 
 static int bar_add_rangeset(const struct pci_dev *pdev, struct vpci_bar *bar,
@@ -862,6 +906,7 @@ static int cf_check init_header(struct pci_dev *pdev)
         }
 
         bars[i].addr = addr;
+        bars[i].guest_addr = addr;
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
@@ -884,6 +929,7 @@ static int cf_check init_header(struct pci_dev *pdev)
         rom->type = VPCI_BAR_ROM;
         rom->size = size;
         rom->addr = addr;
+        rom->guest_addr = addr;
         header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
                               PCI_ROM_ADDRESS_ENABLE;
 
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 817ee9ee7300..e89c571890b2 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -216,7 +216,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix);
  */
 static inline paddr_t vmsix_table_base(const struct vpci *vpci, unsigned int nr)
 {
-    return vpci->header.bars[vpci->msix->tables[nr] & PCI_MSIX_BIRMASK].addr;
+    return vpci->header.bars[vpci->msix->tables[nr] &
+                             PCI_MSIX_BIRMASK].guest_addr;
 }
 
 static inline paddr_t vmsix_table_addr(const struct vpci *vpci, unsigned int nr)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-12 17:54     ` Stewart Hildebrand
  2024-01-12 18:14       ` [PATCH v12.1 " Stewart Hildebrand
@ 2024-01-15  8:53       ` Roger Pau Monné
  2024-01-15 15:08         ` Stewart Hildebrand
  1 sibling, 1 reply; 68+ messages in thread
From: Roger Pau Monné @ 2024-01-15  8:53 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk

On Fri, Jan 12, 2024 at 12:54:56PM -0500, Stewart Hildebrand wrote:
> On 1/12/24 08:48, Roger Pau Monné wrote:
> > On Tue, Jan 09, 2024 at 04:51:16PM -0500, Stewart Hildebrand wrote:
> >> @@ -202,8 +204,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
> >>      struct map_data data = { .d = d, .map = true };
> >>      int rc;
> >>  
> >> +    ASSERT(rw_is_write_locked(&d->pci_lock));
> >> +
> >>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
> >> +    {
> >> +        /*
> >> +         * It's safe to drop and reacquire the lock in this context
> >> +         * without risking pdev disappearing because devices cannot be
> >> +         * removed until the initial domain has been started.
> >> +         */
> >> +        write_unlock(&d->pci_lock);
> >>          process_pending_softirqs();
> >> +        write_lock(&d->pci_lock);
> > 
> > Hm, I should have noticed before, but we already call
> > process_pending_softirqs() with the pdev->vpci->lock held here, so it
> > would make sense to drop it also.
> 
> I don't quite understand this, maybe I'm missing something. I don't see where we acquire pdev->vpci->lock before calling process_pending_softirqs()?
> 
> Also, I tried adding
> 
>     ASSERT(!spin_is_locked(&pdev->vpci->lock));
> 
> both here in apply_map() and in vpci_process_pending(), and they haven't triggered in either dom0 or domU test cases, tested on both arm and x86.

I think I was confused.  Are you sure that pdev->vpci->lock is taken
in the apply_map() call?  I was mistakenly assuming that
vpci_add_handlers() called the init function with the vpci->lock
taken, but that doesn't seem to be case with the current code.  That
leads to apply_map() also being called without the vpci->lock taken.

I was wrongly assuming that apply_map() was called with the vpci->lock
lock taken, and that would need dropping around the
process_pending_softirqs() call.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.1 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-12 18:14       ` [PATCH v12.1 " Stewart Hildebrand
@ 2024-01-15  8:58         ` Jan Beulich
  2024-01-15 15:42           ` Stewart Hildebrand
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2024-01-15  8:58 UTC (permalink / raw)
  To: Stewart Hildebrand, Roger Pau Monné
  Cc: Oleksandr Andrushchenko, Andrew Cooper, Wei Liu, George Dunlap,
	Julien Grall, Stefano Stabellini, Jun Nakajima, Kevin Tian,
	Paul Durrant, Volodymyr Babchuk, xen-devel

On 12.01.2024 19:14, Stewart Hildebrand wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Use the per-domain PCI read/write lock to protect the presence of the
> pci device vpci field. This lock can be used (and in a few cases is used
> right away) so that vpci removal can be performed while holding the lock
> in write mode. Previously such removal could race with vpci_read for
> example.
> 
> When taking both d->pci_lock and pdev->vpci->lock, they should be
> taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
> possible deadlock situations.
> 
> 1. Per-domain's pci_lock is used to protect pdev->vpci structure
> from being removed.
> 
> 2. Writing the command register and ROM BAR register may trigger
> modify_bars to run, which in turn may access multiple pdevs while
> checking for the existing BAR's overlap. The overlapping check, if
> done under the read lock, requires vpci->lock to be acquired on both
> devices being compared, which may produce a deadlock. It is not
> possible to upgrade read lock to write lock in such a case. So, in
> order to prevent the deadlock, use d->pci_lock in write mode instead.
> 
> All other code, which doesn't lead to pdev->vpci destruction and does
> not access multiple pdevs at the same time, can still use a
> combination of the read lock and pdev->vpci->lock.
> 
> 3. Drop const qualifier where the new rwlock is used and this is
> appropriate.
> 
> 4. Do not call process_pending_softirqs with any locks held. For that
> unlock prior the call and re-acquire the locks after. After
> re-acquiring the lock there is no need to check if pdev->vpci exists:
>  - in apply_map because of the context it is called (no race condition
>    possible)
>  - for MSI/MSI-X debug code because it is called at the end of
>    pdev->vpci access and no further access to pdev->vpci is made
> 
> 5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
> while accessing pdevs in vpci code.
> 
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

While I know Roger did offer the tag with certain adjustments, ...

> @@ -913,7 +911,12 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>              struct pci_dev *pdev = msix->pdev;
>  
>              spin_unlock(&msix->pdev->vpci->lock);
> +            read_unlock(&pdev->domain->pci_lock);
>              process_pending_softirqs();
> +
> +            if ( !read_trylock(&pdev->domain->pci_lock) )
> +                return -EBUSY;
> +
>              /* NB: we assume that pdev cannot go away for an alive domain. */
>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>                  return -EBUSY;

... I'm sure he was assuming you would get this right, in also
dropping the 1st-try-acquired lock when this 2nd try-lock fails.
Personally I feel this is the kind of change one would better not
offer (or take) R-b ahead of time.

I further think the respective comment in vpci_dump_msi() also wants
adjusting from singular to plural.

Jan


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 09/15] vpci/header: program p2m with guest BAR view
  2024-01-12 15:06   ` Roger Pau Monné
  2024-01-12 20:31     ` Stewart Hildebrand
@ 2024-01-15  9:07     ` Jan Beulich
  2024-01-15 19:03       ` Stewart Hildebrand
  1 sibling, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2024-01-15  9:07 UTC (permalink / raw)
  To: Roger Pau Monné, Stewart Hildebrand
  Cc: xen-devel, Oleksandr Andrushchenko, Daniel P. Smith,
	Volodymyr Babchuk

On 12.01.2024 16:06, Roger Pau Monné wrote:
> On Tue, Jan 09, 2024 at 04:51:24PM -0500, Stewart Hildebrand wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> +        /*
>> +         * Make sure that the guest set address has the same page offset
>> +         * as the physical address on the host or otherwise things won't work as
>> +         * expected.
>> +         */
>> +        if ( PAGE_OFFSET(bar->guest_addr) != PAGE_OFFSET(bar->addr) )
>> +        {
>> +            gprintk(XENLOG_G_WARNING,
>> +                    "%pp: Can't map BAR%d because of page offset mismatch: %lx vs %lx\n",
>                                            ^u
> 
> Also when using the x modifier it's better to also use # to print the
> 0x prefix.  You can also reduce the length of the message using
> s/because of/due to/ IMO:
> 
> %pp: Can't map BAR%u due to offset mismatch: %lx vs %lx

Or even

%pp: can't map BAR%u - offset mismatch: %lx vs %lx

? Note also my use of lower-case 'c', which brings this log message in
line with all pre-existing (prior to the whole series) vPCI log messages
starting with "%pp: " (when not limiting to thus-prefixed there are a
couple of "Failed to ..." outliers).

Jan


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 11/15] vpci: add initial support for virtual PCI bus topology
  2024-01-12 13:50     ` Stewart Hildebrand
@ 2024-01-15 11:48       ` George Dunlap
  0 siblings, 0 replies; 68+ messages in thread
From: George Dunlap @ 2024-01-15 11:48 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: xen-devel, Oleksandr Andrushchenko, Andrew Cooper, Jan Beulich,
	Julien Grall, Stefano Stabellini, Wei Liu, Roger Pau Monné,
	Volodymyr Babchuk

On Fri, Jan 12, 2024 at 1:50 PM Stewart Hildebrand
<stewart.hildebrand@amd.com> wrote:
>
> On 1/12/24 06:46, George Dunlap wrote:
> > On Tue, Jan 9, 2024 at 9:54 PM Stewart Hildebrand
> > <stewart.hildebrand@amd.com> wrote:
> >> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> >> index 37f5922f3206..b58a822847be 100644
> >> --- a/xen/include/xen/sched.h
> >> +++ b/xen/include/xen/sched.h
> >> @@ -484,6 +484,14 @@ struct domain
> >>       * 2. pdev->vpci->lock
> >>       */
> >>      rwlock_t pci_lock;
> >> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> >> +    /*
> >> +     * The bitmap which shows which device numbers are already used by the
> >> +     * virtual PCI bus topology and is used to assign a unique SBDF to the
> >> +     * next passed through virtual PCI device.
> >> +     */
> >> +    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
> >> +#endif
> >>  #endif
> >
> > Without digging through the whole series, how big do we expect this
> > bitmap to be on typical systems?
> >
> > If it's only going to be a handful of bytes, keeping it around for all
> > guests would be OK; but it's large, it would be better as a pointer,
> > since it's unused on the vast majority of guests.
>
> Since the bitmap is an unsigned long type it will typically be 8 bytes, although only 4 bytes are actually used. VPCI_MAX_VIRT_DEV is currently fixed at 32, as we are only tracking D (not the whole SBDF) in the bitmap so far.

OK, that's fine with me.  (FYI I replied because I thought you needed
my ack specifically for sched.h; looks like any of THE REST will do,
however.)

 -George


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-15  8:53       ` [PATCH v12 " Roger Pau Monné
@ 2024-01-15 15:08         ` Stewart Hildebrand
  0 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-15 15:08 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk

On 1/15/24 03:53, Roger Pau Monné wrote:
> On Fri, Jan 12, 2024 at 12:54:56PM -0500, Stewart Hildebrand wrote:
>> On 1/12/24 08:48, Roger Pau Monné wrote:
>>> On Tue, Jan 09, 2024 at 04:51:16PM -0500, Stewart Hildebrand wrote:
>>>> @@ -202,8 +204,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>>>>      struct map_data data = { .d = d, .map = true };
>>>>      int rc;
>>>>  
>>>> +    ASSERT(rw_is_write_locked(&d->pci_lock));
>>>> +
>>>>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
>>>> +    {
>>>> +        /*
>>>> +         * It's safe to drop and reacquire the lock in this context
>>>> +         * without risking pdev disappearing because devices cannot be
>>>> +         * removed until the initial domain has been started.
>>>> +         */
>>>> +        write_unlock(&d->pci_lock);
>>>>          process_pending_softirqs();
>>>> +        write_lock(&d->pci_lock);
>>>
>>> Hm, I should have noticed before, but we already call
>>> process_pending_softirqs() with the pdev->vpci->lock held here, so it
>>> would make sense to drop it also.
>>
>> I don't quite understand this, maybe I'm missing something. I don't see where we acquire pdev->vpci->lock before calling process_pending_softirqs()?
>>
>> Also, I tried adding
>>
>>     ASSERT(!spin_is_locked(&pdev->vpci->lock));
>>
>> both here in apply_map() and in vpci_process_pending(), and they haven't triggered in either dom0 or domU test cases, tested on both arm and x86.
> 
> I think I was confused.  Are you sure that pdev->vpci->lock is taken
> in the apply_map() call?

I'm sure that it's NOT taken in apply_map(). See the ! in the test ASSERT above.

> I was mistakenly assuming that
> vpci_add_handlers() called the init function with the vpci->lock
> taken, but that doesn't seem to be case with the current code.  That
> leads to apply_map() also being called without the vpci->lock taken.

Right.

> 
> I was wrongly assuming that apply_map() was called with the vpci->lock
> lock taken, and that would need dropping around the
> process_pending_softirqs() call.
> 
> Thanks, Roger.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.1 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-15  8:58         ` Jan Beulich
@ 2024-01-15 15:42           ` Stewart Hildebrand
  0 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-15 15:42 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: Oleksandr Andrushchenko, Andrew Cooper, Wei Liu, George Dunlap,
	Julien Grall, Stefano Stabellini, Jun Nakajima, Kevin Tian,
	Paul Durrant, Volodymyr Babchuk, xen-devel

On 1/15/24 03:58, Jan Beulich wrote:
> On 12.01.2024 19:14, Stewart Hildebrand wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Use the per-domain PCI read/write lock to protect the presence of the
>> pci device vpci field. This lock can be used (and in a few cases is used
>> right away) so that vpci removal can be performed while holding the lock
>> in write mode. Previously such removal could race with vpci_read for
>> example.
>>
>> When taking both d->pci_lock and pdev->vpci->lock, they should be
>> taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
>> possible deadlock situations.
>>
>> 1. Per-domain's pci_lock is used to protect pdev->vpci structure
>> from being removed.
>>
>> 2. Writing the command register and ROM BAR register may trigger
>> modify_bars to run, which in turn may access multiple pdevs while
>> checking for the existing BAR's overlap. The overlapping check, if
>> done under the read lock, requires vpci->lock to be acquired on both
>> devices being compared, which may produce a deadlock. It is not
>> possible to upgrade read lock to write lock in such a case. So, in
>> order to prevent the deadlock, use d->pci_lock in write mode instead.
>>
>> All other code, which doesn't lead to pdev->vpci destruction and does
>> not access multiple pdevs at the same time, can still use a
>> combination of the read lock and pdev->vpci->lock.
>>
>> 3. Drop const qualifier where the new rwlock is used and this is
>> appropriate.
>>
>> 4. Do not call process_pending_softirqs with any locks held. For that
>> unlock prior the call and re-acquire the locks after. After
>> re-acquiring the lock there is no need to check if pdev->vpci exists:
>>  - in apply_map because of the context it is called (no race condition
>>    possible)
>>  - for MSI/MSI-X debug code because it is called at the end of
>>    pdev->vpci access and no further access to pdev->vpci is made
>>
>> 5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
>> while accessing pdevs in vpci code.
>>
>> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
>> Suggested-by: Jan Beulich <jbeulich@suse.com>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
>> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> While I know Roger did offer the tag with certain adjustments, ...
> 
>> @@ -913,7 +911,12 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>              struct pci_dev *pdev = msix->pdev;
>>  
>>              spin_unlock(&msix->pdev->vpci->lock);
>> +            read_unlock(&pdev->domain->pci_lock);
>>              process_pending_softirqs();
>> +
>> +            if ( !read_trylock(&pdev->domain->pci_lock) )
>> +                return -EBUSY;
>> +
>>              /* NB: we assume that pdev cannot go away for an alive domain. */
>>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>>                  return -EBUSY;
> 
> ... I'm sure he was assuming you would get this right, in also
> dropping the 1st-try-acquired lock when this 2nd try-lock fails.

Thanks for catching this, and I appreciate the suggestion. I'll make sure both locks are dropped if needed on all error paths in vpci_msix_arch_print(), and adjust vpci_dump_msi() accordingly.

> Personally I feel this is the kind of change one would better not
> offer (or take) R-b ahead of time.

I'll drop Roger's R-b for v12.2.

> 
> I further think the respective comment in vpci_dump_msi() also wants
> adjusting from singular to plural.

I'll fix for v12.2, thanks for suggesting this.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 09/15] vpci/header: program p2m with guest BAR view
  2024-01-15  9:07     ` [PATCH v12 " Jan Beulich
@ 2024-01-15 19:03       ` Stewart Hildebrand
  0 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-15 19:03 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: xen-devel, Oleksandr Andrushchenko, Daniel P. Smith,
	Volodymyr Babchuk

On 1/15/24 04:07, Jan Beulich wrote:
> On 12.01.2024 16:06, Roger Pau Monné wrote:
>> On Tue, Jan 09, 2024 at 04:51:24PM -0500, Stewart Hildebrand wrote:
>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>> +        /*
>>> +         * Make sure that the guest set address has the same page offset
>>> +         * as the physical address on the host or otherwise things won't work as
>>> +         * expected.
>>> +         */
>>> +        if ( PAGE_OFFSET(bar->guest_addr) != PAGE_OFFSET(bar->addr) )
>>> +        {
>>> +            gprintk(XENLOG_G_WARNING,
>>> +                    "%pp: Can't map BAR%d because of page offset mismatch: %lx vs %lx\n",
>>                                            ^u
>>
>> Also when using the x modifier it's better to also use # to print the
>> 0x prefix.  You can also reduce the length of the message using
>> s/because of/due to/ IMO:
>>
>> %pp: Can't map BAR%u due to offset mismatch: %lx vs %lx
> 
> Or even
> 
> %pp: can't map BAR%u - offset mismatch: %lx vs %lx
> 
> ? 

Using # that becomes:

"%pp: can't map BAR%u - offset mismatch: %#lx vs %#lx\n"

I'll send v12.2.

> Note also my use of lower-case 'c', which brings this log message in
> line with all pre-existing (prior to the whole series) vPCI log messages
> starting with "%pp: " (when not limiting to thus-prefixed there are a
> couple of "Failed to ..." outliers).
> 
> Jan


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-09 21:51 ` [PATCH v12 01/15] vpci: use per-domain PCI lock to protect vpci structure Stewart Hildebrand
  2024-01-12 13:48   ` Roger Pau Monné
@ 2024-01-15 19:43   ` Stewart Hildebrand
  2024-01-19 13:42     ` Roger Pau Monné
                       ` (3 more replies)
  1 sibling, 4 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-15 19:43 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Roger Pau Monné, Wei Liu, George Dunlap, Julien Grall,
	Stefano Stabellini, Jun Nakajima, Kevin Tian, Paul Durrant,
	Volodymyr Babchuk, Stewart Hildebrand

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Use the per-domain PCI read/write lock to protect the presence of the
pci device vpci field. This lock can be used (and in a few cases is used
right away) so that vpci removal can be performed while holding the lock
in write mode. Previously such removal could race with vpci_read for
example.

When taking both d->pci_lock and pdev->vpci->lock, they should be
taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
possible deadlock situations.

1. Per-domain's pci_lock is used to protect pdev->vpci structure
from being removed.

2. Writing the command register and ROM BAR register may trigger
modify_bars to run, which in turn may access multiple pdevs while
checking for the existing BAR's overlap. The overlapping check, if
done under the read lock, requires vpci->lock to be acquired on both
devices being compared, which may produce a deadlock. It is not
possible to upgrade read lock to write lock in such a case. So, in
order to prevent the deadlock, use d->pci_lock in write mode instead.

All other code, which doesn't lead to pdev->vpci destruction and does
not access multiple pdevs at the same time, can still use a
combination of the read lock and pdev->vpci->lock.

3. Drop const qualifier where the new rwlock is used and this is
appropriate.

4. Do not call process_pending_softirqs with any locks held. For that
unlock prior the call and re-acquire the locks after. After
re-acquiring the lock there is no need to check if pdev->vpci exists:
 - in apply_map because of the context it is called (no race condition
   possible)
 - for MSI/MSI-X debug code because it is called at the end of
   pdev->vpci access and no further access to pdev->vpci is made

5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
while accessing pdevs in vpci code.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
Changes in v12.2:
 - drop Roger's R-b
 - drop both locks on error paths in vpci_msix_arch_print()
 - add another ASSERT in vpci_msix_arch_print(), to enforce the
   expectation both locks are held before calling vpci_msix_arch_print()
 - move pdev_done label in vpci_dump_msi()
 - update comments in vpci_dump_msi() to say locks (plural)

Changes in v12.1:
 - use read_trylock() in vpci_msix_arch_print()
 - fixup in-code comments (revert double space, use DomXEN) in
   vpci_{read,write}()
 - minor updates in commit message
 - add Roger's R-b

Changes in v12:
 - s/pci_rwlock/pci_lock/ in commit message
 - expand comment about scope of pci_lock in sched.h
 - in vpci_{read,write}, if hwdom is trying to access a device assigned
   to dom_xen, holding hwdom->pci_lock is sufficient (no need to hold
   dom_xen->pci_lock)
 - reintroduce ASSERT in vmx_pi_update_irte()
 - reintroduce ASSERT in __pci_enable_msi{x}()
 - delete note 6. in commit message about removing ASSERTs since we have
   reintroduced them

Changes in v11:
 - Fixed commit message regarding possible spinlocks
 - Removed parameter from allocate_and_map_msi_pirq(), which was added
 in the prev version. Now we are taking pcidevs_lock in
 physdev_map_pirq()
 - Returned ASSERT to pci_enable_msi
 - Fixed case when we took read lock instead of write one
 - Fixed label indentation

Changes in v10:
 - Moved printk pas locked area
 - Returned back ASSERTs
 - Added new parameter to allocate_and_map_msi_pirq() so it knows if
 it should take the global pci lock
 - Added comment about possible improvement in vpci_write
 - Changed ASSERT(rw_is_locked()) to rw_is_write_locked() in
   appropriate places
 - Renamed release_domain_locks() to release_domain_write_locks()
 - moved domain_done label in vpci_dump_msi() to correct place
Changes in v9:
 - extended locked region to protect vpci_remove_device and
   vpci_add_handlers() calls
 - vpci_write() takes lock in the write mode to protect
   potential call to modify_bars()
 - renamed lock releasing function
 - removed ASSERT()s from msi code
 - added trylock in vpci_dump_msi

Changes in v8:
 - changed d->vpci_lock to d->pci_lock
 - introducing d->pci_lock in a separate patch
 - extended locked region in vpci_process_pending
 - removed pcidevs_lockis vpci_dump_msi()
 - removed some changes as they are not needed with
   the new locking scheme
 - added handling for hwdom && dom_xen case
---
 xen/arch/x86/hvm/vmsi.c       | 31 +++++++++++++--------
 xen/arch/x86/hvm/vmx/vmx.c    |  2 +-
 xen/arch/x86/irq.c            |  8 +++---
 xen/arch/x86/msi.c            | 14 +++++-----
 xen/arch/x86/physdev.c        |  2 ++
 xen/drivers/passthrough/pci.c |  9 +++---
 xen/drivers/vpci/header.c     | 18 ++++++++++++
 xen/drivers/vpci/msi.c        | 30 +++++++++++++++++---
 xen/drivers/vpci/msix.c       | 52 ++++++++++++++++++++++++++++++-----
 xen/drivers/vpci/vpci.c       | 24 ++++++++++++++--
 xen/include/xen/sched.h       |  3 +-
 11 files changed, 152 insertions(+), 41 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 128f23636279..9e5e93a6ba0f 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
     struct msixtbl_entry *entry, *new_entry;
     int r = -EINVAL;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
@@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
     struct pci_dev *pdev;
     struct msixtbl_entry *entry;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
@@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
 {
     unsigned int i;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
     if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
     {
@@ -725,8 +725,8 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
     int rc;
 
     ASSERT(msi->arch.pirq != INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
-    pcidevs_lock();
     for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
     {
         struct xen_domctl_bind_pt_irq unbind = {
@@ -745,7 +745,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
 
     msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
                                        msi->vectors, msi->arch.pirq, msi->mask);
-    pcidevs_unlock();
 }
 
 static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
@@ -778,15 +777,14 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
     int rc;
 
     ASSERT(msi->arch.pirq == INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
     rc = vpci_msi_enable(pdev, vectors, 0);
     if ( rc < 0 )
         return rc;
     msi->arch.pirq = rc;
 
-    pcidevs_lock();
     msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
                                        msi->arch.pirq, msi->mask);
-    pcidevs_unlock();
 
     return 0;
 }
@@ -797,8 +795,8 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
     unsigned int i;
 
     ASSERT(pirq != INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
-    pcidevs_lock();
     for ( i = 0; i < nr && bound; i++ )
     {
         struct xen_domctl_bind_pt_irq bind = {
@@ -814,7 +812,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
     write_lock(&pdev->domain->event_lock);
     unmap_domain_pirq(pdev->domain, pirq);
     write_unlock(&pdev->domain->event_lock);
-    pcidevs_unlock();
 }
 
 void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
@@ -854,6 +851,7 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
     int rc;
 
     ASSERT(entry->arch.pirq == INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
     rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry),
                          table_base);
     if ( rc < 0 )
@@ -861,7 +859,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
 
     entry->arch.pirq = rc;
 
-    pcidevs_lock();
     rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
                          entry->masked);
     if ( rc )
@@ -869,7 +866,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
         vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
         entry->arch.pirq = INVALID_PIRQ;
     }
-    pcidevs_unlock();
 
     return rc;
 }
@@ -895,6 +891,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
 {
     unsigned int i;
 
+    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
+    ASSERT(spin_is_locked(&msix->pdev->vpci->lock));
+
     for ( i = 0; i < msix->max_entries; i++ )
     {
         const struct vpci_msix_entry *entry = &msix->entries[i];
@@ -913,13 +912,23 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
             struct pci_dev *pdev = msix->pdev;
 
             spin_unlock(&msix->pdev->vpci->lock);
+            read_unlock(&pdev->domain->pci_lock);
             process_pending_softirqs();
+
+            if ( !read_trylock(&pdev->domain->pci_lock) )
+                return -EBUSY;
+
             /* NB: we assume that pdev cannot go away for an alive domain. */
             if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
+            {
+                read_unlock(&pdev->domain->pci_lock);
                 return -EBUSY;
+            }
+
             if ( pdev->vpci->msix != msix )
             {
                 spin_unlock(&pdev->vpci->lock);
+                read_unlock(&pdev->domain->pci_lock);
                 return -EAGAIN;
             }
         }
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 8ff675883c2b..890faef48b82 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -413,7 +413,7 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
 
     spin_unlock_irq(&desc->lock);
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&msi_desc->dev->domain->pci_lock));
 
     return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
 
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 50e49e1a4b6f..4d89d9af99fe 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2166,7 +2166,7 @@ int map_domain_pirq(
         struct pci_dev *pdev;
         unsigned int nr = 0;
 
-        ASSERT(pcidevs_locked());
+        ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
 
         ret = -ENODEV;
         if ( !cpu_has_apic )
@@ -2323,7 +2323,7 @@ int unmap_domain_pirq(struct domain *d, int pirq)
     if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
         return -EINVAL;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     info = pirq_info(d, pirq);
@@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 {
     int irq, pirq, ret;
 
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
+
     switch ( type )
     {
     case MAP_PIRQ_TYPE_MSI:
@@ -2917,7 +2919,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 
     msi->irq = irq;
 
-    pcidevs_lock();
     /* Verify or get pirq. */
     write_lock(&d->event_lock);
     pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
@@ -2933,7 +2934,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 
  done:
     write_unlock(&d->event_lock);
-    pcidevs_unlock();
     if ( ret )
     {
         switch ( type )
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 335c0868a225..7da2affa2e82 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -602,7 +602,7 @@ static int msi_capability_init(struct pci_dev *dev,
     unsigned int i, mpos;
     uint16_t control;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
     pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI);
     if ( !pos )
         return -ENODEV;
@@ -771,7 +771,7 @@ static int msix_capability_init(struct pci_dev *dev,
     if ( !pos )
         return -ENODEV;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
 
     control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
     /*
@@ -988,11 +988,11 @@ static int __pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
 {
     struct msi_desc *old_desc;
 
-    ASSERT(pcidevs_locked());
-
     if ( !pdev )
         return -ENODEV;
 
+    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
+
     old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSI);
     if ( old_desc )
     {
@@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
 {
     struct msi_desc *old_desc;
 
-    ASSERT(pcidevs_locked());
-
     if ( !pdev || !pdev->msix )
         return -ENODEV;
 
+    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
+
     if ( msi->entry_nr >= pdev->msix->nr_entries )
         return -EINVAL;
 
@@ -1154,7 +1154,7 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
 int pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi,
                    struct msi_desc **desc)
 {
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
 
     if ( !use_msi )
         return -EPERM;
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 47c4da0af7e1..369c9e788c1c 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
 
     case MAP_PIRQ_TYPE_MSI:
     case MAP_PIRQ_TYPE_MULTI_MSI:
+        pcidevs_lock();
         ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
+        pcidevs_unlock();
         break;
 
     default:
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1439d1ef2b26..3a973324bca1 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -750,7 +750,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         pdev->domain = hardware_domain;
         write_lock(&hardware_domain->pci_lock);
         list_add(&pdev->domain_list, &hardware_domain->pdev_list);
-        write_unlock(&hardware_domain->pci_lock);
 
         /*
          * For devices not discovered by Xen during boot, add vPCI handlers
@@ -759,18 +758,18 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
         ret = vpci_add_handlers(pdev);
         if ( ret )
         {
-            printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
-            write_lock(&hardware_domain->pci_lock);
             list_del(&pdev->domain_list);
             write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
+            printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
             goto out;
         }
+        write_unlock(&hardware_domain->pci_lock);
         ret = iommu_add_device(pdev);
         if ( ret )
         {
-            vpci_remove_device(pdev);
             write_lock(&hardware_domain->pci_lock);
+            vpci_remove_device(pdev);
             list_del(&pdev->domain_list);
             write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
@@ -1146,7 +1145,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
     } while ( devfn != pdev->devfn &&
               PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
 
+    write_lock(&ctxt->d->pci_lock);
     err = vpci_add_handlers(pdev);
+    write_unlock(&ctxt->d->pci_lock);
     if ( err )
         printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
                ctxt->d->domain_id, err);
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 58195549d50a..8f5850b8cf6d 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -173,6 +173,7 @@ bool vpci_process_pending(struct vcpu *v)
         if ( rc == -ERESTART )
             return true;
 
+        write_lock(&v->domain->pci_lock);
         spin_lock(&v->vpci.pdev->vpci->lock);
         /* Disable memory decoding unconditionally on failure. */
         modify_decoding(v->vpci.pdev,
@@ -191,6 +192,7 @@ bool vpci_process_pending(struct vcpu *v)
              * failure.
              */
             vpci_remove_device(v->vpci.pdev);
+        write_unlock(&v->domain->pci_lock);
     }
 
     return false;
@@ -202,8 +204,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     struct map_data data = { .d = d, .map = true };
     int rc;
 
+    ASSERT(rw_is_write_locked(&d->pci_lock));
+
     while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+    {
+        /*
+         * It's safe to drop and reacquire the lock in this context
+         * without risking pdev disappearing because devices cannot be
+         * removed until the initial domain has been started.
+         */
+        write_unlock(&d->pci_lock);
         process_pending_softirqs();
+        write_lock(&d->pci_lock);
+    }
+
     rangeset_destroy(mem);
     if ( !rc )
         modify_decoding(pdev, cmd, false);
@@ -244,6 +258,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     unsigned int i;
     int rc;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !mem )
         return -ENOMEM;
 
@@ -524,6 +540,8 @@ static int cf_check init_header(struct pci_dev *pdev)
     int rc;
     bool mask_cap_list = false;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
     {
     case PCI_HEADER_TYPE_NORMAL:
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index a253ccbd7db7..dc71938e23f5 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -263,7 +263,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
 
 void vpci_dump_msi(void)
 {
-    const struct domain *d;
+    struct domain *d;
 
     rcu_read_lock(&domlist_read_lock);
     for_each_domain ( d )
@@ -275,6 +275,9 @@ void vpci_dump_msi(void)
 
         printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
 
+        if ( !read_trylock(&d->pci_lock) )
+            continue;
+
         for_each_pdev ( d, pdev )
         {
             const struct vpci_msi *msi;
@@ -313,17 +316,36 @@ void vpci_dump_msi(void)
                 {
                     /*
                      * On error vpci_msix_arch_print will always return without
-                     * holding the lock.
+                     * holding the locks.
                      */
                     printk("unable to print all MSI-X entries: %d\n", rc);
-                    process_pending_softirqs();
-                    continue;
+                    goto pdev_done;
                 }
             }
 
+            /*
+             * Unlock locks to process pending softirqs. This is
+             * potentially unsafe, as d->pdev_list can be changed in
+             * meantime.
+             */
             spin_unlock(&pdev->vpci->lock);
+            read_unlock(&d->pci_lock);
+        pdev_done:
             process_pending_softirqs();
+            if ( !read_trylock(&d->pci_lock) )
+            {
+                printk("unable to access other devices for the domain\n");
+                goto domain_done;
+            }
         }
+        read_unlock(&d->pci_lock);
+    domain_done:
+        /*
+         * We need this label at the end of the loop, but some
+         * compilers might not be happy about label at the end of the
+         * compound statement so we adding an empty statement here.
+         */
+        ;
     }
     rcu_read_unlock(&domlist_read_lock);
 }
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index d1126a417da9..b6abab47efdd 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 {
     struct vpci_msix *msix;
 
+    ASSERT(rw_is_locked(&d->pci_lock));
+
     list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
     {
         const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
@@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 
 static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
 {
-    return !!msix_find(v->domain, addr);
+    int rc;
+
+    read_lock(&v->domain->pci_lock);
+    rc = !!msix_find(v->domain, addr);
+    read_unlock(&v->domain->pci_lock);
+
+    return rc;
 }
 
 static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
@@ -358,21 +366,35 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_read(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     const struct vpci_msix_entry *entry;
     unsigned int offset;
 
     *data = ~0UL;
 
+    read_lock(&d->pci_lock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_read(d, msix, addr, len, data);
+    {
+        int rc = adjacent_read(d, msix, addr, len, data);
+
+        read_unlock(&d->pci_lock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -404,6 +426,7 @@ static int cf_check msix_read(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     return X86EMUL_OKAY;
 }
@@ -491,19 +514,33 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_write(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     struct vpci_msix_entry *entry;
     unsigned int offset;
 
+    read_lock(&d->pci_lock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_write(d, msix, addr, len, data);
+    {
+        int rc = adjacent_write(d, msix, addr, len, data);
+
+        read_unlock(&d->pci_lock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -579,6 +616,7 @@ static int cf_check msix_write(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     return X86EMUL_OKAY;
 }
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 72ef277c4f8e..475272b173f3 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -42,6 +42,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
 
 void vpci_remove_device(struct pci_dev *pdev)
 {
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !has_vpci(pdev->domain) || !pdev->vpci )
         return;
 
@@ -77,6 +79,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
     const unsigned long *ro_map;
     int rc = 0;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !has_vpci(pdev->domain) )
         return 0;
 
@@ -361,7 +365,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
 
 uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
@@ -376,12 +380,18 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
     /*
      * Find the PCI dev matching the address, which for hwdom also requires
      * consulting DomXEN.  Passthrough everything that's not trapped.
+     * If this is hwdom and the device is assigned to DomXEN, acquiring hwdom's
+     * pci_lock is sufficient.
      */
+    read_lock(&d->pci_lock);
     pdev = pci_get_pdev(d, sbdf);
     if ( !pdev && is_hardware_domain(d) )
         pdev = pci_get_pdev(dom_xen, sbdf);
     if ( !pdev || !pdev->vpci )
+    {
+        read_unlock(&d->pci_lock);
         return vpci_read_hw(sbdf, reg, size);
+    }
 
     spin_lock(&pdev->vpci->lock);
 
@@ -428,6 +438,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     if ( data_offset < size )
     {
@@ -470,7 +481,7 @@ static void vpci_write_helper(const struct pci_dev *pdev,
 void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
@@ -484,7 +495,13 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
     /*
      * Find the PCI dev matching the address, which for hwdom also requires
      * consulting DomXEN.  Passthrough everything that's not trapped.
+     * If this is hwdom and the device is assigned to DomXEN, acquiring hwdom's
+     * pci_lock is sufficient.
+     *
+     * TODO: We need to take pci_locks in exclusive mode only if we
+     * are modifying BARs, so there is a room for improvement.
      */
+    write_lock(&d->pci_lock);
     pdev = pci_get_pdev(d, sbdf);
     if ( !pdev && is_hardware_domain(d) )
         pdev = pci_get_pdev(dom_xen, sbdf);
@@ -493,6 +510,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
         /* Ignore writes to read-only devices, which have no ->vpci. */
         const unsigned long *ro_map = pci_get_ro_map(sbdf.seg);
 
+        write_unlock(&d->pci_lock);
+
         if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
             vpci_write_hw(sbdf, reg, size, data);
         return;
@@ -534,6 +553,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    write_unlock(&d->pci_lock);
 
     if ( data_offset < size )
         /* Tailing gap, write the remaining. */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 9da91e0e6244..37f5922f3206 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -462,7 +462,8 @@ struct domain
 #ifdef CONFIG_HAS_PCI
     struct list_head pdev_list;
     /*
-     * pci_lock protects access to pdev_list.
+     * pci_lock protects access to pdev_list. pci_lock also protects pdev->vpci
+     * structure from being removed.
      *
      * Any user *reading* from pdev_list, or from devices stored in pdev_list,
      * should hold either pcidevs_lock() or pci_lock in read mode. Optionally,

base-commit: c2ce3466472e9c9eda79f5dc98eb701bc6fdba20
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v12.2 09/15] vpci/header: program p2m with guest BAR view
  2024-01-09 21:51 ` [PATCH v12 09/15] vpci/header: program p2m with guest BAR view Stewart Hildebrand
  2024-01-12 15:06   ` Roger Pau Monné
@ 2024-01-15 19:44   ` Stewart Hildebrand
  2024-01-17  3:01     ` Stewart Hildebrand
  2024-01-19 14:28   ` [PATCH v12.3 " Stewart Hildebrand
  2 siblings, 1 reply; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-15 19:44 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Roger Pau Monné, Daniel P. Smith,
	Volodymyr Babchuk, Stewart Hildebrand

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Take into account guest's BAR view and program its p2m accordingly:
gfn is guest's view of the BAR and mfn is the physical BAR value.
This way hardware domain sees physical BAR values and guest sees
emulated ones.

Hardware domain continues getting the BARs identity mapped, while for
domUs the BARs are mapped at the requested guest address without
modifying the BAR address in the device PCI config space.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
In v12.2:
- Slightly tweak print format in modify_bars()
In v12.1:
- ASSERT(rangeset_is_empty()) in modify_bars()
- Fixup print format in modify_bars()
- Make comment single line in bar_write()
- Add Roger's R-b
In v12:
- Update guest_addr in rom_write()
- Use unsigned long for start_mfn and map_mfn to reduce mfn_x() calls
- Use existing vmsix_table_*() functions
- Change vmsix_table_base() to use .guest_addr
In v11:
- Add vmsix_guest_table_addr() and vmsix_guest_table_base() functions
  to access guest's view of the VMSIx tables.
- Use MFN (not GFN) to check access permissions
- Move page offset check to this patch
- Call rangeset_remove_range() with correct parameters
In v10:
- Moved GFN variable definition outside the loop in map_range()
- Updated printk error message in map_range()
- Now BAR address is always stored in bar->guest_addr, even for
  HW dom, this removes bunch of ugly is_hwdom() checks in modify_bars()
- vmsix_table_base() now uses .guest_addr instead of .addr
In v9:
- Extended the commit message
- Use bar->guest_addr in modify_bars
- Extended printk error message in map_range
- Moved map_data initialization so .bar can be initialized during declaration
Since v5:
- remove debug print in map_range callback
- remove "identity" from the debug print
Since v4:
- moved start_{gfn|mfn} calculation into map_range
- pass vpci_bar in the map_data instead of start_{gfn|mfn}
- s/guest_addr/guest_reg
Since v3:
- updated comment (Roger)
- removed gfn_add(map->start_gfn, rc); which is wrong
- use v->domain instead of v->vpci.pdev->domain
- removed odd e.g. in comment
- s/d%d/%pd in altered code
- use gdprintk for map/unmap logs
Since v2:
- improve readability for data.start_gfn and restructure ?: construct
Since v1:
 - s/MSI/MSI-X in comments
---
 xen/drivers/vpci/header.c | 80 ++++++++++++++++++++++++++++++---------
 xen/include/xen/vpci.h    |  3 +-
 2 files changed, 65 insertions(+), 18 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index feccd070ddd0..8483404c5e91 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -34,6 +34,7 @@
 
 struct map_data {
     struct domain *d;
+    const struct vpci_bar *bar;
     bool map;
 };
 
@@ -41,13 +42,24 @@ static int cf_check map_range(
     unsigned long s, unsigned long e, void *data, unsigned long *c)
 {
     const struct map_data *map = data;
+    /* Start address of the BAR as seen by the guest. */
+    unsigned long start_gfn = PFN_DOWN(map->bar->guest_addr);
+    /* Physical start address of the BAR. */
+    unsigned long start_mfn = PFN_DOWN(map->bar->addr);
     int rc;
 
     for ( ; ; )
     {
         unsigned long size = e - s + 1;
+        /*
+         * Ranges to be mapped don't always start at the BAR start address, as
+         * there can be holes or partially consumed ranges. Account for the
+         * offset of the current address from the BAR start.
+         */
+        unsigned long map_mfn = start_mfn + s - start_gfn;
+        unsigned long m_end = map_mfn + size - 1;
 
-        if ( !iomem_access_permitted(map->d, s, e) )
+        if ( !iomem_access_permitted(map->d, map_mfn, m_end) )
         {
             printk(XENLOG_G_WARNING
                    "%pd denied access to MMIO range [%#lx, %#lx]\n",
@@ -55,7 +67,8 @@ static int cf_check map_range(
             return -EPERM;
         }
 
-        rc = xsm_iomem_mapping(XSM_HOOK, map->d, s, e, map->map);
+        rc = xsm_iomem_mapping(XSM_HOOK, map->d, map_mfn, m_end,
+                               map->map);
         if ( rc )
         {
             printk(XENLOG_G_WARNING
@@ -73,8 +86,8 @@ static int cf_check map_range(
          * - {un}map_mmio_regions doesn't support preemption.
          */
 
-        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
-                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
+        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(map_mfn))
+                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(map_mfn));
         if ( rc == 0 )
         {
             *c += size;
@@ -83,8 +96,9 @@ static int cf_check map_range(
         if ( rc < 0 )
         {
             printk(XENLOG_G_WARNING
-                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
-                   map->map ? "" : "un", s, e, map->d->domain_id, rc);
+                   "Failed to %smap [%lx %lx] -> [%lx %lx] for %pd: %d\n",
+                   map->map ? "" : "un", s, e, map_mfn,
+                   map_mfn + size, map->d, rc);
             break;
         }
         ASSERT(rc < size);
@@ -163,10 +177,6 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
 bool vpci_process_pending(struct vcpu *v)
 {
     struct pci_dev *pdev = v->vpci.pdev;
-    struct map_data data = {
-        .d = v->domain,
-        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
-    };
     struct vpci_header *header = NULL;
     unsigned int i;
 
@@ -186,6 +196,11 @@ bool vpci_process_pending(struct vcpu *v)
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
+        struct map_data data = {
+            .d = v->domain,
+            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
+            .bar = bar,
+        };
         int rc;
 
         if ( rangeset_is_empty(bar->mem) )
@@ -236,7 +251,6 @@ bool vpci_process_pending(struct vcpu *v)
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
                             uint16_t cmd)
 {
-    struct map_data data = { .d = d, .map = true };
     struct vpci_header *header = &pdev->vpci->header;
     int rc = 0;
     unsigned int i;
@@ -246,6 +260,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
+        struct map_data data = { .d = d, .map = true, .bar = bar };
 
         if ( rangeset_is_empty(bar->mem) )
             continue;
@@ -311,12 +326,16 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
      * First fill the rangesets with the BAR of this device or with the ROM
      * BAR only, depending on whether the guest is toggling the memory decode
      * bit of the command register, or the enable bit of the ROM BAR register.
+     *
+     * For non-hardware domain we use guest physical addresses.
      */
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
         unsigned long start = PFN_DOWN(bar->addr);
         unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
+        unsigned long start_guest = PFN_DOWN(bar->guest_addr);
+        unsigned long end_guest = PFN_DOWN(bar->guest_addr + bar->size - 1);
 
         if ( !bar->mem )
             continue;
@@ -336,11 +355,26 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             continue;
         }
 
-        rc = rangeset_add_range(bar->mem, start, end);
+        ASSERT(rangeset_is_empty(bar->mem));
+
+        /*
+         * Make sure that the guest set address has the same page offset
+         * as the physical address on the host or otherwise things won't work as
+         * expected.
+         */
+        if ( PAGE_OFFSET(bar->guest_addr) != PAGE_OFFSET(bar->addr) )
+        {
+            gprintk(XENLOG_G_WARNING,
+                    "%pp: can't map BAR%u - offset mismatch: %#lx vs %#lx\n",
+                    &pdev->sbdf, i, bar->guest_addr, bar->addr);
+            return -EINVAL;
+        }
+
+        rc = rangeset_add_range(bar->mem, start_guest, end_guest);
         if ( rc )
         {
             printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
-                   start, end, rc);
+                   start_guest, end_guest, rc);
             return rc;
         }
 
@@ -352,12 +386,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             if ( rangeset_is_empty(prev_bar->mem) )
                 continue;
 
-            rc = rangeset_remove_range(prev_bar->mem, start, end);
+            rc = rangeset_remove_range(prev_bar->mem, start_guest, end_guest);
             if ( rc )
             {
                 gprintk(XENLOG_WARNING,
                        "%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
-                        &pdev->sbdf, start, end, rc);
+                        &pdev->sbdf, start_guest, end_guest, rc);
                 return rc;
             }
         }
@@ -425,8 +459,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
             {
                 const struct vpci_bar *remote_bar = &tmp->vpci->header.bars[i];
-                unsigned long start = PFN_DOWN(remote_bar->addr);
-                unsigned long end = PFN_DOWN(remote_bar->addr +
+                unsigned long start = PFN_DOWN(remote_bar->guest_addr);
+                unsigned long end = PFN_DOWN(remote_bar->guest_addr +
                                              remote_bar->size - 1);
 
                 if ( !remote_bar->enabled )
@@ -513,6 +547,8 @@ static void cf_check bar_write(
     struct vpci_bar *bar = data;
     bool hi = false;
 
+    ASSERT(is_hardware_domain(pdev->domain));
+
     if ( bar->type == VPCI_BAR_MEM64_HI )
     {
         ASSERT(reg > PCI_BASE_ADDRESS_0);
@@ -543,6 +579,8 @@ static void cf_check bar_write(
      */
     bar->addr &= ~(0xffffffffULL << (hi ? 32 : 0));
     bar->addr |= (uint64_t)val << (hi ? 32 : 0);
+    /* Update guest address, so hardware domain BAR is identity mapped. */
+    bar->guest_addr = bar->addr;
 
     /* Make sure Xen writes back the same value for the BAR RO bits. */
     if ( !hi )
@@ -639,11 +677,14 @@ static void cf_check rom_write(
     }
 
     if ( !rom->enabled )
+    {
         /*
          * If the ROM BAR is not mapped update the address field so the
          * correct address is mapped into the p2m.
          */
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
+        rom->guest_addr = rom->addr;
+    }
 
     if ( !header->bars_mapped || rom->enabled == new_enabled )
     {
@@ -667,7 +708,10 @@ static void cf_check rom_write(
         return;
 
     if ( !new_enabled )
+    {
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
+        rom->guest_addr = rom->addr;
+    }
 }
 
 static int bar_add_rangeset(const struct pci_dev *pdev, struct vpci_bar *bar,
@@ -862,6 +906,7 @@ static int cf_check init_header(struct pci_dev *pdev)
         }
 
         bars[i].addr = addr;
+        bars[i].guest_addr = addr;
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
@@ -884,6 +929,7 @@ static int cf_check init_header(struct pci_dev *pdev)
         rom->type = VPCI_BAR_ROM;
         rom->size = size;
         rom->addr = addr;
+        rom->guest_addr = addr;
         header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
                               PCI_ROM_ADDRESS_ENABLE;
 
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 817ee9ee7300..e89c571890b2 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -216,7 +216,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix);
  */
 static inline paddr_t vmsix_table_base(const struct vpci *vpci, unsigned int nr)
 {
-    return vpci->header.bars[vpci->msix->tables[nr] & PCI_MSIX_BIRMASK].addr;
+    return vpci->header.bars[vpci->msix->tables[nr] &
+                             PCI_MSIX_BIRMASK].guest_addr;
 }
 
 static inline paddr_t vmsix_table_addr(const struct vpci *vpci, unsigned int nr)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 09/15] vpci/header: program p2m with guest BAR view
  2024-01-15 19:44   ` [PATCH v12.2 " Stewart Hildebrand
@ 2024-01-17  3:01     ` Stewart Hildebrand
  2024-01-19 13:43       ` Roger Pau Monné
  0 siblings, 1 reply; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-17  3:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Roger Pau Monné, Daniel P. Smith,
	Volodymyr Babchuk

On 1/15/24 14:44, Stewart Hildebrand wrote:
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index feccd070ddd0..8483404c5e91 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -41,13 +42,24 @@ static int cf_check map_range(
>      unsigned long s, unsigned long e, void *data, unsigned long *c)
>  {
>      const struct map_data *map = data;
> +    /* Start address of the BAR as seen by the guest. */
> +    unsigned long start_gfn = PFN_DOWN(map->bar->guest_addr);
> +    /* Physical start address of the BAR. */
> +    unsigned long start_mfn = PFN_DOWN(map->bar->addr);
>      int rc;
>  
>      for ( ; ; )
>      {
>          unsigned long size = e - s + 1;
> +        /*
> +         * Ranges to be mapped don't always start at the BAR start address, as
> +         * there can be holes or partially consumed ranges. Account for the
> +         * offset of the current address from the BAR start.
> +         */
> +        unsigned long map_mfn = start_mfn + s - start_gfn;
> +        unsigned long m_end = map_mfn + size - 1;
>  
> -        if ( !iomem_access_permitted(map->d, s, e) )
> +        if ( !iomem_access_permitted(map->d, map_mfn, m_end) )

Nit: since this check will now use map_mfn and m_end...

>          {
>              printk(XENLOG_G_WARNING
>                     "%pd denied access to MMIO range [%#lx, %#lx]\n",
>                     map->d, s, e);

... I'd like to also update the arguments passed to this print statement.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 14/15] xen/arm: vpci: permit access to guest vpci space
  2024-01-09 21:51 ` [PATCH v12 14/15] xen/arm: vpci: permit access to guest vpci space Stewart Hildebrand
@ 2024-01-17  3:03   ` Stewart Hildebrand
  0 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-17  3:03 UTC (permalink / raw)
  To: xen-devel

On 1/9/24 16:51, Stewart Hildebrand wrote:
> Move iomem_caps initialization earlier (before arch_domain_create()).
> 
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>

Since the iomem_access_permitted() check over in ("vpci/header: program p2m with guest BAR view") was changed to use MFNs (it used GFNs in an earlier rev) this whole patch should be dropped. The toolstack already does what this patch was trying to do with XEN_DOMCTL_iomem_permission.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-15 19:43   ` [PATCH v12.2 " Stewart Hildebrand
@ 2024-01-19 13:42     ` Roger Pau Monné
  2024-01-23 14:26     ` Jan Beulich
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 68+ messages in thread
From: Roger Pau Monné @ 2024-01-19 13:42 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: xen-devel, Oleksandr Andrushchenko, Jan Beulich, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk

On Mon, Jan 15, 2024 at 02:43:08PM -0500, Stewart Hildebrand wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Use the per-domain PCI read/write lock to protect the presence of the
> pci device vpci field. This lock can be used (and in a few cases is used
> right away) so that vpci removal can be performed while holding the lock
> in write mode. Previously such removal could race with vpci_read for
> example.
> 
> When taking both d->pci_lock and pdev->vpci->lock, they should be
> taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
> possible deadlock situations.
> 
> 1. Per-domain's pci_lock is used to protect pdev->vpci structure
> from being removed.
> 
> 2. Writing the command register and ROM BAR register may trigger
> modify_bars to run, which in turn may access multiple pdevs while
> checking for the existing BAR's overlap. The overlapping check, if
> done under the read lock, requires vpci->lock to be acquired on both
> devices being compared, which may produce a deadlock. It is not
> possible to upgrade read lock to write lock in such a case. So, in
> order to prevent the deadlock, use d->pci_lock in write mode instead.
> 
> All other code, which doesn't lead to pdev->vpci destruction and does
> not access multiple pdevs at the same time, can still use a
> combination of the read lock and pdev->vpci->lock.
> 
> 3. Drop const qualifier where the new rwlock is used and this is
> appropriate.
> 
> 4. Do not call process_pending_softirqs with any locks held. For that
> unlock prior the call and re-acquire the locks after. After
> re-acquiring the lock there is no need to check if pdev->vpci exists:
>  - in apply_map because of the context it is called (no race condition
>    possible)
>  - for MSI/MSI-X debug code because it is called at the end of
>    pdev->vpci access and no further access to pdev->vpci is made
> 
> 5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
> while accessing pdevs in vpci code.
> 
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 09/15] vpci/header: program p2m with guest BAR view
  2024-01-17  3:01     ` Stewart Hildebrand
@ 2024-01-19 13:43       ` Roger Pau Monné
  0 siblings, 0 replies; 68+ messages in thread
From: Roger Pau Monné @ 2024-01-19 13:43 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: xen-devel, Oleksandr Andrushchenko, Daniel P. Smith,
	Volodymyr Babchuk

On Tue, Jan 16, 2024 at 10:01:24PM -0500, Stewart Hildebrand wrote:
> On 1/15/24 14:44, Stewart Hildebrand wrote:
> > diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> > index feccd070ddd0..8483404c5e91 100644
> > --- a/xen/drivers/vpci/header.c
> > +++ b/xen/drivers/vpci/header.c
> > @@ -41,13 +42,24 @@ static int cf_check map_range(
> >      unsigned long s, unsigned long e, void *data, unsigned long *c)
> >  {
> >      const struct map_data *map = data;
> > +    /* Start address of the BAR as seen by the guest. */
> > +    unsigned long start_gfn = PFN_DOWN(map->bar->guest_addr);
> > +    /* Physical start address of the BAR. */
> > +    unsigned long start_mfn = PFN_DOWN(map->bar->addr);
> >      int rc;
> >  
> >      for ( ; ; )
> >      {
> >          unsigned long size = e - s + 1;
> > +        /*
> > +         * Ranges to be mapped don't always start at the BAR start address, as
> > +         * there can be holes or partially consumed ranges. Account for the
> > +         * offset of the current address from the BAR start.
> > +         */
> > +        unsigned long map_mfn = start_mfn + s - start_gfn;
> > +        unsigned long m_end = map_mfn + size - 1;
> >  
> > -        if ( !iomem_access_permitted(map->d, s, e) )
> > +        if ( !iomem_access_permitted(map->d, map_mfn, m_end) )
> 
> Nit: since this check will now use map_mfn and m_end...
> 
> >          {
> >              printk(XENLOG_G_WARNING
> >                     "%pd denied access to MMIO range [%#lx, %#lx]\n",
> >                     map->d, s, e);
> 
> ... I'd like to also update the arguments passed to this print statement.

Indeed.  You will need a new version.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v12.3 09/15] vpci/header: program p2m with guest BAR view
  2024-01-09 21:51 ` [PATCH v12 09/15] vpci/header: program p2m with guest BAR view Stewart Hildebrand
  2024-01-12 15:06   ` Roger Pau Monné
  2024-01-15 19:44   ` [PATCH v12.2 " Stewart Hildebrand
@ 2024-01-19 14:28   ` Stewart Hildebrand
  2 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-19 14:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Roger Pau Monné, Daniel P. Smith,
	Volodymyr Babchuk, Stewart Hildebrand

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Take into account guest's BAR view and program its p2m accordingly:
gfn is guest's view of the BAR and mfn is the physical BAR value.
This way hardware domain sees physical BAR values and guest sees
emulated ones.

Hardware domain continues getting the BARs identity mapped, while for
domUs the BARs are mapped at the requested guest address without
modifying the BAR address in the device PCI config space.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
In v12.3:
- Update arguments passed to permission error prints in map_range()
In v12.2:
- Slightly tweak print format in modify_bars()
In v12.1:
- ASSERT(rangeset_is_empty()) in modify_bars()
- Fixup print format in modify_bars()
- Make comment single line in bar_write()
- Add Roger's R-b
In v12:
- Update guest_addr in rom_write()
- Use unsigned long for start_mfn and map_mfn to reduce mfn_x() calls
- Use existing vmsix_table_*() functions
- Change vmsix_table_base() to use .guest_addr
In v11:
- Add vmsix_guest_table_addr() and vmsix_guest_table_base() functions
  to access guest's view of the VMSIx tables.
- Use MFN (not GFN) to check access permissions
- Move page offset check to this patch
- Call rangeset_remove_range() with correct parameters
In v10:
- Moved GFN variable definition outside the loop in map_range()
- Updated printk error message in map_range()
- Now BAR address is always stored in bar->guest_addr, even for
  HW dom, this removes bunch of ugly is_hwdom() checks in modify_bars()
- vmsix_table_base() now uses .guest_addr instead of .addr
In v9:
- Extended the commit message
- Use bar->guest_addr in modify_bars
- Extended printk error message in map_range
- Moved map_data initialization so .bar can be initialized during declaration
Since v5:
- remove debug print in map_range callback
- remove "identity" from the debug print
Since v4:
- moved start_{gfn|mfn} calculation into map_range
- pass vpci_bar in the map_data instead of start_{gfn|mfn}
- s/guest_addr/guest_reg
Since v3:
- updated comment (Roger)
- removed gfn_add(map->start_gfn, rc); which is wrong
- use v->domain instead of v->vpci.pdev->domain
- removed odd e.g. in comment
- s/d%d/%pd in altered code
- use gdprintk for map/unmap logs
Since v2:
- improve readability for data.start_gfn and restructure ?: construct
Since v1:
 - s/MSI/MSI-X in comments
---
 xen/drivers/vpci/header.c | 84 ++++++++++++++++++++++++++++++---------
 xen/include/xen/vpci.h    |  3 +-
 2 files changed, 67 insertions(+), 20 deletions(-)

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index feccd070ddd0..61f0660a9b0a 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -34,6 +34,7 @@
 
 struct map_data {
     struct domain *d;
+    const struct vpci_bar *bar;
     bool map;
 };
 
@@ -41,26 +42,38 @@ static int cf_check map_range(
     unsigned long s, unsigned long e, void *data, unsigned long *c)
 {
     const struct map_data *map = data;
+    /* Start address of the BAR as seen by the guest. */
+    unsigned long start_gfn = PFN_DOWN(map->bar->guest_addr);
+    /* Physical start address of the BAR. */
+    unsigned long start_mfn = PFN_DOWN(map->bar->addr);
     int rc;
 
     for ( ; ; )
     {
         unsigned long size = e - s + 1;
+        /*
+         * Ranges to be mapped don't always start at the BAR start address, as
+         * there can be holes or partially consumed ranges. Account for the
+         * offset of the current address from the BAR start.
+         */
+        unsigned long map_mfn = start_mfn + s - start_gfn;
+        unsigned long m_end = map_mfn + size - 1;
 
-        if ( !iomem_access_permitted(map->d, s, e) )
+        if ( !iomem_access_permitted(map->d, map_mfn, m_end) )
         {
             printk(XENLOG_G_WARNING
                    "%pd denied access to MMIO range [%#lx, %#lx]\n",
-                   map->d, s, e);
+                   map->d, map_mfn, m_end);
             return -EPERM;
         }
 
-        rc = xsm_iomem_mapping(XSM_HOOK, map->d, s, e, map->map);
+        rc = xsm_iomem_mapping(XSM_HOOK, map->d, map_mfn, m_end,
+                               map->map);
         if ( rc )
         {
             printk(XENLOG_G_WARNING
                    "%pd XSM denied access to MMIO range [%#lx, %#lx]: %d\n",
-                   map->d, s, e, rc);
+                   map->d, map_mfn, m_end, rc);
             return rc;
         }
 
@@ -73,8 +86,8 @@ static int cf_check map_range(
          * - {un}map_mmio_regions doesn't support preemption.
          */
 
-        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(s))
-                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(s));
+        rc = map->map ? map_mmio_regions(map->d, _gfn(s), size, _mfn(map_mfn))
+                      : unmap_mmio_regions(map->d, _gfn(s), size, _mfn(map_mfn));
         if ( rc == 0 )
         {
             *c += size;
@@ -83,8 +96,9 @@ static int cf_check map_range(
         if ( rc < 0 )
         {
             printk(XENLOG_G_WARNING
-                   "Failed to identity %smap [%lx, %lx] for d%d: %d\n",
-                   map->map ? "" : "un", s, e, map->d->domain_id, rc);
+                   "Failed to %smap [%lx %lx] -> [%lx %lx] for %pd: %d\n",
+                   map->map ? "" : "un", s, e, map_mfn,
+                   map_mfn + size, map->d, rc);
             break;
         }
         ASSERT(rc < size);
@@ -163,10 +177,6 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
 bool vpci_process_pending(struct vcpu *v)
 {
     struct pci_dev *pdev = v->vpci.pdev;
-    struct map_data data = {
-        .d = v->domain,
-        .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
-    };
     struct vpci_header *header = NULL;
     unsigned int i;
 
@@ -186,6 +196,11 @@ bool vpci_process_pending(struct vcpu *v)
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
+        struct map_data data = {
+            .d = v->domain,
+            .map = v->vpci.cmd & PCI_COMMAND_MEMORY,
+            .bar = bar,
+        };
         int rc;
 
         if ( rangeset_is_empty(bar->mem) )
@@ -236,7 +251,6 @@ bool vpci_process_pending(struct vcpu *v)
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
                             uint16_t cmd)
 {
-    struct map_data data = { .d = d, .map = true };
     struct vpci_header *header = &pdev->vpci->header;
     int rc = 0;
     unsigned int i;
@@ -246,6 +260,7 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
+        struct map_data data = { .d = d, .map = true, .bar = bar };
 
         if ( rangeset_is_empty(bar->mem) )
             continue;
@@ -311,12 +326,16 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
      * First fill the rangesets with the BAR of this device or with the ROM
      * BAR only, depending on whether the guest is toggling the memory decode
      * bit of the command register, or the enable bit of the ROM BAR register.
+     *
+     * For non-hardware domain we use guest physical addresses.
      */
     for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
     {
         struct vpci_bar *bar = &header->bars[i];
         unsigned long start = PFN_DOWN(bar->addr);
         unsigned long end = PFN_DOWN(bar->addr + bar->size - 1);
+        unsigned long start_guest = PFN_DOWN(bar->guest_addr);
+        unsigned long end_guest = PFN_DOWN(bar->guest_addr + bar->size - 1);
 
         if ( !bar->mem )
             continue;
@@ -336,11 +355,26 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             continue;
         }
 
-        rc = rangeset_add_range(bar->mem, start, end);
+        ASSERT(rangeset_is_empty(bar->mem));
+
+        /*
+         * Make sure that the guest set address has the same page offset
+         * as the physical address on the host or otherwise things won't work as
+         * expected.
+         */
+        if ( PAGE_OFFSET(bar->guest_addr) != PAGE_OFFSET(bar->addr) )
+        {
+            gprintk(XENLOG_G_WARNING,
+                    "%pp: can't map BAR%u - offset mismatch: %#lx vs %#lx\n",
+                    &pdev->sbdf, i, bar->guest_addr, bar->addr);
+            return -EINVAL;
+        }
+
+        rc = rangeset_add_range(bar->mem, start_guest, end_guest);
         if ( rc )
         {
             printk(XENLOG_G_WARNING "Failed to add [%lx, %lx]: %d\n",
-                   start, end, rc);
+                   start_guest, end_guest, rc);
             return rc;
         }
 
@@ -352,12 +386,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             if ( rangeset_is_empty(prev_bar->mem) )
                 continue;
 
-            rc = rangeset_remove_range(prev_bar->mem, start, end);
+            rc = rangeset_remove_range(prev_bar->mem, start_guest, end_guest);
             if ( rc )
             {
                 gprintk(XENLOG_WARNING,
                        "%pp: failed to remove overlapping range [%lx, %lx]: %d\n",
-                        &pdev->sbdf, start, end, rc);
+                        &pdev->sbdf, start_guest, end_guest, rc);
                 return rc;
             }
         }
@@ -425,8 +459,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
             for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
             {
                 const struct vpci_bar *remote_bar = &tmp->vpci->header.bars[i];
-                unsigned long start = PFN_DOWN(remote_bar->addr);
-                unsigned long end = PFN_DOWN(remote_bar->addr +
+                unsigned long start = PFN_DOWN(remote_bar->guest_addr);
+                unsigned long end = PFN_DOWN(remote_bar->guest_addr +
                                              remote_bar->size - 1);
 
                 if ( !remote_bar->enabled )
@@ -513,6 +547,8 @@ static void cf_check bar_write(
     struct vpci_bar *bar = data;
     bool hi = false;
 
+    ASSERT(is_hardware_domain(pdev->domain));
+
     if ( bar->type == VPCI_BAR_MEM64_HI )
     {
         ASSERT(reg > PCI_BASE_ADDRESS_0);
@@ -543,6 +579,8 @@ static void cf_check bar_write(
      */
     bar->addr &= ~(0xffffffffULL << (hi ? 32 : 0));
     bar->addr |= (uint64_t)val << (hi ? 32 : 0);
+    /* Update guest address, so hardware domain BAR is identity mapped. */
+    bar->guest_addr = bar->addr;
 
     /* Make sure Xen writes back the same value for the BAR RO bits. */
     if ( !hi )
@@ -639,11 +677,14 @@ static void cf_check rom_write(
     }
 
     if ( !rom->enabled )
+    {
         /*
          * If the ROM BAR is not mapped update the address field so the
          * correct address is mapped into the p2m.
          */
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
+        rom->guest_addr = rom->addr;
+    }
 
     if ( !header->bars_mapped || rom->enabled == new_enabled )
     {
@@ -667,7 +708,10 @@ static void cf_check rom_write(
         return;
 
     if ( !new_enabled )
+    {
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
+        rom->guest_addr = rom->addr;
+    }
 }
 
 static int bar_add_rangeset(const struct pci_dev *pdev, struct vpci_bar *bar,
@@ -862,6 +906,7 @@ static int cf_check init_header(struct pci_dev *pdev)
         }
 
         bars[i].addr = addr;
+        bars[i].guest_addr = addr;
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
@@ -884,6 +929,7 @@ static int cf_check init_header(struct pci_dev *pdev)
         rom->type = VPCI_BAR_ROM;
         rom->size = size;
         rom->addr = addr;
+        rom->guest_addr = addr;
         header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
                               PCI_ROM_ADDRESS_ENABLE;
 
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 817ee9ee7300..e89c571890b2 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -216,7 +216,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix);
  */
 static inline paddr_t vmsix_table_base(const struct vpci *vpci, unsigned int nr)
 {
-    return vpci->header.bars[vpci->msix->tables[nr] & PCI_MSIX_BIRMASK].addr;
+    return vpci->header.bars[vpci->msix->tables[nr] &
+                             PCI_MSIX_BIRMASK].guest_addr;
 }
 
 static inline paddr_t vmsix_table_addr(const struct vpci *vpci, unsigned int nr)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-15 19:43   ` [PATCH v12.2 " Stewart Hildebrand
  2024-01-19 13:42     ` Roger Pau Monné
@ 2024-01-23 14:26     ` Jan Beulich
  2024-01-23 15:23       ` Roger Pau Monné
  2024-01-23 14:29     ` Jan Beulich
  2024-01-23 14:32     ` Jan Beulich
  3 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2024-01-23 14:26 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: Oleksandr Andrushchenko, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On 15.01.2024 20:43, Stewart Hildebrand wrote:
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
>      struct msixtbl_entry *entry, *new_entry;
>      int r = -EINVAL;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>      ASSERT(rw_is_write_locked(&d->event_lock));
>  
>      if ( !msixtbl_initialised(d) )
> @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
>      struct pci_dev *pdev;
>      struct msixtbl_entry *entry;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>      ASSERT(rw_is_write_locked(&d->event_lock));

I was hoping to just ack this patch, but the two changes above look
questionable to me: How can it be that holding _either_ lock is okay?
It's not obvious in this context that consumers have to hold both
locks now. In fact consumers looks to be the callers of
msixtbl_find_entry(), yet the list is RCU-protected. Whereas races
against themselves or against one another are avoided by holding
d->event_lock.

My only guess then for the original need of holding pcidevs_lock is
the use of msi_desc->dev, with the desire for the device to not go
away. Yet the description doesn't talk about interactions of the per-
domain PCI lock with that one at all; it all circles around the
domain'd vPCI lock.

Feels like I'm missing something that's obvious to everyone else.
Or maybe this part of the patch is actually unrelated, and should be
split off (with its own [proper] justification)? Or wouldn't it then
be better to also change the other paths leading here to acquire the
per-domain PCI lock?

> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -413,7 +413,7 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
>  
>      spin_unlock_irq(&desc->lock);
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&msi_desc->dev->domain->pci_lock));
>  
>      return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);

This then falls in the same category. And apparently there are more.

Jan

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-15 19:43   ` [PATCH v12.2 " Stewart Hildebrand
  2024-01-19 13:42     ` Roger Pau Monné
  2024-01-23 14:26     ` Jan Beulich
@ 2024-01-23 14:29     ` Jan Beulich
  2024-01-24  5:07       ` Stewart Hildebrand
  2024-01-23 14:32     ` Jan Beulich
  3 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2024-01-23 14:29 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: Oleksandr Andrushchenko, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On 15.01.2024 20:43, Stewart Hildebrand wrote:
> @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
>  {
>      struct msi_desc *old_desc;
>  
> -    ASSERT(pcidevs_locked());
> -
>      if ( !pdev || !pdev->msix )
>          return -ENODEV;
>  
> +    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
> +
>      if ( msi->entry_nr >= pdev->msix->nr_entries )
>          return -EINVAL;

Further looking at this - is dereferencing pdev actually safe without holding
the global lock?

Jan


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-15 19:43   ` [PATCH v12.2 " Stewart Hildebrand
                       ` (2 preceding siblings ...)
  2024-01-23 14:29     ` Jan Beulich
@ 2024-01-23 14:32     ` Jan Beulich
  2024-01-23 15:07       ` Roger Pau Monné
  3 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2024-01-23 14:32 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: Oleksandr Andrushchenko, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On 15.01.2024 20:43, Stewart Hildebrand wrote:
> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>  {
>      int irq, pirq, ret;
>  
> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));

If either lock is sufficient to hold here, ...

> --- a/xen/arch/x86/physdev.c
> +++ b/xen/arch/x86/physdev.c
> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>  
>      case MAP_PIRQ_TYPE_MSI:
>      case MAP_PIRQ_TYPE_MULTI_MSI:
> +        pcidevs_lock();
>          ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> +        pcidevs_unlock();
>          break;

... why is it the global lock that's being acquired here?

Jan


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 03/15] vpci: add hooks for PCI device assign/de-assign
  2024-01-09 21:51 ` [PATCH v12 03/15] vpci: add hooks for PCI device assign/de-assign Stewart Hildebrand
@ 2024-01-23 14:36   ` Jan Beulich
  2024-01-30 19:22   ` Stewart Hildebrand
  1 sibling, 0 replies; 68+ messages in thread
From: Jan Beulich @ 2024-01-23 14:36 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: Oleksandr Andrushchenko, Paul Durrant, Roger Pau Monné,
	Volodymyr Babchuk, xen-devel

On 09.01.2024 22:51, Stewart Hildebrand wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> When a PCI device gets assigned/de-assigned we need to
> initialize/de-initialize vPCI state for the device.
> 
> Also, rename vpci_add_handlers() to vpci_assign_device() and
> vpci_remove_device() to vpci_deassign_device() to better reflect role
> of the functions.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>

Acked-by: Jan Beulich <jbeulich@suse.com>




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-23 14:32     ` Jan Beulich
@ 2024-01-23 15:07       ` Roger Pau Monné
  2024-01-24  5:00         ` Stewart Hildebrand
  2024-01-24  8:48         ` Jan Beulich
  0 siblings, 2 replies; 68+ messages in thread
From: Roger Pau Monné @ 2024-01-23 15:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
> On 15.01.2024 20:43, Stewart Hildebrand wrote:
> > @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
> >  {
> >      int irq, pirq, ret;
> >  
> > +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> 
> If either lock is sufficient to hold here, ...
> 
> > --- a/xen/arch/x86/physdev.c
> > +++ b/xen/arch/x86/physdev.c
> > @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
> >  
> >      case MAP_PIRQ_TYPE_MSI:
> >      case MAP_PIRQ_TYPE_MULTI_MSI:
> > +        pcidevs_lock();
> >          ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> > +        pcidevs_unlock();
> >          break;
> 

IIRC (Stewart can further comment) this is done holding the pcidevs
lock to keep the path unmodified, as there's no need to hold the
per-domain rwlock.

Roger.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-23 14:26     ` Jan Beulich
@ 2024-01-23 15:23       ` Roger Pau Monné
  2024-01-24  8:56         ` Jan Beulich
  0 siblings, 1 reply; 68+ messages in thread
From: Roger Pau Monné @ 2024-01-23 15:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On Tue, Jan 23, 2024 at 03:26:26PM +0100, Jan Beulich wrote:
> On 15.01.2024 20:43, Stewart Hildebrand wrote:
> > --- a/xen/arch/x86/hvm/vmsi.c
> > +++ b/xen/arch/x86/hvm/vmsi.c
> > @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
> >      struct msixtbl_entry *entry, *new_entry;
> >      int r = -EINVAL;
> >  
> > -    ASSERT(pcidevs_locked());
> > +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >      ASSERT(rw_is_write_locked(&d->event_lock));
> >  
> >      if ( !msixtbl_initialised(d) )
> > @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
> >      struct pci_dev *pdev;
> >      struct msixtbl_entry *entry;
> >  
> > -    ASSERT(pcidevs_locked());
> > +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >      ASSERT(rw_is_write_locked(&d->event_lock));
> 
> I was hoping to just ack this patch, but the two changes above look
> questionable to me: How can it be that holding _either_ lock is okay?
> It's not obvious in this context that consumers have to hold both
> locks now. In fact consumers looks to be the callers of
> msixtbl_find_entry(), yet the list is RCU-protected. Whereas races
> against themselves or against one another are avoided by holding
> d->event_lock.

The reason for the change here is that msixtbl_pt_{un,}register() gets
called by pt_irq_{create,destroy}_bind(), which is in turn called by
vPCI code (pcidevs_locked()) that has been switched to not take the
pcidevs lock anymore, and hence the ASSERT would trigger.

> My only guess then for the original need of holding pcidevs_lock is
> the use of msi_desc->dev, with the desire for the device to not go
> away. Yet the description doesn't talk about interactions of the per-
> domain PCI lock with that one at all; it all circles around the
> domain'd vPCI lock.

I do agree that it looks like the original intention of holding
pcidevs_lock is to prevent msi_desc->dev from being removed - yet I'm
not sure it's possible for the device to go away while the domain
event_lock is hold, as device removal would need to take that same
lock in order to destroy the irq_desc.

> Feels like I'm missing something that's obvious to everyone else.
> Or maybe this part of the patch is actually unrelated, and should be
> split off (with its own [proper] justification)? Or wouldn't it then
> be better to also change the other paths leading here to acquire the
> per-domain PCI lock?

Other paths in vPCI vpci_msi_update(), vpci_msi_arch_update(),
vpci_msi_arch_enable()... are switched in this patch to use the
per-domain pci_lock instead of pcidevs lock.

> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -413,7 +413,7 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
> >  
> >      spin_unlock_irq(&desc->lock);
> >  
> > -    ASSERT(pcidevs_locked());
> > +    ASSERT(pcidevs_locked() || rw_is_locked(&msi_desc->dev->domain->pci_lock));
> >  
> >      return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
> 
> This then falls in the same category. And apparently there are more.

This one is again a result of such function being called from
pt_irq_create_bind() from vPCI code that has been switched to use the
per-domain pci_lock.

IOMMU state is already protected by it's own internal locks, and
doesn't rely on pcidevs lock.  Hence I can also only guess that the
usage here is to prevent the device from being removed.

Regards, Roger.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-23 15:07       ` Roger Pau Monné
@ 2024-01-24  5:00         ` Stewart Hildebrand
  2024-01-30 14:59           ` Stewart Hildebrand
  2024-01-24  8:48         ` Jan Beulich
  1 sibling, 1 reply; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-24  5:00 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: Oleksandr Andrushchenko, Andrew Cooper, Wei Liu, George Dunlap,
	Julien Grall, Stefano Stabellini, Jun Nakajima, Kevin Tian,
	Paul Durrant, Volodymyr Babchuk, xen-devel

On 1/23/24 10:07, Roger Pau Monné wrote:
> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>>  {
>>>      int irq, pirq, ret;
>>>  
>>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>
>> If either lock is sufficient to hold here, ...
>>
>>> --- a/xen/arch/x86/physdev.c
>>> +++ b/xen/arch/x86/physdev.c
>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>>>  
>>>      case MAP_PIRQ_TYPE_MSI:
>>>      case MAP_PIRQ_TYPE_MULTI_MSI:
>>> +        pcidevs_lock();
>>>          ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>>> +        pcidevs_unlock();
>>>          break;
>>
>> ... why is it the global lock that's being acquired here?
>>
> 
> IIRC (Stewart can further comment) this is done holding the pcidevs
> lock to keep the path unmodified, as there's no need to hold the
> per-domain rwlock.
> 

Although allocate_and_map_msi_pirq() was itself acquiring the global pcidevs_lock() before this patch, we could just as well use read_lock(&d->pci_lock) here instead now. It seems like a good optimization to make, so if there aren't any objections I'll change it to read_lock(&d->pci_lock).


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-23 14:29     ` Jan Beulich
@ 2024-01-24  5:07       ` Stewart Hildebrand
  2024-01-24  8:21         ` Roger Pau Monné
  2024-01-24  8:50         ` Jan Beulich
  0 siblings, 2 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-24  5:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On 1/23/24 09:29, Jan Beulich wrote:
> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>> @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
>>  {
>>      struct msi_desc *old_desc;
>>  
>> -    ASSERT(pcidevs_locked());
>> -
>>      if ( !pdev || !pdev->msix )
>>          return -ENODEV;
>>  
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>> +
>>      if ( msi->entry_nr >= pdev->msix->nr_entries )
>>          return -EINVAL;
> 
> Further looking at this - is dereferencing pdev actually safe without holding
> the global lock?

Are you referring to the new placement of the ASSERT, which opens up the possibility that pdev could be dereferenced and the function return before the ASSERT? If that is what you mean, I see your point. The ASSERT was placed there simply because we wanted to check that pdev != NULL first. See prior discussion at [1]. Hmm.. How about splitting the pdev-checking condition? E.g.:

    if ( !pdev )
        return -ENODEV;

    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));

    if ( !pdev->msix )
        return -ENODEV;


[1] https://lore.kernel.org/xen-devel/85a52f8d-d6db-4478-92b1-2b6305769c96@amd.com/


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-24  5:07       ` Stewart Hildebrand
@ 2024-01-24  8:21         ` Roger Pau Monné
  2024-01-24 20:21           ` Stewart Hildebrand
  2024-01-24  8:50         ` Jan Beulich
  1 sibling, 1 reply; 68+ messages in thread
From: Roger Pau Monné @ 2024-01-24  8:21 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: Jan Beulich, Oleksandr Andrushchenko, Andrew Cooper, Wei Liu,
	George Dunlap, Julien Grall, Stefano Stabellini, Jun Nakajima,
	Kevin Tian, Paul Durrant, Volodymyr Babchuk, xen-devel

On Wed, Jan 24, 2024 at 12:07:28AM -0500, Stewart Hildebrand wrote:
> On 1/23/24 09:29, Jan Beulich wrote:
> > On 15.01.2024 20:43, Stewart Hildebrand wrote:
> >> @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
> >>  {
> >>      struct msi_desc *old_desc;
> >>  
> >> -    ASSERT(pcidevs_locked());
> >> -
> >>      if ( !pdev || !pdev->msix )
> >>          return -ENODEV;
> >>  
> >> +    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
> >> +
> >>      if ( msi->entry_nr >= pdev->msix->nr_entries )
> >>          return -EINVAL;
> > 
> > Further looking at this - is dereferencing pdev actually safe without holding
> > the global lock?

It is safe because either the global pcidevs lock or the per-domain
pci_lock will be held, which should prevent the device from being
removed.

> Are you referring to the new placement of the ASSERT, which opens up the possibility that pdev could be dereferenced and the function return before the ASSERT? If that is what you mean, I see your point. The ASSERT was placed there simply because we wanted to check that pdev != NULL first. See prior discussion at [1]. Hmm.. How about splitting the pdev-checking condition? E.g.:
> 
>     if ( !pdev )
>         return -ENODEV;
> 
>     ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
> 
>     if ( !pdev->msix )
>         return -ENODEV;

I'm not specially worried about the position of the assert, those are
just debug messages at the end.

One worry I have after further looking at the code, when called from
ns16550_init_postirq(), does the device have pdev->domain set?

That case would satisfy the first condition of the assert, so won't
attempt to dereference pdev->domain, but still would be good to ensure
consistency here wrt the state of pdev->domain.

Regards, Roger.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-23 15:07       ` Roger Pau Monné
  2024-01-24  5:00         ` Stewart Hildebrand
@ 2024-01-24  8:48         ` Jan Beulich
  2024-01-24  9:24           ` Roger Pau Monné
  1 sibling, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2024-01-24  8:48 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On 23.01.2024 16:07, Roger Pau Monné wrote:
> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>>  {
>>>      int irq, pirq, ret;
>>>  
>>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>
>> If either lock is sufficient to hold here, ...
>>
>>> --- a/xen/arch/x86/physdev.c
>>> +++ b/xen/arch/x86/physdev.c
>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>>>  
>>>      case MAP_PIRQ_TYPE_MSI:
>>>      case MAP_PIRQ_TYPE_MULTI_MSI:
>>> +        pcidevs_lock();
>>>          ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>>> +        pcidevs_unlock();
>>>          break;
>>
> 
> IIRC (Stewart can further comment) this is done holding the pcidevs
> lock to keep the path unmodified, as there's no need to hold the
> per-domain rwlock.

Yet why would we prefer to acquire a global lock when a per-domain one
suffices?

Jan


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-24  5:07       ` Stewart Hildebrand
  2024-01-24  8:21         ` Roger Pau Monné
@ 2024-01-24  8:50         ` Jan Beulich
  1 sibling, 0 replies; 68+ messages in thread
From: Jan Beulich @ 2024-01-24  8:50 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: Oleksandr Andrushchenko, Andrew Cooper, Roger Pau Monné,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On 24.01.2024 06:07, Stewart Hildebrand wrote:
> On 1/23/24 09:29, Jan Beulich wrote:
>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>> @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
>>>  {
>>>      struct msi_desc *old_desc;
>>>  
>>> -    ASSERT(pcidevs_locked());
>>> -
>>>      if ( !pdev || !pdev->msix )
>>>          return -ENODEV;
>>>  
>>> +    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>>> +
>>>      if ( msi->entry_nr >= pdev->msix->nr_entries )
>>>          return -EINVAL;
>>
>> Further looking at this - is dereferencing pdev actually safe without holding
>> the global lock?
> 
> Are you referring to the new placement of the ASSERT, which opens up the possibility that pdev could be dereferenced and the function return before the ASSERT? If that is what you mean, I see your point. The ASSERT was placed there simply because we wanted to check that pdev != NULL first. See prior discussion at [1]. Hmm.. How about splitting the pdev-checking condition? E.g.:
> 
>     if ( !pdev )
>         return -ENODEV;
> 
>     ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
> 
>     if ( !pdev->msix )
>         return -ENODEV;

Yes, this is the particular arrangement I would have expected (at least
partly based on the guessing of the purpose of holding those locks).

Jan

> [1] https://lore.kernel.org/xen-devel/85a52f8d-d6db-4478-92b1-2b6305769c96@amd.com/



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-23 15:23       ` Roger Pau Monné
@ 2024-01-24  8:56         ` Jan Beulich
  2024-01-24  9:39           ` Roger Pau Monné
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2024-01-24  8:56 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On 23.01.2024 16:23, Roger Pau Monné wrote:
> On Tue, Jan 23, 2024 at 03:26:26PM +0100, Jan Beulich wrote:
>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>> --- a/xen/arch/x86/hvm/vmsi.c
>>> +++ b/xen/arch/x86/hvm/vmsi.c
>>> @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
>>>      struct msixtbl_entry *entry, *new_entry;
>>>      int r = -EINVAL;
>>>  
>>> -    ASSERT(pcidevs_locked());
>>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>>      ASSERT(rw_is_write_locked(&d->event_lock));
>>>  
>>>      if ( !msixtbl_initialised(d) )
>>> @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
>>>      struct pci_dev *pdev;
>>>      struct msixtbl_entry *entry;
>>>  
>>> -    ASSERT(pcidevs_locked());
>>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>>      ASSERT(rw_is_write_locked(&d->event_lock));
>>
>> I was hoping to just ack this patch, but the two changes above look
>> questionable to me: How can it be that holding _either_ lock is okay?
>> It's not obvious in this context that consumers have to hold both
>> locks now. In fact consumers looks to be the callers of
>> msixtbl_find_entry(), yet the list is RCU-protected. Whereas races
>> against themselves or against one another are avoided by holding
>> d->event_lock.
> 
> The reason for the change here is that msixtbl_pt_{un,}register() gets
> called by pt_irq_{create,destroy}_bind(), which is in turn called by
> vPCI code (pcidevs_locked()) that has been switched to not take the
> pcidevs lock anymore, and hence the ASSERT would trigger.

I understand this is the motivation for the change, but that doesn't
(alone) render the construct above sensible / correct.

>> My only guess then for the original need of holding pcidevs_lock is
>> the use of msi_desc->dev, with the desire for the device to not go
>> away. Yet the description doesn't talk about interactions of the per-
>> domain PCI lock with that one at all; it all circles around the
>> domain'd vPCI lock.
> 
> I do agree that it looks like the original intention of holding
> pcidevs_lock is to prevent msi_desc->dev from being removed - yet I'm
> not sure it's possible for the device to go away while the domain
> event_lock is hold, as device removal would need to take that same
> lock in order to destroy the irq_desc.

Yes, that matches an observation of mine as well. If we can simplify
(rather then complicate) locking, I'd prefer if we did. May need to
be a separate (prereq) patch, though.

>> Feels like I'm missing something that's obvious to everyone else.
>> Or maybe this part of the patch is actually unrelated, and should be
>> split off (with its own [proper] justification)? Or wouldn't it then
>> be better to also change the other paths leading here to acquire the
>> per-domain PCI lock?
> 
> Other paths in vPCI vpci_msi_update(), vpci_msi_arch_update(),
> vpci_msi_arch_enable()... are switched in this patch to use the
> per-domain pci_lock instead of pcidevs lock.

Hence my question: Can't we consolidate to all paths only using the
per-domain lock? That would make these odd-looking assertions become
normal-looking again.

Jan


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-24  8:48         ` Jan Beulich
@ 2024-01-24  9:24           ` Roger Pau Monné
  2024-01-24 11:34             ` Jan Beulich
  0 siblings, 1 reply; 68+ messages in thread
From: Roger Pau Monné @ 2024-01-24  9:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
> On 23.01.2024 16:07, Roger Pau Monné wrote:
> > On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
> >> On 15.01.2024 20:43, Stewart Hildebrand wrote:
> >>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
> >>>  {
> >>>      int irq, pirq, ret;
> >>>  
> >>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >>
> >> If either lock is sufficient to hold here, ...
> >>
> >>> --- a/xen/arch/x86/physdev.c
> >>> +++ b/xen/arch/x86/physdev.c
> >>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
> >>>  
> >>>      case MAP_PIRQ_TYPE_MSI:
> >>>      case MAP_PIRQ_TYPE_MULTI_MSI:
> >>> +        pcidevs_lock();
> >>>          ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> >>> +        pcidevs_unlock();
> >>>          break;
> >>
> > 
> > IIRC (Stewart can further comment) this is done holding the pcidevs
> > lock to keep the path unmodified, as there's no need to hold the
> > per-domain rwlock.
> 
> Yet why would we prefer to acquire a global lock when a per-domain one
> suffices?

I was hoping to introduce less changes, specially if they are not
strictly required, as it's less risk.  I'm always quite worry of
locking changes.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-24  8:56         ` Jan Beulich
@ 2024-01-24  9:39           ` Roger Pau Monné
  0 siblings, 0 replies; 68+ messages in thread
From: Roger Pau Monné @ 2024-01-24  9:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On Wed, Jan 24, 2024 at 09:56:42AM +0100, Jan Beulich wrote:
> On 23.01.2024 16:23, Roger Pau Monné wrote:
> > On Tue, Jan 23, 2024 at 03:26:26PM +0100, Jan Beulich wrote:
> >> On 15.01.2024 20:43, Stewart Hildebrand wrote:
> >>> --- a/xen/arch/x86/hvm/vmsi.c
> >>> +++ b/xen/arch/x86/hvm/vmsi.c
> >>> @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
> >>>      struct msixtbl_entry *entry, *new_entry;
> >>>      int r = -EINVAL;
> >>>  
> >>> -    ASSERT(pcidevs_locked());
> >>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >>>      ASSERT(rw_is_write_locked(&d->event_lock));
> >>>  
> >>>      if ( !msixtbl_initialised(d) )
> >>> @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
> >>>      struct pci_dev *pdev;
> >>>      struct msixtbl_entry *entry;
> >>>  
> >>> -    ASSERT(pcidevs_locked());
> >>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >>>      ASSERT(rw_is_write_locked(&d->event_lock));
> >>
> >> I was hoping to just ack this patch, but the two changes above look
> >> questionable to me: How can it be that holding _either_ lock is okay?
> >> It's not obvious in this context that consumers have to hold both
> >> locks now. In fact consumers looks to be the callers of
> >> msixtbl_find_entry(), yet the list is RCU-protected. Whereas races
> >> against themselves or against one another are avoided by holding
> >> d->event_lock.
> > 
> > The reason for the change here is that msixtbl_pt_{un,}register() gets
> > called by pt_irq_{create,destroy}_bind(), which is in turn called by
> > vPCI code (pcidevs_locked()) that has been switched to not take the
> > pcidevs lock anymore, and hence the ASSERT would trigger.
> 
> I understand this is the motivation for the change, but that doesn't
> (alone) render the construct above sensible / correct.

But we agreed that for the purpose of the device not going anyway,
either the pcidevs or the per-domain pci_lock should be held, both are
valid for the purpose, and hence functions have adjusted to take that
into account.

So your concern is about the pcidevs lock being used here not just for
preventing the device from being removed, but also for protecting MSI
related state?

> >> My only guess then for the original need of holding pcidevs_lock is
> >> the use of msi_desc->dev, with the desire for the device to not go
> >> away. Yet the description doesn't talk about interactions of the per-
> >> domain PCI lock with that one at all; it all circles around the
> >> domain'd vPCI lock.
> > 
> > I do agree that it looks like the original intention of holding
> > pcidevs_lock is to prevent msi_desc->dev from being removed - yet I'm
> > not sure it's possible for the device to go away while the domain
> > event_lock is hold, as device removal would need to take that same
> > lock in order to destroy the irq_desc.
> 
> Yes, that matches an observation of mine as well. If we can simplify
> (rather then complicate) locking, I'd prefer if we did. May need to
> be a separate (prereq) patch, though.

Hm, yes, that might be an option, and doing a pre-patch that removes
the need to have pcidevs locked here would avoid the what seem
controversial changes.

> >> Feels like I'm missing something that's obvious to everyone else.
> >> Or maybe this part of the patch is actually unrelated, and should be
> >> split off (with its own [proper] justification)? Or wouldn't it then
> >> be better to also change the other paths leading here to acquire the
> >> per-domain PCI lock?
> > 
> > Other paths in vPCI vpci_msi_update(), vpci_msi_arch_update(),
> > vpci_msi_arch_enable()... are switched in this patch to use the
> > per-domain pci_lock instead of pcidevs lock.
> 
> Hence my question: Can't we consolidate to all paths only using the
> per-domain lock? That would make these odd-looking assertions become
> normal-looking again.

Hm, I think that's more work than originally planned, as the initial
plan was to use both locks during an interim period in order to avoid
doing a full swept change to switch to the per-domain one.

Regards, Roger.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-24  9:24           ` Roger Pau Monné
@ 2024-01-24 11:34             ` Jan Beulich
  2024-01-24 17:51               ` Roger Pau Monné
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2024-01-24 11:34 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On 24.01.2024 10:24, Roger Pau Monné wrote:
> On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
>> On 23.01.2024 16:07, Roger Pau Monné wrote:
>>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>>>>  {
>>>>>      int irq, pirq, ret;
>>>>>  
>>>>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>>>
>>>> If either lock is sufficient to hold here, ...
>>>>
>>>>> --- a/xen/arch/x86/physdev.c
>>>>> +++ b/xen/arch/x86/physdev.c
>>>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>>>>>  
>>>>>      case MAP_PIRQ_TYPE_MSI:
>>>>>      case MAP_PIRQ_TYPE_MULTI_MSI:
>>>>> +        pcidevs_lock();
>>>>>          ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>>>>> +        pcidevs_unlock();
>>>>>          break;
>>>>
>>>
>>> IIRC (Stewart can further comment) this is done holding the pcidevs
>>> lock to keep the path unmodified, as there's no need to hold the
>>> per-domain rwlock.
>>
>> Yet why would we prefer to acquire a global lock when a per-domain one
>> suffices?
> 
> I was hoping to introduce less changes, specially if they are not
> strictly required, as it's less risk.  I'm always quite worry of
> locking changes.

In which case more description / code commenting is needed. The pattern
of the assertions looks dangerous. Even if (as you say in a later reply)
this is only temporary, we all know how long "temporary" can be. It
might even be advisable to introduce a helper construct. That would then
be where the respective code comment goes, clarifying that the _sole_
purpose is to guarantee a pdev to not go away; no further protection of
any state, no other critical region aspects (assuming of course all of
this is actually true for all of the affected use sites).

Jan


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-24 11:34             ` Jan Beulich
@ 2024-01-24 17:51               ` Roger Pau Monné
  2024-01-25  7:43                 ` Jan Beulich
  0 siblings, 1 reply; 68+ messages in thread
From: Roger Pau Monné @ 2024-01-24 17:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On Wed, Jan 24, 2024 at 12:34:10PM +0100, Jan Beulich wrote:
> On 24.01.2024 10:24, Roger Pau Monné wrote:
> > On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
> >> On 23.01.2024 16:07, Roger Pau Monné wrote:
> >>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
> >>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
> >>>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
> >>>>>  {
> >>>>>      int irq, pirq, ret;
> >>>>>  
> >>>>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >>>>
> >>>> If either lock is sufficient to hold here, ...
> >>>>
> >>>>> --- a/xen/arch/x86/physdev.c
> >>>>> +++ b/xen/arch/x86/physdev.c
> >>>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
> >>>>>  
> >>>>>      case MAP_PIRQ_TYPE_MSI:
> >>>>>      case MAP_PIRQ_TYPE_MULTI_MSI:
> >>>>> +        pcidevs_lock();
> >>>>>          ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> >>>>> +        pcidevs_unlock();
> >>>>>          break;
> >>>>
> >>>
> >>> IIRC (Stewart can further comment) this is done holding the pcidevs
> >>> lock to keep the path unmodified, as there's no need to hold the
> >>> per-domain rwlock.
> >>
> >> Yet why would we prefer to acquire a global lock when a per-domain one
> >> suffices?
> > 
> > I was hoping to introduce less changes, specially if they are not
> > strictly required, as it's less risk.  I'm always quite worry of
> > locking changes.
> 
> In which case more description / code commenting is needed. The pattern
> of the assertions looks dangerous.

Is such dangerousness perception because you fear some of the pcidevs
lock usage might be there not just for preventing the pdev from going
away, but also to guarantee exclusive access to certain state?

> Even if (as you say in a later reply)
> this is only temporary, we all know how long "temporary" can be. It
> might even be advisable to introduce a helper construct.

The aim here was to modify as little as possible, in order to avoid
having to analyze all possible users of pcidevs lock, and thus not
block the vPCI work on the probably lengthy and difficult analysis.

Not sure adding a construct makes is much better, as I didn't want to
give the impression all checks for the pcidevs lock can merely be
replaced by the new construct.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-24  8:21         ` Roger Pau Monné
@ 2024-01-24 20:21           ` Stewart Hildebrand
  0 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-24 20:21 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Jan Beulich, Oleksandr Andrushchenko, Andrew Cooper, Wei Liu,
	George Dunlap, Julien Grall, Stefano Stabellini, Jun Nakajima,
	Kevin Tian, Paul Durrant, Volodymyr Babchuk, xen-devel

On 1/24/24 03:21, Roger Pau Monné wrote:
> On Wed, Jan 24, 2024 at 12:07:28AM -0500, Stewart Hildebrand wrote:
>> On 1/23/24 09:29, Jan Beulich wrote:
>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>>> @@ -1043,11 +1043,11 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi,
>>>>  {
>>>>      struct msi_desc *old_desc;
>>>>  
>>>> -    ASSERT(pcidevs_locked());
>>>> -
>>>>      if ( !pdev || !pdev->msix )
>>>>          return -ENODEV;
>>>>  
>>>> +    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>>>> +
>>>>      if ( msi->entry_nr >= pdev->msix->nr_entries )
>>>>          return -EINVAL;
>>>
>>> Further looking at this - is dereferencing pdev actually safe without holding
>>> the global lock?
> 
> It is safe because either the global pcidevs lock or the per-domain
> pci_lock will be held, which should prevent the device from being
> removed.
> 
>> Are you referring to the new placement of the ASSERT, which opens up the possibility that pdev could be dereferenced and the function return before the ASSERT? If that is what you mean, I see your point. The ASSERT was placed there simply because we wanted to check that pdev != NULL first. See prior discussion at [1]. Hmm.. How about splitting the pdev-checking condition? E.g.:
>>
>>     if ( !pdev )
>>         return -ENODEV;
>>
>>     ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
>>
>>     if ( !pdev->msix )
>>         return -ENODEV;
> 
> I'm not specially worried about the position of the assert, those are
> just debug messages at the end.
> 
> One worry I have after further looking at the code, when called from
> ns16550_init_postirq(), does the device have pdev->domain set?
> 
> That case would satisfy the first condition of the assert, so won't
> attempt to dereference pdev->domain, but still would be good to ensure
> consistency here wrt the state of pdev->domain.

Indeed. How about this?

    if ( !pdev )
        return -ENODEV;

    ASSERT(pcidevs_locked() ||
           (pdev->domain && rw_is_locked(&pdev->domain->pci_lock)));

    if ( !pdev->msix )
        return -ENODEV;

And similarly in __pci_enable_msi(), without the !pdev->msix check. And similarly in pci_enable_msi(), which then should also gain its own if ( !pdev ) return -ENODEV; check.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-24 17:51               ` Roger Pau Monné
@ 2024-01-25  7:43                 ` Jan Beulich
  2024-01-25  9:05                   ` Roger Pau Monné
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2024-01-25  7:43 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On 24.01.2024 18:51, Roger Pau Monné wrote:
> On Wed, Jan 24, 2024 at 12:34:10PM +0100, Jan Beulich wrote:
>> On 24.01.2024 10:24, Roger Pau Monné wrote:
>>> On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
>>>> On 23.01.2024 16:07, Roger Pau Monné wrote:
>>>>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>>>>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>>>>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>>>>>>  {
>>>>>>>      int irq, pirq, ret;
>>>>>>>  
>>>>>>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>>>>>
>>>>>> If either lock is sufficient to hold here, ...
>>>>>>
>>>>>>> --- a/xen/arch/x86/physdev.c
>>>>>>> +++ b/xen/arch/x86/physdev.c
>>>>>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>>>>>>>  
>>>>>>>      case MAP_PIRQ_TYPE_MSI:
>>>>>>>      case MAP_PIRQ_TYPE_MULTI_MSI:
>>>>>>> +        pcidevs_lock();
>>>>>>>          ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>>>>>>> +        pcidevs_unlock();
>>>>>>>          break;
>>>>>>
>>>>>
>>>>> IIRC (Stewart can further comment) this is done holding the pcidevs
>>>>> lock to keep the path unmodified, as there's no need to hold the
>>>>> per-domain rwlock.
>>>>
>>>> Yet why would we prefer to acquire a global lock when a per-domain one
>>>> suffices?
>>>
>>> I was hoping to introduce less changes, specially if they are not
>>> strictly required, as it's less risk.  I'm always quite worry of
>>> locking changes.
>>
>> In which case more description / code commenting is needed. The pattern
>> of the assertions looks dangerous.
> 
> Is such dangerousness perception because you fear some of the pcidevs
> lock usage might be there not just for preventing the pdev from going
> away, but also to guarantee exclusive access to certain state?

Indeed. In my view the main purpose of locks is to guard state. Their
use here to guard against devices here is imo rather an abuse; as
mentioned before this should instead be achieved e.g via refcounting.
And it's bad enough already that pcidevs_lock() alone has been abused
this way, without proper marking (leaving us to guess in many places).
It gets worse when a second lock can now also serve this same purpose.

>> Even if (as you say in a later reply)
>> this is only temporary, we all know how long "temporary" can be. It
>> might even be advisable to introduce a helper construct.
> 
> The aim here was to modify as little as possible, in order to avoid
> having to analyze all possible users of pcidevs lock, and thus not
> block the vPCI work on the probably lengthy and difficult analysis.
> 
> Not sure adding a construct makes is much better, as I didn't want to
> give the impression all checks for the pcidevs lock can merely be
> replaced by the new construct.

Of course such a construct could only be used in places where it can
be shown to be appropriate.

Jan


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-25  7:43                 ` Jan Beulich
@ 2024-01-25  9:05                   ` Roger Pau Monné
  2024-01-25 11:23                     ` Jan Beulich
  0 siblings, 1 reply; 68+ messages in thread
From: Roger Pau Monné @ 2024-01-25  9:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On Thu, Jan 25, 2024 at 08:43:05AM +0100, Jan Beulich wrote:
> On 24.01.2024 18:51, Roger Pau Monné wrote:
> > On Wed, Jan 24, 2024 at 12:34:10PM +0100, Jan Beulich wrote:
> >> On 24.01.2024 10:24, Roger Pau Monné wrote:
> >>> On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
> >>>> On 23.01.2024 16:07, Roger Pau Monné wrote:
> >>>>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
> >>>>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
> >>>>>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
> >>>>>>>  {
> >>>>>>>      int irq, pirq, ret;
> >>>>>>>  
> >>>>>>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >>>>>>
> >>>>>> If either lock is sufficient to hold here, ...
> >>>>>>
> >>>>>>> --- a/xen/arch/x86/physdev.c
> >>>>>>> +++ b/xen/arch/x86/physdev.c
> >>>>>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
> >>>>>>>  
> >>>>>>>      case MAP_PIRQ_TYPE_MSI:
> >>>>>>>      case MAP_PIRQ_TYPE_MULTI_MSI:
> >>>>>>> +        pcidevs_lock();
> >>>>>>>          ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> >>>>>>> +        pcidevs_unlock();
> >>>>>>>          break;
> >>>>>>
> >>>>>
> >>>>> IIRC (Stewart can further comment) this is done holding the pcidevs
> >>>>> lock to keep the path unmodified, as there's no need to hold the
> >>>>> per-domain rwlock.
> >>>>
> >>>> Yet why would we prefer to acquire a global lock when a per-domain one
> >>>> suffices?
> >>>
> >>> I was hoping to introduce less changes, specially if they are not
> >>> strictly required, as it's less risk.  I'm always quite worry of
> >>> locking changes.
> >>
> >> In which case more description / code commenting is needed. The pattern
> >> of the assertions looks dangerous.
> > 
> > Is such dangerousness perception because you fear some of the pcidevs
> > lock usage might be there not just for preventing the pdev from going
> > away, but also to guarantee exclusive access to certain state?
> 
> Indeed. In my view the main purpose of locks is to guard state. Their
> use here to guard against devices here is imo rather an abuse; as
> mentioned before this should instead be achieved e.g via refcounting.
> And it's bad enough already that pcidevs_lock() alone has been abused
> this way, without proper marking (leaving us to guess in many places).
> It gets worse when a second lock can now also serve this same
> purpose.

The new lock is taken in read mode in most contexts, and hence can't
be used to indirectly gain exclusive access to domain related
structures in a safe way.

Not saying this makes it any better, but so far this is the best
solution we could come up with that didn't involve a full evaluation
and possible re-write of all usage of the pcidevs lock.

I would also prefer a solution that fully replaces the pcidevs lock
with something else, but for once I don't have a clear picture of how
that would look like because the analysis is a huge amount of work,
likely more than the implementation itself.

Hence the proposed compromise solution that should allow the vPCI work
to make progress.

Regards, Roger.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-25  9:05                   ` Roger Pau Monné
@ 2024-01-25 11:23                     ` Jan Beulich
  2024-01-25 12:33                       ` Roger Pau Monné
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2024-01-25 11:23 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On 25.01.2024 10:05, Roger Pau Monné wrote:
> On Thu, Jan 25, 2024 at 08:43:05AM +0100, Jan Beulich wrote:
>> On 24.01.2024 18:51, Roger Pau Monné wrote:
>>> On Wed, Jan 24, 2024 at 12:34:10PM +0100, Jan Beulich wrote:
>>>> On 24.01.2024 10:24, Roger Pau Monné wrote:
>>>>> On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
>>>>>> On 23.01.2024 16:07, Roger Pau Monné wrote:
>>>>>>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>>>>>>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>>>>>>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>>>>>>>>  {
>>>>>>>>>      int irq, pirq, ret;
>>>>>>>>>  
>>>>>>>>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>>>>>>>
>>>>>>>> If either lock is sufficient to hold here, ...
>>>>>>>>
>>>>>>>>> --- a/xen/arch/x86/physdev.c
>>>>>>>>> +++ b/xen/arch/x86/physdev.c
>>>>>>>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>>>>>>>>>  
>>>>>>>>>      case MAP_PIRQ_TYPE_MSI:
>>>>>>>>>      case MAP_PIRQ_TYPE_MULTI_MSI:
>>>>>>>>> +        pcidevs_lock();
>>>>>>>>>          ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>>>>>>>>> +        pcidevs_unlock();
>>>>>>>>>          break;
>>>>>>>>
>>>>>>>
>>>>>>> IIRC (Stewart can further comment) this is done holding the pcidevs
>>>>>>> lock to keep the path unmodified, as there's no need to hold the
>>>>>>> per-domain rwlock.
>>>>>>
>>>>>> Yet why would we prefer to acquire a global lock when a per-domain one
>>>>>> suffices?
>>>>>
>>>>> I was hoping to introduce less changes, specially if they are not
>>>>> strictly required, as it's less risk.  I'm always quite worry of
>>>>> locking changes.
>>>>
>>>> In which case more description / code commenting is needed. The pattern
>>>> of the assertions looks dangerous.
>>>
>>> Is such dangerousness perception because you fear some of the pcidevs
>>> lock usage might be there not just for preventing the pdev from going
>>> away, but also to guarantee exclusive access to certain state?
>>
>> Indeed. In my view the main purpose of locks is to guard state. Their
>> use here to guard against devices here is imo rather an abuse; as
>> mentioned before this should instead be achieved e.g via refcounting.
>> And it's bad enough already that pcidevs_lock() alone has been abused
>> this way, without proper marking (leaving us to guess in many places).
>> It gets worse when a second lock can now also serve this same
>> purpose.
> 
> The new lock is taken in read mode in most contexts, and hence can't
> be used to indirectly gain exclusive access to domain related
> structures in a safe way.

Oh, right - I keep being misled by rw_is_locked(). This is a fair
argument. Irrespective it would feel better to me if an abstraction
construct was introduced; but seeing you don't like the idea I guess
I won't insist.

Jan


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-25 11:23                     ` Jan Beulich
@ 2024-01-25 12:33                       ` Roger Pau Monné
  2024-01-30 15:04                         ` Stewart Hildebrand
  0 siblings, 1 reply; 68+ messages in thread
From: Roger Pau Monné @ 2024-01-25 12:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stewart Hildebrand, Oleksandr Andrushchenko, Andrew Cooper,
	Wei Liu, George Dunlap, Julien Grall, Stefano Stabellini,
	Jun Nakajima, Kevin Tian, Paul Durrant, Volodymyr Babchuk,
	xen-devel

On Thu, Jan 25, 2024 at 12:23:05PM +0100, Jan Beulich wrote:
> On 25.01.2024 10:05, Roger Pau Monné wrote:
> > On Thu, Jan 25, 2024 at 08:43:05AM +0100, Jan Beulich wrote:
> >> On 24.01.2024 18:51, Roger Pau Monné wrote:
> >>> On Wed, Jan 24, 2024 at 12:34:10PM +0100, Jan Beulich wrote:
> >>>> On 24.01.2024 10:24, Roger Pau Monné wrote:
> >>>>> On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
> >>>>>> On 23.01.2024 16:07, Roger Pau Monné wrote:
> >>>>>>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
> >>>>>>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
> >>>>>>>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
> >>>>>>>>>  {
> >>>>>>>>>      int irq, pirq, ret;
> >>>>>>>>>  
> >>>>>>>>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> >>>>>>>>
> >>>>>>>> If either lock is sufficient to hold here, ...
> >>>>>>>>
> >>>>>>>>> --- a/xen/arch/x86/physdev.c
> >>>>>>>>> +++ b/xen/arch/x86/physdev.c
> >>>>>>>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
> >>>>>>>>>  
> >>>>>>>>>      case MAP_PIRQ_TYPE_MSI:
> >>>>>>>>>      case MAP_PIRQ_TYPE_MULTI_MSI:
> >>>>>>>>> +        pcidevs_lock();
> >>>>>>>>>          ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
> >>>>>>>>> +        pcidevs_unlock();
> >>>>>>>>>          break;
> >>>>>>>>
> >>>>>>>
> >>>>>>> IIRC (Stewart can further comment) this is done holding the pcidevs
> >>>>>>> lock to keep the path unmodified, as there's no need to hold the
> >>>>>>> per-domain rwlock.
> >>>>>>
> >>>>>> Yet why would we prefer to acquire a global lock when a per-domain one
> >>>>>> suffices?
> >>>>>
> >>>>> I was hoping to introduce less changes, specially if they are not
> >>>>> strictly required, as it's less risk.  I'm always quite worry of
> >>>>> locking changes.
> >>>>
> >>>> In which case more description / code commenting is needed. The pattern
> >>>> of the assertions looks dangerous.
> >>>
> >>> Is such dangerousness perception because you fear some of the pcidevs
> >>> lock usage might be there not just for preventing the pdev from going
> >>> away, but also to guarantee exclusive access to certain state?
> >>
> >> Indeed. In my view the main purpose of locks is to guard state. Their
> >> use here to guard against devices here is imo rather an abuse; as
> >> mentioned before this should instead be achieved e.g via refcounting.
> >> And it's bad enough already that pcidevs_lock() alone has been abused
> >> this way, without proper marking (leaving us to guess in many places).
> >> It gets worse when a second lock can now also serve this same
> >> purpose.
> > 
> > The new lock is taken in read mode in most contexts, and hence can't
> > be used to indirectly gain exclusive access to domain related
> > structures in a safe way.
> 
> Oh, right - I keep being misled by rw_is_locked(). This is a fair
> argument. Irrespective it would feel better to me if an abstraction
> construct was introduced; but seeing you don't like the idea I guess
> I won't insist.

TBH I'm not going to argue against it if you and Stewart think it's
clearer, but I also won't request the addition of such wrapper myself.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 10/15] vpci/header: emulate PCI_COMMAND register for guests
  2024-01-09 21:51 ` [PATCH v12 10/15] vpci/header: emulate PCI_COMMAND register for guests Stewart Hildebrand
@ 2024-01-25 15:43   ` Jan Beulich
  2024-02-01  4:50     ` Stewart Hildebrand
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2024-01-25 15:43 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: Oleksandr Andrushchenko, Roger Pau Monné, Andrew Cooper,
	George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu,
	Volodymyr Babchuk, xen-devel

On 09.01.2024 22:51, Stewart Hildebrand wrote:
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -168,6 +168,9 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>      if ( !rom_only )
>      {
>          pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> +        /* Show DomU that we updated P2M */
> +        header->guest_cmd &= ~PCI_COMMAND_MEMORY;
> +        header->guest_cmd |= cmd & PCI_COMMAND_MEMORY;
>          header->bars_mapped = map;
>      }

I don't follow what the comment means to say. The bit in question has no
real connection to the P2M, and the guest also may have no notion of the
underlying hypervisor's internals. Likely connected to ...

> @@ -524,9 +527,26 @@ static void cf_check cmd_write(
>  {
>      struct vpci_header *header = data;
>  
> +    if ( !is_hardware_domain(pdev->domain) )
> +    {
> +        const struct vpci *vpci = pdev->vpci;
> +
> +        if ( (vpci->msi && vpci->msi->enabled) ||
> +             (vpci->msix && vpci->msix->enabled) )
> +            cmd |= PCI_COMMAND_INTX_DISABLE;
> +
> +        /*
> +         * Do not show change to PCI_COMMAND_MEMORY bit until we finish
> +         * modifying P2M mappings.
> +         */
> +        header->guest_cmd = (cmd & ~PCI_COMMAND_MEMORY) |
> +                            (header->guest_cmd & PCI_COMMAND_MEMORY);
> +    }

... the comment here, but then shouldn't it be that the guest can't even
issue a 2nd cfg space access until the present write has been carried out?
Otherwise I'd be inclined to claim that such a partial update is unlikely
to be spec-conformant.

> @@ -843,6 +885,15 @@ static int cf_check init_header(struct pci_dev *pdev)
>      if ( cmd & PCI_COMMAND_MEMORY )
>          pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
>  
> +    /*
> +     * Clear PCI_COMMAND_MEMORY and PCI_COMMAND_IO for DomUs, so they will
> +     * always start with memory decoding disabled and to ensure that we will not
> +     * call modify_bars() at the end of this function.
> +     */
> +    if ( !is_hwdom )
> +        cmd &= ~(PCI_COMMAND_MEMORY | PCI_COMMAND_IO);
> +    header->guest_cmd = cmd;

With PCI_COMMAND_MEMORY clear, the hw reg won't further be written on the
success return path. Yet wouldn't we better clear PCI_COMMAND_IO also in
hardware (until we properly support it)?

I also think the insertion point for the new code isn't well chosen: The
comment just out of context indicates that the code in context above is
connected to the subsequent code. Whereas the addition is not.

> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -70,6 +70,15 @@ static void cf_check control_write(
>  
>          if ( vpci_msi_arch_enable(msi, pdev, vectors) )
>              return;
> +
> +        /*
> +         * Make sure domU doesn't enable INTx while enabling MSI.
> +         */

Nit: This ought to be a single line comment, just like ...

> +        if ( !is_hardware_domain(pdev->domain) )
> +        {
> +            pci_intx(pdev, false);
> +            pdev->vpci->header.guest_cmd |= PCI_COMMAND_INTX_DISABLE;
> +        }
>      }
>      else
>          vpci_msi_arch_disable(msi, pdev);
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -135,6 +135,13 @@ static void cf_check control_write(
>          }
>      }
>  
> +    /* Make sure domU doesn't enable INTx while enabling MSI-X. */
> +    if ( new_enabled && !msix->enabled && !is_hardware_domain(pdev->domain) )
> +    {
> +        pci_intx(pdev, false);
> +        pdev->vpci->header.guest_cmd |= PCI_COMMAND_INTX_DISABLE;
> +    }

... the similar code here has it.

In both cases, is it really appropriate to set the bit in guest view?

Jan


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 11/15] vpci: add initial support for virtual PCI bus topology
  2024-01-09 21:51 ` [PATCH v12 11/15] vpci: add initial support for virtual PCI bus topology Stewart Hildebrand
  2024-01-12 11:46   ` George Dunlap
@ 2024-01-25 16:00   ` Jan Beulich
  2024-02-02  3:30     ` Stewart Hildebrand
  1 sibling, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2024-01-25 16:00 UTC (permalink / raw)
  To: Stewart Hildebrand
  Cc: Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Julien Grall, Stefano Stabellini, Wei Liu, Roger Pau Monné,
	Volodymyr Babchuk, xen-devel

On 09.01.2024 22:51, Stewart Hildebrand wrote:
> --- a/xen/drivers/Kconfig
> +++ b/xen/drivers/Kconfig
> @@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
>  config HAS_VPCI
>  	bool
>  
> +config HAS_VPCI_GUEST_SUPPORT
> +	bool
> +	depends on HAS_VPCI

Wouldn't this better be "select", or even just "imply"?

> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -40,6 +40,49 @@ extern vpci_register_init_t *const __start_vpci_array[];
>  extern vpci_register_init_t *const __end_vpci_array[];
>  #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
>  
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +static int add_virtual_device(struct pci_dev *pdev)
> +{
> +    struct domain *d = pdev->domain;
> +    unsigned int new_dev_number;
> +
> +    if ( is_hardware_domain(d) )
> +        return 0;
> +
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
> +    /*
> +     * Each PCI bus supports 32 devices/slots at max or up to 256 when
> +     * there are multi-function ones which are not yet supported.
> +     */
> +    if ( pdev->info.is_extfn && !pdev->info.is_virtfn )
> +    {
> +        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
> +                 &pdev->sbdf);

The message suggests you ought to check pdev->devfn to have the low
three bits clear. Yet what you check are two booleans.

Further doesn't this require the multi-function bit to be emulated
clear? And finally don't you then also need to disallow assignment of
devices with phantom functions?

> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -484,6 +484,14 @@ struct domain
>       * 2. pdev->vpci->lock
>       */
>      rwlock_t pci_lock;
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    /*
> +     * The bitmap which shows which device numbers are already used by the
> +     * virtual PCI bus topology and is used to assign a unique SBDF to the
> +     * next passed through virtual PCI device.
> +     */
> +    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
> +#endif
>  #endif

With this the 2nd #endif would likely better gain a comment.

> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -21,6 +21,13 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
>  
>  #define VPCI_ECAM_BDF(addr)     (((addr) & 0x0ffff000) >> 12)
>  
> +/*
> + * Maximum number of devices supported by the virtual bus topology:
> + * each PCI bus supports 32 devices/slots at max or up to 256 when
> + * there are multi-function ones which are not yet supported.
> + */
> +#define VPCI_MAX_VIRT_DEV       (PCI_SLOT(~0) + 1)

The limit being this means only bus 0 / seg 0 is supported, which I
think the comment would better also say. (In add_virtual_device(),
which has a similar comment, there's then at least a 2nd one saying
so.)

Jan


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-24  5:00         ` Stewart Hildebrand
@ 2024-01-30 14:59           ` Stewart Hildebrand
  0 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-30 14:59 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: Oleksandr Andrushchenko, Andrew Cooper, Wei Liu, George Dunlap,
	Julien Grall, Stefano Stabellini, Jun Nakajima, Kevin Tian,
	Paul Durrant, Volodymyr Babchuk, xen-devel

On 1/24/24 00:00, Stewart Hildebrand wrote:
> On 1/23/24 10:07, Roger Pau Monné wrote:
>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>>>  {
>>>>      int irq, pirq, ret;
>>>>  
>>>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>>
>>> If either lock is sufficient to hold here, ...
>>>
>>>> --- a/xen/arch/x86/physdev.c
>>>> +++ b/xen/arch/x86/physdev.c
>>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>>>>  
>>>>      case MAP_PIRQ_TYPE_MSI:
>>>>      case MAP_PIRQ_TYPE_MULTI_MSI:
>>>> +        pcidevs_lock();
>>>>          ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>>>> +        pcidevs_unlock();
>>>>          break;
>>>
>>> ... why is it the global lock that's being acquired here?
>>>
>>
>> IIRC (Stewart can further comment) this is done holding the pcidevs
>> lock to keep the path unmodified, as there's no need to hold the
>> per-domain rwlock.
>>
> 
> Although allocate_and_map_msi_pirq() was itself acquiring the global pcidevs_lock() before this patch, we could just as well use read_lock(&d->pci_lock) here instead now. It seems like a good optimization to make, so if there aren't any objections I'll change it to read_lock(&d->pci_lock).
> 

Actually, I take this back. As mentioned in the cover letter of this series, and has been reiterated in recent discussions, the goal with this is to keep existing (non-vPCI) code paths as unmodified as possible. So I'll keep it as pcidevs_lock() here.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12.2 01/15] vpci: use per-domain PCI lock to protect vpci structure
  2024-01-25 12:33                       ` Roger Pau Monné
@ 2024-01-30 15:04                         ` Stewart Hildebrand
  0 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-30 15:04 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich
  Cc: Oleksandr Andrushchenko, Andrew Cooper, Wei Liu, George Dunlap,
	Julien Grall, Stefano Stabellini, Jun Nakajima, Kevin Tian,
	Paul Durrant, Volodymyr Babchuk, xen-devel

On 1/25/24 07:33, Roger Pau Monné wrote:
> On Thu, Jan 25, 2024 at 12:23:05PM +0100, Jan Beulich wrote:
>> On 25.01.2024 10:05, Roger Pau Monné wrote:
>>> On Thu, Jan 25, 2024 at 08:43:05AM +0100, Jan Beulich wrote:
>>>> On 24.01.2024 18:51, Roger Pau Monné wrote:
>>>>> On Wed, Jan 24, 2024 at 12:34:10PM +0100, Jan Beulich wrote:
>>>>>> On 24.01.2024 10:24, Roger Pau Monné wrote:
>>>>>>> On Wed, Jan 24, 2024 at 09:48:35AM +0100, Jan Beulich wrote:
>>>>>>>> On 23.01.2024 16:07, Roger Pau Monné wrote:
>>>>>>>>> On Tue, Jan 23, 2024 at 03:32:12PM +0100, Jan Beulich wrote:
>>>>>>>>>> On 15.01.2024 20:43, Stewart Hildebrand wrote:
>>>>>>>>>>> @@ -2888,6 +2888,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>>>>>>>>>>  {
>>>>>>>>>>>      int irq, pirq, ret;
>>>>>>>>>>>  
>>>>>>>>>>> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>>>>>>>>>>
>>>>>>>>>> If either lock is sufficient to hold here, ...
>>>>>>>>>>
>>>>>>>>>>> --- a/xen/arch/x86/physdev.c
>>>>>>>>>>> +++ b/xen/arch/x86/physdev.c
>>>>>>>>>>> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
>>>>>>>>>>>  
>>>>>>>>>>>      case MAP_PIRQ_TYPE_MSI:
>>>>>>>>>>>      case MAP_PIRQ_TYPE_MULTI_MSI:
>>>>>>>>>>> +        pcidevs_lock();
>>>>>>>>>>>          ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi);
>>>>>>>>>>> +        pcidevs_unlock();
>>>>>>>>>>>          break;
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> IIRC (Stewart can further comment) this is done holding the pcidevs
>>>>>>>>> lock to keep the path unmodified, as there's no need to hold the
>>>>>>>>> per-domain rwlock.
>>>>>>>>
>>>>>>>> Yet why would we prefer to acquire a global lock when a per-domain one
>>>>>>>> suffices?
>>>>>>>
>>>>>>> I was hoping to introduce less changes, specially if they are not
>>>>>>> strictly required, as it's less risk.  I'm always quite worry of
>>>>>>> locking changes.
>>>>>>
>>>>>> In which case more description / code commenting is needed. The pattern
>>>>>> of the assertions looks dangerous.
>>>>>
>>>>> Is such dangerousness perception because you fear some of the pcidevs
>>>>> lock usage might be there not just for preventing the pdev from going
>>>>> away, but also to guarantee exclusive access to certain state?
>>>>
>>>> Indeed. In my view the main purpose of locks is to guard state. Their
>>>> use here to guard against devices here is imo rather an abuse; as
>>>> mentioned before this should instead be achieved e.g via refcounting.
>>>> And it's bad enough already that pcidevs_lock() alone has been abused
>>>> this way, without proper marking (leaving us to guess in many places).
>>>> It gets worse when a second lock can now also serve this same
>>>> purpose.
>>>
>>> The new lock is taken in read mode in most contexts, and hence can't
>>> be used to indirectly gain exclusive access to domain related
>>> structures in a safe way.
>>
>> Oh, right - I keep being misled by rw_is_locked(). This is a fair
>> argument. Irrespective it would feel better to me if an abstraction
>> construct was introduced; but seeing you don't like the idea I guess
>> I won't insist.
> 
> TBH I'm not going to argue against it if you and Stewart think it's
> clearer, but I also won't request the addition of such wrapper myself.
> 
> Thanks, Roger.

Overall, I think there are two sources of confusion:

    1. This patch is using the odd-looking ASSERTs to verify that it is safe to *read* d->pdev_list, and/or ensure a pdev does not go away or get reassigned. The purpose of these ASSERTs is not immediately obvious due to inadequate description.

    2. At first glance, the patch appears to be doing two things: using d->pci_lock for d->pdev_list/pdev protection, and using d->pci_lock for pdev->vpci protection.

Regarding #1, while the review experience could have been improved by introducing a wrapper construct, I think it would also (more importantly) be valuable to have such a wrapper for the sake of code readability. I think it is important to get this right and hopefully avoid/reduce potential future confusion. I'll add something like this in v13, e.g. in sched.h:

/* Ensure pdevs do not go away or get assigned to other domains. */
#define pdev_list_is_read_locked(d) ({                           \

        struct domain *d_ = (d);                                 \

        pcidevs_locked() || (d_ && rw_is_locked(&d_->pci_lock)); \

    })

Example use:

    ASSERT(pdev_list_is_read_locked(d));

Regarding #2, the patch description primarily talks about protecting the pdev->vpci field, and the d->pdev_list read / pdev reassignment protection seems an afterthought. However, the use of pcidevs_lock() for pdev protection is pre-existing. Now that vPCI callers are going to use d->pci_lock (instead of pcidevs_lock()), we are simultaneously changing the allowable mechanism for protecting d->pdev_list reads and pdevs going away or getting reassigned. I briefly experimented with splitting this into two separate patches, but I chose not to pursue this further because then we'd have a brief odd intermediate state, not to mention the additional test/review burden of evaluating each separate change independently. Keep in mind this patch as a whole has already been through much test/review, and at this point my primary focus is to improve readability and avoid confusion. I will plan to add appropriate description and rationale for v13.

Since I will be changing to use a wrapper construct and updating the descriptions, I will plan to drop Roger's R-b tag on this patch for v13.

Lastly, as has already been mentioned in the cover letter and reiterated in discussions here, for the non-vPCI code paths already using pcidevs_lock() I will plan to keep them using pcidevs_lock().


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 03/15] vpci: add hooks for PCI device assign/de-assign
  2024-01-09 21:51 ` [PATCH v12 03/15] vpci: add hooks for PCI device assign/de-assign Stewart Hildebrand
  2024-01-23 14:36   ` Jan Beulich
@ 2024-01-30 19:22   ` Stewart Hildebrand
  1 sibling, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-01-30 19:22 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Andrushchenko, Jan Beulich, Paul Durrant,
	Roger Pau Monné, Volodymyr Babchuk

On 1/9/24 16:51, Stewart Hildebrand wrote:
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 3a973324bca1..a902de6a8693 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -1476,6 +1485,10 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>      if ( pdev->broken && d != hardware_domain && d != dom_io )
>          goto done;
>  
> +    write_lock(&pdev->domain->pci_lock);
> +    vpci_deassign_device(pdev);
> +    write_unlock(&pdev->domain->pci_lock);
> +
>      rc = pdev_msix_assign(d, pdev);
>      if ( rc )
>          goto done;
> @@ -1502,6 +1515,10 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>                          pci_to_dev(pdev), flag);
>      }
>  

After rebasing this on the following commit:

  cb4ecb3cc17b ("pci: fail device assignment if phantom functions cannot be assigned")

I'll add this here:

    if ( rc )
        goto done;

I'll plan on retaining Roger's R-b tag and and Jan's A-b tags for v13.

> +    write_lock(&d->pci_lock);
> +    rc = vpci_assign_device(pdev);
> +    write_unlock(&d->pci_lock);
> +
>   done:
>      if ( rc )
>          printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 10/15] vpci/header: emulate PCI_COMMAND register for guests
  2024-01-25 15:43   ` Jan Beulich
@ 2024-02-01  4:50     ` Stewart Hildebrand
  2024-02-01  8:14       ` Jan Beulich
  0 siblings, 1 reply; 68+ messages in thread
From: Stewart Hildebrand @ 2024-02-01  4:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, Roger Pau Monné, Andrew Cooper,
	George Dunlap, Julien Grall, Stefano Stabellini, Wei Liu,
	Volodymyr Babchuk, xen-devel

On 1/25/24 10:43, Jan Beulich wrote:
> On 09.01.2024 22:51, Stewart Hildebrand wrote:
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -168,6 +168,9 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>>      if ( !rom_only )
>>      {
>>          pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
>> +        /* Show DomU that we updated P2M */
>> +        header->guest_cmd &= ~PCI_COMMAND_MEMORY;
>> +        header->guest_cmd |= cmd & PCI_COMMAND_MEMORY;
>>          header->bars_mapped = map;
>>      }
> 
> I don't follow what the comment means to say. The bit in question has no
> real connection to the P2M, and the guest also may have no notion of the
> underlying hypervisor's internals. Likely connected to ...

Indeed. If the comment survives to v13, I'll update it to:

        /* Now that we updated P2M, show DomU change to PCI_COMMAND_MEMORY */

> 
>> @@ -524,9 +527,26 @@ static void cf_check cmd_write(
>>  {
>>      struct vpci_header *header = data;
>>  
>> +    if ( !is_hardware_domain(pdev->domain) )
>> +    {
>> +        const struct vpci *vpci = pdev->vpci;
>> +
>> +        if ( (vpci->msi && vpci->msi->enabled) ||
>> +             (vpci->msix && vpci->msix->enabled) )
>> +            cmd |= PCI_COMMAND_INTX_DISABLE;
>> +
>> +        /*
>> +         * Do not show change to PCI_COMMAND_MEMORY bit until we finish
>> +         * modifying P2M mappings.
>> +         */
>> +        header->guest_cmd = (cmd & ~PCI_COMMAND_MEMORY) |
>> +                            (header->guest_cmd & PCI_COMMAND_MEMORY);
>> +    }
> 
> ... the comment here, but then shouldn't it be that the guest can't even
> issue a 2nd cfg space access until the present write has been carried out?
> Otherwise I'd be inclined to claim that such a partial update is unlikely
> to be spec-conformant.

Due to the raise_softirq() call added in

  3e568fa9e19c ("vpci: fix deferral of long operations")

my current understanding is: when the guest toggles memory decoding, the guest vcpu doesn't resume execution until vpci_process_pending() and modify_decoding() have finished. So I think the guest should see a consistent state of the register, unless it was trying to read from a different vcpu than the one doing the writing.

Regardless, if the guest did have an opportunity to successfully read the partially updated state of the register, I'm not really spotting what part of the spec that would be a violation of. PCIe 6.1 has this description regarding the bit: "When this bit is Set" and "When this bit is Clear" the device will decode (or not) memory accesses. The spec doesn't seem to distinguish whether the host or the device itself is the one to set/clear the bit. One might even try to argue the opposite: allowing the bit to be toggled before the device reflects the change would be a violation of spec. Since the spec is ambiguous in this regard, I don't think either argument is particularly strong.

Chesterton's fence: the logic for deferring the update of PCI_COMMAND_MEMORY in guest_cmd was added between v10 and v11 of this series. I went back to look at the review comments on v10 [1], but the rationale is still not entirely clear to me. At the end of the day, with the information I have at hand, I suspect it would be fine either way (whether updating guest_cmd is deferred or not). If no other info comes to light, I'm leaning toward not deferring because it would be simpler to update the bit right away in cmd_write().

[1] https://lore.kernel.org/xen-devel/ZVy73iJ3E8nJHvgf@macbook.local/

> 
>> @@ -843,6 +885,15 @@ static int cf_check init_header(struct pci_dev *pdev)
>>      if ( cmd & PCI_COMMAND_MEMORY )
>>          pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
>>  
>> +    /*
>> +     * Clear PCI_COMMAND_MEMORY and PCI_COMMAND_IO for DomUs, so they will
>> +     * always start with memory decoding disabled and to ensure that we will not
>> +     * call modify_bars() at the end of this function.
>> +     */
>> +    if ( !is_hwdom )
>> +        cmd &= ~(PCI_COMMAND_MEMORY | PCI_COMMAND_IO);
>> +    header->guest_cmd = cmd;
> 
> With PCI_COMMAND_MEMORY clear, the hw reg won't further be written on the
> success return path. Yet wouldn't we better clear PCI_COMMAND_IO also in
> hardware (until we properly support it)?

Yes, I'll clear PCI_COMMAND_IO in hardware too

> 
> I also think the insertion point for the new code isn't well chosen: The
> comment just out of context indicates that the code in context above is
> connected to the subsequent code. Whereas the addition is not.

I'll rearrange it

> 
>> --- a/xen/drivers/vpci/msi.c
>> +++ b/xen/drivers/vpci/msi.c
>> @@ -70,6 +70,15 @@ static void cf_check control_write(
>>  
>>          if ( vpci_msi_arch_enable(msi, pdev, vectors) )
>>              return;
>> +
>> +        /*
>> +         * Make sure domU doesn't enable INTx while enabling MSI.
>> +         */
> 
> Nit: This ought to be a single line comment, just like ...

OK, I'll make it a single line

> 
>> +        if ( !is_hardware_domain(pdev->domain) )
>> +        {
>> +            pci_intx(pdev, false);
>> +            pdev->vpci->header.guest_cmd |= PCI_COMMAND_INTX_DISABLE;
>> +        }
>>      }
>>      else
>>          vpci_msi_arch_disable(msi, pdev);
>> --- a/xen/drivers/vpci/msix.c
>> +++ b/xen/drivers/vpci/msix.c
>> @@ -135,6 +135,13 @@ static void cf_check control_write(
>>          }
>>      }
>>  
>> +    /* Make sure domU doesn't enable INTx while enabling MSI-X. */
>> +    if ( new_enabled && !msix->enabled && !is_hardware_domain(pdev->domain) )
>> +    {
>> +        pci_intx(pdev, false);
>> +        pdev->vpci->header.guest_cmd |= PCI_COMMAND_INTX_DISABLE;
>> +    }
> 
> ... the similar code here has it.
> 
> In both cases, is it really appropriate to set the bit in guest view?

I added this based on Roger's comment at [2]. Roger, what do you think? I don't believe QEMU updates the guest view in this manner.

[2] https://lore.kernel.org/xen-devel/ZLqI65gmNj1XDBm4@MacBook-Air-de-Roger.local/


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 10/15] vpci/header: emulate PCI_COMMAND register for guests
  2024-02-01  4:50     ` Stewart Hildebrand
@ 2024-02-01  8:14       ` Jan Beulich
  0 siblings, 0 replies; 68+ messages in thread
From: Jan Beulich @ 2024-02-01  8:14 UTC (permalink / raw)
  To: Stewart Hildebrand, Roger Pau Monné
  Cc: Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Julien Grall, Stefano Stabellini, Wei Liu, Volodymyr Babchuk,
	xen-devel

On 01.02.2024 05:50, Stewart Hildebrand wrote:
> On 1/25/24 10:43, Jan Beulich wrote:
>> On 09.01.2024 22:51, Stewart Hildebrand wrote:
>>> --- a/xen/drivers/vpci/header.c
>>> +++ b/xen/drivers/vpci/header.c
>>> @@ -168,6 +168,9 @@ static void modify_decoding(const struct pci_dev *pdev, uint16_t cmd,
>>>      if ( !rom_only )
>>>      {
>>>          pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
>>> +        /* Show DomU that we updated P2M */
>>> +        header->guest_cmd &= ~PCI_COMMAND_MEMORY;
>>> +        header->guest_cmd |= cmd & PCI_COMMAND_MEMORY;
>>>          header->bars_mapped = map;
>>>      }
>>
>> I don't follow what the comment means to say. The bit in question has no
>> real connection to the P2M, and the guest also may have no notion of the
>> underlying hypervisor's internals. Likely connected to ...
> 
> Indeed. If the comment survives to v13, I'll update it to:
> 
>         /* Now that we updated P2M, show DomU change to PCI_COMMAND_MEMORY */
> 
>>
>>> @@ -524,9 +527,26 @@ static void cf_check cmd_write(
>>>  {
>>>      struct vpci_header *header = data;
>>>  
>>> +    if ( !is_hardware_domain(pdev->domain) )
>>> +    {
>>> +        const struct vpci *vpci = pdev->vpci;
>>> +
>>> +        if ( (vpci->msi && vpci->msi->enabled) ||
>>> +             (vpci->msix && vpci->msix->enabled) )
>>> +            cmd |= PCI_COMMAND_INTX_DISABLE;
>>> +
>>> +        /*
>>> +         * Do not show change to PCI_COMMAND_MEMORY bit until we finish
>>> +         * modifying P2M mappings.
>>> +         */
>>> +        header->guest_cmd = (cmd & ~PCI_COMMAND_MEMORY) |
>>> +                            (header->guest_cmd & PCI_COMMAND_MEMORY);
>>> +    }
>>
>> ... the comment here, but then shouldn't it be that the guest can't even
>> issue a 2nd cfg space access until the present write has been carried out?
>> Otherwise I'd be inclined to claim that such a partial update is unlikely
>> to be spec-conformant.
> 
> Due to the raise_softirq() call added in
> 
>   3e568fa9e19c ("vpci: fix deferral of long operations")
> 
> my current understanding is: when the guest toggles memory decoding, the guest vcpu doesn't resume execution until vpci_process_pending() and modify_decoding() have finished. So I think the guest should see a consistent state of the register, unless it was trying to read from a different vcpu than the one doing the writing.
> 
> Regardless, if the guest did have an opportunity to successfully read the partially updated state of the register, I'm not really spotting what part of the spec that would be a violation of. PCIe 6.1 has this description regarding the bit: "When this bit is Set" and "When this bit is Clear" the device will decode (or not) memory accesses. The spec doesn't seem to distinguish whether the host or the device itself is the one to set/clear the bit. One might even try to argue the opposite: allowing the bit to be toggled before the device reflects the change would be a violation of spec. Since the spec is ambiguous in this regard, I don't think either argument is particularly strong.
> 
> Chesterton's fence: the logic for deferring the update of PCI_COMMAND_MEMORY in guest_cmd was added between v10 and v11 of this series. I went back to look at the review comments on v10 [1], but the rationale is still not entirely clear to me.

Indeed. The only sentence possibly hinting in such a direction would imo
have been "I'm kind of unsure whether we want to fake the guest view by
returning what the guest writes." It's unclear to me whether it really
was meant that way.

> At the end of the day, with the information I have at hand, I suspect it would be fine either way (whether updating guest_cmd is deferred or not). If no other info comes to light, I'm leaning toward not deferring because it would be simpler to update the bit right away in cmd_write().

I'm not sure it would be fine either way. Config space writes aren't
posted writes, so they complete synchronously. IOW whatever internal
state updates are needed in the device, they ought to have finished by
the time the write completes.

> [1] https://lore.kernel.org/xen-devel/ZVy73iJ3E8nJHvgf@macbook.local/
> 
>>[...]
>>> --- a/xen/drivers/vpci/msix.c
>>> +++ b/xen/drivers/vpci/msix.c
>>> @@ -135,6 +135,13 @@ static void cf_check control_write(
>>>          }
>>>      }
>>>  
>>> +    /* Make sure domU doesn't enable INTx while enabling MSI-X. */
>>> +    if ( new_enabled && !msix->enabled && !is_hardware_domain(pdev->domain) )
>>> +    {
>>> +        pci_intx(pdev, false);
>>> +        pdev->vpci->header.guest_cmd |= PCI_COMMAND_INTX_DISABLE;
>>> +    }
>>
>> ... the similar code here has it.
>>
>> In both cases, is it really appropriate to set the bit in guest view?
> 
> I added this based on Roger's comment at [2]. Roger, what do you think? I don't believe QEMU updates the guest view in this manner.
> 
> [2] https://lore.kernel.org/xen-devel/ZLqI65gmNj1XDBm4@MacBook-Air-de-Roger.local/

Leaving this for Roger to answer.

Jan


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v12 11/15] vpci: add initial support for virtual PCI bus topology
  2024-01-25 16:00   ` Jan Beulich
@ 2024-02-02  3:30     ` Stewart Hildebrand
  0 siblings, 0 replies; 68+ messages in thread
From: Stewart Hildebrand @ 2024-02-02  3:30 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Andrushchenko, Andrew Cooper, George Dunlap,
	Julien Grall, Stefano Stabellini, Wei Liu, Roger Pau Monné,
	Volodymyr Babchuk, xen-devel

On 1/25/24 11:00, Jan Beulich wrote:
> On 09.01.2024 22:51, Stewart Hildebrand wrote:
>> --- a/xen/drivers/Kconfig
>> +++ b/xen/drivers/Kconfig
>> @@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
>>  config HAS_VPCI
>>  	bool
>>  
>> +config HAS_VPCI_GUEST_SUPPORT
>> +	bool
>> +	depends on HAS_VPCI
> 
> Wouldn't this better be "select", or even just "imply"?

I prefer "select"

> 
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -40,6 +40,49 @@ extern vpci_register_init_t *const __start_vpci_array[];
>>  extern vpci_register_init_t *const __end_vpci_array[];
>>  #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
>>  
>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +static int add_virtual_device(struct pci_dev *pdev)
>> +{
>> +    struct domain *d = pdev->domain;
>> +    unsigned int new_dev_number;
>> +
>> +    if ( is_hardware_domain(d) )
>> +        return 0;
>> +
>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>> +    /*
>> +     * Each PCI bus supports 32 devices/slots at max or up to 256 when
>> +     * there are multi-function ones which are not yet supported.
>> +     */
>> +    if ( pdev->info.is_extfn && !pdev->info.is_virtfn )
>> +    {
>> +        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
>> +                 &pdev->sbdf);
> 
> The message suggests you ought to check pdev->devfn to have the low
> three bits clear. Yet what you check are two booleans.

I'll check pdev->sbdf.fn

> 
> Further doesn't this require the multi-function bit to be emulated
> clear?

I consider this to be future work. The header type register, where the
bit resides, is not yet emulated for domUs. I have a series in the works
for emulating additional registers (including PCI_HEADER_TYPE), but I'm
planning to wait to submit it until after the current series is finished
so as to not delay the current series any further.

> And finally don't you then also need to disallow assignment of
> devices with phantom functions?

No, I don't think so. My understanding is that there is no configuration
space associated with the phantom SBDFs. There's no special handling
required in vPCI per se, because the phantom function RIDs get mapped
in the IOMMU when the device gets assigned. Future work would include
exposing the PCI Express Capability, including device control register
with the phantom function enable bit. I say this having only done
limited testing of phantom functions on ARM, and by faking it using the
pci-phantom= Xen arg because I don't have a real device with phantom
function capability.

> 
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -484,6 +484,14 @@ struct domain
>>       * 2. pdev->vpci->lock
>>       */
>>      rwlock_t pci_lock;
>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +    /*
>> +     * The bitmap which shows which device numbers are already used by the
>> +     * virtual PCI bus topology and is used to assign a unique SBDF to the
>> +     * next passed through virtual PCI device.
>> +     */
>> +    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
>> +#endif
>>  #endif
> 
> With this the 2nd #endif would likely better gain a comment.

I will add it. Actually, I see no harm in adding a comment for both of
these #endifs.

> 
>> --- a/xen/include/xen/vpci.h
>> +++ b/xen/include/xen/vpci.h
>> @@ -21,6 +21,13 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
>>  
>>  #define VPCI_ECAM_BDF(addr)     (((addr) & 0x0ffff000) >> 12)
>>  
>> +/*
>> + * Maximum number of devices supported by the virtual bus topology:
>> + * each PCI bus supports 32 devices/slots at max or up to 256 when
>> + * there are multi-function ones which are not yet supported.
>> + */
>> +#define VPCI_MAX_VIRT_DEV       (PCI_SLOT(~0) + 1)
> 
> The limit being this means only bus 0 / seg 0 is supported, which I
> think the comment would better also say. (In add_virtual_device(),
> which has a similar comment, there's then at least a 2nd one saying
> so.)

OK, I'll adjust the comment.


^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2024-02-02  3:30 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-09 21:51 [PATCH v12 00/15] PCI devices passthrough on Arm, part 3 Stewart Hildebrand
2024-01-09 21:51 ` [PATCH v12 01/15] vpci: use per-domain PCI lock to protect vpci structure Stewart Hildebrand
2024-01-12 13:48   ` Roger Pau Monné
2024-01-12 17:54     ` Stewart Hildebrand
2024-01-12 18:14       ` [PATCH v12.1 " Stewart Hildebrand
2024-01-15  8:58         ` Jan Beulich
2024-01-15 15:42           ` Stewart Hildebrand
2024-01-15  8:53       ` [PATCH v12 " Roger Pau Monné
2024-01-15 15:08         ` Stewart Hildebrand
2024-01-15 19:43   ` [PATCH v12.2 " Stewart Hildebrand
2024-01-19 13:42     ` Roger Pau Monné
2024-01-23 14:26     ` Jan Beulich
2024-01-23 15:23       ` Roger Pau Monné
2024-01-24  8:56         ` Jan Beulich
2024-01-24  9:39           ` Roger Pau Monné
2024-01-23 14:29     ` Jan Beulich
2024-01-24  5:07       ` Stewart Hildebrand
2024-01-24  8:21         ` Roger Pau Monné
2024-01-24 20:21           ` Stewart Hildebrand
2024-01-24  8:50         ` Jan Beulich
2024-01-23 14:32     ` Jan Beulich
2024-01-23 15:07       ` Roger Pau Monné
2024-01-24  5:00         ` Stewart Hildebrand
2024-01-30 14:59           ` Stewart Hildebrand
2024-01-24  8:48         ` Jan Beulich
2024-01-24  9:24           ` Roger Pau Monné
2024-01-24 11:34             ` Jan Beulich
2024-01-24 17:51               ` Roger Pau Monné
2024-01-25  7:43                 ` Jan Beulich
2024-01-25  9:05                   ` Roger Pau Monné
2024-01-25 11:23                     ` Jan Beulich
2024-01-25 12:33                       ` Roger Pau Monné
2024-01-30 15:04                         ` Stewart Hildebrand
2024-01-09 21:51 ` [PATCH v12 02/15] vpci: restrict unhandled read/write operations for guests Stewart Hildebrand
2024-01-09 21:51 ` [PATCH v12 03/15] vpci: add hooks for PCI device assign/de-assign Stewart Hildebrand
2024-01-23 14:36   ` Jan Beulich
2024-01-30 19:22   ` Stewart Hildebrand
2024-01-09 21:51 ` [PATCH v12 04/15] vpci/header: rework exit path in init_header() Stewart Hildebrand
2024-01-09 21:51 ` [PATCH v12 05/15] vpci/header: implement guest BAR register handlers Stewart Hildebrand
2024-01-09 21:51 ` [PATCH v12 06/15] rangeset: add RANGESETF_no_print flag Stewart Hildebrand
2024-01-09 21:51 ` [PATCH v12 07/15] rangeset: add rangeset_purge() function Stewart Hildebrand
2024-01-10 10:00   ` Jan Beulich
2024-01-09 21:51 ` [PATCH v12 08/15] vpci/header: handle p2m range sets per BAR Stewart Hildebrand
2024-01-09 21:51 ` [PATCH v12 09/15] vpci/header: program p2m with guest BAR view Stewart Hildebrand
2024-01-12 15:06   ` Roger Pau Monné
2024-01-12 20:31     ` Stewart Hildebrand
2024-01-12 20:49       ` [PATCH v12.1 " Stewart Hildebrand
2024-01-15  9:07     ` [PATCH v12 " Jan Beulich
2024-01-15 19:03       ` Stewart Hildebrand
2024-01-15 19:44   ` [PATCH v12.2 " Stewart Hildebrand
2024-01-17  3:01     ` Stewart Hildebrand
2024-01-19 13:43       ` Roger Pau Monné
2024-01-19 14:28   ` [PATCH v12.3 " Stewart Hildebrand
2024-01-09 21:51 ` [PATCH v12 10/15] vpci/header: emulate PCI_COMMAND register for guests Stewart Hildebrand
2024-01-25 15:43   ` Jan Beulich
2024-02-01  4:50     ` Stewart Hildebrand
2024-02-01  8:14       ` Jan Beulich
2024-01-09 21:51 ` [PATCH v12 11/15] vpci: add initial support for virtual PCI bus topology Stewart Hildebrand
2024-01-12 11:46   ` George Dunlap
2024-01-12 13:50     ` Stewart Hildebrand
2024-01-15 11:48       ` George Dunlap
2024-01-25 16:00   ` Jan Beulich
2024-02-02  3:30     ` Stewart Hildebrand
2024-01-09 21:51 ` [PATCH v12 12/15] xen/arm: translate virtual PCI bus topology for guests Stewart Hildebrand
2024-01-09 21:51 ` [PATCH v12 13/15] xen/arm: account IO handlers for emulated PCI MSI-X Stewart Hildebrand
2024-01-09 21:51 ` [PATCH v12 14/15] xen/arm: vpci: permit access to guest vpci space Stewart Hildebrand
2024-01-17  3:03   ` Stewart Hildebrand
2024-01-09 21:51 ` [PATCH v12 15/15] arm/vpci: honor access size when returning an error Stewart Hildebrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.