All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/6] x86/iommu: improve setup time of hwdom IOMMU
@ 2023-12-04  9:42 Roger Pau Monne
  2023-12-04  9:43 ` [PATCH v2 1/6] iommu/vt-d: do not assume page table levels for quarantine domain Roger Pau Monne
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Roger Pau Monne @ 2023-12-04  9:42 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Kevin Tian, Jan Beulich, Andrew Cooper,
	Paul Durrant, Wei Liu, Lukasz Hawrylko, Daniel P. Smith,
	Mateusz Mówka

Hello,

The aim of the series is to reduce boot time setup of IOMMU page tables
for dom0.

The first patch is completely unrelated leftover work from XSA-445, just
included in the series because it's IOMMU code.

Second patch is a pre-req, as further patches can end up attempting to
create maps above the max RAM address, and hence without properly
setting the IOMMU page tables levels those attempts to map would fail.

Last 4 patches rework the hardware domain IOMMU setup to use a rangeset
instead of iterating over all addresses up to the max RAM page.  See
patch 5/6 for performance figures.

Thanks, Roger.

Roger Pau Monne (6):
  iommu/vt-d: do not assume page table levels for quarantine domain
  amd-vi: set IOMMU page table levels based on guest reported paddr
    width
  x86/iommu: introduce a rangeset to perform hwdom IOMMU setup
  x86/iommu: remove regions not to be mapped
  x86/iommu: switch hwdom IOMMU to use a rangeset
  x86/iommu: cleanup unused functions

 xen/arch/x86/hvm/io.c                       |  16 ++
 xen/arch/x86/include/asm/hvm/io.h           |   4 +-
 xen/arch/x86/include/asm/setup.h            |   2 +-
 xen/arch/x86/setup.c                        |  81 +++---
 xen/arch/x86/tboot.c                        |   2 +-
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  20 +-
 xen/drivers/passthrough/vtd/iommu.c         |   2 +-
 xen/drivers/passthrough/x86/iommu.c         | 279 +++++++++++++-------
 8 files changed, 248 insertions(+), 158 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 1/6] iommu/vt-d: do not assume page table levels for quarantine domain
  2023-12-04  9:42 [PATCH v2 0/6] x86/iommu: improve setup time of hwdom IOMMU Roger Pau Monne
@ 2023-12-04  9:43 ` Roger Pau Monne
  2023-12-05 14:24   ` Jan Beulich
  2023-12-04  9:43 ` [PATCH v2 2/6] amd-vi: set IOMMU page table levels based on guest reported paddr width Roger Pau Monne
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monne @ 2023-12-04  9:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Kevin Tian

Like XSA-445, do not assume IOMMU page table levels on VT-d are always set
based on DEFAULT_DOMAIN_ADDRESS_WIDTH and instead fetch the value set by
intel_iommu_hwdom_init() from the domain iommu structure.  This prevents
changes to intel_iommu_hwdom_init() possibly getting the levels out of sync
with what intel_iommu_quarantine_init() expects.

No functional change, since on Intel domains are hardcoded to use
DEFAULT_DOMAIN_ADDRESS_WIDTH.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - New in this version.
---
 xen/drivers/passthrough/vtd/iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index e13b7d99db40..bc6181c9f911 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -3162,7 +3162,7 @@ static int cf_check intel_iommu_quarantine_init(struct pci_dev *pdev,
 {
     struct domain_iommu *hd = dom_iommu(dom_io);
     struct page_info *pg;
-    unsigned int agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
+    unsigned int agaw = hd->arch.vtd.agaw;
     unsigned int level = agaw_to_level(agaw);
     const struct acpi_drhd_unit *drhd;
     const struct acpi_rmrr_unit *rmrr;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 2/6] amd-vi: set IOMMU page table levels based on guest reported paddr width
  2023-12-04  9:42 [PATCH v2 0/6] x86/iommu: improve setup time of hwdom IOMMU Roger Pau Monne
  2023-12-04  9:43 ` [PATCH v2 1/6] iommu/vt-d: do not assume page table levels for quarantine domain Roger Pau Monne
@ 2023-12-04  9:43 ` Roger Pau Monne
  2023-12-05 14:32   ` Jan Beulich
  2023-12-04  9:43 ` [PATCH v2 3/6] x86/iommu: introduce a rangeset to perform hwdom IOMMU setup Roger Pau Monne
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monne @ 2023-12-04  9:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

However take into account the minimum number of levels required by unity maps
when setting the page table levels.

The previous setting of the page table levels for PV guests based on the
highest RAM address was bogus, as there can be other non-RAM regions past the
highest RAM address that need to be mapped, for example device MMIO.

For HVM we also take amd_iommu_min_paging_mode into account, however if unity
maps require more than 4 levels attempting to add those will currently fail at
the p2m level, as 4 levels is the maximum supported.

Fixes: 0700c962ac2d ('Add AMD IOMMU support into hypervisor')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
changes since v1:
 - Use paging_max_paddr_bits() instead of hardcoding
   DEFAULT_DOMAIN_ADDRESS_WIDTH.
---
 xen/drivers/passthrough/amd/pci_amd_iommu.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index 6bc73dc21052..00a25e649f22 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -359,21 +359,17 @@ int __read_mostly amd_iommu_min_paging_mode = 1;
 static int cf_check amd_iommu_domain_init(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
+    int pglvl = amd_iommu_get_paging_mode(
+                PFN_DOWN(1UL << paging_max_paddr_bits(d)));
+
+    if ( pglvl < 0 )
+        return pglvl;
 
     /*
-     * Choose the number of levels for the IOMMU page tables.
-     * - PV needs 3 or 4, depending on whether there is RAM (including hotplug
-     *   RAM) above the 512G boundary.
-     * - HVM could in principle use 3 or 4 depending on how much guest
-     *   physical address space we give it, but this isn't known yet so use 4
-     *   unilaterally.
-     * - Unity maps may require an even higher number.
+     * Choose the number of levels for the IOMMU page tables, taking into
+     * account unity maps.
      */
-    hd->arch.amd.paging_mode = max(amd_iommu_get_paging_mode(
-            is_hvm_domain(d)
-            ? 1UL << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT)
-            : get_upper_mfn_bound() + 1),
-        amd_iommu_min_paging_mode);
+    hd->arch.amd.paging_mode = max(pglvl, amd_iommu_min_paging_mode);
 
     return 0;
 }
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 3/6] x86/iommu: introduce a rangeset to perform hwdom IOMMU setup
  2023-12-04  9:42 [PATCH v2 0/6] x86/iommu: improve setup time of hwdom IOMMU Roger Pau Monne
  2023-12-04  9:43 ` [PATCH v2 1/6] iommu/vt-d: do not assume page table levels for quarantine domain Roger Pau Monne
  2023-12-04  9:43 ` [PATCH v2 2/6] amd-vi: set IOMMU page table levels based on guest reported paddr width Roger Pau Monne
@ 2023-12-04  9:43 ` Roger Pau Monne
  2023-12-05 14:50   ` Jan Beulich
  2023-12-04  9:43 ` [PATCH v2 4/6] x86/iommu: remove regions not to be mapped Roger Pau Monne
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monne @ 2023-12-04  9:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Paul Durrant

This change just introduces the boilerplate code in order to use a rangeset
when setting up the hardware domain IOMMU mappings.  The rangeset is never
populated in this patch, so it's a non-functional change as far as the mappings
the domain gets established.

Note there's a change for HVM domains (ie: PVH dom0) that will get switched to
create the p2m mappings using map_mmio_regions() instead of
p2m_add_identity_entry(), so that ranges can be mapped with a single function
call if possible.  Note that the interface of map_mmio_regions() doesn't allow
creating read-only mappings, but so far there are no such mappings created for
PVH dom0 in arch_iommu_hwdom_init().

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Split from bigger patch.
---
 xen/drivers/passthrough/x86/iommu.c | 89 +++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index 857dccb6a465..531a428f6496 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -370,10 +370,77 @@ static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
     return perms;
 }
 
+struct map_data {
+    struct domain *d;
+    unsigned int flush_flags;
+    bool ro;
+};
+
+static int __hwdom_init cf_check identity_map(unsigned long s, unsigned long e,
+                                              void *data)
+{
+    struct map_data *info = data;
+    struct domain *d = info->d;
+    long rc;
+
+    if ( iommu_verbose )
+        printk(XENLOG_INFO " [%010lx, %010lx] R%c\n",
+               s, e, info->ro ? 'O' : 'W');
+
+    if ( paging_mode_translate(d) )
+    {
+        if ( info->ro )
+        {
+            ASSERT_UNREACHABLE();
+            return 0;
+        }
+        while ( (rc = map_mmio_regions(d, _gfn(s), e - s + 1, _mfn(s))) > 0 )
+        {
+            s += rc;
+            process_pending_softirqs();
+        }
+    }
+    else
+    {
+        const unsigned int perms = IOMMUF_readable | IOMMUF_preempt |
+                                   (info->ro ? 0 : IOMMUF_writable);
+
+        if ( info->ro && !iomem_access_permitted(d, s, e) )
+        {
+            /*
+             * Should be more fine grained in order to not map the forbidden
+             * frame instead of rejecting the region as a whole, but it's only
+             * for read-only MMIO regions, which are very limited.
+             */
+            printk(XENLOG_DEBUG
+                   "IOMMU read-only mapping of region [%lx, %lx] forbidden\n",
+                   s, e);
+            return 0;
+        }
+        while ( (rc = iommu_map(d, _dfn(s), _mfn(s), e - s + 1,
+                                perms, &info->flush_flags)) > 0 )
+        {
+            s += rc;
+            process_pending_softirqs();
+        }
+    }
+    ASSERT(rc <= 0);
+    if ( rc )
+        printk(XENLOG_WARNING
+               "IOMMU identity mapping of [%lx, %lx] failed: %ld\n",
+               s, e, rc);
+
+    /* Ignore errors and attempt to map the remaining regions. */
+    return 0;
+}
+
 void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
 {
     unsigned long i, top, max_pfn, start, count;
     unsigned int flush_flags = 0, start_perms = 0;
+    struct rangeset *map;
+    struct map_data map_data = { .d = d };
+    int rc;
 
     BUG_ON(!is_hardware_domain(d));
 
@@ -397,6 +464,10 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
     if ( iommu_hwdom_passthrough )
         return;
 
+    map = rangeset_new(NULL, NULL, 0);
+    if ( !map )
+        panic("IOMMU init: unable to allocate rangeset\n");
+
     max_pfn = (GB(4) >> PAGE_SHIFT) - 1;
     top = max(max_pdx, pfn_to_pdx(max_pfn) + 1);
 
@@ -451,6 +522,24 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
             goto commit;
     }
 
+    if ( iommu_verbose )
+        printk(XENLOG_INFO "d%u: identity mappings for IOMMU:\n",
+               d->domain_id);
+
+    rc = rangeset_report_ranges(map, 0, ~0UL, identity_map, &map_data);
+    if ( rc )
+        panic("IOMMU unable to create mappings: %d\n", rc);
+    if ( is_pv_domain(d) )
+    {
+        map_data.ro = true;
+        rc = rangeset_report_ranges(mmio_ro_ranges, 0, ~0UL, identity_map,
+                                    &map_data);
+        if ( rc )
+            panic("IOMMU unable to create read-only mappings: %d\n", rc);
+    }
+
+    rangeset_destroy(map);
+
     /* Use if to avoid compiler warning */
     if ( iommu_iotlb_flush_all(d, flush_flags) )
         return;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 4/6] x86/iommu: remove regions not to be mapped
  2023-12-04  9:42 [PATCH v2 0/6] x86/iommu: improve setup time of hwdom IOMMU Roger Pau Monne
                   ` (2 preceding siblings ...)
  2023-12-04  9:43 ` [PATCH v2 3/6] x86/iommu: introduce a rangeset to perform hwdom IOMMU setup Roger Pau Monne
@ 2023-12-04  9:43 ` Roger Pau Monne
  2023-12-05 15:11   ` Jan Beulich
  2023-12-04  9:43 ` [PATCH v2 5/6] x86/iommu: switch hwdom IOMMU to use a rangeset Roger Pau Monne
  2023-12-04  9:43 ` [PATCH v2 6/6] x86/iommu: cleanup unused functions Roger Pau Monne
  5 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monne @ 2023-12-04  9:43 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Paul Durrant, Jan Beulich, Andrew Cooper,
	Wei Liu

Introduce the code to remove regions not to be mapped from the rangeset
that will be used to setup the IOMMU page tables for the hardware domain.

This change also introduces two new functions: remove_xen_ranges() and
vpci_subtract_mmcfg() that copy the logic in xen_in_range() and
vpci_is_mmcfg_address() respectively and remove the ranges that would otherwise
be intercepted by the original functions.

Note that the rangeset is still not populated.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Split from bigger patch.
---
 xen/arch/x86/hvm/io.c               | 16 ++++++++
 xen/arch/x86/include/asm/hvm/io.h   |  3 ++
 xen/arch/x86/include/asm/setup.h    |  1 +
 xen/arch/x86/setup.c                | 48 ++++++++++++++++++++++
 xen/drivers/passthrough/x86/iommu.c | 64 +++++++++++++++++++++++++++++
 5 files changed, 132 insertions(+)

diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index d75af83ad01f..a42854c52b65 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -369,6 +369,22 @@ bool vpci_is_mmcfg_address(const struct domain *d, paddr_t addr)
     return vpci_mmcfg_find(d, addr);
 }
 
+int __hwdom_init vpci_subtract_mmcfg(const struct domain *d, struct rangeset *r)
+{
+    const struct hvm_mmcfg *mmcfg;
+
+    list_for_each_entry ( mmcfg, &d->arch.hvm.mmcfg_regions, next )
+    {
+        int rc = rangeset_remove_range(r, PFN_DOWN(mmcfg->addr),
+                                       PFN_DOWN(mmcfg->addr + mmcfg->size - 1));
+
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+}
+
 static unsigned int vpci_mmcfg_decode_addr(const struct hvm_mmcfg *mmcfg,
                                            paddr_t addr, pci_sbdf_t *sbdf)
 {
diff --git a/xen/arch/x86/include/asm/hvm/io.h b/xen/arch/x86/include/asm/hvm/io.h
index a97731657801..e1e5e6fe7491 100644
--- a/xen/arch/x86/include/asm/hvm/io.h
+++ b/xen/arch/x86/include/asm/hvm/io.h
@@ -156,6 +156,9 @@ void destroy_vpci_mmcfg(struct domain *d);
 /* Check if an address is between a MMCFG region for a domain. */
 bool vpci_is_mmcfg_address(const struct domain *d, paddr_t addr);
 
+/* Remove MMCFG regions from a given rangeset. */
+int vpci_subtract_mmcfg(const struct domain *d, struct rangeset *r);
+
 #endif /* __ASM_X86_HVM_IO_H__ */
 
 
diff --git a/xen/arch/x86/include/asm/setup.h b/xen/arch/x86/include/asm/setup.h
index 9a460e4db8f4..cd07d98101d8 100644
--- a/xen/arch/x86/include/asm/setup.h
+++ b/xen/arch/x86/include/asm/setup.h
@@ -37,6 +37,7 @@ void discard_initial_images(void);
 void *bootstrap_map(const module_t *mod);
 
 int xen_in_range(unsigned long mfn);
+int remove_xen_ranges(struct rangeset *r);
 
 extern uint8_t kbd_shift_flags;
 
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 3cba2be0af6c..71fa0b46f181 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -2136,6 +2136,54 @@ int __hwdom_init xen_in_range(unsigned long mfn)
     return 0;
 }
 
+int __hwdom_init remove_xen_ranges(struct rangeset *r)
+{
+    paddr_t start, end;
+    int rc;
+
+    /* S3 resume code (and other real mode trampoline code) */
+    rc = rangeset_remove_range(r, PFN_DOWN(bootsym_phys(trampoline_start)),
+                               PFN_DOWN(bootsym_phys(trampoline_end)));
+    if ( rc )
+        return rc;
+
+    /*
+     * This needs to remain in sync with the uses of the same symbols in
+     * - __start_xen()
+     * - is_xen_fixed_mfn()
+     * - tboot_shutdown()
+     */
+    /* hypervisor .text + .rodata */
+    rc = rangeset_remove_range(r, PFN_DOWN(__pa(&_stext)),
+                               PFN_DOWN(__pa(&__2M_rodata_end)));
+    if ( rc )
+        return rc;
+
+    /* hypervisor .data + .bss */
+    if ( efi_boot_mem_unused(&start, &end) )
+    {
+        ASSERT(__pa(start) >= __pa(&__2M_rwdata_start));
+        rc = rangeset_remove_range(r, PFN_DOWN(__pa(&__2M_rwdata_start)),
+                                   PFN_DOWN(__pa(start)));
+        if ( rc )
+            return rc;
+        ASSERT(__pa(end) <= __pa(&__2M_rwdata_end));
+        rc = rangeset_remove_range(r, PFN_DOWN(__pa(end)),
+                                   PFN_DOWN(__pa(&__2M_rwdata_end)));
+        if ( rc )
+            return rc;
+    }
+    else
+    {
+        rc = rangeset_remove_range(r, PFN_DOWN(__pa(&__2M_rwdata_start)),
+                                   PFN_DOWN(__pa(&__2M_rwdata_end)));
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+}
+
 static int __hwdom_init cf_check io_bitmap_cb(
     unsigned long s, unsigned long e, void *ctx)
 {
diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index 531a428f6496..7e97805fccec 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -370,6 +370,14 @@ static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
     return perms;
 }
 
+static int __hwdom_init cf_check map_subtract(unsigned long s, unsigned long e,
+                                              void *data)
+{
+    struct rangeset *map = data;
+
+    return rangeset_remove_range(map, s, e);
+}
+
 struct map_data {
     struct domain *d;
     unsigned int flush_flags;
@@ -522,6 +530,62 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
             goto commit;
     }
 
+    /* Remove any areas in-use by Xen. */
+    rc = remove_xen_ranges(map);
+    if ( rc )
+        panic("IOMMU failed to remove Xen ranges: %d\n", rc);
+
+    /* Remove any overlap with the Interrupt Address Range. */
+    rc = rangeset_remove_range(map, 0xfee00, 0xfeeff);
+    if ( rc )
+        panic("IOMMU failed to remove Interrupt Address Range: %d\n",
+              rc);
+
+    /* If emulating IO-APIC(s) make sure the base address is unmapped. */
+    if ( has_vioapic(d) )
+    {
+        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
+        {
+            rc = rangeset_remove_singleton(map,
+                PFN_DOWN(domain_vioapic(d, i)->base_address));
+            if ( rc )
+                panic("IOMMU failed to remove IO-APIC: %d\n",
+                      rc);
+        }
+    }
+
+    if ( is_pv_domain(d) )
+    {
+        /*
+         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
+         * ones there (also for e.g. HPET in certain cases), so it should also
+         * have such established for IOMMUs.  Remove any read-only ranges here,
+         * since ranges in mmio_ro_ranges are already explicitly mapped below
+         * in read-only mode.
+         */
+        rc = rangeset_report_ranges(mmio_ro_ranges, 0, ~0UL, map_subtract, map);
+        if ( rc )
+            panic("IOMMU failed to remove read-only regions: %d\n",
+                  rc);
+    }
+
+    if ( has_vpci(d) )
+    {
+        /*
+         * TODO: runtime added MMCFG regions are not checked to make sure they
+         * don't overlap with already mapped regions, thus preventing trapping.
+         */
+        rc = vpci_subtract_mmcfg(d, map);
+        if ( rc )
+            panic("IOMMU unable to remove MMCFG areas: %d\n", rc);
+    }
+
+    /* Remove any regions past the last address addressable by the domain. */
+    rc = rangeset_remove_range(map, PFN_DOWN(1UL << paging_max_paddr_bits(d)),
+                               ~0UL);
+    if ( rc )
+        panic("IOMMU unable to remove unaddressable ranges: %d\n", rc);
+
     if ( iommu_verbose )
         printk(XENLOG_INFO "d%u: identity mappings for IOMMU:\n",
                d->domain_id);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 5/6] x86/iommu: switch hwdom IOMMU to use a rangeset
  2023-12-04  9:42 [PATCH v2 0/6] x86/iommu: improve setup time of hwdom IOMMU Roger Pau Monne
                   ` (3 preceding siblings ...)
  2023-12-04  9:43 ` [PATCH v2 4/6] x86/iommu: remove regions not to be mapped Roger Pau Monne
@ 2023-12-04  9:43 ` Roger Pau Monne
  2023-12-05 15:27   ` Jan Beulich
  2023-12-04  9:43 ` [PATCH v2 6/6] x86/iommu: cleanup unused functions Roger Pau Monne
  5 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monne @ 2023-12-04  9:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Paul Durrant

The current loop that iterates from 0 to the maximum RAM address in order to
setup the IOMMU mappings is highly inefficient, and it will get worse as the
amount of RAM increases.  It's also not accounting for any reserved regions
past the last RAM address.

Instead of iterating over memory addresses, iterate over the memory map regions
and use a rangeset in order to keep track of which ranges need to be identity
mapped in the hardware domain physical address space.

On an AMD EPYC 7452 with 512GiB of RAM, the time to execute
arch_iommu_hwdom_init() in nanoseconds is:

x old
+ new
    N           Min           Max        Median           Avg        Stddev
x   5 2.2364154e+10  2.338244e+10 2.2474685e+10 2.2622409e+10 4.2949869e+08
+   5       1025012       1033036       1026188     1028276.2     3623.1194
Difference at 95.0% confidence
        -2.26214e+10 +/- 4.42931e+08
        -99.9955% +/- 9.05152e-05%
        (Student's t, pooled s = 3.03701e+08)

Execution time of arch_iommu_hwdom_init() goes down from ~22s to ~0.001s.

Note there's a change for HVM domains (ie: PVH dom0) that get switched to
create the p2m mappings using map_mmio_regions() instead of
p2m_add_identity_entry(), so that ranges can be mapped with a single function
call if possible.  Note that the interface of map_mmio_regions() doesn't
allow creating read-only mappings, but so far there are no such mappings
created for PVH dom0 in arch_iommu_hwdom_init().

No change intended in the resulting mappings that a hardware domain gets.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Split from bigger patch.
 - Remove unneeded default case.
---
 xen/drivers/passthrough/x86/iommu.c | 157 ++++++++--------------------
 1 file changed, 42 insertions(+), 115 deletions(-)

diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index 7e97805fccec..81d6129189d0 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -300,76 +300,6 @@ void iommu_identity_map_teardown(struct domain *d)
     }
 }
 
-static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
-                                                 unsigned long pfn,
-                                                 unsigned long max_pfn)
-{
-    mfn_t mfn = _mfn(pfn);
-    unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
-
-    /*
-     * Set up 1:1 mapping for dom0. Default to include only conventional RAM
-     * areas and let RMRRs include needed reserved regions. When set, the
-     * inclusive mapping additionally maps in every pfn up to 4GB except those
-     * that fall in unusable ranges for PV Dom0.
-     */
-    if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
-        return 0;
-
-    switch ( type = page_get_ram_type(mfn) )
-    {
-    case RAM_TYPE_UNUSABLE:
-        return 0;
-
-    case RAM_TYPE_CONVENTIONAL:
-        if ( iommu_hwdom_strict )
-            return 0;
-        break;
-
-    default:
-        if ( type & RAM_TYPE_RESERVED )
-        {
-            if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
-                perms = 0;
-        }
-        else if ( is_hvm_domain(d) )
-            return 0;
-        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
-            perms = 0;
-    }
-
-    /* Check that it doesn't overlap with the Interrupt Address Range. */
-    if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
-        return 0;
-    /* ... or the IO-APIC */
-    if ( has_vioapic(d) )
-    {
-        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
-            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
-                return 0;
-    }
-    else if ( is_pv_domain(d) )
-    {
-        /*
-         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
-         * ones there (also for e.g. HPET in certain cases), so it should also
-         * have such established for IOMMUs.
-         */
-        if ( iomem_access_permitted(d, pfn, pfn) &&
-             rangeset_contains_singleton(mmio_ro_ranges, pfn) )
-            perms = IOMMUF_readable;
-    }
-    /*
-     * ... or the PCIe MCFG regions.
-     * TODO: runtime added MMCFG regions are not checked to make sure they
-     * don't overlap with already mapped regions, thus preventing trapping.
-     */
-    if ( has_vpci(d) && vpci_is_mmcfg_address(d, pfn_to_paddr(pfn)) )
-        return 0;
-
-    return perms;
-}
-
 static int __hwdom_init cf_check map_subtract(unsigned long s, unsigned long e,
                                               void *data)
 {
@@ -444,8 +374,8 @@ static int __hwdom_init cf_check identity_map(unsigned long s, unsigned long e,
 
 void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
 {
-    unsigned long i, top, max_pfn, start, count;
-    unsigned int flush_flags = 0, start_perms = 0;
+    const unsigned long max_pfn = PFN_DOWN(GB(4)) - 1;
+    unsigned int i;
     struct rangeset *map;
     struct map_data map_data = { .d = d };
     int rc;
@@ -476,58 +406,55 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
     if ( !map )
         panic("IOMMU init: unable to allocate rangeset\n");
 
-    max_pfn = (GB(4) >> PAGE_SHIFT) - 1;
-    top = max(max_pdx, pfn_to_pdx(max_pfn) + 1);
+    if ( iommu_hwdom_inclusive )
+    {
+        /* Add the whole range below 4GB, UNUSABLE regions will be removed. */
+        rc = rangeset_add_range(map, 0, max_pfn);
+        if ( rc )
+            panic("IOMMU inclusive mappings can't be added: %d\n",
+                  rc);
+    }
 
-    for ( i = 0, start = 0, count = 0; i < top; )
+    for ( i = 0; i < e820.nr_map; i++ )
     {
-        unsigned long pfn = pdx_to_pfn(i);
-        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
+        struct e820entry entry = e820.map[i];
 
-        if ( !perms )
-            /* nothing */;
-        else if ( paging_mode_translate(d) )
+        switch ( entry.type )
         {
-            int rc;
+        case E820_UNUSABLE:
+            if ( !iommu_hwdom_inclusive || PFN_DOWN(entry.addr) > max_pfn )
+                continue;
 
-            rc = p2m_add_identity_entry(d, pfn,
-                                        perms & IOMMUF_writable ? p2m_access_rw
-                                                                : p2m_access_r,
-                                        0);
+            rc = rangeset_remove_range(map, PFN_DOWN(entry.addr),
+                                       PFN_DOWN(entry.addr + entry.size - 1));
             if ( rc )
-                printk(XENLOG_WARNING
-                       "%pd: identity mapping of %lx failed: %d\n",
-                       d, pfn, rc);
-        }
-        else if ( pfn != start + count || perms != start_perms )
-        {
-            long rc;
+                panic("IOMMU failed to remove unusable memory: %d\n",
+                      rc);
+            continue;
 
-        commit:
-            while ( (rc = iommu_map(d, _dfn(start), _mfn(start), count,
-                                    start_perms | IOMMUF_preempt,
-                                    &flush_flags)) > 0 )
-            {
-                start += rc;
-                count -= rc;
-                process_pending_softirqs();
-            }
-            if ( rc )
-                printk(XENLOG_WARNING
-                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: %ld\n",
-                       d, start, start + count, rc);
-            start = pfn;
-            count = 1;
-            start_perms = perms;
+        case E820_RESERVED:
+            if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
+                continue;
+            break;
+
+        case E820_RAM:
+            if ( iommu_hwdom_strict )
+                continue;
+            break;
         }
-        else
-            ++count;
 
-        if ( !(++i & 0xfffff) )
-            process_pending_softirqs();
+        if ( iommu_hwdom_inclusive &&
+             PFN_DOWN(entry.addr + entry.size - 1) <= max_pfn )
+            /*
+             * Any range below 4GB is already in the rangeset if using inclusive
+             * mode.
+             */
+            continue;
 
-        if ( i == top && count )
-            goto commit;
+        rc = rangeset_add_range(map, PFN_DOWN(entry.addr),
+                                PFN_DOWN(entry.addr + entry.size - 1));
+        if ( rc )
+            panic("IOMMU failed to add identity range: %d\n", rc);
     }
 
     /* Remove any areas in-use by Xen. */
@@ -605,7 +532,7 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
     rangeset_destroy(map);
 
     /* Use if to avoid compiler warning */
-    if ( iommu_iotlb_flush_all(d, flush_flags) )
+    if ( iommu_iotlb_flush_all(d, map_data.flush_flags) )
         return;
 }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 6/6] x86/iommu: cleanup unused functions
  2023-12-04  9:42 [PATCH v2 0/6] x86/iommu: improve setup time of hwdom IOMMU Roger Pau Monne
                   ` (4 preceding siblings ...)
  2023-12-04  9:43 ` [PATCH v2 5/6] x86/iommu: switch hwdom IOMMU to use a rangeset Roger Pau Monne
@ 2023-12-04  9:43 ` Roger Pau Monne
  2023-12-05 15:29   ` Jan Beulich
  5 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monne @ 2023-12-04  9:43 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Paul Durrant, Jan Beulich, Andrew Cooper,
	Wei Liu, Lukasz Hawrylko, Daniel P. Smith, Mateusz Mówka

Remove xen_in_range() and vpci_is_mmcfg_address() now that hey are unused.

Adjust comments to point to the new functions that replace the existing ones.

No functional change.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Can be squashed with the previous patch if desired, split as a separate patch
for clarity.
---
 xen/arch/x86/include/asm/hvm/io.h |  3 --
 xen/arch/x86/include/asm/setup.h  |  1 -
 xen/arch/x86/setup.c              | 53 ++-----------------------------
 xen/arch/x86/tboot.c              |  2 +-
 4 files changed, 3 insertions(+), 56 deletions(-)

diff --git a/xen/arch/x86/include/asm/hvm/io.h b/xen/arch/x86/include/asm/hvm/io.h
index e1e5e6fe7491..24d1b6134f02 100644
--- a/xen/arch/x86/include/asm/hvm/io.h
+++ b/xen/arch/x86/include/asm/hvm/io.h
@@ -153,9 +153,6 @@ int register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
 /* Destroy tracked MMCFG areas. */
 void destroy_vpci_mmcfg(struct domain *d);
 
-/* Check if an address is between a MMCFG region for a domain. */
-bool vpci_is_mmcfg_address(const struct domain *d, paddr_t addr);
-
 /* Remove MMCFG regions from a given rangeset. */
 int vpci_subtract_mmcfg(const struct domain *d, struct rangeset *r);
 
diff --git a/xen/arch/x86/include/asm/setup.h b/xen/arch/x86/include/asm/setup.h
index cd07d98101d8..1ced1299c77b 100644
--- a/xen/arch/x86/include/asm/setup.h
+++ b/xen/arch/x86/include/asm/setup.h
@@ -36,7 +36,6 @@ unsigned long initial_images_nrpages(nodeid_t node);
 void discard_initial_images(void);
 void *bootstrap_map(const module_t *mod);
 
-int xen_in_range(unsigned long mfn);
 int remove_xen_ranges(struct rangeset *r);
 
 extern uint8_t kbd_shift_flags;
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 71fa0b46f181..7d2cb61a2a4a 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1343,7 +1343,7 @@ void asmlinkage __init noreturn __start_xen(unsigned long mbi_p)
         relocated = true;
 
         /*
-         * This needs to remain in sync with xen_in_range() and the
+         * This needs to remain in sync with remove_xen_ranges() and the
          * respective reserve_e820_ram() invocation below. No need to
          * query efi_boot_mem_unused() here, though.
          */
@@ -1495,7 +1495,7 @@ void asmlinkage __init noreturn __start_xen(unsigned long mbi_p)
     if ( using_2M_mapping() )
         efi_boot_mem_unused(NULL, NULL);
 
-    /* This needs to remain in sync with xen_in_range(). */
+    /* This needs to remain in sync with remove_xen_ranges(). */
     if ( efi_boot_mem_unused(&eb_start, &eb_end) )
     {
         reserve_e820_ram(&boot_e820, __pa(_stext), __pa(eb_start));
@@ -2087,55 +2087,6 @@ void arch_get_xen_caps(xen_capabilities_info_t *info)
     }
 }
 
-int __hwdom_init xen_in_range(unsigned long mfn)
-{
-    paddr_t start, end;
-    int i;
-
-    enum { region_s3, region_ro, region_rw, region_bss, nr_regions };
-    static struct {
-        paddr_t s, e;
-    } xen_regions[nr_regions] __hwdom_initdata;
-
-    /* initialize first time */
-    if ( !xen_regions[0].s )
-    {
-        /* S3 resume code (and other real mode trampoline code) */
-        xen_regions[region_s3].s = bootsym_phys(trampoline_start);
-        xen_regions[region_s3].e = bootsym_phys(trampoline_end);
-
-        /*
-         * This needs to remain in sync with the uses of the same symbols in
-         * - __start_xen() (above)
-         * - is_xen_fixed_mfn()
-         * - tboot_shutdown()
-         */
-
-        /* hypervisor .text + .rodata */
-        xen_regions[region_ro].s = __pa(&_stext);
-        xen_regions[region_ro].e = __pa(&__2M_rodata_end);
-        /* hypervisor .data + .bss */
-        xen_regions[region_rw].s = __pa(&__2M_rwdata_start);
-        xen_regions[region_rw].e = __pa(&__2M_rwdata_end);
-        if ( efi_boot_mem_unused(&start, &end) )
-        {
-            ASSERT(__pa(start) >= xen_regions[region_rw].s);
-            ASSERT(__pa(end) <= xen_regions[region_rw].e);
-            xen_regions[region_rw].e = __pa(start);
-            xen_regions[region_bss].s = __pa(end);
-            xen_regions[region_bss].e = __pa(&__2M_rwdata_end);
-        }
-    }
-
-    start = (paddr_t)mfn << PAGE_SHIFT;
-    end = start + PAGE_SIZE;
-    for ( i = 0; i < nr_regions; i++ )
-        if ( (start < xen_regions[i].e) && (end > xen_regions[i].s) )
-            return 1;
-
-    return 0;
-}
-
 int __hwdom_init remove_xen_ranges(struct rangeset *r)
 {
     paddr_t start, end;
diff --git a/xen/arch/x86/tboot.c b/xen/arch/x86/tboot.c
index 86c4c22cacb8..4c254b4e34b4 100644
--- a/xen/arch/x86/tboot.c
+++ b/xen/arch/x86/tboot.c
@@ -321,7 +321,7 @@ void tboot_shutdown(uint32_t shutdown_type)
 
         /*
          * Xen regions for tboot to MAC. This needs to remain in sync with
-         * xen_in_range().
+         * remove_xen_ranges().
          */
         g_tboot_shared->num_mac_regions = 3;
         /* S3 resume code (and other real mode trampoline code) */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/6] iommu/vt-d: do not assume page table levels for quarantine domain
  2023-12-04  9:43 ` [PATCH v2 1/6] iommu/vt-d: do not assume page table levels for quarantine domain Roger Pau Monne
@ 2023-12-05 14:24   ` Jan Beulich
  2023-12-05 14:30     ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-12-05 14:24 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Kevin Tian, xen-devel

On 04.12.2023 10:43, Roger Pau Monne wrote:
> Like XSA-445, do not assume IOMMU page table levels on VT-d are always set
> based on DEFAULT_DOMAIN_ADDRESS_WIDTH and instead fetch the value set by
> intel_iommu_hwdom_init() from the domain iommu structure.  This prevents
> changes to intel_iommu_hwdom_init() possibly getting the levels out of sync

In both cases, don't you mean intel_iommu_domain_init() instead? Only if
so
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(and happy to adjust while committing).

Otherwise I must be missing something.

Jan

> with what intel_iommu_quarantine_init() expects.
> 
> No functional change, since on Intel domains are hardcoded to use
> DEFAULT_DOMAIN_ADDRESS_WIDTH.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Changes since v1:
>  - New in this version.
> ---
>  xen/drivers/passthrough/vtd/iommu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
> index e13b7d99db40..bc6181c9f911 100644
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -3162,7 +3162,7 @@ static int cf_check intel_iommu_quarantine_init(struct pci_dev *pdev,
>  {
>      struct domain_iommu *hd = dom_iommu(dom_io);
>      struct page_info *pg;
> -    unsigned int agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
> +    unsigned int agaw = hd->arch.vtd.agaw;
>      unsigned int level = agaw_to_level(agaw);
>      const struct acpi_drhd_unit *drhd;
>      const struct acpi_rmrr_unit *rmrr;



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/6] iommu/vt-d: do not assume page table levels for quarantine domain
  2023-12-05 14:24   ` Jan Beulich
@ 2023-12-05 14:30     ` Roger Pau Monné
  0 siblings, 0 replies; 24+ messages in thread
From: Roger Pau Monné @ 2023-12-05 14:30 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Kevin Tian, xen-devel

On Tue, Dec 05, 2023 at 03:24:46PM +0100, Jan Beulich wrote:
> On 04.12.2023 10:43, Roger Pau Monne wrote:
> > Like XSA-445, do not assume IOMMU page table levels on VT-d are always set
> > based on DEFAULT_DOMAIN_ADDRESS_WIDTH and instead fetch the value set by
> > intel_iommu_hwdom_init() from the domain iommu structure.  This prevents
> > changes to intel_iommu_hwdom_init() possibly getting the levels out of sync
> 
> In both cases, don't you mean intel_iommu_domain_init() instead? Only if

Indeed, sorry.

> so
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> (and happy to adjust while committing).

Thanks.

> Otherwise I must be missing something.

No, you are right.

Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/6] amd-vi: set IOMMU page table levels based on guest reported paddr width
  2023-12-04  9:43 ` [PATCH v2 2/6] amd-vi: set IOMMU page table levels based on guest reported paddr width Roger Pau Monne
@ 2023-12-05 14:32   ` Jan Beulich
  2023-12-05 15:11     ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-12-05 14:32 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 04.12.2023 10:43, Roger Pau Monne wrote:
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -359,21 +359,17 @@ int __read_mostly amd_iommu_min_paging_mode = 1;
>  static int cf_check amd_iommu_domain_init(struct domain *d)
>  {
>      struct domain_iommu *hd = dom_iommu(d);
> +    int pglvl = amd_iommu_get_paging_mode(
> +                PFN_DOWN(1UL << paging_max_paddr_bits(d)));

This is a function in the paging subsystem, i.e. generally inapplicable
to system domains (specifically DomIO). If this is to remain this way,
the function would imo need to gain a warning. Yet better would imo be
if the function was avoided for system domains.

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/6] x86/iommu: introduce a rangeset to perform hwdom IOMMU setup
  2023-12-04  9:43 ` [PATCH v2 3/6] x86/iommu: introduce a rangeset to perform hwdom IOMMU setup Roger Pau Monne
@ 2023-12-05 14:50   ` Jan Beulich
  2023-12-05 15:29     ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-12-05 14:50 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Paul Durrant, xen-devel

On 04.12.2023 10:43, Roger Pau Monne wrote:
> This change just introduces the boilerplate code in order to use a rangeset
> when setting up the hardware domain IOMMU mappings.  The rangeset is never
> populated in this patch, so it's a non-functional change as far as the mappings
> the domain gets established.
> 
> Note there's a change for HVM domains (ie: PVH dom0) that will get switched to
> create the p2m mappings using map_mmio_regions() instead of
> p2m_add_identity_entry(), so that ranges can be mapped with a single function
> call if possible.  Note that the interface of map_mmio_regions() doesn't allow
> creating read-only mappings, but so far there are no such mappings created for
> PVH dom0 in arch_iommu_hwdom_init().

I don't understand this paragraph: The rangeset remains empty, so nothing is
changing right here. DYM there is going to be such a change as a result of
this patch, but in a later part of this series?

> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -370,10 +370,77 @@ static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
>      return perms;
>  }
>  
> +struct map_data {
> +    struct domain *d;
> +    unsigned int flush_flags;
> +    bool ro;
> +};
> +
> +static int __hwdom_init cf_check identity_map(unsigned long s, unsigned long e,
> +                                              void *data)
> +{
> +    struct map_data *info = data;
> +    struct domain *d = info->d;
> +    long rc;
> +
> +    if ( iommu_verbose )
> +        printk(XENLOG_INFO " [%010lx, %010lx] R%c\n",
> +               s, e, info->ro ? 'O' : 'W');
> +
> +    if ( paging_mode_translate(d) )
> +    {
> +        if ( info->ro )
> +        {
> +            ASSERT_UNREACHABLE();
> +            return 0;
> +        }
> +        while ( (rc = map_mmio_regions(d, _gfn(s), e - s + 1, _mfn(s))) > 0 )
> +        {
> +            s += rc;
> +            process_pending_softirqs();
> +        }
> +    }
> +    else
> +    {
> +        const unsigned int perms = IOMMUF_readable | IOMMUF_preempt |
> +                                   (info->ro ? 0 : IOMMUF_writable);
> +
> +        if ( info->ro && !iomem_access_permitted(d, s, e) )

How is r/o-ness related to iomem_access_permitted()? The present callers
are such that there is a connection, but that's invisible here. I guess
either the field wants to change name (maybe mmio_ro or ro_mmio or even
just mmio), or there wants to be a comment.

> +        {
> +            /*
> +             * Should be more fine grained in order to not map the forbidden
> +             * frame instead of rejecting the region as a whole, but it's only
> +             * for read-only MMIO regions, which are very limited.
> +             */

How certain are you/we that no two adjacent ones may appear, with
different permissions granted to Dom0?

> +            printk(XENLOG_DEBUG
> +                   "IOMMU read-only mapping of region [%lx, %lx] forbidden\n",
> +                   s, e);
> +            return 0;
> +        }
> +        while ( (rc = iommu_map(d, _dfn(s), _mfn(s), e - s + 1,
> +                                perms, &info->flush_flags)) > 0 )
> +        {
> +            s += rc;
> +            process_pending_softirqs();
> +        }
> +    }
> +    ASSERT(rc <= 0);
> +    if ( rc )
> +        printk(XENLOG_WARNING
> +               "IOMMU identity mapping of [%lx, %lx] failed: %ld\n",
> +               s, e, rc);
> +
> +    /* Ignore errors and attempt to map the remaining regions. */
> +    return 0;
> +}
> +
>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>  {
>      unsigned long i, top, max_pfn, start, count;
>      unsigned int flush_flags = 0, start_perms = 0;
> +    struct rangeset *map;
> +    struct map_data map_data = { .d = d };
> +    int rc;
>  
>      BUG_ON(!is_hardware_domain(d));
>  
> @@ -397,6 +464,10 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>      if ( iommu_hwdom_passthrough )
>          return;
>  
> +    map = rangeset_new(NULL, NULL, 0);
> +    if ( !map )
> +        panic("IOMMU init: unable to allocate rangeset\n");
> +
>      max_pfn = (GB(4) >> PAGE_SHIFT) - 1;
>      top = max(max_pdx, pfn_to_pdx(max_pfn) + 1);
>  
> @@ -451,6 +522,24 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>              goto commit;
>      }
>  
> +    if ( iommu_verbose )
> +        printk(XENLOG_INFO "d%u: identity mappings for IOMMU:\n",
> +               d->domain_id);

%pd: ?

> +    rc = rangeset_report_ranges(map, 0, ~0UL, identity_map, &map_data);
> +    if ( rc )
> +        panic("IOMMU unable to create mappings: %d\n", rc);
> +    if ( is_pv_domain(d) )
> +    {
> +        map_data.ro = true;
> +        rc = rangeset_report_ranges(mmio_ro_ranges, 0, ~0UL, identity_map,
> +                                    &map_data);
> +        if ( rc )
> +            panic("IOMMU unable to create read-only mappings: %d\n", rc);
> +    }
> +
> +    rangeset_destroy(map);

This could move up, couldn't it?

>      /* Use if to avoid compiler warning */
>      if ( iommu_iotlb_flush_all(d, flush_flags) )

Don't you need to fold map.flush_flags into flush_flags ahead of this call?
Or can the variable perhaps go away altogether, being replaced by the struct
field?

Jan

>          return;



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/6] amd-vi: set IOMMU page table levels based on guest reported paddr width
  2023-12-05 14:32   ` Jan Beulich
@ 2023-12-05 15:11     ` Roger Pau Monné
  2023-12-05 15:19       ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2023-12-05 15:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Tue, Dec 05, 2023 at 03:32:20PM +0100, Jan Beulich wrote:
> On 04.12.2023 10:43, Roger Pau Monne wrote:
> > --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> > +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> > @@ -359,21 +359,17 @@ int __read_mostly amd_iommu_min_paging_mode = 1;
> >  static int cf_check amd_iommu_domain_init(struct domain *d)
> >  {
> >      struct domain_iommu *hd = dom_iommu(d);
> > +    int pglvl = amd_iommu_get_paging_mode(
> > +                PFN_DOWN(1UL << paging_max_paddr_bits(d)));
> 
> This is a function in the paging subsystem, i.e. generally inapplicable
> to system domains (specifically DomIO). If this is to remain this way,
> the function would imo need to gain a warning. Yet better would imo be
> if the function was avoided for system domains.

I have to admit I'm confused, won't systems domains return
paging_mode_hap(d) == false, and thus fallback to using paddr_bits
(host paddr width?).

I can avoid such domains calling into paging_max_paddr_bits() but it
seems redundant, and would just be duplicated logic for a case that
paging_max_paddr_bits() already handles correctly AFAICT.

Would it be better for me to rename paging_max_paddr_bits() to
domain_max_paddr_bits() and move it to asm/domain.h?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 4/6] x86/iommu: remove regions not to be mapped
  2023-12-04  9:43 ` [PATCH v2 4/6] x86/iommu: remove regions not to be mapped Roger Pau Monne
@ 2023-12-05 15:11   ` Jan Beulich
  2023-12-05 15:31     ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-12-05 15:11 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Paul Durrant, Andrew Cooper, Wei Liu, xen-devel

On 04.12.2023 10:43, Roger Pau Monne wrote:
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -2136,6 +2136,54 @@ int __hwdom_init xen_in_range(unsigned long mfn)
>      return 0;
>  }
>  
> +int __hwdom_init remove_xen_ranges(struct rangeset *r)
> +{
> +    paddr_t start, end;
> +    int rc;
> +
> +    /* S3 resume code (and other real mode trampoline code) */
> +    rc = rangeset_remove_range(r, PFN_DOWN(bootsym_phys(trampoline_start)),
> +                               PFN_DOWN(bootsym_phys(trampoline_end)));
> +    if ( rc )
> +        return rc;
> +
> +    /*
> +     * This needs to remain in sync with the uses of the same symbols in
> +     * - __start_xen()
> +     * - is_xen_fixed_mfn()
> +     * - tboot_shutdown()
> +     */

As you're duplicating this comment from xen_in_range(), you want to
- also mention xen_in_range() here,
- also update xen_in_range()'s comment,
- also update the respective comments in __start_xen() that also mention
  xen_in_range().
Everything else here looks good to me.

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/6] amd-vi: set IOMMU page table levels based on guest reported paddr width
  2023-12-05 15:11     ` Roger Pau Monné
@ 2023-12-05 15:19       ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2023-12-05 15:19 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel

On 05.12.2023 16:11, Roger Pau Monné wrote:
> On Tue, Dec 05, 2023 at 03:32:20PM +0100, Jan Beulich wrote:
>> On 04.12.2023 10:43, Roger Pau Monne wrote:
>>> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
>>> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>>> @@ -359,21 +359,17 @@ int __read_mostly amd_iommu_min_paging_mode = 1;
>>>  static int cf_check amd_iommu_domain_init(struct domain *d)
>>>  {
>>>      struct domain_iommu *hd = dom_iommu(d);
>>> +    int pglvl = amd_iommu_get_paging_mode(
>>> +                PFN_DOWN(1UL << paging_max_paddr_bits(d)));
>>
>> This is a function in the paging subsystem, i.e. generally inapplicable
>> to system domains (specifically DomIO). If this is to remain this way,
>> the function would imo need to gain a warning. Yet better would imo be
>> if the function was avoided for system domains.
> 
> I have to admit I'm confused, won't systems domains return
> paging_mode_hap(d) == false, and thus fallback to using paddr_bits
> (host paddr width?).

True, but that check lives inside the function.

> I can avoid such domains calling into paging_max_paddr_bits() but it
> seems redundant, and would just be duplicated logic for a case that
> paging_max_paddr_bits() already handles correctly AFAICT.

Hence why I suggested a comment (warning) as alternative.

> Would it be better for me to rename paging_max_paddr_bits() to
> domain_max_paddr_bits() and move it to asm/domain.h?

Maybe. I'm not sure exactly why the function was introduced where it
is and under the name it has. It sole present caller is in cpu-policy.c,
so either name/placement would look good to me.

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/6] x86/iommu: switch hwdom IOMMU to use a rangeset
  2023-12-04  9:43 ` [PATCH v2 5/6] x86/iommu: switch hwdom IOMMU to use a rangeset Roger Pau Monne
@ 2023-12-05 15:27   ` Jan Beulich
  2023-12-05 15:43     ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-12-05 15:27 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Paul Durrant, xen-devel

On 04.12.2023 10:43, Roger Pau Monne wrote:
> @@ -476,58 +406,55 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>      if ( !map )
>          panic("IOMMU init: unable to allocate rangeset\n");
>  
> -    max_pfn = (GB(4) >> PAGE_SHIFT) - 1;
> -    top = max(max_pdx, pfn_to_pdx(max_pfn) + 1);
> +    if ( iommu_hwdom_inclusive )
> +    {
> +        /* Add the whole range below 4GB, UNUSABLE regions will be removed. */
> +        rc = rangeset_add_range(map, 0, max_pfn);
> +        if ( rc )
> +            panic("IOMMU inclusive mappings can't be added: %d\n",
> +                  rc);
> +    }
>  
> -    for ( i = 0, start = 0, count = 0; i < top; )
> +    for ( i = 0; i < e820.nr_map; i++ )
>      {
> -        unsigned long pfn = pdx_to_pfn(i);
> -        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> +        struct e820entry entry = e820.map[i];
>  
> -        if ( !perms )
> -            /* nothing */;
> -        else if ( paging_mode_translate(d) )
> +        switch ( entry.type )
>          {
> -            int rc;
> +        case E820_UNUSABLE:
> +            if ( !iommu_hwdom_inclusive || PFN_DOWN(entry.addr) > max_pfn )
> +                continue;

The !iommu_hwdom_inclusive part isn't really needed here, is it? The ...

> -            rc = p2m_add_identity_entry(d, pfn,
> -                                        perms & IOMMUF_writable ? p2m_access_rw
> -                                                                : p2m_access_r,
> -                                        0);
> +            rc = rangeset_remove_range(map, PFN_DOWN(entry.addr),
> +                                       PFN_DOWN(entry.addr + entry.size - 1));

... call here would then simply be a no-op, as it looks. And things would
overall look more safe if the removal was skipped for fewer reasons.

> @@ -605,7 +532,7 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>      rangeset_destroy(map);
>  
>      /* Use if to avoid compiler warning */
> -    if ( iommu_iotlb_flush_all(d, flush_flags) )
> +    if ( iommu_iotlb_flush_all(d, map_data.flush_flags) )
>          return;
>  }

Ah yes, here is said change. But I think for correctness this wants
moving to the earlier patch.

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/6] x86/iommu: introduce a rangeset to perform hwdom IOMMU setup
  2023-12-05 14:50   ` Jan Beulich
@ 2023-12-05 15:29     ` Roger Pau Monné
  0 siblings, 0 replies; 24+ messages in thread
From: Roger Pau Monné @ 2023-12-05 15:29 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Paul Durrant, xen-devel

On Tue, Dec 05, 2023 at 03:50:44PM +0100, Jan Beulich wrote:
> On 04.12.2023 10:43, Roger Pau Monne wrote:
> > This change just introduces the boilerplate code in order to use a rangeset
> > when setting up the hardware domain IOMMU mappings.  The rangeset is never
> > populated in this patch, so it's a non-functional change as far as the mappings
> > the domain gets established.
> > 
> > Note there's a change for HVM domains (ie: PVH dom0) that will get switched to
> > create the p2m mappings using map_mmio_regions() instead of
> > p2m_add_identity_entry(), so that ranges can be mapped with a single function
> > call if possible.  Note that the interface of map_mmio_regions() doesn't allow
> > creating read-only mappings, but so far there are no such mappings created for
> > PVH dom0 in arch_iommu_hwdom_init().
> 
> I don't understand this paragraph: The rangeset remains empty, so nothing is
> changing right here. DYM there is going to be such a change as a result of
> this patch, but in a later part of this series?

Yes, when the rangeset is populated and mappings are created based on
its contents, map_mmio_regions() will be used instead of
p2m_add_identity_entry().  I guess the '... that will get switched to
create the p2m ...' is not clear enough.

> > --- a/xen/drivers/passthrough/x86/iommu.c
> > +++ b/xen/drivers/passthrough/x86/iommu.c
> > @@ -370,10 +370,77 @@ static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
> >      return perms;
> >  }
> >  
> > +struct map_data {
> > +    struct domain *d;
> > +    unsigned int flush_flags;
> > +    bool ro;
> > +};
> > +
> > +static int __hwdom_init cf_check identity_map(unsigned long s, unsigned long e,
> > +                                              void *data)
> > +{
> > +    struct map_data *info = data;
> > +    struct domain *d = info->d;
> > +    long rc;
> > +
> > +    if ( iommu_verbose )
> > +        printk(XENLOG_INFO " [%010lx, %010lx] R%c\n",
> > +               s, e, info->ro ? 'O' : 'W');
> > +
> > +    if ( paging_mode_translate(d) )
> > +    {
> > +        if ( info->ro )
> > +        {
> > +            ASSERT_UNREACHABLE();
> > +            return 0;
> > +        }
> > +        while ( (rc = map_mmio_regions(d, _gfn(s), e - s + 1, _mfn(s))) > 0 )
> > +        {
> > +            s += rc;
> > +            process_pending_softirqs();
> > +        }
> > +    }
> > +    else
> > +    {
> > +        const unsigned int perms = IOMMUF_readable | IOMMUF_preempt |
> > +                                   (info->ro ? 0 : IOMMUF_writable);
> > +
> > +        if ( info->ro && !iomem_access_permitted(d, s, e) )
> 
> How is r/o-ness related to iomem_access_permitted()? The present callers
> are such that there is a connection, but that's invisible here. I guess
> either the field wants to change name (maybe mmio_ro or ro_mmio or even
> just mmio), or there wants to be a comment.

Will add:

"read-only ranges are only created based on the contents of
mmio_ro_ranges, and hence need the additional iomem_access_permitted()
check."

> > +        {
> > +            /*
> > +             * Should be more fine grained in order to not map the forbidden
> > +             * frame instead of rejecting the region as a whole, but it's only
> > +             * for read-only MMIO regions, which are very limited.
> > +             */
> 
> How certain are you/we that no two adjacent ones may appear, with
> different permissions granted to Dom0?

Yeah, I was already not very convinced by this, and I think the only
solution here is to iterate over the read-only ranges with page
granularity.  In any case read-only ranges are both few and small in
size, hence this is unlikely to be noticeable performance wise.

> > +            printk(XENLOG_DEBUG
> > +                   "IOMMU read-only mapping of region [%lx, %lx] forbidden\n",
> > +                   s, e);
> > +            return 0;
> > +        }
> > +        while ( (rc = iommu_map(d, _dfn(s), _mfn(s), e - s + 1,
> > +                                perms, &info->flush_flags)) > 0 )
> > +        {
> > +            s += rc;
> > +            process_pending_softirqs();
> > +        }
> > +    }
> > +    ASSERT(rc <= 0);
> > +    if ( rc )
> > +        printk(XENLOG_WARNING
> > +               "IOMMU identity mapping of [%lx, %lx] failed: %ld\n",
> > +               s, e, rc);
> > +
> > +    /* Ignore errors and attempt to map the remaining regions. */
> > +    return 0;
> > +}
> > +
> >  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
> >  {
> >      unsigned long i, top, max_pfn, start, count;
> >      unsigned int flush_flags = 0, start_perms = 0;
> > +    struct rangeset *map;
> > +    struct map_data map_data = { .d = d };
> > +    int rc;
> >  
> >      BUG_ON(!is_hardware_domain(d));
> >  
> > @@ -397,6 +464,10 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
> >      if ( iommu_hwdom_passthrough )
> >          return;
> >  
> > +    map = rangeset_new(NULL, NULL, 0);
> > +    if ( !map )
> > +        panic("IOMMU init: unable to allocate rangeset\n");
> > +
> >      max_pfn = (GB(4) >> PAGE_SHIFT) - 1;
> >      top = max(max_pdx, pfn_to_pdx(max_pfn) + 1);
> >  
> > @@ -451,6 +522,24 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
> >              goto commit;
> >      }
> >  
> > +    if ( iommu_verbose )
> > +        printk(XENLOG_INFO "d%u: identity mappings for IOMMU:\n",
> > +               d->domain_id);
> 
> %pd: ?

Indeed, I probably copied this from a different chunk and didn't
adjust to use %pd.

> > +    rc = rangeset_report_ranges(map, 0, ~0UL, identity_map, &map_data);
> > +    if ( rc )
> > +        panic("IOMMU unable to create mappings: %d\n", rc);
> > +    if ( is_pv_domain(d) )
> > +    {
> > +        map_data.ro = true;
> > +        rc = rangeset_report_ranges(mmio_ro_ranges, 0, ~0UL, identity_map,
> > +                                    &map_data);
> > +        if ( rc )
> > +            panic("IOMMU unable to create read-only mappings: %d\n", rc);
> > +    }
> > +
> > +    rangeset_destroy(map);
> 
> This could move up, couldn't it?

Yes, could be moved just after the rangeset_report_ranges(map...)
call.

> >      /* Use if to avoid compiler warning */
> >      if ( iommu_iotlb_flush_all(d, flush_flags) )
> 
> Don't you need to fold map.flush_flags into flush_flags ahead of this call?
> Or can the variable perhaps go away altogether, being replaced by the struct
> field?

Yes, the variable ends up being replaced in a later patch, hence I
didn't touch it here.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/6] x86/iommu: cleanup unused functions
  2023-12-04  9:43 ` [PATCH v2 6/6] x86/iommu: cleanup unused functions Roger Pau Monne
@ 2023-12-05 15:29   ` Jan Beulich
  2023-12-05 15:48     ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-12-05 15:29 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Paul Durrant, Andrew Cooper, Wei Liu, Lukasz Hawrylko,
	Daniel P. Smith, Mateusz Mówka, xen-devel

On 04.12.2023 10:43, Roger Pau Monne wrote:
> Remove xen_in_range() and vpci_is_mmcfg_address() now that hey are unused.

Of the latter you remove only the declaration. Stale patch maybe?

For the former, am I misremembering that Andrew had asked for the function
to stay?

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 4/6] x86/iommu: remove regions not to be mapped
  2023-12-05 15:11   ` Jan Beulich
@ 2023-12-05 15:31     ` Roger Pau Monné
  2023-12-05 15:34       ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2023-12-05 15:31 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Paul Durrant, Andrew Cooper, Wei Liu, xen-devel

On Tue, Dec 05, 2023 at 04:11:21PM +0100, Jan Beulich wrote:
> On 04.12.2023 10:43, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/setup.c
> > +++ b/xen/arch/x86/setup.c
> > @@ -2136,6 +2136,54 @@ int __hwdom_init xen_in_range(unsigned long mfn)
> >      return 0;
> >  }
> >  
> > +int __hwdom_init remove_xen_ranges(struct rangeset *r)
> > +{
> > +    paddr_t start, end;
> > +    int rc;
> > +
> > +    /* S3 resume code (and other real mode trampoline code) */
> > +    rc = rangeset_remove_range(r, PFN_DOWN(bootsym_phys(trampoline_start)),
> > +                               PFN_DOWN(bootsym_phys(trampoline_end)));
> > +    if ( rc )
> > +        return rc;
> > +
> > +    /*
> > +     * This needs to remain in sync with the uses of the same symbols in
> > +     * - __start_xen()
> > +     * - is_xen_fixed_mfn()
> > +     * - tboot_shutdown()
> > +     */
> 
> As you're duplicating this comment from xen_in_range(), you want to
> - also mention xen_in_range() here,
> - also update xen_in_range()'s comment,

xen_in_range() is going away in the last patch, hence I did bother tyo
update it.

> - also update the respective comments in __start_xen() that also mention
>   xen_in_range().

That's done in patch 6/6.

> Everything else here looks good to me.

Let me know if doing such changes in a later patch is OK.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 4/6] x86/iommu: remove regions not to be mapped
  2023-12-05 15:31     ` Roger Pau Monné
@ 2023-12-05 15:34       ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2023-12-05 15:34 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Paul Durrant, Andrew Cooper, Wei Liu, xen-devel

On 05.12.2023 16:31, Roger Pau Monné wrote:
> On Tue, Dec 05, 2023 at 04:11:21PM +0100, Jan Beulich wrote:
>> On 04.12.2023 10:43, Roger Pau Monne wrote:
>>> --- a/xen/arch/x86/setup.c
>>> +++ b/xen/arch/x86/setup.c
>>> @@ -2136,6 +2136,54 @@ int __hwdom_init xen_in_range(unsigned long mfn)
>>>      return 0;
>>>  }
>>>  
>>> +int __hwdom_init remove_xen_ranges(struct rangeset *r)
>>> +{
>>> +    paddr_t start, end;
>>> +    int rc;
>>> +
>>> +    /* S3 resume code (and other real mode trampoline code) */
>>> +    rc = rangeset_remove_range(r, PFN_DOWN(bootsym_phys(trampoline_start)),
>>> +                               PFN_DOWN(bootsym_phys(trampoline_end)));
>>> +    if ( rc )
>>> +        return rc;
>>> +
>>> +    /*
>>> +     * This needs to remain in sync with the uses of the same symbols in
>>> +     * - __start_xen()
>>> +     * - is_xen_fixed_mfn()
>>> +     * - tboot_shutdown()
>>> +     */
>>
>> As you're duplicating this comment from xen_in_range(), you want to
>> - also mention xen_in_range() here,
>> - also update xen_in_range()'s comment,
> 
> xen_in_range() is going away in the last patch, hence I did bother tyo
> update it.
> 
>> - also update the respective comments in __start_xen() that also mention
>>   xen_in_range().
> 
> That's done in patch 6/6.
> 
>> Everything else here looks good to me.
> 
> Let me know if doing such changes in a later patch is OK.

If xen_in_range() is indeed going to go away (see my question there), I'd be
okay-ish with that.

Jan


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/6] x86/iommu: switch hwdom IOMMU to use a rangeset
  2023-12-05 15:27   ` Jan Beulich
@ 2023-12-05 15:43     ` Roger Pau Monné
  2023-12-05 15:47       ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2023-12-05 15:43 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Paul Durrant, xen-devel

On Tue, Dec 05, 2023 at 04:27:05PM +0100, Jan Beulich wrote:
> On 04.12.2023 10:43, Roger Pau Monne wrote:
> > @@ -476,58 +406,55 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
> >      if ( !map )
> >          panic("IOMMU init: unable to allocate rangeset\n");
> >  
> > -    max_pfn = (GB(4) >> PAGE_SHIFT) - 1;
> > -    top = max(max_pdx, pfn_to_pdx(max_pfn) + 1);
> > +    if ( iommu_hwdom_inclusive )
> > +    {
> > +        /* Add the whole range below 4GB, UNUSABLE regions will be removed. */
> > +        rc = rangeset_add_range(map, 0, max_pfn);
> > +        if ( rc )
> > +            panic("IOMMU inclusive mappings can't be added: %d\n",
> > +                  rc);
> > +    }
> >  
> > -    for ( i = 0, start = 0, count = 0; i < top; )
> > +    for ( i = 0; i < e820.nr_map; i++ )
> >      {
> > -        unsigned long pfn = pdx_to_pfn(i);
> > -        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> > +        struct e820entry entry = e820.map[i];
> >  
> > -        if ( !perms )
> > -            /* nothing */;
> > -        else if ( paging_mode_translate(d) )
> > +        switch ( entry.type )
> >          {
> > -            int rc;
> > +        case E820_UNUSABLE:
> > +            if ( !iommu_hwdom_inclusive || PFN_DOWN(entry.addr) > max_pfn )
> > +                continue;
> 
> The !iommu_hwdom_inclusive part isn't really needed here, is it? The ...

Nor the PFN_DOWN(entry.addr) > max_pfn.

> > -            rc = p2m_add_identity_entry(d, pfn,
> > -                                        perms & IOMMUF_writable ? p2m_access_rw
> > -                                                                : p2m_access_r,
> > -                                        0);
> > +            rc = rangeset_remove_range(map, PFN_DOWN(entry.addr),
> > +                                       PFN_DOWN(entry.addr + entry.size - 1));
> 
> ... call here would then simply be a no-op, as it looks. And things would
> overall look more safe if the removal was skipped for fewer reasons.

OK, the removal can be done unconditionally if so desired.

> > @@ -605,7 +532,7 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
> >      rangeset_destroy(map);
> >  
> >      /* Use if to avoid compiler warning */
> > -    if ( iommu_iotlb_flush_all(d, flush_flags) )
> > +    if ( iommu_iotlb_flush_all(d, map_data.flush_flags) )
> >          return;
> >  }
> 
> Ah yes, here is said change. But I think for correctness this wants
> moving to the earlier patch.

OK, so something like:

map_data.flush_flags |= flush_flags;

And adjusting the iommu_iotlb_flush_all() would be fine in this patch
context.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/6] x86/iommu: switch hwdom IOMMU to use a rangeset
  2023-12-05 15:43     ` Roger Pau Monné
@ 2023-12-05 15:47       ` Jan Beulich
  2023-12-05 16:07         ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2023-12-05 15:47 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Paul Durrant, xen-devel

On 05.12.2023 16:43, Roger Pau Monné wrote:
> On Tue, Dec 05, 2023 at 04:27:05PM +0100, Jan Beulich wrote:
>> On 04.12.2023 10:43, Roger Pau Monne wrote:
>>> @@ -476,58 +406,55 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>>>      if ( !map )
>>>          panic("IOMMU init: unable to allocate rangeset\n");
>>>  
>>> -    max_pfn = (GB(4) >> PAGE_SHIFT) - 1;
>>> -    top = max(max_pdx, pfn_to_pdx(max_pfn) + 1);
>>> +    if ( iommu_hwdom_inclusive )
>>> +    {
>>> +        /* Add the whole range below 4GB, UNUSABLE regions will be removed. */
>>> +        rc = rangeset_add_range(map, 0, max_pfn);
>>> +        if ( rc )
>>> +            panic("IOMMU inclusive mappings can't be added: %d\n",
>>> +                  rc);
>>> +    }
>>>  
>>> -    for ( i = 0, start = 0, count = 0; i < top; )
>>> +    for ( i = 0; i < e820.nr_map; i++ )
>>>      {
>>> -        unsigned long pfn = pdx_to_pfn(i);
>>> -        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
>>> +        struct e820entry entry = e820.map[i];
>>>  
>>> -        if ( !perms )
>>> -            /* nothing */;
>>> -        else if ( paging_mode_translate(d) )
>>> +        switch ( entry.type )
>>>          {
>>> -            int rc;
>>> +        case E820_UNUSABLE:
>>> +            if ( !iommu_hwdom_inclusive || PFN_DOWN(entry.addr) > max_pfn )
>>> +                continue;
>>
>> The !iommu_hwdom_inclusive part isn't really needed here, is it? The ...
> 
> Nor the PFN_DOWN(entry.addr) > max_pfn.

Hmm, I couldn't convince myself that could also be dropped.

>>> -            rc = p2m_add_identity_entry(d, pfn,
>>> -                                        perms & IOMMUF_writable ? p2m_access_rw
>>> -                                                                : p2m_access_r,
>>> -                                        0);
>>> +            rc = rangeset_remove_range(map, PFN_DOWN(entry.addr),
>>> +                                       PFN_DOWN(entry.addr + entry.size - 1));
>>
>> ... call here would then simply be a no-op, as it looks. And things would
>> overall look more safe if the removal was skipped for fewer reasons.
> 
> OK, the removal can be done unconditionally if so desired.
> 
>>> @@ -605,7 +532,7 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>>>      rangeset_destroy(map);
>>>  
>>>      /* Use if to avoid compiler warning */
>>> -    if ( iommu_iotlb_flush_all(d, flush_flags) )
>>> +    if ( iommu_iotlb_flush_all(d, map_data.flush_flags) )
>>>          return;
>>>  }
>>
>> Ah yes, here is said change. But I think for correctness this wants
>> moving to the earlier patch.
> 
> OK, so something like:
> 
> map_data.flush_flags |= flush_flags;

Or simply drop flush_flags here right away (read: replace by map.flush_flags).

Jan

> And adjusting the iommu_iotlb_flush_all() would be fine in this patch
> context.
> 
> Thanks, Roger.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/6] x86/iommu: cleanup unused functions
  2023-12-05 15:29   ` Jan Beulich
@ 2023-12-05 15:48     ` Roger Pau Monné
  2023-12-05 15:52       ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2023-12-05 15:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Paul Durrant, Andrew Cooper, Wei Liu, Lukasz Hawrylko,
	Daniel P. Smith, Mateusz Mówka, xen-devel

On Tue, Dec 05, 2023 at 04:29:18PM +0100, Jan Beulich wrote:
> On 04.12.2023 10:43, Roger Pau Monne wrote:
> > Remove xen_in_range() and vpci_is_mmcfg_address() now that hey are unused.
> 
> Of the latter you remove only the declaration. Stale patch maybe?

Fixed.

> For the former, am I misremembering that Andrew had asked for the function
> to stay?

https://lore.kernel.org/xen-devel/81534803-9da4-49b7-894e-f3fb5e8fb131@citrix.com

I did read Andrew's response as it being fine to switch the current
function to use a rangeset, but not as a request to duplicate it.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/6] x86/iommu: cleanup unused functions
  2023-12-05 15:48     ` Roger Pau Monné
@ 2023-12-05 15:52       ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2023-12-05 15:52 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Paul Durrant, Andrew Cooper, Wei Liu, Lukasz Hawrylko,
	Daniel P. Smith, Mateusz Mówka, xen-devel

On 05.12.2023 16:48, Roger Pau Monné wrote:
> On Tue, Dec 05, 2023 at 04:29:18PM +0100, Jan Beulich wrote:
>> On 04.12.2023 10:43, Roger Pau Monne wrote:
>>> Remove xen_in_range() and vpci_is_mmcfg_address() now that hey are unused.
>>
>> Of the latter you remove only the declaration. Stale patch maybe?
> 
> Fixed.
> 
>> For the former, am I misremembering that Andrew had asked for the function
>> to stay?
> 
> https://lore.kernel.org/xen-devel/81534803-9da4-49b7-894e-f3fb5e8fb131@citrix.com
> 
> I did read Andrew's response as it being fine to switch the current
> function to use a rangeset, but not as a request to duplicate it.

Hmm, maybe I inferred too much from "And I'd hoped to make this common".

Jan



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/6] x86/iommu: switch hwdom IOMMU to use a rangeset
  2023-12-05 15:47       ` Jan Beulich
@ 2023-12-05 16:07         ` Roger Pau Monné
  0 siblings, 0 replies; 24+ messages in thread
From: Roger Pau Monné @ 2023-12-05 16:07 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Paul Durrant, xen-devel

On Tue, Dec 05, 2023 at 04:47:15PM +0100, Jan Beulich wrote:
> On 05.12.2023 16:43, Roger Pau Monné wrote:
> > On Tue, Dec 05, 2023 at 04:27:05PM +0100, Jan Beulich wrote:
> >> On 04.12.2023 10:43, Roger Pau Monne wrote:
> >>> @@ -476,58 +406,55 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
> >>>      if ( !map )
> >>>          panic("IOMMU init: unable to allocate rangeset\n");
> >>>  
> >>> -    max_pfn = (GB(4) >> PAGE_SHIFT) - 1;
> >>> -    top = max(max_pdx, pfn_to_pdx(max_pfn) + 1);
> >>> +    if ( iommu_hwdom_inclusive )
> >>> +    {
> >>> +        /* Add the whole range below 4GB, UNUSABLE regions will be removed. */
> >>> +        rc = rangeset_add_range(map, 0, max_pfn);
> >>> +        if ( rc )
> >>> +            panic("IOMMU inclusive mappings can't be added: %d\n",
> >>> +                  rc);
> >>> +    }
> >>>  
> >>> -    for ( i = 0, start = 0, count = 0; i < top; )
> >>> +    for ( i = 0; i < e820.nr_map; i++ )
> >>>      {
> >>> -        unsigned long pfn = pdx_to_pfn(i);
> >>> -        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> >>> +        struct e820entry entry = e820.map[i];
> >>>  
> >>> -        if ( !perms )
> >>> -            /* nothing */;
> >>> -        else if ( paging_mode_translate(d) )
> >>> +        switch ( entry.type )
> >>>          {
> >>> -            int rc;
> >>> +        case E820_UNUSABLE:
> >>> +            if ( !iommu_hwdom_inclusive || PFN_DOWN(entry.addr) > max_pfn )
> >>> +                continue;
> >>
> >> The !iommu_hwdom_inclusive part isn't really needed here, is it? The ...
> > 
> > Nor the PFN_DOWN(entry.addr) > max_pfn.
> 
> Hmm, I couldn't convince myself that could also be dropped.

We never map unusable regions, so it's always fine to remove them from
the rangeset.  The condition was just a way to exit early and avoid
the rangeset_remove_range() call.

> >>> -            rc = p2m_add_identity_entry(d, pfn,
> >>> -                                        perms & IOMMUF_writable ? p2m_access_rw
> >>> -                                                                : p2m_access_r,
> >>> -                                        0);
> >>> +            rc = rangeset_remove_range(map, PFN_DOWN(entry.addr),
> >>> +                                       PFN_DOWN(entry.addr + entry.size - 1));
> >>
> >> ... call here would then simply be a no-op, as it looks. And things would
> >> overall look more safe if the removal was skipped for fewer reasons.
> > 
> > OK, the removal can be done unconditionally if so desired.
> > 
> >>> @@ -605,7 +532,7 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
> >>>      rangeset_destroy(map);
> >>>  
> >>>      /* Use if to avoid compiler warning */
> >>> -    if ( iommu_iotlb_flush_all(d, flush_flags) )
> >>> +    if ( iommu_iotlb_flush_all(d, map_data.flush_flags) )
> >>>          return;
> >>>  }
> >>
> >> Ah yes, here is said change. But I think for correctness this wants
> >> moving to the earlier patch.
> > 
> > OK, so something like:
> > 
> > map_data.flush_flags |= flush_flags;
> 
> Or simply drop flush_flags here right away (read: replace by map.flush_flags).

Right, OK, that will lead to some small changes to existing code which
I wanted to avoid in the context of just adding new code to deal with
a rangeset, but anyway, if that's preferred I will adjust.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-12-05 16:07 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-04  9:42 [PATCH v2 0/6] x86/iommu: improve setup time of hwdom IOMMU Roger Pau Monne
2023-12-04  9:43 ` [PATCH v2 1/6] iommu/vt-d: do not assume page table levels for quarantine domain Roger Pau Monne
2023-12-05 14:24   ` Jan Beulich
2023-12-05 14:30     ` Roger Pau Monné
2023-12-04  9:43 ` [PATCH v2 2/6] amd-vi: set IOMMU page table levels based on guest reported paddr width Roger Pau Monne
2023-12-05 14:32   ` Jan Beulich
2023-12-05 15:11     ` Roger Pau Monné
2023-12-05 15:19       ` Jan Beulich
2023-12-04  9:43 ` [PATCH v2 3/6] x86/iommu: introduce a rangeset to perform hwdom IOMMU setup Roger Pau Monne
2023-12-05 14:50   ` Jan Beulich
2023-12-05 15:29     ` Roger Pau Monné
2023-12-04  9:43 ` [PATCH v2 4/6] x86/iommu: remove regions not to be mapped Roger Pau Monne
2023-12-05 15:11   ` Jan Beulich
2023-12-05 15:31     ` Roger Pau Monné
2023-12-05 15:34       ` Jan Beulich
2023-12-04  9:43 ` [PATCH v2 5/6] x86/iommu: switch hwdom IOMMU to use a rangeset Roger Pau Monne
2023-12-05 15:27   ` Jan Beulich
2023-12-05 15:43     ` Roger Pau Monné
2023-12-05 15:47       ` Jan Beulich
2023-12-05 16:07         ` Roger Pau Monné
2023-12-04  9:43 ` [PATCH v2 6/6] x86/iommu: cleanup unused functions Roger Pau Monne
2023-12-05 15:29   ` Jan Beulich
2023-12-05 15:48     ` Roger Pau Monné
2023-12-05 15:52       ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.