[XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface

All of lore.kernel.org
 help / color / mirror / Atom feed

* [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface
@ 2025-02-17 10:18 Teddy Astie
  2025-02-17 10:18 ` [XEN RFC PATCH v6 01/11] docs/designs: Add a design document for IOMMU subsystem redesign Teddy Astie
                   ` (11 more replies)
  0 siblings, 12 replies; 17+ messages in thread
From: Teddy Astie @ 2025-02-17 10:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Teddy Astie, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, Bertrand Marquis, Volodymyr Babchuk,
	Shawn Anastasio, Lukasz Hawrylko, Daniel P. Smith,
	Mateusz Mówka, Marek Marczykowski-Górecki

This work has been presented at Xen Summit 2024 during the
  IOMMU paravirtualization and Xen IOMMU subsystem rework
design session.

Operating systems may want to have access to a IOMMU in order to do DMA
protection or implement certain features (e.g VFIO on Linux).

VFIO support is mandatory for framework such as SPDK, which can be useful to
implement an alternative storage backend for virtual machines [1].

In this patch series, we introduce in Xen the ability to manage several
contexts per domain and provide a new hypercall interface to allow guests
to manage IOMMU contexts.

The VT-d and AMD-Vi driver is updated to support these new features.

[1] Using SPDK with the Xen hypervisor - FOSDEM 2023
---
Cc: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>

PCI Passthrough now work on my side, but things are still feels quite brittle.

Changed in v2 :
* fixed Xen crash when dumping IOMMU contexts (using X debug key)
with DomUs without IOMMU
* s/dettach/detach/
* removed some unused includes
* fix dangling devices in contexts with detach

Changed in v3 :
* lock entirely map/unmap in hypercall
* prevent IOMMU operations on dying contexts (fix race condition)
* iommu_check_context+iommu_get_context -> iommu_get_context and check for NULL

Changed in v4 :
* Part of initialization logic is moved to domain or toolstack (IOMMU_init)
  + domain/toolstack now decides on "context count" and "pagetable pool size"
  + for now, all domains are able to initialize PV-IOMMU
* introduce "dom0-iommu=no-dma" to make default context block all DMA
  (disables HAP and sync-pt), enforcing usage of PV-IOMMU for DMA
  Can be used to expose properly "Pre-boot DMA protection"
* redesigned locking logic for contexts
  + contexts are accessed using iommu_get_context and released with iommu_put_context

Changed in v5 :
* various PCI Passthrough related fixes
  + rewrote parts of PCI Passthrough logic
  + various other related bug fixes
* simplified VT-d DID (for hardware) management by only having one map instead of two
  (pseudo_domid map was previously used for old quarantine code then recycled for PV-IOMMU
   in addition to another map also tracing Domain<->VT-d DID, now there is only one
   map tracking both making things simpler)
* reworked parts of Xen quarantine logic (needed for PCI Passthrough)
* added cf_check annotations
* some changes to PV-IOMMU headers (Alejandro)

Changed in v6 :
* reorganized the patch series to allow bissecting
   * it is splitted in various smaller patches
* initial AMD-Vi port (it doesn't completely work with PV-IOMMU though, but builds at
  least)
   * AMD-Vi lacks support for iommu_lookup_page (needed for several PV-IOMMU ops)

TODO:
* fix some issues with no-dma+PV and grants
* complete "no-dma" mode (expose to toolstack, add documentation, ...)
* properly define nested mode and PASID support
* consider per-iommu domid limit (allocate did on first attach/reattach ?)
* fix ARM/PPC build issues

* make new quarantine code more unity region aware (isolate devices with
  different reserved regions regions using separate 'contexts')
* find a way to make PV-IOMMU work in DomUs (they don't see machine bdf)
* there are corner cases with PV-IOMMU and to-domain Xen PCI Passthrough
  (e.g pci-assignable-remove will reassign to context 0, while the driver
   expects the device to to be in context X)

Teddy Astie (11):
  docs/designs: Add a design document for IOMMU subsystem redesign
  docs/designs: Add a design document for PV-IOMMU
  x86/domain: Defer domain iommu initialization.
  iommu: Move IOMMU domain related structures to (arch_)iommu_context
  iommu: Simplify quarantine logic
  vtd: Remove MAP_ERROR_RECOVERY code path in domain_context_mapping_one
  iommu: Simplify hardware did management
  iommu: Introduce redesigned IOMMU subsystem
  x86/iommu: Introduce IOMMU arena
  iommu: Introduce PV-IOMMU
  iommu: Introduce no-dma feature

 docs/designs/iommu-contexts.md              |  403 +++++
 docs/designs/pv-iommu.md                    |  116 ++
 xen/arch/arm/include/asm/iommu.h            |    4 +
 xen/arch/ppc/include/asm/iommu.h            |    3 +
 xen/arch/x86/domain.c                       |   10 +-
 xen/arch/x86/include/asm/arena.h            |   54 +
 xen/arch/x86/include/asm/iommu.h            |   59 +-
 xen/arch/x86/include/asm/pci.h              |   17 -
 xen/arch/x86/mm/p2m-ept.c                   |    2 +-
 xen/arch/x86/pv/dom0_build.c                |    6 +-
 xen/arch/x86/tboot.c                        |    3 +-
 xen/common/Makefile                         |    1 +
 xen/common/memory.c                         |    4 +-
 xen/common/pv-iommu.c                       |  539 +++++++
 xen/drivers/passthrough/amd/iommu.h         |   21 +-
 xen/drivers/passthrough/amd/iommu_cmd.c     |   20 +-
 xen/drivers/passthrough/amd/iommu_init.c    |   13 +-
 xen/drivers/passthrough/amd/iommu_map.c     |  217 +--
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  346 ++--
 xen/drivers/passthrough/iommu.c             |  735 ++++++++-
 xen/drivers/passthrough/pci.c               |  404 ++---
 xen/drivers/passthrough/vtd/extern.h        |   19 +-
 xen/drivers/passthrough/vtd/iommu.c         | 1612 ++++++-------------
 xen/drivers/passthrough/vtd/iommu.h         |    2 -
 xen/drivers/passthrough/vtd/qinval.c        |    2 +-
 xen/drivers/passthrough/vtd/quirks.c        |   21 +-
 xen/drivers/passthrough/vtd/vtd.h           |    3 +-
 xen/drivers/passthrough/x86/Makefile        |    1 +
 xen/drivers/passthrough/x86/arena.c         |  157 ++
 xen/drivers/passthrough/x86/iommu.c         |  294 +++-
 xen/include/hypercall-defs.c                |    6 +
 xen/include/public/pv-iommu.h               |  343 ++++
 xen/include/public/xen.h                    |    1 +
 xen/include/xen/iommu.h                     |  117 +-
 xen/include/xen/pci.h                       |    3 +
 35 files changed, 3585 insertions(+), 1973 deletions(-)
 create mode 100644 docs/designs/iommu-contexts.md
 create mode 100644 docs/designs/pv-iommu.md
 create mode 100644 xen/arch/x86/include/asm/arena.h
 create mode 100644 xen/common/pv-iommu.c
 create mode 100644 xen/drivers/passthrough/x86/arena.c
 create mode 100644 xen/include/public/pv-iommu.h

--
2.47.2



Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [XEN RFC PATCH v6 06/11] vtd: Remove MAP_ERROR_RECOVERY code path in domain_context_mapping_one
  2025-02-17 10:18 [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface Teddy Astie
  2025-02-17 10:18 ` [XEN RFC PATCH v6 01/11] docs/designs: Add a design document for IOMMU subsystem redesign Teddy Astie
  2025-02-17 10:18 ` [XEN RFC PATCH v6 05/11] iommu: Simplify quarantine logic Teddy Astie
@ 2025-02-17 10:18 ` Teddy Astie
  2025-02-17 10:18 ` [XEN RFC PATCH v6 02/11] docs/designs: Add a design document for PV-IOMMU Teddy Astie
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Teddy Astie @ 2025-02-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: Teddy Astie, Jan Beulich, Andrew Cooper, Roger Pau Monné

This logic is almost never called as the only possible failures are
- no memory to allocate the pagetable (if it isn't already allocated)
  this is fixed in this patch serie by ensuring that the pagetable is allocated
  when entering this function
- EILSEQ when there is a race condtion with hardware, which should not happen under
  normal circonstances

Remove this logic to simplify the error management of the function.

Signed-off-by: Teddy Astie <teddy.astie@vates.tech>
---
 xen/drivers/passthrough/vtd/iommu.c | 20 --------------------
 xen/drivers/passthrough/vtd/vtd.h   |  3 +--
 2 files changed, 1 insertion(+), 22 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 55562084fc..852994cf97 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1621,26 +1621,6 @@ int domain_context_mapping_one(
     if ( !seg && !rc )
         rc = me_wifi_quirk(domain, bus, devfn, domid, pgd_maddr, mode);
 
-    if ( rc && !(mode & MAP_ERROR_RECOVERY) )
-    {
-        if ( !prev_dom ||
-             /*
-              * Unmapping here means DEV_TYPE_PCI devices with RMRRs (if such
-              * exist) would cause problems if such a region was actually
-              * accessed.
-              */
-             (prev_dom == dom_io && !pdev) )
-            ret = domain_context_unmap_one(domain, iommu, bus, devfn);
-        else
-            ret = domain_context_mapping_one(prev_dom, ctx, iommu, bus, devfn, pdev,
-                                             prev_dom->domain_id,
-                                             iommu_default_context(prev_dom)->arch.vtd.pgd_maddr,
-                                             (mode & MAP_WITH_RMRR) |
-                                             MAP_ERROR_RECOVERY) < 0;
-
-        if ( !ret && pdev && pdev->devfn == devfn )
-            check_cleanup_domid_map(domain, pdev, iommu);
-    }
 
     if ( prev_dom )
         rcu_unlock_domain(prev_dom);
diff --git a/xen/drivers/passthrough/vtd/vtd.h b/xen/drivers/passthrough/vtd/vtd.h
index b95124517b..72aa9a70c9 100644
--- a/xen/drivers/passthrough/vtd/vtd.h
+++ b/xen/drivers/passthrough/vtd/vtd.h
@@ -28,8 +28,7 @@
  */
 #define MAP_WITH_RMRR         (1u << 0)
 #define MAP_OWNER_DYING       (1u << 1)
-#define MAP_ERROR_RECOVERY    (1u << 2)
-#define UNMAP_ME_PHANTOM_FUNC (1u << 3)
+#define UNMAP_ME_PHANTOM_FUNC (1u << 2)
 
 /* Allow for both IOAPIC and IOSAPIC. */
 #define IO_xAPIC_route_entry IO_APIC_route_entry
-- 
2.47.2



Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [XEN RFC PATCH v6 01/11] docs/designs: Add a design document for IOMMU subsystem redesign
  2025-02-17 10:18 [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface Teddy Astie
@ 2025-02-17 10:18 ` Teddy Astie
  2025-02-17 10:18 ` [XEN RFC PATCH v6 05/11] iommu: Simplify quarantine logic Teddy Astie
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Teddy Astie @ 2025-02-17 10:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Teddy Astie, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini

Current IOMMU subsystem has some limitations that make PV-IOMMU practically impossible.
One of them is the assumtion that each domain is bound to a single "IOMMU domain", which
also causes complications with quarantine implementation.

Moreover, current IOMMU subsystem is not entirely well-defined, for instance, the behavior
of map_page between ARM SMMUv3 and x86 VT-d/AMD-Vi greatly differs. On ARM, it can modifies
the domain page table while on x86, it may be forbidden (e.g using HAP with PVH), or only
modifying the devices PoV (e.g using PV).

The goal of this redesign is to define more explicitely the behavior and interface of the
IOMMU subsystem while allowing PV-IOMMU to be effectively implemented.

Signed-off-by Teddy Astie <teddy.astie@vates.tech>
---
 docs/designs/iommu-contexts.md | 403 +++++++++++++++++++++++++++++++++
 1 file changed, 403 insertions(+)
 create mode 100644 docs/designs/iommu-contexts.md

diff --git a/docs/designs/iommu-contexts.md b/docs/designs/iommu-contexts.md
new file mode 100644
index 0000000000..d61c5fcde2
--- /dev/null
+++ b/docs/designs/iommu-contexts.md
@@ -0,0 +1,403 @@
+# IOMMU context management in Xen
+
+Status: Experimental
+Revision: 0
+
+# Background
+
+The design for *IOMMU paravirtualization for Dom0* [1] explains that some guests may
+want to access to IOMMU features. In order to implement this in Xen, several adjustments
+needs to be made to the IOMMU subsystem.
+
+This "hardware IOMMU domain" is currently implemented on a per-domain basis such as each
+domain actually has a specific *hardware IOMMU domain*, this design aims to allow a
+single Xen domain to manage several "IOMMU context", and allow some domains (e.g Dom0
+[1]) to modify their IOMMU contexts.
+
+In addition to this, quarantine feature can be refactored into using IOMMU contexts
+to reduce the complexity of platform-specific implementations and ensuring more
+consistency across platforms.
+
+# IOMMU context
+
+We define a "IOMMU context" as being a *hardware IOMMU domain*, but named as a context
+to avoid confusion with Xen domains.
+It represents some hardware-specific data structure that contains mappings from a device
+frame-number to a machine frame-number (e.g using a pagetable) that can be applied to
+a device using IOMMU hardware.
+
+This structure is bound to a Xen domain, but a Xen domain may have several IOMMU context.
+These contexts may be modifiable using the interface as defined in [1] aside some
+specific cases (e.g modifying default context).
+
+This is implemented in Xen as a new structure that will hold context-specific
+data.
+
+```c
+struct iommu_context {
+    u16 id; /* Context id (0 means default context) */
+    struct list_head devices;
+
+    struct arch_iommu_context arch;
+
+    bool opaque; /* context can't be modified nor accessed (e.g HAP) */
+};
+```
+
+A context is identified by a number that is domain-specific and may be used by IOMMU
+users such as PV-IOMMU by the guest.
+
+struct arch_iommu_context is splited from struct arch_iommu
+
+```c
+struct arch_iommu_context
+{
+    spinlock_t pgtables_lock;
+    struct page_list_head pgtables;
+
+    union {
+        /* Intel VT-d */
+        struct {
+            uint64_t pgd_maddr; /* io page directory machine address */
+            domid_t *didmap; /* per-iommu DID */
+            unsigned long *iommu_bitmap; /* bitmap of iommu(s) that the context uses */
+        } vtd;
+        /* AMD IOMMU */
+        struct {
+            struct page_info *root_table;
+        } amd;
+    };
+};
+
+struct arch_iommu
+{
+    spinlock_t mapping_lock; /* io page table lock */
+    struct {
+        struct page_list_head list;
+        spinlock_t lock;
+    } pgtables;
+
+    struct list_head identity_maps;
+
+    union {
+        /* Intel VT-d */
+        struct {
+            /* no more context-specific values */
+            unsigned int agaw; /* adjusted guest address width, 0 is level 2 30-bit */
+        } vtd;
+        /* AMD IOMMU */
+        struct {
+            unsigned int paging_mode;
+            struct guest_iommu *g_iommu;
+        } amd;
+    };
+};
+```
+
+IOMMU context information is now carried by iommu_context rather than being integrated to
+struct arch_iommu.
+
+# Xen domain IOMMU structure
+
+`struct domain_iommu` is modified to allow multiples context within a single Xen domain
+to exist :
+
+```c
+struct iommu_context_list {
+    uint16_t count; /* Context count excluding default context */
+
+    /* if count > 0 */
+
+    uint64_t *bitmap; /* bitmap of context allocation */
+    struct iommu_context *map; /* Map of contexts */
+};
+
+struct domain_iommu {
+    /* ... */
+
+    struct iommu_context default_ctx;
+    struct iommu_context_list other_contexts;
+
+    /* ... */
+}
+```
+
+default_ctx is a special context with id=0 that holds the page table mapping the entire
+domain, which basically preserve the previous behavior. All devices are expected to be
+bound to this context during initialization.
+
+Along with this default context that always exist, we use a pool of contexts that has a
+fixed size at domain initialization, where contexts can be allocated (if possible), and
+have a id matching their position in the map (considering that id != 0).
+These contexts may be used by IOMMU contexts users such as PV-IOMMU or quarantine domain
+(DomIO).
+
+# Platform independent context management interface
+
+A new platform independant interface is introduced in Xen hypervisor to allow
+IOMMU contexts users to create and manage contexts within domains.
+
+```c
+/* Direct context access functions (not supposed to be used directly) */
+struct iommu_context *iommu_get_context(struct domain *d, u16 ctx_id);
+void iommu_put_context(struct iommu_context *ctx);
+
+/* Flag for default context initialization */
+#define IOMMU_CONTEXT_INIT_default (1 << 0)
+
+/* Flag for quarantine contexts (scratch page, DMA Abort mode, ...) */
+#define IOMMU_CONTEXT_INIT_quarantine (1 << 1)
+
+int iommu_context_init(struct domain *d, struct iommu_context *ctx, u16 ctx_id, u32 flags);
+
+/* Flag to specify that devices will need to be reattached to default domain */
+#define IOMMU_TEARDOWN_REATTACH_DEFAULT (1 << 0)
+
+/*
+ * Flag to specify that the context needs to be destroyed preemptively
+ * (multiple calls to iommu_context_teardown will be required)
+ */
+#define IOMMU_TEARDOWN_PREEMPT (1 << 1)
+
+int iommu_context_teardown(struct domain *d, struct iommu_context *ctx, u32 flags);
+
+/* Allocate a new context, uses CONTEXT_INIT flags */
+int iommu_context_alloc(struct domain *d, u16 *ctx_id, u32 flags);
+
+/* Free a context, uses CONTEXT_TEARDOWN flags */
+int iommu_context_free(struct domain *d, u16 ctx_id, u32 flags);
+
+/* Move a device from one context to another, including between different domains. */
+int iommu_reattach_context(struct domain *prev_dom, struct domain *next_dom,
+                           device_t *dev, u16 ctx_id);
+
+/* Add a device to a context for first initialization */
+int iommu_attach_context(struct domain *d, device_t *dev, u16 ctx_id);
+
+/* Remove a device from a context, effectively removing it from the IOMMU. */
+int iommu_detach_context(struct domain *d, device_t *dev);
+```
+
+This interface will use a new interface with drivers to implement these features.
+
+Some existing functions will have a new parameter to specify on what context to do the operation.
+- iommu_map (iommu_legacy_map untouched)
+- iommu_unmap (iommu_legacy_unmap untouched)
+- iommu_lookup_page
+- iommu_iotlb_flush
+
+These functions will modify the iommu_context structure to accomodate with the
+operations applied, these functions will be used to replace some operations previously
+made in the IOMMU driver.
+
+# IOMMU platform_ops interface changes
+
+The IOMMU driver needs to expose a way to create and manage IOMMU contexts, the approach
+taken here is to modify the interface to allow specifying a IOMMU context on operations,
+and at the same time, simplifying the interface by relying more on iommu
+platform-independent code.
+
+Added functions in iommu_ops
+
+```c
+/* Initialize a context (creating page tables, allocating hardware, structures, ...) */
+int (*context_init)(struct domain *d, struct iommu_context *ctx,
+                    u32 flags);
+/* Destroy a context, assumes no device is bound to the context. */
+int (*context_teardown)(struct domain *d, struct iommu_context *ctx,
+                        u32 flags);
+/* Put a device in a context (assumes the device is not attached to another context) */
+int (*attach)(struct domain *d, device_t *dev,
+              struct iommu_context *ctx);
+/* Remove a device from a context, and from the IOMMU. */
+int (*detach)(struct domain *d, device_t *dev,
+              struct iommu_context *prev_ctx);
+/* Move the device from a context to another, including if the new context is in
+   another domain. d corresponds to the target domain. */
+int (*reattach)(struct domain *d, device_t *dev,
+                struct iommu_context *prev_ctx,
+                struct iommu_context *ctx);
+
+#ifdef CONFIG_HAS_PCI
+/* Specific interface for phantom function devices. */
+int (*add_devfn)(struct domain *d, struct pci_dev *pdev, u16 devfn,
+                 struct iommu_context *ctx);
+int (*remove_devfn)(struct domain *d, struct pci_dev *pdev, u16 devfn,
+                    struct iommu_context *ctx);
+#endif
+
+/* Changes in existing to use a specified iommu_context. */
+int __must_check (*map_page)(struct domain *d, dfn_t dfn, mfn_t mfn,
+                             unsigned int flags,
+                             unsigned int *flush_flags,
+                             struct iommu_context *ctx);
+int __must_check (*unmap_page)(struct domain *d, dfn_t dfn,
+                               unsigned int order,
+                               unsigned int *flush_flags,
+                               struct iommu_context *ctx);
+int __must_check (*lookup_page)(struct domain *d, dfn_t dfn, mfn_t *mfn,
+                                unsigned int *flags,
+                                struct iommu_context *ctx);
+
+int __must_check (*iotlb_flush)(struct domain *d,
+                                struct iommu_context *ctx, dfn_t dfn,
+                                unsigned long page_count,
+                                unsigned int flush_flags);
+
+void (*clear_root_pgtable)(struct domain *d, struct iommu_context *ctx);
+```
+
+These functions are redundant with existing functions, therefore, the following functions
+are replaced with new equivalents :
+- quarantine_init : platform-independent code and IOMMU_CONTEXT_INIT_quarantine flag
+- add_device : attach and add_devfn (phantom)
+- assign_device : attach and add_devfn (phantom)
+- remove_device : detach and remove_devfn (phantom)
+- reassign_device : reattach
+
+Some functionnal differences with previous functions, the following should be handled
+by platform-independent/arch-specific code instead of IOMMU driver :
+- identity mappings (unity mappings and rmrr)
+- device list in context and domain
+- domain of a device
+- quarantine
+
+The idea behind this is to implement IOMMU context features while simplifying IOMMU
+drivers implementations and ensuring more consistency between IOMMU drivers.
+
+## Phantom function handling
+
+PCI devices may use additionnal devfn to do DMA operations, in order to support such
+devices, an interface is added to map specific device functions without implying that
+the device is mapped to a new context (that may cause duplicates in Xen data structures).
+
+Functions add_devfn and remove_devfn allows to map a iommu context on specific devfn
+for a pci device, without altering platform-independent data structures.
+
+It is important for the reattach operation to care about these devices, in order
+to prevent devices from being partially reattached to the new context (see XSA-449 [2])
+by using a all-or-nothing approach for reattaching such devices.
+
+# Quarantine refactoring using IOMMU contexts
+
+The quarantine mecanism can be entirely reimplemented using IOMMU context, making
+it simpler, more consistent between platforms,
+
+Quarantine is currently only supported with x86 platforms and works by creating a
+single *hardware IOMMU domain* per quarantined device. All the quarantine logic is
+the implemented in a platform-specific fashion while actually implementing the same
+concepts :
+
+The *hardware IOMMU context* data structures for quarantine are currently stored in
+the device structure itself (using arch_pci_dev) and IOMMU driver needs to care about
+whether we are dealing with quarantine operations or regular operations (often dealt
+using macros such as QUARANTINE_SKIP or DEVICE_PGTABLE).
+
+The page table that will apply on the quarantined device is created reserved device
+regions, and adding mappings to a scratch page if enabled (quarantine=scratch-page).
+
+A new approach we can use is allowing the quarantine domain (DomIO) to manage IOMMU
+contexts, and implement all the quarantine logic using IOMMU contexts.
+
+That way, the quarantine implementation can be platform-independent, thus have a more
+consistent implementation between platforms. It will also allows quarantine to work
+with other IOMMU implementations without having to implement platform-specific behavior.
+Moreover, quarantine operations can be implemented using regular context operations
+instead of relying on driver-specific code.
+
+Quarantine implementation can be summarised as
+
+```c
+int iommu_quarantine_dev_init(device_t *dev)
+{
+    int ret;
+    u16 ctx_id;
+
+    if ( !iommu_quarantine )
+        return -EINVAL;
+
+    ret = iommu_context_alloc(dom_io, &ctx_id, IOMMU_CONTEXT_INIT_quarantine);
+
+    if ( ret )
+        return ret;
+
+    /** TODO: Setup scratch page, mappings... */
+
+    ret = iommu_reattach_context(dev->domain, dom_io, dev, ctx_id);
+
+    if ( ret )
+    {
+        ASSERT(!iommu_context_free(dom_io, ctx_id, 0));
+        return ret;
+    }
+
+    return ret;
+}
+```
+
+# Platform-specific considerations
+
+## Reference counters on target pages
+
+When mapping a guest page onto a IOMMU context, we need to make sure that
+this page is not reused for something else while being actually referenced
+by a IOMMU context. One way of doing it is incrementing the reference counter
+of each target page we map (excluding reserved regions), and decrementing it
+when the mapping isn't used anymore.
+
+One consideration to have is when destroying the context while having existing
+mappings in it. We can walk through the entire page table and decrement the
+reference counter of all mappings. All of that assumes that there is no reserved
+region mapped (which should be the case as a requirement of teardown, or as a
+consequence of REATTACH_DEFAULT flag).
+
+Another consideration is that the "cleanup mappings" operation may take a lot
+of time depending on the complexity of the page table. Making the teardown operation preemptable can allow the hypercall to be preempted if needed also preventing a malicious
+guest from stalling a CPU in a teardown operation with a specially crafted IOMMU
+context (e.g with several 1G superpages).
+
+## Limit the amount of pages IOMMU contexts can use
+
+In order to prevent a (eventually malicious) guest from causing too much allocations
+in Xen, we can enforce limits on the memory the IOMMU subsystem can use for IOMMU context.
+A possible implementation can be to preallocate a reasonably large chunk of memory
+and split it into pages for use by the IOMMU subsystem only for non-default IOMMU
+contexts (e.g PV-IOMMU interface), if this limitation is overcome, some operations
+may fail from the guest side. These limitations shouldn't impact "usual" operations
+of the IOMMU subsystem (e.g default context initialization).
+
+## x86 Architecture
+
+TODO
+
+### Intel VT-d
+
+VT-d uses DID to tag the *IOMMU domain* applied to a device and assumes that all entries
+with the same DID uses the same page table (i.e same IOMMU context).
+Under certain circonstances (e.g DRHD with DID limit below 16-bits), the *DID* is
+transparently converted into a DRHD-specific DID using a map managed internally.
+
+The current implementation of the code reuses the Xen domain_id as DID.
+However, by using multiples IOMMU contexts per domain, we can't use the domain_id for
+contexts (otherwise, different page tables will be mapped with the same DID).
+The following strategy is used :
+- on the default context, reuse the domain_id (the default context is unique per domain)
+- on non-default context, use a id allocated in the pseudo_domid map, (actually used by
+quarantine) which is a DID outside of Xen domain_id range
+
+### AMD-Vi
+
+TODO
+
+## Device-tree platforms
+
+### SMMU and SMMUv3
+
+TODO
+
+* * *
+
+[1] See pv-iommu.md
+
+[2] pci: phantom functions assigned to incorrect contexts
+https://xenbits.xen.org/xsa/advisory-449.html
\ No newline at end of file
-- 
2.47.2



Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [XEN RFC PATCH v6 05/11] iommu: Simplify quarantine logic
  2025-02-17 10:18 [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface Teddy Astie
  2025-02-17 10:18 ` [XEN RFC PATCH v6 01/11] docs/designs: Add a design document for IOMMU subsystem redesign Teddy Astie
@ 2025-02-17 10:18 ` Teddy Astie
  2025-02-17 10:18 ` [XEN RFC PATCH v6 06/11] vtd: Remove MAP_ERROR_RECOVERY code path in domain_context_mapping_one Teddy Astie
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Teddy Astie @ 2025-02-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: Teddy Astie, Jan Beulich, Andrew Cooper, Roger Pau Monné

Current quarantine code is very hard to change and is very
complicated, remove most bits of it and replace it with direct
reassignement to dom_io domain instead.

Signed-off-by: Teddy Astie <teddy.astie@vates.tech>
---
A idea would be to rework this feature using the new reworked
IOMMU subsystem.
---
 xen/arch/x86/include/asm/pci.h              |  17 --
 xen/drivers/passthrough/amd/iommu_map.c     | 129 +---------
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  51 +---
 xen/drivers/passthrough/pci.c               |   7 +-
 xen/drivers/passthrough/vtd/iommu.c         | 253 ++------------------
 xen/drivers/passthrough/x86/iommu.c         |   1 -
 6 files changed, 29 insertions(+), 429 deletions(-)

diff --git a/xen/arch/x86/include/asm/pci.h b/xen/arch/x86/include/asm/pci.h
index fd5480d67d..214c1a0948 100644
--- a/xen/arch/x86/include/asm/pci.h
+++ b/xen/arch/x86/include/asm/pci.h
@@ -15,23 +15,6 @@
 
 struct arch_pci_dev {
     vmask_t used_vectors;
-    /*
-     * These fields are (de)initialized under pcidevs-lock. Other uses of
-     * them don't race (de)initialization and hence don't strictly need any
-     * locking.
-     */
-    union {
-        /* Subset of struct arch_iommu's fields, to be used in dom_io. */
-        struct {
-            uint64_t pgd_maddr;
-        } vtd;
-        struct {
-            struct page_info *root_table;
-        } amd;
-    };
-    domid_t pseudo_domid;
-    mfn_t leaf_mfn;
-    struct page_list_head pgtables_list;
 };
 
 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
diff --git a/xen/drivers/passthrough/amd/iommu_map.c b/xen/drivers/passthrough/amd/iommu_map.c
index 7514384789..91d8c21048 100644
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -656,9 +656,6 @@ int amd_iommu_reserve_domain_unity_map(struct domain *d, struct iommu_context *c
 {
     int rc;
 
-    if ( d == dom_io )
-        return 0;
-
     for ( rc = 0; !rc && map; map = map->next )
     {
         p2m_access_t p2ma = p2m_access_n;
@@ -680,9 +677,6 @@ int amd_iommu_reserve_domain_unity_unmap(struct domain *d, struct iommu_context
 {
     int rc;
 
-    if ( d == dom_io )
-        return 0;
-
     for ( rc = 0; map; map = map->next )
     {
         int ret = iommu_identity_mapping(d, ctx, p2m_access_x, map->addr,
@@ -771,134 +765,15 @@ int cf_check amd_iommu_get_reserved_device_memory(
     return 0;
 }
 
-static int fill_qpt(union amd_iommu_pte *this, unsigned int level,
-                    struct page_info *pgs[IOMMU_MAX_PT_LEVELS])
-{
-    struct domain_iommu *hd = dom_iommu(dom_io);
-    struct iommu_context *ctx = iommu_default_context(dom_io);
-    unsigned int i;
-    int rc = 0;
-
-    for ( i = 0; !rc && i < PTE_PER_TABLE_SIZE; ++i )
-    {
-        union amd_iommu_pte *pte = &this[i], *next;
-
-        if ( !pte->pr )
-        {
-            if ( !pgs[level] )
-            {
-                /*
-                 * The pgtable allocator is fine for the leaf page, as well as
-                 * page table pages, and the resulting allocations are always
-                 * zeroed.
-                 */
-                pgs[level] = iommu_alloc_pgtable(hd, ctx, 0);
-                if ( !pgs[level] )
-                {
-                    rc = -ENOMEM;
-                    break;
-                }
-
-                if ( level )
-                {
-                    next = __map_domain_page(pgs[level]);
-                    rc = fill_qpt(next, level - 1, pgs);
-                    unmap_domain_page(next);
-                }
-            }
-
-            /*
-             * PDEs are essentially a subset of PTEs, so this function
-             * is fine to use even at the leaf.
-             */
-            set_iommu_pde_present(pte, mfn_x(page_to_mfn(pgs[level])), level,
-                                  true, true);
-        }
-        else if ( level && pte->next_level )
-        {
-            next = map_domain_page(_mfn(pte->mfn));
-            rc = fill_qpt(next, level - 1, pgs);
-            unmap_domain_page(next);
-        }
-    }
-
-    return rc;
-}
-
 int cf_check amd_iommu_quarantine_init(struct pci_dev *pdev, bool scratch_page)
 {
-    struct domain_iommu *hd = dom_iommu(dom_io);
-    struct iommu_context *ctx = iommu_default_context(dom_io);
-    unsigned int level = ctx->arch.amd.paging_mode;
-    unsigned int req_id = get_dma_requestor_id(pdev->seg, pdev->sbdf.bdf);
-    const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
-    int rc;
+    amd_iommu_quarantine_teardown(pdev);
 
-    ASSERT(pcidevs_locked());
-    ASSERT(!ctx->arch.amd.root_table);
-    ASSERT(page_list_empty(&ctx->arch.pgtables));
-
-    if ( !scratch_page && !ivrs_mappings[req_id].unity_map )
-        return 0;
-
-    ASSERT(pdev->arch.pseudo_domid != DOMID_INVALID);
-
-    if ( pdev->arch.amd.root_table )
-    {
-        clear_domain_page(pdev->arch.leaf_mfn);
-        return 0;
-    }
-
-    pdev->arch.amd.root_table = iommu_alloc_pgtable(hd, ctx, 0);
-    if ( !pdev->arch.amd.root_table )
-        return -ENOMEM;
-
-    /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
-    ctx->arch.amd.root_table = pdev->arch.amd.root_table;
-
-    rc = amd_iommu_reserve_domain_unity_map(dom_io, ctx,
-                                            ivrs_mappings[req_id].unity_map,
-                                            0);
-
-    iommu_identity_map_teardown(dom_io, ctx);
-    ctx->arch.amd.root_table = NULL;
-
-    if ( rc )
-        AMD_IOMMU_WARN("%pp: quarantine unity mapping failed\n", &pdev->sbdf);
-    else if ( scratch_page )
-    {
-        union amd_iommu_pte *root;
-        struct page_info *pgs[IOMMU_MAX_PT_LEVELS] = {};
-
-        root = __map_domain_page(pdev->arch.amd.root_table);
-        rc = fill_qpt(root, level - 1, pgs);
-        unmap_domain_page(root);
-
-        pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
-    }
-
-    page_list_move(&pdev->arch.pgtables_list, &ctx->arch.pgtables);
-
-    if ( rc )
-        amd_iommu_quarantine_teardown(pdev);
-
-    return rc;
+    return 0;
 }
 
 void amd_iommu_quarantine_teardown(struct pci_dev *pdev)
 {
-    struct iommu_context *ctx = iommu_default_context(dom_io);
-
-    ASSERT(pcidevs_locked());
-
-    if ( !pdev->arch.amd.root_table )
-        return;
-
-    ASSERT(page_list_empty(&ctx->arch.pgtables));
-    page_list_move(&ctx->arch.pgtables, &pdev->arch.pgtables_list);
-    while ( iommu_free_pgtables(dom_io, ctx) == -ERESTART )
-        /* nothing */;
-    pdev->arch.amd.root_table = NULL;
 }
 
 /*
diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index a3815d71be..0008b35162 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -138,9 +138,6 @@ static int __must_check amd_iommu_setup_domain_device(
     const struct page_info *root_pg;
     domid_t domid;
 
-    if ( QUARANTINE_SKIP(domain, pdev) )
-        return 0;
-
     BUG_ON(!ctx->arch.amd.paging_mode || !iommu->dev_table.buffer);
 
     rc = allocate_domain_resources(domain);
@@ -159,16 +156,8 @@ static int __must_check amd_iommu_setup_domain_device(
     dte = &table[req_id];
     ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
 
-    if ( domain != dom_io )
-    {
-        root_pg = ctx->arch.amd.root_table;
-        domid = domain->domain_id;
-    }
-    else
-    {
-        root_pg = pdev->arch.amd.root_table;
-        domid = pdev->arch.pseudo_domid;
-    }
+    root_pg = ctx->arch.amd.root_table;
+    domid = domain->domain_id;
 
     spin_lock_irqsave(&iommu->lock, flags);
 
@@ -414,9 +403,6 @@ static void amd_iommu_disable_domain_device(const struct domain *domain,
     int req_id;
     u8 bus = pdev->bus;
 
-    if ( QUARANTINE_SKIP(domain, pdev) )
-        return;
-
     ASSERT(pcidevs_locked());
 
     if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
@@ -479,14 +465,9 @@ static int cf_check reassign_device(
         return -ENODEV;
     }
 
-    if ( !QUARANTINE_SKIP(target, pdev) )
-    {
-        rc = amd_iommu_setup_domain_device(target, target_ctx, iommu, devfn, pdev);
-        if ( rc )
-            return rc;
-    }
-    else
-        amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
+    rc = amd_iommu_setup_domain_device(target, target_ctx, iommu, devfn, pdev);
+    if ( rc )
+        return rc;
 
     if ( devfn == pdev->devfn && pdev->domain != target )
     {
@@ -579,8 +560,6 @@ static int cf_check amd_iommu_add_device(u8 devfn, struct pci_dev *pdev)
     struct iommu_context *ctx;
     u16 bdf;
     struct ivrs_mappings *ivrs_mappings;
-    bool fresh_domid = false;
-    int ret;
 
     if ( !pdev->domain )
         return -EINVAL;
@@ -649,22 +628,7 @@ static int cf_check amd_iommu_add_device(u8 devfn, struct pci_dev *pdev)
         AMD_IOMMU_WARN("%pd: unity mapping failed for %pp\n",
                        pdev->domain, &PCI_SBDF(pdev->seg, bdf));
 
-    if ( iommu_quarantine && pdev->arch.pseudo_domid == DOMID_INVALID )
-    {
-        pdev->arch.pseudo_domid = iommu_alloc_domid(iommu->domid_map);
-        if ( pdev->arch.pseudo_domid == DOMID_INVALID )
-            return -ENOSPC;
-        fresh_domid = true;
-    }
-
-    ret = amd_iommu_setup_domain_device(pdev->domain, ctx, iommu, devfn, pdev);
-    if ( ret && fresh_domid )
-    {
-        iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
-        pdev->arch.pseudo_domid = DOMID_INVALID;
-    }
-
-    return ret;
+    return amd_iommu_setup_domain_device(pdev->domain, ctx, iommu, devfn, pdev);
 }
 
 static int cf_check amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
@@ -700,9 +664,6 @@ static int cf_check amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
 
     amd_iommu_quarantine_teardown(pdev);
 
-    iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
-    pdev->arch.pseudo_domid = DOMID_INVALID;
-
     if ( amd_iommu_perdev_intremap &&
          ivrs_mappings[bdf].dte_requestor_id == bdf &&
          ivrs_mappings[bdf].intremap_table )
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 777c6b1a7f..e1ca74b477 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1370,12 +1370,7 @@ static int cf_check _dump_pci_devices(struct pci_seg *pseg, void *arg)
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
     {
         printk("%pp - ", &pdev->sbdf);
-#ifdef CONFIG_X86
-        if ( pdev->domain == dom_io )
-            printk("DomIO:%x", pdev->arch.pseudo_domid);
-        else
-#endif
-            printk("%pd", pdev->domain);
+        printk("%pd", pdev->domain);
         printk(" - node %-3d", (pdev->node != NUMA_NO_NODE) ? pdev->node : -1);
         pdev_dump_msi(pdev);
         printk("\n");
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index f60f39ee1d..55562084fc 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -49,14 +49,6 @@
 #define CONTIG_MASK DMA_PTE_CONTIG_MASK
 #include <asm/pt-contig-markers.h>
 
-/* dom_io is used as a sentinel for quarantined devices */
-#define QUARANTINE_SKIP(d, pgd_maddr) ((d) == dom_io && !(pgd_maddr))
-#define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \
-                                             : (pdev)->arch.pseudo_domid)
-#define DEVICE_PGTABLE(d, pdev) ((d) != dom_io \
-                                 ? iommu_default_context(d)->arch.vtd.pgd_maddr \
-                                 : (pdev)->arch.vtd.pgd_maddr)
-
 bool __read_mostly iommu_igfx = true;
 bool __read_mostly iommu_qinval = true;
 #ifndef iommu_snoop
@@ -1494,8 +1486,6 @@ int domain_context_mapping_one(
     int rc, ret;
     bool flush_dev_iotlb;
 
-    if ( QUARANTINE_SKIP(domain, pgd_maddr) )
-        return 0;
 
     ASSERT(pcidevs_locked());
     spin_lock(&iommu->lock);
@@ -1512,8 +1502,6 @@ int domain_context_mapping_one(
         domid = did_to_domain_id(iommu, prev_did);
         if ( domid < DOMID_FIRST_RESERVED )
             prev_dom = rcu_lock_domain_by_id(domid);
-        else if ( pdev ? domid == pdev->arch.pseudo_domid : domid > DOMID_MASK )
-            prev_dom = rcu_lock_domain(dom_io);
         if ( !prev_dom )
         {
             spin_unlock(&iommu->lock);
@@ -1645,8 +1633,8 @@ int domain_context_mapping_one(
             ret = domain_context_unmap_one(domain, iommu, bus, devfn);
         else
             ret = domain_context_mapping_one(prev_dom, ctx, iommu, bus, devfn, pdev,
-                                             DEVICE_DOMID(prev_dom, pdev),
-                                             DEVICE_PGTABLE(prev_dom, pdev),
+                                             prev_dom->domain_id,
+                                             iommu_default_context(prev_dom)->arch.vtd.pgd_maddr,
                                              (mode & MAP_WITH_RMRR) |
                                              MAP_ERROR_RECOVERY) < 0;
 
@@ -1668,8 +1656,8 @@ static int domain_context_mapping(struct domain *domain, struct iommu_context *c
 {
     const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
     const struct acpi_rmrr_unit *rmrr;
-    paddr_t pgd_maddr = DEVICE_PGTABLE(domain, pdev);
-    domid_t orig_domid = pdev->arch.pseudo_domid;
+    paddr_t pgd_maddr = ctx->arch.vtd.pgd_maddr;
+    domid_t did = domain->domain_id;
     int ret = 0;
     unsigned int i, mode = 0;
     uint16_t seg = pdev->seg, bdf;
@@ -1722,20 +1710,11 @@ static int domain_context_mapping(struct domain *domain, struct iommu_context *c
         if ( !drhd )
             return -ENODEV;
 
-        if ( iommu_quarantine && orig_domid == DOMID_INVALID )
-        {
-            pdev->arch.pseudo_domid =
-                iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
-            if ( pdev->arch.pseudo_domid == DOMID_INVALID )
-                return -ENOSPC;
-        }
-
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF(seg, bus, devfn));
         ret = domain_context_mapping_one(domain, ctx, drhd->iommu, bus, devfn, pdev,
-                                         DEVICE_DOMID(domain, pdev), pgd_maddr,
-                                         mode);
+                                         did, pgd_maddr, mode);
         if ( ret > 0 )
             ret = 0;
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
@@ -1747,21 +1726,12 @@ static int domain_context_mapping(struct domain *domain, struct iommu_context *c
         if ( !drhd )
             return -ENODEV;
 
-        if ( iommu_quarantine && orig_domid == DOMID_INVALID )
-        {
-            pdev->arch.pseudo_domid =
-                iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
-            if ( pdev->arch.pseudo_domid == DOMID_INVALID )
-                return -ENOSPC;
-        }
-
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCI: map %pp\n",
                    domain, &PCI_SBDF(seg, bus, devfn));
 
         ret = domain_context_mapping_one(domain, ctx, drhd->iommu, bus, devfn,
-                                         pdev, DEVICE_DOMID(domain, pdev),
-                                         pgd_maddr, mode);
+                                         pdev, did, pgd_maddr, mode);
         if ( ret < 0 )
             break;
         prev_present = ret;
@@ -1791,8 +1761,7 @@ static int domain_context_mapping(struct domain *domain, struct iommu_context *c
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, ctx, drhd->iommu, bus, devfn,
-                                             NULL, DEVICE_DOMID(domain, pdev),
-                                             pgd_maddr, mode);
+                                             NULL, did, pgd_maddr, mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1807,8 +1776,7 @@ static int domain_context_mapping(struct domain *domain, struct iommu_context *c
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, ctx, drhd->iommu, secbus, 0,
-                                             NULL, DEVICE_DOMID(domain, pdev),
-                                             pgd_maddr, mode);
+                                             NULL, did, pgd_maddr, mode);
 
         if ( ret )
         {
@@ -1830,13 +1798,6 @@ static int domain_context_mapping(struct domain *domain, struct iommu_context *c
     if ( !ret && devfn == pdev->devfn )
         pci_vtd_quirk(pdev);
 
-    if ( ret && drhd && orig_domid == DOMID_INVALID )
-    {
-        iommu_free_domid(pdev->arch.pseudo_domid,
-                         drhd->iommu->pseudo_domid_map);
-        pdev->arch.pseudo_domid = DOMID_INVALID;
-    }
-
     return ret;
 }
 
@@ -1994,10 +1955,6 @@ static const struct acpi_drhd_unit *domain_context_unmap(
         return ERR_PTR(-EINVAL);
     }
 
-    if ( !ret && pdev->devfn == devfn &&
-         !QUARANTINE_SKIP(domain, pdev->arch.vtd.pgd_maddr) )
-        check_cleanup_domid_map(domain, pdev, iommu);
-
     return drhd;
 }
 
@@ -2031,21 +1988,6 @@ static void cf_check iommu_domain_teardown(struct domain *d)
 static void quarantine_teardown(struct pci_dev *pdev,
                                 const struct acpi_drhd_unit *drhd)
 {
-    struct iommu_context *ctx = iommu_default_context(dom_io);
-
-    ASSERT(pcidevs_locked());
-
-    if ( !pdev->arch.vtd.pgd_maddr )
-        return;
-
-    ASSERT(page_list_empty(&ctx->arch.pgtables));
-    page_list_move(&ctx->arch.pgtables, &pdev->arch.pgtables_list);
-    while ( iommu_free_pgtables(dom_io, ctx) == -ERESTART )
-        /* nothing */;
-    pdev->arch.vtd.pgd_maddr = 0;
-
-    if ( drhd )
-        cleanup_domid_map(pdev->arch.pseudo_domid, drhd->iommu);
 }
 
 static int __must_check cf_check intel_iommu_map_page(
@@ -2386,13 +2328,6 @@ static int cf_check intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
 
     quarantine_teardown(pdev, drhd);
 
-    if ( drhd )
-    {
-        iommu_free_domid(pdev->arch.pseudo_domid,
-                         drhd->iommu->pseudo_domid_map);
-        pdev->arch.pseudo_domid = DOMID_INVALID;
-    }
-
     return 0;
 }
 
@@ -2750,42 +2685,22 @@ static int cf_check reassign_device_ownership(
 {
     int ret;
 
-    if ( !QUARANTINE_SKIP(target, pdev->arch.vtd.pgd_maddr) )
-    {
-        struct iommu_context *target_ctx = iommu_default_context(target);
-
-        if ( !has_arch_pdevs(target) )
-            vmx_pi_hooks_assign(target);
+    if ( !has_arch_pdevs(target) )
+        vmx_pi_hooks_assign(target);
 
 #ifdef CONFIG_PV
-        /*
-         * Devices assigned to untrusted domains (here assumed to be any domU)
-         * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected
-         * by the root complex unless interrupt remapping is enabled.
-         */
-        if ( !iommu_intremap && !is_hardware_domain(target) &&
-             !is_system_domain(target) )
-            untrusted_msi = true;
+    /*
+        * Devices assigned to untrusted domains (here assumed to be any domU)
+        * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected
+        * by the root complex unless interrupt remapping is enabled.
+        */
+    if ( !iommu_intremap && !is_hardware_domain(target) &&
+            !is_system_domain(target) )
+        untrusted_msi = true;
 #endif
 
-        ret = domain_context_mapping(target, target_ctx, devfn, pdev);
-
-        if ( !ret && pdev->devfn == devfn &&
-             !QUARANTINE_SKIP(source, pdev->arch.vtd.pgd_maddr) )
-        {
-            const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
+    ret = domain_context_mapping(target, iommu_default_context(target), devfn, pdev);
 
-            if ( drhd )
-                check_cleanup_domid_map(source, pdev, drhd->iommu);
-        }
-    }
-    else
-    {
-        const struct acpi_drhd_unit *drhd;
-
-        drhd = domain_context_unmap(source, devfn, pdev);
-        ret = IS_ERR(drhd) ? PTR_ERR(drhd) : 0;
-    }
     if ( ret )
     {
         if ( !has_arch_pdevs(target) )
@@ -2884,9 +2799,6 @@ static int cf_check intel_iommu_assign_device(
         }
     }
 
-    if ( d == dom_io )
-        return reassign_device_ownership(s, d, devfn, pdev);
-
     /* Setup rmrr identity mapping */
     for_each_rmrr_device( rmrr, bdf, i )
     {
@@ -3096,135 +3008,10 @@ static void cf_check vtd_dump_page_tables(struct domain *d)
                               agaw_to_level(hd->arch.vtd.agaw), 0, 0);
 }
 
-static int fill_qpt(struct dma_pte *this, unsigned int level,
-                    struct page_info *pgs[6])
-{
-    struct domain_iommu *hd = dom_iommu(dom_io);
-    struct iommu_context *ctx = iommu_default_context(dom_io);
-    unsigned int i;
-    int rc = 0;
-
-    for ( i = 0; !rc && i < PTE_NUM; ++i )
-    {
-        struct dma_pte *pte = &this[i], *next;
-
-        if ( !dma_pte_present(*pte) )
-        {
-            if ( !pgs[level] )
-            {
-                /*
-                 * The pgtable allocator is fine for the leaf page, as well as
-                 * page table pages, and the resulting allocations are always
-                 * zeroed.
-                 */
-                pgs[level] = iommu_alloc_pgtable(hd, ctx, 0);
-                if ( !pgs[level] )
-                {
-                    rc = -ENOMEM;
-                    break;
-                }
-
-                if ( level )
-                {
-                    next = map_vtd_domain_page(page_to_maddr(pgs[level]));
-                    rc = fill_qpt(next, level - 1, pgs);
-                    unmap_vtd_domain_page(next);
-                }
-            }
-
-            dma_set_pte_addr(*pte, page_to_maddr(pgs[level]));
-            dma_set_pte_readable(*pte);
-            dma_set_pte_writable(*pte);
-        }
-        else if ( level && !dma_pte_superpage(*pte) )
-        {
-            next = map_vtd_domain_page(dma_pte_addr(*pte));
-            rc = fill_qpt(next, level - 1, pgs);
-            unmap_vtd_domain_page(next);
-        }
-    }
-
-    return rc;
-}
-
 static int cf_check intel_iommu_quarantine_init(struct pci_dev *pdev,
                                                 bool scratch_page)
 {
-    struct domain_iommu *hd = dom_iommu(dom_io);
-    struct iommu_context *ctx = iommu_default_context(dom_io);
-    struct page_info *pg;
-    unsigned int agaw = hd->arch.vtd.agaw;
-    unsigned int level = agaw_to_level(agaw);
-    const struct acpi_drhd_unit *drhd;
-    const struct acpi_rmrr_unit *rmrr;
-    unsigned int i, bdf;
-    bool rmrr_found = false;
-    int rc;
-
-    ASSERT(pcidevs_locked());
-    ASSERT(!ctx->arch.vtd.pgd_maddr);
-    ASSERT(page_list_empty(&ctx->arch.pgtables));
-
-    if ( pdev->arch.vtd.pgd_maddr )
-    {
-        clear_domain_page(pdev->arch.leaf_mfn);
-        return 0;
-    }
-
-    drhd = acpi_find_matched_drhd_unit(pdev);
-    if ( !drhd )
-        return -ENODEV;
-
-    pg = iommu_alloc_pgtable(hd, ctx, 0);
-    if ( !pg )
-        return -ENOMEM;
-
-    rc = context_set_domain_id(NULL, pdev->arch.pseudo_domid, drhd->iommu);
-
-    /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
-    ctx->arch.vtd.pgd_maddr = page_to_maddr(pg);
-
-    for_each_rmrr_device ( rmrr, bdf, i )
-    {
-        if ( rc )
-            break;
-
-        if ( rmrr->segment == pdev->seg && bdf == pdev->sbdf.bdf )
-        {
-            rmrr_found = true;
-
-            rc = iommu_identity_mapping(dom_io, ctx, p2m_access_rw,
-                                        rmrr->base_address, rmrr->end_address,
-                                        0);
-            if ( rc )
-                printk(XENLOG_ERR VTDPREFIX
-                       "%pp: RMRR quarantine mapping failed\n",
-                       &pdev->sbdf);
-        }
-    }
-
-    iommu_identity_map_teardown(dom_io, ctx);
-    ctx->arch.vtd.pgd_maddr = 0;
-    pdev->arch.vtd.pgd_maddr = page_to_maddr(pg);
-
-    if ( !rc && scratch_page )
-    {
-        struct dma_pte *root;
-        struct page_info *pgs[6] = {};
-
-        root = map_vtd_domain_page(pdev->arch.vtd.pgd_maddr);
-        rc = fill_qpt(root, level - 1, pgs);
-        unmap_vtd_domain_page(root);
-
-        pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
-    }
-
-    page_list_move(&pdev->arch.pgtables_list, &ctx->arch.pgtables);
-
-    if ( rc || (!scratch_page && !rmrr_found) )
-        quarantine_teardown(pdev, drhd);
-
-    return rc;
+    return 0;
 }
 
 static const struct iommu_ops __initconst_cf_clobber vtd_ops = {
diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index 4a3fe059cb..a444e5813e 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -549,7 +549,6 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
 
 void arch_pci_init_pdev(struct pci_dev *pdev)
 {
-    pdev->arch.pseudo_domid = DOMID_INVALID;
 }
 
 unsigned long *__init iommu_init_domid(domid_t reserve)
-- 
2.47.2



Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [XEN RFC PATCH v6 02/11] docs/designs: Add a design document for PV-IOMMU
  2025-02-17 10:18 [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface Teddy Astie
                   ` (2 preceding siblings ...)
  2025-02-17 10:18 ` [XEN RFC PATCH v6 06/11] vtd: Remove MAP_ERROR_RECOVERY code path in domain_context_mapping_one Teddy Astie
@ 2025-02-17 10:18 ` Teddy Astie
  2025-02-19 12:02   ` Frediano Ziglio
  2025-02-17 10:18 ` [XEN RFC PATCH v6 09/11] x86/iommu: Introduce IOMMU arena Teddy Astie
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 17+ messages in thread
From: Teddy Astie @ 2025-02-17 10:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Teddy Astie, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini

Some operating systems want to use IOMMU to implement various features (e.g
VFIO) or DMA protection.
This patch introduce a proposal for IOMMU paravirtualization for Dom0.

Signed-off-by: Teddy Astie <teddy.astie@vates.tech>
---
 docs/designs/pv-iommu.md | 116 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 116 insertions(+)
 create mode 100644 docs/designs/pv-iommu.md

diff --git a/docs/designs/pv-iommu.md b/docs/designs/pv-iommu.md
new file mode 100644
index 0000000000..7df9fa0b94
--- /dev/null
+++ b/docs/designs/pv-iommu.md
@@ -0,0 +1,116 @@
+# IOMMU paravirtualization for Dom0
+
+Status: Experimental
+
+# Background
+
+By default, Xen only uses the IOMMU for itself, either to make device adress
+space coherent with guest adress space (x86 HVM/PVH) or to prevent devices
+from doing DMA outside it's expected memory regions including the hypervisor
+(x86 PV).
+
+A limitation is that guests (especially privildged ones) may want to use
+IOMMU hardware in order to implement features such as DMA protection and
+VFIO [1] as IOMMU functionality is not available outside of the hypervisor
+currently.
+
+[1] VFIO - "Virtual Function I/O" - https://www.kernel.org/doc/html/latest/driver-api/vfio.html
+
+# Design
+
+The operating system may want to have access to various IOMMU features such as
+context management and DMA remapping. We can create a new hypercall that allows
+the guest to have access to a new paravirtualized IOMMU interface.
+
+This feature is only meant to be available for the Dom0, as DomU have some
+emulated devices that can't be managed on Xen side and are not hardware, we
+can't rely on the hardware IOMMU to enforce DMA remapping.
+
+This interface is exposed under the `iommu_op` hypercall.
+
+In addition, Xen domains are modified in order to allow existence of several
+IOMMU context including a default one that implement default behavior (e.g
+hardware assisted paging) and can't be modified by guest. DomU cannot have
+contexts, and therefore act as if they only have the default domain.
+
+Each IOMMU context within a Xen domain is identified using a domain-specific
+context number that is used in the Xen IOMMU subsystem and the hypercall
+interface.
+
+The number of IOMMU context a domain is specified by either the toolstack or
+the domain itself.
+
+# IOMMU operations
+
+## Initialize PV-IOMMU
+
+Initialize PV-IOMMU for the domain.
+It can only be called once.
+
+## Alloc context
+
+Create a new IOMMU context for the guest and return the context number to the
+guest.
+Fail if the IOMMU context limit of the guest is reached.
+
+A flag can be specified to create a identity mapping.
+
+## Free context
+
+Destroy a IOMMU context created previously.
+It is not possible to free the default context.
+
+Reattach context devices to default context if specified by the guest.
+
+Fail if there is a device in the context and reattach-to-default flag is not
+specified.
+
+## Reattach device
+
+Reattach a device to another IOMMU context (including the default one).
+The target IOMMU context number must be valid and the context allocated.
+
+The guest needs to specify a PCI SBDF of a device he has access to.
+
+## Map/unmap page
+
+Map/unmap a page on a context.
+The guest needs to specify a gfn and target dfn to map.
+
+Refuse to create the mapping if one already exist for the same dfn.
+
+## Lookup page
+
+Get the gfn mapped by a specific dfn.
+
+## Remote command
+
+Make a PV-IOMMU operation on behalf of another domain.
+Especially useful for implementing IOMMU emulation (e.g using QEMU)
+or initializing PV-IOMMU with enforced limits.
+
+# Implementation considerations
+
+## Hypercall batching
+
+In order to prevent unneeded hypercalls and IOMMU flushing, it is advisable to
+be able to batch some critical IOMMU operations (e.g map/unmap multiple pages).
+
+## Hardware without IOMMU support
+
+Operating system needs to be aware on PV-IOMMU capability, and whether it is
+able to make contexts. However, some operating system may critically fail in
+case they are able to make a new IOMMU context. Which is supposed to happen
+if no IOMMU hardware is available.
+
+The hypercall interface needs a interface to advertise the ability to create
+and manage IOMMU contexts including the amount of context the guest is able
+to use. Using these informations, the Dom0 may decide whether to use or not
+the PV-IOMMU interface.
+
+## Page pool for contexts
+
+In order to prevent unexpected starving on the hypervisor memory with a
+buggy Dom0. We can preallocate the pages the contexts will use and make
+map/unmap use these pages instead of allocating them dynamically.
+
-- 
2.47.2



Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [XEN RFC PATCH v6 03/11] x86/domain: Defer domain iommu initialization.
  2025-02-17 10:18 [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface Teddy Astie
                   ` (5 preceding siblings ...)
  2025-02-17 10:18 ` [XEN RFC PATCH v6 07/11] iommu: Simplify hardware did management Teddy Astie
@ 2025-02-17 10:18 ` Teddy Astie
  2025-02-17 10:18 ` [XEN RFC PATCH v6 11/11] iommu: Introduce no-dma feature Teddy Astie
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Teddy Astie @ 2025-02-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: Teddy Astie, Jan Beulich, Andrew Cooper, Roger Pau Monné

For the IOMMU redesign, the iommu context pagetable is defined once during
initialization. When reusing P2M pagetable, we want to ensure that this
pagetable is properly initialized.

Signed-off-by: Teddy Astie <teddy.astie@vates.tech>
---
 xen/arch/x86/domain.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 78a13e6812..48bf9625e2 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -858,9 +858,6 @@ int arch_domain_create(struct domain *d,
     if ( (rc = init_domain_irq_mapping(d)) != 0 )
         goto fail;
 
-    if ( (rc = iommu_domain_init(d, config->iommu_opts)) != 0 )
-        goto fail;
-
     psr_domain_init(d);
 
     if ( is_hvm_domain(d) )
@@ -879,6 +876,9 @@ int arch_domain_create(struct domain *d,
     else
         ASSERT_UNREACHABLE(); /* Not HVM and not PV? */
 
+    if ( (rc = iommu_domain_init(d, config->iommu_opts)) != 0 )
+        goto fail;
+
     if ( (rc = tsc_set_info(d, XEN_CPUID_TSC_MODE_DEFAULT, 0, 0, 0)) != 0 )
     {
         ASSERT_UNREACHABLE();
-- 
2.47.2



Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [XEN RFC PATCH v6 07/11] iommu: Simplify hardware did management
  2025-02-17 10:18 [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface Teddy Astie
                   ` (4 preceding siblings ...)
  2025-02-17 10:18 ` [XEN RFC PATCH v6 09/11] x86/iommu: Introduce IOMMU arena Teddy Astie
@ 2025-02-17 10:18 ` Teddy Astie
  2025-02-19 12:17   ` Frediano Ziglio
  2025-02-19 12:17   ` Frediano Ziglio
  2025-02-17 10:18 ` [XEN RFC PATCH v6 03/11] x86/domain: Defer domain iommu initialization Teddy Astie
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 17+ messages in thread
From: Teddy Astie @ 2025-02-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: Teddy Astie, Jan Beulich, Andrew Cooper, Roger Pau Monné

Simplify the hardware DID management by allocating a DID
per IOMMU context (currently Xen domain) instead of trying
to reuse Xen domain DID (which may not be possible depending
on hardware constraints like did limits).

Signed-off-by: Teddy Astie <teddy.astie@vates.tech>
---
 xen/arch/x86/include/asm/iommu.h         |   5 +-
 xen/drivers/passthrough/amd/iommu.h      |   3 +
 xen/drivers/passthrough/amd/iommu_cmd.c  |   4 +-
 xen/drivers/passthrough/amd/iommu_init.c |   3 +-
 xen/drivers/passthrough/vtd/extern.h     |   2 -
 xen/drivers/passthrough/vtd/iommu.c      | 335 +++++------------------
 xen/drivers/passthrough/vtd/iommu.h      |   2 -
 xen/drivers/passthrough/vtd/qinval.c     |   2 +-
 xen/drivers/passthrough/x86/iommu.c      |  27 +-
 9 files changed, 89 insertions(+), 294 deletions(-)

diff --git a/xen/arch/x86/include/asm/iommu.h b/xen/arch/x86/include/asm/iommu.h
index 94513ba9dc..d20c3cda59 100644
--- a/xen/arch/x86/include/asm/iommu.h
+++ b/xen/arch/x86/include/asm/iommu.h
@@ -45,12 +45,15 @@ struct arch_iommu_context
         /* Intel VT-d */
         struct {
             uint64_t pgd_maddr; /* io page directory machine address */
-            unsigned long *iommu_bitmap; /* bitmap of iommu(s) that the context uses */
+            domid_t *didmap; /* per-iommu DID (valid only if related iommu_dev_cnt > 0) */
+            unsigned long *iommu_dev_cnt; /* counter of devices per iommu */
         } vtd;
         /* AMD IOMMU */
         struct {
             unsigned int paging_mode;
             struct page_info *root_table;
+            domid_t *didmap; /* per-iommu DID (valid only if related iommu_dev_cnt > 0) */
+            unsigned long *iommu_dev_cnt; /* counter of devices per iommu */
         } amd;
     };
 };
diff --git a/xen/drivers/passthrough/amd/iommu.h b/xen/drivers/passthrough/amd/iommu.h
index 6095bc6a21..dbe427ed27 100644
--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -35,6 +35,7 @@
 
 #define iommu_found()           (!list_empty(&amd_iommu_head))
 
+extern unsigned int nr_amd_iommus;
 extern struct list_head amd_iommu_head;
 
 typedef struct event_entry
@@ -106,6 +107,8 @@ struct amd_iommu {
 
     int enabled;
 
+    unsigned int index;
+
     struct list_head ats_devices;
 };
 
diff --git a/xen/drivers/passthrough/amd/iommu_cmd.c b/xen/drivers/passthrough/amd/iommu_cmd.c
index 83c525b84f..e1a252db93 100644
--- a/xen/drivers/passthrough/amd/iommu_cmd.c
+++ b/xen/drivers/passthrough/amd/iommu_cmd.c
@@ -331,11 +331,13 @@ static void _amd_iommu_flush_pages(struct domain *d,
                                    daddr_t daddr, unsigned int order)
 {
     struct amd_iommu *iommu;
-    unsigned int dom_id = d->domain_id;
+    struct iommu_context *ctx = iommu_default_context(d);
 
     /* send INVALIDATE_IOMMU_PAGES command */
     for_each_amd_iommu ( iommu )
     {
+        domid_t dom_id = ctx->arch.amd.didmap[iommu->index];
+
         invalidate_iommu_pages(iommu, daddr, dom_id, order);
         flush_command_buffer(iommu, 0);
     }
diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c
index 41e241ccc8..333d5d5e39 100644
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -23,7 +23,7 @@
 
 #include "iommu.h"
 
-static int __initdata nr_amd_iommus;
+unsigned int nr_amd_iommus = 0;
 static bool __initdata pci_init;
 
 static struct tasklet amd_iommu_irq_tasklet;
@@ -919,6 +919,7 @@ static void enable_iommu(struct amd_iommu *iommu)
     set_iommu_translation_control(iommu, IOMMU_CONTROL_ENABLED);
 
     iommu->enabled = 1;
+    iommu->index = nr_amd_iommus;
 
     spin_unlock_irqrestore(&iommu->lock, flags);
 
diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index 3dcb77c711..82db8f9435 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -45,8 +45,6 @@ void disable_intremap(struct vtd_iommu *iommu);
 int iommu_alloc(struct acpi_drhd_unit *drhd);
 void iommu_free(struct acpi_drhd_unit *drhd);
 
-domid_t did_to_domain_id(const struct vtd_iommu *iommu, unsigned int did);
-
 int iommu_flush_iec_global(struct vtd_iommu *iommu);
 int iommu_flush_iec_index(struct vtd_iommu *iommu, u8 im, u16 iidx);
 void clear_fault_bits(struct vtd_iommu *iommu);
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 852994cf97..34b2a287f7 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -63,50 +63,6 @@ static struct tasklet vtd_fault_tasklet;
 static int cf_check setup_hwdom_device(u8 devfn, struct pci_dev *);
 static void setup_hwdom_rmrr(struct domain *d);
 
-static bool domid_mapping(const struct vtd_iommu *iommu)
-{
-    return (const void *)iommu->domid_bitmap != (const void *)iommu->domid_map;
-}
-
-static domid_t convert_domid(const struct vtd_iommu *iommu, domid_t domid)
-{
-    /*
-     * While we need to avoid DID 0 for caching-mode IOMMUs, maintain
-     * the property of the transformation being the same in either
-     * direction. By clipping to 16 bits we ensure that the resulting
-     * DID will fit in the respective context entry field.
-     */
-    BUILD_BUG_ON(DOMID_MASK >= 0xffff);
-
-    return !cap_caching_mode(iommu->cap) ? domid : ~domid;
-}
-
-static int get_iommu_did(domid_t domid, const struct vtd_iommu *iommu,
-                         bool warn)
-{
-    unsigned int nr_dom, i;
-
-    if ( !domid_mapping(iommu) )
-        return convert_domid(iommu, domid);
-
-    nr_dom = cap_ndoms(iommu->cap);
-    i = find_first_bit(iommu->domid_bitmap, nr_dom);
-    while ( i < nr_dom )
-    {
-        if ( iommu->domid_map[i] == domid )
-            return i;
-
-        i = find_next_bit(iommu->domid_bitmap, nr_dom, i + 1);
-    }
-
-    if ( warn )
-        dprintk(XENLOG_ERR VTDPREFIX,
-                "No valid iommu %u domid for Dom%d\n",
-                iommu->index, domid);
-
-    return -1;
-}
-
 #define DID_FIELD_WIDTH 16
 #define DID_HIGH_OFFSET 8
 
@@ -117,127 +73,17 @@ static int get_iommu_did(domid_t domid, const struct vtd_iommu *iommu,
 static int context_set_domain_id(struct context_entry *context,
                                  domid_t domid, struct vtd_iommu *iommu)
 {
-    unsigned int i;
-
     ASSERT(pcidevs_locked());
 
-    if ( domid_mapping(iommu) )
-    {
-        unsigned int nr_dom = cap_ndoms(iommu->cap);
-
-        i = find_first_bit(iommu->domid_bitmap, nr_dom);
-        while ( i < nr_dom && iommu->domid_map[i] != domid )
-            i = find_next_bit(iommu->domid_bitmap, nr_dom, i + 1);
-
-        if ( i >= nr_dom )
-        {
-            i = find_first_zero_bit(iommu->domid_bitmap, nr_dom);
-            if ( i >= nr_dom )
-            {
-                dprintk(XENLOG_ERR VTDPREFIX, "IOMMU: no free domain id\n");
-                return -EBUSY;
-            }
-            iommu->domid_map[i] = domid;
-            set_bit(i, iommu->domid_bitmap);
-        }
-    }
-    else
-        i = convert_domid(iommu, domid);
-
     if ( context )
     {
         context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
-        context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
+        context->hi |= (domid & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
     }
 
     return 0;
 }
 
-static void cleanup_domid_map(domid_t domid, struct vtd_iommu *iommu)
-{
-    int iommu_domid;
-
-    if ( !domid_mapping(iommu) )
-        return;
-
-    iommu_domid = get_iommu_did(domid, iommu, false);
-
-    if ( iommu_domid >= 0 )
-    {
-        /*
-         * Update domid_map[] /before/ domid_bitmap[] to avoid a race with
-         * context_set_domain_id(), setting the slot to DOMID_INVALID for
-         * did_to_domain_id() to return a suitable value while the bit is
-         * still set.
-         */
-        iommu->domid_map[iommu_domid] = DOMID_INVALID;
-        clear_bit(iommu_domid, iommu->domid_bitmap);
-    }
-}
-
-static bool any_pdev_behind_iommu(const struct domain *d,
-                                  const struct pci_dev *exclude,
-                                  const struct vtd_iommu *iommu)
-{
-    const struct pci_dev *pdev;
-
-    for_each_pdev ( d, pdev )
-    {
-        const struct acpi_drhd_unit *drhd;
-
-        if ( pdev == exclude )
-            continue;
-
-        drhd = acpi_find_matched_drhd_unit(pdev);
-        if ( drhd && drhd->iommu == iommu )
-            return true;
-    }
-
-    return false;
-}
-
-/*
- * If no other devices under the same iommu owned by this domain,
- * clear iommu in iommu_bitmap and clear domain_id in domid_bitmap.
- */
-static void check_cleanup_domid_map(const struct domain *d,
-                                    const struct pci_dev *exclude,
-                                    struct vtd_iommu *iommu)
-{
-    bool found;
-
-    if ( d == dom_io )
-        return;
-
-    found = any_pdev_behind_iommu(d, exclude, iommu);
-    /*
-     * Hidden devices are associated with DomXEN but usable by the hardware
-     * domain. Hence they need considering here as well.
-     */
-    if ( !found && is_hardware_domain(d) )
-        found = any_pdev_behind_iommu(dom_xen, exclude, iommu);
-
-    if ( !found )
-    {
-        clear_bit(iommu->index, iommu_default_context(d)->arch.vtd.iommu_bitmap);
-        cleanup_domid_map(d->domain_id, iommu);
-    }
-}
-
-domid_t did_to_domain_id(const struct vtd_iommu *iommu, unsigned int did)
-{
-    if ( did >= cap_ndoms(iommu->cap) )
-        return DOMID_INVALID;
-
-    if ( !domid_mapping(iommu) )
-        return convert_domid(iommu, did);
-
-    if ( !test_bit(did, iommu->domid_bitmap) )
-        return DOMID_INVALID;
-
-    return iommu->domid_map[did];
-}
-
 /* Allocate page table, return its machine address */
 uint64_t alloc_pgtable_maddr(unsigned long npages, nodeid_t node)
 {
@@ -754,13 +600,11 @@ static int __must_check cf_check iommu_flush_iotlb(struct domain *d, dfn_t dfn,
 
         iommu = drhd->iommu;
 
-        if ( !test_bit(iommu->index, ctx->arch.vtd.iommu_bitmap) )
+        if ( !ctx->arch.vtd.iommu_dev_cnt[iommu->index] )
             continue;
 
         flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
-        iommu_domid = get_iommu_did(d->domain_id, iommu, !d->is_dying);
-        if ( iommu_domid == -1 )
-            continue;
+        iommu_domid = ctx->arch.vtd.didmap[iommu->index];
 
         if ( !page_count || (page_count & (page_count - 1)) ||
              dfn_eq(dfn, INVALID_DFN) || !IS_ALIGNED(dfn_x(dfn), page_count) )
@@ -1257,7 +1101,6 @@ int __init iommu_alloc(struct acpi_drhd_unit *drhd)
 {
     struct vtd_iommu *iommu;
     unsigned int sagaw, agaw = 0, nr_dom;
-    domid_t reserved_domid = DOMID_INVALID;
     int rc;
 
     iommu = xzalloc(struct vtd_iommu);
@@ -1346,43 +1189,16 @@ int __init iommu_alloc(struct acpi_drhd_unit *drhd)
     if ( !ecap_coherent(iommu->ecap) )
         iommu_non_coherent = true;
 
-    if ( nr_dom <= DOMID_MASK * 2 + cap_caching_mode(iommu->cap) )
-    {
-        /* Allocate domain id (bit) maps. */
-        iommu->domid_bitmap = xzalloc_array(unsigned long,
-                                            BITS_TO_LONGS(nr_dom));
-        iommu->domid_map = xzalloc_array(domid_t, nr_dom);
-        rc = -ENOMEM;
-        if ( !iommu->domid_bitmap || !iommu->domid_map )
-            goto free;
-
-        /*
-         * If Caching mode is set, then invalid translations are tagged
-         * with domain id 0. Hence reserve bit/slot 0.
-         */
-        if ( cap_caching_mode(iommu->cap) )
-        {
-            iommu->domid_map[0] = DOMID_INVALID;
-            __set_bit(0, iommu->domid_bitmap);
-        }
-    }
-    else
-    {
-        /* Don't leave dangling NULL pointers. */
-        iommu->domid_bitmap = ZERO_BLOCK_PTR;
-        iommu->domid_map = ZERO_BLOCK_PTR;
-
-        /*
-         * If Caching mode is set, then invalid translations are tagged
-         * with domain id 0. Hence reserve the ID taking up bit/slot 0.
-         */
-        reserved_domid = convert_domid(iommu, 0) ?: DOMID_INVALID;
-    }
+    /* Allocate domain id (bit) maps. */
+    iommu->domid_bitmap = xzalloc_array(unsigned long,
+                                        BITS_TO_LONGS(nr_dom));
 
-    iommu->pseudo_domid_map = iommu_init_domid(reserved_domid);
-    rc = -ENOMEM;
-    if ( !iommu->pseudo_domid_map )
-        goto free;
+    /*
+        * If Caching mode is set, then invalid translations are tagged
+        * with domain id 0. Hence reserve bit/slot 0.
+        */
+    if ( cap_caching_mode(iommu->cap) )
+        __set_bit(0, iommu->domid_bitmap);
 
     return 0;
 
@@ -1410,8 +1226,6 @@ void __init iommu_free(struct acpi_drhd_unit *drhd)
         iounmap(iommu->reg);
 
     xfree(iommu->domid_bitmap);
-    xfree(iommu->domid_map);
-    xfree(iommu->pseudo_domid_map);
 
     if ( iommu->msi.irq >= 0 )
         destroy_irq(iommu->msi.irq);
@@ -1425,19 +1239,39 @@ void __init iommu_free(struct acpi_drhd_unit *drhd)
         agaw = 64;                              \
     agaw; })
 
-static int cf_check intel_iommu_domain_init(struct domain *d)
+static int cf_check intel_iommu_context_init(struct domain *d, struct iommu_context *ctx)
 {
-    struct domain_iommu *hd = dom_iommu(d);
-    struct iommu_context *ctx = iommu_default_context(d);
+    struct acpi_drhd_unit *drhd;
 
-    ctx->arch.vtd.iommu_bitmap = xzalloc_array(unsigned long,
-                                              BITS_TO_LONGS(nr_iommus));
-    if ( !ctx->arch.vtd.iommu_bitmap )
+    ctx->arch.vtd.didmap = xzalloc_array(domid_t, nr_iommus);
+    if ( !ctx->arch.vtd.didmap )
         return -ENOMEM;
 
+    ctx->arch.vtd.iommu_dev_cnt = xzalloc_array(unsigned long, nr_iommus);
+    if ( !ctx->arch.vtd.iommu_dev_cnt )
+    {
+        xfree(ctx->arch.vtd.didmap);
+        return -ENOMEM;
+    }
+
+    // TODO: Allocate IOMMU domid only when attaching devices ?
+    /* Populate context DID map using pseudo DIDs */
+    for_each_drhd_unit(drhd)
+    {
+        ctx->arch.vtd.didmap[drhd->iommu->index] =
+            iommu_alloc_domid(drhd->iommu->domid_bitmap);
+    }
+
+    return arch_iommu_context_init(d, ctx, 0);
+}
+
+static int cf_check intel_iommu_domain_init(struct domain *d)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+
     hd->arch.vtd.agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
 
-    return 0;
+    return intel_iommu_context_init(d, iommu_default_context(d));
 }
 
 static void __hwdom_init cf_check intel_iommu_hwdom_init(struct domain *d)
@@ -1481,11 +1315,11 @@ int domain_context_mapping_one(
     struct context_entry *context, *context_entries, lctxt;
     __uint128_t res, old;
     uint64_t maddr;
-    uint16_t seg = iommu->drhd->segment, prev_did = 0;
-    struct domain *prev_dom = NULL;
+    uint16_t seg = iommu->drhd->segment, prev_did = 0, did;
     int rc, ret;
-    bool flush_dev_iotlb;
+    bool flush_dev_iotlb, overwrite_entry = false;
 
+    struct iommu_context *prev_ctx = pdev->domain ? iommu_default_context(pdev->domain) : NULL;
 
     ASSERT(pcidevs_locked());
     spin_lock(&iommu->lock);
@@ -1494,23 +1328,12 @@ int domain_context_mapping_one(
     context = &context_entries[devfn];
     old = (lctxt = *context).full;
 
+    did = ctx->arch.vtd.didmap[iommu->index];
+
     if ( context_present(lctxt) )
     {
-        domid_t domid;
-
         prev_did = context_domain_id(lctxt);
-        domid = did_to_domain_id(iommu, prev_did);
-        if ( domid < DOMID_FIRST_RESERVED )
-            prev_dom = rcu_lock_domain_by_id(domid);
-        if ( !prev_dom )
-        {
-            spin_unlock(&iommu->lock);
-            unmap_vtd_domain_page(context_entries);
-            dprintk(XENLOG_DEBUG VTDPREFIX,
-                    "no domain for did %u (nr_dom %u)\n",
-                    prev_did, cap_ndoms(iommu->cap));
-            return -ESRCH;
-        }
+        overwrite_entry = true;
     }
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
@@ -1526,11 +1349,7 @@ int domain_context_mapping_one(
         root = domain_pgd_maddr(domain, ctx, pgd_maddr, iommu->nr_pt_levels);
         if ( !root )
         {
-            spin_unlock(&ctx->arch.mapping_lock);
-            spin_unlock(&iommu->lock);
             unmap_vtd_domain_page(context_entries);
-            if ( prev_dom )
-                rcu_unlock_domain(prev_dom);
             return -ENOMEM;
         }
 
@@ -1543,35 +1362,13 @@ int domain_context_mapping_one(
         spin_unlock(&ctx->arch.mapping_lock);
     }
 
-    rc = context_set_domain_id(&lctxt, domid, iommu);
+    rc = context_set_domain_id(&lctxt, did, iommu);
     if ( rc )
-    {
-    unlock:
-        spin_unlock(&iommu->lock);
-        unmap_vtd_domain_page(context_entries);
-        if ( prev_dom )
-            rcu_unlock_domain(prev_dom);
-        return rc;
-    }
-
-    if ( !prev_dom )
-    {
-        context_set_address_width(lctxt, level_to_agaw(iommu->nr_pt_levels));
-        context_set_fault_enable(lctxt);
-        context_set_present(lctxt);
-    }
-    else if ( prev_dom == domain )
-    {
-        ASSERT(lctxt.full == context->full);
-        rc = !!pdev;
         goto unlock;
-    }
-    else
-    {
-        ASSERT(context_address_width(lctxt) ==
-               level_to_agaw(iommu->nr_pt_levels));
-        ASSERT(!context_fault_disable(lctxt));
-    }
+
+    context_set_address_width(lctxt, level_to_agaw(iommu->nr_pt_levels));
+    context_set_fault_enable(lctxt);
+    context_set_present(lctxt);
 
     res = cmpxchg16b(context, &old, &lctxt.full);
 
@@ -1581,8 +1378,6 @@ int domain_context_mapping_one(
      */
     if ( res != old )
     {
-        if ( pdev )
-            check_cleanup_domid_map(domain, pdev, iommu);
         printk(XENLOG_ERR
                 "%pp: unexpected context entry %016lx_%016lx (expected %016lx_%016lx)\n",
                 &PCI_SBDF(seg, bus, devfn),
@@ -1596,9 +1391,9 @@ int domain_context_mapping_one(
     spin_unlock(&iommu->lock);
 
     rc = iommu_flush_context_device(iommu, prev_did, PCI_BDF(bus, devfn),
-                                    DMA_CCMD_MASK_NOBIT, !prev_dom);
+                                    DMA_CCMD_MASK_NOBIT, !overwrite_entry);
     flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
-    ret = iommu_flush_iotlb_dsi(iommu, prev_did, !prev_dom, flush_dev_iotlb);
+    ret = iommu_flush_iotlb_dsi(iommu, prev_did, !overwrite_entry, flush_dev_iotlb);
 
     /*
      * The current logic for returns:
@@ -1614,18 +1409,27 @@ int domain_context_mapping_one(
     if ( rc > 0 )
         rc = 0;
 
-    set_bit(iommu->index, ctx->arch.vtd.iommu_bitmap);
+    if ( prev_ctx )
+    {
+        /* Don't underflow the counter. */
+        BUG_ON(!prev_ctx->arch.vtd.iommu_dev_cnt[iommu->index]);
+        prev_ctx->arch.vtd.iommu_dev_cnt[iommu->index]--;
+    }
+
+    ctx->arch.vtd.iommu_dev_cnt[iommu->index]++;
 
     unmap_vtd_domain_page(context_entries);
+    spin_unlock(&iommu->lock);
 
     if ( !seg && !rc )
         rc = me_wifi_quirk(domain, bus, devfn, domid, pgd_maddr, mode);
 
+    return rc;
 
-    if ( prev_dom )
-        rcu_unlock_domain(prev_dom);
-
-    return rc ?: pdev && prev_dom;
+    unlock:
+        unmap_vtd_domain_page(context_entries);
+        spin_unlock(&iommu->lock);
+        return rc;
 }
 
 static const struct acpi_drhd_unit *domain_context_unmap(
@@ -1637,7 +1441,7 @@ static int domain_context_mapping(struct domain *domain, struct iommu_context *c
     const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
     const struct acpi_rmrr_unit *rmrr;
     paddr_t pgd_maddr = ctx->arch.vtd.pgd_maddr;
-    domid_t did = domain->domain_id;
+    domid_t did = ctx->arch.vtd.didmap[drhd->iommu->index];
     int ret = 0;
     unsigned int i, mode = 0;
     uint16_t seg = pdev->seg, bdf;
@@ -1960,9 +1764,10 @@ static void cf_check iommu_domain_teardown(struct domain *d)
     ASSERT(!ctx->arch.vtd.pgd_maddr);
 
     for_each_drhd_unit ( drhd )
-        cleanup_domid_map(d->domain_id, drhd->iommu);
+        iommu_free_domid(d->domain_id, drhd->iommu->domid_bitmap);
 
-    XFREE(ctx->arch.vtd.iommu_bitmap);
+    XFREE(ctx->arch.vtd.iommu_dev_cnt);
+    XFREE(ctx->arch.vtd.didmap);
 }
 
 static void quarantine_teardown(struct pci_dev *pdev,
diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index 29d350b23d..77edfa3587 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -506,9 +506,7 @@ struct vtd_iommu {
     } flush;
 
     struct list_head ats_devices;
-    unsigned long *pseudo_domid_map; /* "pseudo" domain id bitmap */
     unsigned long *domid_bitmap;  /* domain id bitmap */
-    domid_t *domid_map;           /* domain id mapping array */
     uint32_t version;
 };
 
diff --git a/xen/drivers/passthrough/vtd/qinval.c b/xen/drivers/passthrough/vtd/qinval.c
index 036f3e8505..3f25b6a2e0 100644
--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -229,7 +229,7 @@ static int __must_check dev_invalidate_sync(struct vtd_iommu *iommu,
     rc = queue_invalidate_wait(iommu, 0, 1, 1, 1);
     if ( rc == -ETIMEDOUT && !pdev->broken )
     {
-        struct domain *d = rcu_lock_domain_by_id(did_to_domain_id(iommu, did));
+        struct domain *d = rcu_lock_domain(pdev->domain);
 
         /*
          * In case the domain has been freed or the IOMMU domid bitmap is
diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index a444e5813e..730a75e628 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -555,9 +555,6 @@ unsigned long *__init iommu_init_domid(domid_t reserve)
 {
     unsigned long *map;
 
-    if ( !iommu_quarantine )
-        return ZERO_BLOCK_PTR;
-
     BUILD_BUG_ON(DOMID_MASK * 2U >= UINT16_MAX);
 
     map = xzalloc_array(unsigned long, BITS_TO_LONGS(UINT16_MAX - DOMID_MASK));
@@ -572,36 +569,24 @@ unsigned long *__init iommu_init_domid(domid_t reserve)
 
 domid_t iommu_alloc_domid(unsigned long *map)
 {
-    /*
-     * This is used uniformly across all IOMMUs, such that on typical
-     * systems we wouldn't re-use the same ID very quickly (perhaps never).
-     */
-    static unsigned int start;
-    unsigned int idx = find_next_zero_bit(map, UINT16_MAX - DOMID_MASK, start);
+    /* TODO: Consider nr_doms ? */
+    unsigned int idx = find_next_zero_bit(map, UINT16_MAX, 0);
 
-    ASSERT(pcidevs_locked());
-
-    if ( idx >= UINT16_MAX - DOMID_MASK )
-        idx = find_first_zero_bit(map, UINT16_MAX - DOMID_MASK);
-    if ( idx >= UINT16_MAX - DOMID_MASK )
-        return DOMID_INVALID;
+    if ( idx >= UINT16_MAX )
+        return UINT16_MAX;
 
     __set_bit(idx, map);
 
-    start = idx + 1;
-
-    return idx | (DOMID_MASK + 1);
+    return idx;
 }
 
 void iommu_free_domid(domid_t domid, unsigned long *map)
 {
     ASSERT(pcidevs_locked());
 
-    if ( domid == DOMID_INVALID )
+    if ( domid == UINT16_MAX )
         return;
 
-    ASSERT(domid > DOMID_MASK);
-
     if ( !__test_and_clear_bit(domid & DOMID_MASK, map) )
         BUG();
 }
-- 
2.47.2



Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [XEN RFC PATCH v6 09/11] x86/iommu: Introduce IOMMU arena
  2025-02-17 10:18 [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface Teddy Astie
                   ` (3 preceding siblings ...)
  2025-02-17 10:18 ` [XEN RFC PATCH v6 02/11] docs/designs: Add a design document for PV-IOMMU Teddy Astie
@ 2025-02-17 10:18 ` Teddy Astie
  2025-02-17 10:18 ` [XEN RFC PATCH v6 07/11] iommu: Simplify hardware did management Teddy Astie
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Teddy Astie @ 2025-02-17 10:18 UTC (permalink / raw)
  To: xen-devel; +Cc: Teddy Astie, Jan Beulich, Andrew Cooper, Roger Pau Monné

Introduce a new facility that reserves a fixed amount of contiguous
pages and provide a way to allocate them.

It is used to ensure that the guest cannot cause the hypervisor to
OOM with unconstrained allocations by abusing the PV-IOMMU interface.

Signed-off-by: Teddy Astie <teddy.astie@vates.tech>
---
 xen/arch/x86/include/asm/arena.h     |  54 +++++++++
 xen/arch/x86/include/asm/iommu.h     |   3 +
 xen/drivers/passthrough/x86/Makefile |   1 +
 xen/drivers/passthrough/x86/arena.c  | 157 +++++++++++++++++++++++++++
 4 files changed, 215 insertions(+)
 create mode 100644 xen/arch/x86/include/asm/arena.h
 create mode 100644 xen/drivers/passthrough/x86/arena.c

diff --git a/xen/arch/x86/include/asm/arena.h b/xen/arch/x86/include/asm/arena.h
new file mode 100644
index 0000000000..7555b100e0
--- /dev/null
+++ b/xen/arch/x86/include/asm/arena.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/**
+ * Simple arena-based page allocator.
+ */
+
+#ifndef __XEN_IOMMU_ARENA_H__
+#define __XEN_IOMMU_ARENA_H__
+
+#include "xen/domain.h"
+#include "xen/atomic.h"
+#include "xen/mm-frame.h"
+#include "xen/types.h"
+
+/**
+ * struct page_arena: Page arena structure
+ */
+struct iommu_arena {
+    /* mfn of the first page of the memory region */
+    mfn_t region_start;
+    /* bitmap of allocations */
+    unsigned long *map;
+
+    /* Order of the arena */
+    unsigned int order;
+
+    /* Used page count */
+    atomic_t used_pages;
+};
+
+/**
+ * Initialize a arena using domheap allocator.
+ * @param [out] arena Arena to allocate
+ * @param [in] domain domain that has ownership of arena pages
+ * @param [in] order order of the arena (power of two of the size)
+ * @param [in] memflags Flags for domheap_alloc_pages()
+ * @return -ENOMEM on arena allocation error, 0 otherwise
+ */
+int iommu_arena_initialize(struct iommu_arena *arena, struct domain *domain,
+                           unsigned int order, unsigned int memflags);
+
+/**
+ * Teardown a arena.
+ * @param [out] arena arena to allocate
+ * @param [in] check check for existing allocations
+ * @return -EBUSY if check is specified
+ */
+int iommu_arena_teardown(struct iommu_arena *arena, bool check);
+
+struct page_info *iommu_arena_allocate_page(struct iommu_arena *arena);
+bool iommu_arena_free_page(struct iommu_arena *arena, struct page_info *page);
+
+#define iommu_arena_size(arena) (1LLU << (arena)->order)
+
+#endif
diff --git a/xen/arch/x86/include/asm/iommu.h b/xen/arch/x86/include/asm/iommu.h
index 654a07b9b2..452b98b42d 100644
--- a/xen/arch/x86/include/asm/iommu.h
+++ b/xen/arch/x86/include/asm/iommu.h
@@ -12,6 +12,8 @@
 #include <asm/cache.h>
 #include <asm/processor.h>
 
+#include "arena.h"
+
 #define DEFAULT_DOMAIN_ADDRESS_WIDTH 48
 
 struct g2m_ioport {
@@ -62,6 +64,7 @@ struct arch_iommu
 {
     /* Queue for freeing pages */
     struct page_list_head free_queue;
+    struct iommu_arena pt_arena; /* allocator for non-default contexts */
 
     union {
         /* Intel VT-d */
diff --git a/xen/drivers/passthrough/x86/Makefile b/xen/drivers/passthrough/x86/Makefile
index 75b2885336..1614f3d284 100644
--- a/xen/drivers/passthrough/x86/Makefile
+++ b/xen/drivers/passthrough/x86/Makefile
@@ -1,2 +1,3 @@
 obj-y += iommu.o
+obj-y += arena.o
 obj-$(CONFIG_HVM) += hvm.o
diff --git a/xen/drivers/passthrough/x86/arena.c b/xen/drivers/passthrough/x86/arena.c
new file mode 100644
index 0000000000..984bc4d643
--- /dev/null
+++ b/xen/drivers/passthrough/x86/arena.c
@@ -0,0 +1,157 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/**
+ * Simple arena-based page allocator.
+ *
+ * Allocate a large block using alloc_domheam_pages and allocate single pages
+ * using iommu_arena_allocate_page and iommu_arena_free_page functions.
+ *
+ * Concurrent {allocate/free}_page is thread-safe
+ * iommu_arena_teardown during {allocate/free}_page is not thread-safe.
+ *
+ * Written by Teddy Astie <teddy.astie@vates.tech>
+ */
+
+#include <asm/bitops.h>
+#include <asm/page.h>
+#include <xen/atomic.h>
+#include <xen/bug.h>
+#include <xen/config.h>
+#include <xen/mm-frame.h>
+#include <xen/mm.h>
+#include <xen/xmalloc.h>
+
+#include <asm/arena.h>
+
+/* Maximum of scan tries if the bit found not available */
+#define ARENA_TSL_MAX_TRIES 5
+
+int iommu_arena_initialize(struct iommu_arena *arena, struct domain *d,
+                           unsigned int order, unsigned int memflags)
+{
+    struct page_info *page;
+
+    /* TODO: Maybe allocate differently ? */
+    page = alloc_domheap_pages(d, order, memflags);
+
+    if ( !page )
+        return -ENOMEM;
+
+    arena->map = xzalloc_array(unsigned long, BITS_TO_LONGS(1LLU << order));
+    arena->order = order;
+    arena->region_start = page_to_mfn(page);
+
+    _atomic_set(&arena->used_pages, 0);
+    bitmap_zero(arena->map, iommu_arena_size(arena));
+
+    printk(XENLOG_DEBUG "IOMMU: Allocated arena (%llu pages, start=%"PRI_mfn")\n",
+           iommu_arena_size(arena), mfn_x(arena->region_start));
+    return 0;
+}
+
+int iommu_arena_teardown(struct iommu_arena *arena, bool check)
+{
+    BUG_ON(mfn_x(arena->region_start) == 0);
+
+    /* Check for allocations if check is specified */
+    if ( check && (atomic_read(&arena->used_pages) > 0) )
+        return -EBUSY;
+
+    free_domheap_pages(mfn_to_page(arena->region_start), arena->order);
+
+    arena->region_start = _mfn(0);
+    _atomic_set(&arena->used_pages, 0);
+    xfree(arena->map);
+    arena->map = NULL;
+
+    return 0;
+}
+
+struct page_info *iommu_arena_allocate_page(struct iommu_arena *arena)
+{
+    unsigned int index;
+    unsigned int tsl_tries = 0;
+
+    BUG_ON(mfn_x(arena->region_start) == 0);
+
+    if ( atomic_read(&arena->used_pages) == iommu_arena_size(arena) )
+        /* All pages used */
+        return NULL;
+
+    do
+    {
+        index = find_first_zero_bit(arena->map, iommu_arena_size(arena));
+
+        if ( index >= iommu_arena_size(arena) )
+            /* No more free pages */
+            return NULL;
+
+        /*
+         * While there shouldn't be a lot of retries in practice, this loop
+         * *may* run indefinetly if the found bit is never free due to being
+         * overwriten by another CPU core right after. Add a safeguard for
+         * such very rare cases.
+         */
+        tsl_tries++;
+
+        if ( unlikely(tsl_tries == ARENA_TSL_MAX_TRIES) )
+        {
+            printk(XENLOG_ERR "ARENA: Too many TSL retries !");
+            return NULL;
+        }
+
+        /* Make sure that the bit we found is still free */
+    } while ( test_and_set_bit(index, arena->map) );
+
+    atomic_inc(&arena->used_pages);
+
+    return mfn_to_page(mfn_add(arena->region_start, index));
+}
+
+bool iommu_arena_free_page(struct iommu_arena *arena, struct page_info *page)
+{
+    unsigned long index;
+    mfn_t frame;
+
+    if ( !page )
+    {
+        printk(XENLOG_WARNING "IOMMU: Trying to free NULL page");
+        WARN();
+        return false;
+    }
+
+    frame = page_to_mfn(page);
+
+    /* Check if page belongs to our arena */
+    if ( (mfn_x(frame) < mfn_x(arena->region_start))
+        || (mfn_x(frame) >= (mfn_x(arena->region_start) + iommu_arena_size(arena))) )
+    {
+        printk(XENLOG_WARNING
+               "IOMMU: Trying to free outside arena region [mfn=%"PRI_mfn"]",
+               mfn_x(frame));
+        WARN();
+        return false;
+    }
+
+    index = mfn_x(frame) - mfn_x(arena->region_start);
+
+    /* Sanity check in case of underflow. */
+    ASSERT(index < iommu_arena_size(arena));
+
+    if ( !test_and_clear_bit(index, arena->map) )
+    {
+        /*
+         * Bit was free during our arena_free_page, which means that
+         * either this page was never allocated, or we are in a double-free
+         * situation.
+         */
+        printk(XENLOG_WARNING
+               "IOMMU: Freeing non-allocated region (double-free?) [mfn=%"PRI_mfn"]",
+               mfn_x(frame));
+        WARN();
+        return false;
+    }
+
+    atomic_dec(&arena->used_pages);
+
+    return true;
+}
\ No newline at end of file
-- 
2.47.2



Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [XEN RFC PATCH v6 11/11] iommu: Introduce no-dma feature
  2025-02-17 10:18 [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface Teddy Astie
                   ` (6 preceding siblings ...)
  2025-02-17 10:18 ` [XEN RFC PATCH v6 03/11] x86/domain: Defer domain iommu initialization Teddy Astie
@ 2025-02-17 10:18 ` Teddy Astie
  2025-02-17 10:18 ` [XEN RFC PATCH v6 04/11] iommu: Move IOMMU domain related structures to (arch_)iommu_context Teddy Astie
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Teddy Astie @ 2025-02-17 10:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Teddy Astie, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini

This feature exposed through `dom0-iommu=no-dma` prevents the devices
of default context to have access to domain's memory.
This basically enforces DMA protection by default. The domain will
need to prepare a specific IOMMU context to do DMA.

This feature needs the guest to provide a PV-IOMMU driver.

Signed-off-by: Teddy Astie <teddy.astie@vates.tech>
---
 xen/common/pv-iommu.c               |  3 +++
 xen/drivers/passthrough/iommu.c     | 10 ++++++++++
 xen/drivers/passthrough/x86/iommu.c |  4 ++++
 xen/include/xen/iommu.h             |  3 +++
 4 files changed, 20 insertions(+)

diff --git a/xen/common/pv-iommu.c b/xen/common/pv-iommu.c
index a1315bf582..9c7d04b4c7 100644
--- a/xen/common/pv-iommu.c
+++ b/xen/common/pv-iommu.c
@@ -99,6 +99,9 @@ static long capabilities_op(struct pv_iommu_capabilities *cap, struct domain *d)
     cap->max_pasid = 0; /* TODO */
     cap->cap_flags = 0;
 
+    if ( !dom_iommu(d)->no_dma )
+        cap->cap_flags |= IOMMUCAP_default_identity;
+
     cap->pgsize_mask = PAGE_SIZE_4K;
 
     return 0;
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index c26a2160f9..59a4c64915 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -55,6 +55,7 @@ static bool __hwdom_initdata iommu_hwdom_none;
 bool __hwdom_initdata iommu_hwdom_strict;
 bool __read_mostly iommu_hwdom_passthrough;
 bool __hwdom_initdata iommu_hwdom_inclusive;
+bool __read_mostly iommu_hwdom_no_dma = false;
 int8_t __hwdom_initdata iommu_hwdom_reserved = -1;
 
 #ifndef iommu_hap_pt_share
@@ -172,6 +173,8 @@ static int __init cf_check parse_dom0_iommu_param(const char *s)
             iommu_hwdom_reserved = val;
         else if ( !cmdline_strcmp(s, "none") )
             iommu_hwdom_none = true;
+        else if ( (val = parse_boolean("dma", s, ss)) >= 0 )
+            iommu_hwdom_no_dma = !val;
         else
             rc = -EINVAL;
 
@@ -329,6 +332,13 @@ int iommu_domain_init(struct domain *d, unsigned int opts)
     if ( !is_hardware_domain(d) || iommu_hwdom_strict )
         hd->need_sync = !iommu_use_hap_pt(d);
 
+    if ( hd->no_dma )
+    {
+        /* No-DMA mode is exclusive with HAP and sync_pt. */
+        hd->hap_pt_share = false;
+        hd->need_sync = false;
+    }
+
     ASSERT(!(hd->need_sync && hd->hap_pt_share));
 
     hd->allow_pv_iommu = true;
diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index 79efc6ad47..174c218b9b 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -529,6 +529,10 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
     if ( iommu_hwdom_reserved == -1 )
         iommu_hwdom_reserved = 1;
 
+    if ( iommu_hwdom_no_dma )
+        /* Skip special mappings with no-dma mode */
+        return;
+
     if ( iommu_hwdom_inclusive )
     {
         printk(XENLOG_WARNING
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index e115642b86..fb38c1be86 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -106,6 +106,7 @@ extern bool iommu_debug;
 extern bool amd_iommu_perdev_intremap;
 
 extern bool iommu_hwdom_strict, iommu_hwdom_passthrough, iommu_hwdom_inclusive;
+extern bool iommu_hwdom_no_dma;
 extern int8_t iommu_hwdom_reserved;
 
 extern unsigned int iommu_dev_iotlb_timeout;
@@ -411,6 +412,8 @@ struct domain_iommu {
     /* SAF-2-safe enum constant in arithmetic operation */
     DECLARE_BITMAP(features, IOMMU_FEAT_count);
 
+    /* Do the IOMMU block all DMA on default context (implies !has_pt_share) ? */
+    bool no_dma;
 
     /* Is the domain allowed to use PV-IOMMU ? */
     bool allow_pv_iommu;
-- 
2.47.2



Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [XEN RFC PATCH v6 04/11] iommu: Move IOMMU domain related structures to (arch_)iommu_context
  2025-02-17 10:18 [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface Teddy Astie
                   ` (7 preceding siblings ...)
  2025-02-17 10:18 ` [XEN RFC PATCH v6 11/11] iommu: Introduce no-dma feature Teddy Astie
@ 2025-02-17 10:18 ` Teddy Astie
  2025-02-17 10:18 ` [XEN RFC PATCH v6 08/11] iommu: Introduce redesigned IOMMU subsystem Teddy Astie
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Teddy Astie @ 2025-02-17 10:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Teddy Astie, Stefano Stabellini, Julien Grall, Bertrand Marquis,
	Michal Orzel, Volodymyr Babchuk, Shawn Anastasio, Jan Beulich,
	Andrew Cooper, Roger Pau Monné, Lukasz Hawrylko,
	Daniel P. Smith, Mateusz Mówka

Preparatory work for IOMMU redesign.

Introduce a new structure (arch_)iommu_context that will hold all
per-IOMMU context related informations for the IOMMU drivers.

Signed-off-by: Teddy Astie <teddy.astie@vates.tech>
---
 xen/arch/arm/include/asm/iommu.h            |   4 +
 xen/arch/ppc/include/asm/iommu.h            |   3 +
 xen/arch/x86/domain.c                       |   4 +-
 xen/arch/x86/include/asm/iommu.h            |  50 +++--
 xen/arch/x86/tboot.c                        |   3 +-
 xen/drivers/passthrough/amd/iommu.h         |   5 +-
 xen/drivers/passthrough/amd/iommu_init.c    |   8 +-
 xen/drivers/passthrough/amd/iommu_map.c     | 102 +++++-----
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  81 ++++----
 xen/drivers/passthrough/iommu.c             |   6 +
 xen/drivers/passthrough/vtd/extern.h        |   4 +-
 xen/drivers/passthrough/vtd/iommu.c         | 208 +++++++++++---------
 xen/drivers/passthrough/vtd/quirks.c        |   3 +-
 xen/drivers/passthrough/x86/iommu.c         |  62 +++---
 xen/include/xen/iommu.h                     |  10 +
 15 files changed, 320 insertions(+), 233 deletions(-)

diff --git a/xen/arch/arm/include/asm/iommu.h b/xen/arch/arm/include/asm/iommu.h
index d57bd8a38c..5ca56cc663 100644
--- a/xen/arch/arm/include/asm/iommu.h
+++ b/xen/arch/arm/include/asm/iommu.h
@@ -20,6 +20,10 @@ struct arch_iommu
     void *priv;
 };
 
+struct arch_iommu_context
+{
+};
+
 const struct iommu_ops *iommu_get_ops(void);
 void iommu_set_ops(const struct iommu_ops *ops);
 
diff --git a/xen/arch/ppc/include/asm/iommu.h b/xen/arch/ppc/include/asm/iommu.h
index 024ead3473..8367505de2 100644
--- a/xen/arch/ppc/include/asm/iommu.h
+++ b/xen/arch/ppc/include/asm/iommu.h
@@ -5,4 +5,7 @@
 struct arch_iommu {
 };
 
+struct arch_iommu_context {
+};
+
 #endif /* __ASM_PPC_IOMMU_H__ */
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 48bf9625e2..26729c879c 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -678,7 +678,7 @@ int arch_sanitise_domain_config(struct xen_domctl_createdomain *config)
     if ( nested_virt && !hvm_nested_virt_supported() )
     {
         dprintk(XENLOG_INFO, "Nested virt requested but not available\n");
-        return -EINVAL;        
+        return -EINVAL;
     }
 
     if ( nested_virt && !hap )
@@ -2392,7 +2392,7 @@ int domain_relinquish_resources(struct domain *d)
 
     PROGRESS(iommu_pagetables):
 
-        ret = iommu_free_pgtables(d);
+        ret = iommu_free_pgtables(d, iommu_default_context(d));
         if ( ret )
             return ret;
 
diff --git a/xen/arch/x86/include/asm/iommu.h b/xen/arch/x86/include/asm/iommu.h
index 8dc464fbd3..94513ba9dc 100644
--- a/xen/arch/x86/include/asm/iommu.h
+++ b/xen/arch/x86/include/asm/iommu.h
@@ -31,22 +31,21 @@ typedef uint64_t daddr_t;
 #define dfn_to_daddr(dfn) __dfn_to_daddr(dfn_x(dfn))
 #define daddr_to_dfn(daddr) _dfn(__daddr_to_dfn(daddr))
 
-struct arch_iommu
-{
-    spinlock_t mapping_lock; /* io page table lock */
-    struct {
-        struct page_list_head list;
-        spinlock_t lock;
-    } pgtables;
+struct iommu_context;
 
+struct arch_iommu_context
+{
+    struct page_list_head pgtables;
     struct list_head identity_maps;
 
+
+    spinlock_t mapping_lock; /* io page table lock */
+
     union {
         /* Intel VT-d */
         struct {
             uint64_t pgd_maddr; /* io page directory machine address */
-            unsigned int agaw; /* adjusted guest address width, 0 is level 2 30-bit */
-            unsigned long *iommu_bitmap; /* bitmap of iommu(s) that the domain uses */
+            unsigned long *iommu_bitmap; /* bitmap of iommu(s) that the context uses */
         } vtd;
         /* AMD IOMMU */
         struct {
@@ -56,6 +55,24 @@ struct arch_iommu
     };
 };
 
+struct arch_iommu
+{
+    /* Queue for freeing pages */
+    struct page_list_head free_queue;
+
+    union {
+        /* Intel VT-d */
+        struct {
+            unsigned int agaw; /* adjusted guest address width, 0 is level 2 30-bit */
+        } vtd;
+        /* AMD IOMMU */
+        struct {
+            unsigned int paging_mode;
+            struct guest_iommu *g_iommu;
+        };
+    };
+};
+
 extern struct iommu_ops iommu_ops;
 
 # include <asm/alternative.h>
@@ -109,10 +126,10 @@ static inline void iommu_disable_x2apic(void)
         iommu_vcall(&iommu_ops, disable_x2apic);
 }
 
-int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
-                           paddr_t base, paddr_t end,
+int iommu_identity_mapping(struct domain *d, struct iommu_context *ctx,
+                           p2m_access_t p2ma, paddr_t base, paddr_t end,
                            unsigned int flag);
-void iommu_identity_map_teardown(struct domain *d);
+void iommu_identity_map_teardown(struct domain *d, struct iommu_context *ctx);
 
 extern bool untrusted_msi;
 
@@ -128,14 +145,19 @@ unsigned long *iommu_init_domid(domid_t reserve);
 domid_t iommu_alloc_domid(unsigned long *map);
 void iommu_free_domid(domid_t domid, unsigned long *map);
 
-int __must_check iommu_free_pgtables(struct domain *d);
+int __must_check iommu_free_pgtables(struct domain *d, struct iommu_context *ctx);
 struct domain_iommu;
 struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd,
+                                                   struct iommu_context *ctx,
                                                    uint64_t contig_mask);
-void iommu_queue_free_pgtable(struct domain_iommu *hd, struct page_info *pg);
+void iommu_queue_free_pgtable(struct domain *d, struct iommu_context *ctx,
+                              struct page_info *pg);
 
 /* Check [start, end] unity map range for correctness. */
 bool iommu_unity_region_ok(const char *prefix, mfn_t start, mfn_t end);
+int arch_iommu_context_init(struct domain *d, struct iommu_context *ctx, u32 flags);
+int arch_iommu_context_teardown(struct domain *d, struct iommu_context *ctx, u32 flags);
+int arch_iommu_flush_free_queue(struct domain *d);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*
diff --git a/xen/arch/x86/tboot.c b/xen/arch/x86/tboot.c
index d5db60d335..0a5aee8b92 100644
--- a/xen/arch/x86/tboot.c
+++ b/xen/arch/x86/tboot.c
@@ -220,7 +220,8 @@ static void tboot_gen_domain_integrity(const uint8_t key[TB_KEY_SIZE],
         {
             const struct domain_iommu *dio = dom_iommu(d);
 
-            update_iommu_mac(&ctx, dio->arch.vtd.pgd_maddr,
+            update_iommu_mac(&ctx,
+                             iommu_default_context(d)->arch.vtd.pgd_maddr,
                              agaw_to_level(dio->arch.vtd.agaw));
         }
     }
diff --git a/xen/drivers/passthrough/amd/iommu.h b/xen/drivers/passthrough/amd/iommu.h
index c32e9e9a16..6095bc6a21 100644
--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -26,6 +26,7 @@
 #include <xen/tasklet.h>
 #include <xen/sched.h>
 #include <xen/domain_page.h>
+#include <xen/iommu.h>
 
 #include <asm/msi.h>
 #include <asm/apicdef.h>
@@ -199,10 +200,10 @@ int __must_check cf_check amd_iommu_unmap_page(
     struct domain *d, dfn_t dfn, unsigned int order,
     unsigned int *flush_flags);
 int __must_check amd_iommu_alloc_root(struct domain *d);
-int amd_iommu_reserve_domain_unity_map(struct domain *d,
+int amd_iommu_reserve_domain_unity_map(struct domain *d, struct iommu_context *ctx,
                                        const struct ivrs_unity_map *map,
                                        unsigned int flag);
-int amd_iommu_reserve_domain_unity_unmap(struct domain *d,
+int amd_iommu_reserve_domain_unity_unmap(struct domain *d, struct iommu_context *ctx,
                                          const struct ivrs_unity_map *map);
 int cf_check amd_iommu_get_reserved_device_memory(
     iommu_grdm_t *func, void *ctxt);
diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c
index 3023625020..41e241ccc8 100644
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -604,7 +604,6 @@ static void iommu_check_event_log(struct amd_iommu *iommu)
                    sizeof(event_entry_t), parse_event_log_entry);
 
     spin_lock_irqsave(&iommu->lock, flags);
-    
     /* Check event overflow. */
     entry = readl(iommu->mmio_base + IOMMU_STATUS_MMIO_OFFSET);
     if ( entry & IOMMU_STATUS_EVENT_LOG_OVERFLOW )
@@ -660,9 +659,8 @@ static void iommu_check_ppr_log(struct amd_iommu *iommu)
 
     iommu_read_log(iommu, &iommu->ppr_log,
                    sizeof(ppr_entry_t), parse_ppr_log_entry);
-    
-    spin_lock_irqsave(&iommu->lock, flags);
 
+    spin_lock_irqsave(&iommu->lock, flags);
     /* Check event overflow. */
     entry = readl(iommu->mmio_base + IOMMU_STATUS_MMIO_OFFSET);
     if ( entry & IOMMU_STATUS_PPR_LOG_OVERFLOW )
@@ -1545,7 +1543,7 @@ static void invalidate_all_domain_pages(void)
 static int cf_check _invalidate_all_devices(
     u16 seg, struct ivrs_mappings *ivrs_mappings)
 {
-    unsigned int bdf; 
+    unsigned int bdf;
     u16 req_id;
     struct amd_iommu *iommu;
 
@@ -1595,7 +1593,7 @@ void cf_check amd_iommu_resume(void)
     for_each_amd_iommu ( iommu )
     {
        /*
-        * To make sure that iommus have not been touched 
+        * To make sure that iommus have not been touched
         * before re-enablement
         */
         disable_iommu(iommu);
diff --git a/xen/drivers/passthrough/amd/iommu_map.c b/xen/drivers/passthrough/amd/iommu_map.c
index dde393645a..7514384789 100644
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -18,6 +18,7 @@
  */
 
 #include <xen/acpi.h>
+#include <xen/iommu.h>
 
 #include "iommu.h"
 
@@ -264,9 +265,9 @@ void __init iommu_dte_add_device_entry(struct amd_iommu_dte *dte,
  * {Re, un}mapping super page frames causes re-allocation of io
  * page tables.
  */
-static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
-                              unsigned int target, unsigned long *pt_mfn,
-                              unsigned int *flush_flags, bool map)
+static int iommu_pde_from_dfn(struct domain *d, struct iommu_context *ctx,
+                              unsigned long dfn, unsigned int target,
+                              unsigned long *pt_mfn, unsigned int *flush_flags, bool map)
 {
     union amd_iommu_pte *pde, *next_table_vaddr;
     unsigned long  next_table_mfn;
@@ -274,8 +275,8 @@ static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
     struct page_info *table;
     struct domain_iommu *hd = dom_iommu(d);
 
-    table = hd->arch.amd.root_table;
-    level = hd->arch.amd.paging_mode;
+    table = ctx->arch.amd.root_table;
+    level = ctx->arch.amd.paging_mode;
 
     if ( !table || target < 1 || level < target || level > 6 )
     {
@@ -311,7 +312,7 @@ static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
             mfn = next_table_mfn;
 
             /* allocate lower level page table */
-            table = iommu_alloc_pgtable(hd, IOMMU_PTE_CONTIG_MASK);
+            table = iommu_alloc_pgtable(hd, ctx, IOMMU_PTE_CONTIG_MASK);
             if ( table == NULL )
             {
                 AMD_IOMMU_ERROR("cannot allocate I/O page table\n");
@@ -346,7 +347,7 @@ static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
 
             if ( next_table_mfn == 0 )
             {
-                table = iommu_alloc_pgtable(hd, IOMMU_PTE_CONTIG_MASK);
+                table = iommu_alloc_pgtable(hd, ctx, IOMMU_PTE_CONTIG_MASK);
                 if ( table == NULL )
                 {
                     AMD_IOMMU_ERROR("cannot allocate I/O page table\n");
@@ -376,7 +377,8 @@ static int iommu_pde_from_dfn(struct domain *d, unsigned long dfn,
     return 0;
 }
 
-static void queue_free_pt(struct domain_iommu *hd, mfn_t mfn, unsigned int level)
+static void queue_free_pt(struct domain *d, struct iommu_context *ctx, mfn_t mfn,
+                          unsigned int level)
 {
     if ( level > 1 )
     {
@@ -387,13 +389,13 @@ static void queue_free_pt(struct domain_iommu *hd, mfn_t mfn, unsigned int level
             if ( pt[i].pr && pt[i].next_level )
             {
                 ASSERT(pt[i].next_level < level);
-                queue_free_pt(hd, _mfn(pt[i].mfn), pt[i].next_level);
+                queue_free_pt(d, ctx, _mfn(pt[i].mfn), pt[i].next_level);
             }
 
         unmap_domain_page(pt);
     }
 
-    iommu_queue_free_pgtable(hd, mfn_to_page(mfn));
+    iommu_queue_free_pgtable(d, ctx, mfn_to_page(mfn));
 }
 
 int cf_check amd_iommu_map_page(
@@ -401,6 +403,7 @@ int cf_check amd_iommu_map_page(
     unsigned int *flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx = iommu_default_context(d);
     unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1;
     bool contig;
     int rc;
@@ -410,7 +413,7 @@ int cf_check amd_iommu_map_page(
     ASSERT((hd->platform_ops->page_sizes >> IOMMUF_order(flags)) &
            PAGE_SIZE_4K);
 
-    spin_lock(&hd->arch.mapping_lock);
+    spin_lock(&ctx->arch.mapping_lock);
 
     /*
      * IOMMU mapping request can be safely ignored when the domain is dying.
@@ -420,24 +423,24 @@ int cf_check amd_iommu_map_page(
      */
     if ( d->is_dying )
     {
-        spin_unlock(&hd->arch.mapping_lock);
+        spin_unlock(&ctx->arch.mapping_lock);
         return 0;
     }
 
     rc = amd_iommu_alloc_root(d);
     if ( rc )
     {
-        spin_unlock(&hd->arch.mapping_lock);
+        spin_unlock(&ctx->arch.mapping_lock);
         AMD_IOMMU_ERROR("root table alloc failed, dfn = %"PRI_dfn"\n",
                         dfn_x(dfn));
         domain_crash(d);
         return rc;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, true) ||
+    if ( iommu_pde_from_dfn(d, ctx, dfn_x(dfn), level, &pt_mfn, flush_flags, true) ||
          !pt_mfn )
     {
-        spin_unlock(&hd->arch.mapping_lock);
+        spin_unlock(&ctx->arch.mapping_lock);
         AMD_IOMMU_ERROR("invalid IO pagetable entry dfn = %"PRI_dfn"\n",
                         dfn_x(dfn));
         domain_crash(d);
@@ -449,12 +452,12 @@ int cf_check amd_iommu_map_page(
                                 flags & IOMMUF_writable,
                                 flags & IOMMUF_readable, &contig);
 
-    while ( unlikely(contig) && ++level < hd->arch.amd.paging_mode )
+    while ( unlikely(contig) && ++level < ctx->arch.amd.paging_mode )
     {
         struct page_info *pg = mfn_to_page(_mfn(pt_mfn));
         unsigned long next_mfn;
 
-        if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags,
+        if ( iommu_pde_from_dfn(d, ctx, dfn_x(dfn), level, &pt_mfn, flush_flags,
                                 false) )
             BUG();
         BUG_ON(!pt_mfn);
@@ -464,11 +467,11 @@ int cf_check amd_iommu_map_page(
                               flags & IOMMUF_writable,
                               flags & IOMMUF_readable, &contig);
         *flush_flags |= IOMMU_FLUSHF_modified | IOMMU_FLUSHF_all;
-        iommu_queue_free_pgtable(hd, pg);
+        iommu_queue_free_pgtable(d, ctx, pg);
         perfc_incr(iommu_pt_coalesces);
     }
 
-    spin_unlock(&hd->arch.mapping_lock);
+    spin_unlock(&ctx->arch.mapping_lock);
 
     *flush_flags |= IOMMU_FLUSHF_added;
     if ( old.pr )
@@ -476,7 +479,7 @@ int cf_check amd_iommu_map_page(
         *flush_flags |= IOMMU_FLUSHF_modified;
 
         if ( IOMMUF_order(flags) && old.next_level )
-            queue_free_pt(hd, _mfn(old.mfn), old.next_level);
+            queue_free_pt(d, ctx, _mfn(old.mfn), old.next_level);
     }
 
     return 0;
@@ -487,6 +490,7 @@ int cf_check amd_iommu_unmap_page(
 {
     unsigned long pt_mfn = 0;
     struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx = iommu_default_context(d);
     unsigned int level = (order / PTE_PER_TABLE_SHIFT) + 1;
     union amd_iommu_pte old = {};
 
@@ -496,17 +500,17 @@ int cf_check amd_iommu_unmap_page(
      */
     ASSERT((hd->platform_ops->page_sizes >> order) & PAGE_SIZE_4K);
 
-    spin_lock(&hd->arch.mapping_lock);
+    spin_lock(&ctx->arch.mapping_lock);
 
-    if ( !hd->arch.amd.root_table )
+    if ( !ctx->arch.amd.root_table )
     {
-        spin_unlock(&hd->arch.mapping_lock);
+        spin_unlock(&ctx->arch.mapping_lock);
         return 0;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, false) )
+    if ( iommu_pde_from_dfn(d, ctx, dfn_x(dfn), level, &pt_mfn, flush_flags, false) )
     {
-        spin_unlock(&hd->arch.mapping_lock);
+        spin_unlock(&ctx->arch.mapping_lock);
         AMD_IOMMU_ERROR("invalid IO pagetable entry dfn = %"PRI_dfn"\n",
                         dfn_x(dfn));
         domain_crash(d);
@@ -520,30 +524,30 @@ int cf_check amd_iommu_unmap_page(
         /* Mark PTE as 'page not present'. */
         old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level, &free);
 
-        while ( unlikely(free) && ++level < hd->arch.amd.paging_mode )
+        while ( unlikely(free) && ++level < ctx->arch.amd.paging_mode )
         {
             struct page_info *pg = mfn_to_page(_mfn(pt_mfn));
 
-            if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn,
+            if ( iommu_pde_from_dfn(d, ctx, dfn_x(dfn), level, &pt_mfn,
                                     flush_flags, false) )
                 BUG();
             BUG_ON(!pt_mfn);
 
             clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level, &free);
             *flush_flags |= IOMMU_FLUSHF_all;
-            iommu_queue_free_pgtable(hd, pg);
+            iommu_queue_free_pgtable(d, ctx, pg);
             perfc_incr(iommu_pt_coalesces);
         }
     }
 
-    spin_unlock(&hd->arch.mapping_lock);
+    spin_unlock(&ctx->arch.mapping_lock);
 
     if ( old.pr )
     {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
         if ( order && old.next_level )
-            queue_free_pt(hd, _mfn(old.mfn), old.next_level);
+            queue_free_pt(d, ctx, _mfn(old.mfn), old.next_level);
     }
 
     return 0;
@@ -646,7 +650,7 @@ int cf_check amd_iommu_flush_iotlb_pages(
     return 0;
 }
 
-int amd_iommu_reserve_domain_unity_map(struct domain *d,
+int amd_iommu_reserve_domain_unity_map(struct domain *d, struct iommu_context *ctx,
                                        const struct ivrs_unity_map *map,
                                        unsigned int flag)
 {
@@ -664,14 +668,14 @@ int amd_iommu_reserve_domain_unity_map(struct domain *d,
         if ( map->write )
             p2ma |= p2m_access_w;
 
-        rc = iommu_identity_mapping(d, p2ma, map->addr,
+        rc = iommu_identity_mapping(d, ctx, p2ma, map->addr,
                                     map->addr + map->length - 1, flag);
     }
 
     return rc;
 }
 
-int amd_iommu_reserve_domain_unity_unmap(struct domain *d,
+int amd_iommu_reserve_domain_unity_unmap(struct domain *d, struct iommu_context *ctx,
                                          const struct ivrs_unity_map *map)
 {
     int rc;
@@ -681,7 +685,7 @@ int amd_iommu_reserve_domain_unity_unmap(struct domain *d,
 
     for ( rc = 0; map; map = map->next )
     {
-        int ret = iommu_identity_mapping(d, p2m_access_x, map->addr,
+        int ret = iommu_identity_mapping(d, ctx, p2m_access_x, map->addr,
                                          map->addr + map->length - 1, 0);
 
         if ( ret && ret != -ENOENT && !rc )
@@ -771,6 +775,7 @@ static int fill_qpt(union amd_iommu_pte *this, unsigned int level,
                     struct page_info *pgs[IOMMU_MAX_PT_LEVELS])
 {
     struct domain_iommu *hd = dom_iommu(dom_io);
+    struct iommu_context *ctx = iommu_default_context(dom_io);
     unsigned int i;
     int rc = 0;
 
@@ -787,7 +792,7 @@ static int fill_qpt(union amd_iommu_pte *this, unsigned int level,
                  * page table pages, and the resulting allocations are always
                  * zeroed.
                  */
-                pgs[level] = iommu_alloc_pgtable(hd, 0);
+                pgs[level] = iommu_alloc_pgtable(hd, ctx, 0);
                 if ( !pgs[level] )
                 {
                     rc = -ENOMEM;
@@ -823,14 +828,15 @@ static int fill_qpt(union amd_iommu_pte *this, unsigned int level,
 int cf_check amd_iommu_quarantine_init(struct pci_dev *pdev, bool scratch_page)
 {
     struct domain_iommu *hd = dom_iommu(dom_io);
-    unsigned int level = hd->arch.amd.paging_mode;
+    struct iommu_context *ctx = iommu_default_context(dom_io);
+    unsigned int level = ctx->arch.amd.paging_mode;
     unsigned int req_id = get_dma_requestor_id(pdev->seg, pdev->sbdf.bdf);
     const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
     int rc;
 
     ASSERT(pcidevs_locked());
-    ASSERT(!hd->arch.amd.root_table);
-    ASSERT(page_list_empty(&hd->arch.pgtables.list));
+    ASSERT(!ctx->arch.amd.root_table);
+    ASSERT(page_list_empty(&ctx->arch.pgtables));
 
     if ( !scratch_page && !ivrs_mappings[req_id].unity_map )
         return 0;
@@ -843,19 +849,19 @@ int cf_check amd_iommu_quarantine_init(struct pci_dev *pdev, bool scratch_page)
         return 0;
     }
 
-    pdev->arch.amd.root_table = iommu_alloc_pgtable(hd, 0);
+    pdev->arch.amd.root_table = iommu_alloc_pgtable(hd, ctx, 0);
     if ( !pdev->arch.amd.root_table )
         return -ENOMEM;
 
     /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
-    hd->arch.amd.root_table = pdev->arch.amd.root_table;
+    ctx->arch.amd.root_table = pdev->arch.amd.root_table;
 
-    rc = amd_iommu_reserve_domain_unity_map(dom_io,
+    rc = amd_iommu_reserve_domain_unity_map(dom_io, ctx,
                                             ivrs_mappings[req_id].unity_map,
                                             0);
 
-    iommu_identity_map_teardown(dom_io);
-    hd->arch.amd.root_table = NULL;
+    iommu_identity_map_teardown(dom_io, ctx);
+    ctx->arch.amd.root_table = NULL;
 
     if ( rc )
         AMD_IOMMU_WARN("%pp: quarantine unity mapping failed\n", &pdev->sbdf);
@@ -871,7 +877,7 @@ int cf_check amd_iommu_quarantine_init(struct pci_dev *pdev, bool scratch_page)
         pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
     }
 
-    page_list_move(&pdev->arch.pgtables_list, &hd->arch.pgtables.list);
+    page_list_move(&pdev->arch.pgtables_list, &ctx->arch.pgtables);
 
     if ( rc )
         amd_iommu_quarantine_teardown(pdev);
@@ -881,16 +887,16 @@ int cf_check amd_iommu_quarantine_init(struct pci_dev *pdev, bool scratch_page)
 
 void amd_iommu_quarantine_teardown(struct pci_dev *pdev)
 {
-    struct domain_iommu *hd = dom_iommu(dom_io);
+    struct iommu_context *ctx = iommu_default_context(dom_io);
 
     ASSERT(pcidevs_locked());
 
     if ( !pdev->arch.amd.root_table )
         return;
 
-    ASSERT(page_list_empty(&hd->arch.pgtables.list));
-    page_list_move(&hd->arch.pgtables.list, &pdev->arch.pgtables_list);
-    while ( iommu_free_pgtables(dom_io) == -ERESTART )
+    ASSERT(page_list_empty(&ctx->arch.pgtables));
+    page_list_move(&ctx->arch.pgtables, &pdev->arch.pgtables_list);
+    while ( iommu_free_pgtables(dom_io, ctx) == -ERESTART )
         /* nothing */;
     pdev->arch.amd.root_table = NULL;
 }
diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index f96f59440b..a3815d71be 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -19,6 +19,7 @@
 
 #include <xen/iocap.h>
 #include <xen/softirq.h>
+#include <xen/iommu.h>
 
 #include <asm/acpi.h>
 
@@ -86,12 +87,12 @@ int get_dma_requestor_id(uint16_t seg, uint16_t bdf)
 
 static int __must_check allocate_domain_resources(struct domain *d)
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx = iommu_default_context(d);
     int rc;
 
-    spin_lock(&hd->arch.mapping_lock);
+    spin_lock(&ctx->arch.mapping_lock);
     rc = amd_iommu_alloc_root(d);
-    spin_unlock(&hd->arch.mapping_lock);
+    spin_unlock(&ctx->arch.mapping_lock);
 
     return rc;
 }
@@ -125,7 +126,7 @@ static bool use_ats(
 }
 
 static int __must_check amd_iommu_setup_domain_device(
-    struct domain *domain, struct amd_iommu *iommu,
+    struct domain *domain, struct iommu_context *ctx, struct amd_iommu *iommu,
     uint8_t devfn, struct pci_dev *pdev)
 {
     struct amd_iommu_dte *table, *dte;
@@ -133,7 +134,6 @@ static int __must_check amd_iommu_setup_domain_device(
     unsigned int req_id, sr_flags;
     int rc;
     u8 bus = pdev->bus;
-    struct domain_iommu *hd = dom_iommu(domain);
     const struct ivrs_mappings *ivrs_dev;
     const struct page_info *root_pg;
     domid_t domid;
@@ -141,7 +141,7 @@ static int __must_check amd_iommu_setup_domain_device(
     if ( QUARANTINE_SKIP(domain, pdev) )
         return 0;
 
-    BUG_ON(!hd->arch.amd.paging_mode || !iommu->dev_table.buffer);
+    BUG_ON(!ctx->arch.amd.paging_mode || !iommu->dev_table.buffer);
 
     rc = allocate_domain_resources(domain);
     if ( rc )
@@ -161,7 +161,7 @@ static int __must_check amd_iommu_setup_domain_device(
 
     if ( domain != dom_io )
     {
-        root_pg = hd->arch.amd.root_table;
+        root_pg = ctx->arch.amd.root_table;
         domid = domain->domain_id;
     }
     else
@@ -177,7 +177,7 @@ static int __must_check amd_iommu_setup_domain_device(
         /* bind DTE to domain page-tables */
         rc = amd_iommu_set_root_page_table(
                  dte, page_to_maddr(root_pg), domid,
-                 hd->arch.amd.paging_mode, sr_flags);
+                 ctx->arch.amd.paging_mode, sr_flags);
         if ( rc )
         {
             ASSERT(rc < 0);
@@ -219,7 +219,7 @@ static int __must_check amd_iommu_setup_domain_device(
         else
             rc = amd_iommu_set_root_page_table(
                      dte, page_to_maddr(root_pg), domid,
-                     hd->arch.amd.paging_mode, sr_flags);
+                     ctx->arch.amd.paging_mode, sr_flags);
         if ( rc < 0 )
         {
             spin_unlock_irqrestore(&iommu->lock, flags);
@@ -270,7 +270,7 @@ static int __must_check amd_iommu_setup_domain_device(
                     "root table = %#"PRIx64", "
                     "domain = %d, paging mode = %d\n",
                     req_id, pdev->type, page_to_maddr(root_pg),
-                    domid, hd->arch.amd.paging_mode);
+                    domid, ctx->arch.amd.paging_mode);
 
     ASSERT(pcidevs_locked());
 
@@ -352,11 +352,12 @@ static int cf_check iov_enable_xt(void)
 int amd_iommu_alloc_root(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx = iommu_default_context(d);
 
-    if ( unlikely(!hd->arch.amd.root_table) && d != dom_io )
+    if ( unlikely(!ctx->arch.amd.root_table) && d != dom_io )
     {
-        hd->arch.amd.root_table = iommu_alloc_pgtable(hd, 0);
-        if ( !hd->arch.amd.root_table )
+        ctx->arch.amd.root_table = iommu_alloc_pgtable(hd, ctx, 0);
+        if ( !ctx->arch.amd.root_table )
             return -ENOMEM;
     }
 
@@ -368,7 +369,7 @@ int __read_mostly amd_iommu_min_paging_mode = 1;
 
 static int cf_check amd_iommu_domain_init(struct domain *d)
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx = iommu_default_context(d);
     int pglvl = amd_iommu_get_paging_mode(
                     1UL << (domain_max_paddr_bits(d) - PAGE_SHIFT));
 
@@ -379,7 +380,7 @@ static int cf_check amd_iommu_domain_init(struct domain *d)
      * Choose the number of levels for the IOMMU page tables, taking into
      * account unity maps.
      */
-    hd->arch.amd.paging_mode = max(pglvl, amd_iommu_min_paging_mode);
+    ctx->arch.amd.paging_mode = max(pglvl, amd_iommu_min_paging_mode);
 
     return 0;
 }
@@ -455,7 +456,7 @@ static void amd_iommu_disable_domain_device(const struct domain *domain,
         AMD_IOMMU_DEBUG("Disable: device id = %#x, "
                         "domain = %d, paging mode = %d\n",
                         req_id, dte->domain_id,
-                        dom_iommu(domain)->arch.amd.paging_mode);
+                        iommu_default_context(domain)->arch.amd.paging_mode);
     }
     else
         spin_unlock_irqrestore(&iommu->lock, flags);
@@ -466,6 +467,8 @@ static int cf_check reassign_device(
     struct pci_dev *pdev)
 {
     struct amd_iommu *iommu;
+    struct iommu_context *target_ctx = iommu_default_context(target);
+    struct iommu_context *source_ctx = iommu_default_context(source);
     int rc;
 
     iommu = find_iommu_for_device(pdev->seg, pdev->sbdf.bdf);
@@ -478,7 +481,7 @@ static int cf_check reassign_device(
 
     if ( !QUARANTINE_SKIP(target, pdev) )
     {
-        rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
+        rc = amd_iommu_setup_domain_device(target, target_ctx, iommu, devfn, pdev);
         if ( rc )
             return rc;
     }
@@ -509,7 +512,7 @@ static int cf_check reassign_device(
         unsigned int bdf = PCI_BDF(pdev->bus, devfn);
 
         rc = amd_iommu_reserve_domain_unity_unmap(
-                 source,
+                 source, source_ctx,
                  ivrs_mappings[get_dma_requestor_id(pdev->seg, bdf)].unity_map);
         if ( rc )
             return rc;
@@ -528,7 +531,8 @@ static int cf_check amd_iommu_assign_device(
     unsigned int bdf = PCI_BDF(pdev->bus, devfn);
     int req_id = get_dma_requestor_id(pdev->seg, bdf);
     int rc = amd_iommu_reserve_domain_unity_map(
-                 d, ivrs_mappings[req_id].unity_map, flag);
+                 d, iommu_default_context(d),
+                 ivrs_mappings[req_id].unity_map, flag);
 
     if ( !rc )
         rc = reassign_device(pdev->domain, d, devfn, pdev);
@@ -536,7 +540,8 @@ static int cf_check amd_iommu_assign_device(
     if ( rc && !is_hardware_domain(d) )
     {
         int ret = amd_iommu_reserve_domain_unity_unmap(
-                      d, ivrs_mappings[req_id].unity_map);
+                      d, iommu_default_context(d),
+                      ivrs_mappings[req_id].unity_map);
 
         if ( ret )
         {
@@ -553,22 +558,25 @@ static int cf_check amd_iommu_assign_device(
 
 static void cf_check amd_iommu_clear_root_pgtable(struct domain *d)
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx = iommu_default_context(d);
 
-    spin_lock(&hd->arch.mapping_lock);
-    hd->arch.amd.root_table = NULL;
-    spin_unlock(&hd->arch.mapping_lock);
+    spin_lock(&ctx->arch.mapping_lock);
+    ctx->arch.amd.root_table = NULL;
+    spin_unlock(&ctx->arch.mapping_lock);
 }
 
 static void cf_check amd_iommu_domain_destroy(struct domain *d)
 {
-    iommu_identity_map_teardown(d);
-    ASSERT(!dom_iommu(d)->arch.amd.root_table);
+    struct iommu_context *ctx = iommu_default_context(d);
+
+    iommu_identity_map_teardown(d, ctx);
+    ASSERT(!ctx->arch.amd.root_table);
 }
 
 static int cf_check amd_iommu_add_device(u8 devfn, struct pci_dev *pdev)
 {
     struct amd_iommu *iommu;
+    struct iommu_context *ctx;
     u16 bdf;
     struct ivrs_mappings *ivrs_mappings;
     bool fresh_domid = false;
@@ -577,6 +585,8 @@ static int cf_check amd_iommu_add_device(u8 devfn, struct pci_dev *pdev)
     if ( !pdev->domain )
         return -EINVAL;
 
+    ctx = iommu_default_context(pdev->domain);
+
     for_each_amd_iommu(iommu)
         if ( pdev->seg == iommu->seg && pdev->sbdf.bdf == iommu->bdf )
             return is_hardware_domain(pdev->domain) ? 0 : -ENODEV;
@@ -633,7 +643,7 @@ static int cf_check amd_iommu_add_device(u8 devfn, struct pci_dev *pdev)
     }
 
     if ( amd_iommu_reserve_domain_unity_map(
-             pdev->domain,
+             pdev->domain, ctx,
              ivrs_mappings[ivrs_mappings[bdf].dte_requestor_id].unity_map,
              0) )
         AMD_IOMMU_WARN("%pd: unity mapping failed for %pp\n",
@@ -647,7 +657,7 @@ static int cf_check amd_iommu_add_device(u8 devfn, struct pci_dev *pdev)
         fresh_domid = true;
     }
 
-    ret = amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+    ret = amd_iommu_setup_domain_device(pdev->domain, ctx, iommu, devfn, pdev);
     if ( ret && fresh_domid )
     {
         iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
@@ -660,12 +670,15 @@ static int cf_check amd_iommu_add_device(u8 devfn, struct pci_dev *pdev)
 static int cf_check amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
 {
     struct amd_iommu *iommu;
+    struct iommu_context *ctx;
     u16 bdf;
     struct ivrs_mappings *ivrs_mappings;
 
     if ( !pdev->domain )
         return -EINVAL;
 
+    ctx = iommu_default_context(pdev->domain);
+
     iommu = find_iommu_for_device(pdev->seg, pdev->sbdf.bdf);
     if ( !iommu )
     {
@@ -680,7 +693,7 @@ static int cf_check amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
     bdf = PCI_BDF(pdev->bus, devfn);
 
     if ( amd_iommu_reserve_domain_unity_unmap(
-             pdev->domain,
+             pdev->domain, ctx,
              ivrs_mappings[ivrs_mappings[bdf].dte_requestor_id].unity_map) )
         AMD_IOMMU_WARN("%pd: unity unmapping failed for %pp\n",
                        pdev->domain, &PCI_SBDF(pdev->seg, bdf));
@@ -755,14 +768,14 @@ static void amd_dump_page_table_level(struct page_info *pg, int level,
 
 static void cf_check amd_dump_page_tables(struct domain *d)
 {
-    const struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx = iommu_default_context(d);
 
-    if ( !hd->arch.amd.root_table )
+    if ( !ctx->arch.amd.root_table )
         return;
 
-    printk("AMD IOMMU %pd table has %u levels\n", d, hd->arch.amd.paging_mode);
-    amd_dump_page_table_level(hd->arch.amd.root_table,
-                              hd->arch.amd.paging_mode, 0, 0);
+    printk("AMD IOMMU %pd table has %u levels\n", d, ctx->arch.amd.paging_mode);
+    amd_dump_page_table_level(ctx->arch.amd.root_table,
+                              ctx->arch.amd.paging_mode, 0, 0);
 }
 
 static const struct iommu_ops __initconst_cf_clobber _iommu_ops = {
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index 9e74a1fc72..662da49766 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -403,12 +403,15 @@ long iommu_unmap(struct domain *d, dfn_t dfn0, unsigned long page_count,
     unsigned long i;
     unsigned int order, j = 0;
     int rc = 0;
+    struct iommu_context *ctx;
 
     if ( !is_iommu_enabled(d) )
         return 0;
 
     ASSERT(!(flags & ~IOMMUF_preempt));
 
+    ctx = iommu_default_context(d);
+
     for ( i = 0; i < page_count; i += 1UL << order )
     {
         dfn_t dfn = dfn_add(dfn0, i);
@@ -468,10 +471,13 @@ int iommu_lookup_page(struct domain *d, dfn_t dfn, mfn_t *mfn,
                       unsigned int *flags)
 {
     const struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx;
 
     if ( !is_iommu_enabled(d) || !hd->platform_ops->lookup_page )
         return -EOPNOTSUPP;
 
+    ctx = iommu_default_context(d);
+
     return iommu_call(hd->platform_ops, lookup_page, d, dfn, mfn, flags);
 }
 
diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index c16583c951..3dcb77c711 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -80,8 +80,8 @@ uint64_t alloc_pgtable_maddr(unsigned long npages, nodeid_t node);
 void free_pgtable_maddr(u64 maddr);
 void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(const void *va);
-int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu,
-                               uint8_t bus, uint8_t devfn,
+int domain_context_mapping_one(struct domain *domain, struct iommu_context *ctx,
+                               struct vtd_iommu *iommu, uint8_t bus, uint8_t devfn,
                                const struct pci_dev *pdev, domid_t domid,
                                paddr_t pgd_maddr, unsigned int mode);
 int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu,
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 9d7a9977a6..f60f39ee1d 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -54,7 +54,7 @@
 #define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \
                                              : (pdev)->arch.pseudo_domid)
 #define DEVICE_PGTABLE(d, pdev) ((d) != dom_io \
-                                 ? dom_iommu(d)->arch.vtd.pgd_maddr \
+                                 ? iommu_default_context(d)->arch.vtd.pgd_maddr \
                                  : (pdev)->arch.vtd.pgd_maddr)
 
 bool __read_mostly iommu_igfx = true;
@@ -227,7 +227,7 @@ static void check_cleanup_domid_map(const struct domain *d,
 
     if ( !found )
     {
-        clear_bit(iommu->index, dom_iommu(d)->arch.vtd.iommu_bitmap);
+        clear_bit(iommu->index, iommu_default_context(d)->arch.vtd.iommu_bitmap);
         cleanup_domid_map(d->domain_id, iommu);
     }
 }
@@ -315,8 +315,9 @@ static u64 bus_to_context_maddr(struct vtd_iommu *iommu, u8 bus)
  *   PTE for the requested address,
  * - for target == 0 the full PTE contents below PADDR_BITS limit.
  */
-static uint64_t addr_to_dma_page_maddr(struct domain *domain, daddr_t addr,
-                                       unsigned int target,
+static uint64_t addr_to_dma_page_maddr(struct domain *domain,
+                                       struct iommu_context *ctx,
+                                       daddr_t addr, unsigned int target,
                                        unsigned int *flush_flags, bool alloc)
 {
     struct domain_iommu *hd = dom_iommu(domain);
@@ -326,10 +327,10 @@ static uint64_t addr_to_dma_page_maddr(struct domain *domain, daddr_t addr,
     u64 pte_maddr = 0;
 
     addr &= (((u64)1) << addr_width) - 1;
-    ASSERT(spin_is_locked(&hd->arch.mapping_lock));
+    ASSERT(spin_is_locked(&ctx->arch.mapping_lock));
     ASSERT(target || !alloc);
 
-    if ( !hd->arch.vtd.pgd_maddr )
+    if ( !ctx->arch.vtd.pgd_maddr )
     {
         struct page_info *pg;
 
@@ -337,13 +338,13 @@ static uint64_t addr_to_dma_page_maddr(struct domain *domain, daddr_t addr,
             goto out;
 
         pte_maddr = level;
-        if ( !(pg = iommu_alloc_pgtable(hd, 0)) )
+        if ( !(pg = iommu_alloc_pgtable(hd, ctx, 0)) )
             goto out;
 
-        hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
+        ctx->arch.vtd.pgd_maddr = page_to_maddr(pg);
     }
 
-    pte_maddr = hd->arch.vtd.pgd_maddr;
+    pte_maddr = ctx->arch.vtd.pgd_maddr;
     parent = map_vtd_domain_page(pte_maddr);
     while ( level > target )
     {
@@ -379,7 +380,7 @@ static uint64_t addr_to_dma_page_maddr(struct domain *domain, daddr_t addr,
             }
 
             pte_maddr = level - 1;
-            pg = iommu_alloc_pgtable(hd, DMA_PTE_CONTIG_MASK);
+            pg = iommu_alloc_pgtable(hd, ctx, DMA_PTE_CONTIG_MASK);
             if ( !pg )
                 break;
 
@@ -431,13 +432,12 @@ static uint64_t addr_to_dma_page_maddr(struct domain *domain, daddr_t addr,
     return pte_maddr;
 }
 
-static paddr_t domain_pgd_maddr(struct domain *d, paddr_t pgd_maddr,
-                                unsigned int nr_pt_levels)
+static paddr_t domain_pgd_maddr(struct domain *d, struct iommu_context *ctx,
+                                paddr_t pgd_maddr, unsigned int nr_pt_levels)
 {
-    struct domain_iommu *hd = dom_iommu(d);
     unsigned int agaw;
 
-    ASSERT(spin_is_locked(&hd->arch.mapping_lock));
+    ASSERT(spin_is_locked(&ctx->arch.mapping_lock));
 
     if ( pgd_maddr )
         /* nothing */;
@@ -449,19 +449,19 @@ static paddr_t domain_pgd_maddr(struct domain *d, paddr_t pgd_maddr,
     }
     else
     {
-        if ( !hd->arch.vtd.pgd_maddr )
+        if ( !ctx->arch.vtd.pgd_maddr )
         {
             /*
              * Ensure we have pagetables allocated down to the smallest
              * level the loop below may need to run to.
              */
-            addr_to_dma_page_maddr(d, 0, min_pt_levels, NULL, true);
+            addr_to_dma_page_maddr(d, ctx, 0, min_pt_levels, NULL, true);
 
-            if ( !hd->arch.vtd.pgd_maddr )
+            if ( !ctx->arch.vtd.pgd_maddr )
                 return 0;
         }
 
-        pgd_maddr = hd->arch.vtd.pgd_maddr;
+        pgd_maddr = ctx->arch.vtd.pgd_maddr;
     }
 
     /* Skip top level(s) of page tables for less-than-maximum level DRHDs. */
@@ -734,7 +734,7 @@ static int __must_check cf_check iommu_flush_iotlb(struct domain *d, dfn_t dfn,
                                                    unsigned long page_count,
                                                    unsigned int flush_flags)
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx = iommu_default_context(d);
     struct acpi_drhd_unit *drhd;
     struct vtd_iommu *iommu;
     bool flush_dev_iotlb;
@@ -762,7 +762,7 @@ static int __must_check cf_check iommu_flush_iotlb(struct domain *d, dfn_t dfn,
 
         iommu = drhd->iommu;
 
-        if ( !test_bit(iommu->index, hd->arch.vtd.iommu_bitmap) )
+        if ( !test_bit(iommu->index, ctx->arch.vtd.iommu_bitmap) )
             continue;
 
         flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
@@ -790,7 +790,8 @@ static int __must_check cf_check iommu_flush_iotlb(struct domain *d, dfn_t dfn,
     return ret;
 }
 
-static void queue_free_pt(struct domain_iommu *hd, mfn_t mfn, unsigned int level)
+static void queue_free_pt(struct domain *d, struct iommu_context *ctx, mfn_t mfn,
+                          unsigned int level)
 {
     if ( level > 1 )
     {
@@ -799,13 +800,13 @@ static void queue_free_pt(struct domain_iommu *hd, mfn_t mfn, unsigned int level
 
         for ( i = 0; i < PTE_NUM; ++i )
             if ( dma_pte_present(pt[i]) && !dma_pte_superpage(pt[i]) )
-                queue_free_pt(hd, maddr_to_mfn(dma_pte_addr(pt[i])),
+                queue_free_pt(d, ctx, maddr_to_mfn(dma_pte_addr(pt[i])),
                               level - 1);
 
         unmap_domain_page(pt);
     }
 
-    iommu_queue_free_pgtable(hd, mfn_to_page(mfn));
+    iommu_queue_free_pgtable(d, ctx, mfn_to_page(mfn));
 }
 
 static int iommu_set_root_entry(struct vtd_iommu *iommu)
@@ -1435,10 +1436,11 @@ void __init iommu_free(struct acpi_drhd_unit *drhd)
 static int cf_check intel_iommu_domain_init(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx = iommu_default_context(d);
 
-    hd->arch.vtd.iommu_bitmap = xzalloc_array(unsigned long,
+    ctx->arch.vtd.iommu_bitmap = xzalloc_array(unsigned long,
                                               BITS_TO_LONGS(nr_iommus));
-    if ( !hd->arch.vtd.iommu_bitmap )
+    if ( !ctx->arch.vtd.iommu_bitmap )
         return -ENOMEM;
 
     hd->arch.vtd.agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
@@ -1479,11 +1481,11 @@ static void __hwdom_init cf_check intel_iommu_hwdom_init(struct domain *d)
  */
 int domain_context_mapping_one(
     struct domain *domain,
+    struct iommu_context *ctx,
     struct vtd_iommu *iommu,
     uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
     domid_t domid, paddr_t pgd_maddr, unsigned int mode)
 {
-    struct domain_iommu *hd = dom_iommu(domain);
     struct context_entry *context, *context_entries, lctxt;
     __uint128_t res, old;
     uint64_t maddr;
@@ -1531,12 +1533,12 @@ int domain_context_mapping_one(
     {
         paddr_t root;
 
-        spin_lock(&hd->arch.mapping_lock);
+        spin_lock(&ctx->arch.mapping_lock);
 
-        root = domain_pgd_maddr(domain, pgd_maddr, iommu->nr_pt_levels);
+        root = domain_pgd_maddr(domain, ctx, pgd_maddr, iommu->nr_pt_levels);
         if ( !root )
         {
-            spin_unlock(&hd->arch.mapping_lock);
+            spin_unlock(&ctx->arch.mapping_lock);
             spin_unlock(&iommu->lock);
             unmap_vtd_domain_page(context_entries);
             if ( prev_dom )
@@ -1550,7 +1552,7 @@ int domain_context_mapping_one(
         else
             context_set_translation_type(lctxt, CONTEXT_TT_MULTI_LEVEL);
 
-        spin_unlock(&hd->arch.mapping_lock);
+        spin_unlock(&ctx->arch.mapping_lock);
     }
 
     rc = context_set_domain_id(&lctxt, domid, iommu);
@@ -1624,7 +1626,7 @@ int domain_context_mapping_one(
     if ( rc > 0 )
         rc = 0;
 
-    set_bit(iommu->index, hd->arch.vtd.iommu_bitmap);
+    set_bit(iommu->index, ctx->arch.vtd.iommu_bitmap);
 
     unmap_vtd_domain_page(context_entries);
 
@@ -1642,7 +1644,7 @@ int domain_context_mapping_one(
              (prev_dom == dom_io && !pdev) )
             ret = domain_context_unmap_one(domain, iommu, bus, devfn);
         else
-            ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
+            ret = domain_context_mapping_one(prev_dom, ctx, iommu, bus, devfn, pdev,
                                              DEVICE_DOMID(prev_dom, pdev),
                                              DEVICE_PGTABLE(prev_dom, pdev),
                                              (mode & MAP_WITH_RMRR) |
@@ -1661,8 +1663,8 @@ int domain_context_mapping_one(
 static const struct acpi_drhd_unit *domain_context_unmap(
     struct domain *d, uint8_t devfn, struct pci_dev *pdev);
 
-static int domain_context_mapping(struct domain *domain, u8 devfn,
-                                  struct pci_dev *pdev)
+static int domain_context_mapping(struct domain *domain, struct iommu_context *ctx,
+                                  u8 devfn, struct pci_dev *pdev)
 {
     const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
     const struct acpi_rmrr_unit *rmrr;
@@ -1731,7 +1733,7 @@ static int domain_context_mapping(struct domain *domain, u8 devfn,
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF(seg, bus, devfn));
-        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, pdev,
+        ret = domain_context_mapping_one(domain, ctx, drhd->iommu, bus, devfn, pdev,
                                          DEVICE_DOMID(domain, pdev), pgd_maddr,
                                          mode);
         if ( ret > 0 )
@@ -1757,7 +1759,7 @@ static int domain_context_mapping(struct domain *domain, u8 devfn,
             printk(VTDPREFIX "%pd:PCI: map %pp\n",
                    domain, &PCI_SBDF(seg, bus, devfn));
 
-        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
+        ret = domain_context_mapping_one(domain, ctx, drhd->iommu, bus, devfn,
                                          pdev, DEVICE_DOMID(domain, pdev),
                                          pgd_maddr, mode);
         if ( ret < 0 )
@@ -1788,7 +1790,7 @@ static int domain_context_mapping(struct domain *domain, u8 devfn,
          * their owner would be the wrong one. Pass NULL instead.
          */
         if ( ret >= 0 )
-            ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
+            ret = domain_context_mapping_one(domain, ctx, drhd->iommu, bus, devfn,
                                              NULL, DEVICE_DOMID(domain, pdev),
                                              pgd_maddr, mode);
 
@@ -1804,7 +1806,7 @@ static int domain_context_mapping(struct domain *domain, u8 devfn,
          */
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
-            ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
+            ret = domain_context_mapping_one(domain, ctx, drhd->iommu, secbus, 0,
                                              NULL, DEVICE_DOMID(domain, pdev),
                                              pgd_maddr, mode);
 
@@ -1813,7 +1815,7 @@ static int domain_context_mapping(struct domain *domain, u8 devfn,
             if ( !prev_present )
                 domain_context_unmap(domain, devfn, pdev);
             else if ( pdev->domain != domain ) /* Avoid infinite recursion. */
-                domain_context_mapping(pdev->domain, devfn, pdev);
+                domain_context_mapping(pdev->domain, ctx, devfn, pdev);
         }
 
         break;
@@ -2001,44 +2003,44 @@ static const struct acpi_drhd_unit *domain_context_unmap(
 
 static void cf_check iommu_clear_root_pgtable(struct domain *d)
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx = iommu_default_context(d);
 
-    spin_lock(&hd->arch.mapping_lock);
-    hd->arch.vtd.pgd_maddr = 0;
-    spin_unlock(&hd->arch.mapping_lock);
+    spin_lock(&ctx->arch.mapping_lock);
+    ctx->arch.vtd.pgd_maddr = 0;
+    spin_unlock(&ctx->arch.mapping_lock);
 }
 
 static void cf_check iommu_domain_teardown(struct domain *d)
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx = iommu_default_context(d);
     const struct acpi_drhd_unit *drhd;
 
     if ( list_empty(&acpi_drhd_units) )
         return;
 
-    iommu_identity_map_teardown(d);
+    iommu_identity_map_teardown(d, ctx);
 
-    ASSERT(!hd->arch.vtd.pgd_maddr);
+    ASSERT(!ctx->arch.vtd.pgd_maddr);
 
     for_each_drhd_unit ( drhd )
         cleanup_domid_map(d->domain_id, drhd->iommu);
 
-    XFREE(hd->arch.vtd.iommu_bitmap);
+    XFREE(ctx->arch.vtd.iommu_bitmap);
 }
 
 static void quarantine_teardown(struct pci_dev *pdev,
                                 const struct acpi_drhd_unit *drhd)
 {
-    struct domain_iommu *hd = dom_iommu(dom_io);
+    struct iommu_context *ctx = iommu_default_context(dom_io);
 
     ASSERT(pcidevs_locked());
 
     if ( !pdev->arch.vtd.pgd_maddr )
         return;
 
-    ASSERT(page_list_empty(&hd->arch.pgtables.list));
-    page_list_move(&hd->arch.pgtables.list, &pdev->arch.pgtables_list);
-    while ( iommu_free_pgtables(dom_io) == -ERESTART )
+    ASSERT(page_list_empty(&ctx->arch.pgtables));
+    page_list_move(&ctx->arch.pgtables, &pdev->arch.pgtables_list);
+    while ( iommu_free_pgtables(dom_io, ctx) == -ERESTART )
         /* nothing */;
     pdev->arch.vtd.pgd_maddr = 0;
 
@@ -2051,6 +2053,7 @@ static int __must_check cf_check intel_iommu_map_page(
     unsigned int *flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx = iommu_default_context(d);
     struct dma_pte *page, *pte, old, new = {};
     u64 pg_maddr;
     unsigned int level = (IOMMUF_order(flags) / LEVEL_STRIDE) + 1;
@@ -2067,7 +2070,7 @@ static int __must_check cf_check intel_iommu_map_page(
     if ( iommu_hwdom_passthrough && is_hardware_domain(d) )
         return 0;
 
-    spin_lock(&hd->arch.mapping_lock);
+    spin_lock(&ctx->arch.mapping_lock);
 
     /*
      * IOMMU mapping request can be safely ignored when the domain is dying.
@@ -2077,15 +2080,15 @@ static int __must_check cf_check intel_iommu_map_page(
      */
     if ( d->is_dying )
     {
-        spin_unlock(&hd->arch.mapping_lock);
+        spin_unlock(&ctx->arch.mapping_lock);
         return 0;
     }
 
-    pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), level, flush_flags,
+    pg_maddr = addr_to_dma_page_maddr(d, ctx, dfn_to_daddr(dfn), level, flush_flags,
                                       true);
     if ( pg_maddr < PAGE_SIZE )
     {
-        spin_unlock(&hd->arch.mapping_lock);
+        spin_unlock(&ctx->arch.mapping_lock);
         return -ENOMEM;
     }
 
@@ -2106,7 +2109,7 @@ static int __must_check cf_check intel_iommu_map_page(
 
     if ( !((old.val ^ new.val) & ~DMA_PTE_CONTIG_MASK) )
     {
-        spin_unlock(&hd->arch.mapping_lock);
+        spin_unlock(&ctx->arch.mapping_lock);
         unmap_vtd_domain_page(page);
         return 0;
     }
@@ -2135,7 +2138,7 @@ static int __must_check cf_check intel_iommu_map_page(
         new.val &= ~(LEVEL_MASK << level_to_offset_bits(level));
         dma_set_pte_superpage(new);
 
-        pg_maddr = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), ++level,
+        pg_maddr = addr_to_dma_page_maddr(d, ctx, dfn_to_daddr(dfn), ++level,
                                           flush_flags, false);
         BUG_ON(pg_maddr < PAGE_SIZE);
 
@@ -2145,11 +2148,11 @@ static int __must_check cf_check intel_iommu_map_page(
         iommu_sync_cache(pte, sizeof(*pte));
 
         *flush_flags |= IOMMU_FLUSHF_modified | IOMMU_FLUSHF_all;
-        iommu_queue_free_pgtable(hd, pg);
+        iommu_queue_free_pgtable(d, ctx, pg);
         perfc_incr(iommu_pt_coalesces);
     }
 
-    spin_unlock(&hd->arch.mapping_lock);
+    spin_unlock(&ctx->arch.mapping_lock);
     unmap_vtd_domain_page(page);
 
     *flush_flags |= IOMMU_FLUSHF_added;
@@ -2158,7 +2161,7 @@ static int __must_check cf_check intel_iommu_map_page(
         *flush_flags |= IOMMU_FLUSHF_modified;
 
         if ( IOMMUF_order(flags) && !dma_pte_superpage(old) )
-            queue_free_pt(hd, maddr_to_mfn(dma_pte_addr(old)),
+            queue_free_pt(d, ctx, maddr_to_mfn(dma_pte_addr(old)),
                           IOMMUF_order(flags) / LEVEL_STRIDE);
     }
 
@@ -2169,6 +2172,7 @@ static int __must_check cf_check intel_iommu_unmap_page(
     struct domain *d, dfn_t dfn, unsigned int order, unsigned int *flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx = iommu_default_context(d);
     daddr_t addr = dfn_to_daddr(dfn);
     struct dma_pte *page = NULL, *pte = NULL, old;
     uint64_t pg_maddr;
@@ -2188,12 +2192,12 @@ static int __must_check cf_check intel_iommu_unmap_page(
     if ( iommu_hwdom_passthrough && is_hardware_domain(d) )
         return 0;
 
-    spin_lock(&hd->arch.mapping_lock);
+    spin_lock(&ctx->arch.mapping_lock);
     /* get target level pte */
-    pg_maddr = addr_to_dma_page_maddr(d, addr, level, flush_flags, false);
+    pg_maddr = addr_to_dma_page_maddr(d, ctx, addr, level, flush_flags, false);
     if ( pg_maddr < PAGE_SIZE )
     {
-        spin_unlock(&hd->arch.mapping_lock);
+        spin_unlock(&ctx->arch.mapping_lock);
         return pg_maddr ? -ENOMEM : 0;
     }
 
@@ -2202,7 +2206,7 @@ static int __must_check cf_check intel_iommu_unmap_page(
 
     if ( !dma_pte_present(*pte) )
     {
-        spin_unlock(&hd->arch.mapping_lock);
+        spin_unlock(&ctx->arch.mapping_lock);
         unmap_vtd_domain_page(page);
         return 0;
     }
@@ -2220,7 +2224,7 @@ static int __must_check cf_check intel_iommu_unmap_page(
 
         unmap_vtd_domain_page(page);
 
-        pg_maddr = addr_to_dma_page_maddr(d, addr, level, flush_flags, false);
+        pg_maddr = addr_to_dma_page_maddr(d, ctx, addr, level, flush_flags, false);
         BUG_ON(pg_maddr < PAGE_SIZE);
 
         page = map_vtd_domain_page(pg_maddr);
@@ -2229,18 +2233,18 @@ static int __must_check cf_check intel_iommu_unmap_page(
         iommu_sync_cache(pte, sizeof(*pte));
 
         *flush_flags |= IOMMU_FLUSHF_all;
-        iommu_queue_free_pgtable(hd, pg);
+        iommu_queue_free_pgtable(d, ctx, pg);
         perfc_incr(iommu_pt_coalesces);
     }
 
-    spin_unlock(&hd->arch.mapping_lock);
+    spin_unlock(&ctx->arch.mapping_lock);
 
     unmap_vtd_domain_page(page);
 
     *flush_flags |= IOMMU_FLUSHF_modified;
 
     if ( order && !dma_pte_superpage(old) )
-        queue_free_pt(hd, maddr_to_mfn(dma_pte_addr(old)),
+        queue_free_pt(d, ctx, maddr_to_mfn(dma_pte_addr(old)),
                       order / LEVEL_STRIDE);
 
     return 0;
@@ -2249,7 +2253,7 @@ static int __must_check cf_check intel_iommu_unmap_page(
 static int cf_check intel_iommu_lookup_page(
     struct domain *d, dfn_t dfn, mfn_t *mfn, unsigned int *flags)
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx = iommu_default_context(d);
     uint64_t val;
 
     /*
@@ -2260,11 +2264,11 @@ static int cf_check intel_iommu_lookup_page(
          (iommu_hwdom_passthrough && is_hardware_domain(d)) )
         return -EOPNOTSUPP;
 
-    spin_lock(&hd->arch.mapping_lock);
+    spin_lock(&ctx->arch.mapping_lock);
 
-    val = addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 0, NULL, false);
+    val = addr_to_dma_page_maddr(d, ctx, dfn_to_daddr(dfn), 0, NULL, false);
 
-    spin_unlock(&hd->arch.mapping_lock);
+    spin_unlock(&ctx->arch.mapping_lock);
 
     if ( val < PAGE_SIZE )
         return -ENOENT;
@@ -2285,7 +2289,7 @@ static bool __init vtd_ept_page_compatible(const struct vtd_iommu *iommu)
 
     /* EPT is not initialised yet, so we must check the capability in
      * the MSR explicitly rather than use cpu_has_vmx_ept_*() */
-    if ( rdmsr_safe(MSR_IA32_VMX_EPT_VPID_CAP, ept_cap) != 0 ) 
+    if ( rdmsr_safe(MSR_IA32_VMX_EPT_VPID_CAP, ept_cap) != 0 )
         return false;
 
     return (ept_has_2mb(ept_cap) && opt_hap_2mb) <=
@@ -2297,6 +2301,7 @@ static bool __init vtd_ept_page_compatible(const struct vtd_iommu *iommu)
 static int cf_check intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
 {
     struct acpi_rmrr_unit *rmrr;
+    struct iommu_context *ctx;
     u16 bdf;
     int ret, i;
 
@@ -2305,6 +2310,8 @@ static int cf_check intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
     if ( !pdev->domain )
         return -EINVAL;
 
+    ctx = iommu_default_context(pdev->domain);
+
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         if ( rmrr->segment == pdev->seg && bdf == PCI_BDF(pdev->bus, devfn) )
@@ -2315,7 +2322,7 @@ static int cf_check intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
              * Since RMRRs are always reserved in the e820 map for the hardware
              * domain, there shouldn't be a conflict.
              */
-            ret = iommu_identity_mapping(pdev->domain, p2m_access_rw,
+            ret = iommu_identity_mapping(pdev->domain, ctx, p2m_access_rw,
                                          rmrr->base_address, rmrr->end_address,
                                          0);
             if ( ret )
@@ -2324,7 +2331,7 @@ static int cf_check intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
         }
     }
 
-    ret = domain_context_mapping(pdev->domain, devfn, pdev);
+    ret = domain_context_mapping(pdev->domain, ctx, devfn, pdev);
     if ( ret )
         dprintk(XENLOG_ERR VTDPREFIX, "%pd: context mapping failed\n",
                 pdev->domain);
@@ -2353,10 +2360,13 @@ static int cf_check intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
     struct acpi_rmrr_unit *rmrr;
     u16 bdf;
     unsigned int i;
+    struct iommu_context *ctx;
 
     if ( !pdev->domain )
         return -EINVAL;
 
+    ctx = iommu_default_context(pdev->domain);
+
     drhd = domain_context_unmap(pdev->domain, devfn, pdev);
     if ( IS_ERR(drhd) )
         return PTR_ERR(drhd);
@@ -2370,7 +2380,7 @@ static int cf_check intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
          * Any flag is nothing to clear these mappings but here
          * its always safe and strict to set 0.
          */
-        iommu_identity_mapping(pdev->domain, p2m_access_x, rmrr->base_address,
+        iommu_identity_mapping(pdev->domain, ctx, p2m_access_x, rmrr->base_address,
                                rmrr->end_address, 0);
     }
 
@@ -2389,7 +2399,9 @@ static int cf_check intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
 static int __hwdom_init cf_check setup_hwdom_device(
     u8 devfn, struct pci_dev *pdev)
 {
-    return domain_context_mapping(pdev->domain, devfn, pdev);
+    struct iommu_context *ctx = iommu_default_context(pdev->domain);
+
+    return domain_context_mapping(pdev->domain, ctx, devfn, pdev);
 }
 
 void clear_fault_bits(struct vtd_iommu *iommu)
@@ -2483,7 +2495,7 @@ static int __must_check init_vtd_hw(bool resume)
 
     /*
      * Enable queue invalidation
-     */   
+     */
     for_each_drhd_unit ( drhd )
     {
         iommu = drhd->iommu;
@@ -2504,7 +2516,7 @@ static int __must_check init_vtd_hw(bool resume)
 
     /*
      * Enable interrupt remapping
-     */  
+     */
     if ( iommu_intremap != iommu_intremap_off )
     {
         int apic;
@@ -2561,6 +2573,7 @@ static int __must_check init_vtd_hw(bool resume)
 
 static void __hwdom_init setup_hwdom_rmrr(struct domain *d)
 {
+    struct iommu_context *ctx = iommu_default_context(d);
     struct acpi_rmrr_unit *rmrr;
     u16 bdf;
     int ret, i;
@@ -2574,7 +2587,7 @@ static void __hwdom_init setup_hwdom_rmrr(struct domain *d)
          * domain, there shouldn't be a conflict. So its always safe and
          * strict to set 0.
          */
-        ret = iommu_identity_mapping(d, p2m_access_rw, rmrr->base_address,
+        ret = iommu_identity_mapping(d, ctx, p2m_access_rw, rmrr->base_address,
                                      rmrr->end_address, 0);
         if ( ret )
             dprintk(XENLOG_ERR VTDPREFIX,
@@ -2739,6 +2752,8 @@ static int cf_check reassign_device_ownership(
 
     if ( !QUARANTINE_SKIP(target, pdev->arch.vtd.pgd_maddr) )
     {
+        struct iommu_context *target_ctx = iommu_default_context(target);
+
         if ( !has_arch_pdevs(target) )
             vmx_pi_hooks_assign(target);
 
@@ -2753,7 +2768,7 @@ static int cf_check reassign_device_ownership(
             untrusted_msi = true;
 #endif
 
-        ret = domain_context_mapping(target, devfn, pdev);
+        ret = domain_context_mapping(target, target_ctx, devfn, pdev);
 
         if ( !ret && pdev->devfn == devfn &&
              !QUARANTINE_SKIP(source, pdev->arch.vtd.pgd_maddr) )
@@ -2802,6 +2817,7 @@ static int cf_check reassign_device_ownership(
     if ( !is_hardware_domain(source) )
     {
         const struct acpi_rmrr_unit *rmrr;
+        struct iommu_context *ctx = iommu_default_context(source);
         u16 bdf;
         unsigned int i;
 
@@ -2813,7 +2829,7 @@ static int cf_check reassign_device_ownership(
                  * Any RMRR flag is always ignored when remove a device,
                  * but its always safe and strict to set 0.
                  */
-                ret = iommu_identity_mapping(source, p2m_access_x,
+                ret = iommu_identity_mapping(source, ctx, p2m_access_x,
                                              rmrr->base_address,
                                              rmrr->end_address, 0);
                 if ( ret && ret != -ENOENT )
@@ -2828,6 +2844,7 @@ static int cf_check intel_iommu_assign_device(
     struct domain *d, u8 devfn, struct pci_dev *pdev, u32 flag)
 {
     struct domain *s = pdev->domain;
+    struct iommu_context *ctx = iommu_default_context(d);
     struct acpi_rmrr_unit *rmrr;
     int ret = 0, i;
     u16 bdf, seg;
@@ -2875,7 +2892,7 @@ static int cf_check intel_iommu_assign_device(
     {
         if ( rmrr->segment == seg && bdf == PCI_BDF(bus, devfn) )
         {
-            ret = iommu_identity_mapping(d, p2m_access_rw, rmrr->base_address,
+            ret = iommu_identity_mapping(d, ctx, p2m_access_rw, rmrr->base_address,
                                          rmrr->end_address, flag);
             if ( ret )
             {
@@ -2898,7 +2915,7 @@ static int cf_check intel_iommu_assign_device(
     {
         if ( rmrr->segment == seg && bdf == PCI_BDF(bus, devfn) )
         {
-            int rc = iommu_identity_mapping(d, p2m_access_x,
+            int rc = iommu_identity_mapping(d, ctx, p2m_access_x,
                                             rmrr->base_address,
                                             rmrr->end_address, 0);
 
@@ -3071,10 +3088,11 @@ static void vtd_dump_page_table_level(paddr_t pt_maddr, int level, paddr_t gpa,
 static void cf_check vtd_dump_page_tables(struct domain *d)
 {
     const struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx = iommu_default_context(d);
 
     printk(VTDPREFIX" %pd table has %d levels\n", d,
            agaw_to_level(hd->arch.vtd.agaw));
-    vtd_dump_page_table_level(hd->arch.vtd.pgd_maddr,
+    vtd_dump_page_table_level(ctx->arch.vtd.pgd_maddr,
                               agaw_to_level(hd->arch.vtd.agaw), 0, 0);
 }
 
@@ -3082,6 +3100,7 @@ static int fill_qpt(struct dma_pte *this, unsigned int level,
                     struct page_info *pgs[6])
 {
     struct domain_iommu *hd = dom_iommu(dom_io);
+    struct iommu_context *ctx = iommu_default_context(dom_io);
     unsigned int i;
     int rc = 0;
 
@@ -3098,7 +3117,7 @@ static int fill_qpt(struct dma_pte *this, unsigned int level,
                  * page table pages, and the resulting allocations are always
                  * zeroed.
                  */
-                pgs[level] = iommu_alloc_pgtable(hd, 0);
+                pgs[level] = iommu_alloc_pgtable(hd, ctx, 0);
                 if ( !pgs[level] )
                 {
                     rc = -ENOMEM;
@@ -3132,6 +3151,7 @@ static int cf_check intel_iommu_quarantine_init(struct pci_dev *pdev,
                                                 bool scratch_page)
 {
     struct domain_iommu *hd = dom_iommu(dom_io);
+    struct iommu_context *ctx = iommu_default_context(dom_io);
     struct page_info *pg;
     unsigned int agaw = hd->arch.vtd.agaw;
     unsigned int level = agaw_to_level(agaw);
@@ -3142,8 +3162,8 @@ static int cf_check intel_iommu_quarantine_init(struct pci_dev *pdev,
     int rc;
 
     ASSERT(pcidevs_locked());
-    ASSERT(!hd->arch.vtd.pgd_maddr);
-    ASSERT(page_list_empty(&hd->arch.pgtables.list));
+    ASSERT(!ctx->arch.vtd.pgd_maddr);
+    ASSERT(page_list_empty(&ctx->arch.pgtables));
 
     if ( pdev->arch.vtd.pgd_maddr )
     {
@@ -3155,14 +3175,14 @@ static int cf_check intel_iommu_quarantine_init(struct pci_dev *pdev,
     if ( !drhd )
         return -ENODEV;
 
-    pg = iommu_alloc_pgtable(hd, 0);
+    pg = iommu_alloc_pgtable(hd, ctx, 0);
     if ( !pg )
         return -ENOMEM;
 
     rc = context_set_domain_id(NULL, pdev->arch.pseudo_domid, drhd->iommu);
 
     /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
-    hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
+    ctx->arch.vtd.pgd_maddr = page_to_maddr(pg);
 
     for_each_rmrr_device ( rmrr, bdf, i )
     {
@@ -3173,7 +3193,7 @@ static int cf_check intel_iommu_quarantine_init(struct pci_dev *pdev,
         {
             rmrr_found = true;
 
-            rc = iommu_identity_mapping(dom_io, p2m_access_rw,
+            rc = iommu_identity_mapping(dom_io, ctx, p2m_access_rw,
                                         rmrr->base_address, rmrr->end_address,
                                         0);
             if ( rc )
@@ -3183,8 +3203,8 @@ static int cf_check intel_iommu_quarantine_init(struct pci_dev *pdev,
         }
     }
 
-    iommu_identity_map_teardown(dom_io);
-    hd->arch.vtd.pgd_maddr = 0;
+    iommu_identity_map_teardown(dom_io, ctx);
+    ctx->arch.vtd.pgd_maddr = 0;
     pdev->arch.vtd.pgd_maddr = page_to_maddr(pg);
 
     if ( !rc && scratch_page )
@@ -3199,7 +3219,7 @@ static int cf_check intel_iommu_quarantine_init(struct pci_dev *pdev,
         pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
     }
 
-    page_list_move(&pdev->arch.pgtables_list, &hd->arch.pgtables.list);
+    page_list_move(&pdev->arch.pgtables_list, &ctx->arch.pgtables);
 
     if ( rc || (!scratch_page && !rmrr_found) )
         quarantine_teardown(pdev, drhd);
diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c
index dc3dac749c..7937eb8c2b 100644
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -422,7 +422,8 @@ static int __must_check map_me_phantom_function(struct domain *domain,
 
     /* map or unmap ME phantom function */
     if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
-        rc = domain_context_mapping_one(domain, drhd->iommu, 0,
+        rc = domain_context_mapping_one(domain, iommu_default_context(domain),
+                                        drhd->iommu, 0,
                                         PCI_DEVFN(dev, 7), NULL,
                                         domid, pgd_maddr, mode);
     else
diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index 8b1e0596b8..4a3fe059cb 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -19,6 +19,7 @@
 #include <xen/paging.h>
 #include <xen/guest_access.h>
 #include <xen/event.h>
+#include <xen/spinlock.h>
 #include <xen/softirq.h>
 #include <xen/vm_event.h>
 #include <xsm/xsm.h>
@@ -185,26 +186,31 @@ void __hwdom_init arch_iommu_check_autotranslated_hwdom(struct domain *d)
 
 int arch_iommu_domain_init(struct domain *d)
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    INIT_PAGE_LIST_HEAD(&dom_iommu(d)->arch.free_queue);
+    return 0;
+}
 
-    spin_lock_init(&hd->arch.mapping_lock);
+int arch_iommu_context_init(struct domain *d, struct iommu_context *ctx, u32 flags)
+{
+    spin_lock_init(&ctx->arch.mapping_lock);
 
-    INIT_PAGE_LIST_HEAD(&hd->arch.pgtables.list);
-    spin_lock_init(&hd->arch.pgtables.lock);
-    INIT_LIST_HEAD(&hd->arch.identity_maps);
+    INIT_PAGE_LIST_HEAD(&ctx->arch.pgtables);
+    INIT_LIST_HEAD(&ctx->arch.identity_maps);
+
+    return 0;
+}
+
+int arch_iommu_context_teardown(struct domain *d, struct iommu_context *ctx, u32 flags)
+{
+    /* Cleanup all page tables */
+    while ( iommu_free_pgtables(d, ctx) == -ERESTART )
+        /* nothing */;
 
     return 0;
 }
 
 void arch_iommu_domain_destroy(struct domain *d)
 {
-    /*
-     * There should be not page-tables left allocated by the time the
-     * domain is destroyed. Note that arch_iommu_domain_destroy() is
-     * called unconditionally, so pgtables may be uninitialized.
-     */
-    ASSERT(!dom_iommu(d)->platform_ops ||
-           page_list_empty(&dom_iommu(d)->arch.pgtables.list));
 }
 
 struct identity_map {
@@ -214,14 +220,13 @@ struct identity_map {
     unsigned int count;
 };
 
-int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
-                           paddr_t base, paddr_t end,
+int iommu_identity_mapping(struct domain *d, struct iommu_context *ctx,
+                           p2m_access_t p2ma, paddr_t base, paddr_t end,
                            unsigned int flag)
 {
     unsigned long base_pfn = base >> PAGE_SHIFT_4K;
     unsigned long end_pfn = PAGE_ALIGN_4K(end) >> PAGE_SHIFT_4K;
     struct identity_map *map;
-    struct domain_iommu *hd = dom_iommu(d);
 
     ASSERT(pcidevs_locked());
     ASSERT(base < end);
@@ -230,7 +235,7 @@ int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
      * No need to acquire hd->arch.mapping_lock: Both insertion and removal
      * get done while holding pcidevs_lock.
      */
-    list_for_each_entry( map, &hd->arch.identity_maps, list )
+    list_for_each_entry( map, &ctx->arch.identity_maps, list )
     {
         if ( map->base == base && map->end == end )
         {
@@ -280,7 +285,7 @@ int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
      * Insert into list ahead of mapping, so the range can be found when
      * trying to clean up.
      */
-    list_add_tail(&map->list, &hd->arch.identity_maps);
+    list_add_tail(&map->list, &ctx->arch.identity_maps);
 
     for ( ; base_pfn < end_pfn; ++base_pfn )
     {
@@ -300,12 +305,11 @@ int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma,
     return 0;
 }
 
-void iommu_identity_map_teardown(struct domain *d)
+void iommu_identity_map_teardown(struct domain *d, struct iommu_context *ctx)
 {
-    struct domain_iommu *hd = dom_iommu(d);
     struct identity_map *map, *tmp;
 
-    list_for_each_entry_safe ( map, tmp, &hd->arch.identity_maps, list )
+    list_for_each_entry_safe ( map, tmp, &ctx->arch.identity_maps, list )
     {
         list_del(&map->list);
         xfree(map);
@@ -603,7 +607,7 @@ void iommu_free_domid(domid_t domid, unsigned long *map)
         BUG();
 }
 
-int iommu_free_pgtables(struct domain *d)
+int iommu_free_pgtables(struct domain *d, struct iommu_context *ctx)
 {
     struct domain_iommu *hd = dom_iommu(d);
     struct page_info *pg;
@@ -613,7 +617,7 @@ int iommu_free_pgtables(struct domain *d)
         return 0;
 
     /* After this barrier, no new IOMMU mappings can be inserted. */
-    spin_barrier(&hd->arch.mapping_lock);
+    spin_barrier(&ctx->arch.mapping_lock);
 
     /*
      * Pages will be moved to the free list below. So we want to
@@ -621,7 +625,7 @@ int iommu_free_pgtables(struct domain *d)
      */
     iommu_vcall(hd->platform_ops, clear_root_pgtable, d);
 
-    while ( (pg = page_list_remove_head(&hd->arch.pgtables.list)) )
+    while ( (pg = page_list_remove_head(&ctx->arch.pgtables)) )
     {
         free_domheap_page(pg);
 
@@ -633,6 +637,7 @@ int iommu_free_pgtables(struct domain *d)
 }
 
 struct page_info *iommu_alloc_pgtable(struct domain_iommu *hd,
+                                      struct iommu_context *ctx,
                                       uint64_t contig_mask)
 {
     unsigned int memflags = 0;
@@ -677,9 +682,7 @@ struct page_info *iommu_alloc_pgtable(struct domain_iommu *hd,
 
     unmap_domain_page(p);
 
-    spin_lock(&hd->arch.pgtables.lock);
-    page_list_add(pg, &hd->arch.pgtables.list);
-    spin_unlock(&hd->arch.pgtables.lock);
+    page_list_add(pg, &ctx->arch.pgtables);
 
     return pg;
 }
@@ -718,13 +721,12 @@ static void cf_check free_queued_pgtables(void *arg)
     }
 }
 
-void iommu_queue_free_pgtable(struct domain_iommu *hd, struct page_info *pg)
+void iommu_queue_free_pgtable(struct domain *d, struct iommu_context *ctx,
+                              struct page_info *pg)
 {
     unsigned int cpu = smp_processor_id();
 
-    spin_lock(&hd->arch.pgtables.lock);
-    page_list_del(pg, &hd->arch.pgtables.list);
-    spin_unlock(&hd->arch.pgtables.lock);
+    page_list_del(pg, &ctx->arch.pgtables);
 
     page_list_add_tail(pg, &per_cpu(free_pgt_list, cpu));
 
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index b928c67e19..11d23cdafb 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -343,9 +343,18 @@ extern int iommu_get_extra_reserved_device_memory(iommu_grdm_t *func,
 # define iommu_vcall iommu_call
 #endif
 
+struct iommu_context {
+    #ifdef CONFIG_HAS_PASSTHROUGH
+    u16 id; /* Context id (0 means default context) */
+
+    struct arch_iommu_context arch;
+    #endif
+};
+
 struct domain_iommu {
 #ifdef CONFIG_HAS_PASSTHROUGH
     struct arch_iommu arch;
+    struct iommu_context default_ctx;
 #endif
 
     /* iommu_ops */
@@ -380,6 +389,7 @@ struct domain_iommu {
 #define dom_iommu(d)              (&(d)->iommu)
 #define iommu_set_feature(d, f)   set_bit(f, dom_iommu(d)->features)
 #define iommu_clear_feature(d, f) clear_bit(f, dom_iommu(d)->features)
+#define iommu_default_context(d) (&dom_iommu(d)->default_ctx) /* does not lock ! */
 
 /* Are we using the domain P2M table as its IOMMU pagetable? */
 #define iommu_use_hap_pt(d)       (IS_ENABLED(CONFIG_HVM) && \
-- 
2.47.2



Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [XEN RFC PATCH v6 08/11] iommu: Introduce redesigned IOMMU subsystem
  2025-02-17 10:18 [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface Teddy Astie
                   ` (8 preceding siblings ...)
  2025-02-17 10:18 ` [XEN RFC PATCH v6 04/11] iommu: Move IOMMU domain related structures to (arch_)iommu_context Teddy Astie
@ 2025-02-17 10:18 ` Teddy Astie
  2025-02-17 10:18 ` [XEN RFC PATCH v6 10/11] iommu: Introduce PV-IOMMU Teddy Astie
  2025-02-18 14:26 ` [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface Marek Marczykowski-Górecki
  11 siblings, 0 replies; 17+ messages in thread
From: Teddy Astie @ 2025-02-17 10:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Teddy Astie, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini

Introduce the changes proposed in docs/designs/iommu-context.md.

Signed-off-by: Teddy Astie <teddy.astie@vates.tech>
---
This patch is still quite large but I am not sure how to split it further.
---
 xen/arch/x86/include/asm/iommu.h            |    8 +-
 xen/arch/x86/mm/p2m-ept.c                   |    2 +-
 xen/arch/x86/pv/dom0_build.c                |    6 +-
 xen/common/memory.c                         |    4 +-
 xen/drivers/passthrough/amd/iommu.h         |   13 +-
 xen/drivers/passthrough/amd/iommu_cmd.c     |   20 +-
 xen/drivers/passthrough/amd/iommu_init.c    |    2 +-
 xen/drivers/passthrough/amd/iommu_map.c     |   52 +-
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  297 +++---
 xen/drivers/passthrough/iommu.c             |  622 ++++++++++-
 xen/drivers/passthrough/pci.c               |  397 +++----
 xen/drivers/passthrough/vtd/extern.h        |   17 +-
 xen/drivers/passthrough/vtd/iommu.c         | 1048 ++++++++-----------
 xen/drivers/passthrough/vtd/quirks.c        |   22 +-
 xen/drivers/passthrough/x86/iommu.c         |  153 ++-
 xen/include/xen/iommu.h                     |   93 +-
 xen/include/xen/pci.h                       |    3 +
 17 files changed, 1538 insertions(+), 1221 deletions(-)

diff --git a/xen/arch/x86/include/asm/iommu.h b/xen/arch/x86/include/asm/iommu.h
index d20c3cda59..654a07b9b2 100644
--- a/xen/arch/x86/include/asm/iommu.h
+++ b/xen/arch/x86/include/asm/iommu.h
@@ -2,10 +2,12 @@
 #ifndef __ARCH_X86_IOMMU_H__
 #define __ARCH_X86_IOMMU_H__
 
+#include <xen/bitmap.h>
 #include <xen/errno.h>
 #include <xen/list.h>
 #include <xen/mem_access.h>
 #include <xen/spinlock.h>
+#include <xen/stdbool.h>
 #include <asm/apicdef.h>
 #include <asm/cache.h>
 #include <asm/processor.h>
@@ -39,18 +41,16 @@ struct arch_iommu_context
     struct list_head identity_maps;
 
 
-    spinlock_t mapping_lock; /* io page table lock */
-
     union {
         /* Intel VT-d */
         struct {
             uint64_t pgd_maddr; /* io page directory machine address */
             domid_t *didmap; /* per-iommu DID (valid only if related iommu_dev_cnt > 0) */
             unsigned long *iommu_dev_cnt; /* counter of devices per iommu */
+            uint32_t superpage_progress; /* superpage progress during teardown */
         } vtd;
         /* AMD IOMMU */
         struct {
-            unsigned int paging_mode;
             struct page_info *root_table;
             domid_t *didmap; /* per-iommu DID (valid only if related iommu_dev_cnt > 0) */
             unsigned long *iommu_dev_cnt; /* counter of devices per iommu */
@@ -72,7 +72,7 @@ struct arch_iommu
         struct {
             unsigned int paging_mode;
             struct guest_iommu *g_iommu;
-        };
+        } amd;
     };
 };
 
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 0cf6818c13..0cf5d3c323 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -978,7 +978,7 @@ out:
             rc = iommu_iotlb_flush(d, _dfn(gfn), 1ul << order,
                                    (iommu_flags ? IOMMU_FLUSHF_added : 0) |
                                    (vtd_pte_present ? IOMMU_FLUSHF_modified
-                                                    : 0));
+                                                    : 0), 0);
         else if ( need_iommu_pt_sync(d) )
             rc = iommu_flags ?
                 iommu_legacy_map(d, _dfn(gfn), mfn, 1ul << order, iommu_flags) :
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index f54d1da5c6..453fb22252 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -77,7 +77,7 @@ static __init void mark_pv_pt_pages_rdonly(struct domain *d,
          * iommu_memory_setup() ended up mapping them.
          */
         if ( need_iommu_pt_sync(d) &&
-             iommu_unmap(d, _dfn(mfn_x(page_to_mfn(page))), 1, 0, flush_flags) )
+             iommu_unmap(d, _dfn(mfn_x(page_to_mfn(page))), 1, 0, flush_flags, 0) )
             BUG();
 
         /* Read-only mapping + PGC_allocated + page-table page. */
@@ -128,7 +128,7 @@ static void __init iommu_memory_setup(struct domain *d, const char *what,
 
     while ( (rc = iommu_map(d, _dfn(mfn_x(mfn)), mfn, nr,
                             IOMMUF_readable | IOMMUF_writable | IOMMUF_preempt,
-                            flush_flags)) > 0 )
+                            flush_flags, 0)) > 0 )
     {
         mfn = mfn_add(mfn, rc);
         nr -= rc;
@@ -970,7 +970,7 @@ static int __init dom0_construct(struct boot_info *bi, struct domain *d)
     }
 
     /* Use while() to avoid compiler warning. */
-    while ( iommu_iotlb_flush_all(d, flush_flags) )
+    while ( iommu_iotlb_flush_all(d, 0, flush_flags) )
         break;
 
     if ( initrd_len != 0 )
diff --git a/xen/common/memory.c b/xen/common/memory.c
index a6f2f6d1b3..acf305bcd0 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -926,7 +926,7 @@ int xenmem_add_to_physmap(struct domain *d, struct xen_add_to_physmap *xatp,
         this_cpu(iommu_dont_flush_iotlb) = 0;
 
         ret = iommu_iotlb_flush(d, _dfn(xatp->idx - done), done,
-                                IOMMU_FLUSHF_modified);
+                                IOMMU_FLUSHF_modified, 0);
         if ( unlikely(ret) && rc >= 0 )
             rc = ret;
 
@@ -940,7 +940,7 @@ int xenmem_add_to_physmap(struct domain *d, struct xen_add_to_physmap *xatp,
             put_page(pages[i]);
 
         ret = iommu_iotlb_flush(d, _dfn(xatp->gpfn - done), done,
-                                IOMMU_FLUSHF_added | IOMMU_FLUSHF_modified);
+                                IOMMU_FLUSHF_added | IOMMU_FLUSHF_modified, 0);
         if ( unlikely(ret) && rc >= 0 )
             rc = ret;
     }
diff --git a/xen/drivers/passthrough/amd/iommu.h b/xen/drivers/passthrough/amd/iommu.h
index dbe427ed27..217c1ebc7a 100644
--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -198,11 +198,10 @@ void amd_iommu_quarantine_teardown(struct pci_dev *pdev);
 /* mapping functions */
 int __must_check cf_check amd_iommu_map_page(
     struct domain *d, dfn_t dfn, mfn_t mfn, unsigned int flags,
-    unsigned int *flush_flags);
+    unsigned int *flush_flags, struct iommu_context *ctx);
 int __must_check cf_check amd_iommu_unmap_page(
     struct domain *d, dfn_t dfn, unsigned int order,
-    unsigned int *flush_flags);
-int __must_check amd_iommu_alloc_root(struct domain *d);
+    unsigned int *flush_flags, struct iommu_context *ctx);
 int amd_iommu_reserve_domain_unity_map(struct domain *d, struct iommu_context *ctx,
                                        const struct ivrs_unity_map *map,
                                        unsigned int flag);
@@ -211,7 +210,7 @@ int amd_iommu_reserve_domain_unity_unmap(struct domain *d, struct iommu_context
 int cf_check amd_iommu_get_reserved_device_memory(
     iommu_grdm_t *func, void *ctxt);
 int __must_check cf_check amd_iommu_flush_iotlb_pages(
-    struct domain *d, dfn_t dfn, unsigned long page_count,
+    struct domain *d, struct iommu_context *ctx, dfn_t dfn, unsigned long page_count,
     unsigned int flush_flags);
 void amd_iommu_print_entries(const struct amd_iommu *iommu, unsigned int dev_id,
                              dfn_t dfn);
@@ -233,9 +232,9 @@ void iommu_dte_add_device_entry(struct amd_iommu_dte *dte,
                                 const struct ivrs_mappings *ivrs_dev);
 
 /* send cmd to iommu */
-void amd_iommu_flush_all_pages(struct domain *d);
-void amd_iommu_flush_pages(struct domain *d, unsigned long dfn,
-                           unsigned int order);
+void amd_iommu_flush_all_pages(struct domain *d, struct iommu_context *ctx);
+void amd_iommu_flush_pages(struct domain *d, struct iommu_context *ctx,
+                           unsigned long dfn, unsigned int order);
 void amd_iommu_flush_iotlb(u8 devfn, const struct pci_dev *pdev,
                            daddr_t daddr, unsigned int order);
 void amd_iommu_flush_device(struct amd_iommu *iommu, uint16_t bdf,
diff --git a/xen/drivers/passthrough/amd/iommu_cmd.c b/xen/drivers/passthrough/amd/iommu_cmd.c
index e1a252db93..495e6139fd 100644
--- a/xen/drivers/passthrough/amd/iommu_cmd.c
+++ b/xen/drivers/passthrough/amd/iommu_cmd.c
@@ -327,19 +327,21 @@ static void amd_iommu_flush_all_iotlbs(const struct domain *d, daddr_t daddr,
 }
 
 /* Flush iommu cache after p2m changes. */
-static void _amd_iommu_flush_pages(struct domain *d,
+static void _amd_iommu_flush_pages(struct domain *d, struct iommu_context *ctx,
                                    daddr_t daddr, unsigned int order)
 {
     struct amd_iommu *iommu;
-    struct iommu_context *ctx = iommu_default_context(d);
 
     /* send INVALIDATE_IOMMU_PAGES command */
     for_each_amd_iommu ( iommu )
     {
-        domid_t dom_id = ctx->arch.amd.didmap[iommu->index];
+        if ( ctx->arch.amd.iommu_dev_cnt[iommu->index] )
+        {
+            domid_t dom_id = ctx->arch.amd.didmap[iommu->index];
 
-        invalidate_iommu_pages(iommu, daddr, dom_id, order);
-        flush_command_buffer(iommu, 0);
+            invalidate_iommu_pages(iommu, daddr, dom_id, order);
+            flush_command_buffer(iommu, 0);
+        }
     }
 
     if ( ats_enabled )
@@ -355,15 +357,15 @@ static void _amd_iommu_flush_pages(struct domain *d,
     }
 }
 
-void amd_iommu_flush_all_pages(struct domain *d)
+void amd_iommu_flush_all_pages(struct domain *d, struct iommu_context *ctx)
 {
-    _amd_iommu_flush_pages(d, INV_IOMMU_ALL_PAGES_ADDRESS, 0);
+    _amd_iommu_flush_pages(d, ctx, INV_IOMMU_ALL_PAGES_ADDRESS, 0);
 }
 
-void amd_iommu_flush_pages(struct domain *d,
+void amd_iommu_flush_pages(struct domain *d, struct iommu_context *ctx,
                            unsigned long dfn, unsigned int order)
 {
-    _amd_iommu_flush_pages(d, __dfn_to_daddr(dfn), order);
+    _amd_iommu_flush_pages(d, ctx, __dfn_to_daddr(dfn), order);
 }
 
 void amd_iommu_flush_device(struct amd_iommu *iommu, uint16_t bdf,
diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c
index 333d5d5e39..67235b4ce4 100644
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -1538,7 +1538,7 @@ static void invalidate_all_domain_pages(void)
 
     for_each_domain( d )
         if ( is_iommu_enabled(d) )
-            amd_iommu_flush_all_pages(d);
+            amd_iommu_flush_all_pages(d, iommu_default_context(d));
 }
 
 static int cf_check _invalidate_all_devices(
diff --git a/xen/drivers/passthrough/amd/iommu_map.c b/xen/drivers/passthrough/amd/iommu_map.c
index 91d8c21048..6c3ec975ce 100644
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -276,7 +276,7 @@ static int iommu_pde_from_dfn(struct domain *d, struct iommu_context *ctx,
     struct domain_iommu *hd = dom_iommu(d);
 
     table = ctx->arch.amd.root_table;
-    level = ctx->arch.amd.paging_mode;
+    level = hd->arch.amd.paging_mode;
 
     if ( !table || target < 1 || level < target || level > 6 )
     {
@@ -400,21 +400,17 @@ static void queue_free_pt(struct domain *d, struct iommu_context *ctx, mfn_t mfn
 
 int cf_check amd_iommu_map_page(
     struct domain *d, dfn_t dfn, mfn_t mfn, unsigned int flags,
-    unsigned int *flush_flags)
+    unsigned int *flush_flags, struct iommu_context *ctx)
 {
     struct domain_iommu *hd = dom_iommu(d);
-    struct iommu_context *ctx = iommu_default_context(d);
     unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1;
     bool contig;
-    int rc;
     unsigned long pt_mfn = 0;
     union amd_iommu_pte old;
 
     ASSERT((hd->platform_ops->page_sizes >> IOMMUF_order(flags)) &
            PAGE_SIZE_4K);
 
-    spin_lock(&ctx->arch.mapping_lock);
-
     /*
      * IOMMU mapping request can be safely ignored when the domain is dying.
      *
@@ -422,25 +418,11 @@ int cf_check amd_iommu_map_page(
      * before any page tables are freed (see iommu_free_pgtables()).
      */
     if ( d->is_dying )
-    {
-        spin_unlock(&ctx->arch.mapping_lock);
         return 0;
-    }
-
-    rc = amd_iommu_alloc_root(d);
-    if ( rc )
-    {
-        spin_unlock(&ctx->arch.mapping_lock);
-        AMD_IOMMU_ERROR("root table alloc failed, dfn = %"PRI_dfn"\n",
-                        dfn_x(dfn));
-        domain_crash(d);
-        return rc;
-    }
 
     if ( iommu_pde_from_dfn(d, ctx, dfn_x(dfn), level, &pt_mfn, flush_flags, true) ||
          !pt_mfn )
     {
-        spin_unlock(&ctx->arch.mapping_lock);
         AMD_IOMMU_ERROR("invalid IO pagetable entry dfn = %"PRI_dfn"\n",
                         dfn_x(dfn));
         domain_crash(d);
@@ -452,7 +434,7 @@ int cf_check amd_iommu_map_page(
                                 flags & IOMMUF_writable,
                                 flags & IOMMUF_readable, &contig);
 
-    while ( unlikely(contig) && ++level < ctx->arch.amd.paging_mode )
+    while ( unlikely(contig) && ++level < hd->arch.amd.paging_mode )
     {
         struct page_info *pg = mfn_to_page(_mfn(pt_mfn));
         unsigned long next_mfn;
@@ -471,8 +453,6 @@ int cf_check amd_iommu_map_page(
         perfc_incr(iommu_pt_coalesces);
     }
 
-    spin_unlock(&ctx->arch.mapping_lock);
-
     *flush_flags |= IOMMU_FLUSHF_added;
     if ( old.pr )
     {
@@ -486,11 +466,11 @@ int cf_check amd_iommu_map_page(
 }
 
 int cf_check amd_iommu_unmap_page(
-    struct domain *d, dfn_t dfn, unsigned int order, unsigned int *flush_flags)
+    struct domain *d, dfn_t dfn, unsigned int order, unsigned int *flush_flags,
+    struct iommu_context *ctx)
 {
     unsigned long pt_mfn = 0;
     struct domain_iommu *hd = dom_iommu(d);
-    struct iommu_context *ctx = iommu_default_context(d);
     unsigned int level = (order / PTE_PER_TABLE_SHIFT) + 1;
     union amd_iommu_pte old = {};
 
@@ -500,17 +480,11 @@ int cf_check amd_iommu_unmap_page(
      */
     ASSERT((hd->platform_ops->page_sizes >> order) & PAGE_SIZE_4K);
 
-    spin_lock(&ctx->arch.mapping_lock);
-
     if ( !ctx->arch.amd.root_table )
-    {
-        spin_unlock(&ctx->arch.mapping_lock);
         return 0;
-    }
 
     if ( iommu_pde_from_dfn(d, ctx, dfn_x(dfn), level, &pt_mfn, flush_flags, false) )
     {
-        spin_unlock(&ctx->arch.mapping_lock);
         AMD_IOMMU_ERROR("invalid IO pagetable entry dfn = %"PRI_dfn"\n",
                         dfn_x(dfn));
         domain_crash(d);
@@ -524,7 +498,7 @@ int cf_check amd_iommu_unmap_page(
         /* Mark PTE as 'page not present'. */
         old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level, &free);
 
-        while ( unlikely(free) && ++level < ctx->arch.amd.paging_mode )
+        while ( unlikely(free) && ++level < hd->arch.amd.paging_mode )
         {
             struct page_info *pg = mfn_to_page(_mfn(pt_mfn));
 
@@ -540,8 +514,6 @@ int cf_check amd_iommu_unmap_page(
         }
     }
 
-    spin_unlock(&ctx->arch.mapping_lock);
-
     if ( old.pr )
     {
         *flush_flags |= IOMMU_FLUSHF_modified;
@@ -608,7 +580,7 @@ static unsigned long flush_count(unsigned long dfn, unsigned long page_count,
 }
 
 int cf_check amd_iommu_flush_iotlb_pages(
-    struct domain *d, dfn_t dfn, unsigned long page_count,
+    struct domain *d, struct iommu_context *ctx, dfn_t dfn, unsigned long page_count,
     unsigned int flush_flags)
 {
     unsigned long dfn_l = dfn_x(dfn);
@@ -626,7 +598,7 @@ int cf_check amd_iommu_flush_iotlb_pages(
     /* If so requested or if the range wraps then just flush everything. */
     if ( (flush_flags & IOMMU_FLUSHF_all) || dfn_l + page_count < dfn_l )
     {
-        amd_iommu_flush_all_pages(d);
+        amd_iommu_flush_all_pages(d, ctx);
         return 0;
     }
 
@@ -639,13 +611,13 @@ int cf_check amd_iommu_flush_iotlb_pages(
      *       flush code.
      */
     if ( page_count == 1 ) /* order 0 flush count */
-        amd_iommu_flush_pages(d, dfn_l, 0);
+        amd_iommu_flush_pages(d, ctx, dfn_l, 0);
     else if ( flush_count(dfn_l, page_count, 9) == 1 )
-        amd_iommu_flush_pages(d, dfn_l, 9);
+        amd_iommu_flush_pages(d, ctx, dfn_l, 9);
     else if ( flush_count(dfn_l, page_count, 18) == 1 )
-        amd_iommu_flush_pages(d, dfn_l, 18);
+        amd_iommu_flush_pages(d, ctx, dfn_l, 18);
     else
-        amd_iommu_flush_all_pages(d);
+        amd_iommu_flush_all_pages(d, ctx);
 
     return 0;
 }
diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index 0008b35162..366d5eb982 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -20,8 +20,11 @@
 #include <xen/iocap.h>
 #include <xen/softirq.h>
 #include <xen/iommu.h>
+#include <xen/mm.h>
+#include <xen/pci.h>
 
 #include <asm/acpi.h>
+#include <asm/iommu.h>
 
 #include "iommu.h"
 #include "../ats.h"
@@ -85,18 +88,6 @@ int get_dma_requestor_id(uint16_t seg, uint16_t bdf)
     return req_id;
 }
 
-static int __must_check allocate_domain_resources(struct domain *d)
-{
-    struct iommu_context *ctx = iommu_default_context(d);
-    int rc;
-
-    spin_lock(&ctx->arch.mapping_lock);
-    rc = amd_iommu_alloc_root(d);
-    spin_unlock(&ctx->arch.mapping_lock);
-
-    return rc;
-}
-
 static bool any_pdev_behind_iommu(const struct domain *d,
                                   const struct pci_dev *exclude,
                                   const struct amd_iommu *iommu)
@@ -127,8 +118,9 @@ static bool use_ats(
 
 static int __must_check amd_iommu_setup_domain_device(
     struct domain *domain, struct iommu_context *ctx, struct amd_iommu *iommu,
-    uint8_t devfn, struct pci_dev *pdev)
+    uint8_t devfn, struct pci_dev *pdev, struct iommu_context *prev_ctx)
 {
+    struct domain_iommu *hd = dom_iommu(domain);
     struct amd_iommu_dte *table, *dte;
     unsigned long flags;
     unsigned int req_id, sr_flags;
@@ -138,11 +130,7 @@ static int __must_check amd_iommu_setup_domain_device(
     const struct page_info *root_pg;
     domid_t domid;
 
-    BUG_ON(!ctx->arch.amd.paging_mode || !iommu->dev_table.buffer);
-
-    rc = allocate_domain_resources(domain);
-    if ( rc )
-        return rc;
+    BUG_ON(!hd->arch.amd.paging_mode || !iommu->dev_table.buffer);
 
     req_id = get_dma_requestor_id(iommu->seg, pdev->sbdf.bdf);
     ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
@@ -157,7 +145,7 @@ static int __must_check amd_iommu_setup_domain_device(
     ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
 
     root_pg = ctx->arch.amd.root_table;
-    domid = domain->domain_id;
+    domid = ctx->arch.amd.didmap[iommu->index];
 
     spin_lock_irqsave(&iommu->lock, flags);
 
@@ -166,7 +154,7 @@ static int __must_check amd_iommu_setup_domain_device(
         /* bind DTE to domain page-tables */
         rc = amd_iommu_set_root_page_table(
                  dte, page_to_maddr(root_pg), domid,
-                 ctx->arch.amd.paging_mode, sr_flags);
+                 hd->arch.amd.paging_mode, sr_flags);
         if ( rc )
         {
             ASSERT(rc < 0);
@@ -208,7 +196,7 @@ static int __must_check amd_iommu_setup_domain_device(
         else
             rc = amd_iommu_set_root_page_table(
                      dte, page_to_maddr(root_pg), domid,
-                     ctx->arch.amd.paging_mode, sr_flags);
+                     hd->arch.amd.paging_mode, sr_flags);
         if ( rc < 0 )
         {
             spin_unlock_irqrestore(&iommu->lock, flags);
@@ -259,7 +247,7 @@ static int __must_check amd_iommu_setup_domain_device(
                     "root table = %#"PRIx64", "
                     "domain = %d, paging mode = %d\n",
                     req_id, pdev->type, page_to_maddr(root_pg),
-                    domid, ctx->arch.amd.paging_mode);
+                    domid, hd->arch.amd.paging_mode);
 
     ASSERT(pcidevs_locked());
 
@@ -272,6 +260,15 @@ static int __must_check amd_iommu_setup_domain_device(
         amd_iommu_flush_iotlb(devfn, pdev, INV_IOMMU_ALL_PAGES_ADDRESS, 0);
     }
 
+    if ( prev_ctx )
+    {
+        /* Don't underflow the counter. */
+        BUG_ON(!prev_ctx->arch.amd.iommu_dev_cnt[iommu->index]);
+        prev_ctx->arch.vtd.iommu_dev_cnt[iommu->index]--;
+    }
+
+    ctx->arch.amd.iommu_dev_cnt[iommu->index]++;
+
     return 0;
 }
 
@@ -338,27 +335,12 @@ static int cf_check iov_enable_xt(void)
     return 0;
 }
 
-int amd_iommu_alloc_root(struct domain *d)
-{
-    struct domain_iommu *hd = dom_iommu(d);
-    struct iommu_context *ctx = iommu_default_context(d);
-
-    if ( unlikely(!ctx->arch.amd.root_table) && d != dom_io )
-    {
-        ctx->arch.amd.root_table = iommu_alloc_pgtable(hd, ctx, 0);
-        if ( !ctx->arch.amd.root_table )
-            return -ENOMEM;
-    }
-
-    return 0;
-}
-
 unsigned int __read_mostly amd_iommu_max_paging_mode = IOMMU_MAX_PT_LEVELS;
 int __read_mostly amd_iommu_min_paging_mode = 1;
 
 static int cf_check amd_iommu_domain_init(struct domain *d)
 {
-    struct iommu_context *ctx = iommu_default_context(d);
+    struct domain_iommu *hd = dom_iommu(d);
     int pglvl = amd_iommu_get_paging_mode(
                     1UL << (domain_max_paddr_bits(d) - PAGE_SHIFT));
 
@@ -369,7 +351,7 @@ static int cf_check amd_iommu_domain_init(struct domain *d)
      * Choose the number of levels for the IOMMU page tables, taking into
      * account unity maps.
      */
-    ctx->arch.amd.paging_mode = max(pglvl, amd_iommu_min_paging_mode);
+    hd->arch.amd.paging_mode = max(pglvl, amd_iommu_min_paging_mode);
 
     return 0;
 }
@@ -380,9 +362,6 @@ static void __hwdom_init cf_check amd_iommu_hwdom_init(struct domain *d)
 {
     const struct amd_iommu *iommu;
 
-    if ( allocate_domain_resources(d) )
-        BUG();
-
     for_each_amd_iommu ( iommu )
         if ( iomem_deny_access(d, PFN_DOWN(iommu->mmio_base_phys),
                                PFN_DOWN(iommu->mmio_base_phys +
@@ -394,8 +373,11 @@ static void __hwdom_init cf_check amd_iommu_hwdom_init(struct domain *d)
     setup_hwdom_pci_devices(d, amd_iommu_add_device);
 }
 
+
+
 static void amd_iommu_disable_domain_device(const struct domain *domain,
                                             struct amd_iommu *iommu,
+                                            struct iommu_context *prev_ctx,
                                             uint8_t devfn, struct pci_dev *pdev)
 {
     struct amd_iommu_dte *table, *dte;
@@ -442,108 +424,141 @@ static void amd_iommu_disable_domain_device(const struct domain *domain,
         AMD_IOMMU_DEBUG("Disable: device id = %#x, "
                         "domain = %d, paging mode = %d\n",
                         req_id, dte->domain_id,
-                        iommu_default_context(domain)->arch.amd.paging_mode);
+                        dom_iommu(domain)->arch.amd.paging_mode);
     }
     else
         spin_unlock_irqrestore(&iommu->lock, flags);
+
+    BUG_ON(!prev_ctx->arch.amd.iommu_dev_cnt[iommu->index]);
+    prev_ctx->arch.amd.iommu_dev_cnt[iommu->index]--;
 }
 
-static int cf_check reassign_device(
-    struct domain *source, struct domain *target, u8 devfn,
-    struct pci_dev *pdev)
+static int cf_check amd_iommu_context_init(struct domain *d, struct iommu_context *ctx,
+                                           u32 flags)
 {
     struct amd_iommu *iommu;
-    struct iommu_context *target_ctx = iommu_default_context(target);
-    struct iommu_context *source_ctx = iommu_default_context(source);
-    int rc;
+    struct domain_iommu *hd = dom_iommu(d);
 
-    iommu = find_iommu_for_device(pdev->seg, pdev->sbdf.bdf);
-    if ( !iommu )
+    ctx->arch.amd.didmap = xzalloc_array(domid_t, nr_amd_iommus);
+    if ( !ctx->arch.amd.didmap )
+        return -ENOMEM;
+
+    ctx->arch.amd.iommu_dev_cnt = xzalloc_array(unsigned long, nr_amd_iommus);
+    if ( !ctx->arch.amd.iommu_dev_cnt )
     {
-        AMD_IOMMU_WARN("failed to find IOMMU: %pp cannot be assigned to %pd\n",
-                       &PCI_SBDF(pdev->seg, pdev->bus, devfn), target);
-        return -ENODEV;
+        xfree(ctx->arch.amd.didmap);
+        return -ENOMEM;
     }
 
-    rc = amd_iommu_setup_domain_device(target, target_ctx, iommu, devfn, pdev);
-    if ( rc )
-        return rc;
+    // TODO: Allocate IOMMU domid only when attaching devices ?
+    /* Populate context DID map using pseudo DIDs */
+    for_each_amd_iommu(iommu)
+    {
+        ctx->arch.amd.didmap[iommu->index] =
+            iommu_alloc_domid(iommu->domid_map);
+    }
 
-    if ( devfn == pdev->devfn && pdev->domain != target )
+    if ( !ctx->opaque )
     {
-        write_lock(&source->pci_lock);
-        list_del(&pdev->domain_list);
-        write_unlock(&source->pci_lock);
+        /* Create initial context page */
+        ctx->arch.amd.root_table = iommu_alloc_pgtable(hd, ctx, 0);
+    }
 
-        pdev->domain = target;
+    return arch_iommu_context_init(d, ctx, flags);
 
-        write_lock(&target->pci_lock);
-        list_add(&pdev->domain_list, &target->pdev_list);
-        write_unlock(&target->pci_lock);
-    }
+}
 
-    /*
-     * If the device belongs to the hardware domain, and it has a unity mapping,
-     * don't remove it from the hardware domain, because BIOS may reference that
-     * mapping.
-     */
-    if ( !is_hardware_domain(source) )
-    {
-        const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
-        unsigned int bdf = PCI_BDF(pdev->bus, devfn);
+static int cf_check amd_iommu_context_teardown(struct domain *d,
+                                        struct iommu_context *ctx, u32 flags)
+{
+    struct amd_iommu *iommu;
+    pcidevs_lock();
 
-        rc = amd_iommu_reserve_domain_unity_unmap(
-                 source, source_ctx,
-                 ivrs_mappings[get_dma_requestor_id(pdev->seg, bdf)].unity_map);
-        if ( rc )
-            return rc;
+    // TODO: Cleanup mappings
+    ASSERT(ctx->arch.amd.didmap);
+
+    for_each_amd_iommu(iommu)
+    {
+        iommu_free_domid(ctx->arch.amd.didmap[iommu->index], iommu->domid_map);
     }
 
-    AMD_IOMMU_DEBUG("Re-assign %pp from %pd to %pd\n",
-                    &PCI_SBDF(pdev->seg, pdev->bus, devfn), source, target);
+    xfree(ctx->arch.amd.didmap);
 
-    return 0;
+    pcidevs_unlock();
+    return arch_iommu_context_teardown(d, ctx, flags);
 }
 
-static int cf_check amd_iommu_assign_device(
-    struct domain *d, u8 devfn, struct pci_dev *pdev, u32 flag)
+static int cf_check amd_iommu_attach(
+    struct domain *d, struct pci_dev *pdev, struct iommu_context *ctx)
 {
+    int ret;
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
-    unsigned int bdf = PCI_BDF(pdev->bus, devfn);
-    int req_id = get_dma_requestor_id(pdev->seg, bdf);
-    int rc = amd_iommu_reserve_domain_unity_map(
-                 d, iommu_default_context(d),
-                 ivrs_mappings[req_id].unity_map, flag);
+    int req_id = get_dma_requestor_id(pdev->seg, pdev->sbdf.bdf);
+    struct ivrs_unity_map *map = ivrs_mappings[req_id].unity_map;
+    struct amd_iommu *iommu = find_iommu_for_device(pdev->seg, pdev->sbdf.bdf);
 
-    if ( !rc )
-        rc = reassign_device(pdev->domain, d, devfn, pdev);
+    ret = amd_iommu_reserve_domain_unity_map(d, ctx, map, 0);
+    if ( !ret )
+        return ret;
 
-    if ( rc && !is_hardware_domain(d) )
-    {
-        int ret = amd_iommu_reserve_domain_unity_unmap(
-                      d, iommu_default_context(d),
-                      ivrs_mappings[req_id].unity_map);
+    return amd_iommu_setup_domain_device(d, ctx, iommu, pdev->devfn, pdev, NULL);
+}
 
-        if ( ret )
-        {
-            printk(XENLOG_ERR "AMD-Vi: "
-                   "unity-unmap for %pd/%04x:%02x:%02x.%u failed (%d)\n",
-                   d, pdev->seg, pdev->bus,
-                   PCI_SLOT(devfn), PCI_FUNC(devfn), ret);
-            domain_crash(d);
-        }
-    }
+static int cf_check amd_iommu_detach(struct domain *d, struct pci_dev *pdev,
+                                     struct iommu_context *prev_ctx)
+{
+    struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
+    int req_id = get_dma_requestor_id(pdev->seg, pdev->sbdf.bdf);
+    struct amd_iommu *iommu = find_iommu_for_device(pdev->seg, pdev->sbdf.bdf);
+
+    amd_iommu_disable_domain_device(d, iommu, prev_ctx, pdev->devfn, pdev);
 
-    return rc;
+    return amd_iommu_reserve_domain_unity_unmap(d, prev_ctx, ivrs_mappings[req_id].unity_map);
 }
 
-static void cf_check amd_iommu_clear_root_pgtable(struct domain *d)
+static int cf_check amd_iommu_add_devfn(struct domain *d, struct pci_dev *pdev,
+                                        u16 devfn, struct iommu_context *ctx)
 {
-    struct iommu_context *ctx = iommu_default_context(d);
+    struct amd_iommu *iommu = find_iommu_for_device(pdev->seg, pdev->sbdf.bdf);
+
+    return amd_iommu_setup_domain_device(d, ctx, iommu, pdev->devfn, pdev, NULL);
+}
+
+static int cf_check amd_iommu_remove_devfn(struct domain *d, struct pci_dev *pdev,
+                                           u16 devfn)
+{
+    struct amd_iommu *iommu = find_iommu_for_device(pdev->seg, pdev->sbdf.bdf);
+
+    amd_iommu_disable_domain_device(d, iommu, NULL, pdev->devfn, pdev);
+
+    return 0;
+}
+
+static int cf_check amd_iommu_reattach(struct domain *d,
+                                       struct pci_dev *pdev,
+                                       struct iommu_context *prev_ctx,
+                                       struct iommu_context *ctx)
+{
+    int ret;
+    struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
+    int req_id = get_dma_requestor_id(pdev->seg, pdev->sbdf.bdf);
+    struct ivrs_unity_map *map = ivrs_mappings[req_id].unity_map;
+
+    ret = amd_iommu_reserve_domain_unity_map(d, ctx, map, 0);
+    if ( !ret )
+        return ret;
+
+    ret = amd_iommu_setup_domain_device(d, ctx, ivrs_mappings->iommu, pdev->devfn,
+                                        pdev, prev_ctx);
+    if ( !ret )
+        return ret;
 
-    spin_lock(&ctx->arch.mapping_lock);
+    return amd_iommu_reserve_domain_unity_unmap(d, prev_ctx, map);
+}
+
+static void cf_check amd_iommu_clear_root_pgtable(struct domain *d, struct iommu_context *ctx)
+{
     ctx->arch.amd.root_table = NULL;
-    spin_unlock(&ctx->arch.mapping_lock);
 }
 
 static void cf_check amd_iommu_domain_destroy(struct domain *d)
@@ -628,48 +643,7 @@ static int cf_check amd_iommu_add_device(u8 devfn, struct pci_dev *pdev)
         AMD_IOMMU_WARN("%pd: unity mapping failed for %pp\n",
                        pdev->domain, &PCI_SBDF(pdev->seg, bdf));
 
-    return amd_iommu_setup_domain_device(pdev->domain, ctx, iommu, devfn, pdev);
-}
-
-static int cf_check amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
-{
-    struct amd_iommu *iommu;
-    struct iommu_context *ctx;
-    u16 bdf;
-    struct ivrs_mappings *ivrs_mappings;
-
-    if ( !pdev->domain )
-        return -EINVAL;
-
-    ctx = iommu_default_context(pdev->domain);
-
-    iommu = find_iommu_for_device(pdev->seg, pdev->sbdf.bdf);
-    if ( !iommu )
-    {
-        AMD_IOMMU_WARN("failed to find IOMMU: %pp cannot be removed from %pd\n",
-                        &PCI_SBDF(pdev->seg, pdev->bus, devfn), pdev->domain);
-        return -ENODEV;
-    }
-
-    amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev);
-
-    ivrs_mappings = get_ivrs_mappings(pdev->seg);
-    bdf = PCI_BDF(pdev->bus, devfn);
-
-    if ( amd_iommu_reserve_domain_unity_unmap(
-             pdev->domain, ctx,
-             ivrs_mappings[ivrs_mappings[bdf].dte_requestor_id].unity_map) )
-        AMD_IOMMU_WARN("%pd: unity unmapping failed for %pp\n",
-                       pdev->domain, &PCI_SBDF(pdev->seg, bdf));
-
-    amd_iommu_quarantine_teardown(pdev);
-
-    if ( amd_iommu_perdev_intremap &&
-         ivrs_mappings[bdf].dte_requestor_id == bdf &&
-         ivrs_mappings[bdf].intremap_table )
-        amd_iommu_free_intremap_table(iommu, &ivrs_mappings[bdf], bdf);
-
-    return 0;
+    return amd_iommu_setup_domain_device(pdev->domain, ctx, iommu, devfn, pdev, NULL);
 }
 
 static int cf_check amd_iommu_group_id(u16 seg, u8 bus, u8 devfn)
@@ -729,30 +703,33 @@ static void amd_dump_page_table_level(struct page_info *pg, int level,
 
 static void cf_check amd_dump_page_tables(struct domain *d)
 {
+    struct domain_iommu *hd = dom_iommu(d);
     struct iommu_context *ctx = iommu_default_context(d);
 
     if ( !ctx->arch.amd.root_table )
         return;
 
-    printk("AMD IOMMU %pd table has %u levels\n", d, ctx->arch.amd.paging_mode);
+    printk("AMD IOMMU %pd table has %u levels\n", d, hd->arch.amd.paging_mode);
     amd_dump_page_table_level(ctx->arch.amd.root_table,
-                              ctx->arch.amd.paging_mode, 0, 0);
+                              hd->arch.amd.paging_mode, 0, 0);
 }
 
 static const struct iommu_ops __initconst_cf_clobber _iommu_ops = {
     .page_sizes = PAGE_SIZE_4K | PAGE_SIZE_2M | PAGE_SIZE_1G,
     .init = amd_iommu_domain_init,
     .hwdom_init = amd_iommu_hwdom_init,
-    .quarantine_init = amd_iommu_quarantine_init,
-    .add_device = amd_iommu_add_device,
-    .remove_device = amd_iommu_remove_device,
-    .assign_device  = amd_iommu_assign_device,
+    .context_init = amd_iommu_context_init,
+    .context_teardown = amd_iommu_context_teardown,
+    .attach = amd_iommu_attach,
+    .detach = amd_iommu_detach,
+    .reattach = amd_iommu_reattach,
+    .add_devfn = amd_iommu_add_devfn,
+    .remove_devfn = amd_iommu_remove_devfn,
     .teardown = amd_iommu_domain_destroy,
     .clear_root_pgtable = amd_iommu_clear_root_pgtable,
     .map_page = amd_iommu_map_page,
     .unmap_page = amd_iommu_unmap_page,
     .iotlb_flush = amd_iommu_flush_iotlb_pages,
-    .reassign_device = reassign_device,
     .get_device_group_id = amd_iommu_group_id,
     .enable_x2apic = iov_enable_xt,
     .update_ire_from_apic = amd_iommu_ioapic_update_ire,
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index 662da49766..f92835a2ed 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -208,13 +208,15 @@ int iommu_domain_init(struct domain *d, unsigned int opts)
     hd->node = NUMA_NO_NODE;
 #endif
 
+    rspin_lock_init(&hd->default_ctx.lock);
+
     ret = arch_iommu_domain_init(d);
     if ( ret )
         return ret;
 
     hd->platform_ops = iommu_get_ops();
     ret = iommu_call(hd->platform_ops, init, d);
-    if ( ret || is_system_domain(d) )
+    if ( ret || (is_system_domain(d) && d != dom_io) )
         return ret;
 
     /*
@@ -236,7 +238,17 @@ int iommu_domain_init(struct domain *d, unsigned int opts)
 
     ASSERT(!(hd->need_sync && hd->hap_pt_share));
 
-    return 0;
+    rspin_lock(&hd->default_ctx.lock);
+    ret = iommu_context_init(d, &hd->default_ctx, 0, IOMMU_CONTEXT_INIT_default);
+    rspin_unlock(&hd->default_ctx.lock);
+
+    rwlock_init(&hd->other_contexts.lock);
+    hd->other_contexts.initialized = (atomic_t)ATOMIC_INIT(0);
+    hd->other_contexts.count = 0;
+    hd->other_contexts.bitmap = NULL;
+    hd->other_contexts.map = NULL;
+
+    return ret;
 }
 
 static void cf_check iommu_dump_page_tables(unsigned char key)
@@ -249,14 +261,11 @@ static void cf_check iommu_dump_page_tables(unsigned char key)
 
     for_each_domain(d)
     {
-        if ( is_hardware_domain(d) || !is_iommu_enabled(d) )
+        if ( !is_iommu_enabled(d) )
             continue;
 
         if ( iommu_use_hap_pt(d) )
-        {
             printk("%pd sharing page tables\n", d);
-            continue;
-        }
 
         iommu_vcall(dom_iommu(d)->platform_ops, dump_page_tables, d);
     }
@@ -274,9 +283,13 @@ void __hwdom_init iommu_hwdom_init(struct domain *d)
     iommu_vcall(hd->platform_ops, hwdom_init, d);
 }
 
-static void iommu_teardown(struct domain *d)
+void cf_check iommu_domain_destroy(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
+    struct pci_dev *pdev;
+
+    if ( !is_iommu_enabled(d) )
+        return;
 
     /*
      * During early domain creation failure, we may reach here with the
@@ -285,17 +298,65 @@ static void iommu_teardown(struct domain *d)
     if ( !hd->platform_ops )
         return;
 
+    /* Move all devices back to quarantine */
+    /* TODO: Is it needed ? */
+    for_each_pdev(d, pdev)
+    {
+        int rc = iommu_reattach_context(d, dom_io, pdev, 0);
+
+        if ( rc )
+        {
+            printk(XENLOG_WARNING "Unable to quarantine device %pp (%d)\n", &pdev->sbdf, rc);
+            pdev->broken = true;
+        }
+        else
+            pdev->domain = dom_io;
+    }
+
     iommu_vcall(hd->platform_ops, teardown, d);
+
+    arch_iommu_domain_destroy(d);
 }
 
-void iommu_domain_destroy(struct domain *d)
-{
-    if ( !is_iommu_enabled(d) )
-        return;
+bool cf_check iommu_check_context(struct domain *d, u16 ctx_id) {
+    struct domain_iommu *hd = dom_iommu(d);
 
-    iommu_teardown(d);
+    if (ctx_id == 0)
+        return 1; /* Default context always exist. */
 
-    arch_iommu_domain_destroy(d);
+    if ((ctx_id - 1) >= hd->other_contexts.count)
+        return 0; /* out of bounds */
+
+    return test_bit(ctx_id - 1, hd->other_contexts.bitmap);
+}
+
+struct iommu_context * cf_check iommu_get_context(struct domain *d, u16 ctx_id) {
+    struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx;
+
+    if ( !iommu_check_context(d, ctx_id) )
+        return NULL;
+
+    if (ctx_id == 0)
+        ctx = &hd->default_ctx;
+    else
+        ctx = &hd->other_contexts.map[ctx_id - 1];
+
+    rspin_lock(&ctx->lock);
+    /* Check if the context is still valid at this point */
+    if ( unlikely(!iommu_check_context(d, ctx_id)) )
+    {
+        /* Context has been destroyed in between */
+        rspin_unlock(&ctx->lock);
+        return NULL;
+    }
+
+    return ctx;
+}
+
+void cf_check iommu_put_context(struct iommu_context *ctx)
+{
+    rspin_unlock(&ctx->lock);
 }
 
 static unsigned int mapping_order(const struct domain_iommu *hd,
@@ -323,11 +384,11 @@ static unsigned int mapping_order(const struct domain_iommu *hd,
     return order;
 }
 
-long iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0,
-               unsigned long page_count, unsigned int flags,
-               unsigned int *flush_flags)
+static long _iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0,
+                       unsigned long page_count, unsigned int flags,
+                       unsigned int *flush_flags, struct iommu_context *ctx)
 {
-    const struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(d);
     unsigned long i;
     unsigned int order, j = 0;
     int rc = 0;
@@ -350,7 +411,7 @@ long iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0,
             return i;
 
         rc = iommu_call(hd->platform_ops, map_page, d, dfn, mfn,
-                        flags | IOMMUF_order(order), flush_flags);
+                        flags | IOMMUF_order(order), flush_flags, ctx);
 
         if ( likely(!rc) )
             continue;
@@ -361,10 +422,10 @@ long iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0,
                    d->domain_id, dfn_x(dfn), mfn_x(mfn), rc);
 
         /* while statement to satisfy __must_check */
-        while ( iommu_unmap(d, dfn0, i, 0, flush_flags) )
+        while ( iommu_unmap(d, dfn0, i, 0, flush_flags, ctx->id) )
             break;
 
-        if ( !is_hardware_domain(d) )
+        if ( !ctx->id && !is_hardware_domain(d) )
             domain_crash(d);
 
         break;
@@ -375,43 +436,67 @@ long iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0,
      * page, flush everything and clear flush flags.
      */
     if ( page_count > 1 && unlikely(rc) &&
-         !iommu_iotlb_flush_all(d, *flush_flags) )
+         !iommu_iotlb_flush_all(d, ctx->id, *flush_flags) )
         *flush_flags = 0;
 
     return rc;
 }
 
+long iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0,
+               unsigned long page_count, unsigned int flags,
+               unsigned int *flush_flags, u16 ctx_id)
+{
+    struct iommu_context *ctx;
+    long ret;
+
+    if ( !(ctx = iommu_get_context(d, ctx_id)) )
+        return -ENOENT;
+
+    ret = _iommu_map(d, dfn0, mfn0, page_count, flags, flush_flags, ctx);
+
+    iommu_put_context(ctx);
+
+    return ret;
+}
+
 int iommu_legacy_map(struct domain *d, dfn_t dfn, mfn_t mfn,
                      unsigned long page_count, unsigned int flags)
 {
+    struct iommu_context *ctx;
     unsigned int flush_flags = 0;
-    int rc;
+    int rc = 0;
 
     ASSERT(!(flags & IOMMUF_preempt));
-    rc = iommu_map(d, dfn, mfn, page_count, flags, &flush_flags);
 
-    if ( !this_cpu(iommu_dont_flush_iotlb) && !rc )
-        rc = iommu_iotlb_flush(d, dfn, page_count, flush_flags);
+    ctx = iommu_get_context(d, 0);
+
+    if ( !ctx->opaque )
+    {
+        rc = iommu_map(d, dfn, mfn, page_count, flags, &flush_flags, 0);
+
+        if ( !this_cpu(iommu_dont_flush_iotlb) && !rc )
+            rc = iommu_iotlb_flush(d, dfn, page_count, flush_flags, 0);
+    }
+
+    iommu_put_context(ctx);
 
     return rc;
 }
 
-long iommu_unmap(struct domain *d, dfn_t dfn0, unsigned long page_count,
-                 unsigned int flags, unsigned int *flush_flags)
+static long _iommu_unmap(struct domain *d, dfn_t dfn0, unsigned long page_count,
+                         unsigned int flags, unsigned int *flush_flags,
+                         struct iommu_context *ctx)
 {
-    const struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(d);
     unsigned long i;
     unsigned int order, j = 0;
     int rc = 0;
-    struct iommu_context *ctx;
 
     if ( !is_iommu_enabled(d) )
         return 0;
 
     ASSERT(!(flags & ~IOMMUF_preempt));
 
-    ctx = iommu_default_context(d);
-
     for ( i = 0; i < page_count; i += 1UL << order )
     {
         dfn_t dfn = dfn_add(dfn0, i);
@@ -425,7 +510,8 @@ long iommu_unmap(struct domain *d, dfn_t dfn0, unsigned long page_count,
             return i;
 
         err = iommu_call(hd->platform_ops, unmap_page, d, dfn,
-                         flags | IOMMUF_order(order), flush_flags);
+                         flags | IOMMUF_order(order), flush_flags,
+                         ctx);
 
         if ( likely(!err) )
             continue;
@@ -438,7 +524,7 @@ long iommu_unmap(struct domain *d, dfn_t dfn0, unsigned long page_count,
         if ( !rc )
             rc = err;
 
-        if ( !is_hardware_domain(d) )
+        if ( !ctx->id && !is_hardware_domain(d) )
         {
             domain_crash(d);
             break;
@@ -450,41 +536,74 @@ long iommu_unmap(struct domain *d, dfn_t dfn0, unsigned long page_count,
      * page, flush everything and clear flush flags.
      */
     if ( page_count > 1 && unlikely(rc) &&
-         !iommu_iotlb_flush_all(d, *flush_flags) )
+         !iommu_iotlb_flush_all(d, ctx->id, *flush_flags) )
         *flush_flags = 0;
 
     return rc;
 }
 
+long iommu_unmap(struct domain *d, dfn_t dfn0, unsigned long page_count,
+                 unsigned int flags, unsigned int *flush_flags,
+                 u16 ctx_id)
+{
+    struct iommu_context *ctx;
+    long ret;
+
+    if ( !(ctx = iommu_get_context(d, ctx_id)) )
+        return -ENOENT;
+
+    ret = _iommu_unmap(d, dfn0, page_count, flags, flush_flags, ctx);
+
+    iommu_put_context(ctx);
+
+    return ret;
+}
+
 int iommu_legacy_unmap(struct domain *d, dfn_t dfn, unsigned long page_count)
 {
     unsigned int flush_flags = 0;
-    int rc = iommu_unmap(d, dfn, page_count, 0, &flush_flags);
+    struct iommu_context *ctx;
+    int rc = 0;
+
+    ctx = iommu_get_context(d, 0);
 
-    if ( !this_cpu(iommu_dont_flush_iotlb) && !rc )
-        rc = iommu_iotlb_flush(d, dfn, page_count, flush_flags);
+    if ( !ctx->opaque )
+    {
+        rc = iommu_unmap(d, dfn, page_count, 0, &flush_flags, 0);
+
+        if ( !this_cpu(iommu_dont_flush_iotlb) && !rc )
+            rc = iommu_iotlb_flush(d, dfn, page_count, flush_flags, 0);
+    }
+
+    iommu_put_context(ctx);
 
     return rc;
 }
 
 int iommu_lookup_page(struct domain *d, dfn_t dfn, mfn_t *mfn,
-                      unsigned int *flags)
+                      unsigned int *flags, u16 ctx_id)
 {
-    const struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(d);
     struct iommu_context *ctx;
+    int ret;
 
     if ( !is_iommu_enabled(d) || !hd->platform_ops->lookup_page )
         return -EOPNOTSUPP;
 
-    ctx = iommu_default_context(d);
+    if ( !(ctx = iommu_get_context(d, ctx_id)) )
+        return -ENOENT;
+
+    ret = iommu_call(hd->platform_ops, lookup_page, d, dfn, mfn, flags, ctx);
 
-    return iommu_call(hd->platform_ops, lookup_page, d, dfn, mfn, flags);
+    iommu_put_context(ctx);
+    return ret;
 }
 
 int iommu_iotlb_flush(struct domain *d, dfn_t dfn, unsigned long page_count,
-                      unsigned int flush_flags)
+                      unsigned int flush_flags, u16 ctx_id)
 {
-    const struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx;
     int rc;
 
     if ( !is_iommu_enabled(d) || !hd->platform_ops->iotlb_flush ||
@@ -494,7 +613,10 @@ int iommu_iotlb_flush(struct domain *d, dfn_t dfn, unsigned long page_count,
     if ( dfn_eq(dfn, INVALID_DFN) )
         return -EINVAL;
 
-    rc = iommu_call(hd->platform_ops, iotlb_flush, d, dfn, page_count,
+    if ( !(ctx = iommu_get_context(d, ctx_id)) )
+        return -ENOENT;
+
+    rc = iommu_call(hd->platform_ops, iotlb_flush, d, ctx, dfn, page_count,
                     flush_flags);
     if ( unlikely(rc) )
     {
@@ -503,23 +625,29 @@ int iommu_iotlb_flush(struct domain *d, dfn_t dfn, unsigned long page_count,
                    "d%d: IOMMU IOTLB flush failed: %d, dfn %"PRI_dfn", page count %lu flags %x\n",
                    d->domain_id, rc, dfn_x(dfn), page_count, flush_flags);
 
-        if ( !is_hardware_domain(d) )
+        if ( !ctx->id && !is_hardware_domain(d) )
             domain_crash(d);
     }
 
+    iommu_put_context(ctx);
+
     return rc;
 }
 
-int iommu_iotlb_flush_all(struct domain *d, unsigned int flush_flags)
+int iommu_iotlb_flush_all(struct domain *d, u16 ctx_id, unsigned int flush_flags)
 {
-    const struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx;
     int rc;
 
     if ( !is_iommu_enabled(d) || !hd->platform_ops->iotlb_flush ||
          !flush_flags )
         return 0;
 
-    rc = iommu_call(hd->platform_ops, iotlb_flush, d, INVALID_DFN, 0,
+    if ( !(ctx = iommu_get_context(d, ctx_id)) )
+        return -ENOENT;
+
+    rc = iommu_call(hd->platform_ops, iotlb_flush, d, ctx, _dfn(0), 0,
                     flush_flags | IOMMU_FLUSHF_all);
     if ( unlikely(rc) )
     {
@@ -532,21 +660,409 @@ int iommu_iotlb_flush_all(struct domain *d, unsigned int flush_flags)
             domain_crash(d);
     }
 
+    iommu_put_context(ctx);
     return rc;
 }
 
+int cf_check iommu_context_init(struct domain *d, struct iommu_context *ctx, u16 ctx_id,
+                       u32 flags)
+{
+    if ( !dom_iommu(d)->platform_ops->context_init )
+        return -ENOSYS;
+
+    INIT_LIST_HEAD(&ctx->devices);
+    ctx->id = ctx_id;
+    ctx->dying = false;
+    ctx->opaque = false; /* assume non-opaque by default */
+
+    return iommu_call(dom_iommu(d)->platform_ops, context_init, d, ctx, flags);
+}
+
+int iommu_context_alloc(struct domain *d, u16 *ctx_id, u32 flags)
+{
+    unsigned int i;
+    int ret;
+    struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx;
+
+    do {
+        i = find_first_zero_bit(hd->other_contexts.bitmap, hd->other_contexts.count);
+
+        if ( i >= hd->other_contexts.count )
+            return -ENOSPC;
+
+        ctx = &hd->other_contexts.map[i];
+
+        /* Try to lock the mutex, can fail on concurrent accesses */
+        if ( !rspin_trylock(&ctx->lock) )
+            continue;
+
+        /* We can now set it as used, we keep the lock for initialization. */
+        set_bit(i, hd->other_contexts.bitmap);
+    } while (0);
+
+    *ctx_id = i + 1;
+
+    ret = iommu_context_init(d, ctx, *ctx_id, flags);
+
+    if ( ret )
+        clear_bit(*ctx_id, hd->other_contexts.bitmap);
+
+    iommu_put_context(ctx);
+    return ret;
+}
+
+/**
+ * Attach dev phantom functions to ctx, override any existing
+ * mapped context.
+ */
+static int cf_check iommu_reattach_phantom(struct domain *d, device_t *dev,
+                                  struct iommu_context *ctx)
+{
+    int ret = 0;
+    uint8_t devfn = dev->devfn;
+    struct domain_iommu *hd = dom_iommu(d);
+
+    while ( dev->phantom_stride )
+    {
+        devfn += dev->phantom_stride;
+
+        if ( PCI_SLOT(devfn) != PCI_SLOT(dev->devfn) )
+            break;
+
+        ret = iommu_call(hd->platform_ops, add_devfn, d, dev, devfn, ctx);
+
+        if ( ret )
+            break;
+    }
+
+    return ret;
+}
+
+/**
+ * Detach all device phantom functions.
+ */
+static int cf_check iommu_detach_phantom(struct domain *d, device_t *dev)
+{
+    int ret = 0;
+    uint8_t devfn = dev->devfn;
+    struct domain_iommu *hd = dom_iommu(d);
+
+    while ( dev->phantom_stride )
+    {
+        devfn += dev->phantom_stride;
+
+        if ( PCI_SLOT(devfn) != PCI_SLOT(dev->devfn) )
+            break;
+
+        ret = iommu_call(hd->platform_ops, remove_devfn, d, dev, devfn);
+
+        if ( ret )
+            break;
+    }
+
+    return ret;
+}
+
+int cf_check iommu_attach_context(struct domain *d, device_t *dev, u16 ctx_id)
+{
+    struct iommu_context *ctx = NULL;
+    int ret, rc;
+
+    if ( !(ctx = iommu_get_context(d, ctx_id)) )
+    {
+        ret = -ENOENT;
+        goto unlock;
+    }
+
+    pcidevs_lock();
+
+    if ( ctx->dying )
+    {
+        ret = -EINVAL;
+        goto unlock;
+    }
+
+    ret = iommu_call(dom_iommu(d)->platform_ops, attach, d, dev, ctx);
+
+    if ( ret )
+        goto unlock;
+
+    /* See iommu_reattach_context() */
+    rc = iommu_reattach_phantom(d, dev, ctx);
+
+    if ( rc )
+    {
+        printk(XENLOG_ERR "IOMMU: Unable to attach %pp phantom functions\n",
+               &dev->sbdf);
+
+        if( iommu_call(dom_iommu(d)->platform_ops, detach, d, dev, ctx)
+            || iommu_detach_phantom(d, dev) )
+        {
+            printk(XENLOG_ERR "IOMMU: Improperly detached %pp\n", &dev->sbdf);
+            WARN();
+        }
+
+        ret = -EIO;
+        goto unlock;
+    }
+
+    dev->context = ctx_id;
+    list_add(&dev->context_list, &ctx->devices);
+
+unlock:
+    pcidevs_unlock();
+
+    if ( ctx )
+        iommu_put_context(ctx);
+
+    return ret;
+}
+
+int cf_check iommu_detach_context(struct domain *d, device_t *dev)
+{
+    struct iommu_context *ctx;
+    int ret, rc;
+
+    if ( !dev->domain )
+    {
+        printk(XENLOG_WARNING "IOMMU: Trying to detach a non-attached device\n");
+        WARN();
+        return 0;
+    }
+
+    /* Make sure device is actually in the domain. */
+    ASSERT(d == dev->domain);
+
+    pcidevs_lock();
+
+    ctx = iommu_get_context(d, dev->context);
+    ASSERT(ctx); /* device is using an invalid context ?
+                    dev->context invalid ? */
+
+    ret = iommu_call(dom_iommu(d)->platform_ops, detach, d, dev, ctx);
+
+    if ( ret )
+        goto unlock;
+
+    rc = iommu_detach_phantom(d, dev);
+
+    if ( rc )
+        printk(XENLOG_WARNING "IOMMU: "
+               "Improperly detached device functions (%d)\n", rc);
+
+    list_del(&dev->context_list);
+
+unlock:
+    pcidevs_unlock();
+    iommu_put_context(ctx);
+    return ret;
+}
+
+int cf_check iommu_reattach_context(struct domain *prev_dom, struct domain *next_dom,
+                           device_t *dev, u16 ctx_id)
+{
+    u16 prev_ctx_id;
+    device_t *ctx_dev;
+    struct domain_iommu *prev_hd, *next_hd;
+    struct iommu_context *prev_ctx = NULL, *next_ctx = NULL;
+    int ret, rc;
+    bool same_domain;
+
+    /* Make sure we actually are doing something meaningful */
+    BUG_ON(!prev_dom && !next_dom);
+
+    /* Device domain must be coherent with prev_dom. */
+    ASSERT(!prev_dom || dev->domain == prev_dom);
+
+    /// TODO: Do such cases exists ?
+    // /* Platform ops must match */
+    // if (dom_iommu(prev_dom)->platform_ops != dom_iommu(next_dom)->platform_ops)
+    //     return -EINVAL;
+
+    if ( !prev_dom )
+        return iommu_attach_context(next_dom, dev, ctx_id);
+
+    if ( !next_dom )
+        return iommu_detach_context(prev_dom, dev);
+
+    prev_hd = dom_iommu(prev_dom);
+    next_hd = dom_iommu(next_dom);
+
+    pcidevs_lock();
+
+    same_domain = prev_dom == next_dom;
+
+    prev_ctx_id = dev->context;
+
+    if ( same_domain && (ctx_id == prev_ctx_id) )
+    {
+        printk(XENLOG_DEBUG
+               "IOMMU: Reattaching %pp to same IOMMU context c%hu\n",
+               &dev->sbdf, ctx_id);
+        ret = 0;
+        goto unlock;
+    }
+
+    if ( !(prev_ctx = iommu_get_context(prev_dom, prev_ctx_id)) )
+    {
+        ret = -ENOENT;
+        goto unlock;
+    }
+
+    if ( !(next_ctx = iommu_get_context(next_dom, ctx_id)) )
+    {
+        ret = -ENOENT;
+        goto unlock;
+    }
+
+    if ( next_ctx->dying )
+    {
+        ret = -EINVAL;
+        goto unlock;
+    }
+
+    ret = iommu_call(prev_hd->platform_ops, reattach, next_dom, dev, prev_ctx,
+                     next_ctx);
+
+    if ( ret )
+        goto unlock;
+
+    /*
+     * We need to do special handling for phantom devices as they
+     * also use some other PCI functions behind the scenes.
+     */
+    rc = iommu_reattach_phantom(next_dom, dev, next_ctx);
+
+    if ( rc )
+    {
+        /**
+         * Device is being partially reattached (we have primary function and
+         * maybe some phantom functions attached to next_ctx, some others to prev_ctx),
+         * some functions of the device will be attached to next_ctx.
+         */
+        printk(XENLOG_WARNING "IOMMU: "
+               "Device %pp improperly reattached due to phantom function"
+               " reattach failure between %dd%dc and %dd%dc (%d)\n", dev,
+               prev_dom->domain_id, prev_ctx->id, next_dom->domain_id,
+               next_dom->domain_id, rc);
+
+        /* Try reattaching to previous context, reverting into a consistent state. */
+        if ( iommu_call(prev_hd->platform_ops, reattach, prev_dom, dev, next_ctx,
+                        prev_ctx) || iommu_reattach_phantom(prev_dom, dev, prev_ctx) )
+        {
+            printk(XENLOG_ERR "Unable to reattach %pp back to %dd%dc\n",
+                   &dev->sbdf, prev_dom->domain_id, prev_ctx->id);
+
+            if ( !is_hardware_domain(prev_dom) )
+                domain_crash(prev_dom);
+
+            if ( prev_dom != next_dom && !is_hardware_domain(next_dom) )
+                domain_crash(next_dom);
+
+            rc = -EIO;
+        }
+
+        ret = rc;
+        goto unlock;
+    }
+
+    /* Remove device from previous context, and add it to new one. */
+    list_for_each_entry(ctx_dev, &prev_ctx->devices, context_list)
+    {
+        if ( ctx_dev == dev )
+        {
+            list_del(&ctx_dev->context_list);
+            list_add(&ctx_dev->context_list, &next_ctx->devices);
+            break;
+        }
+    }
+
+    if (!ret)
+        dev->context = ctx_id; /* update device context*/
+
+unlock:
+    pcidevs_unlock();
+
+    if ( prev_ctx )
+        iommu_put_context(prev_ctx);
+
+    if ( next_ctx )
+        iommu_put_context(next_ctx);
+
+    return ret;
+}
+
+int cf_check iommu_context_teardown(struct domain *d, struct iommu_context *ctx, u32 flags)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+
+    if ( !hd->platform_ops->context_teardown )
+        return -ENOSYS;
+
+    ctx->dying = true;
+
+    /* first reattach devices back to default context if needed */
+    if ( flags & IOMMU_TEARDOWN_REATTACH_DEFAULT )
+    {
+        struct pci_dev *device;
+        list_for_each_entry(device, &ctx->devices, context_list)
+            iommu_reattach_context(d, d, device, 0);
+    }
+    else if (!list_empty(&ctx->devices))
+        return -EBUSY; /* there is a device in context */
+
+    return iommu_call(hd->platform_ops, context_teardown, d, ctx, flags);
+}
+
+int cf_check iommu_context_free(struct domain *d, u16 ctx_id, u32 flags)
+{
+    int ret;
+    struct domain_iommu *hd = dom_iommu(d);
+    struct iommu_context *ctx;
+
+    if ( ctx_id == 0 )
+        return -EINVAL;
+
+    if ( !(ctx = iommu_get_context(d, ctx_id)) )
+        return -ENOENT;
+
+    ret = iommu_context_teardown(d, ctx, flags);
+
+    if ( !ret )
+        clear_bit(ctx_id - 1, hd->other_contexts.bitmap);
+
+    iommu_put_context(ctx);
+    return ret;
+}
+
 int iommu_quarantine_dev_init(device_t *dev)
 {
-    const struct domain_iommu *hd = dom_iommu(dom_io);
+    int ret;
+    u16 ctx_id;
 
-    if ( !iommu_quarantine || !hd->platform_ops->quarantine_init )
+    if ( !iommu_quarantine )
         return 0;
 
-    return iommu_call(hd->platform_ops, quarantine_init,
-                      dev, iommu_quarantine == IOMMU_quarantine_scratch_page);
+    ret = iommu_context_alloc(dom_io, &ctx_id, IOMMU_CONTEXT_INIT_quarantine);
+
+    if ( ret )
+        return ret;
+
+    /** TODO: Setup scratch page, mappings... */
+
+    ret = iommu_reattach_context(dev->domain, dom_io, dev, ctx_id);
+
+    if ( ret )
+    {
+        ASSERT(!iommu_context_free(dom_io, ctx_id, 0));
+        return ret;
+    }
+
+    return ret;
 }
 
-static int __init iommu_quarantine_init(void)
+int __init iommu_quarantine_init(void)
 {
     dom_io->options |= XEN_DOMCTL_CDF_iommu;
 
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index e1ca74b477..56f65090fc 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -654,6 +654,101 @@ unsigned int pci_size_mem_bar(pci_sbdf_t sbdf, unsigned int pos,
     return is64bits ? 2 : 1;
 }
 
+static int device_assigned(struct pci_dev *pdev)
+{
+    int rc = 0;
+
+    /*
+     * If the device exists and it is not owned by either the hardware
+     * domain or dom_io then it must be assigned to a guest, or be
+     * hidden (owned by dom_xen).
+     */
+    if ( pdev->domain != hardware_domain && pdev->domain != dom_io )
+        rc = -EBUSY;
+
+    return rc;
+}
+
+/* Caller should hold the pcidevs_lock */
+static int pci_reassign_device(struct domain *prev_dom, struct domain *next_dom,
+                               struct pci_dev *pdev, u32 flag)
+{
+    int rc = 0;
+    ASSERT(prev_dom || next_dom);
+
+    if ( !is_iommu_enabled(next_dom) )
+        return -EINVAL;
+
+    if ( !arch_iommu_use_permitted(next_dom) )
+        return -EXDEV;
+
+    /* Do not allow broken devices to be assigned to guests. */
+    if ( pdev->broken && next_dom != hardware_domain && next_dom != dom_io )
+        return -EBADF;
+
+    if ( prev_dom )
+    {
+        write_lock(&prev_dom->pci_lock);
+        vpci_deassign_device(pdev);
+        write_unlock(&prev_dom->pci_lock);
+    }
+
+    rc = pdev_msix_assign(next_dom, pdev);
+    if ( rc )
+        goto done;
+
+    pdev->fault.count = 0;
+
+    if ( prev_dom && next_dom )
+    {
+        printk(XENLOG_INFO "PCI: Reassigning PCI device from %dd to %dd\n",
+               prev_dom->domain_id, next_dom->domain_id);
+    }
+    else if ( prev_dom )
+    {
+        printk(XENLOG_INFO "PCI: Assigning PCI device to %dd\n", prev_dom->domain_id);
+    }
+    else if ( next_dom )
+    {
+        printk(XENLOG_INFO "PCI: Remove PCI device of %dd\n", next_dom->domain_id);
+    }
+    else
+    {
+        ASSERT_UNREACHABLE();
+    }
+
+    rc = iommu_reattach_context(prev_dom, next_dom, pci_to_dev(pdev), 0);
+
+    if ( rc )
+        goto done;
+
+    if ( prev_dom )
+    {
+        write_lock(&prev_dom->pci_lock);
+        list_del(&pdev->domain_list);
+        write_unlock(&prev_dom->pci_lock);
+    }
+
+    pdev->domain = next_dom;
+
+    if ( next_dom )
+    {
+        write_lock(&next_dom->pci_lock);
+        list_add(&pdev->domain_list, &next_dom->pdev_list);
+
+        rc = vpci_assign_device(pdev);
+        write_unlock(&next_dom->pci_lock);
+    }
+
+ done:
+
+    /* The device is assigned to dom_io so mark it as quarantined */
+    if ( !rc && next_dom == dom_io )
+        pdev->quarantine = true;
+
+    return rc;
+}
+
 int pci_add_device(u16 seg, u8 bus, u8 devfn,
                    const struct pci_dev_info *info, nodeid_t node)
 {
@@ -699,13 +794,30 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
 
             if ( !pf_pdev )
             {
-                printk(XENLOG_WARNING
-                       "Attempted to add SR-IOV VF %pp without PF %pp\n",
-                       &pdev->sbdf,
-                       &PCI_SBDF(seg, info->physfn.bus, info->physfn.devfn));
-                free_pdev(pseg, pdev);
-                ret = -ENODEV;
-                goto out;
+                ret = pci_add_device(seg, info->physfn.bus, info->physfn.devfn,
+                                     NULL, node);
+                if ( ret )
+                {
+                    printk(XENLOG_WARNING
+                           "Failed to add SR-IOV device PF %pp for VF %pp\n",
+                           &PCI_SBDF(seg, info->physfn.bus, info->physfn.devfn),
+                           &pdev->sbdf);
+                    free_pdev(pseg, pdev);
+                    goto out;
+                }
+                pf_pdev = pci_get_pdev(NULL, PCI_SBDF(seg, info->physfn.bus,
+                                                      info->physfn.devfn));
+                if ( !pf_pdev )
+                {
+                    printk(XENLOG_ERR
+                           "Inconsistent PCI state: failed to find newly added PF %pp for VF %pp\n",
+                           &PCI_SBDF(seg, info->physfn.bus, info->physfn.devfn),
+                           &pdev->sbdf);
+                    ASSERT_UNREACHABLE();
+                    free_pdev(pseg, pdev);
+                    ret = -EILSEQ;
+                    goto out;
+                }
             }
 
             if ( !pdev->pf_pdev )
@@ -877,74 +989,6 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
     return ret;
 }
 
-/* Caller should hold the pcidevs_lock */
-static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
-                           uint8_t devfn)
-{
-    const struct domain_iommu *hd = dom_iommu(d);
-    struct pci_dev *pdev;
-    struct domain *target;
-    int ret = 0;
-
-    if ( !is_iommu_enabled(d) )
-        return -EINVAL;
-
-    ASSERT(pcidevs_locked());
-    pdev = pci_get_pdev(d, PCI_SBDF(seg, bus, devfn));
-    if ( !pdev )
-        return -ENODEV;
-
-    /* De-assignment from dom_io should de-quarantine the device */
-    if ( (pdev->quarantine || iommu_quarantine) && pdev->domain != dom_io )
-    {
-        ret = iommu_quarantine_dev_init(pci_to_dev(pdev));
-        if ( ret )
-           return ret;
-
-        target = dom_io;
-    }
-    else
-        target = hardware_domain;
-
-    while ( pdev->phantom_stride )
-    {
-        devfn += pdev->phantom_stride;
-        if ( PCI_SLOT(devfn) != PCI_SLOT(pdev->devfn) )
-            break;
-        ret = iommu_call(hd->platform_ops, reassign_device, d, target, devfn,
-                         pci_to_dev(pdev));
-        if ( ret )
-            goto out;
-    }
-
-    write_lock(&d->pci_lock);
-    vpci_deassign_device(pdev);
-    write_unlock(&d->pci_lock);
-
-    devfn = pdev->devfn;
-    ret = iommu_call(hd->platform_ops, reassign_device, d, target, devfn,
-                     pci_to_dev(pdev));
-    if ( ret )
-        goto out;
-
-    if ( pdev->domain == hardware_domain  )
-        pdev->quarantine = false;
-
-    pdev->fault.count = 0;
-
-    write_lock(&target->pci_lock);
-    /* Re-assign back to hardware_domain */
-    ret = vpci_assign_device(pdev);
-    write_unlock(&target->pci_lock);
-
- out:
-    if ( ret )
-        printk(XENLOG_G_ERR "%pd: deassign (%pp) failed (%d)\n",
-               d, &PCI_SBDF(seg, bus, devfn), ret);
-
-    return ret;
-}
-
 int pci_release_devices(struct domain *d)
 {
     int combined_ret;
@@ -966,13 +1010,10 @@ int pci_release_devices(struct domain *d)
         struct pci_dev *pdev = list_first_entry(&d->pdev_list,
                                                 struct pci_dev,
                                                 domain_list);
-        uint16_t seg = pdev->seg;
-        uint8_t bus = pdev->bus;
-        uint8_t devfn = pdev->devfn;
         int ret;
 
         write_unlock(&d->pci_lock);
-        ret = deassign_device(d, seg, bus, devfn);
+        ret = pci_reassign_device(d, dom_io, pdev, 0);
         write_lock(&d->pci_lock);
         if ( ret )
         {
@@ -1180,25 +1221,18 @@ struct setup_hwdom {
 static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
                                                 struct pci_dev *pdev)
 {
-    u8 devfn = pdev->devfn;
     int err;
 
-    do {
-        err = ctxt->handler(devfn, pdev);
-        if ( err )
-        {
-            printk(XENLOG_ERR "setup %pp for d%d failed (%d)\n",
-                   &pdev->sbdf, ctxt->d->domain_id, err);
-            if ( devfn == pdev->devfn )
-                return;
-        }
-        devfn += pdev->phantom_stride;
-    } while ( devfn != pdev->devfn &&
-              PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
+    err = ctxt->handler(pdev->devfn, pdev);
+
+    if ( err )
+        goto done;
 
     write_lock(&ctxt->d->pci_lock);
     err = vpci_assign_device(pdev);
     write_unlock(&ctxt->d->pci_lock);
+
+done:
     if ( err )
         printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
                ctxt->d->domain_id, err);
@@ -1397,8 +1431,6 @@ __initcall(setup_dump_pcidevs);
 static int iommu_add_device(struct pci_dev *pdev)
 {
     const struct domain_iommu *hd;
-    int rc;
-    unsigned int devfn = pdev->devfn;
 
     if ( !pdev->domain )
         return -EINVAL;
@@ -1409,20 +1441,7 @@ static int iommu_add_device(struct pci_dev *pdev)
     if ( !is_iommu_enabled(pdev->domain) )
         return 0;
 
-    rc = iommu_call(hd->platform_ops, add_device, devfn, pci_to_dev(pdev));
-    if ( rc || !pdev->phantom_stride )
-        return rc;
-
-    for ( ; ; )
-    {
-        devfn += pdev->phantom_stride;
-        if ( PCI_SLOT(devfn) != PCI_SLOT(pdev->devfn) )
-            return 0;
-        rc = iommu_call(hd->platform_ops, add_device, devfn, pci_to_dev(pdev));
-        if ( rc )
-            printk(XENLOG_WARNING "IOMMU: add %pp failed (%d)\n",
-                   &PCI_SBDF(pdev->seg, pdev->bus, devfn), rc);
-    }
+    return iommu_attach_context(pdev->domain, pci_to_dev(pdev), 0);
 }
 
 static int iommu_enable_device(struct pci_dev *pdev)
@@ -1444,145 +1463,13 @@ static int iommu_enable_device(struct pci_dev *pdev)
 
 static int iommu_remove_device(struct pci_dev *pdev)
 {
-    const struct domain_iommu *hd;
-    u8 devfn;
-
     if ( !pdev->domain )
         return -EINVAL;
 
-    hd = dom_iommu(pdev->domain);
     if ( !is_iommu_enabled(pdev->domain) )
         return 0;
 
-    for ( devfn = pdev->devfn ; pdev->phantom_stride; )
-    {
-        int rc;
-
-        devfn += pdev->phantom_stride;
-        if ( PCI_SLOT(devfn) != PCI_SLOT(pdev->devfn) )
-            break;
-        rc = iommu_call(hd->platform_ops, remove_device, devfn,
-                        pci_to_dev(pdev));
-        if ( !rc )
-            continue;
-
-        printk(XENLOG_ERR "IOMMU: remove %pp failed (%d)\n",
-               &PCI_SBDF(pdev->seg, pdev->bus, devfn), rc);
-        return rc;
-    }
-
-    devfn = pdev->devfn;
-
-    return iommu_call(hd->platform_ops, remove_device, devfn, pci_to_dev(pdev));
-}
-
-static int device_assigned(u16 seg, u8 bus, u8 devfn)
-{
-    struct pci_dev *pdev;
-    int rc = 0;
-
-    ASSERT(pcidevs_locked());
-    pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
-
-    if ( !pdev )
-        rc = -ENODEV;
-    /*
-     * If the device exists and it is not owned by either the hardware
-     * domain or dom_io then it must be assigned to a guest, or be
-     * hidden (owned by dom_xen).
-     */
-    else if ( pdev->domain != hardware_domain &&
-              pdev->domain != dom_io )
-        rc = -EBUSY;
-
-    return rc;
-}
-
-/* Caller should hold the pcidevs_lock */
-static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
-{
-    const struct domain_iommu *hd = dom_iommu(d);
-    struct pci_dev *pdev;
-    int rc = 0;
-
-    if ( !is_iommu_enabled(d) )
-        return 0;
-
-    if ( !arch_iommu_use_permitted(d) )
-        return -EXDEV;
-
-    /* device_assigned() should already have cleared the device for assignment */
-    ASSERT(pcidevs_locked());
-    pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
-    ASSERT(pdev && (pdev->domain == hardware_domain ||
-                    pdev->domain == dom_io));
-
-    /* Do not allow broken devices to be assigned to guests. */
-    rc = -EBADF;
-    if ( pdev->broken && d != hardware_domain && d != dom_io )
-        goto done;
-
-    write_lock(&pdev->domain->pci_lock);
-    vpci_deassign_device(pdev);
-    write_unlock(&pdev->domain->pci_lock);
-
-    rc = pdev_msix_assign(d, pdev);
-    if ( rc )
-        goto done;
-
-    if ( pdev->domain != dom_io )
-    {
-        rc = iommu_quarantine_dev_init(pci_to_dev(pdev));
-        if ( rc )
-            goto done;
-    }
-
-    pdev->fault.count = 0;
-
-    rc = iommu_call(hd->platform_ops, assign_device, d, devfn, pci_to_dev(pdev),
-                    flag);
-
-    while ( pdev->phantom_stride && !rc )
-    {
-        devfn += pdev->phantom_stride;
-        if ( PCI_SLOT(devfn) != PCI_SLOT(pdev->devfn) )
-            break;
-        rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
-                        pci_to_dev(pdev), flag);
-    }
-
-    if ( rc )
-        goto done;
-
-    write_lock(&d->pci_lock);
-    rc = vpci_assign_device(pdev);
-    write_unlock(&d->pci_lock);
-
- done:
-    if ( rc )
-    {
-        printk(XENLOG_G_WARNING "%pd: assign %s(%pp) failed (%d)\n",
-               d, devfn != pdev->devfn ? "phantom function " : "",
-               &PCI_SBDF(seg, bus, devfn), rc);
-
-        if ( devfn != pdev->devfn && deassign_device(d, seg, bus, pdev->devfn) )
-        {
-            /*
-             * Device with phantom functions that failed to both assign and
-             * rollback.  Mark the device as broken and crash the target domain,
-             * as the state of the functions at this point is unknown and Xen
-             * has no way to assert consistent context assignment among them.
-             */
-            pdev->broken = true;
-            if ( !is_hardware_domain(d) && d != dom_io )
-                domain_crash(d);
-        }
-    }
-    /* The device is assigned to dom_io so mark it as quarantined */
-    else if ( d == dom_io )
-        pdev->quarantine = true;
-
-    return rc;
+    return iommu_detach_context(pdev->domain, pdev);
 }
 
 static int iommu_get_device_group(
@@ -1672,6 +1559,7 @@ int iommu_do_pci_domctl(
     u8 bus, devfn;
     int ret = 0;
     uint32_t machine_sbdf;
+    struct pci_dev *pdev;
 
     switch ( domctl->cmd )
     {
@@ -1741,7 +1629,15 @@ int iommu_do_pci_domctl(
         devfn = PCI_DEVFN(machine_sbdf);
 
         pcidevs_lock();
-        ret = device_assigned(seg, bus, devfn);
+        pdev = pci_get_pdev(NULL, PCI_SBDF(seg, bus, devfn));
+
+        if ( !pdev )
+        {
+            printk(XENLOG_G_INFO "%pp doesn't exist", &PCI_SBDF(seg, bus, devfn));
+            break;
+        }
+
+        ret = device_assigned(pdev);
         if ( domctl->cmd == XEN_DOMCTL_test_assign_device )
         {
             if ( ret )
@@ -1752,7 +1648,7 @@ int iommu_do_pci_domctl(
             }
         }
         else if ( !ret )
-            ret = assign_device(d, seg, bus, devfn, flags);
+            ret = pci_reassign_device(pdev->domain, d, pdev, flags);
         pcidevs_unlock();
         if ( ret == -ERESTART )
             ret = hypercall_create_continuation(__HYPERVISOR_domctl,
@@ -1786,7 +1682,20 @@ int iommu_do_pci_domctl(
         devfn = PCI_DEVFN(machine_sbdf);
 
         pcidevs_lock();
-        ret = deassign_device(d, seg, bus, devfn);
+        pdev = pci_get_pdev(d, PCI_SBDF(seg, bus, devfn));
+
+        if ( pdev )
+        {
+            struct domain *target = hardware_domain;
+
+            if ( (pdev->quarantine || iommu_quarantine) && pdev->domain != dom_io )
+                target = dom_io;
+
+            ret = pci_reassign_device(d, target, pdev, 0);
+        }
+        else
+            ret = -ENODEV;
+
         pcidevs_unlock();
         break;
 
diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index 82db8f9435..a980be3646 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -78,12 +78,12 @@ uint64_t alloc_pgtable_maddr(unsigned long npages, nodeid_t node);
 void free_pgtable_maddr(u64 maddr);
 void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(const void *va);
-int domain_context_mapping_one(struct domain *domain, struct iommu_context *ctx,
-                               struct vtd_iommu *iommu, uint8_t bus, uint8_t devfn,
-                               const struct pci_dev *pdev, domid_t domid,
-                               paddr_t pgd_maddr, unsigned int mode);
-int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu,
-                             uint8_t bus, uint8_t devfn);
+int apply_context_single(struct domain *domain, struct iommu_context *ctx,
+                         struct vtd_iommu *iommu, uint8_t bus, uint8_t devfn,
+                         struct iommu_context *prev_ctx);
+int unapply_context_single(struct domain *domain, struct vtd_iommu *iommu,
+                           struct iommu_context *prev_ctx, uint8_t bus,
+                           uint8_t devfn);
 int cf_check intel_iommu_get_reserved_device_memory(
     iommu_grdm_t *func, void *ctxt);
 
@@ -104,8 +104,9 @@ void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct vtd_iommu *iommu);
 void vtd_ops_postamble_quirk(struct vtd_iommu *iommu);
 int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
-                               uint8_t devfn, domid_t domid, paddr_t pgd_maddr,
-                               unsigned int mode);
+                               uint8_t devfn, domid_t domid,
+                               unsigned int mode, struct iommu_context *ctx,
+                               struct iommu_context *prev_ctx);
 void pci_vtd_quirk(const struct pci_dev *);
 void quirk_iommu_caps(struct vtd_iommu *iommu);
 
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 34b2a287f7..bb53cff158 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -27,6 +27,7 @@
 #include <xen/iommu.h>
 #include <xen/numa.h>
 #include <xen/softirq.h>
+#include <xen/event.h>
 #include <xen/time.h>
 #include <xen/pci.h>
 #include <xen/pci_regs.h>
@@ -61,7 +62,6 @@ static unsigned int __ro_after_init min_pt_levels = UINT_MAX;
 static struct tasklet vtd_fault_tasklet;
 
 static int cf_check setup_hwdom_device(u8 devfn, struct pci_dev *);
-static void setup_hwdom_rmrr(struct domain *d);
 
 #define DID_FIELD_WIDTH 16
 #define DID_HIGH_OFFSET 8
@@ -165,7 +165,7 @@ static uint64_t addr_to_dma_page_maddr(struct domain *domain,
     u64 pte_maddr = 0;
 
     addr &= (((u64)1) << addr_width) - 1;
-    ASSERT(spin_is_locked(&ctx->arch.mapping_lock));
+    ASSERT(rspin_is_locked(&ctx->lock));
     ASSERT(target || !alloc);
 
     if ( !ctx->arch.vtd.pgd_maddr )
@@ -270,36 +270,22 @@ static uint64_t addr_to_dma_page_maddr(struct domain *domain,
     return pte_maddr;
 }
 
-static paddr_t domain_pgd_maddr(struct domain *d, struct iommu_context *ctx,
-                                paddr_t pgd_maddr, unsigned int nr_pt_levels)
+static paddr_t get_context_pgd(struct domain *d, struct iommu_context *ctx,
+                               unsigned int nr_pt_levels)
 {
     unsigned int agaw;
+    paddr_t pgd_maddr = ctx->arch.vtd.pgd_maddr;
 
-    ASSERT(spin_is_locked(&ctx->arch.mapping_lock));
-
-    if ( pgd_maddr )
-        /* nothing */;
-    else if ( iommu_use_hap_pt(d) )
+    if ( !ctx->arch.vtd.pgd_maddr )
     {
-        pagetable_t pgt = p2m_get_pagetable(p2m_get_hostp2m(d));
+        /*
+         * Ensure we have pagetables allocated down to the smallest
+         * level the loop below may need to run to.
+         */
+        addr_to_dma_page_maddr(d, ctx, 0, min_pt_levels, NULL, true);
 
-        pgd_maddr = pagetable_get_paddr(pgt);
-    }
-    else
-    {
         if ( !ctx->arch.vtd.pgd_maddr )
-        {
-            /*
-             * Ensure we have pagetables allocated down to the smallest
-             * level the loop below may need to run to.
-             */
-            addr_to_dma_page_maddr(d, ctx, 0, min_pt_levels, NULL, true);
-
-            if ( !ctx->arch.vtd.pgd_maddr )
-                return 0;
-        }
-
-        pgd_maddr = ctx->arch.vtd.pgd_maddr;
+            return 0;
     }
 
     /* Skip top level(s) of page tables for less-than-maximum level DRHDs. */
@@ -568,17 +554,20 @@ static int __must_check iommu_flush_all(void)
     return rc;
 }
 
-static int __must_check cf_check iommu_flush_iotlb(struct domain *d, dfn_t dfn,
+static int __must_check cf_check iommu_flush_iotlb(struct domain *d,
+                                                   struct iommu_context *ctx,
+                                                   dfn_t dfn,
                                                    unsigned long page_count,
                                                    unsigned int flush_flags)
 {
-    struct iommu_context *ctx = iommu_default_context(d);
     struct acpi_drhd_unit *drhd;
     struct vtd_iommu *iommu;
     bool flush_dev_iotlb;
     int iommu_domid;
     int ret = 0;
 
+    ASSERT(ctx);
+
     if ( flush_flags & IOMMU_FLUSHF_all )
     {
         dfn = INVALID_DFN;
@@ -1239,7 +1228,8 @@ void __init iommu_free(struct acpi_drhd_unit *drhd)
         agaw = 64;                              \
     agaw; })
 
-static int cf_check intel_iommu_context_init(struct domain *d, struct iommu_context *ctx)
+static int cf_check intel_iommu_context_init(struct domain *d, struct iommu_context *ctx,
+                                             u32 flags)
 {
     struct acpi_drhd_unit *drhd;
 
@@ -1254,6 +1244,27 @@ static int cf_check intel_iommu_context_init(struct domain *d, struct iommu_cont
         return -ENOMEM;
     }
 
+    ctx->arch.vtd.superpage_progress = 0;
+
+    if ( flags & IOMMU_CONTEXT_INIT_default )
+    {
+        ctx->arch.vtd.pgd_maddr = 0;
+
+        /*
+         * Context is considered "opaque" (non-managed) in these cases :
+         *  - HAP is enabled, in this case, the pagetable is not managed by the
+         *    IOMMU code, thus opaque
+         *  - IOMMU is in passthrough which means that there is no actual pagetable
+         */
+        if ( iommu_use_hap_pt(d) )
+        {
+            pagetable_t pgt = p2m_get_pagetable(p2m_get_hostp2m(d));
+            ctx->arch.vtd.pgd_maddr = pagetable_get_paddr(pgt);
+
+            ctx->opaque = true;
+        }
+    }
+
     // TODO: Allocate IOMMU domid only when attaching devices ?
     /* Populate context DID map using pseudo DIDs */
     for_each_drhd_unit(drhd)
@@ -1262,7 +1273,11 @@ static int cf_check intel_iommu_context_init(struct domain *d, struct iommu_cont
             iommu_alloc_domid(drhd->iommu->domid_bitmap);
     }
 
-    return arch_iommu_context_init(d, ctx, 0);
+    if ( !ctx->opaque )
+        /* Create initial context page */
+        addr_to_dma_page_maddr(d, ctx, 0, min_pt_levels, NULL, true);
+
+    return arch_iommu_context_init(d, ctx, flags);
 }
 
 static int cf_check intel_iommu_domain_init(struct domain *d)
@@ -1271,7 +1286,7 @@ static int cf_check intel_iommu_domain_init(struct domain *d)
 
     hd->arch.vtd.agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
 
-    return intel_iommu_context_init(d, iommu_default_context(d));
+    return 0;
 }
 
 static void __hwdom_init cf_check intel_iommu_hwdom_init(struct domain *d)
@@ -1279,7 +1294,7 @@ static void __hwdom_init cf_check intel_iommu_hwdom_init(struct domain *d)
     struct acpi_drhd_unit *drhd;
 
     setup_hwdom_pci_devices(d, setup_hwdom_device);
-    setup_hwdom_rmrr(d);
+
     /* Make sure workarounds are applied before enabling the IOMMU(s). */
     arch_iommu_hwdom_init(d);
 
@@ -1296,21 +1311,17 @@ static void __hwdom_init cf_check intel_iommu_hwdom_init(struct domain *d)
     }
 }
 
-/*
- * This function returns
- * - a negative errno value upon error,
- * - zero upon success when previously the entry was non-present, or this isn't
- *   the "main" request for a device (pdev == NULL), or for no-op quarantining
- *   assignments,
- * - positive (one) upon success when previously the entry was present and this
- *   is the "main" request for a device (pdev != NULL).
+/**
+ * Apply a context on a device.
+ * @param domain Domain of the context
+ * @param ctx IOMMU context to apply
+ * @param iommu IOMMU hardware to use (must match device iommu)
+ * @param bus PCI device bus
+ * @param devfn PCI device function
  */
-int domain_context_mapping_one(
-    struct domain *domain,
-    struct iommu_context *ctx,
-    struct vtd_iommu *iommu,
-    uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
-    domid_t domid, paddr_t pgd_maddr, unsigned int mode)
+int apply_context_single(struct domain *domain, struct iommu_context *ctx,
+                         struct vtd_iommu *iommu, uint8_t bus, uint8_t devfn,
+                         struct iommu_context *prev_ctx)
 {
     struct context_entry *context, *context_entries, lctxt;
     __uint128_t res, old;
@@ -1319,8 +1330,6 @@ int domain_context_mapping_one(
     int rc, ret;
     bool flush_dev_iotlb, overwrite_entry = false;
 
-    struct iommu_context *prev_ctx = pdev->domain ? iommu_default_context(pdev->domain) : NULL;
-
     ASSERT(pcidevs_locked());
     spin_lock(&iommu->lock);
     maddr = bus_to_context_maddr(iommu, bus);
@@ -1336,7 +1345,7 @@ int domain_context_mapping_one(
         overwrite_entry = true;
     }
 
-    if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
+    if ( iommu_hwdom_passthrough && is_hardware_domain(domain) && !ctx->id )
     {
         context_set_translation_type(lctxt, CONTEXT_TT_PASS_THRU);
     }
@@ -1344,9 +1353,7 @@ int domain_context_mapping_one(
     {
         paddr_t root;
 
-        spin_lock(&ctx->arch.mapping_lock);
-
-        root = domain_pgd_maddr(domain, ctx, pgd_maddr, iommu->nr_pt_levels);
+        root = get_context_pgd(domain, ctx, iommu->nr_pt_levels);
         if ( !root )
         {
             unmap_vtd_domain_page(context_entries);
@@ -1358,8 +1365,6 @@ int domain_context_mapping_one(
             context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
         else
             context_set_translation_type(lctxt, CONTEXT_TT_MULTI_LEVEL);
-
-        spin_unlock(&ctx->arch.mapping_lock);
     }
 
     rc = context_set_domain_id(&lctxt, did, iommu);
@@ -1388,7 +1393,6 @@ int domain_context_mapping_one(
     }
 
     iommu_sync_cache(context, sizeof(struct context_entry));
-    spin_unlock(&iommu->lock);
 
     rc = iommu_flush_context_device(iommu, prev_did, PCI_BDF(bus, devfn),
                                     DMA_CCMD_MASK_NOBIT, !overwrite_entry);
@@ -1422,7 +1426,7 @@ int domain_context_mapping_one(
     spin_unlock(&iommu->lock);
 
     if ( !seg && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, domid, pgd_maddr, mode);
+        rc = me_wifi_quirk(domain, bus, devfn, did, 0, ctx, prev_ctx);
 
     return rc;
 
@@ -1432,152 +1436,32 @@ int domain_context_mapping_one(
         return rc;
 }
 
-static const struct acpi_drhd_unit *domain_context_unmap(
-    struct domain *d, uint8_t devfn, struct pci_dev *pdev);
-
-static int domain_context_mapping(struct domain *domain, struct iommu_context *ctx,
-                                  u8 devfn, struct pci_dev *pdev)
+int apply_context(struct domain *d, struct iommu_context *ctx,
+                  struct pci_dev *pdev, u8 devfn,
+                  struct iommu_context *prev_ctx)
 {
-    const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
-    const struct acpi_rmrr_unit *rmrr;
-    paddr_t pgd_maddr = ctx->arch.vtd.pgd_maddr;
-    domid_t did = ctx->arch.vtd.didmap[drhd->iommu->index];
+    struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
+    struct vtd_iommu *iommu = drhd->iommu;
     int ret = 0;
-    unsigned int i, mode = 0;
-    uint16_t seg = pdev->seg, bdf;
-    uint8_t bus = pdev->bus, secbus;
-
-    /*
-     * Generally we assume only devices from one node to get assigned to a
-     * given guest.  But even if not, by replacing the prior value here we
-     * guarantee that at least some basic allocations for the device being
-     * added will get done against its node.  Any further allocations for
-     * this or other devices may be penalized then, but some would also be
-     * if we left other than NUMA_NO_NODE untouched here.
-     */
-    if ( drhd && drhd->iommu->node != NUMA_NO_NODE )
-        dom_iommu(domain)->node = drhd->iommu->node;
 
-    ASSERT(pcidevs_locked());
+    if ( !drhd )
+        return -EINVAL;
 
-    for_each_rmrr_device( rmrr, bdf, i )
+    if ( pdev->type == DEV_TYPE_PCI_HOST_BRIDGE ||
+         pdev->type == DEV_TYPE_PCIe_BRIDGE ||
+         pdev->type == DEV_TYPE_PCIe2PCI_BRIDGE ||
+         pdev->type == DEV_TYPE_LEGACY_PCI_BRIDGE )
     {
-        if ( rmrr->segment != pdev->seg || bdf != pdev->sbdf.bdf )
-            continue;
-
-        mode |= MAP_WITH_RMRR;
-        break;
+        printk(XENLOG_WARNING VTDPREFIX " Ignoring apply_context on PCI bridge\n");
+        return 0;
     }
 
-    if ( domain != pdev->domain && pdev->domain != dom_io &&
-         pdev->domain->is_dying )
-        mode |= MAP_OWNER_DYING;
-
-    switch ( pdev->type )
-    {
-        bool prev_present;
-
-    case DEV_TYPE_PCI_HOST_BRIDGE:
-        if ( iommu_debug )
-            printk(VTDPREFIX "%pd:Hostbridge: skip %pp map\n",
-                   domain, &PCI_SBDF(seg, bus, devfn));
-        if ( !is_hardware_domain(domain) )
-            return -EPERM;
-        break;
-
-    case DEV_TYPE_PCIe_BRIDGE:
-    case DEV_TYPE_PCIe2PCI_BRIDGE:
-    case DEV_TYPE_LEGACY_PCI_BRIDGE:
-        break;
-
-    case DEV_TYPE_PCIe_ENDPOINT:
-        if ( !drhd )
-            return -ENODEV;
-
-        if ( iommu_debug )
-            printk(VTDPREFIX "%pd:PCIe: map %pp\n",
-                   domain, &PCI_SBDF(seg, bus, devfn));
-        ret = domain_context_mapping_one(domain, ctx, drhd->iommu, bus, devfn, pdev,
-                                         did, pgd_maddr, mode);
-        if ( ret > 0 )
-            ret = 0;
-        if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
-            enable_ats_device(pdev, &drhd->iommu->ats_devices);
-
-        break;
-
-    case DEV_TYPE_PCI:
-        if ( !drhd )
-            return -ENODEV;
-
-        if ( iommu_debug )
-            printk(VTDPREFIX "%pd:PCI: map %pp\n",
-                   domain, &PCI_SBDF(seg, bus, devfn));
-
-        ret = domain_context_mapping_one(domain, ctx, drhd->iommu, bus, devfn,
-                                         pdev, did, pgd_maddr, mode);
-        if ( ret < 0 )
-            break;
-        prev_present = ret;
-
-        if ( (ret = find_upstream_bridge(seg, &bus, &devfn, &secbus)) < 1 )
-        {
-            if ( !ret )
-                break;
-            ret = -ENXIO;
-        }
-        /*
-         * Strictly speaking if the device is the only one behind this bridge
-         * and the only one with this (secbus,0,0) tuple, it could be allowed
-         * to be re-assigned regardless of RMRR presence.  But let's deal with
-         * that case only if it is actually found in the wild.  Note that
-         * dealing with this just here would still not render the operation
-         * secure.
-         */
-        else if ( prev_present && (mode & MAP_WITH_RMRR) &&
-                  domain != pdev->domain )
-            ret = -EOPNOTSUPP;
-
-        /*
-         * Mapping a bridge should, if anything, pass the struct pci_dev of
-         * that bridge. Since bridges don't normally get assigned to guests,
-         * their owner would be the wrong one. Pass NULL instead.
-         */
-        if ( ret >= 0 )
-            ret = domain_context_mapping_one(domain, ctx, drhd->iommu, bus, devfn,
-                                             NULL, did, pgd_maddr, mode);
-
-        /*
-         * Devices behind PCIe-to-PCI/PCIx bridge may generate different
-         * requester-id. It may originate from devfn=0 on the secondary bus
-         * behind the bridge. Map that id as well if we didn't already.
-         *
-         * Somewhat similar as for bridges, we don't want to pass a struct
-         * pci_dev here - there may not even exist one for this (secbus,0,0)
-         * tuple. If there is one, without properly working device groups it
-         * may again not have the correct owner.
-         */
-        if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
-             (secbus != pdev->bus || pdev->devfn != 0) )
-            ret = domain_context_mapping_one(domain, ctx, drhd->iommu, secbus, 0,
-                                             NULL, did, pgd_maddr, mode);
-
-        if ( ret )
-        {
-            if ( !prev_present )
-                domain_context_unmap(domain, devfn, pdev);
-            else if ( pdev->domain != domain ) /* Avoid infinite recursion. */
-                domain_context_mapping(pdev->domain, ctx, devfn, pdev);
-        }
+    ASSERT(pcidevs_locked());
 
-        break;
+    ret = apply_context_single(d, ctx, iommu, pdev->bus, pdev->devfn, prev_ctx);
 
-    default:
-        dprintk(XENLOG_ERR VTDPREFIX, "%pd:unknown(%u): %pp\n",
-                domain, pdev->type, &PCI_SBDF(seg, bus, devfn));
-        ret = -EINVAL;
-        break;
-    }
+    if ( !ret && ats_device(pdev, drhd) > 0 )
+        enable_ats_device(pdev, &iommu->ats_devices);
 
     if ( !ret && devfn == pdev->devfn )
         pci_vtd_quirk(pdev);
@@ -1585,10 +1469,8 @@ static int domain_context_mapping(struct domain *domain, struct iommu_context *c
     return ret;
 }
 
-int domain_context_unmap_one(
-    struct domain *domain,
-    struct vtd_iommu *iommu,
-    uint8_t bus, uint8_t devfn)
+int unapply_context_single(struct domain *domain, struct vtd_iommu *iommu,
+                           struct iommu_context *prev_ctx, uint8_t bus, uint8_t devfn)
 {
     struct context_entry *context, *context_entries;
     u64 maddr;
@@ -1636,12 +1518,18 @@ int domain_context_unmap_one(
     if ( rc > 0 )
         rc = 0;
 
+    if ( !rc )
+    {
+        BUG_ON(!prev_ctx->arch.vtd.iommu_dev_cnt[iommu->index]);
+        prev_ctx->arch.vtd.iommu_dev_cnt[iommu->index]--;
+    }
+
     spin_unlock(&iommu->lock);
     unmap_vtd_domain_page(context_entries);
 
     if ( !iommu->drhd->segment && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, DOMID_INVALID, 0,
-                           UNMAP_ME_PHANTOM_FUNC);
+        rc = me_wifi_quirk(domain, bus, devfn, DOMID_INVALID, UNMAP_ME_PHANTOM_FUNC,
+                           NULL, prev_ctx);
 
     if ( rc && !is_hardware_domain(domain) && domain != dom_io )
     {
@@ -1659,128 +1547,27 @@ int domain_context_unmap_one(
     return rc;
 }
 
-static const struct acpi_drhd_unit *domain_context_unmap(
-    struct domain *domain,
-    uint8_t devfn,
-    struct pci_dev *pdev)
+static void cf_check iommu_clear_root_pgtable(struct domain *d,
+                                              struct iommu_context *ctx)
 {
-    const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
-    struct vtd_iommu *iommu = drhd ? drhd->iommu : NULL;
-    int ret;
-    uint16_t seg = pdev->seg;
-    uint8_t bus = pdev->bus, tmp_bus, tmp_devfn, secbus;
-
-    switch ( pdev->type )
-    {
-    case DEV_TYPE_PCI_HOST_BRIDGE:
-        if ( iommu_debug )
-            printk(VTDPREFIX "%pd:Hostbridge: skip %pp unmap\n",
-                   domain, &PCI_SBDF(seg, bus, devfn));
-        return ERR_PTR(is_hardware_domain(domain) ? 0 : -EPERM);
-
-    case DEV_TYPE_PCIe_BRIDGE:
-    case DEV_TYPE_PCIe2PCI_BRIDGE:
-    case DEV_TYPE_LEGACY_PCI_BRIDGE:
-        return ERR_PTR(0);
-
-    case DEV_TYPE_PCIe_ENDPOINT:
-        if ( !iommu )
-            return ERR_PTR(-ENODEV);
-
-        if ( iommu_debug )
-            printk(VTDPREFIX "%pd:PCIe: unmap %pp\n",
-                   domain, &PCI_SBDF(seg, bus, devfn));
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
-        if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
-            disable_ats_device(pdev);
-
-        break;
-
-    case DEV_TYPE_PCI:
-        if ( !iommu )
-            return ERR_PTR(-ENODEV);
-
-        if ( iommu_debug )
-            printk(VTDPREFIX "%pd:PCI: unmap %pp\n",
-                   domain, &PCI_SBDF(seg, bus, devfn));
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
-        if ( ret )
-            break;
-
-        tmp_bus = bus;
-        tmp_devfn = devfn;
-        if ( (ret = find_upstream_bridge(seg, &tmp_bus, &tmp_devfn,
-                                         &secbus)) < 1 )
-        {
-            if ( ret )
-            {
-                ret = -ENXIO;
-                if ( !domain->is_dying &&
-                     !is_hardware_domain(domain) && domain != dom_io )
-                {
-                    domain_crash(domain);
-                    /* Make upper layers continue in a best effort manner. */
-                    ret = 0;
-                }
-            }
-            break;
-        }
-
-        ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
-        /* PCIe to PCI/PCIx bridge */
-        if ( !ret && pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
-            ret = domain_context_unmap_one(domain, iommu, secbus, 0);
-
-        break;
-
-    default:
-        dprintk(XENLOG_ERR VTDPREFIX, "%pd:unknown(%u): %pp\n",
-                domain, pdev->type, &PCI_SBDF(seg, bus, devfn));
-        return ERR_PTR(-EINVAL);
-    }
-
-    return drhd;
-}
-
-static void cf_check iommu_clear_root_pgtable(struct domain *d)
-{
-    struct iommu_context *ctx = iommu_default_context(d);
-
-    spin_lock(&ctx->arch.mapping_lock);
     ctx->arch.vtd.pgd_maddr = 0;
-    spin_unlock(&ctx->arch.mapping_lock);
 }
 
 static void cf_check iommu_domain_teardown(struct domain *d)
 {
     struct iommu_context *ctx = iommu_default_context(d);
-    const struct acpi_drhd_unit *drhd;
 
     if ( list_empty(&acpi_drhd_units) )
         return;
 
-    iommu_identity_map_teardown(d, ctx);
-
     ASSERT(!ctx->arch.vtd.pgd_maddr);
-
-    for_each_drhd_unit ( drhd )
-        iommu_free_domid(d->domain_id, drhd->iommu->domid_bitmap);
-
-    XFREE(ctx->arch.vtd.iommu_dev_cnt);
-    XFREE(ctx->arch.vtd.didmap);
-}
-
-static void quarantine_teardown(struct pci_dev *pdev,
-                                const struct acpi_drhd_unit *drhd)
-{
 }
 
 static int __must_check cf_check intel_iommu_map_page(
     struct domain *d, dfn_t dfn, mfn_t mfn, unsigned int flags,
-    unsigned int *flush_flags)
+    unsigned int *flush_flags, struct iommu_context *ctx)
 {
     struct domain_iommu *hd = dom_iommu(d);
-    struct iommu_context *ctx = iommu_default_context(d);
     struct dma_pte *page, *pte, old, new = {};
     u64 pg_maddr;
     unsigned int level = (IOMMUF_order(flags) / LEVEL_STRIDE) + 1;
@@ -1789,35 +1576,22 @@ static int __must_check cf_check intel_iommu_map_page(
     ASSERT((hd->platform_ops->page_sizes >> IOMMUF_order(flags)) &
            PAGE_SIZE_4K);
 
-    /* Do nothing if VT-d shares EPT page table */
-    if ( iommu_use_hap_pt(d) )
+    if ( ctx->opaque )
         return 0;
 
-    /* Do nothing if hardware domain and iommu supports pass thru. */
-    if ( iommu_hwdom_passthrough && is_hardware_domain(d) )
-        return 0;
-
-    spin_lock(&ctx->arch.mapping_lock);
-
     /*
      * IOMMU mapping request can be safely ignored when the domain is dying.
      *
-     * hd->arch.mapping_lock guarantees that d->is_dying will be observed
+     * hd->lock guarantees that d->is_dying will be observed
      * before any page tables are freed (see iommu_free_pgtables())
      */
     if ( d->is_dying )
-    {
-        spin_unlock(&ctx->arch.mapping_lock);
         return 0;
-    }
 
     pg_maddr = addr_to_dma_page_maddr(d, ctx, dfn_to_daddr(dfn), level, flush_flags,
                                       true);
     if ( pg_maddr < PAGE_SIZE )
-    {
-        spin_unlock(&ctx->arch.mapping_lock);
         return -ENOMEM;
-    }
 
     page = (struct dma_pte *)map_vtd_domain_page(pg_maddr);
     pte = &page[address_level_offset(dfn_to_daddr(dfn), level)];
@@ -1836,7 +1610,6 @@ static int __must_check cf_check intel_iommu_map_page(
 
     if ( !((old.val ^ new.val) & ~DMA_PTE_CONTIG_MASK) )
     {
-        spin_unlock(&ctx->arch.mapping_lock);
         unmap_vtd_domain_page(page);
         return 0;
     }
@@ -1879,7 +1652,6 @@ static int __must_check cf_check intel_iommu_map_page(
         perfc_incr(iommu_pt_coalesces);
     }
 
-    spin_unlock(&ctx->arch.mapping_lock);
     unmap_vtd_domain_page(page);
 
     *flush_flags |= IOMMU_FLUSHF_added;
@@ -1896,10 +1668,10 @@ static int __must_check cf_check intel_iommu_map_page(
 }
 
 static int __must_check cf_check intel_iommu_unmap_page(
-    struct domain *d, dfn_t dfn, unsigned int order, unsigned int *flush_flags)
+    struct domain *d, dfn_t dfn, unsigned int order, unsigned int *flush_flags,
+    struct iommu_context *ctx)
 {
     struct domain_iommu *hd = dom_iommu(d);
-    struct iommu_context *ctx = iommu_default_context(d);
     daddr_t addr = dfn_to_daddr(dfn);
     struct dma_pte *page = NULL, *pte = NULL, old;
     uint64_t pg_maddr;
@@ -1911,20 +1683,13 @@ static int __must_check cf_check intel_iommu_unmap_page(
      */
     ASSERT((hd->platform_ops->page_sizes >> order) & PAGE_SIZE_4K);
 
-    /* Do nothing if VT-d shares EPT page table */
-    if ( iommu_use_hap_pt(d) )
+    if ( ctx->opaque )
         return 0;
 
-    /* Do nothing if hardware domain and iommu supports pass thru. */
-    if ( iommu_hwdom_passthrough && is_hardware_domain(d) )
-        return 0;
-
-    spin_lock(&ctx->arch.mapping_lock);
     /* get target level pte */
     pg_maddr = addr_to_dma_page_maddr(d, ctx, addr, level, flush_flags, false);
     if ( pg_maddr < PAGE_SIZE )
     {
-        spin_unlock(&ctx->arch.mapping_lock);
         return pg_maddr ? -ENOMEM : 0;
     }
 
@@ -1933,7 +1698,6 @@ static int __must_check cf_check intel_iommu_unmap_page(
 
     if ( !dma_pte_present(*pte) )
     {
-        spin_unlock(&ctx->arch.mapping_lock);
         unmap_vtd_domain_page(page);
         return 0;
     }
@@ -1964,8 +1728,6 @@ static int __must_check cf_check intel_iommu_unmap_page(
         perfc_incr(iommu_pt_coalesces);
     }
 
-    spin_unlock(&ctx->arch.mapping_lock);
-
     unmap_vtd_domain_page(page);
 
     *flush_flags |= IOMMU_FLUSHF_modified;
@@ -1978,25 +1740,16 @@ static int __must_check cf_check intel_iommu_unmap_page(
 }
 
 static int cf_check intel_iommu_lookup_page(
-    struct domain *d, dfn_t dfn, mfn_t *mfn, unsigned int *flags)
+    struct domain *d, dfn_t dfn, mfn_t *mfn, unsigned int *flags,
+    struct iommu_context *ctx)
 {
-    struct iommu_context *ctx = iommu_default_context(d);
     uint64_t val;
 
-    /*
-     * If VT-d shares EPT page table or if the domain is the hardware
-     * domain and iommu_passthrough is set then pass back the dfn.
-     */
-    if ( iommu_use_hap_pt(d) ||
-         (iommu_hwdom_passthrough && is_hardware_domain(d)) )
+    if ( ctx->opaque )
         return -EOPNOTSUPP;
 
-    spin_lock(&ctx->arch.mapping_lock);
-
     val = addr_to_dma_page_maddr(d, ctx, dfn_to_daddr(dfn), 0, NULL, false);
 
-    spin_unlock(&ctx->arch.mapping_lock);
-
     if ( val < PAGE_SIZE )
         return -ENOENT;
 
@@ -2025,47 +1778,6 @@ static bool __init vtd_ept_page_compatible(const struct vtd_iommu *iommu)
             (cap_sps_1gb(vtd_cap) && iommu_superpages);
 }
 
-static int cf_check intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
-{
-    struct acpi_rmrr_unit *rmrr;
-    struct iommu_context *ctx;
-    u16 bdf;
-    int ret, i;
-
-    ASSERT(pcidevs_locked());
-
-    if ( !pdev->domain )
-        return -EINVAL;
-
-    ctx = iommu_default_context(pdev->domain);
-
-    for_each_rmrr_device ( rmrr, bdf, i )
-    {
-        if ( rmrr->segment == pdev->seg && bdf == PCI_BDF(pdev->bus, devfn) )
-        {
-            /*
-             * iommu_add_device() is only called for the hardware
-             * domain (see xen/drivers/passthrough/pci.c:pci_add_device()).
-             * Since RMRRs are always reserved in the e820 map for the hardware
-             * domain, there shouldn't be a conflict.
-             */
-            ret = iommu_identity_mapping(pdev->domain, ctx, p2m_access_rw,
-                                         rmrr->base_address, rmrr->end_address,
-                                         0);
-            if ( ret )
-                dprintk(XENLOG_ERR VTDPREFIX, "%pd: RMRR mapping failed\n",
-                        pdev->domain);
-        }
-    }
-
-    ret = domain_context_mapping(pdev->domain, ctx, devfn, pdev);
-    if ( ret )
-        dprintk(XENLOG_ERR VTDPREFIX, "%pd: context mapping failed\n",
-                pdev->domain);
-
-    return ret;
-}
-
 static int cf_check intel_iommu_enable_device(struct pci_dev *pdev)
 {
     struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
@@ -2081,47 +1793,16 @@ static int cf_check intel_iommu_enable_device(struct pci_dev *pdev)
     return ret >= 0 ? 0 : ret;
 }
 
-static int cf_check intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
-{
-    const struct acpi_drhd_unit *drhd;
-    struct acpi_rmrr_unit *rmrr;
-    u16 bdf;
-    unsigned int i;
-    struct iommu_context *ctx;
-
-    if ( !pdev->domain )
-        return -EINVAL;
-
-    ctx = iommu_default_context(pdev->domain);
-
-    drhd = domain_context_unmap(pdev->domain, devfn, pdev);
-    if ( IS_ERR(drhd) )
-        return PTR_ERR(drhd);
-
-    for_each_rmrr_device ( rmrr, bdf, i )
-    {
-        if ( rmrr->segment != pdev->seg || bdf != PCI_BDF(pdev->bus, devfn) )
-            continue;
-
-        /*
-         * Any flag is nothing to clear these mappings but here
-         * its always safe and strict to set 0.
-         */
-        iommu_identity_mapping(pdev->domain, ctx, p2m_access_x, rmrr->base_address,
-                               rmrr->end_address, 0);
-    }
-
-    quarantine_teardown(pdev, drhd);
-
-    return 0;
-}
-
 static int __hwdom_init cf_check setup_hwdom_device(
     u8 devfn, struct pci_dev *pdev)
 {
-    struct iommu_context *ctx = iommu_default_context(pdev->domain);
+    if (pdev->type == DEV_TYPE_PCI_HOST_BRIDGE ||
+        pdev->type == DEV_TYPE_PCIe_BRIDGE ||
+        pdev->type == DEV_TYPE_PCIe2PCI_BRIDGE ||
+        pdev->type == DEV_TYPE_LEGACY_PCI_BRIDGE)
+        return 0;
 
-    return domain_context_mapping(pdev->domain, ctx, devfn, pdev);
+    return iommu_attach_context(hardware_domain, pdev, 0);
 }
 
 void clear_fault_bits(struct vtd_iommu *iommu)
@@ -2291,35 +1972,53 @@ static int __must_check init_vtd_hw(bool resume)
     return iommu_flush_all();
 }
 
-static void __hwdom_init setup_hwdom_rmrr(struct domain *d)
+static void cf_check arch_iommu_dump_domain_contexts(struct domain *d)
 {
-    struct iommu_context *ctx = iommu_default_context(d);
-    struct acpi_rmrr_unit *rmrr;
-    u16 bdf;
-    int ret, i;
+    unsigned int i, iommu_no;
+    struct pci_dev *pdev;
+    struct iommu_context *ctx;
+    struct domain_iommu *hd = dom_iommu(d);
 
-    pcidevs_lock();
-    for_each_rmrr_device ( rmrr, bdf, i )
+    if (d == dom_io)
+        printk("d[IO] contexts\n");
+    else
+        printk("d%hu contexts\n", d->domain_id);
+
+    for (i = 0; i < (1 + hd->other_contexts.count); ++i)
     {
-        /*
-         * Here means we're add a device to the hardware domain.
-         * Since RMRRs are always reserved in the e820 map for the hardware
-         * domain, there shouldn't be a conflict. So its always safe and
-         * strict to set 0.
-         */
-        ret = iommu_identity_mapping(d, ctx, p2m_access_rw, rmrr->base_address,
-                                     rmrr->end_address, 0);
-        if ( ret )
-            dprintk(XENLOG_ERR VTDPREFIX,
-                     "IOMMU: mapping reserved region failed\n");
+        if ( (ctx = iommu_get_context(d, i)) )
+        {
+            printk(" Context %d (%"PRIx64")\n", i, ctx->arch.vtd.pgd_maddr);
+
+            for (iommu_no = 0; iommu_no < nr_iommus; iommu_no++)
+                printk("  IOMMU %u (used=%lu; did=%hu)\n", iommu_no,
+                       ctx->arch.vtd.iommu_dev_cnt[iommu_no],
+                       ctx->arch.vtd.didmap[iommu_no]);
+
+            list_for_each_entry(pdev, &ctx->devices, context_list)
+            {
+                printk("  - %pp\n", &pdev->sbdf);
+            }
+
+            iommu_put_context(ctx);
+        }
     }
-    pcidevs_unlock();
 }
 
 static struct iommu_state {
     uint32_t fectl;
 } *__read_mostly iommu_state;
 
+static void cf_check arch_iommu_dump_contexts(unsigned char key)
+{
+    struct domain *d;
+
+    for_each_domain(d)
+        if (is_iommu_enabled(d))
+            arch_iommu_dump_domain_contexts(d);
+
+    arch_iommu_dump_domain_contexts(dom_io);
+}
 static int __init cf_check vtd_setup(void)
 {
     struct acpi_drhd_unit *drhd;
@@ -2449,6 +2148,7 @@ static int __init cf_check vtd_setup(void)
     iommu_ops.page_sizes |= large_sizes;
 
     register_keyhandler('V', vtd_dump_iommu_info, "dump iommu info", 1);
+    register_keyhandler('X', arch_iommu_dump_contexts, "dump iommu contexts", 1);
 
     return 0;
 
@@ -2463,173 +2163,6 @@ static int __init cf_check vtd_setup(void)
     return ret;
 }
 
-static int cf_check reassign_device_ownership(
-    struct domain *source,
-    struct domain *target,
-    u8 devfn, struct pci_dev *pdev)
-{
-    int ret;
-
-    if ( !has_arch_pdevs(target) )
-        vmx_pi_hooks_assign(target);
-
-#ifdef CONFIG_PV
-    /*
-        * Devices assigned to untrusted domains (here assumed to be any domU)
-        * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected
-        * by the root complex unless interrupt remapping is enabled.
-        */
-    if ( !iommu_intremap && !is_hardware_domain(target) &&
-            !is_system_domain(target) )
-        untrusted_msi = true;
-#endif
-
-    ret = domain_context_mapping(target, iommu_default_context(target), devfn, pdev);
-
-    if ( ret )
-    {
-        if ( !has_arch_pdevs(target) )
-            vmx_pi_hooks_deassign(target);
-        return ret;
-    }
-
-    if ( devfn == pdev->devfn && pdev->domain != target )
-    {
-        write_lock(&source->pci_lock);
-        list_del(&pdev->domain_list);
-        write_unlock(&source->pci_lock);
-
-        pdev->domain = target;
-
-        write_lock(&target->pci_lock);
-        list_add(&pdev->domain_list, &target->pdev_list);
-        write_unlock(&target->pci_lock);
-    }
-
-    if ( !has_arch_pdevs(source) )
-        vmx_pi_hooks_deassign(source);
-
-    /*
-     * If the device belongs to the hardware domain, and it has RMRR, don't
-     * remove it from the hardware domain, because BIOS may use RMRR at
-     * booting time.
-     */
-    if ( !is_hardware_domain(source) )
-    {
-        const struct acpi_rmrr_unit *rmrr;
-        struct iommu_context *ctx = iommu_default_context(source);
-        u16 bdf;
-        unsigned int i;
-
-        for_each_rmrr_device( rmrr, bdf, i )
-            if ( rmrr->segment == pdev->seg &&
-                 bdf == PCI_BDF(pdev->bus, devfn) )
-            {
-                /*
-                 * Any RMRR flag is always ignored when remove a device,
-                 * but its always safe and strict to set 0.
-                 */
-                ret = iommu_identity_mapping(source, ctx, p2m_access_x,
-                                             rmrr->base_address,
-                                             rmrr->end_address, 0);
-                if ( ret && ret != -ENOENT )
-                    return ret;
-            }
-    }
-
-    return 0;
-}
-
-static int cf_check intel_iommu_assign_device(
-    struct domain *d, u8 devfn, struct pci_dev *pdev, u32 flag)
-{
-    struct domain *s = pdev->domain;
-    struct iommu_context *ctx = iommu_default_context(d);
-    struct acpi_rmrr_unit *rmrr;
-    int ret = 0, i;
-    u16 bdf, seg;
-    u8 bus;
-
-    if ( list_empty(&acpi_drhd_units) )
-        return -ENODEV;
-
-    seg = pdev->seg;
-    bus = pdev->bus;
-    /*
-     * In rare cases one given rmrr is shared by multiple devices but
-     * obviously this would put the security of a system at risk. So
-     * we would prevent from this sort of device assignment. But this
-     * can be permitted if user set
-     *      "pci = [ 'sbdf, rdm_policy=relaxed' ]"
-     *
-     * TODO: in the future we can introduce group device assignment
-     * interface to make sure devices sharing RMRR are assigned to the
-     * same domain together.
-     */
-    for_each_rmrr_device( rmrr, bdf, i )
-    {
-        if ( rmrr->segment == seg && bdf == PCI_BDF(bus, devfn) &&
-             rmrr->scope.devices_cnt > 1 )
-        {
-            bool relaxed = flag & XEN_DOMCTL_DEV_RDM_RELAXED;
-
-            printk(XENLOG_GUEST "%s" VTDPREFIX
-                   " It's %s to assign %pp"
-                   " with shared RMRR at %"PRIx64" for %pd.\n",
-                   relaxed ? XENLOG_WARNING : XENLOG_ERR,
-                   relaxed ? "risky" : "disallowed",
-                   &PCI_SBDF(seg, bus, devfn), rmrr->base_address, d);
-            if ( !relaxed )
-                return -EPERM;
-        }
-    }
-
-    /* Setup rmrr identity mapping */
-    for_each_rmrr_device( rmrr, bdf, i )
-    {
-        if ( rmrr->segment == seg && bdf == PCI_BDF(bus, devfn) )
-        {
-            ret = iommu_identity_mapping(d, ctx, p2m_access_rw, rmrr->base_address,
-                                         rmrr->end_address, flag);
-            if ( ret )
-            {
-                printk(XENLOG_G_ERR VTDPREFIX
-                       "%pd: cannot map reserved region [%"PRIx64",%"PRIx64"]: %d\n",
-                       d, rmrr->base_address, rmrr->end_address, ret);
-                break;
-            }
-        }
-    }
-
-    if ( !ret )
-        ret = reassign_device_ownership(s, d, devfn, pdev);
-
-    /* See reassign_device_ownership() for the hwdom aspect. */
-    if ( !ret || is_hardware_domain(d) )
-        return ret;
-
-    for_each_rmrr_device( rmrr, bdf, i )
-    {
-        if ( rmrr->segment == seg && bdf == PCI_BDF(bus, devfn) )
-        {
-            int rc = iommu_identity_mapping(d, ctx, p2m_access_x,
-                                            rmrr->base_address,
-                                            rmrr->end_address, 0);
-
-            if ( rc && rc != -ENOENT )
-            {
-                printk(XENLOG_ERR VTDPREFIX
-                       "%pd: cannot unmap reserved region [%"PRIx64",%"PRIx64"]: %d\n",
-                       d, rmrr->base_address, rmrr->end_address, rc);
-                domain_crash(d);
-                break;
-            }
-        }
-    }
-
-    return ret;
-}
-
 static int cf_check intel_iommu_group_id(u16 seg, u8 bus, u8 devfn)
 {
     u8 secbus;
@@ -2754,6 +2287,11 @@ static void vtd_dump_page_table_level(paddr_t pt_maddr, int level, paddr_t gpa,
     if ( level < 1 )
         return;
 
+    if (pt_maddr == 0) {
+        printk(" (empty)\n");
+        return;
+    }
+
     pt_vaddr = map_vtd_domain_page(pt_maddr);
 
     next_level = level - 1;
@@ -2785,35 +2323,305 @@ static void vtd_dump_page_table_level(paddr_t pt_maddr, int level, paddr_t gpa,
 static void cf_check vtd_dump_page_tables(struct domain *d)
 {
     const struct domain_iommu *hd = dom_iommu(d);
-    struct iommu_context *ctx = iommu_default_context(d);
+    unsigned int i;
 
-    printk(VTDPREFIX" %pd table has %d levels\n", d,
+    printk(VTDPREFIX " %pd table has %d levels\n", d,
            agaw_to_level(hd->arch.vtd.agaw));
-    vtd_dump_page_table_level(ctx->arch.vtd.pgd_maddr,
-                              agaw_to_level(hd->arch.vtd.agaw), 0, 0);
+
+    for (i = 1; i < (1 + hd->other_contexts.count); ++i)
+    {
+        struct iommu_context *ctx = iommu_get_context(d, i);
+
+        printk(VTDPREFIX " %pd context %d: %s\n", d, i,
+               ctx ? "allocated" : "non-allocated");
+
+        if (ctx)
+        {
+            vtd_dump_page_table_level(ctx->arch.vtd.pgd_maddr,
+                                      agaw_to_level(hd->arch.vtd.agaw), 0, 0);
+            iommu_put_context(ctx);
+        }
+    }
+}
+
+static int intel_iommu_cleanup_pte(uint64_t pte_maddr, bool preempt)
+{
+    size_t i;
+    struct dma_pte *pte = map_vtd_domain_page(pte_maddr);
+
+    for (i = 0; i < (1 << PAGETABLE_ORDER); ++i)
+        if ( dma_pte_present(pte[i]) )
+        {
+            /* Remove the reference of the target mapping (if needed) */
+            mfn_t mfn = maddr_to_mfn(dma_pte_addr(pte[i]));
+
+            if ( mfn_valid(mfn) )
+                put_page(mfn_to_page(mfn));
+
+            if ( preempt )
+                dma_clear_pte(pte[i]);
+        }
+
+    unmap_vtd_domain_page(pte);
+
+    return 0;
+}
+
+/**
+ * Cleanup logic :
+ * Walk through the entire page table, progressively removing mappings if preempt.
+ *
+ * Return values :
+ *  - Report preemption with -ERESTART.
+ *  - Report empty pte/pgd with 0.
+ *
+ * When preempted during superpage operation, store state in vtd.superpage_progress.
+ */
+
+static int intel_iommu_cleanup_superpage(struct iommu_context *ctx,
+                                          unsigned int page_order, uint64_t pte_maddr,
+                                          bool preempt)
+{
+    size_t i = 0, page_count = 1 << page_order;
+    struct page_info *page = maddr_to_page(pte_maddr);
+
+    if ( preempt )
+        i = ctx->arch.vtd.superpage_progress;
+
+    for (; i < page_count; page++)
+    {
+        put_page(page);
+
+        if ( preempt && (i & 0xff) && general_preempt_check() )
+        {
+            ctx->arch.vtd.superpage_progress = i + 1;
+            return -ERESTART;
+        }
+    }
+
+    if ( preempt )
+        ctx->arch.vtd.superpage_progress = 0;
+
+    return 0;
+}
+
+static int intel_iommu_cleanup_mappings(struct iommu_context *ctx,
+                                         unsigned int nr_pt_levels, uint64_t pgd_maddr,
+                                         bool preempt)
+{
+    size_t i;
+    int rc;
+    struct dma_pte *pgd;
+
+    if ( ctx->opaque )
+        /* don't touch opaque contexts */
+        return 0;
+
+    pgd = map_vtd_domain_page(pgd_maddr);
+
+    for (i = 0; i < (1 << PAGETABLE_ORDER); ++i)
+    {
+        if ( dma_pte_present(pgd[i]) )
+        {
+            uint64_t pte_maddr = dma_pte_addr(pgd[i]);
+
+            if ( dma_pte_superpage(pgd[i]) )
+                rc = intel_iommu_cleanup_superpage(ctx, nr_pt_levels * SUPERPAGE_ORDER,
+                                                   pte_maddr, preempt);
+            else if ( nr_pt_levels > 2 )
+                /* Next level is not PTE */
+                rc = intel_iommu_cleanup_mappings(ctx, nr_pt_levels - 1,
+                                                  pte_maddr, preempt);
+            else
+                rc = intel_iommu_cleanup_pte(pte_maddr, preempt);
+
+            if ( preempt && !rc )
+                /* Fold pgd (no more mappings in it) */
+                dma_clear_pte(pgd[i]);
+            else if ( preempt && (rc == -ERESTART || general_preempt_check()) )
+            {
+                unmap_vtd_domain_page(pgd);
+                return -ERESTART;
+            }
+        }
+    }
+
+    unmap_vtd_domain_page(pgd);
+
+    return 0;
 }
 
-static int cf_check intel_iommu_quarantine_init(struct pci_dev *pdev,
-                                                bool scratch_page)
+static int cf_check intel_iommu_context_teardown(struct domain *d,
+                                        struct iommu_context *ctx, u32 flags)
 {
+    struct acpi_drhd_unit *drhd;
+    pcidevs_lock();
+
+    // Cleanup mappings
+    if ( intel_iommu_cleanup_mappings(ctx, agaw_to_level(d->iommu.arch.vtd.agaw),
+                                      ctx->arch.vtd.pgd_maddr,
+                                      flags & IOMMUF_preempt) < 0 )
+    {
+        pcidevs_unlock();
+        return -ERESTART;
+    }
+
+    ASSERT(ctx->arch.vtd.didmap);
+
+    for_each_drhd_unit(drhd)
+    {
+        unsigned long index = drhd->iommu->index;
+
+        iommu_free_domid(ctx->arch.vtd.didmap[index], drhd->iommu->domid_bitmap);
+    }
+
+    xfree(ctx->arch.vtd.didmap);
+
+    pcidevs_unlock();
+    return arch_iommu_context_teardown(d, ctx, flags);
+}
+
+static int intel_iommu_dev_rmrr(struct domain *d, struct pci_dev *pdev,
+                                struct iommu_context *ctx, bool unmap)
+{
+    struct acpi_rmrr_unit *rmrr;
+    u16 bdf;
+    int ret, i;
+
+    for_each_rmrr_device(rmrr, bdf, i)
+    {
+        if ( PCI_SBDF(rmrr->segment, bdf).sbdf == pdev->sbdf.sbdf )
+        {
+            ret = iommu_identity_mapping(d, ctx,
+                                         unmap ? p2m_access_x : p2m_access_rw,
+                                         rmrr->base_address, rmrr->end_address,
+                                         0);
+
+            if ( ret < 0 )
+                return ret;
+        }
+    }
+
     return 0;
 }
 
+static int cf_check intel_iommu_attach(struct domain *d, struct pci_dev *pdev,
+                                       struct iommu_context *ctx)
+{
+    int ret;
+    const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
+
+    if ( !pdev || !drhd )
+        return -EINVAL;
+
+    ret = intel_iommu_dev_rmrr(d, pdev, ctx, false);
+
+    if ( ret )
+        return ret;
+
+    ret = apply_context(d, ctx, pdev, pdev->devfn, NULL);
+
+    if ( ret )
+        return ret;
+
+    pci_vtd_quirk(pdev);
+
+    return ret;
+}
+
+static int cf_check intel_iommu_detach(struct domain *d, struct pci_dev *pdev,
+                                       struct iommu_context *prev_ctx)
+{
+    int ret, rc;
+    const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
+
+    if (!pdev || !drhd)
+        return -EINVAL;
+
+    ret = unapply_context_single(d, drhd->iommu, prev_ctx, pdev->bus, pdev->devfn);
+
+    if ( ret )
+        return ret;
+
+    if ( (rc = intel_iommu_dev_rmrr(d, pdev, prev_ctx, true)) )
+        printk(XENLOG_WARNING VTDPREFIX
+               " Unable to unmap RMRR from d%dc%d for %pp (%d)\n",
+               d->domain_id, prev_ctx->id, &pdev->sbdf, rc);
+
+    return ret;
+}
+
+static int cf_check intel_iommu_reattach(struct domain *d,
+                                         struct pci_dev *pdev,
+                                         struct iommu_context *prev_ctx,
+                                         struct iommu_context *ctx)
+{
+    int ret, rc;
+    const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
+
+    if (!pdev || !drhd)
+        return -EINVAL;
+
+    ret = intel_iommu_dev_rmrr(d, pdev, ctx, false);
+
+    if ( ret )
+        return ret;
+
+    ret = apply_context(d, ctx, pdev, pdev->devfn, prev_ctx);
+
+    if ( ret )
+        return ret;
+
+    if ( (rc = intel_iommu_dev_rmrr(d, pdev, prev_ctx, true)) )
+        printk(XENLOG_WARNING VTDPREFIX
+               " Unable to unmap RMRR from d%dc%d for %pp (%d)\n",
+               d->domain_id, prev_ctx->id, &pdev->sbdf, rc);
+
+    pci_vtd_quirk(pdev);
+
+    return ret;
+}
+
+static int cf_check intel_iommu_add_devfn(struct domain *d,
+                                          struct pci_dev *pdev, u16 devfn,
+                                          struct iommu_context *ctx)
+{
+    const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
+
+    if ( !pdev || !drhd )
+        return -EINVAL;
+
+    return apply_context(d, ctx, pdev, devfn, NULL);
+}
+
+static int cf_check intel_iommu_remove_devfn(struct domain *d, struct pci_dev *pdev,
+                                             u16 devfn)
+{
+    const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
+
+    if ( !pdev || !drhd )
+        return -EINVAL;
+
+    return unapply_context_single(d, drhd->iommu, NULL, pdev->bus, devfn);
+}
+
 static const struct iommu_ops __initconst_cf_clobber vtd_ops = {
     .page_sizes = PAGE_SIZE_4K,
     .init = intel_iommu_domain_init,
     .hwdom_init = intel_iommu_hwdom_init,
-    .quarantine_init = intel_iommu_quarantine_init,
-    .add_device = intel_iommu_add_device,
+    .context_init = intel_iommu_context_init,
+    .context_teardown = intel_iommu_context_teardown,
+    .attach = intel_iommu_attach,
+    .detach = intel_iommu_detach,
+    .reattach = intel_iommu_reattach,
+    .add_devfn = intel_iommu_add_devfn,
+    .remove_devfn = intel_iommu_remove_devfn,
     .enable_device = intel_iommu_enable_device,
-    .remove_device = intel_iommu_remove_device,
-    .assign_device  = intel_iommu_assign_device,
     .teardown = iommu_domain_teardown,
     .clear_root_pgtable = iommu_clear_root_pgtable,
     .map_page = intel_iommu_map_page,
     .unmap_page = intel_iommu_unmap_page,
     .lookup_page = intel_iommu_lookup_page,
-    .reassign_device = reassign_device_ownership,
     .get_device_group_id = intel_iommu_group_id,
     .enable_x2apic = intel_iommu_enable_eim,
     .disable_x2apic = intel_iommu_disable_eim,
diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c
index 7937eb8c2b..0c8a6d73dd 100644
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -408,9 +408,9 @@ void __init platform_quirks_init(void)
 
 static int __must_check map_me_phantom_function(struct domain *domain,
                                                 unsigned int dev,
-                                                domid_t domid,
-                                                paddr_t pgd_maddr,
-                                                unsigned int mode)
+                                                unsigned int mode,
+                                                struct iommu_context *ctx,
+                                                struct iommu_context *prev_ctx)
 {
     struct acpi_drhd_unit *drhd;
     struct pci_dev *pdev;
@@ -422,19 +422,17 @@ static int __must_check map_me_phantom_function(struct domain *domain,
 
     /* map or unmap ME phantom function */
     if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
-        rc = domain_context_mapping_one(domain, iommu_default_context(domain),
-                                        drhd->iommu, 0,
-                                        PCI_DEVFN(dev, 7), NULL,
-                                        domid, pgd_maddr, mode);
+        rc = apply_context_single(domain, ctx, drhd->iommu, 0,
+                                  PCI_DEVFN(dev, 7), prev_ctx);
     else
-        rc = domain_context_unmap_one(domain, drhd->iommu, 0,
-                                      PCI_DEVFN(dev, 7));
+        rc = unapply_context_single(domain, drhd->iommu, prev_ctx, 0, PCI_DEVFN(dev, 7));
 
     return rc;
 }
 
 int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
-                  domid_t domid, paddr_t pgd_maddr, unsigned int mode)
+                  domid_t domid, unsigned int mode,
+                  struct iommu_context *ctx, struct iommu_context *prev_ctx)
 {
     u32 id;
     int rc = 0;
@@ -458,7 +456,7 @@ int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                rc = map_me_phantom_function(domain, 3, domid, pgd_maddr, mode);
+                rc = map_me_phantom_function(domain, 3, mode, ctx, prev_ctx);
                 break;
             default:
                 break;
@@ -484,7 +482,7 @@ int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                rc = map_me_phantom_function(domain, 22, domid, pgd_maddr, mode);
+                rc = map_me_phantom_function(domain, 22, mode, ctx, prev_ctx);
                 break;
             default:
                 break;
diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index 730a75e628..7b7fac0db8 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -12,6 +12,12 @@
  * this program; If not, see <http://www.gnu.org/licenses/>.
  */
 
+#include <xen/keyhandler.h>
+#include <xen/lib.h>
+#include <xen/pci.h>
+#include <xen/bitmap.h>
+#include <xen/list.h>
+#include <xen/mm.h>
 #include <xen/cpu.h>
 #include <xen/sched.h>
 #include <xen/iocap.h>
@@ -19,7 +25,6 @@
 #include <xen/paging.h>
 #include <xen/guest_access.h>
 #include <xen/event.h>
-#include <xen/spinlock.h>
 #include <xen/softirq.h>
 #include <xen/vm_event.h>
 #include <xsm/xsm.h>
@@ -29,6 +34,9 @@
 #include <asm/mem_paging.h>
 #include <asm/pt-contig-markers.h>
 #include <asm/setup.h>
+#include <asm/iommu.h>
+#include <asm/page.h>
+#include <asm/p2m.h>
 
 const struct iommu_init_ops *__initdata iommu_init_ops;
 struct iommu_ops __ro_after_init iommu_ops;
@@ -192,8 +200,6 @@ int arch_iommu_domain_init(struct domain *d)
 
 int arch_iommu_context_init(struct domain *d, struct iommu_context *ctx, u32 flags)
 {
-    spin_lock_init(&ctx->arch.mapping_lock);
-
     INIT_PAGE_LIST_HEAD(&ctx->arch.pgtables);
     INIT_LIST_HEAD(&ctx->arch.identity_maps);
 
@@ -220,6 +226,95 @@ struct identity_map {
     unsigned int count;
 };
 
+static int unmap_identity_region(struct domain *d, struct iommu_context *ctx,
+                                 unsigned int base_pfn, unsigned int end_pfn)
+{
+    int ret = 0;
+
+    if ( ctx->opaque && !ctx->id )
+    {
+        #ifdef CONFIG_HVM
+        this_cpu(iommu_dont_flush_iotlb) = true;
+        while ( base_pfn < end_pfn )
+        {
+            if ( p2m_remove_identity_entry(d, base_pfn) )
+                ret = -ENXIO;
+
+            base_pfn++;
+        }
+        this_cpu(iommu_dont_flush_iotlb) = false;
+        #else
+        ASSERT_UNREACHABLE();
+        #endif
+    }
+    else
+    {
+        size_t page_count = end_pfn - base_pfn + 1;
+        unsigned int flush_flags;
+
+        ret = iommu_unmap(d, _dfn(base_pfn), page_count, 0, &flush_flags,
+                          ctx->id);
+
+        if ( ret )
+            return ret;
+
+        ret = iommu_iotlb_flush(d, _dfn(base_pfn), page_count,
+                                flush_flags, ctx->id);
+    }
+
+    return ret;
+}
+
+static int map_identity_region(struct domain *d, struct iommu_context *ctx,
+                               unsigned int base_pfn, unsigned int end_pfn,
+                               p2m_access_t p2ma, unsigned int flag)
+{
+    int ret = 0;
+    unsigned int flush_flags = 0;
+    size_t page_count = end_pfn - base_pfn + 1;
+
+    if ( ctx->opaque && !ctx->id )
+    {
+        #ifdef CONFIG_HVM
+        int i;
+        this_cpu(iommu_dont_flush_iotlb) = true;
+
+        for (i = 0; i < page_count; i++)
+        {
+            ret = p2m_add_identity_entry(d, base_pfn + i, p2ma, flag);
+
+            if ( ret )
+                break;
+
+            base_pfn++;
+        }
+        this_cpu(iommu_dont_flush_iotlb) = false;
+        #else
+        ASSERT_UNREACHABLE();
+        #endif
+    }
+    else
+    {
+        int i;
+
+        for (i = 0; i < page_count; i++)
+        {
+            ret = iommu_map(d, _dfn(base_pfn + i), _mfn(base_pfn + i), 1,
+                            p2m_access_to_iommu_flags(p2ma), &flush_flags,
+                            ctx->id);
+
+            if ( ret )
+                break;
+        }
+    }
+
+    ret = iommu_iotlb_flush(d, _dfn(base_pfn), page_count, flush_flags,
+                            ctx->id);
+
+    return ret;
+}
+
+/* p2m_access_x removes the mapping */
 int iommu_identity_mapping(struct domain *d, struct iommu_context *ctx,
                            p2m_access_t p2ma, paddr_t base, paddr_t end,
                            unsigned int flag)
@@ -227,24 +322,20 @@ int iommu_identity_mapping(struct domain *d, struct iommu_context *ctx,
     unsigned long base_pfn = base >> PAGE_SHIFT_4K;
     unsigned long end_pfn = PAGE_ALIGN_4K(end) >> PAGE_SHIFT_4K;
     struct identity_map *map;
+    int ret = 0;
 
     ASSERT(pcidevs_locked());
     ASSERT(base < end);
 
-    /*
-     * No need to acquire hd->arch.mapping_lock: Both insertion and removal
-     * get done while holding pcidevs_lock.
-     */
     list_for_each_entry( map, &ctx->arch.identity_maps, list )
     {
         if ( map->base == base && map->end == end )
         {
-            int ret = 0;
-
             if ( p2ma != p2m_access_x )
             {
                 if ( map->access != p2ma )
                     return -EADDRINUSE;
+
                 ++map->count;
                 return 0;
             }
@@ -252,12 +343,9 @@ int iommu_identity_mapping(struct domain *d, struct iommu_context *ctx,
             if ( --map->count )
                 return 0;
 
-            while ( base_pfn < end_pfn )
-            {
-                if ( clear_identity_p2m_entry(d, base_pfn) )
-                    ret = -ENXIO;
-                base_pfn++;
-            }
+            printk("Unmapping [%"PRI_mfn"x:%"PRI_mfn"] for d%dc%d\n", base_pfn, end_pfn,
+                   d->domain_id, ctx->id);
+            ret = unmap_identity_region(d, ctx, base_pfn, end_pfn);
 
             list_del(&map->list);
             xfree(map);
@@ -281,27 +369,17 @@ int iommu_identity_mapping(struct domain *d, struct iommu_context *ctx,
     map->access = p2ma;
     map->count = 1;
 
-    /*
-     * Insert into list ahead of mapping, so the range can be found when
-     * trying to clean up.
-     */
-    list_add_tail(&map->list, &ctx->arch.identity_maps);
+    printk("Mapping [%"PRI_mfn"x:%"PRI_mfn"] for d%dc%d\n", base_pfn, end_pfn,
+           d->domain_id, ctx->id);
+    ret = map_identity_region(d, ctx, base_pfn, end_pfn, p2ma, flag);
 
-    for ( ; base_pfn < end_pfn; ++base_pfn )
+    if ( ret )
     {
-        int err = set_identity_p2m_entry(d, base_pfn, p2ma, flag);
-
-        if ( !err )
-            continue;
-
-        if ( (map->base >> PAGE_SHIFT_4K) == base_pfn )
-        {
-            list_del(&map->list);
-            xfree(map);
-        }
-        return err;
+        xfree(map);
+        return ret;
     }
 
+    list_add(&map->list, &ctx->arch.identity_maps);
     return 0;
 }
 
@@ -373,7 +451,7 @@ static int __hwdom_init cf_check identity_map(unsigned long s, unsigned long e,
             if ( iomem_access_permitted(d, s, s) )
             {
                 rc = iommu_map(d, _dfn(s), _mfn(s), 1, perms,
-                               &info->flush_flags);
+                               &info->flush_flags, 0);
                 if ( rc < 0 )
                     break;
                 /* Must map a frame at least, which is what we request for. */
@@ -383,7 +461,7 @@ static int __hwdom_init cf_check identity_map(unsigned long s, unsigned long e,
             s++;
         }
         while ( (rc = iommu_map(d, _dfn(s), _mfn(s), e - s + 1,
-                                perms, &info->flush_flags)) > 0 )
+                                perms, &info->flush_flags, 0)) > 0 )
         {
             s += rc;
             process_pending_softirqs();
@@ -543,7 +621,7 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
                map_data.mmio_ro ? "read-only " : "", rc);
 
     /* Use if to avoid compiler warning */
-    if ( iommu_iotlb_flush_all(d, map_data.flush_flags) )
+    if ( iommu_iotlb_flush_all(d, 0, map_data.flush_flags) )
         return;
 }
 
@@ -600,14 +678,11 @@ int iommu_free_pgtables(struct domain *d, struct iommu_context *ctx)
     if ( !is_iommu_enabled(d) )
         return 0;
 
-    /* After this barrier, no new IOMMU mappings can be inserted. */
-    spin_barrier(&ctx->arch.mapping_lock);
-
     /*
      * Pages will be moved to the free list below. So we want to
      * clear the root page-table to avoid any potential use after-free.
      */
-    iommu_vcall(hd->platform_ops, clear_root_pgtable, d);
+    iommu_vcall(hd->platform_ops, clear_root_pgtable, d, ctx);
 
     while ( (pg = page_list_remove_head(&ctx->arch.pgtables)) )
     {
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 11d23cdafb..15250da119 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -161,11 +161,10 @@ enum
  */
 long __must_check iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0,
                             unsigned long page_count, unsigned int flags,
-                            unsigned int *flush_flags);
+                            unsigned int *flush_flags, u16 ctx_id);
 long __must_check iommu_unmap(struct domain *d, dfn_t dfn0,
                               unsigned long page_count, unsigned int flags,
-                              unsigned int *flush_flags);
-
+                              unsigned int *flush_flags, u16 ctx_id);
 int __must_check iommu_legacy_map(struct domain *d, dfn_t dfn, mfn_t mfn,
                                   unsigned long page_count,
                                   unsigned int flags);
@@ -173,12 +172,13 @@ int __must_check iommu_legacy_unmap(struct domain *d, dfn_t dfn,
                                     unsigned long page_count);
 
 int __must_check iommu_lookup_page(struct domain *d, dfn_t dfn, mfn_t *mfn,
-                                   unsigned int *flags);
+                                   unsigned int *flags, u16 ctx_id);
 
 int __must_check iommu_iotlb_flush(struct domain *d, dfn_t dfn,
                                    unsigned long page_count,
-                                   unsigned int flush_flags);
-int __must_check iommu_iotlb_flush_all(struct domain *d,
+                                   unsigned int flush_flags,
+                                   u16 ctx_id);
+int __must_check iommu_iotlb_flush_all(struct domain *d, u16 ctx_id,
                                        unsigned int flush_flags);
 
 enum iommu_feature
@@ -250,20 +250,30 @@ struct page_info;
  */
 typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u32 id, void *ctxt);
 
+struct iommu_context;
+
 struct iommu_ops {
     unsigned long page_sizes;
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
-    int (*quarantine_init)(device_t *dev, bool scratch_page);
-    int (*add_device)(uint8_t devfn, device_t *dev);
+    int (*context_init)(struct domain *d, struct iommu_context *ctx,
+                        u32 flags);
+    int (*context_teardown)(struct domain *d, struct iommu_context *ctx,
+                            u32 flags);
+    int (*attach)(struct domain *d, device_t *dev,
+                  struct iommu_context *ctx);
+    int (*detach)(struct domain *d, device_t *dev,
+                   struct iommu_context *prev_ctx);
+    int (*reattach)(struct domain *d, device_t *dev,
+                    struct iommu_context *prev_ctx,
+                    struct iommu_context *ctx);
+
     int (*enable_device)(device_t *dev);
-    int (*remove_device)(uint8_t devfn, device_t *dev);
-    int (*assign_device)(struct domain *d, uint8_t devfn, device_t *dev,
-                         uint32_t flag);
-    int (*reassign_device)(struct domain *s, struct domain *t,
-                           uint8_t devfn, device_t *dev);
 #ifdef CONFIG_HAS_PCI
     int (*get_device_group_id)(uint16_t seg, uint8_t bus, uint8_t devfn);
+    int (*add_devfn)(struct domain *d, struct pci_dev *pdev, u16 devfn,
+                     struct iommu_context *ctx);
+    int (*remove_devfn)(struct domain *d, struct pci_dev *pdev, u16 devfn);
 #endif /* HAS_PCI */
 
     void (*teardown)(struct domain *d);
@@ -274,12 +284,15 @@ struct iommu_ops {
      */
     int __must_check (*map_page)(struct domain *d, dfn_t dfn, mfn_t mfn,
                                  unsigned int flags,
-                                 unsigned int *flush_flags);
+                                 unsigned int *flush_flags,
+                                 struct iommu_context *ctx);
     int __must_check (*unmap_page)(struct domain *d, dfn_t dfn,
                                    unsigned int order,
-                                   unsigned int *flush_flags);
+                                   unsigned int *flush_flags,
+                                   struct iommu_context *ctx);
     int __must_check (*lookup_page)(struct domain *d, dfn_t dfn, mfn_t *mfn,
-                                    unsigned int *flags);
+                                    unsigned int *flags,
+                                    struct iommu_context *ctx);
 
 #ifdef CONFIG_X86
     int (*enable_x2apic)(void);
@@ -292,14 +305,15 @@ struct iommu_ops {
     int (*setup_hpet_msi)(struct msi_desc *msi_desc);
 
     void (*adjust_irq_affinities)(void);
-    void (*clear_root_pgtable)(struct domain *d);
+    void (*clear_root_pgtable)(struct domain *d, struct iommu_context *ctx);
     int (*update_ire_from_msi)(struct msi_desc *msi_desc, struct msi_msg *msg);
 #endif /* CONFIG_X86 */
 
     int __must_check (*suspend)(void);
     void (*resume)(void);
     void (*crash_shutdown)(void);
-    int __must_check (*iotlb_flush)(struct domain *d, dfn_t dfn,
+    int __must_check (*iotlb_flush)(struct domain *d,
+                                    struct iommu_context *ctx, dfn_t dfn,
                                     unsigned long page_count,
                                     unsigned int flush_flags);
     int (*get_reserved_device_memory)(iommu_grdm_t *func, void *ctxt);
@@ -346,15 +360,36 @@ extern int iommu_get_extra_reserved_device_memory(iommu_grdm_t *func,
 struct iommu_context {
     #ifdef CONFIG_HAS_PASSTHROUGH
     u16 id; /* Context id (0 means default context) */
+    rspinlock_t lock; /* context lock */
+
+    struct list_head devices;
 
     struct arch_iommu_context arch;
+
+    bool opaque; /* context can't be modified nor accessed (e.g HAP) */
+    bool dying; /* the context is tearing down */
     #endif
 };
 
+struct iommu_context_list {
+    atomic_t initialized; /* has/is context list being initialized ? */
+    rwlock_t lock; /* prevent concurrent destruction and access of contexts */
+    uint16_t count; /* Context count excluding default context */
+
+    /* if count > 0 */
+
+    uint64_t *bitmap; /* bitmap of context allocation */
+    struct iommu_context *map; /* Map of contexts */
+};
+
+
 struct domain_iommu {
+
 #ifdef CONFIG_HAS_PASSTHROUGH
     struct arch_iommu arch;
+
     struct iommu_context default_ctx;
+    struct iommu_context_list other_contexts;
 #endif
 
     /* iommu_ops */
@@ -415,6 +450,8 @@ int __must_check iommu_suspend(void);
 void iommu_resume(void);
 void iommu_crash_shutdown(void);
 int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
+
+int __init iommu_quarantine_init(void);
 int iommu_quarantine_dev_init(device_t *dev);
 
 #ifdef CONFIG_HAS_PCI
@@ -424,6 +461,26 @@ int iommu_do_pci_domctl(struct xen_domctl *domctl, struct domain *d,
 
 void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev);
 
+
+struct iommu_context *iommu_get_context(struct domain *d, u16 ctx_id);
+void iommu_put_context(struct iommu_context *ctx);
+
+#define IOMMU_CONTEXT_INIT_default (1 << 0)
+#define IOMMU_CONTEXT_INIT_quarantine (1 << 1)
+int iommu_context_init(struct domain *d, struct iommu_context *ctx, u16 ctx_id, u32 flags);
+
+#define IOMMU_TEARDOWN_REATTACH_DEFAULT (1 << 0)
+#define IOMMU_TEARDOWN_PREEMPT (1 << 1)
+int iommu_context_teardown(struct domain *d, struct iommu_context *ctx, u32 flags);
+
+int iommu_context_alloc(struct domain *d, u16 *ctx_id, u32 flags);
+int iommu_context_free(struct domain *d, u16 ctx_id, u32 flags);
+
+int iommu_reattach_context(struct domain *prev_dom, struct domain *next_dom,
+                           device_t *dev, u16 ctx_id);
+int iommu_attach_context(struct domain *d, device_t *dev, u16 ctx_id);
+int iommu_detach_context(struct domain *d, device_t *dev);
+
 /*
  * The purpose of the iommu_dont_flush_iotlb optional cpu flag is to
  * avoid unecessary iotlb_flush in the low level IOMMU code.
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index f784e91160..a421ead1a4 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -97,6 +97,7 @@ struct pci_dev_info {
 struct pci_dev {
     struct list_head alldevs_list;
     struct list_head domain_list;
+    struct list_head context_list;
 
     struct list_head msi_list;
 
@@ -104,6 +105,8 @@ struct pci_dev {
 
     struct domain *domain;
 
+    uint16_t context; /* IOMMU context number of domain */
+
     const union {
         struct {
             uint8_t devfn;
-- 
2.47.2



Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [XEN RFC PATCH v6 10/11] iommu: Introduce PV-IOMMU
  2025-02-17 10:18 [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface Teddy Astie
                   ` (9 preceding siblings ...)
  2025-02-17 10:18 ` [XEN RFC PATCH v6 08/11] iommu: Introduce redesigned IOMMU subsystem Teddy Astie
@ 2025-02-17 10:18 ` Teddy Astie
  2025-02-18 14:26 ` [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface Marek Marczykowski-Górecki
  11 siblings, 0 replies; 17+ messages in thread
From: Teddy Astie @ 2025-02-17 10:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Teddy Astie, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini

Introduce the PV-IOMMU subsystem as defined in docs/designs/pv-iommu.md.

Signed-off-by: Teddy Astie <teddy.astie@vates.tech>
---
 xen/arch/x86/include/asm/iommu.h            |   3 +
 xen/common/Makefile                         |   1 +
 xen/common/pv-iommu.c                       | 536 ++++++++++++++++++++
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  15 +
 xen/drivers/passthrough/iommu.c             | 105 ++++
 xen/drivers/passthrough/vtd/iommu.c         |   8 +
 xen/drivers/passthrough/x86/iommu.c         |  61 ++-
 xen/include/hypercall-defs.c                |   6 +
 xen/include/public/pv-iommu.h               | 343 +++++++++++++
 xen/include/public/xen.h                    |   1 +
 xen/include/xen/iommu.h                     |  11 +
 11 files changed, 1085 insertions(+), 5 deletions(-)
 create mode 100644 xen/common/pv-iommu.c
 create mode 100644 xen/include/public/pv-iommu.h

diff --git a/xen/arch/x86/include/asm/iommu.h b/xen/arch/x86/include/asm/iommu.h
index 452b98b42d..09fb512936 100644
--- a/xen/arch/x86/include/asm/iommu.h
+++ b/xen/arch/x86/include/asm/iommu.h
@@ -136,6 +136,9 @@ int iommu_identity_mapping(struct domain *d, struct iommu_context *ctx,
                            p2m_access_t p2ma, paddr_t base, paddr_t end,
                            unsigned int flag);
 void iommu_identity_map_teardown(struct domain *d, struct iommu_context *ctx);
+bool iommu_identity_map_check(struct domain *d, struct iommu_context *ctx,
+                              mfn_t mfn);
+
 
 extern bool untrusted_msi;
 
diff --git a/xen/common/Makefile b/xen/common/Makefile
index cba3b32733..c8583a80ba 100644
--- a/xen/common/Makefile
+++ b/xen/common/Makefile
@@ -37,6 +37,7 @@ obj-y += percpu.o
 obj-$(CONFIG_PERF_COUNTERS) += perfc.o
 obj-bin-$(CONFIG_HAS_PMAP) += pmap.init.o
 obj-y += preempt.o
+obj-y += pv-iommu.o
 obj-y += random.o
 obj-y += rangeset.o
 obj-y += radix-tree.o
diff --git a/xen/common/pv-iommu.c b/xen/common/pv-iommu.c
new file mode 100644
index 0000000000..a1315bf582
--- /dev/null
+++ b/xen/common/pv-iommu.c
@@ -0,0 +1,536 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * xen/common/pv_iommu.c
+ *
+ * PV-IOMMU hypercall interface.
+ */
+
+#include <xen/errno.h>
+#include <xen/mm.h>
+#include <xen/lib.h>
+#include <xen/iommu.h>
+#include <xen/sched.h>
+#include <xen/iocap.h>
+#include <xen/mm-frame.h>
+#include <xen/pci.h>
+#include <xen/guest_access.h>
+#include <asm/p2m.h>
+#include <asm/event.h>
+#include <asm/mm.h>
+#include <asm/iommu.h>
+#include <public/pv-iommu.h>
+
+#define PVIOMMU_PREFIX "[PV-IOMMU] "
+
+static int get_paged_frame(struct domain *d, gfn_t gfn, mfn_t *mfn,
+                           struct page_info **page, bool readonly)
+{
+    int ret = 0;
+    p2m_type_t p2mt = p2m_invalid;
+
+    #ifdef CONFIG_X86
+    p2m_query_t query = P2M_ALLOC;
+
+    if ( !readonly )
+        query |= P2M_UNSHARE;
+
+    *mfn = get_gfn_type(d, gfn_x(gfn), &p2mt, query);
+    #else
+    *mfn = p2m_lookup(d, gfn, &p2mt);
+    #endif
+
+    if ( mfn_eq(*mfn, INVALID_MFN) )
+    {
+        /* No mapping ? */
+        printk(XENLOG_G_WARNING PVIOMMU_PREFIX
+               "Trying to map to non-backed page frame (gfn=%"PRI_gfn
+               " p2mt=%d d%d)\n", gfn_x(gfn), p2mt, d->domain_id);
+
+        ret = -ENOENT;
+    }
+    else if ( p2m_is_any_ram(p2mt) && mfn_valid(*mfn) )
+    {
+        *page = get_page_from_mfn(*mfn, d);
+        ret = 0;
+    }
+    else if ( p2m_is_mmio(p2mt) ||
+              iomem_access_permitted(d, mfn_x(*mfn),mfn_x(*mfn)) )
+    {
+        *page = NULL;
+        ret = 0;
+    }
+    else
+    {
+        printk(XENLOG_G_WARNING PVIOMMU_PREFIX
+               "Unexpected p2mt %d (d%d gfn=%"PRI_gfn" mfn=%"PRI_mfn")\n",
+               p2mt, d->domain_id, gfn_x(gfn), mfn_x(*mfn));
+
+        ret = -EPERM;
+    }
+
+    put_gfn(d, gfn_x(gfn));
+    return ret;
+}
+
+static bool can_use_iommu_check(struct domain *d)
+{
+    if ( !is_iommu_enabled(d) )
+    {
+        printk(XENLOG_G_WARNING PVIOMMU_PREFIX
+               "IOMMU disabled for this domain\n");
+        return false;
+    }
+
+    if ( !dom_iommu(d)->allow_pv_iommu )
+    {
+        printk(XENLOG_G_WARNING PVIOMMU_PREFIX
+               "PV-IOMMU disabled for this domain\n");
+        return false;
+    }
+
+    return true;
+}
+
+static long capabilities_op(struct pv_iommu_capabilities *cap, struct domain *d)
+{
+    cap->max_ctx_no = d->iommu.other_contexts.count;
+    cap->max_iova_addr = iommu_get_max_iova(d);
+
+    cap->max_pasid = 0; /* TODO */
+    cap->cap_flags = 0;
+
+    cap->pgsize_mask = PAGE_SIZE_4K;
+
+    return 0;
+}
+
+static long init_op(struct pv_iommu_init *init, struct domain *d)
+{
+    if (init->max_ctx_no == UINT32_MAX)
+        return -E2BIG;
+
+    return iommu_domain_pviommu_init(d, init->max_ctx_no + 1, init->arena_order);
+}
+
+static long alloc_context_op(struct pv_iommu_alloc *alloc, struct domain *d)
+{
+    u16 ctx_no = 0;
+    int status = 0;
+
+    status = iommu_context_alloc(d, &ctx_no, 0);
+
+    if ( status )
+        return status;
+
+    printk(XENLOG_G_INFO PVIOMMU_PREFIX
+           "Created IOMMU context %hu in d%d\n", ctx_no, d->domain_id);
+
+    alloc->ctx_no = ctx_no;
+    return 0;
+}
+
+static long free_context_op(struct pv_iommu_free *free, struct domain *d)
+{
+    int flags = IOMMU_TEARDOWN_PREEMPT;
+
+    if ( !free->ctx_no )
+        return -EINVAL;
+
+    if ( free->free_flags & IOMMU_FREE_reattach_default )
+        flags |= IOMMU_TEARDOWN_REATTACH_DEFAULT;
+
+    return iommu_context_free(d, free->ctx_no, flags);
+}
+
+static long reattach_device_op(struct pv_iommu_reattach_device *reattach,
+                               struct domain *d)
+{
+    int ret;
+    device_t *pdev;
+    struct physdev_pci_device dev = reattach->dev;
+
+    pcidevs_lock();
+    pdev = pci_get_pdev(d, PCI_SBDF(dev.seg, dev.bus, dev.devfn));
+
+    if ( !pdev )
+    {
+        pcidevs_unlock();
+        return -ENOENT;
+    }
+
+    ret = iommu_reattach_context(d, d, pdev, reattach->ctx_no);
+
+    pcidevs_unlock();
+    return ret;
+}
+
+static long map_pages_op(struct pv_iommu_map_pages *map, struct domain *d)
+{
+    struct iommu_context *ctx;
+    int ret = 0, flush_ret;
+    struct page_info *page = NULL;
+    mfn_t mfn, mfn_lookup;
+    unsigned int flags = 0, flush_flags = 0;
+    size_t i = 0;
+    dfn_t dfn0 = _dfn(map->dfn); /* original map->dfn */
+
+    if ( !map->ctx_no || !(ctx = iommu_get_context(d, map->ctx_no)) )
+        return -EINVAL;
+
+    if ( map->map_flags & IOMMU_MAP_readable )
+        flags |= IOMMUF_readable;
+
+    if ( map->map_flags & IOMMU_MAP_writeable )
+        flags |= IOMMUF_writable;
+
+    for (i = 0; i < map->nr_pages; i++)
+    {
+        gfn_t gfn = _gfn(map->gfn + i);
+        dfn_t dfn = _dfn(map->dfn + i);
+
+#ifdef CONFIG_X86
+        if ( iommu_identity_map_check(d, ctx, _mfn(map->dfn)) )
+        {
+            ret = -EADDRNOTAVAIL;
+            break;
+        }
+#endif
+
+        ret = get_paged_frame(d, gfn, &mfn, &page, 0);
+
+        if ( ret )
+            break;
+
+        /* Check for conflict with existing mappings */
+        if ( !iommu_lookup_page(d, dfn, &mfn_lookup, &flags, map->ctx_no) )
+        {
+            if ( page )
+                put_page(page);
+
+            ret = -EADDRINUSE;
+            break;
+        }
+
+        ret = iommu_map(d, dfn, mfn, 1, flags, &flush_flags, map->ctx_no);
+
+        if ( ret )
+        {
+            if ( page )
+                put_page(page);
+
+            break;
+        }
+
+        map->mapped++;
+
+        if ( (i & 0xff) && hypercall_preempt_check() )
+        {
+            i++;
+
+            map->gfn += i;
+            map->dfn += i;
+            map->nr_pages -= i;
+
+            ret = -ERESTART;
+            break;
+        }
+    }
+
+    flush_ret = iommu_iotlb_flush(d, dfn0, i, flush_flags, map->ctx_no);
+
+    iommu_put_context(ctx);
+
+    if ( flush_ret )
+        printk(XENLOG_G_WARNING PVIOMMU_PREFIX
+               "Flush operation failed for d%dc%d (%d)\n", d->domain_id,
+               ctx->id, flush_ret);
+
+    return ret;
+}
+
+static long unmap_pages_op(struct pv_iommu_unmap_pages *unmap, struct domain *d)
+{
+    struct iommu_context *ctx;
+    mfn_t mfn;
+    int ret = 0, flush_ret;
+    unsigned int flags, flush_flags = 0;
+    size_t i = 0;
+    dfn_t dfn0 = _dfn(unmap->dfn); /* original unmap->dfn */
+
+    if ( !unmap->ctx_no || !(ctx = iommu_get_context(d, unmap->ctx_no)) )
+        return -EINVAL;
+
+    for (i = 0; i < unmap->nr_pages; i++)
+    {
+        dfn_t dfn = _dfn(unmap->dfn + i);
+
+#ifdef CONFIG_X86
+        if ( iommu_identity_map_check(d, ctx, _mfn(unmap->dfn)) )
+        {
+            ret = -EADDRNOTAVAIL;
+            break;
+        }
+#endif
+
+        /* Check if there is a valid mapping for this domain */
+        if ( iommu_lookup_page(d, dfn, &mfn, &flags, unmap->ctx_no) ) {
+            ret = -ENOENT;
+            break;
+        }
+
+        ret = iommu_unmap(d, dfn, 1, 0, &flush_flags, unmap->ctx_no);
+
+        if ( ret )
+            break;
+
+        unmap->unmapped++;
+
+        /* Decrement reference counter (if needed) */
+        if ( mfn_valid(mfn) )
+            put_page(mfn_to_page(mfn));
+
+        if ( (i & 0xff) && hypercall_preempt_check() )
+        {
+            i++;
+
+            unmap->dfn += i;
+            unmap->nr_pages -= i;
+
+            ret = -ERESTART;
+            break;
+        }
+    }
+
+    flush_ret = iommu_iotlb_flush(d, dfn0, i, flush_flags, unmap->ctx_no);
+
+    iommu_put_context(ctx);
+
+    if ( flush_ret )
+        printk(XENLOG_WARNING PVIOMMU_PREFIX
+               "Flush operation failed for d%dc%d (%d)\n", d->domain_id,
+               ctx->id, flush_ret);
+
+    return ret;
+}
+
+static long do_iommu_subop(int subop, XEN_GUEST_HANDLE_PARAM(void) arg,
+                           struct domain *d, bool remote);
+
+static long remote_cmd_op(struct pv_iommu_remote_cmd *remote_cmd,
+                          struct domain *current_domain)
+{
+    long ret = 0;
+    struct domain *d;
+
+    /* TODO: use a better permission logic */
+    if ( !is_hardware_domain(current_domain) )
+        return -EPERM;
+
+    d = get_domain_by_id(remote_cmd->domid);
+
+    if ( !d )
+        return -ENOENT;
+
+    ret = do_iommu_subop(remote_cmd->subop, remote_cmd->arg, d, true);
+
+    put_domain(d);
+
+    return ret;
+}
+
+static long do_iommu_subop(int subop, XEN_GUEST_HANDLE_PARAM(void) arg,
+                           struct domain *d, bool remote)
+{
+    long ret = 0;
+
+    switch ( subop )
+    {
+        case IOMMU_noop:
+            break;
+
+        case IOMMU_query_capabilities:
+        {
+            struct pv_iommu_capabilities cap;
+
+            ret = capabilities_op(&cap, d);
+
+            if ( unlikely(copy_to_guest(arg, &cap, 1)) )
+                ret = -EFAULT;
+
+            break;
+        }
+
+        case IOMMU_init:
+        {
+            struct pv_iommu_init init;
+
+            if ( unlikely(copy_from_guest(&init, arg, 1)) )
+            {
+                ret = -EFAULT;
+                break;
+            }
+
+            ret = init_op(&init, d);
+        }
+
+        case IOMMU_alloc_context:
+        {
+            struct pv_iommu_alloc alloc;
+
+            if ( unlikely(copy_from_guest(&alloc, arg, 1)) )
+            {
+                ret = -EFAULT;
+                break;
+            }
+
+            ret = alloc_context_op(&alloc, d);
+
+            if ( unlikely(copy_to_guest(arg, &alloc, 1)) )
+                ret = -EFAULT;
+
+            break;
+        }
+
+        case IOMMU_free_context:
+        {
+            struct pv_iommu_free free;
+
+            if ( unlikely(copy_from_guest(&free, arg, 1)) )
+            {
+                ret = -EFAULT;
+                break;
+            }
+
+            ret = free_context_op(&free, d);
+            break;
+        }
+
+        case IOMMU_reattach_device:
+        {
+            struct pv_iommu_reattach_device reattach;
+
+            if ( unlikely(copy_from_guest(&reattach, arg, 1)) )
+            {
+                ret = -EFAULT;
+                break;
+            }
+
+            ret = reattach_device_op(&reattach, d);
+            break;
+        }
+
+        case IOMMU_map_pages:
+        {
+            struct pv_iommu_map_pages map;
+
+            if ( unlikely(copy_from_guest(&map, arg, 1)) )
+            {
+                ret = -EFAULT;
+                break;
+            }
+
+            ret = map_pages_op(&map, d);
+
+            if ( unlikely(copy_to_guest(arg, &map, 1)) )
+                ret = -EFAULT;
+
+            break;
+        }
+
+        case IOMMU_unmap_pages:
+        {
+            struct pv_iommu_unmap_pages unmap;
+
+            if ( unlikely(copy_from_guest(&unmap, arg, 1)) )
+            {
+                ret = -EFAULT;
+                break;
+            }
+
+            ret = unmap_pages_op(&unmap, d);
+
+            if ( unlikely(copy_to_guest(arg, &unmap, 1)) )
+                ret = -EFAULT;
+
+            break;
+        }
+
+        case IOMMU_remote_cmd:
+        {
+            struct pv_iommu_remote_cmd remote_cmd;
+
+            if ( remote )
+            {
+                /* Prevent remote_cmd from being called recursively */
+                ret = -EINVAL;
+                break;
+            }
+
+            if ( unlikely(copy_from_guest(&remote_cmd, arg, 1)) )
+            {
+                ret = -EFAULT;
+                break;
+            }
+
+            ret = remote_cmd_op(&remote_cmd, d);
+            break;
+        }
+
+        /*
+         * TODO
+         */
+        case IOMMU_alloc_nested:
+        {
+            ret = -EOPNOTSUPP;
+            break;
+        }
+
+        case IOMMU_flush_nested:
+        {
+            ret = -EOPNOTSUPP;
+            break;
+        }
+
+        case IOMMU_attach_pasid:
+        {
+            ret = -EOPNOTSUPP;
+            break;
+        }
+
+        case IOMMU_detach_pasid:
+        {
+            ret = -EOPNOTSUPP;
+            break;
+        }
+
+        default:
+            return -EOPNOTSUPP;
+    }
+
+    return ret;
+}
+
+long do_iommu_op(unsigned int subop, XEN_GUEST_HANDLE_PARAM(void) arg)
+{
+    long ret = 0;
+
+    if ( !can_use_iommu_check(current->domain) )
+        return -ENODEV;
+
+    ret = do_iommu_subop(subop, arg, current->domain, false);
+
+    if ( ret == -ERESTART )
+        return hypercall_create_continuation(__HYPERVISOR_iommu_op, "ih", subop, arg);
+
+    return ret;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index 366d5eb982..0b561ff99b 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -714,6 +714,20 @@ static void cf_check amd_dump_page_tables(struct domain *d)
                               hd->arch.amd.paging_mode, 0, 0);
 }
 
+uint64_t amd_get_max_iova(struct domain *d)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+    unsigned int bits = 12 + hd->arch.amd.paging_mode * 9;
+
+    /* If paging_mode == 6, which indicates 6-level page tables,
+       we have bits == 66 while the GPA space is still 64-bits
+     */
+    if (bits >= 64)
+        return ~0LLU;
+
+    return (1LLU << bits) - 1;
+}
+
 static const struct iommu_ops __initconst_cf_clobber _iommu_ops = {
     .page_sizes = PAGE_SIZE_4K | PAGE_SIZE_2M | PAGE_SIZE_1G,
     .init = amd_iommu_domain_init,
@@ -742,6 +756,7 @@ static const struct iommu_ops __initconst_cf_clobber _iommu_ops = {
     .crash_shutdown = amd_iommu_crash_shutdown,
     .get_reserved_device_memory = amd_iommu_get_reserved_device_memory,
     .dump_page_tables = amd_dump_page_tables,
+    .get_max_iova = amd_get_max_iova,
 };
 
 static const struct iommu_init_ops __initconstrel _iommu_init_ops = {
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index f92835a2ed..c26a2160f9 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -193,6 +193,99 @@ static void __hwdom_init check_hwdom_reqs(struct domain *d)
     arch_iommu_check_autotranslated_hwdom(d);
 }
 
+
+int iommu_domain_pviommu_init(struct domain *d, uint16_t nb_ctx, uint32_t arena_order)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+    int rc;
+
+    BUG_ON(nb_ctx == 0); /* sanity check (prevent underflow) */
+
+    /*
+     * hd->other_contexts.count is always reported as 0 during initialization
+     * preventing misuse of partially initialized IOMMU contexts.
+     */
+
+    if ( atomic_cmpxchg(&hd->other_contexts.initialized, 0, 1) == 1 )
+        return -EACCES;
+
+    if ( (nb_ctx - 1) > 0 ) {
+        /* Initialize context bitmap */
+        size_t i;
+
+        hd->other_contexts.bitmap = xzalloc_array(unsigned long,
+                                                  BITS_TO_LONGS(nb_ctx - 1));
+
+        if (!hd->other_contexts.bitmap)
+        {
+            rc = -ENOMEM;
+            goto cleanup;
+        }
+
+        hd->other_contexts.map = xzalloc_array(struct iommu_context, nb_ctx - 1);
+
+        if (!hd->other_contexts.map)
+        {
+            rc = -ENOMEM;
+            goto cleanup;
+        }
+
+        for (i = 0; i < (nb_ctx - 1); i++)
+            rspin_lock_init(&hd->other_contexts.map[i].lock);
+    }
+
+    rc = arch_iommu_pviommu_init(d, nb_ctx, arena_order);
+
+    if ( rc )
+        goto cleanup;
+
+    /* Make sure initialization is complete before making it visible to other CPUs. */
+    smp_wmb();
+
+    hd->other_contexts.count = nb_ctx - 1;
+
+    printk(XENLOG_INFO "Dom%d uses %lu IOMMU contexts (%llu pages arena)\n",
+           d->domain_id, (unsigned long)nb_ctx, 1llu << arena_order);
+
+    return 0;
+
+cleanup:
+    /* TODO: Reset hd->other_contexts.initialized */
+    if ( hd->other_contexts.bitmap )
+    {
+        xfree(hd->other_contexts.bitmap);
+        hd->other_contexts.bitmap = NULL;
+    }
+
+    if ( hd->other_contexts.map )
+    {
+        xfree(hd->other_contexts.map);
+        hd->other_contexts.bitmap = NULL;
+    }
+
+    return rc;
+}
+
+int iommu_domain_pviommu_teardown(struct domain *d)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+    int i;
+    /* FIXME: Potential race condition with remote_op ? */
+
+    for (i = 0; i < hd->other_contexts.count; i++)
+        WARN_ON(iommu_context_free(d, i, IOMMU_TEARDOWN_REATTACH_DEFAULT) != ENOENT);
+
+    hd->other_contexts.count = 0;
+
+    if ( hd->other_contexts.bitmap )
+        xfree(hd->other_contexts.bitmap);
+
+    if ( hd->other_contexts.map )
+        xfree(hd->other_contexts.map);
+
+    return 0;
+}
+
 int iommu_domain_init(struct domain *d, unsigned int opts)
 {
     struct domain_iommu *hd = dom_iommu(d);
@@ -238,6 +331,8 @@ int iommu_domain_init(struct domain *d, unsigned int opts)
 
     ASSERT(!(hd->need_sync && hd->hap_pt_share));
 
+    hd->allow_pv_iommu = true;
+
     rspin_lock(&hd->default_ctx.lock);
     ret = iommu_context_init(d, &hd->default_ctx, 0, IOMMU_CONTEXT_INIT_default);
     rspin_unlock(&hd->default_ctx.lock);
@@ -1204,6 +1299,16 @@ bool iommu_has_feature(struct domain *d, enum iommu_feature feature)
     return is_iommu_enabled(d) && test_bit(feature, dom_iommu(d)->features);
 }
 
+uint64_t iommu_get_max_iova(struct domain *d)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+
+    if ( !hd->platform_ops->get_max_iova )
+        return 0;
+
+    return iommu_call(hd->platform_ops, get_max_iova, d);
+}
+
 #define MAX_EXTRA_RESERVED_RANGES 20
 struct extra_reserved_range {
     unsigned long start;
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index bb53cff158..20afb68399 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2605,6 +2605,13 @@ static int cf_check intel_iommu_remove_devfn(struct domain *d, struct pci_dev *p
     return unapply_context_single(d, drhd->iommu, NULL, pdev->bus, devfn);
 }
 
+static uint64_t cf_check intel_iommu_get_max_iova(struct domain *d)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+
+    return (1LLU << agaw_to_width(hd->arch.vtd.agaw)) - 1;
+}
+
 static const struct iommu_ops __initconst_cf_clobber vtd_ops = {
     .page_sizes = PAGE_SIZE_4K,
     .init = intel_iommu_domain_init,
@@ -2636,6 +2643,7 @@ static const struct iommu_ops __initconst_cf_clobber vtd_ops = {
     .iotlb_flush = iommu_flush_iotlb,
     .get_reserved_device_memory = intel_iommu_get_reserved_device_memory,
     .dump_page_tables = vtd_dump_page_tables,
+    .get_max_iova = intel_iommu_get_max_iova,
 };
 
 const struct iommu_init_ops __initconstrel intel_iommu_init_ops = {
diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index 7b7fac0db8..79efc6ad47 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -215,6 +215,32 @@ int arch_iommu_context_teardown(struct domain *d, struct iommu_context *ctx, u32
     return 0;
 }
 
+int arch_iommu_pviommu_init(struct domain *d, uint16_t nb_ctx, uint32_t arena_order)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+
+    if ( arena_order == 0 )
+        return 0;
+
+    return iommu_arena_initialize(&hd->arch.pt_arena, NULL, arena_order, 0);
+}
+
+int arch_iommu_pviommu_teardown(struct domain *d)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+
+    if ( iommu_arena_teardown(&hd->arch.pt_arena, true) )
+    {
+        printk(XENLOG_WARNING "IOMMU Arena used while being destroyed\n");
+        WARN();
+
+        /* Teardown anyway */
+        iommu_arena_teardown(&hd->arch.pt_arena, false);
+    }
+
+    return 0;
+}
+
 void arch_iommu_domain_destroy(struct domain *d)
 {
 }
@@ -394,6 +420,19 @@ void iommu_identity_map_teardown(struct domain *d, struct iommu_context *ctx)
     }
 }
 
+bool iommu_identity_map_check(struct domain *d, struct iommu_context *ctx,
+                              mfn_t mfn)
+{
+    struct identity_map *map;
+    uint64_t addr = pfn_to_paddr(mfn_x(mfn));
+
+    list_for_each_entry ( map, &ctx->arch.identity_maps, list )
+        if (addr >= map->base && addr < map->end)
+            return true;
+
+    return false;
+}
+
 static int __hwdom_init cf_check map_subtract(unsigned long s, unsigned long e,
                                               void *data)
 {
@@ -669,7 +708,7 @@ void iommu_free_domid(domid_t domid, unsigned long *map)
         BUG();
 }
 
-int iommu_free_pgtables(struct domain *d, struct iommu_context *ctx)
+int cf_check iommu_free_pgtables(struct domain *d, struct iommu_context *ctx)
 {
     struct domain_iommu *hd = dom_iommu(d);
     struct page_info *pg;
@@ -686,7 +725,10 @@ int iommu_free_pgtables(struct domain *d, struct iommu_context *ctx)
 
     while ( (pg = page_list_remove_head(&ctx->arch.pgtables)) )
     {
-        free_domheap_page(pg);
+        if (ctx->id == 0)
+            free_domheap_page(pg);
+        else
+            iommu_arena_free_page(&hd->arch.pt_arena, pg);
 
         if ( !(++done & 0xff) && general_preempt_check() )
             return -ERESTART;
@@ -708,7 +750,11 @@ struct page_info *iommu_alloc_pgtable(struct domain_iommu *hd,
         memflags = MEMF_node(hd->node);
 #endif
 
-    pg = alloc_domheap_page(NULL, memflags);
+    if (ctx->id == 0)
+        pg = alloc_domheap_page(NULL, memflags);
+    else
+        pg = iommu_arena_allocate_page(&hd->arch.pt_arena);
+
     if ( !pg )
         return NULL;
 
@@ -787,9 +833,14 @@ void iommu_queue_free_pgtable(struct domain *d, struct iommu_context *ctx,
 
     page_list_del(pg, &ctx->arch.pgtables);
 
-    page_list_add_tail(pg, &per_cpu(free_pgt_list, cpu));
+    if ( !ctx->id )
+    {
+        page_list_add_tail(pg, &per_cpu(free_pgt_list, cpu));
 
-    tasklet_schedule(&per_cpu(free_pgt_tasklet, cpu));
+        tasklet_schedule(&per_cpu(free_pgt_tasklet, cpu));
+    }
+    else
+        iommu_arena_free_page(&dom_iommu(d)->arch.pt_arena, pg);
 }
 
 static int cf_check cpu_callback(
diff --git a/xen/include/hypercall-defs.c b/xen/include/hypercall-defs.c
index 7720a29ade..78ca87b57f 100644
--- a/xen/include/hypercall-defs.c
+++ b/xen/include/hypercall-defs.c
@@ -209,6 +209,9 @@ hypfs_op(unsigned int cmd, const char *arg1, unsigned long arg2, void *arg3, uns
 #ifdef CONFIG_X86
 xenpmu_op(unsigned int op, xen_pmu_params_t *arg)
 #endif
+#ifdef CONFIG_HAS_PASSTHROUGH
+iommu_op(unsigned int subop, void *arg)
+#endif
 
 #ifdef CONFIG_PV
 caller: pv64
@@ -295,5 +298,8 @@ mca                                do       do       -        -        -
 #ifndef CONFIG_PV_SHIM_EXCLUSIVE
 paging_domctl_cont                 do       do       do       do       -
 #endif
+#ifdef CONFIG_HAS_PASSTHROUGH
+iommu_op                           do       do       do       do       -
+#endif
 
 #endif /* !CPPCHECK */
diff --git a/xen/include/public/pv-iommu.h b/xen/include/public/pv-iommu.h
new file mode 100644
index 0000000000..6f50aea4b7
--- /dev/null
+++ b/xen/include/public/pv-iommu.h
@@ -0,0 +1,343 @@
+/* SPDX-License-Identifier: MIT */
+/**
+ * pv-iommu.h
+ *
+ * Paravirtualized IOMMU driver interface.
+ *
+ * Copyright (c) 2024 Teddy Astie <teddy.astie@vates.tech>
+ */
+
+#ifndef __XEN_PUBLIC_PV_IOMMU_H__
+#define __XEN_PUBLIC_PV_IOMMU_H__
+
+#include "xen.h"
+#include "physdev.h"
+
+#ifndef uint64_aligned_t
+#define uint64_aligned_t uint64_t
+#endif
+
+#define IOMMU_DEFAULT_CONTEXT (0)
+
+enum pv_iommu_cmd {
+    /* Basic cmd */
+    IOMMU_noop = 0,
+    IOMMU_query_capabilities = 1,
+    IOMMU_init = 2,
+    IOMMU_alloc_context = 3,
+    IOMMU_free_context = 4,
+    IOMMU_reattach_device = 5,
+    IOMMU_map_pages = 6,
+    IOMMU_unmap_pages = 7,
+    IOMMU_remote_cmd = 8,
+
+    /* Extended cmd */
+    IOMMU_alloc_nested = 9,      /* if IOMMUCAP_nested */
+    IOMMU_flush_nested = 10,     /* if IOMMUCAP_nested */
+    IOMMU_attach_pasid = 11,     /* if IOMMUCAP_pasid */
+    IOMMU_detach_pasid = 12,     /* if IOMMUCAP_pasid */
+};
+
+/**
+ * If set, default context allow DMA to domain memory.
+ * If cleared, default context blocks all DMA to domain memory.
+ */
+#define IOMMUCAP_default_identity  (1U << 0)
+
+/**
+ * IOMMU_MAP_cache support.
+ */
+#define IOMMUCAP_cache     (1U << 1)
+
+/**
+ * If set, IOMMU_alloc_nested and IOMMU_flush_nested are supported.
+ */
+#define IOMMUCAP_nested    (1U << 2)
+
+/**
+ * If set, IOMMU_attach_pasid and IOMMU_detach_pasid are supported and
+ * a device PASID can be specified in reattach_context.
+ */
+#define IOMMUCAP_pasid     (1U << 3)
+
+/**
+ * If set, IOMMU_ALLOC_identity is supported in pv_iommu_alloc.
+ */
+#define IOMMUCAP_identity  (1U << 4)
+
+/**
+ * IOMMU_query_capabilities
+ * Query PV-IOMMU capabilities for this domain.
+ */
+struct pv_iommu_capabilities {
+    /*
+     * OUT: Maximum device address (iova) that the guest can use for mappings.
+     */
+    uint64_aligned_t max_iova_addr;
+
+    /* OUT: IOMMU capabilities flags */
+    uint32_t cap_flags;
+
+    /* OUT: Mask of all supported page sizes. */
+    uint32_t pgsize_mask;
+
+    /* OUT: Maximum pasid (if IOMMUCAP_pasid) */
+    uint32_t max_pasid;
+
+    /* OUT: Maximum number of IOMMU context this domain can use. */
+    uint16_t max_ctx_no;
+
+    uint16_t pad0;
+};
+typedef struct pv_iommu_capabilities pv_iommu_capabilities_t;
+DEFINE_XEN_GUEST_HANDLE(pv_iommu_capabilities_t);
+
+/**
+ * IOMMU_init
+ * Initialize PV-IOMMU for this domain.
+ *
+ * Fails with -EACCESS if PV-IOMMU is already initialized.
+ */
+struct pv_iommu_init {
+    /* IN: Maximum number of IOMMU context this domain can use. */
+    uint32_t max_ctx_no;
+
+    /* IN: Arena size in pages (in power of two) */
+    uint32_t arena_order;
+};
+typedef struct pv_iommu_init pv_iommu_init_t;
+DEFINE_XEN_GUEST_HANDLE(pv_iommu_init_t);
+
+/**
+ * Create a 1:1 identity mapped context to domain memory
+ * (needs IOMMUCAP_identity).
+ */
+#define IOMMU_ALLOC_identity (1 << 0)
+
+/**
+ * IOMMU_alloc_context
+ * Allocate an IOMMU context.
+ * Fails with -ENOSPC if no context number is available.
+ */
+struct pv_iommu_alloc {
+    /* OUT: allocated IOMMU context number */
+    uint16_t ctx_no;
+
+    /* IN: allocation flags */
+    uint32_t alloc_flags;
+};
+typedef struct pv_iommu_alloc pv_iommu_alloc_t;
+DEFINE_XEN_GUEST_HANDLE(pv_iommu_alloc_t);
+
+/**
+ * Move all devices to default context before freeing the context.
+ */
+#define IOMMU_FREE_reattach_default (1 << 0)
+
+/**
+ * IOMMU_free_context
+ * Destroy a IOMMU context.
+ *
+ * If IOMMU_FREE_reattach_default is specified, move all context devices to
+ * default context before destroying this context.
+ *
+ * If there are devices in the context and IOMMU_FREE_reattach_default is not
+ * specified, fail with -EBUSY.
+ *
+ * The default context can't be destroyed.
+ */
+struct pv_iommu_free {
+    /* IN: IOMMU context number to free */
+    uint16_t ctx_no;
+
+    /* IN: Free operation specific flags */
+    uint32_t free_flags;
+};
+typedef struct pv_iommu_free pv_iommu_free_t;
+DEFINE_XEN_GUEST_HANDLE(pv_iommu_free_t);
+
+/* Device has read access */
+#define IOMMU_MAP_readable (1 << 0)
+
+/* Device has write access */
+#define IOMMU_MAP_writeable (1 << 1)
+
+/* Enforce DMA coherency */
+#define IOMMU_MAP_cache (1 << 2)
+
+/**
+ * IOMMU_map_pages
+ * Map pages on a IOMMU context.
+ *
+ * pgsize must be supported by pgsize_mask.
+ * Fails with -EINVAL if mapping on top of another mapping.
+ * Report actually mapped page count in mapped field (regardless of failure).
+ */
+struct pv_iommu_map_pages {
+    /* IN: IOMMU context number */
+    uint16_t ctx_no;
+
+    /* IN: Guest frame number */
+    uint64_aligned_t gfn;
+
+    /* IN: Device frame number */
+    uint64_aligned_t dfn;
+
+    /* IN: Map flags */
+    uint32_t map_flags;
+
+    /* IN: Size of pages to map */
+    uint32_t pgsize;
+
+    /* IN: Number of pages to map */
+    uint32_t nr_pages;
+
+    /* OUT: Number of pages actually mapped */
+    uint32_t mapped;
+};
+typedef struct pv_iommu_map_pages pv_iommu_map_pages_t;
+DEFINE_XEN_GUEST_HANDLE(pv_iommu_map_pages_t);
+
+/**
+ * IOMMU_unmap_pages
+ * Unmap pages on a IOMMU context.
+ *
+ * pgsize must be supported by pgsize_mask.
+ * Report actually unmapped page count in mapped field (regardless of failure).
+ * Fails with -ENOENT when attempting to unmap a page without any mapping
+ */
+struct pv_iommu_unmap_pages {
+    /* IN: IOMMU context number */
+    uint16_t ctx_no;
+
+    /* IN: Device frame number */
+    uint64_aligned_t dfn;
+
+    /* IN: Size of pages to unmap */
+    uint32_t pgsize;
+
+    /* IN: Number of pages to unmap */
+    uint32_t nr_pages;
+
+    /* OUT: Number of pages actually unmapped */
+    uint32_t unmapped;
+};
+typedef struct pv_iommu_unmap_pages pv_iommu_unmap_pages_t;
+DEFINE_XEN_GUEST_HANDLE(pv_iommu_unmap_pages_t);
+
+/**
+ * IOMMU_reattach_device
+ * Reattach a device to another IOMMU context.
+ * Fails with -ENODEV if no such device exist.
+ */
+struct pv_iommu_reattach_device {
+    /* IN: Target IOMMU context number */
+    uint16_t ctx_no;
+
+    /* IN: Physical device to move */
+    struct physdev_pci_device dev;
+
+    /* IN: PASID of the device (if IOMMUCAP_pasid) */
+    uint32_t pasid;
+};
+typedef struct pv_iommu_reattach_device pv_iommu_reattach_device_t;
+DEFINE_XEN_GUEST_HANDLE(pv_iommu_reattach_device_t);
+
+
+/**
+ * IOMMU_remote_cmd
+ * Do a PV-IOMMU operation on another domain.
+ * Current domain needs to be allowed to act on the target domain, otherwise
+ * fails with -EPERM.
+ */
+struct pv_iommu_remote_cmd {
+    /* IN: Target domain to do the subop on */
+    uint16_t domid;
+
+    /* IN: Command to do on target domain. */
+    uint16_t subop;
+
+    /* INOUT: Command argument from current domain memory */
+    XEN_GUEST_HANDLE(void) arg;
+};
+typedef struct pv_iommu_remote_cmd pv_iommu_remote_cmd_t;
+DEFINE_XEN_GUEST_HANDLE(pv_iommu_remote_cmd_t);
+
+/**
+ * IOMMU_alloc_nested
+ * Create a nested IOMMU context (needs IOMMUCAP_nested).
+ *
+ * This context uses a platform-specific page table from domain address space
+ * specified in pgtable_gfn and use it for nested translations.
+ *
+ * Explicit flushes needs to be submited with IOMMU_flush_nested on
+ * modification of the nested pagetable to ensure coherency between IOTLB and
+ * nested page table.
+ *
+ * This context can be destroyed using IOMMU_free_context.
+ * This context cannot be modified using map_pages, unmap_pages.
+ */
+struct pv_iommu_alloc_nested {
+    /* OUT: allocated IOMMU context number */
+    uint16_t ctx_no;
+
+    /* IN: guest frame number of the nested page table */
+    uint64_aligned_t pgtable_gfn;
+
+    /* IN: nested mode flags */
+    uint64_aligned_t nested_flags;
+};
+typedef struct pv_iommu_alloc_nested pv_iommu_alloc_nested_t;
+DEFINE_XEN_GUEST_HANDLE(pv_iommu_alloc_nested_t);
+
+/**
+ * IOMMU_flush_nested (needs IOMMUCAP_nested)
+ * Flush the IOTLB for nested translation.
+ */
+struct pv_iommu_flush_nested {
+    /* TODO */
+};
+typedef struct pv_iommu_flush_nested pv_iommu_flush_nested_t;
+DEFINE_XEN_GUEST_HANDLE(pv_iommu_flush_nested_t);
+
+/**
+ * IOMMU_attach_pasid (needs IOMMUCAP_pasid)
+ * Attach a new device-with-pasid to a IOMMU context.
+ * If a matching device-with-pasid already exists (globally),
+ * fail with -EEXIST.
+ * If pasid is 0, fails with -EINVAL.
+ * If physical device doesn't exist in domain, fail with -ENOENT.
+ */
+struct pv_iommu_attach_pasid {
+    /* IN: IOMMU context to add the device-with-pasid in */
+    uint16_t ctx_no;
+
+    /* IN: Physical device */
+    struct physdev_pci_device dev;
+
+    /* IN: pasid of the device to attach */
+    uint32_t pasid;
+};
+typedef struct pv_iommu_attach_pasid pv_iommu_attach_pasid_t;
+DEFINE_XEN_GUEST_HANDLE(pv_iommu_attach_pasid_t);
+
+/**
+ * IOMMU_detach_pasid (needs IOMMUCAP_pasid)
+ * detach a device-with-pasid.
+ * If the device-with-pasid doesn't exist or belong to the domain,
+ * fail with -ENOENT.
+ * If pasid is 0, fails with -EINVAL.
+ */
+struct pv_iommu_detach_pasid {
+    /* IN: Physical device */
+    struct physdev_pci_device dev;
+
+    /* pasid of the device to detach */
+    uint32_t pasid;
+};
+typedef struct pv_iommu_detach_pasid pv_iommu_detach_pasid_t;
+DEFINE_XEN_GUEST_HANDLE(pv_iommu_detach_pasid_t);
+
+/* long do_iommu_op(int subop, XEN_GUEST_HANDLE_PARAM(void) arg) */
+
+#endif
\ No newline at end of file
diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h
index e051f989a5..d5bdedfee5 100644
--- a/xen/include/public/xen.h
+++ b/xen/include/public/xen.h
@@ -118,6 +118,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_ulong_t);
 #define __HYPERVISOR_xenpmu_op            40
 #define __HYPERVISOR_dm_op                41
 #define __HYPERVISOR_hypfs_op             42
+#define __HYPERVISOR_iommu_op             43
 
 /* Architecture-specific hypercall definitions. */
 #define __HYPERVISOR_arch_0               48
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 15250da119..e115642b86 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -328,6 +328,8 @@ struct iommu_ops {
      */
     int (*dt_xlate)(device_t *dev, const struct dt_phandle_args *args);
 #endif
+
+    uint64_t (*get_max_iova)(struct domain *d);
 };
 
 /*
@@ -409,6 +411,10 @@ struct domain_iommu {
     /* SAF-2-safe enum constant in arithmetic operation */
     DECLARE_BITMAP(features, IOMMU_FEAT_count);
 
+
+    /* Is the domain allowed to use PV-IOMMU ? */
+    bool allow_pv_iommu;
+
     /* Does the guest share HAP mapping with the IOMMU? */
     bool hap_pt_share;
 
@@ -446,6 +452,8 @@ static inline int iommu_do_domctl(struct xen_domctl *domctl, struct domain *d,
 }
 #endif
 
+int iommu_domain_pviommu_init(struct domain *d, uint16_t nb_ctx, uint32_t arena_order);
+
 int __must_check iommu_suspend(void);
 void iommu_resume(void);
 void iommu_crash_shutdown(void);
@@ -461,6 +469,7 @@ int iommu_do_pci_domctl(struct xen_domctl *domctl, struct domain *d,
 
 void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev);
 
+uint64_t iommu_get_max_iova(struct domain *d);
 
 struct iommu_context *iommu_get_context(struct domain *d, u16 ctx_id);
 void iommu_put_context(struct iommu_context *ctx);
@@ -496,6 +505,8 @@ DECLARE_PER_CPU(bool, iommu_dont_flush_iotlb);
 extern struct spinlock iommu_pt_cleanup_lock;
 extern struct page_list_head iommu_pt_cleanup_list;
 
+int arch_iommu_pviommu_init(struct domain *d, uint16_t nb_ctx, uint32_t arena_order);
+int arch_iommu_pviommu_teardown(struct domain *d);
 bool arch_iommu_use_permitted(const struct domain *d);
 
 #ifdef CONFIG_X86
-- 
2.47.2



Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface
  2025-02-17 10:18 [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface Teddy Astie
                   ` (10 preceding siblings ...)
  2025-02-17 10:18 ` [XEN RFC PATCH v6 10/11] iommu: Introduce PV-IOMMU Teddy Astie
@ 2025-02-18 14:26 ` Marek Marczykowski-Górecki
  11 siblings, 0 replies; 17+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-02-18 14:26 UTC (permalink / raw)
  To: Teddy Astie
  Cc: xen-devel, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini, Bertrand Marquis, Volodymyr Babchuk,
	Shawn Anastasio, Lukasz Hawrylko, Daniel P. Smith,
	Mateusz Mówka

[-- Attachment #1: Type: text/plain, Size: 7219 bytes --]

On Mon, Feb 17, 2025 at 10:18:17AM +0000, Teddy Astie wrote:
> This work has been presented at Xen Summit 2024 during the
>   IOMMU paravirtualization and Xen IOMMU subsystem rework
> design session.
> 
> Operating systems may want to have access to a IOMMU in order to do DMA
> protection or implement certain features (e.g VFIO on Linux).
> 
> VFIO support is mandatory for framework such as SPDK, which can be useful to
> implement an alternative storage backend for virtual machines [1].
> 
> In this patch series, we introduce in Xen the ability to manage several
> contexts per domain and provide a new hypercall interface to allow guests
> to manage IOMMU contexts.
> 
> The VT-d and AMD-Vi driver is updated to support these new features.
> 
> [1] Using SPDK with the Xen hypervisor - FOSDEM 2023
> ---
> Cc: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> 
> PCI Passthrough now work on my side, but things are still feels quite brittle.

Pipeline:
https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1676457780

The passthrough tests on Intel are green, but not on AMD. I don't see
any specific crash, but the device doesn't work (DHCP timeout). Likely
an issue with interrupts.

There are also build failures on ARM...

> Changed in v2 :
> * fixed Xen crash when dumping IOMMU contexts (using X debug key)
> with DomUs without IOMMU
> * s/dettach/detach/
> * removed some unused includes
> * fix dangling devices in contexts with detach
> 
> Changed in v3 :
> * lock entirely map/unmap in hypercall
> * prevent IOMMU operations on dying contexts (fix race condition)
> * iommu_check_context+iommu_get_context -> iommu_get_context and check for NULL
> 
> Changed in v4 :
> * Part of initialization logic is moved to domain or toolstack (IOMMU_init)
>   + domain/toolstack now decides on "context count" and "pagetable pool size"
>   + for now, all domains are able to initialize PV-IOMMU
> * introduce "dom0-iommu=no-dma" to make default context block all DMA
>   (disables HAP and sync-pt), enforcing usage of PV-IOMMU for DMA
>   Can be used to expose properly "Pre-boot DMA protection"
> * redesigned locking logic for contexts
>   + contexts are accessed using iommu_get_context and released with iommu_put_context
> 
> Changed in v5 :
> * various PCI Passthrough related fixes
>   + rewrote parts of PCI Passthrough logic
>   + various other related bug fixes
> * simplified VT-d DID (for hardware) management by only having one map instead of two
>   (pseudo_domid map was previously used for old quarantine code then recycled for PV-IOMMU
>    in addition to another map also tracing Domain<->VT-d DID, now there is only one
>    map tracking both making things simpler)
> * reworked parts of Xen quarantine logic (needed for PCI Passthrough)
> * added cf_check annotations
> * some changes to PV-IOMMU headers (Alejandro)
> 
> Changed in v6 :
> * reorganized the patch series to allow bissecting
>    * it is splitted in various smaller patches
> * initial AMD-Vi port (it doesn't completely work with PV-IOMMU though, but builds at
>   least)
>    * AMD-Vi lacks support for iommu_lookup_page (needed for several PV-IOMMU ops)
> 
> TODO:
> * fix some issues with no-dma+PV and grants
> * complete "no-dma" mode (expose to toolstack, add documentation, ...)
> * properly define nested mode and PASID support
> * consider per-iommu domid limit (allocate did on first attach/reattach ?)
> * fix ARM/PPC build issues
> 
> * make new quarantine code more unity region aware (isolate devices with
>   different reserved regions regions using separate 'contexts')
> * find a way to make PV-IOMMU work in DomUs (they don't see machine bdf)
> * there are corner cases with PV-IOMMU and to-domain Xen PCI Passthrough
>   (e.g pci-assignable-remove will reassign to context 0, while the driver
>    expects the device to to be in context X)
> 
> Teddy Astie (11):
>   docs/designs: Add a design document for IOMMU subsystem redesign
>   docs/designs: Add a design document for PV-IOMMU
>   x86/domain: Defer domain iommu initialization.
>   iommu: Move IOMMU domain related structures to (arch_)iommu_context
>   iommu: Simplify quarantine logic
>   vtd: Remove MAP_ERROR_RECOVERY code path in domain_context_mapping_one
>   iommu: Simplify hardware did management
>   iommu: Introduce redesigned IOMMU subsystem
>   x86/iommu: Introduce IOMMU arena
>   iommu: Introduce PV-IOMMU
>   iommu: Introduce no-dma feature
> 
>  docs/designs/iommu-contexts.md              |  403 +++++
>  docs/designs/pv-iommu.md                    |  116 ++
>  xen/arch/arm/include/asm/iommu.h            |    4 +
>  xen/arch/ppc/include/asm/iommu.h            |    3 +
>  xen/arch/x86/domain.c                       |   10 +-
>  xen/arch/x86/include/asm/arena.h            |   54 +
>  xen/arch/x86/include/asm/iommu.h            |   59 +-
>  xen/arch/x86/include/asm/pci.h              |   17 -
>  xen/arch/x86/mm/p2m-ept.c                   |    2 +-
>  xen/arch/x86/pv/dom0_build.c                |    6 +-
>  xen/arch/x86/tboot.c                        |    3 +-
>  xen/common/Makefile                         |    1 +
>  xen/common/memory.c                         |    4 +-
>  xen/common/pv-iommu.c                       |  539 +++++++
>  xen/drivers/passthrough/amd/iommu.h         |   21 +-
>  xen/drivers/passthrough/amd/iommu_cmd.c     |   20 +-
>  xen/drivers/passthrough/amd/iommu_init.c    |   13 +-
>  xen/drivers/passthrough/amd/iommu_map.c     |  217 +--
>  xen/drivers/passthrough/amd/pci_amd_iommu.c |  346 ++--
>  xen/drivers/passthrough/iommu.c             |  735 ++++++++-
>  xen/drivers/passthrough/pci.c               |  404 ++---
>  xen/drivers/passthrough/vtd/extern.h        |   19 +-
>  xen/drivers/passthrough/vtd/iommu.c         | 1612 ++++++-------------
>  xen/drivers/passthrough/vtd/iommu.h         |    2 -
>  xen/drivers/passthrough/vtd/qinval.c        |    2 +-
>  xen/drivers/passthrough/vtd/quirks.c        |   21 +-
>  xen/drivers/passthrough/vtd/vtd.h           |    3 +-
>  xen/drivers/passthrough/x86/Makefile        |    1 +
>  xen/drivers/passthrough/x86/arena.c         |  157 ++
>  xen/drivers/passthrough/x86/iommu.c         |  294 +++-
>  xen/include/hypercall-defs.c                |    6 +
>  xen/include/public/pv-iommu.h               |  343 ++++
>  xen/include/public/xen.h                    |    1 +
>  xen/include/xen/iommu.h                     |  117 +-
>  xen/include/xen/pci.h                       |    3 +
>  35 files changed, 3585 insertions(+), 1973 deletions(-)
>  create mode 100644 docs/designs/iommu-contexts.md
>  create mode 100644 docs/designs/pv-iommu.md
>  create mode 100644 xen/arch/x86/include/asm/arena.h
>  create mode 100644 xen/common/pv-iommu.c
>  create mode 100644 xen/drivers/passthrough/x86/arena.c
>  create mode 100644 xen/include/public/pv-iommu.h
> 
> --
> 2.47.2
> 
> 
> 
> Teddy Astie | Vates XCP-ng Developer
> 
> XCP-ng & Xen Orchestra - Vates solutions
> 
> web: https://vates.tech
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [XEN RFC PATCH v6 02/11] docs/designs: Add a design document for PV-IOMMU
  2025-02-17 10:18 ` [XEN RFC PATCH v6 02/11] docs/designs: Add a design document for PV-IOMMU Teddy Astie
@ 2025-02-19 12:02   ` Frediano Ziglio
  2025-02-19 14:01     ` Teddy Astie
  0 siblings, 1 reply; 17+ messages in thread
From: Frediano Ziglio @ 2025-02-19 12:02 UTC (permalink / raw)
  To: Teddy Astie, xen-devel
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

On 17/02/2025 10:18, Teddy Astie wrote:
> Some operating systems want to use IOMMU to implement various features (e.g
> VFIO) or DMA protection.
> This patch introduce a proposal for IOMMU paravirtualization for Dom0.
> 
> Signed-off-by: Teddy Astie <teddy.astie@vates.tech>
> ---
>   docs/designs/pv-iommu.md | 116 +++++++++++++++++++++++++++++++++++++++
>   1 file changed, 116 insertions(+)
>   create mode 100644 docs/designs/pv-iommu.md
> 
> diff --git a/docs/designs/pv-iommu.md b/docs/designs/pv-iommu.md
> new file mode 100644
> index 0000000000..7df9fa0b94
> --- /dev/null
> +++ b/docs/designs/pv-iommu.md
> @@ -0,0 +1,116 @@
> +# IOMMU paravirtualization for Dom0
> +
> +Status: Experimental
> +
> +# Background
> +
> +By default, Xen only uses the IOMMU for itself, either to make device adress
> +space coherent with guest adress space (x86 HVM/PVH) or to prevent devices

typo: adress -> address

> +from doing DMA outside it's expected memory regions including the hypervisor
> +(x86 PV).
> +
> +A limitation is that guests (especially privildged ones) may want to use

typo: privildged -> privileged

> +IOMMU hardware in order to implement features such as DMA protection and
> +VFIO [1] as IOMMU functionality is not available outside of the hypervisor
> +currently.
> +
> +[1] VFIO - "Virtual Function I/O" - https://www.kernel.org/doc/html/latest/driver-api/vfio.html
> +
> +# Design
> +
> +The operating system may want to have access to various IOMMU features such as
> +context management and DMA remapping. We can create a new hypercall that allows
> +the guest to have access to a new paravirtualized IOMMU interface.
> +
> +This feature is only meant to be available for the Dom0, as DomU have some
> +emulated devices that can't be managed on Xen side and are not hardware, we
> +can't rely on the hardware IOMMU to enforce DMA remapping.
> +
> +This interface is exposed under the `iommu_op` hypercall.
> +
> +In addition, Xen domains are modified in order to allow existence of several
> +IOMMU context including a default one that implement default behavior (e.g
> +hardware assisted paging) and can't be modified by guest. DomU cannot have
> +contexts, and therefore act as if they only have the default domain.
> +
> +Each IOMMU context within a Xen domain is identified using a domain-specific
> +context number that is used in the Xen IOMMU subsystem and the hypercall
> +interface.
> +
> +The number of IOMMU context a domain is specified by either the toolstack or
> +the domain itself.

I don't understand what you want express with the above sentence.
Maybe it's just me.

> +
> +# IOMMU operations
> +
> +## Initialize PV-IOMMU
> +
> +Initialize PV-IOMMU for the domain.
> +It can only be called once.
> +

Could this operation be done automatically on first context allocation ?

> +## Alloc context
> +
> +Create a new IOMMU context for the guest and return the context number to the
> +guest.
> +Fail if the IOMMU context limit of the guest is reached.
> +
> +A flag can be specified to create a identity mapping.
> +
> +## Free context
> +
> +Destroy a IOMMU context created previously.
> +It is not possible to free the default context.
> +
> +Reattach context devices to default context if specified by the guest.
> +
> +Fail if there is a device in the context and reattach-to-default flag is not
> +specified.
> +
> +## Reattach device
> +
> +Reattach a device to another IOMMU context (including the default one).
> +The target IOMMU context number must be valid and the context allocated.
> +
> +The guest needs to specify a PCI SBDF of a device he has access to.
> +
> +## Map/unmap page
> +
> +Map/unmap a page on a context.
> +The guest needs to specify a gfn and target dfn to map.
> +
> +Refuse to create the mapping if one already exist for the same dfn.
> +
> +## Lookup page
> +
> +Get the gfn mapped by a specific dfn.
> +
> +## Remote command
> +
> +Make a PV-IOMMU operation on behalf of another domain.
> +Especially useful for implementing IOMMU emulation (e.g using QEMU)
> +or initializing PV-IOMMU with enforced limits.
> +
> +# Implementation considerations
> +
> +## Hypercall batching
> +
> +In order to prevent unneeded hypercalls and IOMMU flushing, it is advisable to
> +be able to batch some critical IOMMU operations (e.g map/unmap multiple pages).
> +

I suppose that batching also implies preemption.

> +## Hardware without IOMMU support
> +
> +Operating system needs to be aware on PV-IOMMU capability, and whether it is
> +able to make contexts. However, some operating system may critically fail in
> +case they are able to make a new IOMMU context. Which is supposed to happen
> +if no IOMMU hardware is available.
> +
> +The hypercall interface needs a interface to advertise the ability to create
> +and manage IOMMU contexts including the amount of context the guest is able
> +to use. Using these informations, the Dom0 may decide whether to use or not
> +the PV-IOMMU interface.
> +
> +## Page pool for contexts
> +
> +In order to prevent unexpected starving on the hypervisor memory with a
> +buggy Dom0. We can preallocate the pages the contexts will use and make
> +map/unmap use these pages instead of allocating them dynamically.
> +

Regards,
   Frediano



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [XEN RFC PATCH v6 07/11] iommu: Simplify hardware did management
  2025-02-17 10:18 ` [XEN RFC PATCH v6 07/11] iommu: Simplify hardware did management Teddy Astie
@ 2025-02-19 12:17   ` Frediano Ziglio
  2025-02-19 12:17   ` Frediano Ziglio
  1 sibling, 0 replies; 17+ messages in thread
From: Frediano Ziglio @ 2025-02-19 12:17 UTC (permalink / raw)
  To: Teddy Astie, xen-devel; +Cc: Jan Beulich, Andrew Cooper, Roger Pau Monné

On 17/02/2025 10:18, Teddy Astie wrote:
> Simplify the hardware DID management by allocating a DID
> per IOMMU context (currently Xen domain) instead of trying
> to reuse Xen domain DID (which may not be possible depending
> on hardware constraints like did limits).

Minor: here and in the title, did should be DID if it's the acronym, 
otherwise can be confusing.

...

Frediano



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [XEN RFC PATCH v6 07/11] iommu: Simplify hardware did management
  2025-02-17 10:18 ` [XEN RFC PATCH v6 07/11] iommu: Simplify hardware did management Teddy Astie
  2025-02-19 12:17   ` Frediano Ziglio
@ 2025-02-19 12:17   ` Frediano Ziglio
  1 sibling, 0 replies; 17+ messages in thread
From: Frediano Ziglio @ 2025-02-19 12:17 UTC (permalink / raw)
  To: Teddy Astie, xen-devel; +Cc: Jan Beulich, Andrew Cooper, Roger Pau Monné

On 17/02/2025 10:18, Teddy Astie wrote:
> Simplify the hardware DID management by allocating a DID
> per IOMMU context (currently Xen domain) instead of trying
> to reuse Xen domain DID (which may not be possible depending
> on hardware constraints like did limits).

Minor: here and in the title, did should be DID if it's the acronym, 
otherwise can be confusing.

...

Frediano



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [XEN RFC PATCH v6 02/11] docs/designs: Add a design document for PV-IOMMU
  2025-02-19 12:02   ` Frediano Ziglio
@ 2025-02-19 14:01     ` Teddy Astie
  0 siblings, 0 replies; 17+ messages in thread
From: Teddy Astie @ 2025-02-19 14:01 UTC (permalink / raw)
  To: Frediano Ziglio, xen-devel
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Roger Pau Monné, Stefano Stabellini

Hello Frediano,

Ok for typos fixes

Le 19/02/2025 à 13:02, Frediano Ziglio a écrit :
> On 17/02/2025 10:18, Teddy Astie wrote:
>> +Each IOMMU context within a Xen domain is identified using a domain-
>> specific
>> +context number that is used in the Xen IOMMU subsystem and the hypercall
>> +interface.
>> +
>> +The number of IOMMU context a domain is specified by either the
>> toolstack or
>> +the domain itself.
>
> I don't understand what you want express with the above sentence.
> Maybe it's just me.
>
>> +
>> +# IOMMU operations
>> +
>> +## Initialize PV-IOMMU
>> +
>> +Initialize PV-IOMMU for the domain.
>> +It can only be called once.
>> +
>
> Could this operation be done automatically on first context allocation ?
>

For initializing PV-IOMMU, you need to pass some additional parameters
(memory/context limits). To avoid a guest from initializing with
arbitrary limits, it can also be done by the toolstack (e.g domain
builder) to enforce some specific limitations as this initialization can
only be done once.

>> +## Alloc context
>> +
>> +Create a new IOMMU context for the guest and return the context
>> number to the
>> +guest.
>> +Fail if the IOMMU context limit of the guest is reached.
>> +
>> +A flag can be specified to create a identity mapping.
>> +
>> +## Free context
>> +
>> +Destroy a IOMMU context created previously.
>> +It is not possible to free the default context.
>> +
>> +Reattach context devices to default context if specified by the guest.
>> +
>> +Fail if there is a device in the context and reattach-to-default flag
>> is not
>> +specified.
>> +
>> +## Reattach device
>> +
>> +Reattach a device to another IOMMU context (including the default one).
>> +The target IOMMU context number must be valid and the context allocated.
>> +
>> +The guest needs to specify a PCI SBDF of a device he has access to.
>> +
>> +## Map/unmap page
>> +
>> +Map/unmap a page on a context.
>> +The guest needs to specify a gfn and target dfn to map.
>> +
>> +Refuse to create the mapping if one already exist for the same dfn.
>> +
>> +## Lookup page
>> +
>> +Get the gfn mapped by a specific dfn.
>> +
>> +## Remote command
>> +
>> +Make a PV-IOMMU operation on behalf of another domain.
>> +Especially useful for implementing IOMMU emulation (e.g using QEMU)
>> +or initializing PV-IOMMU with enforced limits.
>> +
>> +# Implementation considerations
>> +
>> +## Hypercall batching
>> +
>> +In order to prevent unneeded hypercalls and IOMMU flushing, it is
>> advisable to
>> +be able to batch some critical IOMMU operations (e.g map/unmap
>> multiple pages).
>> +
>
> I suppose that batching also implies preemption.
>

Yes, the current implementation does it, but I haven't updated to doc on
that aspect.

>> +## Hardware without IOMMU support
>> +
>> +Operating system needs to be aware on PV-IOMMU capability, and
>> whether it is
>> +able to make contexts. However, some operating system may critically
>> fail in
>> +case they are able to make a new IOMMU context. Which is supposed to
>> happen
>> +if no IOMMU hardware is available.
>> +
>> +The hypercall interface needs a interface to advertise the ability to
>> create
>> +and manage IOMMU contexts including the amount of context the guest
>> is able
>> +to use. Using these informations, the Dom0 may decide whether to use
>> or not
>> +the PV-IOMMU interface.
>> +
>> +## Page pool for contexts
>> +
>> +In order to prevent unexpected starving on the hypervisor memory with a
>> +buggy Dom0. We can preallocate the pages the contexts will use and make
>> +map/unmap use these pages instead of allocating them dynamically.
>> +
>
> Regards,
>    Frediano
>

Thanks
Teddy



Teddy Astie | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-02-19 14:01 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-17 10:18 [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface Teddy Astie
2025-02-17 10:18 ` [XEN RFC PATCH v6 01/11] docs/designs: Add a design document for IOMMU subsystem redesign Teddy Astie
2025-02-17 10:18 ` [XEN RFC PATCH v6 05/11] iommu: Simplify quarantine logic Teddy Astie
2025-02-17 10:18 ` [XEN RFC PATCH v6 06/11] vtd: Remove MAP_ERROR_RECOVERY code path in domain_context_mapping_one Teddy Astie
2025-02-17 10:18 ` [XEN RFC PATCH v6 02/11] docs/designs: Add a design document for PV-IOMMU Teddy Astie
2025-02-19 12:02   ` Frediano Ziglio
2025-02-19 14:01     ` Teddy Astie
2025-02-17 10:18 ` [XEN RFC PATCH v6 09/11] x86/iommu: Introduce IOMMU arena Teddy Astie
2025-02-17 10:18 ` [XEN RFC PATCH v6 07/11] iommu: Simplify hardware did management Teddy Astie
2025-02-19 12:17   ` Frediano Ziglio
2025-02-19 12:17   ` Frediano Ziglio
2025-02-17 10:18 ` [XEN RFC PATCH v6 03/11] x86/domain: Defer domain iommu initialization Teddy Astie
2025-02-17 10:18 ` [XEN RFC PATCH v6 11/11] iommu: Introduce no-dma feature Teddy Astie
2025-02-17 10:18 ` [XEN RFC PATCH v6 04/11] iommu: Move IOMMU domain related structures to (arch_)iommu_context Teddy Astie
2025-02-17 10:18 ` [XEN RFC PATCH v6 08/11] iommu: Introduce redesigned IOMMU subsystem Teddy Astie
2025-02-17 10:18 ` [XEN RFC PATCH v6 10/11] iommu: Introduce PV-IOMMU Teddy Astie
2025-02-18 14:26 ` [XEN RFC PATCH v6 00/11] IOMMU subsystem redesign and PV-IOMMU interface Marek Marczykowski-Górecki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.