[PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU
@ 2025-01-11  3:32 Nicolin Chen
  2025-01-11  3:32 ` [PATCH RFCv2 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie Nicolin Chen
                   ` (15 more replies)
  0 siblings, 16 replies; 64+ messages in thread
From: Nicolin Chen @ 2025-01-11  3:32 UTC (permalink / raw)
  To: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson
  Cc: joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

[ Background ]
On ARM GIC systems and others, the target address of the MSI is translated
by the IOMMU. For GIC, the MSI address page is called "ITS" page. When the
IOMMU is disabled, the MSI address is programmed to the physical location
of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS
page is behind the IOMMU, so the MSI address is programmed to an allocated
IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to
the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000).
When a 2-stage translation is enabled, IOVA will be still used to program
the MSI address, though the mappings will be in two stages:
  IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000)
(IPA stands for Intermediate Physical Address).

If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, the
IOVA is dynamically allocated from the top of the IOVA space. If attached
to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the IOVA is
fixed to an MSI window reported by the IOMMU driver via IOMMU_RESV_SW_MSI,
which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM IOMMUs.

So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge
of the IOMMU translation (1-stage translation), since the IOVA for the ITS
page is fixed and known by kernel. However, with virtual machine enabling
a nested IOMMU translation (2-stage), a guest kernel directly controls the
stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at an
IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host
kernel can't know that guest-level IOVA to program the MSI address.

There have been two approaches to solve this problem:
1. Create an identity mapping in the stage-1. VMM could insert a few RMRs
   (Reserved Memory Regions) in guest's IORT. Then the guest kernel would
   fetch these RMR entries from the IORT and create an IOMMU_RESV_DIRECT
   region per iommu group for a direct mapping. Eventually, the mappings
   would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000
   This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA.
2. Forward the guest-level MSI IOVA captured by VMM to the host-level GIC
   driver, to program the correct MSI IOVA. Forward the VMM-defined vITS
   page location (IPA) to the kernel for the stage-2 mapping. Eventually:
   IOVA (0xFFFF0000) ===> IPA (0x80900000) ===> PA (0x20200000)
   This requires a VFIO ioctl (for IOVA) and an IOMMUFD ioctl (for IPA).

Worth mentioning that when Eric Auger was working on the same topic with
the VFIO iommu uAPI, he had the approach (2) first, and then switched to
the approach (1), suggested by Jean-Philippe for reduction of complexity.

The approach (1) basically feels like the existing VFIO passthrough that
has a 1-stage mapping for the unmanaged domain, yet only by shifting the
MSI mapping from stage 1 (guest-has-no-iommu case) to stage 2 (guest-has-
iommu case). So, it could reuse the existing IOMMU_RESV_SW_MSI piece, by
sharing the same idea of "VMM leaving everything to the kernel".

The approach (2) is an ideal solution, yet it requires additional effort
for kernel to be aware of the 1-stage gIOVA(s) and 2-stage IPAs for vITS
page(s), which demands VMM to closely cooperate.
 * It also brings some complicated use cases to the table where the host
   or/and guest system(s) has/have multiple ITS pages.

[ Execution ]
Though these two approaches feel very different on the surface, they can
share some underlying common infrastructure. Currently, only one pair of
sw_msi functions (prepare/compose) are provided by dma-iommu for irqchip
drivers to directly use. There could be different versions of functions
from different domain owners: for existing VFIO passthrough cases and in-
kernel DMA domain cases, reuse the existing dma-iommu's version of sw_msi
functions; for nested translation use cases, there can be another version
of sw_msi functions to handle mapping and msi_msg(s) differently.

To support both approaches, in this series
 - Get rid of the duplication in the "compose" function
 - Introduce a function pointer for the previously "prepare" function
 - Allow different domain owners to set their own "sw_msi" implementations
 - Implement an iommufd_sw_msi function to additionally support a nested
   translation use case using the approach (2), i.e. the RMR solution
 - Add a pair of IOMMUFD options for a SW_MSI window for kernel and VMM to
   agree on (for approach 1)
 - Add a new VFIO ioctl to set the MSI(x) vector(s) for iommufd_sw_msi()
   to update the msi_desc structure accordingly (for approach 2)

A missing piece
 - Potentially another IOMMUFD_CMD_IOAS_MAP_MSI ioctl for VMM to map the
   IPAs of the vITS page(s) in the stage-2 io page table. (for approach 2)
   (in this RFC, conveniently reuse the new IOMMUFD SW_MSI options to set
    the vITS page's IPA, which works finely in a single-vITS-page case.)

This is a joint effort that includes Jason's rework in irq/iommu/iommufd
base level and my additional patches on top of that for new uAPIs.

This series is on github:
https://github.com/nicolinc/iommufd/commits/iommufd_msi-rfcv2
Pairing QEMU branch for testing (approach 1):
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_msi-rfcv2-rmr
Pairing QEMU branch for testing (approach 2):
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_msi-rfcv2-vits

Changelog
v2
 * Rebase on v6.13-rc6
 * Drop all the irq/pci patches and rework the compose function instead
 * Add a new sw_msi op to iommu_domain for a per type implementation and
   let iommufd core has its own implementation to support both approaches
 * Add RMR-solution (approach 1) support since it is straightforward and
   have been used in some out-of-tree projects widely
v1
 https://lore.kernel.org/kvm/cover.1731130093.git.nicolinc@nvidia.com/

Thanks!
Nicolin

Jason Gunthorpe (5):
  genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of
    iommu_cookie
  genirq/msi: Rename iommu_dma_compose_msi_msg() to
    msi_msg_set_msi_addr()
  iommu: Make iommu_dma_prepare_msi() into a generic operation
  irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by the irqchips that
    need it
  iommufd: Implement sw_msi support natively

Nicolin Chen (8):
  iommu: Turn fault_data to iommufd private pointer
  iommufd: Make attach_handle generic
  iommu: Turn iova_cookie to dma-iommu private pointer
  iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  iommufd/selftes: Add coverage for IOMMU_OPTION_SW_MSI_START/SIZE
  iommufd/device: Allow setting IOVAs for MSI(x) vectors
  vfio-iommufd: Provide another layer of msi_iova helpers
  vfio/pci: Allow preset MSI IOVAs via VFIO_IRQ_SET_ACTION_PREPARE

 drivers/iommu/Kconfig                         |   1 -
 drivers/irqchip/Kconfig                       |   4 +
 kernel/irq/Kconfig                            |   1 +
 drivers/iommu/iommufd/iommufd_private.h       |  69 ++--
 include/linux/iommu.h                         |  58 ++--
 include/linux/iommufd.h                       |   6 +
 include/linux/msi.h                           |  43 ++-
 include/linux/vfio.h                          |  25 ++
 include/uapi/linux/iommufd.h                  |  18 +-
 include/uapi/linux/vfio.h                     |   8 +-
 drivers/iommu/dma-iommu.c                     |  63 ++--
 drivers/iommu/iommu.c                         |  29 ++
 drivers/iommu/iommufd/device.c                | 312 ++++++++++++++++--
 drivers/iommu/iommufd/fault.c                 | 122 +------
 drivers/iommu/iommufd/hw_pagetable.c          |   5 +-
 drivers/iommu/iommufd/io_pagetable.c          |   4 +-
 drivers/iommu/iommufd/ioas.c                  |  34 ++
 drivers/iommu/iommufd/main.c                  |  15 +
 drivers/irqchip/irq-gic-v2m.c                 |   5 +-
 drivers/irqchip/irq-gic-v3-its.c              |  13 +-
 drivers/irqchip/irq-gic-v3-mbi.c              |  12 +-
 drivers/irqchip/irq-ls-scfg-msi.c             |   5 +-
 drivers/vfio/iommufd.c                        |  27 ++
 drivers/vfio/pci/vfio_pci_intrs.c             |  46 +++
 drivers/vfio/vfio_main.c                      |   3 +
 tools/testing/selftests/iommu/iommufd.c       |  53 +++
 .../selftests/iommu/iommufd_fail_nth.c        |  14 +
 27 files changed, 712 insertions(+), 283 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH RFCv2 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie
  2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
@ 2025-01-11  3:32 ` Nicolin Chen
  2025-01-23 17:10   ` Eric Auger
  2025-01-11  3:32 ` [PATCH RFCv2 02/13] genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_msi_addr() Nicolin Chen
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 64+ messages in thread
From: Nicolin Chen @ 2025-01-11  3:32 UTC (permalink / raw)
  To: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson
  Cc: joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

From: Jason Gunthorpe <jgg@nvidia.com>

All the iommu cases simply want to override the MSI page's address with
the IOVA that was mapped through the iommu. This doesn't need a cookie
pointer, we just need to store the IOVA and its page size in the msi_desc.

Instead provide msi_desc_set_iommu_msi_iova() which allows the IOMMU side
to specify the IOVA that the MSI page is placed during
iommu_dma_prepare(). This is stored in the msi_desc and then
iommu_dma_compose_msi_msg() is a simple inline that sets address_hi/lo.

The next patch will correct the naming.

This is done because we cannot correctly lock access to group->domain in
the atomic context that iommu_dma_compose_msi_msg() is called under. Today
the locking miss is tolerable because dma_iommu.c operates under an
assumption that the domain does not change while a driver is probed.

However iommufd now permits the domain to change while the driver is
probed and VFIO userspace can create races with IRQ changes calling
iommu_dma_prepare/compose_msi_msg() and changing/freeing the iommu_domain.

Removing the pointer, and critically, the call to
iommu_get_domain_for_dev() during compose resolves this race.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h     |  6 ------
 include/linux/msi.h       | 45 +++++++++++++++++++++++----------------
 drivers/iommu/dma-iommu.c | 30 +++++---------------------
 3 files changed, 32 insertions(+), 49 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 318d27841130..3a4215966c1b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1513,7 +1513,6 @@ static inline void iommu_debugfs_setup(void) {}
 int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base);
 
 int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr);
-void iommu_dma_compose_msi_msg(struct msi_desc *desc, struct msi_msg *msg);
 
 #else /* CONFIG_IOMMU_DMA */
 
@@ -1529,11 +1528,6 @@ static inline int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_a
 {
 	return 0;
 }
-
-static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
-{
-}
-
 #endif	/* CONFIG_IOMMU_DMA */
 
 /*
diff --git a/include/linux/msi.h b/include/linux/msi.h
index b10093c4d00e..d442b4a69d56 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -184,7 +184,8 @@ struct msi_desc {
 	struct msi_msg			msg;
 	struct irq_affinity_desc	*affinity;
 #ifdef CONFIG_IRQ_MSI_IOMMU
-	const void			*iommu_cookie;
+	u64				iommu_msi_iova : 58;
+	u64				iommu_msi_page_shift : 6;
 #endif
 #ifdef CONFIG_SYSFS
 	struct device_attribute		*sysfs_attrs;
@@ -285,28 +286,36 @@ struct msi_desc *msi_next_desc(struct device *dev, unsigned int domid,
 
 #define msi_desc_to_dev(desc)		((desc)->dev)
 
-#ifdef CONFIG_IRQ_MSI_IOMMU
-static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
-{
-	return desc->iommu_cookie;
-}
-
-static inline void msi_desc_set_iommu_cookie(struct msi_desc *desc,
-					     const void *iommu_cookie)
+static inline void msi_desc_set_iommu_msi_iova(struct msi_desc *desc,
+					       u64 msi_iova,
+					       unsigned int page_shift)
 {
-	desc->iommu_cookie = iommu_cookie;
-}
-#else
-static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
-{
-	return NULL;
+#ifdef CONFIG_IRQ_MSI_IOMMU
+	desc->iommu_msi_iova = msi_iova >> page_shift;
+	desc->iommu_msi_page_shift = page_shift;
+#endif
 }
 
-static inline void msi_desc_set_iommu_cookie(struct msi_desc *desc,
-					     const void *iommu_cookie)
+/**
+ * iommu_dma_compose_msi_msg() - Apply translation to an MSI message
+ * @desc: MSI descriptor prepared by iommu_dma_prepare_msi()
+ * @msg: MSI message containing target physical address
+ */
+static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc,
+					     struct msi_msg *msg)
 {
-}
+#ifdef CONFIG_IRQ_MSI_IOMMU
+	if (desc->iommu_msi_page_shift) {
+		u64 msi_iova = desc->iommu_msi_iova
+			       << desc->iommu_msi_page_shift;
+
+		msg->address_hi = upper_32_bits(msi_iova);
+		msg->address_lo = lower_32_bits(msi_iova) |
+				  (msg->address_lo &
+				   ((1 << desc->iommu_msi_page_shift) - 1));
+	}
 #endif
+}
 
 int msi_domain_insert_msi_desc(struct device *dev, unsigned int domid,
 			       struct msi_desc *init_desc);
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 2a9fa0c8cc00..bf91e014d179 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1815,7 +1815,7 @@ int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
 	static DEFINE_MUTEX(msi_prepare_lock); /* see below */
 
 	if (!domain || !domain->iova_cookie) {
-		desc->iommu_cookie = NULL;
+		msi_desc_set_iommu_msi_iova(desc, 0, 0);
 		return 0;
 	}
 
@@ -1827,33 +1827,13 @@ int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
 	mutex_lock(&msi_prepare_lock);
 	msi_page = iommu_dma_get_msi_page(dev, msi_addr, domain);
 	mutex_unlock(&msi_prepare_lock);
-
-	msi_desc_set_iommu_cookie(desc, msi_page);
-
 	if (!msi_page)
 		return -ENOMEM;
-	return 0;
-}
 
-/**
- * iommu_dma_compose_msi_msg() - Apply translation to an MSI message
- * @desc: MSI descriptor prepared by iommu_dma_prepare_msi()
- * @msg: MSI message containing target physical address
- */
-void iommu_dma_compose_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
-{
-	struct device *dev = msi_desc_to_dev(desc);
-	const struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-	const struct iommu_dma_msi_page *msi_page;
-
-	msi_page = msi_desc_get_iommu_cookie(desc);
-
-	if (!domain || !domain->iova_cookie || WARN_ON(!msi_page))
-		return;
-
-	msg->address_hi = upper_32_bits(msi_page->iova);
-	msg->address_lo &= cookie_msi_granule(domain->iova_cookie) - 1;
-	msg->address_lo += lower_32_bits(msi_page->iova);
+	msi_desc_set_iommu_msi_iova(
+		desc, msi_page->iova,
+		ilog2(cookie_msi_granule(domain->iova_cookie)));
+	return 0;
 }
 
 static int iommu_dma_init(void)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH RFCv2 02/13] genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_msi_addr()
  2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
  2025-01-11  3:32 ` [PATCH RFCv2 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie Nicolin Chen
@ 2025-01-11  3:32 ` Nicolin Chen
  2025-01-23 17:10   ` Eric Auger
  2025-01-11  3:32 ` [PATCH RFCv2 03/13] iommu: Make iommu_dma_prepare_msi() into a generic operation Nicolin Chen
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 64+ messages in thread
From: Nicolin Chen @ 2025-01-11  3:32 UTC (permalink / raw)
  To: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson
  Cc: joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

From: Jason Gunthorpe <jgg@nvidia.com>

The new function is used to take in a u64 MSI address and store it in the
msi_msg. If the iommu has provided an alternative address then that is
replaced instead.

All callers have a tidy u64 already so this also consolidates the repeated
low/high code into a small helper.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/msi.h               | 18 ++++++++----------
 drivers/irqchip/irq-gic-v2m.c     |  5 +----
 drivers/irqchip/irq-gic-v3-its.c  | 13 +++----------
 drivers/irqchip/irq-gic-v3-mbi.c  | 12 ++++--------
 drivers/irqchip/irq-ls-scfg-msi.c |  5 ++---
 5 files changed, 18 insertions(+), 35 deletions(-)

diff --git a/include/linux/msi.h b/include/linux/msi.h
index d442b4a69d56..f6369748fc6e 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -296,13 +296,8 @@ static inline void msi_desc_set_iommu_msi_iova(struct msi_desc *desc,
 #endif
 }
 
-/**
- * iommu_dma_compose_msi_msg() - Apply translation to an MSI message
- * @desc: MSI descriptor prepared by iommu_dma_prepare_msi()
- * @msg: MSI message containing target physical address
- */
-static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc,
-					     struct msi_msg *msg)
+static inline void msi_msg_set_msi_addr(struct msi_desc *desc,
+					struct msi_msg *msg, u64 msi_addr)
 {
 #ifdef CONFIG_IRQ_MSI_IOMMU
 	if (desc->iommu_msi_page_shift) {
@@ -310,11 +305,14 @@ static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc,
 			       << desc->iommu_msi_page_shift;
 
 		msg->address_hi = upper_32_bits(msi_iova);
-		msg->address_lo = lower_32_bits(msi_iova) |
-				  (msg->address_lo &
-				   ((1 << desc->iommu_msi_page_shift) - 1));
+		msg->address_lo =
+			lower_32_bits(msi_iova) |
+			(msi_addr & ((1 << desc->iommu_msi_page_shift) - 1));
+		return;
 	}
 #endif
+	msg->address_hi = upper_32_bits(msi_addr);
+	msg->address_lo = lower_32_bits(msi_addr);
 }
 
 int msi_domain_insert_msi_desc(struct device *dev, unsigned int domid,
diff --git a/drivers/irqchip/irq-gic-v2m.c b/drivers/irqchip/irq-gic-v2m.c
index be35c5349986..6599c56873ad 100644
--- a/drivers/irqchip/irq-gic-v2m.c
+++ b/drivers/irqchip/irq-gic-v2m.c
@@ -87,9 +87,6 @@ static void gicv2m_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 	struct v2m_data *v2m = irq_data_get_irq_chip_data(data);
 	phys_addr_t addr = gicv2m_get_msi_addr(v2m, data->hwirq);
 
-	msg->address_hi = upper_32_bits(addr);
-	msg->address_lo = lower_32_bits(addr);
-
 	if (v2m->flags & GICV2M_GRAVITON_ADDRESS_ONLY)
 		msg->data = 0;
 	else
@@ -97,7 +94,7 @@ static void gicv2m_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 	if (v2m->flags & GICV2M_NEEDS_SPI_OFFSET)
 		msg->data -= v2m->spi_offset;
 
-	iommu_dma_compose_msi_msg(irq_data_get_msi_desc(data), msg);
+	msi_msg_set_msi_addr(irq_data_get_msi_desc(data), msg, addr);
 }
 
 static struct irq_chip gicv2m_irq_chip = {
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 92244cfa0464..8c3ab7b471ca 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -1809,17 +1809,10 @@ static u64 its_irq_get_msi_base(struct its_device *its_dev)
 static void its_irq_compose_msi_msg(struct irq_data *d, struct msi_msg *msg)
 {
 	struct its_device *its_dev = irq_data_get_irq_chip_data(d);
-	struct its_node *its;
-	u64 addr;
-
-	its = its_dev->its;
-	addr = its->get_msi_base(its_dev);
-
-	msg->address_lo		= lower_32_bits(addr);
-	msg->address_hi		= upper_32_bits(addr);
-	msg->data		= its_get_event_id(d);
 
-	iommu_dma_compose_msi_msg(irq_data_get_msi_desc(d), msg);
+	msg->data = its_get_event_id(d);
+	msi_msg_set_msi_addr(irq_data_get_msi_desc(d), msg,
+			     its_dev->its->get_msi_base(its_dev));
 }
 
 static int its_irq_set_irqchip_state(struct irq_data *d,
diff --git a/drivers/irqchip/irq-gic-v3-mbi.c b/drivers/irqchip/irq-gic-v3-mbi.c
index 3fe870f8ee17..513479da9ee3 100644
--- a/drivers/irqchip/irq-gic-v3-mbi.c
+++ b/drivers/irqchip/irq-gic-v3-mbi.c
@@ -147,22 +147,18 @@ static const struct irq_domain_ops mbi_domain_ops = {
 
 static void mbi_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 {
-	msg[0].address_hi = upper_32_bits(mbi_phys_base + GICD_SETSPI_NSR);
-	msg[0].address_lo = lower_32_bits(mbi_phys_base + GICD_SETSPI_NSR);
 	msg[0].data = data->parent_data->hwirq;
-
-	iommu_dma_compose_msi_msg(irq_data_get_msi_desc(data), msg);
+	msi_msg_set_msi_addr(irq_data_get_msi_desc(data), &msg[0],
+			     mbi_phys_base + GICD_SETSPI_NSR);
 }
 
 static void mbi_compose_mbi_msg(struct irq_data *data, struct msi_msg *msg)
 {
 	mbi_compose_msi_msg(data, msg);
 
-	msg[1].address_hi = upper_32_bits(mbi_phys_base + GICD_CLRSPI_NSR);
-	msg[1].address_lo = lower_32_bits(mbi_phys_base + GICD_CLRSPI_NSR);
 	msg[1].data = data->parent_data->hwirq;
-
-	iommu_dma_compose_msi_msg(irq_data_get_msi_desc(data), &msg[1]);
+	msi_msg_set_msi_addr(irq_data_get_msi_desc(data), &msg[1],
+			     mbi_phys_base + GICD_CLRSPI_NSR);
 }
 
 static bool mbi_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
diff --git a/drivers/irqchip/irq-ls-scfg-msi.c b/drivers/irqchip/irq-ls-scfg-msi.c
index c0e1aafe468c..2ac6d89b4cb4 100644
--- a/drivers/irqchip/irq-ls-scfg-msi.c
+++ b/drivers/irqchip/irq-ls-scfg-msi.c
@@ -87,8 +87,6 @@ static void ls_scfg_msi_compose_msg(struct irq_data *data, struct msi_msg *msg)
 {
 	struct ls_scfg_msi *msi_data = irq_data_get_irq_chip_data(data);
 
-	msg->address_hi = upper_32_bits(msi_data->msiir_addr);
-	msg->address_lo = lower_32_bits(msi_data->msiir_addr);
 	msg->data = data->hwirq;
 
 	if (msi_affinity_flag) {
@@ -98,7 +96,8 @@ static void ls_scfg_msi_compose_msg(struct irq_data *data, struct msi_msg *msg)
 		msg->data |= cpumask_first(mask);
 	}
 
-	iommu_dma_compose_msi_msg(irq_data_get_msi_desc(data), msg);
+	msi_msg_set_msi_addr(irq_data_get_msi_desc(data), msg,
+			     msi_data->msiir_addr);
 }
 
 static int ls_scfg_msi_set_affinity(struct irq_data *irq_data,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH RFCv2 03/13] iommu: Make iommu_dma_prepare_msi() into a generic operation
  2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
  2025-01-11  3:32 ` [PATCH RFCv2 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie Nicolin Chen
  2025-01-11  3:32 ` [PATCH RFCv2 02/13] genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_msi_addr() Nicolin Chen
@ 2025-01-11  3:32 ` Nicolin Chen
  2025-01-23 17:10   ` Eric Auger
  2025-01-11  3:32 ` [PATCH RFCv2 04/13] irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by the irqchips that need it Nicolin Chen
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 64+ messages in thread
From: Nicolin Chen @ 2025-01-11  3:32 UTC (permalink / raw)
  To: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson
  Cc: joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

From: Jason Gunthorpe <jgg@nvidia.com>

SW_MSI supports IOMMU to translate an MSI message before the MSI message
is delivered to the interrupt controller. On such systems the iommu_domain
must have a translation for the MSI message for interrupts to work.

The IRQ subsystem will call into IOMMU to request that a physical page be
setup to receive MSI message, and the IOMMU then sets an IOVA that maps to
that physical page. Ultimately the IOVA is programmed into the device via
the msi_msg.

Generalize this to allow the iommu_domain owner to provide its own
implementation of this mapping. Add a function pointer to struct
iommu_domain to allow the domain owner to provide an implementation.

Have dma-iommu supply its implementation for IOMMU_DOMAIN_DMA types during
the iommu_get_dma_cookie() path. For IOMMU_DOMAIN_UNMANAGED types used by
VFIO (and iommufd for now), have the same iommu_dma_sw_msi set as well in
the iommu_get_msi_cookie() path.

Hold the group mutex while in iommu_dma_prepare_msi() to ensure the domain
doesn't change or become freed while running. Races with IRQ operations
from VFIO and domain changes from iommufd are possible here.

Rreplace the msi_prepare_lock with a lockdep assertion for the group mutex
as documentation. For the dma_iommu.c each iommu_domain unique to a
group.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
[nicolinc: move iommu_domain_set_sw_msi() from iommu_dma_init_domain() to
 iommu_dma_init_domain(); add in iommu_put_dma_cookie() an sw_msi test]
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h     | 44 ++++++++++++++++++++++++++-------------
 drivers/iommu/dma-iommu.c | 33 +++++++++++++----------------
 drivers/iommu/iommu.c     | 29 ++++++++++++++++++++++++++
 3 files changed, 73 insertions(+), 33 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 3a4215966c1b..423fdfa6b3bb 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -44,6 +44,8 @@ struct iommu_dma_cookie;
 struct iommu_fault_param;
 struct iommufd_ctx;
 struct iommufd_viommu;
+struct msi_desc;
+struct msi_msg;
 
 #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
 #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
@@ -216,6 +218,12 @@ struct iommu_domain {
 	struct iommu_domain_geometry geometry;
 	struct iommu_dma_cookie *iova_cookie;
 	int (*iopf_handler)(struct iopf_group *group);
+
+#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
+	int (*sw_msi)(struct iommu_domain *domain, struct msi_desc *desc,
+		      phys_addr_t msi_addr);
+#endif
+
 	void *fault_data;
 	union {
 		struct {
@@ -234,6 +242,16 @@ struct iommu_domain {
 	};
 };
 
+static inline void iommu_domain_set_sw_msi(
+	struct iommu_domain *domain,
+	int (*sw_msi)(struct iommu_domain *domain, struct msi_desc *desc,
+		      phys_addr_t msi_addr))
+{
+#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
+	domain->sw_msi = sw_msi;
+#endif
+}
+
 static inline bool iommu_is_dma_domain(struct iommu_domain *domain)
 {
 	return domain->type & __IOMMU_DOMAIN_DMA_API;
@@ -1475,6 +1493,18 @@ static inline ioasid_t iommu_alloc_global_pasid(struct device *dev)
 static inline void iommu_free_global_pasid(ioasid_t pasid) {}
 #endif /* CONFIG_IOMMU_API */
 
+#ifdef CONFIG_IRQ_MSI_IOMMU
+#ifdef CONFIG_IOMMU_API
+int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr);
+#else
+static inline int iommu_dma_prepare_msi(struct msi_desc *desc,
+					phys_addr_t msi_addr)
+{
+	return 0;
+}
+#endif /* CONFIG_IOMMU_API */
+#endif /* CONFIG_IRQ_MSI_IOMMU */
+
 #if IS_ENABLED(CONFIG_LOCKDEP) && IS_ENABLED(CONFIG_IOMMU_API)
 void iommu_group_mutex_assert(struct device *dev);
 #else
@@ -1508,26 +1538,12 @@ static inline void iommu_debugfs_setup(void) {}
 #endif
 
 #ifdef CONFIG_IOMMU_DMA
-#include <linux/msi.h>
-
 int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base);
-
-int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr);
-
 #else /* CONFIG_IOMMU_DMA */
-
-struct msi_desc;
-struct msi_msg;
-
 static inline int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
 {
 	return -ENODEV;
 }
-
-static inline int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
-{
-	return 0;
-}
 #endif	/* CONFIG_IOMMU_DMA */
 
 /*
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index bf91e014d179..3b58244e6344 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -24,6 +24,7 @@
 #include <linux/memremap.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
+#include <linux/msi.h>
 #include <linux/of_iommu.h>
 #include <linux/pci.h>
 #include <linux/scatterlist.h>
@@ -102,6 +103,9 @@ static int __init iommu_dma_forcedac_setup(char *str)
 }
 early_param("iommu.forcedac", iommu_dma_forcedac_setup);
 
+static int iommu_dma_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
+			    phys_addr_t msi_addr);
+
 /* Number of entries per flush queue */
 #define IOVA_DEFAULT_FQ_SIZE	256
 #define IOVA_SINGLE_FQ_SIZE	32768
@@ -398,6 +402,7 @@ int iommu_get_dma_cookie(struct iommu_domain *domain)
 		return -ENOMEM;
 
 	mutex_init(&domain->iova_cookie->mutex);
+	iommu_domain_set_sw_msi(domain, iommu_dma_sw_msi);
 	return 0;
 }
 
@@ -429,6 +434,7 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
 
 	cookie->msi_iova = base;
 	domain->iova_cookie = cookie;
+	iommu_domain_set_sw_msi(domain, iommu_dma_sw_msi);
 	return 0;
 }
 EXPORT_SYMBOL(iommu_get_msi_cookie);
@@ -443,6 +449,9 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
 	struct iommu_dma_msi_page *msi, *tmp;
 
+	if (domain->sw_msi != iommu_dma_sw_msi)
+		return;
+
 	if (!cookie)
 		return;
 
@@ -1800,33 +1809,19 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
 	return NULL;
 }
 
-/**
- * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
- * @desc: MSI descriptor, will store the MSI page
- * @msi_addr: MSI target address to be mapped
- *
- * Return: 0 on success or negative error code if the mapping failed.
- */
-int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
+static int iommu_dma_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
+			    phys_addr_t msi_addr)
 {
 	struct device *dev = msi_desc_to_dev(desc);
-	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-	struct iommu_dma_msi_page *msi_page;
-	static DEFINE_MUTEX(msi_prepare_lock); /* see below */
+	const struct iommu_dma_msi_page *msi_page;
 
-	if (!domain || !domain->iova_cookie) {
+	if (!domain->iova_cookie) {
 		msi_desc_set_iommu_msi_iova(desc, 0, 0);
 		return 0;
 	}
 
-	/*
-	 * In fact the whole prepare operation should already be serialised by
-	 * irq_domain_mutex further up the callchain, but that's pretty subtle
-	 * on its own, so consider this locking as failsafe documentation...
-	 */
-	mutex_lock(&msi_prepare_lock);
+	iommu_group_mutex_assert(dev);
 	msi_page = iommu_dma_get_msi_page(dev, msi_addr, domain);
-	mutex_unlock(&msi_prepare_lock);
 	if (!msi_page)
 		return -ENOMEM;
 
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 599030e1e890..fbbbcdba8a4f 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3587,3 +3587,32 @@ int iommu_replace_group_handle(struct iommu_group *group,
 	return ret;
 }
 EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
+
+#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
+/**
+ * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
+ * @desc: MSI descriptor, will store the MSI page
+ * @msi_addr: MSI target address to be mapped
+ *
+ * The implementation of sw_msi() should take msi_addr and map it to
+ * an IOVA in the domain and call msi_desc_set_iommu_msi_iova() with the
+ * mapping information.
+ *
+ * Return: 0 on success or negative error code if the mapping failed.
+ */
+int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
+{
+	struct device *dev = msi_desc_to_dev(desc);
+	struct iommu_group *group = dev->iommu_group;
+	int ret = 0;
+
+	if (!group)
+		return 0;
+
+	mutex_lock(&group->mutex);
+	if (group->domain && group->domain->sw_msi)
+		ret = group->domain->sw_msi(group->domain, desc, msi_addr);
+	mutex_unlock(&group->mutex);
+	return ret;
+}
+#endif /* CONFIG_IRQ_MSI_IOMMU */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH RFCv2 04/13] irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by the irqchips that need it
  2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
                   ` (2 preceding siblings ...)
  2025-01-11  3:32 ` [PATCH RFCv2 03/13] iommu: Make iommu_dma_prepare_msi() into a generic operation Nicolin Chen
@ 2025-01-11  3:32 ` Nicolin Chen
  2025-01-11  3:32 ` [PATCH RFCv2 05/13] iommu: Turn fault_data to iommufd private pointer Nicolin Chen
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 64+ messages in thread
From: Nicolin Chen @ 2025-01-11  3:32 UTC (permalink / raw)
  To: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson
  Cc: joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

From: Jason Gunthorpe <jgg@nvidia.com>

Currently IRQ_MSI_IOMMU is selected if DMA_IOMMU is available to provide
an implementation for iommu_dma_prepare/compose_msi_msg(). However it
makes more sense for the irqchips that call prepare/compose to select it
and that will trigger all the additional code and data to be compiled into
the kernel.

If IRQ_MSI_IOMMU is selected with no IOMMU side implementation then
prepare/compose will be NOP stubs.

If IRQ_MSI_IOMMU is not selected by an irqchip then the related code on
the iommu side is compiled out.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/Kconfig   | 1 -
 drivers/irqchip/Kconfig | 4 ++++
 kernel/irq/Kconfig      | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 47c46e4b739e..b9a4f0521d71 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -154,7 +154,6 @@ config IOMMU_DMA
 	select DMA_OPS_HELPERS
 	select IOMMU_API
 	select IOMMU_IOVA
-	select IRQ_MSI_IOMMU
 	select NEED_SG_DMA_LENGTH
 	select NEED_SG_DMA_FLAGS if SWIOTLB
 
diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
index 9bee02db1643..758cbbf907aa 100644
--- a/drivers/irqchip/Kconfig
+++ b/drivers/irqchip/Kconfig
@@ -28,6 +28,7 @@ config ARM_GIC_V2M
 	select ARM_GIC
 	select IRQ_MSI_LIB
 	select PCI_MSI
+	select IRQ_MSI_IOMMU
 
 config GIC_NON_BANKED
 	bool
@@ -38,12 +39,14 @@ config ARM_GIC_V3
 	select PARTITION_PERCPU
 	select GENERIC_IRQ_EFFECTIVE_AFF_MASK if SMP
 	select HAVE_ARM_SMCCC_DISCOVERY
+	select IRQ_MSI_IOMMU
 
 config ARM_GIC_V3_ITS
 	bool
 	select GENERIC_MSI_IRQ
 	select IRQ_MSI_LIB
 	default ARM_GIC_V3
+	select IRQ_MSI_IOMMU
 
 config ARM_GIC_V3_ITS_FSL_MC
 	bool
@@ -407,6 +410,7 @@ config LS_EXTIRQ
 
 config LS_SCFG_MSI
 	def_bool y if SOC_LS1021A || ARCH_LAYERSCAPE
+	select IRQ_MSI_IOMMU
 	depends on PCI_MSI
 
 config PARTITION_PERCPU
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index 529adb1f5859..045b37073d4a 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -96,6 +96,7 @@ config GENERIC_MSI_IRQ
 	bool
 	select IRQ_DOMAIN_HIERARCHY
 
+# irqchip drivers should select this if they call iommu_dma_prepare_msi()
 config IRQ_MSI_IOMMU
 	bool
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH RFCv2 05/13] iommu: Turn fault_data to iommufd private pointer
  2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
                   ` (3 preceding siblings ...)
  2025-01-11  3:32 ` [PATCH RFCv2 04/13] irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by the irqchips that need it Nicolin Chen
@ 2025-01-11  3:32 ` Nicolin Chen
  2025-01-23  9:54   ` Tian, Kevin
  2025-01-29 12:40   ` Eric Auger
  2025-01-11  3:32 ` [PATCH RFCv2 06/13] iommufd: Make attach_handle generic Nicolin Chen
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 64+ messages in thread
From: Nicolin Chen @ 2025-01-11  3:32 UTC (permalink / raw)
  To: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson
  Cc: joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

A "fault_data" was added exclusively for the iommufd_fault_iopf_handler()
used by IOPF/PRI use cases, along with the attach_handle. Now, the iommufd
version of sw_msi function will resue the attach_handle and fault_data for
a non-fault case.

Rename "fault_data" to "iommufd_hwpt" so as not to confine it to a "fault"
case. Move it into a union to be the iommufd private pointer. A following
patch will move the iova_cookie to the union for dma-iommu too, after the
iommufd_sw_msi implementation is added.

Since we have two unions now, add some simple comments for readability.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h                | 6 ++++--
 drivers/iommu/iommufd/fault.c        | 2 +-
 drivers/iommu/iommufd/hw_pagetable.c | 2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 423fdfa6b3bb..b6526d734f30 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -224,8 +224,10 @@ struct iommu_domain {
 		      phys_addr_t msi_addr);
 #endif
 
-	void *fault_data;
-	union {
+	union { /* Pointer usable by owner of the domain */
+		struct iommufd_hw_pagetable *iommufd_hwpt; /* iommufd */
+	};
+	union { /* Fault handler */
 		struct {
 			iommu_fault_handler_t handler;
 			void *handler_token;
diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
index 1fe804e28a86..06aa83a75e94 100644
--- a/drivers/iommu/iommufd/fault.c
+++ b/drivers/iommu/iommufd/fault.c
@@ -431,7 +431,7 @@ int iommufd_fault_iopf_handler(struct iopf_group *group)
 	struct iommufd_hw_pagetable *hwpt;
 	struct iommufd_fault *fault;
 
-	hwpt = group->attach_handle->domain->fault_data;
+	hwpt = group->attach_handle->domain->iommufd_hwpt;
 	fault = hwpt->fault;
 
 	mutex_lock(&fault->mutex);
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index ce03c3804651..f7c0d7b214b6 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -402,10 +402,10 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 		}
 		hwpt->fault = fault;
 		hwpt->domain->iopf_handler = iommufd_fault_iopf_handler;
-		hwpt->domain->fault_data = hwpt;
 		refcount_inc(&fault->obj.users);
 		iommufd_put_object(ucmd->ictx, &fault->obj);
 	}
+	hwpt->domain->iommufd_hwpt = hwpt;
 
 	cmd->out_hwpt_id = hwpt->obj.id;
 	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH RFCv2 06/13] iommufd: Make attach_handle generic
  2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
                   ` (4 preceding siblings ...)
  2025-01-11  3:32 ` [PATCH RFCv2 05/13] iommu: Turn fault_data to iommufd private pointer Nicolin Chen
@ 2025-01-11  3:32 ` Nicolin Chen
  2025-01-18  8:23   ` Yi Liu
  2025-01-29 13:14   ` Eric Auger
  2025-01-11  3:32 ` [PATCH RFCv2 07/13] iommufd: Implement sw_msi support natively Nicolin Chen
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 64+ messages in thread
From: Nicolin Chen @ 2025-01-11  3:32 UTC (permalink / raw)
  To: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson
  Cc: joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

"attach_handle" was added exclusively for the iommufd_fault_iopf_handler()
used by IOPF/PRI use cases, along with the "fault_data". Now, the iommufd
version of sw_msi function will resue the attach_handle and fault_data for
a non-fault case.

Move the attach_handle part out of the fault.c file to make it generic for
all cases. Simplify the remaining fault specific routine to attach/detach.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_private.h |  40 +-------
 drivers/iommu/iommufd/device.c          | 105 +++++++++++++++++++++
 drivers/iommu/iommufd/fault.c           | 120 +++---------------------
 3 files changed, 122 insertions(+), 143 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index b6d706cf2c66..063c0a42f54f 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -472,42 +472,12 @@ void iommufd_fault_destroy(struct iommufd_object *obj);
 int iommufd_fault_iopf_handler(struct iopf_group *group);
 
 int iommufd_fault_domain_attach_dev(struct iommufd_hw_pagetable *hwpt,
-				    struct iommufd_device *idev);
+				    struct iommufd_device *idev,
+				    bool enable_iopf);
 void iommufd_fault_domain_detach_dev(struct iommufd_hw_pagetable *hwpt,
-				     struct iommufd_device *idev);
-int iommufd_fault_domain_replace_dev(struct iommufd_device *idev,
-				     struct iommufd_hw_pagetable *hwpt,
-				     struct iommufd_hw_pagetable *old);
-
-static inline int iommufd_hwpt_attach_device(struct iommufd_hw_pagetable *hwpt,
-					     struct iommufd_device *idev)
-{
-	if (hwpt->fault)
-		return iommufd_fault_domain_attach_dev(hwpt, idev);
-
-	return iommu_attach_group(hwpt->domain, idev->igroup->group);
-}
-
-static inline void iommufd_hwpt_detach_device(struct iommufd_hw_pagetable *hwpt,
-					      struct iommufd_device *idev)
-{
-	if (hwpt->fault) {
-		iommufd_fault_domain_detach_dev(hwpt, idev);
-		return;
-	}
-
-	iommu_detach_group(hwpt->domain, idev->igroup->group);
-}
-
-static inline int iommufd_hwpt_replace_device(struct iommufd_device *idev,
-					      struct iommufd_hw_pagetable *hwpt,
-					      struct iommufd_hw_pagetable *old)
-{
-	if (old->fault || hwpt->fault)
-		return iommufd_fault_domain_replace_dev(idev, hwpt, old);
-
-	return iommu_group_replace_domain(idev->igroup->group, hwpt->domain);
-}
+				     struct iommufd_device *idev,
+				     struct iommufd_attach_handle *handle,
+				     bool disable_iopf);
 
 static inline struct iommufd_viommu *
 iommufd_get_viommu(struct iommufd_ucmd *ucmd, u32 id)
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index dfd0898fb6c1..38b31b652147 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -352,6 +352,111 @@ iommufd_device_attach_reserved_iova(struct iommufd_device *idev,
 	return 0;
 }
 
+/* The device attach/detach/replace helpers for attach_handle */
+
+static int iommufd_hwpt_attach_device(struct iommufd_hw_pagetable *hwpt,
+				      struct iommufd_device *idev)
+{
+	struct iommufd_attach_handle *handle;
+	int rc;
+
+	if (hwpt->fault) {
+		rc = iommufd_fault_domain_attach_dev(hwpt, idev, true);
+		if (rc)
+			return rc;
+	}
+
+	handle = kzalloc(sizeof(*handle), GFP_KERNEL);
+	if (!handle) {
+		rc = -ENOMEM;
+		goto out_fault_detach;
+	}
+
+	handle->idev = idev;
+	rc = iommu_attach_group_handle(hwpt->domain, idev->igroup->group,
+				       &handle->handle);
+	if (rc)
+		goto out_free_handle;
+
+	return 0;
+
+out_free_handle:
+	kfree(handle);
+	handle = NULL;
+out_fault_detach:
+	if (hwpt->fault)
+		iommufd_fault_domain_detach_dev(hwpt, idev, handle, true);
+	return rc;
+}
+
+static struct iommufd_attach_handle *
+iommufd_device_get_attach_handle(struct iommufd_device *idev)
+{
+	struct iommu_attach_handle *handle;
+
+	handle =
+		iommu_attach_handle_get(idev->igroup->group, IOMMU_NO_PASID, 0);
+	if (IS_ERR(handle))
+		return NULL;
+	return to_iommufd_handle(handle);
+}
+
+static void iommufd_hwpt_detach_device(struct iommufd_hw_pagetable *hwpt,
+				       struct iommufd_device *idev)
+{
+	struct iommufd_attach_handle *handle;
+
+	handle = iommufd_device_get_attach_handle(idev);
+	iommu_detach_group_handle(hwpt->domain, idev->igroup->group);
+	if (hwpt->fault)
+		iommufd_fault_domain_detach_dev(hwpt, idev, handle, true);
+	kfree(handle);
+}
+
+static int iommufd_hwpt_replace_device(struct iommufd_device *idev,
+				       struct iommufd_hw_pagetable *hwpt,
+				       struct iommufd_hw_pagetable *old)
+{
+	struct iommufd_attach_handle *old_handle =
+		iommufd_device_get_attach_handle(idev);
+	struct iommufd_attach_handle *handle;
+	int rc;
+
+	if (hwpt->fault) {
+		rc = iommufd_fault_domain_attach_dev(hwpt, idev, !old->fault);
+		if (rc)
+			return rc;
+	}
+
+	handle = kzalloc(sizeof(*handle), GFP_KERNEL);
+	if (!handle) {
+		rc = -ENOMEM;
+		goto out_fault_detach;
+	}
+
+	handle->idev = idev;
+	rc = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
+					&handle->handle);
+	if (rc)
+		goto out_free_handle;
+
+	if (old->fault)
+		iommufd_fault_domain_detach_dev(old, idev, old_handle,
+						!hwpt->fault);
+	kfree(old_handle);
+
+	return 0;
+
+out_free_handle:
+	kfree(handle);
+	handle = NULL;
+out_fault_detach:
+	if (hwpt->fault)
+		iommufd_fault_domain_detach_dev(hwpt, idev, handle,
+						!old->fault);
+	return rc;
+}
+
 int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
 				struct iommufd_device *idev)
 {
diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
index 06aa83a75e94..1d9bd3024b57 100644
--- a/drivers/iommu/iommufd/fault.c
+++ b/drivers/iommu/iommufd/fault.c
@@ -60,42 +60,17 @@ static void iommufd_fault_iopf_disable(struct iommufd_device *idev)
 	mutex_unlock(&idev->iopf_lock);
 }
 
-static int __fault_domain_attach_dev(struct iommufd_hw_pagetable *hwpt,
-				     struct iommufd_device *idev)
-{
-	struct iommufd_attach_handle *handle;
-	int ret;
-
-	handle = kzalloc(sizeof(*handle), GFP_KERNEL);
-	if (!handle)
-		return -ENOMEM;
-
-	handle->idev = idev;
-	ret = iommu_attach_group_handle(hwpt->domain, idev->igroup->group,
-					&handle->handle);
-	if (ret)
-		kfree(handle);
-
-	return ret;
-}
-
 int iommufd_fault_domain_attach_dev(struct iommufd_hw_pagetable *hwpt,
-				    struct iommufd_device *idev)
+				    struct iommufd_device *idev,
+				    bool enable_iopf)
 {
-	int ret;
+	int rc = 0;
 
 	if (!hwpt->fault)
 		return -EINVAL;
-
-	ret = iommufd_fault_iopf_enable(idev);
-	if (ret)
-		return ret;
-
-	ret = __fault_domain_attach_dev(hwpt, idev);
-	if (ret)
-		iommufd_fault_iopf_disable(idev);
-
-	return ret;
+	if (enable_iopf)
+		rc = iommufd_fault_iopf_enable(idev);
+	return rc;
 }
 
 static void iommufd_auto_response_faults(struct iommufd_hw_pagetable *hwpt,
@@ -127,86 +102,15 @@ static void iommufd_auto_response_faults(struct iommufd_hw_pagetable *hwpt,
 	mutex_unlock(&fault->mutex);
 }
 
-static struct iommufd_attach_handle *
-iommufd_device_get_attach_handle(struct iommufd_device *idev)
-{
-	struct iommu_attach_handle *handle;
-
-	handle = iommu_attach_handle_get(idev->igroup->group, IOMMU_NO_PASID, 0);
-	if (IS_ERR(handle))
-		return NULL;
-
-	return to_iommufd_handle(handle);
-}
-
 void iommufd_fault_domain_detach_dev(struct iommufd_hw_pagetable *hwpt,
-				     struct iommufd_device *idev)
+				     struct iommufd_device *idev,
+				     struct iommufd_attach_handle *handle,
+				     bool disable_iopf)
 {
-	struct iommufd_attach_handle *handle;
-
-	handle = iommufd_device_get_attach_handle(idev);
-	iommu_detach_group_handle(hwpt->domain, idev->igroup->group);
-	iommufd_auto_response_faults(hwpt, handle);
-	iommufd_fault_iopf_disable(idev);
-	kfree(handle);
-}
-
-static int __fault_domain_replace_dev(struct iommufd_device *idev,
-				      struct iommufd_hw_pagetable *hwpt,
-				      struct iommufd_hw_pagetable *old)
-{
-	struct iommufd_attach_handle *handle, *curr = NULL;
-	int ret;
-
-	if (old->fault)
-		curr = iommufd_device_get_attach_handle(idev);
-
-	if (hwpt->fault) {
-		handle = kzalloc(sizeof(*handle), GFP_KERNEL);
-		if (!handle)
-			return -ENOMEM;
-
-		handle->idev = idev;
-		ret = iommu_replace_group_handle(idev->igroup->group,
-						 hwpt->domain, &handle->handle);
-	} else {
-		ret = iommu_replace_group_handle(idev->igroup->group,
-						 hwpt->domain, NULL);
-	}
-
-	if (!ret && curr) {
-		iommufd_auto_response_faults(old, curr);
-		kfree(curr);
-	}
-
-	return ret;
-}
-
-int iommufd_fault_domain_replace_dev(struct iommufd_device *idev,
-				     struct iommufd_hw_pagetable *hwpt,
-				     struct iommufd_hw_pagetable *old)
-{
-	bool iopf_off = !hwpt->fault && old->fault;
-	bool iopf_on = hwpt->fault && !old->fault;
-	int ret;
-
-	if (iopf_on) {
-		ret = iommufd_fault_iopf_enable(idev);
-		if (ret)
-			return ret;
-	}
-
-	ret = __fault_domain_replace_dev(idev, hwpt, old);
-	if (ret) {
-		if (iopf_on)
-			iommufd_fault_iopf_disable(idev);
-		return ret;
-	}
-
-	if (iopf_off)
+	if (handle)
+		iommufd_auto_response_faults(hwpt, handle);
+	if (disable_iopf)
 		iommufd_fault_iopf_disable(idev);
-
-	return 0;
 }
 
 void iommufd_fault_destroy(struct iommufd_object *obj)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH RFCv2 07/13] iommufd: Implement sw_msi support natively
  2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
                   ` (5 preceding siblings ...)
  2025-01-11  3:32 ` [PATCH RFCv2 06/13] iommufd: Make attach_handle generic Nicolin Chen
@ 2025-01-11  3:32 ` Nicolin Chen
  2025-01-15  4:21   ` Yury Norov
  2025-01-23 19:30   ` Jason Gunthorpe
  2025-01-11  3:32 ` [PATCH RFCv2 08/13] iommu: Turn iova_cookie to dma-iommu private pointer Nicolin Chen
                   ` (8 subsequent siblings)
  15 siblings, 2 replies; 64+ messages in thread
From: Nicolin Chen @ 2025-01-11  3:32 UTC (permalink / raw)
  To: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson
  Cc: joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

From: Jason Gunthorpe <jgg@nvidia.com>

iommufd has a model where the iommu_domain can be changed while the VFIO
device is attached. In this case the MSI should continue to work. This
corner case has not worked because the dma-iommu implementation of sw_msi
is tied to a single domain.

Implement the sw_msi mapping directly and use a global per-fd table to
associate assigned iova to the MSI pages. This allows the MSI pages to
loaded into a domain before it is attached ensuring that MSI is not
disrupted.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
[nicolinc: set sw_msi pointer in nested hwpt allocators]
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_private.h |  23 +++-
 drivers/iommu/iommufd/device.c          | 158 ++++++++++++++++++++----
 drivers/iommu/iommufd/hw_pagetable.c    |   3 +
 drivers/iommu/iommufd/main.c            |   9 ++
 4 files changed, 170 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 063c0a42f54f..3e83bbb5912c 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -19,6 +19,22 @@ struct iommu_group;
 struct iommu_option;
 struct iommufd_device;
 
+struct iommufd_sw_msi_map {
+	struct list_head sw_msi_item;
+	phys_addr_t sw_msi_start;
+	phys_addr_t msi_addr;
+	unsigned int pgoff;
+	unsigned int id;
+};
+
+/* Bitmap of struct iommufd_sw_msi_map::id */
+struct iommufd_sw_msi_maps {
+	DECLARE_BITMAP(bitmap, 64);
+};
+
+int iommufd_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
+		   phys_addr_t msi_addr);
+
 struct iommufd_ctx {
 	struct file *file;
 	struct xarray objects;
@@ -26,6 +42,10 @@ struct iommufd_ctx {
 	wait_queue_head_t destroy_wait;
 	struct rw_semaphore ioas_creation_lock;
 
+	struct mutex sw_msi_lock;
+	struct list_head sw_msi_list;
+	unsigned int sw_msi_id;
+
 	u8 account_mode;
 	/* Compatibility with VFIO no iommu */
 	u8 no_iommu_mode;
@@ -283,10 +303,10 @@ struct iommufd_hwpt_paging {
 	struct iommufd_ioas *ioas;
 	bool auto_domain : 1;
 	bool enforce_cache_coherency : 1;
-	bool msi_cookie : 1;
 	bool nest_parent : 1;
 	/* Head at iommufd_ioas::hwpt_list */
 	struct list_head hwpt_item;
+	struct iommufd_sw_msi_maps present_sw_msi;
 };
 
 struct iommufd_hwpt_nested {
@@ -383,6 +403,7 @@ struct iommufd_group {
 	struct iommu_group *group;
 	struct iommufd_hw_pagetable *hwpt;
 	struct list_head device_list;
+	struct iommufd_sw_msi_maps required_sw_msi;
 	phys_addr_t sw_msi_start;
 };
 
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 38b31b652147..f75b3c23cd41 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -5,6 +5,7 @@
 #include <linux/iommufd.h>
 #include <linux/slab.h>
 #include <uapi/linux/iommufd.h>
+#include <linux/msi.h>
 
 #include "../iommu-priv.h"
 #include "io_pagetable.h"
@@ -293,36 +294,149 @@ u32 iommufd_device_to_id(struct iommufd_device *idev)
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_device_to_id, "IOMMUFD");
 
+/*
+ * Get a iommufd_sw_msi_map for the msi physical address requested by the irq
+ * layer. The mapping to IOVA is global to the iommufd file descriptor, every
+ * domain that is attached to a device using the same MSI parameters will use
+ * the same IOVA.
+ */
+static struct iommufd_sw_msi_map *
+iommufd_sw_msi_get_map(struct iommufd_ctx *ictx, phys_addr_t msi_addr,
+		       phys_addr_t sw_msi_start)
+{
+	struct iommufd_sw_msi_map *cur;
+	unsigned int max_pgoff = 0;
+
+	lockdep_assert_held(&ictx->sw_msi_lock);
+
+	list_for_each_entry(cur, &ictx->sw_msi_list, sw_msi_item) {
+		if (cur->sw_msi_start != sw_msi_start)
+			continue;
+		max_pgoff = max(max_pgoff, cur->pgoff + 1);
+		if (cur->msi_addr == msi_addr)
+			return cur;
+	}
+
+	if (ictx->sw_msi_id >=
+	    BITS_PER_BYTE * sizeof_field(struct iommufd_sw_msi_maps, bitmap))
+		return ERR_PTR(-EOVERFLOW);
+
+	cur = kzalloc(sizeof(*cur), GFP_KERNEL);
+	if (!cur)
+		cur = ERR_PTR(-ENOMEM);
+	cur->sw_msi_start = sw_msi_start;
+	cur->msi_addr = msi_addr;
+	cur->pgoff = max_pgoff;
+	cur->id = ictx->sw_msi_id++;
+	list_add_tail(&cur->sw_msi_item, &ictx->sw_msi_list);
+	return cur;
+}
+
+static int iommufd_sw_msi_install(struct iommufd_ctx *ictx,
+				  struct iommufd_hwpt_paging *hwpt_paging,
+				  struct iommufd_sw_msi_map *msi_map)
+{
+	unsigned long iova;
+
+	lockdep_assert_held(&ictx->sw_msi_lock);
+
+	iova = msi_map->sw_msi_start + msi_map->pgoff * PAGE_SIZE;
+	if (!test_bit(msi_map->id, hwpt_paging->present_sw_msi.bitmap)) {
+		int rc;
+
+		rc = iommu_map(hwpt_paging->common.domain, iova,
+			       msi_map->msi_addr, PAGE_SIZE,
+			       IOMMU_WRITE | IOMMU_READ | IOMMU_MMIO,
+			       GFP_KERNEL_ACCOUNT);
+		if (rc)
+			return rc;
+		set_bit(msi_map->id, hwpt_paging->present_sw_msi.bitmap);
+	}
+	return 0;
+}
+
+/*
+ * Called by the irq code if the platform translates the MSI address through the
+ * IOMMU. msi_addr is the physical address of the MSI page. iommufd will
+ * allocate a fd global iova for the physical page that is the same on all
+ * domains and devices.
+ */
+#ifdef CONFIG_IRQ_MSI_IOMMU
+int iommufd_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
+		   phys_addr_t msi_addr)
+{
+	struct device *dev = msi_desc_to_dev(desc);
+	struct iommu_attach_handle *raw_handle;
+	struct iommufd_hwpt_paging *hwpt_paging;
+	struct iommufd_attach_handle *handle;
+	struct iommufd_sw_msi_map *msi_map;
+	struct iommufd_ctx *ictx;
+	unsigned long iova;
+	int rc;
+
+	raw_handle =
+		iommu_attach_handle_get(dev->iommu_group, IOMMU_NO_PASID, 0);
+	if (!raw_handle)
+		return 0;
+	hwpt_paging = find_hwpt_paging(domain->iommufd_hwpt);
+
+	handle = to_iommufd_handle(raw_handle);
+	/* No IOMMU_RESV_SW_MSI means no change to the msi_msg */
+	if (handle->idev->igroup->sw_msi_start == PHYS_ADDR_MAX)
+		return 0;
+
+	ictx = handle->idev->ictx;
+	guard(mutex)(&ictx->sw_msi_lock);
+	/*
+	 * The input msi_addr is the exact byte offset of the MSI doorbell, we
+	 * assume the caller has checked that it is contained with a MMIO region
+	 * that is secure to map at PAGE_SIZE.
+	 */
+	msi_map = iommufd_sw_msi_get_map(handle->idev->ictx,
+					 msi_addr & PAGE_MASK,
+					 handle->idev->igroup->sw_msi_start);
+	if (IS_ERR(msi_map))
+		return PTR_ERR(msi_map);
+
+	rc = iommufd_sw_msi_install(ictx, hwpt_paging, msi_map);
+	if (rc)
+		return rc;
+	set_bit(msi_map->id, handle->idev->igroup->required_sw_msi.bitmap);
+
+	iova = msi_map->sw_msi_start + msi_map->pgoff * PAGE_SIZE;
+	msi_desc_set_iommu_msi_iova(desc, iova, PAGE_SHIFT);
+	return 0;
+}
+#endif
+
+/*
+ * FIXME: when a domain is removed any ids that are not in the union of
+ * all still attached devices should be removed.
+ */
+
 static int iommufd_group_setup_msi(struct iommufd_group *igroup,
 				   struct iommufd_hwpt_paging *hwpt_paging)
 {
-	phys_addr_t sw_msi_start = igroup->sw_msi_start;
-	int rc;
+	struct iommufd_ctx *ictx = igroup->ictx;
+	struct iommufd_sw_msi_map *cur;
+
+	if (igroup->sw_msi_start == PHYS_ADDR_MAX)
+		return 0;
 
 	/*
-	 * If the IOMMU driver gives a IOMMU_RESV_SW_MSI then it is asking us to
-	 * call iommu_get_msi_cookie() on its behalf. This is necessary to setup
-	 * the MSI window so iommu_dma_prepare_msi() can install pages into our
-	 * domain after request_irq(). If it is not done interrupts will not
-	 * work on this domain.
-	 *
-	 * FIXME: This is conceptually broken for iommufd since we want to allow
-	 * userspace to change the domains, eg switch from an identity IOAS to a
-	 * DMA IOAS. There is currently no way to create a MSI window that
-	 * matches what the IRQ layer actually expects in a newly created
-	 * domain.
+	 * Install all the MSI pages the device has been using into the domain
 	 */
-	if (sw_msi_start != PHYS_ADDR_MAX && !hwpt_paging->msi_cookie) {
-		rc = iommu_get_msi_cookie(hwpt_paging->common.domain,
-					  sw_msi_start);
+	guard(mutex)(&ictx->sw_msi_lock);
+	list_for_each_entry(cur, &ictx->sw_msi_list, sw_msi_item) {
+		int rc;
+
+		if (cur->sw_msi_start != igroup->sw_msi_start ||
+		    !test_bit(cur->id, igroup->required_sw_msi.bitmap))
+			continue;
+
+		rc = iommufd_sw_msi_install(ictx, hwpt_paging, cur);
 		if (rc)
 			return rc;
-
-		/*
-		 * iommu_get_msi_cookie() can only be called once per domain,
-		 * it returns -EBUSY on later calls.
-		 */
-		hwpt_paging->msi_cookie = true;
 	}
 	return 0;
 }
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index f7c0d7b214b6..538484eecb3b 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -156,6 +156,7 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 			goto out_abort;
 		}
 	}
+	iommu_domain_set_sw_msi(hwpt->domain, iommufd_sw_msi);
 
 	/*
 	 * Set the coherency mode before we do iopt_table_add_domain() as some
@@ -251,6 +252,7 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
 		goto out_abort;
 	}
 	hwpt->domain->owner = ops;
+	iommu_domain_set_sw_msi(hwpt->domain, iommufd_sw_msi);
 
 	if (WARN_ON_ONCE(hwpt->domain->type != IOMMU_DOMAIN_NESTED)) {
 		rc = -EINVAL;
@@ -303,6 +305,7 @@ iommufd_viommu_alloc_hwpt_nested(struct iommufd_viommu *viommu, u32 flags,
 		goto out_abort;
 	}
 	hwpt->domain->owner = viommu->iommu_dev->ops;
+	iommu_domain_set_sw_msi(hwpt->domain, iommufd_sw_msi);
 
 	if (WARN_ON_ONCE(hwpt->domain->type != IOMMU_DOMAIN_NESTED)) {
 		rc = -EINVAL;
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 97c5e3567d33..7cc9497b7193 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -227,6 +227,8 @@ static int iommufd_fops_open(struct inode *inode, struct file *filp)
 	xa_init(&ictx->groups);
 	ictx->file = filp;
 	init_waitqueue_head(&ictx->destroy_wait);
+	mutex_init(&ictx->sw_msi_lock);
+	INIT_LIST_HEAD(&ictx->sw_msi_list);
 	filp->private_data = ictx;
 	return 0;
 }
@@ -234,6 +236,8 @@ static int iommufd_fops_open(struct inode *inode, struct file *filp)
 static int iommufd_fops_release(struct inode *inode, struct file *filp)
 {
 	struct iommufd_ctx *ictx = filp->private_data;
+	struct iommufd_sw_msi_map *next;
+	struct iommufd_sw_msi_map *cur;
 	struct iommufd_object *obj;
 
 	/*
@@ -262,6 +266,11 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp)
 			break;
 	}
 	WARN_ON(!xa_empty(&ictx->groups));
+
+	mutex_destroy(&ictx->sw_msi_lock);
+	list_for_each_entry_safe(cur, next, &ictx->sw_msi_list, sw_msi_item)
+		kfree(cur);
+
 	kfree(ictx);
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH RFCv2 08/13] iommu: Turn iova_cookie to dma-iommu private pointer
  2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
                   ` (6 preceding siblings ...)
  2025-01-11  3:32 ` [PATCH RFCv2 07/13] iommufd: Implement sw_msi support natively Nicolin Chen
@ 2025-01-11  3:32 ` Nicolin Chen
  2025-01-13 16:40   ` Jason Gunthorpe
  2025-01-11  3:32 ` [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls Nicolin Chen
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 64+ messages in thread
From: Nicolin Chen @ 2025-01-11  3:32 UTC (permalink / raw)
  To: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson
  Cc: joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

Now, iommufd has its own iommufd_sw_msi using iommufd_hwpt, the iommufd
owned domain's private pointer. Similarly, iova_cookie can be seen as a
dma-iommu owned domain's private pointer. So, move iova_cookie into the
union.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b6526d734f30..284c40fe0e12 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -216,7 +216,6 @@ struct iommu_domain {
 	const struct iommu_ops *owner; /* Whose domain_alloc we came from */
 	unsigned long pgsize_bitmap;	/* Bitmap of page sizes in use */
 	struct iommu_domain_geometry geometry;
-	struct iommu_dma_cookie *iova_cookie;
 	int (*iopf_handler)(struct iopf_group *group);
 
 #if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
@@ -225,6 +224,7 @@ struct iommu_domain {
 #endif
 
 	union { /* Pointer usable by owner of the domain */
+		struct iommu_dma_cookie *iova_cookie; /* dma-iommu */
 		struct iommufd_hw_pagetable *iommufd_hwpt; /* iommufd */
 	};
 	union { /* Fault handler */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
                   ` (7 preceding siblings ...)
  2025-01-11  3:32 ` [PATCH RFCv2 08/13] iommu: Turn iova_cookie to dma-iommu private pointer Nicolin Chen
@ 2025-01-11  3:32 ` Nicolin Chen
  2025-01-23 10:07   ` Tian, Kevin
  2025-01-29 13:44   ` Eric Auger
  2025-01-11  3:32 ` [PATCH RFCv2 10/13] iommufd/selftes: Add coverage for IOMMU_OPTION_SW_MSI_START/SIZE Nicolin Chen
                   ` (6 subsequent siblings)
  15 siblings, 2 replies; 64+ messages in thread
From: Nicolin Chen @ 2025-01-11  3:32 UTC (permalink / raw)
  To: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson
  Cc: joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

For systems that require MSI pages to be mapped into the IOMMU translation
the IOMMU driver provides an IOMMU_RESV_SW_MSI range, which is the default
recommended IOVA window to place these mappings. However, there is nothing
special about this address. And to support the RMR trick in VMM for nested
translation, the VMM needs to know what sw_msi window the kernel is using.
As there is no particular reason to force VMM to adopt the kernel default,
provide a simple IOMMU_OPTION_SW_MSI_START/SIZE ioctl that the VMM can use
to directly specify the sw_msi window that it wants to use, which replaces
and disables the default IOMMU_RESV_SW_MSI from the driver to avoid having
to build an API to discover the default IOMMU_RESV_SW_MSI.

Since iommufd now has its own sw_msi function, this is easy to implement.

To keep things simple, the parameters are global to the entire iommufd FD,
and will directly replace the IOMMU_RESV_SW_MSI values. The VMM must set
the values before creating any hwpt's to have any effect.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_private.h |  4 +++
 include/uapi/linux/iommufd.h            | 18 ++++++++++++-
 drivers/iommu/iommufd/device.c          |  4 +++
 drivers/iommu/iommufd/io_pagetable.c    |  4 ++-
 drivers/iommu/iommufd/ioas.c            | 34 +++++++++++++++++++++++++
 drivers/iommu/iommufd/main.c            |  6 +++++
 6 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 3e83bbb5912c..9f071609f00b 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -45,6 +45,9 @@ struct iommufd_ctx {
 	struct mutex sw_msi_lock;
 	struct list_head sw_msi_list;
 	unsigned int sw_msi_id;
+	/* User-programmed SW_MSI region, to override igroup->sw_msi_start */
+	phys_addr_t sw_msi_start;
+	size_t sw_msi_size;
 
 	u8 account_mode;
 	/* Compatibility with VFIO no iommu */
@@ -281,6 +284,7 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
 int iommufd_ioas_option(struct iommufd_ucmd *ucmd);
 int iommufd_option_rlimit_mode(struct iommu_option *cmd,
 			       struct iommufd_ctx *ictx);
+int iommufd_option_sw_msi(struct iommu_option *cmd, struct iommufd_ctx *ictx);
 
 int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
 int iommufd_check_iova_range(struct io_pagetable *iopt,
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 34810f6ae2b5..c864a201e502 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -294,7 +294,9 @@ struct iommu_ioas_unmap {
 
 /**
  * enum iommufd_option - ioctl(IOMMU_OPTION_RLIMIT_MODE) and
- *                       ioctl(IOMMU_OPTION_HUGE_PAGES)
+ *                       ioctl(IOMMU_OPTION_HUGE_PAGES) and
+ *                       ioctl(IOMMU_OPTION_SW_MSI_START) and
+ *                       ioctl(IOMMU_OPTION_SW_MSI_SIZE)
  * @IOMMU_OPTION_RLIMIT_MODE:
  *    Change how RLIMIT_MEMLOCK accounting works. The caller must have privilege
  *    to invoke this. Value 0 (default) is user based accounting, 1 uses process
@@ -304,10 +306,24 @@ struct iommu_ioas_unmap {
  *    iommu mappings. Value 0 disables combining, everything is mapped to
  *    PAGE_SIZE. This can be useful for benchmarking.  This is a per-IOAS
  *    option, the object_id must be the IOAS ID.
+ * @IOMMU_OPTION_SW_MSI_START:
+ *    Change the base address of the IOMMU mapping region for MSI doorbell(s).
+ *    It must be set this before attaching a device to an IOAS/HWPT, otherwise
+ *    this option will be not effective on that IOAS/HWPT. User can choose to
+ *    let kernel pick a base address, by simply ignoring this option or setting
+ *    a value 0 to IOMMU_OPTION_SW_MSI_SIZE. Global option, object_id must be 0
+ * @IOMMU_OPTION_SW_MSI_SIZE:
+ *    Change the size of the IOMMU mapping region for MSI doorbell(s). It must
+ *    be set this before attaching a device to an IOAS/HWPT, otherwise it won't
+ *    be effective on that IOAS/HWPT. The value is in MB, and the minimum value
+ *    is 1 MB. A value 0 (default) will invalidate the MSI doorbell base address
+ *    value set to IOMMU_OPTION_SW_MSI_START. Global option, object_id must be 0
  */
 enum iommufd_option {
 	IOMMU_OPTION_RLIMIT_MODE = 0,
 	IOMMU_OPTION_HUGE_PAGES = 1,
+	IOMMU_OPTION_SW_MSI_START = 2,
+	IOMMU_OPTION_SW_MSI_SIZE = 3,
 };
 
 /**
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index f75b3c23cd41..093a3bd798db 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -445,10 +445,14 @@ static int
 iommufd_device_attach_reserved_iova(struct iommufd_device *idev,
 				    struct iommufd_hwpt_paging *hwpt_paging)
 {
+	struct iommufd_ctx *ictx = idev->ictx;
 	int rc;
 
 	lockdep_assert_held(&idev->igroup->lock);
 
+	/* Override it with a user-programmed SW_MSI region */
+	if (ictx->sw_msi_size && ictx->sw_msi_start != PHYS_ADDR_MAX)
+		idev->igroup->sw_msi_start = ictx->sw_msi_start;
 	rc = iopt_table_enforce_dev_resv_regions(&hwpt_paging->ioas->iopt,
 						 idev->dev,
 						 &idev->igroup->sw_msi_start);
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 8a790e597e12..5d7f5ca1eecf 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -1446,7 +1446,9 @@ int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt,
 		if (sw_msi_start && resv->type == IOMMU_RESV_MSI)
 			num_hw_msi++;
 		if (sw_msi_start && resv->type == IOMMU_RESV_SW_MSI) {
-			*sw_msi_start = resv->start;
+			/* Bypass the driver-defined SW_MSI region, if preset */
+			if (*sw_msi_start == PHYS_ADDR_MAX)
+				*sw_msi_start = resv->start;
 			num_sw_msi++;
 		}
 
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 1542c5fd10a8..3f4e25b660f9 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -620,6 +620,40 @@ int iommufd_option_rlimit_mode(struct iommu_option *cmd,
 	return -EOPNOTSUPP;
 }
 
+int iommufd_option_sw_msi(struct iommu_option *cmd, struct iommufd_ctx *ictx)
+{
+	if (cmd->object_id)
+		return -EOPNOTSUPP;
+
+	if (cmd->op == IOMMU_OPTION_OP_GET) {
+		switch (cmd->option_id) {
+		case IOMMU_OPTION_SW_MSI_START:
+			cmd->val64 = (u64)ictx->sw_msi_start;
+			break;
+		case IOMMU_OPTION_SW_MSI_SIZE:
+			cmd->val64 = (u64)ictx->sw_msi_size;
+			break;
+		default:
+			return -EOPNOTSUPP;
+		}
+		return 0;
+	}
+	if (cmd->op == IOMMU_OPTION_OP_SET) {
+		switch (cmd->option_id) {
+		case IOMMU_OPTION_SW_MSI_START:
+			ictx->sw_msi_start = (phys_addr_t)cmd->val64;
+			break;
+		case IOMMU_OPTION_SW_MSI_SIZE:
+			ictx->sw_msi_size = (size_t)cmd->val64;
+			break;
+		default:
+			return -EOPNOTSUPP;
+		}
+		return 0;
+	}
+	return -EOPNOTSUPP;
+}
+
 static int iommufd_ioas_option_huge_pages(struct iommu_option *cmd,
 					  struct iommufd_ioas *ioas)
 {
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 7cc9497b7193..026297265c71 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -229,6 +229,8 @@ static int iommufd_fops_open(struct inode *inode, struct file *filp)
 	init_waitqueue_head(&ictx->destroy_wait);
 	mutex_init(&ictx->sw_msi_lock);
 	INIT_LIST_HEAD(&ictx->sw_msi_list);
+	ictx->sw_msi_start = PHYS_ADDR_MAX;
+	ictx->sw_msi_size = 0;
 	filp->private_data = ictx;
 	return 0;
 }
@@ -287,6 +289,10 @@ static int iommufd_option(struct iommufd_ucmd *ucmd)
 	case IOMMU_OPTION_RLIMIT_MODE:
 		rc = iommufd_option_rlimit_mode(cmd, ucmd->ictx);
 		break;
+	case IOMMU_OPTION_SW_MSI_START:
+	case IOMMU_OPTION_SW_MSI_SIZE:
+		rc = iommufd_option_sw_msi(cmd, ucmd->ictx);
+		break;
 	case IOMMU_OPTION_HUGE_PAGES:
 		rc = iommufd_ioas_option(ucmd);
 		break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH RFCv2 10/13] iommufd/selftes: Add coverage for IOMMU_OPTION_SW_MSI_START/SIZE
  2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
                   ` (8 preceding siblings ...)
  2025-01-11  3:32 ` [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls Nicolin Chen
@ 2025-01-11  3:32 ` Nicolin Chen
  2025-01-11  3:32 ` [PATCH RFCv2 11/13] iommufd/device: Allow setting IOVAs for MSI(x) vectors Nicolin Chen
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 64+ messages in thread
From: Nicolin Chen @ 2025-01-11  3:32 UTC (permalink / raw)
  To: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson
  Cc: joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

Also add fail_nth coverage too.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 tools/testing/selftests/iommu/iommufd.c       | 53 +++++++++++++++++++
 .../selftests/iommu/iommufd_fail_nth.c        | 14 +++++
 2 files changed, 67 insertions(+)

diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index a1b2b657999d..87b5b5ac927a 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -334,6 +334,59 @@ TEST_F(change_process, basic)
 	ASSERT_EQ(child, waitpid(child, NULL, 0));
 }
 
+FIXTURE(iommufd_sw_msi)
+{
+	int fd;
+	uint32_t ioas_id;
+};
+
+FIXTURE_SETUP(iommufd_sw_msi)
+{
+	self->fd = open("/dev/iommu", O_RDWR);
+	ASSERT_NE(-1, self->fd);
+
+	test_ioctl_ioas_alloc(&self->ioas_id);
+}
+
+FIXTURE_TEARDOWN(iommufd_sw_msi)
+{
+	teardown_iommufd(self->fd, _metadata);
+}
+
+TEST_F(iommufd_sw_msi, basic)
+{
+	struct iommu_option cmd = {
+		.size = sizeof(cmd),
+		.op = IOMMU_OPTION_OP_SET,
+	};
+	/* Negative case: assign an object_id to this global option */
+	cmd.object_id = self->ioas_id;
+	cmd.option_id = IOMMU_OPTION_SW_MSI_START;
+	cmd.val64 = 0xffffffff;
+	EXPECT_ERRNO(EOPNOTSUPP, ioctl(self->fd, IOMMU_OPTION, &cmd));
+	cmd.option_id = IOMMU_OPTION_SW_MSI_SIZE;
+	cmd.val64 = 2;
+	EXPECT_ERRNO(EOPNOTSUPP, ioctl(self->fd, IOMMU_OPTION, &cmd));
+
+	cmd.object_id = 0;
+	cmd.option_id = IOMMU_OPTION_SW_MSI_START;
+	cmd.val64 = 0xffffffff;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+	cmd.option_id = IOMMU_OPTION_SW_MSI_SIZE;
+	cmd.val64 = 2;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+
+	/* Read them back to verify */
+	cmd.op = IOMMU_OPTION_OP_GET;
+	cmd.object_id = 0;
+	cmd.option_id = IOMMU_OPTION_SW_MSI_START;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+	ASSERT_EQ(cmd.val64, 0xffffffff);
+	cmd.option_id = IOMMU_OPTION_SW_MSI_SIZE;
+	ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+	ASSERT_EQ(cmd.val64, 2);
+}
+
 FIXTURE(iommufd_ioas)
 {
 	int fd;
diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c
index 64b1f8e1b0cf..6d7b1417ce45 100644
--- a/tools/testing/selftests/iommu/iommufd_fail_nth.c
+++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c
@@ -615,6 +615,10 @@ TEST_FAIL_NTH(basic_fail_nth, access_pin_domain)
 /* device.c */
 TEST_FAIL_NTH(basic_fail_nth, device)
 {
+	struct iommu_option cmd = {
+		.size = sizeof(cmd),
+		.op = IOMMU_OPTION_OP_SET,
+	};
 	struct iommu_hwpt_selftest data = {
 		.iotlb = IOMMU_TEST_IOTLB_DEFAULT,
 	};
@@ -634,6 +638,16 @@ TEST_FAIL_NTH(basic_fail_nth, device)
 	if (self->fd == -1)
 		return -1;
 
+	cmd.option_id = IOMMU_OPTION_SW_MSI_START;
+	cmd.val64 = 0x8000000;
+	if (ioctl(self->fd, IOMMU_OPTION, &cmd))
+		return -1;
+
+	cmd.option_id = IOMMU_OPTION_SW_MSI_SIZE;
+	cmd.val64 = 2;
+	if (ioctl(self->fd, IOMMU_OPTION, &cmd))
+		return -1;
+
 	if (_test_ioctl_ioas_alloc(self->fd, &ioas_id))
 		return -1;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH RFCv2 11/13] iommufd/device: Allow setting IOVAs for MSI(x) vectors
  2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
                   ` (9 preceding siblings ...)
  2025-01-11  3:32 ` [PATCH RFCv2 10/13] iommufd/selftes: Add coverage for IOMMU_OPTION_SW_MSI_START/SIZE Nicolin Chen
@ 2025-01-11  3:32 ` Nicolin Chen
  2025-01-11  3:32 ` [PATCH RFCv2 12/13] vfio-iommufd: Provide another layer of msi_iova helpers Nicolin Chen
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 64+ messages in thread
From: Nicolin Chen @ 2025-01-11  3:32 UTC (permalink / raw)
  To: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson
  Cc: joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

Add a resizable array "msi_iovas" to store the user space IOVAs for its
vMSI(x) vectors. And use the corresponding IOVA for set the msi_desc in
the iommufd_sw_msi(). This allows irqchip driver to program the correct
MSI doorbell addresses (in case of requiring IOMMU translations).

Provide a set of helpers to allocate/set/unset the resizable array. The
next patch will call them from the vfio code.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_private.h |  2 ++
 include/linux/iommufd.h                 |  6 ++++
 drivers/iommu/iommufd/device.c          | 47 ++++++++++++++++++++++++-
 3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 9f071609f00b..81ff46d630dc 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -427,6 +427,8 @@ struct iommufd_device {
 	/* protect iopf_enabled counter */
 	struct mutex iopf_lock;
 	unsigned int iopf_enabled;
+	dma_addr_t *msi_iovas;
+	unsigned int num_msi_iovas;
 };
 
 static inline struct iommufd_device *
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 11110c749200..787dcdcb9b3b 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -58,6 +58,12 @@ void iommufd_device_detach(struct iommufd_device *idev);
 
 struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev);
 u32 iommufd_device_to_id(struct iommufd_device *idev);
+int iommufd_device_set_num_msi_iovas(struct iommufd_device *idev,
+				     unsigned int num);
+int iommufd_device_set_msi_iova(struct iommufd_device *idev, unsigned int index,
+				dma_addr_t iova);
+void iommufd_device_unset_msi_iova(struct iommufd_device *idev,
+				   unsigned int index);
 
 struct iommufd_access_ops {
 	u8 needs_pin_pages : 1;
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 093a3bd798db..3d16967146cc 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -137,6 +137,8 @@ void iommufd_device_destroy(struct iommufd_object *obj)
 	struct iommufd_device *idev =
 		container_of(obj, struct iommufd_device, obj);
 
+	if (idev->num_msi_iovas)
+		kfree(idev->msi_iovas);
 	iommu_device_release_dma_owner(idev->dev);
 	iommufd_put_group(idev->igroup);
 	if (!iommufd_selftest_is_mock_dev(idev->dev))
@@ -294,6 +296,45 @@ u32 iommufd_device_to_id(struct iommufd_device *idev)
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_device_to_id, "IOMMUFD");
 
+int iommufd_device_set_num_msi_iovas(struct iommufd_device *idev,
+				     unsigned int num)
+{
+	dma_addr_t *msi_iovas;
+	int i;
+
+	msi_iovas = krealloc(idev->msi_iovas, sizeof(*idev->msi_iovas) * num,
+			     GFP_KERNEL);
+	if (!msi_iovas)
+		return -ENOMEM;
+
+	for (i = idev->num_msi_iovas; i < num; i++)
+		msi_iovas[i] = PHYS_ADDR_MAX;
+
+	idev->msi_iovas = msi_iovas;
+	idev->num_msi_iovas = num;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_device_set_num_msi_iovas, "IOMMUFD");
+
+int iommufd_device_set_msi_iova(struct iommufd_device *idev, unsigned int index,
+				dma_addr_t iova)
+{
+	if (index >= idev->num_msi_iovas)
+		return -EINVAL;
+	idev->msi_iovas[index] = iova;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_device_set_msi_iova, "IOMMUFD");
+
+void iommufd_device_unset_msi_iova(struct iommufd_device *idev,
+				   unsigned int index)
+{
+	if (index >= idev->num_msi_iovas)
+		return;
+	idev->msi_iovas[index] = PHYS_ADDR_MAX;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_device_unset_msi_iova, "IOMMUFD");
+
 /*
  * Get a iommufd_sw_msi_map for the msi physical address requested by the irq
  * layer. The mapping to IOVA is global to the iommufd file descriptor, every
@@ -403,7 +444,11 @@ int iommufd_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
 		return rc;
 	set_bit(msi_map->id, handle->idev->igroup->required_sw_msi.bitmap);
 
-	iova = msi_map->sw_msi_start + msi_map->pgoff * PAGE_SIZE;
+	if (desc->msi_index < handle->idev->num_msi_iovas &&
+	    handle->idev->msi_iovas[desc->msi_index] != PHYS_ADDR_MAX)
+		iova = handle->idev->msi_iovas[desc->msi_index];
+	else
+		iova = msi_map->sw_msi_start + msi_map->pgoff * PAGE_SIZE;
 	msi_desc_set_iommu_msi_iova(desc, iova, PAGE_SHIFT);
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH RFCv2 12/13] vfio-iommufd: Provide another layer of msi_iova helpers
  2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
                   ` (10 preceding siblings ...)
  2025-01-11  3:32 ` [PATCH RFCv2 11/13] iommufd/device: Allow setting IOVAs for MSI(x) vectors Nicolin Chen
@ 2025-01-11  3:32 ` Nicolin Chen
  2025-01-11  3:32 ` [PATCH RFCv2 13/13] vfio/pci: Allow preset MSI IOVAs via VFIO_IRQ_SET_ACTION_PREPARE Nicolin Chen
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 64+ messages in thread
From: Nicolin Chen @ 2025-01-11  3:32 UTC (permalink / raw)
  To: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson
  Cc: joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

As the bridge between iommufd core and vfio core, add a set of msi_iova
helpers to call the new msi_iova helpers from the iommufd_device level.

The next patch will use this bridging helpers in vfio-pci code.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/vfio.h   | 25 +++++++++++++++++++++++++
 drivers/vfio/iommufd.c | 27 +++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 000a6cab2d31..f38a1e910a2f 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -133,6 +133,12 @@ struct vfio_device_ops {
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
 struct iommufd_ctx *vfio_iommufd_device_ictx(struct vfio_device *vdev);
+int vfio_iommufd_device_set_num_msi_iovas(struct vfio_device *vdev,
+					  unsigned int num);
+int vfio_iommufd_device_set_msi_iova(struct vfio_device *vdev,
+				     unsigned int index, dma_addr_t iova);
+void vfio_iommufd_device_unset_msi_iova(struct vfio_device *vdev,
+					unsigned int index);
 int vfio_iommufd_get_dev_id(struct vfio_device *vdev, struct iommufd_ctx *ictx);
 int vfio_iommufd_physical_bind(struct vfio_device *vdev,
 			       struct iommufd_ctx *ictx, u32 *out_device_id);
@@ -151,6 +157,25 @@ vfio_iommufd_device_ictx(struct vfio_device *vdev)
 	return NULL;
 }
 
+static inline int
+vfio_iommufd_device_set_num_msi_iovas(struct vfio_device *vdev,
+				      unsigned int num)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int vfio_iommufd_device_set_msi_iova(struct vfio_device *vdev,
+						   unsigned int index,
+						   dma_addr_t iova)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void vfio_iommufd_device_unset_msi_iova(struct vfio_device *vdev,
+						      unsigned int index)
+{
+}
+
 static inline int
 vfio_iommufd_get_dev_id(struct vfio_device *vdev, struct iommufd_ctx *ictx)
 {
diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
index 516294fd901b..805efdbccc97 100644
--- a/drivers/vfio/iommufd.c
+++ b/drivers/vfio/iommufd.c
@@ -69,6 +69,33 @@ struct iommufd_ctx *vfio_iommufd_device_ictx(struct vfio_device *vdev)
 }
 EXPORT_SYMBOL_GPL(vfio_iommufd_device_ictx);
 
+int vfio_iommufd_device_set_num_msi_iovas(struct vfio_device *vdev,
+					  unsigned int num)
+{
+	if (!vdev->iommufd_device)
+		return -ENODEV;
+	return iommufd_device_set_num_msi_iovas(vdev->iommufd_device, num);
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_device_set_num_msi_iovas);
+
+int vfio_iommufd_device_set_msi_iova(struct vfio_device *vdev,
+				     unsigned int index, dma_addr_t iova)
+{
+	if (!vdev->iommufd_device)
+		return -ENODEV;
+	return iommufd_device_set_msi_iova(vdev->iommufd_device, index, iova);
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_device_set_msi_iova);
+
+void vfio_iommufd_device_unset_msi_iova(struct vfio_device *vdev,
+					unsigned int index)
+{
+	if (!vdev->iommufd_device)
+		return;
+	iommufd_device_unset_msi_iova(vdev->iommufd_device, index);
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_device_unset_msi_iova);
+
 static int vfio_iommufd_device_id(struct vfio_device *vdev)
 {
 	if (vdev->iommufd_device)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH RFCv2 13/13] vfio/pci: Allow preset MSI IOVAs via VFIO_IRQ_SET_ACTION_PREPARE
  2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
                   ` (11 preceding siblings ...)
  2025-01-11  3:32 ` [PATCH RFCv2 12/13] vfio-iommufd: Provide another layer of msi_iova helpers Nicolin Chen
@ 2025-01-11  3:32 ` Nicolin Chen
  2025-01-23  9:06 ` [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Shameerali Kolothum Thodi
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 64+ messages in thread
From: Nicolin Chen @ 2025-01-11  3:32 UTC (permalink / raw)
  To: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson
  Cc: joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

Add a new VFIO_IRQ_SET_ACTION_PREPARE to set VFIO_IRQ_SET_DATA_MSI_IOVA,
giving user space an interface to forward to kernel the stage-1 IOVA (of
a 2-stage translation: IOVA->IPA->PA) for an MSI doorbell address, since
the ITS hardware needs to be programmed with the top level IOVA address,
in order to work with the IOMMU on ARM64.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/uapi/linux/vfio.h         |  8 ++++--
 drivers/vfio/pci/vfio_pci_intrs.c | 46 +++++++++++++++++++++++++++++++
 drivers/vfio/vfio_main.c          |  3 ++
 3 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index c8dbf8219c4f..85095e59a3c6 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -590,6 +590,8 @@ struct vfio_irq_set {
 #define VFIO_IRQ_SET_ACTION_MASK	(1 << 3) /* Mask interrupt */
 #define VFIO_IRQ_SET_ACTION_UNMASK	(1 << 4) /* Unmask interrupt */
 #define VFIO_IRQ_SET_ACTION_TRIGGER	(1 << 5) /* Trigger interrupt */
+#define VFIO_IRQ_SET_DATA_MSI_IOVA	(1 << 6) /* Data is MSI IOVA (u64) */
+#define VFIO_IRQ_SET_ACTION_PREPARE	(1 << 7) /* Prepare interrupt */
 	__u32	index;
 	__u32	start;
 	__u32	count;
@@ -599,10 +601,12 @@ struct vfio_irq_set {
 
 #define VFIO_IRQ_SET_DATA_TYPE_MASK	(VFIO_IRQ_SET_DATA_NONE | \
 					 VFIO_IRQ_SET_DATA_BOOL | \
-					 VFIO_IRQ_SET_DATA_EVENTFD)
+					 VFIO_IRQ_SET_DATA_EVENTFD | \
+					 VFIO_IRQ_SET_DATA_MSI_IOVA)
 #define VFIO_IRQ_SET_ACTION_TYPE_MASK	(VFIO_IRQ_SET_ACTION_MASK | \
 					 VFIO_IRQ_SET_ACTION_UNMASK | \
-					 VFIO_IRQ_SET_ACTION_TRIGGER)
+					 VFIO_IRQ_SET_ACTION_TRIGGER | \
+					 VFIO_IRQ_SET_ACTION_PREPARE)
 /**
  * VFIO_DEVICE_RESET - _IO(VFIO_TYPE, VFIO_BASE + 11)
  *
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 8382c5834335..80ea6bc1941f 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -685,6 +685,8 @@ static int vfio_pci_set_msi_trigger(struct vfio_pci_core_device *vdev,
 
 	if (irq_is(vdev, index) && !count && (flags & VFIO_IRQ_SET_DATA_NONE)) {
 		vfio_msi_disable(vdev, msix);
+		for (i = start; i < start + count; i++)
+			vfio_iommufd_device_unset_msi_iova(&vdev->vdev, i);
 		return 0;
 	}
 
@@ -728,6 +730,47 @@ static int vfio_pci_set_msi_trigger(struct vfio_pci_core_device *vdev,
 	return 0;
 }
 
+static int vfio_pci_set_msi_prepare(struct vfio_pci_core_device *vdev,
+				    unsigned int index, unsigned int start,
+				    unsigned int count, uint32_t flags,
+				    void *data)
+{
+	struct vfio_device *core = &vdev->vdev;
+	uint64_t *iovas = data;
+	unsigned int i;
+	int ret;
+
+	if (!vfio_iommufd_device_ictx(core))
+		return -EOPNOTSUPP;
+	if (!(irq_is(vdev, index) || is_irq_none(vdev)))
+		return -EINVAL;
+
+	if (flags & VFIO_IRQ_SET_DATA_NONE) {
+		if (!count)
+			return -EINVAL;
+		for (i = start; i < start + count; i++)
+			vfio_iommufd_device_unset_msi_iova(core, i);
+		return 0;
+	}
+
+	if (!(flags & VFIO_IRQ_SET_DATA_MSI_IOVA))
+		return -EOPNOTSUPP;
+	if (!IS_ENABLED(CONFIG_IRQ_MSI_IOMMU))
+		return -EOPNOTSUPP;
+
+	ret = vfio_iommufd_device_set_num_msi_iovas(core, start + count);
+	if (ret)
+		return ret;
+
+	for (i = start; i < start + count; i++) {
+		ret = vfio_iommufd_device_set_msi_iova(core, i, iovas[i]);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 static int vfio_pci_set_ctx_trigger_single(struct eventfd_ctx **ctx,
 					   unsigned int count, uint32_t flags,
 					   void *data)
@@ -837,6 +880,9 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
 		case VFIO_IRQ_SET_ACTION_TRIGGER:
 			func = vfio_pci_set_msi_trigger;
 			break;
+		case VFIO_IRQ_SET_ACTION_PREPARE:
+			func = vfio_pci_set_msi_prepare;
+			break;
 		}
 		break;
 	case VFIO_PCI_ERR_IRQ_INDEX:
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 1fd261efc582..ad11c8e7da7b 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1554,6 +1554,9 @@ int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int num_irqs,
 	case VFIO_IRQ_SET_DATA_EVENTFD:
 		size = sizeof(int32_t);
 		break;
+	case VFIO_IRQ_SET_DATA_MSI_IOVA:
+		size = sizeof(uint64_t);
+		break;
 	default:
 		return -EINVAL;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 08/13] iommu: Turn iova_cookie to dma-iommu private pointer
  2025-01-11  3:32 ` [PATCH RFCv2 08/13] iommu: Turn iova_cookie to dma-iommu private pointer Nicolin Chen
@ 2025-01-13 16:40   ` Jason Gunthorpe
  0 siblings, 0 replies; 64+ messages in thread
From: Jason Gunthorpe @ 2025-01-13 16:40 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, kevin.tian, tglx, maz, alex.williamson, joro,
	shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Fri, Jan 10, 2025 at 07:32:24PM -0800, Nicolin Chen wrote:
> Now, iommufd has its own iommufd_sw_msi using iommufd_hwpt, the iommufd
> owned domain's private pointer. Similarly, iova_cookie can be seen as a
> dma-iommu owned domain's private pointer. So, move iova_cookie into the
> union.

This commit message would be clearer as:

Now that iommufd does not rely on dma-iommu.c for any purpose we can
combine the dma-iommu.c iova_cookie and the iommufd_hwpt under the
same union. This union is effectively 'owner data' can be used by the
entity that allocated the domain. Note that legacy vfio type1 flows
continue to use dma-iommu.c for sw_msi and still need iova_cookie.

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 07/13] iommufd: Implement sw_msi support natively
  2025-01-11  3:32 ` [PATCH RFCv2 07/13] iommufd: Implement sw_msi support natively Nicolin Chen
@ 2025-01-15  4:21   ` Yury Norov
  2025-01-16 20:21     ` Jason Gunthorpe
  2025-01-23 19:30   ` Jason Gunthorpe
  1 sibling, 1 reply; 64+ messages in thread
From: Yury Norov @ 2025-01-15  4:21 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson,
	joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, nipun.gupta, iommu,
	linux-kernel, linux-arm-kernel, kvm, linux-kselftest, patches,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile

On Fri, Jan 10, 2025 at 07:32:23PM -0800, Nicolin Chen wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> iommufd has a model where the iommu_domain can be changed while the VFIO
> device is attached. In this case the MSI should continue to work. This
> corner case has not worked because the dma-iommu implementation of sw_msi
> is tied to a single domain.
> 
> Implement the sw_msi mapping directly and use a global per-fd table to
> associate assigned iova to the MSI pages. This allows the MSI pages to
> loaded into a domain before it is attached ensuring that MSI is not

s/loaded/be loaded/ ?

> disrupted.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> [nicolinc: set sw_msi pointer in nested hwpt allocators]
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/iommufd/iommufd_private.h |  23 +++-
>  drivers/iommu/iommufd/device.c          | 158 ++++++++++++++++++++----
>  drivers/iommu/iommufd/hw_pagetable.c    |   3 +
>  drivers/iommu/iommufd/main.c            |   9 ++
>  4 files changed, 170 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index 063c0a42f54f..3e83bbb5912c 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -19,6 +19,22 @@ struct iommu_group;
>  struct iommu_option;
>  struct iommufd_device;
>  
> +struct iommufd_sw_msi_map {
> +	struct list_head sw_msi_item;
> +	phys_addr_t sw_msi_start;
> +	phys_addr_t msi_addr;
> +	unsigned int pgoff;
> +	unsigned int id;
> +};
> +
> +/* Bitmap of struct iommufd_sw_msi_map::id */
> +struct iommufd_sw_msi_maps {
> +	DECLARE_BITMAP(bitmap, 64);
> +};
> +
> +int iommufd_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
> +		   phys_addr_t msi_addr);
> +
>  struct iommufd_ctx {
>  	struct file *file;
>  	struct xarray objects;
> @@ -26,6 +42,10 @@ struct iommufd_ctx {
>  	wait_queue_head_t destroy_wait;
>  	struct rw_semaphore ioas_creation_lock;
>  
> +	struct mutex sw_msi_lock;
> +	struct list_head sw_msi_list;
> +	unsigned int sw_msi_id;
> +
>  	u8 account_mode;
>  	/* Compatibility with VFIO no iommu */
>  	u8 no_iommu_mode;
> @@ -283,10 +303,10 @@ struct iommufd_hwpt_paging {
>  	struct iommufd_ioas *ioas;
>  	bool auto_domain : 1;
>  	bool enforce_cache_coherency : 1;
> -	bool msi_cookie : 1;
>  	bool nest_parent : 1;
>  	/* Head at iommufd_ioas::hwpt_list */
>  	struct list_head hwpt_item;
> +	struct iommufd_sw_msi_maps present_sw_msi;
>  };
>  
>  struct iommufd_hwpt_nested {
> @@ -383,6 +403,7 @@ struct iommufd_group {
>  	struct iommu_group *group;
>  	struct iommufd_hw_pagetable *hwpt;
>  	struct list_head device_list;
> +	struct iommufd_sw_msi_maps required_sw_msi;
>  	phys_addr_t sw_msi_start;
>  };
>  
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index 38b31b652147..f75b3c23cd41 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -5,6 +5,7 @@
>  #include <linux/iommufd.h>
>  #include <linux/slab.h>
>  #include <uapi/linux/iommufd.h>
> +#include <linux/msi.h>
>  
>  #include "../iommu-priv.h"
>  #include "io_pagetable.h"
> @@ -293,36 +294,149 @@ u32 iommufd_device_to_id(struct iommufd_device *idev)
>  }
>  EXPORT_SYMBOL_NS_GPL(iommufd_device_to_id, "IOMMUFD");
>  
> +/*
> + * Get a iommufd_sw_msi_map for the msi physical address requested by the irq
> + * layer. The mapping to IOVA is global to the iommufd file descriptor, every
> + * domain that is attached to a device using the same MSI parameters will use
> + * the same IOVA.
> + */
> +static struct iommufd_sw_msi_map *
> +iommufd_sw_msi_get_map(struct iommufd_ctx *ictx, phys_addr_t msi_addr,
> +		       phys_addr_t sw_msi_start)
> +{
> +	struct iommufd_sw_msi_map *cur;
> +	unsigned int max_pgoff = 0;
> +
> +	lockdep_assert_held(&ictx->sw_msi_lock);
> +
> +	list_for_each_entry(cur, &ictx->sw_msi_list, sw_msi_item) {
> +		if (cur->sw_msi_start != sw_msi_start)
> +			continue;
> +		max_pgoff = max(max_pgoff, cur->pgoff + 1);
> +		if (cur->msi_addr == msi_addr)
> +			return cur;
> +	}
> +
> +	if (ictx->sw_msi_id >=
> +	    BITS_PER_BYTE * sizeof_field(struct iommufd_sw_msi_maps, bitmap))
> +		return ERR_PTR(-EOVERFLOW);
> +
> +	cur = kzalloc(sizeof(*cur), GFP_KERNEL);
> +	if (!cur)
> +		cur = ERR_PTR(-ENOMEM);
> +	cur->sw_msi_start = sw_msi_start;
> +	cur->msi_addr = msi_addr;
> +	cur->pgoff = max_pgoff;
> +	cur->id = ictx->sw_msi_id++;
> +	list_add_tail(&cur->sw_msi_item, &ictx->sw_msi_list);
> +	return cur;
> +}
> +
> +static int iommufd_sw_msi_install(struct iommufd_ctx *ictx,
> +				  struct iommufd_hwpt_paging *hwpt_paging,
> +				  struct iommufd_sw_msi_map *msi_map)
> +{
> +	unsigned long iova;
> +
> +	lockdep_assert_held(&ictx->sw_msi_lock);
> +
> +	iova = msi_map->sw_msi_start + msi_map->pgoff * PAGE_SIZE;
> +	if (!test_bit(msi_map->id, hwpt_paging->present_sw_msi.bitmap)) {
> +		int rc;
> +
> +		rc = iommu_map(hwpt_paging->common.domain, iova,
> +			       msi_map->msi_addr, PAGE_SIZE,
> +			       IOMMU_WRITE | IOMMU_READ | IOMMU_MMIO,
> +			       GFP_KERNEL_ACCOUNT);
> +		if (rc)
> +			return rc;
> +		set_bit(msi_map->id, hwpt_paging->present_sw_msi.bitmap);
> +	}
> +	return 0;
> +}

So, does sw_msi_lock protect the present_sw_msi bitmap? If so, you
should use non-atomic __set_bit(). If not, you'd do something like:

        if (test_and_set_bit(...))
                return 0;

        rc = iommu_map(...);
        if (rc)
                clear_bit(...);

        return rc

Now it looks like a series of atomic accesses, which is not atomic, and it
misleads...

> +
> +/*
> + * Called by the irq code if the platform translates the MSI address through the
> + * IOMMU. msi_addr is the physical address of the MSI page. iommufd will
> + * allocate a fd global iova for the physical page that is the same on all
> + * domains and devices.
> + */
> +#ifdef CONFIG_IRQ_MSI_IOMMU
> +int iommufd_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
> +		   phys_addr_t msi_addr)
> +{
> +	struct device *dev = msi_desc_to_dev(desc);
> +	struct iommu_attach_handle *raw_handle;
> +	struct iommufd_hwpt_paging *hwpt_paging;
> +	struct iommufd_attach_handle *handle;
> +	struct iommufd_sw_msi_map *msi_map;
> +	struct iommufd_ctx *ictx;
> +	unsigned long iova;
> +	int rc;
> +
> +	raw_handle =
> +		iommu_attach_handle_get(dev->iommu_group, IOMMU_NO_PASID, 0);

Nit: no need to break the line.

> +	if (!raw_handle)
> +		return 0;
> +	hwpt_paging = find_hwpt_paging(domain->iommufd_hwpt);
> +
> +	handle = to_iommufd_handle(raw_handle);
> +	/* No IOMMU_RESV_SW_MSI means no change to the msi_msg */
> +	if (handle->idev->igroup->sw_msi_start == PHYS_ADDR_MAX)
> +		return 0;
> +
> +	ictx = handle->idev->ictx;
> +	guard(mutex)(&ictx->sw_msi_lock);
> +	/*
> +	 * The input msi_addr is the exact byte offset of the MSI doorbell, we
> +	 * assume the caller has checked that it is contained with a MMIO region
> +	 * that is secure to map at PAGE_SIZE.
> +	 */
> +	msi_map = iommufd_sw_msi_get_map(handle->idev->ictx,
> +					 msi_addr & PAGE_MASK,
> +					 handle->idev->igroup->sw_msi_start);
> +	if (IS_ERR(msi_map))
> +		return PTR_ERR(msi_map);
> +
> +	rc = iommufd_sw_msi_install(ictx, hwpt_paging, msi_map);
> +	if (rc)
> +		return rc;
> +	set_bit(msi_map->id, handle->idev->igroup->required_sw_msi.bitmap);

Same here. I guess, sw_msi_lock protects required_sw_msi.bitmap,
right?

Thanks,
Yury

> +
> +	iova = msi_map->sw_msi_start + msi_map->pgoff * PAGE_SIZE;
> +	msi_desc_set_iommu_msi_iova(desc, iova, PAGE_SHIFT);
> +	return 0;
> +}
> +#endif
> +
> +/*
> + * FIXME: when a domain is removed any ids that are not in the union of
> + * all still attached devices should be removed.
> + */
> +
>  static int iommufd_group_setup_msi(struct iommufd_group *igroup,
>  				   struct iommufd_hwpt_paging *hwpt_paging)
>  {
> -	phys_addr_t sw_msi_start = igroup->sw_msi_start;
> -	int rc;
> +	struct iommufd_ctx *ictx = igroup->ictx;
> +	struct iommufd_sw_msi_map *cur;
> +
> +	if (igroup->sw_msi_start == PHYS_ADDR_MAX)
> +		return 0;
>  
>  	/*
> -	 * If the IOMMU driver gives a IOMMU_RESV_SW_MSI then it is asking us to
> -	 * call iommu_get_msi_cookie() on its behalf. This is necessary to setup
> -	 * the MSI window so iommu_dma_prepare_msi() can install pages into our
> -	 * domain after request_irq(). If it is not done interrupts will not
> -	 * work on this domain.
> -	 *
> -	 * FIXME: This is conceptually broken for iommufd since we want to allow
> -	 * userspace to change the domains, eg switch from an identity IOAS to a
> -	 * DMA IOAS. There is currently no way to create a MSI window that
> -	 * matches what the IRQ layer actually expects in a newly created
> -	 * domain.
> +	 * Install all the MSI pages the device has been using into the domain
>  	 */
> -	if (sw_msi_start != PHYS_ADDR_MAX && !hwpt_paging->msi_cookie) {
> -		rc = iommu_get_msi_cookie(hwpt_paging->common.domain,
> -					  sw_msi_start);
> +	guard(mutex)(&ictx->sw_msi_lock);
> +	list_for_each_entry(cur, &ictx->sw_msi_list, sw_msi_item) {
> +		int rc;
> +
> +		if (cur->sw_msi_start != igroup->sw_msi_start ||
> +		    !test_bit(cur->id, igroup->required_sw_msi.bitmap))
> +			continue;
> +
> +		rc = iommufd_sw_msi_install(ictx, hwpt_paging, cur);
>  		if (rc)
>  			return rc;
> -
> -		/*
> -		 * iommu_get_msi_cookie() can only be called once per domain,
> -		 * it returns -EBUSY on later calls.
> -		 */
> -		hwpt_paging->msi_cookie = true;
>  	}
>  	return 0;
>  }
> diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
> index f7c0d7b214b6..538484eecb3b 100644
> --- a/drivers/iommu/iommufd/hw_pagetable.c
> +++ b/drivers/iommu/iommufd/hw_pagetable.c
> @@ -156,6 +156,7 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
>  			goto out_abort;
>  		}
>  	}
> +	iommu_domain_set_sw_msi(hwpt->domain, iommufd_sw_msi);
>  
>  	/*
>  	 * Set the coherency mode before we do iopt_table_add_domain() as some
> @@ -251,6 +252,7 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
>  		goto out_abort;
>  	}
>  	hwpt->domain->owner = ops;
> +	iommu_domain_set_sw_msi(hwpt->domain, iommufd_sw_msi);
>  
>  	if (WARN_ON_ONCE(hwpt->domain->type != IOMMU_DOMAIN_NESTED)) {
>  		rc = -EINVAL;
> @@ -303,6 +305,7 @@ iommufd_viommu_alloc_hwpt_nested(struct iommufd_viommu *viommu, u32 flags,
>  		goto out_abort;
>  	}
>  	hwpt->domain->owner = viommu->iommu_dev->ops;
> +	iommu_domain_set_sw_msi(hwpt->domain, iommufd_sw_msi);
>  
>  	if (WARN_ON_ONCE(hwpt->domain->type != IOMMU_DOMAIN_NESTED)) {
>  		rc = -EINVAL;
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index 97c5e3567d33..7cc9497b7193 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -227,6 +227,8 @@ static int iommufd_fops_open(struct inode *inode, struct file *filp)
>  	xa_init(&ictx->groups);
>  	ictx->file = filp;
>  	init_waitqueue_head(&ictx->destroy_wait);
> +	mutex_init(&ictx->sw_msi_lock);
> +	INIT_LIST_HEAD(&ictx->sw_msi_list);
>  	filp->private_data = ictx;
>  	return 0;
>  }
> @@ -234,6 +236,8 @@ static int iommufd_fops_open(struct inode *inode, struct file *filp)
>  static int iommufd_fops_release(struct inode *inode, struct file *filp)
>  {
>  	struct iommufd_ctx *ictx = filp->private_data;
> +	struct iommufd_sw_msi_map *next;
> +	struct iommufd_sw_msi_map *cur;
>  	struct iommufd_object *obj;
>  
>  	/*
> @@ -262,6 +266,11 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp)
>  			break;
>  	}
>  	WARN_ON(!xa_empty(&ictx->groups));
> +
> +	mutex_destroy(&ictx->sw_msi_lock);
> +	list_for_each_entry_safe(cur, next, &ictx->sw_msi_list, sw_msi_item)
> +		kfree(cur);
> +
>  	kfree(ictx);
>  	return 0;
>  }
> -- 
> 2.43.0

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 07/13] iommufd: Implement sw_msi support natively
  2025-01-15  4:21   ` Yury Norov
@ 2025-01-16 20:21     ` Jason Gunthorpe
  0 siblings, 0 replies; 64+ messages in thread
From: Jason Gunthorpe @ 2025-01-16 20:21 UTC (permalink / raw)
  To: Yury Norov
  Cc: Nicolin Chen, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, eric.auger,
	yebin10, apatel, shivamurthy.shastri, bhelgaas, anna-maria,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Tue, Jan 14, 2025 at 11:21:13PM -0500, Yury Norov wrote:
> > +static int iommufd_sw_msi_install(struct iommufd_ctx *ictx,
> > +				  struct iommufd_hwpt_paging *hwpt_paging,
> > +				  struct iommufd_sw_msi_map *msi_map)
> > +{
> > +	unsigned long iova;
> > +
> > +	lockdep_assert_held(&ictx->sw_msi_lock);
> > +
> > +	iova = msi_map->sw_msi_start + msi_map->pgoff * PAGE_SIZE;
> > +	if (!test_bit(msi_map->id, hwpt_paging->present_sw_msi.bitmap)) {
> > +		int rc;
> > +
> > +		rc = iommu_map(hwpt_paging->common.domain, iova,
> > +			       msi_map->msi_addr, PAGE_SIZE,
> > +			       IOMMU_WRITE | IOMMU_READ | IOMMU_MMIO,
> > +			       GFP_KERNEL_ACCOUNT);
> > +		if (rc)
> > +			return rc;
> > +		set_bit(msi_map->id, hwpt_paging->present_sw_msi.bitmap);
> > +	}
> > +	return 0;
> > +}
> 
> So, does sw_msi_lock protect the present_sw_msi bitmap? If so, you
> should use non-atomic __set_bit().

Yes, that is a good point

Thanks,
Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 06/13] iommufd: Make attach_handle generic
  2025-01-11  3:32 ` [PATCH RFCv2 06/13] iommufd: Make attach_handle generic Nicolin Chen
@ 2025-01-18  8:23   ` Yi Liu
  2025-01-18 20:32     ` Nicolin Chen
  2025-01-29 13:14   ` Eric Auger
  1 sibling, 1 reply; 64+ messages in thread
From: Yi Liu @ 2025-01-18  8:23 UTC (permalink / raw)
  To: Nicolin Chen, will, robin.murphy, jgg, kevin.tian, tglx, maz,
	alex.williamson
  Cc: joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On 2025/1/11 11:32, Nicolin Chen wrote:
> "attach_handle" was added exclusively for the iommufd_fault_iopf_handler()
> used by IOPF/PRI use cases, along with the "fault_data". Now, the iommufd
> version of sw_msi function will resue the attach_handle and fault_data for
> a non-fault case.
> 
> Move the attach_handle part out of the fault.c file to make it generic for
> all cases. Simplify the remaining fault specific routine to attach/detach.

I guess you can send it separately since both of our series need it. :)

> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   drivers/iommu/iommufd/iommufd_private.h |  40 +-------
>   drivers/iommu/iommufd/device.c          | 105 +++++++++++++++++++++
>   drivers/iommu/iommufd/fault.c           | 120 +++---------------------
>   3 files changed, 122 insertions(+), 143 deletions(-)
> 
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index b6d706cf2c66..063c0a42f54f 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -472,42 +472,12 @@ void iommufd_fault_destroy(struct iommufd_object *obj);
>   int iommufd_fault_iopf_handler(struct iopf_group *group);
>   
>   int iommufd_fault_domain_attach_dev(struct iommufd_hw_pagetable *hwpt,
> -				    struct iommufd_device *idev);
> +				    struct iommufd_device *idev,
> +				    bool enable_iopf);
>   void iommufd_fault_domain_detach_dev(struct iommufd_hw_pagetable *hwpt,
> -				     struct iommufd_device *idev);
> -int iommufd_fault_domain_replace_dev(struct iommufd_device *idev,
> -				     struct iommufd_hw_pagetable *hwpt,
> -				     struct iommufd_hw_pagetable *old);
> -
> -static inline int iommufd_hwpt_attach_device(struct iommufd_hw_pagetable *hwpt,
> -					     struct iommufd_device *idev)
> -{
> -	if (hwpt->fault)
> -		return iommufd_fault_domain_attach_dev(hwpt, idev);
> -
> -	return iommu_attach_group(hwpt->domain, idev->igroup->group);
> -}
> -
> -static inline void iommufd_hwpt_detach_device(struct iommufd_hw_pagetable *hwpt,
> -					      struct iommufd_device *idev)
> -{
> -	if (hwpt->fault) {
> -		iommufd_fault_domain_detach_dev(hwpt, idev);
> -		return;
> -	}
> -
> -	iommu_detach_group(hwpt->domain, idev->igroup->group);
> -}
> -
> -static inline int iommufd_hwpt_replace_device(struct iommufd_device *idev,
> -					      struct iommufd_hw_pagetable *hwpt,
> -					      struct iommufd_hw_pagetable *old)
> -{
> -	if (old->fault || hwpt->fault)
> -		return iommufd_fault_domain_replace_dev(idev, hwpt, old);
> -
> -	return iommu_group_replace_domain(idev->igroup->group, hwpt->domain);
> -}
> +				     struct iommufd_device *idev,
> +				     struct iommufd_attach_handle *handle,
> +				     bool disable_iopf);
>   
>   static inline struct iommufd_viommu *
>   iommufd_get_viommu(struct iommufd_ucmd *ucmd, u32 id)
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index dfd0898fb6c1..38b31b652147 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -352,6 +352,111 @@ iommufd_device_attach_reserved_iova(struct iommufd_device *idev,
>   	return 0;
>   }
>   
> +/* The device attach/detach/replace helpers for attach_handle */
> +
> +static int iommufd_hwpt_attach_device(struct iommufd_hw_pagetable *hwpt,
> +				      struct iommufd_device *idev)
> +{
> +	struct iommufd_attach_handle *handle;
> +	int rc;
> +
> +	if (hwpt->fault) {
> +		rc = iommufd_fault_domain_attach_dev(hwpt, idev, true);
> +		if (rc)
> +			return rc;
> +	}
> +
> +	handle = kzalloc(sizeof(*handle), GFP_KERNEL);
> +	if (!handle) {
> +		rc = -ENOMEM;
> +		goto out_fault_detach;
> +	}
> +
> +	handle->idev = idev;
> +	rc = iommu_attach_group_handle(hwpt->domain, idev->igroup->group,
> +				       &handle->handle);
> +	if (rc)
> +		goto out_free_handle;
> +
> +	return 0;
> +
> +out_free_handle:
> +	kfree(handle);
> +	handle = NULL;
> +out_fault_detach:
> +	if (hwpt->fault)
> +		iommufd_fault_domain_detach_dev(hwpt, idev, handle, true);
> +	return rc;
> +}
> +
> +static struct iommufd_attach_handle *
> +iommufd_device_get_attach_handle(struct iommufd_device *idev)
> +{
> +	struct iommu_attach_handle *handle;
> +
> +	handle =
> +		iommu_attach_handle_get(idev->igroup->group, IOMMU_NO_PASID, 0);
> +	if (IS_ERR(handle))
> +		return NULL;
> +	return to_iommufd_handle(handle);
> +}
> +
> +static void iommufd_hwpt_detach_device(struct iommufd_hw_pagetable *hwpt,
> +				       struct iommufd_device *idev)
> +{
> +	struct iommufd_attach_handle *handle;
> +
> +	handle = iommufd_device_get_attach_handle(idev);
> +	iommu_detach_group_handle(hwpt->domain, idev->igroup->group);
> +	if (hwpt->fault)
> +		iommufd_fault_domain_detach_dev(hwpt, idev, handle, true);
> +	kfree(handle);
> +}
> +
> +static int iommufd_hwpt_replace_device(struct iommufd_device *idev,
> +				       struct iommufd_hw_pagetable *hwpt,
> +				       struct iommufd_hw_pagetable *old)
> +{
> +	struct iommufd_attach_handle *old_handle =
> +		iommufd_device_get_attach_handle(idev);
> +	struct iommufd_attach_handle *handle;
> +	int rc;
> +
> +	if (hwpt->fault) {
> +		rc = iommufd_fault_domain_attach_dev(hwpt, idev, !old->fault);
> +		if (rc)
> +			return rc;
> +	}
> +
> +	handle = kzalloc(sizeof(*handle), GFP_KERNEL);
> +	if (!handle) {
> +		rc = -ENOMEM;
> +		goto out_fault_detach;
> +	}
> +
> +	handle->idev = idev;
> +	rc = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
> +					&handle->handle);
> +	if (rc)
> +		goto out_free_handle;
> +
> +	if (old->fault)
> +		iommufd_fault_domain_detach_dev(old, idev, old_handle,
> +						!hwpt->fault);
> +	kfree(old_handle);
> +
> +	return 0;
> +
> +out_free_handle:
> +	kfree(handle);
> +	handle = NULL;
> +out_fault_detach:
> +	if (hwpt->fault)
> +		iommufd_fault_domain_detach_dev(hwpt, idev, handle,
> +						!old->fault);
> +	return rc;
> +}
> +
>   int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
>   				struct iommufd_device *idev)
>   {
> diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
> index 06aa83a75e94..1d9bd3024b57 100644
> --- a/drivers/iommu/iommufd/fault.c
> +++ b/drivers/iommu/iommufd/fault.c
> @@ -60,42 +60,17 @@ static void iommufd_fault_iopf_disable(struct iommufd_device *idev)
>   	mutex_unlock(&idev->iopf_lock);
>   }
>   
> -static int __fault_domain_attach_dev(struct iommufd_hw_pagetable *hwpt,
> -				     struct iommufd_device *idev)
> -{
> -	struct iommufd_attach_handle *handle;
> -	int ret;
> -
> -	handle = kzalloc(sizeof(*handle), GFP_KERNEL);
> -	if (!handle)
> -		return -ENOMEM;
> -
> -	handle->idev = idev;
> -	ret = iommu_attach_group_handle(hwpt->domain, idev->igroup->group,
> -					&handle->handle);
> -	if (ret)
> -		kfree(handle);
> -
> -	return ret;
> -}
> -
>   int iommufd_fault_domain_attach_dev(struct iommufd_hw_pagetable *hwpt,
> -				    struct iommufd_device *idev)
> +				    struct iommufd_device *idev,
> +				    bool enable_iopf)
>   {
> -	int ret;
> +	int rc = 0;
>   
>   	if (!hwpt->fault)
>   		return -EINVAL;
> -
> -	ret = iommufd_fault_iopf_enable(idev);
> -	if (ret)
> -		return ret;
> -
> -	ret = __fault_domain_attach_dev(hwpt, idev);
> -	if (ret)
> -		iommufd_fault_iopf_disable(idev);
> -
> -	return ret;
> +	if (enable_iopf)
> +		rc = iommufd_fault_iopf_enable(idev);
> +	return rc;
>   }
>   
>   static void iommufd_auto_response_faults(struct iommufd_hw_pagetable *hwpt,
> @@ -127,86 +102,15 @@ static void iommufd_auto_response_faults(struct iommufd_hw_pagetable *hwpt,
>   	mutex_unlock(&fault->mutex);
>   }
>   
> -static struct iommufd_attach_handle *
> -iommufd_device_get_attach_handle(struct iommufd_device *idev)
> -{
> -	struct iommu_attach_handle *handle;
> -
> -	handle = iommu_attach_handle_get(idev->igroup->group, IOMMU_NO_PASID, 0);
> -	if (IS_ERR(handle))
> -		return NULL;
> -
> -	return to_iommufd_handle(handle);
> -}
> -
>   void iommufd_fault_domain_detach_dev(struct iommufd_hw_pagetable *hwpt,
> -				     struct iommufd_device *idev)
> +				     struct iommufd_device *idev,
> +				     struct iommufd_attach_handle *handle,
> +				     bool disable_iopf)
>   {
> -	struct iommufd_attach_handle *handle;
> -
> -	handle = iommufd_device_get_attach_handle(idev);
> -	iommu_detach_group_handle(hwpt->domain, idev->igroup->group);
> -	iommufd_auto_response_faults(hwpt, handle);
> -	iommufd_fault_iopf_disable(idev);
> -	kfree(handle);
> -}
> -
> -static int __fault_domain_replace_dev(struct iommufd_device *idev,
> -				      struct iommufd_hw_pagetable *hwpt,
> -				      struct iommufd_hw_pagetable *old)
> -{
> -	struct iommufd_attach_handle *handle, *curr = NULL;
> -	int ret;
> -
> -	if (old->fault)
> -		curr = iommufd_device_get_attach_handle(idev);
> -
> -	if (hwpt->fault) {
> -		handle = kzalloc(sizeof(*handle), GFP_KERNEL);
> -		if (!handle)
> -			return -ENOMEM;
> -
> -		handle->idev = idev;
> -		ret = iommu_replace_group_handle(idev->igroup->group,
> -						 hwpt->domain, &handle->handle);
> -	} else {
> -		ret = iommu_replace_group_handle(idev->igroup->group,
> -						 hwpt->domain, NULL);
> -	}
> -
> -	if (!ret && curr) {
> -		iommufd_auto_response_faults(old, curr);
> -		kfree(curr);
> -	}
> -
> -	return ret;
> -}
> -
> -int iommufd_fault_domain_replace_dev(struct iommufd_device *idev,
> -				     struct iommufd_hw_pagetable *hwpt,
> -				     struct iommufd_hw_pagetable *old)
> -{
> -	bool iopf_off = !hwpt->fault && old->fault;
> -	bool iopf_on = hwpt->fault && !old->fault;
> -	int ret;
> -
> -	if (iopf_on) {
> -		ret = iommufd_fault_iopf_enable(idev);
> -		if (ret)
> -			return ret;
> -	}
> -
> -	ret = __fault_domain_replace_dev(idev, hwpt, old);
> -	if (ret) {
> -		if (iopf_on)
> -			iommufd_fault_iopf_disable(idev);
> -		return ret;
> -	}
> -
> -	if (iopf_off)
> +	if (handle)
> +		iommufd_auto_response_faults(hwpt, handle);

no need to check handle. After this patch, both the non-fault and the fault
path will allocate handle. It cannot be used to isolate fault and non-fault
path. Also, the callers of iommufd_fault_domain_detach_dev() will check
hwpt->fault before calling it. So just call iommufd_auto_response_faults().

> +	if (disable_iopf)
>   		iommufd_fault_iopf_disable(idev);
> -
> -	return 0;
>   }
>   
>   void iommufd_fault_destroy(struct iommufd_object *obj)

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 06/13] iommufd: Make attach_handle generic
  2025-01-18  8:23   ` Yi Liu
@ 2025-01-18 20:32     ` Nicolin Chen
  2025-01-19 10:40       ` Yi Liu
  2025-01-20 14:20       ` Jason Gunthorpe
  0 siblings, 2 replies; 64+ messages in thread
From: Nicolin Chen @ 2025-01-18 20:32 UTC (permalink / raw)
  To: Yi Liu, jgg
  Cc: will, robin.murphy, kevin.tian, tglx, maz, alex.williamson, joro,
	shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Sat, Jan 18, 2025 at 04:23:22PM +0800, Yi Liu wrote:
> On 2025/1/11 11:32, Nicolin Chen wrote:
> > "attach_handle" was added exclusively for the iommufd_fault_iopf_handler()
> > used by IOPF/PRI use cases, along with the "fault_data". Now, the iommufd
> > version of sw_msi function will resue the attach_handle and fault_data for
> > a non-fault case.
> > 
> > Move the attach_handle part out of the fault.c file to make it generic for
> > all cases. Simplify the remaining fault specific routine to attach/detach.
> 
> I guess you can send it separately since both of our series need it. :)

Jason, would you like to take this patch separately? I can send
it prior to two big series for a quick review after rc1. It'll
likely impact the vEVENTQ series too.

> > +static int iommufd_hwpt_attach_device(struct iommufd_hw_pagetable *hwpt,
> > +				      struct iommufd_device *idev)
> > +{
> > +	struct iommufd_attach_handle *handle;
> > +	int rc;
> > +
> > +	if (hwpt->fault) {
> > +		rc = iommufd_fault_domain_attach_dev(hwpt, idev, true);
> > +		if (rc)
> > +			return rc;
> > +	}
> > +
> > +	handle = kzalloc(sizeof(*handle), GFP_KERNEL);
> > +	if (!handle) {
> > +		rc = -ENOMEM;
> > +		goto out_fault_detach;
> > +	}
> > +
> > +	handle->idev = idev;
> > +	rc = iommu_attach_group_handle(hwpt->domain, idev->igroup->group,
> > +				       &handle->handle);
> > +	if (rc)
> > +		goto out_free_handle;
> > +
> > +	return 0;
> > +
> > +out_free_handle:
> > +	kfree(handle);
> > +	handle = NULL;
> > +out_fault_detach:
> > +	if (hwpt->fault)
> > +		iommufd_fault_domain_detach_dev(hwpt, idev, handle, true);
> > +	return rc;
> > +}

Here the revert path passes in a handle=NULL..

> >   void iommufd_fault_domain_detach_dev(struct iommufd_hw_pagetable *hwpt,
> > -				     struct iommufd_device *idev)
> > +				     struct iommufd_device *idev,
> > +				     struct iommufd_attach_handle *handle,
> > +				     bool disable_iopf)
> >   {
> > +	if (handle)
> > +		iommufd_auto_response_faults(hwpt, handle);
> 
> no need to check handle. After this patch, both the non-fault and the fault
> path will allocate handle. It cannot be used to isolate fault and non-fault
> path. Also, the callers of iommufd_fault_domain_detach_dev() will check
> hwpt->fault before calling it. So just call iommufd_auto_response_faults().

..so we still need this !NULL validation? :-/

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 06/13] iommufd: Make attach_handle generic
  2025-01-18 20:32     ` Nicolin Chen
@ 2025-01-19 10:40       ` Yi Liu
  2025-01-20  5:54         ` Nicolin Chen
  2025-01-20 14:20       ` Jason Gunthorpe
  1 sibling, 1 reply; 64+ messages in thread
From: Yi Liu @ 2025-01-19 10:40 UTC (permalink / raw)
  To: Nicolin Chen, jgg
  Cc: will, robin.murphy, kevin.tian, tglx, maz, alex.williamson, joro,
	shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On 2025/1/19 04:32, Nicolin Chen wrote:
> On Sat, Jan 18, 2025 at 04:23:22PM +0800, Yi Liu wrote:
>> On 2025/1/11 11:32, Nicolin Chen wrote:
>>> "attach_handle" was added exclusively for the iommufd_fault_iopf_handler()
>>> used by IOPF/PRI use cases, along with the "fault_data". Now, the iommufd
>>> version of sw_msi function will resue the attach_handle and fault_data for
>>> a non-fault case.
>>>
>>> Move the attach_handle part out of the fault.c file to make it generic for
>>> all cases. Simplify the remaining fault specific routine to attach/detach.
>>
>> I guess you can send it separately since both of our series need it. :)
> 
> Jason, would you like to take this patch separately? I can send
> it prior to two big series for a quick review after rc1. It'll
> likely impact the vEVENTQ series too.
> 
>>> +static int iommufd_hwpt_attach_device(struct iommufd_hw_pagetable *hwpt,
>>> +				      struct iommufd_device *idev)
>>> +{
>>> +	struct iommufd_attach_handle *handle;
>>> +	int rc;
>>> +
>>> +	if (hwpt->fault) {
>>> +		rc = iommufd_fault_domain_attach_dev(hwpt, idev, true);
>>> +		if (rc)
>>> +			return rc;
>>> +	}
>>> +
>>> +	handle = kzalloc(sizeof(*handle), GFP_KERNEL);
>>> +	if (!handle) {
>>> +		rc = -ENOMEM;
>>> +		goto out_fault_detach;
>>> +	}
>>> +
>>> +	handle->idev = idev;
>>> +	rc = iommu_attach_group_handle(hwpt->domain, idev->igroup->group,
>>> +				       &handle->handle);
>>> +	if (rc)
>>> +		goto out_free_handle;
>>> +
>>> +	return 0;
>>> +
>>> +out_free_handle:
>>> +	kfree(handle);
>>> +	handle = NULL;
>>> +out_fault_detach:
>>> +	if (hwpt->fault)
>>> +		iommufd_fault_domain_detach_dev(hwpt, idev, handle, true);
>>> +	return rc;
>>> +}
> 
> Here the revert path passes in a handle=NULL..

aha. got it. Perhaps we can allocate handle first. In the below thread, it
is possible that a failed domain may have pending PRIs, it would require
the caller to call the auto response. Although, we are likely to swap the
order, but it is nice to have for the caller to do it.

https://lore.kernel.org/linux-iommu/f685daca-081a-4ede-b1e1-559009fa9ebc@intel.com/

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 06/13] iommufd: Make attach_handle generic
  2025-01-19 10:40       ` Yi Liu
@ 2025-01-20  5:54         ` Nicolin Chen
  2025-01-24 13:31           ` Yi Liu
  0 siblings, 1 reply; 64+ messages in thread
From: Nicolin Chen @ 2025-01-20  5:54 UTC (permalink / raw)
  To: Yi Liu
  Cc: jgg, will, robin.murphy, kevin.tian, tglx, maz, alex.williamson,
	joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Sun, Jan 19, 2025 at 06:40:57PM +0800, Yi Liu wrote:
> On 2025/1/19 04:32, Nicolin Chen wrote:
> > On Sat, Jan 18, 2025 at 04:23:22PM +0800, Yi Liu wrote:
> > > On 2025/1/11 11:32, Nicolin Chen wrote:
> > > > +static int iommufd_hwpt_attach_device(struct iommufd_hw_pagetable *hwpt,
> > > > +				      struct iommufd_device *idev)
> > > > +{
> > > > +	struct iommufd_attach_handle *handle;
> > > > +	int rc;
> > > > +
> > > > +	if (hwpt->fault) {
> > > > +		rc = iommufd_fault_domain_attach_dev(hwpt, idev, true);
> > > > +		if (rc)
> > > > +			return rc;
> > > > +	}
> > > > +
> > > > +	handle = kzalloc(sizeof(*handle), GFP_KERNEL);
> > > > +	if (!handle) {
> > > > +		rc = -ENOMEM;
> > > > +		goto out_fault_detach;
> > > > +	}
> > > > +
> > > > +	handle->idev = idev;
> > > > +	rc = iommu_attach_group_handle(hwpt->domain, idev->igroup->group,
> > > > +				       &handle->handle);
> > > > +	if (rc)
> > > > +		goto out_free_handle;
> > > > +
> > > > +	return 0;
> > > > +
> > > > +out_free_handle:
> > > > +	kfree(handle);
> > > > +	handle = NULL;
> > > > +out_fault_detach:
> > > > +	if (hwpt->fault)
> > > > +		iommufd_fault_domain_detach_dev(hwpt, idev, handle, true);
> > > > +	return rc;
> > > > +}
> > 
> > Here the revert path passes in a handle=NULL..
> 
> aha. got it. Perhaps we can allocate handle first. In the below thread, it
> is possible that a failed domain may have pending PRIs, it would require
> the caller to call the auto response. Although, we are likely to swap the
> order, but it is nice to have for the caller to do it.
> 
> https://lore.kernel.org/linux-iommu/f685daca-081a-4ede-b1e1-559009fa9ebc@intel.com/

Hmm, I don't really see a point in letting the detach flow to
scan the two lists in hwpt->fault against a zero-ed handle...
which feels like a waste of CPU cycles?

And I am not sure how that xa_insert part is realted?

Thanks
Nic

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 06/13] iommufd: Make attach_handle generic
  2025-01-18 20:32     ` Nicolin Chen
  2025-01-19 10:40       ` Yi Liu
@ 2025-01-20 14:20       ` Jason Gunthorpe
  1 sibling, 0 replies; 64+ messages in thread
From: Jason Gunthorpe @ 2025-01-20 14:20 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Yi Liu, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, eric.auger,
	yebin10, apatel, shivamurthy.shastri, bhelgaas, anna-maria,
	yury.norov, nipun.gupta, iommu, linux-kernel, linux-arm-kernel,
	kvm, linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Sat, Jan 18, 2025 at 12:32:49PM -0800, Nicolin Chen wrote:
> On Sat, Jan 18, 2025 at 04:23:22PM +0800, Yi Liu wrote:
> > On 2025/1/11 11:32, Nicolin Chen wrote:
> > > "attach_handle" was added exclusively for the iommufd_fault_iopf_handler()
> > > used by IOPF/PRI use cases, along with the "fault_data". Now, the iommufd
> > > version of sw_msi function will resue the attach_handle and fault_data for
> > > a non-fault case.
> > > 
> > > Move the attach_handle part out of the fault.c file to make it generic for
> > > all cases. Simplify the remaining fault specific routine to attach/detach.
> > 
> > I guess you can send it separately since both of our series need it. :)
> 
> Jason, would you like to take this patch separately? I can send
> it prior to two big series for a quick review after rc1. It'll
> likely impact the vEVENTQ series too.

If it helps you can put it in its own series and I will take it with
pasid or vevent, which ever goes first

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU
  2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
                   ` (12 preceding siblings ...)
  2025-01-11  3:32 ` [PATCH RFCv2 13/13] vfio/pci: Allow preset MSI IOVAs via VFIO_IRQ_SET_ACTION_PREPARE Nicolin Chen
@ 2025-01-23  9:06 ` Shameerali Kolothum Thodi
  2025-01-23 13:24   ` Jason Gunthorpe
  2025-02-05 22:49 ` Jacob Pan
  2025-02-07 14:34 ` Jason Gunthorpe
  15 siblings, 1 reply; 64+ messages in thread
From: Shameerali Kolothum Thodi @ 2025-01-23  9:06 UTC (permalink / raw)
  To: Nicolin Chen, will@kernel.org, robin.murphy@arm.com,
	jgg@nvidia.com, kevin.tian@intel.com, tglx@linutronix.de,
	maz@kernel.org, alex.williamson@redhat.com
  Cc: joro@8bytes.org, shuah@kernel.org, reinette.chatre@intel.com,
	eric.auger@redhat.com, yebin (H), apatel@ventanamicro.com,
	shivamurthy.shastri@linutronix.de, bhelgaas@google.com,
	anna-maria@linutronix.de, yury.norov@gmail.com,
	nipun.gupta@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	jean-philippe@linaro.org, mdf@kernel.org, mshavit@google.com,
	smostafa@google.com, ddutile@redhat.com

Hi Nicolin,

> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Saturday, January 11, 2025 3:32 AM
> To: will@kernel.org; robin.murphy@arm.com; jgg@nvidia.com;
> kevin.tian@intel.com; tglx@linutronix.de; maz@kernel.org;
> alex.williamson@redhat.com
> Cc: joro@8bytes.org; shuah@kernel.org; reinette.chatre@intel.com;
> eric.auger@redhat.com; yebin (H) <yebin10@huawei.com>;
> apatel@ventanamicro.com; shivamurthy.shastri@linutronix.de;
> bhelgaas@google.com; anna-maria@linutronix.de; yury.norov@gmail.com;
> nipun.gupta@amd.com; iommu@lists.linux.dev; linux-
> kernel@vger.kernel.org; linux-arm-kernel@lists.infradead.org;
> kvm@vger.kernel.org; linux-kselftest@vger.kernel.org;
> patches@lists.linux.dev; jean-philippe@linaro.org; mdf@kernel.org;
> mshavit@google.com; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>; smostafa@google.com;
> ddutile@redhat.com
> Subject: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with
> nested SMMU
> 
> [ Background ]
> On ARM GIC systems and others, the target address of the MSI is translated
> by the IOMMU. For GIC, the MSI address page is called "ITS" page. When
> the
> IOMMU is disabled, the MSI address is programmed to the physical location
> of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS
> page is behind the IOMMU, so the MSI address is programmed to an
> allocated
> IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to
> the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000).
> When a 2-stage translation is enabled, IOVA will be still used to program
> the MSI address, though the mappings will be in two stages:
>   IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000)
> (IPA stands for Intermediate Physical Address).
> 
> If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA,
> the
> IOVA is dynamically allocated from the top of the IOVA space. If attached
> to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the
> IOVA is
> fixed to an MSI window reported by the IOMMU driver via
> IOMMU_RESV_SW_MSI,
> which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM
> IOMMUs.
> 
> So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge
> of the IOMMU translation (1-stage translation), since the IOVA for the ITS
> page is fixed and known by kernel. However, with virtual machine enabling
> a nested IOMMU translation (2-stage), a guest kernel directly controls the
> stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at
> an
> IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host
> kernel can't know that guest-level IOVA to program the MSI address.
> 
> There have been two approaches to solve this problem:
> 1. Create an identity mapping in the stage-1. VMM could insert a few RMRs
>    (Reserved Memory Regions) in guest's IORT. Then the guest kernel would
>    fetch these RMR entries from the IORT and create an
> IOMMU_RESV_DIRECT
>    region per iommu group for a direct mapping. Eventually, the mappings
>    would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000
>    This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA.
> 2. Forward the guest-level MSI IOVA captured by VMM to the host-level GIC
>    driver, to program the correct MSI IOVA. Forward the VMM-defined vITS
>    page location (IPA) to the kernel for the stage-2 mapping. Eventually:
>    IOVA (0xFFFF0000) ===> IPA (0x80900000) ===> PA (0x20200000)
>    This requires a VFIO ioctl (for IOVA) and an IOMMUFD ioctl (for IPA).
> 
> Worth mentioning that when Eric Auger was working on the same topic
> with
> the VFIO iommu uAPI, he had the approach (2) first, and then switched to
> the approach (1), suggested by Jean-Philippe for reduction of complexity.
> 
> The approach (1) basically feels like the existing VFIO passthrough that
> has a 1-stage mapping for the unmanaged domain, yet only by shifting the
> MSI mapping from stage 1 (guest-has-no-iommu case) to stage 2 (guest-has-
> iommu case). So, it could reuse the existing IOMMU_RESV_SW_MSI piece,
> by
> sharing the same idea of "VMM leaving everything to the kernel".
> 
> The approach (2) is an ideal solution, yet it requires additional effort
> for kernel to be aware of the 1-stage gIOVA(s) and 2-stage IPAs for vITS
> page(s), which demands VMM to closely cooperate.
>  * It also brings some complicated use cases to the table where the host
>    or/and guest system(s) has/have multiple ITS pages.

I had done some basic sanity tests with this series and the Qemu branches you
provided on a HiSilicon hardwrae. The basic dev assignment works fine. I will 
rebase my Qemu smuv3-accel branch on top of this and will do some more tests.

One confusion I have about the above text is, do we still plan to support the
approach -1( Using RMR in Qemu) or you are just mentioning it here because
it is still possible to make use of that. I think from previous discussions the 
argument was to adopt a more dedicated MSI pass-through model which I
think is  approach-2 here.  Could you please confirm.

Thanks,
Shameer




^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH RFCv2 05/13] iommu: Turn fault_data to iommufd private pointer
  2025-01-11  3:32 ` [PATCH RFCv2 05/13] iommu: Turn fault_data to iommufd private pointer Nicolin Chen
@ 2025-01-23  9:54   ` Tian, Kevin
  2025-01-23 13:25     ` Jason Gunthorpe
  2025-01-29 12:40   ` Eric Auger
  1 sibling, 1 reply; 64+ messages in thread
From: Tian, Kevin @ 2025-01-23  9:54 UTC (permalink / raw)
  To: Nicolin Chen, will@kernel.org, robin.murphy@arm.com,
	jgg@nvidia.com, tglx@linutronix.de, maz@kernel.org,
	alex.williamson@redhat.com
  Cc: joro@8bytes.org, shuah@kernel.org, Chatre, Reinette,
	eric.auger@redhat.com, yebin10@huawei.com,
	apatel@ventanamicro.com, shivamurthy.shastri@linutronix.de,
	bhelgaas@google.com, anna-maria@linutronix.de,
	yury.norov@gmail.com, nipun.gupta@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	jean-philippe@linaro.org, mdf@kernel.org, mshavit@google.com,
	shameerali.kolothum.thodi@huawei.com, smostafa@google.com,
	ddutile@redhat.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Saturday, January 11, 2025 11:32 AM
> 
> @@ -224,8 +224,10 @@ struct iommu_domain {
>  		      phys_addr_t msi_addr);
>  #endif
> 
> -	void *fault_data;
> -	union {
> +	union { /* Pointer usable by owner of the domain */
> +		struct iommufd_hw_pagetable *iommufd_hwpt; /* iommufd
> */
> +	};
> +	union { /* Fault handler */

hmm is it better to rename it as "void *private;" and let the caller
do type conversion?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  2025-01-11  3:32 ` [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls Nicolin Chen
@ 2025-01-23 10:07   ` Tian, Kevin
  2025-02-03 18:36     ` Nicolin Chen
  2025-01-29 13:44   ` Eric Auger
  1 sibling, 1 reply; 64+ messages in thread
From: Tian, Kevin @ 2025-01-23 10:07 UTC (permalink / raw)
  To: Nicolin Chen, will@kernel.org, robin.murphy@arm.com,
	jgg@nvidia.com, tglx@linutronix.de, maz@kernel.org,
	alex.williamson@redhat.com
  Cc: joro@8bytes.org, shuah@kernel.org, Chatre, Reinette,
	eric.auger@redhat.com, yebin10@huawei.com,
	apatel@ventanamicro.com, shivamurthy.shastri@linutronix.de,
	bhelgaas@google.com, anna-maria@linutronix.de,
	yury.norov@gmail.com, nipun.gupta@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	jean-philippe@linaro.org, mdf@kernel.org, mshavit@google.com,
	shameerali.kolothum.thodi@huawei.com, smostafa@google.com,
	ddutile@redhat.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Saturday, January 11, 2025 11:32 AM
> 
> @@ -294,7 +294,9 @@ struct iommu_ioas_unmap {
> 
>  /**
>   * enum iommufd_option - ioctl(IOMMU_OPTION_RLIMIT_MODE) and
> - *                       ioctl(IOMMU_OPTION_HUGE_PAGES)
> + *                       ioctl(IOMMU_OPTION_HUGE_PAGES) and
> + *                       ioctl(IOMMU_OPTION_SW_MSI_START) and
> + *                       ioctl(IOMMU_OPTION_SW_MSI_SIZE)
>   * @IOMMU_OPTION_RLIMIT_MODE:
>   *    Change how RLIMIT_MEMLOCK accounting works. The caller must have
> privilege
>   *    to invoke this. Value 0 (default) is user based accounting, 1 uses process
> @@ -304,10 +306,24 @@ struct iommu_ioas_unmap {
>   *    iommu mappings. Value 0 disables combining, everything is mapped to
>   *    PAGE_SIZE. This can be useful for benchmarking.  This is a per-IOAS
>   *    option, the object_id must be the IOAS ID.
> + * @IOMMU_OPTION_SW_MSI_START:
> + *    Change the base address of the IOMMU mapping region for MSI
> doorbell(s).
> + *    It must be set this before attaching a device to an IOAS/HWPT,

remove 'this'

> otherwise
> + *    this option will be not effective on that IOAS/HWPT. User can 

Do we want to explicitly check this instead of leaving it no effect
silently?

> choose to
> + *    let kernel pick a base address, by simply ignoring this option or setting
> + *    a value 0 to IOMMU_OPTION_SW_MSI_SIZE. Global option, object_id
> must be 0
> + * @IOMMU_OPTION_SW_MSI_SIZE:
> + *    Change the size of the IOMMU mapping region for MSI doorbell(s). It
> must
> + *    be set this before attaching a device to an IOAS/HWPT, otherwise it
> won't
> + *    be effective on that IOAS/HWPT. The value is in MB, and the minimum
> value
> + *    is 1 MB. A value 0 (default) will invalidate the MSI doorbell base address
> + *    value set to IOMMU_OPTION_SW_MSI_START. Global option, object_id
> must be 0

hmm there is no check on the minimal value and enable the effect
of value 0 in this patch.

>  iommufd_device_attach_reserved_iova(struct iommufd_device *idev,
>  				    struct iommufd_hwpt_paging
> *hwpt_paging)
>  {
> +	struct iommufd_ctx *ictx = idev->ictx;
>  	int rc;
> 
>  	lockdep_assert_held(&idev->igroup->lock);
> 
> +	/* Override it with a user-programmed SW_MSI region */
> +	if (ictx->sw_msi_size && ictx->sw_msi_start != PHYS_ADDR_MAX)
> +		idev->igroup->sw_msi_start = ictx->sw_msi_start;
>  	rc = iopt_table_enforce_dev_resv_regions(&hwpt_paging->ioas->iopt,
>  						 idev->dev,
>  						 &idev->igroup-
> >sw_msi_start);

what about moving above additions into 
iopt_table_enforce_dev_resv_regions() which is all about finding
a sw_msi address and can check the user setting internally?

> diff --git a/drivers/iommu/iommufd/io_pagetable.c
> b/drivers/iommu/iommufd/io_pagetable.c
> index 8a790e597e12..5d7f5ca1eecf 100644
> --- a/drivers/iommu/iommufd/io_pagetable.c
> +++ b/drivers/iommu/iommufd/io_pagetable.c
> @@ -1446,7 +1446,9 @@ int iopt_table_enforce_dev_resv_regions(struct
> io_pagetable *iopt,
>  		if (sw_msi_start && resv->type == IOMMU_RESV_MSI)
>  			num_hw_msi++;
>  		if (sw_msi_start && resv->type == IOMMU_RESV_SW_MSI) {
> -			*sw_msi_start = resv->start;
> +			/* Bypass the driver-defined SW_MSI region, if preset
> */
> +			if (*sw_msi_start == PHYS_ADDR_MAX)
> +				*sw_msi_start = resv->start;

the code is not about bypass. Instead it's to use the driver-defined
region if user doesn't set it.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU
  2025-01-23  9:06 ` [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Shameerali Kolothum Thodi
@ 2025-01-23 13:24   ` Jason Gunthorpe
  2025-01-29 14:54     ` Eric Auger
  0 siblings, 1 reply; 64+ messages in thread
From: Jason Gunthorpe @ 2025-01-23 13:24 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Nicolin Chen, will@kernel.org, robin.murphy@arm.com,
	kevin.tian@intel.com, tglx@linutronix.de, maz@kernel.org,
	alex.williamson@redhat.com, joro@8bytes.org, shuah@kernel.org,
	reinette.chatre@intel.com, eric.auger@redhat.com, yebin (H),
	apatel@ventanamicro.com, shivamurthy.shastri@linutronix.de,
	bhelgaas@google.com, anna-maria@linutronix.de,
	yury.norov@gmail.com, nipun.gupta@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	jean-philippe@linaro.org, mdf@kernel.org, mshavit@google.com,
	smostafa@google.com, ddutile@redhat.com

On Thu, Jan 23, 2025 at 09:06:49AM +0000, Shameerali Kolothum Thodi wrote:

> One confusion I have about the above text is, do we still plan to support the
> approach -1( Using RMR in Qemu)

Yes, it remains an option. The VMM would use the
IOMMU_OPTION_SW_MSI_START/SIZE ioctls to tell the kernel where it
wants to put the RMR region then it would send the RMR into the VM
through ACPI.

The kernel side promises that the RMR region will have a consistent
(but unpredictable!) layout of ITS pages (however many are required)
within that RMR space, regardless of what devices/domain are attached.

I would like to start with patches up to #10 for this part as it
solves two of the three problems here.

> or you are just mentioning it here because
> it is still possible to make use of that. I think from previous discussions the
> argument was to adopt a more dedicated MSI pass-through model which I
> think is  approach-2 here.  

The basic flow of the pass through model is shown in the last two
patches, it is not fully complete but is testable. It assumes a single
ITS page. The VM would use IOMMU_OPTION_SW_MSI_START/SIZE to put the
ITS page at the correct S2 location and then describe it in the ACPI
as an ITS page not a RMR.

The VMM will capture the MSI writes and use
VFIO_IRQ_SET_ACTION_PREPARE to convey the guests's S1 translation to
the IRQ subsystem.

This missing peice is cleaning up the ITS mapping to allow for
multiple ITS pages. I've imagined that kvm would someone give iommufd
a FD that holds the specific ITS pages instead of the
IOMMU_OPTION_SW_MSI_START/SIZE flow.

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 05/13] iommu: Turn fault_data to iommufd private pointer
  2025-01-23  9:54   ` Tian, Kevin
@ 2025-01-23 13:25     ` Jason Gunthorpe
  0 siblings, 0 replies; 64+ messages in thread
From: Jason Gunthorpe @ 2025-01-23 13:25 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Nicolin Chen, will@kernel.org, robin.murphy@arm.com,
	tglx@linutronix.de, maz@kernel.org, alex.williamson@redhat.com,
	joro@8bytes.org, shuah@kernel.org, Chatre, Reinette,
	eric.auger@redhat.com, yebin10@huawei.com,
	apatel@ventanamicro.com, shivamurthy.shastri@linutronix.de,
	bhelgaas@google.com, anna-maria@linutronix.de,
	yury.norov@gmail.com, nipun.gupta@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	jean-philippe@linaro.org, mdf@kernel.org, mshavit@google.com,
	shameerali.kolothum.thodi@huawei.com, smostafa@google.com,
	ddutile@redhat.com

On Thu, Jan 23, 2025 at 09:54:38AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Saturday, January 11, 2025 11:32 AM
> > 
> > @@ -224,8 +224,10 @@ struct iommu_domain {
> >  		      phys_addr_t msi_addr);
> >  #endif
> > 
> > -	void *fault_data;
> > -	union {
> > +	union { /* Pointer usable by owner of the domain */
> > +		struct iommufd_hw_pagetable *iommufd_hwpt; /* iommufd
> > */
> > +	};
> > +	union { /* Fault handler */
> 
> hmm is it better to rename it as "void *private;" and let the caller
> do type conversion?

I like the type safety, the union can hold other in-kernel users with
their proper types and this discourages drivers from inventing weird
things..

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 03/13] iommu: Make iommu_dma_prepare_msi() into a generic operation
  2025-01-11  3:32 ` [PATCH RFCv2 03/13] iommu: Make iommu_dma_prepare_msi() into a generic operation Nicolin Chen
@ 2025-01-23 17:10   ` Eric Auger
  2025-01-23 18:16     ` Jason Gunthorpe
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Auger @ 2025-01-23 17:10 UTC (permalink / raw)
  To: Nicolin Chen, will, robin.murphy, jgg, kevin.tian, tglx, maz,
	alex.williamson
  Cc: joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

Hi,


On 1/11/25 4:32 AM, Nicolin Chen wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
>
> SW_MSI supports IOMMU to translate an MSI message before the MSI message
> is delivered to the interrupt controller. On such systems the iommu_domain
> must have a translation for the MSI message for interrupts to work.
>
> The IRQ subsystem will call into IOMMU to request that a physical page be
> setup to receive MSI message, and the IOMMU then sets an IOVA that maps to
> that physical page. Ultimately the IOVA is programmed into the device via
> the msi_msg.
>
> Generalize this to allow the iommu_domain owner to provide its own
> implementation of this mapping. Add a function pointer to struct
> iommu_domain to allow the domain owner to provide an implementation.
>
> Have dma-iommu supply its implementation for IOMMU_DOMAIN_DMA types during
> the iommu_get_dma_cookie() path. For IOMMU_DOMAIN_UNMANAGED types used by
> VFIO (and iommufd for now), have the same iommu_dma_sw_msi set as well in
> the iommu_get_msi_cookie() path.
>
> Hold the group mutex while in iommu_dma_prepare_msi() to ensure the domain
> doesn't change or become freed while running. Races with IRQ operations
> from VFIO and domain changes from iommufd are possible here.
this was my question in previous comments
>
> Rreplace the msi_prepare_lock with a lockdep assertion for the group mutex
Replace
> as documentation. For the dma_iommu.c each iommu_domain unique to a
is?
> group.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> [nicolinc: move iommu_domain_set_sw_msi() from iommu_dma_init_domain() to
>  iommu_dma_init_domain(); add in iommu_put_dma_cookie() an sw_msi test]
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  include/linux/iommu.h     | 44 ++++++++++++++++++++++++++-------------
>  drivers/iommu/dma-iommu.c | 33 +++++++++++++----------------
>  drivers/iommu/iommu.c     | 29 ++++++++++++++++++++++++++
>  3 files changed, 73 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 3a4215966c1b..423fdfa6b3bb 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -44,6 +44,8 @@ struct iommu_dma_cookie;
>  struct iommu_fault_param;
>  struct iommufd_ctx;
>  struct iommufd_viommu;
> +struct msi_desc;
> +struct msi_msg;
>  
>  #define IOMMU_FAULT_PERM_READ	(1 << 0) /* read */
>  #define IOMMU_FAULT_PERM_WRITE	(1 << 1) /* write */
> @@ -216,6 +218,12 @@ struct iommu_domain {
>  	struct iommu_domain_geometry geometry;
>  	struct iommu_dma_cookie *iova_cookie;
>  	int (*iopf_handler)(struct iopf_group *group);
> +
> +#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
> +	int (*sw_msi)(struct iommu_domain *domain, struct msi_desc *desc,
> +		      phys_addr_t msi_addr);
> +#endif
> +
>  	void *fault_data;
>  	union {
>  		struct {
> @@ -234,6 +242,16 @@ struct iommu_domain {
>  	};
>  };
>  
> +static inline void iommu_domain_set_sw_msi(
> +	struct iommu_domain *domain,
> +	int (*sw_msi)(struct iommu_domain *domain, struct msi_desc *desc,
> +		      phys_addr_t msi_addr))
> +{
> +#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
> +	domain->sw_msi = sw_msi;
> +#endif
> +}
> +
>  static inline bool iommu_is_dma_domain(struct iommu_domain *domain)
>  {
>  	return domain->type & __IOMMU_DOMAIN_DMA_API;
> @@ -1475,6 +1493,18 @@ static inline ioasid_t iommu_alloc_global_pasid(struct device *dev)
>  static inline void iommu_free_global_pasid(ioasid_t pasid) {}
>  #endif /* CONFIG_IOMMU_API */
>  
> +#ifdef CONFIG_IRQ_MSI_IOMMU
> +#ifdef CONFIG_IOMMU_API
> +int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr);
> +#else
> +static inline int iommu_dma_prepare_msi(struct msi_desc *desc,
> +					phys_addr_t msi_addr)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_IOMMU_API */
> +#endif /* CONFIG_IRQ_MSI_IOMMU */
> +
>  #if IS_ENABLED(CONFIG_LOCKDEP) && IS_ENABLED(CONFIG_IOMMU_API)
>  void iommu_group_mutex_assert(struct device *dev);
>  #else
> @@ -1508,26 +1538,12 @@ static inline void iommu_debugfs_setup(void) {}
>  #endif
>  
>  #ifdef CONFIG_IOMMU_DMA
> -#include <linux/msi.h>
> -
>  int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base);
> -
> -int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr);
> -
>  #else /* CONFIG_IOMMU_DMA */
> -
> -struct msi_desc;
> -struct msi_msg;
> -
>  static inline int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
>  {
>  	return -ENODEV;
>  }
> -
> -static inline int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
> -{
> -	return 0;
> -}
>  #endif	/* CONFIG_IOMMU_DMA */
>  
>  /*
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index bf91e014d179..3b58244e6344 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -24,6 +24,7 @@
>  #include <linux/memremap.h>
>  #include <linux/mm.h>
>  #include <linux/mutex.h>
> +#include <linux/msi.h>
>  #include <linux/of_iommu.h>
>  #include <linux/pci.h>
>  #include <linux/scatterlist.h>
> @@ -102,6 +103,9 @@ static int __init iommu_dma_forcedac_setup(char *str)
>  }
>  early_param("iommu.forcedac", iommu_dma_forcedac_setup);
>  
> +static int iommu_dma_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
> +			    phys_addr_t msi_addr);
> +
>  /* Number of entries per flush queue */
>  #define IOVA_DEFAULT_FQ_SIZE	256
>  #define IOVA_SINGLE_FQ_SIZE	32768
> @@ -398,6 +402,7 @@ int iommu_get_dma_cookie(struct iommu_domain *domain)
>  		return -ENOMEM;
>  
>  	mutex_init(&domain->iova_cookie->mutex);
> +	iommu_domain_set_sw_msi(domain, iommu_dma_sw_msi);
>  	return 0;
>  }
>  
> @@ -429,6 +434,7 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
>  
>  	cookie->msi_iova = base;
>  	domain->iova_cookie = cookie;
> +	iommu_domain_set_sw_msi(domain, iommu_dma_sw_msi);
>  	return 0;
>  }
>  EXPORT_SYMBOL(iommu_get_msi_cookie);
> @@ -443,6 +449,9 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
>  	struct iommu_dma_cookie *cookie = domain->iova_cookie;
>  	struct iommu_dma_msi_page *msi, *tmp;
>  
> +	if (domain->sw_msi != iommu_dma_sw_msi)
> +		return;
> +
I don't get the above check. The comment says this is also called for a
cookie prepared with iommu_get_dma_cookie(). Don't you need to do some
cleanup for this latter?
>  	if (!cookie)
>  		return;
>  
> @@ -1800,33 +1809,19 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
>  	return NULL;
>  }
>  
> -/**
> - * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
> - * @desc: MSI descriptor, will store the MSI page
> - * @msi_addr: MSI target address to be mapped
> - *
> - * Return: 0 on success or negative error code if the mapping failed.
> - */
> -int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
> +static int iommu_dma_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
> +			    phys_addr_t msi_addr)
>  {
>  	struct device *dev = msi_desc_to_dev(desc);
> -	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> -	struct iommu_dma_msi_page *msi_page;
> -	static DEFINE_MUTEX(msi_prepare_lock); /* see below */
> +	const struct iommu_dma_msi_page *msi_page;
>  
> -	if (!domain || !domain->iova_cookie) {
> +	if (!domain->iova_cookie) {
>  		msi_desc_set_iommu_msi_iova(desc, 0, 0);
>  		return 0;
>  	}
>  
> -	/*
> -	 * In fact the whole prepare operation should already be serialised by
> -	 * irq_domain_mutex further up the callchain, but that's pretty subtle
> -	 * on its own, so consider this locking as failsafe documentation...
> -	 */
> -	mutex_lock(&msi_prepare_lock);
> +	iommu_group_mutex_assert(dev);
>  	msi_page = iommu_dma_get_msi_page(dev, msi_addr, domain);
> -	mutex_unlock(&msi_prepare_lock);
>  	if (!msi_page)
>  		return -ENOMEM;
>  
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 599030e1e890..fbbbcdba8a4f 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -3587,3 +3587,32 @@ int iommu_replace_group_handle(struct iommu_group *group,
>  	return ret;
>  }
>  EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
> +
> +#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
> +/**
> + * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
> + * @desc: MSI descriptor, will store the MSI page
> + * @msi_addr: MSI target address to be mapped
> + *
> + * The implementation of sw_msi() should take msi_addr and map it to
> + * an IOVA in the domain and call msi_desc_set_iommu_msi_iova() with the
> + * mapping information.
> + *
> + * Return: 0 on success or negative error code if the mapping failed.
> + */
> +int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
> +{
> +	struct device *dev = msi_desc_to_dev(desc);
> +	struct iommu_group *group = dev->iommu_group;
> +	int ret = 0;
> +
> +	if (!group)
> +		return 0;
> +
> +	mutex_lock(&group->mutex);
> +	if (group->domain && group->domain->sw_msi)
> +		ret = group->domain->sw_msi(group->domain, desc, msi_addr);
> +	mutex_unlock(&group->mutex);
> +	return ret;
> +}
> +#endif /* CONFIG_IRQ_MSI_IOMMU */
Thanks

Eric


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie
  2025-01-11  3:32 ` [PATCH RFCv2 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie Nicolin Chen
@ 2025-01-23 17:10   ` Eric Auger
  2025-01-23 18:48     ` Jason Gunthorpe
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Auger @ 2025-01-23 17:10 UTC (permalink / raw)
  To: Nicolin Chen, will, robin.murphy, jgg, kevin.tian, tglx, maz,
	alex.williamson
  Cc: joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile


Hi Nicolin,

On 1/11/25 4:32 AM, Nicolin Chen wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
>
> All the iommu cases simply want to override the MSI page's address with
those which translate MSIs
> the IOVA that was mapped through the iommu. This doesn't need a cookie
> pointer, we just need to store the IOVA and its page size in the msi_desc.
>
> Instead provide msi_desc_set_iommu_msi_iova() which allows the IOMMU side
> to specify the IOVA that the MSI page is placed during
> iommu_dma_prepare(). This is stored in the msi_desc and then
iommu_dma_prepare_msi()
> iommu_dma_compose_msi_msg() is a simple inline that sets address_hi/lo.
>
> The next patch will correct the naming.
>
> This is done because we cannot correctly lock access to group->domain in
> the atomic context that iommu_dma_compose_msi_msg() is called under. Today
> the locking miss is tolerable because dma_iommu.c operates under an
> assumption that the domain does not change while a driver is probed.
>
> However iommufd now permits the domain to change while the driver is
> probed and VFIO userspace can create races with IRQ changes calling
> iommu_dma_prepare/compose_msi_msg() and changing/freeing the iommu_domain.
and is it safe in iommu_dma_prepare_msi()?
>
> Removing the pointer, and critically, the call to
> iommu_get_domain_for_dev() during compose resolves this race.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  include/linux/iommu.h     |  6 ------
>  include/linux/msi.h       | 45 +++++++++++++++++++++++----------------
>  drivers/iommu/dma-iommu.c | 30 +++++---------------------
>  3 files changed, 32 insertions(+), 49 deletions(-)
>
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 318d27841130..3a4215966c1b 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -1513,7 +1513,6 @@ static inline void iommu_debugfs_setup(void) {}
>  int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base);
>  
>  int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr);
> -void iommu_dma_compose_msi_msg(struct msi_desc *desc, struct msi_msg *msg);
>  
>  #else /* CONFIG_IOMMU_DMA */
>  
> @@ -1529,11 +1528,6 @@ static inline int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_a
>  {
>  	return 0;
>  }
> -
> -static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
> -{
> -}
> -
>  #endif	/* CONFIG_IOMMU_DMA */
>  
>  /*
> diff --git a/include/linux/msi.h b/include/linux/msi.h
> index b10093c4d00e..d442b4a69d56 100644
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -184,7 +184,8 @@ struct msi_desc {
>  	struct msi_msg			msg;
>  	struct irq_affinity_desc	*affinity;
>  #ifdef CONFIG_IRQ_MSI_IOMMU
> -	const void			*iommu_cookie;
you may add kernel doc comments above
> +	u64				iommu_msi_iova : 58;
> +	u64				iommu_msi_page_shift : 6;
>  #endif
>  #ifdef CONFIG_SYSFS
>  	struct device_attribute		*sysfs_attrs;
> @@ -285,28 +286,36 @@ struct msi_desc *msi_next_desc(struct device *dev, unsigned int domid,
>  
>  #define msi_desc_to_dev(desc)		((desc)->dev)
>  
> -#ifdef CONFIG_IRQ_MSI_IOMMU
> -static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
> -{
> -	return desc->iommu_cookie;
> -}
> -
> -static inline void msi_desc_set_iommu_cookie(struct msi_desc *desc,
> -					     const void *iommu_cookie)
> +static inline void msi_desc_set_iommu_msi_iova(struct msi_desc *desc,
> +					       u64 msi_iova,
> +					       unsigned int page_shift)
>  {
> -	desc->iommu_cookie = iommu_cookie;
> -}
> -#else
> -static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
> -{
> -	return NULL;
> +#ifdef CONFIG_IRQ_MSI_IOMMU
> +	desc->iommu_msi_iova = msi_iova >> page_shift;
> +	desc->iommu_msi_page_shift = page_shift;
> +#endif
>  }
>  
> -static inline void msi_desc_set_iommu_cookie(struct msi_desc *desc,
> -					     const void *iommu_cookie)
> +/**
> + * iommu_dma_compose_msi_msg() - Apply translation to an MSI message
> + * @desc: MSI descriptor prepared by iommu_dma_prepare_msi()
> + * @msg: MSI message containing target physical address
> + */
> +static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc,
> +					     struct msi_msg *msg)
>  {
> -}
> +#ifdef CONFIG_IRQ_MSI_IOMMU
> +	if (desc->iommu_msi_page_shift) {
> +		u64 msi_iova = desc->iommu_msi_iova
> +			       << desc->iommu_msi_page_shift;
> +
> +		msg->address_hi = upper_32_bits(msi_iova);
> +		msg->address_lo = lower_32_bits(msi_iova) |
> +				  (msg->address_lo &
> +				   ((1 << desc->iommu_msi_page_shift) - 1));
> +	}
>  #endif
> +}
>  
>  int msi_domain_insert_msi_desc(struct device *dev, unsigned int domid,
>  			       struct msi_desc *init_desc);
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 2a9fa0c8cc00..bf91e014d179 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1815,7 +1815,7 @@ int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
>  	static DEFINE_MUTEX(msi_prepare_lock); /* see below */
>  
>  	if (!domain || !domain->iova_cookie) {
> -		desc->iommu_cookie = NULL;
> +		msi_desc_set_iommu_msi_iova(desc, 0, 0);
>  		return 0;
>  	}
>  
> @@ -1827,33 +1827,13 @@ int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
>  	mutex_lock(&msi_prepare_lock);
>  	msi_page = iommu_dma_get_msi_page(dev, msi_addr, domain);
>  	mutex_unlock(&msi_prepare_lock);
> -
> -	msi_desc_set_iommu_cookie(desc, msi_page);
> -
>  	if (!msi_page)
>  		return -ENOMEM;
> -	return 0;
> -}
>  
> -/**
> - * iommu_dma_compose_msi_msg() - Apply translation to an MSI message
> - * @desc: MSI descriptor prepared by iommu_dma_prepare_msi()
> - * @msg: MSI message containing target physical address
> - */
> -void iommu_dma_compose_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
> -{
> -	struct device *dev = msi_desc_to_dev(desc);
> -	const struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> -	const struct iommu_dma_msi_page *msi_page;
> -
> -	msi_page = msi_desc_get_iommu_cookie(desc);
> -
> -	if (!domain || !domain->iova_cookie || WARN_ON(!msi_page))
> -		return;
> -
> -	msg->address_hi = upper_32_bits(msi_page->iova);
> -	msg->address_lo &= cookie_msi_granule(domain->iova_cookie) - 1;
> -	msg->address_lo += lower_32_bits(msi_page->iova);
> +	msi_desc_set_iommu_msi_iova(
> +		desc, msi_page->iova,
> +		ilog2(cookie_msi_granule(domain->iova_cookie)));
> +	return 0;
>  }
>  
>  static int iommu_dma_init(void)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 02/13] genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_msi_addr()
  2025-01-11  3:32 ` [PATCH RFCv2 02/13] genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_msi_addr() Nicolin Chen
@ 2025-01-23 17:10   ` Eric Auger
  2025-01-23 18:50     ` Jason Gunthorpe
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Auger @ 2025-01-23 17:10 UTC (permalink / raw)
  To: Nicolin Chen, will, robin.murphy, jgg, kevin.tian, tglx, maz,
	alex.williamson
  Cc: joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile




On 1/11/25 4:32 AM, Nicolin Chen wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
>
> The new function is used to take in a u64 MSI address and store it in the
> msi_msg. If the iommu has provided an alternative address then that is
> replaced instead.
>
> All callers have a tidy u64 already so this also consolidates the repeated
> low/high code into a small helper.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  include/linux/msi.h               | 18 ++++++++----------
>  drivers/irqchip/irq-gic-v2m.c     |  5 +----
>  drivers/irqchip/irq-gic-v3-its.c  | 13 +++----------
>  drivers/irqchip/irq-gic-v3-mbi.c  | 12 ++++--------
>  drivers/irqchip/irq-ls-scfg-msi.c |  5 ++---
>  5 files changed, 18 insertions(+), 35 deletions(-)
>
> diff --git a/include/linux/msi.h b/include/linux/msi.h
> index d442b4a69d56..f6369748fc6e 100644
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -296,13 +296,8 @@ static inline void msi_desc_set_iommu_msi_iova(struct msi_desc *desc,
>  #endif
>  }
>  
> -/**
> - * iommu_dma_compose_msi_msg() - Apply translation to an MSI message
> - * @desc: MSI descriptor prepared by iommu_dma_prepare_msi()
> - * @msg: MSI message containing target physical address
> - */
> -static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc,
> -					     struct msi_msg *msg)
> +static inline void msi_msg_set_msi_addr(struct msi_desc *desc,
> +					struct msi_msg *msg, u64 msi_addr)
nit: msi_msg_set_addr(ess) may be enough as the populated fields are
address_lo/hi
>  {
>  #ifdef CONFIG_IRQ_MSI_IOMMU
>  	if (desc->iommu_msi_page_shift) {
> @@ -310,11 +305,14 @@ static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc,
>  			       << desc->iommu_msi_page_shift;
>  
>  		msg->address_hi = upper_32_bits(msi_iova);
> -		msg->address_lo = lower_32_bits(msi_iova) |
> -				  (msg->address_lo &
> -				   ((1 << desc->iommu_msi_page_shift) - 1));
> +		msg->address_lo =
> +			lower_32_bits(msi_iova) |
> +			(msi_addr & ((1 << desc->iommu_msi_page_shift) - 1));
> +		return;
>  	}
>  #endif
> +	msg->address_hi = upper_32_bits(msi_addr);
> +	msg->address_lo = lower_32_bits(msi_addr);
>  }
>  
>  int msi_domain_insert_msi_desc(struct device *dev, unsigned int domid,
> diff --git a/drivers/irqchip/irq-gic-v2m.c b/drivers/irqchip/irq-gic-v2m.c
> index be35c5349986..6599c56873ad 100644
> --- a/drivers/irqchip/irq-gic-v2m.c
> +++ b/drivers/irqchip/irq-gic-v2m.c
> @@ -87,9 +87,6 @@ static void gicv2m_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  	struct v2m_data *v2m = irq_data_get_irq_chip_data(data);
>  	phys_addr_t addr = gicv2m_get_msi_addr(v2m, data->hwirq);
>  
> -	msg->address_hi = upper_32_bits(addr);
> -	msg->address_lo = lower_32_bits(addr);
> -
>  	if (v2m->flags & GICV2M_GRAVITON_ADDRESS_ONLY)
>  		msg->data = 0;
>  	else
> @@ -97,7 +94,7 @@ static void gicv2m_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  	if (v2m->flags & GICV2M_NEEDS_SPI_OFFSET)
>  		msg->data -= v2m->spi_offset;
>  
> -	iommu_dma_compose_msi_msg(irq_data_get_msi_desc(data), msg);
> +	msi_msg_set_msi_addr(irq_data_get_msi_desc(data), msg, addr);
>  }
>  
>  static struct irq_chip gicv2m_irq_chip = {
> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
> index 92244cfa0464..8c3ab7b471ca 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -1809,17 +1809,10 @@ static u64 its_irq_get_msi_base(struct its_device *its_dev)
>  static void its_irq_compose_msi_msg(struct irq_data *d, struct msi_msg *msg)
>  {
>  	struct its_device *its_dev = irq_data_get_irq_chip_data(d);
> -	struct its_node *its;
> -	u64 addr;
> -
> -	its = its_dev->its;
> -	addr = its->get_msi_base(its_dev);
> -
> -	msg->address_lo		= lower_32_bits(addr);
> -	msg->address_hi		= upper_32_bits(addr);
> -	msg->data		= its_get_event_id(d);
>  
> -	iommu_dma_compose_msi_msg(irq_data_get_msi_desc(d), msg);
> +	msg->data = its_get_event_id(d);
> +	msi_msg_set_msi_addr(irq_data_get_msi_desc(d), msg,
> +			     its_dev->its->get_msi_base(its_dev));
>  }
>  
>  static int its_irq_set_irqchip_state(struct irq_data *d,
> diff --git a/drivers/irqchip/irq-gic-v3-mbi.c b/drivers/irqchip/irq-gic-v3-mbi.c
> index 3fe870f8ee17..513479da9ee3 100644
> --- a/drivers/irqchip/irq-gic-v3-mbi.c
> +++ b/drivers/irqchip/irq-gic-v3-mbi.c
> @@ -147,22 +147,18 @@ static const struct irq_domain_ops mbi_domain_ops = {
>  
>  static void mbi_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  {
> -	msg[0].address_hi = upper_32_bits(mbi_phys_base + GICD_SETSPI_NSR);
> -	msg[0].address_lo = lower_32_bits(mbi_phys_base + GICD_SETSPI_NSR);
>  	msg[0].data = data->parent_data->hwirq;
> -
> -	iommu_dma_compose_msi_msg(irq_data_get_msi_desc(data), msg);
> +	msi_msg_set_msi_addr(irq_data_get_msi_desc(data), &msg[0],
> +			     mbi_phys_base + GICD_SETSPI_NSR);
>  }
>  
>  static void mbi_compose_mbi_msg(struct irq_data *data, struct msi_msg *msg)
>  {
>  	mbi_compose_msi_msg(data, msg);
>  
> -	msg[1].address_hi = upper_32_bits(mbi_phys_base + GICD_CLRSPI_NSR);
> -	msg[1].address_lo = lower_32_bits(mbi_phys_base + GICD_CLRSPI_NSR);
>  	msg[1].data = data->parent_data->hwirq;
> -
> -	iommu_dma_compose_msi_msg(irq_data_get_msi_desc(data), &msg[1]);
> +	msi_msg_set_msi_addr(irq_data_get_msi_desc(data), &msg[1],
> +			     mbi_phys_base + GICD_CLRSPI_NSR);
>  }
>  
>  static bool mbi_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
> diff --git a/drivers/irqchip/irq-ls-scfg-msi.c b/drivers/irqchip/irq-ls-scfg-msi.c
> index c0e1aafe468c..2ac6d89b4cb4 100644
> --- a/drivers/irqchip/irq-ls-scfg-msi.c
> +++ b/drivers/irqchip/irq-ls-scfg-msi.c
> @@ -87,8 +87,6 @@ static void ls_scfg_msi_compose_msg(struct irq_data *data, struct msi_msg *msg)
>  {
>  	struct ls_scfg_msi *msi_data = irq_data_get_irq_chip_data(data);
>  
> -	msg->address_hi = upper_32_bits(msi_data->msiir_addr);
> -	msg->address_lo = lower_32_bits(msi_data->msiir_addr);
>  	msg->data = data->hwirq;
>  
>  	if (msi_affinity_flag) {
> @@ -98,7 +96,8 @@ static void ls_scfg_msi_compose_msg(struct irq_data *data, struct msi_msg *msg)
>  		msg->data |= cpumask_first(mask);
>  	}
>  
> -	iommu_dma_compose_msi_msg(irq_data_get_msi_desc(data), msg);
> +	msi_msg_set_msi_addr(irq_data_get_msi_desc(data), msg,
> +			     msi_data->msiir_addr);
>  }
>  
>  static int ls_scfg_msi_set_affinity(struct irq_data *irq_data,


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 03/13] iommu: Make iommu_dma_prepare_msi() into a generic operation
  2025-01-23 17:10   ` Eric Auger
@ 2025-01-23 18:16     ` Jason Gunthorpe
  2025-01-29 12:29       ` Eric Auger
  0 siblings, 1 reply; 64+ messages in thread
From: Jason Gunthorpe @ 2025-01-23 18:16 UTC (permalink / raw)
  To: Eric Auger
  Cc: Nicolin Chen, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Thu, Jan 23, 2025 at 06:10:47PM +0100, Eric Auger wrote:
> Hi,
> 
> 
> On 1/11/25 4:32 AM, Nicolin Chen wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> >
> > SW_MSI supports IOMMU to translate an MSI message before the MSI message
> > is delivered to the interrupt controller. On such systems the iommu_domain
> > must have a translation for the MSI message for interrupts to work.
> >
> > The IRQ subsystem will call into IOMMU to request that a physical page be
> > setup to receive MSI message, and the IOMMU then sets an IOVA that maps to
> > that physical page. Ultimately the IOVA is programmed into the device via
> > the msi_msg.
> >
> > Generalize this to allow the iommu_domain owner to provide its own
> > implementation of this mapping. Add a function pointer to struct
> > iommu_domain to allow the domain owner to provide an implementation.
> >
> > Have dma-iommu supply its implementation for IOMMU_DOMAIN_DMA types during
> > the iommu_get_dma_cookie() path. For IOMMU_DOMAIN_UNMANAGED types used by
> > VFIO (and iommufd for now), have the same iommu_dma_sw_msi set as well in
> > the iommu_get_msi_cookie() path.
> >
> > Hold the group mutex while in iommu_dma_prepare_msi() to ensure the domain
> > doesn't change or become freed while running. Races with IRQ operations
> > from VFIO and domain changes from iommufd are possible here.
> this was my question in previous comments

Ah, well there is the answer :)

> > Rreplace the msi_prepare_lock with a lockdep assertion for the group mutex
> Replace
> > as documentation. For the dma_iommu.c each iommu_domain unique to a
> is?
> > group.

Yes

Replace the msi_prepare_lock with a lockdep assertion for the group mutex
as documentation. For the dmau_iommu.c each iommu_domain is unique to a
group.

> > @@ -443,6 +449,9 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
> >  	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> >  	struct iommu_dma_msi_page *msi, *tmp;
> >  
> > +	if (domain->sw_msi != iommu_dma_sw_msi)
> > +		return;
> > +
> I don't get the above check.

It is because of this:

void iommu_domain_free(struct iommu_domain *domain)
{
	if (domain->type == IOMMU_DOMAIN_SVA)
		mmdrop(domain->mm);
	iommu_put_dma_cookie(domain);

iommufd may be using domain->sw_msi so iommu_put_dma_cookie() needs to
be a NOP. Also, later we move cookie into a union so it is not
reliably NULL anymore.

> The comment says this is also called for a
> cookie prepared with iommu_get_dma_cookie(). Don't you need to do some
> cleanup for this latter?

That seems seems OK, only two places set domain->iova_cookie:

int iommu_get_dma_cookie(struct iommu_domain *domain)
{
	domain->iova_cookie = cookie_alloc(IOMMU_DMA_IOVA_COOKIE);
	iommu_domain_set_sw_msi(domain, iommu_dma_sw_msi);

and

int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
{
	domain->iova_cookie = cookie;
	iommu_domain_set_sw_msi(domain, iommu_dma_sw_msi);

So (domain->sw_msi == iommu_dma_sw_msi) in iommu_put_dma_cookie() for
both cases..

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie
  2025-01-23 17:10   ` Eric Auger
@ 2025-01-23 18:48     ` Jason Gunthorpe
  2025-01-29 12:11       ` Eric Auger
  0 siblings, 1 reply; 64+ messages in thread
From: Jason Gunthorpe @ 2025-01-23 18:48 UTC (permalink / raw)
  To: Eric Auger
  Cc: Nicolin Chen, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Thu, Jan 23, 2025 at 06:10:48PM +0100, Eric Auger wrote:

> > However iommufd now permits the domain to change while the driver is
> > probed and VFIO userspace can create races with IRQ changes calling
> > iommu_dma_prepare/compose_msi_msg() and changing/freeing the iommu_domain.
> and is it safe in iommu_dma_prepare_msi()?

iommu_dma_prepare_msi() takes the group mutex:

int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
{
	struct device *dev = msi_desc_to_dev(desc);
	struct iommu_group *group = dev->iommu_group;

	mutex_lock(&group->mutex);
	if (group->domain && group->domain->sw_msi)
		ret = group->domain->sw_msi(group->domain, desc, msi_addr);

Which prevents changing domain attachments during execution.

For iommufd, if the domain attachment changes immediately after
iommu_dma_prepare_msi() unlocks, then the information given to
msi_desc_set_iommu_msi_iova() is still valid on the new domain.

This is because the iommufd implementation of sw_msi keeps the same
IOVA for the same ITS page globally across all domains. Any racing
change of domain will attach a new domain with the right ITS IOVA
already mapped and populated.

It is why this series stops using the domain pointer as a cookie
inside the msi_desc, immediately after the group->mutex is unlocked
a new domain can be attached and the old domain can be freed, which
would UAF the domain pointer in the cookie.

> > diff --git a/include/linux/msi.h b/include/linux/msi.h
> > index b10093c4d00e..d442b4a69d56 100644
> > --- a/include/linux/msi.h
> > +++ b/include/linux/msi.h
> > @@ -184,7 +184,8 @@ struct msi_desc {
> >  	struct msi_msg			msg;
> >  	struct irq_affinity_desc	*affinity;
> >  #ifdef CONFIG_IRQ_MSI_IOMMU
> > -	const void			*iommu_cookie;
> you may add kernel doc comments above

I wondered if internal stuff was not being documented as the old
iommu_cookie didn't have a comment..

But sure:

 * @iommu_msi_iova: Optional IOVA from the IOMMU to overide the msi_addr.
 *                  Only used if iommu_msi_page_shift != 0
 * @iommu_msi_page_shift: Indicates how many bits of the original address
 *                        should be preserved when using iommu_msi_iova.

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 02/13] genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_msi_addr()
  2025-01-23 17:10   ` Eric Auger
@ 2025-01-23 18:50     ` Jason Gunthorpe
  2025-01-29 10:44       ` Eric Auger
  0 siblings, 1 reply; 64+ messages in thread
From: Jason Gunthorpe @ 2025-01-23 18:50 UTC (permalink / raw)
  To: Eric Auger
  Cc: Nicolin Chen, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Thu, Jan 23, 2025 at 06:10:54PM +0100, Eric Auger wrote:

> > -/**
> > - * iommu_dma_compose_msi_msg() - Apply translation to an MSI message
> > - * @desc: MSI descriptor prepared by iommu_dma_prepare_msi()
> > - * @msg: MSI message containing target physical address
> > - */
> > -static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc,
> > -					     struct msi_msg *msg)
> > +static inline void msi_msg_set_msi_addr(struct msi_desc *desc,
> > +					struct msi_msg *msg, u64 msi_addr)

> nit: msi_msg_set_addr(ess) may be enough as the populated fields are
> address_lo/hi

Not sure I follow, you suggest to use msi_msg_set_addr() as the
function name? Seems Ok

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 07/13] iommufd: Implement sw_msi support natively
  2025-01-11  3:32 ` [PATCH RFCv2 07/13] iommufd: Implement sw_msi support natively Nicolin Chen
  2025-01-15  4:21   ` Yury Norov
@ 2025-01-23 19:30   ` Jason Gunthorpe
  1 sibling, 0 replies; 64+ messages in thread
From: Jason Gunthorpe @ 2025-01-23 19:30 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, kevin.tian, tglx, maz, alex.williamson, joro,
	shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Fri, Jan 10, 2025 at 07:32:23PM -0800, Nicolin Chen wrote:

> +/*
> + * FIXME: when a domain is removed any ids that are not in the union of
> + * all still attached devices should be removed.
> + */

I've been thinking about this, maybe we can just delete the comment.

It is thinking about is the case where you attach a domain to device
A, then B, then detach B. If there are multiple ITS pages then B's
page will remain accessible to A.

However, A had access to B's page already and it was perfectly fine,
so why do we need to revoke it?

The logic is fine to keep track of this, so I think we can just let it
be. ITS pages populate in a lazy way, but are permanent once
populated.

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 06/13] iommufd: Make attach_handle generic
  2025-01-20  5:54         ` Nicolin Chen
@ 2025-01-24 13:31           ` Yi Liu
  0 siblings, 0 replies; 64+ messages in thread
From: Yi Liu @ 2025-01-24 13:31 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg, will, robin.murphy, kevin.tian, tglx, maz, alex.williamson,
	joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On 2025/1/20 13:54, Nicolin Chen wrote:
> On Sun, Jan 19, 2025 at 06:40:57PM +0800, Yi Liu wrote:
>> On 2025/1/19 04:32, Nicolin Chen wrote:
>>> On Sat, Jan 18, 2025 at 04:23:22PM +0800, Yi Liu wrote:
>>>> On 2025/1/11 11:32, Nicolin Chen wrote:
>>>>> +static int iommufd_hwpt_attach_device(struct iommufd_hw_pagetable *hwpt,
>>>>> +				      struct iommufd_device *idev)
>>>>> +{
>>>>> +	struct iommufd_attach_handle *handle;
>>>>> +	int rc;
>>>>> +
>>>>> +	if (hwpt->fault) {
>>>>> +		rc = iommufd_fault_domain_attach_dev(hwpt, idev, true);
>>>>> +		if (rc)
>>>>> +			return rc;
>>>>> +	}
>>>>> +
>>>>> +	handle = kzalloc(sizeof(*handle), GFP_KERNEL);
>>>>> +	if (!handle) {
>>>>> +		rc = -ENOMEM;
>>>>> +		goto out_fault_detach;
>>>>> +	}
>>>>> +
>>>>> +	handle->idev = idev;
>>>>> +	rc = iommu_attach_group_handle(hwpt->domain, idev->igroup->group,
>>>>> +				       &handle->handle);
>>>>> +	if (rc)
>>>>> +		goto out_free_handle;
>>>>> +
>>>>> +	return 0;
>>>>> +
>>>>> +out_free_handle:
>>>>> +	kfree(handle);
>>>>> +	handle = NULL;
>>>>> +out_fault_detach:
>>>>> +	if (hwpt->fault)
>>>>> +		iommufd_fault_domain_detach_dev(hwpt, idev, handle, true);
>>>>> +	return rc;
>>>>> +}
>>>
>>> Here the revert path passes in a handle=NULL..
>>
>> aha. got it. Perhaps we can allocate handle first. In the below thread, it
>> is possible that a failed domain may have pending PRIs, it would require
>> the caller to call the auto response. Although, we are likely to swap the
>> order, but it is nice to have for the caller to do it.
>>
>> https://lore.kernel.org/linux-iommu/f685daca-081a-4ede-b1e1-559009fa9ebc@intel.com/
> 
> Hmm, I don't really see a point in letting the detach flow to
> scan the two lists in hwpt->fault against a zero-ed handle...
> which feels like a waste of CPU cycles?

I meant you may call iommufd_fault_domain_attach_dev() after allocating
handle. Then in the error path, the handle is not zeroed when calling
iommufd_fault_domain_detach_dev(). The cpu circle will not be wasted if
if the two lists are empty. But it would be required if the lists are not
empty. :)

> And I am not sure how that xa_insert part is realted?

Maybe I failed to make it clear. That thread had discussed a case that the
PRIs may be forwarded to hwpt before the attach succeeds. But it is needed
to flush the PRIs in htwp->fault. Although I would swap the order of
xa_insert() and __iommu_set_group_pasid(), it is still nice if iommufd side
flushes the two lists in the error handling path.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 02/13] genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_msi_addr()
  2025-01-23 18:50     ` Jason Gunthorpe
@ 2025-01-29 10:44       ` Eric Auger
  0 siblings, 0 replies; 64+ messages in thread
From: Eric Auger @ 2025-01-29 10:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

Hi Jason,


On 1/23/25 7:50 PM, Jason Gunthorpe wrote:
> On Thu, Jan 23, 2025 at 06:10:54PM +0100, Eric Auger wrote:
>
>>> -/**
>>> - * iommu_dma_compose_msi_msg() - Apply translation to an MSI message
>>> - * @desc: MSI descriptor prepared by iommu_dma_prepare_msi()
>>> - * @msg: MSI message containing target physical address
>>> - */
>>> -static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc,
>>> -					     struct msi_msg *msg)
>>> +static inline void msi_msg_set_msi_addr(struct msi_desc *desc,
>>> +					struct msi_msg *msg, u64 msi_addr)
>> nit: msi_msg_set_addr(ess) may be enough as the populated fields are
>> address_lo/hi
> Not sure I follow, you suggest to use msi_msg_set_addr() as the
> function name? Seems Ok
Yes my minor point was just about removing the "msi" redundancy

Eric
>
> Jason
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie
  2025-01-23 18:48     ` Jason Gunthorpe
@ 2025-01-29 12:11       ` Eric Auger
  0 siblings, 0 replies; 64+ messages in thread
From: Eric Auger @ 2025-01-29 12:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

Hi,


On 1/23/25 7:48 PM, Jason Gunthorpe wrote:
> On Thu, Jan 23, 2025 at 06:10:48PM +0100, Eric Auger wrote:
>
>>> However iommufd now permits the domain to change while the driver is
>>> probed and VFIO userspace can create races with IRQ changes calling
>>> iommu_dma_prepare/compose_msi_msg() and changing/freeing the iommu_domain.
>> and is it safe in iommu_dma_prepare_msi()?
> iommu_dma_prepare_msi() takes the group mutex:
>
> int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
> {
> 	struct device *dev = msi_desc_to_dev(desc);
> 	struct iommu_group *group = dev->iommu_group;
>
> 	mutex_lock(&group->mutex);
> 	if (group->domain && group->domain->sw_msi)
> 		ret = group->domain->sw_msi(group->domain, desc, msi_addr);
>
> Which prevents changing domain attachments during execution.
>
> For iommufd, if the domain attachment changes immediately after
> iommu_dma_prepare_msi() unlocks, then the information given to
> msi_desc_set_iommu_msi_iova() is still valid on the new domain.
>
> This is because the iommufd implementation of sw_msi keeps the same
> IOVA for the same ITS page globally across all domains. Any racing
> change of domain will attach a new domain with the right ITS IOVA
> already mapped and populated.
> It is why this series stops using the domain pointer as a cookie
> inside the msi_desc, immediately after the group->mutex is unlocked
> a new domain can be attached and the old domain can be freed, which
> would UAF the domain pointer in the cookie.
OK thank you for the clarification
>
>>> diff --git a/include/linux/msi.h b/include/linux/msi.h
>>> index b10093c4d00e..d442b4a69d56 100644
>>> --- a/include/linux/msi.h
>>> +++ b/include/linux/msi.h
>>> @@ -184,7 +184,8 @@ struct msi_desc {
>>>  	struct msi_msg			msg;
>>>  	struct irq_affinity_desc	*affinity;
>>>  #ifdef CONFIG_IRQ_MSI_IOMMU
>>> -	const void			*iommu_cookie;
>> you may add kernel doc comments above
> I wondered if internal stuff was not being documented as the old
> iommu_cookie didn't have a comment..
>
> But sure:
>
>  * @iommu_msi_iova: Optional IOVA from the IOMMU to overide the msi_addr.
>  *                  Only used if iommu_msi_page_shift != 0
>  * @iommu_msi_page_shift: Indicates how many bits of the original address
>  *                        should be preserved when using iommu_msi_iova.
Sounds good

Eric
>
> Jason
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 03/13] iommu: Make iommu_dma_prepare_msi() into a generic operation
  2025-01-23 18:16     ` Jason Gunthorpe
@ 2025-01-29 12:29       ` Eric Auger
  0 siblings, 0 replies; 64+ messages in thread
From: Eric Auger @ 2025-01-29 12:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

Hi Jason,


On 1/23/25 7:16 PM, Jason Gunthorpe wrote:
> On Thu, Jan 23, 2025 at 06:10:47PM +0100, Eric Auger wrote:
>> Hi,
>>
>>
>> On 1/11/25 4:32 AM, Nicolin Chen wrote:
>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>
>>> SW_MSI supports IOMMU to translate an MSI message before the MSI message
>>> is delivered to the interrupt controller. On such systems the iommu_domain
>>> must have a translation for the MSI message for interrupts to work.
>>>
>>> The IRQ subsystem will call into IOMMU to request that a physical page be
>>> setup to receive MSI message, and the IOMMU then sets an IOVA that maps to
>>> that physical page. Ultimately the IOVA is programmed into the device via
>>> the msi_msg.
>>>
>>> Generalize this to allow the iommu_domain owner to provide its own
>>> implementation of this mapping. Add a function pointer to struct
>>> iommu_domain to allow the domain owner to provide an implementation.
>>>
>>> Have dma-iommu supply its implementation for IOMMU_DOMAIN_DMA types during
>>> the iommu_get_dma_cookie() path. For IOMMU_DOMAIN_UNMANAGED types used by
>>> VFIO (and iommufd for now), have the same iommu_dma_sw_msi set as well in
>>> the iommu_get_msi_cookie() path.
>>>
>>> Hold the group mutex while in iommu_dma_prepare_msi() to ensure the domain
>>> doesn't change or become freed while running. Races with IRQ operations
>>> from VFIO and domain changes from iommufd are possible here.
>> this was my question in previous comments
> Ah, well there is the answer :)
>
>>> Rreplace the msi_prepare_lock with a lockdep assertion for the group mutex
>> Replace
>>> as documentation. For the dma_iommu.c each iommu_domain unique to a
>> is?
>>> group.
> Yes
>
> Replace the msi_prepare_lock with a lockdep assertion for the group mutex
> as documentation. For the dmau_iommu.c each iommu_domain is unique to a
> group.
>
>>> @@ -443,6 +449,9 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
>>>  	struct iommu_dma_cookie *cookie = domain->iova_cookie;
>>>  	struct iommu_dma_msi_page *msi, *tmp;
>>>  
>>> +	if (domain->sw_msi != iommu_dma_sw_msi)
>>> +		return;
>>> +
>> I don't get the above check.
> It is because of this:
>
> void iommu_domain_free(struct iommu_domain *domain)
> {
> 	if (domain->type == IOMMU_DOMAIN_SVA)
> 		mmdrop(domain->mm);
> 	iommu_put_dma_cookie(domain);
>
> iommufd may be using domain->sw_msi so iommu_put_dma_cookie() needs to
> be a NOP. Also, later we move cookie into a union so it is not
> reliably NULL anymore.
OK
>
>> The comment says this is also called for a
>> cookie prepared with iommu_get_dma_cookie(). Don't you need to do some
>> cleanup for this latter?
> That seems seems OK, only two places set domain->iova_cookie:
>
> int iommu_get_dma_cookie(struct iommu_domain *domain)
> {
> 	domain->iova_cookie = cookie_alloc(IOMMU_DMA_IOVA_COOKIE);
> 	iommu_domain_set_sw_msi(domain, iommu_dma_sw_msi);
>
> and
>
> int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
> {
> 	domain->iova_cookie = cookie;
> 	iommu_domain_set_sw_msi(domain, iommu_dma_sw_msi);
>
> So (domain->sw_msi == iommu_dma_sw_msi) in iommu_put_dma_cookie() for
> both cases..
makes sense.

Thanks

Eric
>
> Jason
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 05/13] iommu: Turn fault_data to iommufd private pointer
  2025-01-11  3:32 ` [PATCH RFCv2 05/13] iommu: Turn fault_data to iommufd private pointer Nicolin Chen
  2025-01-23  9:54   ` Tian, Kevin
@ 2025-01-29 12:40   ` Eric Auger
  2025-02-03 17:48     ` Nicolin Chen
  1 sibling, 1 reply; 64+ messages in thread
From: Eric Auger @ 2025-01-29 12:40 UTC (permalink / raw)
  To: Nicolin Chen, will, robin.murphy, jgg, kevin.tian, tglx, maz,
	alex.williamson
  Cc: joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile




On 1/11/25 4:32 AM, Nicolin Chen wrote:
> A "fault_data" was added exclusively for the iommufd_fault_iopf_handler()
> used by IOPF/PRI use cases, along with the attach_handle. Now, the iommufd
> version of sw_msi function will resue the attach_handle and fault_data for
reuse
> a non-fault case.
>
> Rename "fault_data" to "iommufd_hwpt" so as not to confine it to a "fault"
> case. Move it into a union to be the iommufd private pointer. A following
> patch will move the iova_cookie to the union for dma-iommu too, after the
> iommufd_sw_msi implementation is added.
>
> Since we have two unions now, add some simple comments for readability.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  include/linux/iommu.h                | 6 ++++--
>  drivers/iommu/iommufd/fault.c        | 2 +-
>  drivers/iommu/iommufd/hw_pagetable.c | 2 +-
>  3 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 423fdfa6b3bb..b6526d734f30 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -224,8 +224,10 @@ struct iommu_domain {
>  		      phys_addr_t msi_addr);
>  #endif
>  
> -	void *fault_data;
> -	union {
> +	union { /* Pointer usable by owner of the domain */
> +		struct iommufd_hw_pagetable *iommufd_hwpt; /* iommufd */
> +	};
> +	union { /* Fault handler */
>  		struct {
>  			iommu_fault_handler_t handler;
>  			void *handler_token;
> diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
> index 1fe804e28a86..06aa83a75e94 100644
> --- a/drivers/iommu/iommufd/fault.c
> +++ b/drivers/iommu/iommufd/fault.c
> @@ -431,7 +431,7 @@ int iommufd_fault_iopf_handler(struct iopf_group *group)
>  	struct iommufd_hw_pagetable *hwpt;
>  	struct iommufd_fault *fault;
>  
> -	hwpt = group->attach_handle->domain->fault_data;
> +	hwpt = group->attach_handle->domain->iommufd_hwpt;
>  	fault = hwpt->fault;
>  
>  	mutex_lock(&fault->mutex);
> diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
> index ce03c3804651..f7c0d7b214b6 100644
> --- a/drivers/iommu/iommufd/hw_pagetable.c
> +++ b/drivers/iommu/iommufd/hw_pagetable.c
> @@ -402,10 +402,10 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
>  		}
>  		hwpt->fault = fault;
>  		hwpt->domain->iopf_handler = iommufd_fault_iopf_handler;
> -		hwpt->domain->fault_data = hwpt;
>  		refcount_inc(&fault->obj.users);
>  		iommufd_put_object(ucmd->ictx, &fault->obj);
>  	}
> +	hwpt->domain->iommufd_hwpt = hwpt;
don't we want to reset this somewhere on release path?

Eric
>  
>  	cmd->out_hwpt_id = hwpt->obj.id;
>  	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 06/13] iommufd: Make attach_handle generic
  2025-01-11  3:32 ` [PATCH RFCv2 06/13] iommufd: Make attach_handle generic Nicolin Chen
  2025-01-18  8:23   ` Yi Liu
@ 2025-01-29 13:14   ` Eric Auger
  2025-02-03 18:08     ` Nicolin Chen
  1 sibling, 1 reply; 64+ messages in thread
From: Eric Auger @ 2025-01-29 13:14 UTC (permalink / raw)
  To: Nicolin Chen, will, robin.murphy, jgg, kevin.tian, tglx, maz,
	alex.williamson
  Cc: joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

Hi,


On 1/11/25 4:32 AM, Nicolin Chen wrote:
> "attach_handle" was added exclusively for the iommufd_fault_iopf_handler()
> used by IOPF/PRI use cases, along with the "fault_data". Now, the iommufd
> version of sw_msi function will resue the attach_handle and fault_data for
reuse
> a non-fault case.
>
> Move the attach_handle part out of the fault.c file to make it generic for
> all cases. Simplify the remaining fault specific routine to attach/detach.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/iommufd/iommufd_private.h |  40 +-------
>  drivers/iommu/iommufd/device.c          | 105 +++++++++++++++++++++
>  drivers/iommu/iommufd/fault.c           | 120 +++---------------------
>  3 files changed, 122 insertions(+), 143 deletions(-)
>
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index b6d706cf2c66..063c0a42f54f 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -472,42 +472,12 @@ void iommufd_fault_destroy(struct iommufd_object *obj);
>  int iommufd_fault_iopf_handler(struct iopf_group *group);
>  
>  int iommufd_fault_domain_attach_dev(struct iommufd_hw_pagetable *hwpt,
> -				    struct iommufd_device *idev);
> +				    struct iommufd_device *idev,
> +				    bool enable_iopf);
>  void iommufd_fault_domain_detach_dev(struct iommufd_hw_pagetable *hwpt,
> -				     struct iommufd_device *idev);
> -int iommufd_fault_domain_replace_dev(struct iommufd_device *idev,
> -				     struct iommufd_hw_pagetable *hwpt,
> -				     struct iommufd_hw_pagetable *old);
> -
> -static inline int iommufd_hwpt_attach_device(struct iommufd_hw_pagetable *hwpt,
> -					     struct iommufd_device *idev)
> -{
> -	if (hwpt->fault)
> -		return iommufd_fault_domain_attach_dev(hwpt, idev);
> -
> -	return iommu_attach_group(hwpt->domain, idev->igroup->group);
> -}
> -
> -static inline void iommufd_hwpt_detach_device(struct iommufd_hw_pagetable *hwpt,
> -					      struct iommufd_device *idev)
> -{
> -	if (hwpt->fault) {
> -		iommufd_fault_domain_detach_dev(hwpt, idev);
> -		return;
> -	}
> -
> -	iommu_detach_group(hwpt->domain, idev->igroup->group);
> -}
> -
> -static inline int iommufd_hwpt_replace_device(struct iommufd_device *idev,
> -					      struct iommufd_hw_pagetable *hwpt,
> -					      struct iommufd_hw_pagetable *old)
> -{
> -	if (old->fault || hwpt->fault)
> -		return iommufd_fault_domain_replace_dev(idev, hwpt, old);
> -
> -	return iommu_group_replace_domain(idev->igroup->group, hwpt->domain);
> -}
> +				     struct iommufd_device *idev,
> +				     struct iommufd_attach_handle *handle,
> +				     bool disable_iopf);
>  
>  static inline struct iommufd_viommu *
>  iommufd_get_viommu(struct iommufd_ucmd *ucmd, u32 id)
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index dfd0898fb6c1..38b31b652147 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -352,6 +352,111 @@ iommufd_device_attach_reserved_iova(struct iommufd_device *idev,
>  	return 0;
>  }
>  
> +/* The device attach/detach/replace helpers for attach_handle */
> +
> +static int iommufd_hwpt_attach_device(struct iommufd_hw_pagetable *hwpt,
> +				      struct iommufd_device *idev)
> +{
> +	struct iommufd_attach_handle *handle;
> +	int rc;
> +
> +	if (hwpt->fault) {
> +		rc = iommufd_fault_domain_attach_dev(hwpt, idev, true);
why don't we simply call iommufd_fault_iopf_enable(idev)
also it looks there is a redundant check of hwpt_fault here and in

iommufd_fault_domain_attach_dev

Besides the addition of enable_iopf param is not documented anywhere

> +		if (rc)
> +			return rc;
> +	}
> +
> +	handle = kzalloc(sizeof(*handle), GFP_KERNEL);
> +	if (!handle) {
> +		rc = -ENOMEM;
> +		goto out_fault_detach;
> +	}
> +
> +	handle->idev = idev;
> +	rc = iommu_attach_group_handle(hwpt->domain, idev->igroup->group,
> +				       &handle->handle);
> +	if (rc)
> +		goto out_free_handle;
> +
> +	return 0;
> +
> +out_free_handle:
> +	kfree(handle);
> +	handle = NULL;
> +out_fault_detach:
> +	if (hwpt->fault)
> +		iommufd_fault_domain_detach_dev(hwpt, idev, handle, true);
> +	return rc;
> +}
> +
> +static struct iommufd_attach_handle *
> +iommufd_device_get_attach_handle(struct iommufd_device *idev)
> +{
> +	struct iommu_attach_handle *handle;
> +
> +	handle =
> +		iommu_attach_handle_get(idev->igroup->group, IOMMU_NO_PASID, 0);
> +	if (IS_ERR(handle))
> +		return NULL;
> +	return to_iommufd_handle(handle);
> +}
> +
> +static void iommufd_hwpt_detach_device(struct iommufd_hw_pagetable *hwpt,
> +				       struct iommufd_device *idev)
> +{
> +	struct iommufd_attach_handle *handle;
> +
> +	handle = iommufd_device_get_attach_handle(idev);
> +	iommu_detach_group_handle(hwpt->domain, idev->igroup->group);
> +	if (hwpt->fault)
> +		iommufd_fault_domain_detach_dev(hwpt, idev, handle, true);
same here, pretty difficult to understand what this

iommufd_fault_domain_detach_dev does
To me calling iommufd_auto_response_faults and iommufd_fault_iopf_disable would be more readable or rename iommufd_fault_domain_detach_dev().
Also compared to the original code, there is a new check on handle. Why is it necessary.

Globally I feel that patch pretty hard to read. Would be nice to split if possible to ease the review process.

Thanks

Eric

> +	kfree(handle);
> +}
> +
> +static int iommufd_hwpt_replace_device(struct iommufd_device *idev,
> +				       struct iommufd_hw_pagetable *hwpt,
> +				       struct iommufd_hw_pagetable *old)
> +{
> +	struct iommufd_attach_handle *old_handle =
> +		iommufd_device_get_attach_handle(idev);
> +	struct iommufd_attach_handle *handle;
> +	int rc;
> +
> +	if (hwpt->fault) {
> +		rc = iommufd_fault_domain_attach_dev(hwpt, idev, !old->fault);
> +		if (rc)
> +			return rc;
> +	}
> +
> +	handle = kzalloc(sizeof(*handle), GFP_KERNEL);
> +	if (!handle) {
> +		rc = -ENOMEM;
> +		goto out_fault_detach;
> +	}
> +
> +	handle->idev = idev;
> +	rc = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
> +					&handle->handle);
> +	if (rc)
> +		goto out_free_handle;
> +
> +	if (old->fault)
> +		iommufd_fault_domain_detach_dev(old, idev, old_handle,
> +						!hwpt->fault);
> +	kfree(old_handle);
> +
> +	return 0;
> +
> +out_free_handle:
> +	kfree(handle);
> +	handle = NULL;
> +out_fault_detach:
> +	if (hwpt->fault)
> +		iommufd_fault_domain_detach_dev(hwpt, idev, handle,
> +						!old->fault);
> +	return rc;
> +}
> +
>  int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
>  				struct iommufd_device *idev)
>  {
> diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
> index 06aa83a75e94..1d9bd3024b57 100644
> --- a/drivers/iommu/iommufd/fault.c
> +++ b/drivers/iommu/iommufd/fault.c
> @@ -60,42 +60,17 @@ static void iommufd_fault_iopf_disable(struct iommufd_device *idev)
>  	mutex_unlock(&idev->iopf_lock);
>  }
>  
> -static int __fault_domain_attach_dev(struct iommufd_hw_pagetable *hwpt,
> -				     struct iommufd_device *idev)
> -{
> -	struct iommufd_attach_handle *handle;
> -	int ret;
> -
> -	handle = kzalloc(sizeof(*handle), GFP_KERNEL);
> -	if (!handle)
> -		return -ENOMEM;
> -
> -	handle->idev = idev;
> -	ret = iommu_attach_group_handle(hwpt->domain, idev->igroup->group,
> -					&handle->handle);
> -	if (ret)
> -		kfree(handle);
> -
> -	return ret;
> -}
> -
>  int iommufd_fault_domain_attach_dev(struct iommufd_hw_pagetable *hwpt,
> -				    struct iommufd_device *idev)
> +				    struct iommufd_device *idev,
> +				    bool enable_iopf)
>  {
> -	int ret;
> +	int rc = 0;
>  
>  	if (!hwpt->fault)
>  		return -EINVAL;
> -
> -	ret = iommufd_fault_iopf_enable(idev);
> -	if (ret)
> -		return ret;
> -
> -	ret = __fault_domain_attach_dev(hwpt, idev);
> -	if (ret)
> -		iommufd_fault_iopf_disable(idev);
> -
> -	return ret;
> +	if (enable_iopf)
> +		rc = iommufd_fault_iopf_enable(idev);
> +	return rc;
>  }
>  
>  static void iommufd_auto_response_faults(struct iommufd_hw_pagetable *hwpt,
> @@ -127,86 +102,15 @@ static void iommufd_auto_response_faults(struct iommufd_hw_pagetable *hwpt,
>  	mutex_unlock(&fault->mutex);
>  }
>  
> -static struct iommufd_attach_handle *
> -iommufd_device_get_attach_handle(struct iommufd_device *idev)
> -{
> -	struct iommu_attach_handle *handle;
> -
> -	handle = iommu_attach_handle_get(idev->igroup->group, IOMMU_NO_PASID, 0);
> -	if (IS_ERR(handle))
> -		return NULL;
> -
> -	return to_iommufd_handle(handle);
> -}
> -
>  void iommufd_fault_domain_detach_dev(struct iommufd_hw_pagetable *hwpt,
> -				     struct iommufd_device *idev)
> +				     struct iommufd_device *idev,
> +				     struct iommufd_attach_handle *handle,
> +				     bool disable_iopf)
>  {
> -	struct iommufd_attach_handle *handle;
> -
> -	handle = iommufd_device_get_attach_handle(idev);
> -	iommu_detach_group_handle(hwpt->domain, idev->igroup->group);
> -	iommufd_auto_response_faults(hwpt, handle);
> -	iommufd_fault_iopf_disable(idev);
> -	kfree(handle);
> -}
> -
> -static int __fault_domain_replace_dev(struct iommufd_device *idev,
> -				      struct iommufd_hw_pagetable *hwpt,
> -				      struct iommufd_hw_pagetable *old)
> -{
> -	struct iommufd_attach_handle *handle, *curr = NULL;
> -	int ret;
> -
> -	if (old->fault)
> -		curr = iommufd_device_get_attach_handle(idev);
> -
> -	if (hwpt->fault) {
> -		handle = kzalloc(sizeof(*handle), GFP_KERNEL);
> -		if (!handle)
> -			return -ENOMEM;
> -
> -		handle->idev = idev;
> -		ret = iommu_replace_group_handle(idev->igroup->group,
> -						 hwpt->domain, &handle->handle);
> -	} else {
> -		ret = iommu_replace_group_handle(idev->igroup->group,
> -						 hwpt->domain, NULL);
> -	}
> -
> -	if (!ret && curr) {
> -		iommufd_auto_response_faults(old, curr);
> -		kfree(curr);
> -	}
> -
> -	return ret;
> -}
> -
> -int iommufd_fault_domain_replace_dev(struct iommufd_device *idev,
> -				     struct iommufd_hw_pagetable *hwpt,
> -				     struct iommufd_hw_pagetable *old)
> -{
> -	bool iopf_off = !hwpt->fault && old->fault;
> -	bool iopf_on = hwpt->fault && !old->fault;
> -	int ret;
> -
> -	if (iopf_on) {
> -		ret = iommufd_fault_iopf_enable(idev);
> -		if (ret)
> -			return ret;
> -	}
> -
> -	ret = __fault_domain_replace_dev(idev, hwpt, old);
> -	if (ret) {
> -		if (iopf_on)
> -			iommufd_fault_iopf_disable(idev);
> -		return ret;
> -	}
> -
> -	if (iopf_off)
> +	if (handle)
> +		iommufd_auto_response_faults(hwpt, handle);
> +	if (disable_iopf)
>  		iommufd_fault_iopf_disable(idev);
> -
> -	return 0;
>  }
>  
>  void iommufd_fault_destroy(struct iommufd_object *obj)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  2025-01-11  3:32 ` [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls Nicolin Chen
  2025-01-23 10:07   ` Tian, Kevin
@ 2025-01-29 13:44   ` Eric Auger
  2025-01-29 14:58     ` Jason Gunthorpe
  1 sibling, 1 reply; 64+ messages in thread
From: Eric Auger @ 2025-01-29 13:44 UTC (permalink / raw)
  To: Nicolin Chen, will, robin.murphy, jgg, kevin.tian, tglx, maz,
	alex.williamson
  Cc: joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

Hi,


On 1/11/25 4:32 AM, Nicolin Chen wrote:
> For systems that require MSI pages to be mapped into the IOMMU translation
> the IOMMU driver provides an IOMMU_RESV_SW_MSI range, which is the default
> recommended IOVA window to place these mappings. However, there is nothing
> special about this address. And to support the RMR trick in VMM for nested
well at least it shall not overlap VMM's RAM. So it was not random either.
> translation, the VMM needs to know what sw_msi window the kernel is using.
> As there is no particular reason to force VMM to adopt the kernel default,
> provide a simple IOMMU_OPTION_SW_MSI_START/SIZE ioctl that the VMM can use
> to directly specify the sw_msi window that it wants to use, which replaces
> and disables the default IOMMU_RESV_SW_MSI from the driver to avoid having
> to build an API to discover the default IOMMU_RESV_SW_MSI.
IIUC the MSI window will then be different when using legacy VFIO
assignment and iommufd backend.
MSI reserved regions are exposed in
/sys/kernel/iommu_groups/<n>/reserved_regions
0x0000000008000000 0x00000000080fffff msi

Is that configurability reflected accordingly?

How do you make sure it does not collide with other resv regions? I
don't see any check here.

>
> Since iommufd now has its own sw_msi function, this is easy to implement.
>
> To keep things simple, the parameters are global to the entire iommufd FD,
> and will directly replace the IOMMU_RESV_SW_MSI values. The VMM must set
> the values before creating any hwpt's to have any effect.
>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/iommufd/iommufd_private.h |  4 +++
>  include/uapi/linux/iommufd.h            | 18 ++++++++++++-
>  drivers/iommu/iommufd/device.c          |  4 +++
>  drivers/iommu/iommufd/io_pagetable.c    |  4 ++-
>  drivers/iommu/iommufd/ioas.c            | 34 +++++++++++++++++++++++++
>  drivers/iommu/iommufd/main.c            |  6 +++++
>  6 files changed, 68 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index 3e83bbb5912c..9f071609f00b 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -45,6 +45,9 @@ struct iommufd_ctx {
>  	struct mutex sw_msi_lock;
>  	struct list_head sw_msi_list;
>  	unsigned int sw_msi_id;
> +	/* User-programmed SW_MSI region, to override igroup->sw_msi_start */
> +	phys_addr_t sw_msi_start;
> +	size_t sw_msi_size;
>  
>  	u8 account_mode;
>  	/* Compatibility with VFIO no iommu */
> @@ -281,6 +284,7 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
>  int iommufd_ioas_option(struct iommufd_ucmd *ucmd);
>  int iommufd_option_rlimit_mode(struct iommu_option *cmd,
>  			       struct iommufd_ctx *ictx);
> +int iommufd_option_sw_msi(struct iommu_option *cmd, struct iommufd_ctx *ictx);
>  
>  int iommufd_vfio_ioas(struct iommufd_ucmd *ucmd);
>  int iommufd_check_iova_range(struct io_pagetable *iopt,
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 34810f6ae2b5..c864a201e502 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -294,7 +294,9 @@ struct iommu_ioas_unmap {
>  
>  /**
>   * enum iommufd_option - ioctl(IOMMU_OPTION_RLIMIT_MODE) and
> - *                       ioctl(IOMMU_OPTION_HUGE_PAGES)
> + *                       ioctl(IOMMU_OPTION_HUGE_PAGES) and
> + *                       ioctl(IOMMU_OPTION_SW_MSI_START) and
> + *                       ioctl(IOMMU_OPTION_SW_MSI_SIZE)
>   * @IOMMU_OPTION_RLIMIT_MODE:
>   *    Change how RLIMIT_MEMLOCK accounting works. The caller must have privilege
>   *    to invoke this. Value 0 (default) is user based accounting, 1 uses process
> @@ -304,10 +306,24 @@ struct iommu_ioas_unmap {
>   *    iommu mappings. Value 0 disables combining, everything is mapped to
>   *    PAGE_SIZE. This can be useful for benchmarking.  This is a per-IOAS
>   *    option, the object_id must be the IOAS ID.
> + * @IOMMU_OPTION_SW_MSI_START:
> + *    Change the base address of the IOMMU mapping region for MSI doorbell(s).
> + *    It must be set this before attaching a device to an IOAS/HWPT, otherwise
> + *    this option will be not effective on that IOAS/HWPT. User can choose to
> + *    let kernel pick a base address, by simply ignoring this option or setting
> + *    a value 0 to IOMMU_OPTION_SW_MSI_SIZE. Global option, object_id must be 0
I think we should document it cannot be put at a random place either.
> + * @IOMMU_OPTION_SW_MSI_SIZE:
> + *    Change the size of the IOMMU mapping region for MSI doorbell(s). It must
> + *    be set this before attaching a device to an IOAS/HWPT, otherwise it won't
> + *    be effective on that IOAS/HWPT. The value is in MB, and the minimum value
> + *    is 1 MB. A value 0 (default) will invalidate the MSI doorbell base address
> + *    value set to IOMMU_OPTION_SW_MSI_START. Global option, object_id must be 0
>   */
>  enum iommufd_option {
>  	IOMMU_OPTION_RLIMIT_MODE = 0,
>  	IOMMU_OPTION_HUGE_PAGES = 1,
> +	IOMMU_OPTION_SW_MSI_START = 2,
> +	IOMMU_OPTION_SW_MSI_SIZE = 3,
>  };
>  
>  /**
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index f75b3c23cd41..093a3bd798db 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -445,10 +445,14 @@ static int
>  iommufd_device_attach_reserved_iova(struct iommufd_device *idev,
>  				    struct iommufd_hwpt_paging *hwpt_paging)
>  {
> +	struct iommufd_ctx *ictx = idev->ictx;
>  	int rc;
>  
>  	lockdep_assert_held(&idev->igroup->lock);
>  
> +	/* Override it with a user-programmed SW_MSI region */
> +	if (ictx->sw_msi_size && ictx->sw_msi_start != PHYS_ADDR_MAX)
> +		idev->igroup->sw_msi_start = ictx->sw_msi_start;
>  	rc = iopt_table_enforce_dev_resv_regions(&hwpt_paging->ioas->iopt,
>  						 idev->dev,
>  						 &idev->igroup->sw_msi_start);
> diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
> index 8a790e597e12..5d7f5ca1eecf 100644
> --- a/drivers/iommu/iommufd/io_pagetable.c
> +++ b/drivers/iommu/iommufd/io_pagetable.c
> @@ -1446,7 +1446,9 @@ int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt,
>  		if (sw_msi_start && resv->type == IOMMU_RESV_MSI)
>  			num_hw_msi++;
>  		if (sw_msi_start && resv->type == IOMMU_RESV_SW_MSI) {
> -			*sw_msi_start = resv->start;
> +			/* Bypass the driver-defined SW_MSI region, if preset */
> +			if (*sw_msi_start == PHYS_ADDR_MAX)
> +				*sw_msi_start = resv->start;
>  			num_sw_msi++;
>  		}
>  
> diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
> index 1542c5fd10a8..3f4e25b660f9 100644
> --- a/drivers/iommu/iommufd/ioas.c
> +++ b/drivers/iommu/iommufd/ioas.c
> @@ -620,6 +620,40 @@ int iommufd_option_rlimit_mode(struct iommu_option *cmd,
>  	return -EOPNOTSUPP;
>  }
>  
> +int iommufd_option_sw_msi(struct iommu_option *cmd, struct iommufd_ctx *ictx)
> +{
> +	if (cmd->object_id)
> +		return -EOPNOTSUPP;
> +
> +	if (cmd->op == IOMMU_OPTION_OP_GET) {
> +		switch (cmd->option_id) {
> +		case IOMMU_OPTION_SW_MSI_START:
> +			cmd->val64 = (u64)ictx->sw_msi_start;
> +			break;
> +		case IOMMU_OPTION_SW_MSI_SIZE:
> +			cmd->val64 = (u64)ictx->sw_msi_size;
> +			break;
> +		default:
> +			return -EOPNOTSUPP;
> +		}
> +		return 0;
> +	}
> +	if (cmd->op == IOMMU_OPTION_OP_SET) {
> +		switch (cmd->option_id) {
> +		case IOMMU_OPTION_SW_MSI_START:
> +			ictx->sw_msi_start = (phys_addr_t)cmd->val64;
> +			break;
> +		case IOMMU_OPTION_SW_MSI_SIZE:
> +			ictx->sw_msi_size = (size_t)cmd->val64;
> +			break;
> +		default:
> +			return -EOPNOTSUPP;
> +		}
> +		return 0;
> +	}
> +	return -EOPNOTSUPP;
> +}
> +
>  static int iommufd_ioas_option_huge_pages(struct iommu_option *cmd,
>  					  struct iommufd_ioas *ioas)
>  {
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index 7cc9497b7193..026297265c71 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -229,6 +229,8 @@ static int iommufd_fops_open(struct inode *inode, struct file *filp)
>  	init_waitqueue_head(&ictx->destroy_wait);
>  	mutex_init(&ictx->sw_msi_lock);
>  	INIT_LIST_HEAD(&ictx->sw_msi_list);
> +	ictx->sw_msi_start = PHYS_ADDR_MAX;
> +	ictx->sw_msi_size = 0;
>  	filp->private_data = ictx;
>  	return 0;
>  }
> @@ -287,6 +289,10 @@ static int iommufd_option(struct iommufd_ucmd *ucmd)
>  	case IOMMU_OPTION_RLIMIT_MODE:
>  		rc = iommufd_option_rlimit_mode(cmd, ucmd->ictx);
>  		break;
> +	case IOMMU_OPTION_SW_MSI_START:
> +	case IOMMU_OPTION_SW_MSI_SIZE:
> +		rc = iommufd_option_sw_msi(cmd, ucmd->ictx);
> +		break;
>  	case IOMMU_OPTION_HUGE_PAGES:
>  		rc = iommufd_ioas_option(ucmd);
>  		break;
Eric


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU
  2025-01-23 13:24   ` Jason Gunthorpe
@ 2025-01-29 14:54     ` Eric Auger
  2025-01-29 15:04       ` Jason Gunthorpe
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Auger @ 2025-01-29 14:54 UTC (permalink / raw)
  To: Jason Gunthorpe, Shameerali Kolothum Thodi
  Cc: Nicolin Chen, will@kernel.org, robin.murphy@arm.com,
	kevin.tian@intel.com, tglx@linutronix.de, maz@kernel.org,
	alex.williamson@redhat.com, joro@8bytes.org, shuah@kernel.org,
	reinette.chatre@intel.com, yebin (H), apatel@ventanamicro.com,
	shivamurthy.shastri@linutronix.de, bhelgaas@google.com,
	anna-maria@linutronix.de, yury.norov@gmail.com,
	nipun.gupta@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	jean-philippe@linaro.org, mdf@kernel.org, mshavit@google.com,
	smostafa@google.com, ddutile@redhat.com


Hi Jason,

On 1/23/25 2:24 PM, Jason Gunthorpe wrote:
> On Thu, Jan 23, 2025 at 09:06:49AM +0000, Shameerali Kolothum Thodi wrote:
>
>> One confusion I have about the above text is, do we still plan to support the
>> approach -1( Using RMR in Qemu)
> Yes, it remains an option. The VMM would use the
> IOMMU_OPTION_SW_MSI_START/SIZE ioctls to tell the kernel where it
> wants to put the RMR region then it would send the RMR into the VM
> through ACPI.
>
> The kernel side promises that the RMR region will have a consistent
> (but unpredictable!) layout of ITS pages (however many are required)
> within that RMR space, regardless of what devices/domain are attached.
>
> I would like to start with patches up to #10 for this part as it
> solves two of the three problems here.
>
>> or you are just mentioning it here because
>> it is still possible to make use of that. I think from previous discussions the
>> argument was to adopt a more dedicated MSI pass-through model which I
>> think is  approach-2 here.  
> The basic flow of the pass through model is shown in the last two
> patches, it is not fully complete but is testable. It assumes a single
> ITS page. The VM would use IOMMU_OPTION_SW_MSI_START/SIZE to put the
> ITS page at the correct S2 location and then describe it in the ACPI
> as an ITS page not a RMR.
This is a nice to have feature but not mandated in the first place, is it?
>
> The VMM will capture the MSI writes and use
> VFIO_IRQ_SET_ACTION_PREPARE to convey the guests's S1 translation to
> the IRQ subsystem.
>
> This missing peice is cleaning up the ITS mapping to allow for
> multiple ITS pages. I've imagined that kvm would someone give iommufd
> a FD that holds the specific ITS pages instead of the
> IOMMU_OPTION_SW_MSI_START/SIZE flow.
That's what I don't get: at the moment you only pass the gIOVA. With
technique 2, how can you build the nested mapping, ie.

         S1           S2
gIOVA    ->    gDB    ->    hDB

without passing the full gIOVA/gDB S1 mapping to the host?

Eric


>
> Jason
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  2025-01-29 13:44   ` Eric Auger
@ 2025-01-29 14:58     ` Jason Gunthorpe
  2025-01-29 17:23       ` Eric Auger
  2025-02-07  4:26       ` Nicolin Chen
  0 siblings, 2 replies; 64+ messages in thread
From: Jason Gunthorpe @ 2025-01-29 14:58 UTC (permalink / raw)
  To: Eric Auger
  Cc: Nicolin Chen, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Wed, Jan 29, 2025 at 02:44:12PM +0100, Eric Auger wrote:
> Hi,
> 
> 
> On 1/11/25 4:32 AM, Nicolin Chen wrote:
> > For systems that require MSI pages to be mapped into the IOMMU translation
> > the IOMMU driver provides an IOMMU_RESV_SW_MSI range, which is the default
> > recommended IOVA window to place these mappings. However, there is nothing
> > special about this address. And to support the RMR trick in VMM for nested
> well at least it shall not overlap VMM's RAM. So it was not random either.
> > translation, the VMM needs to know what sw_msi window the kernel is using.
> > As there is no particular reason to force VMM to adopt the kernel default,
> > provide a simple IOMMU_OPTION_SW_MSI_START/SIZE ioctl that the VMM can use
> > to directly specify the sw_msi window that it wants to use, which replaces
> > and disables the default IOMMU_RESV_SW_MSI from the driver to avoid having
> > to build an API to discover the default IOMMU_RESV_SW_MSI.
> IIUC the MSI window will then be different when using legacy VFIO
> assignment and iommufd backend.

? They use the same, iommufd can have userspace override it. Then it
will ignore the reserved region.

> MSI reserved regions are exposed in
> /sys/kernel/iommu_groups/<n>/reserved_regions
> 0x0000000008000000 0x00000000080fffff msi
 
> Is that configurability reflected accordingly?

?

Nothing using iommufd should parse that sysfs file.
 
> How do you make sure it does not collide with other resv regions? I
> don't see any check here.

Yes this does need to be checked, it does look missing. It still needs
to create a reserved region in the ioas when attaching to keep the
areas safe and it has to intersect with the incoming reserved
regions from the driver.

> > + * @IOMMU_OPTION_SW_MSI_START:
> > + *    Change the base address of the IOMMU mapping region for MSI doorbell(s).
> > + *    It must be set this before attaching a device to an IOAS/HWPT, otherwise
> > + *    this option will be not effective on that IOAS/HWPT. User can choose to
> > + *    let kernel pick a base address, by simply ignoring this option or setting
> > + *    a value 0 to IOMMU_OPTION_SW_MSI_SIZE. Global option, object_id must be 0

> I think we should document it cannot be put at a random place either.

It can be put at any place a map can be placed.

That also needs to be checked when creating a domain, it can't be
outside the geometry.

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU
  2025-01-29 14:54     ` Eric Auger
@ 2025-01-29 15:04       ` Jason Gunthorpe
  2025-01-29 17:46         ` Eric Auger
  0 siblings, 1 reply; 64+ messages in thread
From: Jason Gunthorpe @ 2025-01-29 15:04 UTC (permalink / raw)
  To: Eric Auger
  Cc: Shameerali Kolothum Thodi, Nicolin Chen, will@kernel.org,
	robin.murphy@arm.com, kevin.tian@intel.com, tglx@linutronix.de,
	maz@kernel.org, alex.williamson@redhat.com, joro@8bytes.org,
	shuah@kernel.org, reinette.chatre@intel.com, yebin (H),
	apatel@ventanamicro.com, shivamurthy.shastri@linutronix.de,
	bhelgaas@google.com, anna-maria@linutronix.de,
	yury.norov@gmail.com, nipun.gupta@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	jean-philippe@linaro.org, mdf@kernel.org, mshavit@google.com,
	smostafa@google.com, ddutile@redhat.com

On Wed, Jan 29, 2025 at 03:54:48PM +0100, Eric Auger wrote:
> >> or you are just mentioning it here because
> >> it is still possible to make use of that. I think from previous discussions the
> >> argument was to adopt a more dedicated MSI pass-through model which I
> >> think is  approach-2 here.  
> > The basic flow of the pass through model is shown in the last two
> > patches, it is not fully complete but is testable. It assumes a single
> > ITS page. The VM would use IOMMU_OPTION_SW_MSI_START/SIZE to put the
> > ITS page at the correct S2 location and then describe it in the ACPI
> > as an ITS page not a RMR.

> This is a nice to have feature but not mandated in the first place,
> is it?

Not mandated. It just sort of happens because of the design. IMHO
nothing should use it because there is no way for userspace to
discover how many ITS pages there may be.

> > This missing peice is cleaning up the ITS mapping to allow for
> > multiple ITS pages. I've imagined that kvm would someone give iommufd
> > a FD that holds the specific ITS pages instead of the
> > IOMMU_OPTION_SW_MSI_START/SIZE flow.

> That's what I don't get: at the moment you only pass the gIOVA. With
> technique 2, how can you build the nested mapping, ie.
> 
>          S1           S2
> gIOVA    ->    gDB    ->    hDB
> 
> without passing the full gIOVA/gDB S1 mapping to the host?

The nested S2 mapping is already setup before the VM boots:

 - The VMM puts the ITS page (hDB) into the S2 at a fixed address (gDB)
 - The ACPI tells the VM that the GIC has an ITS page at the S2's
   address (hDB)
 - The VM sets up its S1 with a gIOVA that points to the S2's ITS 
   page (gDB). The S2 already has gDB -> hDB.
 - The VMM traps the gIOVA write to the MSI-X table. Both the S1 and
   S2 are populated at this moment.

If you have multiple ITS pages then the ACPI has to tell the guest GIC
about them, what their gDB address is, and what devices use which ITS.

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  2025-01-29 14:58     ` Jason Gunthorpe
@ 2025-01-29 17:23       ` Eric Auger
  2025-01-29 17:39         ` Jason Gunthorpe
  2025-02-07  4:26       ` Nicolin Chen
  1 sibling, 1 reply; 64+ messages in thread
From: Eric Auger @ 2025-01-29 17:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile




On 1/29/25 3:58 PM, Jason Gunthorpe wrote:
> On Wed, Jan 29, 2025 at 02:44:12PM +0100, Eric Auger wrote:
>> Hi,
>>
>>
>> On 1/11/25 4:32 AM, Nicolin Chen wrote:
>>> For systems that require MSI pages to be mapped into the IOMMU translation
>>> the IOMMU driver provides an IOMMU_RESV_SW_MSI range, which is the default
>>> recommended IOVA window to place these mappings. However, there is nothing
>>> special about this address. And to support the RMR trick in VMM for nested
>> well at least it shall not overlap VMM's RAM. So it was not random either.
>>> translation, the VMM needs to know what sw_msi window the kernel is using.
>>> As there is no particular reason to force VMM to adopt the kernel default,
>>> provide a simple IOMMU_OPTION_SW_MSI_START/SIZE ioctl that the VMM can use
>>> to directly specify the sw_msi window that it wants to use, which replaces
>>> and disables the default IOMMU_RESV_SW_MSI from the driver to avoid having
>>> to build an API to discover the default IOMMU_RESV_SW_MSI.
>> IIUC the MSI window will then be different when using legacy VFIO
>> assignment and iommufd backend.
> ? They use the same, iommufd can have userspace override it. Then it
> will ignore the reserved region.
In current arm-smmu-v3.c you have
        region = iommu_alloc_resv_region(MSI_IOVA_BASE, MSI_IOVA_LENGTH,
                                         prot, IOMMU_RESV_SW_MSI,
GFP_KERNEL);

in arm_smmu_get_resv_regions()
If you overwrite the default region, don't you need to expose the user
defined resv region?

>
>> MSI reserved regions are exposed in
>> /sys/kernel/iommu_groups/<n>/reserved_regions
>> 0x0000000008000000 0x00000000080fffff msi
>  
>> Is that configurability reflected accordingly?
> ?
>
> Nothing using iommufd should parse that sysfs file.
Right but aren't you still supposed to populate the sysfs files
properly. This region must be carved out from the IOVA space, right?
>  
>> How do you make sure it does not collide with other resv regions? I
>> don't see any check here.
> Yes this does need to be checked, it does look missing. It still needs
> to create a reserved region in the ioas when attaching to keep the
> areas safe and it has to intersect with the incoming reserved
> regions from the driver.
>
>>> + * @IOMMU_OPTION_SW_MSI_START:
>>> + *    Change the base address of the IOMMU mapping region for MSI doorbell(s).
>>> + *    It must be set this before attaching a device to an IOAS/HWPT, otherwise
>>> + *    this option will be not effective on that IOAS/HWPT. User can choose to
>>> + *    let kernel pick a base address, by simply ignoring this option or setting
>>> + *    a value 0 to IOMMU_OPTION_SW_MSI_SIZE. Global option, object_id must be 0
>> I think we should document it cannot be put at a random place either.
> It can be put at any place a map can be placed.
to me It cannot overlap with guest RAM IPA so userspace needs to be
cautious about that

Eric
>
> That also needs to be checked when creating a domain, it can't be
> outside the geometry.
>
> Jason
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  2025-01-29 17:23       ` Eric Auger
@ 2025-01-29 17:39         ` Jason Gunthorpe
  2025-01-29 17:49           ` Eric Auger
  0 siblings, 1 reply; 64+ messages in thread
From: Jason Gunthorpe @ 2025-01-29 17:39 UTC (permalink / raw)
  To: Eric Auger
  Cc: Nicolin Chen, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Wed, Jan 29, 2025 at 06:23:33PM +0100, Eric Auger wrote:
> >> IIUC the MSI window will then be different when using legacy VFIO
> >> assignment and iommufd backend.
> > ? They use the same, iommufd can have userspace override it. Then it
> > will ignore the reserved region.
> In current arm-smmu-v3.c you have
>         region = iommu_alloc_resv_region(MSI_IOVA_BASE, MSI_IOVA_LENGTH,
>                                          prot, IOMMU_RESV_SW_MSI,
> GFP_KERNEL);
> 
> in arm_smmu_get_resv_regions()
> If you overwrite the default region, don't you need to expose the user
> defined resv region?

If it was overriden inside iommufd then the user told the kernel what
range to use to override it. I don't need to go back and report back
to userspace information that it already gave to the kernel..

> > Nothing using iommufd should parse that sysfs file.
> Right but aren't you still supposed to populate the sysfs files
> properly. This region must be carved out from the IOVA space, right?

The sysfs shouldn't be changed here based on how iommufd decides to
use the iova space. The sysfs reflects the information reported from
the driver and sw_msi should be understood as the driver's
recommendation when you view it from sysfs.

The actual reserved regions in effect for an iommufd object are
queried directly in iommufd and do not have a sysfs representation.

> >>> + * @IOMMU_OPTION_SW_MSI_START:
> >>> + *    Change the base address of the IOMMU mapping region for MSI doorbell(s).
> >>> + *    It must be set this before attaching a device to an IOAS/HWPT, otherwise
> >>> + *    this option will be not effective on that IOAS/HWPT. User can choose to
> >>> + *    let kernel pick a base address, by simply ignoring this option or setting
> >>> + *    a value 0 to IOMMU_OPTION_SW_MSI_SIZE. Global option, object_id must be 0
> >> I think we should document it cannot be put at a random place either.
> > It can be put at any place a map can be placed.
> to me It cannot overlap with guest RAM IPA so userspace needs to be
> cautious about that

Yes, userspace needs to manage its own VM memory map to avoid
overlaps, but from an API perspective it can be placed anywhere that a
map can be placed.

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU
  2025-01-29 15:04       ` Jason Gunthorpe
@ 2025-01-29 17:46         ` Eric Auger
  2025-01-29 20:13           ` Jason Gunthorpe
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Auger @ 2025-01-29 17:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, Nicolin Chen, will@kernel.org,
	robin.murphy@arm.com, kevin.tian@intel.com, tglx@linutronix.de,
	maz@kernel.org, alex.williamson@redhat.com, joro@8bytes.org,
	shuah@kernel.org, reinette.chatre@intel.com, yebin (H),
	apatel@ventanamicro.com, shivamurthy.shastri@linutronix.de,
	bhelgaas@google.com, anna-maria@linutronix.de,
	yury.norov@gmail.com, nipun.gupta@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	jean-philippe@linaro.org, mdf@kernel.org, mshavit@google.com,
	smostafa@google.com, ddutile@redhat.com




On 1/29/25 4:04 PM, Jason Gunthorpe wrote:
> On Wed, Jan 29, 2025 at 03:54:48PM +0100, Eric Auger wrote:
>>>> or you are just mentioning it here because
>>>> it is still possible to make use of that. I think from previous discussions the
>>>> argument was to adopt a more dedicated MSI pass-through model which I
>>>> think is  approach-2 here.  
>>> The basic flow of the pass through model is shown in the last two
>>> patches, it is not fully complete but is testable. It assumes a single
>>> ITS page. The VM would use IOMMU_OPTION_SW_MSI_START/SIZE to put the
>>> ITS page at the correct S2 location and then describe it in the ACPI
>>> as an ITS page not a RMR.
>> This is a nice to have feature but not mandated in the first place,
>> is it?
> Not mandated. It just sort of happens because of the design. IMHO
> nothing should use it because there is no way for userspace to
> discover how many ITS pages there may be.
>
>>> This missing peice is cleaning up the ITS mapping to allow for
>>> multiple ITS pages. I've imagined that kvm would someone give iommufd
>>> a FD that holds the specific ITS pages instead of the
>>> IOMMU_OPTION_SW_MSI_START/SIZE flow.
>> That's what I don't get: at the moment you only pass the gIOVA. With
>> technique 2, how can you build the nested mapping, ie.
>>
>>          S1           S2
>> gIOVA    ->    gDB    ->    hDB
>>
>> without passing the full gIOVA/gDB S1 mapping to the host?
> The nested S2 mapping is already setup before the VM boots:
>
>  - The VMM puts the ITS page (hDB) into the S2 at a fixed address (gDB)
Ah OK. Your gDB has nothing to do with the actual S1 guest gDB, right?
It is computed in iommufd_sw_msi_get_map() from the sw_msi_start pool.
Is that correct? In
https://lore.kernel.org/all/20210411111228.14386-9-eric.auger@redhat.com/
I was passing both the gIOVA and the "true" gDB Eric
>  - The ACPI tells the VM that the GIC has an ITS page at the S2's
>    address (hDB)
>  - The VM sets up its S1 with a gIOVA that points to the S2's ITS 
>    page (gDB). The S2 already has gDB -> hDB.
>  - The VMM traps the gIOVA write to the MSI-X table. Both the S1 and
>    S2 are populated at this moment.
>
> If you have multiple ITS pages then the ACPI has to tell the guest GIC
> about them, what their gDB address is, and what devices use which ITS.
>
> Jason
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  2025-01-29 17:39         ` Jason Gunthorpe
@ 2025-01-29 17:49           ` Eric Auger
  2025-01-29 20:15             ` Jason Gunthorpe
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Auger @ 2025-01-29 17:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile




On 1/29/25 6:39 PM, Jason Gunthorpe wrote:
> On Wed, Jan 29, 2025 at 06:23:33PM +0100, Eric Auger wrote:
>>>> IIUC the MSI window will then be different when using legacy VFIO
>>>> assignment and iommufd backend.
>>> ? They use the same, iommufd can have userspace override it. Then it
>>> will ignore the reserved region.
>> In current arm-smmu-v3.c you have
>>         region = iommu_alloc_resv_region(MSI_IOVA_BASE, MSI_IOVA_LENGTH,
>>                                          prot, IOMMU_RESV_SW_MSI,
>> GFP_KERNEL);
>>
>> in arm_smmu_get_resv_regions()
>> If you overwrite the default region, don't you need to expose the user
>> defined resv region?
> If it was overriden inside iommufd then the user told the kernel what
> range to use to override it. I don't need to go back and report back
> to userspace information that it already gave to the kernel..

Looks strange to me because info exposed in sysfs is wrong then. What if
someone else relies on this info, either at kernel level through the
get_resv_regions callback or from user space.
>
>>> Nothing using iommufd should parse that sysfs file.
>> Right but aren't you still supposed to populate the sysfs files
>> properly. This region must be carved out from the IOVA space, right?
> The sysfs shouldn't be changed here based on how iommufd decides to
> use the iova space. The sysfs reflects the information reported from
> the driver and sw_msi should be understood as the driver's
> recommendation when you view it from sysfs.
>
> The actual reserved regions in effect for an iommufd object are
> queried directly in iommufd and do not have a sysfs representation.
>
>>>>> + * @IOMMU_OPTION_SW_MSI_START:
>>>>> + *    Change the base address of the IOMMU mapping region for MSI doorbell(s).
>>>>> + *    It must be set this before attaching a device to an IOAS/HWPT, otherwise
>>>>> + *    this option will be not effective on that IOAS/HWPT. User can choose to
>>>>> + *    let kernel pick a base address, by simply ignoring this option or setting
>>>>> + *    a value 0 to IOMMU_OPTION_SW_MSI_SIZE. Global option, object_id must be 0
>>>> I think we should document it cannot be put at a random place either.
>>> It can be put at any place a map can be placed.
>> to me It cannot overlap with guest RAM IPA so userspace needs to be
>> cautious about that
> Yes, userspace needs to manage its own VM memory map to avoid
> overlaps, but from an API perspective it can be placed anywhere that a
> map can be placed.
OK

Eric
>
> Jason
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU
  2025-01-29 17:46         ` Eric Auger
@ 2025-01-29 20:13           ` Jason Gunthorpe
  2025-02-04 12:55             ` Eric Auger
  0 siblings, 1 reply; 64+ messages in thread
From: Jason Gunthorpe @ 2025-01-29 20:13 UTC (permalink / raw)
  To: Eric Auger
  Cc: Shameerali Kolothum Thodi, Nicolin Chen, will@kernel.org,
	robin.murphy@arm.com, kevin.tian@intel.com, tglx@linutronix.de,
	maz@kernel.org, alex.williamson@redhat.com, joro@8bytes.org,
	shuah@kernel.org, reinette.chatre@intel.com, yebin (H),
	apatel@ventanamicro.com, shivamurthy.shastri@linutronix.de,
	bhelgaas@google.com, anna-maria@linutronix.de,
	yury.norov@gmail.com, nipun.gupta@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	jean-philippe@linaro.org, mdf@kernel.org, mshavit@google.com,
	smostafa@google.com, ddutile@redhat.com

On Wed, Jan 29, 2025 at 06:46:20PM +0100, Eric Auger wrote:
> >>> This missing peice is cleaning up the ITS mapping to allow for
> >>> multiple ITS pages. I've imagined that kvm would someone give iommufd
> >>> a FD that holds the specific ITS pages instead of the
> >>> IOMMU_OPTION_SW_MSI_START/SIZE flow.
> >> That's what I don't get: at the moment you only pass the gIOVA. With
> >> technique 2, how can you build the nested mapping, ie.
> >>
> >>          S1           S2
> >> gIOVA    ->    gDB    ->    hDB
> >>
> >> without passing the full gIOVA/gDB S1 mapping to the host?
> > The nested S2 mapping is already setup before the VM boots:
> >
> >  - The VMM puts the ITS page (hDB) into the S2 at a fixed address (gDB)
> Ah OK. Your gDB has nothing to do with the actual S1 guest gDB,
> right?

I'm not totally sure what you mean by gDB? The above diagram suggests
it is the ITS page address in the S2? Ie the guest physical address of
the ITS.

Within the VM, when it goes to call iommu_dma_prepare_msi(), it will
provide the gDB adress as the phys_addr_t msi_addr.

This happens because the GIC driver will have been informed of the ITS
page at the gDB address, and it will use
iommu_dma_prepare_msi(). Exactly the same as bare metal.

> It is computed in iommufd_sw_msi_get_map() from the sw_msi_start pool.
> Is that correct?

Yes, for a single ITS page it will reliably be put at sw_msi_start.
Since the VMM can provide sw_msi_start through the OPTION, the VMM can
place the ITS page where it wants and then program the ACPI to tell
the VM to call iommu_dma_prepare_msi(). (don't use this flow, it
doesn't work for multi ITS, for testing only)

> https://lore.kernel.org/all/20210411111228.14386-9-eric.auger@redhat.com/
> I was passing both the gIOVA and the "true" gDB Eric

If I understand this right, it still had the hypervisor dynamically
setting up the S2, here it is pre-set and static?

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  2025-01-29 17:49           ` Eric Auger
@ 2025-01-29 20:15             ` Jason Gunthorpe
  0 siblings, 0 replies; 64+ messages in thread
From: Jason Gunthorpe @ 2025-01-29 20:15 UTC (permalink / raw)
  To: Eric Auger
  Cc: Nicolin Chen, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Wed, Jan 29, 2025 at 06:49:22PM +0100, Eric Auger wrote:
> > If it was overriden inside iommufd then the user told the kernel what
> > range to use to override it. I don't need to go back and report back
> > to userspace information that it already gave to the kernel..
> 
> Looks strange to me because info exposed in sysfs is wrong then. What if
> someone else relies on this info, either at kernel level through the
> get_resv_regions callback or from user space.

Nothing else should call get_resv_regions() because VFIO is bound to
the device and iommufd owns the domain. We expect some exclusivity
here :)

sysfs for sw_msi should be understood as reporting the driver
recommendation, not anything to do with the current MSI operation of
the device.

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 05/13] iommu: Turn fault_data to iommufd private pointer
  2025-01-29 12:40   ` Eric Auger
@ 2025-02-03 17:48     ` Nicolin Chen
  0 siblings, 0 replies; 64+ messages in thread
From: Nicolin Chen @ 2025-02-03 17:48 UTC (permalink / raw)
  To: Eric Auger
  Cc: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson,
	joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

Hi Eric,

On Wed, Jan 29, 2025 at 01:40:54PM +0100, Eric Auger wrote:
> On 1/11/25 4:32 AM, Nicolin Chen wrote:
> > A "fault_data" was added exclusively for the iommufd_fault_iopf_handler()
> > used by IOPF/PRI use cases, along with the attach_handle. Now, the iommufd
> > version of sw_msi function will resue the attach_handle and fault_data for
> reuse

Ack.

> > diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
> > index ce03c3804651..f7c0d7b214b6 100644
> > --- a/drivers/iommu/iommufd/hw_pagetable.c
> > +++ b/drivers/iommu/iommufd/hw_pagetable.c
> > @@ -402,10 +402,10 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
> >  		}
> >  		hwpt->fault = fault;
> >  		hwpt->domain->iopf_handler = iommufd_fault_iopf_handler;
> > -		hwpt->domain->fault_data = hwpt;
> >  		refcount_inc(&fault->obj.users);
> >  		iommufd_put_object(ucmd->ictx, &fault->obj);
> >  	}
> > +	hwpt->domain->iommufd_hwpt = hwpt;

> don't we want to reset this somewhere on release path?

We do iommu_domain_free() entirely on HWPT's release path.

This basically sets the domain's "owner data" as Jason remarked:
https://lore.kernel.org/linux-iommu/20250113164037.GO5556@nvidia.com/

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 06/13] iommufd: Make attach_handle generic
  2025-01-29 13:14   ` Eric Auger
@ 2025-02-03 18:08     ` Nicolin Chen
  0 siblings, 0 replies; 64+ messages in thread
From: Nicolin Chen @ 2025-02-03 18:08 UTC (permalink / raw)
  To: Eric Auger
  Cc: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson,
	joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Wed, Jan 29, 2025 at 02:14:47PM +0100, Eric Auger wrote:
> > @@ -352,6 +352,111 @@ iommufd_device_attach_reserved_iova(struct iommufd_device *idev,
> >  	return 0;
> >  }
> >  
> > +/* The device attach/detach/replace helpers for attach_handle */
> > +
> > +static int iommufd_hwpt_attach_device(struct iommufd_hw_pagetable *hwpt,
> > +				      struct iommufd_device *idev)
> > +{
> > +	struct iommufd_attach_handle *handle;
> > +	int rc;
> > +
> > +	if (hwpt->fault) {
> > +		rc = iommufd_fault_domain_attach_dev(hwpt, idev, true);
> why don't we simply call iommufd_fault_iopf_enable(idev)
> also it looks there is a redundant check of hwpt_fault here and in
> 
> iommufd_fault_domain_attach_dev
> 
> Besides the addition of enable_iopf param is not documented anywhere

OK. I will try unwrapping that.

> > +static void iommufd_hwpt_detach_device(struct iommufd_hw_pagetable *hwpt,
> > +				       struct iommufd_device *idev)
> > +{
> > +	struct iommufd_attach_handle *handle;
> > +
> > +	handle = iommufd_device_get_attach_handle(idev);
> > +	iommu_detach_group_handle(hwpt->domain, idev->igroup->group);
> > +	if (hwpt->fault)
> > +		iommufd_fault_domain_detach_dev(hwpt, idev, handle, true);
> same here, pretty difficult to understand what this
> 
> iommufd_fault_domain_detach_dev does
> To me calling iommufd_auto_response_faults and iommufd_fault_iopf_disable would be more readable or rename iommufd_fault_domain_detach_dev().
> Also compared to the original code,

This is basically a cleanup call for the fault specific items as
the patch's commit message describes. And you read it correct..

I will see what I can do with the naming.

> there is a new check on handle. Why is it necessary.

It was to avoid the error path that has a handle=NULL entering the
auto response function. We can change that a bit to drop the check
to make it slightly clearer, though it would waste some extra CPU
cycles on scanning the two fault lists against an empty handle.

> Globally I feel that patch pretty hard to read. Would be nice to split if possible to ease the review process.

This patch is needed by both this series and Yi's PASID series too,
so I was planning to send it individually. I will see what I can do
to make it easy to read.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  2025-01-23 10:07   ` Tian, Kevin
@ 2025-02-03 18:36     ` Nicolin Chen
  0 siblings, 0 replies; 64+ messages in thread
From: Nicolin Chen @ 2025-02-03 18:36 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: will@kernel.org, robin.murphy@arm.com, jgg@nvidia.com,
	tglx@linutronix.de, maz@kernel.org, alex.williamson@redhat.com,
	joro@8bytes.org, shuah@kernel.org, Chatre, Reinette,
	eric.auger@redhat.com, yebin10@huawei.com,
	apatel@ventanamicro.com, shivamurthy.shastri@linutronix.de,
	bhelgaas@google.com, anna-maria@linutronix.de,
	yury.norov@gmail.com, nipun.gupta@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	jean-philippe@linaro.org, mdf@kernel.org, mshavit@google.com,
	shameerali.kolothum.thodi@huawei.com, smostafa@google.com,
	ddutile@redhat.com

On Thu, Jan 23, 2025 at 10:07:13AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Saturday, January 11, 2025 11:32 AM
> > 
> > @@ -294,7 +294,9 @@ struct iommu_ioas_unmap {
> > 
> >  /**
> >   * enum iommufd_option - ioctl(IOMMU_OPTION_RLIMIT_MODE) and
> > - *                       ioctl(IOMMU_OPTION_HUGE_PAGES)
> > + *                       ioctl(IOMMU_OPTION_HUGE_PAGES) and
> > + *                       ioctl(IOMMU_OPTION_SW_MSI_START) and
> > + *                       ioctl(IOMMU_OPTION_SW_MSI_SIZE)
> >   * @IOMMU_OPTION_RLIMIT_MODE:
> >   *    Change how RLIMIT_MEMLOCK accounting works. The caller must have
> > privilege
> >   *    to invoke this. Value 0 (default) is user based accounting, 1 uses process
> > @@ -304,10 +306,24 @@ struct iommu_ioas_unmap {
> >   *    iommu mappings. Value 0 disables combining, everything is mapped to
> >   *    PAGE_SIZE. This can be useful for benchmarking.  This is a per-IOAS
> >   *    option, the object_id must be the IOAS ID.
> > + * @IOMMU_OPTION_SW_MSI_START:
> > + *    Change the base address of the IOMMU mapping region for MSI
> > doorbell(s).
> > + *    It must be set this before attaching a device to an IOAS/HWPT,
> 
> remove 'this'

Ack.

> > otherwise
> > + *    this option will be not effective on that IOAS/HWPT. User can 
> 
> Do we want to explicitly check this instead of leaving it no effect
> silently?

So, the idea here is:
If this option is unset, use the default SW_MSI from the driver
If this option is set, use it over the default SW_MSI from the driver

That's what the following statement "User can choose to let.." means.

> > choose to
> > + *    let kernel pick a base address, by simply ignoring this option or setting
> > + *    a value 0 to IOMMU_OPTION_SW_MSI_SIZE. Global option, object_id
> > must be 0
> > + * @IOMMU_OPTION_SW_MSI_SIZE:
> > + *    Change the size of the IOMMU mapping region for MSI doorbell(s). It
> > must
> > + *    be set this before attaching a device to an IOAS/HWPT, otherwise it
> > won't
> > + *    be effective on that IOAS/HWPT. The value is in MB, and the minimum
> > value
> > + *    is 1 MB. A value 0 (default) will invalidate the MSI doorbell base address
> > + *    value set to IOMMU_OPTION_SW_MSI_START. Global option, object_id
> > must be 0
> 
> hmm there is no check on the minimal value and enable the effect
> of value 0 in this patch.

Well, it's somewhat enforced by __aligned_u64 since it can't be any
value between 0 (disable) and 1 (minimal)?

And the override code checks "ctx->sw_msi_size".

> >  iommufd_device_attach_reserved_iova(struct iommufd_device *idev,
> >  				    struct iommufd_hwpt_paging
> > *hwpt_paging)
> >  {
> > +	struct iommufd_ctx *ictx = idev->ictx;
> >  	int rc;
> > 
> >  	lockdep_assert_held(&idev->igroup->lock);
> > 
> > +	/* Override it with a user-programmed SW_MSI region */
> > +	if (ictx->sw_msi_size && ictx->sw_msi_start != PHYS_ADDR_MAX)
> > +		idev->igroup->sw_msi_start = ictx->sw_msi_start;
> >  	rc = iopt_table_enforce_dev_resv_regions(&hwpt_paging->ioas->iopt,
> >  						 idev->dev,
> >  						 &idev->igroup-
> > >sw_msi_start);
> 
> what about moving above additions into 
> iopt_table_enforce_dev_resv_regions() which is all about finding
> a sw_msi address and can check the user setting internally?

We could. Probably would be cleaner by doing that in one place.

> > diff --git a/drivers/iommu/iommufd/io_pagetable.c
> > b/drivers/iommu/iommufd/io_pagetable.c
> > index 8a790e597e12..5d7f5ca1eecf 100644
> > --- a/drivers/iommu/iommufd/io_pagetable.c
> > +++ b/drivers/iommu/iommufd/io_pagetable.c
> > @@ -1446,7 +1446,9 @@ int iopt_table_enforce_dev_resv_regions(struct
> > io_pagetable *iopt,
> >  		if (sw_msi_start && resv->type == IOMMU_RESV_MSI)
> >  			num_hw_msi++;
> >  		if (sw_msi_start && resv->type == IOMMU_RESV_SW_MSI) {
> > -			*sw_msi_start = resv->start;
> > +			/* Bypass the driver-defined SW_MSI region, if preset
> > */
> > +			if (*sw_msi_start == PHYS_ADDR_MAX)
> > +				*sw_msi_start = resv->start;
> 
> the code is not about bypass. Instead it's to use the driver-defined
> region if user doesn't set it.

Ack:
			/* If being unset, Use the default IOMMU_RESV_SW_MSI */

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU
  2025-01-29 20:13           ` Jason Gunthorpe
@ 2025-02-04 12:55             ` Eric Auger
  2025-02-04 13:02               ` Jason Gunthorpe
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Auger @ 2025-02-04 12:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, Nicolin Chen, will@kernel.org,
	robin.murphy@arm.com, kevin.tian@intel.com, tglx@linutronix.de,
	maz@kernel.org, alex.williamson@redhat.com, joro@8bytes.org,
	shuah@kernel.org, reinette.chatre@intel.com, yebin (H),
	apatel@ventanamicro.com, shivamurthy.shastri@linutronix.de,
	bhelgaas@google.com, anna-maria@linutronix.de,
	yury.norov@gmail.com, nipun.gupta@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	jean-philippe@linaro.org, mdf@kernel.org, mshavit@google.com,
	smostafa@google.com, ddutile@redhat.com

Hi Jason,


On 1/29/25 9:13 PM, Jason Gunthorpe wrote:
> On Wed, Jan 29, 2025 at 06:46:20PM +0100, Eric Auger wrote:
>>>>> This missing peice is cleaning up the ITS mapping to allow for
>>>>> multiple ITS pages. I've imagined that kvm would someone give iommufd
>>>>> a FD that holds the specific ITS pages instead of the
>>>>> IOMMU_OPTION_SW_MSI_START/SIZE flow.
>>>> That's what I don't get: at the moment you only pass the gIOVA. With
>>>> technique 2, how can you build the nested mapping, ie.
>>>>
>>>>          S1           S2
>>>> gIOVA    ->    gDB    ->    hDB
>>>>
>>>> without passing the full gIOVA/gDB S1 mapping to the host?
>>> The nested S2 mapping is already setup before the VM boots:
>>>
>>>  - The VMM puts the ITS page (hDB) into the S2 at a fixed address (gDB)
>> Ah OK. Your gDB has nothing to do with the actual S1 guest gDB,
>> right?
> I'm not totally sure what you mean by gDB? The above diagram suggests
> it is the ITS page address in the S2? Ie the guest physical address of
> the ITS.
Yes this is what I meant, ie. the guest ITS doorbell GPA
>
> Within the VM, when it goes to call iommu_dma_prepare_msi(), it will
> provide the gDB adress as the phys_addr_t msi_addr.
>
> This happens because the GIC driver will have been informed of the ITS
> page at the gDB address, and it will use
> iommu_dma_prepare_msi(). Exactly the same as bare metal.

understood this is the standard MSI binding scheme.
>
>> It is computed in iommufd_sw_msi_get_map() from the sw_msi_start pool.
>> Is that correct?
> Yes, for a single ITS page it will reliably be put at sw_msi_start.
> Since the VMM can provide sw_msi_start through the OPTION, the VMM can
> place the ITS page where it wants and then program the ACPI to tell
> the VM to call iommu_dma_prepare_msi(). (don't use this flow, it
> doesn't work for multi ITS, for testing only)
OK so you need to set host sw_msi_start to the guest doorbell GPA which
is currently set, in qemu, at
GITS_TRANSLATER 0x08080000 + 0x10000

In my original integration, I passed pairs of S1 gIOVA/gDB used by the
guest and this gDB was directly reused for mapping hDB.

I think I get it now.

Eric
>
>> https://lore.kernel.org/all/20210411111228.14386-9-eric.auger@redhat.com/
>> I was passing both the gIOVA and the "true" gDB Eric
> If I understand this right, it still had the hypervisor dynamically
> setting up the S2, here it is pre-set and static?
>
> Jason
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU
  2025-02-04 12:55             ` Eric Auger
@ 2025-02-04 13:02               ` Jason Gunthorpe
  0 siblings, 0 replies; 64+ messages in thread
From: Jason Gunthorpe @ 2025-02-04 13:02 UTC (permalink / raw)
  To: Eric Auger
  Cc: Shameerali Kolothum Thodi, Nicolin Chen, will@kernel.org,
	robin.murphy@arm.com, kevin.tian@intel.com, tglx@linutronix.de,
	maz@kernel.org, alex.williamson@redhat.com, joro@8bytes.org,
	shuah@kernel.org, reinette.chatre@intel.com, yebin (H),
	apatel@ventanamicro.com, shivamurthy.shastri@linutronix.de,
	bhelgaas@google.com, anna-maria@linutronix.de,
	yury.norov@gmail.com, nipun.gupta@amd.com, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	jean-philippe@linaro.org, mdf@kernel.org, mshavit@google.com,
	smostafa@google.com, ddutile@redhat.com

On Tue, Feb 04, 2025 at 01:55:01PM +0100, Eric Auger wrote:

> OK so you need to set host sw_msi_start to the guest doorbell GPA which
> is currently set, in qemu, at
> GITS_TRANSLATER 0x08080000 + 0x10000

Yes (but don't do this except for testing)

The challenge that remains is how to build an API to get each ITS page
mapped into the S2 at the right position - ideally statically before
the VM is booted.

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU
  2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
                   ` (13 preceding siblings ...)
  2025-01-23  9:06 ` [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Shameerali Kolothum Thodi
@ 2025-02-05 22:49 ` Jacob Pan
  2025-02-05 22:56   ` Nicolin Chen
  2025-02-07 14:34 ` Jason Gunthorpe
  15 siblings, 1 reply; 64+ messages in thread
From: Jacob Pan @ 2025-02-05 22:49 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson,
	joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, jacob.pan

Hi Nicolin,

On Fri, 10 Jan 2025 19:32:16 -0800
Nicolin Chen <nicolinc@nvidia.com> wrote:

> [ Background ]
> On ARM GIC systems and others, the target address of the MSI is
> translated by the IOMMU. For GIC, the MSI address page is called
> "ITS" page. When the IOMMU is disabled, the MSI address is programmed
> to the physical location of the GIC ITS page (e.g. 0x20200000). When
> the IOMMU is enabled, the ITS page is behind the IOMMU, so the MSI
> address is programmed to an allocated IO virtual address (a.k.a
> IOVA), e.g. 0xFFFF0000, which must be mapped to the physical ITS
> page: IOVA (0xFFFF0000) ===> PA (0x20200000). When a 2-stage
> translation is enabled, IOVA will be still used to program the MSI
> address, though the mappings will be in two stages: IOVA (0xFFFF0000)
> ===> IPA (e.g. 0x80900000) ===> PA (0x20200000) (IPA stands for
> Intermediate Physical Address).
> 
> If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA,
> the IOVA is dynamically allocated from the top of the IOVA space. If
> attached to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough
> device), the IOVA is fixed to an MSI window reported by the IOMMU
> driver via IOMMU_RESV_SW_MSI, which is hardwired to MSI_IOVA_BASE
> (IOVA==0x8000000) for ARM IOMMUs.
> 
> So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in
> charge of the IOMMU translation (1-stage translation), since the IOVA
> for the ITS page is fixed and known by kernel. However, with virtual
> machine enabling a nested IOMMU translation (2-stage), a guest kernel
> directly controls the stage-1 translation with an IOMMU_DOMAIN_DMA,
> mapping a vITS page (at an IPA 0x80900000) onto its own IOVA space
> (e.g. 0xEEEE0000). Then, the host kernel can't know that guest-level
> IOVA to program the MSI address.
> 
> There have been two approaches to solve this problem:
> 1. Create an identity mapping in the stage-1. VMM could insert a few
> RMRs (Reserved Memory Regions) in guest's IORT. Then the guest kernel
> would fetch these RMR entries from the IORT and create an
> IOMMU_RESV_DIRECT region per iommu group for a direct mapping.
> Eventually, the mappings would look like: IOVA (0x8000000) === IPA
> (0x8000000) ===> 0x20200000 This requires an IOMMUFD ioctl for kernel
> and VMM to agree on the IPA.

Should this RMR be in a separate range than MSI_IOVA_BASE? The guest
will have MSI_IOVA_BASE in a reserved region already, no?
e.g. # cat
/sys/bus/pci/devices/0015\:01\:00.0/iommu_group/reserved_regions
0x0000000008000000 0x00000000080fffff msi

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU
  2025-02-05 22:49 ` Jacob Pan
@ 2025-02-05 22:56   ` Nicolin Chen
  0 siblings, 0 replies; 64+ messages in thread
From: Nicolin Chen @ 2025-02-05 22:56 UTC (permalink / raw)
  To: Jacob Pan
  Cc: will, robin.murphy, jgg, kevin.tian, tglx, maz, alex.williamson,
	joro, shuah, reinette.chatre, eric.auger, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Wed, Feb 05, 2025 at 02:49:04PM -0800, Jacob Pan wrote:
> > There have been two approaches to solve this problem:
> > 1. Create an identity mapping in the stage-1. VMM could insert a few
> > RMRs (Reserved Memory Regions) in guest's IORT. Then the guest kernel
> > would fetch these RMR entries from the IORT and create an
> > IOMMU_RESV_DIRECT region per iommu group for a direct mapping.
> > Eventually, the mappings would look like: IOVA (0x8000000) === IPA
> > (0x8000000) ===> 0x20200000 This requires an IOMMUFD ioctl for kernel
> > and VMM to agree on the IPA.
> 
> Should this RMR be in a separate range than MSI_IOVA_BASE? The guest
> will have MSI_IOVA_BASE in a reserved region already, no?
> e.g. # cat
> /sys/bus/pci/devices/0015\:01\:00.0/iommu_group/reserved_regions
> 0x0000000008000000 0x00000000080fffff msi

No. In Patch-9, the driver-defined MSI_IOVA_BASE will be ignored if
userspace has assigned IOMMU_OPTION_SW_MSI_START/SIZE, even if they
might have the same values as the MSI_IOVA_BASE window.

The idea of MSI_IOVA_BASE in this series is a kernel default that'd
be only effective when user space doesn't care to set anything.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  2025-01-29 14:58     ` Jason Gunthorpe
  2025-01-29 17:23       ` Eric Auger
@ 2025-02-07  4:26       ` Nicolin Chen
  2025-02-07 14:30         ` Jason Gunthorpe
  1 sibling, 1 reply; 64+ messages in thread
From: Nicolin Chen @ 2025-02-07  4:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Eric Auger, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Wed, Jan 29, 2025 at 10:58:00AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 29, 2025 at 02:44:12PM +0100, Eric Auger wrote:
> > On 1/11/25 4:32 AM, Nicolin Chen wrote:
> > > For systems that require MSI pages to be mapped into the IOMMU translation
> > > the IOMMU driver provides an IOMMU_RESV_SW_MSI range, which is the default
> > > recommended IOVA window to place these mappings. However, there is nothing
> > > special about this address. And to support the RMR trick in VMM for nested
> > well at least it shall not overlap VMM's RAM. So it was not random either.
> > > translation, the VMM needs to know what sw_msi window the kernel is using.
> > > As there is no particular reason to force VMM to adopt the kernel default,
> > > provide a simple IOMMU_OPTION_SW_MSI_START/SIZE ioctl that the VMM can use
> > > to directly specify the sw_msi window that it wants to use, which replaces
> > > and disables the default IOMMU_RESV_SW_MSI from the driver to avoid having
> > > to build an API to discover the default IOMMU_RESV_SW_MSI.
> > IIUC the MSI window will then be different when using legacy VFIO
> > assignment and iommufd backend.
> 
> ? They use the same, iommufd can have userspace override it. Then it
> will ignore the reserved region.
> 
> > MSI reserved regions are exposed in
> > /sys/kernel/iommu_groups/<n>/reserved_regions
> > 0x0000000008000000 0x00000000080fffff msi
>  
> > Is that configurability reflected accordingly?
> 
> ?
> 
> Nothing using iommufd should parse that sysfs file.
>  
> > How do you make sure it does not collide with other resv regions? I
> > don't see any check here.
> 
> Yes this does need to be checked, it does look missing. It still needs
> to create a reserved region in the ioas when attaching to keep the
> areas safe and it has to intersect with the incoming reserved
> regions from the driver.

Yea, I found iopt_reserve_iova() is actually missed entirely...

While fixing this, I see a way to turn the OPTIONs back to per-
idev, if you still prefer them to be per-idev(?). Then, we can
check a given input in the set_option() against the device's
reserved region list from the driver, prior to device attaching
to any HWPT.

Otherwise, we just rely on iopt_enforce_device_reserve_region()
during an attach, keeping the option global to simplify VMMs.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  2025-02-07  4:26       ` Nicolin Chen
@ 2025-02-07 14:30         ` Jason Gunthorpe
  2025-02-07 15:28           ` Jason Gunthorpe
  0 siblings, 1 reply; 64+ messages in thread
From: Jason Gunthorpe @ 2025-02-07 14:30 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Eric Auger, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Thu, Feb 06, 2025 at 08:26:05PM -0800, Nicolin Chen wrote:
> Yea, I found iopt_reserve_iova() is actually missed entirely...
> 
> While fixing this, I see a way to turn the OPTIONs back to per-
> idev, if you still prefer them to be per-idev(?). Then, we can
> check a given input in the set_option() against the device's
> reserved region list from the driver, prior to device attaching
> to any HWPT.

I didn't have a strong opinion, if the idev works without complexity
then I'd stick with that on the basis of narrower scope is usually
better.

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU
  2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
                   ` (14 preceding siblings ...)
  2025-02-05 22:49 ` Jacob Pan
@ 2025-02-07 14:34 ` Jason Gunthorpe
  2025-02-07 14:42   ` Thomas Gleixner
  15 siblings, 1 reply; 64+ messages in thread
From: Jason Gunthorpe @ 2025-02-07 14:34 UTC (permalink / raw)
  To: Nicolin Chen, tglx, robin.murphy, maz
  Cc: will, kevin.tian, alex.williamson, joro, shuah, reinette.chatre,
	eric.auger, yebin10, apatel, shivamurthy.shastri, bhelgaas,
	anna-maria, yury.norov, nipun.gupta, iommu, linux-kernel,
	linux-arm-kernel, kvm, linux-kselftest, patches, jean-philippe,
	mdf, mshavit, shameerali.kolothum.thodi, smostafa, ddutile

On Fri, Jan 10, 2025 at 07:32:16PM -0800, Nicolin Chen wrote:
> Though these two approaches feel very different on the surface, they can
> share some underlying common infrastructure. Currently, only one pair of
> sw_msi functions (prepare/compose) are provided by dma-iommu for irqchip
> drivers to directly use. There could be different versions of functions
> from different domain owners: for existing VFIO passthrough cases and in-
> kernel DMA domain cases, reuse the existing dma-iommu's version of sw_msi
> functions; for nested translation use cases, there can be another version
> of sw_msi functions to handle mapping and msi_msg(s) differently.
> 
> To support both approaches, in this series
>  - Get rid of the duplication in the "compose" function
>  - Introduce a function pointer for the previously "prepare" function
>  - Allow different domain owners to set their own "sw_msi" implementations
>  - Implement an iommufd_sw_msi function to additionally support a nested
>    translation use case using the approach (2), i.e. the RMR solution
>  - Add a pair of IOMMUFD options for a SW_MSI window for kernel and VMM to
>    agree on (for approach 1)
>  - Add a new VFIO ioctl to set the MSI(x) vector(s) for iommufd_sw_msi()
>    to update the msi_desc structure accordingly (for approach 2)

Thomas/Marc/Robin, are we comfortable with this general approach?
Nicolin can send something non-RFC for a proper review.

I like it, it solves many of the problems iommufd had here and it
seems logical from the irq side.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU
  2025-02-07 14:34 ` Jason Gunthorpe
@ 2025-02-07 14:42   ` Thomas Gleixner
  0 siblings, 0 replies; 64+ messages in thread
From: Thomas Gleixner @ 2025-02-07 14:42 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen, robin.murphy, maz
  Cc: will, kevin.tian, alex.williamson, joro, shuah, reinette.chatre,
	eric.auger, yebin10, apatel, shivamurthy.shastri, bhelgaas,
	anna-maria, yury.norov, nipun.gupta, iommu, linux-kernel,
	linux-arm-kernel, kvm, linux-kselftest, patches, jean-philippe,
	mdf, mshavit, shameerali.kolothum.thodi, smostafa, ddutile

On Fri, Feb 07 2025 at 10:34, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 07:32:16PM -0800, Nicolin Chen wrote:
>> Though these two approaches feel very different on the surface, they can
>> share some underlying common infrastructure. Currently, only one pair of
>> sw_msi functions (prepare/compose) are provided by dma-iommu for irqchip
>> drivers to directly use. There could be different versions of functions
>> from different domain owners: for existing VFIO passthrough cases and in-
>> kernel DMA domain cases, reuse the existing dma-iommu's version of sw_msi
>> functions; for nested translation use cases, there can be another version
>> of sw_msi functions to handle mapping and msi_msg(s) differently.
>> 
>> To support both approaches, in this series
>>  - Get rid of the duplication in the "compose" function
>>  - Introduce a function pointer for the previously "prepare" function
>>  - Allow different domain owners to set their own "sw_msi" implementations
>>  - Implement an iommufd_sw_msi function to additionally support a nested
>>    translation use case using the approach (2), i.e. the RMR solution
>>  - Add a pair of IOMMUFD options for a SW_MSI window for kernel and VMM to
>>    agree on (for approach 1)
>>  - Add a new VFIO ioctl to set the MSI(x) vector(s) for iommufd_sw_msi()
>>    to update the msi_desc structure accordingly (for approach 2)
>
> Thomas/Marc/Robin, are we comfortable with this general approach?
> Nicolin can send something non-RFC for a proper review.
>
> I like it, it solves many of the problems iommufd had here and it
> seems logical from the irq side.

I haven't seen anything horrible. My main concern of having a proper
cached and writeable message is addressed.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  2025-02-07 14:30         ` Jason Gunthorpe
@ 2025-02-07 15:28           ` Jason Gunthorpe
  2025-02-07 18:59             ` Nicolin Chen
  0 siblings, 1 reply; 64+ messages in thread
From: Jason Gunthorpe @ 2025-02-07 15:28 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Eric Auger, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Fri, Feb 07, 2025 at 10:30:20AM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 06, 2025 at 08:26:05PM -0800, Nicolin Chen wrote:
> > Yea, I found iopt_reserve_iova() is actually missed entirely...
> > 
> > While fixing this, I see a way to turn the OPTIONs back to per-
> > idev, if you still prefer them to be per-idev(?). Then, we can
> > check a given input in the set_option() against the device's
> > reserved region list from the driver, prior to device attaching
> > to any HWPT.
> 
> I didn't have a strong opinion, if the idev works without complexity
> then I'd stick with that on the basis of narrower scope is usually
> better.

If you make it per-idev then it is also implicitly per-GIC as well
since each idev has exactly one GIC

This would make it useful as a way to get each ITS page mapped into a
single fixed location.. 

Really hard to use from the VMM though

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  2025-02-07 15:28           ` Jason Gunthorpe
@ 2025-02-07 18:59             ` Nicolin Chen
  2025-02-09 18:09               ` Jason Gunthorpe
  0 siblings, 1 reply; 64+ messages in thread
From: Nicolin Chen @ 2025-02-07 18:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Eric Auger, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Fri, Feb 07, 2025 at 11:28:01AM -0400, Jason Gunthorpe wrote:
> On Fri, Feb 07, 2025 at 10:30:20AM -0400, Jason Gunthorpe wrote:
> > On Thu, Feb 06, 2025 at 08:26:05PM -0800, Nicolin Chen wrote:
> > > Yea, I found iopt_reserve_iova() is actually missed entirely...
> > > 
> > > While fixing this, I see a way to turn the OPTIONs back to per-
> > > idev, if you still prefer them to be per-idev(?). Then, we can
> > > check a given input in the set_option() against the device's
> > > reserved region list from the driver, prior to device attaching
> > > to any HWPT.
> > 
> > I didn't have a strong opinion, if the idev works without complexity
> > then I'd stick with that on the basis of narrower scope is usually
> > better.

We could forward the "SW_MSI_START" location or vITS's IPA into
each vSMMU module in the QEMU, then vSMMU module would initiate
SET_OPTION when an iommufd-enabled VFIO device attaching to it.

That's how I tested it. And this is one vITS in the VM, so we'd
only have one fixed location for all devices.

> If you make it per-idev then it is also implicitly per-GIC as well
> since each idev has exactly one GIC
> 
> This would make it useful as a way to get each ITS page mapped into a
> single fixed location.. 

Hmm, is it for the approach-2 (i.e. vITS solution)?

Do you mean a use case:
Multiple vITS pages <=> One pITS page?

> Really hard to use from the VMM though

I could imagine. The caller initiating a SET_OPTION call in VMM
will have to know what vITS page for what device. So, this info
has to go through the KVM/IRQ module to get processed and then
forwarded to the caller (vSMMU module at this moment)..

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
  2025-02-07 18:59             ` Nicolin Chen
@ 2025-02-09 18:09               ` Jason Gunthorpe
  0 siblings, 0 replies; 64+ messages in thread
From: Jason Gunthorpe @ 2025-02-09 18:09 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Eric Auger, will, robin.murphy, kevin.tian, tglx, maz,
	alex.williamson, joro, shuah, reinette.chatre, yebin10, apatel,
	shivamurthy.shastri, bhelgaas, anna-maria, yury.norov,
	nipun.gupta, iommu, linux-kernel, linux-arm-kernel, kvm,
	linux-kselftest, patches, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile

On Fri, Feb 07, 2025 at 10:59:48AM -0800, Nicolin Chen wrote:
> On Fri, Feb 07, 2025 at 11:28:01AM -0400, Jason Gunthorpe wrote:
> > On Fri, Feb 07, 2025 at 10:30:20AM -0400, Jason Gunthorpe wrote:
> > > On Thu, Feb 06, 2025 at 08:26:05PM -0800, Nicolin Chen wrote:
> > > > Yea, I found iopt_reserve_iova() is actually missed entirely...
> > > > 
> > > > While fixing this, I see a way to turn the OPTIONs back to per-
> > > > idev, if you still prefer them to be per-idev(?). Then, we can
> > > > check a given input in the set_option() against the device's
> > > > reserved region list from the driver, prior to device attaching
> > > > to any HWPT.
> > > 
> > > I didn't have a strong opinion, if the idev works without complexity
> > > then I'd stick with that on the basis of narrower scope is usually
> > > better.

On reflection I don't think a per-idev is going to work very
well.. Part of the design was to keep track of a bitmap of already
mapped pages in the single hpwt that unions all of the devices. If it
is per-device then that basic thing doesn't work and it becomes much
more complicated

> I could imagine. The caller initiating a SET_OPTION call in VMM
> will have to know what vITS page for what device. So, this info
> has to go through the KVM/IRQ module to get processed and then
> forwarded to the caller (vSMMU module at this moment)..

Ultimately, as we saw in the other conversation, the qemu command line
will need to describe the GIC(s) and all their ITS pages directly,
somehow.

Jason

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2025-02-09 18:09 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-11  3:32 [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
2025-01-11  3:32 ` [PATCH RFCv2 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie Nicolin Chen
2025-01-23 17:10   ` Eric Auger
2025-01-23 18:48     ` Jason Gunthorpe
2025-01-29 12:11       ` Eric Auger
2025-01-11  3:32 ` [PATCH RFCv2 02/13] genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_msi_addr() Nicolin Chen
2025-01-23 17:10   ` Eric Auger
2025-01-23 18:50     ` Jason Gunthorpe
2025-01-29 10:44       ` Eric Auger
2025-01-11  3:32 ` [PATCH RFCv2 03/13] iommu: Make iommu_dma_prepare_msi() into a generic operation Nicolin Chen
2025-01-23 17:10   ` Eric Auger
2025-01-23 18:16     ` Jason Gunthorpe
2025-01-29 12:29       ` Eric Auger
2025-01-11  3:32 ` [PATCH RFCv2 04/13] irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by the irqchips that need it Nicolin Chen
2025-01-11  3:32 ` [PATCH RFCv2 05/13] iommu: Turn fault_data to iommufd private pointer Nicolin Chen
2025-01-23  9:54   ` Tian, Kevin
2025-01-23 13:25     ` Jason Gunthorpe
2025-01-29 12:40   ` Eric Auger
2025-02-03 17:48     ` Nicolin Chen
2025-01-11  3:32 ` [PATCH RFCv2 06/13] iommufd: Make attach_handle generic Nicolin Chen
2025-01-18  8:23   ` Yi Liu
2025-01-18 20:32     ` Nicolin Chen
2025-01-19 10:40       ` Yi Liu
2025-01-20  5:54         ` Nicolin Chen
2025-01-24 13:31           ` Yi Liu
2025-01-20 14:20       ` Jason Gunthorpe
2025-01-29 13:14   ` Eric Auger
2025-02-03 18:08     ` Nicolin Chen
2025-01-11  3:32 ` [PATCH RFCv2 07/13] iommufd: Implement sw_msi support natively Nicolin Chen
2025-01-15  4:21   ` Yury Norov
2025-01-16 20:21     ` Jason Gunthorpe
2025-01-23 19:30   ` Jason Gunthorpe
2025-01-11  3:32 ` [PATCH RFCv2 08/13] iommu: Turn iova_cookie to dma-iommu private pointer Nicolin Chen
2025-01-13 16:40   ` Jason Gunthorpe
2025-01-11  3:32 ` [PATCH RFCv2 09/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls Nicolin Chen
2025-01-23 10:07   ` Tian, Kevin
2025-02-03 18:36     ` Nicolin Chen
2025-01-29 13:44   ` Eric Auger
2025-01-29 14:58     ` Jason Gunthorpe
2025-01-29 17:23       ` Eric Auger
2025-01-29 17:39         ` Jason Gunthorpe
2025-01-29 17:49           ` Eric Auger
2025-01-29 20:15             ` Jason Gunthorpe
2025-02-07  4:26       ` Nicolin Chen
2025-02-07 14:30         ` Jason Gunthorpe
2025-02-07 15:28           ` Jason Gunthorpe
2025-02-07 18:59             ` Nicolin Chen
2025-02-09 18:09               ` Jason Gunthorpe
2025-01-11  3:32 ` [PATCH RFCv2 10/13] iommufd/selftes: Add coverage for IOMMU_OPTION_SW_MSI_START/SIZE Nicolin Chen
2025-01-11  3:32 ` [PATCH RFCv2 11/13] iommufd/device: Allow setting IOVAs for MSI(x) vectors Nicolin Chen
2025-01-11  3:32 ` [PATCH RFCv2 12/13] vfio-iommufd: Provide another layer of msi_iova helpers Nicolin Chen
2025-01-11  3:32 ` [PATCH RFCv2 13/13] vfio/pci: Allow preset MSI IOVAs via VFIO_IRQ_SET_ACTION_PREPARE Nicolin Chen
2025-01-23  9:06 ` [PATCH RFCv2 00/13] iommu: Add MSI mapping support with nested SMMU Shameerali Kolothum Thodi
2025-01-23 13:24   ` Jason Gunthorpe
2025-01-29 14:54     ` Eric Auger
2025-01-29 15:04       ` Jason Gunthorpe
2025-01-29 17:46         ` Eric Auger
2025-01-29 20:13           ` Jason Gunthorpe
2025-02-04 12:55             ` Eric Auger
2025-02-04 13:02               ` Jason Gunthorpe
2025-02-05 22:49 ` Jacob Pan
2025-02-05 22:56   ` Nicolin Chen
2025-02-07 14:34 ` Jason Gunthorpe
2025-02-07 14:42   ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).