* [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU
@ 2025-02-08 9:02 Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie Nicolin Chen
` (13 more replies)
0 siblings, 14 replies; 28+ messages in thread
From: Nicolin Chen @ 2025-02-08 9:02 UTC (permalink / raw)
To: jgg, kevin.tian, tglx, maz
Cc: joro, will, robin.murphy, shuah, iommu, linux-kernel,
linux-arm-kernel, linux-kselftest, eric.auger, baolu.lu, yi.l.liu,
yury.norov, jacob.pan, patches
[ Background ]
On ARM GIC systems and others, the target address of the MSI is translated
by the IOMMU. For GIC, the MSI address page is called "ITS" page. When the
IOMMU is disabled, the MSI address is programmed to the physical location
of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS
page is behind the IOMMU, so the MSI address is programmed to an allocated
IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to
the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000).
When a 2-stage translation is enabled, IOVA will be still used to program
the MSI address, though the mappings will be in two stages:
IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> PA (0x20200000)
(IPA stands for Intermediate Physical Address).
If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, the
IOVA is dynamically allocated from the top of the IOVA space. If attached
to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the IOVA is
fixed to an MSI window reported by the IOMMU driver via IOMMU_RESV_SW_MSI,
which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM IOMMUs.
So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge
of the IOMMU translation (1-stage translation), since the IOVA for the ITS
page is fixed and known by kernel. However, with virtual machine enabling
a nested IOMMU translation (2-stage), a guest kernel directly controls the
stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at an
IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host
kernel can't know that guest-level IOVA to program the MSI address.
There have been two approaches to solve this problem:
1. Create an identity mapping in the stage-1. VMM could insert a few RMRs
(Reserved Memory Regions) in guest's IORT. Then the guest kernel would
fetch these RMR entries from the IORT and create an IOMMU_RESV_DIRECT
region per iommu group for a direct mapping. Eventually, the mappings
would look like: IOVA (0x8000000) === IPA (0x8000000) ===> 0x20200000
This requires an IOMMUFD ioctl for kernel and VMM to agree on the IPA.
2. Forward the guest-level MSI IOVA captured by VMM to the host-level GIC
driver, to program the correct MSI IOVA. Forward the VMM-defined vITS
page location (IPA) to the kernel for the stage-2 mapping. Eventually:
IOVA (0xFFFF0000) ===> IPA (0x80900000) ===> PA (0x20200000)
This requires a VFIO ioctl (for IOVA) and an IOMMUFD ioctl (for IPA).
Worth mentioning that when Eric Auger was working on the same topic with
the VFIO iommu uAPI, he had the approach (2) first, and then switched to
the approach (1), suggested by Jean-Philippe for reduction of complexity.
The approach (1) basically feels like the existing VFIO passthrough that
has a 1-stage mapping for the unmanaged domain, yet only by shifting the
MSI mapping from stage 1 (guest-has-no-iommu case) to stage 2 (guest-has-
iommu case). So, it could reuse the existing IOMMU_RESV_SW_MSI piece, by
sharing the same idea of "VMM leaving everything to the kernel".
The approach (2) is an ideal solution, yet it requires additional effort
for kernel to be aware of the 1-stage gIOVA(s) and 2-stage IPAs for vITS
page(s), which demands VMM to closely cooperate.
* It also brings some complicated use cases to the table where the host
or/and guest system(s) has/have multiple ITS pages.
[ Execution ]
Though these two approaches feel very different on the surface, they can
share some underlying common infrastructure. Currently, only one pair of
sw_msi functions (prepare/compose) are provided by dma-iommu for irqchip
drivers to directly use. There could be different versions of functions
from different domain owners: for existing VFIO passthrough cases and in-
kernel DMA domain cases, reuse the existing dma-iommu's version of sw_msi
functions; for nested translation use cases, there can be another version
of sw_msi functions to handle mapping and msi_msg(s) differently.
As a part-1 supporting the approach (1), i.e. the RMR solution:
- Get rid of the duplication in the "compose" function
- Introduce a function pointer for the previously "prepare" function
- Allow different domain owners to set their own "sw_msi" implementations
- Implement an iommufd_sw_msi function to additionally support a nested
translation use case using the approach (1)
- Add a pair of IOMMUFD options for a SW_MSI window for kernel and VMM to
agree on (for approach 1)
[ Future Plan ]
Part-2 and beyond will continue the effort of supporting the approach (2)
for a complete vITS-to-pITS mapping:
1) Map the phsical ITS page (potentially via IOMMUFD_CMD_IOAS_MAP_MSI)
2) Convey the IOVAs per-irq (potentially via VFIO_IRQ_SET_ACTION_PREPARE)
Note that the set_option uAPI in this series might not fit since this
requires it is an array of MSI IOVAs.)
---
This is a joint effort that includes Jason's rework in irq/iommu/iommufd
base level and my additional patches on top of that for new uAPIs.
This series is on github:
https://github.com/nicolinc/iommufd/commits/iommufd_msi_p1-v1
Pairing QEMU branch for testing (approach 1):
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_msi_p1-v1-rmr
(Note: QEMU virt command no longer requires iommmufd object v.s. RFCv2)
Changelog
v1
* Rebase on v6.14-rc1 and iommufd_attach_handle-v1 series
https://lore.kernel.org/all/cover.1738645017.git.nicolinc@nvidia.com/
* Correct typos
* Replace set_bit with __set_bit
* Use a common helper to get iommufd_handle
* Add kdoc for iommu_msi_iova/iommu_msi_page_shift
* Rename msi_msg_set_msi_addr() to msi_msg_set_addr()
* Update selftest for a better coverage for the new options
* Change IOMMU_OPTION_SW_MSI_START/SIZE to be per-idev and properly
check against device's reserved region list
RFCv2
https://lore.kernel.org/kvm/cover.1736550979.git.nicolinc@nvidia.com/
* Rebase on v6.13-rc6
* Drop all the irq/pci patches and rework the compose function instead
* Add a new sw_msi op to iommu_domain for a per type implementation and
let iommufd core has its own implementation to support both approaches
* Add RMR-solution (approach 1) support since it is straightforward and
have been used in some out-of-tree projects widely
RFCv1
https://lore.kernel.org/kvm/cover.1731130093.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Jason Gunthorpe (5):
genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of
iommu_cookie
genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_addr()
iommu: Make iommu_dma_prepare_msi() into a generic operation
irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by the irqchips that
need it
iommufd: Implement sw_msi support natively
Nicolin Chen (8):
iommu: Turn fault_data to iommufd private pointer
iommu: Turn iova_cookie to dma-iommu private pointer
iommufd/device: Move sw_msi_start from igroup to idev
iommufd: Pass in idev to iopt_table_enforce_dev_resv_regions
iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
iommufd/selftest: Add MOCK_FLAGS_DEVICE_NO_ATTACH
iommufd/selftest: Add a testing reserved region
iommufd/selftest: Add coverage for IOMMU_OPTION_SW_MSI_START/SIZE
drivers/iommu/Kconfig | 1 -
drivers/irqchip/Kconfig | 4 +
kernel/irq/Kconfig | 1 +
drivers/iommu/iommufd/iommufd_private.h | 29 ++-
drivers/iommu/iommufd/iommufd_test.h | 4 +
include/linux/iommu.h | 58 ++++--
include/linux/msi.h | 47 +++--
include/uapi/linux/iommufd.h | 20 +-
drivers/iommu/dma-iommu.c | 63 ++----
drivers/iommu/iommu.c | 29 +++
drivers/iommu/iommufd/device.c | 196 ++++++++++++++----
drivers/iommu/iommufd/fault.c | 2 +-
drivers/iommu/iommufd/hw_pagetable.c | 5 +-
drivers/iommu/iommufd/io_pagetable.c | 18 +-
drivers/iommu/iommufd/ioas.c | 97 +++++++++
drivers/iommu/iommufd/main.c | 13 ++
drivers/iommu/iommufd/selftest.c | 41 +++-
drivers/irqchip/irq-gic-v2m.c | 5 +-
drivers/irqchip/irq-gic-v3-its.c | 13 +-
drivers/irqchip/irq-gic-v3-mbi.c | 12 +-
drivers/irqchip/irq-ls-scfg-msi.c | 5 +-
tools/testing/selftests/iommu/iommufd.c | 97 +++++++++
.../selftests/iommu/iommufd_fail_nth.c | 21 ++
23 files changed, 608 insertions(+), 173 deletions(-)
base-commit: 2b5bc8c9425fd87e094a08f72498536133da80e1
--
2.43.0
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v1 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie
2025-02-08 9:02 [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
@ 2025-02-08 9:02 ` Nicolin Chen
2025-02-13 11:54 ` Thomas Gleixner
2025-02-13 20:28 ` Jacob Pan
2025-02-08 9:02 ` [PATCH v1 02/13] genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_addr() Nicolin Chen
` (12 subsequent siblings)
13 siblings, 2 replies; 28+ messages in thread
From: Nicolin Chen @ 2025-02-08 9:02 UTC (permalink / raw)
To: jgg, kevin.tian, tglx, maz
Cc: joro, will, robin.murphy, shuah, iommu, linux-kernel,
linux-arm-kernel, linux-kselftest, eric.auger, baolu.lu, yi.l.liu,
yury.norov, jacob.pan, patches
From: Jason Gunthorpe <jgg@nvidia.com>
All the iommu cases simply want to override the MSI page's address with
the IOVA that was mapped through the iommu. This doesn't need a cookie
pointer, we just need to store the IOVA and its page size in the msi_desc.
Instead provide msi_desc_set_iommu_msi_iova() which allows the IOMMU side
to specify the IOVA that the MSI page is placed during
iommu_dma_prepare_msi(). This is stored in the msi_desc and then
iommu_dma_compose_msi_msg() is a simple inline that sets address_hi/lo.
The next patch will correct the naming.
This is done because we cannot correctly lock access to group->domain in
the atomic context that iommu_dma_compose_msi_msg() is called under. Today
the locking miss is tolerable because dma_iommu.c operates under an
assumption that the domain does not change while a driver is probed.
However iommufd now permits the domain to change while the driver is
probed and VFIO userspace can create races with IRQ changes calling
iommu_dma_prepare_msi/compose_msi_msg() and changing/freeing the
iommu_domain.
Removing the pointer, and critically, the call to
iommu_get_domain_for_dev() during compose resolves this race.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
include/linux/iommu.h | 6 -----
include/linux/msi.h | 49 +++++++++++++++++++++++++--------------
drivers/iommu/dma-iommu.c | 30 ++++--------------------
3 files changed, 36 insertions(+), 49 deletions(-)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 38c65e92ecd0..caee952febd4 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1508,7 +1508,6 @@ static inline void iommu_debugfs_setup(void) {}
int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base);
int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr);
-void iommu_dma_compose_msi_msg(struct msi_desc *desc, struct msi_msg *msg);
#else /* CONFIG_IOMMU_DMA */
@@ -1524,11 +1523,6 @@ static inline int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_a
{
return 0;
}
-
-static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
-{
-}
-
#endif /* CONFIG_IOMMU_DMA */
/*
diff --git a/include/linux/msi.h b/include/linux/msi.h
index b10093c4d00e..74c6a823f157 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -166,6 +166,10 @@ struct msi_desc_data {
* @dev: Pointer to the device which uses this descriptor
* @msg: The last set MSI message cached for reuse
* @affinity: Optional pointer to a cpu affinity mask for this descriptor
+ * @iommu_msi_iova: Optional IOVA from the IOMMU to override the msi_addr.
+ * Only used if iommu_msi_page_shift != 0
+ * @iommu_msi_page_shift: Indicates how many bits of the original address
+ * should be preserved when using iommu_msi_iova.
* @sysfs_attr: Pointer to sysfs device attribute
*
* @write_msi_msg: Callback that may be called when the MSI message
@@ -184,7 +188,8 @@ struct msi_desc {
struct msi_msg msg;
struct irq_affinity_desc *affinity;
#ifdef CONFIG_IRQ_MSI_IOMMU
- const void *iommu_cookie;
+ u64 iommu_msi_iova : 58;
+ u64 iommu_msi_page_shift : 6;
#endif
#ifdef CONFIG_SYSFS
struct device_attribute *sysfs_attrs;
@@ -285,28 +290,36 @@ struct msi_desc *msi_next_desc(struct device *dev, unsigned int domid,
#define msi_desc_to_dev(desc) ((desc)->dev)
-#ifdef CONFIG_IRQ_MSI_IOMMU
-static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
-{
- return desc->iommu_cookie;
-}
-
-static inline void msi_desc_set_iommu_cookie(struct msi_desc *desc,
- const void *iommu_cookie)
+static inline void msi_desc_set_iommu_msi_iova(struct msi_desc *desc,
+ u64 msi_iova,
+ unsigned int page_shift)
{
- desc->iommu_cookie = iommu_cookie;
-}
-#else
-static inline const void *msi_desc_get_iommu_cookie(struct msi_desc *desc)
-{
- return NULL;
+#ifdef CONFIG_IRQ_MSI_IOMMU
+ desc->iommu_msi_iova = msi_iova >> page_shift;
+ desc->iommu_msi_page_shift = page_shift;
+#endif
}
-static inline void msi_desc_set_iommu_cookie(struct msi_desc *desc,
- const void *iommu_cookie)
+/**
+ * iommu_dma_compose_msi_msg() - Apply translation to an MSI message
+ * @desc: MSI descriptor prepared by iommu_dma_prepare_msi()
+ * @msg: MSI message containing target physical address
+ */
+static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc,
+ struct msi_msg *msg)
{
-}
+#ifdef CONFIG_IRQ_MSI_IOMMU
+ if (desc->iommu_msi_page_shift) {
+ u64 msi_iova = desc->iommu_msi_iova
+ << desc->iommu_msi_page_shift;
+
+ msg->address_hi = upper_32_bits(msi_iova);
+ msg->address_lo = lower_32_bits(msi_iova) |
+ (msg->address_lo &
+ ((1 << desc->iommu_msi_page_shift) - 1));
+ }
#endif
+}
int msi_domain_insert_msi_desc(struct device *dev, unsigned int domid,
struct msi_desc *init_desc);
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 2a9fa0c8cc00..bf91e014d179 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1815,7 +1815,7 @@ int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
static DEFINE_MUTEX(msi_prepare_lock); /* see below */
if (!domain || !domain->iova_cookie) {
- desc->iommu_cookie = NULL;
+ msi_desc_set_iommu_msi_iova(desc, 0, 0);
return 0;
}
@@ -1827,33 +1827,13 @@ int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
mutex_lock(&msi_prepare_lock);
msi_page = iommu_dma_get_msi_page(dev, msi_addr, domain);
mutex_unlock(&msi_prepare_lock);
-
- msi_desc_set_iommu_cookie(desc, msi_page);
-
if (!msi_page)
return -ENOMEM;
- return 0;
-}
-/**
- * iommu_dma_compose_msi_msg() - Apply translation to an MSI message
- * @desc: MSI descriptor prepared by iommu_dma_prepare_msi()
- * @msg: MSI message containing target physical address
- */
-void iommu_dma_compose_msi_msg(struct msi_desc *desc, struct msi_msg *msg)
-{
- struct device *dev = msi_desc_to_dev(desc);
- const struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
- const struct iommu_dma_msi_page *msi_page;
-
- msi_page = msi_desc_get_iommu_cookie(desc);
-
- if (!domain || !domain->iova_cookie || WARN_ON(!msi_page))
- return;
-
- msg->address_hi = upper_32_bits(msi_page->iova);
- msg->address_lo &= cookie_msi_granule(domain->iova_cookie) - 1;
- msg->address_lo += lower_32_bits(msi_page->iova);
+ msi_desc_set_iommu_msi_iova(
+ desc, msi_page->iova,
+ ilog2(cookie_msi_granule(domain->iova_cookie)));
+ return 0;
}
static int iommu_dma_init(void)
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v1 02/13] genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_addr()
2025-02-08 9:02 [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie Nicolin Chen
@ 2025-02-08 9:02 ` Nicolin Chen
2025-02-13 12:11 ` Thomas Gleixner
2025-02-08 9:02 ` [PATCH v1 03/13] iommu: Make iommu_dma_prepare_msi() into a generic operation Nicolin Chen
` (11 subsequent siblings)
13 siblings, 1 reply; 28+ messages in thread
From: Nicolin Chen @ 2025-02-08 9:02 UTC (permalink / raw)
To: jgg, kevin.tian, tglx, maz
Cc: joro, will, robin.murphy, shuah, iommu, linux-kernel,
linux-arm-kernel, linux-kselftest, eric.auger, baolu.lu, yi.l.liu,
yury.norov, jacob.pan, patches
From: Jason Gunthorpe <jgg@nvidia.com>
The new function is used to take in a u64 MSI address and store it in the
msi_msg. If the iommu has provided an alternative address then that is
replaced instead.
All callers have a tidy u64 already so this also consolidates the repeated
low/high code into a small helper.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
include/linux/msi.h | 18 ++++++++----------
drivers/irqchip/irq-gic-v2m.c | 5 +----
drivers/irqchip/irq-gic-v3-its.c | 13 +++----------
drivers/irqchip/irq-gic-v3-mbi.c | 12 ++++--------
drivers/irqchip/irq-ls-scfg-msi.c | 5 ++---
5 files changed, 18 insertions(+), 35 deletions(-)
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 74c6a823f157..2514116fa5dd 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -300,13 +300,8 @@ static inline void msi_desc_set_iommu_msi_iova(struct msi_desc *desc,
#endif
}
-/**
- * iommu_dma_compose_msi_msg() - Apply translation to an MSI message
- * @desc: MSI descriptor prepared by iommu_dma_prepare_msi()
- * @msg: MSI message containing target physical address
- */
-static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc,
- struct msi_msg *msg)
+static inline void msi_msg_set_addr(struct msi_desc *desc, struct msi_msg *msg,
+ u64 msi_addr)
{
#ifdef CONFIG_IRQ_MSI_IOMMU
if (desc->iommu_msi_page_shift) {
@@ -314,11 +309,14 @@ static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc,
<< desc->iommu_msi_page_shift;
msg->address_hi = upper_32_bits(msi_iova);
- msg->address_lo = lower_32_bits(msi_iova) |
- (msg->address_lo &
- ((1 << desc->iommu_msi_page_shift) - 1));
+ msg->address_lo =
+ lower_32_bits(msi_iova) |
+ (msi_addr & ((1 << desc->iommu_msi_page_shift) - 1));
+ return;
}
#endif
+ msg->address_hi = upper_32_bits(msi_addr);
+ msg->address_lo = lower_32_bits(msi_addr);
}
int msi_domain_insert_msi_desc(struct device *dev, unsigned int domid,
diff --git a/drivers/irqchip/irq-gic-v2m.c b/drivers/irqchip/irq-gic-v2m.c
index be35c5349986..57e0470e0d13 100644
--- a/drivers/irqchip/irq-gic-v2m.c
+++ b/drivers/irqchip/irq-gic-v2m.c
@@ -87,9 +87,6 @@ static void gicv2m_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
struct v2m_data *v2m = irq_data_get_irq_chip_data(data);
phys_addr_t addr = gicv2m_get_msi_addr(v2m, data->hwirq);
- msg->address_hi = upper_32_bits(addr);
- msg->address_lo = lower_32_bits(addr);
-
if (v2m->flags & GICV2M_GRAVITON_ADDRESS_ONLY)
msg->data = 0;
else
@@ -97,7 +94,7 @@ static void gicv2m_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
if (v2m->flags & GICV2M_NEEDS_SPI_OFFSET)
msg->data -= v2m->spi_offset;
- iommu_dma_compose_msi_msg(irq_data_get_msi_desc(data), msg);
+ msi_msg_set_addr(irq_data_get_msi_desc(data), msg, addr);
}
static struct irq_chip gicv2m_irq_chip = {
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index 8c3ec5734f1e..ce0bf70b9eaf 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -1809,17 +1809,10 @@ static u64 its_irq_get_msi_base(struct its_device *its_dev)
static void its_irq_compose_msi_msg(struct irq_data *d, struct msi_msg *msg)
{
struct its_device *its_dev = irq_data_get_irq_chip_data(d);
- struct its_node *its;
- u64 addr;
-
- its = its_dev->its;
- addr = its->get_msi_base(its_dev);
-
- msg->address_lo = lower_32_bits(addr);
- msg->address_hi = upper_32_bits(addr);
- msg->data = its_get_event_id(d);
- iommu_dma_compose_msi_msg(irq_data_get_msi_desc(d), msg);
+ msg->data = its_get_event_id(d);
+ msi_msg_set_addr(irq_data_get_msi_desc(d), msg,
+ its_dev->its->get_msi_base(its_dev));
}
static int its_irq_set_irqchip_state(struct irq_data *d,
diff --git a/drivers/irqchip/irq-gic-v3-mbi.c b/drivers/irqchip/irq-gic-v3-mbi.c
index 3fe870f8ee17..a6510128611e 100644
--- a/drivers/irqchip/irq-gic-v3-mbi.c
+++ b/drivers/irqchip/irq-gic-v3-mbi.c
@@ -147,22 +147,18 @@ static const struct irq_domain_ops mbi_domain_ops = {
static void mbi_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
{
- msg[0].address_hi = upper_32_bits(mbi_phys_base + GICD_SETSPI_NSR);
- msg[0].address_lo = lower_32_bits(mbi_phys_base + GICD_SETSPI_NSR);
msg[0].data = data->parent_data->hwirq;
-
- iommu_dma_compose_msi_msg(irq_data_get_msi_desc(data), msg);
+ msi_msg_set_addr(irq_data_get_msi_desc(data), &msg[0],
+ mbi_phys_base + GICD_SETSPI_NSR);
}
static void mbi_compose_mbi_msg(struct irq_data *data, struct msi_msg *msg)
{
mbi_compose_msi_msg(data, msg);
- msg[1].address_hi = upper_32_bits(mbi_phys_base + GICD_CLRSPI_NSR);
- msg[1].address_lo = lower_32_bits(mbi_phys_base + GICD_CLRSPI_NSR);
msg[1].data = data->parent_data->hwirq;
-
- iommu_dma_compose_msi_msg(irq_data_get_msi_desc(data), &msg[1]);
+ msi_msg_set_addr(irq_data_get_msi_desc(data), &msg[1],
+ mbi_phys_base + GICD_CLRSPI_NSR);
}
static bool mbi_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
diff --git a/drivers/irqchip/irq-ls-scfg-msi.c b/drivers/irqchip/irq-ls-scfg-msi.c
index c0e1aafe468c..3cb80796cc7c 100644
--- a/drivers/irqchip/irq-ls-scfg-msi.c
+++ b/drivers/irqchip/irq-ls-scfg-msi.c
@@ -87,8 +87,6 @@ static void ls_scfg_msi_compose_msg(struct irq_data *data, struct msi_msg *msg)
{
struct ls_scfg_msi *msi_data = irq_data_get_irq_chip_data(data);
- msg->address_hi = upper_32_bits(msi_data->msiir_addr);
- msg->address_lo = lower_32_bits(msi_data->msiir_addr);
msg->data = data->hwirq;
if (msi_affinity_flag) {
@@ -98,7 +96,8 @@ static void ls_scfg_msi_compose_msg(struct irq_data *data, struct msi_msg *msg)
msg->data |= cpumask_first(mask);
}
- iommu_dma_compose_msi_msg(irq_data_get_msi_desc(data), msg);
+ msi_msg_set_addr(irq_data_get_msi_desc(data), msg,
+ msi_data->msiir_addr);
}
static int ls_scfg_msi_set_affinity(struct irq_data *irq_data,
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v1 03/13] iommu: Make iommu_dma_prepare_msi() into a generic operation
2025-02-08 9:02 [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 02/13] genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_addr() Nicolin Chen
@ 2025-02-08 9:02 ` Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 04/13] irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by the irqchips that need it Nicolin Chen
` (10 subsequent siblings)
13 siblings, 0 replies; 28+ messages in thread
From: Nicolin Chen @ 2025-02-08 9:02 UTC (permalink / raw)
To: jgg, kevin.tian, tglx, maz
Cc: joro, will, robin.murphy, shuah, iommu, linux-kernel,
linux-arm-kernel, linux-kselftest, eric.auger, baolu.lu, yi.l.liu,
yury.norov, jacob.pan, patches
From: Jason Gunthorpe <jgg@nvidia.com>
SW_MSI supports IOMMU to translate an MSI message before the MSI message
is delivered to the interrupt controller. On such systems the iommu_domain
must have a translation for the MSI message for interrupts to work.
The IRQ subsystem will call into IOMMU to request that a physical page be
setup to receive MSI message, and the IOMMU then sets an IOVA that maps to
that physical page. Ultimately the IOVA is programmed into the device via
the msi_msg.
Generalize this to allow the iommu_domain owner to provide its own
implementation of this mapping. Add a function pointer to struct
iommu_domain to allow the domain owner to provide an implementation.
Have dma-iommu supply its implementation for IOMMU_DOMAIN_DMA types during
the iommu_get_dma_cookie() path. For IOMMU_DOMAIN_UNMANAGED types used by
VFIO (and iommufd for now), have the same iommu_dma_sw_msi set as well in
the iommu_get_msi_cookie() path.
Hold the group mutex while in iommu_dma_prepare_msi() to ensure the domain
doesn't change or become freed while running. Races with IRQ operations
from VFIO and domain changes from iommufd are possible here.
Replace the msi_prepare_lock with a lockdep assertion for the group mutex
as documentation. For the dmau_iommu.c each iommu_domain is unique to a
group.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
[nicolinc: move iommu_domain_set_sw_msi() from iommu_dma_init_domain() to
iommu_dma_init_domain(); add in iommu_put_dma_cookie() an sw_msi test]
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
include/linux/iommu.h | 44 ++++++++++++++++++++++++++-------------
drivers/iommu/dma-iommu.c | 33 +++++++++++++----------------
drivers/iommu/iommu.c | 29 ++++++++++++++++++++++++++
3 files changed, 73 insertions(+), 33 deletions(-)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index caee952febd4..761c5e186de9 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -44,6 +44,8 @@ struct iommu_dma_cookie;
struct iommu_fault_param;
struct iommufd_ctx;
struct iommufd_viommu;
+struct msi_desc;
+struct msi_msg;
#define IOMMU_FAULT_PERM_READ (1 << 0) /* read */
#define IOMMU_FAULT_PERM_WRITE (1 << 1) /* write */
@@ -216,6 +218,12 @@ struct iommu_domain {
struct iommu_domain_geometry geometry;
struct iommu_dma_cookie *iova_cookie;
int (*iopf_handler)(struct iopf_group *group);
+
+#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
+ int (*sw_msi)(struct iommu_domain *domain, struct msi_desc *desc,
+ phys_addr_t msi_addr);
+#endif
+
void *fault_data;
union {
struct {
@@ -234,6 +242,16 @@ struct iommu_domain {
};
};
+static inline void iommu_domain_set_sw_msi(
+ struct iommu_domain *domain,
+ int (*sw_msi)(struct iommu_domain *domain, struct msi_desc *desc,
+ phys_addr_t msi_addr))
+{
+#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
+ domain->sw_msi = sw_msi;
+#endif
+}
+
static inline bool iommu_is_dma_domain(struct iommu_domain *domain)
{
return domain->type & __IOMMU_DOMAIN_DMA_API;
@@ -1470,6 +1488,18 @@ static inline ioasid_t iommu_alloc_global_pasid(struct device *dev)
static inline void iommu_free_global_pasid(ioasid_t pasid) {}
#endif /* CONFIG_IOMMU_API */
+#ifdef CONFIG_IRQ_MSI_IOMMU
+#ifdef CONFIG_IOMMU_API
+int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr);
+#else
+static inline int iommu_dma_prepare_msi(struct msi_desc *desc,
+ phys_addr_t msi_addr)
+{
+ return 0;
+}
+#endif /* CONFIG_IOMMU_API */
+#endif /* CONFIG_IRQ_MSI_IOMMU */
+
#if IS_ENABLED(CONFIG_LOCKDEP) && IS_ENABLED(CONFIG_IOMMU_API)
void iommu_group_mutex_assert(struct device *dev);
#else
@@ -1503,26 +1533,12 @@ static inline void iommu_debugfs_setup(void) {}
#endif
#ifdef CONFIG_IOMMU_DMA
-#include <linux/msi.h>
-
int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base);
-
-int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr);
-
#else /* CONFIG_IOMMU_DMA */
-
-struct msi_desc;
-struct msi_msg;
-
static inline int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
{
return -ENODEV;
}
-
-static inline int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
-{
- return 0;
-}
#endif /* CONFIG_IOMMU_DMA */
/*
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index bf91e014d179..3b58244e6344 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -24,6 +24,7 @@
#include <linux/memremap.h>
#include <linux/mm.h>
#include <linux/mutex.h>
+#include <linux/msi.h>
#include <linux/of_iommu.h>
#include <linux/pci.h>
#include <linux/scatterlist.h>
@@ -102,6 +103,9 @@ static int __init iommu_dma_forcedac_setup(char *str)
}
early_param("iommu.forcedac", iommu_dma_forcedac_setup);
+static int iommu_dma_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
+ phys_addr_t msi_addr);
+
/* Number of entries per flush queue */
#define IOVA_DEFAULT_FQ_SIZE 256
#define IOVA_SINGLE_FQ_SIZE 32768
@@ -398,6 +402,7 @@ int iommu_get_dma_cookie(struct iommu_domain *domain)
return -ENOMEM;
mutex_init(&domain->iova_cookie->mutex);
+ iommu_domain_set_sw_msi(domain, iommu_dma_sw_msi);
return 0;
}
@@ -429,6 +434,7 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)
cookie->msi_iova = base;
domain->iova_cookie = cookie;
+ iommu_domain_set_sw_msi(domain, iommu_dma_sw_msi);
return 0;
}
EXPORT_SYMBOL(iommu_get_msi_cookie);
@@ -443,6 +449,9 @@ void iommu_put_dma_cookie(struct iommu_domain *domain)
struct iommu_dma_cookie *cookie = domain->iova_cookie;
struct iommu_dma_msi_page *msi, *tmp;
+ if (domain->sw_msi != iommu_dma_sw_msi)
+ return;
+
if (!cookie)
return;
@@ -1800,33 +1809,19 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
return NULL;
}
-/**
- * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
- * @desc: MSI descriptor, will store the MSI page
- * @msi_addr: MSI target address to be mapped
- *
- * Return: 0 on success or negative error code if the mapping failed.
- */
-int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
+static int iommu_dma_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
+ phys_addr_t msi_addr)
{
struct device *dev = msi_desc_to_dev(desc);
- struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
- struct iommu_dma_msi_page *msi_page;
- static DEFINE_MUTEX(msi_prepare_lock); /* see below */
+ const struct iommu_dma_msi_page *msi_page;
- if (!domain || !domain->iova_cookie) {
+ if (!domain->iova_cookie) {
msi_desc_set_iommu_msi_iova(desc, 0, 0);
return 0;
}
- /*
- * In fact the whole prepare operation should already be serialised by
- * irq_domain_mutex further up the callchain, but that's pretty subtle
- * on its own, so consider this locking as failsafe documentation...
- */
- mutex_lock(&msi_prepare_lock);
+ iommu_group_mutex_assert(dev);
msi_page = iommu_dma_get_msi_page(dev, msi_addr, domain);
- mutex_unlock(&msi_prepare_lock);
if (!msi_page)
return -ENOMEM;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 870c3cdbd0f6..022bf96a18c5 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3596,3 +3596,32 @@ int iommu_replace_group_handle(struct iommu_group *group,
return ret;
}
EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
+
+#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
+/**
+ * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
+ * @desc: MSI descriptor, will store the MSI page
+ * @msi_addr: MSI target address to be mapped
+ *
+ * The implementation of sw_msi() should take msi_addr and map it to
+ * an IOVA in the domain and call msi_desc_set_iommu_msi_iova() with the
+ * mapping information.
+ *
+ * Return: 0 on success or negative error code if the mapping failed.
+ */
+int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)
+{
+ struct device *dev = msi_desc_to_dev(desc);
+ struct iommu_group *group = dev->iommu_group;
+ int ret = 0;
+
+ if (!group)
+ return 0;
+
+ mutex_lock(&group->mutex);
+ if (group->domain && group->domain->sw_msi)
+ ret = group->domain->sw_msi(group->domain, desc, msi_addr);
+ mutex_unlock(&group->mutex);
+ return ret;
+}
+#endif /* CONFIG_IRQ_MSI_IOMMU */
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v1 04/13] irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by the irqchips that need it
2025-02-08 9:02 [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
` (2 preceding siblings ...)
2025-02-08 9:02 ` [PATCH v1 03/13] iommu: Make iommu_dma_prepare_msi() into a generic operation Nicolin Chen
@ 2025-02-08 9:02 ` Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 05/13] iommu: Turn fault_data to iommufd private pointer Nicolin Chen
` (9 subsequent siblings)
13 siblings, 0 replies; 28+ messages in thread
From: Nicolin Chen @ 2025-02-08 9:02 UTC (permalink / raw)
To: jgg, kevin.tian, tglx, maz
Cc: joro, will, robin.murphy, shuah, iommu, linux-kernel,
linux-arm-kernel, linux-kselftest, eric.auger, baolu.lu, yi.l.liu,
yury.norov, jacob.pan, patches
From: Jason Gunthorpe <jgg@nvidia.com>
Currently IRQ_MSI_IOMMU is selected if DMA_IOMMU is available to provide
an implementation for iommu_dma_prepare/compose_msi_msg(). However it
makes more sense for the irqchips that call prepare/compose to select it
and that will trigger all the additional code and data to be compiled into
the kernel.
If IRQ_MSI_IOMMU is selected with no IOMMU side implementation then
prepare/compose will be NOP stubs.
If IRQ_MSI_IOMMU is not selected by an irqchip then the related code on
the iommu side is compiled out.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/Kconfig | 1 -
drivers/irqchip/Kconfig | 4 ++++
kernel/irq/Kconfig | 1 +
3 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index ec1b5e32b972..5124e7431fe3 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -154,7 +154,6 @@ config IOMMU_DMA
select DMA_OPS_HELPERS
select IOMMU_API
select IOMMU_IOVA
- select IRQ_MSI_IOMMU
select NEED_SG_DMA_LENGTH
select NEED_SG_DMA_FLAGS if SWIOTLB
diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
index be063bfb50c4..84015decc3a4 100644
--- a/drivers/irqchip/Kconfig
+++ b/drivers/irqchip/Kconfig
@@ -28,6 +28,7 @@ config ARM_GIC_V2M
select ARM_GIC
select IRQ_MSI_LIB
select PCI_MSI
+ select IRQ_MSI_IOMMU
config GIC_NON_BANKED
bool
@@ -38,12 +39,14 @@ config ARM_GIC_V3
select PARTITION_PERCPU
select GENERIC_IRQ_EFFECTIVE_AFF_MASK if SMP
select HAVE_ARM_SMCCC_DISCOVERY
+ select IRQ_MSI_IOMMU
config ARM_GIC_V3_ITS
bool
select GENERIC_MSI_IRQ
select IRQ_MSI_LIB
default ARM_GIC_V3
+ select IRQ_MSI_IOMMU
config ARM_GIC_V3_ITS_FSL_MC
bool
@@ -407,6 +410,7 @@ config LS_EXTIRQ
config LS_SCFG_MSI
def_bool y if SOC_LS1021A || ARCH_LAYERSCAPE
+ select IRQ_MSI_IOMMU
depends on PCI_MSI
config PARTITION_PERCPU
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index 5432418c0fea..9636aed20401 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -100,6 +100,7 @@ config GENERIC_MSI_IRQ
bool
select IRQ_DOMAIN_HIERARCHY
+# irqchip drivers should select this if they call iommu_dma_prepare_msi()
config IRQ_MSI_IOMMU
bool
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v1 05/13] iommu: Turn fault_data to iommufd private pointer
2025-02-08 9:02 [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
` (3 preceding siblings ...)
2025-02-08 9:02 ` [PATCH v1 04/13] irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by the irqchips that need it Nicolin Chen
@ 2025-02-08 9:02 ` Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 06/13] iommufd: Implement sw_msi support natively Nicolin Chen
` (8 subsequent siblings)
13 siblings, 0 replies; 28+ messages in thread
From: Nicolin Chen @ 2025-02-08 9:02 UTC (permalink / raw)
To: jgg, kevin.tian, tglx, maz
Cc: joro, will, robin.murphy, shuah, iommu, linux-kernel,
linux-arm-kernel, linux-kselftest, eric.auger, baolu.lu, yi.l.liu,
yury.norov, jacob.pan, patches
A "fault_data" was added exclusively for the iommufd_fault_iopf_handler()
used by IOPF/PRI use cases, along with the attach_handle. Now, the iommufd
version of sw_msi function will reuse the attach_handle and fault_data for
a non-fault case.
Rename "fault_data" to "iommufd_hwpt" so as not to confine it to a "fault"
case. Move it into a union to be the iommufd private pointer. A following
patch will move the iova_cookie to the union for dma-iommu too, after the
iommufd_sw_msi implementation is added.
Since we have two unions now, add some simple comments for readability.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
include/linux/iommu.h | 6 ++++--
drivers/iommu/iommufd/fault.c | 2 +-
drivers/iommu/iommufd/hw_pagetable.c | 2 +-
3 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 761c5e186de9..e93d2e918599 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -224,8 +224,10 @@ struct iommu_domain {
phys_addr_t msi_addr);
#endif
- void *fault_data;
- union {
+ union { /* Pointer usable by owner of the domain */
+ struct iommufd_hw_pagetable *iommufd_hwpt; /* iommufd */
+ };
+ union { /* Fault handler */
struct {
iommu_fault_handler_t handler;
void *handler_token;
diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
index 931a3fbe6e32..c48d72c9668c 100644
--- a/drivers/iommu/iommufd/fault.c
+++ b/drivers/iommu/iommufd/fault.c
@@ -329,7 +329,7 @@ int iommufd_fault_iopf_handler(struct iopf_group *group)
struct iommufd_hw_pagetable *hwpt;
struct iommufd_fault *fault;
- hwpt = group->attach_handle->domain->fault_data;
+ hwpt = group->attach_handle->domain->iommufd_hwpt;
fault = hwpt->fault;
spin_lock(&fault->lock);
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 598be26a14e2..2641d50f46cf 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -406,10 +406,10 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
}
hwpt->fault = fault;
hwpt->domain->iopf_handler = iommufd_fault_iopf_handler;
- hwpt->domain->fault_data = hwpt;
refcount_inc(&fault->obj.users);
iommufd_put_object(ucmd->ictx, &fault->obj);
}
+ hwpt->domain->iommufd_hwpt = hwpt;
cmd->out_hwpt_id = hwpt->obj.id;
rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v1 06/13] iommufd: Implement sw_msi support natively
2025-02-08 9:02 [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
` (4 preceding siblings ...)
2025-02-08 9:02 ` [PATCH v1 05/13] iommu: Turn fault_data to iommufd private pointer Nicolin Chen
@ 2025-02-08 9:02 ` Nicolin Chen
2025-02-11 18:16 ` Jason Gunthorpe
2025-02-08 9:02 ` [PATCH v1 07/13] iommu: Turn iova_cookie to dma-iommu private pointer Nicolin Chen
` (7 subsequent siblings)
13 siblings, 1 reply; 28+ messages in thread
From: Nicolin Chen @ 2025-02-08 9:02 UTC (permalink / raw)
To: jgg, kevin.tian, tglx, maz
Cc: joro, will, robin.murphy, shuah, iommu, linux-kernel,
linux-arm-kernel, linux-kselftest, eric.auger, baolu.lu, yi.l.liu,
yury.norov, jacob.pan, patches
From: Jason Gunthorpe <jgg@nvidia.com>
iommufd has a model where the iommu_domain can be changed while the VFIO
device is attached. In this case the MSI should continue to work. This
corner case has not worked because the dma-iommu implementation of sw_msi
is tied to a single domain.
Implement the sw_msi mapping directly and use a global per-fd table to
associate assigned iova to the MSI pages. This allows the MSI pages to
be loaded into a domain before it is attached ensuring that MSI is not
disrupted.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/iommufd/iommufd_private.h | 23 +++-
drivers/iommu/iommufd/device.c | 174 +++++++++++++++++++-----
drivers/iommu/iommufd/hw_pagetable.c | 3 +
drivers/iommu/iommufd/main.c | 9 ++
4 files changed, 175 insertions(+), 34 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 8e0e3ab64747..246297452a44 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -19,6 +19,22 @@ struct iommu_group;
struct iommu_option;
struct iommufd_device;
+struct iommufd_sw_msi_map {
+ struct list_head sw_msi_item;
+ phys_addr_t sw_msi_start;
+ phys_addr_t msi_addr;
+ unsigned int pgoff;
+ unsigned int id;
+};
+
+/* Bitmap of struct iommufd_sw_msi_map::id */
+struct iommufd_sw_msi_maps {
+ DECLARE_BITMAP(bitmap, 64);
+};
+
+int iommufd_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
+ phys_addr_t msi_addr);
+
struct iommufd_ctx {
struct file *file;
struct xarray objects;
@@ -26,6 +42,10 @@ struct iommufd_ctx {
wait_queue_head_t destroy_wait;
struct rw_semaphore ioas_creation_lock;
+ struct mutex sw_msi_lock;
+ struct list_head sw_msi_list;
+ unsigned int sw_msi_id;
+
u8 account_mode;
/* Compatibility with VFIO no iommu */
u8 no_iommu_mode;
@@ -283,10 +303,10 @@ struct iommufd_hwpt_paging {
struct iommufd_ioas *ioas;
bool auto_domain : 1;
bool enforce_cache_coherency : 1;
- bool msi_cookie : 1;
bool nest_parent : 1;
/* Head at iommufd_ioas::hwpt_list */
struct list_head hwpt_item;
+ struct iommufd_sw_msi_maps present_sw_msi;
};
struct iommufd_hwpt_nested {
@@ -383,6 +403,7 @@ struct iommufd_group {
struct iommu_group *group;
struct iommufd_hw_pagetable *hwpt;
struct list_head device_list;
+ struct iommufd_sw_msi_maps required_sw_msi;
phys_addr_t sw_msi_start;
};
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 360ba3ed8545..e435ba13a8a5 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -5,6 +5,7 @@
#include <linux/iommufd.h>
#include <linux/slab.h>
#include <uapi/linux/iommufd.h>
+#include <linux/msi.h>
#include "../iommu-priv.h"
#include "io_pagetable.h"
@@ -293,36 +294,152 @@ u32 iommufd_device_to_id(struct iommufd_device *idev)
}
EXPORT_SYMBOL_NS_GPL(iommufd_device_to_id, "IOMMUFD");
+/*
+ * Get a iommufd_sw_msi_map for the msi physical address requested by the irq
+ * layer. The mapping to IOVA is global to the iommufd file descriptor, every
+ * domain that is attached to a device using the same MSI parameters will use
+ * the same IOVA.
+ */
+static struct iommufd_sw_msi_map *
+iommufd_sw_msi_get_map(struct iommufd_ctx *ictx, phys_addr_t msi_addr,
+ phys_addr_t sw_msi_start)
+{
+ struct iommufd_sw_msi_map *cur;
+ unsigned int max_pgoff = 0;
+
+ lockdep_assert_held(&ictx->sw_msi_lock);
+
+ list_for_each_entry(cur, &ictx->sw_msi_list, sw_msi_item) {
+ if (cur->sw_msi_start != sw_msi_start)
+ continue;
+ max_pgoff = max(max_pgoff, cur->pgoff + 1);
+ if (cur->msi_addr == msi_addr)
+ return cur;
+ }
+
+ if (ictx->sw_msi_id >=
+ BITS_PER_BYTE * sizeof_field(struct iommufd_sw_msi_maps, bitmap))
+ return ERR_PTR(-EOVERFLOW);
+
+ cur = kzalloc(sizeof(*cur), GFP_KERNEL);
+ if (!cur)
+ cur = ERR_PTR(-ENOMEM);
+ cur->sw_msi_start = sw_msi_start;
+ cur->msi_addr = msi_addr;
+ cur->pgoff = max_pgoff;
+ cur->id = ictx->sw_msi_id++;
+ list_add_tail(&cur->sw_msi_item, &ictx->sw_msi_list);
+ return cur;
+}
+
+static int iommufd_sw_msi_install(struct iommufd_ctx *ictx,
+ struct iommufd_hwpt_paging *hwpt_paging,
+ struct iommufd_sw_msi_map *msi_map)
+{
+ unsigned long iova;
+
+ lockdep_assert_held(&ictx->sw_msi_lock);
+
+ iova = msi_map->sw_msi_start + msi_map->pgoff * PAGE_SIZE;
+ if (!test_bit(msi_map->id, hwpt_paging->present_sw_msi.bitmap)) {
+ int rc;
+
+ rc = iommu_map(hwpt_paging->common.domain, iova,
+ msi_map->msi_addr, PAGE_SIZE,
+ IOMMU_WRITE | IOMMU_READ | IOMMU_MMIO,
+ GFP_KERNEL_ACCOUNT);
+ if (rc)
+ return rc;
+ __set_bit(msi_map->id, hwpt_paging->present_sw_msi.bitmap);
+ }
+ return 0;
+}
+
+static struct iommufd_attach_handle *
+iommu_group_get_iommufd_handle(struct iommu_group *group)
+{
+ struct iommu_attach_handle *handle;
+
+ handle = iommu_attach_handle_get(group, IOMMU_NO_PASID, 0);
+ if (IS_ERR(handle))
+ return NULL;
+ return to_iommufd_handle(handle);
+}
+
+/*
+ * Called by the irq code if the platform translates the MSI address through the
+ * IOMMU. msi_addr is the physical address of the MSI page. iommufd will
+ * allocate a fd global iova for the physical page that is the same on all
+ * domains and devices.
+ */
+#ifdef CONFIG_IRQ_MSI_IOMMU
+int iommufd_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
+ phys_addr_t msi_addr)
+{
+ struct device *dev = msi_desc_to_dev(desc);
+ struct iommufd_hwpt_paging *hwpt_paging;
+ struct iommufd_attach_handle *handle;
+ struct iommufd_sw_msi_map *msi_map;
+ struct iommufd_ctx *ictx;
+ unsigned long iova;
+ int rc;
+
+ handle = iommu_group_get_iommufd_handle(dev->iommu_group);
+ if (!handle)
+ return 0;
+ hwpt_paging = find_hwpt_paging(domain->iommufd_hwpt);
+
+ /* No IOMMU_RESV_SW_MSI means no change to the msi_msg */
+ if (handle->idev->igroup->sw_msi_start == PHYS_ADDR_MAX)
+ return 0;
+
+ ictx = handle->idev->ictx;
+ guard(mutex)(&ictx->sw_msi_lock);
+ /*
+ * The input msi_addr is the exact byte offset of the MSI doorbell, we
+ * assume the caller has checked that it is contained with a MMIO region
+ * that is secure to map at PAGE_SIZE.
+ */
+ msi_map = iommufd_sw_msi_get_map(handle->idev->ictx,
+ msi_addr & PAGE_MASK,
+ handle->idev->igroup->sw_msi_start);
+ if (IS_ERR(msi_map))
+ return PTR_ERR(msi_map);
+
+ rc = iommufd_sw_msi_install(ictx, hwpt_paging, msi_map);
+ if (rc)
+ return rc;
+ __set_bit(msi_map->id, handle->idev->igroup->required_sw_msi.bitmap);
+
+ iova = msi_map->sw_msi_start + msi_map->pgoff * PAGE_SIZE;
+ msi_desc_set_iommu_msi_iova(desc, iova, PAGE_SHIFT);
+ return 0;
+}
+#endif
+
static int iommufd_group_setup_msi(struct iommufd_group *igroup,
struct iommufd_hwpt_paging *hwpt_paging)
{
- phys_addr_t sw_msi_start = igroup->sw_msi_start;
- int rc;
+ struct iommufd_ctx *ictx = igroup->ictx;
+ struct iommufd_sw_msi_map *cur;
+
+ if (igroup->sw_msi_start == PHYS_ADDR_MAX)
+ return 0;
/*
- * If the IOMMU driver gives a IOMMU_RESV_SW_MSI then it is asking us to
- * call iommu_get_msi_cookie() on its behalf. This is necessary to setup
- * the MSI window so iommu_dma_prepare_msi() can install pages into our
- * domain after request_irq(). If it is not done interrupts will not
- * work on this domain.
- *
- * FIXME: This is conceptually broken for iommufd since we want to allow
- * userspace to change the domains, eg switch from an identity IOAS to a
- * DMA IOAS. There is currently no way to create a MSI window that
- * matches what the IRQ layer actually expects in a newly created
- * domain.
+ * Install all the MSI pages the device has been using into the domain
*/
- if (sw_msi_start != PHYS_ADDR_MAX && !hwpt_paging->msi_cookie) {
- rc = iommu_get_msi_cookie(hwpt_paging->common.domain,
- sw_msi_start);
+ guard(mutex)(&ictx->sw_msi_lock);
+ list_for_each_entry(cur, &ictx->sw_msi_list, sw_msi_item) {
+ int rc;
+
+ if (cur->sw_msi_start != igroup->sw_msi_start ||
+ !test_bit(cur->id, igroup->required_sw_msi.bitmap))
+ continue;
+
+ rc = iommufd_sw_msi_install(ictx, hwpt_paging, cur);
if (rc)
return rc;
-
- /*
- * iommu_get_msi_cookie() can only be called once per domain,
- * it returns -EBUSY on later calls.
- */
- hwpt_paging->msi_cookie = true;
}
return 0;
}
@@ -386,17 +503,8 @@ static int iommufd_hwpt_attach_device(struct iommufd_hw_pagetable *hwpt,
return rc;
}
-static struct iommufd_attach_handle *
-iommufd_device_get_attach_handle(struct iommufd_device *idev)
-{
- struct iommu_attach_handle *handle;
-
- handle =
- iommu_attach_handle_get(idev->igroup->group, IOMMU_NO_PASID, 0);
- if (IS_ERR(handle))
- return NULL;
- return to_iommufd_handle(handle);
-}
+#define iommufd_device_get_attach_handle(idev) \
+ iommu_group_get_iommufd_handle(idev->igroup->group)
static void iommufd_hwpt_detach_device(struct iommufd_hw_pagetable *hwpt,
struct iommufd_device *idev)
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index 2641d50f46cf..7de6e914232e 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -156,6 +156,7 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
goto out_abort;
}
}
+ iommu_domain_set_sw_msi(hwpt->domain, iommufd_sw_msi);
/*
* Set the coherency mode before we do iopt_table_add_domain() as some
@@ -251,6 +252,7 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
goto out_abort;
}
hwpt->domain->owner = ops;
+ iommu_domain_set_sw_msi(hwpt->domain, iommufd_sw_msi);
if (WARN_ON_ONCE(hwpt->domain->type != IOMMU_DOMAIN_NESTED)) {
rc = -EINVAL;
@@ -307,6 +309,7 @@ iommufd_viommu_alloc_hwpt_nested(struct iommufd_viommu *viommu, u32 flags,
goto out_abort;
}
hwpt->domain->owner = viommu->iommu_dev->ops;
+ iommu_domain_set_sw_msi(hwpt->domain, iommufd_sw_msi);
if (WARN_ON_ONCE(hwpt->domain->type != IOMMU_DOMAIN_NESTED)) {
rc = -EINVAL;
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index ccf616462a1c..b6fa9fd11bc1 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -227,6 +227,8 @@ static int iommufd_fops_open(struct inode *inode, struct file *filp)
xa_init(&ictx->groups);
ictx->file = filp;
init_waitqueue_head(&ictx->destroy_wait);
+ mutex_init(&ictx->sw_msi_lock);
+ INIT_LIST_HEAD(&ictx->sw_msi_list);
filp->private_data = ictx;
return 0;
}
@@ -234,6 +236,8 @@ static int iommufd_fops_open(struct inode *inode, struct file *filp)
static int iommufd_fops_release(struct inode *inode, struct file *filp)
{
struct iommufd_ctx *ictx = filp->private_data;
+ struct iommufd_sw_msi_map *next;
+ struct iommufd_sw_msi_map *cur;
struct iommufd_object *obj;
/*
@@ -262,6 +266,11 @@ static int iommufd_fops_release(struct inode *inode, struct file *filp)
break;
}
WARN_ON(!xa_empty(&ictx->groups));
+
+ mutex_destroy(&ictx->sw_msi_lock);
+ list_for_each_entry_safe(cur, next, &ictx->sw_msi_list, sw_msi_item)
+ kfree(cur);
+
kfree(ictx);
return 0;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v1 07/13] iommu: Turn iova_cookie to dma-iommu private pointer
2025-02-08 9:02 [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
` (5 preceding siblings ...)
2025-02-08 9:02 ` [PATCH v1 06/13] iommufd: Implement sw_msi support natively Nicolin Chen
@ 2025-02-08 9:02 ` Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 08/13] iommufd/device: Move sw_msi_start from igroup to idev Nicolin Chen
` (6 subsequent siblings)
13 siblings, 0 replies; 28+ messages in thread
From: Nicolin Chen @ 2025-02-08 9:02 UTC (permalink / raw)
To: jgg, kevin.tian, tglx, maz
Cc: joro, will, robin.murphy, shuah, iommu, linux-kernel,
linux-arm-kernel, linux-kselftest, eric.auger, baolu.lu, yi.l.liu,
yury.norov, jacob.pan, patches
Now that iommufd does not rely on dma-iommu.c for any purpose we can
combine the dma-iommu.c iova_cookie and the iommufd_hwpt under the
same union. This union is effectively 'owner data' can be used by the
entity that allocated the domain. Note that legacy vfio type1 flows
continue to use dma-iommu.c for sw_msi and still need iova_cookie.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
include/linux/iommu.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index e93d2e918599..99dd72998cb7 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -216,7 +216,6 @@ struct iommu_domain {
const struct iommu_ops *owner; /* Whose domain_alloc we came from */
unsigned long pgsize_bitmap; /* Bitmap of page sizes in use */
struct iommu_domain_geometry geometry;
- struct iommu_dma_cookie *iova_cookie;
int (*iopf_handler)(struct iopf_group *group);
#if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
@@ -225,6 +224,7 @@ struct iommu_domain {
#endif
union { /* Pointer usable by owner of the domain */
+ struct iommu_dma_cookie *iova_cookie; /* dma-iommu */
struct iommufd_hw_pagetable *iommufd_hwpt; /* iommufd */
};
union { /* Fault handler */
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v1 08/13] iommufd/device: Move sw_msi_start from igroup to idev
2025-02-08 9:02 [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
` (6 preceding siblings ...)
2025-02-08 9:02 ` [PATCH v1 07/13] iommu: Turn iova_cookie to dma-iommu private pointer Nicolin Chen
@ 2025-02-08 9:02 ` Nicolin Chen
2025-02-09 18:41 ` Jason Gunthorpe
2025-02-08 9:02 ` [PATCH v1 09/13] iommufd: Pass in idev to iopt_table_enforce_dev_resv_regions Nicolin Chen
` (5 subsequent siblings)
13 siblings, 1 reply; 28+ messages in thread
From: Nicolin Chen @ 2025-02-08 9:02 UTC (permalink / raw)
To: jgg, kevin.tian, tglx, maz
Cc: joro, will, robin.murphy, shuah, iommu, linux-kernel,
linux-arm-kernel, linux-kselftest, eric.auger, baolu.lu, yi.l.liu,
yury.norov, jacob.pan, patches
The sw_msi_start was set from the per-device reserved region, so storing
it in the iommufd_device structure makes sense too. This will also ease
a following patch adding a SET_OPTION uAPI to set per-idev sw_msi args.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/iommufd/iommufd_private.h | 2 +-
drivers/iommu/iommufd/device.c | 31 +++++++++++++------------
2 files changed, 17 insertions(+), 16 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 246297452a44..c0df549c7d34 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -404,7 +404,6 @@ struct iommufd_group {
struct iommufd_hw_pagetable *hwpt;
struct list_head device_list;
struct iommufd_sw_msi_maps required_sw_msi;
- phys_addr_t sw_msi_start;
};
/*
@@ -423,6 +422,7 @@ struct iommufd_device {
/* protect iopf_enabled counter */
struct mutex iopf_lock;
unsigned int iopf_enabled;
+ phys_addr_t sw_msi_start;
};
static inline struct iommufd_device *
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index e435ba13a8a5..882cc51a3feb 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -91,7 +91,6 @@ static struct iommufd_group *iommufd_get_group(struct iommufd_ctx *ictx,
kref_init(&new_igroup->ref);
mutex_init(&new_igroup->lock);
INIT_LIST_HEAD(&new_igroup->device_list);
- new_igroup->sw_msi_start = PHYS_ADDR_MAX;
/* group reference moves into new_igroup */
new_igroup->group = group;
@@ -217,6 +216,7 @@ struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
/* igroup refcount moves into iommufd_device */
idev->igroup = igroup;
mutex_init(&idev->iopf_lock);
+ idev->sw_msi_start = PHYS_ADDR_MAX;
/*
* If the caller fails after this success it must call
@@ -390,7 +390,7 @@ int iommufd_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
hwpt_paging = find_hwpt_paging(domain->iommufd_hwpt);
/* No IOMMU_RESV_SW_MSI means no change to the msi_msg */
- if (handle->idev->igroup->sw_msi_start == PHYS_ADDR_MAX)
+ if (handle->idev->sw_msi_start == PHYS_ADDR_MAX)
return 0;
ictx = handle->idev->ictx;
@@ -402,7 +402,7 @@ int iommufd_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
*/
msi_map = iommufd_sw_msi_get_map(handle->idev->ictx,
msi_addr & PAGE_MASK,
- handle->idev->igroup->sw_msi_start);
+ handle->idev->sw_msi_start);
if (IS_ERR(msi_map))
return PTR_ERR(msi_map);
@@ -417,13 +417,13 @@ int iommufd_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
}
#endif
-static int iommufd_group_setup_msi(struct iommufd_group *igroup,
- struct iommufd_hwpt_paging *hwpt_paging)
+static int iommufd_device_setup_msi(struct iommufd_device *idev,
+ struct iommufd_hwpt_paging *hwpt_paging)
{
- struct iommufd_ctx *ictx = igroup->ictx;
+ struct iommufd_ctx *ictx = idev->ictx;
struct iommufd_sw_msi_map *cur;
- if (igroup->sw_msi_start == PHYS_ADDR_MAX)
+ if (idev->sw_msi_start == PHYS_ADDR_MAX)
return 0;
/*
@@ -433,8 +433,8 @@ static int iommufd_group_setup_msi(struct iommufd_group *igroup,
list_for_each_entry(cur, &ictx->sw_msi_list, sw_msi_item) {
int rc;
- if (cur->sw_msi_start != igroup->sw_msi_start ||
- !test_bit(cur->id, igroup->required_sw_msi.bitmap))
+ if (cur->sw_msi_start != idev->sw_msi_start ||
+ !test_bit(cur->id, idev->igroup->required_sw_msi.bitmap))
continue;
rc = iommufd_sw_msi_install(ictx, hwpt_paging, cur);
@@ -454,12 +454,12 @@ iommufd_device_attach_reserved_iova(struct iommufd_device *idev,
rc = iopt_table_enforce_dev_resv_regions(&hwpt_paging->ioas->iopt,
idev->dev,
- &idev->igroup->sw_msi_start);
+ &idev->sw_msi_start);
if (rc)
return rc;
if (list_empty(&idev->igroup->device_list)) {
- rc = iommufd_group_setup_msi(idev->igroup, hwpt_paging);
+ rc = iommufd_device_setup_msi(idev, hwpt_paging);
if (rc) {
iopt_remove_reserved_iova(&hwpt_paging->ioas->iopt,
idev->dev);
@@ -650,9 +650,10 @@ iommufd_group_remove_reserved_iova(struct iommufd_group *igroup,
}
static int
-iommufd_group_do_replace_reserved_iova(struct iommufd_group *igroup,
- struct iommufd_hwpt_paging *hwpt_paging)
+iommufd_device_do_replace_reserved_iova(struct iommufd_device *idev,
+ struct iommufd_hwpt_paging *hwpt_paging)
{
+ struct iommufd_group *igroup = idev->igroup;
struct iommufd_hwpt_paging *old_hwpt_paging;
struct iommufd_device *cur;
int rc;
@@ -669,7 +670,7 @@ iommufd_group_do_replace_reserved_iova(struct iommufd_group *igroup,
}
}
- rc = iommufd_group_setup_msi(igroup, hwpt_paging);
+ rc = iommufd_device_setup_msi(idev, hwpt_paging);
if (rc)
goto err_unresv;
return 0;
@@ -704,7 +705,7 @@ iommufd_device_do_replace(struct iommufd_device *idev,
old_hwpt = igroup->hwpt;
if (hwpt_paging) {
- rc = iommufd_group_do_replace_reserved_iova(igroup, hwpt_paging);
+ rc = iommufd_device_do_replace_reserved_iova(idev, hwpt_paging);
if (rc)
goto err_unlock;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v1 09/13] iommufd: Pass in idev to iopt_table_enforce_dev_resv_regions
2025-02-08 9:02 [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
` (7 preceding siblings ...)
2025-02-08 9:02 ` [PATCH v1 08/13] iommufd/device: Move sw_msi_start from igroup to idev Nicolin Chen
@ 2025-02-08 9:02 ` Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 10/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls Nicolin Chen
` (4 subsequent siblings)
13 siblings, 0 replies; 28+ messages in thread
From: Nicolin Chen @ 2025-02-08 9:02 UTC (permalink / raw)
To: jgg, kevin.tian, tglx, maz
Cc: joro, will, robin.murphy, shuah, iommu, linux-kernel,
linux-arm-kernel, linux-kselftest, eric.auger, baolu.lu, yi.l.liu,
yury.norov, jacob.pan, patches
The iopt_table_enforce_dev_resv_regions needs to access the sw_msi_start
and sw_msi_size stored in the idev, set by user space. So, pass in idev
pointer instead.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/iommufd/iommufd_private.h | 2 +-
drivers/iommu/iommufd/device.c | 5 ++---
drivers/iommu/iommufd/io_pagetable.c | 3 ++-
3 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index c0df549c7d34..7a9cc6e61152 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -115,7 +115,7 @@ int iopt_table_add_domain(struct io_pagetable *iopt,
void iopt_table_remove_domain(struct io_pagetable *iopt,
struct iommu_domain *domain);
int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt,
- struct device *dev,
+ struct iommufd_device *idev,
phys_addr_t *sw_msi_start);
int iopt_set_allow_iova(struct io_pagetable *iopt,
struct rb_root_cached *allowed_iova);
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 882cc51a3feb..5ecd96c04e89 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -452,8 +452,7 @@ iommufd_device_attach_reserved_iova(struct iommufd_device *idev,
lockdep_assert_held(&idev->igroup->lock);
- rc = iopt_table_enforce_dev_resv_regions(&hwpt_paging->ioas->iopt,
- idev->dev,
+ rc = iopt_table_enforce_dev_resv_regions(&hwpt_paging->ioas->iopt, idev,
&idev->sw_msi_start);
if (rc)
return rc;
@@ -664,7 +663,7 @@ iommufd_device_do_replace_reserved_iova(struct iommufd_device *idev,
if (!old_hwpt_paging || hwpt_paging->ioas != old_hwpt_paging->ioas) {
list_for_each_entry(cur, &igroup->device_list, group_item) {
rc = iopt_table_enforce_dev_resv_regions(
- &hwpt_paging->ioas->iopt, cur->dev, NULL);
+ &hwpt_paging->ioas->iopt, cur, NULL);
if (rc)
goto err_unresv;
}
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 8a790e597e12..441da0314a54 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -1423,9 +1423,10 @@ void iopt_remove_access(struct io_pagetable *iopt,
/* Narrow the valid_iova_itree to include reserved ranges from a device. */
int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt,
- struct device *dev,
+ struct iommufd_device *idev,
phys_addr_t *sw_msi_start)
{
+ struct device *dev = idev->dev;
struct iommu_resv_region *resv;
LIST_HEAD(resv_regions);
unsigned int num_hw_msi = 0;
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v1 10/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls
2025-02-08 9:02 [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
` (8 preceding siblings ...)
2025-02-08 9:02 ` [PATCH v1 09/13] iommufd: Pass in idev to iopt_table_enforce_dev_resv_regions Nicolin Chen
@ 2025-02-08 9:02 ` Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 11/13] iommufd/selftest: Add MOCK_FLAGS_DEVICE_NO_ATTACH Nicolin Chen
` (3 subsequent siblings)
13 siblings, 0 replies; 28+ messages in thread
From: Nicolin Chen @ 2025-02-08 9:02 UTC (permalink / raw)
To: jgg, kevin.tian, tglx, maz
Cc: joro, will, robin.murphy, shuah, iommu, linux-kernel,
linux-arm-kernel, linux-kselftest, eric.auger, baolu.lu, yi.l.liu,
yury.norov, jacob.pan, patches
For systems that require MSI pages to be mapped into the IOMMU translation
the IOMMU driver provides an IOMMU_RESV_SW_MSI range, which is the default
recommended IOVA window to place these mappings. However, there is nothing
special about this address. And to support the RMR trick in VMM for nested
translation, the VMM needs to know what sw_msi window the kernel is using.
As there is no particular reason to force VMM to adopt the kernel default,
provide a simple IOMMU_OPTION_SW_MSI_START/SIZE ioctl that the VMM can use
to directly specify its desired sw_msi window, which replaces and disables
the default IOMMU_RESV_SW_MSI from the driver, to avoid having to build an
API to discover the default IOMMU_RESV_SW_MSI.
Since iommufd now has its own sw_msi function, this is easy to implement.
Keep these two options per iommufd_device, so each device can set its own
desired MSI window. VMM must set the values before attaching the device to
any HWPT/IOAS to have an effect.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/iommufd/iommufd_private.h | 2 +
include/uapi/linux/iommufd.h | 20 ++++-
drivers/iommu/iommufd/io_pagetable.c | 15 +++-
drivers/iommu/iommufd/ioas.c | 97 +++++++++++++++++++++++++
drivers/iommu/iommufd/main.c | 4 +
5 files changed, 134 insertions(+), 4 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 7a9cc6e61152..2d1aae7c8610 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -279,6 +279,7 @@ int iommufd_ioas_change_process(struct iommufd_ucmd *ucmd);
int iommufd_ioas_copy(struct iommufd_ucmd *ucmd);
int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd);
int iommufd_ioas_option(struct iommufd_ucmd *ucmd);
+int iommufd_option_sw_msi(struct iommufd_ucmd *ucmd);
int iommufd_option_rlimit_mode(struct iommu_option *cmd,
struct iommufd_ctx *ictx);
@@ -423,6 +424,7 @@ struct iommufd_device {
struct mutex iopf_lock;
unsigned int iopf_enabled;
phys_addr_t sw_msi_start;
+ size_t sw_msi_size;
};
static inline struct iommufd_device *
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 78747b24bd0f..310256bc3dbf 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -294,7 +294,9 @@ struct iommu_ioas_unmap {
/**
* enum iommufd_option - ioctl(IOMMU_OPTION_RLIMIT_MODE) and
- * ioctl(IOMMU_OPTION_HUGE_PAGES)
+ * ioctl(IOMMU_OPTION_HUGE_PAGES) and
+ * ioctl(IOMMU_OPTION_SW_MSI_START) and
+ * ioctl(IOMMU_OPTION_SW_MSI_SIZE)
* @IOMMU_OPTION_RLIMIT_MODE:
* Change how RLIMIT_MEMLOCK accounting works. The caller must have privilege
* to invoke this. Value 0 (default) is user based accounting, 1 uses process
@@ -304,10 +306,26 @@ struct iommu_ioas_unmap {
* iommu mappings. Value 0 disables combining, everything is mapped to
* PAGE_SIZE. This can be useful for benchmarking. This is a per-IOAS
* option, the object_id must be the IOAS ID.
+ * @IOMMU_OPTION_SW_MSI_START:
+ * Change the base address of the IOMMU mapping region for MSI doorbell(s).
+ * This option being unset or @IOMMU_OPTION_SW_MSI_SIZE being value 0 tells
+ * the kernel to pick its default MSI doorbell window, ignoring these two
+ * options. To set this option, userspace must do before attaching a device
+ * to an IOAS/HWPT. Otherwise, kernel will return error (-EBUSY). An address
+ * must be 1MB aligned. This option is per-device, the object_id must be the
+ * device ID.
+ * @IOMMU_OPTION_SW_MSI_SIZE:
+ * Change the size (in MB) of the IOMMU mapping region for MSI doorbell(s).
+ * The minimum value is 1 MB. A value 0 (default) tells the kernel to ignore
+ * the base address value set to @IOMMU_OPTION_SW_MSI_START, and to pick its
+ * default MSI doorbell window. Same requirements are applied to this option
+ * too, so check @IOMMU_OPTION_SW_MSI_START for details.
*/
enum iommufd_option {
IOMMU_OPTION_RLIMIT_MODE = 0,
IOMMU_OPTION_HUGE_PAGES = 1,
+ IOMMU_OPTION_SW_MSI_START = 2,
+ IOMMU_OPTION_SW_MSI_SIZE = 3,
};
/**
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 441da0314a54..6e6dcc480922 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -1441,18 +1441,27 @@ int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt,
iommu_get_resv_regions(dev, &resv_regions);
list_for_each_entry(resv, &resv_regions, list) {
+ unsigned long start = PHYS_ADDR_MAX, last = 0;
+
if (resv->type == IOMMU_RESV_DIRECT_RELAXABLE)
continue;
if (sw_msi_start && resv->type == IOMMU_RESV_MSI)
num_hw_msi++;
if (sw_msi_start && resv->type == IOMMU_RESV_SW_MSI) {
- *sw_msi_start = resv->start;
+ if (idev->sw_msi_size) {
+ start = *sw_msi_start;
+ last = idev->sw_msi_size - 1 + start;
+ }
num_sw_msi++;
}
- rc = iopt_reserve_iova(iopt, resv->start,
- resv->length - 1 + resv->start, dev);
+ if (start == PHYS_ADDR_MAX) {
+ start = resv->start;
+ last = resv->length - 1 + start;
+ }
+
+ rc = iopt_reserve_iova(iopt, start, last, dev);
if (rc)
goto out_reserved;
}
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 1542c5fd10a8..1fc93bc70bf4 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -620,6 +620,103 @@ int iommufd_option_rlimit_mode(struct iommu_option *cmd,
return -EOPNOTSUPP;
}
+static inline int iommufd_option_sw_msi_test(struct iommufd_device *idev,
+ phys_addr_t start, size_t size)
+{
+ const phys_addr_t alignment = SZ_1M - 1;
+ struct iommu_resv_region *resv;
+ phys_addr_t resv_last, last;
+ LIST_HEAD(resv_regions);
+ int rc = 0;
+
+ /* Alignment Test */
+ if (start & alignment)
+ return -EINVAL;
+
+ /* Overlap Test */
+ if (!size)
+ size = SZ_1M;
+ last = size - 1 + start;
+ /* FIXME: drivers allocate memory but there is no failure propogated */
+ iommu_get_resv_regions(idev->dev, &resv_regions);
+ list_for_each_entry(resv, &resv_regions, list) {
+ if (resv->type == IOMMU_RESV_SW_MSI || /* iommufd will bypass */
+ resv->type == IOMMU_RESV_DIRECT_RELAXABLE)
+ continue;
+ resv_last = resv->length - 1 + resv->start;
+ if (resv->start <= last && resv_last >= start) {
+ rc = -EADDRINUSE;
+ break;
+ }
+ }
+ iommu_put_resv_regions(idev->dev, &resv_regions);
+ return rc;
+}
+
+int iommufd_option_sw_msi(struct iommufd_ucmd *ucmd)
+{
+ struct iommu_option *cmd = ucmd->cmd;
+ struct iommufd_device *idev;
+ int rc = 0;
+
+ idev = iommufd_get_device(ucmd, cmd->object_id);
+ if (IS_ERR(idev))
+ return PTR_ERR(idev);
+
+ mutex_lock(&idev->igroup->lock);
+ /* Device cannot enforce the sw_msi window if already attached */
+ if (idev->igroup->hwpt) {
+ rc = -EBUSY;
+ goto out_unlock;
+ }
+
+ if (cmd->op == IOMMU_OPTION_OP_GET) {
+ switch (cmd->option_id) {
+ case IOMMU_OPTION_SW_MSI_START:
+ cmd->val64 = (u64)idev->sw_msi_start;
+ break;
+ case IOMMU_OPTION_SW_MSI_SIZE:
+ cmd->val64 = (u64)idev->sw_msi_size / SZ_1M;
+ break;
+ default:
+ rc = -EOPNOTSUPP;
+ break;
+ }
+ }
+ if (cmd->op == IOMMU_OPTION_OP_SET) {
+ phys_addr_t start = idev->sw_msi_start;
+ size_t size = idev->sw_msi_size;
+
+ switch (cmd->option_id) {
+ case IOMMU_OPTION_SW_MSI_START:
+ start = (phys_addr_t)cmd->val64;
+ rc = iommufd_option_sw_msi_test(idev, start, size);
+ if (rc)
+ break;
+ idev->sw_msi_start = start;
+ break;
+ case IOMMU_OPTION_SW_MSI_SIZE:
+ size = (size_t)cmd->val64 * SZ_1M;
+ if (size) {
+ rc = iommufd_option_sw_msi_test(idev, start,
+ size);
+ if (rc)
+ break;
+ }
+ idev->sw_msi_size = size;
+ break;
+ default:
+ rc = -EOPNOTSUPP;
+ break;
+ }
+ }
+
+out_unlock:
+ mutex_unlock(&idev->igroup->lock);
+ iommufd_put_object(ucmd->ictx, &idev->obj);
+ return rc;
+}
+
static int iommufd_ioas_option_huge_pages(struct iommu_option *cmd,
struct iommufd_ioas *ioas)
{
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index b6fa9fd11bc1..f92fb03ca3c1 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -287,6 +287,10 @@ static int iommufd_option(struct iommufd_ucmd *ucmd)
case IOMMU_OPTION_RLIMIT_MODE:
rc = iommufd_option_rlimit_mode(cmd, ucmd->ictx);
break;
+ case IOMMU_OPTION_SW_MSI_START:
+ case IOMMU_OPTION_SW_MSI_SIZE:
+ rc = iommufd_option_sw_msi(ucmd);
+ break;
case IOMMU_OPTION_HUGE_PAGES:
rc = iommufd_ioas_option(ucmd);
break;
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v1 11/13] iommufd/selftest: Add MOCK_FLAGS_DEVICE_NO_ATTACH
2025-02-08 9:02 [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
` (9 preceding siblings ...)
2025-02-08 9:02 ` [PATCH v1 10/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls Nicolin Chen
@ 2025-02-08 9:02 ` Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 12/13] iommufd/selftest: Add a testing reserved region Nicolin Chen
` (2 subsequent siblings)
13 siblings, 0 replies; 28+ messages in thread
From: Nicolin Chen @ 2025-02-08 9:02 UTC (permalink / raw)
To: jgg, kevin.tian, tglx, maz
Cc: joro, will, robin.murphy, shuah, iommu, linux-kernel,
linux-arm-kernel, linux-kselftest, eric.auger, baolu.lu, yi.l.liu,
yury.norov, jacob.pan, patches
Add a new MOCK_FLAGS_DEVICE_NO_ATTACH flag to allow the mock_domain cmd to
bypass the attach step, as IOMMU_OPTION_SW_MSI_START/SIZE only allow users
to set prior to an IOAS/HWPT attachment.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/iommufd/iommufd_test.h | 1 +
drivers/iommu/iommufd/selftest.c | 22 +++++++++++++++-------
2 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index a6b7a163f636..02be242f8f34 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -48,6 +48,7 @@ enum {
enum {
MOCK_FLAGS_DEVICE_NO_DIRTY = 1 << 0,
MOCK_FLAGS_DEVICE_HUGE_IOVA = 1 << 1,
+ MOCK_FLAGS_DEVICE_NO_ATTACH = 1 << 2,
};
enum {
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index d40deb0a4f06..37a5cb89e27c 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -839,11 +839,13 @@ static void mock_dev_release(struct device *dev)
static struct mock_dev *mock_dev_create(unsigned long dev_flags)
{
+ const u32 SUPPORTED_FLAGS = MOCK_FLAGS_DEVICE_NO_DIRTY |
+ MOCK_FLAGS_DEVICE_HUGE_IOVA |
+ MOCK_FLAGS_DEVICE_NO_ATTACH;
struct mock_dev *mdev;
int rc, i;
- if (dev_flags &
- ~(MOCK_FLAGS_DEVICE_NO_DIRTY | MOCK_FLAGS_DEVICE_HUGE_IOVA))
+ if (dev_flags & ~SUPPORTED_FLAGS)
return ERR_PTR(-EINVAL);
mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
@@ -921,9 +923,13 @@ static int iommufd_test_mock_domain(struct iommufd_ucmd *ucmd,
}
sobj->idev.idev = idev;
- rc = iommufd_device_attach(idev, &pt_id);
- if (rc)
- goto out_unbind;
+ if (dev_flags & MOCK_FLAGS_DEVICE_NO_ATTACH) {
+ pt_id = 0;
+ } else {
+ rc = iommufd_device_attach(idev, &pt_id);
+ if (rc)
+ goto out_unbind;
+ }
/* Userspace must destroy the device_id to destroy the object */
cmd->mock_domain.out_hwpt_id = pt_id;
@@ -936,7 +942,8 @@ static int iommufd_test_mock_domain(struct iommufd_ucmd *ucmd,
return 0;
out_detach:
- iommufd_device_detach(idev);
+ if (!(dev_flags & MOCK_FLAGS_DEVICE_NO_ATTACH))
+ iommufd_device_detach(idev);
out_unbind:
iommufd_device_unbind(idev);
out_mdev:
@@ -1603,7 +1610,8 @@ void iommufd_selftest_destroy(struct iommufd_object *obj)
switch (sobj->type) {
case TYPE_IDEV:
- iommufd_device_detach(sobj->idev.idev);
+ if (!(sobj->idev.mock_dev->flags & MOCK_FLAGS_DEVICE_NO_ATTACH))
+ iommufd_device_detach(sobj->idev.idev);
iommufd_device_unbind(sobj->idev.idev);
mock_dev_destroy(sobj->idev.mock_dev);
break;
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v1 12/13] iommufd/selftest: Add a testing reserved region
2025-02-08 9:02 [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
` (10 preceding siblings ...)
2025-02-08 9:02 ` [PATCH v1 11/13] iommufd/selftest: Add MOCK_FLAGS_DEVICE_NO_ATTACH Nicolin Chen
@ 2025-02-08 9:02 ` Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 13/13] iommufd/selftest: Add coverage for IOMMU_OPTION_SW_MSI_START/SIZE Nicolin Chen
2025-02-19 13:37 ` [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Jason Gunthorpe
13 siblings, 0 replies; 28+ messages in thread
From: Nicolin Chen @ 2025-02-08 9:02 UTC (permalink / raw)
To: jgg, kevin.tian, tglx, maz
Cc: joro, will, robin.murphy, shuah, iommu, linux-kernel,
linux-arm-kernel, linux-kselftest, eric.auger, baolu.lu, yi.l.liu,
yury.norov, jacob.pan, patches
The new IOMMU_OPTION_SW_MSI_START/SIZE must not overlap with any existing
device reserved region, so add a testing region [0x80000000, 0x8fffffff],
on top of the normal IOVA aperture for selftest program to run an overlap
test.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
drivers/iommu/iommufd/iommufd_test.h | 3 +++
drivers/iommu/iommufd/selftest.c | 19 ++++++++++++++++++-
2 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index 02be242f8f34..53e2e30570fc 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -213,4 +213,7 @@ struct iommu_viommu_invalidate_selftest {
__u32 cache_id;
};
+#define IOMMU_TEST_RESV_BASE 0x80000000UL
+#define IOMMU_TEST_RESV_LENGTH 0x10000000UL
+
#endif
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 37a5cb89e27c..f4ac443d73d6 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -13,6 +13,7 @@
#include <linux/xarray.h>
#include <uapi/linux/iommufd.h>
+#include "../dma-iommu.h"
#include "../iommu-priv.h"
#include "io_pagetable.h"
#include "iommufd_private.h"
@@ -379,7 +380,8 @@ mock_domain_alloc_paging_flags(struct device *dev, u32 flags,
if (!mock)
return ERR_PTR(-ENOMEM);
mock->domain.geometry.aperture_start = MOCK_APERTURE_START;
- mock->domain.geometry.aperture_end = MOCK_APERTURE_LAST;
+ mock->domain.geometry.aperture_end =
+ MOCK_APERTURE_LAST + IOMMU_TEST_RESV_LENGTH;
mock->domain.pgsize_bitmap = MOCK_IO_PAGE_SIZE;
if (dev && mdev->flags & MOCK_FLAGS_DEVICE_HUGE_IOVA)
mock->domain.pgsize_bitmap |= MOCK_HUGE_PAGE_SIZE;
@@ -567,6 +569,20 @@ static int mock_dev_disable_feat(struct device *dev, enum iommu_dev_features fea
return 0;
}
+static void mock_dev_get_resv_regions(struct device *dev,
+ struct list_head *head)
+{
+ const int prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
+ struct iommu_resv_region *region;
+
+ region = iommu_alloc_resv_region(IOMMU_TEST_RESV_BASE,
+ IOMMU_TEST_RESV_LENGTH, prot,
+ IOMMU_RESV_RESERVED, GFP_KERNEL);
+ if (!region)
+ return;
+ list_add_tail(®ion->list, head);
+}
+
static void mock_viommu_destroy(struct iommufd_viommu *viommu)
{
struct mock_iommu_device *mock_iommu = container_of(
@@ -711,6 +727,7 @@ static const struct iommu_ops mock_ops = {
.page_response = mock_domain_page_response,
.dev_enable_feat = mock_dev_enable_feat,
.dev_disable_feat = mock_dev_disable_feat,
+ .get_resv_regions = mock_dev_get_resv_regions,
.user_pasid_table = true,
.viommu_alloc = mock_viommu_alloc,
.default_domain_ops =
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v1 13/13] iommufd/selftest: Add coverage for IOMMU_OPTION_SW_MSI_START/SIZE
2025-02-08 9:02 [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
` (11 preceding siblings ...)
2025-02-08 9:02 ` [PATCH v1 12/13] iommufd/selftest: Add a testing reserved region Nicolin Chen
@ 2025-02-08 9:02 ` Nicolin Chen
2025-02-19 13:37 ` [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Jason Gunthorpe
13 siblings, 0 replies; 28+ messages in thread
From: Nicolin Chen @ 2025-02-08 9:02 UTC (permalink / raw)
To: jgg, kevin.tian, tglx, maz
Cc: joro, will, robin.murphy, shuah, iommu, linux-kernel,
linux-arm-kernel, linux-kselftest, eric.auger, baolu.lu, yi.l.liu,
yury.norov, jacob.pan, patches
Also add fail_nth coverage too.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
tools/testing/selftests/iommu/iommufd.c | 97 +++++++++++++++++++
.../selftests/iommu/iommufd_fail_nth.c | 21 ++++
2 files changed, 118 insertions(+)
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index a1b2b657999d..db7dfc5ae56a 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -334,6 +334,103 @@ TEST_F(change_process, basic)
ASSERT_EQ(child, waitpid(child, NULL, 0));
}
+FIXTURE(iommufd_sw_msi)
+{
+ int fd;
+ uint32_t ioas_id;
+ uint32_t idev_id[2];
+};
+
+FIXTURE_SETUP(iommufd_sw_msi)
+{
+ self->fd = open("/dev/iommu", O_RDWR);
+ ASSERT_NE(-1, self->fd);
+
+ test_ioctl_ioas_alloc(&self->ioas_id);
+ test_cmd_mock_domain(self->ioas_id, NULL, NULL, &self->idev_id[0]);
+ test_cmd_mock_domain_flags(self->ioas_id, MOCK_FLAGS_DEVICE_NO_ATTACH,
+ NULL, NULL, &self->idev_id[1]);
+}
+
+FIXTURE_TEARDOWN(iommufd_sw_msi)
+{
+ teardown_iommufd(self->fd, _metadata);
+}
+
+TEST_F(iommufd_sw_msi, basic)
+{
+ struct iommu_option cmd = {
+ .size = sizeof(cmd),
+ .op = IOMMU_OPTION_OP_SET,
+ };
+
+ /* Negative case: object_id must be a device id */
+ cmd.object_id = self->ioas_id;
+ cmd.option_id = IOMMU_OPTION_SW_MSI_START;
+ cmd.val64 = 0x70000000;
+ EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_OPTION, &cmd));
+ cmd.object_id = 0;
+ cmd.option_id = IOMMU_OPTION_SW_MSI_SIZE;
+ cmd.val64 = 2;
+ EXPECT_ERRNO(ENOENT, ioctl(self->fd, IOMMU_OPTION, &cmd));
+
+ /* Negative case: device must not be attached already */
+ if (self->idev_id[0]) {
+ cmd.object_id = self->idev_id[0];
+ cmd.option_id = IOMMU_OPTION_SW_MSI_START;
+ cmd.val64 = 0x70000000;
+ EXPECT_ERRNO(EBUSY, ioctl(self->fd, IOMMU_OPTION, &cmd));
+ }
+
+ /* Device attached to nothing */
+ if (self->idev_id[1]) {
+ /* Negative case: alignment failures */
+ cmd.object_id = self->idev_id[1];
+ cmd.option_id = IOMMU_OPTION_SW_MSI_START;
+ cmd.val64 = 0x7fffffff;
+ EXPECT_ERRNO(EINVAL, ioctl(self->fd, IOMMU_OPTION, &cmd));
+ cmd.val64 = 0x7fffff00;
+ EXPECT_ERRNO(EINVAL, ioctl(self->fd, IOMMU_OPTION, &cmd));
+ cmd.val64 = 0x7fff0000;
+ EXPECT_ERRNO(EINVAL, ioctl(self->fd, IOMMU_OPTION, &cmd));
+
+ /* Negative case: overlap against [0x80000000, 0x80ffffff] */
+ cmd.option_id = IOMMU_OPTION_SW_MSI_START;
+ cmd.val64 = 0x80000000;
+ EXPECT_ERRNO(EADDRINUSE, ioctl(self->fd, IOMMU_OPTION, &cmd));
+ cmd.val64 = 0x80400000;
+ EXPECT_ERRNO(EADDRINUSE, ioctl(self->fd, IOMMU_OPTION, &cmd));
+ cmd.val64 = 0x80800000;
+ EXPECT_ERRNO(EADDRINUSE, ioctl(self->fd, IOMMU_OPTION, &cmd));
+ cmd.val64 = 0x80c00000;
+ EXPECT_ERRNO(EADDRINUSE, ioctl(self->fd, IOMMU_OPTION, &cmd));
+ /* Though an address that starts 1MB below will be okay ... */
+ cmd.val64 = 0x7ff00000;
+ ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+ /* ... but not with a 2MB size */
+ cmd.option_id = IOMMU_OPTION_SW_MSI_SIZE;
+ cmd.val64 = 2;
+ EXPECT_ERRNO(EADDRINUSE, ioctl(self->fd, IOMMU_OPTION, &cmd));
+
+ /* Set a safe 2MB window */
+ cmd.option_id = IOMMU_OPTION_SW_MSI_START;
+ cmd.val64 = 0x70000000;
+ ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+ cmd.option_id = IOMMU_OPTION_SW_MSI_SIZE;
+ cmd.val64 = 2;
+ ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+
+ /* Read them back to verify */
+ cmd.op = IOMMU_OPTION_OP_GET;
+ cmd.option_id = IOMMU_OPTION_SW_MSI_START;
+ ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+ ASSERT_EQ(cmd.val64, 0x70000000);
+ cmd.option_id = IOMMU_OPTION_SW_MSI_SIZE;
+ ASSERT_EQ(0, ioctl(self->fd, IOMMU_OPTION, &cmd));
+ ASSERT_EQ(cmd.val64, 2);
+ }
+}
+
FIXTURE(iommufd_ioas)
{
int fd;
diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c
index 64b1f8e1b0cf..be0d7735dfeb 100644
--- a/tools/testing/selftests/iommu/iommufd_fail_nth.c
+++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c
@@ -615,6 +615,10 @@ TEST_FAIL_NTH(basic_fail_nth, access_pin_domain)
/* device.c */
TEST_FAIL_NTH(basic_fail_nth, device)
{
+ struct iommu_option cmd = {
+ .size = sizeof(cmd),
+ .op = IOMMU_OPTION_OP_SET,
+ };
struct iommu_hwpt_selftest data = {
.iotlb = IOMMU_TEST_IOTLB_DEFAULT,
};
@@ -624,6 +628,7 @@ TEST_FAIL_NTH(basic_fail_nth, device)
uint32_t ioas_id;
uint32_t ioas_id2;
uint32_t stdev_id;
+ uint32_t idev_id2;
uint32_t idev_id;
uint32_t hwpt_id;
uint32_t viommu_id;
@@ -692,6 +697,22 @@ TEST_FAIL_NTH(basic_fail_nth, device)
IOMMU_HWPT_DATA_SELFTEST, &data, sizeof(data)))
return -1;
+ if (_test_cmd_mock_domain_flags(self->fd, ioas_id,
+ MOCK_FLAGS_DEVICE_NO_ATTACH, NULL, NULL,
+ &idev_id2))
+ return -1;
+
+ cmd.object_id = idev_id2;
+ cmd.option_id = IOMMU_OPTION_SW_MSI_START;
+ cmd.val64 = 0x70000000;
+ if (ioctl(self->fd, IOMMU_OPTION, &cmd))
+ return -1;
+
+ cmd.option_id = IOMMU_OPTION_SW_MSI_SIZE;
+ cmd.val64 = 2;
+ if (ioctl(self->fd, IOMMU_OPTION, &cmd))
+ return -1;
+
return 0;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH v1 08/13] iommufd/device: Move sw_msi_start from igroup to idev
2025-02-08 9:02 ` [PATCH v1 08/13] iommufd/device: Move sw_msi_start from igroup to idev Nicolin Chen
@ 2025-02-09 18:41 ` Jason Gunthorpe
2025-02-10 23:25 ` Nicolin Chen
0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2025-02-09 18:41 UTC (permalink / raw)
To: Nicolin Chen
Cc: kevin.tian, tglx, maz, joro, will, robin.murphy, shuah, iommu,
linux-kernel, linux-arm-kernel, linux-kselftest, eric.auger,
baolu.lu, yi.l.liu, yury.norov, jacob.pan, patches
On Sat, Feb 08, 2025 at 01:02:41AM -0800, Nicolin Chen wrote:
> @@ -433,8 +433,8 @@ static int iommufd_group_setup_msi(struct iommufd_group *igroup,
> list_for_each_entry(cur, &ictx->sw_msi_list, sw_msi_item) {
> int rc;
>
> - if (cur->sw_msi_start != igroup->sw_msi_start ||
> - !test_bit(cur->id, igroup->required_sw_msi.bitmap))
> + if (cur->sw_msi_start != idev->sw_msi_start ||
> + !test_bit(cur->id, idev->igroup->required_sw_msi.bitmap))
> continue;
So we end up creating seperate sw_msi_list items with unique IDs for
every sw_msi_start?
That indeed might work well, I will try to check it and think about
this harder.
Jason
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 08/13] iommufd/device: Move sw_msi_start from igroup to idev
2025-02-09 18:41 ` Jason Gunthorpe
@ 2025-02-10 23:25 ` Nicolin Chen
0 siblings, 0 replies; 28+ messages in thread
From: Nicolin Chen @ 2025-02-10 23:25 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: kevin.tian, tglx, maz, joro, will, robin.murphy, shuah, iommu,
linux-kernel, linux-arm-kernel, linux-kselftest, eric.auger,
baolu.lu, yi.l.liu, yury.norov, jacob.pan, patches
On Sun, Feb 09, 2025 at 02:41:52PM -0400, Jason Gunthorpe wrote:
> On Sat, Feb 08, 2025 at 01:02:41AM -0800, Nicolin Chen wrote:
> > @@ -433,8 +433,8 @@ static int iommufd_group_setup_msi(struct iommufd_group *igroup,
> > list_for_each_entry(cur, &ictx->sw_msi_list, sw_msi_item) {
> > int rc;
> >
> > - if (cur->sw_msi_start != igroup->sw_msi_start ||
> > - !test_bit(cur->id, igroup->required_sw_msi.bitmap))
> > + if (cur->sw_msi_start != idev->sw_msi_start ||
> > + !test_bit(cur->id, idev->igroup->required_sw_msi.bitmap))
> > continue;
>
> So we end up creating seperate sw_msi_list items with unique IDs for
> every sw_msi_start?
>
> That indeed might work well, I will try to check it and think about
> this harder.
The sw_msi_list is still per-ictx, so there won't be items/ids
that overlap with their sw_msi windows, right?
Then, the per-HWPT bitmap could still protect the iommu_map(),
as the design wanted to? No?
Nicolin
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 06/13] iommufd: Implement sw_msi support natively
2025-02-08 9:02 ` [PATCH v1 06/13] iommufd: Implement sw_msi support natively Nicolin Chen
@ 2025-02-11 18:16 ` Jason Gunthorpe
2025-02-11 19:04 ` Nicolin Chen
0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2025-02-11 18:16 UTC (permalink / raw)
To: Nicolin Chen
Cc: kevin.tian, tglx, maz, joro, will, robin.murphy, shuah, iommu,
linux-kernel, linux-arm-kernel, linux-kselftest, eric.auger,
baolu.lu, yi.l.liu, yury.norov, jacob.pan, patches
On Sat, Feb 08, 2025 at 01:02:39AM -0800, Nicolin Chen wrote:
> +static struct iommufd_attach_handle *
> +iommu_group_get_iommufd_handle(struct iommu_group *group)
> +{
> + struct iommu_attach_handle *handle;
> +
> + handle = iommu_attach_handle_get(group, IOMMU_NO_PASID, 0);
> + if (IS_ERR(handle))
> + return NULL;
> + return to_iommufd_handle(handle);
> +}
> +
> +/*
> + * Called by the irq code if the platform translates the MSI address through the
> + * IOMMU. msi_addr is the physical address of the MSI page. iommufd will
> + * allocate a fd global iova for the physical page that is the same on all
> + * domains and devices.
> + */
> +#ifdef CONFIG_IRQ_MSI_IOMMU
> +int iommufd_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
> + phys_addr_t msi_addr)
> +{
> + struct device *dev = msi_desc_to_dev(desc);
> + struct iommufd_hwpt_paging *hwpt_paging;
> + struct iommufd_attach_handle *handle;
> + struct iommufd_sw_msi_map *msi_map;
> + struct iommufd_ctx *ictx;
> + unsigned long iova;
> + int rc;
> +
> + handle = iommu_group_get_iommufd_handle(dev->iommu_group);
> + if (!handle)
> + return 0;
I think you should open code this and leave the other function
alone. The locking rules are different here.
iommufd_device_get_attach_handle() should be locked under the
igroup->lock
While in this context we are locked under the iommu core group mutex.
A comment will help
/*
* It is safe to call iommu_attach_handle_get() here because the iommu
* core code invokes this under the group mutex which also prevents any
* change of the attach handle for the duration of this function.
*/
iommu_group_mutex_assert(dev);
Jason
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 06/13] iommufd: Implement sw_msi support natively
2025-02-11 18:16 ` Jason Gunthorpe
@ 2025-02-11 19:04 ` Nicolin Chen
0 siblings, 0 replies; 28+ messages in thread
From: Nicolin Chen @ 2025-02-11 19:04 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: kevin.tian, tglx, maz, joro, will, robin.murphy, shuah, iommu,
linux-kernel, linux-arm-kernel, linux-kselftest, eric.auger,
baolu.lu, yi.l.liu, yury.norov, jacob.pan, patches
On Tue, Feb 11, 2025 at 02:16:20PM -0400, Jason Gunthorpe wrote:
> On Sat, Feb 08, 2025 at 01:02:39AM -0800, Nicolin Chen wrote:
>
> > +static struct iommufd_attach_handle *
> > +iommu_group_get_iommufd_handle(struct iommu_group *group)
> > +{
> > + struct iommu_attach_handle *handle;
> > +
> > + handle = iommu_attach_handle_get(group, IOMMU_NO_PASID, 0);
> > + if (IS_ERR(handle))
> > + return NULL;
> > + return to_iommufd_handle(handle);
> > +}
> > +
> > +/*
> > + * Called by the irq code if the platform translates the MSI address through the
> > + * IOMMU. msi_addr is the physical address of the MSI page. iommufd will
> > + * allocate a fd global iova for the physical page that is the same on all
> > + * domains and devices.
> > + */
> > +#ifdef CONFIG_IRQ_MSI_IOMMU
> > +int iommufd_sw_msi(struct iommu_domain *domain, struct msi_desc *desc,
> > + phys_addr_t msi_addr)
> > +{
> > + struct device *dev = msi_desc_to_dev(desc);
> > + struct iommufd_hwpt_paging *hwpt_paging;
> > + struct iommufd_attach_handle *handle;
> > + struct iommufd_sw_msi_map *msi_map;
> > + struct iommufd_ctx *ictx;
> > + unsigned long iova;
> > + int rc;
> > +
> > + handle = iommu_group_get_iommufd_handle(dev->iommu_group);
> > + if (!handle)
> > + return 0;
>
> I think you should open code this and leave the other function
> alone. The locking rules are different here.
>
> iommufd_device_get_attach_handle() should be locked under the
> igroup->lock
>
> While in this context we are locked under the iommu core group mutex.
>
> A comment will help
>
> /*
> * It is safe to call iommu_attach_handle_get() here because the iommu
> * core code invokes this under the group mutex which also prevents any
> * change of the attach handle for the duration of this function.
> */
> iommu_group_mutex_assert(dev);
Ack. I reverted that part and added this piece.
Thanks
Nicolin
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie
2025-02-08 9:02 ` [PATCH v1 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie Nicolin Chen
@ 2025-02-13 11:54 ` Thomas Gleixner
2025-02-13 13:57 ` Jason Gunthorpe
2025-02-13 20:28 ` Jacob Pan
1 sibling, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2025-02-13 11:54 UTC (permalink / raw)
To: Nicolin Chen, jgg, kevin.tian, maz
Cc: joro, will, robin.murphy, shuah, iommu, linux-kernel,
linux-arm-kernel, linux-kselftest, eric.auger, baolu.lu, yi.l.liu,
yury.norov, jacob.pan, patches
On Sat, Feb 08 2025 at 01:02, Nicolin Chen wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
>
> All the iommu cases simply want to override the MSI page's address with
> the IOVA that was mapped through the iommu. This doesn't need a cookie
> pointer, we just need to store the IOVA and its page size in the
> msi_desc.
We need to do nothing :)
> Instead provide msi_desc_set_iommu_msi_iova() which allows the IOMMU side
> to specify the IOVA that the MSI page is placed during
> iommu_dma_prepare_msi(). This is stored in the msi_desc and then
> iommu_dma_compose_msi_msg() is a simple inline that sets address_hi/lo.
>
> The next patch will correct the naming.
>
> This is done because we cannot correctly lock access to group->domain
> in
Now this gets really confusing. How is the next patch which corrects the
naming related the locking problem?
> the atomic context that iommu_dma_compose_msi_msg() is called under. Today
> the locking miss is tolerable because dma_iommu.c operates under an
> assumption that the domain does not change while a driver is probed.
>
> However iommufd now permits the domain to change while the driver is
> probed and VFIO userspace can create races with IRQ changes calling
> iommu_dma_prepare_msi/compose_msi_msg() and changing/freeing the
> iommu_domain.
>
> Removing the pointer, and critically, the call to
> iommu_get_domain_for_dev() during compose resolves this race.
So this change log really fails to follow the basic structure:
The context, the problem and the solution
Something like this:
The IOMMU translation for MSI message addresses is a two stage
process:
1) A cookie containing the IOVA address is stored in the MSI
descriptor, when a MSI interrupt is allocated
2) The compose callback uses this cookie to apply the translation
to the message address.
This worked so far because .......
With the iommufd rework this becomes racy, because ...
Fix this by storing ... instead of ... ....
Hmm?
tglx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 02/13] genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_addr()
2025-02-08 9:02 ` [PATCH v1 02/13] genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_addr() Nicolin Chen
@ 2025-02-13 12:11 ` Thomas Gleixner
2025-02-13 14:37 ` Jason Gunthorpe
0 siblings, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2025-02-13 12:11 UTC (permalink / raw)
To: Nicolin Chen, jgg, kevin.tian, maz
Cc: joro, will, robin.murphy, shuah, iommu, linux-kernel,
linux-arm-kernel, linux-kselftest, eric.auger, baolu.lu, yi.l.liu,
yury.norov, jacob.pan, patches
On Sat, Feb 08 2025 at 01:02, Nicolin Chen wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
>
> The new function is used to take in a u64 MSI address and store it in the
Which new function? The subject claims this is a rename. That's
confusing at best.
> msi_msg. If the iommu has provided an alternative address then that is
> replaced instead.
>
> All callers have a tidy u64 already so this also consolidates the repeated
> low/high code into a small helper.
Now I'm really confused. Something like:
genirq/msi: Refactor iommu_dma_compose_msi_msg()
The call sites of iommu_dma_compose_msi_msg() have the following
pattern:
1) Set the MSI message address to the non-translated address
2) Invoke iommu_dma_compose_msi_msg() to apply IOVA translation if
available.
Rework and rename iommu_dma_compose_msi_msg() to handles both
address set up variants in one consolidated helper.
Or something to that effect.
> -/**
> - * iommu_dma_compose_msi_msg() - Apply translation to an MSI message
> - * @desc: MSI descriptor prepared by iommu_dma_prepare_msi()
> - * @msg: MSI message containing target physical address
> - */
Please keep and rework the documentation comment.
> -static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc,
> - struct msi_msg *msg)
> +static inline void msi_msg_set_addr(struct msi_desc *desc, struct msi_msg *msg,
> + u64 msi_addr)
No line break required. You have 100 characters available.
> {
> #ifdef CONFIG_IRQ_MSI_IOMMU
> if (desc->iommu_msi_page_shift) {
> @@ -314,11 +309,14 @@ static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc,
> << desc->iommu_msi_page_shift;
>
> msg->address_hi = upper_32_bits(msi_iova);
> - msg->address_lo = lower_32_bits(msi_iova) |
> - (msg->address_lo &
> - ((1 << desc->iommu_msi_page_shift) - 1));
> + msg->address_lo =
> + lower_32_bits(msi_iova) |
> + (msi_addr & ((1 << desc->iommu_msi_page_shift) - 1));
Please avoid unrelated random formatting changes. Especially in this
case this is even more non-sensical as you introduced the original
formatting in the previous patch. So fix it up there and make it:
msg->address_lo = lower_32_bits(msi_iova) |
(msg->address_lo & ((1 << desc->iommu_msi_page_shift) - 1));
which is way more readable than this displaced variant above.
> + return;
>
> - iommu_dma_compose_msi_msg(irq_data_get_msi_desc(d), msg);
> + msg->data = its_get_event_id(d);
> + msi_msg_set_addr(irq_data_get_msi_desc(d), msg,
> + its_dev->its->get_msi_base(its_dev));
No line break required here and at the other places.
Thanks,
tglx
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie
2025-02-13 11:54 ` Thomas Gleixner
@ 2025-02-13 13:57 ` Jason Gunthorpe
2025-02-13 21:34 ` Nicolin Chen
0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2025-02-13 13:57 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Nicolin Chen, kevin.tian, maz, joro, will, robin.murphy, shuah,
iommu, linux-kernel, linux-arm-kernel, linux-kselftest,
eric.auger, baolu.lu, yi.l.liu, yury.norov, jacob.pan, patches
On Thu, Feb 13, 2025 at 12:54:15PM +0100, Thomas Gleixner wrote:
> So this change log really fails to follow the basic structure:
>
> The context, the problem and the solution
The IOMMU translation for MSI message addresses is a two step
process, seperated in time:
1) iommu_dma_prepare_msi(): A cookie pointer containing the IOVA
address is stored in the MSI descriptor, when a MSI interrupt is
allocated.
2) iommu_dma_compose_msi_msg(): The compose callback uses this
cookkie pointer to compute the translated message address.
This has an inherent lifetime problem for the pointer stored in the
cookie. It must remain valid between the two steps. There is no
locking at the irq layer that helps protect the liftime. Today this
only works under the assumption that the iommu domain is not changed
while MSI interrupts are being programmed. This is true for normal DMA
API users within the kernel as the iommu domain is attached before the
driver is probed and cannot be changed while a driver is attached.
Classic VFIO type1 also prevented changing the iommu domain while VFIO
was running as it does not support changing the "container" after
starting up.
However, iommufd has improved this and we can change the iommu domain
during VFIO operation. This allows userspace to directly race
VFIO_DEVICE_ATTACH_IOMMUFD_PT (which calls iommu_attach_group()) and
VFIO_DEVICE_SET_IRQS (which calls into iommu_dma_compose_msi_msg()).
This causes both the cookie pointer and the unlocked call to
iommu_get_domain_for_dev() on the MSI translation path to become a
potential UAF.
Fix the MSI cookie UAF by removing the cookie pointer. The translated
message address is already known during iommu_dma_prepare_msi() and
can not change. It can simply be stored as an integer in the MSI
descriptor.
A following patch will correct the iommu_get_domain_for_dev() UAF
using the IOMMU group mutex.
Ok?
Nicolin - lets change the patch structure a little bit can you adjust
this patch to leave iommu_dma_compose_msi_msg() in dma-iommu.c and the
next patch will be all about renaming and moving it to the MSI core
code instead? Easier to explain
Thanks,
Jason
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 02/13] genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_addr()
2025-02-13 12:11 ` Thomas Gleixner
@ 2025-02-13 14:37 ` Jason Gunthorpe
0 siblings, 0 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2025-02-13 14:37 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Nicolin Chen, kevin.tian, maz, joro, will, robin.murphy, shuah,
iommu, linux-kernel, linux-arm-kernel, linux-kselftest,
eric.auger, baolu.lu, yi.l.liu, yury.norov, jacob.pan, patches
On Thu, Feb 13, 2025 at 01:11:37PM +0100, Thomas Gleixner wrote:
> On Sat, Feb 08 2025 at 01:02, Nicolin Chen wrote:
>
> > From: Jason Gunthorpe <jgg@nvidia.com>
> >
> > The new function is used to take in a u64 MSI address and store it in the
Assuming Nicolin moves the hunk as I suggested for patch 1 lets say:
genirq/msi: Refactor iommu_dma_compose_msi_msg()
The two step process to translate the MSI address involves two
functions, iommu_dma_prepare_msi() and iommu_dma_compose_msi_msg().
Previously iommu_dma_compose_msi_msg() needed to be in the iommu layer
as it had to dereference the opaque cookie pointer. The previous patch
changed the cookie pointer into an integer so there is no longer any
need for the iommu layer to be involved.
Further, the call sites of iommu_dma_compose_msi_msg() all follow the
same pattern of setting the MSI message address_hi/lo to
non-translated and then immediately calling
iommu_dma_compose_msi_msg().
Refactor iommu_dma_compose_msi_msg() into msi_msg_set_addr() which
directly accepts the u64 version of the address and simplifies all the
callers.
Move the implementation to linux/msi.h since it has nothing to do with
iommu.
Aside from refactoring, this logically prepares for the next patch
which allows multiple implementation options for
iommu_dma_prepare_msi(). It does not make sense to have the
iommu_dma_compose_msi_msg() in dma-iommu.c when it no longer provides
the only iommu_dma_prepare_msi() implementation.
Ok?
> > -static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc,
> > - struct msi_msg *msg)
> > +static inline void msi_msg_set_addr(struct msi_desc *desc, struct msi_msg *msg,
> > + u64 msi_addr)
>
> No line break required. You have 100 characters available.
Sure, I make alot of patches for places that only accept 80 :\ It is
hard to keep track of everyones preferences. Thank you for having
patience, it will get fixed - I think following a 100 char limit will
resolve all of these notes.
Thanks,
Jason
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie
2025-02-08 9:02 ` [PATCH v1 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie Nicolin Chen
2025-02-13 11:54 ` Thomas Gleixner
@ 2025-02-13 20:28 ` Jacob Pan
2025-02-13 21:02 ` Nicolin Chen
1 sibling, 1 reply; 28+ messages in thread
From: Jacob Pan @ 2025-02-13 20:28 UTC (permalink / raw)
To: Nicolin Chen
Cc: jgg, kevin.tian, tglx, maz, joro, will, robin.murphy, shuah,
iommu, linux-kernel, linux-arm-kernel, linux-kselftest,
eric.auger, baolu.lu, yi.l.liu, yury.norov, patches, jacob.pan
Hi Nicolin,
On Sat, 8 Feb 2025 01:02:34 -0800
Nicolin Chen <nicolinc@nvidia.com> wrote:
> -static inline void msi_desc_set_iommu_cookie(struct msi_desc *desc,
> - const void
> *iommu_cookie) +/**
> + * iommu_dma_compose_msi_msg() - Apply translation to an MSI message
> + * @desc: MSI descriptor prepared by iommu_dma_prepare_msi()
> + * @msg: MSI message containing target physical address
> + */
Is it IOVA not PA?
> +static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc,
> + struct msi_msg *msg)
> {
> -}
> +#ifdef CONFIG_IRQ_MSI_IOMMU
> + if (desc->iommu_msi_page_shift) {
> + u64 msi_iova = desc->iommu_msi_iova
> + << desc->iommu_msi_page_shift;
> +
> + msg->address_hi = upper_32_bits(msi_iova);
> + msg->address_lo = lower_32_bits(msi_iova) |
> + (msg->address_lo &
> + ((1 <<
> desc->iommu_msi_page_shift) - 1));
> + }
> #endif
> +}
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie
2025-02-13 20:28 ` Jacob Pan
@ 2025-02-13 21:02 ` Nicolin Chen
2025-02-13 21:33 ` Jacob Pan
0 siblings, 1 reply; 28+ messages in thread
From: Nicolin Chen @ 2025-02-13 21:02 UTC (permalink / raw)
To: Jacob Pan
Cc: jgg, kevin.tian, tglx, maz, joro, will, robin.murphy, shuah,
iommu, linux-kernel, linux-arm-kernel, linux-kselftest,
eric.auger, baolu.lu, yi.l.liu, yury.norov, patches
On Thu, Feb 13, 2025 at 12:28:49PM -0800, Jacob Pan wrote:
> Hi Nicolin,
>
> On Sat, 8 Feb 2025 01:02:34 -0800
> Nicolin Chen <nicolinc@nvidia.com> wrote:
>
> > -static inline void msi_desc_set_iommu_cookie(struct msi_desc *desc,
> > - const void
> > *iommu_cookie) +/**
> > + * iommu_dma_compose_msi_msg() - Apply translation to an MSI message
> > + * @desc: MSI descriptor prepared by iommu_dma_prepare_msi()
> > + * @msg: MSI message containing target physical address
> > + */
> Is it IOVA not PA?
This is moved from dma-iommu.c so we didn't change that.
And I think it's correct to say "target physical address" as the
irqchip driver does pass in a PA via @msg.
Then iommu_dma_compose_msi_msg() kind of reverse-translates that,
overwriting the msg with the "IOVA" from @desc.
Thanks
Nicolin
> > +static inline void iommu_dma_compose_msi_msg(struct msi_desc *desc,
> > + struct msi_msg *msg)
> > {
> > -}
> > +#ifdef CONFIG_IRQ_MSI_IOMMU
> > + if (desc->iommu_msi_page_shift) {
> > + u64 msi_iova = desc->iommu_msi_iova
> > + << desc->iommu_msi_page_shift;
> > +
> > + msg->address_hi = upper_32_bits(msi_iova);
> > + msg->address_lo = lower_32_bits(msi_iova) |
> > + (msg->address_lo &
> > + ((1 <<
> > desc->iommu_msi_page_shift) - 1));
> > + }
> > #endif
> > +}
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie
2025-02-13 21:02 ` Nicolin Chen
@ 2025-02-13 21:33 ` Jacob Pan
0 siblings, 0 replies; 28+ messages in thread
From: Jacob Pan @ 2025-02-13 21:33 UTC (permalink / raw)
To: Nicolin Chen
Cc: jgg, kevin.tian, tglx, maz, joro, will, robin.murphy, shuah,
iommu, linux-kernel, linux-arm-kernel, linux-kselftest,
eric.auger, baolu.lu, yi.l.liu, yury.norov, patches, jacob.pan
Hi Nicolin,
On Thu, 13 Feb 2025 13:02:27 -0800
Nicolin Chen <nicolinc@nvidia.com> wrote:
> On Thu, Feb 13, 2025 at 12:28:49PM -0800, Jacob Pan wrote:
> > Hi Nicolin,
> >
> > On Sat, 8 Feb 2025 01:02:34 -0800
> > Nicolin Chen <nicolinc@nvidia.com> wrote:
> >
> > > -static inline void msi_desc_set_iommu_cookie(struct msi_desc
> > > *desc,
> > > - const void
> > > *iommu_cookie) +/**
> > > + * iommu_dma_compose_msi_msg() - Apply translation to an MSI
> > > message
> > > + * @desc: MSI descriptor prepared by iommu_dma_prepare_msi()
> > > + * @msg: MSI message containing target physical address
> > > + */
> > Is it IOVA not PA?
>
> This is moved from dma-iommu.c so we didn't change that.
>
> And I think it's correct to say "target physical address" as the
> irqchip driver does pass in a PA via @msg.
>
> Then iommu_dma_compose_msi_msg() kind of reverse-translates that,
> overwriting the msg with the "IOVA" from @desc.
>
It is just confusing that the msg can be IOVA, not always PA as the
comment says. I see it gets deleted anyway in the next patch :)
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie
2025-02-13 13:57 ` Jason Gunthorpe
@ 2025-02-13 21:34 ` Nicolin Chen
0 siblings, 0 replies; 28+ messages in thread
From: Nicolin Chen @ 2025-02-13 21:34 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Thomas Gleixner, kevin.tian, maz, joro, will, robin.murphy, shuah,
iommu, linux-kernel, linux-arm-kernel, linux-kselftest,
eric.auger, baolu.lu, yi.l.liu, yury.norov, jacob.pan, patches
On Thu, Feb 13, 2025 at 09:57:52AM -0400, Jason Gunthorpe wrote:
> Nicolin - lets change the patch structure a little bit can you adjust
> this patch to leave iommu_dma_compose_msi_msg() in dma-iommu.c and the
> next patch will be all about renaming and moving it to the MSI core
> code instead? Easier to explain
Ack. I moved that and updated the commit message too.
Thanks
Nicolin
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU
2025-02-08 9:02 [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
` (12 preceding siblings ...)
2025-02-08 9:02 ` [PATCH v1 13/13] iommufd/selftest: Add coverage for IOMMU_OPTION_SW_MSI_START/SIZE Nicolin Chen
@ 2025-02-19 13:37 ` Jason Gunthorpe
2025-02-19 16:06 ` Nicolin Chen
13 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2025-02-19 13:37 UTC (permalink / raw)
To: Nicolin Chen
Cc: kevin.tian, tglx, maz, joro, will, robin.murphy, shuah, iommu,
linux-kernel, linux-arm-kernel, linux-kselftest, eric.auger,
baolu.lu, yi.l.liu, yury.norov, jacob.pan, patches
On Sat, Feb 08, 2025 at 01:02:33AM -0800, Nicolin Chen wrote:
> iommu: Turn iova_cookie to dma-iommu private pointer
I suggest you respin this with the updates and stop here at this
patch, it is enough to get the core infrastructure updated and does
not bring any uAPI changes. The remainder can be done after we get to
this merged.
Jason
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU
2025-02-19 13:37 ` [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Jason Gunthorpe
@ 2025-02-19 16:06 ` Nicolin Chen
0 siblings, 0 replies; 28+ messages in thread
From: Nicolin Chen @ 2025-02-19 16:06 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: kevin.tian, tglx, maz, joro, will, robin.murphy, shuah, iommu,
linux-kernel, linux-arm-kernel, linux-kselftest, eric.auger,
baolu.lu, yi.l.liu, yury.norov, jacob.pan, patches
On Wed, Feb 19, 2025 at 09:37:23AM -0400, Jason Gunthorpe wrote:
> On Sat, Feb 08, 2025 at 01:02:33AM -0800, Nicolin Chen wrote:
> > iommu: Turn iova_cookie to dma-iommu private pointer
>
> I suggest you respin this with the updates and stop here at this
> patch, it is enough to get the core infrastructure updated and does
> not bring any uAPI changes. The remainder can be done after we get to
> this merged.
OK. I will split that and send the first 7 patches as Part-1.
Thanks
Nicolin
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2025-02-19 16:08 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-08 9:02 [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 01/13] genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookie Nicolin Chen
2025-02-13 11:54 ` Thomas Gleixner
2025-02-13 13:57 ` Jason Gunthorpe
2025-02-13 21:34 ` Nicolin Chen
2025-02-13 20:28 ` Jacob Pan
2025-02-13 21:02 ` Nicolin Chen
2025-02-13 21:33 ` Jacob Pan
2025-02-08 9:02 ` [PATCH v1 02/13] genirq/msi: Rename iommu_dma_compose_msi_msg() to msi_msg_set_addr() Nicolin Chen
2025-02-13 12:11 ` Thomas Gleixner
2025-02-13 14:37 ` Jason Gunthorpe
2025-02-08 9:02 ` [PATCH v1 03/13] iommu: Make iommu_dma_prepare_msi() into a generic operation Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 04/13] irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by the irqchips that need it Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 05/13] iommu: Turn fault_data to iommufd private pointer Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 06/13] iommufd: Implement sw_msi support natively Nicolin Chen
2025-02-11 18:16 ` Jason Gunthorpe
2025-02-11 19:04 ` Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 07/13] iommu: Turn iova_cookie to dma-iommu private pointer Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 08/13] iommufd/device: Move sw_msi_start from igroup to idev Nicolin Chen
2025-02-09 18:41 ` Jason Gunthorpe
2025-02-10 23:25 ` Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 09/13] iommufd: Pass in idev to iopt_table_enforce_dev_resv_regions Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 10/13] iommufd: Add IOMMU_OPTION_SW_MSI_START/SIZE ioctls Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 11/13] iommufd/selftest: Add MOCK_FLAGS_DEVICE_NO_ATTACH Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 12/13] iommufd/selftest: Add a testing reserved region Nicolin Chen
2025-02-08 9:02 ` [PATCH v1 13/13] iommufd/selftest: Add coverage for IOMMU_OPTION_SW_MSI_START/SIZE Nicolin Chen
2025-02-19 13:37 ` [PATCH v1 00/13] iommu: Add MSI mapping support with nested SMMU Jason Gunthorpe
2025-02-19 16:06 ` Nicolin Chen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).