All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org>
To: Shivaprasad G Bhat <sbhat@linux.ibm.com>,
	linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	kvm@vger.kernel.org, iommu@lists.linux.dev
Cc: mpe@ellerman.id.au, maddy@linux.ibm.com, npiggin@gmail.com,
	alex@shazbot.org, joerg.roedel@amd.com, kevin.tian@intel.com,
	gbatra@linux.ibm.com, jgg@nvidia.com, clg@kaod.org,
	vaibhav@linux.ibm.com, brking@linux.vnet.ibm.com,
	nnmlinux@linux.ibm.com, amachhiw@linux.ibm.com,
	tpearson@raptorengineering.com
Subject: Re: [RFC PATCH] powerpc: iommu: Initial IOMMUFD support for PPC64
Date: Wed, 28 Jan 2026 11:53:48 +0100	[thread overview]
Message-ID: <974a95d4-0ae5-400a-992f-9e468a0666d6@kernel.org> (raw)
In-Reply-To: <176953894915.725.1102545144304639827.stgit@linux.ibm.com>



Le 27/01/2026 à 19:35, Shivaprasad G Bhat a écrit :
> The RFC attempts to implement the IOMMUFD support on PPC64 by
> adding new iommu_ops for paging domain. The existing platform
> domain continues to be the default domain for in-kernel use.
> The domain ownership transfer ensures the reset of iommu states
> for the new paging domain and in-kernel usage.
> 
> On PPC64, IOVA ranges are based on the type of the DMA window
> and their properties. Currently, there is no way to expose the
> attributes of the non-default 64-bit DMA window, which the platform
> supports. The platform allows the operating system to select the
> starting offset(at 4GiB or 512PiB default offset), pagesize and
> window size for the non-default 64-bit DMA window. For example,
> with VFIO, this is handled via VFIO_IOMMU_SPAPR_TCE_GET_INFO
> and VFIO_IOMMU_SPAPR_TCE_CREATE|REMOVE ioctls. While I am exploring
> the ways to expose and configure these DMA window attributes as
> per user input, any suggestions in this regard will be very helpful.
> 
> Currently existing vfio type1 specific vfio-compat driver even
> with this patch will not work for PPC64. I believe we need to have
> a separate "vfio-spapr-compat" driver to make it work.
> 
> So brief list of current open problems and ongoing reworks:
>   - Second DMA window support as mentioned above.
>   - KVM support.
>   - EEH support.
>   - The vfio compat driver for the spapr tce iommu.
>   - Multiple devices (multifunction, same/different iommu group checks,
>     SRIOV VF assignment) support.
>   - Race conditions, device plug/unplug.
>   - self|tests.
> 
> The patch currently works for single device and exposes only the
> default DMA window of 1GB to the user. It has been tested for
> both PowerNV and pSeries machine tce iommu backends. The testing
> was done using a Qemu[1] and TCG guest having a NVME device
> passthrough. One can use the command like below to try:
> 
> qemu-system-ppc64 -machine pseries -accel tcg \
> -device spapr-pci-host-bridge,index=1,id=pci.1,ddw=off \
> -device vfio-pci,host=<hostdev>,id=hostdev0,\
> bus=pci.1.0,addr=0x1,iommufd=iommufd0 \
> -object iommufd,id=iommufd0 <...>
> ...
> root:localhost# mount /dev/nvme0n1 /mnt
> root:localhost# ls /mnt
> ...
> 
> The current patch is based on linux kernel 6.19-rc6 tree.

Getting the following build failure on linuxppc-dev patchwork with 
g5_defconfig or ppc64_defconfig:

Error: /linux/arch/powerpc/sysdev/dart_iommu.c:325:9: error: 
initialization of 'int (*)(struct iommu_table *, long int,  long int, 
long unsigned int,  enum dma_data_direction,  long unsigned int,  bool)' 
{aka 'int (*)(struct iommu_table *, long int,  long int,  long unsigned 
int,  enum dma_data_direction,  long unsigned int,  _Bool)'} from 
incompatible pointer type 'int (*)(struct iommu_table *, long int,  long 
int,  long unsigned int,  enum dma_data_direction,  long unsigned int)' 
[-Werror=incompatible-pointer-types]
   .set = dart_build,
          ^~~~~~~~~~
/linux/arch/powerpc/sysdev/dart_iommu.c:325:9: note: (near 
initialization for 'iommu_dart_ops.set')
cc1: all warnings being treated as errors
make[5]: *** [/linux/scripts/Makefile.build:287: 
arch/powerpc/sysdev/dart_iommu.o] Error 1
make[4]: *** [/linux/scripts/Makefile.build:544: arch/powerpc/sysdev] 
Error 2

Christophe

> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> 
> References:
> 1 : https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fshivaprasadbhat%2Fqemu%2Ftree%2Fiommufd-wip&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7C4b6054524dcf4d42f24308de5dd2fc27%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C639051357920885715%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=NBGzjiMaEskySEDGCZHhPwQ5VzADQXPCpH45d5p4Cuk%3D&reserved=0
> ---
>   arch/powerpc/include/asm/iommu.h              |    2
>   arch/powerpc/kernel/iommu.c                   |  181 +++++++++++++++++++++++++
>   arch/powerpc/platforms/powernv/pci-ioda-tce.c |    4 -
>   arch/powerpc/platforms/powernv/pci-ioda.c     |    4 -
>   arch/powerpc/platforms/powernv/pci.h          |    2
>   arch/powerpc/platforms/pseries/iommu.c        |    6 -
>   drivers/vfio/Kconfig                          |    4 -
>   7 files changed, 190 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index eafdd63cd6c4..1dc72fbb89e7 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -46,7 +46,7 @@ struct iommu_table_ops {
>   			long index, long npages,
>   			unsigned long uaddr,
>   			enum dma_data_direction direction,
> -			unsigned long attrs);
> +			unsigned long attrs, bool is_phys);
>   #ifdef CONFIG_IOMMU_API
>   	/*
>   	 * Exchanges existing TCE with new TCE plus direction bits;
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 0ce71310b7d9..e6543480c461 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -365,7 +365,7 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
>   	/* Put the TCEs in the HW table */
>   	build_fail = tbl->it_ops->set(tbl, entry, npages,
>   				      (unsigned long)page &
> -				      IOMMU_PAGE_MASK(tbl), direction, attrs);
> +				      IOMMU_PAGE_MASK(tbl), direction, attrs, false);
>   
>   	/* tbl->it_ops->set() only returns non-zero for transient errors.
>   	 * Clean up the table bitmap in this case and return
> @@ -539,7 +539,7 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
>   		/* Insert into HW table */
>   		build_fail = tbl->it_ops->set(tbl, entry, npages,
>   					      vaddr & IOMMU_PAGE_MASK(tbl),
> -					      direction, attrs);
> +					      direction, attrs, false);
>   		if(unlikely(build_fail))
>   			goto failure;
>   
> @@ -1201,7 +1201,15 @@ spapr_tce_blocked_iommu_attach_dev(struct iommu_domain *platform_domain,
>   	 * also sets the dma_api ops
>   	 */
>   	table_group = iommu_group_get_iommudata(grp);
> +
> +	if (old && old->type == IOMMU_DOMAIN_DMA) {
> +		ret = table_group->ops->unset_window(table_group, 0);
> +		if (ret)
> +			goto exit;
> +	}
> +
>   	ret = table_group->ops->take_ownership(table_group, dev);
> +exit:
>   	iommu_group_put(grp);
>   
>   	return ret;
> @@ -1260,6 +1268,167 @@ static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev)
>   	return hose->controller_ops.device_group(hose, pdev);
>   }
>   
> +struct ppc64_domain {
> +	struct iommu_domain  domain;
> +	struct device        *device; /* Make it a list */
> +	struct iommu_table   *table;
> +	spinlock_t           list_lock;
> +	struct rcu_head      rcu;
> +};
> +
> +static struct ppc64_domain *to_ppc64_domain(struct iommu_domain *dom)
> +{
> +	return container_of(dom, struct ppc64_domain, domain);
> +}
> +
> +static void spapr_tce_domain_free(struct iommu_domain *domain)
> +{
> +	struct ppc64_domain *ppc64_domain = to_ppc64_domain(domain);
> +
> +	kfree(ppc64_domain);
> +}
> +
> +static const struct iommu_ops spapr_tce_iommu_ops;
> +static struct iommu_domain *spapr_tce_domain_alloc_paging(struct device *dev)
> +{
> +	struct iommu_group *grp = iommu_group_get(dev);
> +	struct iommu_table_group *table_group;
> +	struct ppc64_domain *ppc64_domain;
> +	struct iommu_table *ptbl;
> +	int ret = -1;
> +
> +	table_group = iommu_group_get_iommudata(grp);
> +	ppc64_domain = kzalloc(sizeof(*ppc64_domain), GFP_KERNEL);
> +	if (!ppc64_domain)
> +		return NULL;
> +
> +	/* Just the default window hardcode for now */
> +	ret = table_group->ops->create_table(table_group, 0, 0xc, 0x40000000, 1, &ptbl);
> +	iommu_tce_table_get(ptbl);
> +	ppc64_domain->table = ptbl; /* REVISIT: Single device for now */
> +	if (!ppc64_domain->table) {
> +		kfree(ppc64_domain);
> +		iommu_tce_table_put(ptbl);
> +		iommu_group_put(grp);
> +		return NULL;
> +	}
> +
> +	table_group->ops->set_window(table_group, 0, ptbl);
> +	iommu_group_put(grp);
> +
> +	ppc64_domain->domain.pgsize_bitmap = SZ_4K;
> +	ppc64_domain->domain.geometry.force_aperture = true;
> +	ppc64_domain->domain.geometry.aperture_start = 0;
> +	ppc64_domain->domain.geometry.aperture_end = 0x40000000; /*default window */
> +	ppc64_domain->domain.ops = spapr_tce_iommu_ops.default_domain_ops;
> +
> +	spin_lock_init(&ppc64_domain->list_lock);
> +
> +	return &ppc64_domain->domain;
> +}
> +
> +static size_t spapr_tce_iommu_unmap_pages(struct iommu_domain *domain,
> +				unsigned long iova,
> +				size_t pgsize, size_t pgcount,
> +				struct iommu_iotlb_gather *gather)
> +{
> +	struct ppc64_domain *ppc64_domain = to_ppc64_domain(domain);
> +	struct iommu_table *tbl = ppc64_domain->table;
> +	unsigned long pgshift = __ffs(pgsize);
> +	size_t size = pgcount << pgshift;
> +	size_t mapped = 0;
> +	unsigned int tcenum;
> +	int  mask;
> +
> +	if (pgsize != SZ_4K)
> +		return -EINVAL;
> +
> +	size = PAGE_ALIGN(size);
> +
> +	mask = IOMMU_PAGE_MASK(tbl);
> +	tcenum = iova >> tbl->it_page_shift;
> +
> +	tbl->it_ops->clear(tbl, tcenum, pgcount);
> +
> +	mapped = pgsize * pgcount;
> +
> +	return mapped;
> +}
> +
> +static phys_addr_t spapr_tce_iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
> +{
> +	struct ppc64_domain *ppc64_domain = to_ppc64_domain(domain);
> +	struct iommu_table *tbl = ppc64_domain->table;
> +	phys_addr_t paddr, rpn, tceval;
> +	unsigned int tcenum;
> +
> +	tcenum = iova >> tbl->it_page_shift;
> +	tceval = tbl->it_ops->get(tbl, tcenum);
> +
> +	/* Ignore the direction bits */
> +	rpn = tceval >> tbl->it_page_shift;
> +	paddr = rpn << tbl->it_page_shift;
> +
> +	return paddr;
> +}
> +
> +static int spapr_tce_iommu_map_pages(struct iommu_domain *domain,
> +				unsigned long iova, phys_addr_t paddr,
> +				size_t pgsize, size_t pgcount,
> +				int prot, gfp_t gfp, size_t *mapped)
> +{
> +	struct ppc64_domain *ppc64_domain = to_ppc64_domain(domain);
> +	enum dma_data_direction direction = DMA_BIDIRECTIONAL;
> +	struct iommu_table *tbl = ppc64_domain->table;
> +	unsigned long pgshift = __ffs(pgsize);
> +	size_t size = pgcount << pgshift;
> +	unsigned int tcenum;
> +	int ret;
> +
> +	if (pgsize != SZ_4K)
> +		return -EINVAL;
> +
> +	if (iova < ppc64_domain->domain.geometry.aperture_start ||
> +	    (iova + size - 1) > ppc64_domain->domain.geometry.aperture_end)
> +		return -EINVAL;
> +
> +	if (!IS_ALIGNED(iova | paddr, pgsize))
> +		return -EINVAL;
> +
> +	if (!(prot & IOMMU_WRITE))
> +		direction = DMA_FROM_DEVICE;
> +
> +	if (!(prot & IOMMU_READ))
> +		direction = DMA_TO_DEVICE;
> +
> +	size = PAGE_ALIGN(size);
> +	tcenum = iova >> tbl->it_page_shift;
> +
> +	/* Put the TCEs in the HW table */
> +	ret = tbl->it_ops->set(tbl, tcenum, pgcount,
> +				paddr, direction, 0, true);
> +	if (!ret && mapped)
> +		*mapped = pgsize;
> +
> +	return 0;
> +}
> +
> +static int spapr_tce_iommu_attach_device(struct iommu_domain *domain,
> +				    struct device *dev, struct iommu_domain *old)
> +{
> +	struct ppc64_domain *ppc64_domain = to_ppc64_domain(domain);
> +
> +	/* REVISIT */
> +	if (!domain)
> +		return 0;
> +
> +	/* REVISIT: Check table group, list handling */
> +	ppc64_domain->device = dev;
> +
> +	return 0;
> +}
> +
> +
>   static const struct iommu_ops spapr_tce_iommu_ops = {
>   	.default_domain = &spapr_tce_platform_domain,
>   	.blocked_domain = &spapr_tce_blocked_domain,
> @@ -1267,6 +1436,14 @@ static const struct iommu_ops spapr_tce_iommu_ops = {
>   	.probe_device = spapr_tce_iommu_probe_device,
>   	.release_device = spapr_tce_iommu_release_device,
>   	.device_group = spapr_tce_iommu_device_group,
> +	.domain_alloc_paging = spapr_tce_domain_alloc_paging,
> +	.default_domain_ops = &(const struct iommu_domain_ops) {
> +		.attach_dev     = spapr_tce_iommu_attach_device,
> +		.map_pages      = spapr_tce_iommu_map_pages,
> +		.unmap_pages    = spapr_tce_iommu_unmap_pages,
> +		.iova_to_phys   = spapr_tce_iommu_iova_to_phys,
> +		.free           = spapr_tce_domain_free,
> +	}
>   };
>   
>   static struct attribute *spapr_tce_iommu_attrs[] = {
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> index e96324502db0..8800bf86d17a 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> @@ -123,10 +123,10 @@ static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx, bool alloc)
>   
>   int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
>   		unsigned long uaddr, enum dma_data_direction direction,
> -		unsigned long attrs)
> +		unsigned long attrs, bool is_phys)
>   {
>   	u64 proto_tce = iommu_direction_to_tce_perm(direction);
> -	u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
> +	u64 rpn = !is_phys ? __pa(uaddr) >> tbl->it_page_shift : uaddr >> tbl->it_page_shift;
>   	long i;
>   
>   	if (proto_tce & TCE_PCI_WRITE)
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index b0c1d9d16fb5..610146a63e3b 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1241,10 +1241,10 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
>   static int pnv_ioda2_tce_build(struct iommu_table *tbl, long index,
>   		long npages, unsigned long uaddr,
>   		enum dma_data_direction direction,
> -		unsigned long attrs)
> +		unsigned long attrs, bool is_phys)
>   {
>   	int ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
> -			attrs);
> +			attrs, is_phys);
>   
>   	if (!ret)
>   		pnv_pci_ioda2_tce_invalidate(tbl, index, npages);
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 42075501663b..3579ecd55d00 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -300,7 +300,7 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>   
>   extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
>   		unsigned long uaddr, enum dma_data_direction direction,
> -		unsigned long attrs);
> +		unsigned long attrs, bool is_phys);
>   extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
>   extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
>   		unsigned long *hpa, enum dma_data_direction *direction);
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index eec333dd2e59..8c6f9f18e462 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -122,7 +122,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
>   static int tce_build_pSeries(struct iommu_table *tbl, long index,
>   			      long npages, unsigned long uaddr,
>   			      enum dma_data_direction direction,
> -			      unsigned long attrs)
> +			      unsigned long attrs, bool false)
>   {
>   	u64 proto_tce;
>   	__be64 *tcep;
> @@ -250,7 +250,7 @@ static DEFINE_PER_CPU(__be64 *, tce_page);
>   static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>   				     long npages, unsigned long uaddr,
>   				     enum dma_data_direction direction,
> -				     unsigned long attrs)
> +				     unsigned long attrs, bool is_phys)
>   {
>   	u64 rc = 0;
>   	u64 proto_tce;
> @@ -287,7 +287,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>   		__this_cpu_write(tce_page, tcep);
>   	}
>   
> -	rpn = __pa(uaddr) >> tceshift;
> +	rpn = !is_phys ? __pa(uaddr) >> tceshift : uaddr >> tceshift;
>   	proto_tce = TCE_PCI_READ;
>   	if (direction != DMA_TO_DEVICE)
>   		proto_tce |= TCE_PCI_WRITE;
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index ceae52fd7586..9929aa78a5da 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -4,7 +4,7 @@ menuconfig VFIO
>   	select IOMMU_API
>   	depends on IOMMUFD || !IOMMUFD
>   	select INTERVAL_TREE
> -	select VFIO_GROUP if SPAPR_TCE_IOMMU || IOMMUFD=n
> +	select VFIO_GROUP if IOMMUFD=n
>   	select VFIO_DEVICE_CDEV if !VFIO_GROUP
>   	select VFIO_CONTAINER if IOMMUFD=n
>   	help
> @@ -16,7 +16,7 @@ menuconfig VFIO
>   if VFIO
>   config VFIO_DEVICE_CDEV
>   	bool "Support for the VFIO cdev /dev/vfio/devices/vfioX"
> -	depends on IOMMUFD && !SPAPR_TCE_IOMMU
> +	depends on IOMMUFD
>   	default !VFIO_GROUP
>   	help
>   	  The VFIO device cdev is another way for userspace to get device
> 
> 


  parent reply	other threads:[~2026-01-28 10:53 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-27 18:35 [RFC PATCH] powerpc: iommu: Initial IOMMUFD support for PPC64 Shivaprasad G Bhat
2026-01-27 19:16 ` Jason Gunthorpe
2026-02-03 15:52   ` Shivaprasad G Bhat
2026-02-03 18:07     ` Jason Gunthorpe
2026-02-04 15:23       ` Shivaprasad G Bhat
2026-01-28 10:53 ` Christophe Leroy (CS GROUP) [this message]
2026-02-03 15:54   ` Shivaprasad G Bhat
2026-02-01 22:44 ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=974a95d4-0ae5-400a-992f-9e468a0666d6@kernel.org \
    --to=chleroy@kernel.org \
    --cc=alex@shazbot.org \
    --cc=amachhiw@linux.ibm.com \
    --cc=brking@linux.vnet.ibm.com \
    --cc=clg@kaod.org \
    --cc=gbatra@linux.ibm.com \
    --cc=iommu@lists.linux.dev \
    --cc=jgg@nvidia.com \
    --cc=joerg.roedel@amd.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=maddy@linux.ibm.com \
    --cc=mpe@ellerman.id.au \
    --cc=nnmlinux@linux.ibm.com \
    --cc=npiggin@gmail.com \
    --cc=sbhat@linux.ibm.com \
    --cc=tpearson@raptorengineering.com \
    --cc=vaibhav@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.