Re: [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

All of lore.kernel.org
 help / color / mirror / Atom feed

From: David Gibson <david@gibson.dropbear.id.au>
To: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Paul Mackerras <paulus@samba.org>,
	Alex Williamson <alex.williamson@redhat.com>,
	linuxppc-dev@lists.ozlabs.org, kvm@vger.kernel.org,
	kvm-ppc@vger.kernel.org
Subject: Re: [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
Date: Thu, 09 Feb 2017 06:41:12 +0000	[thread overview]
Message-ID: <20170209064112.GD14524@umbus> (raw)
In-Reply-To: <20170207071711.28938-11-aik@ozlabs.ru>

[-- Attachment #1: Type: text/plain, Size: 32625 bytes --]

On Tue, Feb 07, 2017 at 06:17:11PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
> 
> This has separate copies of handlers for real and virtual modes as
> in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
> would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
> in virtual mode and direct access in real mode and having a common
> helper for it would make things uglier imho.
> 
> 
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>  include/uapi/linux/kvm.h                   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c           | 319 ++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 172 +++++++++++++++-
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  virt/kvm/vfio.c                            |  60 ++++++
>  8 files changed, 590 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..f95d867168ea 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,25 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> +	allocated by sPAPR KVM.
> +	kvm_device_attr.addr points to a struct:
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +	struct kvm_vfio_spapr_tce {
> +		__u32	argsz;
> +		__u32	flags;
> +		__s32	groupfd;
> +		__s32	tablefd;
> +	};
> +
> +	where
> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> +	@flags are not supported now, must be zero;
> +	@groupfd is a file descriptor for a VFIO group;
> +	@tablefd is a file descriptor for a TCE table allocated via
> +		KVM_CREATE_SPAPR_TCE.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index e59b172666cd..a827006941f8 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_iommu_table {
> +	struct rcu_head rcu;
> +	struct list_head next;
> +	struct vfio_group *group;
> +	struct iommu_table *tbl;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head iommu_tables;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 37bc9e7e90ba..da1410bd6b36 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group);
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a2c9bb5a0ead..cdfa01169bd2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	groupfd;
> +	__s32	tablefd;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 9a7b7fca5e84..cb0469151e35 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -27,6 +27,10 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/iommu.h>
> +#include <linux/file.h>
> +#include <linux/vfio.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -39,6 +43,36 @@
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
> +#include <asm/mmu_context.h>
> +
> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> +{
> +	void (*fn)(struct vfio_group *);
> +
> +	fn = symbol_get(vfio_group_put_external_user);
> +	if (WARN_ON(!fn))
> +		return;
> +
> +	fn(vfio_group);
> +
> +	symbol_put(vfio_group_put_external_user);
> +}
> +
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;
> +
> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}
>  
>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>  {
> @@ -90,6 +124,123 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>  	return ret;
>  }
>  
> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> +
> +	iommu_table_put(stit->tbl);
> +	kvm_vfio_group_put_external_user(stit->group);
> +
> +	kfree(stit);
> +}
> +
> +static void kvm_spapr_tce_liobn_release_iommu_group(
> +		struct kvmppc_spapr_tce_table *stt,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> +
> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> +		if (group && (stit->group != group))
> +			continue;
> +
> +		list_del_rcu(&stit->next);
> +
> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> +	}
> +}
> +
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> +}
> +
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt = NULL;
> +	bool found = false;
> +	struct iommu_table *tbl = NULL;
> +	struct iommu_table_group *table_group;
> +	long i, ret = 0;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	struct fd f;
> +	int group_id;
> +	struct iommu_group *grp;
> +
> +	group_id = kvm_vfio_external_user_iommu_id(group);
> +	grp = iommu_group_get_by_id(group_id);
> +	if (!grp)
> +		return -EFAULT;

EFAULT doesn't look right, that's usually means userspace has give us
a bad address.  What does failure to look up the iommu group by id
mean here?


> +
> +	f = fdget(tablefd);
> +	if (!f.file) {
> +		ret = -EBADF;
> +		goto put_exit;
> +	}
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt == f.file->private_data) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	fdput(f);
> +
> +	if (!found) {
> +		ret = -ENODEV;

ENODEV doesn't look right either.  That generally means you're trying
to use a device or facility that doesn't exist.  This case just means
you've passed a file handle that either isn't a TCE table at all, or
os one associated with a different VM.  -EINVAL, I guess, overloaded
as it is.

> +		goto put_exit;

Don't you need to put the table fd as well as the iommu group which
you put in that exit path?

> +	}
> +
> +	table_group = iommu_group_get_iommudata(grp);
> +	if (WARN_ON(!table_group)) {
> +		ret = -EFAULT;
> +		goto put_exit;

Again don't you need to put the table fd as well.

> +	}
> +
> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> +		struct iommu_table *tbltmp = table_group->tables[i];
> +
> +		if (!tbltmp)
> +			continue;
> +
> +		/*
> +		 * Make sure hardware table parameters are exactly the same;
> +		 * this is used in the TCE handlers where boundary checks
> +		 * use only the first attached table.
> +		 */
> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> +				(tbltmp->it_offset == stt->offset) &&
> +				(tbltmp->it_size == stt->size)) {
> +			tbl = tbltmp;
> +			break;
> +		}
> +	}
> +	if (!tbl) {
> +		ret = -ENODEV;

Again, ENODEV doesn't seem right.  Here the problem is that the host
hardware constraints don't match the guest hardware constraints.
Hmm.  EIO?  ENOSPC?

> +		goto put_exit;
> +	}
> +
> +	iommu_table_get(tbl);
> +
> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> +	stit->tbl = tbl;
> +	stit->group = group;
> +
> +	list_add_rcu(&stit->next, &stt->iommu_tables);

So if you add the same group to the same liobn multiple times, you'll
get multiple identical entries in this list.

I guess that's mostly harmless... although.. does it allow the user to
force the allocation of arbitrary amounts of kernel memory in that
list?

> +put_exit:
> +	iommu_group_put(grp);
> +
> +	return ret;
> +}
> +
>  static void release_spapr_tce_table(struct rcu_head *head)
>  {
>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> @@ -132,6 +283,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>  
>  	list_del_rcu(&stt->list);
>  
> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> +
>  	kvm_put_kvm(stt->kvm);
>  
>  	kvmppc_account_memlimit(
> @@ -181,6 +334,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	stt->offset = args->offset;
>  	stt->size = size;
>  	stt->kvm = kvm;
> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>  
>  	for (i = 0; i < npages; i++) {
>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> @@ -209,11 +363,94 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	return ret;
>  }
>  
> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_HARDWARE;

What could trigger this error?  Should it be a WARN_ON?

> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret != H_SUCCESS)
> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> +		return H_PARAMETER;
> +
> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> +		return H_HARDWARE;

IIUC this would happen if qemu had failed to preregister all of guest
RAM, making this indeed an H_HARDWARE.

> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;

I'm less clear on when this one would happen.

> +
> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
> -	long ret;
> +	long ret, idx;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, gpa;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -230,6 +467,36 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		entry = ioba >> stit->tbl->it_page_shift;
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +		dir = iommu_tce_direction(tce);
> +
> +		if (dir == DMA_NONE) {
> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> +				return H_PARAMETER;
> +		} else {
> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))

Any way you could make these param check functions based on stt
instead of stit->tbl?  That would let you do them before checking if
there are any hw tables to update, avaoiding the somewhat awkward
	if (at least one)
		for (each one)
construct.

> +				return H_PARAMETER;
> +		}
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			if (dir == DMA_NONE) {
> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry);
> +			} else {
> +				idx = srcu_read_lock(&vcpu->kvm->srcu);
> +				ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> +						entry, gpa, dir);
> +				srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +			}
> +			if (ret != H_SUCCESS)
> +				return ret;

Doesn't this error path need to clean up for the case where you
managed to update some backing TCE tables, but then failed later ones?

> +		}
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -242,9 +509,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS, idx;
> -	unsigned long entry, ua = 0;
> +	unsigned long entry, gpa, ua = 0;
>  	u64 __user *tces;
>  	u64 tce;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -272,6 +540,9 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  	tces = (u64 __user *) ua;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +
>  	for (i = 0; i < npages; ++i) {
>  		if (get_user(tce, tces + i)) {
>  			ret = H_TOO_HARD;
> @@ -282,6 +553,15 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +
> +		if (stit) {
> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +			ret = iommu_tce_put_param_check(stit->tbl,
> +					ioba + (i << stit->tbl->it_page_shift),
> +					gpa);
> +			if (ret != H_SUCCESS)
> +				goto unlock_exit;
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> @@ -291,6 +571,21 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		}
>  		tce = be64_to_cpu(tce);
>  
> +		if (stit) {
> +			for (i = 0; i < npages; ++i) {
> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +				list_for_each_entry_lockless(stit,
> +						&stt->iommu_tables, next) {
> +					ret = kvmppc_tce_iommu_map(vcpu->kvm,
> +						stit->tbl, entry + i, gpa,
> +						iommu_tce_direction(tce));
> +					if (ret != H_SUCCESS)
> +						goto unlock_exit;
> +				}

Um.. what value will this for_each leave in stit after completion?  I
suspect it will be something bogus, which means re-using stit in the
next 0..npages loop iteration won't be safe (you only initialize stit
with the first entry outside that loop).

> +			}
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -307,6 +602,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -320,6 +616,25 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> +					tce_value, npages))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +			for (i = 0; i < npages; ++i) {
> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry + i);
> +				if (ret)
> +					return ret;

Again do you need some sort of cleanup for partial completion?


> +			}
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index dc1c66fda941..018c7d94a575 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -178,11 +178,104 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret)
> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa = 0, ua;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> +		return H_PARAMETER;
> +
> +	mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_HARDWARE;

What circumstances can this fail under?  Does it need to be H_TOO_HARD instead?

> +
> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;
> +
> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu->kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, gpa;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -199,6 +292,33 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		entry = ioba >> stit->tbl->it_page_shift;
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +		dir = iommu_tce_direction(tce);
> +
> +		if (dir == DMA_NONE) {
> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> +				return H_PARAMETER;
> +		} else {
> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
> +				return H_PARAMETER;
> +		}
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			if (dir == DMA_NONE)
> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry);
> +			else
> +				ret = kvmppc_rm_tce_iommu_map(vcpu, stit->tbl,
> +						entry, gpa, dir);
> +			if (ret != H_SUCCESS)
> +				return ret;
> +		}
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -237,9 +357,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS;
> -	unsigned long tces, entry, tce, ua = 0;
> +	unsigned long tces, entry, gpa, tce, ua = 0;
>  	unsigned long *rmap = NULL;
>  	bool prereg = false;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -303,17 +424,45 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		}
>  	}
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +
>  	for (i = 0; i < npages; ++i) {
>  		tce = be64_to_cpu(((u64 *)tces)[i]);
>  
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +
> +		if (stit) {
> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +			ret = iommu_tce_put_param_check(stit->tbl,
> +					ioba + (i << stit->tbl->it_page_shift),
> +					gpa);
> +			if (ret != H_SUCCESS)
> +				goto unlock_exit;
> +
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
>  		tce = be64_to_cpu(((u64 *)tces)[i]);

As noted in the earlier patch this is really dangerous - by reloading
the tce from userspace you've thrown away the verification above.

> +		if (stit) {
> +			for (i = 0; i < npages; ++i) {
> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +				list_for_each_entry_lockless(stit,
> +						&stt->iommu_tables, next) {
> +					ret = kvmppc_rm_tce_iommu_map(vcpu,
> +						stit->tbl, entry + i, gpa,
> +						iommu_tce_direction(tce));
> +					if (ret != H_SUCCESS)
> +						goto unlock_exit;
> +				}
> +			}
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -330,6 +479,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -343,6 +494,25 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> +					tce_value, npages))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +			for (i = 0; i < npages; ++i) {
> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry + i);
> +				if (ret)
> +					return ret;
> +			}
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index cd892dec7cb6..f3127dc87912 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	case KVM_CAP_SPAPR_TCE:
>  	case KVM_CAP_SPAPR_TCE_64:
> +		/* fallthrough */

I'm not sure why this one should get a fallthrough comment, when none
of the other cases do.

> +	case KVM_CAP_SPAPR_TCE_VFIO:
>  	case KVM_CAP_PPC_RTAS:
>  	case KVM_CAP_PPC_FIXUP_HCALL:
>  	case KVM_CAP_PPC_ENABLE_HCALL:
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index d32f239eb471..2b7dc22265fe 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  
>  		mutex_unlock(&kv->lock);
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>  
>  		kvm_vfio_group_put_external_user(vfio_group);
> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  		kvm_vfio_update_coherency(dev);
>  
>  		return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> +		struct kvm_vfio_spapr_tce param;
> +		unsigned long minsz;
> +		struct kvm_vfio *kv = dev->private;
> +		struct vfio_group *vfio_group;
> +		struct kvm_vfio_group *kvg;
> +		struct fd f;
> +
> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz || param.flags)
> +			return -EINVAL;
> +
> +		f = fdget(param.groupfd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +


Is there any particular reason you unwrap the group fd here, but the
table fd inside kvm__spapr_tce_attach_iommu_group()?

> +		ret = -ENOENT;
> +
> +		mutex_lock(&kv->lock);
> +
> +		list_for_each_entry(kvg, &kv->group_list, node) {
> +			if (kvg->vfio_group != vfio_group)
> +				continue;
> +
> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> +					param.tablefd, vfio_group);
> +
> +			break;
> +		}
> +
> +		mutex_unlock(&kv->lock);
> +
> +		return ret;
> +	}
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>  	}
>  
>  	return -ENXIO;
> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_VFIO_GROUP_ADD:
>  		case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> +#endif
>  			return 0;
>  		}
>  
> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>  	struct kvm_vfio_group *kvg, *tmp;
>  
>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>  		list_del(&kvg->node);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

WARNING: multiple messages have this Message-ID (diff)

From: David Gibson <david@gibson.dropbear.id.au>
To: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: linuxppc-dev@lists.ozlabs.org,
	Alex Williamson <alex.williamson@redhat.com>,
	Paul Mackerras <paulus@samba.org>,
	kvm-ppc@vger.kernel.org, kvm@vger.kernel.org
Subject: Re: [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
Date: Thu, 9 Feb 2017 17:41:12 +1100	[thread overview]
Message-ID: <20170209064112.GD14524@umbus> (raw)
In-Reply-To: <20170207071711.28938-11-aik@ozlabs.ru>

[-- Attachment #1: Type: text/plain, Size: 32625 bytes --]

On Tue, Feb 07, 2017 at 06:17:11PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
> 
> This has separate copies of handlers for real and virtual modes as
> in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
> would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
> in virtual mode and direct access in real mode and having a common
> helper for it would make things uglier imho.
> 
> 
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>  include/uapi/linux/kvm.h                   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c           | 319 ++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 172 +++++++++++++++-
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  virt/kvm/vfio.c                            |  60 ++++++
>  8 files changed, 590 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..f95d867168ea 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,25 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> +	allocated by sPAPR KVM.
> +	kvm_device_attr.addr points to a struct:
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +	struct kvm_vfio_spapr_tce {
> +		__u32	argsz;
> +		__u32	flags;
> +		__s32	groupfd;
> +		__s32	tablefd;
> +	};
> +
> +	where
> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> +	@flags are not supported now, must be zero;
> +	@groupfd is a file descriptor for a VFIO group;
> +	@tablefd is a file descriptor for a TCE table allocated via
> +		KVM_CREATE_SPAPR_TCE.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index e59b172666cd..a827006941f8 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_iommu_table {
> +	struct rcu_head rcu;
> +	struct list_head next;
> +	struct vfio_group *group;
> +	struct iommu_table *tbl;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head iommu_tables;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 37bc9e7e90ba..da1410bd6b36 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group);
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a2c9bb5a0ead..cdfa01169bd2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	groupfd;
> +	__s32	tablefd;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 9a7b7fca5e84..cb0469151e35 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -27,6 +27,10 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/iommu.h>
> +#include <linux/file.h>
> +#include <linux/vfio.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -39,6 +43,36 @@
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
> +#include <asm/mmu_context.h>
> +
> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> +{
> +	void (*fn)(struct vfio_group *);
> +
> +	fn = symbol_get(vfio_group_put_external_user);
> +	if (WARN_ON(!fn))
> +		return;
> +
> +	fn(vfio_group);
> +
> +	symbol_put(vfio_group_put_external_user);
> +}
> +
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;
> +
> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}
>  
>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>  {
> @@ -90,6 +124,123 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>  	return ret;
>  }
>  
> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> +
> +	iommu_table_put(stit->tbl);
> +	kvm_vfio_group_put_external_user(stit->group);
> +
> +	kfree(stit);
> +}
> +
> +static void kvm_spapr_tce_liobn_release_iommu_group(
> +		struct kvmppc_spapr_tce_table *stt,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> +
> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> +		if (group && (stit->group != group))
> +			continue;
> +
> +		list_del_rcu(&stit->next);
> +
> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> +	}
> +}
> +
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> +}
> +
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt = NULL;
> +	bool found = false;
> +	struct iommu_table *tbl = NULL;
> +	struct iommu_table_group *table_group;
> +	long i, ret = 0;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	struct fd f;
> +	int group_id;
> +	struct iommu_group *grp;
> +
> +	group_id = kvm_vfio_external_user_iommu_id(group);
> +	grp = iommu_group_get_by_id(group_id);
> +	if (!grp)
> +		return -EFAULT;

EFAULT doesn't look right, that's usually means userspace has give us
a bad address.  What does failure to look up the iommu group by id
mean here?


> +
> +	f = fdget(tablefd);
> +	if (!f.file) {
> +		ret = -EBADF;
> +		goto put_exit;
> +	}
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt == f.file->private_data) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	fdput(f);
> +
> +	if (!found) {
> +		ret = -ENODEV;

ENODEV doesn't look right either.  That generally means you're trying
to use a device or facility that doesn't exist.  This case just means
you've passed a file handle that either isn't a TCE table at all, or
os one associated with a different VM.  -EINVAL, I guess, overloaded
as it is.

> +		goto put_exit;

Don't you need to put the table fd as well as the iommu group which
you put in that exit path?

> +	}
> +
> +	table_group = iommu_group_get_iommudata(grp);
> +	if (WARN_ON(!table_group)) {
> +		ret = -EFAULT;
> +		goto put_exit;

Again don't you need to put the table fd as well.

> +	}
> +
> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> +		struct iommu_table *tbltmp = table_group->tables[i];
> +
> +		if (!tbltmp)
> +			continue;
> +
> +		/*
> +		 * Make sure hardware table parameters are exactly the same;
> +		 * this is used in the TCE handlers where boundary checks
> +		 * use only the first attached table.
> +		 */
> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> +				(tbltmp->it_offset == stt->offset) &&
> +				(tbltmp->it_size == stt->size)) {
> +			tbl = tbltmp;
> +			break;
> +		}
> +	}
> +	if (!tbl) {
> +		ret = -ENODEV;

Again, ENODEV doesn't seem right.  Here the problem is that the host
hardware constraints don't match the guest hardware constraints.
Hmm.  EIO?  ENOSPC?

> +		goto put_exit;
> +	}
> +
> +	iommu_table_get(tbl);
> +
> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> +	stit->tbl = tbl;
> +	stit->group = group;
> +
> +	list_add_rcu(&stit->next, &stt->iommu_tables);

So if you add the same group to the same liobn multiple times, you'll
get multiple identical entries in this list.

I guess that's mostly harmless... although.. does it allow the user to
force the allocation of arbitrary amounts of kernel memory in that
list?

> +put_exit:
> +	iommu_group_put(grp);
> +
> +	return ret;
> +}
> +
>  static void release_spapr_tce_table(struct rcu_head *head)
>  {
>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> @@ -132,6 +283,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>  
>  	list_del_rcu(&stt->list);
>  
> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> +
>  	kvm_put_kvm(stt->kvm);
>  
>  	kvmppc_account_memlimit(
> @@ -181,6 +334,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	stt->offset = args->offset;
>  	stt->size = size;
>  	stt->kvm = kvm;
> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>  
>  	for (i = 0; i < npages; i++) {
>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> @@ -209,11 +363,94 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	return ret;
>  }
>  
> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_HARDWARE;

What could trigger this error?  Should it be a WARN_ON?

> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret != H_SUCCESS)
> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> +		return H_PARAMETER;
> +
> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> +		return H_HARDWARE;

IIUC this would happen if qemu had failed to preregister all of guest
RAM, making this indeed an H_HARDWARE.

> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;

I'm less clear on when this one would happen.

> +
> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
> -	long ret;
> +	long ret, idx;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, gpa;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -230,6 +467,36 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		entry = ioba >> stit->tbl->it_page_shift;
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +		dir = iommu_tce_direction(tce);
> +
> +		if (dir == DMA_NONE) {
> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> +				return H_PARAMETER;
> +		} else {
> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))

Any way you could make these param check functions based on stt
instead of stit->tbl?  That would let you do them before checking if
there are any hw tables to update, avaoiding the somewhat awkward
	if (at least one)
		for (each one)
construct.

> +				return H_PARAMETER;
> +		}
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			if (dir == DMA_NONE) {
> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry);
> +			} else {
> +				idx = srcu_read_lock(&vcpu->kvm->srcu);
> +				ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> +						entry, gpa, dir);
> +				srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +			}
> +			if (ret != H_SUCCESS)
> +				return ret;

Doesn't this error path need to clean up for the case where you
managed to update some backing TCE tables, but then failed later ones?

> +		}
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -242,9 +509,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS, idx;
> -	unsigned long entry, ua = 0;
> +	unsigned long entry, gpa, ua = 0;
>  	u64 __user *tces;
>  	u64 tce;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -272,6 +540,9 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  	tces = (u64 __user *) ua;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +
>  	for (i = 0; i < npages; ++i) {
>  		if (get_user(tce, tces + i)) {
>  			ret = H_TOO_HARD;
> @@ -282,6 +553,15 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +
> +		if (stit) {
> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +			ret = iommu_tce_put_param_check(stit->tbl,
> +					ioba + (i << stit->tbl->it_page_shift),
> +					gpa);
> +			if (ret != H_SUCCESS)
> +				goto unlock_exit;
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> @@ -291,6 +571,21 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		}
>  		tce = be64_to_cpu(tce);
>  
> +		if (stit) {
> +			for (i = 0; i < npages; ++i) {
> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +				list_for_each_entry_lockless(stit,
> +						&stt->iommu_tables, next) {
> +					ret = kvmppc_tce_iommu_map(vcpu->kvm,
> +						stit->tbl, entry + i, gpa,
> +						iommu_tce_direction(tce));
> +					if (ret != H_SUCCESS)
> +						goto unlock_exit;
> +				}

Um.. what value will this for_each leave in stit after completion?  I
suspect it will be something bogus, which means re-using stit in the
next 0..npages loop iteration won't be safe (you only initialize stit
with the first entry outside that loop).

> +			}
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -307,6 +602,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -320,6 +616,25 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> +					tce_value, npages))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +			for (i = 0; i < npages; ++i) {
> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry + i);
> +				if (ret)
> +					return ret;

Again do you need some sort of cleanup for partial completion?


> +			}
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index dc1c66fda941..018c7d94a575 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -178,11 +178,104 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret)
> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa = 0, ua;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> +		return H_PARAMETER;
> +
> +	mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_HARDWARE;

What circumstances can this fail under?  Does it need to be H_TOO_HARD instead?

> +
> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;
> +
> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu->kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, gpa;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -199,6 +292,33 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		entry = ioba >> stit->tbl->it_page_shift;
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +		dir = iommu_tce_direction(tce);
> +
> +		if (dir == DMA_NONE) {
> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> +				return H_PARAMETER;
> +		} else {
> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
> +				return H_PARAMETER;
> +		}
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			if (dir == DMA_NONE)
> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry);
> +			else
> +				ret = kvmppc_rm_tce_iommu_map(vcpu, stit->tbl,
> +						entry, gpa, dir);
> +			if (ret != H_SUCCESS)
> +				return ret;
> +		}
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -237,9 +357,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS;
> -	unsigned long tces, entry, tce, ua = 0;
> +	unsigned long tces, entry, gpa, tce, ua = 0;
>  	unsigned long *rmap = NULL;
>  	bool prereg = false;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -303,17 +424,45 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		}
>  	}
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +
>  	for (i = 0; i < npages; ++i) {
>  		tce = be64_to_cpu(((u64 *)tces)[i]);
>  
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +
> +		if (stit) {
> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +			ret = iommu_tce_put_param_check(stit->tbl,
> +					ioba + (i << stit->tbl->it_page_shift),
> +					gpa);
> +			if (ret != H_SUCCESS)
> +				goto unlock_exit;
> +
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
>  		tce = be64_to_cpu(((u64 *)tces)[i]);

As noted in the earlier patch this is really dangerous - by reloading
the tce from userspace you've thrown away the verification above.

> +		if (stit) {
> +			for (i = 0; i < npages; ++i) {
> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +				list_for_each_entry_lockless(stit,
> +						&stt->iommu_tables, next) {
> +					ret = kvmppc_rm_tce_iommu_map(vcpu,
> +						stit->tbl, entry + i, gpa,
> +						iommu_tce_direction(tce));
> +					if (ret != H_SUCCESS)
> +						goto unlock_exit;
> +				}
> +			}
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -330,6 +479,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -343,6 +494,25 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> +					tce_value, npages))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +			for (i = 0; i < npages; ++i) {
> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry + i);
> +				if (ret)
> +					return ret;
> +			}
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index cd892dec7cb6..f3127dc87912 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	case KVM_CAP_SPAPR_TCE:
>  	case KVM_CAP_SPAPR_TCE_64:
> +		/* fallthrough */

I'm not sure why this one should get a fallthrough comment, when none
of the other cases do.

> +	case KVM_CAP_SPAPR_TCE_VFIO:
>  	case KVM_CAP_PPC_RTAS:
>  	case KVM_CAP_PPC_FIXUP_HCALL:
>  	case KVM_CAP_PPC_ENABLE_HCALL:
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index d32f239eb471..2b7dc22265fe 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  
>  		mutex_unlock(&kv->lock);
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>  
>  		kvm_vfio_group_put_external_user(vfio_group);
> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  		kvm_vfio_update_coherency(dev);
>  
>  		return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> +		struct kvm_vfio_spapr_tce param;
> +		unsigned long minsz;
> +		struct kvm_vfio *kv = dev->private;
> +		struct vfio_group *vfio_group;
> +		struct kvm_vfio_group *kvg;
> +		struct fd f;
> +
> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz || param.flags)
> +			return -EINVAL;
> +
> +		f = fdget(param.groupfd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +


Is there any particular reason you unwrap the group fd here, but the
table fd inside kvm__spapr_tce_attach_iommu_group()?

> +		ret = -ENOENT;
> +
> +		mutex_lock(&kv->lock);
> +
> +		list_for_each_entry(kvg, &kv->group_list, node) {
> +			if (kvg->vfio_group != vfio_group)
> +				continue;
> +
> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> +					param.tablefd, vfio_group);
> +
> +			break;
> +		}
> +
> +		mutex_unlock(&kv->lock);
> +
> +		return ret;
> +	}
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>  	}
>  
>  	return -ENXIO;
> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_VFIO_GROUP_ADD:
>  		case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> +#endif
>  			return 0;
>  		}
>  
> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>  	struct kvm_vfio_group *kvg, *tmp;
>  
>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>  		list_del(&kvg->node);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

WARNING: multiple messages have this Message-ID (diff)

From: David Gibson <david@gibson.dropbear.id.au>
To: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Paul Mackerras <paulus@samba.org>,
	Alex Williamson <alex.williamson@redhat.com>,
	linuxppc-dev@lists.ozlabs.org, kvm@vger.kernel.org,
	kvm-ppc@vger.kernel.org
Subject: Re: [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
Date: Thu, 9 Feb 2017 17:41:12 +1100	[thread overview]
Message-ID: <20170209064112.GD14524@umbus> (raw)
In-Reply-To: <20170207071711.28938-11-aik@ozlabs.ru>

[-- Attachment #1: Type: text/plain, Size: 32625 bytes --]

On Tue, Feb 07, 2017 at 06:17:11PM +1100, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu_table may appear in the list several times, to be fixed later
> ---
> 
> This has separate copies of handlers for real and virtual modes as
> in fact H_PUT_TCE and H_STUFF_TCE could share a lot (common helpers
> would take a "realmode" flag) but H_PUT_TCE_INDIRECT uses get_user()
> in virtual mode and direct access in real mode and having a common
> helper for it would make things uglier imho.
> 
> 
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>  arch/powerpc/include/asm/kvm_host.h        |   8 +
>  arch/powerpc/include/asm/kvm_ppc.h         |   4 +
>  include/uapi/linux/kvm.h                   |   8 +
>  arch/powerpc/kvm/book3s_64_vio.c           | 319 ++++++++++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c        | 172 +++++++++++++++-
>  arch/powerpc/kvm/powerpc.c                 |   2 +
>  virt/kvm/vfio.c                            |  60 ++++++
>  8 files changed, 590 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740c67ca..f95d867168ea 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,25 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>    KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
>    KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> +	kvm_device_attr.addr points to an int32_t file descriptor
> +	for the VFIO group.
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> +	allocated by sPAPR KVM.
> +	kvm_device_attr.addr points to a struct:
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> +	struct kvm_vfio_spapr_tce {
> +		__u32	argsz;
> +		__u32	flags;
> +		__s32	groupfd;
> +		__s32	tablefd;
> +	};
> +
> +	where
> +	@argsz is the size of kvm_vfio_spapr_tce_liobn;
> +	@flags are not supported now, must be zero;
> +	@groupfd is a file descriptor for a VFIO group;
> +	@tablefd is a file descriptor for a TCE table allocated via
> +		KVM_CREATE_SPAPR_TCE.
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index e59b172666cd..a827006941f8 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -191,6 +191,13 @@ struct kvmppc_pginfo {
>  	atomic_t refcnt;
>  };
>  
> +struct kvmppc_spapr_tce_iommu_table {
> +	struct rcu_head rcu;
> +	struct list_head next;
> +	struct vfio_group *group;
> +	struct iommu_table *tbl;
> +};
> +
>  struct kvmppc_spapr_tce_table {
>  	struct list_head list;
>  	struct kvm *kvm;
> @@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
>  	u32 page_shift;
>  	u64 offset;		/* in pages */
>  	u64 size;		/* window size in pages */
> +	struct list_head iommu_tables;
>  	struct page *pages[0];
>  };
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 37bc9e7e90ba..da1410bd6b36 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -163,6 +163,10 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
>  extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
>  			struct kvm_memory_slot *memslot, unsigned long porder);
>  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group);
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group);
>  
>  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  				struct kvm_create_spapr_tce_64 *args);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a2c9bb5a0ead..cdfa01169bd2 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1076,6 +1076,7 @@ struct kvm_device_attr {
>  #define  KVM_DEV_VFIO_GROUP			1
>  #define   KVM_DEV_VFIO_GROUP_ADD			1
>  #define   KVM_DEV_VFIO_GROUP_DEL			2
> +#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
>  
>  enum kvm_device_type {
>  	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
> @@ -1097,6 +1098,13 @@ enum kvm_device_type {
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> +struct kvm_vfio_spapr_tce {
> +	__u32	argsz;
> +	__u32	flags;
> +	__s32	groupfd;
> +	__s32	tablefd;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index 9a7b7fca5e84..cb0469151e35 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -27,6 +27,10 @@
>  #include <linux/hugetlb.h>
>  #include <linux/list.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/iommu.h>
> +#include <linux/file.h>
> +#include <linux/vfio.h>
> +#include <linux/module.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -39,6 +43,36 @@
>  #include <asm/udbg.h>
>  #include <asm/iommu.h>
>  #include <asm/tce.h>
> +#include <asm/mmu_context.h>
> +
> +static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
> +{
> +	void (*fn)(struct vfio_group *);
> +
> +	fn = symbol_get(vfio_group_put_external_user);
> +	if (WARN_ON(!fn))
> +		return;
> +
> +	fn(vfio_group);
> +
> +	symbol_put(vfio_group_put_external_user);
> +}
> +
> +static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
> +{
> +	int (*fn)(struct vfio_group *);
> +	int ret = -1;
> +
> +	fn = symbol_get(vfio_external_user_iommu_id);
> +	if (!fn)
> +		return ret;
> +
> +	ret = fn(vfio_group);
> +
> +	symbol_put(vfio_external_user_iommu_id);
> +
> +	return ret;
> +}
>  
>  static unsigned long kvmppc_tce_pages(unsigned long iommu_pages)
>  {
> @@ -90,6 +124,123 @@ static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
>  	return ret;
>  }
>  
> +static void kvm_spapr_tce_iommu_table_free(struct rcu_head *head)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit = container_of(head,
> +			struct kvmppc_spapr_tce_iommu_table, rcu);
> +
> +	iommu_table_put(stit->tbl);
> +	kvm_vfio_group_put_external_user(stit->group);
> +
> +	kfree(stit);
> +}
> +
> +static void kvm_spapr_tce_liobn_release_iommu_group(
> +		struct kvmppc_spapr_tce_table *stt,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
> +
> +	list_for_each_entry_safe(stit, tmp, &stt->iommu_tables, next) {
> +		if (group && (stit->group != group))
> +			continue;
> +
> +		list_del_rcu(&stit->next);
> +
> +		call_rcu(&stit->rcu, kvm_spapr_tce_iommu_table_free);
> +	}
> +}
> +
> +extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt;
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list)
> +		kvm_spapr_tce_liobn_release_iommu_group(stt, group);
> +}
> +
> +extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
> +		struct vfio_group *group)
> +{
> +	struct kvmppc_spapr_tce_table *stt = NULL;
> +	bool found = false;
> +	struct iommu_table *tbl = NULL;
> +	struct iommu_table_group *table_group;
> +	long i, ret = 0;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	struct fd f;
> +	int group_id;
> +	struct iommu_group *grp;
> +
> +	group_id = kvm_vfio_external_user_iommu_id(group);
> +	grp = iommu_group_get_by_id(group_id);
> +	if (!grp)
> +		return -EFAULT;

EFAULT doesn't look right, that's usually means userspace has give us
a bad address.  What does failure to look up the iommu group by id
mean here?


> +
> +	f = fdget(tablefd);
> +	if (!f.file) {
> +		ret = -EBADF;
> +		goto put_exit;
> +	}
> +
> +	list_for_each_entry_rcu(stt, &kvm->arch.spapr_tce_tables, list) {
> +		if (stt == f.file->private_data) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	fdput(f);
> +
> +	if (!found) {
> +		ret = -ENODEV;

ENODEV doesn't look right either.  That generally means you're trying
to use a device or facility that doesn't exist.  This case just means
you've passed a file handle that either isn't a TCE table at all, or
os one associated with a different VM.  -EINVAL, I guess, overloaded
as it is.

> +		goto put_exit;

Don't you need to put the table fd as well as the iommu group which
you put in that exit path?

> +	}
> +
> +	table_group = iommu_group_get_iommudata(grp);
> +	if (WARN_ON(!table_group)) {
> +		ret = -EFAULT;
> +		goto put_exit;

Again don't you need to put the table fd as well.

> +	}
> +
> +	for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> +		struct iommu_table *tbltmp = table_group->tables[i];
> +
> +		if (!tbltmp)
> +			continue;
> +
> +		/*
> +		 * Make sure hardware table parameters are exactly the same;
> +		 * this is used in the TCE handlers where boundary checks
> +		 * use only the first attached table.
> +		 */
> +		if ((tbltmp->it_page_shift == stt->page_shift) &&
> +				(tbltmp->it_offset == stt->offset) &&
> +				(tbltmp->it_size == stt->size)) {
> +			tbl = tbltmp;
> +			break;
> +		}
> +	}
> +	if (!tbl) {
> +		ret = -ENODEV;

Again, ENODEV doesn't seem right.  Here the problem is that the host
hardware constraints don't match the guest hardware constraints.
Hmm.  EIO?  ENOSPC?

> +		goto put_exit;
> +	}
> +
> +	iommu_table_get(tbl);
> +
> +	stit = kzalloc(sizeof(*stit), GFP_KERNEL);
> +	stit->tbl = tbl;
> +	stit->group = group;
> +
> +	list_add_rcu(&stit->next, &stt->iommu_tables);

So if you add the same group to the same liobn multiple times, you'll
get multiple identical entries in this list.

I guess that's mostly harmless... although.. does it allow the user to
force the allocation of arbitrary amounts of kernel memory in that
list?

> +put_exit:
> +	iommu_group_put(grp);
> +
> +	return ret;
> +}
> +
>  static void release_spapr_tce_table(struct rcu_head *head)
>  {
>  	struct kvmppc_spapr_tce_table *stt = container_of(head,
> @@ -132,6 +283,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
>  
>  	list_del_rcu(&stt->list);
>  
> +	kvm_spapr_tce_liobn_release_iommu_group(stt, NULL /* release all */);
> +
>  	kvm_put_kvm(stt->kvm);
>  
>  	kvmppc_account_memlimit(
> @@ -181,6 +334,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	stt->offset = args->offset;
>  	stt->size = size;
>  	stt->kvm = kvm;
> +	INIT_LIST_HEAD_RCU(&stt->iommu_tables);
>  
>  	for (i = 0; i < npages; i++) {
>  		stt->pages[i] = alloc_page(GFP_KERNEL | __GFP_ZERO);
> @@ -209,11 +363,94 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>  	return ret;
>  }
>  
> +static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_HARDWARE;

What could trigger this error?  Should it be a WARN_ON?

> +	mem = mm_iommu_lookup(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret != H_SUCCESS)
> +		iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_tce_iommu_map(struct kvm *kvm, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa, ua, *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(kvm, gpa, &ua, NULL))
> +		return H_PARAMETER;
> +
> +	mem = mm_iommu_lookup(kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (mm_iommu_ua_to_hpa(mem, ua, &hpa))
> +		return H_HARDWARE;

IIUC this would happen if qemu had failed to preregister all of guest
RAM, making this indeed an H_HARDWARE.

> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;

I'm less clear on when this one would happen.

> +
> +	ret = iommu_tce_xchg(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +
>  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		      unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
> -	long ret;
> +	long ret, idx;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, gpa;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -230,6 +467,36 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		entry = ioba >> stit->tbl->it_page_shift;
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +		dir = iommu_tce_direction(tce);
> +
> +		if (dir == DMA_NONE) {
> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> +				return H_PARAMETER;
> +		} else {
> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))

Any way you could make these param check functions based on stt
instead of stit->tbl?  That would let you do them before checking if
there are any hw tables to update, avaoiding the somewhat awkward
	if (at least one)
		for (each one)
construct.

> +				return H_PARAMETER;
> +		}
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			if (dir == DMA_NONE) {
> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry);
> +			} else {
> +				idx = srcu_read_lock(&vcpu->kvm->srcu);
> +				ret = kvmppc_tce_iommu_map(vcpu->kvm, stit->tbl,
> +						entry, gpa, dir);
> +				srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +			}
> +			if (ret != H_SUCCESS)
> +				return ret;

Doesn't this error path need to clean up for the case where you
managed to update some backing TCE tables, but then failed later ones?

> +		}
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -242,9 +509,10 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS, idx;
> -	unsigned long entry, ua = 0;
> +	unsigned long entry, gpa, ua = 0;
>  	u64 __user *tces;
>  	u64 tce;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -272,6 +540,9 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  	}
>  	tces = (u64 __user *) ua;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +
>  	for (i = 0; i < npages; ++i) {
>  		if (get_user(tce, tces + i)) {
>  			ret = H_TOO_HARD;
> @@ -282,6 +553,15 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +
> +		if (stit) {
> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +			ret = iommu_tce_put_param_check(stit->tbl,
> +					ioba + (i << stit->tbl->it_page_shift),
> +					gpa);
> +			if (ret != H_SUCCESS)
> +				goto unlock_exit;
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
> @@ -291,6 +571,21 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		}
>  		tce = be64_to_cpu(tce);
>  
> +		if (stit) {
> +			for (i = 0; i < npages; ++i) {
> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +				list_for_each_entry_lockless(stit,
> +						&stt->iommu_tables, next) {
> +					ret = kvmppc_tce_iommu_map(vcpu->kvm,
> +						stit->tbl, entry + i, gpa,
> +						iommu_tce_direction(tce));
> +					if (ret != H_SUCCESS)
> +						goto unlock_exit;
> +				}

Um.. what value will this for_each leave in stit after completion?  I
suspect it will be something bogus, which means re-using stit in the
next 0..npages loop iteration won't be safe (you only initialize stit
with the first entry outside that loop).

> +			}
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -307,6 +602,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -320,6 +616,25 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> +					tce_value, npages))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +			for (i = 0; i < npages; ++i) {
> +				ret = kvmppc_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry + i);
> +				if (ret)
> +					return ret;

Again do you need some sort of cleanup for partial completion?


> +			}
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index dc1c66fda941..018c7d94a575 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -178,11 +178,104 @@ long kvmppc_gpa_to_ua(struct kvm *kvm, unsigned long gpa,
>  EXPORT_SYMBOL_GPL(kvmppc_gpa_to_ua);
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	struct mm_iommu_table_group_mem_t *mem = NULL;
> +	const unsigned long pgsize = 1ULL << tbl->it_page_shift;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +
> +	if (!pua)
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_TOO_HARD;
> +
> +	mem = mm_iommu_lookup_rm(kvm->mm, *pua, pgsize);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	mm_iommu_mapped_dec(mem);
> +
> +	*pua = 0;
> +
> +	return H_SUCCESS;
> +}
> +
> +static long kvmppc_rm_tce_iommu_unmap(struct kvm *kvm,
> +		struct iommu_table *tbl, unsigned long entry)
> +{
> +	enum dma_data_direction dir = DMA_NONE;
> +	unsigned long hpa = 0;
> +	long ret;
> +
> +	if (iommu_tce_xchg_rm(tbl, entry, &hpa, &dir))
> +		return H_HARDWARE;
> +
> +	if (dir == DMA_NONE)
> +		return H_SUCCESS;
> +
> +	ret = kvmppc_rm_tce_iommu_mapped_dec(kvm, tbl, entry);
> +	if (ret)
> +		iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +
> +	return ret;
> +}
> +
> +long kvmppc_rm_tce_iommu_map(struct kvm_vcpu *vcpu, struct iommu_table *tbl,
> +		unsigned long entry, unsigned long gpa,
> +		enum dma_data_direction dir)
> +{
> +	long ret;
> +	unsigned long hpa = 0, ua;
> +	unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
> +	struct mm_iommu_table_group_mem_t *mem;
> +
> +	if (!pua)
> +		/* it_userspace allocation might be delayed */
> +		return H_TOO_HARD;
> +
> +	if (kvmppc_gpa_to_ua(vcpu->kvm, gpa, &ua, NULL))
> +		return H_PARAMETER;
> +
> +	mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, 1ULL << tbl->it_page_shift);
> +	if (!mem)
> +		return H_TOO_HARD;
> +
> +	if (mm_iommu_ua_to_hpa_rm(mem, ua, &hpa))
> +		return H_HARDWARE;
> +
> +	pua = (void *) vmalloc_to_phys(pua);
> +	if (!pua)
> +		return H_HARDWARE;

What circumstances can this fail under?  Does it need to be H_TOO_HARD instead?

> +
> +	if (mm_iommu_mapped_inc(mem))
> +		return H_HARDWARE;
> +
> +	ret = iommu_tce_xchg_rm(tbl, entry, &hpa, &dir);
> +	if (ret) {
> +		mm_iommu_mapped_dec(mem);
> +		return H_TOO_HARD;
> +	}
> +
> +	if (dir != DMA_NONE)
> +		kvmppc_rm_tce_iommu_mapped_dec(vcpu->kvm, tbl, entry);
> +
> +	*pua = ua;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvmppc_rm_tce_iommu_map);
> +
>  long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  		unsigned long ioba, unsigned long tce)
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +	unsigned long entry, gpa;
> +	enum dma_data_direction dir;
>  
>  	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
>  	/* 	    liobn, ioba, tce); */
> @@ -199,6 +292,33 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
>  	if (ret != H_SUCCESS)
>  		return ret;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		entry = ioba >> stit->tbl->it_page_shift;
> +		gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +		dir = iommu_tce_direction(tce);
> +
> +		if (dir == DMA_NONE) {
> +			if (iommu_tce_clear_param_check(stit->tbl, ioba, 0, 1))
> +				return H_PARAMETER;
> +		} else {
> +			if (iommu_tce_put_param_check(stit->tbl, ioba, gpa))
> +				return H_PARAMETER;
> +		}
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			if (dir == DMA_NONE)
> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry);
> +			else
> +				ret = kvmppc_rm_tce_iommu_map(vcpu, stit->tbl,
> +						entry, gpa, dir);
> +			if (ret != H_SUCCESS)
> +				return ret;
> +		}
> +	}
> +
>  	kvmppc_tce_put(stt, ioba >> stt->page_shift, tce);
>  
>  	return H_SUCCESS;
> @@ -237,9 +357,10 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret = H_SUCCESS;
> -	unsigned long tces, entry, tce, ua = 0;
> +	unsigned long tces, entry, gpa, tce, ua = 0;
>  	unsigned long *rmap = NULL;
>  	bool prereg = false;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -303,17 +424,45 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  		}
>  	}
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +
>  	for (i = 0; i < npages; ++i) {
>  		tce = be64_to_cpu(((u64 *)tces)[i]);
>  
>  		ret = kvmppc_tce_validate(stt, tce);
>  		if (ret != H_SUCCESS)
>  			goto unlock_exit;
> +
> +		if (stit) {
> +			gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +			ret = iommu_tce_put_param_check(stit->tbl,
> +					ioba + (i << stit->tbl->it_page_shift),
> +					gpa);
> +			if (ret != H_SUCCESS)
> +				goto unlock_exit;
> +
> +		}
>  	}
>  
>  	for (i = 0; i < npages; ++i) {
>  		tce = be64_to_cpu(((u64 *)tces)[i]);

As noted in the earlier patch this is really dangerous - by reloading
the tce from userspace you've thrown away the verification above.

> +		if (stit) {
> +			for (i = 0; i < npages; ++i) {
> +				gpa = tce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
> +
> +				list_for_each_entry_lockless(stit,
> +						&stt->iommu_tables, next) {
> +					ret = kvmppc_rm_tce_iommu_map(vcpu,
> +						stit->tbl, entry + i, gpa,
> +						iommu_tce_direction(tce));
> +					if (ret != H_SUCCESS)
> +						goto unlock_exit;
> +				}
> +			}
> +		}
> +
>  		kvmppc_tce_put(stt, entry + i, tce);
>  	}
>  
> @@ -330,6 +479,8 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  {
>  	struct kvmppc_spapr_tce_table *stt;
>  	long i, ret;
> +	struct kvmppc_spapr_tce_iommu_table *stit;
> +
>  
>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
>  	if (!stt)
> @@ -343,6 +494,25 @@ long kvmppc_rm_h_stuff_tce(struct kvm_vcpu *vcpu,
>  	if (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ))
>  		return H_PARAMETER;
>  
> +	stit = list_first_entry_or_null(&stt->iommu_tables,
> +			struct kvmppc_spapr_tce_iommu_table, next);
> +	if (stit) {
> +		if (iommu_tce_clear_param_check(stit->tbl, ioba,
> +					tce_value, npages))
> +			return H_PARAMETER;
> +
> +		list_for_each_entry_lockless(stit, &stt->iommu_tables, next) {
> +			unsigned long entry = ioba >> stit->tbl->it_page_shift;
> +
> +			for (i = 0; i < npages; ++i) {
> +				ret = kvmppc_rm_tce_iommu_unmap(vcpu->kvm,
> +						stit->tbl, entry + i);
> +				if (ret)
> +					return ret;
> +			}
> +		}
> +	}
> +
>  	for (i = 0; i < npages; ++i, ioba += (1ULL << stt->page_shift))
>  		kvmppc_tce_put(stt, ioba >> stt->page_shift, tce_value);
>  
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index cd892dec7cb6..f3127dc87912 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -536,6 +536,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  #ifdef CONFIG_PPC_BOOK3S_64
>  	case KVM_CAP_SPAPR_TCE:
>  	case KVM_CAP_SPAPR_TCE_64:
> +		/* fallthrough */

I'm not sure why this one should get a fallthrough comment, when none
of the other cases do.

> +	case KVM_CAP_SPAPR_TCE_VFIO:
>  	case KVM_CAP_PPC_RTAS:
>  	case KVM_CAP_PPC_FIXUP_HCALL:
>  	case KVM_CAP_PPC_ENABLE_HCALL:
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index d32f239eb471..2b7dc22265fe 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -20,6 +20,10 @@
>  #include <linux/vfio.h>
>  #include "vfio.h"
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +#include <asm/kvm_ppc.h>
> +#endif
> +
>  struct kvm_vfio_group {
>  	struct list_head node;
>  	struct vfio_group *vfio_group;
> @@ -211,6 +215,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  
>  		mutex_unlock(&kv->lock);
>  
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(vfio_group, NULL);
>  
>  		kvm_vfio_group_put_external_user(vfio_group);
> @@ -218,6 +225,53 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
>  		kvm_vfio_update_coherency(dev);
>  
>  		return ret;
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: {
> +		struct kvm_vfio_spapr_tce param;
> +		unsigned long minsz;
> +		struct kvm_vfio *kv = dev->private;
> +		struct vfio_group *vfio_group;
> +		struct kvm_vfio_group *kvg;
> +		struct fd f;
> +
> +		minsz = offsetofend(struct kvm_vfio_spapr_tce, tablefd);
> +
> +		if (copy_from_user(&param, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (param.argsz < minsz || param.flags)
> +			return -EINVAL;
> +
> +		f = fdget(param.groupfd);
> +		if (!f.file)
> +			return -EBADF;
> +
> +		vfio_group = kvm_vfio_group_get_external_user(f.file);
> +		fdput(f);
> +
> +		if (IS_ERR(vfio_group))
> +			return PTR_ERR(vfio_group);
> +


Is there any particular reason you unwrap the group fd here, but the
table fd inside kvm__spapr_tce_attach_iommu_group()?

> +		ret = -ENOENT;
> +
> +		mutex_lock(&kv->lock);
> +
> +		list_for_each_entry(kvg, &kv->group_list, node) {
> +			if (kvg->vfio_group != vfio_group)
> +				continue;
> +
> +			ret = kvm_spapr_tce_attach_iommu_group(dev->kvm,
> +					param.tablefd, vfio_group);
> +
> +			break;
> +		}
> +
> +		mutex_unlock(&kv->lock);
> +
> +		return ret;
> +	}
> +#endif /* CONFIG_SPAPR_TCE_IOMMU */
>  	}
>  
>  	return -ENXIO;
> @@ -242,6 +296,9 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_VFIO_GROUP_ADD:
>  		case KVM_DEV_VFIO_GROUP_DEL:
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
> +#endif
>  			return 0;
>  		}
>  
> @@ -257,6 +314,9 @@ static void kvm_vfio_destroy(struct kvm_device *dev)
>  	struct kvm_vfio_group *kvg, *tmp;
>  
>  	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
> +		kvm_spapr_tce_release_iommu_group(dev->kvm, kvg->vfio_group);
> +#endif
>  		kvm_vfio_group_set_kvm(kvg->vfio_group, NULL);
>  		kvm_vfio_group_put_external_user(kvg->vfio_group);
>  		list_del(&kvg->node);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

next prev parent reply	other threads:[~2017-02-09  6:41 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-07  7:17 [PATCH kernel v4 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
2017-02-07  7:17 ` Alexey Kardashevskiy
2017-02-07  7:17 ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 01/10] powerpc/mmu: Add real mode support for IOMMU preregistered memory Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 02/10] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange() Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 05/10] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 06/10] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 07/10] KVM: PPC: Pass kvm* to kvmppc_find_table() Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-09  3:51   ` David Gibson
2017-02-09  3:51     ` David Gibson
2017-02-09  3:51     ` David Gibson
2017-02-09  8:20     ` Alexey Kardashevskiy
2017-02-09  8:20       ` Alexey Kardashevskiy
2017-02-10  3:07       ` David Gibson
2017-02-10  3:07         ` David Gibson
2017-02-10  4:09         ` Alexey Kardashevskiy
2017-02-10  4:09           ` Alexey Kardashevskiy
2017-02-10  4:50           ` David Gibson
2017-02-10  4:50             ` David Gibson
2017-02-10  4:50             ` David Gibson
2017-02-10  7:58             ` Alexey Kardashevskiy
2017-02-10  7:58               ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 09/10] KVM: PPC: Use preregistered memory API to access TCE list Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-09  4:00   ` David Gibson
2017-02-09  4:00     ` David Gibson
2017-02-09  4:00     ` David Gibson
2017-02-07  7:17 ` [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-09  6:41   ` David Gibson [this message]
2017-02-09  6:41     ` David Gibson
2017-02-09  6:41     ` David Gibson
2017-02-10  2:50     ` Alexey Kardashevskiy
2017-02-10  2:50       ` Alexey Kardashevskiy
2017-02-10  2:50       ` Alexey Kardashevskiy
2017-02-10  4:02       ` David Gibson
2017-02-10  4:02         ` David Gibson
2017-02-10  4:02         ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170209064112.GD14524@umbus \
    --to=david@gibson.dropbear.id.au \
    --cc=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=kvm-ppc@vger.kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=paulus@samba.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.