* [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64
@ 2016-01-26 13:12 Eric Auger
  2016-01-26 13:12 ` [PATCH 01/10] iommu: Add DOMAIN_ATTR_MSI_MAPPING attribute Eric Auger
                   ` (11 more replies)
  0 siblings, 12 replies; 32+ messages in thread
From: Eric Auger @ 2016-01-26 13:12 UTC (permalink / raw)
  To: linux-arm-kernel
This series addresses KVM PCIe passthrough with MSI enabled on ARM/ARM64.
It pursues the efforts done on [1], [2], [3]. It also aims at covering the
same need on some PowerPC platforms.
On x86 all accesses to the 1MB PA region [FEE0_0000h - FEF0_000h] are directed
as interrupt messages: accesses to this special PA window directly target the
APIC configuration space and not DRAM, meaning the downstream IOMMU is bypassed.
This is not the case on above mentionned platforms where MSI messages emitted
by devices are conveyed through the IOMMU. This means an IOVA/host PA mapping
must exist for the MSI to reach the MSI controller. Normal way to create
IOVA bindings consists in using VFIO DMA MAP API. However in this case
the MSI IOVA is not mapped onto guest RAM but on host physical page (the MSI
controller frame).
Following first comments, the spirit of [2] is kept: the guest registers
an IOVA range reserved for MSI mapping. When the VFIO-PCIe driver allocates
its MSI vectors, it overwrites the MSI controller physical address with an IOVA,
allocated within the window provided by the userspace. This IOVA is mapped
onto the MSI controller frame physical page.
The series does not address yet the problematic of telling the userspace how
much IOVA he should provision.
Best Regards
Eric
Testing:
This is currently tested on ARM64 AMD Overdrive HW (single GICv2m frame)
with an e1000e PCIe card. This is not tested on PPC.
References:
[1] [RFC 0/2] VFIO: Add virtual MSI doorbell support
    (https://lkml.org/lkml/2015/7/24/135)
[2] [RFC PATCH 0/6] vfio: Add interface to map MSI pages
    (https://lists.cs.columbia.edu/pipermail/kvmarm/2015-September/016607.html)
[3] [PATCH v2 0/3] Introduce MSI hardware mapping for VFIO
    (http://permalink.gmane.org/gmane.comp.emulators.kvm.arm.devel/3858)
Git:
https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.5-rc1-pcie-passthrough-v1
History:
RFC v1 [2] -> PATCH v1:
- use the existing dma map/unmap ioctl interface with a flag to register a
  reserved IOVA range. Use the legacy Rb to store this special vfio_dma.
- a single reserved IOVA contiguous region now is allowed
- use of an RB tree indexed by PA to store allocated reserved slots
- use of a vfio_domain iova_domain to manage iova allocation within the
  window provided by the userspace
- vfio alloc_map/unmap_free take a vfio_group handle
- vfio_group handle is cached in vfio_pci_device
- add ref counting to bindings
- user modality enabled at the end of the series
Eric Auger (10):
  iommu: Add DOMAIN_ATTR_MSI_MAPPING attribute
  vfio: expose MSI mapping requirement through VFIO_IOMMU_GET_INFO
  vfio_iommu_type1: add reserved binding RB tree management
  vfio: introduce VFIO_IOVA_RESERVED vfio_dma type
  vfio/type1: attach a reserved iova domain to vfio_domain
  vfio: introduce vfio_group_alloc_map_/unmap_free_reserved_iova
  vfio: pci: cache the vfio_group in vfio_pci_device
  vfio: introduce vfio_group_require_msi_mapping
  vfio-pci: create an iommu mapping for msi address
  vfio: allow the user to register reserved iova range for MSI mapping
 drivers/iommu/arm-smmu.c            |   2 +
 drivers/iommu/fsl_pamu_domain.c     |   3 +
 drivers/vfio/pci/vfio_pci.c         |   8 +
 drivers/vfio/pci/vfio_pci_intrs.c   |  73 ++++++-
 drivers/vfio/pci/vfio_pci_private.h |   1 +
 drivers/vfio/vfio.c                 |  64 ++++++
 drivers/vfio/vfio_iommu_type1.c     | 412 +++++++++++++++++++++++++++++++++++-
 include/linux/iommu.h               |   1 +
 include/linux/vfio.h                |  39 +++-
 include/uapi/linux/vfio.h           |  10 +
 10 files changed, 598 insertions(+), 15 deletions(-)
-- 
1.9.1
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 01/10] iommu: Add DOMAIN_ATTR_MSI_MAPPING attribute
  2016-01-26 13:12 [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Eric Auger
@ 2016-01-26 13:12 ` Eric Auger
  2016-01-26 13:12 ` [PATCH 02/10] vfio: expose MSI mapping requirement through VFIO_IOMMU_GET_INFO Eric Auger
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Eric Auger @ 2016-01-26 13:12 UTC (permalink / raw)
  To: linux-arm-kernel
Introduce new DOMAIN_ATTR_MSI_MAPPING domain attribute. If supported,
this means the MSI addresses need to be mapped in the IOMMU. ARM SMMUS
and FSL PAMU, at least expose this attribute.
x86 IOMMUs typically don't expose the attribute since on x86 MSI write
transaction addresses always are within the 1MB PA region [FEE0_0000h -
FEF0_000h] window which directly targets the APIC configuration space and
hence bypass the sMMU.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
to be honest I wonder whether this is property of the sMMU? Isn't it
a platform property?
RFC v1 -> RFC v2:
- the data field is not used
- for this attribute domain_get_attr simply returns 0 if the MSI_MAPPING
  capability if needed or <0 if not.
- removed struct iommu_domain_msi_maps
---
 drivers/iommu/arm-smmu.c        | 2 ++
 drivers/iommu/fsl_pamu_domain.c | 3 +++
 include/linux/iommu.h           | 1 +
 3 files changed, 6 insertions(+)
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 59ee4b8..c8b7e71 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -1409,6 +1409,8 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
 	case DOMAIN_ATTR_NESTING:
 		*(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED);
 		return 0;
+	case DOMAIN_ATTR_MSI_MAPPING:
+		return 0;
 	default:
 		return -ENODEV;
 	}
diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
index da0e1e3..dd2e0d6 100644
--- a/drivers/iommu/fsl_pamu_domain.c
+++ b/drivers/iommu/fsl_pamu_domain.c
@@ -856,6 +856,9 @@ static int fsl_pamu_get_domain_attr(struct iommu_domain *domain,
 	case DOMAIN_ATTR_FSL_PAMUV1:
 		*(int *)data = DOMAIN_ATTR_FSL_PAMUV1;
 		break;
+	case DOMAIN_ATTR_MSI_MAPPING:
+		ret = 0;
+		break;
 	default:
 		pr_debug("Unsupported attribute type\n");
 		ret = -EINVAL;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f28dff3..3ae2fb6 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -112,6 +112,7 @@ enum iommu_attr {
 	DOMAIN_ATTR_FSL_PAMU_ENABLE,
 	DOMAIN_ATTR_FSL_PAMUV1,
 	DOMAIN_ATTR_NESTING,	/* two stages of translation */
+	DOMAIN_ATTR_MSI_MAPPING, /* Require MSIs mapping in iommu */
 	DOMAIN_ATTR_MAX,
 };
 
-- 
1.9.1
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 02/10] vfio: expose MSI mapping requirement through VFIO_IOMMU_GET_INFO
  2016-01-26 13:12 [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Eric Auger
  2016-01-26 13:12 ` [PATCH 01/10] iommu: Add DOMAIN_ATTR_MSI_MAPPING attribute Eric Auger
@ 2016-01-26 13:12 ` Eric Auger
  2016-01-26 13:12 ` [PATCH 03/10] vfio_iommu_type1: add reserved binding RB tree management Eric Auger
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Eric Auger @ 2016-01-26 13:12 UTC (permalink / raw)
  To: linux-arm-kernel
This patch allows the user-space to retrieve whether msi write
transaction addresses must be mapped. This is returned through the
VFIO_IOMMU_GET_INFO API using a new flag: VFIO_IOMMU_INFO_REQUIRE_MSI_MAP.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
RFC v1 -> RFC v2:
- derived from
  [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state
- renamed allow_msi_reconfig into require_msi_mapping
- fixed VFIO_IOMMU_GET_INFO
---
 drivers/vfio/vfio_iommu_type1.c | 26 ++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       |  1 +
 2 files changed, 27 insertions(+)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 6f1ea3d..c5b57e1 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -255,6 +255,29 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 }
 
 /*
+ * vfio_domains_require_msi_mapping: indicates whether MSI write transaction
+ * addresses must be mapped
+ *
+ * returns true if it does
+ */
+static bool vfio_domains_require_msi_mapping(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *d;
+	bool ret;
+
+	mutex_lock(&iommu->lock);
+	/* All domains have same require_msi_map property, pick first */
+	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
+	if (iommu_domain_get_attr(d->domain, DOMAIN_ATTR_MSI_MAPPING, NULL) < 0)
+		ret = false;
+	else
+		ret = true;
+	mutex_unlock(&iommu->lock);
+
+	return ret;
+}
+
+/*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
@@ -997,6 +1020,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 		info.flags = VFIO_IOMMU_INFO_PGSIZES;
 
+		if (vfio_domains_require_msi_mapping(iommu))
+			info.flags |= VFIO_IOMMU_INFO_REQUIRE_MSI_MAP;
+
 		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
 
 		return copy_to_user((void __user *)arg, &info, minsz);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 7d7a4c6..43e183b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -400,6 +400,7 @@ struct vfio_iommu_type1_info {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
+#define VFIO_IOMMU_INFO_REQUIRE_MSI_MAP (1 << 1)/* MSI must be mapped */
 	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
 };
 
-- 
1.9.1
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 03/10] vfio_iommu_type1: add reserved binding RB tree management
  2016-01-26 13:12 [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Eric Auger
  2016-01-26 13:12 ` [PATCH 01/10] iommu: Add DOMAIN_ATTR_MSI_MAPPING attribute Eric Auger
  2016-01-26 13:12 ` [PATCH 02/10] vfio: expose MSI mapping requirement through VFIO_IOMMU_GET_INFO Eric Auger
@ 2016-01-26 13:12 ` Eric Auger
  2016-01-26 13:12 ` [PATCH 04/10] vfio: introduce VFIO_IOVA_RESERVED vfio_dma type Eric Auger
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Eric Auger @ 2016-01-26 13:12 UTC (permalink / raw)
  To: linux-arm-kernel
Legacy dma_list is just used to insert the reserved iova region and
check mapping of reserved iova happens in this window. As opposed to
other vfio_dma slots, the reserved one is not necessarily mapped.
We will need to track which host physical addresses are mapped to
reserved IOVA. In that prospect we introduce a new RB tree indexed
by physical address. This reverse RB tree only is used for reserved
IOVA bindings. It belongs to a given iommu domain.
It is expected this RB tree will contain very few bindings. Those
generally correspond to single page mapping one MSI frame (at least
for ARM, containing the GICv2m frame or ITS GITS_TRANSLATER frame.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 drivers/vfio/vfio_iommu_type1.c | 63 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c5b57e1..32438d9 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -56,6 +56,7 @@ MODULE_PARM_DESC(disable_hugepages,
 struct vfio_iommu {
 	struct list_head	domain_list;
 	struct mutex		lock;
+	/* rb tree indexed by IOVA */
 	struct rb_root		dma_list;
 	bool			v2;
 	bool			nesting;
@@ -65,6 +66,8 @@ struct vfio_domain {
 	struct iommu_domain	*domain;
 	struct list_head	next;
 	struct list_head	group_list;
+	/* rb tree indexed by PA, for reserved bindings only */
+	struct rb_root		reserved_binding_list;
 	int			prot;		/* IOMMU_CACHE */
 	bool			fgsp;		/* Fine-grained super pages */
 };
@@ -77,11 +80,70 @@ struct vfio_dma {
 	int			prot;		/* IOMMU_READ/WRITE */
 };
 
+struct vfio_reserved_binding {
+	struct kref 		kref;
+	struct rb_node		node;
+	struct vfio_domain	*domain;
+	phys_addr_t		addr;
+	dma_addr_t		iova;
+	size_t			size;
+};
+
 struct vfio_group {
 	struct iommu_group	*iommu_group;
 	struct list_head	next;
 };
 
+/* Reserved binding RB-tree manipulation */
+
+static struct vfio_reserved_binding *vfio_find_reserved_binding(
+				    struct vfio_domain *d,
+				    phys_addr_t start, size_t size)
+{
+	struct rb_node *node = d->reserved_binding_list.rb_node;
+
+	while (node) {
+		struct vfio_reserved_binding *binding =
+			rb_entry(node, struct vfio_reserved_binding, node);
+
+		if (start + size <= binding->addr)
+			node = node->rb_left;
+		else if (start >= binding->addr + binding->size)
+			node = node->rb_right;
+		else
+			return binding;
+	}
+
+	return NULL;
+}
+
+static void vfio_link_reserved_binding(struct vfio_domain *d,
+				       struct vfio_reserved_binding *new)
+{
+	struct rb_node **link = &d->reserved_binding_list.rb_node;
+	struct rb_node *parent = NULL;
+	struct vfio_reserved_binding *binding;
+
+	while (*link) {
+		parent = *link;
+		binding = rb_entry(parent, struct vfio_reserved_binding, node);
+
+		if (new->addr + new->size <= binding->addr)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &d->reserved_binding_list);
+}
+
+static void vfio_unlink_reserved_binding(struct vfio_domain *d,
+					 struct vfio_reserved_binding *old)
+{
+	rb_erase(&old->node, &d->reserved_binding_list);
+}
+
 /*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
@@ -784,6 +846,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 		ret = -ENOMEM;
 		goto out_free;
 	}
+	domain->reserved_binding_list = RB_ROOT;
 
 	group->iommu_group = iommu_group;
 
-- 
1.9.1
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 04/10] vfio: introduce VFIO_IOVA_RESERVED vfio_dma type
  2016-01-26 13:12 [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Eric Auger
                   ` (2 preceding siblings ...)
  2016-01-26 13:12 ` [PATCH 03/10] vfio_iommu_type1: add reserved binding RB tree management Eric Auger
@ 2016-01-26 13:12 ` Eric Auger
  2016-01-26 13:12 ` [PATCH 05/10] vfio/type1: attach a reserved iova domain to vfio_domain Eric Auger
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Eric Auger @ 2016-01-26 13:12 UTC (permalink / raw)
  To: linux-arm-kernel
We introduce a vfio_dma type since we will need to discriminate
legacy vfio_dma's from new reserved ones. Since those latter are
not mapped at registration, some handlings must be unplugged:
removal, replay. They will be enhanced later on.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 drivers/vfio/vfio_iommu_type1.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 32438d9..33a9ce4 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -53,6 +53,15 @@ module_param_named(disable_hugepages,
 MODULE_PARM_DESC(disable_hugepages,
 		 "Disable VFIO IOMMU support for IOMMU hugepages.");
 
+enum vfio_iova_type {
+	VFIO_IOVA_USER = 0, /* standard IOVA used to map user vaddr */
+	/*
+	 * IOVA reserved to map special host physical addresses,
+	 * MSI frames for instance
+	 */
+	VFIO_IOVA_RESERVED,
+};
+
 struct vfio_iommu {
 	struct list_head	domain_list;
 	struct mutex		lock;
@@ -78,6 +87,7 @@ struct vfio_dma {
 	unsigned long		vaddr;		/* Process virtual addr */
 	size_t			size;		/* Map size (bytes) */
 	int			prot;		/* IOMMU_READ/WRITE */
+	enum vfio_iova_type	type;		/* type of IOVA */
 };
 
 struct vfio_reserved_binding {
@@ -480,7 +490,8 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
-	vfio_unmap_unpin(iommu, dma);
+	if (likely(dma->type != VFIO_IOVA_RESERVED))
+		vfio_unmap_unpin(iommu, dma);
 	vfio_unlink_dma(iommu, dma);
 	kfree(dma);
 }
@@ -756,6 +767,10 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 		dma_addr_t iova;
 
 		dma = rb_entry(n, struct vfio_dma, node);
+
+		if (unlikely(dma->type == VFIO_IOVA_RESERVED))
+			continue;
+
 		iova = dma->iova;
 
 		while (iova < dma->iova + dma->size) {
-- 
1.9.1
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 05/10] vfio/type1: attach a reserved iova domain to vfio_domain
  2016-01-26 13:12 [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Eric Auger
                   ` (3 preceding siblings ...)
  2016-01-26 13:12 ` [PATCH 04/10] vfio: introduce VFIO_IOVA_RESERVED vfio_dma type Eric Auger
@ 2016-01-26 13:12 ` Eric Auger
  2016-01-26 13:12 ` [PATCH 06/10] vfio: introduce vfio_group_alloc_map_/unmap_free_reserved_iova Eric Auger
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Eric Auger @ 2016-01-26 13:12 UTC (permalink / raw)
  To: linux-arm-kernel
This patch adds a reserved iova_domain to the vfio_domain struct.
This iova domain will enable to allocate iova within the reserved
iova region.
alloc_reserved_iova_domain makes possible to allocate and initialize
this iova domain. The user will be introduced in subsequent patches.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 drivers/vfio/vfio_iommu_type1.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 33a9ce4..33304c0 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/workqueue.h>
+#include <linux/iova.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -77,6 +78,7 @@ struct vfio_domain {
 	struct list_head	group_list;
 	/* rb tree indexed by PA, for reserved bindings only */
 	struct rb_root		reserved_binding_list;
+	struct iova_domain	*reserved_iova_domain;
 	int			prot;		/* IOMMU_CACHE */
 	bool			fgsp;		/* Fine-grained super pages */
 };
@@ -178,6 +180,41 @@ static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
 	return NULL;
 }
 
+/* alloc_reserved_iova_domain: allocate an iova domain used to manage
+ * reserved iova
+ * @iova: base iova of the domain
+ * @size: size of the domain
+ * @order: iommu page size order
+ */
+static int alloc_reserved_iova_domain(struct vfio_domain *domain,
+				      dma_addr_t iova, size_t size,
+				      unsigned long order)
+{
+	unsigned long granule, mask;
+	int ret = 0;
+
+	granule = 1UL << order;
+	mask = granule - 1;
+	if (iova & mask)
+		return -EINVAL;
+	if ((!size) || (size & mask))
+		return -EINVAL;
+
+	domain->reserved_iova_domain =
+		kzalloc(sizeof(struct iova_domain), GFP_KERNEL);
+	if (!domain->reserved_iova_domain) {
+		ret = -ENOMEM;
+		goto out_free;
+	}
+	init_iova_domain(domain->reserved_iova_domain,
+			 granule, iova >> order, (iova + size - 1) >> order);
+	return ret;
+
+out_free:
+	kfree(domain->reserved_iova_domain);
+	return ret;
+}
+
 static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
 {
 	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
-- 
1.9.1
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 06/10] vfio: introduce vfio_group_alloc_map_/unmap_free_reserved_iova
  2016-01-26 13:12 [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Eric Auger
                   ` (4 preceding siblings ...)
  2016-01-26 13:12 ` [PATCH 05/10] vfio/type1: attach a reserved iova domain to vfio_domain Eric Auger
@ 2016-01-26 13:12 ` Eric Auger
  2016-01-26 16:17   ` kbuild test robot
  2016-01-26 13:12 ` [PATCH 07/10] vfio: pci: cache the vfio_group in vfio_pci_device Eric Auger
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: Eric Auger @ 2016-01-26 13:12 UTC (permalink / raw)
  To: linux-arm-kernel
This patch introduces vfio_group_alloc_map_/unmap_free_reserved_iova
and implements corresponding vfio_iommu_type1 operations.
alloc_map allows to allocate a new reserved iova page and map it
onto the physical page that contains a given PA. It returns the iova
that is mapped onto the provided PA. In case a mapping already exist
between both pages, the IOVA corresponding to the PA is directly returned.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Ankit Jindal <ajindal@apm.com>
Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
---
 drivers/vfio/vfio.c             |  39 ++++++++++
 drivers/vfio/vfio_iommu_type1.c | 163 ++++++++++++++++++++++++++++++++++++++--
 include/linux/vfio.h            |  34 ++++++++-
 3 files changed, 228 insertions(+), 8 deletions(-)
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 82f25cc..3d9de00 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -268,6 +268,45 @@ void vfio_unregister_iommu_driver(const struct vfio_iommu_driver_ops *ops)
 }
 EXPORT_SYMBOL_GPL(vfio_unregister_iommu_driver);
 
+int vfio_group_alloc_map_reserved_iova(struct vfio_group *group,
+				       phys_addr_t addr, int prot,
+				       dma_addr_t *iova)
+{
+	struct vfio_container *container = group->container;
+	const struct vfio_iommu_driver_ops *ops = container->iommu_driver->ops;
+	int ret;
+
+	if (!ops->alloc_map_reserved_iova)
+		return -EINVAL;
+
+	down_read(&container->group_lock);
+	ret = ops->alloc_map_reserved_iova(container->iommu_data,
+					   group->iommu_group,
+					   addr, prot, iova);
+	up_read(&container->group_lock);
+	return ret;
+
+}
+EXPORT_SYMBOL_GPL(vfio_group_alloc_map_reserved_iova);
+
+int vfio_group_unmap_free_reserved_iova(struct vfio_group *group,
+					dma_addr_t iova)
+{
+	struct vfio_container *container = group->container;
+	const struct vfio_iommu_driver_ops *ops = container->iommu_driver->ops;
+	int ret;
+
+	if (!ops->unmap_free_reserved_iova)
+		return -EINVAL;
+
+	down_read(&container->group_lock);
+	ret = ops->unmap_free_reserved_iova(container->iommu_data,
+					    group->iommu_group, iova);
+	up_read(&container->group_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_group_unmap_free_reserved_iova);
+
 /**
  * Group minor allocation/free - both called with vfio.group_lock held
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 33304c0..a79e2a8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -156,6 +156,19 @@ static void vfio_unlink_reserved_binding(struct vfio_domain *d,
 	rb_erase(&old->node, &d->reserved_binding_list);
 }
 
+static void vfio_reserved_binding_release(struct kref *kref)
+{
+	struct vfio_reserved_binding *b =
+		container_of(kref, struct vfio_reserved_binding, kref);
+	struct vfio_domain *d = b->domain;
+	unsigned long order = __ffs(b->size);
+
+	iommu_unmap(d->domain, b->iova, b->size);
+	free_iova(d->reserved_iova_domain, b->iova >> order);
+	vfio_unlink_reserved_binding(d, b);
+	kfree(b);
+}
+
 /*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
@@ -1034,6 +1047,138 @@ done:
 	mutex_unlock(&iommu->lock);
 }
 
+static struct vfio_domain *vfio_find_iommu_domain(void *iommu_data,
+						   struct iommu_group *group)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_group *g;
+	struct vfio_domain *d;
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		list_for_each_entry(g, &d->group_list, next) {
+			if (g->iommu_group == group)
+				return d;
+		}
+	}
+	return NULL;
+}
+
+static int vfio_iommu_type1_alloc_map_reserved_iova(void *iommu_data,
+						    struct iommu_group *group,
+						    phys_addr_t addr, int prot,
+						    dma_addr_t *iova)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *d;
+	uint64_t mask, iommu_page_size;
+	struct vfio_reserved_binding *b;
+	unsigned long order;
+	struct iova *p_iova;
+	phys_addr_t aligned_addr, offset;
+	int ret = 0;
+
+	order = __ffs(vfio_pgsize_bitmap(iommu));
+	iommu_page_size = (uint64_t)1 << order;
+	mask = iommu_page_size - 1;
+	aligned_addr = addr & ~mask;
+	offset = addr - aligned_addr;
+
+	mutex_lock(&iommu->lock);
+
+	d = vfio_find_iommu_domain(iommu_data, group);
+	if (!d) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	b = vfio_find_reserved_binding(d, aligned_addr, iommu_page_size);
+	if (b) {
+		ret = 0;
+		*iova = b->iova + offset;
+		kref_get(&b->kref);
+		goto unlock;
+	}
+
+	/* allocate a new reserved IOVA page and a new binding node */
+	p_iova = alloc_iova(d->reserved_iova_domain, 1,
+			    d->reserved_iova_domain->dma_32bit_pfn, true);
+	if (!p_iova) {
+		ret = -ENOMEM;
+		goto unlock;
+	}
+	*iova = p_iova->pfn_lo << order;
+
+	b = kzalloc(sizeof(*b), GFP_KERNEL);
+	if (!b) {
+		ret = -ENOMEM;
+		goto free_iova_unlock;
+	}
+
+	ret = iommu_map(d->domain, *iova, aligned_addr, iommu_page_size, prot);
+	if (ret)
+		goto free_binding_iova_unlock;
+
+	kref_init(&b->kref);
+	kref_get(&b->kref);
+	b->domain = d;
+	b->addr = aligned_addr;
+	b->iova = *iova;
+	b->size = iommu_page_size;
+	vfio_link_reserved_binding(d, b);
+	*iova += offset;
+
+	goto unlock;
+
+free_binding_iova_unlock:
+	kfree(b);
+free_iova_unlock:
+	free_iova(d->reserved_iova_domain, *iova >> order);
+unlock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static int vfio_iommu_type1_unmap_free_reserved_iova(void *iommu_data,
+						     struct iommu_group *group,
+						     dma_addr_t iova)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_reserved_binding *b;
+	struct vfio_domain *d;
+	phys_addr_t aligned_addr;
+	dma_addr_t aligned_iova, iommu_page_size, mask, offset;
+	unsigned long order;
+	int ret = 0;
+
+	order = __ffs(vfio_pgsize_bitmap(iommu));
+	iommu_page_size = (uint64_t)1 << order;
+	mask = iommu_page_size - 1;
+	aligned_iova = iova & ~mask;
+	offset = iova - aligned_iova;
+
+	mutex_lock(&iommu->lock);
+
+	d = vfio_find_iommu_domain(iommu_data, group);
+	if (!d) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	aligned_addr = iommu_iova_to_phys(d->domain, aligned_iova);
+
+	b = vfio_find_reserved_binding(d, aligned_addr, iommu_page_size);
+	if (!b) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	kref_put(&b->kref, vfio_reserved_binding_release);
+
+unlock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
 static void *vfio_iommu_type1_open(unsigned long arg)
 {
 	struct vfio_iommu *iommu;
@@ -1180,13 +1325,17 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 }
 
 static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
-	.name		= "vfio-iommu-type1",
-	.owner		= THIS_MODULE,
-	.open		= vfio_iommu_type1_open,
-	.release	= vfio_iommu_type1_release,
-	.ioctl		= vfio_iommu_type1_ioctl,
-	.attach_group	= vfio_iommu_type1_attach_group,
-	.detach_group	= vfio_iommu_type1_detach_group,
+	.name				= "vfio-iommu-type1",
+	.owner				= THIS_MODULE,
+	.open				= vfio_iommu_type1_open,
+	.release			= vfio_iommu_type1_release,
+	.ioctl				= vfio_iommu_type1_ioctl,
+	.attach_group			= vfio_iommu_type1_attach_group,
+	.detach_group			= vfio_iommu_type1_detach_group,
+	.alloc_map_reserved_iova	=
+		vfio_iommu_type1_alloc_map_reserved_iova,
+	.unmap_free_reserved_iova	=
+		vfio_iommu_type1_unmap_free_reserved_iova,
 };
 
 static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 610a86a..0020f81 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -75,7 +75,13 @@ struct vfio_iommu_driver_ops {
 					struct iommu_group *group);
 	void		(*detach_group)(void *iommu_data,
 					struct iommu_group *group);
-
+	int		(*alloc_map_reserved_iova)(void *iommu_data,
+						   struct iommu_group *group,
+						   phys_addr_t addr, int prot,
+						   dma_addr_t *iova);
+	int		(*unmap_free_reserved_iova)(void *iommu_data,
+						    struct iommu_group *group,
+						    dma_addr_t iova);
 };
 
 extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
@@ -138,4 +144,30 @@ extern int vfio_virqfd_enable(void *opaque,
 			      void *data, struct virqfd **pvirqfd, int fd);
 extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
 
+/**
+ * vfio_group_alloc_map_reserved_iova: allocates a new iova page and map
+ * it onto the aligned physical page that contains a given physical addr.
+ * page size is the domain iommu page size.
+ *
+ * @group: vfio group handle
+ * @addr: physical address to map
+ * @prot: protection attribute
+ * @iova: returned iova that is mapped onto addr
+ *
+ * returns 0 on success, < 0 on failure
+ */
+extern int vfio_group_alloc_map_reserved_iova(struct vfio_group *group,
+					      phys_addr_t addr, int prot,
+					      dma_addr_t *iova);
+/**
+ * vfio_group_unmap_free_reserved_iova: unmap and free the reserved iova page
+ *
+ * @group: vfio group handle
+ * @iova: base iova, must be aligned on the IOMMU page size
+ *
+ * returns 0 on success, < 0 on failure
+ */
+extern int vfio_group_unmap_free_reserved_iova(struct vfio_group *group,
+					       dma_addr_t iova);
+
 #endif /* VFIO_H */
-- 
1.9.1
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 07/10] vfio: pci: cache the vfio_group in vfio_pci_device
  2016-01-26 13:12 [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Eric Auger
                   ` (5 preceding siblings ...)
  2016-01-26 13:12 ` [PATCH 06/10] vfio: introduce vfio_group_alloc_map_/unmap_free_reserved_iova Eric Auger
@ 2016-01-26 13:12 ` Eric Auger
  2016-01-26 13:12 ` [PATCH 08/10] vfio: introduce vfio_group_require_msi_mapping Eric Auger
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Eric Auger @ 2016-01-26 13:12 UTC (permalink / raw)
  To: linux-arm-kernel
Add a new set_group operation which allows to cache the vfio_group
handle within the vfio_pci_device struct. This is useful to do
iommu operations from the vfio_pci device.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 drivers/vfio/pci/vfio_pci.c         | 8 ++++++++
 drivers/vfio/pci/vfio_pci_private.h | 1 +
 drivers/vfio/vfio.c                 | 3 +++
 include/linux/vfio.h                | 3 +++
 4 files changed, 15 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 2760a7b..2a32856 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -920,6 +920,13 @@ static void vfio_pci_request(void *device_data, unsigned int count)
 	mutex_unlock(&vdev->igate);
 }
 
+static void vfio_pci_set_group(void *device_data, struct vfio_group *group)
+{
+	struct vfio_pci_device *vdev = device_data;
+
+	vdev->vfio_group = group;
+}
+
 static const struct vfio_device_ops vfio_pci_ops = {
 	.name		= "vfio-pci",
 	.open		= vfio_pci_open,
@@ -929,6 +936,7 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.write		= vfio_pci_write,
 	.mmap		= vfio_pci_mmap,
 	.request	= vfio_pci_request,
+	.set_group	= vfio_pci_set_group,
 };
 
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 0e7394f..2893b10 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -35,6 +35,7 @@ struct vfio_pci_irq_ctx {
 
 struct vfio_pci_device {
 	struct pci_dev		*pdev;
+	struct vfio_group	*vfio_group;
 	void __iomem		*barmap[PCI_STD_RESOURCE_END + 1];
 	u8			*pci_config_map;
 	u8			*vconfig;
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 3d9de00..2760d4c 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -548,6 +548,9 @@ struct vfio_device *vfio_group_create_device(struct vfio_group *group,
 	/* No need to get group_lock, caller has group reference */
 	vfio_group_get(group);
 
+	if (device->ops->set_group)
+		device->ops->set_group(device_data, group);
+
 	mutex_lock(&group->device_lock);
 	list_add(&device->group_next, &group->device_list);
 	mutex_unlock(&group->device_lock);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0020f81..f7eaf30 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -18,6 +18,8 @@
 #include <linux/poll.h>
 #include <uapi/linux/vfio.h>
 
+struct vfio_group;
+
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
  *
@@ -42,6 +44,7 @@ struct vfio_device_ops {
 			 unsigned long arg);
 	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
 	void	(*request)(void *device_data, unsigned int count);
+	void	(*set_group)(void *device_data, struct vfio_group *group);
 };
 
 extern struct iommu_group *vfio_iommu_group_get(struct device *dev);
-- 
1.9.1
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 08/10] vfio: introduce vfio_group_require_msi_mapping
  2016-01-26 13:12 [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Eric Auger
                   ` (6 preceding siblings ...)
  2016-01-26 13:12 ` [PATCH 07/10] vfio: pci: cache the vfio_group in vfio_pci_device Eric Auger
@ 2016-01-26 13:12 ` Eric Auger
  2016-01-26 13:12 ` [PATCH 09/10] vfio-pci: create an iommu mapping for msi address Eric Auger
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Eric Auger @ 2016-01-26 13:12 UTC (permalink / raw)
  To: linux-arm-kernel
This new function enables to know whether msi write transaction
addresses must be mapped.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
 drivers/vfio/vfio.c             | 22 ++++++++++++++++++++++
 drivers/vfio/vfio_iommu_type1.c |  8 ++++++++
 include/linux/vfio.h            |  2 ++
 3 files changed, 32 insertions(+)
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 2760d4c..f3df5a10 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -793,6 +793,28 @@ static int vfio_iommu_group_notifier(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
+bool vfio_group_require_msi_mapping(struct vfio_group *group)
+{
+	struct vfio_container *container = group->container;
+	struct vfio_iommu_driver *driver;
+	bool ret;
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (!driver || !driver->ops || !driver->ops->require_msi_mapping) {
+		ret = -EINVAL;
+		goto up;
+	}
+
+	ret = driver->ops->require_msi_mapping(container->iommu_data);
+
+up:
+	up_read(&container->group_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_group_require_msi_mapping);
+
 /**
  * VFIO driver API
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index a79e2a8..2f085d3 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1179,6 +1179,13 @@ unlock:
 	return ret;
 }
 
+static bool vfio_iommu_type1_require_msi_mapping(void *iommu_data)
+{
+	struct vfio_iommu *iommu = iommu_data;
+
+	return vfio_domains_require_msi_mapping(iommu);
+}
+
 static void *vfio_iommu_type1_open(unsigned long arg)
 {
 	struct vfio_iommu *iommu;
@@ -1336,6 +1343,7 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 		vfio_iommu_type1_alloc_map_reserved_iova,
 	.unmap_free_reserved_iova	=
 		vfio_iommu_type1_unmap_free_reserved_iova,
+	.require_msi_mapping		= vfio_iommu_type1_require_msi_mapping,
 };
 
 static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index f7eaf30..3e6cbeb 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -58,6 +58,7 @@ extern void *vfio_del_group_dev(struct device *dev);
 extern struct vfio_device *vfio_device_get_from_dev(struct device *dev);
 extern void vfio_device_put(struct vfio_device *device);
 extern void *vfio_device_data(struct vfio_device *device);
+extern bool vfio_group_require_msi_mapping(struct vfio_group *group);
 
 /**
  * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
@@ -85,6 +86,7 @@ struct vfio_iommu_driver_ops {
 	int		(*unmap_free_reserved_iova)(void *iommu_data,
 						    struct iommu_group *group,
 						    dma_addr_t iova);
+	bool		(*require_msi_mapping)(void *iommu_data);
 };
 
 extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
-- 
1.9.1
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 09/10] vfio-pci: create an iommu mapping for msi address
  2016-01-26 13:12 [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Eric Auger
                   ` (7 preceding siblings ...)
  2016-01-26 13:12 ` [PATCH 08/10] vfio: introduce vfio_group_require_msi_mapping Eric Auger
@ 2016-01-26 13:12 ` Eric Auger
  2016-01-26 14:43   ` kbuild test robot
  2016-01-26 13:12 ` [PATCH 10/10] vfio: allow the user to register reserved iova range for MSI mapping Eric Auger
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: Eric Auger @ 2016-01-26 13:12 UTC (permalink / raw)
  To: linux-arm-kernel
Some platforms require the MSI address programmed in the PCI
device to be an IOVA and not a host physical address.
This is typically the case for ARM and PowerPC, as opposed
to x86. This patch allocates an IOVA page and maps it onto
the physical page which contains the target MSI write transaction
address. In case an IOMMU binding already exists between those 2,
simply reprogram the device.
The binding is destroyed by the VFIO IOMMU backend.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
Signed-off-by: Eric Auger <eric.auger@linaro.org>
---
---
 drivers/vfio/pci/vfio_pci_intrs.c | 73 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 68 insertions(+), 5 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 3b3ba15..bac24c9 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -305,6 +305,57 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
 	return 0;
 }
 
+/**
+ * vfio_set_mapped_msi_addr: overwrites the msi physical address with an iova
+ *
+ * @pdev: vfio pci device handle
+ * @irq: irq linux number
+ * returns 0 upon success, < 0 on failure
+ */
+static int vfio_set_mapped_msi_addr(struct vfio_pci_device *vdev, int irq)
+{
+	phys_addr_t msi_addr;
+	dma_addr_t msi_iova;
+	struct vfio_group *group = vdev->vfio_group;
+	struct msi_msg msg;
+	int ret;
+
+	get_cached_msi_msg(irq, &msg);
+	msi_addr = (phys_addr_t)(msg.address_hi) << 32 |
+			(phys_addr_t)(msg.address_lo);
+
+	ret = vfio_group_alloc_map_reserved_iova(group, msi_addr,
+						 IOMMU_WRITE, &msi_iova);
+	if (ret)
+		goto out;
+
+	/* Re-program the msi-address with the iova */
+	msg.address_hi = (u32)(msi_iova >> 32);
+	msg.address_lo = (u32)(msi_iova & 0xffffffff);
+	pci_write_msi_msg(irq, &msg);
+
+out:
+	return ret;
+}
+
+/**
+ * vfio_unset_mapped_msi_addr: decrement the ref counter of the msi iova page
+ * associated to the linux irq (in case it is null unmaps and frees resources)
+ *
+ * @pdev: vfio pci device handle
+ * @irq: irq linux number
+ */
+static void vfio_unset_mapped_msi_addr(struct vfio_pci_device *vdev, int irq)
+{
+	dma_addr_t msi_iova;
+	struct vfio_group *group = vdev->vfio_group;
+	struct msi_msg msg;
+
+	get_cached_msi_msg(irq, &msg);
+	msi_iova = (u64)(msg.address_hi) << 32 | (u64)(msg.address_lo);
+	vfio_group_unmap_free_reserved_iova(group, msi_iova);
+}
+
 static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
 				      int vector, int fd, bool msix)
 {
@@ -318,6 +369,7 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
 		return -EINVAL;
 
 	if (vdev->ctx[vector].trigger) {
+		vfio_unset_mapped_msi_addr(vdev, irq);
 		free_irq(irq, vdev->ctx[vector].trigger);
 		irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
 		kfree(vdev->ctx[vector].name);
@@ -355,11 +407,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
 
 	ret = request_irq(irq, vfio_msihandler, 0,
 			  vdev->ctx[vector].name, trigger);
-	if (ret) {
-		kfree(vdev->ctx[vector].name);
-		eventfd_ctx_put(trigger);
-		return ret;
-	}
+	if (ret)
+		goto error_free;
 
 	vdev->ctx[vector].producer.token = trigger;
 	vdev->ctx[vector].producer.irq = irq;
@@ -369,9 +418,23 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
 		"irq bypass producer (token %p) registration fails: %d\n",
 		vdev->ctx[vector].producer.token, ret);
 
+	if (vfio_group_require_msi_mapping(vdev->vfio_group)) {
+		ret = vfio_set_mapped_msi_addr(vdev, irq);
+		if (ret)
+			goto error_free_irq;
+	}
+
 	vdev->ctx[vector].trigger = trigger;
 
 	return 0;
+
+error_free_irq:
+	free_irq(irq, vdev->ctx[vector].trigger);
+error_free:
+	kfree(vdev->ctx[vector].name);
+	eventfd_ctx_put(trigger);
+	return ret;
+
 }
 
 static int vfio_msi_set_block(struct vfio_pci_device *vdev, unsigned start,
-- 
1.9.1
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 10/10] vfio: allow the user to register reserved iova range for MSI mapping
  2016-01-26 13:12 [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Eric Auger
                   ` (8 preceding siblings ...)
  2016-01-26 13:12 ` [PATCH 09/10] vfio-pci: create an iommu mapping for msi address Eric Auger
@ 2016-01-26 13:12 ` Eric Auger
  2016-01-26 16:42   ` kbuild test robot
  2016-01-26 18:32   ` kbuild test robot
  2016-01-26 17:25 ` [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Pavel Fedin
  2016-01-28 21:51 ` Alex Williamson
  11 siblings, 2 replies; 32+ messages in thread
From: Eric Auger @ 2016-01-26 13:12 UTC (permalink / raw)
  To: linux-arm-kernel
The user is allowed to register a reserved IOVA range by using the
DMA MAP API and setting the new flag: VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA.
It provides the base address and the size. This region is stored in the
vfio_dma rb tree. At that point the iova range is not mapped to any target
address yet. The host kernel will use those iova when needed, typically
when the VFIO-PCI device allocates its MSI's.
This patch also handles the destruction of the reserved binding RB-tree and
domain's iova_domains.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com>
---
- Currently the user is not yet informed about the number of pages to
  provide
RFC v1 -> RFC v2:
- takes into account Alex comments, based on
  [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region:
- use the existing dma map/unmap ioctl interface with a flag to register
  a reserved IOVA range. A single reserved iova region is allowed.
---
 drivers/vfio/vfio_iommu_type1.c | 98 ++++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h       |  9 ++++
 2 files changed, 106 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2f085d3..37c7d78 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -538,10 +538,40 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 	vfio_lock_acct(-unlocked);
 }
 
+/* vfio_unmap_reserved: unmap and free all reserved binding nodes
+ * for all domains and destroy their iova_domain
+ *
+ * @iommu: iommu handle
+ */
+static void vfio_unmap_reserved(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *d;
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		struct rb_node *node;
+
+		while ((node = rb_first(&d->reserved_binding_list))) {
+			struct vfio_reserved_binding *b =
+				rb_entry(node,
+					 struct vfio_reserved_binding, node);
+
+			while (!kref_put(&b->kref,
+				vfio_reserved_binding_release)) {
+			}
+		}
+		d->reserved_binding_list = RB_ROOT;
+
+		put_iova_domain(d->reserved_iova_domain);
+		kfree(d->reserved_iova_domain);
+	}
+}
+
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	if (likely(dma->type != VFIO_IOVA_RESERVED))
 		vfio_unmap_unpin(iommu, dma);
+	else
+		vfio_unmap_reserved(iommu);
 	vfio_unlink_dma(iommu, dma);
 	kfree(dma);
 }
@@ -785,6 +815,68 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static int vfio_register_reserved_iova_range(struct vfio_iommu *iommu,
+			   struct vfio_iommu_type1_dma_map *map)
+{
+	dma_addr_t iova = map->iova;
+	size_t size = map->size;
+	uint64_t mask;
+	struct vfio_dma *dma;
+	int ret = 0;
+	struct vfio_domain *d;
+	unsigned long order;
+
+	/* Verify that none of our __u64 fields overflow */
+	if (map->size != size || map->iova != iova)
+		return -EINVAL;
+
+	order =  __ffs(vfio_pgsize_bitmap(iommu));
+	mask = ((uint64_t)1 << order) - 1;
+
+	WARN_ON(mask & PAGE_MASK);
+
+	/* we currently only support MSI_RESERVED_IOVA */
+	if (!(map->flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA))
+		return -EINVAL;
+
+	if (!size || (size | iova) & mask)
+		return -EINVAL;
+
+	/* Don't allow IOVA address wrap */
+	if (iova + size - 1 < iova)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	/* check if the iova domain has not been instantiated already*/
+	d = list_first_entry(&iommu->domain_list,
+				  struct vfio_domain, next);
+
+	if (d->reserved_iova_domain || vfio_find_dma(iommu, iova, size)) {
+		ret =  -EEXIST;
+		goto out;
+	}
+
+	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
+	if (!dma) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	dma->iova = iova;
+	dma->size = size;
+	dma->type = VFIO_IOVA_RESERVED;
+
+	vfio_link_dma(iommu, dma);
+
+	list_for_each_entry(d, &iommu->domain_list, next)
+		alloc_reserved_iova_domain(d, iova, size, order);
+
+out:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
 static int vfio_bus_type(struct device *dev, void *data)
 {
 	struct bus_type **bus = data;
@@ -1297,7 +1389,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
 		struct vfio_iommu_type1_dma_map map;
 		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
-				VFIO_DMA_MAP_FLAG_WRITE;
+				VFIO_DMA_MAP_FLAG_WRITE |
+				VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA;
 
 		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
 
@@ -1307,6 +1400,9 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		if (map.argsz < minsz || map.flags & ~mask)
 			return -EINVAL;
 
+		if (map.flags & VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA)
+			return vfio_register_reserved_iova_range(iommu, &map);
+
 		return vfio_dma_do_map(iommu, &map);
 
 	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 43e183b..982e326 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -411,12 +411,21 @@ struct vfio_iommu_type1_info {
  *
  * Map process virtual addresses to IO virtual addresses using the
  * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ *
+ * In case MSI_RESERVED_IOVA is set, the API only aims@registering an IOVA
+ * region which will be used on some platforms to map the host MSI frame.
+ * in that specific case, vaddr and prot are ignored. The requirement for
+ * provisioning such IOVA range can be checked by calling VFIO_IOMMU_GET_INFO
+ * with the VFIO_IOMMU_INFO_REQUIRE_MSI_MAP attribute. A single
+ * MSI_RESERVED_IOVA region can be registered
  */
 struct vfio_iommu_type1_dma_map {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
 #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
+/* reserved iova for MSI vectors*/
+#define VFIO_DMA_MAP_FLAG_MSI_RESERVED_IOVA (1 << 2)
 	__u64	vaddr;				/* Process virtual address */
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
-- 
1.9.1
^ permalink raw reply related	[flat|nested] 32+ messages in thread
* [PATCH 09/10] vfio-pci: create an iommu mapping for msi address
  2016-01-26 13:12 ` [PATCH 09/10] vfio-pci: create an iommu mapping for msi address Eric Auger
@ 2016-01-26 14:43   ` kbuild test robot
  2016-01-26 15:14     ` Eric Auger
  0 siblings, 1 reply; 32+ messages in thread
From: kbuild test robot @ 2016-01-26 14:43 UTC (permalink / raw)
  To: linux-arm-kernel
Hi Eric,
[auto build test WARNING on v4.5-rc1]
[also build test WARNING on next-20160125]
[cannot apply to iommu/next]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
url:    https://github.com/0day-ci/linux/commits/Eric-Auger/KVM-PCIe-MSI-passthrough-on-ARM-ARM64/20160126-211921
config: i386-allmodconfig (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 
All warnings (new ones prefixed by >>):
   drivers/vfio/pci/vfio_pci_intrs.c: In function 'vfio_set_mapped_msi_addr':
>> drivers/vfio/pci/vfio_pci_intrs.c:324:43: warning: left shift count >= width of type [-Wshift-count-overflow]
     msi_addr = (phys_addr_t)(msg.address_hi) << 32 |
                                              ^
>> drivers/vfio/pci/vfio_pci_intrs.c:333:34: warning: right shift count >= width of type [-Wshift-count-overflow]
     msg.address_hi = (u32)(msi_iova >> 32);
                                     ^
vim +324 drivers/vfio/pci/vfio_pci_intrs.c
   318		dma_addr_t msi_iova;
   319		struct vfio_group *group = vdev->vfio_group;
   320		struct msi_msg msg;
   321		int ret;
   322	
   323		get_cached_msi_msg(irq, &msg);
 > 324		msi_addr = (phys_addr_t)(msg.address_hi) << 32 |
   325				(phys_addr_t)(msg.address_lo);
   326	
   327		ret = vfio_group_alloc_map_reserved_iova(group, msi_addr,
   328							 IOMMU_WRITE, &msi_iova);
   329		if (ret)
   330			goto out;
   331	
   332		/* Re-program the msi-address with the iova */
 > 333		msg.address_hi = (u32)(msi_iova >> 32);
   334		msg.address_lo = (u32)(msi_iova & 0xffffffff);
   335		pci_write_msi_msg(irq, &msg);
   336	
---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .config.gz
Type: application/octet-stream
Size: 53452 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20160126/8f75d4cc/attachment-0001.obj>
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 09/10] vfio-pci: create an iommu mapping for msi address
  2016-01-26 14:43   ` kbuild test robot
@ 2016-01-26 15:14     ` Eric Auger
  0 siblings, 0 replies; 32+ messages in thread
From: Eric Auger @ 2016-01-26 15:14 UTC (permalink / raw)
  To: linux-arm-kernel
Hi,
On 01/26/2016 03:43 PM, kbuild test robot wrote:
> Hi Eric,
> 
> [auto build test WARNING on v4.5-rc1]
> [also build test WARNING on next-20160125]
> [cannot apply to iommu/next]
> [if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Eric-Auger/KVM-PCIe-MSI-passthrough-on-ARM-ARM64/20160126-211921
> config: i386-allmodconfig (attached as .config)
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386 
> 
> All warnings (new ones prefixed by >>):
> 
>    drivers/vfio/pci/vfio_pci_intrs.c: In function 'vfio_set_mapped_msi_addr':
>>> drivers/vfio/pci/vfio_pci_intrs.c:324:43: warning: left shift count >= width of type [-Wshift-count-overflow]
>      msi_addr = (phys_addr_t)(msg.address_hi) << 32 |
>                                               ^
>>> drivers/vfio/pci/vfio_pci_intrs.c:333:34: warning: right shift count >= width of type [-Wshift-count-overflow]
>      msg.address_hi = (u32)(msi_iova >> 32);
I definitively need to revisit that code. I Better understand Alex'
comment now :-(
Thanks
Eric
>                                      ^
> 
> vim +324 drivers/vfio/pci/vfio_pci_intrs.c
> 
>    318		dma_addr_t msi_iova;
>    319		struct vfio_group *group = vdev->vfio_group;
>    320		struct msi_msg msg;
>    321		int ret;
>    322	
>    323		get_cached_msi_msg(irq, &msg);
>  > 324		msi_addr = (phys_addr_t)(msg.address_hi) << 32 |
>    325				(phys_addr_t)(msg.address_lo);
>    326	
>    327		ret = vfio_group_alloc_map_reserved_iova(group, msi_addr,
>    328							 IOMMU_WRITE, &msi_iova);
>    329		if (ret)
>    330			goto out;
>    331	
>    332		/* Re-program the msi-address with the iova */
>  > 333		msg.address_hi = (u32)(msi_iova >> 32);
>    334		msg.address_lo = (u32)(msi_iova & 0xffffffff);
>    335		pci_write_msi_msg(irq, &msg);
>    336	
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
> 
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 06/10] vfio: introduce vfio_group_alloc_map_/unmap_free_reserved_iova
  2016-01-26 13:12 ` [PATCH 06/10] vfio: introduce vfio_group_alloc_map_/unmap_free_reserved_iova Eric Auger
@ 2016-01-26 16:17   ` kbuild test robot
  2016-01-26 16:37     ` Eric Auger
  0 siblings, 1 reply; 32+ messages in thread
From: kbuild test robot @ 2016-01-26 16:17 UTC (permalink / raw)
  To: linux-arm-kernel
Hi Eric,
[auto build test ERROR on v4.5-rc1]
[also build test ERROR on next-20160125]
[cannot apply to iommu/next]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
url:    https://github.com/0day-ci/linux/commits/Eric-Auger/KVM-PCIe-MSI-passthrough-on-ARM-ARM64/20160126-211921
config: x86_64-randconfig-s3-01262306 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 
All errors (new ones prefixed by >>):
>> ERROR: "alloc_iova" [drivers/vfio/vfio_iommu_type1.ko] undefined!
>> ERROR: "free_iova" [drivers/vfio/vfio_iommu_type1.ko] undefined!
---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .config.gz
Type: application/octet-stream
Size: 23402 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20160127/38c4b575/attachment-0001.obj>
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 06/10] vfio: introduce vfio_group_alloc_map_/unmap_free_reserved_iova
  2016-01-26 16:17   ` kbuild test robot
@ 2016-01-26 16:37     ` Eric Auger
  0 siblings, 0 replies; 32+ messages in thread
From: Eric Auger @ 2016-01-26 16:37 UTC (permalink / raw)
  To: linux-arm-kernel
Hi,
On 01/26/2016 05:17 PM, kbuild test robot wrote:
> Hi Eric,
> 
> [auto build test ERROR on v4.5-rc1]
> [also build test ERROR on next-20160125]
> [cannot apply to iommu/next]
> [if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Eric-Auger/KVM-PCIe-MSI-passthrough-on-ARM-ARM64/20160126-211921
> config: x86_64-randconfig-s3-01262306 (attached as .config)
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
> 
> All errors (new ones prefixed by >>):
> 
>>> ERROR: "alloc_iova" [drivers/vfio/vfio_iommu_type1.ko] undefined!
>>> ERROR: "free_iova" [drivers/vfio/vfio_iommu_type1.ko] undefined!
I will protect the code with CONFIG_IOMMU_IOVA if the usage of those
functions is acknowledged.
Thanks
Eric
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
> 
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 10/10] vfio: allow the user to register reserved iova range for MSI mapping
  2016-01-26 13:12 ` [PATCH 10/10] vfio: allow the user to register reserved iova range for MSI mapping Eric Auger
@ 2016-01-26 16:42   ` kbuild test robot
  2016-01-26 18:32   ` kbuild test robot
  1 sibling, 0 replies; 32+ messages in thread
From: kbuild test robot @ 2016-01-26 16:42 UTC (permalink / raw)
  To: linux-arm-kernel
Hi Eric,
[auto build test ERROR on v4.5-rc1]
[also build test ERROR on next-20160125]
[cannot apply to iommu/next]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
url:    https://github.com/0day-ci/linux/commits/Eric-Auger/KVM-PCIe-MSI-passthrough-on-ARM-ARM64/20160126-211921
config: x86_64-randconfig-s3-01262306 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 
All errors (new ones prefixed by >>):
   ERROR: "alloc_iova" [drivers/vfio/vfio_iommu_type1.ko] undefined!
>> ERROR: "init_iova_domain" [drivers/vfio/vfio_iommu_type1.ko] undefined!
>> ERROR: "put_iova_domain" [drivers/vfio/vfio_iommu_type1.ko] undefined!
   ERROR: "free_iova" [drivers/vfio/vfio_iommu_type1.ko] undefined!
---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .config.gz
Type: application/octet-stream
Size: 23402 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20160127/5b40d37f/attachment-0001.obj>
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64
  2016-01-26 13:12 [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Eric Auger
                   ` (9 preceding siblings ...)
  2016-01-26 13:12 ` [PATCH 10/10] vfio: allow the user to register reserved iova range for MSI mapping Eric Auger
@ 2016-01-26 17:25 ` Pavel Fedin
  2016-01-27  8:52   ` Eric Auger
  2016-01-28 21:51 ` Alex Williamson
  11 siblings, 1 reply; 32+ messages in thread
From: Pavel Fedin @ 2016-01-26 17:25 UTC (permalink / raw)
  To: linux-arm-kernel
 Hello!
 I'd like just to clarify some things for myself and better wrap my head around it...
> On x86 all accesses to the 1MB PA region [FEE0_0000h - FEF0_000h] are directed
> as interrupt messages: accesses to this special PA window directly target the
> APIC configuration space and not DRAM, meaning the downstream IOMMU is bypassed.
 So, this is effectively the same as always having hardwired 1:1 mappings on all IOMMUs, isn't it ?
 If so, then we can't we just do the same, just by forcing similar 1:1 mapping? This is what i tried to do in my patchset. All of
you are talking about a situation which arises when we are emulating different machine with different physical addresses layout. And
e. g. if our host has MSI at 0xABADCAFE, our target could have valid RAM at the same location, and we need to handle it somehow,
therefore we have to move our MSI window out of target's RAM. But how does this work on a PC then? What if our host is PC, and we
want to emulate some ARM board, which has RAM at FE00 0000 ? Or does it mean that PC architecture is flawed and can reliably handle
PCI passthrough only for itself ?
Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 10/10] vfio: allow the user to register reserved iova range for MSI mapping
  2016-01-26 13:12 ` [PATCH 10/10] vfio: allow the user to register reserved iova range for MSI mapping Eric Auger
  2016-01-26 16:42   ` kbuild test robot
@ 2016-01-26 18:32   ` kbuild test robot
  1 sibling, 0 replies; 32+ messages in thread
From: kbuild test robot @ 2016-01-26 18:32 UTC (permalink / raw)
  To: linux-arm-kernel
Hi Eric,
[auto build test ERROR on v4.5-rc1]
[also build test ERROR on next-20160125]
[cannot apply to iommu/next]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
url:    https://github.com/0day-ci/linux/commits/Eric-Auger/KVM-PCIe-MSI-passthrough-on-ARM-ARM64/20160126-211921
config: x86_64-randconfig-s3-01270003 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 
All errors (new ones prefixed by >>):
   drivers/built-in.o: In function `vfio_reserved_binding_release':
   vfio_iommu_type1.c:(.text+0x1d537b): undefined reference to `free_iova'
   drivers/built-in.o: In function `vfio_iommu_type1_alloc_map_reserved_iova':
   vfio_iommu_type1.c:(.text+0x1d5b32): undefined reference to `alloc_iova'
   vfio_iommu_type1.c:(.text+0x1d5d67): undefined reference to `free_iova'
   drivers/built-in.o: In function `vfio_remove_dma':
>> vfio_iommu_type1.c:(.text+0x1d6c80): undefined reference to `put_iova_domain'
   drivers/built-in.o: In function `vfio_iommu_type1_ioctl':
>> vfio_iommu_type1.c:(.text+0x1d81f8): undefined reference to `init_iova_domain'
---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .config.gz
Type: application/octet-stream
Size: 24506 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20160127/9848d103/attachment-0001.obj>
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64
  2016-01-26 17:25 ` [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Pavel Fedin
@ 2016-01-27  8:52   ` Eric Auger
  2016-01-28  7:13     ` Pavel Fedin
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Auger @ 2016-01-27  8:52 UTC (permalink / raw)
  To: linux-arm-kernel
Hi Pavel,
On 01/26/2016 06:25 PM, Pavel Fedin wrote:
>  Hello!
>  I'd like just to clarify some things for myself and better wrap my head around it...
> 
>> On x86 all accesses to the 1MB PA region [FEE0_0000h - FEF0_000h] are directed
>> as interrupt messages: accesses to this special PA window directly target the
>> APIC configuration space and not DRAM, meaning the downstream IOMMU is bypassed.
> 
>  So, this is effectively the same as always having hardwired 1:1 mappings on all IOMMUs, isn't it ?
>  If so, then we can't we just do the same, just by forcing similar 1:1 mapping? This is what i tried to do in my patchset. All of
> you are talking about a situation which arises when we are emulating different machine with different physical addresses layout. And
> e. g. if our host has MSI at 0xABADCAFE, our target could have valid RAM at the same location, and we need to handle it somehow,
> therefore we have to move our MSI window out of target's RAM. But how does this work on a PC then? What if our host is PC, and we
> want to emulate some ARM board, which has RAM at FE00 0000 ? Or does it mean that PC architecture is flawed and can reliably handle
> PCI passthrough only for itself ?
Alex answered to this I think:
"
x86 isn't problem-free in this space.  An x86 VM is going to know that
the 0xfee00000 address range is special, it won't be backed by RAM and
won't be a DMA target, thus we'll never attempt to map it for an iova
address.  However, if we run a non-x86 VM or a userspace driver, it
doesn't necessarily know that there's anything special about that range
of iovas.  I intend to resolve this with an extension to the iommu info
ioctl that describes the available iova space for the iommu.  The
interrupt region would simply be excluded.
"
I am not sure I've addressed this requirement yet but it seems more
future proof to have an IOMMU mapping for those addresses.
For the ARM use case I think Marc gave guidance:
"
We want userspace to be in control of the memory map, and it
is the kernel's job to tell us whether or not this matches the HW
capabilities or not. A fixed mapping may completely clash with the
memory map I want (think emulating HW x on platform y), and there is no
reason why we should have the restrictions x86 has.
"
That's the rationale behind respining that way.
Waiting for other comments & discussions, I am going to address the iova
and dma_addr_t kbuilt reported compilation issues. Please apologize for
those.
Best Regards
Eric
> 
> Kind regards,
> Pavel Fedin
> Senior Engineer
> Samsung Electronics Research center Russia
> 
> 
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64
  2016-01-27  8:52   ` Eric Auger
@ 2016-01-28  7:13     ` Pavel Fedin
  2016-01-28  9:50       ` Eric Auger
  0 siblings, 1 reply; 32+ messages in thread
From: Pavel Fedin @ 2016-01-28  7:13 UTC (permalink / raw)
  To: linux-arm-kernel
 Hello!
> x86 isn't problem-free in this space.  An x86 VM is going to know that
> the 0xfee00000 address range is special, it won't be backed by RAM and
> won't be a DMA target, thus we'll never attempt to map it for an iova
> address.  However, if we run a non-x86 VM or a userspace driver, it
> doesn't necessarily know that there's anything special about that range
> of iovas.  I intend to resolve this with an extension to the iommu info
> ioctl that describes the available iova space for the iommu.  The
> interrupt region would simply be excluded.
 I see now, but i still don't understand how it would work. How can we tell the guest OS that we cannot do DMA to this particular
area? Just exclude it from RAM at all? But this means we would have to modify machine's model...
 I know that this is a bit different story from what we are implementing now. Just curious.
Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64
  2016-01-28  7:13     ` Pavel Fedin
@ 2016-01-28  9:50       ` Eric Auger
  0 siblings, 0 replies; 32+ messages in thread
From: Eric Auger @ 2016-01-28  9:50 UTC (permalink / raw)
  To: linux-arm-kernel
Hi Pavel,
On 01/28/2016 08:13 AM, Pavel Fedin wrote:
>  Hello!
> 
>> x86 isn't problem-free in this space.  An x86 VM is going to know that
>> the 0xfee00000 address range is special, it won't be backed by RAM and
>> won't be a DMA target, thus we'll never attempt to map it for an iova
>> address.  However, if we run a non-x86 VM or a userspace driver, it
>> doesn't necessarily know that there's anything special about that range
>> of iovas.  I intend to resolve this with an extension to the iommu info
>> ioctl that describes the available iova space for the iommu.  The
>> interrupt region would simply be excluded.
> 
>  I see now, but i still don't understand how it would work. How can we tell the guest OS that we cannot do DMA to this particular
> area? Just exclude it from RAM at all? But this means we would have to modify machine's model...
>  I know that this is a bit different story from what we are implementing now. Just curious.
Well in QEMU mach-virt we have a static guest PA memory map. Maybe in
some other virt machines this is different and it is possible to take
into account the fact an IOVA range cannot be used?
Regards
Eric
> 
> Kind regards,
> Pavel Fedin
> Senior Engineer
> Samsung Electronics Research center Russia
> 
> 
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64
  2016-01-26 13:12 [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Eric Auger
                   ` (10 preceding siblings ...)
  2016-01-26 17:25 ` [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Pavel Fedin
@ 2016-01-28 21:51 ` Alex Williamson
  2016-01-29 14:35   ` Eric Auger
  11 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2016-01-28 21:51 UTC (permalink / raw)
  To: linux-arm-kernel
On Tue, 2016-01-26 at 13:12 +0000, Eric Auger wrote:
> This series addresses KVM PCIe passthrough with MSI enabled on ARM/ARM64.
> It pursues the efforts done on [1], [2], [3]. It also aims at covering the
> same need on some PowerPC platforms.
>?
> On x86 all accesses to the 1MB PA region [FEE0_0000h - FEF0_000h] are directed
> as interrupt messages: accesses to this special PA window directly target the
> APIC configuration space and not DRAM, meaning the downstream IOMMU is bypassed.
>?
> This is not the case on above mentionned platforms where MSI messages emitted
> by devices are conveyed through the IOMMU. This means an IOVA/host PA mapping
> must exist for the MSI to reach the MSI controller. Normal way to create
> IOVA bindings consists in using VFIO DMA MAP API. However in this case
> the MSI IOVA is not mapped onto guest RAM but on host physical page (the MSI
> controller frame).
>?
> Following first comments, the spirit of [2] is kept: the guest registers
> an IOVA range reserved for MSI mapping. When the VFIO-PCIe driver allocates
> its MSI vectors, it overwrites the MSI controller physical address with an IOVA,
> allocated within the window provided by the userspace. This IOVA is mapped
> onto the MSI controller frame physical page.
>?
> The series does not address yet the problematic of telling the userspace how
> much IOVA he should provision.
I'm sort of on a think-different approach today, so bear with me; how is
it that x86 can make interrupt remapping so transparent to drivers like
vfio-pci while for ARM and ppc we seem to be stuck with doing these
fixups of the physical vector ourselves, implying ugly (no offense)
paths bouncing through vfio to connect the driver and iommu backends?
We know that x86 handles MSI vectors specially, so there is some
hardware that helps the situation.??It's not just that x86 has a fixed
range for MSI, it's how it manages that range when interrupt remapping
hardware is enabled.??A device table indexed by source-ID references a
per device table indexed by data from the MSI write itself.??So we get
much, much finer granularity, but there's still effectively an interrupt
domain per device that's being transparently managed under the covers
whenever we request an MSI vector for a device.
So why can't we do something more like that here???There's no predefined
MSI vector range, so defining an interface for the user to specify that
is unavoidable.??But why shouldn't everything else be transparent???We
could add an interface to the IOMMU API that allows us to register that
reserved range for the IOMMU domain.??IOMMU-core (or maybe interrupt
remapping) code might allocate an IOVA domain for this just as you've
done in the type1 code here.??But rather than having any interaction
with vfio-pci, why not do this at lower levels such that the platform
interrupt vector allocation code automatically uses one of those IOVA
ranges and returns the IOVA rather than the physical address for the PCI
code to program into the device???I think we know what needs to be done,
but we're taking the approach of managing the space ourselves and doing
a fixup of the device after the core code has done its job when we
really ought to be letting the core code manage a space that we define
and programming the device so that it doesn't need a fixup in the
vfio-pci code.??Wouldn't it be nicer if pci_enable_msix_range() returned
with the device properly programmed or generate an error if there's not
enough reserved mapping space in IOMMU domain?  Can it be done???Thanks,
Alex
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64
  2016-01-28 21:51 ` Alex Williamson
@ 2016-01-29 14:35   ` Eric Auger
  2016-01-29 19:33     ` Alex Williamson
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Auger @ 2016-01-29 14:35 UTC (permalink / raw)
  To: linux-arm-kernel
Hi Alex,
On 01/28/2016 10:51 PM, Alex Williamson wrote:
> On Tue, 2016-01-26 at 13:12 +0000, Eric Auger wrote:
>> This series addresses KVM PCIe passthrough with MSI enabled on ARM/ARM64.
>> It pursues the efforts done on [1], [2], [3]. It also aims at covering the
>> same need on some PowerPC platforms.
>>  
>> On x86 all accesses to the 1MB PA region [FEE0_0000h - FEF0_000h] are directed
>> as interrupt messages: accesses to this special PA window directly target the
>> APIC configuration space and not DRAM, meaning the downstream IOMMU is bypassed.
>>  
>> This is not the case on above mentionned platforms where MSI messages emitted
>> by devices are conveyed through the IOMMU. This means an IOVA/host PA mapping
>> must exist for the MSI to reach the MSI controller. Normal way to create
>> IOVA bindings consists in using VFIO DMA MAP API. However in this case
>> the MSI IOVA is not mapped onto guest RAM but on host physical page (the MSI
>> controller frame).
>>  
>> Following first comments, the spirit of [2] is kept: the guest registers
>> an IOVA range reserved for MSI mapping. When the VFIO-PCIe driver allocates
>> its MSI vectors, it overwrites the MSI controller physical address with an IOVA,
>> allocated within the window provided by the userspace. This IOVA is mapped
>> onto the MSI controller frame physical page.
>>  
>> The series does not address yet the problematic of telling the userspace how
>> much IOVA he should provision.
> 
> I'm sort of on a think-different approach today, so bear with me; how is
> it that x86 can make interrupt remapping so transparent to drivers like
> vfio-pci while for ARM and ppc we seem to be stuck with doing these
> fixups of the physical vector ourselves, implying ugly (no offense)
> paths bouncing through vfio to connect the driver and iommu backends?
> 
> We know that x86 handles MSI vectors specially, so there is some
> hardware that helps the situation.  It's not just that x86 has a fixed
> range for MSI, it's how it manages that range when interrupt remapping
> hardware is enabled.  A device table indexed by source-ID references a
> per device table indexed by data from the MSI write itself.  So we get
> much, much finer granularity,
About the granularity, I think ARM GICv3 now provides a similar
capability with GICv3 ITS (interrupt translation service). Along with
the MSI MSG write transaction, the device outputs a DeviceID conveyed on
the bus. This DeviceID (~ your source-ID) enables to index a device
table. The entry in the device table points to a DeviceId interrupt
translation table indexed by the EventID found in the msi msg. So the
entry in the interrupt translation table eventually gives you the
eventual interrupt ID targeted by the MSI MSG.
This translation capability if not available in GICv2M though, ie. the
one I am currently using.
Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)
 but there's still effectively an interrupt
> domain per device that's being transparently managed under the covers
> whenever we request an MSI vector for a device.
> 
> So why can't we do something more like that here?  There's no predefined
> MSI vector range, so defining an interface for the user to specify that
> is unavoidable.
Do you confirm that VFIO user API still still is the good choice to
provide that IOVA range?
  But why shouldn't everything else be transparent?  We
> could add an interface to the IOMMU API that allows us to register that
> reserved range for the IOMMU domain.  IOMMU-core (or maybe interrupt
> remapping) code might allocate an IOVA domain for this just as you've
> done in the type1 code here.
I have no objection to move that iova allocation scheme somewhere else.
I just need to figure out how to deal with the fact iova.c is not
compiled everywhere as I noticed too late ;-)
  But rather than having any interaction
> with vfio-pci, why not do this at lower levels such that the platform
> interrupt vector allocation code automatically uses one of those IOVA
> ranges and returns the IOVA rather than the physical address for the PCI
> code to program into the device?  I think we know what needs to be done,
> but we're taking the approach of managing the space ourselves and doing
> a fixup of the device after the core code has done its job when we
> really ought to be letting the core code manage a space that we define
> and programming the device so that it doesn't need a fixup in the
> vfio-pci code.  Wouldn't it be nicer if pci_enable_msix_range() returned
> with the device properly programmed or generate an error if there's not
> enough reserved mapping space in IOMMU domain?  Can it be done?
I agree with you on the fact it would be cleaner to manage that natively
at MSI controller level instead of patching the address value in
vfio_pci_intrs.c. I will investigate in that direction but I need some
more time to understand the links between the MSI controller, the PCI
device and the IOMMU.
Best Regards
Eric
  Thanks,
> 
> Alex
> 
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64
  2016-01-29 14:35   ` Eric Auger
@ 2016-01-29 19:33     ` Alex Williamson
  2016-01-29 21:25       ` Eric Auger
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2016-01-29 19:33 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, 2016-01-29 at 15:35 +0100, Eric Auger wrote:
> Hi Alex,
> On 01/28/2016 10:51 PM, Alex Williamson wrote:
> > On Tue, 2016-01-26 at 13:12 +0000, Eric Auger wrote:
> > > This series addresses KVM PCIe passthrough with MSI enabled on ARM/ARM64.
> > > It pursues the efforts done on [1], [2], [3]. It also aims at covering the
> > > same need on some PowerPC platforms.
> > > ?
> > > On x86 all accesses to the 1MB PA region [FEE0_0000h - FEF0_000h] are directed
> > > as interrupt messages: accesses to this special PA window directly target the
> > > APIC configuration space and not DRAM, meaning the downstream IOMMU is bypassed.
> > > ?
> > > This is not the case on above mentionned platforms where MSI messages emitted
> > > by devices are conveyed through the IOMMU. This means an IOVA/host PA mapping
> > > must exist for the MSI to reach the MSI controller. Normal way to create
> > > IOVA bindings consists in using VFIO DMA MAP API. However in this case
> > > the MSI IOVA is not mapped onto guest RAM but on host physical page (the MSI
> > > controller frame).
> > > ?
> > > Following first comments, the spirit of [2] is kept: the guest registers
> > > an IOVA range reserved for MSI mapping. When the VFIO-PCIe driver allocates
> > > its MSI vectors, it overwrites the MSI controller physical address with an IOVA,
> > > allocated within the window provided by the userspace. This IOVA is mapped
> > > onto the MSI controller frame physical page.
> > > ?
> > > The series does not address yet the problematic of telling the userspace how
> > > much IOVA he should provision.
> >?
> > I'm sort of on a think-different approach today, so bear with me; how is
> > it that x86 can make interrupt remapping so transparent to drivers like
> > vfio-pci while for ARM and ppc we seem to be stuck with doing these
> > fixups of the physical vector ourselves, implying ugly (no offense)
> > paths bouncing through vfio to connect the driver and iommu backends?
> >?
> > We know that x86 handles MSI vectors specially, so there is some
> > hardware that helps the situation.??It's not just that x86 has a fixed
> > range for MSI, it's how it manages that range when interrupt remapping
> > hardware is enabled.??A device table indexed by source-ID references a
> > per device table indexed by data from the MSI write itself.??So we get
> > much, much finer granularity,
> About the granularity, I think ARM GICv3 now provides a similar
> capability with GICv3 ITS (interrupt translation service). Along with
> the MSI MSG write transaction, the device outputs a DeviceID conveyed on
> the bus. This DeviceID (~ your source-ID) enables to index a device
> table. The entry in the device table points to a DeviceId interrupt
> translation table indexed by the EventID found in the msi msg. So the
> entry in the interrupt translation table eventually gives you the
> eventual interrupt ID targeted by the MSI MSG.
> This translation capability if not available in GICv2M though, ie. the
> one I am currently using.
>?
> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)
So it sounds like the interrupt remapping plumbing needs to be
implemented for those chips.??How does ITS identify an MSI versus any
other DMA write???Does it need to be within a preconfigured address
space like on x86 or does it know this implicitly by the transaction
(which doesn't seem possible on PCIe)?
Along with this discussion, we should probably be revisiting whether
existing ARM SMMUs should be exposing the IOMMU_CAP_INTR_REMAP
capability.??This capability is meant to indicate interrupt isolation,
but if an entire page of IOVA space is mapped through the IOMMU to a
range of interrupts and some of those interrupts are shared with host
devices or other VMs, then we really don't have that isolation and the
system is susceptible to one VM interfering with another or with the
host.??If that's the case, the SMMU should not be claiming
IOMMU_CAP_INTR_REMAP.
> ?but there's still effectively an interrupt
> > domain per device that's being transparently managed under the covers
> > whenever we request an MSI vector for a device.
> >?
> > So why can't we do something more like that here???There's no predefined
> > MSI vector range, so defining an interface for the user to specify that
> > is unavoidable.
> Do you confirm that VFIO user API still still is the good choice to
> provide that IOVA range?
I don't see that we have an option there unless ARM wants to
retroactively reserve a range of IOVA space in the spec, which is
certainly not going to happen.??The only other thing that comes to mind
would be if there was an existing address space which could never be
backed by RAM or other DMA capable targets.??But that seems far fetched
as well.
> ? But why shouldn't everything else be transparent???We
> > could add an interface to the IOMMU API that allows us to register that
> > reserved range for the IOMMU domain.??IOMMU-core (or maybe interrupt
> > remapping) code might allocate an IOVA domain for this just as you've
> > done in the type1 code here.
> I have no objection to move that iova allocation scheme somewhere else.
> I just need to figure out how to deal with the fact iova.c is not
> compiled everywhere as I noticed too late ;-)
> ? But rather than having any interaction
> > with vfio-pci, why not do this at lower levels such that the platform
> > interrupt vector allocation code automatically uses one of those IOVA
> > ranges and returns the IOVA rather than the physical address for the PCI
> > code to program into the device???I think we know what needs to be done,
> > but we're taking the approach of managing the space ourselves and doing
> > a fixup of the device after the core code has done its job when we
> > really ought to be letting the core code manage a space that we define
> > and programming the device so that it doesn't need a fixup in the
> > vfio-pci code.??Wouldn't it be nicer if pci_enable_msix_range() returned
> > with the device properly programmed or generate an error if there's not
> > enough reserved mapping space in IOMMU domain???Can it be done?
> I agree with you on the fact it would be cleaner to manage that natively
> at MSI controller level instead of patching the address value in
> vfio_pci_intrs.c. I will investigate in that direction but I need some
> more time to understand the links between the MSI controller, the PCI
> device and the IOMMU.
Since the current interrupt remapping schemes seem to operate in a
different address space, I expect there will be work to do to fit the
interrupt remapping within a provided address space, but it seems like a
very reasonable constraint to add.??Thanks,
Alex
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64
  2016-01-29 19:33     ` Alex Williamson
@ 2016-01-29 21:25       ` Eric Auger
  2016-02-01 14:03         ` Will Deacon
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Auger @ 2016-01-29 21:25 UTC (permalink / raw)
  To: linux-arm-kernel
Hi Alex,
On 01/29/2016 08:33 PM, Alex Williamson wrote:
> On Fri, 2016-01-29 at 15:35 +0100, Eric Auger wrote:
>> Hi Alex,
>> On 01/28/2016 10:51 PM, Alex Williamson wrote:
>>> On Tue, 2016-01-26 at 13:12 +0000, Eric Auger wrote:
>>>> This series addresses KVM PCIe passthrough with MSI enabled on ARM/ARM64.
>>>> It pursues the efforts done on [1], [2], [3]. It also aims at covering the
>>>> same need on some PowerPC platforms.
>>>>  
>>>> On x86 all accesses to the 1MB PA region [FEE0_0000h - FEF0_000h] are directed
>>>> as interrupt messages: accesses to this special PA window directly target the
>>>> APIC configuration space and not DRAM, meaning the downstream IOMMU is bypassed.
>>>>  
>>>> This is not the case on above mentionned platforms where MSI messages emitted
>>>> by devices are conveyed through the IOMMU. This means an IOVA/host PA mapping
>>>> must exist for the MSI to reach the MSI controller. Normal way to create
>>>> IOVA bindings consists in using VFIO DMA MAP API. However in this case
>>>> the MSI IOVA is not mapped onto guest RAM but on host physical page (the MSI
>>>> controller frame).
>>>>  
>>>> Following first comments, the spirit of [2] is kept: the guest registers
>>>> an IOVA range reserved for MSI mapping. When the VFIO-PCIe driver allocates
>>>> its MSI vectors, it overwrites the MSI controller physical address with an IOVA,
>>>> allocated within the window provided by the userspace. This IOVA is mapped
>>>> onto the MSI controller frame physical page.
>>>>  
>>>> The series does not address yet the problematic of telling the userspace how
>>>> much IOVA he should provision.
>>>  
>>> I'm sort of on a think-different approach today, so bear with me; how is
>>> it that x86 can make interrupt remapping so transparent to drivers like
>>> vfio-pci while for ARM and ppc we seem to be stuck with doing these
>>> fixups of the physical vector ourselves, implying ugly (no offense)
>>> paths bouncing through vfio to connect the driver and iommu backends?
>>>  
>>> We know that x86 handles MSI vectors specially, so there is some
>>> hardware that helps the situation.  It's not just that x86 has a fixed
>>> range for MSI, it's how it manages that range when interrupt remapping
>>> hardware is enabled.  A device table indexed by source-ID references a
>>> per device table indexed by data from the MSI write itself.  So we get
>>> much, much finer granularity,
>> About the granularity, I think ARM GICv3 now provides a similar
>> capability with GICv3 ITS (interrupt translation service). Along with
>> the MSI MSG write transaction, the device outputs a DeviceID conveyed on
>> the bus. This DeviceID (~ your source-ID) enables to index a device
>> table. The entry in the device table points to a DeviceId interrupt
>> translation table indexed by the EventID found in the msi msg. So the
>> entry in the interrupt translation table eventually gives you the
>> eventual interrupt ID targeted by the MSI MSG.
>> This translation capability if not available in GICv2M though, ie. the
>> one I am currently using.
>>  
>> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)
> 
> So it sounds like the interrupt remapping plumbing needs to be
> implemented for those chips.  How does ITS identify an MSI versus any
> other DMA write?  Does it need to be within a preconfigured address
> space like on x86 or does it know this implicitly by the transaction
> (which doesn't seem possible on PCIe)?
It seems there is a kind of misunderstanding here. Assuming a "simple"
system with a single ITS, all devices likely to produce MSI must write
those messages in a single register, located in the ITS MSI 64kB frame
(this register is called GITS_TRANSLATER). Then the ITS discriminates
between senders using the DeviceID conveyed out-of-band on the bus (or
by other implementation defined means). For those DeviceId, a deviceId
interrupt translation table is supposed to exist, else it is going to
fault. If any "undeclared" device is writing into that register, its
deviceid will be unknown. It looks like on Intel the interrupt remapping
HW rather is abstracted on the IOMMU side; I did not take time yet to
carefully read the VT-d spec but maybe the Intel interrupt remapping HW
rather acts as an IOMMU that takes an input MSI address within the
famous window and apply a translation scheme based on the MSI address &
data? On ARM the input MSI address always is the GITS_TRANSLATER and
then the translation scheme is based on out-of-band info (deviceid) +
data content(eventid). I Hope this clarifies.
> 
> Along with this discussion, we should probably be revisiting whether
> existing ARM SMMUs should be exposing the IOMMU_CAP_INTR_REMAP
> capability.
so according to the above explanation not sure it is relevant. Will/Marc
might correct me if I told some wrong things.
  This capability is meant to indicate interrupt isolation,
> but if an entire page of IOVA space is mapped through the IOMMU to a
> range of interrupts and some of those interrupts are shared with host
> devices or other VMs, then we really don't have that isolation and the
> system is susceptible to one VM interfering with another or with the
> host.  If that's the case, the SMMU should not be claiming
> IOMMU_CAP_INTR_REMAP.
My understanding is a PCI device working for the host must have its own
deviceid translation table while another one assigned to a guest needs
to have another one. Each of those will then trigger different final
interrupt IDs in separate domains.
To be honest for the time being I was not addressing the ITS case but
just the simpler GICv2m case where we do not have interrupt translation.
In GICv2m with a single 4kB MSI frame you still have a single register
written by devices. The msg data content then induces a given interrupt ID.
My kernel series "just" aimed at allowing the device to reach the
physical address of the GICv2m MSI frame through the IOMMU.
But you're right here I think I should have a larger vision of what is
targeted with ITS. In GICv2m with a single MSI frame the discrimination
only works on the msi data (there is no deviceid). However it is also
possible to have several GICv2M MSI 4kB frames and in that case you can
give 1 MSI 4kB frame per VM but it is yet another use case. My AMD
system currently exposes a single MSI frame - in which case we have poor
isolation as you say -.
> 
>>  but there's still effectively an interrupt
>>> domain per device that's being transparently managed under the covers
>>> whenever we request an MSI vector for a device.
>>>  
>>> So why can't we do something more like that here?  There's no predefined
>>> MSI vector range, so defining an interface for the user to specify that
>>> is unavoidable.
>> Do you confirm that VFIO user API still still is the good choice to
>> provide that IOVA range?
> 
> I don't see that we have an option there unless ARM wants to
> retroactively reserve a range of IOVA space in the spec, which is
> certainly not going to happen.  The only other thing that comes to mind
> would be if there was an existing address space which could never be
> backed by RAM or other DMA capable targets.  But that seems far fetched
> as well.
I don't think there is a plan for such change and I am afraid we need to
integrate above configurations (GICv2M with a single frame, GICv2M with
several frames, ITS and there may be others not covered here that I am
not aware of).
> 
>>   But why shouldn't everything else be transparent?  We
>>> could add an interface to the IOMMU API that allows us to register that
>>> reserved range for the IOMMU domain.  IOMMU-core (or maybe interrupt
>>> remapping) code might allocate an IOVA domain for this just as you've
>>> done in the type1 code here.
>> I have no objection to move that iova allocation scheme somewhere else.
>> I just need to figure out how to deal with the fact iova.c is not
>> compiled everywhere as I noticed too late ;-)
>>   But rather than having any interaction
>>> with vfio-pci, why not do this at lower levels such that the platform
>>> interrupt vector allocation code automatically uses one of those IOVA
>>> ranges and returns the IOVA rather than the physical address for the PCI
>>> code to program into the device?  I think we know what needs to be done,
>>> but we're taking the approach of managing the space ourselves and doing
>>> a fixup of the device after the core code has done its job when we
>>> really ought to be letting the core code manage a space that we define
>>> and programming the device so that it doesn't need a fixup in the
>>> vfio-pci code.  Wouldn't it be nicer if pci_enable_msix_range() returned
>>> with the device properly programmed or generate an error if there's not
>>> enough reserved mapping space in IOMMU domain?  Can it be done?
>> I agree with you on the fact it would be cleaner to manage that natively
>> at MSI controller level instead of patching the address value in
>> vfio_pci_intrs.c. I will investigate in that direction but I need some
>> more time to understand the links between the MSI controller, the PCI
>> device and the IOMMU.
> 
> Since the current interrupt remapping schemes seem to operate in a
> different address space, I expect there will be work to do to fit the
> interrupt remapping within a provided address space, but it seems like a
> very reasonable constraint to add.  Thanks,
I hope this discussion will help. Please ARM guys, correct me if there
are some unclarities or wrong statements.
Thank you for reading up to here and have a nice WE!
Best Regards
Eric
> 
> Alex
> 
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64
  2016-01-29 21:25       ` Eric Auger
@ 2016-02-01 14:03         ` Will Deacon
  2016-02-03 12:50           ` Christoffer Dall
  0 siblings, 1 reply; 32+ messages in thread
From: Will Deacon @ 2016-02-01 14:03 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, Jan 29, 2016 at 10:25:52PM +0100, Eric Auger wrote:
> On 01/29/2016 08:33 PM, Alex Williamson wrote:
> >>> We know that x86 handles MSI vectors specially, so there is some
> >>> hardware that helps the situation.  It's not just that x86 has a fixed
> >>> range for MSI, it's how it manages that range when interrupt remapping
> >>> hardware is enabled.  A device table indexed by source-ID references a
> >>> per device table indexed by data from the MSI write itself.  So we get
> >>> much, much finer granularity,
> >> About the granularity, I think ARM GICv3 now provides a similar
> >> capability with GICv3 ITS (interrupt translation service). Along with
> >> the MSI MSG write transaction, the device outputs a DeviceID conveyed on
> >> the bus. This DeviceID (~ your source-ID) enables to index a device
> >> table. The entry in the device table points to a DeviceId interrupt
> >> translation table indexed by the EventID found in the msi msg. So the
> >> entry in the interrupt translation table eventually gives you the
> >> eventual interrupt ID targeted by the MSI MSG.
> >> This translation capability if not available in GICv2M though, ie. the
> >> one I am currently using.
> >>  
> >> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)
That's right. GICv3/ITS disambiguates the interrupt source using the
DeviceID, which for PCI is derived from the Requester ID of the endpoint.
GICv2m is less flexible and requires a separate physical frame per guest
to achieve isolation.
Will
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64
  2016-02-01 14:03         ` Will Deacon
@ 2016-02-03 12:50           ` Christoffer Dall
  2016-02-03 13:10             ` Will Deacon
  0 siblings, 1 reply; 32+ messages in thread
From: Christoffer Dall @ 2016-02-03 12:50 UTC (permalink / raw)
  To: linux-arm-kernel
On Mon, Feb 01, 2016 at 02:03:51PM +0000, Will Deacon wrote:
> On Fri, Jan 29, 2016 at 10:25:52PM +0100, Eric Auger wrote:
> > On 01/29/2016 08:33 PM, Alex Williamson wrote:
> > >>> We know that x86 handles MSI vectors specially, so there is some
> > >>> hardware that helps the situation.  It's not just that x86 has a fixed
> > >>> range for MSI, it's how it manages that range when interrupt remapping
> > >>> hardware is enabled.  A device table indexed by source-ID references a
> > >>> per device table indexed by data from the MSI write itself.  So we get
> > >>> much, much finer granularity,
> > >> About the granularity, I think ARM GICv3 now provides a similar
> > >> capability with GICv3 ITS (interrupt translation service). Along with
> > >> the MSI MSG write transaction, the device outputs a DeviceID conveyed on
> > >> the bus. This DeviceID (~ your source-ID) enables to index a device
> > >> table. The entry in the device table points to a DeviceId interrupt
> > >> translation table indexed by the EventID found in the msi msg. So the
> > >> entry in the interrupt translation table eventually gives you the
> > >> eventual interrupt ID targeted by the MSI MSG.
> > >> This translation capability if not available in GICv2M though, ie. the
> > >> one I am currently using.
> > >>  
> > >> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)
> 
> That's right. GICv3/ITS disambiguates the interrupt source using the
> DeviceID, which for PCI is derived from the Requester ID of the endpoint.
> GICv2m is less flexible and requires a separate physical frame per guest
> to achieve isolation.
> 
We should still support MSI passthrough with a single MSI frame host
system though, right?
(Users should just be aware that guests are not fully protected against
misbehaving hardware in that case).
-Christoffer
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64
  2016-02-03 12:50           ` Christoffer Dall
@ 2016-02-03 13:10             ` Will Deacon
  2016-02-03 15:36               ` Christoffer Dall
  0 siblings, 1 reply; 32+ messages in thread
From: Will Deacon @ 2016-02-03 13:10 UTC (permalink / raw)
  To: linux-arm-kernel
On Wed, Feb 03, 2016 at 01:50:47PM +0100, Christoffer Dall wrote:
> On Mon, Feb 01, 2016 at 02:03:51PM +0000, Will Deacon wrote:
> > On Fri, Jan 29, 2016 at 10:25:52PM +0100, Eric Auger wrote:
> > > On 01/29/2016 08:33 PM, Alex Williamson wrote:
> > > >>> We know that x86 handles MSI vectors specially, so there is some
> > > >>> hardware that helps the situation.  It's not just that x86 has a fixed
> > > >>> range for MSI, it's how it manages that range when interrupt remapping
> > > >>> hardware is enabled.  A device table indexed by source-ID references a
> > > >>> per device table indexed by data from the MSI write itself.  So we get
> > > >>> much, much finer granularity,
> > > >> About the granularity, I think ARM GICv3 now provides a similar
> > > >> capability with GICv3 ITS (interrupt translation service). Along with
> > > >> the MSI MSG write transaction, the device outputs a DeviceID conveyed on
> > > >> the bus. This DeviceID (~ your source-ID) enables to index a device
> > > >> table. The entry in the device table points to a DeviceId interrupt
> > > >> translation table indexed by the EventID found in the msi msg. So the
> > > >> entry in the interrupt translation table eventually gives you the
> > > >> eventual interrupt ID targeted by the MSI MSG.
> > > >> This translation capability if not available in GICv2M though, ie. the
> > > >> one I am currently using.
> > > >>  
> > > >> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)
> > 
> > That's right. GICv3/ITS disambiguates the interrupt source using the
> > DeviceID, which for PCI is derived from the Requester ID of the endpoint.
> > GICv2m is less flexible and requires a separate physical frame per guest
> > to achieve isolation.
> > 
> We should still support MSI passthrough with a single MSI frame host
> system though, right?
I think we should treat the frame as an exclusive resource and assign it
to a single VM.
> (Users should just be aware that guests are not fully protected against
> misbehaving hardware in that case).
Is it confined to misbehaving hardware? What if a malicious/buggy guest
configures its device to DMA all over the doorbell?
Will
^ permalink raw reply	[flat|nested] 32+ messages in thread
* [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64
  2016-02-03 13:10             ` Will Deacon
@ 2016-02-03 15:36               ` Christoffer Dall
       [not found]                 ` <56B4DC97.60904@linaro.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Christoffer Dall @ 2016-02-03 15:36 UTC (permalink / raw)
  To: linux-arm-kernel
On Wed, Feb 03, 2016 at 01:10:58PM +0000, Will Deacon wrote:
> On Wed, Feb 03, 2016 at 01:50:47PM +0100, Christoffer Dall wrote:
> > On Mon, Feb 01, 2016 at 02:03:51PM +0000, Will Deacon wrote:
> > > On Fri, Jan 29, 2016 at 10:25:52PM +0100, Eric Auger wrote:
> > > > On 01/29/2016 08:33 PM, Alex Williamson wrote:
> > > > >>> We know that x86 handles MSI vectors specially, so there is some
> > > > >>> hardware that helps the situation.  It's not just that x86 has a fixed
> > > > >>> range for MSI, it's how it manages that range when interrupt remapping
> > > > >>> hardware is enabled.  A device table indexed by source-ID references a
> > > > >>> per device table indexed by data from the MSI write itself.  So we get
> > > > >>> much, much finer granularity,
> > > > >> About the granularity, I think ARM GICv3 now provides a similar
> > > > >> capability with GICv3 ITS (interrupt translation service). Along with
> > > > >> the MSI MSG write transaction, the device outputs a DeviceID conveyed on
> > > > >> the bus. This DeviceID (~ your source-ID) enables to index a device
> > > > >> table. The entry in the device table points to a DeviceId interrupt
> > > > >> translation table indexed by the EventID found in the msi msg. So the
> > > > >> entry in the interrupt translation table eventually gives you the
> > > > >> eventual interrupt ID targeted by the MSI MSG.
> > > > >> This translation capability if not available in GICv2M though, ie. the
> > > > >> one I am currently using.
> > > > >>  
> > > > >> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)
> > > 
> > > That's right. GICv3/ITS disambiguates the interrupt source using the
> > > DeviceID, which for PCI is derived from the Requester ID of the endpoint.
> > > GICv2m is less flexible and requires a separate physical frame per guest
> > > to achieve isolation.
> > > 
> > We should still support MSI passthrough with a single MSI frame host
> > system though, right?
> 
> I think we should treat the frame as an exclusive resource and assign it
> to a single VM.
so on a single frame GICv2m system, either your host or a single VM gets
to do MSIs...
> 
> > (Users should just be aware that guests are not fully protected against
> > misbehaving hardware in that case).
> 
> Is it confined to misbehaving hardware? What if a malicious/buggy guest
> configures its device to DMA all over the doorbell?
> 
I guess not, I suppose we can't trap any configuration access and
mediate that for any device.  Bummer.
-Christoffer
^ permalink raw reply	[flat|nested] 32+ messages in thread
* ARM PCI/MSI KVM passthrough with GICv2M
       [not found]                 ` <56B4DC97.60904@linaro.org>
@ 2016-02-05 18:17                   ` Alex Williamson
  2016-02-08  9:48                     ` Christoffer Dall
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2016-02-05 18:17 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, 5 Feb 2016 18:32:07 +0100
Eric Auger <eric.auger@linaro.org> wrote:
> Hi Alex,
> 
> I tried to sketch a proposal for guaranteeing the IRQ integrity when
> doing ARM PCI/MSI passthrough with ARM GICv2M msi-controller. This is
> based on extended VFIO group viability control, as detailed below.
> 
> As opposed to ARM GICv3 ITS, this MSI controller does *not* support IRQ
> remapping. It can expose 1 or more 4kB MSI frame. Each frame contains a
> single register where the msi data is written.
> 
> I would be grateful to you if you could tell me whether it makes any sense.
> 
> Thanks in advance
> 
> Best Regards
> 
> Eric
> 
> 
> 1) GICv2m with a single 4kB single frame
>    all devices having this msi-controller as msi-parent share this
>    single MSI frame. Those devices can work on behalf of the host
>    or work on behalf of 1 or more guests (KVM assigned devices). We
>    must make sure either the host only or 1 single VM can access to the
>    single frame to guarantee interrupt integrity: a device assigned
>    to 1 VM should not be able to trigger MSI targeted to the host
>    or another VM.
> 
>    I would propose to extend the VFIO notion of group viability.
>    Currently a VFIO group is viable if:
>    all devices belonging to the same group are bound to a VFIO driver
>    or unbound.
> 
>    Let's imagine we extend the viability check as follows:
> 
>    0) keep the current viable check: all the devices belonging to
>       the group must be vfio bound or unbound.
>    1) retrieve the MSI parent of the device and list all the
>       other devices using that MSI controller as MSI-parent (does not
>       look straightforward):
>    2) they must be VFIO driver bound or unbound as well (meaning
>       they are not used by the host). If not, reject device attachment
>    - in case they are VFIO bound (a VFIO group is set):
>      x if all VFIO containers are the same as the one of the device's
>        we try to attach, that's OK. This means the other devices
>        use different IOMMU mappings, eventually will target the
>        MSI frame but they all work for the same user space client/VM.
>      x 1 or more devices has a different container than the device
>        under attachment:
>        It works on behalf of a different user space client/VM,
>        we can't attach the new device. I think there is a case however
>        where severals containers can be opened by a single QEMU.
> 
> Of course the dynamic aspects, ie a new device showing up or an unbind
> event bring significant complexity.
> 
> 2) GICv2M with multiple 4kB frames
>    Each msi-frame is enumerated as msi-controller. The device tree
>    statically defines which device is attached to each msi frame.
>    In case devices are assigned we cannot change this attachment
>    anyway since there might be physical contraints behind.
>    So devices likely to be assigned to guests should be linked to a
>    different MSI frame than devices that are not.
> 
>    I think extended viability concept can be used as well.
> 
>    This model still is not ideal: in case we have a SR-IOV device
>    plugged onto an host bridge attached to a single MSI parent you won't
>    be able anyway to have 1 Virtual Function working for host and 1 VF
>    working for a guest. Only Interrupt translation (ITS) will bring that
>    feature.
> 
> 3) GICv3 ITS
>    This one supports interrupt translation service ~ Intel
>    IRQ remapping.
>    This means a single frame can be used by all devices. A deviceID is
>    used exclusively by the host or a guest. I assume the ITS driver
>    allocates/populates deviceid interrupt translation table featuring
>    separate LPI spaces ie by construction different ITT cannot feature
>    same LPIs. So no need to do the extended viability test.
> 
>    The MSI controller should have a property telling whether
>    it supports interrupt translation. This kind of property currently
>    exists on IOMMU side for INTEL remapping.
> 
Hi Eric,
Would anyone be terribly upset if we simply assume the worst case
scenario on GICv2m/M, have the IOMMU not claim IOMMU_CAP_INTR_REMAP, and
require the user to opt-in via the allow_unsafe_interrupts on the
vfio_iommu_type1 module?  That would make it very compatible with what
we already do on x86, where it really is all or nothing.  My assumption
is that GICv2 would be phased out in favor of GICv3, so there's always
a hardware upgrade path to having more complete isolation, but the
return on investment for figuring out whether a given device really has
this sort of isolation seems pretty low.  Often users already have some
degree of trust in the VMs they use for device assignment anyway.  An
especially prudent user can still look at the hardware specs for their
specific system to understand whether any devices are fully isolated
and only make use of those for device assignment.  Does that seem like
a reasonable alternative?  Thanks,
Alex
^ permalink raw reply	[flat|nested] 32+ messages in thread
* ARM PCI/MSI KVM passthrough with GICv2M
  2016-02-05 18:17                   ` ARM PCI/MSI KVM passthrough with GICv2M Alex Williamson
@ 2016-02-08  9:48                     ` Christoffer Dall
  2016-02-08 13:27                       ` Eric Auger
  0 siblings, 1 reply; 32+ messages in thread
From: Christoffer Dall @ 2016-02-08  9:48 UTC (permalink / raw)
  To: linux-arm-kernel
On Fri, Feb 05, 2016 at 11:17:00AM -0700, Alex Williamson wrote:
> On Fri, 5 Feb 2016 18:32:07 +0100
> Eric Auger <eric.auger@linaro.org> wrote:
> 
> > Hi Alex,
> > 
> > I tried to sketch a proposal for guaranteeing the IRQ integrity when
> > doing ARM PCI/MSI passthrough with ARM GICv2M msi-controller. This is
> > based on extended VFIO group viability control, as detailed below.
> > 
> > As opposed to ARM GICv3 ITS, this MSI controller does *not* support IRQ
> > remapping. It can expose 1 or more 4kB MSI frame. Each frame contains a
> > single register where the msi data is written.
> > 
> > I would be grateful to you if you could tell me whether it makes any sense.
> > 
> > Thanks in advance
> > 
> > Best Regards
> > 
> > Eric
> > 
> > 
> > 1) GICv2m with a single 4kB single frame
> >    all devices having this msi-controller as msi-parent share this
> >    single MSI frame. Those devices can work on behalf of the host
> >    or work on behalf of 1 or more guests (KVM assigned devices). We
> >    must make sure either the host only or 1 single VM can access to the
> >    single frame to guarantee interrupt integrity: a device assigned
> >    to 1 VM should not be able to trigger MSI targeted to the host
> >    or another VM.
> > 
> >    I would propose to extend the VFIO notion of group viability.
> >    Currently a VFIO group is viable if:
> >    all devices belonging to the same group are bound to a VFIO driver
> >    or unbound.
> > 
> >    Let's imagine we extend the viability check as follows:
> > 
> >    0) keep the current viable check: all the devices belonging to
> >       the group must be vfio bound or unbound.
> >    1) retrieve the MSI parent of the device and list all the
> >       other devices using that MSI controller as MSI-parent (does not
> >       look straightforward):
> >    2) they must be VFIO driver bound or unbound as well (meaning
> >       they are not used by the host). If not, reject device attachment
> >    - in case they are VFIO bound (a VFIO group is set):
> >      x if all VFIO containers are the same as the one of the device's
> >        we try to attach, that's OK. This means the other devices
> >        use different IOMMU mappings, eventually will target the
> >        MSI frame but they all work for the same user space client/VM.
> >      x 1 or more devices has a different container than the device
> >        under attachment:
> >        It works on behalf of a different user space client/VM,
> >        we can't attach the new device. I think there is a case however
> >        where severals containers can be opened by a single QEMU.
> > 
> > Of course the dynamic aspects, ie a new device showing up or an unbind
> > event bring significant complexity.
> > 
> > 2) GICv2M with multiple 4kB frames
> >    Each msi-frame is enumerated as msi-controller. The device tree
> >    statically defines which device is attached to each msi frame.
> >    In case devices are assigned we cannot change this attachment
> >    anyway since there might be physical contraints behind.
> >    So devices likely to be assigned to guests should be linked to a
> >    different MSI frame than devices that are not.
> > 
> >    I think extended viability concept can be used as well.
> > 
> >    This model still is not ideal: in case we have a SR-IOV device
> >    plugged onto an host bridge attached to a single MSI parent you won't
> >    be able anyway to have 1 Virtual Function working for host and 1 VF
> >    working for a guest. Only Interrupt translation (ITS) will bring that
> >    feature.
> > 
> > 3) GICv3 ITS
> >    This one supports interrupt translation service ~ Intel
> >    IRQ remapping.
> >    This means a single frame can be used by all devices. A deviceID is
> >    used exclusively by the host or a guest. I assume the ITS driver
> >    allocates/populates deviceid interrupt translation table featuring
> >    separate LPI spaces ie by construction different ITT cannot feature
> >    same LPIs. So no need to do the extended viability test.
> > 
> >    The MSI controller should have a property telling whether
> >    it supports interrupt translation. This kind of property currently
> >    exists on IOMMU side for INTEL remapping.
> > 
> 
> Hi Eric,
> 
> Would anyone be terribly upset if we simply assume the worst case
> scenario on GICv2m/M, have the IOMMU not claim IOMMU_CAP_INTR_REMAP, and
> require the user to opt-in via the allow_unsafe_interrupts on the
> vfio_iommu_type1 module?  That would make it very compatible with what
> we already do on x86, where it really is all or nothing.  
meaning either you allow unsafe multiplexing with passthrough in every
flavor (unsafely) or you don't allow it at all?
I didn't know such on option existed, but it seems to me that this fits
the bill exactly.
> My assumption
> is that GICv2 would be phased out in favor of GICv3, so there's always
> a hardware upgrade path to having more complete isolation, but the
> return on investment for figuring out whether a given device really has
> this sort of isolation seems pretty low.  Often users already have some
> degree of trust in the VMs they use for device assignment anyway.  An
> especially prudent user can still look at the hardware specs for their
> specific system to understand whether any devices are fully isolated
> and only make use of those for device assignment.  Does that seem like
> a reasonable alternative?
> 
It sounds good to me, that would allow us to release a GICv2m-based
solution for MSI passthrough on currently available hardware like the
Seattle.
Thanks,
-Christoffer
^ permalink raw reply	[flat|nested] 32+ messages in thread
* ARM PCI/MSI KVM passthrough with GICv2M
  2016-02-08  9:48                     ` Christoffer Dall
@ 2016-02-08 13:27                       ` Eric Auger
  0 siblings, 0 replies; 32+ messages in thread
From: Eric Auger @ 2016-02-08 13:27 UTC (permalink / raw)
  To: linux-arm-kernel
Hi Alex, Christoffer,
On 02/08/2016 10:48 AM, Christoffer Dall wrote:
> On Fri, Feb 05, 2016 at 11:17:00AM -0700, Alex Williamson wrote:
>> On Fri, 5 Feb 2016 18:32:07 +0100
>> Eric Auger <eric.auger@linaro.org> wrote:
>>
>>> Hi Alex,
>>>
>>> I tried to sketch a proposal for guaranteeing the IRQ integrity when
>>> doing ARM PCI/MSI passthrough with ARM GICv2M msi-controller. This is
>>> based on extended VFIO group viability control, as detailed below.
>>>
>>> As opposed to ARM GICv3 ITS, this MSI controller does *not* support IRQ
>>> remapping. It can expose 1 or more 4kB MSI frame. Each frame contains a
>>> single register where the msi data is written.
>>>
>>> I would be grateful to you if you could tell me whether it makes any sense.
>>>
>>> Thanks in advance
>>>
>>> Best Regards
>>>
>>> Eric
>>>
>>>
>>> 1) GICv2m with a single 4kB single frame
>>>    all devices having this msi-controller as msi-parent share this
>>>    single MSI frame. Those devices can work on behalf of the host
>>>    or work on behalf of 1 or more guests (KVM assigned devices). We
>>>    must make sure either the host only or 1 single VM can access to the
>>>    single frame to guarantee interrupt integrity: a device assigned
>>>    to 1 VM should not be able to trigger MSI targeted to the host
>>>    or another VM.
>>>
>>>    I would propose to extend the VFIO notion of group viability.
>>>    Currently a VFIO group is viable if:
>>>    all devices belonging to the same group are bound to a VFIO driver
>>>    or unbound.
>>>
>>>    Let's imagine we extend the viability check as follows:
>>>
>>>    0) keep the current viable check: all the devices belonging to
>>>       the group must be vfio bound or unbound.
>>>    1) retrieve the MSI parent of the device and list all the
>>>       other devices using that MSI controller as MSI-parent (does not
>>>       look straightforward):
>>>    2) they must be VFIO driver bound or unbound as well (meaning
>>>       they are not used by the host). If not, reject device attachment
>>>    - in case they are VFIO bound (a VFIO group is set):
>>>      x if all VFIO containers are the same as the one of the device's
>>>        we try to attach, that's OK. This means the other devices
>>>        use different IOMMU mappings, eventually will target the
>>>        MSI frame but they all work for the same user space client/VM.
>>>      x 1 or more devices has a different container than the device
>>>        under attachment:
>>>        It works on behalf of a different user space client/VM,
>>>        we can't attach the new device. I think there is a case however
>>>        where severals containers can be opened by a single QEMU.
>>>
>>> Of course the dynamic aspects, ie a new device showing up or an unbind
>>> event bring significant complexity.
>>>
>>> 2) GICv2M with multiple 4kB frames
>>>    Each msi-frame is enumerated as msi-controller. The device tree
>>>    statically defines which device is attached to each msi frame.
>>>    In case devices are assigned we cannot change this attachment
>>>    anyway since there might be physical contraints behind.
>>>    So devices likely to be assigned to guests should be linked to a
>>>    different MSI frame than devices that are not.
>>>
>>>    I think extended viability concept can be used as well.
>>>
>>>    This model still is not ideal: in case we have a SR-IOV device
>>>    plugged onto an host bridge attached to a single MSI parent you won't
>>>    be able anyway to have 1 Virtual Function working for host and 1 VF
>>>    working for a guest. Only Interrupt translation (ITS) will bring that
>>>    feature.
>>>
>>> 3) GICv3 ITS
>>>    This one supports interrupt translation service ~ Intel
>>>    IRQ remapping.
>>>    This means a single frame can be used by all devices. A deviceID is
>>>    used exclusively by the host or a guest. I assume the ITS driver
>>>    allocates/populates deviceid interrupt translation table featuring
>>>    separate LPI spaces ie by construction different ITT cannot feature
>>>    same LPIs. So no need to do the extended viability test.
>>>
>>>    The MSI controller should have a property telling whether
>>>    it supports interrupt translation. This kind of property currently
>>>    exists on IOMMU side for INTEL remapping.
>>>
>>
>> Hi Eric,
>>
>> Would anyone be terribly upset if we simply assume the worst case
>> scenario on GICv2m/M, have the IOMMU not claim IOMMU_CAP_INTR_REMAP, and
>> require the user to opt-in via the allow_unsafe_interrupts on the
>> vfio_iommu_type1 module?  That would make it very compatible with what
>> we already do on x86, where it really is all or nothing.  
> 
> meaning either you allow unsafe multiplexing with passthrough in every
> flavor (unsafely) or you don't allow it at all?
that's my understanding. if the iommu does not expose
IOMMU_CAP_INTR_REMAP, the end-user must explicitly turn
allow_unsafe_interrupts on. On ARM we will have the handle the fact the
interrupt translation is handled on interrupt controller side and not on
iommu side though;
>
> I didn't know such on option existed, but it seems to me that this fits
> the bill exactly.
well I think the support of multiple GICv2m MSI frames was devised to
allow safe interrupts but extending the VFIO viability notion as
described above effectively seems a huge work with small benefits since
we don't have much HW featuring multiple frames I am afraid. So I think
it is a good compromise to have a minimal integration with GICv2m and
full feature with best fitted HW, ie. GICv3 ITS.
> 
> 
>> My assumption
>> is that GICv2 would be phased out in favor of GICv3, so there's always
>> a hardware upgrade path to having more complete isolation, but the
>> return on investment for figuring out whether a given device really has
>> this sort of isolation seems pretty low.  Often users already have some
>> degree of trust in the VMs they use for device assignment anyway.  An
>> especially prudent user can still look at the hardware specs for their
>> specific system to understand whether any devices are fully isolated
>> and only make use of those for device assignment.  Does that seem like
>> a reasonable alternative?
>>
> 
> It sounds good to me, that would allow us to release a GICv2m-based
> solution for MSI passthrough on currently available hardware like the
> Seattle.
Sounds good to me too. I am going to respin the kernel series according
to this discussion and previous comments.
Thanks for your comments!
Best Regards
Eric
> 
> Thanks,
> -Christoffer
> 
^ permalink raw reply	[flat|nested] 32+ messages in thread
end of thread, other threads:[~2016-02-08 13:27 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-26 13:12 [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Eric Auger
2016-01-26 13:12 ` [PATCH 01/10] iommu: Add DOMAIN_ATTR_MSI_MAPPING attribute Eric Auger
2016-01-26 13:12 ` [PATCH 02/10] vfio: expose MSI mapping requirement through VFIO_IOMMU_GET_INFO Eric Auger
2016-01-26 13:12 ` [PATCH 03/10] vfio_iommu_type1: add reserved binding RB tree management Eric Auger
2016-01-26 13:12 ` [PATCH 04/10] vfio: introduce VFIO_IOVA_RESERVED vfio_dma type Eric Auger
2016-01-26 13:12 ` [PATCH 05/10] vfio/type1: attach a reserved iova domain to vfio_domain Eric Auger
2016-01-26 13:12 ` [PATCH 06/10] vfio: introduce vfio_group_alloc_map_/unmap_free_reserved_iova Eric Auger
2016-01-26 16:17   ` kbuild test robot
2016-01-26 16:37     ` Eric Auger
2016-01-26 13:12 ` [PATCH 07/10] vfio: pci: cache the vfio_group in vfio_pci_device Eric Auger
2016-01-26 13:12 ` [PATCH 08/10] vfio: introduce vfio_group_require_msi_mapping Eric Auger
2016-01-26 13:12 ` [PATCH 09/10] vfio-pci: create an iommu mapping for msi address Eric Auger
2016-01-26 14:43   ` kbuild test robot
2016-01-26 15:14     ` Eric Auger
2016-01-26 13:12 ` [PATCH 10/10] vfio: allow the user to register reserved iova range for MSI mapping Eric Auger
2016-01-26 16:42   ` kbuild test robot
2016-01-26 18:32   ` kbuild test robot
2016-01-26 17:25 ` [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 Pavel Fedin
2016-01-27  8:52   ` Eric Auger
2016-01-28  7:13     ` Pavel Fedin
2016-01-28  9:50       ` Eric Auger
2016-01-28 21:51 ` Alex Williamson
2016-01-29 14:35   ` Eric Auger
2016-01-29 19:33     ` Alex Williamson
2016-01-29 21:25       ` Eric Auger
2016-02-01 14:03         ` Will Deacon
2016-02-03 12:50           ` Christoffer Dall
2016-02-03 13:10             ` Will Deacon
2016-02-03 15:36               ` Christoffer Dall
     [not found]                 ` <56B4DC97.60904@linaro.org>
2016-02-05 18:17                   ` ARM PCI/MSI KVM passthrough with GICv2M Alex Williamson
2016-02-08  9:48                     ` Christoffer Dall
2016-02-08 13:27                       ` Eric Auger
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).