linux-s390.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/3] virtio: Use the DMA API when appropriate
@ 2014-09-17  5:22 Andy Lutomirski
  2014-09-17  5:22 ` [PATCH v5 1/3] virtio_ring: Support DMA APIs if requested Andy Lutomirski
                   ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-17  5:22 UTC (permalink / raw)
  To: Rusty Russell, Michael S. Tsirkin
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	virtualization, Christian Borntraeger, Paolo Bonzini, linux390,
	Andy Lutomirski

This fixes virtio on Xen guests as well as on any other platform
that uses virtio_pci on which physical addresses don't match bus
addresses.

This can be tested with:

    virtme-run --xen xen --kimg arch/x86/boot/bzImage --console

using virtme from here:

    https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git

Without these patches, the guest hangs forever.  With these patches,
everything works.

This should be safe on all platforms that I'm aware of.  That
doesn't mean that there isn't anything that I missed.

Applies to net-next.

Changes from v4:
 - Rebased onto net-next.
 - Dropped the sg end mark changes from virtio_net, as that issue
   is solved differently in net-next.
 - virtio_pci does not use the DMA API on powerpc, for reasons that
   are explained in a detailed comment in virtio_ring.

Changes from v3:
 - virtio_pci only asks virtio_ring to use the DMA API if
   !PCI_DMA_BUS_IS_PHYS.
 - Reduce tools/virtio breakage.  It's now merely as broken as before
   instead of being even more broken.
 - Drop the sg_next changes -- Rusty's version is better.

Changes from v2:
 - Reordered patches.
 - Fixed a virtio_net OOPS.

Changes from v1:
 - Using the DMA API is optional now.  It would be nice to improve the
   DMA API to the point that it could be used unconditionally, but s390
   proves that we're not there yet.
 - Includes patch 4, which fixes DMA debugging warnings from virtio_net.

Andy Lutomirski (3):
  virtio_ring: Support DMA APIs if requested
  virtio_pci: Use the DMA API for virtqueues when possible
  virtio_net: Stop doing DMA from the stack

 drivers/lguest/lguest_device.c         |   3 +-
 drivers/misc/mic/card/mic_virtio.c     |   2 +-
 drivers/net/virtio_net.c               |  53 ++++++---
 drivers/remoteproc/remoteproc_virtio.c |   4 +-
 drivers/s390/kvm/kvm_virtio.c          |   2 +-
 drivers/s390/kvm/virtio_ccw.c          |   4 +-
 drivers/virtio/virtio_mmio.c           |   5 +-
 drivers/virtio/virtio_pci.c            |  91 ++++++++++++++--
 drivers/virtio/virtio_ring.c           | 194 +++++++++++++++++++++++++++------
 include/linux/virtio_ring.h            |   1 +
 tools/virtio/linux/dma-mapping.h       |  17 +++
 tools/virtio/linux/virtio.h            |   1 +
 tools/virtio/virtio_test.c             |   2 +-
 tools/virtio/vringh_test.c             |   3 +-
 14 files changed, 314 insertions(+), 68 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

-- 
1.9.3

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v5 1/3] virtio_ring: Support DMA APIs if requested
  2014-09-17  5:22 [PATCH v5 0/3] virtio: Use the DMA API when appropriate Andy Lutomirski
@ 2014-09-17  5:22 ` Andy Lutomirski
  2014-09-17  5:22 ` [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible Andy Lutomirski
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-17  5:22 UTC (permalink / raw)
  To: Rusty Russell, Michael S. Tsirkin
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	virtualization, Christian Borntraeger, Paolo Bonzini, linux390,
	Andy Lutomirski

virtio_ring currently sends the device (usually a hypervisor)
physical addresses of its I/O buffers.  This is okay when DMA
addresses and physical addresses are the same thing, but this isn't
always the case.  For example, this never works on Xen guests, and
it is likely to fail if a physical "virtio" device ever ends up
behind an IOMMU or swiotlb.

The immediate use case for me is to enable virtio on Xen guests.
For that to work, we need vring to support DMA address translation
as well as a corresponding change to virtio_pci or to another
driver.

With this patch, if enabled, virtfs survives kmemleak and
CONFIG_DMA_API_DEBUG.  virtio-net warns (correctly) about DMA from
the stack in virtnet_set_rx_mode.

This explicitly supports !CONFIG_HAS_DMA.  If vring is asked to use
the DMA API and CONFIG_HAS_DMA is not set, then vring will refuse to
create the virtqueue.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 drivers/lguest/lguest_device.c         |   3 +-
 drivers/misc/mic/card/mic_virtio.c     |   2 +-
 drivers/remoteproc/remoteproc_virtio.c |   4 +-
 drivers/s390/kvm/kvm_virtio.c          |   2 +-
 drivers/s390/kvm/virtio_ccw.c          |   4 +-
 drivers/virtio/virtio_mmio.c           |   5 +-
 drivers/virtio/virtio_pci.c            |   3 +-
 drivers/virtio/virtio_ring.c           | 194 +++++++++++++++++++++++++++------
 include/linux/virtio_ring.h            |   1 +
 tools/virtio/linux/dma-mapping.h       |  17 +++
 tools/virtio/linux/virtio.h            |   1 +
 tools/virtio/virtio_test.c             |   2 +-
 tools/virtio/vringh_test.c             |   3 +-
 13 files changed, 198 insertions(+), 43 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
index d0a1d8a45c81..f0eafbe82ed4 100644
--- a/drivers/lguest/lguest_device.c
+++ b/drivers/lguest/lguest_device.c
@@ -301,7 +301,8 @@ static struct virtqueue *lg_find_vq(struct virtio_device *vdev,
 	 * barriers.
 	 */
 	vq = vring_new_virtqueue(index, lvq->config.num, LGUEST_VRING_ALIGN, vdev,
-				 true, lvq->pages, lg_notify, callback, name);
+				 true, false, lvq->pages,
+				 lg_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
 		goto unmap;
diff --git a/drivers/misc/mic/card/mic_virtio.c b/drivers/misc/mic/card/mic_virtio.c
index f14b60080c21..d633964417b1 100644
--- a/drivers/misc/mic/card/mic_virtio.c
+++ b/drivers/misc/mic/card/mic_virtio.c
@@ -256,7 +256,7 @@ static struct virtqueue *mic_find_vq(struct virtio_device *vdev,
 	mvdev->vr[index] = va;
 	memset_io(va, 0x0, _vr_size);
 	vq = vring_new_virtqueue(index, le16_to_cpu(config.num),
-				 MIC_VIRTIO_RING_ALIGN, vdev, false,
+				 MIC_VIRTIO_RING_ALIGN, vdev, false, false,
 				 (void __force *)va, mic_notify, callback,
 				 name);
 	if (!vq) {
diff --git a/drivers/remoteproc/remoteproc_virtio.c b/drivers/remoteproc/remoteproc_virtio.c
index a34b50690b4e..e31f2fefa76e 100644
--- a/drivers/remoteproc/remoteproc_virtio.c
+++ b/drivers/remoteproc/remoteproc_virtio.c
@@ -107,8 +107,8 @@ static struct virtqueue *rp_find_vq(struct virtio_device *vdev,
 	 * Create the new vq, and tell virtio we're not interested in
 	 * the 'weak' smp barriers, since we're talking with a real device.
 	 */
-	vq = vring_new_virtqueue(id, len, rvring->align, vdev, false, addr,
-					rproc_virtio_notify, callback, name);
+	vq = vring_new_virtqueue(id, len, rvring->align, vdev, false, false,
+				 addr, rproc_virtio_notify, callback, name);
 	if (!vq) {
 		dev_err(dev, "vring_new_virtqueue %s failed\n", name);
 		rproc_free_vring(rvring);
diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
index a1349653c6d9..91abcdc196d0 100644
--- a/drivers/s390/kvm/kvm_virtio.c
+++ b/drivers/s390/kvm/kvm_virtio.c
@@ -206,7 +206,7 @@ static struct virtqueue *kvm_find_vq(struct virtio_device *vdev,
 		goto out;
 
 	vq = vring_new_virtqueue(index, config->num, KVM_S390_VIRTIO_RING_ALIGN,
-				 vdev, true, (void *) config->address,
+				 vdev, true, false, (void *) config->address,
 				 kvm_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
diff --git a/drivers/s390/kvm/virtio_ccw.c b/drivers/s390/kvm/virtio_ccw.c
index d2c0b442bce5..2462a443358a 100644
--- a/drivers/s390/kvm/virtio_ccw.c
+++ b/drivers/s390/kvm/virtio_ccw.c
@@ -478,8 +478,8 @@ static struct virtqueue *virtio_ccw_setup_vq(struct virtio_device *vdev,
 	}
 
 	vq = vring_new_virtqueue(i, info->num, KVM_VIRTIO_CCW_RING_ALIGN, vdev,
-				 true, info->queue, virtio_ccw_kvm_notify,
-				 callback, name);
+				 true, false, info->queue,
+				 virtio_ccw_kvm_notify, callback, name);
 	if (!vq) {
 		/* For now, we fail if we can't get the requested size. */
 		dev_warn(&vcdev->cdev->dev, "no vq\n");
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index c600ccfd6922..693254e52a5d 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -366,8 +366,9 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
 			vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
 
 	/* Create the vring */
-	vq = vring_new_virtqueue(index, info->num, VIRTIO_MMIO_VRING_ALIGN, vdev,
-				 true, info->queue, vm_notify, callback, name);
+	vq = vring_new_virtqueue(index, info->num, VIRTIO_MMIO_VRING_ALIGN,
+				 vdev, true, false, info->queue,
+				 vm_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
 		goto error_new_virtqueue;
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 3d1463c6b120..a1f299fa4626 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -430,7 +430,8 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
 
 	/* create the vring */
 	vq = vring_new_virtqueue(index, info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
-				 true, info->queue, vp_notify, callback, name);
+				 true, false, info->queue,
+				 vp_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
 		goto out_activate_queue;
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 3b1f89b6e743..babcaea1afa0 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -24,6 +24,7 @@
 #include <linux/module.h>
 #include <linux/hrtimer.h>
 #include <linux/kmemleak.h>
+#include <linux/dma-mapping.h>
 
 #ifdef DEBUG
 /* For development, we want to crash whenever the ring is screwed. */
@@ -54,6 +55,12 @@
 #define END_USE(vq)
 #endif
 
+struct vring_desc_state
+{
+	void *data;			/* Data for callback. */
+	struct vring_desc *indir_desc;	/* Indirect descriptor, if any. */
+};
+
 struct vring_virtqueue
 {
 	struct virtqueue vq;
@@ -64,6 +71,9 @@ struct vring_virtqueue
 	/* Can we use weak barriers? */
 	bool weak_barriers;
 
+	/* Should we use the DMA API? */
+	bool use_dma_api;
+
 	/* Other side has made a mess, don't try any more. */
 	bool broken;
 
@@ -93,12 +103,81 @@ struct vring_virtqueue
 	ktime_t last_add_time;
 #endif
 
-	/* Tokens for callbacks. */
-	void *data[];
+	/* Per-descriptor state. */
+	struct vring_desc_state desc_state[];
 };
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+/* Map one sg entry. */
+static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq,
+				   struct scatterlist *sg,
+				   enum dma_data_direction direction)
+{
+#ifdef CONFIG_HAS_DMA
+	/*
+	 * We can't use dma_map_sg, because we don't use scatterlists in
+	 * the way it expects (we sometimes use unterminated
+	 * scatterlists, and we don't guarantee that the scatterlist
+	 * will exist for the lifetime of the mapping.
+	 */
+	if (vq->use_dma_api)
+		return dma_map_page(vq->vq.vdev->dev.parent,
+				    sg_page(sg), sg->offset, sg->length,
+				    direction);
+#endif
+
+	return sg_phys(sg);
+}
+
+static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
+				   void *cpu_addr, size_t size,
+				   enum dma_data_direction direction)
+{
+#ifdef CONFIG_HAS_DMA
+	if (vq->use_dma_api)
+		return dma_map_single(vq->vq.vdev->dev.parent,
+				      cpu_addr, size,
+				      direction);
+#endif
+
+	/* avoid kmemleak false positive (hidden by virt_to_phys) */
+	kmemleak_ignore(cpu_addr);
+	return virt_to_phys(cpu_addr);
+}
+
+static void vring_unmap_one(const struct vring_virtqueue *vq,
+			    struct vring_desc *desc)
+{
+#ifdef CONFIG_HAS_DMA
+	if (!vq->use_dma_api)
+		return;		/* Nothing to do. */
+
+	if (desc->flags & VRING_DESC_F_INDIRECT) {
+		dma_unmap_single(vq->vq.vdev->dev.parent,
+				 desc->addr, desc->len,
+				 (desc->flags & VRING_DESC_F_WRITE) ?
+				 DMA_FROM_DEVICE : DMA_TO_DEVICE);
+	} else {
+		dma_unmap_page(vq->vq.vdev->dev.parent,
+			       desc->addr, desc->len,
+			       (desc->flags & VRING_DESC_F_WRITE) ?
+			       DMA_FROM_DEVICE : DMA_TO_DEVICE);
+	}
+#endif
+}
+
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+			       dma_addr_t addr)
+{
+#ifdef CONFIG_HAS_DMA
+	return vq->use_dma_api &&
+		dma_mapping_error(vq->vq.vdev->dev.parent, addr);
+#else
+	return 0;
+#endif
+}
+
 static struct vring_desc *alloc_indirect(unsigned int total_sg, gfp_t gfp)
 {
 	struct vring_desc *desc;
@@ -131,7 +210,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	struct scatterlist *sg;
 	struct vring_desc *desc;
-	unsigned int i, n, avail, descs_used, uninitialized_var(prev);
+	unsigned int i, n, avail, descs_used, uninitialized_var(prev), err_idx;
 	int head;
 	bool indirect;
 
@@ -171,21 +250,25 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 
 	if (desc) {
 		/* Use a single buffer which doesn't continue */
+		indirect = true;
 		vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
-		vq->vring.desc[head].addr = virt_to_phys(desc);
-		/* avoid kmemleak false positive (hidden by virt_to_phys) */
-		kmemleak_ignore(desc);
+		vq->vring.desc[head].addr = vring_map_single(
+			vq,
+			desc, total_sg * sizeof(struct vring_desc),
+			DMA_TO_DEVICE);
+		if (vring_mapping_error(vq, vq->vring.desc[head].addr))
+			goto unmap_free_indir;
+
 		vq->vring.desc[head].len = total_sg * sizeof(struct vring_desc);
 
 		/* Set up rest to use this indirect table. */
 		i = 0;
 		descs_used = 1;
-		indirect = true;
 	} else {
+		indirect = false;
 		desc = vq->vring.desc;
 		i = head;
 		descs_used = total_sg;
-		indirect = false;
 	}
 
 	if (vq->vq.num_free < descs_used) {
@@ -200,14 +283,13 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 		return -ENOSPC;
 	}
 
-	/* We're about to use some buffers from the free list. */
-	vq->vq.num_free -= descs_used;
-
 	for (n = 0; n < out_sgs; n++) {
 		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
 			desc[i].flags = VRING_DESC_F_NEXT;
-			desc[i].addr = sg_phys(sg);
+			desc[i].addr = vring_map_one_sg(vq, sg, DMA_TO_DEVICE);
 			desc[i].len = sg->length;
+			if (vring_mapping_error(vq, desc[i].addr))
+				goto unmap_release;
 			prev = i;
 			i = desc[i].next;
 		}
@@ -215,8 +297,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	for (; n < (out_sgs + in_sgs); n++) {
 		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
 			desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
-			desc[i].addr = sg_phys(sg);
+			desc[i].addr = vring_map_one_sg(vq, sg, DMA_FROM_DEVICE);
 			desc[i].len = sg->length;
+			if (vring_mapping_error(vq, desc[i].addr))
+				goto unmap_release;
 			prev = i;
 			i = desc[i].next;
 		}
@@ -224,14 +308,19 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	/* Last one doesn't continue. */
 	desc[prev].flags &= ~VRING_DESC_F_NEXT;
 
+	/* We're using some buffers from the free list. */
+	vq->vq.num_free -= descs_used;
+
 	/* Update free pointer */
 	if (indirect)
 		vq->free_head = vq->vring.desc[head].next;
 	else
 		vq->free_head = i;
 
-	/* Set token. */
-	vq->data[head] = data;
+	/* Store token and indirect buffer state. */
+	vq->desc_state[head].data = data;
+	if (indirect)
+		vq->desc_state[head].indir_desc = desc;
 
 	/* Put entry in available array (but don't update avail->idx until they
 	 * do sync). */
@@ -253,6 +342,28 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	END_USE(vq);
 
 	return 0;
+
+unmap_release:
+	err_idx = i;
+	i = head;
+
+	for (n = 0; n < total_sg; n++) {
+		if (i == err_idx)
+			break;
+		vring_unmap_one(vq, &desc[i]);
+		i = vq->vring.desc[i].next;
+	}
+
+	vq->vq.num_free += total_sg;
+
+	if (indirect)
+		vring_unmap_one(vq, desc);
+
+unmap_free_indir:
+	if (indirect)
+		kfree(desc);
+
+	return -EIO;
 }
 
 /**
@@ -423,27 +534,42 @@ EXPORT_SYMBOL_GPL(virtqueue_kick);
 
 static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
 {
-	unsigned int i;
+	unsigned int i, j;
 
 	/* Clear data ptr. */
-	vq->data[head] = NULL;
+	vq->desc_state[head].data = NULL;
 
-	/* Put back on free list: find end */
+	/* Put back on free list: unmap first-level descriptors and find end */
 	i = head;
 
-	/* Free the indirect table */
-	if (vq->vring.desc[i].flags & VRING_DESC_F_INDIRECT)
-		kfree(phys_to_virt(vq->vring.desc[i].addr));
-
 	while (vq->vring.desc[i].flags & VRING_DESC_F_NEXT) {
+		vring_unmap_one(vq, &vq->vring.desc[i]);
 		i = vq->vring.desc[i].next;
 		vq->vq.num_free++;
 	}
 
+	vring_unmap_one(vq, &vq->vring.desc[i]);
 	vq->vring.desc[i].next = vq->free_head;
 	vq->free_head = head;
+
 	/* Plus final descriptor */
 	vq->vq.num_free++;
+
+	/* Free the indirect table, if any, now that it's unmapped. */
+	if (vq->desc_state[head].indir_desc) {
+		struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
+		u32 len = vq->vring.desc[head].len;
+
+		BUG_ON(!(vq->vring.desc[head].flags & VRING_DESC_F_INDIRECT));
+		BUG_ON(len == 0 || len % sizeof(struct vring_desc));
+
+		if (vq->use_dma_api)
+			for (j = 0; j < len / sizeof(struct vring_desc); j++)
+				vring_unmap_one(vq, &indir_desc[j]);
+
+		kfree(vq->desc_state[head].indir_desc);
+		vq->desc_state[head].indir_desc = NULL;
+	}
 }
 
 static inline bool more_used(const struct vring_virtqueue *vq)
@@ -498,13 +624,13 @@ void *virtqueue_get_buf(struct virtqueue *_vq, unsigned int *len)
 		BAD_RING(vq, "id %u out of range\n", i);
 		return NULL;
 	}
-	if (unlikely(!vq->data[i])) {
+	if (unlikely(!vq->desc_state[i].data)) {
 		BAD_RING(vq, "id %u is not a head!\n", i);
 		return NULL;
 	}
 
 	/* detach_buf clears data, so grab it now. */
-	ret = vq->data[i];
+	ret = vq->desc_state[i].data;
 	detach_buf(vq, i);
 	vq->last_used_idx++;
 	/* If we expect an interrupt for the next entry, tell host
@@ -665,10 +791,10 @@ void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
 	START_USE(vq);
 
 	for (i = 0; i < vq->vring.num; i++) {
-		if (!vq->data[i])
+		if (!vq->desc_state[i].data)
 			continue;
 		/* detach_buf clears data, so grab it now. */
-		buf = vq->data[i];
+		buf = vq->desc_state[i].data;
 		detach_buf(vq, i);
 		vq->vring.avail->idx--;
 		END_USE(vq);
@@ -707,6 +833,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 				      unsigned int vring_align,
 				      struct virtio_device *vdev,
 				      bool weak_barriers,
+				      bool use_dma_api,
 				      void *pages,
 				      bool (*notify)(struct virtqueue *),
 				      void (*callback)(struct virtqueue *),
@@ -721,7 +848,13 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 		return NULL;
 	}
 
-	vq = kmalloc(sizeof(*vq) + sizeof(void *)*num, GFP_KERNEL);
+#ifndef CONFIG_HAS_DMA
+	if (use_dma_api)
+		return NULL;
+#endif
+
+	vq = kmalloc(sizeof(*vq) + num * sizeof(struct vring_desc_state),
+		     GFP_KERNEL);
 	if (!vq)
 		return NULL;
 
@@ -733,6 +866,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 	vq->vq.index = index;
 	vq->notify = notify;
 	vq->weak_barriers = weak_barriers;
+	vq->use_dma_api = use_dma_api;
 	vq->broken = false;
 	vq->last_used_idx = 0;
 	vq->num_added = 0;
@@ -751,11 +885,9 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 
 	/* Put everything in free lists. */
 	vq->free_head = 0;
-	for (i = 0; i < num-1; i++) {
+	for (i = 0; i < num-1; i++)
 		vq->vring.desc[i].next = i+1;
-		vq->data[i] = NULL;
-	}
-	vq->data[i] = NULL;
+	memset(vq->desc_state, 0, num * sizeof(struct vring_desc_state));
 
 	return &vq->vq;
 }
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index 67e06fe18c03..60f761a38a09 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -70,6 +70,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 				      unsigned int vring_align,
 				      struct virtio_device *vdev,
 				      bool weak_barriers,
+				      bool use_dma_api,
 				      void *pages,
 				      bool (*notify)(struct virtqueue *vq),
 				      void (*callback)(struct virtqueue *vq),
diff --git a/tools/virtio/linux/dma-mapping.h b/tools/virtio/linux/dma-mapping.h
new file mode 100644
index 000000000000..4f93af89ae16
--- /dev/null
+++ b/tools/virtio/linux/dma-mapping.h
@@ -0,0 +1,17 @@
+#ifndef _LINUX_DMA_MAPPING_H
+#define _LINUX_DMA_MAPPING_H
+
+#ifdef CONFIG_HAS_DMA
+# error Virtio userspace code does not support CONFIG_HAS_DMA
+#endif
+
+#define PCI_DMA_BUS_IS_PHYS 1
+
+enum dma_data_direction {
+	DMA_BIDIRECTIONAL = 0,
+	DMA_TO_DEVICE = 1,
+	DMA_FROM_DEVICE = 2,
+	DMA_NONE = 3,
+};
+
+#endif
diff --git a/tools/virtio/linux/virtio.h b/tools/virtio/linux/virtio.h
index 5a2d1f0f6bc7..5d42dc6a6201 100644
--- a/tools/virtio/linux/virtio.h
+++ b/tools/virtio/linux/virtio.h
@@ -78,6 +78,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 				      unsigned int vring_align,
 				      struct virtio_device *vdev,
 				      bool weak_barriers,
+				      bool use_dma_api,
 				      void *pages,
 				      bool (*notify)(struct virtqueue *vq),
 				      void (*callback)(struct virtqueue *vq),
diff --git a/tools/virtio/virtio_test.c b/tools/virtio/virtio_test.c
index 00ea679b3826..860cc89900a7 100644
--- a/tools/virtio/virtio_test.c
+++ b/tools/virtio/virtio_test.c
@@ -99,7 +99,7 @@ static void vq_info_add(struct vdev_info *dev, int num)
 	vring_init(&info->vring, num, info->ring, 4096);
 	info->vq = vring_new_virtqueue(info->idx,
 				       info->vring.num, 4096, &dev->vdev,
-				       true, info->ring,
+				       true, false, info->ring,
 				       vq_notify, vq_callback, "test");
 	assert(info->vq);
 	info->vq->priv = info;
diff --git a/tools/virtio/vringh_test.c b/tools/virtio/vringh_test.c
index 14a4f4cab5b9..67d3c3a1ba88 100644
--- a/tools/virtio/vringh_test.c
+++ b/tools/virtio/vringh_test.c
@@ -312,7 +312,8 @@ static int parallel_test(unsigned long features,
 		if (sched_setaffinity(getpid(), sizeof(cpu_set), &cpu_set))
 			err(1, "Could not set affinity to cpu %u", first_cpu);
 
-		vq = vring_new_virtqueue(0, RINGSIZE, ALIGN, &gvdev.vdev, true,
+		vq = vring_new_virtqueue(0, RINGSIZE, ALIGN, &gvdev.vdev,
+					 true, false,
 					 guest_map, fast_vringh ? no_notify_host
 					 : parallel_notify_host,
 					 never_callback_guest, "guest vq");
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-17  5:22 [PATCH v5 0/3] virtio: Use the DMA API when appropriate Andy Lutomirski
  2014-09-17  5:22 ` [PATCH v5 1/3] virtio_ring: Support DMA APIs if requested Andy Lutomirski
@ 2014-09-17  5:22 ` Andy Lutomirski
  2014-09-17 12:02   ` Benjamin Herrenschmidt
  2014-09-17 16:09   ` Ira W. Snyder
  2014-09-17  5:22 ` [PATCH v5 3/3] virtio_net: Stop doing DMA from the stack Andy Lutomirski
  2014-09-19 18:25 ` [PATCH v5 0/3] virtio: Use the DMA API when appropriate Konrad Rzeszutek Wilk
  3 siblings, 2 replies; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-17  5:22 UTC (permalink / raw)
  To: Rusty Russell, Michael S. Tsirkin
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	virtualization, Christian Borntraeger, Paolo Bonzini, linux390,
	Andy Lutomirski

On non-PPC systems, virtio_pci should use the DMA API.  This fixes
virtio_pci on Xen.  On PPC, using the DMA API would break things, so
we need to preserve the old behavior.

The big comment in this patch explains the considerations in more
detail.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 drivers/virtio/virtio_pci.c | 90 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 81 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index a1f299fa4626..8ddb0a641878 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -80,8 +80,10 @@ struct virtio_pci_vq_info
 	/* the number of entries in the queue */
 	int num;
 
-	/* the virtual address of the ring queue */
-	void *queue;
+	/* the ring queue */
+	void *queue;			/* virtual address */
+	dma_addr_t queue_dma_addr;	/* bus address */
+	bool use_dma_api;		/* are we using the DMA API? */
 
 	/* the list node for the virtqueues list */
 	struct list_head node;
@@ -388,6 +390,50 @@ static int vp_request_intx(struct virtio_device *vdev)
 	return err;
 }
 
+static bool vp_use_dma_api(void)
+{
+	/*
+	 * Due to limitations of the DMA API, we only have two choices:
+	 * use the DMA API (e.g. set up IOMMU mappings or apply Xen's
+	 * physical-to-machine translation) or use direct physical
+	 * addressing.  Furthermore, there's no sensible way yet for the
+	 * PCI bus code to tell us whether we're supposed to act like a
+	 * normal PCI device (and use the DMA API) or to do something
+	 * else.  So we're stuck with heuristics here.
+	 *
+	 * In general, we would prefer to use the DMA API, since we
+	 * might be driving a physical device, and such devices *must*
+	 * use the DMA API if there is an IOMMU involved.
+	 *
+	 * On x86, there are no physically-mapped emulated virtio PCI
+	 * devices that live behind an IOMMU.  On ARM, there don't seem
+	 * to be any hypervisors that use virtio_pci (as opposed to
+	 * virtio_mmio) that also emulate an IOMMU.  So using the DMI
+	 * API is safe.
+	 *
+	 * On PowerPC, it's the other way around.  There usually is an
+	 * IOMMU between us and the virtio PCI device, but the device is
+	 * probably emulated and ignores the IOMMU.  Unfortunately, we
+	 * can't tell whether we're talking to an emulated device or to
+	 * a physical device that really lives behind the IOMMU.  That
+	 * means that we're stuck with ignoring the DMA API.
+	 */
+
+#ifdef CONFIG_PPC
+	return false;
+#else
+	/*
+	 * Minor optimization: if the platform promises to have physical
+	 * PCI DMA, we turn off DMA mapping in virtio_ring.  If the
+	 * platform's DMA API implementation is well optimized, this
+	 * should have almost no effect, but we already have a branch in
+	 * the vring code, and we can avoid any further indirection with
+	 * very little effort.
+	 */
+	return !PCI_DMA_BUS_IS_PHYS;
+#endif
+}
+
 static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
 				  void (*callback)(struct virtqueue *vq),
 				  const char *name,
@@ -416,21 +462,30 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
 
 	info->num = num;
 	info->msix_vector = msix_vec;
+	info->use_dma_api = vp_use_dma_api();
 
-	size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
-	info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
+	size = vring_size(num, VIRTIO_PCI_VRING_ALIGN);
+	if (info->use_dma_api) {
+		info->queue = dma_zalloc_coherent(vdev->dev.parent, size,
+						  &info->queue_dma_addr,
+						  GFP_KERNEL);
+	} else {
+		info->queue = alloc_pages_exact(PAGE_ALIGN(size),
+						GFP_KERNEL|__GFP_ZERO);
+		info->queue_dma_addr = virt_to_phys(info->queue);
+	}
 	if (info->queue == NULL) {
 		err = -ENOMEM;
 		goto out_info;
 	}
 
 	/* activate the queue */
-	iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
+	iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
 		  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
 	/* create the vring */
 	vq = vring_new_virtqueue(index, info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
-				 true, false, info->queue,
+				 true, info->use_dma_api, info->queue,
 				 vp_notify, callback, name);
 	if (!vq) {
 		err = -ENOMEM;
@@ -463,7 +518,12 @@ out_assign:
 	vring_del_virtqueue(vq);
 out_activate_queue:
 	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
-	free_pages_exact(info->queue, size);
+	if (info->use_dma_api) {
+		dma_free_coherent(vdev->dev.parent, size,
+				  info->queue, info->queue_dma_addr);
+	} else {
+		free_pages_exact(info->queue, PAGE_ALIGN(size));
+	}
 out_info:
 	kfree(info);
 	return ERR_PTR(err);
@@ -493,8 +553,13 @@ static void vp_del_vq(struct virtqueue *vq)
 	/* Select and deactivate the queue */
 	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
-	size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
-	free_pages_exact(info->queue, size);
+	size = vring_size(info->num, VIRTIO_PCI_VRING_ALIGN);
+	if (info->use_dma_api) {
+		dma_free_coherent(vq->vdev->dev.parent, size,
+				  info->queue, info->queue_dma_addr);
+	} else {
+		free_pages_exact(info->queue, PAGE_ALIGN(size));
+	}
 	kfree(info);
 }
 
@@ -713,6 +778,13 @@ static int virtio_pci_probe(struct pci_dev *pci_dev,
 	if (err)
 		goto out;
 
+	err = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64));
+	if (err)
+		err = dma_set_mask_and_coherent(&pci_dev->dev,
+						DMA_BIT_MASK(32));
+	if (err)
+		dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA.  Trying to continue, but this might not work.\n");
+
 	err = pci_request_regions(pci_dev, "virtio-pci");
 	if (err)
 		goto out_enable_device;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v5 3/3] virtio_net: Stop doing DMA from the stack
  2014-09-17  5:22 [PATCH v5 0/3] virtio: Use the DMA API when appropriate Andy Lutomirski
  2014-09-17  5:22 ` [PATCH v5 1/3] virtio_ring: Support DMA APIs if requested Andy Lutomirski
  2014-09-17  5:22 ` [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible Andy Lutomirski
@ 2014-09-17  5:22 ` Andy Lutomirski
  2014-09-19 18:25 ` [PATCH v5 0/3] virtio: Use the DMA API when appropriate Konrad Rzeszutek Wilk
  3 siblings, 0 replies; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-17  5:22 UTC (permalink / raw)
  To: Rusty Russell, Michael S. Tsirkin
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	virtualization, Christian Borntraeger, Paolo Bonzini, linux390,
	Andy Lutomirski

Now that virtio supports real DMA, drivers should play by the rules.
For virtio_net, that means that DMA should be done to and from
dynamically-allocated memory, not the kernel stack.

This should have no effect on any performance-critical code paths.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 drivers/net/virtio_net.c | 53 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3d0ce4468ce6..0bda62be5fe0 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -971,31 +971,43 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
 				 struct scatterlist *out)
 {
 	struct scatterlist *sgs[4], hdr, stat;
-	struct virtio_net_ctrl_hdr ctrl;
-	virtio_net_ctrl_ack status = ~0;
+
+	struct {
+		struct virtio_net_ctrl_hdr ctrl;
+		virtio_net_ctrl_ack status;
+	} *buf;
+
 	unsigned out_num = 0, tmp;
+	bool ret;
 
 	/* Caller should know better */
 	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
-	ctrl.class = class;
-	ctrl.cmd = cmd;
+	buf = kmalloc(sizeof(*buf), GFP_ATOMIC);
+	if (!buf)
+		return false;
+	buf->status = ~0;
+
+	buf->ctrl.class = class;
+	buf->ctrl.cmd = cmd;
 	/* Add header */
-	sg_init_one(&hdr, &ctrl, sizeof(ctrl));
+	sg_init_one(&hdr, &buf->ctrl, sizeof(buf->ctrl));
 	sgs[out_num++] = &hdr;
 
 	if (out)
 		sgs[out_num++] = out;
 
 	/* Add return status. */
-	sg_init_one(&stat, &status, sizeof(status));
+	sg_init_one(&stat, &buf->status, sizeof(buf->status));
 	sgs[out_num] = &stat;
 
 	BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
 	virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
 
-	if (unlikely(!virtqueue_kick(vi->cvq)))
-		return status == VIRTIO_NET_OK;
+	if (unlikely(!virtqueue_kick(vi->cvq))) {
+		ret = (buf->status == VIRTIO_NET_OK);
+		goto out;
+	}
 
 	/* Spin for a response, the kick causes an ioport write, trapping
 	 * into the hypervisor, so the request should be handled immediately.
@@ -1004,7 +1016,11 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
 	       !virtqueue_is_broken(vi->cvq))
 		cpu_relax();
 
-	return status == VIRTIO_NET_OK;
+	ret = (buf->status == VIRTIO_NET_OK);
+
+out:
+	kfree(buf);
+	return ret;
 }
 
 static int virtnet_set_mac_address(struct net_device *dev, void *p)
@@ -1145,7 +1161,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct scatterlist sg[2];
-	u8 promisc, allmulti;
+	u8 *cmdbyte;
 	struct virtio_net_ctrl_mac *mac_data;
 	struct netdev_hw_addr *ha;
 	int uc_count;
@@ -1157,22 +1173,25 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 	if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
 		return;
 
-	promisc = ((dev->flags & IFF_PROMISC) != 0);
-	allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
+	cmdbyte = kmalloc(sizeof(*cmdbyte), GFP_ATOMIC);
+	if (!cmdbyte)
+		return;
 
-	sg_init_one(sg, &promisc, sizeof(promisc));
+	sg_init_one(sg, cmdbyte, sizeof(*cmdbyte));
 
+	*cmdbyte = ((dev->flags & IFF_PROMISC) != 0);
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
 				  VIRTIO_NET_CTRL_RX_PROMISC, sg))
 		dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
-			 promisc ? "en" : "dis");
-
-	sg_init_one(sg, &allmulti, sizeof(allmulti));
+			 *cmdbyte ? "en" : "dis");
 
+	*cmdbyte = ((dev->flags & IFF_ALLMULTI) != 0);
 	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
 				  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
 		dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
-			 allmulti ? "en" : "dis");
+			 *cmdbyte ? "en" : "dis");
+
+	kfree(cmdbyte);
 
 	uc_count = netdev_uc_count(dev);
 	mc_count = netdev_mc_count(dev);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-17  5:22 ` [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible Andy Lutomirski
@ 2014-09-17 12:02   ` Benjamin Herrenschmidt
  2014-09-17 14:16     ` Michael S. Tsirkin
  2014-09-17 16:09   ` Ira W. Snyder
  1 sibling, 1 reply; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-17 12:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Konrad Rzeszutek Wilk, Michael S. Tsirkin,
	virtualization, Christian Borntraeger, Paolo Bonzini, linux390

On Tue, 2014-09-16 at 22:22 -0700, Andy Lutomirski wrote:
> On non-PPC systems, virtio_pci should use the DMA API.  This fixes
> virtio_pci on Xen.  On PPC, using the DMA API would break things, so
> we need to preserve the old behavior.
> 
> The big comment in this patch explains the considerations in more
> detail.

I still disagree with using CONFIG_PPC as a trigger here.

Fundamentally, the qemu implementation today bypasses IOMMUs on all
platforms as far as I can tell.

If that changes, we'll have a backward compatibility problem.

The virtio device should advertise whether it's using that bypass
mode of operation and virtio_pci should react accordingly.

There is a demand for being able to operate on top of an IOMMU on
powerpc as well for some embedded stuff using PCI as a transport so your
patch precludes that.

Cheers,
Ben.

> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  drivers/virtio/virtio_pci.c | 90 ++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 81 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
> index a1f299fa4626..8ddb0a641878 100644
> --- a/drivers/virtio/virtio_pci.c
> +++ b/drivers/virtio/virtio_pci.c
> @@ -80,8 +80,10 @@ struct virtio_pci_vq_info
>  	/* the number of entries in the queue */
>  	int num;
>  
> -	/* the virtual address of the ring queue */
> -	void *queue;
> +	/* the ring queue */
> +	void *queue;			/* virtual address */
> +	dma_addr_t queue_dma_addr;	/* bus address */
> +	bool use_dma_api;		/* are we using the DMA API? */
>  
>  	/* the list node for the virtqueues list */
>  	struct list_head node;
> @@ -388,6 +390,50 @@ static int vp_request_intx(struct virtio_device *vdev)
>  	return err;
>  }
>  
> +static bool vp_use_dma_api(void)
> +{
> +	/*
> +	 * Due to limitations of the DMA API, we only have two choices:
> +	 * use the DMA API (e.g. set up IOMMU mappings or apply Xen's
> +	 * physical-to-machine translation) or use direct physical
> +	 * addressing.  Furthermore, there's no sensible way yet for the
> +	 * PCI bus code to tell us whether we're supposed to act like a
> +	 * normal PCI device (and use the DMA API) or to do something
> +	 * else.  So we're stuck with heuristics here.
> +	 *
> +	 * In general, we would prefer to use the DMA API, since we
> +	 * might be driving a physical device, and such devices *must*
> +	 * use the DMA API if there is an IOMMU involved.
> +	 *
> +	 * On x86, there are no physically-mapped emulated virtio PCI
> +	 * devices that live behind an IOMMU.  On ARM, there don't seem
> +	 * to be any hypervisors that use virtio_pci (as opposed to
> +	 * virtio_mmio) that also emulate an IOMMU.  So using the DMI
> +	 * API is safe.
> +	 *
> +	 * On PowerPC, it's the other way around.  There usually is an
> +	 * IOMMU between us and the virtio PCI device, but the device is
> +	 * probably emulated and ignores the IOMMU.  Unfortunately, we
> +	 * can't tell whether we're talking to an emulated device or to
> +	 * a physical device that really lives behind the IOMMU.  That
> +	 * means that we're stuck with ignoring the DMA API.
> +	 */
> +
> +#ifdef CONFIG_PPC
> +	return false;
> +#else
> +	/*
> +	 * Minor optimization: if the platform promises to have physical
> +	 * PCI DMA, we turn off DMA mapping in virtio_ring.  If the
> +	 * platform's DMA API implementation is well optimized, this
> +	 * should have almost no effect, but we already have a branch in
> +	 * the vring code, and we can avoid any further indirection with
> +	 * very little effort.
> +	 */
> +	return !PCI_DMA_BUS_IS_PHYS;
> +#endif
> +}
> +
>  static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
>  				  void (*callback)(struct virtqueue *vq),
>  				  const char *name,
> @@ -416,21 +462,30 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
>  
>  	info->num = num;
>  	info->msix_vector = msix_vec;
> +	info->use_dma_api = vp_use_dma_api();
>  
> -	size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
> -	info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
> +	size = vring_size(num, VIRTIO_PCI_VRING_ALIGN);
> +	if (info->use_dma_api) {
> +		info->queue = dma_zalloc_coherent(vdev->dev.parent, size,
> +						  &info->queue_dma_addr,
> +						  GFP_KERNEL);
> +	} else {
> +		info->queue = alloc_pages_exact(PAGE_ALIGN(size),
> +						GFP_KERNEL|__GFP_ZERO);
> +		info->queue_dma_addr = virt_to_phys(info->queue);
> +	}
>  	if (info->queue == NULL) {
>  		err = -ENOMEM;
>  		goto out_info;
>  	}
>  
>  	/* activate the queue */
> -	iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
> +	iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
>  		  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
>  
>  	/* create the vring */
>  	vq = vring_new_virtqueue(index, info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
> -				 true, false, info->queue,
> +				 true, info->use_dma_api, info->queue,
>  				 vp_notify, callback, name);
>  	if (!vq) {
>  		err = -ENOMEM;
> @@ -463,7 +518,12 @@ out_assign:
>  	vring_del_virtqueue(vq);
>  out_activate_queue:
>  	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> -	free_pages_exact(info->queue, size);
> +	if (info->use_dma_api) {
> +		dma_free_coherent(vdev->dev.parent, size,
> +				  info->queue, info->queue_dma_addr);
> +	} else {
> +		free_pages_exact(info->queue, PAGE_ALIGN(size));
> +	}
>  out_info:
>  	kfree(info);
>  	return ERR_PTR(err);
> @@ -493,8 +553,13 @@ static void vp_del_vq(struct virtqueue *vq)
>  	/* Select and deactivate the queue */
>  	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
>  
> -	size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
> -	free_pages_exact(info->queue, size);
> +	size = vring_size(info->num, VIRTIO_PCI_VRING_ALIGN);
> +	if (info->use_dma_api) {
> +		dma_free_coherent(vq->vdev->dev.parent, size,
> +				  info->queue, info->queue_dma_addr);
> +	} else {
> +		free_pages_exact(info->queue, PAGE_ALIGN(size));
> +	}
>  	kfree(info);
>  }
>  
> @@ -713,6 +778,13 @@ static int virtio_pci_probe(struct pci_dev *pci_dev,
>  	if (err)
>  		goto out;
>  
> +	err = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64));
> +	if (err)
> +		err = dma_set_mask_and_coherent(&pci_dev->dev,
> +						DMA_BIT_MASK(32));
> +	if (err)
> +		dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA.  Trying to continue, but this might not work.\n");
> +
>  	err = pci_request_regions(pci_dev, "virtio-pci");
>  	if (err)
>  		goto out_enable_device;

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-17 12:02   ` Benjamin Herrenschmidt
@ 2014-09-17 14:16     ` Michael S. Tsirkin
  2014-09-17 16:07       ` Andy Lutomirski
                         ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Michael S. Tsirkin @ 2014-09-17 14:16 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390, Konrad Rzeszutek Wilk, virtualization,
	Christian Borntraeger, Paolo Bonzini, linux390, Andy Lutomirski

On Wed, Sep 17, 2014 at 08:02:31AM -0400, Benjamin Herrenschmidt wrote:
> On Tue, 2014-09-16 at 22:22 -0700, Andy Lutomirski wrote:
> > On non-PPC systems, virtio_pci should use the DMA API.  This fixes
> > virtio_pci on Xen.  On PPC, using the DMA API would break things, so
> > we need to preserve the old behavior.
> > 
> > The big comment in this patch explains the considerations in more
> > detail.
> 
> I still disagree with using CONFIG_PPC as a trigger here.
> 
> Fundamentally, the qemu implementation today bypasses IOMMUs on all
> platforms as far as I can tell.
> 
> If that changes, we'll have a backward compatibility problem.
> 
> The virtio device should advertise whether it's using that bypass
> mode of operation and virtio_pci should react accordingly.

Well if there's a way to detect that - that's outside the
device, then we probably shouldn't use device-specific
interfaces for this capability.


> There is a demand for being able to operate on top of an IOMMU on
> powerpc as well for some embedded stuff using PCI as a transport so your
> patch precludes that.
> 
> Cheers,
> Ben.

As far as I can see, nothing changes on PPC so this patch
does not preclude anything that was working?


> > 
> > Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> > ---
> >  drivers/virtio/virtio_pci.c | 90 ++++++++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 81 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
> > index a1f299fa4626..8ddb0a641878 100644
> > --- a/drivers/virtio/virtio_pci.c
> > +++ b/drivers/virtio/virtio_pci.c
> > @@ -80,8 +80,10 @@ struct virtio_pci_vq_info
> >  	/* the number of entries in the queue */
> >  	int num;
> >  
> > -	/* the virtual address of the ring queue */
> > -	void *queue;
> > +	/* the ring queue */
> > +	void *queue;			/* virtual address */
> > +	dma_addr_t queue_dma_addr;	/* bus address */
> > +	bool use_dma_api;		/* are we using the DMA API? */
> >  
> >  	/* the list node for the virtqueues list */
> >  	struct list_head node;
> > @@ -388,6 +390,50 @@ static int vp_request_intx(struct virtio_device *vdev)
> >  	return err;
> >  }
> >  
> > +static bool vp_use_dma_api(void)
> > +{
> > +	/*
> > +	 * Due to limitations of the DMA API, we only have two choices:
> > +	 * use the DMA API (e.g. set up IOMMU mappings or apply Xen's
> > +	 * physical-to-machine translation) or use direct physical
> > +	 * addressing.  Furthermore, there's no sensible way yet for the
> > +	 * PCI bus code to tell us whether we're supposed to act like a
> > +	 * normal PCI device (and use the DMA API) or to do something
> > +	 * else.  So we're stuck with heuristics here.
> > +	 *
> > +	 * In general, we would prefer to use the DMA API, since we
> > +	 * might be driving a physical device, and such devices *must*
> > +	 * use the DMA API if there is an IOMMU involved.
> > +	 *
> > +	 * On x86, there are no physically-mapped emulated virtio PCI
> > +	 * devices that live behind an IOMMU.  On ARM, there don't seem
> > +	 * to be any hypervisors that use virtio_pci (as opposed to
> > +	 * virtio_mmio) that also emulate an IOMMU.  So using the DMI
> > +	 * API is safe.
> > +	 *
> > +	 * On PowerPC, it's the other way around.  There usually is an
> > +	 * IOMMU between us and the virtio PCI device, but the device is
> > +	 * probably emulated and ignores the IOMMU.  Unfortunately, we
> > +	 * can't tell whether we're talking to an emulated device or to
> > +	 * a physical device that really lives behind the IOMMU.  That
> > +	 * means that we're stuck with ignoring the DMA API.
> > +	 */
> > +
> > +#ifdef CONFIG_PPC
> > +	return false;
> > +#else
> > +	/*
> > +	 * Minor optimization: if the platform promises to have physical
> > +	 * PCI DMA, we turn off DMA mapping in virtio_ring.  If the
> > +	 * platform's DMA API implementation is well optimized, this
> > +	 * should have almost no effect, but we already have a branch in
> > +	 * the vring code, and we can avoid any further indirection with
> > +	 * very little effort.
> > +	 */
> > +	return !PCI_DMA_BUS_IS_PHYS;
> > +#endif
> > +}
> > +
> >  static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
> >  				  void (*callback)(struct virtqueue *vq),
> >  				  const char *name,
> > @@ -416,21 +462,30 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
> >  
> >  	info->num = num;
> >  	info->msix_vector = msix_vec;
> > +	info->use_dma_api = vp_use_dma_api();
> >  
> > -	size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
> > -	info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
> > +	size = vring_size(num, VIRTIO_PCI_VRING_ALIGN);
> > +	if (info->use_dma_api) {
> > +		info->queue = dma_zalloc_coherent(vdev->dev.parent, size,
> > +						  &info->queue_dma_addr,
> > +						  GFP_KERNEL);
> > +	} else {
> > +		info->queue = alloc_pages_exact(PAGE_ALIGN(size),
> > +						GFP_KERNEL|__GFP_ZERO);
> > +		info->queue_dma_addr = virt_to_phys(info->queue);
> > +	}
> >  	if (info->queue == NULL) {
> >  		err = -ENOMEM;
> >  		goto out_info;
> >  	}
> >  
> >  	/* activate the queue */
> > -	iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
> > +	iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
> >  		  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> >  
> >  	/* create the vring */
> >  	vq = vring_new_virtqueue(index, info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
> > -				 true, false, info->queue,
> > +				 true, info->use_dma_api, info->queue,
> >  				 vp_notify, callback, name);
> >  	if (!vq) {
> >  		err = -ENOMEM;
> > @@ -463,7 +518,12 @@ out_assign:
> >  	vring_del_virtqueue(vq);
> >  out_activate_queue:
> >  	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> > -	free_pages_exact(info->queue, size);
> > +	if (info->use_dma_api) {
> > +		dma_free_coherent(vdev->dev.parent, size,
> > +				  info->queue, info->queue_dma_addr);
> > +	} else {
> > +		free_pages_exact(info->queue, PAGE_ALIGN(size));
> > +	}
> >  out_info:
> >  	kfree(info);
> >  	return ERR_PTR(err);
> > @@ -493,8 +553,13 @@ static void vp_del_vq(struct virtqueue *vq)
> >  	/* Select and deactivate the queue */
> >  	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> >  
> > -	size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
> > -	free_pages_exact(info->queue, size);
> > +	size = vring_size(info->num, VIRTIO_PCI_VRING_ALIGN);
> > +	if (info->use_dma_api) {
> > +		dma_free_coherent(vq->vdev->dev.parent, size,
> > +				  info->queue, info->queue_dma_addr);
> > +	} else {
> > +		free_pages_exact(info->queue, PAGE_ALIGN(size));
> > +	}
> >  	kfree(info);
> >  }
> >  
> > @@ -713,6 +778,13 @@ static int virtio_pci_probe(struct pci_dev *pci_dev,
> >  	if (err)
> >  		goto out;
> >  
> > +	err = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64));
> > +	if (err)
> > +		err = dma_set_mask_and_coherent(&pci_dev->dev,
> > +						DMA_BIT_MASK(32));
> > +	if (err)
> > +		dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA.  Trying to continue, but this might not work.\n");
> > +
> >  	err = pci_request_regions(pci_dev, "virtio-pci");
> >  	if (err)
> >  		goto out_enable_device;
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-17 14:16     ` Michael S. Tsirkin
@ 2014-09-17 16:07       ` Andy Lutomirski
  2014-09-17 16:49         ` David Woodhouse
  2014-09-19 21:33         ` Benjamin Herrenschmidt
  2014-09-19 21:31       ` Benjamin Herrenschmidt
  2014-09-29 18:55       ` Andy Lutomirski
  2 siblings, 2 replies; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-17 16:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, linux390@de.ibm.com, Paolo Bonzini,
	David Woodhouse

On Sep 17, 2014 7:13 AM, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
> On Wed, Sep 17, 2014 at 08:02:31AM -0400, Benjamin Herrenschmidt wrote:
> > On Tue, 2014-09-16 at 22:22 -0700, Andy Lutomirski wrote:
> > > On non-PPC systems, virtio_pci should use the DMA API.  This fixes
> > > virtio_pci on Xen.  On PPC, using the DMA API would break things, so
> > > we need to preserve the old behavior.
> > >
> > > The big comment in this patch explains the considerations in more
> > > detail.
> >
> > I still disagree with using CONFIG_PPC as a trigger here.
> >
> > Fundamentally, the qemu implementation today bypasses IOMMUs on all
> > platforms as far as I can tell.
> >
> > If that changes, we'll have a backward compatibility problem.
> >
> > The virtio device should advertise whether it's using that bypass
> > mode of operation and virtio_pci should react accordingly.
>
> Well if there's a way to detect that - that's outside the
> device, then we probably shouldn't use device-specific
> interfaces for this capability.

I still think that this is a property of the bus, not the device.  x86
has such a mechanism, and this patch uses it transparently.

(NB: According to David Woodhouse about ten minutes ago, this should
work with a bit of care right now on x86 via ACPI as long as QEMU
configures the tables correctly.)

>
>
> > There is a demand for being able to operate on top of an IOMMU on
> > powerpc as well for some embedded stuff using PCI as a transport so your
> > patch precludes that.
> >
> > Cheers,
> > Ben.
>
> As far as I can see, nothing changes on PPC so this patch
> does not preclude anything that was working?
>

It shouldn't.  That being said, at some point this problem will need
solving on PPC, and this patch doesn't help much, other than adding
the virtio_ring piece.

I'd really like to see the generic or arch IOMMU code handle this so
that the PPC special case in the virtio-pci driver can go away.

Another reason that this kind of special casing doesn't belong in
virtio-pci: if anyone ever binds vfio to a virtio-pci device, if the
DMA API doesn't work right, then the kernel will go boom or be
compromised.

--Andy

>
> > >
> > > Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> > > ---
> > >  drivers/virtio/virtio_pci.c | 90 ++++++++++++++++++++++++++++++++++++++++-----
> > >  1 file changed, 81 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
> > > index a1f299fa4626..8ddb0a641878 100644
> > > --- a/drivers/virtio/virtio_pci.c
> > > +++ b/drivers/virtio/virtio_pci.c
> > > @@ -80,8 +80,10 @@ struct virtio_pci_vq_info
> > >     /* the number of entries in the queue */
> > >     int num;
> > >
> > > -   /* the virtual address of the ring queue */
> > > -   void *queue;
> > > +   /* the ring queue */
> > > +   void *queue;                    /* virtual address */
> > > +   dma_addr_t queue_dma_addr;      /* bus address */
> > > +   bool use_dma_api;               /* are we using the DMA API? */
> > >
> > >     /* the list node for the virtqueues list */
> > >     struct list_head node;
> > > @@ -388,6 +390,50 @@ static int vp_request_intx(struct virtio_device *vdev)
> > >     return err;
> > >  }
> > >
> > > +static bool vp_use_dma_api(void)
> > > +{
> > > +   /*
> > > +    * Due to limitations of the DMA API, we only have two choices:
> > > +    * use the DMA API (e.g. set up IOMMU mappings or apply Xen's
> > > +    * physical-to-machine translation) or use direct physical
> > > +    * addressing.  Furthermore, there's no sensible way yet for the
> > > +    * PCI bus code to tell us whether we're supposed to act like a
> > > +    * normal PCI device (and use the DMA API) or to do something
> > > +    * else.  So we're stuck with heuristics here.
> > > +    *
> > > +    * In general, we would prefer to use the DMA API, since we
> > > +    * might be driving a physical device, and such devices *must*
> > > +    * use the DMA API if there is an IOMMU involved.
> > > +    *
> > > +    * On x86, there are no physically-mapped emulated virtio PCI
> > > +    * devices that live behind an IOMMU.  On ARM, there don't seem
> > > +    * to be any hypervisors that use virtio_pci (as opposed to
> > > +    * virtio_mmio) that also emulate an IOMMU.  So using the DMI
> > > +    * API is safe.
> > > +    *
> > > +    * On PowerPC, it's the other way around.  There usually is an
> > > +    * IOMMU between us and the virtio PCI device, but the device is
> > > +    * probably emulated and ignores the IOMMU.  Unfortunately, we
> > > +    * can't tell whether we're talking to an emulated device or to
> > > +    * a physical device that really lives behind the IOMMU.  That
> > > +    * means that we're stuck with ignoring the DMA API.
> > > +    */
> > > +
> > > +#ifdef CONFIG_PPC
> > > +   return false;
> > > +#else
> > > +   /*
> > > +    * Minor optimization: if the platform promises to have physical
> > > +    * PCI DMA, we turn off DMA mapping in virtio_ring.  If the
> > > +    * platform's DMA API implementation is well optimized, this
> > > +    * should have almost no effect, but we already have a branch in
> > > +    * the vring code, and we can avoid any further indirection with
> > > +    * very little effort.
> > > +    */
> > > +   return !PCI_DMA_BUS_IS_PHYS;
> > > +#endif
> > > +}
> > > +
> > >  static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
> > >                               void (*callback)(struct virtqueue *vq),
> > >                               const char *name,
> > > @@ -416,21 +462,30 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
> > >
> > >     info->num = num;
> > >     info->msix_vector = msix_vec;
> > > +   info->use_dma_api = vp_use_dma_api();
> > >
> > > -   size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
> > > -   info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
> > > +   size = vring_size(num, VIRTIO_PCI_VRING_ALIGN);
> > > +   if (info->use_dma_api) {
> > > +           info->queue = dma_zalloc_coherent(vdev->dev.parent, size,
> > > +                                             &info->queue_dma_addr,
> > > +                                             GFP_KERNEL);
> > > +   } else {
> > > +           info->queue = alloc_pages_exact(PAGE_ALIGN(size),
> > > +                                           GFP_KERNEL|__GFP_ZERO);
> > > +           info->queue_dma_addr = virt_to_phys(info->queue);
> > > +   }
> > >     if (info->queue == NULL) {
> > >             err = -ENOMEM;
> > >             goto out_info;
> > >     }
> > >
> > >     /* activate the queue */
> > > -   iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
> > > +   iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
> > >               vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> > >
> > >     /* create the vring */
> > >     vq = vring_new_virtqueue(index, info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
> > > -                            true, false, info->queue,
> > > +                            true, info->use_dma_api, info->queue,
> > >                              vp_notify, callback, name);
> > >     if (!vq) {
> > >             err = -ENOMEM;
> > > @@ -463,7 +518,12 @@ out_assign:
> > >     vring_del_virtqueue(vq);
> > >  out_activate_queue:
> > >     iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> > > -   free_pages_exact(info->queue, size);
> > > +   if (info->use_dma_api) {
> > > +           dma_free_coherent(vdev->dev.parent, size,
> > > +                             info->queue, info->queue_dma_addr);
> > > +   } else {
> > > +           free_pages_exact(info->queue, PAGE_ALIGN(size));
> > > +   }
> > >  out_info:
> > >     kfree(info);
> > >     return ERR_PTR(err);
> > > @@ -493,8 +553,13 @@ static void vp_del_vq(struct virtqueue *vq)
> > >     /* Select and deactivate the queue */
> > >     iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> > >
> > > -   size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
> > > -   free_pages_exact(info->queue, size);
> > > +   size = vring_size(info->num, VIRTIO_PCI_VRING_ALIGN);
> > > +   if (info->use_dma_api) {
> > > +           dma_free_coherent(vq->vdev->dev.parent, size,
> > > +                             info->queue, info->queue_dma_addr);
> > > +   } else {
> > > +           free_pages_exact(info->queue, PAGE_ALIGN(size));
> > > +   }
> > >     kfree(info);
> > >  }
> > >
> > > @@ -713,6 +778,13 @@ static int virtio_pci_probe(struct pci_dev *pci_dev,
> > >     if (err)
> > >             goto out;
> > >
> > > +   err = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64));
> > > +   if (err)
> > > +           err = dma_set_mask_and_coherent(&pci_dev->dev,
> > > +                                           DMA_BIT_MASK(32));
> > > +   if (err)
> > > +           dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA.  Trying to continue, but this might not work.\n");
> > > +
> > >     err = pci_request_regions(pci_dev, "virtio-pci");
> > >     if (err)
> > >             goto out_enable_device;
> >

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-17  5:22 ` [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible Andy Lutomirski
  2014-09-17 12:02   ` Benjamin Herrenschmidt
@ 2014-09-17 16:09   ` Ira W. Snyder
  2014-09-17 16:15     ` Andy Lutomirski
  1 sibling, 1 reply; 42+ messages in thread
From: Ira W. Snyder @ 2014-09-17 16:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, Konrad Rzeszutek Wilk, Benjamin Herrenschmidt,
	virtualization, Christian Borntraeger, Michael S. Tsirkin,
	linux390, Paolo Bonzini

On Tue, Sep 16, 2014 at 10:22:27PM -0700, Andy Lutomirski wrote:
> On non-PPC systems, virtio_pci should use the DMA API.  This fixes
> virtio_pci on Xen.  On PPC, using the DMA API would break things, so
> we need to preserve the old behavior.
> 
> The big comment in this patch explains the considerations in more
> detail.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  drivers/virtio/virtio_pci.c | 90 ++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 81 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
> index a1f299fa4626..8ddb0a641878 100644
> --- a/drivers/virtio/virtio_pci.c
> +++ b/drivers/virtio/virtio_pci.c
> @@ -80,8 +80,10 @@ struct virtio_pci_vq_info
>  	/* the number of entries in the queue */
>  	int num;
>  
> -	/* the virtual address of the ring queue */
> -	void *queue;
> +	/* the ring queue */
> +	void *queue;			/* virtual address */
> +	dma_addr_t queue_dma_addr;	/* bus address */
> +	bool use_dma_api;		/* are we using the DMA API? */
>  
>  	/* the list node for the virtqueues list */
>  	struct list_head node;
> @@ -388,6 +390,50 @@ static int vp_request_intx(struct virtio_device *vdev)
>  	return err;
>  }
>  
> +static bool vp_use_dma_api(void)
> +{
> +	/*
> +	 * Due to limitations of the DMA API, we only have two choices:
> +	 * use the DMA API (e.g. set up IOMMU mappings or apply Xen's
> +	 * physical-to-machine translation) or use direct physical
> +	 * addressing.  Furthermore, there's no sensible way yet for the
> +	 * PCI bus code to tell us whether we're supposed to act like a
> +	 * normal PCI device (and use the DMA API) or to do something
> +	 * else.  So we're stuck with heuristics here.
> +	 *
> +	 * In general, we would prefer to use the DMA API, since we
> +	 * might be driving a physical device, and such devices *must*
> +	 * use the DMA API if there is an IOMMU involved.
> +	 *
> +	 * On x86, there are no physically-mapped emulated virtio PCI
> +	 * devices that live behind an IOMMU.  On ARM, there don't seem
> +	 * to be any hypervisors that use virtio_pci (as opposed to
> +	 * virtio_mmio) that also emulate an IOMMU.  So using the DMI

Hi,

I noticed a typo here. It should say "DMA" not "DMI". Just thought I'd
point it out.

Ira

> +	 * API is safe.
> +	 *
> +	 * On PowerPC, it's the other way around.  There usually is an
> +	 * IOMMU between us and the virtio PCI device, but the device is
> +	 * probably emulated and ignores the IOMMU.  Unfortunately, we
> +	 * can't tell whether we're talking to an emulated device or to
> +	 * a physical device that really lives behind the IOMMU.  That
> +	 * means that we're stuck with ignoring the DMA API.
> +	 */
> +
> +#ifdef CONFIG_PPC
> +	return false;
> +#else
> +	/*
> +	 * Minor optimization: if the platform promises to have physical
> +	 * PCI DMA, we turn off DMA mapping in virtio_ring.  If the
> +	 * platform's DMA API implementation is well optimized, this
> +	 * should have almost no effect, but we already have a branch in
> +	 * the vring code, and we can avoid any further indirection with
> +	 * very little effort.
> +	 */
> +	return !PCI_DMA_BUS_IS_PHYS;
> +#endif
> +}
> +
>  static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
>  				  void (*callback)(struct virtqueue *vq),
>  				  const char *name,
> @@ -416,21 +462,30 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
>  
>  	info->num = num;
>  	info->msix_vector = msix_vec;
> +	info->use_dma_api = vp_use_dma_api();
>  
> -	size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
> -	info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
> +	size = vring_size(num, VIRTIO_PCI_VRING_ALIGN);
> +	if (info->use_dma_api) {
> +		info->queue = dma_zalloc_coherent(vdev->dev.parent, size,
> +						  &info->queue_dma_addr,
> +						  GFP_KERNEL);
> +	} else {
> +		info->queue = alloc_pages_exact(PAGE_ALIGN(size),
> +						GFP_KERNEL|__GFP_ZERO);
> +		info->queue_dma_addr = virt_to_phys(info->queue);
> +	}
>  	if (info->queue == NULL) {
>  		err = -ENOMEM;
>  		goto out_info;
>  	}
>  
>  	/* activate the queue */
> -	iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
> +	iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
>  		  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
>  
>  	/* create the vring */
>  	vq = vring_new_virtqueue(index, info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
> -				 true, false, info->queue,
> +				 true, info->use_dma_api, info->queue,
>  				 vp_notify, callback, name);
>  	if (!vq) {
>  		err = -ENOMEM;
> @@ -463,7 +518,12 @@ out_assign:
>  	vring_del_virtqueue(vq);
>  out_activate_queue:
>  	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> -	free_pages_exact(info->queue, size);
> +	if (info->use_dma_api) {
> +		dma_free_coherent(vdev->dev.parent, size,
> +				  info->queue, info->queue_dma_addr);
> +	} else {
> +		free_pages_exact(info->queue, PAGE_ALIGN(size));
> +	}
>  out_info:
>  	kfree(info);
>  	return ERR_PTR(err);
> @@ -493,8 +553,13 @@ static void vp_del_vq(struct virtqueue *vq)
>  	/* Select and deactivate the queue */
>  	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
>  
> -	size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
> -	free_pages_exact(info->queue, size);
> +	size = vring_size(info->num, VIRTIO_PCI_VRING_ALIGN);
> +	if (info->use_dma_api) {
> +		dma_free_coherent(vq->vdev->dev.parent, size,
> +				  info->queue, info->queue_dma_addr);
> +	} else {
> +		free_pages_exact(info->queue, PAGE_ALIGN(size));
> +	}
>  	kfree(info);
>  }
>  
> @@ -713,6 +778,13 @@ static int virtio_pci_probe(struct pci_dev *pci_dev,
>  	if (err)
>  		goto out;
>  
> +	err = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64));
> +	if (err)
> +		err = dma_set_mask_and_coherent(&pci_dev->dev,
> +						DMA_BIT_MASK(32));
> +	if (err)
> +		dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA.  Trying to continue, but this might not work.\n");
> +
>  	err = pci_request_regions(pci_dev, "virtio-pci");
>  	if (err)
>  		goto out_enable_device;
> -- 
> 1.9.3
> 
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-17 16:09   ` Ira W. Snyder
@ 2014-09-17 16:15     ` Andy Lutomirski
  0 siblings, 0 replies; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-17 16:15 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, Michael S. Tsirkin, linux390@de.ibm.com,
	Paolo Bonzini

On Wed, Sep 17, 2014 at 9:09 AM, Ira W. Snyder <iws@ovro.caltech.edu> wrote:
> On Tue, Sep 16, 2014 at 10:22:27PM -0700, Andy Lutomirski wrote:
>> On non-PPC systems, virtio_pci should use the DMA API.  This fixes
>> virtio_pci on Xen.  On PPC, using the DMA API would break things, so
>> we need to preserve the old behavior.
>>
>> The big comment in this patch explains the considerations in more
>> detail.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>  drivers/virtio/virtio_pci.c | 90 ++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 81 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
>> index a1f299fa4626..8ddb0a641878 100644
>> --- a/drivers/virtio/virtio_pci.c
>> +++ b/drivers/virtio/virtio_pci.c
>> @@ -80,8 +80,10 @@ struct virtio_pci_vq_info
>>       /* the number of entries in the queue */
>>       int num;
>>
>> -     /* the virtual address of the ring queue */
>> -     void *queue;
>> +     /* the ring queue */
>> +     void *queue;                    /* virtual address */
>> +     dma_addr_t queue_dma_addr;      /* bus address */
>> +     bool use_dma_api;               /* are we using the DMA API? */
>>
>>       /* the list node for the virtqueues list */
>>       struct list_head node;
>> @@ -388,6 +390,50 @@ static int vp_request_intx(struct virtio_device *vdev)
>>       return err;
>>  }
>>
>> +static bool vp_use_dma_api(void)
>> +{
>> +     /*
>> +      * Due to limitations of the DMA API, we only have two choices:
>> +      * use the DMA API (e.g. set up IOMMU mappings or apply Xen's
>> +      * physical-to-machine translation) or use direct physical
>> +      * addressing.  Furthermore, there's no sensible way yet for the
>> +      * PCI bus code to tell us whether we're supposed to act like a
>> +      * normal PCI device (and use the DMA API) or to do something
>> +      * else.  So we're stuck with heuristics here.
>> +      *
>> +      * In general, we would prefer to use the DMA API, since we
>> +      * might be driving a physical device, and such devices *must*
>> +      * use the DMA API if there is an IOMMU involved.
>> +      *
>> +      * On x86, there are no physically-mapped emulated virtio PCI
>> +      * devices that live behind an IOMMU.  On ARM, there don't seem
>> +      * to be any hypervisors that use virtio_pci (as opposed to
>> +      * virtio_mmio) that also emulate an IOMMU.  So using the DMI
>
> Hi,
>
> I noticed a typo here. It should say "DMA" not "DMI". Just thought I'd
> point it out.

Thanks.  I'll fix this in v6 if there is a v6.

--Andy

>
> Ira
>
>> +      * API is safe.
>> +      *
>> +      * On PowerPC, it's the other way around.  There usually is an
>> +      * IOMMU between us and the virtio PCI device, but the device is
>> +      * probably emulated and ignores the IOMMU.  Unfortunately, we
>> +      * can't tell whether we're talking to an emulated device or to
>> +      * a physical device that really lives behind the IOMMU.  That
>> +      * means that we're stuck with ignoring the DMA API.
>> +      */
>> +
>> +#ifdef CONFIG_PPC
>> +     return false;
>> +#else
>> +     /*
>> +      * Minor optimization: if the platform promises to have physical
>> +      * PCI DMA, we turn off DMA mapping in virtio_ring.  If the
>> +      * platform's DMA API implementation is well optimized, this
>> +      * should have almost no effect, but we already have a branch in
>> +      * the vring code, and we can avoid any further indirection with
>> +      * very little effort.
>> +      */
>> +     return !PCI_DMA_BUS_IS_PHYS;
>> +#endif
>> +}
>> +
>>  static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
>>                                 void (*callback)(struct virtqueue *vq),
>>                                 const char *name,
>> @@ -416,21 +462,30 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
>>
>>       info->num = num;
>>       info->msix_vector = msix_vec;
>> +     info->use_dma_api = vp_use_dma_api();
>>
>> -     size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
>> -     info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
>> +     size = vring_size(num, VIRTIO_PCI_VRING_ALIGN);
>> +     if (info->use_dma_api) {
>> +             info->queue = dma_zalloc_coherent(vdev->dev.parent, size,
>> +                                               &info->queue_dma_addr,
>> +                                               GFP_KERNEL);
>> +     } else {
>> +             info->queue = alloc_pages_exact(PAGE_ALIGN(size),
>> +                                             GFP_KERNEL|__GFP_ZERO);
>> +             info->queue_dma_addr = virt_to_phys(info->queue);
>> +     }
>>       if (info->queue == NULL) {
>>               err = -ENOMEM;
>>               goto out_info;
>>       }
>>
>>       /* activate the queue */
>> -     iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
>> +     iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
>>                 vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
>>
>>       /* create the vring */
>>       vq = vring_new_virtqueue(index, info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
>> -                              true, false, info->queue,
>> +                              true, info->use_dma_api, info->queue,
>>                                vp_notify, callback, name);
>>       if (!vq) {
>>               err = -ENOMEM;
>> @@ -463,7 +518,12 @@ out_assign:
>>       vring_del_virtqueue(vq);
>>  out_activate_queue:
>>       iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
>> -     free_pages_exact(info->queue, size);
>> +     if (info->use_dma_api) {
>> +             dma_free_coherent(vdev->dev.parent, size,
>> +                               info->queue, info->queue_dma_addr);
>> +     } else {
>> +             free_pages_exact(info->queue, PAGE_ALIGN(size));
>> +     }
>>  out_info:
>>       kfree(info);
>>       return ERR_PTR(err);
>> @@ -493,8 +553,13 @@ static void vp_del_vq(struct virtqueue *vq)
>>       /* Select and deactivate the queue */
>>       iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
>>
>> -     size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
>> -     free_pages_exact(info->queue, size);
>> +     size = vring_size(info->num, VIRTIO_PCI_VRING_ALIGN);
>> +     if (info->use_dma_api) {
>> +             dma_free_coherent(vq->vdev->dev.parent, size,
>> +                               info->queue, info->queue_dma_addr);
>> +     } else {
>> +             free_pages_exact(info->queue, PAGE_ALIGN(size));
>> +     }
>>       kfree(info);
>>  }
>>
>> @@ -713,6 +778,13 @@ static int virtio_pci_probe(struct pci_dev *pci_dev,
>>       if (err)
>>               goto out;
>>
>> +     err = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64));
>> +     if (err)
>> +             err = dma_set_mask_and_coherent(&pci_dev->dev,
>> +                                             DMA_BIT_MASK(32));
>> +     if (err)
>> +             dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA.  Trying to continue, but this might not work.\n");
>> +
>>       err = pci_request_regions(pci_dev, "virtio-pci");
>>       if (err)
>>               goto out_enable_device;
>> --
>> 1.9.3
>>
>> _______________________________________________
>> Virtualization mailing list
>> Virtualization@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/virtualization



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-17 16:07       ` Andy Lutomirski
@ 2014-09-17 16:49         ` David Woodhouse
  2014-09-19 21:28           ` Benjamin Herrenschmidt
  2014-09-19 21:33         ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 42+ messages in thread
From: David Woodhouse @ 2014-09-17 16:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, linux390@de.ibm.com, Paolo Bonzini


[-- Attachment #1.1: Type: text/plain, Size: 2281 bytes --]

On Wed, 2014-09-17 at 09:07 -0700, Andy Lutomirski wrote:
> 
> I still think that this is a property of the bus, not the device.  x86
> has such a mechanism, and this patch uses it transparently.

Right. A device driver should use the DMA API. Always.

The platform's implementation of the DMA API is then free to observe
that the device in question isn't actually served by any IOMMU and give
a trivial 1:1 mapping.

> > > The virtio device should advertise whether it's using that bypass
> > > mode of operation and virtio_pci should react accordingly.
> >
> > Well if there's a way to detect that - that's outside the
> > device, then we probably shouldn't use device-specific
> > interfaces for this capability.
> 
> (NB: According to David Woodhouse about ten minutes ago, this should
> work with a bit of care right now on x86 via ACPI as long as QEMU
> configures the tables correctly.)

Well, we don't currently expose a virtual IOMMU to x86 guests at the
moment although we're going to want to in the near future.

My answer was based on the assumption that it'll actually be an Intel
IOMMU that we expose, rather than some paravirtualised IOMMU interface
which allows us to support both AMD and Intel hosts. That may not be a
valid assumption; I don't think we've really explored this at all.

But *given* that assumption, the VT-d ACPI tables allow a given IOMMU
unit to be associated either with a list of specific devices, *or* to
have a 'catch-all' bit set which means that it owns all
otherwise-unmatched devices in that PCI domain.

So... if you want some devices to be seen as *not* behind an IOMMU then
they need to be in a PCI domain which *doesn't* have a catch-all IOMMU
unit. So either you put them in a separate PCI domain all by themselves,
or you explicitly list all (possible) PCI BDFs under the IOMMU unit and
don't use the catch-all facility.
 
Having said that... don't. Just put the damn thing behind an IOMMU. The
guest OS doesn't *have* to use it; it either eschew the IOMMU completely
or it can choose to use pass-through mode for specific devices.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5745 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 0/3] virtio: Use the DMA API when appropriate
  2014-09-17  5:22 [PATCH v5 0/3] virtio: Use the DMA API when appropriate Andy Lutomirski
                   ` (2 preceding siblings ...)
  2014-09-17  5:22 ` [PATCH v5 3/3] virtio_net: Stop doing DMA from the stack Andy Lutomirski
@ 2014-09-19 18:25 ` Konrad Rzeszutek Wilk
  3 siblings, 0 replies; 42+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-19 18:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390, xen-devel, Michael S. Tsirkin, Benjamin Herrenschmidt,
	virtualization, Christian Borntraeger, Paolo Bonzini, linux390

On Tue, Sep 16, 2014 at 10:22:25PM -0700, Andy Lutomirski wrote:
> This fixes virtio on Xen guests as well as on any other platform
> that uses virtio_pci on which physical addresses don't match bus
> addresses.

I can do 'Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>'
but not sure through whom this patch should go through?


> 
> This can be tested with:
> 
>     virtme-run --xen xen --kimg arch/x86/boot/bzImage --console
> 
> using virtme from here:
> 
>     https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git
> 
> Without these patches, the guest hangs forever.  With these patches,
> everything works.
> 
> This should be safe on all platforms that I'm aware of.  That
> doesn't mean that there isn't anything that I missed.
> 
> Applies to net-next.
> 
> Changes from v4:
>  - Rebased onto net-next.
>  - Dropped the sg end mark changes from virtio_net, as that issue
>    is solved differently in net-next.
>  - virtio_pci does not use the DMA API on powerpc, for reasons that
>    are explained in a detailed comment in virtio_ring.
> 
> Changes from v3:
>  - virtio_pci only asks virtio_ring to use the DMA API if
>    !PCI_DMA_BUS_IS_PHYS.
>  - Reduce tools/virtio breakage.  It's now merely as broken as before
>    instead of being even more broken.
>  - Drop the sg_next changes -- Rusty's version is better.
> 
> Changes from v2:
>  - Reordered patches.
>  - Fixed a virtio_net OOPS.
> 
> Changes from v1:
>  - Using the DMA API is optional now.  It would be nice to improve the
>    DMA API to the point that it could be used unconditionally, but s390
>    proves that we're not there yet.
>  - Includes patch 4, which fixes DMA debugging warnings from virtio_net.
> 
> Andy Lutomirski (3):
>   virtio_ring: Support DMA APIs if requested
>   virtio_pci: Use the DMA API for virtqueues when possible
>   virtio_net: Stop doing DMA from the stack
> 
>  drivers/lguest/lguest_device.c         |   3 +-
>  drivers/misc/mic/card/mic_virtio.c     |   2 +-
>  drivers/net/virtio_net.c               |  53 ++++++---
>  drivers/remoteproc/remoteproc_virtio.c |   4 +-
>  drivers/s390/kvm/kvm_virtio.c          |   2 +-
>  drivers/s390/kvm/virtio_ccw.c          |   4 +-
>  drivers/virtio/virtio_mmio.c           |   5 +-
>  drivers/virtio/virtio_pci.c            |  91 ++++++++++++++--
>  drivers/virtio/virtio_ring.c           | 194 +++++++++++++++++++++++++++------
>  include/linux/virtio_ring.h            |   1 +
>  tools/virtio/linux/dma-mapping.h       |  17 +++
>  tools/virtio/linux/virtio.h            |   1 +
>  tools/virtio/virtio_test.c             |   2 +-
>  tools/virtio/vringh_test.c             |   3 +-
>  14 files changed, 314 insertions(+), 68 deletions(-)
>  create mode 100644 tools/virtio/linux/dma-mapping.h
> 
> -- 
> 1.9.3
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-17 16:49         ` David Woodhouse
@ 2014-09-19 21:28           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-19 21:28 UTC (permalink / raw)
  To: David Woodhouse
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Andy Lutomirski, Christian Borntraeger,
	linux390@de.ibm.com, Paolo Bonzini, Linux Virtualization

On Wed, 2014-09-17 at 09:49 -0700, David Woodhouse wrote:
> On Wed, 2014-09-17 at 09:07 -0700, Andy Lutomirski wrote:
> > 
> > I still think that this is a property of the bus, not the device.  x86
> > has such a mechanism, and this patch uses it transparently.
> 
> Right. A device driver should use the DMA API. Always.

I do agree, and that's why I suggested that instead of doing what it
does, that patch should instead:

  - Always use dma_map_*

  - Make the dma "ops" of the virtio bus/device point to a set of "null
ops" when necessary.

This would also make my life easier on powerpc since I will too have
cases where I will want virtio to use the dma mapping API, it's just
that the *current* implementation is such that qemu always bypasses
iommus and we want to keep that mode of operations as an option (even
the default) for performances.

But qemu isn't the only other implementation of virtio ...
 
> The platform's implementation of the DMA API is then free to observe
> that the device in question isn't actually served by any IOMMU and give
> a trivial 1:1 mapping.

Right.

> > > > The virtio device should advertise whether it's using that bypass
> > > > mode of operation and virtio_pci should react accordingly.
> > >
> > > Well if there's a way to detect that - that's outside the
> > > device, then we probably shouldn't use device-specific
> > > interfaces for this capability.
> > 
> > (NB: According to David Woodhouse about ten minutes ago, this should
> > work with a bit of care right now on x86 via ACPI as long as QEMU
> > configures the tables correctly.)
> 
> Well, we don't currently expose a virtual IOMMU to x86 guests at the
> moment although we're going to want to in the near future.
> 
> My answer was based on the assumption that it'll actually be an Intel
> IOMMU that we expose, rather than some paravirtualised IOMMU interface
> which allows us to support both AMD and Intel hosts. That may not be a
> valid assumption; I don't think we've really explored this at all.
> 
> But *given* that assumption, the VT-d ACPI tables allow a given IOMMU
> unit to be associated either with a list of specific devices, *or* to
> have a 'catch-all' bit set which means that it owns all
> otherwise-unmatched devices in that PCI domain.
> 
> So... if you want some devices to be seen as *not* behind an IOMMU then
> they need to be in a PCI domain which *doesn't* have a catch-all IOMMU
> unit. So either you put them in a separate PCI domain all by themselves,
> or you explicitly list all (possible) PCI BDFs under the IOMMU unit and
> don't use the catch-all facility.
>  
> Having said that... don't. Just put the damn thing behind an IOMMU. The
> guest OS doesn't *have* to use it; it either eschew the IOMMU completely
> or it can choose to use pass-through mode for specific devices.

Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-17 14:16     ` Michael S. Tsirkin
  2014-09-17 16:07       ` Andy Lutomirski
@ 2014-09-19 21:31       ` Benjamin Herrenschmidt
  2014-09-29 18:55       ` Andy Lutomirski
  2 siblings, 0 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-19 21:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-s390, Konrad Rzeszutek Wilk, virtualization,
	Christian Borntraeger, Paolo Bonzini, linux390, Andy Lutomirski

On Wed, 2014-09-17 at 17:16 +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 17, 2014 at 08:02:31AM -0400, Benjamin Herrenschmidt wrote:
> > On Tue, 2014-09-16 at 22:22 -0700, Andy Lutomirski wrote:
> > > On non-PPC systems, virtio_pci should use the DMA API.  This fixes
> > > virtio_pci on Xen.  On PPC, using the DMA API would break things, so
> > > we need to preserve the old behavior.
> > > 
> > > The big comment in this patch explains the considerations in more
> > > detail.
> > 
> > I still disagree with using CONFIG_PPC as a trigger here.
> > 
> > Fundamentally, the qemu implementation today bypasses IOMMUs on all
> > platforms as far as I can tell.
> > 
> > If that changes, we'll have a backward compatibility problem.
> > 
> > The virtio device should advertise whether it's using that bypass
> > mode of operation and virtio_pci should react accordingly.
> 
> Well if there's a way to detect that - that's outside the
> device, then we probably shouldn't use device-specific
> interfaces for this capability.

Not necessarily. virtio is "special" in that regard, and it's an
attribute of a given virtio "server" implementation as to whether it
honors or bypasses the system or bus iommu.

> > There is a demand for being able to operate on top of an IOMMU on
> > powerpc as well for some embedded stuff using PCI as a transport so your
> > patch precludes that.
> > 
> > Cheers,
> > Ben.
> 
> As far as I can see, nothing changes on PPC so this patch
> does not preclude anything that was working?

Today but breaks some other intended usages. "PPC" is a very wide scope.
Some "PPC" users want to use virtio for communicating between machines
over a non-transparent PCI setup an that requires using the iommu for
example. We have other things coming down the pipe that will require the
use of the DMA mapping API as well.

> > > 
> > > Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> > > ---
> > >  drivers/virtio/virtio_pci.c | 90 ++++++++++++++++++++++++++++++++++++++++-----
> > >  1 file changed, 81 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
> > > index a1f299fa4626..8ddb0a641878 100644
> > > --- a/drivers/virtio/virtio_pci.c
> > > +++ b/drivers/virtio/virtio_pci.c
> > > @@ -80,8 +80,10 @@ struct virtio_pci_vq_info
> > >  	/* the number of entries in the queue */
> > >  	int num;
> > >  
> > > -	/* the virtual address of the ring queue */
> > > -	void *queue;
> > > +	/* the ring queue */
> > > +	void *queue;			/* virtual address */
> > > +	dma_addr_t queue_dma_addr;	/* bus address */
> > > +	bool use_dma_api;		/* are we using the DMA API? */
> > >  
> > >  	/* the list node for the virtqueues list */
> > >  	struct list_head node;
> > > @@ -388,6 +390,50 @@ static int vp_request_intx(struct virtio_device *vdev)
> > >  	return err;
> > >  }
> > >  
> > > +static bool vp_use_dma_api(void)
> > > +{
> > > +	/*
> > > +	 * Due to limitations of the DMA API, we only have two choices:
> > > +	 * use the DMA API (e.g. set up IOMMU mappings or apply Xen's
> > > +	 * physical-to-machine translation) or use direct physical
> > > +	 * addressing.  Furthermore, there's no sensible way yet for the
> > > +	 * PCI bus code to tell us whether we're supposed to act like a
> > > +	 * normal PCI device (and use the DMA API) or to do something
> > > +	 * else.  So we're stuck with heuristics here.
> > > +	 *
> > > +	 * In general, we would prefer to use the DMA API, since we
> > > +	 * might be driving a physical device, and such devices *must*
> > > +	 * use the DMA API if there is an IOMMU involved.
> > > +	 *
> > > +	 * On x86, there are no physically-mapped emulated virtio PCI
> > > +	 * devices that live behind an IOMMU.  On ARM, there don't seem
> > > +	 * to be any hypervisors that use virtio_pci (as opposed to
> > > +	 * virtio_mmio) that also emulate an IOMMU.  So using the DMI
> > > +	 * API is safe.
> > > +	 *
> > > +	 * On PowerPC, it's the other way around.  There usually is an
> > > +	 * IOMMU between us and the virtio PCI device, but the device is
> > > +	 * probably emulated and ignores the IOMMU.  Unfortunately, we
> > > +	 * can't tell whether we're talking to an emulated device or to
> > > +	 * a physical device that really lives behind the IOMMU.  That
> > > +	 * means that we're stuck with ignoring the DMA API.
> > > +	 */
> > > +
> > > +#ifdef CONFIG_PPC
> > > +	return false;
> > > +#else
> > > +	/*
> > > +	 * Minor optimization: if the platform promises to have physical
> > > +	 * PCI DMA, we turn off DMA mapping in virtio_ring.  If the
> > > +	 * platform's DMA API implementation is well optimized, this
> > > +	 * should have almost no effect, but we already have a branch in
> > > +	 * the vring code, and we can avoid any further indirection with
> > > +	 * very little effort.
> > > +	 */
> > > +	return !PCI_DMA_BUS_IS_PHYS;
> > > +#endif
> > > +}
> > > +
> > >  static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
> > >  				  void (*callback)(struct virtqueue *vq),
> > >  				  const char *name,
> > > @@ -416,21 +462,30 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
> > >  
> > >  	info->num = num;
> > >  	info->msix_vector = msix_vec;
> > > +	info->use_dma_api = vp_use_dma_api();
> > >  
> > > -	size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
> > > -	info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
> > > +	size = vring_size(num, VIRTIO_PCI_VRING_ALIGN);
> > > +	if (info->use_dma_api) {
> > > +		info->queue = dma_zalloc_coherent(vdev->dev.parent, size,
> > > +						  &info->queue_dma_addr,
> > > +						  GFP_KERNEL);
> > > +	} else {
> > > +		info->queue = alloc_pages_exact(PAGE_ALIGN(size),
> > > +						GFP_KERNEL|__GFP_ZERO);
> > > +		info->queue_dma_addr = virt_to_phys(info->queue);
> > > +	}
> > >  	if (info->queue == NULL) {
> > >  		err = -ENOMEM;
> > >  		goto out_info;
> > >  	}
> > >  
> > >  	/* activate the queue */
> > > -	iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
> > > +	iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
> > >  		  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> > >  
> > >  	/* create the vring */
> > >  	vq = vring_new_virtqueue(index, info->num, VIRTIO_PCI_VRING_ALIGN, vdev,
> > > -				 true, false, info->queue,
> > > +				 true, info->use_dma_api, info->queue,
> > >  				 vp_notify, callback, name);
> > >  	if (!vq) {
> > >  		err = -ENOMEM;
> > > @@ -463,7 +518,12 @@ out_assign:
> > >  	vring_del_virtqueue(vq);
> > >  out_activate_queue:
> > >  	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> > > -	free_pages_exact(info->queue, size);
> > > +	if (info->use_dma_api) {
> > > +		dma_free_coherent(vdev->dev.parent, size,
> > > +				  info->queue, info->queue_dma_addr);
> > > +	} else {
> > > +		free_pages_exact(info->queue, PAGE_ALIGN(size));
> > > +	}
> > >  out_info:
> > >  	kfree(info);
> > >  	return ERR_PTR(err);
> > > @@ -493,8 +553,13 @@ static void vp_del_vq(struct virtqueue *vq)
> > >  	/* Select and deactivate the queue */
> > >  	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> > >  
> > > -	size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
> > > -	free_pages_exact(info->queue, size);
> > > +	size = vring_size(info->num, VIRTIO_PCI_VRING_ALIGN);
> > > +	if (info->use_dma_api) {
> > > +		dma_free_coherent(vq->vdev->dev.parent, size,
> > > +				  info->queue, info->queue_dma_addr);
> > > +	} else {
> > > +		free_pages_exact(info->queue, PAGE_ALIGN(size));
> > > +	}
> > >  	kfree(info);
> > >  }
> > >  
> > > @@ -713,6 +778,13 @@ static int virtio_pci_probe(struct pci_dev *pci_dev,
> > >  	if (err)
> > >  		goto out;
> > >  
> > > +	err = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64));
> > > +	if (err)
> > > +		err = dma_set_mask_and_coherent(&pci_dev->dev,
> > > +						DMA_BIT_MASK(32));
> > > +	if (err)
> > > +		dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA.  Trying to continue, but this might not work.\n");
> > > +
> > >  	err = pci_request_regions(pci_dev, "virtio-pci");
> > >  	if (err)
> > >  		goto out_enable_device;
> > 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-17 16:07       ` Andy Lutomirski
  2014-09-17 16:49         ` David Woodhouse
@ 2014-09-19 21:33         ` Benjamin Herrenschmidt
  2014-09-20  5:59           ` Andy Lutomirski
  1 sibling, 1 reply; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-19 21:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	linux390@de.ibm.com, Paolo Bonzini, David Woodhouse

On Wed, 2014-09-17 at 09:07 -0700, Andy Lutomirski wrote:
> It shouldn't.  That being said, at some point this problem will need
> solving on PPC, and this patch doesn't help much, other than adding
> the virtio_ring piece.
> 
> I'd really like to see the generic or arch IOMMU code handle this so
> that the PPC special case in the virtio-pci driver can go away.
> 
> Another reason that this kind of special casing doesn't belong in
> virtio-pci: if anyone ever binds vfio to a virtio-pci device, if the
> DMA API doesn't work right, then the kernel will go boom or be
> compromised.

Well, that's also for these reasons that I think your patch shouldn't go
through those virtio_* variants that test that "use_dma_api" flag or
whatever you called it last time I looked :-)

You should just unconditionally call the DMA API and we should fixup the
special cases by overriding the hooks.

I can help you do the override hack for PPC for now, until we can do
something saner accross all architectures.

The main problem is s390 but I had a quick chat with Utz Bacher earlier
this week and he seemed to think having s390 use the DMA API shouldn't
be a big deal, they can also hookup some transparent set of ops if
necessary.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-19 21:33         ` Benjamin Herrenschmidt
@ 2014-09-20  5:59           ` Andy Lutomirski
  2014-09-21  5:03             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-20  5:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	linux390@de.ibm.com, Paolo Bonzini, David Woodhouse

On Sep 19, 2014 2:33 PM, "Benjamin Herrenschmidt"
<benh@kernel.crashing.org> wrote:
>
> On Wed, 2014-09-17 at 09:07 -0700, Andy Lutomirski wrote:
> > It shouldn't.  That being said, at some point this problem will need
> > solving on PPC, and this patch doesn't help much, other than adding
> > the virtio_ring piece.
> >
> > I'd really like to see the generic or arch IOMMU code handle this so
> > that the PPC special case in the virtio-pci driver can go away.
> >
> > Another reason that this kind of special casing doesn't belong in
> > virtio-pci: if anyone ever binds vfio to a virtio-pci device, if the
> > DMA API doesn't work right, then the kernel will go boom or be
> > compromised.
>
> Well, that's also for these reasons that I think your patch shouldn't go
> through those virtio_* variants that test that "use_dma_api" flag or
> whatever you called it last time I looked :-)
>
> You should just unconditionally call the DMA API and we should fixup the
> special cases by overriding the hooks.

Unconditionally for virtio-pci or unconditionally for all drivers?  I
wouldn't be surprised if some of the virtio-mmio archs have trouble,
too.  Aarch64, at least, seems to be okay with using the DMA API on
virtio_mmio devices, even with IOMMU support compiled in.

Are there other relevant architectures that are easy to test?

>
> I can help you do the override hack for PPC for now, until we can do
> something saner accross all architectures.

Sure.

The question is: should the patches go in to 3.18 as is our should
they wait?  It would be straightforward to remove the use_dma_api
switch PPC, s390, and virtio_mmio are ready.

--Andy

>
> The main problem is s390 but I had a quick chat with Utz Bacher earlier
> this week and he seemed to think having s390 use the DMA API shouldn't
> be a big deal, they can also hookup some transparent set of ops if
> necessary.
>
> Cheers,
> Ben.
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-20  5:59           ` Andy Lutomirski
@ 2014-09-21  5:03             ` Benjamin Herrenschmidt
  2014-09-21  5:05               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-21  5:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	linux390@de.ibm.com, Paolo Bonzini, David Woodhouse

On Fri, 2014-09-19 at 22:59 -0700, Andy Lutomirski wrote:
> Sure.
> 
> The question is: should the patches go in to 3.18 as is our should
> they wait?  It would be straightforward to remove the use_dma_api
> switch PPC, s390, and virtio_mmio are ready.

I don't mind the patches going in now in their current form with one
exception (see below), but I'm keen on moving all the way to an
unconditional use of the DMA API with a simple tweaking of the dma ops
of the device based on need/iommu/etc...

What I might do is first experiment with powerpc to try to set
use_dma_api to true and do the dma_ops manipulation.

The exception I mentioned is that I would really like the virtio device
to expose via whatever transport we chose to use (though capability
exchange sounds like a reasonable one) whether the "server"
implementation is bypassing IOMMUs or not instead on relying on client
side heuristics.

IE. Basically, we are trying to "guess" with an ifdef CONFIG_PPC, what
is essentially an attribute of the server-side, ie, whether is bypasses
the iommu for the PCI bus it resides on.

I believe all the arguments about whether this should be a bus property
or whether the x86 case can be worked around via ACPI tables etc... are
all moot. Today, qemu implementation can put virtio devices on busses
with an iommu and bypass it, so at the very least for backward
compatibility, we should expose that appropriately from the "server"
side.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-21  5:03             ` Benjamin Herrenschmidt
@ 2014-09-21  5:05               ` Benjamin Herrenschmidt
  2014-09-21  5:48                 ` Andy Lutomirski
  2014-09-24 21:41                 ` Andy Lutomirski
  0 siblings, 2 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-21  5:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	linux390@de.ibm.com, Paolo Bonzini, David Woodhouse

On Sun, 2014-09-21 at 15:03 +1000, Benjamin Herrenschmidt wrote:

> The exception I mentioned is that I would really like the virtio device
> to expose via whatever transport we chose to use (though capability
> exchange sounds like a reasonable one) whether the "server"
> implementation is bypassing IOMMUs or not instead on relying on client
> side heuristics.
> 
> IE. Basically, we are trying to "guess" with an ifdef CONFIG_PPC, what
> is essentially an attribute of the server-side, ie, whether is bypasses
> the iommu for the PCI bus it resides on.

> I believe all the arguments about whether this should be a bus property
> or whether the x86 case can be worked around via ACPI tables etc... are
> all moot. Today, qemu implementation can put virtio devices on busses
> with an iommu and bypass it, so at the very least for backward
> compatibility, we should expose that appropriately from the "server"
> side.

And of course, since we are talking about backward compatibility with
existing qemus here, the capability should be the opposite, ie "honor
iommu", with the assumption that without it, the implementation bypasses
it, which reflects what the current qemu implementation does on any
architecture, whether you configure the bus to have an iommu emulated on
it or not.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-21  5:05               ` Benjamin Herrenschmidt
@ 2014-09-21  5:48                 ` Andy Lutomirski
  2014-09-21  6:01                   ` David Woodhouse
  2014-09-24 21:41                 ` Andy Lutomirski
  1 sibling, 1 reply; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-21  5:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	linux390@de.ibm.com, Paolo Bonzini, David Woodhouse

On Sat, Sep 20, 2014 at 10:05 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Sun, 2014-09-21 at 15:03 +1000, Benjamin Herrenschmidt wrote:
>
>> The exception I mentioned is that I would really like the virtio device
>> to expose via whatever transport we chose to use (though capability
>> exchange sounds like a reasonable one) whether the "server"
>> implementation is bypassing IOMMUs or not instead on relying on client
>> side heuristics.
>>
>> IE. Basically, we are trying to "guess" with an ifdef CONFIG_PPC, what
>> is essentially an attribute of the server-side, ie, whether is bypasses
>> the iommu for the PCI bus it resides on.
>
>> I believe all the arguments about whether this should be a bus property
>> or whether the x86 case can be worked around via ACPI tables etc... are
>> all moot. Today, qemu implementation can put virtio devices on busses
>> with an iommu and bypass it, so at the very least for backward
>> compatibility, we should expose that appropriately from the "server"
>> side.
>

I'm all for doing this as some kind of bus property, if needed,
although David Woodhouse may be right that the best solution is to
*always* have the IOMMU there and to just ask it for an identity map
if desired.

> And of course, since we are talking about backward compatibility with
> existing qemus here, the capability should be the opposite, ie "honor
> iommu", with the assumption that without it, the implementation bypasses
> it, which reflects what the current qemu implementation does on any
> architecture, whether you configure the bus to have an iommu emulated on
> it or not.
>

If I'm understanding you correctly, this won't work for Xen on QEMU.
QEMU's virtio-pci device doesn't have a real IOMMU, but it is still
behind Xen's translation layer, and QEMU doesn't even know that Xen is
there.

--Andy

> Cheers,
> Ben.
>
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-21  5:48                 ` Andy Lutomirski
@ 2014-09-21  6:01                   ` David Woodhouse
  0 siblings, 0 replies; 42+ messages in thread
From: David Woodhouse @ 2014-09-21  6:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, linux390@de.ibm.com, Paolo Bonzini


[-- Attachment #1.1: Type: text/plain, Size: 568 bytes --]

On Sat, 2014-09-20 at 22:48 -0700, Andy Lutomirski wrote:
> 
> I'm all for doing this as some kind of bus property, if needed,
> although David Woodhouse may be right that the best solution is to
> *always* have the IOMMU there and to just ask it for an identity map
> if desired.

That's the right solution for x86, certainly. For POWER it's a little
less clear-cut because I don't think the IOMMU normally supports
identity mapping there. You could still do an identity mapping (hell, we
do at the moment) but it'd be a very special case.

-- 
dwmw2


[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5745 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-21  5:05               ` Benjamin Herrenschmidt
  2014-09-21  5:48                 ` Andy Lutomirski
@ 2014-09-24 21:41                 ` Andy Lutomirski
  2014-09-24 21:50                   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-24 21:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	linux390@de.ibm.com, Paolo Bonzini, David Woodhouse

On Sat, Sep 20, 2014 at 10:05 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Sun, 2014-09-21 at 15:03 +1000, Benjamin Herrenschmidt wrote:
>
>> The exception I mentioned is that I would really like the virtio device
>> to expose via whatever transport we chose to use (though capability
>> exchange sounds like a reasonable one) whether the "server"
>> implementation is bypassing IOMMUs or not instead on relying on client
>> side heuristics.
>>
>> IE. Basically, we are trying to "guess" with an ifdef CONFIG_PPC, what
>> is essentially an attribute of the server-side, ie, whether is bypasses
>> the iommu for the PCI bus it resides on.
>
>> I believe all the arguments about whether this should be a bus property
>> or whether the x86 case can be worked around via ACPI tables etc... are
>> all moot. Today, qemu implementation can put virtio devices on busses
>> with an iommu and bypass it, so at the very least for backward
>> compatibility, we should expose that appropriately from the "server"
>> side.
>
> And of course, since we are talking about backward compatibility with
> existing qemus here, the capability should be the opposite, ie "honor
> iommu", with the assumption that without it, the implementation bypasses
> it, which reflects what the current qemu implementation does on any
> architecture, whether you configure the bus to have an iommu emulated on
> it or not.

Can PPC do this using a new devicetree property?

--Andy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-24 21:41                 ` Andy Lutomirski
@ 2014-09-24 21:50                   ` Benjamin Herrenschmidt
  2014-09-24 21:59                     ` Andy Lutomirski
  0 siblings, 1 reply; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-24 21:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	linux390@de.ibm.com, Paolo Bonzini, David Woodhouse

On Wed, 2014-09-24 at 14:41 -0700, Andy Lutomirski wrote:
> On Sat, Sep 20, 2014 at 10:05 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Sun, 2014-09-21 at 15:03 +1000, Benjamin Herrenschmidt wrote:
> >
> >> The exception I mentioned is that I would really like the virtio device
> >> to expose via whatever transport we chose to use (though capability
> >> exchange sounds like a reasonable one) whether the "server"
> >> implementation is bypassing IOMMUs or not instead on relying on client
> >> side heuristics.
> >>
> >> IE. Basically, we are trying to "guess" with an ifdef CONFIG_PPC, what
> >> is essentially an attribute of the server-side, ie, whether is bypasses
> >> the iommu for the PCI bus it resides on.
> >
> >> I believe all the arguments about whether this should be a bus property
> >> or whether the x86 case can be worked around via ACPI tables etc... are
> >> all moot. Today, qemu implementation can put virtio devices on busses
> >> with an iommu and bypass it, so at the very least for backward
> >> compatibility, we should expose that appropriately from the "server"
> >> side.
> >
> > And of course, since we are talking about backward compatibility with
> > existing qemus here, the capability should be the opposite, ie "honor
> > iommu", with the assumption that without it, the implementation bypasses
> > it, which reflects what the current qemu implementation does on any
> > architecture, whether you configure the bus to have an iommu emulated on
> > it or not.
> 
> Can PPC do this using a new devicetree property?

The DT props for PCI devices are created by the FW inside the guest from
standard PCI probing, so it would have to get the info from qemu via
config or register space.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-24 21:50                   ` Benjamin Herrenschmidt
@ 2014-09-24 21:59                     ` Andy Lutomirski
  2014-09-24 22:04                       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-24 21:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	linux390@de.ibm.com, Paolo Bonzini, David Woodhouse

On Wed, Sep 24, 2014 at 2:50 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2014-09-24 at 14:41 -0700, Andy Lutomirski wrote:
>> On Sat, Sep 20, 2014 at 10:05 PM, Benjamin Herrenschmidt
>> <benh@kernel.crashing.org> wrote:
>> > On Sun, 2014-09-21 at 15:03 +1000, Benjamin Herrenschmidt wrote:
>> >
>> >> The exception I mentioned is that I would really like the virtio device
>> >> to expose via whatever transport we chose to use (though capability
>> >> exchange sounds like a reasonable one) whether the "server"
>> >> implementation is bypassing IOMMUs or not instead on relying on client
>> >> side heuristics.
>> >>
>> >> IE. Basically, we are trying to "guess" with an ifdef CONFIG_PPC, what
>> >> is essentially an attribute of the server-side, ie, whether is bypasses
>> >> the iommu for the PCI bus it resides on.
>> >
>> >> I believe all the arguments about whether this should be a bus property
>> >> or whether the x86 case can be worked around via ACPI tables etc... are
>> >> all moot. Today, qemu implementation can put virtio devices on busses
>> >> with an iommu and bypass it, so at the very least for backward
>> >> compatibility, we should expose that appropriately from the "server"
>> >> side.
>> >
>> > And of course, since we are talking about backward compatibility with
>> > existing qemus here, the capability should be the opposite, ie "honor
>> > iommu", with the assumption that without it, the implementation bypasses
>> > it, which reflects what the current qemu implementation does on any
>> > architecture, whether you configure the bus to have an iommu emulated on
>> > it or not.
>>
>> Can PPC do this using a new devicetree property?
>
> The DT props for PCI devices are created by the FW inside the guest from
> standard PCI probing, so it would have to get the info from qemu via
> config or register space.
>

Scratch that idea, then.

The best that I can currently come up with is to say that pre-1.0
devices on PPC bypass the IOMMU and that 1.0 devices on PPC and all
devices on all other architectures do not bypass the IOMMU.

I agree that this is far from ideal.  Any ideas?  Is there any other
channel that QEMU can use to signal to a PPC guest that a specific PCI
virtio device isn't really behind an IOMMU?  From my POV, the main
consideration is that existing QEMU versions hosting Xen hypervisors
should work, which is not the case in current kernels, but which is
the case with my patches without any known regressions.

Is there some evil trick that a PPC guest could use to detect whether
the IOMMU is honored?  As an example that I don't like at all, the
guest could program the IOMMU so that the ring's physical address maps
to a second copy of the ring and then see which one works.

--Andy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-24 21:59                     ` Andy Lutomirski
@ 2014-09-24 22:04                       ` Benjamin Herrenschmidt
  2014-09-24 22:15                         ` Andy Lutomirski
  0 siblings, 1 reply; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-24 22:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	linux390@de.ibm.com, Paolo Bonzini, David Woodhouse

On Wed, 2014-09-24 at 14:59 -0700, Andy Lutomirski wrote:

> Scratch that idea, then.
> 
> The best that I can currently come up with is to say that pre-1.0
> devices on PPC bypass the IOMMU and that 1.0 devices on PPC and all
> devices on all other architectures do not bypass the IOMMU.

Well, the thing is, we *want* them to bypass the IOMMU for performance
reasons in the long run. Today we have no ways to tell our guests that
a PCI bus doesn't have an IOMMU, they always do !

Also qemu can mix and match devices on a given PCI bus so making this a
bus property wouldn't work either.

> I agree that this is far from ideal.  Any ideas?  Is there any other
> channel that QEMU can use to signal to a PPC guest that a specific PCI
> virtio device isn't really behind an IOMMU?  

Any reason why this can't be a virtio capability ?

> From my POV, the main
> consideration is that existing QEMU versions hosting Xen hypervisors
> should work, which is not the case in current kernels, but which is
> the case with my patches without any known regressions.
> 
> Is there some evil trick that a PPC guest could use to detect whether
> the IOMMU is honored?  As an example that I don't like at all, the
> guest could program the IOMMU so that the ring's physical address maps
> to a second copy of the ring and then see which one works.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-24 22:04                       ` Benjamin Herrenschmidt
@ 2014-09-24 22:15                         ` Andy Lutomirski
  2014-09-24 22:38                           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-24 22:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	linux390@de.ibm.com, Paolo Bonzini, David Woodhouse

On Wed, Sep 24, 2014 at 3:04 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2014-09-24 at 14:59 -0700, Andy Lutomirski wrote:
>
>> Scratch that idea, then.
>>
>> The best that I can currently come up with is to say that pre-1.0
>> devices on PPC bypass the IOMMU and that 1.0 devices on PPC and all
>> devices on all other architectures do not bypass the IOMMU.
>
> Well, the thing is, we *want* them to bypass the IOMMU for performance
> reasons in the long run. Today we have no ways to tell our guests that
> a PCI bus doesn't have an IOMMU, they always do !

Alternatively, the IOMMU could be extended to allow an efficient
identity map of the entire physical address space.

>
> Also qemu can mix and match devices on a given PCI bus so making this a
> bus property wouldn't work either.
>
>> I agree that this is far from ideal.  Any ideas?  Is there any other
>> channel that QEMU can use to signal to a PPC guest that a specific PCI
>> virtio device isn't really behind an IOMMU?
>
> Any reason why this can't be a virtio capability ?
>

For Xen.

On Xen, unless we do something quite kludgey, QEMU *cannot* bypass the
DMA API, and QEMU doesn't even know that Xen is there.  On Linux
kernels on Xen, the "physical" address (as returned by sg_phys, etc)
is not a real physical address.  The DMA API translates back and
forth.  Without using the DMA API, all of the DMA ends up in the wrong
place, and everything breaks.

We could teach the driver to bypass the IOMMU but to still use Xen's
mapping correctly, but that's complicated, and I think it serves
little purpose.  As I understand it, there is no reason to ever bypass
an IOMMU on x86, since x86 IOMMUs can be programmed to efficiently
identity map everything.  So, on x86, I think that the right long-term
solution is to just use the DMA API always.

There is a potential security issue, though.  If virtio-pci sets up an
identity map, then a malicious PCI device could pretend to speak
virtio and then take over the system using that identity map.  Using a
virtio capability here doesn't help, since a malicious device will
just set that capability to indicate that it should be identity
mapped.

I guess that we ideally want a secure way for the guest to distinguish
between three cases:

a) IOMMU should be used.
b) Device is trusted but the IOMMU still works.  The guest may want to
set up an identity map.
c) (PPC only) Device bypasses the IOMMU.  There's no security issue
here: a malicious device wouldn't be able to bypass the IOMMU, so it
would be unable to do DMA at all.

--Andy

>> From my POV, the main
>> consideration is that existing QEMU versions hosting Xen hypervisors
>> should work, which is not the case in current kernels, but which is
>> the case with my patches without any known regressions.
>>
>> Is there some evil trick that a PPC guest could use to detect whether
>> the IOMMU is honored?  As an example that I don't like at all, the
>> guest could program the IOMMU so that the ring's physical address maps
>> to a second copy of the ring and then see which one works.
>
> Cheers,
> Ben.
>
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-24 22:15                         ` Andy Lutomirski
@ 2014-09-24 22:38                           ` Benjamin Herrenschmidt
  2014-09-24 22:49                             ` Andy Lutomirski
  0 siblings, 1 reply; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-24 22:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	linux390@de.ibm.com, Paolo Bonzini, David Woodhouse

On Wed, 2014-09-24 at 15:15 -0700, Andy Lutomirski wrote:
> On Wed, Sep 24, 2014 at 3:04 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Wed, 2014-09-24 at 14:59 -0700, Andy Lutomirski wrote:
> >
> >> Scratch that idea, then.
> >>
> >> The best that I can currently come up with is to say that pre-1.0
> >> devices on PPC bypass the IOMMU and that 1.0 devices on PPC and all
> >> devices on all other architectures do not bypass the IOMMU.
> >
> > Well, the thing is, we *want* them to bypass the IOMMU for performance
> > reasons in the long run. Today we have no ways to tell our guests that
> > a PCI bus doesn't have an IOMMU, they always do !
> 
> Alternatively, the IOMMU could be extended to allow an efficient
> identity map of the entire physical address space.

We have that option, but I'm a bit reluctant to rely on it, it has its
issues, but it's something we can look into.

> > Also qemu can mix and match devices on a given PCI bus so making this a
> > bus property wouldn't work either.
> >
> >> I agree that this is far from ideal.  Any ideas?  Is there any other
> >> channel that QEMU can use to signal to a PPC guest that a specific PCI
> >> virtio device isn't really behind an IOMMU?
> >
> > Any reason why this can't be a virtio capability ?
> >
> 
> For Xen.

And ? I don't see the problem... When running Xen, you don't put the
capability in. They are negociated... Why can't qemu set the capability
accordingly based on whether it's using KVM or Xen ? It's not like we
care about backward compat of ppc on Xen anyway...


 .../...

> a) IOMMU should be used.
> b) Device is trusted but the IOMMU still works.  The guest may want to
> set up an identity map.
> c) (PPC only) Device bypasses the IOMMU.  There's no security issue
> here: a malicious device wouldn't be able to bypass the IOMMU, so it
> would be unable to do DMA at all.

I wouldn't make it "ppc only" at this point but yes. Also keep in mind
that we do have ppc cases where we will want to use the iommu (in which
case the capability would be enforced by the other half).

Cheers,
Ben.

> --Andy
> 
> >> From my POV, the main
> >> consideration is that existing QEMU versions hosting Xen hypervisors
> >> should work, which is not the case in current kernels, but which is
> >> the case with my patches without any known regressions.
> >>
> >> Is there some evil trick that a PPC guest could use to detect whether
> >> the IOMMU is honored?  As an example that I don't like at all, the
> >> guest could program the IOMMU so that the ring's physical address maps
> >> to a second copy of the ring and then see which one works.
> >
> > Cheers,
> > Ben.
> >
> >
> 
> 
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-24 22:38                           ` Benjamin Herrenschmidt
@ 2014-09-24 22:49                             ` Andy Lutomirski
  0 siblings, 0 replies; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-24 22:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	linux390@de.ibm.com, Paolo Bonzini, David Woodhouse

On Wed, Sep 24, 2014 at 3:38 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2014-09-24 at 15:15 -0700, Andy Lutomirski wrote:
>> On Wed, Sep 24, 2014 at 3:04 PM, Benjamin Herrenschmidt
>> <benh@kernel.crashing.org> wrote:
>> > On Wed, 2014-09-24 at 14:59 -0700, Andy Lutomirski wrote:
>> >
>> >> Scratch that idea, then.
>> >>
>> >> The best that I can currently come up with is to say that pre-1.0
>> >> devices on PPC bypass the IOMMU and that 1.0 devices on PPC and all
>> >> devices on all other architectures do not bypass the IOMMU.
>> >
>> > Well, the thing is, we *want* them to bypass the IOMMU for performance
>> > reasons in the long run. Today we have no ways to tell our guests that
>> > a PCI bus doesn't have an IOMMU, they always do !
>>
>> Alternatively, the IOMMU could be extended to allow an efficient
>> identity map of the entire physical address space.
>
> We have that option, but I'm a bit reluctant to rely on it, it has its
> issues, but it's something we can look into.
>
>> > Also qemu can mix and match devices on a given PCI bus so making this a
>> > bus property wouldn't work either.
>> >
>> >> I agree that this is far from ideal.  Any ideas?  Is there any other
>> >> channel that QEMU can use to signal to a PPC guest that a specific PCI
>> >> virtio device isn't really behind an IOMMU?
>> >
>> > Any reason why this can't be a virtio capability ?
>> >
>>
>> For Xen.
>
> And ? I don't see the problem... When running Xen, you don't put the
> capability in. They are negociated... Why can't qemu set the capability
> accordingly based on whether it's using KVM or Xen ? It's not like we
> care about backward compat of ppc on Xen anyway...
>

QEMU doesn't know that Xen is running.  I can do qemu-system-x86_64
-kernel /path/to/xen (virtme-run --xen does this) or
qemu-system-x86_64 -hda disk_with_xen.img and Xen will run.

>
>  .../...
>
>> a) IOMMU should be used.
>> b) Device is trusted but the IOMMU still works.  The guest may want to
>> set up an identity map.
>> c) (PPC only) Device bypasses the IOMMU.  There's no security issue
>> here: a malicious device wouldn't be able to bypass the IOMMU, so it
>> would be unable to do DMA at all.
>
> I wouldn't make it "ppc only" at this point but yes. Also keep in mind
> that we do have ppc cases where we will want to use the iommu (in which
> case the capability would be enforced by the other half).

The only reason I say "PPC only" is because I don't think that other
architectures would want to choose c.

We could plausibly use paravirt tricks for this, but that gets messy,
since e.g. the KVM kernel code might not even know what QEMU wants to
do.

--Andy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-17 14:16     ` Michael S. Tsirkin
  2014-09-17 16:07       ` Andy Lutomirski
  2014-09-19 21:31       ` Benjamin Herrenschmidt
@ 2014-09-29 18:55       ` Andy Lutomirski
  2014-09-29 20:49         ` Benjamin Herrenschmidt
  2014-10-06  9:59         ` Christian Borntraeger
  2 siblings, 2 replies; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-29 18:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, Paolo Bonzini, linux390@de.ibm.com

On Wed, Sep 17, 2014 at 7:16 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Sep 17, 2014 at 08:02:31AM -0400, Benjamin Herrenschmidt wrote:
>> On Tue, 2014-09-16 at 22:22 -0700, Andy Lutomirski wrote:
>> > On non-PPC systems, virtio_pci should use the DMA API.  This fixes
>> > virtio_pci on Xen.  On PPC, using the DMA API would break things, so
>> > we need to preserve the old behavior.
>> >
>> > The big comment in this patch explains the considerations in more
>> > detail.
>>
>> I still disagree with using CONFIG_PPC as a trigger here.
>>
>> Fundamentally, the qemu implementation today bypasses IOMMUs on all
>> platforms as far as I can tell.
>>
>> If that changes, we'll have a backward compatibility problem.
>>
>> The virtio device should advertise whether it's using that bypass
>> mode of operation and virtio_pci should react accordingly.
>
> Well if there's a way to detect that - that's outside the
> device, then we probably shouldn't use device-specific
> interfaces for this capability.
>
>
>> There is a demand for being able to operate on top of an IOMMU on
>> powerpc as well for some embedded stuff using PCI as a transport so your
>> patch precludes that.
>>
>> Cheers,
>> Ben.
>
> As far as I can see, nothing changes on PPC so this patch
> does not preclude anything that was working?

Rusty and Michael, what's the status of this?

I think that (aside from the trivial DMI/DMA typo) the only real issue
here is that the situation on PPC is ugly.  We're failing to enable
physical virtio hardware on PPC with these patches, but that never
worked anyway.  I don't think that there are any regressions other
than ugliness.

My preference would be to apply the patches as is (or with "DMA"
spelled correctly), and then to:

 - Make sure that all virtio-mmio systems have working DMA ops so that
virtio-mmio can the DMA API

 - Fix the DMA API on s390 (probably easy) and on PPC (not necessarily so easy)

 - Remove the non-DMA-API code, which would be a very small change on
top of these patches.

--Andy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-29 18:55       ` Andy Lutomirski
@ 2014-09-29 20:49         ` Benjamin Herrenschmidt
  2014-09-29 20:55           ` Andy Lutomirski
  2014-10-06  9:59         ` Christian Borntraeger
  1 sibling, 1 reply; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-29 20:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	Paolo Bonzini, linux390@de.ibm.com

On Mon, 2014-09-29 at 11:55 -0700, Andy Lutomirski wrote:

> Rusty and Michael, what's the status of this?

The status is that I still think we need *a* way to actually inform the
guest whether the virtio implementation will or will not bypass the
IOMMU. I don't know Xen enough to figure out how to do that and we could
maybe just make it something qemu puts in the device-tree on powerpc
only.

However I dislike making it global or per-bus, we could have a
combination of qemu and HW virtio on the same guest, so I really think
this needs to be a capability of the virtio device.

I don't completely understand what games Xen is playing here, but from
what I can tell, it's pretty clear that today's qemu implementation
always bypasses any iommu and so should always be exported as such on
all platforms, at least all kvm and pure qemu ones.

> I think that (aside from the trivial DMI/DMA typo) the only real issue
> here is that the situation on PPC is ugly.  We're failing to enable
> physical virtio hardware on PPC with these patches, but that never
> worked anyway.  I don't think that there are any regressions other
> than ugliness.
> 
> My preference would be to apply the patches as is (or with "DMA"
> spelled correctly), and then to:
> 
>  - Make sure that all virtio-mmio systems have working DMA ops so that
> virtio-mmio can the DMA API
> 
>  - Fix the DMA API on s390 (probably easy) and on PPC (not necessarily so easy)
> 
>  - Remove the non-DMA-API code, which would be a very small change on
> top of these patches.
> 
> --Andy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-29 20:49         ` Benjamin Herrenschmidt
@ 2014-09-29 20:55           ` Andy Lutomirski
  2014-09-29 21:06             ` Benjamin Herrenschmidt
  2014-09-30 15:38             ` Michael S. Tsirkin
  0 siblings, 2 replies; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-29 20:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	Paolo Bonzini, linux390@de.ibm.com

On Mon, Sep 29, 2014 at 1:49 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Mon, 2014-09-29 at 11:55 -0700, Andy Lutomirski wrote:
>
>> Rusty and Michael, what's the status of this?
>
> The status is that I still think we need *a* way to actually inform the
> guest whether the virtio implementation will or will not bypass the
> IOMMU. I don't know Xen enough to figure out how to do that and we could
> maybe just make it something qemu puts in the device-tree on powerpc
> only.
>
> However I dislike making it global or per-bus, we could have a
> combination of qemu and HW virtio on the same guest, so I really think
> this needs to be a capability of the virtio device.

Or a capability of the PCI slot, somehow.  I don't understand PCI
topology very well.

>
> I don't completely understand what games Xen is playing here, but from
> what I can tell, it's pretty clear that today's qemu implementation
> always bypasses any iommu and so should always be exported as such on
> all platforms, at least all kvm and pure qemu ones.

Except that I think that PPC is the only platform on which QEMU's code
actually bypasses any IOMMU.  Unless we've all missed something, there
is no QEMU release that will put a virtio device behind an IOMMU on
any platform other than PPC.

If the eventual solution is to say that virtio 1.0 PCI devices always
respect an IOMMU unless they set a magic flag saying "I'm not real
hardware and I bypass the IOMMU", then I don't really object to that,
except that it'll be a mess if the guest is running Xen.  But even Xen
would (I think) be okay if it actually worked by having a new DMA API
operation that says "this device is magically identity mapped" and
then just teaching Xen to implement that.

But I'm not an OASIS member, so I can't really do this.  I agree that
this issue needs to be addressed somehow, but I don't think it needs
to block these patches.

--Andy

>
>> I think that (aside from the trivial DMI/DMA typo) the only real issue
>> here is that the situation on PPC is ugly.  We're failing to enable
>> physical virtio hardware on PPC with these patches, but that never
>> worked anyway.  I don't think that there are any regressions other
>> than ugliness.
>>
>> My preference would be to apply the patches as is (or with "DMA"
>> spelled correctly), and then to:
>>
>>  - Make sure that all virtio-mmio systems have working DMA ops so that
>> virtio-mmio can the DMA API
>>
>>  - Fix the DMA API on s390 (probably easy) and on PPC (not necessarily so easy)
>>
>>  - Remove the non-DMA-API code, which would be a very small change on
>> top of these patches.
>>
>> --Andy
>
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-29 20:55           ` Andy Lutomirski
@ 2014-09-29 21:06             ` Benjamin Herrenschmidt
  2014-09-30 15:38             ` Michael S. Tsirkin
  1 sibling, 0 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-09-29 21:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Linux Virtualization, Christian Borntraeger,
	Paolo Bonzini, linux390@de.ibm.com

On Mon, 2014-09-29 at 13:55 -0700, Andy Lutomirski wrote:
> If the eventual solution is to say that virtio 1.0 PCI devices always
> respect an IOMMU unless they set a magic flag saying "I'm not real
> hardware and I bypass the IOMMU", then I don't really object to that,
> except that it'll be a mess if the guest is running Xen.  But even Xen
> would (I think) be okay if it actually worked by having a new DMA API
> operation that says "this device is magically identity mapped" and
> then just teaching Xen to implement that.
> 
> But I'm not an OASIS member, so I can't really do this.  I agree that
> this issue needs to be addressed somehow, but I don't think it needs
> to block these patches.

I'll let Rusty be the final judge of that (well, when he's back from his
vacation that is).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-29 20:55           ` Andy Lutomirski
  2014-09-29 21:06             ` Benjamin Herrenschmidt
@ 2014-09-30 15:38             ` Michael S. Tsirkin
  2014-09-30 15:48               ` Andy Lutomirski
                                 ` (2 more replies)
  1 sibling, 3 replies; 42+ messages in thread
From: Michael S. Tsirkin @ 2014-09-30 15:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, Paolo Bonzini, linux390@de.ibm.com

On Mon, Sep 29, 2014 at 01:55:11PM -0700, Andy Lutomirski wrote:
> On Mon, Sep 29, 2014 at 1:49 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Mon, 2014-09-29 at 11:55 -0700, Andy Lutomirski wrote:
> >
> >> Rusty and Michael, what's the status of this?
> >
> > The status is that I still think we need *a* way to actually inform the
> > guest whether the virtio implementation will or will not bypass the
> > IOMMU. I don't know Xen enough to figure out how to do that and we could
> > maybe just make it something qemu puts in the device-tree on powerpc
> > only.
> >
> > However I dislike making it global or per-bus, we could have a
> > combination of qemu and HW virtio on the same guest, so I really think
> > this needs to be a capability of the virtio device.
> 
> Or a capability of the PCI slot, somehow.  I don't understand PCI
> topology very well.
> 
> >
> > I don't completely understand what games Xen is playing here, but from
> > what I can tell, it's pretty clear that today's qemu implementation
> > always bypasses any iommu and so should always be exported as such on
> > all platforms, at least all kvm and pure qemu ones.
> 
> Except that I think that PPC is the only platform on which QEMU's code
> actually bypasses any IOMMU.  Unless we've all missed something, there
> is no QEMU release that will put a virtio device behind an IOMMU on
> any platform other than PPC.

I think that is true but it seems that this will be true for x86 for
QEMU 2.2 unless we make some changes there.
Which we might not have the time for since 2.2 is feature frozen
from tomorrow.
Maybe we should disable the IOMMU in 2.2, this is worth considering.


> If the eventual solution is to say that virtio 1.0 PCI devices always
> respect an IOMMU unless they set a magic flag saying "I'm not real
> hardware and I bypass the IOMMU", then I don't really object to that,
> except that it'll be a mess if the guest is running Xen.  But even Xen
> would (I think) be okay if it actually worked by having a new DMA API
> operation that says "this device is magically identity mapped" and
> then just teaching Xen to implement that.
> 
> But I'm not an OASIS member, so I can't really do this.  I agree that
> this issue needs to be addressed somehow, but I don't think it needs
> to block these patches.
> 
> --Andy

I thought hard about this, I think we are better off waiting till the
next release: there's a chance QEMU will have IOMMU support for KVM x86
then, and this will make it easier to judge which way does the wind
blow.

It seems that we lose nothing substantial keeping the status quo a bit longer,
but if we make an incompatible change in guests now we might
create nasty compatibility headaches going forward.

> >
> >> I think that (aside from the trivial DMI/DMA typo) the only real issue
> >> here is that the situation on PPC is ugly.  We're failing to enable
> >> physical virtio hardware on PPC with these patches, but that never
> >> worked anyway.  I don't think that there are any regressions other
> >> than ugliness.
> >>
> >> My preference would be to apply the patches as is (or with "DMA"
> >> spelled correctly), and then to:
> >>
> >>  - Make sure that all virtio-mmio systems have working DMA ops so that
> >> virtio-mmio can the DMA API
> >>
> >>  - Fix the DMA API on s390 (probably easy) and on PPC (not necessarily so easy)
> >>
> >>  - Remove the non-DMA-API code, which would be a very small change on
> >> top of these patches.
> >>
> >> --Andy
> >
> >
> 
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-30 15:38             ` Michael S. Tsirkin
@ 2014-09-30 15:48               ` Andy Lutomirski
  2014-09-30 16:19                 ` Andy Lutomirski
                                   ` (2 more replies)
  2014-09-30 15:53               ` Paolo Bonzini
  2014-09-30 20:05               ` Andy Lutomirski
  2 siblings, 3 replies; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-30 15:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, Paolo Bonzini, linux390@de.ibm.com

On Tue, Sep 30, 2014 at 8:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> I thought hard about this, I think we are better off waiting till the
> next release: there's a chance QEMU will have IOMMU support for KVM x86
> then, and this will make it easier to judge which way does the wind
> blow.
>
> It seems that we lose nothing substantial keeping the status quo a bit longer,
> but if we make an incompatible change in guests now we might
> create nasty compatibility headaches going forward.
>

I would argue for the opposite approach.  Having a QEMU release that
supports an IOMMU on x86 and exposes a commonly used PCI device that
bypasses that IOMMU without any explicit notification to the guest
(and specification!) that this is happening is IMO insane.  Once that
happens, we'll have to address the nasty case on both x86 and PPC.
This will suck.

If we accept the guest change and make sure that there is never a QEMU
release that has a visible IOMMU cheat on any arch other than PPC,
then at least the damage will be contained.

x86 will be worse than PPC, too: the special case needed to support
QEMU 2.2 with IOMMU and virtio enabled with a Xen guest will be fairly
large and disgusting and will only exist to support something that IMO
should never have existed in the first place.

PPC at least avoids *that* problem by virtue of not having Xen
paravirt.  (And please don't add Xen paravirt to PPC -- x86 is trying
to kill it off, but this is a 5-10 year project.)

[..., reordered]

>>
>> Except that I think that PPC is the only platform on which QEMU's code
>> actually bypasses any IOMMU.  Unless we've all missed something, there
>> is no QEMU release that will put a virtio device behind an IOMMU on
>> any platform other than PPC.
>
> I think that is true but it seems that this will be true for x86 for
> QEMU 2.2 unless we make some changes there.
> Which we might not have the time for since 2.2 is feature frozen
> from tomorrow.
> Maybe we should disable the IOMMU in 2.2, this is worth considering.
>

Please do.

Also, try booting this 2.2 QEMU candidate with nested virtualization
on.  Then bind vfio to a virtio-pci device and watch the guest get
corrupted.  QEMU will blame Linux for incorrectly programming the
hardware, and Linux will blame QEMU for its blatant violation of the
ACPI spec.  Given that this is presumably most of the point of adding
IOMMU support, it seems like a terrible idea to let code like that
into the wild.

If this happens, Linux may also end up needing a quirk to prevent vfio
from binding to QEMU 2.2's virtio-pci devices.

--Andy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-30 15:38             ` Michael S. Tsirkin
  2014-09-30 15:48               ` Andy Lutomirski
@ 2014-09-30 15:53               ` Paolo Bonzini
  2014-10-01  7:36                 ` Michael S. Tsirkin
  2014-09-30 20:05               ` Andy Lutomirski
  2 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2014-09-30 15:53 UTC (permalink / raw)
  To: Michael S. Tsirkin, Andy Lutomirski
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, linux390@de.ibm.com

Il 30/09/2014 17:38, Michael S. Tsirkin ha scritto:
> I think that is true but it seems that this will be true for x86 for
> QEMU 2.2 unless we make some changes there.
> Which we might not have the time for since 2.2 is feature frozen
> from tomorrow.
> Maybe we should disable the IOMMU in 2.2, this is worth considering.

It is disabled by default, no?  And only supported by Q35 which does not
yet have migration compatibility (though we are versioning it, not sure
why that is the case).

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-30 15:48               ` Andy Lutomirski
@ 2014-09-30 16:19                 ` Andy Lutomirski
  2014-09-30 17:53                 ` Konrad Rzeszutek Wilk
  2014-10-01  6:42                 ` Michael S. Tsirkin
  2 siblings, 0 replies; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-30 16:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, Paolo Bonzini, linux390@de.ibm.com

On Tue, Sep 30, 2014 at 8:48 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Sep 30, 2014 at 8:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> Maybe we should disable the IOMMU in 2.2, this is worth considering.
>>
>
> Please do.
>
> Also, try booting this 2.2 QEMU candidate with nested virtualization
> on.  Then bind vfio to a virtio-pci device and watch the guest get
> corrupted.  QEMU will blame Linux for incorrectly programming the
> hardware, and Linux will blame QEMU for its blatant violation of the
> ACPI spec.  Given that this is presumably most of the point of adding
> IOMMU support, it seems like a terrible idea to let code like that
> into the wild.
>
> If this happens, Linux may also end up needing a quirk to prevent vfio
> from binding to QEMU 2.2's virtio-pci devices.

I just confirmed that a guest with my patches blows up if I run it
like this against QEMU master from today:

PATH=.:$PATH virtme-run --kimg
~/apps/linux-devel/arch/x86/boot/bzImage -a intel_iommu=on --qemu-opts
-machine q35,iommu=on

IOW, QEMU master is indeed presenting an IOMMU that blows up the guest
if the guest tries to use it as specified.

--Andy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-30 15:48               ` Andy Lutomirski
  2014-09-30 16:19                 ` Andy Lutomirski
@ 2014-09-30 17:53                 ` Konrad Rzeszutek Wilk
  2014-09-30 18:01                   ` Andy Lutomirski
  2014-10-01  6:42                 ` Michael S. Tsirkin
  2 siblings, 1 reply; 42+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-30 17:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390@vger.kernel.org, Michael S. Tsirkin,
	Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, Paolo Bonzini, linux390@de.ibm.com

> x86 will be worse than PPC, too: the special case needed to support
> QEMU 2.2 with IOMMU and virtio enabled with a Xen guest will be fairly
> large and disgusting and will only exist to support something that IMO
> should never have existed in the first place.

<scratches his head> I don't follow.
> 
> PPC at least avoids *that* problem by virtue of not having Xen
> paravirt.  (And please don't add Xen paravirt to PPC -- x86 is trying
> to kill it off, but this is a 5-10 year project.)

Correction:
 - The Xen project is trying to kill some of the paravirts off.
 - KVM uses paravirts as well (and then added some)

> 
> [..., reordered]
> 
> >>
> >> Except that I think that PPC is the only platform on which QEMU's code
> >> actually bypasses any IOMMU.  Unless we've all missed something, there
> >> is no QEMU release that will put a virtio device behind an IOMMU on
> >> any platform other than PPC.
> >
> > I think that is true but it seems that this will be true for x86 for
> > QEMU 2.2 unless we make some changes there.
> > Which we might not have the time for since 2.2 is feature frozen
> > from tomorrow.
> > Maybe we should disable the IOMMU in 2.2, this is worth considering.
> >
> 
> Please do.
> 
> Also, try booting this 2.2 QEMU candidate with nested virtualization
> on.  Then bind vfio to a virtio-pci device and watch the guest get
> corrupted.  QEMU will blame Linux for incorrectly programming the

Hehe.
> hardware, and Linux will blame QEMU for its blatant violation of the
> ACPI spec.  Given that this is presumably most of the point of adding
> IOMMU support, it seems like a terrible idea to let code like that
> into the wild.
> 
> If this happens, Linux may also end up needing a quirk to prevent vfio
> from binding to QEMU 2.2's virtio-pci devices.
> 
> --Andy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-30 17:53                 ` Konrad Rzeszutek Wilk
@ 2014-09-30 18:01                   ` Andy Lutomirski
  2014-10-02 16:36                     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-30 18:01 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: linux-s390@vger.kernel.org, Michael S. Tsirkin,
	Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, Paolo Bonzini, linux390@de.ibm.com

On Tue, Sep 30, 2014 at 10:53 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>> x86 will be worse than PPC, too: the special case needed to support
>> QEMU 2.2 with IOMMU and virtio enabled with a Xen guest will be fairly
>> large and disgusting and will only exist to support something that IMO
>> should never have existed in the first place.
>
> <scratches his head> I don't follow.

If you boot a Xen PV dom0 on QEMU master with -machine q35,iommu=on
and you add a virtio device, dom0 will end up with a PCI device that
does DMA to "machine" addresses.  These addresses are not compatible
with the DMA API (which works with bus addresses), nor are they the
same as physical addresses.

So virtio in current kernels won't work for the same reason they never
work on Xen.  But virtio-pci with my patches won't work either,
because they (or the Xen hypervisor) will try to program the IOMMU
with a non-identity mapping, causing everything to explode.

Hacking up the virtio-pci driver to explicitly ask Xen for machine
addresses might work, but, at the very least, it will be a giant
security hole if anyone binds a virtio device to a domain other than
dom0 (which, again, is kind of the point of having an IOMMU).

>>
>> PPC at least avoids *that* problem by virtue of not having Xen
>> paravirt.  (And please don't add Xen paravirt to PPC -- x86 is trying
>> to kill it off, but this is a 5-10 year project.)
>
> Correction:
>  - The Xen project is trying to kill some of the paravirts off.
>  - KVM uses paravirts as well (and then added some)

By "paravirt" I meant PV, where there's the weird physical/machine
address discrepancy that's visible to the guest.  This is not to say
that Xen PVH wouldn't also be screwed running on QEMU master.

--Andy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-30 15:38             ` Michael S. Tsirkin
  2014-09-30 15:48               ` Andy Lutomirski
  2014-09-30 15:53               ` Paolo Bonzini
@ 2014-09-30 20:05               ` Andy Lutomirski
  2 siblings, 0 replies; 42+ messages in thread
From: Andy Lutomirski @ 2014-09-30 20:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, Paolo Bonzini, linux390@de.ibm.com

On Tue, Sep 30, 2014 at 8:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Sep 29, 2014 at 01:55:11PM -0700, Andy Lutomirski wrote:
>> On Mon, Sep 29, 2014 at 1:49 PM, Benjamin Herrenschmidt
>> <benh@kernel.crashing.org> wrote:
>> > On Mon, 2014-09-29 at 11:55 -0700, Andy Lutomirski wrote:
>> >
>> >> Rusty and Michael, what's the status of this?
>> >
>> > The status is that I still think we need *a* way to actually inform the
>> > guest whether the virtio implementation will or will not bypass the
>> > IOMMU. I don't know Xen enough to figure out how to do that and we could
>> > maybe just make it something qemu puts in the device-tree on powerpc
>> > only.
>> >
>> > However I dislike making it global or per-bus, we could have a
>> > combination of qemu and HW virtio on the same guest, so I really think
>> > this needs to be a capability of the virtio device.
>>
>> Or a capability of the PCI slot, somehow.  I don't understand PCI
>> topology very well.
>>
>> >
>> > I don't completely understand what games Xen is playing here, but from
>> > what I can tell, it's pretty clear that today's qemu implementation
>> > always bypasses any iommu and so should always be exported as such on
>> > all platforms, at least all kvm and pure qemu ones.
>>
>> Except that I think that PPC is the only platform on which QEMU's code
>> actually bypasses any IOMMU.  Unless we've all missed something, there
>> is no QEMU release that will put a virtio device behind an IOMMU on
>> any platform other than PPC.
>
> I think that is true but it seems that this will be true for x86 for
> QEMU 2.2 unless we make some changes there.
> Which we might not have the time for since 2.2 is feature frozen
> from tomorrow.
> Maybe we should disable the IOMMU in 2.2, this is worth considering.
>
>
>> If the eventual solution is to say that virtio 1.0 PCI devices always
>> respect an IOMMU unless they set a magic flag saying "I'm not real
>> hardware and I bypass the IOMMU", then I don't really object to that,
>> except that it'll be a mess if the guest is running Xen.  But even Xen
>> would (I think) be okay if it actually worked by having a new DMA API
>> operation that says "this device is magically identity mapped" and
>> then just teaching Xen to implement that.
>>
>> But I'm not an OASIS member, so I can't really do this.  I agree that
>> this issue needs to be addressed somehow, but I don't think it needs
>> to block these patches.
>>
>> --Andy
>
> I thought hard about this, I think we are better off waiting till the
> next release: there's a chance QEMU will have IOMMU support for KVM x86
> then, and this will make it easier to judge which way does the wind
> blow.

If QEMU wants to fix this, it looks like it wouldn't be so bad.  The
virtio code could keep track of the address space in use.  It looks
like a lot of the changes would be simplifications, but vring_map
would have to go away.  That would cause some performance hit due to
the loss of the permanent ring mapping, bit it would be no worse than
the temporary mappings that already exist for indirect descriptors and
actual data.

--Andy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-30 15:48               ` Andy Lutomirski
  2014-09-30 16:19                 ` Andy Lutomirski
  2014-09-30 17:53                 ` Konrad Rzeszutek Wilk
@ 2014-10-01  6:42                 ` Michael S. Tsirkin
  2 siblings, 0 replies; 42+ messages in thread
From: Michael S. Tsirkin @ 2014-10-01  6:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, Paolo Bonzini, linux390@de.ibm.com

On Tue, Sep 30, 2014 at 08:48:45AM -0700, Andy Lutomirski wrote:
> On Tue, Sep 30, 2014 at 8:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > I thought hard about this, I think we are better off waiting till the
> > next release: there's a chance QEMU will have IOMMU support for KVM x86
> > then, and this will make it easier to judge which way does the wind
> > blow.
> >
> > It seems that we lose nothing substantial keeping the status quo a bit longer,
> > but if we make an incompatible change in guests now we might
> > create nasty compatibility headaches going forward.
> >
> 
> I would argue for the opposite approach.  Having a QEMU release that
> supports an IOMMU on x86 and exposes a commonly used PCI device that
> bypasses that IOMMU without any explicit notification to the guest
> (and specification!) that this is happening is IMO insane.  Once that
> happens, we'll have to address the nasty case on both x86 and PPC.
> This will suck.
>
> If we accept the guest change and make sure that there is never a QEMU
> release that has a visible IOMMU cheat on any arch other than PPC,
> then at least the damage will be contained.

Wrt QEMU this sounds reasonable. Wrt guest, deferring guest changes
a bit, until we have a better idea about how the host side behaves
sounds better to me than saying "this is how guests will behave,
let the host adapt to that".

> x86 will be worse than PPC, too: the special case needed to support
> QEMU 2.2 with IOMMU and virtio enabled with a Xen guest will be fairly
> large and disgusting and will only exist to support something that IMO
> should never have existed in the first place.
> 
> PPC at least avoids *that* problem by virtue of not having Xen
> paravirt.  (And please don't add Xen paravirt to PPC -- x86 is trying
> to kill it off, but this is a 5-10 year project.)
> 
> [..., reordered]
> 
> >>
> >> Except that I think that PPC is the only platform on which QEMU's code
> >> actually bypasses any IOMMU.  Unless we've all missed something, there
> >> is no QEMU release that will put a virtio device behind an IOMMU on
> >> any platform other than PPC.
> >
> > I think that is true but it seems that this will be true for x86 for
> > QEMU 2.2 unless we make some changes there.
> > Which we might not have the time for since 2.2 is feature frozen
> > from tomorrow.
> > Maybe we should disable the IOMMU in 2.2, this is worth considering.
> >
> 
> Please do.

Or at least disable it just if there are virtio devices.

> Also, try booting this 2.2 QEMU candidate with nested virtualization
> on.  Then bind vfio to a virtio-pci device and watch the guest get
> corrupted.  QEMU will blame Linux for incorrectly programming the
> hardware, and Linux will blame QEMU for its blatant violation of the
> ACPI spec.  Given that this is presumably most of the point of adding
> IOMMU support, it seems like a terrible idea to let code like that
> into the wild.
> 
> If this happens, Linux may also end up needing a quirk to prevent vfio
> from binding to QEMU 2.2's virtio-pci devices.
> 
> --Andy

This specific item wouldn't worry me too much.

-- 
MST

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-30 15:53               ` Paolo Bonzini
@ 2014-10-01  7:36                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 42+ messages in thread
From: Michael S. Tsirkin @ 2014-10-01  7:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Andy Lutomirski, Christian Borntraeger,
	linux390@de.ibm.com, Linux Virtualization

On Tue, Sep 30, 2014 at 05:53:37PM +0200, Paolo Bonzini wrote:
> Il 30/09/2014 17:38, Michael S. Tsirkin ha scritto:
> > I think that is true but it seems that this will be true for x86 for
> > QEMU 2.2 unless we make some changes there.
> > Which we might not have the time for since 2.2 is feature frozen
> > from tomorrow.
> > Maybe we should disable the IOMMU in 2.2, this is worth considering.
> 
> It is disabled by default, no?  And only supported by Q35 which does not
> yet have migration compatibility (though we are versioning it, not sure
> why that is the case).
> 
> Paolo

I'm not sure what it has to do with migration.
It's disabled by default of course but if it's known not to
work correctly with virtio we are better off disabling
virtio when iommu is enabled or hiding the feature
from users altogether until we figure it out.

-- 
MST

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-30 18:01                   ` Andy Lutomirski
@ 2014-10-02 16:36                     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 42+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-02 16:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-s390@vger.kernel.org, Michael S. Tsirkin,
	Benjamin Herrenschmidt, Linux Virtualization,
	Christian Borntraeger, Paolo Bonzini, linux390@de.ibm.com

On Tue, Sep 30, 2014 at 11:01:29AM -0700, Andy Lutomirski wrote:
> On Tue, Sep 30, 2014 at 10:53 AM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> >> x86 will be worse than PPC, too: the special case needed to support
> >> QEMU 2.2 with IOMMU and virtio enabled with a Xen guest will be fairly
> >> large and disgusting and will only exist to support something that IMO
> >> should never have existed in the first place.
> >
> > <scratches his head> I don't follow.
> 
> If you boot a Xen PV dom0 on QEMU master with -machine q35,iommu=on
> and you add a virtio device, dom0 will end up with a PCI device that
> does DMA to "machine" addresses.  These addresses are not compatible
> with the DMA API (which works with bus addresses), nor are they the
> same as physical addresses.

That is presumarily because the IOMMU assumes the virtio devices are real
devices, not fake ones.
> 
> So virtio in current kernels won't work for the same reason they never
> work on Xen.  But virtio-pci with my patches won't work either,
> because they (or the Xen hypervisor) will try to program the IOMMU
> with a non-identity mapping, causing everything to explode.
> 
> Hacking up the virtio-pci driver to explicitly ask Xen for machine
> addresses might work, but, at the very least, it will be a giant
> security hole if anyone binds a virtio device to a domain other than
> dom0 (which, again, is kind of the point of having an IOMMU).
> 
> >>
> >> PPC at least avoids *that* problem by virtue of not having Xen
> >> paravirt.  (And please don't add Xen paravirt to PPC -- x86 is trying
> >> to kill it off, but this is a 5-10 year project.)
> >
> > Correction:
> >  - The Xen project is trying to kill some of the paravirts off.
> >  - KVM uses paravirts as well (and then added some)
> 
> By "paravirt" I meant PV, where there's the weird physical/machine
> address discrepancy that's visible to the guest.  This is not to say
> that Xen PVH wouldn't also be screwed running on QEMU master.
> 
> --Andy

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-09-29 18:55       ` Andy Lutomirski
  2014-09-29 20:49         ` Benjamin Herrenschmidt
@ 2014-10-06  9:59         ` Christian Borntraeger
  2014-10-06 10:48           ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 42+ messages in thread
From: Christian Borntraeger @ 2014-10-06  9:59 UTC (permalink / raw)
  To: Andy Lutomirski, Michael S. Tsirkin
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Benjamin Herrenschmidt, Linux Virtualization, Paolo Bonzini,
	linux390@de.ibm.com

Am 29.09.2014 20:55, schrieb Andy Lutomirski:
> On Wed, Sep 17, 2014 at 7:16 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Wed, Sep 17, 2014 at 08:02:31AM -0400, Benjamin Herrenschmidt wrote:
>>> On Tue, 2014-09-16 at 22:22 -0700, Andy Lutomirski wrote:
>>>> On non-PPC systems, virtio_pci should use the DMA API.  This fixes
>>>> virtio_pci on Xen.  On PPC, using the DMA API would break things, so
>>>> we need to preserve the old behavior.
>>>>
>>>> The big comment in this patch explains the considerations in more
>>>> detail.
>>>
>>> I still disagree with using CONFIG_PPC as a trigger here.
>>>
>>> Fundamentally, the qemu implementation today bypasses IOMMUs on all
>>> platforms as far as I can tell.
>>>
>>> If that changes, we'll have a backward compatibility problem.
>>>
>>> The virtio device should advertise whether it's using that bypass
>>> mode of operation and virtio_pci should react accordingly.
>>
>> Well if there's a way to detect that - that's outside the
>> device, then we probably shouldn't use device-specific
>> interfaces for this capability.
>>
>>
>>> There is a demand for being able to operate on top of an IOMMU on
>>> powerpc as well for some embedded stuff using PCI as a transport so your
>>> patch precludes that.
>>>
>>> Cheers,
>>> Ben.
>>
>> As far as I can see, nothing changes on PPC so this patch
>> does not preclude anything that was working?
> 
> Rusty and Michael, what's the status of this?
> 
> I think that (aside from the trivial DMI/DMA typo) the only real issue
> here is that the situation on PPC is ugly.  We're failing to enable
> physical virtio hardware on PPC with these patches, but that never
> worked anyway.  I don't think that there are any regressions other
> than ugliness.
> 
> My preference would be to apply the patches as is (or with "DMA"
> spelled correctly), and then to:
> 
>  - Make sure that all virtio-mmio systems have working DMA ops so that
> virtio-mmio can the DMA API
> 
>  - Fix the DMA API on s390 (probably easy) and on PPC (not necessarily so easy)

Just as a comment: On s390 we always considered the memory access as access to real memory (not device memory) for virtio accesses. I prefer to not touch the DMA API on s390 as it is quite s390-PCI specific but it is somewhat strange for CCW devices.

We would basically have two kinds of DMA mappings on s390 then, one iommu like for the PCI devices and one that maps DMA memory 1:1 to real memory. We would also need some buy-in from the s390 arch maintainers then.


> 
>  - Remove the non-DMA-API code, which would be a very small change on
> top of these patches.
> 
> --Andy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-s390" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible
  2014-10-06  9:59         ` Christian Borntraeger
@ 2014-10-06 10:48           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 42+ messages in thread
From: Benjamin Herrenschmidt @ 2014-10-06 10:48 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: linux-s390@vger.kernel.org, Konrad Rzeszutek Wilk,
	Michael S. Tsirkin, Andy Lutomirski, Paolo Bonzini,
	linux390@de.ibm.com, Linux Virtualization

On Mon, 2014-10-06 at 11:59 +0200, Christian Borntraeger wrote:
> Just as a comment: On s390 we always considered the memory access as
> access to real memory (not device memory) for virtio accesses. I
> prefer to not touch the DMA API on s390 as it is quite s390-PCI
> specific but it is somewhat strange for CCW devices.
> 
> We would basically have two kinds of DMA mappings on s390 then, one
> iommu like for the PCI devices and one that maps DMA memory 1:1 to
> real memory. We would also need some buy-in from the s390 arch
> maintainers then.

It's pretty standard to have the dma API go via dma-ops hanging off
struct device, you can then easily make them to the right thing for the
type of device... Provided we have a way to identify "bypass everyting"
virtio vs. virtio going through normal PCI translation.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2014-10-06 10:48 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-17  5:22 [PATCH v5 0/3] virtio: Use the DMA API when appropriate Andy Lutomirski
2014-09-17  5:22 ` [PATCH v5 1/3] virtio_ring: Support DMA APIs if requested Andy Lutomirski
2014-09-17  5:22 ` [PATCH v5 2/3] virtio_pci: Use the DMA API for virtqueues when possible Andy Lutomirski
2014-09-17 12:02   ` Benjamin Herrenschmidt
2014-09-17 14:16     ` Michael S. Tsirkin
2014-09-17 16:07       ` Andy Lutomirski
2014-09-17 16:49         ` David Woodhouse
2014-09-19 21:28           ` Benjamin Herrenschmidt
2014-09-19 21:33         ` Benjamin Herrenschmidt
2014-09-20  5:59           ` Andy Lutomirski
2014-09-21  5:03             ` Benjamin Herrenschmidt
2014-09-21  5:05               ` Benjamin Herrenschmidt
2014-09-21  5:48                 ` Andy Lutomirski
2014-09-21  6:01                   ` David Woodhouse
2014-09-24 21:41                 ` Andy Lutomirski
2014-09-24 21:50                   ` Benjamin Herrenschmidt
2014-09-24 21:59                     ` Andy Lutomirski
2014-09-24 22:04                       ` Benjamin Herrenschmidt
2014-09-24 22:15                         ` Andy Lutomirski
2014-09-24 22:38                           ` Benjamin Herrenschmidt
2014-09-24 22:49                             ` Andy Lutomirski
2014-09-19 21:31       ` Benjamin Herrenschmidt
2014-09-29 18:55       ` Andy Lutomirski
2014-09-29 20:49         ` Benjamin Herrenschmidt
2014-09-29 20:55           ` Andy Lutomirski
2014-09-29 21:06             ` Benjamin Herrenschmidt
2014-09-30 15:38             ` Michael S. Tsirkin
2014-09-30 15:48               ` Andy Lutomirski
2014-09-30 16:19                 ` Andy Lutomirski
2014-09-30 17:53                 ` Konrad Rzeszutek Wilk
2014-09-30 18:01                   ` Andy Lutomirski
2014-10-02 16:36                     ` Konrad Rzeszutek Wilk
2014-10-01  6:42                 ` Michael S. Tsirkin
2014-09-30 15:53               ` Paolo Bonzini
2014-10-01  7:36                 ` Michael S. Tsirkin
2014-09-30 20:05               ` Andy Lutomirski
2014-10-06  9:59         ` Christian Borntraeger
2014-10-06 10:48           ` Benjamin Herrenschmidt
2014-09-17 16:09   ` Ira W. Snyder
2014-09-17 16:15     ` Andy Lutomirski
2014-09-17  5:22 ` [PATCH v5 3/3] virtio_net: Stop doing DMA from the stack Andy Lutomirski
2014-09-19 18:25 ` [PATCH v5 0/3] virtio: Use the DMA API when appropriate Konrad Rzeszutek Wilk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).