[PATCH 0/0][RFC] KVM use of vringfd

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/0][RFC] KVM use of vringfd
@ 2008-06-13 13:57 Mark McLoughlin
  2008-06-13 13:57 ` [PATCH 1/5] vring: Replace mmap() interface with ioctl() Mark McLoughlin
  0 siblings, 1 reply; 23+ messages in thread
From: Mark McLoughlin @ 2008-06-13 13:57 UTC (permalink / raw)
  To: Anthony Liguori, Avi Kivity, Rusty Russell; +Cc: kvm

Hey,
        Since this came up at the KVM forum, I thought I'd post these
patches here for discussion.

        The first issue with making KVM use vringfd is that the mmap()
interface doesn't work for KVM since it is the guest that allocates
the ring descriptor, so it doesn't make much sense for the host to
remap those pages.

        The second issue is a little more tricky; QEMU's networking
implementation is based around the notion of connecting a number
of interfaces to a VLAN. So, in less common cases where you don't
just have a tap interface and a virtio interface connected to the
VLAN, you can't take just lguest's simple approach of sharing the
same vrings between the tap driver in the host and the virtio driver
in the guest and having the host userspace notify each side of buffers
as they come and go. I know Anthony has ideas here, so I'll just let
him outline them.

Cheers,
Mark.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/5] vring: Replace mmap() interface with ioctl()
  2008-06-13 13:57 [PATCH 0/0][RFC] KVM use of vringfd Mark McLoughlin
@ 2008-06-13 13:57 ` Mark McLoughlin
  2008-06-13 13:57   ` [PATCH 2/5] lguest: Use VRINGSETINFO ioctl() instead of mmap() Mark McLoughlin
                     ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Mark McLoughlin @ 2008-06-13 13:57 UTC (permalink / raw)
  To: Anthony Liguori, Avi Kivity, Rusty Russell; +Cc: kvm, Mark McLoughlin

/dev/vring's mmap() interface is a strange creature. It
serves as a way for userland to supply the address of the
already allocated ring descriptors, but causes those pages
to be re-maped as a natural side effect of the mmap()

This is not an issue for lguest because it does the mmap()
before even starting the guest. However, in the case of kvm,
the guest allocates the ring and informs the host of its
addresss. If we then mmap() it, we cause it to be remapped
to new pages which the vring driver will then use.

Now, KVM guests don't actually use the ring pages before
informing the host of its address, so we could probably just
invalidate the guest's shadow page table and have the new
pfns picked up. That would be an odd requirement to impose
on the guest ABI, though.

Since the mmap() semantics are so strange, switch to using a
single ioctl() for setting up the ring.

(Against misc:dev_vring.patch and misc:ringfd-base-limit.patch)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 drivers/char/vring.c  |   61 +++++++++++++++++++++++-------------------------
 include/linux/vring.h |   10 ++++++-
 2 files changed, 37 insertions(+), 34 deletions(-)

diff --git a/drivers/char/vring.c b/drivers/char/vring.c
index 0b9bdf5..a66b890 100644
--- a/drivers/char/vring.c
+++ b/drivers/char/vring.c
@@ -22,6 +22,7 @@
 #include <linux/mutex.h>
 #include <linux/wait.h>
 #include <linux/fs.h>
+#include <linux/compat.h>
 #include <linux/poll.h>
 #include <linux/module.h>
 #include <linux/miscdevice.h>
@@ -126,22 +127,21 @@ static int vring_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
-static int vring_mmap(struct file *filp, struct vm_area_struct *vma)
+static long vring_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
-	unsigned long size, num_descs;
 	struct vring_info *vr = filp->private_data;
+	void __user *argp = (void __user *)arg;
+	struct vring_ioctl_info info;
+	unsigned long descs;
 	int err;
 
-	/* We overload mmap's offset to hold the ring number. */
-	num_descs = vma->vm_pgoff;
+	if (cmd != VRINGSETINFO)
+		return -ENOTTY;
 
-	/* Must be a power of two, and limit indices to a u16. */
-	if (!is_power_of_2(num_descs) || num_descs > 65536)
-		return -EINVAL;
+	if (copy_from_user(&info, argp, sizeof(info)))
+		return -EFAULT;
 
-	/* mmap size must be what we expect for such a ring. */
-	size = vma->vm_end - vma->vm_start;
-	if (size != ALIGN(vring_size(num_descs, PAGE_SIZE), PAGE_SIZE))
+	if (!is_power_of_2(info.num_descs))
 		return -EINVAL;
 
 	/* We only let them map this in one place. */
@@ -151,9 +151,14 @@ static int vring_mmap(struct file *filp, struct vm_area_struct *vma)
 		goto unlock;
 	}
 
-	vring_init(&vr->ring, num_descs, (void *)vma->vm_start, PAGE_SIZE);
+	descs = info.descs;
+	vring_init(&vr->ring, info.num_descs, (void *)descs, PAGE_SIZE);
 
-	vr->mask = num_descs - 1;
+	vr->mask = info.num_descs - 1;
+	vr->base = info.base;
+	vr->limit = info.limit;
+	if (vr->limit == 0)
+		vr->limit = -1UL;
 	err = 0;
 
 unlock:
@@ -161,6 +166,16 @@ unlock:
 	return err;
 }
 
+#ifdef CONFIG_COMPAT
+static long vring_compat_ioctl(struct file *filp, unsigned int cmd,
+			       unsigned long arg)
+{
+	return vring_ioctl(filp, cmd, (unsigned long)compat_ptr(arg));
+}
+#else
+#define vring_compat_ioctl NULL
+#endif
+
 static int vring_open(struct inode *in, struct file *filp)
 {
 	struct vring_info *vr;
@@ -176,32 +191,14 @@ static int vring_open(struct inode *in, struct file *filp)
 	return 0;
 }
 
-static int vring_ioctl(struct inode *in, struct file *filp,
-		       unsigned int cmd, unsigned long arg)
-{
-	struct vring_info *vr = filp->private_data;
-
-	switch (cmd) {
-	case VRINGSETBASE:
-		vr->base = arg;
-		break;
-	case VRINGSETLIMIT:
-		vr->limit = arg;
-		break;
-	default:
-		return -ENOTTY;
-	}
-	return 0;
-}
-
 static const struct file_operations vring_fops = {
 	.open		= vring_open,
 	.release	= vring_release,
-	.mmap		= vring_mmap,
 	.read		= vring_read,
 	.write		= vring_write,
 	.poll		= vring_poll,
-	.ioctl		= vring_ioctl,
+	.unlocked_ioctl	= vring_ioctl,
+	.compat_ioctl	= vring_compat_ioctl,
 };
 
 /**
diff --git a/include/linux/vring.h b/include/linux/vring.h
index 47c8848..de4125d 100644
--- a/include/linux/vring.h
+++ b/include/linux/vring.h
@@ -21,8 +21,14 @@
 #include <linux/types.h>
 
 /* Ioctl defines. */
-#define VRINGSETBASE	_IO(0xAD, 0)
-#define VRINGSETLIMIT	_IO(0xAD, 1)
+#define VRINGSETINFO _IO(0xAD, 0)
+
+struct vring_ioctl_info {
+	__u16 num_descs;
+	__u64 descs;
+	__u64 base;
+	__u64 limit;
+};
 
 #ifdef __KERNEL__
 
-- 
1.5.4.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/5] lguest: Use VRINGSETINFO ioctl() instead of mmap()
  2008-06-13 13:57 ` [PATCH 1/5] vring: Replace mmap() interface with ioctl() Mark McLoughlin
@ 2008-06-13 13:57   ` Mark McLoughlin
  2008-06-13 13:57     ` [PATCH 3/5] kvm: qemu: Publish last_avail index in the ring Mark McLoughlin
  2008-06-13 14:09   ` [PATCH 1/5] vring: Replace mmap() interface with ioctl() Avi Kivity
  2008-06-14  9:02   ` Rusty Russell
  2 siblings, 1 reply; 23+ messages in thread
From: Mark McLoughlin @ 2008-06-13 13:57 UTC (permalink / raw)
  To: Anthony Liguori, Avi Kivity, Rusty Russell; +Cc: kvm, Mark McLoughlin

(Against lguest:use-tun-ringfd.patch)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 Documentation/lguest/lguest.c |   24 +++++++++---------------
 1 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c
index 630e159..d37fa2d 100644
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -1373,22 +1373,16 @@ static bool recvfd_used(int fd, struct device *dev)
 static int map_vring(struct vring *vr)
 {
 	int fd = open_or_die("/dev/vring", O_RDWR);
+	struct vring_ioctl_info info;
 
-	/* Map the rings over where they belong in Guest. */
-	if (mmap(vr->desc, page_align(vring_size(vr->num, getpagesize())),
-		 PROT_READ|PROT_WRITE, MAP_FIXED|MAP_SHARED, fd,
-		 vr->num * getpagesize()) != vr->desc)
-		err(1, "mmaping /dev/vring");
-
-	/* This is subtle and nasty.  If we lazily map this, Waker may
-	 * see different pages when we touch it, and hence it will get
-	 * a different result for poll(). */
-	memset(vr->desc, 0, vring_size(vr->num, getpagesize()));
-
-	/* Set offset & limit. */
-	if (ioctl(fd, VRINGSETBASE, guest_base) != 0
-	    || ioctl(fd, VRINGSETLIMIT, guest_limit) != 0)
-		err(1, "Setting vring offset and limit");
+	info.num_descs = vr->num;
+	info.descs = (unsigned long)vr->desc;
+	info.base = (unsigned long)guest_base;
+	info.limit = guest_limit;
+
+	/* Tell /dev/vring where the ring is */
+	if (ioctl(fd, VRINGSETINFO, &info) != 0)
+		err(1, "Setting up /dev/vring fd");
 
 	return fd;
 }
-- 
1.5.4.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/5] kvm: qemu: Publish last_avail index in the ring
  2008-06-13 13:57   ` [PATCH 2/5] lguest: Use VRINGSETINFO ioctl() instead of mmap() Mark McLoughlin
@ 2008-06-13 13:57     ` Mark McLoughlin
  2008-06-13 13:58       ` [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies Mark McLoughlin
  0 siblings, 1 reply; 23+ messages in thread
From: Mark McLoughlin @ 2008-06-13 13:57 UTC (permalink / raw)
  To: Anthony Liguori, Avi Kivity, Rusty Russell; +Cc: kvm, Mark McLoughlin

Set the VIRTIO_RING_F_PUBLISH_INDICES feature and publish
the last_avail index within the ring itself.

This is important for save/restore when using vringfd
because the kernel is the one tracking last_avail, but
we need to be able to peek at that state.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 qemu/hw/virtio-net.c |    4 ++--
 qemu/hw/virtio.c     |   16 ++++++++--------
 qemu/hw/virtio.h     |    7 ++++++-
 3 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index a61fdb1..6c42bf0 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -106,7 +106,7 @@ static int virtio_net_can_receive(void *opaque)
 	!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
 	return 0;
 
-    if (n->rx_vq->vring.avail->idx == n->rx_vq->last_avail_idx)
+    if (n->rx_vq->vring.avail->idx == vq_last_avail(n->rx_vq))
 	return 0;
 
     return 1;
@@ -178,7 +178,7 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq)
     VirtIONet *n = to_virtio_net(vdev);
 
     if (n->tx_timer_active &&
-	(vq->vring.avail->idx - vq->last_avail_idx) == 64) {
+	(vq->vring.avail->idx - vq_last_avail(vq)) == 64) {
 	vq->vring.used->flags &= ~VRING_USED_F_NO_NOTIFY;
 	qemu_del_timer(n->tx_timer);
 	n->tx_timer_active = 0;
diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c
index 3119ea9..a1ee93f 100644
--- a/qemu/hw/virtio.c
+++ b/qemu/hw/virtio.c
@@ -147,17 +147,17 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
     unsigned int position;
 
     /* Check it isn't doing very strange things with descriptor numbers. */
-    if ((uint16_t)(vq->vring.avail->idx - vq->last_avail_idx) > vq->vring.num)
+    if ((uint16_t)(vq->vring.avail->idx - vq_last_avail(vq)) > vq->vring.num)
 	errx(1, "Guest moved used index from %u to %u",
-	     vq->last_avail_idx, vq->vring.avail->idx);
+	     vq_last_avail(vq), vq->vring.avail->idx);
 
     /* If there's nothing new since last we looked, return invalid. */
-    if (vq->vring.avail->idx == vq->last_avail_idx)
+    if (vq->vring.avail->idx == vq_last_avail(vq))
 	return 0;
 
     /* Grab the next descriptor number they're advertising, and increment
      * the index we've seen. */
-    head = vq->vring.avail->ring[vq->last_avail_idx++ % vq->vring.num];
+    head = vq->vring.avail->ring[vq_last_avail(vq)++ % vq->vring.num];
 
     /* If their number is silly, that's a fatal mistake. */
     if (head >= vq->vring.num)
@@ -222,7 +222,6 @@ void virtio_reset(void *opaque)
         vdev->vq[i].vring.desc = NULL;
         vdev->vq[i].vring.avail = NULL;
         vdev->vq[i].vring.used = NULL;
-        vdev->vq[i].last_avail_idx = 0;
         vdev->vq[i].pfn = 0;
     }
 }
@@ -278,6 +277,7 @@ static uint32_t virtio_ioport_read(void *opaque, uint32_t addr)
     case VIRTIO_PCI_HOST_FEATURES:
 	ret = vdev->get_features(vdev);
 	ret |= (1 << VIRTIO_F_NOTIFY_ON_EMPTY);
+	ret |= (1 << VIRTIO_RING_F_PUBLISH_INDICES);
 	break;
     case VIRTIO_PCI_GUEST_FEATURES:
 	ret = vdev->features;
@@ -434,7 +434,7 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
 void virtio_notify(VirtIODevice *vdev, VirtQueue *vq)
 {
     /* Always notify when queue is empty */
-    if ((vq->inuse || vq->vring.avail->idx != vq->last_avail_idx)) &&
+    if ((vq->inuse || vq->vring.avail->idx != vq_last_avail(vq)) &&
 	(vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
 	return;
 
@@ -469,7 +469,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
 
 	qemu_put_be32(f, vdev->vq[i].vring.num);
 	qemu_put_be32s(f, &vdev->vq[i].pfn);
-	qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
+	qemu_put_be16(f, vq_last_avail(&vdev->vq[i]));
     }
 }
 
@@ -492,7 +492,7 @@ void virtio_load(VirtIODevice *vdev, QEMUFile *f)
     for (i = 0; i < num; i++) {
 	vdev->vq[i].vring.num = qemu_get_be32(f);
 	qemu_get_be32s(f, &vdev->vq[i].pfn);
-	qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
+	vq_last_avail(&vdev->vq[i]) = qemu_get_be16(f);
 
 	if (vdev->vq[i].pfn) {
 	    size_t size;
diff --git a/qemu/hw/virtio.h b/qemu/hw/virtio.h
index 1adaed3..142ecbd 100644
--- a/qemu/hw/virtio.h
+++ b/qemu/hw/virtio.h
@@ -46,6 +46,9 @@
 /* This means don't interrupt guest when buffer consumed. */
 #define VRING_AVAIL_F_NO_INTERRUPT	1
 
+/* We publish the last-seen available index at the end of the used ring */
+#define VIRTIO_RING_F_PUBLISH_INDICES	28
+
 typedef struct VirtQueue VirtQueue;
 typedef struct VirtIODevice VirtIODevice;
 
@@ -89,11 +92,13 @@ struct VirtQueue
 {
     VRing vring;
     uint32_t pfn;
-    uint16_t last_avail_idx;
     int inuse;
     void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
 };
 
+/* We publish the last-seen available index at the end of the used ring */
+#define vq_last_avail(q) (*(uint16_t *)&(q)->vring.used->ring[(q)->vring.num])
+
 #define VIRTQUEUE_MAX_SIZE 1024
 
 typedef struct VirtQueueElement
-- 
1.5.4.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies
  2008-06-13 13:57     ` [PATCH 3/5] kvm: qemu: Publish last_avail index in the ring Mark McLoughlin
@ 2008-06-13 13:58       ` Mark McLoughlin
  2008-06-13 13:58         ` [PATCH 5/5] kvm: qemu: Add support for partial csums and GSO Mark McLoughlin
  2008-06-14 23:28         ` [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies Anthony Liguori
  0 siblings, 2 replies; 23+ messages in thread
From: Mark McLoughlin @ 2008-06-13 13:58 UTC (permalink / raw)
  To: Anthony Liguori, Avi Kivity, Rusty Russell; +Cc: kvm, Mark McLoughlin

Where a VLAN consists only of a single tap interface on the
host side and a single virtio interface on the guest side,
we can use the tun/tap driver's vringfd support to eliminate
copies.

We set up a vringfd for both rings and pass them to the
tun/tap driver. When the guest informs us of the location of
the ring pages, we use the VRINGSETINFO ioctl() to inform
the /dev/vring driver.

On xmit, we write() to the vringfd to notify tun/tap of new
buffers. Once the buffers have been sent, the vringfd
becomes readable and we read() from it, causing the buffers
to be placed back on the used ring. We can then notify the
guest that that we're done with them.

On the recv side, the guest makes buffers available via the
ring and we merely poll() the vringfd for notification of
tun/tap having buffers available. When the vringfd becomes
readable, we read() from it to make tun/tap copy its buffers
into the guest buffers and, finally, we notify the guest.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 qemu/hw/virtio-net.c |   78 ++++++++++++++++++++++++++++++++++++++++++++++++++
 qemu/hw/virtio.c     |   58 +++++++++++++++++++++++++++++++++++++
 qemu/hw/virtio.h     |    4 ++
 qemu/net.h           |    4 ++
 qemu/vl.c            |   32 ++++++++++++++++++++
 5 files changed, 176 insertions(+), 0 deletions(-)

diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 6c42bf0..f3ff356 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -15,6 +15,7 @@
 #include "net.h"
 #include "qemu-timer.h"
 #include "qemu-kvm.h"
+#include "qemu-char.h"
 
 /* from Linux's virtio_net.h */
 
@@ -156,6 +157,12 @@ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
     if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
         return;
 
+    if (vq->vringfd > 0) {
+	if (write(vq->vringfd, "", 0) != 0)
+	    fprintf(stderr, "Failed to notify host kernel of xmit buffers\n");
+	return;
+    }
+
     while (virtqueue_pop(vq, &elem)) {
 	ssize_t len = 0;
 
@@ -205,6 +212,65 @@ static void virtio_net_tx_timer(void *opaque)
     virtio_net_flush_tx(n, n->tx_vq);
 }
 
+static void virtio_net_rx_used(void *opaque)
+{
+    VirtIONet *n = opaque;
+
+    if (read(n->rx_vq->vringfd, NULL, 0) != 0) {
+	fprintf(stderr, "Failed to pull used rx buffers\n");
+	return;
+    }
+
+    virtio_notify(&n->vdev, n->rx_vq);
+}
+
+static void virtio_net_tx_used(void *opaque)
+{
+    VirtIONet *n = opaque;
+
+    if (read(n->tx_vq->vringfd, NULL, 0) != 0) {
+	fprintf(stderr, "Failed to pull used tx buffers\n");
+	return;
+    }
+
+    virtio_notify(&n->vdev, n->tx_vq);
+}
+
+static void virtio_net_close_vrings(VirtIONet *n)
+{
+    if (n->rx_vq->vringfd > 0) {
+	qemu_set_fd_handler(n->rx_vq->vringfd, NULL, NULL, NULL);
+	virtqueue_close_vringfd(n->rx_vq);
+    }
+    if (n->tx_vq->vringfd > 0) {
+	qemu_set_fd_handler(n->tx_vq->vringfd, NULL, NULL, NULL);
+	virtqueue_close_vringfd(n->tx_vq);
+    }
+}
+
+static void virtio_net_init_vrings(VirtIONet *n)
+{
+    struct VLANState *vlan = n->vc->vlan;
+    VLANClientState *host = vlan->first_client;
+
+    if (vlan->nb_host_devs != 1 || vlan->nb_guest_devs != 1)
+	return;
+
+    if (!host->set_vring_fds)
+	return;
+
+    if (!virtqueue_open_vringfd(n->rx_vq) || !virtqueue_open_vringfd(n->tx_vq)) {
+	virtio_net_close_vrings(n);
+	return;
+    }
+
+    qemu_set_fd_handler(n->rx_vq->vringfd, virtio_net_rx_used, NULL, n);
+    qemu_set_fd_handler(n->tx_vq->vringfd, virtio_net_tx_used, NULL, n);
+
+    if (!host->set_vring_fds(host, n->rx_vq->vringfd, n->tx_vq->vringfd))
+	virtio_net_close_vrings(n);
+}
+
 static void virtio_net_save(QEMUFile *f, void *opaque)
 {
     VirtIONet *n = opaque;
@@ -224,6 +290,16 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
 
     virtio_load(&n->vdev, f);
 
+    if (n->rx_vq->vringfd > 0 && n->tx_vq->vringfd > 0) {
+	VLANClientState *host = n->vc->vlan->first_client;
+
+	qemu_set_fd_handler(n->rx_vq->vringfd, virtio_net_rx_used, NULL, n);
+	qemu_set_fd_handler(n->tx_vq->vringfd, virtio_net_tx_used, NULL, n);
+
+	if (!host->set_vring_fds(host, n->rx_vq->vringfd, n->tx_vq->vringfd))
+	    virtio_net_close_vrings(n);
+    }
+
     qemu_get_buffer(f, n->mac, 6);
     n->tx_timer_active = qemu_get_be32(f);
 
@@ -258,6 +334,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
     n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
     n->tx_timer_active = 0;
 
+    virtio_net_init_vrings(n);
+
     register_savevm("virtio-net", virtio_net_id++, 1,
 		    virtio_net_save, virtio_net_load, n);
 
diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c
index a1ee93f..e9ec1ad 100644
--- a/qemu/hw/virtio.c
+++ b/qemu/hw/virtio.c
@@ -13,6 +13,7 @@
 
 #include <inttypes.h>
 #include <err.h>
+#include <sys/ioctl.h>
 
 #include "virtio.h"
 #include "sysemu.h"
@@ -193,6 +194,56 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
     return elem->in_num + elem->out_num;
 }
 
+/* vringfd functions */
+
+static int virtqueue_setup_vringfd(VirtQueue *vq)
+{
+    struct vring_ioctl_info info;
+
+    if (vq->vringfd <= 0 || !vq->vring.desc)
+	return 1;
+
+    info.num_descs = vq->vring.num;
+    info.descs	   = (unsigned long)vq->vring.desc;
+    info.base	   = (unsigned long)phys_ram_base;
+    info.limit	   = phys_ram_size;
+
+    if (ioctl(vq->vringfd, VRINGSETINFO, &info) != 0) {
+	fprintf(stderr, "Failed to setup vring: %s\n", strerror(errno));
+	return 0;
+    }
+
+    return 1;
+}
+
+int virtqueue_open_vringfd(VirtQueue *vq)
+{
+    int fd;
+
+    fd = open("/dev/vring", O_RDWR);
+    if (fd < 0) {
+	static int warned = 0;
+	if (!warned)
+	    fprintf(stderr, "Warning: no vringfd support available: "
+		    "Failed to open /dev/vring\n");
+	warned = 1;
+	return 0;
+    }
+
+    vq->vringfd = fd;
+
+    return virtqueue_setup_vringfd(vq);
+}
+
+void virtqueue_close_vringfd(VirtQueue *vq)
+{
+    if (vq->vringfd <= 0)
+	return;
+
+    close(vq->vringfd);
+    vq->vringfd = 0;
+}
+
 /* virtio device */
 
 static VirtIODevice *to_virtio_device(PCIDevice *pci_dev)
@@ -248,6 +299,7 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
 	    size_t size = virtqueue_size(vdev->vq[vdev->queue_sel].vring.num);
 	    virtqueue_init(&vdev->vq[vdev->queue_sel],
 			   virtio_map_gpa(pa, size));
+	    virtqueue_setup_vringfd(&vdev->vq[vdev->queue_sel]);
 	}
 	break;
     case VIRTIO_PCI_QUEUE_SEL:
@@ -464,6 +516,8 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
     qemu_put_be32(f, i);
 
     for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
+	qemu_put_byte(f, vdev->vq[i].vringfd > 0);
+
 	if (vdev->vq[i].vring.num == 0)
 	    break;
 
@@ -490,6 +544,9 @@ void virtio_load(VirtIODevice *vdev, QEMUFile *f)
     num = qemu_get_be32(f);
 
     for (i = 0; i < num; i++) {
+	if (qemu_get_byte(f))
+	    virtqueue_open_vringfd(&vdev->vq[i]);
+
 	vdev->vq[i].vring.num = qemu_get_be32(f);
 	qemu_get_be32s(f, &vdev->vq[i].pfn);
 	vq_last_avail(&vdev->vq[i]) = qemu_get_be16(f);
@@ -501,6 +558,7 @@ void virtio_load(VirtIODevice *vdev, QEMUFile *f)
 	    pa = (ram_addr_t)vdev->vq[i].pfn << TARGET_PAGE_BITS;
 	    size = virtqueue_size(vdev->vq[i].vring.num);
 	    virtqueue_init(&vdev->vq[i], virtio_map_gpa(pa, size));
+	    virtqueue_setup_vringfd(&vdev->vq[i]);
 	}
     }
 
diff --git a/qemu/hw/virtio.h b/qemu/hw/virtio.h
index 142ecbd..d3cf5fb 100644
--- a/qemu/hw/virtio.h
+++ b/qemu/hw/virtio.h
@@ -93,6 +93,7 @@ struct VirtQueue
     VRing vring;
     uint32_t pfn;
     int inuse;
+    int vringfd;
     void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
 };
 
@@ -152,4 +153,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f);
 
 void virtio_load(VirtIODevice *vdev, QEMUFile *f);
 
+int virtqueue_open_vringfd(VirtQueue *vq);
+void virtqueue_close_vringfd(VirtQueue *vq);
+
 #endif
diff --git a/qemu/net.h b/qemu/net.h
index 9dc8b7c..e8ed926 100644
--- a/qemu/net.h
+++ b/qemu/net.h
@@ -9,12 +9,15 @@ typedef ssize_t (IOReadvHandler)(void *, const struct iovec *, int);
 
 typedef struct VLANClientState VLANClientState;
 
+typedef int (SetVringFDs)(VLANClientState *, int, int);
+
 struct VLANClientState {
     IOReadHandler *fd_read;
     IOReadvHandler *fd_readv;
     /* Packets may still be sent if this returns zero.  It's used to
        rate-limit the slirp code.  */
     IOCanRWHandler *fd_can_read;
+    SetVringFDs *set_vring_fds;
     void *opaque;
     struct VLANClientState *next;
     struct VLANState *vlan;
@@ -26,6 +29,7 @@ struct VLANState {
     VLANClientState *first_client;
     struct VLANState *next;
     unsigned int nb_guest_devs, nb_host_devs;
+    unsigned int hotplug_disabled : 1;
 };
 
 VLANState *qemu_find_vlan(int id);
diff --git a/qemu/vl.c b/qemu/vl.c
index f573dce..d043ccd 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -4166,6 +4166,31 @@ static void tap_send(void *opaque)
     } while (s->size > 0);
 }
 
+#ifdef TUNSETRECVVRING
+static int tap_set_vring_fds(VLANClientState *vc, int recv, int xmit)
+{
+    TAPState *s = vc->opaque;
+
+    if (ioctl(s->fd, TUNSETRECVVRING, recv) != 0) {
+	fprintf(stderr, "TUNSETRECVVRING ioctl() failed: %s\n",
+		strerror(errno));
+	return 0;
+    }
+
+    if (ioctl(s->fd, TUNSETXMITVRING, xmit) != 0) {
+	fprintf(stderr, "TUNSETXMITVRING ioctl() failed: %s\n",
+		strerror(errno));
+	exit(1);
+    }
+
+    qemu_set_fd_handler(s->fd, NULL, NULL, NULL);
+
+    vc->vlan->hotplug_disabled = 1;
+
+    return 1;
+}
+#endif
+
 /* fd support */
 
 static TAPState *net_tap_fd_init(VLANState *vlan, int fd)
@@ -4178,6 +4203,9 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd)
     s->fd = fd;
     s->vc = qemu_new_vlan_client(vlan, tap_receive, NULL, s);
     s->vc->fd_readv = tap_readv;
+#ifdef TUNSETRECVVRING
+    s->vc->set_vring_fds = tap_set_vring_fds;
+#endif
     qemu_set_fd_handler2(s->fd, tap_can_send, tap_send, NULL, s);
     snprintf(s->vc->info_str, sizeof(s->vc->info_str), "tap: fd=%d", fd);
     return s;
@@ -4974,6 +5002,10 @@ int net_client_init(const char *str)
         fprintf(stderr, "Could not create vlan %d\n", vlan_id);
         return -1;
     }
+    if (vlan->hotplug_disabled) {
+        fprintf(stderr, "Hotplug disabled on vlan %d\n", vlan_id);
+        return -1;
+    }
     if (!strcmp(device, "nic")) {
         NICInfo *nd;
         uint8_t *macaddr;
-- 
1.5.4.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/5] kvm: qemu: Add support for partial csums and GSO
  2008-06-13 13:58       ` [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies Mark McLoughlin
@ 2008-06-13 13:58         ` Mark McLoughlin
  2008-06-14 23:28         ` [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies Anthony Liguori
  1 sibling, 0 replies; 23+ messages in thread
From: Mark McLoughlin @ 2008-06-13 13:58 UTC (permalink / raw)
  To: Anthony Liguori, Avi Kivity, Rusty Russell; +Cc: kvm, Mark McLoughlin

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
---
 qemu/hw/virtio-net.c |   51 ++++++++++++++++++++++++++++++++++++++++++++-----
 qemu/net.h           |    2 +
 qemu/vl.c            |   24 +++++++++++++++++++++++
 3 files changed, 71 insertions(+), 6 deletions(-)

diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index f3ff356..3952e90 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -23,9 +23,18 @@
 #define VIRTIO_ID_NET	1
 
 /* The feature bitmap for virtio net */
-#define VIRTIO_NET_F_NO_CSUM	0
-#define VIRTIO_NET_F_MAC	5
-#define VIRTIO_NET_F_GS0	6
+#define VIRTIO_NET_F_CSUM	0	/* Host handles pkts w/ partial csum */
+#define VIRTIO_NET_F_GUEST_CSUM	1	/* Guest handles pkts w/ partial csum */
+#define VIRTIO_NET_F_MAC	5	/* Host has given MAC address. */
+#define VIRTIO_NET_F_GSO	6	/* Host handles pkts w/ any GSO type */
+#define VIRTIO_NET_F_GUEST_TSO4	7	/* Guest can handle TSOv4 in. */
+#define VIRTIO_NET_F_GUEST_TSO6	8	/* Guest can handle TSOv6 in. */
+#define VIRTIO_NET_F_GUEST_ECN	9	/* Guest can handle TSO[6] w/ ECN in. */
+#define VIRTIO_NET_F_GUEST_UFO	10	/* Guest can handle UFO in. */
+#define VIRTIO_NET_F_HOST_TSO4	11	/* Host can handle TSOv4 in. */
+#define VIRTIO_NET_F_HOST_TSO6	12	/* Host can handle TSOv6 in. */
+#define VIRTIO_NET_F_HOST_ECN	13	/* Host can handle TSO[6] w/ ECN in. */
+#define VIRTIO_NET_F_HOST_UFO	14	/* Host can handle UFO in. */
 
 #define TX_TIMER_INTERVAL (1000 / 500)
 
@@ -43,8 +52,6 @@ struct virtio_net_hdr
     uint8_t flags;
 #define VIRTIO_NET_HDR_GSO_NONE		0	// Not a GSO frame
 #define VIRTIO_NET_HDR_GSO_TCPV4	1	// GSO frame, IPv4 TCP (TSO)
-/* FIXME: Do we need this?  If they said they can handle ECN, do they care? */
-#define VIRTIO_NET_HDR_GSO_TCPV4_ECN	2	// GSO frame, IPv4 TCP w/ ECN
 #define VIRTIO_NET_HDR_GSO_UDP		3	// GSO frame, IPv4 UDP (UFO)
 #define VIRTIO_NET_HDR_GSO_TCPV6	4	// GSO frame, IPv6 TCP
 #define VIRTIO_NET_HDR_GSO_ECN		0x80	// TCP has ECN set
@@ -86,7 +93,38 @@ static void virtio_net_update_config(VirtIODevice *vdev, uint8_t *config)
 
 static uint32_t virtio_net_get_features(VirtIODevice *vdev)
 {
-    return (1 << VIRTIO_NET_F_MAC);
+    uint32_t features = (1 << VIRTIO_NET_F_MAC);
+
+    if (vdev->vq[0].vringfd > 0) {
+	features |= (1 << VIRTIO_NET_F_CSUM);
+	features |= (1 << VIRTIO_NET_F_GUEST_CSUM);
+	features |= (1 << VIRTIO_NET_F_GUEST_TSO4);
+	features |= (1 << VIRTIO_NET_F_GUEST_TSO6);
+	features |= (1 << VIRTIO_NET_F_GUEST_ECN);
+	features |= (1 << VIRTIO_NET_F_GUEST_UFO);
+	features |= (1 << VIRTIO_NET_F_HOST_TSO4);
+	features |= (1 << VIRTIO_NET_F_HOST_TSO6);
+	features |= (1 << VIRTIO_NET_F_HOST_ECN);
+	/* Kernel can't actually handle UFO in software currently. */
+    }
+
+    return features;
+}
+
+static void virtio_net_set_features(VirtIODevice *vdev, uint32_t features)
+{
+    VirtIONet *n = to_virtio_net(vdev);
+    VLANClientState *host = n->vc->vlan->first_client;
+
+    if (n->rx_vq->vringfd <= 0 || !host->set_vring_features)
+	return;
+
+    host->set_vring_features(host,
+                             (features >> VIRTIO_NET_F_GUEST_CSUM) & 1,
+                             (features >> VIRTIO_NET_F_GUEST_TSO4) & 1,
+                             (features >> VIRTIO_NET_F_GUEST_TSO6) & 1,
+                             (features >> VIRTIO_NET_F_GUEST_ECN)  & 1,
+                             (features >> VIRTIO_NET_F_GUEST_UFO)  & 1);
 }
 
 /* RX */
@@ -325,6 +363,7 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
 
     n->vdev.update_config = virtio_net_update_config;
     n->vdev.get_features = virtio_net_get_features;
+    n->vdev.set_features = virtio_net_set_features;
     n->rx_vq = virtio_add_queue(&n->vdev, 128, virtio_net_handle_rx);
     n->tx_vq = virtio_add_queue(&n->vdev, 128, virtio_net_handle_tx);
     memcpy(n->mac, nd->macaddr, 6);
diff --git a/qemu/net.h b/qemu/net.h
index e8ed926..dd250ae 100644
--- a/qemu/net.h
+++ b/qemu/net.h
@@ -10,6 +10,7 @@ typedef ssize_t (IOReadvHandler)(void *, const struct iovec *, int);
 typedef struct VLANClientState VLANClientState;
 
 typedef int (SetVringFDs)(VLANClientState *, int, int);
+typedef void (SetVringFeatures)(VLANClientState *, int, int, int, int, int);
 
 struct VLANClientState {
     IOReadHandler *fd_read;
@@ -18,6 +19,7 @@ struct VLANClientState {
        rate-limit the slirp code.  */
     IOCanRWHandler *fd_can_read;
     SetVringFDs *set_vring_fds;
+    SetVringFeatures *set_vring_features;
     void *opaque;
     struct VLANClientState *next;
     struct VLANState *vlan;
diff --git a/qemu/vl.c b/qemu/vl.c
index d043ccd..249041c 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -4189,6 +4189,29 @@ static int tap_set_vring_fds(VLANClientState *vc, int recv, int xmit)
 
     return 1;
 }
+
+static void tap_set_vring_features(VLANClientState *vc, int csum, int tso4,
+                                   int tso6, int ecn, int ufo)
+{
+    TAPState *s = vc->opaque;
+    unsigned int features = 0;
+
+    if (csum) {
+        features |= TUN_F_CSUM;
+        if (tso4)
+            features |= TUN_F_TSO4;
+        if (tso6)
+            features |= TUN_F_TSO6;
+        if ((tso4 || tso6) && ecn)
+            features |= TUN_F_TSO_ECN;
+        if (ufo)
+            features |= TUN_F_UFO;
+    }
+
+    if (ioctl(s->fd, TUNSETFEATURES, features) != 0)
+        fprintf(stderr, "TUNSETFEATURES ioctl() failed: %s\n",
+                strerror(errno));
+}
 #endif
 
 /* fd support */
@@ -4205,6 +4228,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd)
     s->vc->fd_readv = tap_readv;
 #ifdef TUNSETRECVVRING
     s->vc->set_vring_fds = tap_set_vring_fds;
+    s->vc->set_vring_features = tap_set_vring_features;
 #endif
     qemu_set_fd_handler2(s->fd, tap_can_send, tap_send, NULL, s);
     snprintf(s->vc->info_str, sizeof(s->vc->info_str), "tap: fd=%d", fd);
-- 
1.5.4.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies
  2008-06-13 13:58       ` [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies Mark McLoughlin
  2008-06-13 13:58         ` [PATCH 5/5] kvm: qemu: Add support for partial csums and GSO Mark McLoughlin
@ 2008-06-14 23:28         ` Anthony Liguori
  2008-06-16  2:10           ` Rusty Russell
  2008-06-17 14:08           ` Mark McLoughlin
  1 sibling, 2 replies; 23+ messages in thread
From: Anthony Liguori @ 2008-06-14 23:28 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: Avi Kivity, Rusty Russell, kvm

Mark McLoughlin wrote:
> Where a VLAN consists only of a single tap interface on the
> host side and a single virtio interface on the guest side,
> we can use the tun/tap driver's vringfd support to eliminate
> copies.
>
> We set up a vringfd for both rings and pass them to the
> tun/tap driver. When the guest informs us of the location of
> the ring pages, we use the VRINGSETINFO ioctl() to inform
> the /dev/vring driver.
>   

This patch set is useful for testing (I have one too in my patch 
queue).  We need to make some more pervasive changes to QEMU though to 
take advantage of vringfd upstream.

Specifically, we need to introduce a RX/TX buffer adding/polling API for 
VLANClientState.  We can then use this within a vringfd VLAN client to 
push the indexes to vringfd.  We can't use the base/limit stuff in QEMU 
so we have to do translation.  Not a big deal really.

Have you benchmarked the driver?  I wasn't seeing great performance 
myself although I think that was due to some bugs in the vringfd code.

Regards,

Anthony Liguori

> On xmit, we write() to the vringfd to notify tun/tap of new
> buffers. Once the buffers have been sent, the vringfd
> becomes readable and we read() from it, causing the buffers
> to be placed back on the used ring. We can then notify the
> guest that that we're done with them.
>
> On the recv side, the guest makes buffers available via the
> ring and we merely poll() the vringfd for notification of
> tun/tap having buffers available. When the vringfd becomes
> readable, we read() from it to make tun/tap copy its buffers
> into the guest buffers and, finally, we notify the guest.
>
> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
> ---
>  qemu/hw/virtio-net.c |   78 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  qemu/hw/virtio.c     |   58 +++++++++++++++++++++++++++++++++++++
>  qemu/hw/virtio.h     |    4 ++
>  qemu/net.h           |    4 ++
>  qemu/vl.c            |   32 ++++++++++++++++++++
>  5 files changed, 176 insertions(+), 0 deletions(-)
>
> diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
> index 6c42bf0..f3ff356 100644
> --- a/qemu/hw/virtio-net.c
> +++ b/qemu/hw/virtio-net.c
> @@ -15,6 +15,7 @@
>  #include "net.h"
>  #include "qemu-timer.h"
>  #include "qemu-kvm.h"
> +#include "qemu-char.h"
>
>  /* from Linux's virtio_net.h */
>
> @@ -156,6 +157,12 @@ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
>      if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
>          return;
>
> +    if (vq->vringfd > 0) {
> +	if (write(vq->vringfd, "", 0) != 0)
> +	    fprintf(stderr, "Failed to notify host kernel of xmit buffers\n");
> +	return;
> +    }
> +
>      while (virtqueue_pop(vq, &elem)) {
>  	ssize_t len = 0;
>
> @@ -205,6 +212,65 @@ static void virtio_net_tx_timer(void *opaque)
>      virtio_net_flush_tx(n, n->tx_vq);
>  }
>
> +static void virtio_net_rx_used(void *opaque)
> +{
> +    VirtIONet *n = opaque;
> +
> +    if (read(n->rx_vq->vringfd, NULL, 0) != 0) {
> +	fprintf(stderr, "Failed to pull used rx buffers\n");
> +	return;
> +    }
> +
> +    virtio_notify(&n->vdev, n->rx_vq);
> +}
> +
> +static void virtio_net_tx_used(void *opaque)
> +{
> +    VirtIONet *n = opaque;
> +
> +    if (read(n->tx_vq->vringfd, NULL, 0) != 0) {
> +	fprintf(stderr, "Failed to pull used tx buffers\n");
> +	return;
> +    }
> +
> +    virtio_notify(&n->vdev, n->tx_vq);
> +}
> +
> +static void virtio_net_close_vrings(VirtIONet *n)
> +{
> +    if (n->rx_vq->vringfd > 0) {
> +	qemu_set_fd_handler(n->rx_vq->vringfd, NULL, NULL, NULL);
> +	virtqueue_close_vringfd(n->rx_vq);
> +    }
> +    if (n->tx_vq->vringfd > 0) {
> +	qemu_set_fd_handler(n->tx_vq->vringfd, NULL, NULL, NULL);
> +	virtqueue_close_vringfd(n->tx_vq);
> +    }
> +}
> +
> +static void virtio_net_init_vrings(VirtIONet *n)
> +{
> +    struct VLANState *vlan = n->vc->vlan;
> +    VLANClientState *host = vlan->first_client;
> +
> +    if (vlan->nb_host_devs != 1 || vlan->nb_guest_devs != 1)
> +	return;
> +
> +    if (!host->set_vring_fds)
> +	return;
> +
> +    if (!virtqueue_open_vringfd(n->rx_vq) || !virtqueue_open_vringfd(n->tx_vq)) {
> +	virtio_net_close_vrings(n);
> +	return;
> +    }
> +
> +    qemu_set_fd_handler(n->rx_vq->vringfd, virtio_net_rx_used, NULL, n);
> +    qemu_set_fd_handler(n->tx_vq->vringfd, virtio_net_tx_used, NULL, n);
> +
> +    if (!host->set_vring_fds(host, n->rx_vq->vringfd, n->tx_vq->vringfd))
> +	virtio_net_close_vrings(n);
> +}
> +
>  static void virtio_net_save(QEMUFile *f, void *opaque)
>  {
>      VirtIONet *n = opaque;
> @@ -224,6 +290,16 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
>
>      virtio_load(&n->vdev, f);
>
> +    if (n->rx_vq->vringfd > 0 && n->tx_vq->vringfd > 0) {
> +	VLANClientState *host = n->vc->vlan->first_client;
> +
> +	qemu_set_fd_handler(n->rx_vq->vringfd, virtio_net_rx_used, NULL, n);
> +	qemu_set_fd_handler(n->tx_vq->vringfd, virtio_net_tx_used, NULL, n);
> +
> +	if (!host->set_vring_fds(host, n->rx_vq->vringfd, n->tx_vq->vringfd))
> +	    virtio_net_close_vrings(n);
> +    }
> +
>      qemu_get_buffer(f, n->mac, 6);
>      n->tx_timer_active = qemu_get_be32(f);
>
> @@ -258,6 +334,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
>      n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
>      n->tx_timer_active = 0;
>
> +    virtio_net_init_vrings(n);
> +
>      register_savevm("virtio-net", virtio_net_id++, 1,
>  		    virtio_net_save, virtio_net_load, n);
>
> diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c
> index a1ee93f..e9ec1ad 100644
> --- a/qemu/hw/virtio.c
> +++ b/qemu/hw/virtio.c
> @@ -13,6 +13,7 @@
>
>  #include <inttypes.h>
>  #include <err.h>
> +#include <sys/ioctl.h>
>
>  #include "virtio.h"
>  #include "sysemu.h"
> @@ -193,6 +194,56 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>      return elem->in_num + elem->out_num;
>  }
>
> +/* vringfd functions */
> +
> +static int virtqueue_setup_vringfd(VirtQueue *vq)
> +{
> +    struct vring_ioctl_info info;
> +
> +    if (vq->vringfd <= 0 || !vq->vring.desc)
> +	return 1;
> +
> +    info.num_descs = vq->vring.num;
> +    info.descs	   = (unsigned long)vq->vring.desc;
> +    info.base	   = (unsigned long)phys_ram_base;
> +    info.limit	   = phys_ram_size;
> +
> +    if (ioctl(vq->vringfd, VRINGSETINFO, &info) != 0) {
> +	fprintf(stderr, "Failed to setup vring: %s\n", strerror(errno));
> +	return 0;
> +    }
> +
> +    return 1;
> +}
> +
> +int virtqueue_open_vringfd(VirtQueue *vq)
> +{
> +    int fd;
> +
> +    fd = open("/dev/vring", O_RDWR);
> +    if (fd < 0) {
> +	static int warned = 0;
> +	if (!warned)
> +	    fprintf(stderr, "Warning: no vringfd support available: "
> +		    "Failed to open /dev/vring\n");
> +	warned = 1;
> +	return 0;
> +    }
> +
> +    vq->vringfd = fd;
> +
> +    return virtqueue_setup_vringfd(vq);
> +}
> +
> +void virtqueue_close_vringfd(VirtQueue *vq)
> +{
> +    if (vq->vringfd <= 0)
> +	return;
> +
> +    close(vq->vringfd);
> +    vq->vringfd = 0;
> +}
> +
>  /* virtio device */
>
>  static VirtIODevice *to_virtio_device(PCIDevice *pci_dev)
> @@ -248,6 +299,7 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
>  	    size_t size = virtqueue_size(vdev->vq[vdev->queue_sel].vring.num);
>  	    virtqueue_init(&vdev->vq[vdev->queue_sel],
>  			   virtio_map_gpa(pa, size));
> +	    virtqueue_setup_vringfd(&vdev->vq[vdev->queue_sel]);
>  	}
>  	break;
>      case VIRTIO_PCI_QUEUE_SEL:
> @@ -464,6 +516,8 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>      qemu_put_be32(f, i);
>
>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> +	qemu_put_byte(f, vdev->vq[i].vringfd > 0);
> +
>  	if (vdev->vq[i].vring.num == 0)
>  	    break;
>
> @@ -490,6 +544,9 @@ void virtio_load(VirtIODevice *vdev, QEMUFile *f)
>      num = qemu_get_be32(f);
>
>      for (i = 0; i < num; i++) {
> +	if (qemu_get_byte(f))
> +	    virtqueue_open_vringfd(&vdev->vq[i]);
> +
>  	vdev->vq[i].vring.num = qemu_get_be32(f);
>  	qemu_get_be32s(f, &vdev->vq[i].pfn);
>  	vq_last_avail(&vdev->vq[i]) = qemu_get_be16(f);
> @@ -501,6 +558,7 @@ void virtio_load(VirtIODevice *vdev, QEMUFile *f)
>  	    pa = (ram_addr_t)vdev->vq[i].pfn << TARGET_PAGE_BITS;
>  	    size = virtqueue_size(vdev->vq[i].vring.num);
>  	    virtqueue_init(&vdev->vq[i], virtio_map_gpa(pa, size));
> +	    virtqueue_setup_vringfd(&vdev->vq[i]);
>  	}
>      }
>
> diff --git a/qemu/hw/virtio.h b/qemu/hw/virtio.h
> index 142ecbd..d3cf5fb 100644
> --- a/qemu/hw/virtio.h
> +++ b/qemu/hw/virtio.h
> @@ -93,6 +93,7 @@ struct VirtQueue
>      VRing vring;
>      uint32_t pfn;
>      int inuse;
> +    int vringfd;
>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>  };
>
> @@ -152,4 +153,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f);
>
>  void virtio_load(VirtIODevice *vdev, QEMUFile *f);
>
> +int virtqueue_open_vringfd(VirtQueue *vq);
> +void virtqueue_close_vringfd(VirtQueue *vq);
> +
>  #endif
> diff --git a/qemu/net.h b/qemu/net.h
> index 9dc8b7c..e8ed926 100644
> --- a/qemu/net.h
> +++ b/qemu/net.h
> @@ -9,12 +9,15 @@ typedef ssize_t (IOReadvHandler)(void *, const struct iovec *, int);
>
>  typedef struct VLANClientState VLANClientState;
>
> +typedef int (SetVringFDs)(VLANClientState *, int, int);
> +
>  struct VLANClientState {
>      IOReadHandler *fd_read;
>      IOReadvHandler *fd_readv;
>      /* Packets may still be sent if this returns zero.  It's used to
>         rate-limit the slirp code.  */
>      IOCanRWHandler *fd_can_read;
> +    SetVringFDs *set_vring_fds;
>      void *opaque;
>      struct VLANClientState *next;
>      struct VLANState *vlan;
> @@ -26,6 +29,7 @@ struct VLANState {
>      VLANClientState *first_client;
>      struct VLANState *next;
>      unsigned int nb_guest_devs, nb_host_devs;
> +    unsigned int hotplug_disabled : 1;
>  };
>
>  VLANState *qemu_find_vlan(int id);
> diff --git a/qemu/vl.c b/qemu/vl.c
> index f573dce..d043ccd 100644
> --- a/qemu/vl.c
> +++ b/qemu/vl.c
> @@ -4166,6 +4166,31 @@ static void tap_send(void *opaque)
>      } while (s->size > 0);
>  }
>
> +#ifdef TUNSETRECVVRING
> +static int tap_set_vring_fds(VLANClientState *vc, int recv, int xmit)
> +{
> +    TAPState *s = vc->opaque;
> +
> +    if (ioctl(s->fd, TUNSETRECVVRING, recv) != 0) {
> +	fprintf(stderr, "TUNSETRECVVRING ioctl() failed: %s\n",
> +		strerror(errno));
> +	return 0;
> +    }
> +
> +    if (ioctl(s->fd, TUNSETXMITVRING, xmit) != 0) {
> +	fprintf(stderr, "TUNSETXMITVRING ioctl() failed: %s\n",
> +		strerror(errno));
> +	exit(1);
> +    }
> +
> +    qemu_set_fd_handler(s->fd, NULL, NULL, NULL);
> +
> +    vc->vlan->hotplug_disabled = 1;
> +
> +    return 1;
> +}
> +#endif
> +
>  /* fd support */
>
>  static TAPState *net_tap_fd_init(VLANState *vlan, int fd)
> @@ -4178,6 +4203,9 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd)
>      s->fd = fd;
>      s->vc = qemu_new_vlan_client(vlan, tap_receive, NULL, s);
>      s->vc->fd_readv = tap_readv;
> +#ifdef TUNSETRECVVRING
> +    s->vc->set_vring_fds = tap_set_vring_fds;
> +#endif
>      qemu_set_fd_handler2(s->fd, tap_can_send, tap_send, NULL, s);
>      snprintf(s->vc->info_str, sizeof(s->vc->info_str), "tap: fd=%d", fd);
>      return s;
> @@ -4974,6 +5002,10 @@ int net_client_init(const char *str)
>          fprintf(stderr, "Could not create vlan %d\n", vlan_id);
>          return -1;
>      }
> +    if (vlan->hotplug_disabled) {
> +        fprintf(stderr, "Hotplug disabled on vlan %d\n", vlan_id);
> +        return -1;
> +    }
>      if (!strcmp(device, "nic")) {
>          NICInfo *nd;
>          uint8_t *macaddr;
>   


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies
  2008-06-14 23:28         ` [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies Anthony Liguori
@ 2008-06-16  2:10           ` Rusty Russell
  2008-06-16 14:02             ` Anthony Liguori
  2008-06-18 14:01             ` Avi Kivity
  2008-06-17 14:08           ` Mark McLoughlin
  1 sibling, 2 replies; 23+ messages in thread
From: Rusty Russell @ 2008-06-16  2:10 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Mark McLoughlin, Avi Kivity, kvm

On Sunday 15 June 2008 09:28:34 Anthony Liguori wrote:
> Have you benchmarked the driver?  I wasn't seeing great performance
> myself although I think that was due to some bugs in the vringfd code.

Yeah, every time I get close to benchmarking I find another bug :(  But I've 
spent some time optimising the normal lguest net device, so we'll have a fair 
comparison.

In theory vringfd will get us zero copy from guest sendfile out to external 
machines.  For anything else we're doing a copy anyway, so avoiding copying 
has no great benefit.

The interface is still worthwhile to provide zero-copy receive on intelligent 
or bound NICs, but that's science fiction at the moment...

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies
  2008-06-16  2:10           ` Rusty Russell
@ 2008-06-16 14:02             ` Anthony Liguori
  2008-06-16 14:58               ` Avi Kivity
  2008-06-18  5:43               ` Rusty Russell
  2008-06-18 14:01             ` Avi Kivity
  1 sibling, 2 replies; 23+ messages in thread
From: Anthony Liguori @ 2008-06-16 14:02 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Mark McLoughlin, Avi Kivity, kvm

Rusty Russell wrote:
> On Sunday 15 June 2008 09:28:34 Anthony Liguori wrote:
>   
>> Have you benchmarked the driver?  I wasn't seeing great performance
>> myself although I think that was due to some bugs in the vringfd code.
>>     
>
> Yeah, every time I get close to benchmarking I find another bug :(  But I've 
> spent some time optimising the normal lguest net device, so we'll have a fair 
> comparison.
>
> In theory vringfd will get us zero copy from guest sendfile out to external 
> machines.  For anything else we're doing a copy anyway, so avoiding copying 
> has no great benefit.
>   

There's nothing that prevents zero-copy to be implemented for tun 
without vringfd.  In fact, I seem to recall that your earlier patches 
implemented zero-copy :-)

I like the vringfd model and I think it's a good way to move forward.  
My concern is that it introduces an extra syscall in the TX path.  Right 
now, we do a single write call whereas with vringfd we need to insert 
the TX packet into the queue, do a notify, and then wait for indication 
that the TX has succeeded.

I know we'll win with TSO but we don't need vringfd for TSO.  The jury's 
still out IMHO as to whether we should do vringfd or just try to merge 
TSO tun patches.

Regards,

Anthony Liguori

> The interface is still worthwhile to provide zero-copy receive on intelligent 
> or bound NICs, but that's science fiction at the moment...
>
> Cheers,
> Rusty.
>   

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies
  2008-06-16 14:02             ` Anthony Liguori
@ 2008-06-16 14:58               ` Avi Kivity
  2008-06-18  5:43               ` Rusty Russell
  1 sibling, 0 replies; 23+ messages in thread
From: Avi Kivity @ 2008-06-16 14:58 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Rusty Russell, Mark McLoughlin, kvm

Anthony Liguori wrote:
>>
>> In theory vringfd will get us zero copy from guest sendfile out to 
>> external machines.  For anything else we're doing a copy anyway, so 
>> avoiding copying has no great benefit.
>>   
>
> There's nothing that prevents zero-copy to be implemented for tun 
> without vringfd.  In fact, I seem to recall that your earlier patches 
> implemented zero-copy :-)
>
> I like the vringfd model and I think it's a good way to move forward.  
> My concern is that it introduces an extra syscall in the TX path.  
> Right now, we do a single write call whereas with vringfd we need to 
> insert the TX packet into the queue, do a notify, and then wait for 
> indication that the TX has succeeded.
>
> I know we'll win with TSO but we don't need vringfd for TSO.  The 
> jury's still out IMHO as to whether we should do vringfd or just try 
> to merge TSO tun patches.

tun+tso still doesn't give you zerocopy (unless you change it to use 
aio, which re-introduces the syscall).

btw, the two vringfd syscalls are amortized over a potentially large 
number of packets, whereas the single tun syscall is per-packet.


(note: we can get rid of the two syscalls as well by having each side 
opportunistically pick up ring entries, like Xen does)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies
  2008-06-16 14:02             ` Anthony Liguori
  2008-06-16 14:58               ` Avi Kivity
@ 2008-06-18  5:43               ` Rusty Russell
  1 sibling, 0 replies; 23+ messages in thread
From: Rusty Russell @ 2008-06-18  5:43 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Mark McLoughlin, Avi Kivity, kvm

On Tuesday 17 June 2008 00:02:55 Anthony Liguori wrote:
> There's nothing that prevents zero-copy to be implemented for tun
> without vringfd.  In fact, I seem to recall that your earlier patches
> implemented zero-copy :-)

They didn't actually work.  You need to block until the data isn't being used 
any more (thread pool anyone?), or implement an aio interface.

> I like the vringfd model and I think it's a good way to move forward.
> My concern is that it introduces an extra syscall in the TX path.  Right
> now, we do a single write call whereas with vringfd we need to insert
> the TX packet into the queue, do a notify, and then wait for indication
> that the TX has succeeded.

If the guest wants notification of xmit, yes you need another syscall for 
that.  But it often doesn't (note: current vring tun ignored the NO_NOTIFY 
flag, but one thing at a time).

> I know we'll win with TSO but we don't need vringfd for TSO.  The jury's
> still out IMHO as to whether we should do vringfd or just try to merge
> TSO tun patches.

Note that we can do TSO in userspace, too.  No syscall reduction, but an VM 
exit reduction.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies
  2008-06-16  2:10           ` Rusty Russell
  2008-06-16 14:02             ` Anthony Liguori
@ 2008-06-18 14:01             ` Avi Kivity
  1 sibling, 0 replies; 23+ messages in thread
From: Avi Kivity @ 2008-06-18 14:01 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Anthony Liguori, Mark McLoughlin, kvm

Rusty Russell wrote:
> Yeah, every time I get close to benchmarking I find another bug :(  But I've 
> spent some time optimising the normal lguest net device, so we'll have a fair 
> comparison.
>
> In theory vringfd will get us zero copy from guest sendfile out to external 
> machines.  For anything else we're doing a copy anyway, so avoiding copying 
> has no great benefit.
>
> The interface is still worthwhile to provide zero-copy receive on intelligent 
> or bound NICs, but that's science fiction at the moment...
>
>   

Eddie is planning exactly this...

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies
  2008-06-14 23:28         ` [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies Anthony Liguori
  2008-06-16  2:10           ` Rusty Russell
@ 2008-06-17 14:08           ` Mark McLoughlin
  2008-06-17 14:54             ` Anthony Liguori
  1 sibling, 1 reply; 23+ messages in thread
From: Mark McLoughlin @ 2008-06-17 14:08 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, Rusty Russell, kvm

Hi Anthony,

On Sat, 2008-06-14 at 18:28 -0500, Anthony Liguori wrote:
> Mark McLoughlin wrote:
> > Where a VLAN consists only of a single tap interface on the
> > host side and a single virtio interface on the guest side,
> > we can use the tun/tap driver's vringfd support to eliminate
> > copies.
> >
> > We set up a vringfd for both rings and pass them to the
> > tun/tap driver. When the guest informs us of the location of
> > the ring pages, we use the VRINGSETINFO ioctl() to inform
> > the /dev/vring driver.
> >   
> 
> This patch set is useful for testing (I have one too in my patch 
> queue).

Ah, didn't know of your queue ... Is it (http://hg.codemonkey.ws/) down
atm?

>   We need to make some more pervasive changes to QEMU though to 
> take advantage of vringfd upstream.
> 
> Specifically, we need to introduce a RX/TX buffer adding/polling API for 
> VLANClientState.  We can then use this within a vringfd VLAN client to 
> push the indexes to vringfd.

I don't think I'm following you fully on this.

The TX side is fine - guest adds buffer to ring, virtio VLANClient calls
->add_tx_buffer() on every other VLANClient, waits until all are
finished sending and notifies the guest that we're done.

But the RX side? The guest allocates the buffers, so does the virtio
VLANClient divide those buffers between every other VLANClient? Or does
it make all buffers available to all clients and have a way of locking a
buffer just before using it? The former would be a waste, and we don't
have any way of doing the latter right now with vringfd.

Also, since a client could be supplied with RX buffers from multiple
other clients, tun/tap would need to support multiple RX rings.

It really makes one wonder whether QEMU's VLAN feature is really worth
all this bother.

Oh yes, there's also GSO feature negotiation; you'd need to have a way
of figuring out what clients support what GSO features, which is
fine ... expect for what to do in the case of hotplug.

>   We can't use the base/limit stuff in QEMU so we have to do
> translation.  Not a big deal really.

Yeah, that's not a problem.

> Have you benchmarked the driver?  I wasn't seeing great performance 
> myself although I think that was due to some bugs in the vringfd code.

Nope, I haven't done any real benchmarking with it yet.

Cheers,
Mark.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies
  2008-06-17 14:08           ` Mark McLoughlin
@ 2008-06-17 14:54             ` Anthony Liguori
  2008-06-17 15:45               ` Mark McLoughlin
  0 siblings, 1 reply; 23+ messages in thread
From: Anthony Liguori @ 2008-06-17 14:54 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: Avi Kivity, Rusty Russell, kvm

Mark McLoughlin wrote:
> Hi Anthony,
>
> On Sat, 2008-06-14 at 18:28 -0500, Anthony Liguori wrote:
>   
>> This patch set is useful for testing (I have one too in my patch 
>> queue).
>>     
>
> Ah, didn't know of your queue ... Is it (http://hg.codemonkey.ws/) down
> atm?
>   

It's up now.

>>   We need to make some more pervasive changes to QEMU though to 
>> take advantage of vringfd upstream.
>>
>> Specifically, we need to introduce a RX/TX buffer adding/polling API for 
>> VLANClientState.  We can then use this within a vringfd VLAN client to 
>> push the indexes to vringfd.
>>     
>
> I don't think I'm following you fully on this.
>
> The TX side is fine - guest adds buffer to ring, virtio VLANClient calls
> ->add_tx_buffer() on every other VLANClient, waits until all are
> finished sending and notifies the guest that we're done.
>
> But the RX side? The guest allocates the buffers, so does the virtio
> VLANClient divide those buffers between every other VLANClient?

This is where things get tricky.  Internally, it will have to copy the 
TX buffer into each of the clients RX buffers.  We need to special case 
the circumstance where the only other VLANClientState is a vringfd 
client so that we can pass the RX buffer directly to it.  Haven't come 
up with a perfect API just yet but that's what we need to do.

>  Or does
> it make all buffers available to all clients and have a way of locking a
> buffer just before using it? The former would be a waste, and we don't
> have any way of doing the latter right now with vringfd.
>
> Also, since a client could be supplied with RX buffers from multiple
> other clients, tun/tap would need to support multiple RX rings.
>
> It really makes one wonder whether QEMU's VLAN feature is really worth
> all this bother.
>   

It's necessary for upstream acceptance and solving the problem right 
will give us vringfd support for the e1000 for free.

> Oh yes, there's also GSO feature negotiation; you'd need to have a way
> of figuring out what clients support what GSO features, which is
> fine ... expect for what to do in the case of hotplug.
>   

Right, we need a feature API.

Regards,

Anthony Liguori

>>   We can't use the base/limit stuff in QEMU so we have to do
>> translation.  Not a big deal really.
>>     
>
> Yeah, that's not a problem.
>
>   
>> Have you benchmarked the driver?  I wasn't seeing great performance 
>> myself although I think that was due to some bugs in the vringfd code.
>>     
>
> Nope, I haven't done any real benchmarking with it yet.
>
> Cheers,
> Mark.
>
>   


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies
  2008-06-17 14:54             ` Anthony Liguori
@ 2008-06-17 15:45               ` Mark McLoughlin
  0 siblings, 0 replies; 23+ messages in thread
From: Mark McLoughlin @ 2008-06-17 15:45 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, Rusty Russell, kvm

On Tue, 2008-06-17 at 09:54 -0500, Anthony Liguori wrote:
> Mark McLoughlin wrote:
> >
> > On Sat, 2008-06-14 at 18:28 -0500, Anthony Liguori wrote:

> >>   We need to make some more pervasive changes to QEMU though to 
> >> take advantage of vringfd upstream.
> >>
> >> Specifically, we need to introduce a RX/TX buffer adding/polling API for 
> >> VLANClientState.  We can then use this within a vringfd VLAN client to 
> >> push the indexes to vringfd.
> >>     
> >
> > I don't think I'm following you fully on this.
> >
> > The TX side is fine - guest adds buffer to ring, virtio VLANClient calls
> > ->add_tx_buffer() on every other VLANClient, waits until all are
> > finished sending and notifies the guest that we're done.
> >
> > But the RX side? The guest allocates the buffers, so does the virtio
> > VLANClient divide those buffers between every other VLANClient?
> 
> This is where things get tricky.  Internally, it will have to copy the 
> TX buffer into each of the clients RX buffers.  We need to special case 
> the circumstance where the only other VLANClientState is a vringfd 
> client so that we can pass the RX buffer directly to it.  Haven't come 
> up with a perfect API just yet but that's what we need to do.

How about if we reversed the tun recv vringfd?

i.e. rather than passing the guest's recv buffers down to the tun driver
where it copies the skbs into the buffers, have the tun driver pass the
skb buffers directly via the vringfd up to qemu where it can then pass
them to each of the other clients. The virtio client would then do copy
into the guest's buffer.

That would make the internal qemu API pretty simple and eliminate the
need for special cases.

Cheers,
Mark.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] vring: Replace mmap() interface with ioctl()
  2008-06-13 13:57 ` [PATCH 1/5] vring: Replace mmap() interface with ioctl() Mark McLoughlin
  2008-06-13 13:57   ` [PATCH 2/5] lguest: Use VRINGSETINFO ioctl() instead of mmap() Mark McLoughlin
@ 2008-06-13 14:09   ` Avi Kivity
  2008-06-17 12:19     ` Mark McLoughlin
  2008-06-14  9:02   ` Rusty Russell
  2 siblings, 1 reply; 23+ messages in thread
From: Avi Kivity @ 2008-06-13 14:09 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: Anthony Liguori, Rusty Russell, kvm

Mark McLoughlin wrote:
> /dev/vring's mmap() interface is a strange creature. It
> serves as a way for userland to supply the address of the
> already allocated ring descriptors, but causes those pages
> to be re-maped as a natural side effect of the mmap()
>
> This is not an issue for lguest because it does the mmap()
> before even starting the guest. However, in the case of kvm,
> the guest allocates the ring and informs the host of its
> addresss. If we then mmap() it, we cause it to be remapped
> to new pages which the vring driver will then use.
>
> Now, KVM guests don't actually use the ring pages before
> informing the host of its address, so we could probably just
> invalidate the guest's shadow page table and have the new
> pfns picked up. That would be an odd requirement to impose
> on the guest ABI, though.
>
> Since the mmap() semantics are so strange, switch to using a
> single ioctl() for setting up the ring.
>
>   

Definitely an improvement; iterfaces which seem familiar but aren't are 
not user friendly.

In any case the guest can't use the ring directly since physical memory 
is discontiguous.

> -	.ioctl		= vring_ioctl,
> +	.unlocked_ioctl	= vring_ioctl,
> +	.compat_ioctl	= vring_compat_ioctl,
>   

I think you can set compat_ioctl = vring_ioctl (that's what kvm does).

> diff --git a/include/linux/vring.h b/include/linux/vring.h
> index 47c8848..de4125d 100644
> --- a/include/linux/vring.h
> +++ b/include/linux/vring.h
> @@ -21,8 +21,14 @@
>  #include <linux/types.h>
>  
>  /* Ioctl defines. */
> -#define VRINGSETBASE	_IO(0xAD, 0)
> -#define VRINGSETLIMIT	_IO(0xAD, 1)
> +#define VRINGSETINFO _IO(0xAD, 0)
> +
> +struct vring_ioctl_info {
> +	__u16 num_descs;
>   

Padding for 64-bits here, otherwise compat_ioctl breaks.

> +	__u64 descs;
> +	__u64 base;
> +	__u64 limit;
> +};


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] vring: Replace mmap() interface with ioctl()
  2008-06-13 14:09   ` [PATCH 1/5] vring: Replace mmap() interface with ioctl() Avi Kivity
@ 2008-06-17 12:19     ` Mark McLoughlin
  2008-06-18 14:05       ` Avi Kivity
  0 siblings, 1 reply; 23+ messages in thread
From: Mark McLoughlin @ 2008-06-17 12:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Anthony Liguori, Rusty Russell, kvm

On Fri, 2008-06-13 at 17:09 +0300, Avi Kivity wrote:
> Mark McLoughlin wrote:
> > -	.ioctl		= vring_ioctl,
> > +	.unlocked_ioctl	= vring_ioctl,
> > +	.compat_ioctl	= vring_compat_ioctl,
> >   
> 
> I think you can set compat_ioctl = vring_ioctl (that's what kvm does).

Don't pointer args need the compat_ptr() conversion on s390x?

> > diff --git a/include/linux/vring.h b/include/linux/vring.h
> > index 47c8848..de4125d 100644
> > --- a/include/linux/vring.h
> > +++ b/include/linux/vring.h
> > @@ -21,8 +21,14 @@
> >  #include <linux/types.h>
> >  
> >  /* Ioctl defines. */
> > -#define VRINGSETBASE	_IO(0xAD, 0)
> > -#define VRINGSETLIMIT	_IO(0xAD, 1)
> > +#define VRINGSETINFO _IO(0xAD, 0)
> > +
> > +struct vring_ioctl_info {
> > +	__u16 num_descs;
> >   
> 
> Padding for 64-bits here, otherwise compat_ioctl breaks.

Nice catch. Fixed below.

Thanks,
Mark.

From: Mark McLoughlin <markmc@redhat.com>
Subject: vring: Replace mmap() interface with ioctl()

/dev/vring's mmap() interface is a strange creature. It
serves as a way for userland to supply the address of the
already allocated ring descriptors, but causes those pages
to be re-maped as a natural side effect of the mmap()

This is not an issue for lguest because it does the mmap()
before even starting the guest. However, in the case of kvm,
the guest allocates the ring and informs the host of its
addresss. If we then mmap() it, we cause it to be remapped
to new pages which the vring driver will then use.

Now, KVM guests don't actually use the ring pages before
informing the host of its address, so we could probably just
invalidate the guest's shadow page table and have the new
pfns picked up. That would be an odd requirement to impose
on the guest ABI, though.

Since the mmap() semantics are so strange, switch to using a
single ioctl() for setting up the ring.

(Against misc:dev_vring.patch and misc:ringfd-base-limit.patch)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Index: linux-2.6/drivers/char/vring.c
===================================================================
--- linux-2.6.orig/drivers/char/vring.c	2008-06-19 09:03:11.000000000 +0100
+++ linux-2.6/drivers/char/vring.c	2008-06-19 09:04:36.000000000 +0100
@@ -22,6 +22,7 @@
 #include <linux/mutex.h>
 #include <linux/wait.h>
 #include <linux/fs.h>
+#include <linux/compat.h>
 #include <linux/poll.h>
 #include <linux/module.h>
 #include <linux/miscdevice.h>
@@ -126,22 +127,21 @@
 	return 0;
 }
 
-static int vring_mmap(struct file *filp, struct vm_area_struct *vma)
+static long vring_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
-	unsigned long size, num_descs;
 	struct vring_info *vr = filp->private_data;
+	void __user *argp = (void __user *)arg;
+	struct vring_ioctl_info info;
+	unsigned long descs;
 	int err;
 
-	/* We overload mmap's offset to hold the ring number. */
-	num_descs = vma->vm_pgoff;
+	if (cmd != VRINGSETINFO)
+		return -ENOTTY;
 
-	/* Must be a power of two, and limit indices to a u16. */
-	if (!is_power_of_2(num_descs) || num_descs > 65536)
-		return -EINVAL;
+	if (copy_from_user(&info, argp, sizeof(info)))
+		return -EFAULT;
 
-	/* mmap size must be what we expect for such a ring. */
-	size = vma->vm_end - vma->vm_start;
-	if (size != ALIGN(vring_size(num_descs, PAGE_SIZE), PAGE_SIZE))
+	if (!is_power_of_2(info.num_descs))
 		return -EINVAL;
 
 	/* We only let them map this in one place. */
@@ -151,9 +151,14 @@
 		goto unlock;
 	}
 
-	vring_init(&vr->ring, num_descs, (void *)vma->vm_start, PAGE_SIZE);
+	descs = info.descs;
+	vring_init(&vr->ring, info.num_descs, (void *)descs, PAGE_SIZE);
 
-	vr->mask = num_descs - 1;
+	vr->mask = info.num_descs - 1;
+	vr->base = info.base;
+	vr->limit = info.limit;
+	if (vr->limit == 0)
+		vr->limit = -1UL;
 	err = 0;
 
 unlock:
@@ -161,6 +166,16 @@
 	return err;
 }
 
+#ifdef CONFIG_COMPAT
+static long vring_compat_ioctl(struct file *filp, unsigned int cmd,
+			       unsigned long arg)
+{
+	return vring_ioctl(filp, cmd, (unsigned long)compat_ptr(arg));
+}
+#else
+#define vring_compat_ioctl NULL
+#endif
+
 static int vring_open(struct inode *in, struct file *filp)
 {
 	struct vring_info *vr;
@@ -176,32 +191,14 @@
 	return 0;
 }
 
-static int vring_ioctl(struct inode *in, struct file *filp,
-		       unsigned int cmd, unsigned long arg)
-{
-	struct vring_info *vr = filp->private_data;
-
-	switch (cmd) {
-	case VRINGSETBASE:
-		vr->base = arg;
-		break;
-	case VRINGSETLIMIT:
-		vr->limit = arg;
-		break;
-	default:
-		return -ENOTTY;
-	}
-	return 0;
-}
-
 static const struct file_operations vring_fops = {
 	.open		= vring_open,
 	.release	= vring_release,
-	.mmap		= vring_mmap,
 	.read		= vring_read,
 	.write		= vring_write,
 	.poll		= vring_poll,
-	.ioctl		= vring_ioctl,
+	.unlocked_ioctl	= vring_ioctl,
+	.compat_ioctl	= vring_compat_ioctl,
 };
 
 /**
Index: linux-2.6/include/linux/vring.h
===================================================================
--- linux-2.6.orig/include/linux/vring.h	2008-06-19 09:03:11.000000000 +0100
+++ linux-2.6/include/linux/vring.h	2008-06-19 09:05:56.000000000 +0100
@@ -21,8 +21,15 @@
 #include <linux/types.h>
 
 /* Ioctl defines. */
-#define VRINGSETBASE	_IO(0xAD, 0)
-#define VRINGSETLIMIT	_IO(0xAD, 1)
+#define VRINGSETINFO _IO(0xAD, 0)
+
+struct vring_ioctl_info {
+	__u16 num_descs;
+	__u8 padding[6];
+	__u64 descs;
+	__u64 base;
+	__u64 limit;
+};
 
 #ifdef __KERNEL__
 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] vring: Replace mmap() interface with ioctl()
  2008-06-17 12:19     ` Mark McLoughlin
@ 2008-06-18 14:05       ` Avi Kivity
  0 siblings, 0 replies; 23+ messages in thread
From: Avi Kivity @ 2008-06-18 14:05 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: Anthony Liguori, Rusty Russell, kvm

Mark McLoughlin wrote:
> On Fri, 2008-06-13 at 17:09 +0300, Avi Kivity wrote:
>   
>> Mark McLoughlin wrote:
>>     
>>> -	.ioctl		= vring_ioctl,
>>> +	.unlocked_ioctl	= vring_ioctl,
>>> +	.compat_ioctl	= vring_compat_ioctl,
>>>   
>>>       
>> I think you can set compat_ioctl = vring_ioctl (that's what kvm does).
>>     
>
> Don't pointer args need the compat_ptr() conversion on s390x?
>
>   

Ouch.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] vring: Replace mmap() interface with ioctl()
  2008-06-13 13:57 ` [PATCH 1/5] vring: Replace mmap() interface with ioctl() Mark McLoughlin
  2008-06-13 13:57   ` [PATCH 2/5] lguest: Use VRINGSETINFO ioctl() instead of mmap() Mark McLoughlin
  2008-06-13 14:09   ` [PATCH 1/5] vring: Replace mmap() interface with ioctl() Avi Kivity
@ 2008-06-14  9:02   ` Rusty Russell
  2008-06-14 14:20     ` Avi Kivity
  2 siblings, 1 reply; 23+ messages in thread
From: Rusty Russell @ 2008-06-14  9:02 UTC (permalink / raw)
  To: Mark McLoughlin; +Cc: Anthony Liguori, Avi Kivity, kvm

On Friday 13 June 2008 23:57:57 Mark McLoughlin wrote:
> /dev/vring's mmap() interface is a strange creature. It
> serves as a way for userland to supply the address of the
> already allocated ring descriptors, but causes those pages
> to be re-maped as a natural side effect of the mmap()
>
> This is not an issue for lguest because it does the mmap()
> before even starting the guest. However, in the case of kvm,
> the guest allocates the ring and informs the host of its
> addresss. If we then mmap() it, we cause it to be remapped
> to new pages which the vring driver will then use.

Yeah, ugly, huh?  I didn't like it either, but my ioctl aversion forced me to 
consider mmap.  Then my aversion to rewriting the mmap code lead me to 
this "simply replace the user pages" hack :)

Fix the couple of minor issues reported here and I'll happily apply it.

BTW, including base and limit makes sense.  I didn't do this to start since 
Avi doesn't want to guarentee kvm's guest mapping linear (of course he's only 
doing this so he can fix it later).

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] vring: Replace mmap() interface with ioctl()
  2008-06-14  9:02   ` Rusty Russell
@ 2008-06-14 14:20     ` Avi Kivity
  2008-06-14 23:23       ` Anthony Liguori
  0 siblings, 1 reply; 23+ messages in thread
From: Avi Kivity @ 2008-06-14 14:20 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Mark McLoughlin, Anthony Liguori, kvm

Rusty Russell wrote:
> BTW, including base and limit makes sense.  I didn't do this to start since 
> Avi doesn't want to guarentee kvm's guest mapping linear (of course he's only 
> doing this so he can fix it later).
>   

And also, because memory hotplug and 64-bit PCI BARs require reserving 
an infinite virtual address space range.  Not to mention that someone 
needs to update the dirty bitmap in case we're live migrating.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] vring: Replace mmap() interface with ioctl()
  2008-06-14 14:20     ` Avi Kivity
@ 2008-06-14 23:23       ` Anthony Liguori
  2008-06-15 15:24         ` Avi Kivity
  0 siblings, 1 reply; 23+ messages in thread
From: Anthony Liguori @ 2008-06-14 23:23 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Rusty Russell, Mark McLoughlin, kvm

Avi Kivity wrote:
> Rusty Russell wrote:
>> BTW, including base and limit makes sense.  I didn't do this to start 
>> since Avi doesn't want to guarentee kvm's guest mapping linear (of 
>> course he's only doing this so he can fix it later).
>>   
>
> And also, because memory hotplug and 64-bit PCI BARs require reserving 
> an infinite virtual address space range.  Not to mention that someone 
> needs to update the dirty bitmap in case we're live migrating.

You can certainly hotplug to the next RAM address so it doesn't require 
infinite space.  You wouldn't send a packet from/to a PCI IO region so I 
don't think that practically speaking that's a problem.

But the base/limit thing is a bit of a hack.  I don't think the 
userspace translation is going to be a problem.

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] vring: Replace mmap() interface with ioctl()
  2008-06-14 23:23       ` Anthony Liguori
@ 2008-06-15 15:24         ` Avi Kivity
  2008-06-15 19:13           ` Anthony Liguori
  0 siblings, 1 reply; 23+ messages in thread
From: Avi Kivity @ 2008-06-15 15:24 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Rusty Russell, Mark McLoughlin, kvm

Anthony Liguori wrote:
>>
>> And also, because memory hotplug and 64-bit PCI BARs require 
>> reserving an infinite virtual address space range.  Not to mention 
>> that someone needs to update the dirty bitmap in case we're live 
>> migrating.
>
> You can certainly hotplug to the next RAM address so it doesn't 
> require infinite space.  

But you need to reserve that space to prevent mallocs from going there.  
How much space do you reserve?

> You wouldn't send a packet from/to a PCI IO region so I don't think 
> that practically speaking that's a problem.

If we implement interguest shared memory as a pci device, then it 
becomes a problem.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] vring: Replace mmap() interface with ioctl()
  2008-06-15 15:24         ` Avi Kivity
@ 2008-06-15 19:13           ` Anthony Liguori
  0 siblings, 0 replies; 23+ messages in thread
From: Anthony Liguori @ 2008-06-15 19:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Rusty Russell, Mark McLoughlin, kvm

Avi Kivity wrote:
> Anthony Liguori wrote:
>>>
>>> And also, because memory hotplug and 64-bit PCI BARs require 
>>> reserving an infinite virtual address space range.  Not to mention 
>>> that someone needs to update the dirty bitmap in case we're live 
>>> migrating.
>>
>> You can certainly hotplug to the next RAM address so it doesn't 
>> require infinite space.  
>
> But you need to reserve that space to prevent mallocs from going 
> there.  How much space do you reserve?

It's deterministic at least.  Any physical system has a maximum amount 
of physical memory that it supports so hotplug cannot exceed that amount.

>> You wouldn't send a packet from/to a PCI IO region so I don't think 
>> that practically speaking that's a problem.
>
> If we implement interguest shared memory as a pci device, then it 
> becomes a problem.

We can set an explicit BAR to keep the physical address reasonable.

I'm not arguing that we should enforce base/limit, just that it is 
possible.  I think the burden is to prove that enforcing base/limit 
provides a significant performance boost to warrant the complexity.

The nastiest bit of manipulating the ring in-kernel is that it requires 
TX mitigation to be within the kernel.  This means that adjusting the 
heuristics for adaptive TX mitigation will require host kernel 
modifications which stinks IMHO.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2008-06-18 14:06 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-13 13:57 [PATCH 0/0][RFC] KVM use of vringfd Mark McLoughlin
2008-06-13 13:57 ` [PATCH 1/5] vring: Replace mmap() interface with ioctl() Mark McLoughlin
2008-06-13 13:57   ` [PATCH 2/5] lguest: Use VRINGSETINFO ioctl() instead of mmap() Mark McLoughlin
2008-06-13 13:57     ` [PATCH 3/5] kvm: qemu: Publish last_avail index in the ring Mark McLoughlin
2008-06-13 13:58       ` [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies Mark McLoughlin
2008-06-13 13:58         ` [PATCH 5/5] kvm: qemu: Add support for partial csums and GSO Mark McLoughlin
2008-06-14 23:28         ` [PATCH 4/5] kvm: qemu: Use vringfd to eliminate copies Anthony Liguori
2008-06-16  2:10           ` Rusty Russell
2008-06-16 14:02             ` Anthony Liguori
2008-06-16 14:58               ` Avi Kivity
2008-06-18  5:43               ` Rusty Russell
2008-06-18 14:01             ` Avi Kivity
2008-06-17 14:08           ` Mark McLoughlin
2008-06-17 14:54             ` Anthony Liguori
2008-06-17 15:45               ` Mark McLoughlin
2008-06-13 14:09   ` [PATCH 1/5] vring: Replace mmap() interface with ioctl() Avi Kivity
2008-06-17 12:19     ` Mark McLoughlin
2008-06-18 14:05       ` Avi Kivity
2008-06-14  9:02   ` Rusty Russell
2008-06-14 14:20     ` Avi Kivity
2008-06-14 23:23       ` Anthony Liguori
2008-06-15 15:24         ` Avi Kivity
2008-06-15 19:13           ` Anthony Liguori

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox