public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 1/3] virtio infrastructure
@ 2007-05-31 12:19 Rusty Russell
  2007-06-01 23:35 ` Santos, Jose Renato G
       [not found] ` <1180613947.11133.58.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 2 replies; 41+ messages in thread
From: Rusty Russell @ 2007-05-31 12:19 UTC (permalink / raw)
  To: kvm-devel, Xen Mailing List, virtualization
  Cc: Jimi Xenidis, Stephen Rothwell,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	Christian Borntraeger, Suzanne McIntosh, Martin Schwidefsky

This attempts to implement a "virtual I/O" layer which should allow
common drivers to be efficiently used across most virtual I/O
mechanisms.  It will no-doubt need further enhancement.

The details of probing the device are left to hypervisor-specific
code: it simple constructs the "struct virtio_device" and hands it to
the probe function (eg. virtnet_probe() or virtblk_probe()).

The virtio drivers add and detach input and output buffers; as the
buffers are used up their associated "used" pointers are filled in.

I have written two virtio device drivers (net and block) and two
virtio implementations (for lguest): a read-write socket-style
implementation, and a more efficient descriptor-based implementation).

Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
---
 include/linux/virtio.h |   69 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff -r 2db2135723b0 include/linux/virtio.h
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/include/linux/virtio.h	Thu May 31 17:52:19 2007 +1000
@@ -0,0 +1,66 @@
+#ifndef _LINUX_VIRTIO_H
+#define _LINUX_VIRTIO_H
+#include <linux/types.h>
+#include <linux/scatterlist.h>
+
+/**
+ * virtio_device - description and routines to drive a virtual device.
+ * @dev: the underlying struct device.
+ * @ops: the operations for this virtual device.
+ */
+struct virtio_device {
+	struct device *dev;
+	struct virtio_ops *ops;
+};
+
+/**
+ * virtio_ops - virtio abstraction layer
+ * @add_outbuf: prepare to send data to the other end:
+ *	vdev: the virtio_device
+ *	sg: the description of the buffer(s).
+ *	num: the size of the sg array.
+ *	used: the length sent (set once sending is done).
+ *      Returns an identifier or an error.
+ * @add_inbuf: prepare to receive data from the other end:
+ *	vdev: the virtio_device
+ *	sg: the description of the buffer(s).
+ *	num: the size of the sg array.
+ *	used: the length sent (set once data received).
+ *      Returns an identifier or an error (eg. -ENOSPC).
+ * @sync: update after add_inbuf/add_outbuf
+ *	vdev: the virtio_device we're talking about.
+ *	Use the virtio_sync wrapper, to avoid unnecessary calls.
+ * @detach_outbuf: make sure sent sg can no longer be read.
+ *	vdev: the virtio_device we're talking about.
+ *	id: the identifier returned from add_outbuf.
+ * @detach_inbuf: make sure sent sg can no longer be written to.
+ *	vdev: the virtio_device we're talking about.
+ *	id: the identifier returned from add_inbuf.
+ */
+struct virtio_ops {
+	unsigned long (*add_outbuf)(struct virtio_device *vdev,
+				    const struct scatterlist sg[],
+				    unsigned int num,
+				    unsigned long *used);
+
+	unsigned long (*add_inbuf)(struct virtio_device *vdev,
+				   struct scatterlist sg[],
+				   unsigned int num,
+				   unsigned long *used);
+
+	void (*sync)(struct virtio_device *vdev);
+
+	void (*detach_outbuf)(struct virtio_device *vdev, unsigned long id);
+	void (*detach_inbuf)(struct virtio_device *vdev, unsigned long id);
+};
+
+/**
+ * virtio_sync - start sending/receiving data from the other end.
+ * @vdev: the virtio_device we're talking about.
+ */
+static inline void virtio_sync(struct virtio_device *vdev)
+{
+	if (vdev->ops->sync)
+		vdev->ops->sync(vdev);
+}
+#endif /* _LINUX_VIRTIO_H */



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH RFC 2/3] virtio infrastructure: example net driver
       [not found] ` <1180613947.11133.58.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-05-31 12:20   ` Rusty Russell
       [not found]     ` <1180614044.11133.61.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2007-05-31 12:53   ` [PATCH RFC 1/3] virtio infrastructure Carsten Otte
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-05-31 12:20 UTC (permalink / raw)
  To: kvm-devel, Xen Mailing List, virtualization
  Cc: Jimi Xenidis, Stephen Rothwell,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	Christian Borntraeger, Suzanne McIntosh, Martin Schwidefsky

Example net driver using virtio

TODO:
	1) Locking (see #2).
	2) NAPI.
	3) GSO.
	4) Checksum options.

Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
---
 drivers/net/Makefile       |    2
 drivers/net/virtio_net.c   |  237 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/virtio_net.h |   13 ++
 3 files changed, 251 insertions(+), 1 deletion(-)

diff -r 96df1769fce9 drivers/net/Makefile
--- a/drivers/net/Makefile	Thu May 31 17:52:38 2007 +1000
+++ b/drivers/net/Makefile	Thu May 31 17:52:40 2007 +1000
@@ -37,7 +37,7 @@ obj-$(CONFIG_CASSINI) += cassini.o
 
 obj-$(CONFIG_MACE) += mace.o
 obj-$(CONFIG_BMAC) += bmac.o
-
+obj-y += virtio_net.o
 obj-$(CONFIG_DGRS) += dgrs.o
 obj-$(CONFIG_VORTEX) += 3c59x.o
 obj-$(CONFIG_TYPHOON) += typhoon.o
diff -r 96df1769fce9 drivers/net/virtio_net.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/drivers/net/virtio_net.c	Thu May 31 17:53:54 2007 +1000
@@ -0,0 +1,236 @@
+/* A simple network driver using virtio.
+ *
+ * Copyright 2007 Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org> IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+//#define DEBUG
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/module.h>
+#include <linux/virtio.h>
+#include <linux/scatterlist.h>
+
+#define NET_BUFS 128
+/* FIXME: Make dynamic */
+#define MAX_PACKET_LEN (ETH_HLEN+ETH_DATA_LEN)
+
+struct virtnet_info
+{
+	struct virtio_device *vdev;
+	struct net_device *ndev;
+
+	/* Receive queue. */
+	struct sk_buff *in[NET_BUFS];
+	/* Send queue. */
+	struct sk_buff *out[NET_BUFS];
+
+	/* Lengths for input buffers as they are used. */
+	unsigned long in_used[NET_BUFS];
+
+	/* Lengths for output buffers as they are used. */
+	unsigned long out_used[NET_BUFS];
+
+	/* IDs for buffers. */
+	long in_ids[NET_BUFS];
+	long out_ids[NET_BUFS];
+};
+
+static int start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	unsigned int n, i;
+	/* FIXME: What *is* the max here? */
+	struct scatterlist sg[MAX_SKB_FRAGS + 2];
+	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
+
+	pr_debug("%s: xmit %02x:%02x:%02x:%02x:%02x:%02x\n",
+		 dev->name, dest[0],dest[1],dest[2],dest[3],dest[4],dest[5]);
+
+	/* Go through and find a new output slot, and free any on the way */
+	for (i = 0; i < NET_BUFS; i++) {
+		if (!vi->out[i])
+			break;
+
+		if (!vi->out_used[i])
+			continue;
+
+		vi->vdev->ops->detach_outbuf(vi->vdev, vi->out_ids[i]);
+		dev->stats.tx_bytes += vi->out_used[i];
+		dev->stats.tx_packets++;
+		dev_kfree_skb(vi->out[i]);
+		vi->out[i] = NULL;
+		break;
+	}
+	if (unlikely(i == NET_BUFS)) {
+		pr_debug("%s: ring full\n", dev->name);
+		goto stop;
+	}
+
+	n = skb_to_sgvec(skb, sg, 0, skb->len);
+	vi->out_used[i] = 0;
+	vi->out_ids[i] = vi->vdev->ops->add_outbuf(vi->vdev, sg, n,
+						   &vi->out_used[i]);
+	if (IS_ERR_VALUE(vi->out_ids[i])) {
+		pr_debug("%s: virtio not prepared to send\n", dev->name);
+		goto stop;
+	}
+	vi->out[i] = skb;
+	virtio_sync(vi->vdev);
+	return 0;
+stop:
+	netif_stop_queue(dev);
+	return NETDEV_TX_BUSY;
+}
+
+static void receive_skb(struct net_device *dev, struct sk_buff *skb,
+			unsigned len)
+{
+	if (unlikely(len < ETH_HLEN)) {
+		pr_debug("%s: short packet %i\n", dev->name, len);
+		dev->stats.rx_length_errors++;
+		dev_kfree_skb(skb);
+		return;
+	}
+	BUG_ON(len > MAX_PACKET_LEN);
+
+	skb_trim(skb, len);
+	skb->protocol = eth_type_trans(skb, dev);
+	pr_debug("Receiving skb proto 0x%04x len %i type %i\n",
+		 ntohs(skb->protocol), skb->len, skb->pkt_type);
+	dev->stats.rx_bytes += skb->len;
+	dev->stats.rx_packets++;
+	netif_rx(skb);
+}
+
+static void try_fill_recv(struct virtnet_info *vi, unsigned int i)
+{
+	/* FIXME: What *is* the max here? */
+	struct scatterlist sg[MAX_SKB_FRAGS + 1];
+	unsigned int n;
+
+	vi->in[i] = netdev_alloc_skb(vi->ndev, MAX_PACKET_LEN);
+	if (unlikely(!vi->in[i]))
+		return;
+
+	skb_put(vi->in[i], MAX_PACKET_LEN);
+	n = skb_to_sgvec(vi->in[i], sg, 0, vi->in[i]->len);
+	vi->in_used[i] = 0;
+	vi->in_ids[i] = vi->vdev->ops->add_inbuf(vi->vdev, sg, n,
+						 &vi->in_used[i]);
+	if (IS_ERR_VALUE(vi->in_ids[i])) {
+		kfree_skb(vi->in[i]);
+		vi->in[i] = NULL;
+	}
+}
+
+static int virtnet_open(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	unsigned int i;
+
+	for (i = 0; i < NET_BUFS; i++)
+		try_fill_recv(vi, i);
+
+	virtio_sync(vi->vdev);
+	return 0;
+}
+
+static int virtnet_close(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	unsigned int i;
+
+	for (i = 0; i < NET_BUFS; i++) {
+		if (vi->in[i]) {
+			vi->vdev->ops->detach_inbuf(vi->vdev, vi->in_ids[i]);
+			kfree_skb(vi->in[i]);
+			vi->in[i] = NULL;
+		}
+		if (vi->out[i]) {
+			vi->vdev->ops->detach_outbuf(vi->vdev, vi->out_ids[i]);
+			kfree_skb(vi->out[i]);
+			vi->out[i] = NULL;
+		}
+	}
+	return 0;
+}
+
+struct net_device *virtnet_probe(struct virtio_device *vdev,
+				 const u8 mac[ETH_ALEN])
+{
+	int err;
+	struct net_device *dev;
+	struct virtnet_info *vi;
+
+	dev = alloc_etherdev(sizeof(struct virtnet_info));
+	if (!dev)
+		return ERR_PTR(-ENOMEM);
+
+	SET_MODULE_OWNER(dev);
+
+	ether_setup(dev);
+	memcpy(dev->dev_addr, mac, ETH_ALEN);
+	dev->open = virtnet_open;
+	dev->stop = virtnet_close;
+	dev->hard_start_xmit = start_xmit;
+	SET_NETDEV_DEV(dev, vdev->dev);
+
+	vi = netdev_priv(dev);
+	vi->vdev = vdev;
+	vi->ndev = dev;
+
+	err = register_netdev(dev);
+	if (err) {
+		pr_debug("virtio_net: registering device failed\n");
+		goto free;
+	}
+	pr_debug("virtnet: registered device %s\n", dev->name);
+	return dev;
+
+free:
+	free_netdev(dev);
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(virtnet_probe);
+
+/* We get this when the other side sends buffers, or consumes them. */
+int virtnet_interrupt(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	unsigned int i;
+
+	for (i = 0; i < NET_BUFS; i++) {
+		if (vi->in[i] && vi->in_used[i]) {
+			vi->vdev->ops->detach_inbuf(vi->vdev, vi->in_ids[i]);
+			receive_skb(dev, vi->in[i], vi->in_used[i]);
+			try_fill_recv(vi, i);
+		}
+	}
+
+	netif_wake_queue(dev);
+	return IRQ_HANDLED;
+}
+EXPORT_SYMBOL_GPL(virtnet_interrupt);
+
+void virtnet_remove(struct net_device *dev)
+{
+	unregister_netdev(dev);
+	free_netdev(dev);
+}
+EXPORT_SYMBOL_GPL(virtnet_remove);
+
+MODULE_DESCRIPTION("Virtio network driver");
+MODULE_LICENSE("GPL");
diff -r 96df1769fce9 include/linux/virtio_net.h
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/include/linux/virtio_net.h	Thu May 31 17:52:40 2007 +1000
@@ -0,0 +1,13 @@
+#ifndef _LINUX_VIRTIO_NET_H
+#define _LINUX_VIRTIO_NET_H
+#include <linux/types.h>
+#include <linux/etherdevice.h>
+struct net_device;
+struct virtio_device;
+
+struct net_device *virtnet_probe(struct virtio_device *vdev,
+				 const u8 mac[ETH_ALEN]);
+int virtnet_interrupt(struct net_device *dev);
+void virtnet_remove(struct net_device *dev);
+
+#endif /* _LINUX_VIRTIO_NET_H */



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH RFC 3/3] virtio infrastructure: example block driver
       [not found]     ` <1180614044.11133.61.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-05-31 12:21       ` Rusty Russell
       [not found]         ` <1180614091.11133.63.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-05-31 12:21 UTC (permalink / raw)
  To: kvm-devel
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	Christian Borntraeger, virtualization, Suzanne McIntosh,
	Martin Schwidefsky

Example block driver using virtio.

The block driver uses outbufs with sg[0] being the request information
(struct virtio_blk_outhdr) with the type, sector and inbuf id.  For a
write, the rest of the sg will contain the data to be written.

The first segment of the inbuf is a result code (struct
virtio_blk_inhdr).  For a read, the rest of the sg points to the input
buffer.

TODO:
	1) Ordered tag support.

Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
---
 drivers/block/Makefile     |    1
 drivers/block/virtio_blk.c |  269 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/virtio_blk.h |   29 ++++
 3 files changed, 299 insertions(+)

diff -r 8f6c1b0efb6a drivers/block/Makefile
--- a/drivers/block/Makefile	Thu May 31 17:54:08 2007 +1000
+++ b/drivers/block/Makefile	Thu May 31 17:54:13 2007 +1000
@@ -20,6 +20,7 @@ obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.
 obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
 obj-$(CONFIG_BLK_DEV_DAC960)	+= DAC960.o
 obj-$(CONFIG_CDROM_PKTCDVD)	+= pktcdvd.o
+obj-y				+= virtio_blk.o
 
 obj-$(CONFIG_BLK_DEV_UMEM)	+= umem.o
 obj-$(CONFIG_BLK_DEV_NBD)	+= nbd.o
diff -r 8f6c1b0efb6a drivers/block/virtio_blk.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/drivers/block/virtio_blk.c	Thu May 31 17:55:12 2007 +1000
@@ -0,0 +1,270 @@
+#include <linux/spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/hdreg.h>
+#include <linux/virtio.h>
+#include <linux/virtio_blk.h>
+
+static unsigned char virtblk_index = 'a';
+struct virtio_blk
+{
+	struct virtio_device *vdev;
+
+	spinlock_t lock;
+
+	/* The disk structure for the kernel. */
+	struct gendisk *disk;
+
+	/* Request tracking (by inbuf id). */
+	struct list_head reqs;
+
+	mempool_t *pool;
+
+	/* Scatterlist: can be too big for stack. */
+	struct scatterlist sg[1+MAX_PHYS_SEGMENTS];
+};
+
+struct virtblk_req
+{
+	struct list_head list;
+	struct request *req;
+	unsigned long out_id;
+	struct virtio_blk_outhdr out_hdr;
+	struct virtio_blk_inhdr in_hdr;
+	unsigned long in_used;
+};
+
+/* Jens gave me this nice helper to end all chunks of a request. */
+static void end_entire_request(struct request *req, int uptodate)
+{
+	if (end_that_request_first(req, uptodate, req->hard_nr_sectors))
+		BUG();
+	add_disk_randomness(req->rq_disk);
+	end_that_request_last(req, uptodate);
+}
+
+static bool do_write(request_queue_t *q, struct virtio_blk *vblk,
+		     struct virtblk_req *vbr)
+{
+	unsigned long num;
+
+	/* Set up for reply. */
+	vblk->sg[0].page = virt_to_page(&vbr->in_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->in_hdr);
+	vblk->sg[0].length = sizeof(vbr->in_hdr);
+	vbr->out_hdr.id = vblk->vdev->ops->add_inbuf(vblk->vdev, vblk->sg, 1,
+						     &vbr->in_used);
+	if (IS_ERR_VALUE(vbr->out_hdr.id))
+		goto full;
+
+	/* First sg element points to output header. */
+	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
+	vblk->sg[0].length = sizeof(vbr->out_hdr);
+
+	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
+	vbr->out_id = vblk->vdev->ops->add_outbuf(vblk->vdev, vblk->sg, 1+num,
+						  NULL);
+	if (IS_ERR_VALUE(vbr->out_id))
+		goto detach_inbuf_full;
+
+	list_add_tail(&vbr->list, &vblk->reqs);
+	return true;
+
+detach_inbuf_full:
+	vblk->vdev->ops->detach_inbuf(vblk->vdev, vbr->out_hdr.id);
+full:
+	return false;
+}
+
+static bool do_read(request_queue_t *q, struct virtio_blk *vblk,
+		    struct virtblk_req *vbr)
+{
+	unsigned long num;
+
+	/* Set up for reply. */
+	vblk->sg[0].page = virt_to_page(&vbr->in_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->in_hdr);
+	vblk->sg[0].length = sizeof(vbr->in_hdr);
+	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
+	vbr->out_hdr.id = vblk->vdev->ops->add_inbuf(vblk->vdev, vblk->sg,
+						     1+num, &vbr->in_used);
+	if (IS_ERR_VALUE(vbr->out_hdr.id))
+		goto full;
+
+	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
+	vblk->sg[0].length = sizeof(vbr->out_hdr);
+
+	vbr->out_id = vblk->vdev->ops->add_outbuf(vblk->vdev, vblk->sg, 1,
+						  NULL);
+	if (IS_ERR_VALUE(vbr->out_id))
+		goto detach_inbuf_full;
+
+	pr_debug("%s: read at offset %lu %lu->%lu\n",
+		 vblk->disk->disk_name,
+		 vbr->out_hdr.sector*512, vbr->out_id, vbr->out_hdr.id);
+	list_add_tail(&vbr->list, &vblk->reqs);
+	return true;
+
+detach_inbuf_full:
+	vblk->vdev->ops->detach_inbuf(vblk->vdev, vbr->out_hdr.id);
+full:
+	return false;
+}
+
+static void do_virtblk_request(request_queue_t *q)
+{
+	struct virtio_blk *vblk = NULL;
+	struct request *req;
+	struct virtblk_req *vbr;
+
+	while ((req = elv_next_request(q)) != NULL) {
+		vblk = req->rq_disk->private_data;
+
+		/* FIXME: handle these iff capable. */
+		if (!blk_fs_request(req)) {
+			pr_debug("Got non-command 0x%08x\n", req->cmd_type);
+			req->errors++;
+			end_entire_request(req, 0);
+			continue;
+		}
+
+		vbr = mempool_alloc(vblk->pool, GFP_ATOMIC);
+		if (!vbr)
+			goto stop;
+
+		BUG_ON(req->nr_phys_segments > ARRAY_SIZE(vblk->sg));
+		vbr->req = req;
+		vbr->out_hdr.type = rq_data_dir(req);
+		vbr->out_hdr.sector = req->sector;
+		vbr->in_used = 0;
+
+		if (rq_data_dir(req) == WRITE) {
+			if (!do_write(q, vblk, vbr))
+				goto stop;
+		} else {
+			if (!do_read(q, vblk, vbr))
+				goto stop;
+		}
+		blkdev_dequeue_request(req);
+	}
+
+sync:
+	if (vblk)
+		virtio_sync(vblk->vdev);
+	return;
+
+stop:
+	/* Queue full?  Wait. */
+	blk_stop_queue(q);
+	mempool_free(vbr, vblk->pool);
+	goto sync;
+}
+
+int virtblk_interrupt(struct gendisk *disk)
+{
+	struct virtio_blk *vblk = disk->private_data;
+	struct virtblk_req *i, *next;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vblk->lock, flags);
+	list_for_each_entry_safe(i, next, &vblk->reqs, list) {
+		/* Reply hasn't come back? */
+		if (!i->in_used)
+			continue;
+		/* Make sure other side can no longer read/write */
+		vblk->vdev->ops->detach_outbuf(vblk->vdev, i->out_id);
+		vblk->vdev->ops->detach_inbuf(vblk->vdev, i->out_hdr.id);
+		pr_debug("%s: finished request %lu %s\n", disk->disk_name,
+			  i->out_id, i->in_hdr.status == 1 ? "OK" : "FAILED");
+		/* Finish and free request */
+		end_entire_request(i->req, i->in_hdr.status == 1);
+		list_del(&i->list);
+		mempool_free(i, vblk->pool);
+	}
+	blk_start_queue(disk->queue);
+	spin_unlock_irqrestore(&vblk->lock, flags);
+	return IRQ_HANDLED;
+}
+EXPORT_SYMBOL_GPL(virtblk_interrupt);
+
+static struct block_device_operations virtblk_fops = {
+	.owner = THIS_MODULE,
+};
+
+struct gendisk *virtblk_probe(struct virtio_device *vdev)
+{
+	struct virtio_blk *vblk;
+	int err, major;
+
+	vblk = kmalloc(sizeof(*vblk), GFP_KERNEL);
+	if (!vblk) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	spin_lock_init(&vblk->lock);
+	INIT_LIST_HEAD(&vblk->reqs);
+	vblk->vdev = vdev;
+
+	vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
+	if (!vblk->pool) {
+		err = -ENOMEM;
+		goto out_free_vblk;
+	}
+
+	major = register_blkdev(0, "virtblk");
+	if (major < 0) {
+		err = major;
+		goto out_mempool;
+	}
+
+	/* FIXME: How many partitions?  How long is a piece of string? */
+	vblk->disk = alloc_disk(1 << 3);
+	if (!vblk->disk) {
+		err = -ENOMEM;
+		goto out_unregister_blkdev;
+	}
+
+	vblk->disk->queue = blk_init_queue(do_virtblk_request, &vblk->lock);
+	if (!vblk->disk->queue) {
+		err = -ENOMEM;
+		goto out_put_disk;
+	}
+
+	sprintf(vblk->disk->disk_name, "vb%c", virtblk_index++);
+	vblk->disk->major = major;
+	vblk->disk->first_minor = 0;
+	vblk->disk->private_data = vblk;
+	vblk->disk->fops = &virtblk_fops;
+
+	/* Caller can do blk_queue_max_hw_segments(), set_capacity()
+	 * etc then add_disk(). */
+	return vblk->disk;
+
+out_put_disk:
+	put_disk(vblk->disk);
+out_unregister_blkdev:
+	unregister_blkdev(major, "virtblk");
+out_mempool:
+	mempool_destroy(vblk->pool);
+out_free_vblk:
+	kfree(vblk);
+out:
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(virtblk_probe);
+
+void virtblk_remove(struct gendisk *disk)
+{
+	struct virtio_blk *vblk = disk->private_data;
+	int major = vblk->disk->major;
+
+	BUG_ON(!list_empty(&vblk->reqs));
+	blk_cleanup_queue(vblk->disk->queue);
+	put_disk(vblk->disk);
+	unregister_blkdev(major, "virtblk");
+	mempool_destroy(vblk->pool);
+	kfree(vblk);
+}
+EXPORT_SYMBOL_GPL(virtblk_remove);
diff -r 8f6c1b0efb6a include/linux/virtio_blk.h
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/include/linux/virtio_blk.h	Thu May 31 17:54:13 2007 +1000
@@ -0,0 +1,29 @@
+#ifndef _LINUX_VIRTIO_BLK_H
+#define _LINUX_VIRTIO_BLK_H
+#include <linux/types.h>
+struct gendisk;
+struct virtio_device;
+struct hd_geometry;
+
+/* This is the first element of the scatter-gather list. */
+struct virtio_blk_outhdr
+{
+	/* 0 == read, 1 == write */
+	u32 type;
+	/* Sector (ie. 512 byte offset) */
+	unsigned long sector;
+	/* Where to put reply. */
+	unsigned long id;
+};
+
+struct virtio_blk_inhdr
+{
+	/* 1 = OK, 0 = not ok. */
+	unsigned long status;
+};
+
+struct gendisk *virtblk_probe(struct virtio_device *vdev);
+int virtblk_interrupt(struct gendisk *disk);
+void virtblk_remove(struct gendisk *disk);
+
+#endif /* _LINUX_VIRTIO_BLK_H */



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 1/3] virtio infrastructure
       [not found] ` <1180613947.11133.58.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2007-05-31 12:20   ` [PATCH RFC 2/3] virtio infrastructure: example net driver Rusty Russell
@ 2007-05-31 12:53   ` Carsten Otte
  2007-05-31 13:01   ` Dor Laor
  2007-06-02  6:30   ` Avi Kivity
  3 siblings, 0 replies; 41+ messages in thread
From: Carsten Otte @ 2007-05-31 12:53 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	virtualization, Christian Borntraeger, Suzanne McIntosh

Rusty Russell wrote:
> This attempts to implement a "virtual I/O" layer which should allow
> common drivers to be efficiently used across most virtual I/O
> mechanisms.  It will no-doubt need further enhancement.
> 
> The details of probing the device are left to hypervisor-specific
> code: it simple constructs the "struct virtio_device" and hands it to
> the probe function (eg. virtnet_probe() or virtblk_probe()).
> 
> The virtio drivers add and detach input and output buffers; as the
> buffers are used up their associated "used" pointers are filled in.
> 
> I have written two virtio device drivers (net and block) and two
> virtio implementations (for lguest): a read-write socket-style
> implementation, and a more efficient descriptor-based implementation).
These should work for s390 afaics. They seem to fit the requirements 
of network IO.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 3/3] virtio infrastructure: example block driver
       [not found]         ` <1180614091.11133.63.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-05-31 12:57           ` Carsten Otte
       [not found]             ` <465EC637.7020504-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Carsten Otte @ 2007-05-31 12:57 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	virtualization, Christian Borntraeger, Suzanne McIntosh

Rusty Russell wrote:
> Example block driver using virtio.
> 
> The block driver uses outbufs with sg[0] being the request information
> (struct virtio_blk_outhdr) with the type, sector and inbuf id.  For a
> write, the rest of the sg will contain the data to be written.
> 
> The first segment of the inbuf is a result code (struct
> virtio_blk_inhdr).  For a read, the rest of the sg points to the input
> buffer.
> 
> TODO:
> 	1) Ordered tag support.
Implementing a do_request function has quite a few disadvantages over 
hooking into q->make_request_fn. This way, we have the device plug 
(latency), request merging, and I/O scheduling inside the guest.
It seems preferable to do that in the host, especially when requests 
of multiple guests end up on the same physical media (shared access, 
or partitioned).

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 1/3] virtio infrastructure
       [not found] ` <1180613947.11133.58.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2007-05-31 12:20   ` [PATCH RFC 2/3] virtio infrastructure: example net driver Rusty Russell
  2007-05-31 12:53   ` [PATCH RFC 1/3] virtio infrastructure Carsten Otte
@ 2007-05-31 13:01   ` Dor Laor
  2007-06-02  6:30   ` Avi Kivity
  3 siblings, 0 replies; 41+ messages in thread
From: Dor Laor @ 2007-05-31 13:01 UTC (permalink / raw)
  To: Rusty Russell, kvm-devel, Xen Mailing List, virtualization
  Cc: Jimi Xenidis, Stephen Rothwell,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Herbert Xu,
	Christian Borntraeger, Suzanne McIntosh, Martin Schwidefsky

>This attempts to implement a "virtual I/O" layer which should allow
>common drivers to be efficiently used across most virtual I/O
>mechanisms.  It will no-doubt need further enhancement.
>
>The details of probing the device are left to hypervisor-specific
>code: it simple constructs the "struct virtio_device" and hands it to
>the probe function (eg. virtnet_probe() or virtblk_probe()).
>
>The virtio drivers add and detach input and output buffers; as the
>buffers are used up their associated "used" pointers are filled in.
>
>I have written two virtio device drivers (net and block) and two
>virtio implementations (for lguest): a read-write socket-style
>implementation, and a more efficient descriptor-based implementation).
>
>Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>

That's the exact things I was planning to add to KVM/Linux.
All virtual I/O devices should have common interface and share the core
functionality. Since Xen PV drivers are already performance optimized
and 
feature rich, we were planning to generalize the hypervisor-specific
backend in order to reuse them.
This is a good step toward such sharing.
Cheers, Dor.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 3/3] virtio infrastructure: example block driver
       [not found]             ` <465EC637.7020504-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-05-31 15:02               ` Troy Benjegerdes
       [not found]                 ` <20070531150206.GB6474-na1kE3HDu0idQnJuSAr7PQ@public.gmane.org>
  2007-05-31 23:39               ` Rusty Russell
  1 sibling, 1 reply; 41+ messages in thread
From: Troy Benjegerdes @ 2007-05-31 15:02 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	virtualization, Christian Borntraeger, Suzanne McIntosh

On Thu, May 31, 2007 at 02:57:27PM +0200, Carsten Otte wrote:
> Rusty Russell wrote:
> > Example block driver using virtio.
> > 
> > The block driver uses outbufs with sg[0] being the request information
> > (struct virtio_blk_outhdr) with the type, sector and inbuf id.  For a
> > write, the rest of the sg will contain the data to be written.
> > 
> > The first segment of the inbuf is a result code (struct
> > virtio_blk_inhdr).  For a read, the rest of the sg points to the input
> > buffer.
> > 
> > TODO:
> > 	1) Ordered tag support.
> Implementing a do_request function has quite a few disadvantages over 
> hooking into q->make_request_fn. This way, we have the device plug 
> (latency), request merging, and I/O scheduling inside the guest.
> It seems preferable to do that in the host, especially when requests 
> of multiple guests end up on the same physical media (shared access, 
> or partitioned).

This kind of a claim needs some benchmark data to document it.

I'll make the counterclaim that you *should* be doing I/O scheduling in
the guest, both to be able to test new I/O schedulers, and to provide
a set of pre-scheduled I/Os so the host has to do less work.

If someone really needs the host to be doing I/O scheduling, and not the
guest, then why are they using virtualization in the first place?

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 3/3] virtio infrastructure: example block driver
       [not found]                 ` <20070531150206.GB6474-na1kE3HDu0idQnJuSAr7PQ@public.gmane.org>
@ 2007-05-31 17:01                   ` Carsten Otte
       [not found]                     ` <465EFF72.5070100-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Carsten Otte @ 2007-05-31 17:01 UTC (permalink / raw)
  To: Troy Benjegerdes
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	carsteno-tA70FqPdS9bQT0dZR+AlfA,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, virtualization,
	kvm-devel, Suzanne McIntosh, Christian Borntraeger

Troy Benjegerdes wrote:
> This kind of a claim needs some benchmark data to document it.
We've implemented both for our vdisk driver on 390. At least on our 
platform, merging in the host is preferable because vmenter/vmexit is 
very fast and we would merge twice because we submit the result via 
io_submit() system call from host userland.
In the end you'd need to measure on all platforms this is gonna run 
on, if you want competitive benchmarks. That makes the benchmark 
argument somewhat fuzzy.

> I'll make the counterclaim that you *should* be doing I/O scheduling in
> the guest, both to be able to test new I/O schedulers, and to provide
> a set of pre-scheduled I/Os so the host has to do less work.
What makes the work to merge requests less, if done in the host? Is'nt 
it the same code really if you run Linux in both host and guest?

> If someone really needs the host to be doing I/O scheduling, and not the
> guest, then why are they using virtualization in the first place?
Just to run multiple operating systems on one box. Funny question...

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 3/3] virtio infrastructure: example block driver
       [not found]                     ` <465EFF72.5070100-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-05-31 17:36                       ` Avi Kivity
  0 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2007-05-31 17:36 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	virtualization, Suzanne McIntosh, Christian Borntraeger

Carsten Otte wrote:
> Troy Benjegerdes wrote:
>   
>> This kind of a claim needs some benchmark data to document it.
>>     
> We've implemented both for our vdisk driver on 390. At least on our 
> platform, merging in the host is preferable because vmenter/vmexit is 
> very fast and we would merge twice because we submit the result via 
> io_submit() system call from host userland.
>   


If a single hypercall can submit multiple requests, then this may work
well for the newcomer archs too.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 3/3] virtio infrastructure: example block driver
       [not found]             ` <465EC637.7020504-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  2007-05-31 15:02               ` Troy Benjegerdes
@ 2007-05-31 23:39               ` Rusty Russell
       [not found]                 ` <1180654765.10999.6.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  1 sibling, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-05-31 23:39 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	virtualization, Christian Borntraeger, Suzanne McIntosh,
	Jens Axboe

On Thu, 2007-05-31 at 14:57 +0200, Carsten Otte wrote:
> Rusty Russell wrote:
> > Example block driver using virtio.
> > 
> > The block driver uses outbufs with sg[0] being the request information
> > (struct virtio_blk_outhdr) with the type, sector and inbuf id.  For a
> > write, the rest of the sg will contain the data to be written.
> > 
> > The first segment of the inbuf is a result code (struct
> > virtio_blk_inhdr).  For a read, the rest of the sg points to the input
> > buffer.
> > 
> > TODO:
> > 	1) Ordered tag support.
> Implementing a do_request function has quite a few disadvantages over 
> hooking into q->make_request_fn. This way, we have the device plug 
> (latency), request merging, and I/O scheduling inside the guest.

Now my lack of block-layer knowledge is showing.  I would have thought
that if we want to do things like ionice(1) to work, we have to do some
guest scheduling or pass that information down to the host.

> It seems preferable to do that in the host, especially when requests 
> of multiple guests end up on the same physical media (shared access, 
> or partitioned).

What's the overhead in doing both?

Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 3/3] virtio infrastructure: example block driver
       [not found]                 ` <1180654765.10999.6.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-01  7:10                   ` Carsten Otte
       [not found]                     ` <465FC65C.6020905-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Carsten Otte @ 2007-06-01  7:10 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	carsteno-tA70FqPdS9bQT0dZR+AlfA,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, virtualization,
	kvm-devel, Suzanne McIntosh, Jens Axboe, Christian Borntraeger

Rusty Russell wrote:
> Now my lack of block-layer knowledge is showing.  I would have thought
> that if we want to do things like ionice(1) to work, we have to do some
> guest scheduling or pass that information down to the host.
Yea that would only work on the host: one can use ionice to set the io 
niceness of the entire guest. Individual processes inside the guest 
are opaque to the host, and thus are opaque to its io scheduler.

>> It seems preferable to do that in the host, especially when requests 
>> of multiple guests end up on the same physical media (shared access, 
>> or partitioned).
> 
> What's the overhead in doing both?
With regard to compute power needed, almost none. The penalty is 
latency, not overhead: A small request may sit on the request queue to 
wait for other work to arrive until the queue gets unplugged. This 
penality is compensated by the benefit of a good chance that more 
requests will be merged during this time period.
If we have this method both in host and guest, we have twice the 
penalty with no added benefit.

On the other hand, if we choose to hook into q->make_request_fn, we do 
end up doing far more hypercalls: bios do not get merged on the guest 
side.  We must do a hypercall per bio in this scenario, or we'll end 
up adding latency again. In contrast, we can submit the entire content 
of the queue with a single hypercall when calling at do_request().

A third way out of that situation is to do queueing between guest and 
host: on the first bio, guest does a hypercall. When the next bio 
arrives, guest sees that the host has not finished processing the 
queue yet and pushes another buffer without doing a notification. 
We've also implemented this, with the result that our host stack was 
quick enough to practically always process the bio before the guest 
had the chance to submit another one. Performance was a nightmare, so 
we discontinued pursuing that idea.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 3/3] virtio infrastructure: example block driver
       [not found]                     ` <465FC65C.6020905-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-06-01 13:13                       ` Jens Axboe
       [not found]                         ` <20070601131315.GW32105-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  2007-06-02  9:28                       ` Rusty Russell
  1 sibling, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2007-06-01 13:13 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	virtualization, Christian Borntraeger, Suzanne McIntosh

On Fri, Jun 01 2007, Carsten Otte wrote:
> Rusty Russell wrote:
> >Now my lack of block-layer knowledge is showing.  I would have thought
> >that if we want to do things like ionice(1) to work, we have to do some
> >guest scheduling or pass that information down to the host.
> Yea that would only work on the host: one can use ionice to set the io 
> niceness of the entire guest. Individual processes inside the guest 
> are opaque to the host, and thus are opaque to its io scheduler.
> 
> >>It seems preferable to do that in the host, especially when requests 
> >>of multiple guests end up on the same physical media (shared access, 
> >>or partitioned).
> >
> >What's the overhead in doing both?
> With regard to compute power needed, almost none. The penalty is 
> latency, not overhead: A small request may sit on the request queue to 
> wait for other work to arrive until the queue gets unplugged. This 
> penality is compensated by the benefit of a good chance that more 
> requests will be merged during this time period.
> If we have this method both in host and guest, we have twice the 
> penalty with no added benefit.

I don't buy that argument. We can easily expose the unplug delay, so you
can kill it at what ever level you want. Or you could just do it in the
driver right now, but that is a bit hackish.

> On the other hand, if we choose to hook into q->make_request_fn, we do 
> end up doing far more hypercalls: bios do not get merged on the guest 
> side.  We must do a hypercall per bio in this scenario, or we'll end 
> up adding latency again. In contrast, we can submit the entire content 
> of the queue with a single hypercall when calling at do_request().
> 
> A third way out of that situation is to do queueing between guest and 
> host: on the first bio, guest does a hypercall. When the next bio 
> arrives, guest sees that the host has not finished processing the 
> queue yet and pushes another buffer without doing a notification. 
> We've also implemented this, with the result that our host stack was 
> quick enough to practically always process the bio before the guest 
> had the chance to submit another one. Performance was a nightmare, so 
> we discontinued pursuing that idea.

I'd greatly prefer maintaing a request_fn interface for this. The
make_request_fn/request_fn call ratio is at least 1, and typically a lot
larger (4kb blocks, 128kb request not uncommon -> 32).

-- 
Jens Axboe


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC 1/3] virtio infrastructure
  2007-05-31 12:19 [PATCH RFC 1/3] virtio infrastructure Rusty Russell
@ 2007-06-01 23:35 ` Santos, Jose Renato G
  2007-06-02 10:12   ` Rusty Russell
       [not found] ` <1180613947.11133.58.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  1 sibling, 1 reply; 41+ messages in thread
From: Santos, Jose Renato G @ 2007-06-01 23:35 UTC (permalink / raw)
  To: Rusty Russell, kvm-devel, Xen Mailing List, virtualization
  Cc: Jimi Xenidis, Stephen Rothwell, jmk, Herbert Xu,
	Christian Borntraeger, Suzanne McIntosh, Anthony Liguori,
	Martin Schwidefsky

 

> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com 
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
> Rusty Russell
> Sent: Thursday, May 31, 2007 5:19 AM
> To: kvm-devel; Xen Mailing List; virtualization
> Cc: Jimi Xenidis; Stephen Rothwell; jmk@plan9.bell-labs.com; 
> Herbert Xu; Christian Borntraeger; Suzanne McIntosh; Anthony 
> Liguori; Martin Schwidefsky
> Subject: [Xen-devel] [PATCH RFC 1/3] virtio infrastructure
> 
> This attempts to implement a "virtual I/O" layer which should 
> allow common drivers to be efficiently used across most 
> virtual I/O mechanisms.  It will no-doubt need further enhancement.
> 
>

Rusty

Could you please clarify what is the purpose of this "virtual I/O"
layer?
At least for networking, why isn't the current linux net device
abstraction sufficient for hiding the details of different virtual
devices implementations? What am I missing?

Thanks

Renato

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 1/3] virtio infrastructure
       [not found] ` <1180613947.11133.58.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
                     ` (2 preceding siblings ...)
  2007-05-31 13:01   ` Dor Laor
@ 2007-06-02  6:30   ` Avi Kivity
       [not found]     ` <46610E8D.10706-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  3 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2007-06-02  6:30 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Suzanne McIntosh, Martin Schwidefsky

Rusty Russell wrote:
> This attempts to implement a "virtual I/O" layer which should allow
> common drivers to be efficiently used across most virtual I/O
> mechanisms.  It will no-doubt need further enhancement.
>
> The details of probing the device are left to hypervisor-specific
> code: it simple constructs the "struct virtio_device" and hands it to
> the probe function (eg. virtnet_probe() or virtblk_probe()).
>
> The virtio drivers add and detach input and output buffers; as the
> buffers are used up their associated "used" pointers are filled in.
>
>   

Good stuff.

> +/**
> + * virtio_ops - virtio abstraction layer
> + * @add_outbuf: prepare to send data to the other end:
> + *	vdev: the virtio_device
> + *	sg: the description of the buffer(s).
> + *	num: the size of the sg array.
> + *	used: the length sent (set once sending is done).
> + *      Returns an identifier or an error.
> + * @add_inbuf: prepare to receive data from the other end:
> + *	vdev: the virtio_device
> + *	sg: the description of the buffer(s).
> + *	num: the size of the sg array.
> + *	used: the length sent (set once data received).
> + *      Returns an identifier or an error (eg. -ENOSPC).
>   

Instead of 'used', how about a completion callback (with associated data
pointer)?  A new helper, virtio_complete(), would call the callback for
all completed requests.  It would eliminate all the tedious scanning
used to match the identifier.

It would also be nice to support a bit of non-buffer data, like a set of
bitflags.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 3/3] virtio infrastructure: example block driver
       [not found]                     ` <465FC65C.6020905-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  2007-06-01 13:13                       ` Jens Axboe
@ 2007-06-02  9:28                       ` Rusty Russell
       [not found]                         ` <1180776482.9228.22.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  1 sibling, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-06-02  9:28 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	virtualization, Christian Borntraeger, Suzanne McIntosh,
	Jens Axboe

On Fri, 2007-06-01 at 09:10 +0200, Carsten Otte wrote:
> Rusty Russell wrote:
> > What's the overhead in doing both?
> With regard to compute power needed, almost none. The penalty is 
> latency, not overhead: A small request may sit on the request queue to 
> wait for other work to arrive until the queue gets unplugged. This 
> penality is compensated by the benefit of a good chance that more 
> requests will be merged during this time period.
> If we have this method both in host and guest, we have twice the 
> penalty with no added benefit.

Indeed, but as it turns out that the draft block driver is appealingly
naive in this respect: the caller can invoke "elevator_init(disk->queue,
"noop")".  See the extract from the lguest implementation below (which
doesn't do this, but could).

Is the noop scheduler significantly worse than hooking directly into
q->make_request_fn?

> A third way out of that situation is to do queueing between guest and 
> host: on the first bio, guest does a hypercall. When the next bio 
> arrives, guest sees that the host has not finished processing the 
> queue yet and pushes another buffer without doing a notification. 
> We've also implemented this, with the result that our host stack was 
> quick enough to practically always process the bio before the guest 
> had the chance to submit another one. Performance was a nightmare, so 
> we discontinued pursuing that idea.

Interesting!  This kind of implementation becomes quite natural with
shared memory so the guest can see an "ack" from the host: if the
previous notification hasn't been acked, it doesn't send another one.

Such a scheme has application beyond block devices and (this is what I'm
really interested in): should be easy to implement under virtio_ops.

Thanks!
Rusty.

+/* Example block driver code. */
+#include <linux/virtio_blk.h>
+#include <linux/genhd.h>
+#include <linux/blkdev.h>
+static irqreturn_t lguest_virtblk_interrupt(int irq, void *_lgv)
+{
+	struct lguest_virtio_device *lgv = _lgv;
+
+	return virtblk_interrupt(lgv->priv);
+}
+
+static int lguest_virtblk_probe(struct lguest_device *lgdev)
+{
+	struct lguest_virtio_device *lgv;
+	struct gendisk *disk;
+	unsigned long sectors;
+	int err, irqf, i;
+
+	lgv = kmalloc(sizeof(*lgv), GFP_KERNEL);
+	if (!lgv)
+		return -ENOMEM;
+
+	memset(lgv, 0, sizeof(*lgv));
+
+	lgdev->private = lgv;
+	lgv->lg = lgdev;
+
+	/* Map is input page followed by output page */
+	lgv->in.p = lguest_map(lguest_devices[lgdev->index].pfn<<PAGE_SHIFT,2);
+	if (!lgv->in.p) {
+		err = -ENOMEM;
+		goto free_lgv;
+	}
+	lgv->out.p = lgv->in.p + 1;
+	/* Page is initially used to pass capacity. */
+	sectors = *(unsigned long *)lgv->in.p;
+	*(unsigned long *)lgv->in.p = 0;
+
+	/* Put everything in free lists. */
+	lgv->in.avail = lgv->out.avail = NUM_DESCS;
+	for (i = 0; i < NUM_DESCS-1; i++) {
+		lgv->in.p->desc[i].next = i+1;
+		lgv->out.p->desc[i].next = i+1;
+	}
+
+	lgv->vdev.ops = &lguest_virtio_ops;
+	lgv->vdev.dev = &lgdev->dev;
+
+	lgv->priv = disk = virtblk_probe(&lgv->vdev);
+	if (IS_ERR(lgv->priv)) {
+		err = PTR_ERR(lgv->priv);
+		goto unmap;
+	}
+	set_capacity(disk, sectors);
+	blk_queue_max_hw_segments(disk->queue, NUM_DESCS-1);
+
+	if (lguest_devices[lgv->lg->index].features&LGUEST_DEVICE_F_RANDOMNESS)
+		irqf = IRQF_SAMPLE_RANDOM;
+	else
+		irqf = 0;
+
+	err = request_irq(lgdev_irq(lgv->lg), lguest_virtblk_interrupt, irqf,
+			  disk->disk_name, lgv);
+	if (err)
+		goto remove;
+
+	add_disk(disk);
+	printk("Virtblk device %s registered\n", disk->disk_name);
+	return 0;
+
+remove:
+	virtblk_remove(disk);
+unmap:
+	lguest_unmap(lgv->in.p);
+free_lgv:
+	kfree(lgv);
+	return err;
+}
+
+static struct lguest_driver lguest_virtblk_drv = {
+	.name = "lguestvirtblk",
+	.owner = THIS_MODULE,
+	.device_type = LGUEST_DEVICE_T_VIRTBLK,
+	.probe = lguest_virtblk_probe,
+};
+
+static __init int lguest_virtblk_init(void)
+{
+	return register_lguest_driver(&lguest_virtblk_drv);
+}
+device_initcall(lguest_virtblk_init);
+
+MODULE_LICENSE("GPL");


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 1/3] virtio infrastructure
       [not found]     ` <46610E8D.10706-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-06-02  9:50       ` Rusty Russell
       [not found]         ` <1180777836.9228.44.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-06-02  9:50 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Suzanne McIntosh, Martin Schwidefsky

On Sat, 2007-06-02 at 09:30 +0300, Avi Kivity wrote:
> Rusty Russell wrote:
> > + * virtio_ops - virtio abstraction layer
> > + * @add_outbuf: prepare to send data to the other end:
> > + *	vdev: the virtio_device
> > + *	sg: the description of the buffer(s).
> > + *	num: the size of the sg array.
> > + *	used: the length sent (set once sending is done).
> > + *      Returns an identifier or an error.
> > + * @add_inbuf: prepare to receive data from the other end:
> > + *	vdev: the virtio_device
> > + *	sg: the description of the buffer(s).
> > + *	num: the size of the sg array.
> > + *	used: the length sent (set once data received).
> > + *      Returns an identifier or an error (eg. -ENOSPC).
> >   
> 
> Instead of 'used', how about a completion callback (with associated data
> pointer)?  A new helper, virtio_complete(), would call the callback for
> all completed requests.  It would eliminate all the tedious scanning
> used to match the identifier.

Hi Avi,

	There were several considerations here.  My first was that the drivers
look much more like normal devices than getting a callback for every
buffer.   Secondly, "used" batches much more nicely than a completion.
Finally, it's also something you really want to know, so the driver
doesn't have to zero its inbufs (an untrusted other side says it sends
you 1500 bytes but actually sent nothing, and now you spray kernel
memory out the NIC).

	I also considered some scheme like:

	struct virtio_used_info
	{
		unsigned long len;
		void *next_token;
	};
	...
	unsigned long (*add_outbuf)(struct virtio_device *vdev,
				    const struct scatterlist sg[],
				    unsigned int num,
				    void *token,
				    struct virtio_used_info *used_info);

So the "used" becomes a "used/next" pair and you can just walk the
linked list.  But I wasn't convinced that walking the buffers is going
to be a performance issue (tho the net driver puts them in a continuous
array for cache friendliness as a nod to this concern).

> It would also be nice to support a bit of non-buffer data, like a set of
> bitflags.

	I expect this might be necessary, but it wasn't so far.  The non-buffer
data tends to go in sg[0]: the block driver works this way, and the
network driver will for GSO.  Of course, a specialized virtio_ops
backend might well take this and put the info somewhere else.

	I also considered a separate "publish"/"examine" interface for things
which aren't really messages, but again, haven't needed it yet.

Thanks!
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC 1/3] virtio infrastructure
  2007-06-01 23:35 ` Santos, Jose Renato G
@ 2007-06-02 10:12   ` Rusty Russell
       [not found]     ` <1180779167.9228.66.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-06-02 10:12 UTC (permalink / raw)
  To: Santos, Jose Renato G
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List, jmk, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Suzanne McIntosh, Anthony Liguori, Martin Schwidefsky

On Fri, 2007-06-01 at 23:35 +0000, Santos, Jose Renato G wrote:
> 
> > -----Original Message-----
> > From: xen-devel-bounces@lists.xensource.com 
> > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
> > Rusty Russell
> > Sent: Thursday, May 31, 2007 5:19 AM
> > To: kvm-devel; Xen Mailing List; virtualization
> > Cc: Jimi Xenidis; Stephen Rothwell; jmk@plan9.bell-labs.com; 
> > Herbert Xu; Christian Borntraeger; Suzanne McIntosh; Anthony 
> > Liguori; Martin Schwidefsky
> > Subject: [Xen-devel] [PATCH RFC 1/3] virtio infrastructure
> > 
> > This attempts to implement a "virtual I/O" layer which should 
> > allow common drivers to be efficiently used across most 
> > virtual I/O mechanisms.  It will no-doubt need further enhancement.
> 
> Rusty
> 
> Could you please clarify what is the purpose of this "virtual I/O"
> layer?
> At least for networking, why isn't the current linux net device
> abstraction sufficient for hiding the details of different virtual
> devices implementations? What am I missing?

Hi Renato!

	This is a very good question.  For currently-existing and working
devices it just seems like another layer of indirection, and not much
code saving.

	There are several reasons for having a common layer.  The obvious is to
reduce the amount of code, but that's the least important.  Slightly
more important is that noone wants to maintain and tune 5 different
virtual net drivers, 5 different virtual block drivers, etc.  In fact,
the kernel community will start to rebel at some point, especially if
they want different optimization hooks deeper into the kernel.

	We expect to see new device types (entropy device, anyone?), and so the
5 different drivers becomes the 5 x N different drivers.  Eventually we
may come up with common bus and probing which everyone loves, but until
then we can at least make it simple for each virtualization technology
to implement the new XYZZY device.

	Finally, virtual I/O techniques are *moving*.  KVM doesn't really have
one, XenSource are discussing more changes to theirs (at least for
networking) and lguest is still looking for an efficient one.  No doubt
the others would love to use clean and optimal Linux drivers, too (UML,
S/390, Power, VMWare?).

	When Linux doesn't provide a model of what it expects from virtual
devices, one end of the design process is untethered: we've already seen
some fairly random and gratuitously different results.  If we can
provide a simple interface which we like it encourages new
implementations to produce Linux-friendly interfaces.

	Of course, the new interface must not suck.

I hope that clarifies my thinking?
Rusty.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 1/3] virtio infrastructure
       [not found]         ` <1180777836.9228.44.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-03  5:28           ` Avi Kivity
       [not found]             ` <46625192.5020108-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2007-06-03  5:28 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Suzanne McIntosh, Martin Schwidefsky

Rusty Russell wrote:
> On Sat, 2007-06-02 at 09:30 +0300, Avi Kivity wrote:
>   
>> Rusty Russell wrote:
>>     
>>> + * virtio_ops - virtio abstraction layer
>>> + * @add_outbuf: prepare to send data to the other end:
>>> + *	vdev: the virtio_device
>>> + *	sg: the description of the buffer(s).
>>> + *	num: the size of the sg array.
>>> + *	used: the length sent (set once sending is done).
>>> + *      Returns an identifier or an error.
>>> + * @add_inbuf: prepare to receive data from the other end:
>>> + *	vdev: the virtio_device
>>> + *	sg: the description of the buffer(s).
>>> + *	num: the size of the sg array.
>>> + *	used: the length sent (set once data received).
>>> + *      Returns an identifier or an error (eg. -ENOSPC).
>>>   
>>>       
>> Instead of 'used', how about a completion callback (with associated data
>> pointer)?  A new helper, virtio_complete(), would call the callback for
>> all completed requests.  It would eliminate all the tedious scanning
>> used to match the identifier.
>>     
>
> Hi Avi,
>
> 	There were several considerations here.  My first was that the drivers
> look much more like normal devices than getting a callback for every
> buffer.   Secondly, "used" batches much more nicely than a completion.
> Finally, it's also something you really want to know, so the driver
> doesn't have to zero its inbufs (an untrusted other side says it sends
> you 1500 bytes but actually sent nothing, and now you spray kernel
> memory out the NIC).
>   

Sure, 'used' is important (how else do you get the packet size on
receive?),  I'm just bitching about the linear scan.

> 	I also considered some scheme like:
>
> 	struct virtio_used_info
> 	{
> 		unsigned long len;
> 		void *next_token;
> 	};
> 	...
> 	unsigned long (*add_outbuf)(struct virtio_device *vdev,
> 				    const struct scatterlist sg[],
> 				    unsigned int num,
> 				    void *token,
> 				    struct virtio_used_info *used_info);
>
> So the "used" becomes a "used/next" pair and you can just walk the
> linked list.  But I wasn't convinced that walking the buffers is going
> to be a performance issue (tho the net driver puts them in a continuous
> array for cache friendliness as a nod to this concern).
>   

Well, if you have 256 slots per direction, you will scan 4KB of memory
per interrupt (256 slots x 2 directions x 8 bytes).  You may need a
queue length greater than 256 for a high bandwidth interface, if you
want to reduce your interrupt rate, or if your guest is scheduled out
for lengthy periods (say, a millisecond).


>   
>> It would also be nice to support a bit of non-buffer data, like a set of
>> bitflags.
>>     
>
> 	I expect this might be necessary, but it wasn't so far.  The non-buffer
> data tends to go in sg[0]: the block driver works this way, and the
> network driver will for GSO.  Of course, a specialized virtio_ops
> backend might well take this and put the info somewhere else.
>   

Yeah.  Well, it's a somewhat roundabout way of doing things.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 1/3] virtio infrastructure
       [not found]             ` <46625192.5020108-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-06-03 10:29               ` Rusty Russell
       [not found]                 ` <1180866540.17442.74.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-06-03 10:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Suzanne McIntosh, Martin Schwidefsky

On Sun, 2007-06-03 at 08:28 +0300, Avi Kivity wrote:
> Sure, 'used' is important (how else do you get the packet size on
> receive?)

Well, the sender can prepend the length, or for networking just leave
Linux to trim packets.  It's the trust issue which bugs me.

> ,  I'm just bitching about the linear scan.

If you're doing interrupt mitigation on a high-bandwidth device, I think
it's optimal.  And if you're not, you obviously don't care 8)

That said, thousands of descriptors may not be insane...

> Well, if you have 256 slots per direction, you will scan 4KB of memory
> per interrupt (256 slots x 2 directions x 8 bytes).  You may need a
> queue length greater than 256 for a high bandwidth interface, if you
> want to reduce your interrupt rate, or if your guest is scheduled out
> for lengthy periods (say, a millisecond).

Hmm... Perhaps I should move the used arrays into the "struct
virtio_device" and guarantee that the id (returned by add_*buf) is an
index into that.  Then we can trivially add a corresponding bit array.

This also makes the "id" easier to use from the driver side: if we add a
"max_id" field it can size its own arrays if it wants to.  Gets rid of
the linked-list in the block driver...

Thoughts?
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Xen-devel] Re: [PATCH RFC 1/3] virtio infrastructure
       [not found]                 ` <1180866540.17442.74.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-03 11:39                   ` Avi Kivity
       [not found]                     ` <4662A86A.7010706-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2007-06-03 11:39 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Suzanne McIntosh, Martin Schwidefsky

Rusty Russell wrote:
> Hmm... Perhaps I should move the used arrays into the "struct
> virtio_device" and guarantee that the id (returned by add_*buf) is an
> index into that.  Then we can trivially add a corresponding bit array.
>
>   

That may force the virtio backend to do things it doesn't want to.

> This also makes the "id" easier to use from the driver side: if we add a
> "max_id" field it can size its own arrays if it wants to.  Gets rid of
> the linked-list in the block driver...
>   

How about

    struct virtio_completion {
        unsigned long length;
        void *data;
    };

    int virtio_complete(struct virtio_completion *completions, int nr);


where 'data' is an opaque pointer passed by the device during add_*buf()?

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Xen-devel] Re: [PATCH RFC 1/3] virtio infrastructure
       [not found]                     ` <4662A86A.7010706-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-06-03 23:53                       ` Rusty Russell
       [not found]                         ` <1180914836.17442.104.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-06-03 23:53 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Suzanne McIntosh, Martin Schwidefsky

On Sun, 2007-06-03 at 14:39 +0300, Avi Kivity wrote:
> Rusty Russell wrote:
> > Hmm... Perhaps I should move the used arrays into the "struct
> > virtio_device" and guarantee that the id (returned by add_*buf) is an
> > index into that.  Then we can trivially add a corresponding bit array. 
> 
> That may force the virtio backend to do things it doesn't want to.

Well, I have two implementations, and both ids already fit this model so
I have some confidence.  I just need to add the set_bit/clear_bit to the
read/write one, and use sync_clear_bit to the descriptor-based one
(which I suspect will actually get a little cleaner).

So I think this constraint is a reasonable thing to add anyway.

> > This also makes the "id" easier to use from the driver side: if we add a
> > "max_id" field it can size its own arrays if it wants to.  Gets rid of
> > the linked-list in the block driver...
> >   
> 
> How about
> 
>     struct virtio_completion {
>         unsigned long length;
>         void *data;
>     };
> 
>     int virtio_complete(struct virtio_completion *completions, int nr);
> 
> where 'data' is an opaque pointer passed by the device during add_*buf()?

Other than the fact that the driver will look less like a normal driver,
it's actually harder to write the net driver if that complete call can
happen off an interrupt: we can't free skbs in interrupt context, which
is why the used-outbuf cleanup is currently done (inefficiently, but
that's orthogonal) in the xmit routine.

I'll code up the modifications now and check if I'm right about the
impact on the code.  I might have to come up with something else...

Cheers,
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Xen-devel] Re: [PATCH RFC 1/3] virtio infrastructure
       [not found]                         ` <1180914836.17442.104.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-04  9:55                           ` Avi Kivity
       [not found]                             ` <4663E18D.4030007-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2007-06-04  9:55 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Suzanne McIntosh, Martin Schwidefsky

Rusty Russell wrote:
> On Sun, 2007-06-03 at 14:39 +0300, Avi Kivity wrote:
>   
>> Rusty Russell wrote:
>>     
>>> Hmm... Perhaps I should move the used arrays into the "struct
>>> virtio_device" and guarantee that the id (returned by add_*buf) is an
>>> index into that.  Then we can trivially add a corresponding bit array. 
>>>       
>> That may force the virtio backend to do things it doesn't want to.
>>     
>
> Well, I have two implementations, and both ids already fit this model so
> I have some confidence.  I just need to add the set_bit/clear_bit to the
> read/write one, and use sync_clear_bit to the descriptor-based one
> (which I suspect will actually get a little cleaner).
>
> So I think this constraint is a reasonable thing to add anyway.
>
>   


Some hardware (tulip IIRC, but long time ago) allows linked list based 
descriptors.  If a virtio implementation takes a similar approach, it 
may not have an array of descriptors.

Networking hardware generally services descriptors in a FIFO manner.  
virtio may not (for example, it may offload copies of larger packets to 
a dma engine such as I/OAT, resulting in a delay, but copy smaller 
packets immediately).  that means that there will be some mismatch 
between virtio drivers and real hardware drivers.

For block devices, reordering is a matter of course, and virtio needs to 
efficiently support that.  Queues are generally shorter for block, though.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Xen-devel] Re: [PATCH RFC 1/3] virtio infrastructure
       [not found]                             ` <4663E18D.4030007-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-06-04 10:32                               ` Herbert Xu
  2007-06-04 11:19                               ` Rusty Russell
  1 sibling, 0 replies; 41+ messages in thread
From: Herbert Xu @ 2007-06-04 10:32 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel,
	virtualization, Christian Borntraeger, Suzanne McIntosh,
	Martin Schwidefsky

On Mon, Jun 04, 2007 at 12:55:25PM +0300, Avi Kivity wrote:
> 
> Networking hardware generally services descriptors in a FIFO manner.  
> virtio may not (for example, it may offload copies of larger packets to 
> a dma engine such as I/OAT, resulting in a delay, but copy smaller 
> packets immediately).  that means that there will be some mismatch 
> between virtio drivers and real hardware drivers.

You're free to do that in the process but before your packets leave
the backend you've got to make sure that they haven't been reordered.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 1/3] virtio infrastructure
       [not found]                             ` <4663E18D.4030007-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-06-04 10:32                               ` Herbert Xu
@ 2007-06-04 11:19                               ` Rusty Russell
       [not found]                                 ` <1180955964.25878.27.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  1 sibling, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-06-04 11:19 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Suzanne McIntosh, Martin Schwidefsky

On Mon, 2007-06-04 at 12:55 +0300, Avi Kivity wrote:
> Rusty Russell wrote:
> > On Sun, 2007-06-03 at 14:39 +0300, Avi Kivity wrote:
> >   
> >> Rusty Russell wrote:
> >>     
> >>> Hmm... Perhaps I should move the used arrays into the "struct
> >>> virtio_device" and guarantee that the id (returned by add_*buf) is an
> >>> index into that.  Then we can trivially add a corresponding bit array. 
> >>>       
> >> That may force the virtio backend to do things it doesn't want to.
> >>     
> >
> > Well, I have two implementations, and both ids already fit this model so
> > I have some confidence.  I just need to add the set_bit/clear_bit to the
> > read/write one, and use sync_clear_bit to the descriptor-based one
> > (which I suspect will actually get a little cleaner).
> >
> > So I think this constraint is a reasonable thing to add anyway.
> 
> Some hardware (tulip IIRC, but long time ago) allows linked list based 
> descriptors.  If a virtio implementation takes a similar approach, it 
> may not have an array of descriptors.

Sure, but the mapping becomes non-trivial only if you have no ceiling on
the number in flight at any time.  OTOH, it's a very nice property for
driver authors.  I'll send the diffs once debugging is done...

> Networking hardware generally services descriptors in a FIFO manner.

Well, ethernet guarantees order.  Not sure about others tho...

> virtio may not (for example, it may offload copies of larger packets to 
> a dma engine such as I/OAT, resulting in a delay, but copy smaller 
> packets immediately).  that means that there will be some mismatch 
> between virtio drivers and real hardware drivers.

I think your point is that the completion bitmap (or indeed, the current
approach) does not maintain order?  Hmm, this is more convincing to me
than cache arguments, since some devices might want ordering and want
more than a single io in flight.

I shall ponder this more deeply...

Thanks!
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 1/3] virtio infrastructure
       [not found]                                 ` <1180955964.25878.27.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-04 11:25                                   ` Avi Kivity
       [not found]                                     ` <4663F6AC.7070100-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2007-06-04 11:25 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Suzanne McIntosh, Martin Schwidefsky

Rusty Russell wrote:
>   
>> Networking hardware generally services descriptors in a FIFO manner.
>>     
>
> Well, ethernet guarantees order.  Not sure about others tho...
>
>   

OT: Does that hold for bonded interfaces too?

>> virtio may not (for example, it may offload copies of larger packets to 
>> a dma engine such as I/OAT, resulting in a delay, but copy smaller 
>> packets immediately).  that means that there will be some mismatch 
>> between virtio drivers and real hardware drivers.
>>     
>
> I think your point is that the completion bitmap (or indeed, the current
> approach) does not maintain order?  Hmm, this is more convincing to me
> than cache arguments, since some devices might want ordering and want
> more than a single io in flight.
>   

Well, it wasn't really; sorry for being unclear.  My point was that 
virtio interfaces will not match hardware exactly.

My objection is to "scan all slots, occupied or not, for completion".  I 
think virtio should present completed descriptors without the need for 
scanning, even if it means looking a bit different from a typical 
ethernet driver.


-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 1/3] virtio infrastructure
       [not found]                                     ` <4663F6AC.7070100-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-06-04 11:40                                       ` Rusty Russell
  2007-06-04 12:17                                       ` Herbert Xu
  1 sibling, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-06-04 11:40 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Suzanne McIntosh, Martin Schwidefsky

On Mon, 2007-06-04 at 14:25 +0300, Avi Kivity wrote:
> Rusty Russell wrote:
> >   
> >> Networking hardware generally services descriptors in a FIFO manner.
> >>     
> >
> > Well, ethernet guarantees order.  Not sure about others tho...
> 
> OT: Does that hold for bonded interfaces too?

Sorry, I don't know.  The ethernet standard promises in-order, but I'd
imagine you'd need to prepend a header to get this to work with bonding
in general...

> >> virtio may not (for example, it may offload copies of larger packets to 
> >> a dma engine such as I/OAT, resulting in a delay, but copy smaller 
> >> packets immediately).  that means that there will be some mismatch 
> >> between virtio drivers and real hardware drivers.
> >>     
> >
> > I think your point is that the completion bitmap (or indeed, the current
> > approach) does not maintain order?  Hmm, this is more convincing to me
> > than cache arguments, since some devices might want ordering and want
> > more than a single io in flight.
> 
> Well, it wasn't really; sorry for being unclear.  My point was that 
> virtio interfaces will not match hardware exactly.
> 
> My objection is to "scan all slots, occupied or not, for completion".  I 
> think virtio should present completed descriptors without the need for 
> scanning, even if it means looking a bit different from a typical 
> ethernet driver.

It's not just the ethernet driver, it's virtio drivers in general.  One
reason the Xen drivers are viewed with such horror is that they look
nothing like normal Linux drivers.

But that just means that the linked list(s) should be in the "struct
virtio_device" rather than an arg to the "interrupt" handler.  I think,
given that the network code doesn't want to process used outbufs in the
interrupt handler, this is the Right Thing anyway.

I'll send here once it's done...

Thanks,
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 1/3] virtio infrastructure
       [not found]                                     ` <4663F6AC.7070100-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-06-04 11:40                                       ` Rusty Russell
@ 2007-06-04 12:17                                       ` Herbert Xu
  1 sibling, 0 replies; 41+ messages in thread
From: Herbert Xu @ 2007-06-04 12:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel,
	virtualization, Christian Borntraeger, Suzanne McIntosh,
	Martin Schwidefsky

On Mon, Jun 04, 2007 at 02:25:32PM +0300, Avi Kivity wrote:
> 
> OT: Does that hold for bonded interfaces too?

Yes.  By default traffic to the same destination MAC always stick to
one interface.  You could select a layer3+4 hashing policy but even
that guarantees a single flow will stick to one physical interface
unless it contains IP fragments which should never happen for TCP.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 3/3] virtio infrastructure: example block driver
       [not found]                         ` <1180776482.9228.22.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-04 12:53                           ` Carsten Otte
  0 siblings, 0 replies; 41+ messages in thread
From: Carsten Otte @ 2007-06-04 12:53 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	carsteno-tA70FqPdS9bQT0dZR+AlfA,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, virtualization,
	kvm-devel, Suzanne McIntosh, Jens Axboe, Christian Borntraeger

Rusty Russell wrote:
> Is the noop scheduler significantly worse than hooking directly into
> q->make_request_fn?
The noop scheduler does do request merging, and has the same device 
plug latency as other schedulers.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 3/3] virtio infrastructure: example block driver
       [not found]                         ` <20070601131315.GW32105-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2007-06-04 13:40                           ` Carsten Otte
       [not found]                             ` <4664164A.40604-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Carsten Otte @ 2007-06-04 13:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	carsteno-tA70FqPdS9bQT0dZR+AlfA,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, virtualization,
	kvm-devel, Suzanne McIntosh, Christian Borntraeger

Jens Axboe wrote:
> On Fri, Jun 01 2007, Carsten Otte wrote:
>> With regard to compute power needed, almost none. The penalty is 
>> latency, not overhead: A small request may sit on the request queue to 
>> wait for other work to arrive until the queue gets unplugged. This 
>> penality is compensated by the benefit of a good chance that more 
>> requests will be merged during this time period.
>> If we have this method both in host and guest, we have twice the 
>> penalty with no added benefit.
> 
> I don't buy that argument. We can easily expose the unplug delay, so you
> can kill it at what ever level you want. Or you could just do it in the
> driver right now, but that is a bit hackish.
That would be preferable if the device driver can chose the unplug 
delay, or even better it could be (guest)sysfs tuneable.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 3/3] virtio infrastructure: example block driver
       [not found]                             ` <4664164A.40604-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-06-04 13:43                               ` Jens Axboe
       [not found]                                 ` <20070604134322.GC32105-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2007-06-04 13:43 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	virtualization, Christian Borntraeger, Suzanne McIntosh

On Mon, Jun 04 2007, Carsten Otte wrote:
> Jens Axboe wrote:
> >On Fri, Jun 01 2007, Carsten Otte wrote:
> >>With regard to compute power needed, almost none. The penalty is 
> >>latency, not overhead: A small request may sit on the request queue to 
> >>wait for other work to arrive until the queue gets unplugged. This 
> >>penality is compensated by the benefit of a good chance that more 
> >>requests will be merged during this time period.
> >>If we have this method both in host and guest, we have twice the 
> >>penalty with no added benefit.
> >
> >I don't buy that argument. We can easily expose the unplug delay, so you
> >can kill it at what ever level you want. Or you could just do it in the
> >driver right now, but that is a bit hackish.
> That would be preferable if the device driver can chose the unplug 
> delay, or even better it could be (guest)sysfs tuneable.

Right. We probably should make it sysfs configurable in any case, right
now it's a (somewhat) policy decision in the kernel with the delay and
unplug depth.

-- 
Jens Axboe


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 3/3] virtio infrastructure: example block driver
       [not found]                                 ` <20070604134322.GC32105-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2007-06-04 13:52                                   ` Rusty Russell
       [not found]                                     ` <1180965161.25878.47.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-06-04 13:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	carsteno-tA70FqPdS9bQT0dZR+AlfA,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, virtualization,
	kvm-devel, Suzanne McIntosh, Christian Borntraeger

On Mon, 2007-06-04 at 15:43 +0200, Jens Axboe wrote:
> On Mon, Jun 04 2007, Carsten Otte wrote:
> > Jens Axboe wrote:
> > >On Fri, Jun 01 2007, Carsten Otte wrote:
> > >>With regard to compute power needed, almost none. The penalty is 
> > >>latency, not overhead: A small request may sit on the request queue to 
> > >>wait for other work to arrive until the queue gets unplugged. This 
> > >>penality is compensated by the benefit of a good chance that more 
> > >>requests will be merged during this time period.
> > >>If we have this method both in host and guest, we have twice the 
> > >>penalty with no added benefit.
> > >
> > >I don't buy that argument. We can easily expose the unplug delay, so you
> > >can kill it at what ever level you want. Or you could just do it in the
> > >driver right now, but that is a bit hackish.
> > That would be preferable if the device driver can chose the unplug 
> > delay, or even better it could be (guest)sysfs tuneable.
> 
> Right. We probably should make it sysfs configurable in any case, right
> now it's a (somewhat) policy decision in the kernel with the delay and
> unplug depth.

The danger is that it's just another knob noone knows how to use.
Perhaps simply setting it to 0 for the noop scheduler will cover all
known cases?

Cheers,
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 3/3] virtio infrastructure: example block driver
       [not found]                                     ` <1180965161.25878.47.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-04 13:54                                       ` Jens Axboe
       [not found]                                         ` <20070604135433.GG32105-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2007-06-04 13:54 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	carsteno-tA70FqPdS9bQT0dZR+AlfA,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, virtualization,
	kvm-devel, Suzanne McIntosh, Christian Borntraeger

On Mon, Jun 04 2007, Rusty Russell wrote:
> On Mon, 2007-06-04 at 15:43 +0200, Jens Axboe wrote:
> > On Mon, Jun 04 2007, Carsten Otte wrote:
> > > Jens Axboe wrote:
> > > >On Fri, Jun 01 2007, Carsten Otte wrote:
> > > >>With regard to compute power needed, almost none. The penalty is 
> > > >>latency, not overhead: A small request may sit on the request queue to 
> > > >>wait for other work to arrive until the queue gets unplugged. This 
> > > >>penality is compensated by the benefit of a good chance that more 
> > > >>requests will be merged during this time period.
> > > >>If we have this method both in host and guest, we have twice the 
> > > >>penalty with no added benefit.
> > > >
> > > >I don't buy that argument. We can easily expose the unplug delay, so you
> > > >can kill it at what ever level you want. Or you could just do it in the
> > > >driver right now, but that is a bit hackish.
> > > That would be preferable if the device driver can chose the unplug 
> > > delay, or even better it could be (guest)sysfs tuneable.
> > 
> > Right. We probably should make it sysfs configurable in any case, right
> > now it's a (somewhat) policy decision in the kernel with the delay and
> > unplug depth.
> 
> The danger is that it's just another knob noone knows how to use.
> Perhaps simply setting it to 0 for the noop scheduler will cover all
> known cases?

Most people should not fiddle with it, the defaults are there for good
reason. I can provide a blk_queue_unplug_thresholds(q, depth, delay)
helper that you could use for the virtualized drivers, perhaps that
would be better for that use?

-- 
Jens Axboe


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 3/3] virtio infrastructure: example block driver
       [not found]                                         ` <20070604135433.GG32105-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2007-06-04 14:12                                           ` Carsten Otte
       [not found]                                             ` <46641DC9.5050600-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Carsten Otte @ 2007-06-04 14:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	carsteno-tA70FqPdS9bQT0dZR+AlfA,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, virtualization,
	kvm-devel, Suzanne McIntosh, Christian Borntraeger

Jens Axboe wrote:
> Most people should not fiddle with it, the defaults are there for good
> reason. I can provide a blk_queue_unplug_thresholds(q, depth, delay)
> helper that you could use for the virtualized drivers, perhaps that
> would be better for that use?
Yea, we should'nt change the defaults without a good reason. That 
would change things for all device drivers.
This interface provides all functionality we need. I think we need a 
knob in /sys/block/mydevice/queue/ in addition to that.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC 3/3] virtio infrastructure: example block driver
       [not found]                                             ` <46641DC9.5050600-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-06-04 14:31                                               ` Jens Axboe
  0 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2007-06-04 14:31 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	virtualization, Christian Borntraeger, Suzanne McIntosh

On Mon, Jun 04 2007, Carsten Otte wrote:
> Jens Axboe wrote:
> >Most people should not fiddle with it, the defaults are there for good
> >reason. I can provide a blk_queue_unplug_thresholds(q, depth, delay)
> >helper that you could use for the virtualized drivers, perhaps that
> >would be better for that use?
> Yea, we should'nt change the defaults without a good reason. That 
> would change things for all device drivers.
> This interface provides all functionality we need. I think we need a 
> knob in /sys/block/mydevice/queue/ in addition to that.

Something like this, totally untested (but trivial, so it should work
:-)

diff --git a/block/elevator.c b/block/elevator.c
index ce866eb..81e2a2d 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -638,7 +638,7 @@ void elv_insert(request_queue_t *q, struct request *rq, int where)
 		int nrq = q->rq.count[READ] + q->rq.count[WRITE]
 			- q->in_flight;
 
-		if (nrq >= q->unplug_thresh)
+		if (nrq >= q->unplug_thresh || !q->unplug_delay)
 			__generic_unplug_device(q);
 	}
 }
diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 6b5173a..aaefb32 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -785,6 +785,30 @@ void blk_queue_dma_alignment(request_queue_t *q, int mask)
 EXPORT_SYMBOL(blk_queue_dma_alignment);
 
 /**
+ * blk_queue_unplug_threshold - set automatic unplug thresholds for the queue
+ * @q:     the request queue for the device
+ * @depth: the queue depth at which to do unplug
+ * @delay: maximum unplug timer delay
+ *
+ * Description:
+ *    Set the desired unplug depth/threshold and delay for a given queue.
+ *    The block layer has a set of good defaults for this, so this function
+ *    should ONLY be used by drivers for virtualized environments, where
+ *    you could potentially have several layers of queues that each do their
+ *    own delay.
+ *
+ *    If in doubt, don't use this function! The settings can also be
+ *    tweaked from sysfs.
+ *
+ **/
+void blk_queue_unplug_threshold(request_queue_t *q, unsigned int depth,
+				unsigned long delay)
+{
+	q->unplug_thresh = depth;
+	q->unplug_delay = delay;
+}
+
+/**
  * blk_queue_find_tag - find a request by its tag and queue
  * @q:	 The request queue for the device
  * @tag: The tag of the request
@@ -1550,7 +1574,8 @@ void blk_plug_device(request_queue_t *q)
 		return;
 
 	if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
-		mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
+		if (q->unplug_delay)
+			mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
 		blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
 	}
 }
@@ -3975,6 +4000,54 @@ static ssize_t queue_max_hw_sectors_show(struct request_queue *q, char *page)
 	return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_unplug_delay_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->unplug_delay, page);
+}
+
+/*
+ * We don't bother rearming a running timer. It's just not worth it, the
+ * next unplug will get it right.
+ */
+static ssize_t queue_unplug_delay_store(struct request_queue *q,
+					const char *page, size_t count)
+{
+	unsigned long delay;
+	int ret;
+
+	ret = queue_var_store(&delay, page, count);
+
+	spin_lock_irq(q->queue_lock);
+	q->unplug_delay = msecs_to_jiffies(delay);
+	spin_unlock_irq(q->queue_lock);
+
+	return ret;
+}
+
+static ssize_t queue_unplug_depth_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->unplug_thresh, page);
+}
+
+/*
+ * We don't bother unplugging if we depth was reduced and we just happened
+ * to have a current queue depth of somewhere in between the old and new
+ * value.
+ */
+static ssize_t queue_unplug_depth_store(struct request_queue *q,
+					const char *page, size_t count)
+{
+	unsigned long depth;
+	int ret;
+
+	ret = queue_var_store(&depth, page, count);
+
+	spin_lock_irq(q->queue_lock);
+	q->unplug_thresh = depth;
+	spin_unlock_irq(q->queue_lock);
+
+	return ret;
+}
 
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
@@ -4005,12 +4078,26 @@ static struct queue_sysfs_entry queue_iosched_entry = {
 	.store = elv_iosched_store,
 };
 
+static struct queue_sysfs_entry queue_unplug_depth_entry = {
+	.attr = {.name = "unplug_depth", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_unplug_depth_show,
+	.store = queue_unplug_depth_store,
+};
+
+static struct queue_sysfs_entry queue_unplug_delay_entry = {
+	.attr = {.name = "unplug_delay_ms", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_unplug_delay_show,
+	.store = queue_unplug_delay_store,
+};
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
 	&queue_iosched_entry.attr,
+	&queue_unplug_delay_entry.attr,
+	&queue_unplug_depth_entry.attr,
 	NULL,
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index db5b00a..04c09d6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -747,6 +747,7 @@ extern void blk_queue_prep_rq(request_queue_t *, prep_rq_fn *pfn);
 extern void blk_queue_merge_bvec(request_queue_t *, merge_bvec_fn *);
 extern void blk_queue_dma_alignment(request_queue_t *, int);
 extern void blk_queue_softirq_done(request_queue_t *, softirq_done_fn *);
+extern void blk_queue_unplug_threshold(request_queue_t *q, unsigned int, unsigned long);
 extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
 extern int blk_queue_ordered(request_queue_t *, unsigned, prepare_flush_fn *);
 extern void blk_queue_issue_flush_fn(request_queue_t *, issue_flush_fn *);

-- 
Jens Axboe


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Xen-devel] [PATCH RFC 1/3] virtio infrastructure
       [not found]     ` <1180779167.9228.66.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-04 21:14       ` Santos, Jose Renato G
       [not found]         ` <08CA2245AFCF444DB3AC415E47CC40AFBD2FD4-VylnnfFjWmASZAcGdq5asR6epYMZPwEe5NbjCUgZEJk@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Santos, Jose Renato G @ 2007-06-04 21:14 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Herbert Xu, kvm-devel,
	virtualization, Christian Borntraeger, Suzanne McIntosh,
	Martin Schwidefsky

 

> -----Original Message-----
> From: Rusty Russell [mailto:rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org] 
> Sent: Saturday, June 02, 2007 3:13 AM
> To: Santos, Jose Renato G
> Cc: kvm-devel; Xen Mailing List; virtualization; Jimi 
> Xenidis; Stephen Rothwell; jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org; Herbert 
> Xu; Christian Borntraeger; Suzanne McIntosh; Anthony Liguori; 
> Martin Schwidefsky
> Subject: RE: [Xen-devel] [PATCH RFC 1/3] virtio infrastructure
> 
> 
> Hi Renato!
> 
> 	This is a very good question.  For currently-existing 
> and working devices it just seems like another layer of 
> indirection, and not much code saving.
> 
> 	There are several reasons for having a common layer.  
> The obvious is to reduce the amount of code, but that's the 
> least important.  Slightly more important is that noone wants 
> to maintain and tune 5 different virtual net drivers, 5 
> different virtual block drivers, etc.  In fact, the kernel 
> community will start to rebel at some point, especially if 
> they want different optimization hooks deeper into the kernel.
> 

  Rusty,

  Thanks for clarifying your thinking. This helped me understand
  your goals better.
  I agree it would be nice to reduce the number of drivers as it
  improves mantainability. However I am not convinced that
  adding an IO virtualization layer will remove the need
  for having different drivers for different virtualization
  technologies.
  It seems that we will still need specific devices drivers
  for each different virtualization flavor. For example,
  we will still need to have a specific Xen netfront 
  device that talks to a backend device in dom0, using
  page grants, and other Xen specific mechanisms.
  It looks like  will still need to maintain all the virtual device 
  drivers and in addition we will now have to maintain 
  another virtualization layer. 
  I confess I don't know well any of the other virtualization
  technologies besides Xen. Maybe for some of them there is
  enough similarities that you could benefit from a common
  virtualization layer, but I just can't see it yet. 
  
  Regards

  Renato
  
> 	We expect to see new device types (entropy device, 
> anyone?), and so the
> 5 different drivers becomes the 5 x N different drivers.  
> Eventually we may come up with common bus and probing which 
> everyone loves, but until then we can at least make it simple 
> for each virtualization technology to implement the new XYZZY device.
> 
> 	Finally, virtual I/O techniques are *moving*.  KVM 
> doesn't really have one, XenSource are discussing more 
> changes to theirs (at least for
> networking) and lguest is still looking for an efficient one. 
>  No doubt the others would love to use clean and optimal 
> Linux drivers, too (UML, S/390, Power, VMWare?).
> 
> 	When Linux doesn't provide a model of what it expects 
> from virtual devices, one end of the design process is 
> untethered: we've already seen some fairly random and 
> gratuitously different results.  If we can provide a simple 
> interface which we like it encourages new implementations to 
> produce Linux-friendly interfaces.
> 
> 	Of course, the new interface must not suck.
> 
> I hope that clarifies my thinking?
> Rusty.
> 
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Xen-devel] [PATCH RFC 1/3] virtio infrastructure
       [not found]         ` <08CA2245AFCF444DB3AC415E47CC40AFBD2FD4-VylnnfFjWmASZAcGdq5asR6epYMZPwEe5NbjCUgZEJk@public.gmane.org>
@ 2007-06-04 21:19           ` Jeremy Fitzhardinge
       [not found]             ` <466481ED.8050209-TSDbQ3PG+2Y@public.gmane.org>
  2007-06-05  1:44           ` [Xen-devel] " Rusty Russell
  1 sibling, 1 reply; 41+ messages in thread
From: Jeremy Fitzhardinge @ 2007-06-04 21:19 UTC (permalink / raw)
  To: Santos, Jose Renato G
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Herbert Xu, kvm-devel,
	virtualization, Christian Borntraeger, Suzanne McIntosh,
	Martin Schwidefsky

Santos, Jose Renato G wrote:
>   It seems that we will still need specific devices drivers
>   for each different virtualization flavor. For example,
>   we will still need to have a specific Xen netfront 
>   device that talks to a backend device in dom0, using
>   page grants, and other Xen specific mechanisms.
>   It looks like  will still need to maintain all the virtual device 
>   drivers and in addition we will now have to maintain 
>   another virtualization layer. 
>   

The hope is that the Xen-specific elements of the driver will be
restricted to Xen-specific things like grant tables, and the bulk of the
driver and its logic can be common.  Whether that can be achieved and
still retain the full performance/features of the entirely Xen-specific
netfront driver remains to be seen.  I haven't had a chance to look at
doing any Xen-virtio glue yet, so I'm not really sure how it will work out.

    J

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Xen-devel] [PATCH RFC 1/3] virtio infrastructure
       [not found]         ` <08CA2245AFCF444DB3AC415E47CC40AFBD2FD4-VylnnfFjWmASZAcGdq5asR6epYMZPwEe5NbjCUgZEJk@public.gmane.org>
  2007-06-04 21:19           ` Jeremy Fitzhardinge
@ 2007-06-05  1:44           ` Rusty Russell
       [not found]             ` <1181007892.25878.112.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  1 sibling, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-06-05  1:44 UTC (permalink / raw)
  To: Santos, Jose Renato G
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Herbert Xu, kvm-devel,
	virtualization, Christian Borntraeger, Suzanne McIntosh,
	Martin Schwidefsky

On Mon, 2007-06-04 at 21:14 +0000, Santos, Jose Renato G wrote:
>   Thanks for clarifying your thinking. This helped me understand
>   your goals better.
>   I agree it would be nice to reduce the number of drivers as it
>   improves mantainability. However I am not convinced that
>   adding an IO virtualization layer will remove the need
>   for having different drivers for different virtualization
>   technologies.

>   It seems that we will still need specific devices drivers
>   for each different virtualization flavor. For example,
>   we will still need to have a specific Xen netfront 
>   device that talks to a backend device in dom0, using
>   page grants, and other Xen specific mechanisms.

Hi Renato,

	That definitely should be implementable as a virtio layer; it was one
of the design points.  I consulted with Herbert Xu early on in the
process, and I don't think it would be too painful.  The devil, of
course, is in the details.

>   It looks like  will still need to maintain all the virtual device 
>   drivers and in addition we will now have to maintain 
>   another virtualization layer. 

	That would be silly, yes.

>   I confess I don't know well any of the other virtualization
>   technologies besides Xen. Maybe for some of them there is
>   enough similarities that you could benefit from a common
>   virtualization layer, but I just can't see it yet. 

Well, S/390, PowerPC and UML both have virtual I/O already in the kernel
tree, as does Xen.  I believe VMWare have out-of-tree drivers.  KVM is
in tree, but currently doesn't have paravirtualized drivers. 
lguest is sitting in the -mm tree for merging in 2.6.23 with its own
drivers.

None of these drivers is optimal.  The Xen ones are closest, and they're
very Xen-specific and quite complex.  This is good, and as it gives
virtio drivers a target to beat 8)

Cheers,
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Xen-devel] [PATCH RFC 1/3] virtio infrastructure
       [not found]             ` <466481ED.8050209-TSDbQ3PG+2Y@public.gmane.org>
@ 2007-06-05  2:05               ` Santos, Jose Renato G
       [not found]                 ` <08CA2245AFCF444DB3AC415E47CC40AFBD30B4-VylnnfFjWmASZAcGdq5asR6epYMZPwEe5NbjCUgZEJk@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Santos, Jose Renato G @ 2007-06-05  2:05 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Herbert Xu, kvm-devel,
	virtualization, Christian Borntraeger, Suzanne McIntosh,
	Martin Schwidefsky

 

> -----Original Message-----
> From: Jeremy Fitzhardinge [mailto:jeremy-TSDbQ3PG+2Y@public.gmane.org] 
> Sent: Monday, June 04, 2007 2:20 PM
> To: Santos, Jose Renato G
> Cc: Rusty Russell; Jimi Xenidis; Stephen Rothwell; Xen 
> Mailing List; jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org; Herbert Xu; kvm-devel; 
> virtualization; Christian Borntraeger; Suzanne McIntosh; 
> Anthony Liguori; Martin Schwidefsky
> Subject: Re: [Xen-devel] [PATCH RFC 1/3] virtio infrastructure
> 
> Santos, Jose Renato G wrote:
> >   It seems that we will still need specific devices drivers
> >   for each different virtualization flavor. For example,
> >   we will still need to have a specific Xen netfront 
> >   device that talks to a backend device in dom0, using
> >   page grants, and other Xen specific mechanisms.
> >   It looks like  will still need to maintain all the virtual device 
> >   drivers and in addition we will now have to maintain 
> >   another virtualization layer. 
> >   
> 
> The hope is that the Xen-specific elements of the driver will 
> be restricted to Xen-specific things like grant tables, and 
> the bulk of the driver and its logic can be common.  Whether 
> that can be achieved and still retain the full 
> performance/features of the entirely Xen-specific netfront 
> driver remains to be seen.  I haven't had a chance to look at 
> doing any Xen-virtio glue yet, so I'm not really sure how it 
> will work out.
> 
>     J
> 

  Ok, if you share some common code this could be beneficial, but
  in the specific case of Xen networking I believe most of netfront
  code is Xen specific.  I think that a generic "virtual-IO" layer 
  would not be beneficial in this case, but instead it would 
  only add extra complexity to glue the layers.
  I don't know if this will be the case for other
  virtualization technologies, but I think it would be useful to  
  have a better understanding of exactly which code could
  be shared before proposing something like this to the linux 
  mantainers.
  I think it will probably not be accepted upstream if
  there is no clear evidence of code sharing. And if the sharing
  is not uniform across all virtualization technologies, 
  we should keep an option of bypassing the virtio layer when
  it does not help or when it makes code more complex (which
  I suspect will be the case for Xen networking).

  Just my 2 cents,

  Regards
  Renato

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Xen-devel] [PATCH RFC 1/3] virtio infrastructure
       [not found]                 ` <08CA2245AFCF444DB3AC415E47CC40AFBD30B4-VylnnfFjWmASZAcGdq5asR6epYMZPwEe5NbjCUgZEJk@public.gmane.org>
@ 2007-06-05  2:27                   ` Rusty Russell
  2007-06-05  2:49                     ` Santos, Jose Renato G
  0 siblings, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-06-05  2:27 UTC (permalink / raw)
  To: Santos, Jose Renato G
  Cc: Jimi Xenidis, Stephen Rothwell, Jeremy Fitzhardinge,
	Xen Mailing List, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	Herbert Xu, kvm-devel, virtualization, Christian Borntraeger,
	Suzanne McIntosh, Martin Schwidefsky

On Tue, 2007-06-05 at 02:05 +0000, Santos, Jose Renato G wrote:

> > From: Jeremy Fitzhardinge [mailto:jeremy-TSDbQ3PG+2Y@public.gmane.org] 
> > The hope is that the Xen-specific elements of the driver will 
> > be restricted to Xen-specific things like grant tables, and 
> > the bulk of the driver and its logic can be common.  Whether 
> > that can be achieved and still retain the full 
> > performance/features of the entirely Xen-specific netfront 
> > driver remains to be seen.  I haven't had a chance to look at 
> > doing any Xen-virtio glue yet, so I'm not really sure how it 
> > will work out.
> > 
> >     J
> > 
> 
>   Ok, if you share some common code this could be beneficial, but
>   in the specific case of Xen networking I believe most of netfront
>   code is Xen specific.  I think that a generic "virtual-IO" layer 
>   would not be beneficial in this case, but instead it would 
>   only add extra complexity to glue the layers.

This is precisely the plan: to code it up and look hard at it.  If
performance gets hurt, it's a lose.  If performance is equal and the
code is clearer, it's a win, and this is my goal.

Perhaps you underestimate how much of the Xen netfront driver is
actually dealing with things which *any* efficient virtio I/O netdriver
will have to wrestle with.

Thanks!
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Xen-devel] [PATCH RFC 1/3] virtio infrastructure
       [not found]             ` <1181007892.25878.112.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-05  2:39               ` Santos, Jose Renato G
  0 siblings, 0 replies; 41+ messages in thread
From: Santos, Jose Renato G @ 2007-06-05  2:39 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Herbert Xu, kvm-devel,
	virtualization, Christian Borntraeger, Suzanne McIntosh,
	Martin Schwidefsky

 

> -----Original Message-----
> From: Rusty Russell [mailto:rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org] 
> Sent: Monday, June 04, 2007 6:45 PM
> To: Santos, Jose Renato G
> Cc: kvm-devel; Xen Mailing List; virtualization; Jimi 
> Xenidis; Stephen Rothwell; jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org; Herbert 
> Xu; Christian Borntraeger; Suzanne McIntosh; Anthony Liguori; 
> Martin Schwidefsky
> Subject: RE: [Xen-devel] [PATCH RFC 1/3] virtio infrastructure
> 
> On Mon, 2007-06-04 at 21:14 +0000, Santos, Jose Renato G wrote:
> >   Thanks for clarifying your thinking. This helped me understand
> >   your goals better.
> >   I agree it would be nice to reduce the number of drivers as it
> >   improves mantainability. However I am not convinced that
> >   adding an IO virtualization layer will remove the need
> >   for having different drivers for different virtualization
> >   technologies.
> 
> >   It seems that we will still need specific devices drivers
> >   for each different virtualization flavor. For example,
> >   we will still need to have a specific Xen netfront 
> >   device that talks to a backend device in dom0, using
> >   page grants, and other Xen specific mechanisms.
> 
> Hi Renato,
> 
> 	That definitely should be implementable as a virtio 
> layer; it was one of the design points.  I consulted with 
> Herbert Xu early on in the process, and I don't think it 
> would be too painful.  The devil, of course, is in the details.
> 

 Hi Rusty,
 Yes. I believe it should be implementable. The question is,
 does it achieve the desired benefits? Will the code be easier
 to maintain because it is sharing some generic code with
 other drivers? 

> >   It looks like  will still need to maintain all the virtual device 
> >   drivers and in addition we will now have to maintain 
> >   another virtualization layer. 
> 
> 	That would be silly, yes.
> 
> >   I confess I don't know well any of the other virtualization
> >   technologies besides Xen. Maybe for some of them there is
> >   enough similarities that you could benefit from a common
> >   virtualization layer, but I just can't see it yet. 
> 
> Well, S/390, PowerPC and UML both have virtual I/O already in 
> the kernel tree, as does Xen.  I believe VMWare have 
> out-of-tree drivers.  KVM is in tree, but currently doesn't 
> have paravirtualized drivers. 
> lguest is sitting in the -mm tree for merging in 2.6.23 with 
> its own drivers.
> 
> None of these drivers is optimal.  The Xen ones are closest, 
> and they're very Xen-specific and quite complex.  This is 
> good, and as it gives virtio drivers a target to beat 8)
> 

 Yes. It will be interesting to see how virtio could improve
 the situation. I hope you are right and we could simplify
 this complexity. 
 I apologize if I sounded negative. 
 Just wanted to have a reality check :)

 Regards

 Renato 

> Cheers,
> Rusty.
> 
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC 1/3] virtio infrastructure
  2007-06-05  2:27                   ` Rusty Russell
@ 2007-06-05  2:49                     ` Santos, Jose Renato G
  0 siblings, 0 replies; 41+ messages in thread
From: Santos, Jose Renato G @ 2007-06-05  2:49 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jimi Xenidis, Stephen Rothwell, Jeremy Fitzhardinge,
	Xen Mailing List, jmk, Herbert Xu, kvm-devel, virtualization,
	Christian Borntraeger, Suzanne McIntosh, Anthony Liguori,
	Martin Schwidefsky

 

> -----Original Message-----
> From: Rusty Russell [mailto:rusty@rustcorp.com.au] 
> Sent: Monday, June 04, 2007 7:27 PM
> To: Santos, Jose Renato G
> Cc: Jeremy Fitzhardinge; Jimi Xenidis; Stephen Rothwell; Xen 
> Mailing List; jmk@plan9.bell-labs.com; Herbert Xu; kvm-devel; 
> virtualization; Christian Borntraeger; Suzanne McIntosh; 
> Anthony Liguori; Martin Schwidefsky
> Subject: RE: [Xen-devel] [PATCH RFC 1/3] virtio infrastructure
> 
> On Tue, 2007-06-05 at 02:05 +0000, Santos, Jose Renato G wrote:
> 
> > > From: Jeremy Fitzhardinge [mailto:jeremy@goop.org] The 
> hope is that 
> > > the Xen-specific elements of the driver will be restricted to 
> > > Xen-specific things like grant tables, and the bulk of the driver 
> > > and its logic can be common.  Whether that can be 
> achieved and still 
> > > retain the full performance/features of the entirely Xen-specific 
> > > netfront driver remains to be seen.  I haven't had a 
> chance to look 
> > > at doing any Xen-virtio glue yet, so I'm not really sure 
> how it will 
> > > work out.
> > > 
> > >     J
> > > 
> > 
> >   Ok, if you share some common code this could be beneficial, but
> >   in the specific case of Xen networking I believe most of netfront
> >   code is Xen specific.  I think that a generic "virtual-IO" layer 
> >   would not be beneficial in this case, but instead it would 
> >   only add extra complexity to glue the layers.
> 
> This is precisely the plan: to code it up and look hard at 
> it.  If performance gets hurt, it's a lose.  If performance 
> is equal and the code is clearer, it's a win, and this is my goal.
> 
  I was not really thinking about performance. I was questioning
  how much code could realy be shared? I expected that just a small
  fraction of the code could be shared. So even if performance does
  not suffer, it would not be providing any benefit, or worse, it
  could make the code more complex due to interaction with an 
  additional layer

> Perhaps you underestimate how much of the Xen netfront driver 
> is actually dealing with things which *any* efficient virtio 
> I/O netdriver will have to wrestle with.
> 
  Perhaps, you are right.
  I think we will know when you code it up.
  
  Regards

  Renato
  
  
> Thanks!
> Rusty.
> 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2007-06-05  2:49 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-31 12:19 [PATCH RFC 1/3] virtio infrastructure Rusty Russell
2007-06-01 23:35 ` Santos, Jose Renato G
2007-06-02 10:12   ` Rusty Russell
     [not found]     ` <1180779167.9228.66.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-04 21:14       ` [Xen-devel] " Santos, Jose Renato G
     [not found]         ` <08CA2245AFCF444DB3AC415E47CC40AFBD2FD4-VylnnfFjWmASZAcGdq5asR6epYMZPwEe5NbjCUgZEJk@public.gmane.org>
2007-06-04 21:19           ` Jeremy Fitzhardinge
     [not found]             ` <466481ED.8050209-TSDbQ3PG+2Y@public.gmane.org>
2007-06-05  2:05               ` Santos, Jose Renato G
     [not found]                 ` <08CA2245AFCF444DB3AC415E47CC40AFBD30B4-VylnnfFjWmASZAcGdq5asR6epYMZPwEe5NbjCUgZEJk@public.gmane.org>
2007-06-05  2:27                   ` Rusty Russell
2007-06-05  2:49                     ` Santos, Jose Renato G
2007-06-05  1:44           ` [Xen-devel] " Rusty Russell
     [not found]             ` <1181007892.25878.112.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-05  2:39               ` Santos, Jose Renato G
     [not found] ` <1180613947.11133.58.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-05-31 12:20   ` [PATCH RFC 2/3] virtio infrastructure: example net driver Rusty Russell
     [not found]     ` <1180614044.11133.61.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-05-31 12:21       ` [PATCH RFC 3/3] virtio infrastructure: example block driver Rusty Russell
     [not found]         ` <1180614091.11133.63.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-05-31 12:57           ` Carsten Otte
     [not found]             ` <465EC637.7020504-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-31 15:02               ` Troy Benjegerdes
     [not found]                 ` <20070531150206.GB6474-na1kE3HDu0idQnJuSAr7PQ@public.gmane.org>
2007-05-31 17:01                   ` Carsten Otte
     [not found]                     ` <465EFF72.5070100-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-31 17:36                       ` Avi Kivity
2007-05-31 23:39               ` Rusty Russell
     [not found]                 ` <1180654765.10999.6.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-01  7:10                   ` Carsten Otte
     [not found]                     ` <465FC65C.6020905-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-06-01 13:13                       ` Jens Axboe
     [not found]                         ` <20070601131315.GW32105-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2007-06-04 13:40                           ` Carsten Otte
     [not found]                             ` <4664164A.40604-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-06-04 13:43                               ` Jens Axboe
     [not found]                                 ` <20070604134322.GC32105-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2007-06-04 13:52                                   ` Rusty Russell
     [not found]                                     ` <1180965161.25878.47.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-04 13:54                                       ` Jens Axboe
     [not found]                                         ` <20070604135433.GG32105-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2007-06-04 14:12                                           ` Carsten Otte
     [not found]                                             ` <46641DC9.5050600-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-06-04 14:31                                               ` Jens Axboe
2007-06-02  9:28                       ` Rusty Russell
     [not found]                         ` <1180776482.9228.22.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-04 12:53                           ` Carsten Otte
2007-05-31 12:53   ` [PATCH RFC 1/3] virtio infrastructure Carsten Otte
2007-05-31 13:01   ` Dor Laor
2007-06-02  6:30   ` Avi Kivity
     [not found]     ` <46610E8D.10706-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-02  9:50       ` Rusty Russell
     [not found]         ` <1180777836.9228.44.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-03  5:28           ` Avi Kivity
     [not found]             ` <46625192.5020108-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-03 10:29               ` Rusty Russell
     [not found]                 ` <1180866540.17442.74.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-03 11:39                   ` [Xen-devel] " Avi Kivity
     [not found]                     ` <4662A86A.7010706-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-03 23:53                       ` Rusty Russell
     [not found]                         ` <1180914836.17442.104.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-04  9:55                           ` Avi Kivity
     [not found]                             ` <4663E18D.4030007-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-04 10:32                               ` Herbert Xu
2007-06-04 11:19                               ` Rusty Russell
     [not found]                                 ` <1180955964.25878.27.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-04 11:25                                   ` Avi Kivity
     [not found]                                     ` <4663F6AC.7070100-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-04 11:40                                       ` Rusty Russell
2007-06-04 12:17                                       ` Herbert Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox