public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/3] Virtio draft II
@ 2007-06-07 12:02 Rusty Russell
  2007-06-07 12:04 ` [PATCH RFC 1/3] Virtio draft II: virtio.h Rusty Russell
       [not found] ` <1181217762.14054.192.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 2 replies; 26+ messages in thread
From: Rusty Russell @ 2007-06-07 12:02 UTC (permalink / raw)
  To: virtualization
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, Christian Borntraeger, Latchesar Ionkov,
	Suzanne McIntosh, Martin Schwidefsky

Hi again all,

	It turns out that networking really wants ordered requests, which the
previous patches didn't allow.  This patch changes it to a callback
mechanism; kudos to Avi.

	The downside is that locking is more complicated, and after a few dead
ends I implemented the simplest solution: the struct virtio_device
contains the spinlock to use, and it's held when your callbacks get
called.

Core changes:
1) struct virtio_device has a "lock" and "priv" fields (the latter for
the driver to use, esp from callbacks).
2) add_outbuf and add_inbuf take a cb function ptr and void *, rather
than a used pointer.
3) lengths of buffers must now fit in an unsigned int, not long.
4) the virtio_sync wrapper is gone.

Block driver changes (+30 lines)
1) Convert to callbacks not interrupt.
2) Ensure both outbuf & inbuf have been used up before finishing
request.  This avoids potential remote access to freed memory.

Net driver changes (+6 lines)
1) Convert to callbacks.
2) Store id in skb->cb, not array.
3) Remove NET_BUFS limit: we queue packets until virtio chokes.
4) Locking should now be correct, thanks mainly to virtio changes.

BTW, the lguest implementations (very rough) are in the lguest patch
repo for your viewing pleasure:

Implementation over dumb read/write interface:
http://lguest.ozlabs.org/patches/?file/tip/new-io-lguest-readwrite.patch
Implementation using descriptors (assuming host can access guest mem):
http://lguest.ozlabs.org/patches/?file/tip/new-io-lguest-desc.patch

Cheers,
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH RFC 1/3] Virtio draft II: virtio.h
  2007-06-07 12:02 [PATCH RFC 0/3] Virtio draft II Rusty Russell
@ 2007-06-07 12:04 ` Rusty Russell
       [not found]   ` <1181217867.14054.195.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
       [not found] ` <1181217762.14054.192.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  1 sibling, 1 reply; 26+ messages in thread
From: Rusty Russell @ 2007-06-07 12:04 UTC (permalink / raw)
  To: virtualization
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk@plan9.bell-labs.com, Herbert Xu, kvm-devel, Avi Kivity,
	Christian Borntraeger, Latchesar Ionkov, Suzanne McIntosh,
	Martin Schwidefsky

This attempts to implement a "virtual I/O" layer which should allow
common drivers to be efficiently used across most virtual I/O
mechanisms.  It will no-doubt need further enhancement.

The details of probing the device are left to hypervisor-specific
code: it simple constructs the "struct virtio_device" and hands it to
the probe function (eg. virtnet_probe() or virtblk_probe()).

The virtio drivers add and detach input and output buffers; as the
buffers are used up their associated callbacks are filled in.

I have written two virtio device drivers (net and block) and two
virtio implementations (for lguest): a read-write socket-style
implementation, and a more efficient descriptor-based implementation.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 include/linux/virtio.h |   75 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

===================================================================
--- /dev/null
+++ b/include/linux/virtio.h
@@ -0,0 +1,75 @@
+#ifndef _LINUX_VIRTIO_H
+#define _LINUX_VIRTIO_H
+#include <linux/types.h>
+#include <linux/scatterlist.h>
+#include <linux/spinlock.h>
+
+/**
+ * virtio_device - description and routines to drive a virtual device.
+ * @lock: the lock to hold before calling any functions.
+ * @dev: the underlying struct device.
+ * @ops: the operations for this virtual device.
+ * @priv: private pointer for the driver to use.
+ */
+struct virtio_device {
+	spinlock_t lock;
+	struct device *dev;
+	struct virtio_ops *ops;
+	void *priv;
+};
+
+/**
+ * virtio_ops - virtio abstraction layer
+ * @add_outbuf: prepare to send data to the other end:
+ *	vdev: the virtio_device
+ *	sg: the description of the buffer(s).
+ *	num: the size of the sg array.
+ *	cb: the function to call once the outbuf is finished & detached.
+ *	data: the token to hand to the cb function.
+ *      Returns a unique id or an error.  Note that the callback will be
+ *	called with the lock held, and possibly in an interrupt handler.
+ * @add_inbuf: prepare to receive data from the other end:
+ *	vdev: the virtio_device
+ *	sg: the description of the buffer(s).
+ *	num: the size of the sg array.
+ *	cb: the function to call once the inbuf is finished & detached.
+ *	data: the token to hand to the cb function.
+ *      Returns a unique id or an error (eg. -ENOSPC).  Note that the
+ *	callback will be called with the lock held, and possibly in an
+ *	interrupt handler.
+ * @sync: update after add_inbuf/add_outbuf
+ *	vdev: the virtio_device we're talking about.
+ *	After one or more add_inbuf/add_outbuf calls, invoke this to kick
+ *	the virtio layer.
+ * @detach_outbuf: make sure sent sg can no longer be read.
+ *	vdev: the virtio_device we're talking about.
+ *	id: the id returned from add_outbuf.
+ *	This is not necessary (or valid!) if the outbuf callback has
+ *	already fired.
+ * @detach_inbuf: make sure sent sg can no longer be written to.
+ *	vdev: the virtio_device we're talking about.
+ *	id: the id returned from add_inbuf.
+ *	This is not necessary (or valid!) if the outbuf callback has
+ *	already fired.
+ */
+struct virtio_ops {
+	unsigned long (*add_outbuf)(struct virtio_device *vdev,
+				    const struct scatterlist sg[],
+				    unsigned int num,
+				    void (*cb)(struct virtio_device *vdev,
+					       void *data, unsigned len),
+				    void *data);
+
+	unsigned long (*add_inbuf)(struct virtio_device *vdev,
+				   struct scatterlist sg[],
+				   unsigned int num,
+				   void (*cb)(struct virtio_device *vdev,
+					      void *data, unsigned len),
+				   void *data);
+
+	void (*sync)(struct virtio_device *vdev);
+
+	void (*detach_outbuf)(struct virtio_device *vdev, unsigned long id);
+	void (*detach_inbuf)(struct virtio_device *vdev, unsigned long id);
+};
+#endif /* _LINUX_VIRTIO_H */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH RFC 2/3] Virtio draft II: example block driver
       [not found]   ` <1181217867.14054.195.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-07 12:05     ` Rusty Russell
       [not found]       ` <1181217920.14054.196.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Rusty Russell @ 2007-06-07 12:05 UTC (permalink / raw)
  To: virtualization
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, Christian Borntraeger, Latchesar Ionkov,
	Suzanne McIntosh, Martin Schwidefsky

The block driver uses outbufs with sg[0] being the request information
(struct virtio_blk_outhdr) with the type, sector and inbuf id.  For a
write, the rest of the sg will contain the data to be written.

The first segment of the inbuf is a result code (struct
virtio_blk_inhdr).  For a read, the rest of the sg points to the input
buffer.

TODO:
	1) Ordered tag support.

Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
---
 drivers/block/Makefile     |    1 
 drivers/block/virtio_blk.c |  299 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/virtio_blk.h |   28 ++++
 3 files changed, 328 insertions(+)

===================================================================
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.
 obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
 obj-$(CONFIG_BLK_DEV_DAC960)	+= DAC960.o
 obj-$(CONFIG_CDROM_PKTCDVD)	+= pktcdvd.o
+obj-y				+= virtio_blk.o
 
 obj-$(CONFIG_BLK_DEV_UMEM)	+= umem.o
 obj-$(CONFIG_BLK_DEV_NBD)	+= nbd.o
===================================================================
--- /dev/null
+++ b/drivers/block/virtio_blk.c
@@ -0,0 +1,299 @@
+//#define DEBUG
+#include <linux/spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/hdreg.h>
+#include <linux/virtio.h>
+#include <linux/virtio_blk.h>
+
+static unsigned char virtblk_index = 'a';
+struct virtio_blk
+{
+	struct virtio_device *vdev;
+
+	/* The disk structure for the kernel. */
+	struct gendisk *disk;
+
+	/* Request tracking. */
+	struct list_head reqs;
+
+	mempool_t *pool;
+
+	/* Scatterlist: can be too big for stack. */
+	struct scatterlist sg[1+MAX_PHYS_SEGMENTS];
+};
+
+struct virtblk_req
+{
+	struct list_head list;
+	struct request *req;
+	unsigned long out_id;
+	bool out_done, in_done;
+	bool failed;
+	struct virtio_blk_outhdr out_hdr;
+	struct virtio_blk_inhdr in_hdr;
+};
+
+/* Jens gave me this nice helper to end all chunks of a request. */
+static void end_dequeued_request(struct request *req, int uptodate)
+{
+	if (end_that_request_first(req, uptodate, req->hard_nr_sectors))
+		BUG();
+	add_disk_randomness(req->rq_disk);
+	end_that_request_last(req, uptodate);
+}
+
+static void finish(struct virtio_blk *vblk, struct virtblk_req *vbr)
+{
+	end_dequeued_request(vbr->req, !vbr->failed);
+	list_del(&vbr->list);
+	mempool_free(vbr, vblk->pool);
+	/* In case queue is stopped waiting for more buffers. */
+	blk_start_queue(vblk->disk->queue);
+}
+
+/* We make sure they finished both the input and output buffers: otherwise
+ * they might still have read access after we free them. */
+static void blk_out_done(struct virtio_device *vdev, void *_vbr, unsigned len)
+{
+	struct virtblk_req *vbr = _vbr;
+	struct virtio_blk *vblk = vdev->priv;
+
+	assert_spin_locked(&vblk->vdev->lock);
+
+	BUG_ON(vbr->out_done);
+	vbr->out_done = true;
+	if (vbr->in_done)
+		finish(vblk, vbr);
+}
+
+static void blk_in_done(struct virtio_device *vdev, void *_vbr, unsigned len)
+{
+	struct virtblk_req *vbr = _vbr;
+	struct virtio_blk *vblk = vdev->priv;
+	unsigned long expected_len;
+
+	assert_spin_locked(&vblk->vdev->lock);
+
+	expected_len = sizeof(vbr->in_hdr);
+	if (vbr->out_hdr.type == READ)
+		expected_len += vbr->req->hard_nr_sectors*512;
+
+	if (unlikely(len != expected_len)) {
+		dev_err(vblk->vdev->dev, "short reply %u not %lu",
+			len, expected_len);
+		vbr->failed = true;
+	} else if (unlikely(vbr->in_hdr.status != 1)) {
+		vbr->failed = true;
+	}
+
+	BUG_ON(vbr->in_done);
+	vbr->in_done = true;
+	if (vbr->out_done)
+		finish(vblk, vbr);
+}
+
+static bool do_write(request_queue_t *q, struct virtio_blk *vblk,
+		     struct virtblk_req *vbr)
+{
+	unsigned long num;
+
+	/* Set up for reply. */
+	vblk->sg[0].page = virt_to_page(&vbr->in_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->in_hdr);
+	vblk->sg[0].length = sizeof(vbr->in_hdr);
+	vbr->out_hdr.id = vblk->vdev->ops->add_inbuf(vblk->vdev, vblk->sg, 1,
+						     blk_in_done, vbr);
+	if (IS_ERR_VALUE(vbr->out_hdr.id))
+		goto full;
+
+	/* First sg element points to output header. */
+	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
+	vblk->sg[0].length = sizeof(vbr->out_hdr);
+
+	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
+	vbr->out_done = vbr->in_done = false;
+	vbr->failed = false;
+	vbr->out_id = vblk->vdev->ops->add_outbuf(vblk->vdev, vblk->sg, 1+num,
+						  blk_out_done, vbr);
+	if (IS_ERR_VALUE(vbr->out_id))
+		goto detach_inbuf_full;
+
+	pr_debug("Write: %p in=%lu out=%lu\n", vbr,
+		 vbr->out_hdr.id, vbr->out_id);
+	list_add_tail(&vbr->list, &vblk->reqs);
+	return true;
+
+detach_inbuf_full:
+	vblk->vdev->ops->detach_inbuf(vblk->vdev, vbr->out_hdr.id);
+full:
+	return false;
+}
+
+static bool do_read(request_queue_t *q, struct virtio_blk *vblk,
+		    struct virtblk_req *vbr)
+{
+	unsigned long num;
+
+	/* Set up for reply. */
+	vblk->sg[0].page = virt_to_page(&vbr->in_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->in_hdr);
+	vblk->sg[0].length = sizeof(vbr->in_hdr);
+	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
+	vbr->out_hdr.id = vblk->vdev->ops->add_inbuf(vblk->vdev, vblk->sg,
+						     1+num, blk_in_done, vbr);
+	if (IS_ERR_VALUE(vbr->out_hdr.id))
+		goto full;
+
+	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
+	vblk->sg[0].length = sizeof(vbr->out_hdr);
+
+	vbr->out_done = vbr->in_done = false;
+	vbr->failed = false;
+	vbr->out_id = vblk->vdev->ops->add_outbuf(vblk->vdev, vblk->sg, 1,
+						  blk_out_done, vbr);
+	if (IS_ERR_VALUE(vbr->out_id))
+		goto detach_inbuf_full;
+
+	pr_debug("Read: %p in=%lu out=%lu\n", vbr,
+		 vbr->out_hdr.id, vbr->out_id);
+	list_add_tail(&vbr->list, &vblk->reqs);
+	return true;
+
+detach_inbuf_full:
+	vblk->vdev->ops->detach_inbuf(vblk->vdev, vbr->out_hdr.id);
+full:
+	return false;
+}
+
+static void do_virtblk_request(request_queue_t *q)
+{
+	struct virtio_blk *vblk = NULL;
+	struct request *req;
+	struct virtblk_req *vbr;
+
+	while ((req = elv_next_request(q)) != NULL) {
+		vblk = req->rq_disk->private_data;
+
+		/* FIXME: handle these iff capable. */
+		if (!blk_fs_request(req)) {
+			pr_debug("Got non-command 0x%08x\n", req->cmd_type);
+			req->errors++;
+			blkdev_dequeue_request(req);
+			end_dequeued_request(req, 0);
+			continue;
+		}
+
+		vbr = mempool_alloc(vblk->pool, GFP_ATOMIC);
+		if (!vbr)
+			goto stop;
+
+		BUG_ON(req->nr_phys_segments > ARRAY_SIZE(vblk->sg));
+		vbr->req = req;
+		vbr->out_hdr.type = rq_data_dir(req);
+		vbr->out_hdr.sector = req->sector;
+
+		if (rq_data_dir(req) == WRITE) {
+			if (!do_write(q, vblk, vbr))
+				goto stop;
+		} else {
+			if (!do_read(q, vblk, vbr))
+				goto stop;
+		}
+		blkdev_dequeue_request(req);
+	}
+
+sync:
+	if (vblk)
+		vblk->vdev->ops->sync(vblk->vdev);
+	return;
+
+stop:
+	/* Queue full?  Wait. */
+	blk_stop_queue(q);
+	mempool_free(vbr, vblk->pool);
+	goto sync;
+}
+
+static struct block_device_operations virtblk_fops = {
+	.owner = THIS_MODULE,
+};
+
+struct gendisk *virtblk_probe(struct virtio_device *vdev)
+{
+	struct virtio_blk *vblk;
+	int err, major;
+
+	vblk = kmalloc(sizeof(*vblk), GFP_KERNEL);
+	if (!vblk) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	INIT_LIST_HEAD(&vblk->reqs);
+	vblk->vdev = vdev;
+	vdev->priv = vblk;
+
+	vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
+	if (!vblk->pool) {
+		err = -ENOMEM;
+		goto out_free_vblk;
+	}
+
+	major = register_blkdev(0, "virtblk");
+	if (major < 0) {
+		err = major;
+		goto out_mempool;
+	}
+
+	/* FIXME: How many partitions?  How long is a piece of string? */
+	vblk->disk = alloc_disk(1 << 3);
+	if (!vblk->disk) {
+		err = -ENOMEM;
+		goto out_unregister_blkdev;
+	}
+
+	vblk->disk->queue = blk_init_queue(do_virtblk_request,
+					   &vblk->vdev->lock);
+	if (!vblk->disk->queue) {
+		err = -ENOMEM;
+		goto out_put_disk;
+	}
+
+	sprintf(vblk->disk->disk_name, "vb%c", virtblk_index++);
+	vblk->disk->major = major;
+	vblk->disk->first_minor = 0;
+	vblk->disk->private_data = vblk;
+	vblk->disk->fops = &virtblk_fops;
+
+	/* Caller can do blk_queue_max_hw_segments(), set_capacity()
+	 * etc then add_disk(). */
+	return vblk->disk;
+
+out_put_disk:
+	put_disk(vblk->disk);
+out_unregister_blkdev:
+	unregister_blkdev(major, "virtblk");
+out_mempool:
+	mempool_destroy(vblk->pool);
+out_free_vblk:
+	kfree(vblk);
+out:
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(virtblk_probe);
+
+void virtblk_remove(struct gendisk *disk)
+{
+	struct virtio_blk *vblk = disk->private_data;
+	int major = vblk->disk->major;
+
+	BUG_ON(!list_empty(&vblk->reqs));
+	blk_cleanup_queue(vblk->disk->queue);
+	put_disk(vblk->disk);
+	unregister_blkdev(major, "virtblk");
+	mempool_destroy(vblk->pool);
+	kfree(vblk);
+}
+EXPORT_SYMBOL_GPL(virtblk_remove);
===================================================================
--- /dev/null
+++ b/include/linux/virtio_blk.h
@@ -0,0 +1,28 @@
+#ifndef _LINUX_VIRTIO_BLK_H
+#define _LINUX_VIRTIO_BLK_H
+#include <linux/types.h>
+struct gendisk;
+struct virtio_device;
+struct hd_geometry;
+
+/* This is the first element of the scatter-gather list. */
+struct virtio_blk_outhdr
+{
+	/* 0 == read, 1 == write */
+	u32 type;
+	/* Sector (ie. 512 byte offset) */
+	unsigned long sector;
+	/* Where to put reply. */
+	unsigned long id;
+};
+
+struct virtio_blk_inhdr
+{
+	/* 1 = OK, 0 = not ok. */
+	unsigned long status;
+};
+
+struct gendisk *virtblk_probe(struct virtio_device *vdev);
+void virtblk_remove(struct gendisk *disk);
+
+#endif /* _LINUX_VIRTIO_BLK_H */



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH RFC 3/3] Virtio draft II: example net driver
       [not found]       ` <1181217920.14054.196.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-07 12:07         ` Rusty Russell
  2007-06-07 12:19         ` [PATCH RFC 2/3] Virtio draft II: example block driver Avi Kivity
  1 sibling, 0 replies; 26+ messages in thread
From: Rusty Russell @ 2007-06-07 12:07 UTC (permalink / raw)
  To: virtualization
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, Christian Borntraeger, Latchesar Ionkov,
	Suzanne McIntosh, Martin Schwidefsky

TODO:
	1) NAPI (will require way of suppressing virtio callbacks)
	2) GSO.
	3) Checksum options.

Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
---
 drivers/net/Makefile       |    2 
 drivers/net/virtio_net.c   |  243 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/virtio_net.h |   12 ++
 3 files changed, 256 insertions(+), 1 deletion(-)

===================================================================
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -37,7 +37,7 @@ obj-$(CONFIG_CASSINI) += cassini.o
 
 obj-$(CONFIG_MACE) += mace.o
 obj-$(CONFIG_BMAC) += bmac.o
-
+obj-y += virtio_net.o
 obj-$(CONFIG_DGRS) += dgrs.o
 obj-$(CONFIG_VORTEX) += 3c59x.o
 obj-$(CONFIG_TYPHOON) += typhoon.o
===================================================================
--- /dev/null
+++ b/drivers/net/virtio_net.c
@@ -0,0 +1,243 @@
+/* A simple network driver using virtio.
+ *
+ * Copyright 2007 Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org> IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+//#define DEBUG
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/module.h>
+#include <linux/virtio.h>
+#include <linux/scatterlist.h>
+
+/* FIXME: Make dynamic */
+#define MAX_PACKET_LEN (ETH_HLEN+ETH_DATA_LEN)
+
+#define SKB_ID(skb) (*(unsigned long *)(skb)->cb)
+
+struct virtnet_info
+{
+	struct virtio_device *vdev;
+	struct net_device *ndev;
+
+	/* Receive & send queues. */
+	struct sk_buff_head recv;
+	struct sk_buff_head send;
+
+	/* Transmitted packets waiting to be freed */
+	struct sk_buff_head free;
+};
+
+static void skb_xmit_done(struct virtio_device *vdev, void *_skb, unsigned len)
+{
+	struct virtnet_info *vi = vdev->priv;
+	struct sk_buff *skb = _skb;
+
+	assert_spin_locked(&vdev->lock);
+
+	__skb_unlink(skb, &vi->send);
+	vi->ndev->stats.tx_bytes += len;
+	vi->ndev->stats.tx_packets++;
+	__skb_queue_head(&vi->free, skb);
+
+	pr_debug("Sent skb %p\n", skb);
+	/* In case we were waiting for output buffers. */
+	netif_wake_queue(vi->ndev);
+}
+
+static void receive_skb(struct net_device *dev, struct sk_buff *skb,
+			unsigned len)
+{
+	if (unlikely(len < ETH_HLEN)) {
+		pr_debug("%s: short packet %i\n", dev->name, len);
+		dev->stats.rx_length_errors++;
+		dev_kfree_skb(skb);
+		return;
+	}
+	BUG_ON(len > MAX_PACKET_LEN);
+
+	skb_trim(skb, len);
+	skb->protocol = eth_type_trans(skb, dev);
+	pr_debug("Receiving skb proto 0x%04x len %i type %i\n",
+		 ntohs(skb->protocol), skb->len, skb->pkt_type);
+	dev->stats.rx_bytes += skb->len;
+	dev->stats.rx_packets++;
+	netif_rx(skb);
+}
+
+static void skb_recv_done(struct virtio_device *, void *, unsigned);
+static int try_fill_recv(struct virtnet_info *vi)
+{
+	struct sk_buff *skb;
+	struct scatterlist sg[MAX_SKB_FRAGS];
+	unsigned long num, id;
+
+	assert_spin_locked(&vi->vdev->lock);
+
+	skb = netdev_alloc_skb(vi->ndev, MAX_PACKET_LEN);
+	if (unlikely(!skb))
+		return -ENOMEM;
+
+	skb_put(skb, MAX_PACKET_LEN);
+	num = skb_to_sgvec(skb, sg, 0, skb->len);
+	skb_queue_head(&vi->recv, skb);
+
+	id = vi->vdev->ops->add_inbuf(vi->vdev, sg, num, skb_recv_done, skb);
+	if (IS_ERR_VALUE(id)) {
+		skb_unlink(skb, &vi->recv);
+		kfree_skb(skb);
+		return id;
+	}
+	return 0;
+}
+
+static void skb_recv_done(struct virtio_device *vdev, void *_skb, unsigned len)
+{
+	struct virtnet_info *vi = vdev->priv;
+	struct sk_buff *skb = _skb;
+
+	assert_spin_locked(&vdev->lock);
+	__skb_unlink(skb, &vi->recv);
+	receive_skb(vi->ndev, skb, len);
+	try_fill_recv(vi);
+}
+
+static void free_old_skbs(struct sk_buff_head *free)
+{
+	struct sk_buff *skb;
+	while ((skb = __skb_dequeue(free)) != NULL)
+		kfree_skb(skb);
+}
+
+static int start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	unsigned long num, id;
+	struct scatterlist sg[MAX_SKB_FRAGS];
+	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
+	unsigned long flags;
+
+	pr_debug("%s: xmit %p %02x:%02x:%02x:%02x:%02x:%02x\n",
+		 dev->name, skb,
+		 dest[0], dest[1], dest[2], dest[3], dest[4], dest[5]);
+
+	spin_lock_irqsave(&vi->vdev->lock, flags);
+	/* Free any transmitted packets: not supposed to do it in interrupt */
+	free_old_skbs(&vi->free);
+
+	num = skb_to_sgvec(skb, sg, 0, skb->len);
+	__skb_queue_head(&vi->send, skb);
+	id = vi->vdev->ops->add_outbuf(vi->vdev, sg, num, skb_xmit_done, skb);
+	if (IS_ERR_VALUE(id)) {
+		pr_debug("%s: virtio not prepared to send\n", dev->name);
+		skb_unlink(skb, &vi->send);
+		netif_stop_queue(dev);
+		return NETDEV_TX_BUSY;
+	}
+	SKB_ID(skb) = id;
+	vi->vdev->ops->sync(vi->vdev);
+	spin_unlock_irqrestore(&vi->vdev->lock, flags);
+
+	return 0;
+}
+
+static int virtnet_open(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	int i, err;
+
+	spin_lock_irq(&vi->vdev->lock);
+	for (i = 0; (err = try_fill_recv(vi)) == 0; i++);
+	vi->vdev->ops->sync(vi->vdev);
+	spin_unlock_irq(&vi->vdev->lock);
+
+	/* If we didn't even get one input buffer, we're useless. */
+	if (i == 0)
+		return err;
+
+	return 0;
+}
+
+static int virtnet_close(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct sk_buff *skb;
+
+	spin_lock_irq(&vi->vdev->lock);
+	while ((skb = __skb_dequeue(&vi->recv)) != NULL) {
+		vi->vdev->ops->detach_inbuf(vi->vdev, SKB_ID(skb));
+		kfree_skb(skb);
+	}
+	while ((skb = __skb_dequeue(&vi->send)) != NULL) {
+		vi->vdev->ops->detach_outbuf(vi->vdev, SKB_ID(skb));
+		kfree_skb(skb);
+	}
+	free_old_skbs(&vi->free);
+	spin_unlock_irq(&vi->vdev->lock);
+	return 0;
+}
+
+struct net_device *virtnet_probe(struct virtio_device *vdev,
+				 const u8 mac[ETH_ALEN])
+{
+	int err;
+	struct net_device *dev;
+	struct virtnet_info *vi;
+
+	dev = alloc_etherdev(sizeof(struct virtnet_info));
+	if (!dev)
+		return ERR_PTR(-ENOMEM);
+
+	SET_MODULE_OWNER(dev);
+
+	ether_setup(dev);
+	memcpy(dev->dev_addr, mac, ETH_ALEN);
+	dev->open = virtnet_open;
+	dev->stop = virtnet_close;
+	dev->hard_start_xmit = start_xmit;
+	SET_NETDEV_DEV(dev, vdev->dev);
+
+	vi = netdev_priv(dev);
+	vi->vdev = vdev;
+	vi->ndev = dev;
+	vdev->priv = vi;
+	skb_queue_head_init(&vi->recv);
+	skb_queue_head_init(&vi->send);
+	skb_queue_head_init(&vi->free);
+
+	err = register_netdev(dev);
+	if (err) {
+		pr_debug("virtio_net: registering device failed\n");
+		goto free;
+	}
+	pr_debug("virtnet: registered device %s\n", dev->name);
+	return dev;
+
+free:
+	free_netdev(dev);
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(virtnet_probe);
+
+void virtnet_remove(struct net_device *dev)
+{
+	unregister_netdev(dev);
+	free_netdev(dev);
+}
+EXPORT_SYMBOL_GPL(virtnet_remove);
+
+MODULE_DESCRIPTION("Virtio network driver");
+MODULE_LICENSE("GPL");
===================================================================
--- /dev/null
+++ b/include/linux/virtio_net.h
@@ -0,0 +1,12 @@
+#ifndef _LINUX_VIRTIO_NET_H
+#define _LINUX_VIRTIO_NET_H
+#include <linux/types.h>
+#include <linux/etherdevice.h>
+struct net_device;
+struct virtio_device;
+
+struct net_device *virtnet_probe(struct virtio_device *vdev,
+				 const u8 mac[ETH_ALEN]);
+void virtnet_remove(struct net_device *dev);
+
+#endif /* _LINUX_VIRTIO_NET_H */



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 2/3] Virtio draft II: example block driver
       [not found]       ` <1181217920.14054.196.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2007-06-07 12:07         ` [PATCH RFC 3/3] Virtio draft II: example net driver Rusty Russell
@ 2007-06-07 12:19         ` Avi Kivity
       [not found]           ` <4667F7C0.3070604-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2007-06-07 12:19 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Latchesar Ionkov, Suzanne McIntosh, Martin Schwidefsky

Rusty Russell wrote:
> +	sprintf(vblk->disk->disk_name, "vb%c", virtblk_index++);
>   

vd%c, in keeping with hd%c and sd%c?

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 2/3] Virtio draft II: example block driver
       [not found]           ` <4667F7C0.3070604-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-06-08  1:11             ` Rusty Russell
  0 siblings, 0 replies; 26+ messages in thread
From: Rusty Russell @ 2007-06-08  1:11 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jimi Xenidis, Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Latchesar Ionkov, Suzanne McIntosh, Jens Axboe,
	Martin Schwidefsky

On Thu, 2007-06-07 at 15:19 +0300, Avi Kivity wrote:
> Rusty Russell wrote:
> > +	sprintf(vblk->disk->disk_name, "vb%c", virtblk_index++);
> >   
> 
> vd%c, in keeping with hd%c and sd%c?

Should work, since iSeries' devices are /dev/iseries/vd%c.  Is there
already a magic routine to take the base and assign the a... aaa etc?

Rusty.




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH RFC 0/3] Virtio draft III
       [not found] ` <1181217762.14054.192.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-16 13:12   ` Rusty Russell
       [not found]     ` <1181999552.6237.255.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Rusty Russell @ 2007-06-16 13:12 UTC (permalink / raw)
  To: virtualization
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, Christian Borntraeger, Latchesar Ionkov,
	Suzanne McIntosh, Martin Schwidefsky

In this episode, Rusty tries to NAPI-ize the driver and discovers that
virtio callbacks are a bad idea: NAPI needs to turn interrupts off and
still be able to query for new incoming packets.

Changes to core:
1) Back to "interrupt" model with get_inbuf()/get_outbuf() calls.
2) Clearer rules for locking: in calls cannot overlap, out calls cannot
overlap, but in can overlap out.
3) Methods for suppressing/enabling "interrupt" calls.

Changes to example net driver:
1) NAPI, locking is now correct (and there is none)

Changes to example block driver:
1) Relay SCSI ioctls (particularly CDROMEJECT) for optional server
support (VIRTIO_BLK_T_SCSI_CMD).
2) /dev/vb -> /dev/vd.
3) Barrier support.
4) Header w/ definitions can be included from userspace.

Here's the inter-diff for the three:

diff -u b/include/linux/virtio.h b/include/linux/virtio.h
--- b/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -7,69 +7,109 @@
 /**
  * virtio_device - description and routines to drive a virtual device.
- * @lock: the lock to hold before calling any functions.
  * @dev: the underlying struct device.
  * @ops: the operations for this virtual device.
+ * @driver_ops: set by the driver for callbacks.
  * @priv: private pointer for the driver to use.
  */
 struct virtio_device {
-	spinlock_t lock;
 	struct device *dev;
 	struct virtio_ops *ops;
+	struct virtio_driver_ops *driver_ops;
 	void *priv;
 };
 
 /**
+ * virtio_driver_ops - driver callbacks for a virtual device.
+ * @in: inbufs have been completed.
+ *	Usually called from an interrupt handler.
+ *	Return false to suppress further inbuf callbacks.
+ * @out: outbufs have been completed.
+ *	Usually called from an interrupt handler.
+ *	Return false to suppress further outbuf callbacks.
+ */
+struct virtio_driver_ops {
+	bool (*in)(struct virtio_device *dev);
+	bool (*out)(struct virtio_device *dev);
+};
+
+enum virtio_dir {
+	VIRTIO_IN = 0x1,
+	VIRTIO_OUT = 0x2,
+};
+
+/**
  * virtio_ops - virtio abstraction layer
  * @add_outbuf: prepare to send data to the other end:
  *	vdev: the virtio_device
  *	sg: the description of the buffer(s).
  *	num: the size of the sg array.
- *	cb: the function to call once the outbuf is finished & detached.
- *	data: the token to hand to the cb function.
- *      Returns a unique id or an error.  Note that the callback will be
- *	called with the lock held, and possibly in an interrupt handler.
+ *	data: the token returned by the get_outbuf function.
+ *      Returns a unique id or an error.
  * @add_inbuf: prepare to receive data from the other end:
  *	vdev: the virtio_device
  *	sg: the description of the buffer(s).
  *	num: the size of the sg array.
- *	cb: the function to call once the inbuf is finished & detached.
- *	data: the token to hand to the cb function.
- *      Returns a unique id or an error (eg. -ENOSPC).  Note that the
- *	callback will be called with the lock held, and possibly in an
- *	interrupt handler.
- * @sync: update after add_inbuf/add_outbuf
+ *	data: the token returned by the get_inbuf function.
+ *      Returns a unique id or an error (eg. -ENOSPC).
+ * @sync: update after add_inbuf and/or add_outbuf
  *	vdev: the virtio_device we're talking about.
+ *	inout: VIRTIO_IN and/or VIRTIO_OUT
  *	After one or more add_inbuf/add_outbuf calls, invoke this to kick
  *	the virtio layer.
+ * @get_outbuf: get the next used outbuf.
+ *	vdev: the virtio_device we're talking about.
+ *	len: the length written into the outbuf
+ *	Returns NULL or the "data" token handed to add_outbuf (which has been
+ *	detached).
+ * @get_inbuf: get the next used inbuf.
+ *	vdev: the virtio_device we're talking about.
+ *	len: the length read from the inbuf
+ *	Returns NULL or the "data" token handed to add_inbuf (which has been
+ *	detached).
  * @detach_outbuf: make sure sent sg can no longer be read.
  *	vdev: the virtio_device we're talking about.
  *	id: the id returned from add_outbuf.
- *	This is not necessary (or valid!) if the outbuf callback has
- *	already fired.
+ *	This is usually used for shutdown.  Don't try to detach twice.
  * @detach_inbuf: make sure sent sg can no longer be written to.
  *	vdev: the virtio_device we're talking about.
  *	id: the id returned from add_inbuf.
- *	This is not necessary (or valid!) if the outbuf callback has
- *	already fired.
+ *	This is usually used for shutdown.  Don't try to detach twice.
+ * @restart_in: restart calls to driver_ops->in after it returned false.
+ *	vdev: the virtio_device we're talking about.
+ *	This returns "false" (and doesn't re-enable) if there are pending
+ *	inbufs, to avoid a race.
+ * @restart_out: restart calls to driver_ops->out after it returned false.
+ *	vdev: the virtio_device we're talking about.
+ *	This returns "false" (and doesn't re-enable) if there are pending
+ *	outbufs, to avoid a race.
+ *
+ * Locking rules are straightforward: the driver is responsible for
+ * locking.  Outbuf operations can be called in parallel to inbuf
+ * operations, but no two outbuf operations nor two inbuf operations
+ * may be invoked simultaneously.
+ *
+ * All operations can be called in any context.
  */
 struct virtio_ops {
 	unsigned long (*add_outbuf)(struct virtio_device *vdev,
 				    const struct scatterlist sg[],
 				    unsigned int num,
-				    void (*cb)(struct virtio_device *vdev,
-					       void *data, unsigned len),
 				    void *data);
 
 	unsigned long (*add_inbuf)(struct virtio_device *vdev,
 				   struct scatterlist sg[],
 				   unsigned int num,
-				   void (*cb)(struct virtio_device *vdev,
-					      void *data, unsigned len),
 				   void *data);
 
-	void (*sync)(struct virtio_device *vdev);
+	void (*sync)(struct virtio_device *vdev, enum virtio_dir inout);
+
+	void *(*get_outbuf)(struct virtio_device *vdev, unsigned int *len);
+	void *(*get_inbuf)(struct virtio_device *vdev, unsigned int *len);
 
 	void (*detach_outbuf)(struct virtio_device *vdev, unsigned long id);
 	void (*detach_inbuf)(struct virtio_device *vdev, unsigned long id);
+
+	bool (*restart_in)(struct virtio_device *vdev);
+	bool (*restart_out)(struct virtio_device *vdev);
 };
 #endif /* _LINUX_VIRTIO_H */
diff -u b/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
--- b/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -33,29 +33,21 @@
 	struct virtio_device *vdev;
 	struct net_device *ndev;
 
+	/* Number of input buffers, and max we've ever had. */
+	unsigned int num, max;
+
 	/* Receive & send queues. */
 	struct sk_buff_head recv;
 	struct sk_buff_head send;
-
-	/* Transmitted packets waiting to be freed */
-	struct sk_buff_head free;
 };
 
-static void skb_xmit_done(struct virtio_device *vdev, void *_skb, unsigned len)
+static bool skb_xmit_done(struct virtio_device *vdev)
 {
 	struct virtnet_info *vi = vdev->priv;
-	struct sk_buff *skb = _skb;
-
-	assert_spin_locked(&vdev->lock);
-
-	__skb_unlink(skb, &vi->send);
-	vi->ndev->stats.tx_bytes += len;
-	vi->ndev->stats.tx_packets++;
-	__skb_queue_head(&vi->free, skb);
 
-	pr_debug("Sent skb %p\n", skb);
 	/* In case we were waiting for output buffers. */
 	netif_wake_queue(vi->ndev);
+	return true;
 }
 
 static void receive_skb(struct net_device *dev, struct sk_buff *skb,
@@ -78,48 +70,90 @@
 	netif_rx(skb);
 }
 
-static void skb_recv_done(struct virtio_device *, void *, unsigned);
-static int try_fill_recv(struct virtnet_info *vi)
+static void try_fill_recv(struct virtnet_info *vi)
 {
 	struct sk_buff *skb;
 	struct scatterlist sg[MAX_SKB_FRAGS];
-	unsigned long num, id;
-
-	assert_spin_locked(&vi->vdev->lock);
+	unsigned long sgnum, id;
 
-	skb = netdev_alloc_skb(vi->ndev, MAX_PACKET_LEN);
-	if (unlikely(!skb))
-		return -ENOMEM;
+	for (;;) {
+		skb = netdev_alloc_skb(vi->ndev, MAX_PACKET_LEN);
+		if (unlikely(!skb))
+			break;
+
+		skb_put(skb, MAX_PACKET_LEN);
+		sgnum = skb_to_sgvec(skb, sg, 0, skb->len);
+		skb_queue_head(&vi->recv, skb);
+
+		id = vi->vdev->ops->add_inbuf(vi->vdev, sg, sgnum, skb);
+		if (IS_ERR_VALUE(id)) {
+			skb_unlink(skb, &vi->recv);
+			kfree_skb(skb);
+			break;
+		}
+		vi->num++;
+	}
+	if (unlikely(vi->num > vi->max))
+		vi->max = vi->num;
+	vi->vdev->ops->sync(vi->vdev, VIRTIO_IN);
+}
 
-	skb_put(skb, MAX_PACKET_LEN);
-	num = skb_to_sgvec(skb, sg, 0, skb->len);
-	skb_queue_head(&vi->recv, skb);
+static bool skb_recv_done(struct virtio_device *vdev)
+{
+	struct virtnet_info *vi = vdev->priv;
 
-	id = vi->vdev->ops->add_inbuf(vi->vdev, sg, num, skb_recv_done, skb);
-	if (IS_ERR_VALUE(id)) {
-		skb_unlink(skb, &vi->recv);
-		kfree_skb(skb);
-		return id;
-	}
-	return 0;
+	netif_rx_schedule(vi->ndev);
+	/* Suppress further interrupts. */
+	return false;
 }
 
-static void skb_recv_done(struct virtio_device *vdev, void *_skb, unsigned len)
+static int virtnet_poll(struct net_device *dev, int *budget)
 {
-	struct virtnet_info *vi = vdev->priv;
-	struct sk_buff *skb = _skb;
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct sk_buff *skb = NULL;
+	unsigned int len, received = 0;
 
-	assert_spin_locked(&vdev->lock);
-	__skb_unlink(skb, &vi->recv);
-	receive_skb(vi->ndev, skb, len);
-	try_fill_recv(vi);
+again:
+	while (received < dev->quota &&
+	       (skb = vi->vdev->ops->get_inbuf(vi->vdev, &len)) != NULL) {
+		__skb_unlink(skb, &vi->recv);
+		receive_skb(vi->ndev, skb, len);
+		vi->num--;
+		received++;
+	}
+
+        dev->quota -= received;
+        *budget -= received;
+
+	/* FIXME: If we oom and completely run out of inbufs, we need
+	 * to start a timer trying to fill more. */
+	if (vi->num < vi->max / 2)
+		try_fill_recv(vi);
+
+	/* Still more work to do? */
+	if (skb)
+		return 1; /* not done */
+
+	netif_rx_complete(dev);
+	if (unlikely(!vi->vdev->ops->restart_in(vi->vdev))
+	    && netif_rx_reschedule(dev, received))
+		goto again;
+
+	return 0;
 }
 
-static void free_old_skbs(struct sk_buff_head *free)
+static void free_old_xmit_skbs(struct virtnet_info *vi)
 {
 	struct sk_buff *skb;
-	while ((skb = __skb_dequeue(free)) != NULL)
+	unsigned int len;
+
+	while ((skb = vi->vdev->ops->get_outbuf(vi->vdev, &len)) != NULL) {
+		pr_debug("Sent skb %p\n", skb);
+		__skb_unlink(skb, &vi->send);
+		vi->ndev->stats.tx_bytes += len;
+		vi->ndev->stats.tx_packets++;
 		kfree_skb(skb);
+	}
 }
 
 static int start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -128,19 +162,16 @@
 	unsigned long num, id;
 	struct scatterlist sg[MAX_SKB_FRAGS];
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
-	unsigned long flags;
 
 	pr_debug("%s: xmit %p %02x:%02x:%02x:%02x:%02x:%02x\n",
 		 dev->name, skb,
 		 dest[0], dest[1], dest[2], dest[3], dest[4], dest[5]);
 
-	spin_lock_irqsave(&vi->vdev->lock, flags);
-	/* Free any transmitted packets: not supposed to do it in interrupt */
-	free_old_skbs(&vi->free);
+	free_old_xmit_skbs(vi);
 
 	num = skb_to_sgvec(skb, sg, 0, skb->len);
 	__skb_queue_head(&vi->send, skb);
-	id = vi->vdev->ops->add_outbuf(vi->vdev, sg, num, skb_xmit_done, skb);
+	id = vi->vdev->ops->add_outbuf(vi->vdev, sg, num, skb);
 	if (IS_ERR_VALUE(id)) {
 		pr_debug("%s: virtio not prepared to send\n", dev->name);
 		skb_unlink(skb, &vi->send);
@@ -148,8 +179,7 @@
 		return NETDEV_TX_BUSY;
 	}
 	SKB_ID(skb) = id;
-	vi->vdev->ops->sync(vi->vdev);
-	spin_unlock_irqrestore(&vi->vdev->lock, flags);
+	vi->vdev->ops->sync(vi->vdev, VIRTIO_OUT);
 
 	return 0;
 }
@@ -157,16 +187,12 @@
 static int virtnet_open(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
-	int i, err;
 
-	spin_lock_irq(&vi->vdev->lock);
-	for (i = 0; (err = try_fill_recv(vi)) == 0; i++);
-	vi->vdev->ops->sync(vi->vdev);
-	spin_unlock_irq(&vi->vdev->lock);
+	try_fill_recv(vi);
 
 	/* If we didn't even get one input buffer, we're useless. */
-	if (i == 0)
-		return err;
+	if (vi->num == 0)
+		return -ENOMEM;
 
 	return 0;
 }
@@ -176,20 +202,26 @@
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct sk_buff *skb;
 
-	spin_lock_irq(&vi->vdev->lock);
+	/* networking core has neutered skb_xmit_done/skb_recv_done, so don't
+	 * worry about races vs. get_buf(). */
 	while ((skb = __skb_dequeue(&vi->recv)) != NULL) {
 		vi->vdev->ops->detach_inbuf(vi->vdev, SKB_ID(skb));
 		kfree_skb(skb);
+		vi->num--;
 	}
 	while ((skb = __skb_dequeue(&vi->send)) != NULL) {
 		vi->vdev->ops->detach_outbuf(vi->vdev, SKB_ID(skb));
 		kfree_skb(skb);
 	}
-	free_old_skbs(&vi->free);
-	spin_unlock_irq(&vi->vdev->lock);
+	BUG_ON(vi->num != 0);
 	return 0;
 }
 
+static struct virtio_driver_ops virtnet_ops = {
+	.in = skb_recv_done,
+	.out = skb_xmit_done,
+};
+
 struct net_device *virtnet_probe(struct virtio_device *vdev,
 				 const u8 mac[ETH_ALEN])
 {
@@ -207,16 +239,18 @@
 	memcpy(dev->dev_addr, mac, ETH_ALEN);
 	dev->open = virtnet_open;
 	dev->stop = virtnet_close;
+	dev->poll = virtnet_poll;
 	dev->hard_start_xmit = start_xmit;
+	dev->weight = 16;
 	SET_NETDEV_DEV(dev, vdev->dev);
 
 	vi = netdev_priv(dev);
 	vi->vdev = vdev;
 	vi->ndev = dev;
 	vdev->priv = vi;
+	vdev->driver_ops = &virtnet_ops;
 	skb_queue_head_init(&vi->recv);
 	skb_queue_head_init(&vi->send);
-	skb_queue_head_init(&vi->free);
 
 	err = register_netdev(dev);
 	if (err) {
diff -u b/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
--- b/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -1,4 +1,4 @@
-//#define DEBUG
+#define DEBUG
 #include <linux/spinlock.h>
 #include <linux/blkdev.h>
 #include <linux/hdreg.h>
@@ -8,6 +8,8 @@
 static unsigned char virtblk_index = 'a';
 struct virtio_blk
 {
+	spinlock_t lock;
+
 	struct virtio_device *vdev;
 
 	/* The disk structure for the kernel. */
@@ -19,7 +21,7 @@
 	mempool_t *pool;
 
 	/* Scatterlist: can be too big for stack. */
-	struct scatterlist sg[1+MAX_PHYS_SEGMENTS];
+	struct scatterlist sg[2+MAX_PHYS_SEGMENTS];
 };
 
 struct virtblk_req
@@ -28,68 +30,94 @@
 	struct request *req;
 	unsigned long out_id;
 	bool out_done, in_done;
-	bool failed;
+	int uptodate;
 	struct virtio_blk_outhdr out_hdr;
 	struct virtio_blk_inhdr in_hdr;
 };
 
-/* Jens gave me this nice helper to end all chunks of a request. */
-static void end_dequeued_request(struct request *req, int uptodate)
+static void end_dequeued_request(struct request *req,
+				 request_queue_t *q, int uptodate)
 {
-	if (end_that_request_first(req, uptodate, req->hard_nr_sectors))
+	/* And so the insanity of the block layer infects us here. */
+	int nsectors = req->hard_nr_sectors;
+
+	if (blk_pc_request(req)) {
+		nsectors = (req->data_len + 511) >> 9;
+		if (!nsectors)
+			nsectors = 1;
+		printk("uptodate = %i\n", uptodate);
+	}
+	if (end_that_request_first(req, uptodate, nsectors))
 		BUG();
 	add_disk_randomness(req->rq_disk);
 	end_that_request_last(req, uptodate);
 }
 
-static void finish(struct virtio_blk *vblk, struct virtblk_req *vbr)
+static bool finish(struct virtio_blk *vblk, struct virtblk_req *vbr)
 {
-	end_dequeued_request(vbr->req, !vbr->failed);
+	if (!vbr->in_done || !vbr->out_done)
+		return false;
+	end_dequeued_request(vbr->req, vblk->disk->queue, vbr->uptodate);
 	list_del(&vbr->list);
 	mempool_free(vbr, vblk->pool);
-	/* In case queue is stopped waiting for more buffers. */
-	blk_start_queue(vblk->disk->queue);
+	return true;
 }
 
 /* We make sure they finished both the input and output buffers: otherwise
  * they might still have read access after we free them. */
-static void blk_out_done(struct virtio_device *vdev, void *_vbr, unsigned len)
+static bool blk_out_done(struct virtio_device *vdev)
 {
-	struct virtblk_req *vbr = _vbr;
 	struct virtio_blk *vblk = vdev->priv;
+	struct virtblk_req *vbr;
+	unsigned int len, finished = 0;
+	unsigned long flags;
 
-	assert_spin_locked(&vblk->vdev->lock);
-
-	BUG_ON(vbr->out_done);
-	vbr->out_done = true;
-	if (vbr->in_done)
-		finish(vblk, vbr);
+	spin_lock_irqsave(&vblk->lock, flags);
+	while ((vbr = vdev->ops->get_outbuf(vdev, &len)) != NULL) {
+		BUG_ON(vbr->out_done);
+		vbr->out_done = true;
+		finished += finish(vblk, vbr);
+	}
+	/* In case queue is stopped waiting for more buffers. */
+	if (finished)
+		blk_start_queue(vblk->disk->queue);
+	spin_unlock_irqrestore(&vblk->lock, flags);
+	return true;
 }
 
-static void blk_in_done(struct virtio_device *vdev, void *_vbr, unsigned len)
+static bool blk_in_done(struct virtio_device *vdev)
 {
-	struct virtblk_req *vbr = _vbr;
 	struct virtio_blk *vblk = vdev->priv;
-	unsigned long expected_len;
+	struct virtblk_req *vbr;
+	unsigned int len, finished = 0;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vblk->lock, flags);
 
-	assert_spin_locked(&vblk->vdev->lock);
+	while ((vbr = vdev->ops->get_inbuf(vdev, &len)) != NULL) {
+		BUG_ON(vbr->in_done);
 
-	expected_len = sizeof(vbr->in_hdr);
-	if (vbr->out_hdr.type == READ)
-		expected_len += vbr->req->hard_nr_sectors*512;
-
-	if (unlikely(len != expected_len)) {
-		dev_err(vblk->vdev->dev, "short reply %u not %lu",
-			len, expected_len);
-		vbr->failed = true;
-	} else if (unlikely(vbr->in_hdr.status != 1)) {
-		vbr->failed = true;
+		switch (vbr->in_hdr.status) {
+		case VIRTIO_BLK_S_OK:
+			vbr->uptodate = 1;
+			break;
+		case VIRTIO_BLK_S_UNSUPP:
+			printk("Request was unsupported\n");
+			vbr->uptodate = -ENOTTY;
+			break;
+		default:
+			vbr->uptodate = 0;
+			break;
+		}
+		vbr->in_done = true;
+		finished += finish(vblk, vbr);
 	}
 
-	BUG_ON(vbr->in_done);
-	vbr->in_done = true;
-	if (vbr->out_done)
-		finish(vblk, vbr);
+	/* In case queue is stopped waiting for more buffers. */
+	if (finished)
+		blk_start_queue(vblk->disk->queue);
+	spin_unlock_irqrestore(&vblk->lock, flags);
+	return true;
 }
 
 static bool do_write(request_queue_t *q, struct virtio_blk *vblk,
@@ -97,12 +125,14 @@
 {
 	unsigned long num;
 
+	vbr->out_hdr.type |= VIRTIO_BLK_T_WRITE;
+
 	/* Set up for reply. */
 	vblk->sg[0].page = virt_to_page(&vbr->in_hdr);
 	vblk->sg[0].offset = offset_in_page(&vbr->in_hdr);
 	vblk->sg[0].length = sizeof(vbr->in_hdr);
 	vbr->out_hdr.id = vblk->vdev->ops->add_inbuf(vblk->vdev, vblk->sg, 1,
-						     blk_in_done, vbr);
+						     vbr);
 	if (IS_ERR_VALUE(vbr->out_hdr.id))
 		goto full;
 
@@ -112,15 +142,13 @@
 	vblk->sg[0].length = sizeof(vbr->out_hdr);
 
 	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
-	vbr->out_done = vbr->in_done = false;
-	vbr->failed = false;
 	vbr->out_id = vblk->vdev->ops->add_outbuf(vblk->vdev, vblk->sg, 1+num,
-						  blk_out_done, vbr);
+						  vbr);
 	if (IS_ERR_VALUE(vbr->out_id))
 		goto detach_inbuf_full;
 
 	pr_debug("Write: %p in=%lu out=%lu\n", vbr,
-		 vbr->out_hdr.id, vbr->out_id);
+		 (long)vbr->out_hdr.id, (long)vbr->out_id);
 	list_add_tail(&vbr->list, &vblk->reqs);
 	return true;
 
@@ -135,13 +163,15 @@
 {
 	unsigned long num;
 
+	vbr->out_hdr.type |= VIRTIO_BLK_T_READ;
+
 	/* Set up for reply. */
 	vblk->sg[0].page = virt_to_page(&vbr->in_hdr);
 	vblk->sg[0].offset = offset_in_page(&vbr->in_hdr);
 	vblk->sg[0].length = sizeof(vbr->in_hdr);
 	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
 	vbr->out_hdr.id = vblk->vdev->ops->add_inbuf(vblk->vdev, vblk->sg,
-						     1+num, blk_in_done, vbr);
+						     1+num, vbr);
 	if (IS_ERR_VALUE(vbr->out_hdr.id))
 		goto full;
 
@@ -149,15 +179,53 @@
 	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
 	vblk->sg[0].length = sizeof(vbr->out_hdr);
 
-	vbr->out_done = vbr->in_done = false;
-	vbr->failed = false;
 	vbr->out_id = vblk->vdev->ops->add_outbuf(vblk->vdev, vblk->sg, 1,
-						  blk_out_done, vbr);
+						  vbr);
 	if (IS_ERR_VALUE(vbr->out_id))
 		goto detach_inbuf_full;
 
 	pr_debug("Read: %p in=%lu out=%lu\n", vbr,
-		 vbr->out_hdr.id, vbr->out_id);
+		 (long)vbr->out_hdr.id, (long)vbr->out_id);
+	list_add_tail(&vbr->list, &vblk->reqs);
+	return true;
+
+detach_inbuf_full:
+	vblk->vdev->ops->detach_inbuf(vblk->vdev, vbr->out_hdr.id);
+full:
+	return false;
+}
+
+static bool do_scsi(request_queue_t *q, struct virtio_blk *vblk,
+		    struct virtblk_req *vbr)
+{
+	unsigned long num;
+
+	vbr->out_hdr.type |= VIRTIO_BLK_T_SCSI_CMD;
+
+	/* Set up for reply. */
+	vblk->sg[0].page = virt_to_page(&vbr->in_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->in_hdr);
+	vblk->sg[0].length = sizeof(vbr->in_hdr);
+	vbr->out_hdr.id = vblk->vdev->ops->add_inbuf(vblk->vdev, vblk->sg, 1,
+						     vbr);
+	if (IS_ERR_VALUE(vbr->out_hdr.id))
+		goto full;
+
+	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
+	vblk->sg[0].length = sizeof(vbr->out_hdr);
+	vblk->sg[1].page = virt_to_page(vbr->req->cmd);
+	vblk->sg[1].offset = offset_in_page(vbr->req->cmd);
+	vblk->sg[1].length = vbr->req->cmd_len;
+
+	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
+	vbr->out_id = vblk->vdev->ops->add_outbuf(vblk->vdev, vblk->sg, 2+num,
+						  vbr);
+	if (IS_ERR_VALUE(vbr->out_id))
+		goto detach_inbuf_full;
+
+	pr_debug("Scsi: %p in=%lu out=%lu\n", vbr,
+		 (long)vbr->out_hdr.id, (long)vbr->out_id);
 	list_add_tail(&vbr->list, &vblk->reqs);
 	return true;
 
@@ -176,37 +244,38 @@
 	while ((req = elv_next_request(q)) != NULL) {
 		vblk = req->rq_disk->private_data;
 
-		/* FIXME: handle these iff capable. */
-		if (!blk_fs_request(req)) {
-			pr_debug("Got non-command 0x%08x\n", req->cmd_type);
-			req->errors++;
-			blkdev_dequeue_request(req);
-			end_dequeued_request(req, 0);
-			continue;
-		}
-
 		vbr = mempool_alloc(vblk->pool, GFP_ATOMIC);
 		if (!vbr)
 			goto stop;
 
 		BUG_ON(req->nr_phys_segments > ARRAY_SIZE(vblk->sg));
 		vbr->req = req;
-		vbr->out_hdr.type = rq_data_dir(req);
+		/* Actual type gets or'ed in do_scsi/do_write/do_read */
+		vbr->out_hdr.type = blk_barrier_rq(req)?VIRTIO_BLK_T_BARRIER:0;
 		vbr->out_hdr.sector = req->sector;
+		vbr->out_hdr.ioprio = req->ioprio;
+		vbr->out_done = vbr->in_done = false;
 
-		if (rq_data_dir(req) == WRITE) {
-			if (!do_write(q, vblk, vbr))
-				goto stop;
-		} else {
-			if (!do_read(q, vblk, vbr))
+		if (blk_pc_request(req)) {
+			if (!do_scsi(q, vblk, vbr))
 				goto stop;
-		}
+		} else if (blk_fs_request(req)) {
+			if (rq_data_dir(req) == WRITE) {
+				if (!do_write(q, vblk, vbr))
+					goto stop;
+			} else {
+				if (!do_read(q, vblk, vbr))
+					goto stop;
+			}
+		} else
+			/* We don't put anything else in the queue. */
+			BUG();
 		blkdev_dequeue_request(req);
 	}
 
 sync:
 	if (vblk)
-		vblk->vdev->ops->sync(vblk->vdev);
+		vblk->vdev->ops->sync(vblk->vdev, VIRTIO_IN|VIRTIO_OUT);
 	return;
 
 stop:
@@ -216,7 +285,21 @@
 	goto sync;
 }
 
+static int virtblk_ioctl(struct inode *inode, struct file *filp,
+			 unsigned cmd, unsigned long data)
+{
+	return scsi_cmd_ioctl(filp, inode->i_bdev->bd_disk, cmd,
+			      (void __user *)data);
+}
+
+static struct virtio_driver_ops virtblk_ops = {
+	.in = blk_in_done,
+	.out = blk_out_done,
+};
+
+
 static struct block_device_operations virtblk_fops = {
+	.ioctl = virtblk_ioctl,
 	.owner = THIS_MODULE,
 };
 
@@ -232,8 +315,10 @@
 	}
 
 	INIT_LIST_HEAD(&vblk->reqs);
+	spin_lock_init(&vblk->lock);
 	vblk->vdev = vdev;
 	vdev->priv = vblk;
+	vdev->driver_ops = &virtblk_ops;
 
 	vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
 	if (!vblk->pool) {
@@ -254,19 +339,20 @@
 		goto out_unregister_blkdev;
 	}
 
-	vblk->disk->queue = blk_init_queue(do_virtblk_request,
-					   &vblk->vdev->lock);
+	vblk->disk->queue = blk_init_queue(do_virtblk_request, &vblk->lock);
 	if (!vblk->disk->queue) {
 		err = -ENOMEM;
 		goto out_put_disk;
 	}
 
-	sprintf(vblk->disk->disk_name, "vb%c", virtblk_index++);
+	sprintf(vblk->disk->disk_name, "vd%c", virtblk_index++);
 	vblk->disk->major = major;
 	vblk->disk->first_minor = 0;
 	vblk->disk->private_data = vblk;
 	vblk->disk->fops = &virtblk_fops;
 
+	blk_queue_ordered(vblk->disk->queue, QUEUE_ORDERED_TAG, NULL);
+
 	/* Caller can do blk_queue_max_hw_segments(), set_capacity()
 	 * etc then add_disk(). */
 	return vblk->disk;
diff -u b/include/linux/virtio_blk.h b/include/linux/virtio_blk.h
--- b/include/linux/virtio_blk.h
+++ b/include/linux/virtio_blk.h
@@ -3,26 +3,37 @@
 #include <linux/types.h>
-struct gendisk;
-struct virtio_device;
-struct hd_geometry;
+
+#define VIRTIO_BLK_T_READ	0
+#define VIRTIO_BLK_T_WRITE 1
+#define VIRTIO_BLK_T_SCSI_CMD 3
+#define VIRTIO_BLK_T_BARRIER 0x80000000 /* Barrier before this op. */
 
 /* This is the first element of the scatter-gather list. */
 struct virtio_blk_outhdr
 {
-	/* 0 == read, 1 == write */
-	u32 type;
+	/* VIRTIO_BLK_T* */
+	__u32 type;
+	/* io priority. */
+	__u32 ioprio;
 	/* Sector (ie. 512 byte offset) */
-	unsigned long sector;
+	__u64 sector;
 	/* Where to put reply. */
-	unsigned long id;
+	__u64 id;
 };
 
+#define VIRTIO_BLK_S_OK		0
+#define VIRTIO_BLK_S_IOERR	1
+#define VIRTIO_BLK_S_UNSUPP	2
+
 struct virtio_blk_inhdr
 {
-	/* 1 = OK, 0 = not ok. */
-	unsigned long status;
+	unsigned char status;
 };
 
+#ifdef __KERNEL__
+struct gendisk;
+struct virtio_device;
+
 struct gendisk *virtblk_probe(struct virtio_device *vdev);
 void virtblk_remove(struct gendisk *disk);
-
+#endif /* __KERNEL__ */
 #endif /* _LINUX_VIRTIO_BLK_H */
only in patch2:
unchanged:
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -341,6 +341,7 @@ unifdef-y += utsname.h
 unifdef-y += utsname.h
 unifdef-y += videodev2.h
 unifdef-y += videodev.h
+unifdef-y += virtio_blk.h
 unifdef-y += wait.h
 unifdef-y += wanrouter.h
 unifdef-y += watchdog.h





-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH RFC 1/3] Virtio draft III: virtio.h
       [not found]     ` <1181999552.6237.255.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-16 13:14       ` Rusty Russell
       [not found]         ` <1181999669.6237.257.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2007-06-17 14:14       ` [PATCH RFC 0/3] Virtio draft III Avi Kivity
  1 sibling, 1 reply; 26+ messages in thread
From: Rusty Russell @ 2007-06-16 13:14 UTC (permalink / raw)
  To: virtualization
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, Christian Borntraeger, Latchesar Ionkov,
	Suzanne McIntosh, Martin Schwidefsky

Draft III of virtio interface

This attempts to implement a "virtual I/O" layer which should allow
common drivers to be efficiently used across most virtual I/O
mechanisms.  It will no-doubt need further enhancement.

The details of probing the device are left to hypervisor-specific
code: it simple constructs the "struct virtio_device" and hands it to
the probe function (eg. virtnet_probe() or virtblk_probe()).

The virtio drivers add and get input and output buffers; as the
buffers are used up the driver "interrupt" callbacks are invoked.

I have written two virtio device drivers (net and block) and two
virtio implementations (for lguest): a read-write socket-style
implementation, and a more efficient descriptor-based implementation.

Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
---
 include/linux/virtio.h |  115 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 115 insertions(+)

===================================================================
--- /dev/null
+++ b/include/linux/virtio.h
@@ -0,0 +1,115 @@
+#ifndef _LINUX_VIRTIO_H
+#define _LINUX_VIRTIO_H
+#include <linux/types.h>
+#include <linux/scatterlist.h>
+#include <linux/spinlock.h>
+
+/**
+ * virtio_device - description and routines to drive a virtual device.
+ * @dev: the underlying struct device.
+ * @ops: the operations for this virtual device.
+ * @driver_ops: set by the driver for callbacks.
+ * @priv: private pointer for the driver to use.
+ */
+struct virtio_device {
+	struct device *dev;
+	struct virtio_ops *ops;
+	struct virtio_driver_ops *driver_ops;
+	void *priv;
+};
+
+/**
+ * virtio_driver_ops - driver callbacks for a virtual device.
+ * @in: inbufs have been completed.
+ *	Usually called from an interrupt handler.
+ *	Return false to suppress further inbuf callbacks.
+ * @out: outbufs have been completed.
+ *	Usually called from an interrupt handler.
+ *	Return false to suppress further outbuf callbacks.
+ */
+struct virtio_driver_ops {
+	bool (*in)(struct virtio_device *dev);
+	bool (*out)(struct virtio_device *dev);
+};
+
+enum virtio_dir {
+	VIRTIO_IN = 0x1,
+	VIRTIO_OUT = 0x2,
+};
+
+/**
+ * virtio_ops - virtio abstraction layer
+ * @add_outbuf: prepare to send data to the other end:
+ *	vdev: the virtio_device
+ *	sg: the description of the buffer(s).
+ *	num: the size of the sg array.
+ *	data: the token returned by the get_outbuf function.
+ *      Returns a unique id or an error.
+ * @add_inbuf: prepare to receive data from the other end:
+ *	vdev: the virtio_device
+ *	sg: the description of the buffer(s).
+ *	num: the size of the sg array.
+ *	data: the token returned by the get_inbuf function.
+ *      Returns a unique id or an error (eg. -ENOSPC).
+ * @sync: update after add_inbuf and/or add_outbuf
+ *	vdev: the virtio_device we're talking about.
+ *	inout: VIRTIO_IN and/or VIRTIO_OUT
+ *	After one or more add_inbuf/add_outbuf calls, invoke this to kick
+ *	the virtio layer.
+ * @get_outbuf: get the next used outbuf.
+ *	vdev: the virtio_device we're talking about.
+ *	len: the length written into the outbuf
+ *	Returns NULL or the "data" token handed to add_outbuf (which has been
+ *	detached).
+ * @get_inbuf: get the next used inbuf.
+ *	vdev: the virtio_device we're talking about.
+ *	len: the length read from the inbuf
+ *	Returns NULL or the "data" token handed to add_inbuf (which has been
+ *	detached).
+ * @detach_outbuf: make sure sent sg can no longer be read.
+ *	vdev: the virtio_device we're talking about.
+ *	id: the id returned from add_outbuf.
+ *	This is usually used for shutdown.  Don't try to detach twice.
+ * @detach_inbuf: make sure sent sg can no longer be written to.
+ *	vdev: the virtio_device we're talking about.
+ *	id: the id returned from add_inbuf.
+ *	This is usually used for shutdown.  Don't try to detach twice.
+ * @restart_in: restart calls to driver_ops->in after it returned false.
+ *	vdev: the virtio_device we're talking about.
+ *	This returns "false" (and doesn't re-enable) if there are pending
+ *	inbufs, to avoid a race.
+ * @restart_out: restart calls to driver_ops->out after it returned false.
+ *	vdev: the virtio_device we're talking about.
+ *	This returns "false" (and doesn't re-enable) if there are pending
+ *	outbufs, to avoid a race.
+ *
+ * Locking rules are straightforward: the driver is responsible for
+ * locking.  Outbuf operations can be called in parallel to inbuf
+ * operations, but no two outbuf operations nor two inbuf operations
+ * may be invoked simultaneously.
+ *
+ * All operations can be called in any context.
+ */
+struct virtio_ops {
+	unsigned long (*add_outbuf)(struct virtio_device *vdev,
+				    const struct scatterlist sg[],
+				    unsigned int num,
+				    void *data);
+
+	unsigned long (*add_inbuf)(struct virtio_device *vdev,
+				   struct scatterlist sg[],
+				   unsigned int num,
+				   void *data);
+
+	void (*sync)(struct virtio_device *vdev, enum virtio_dir inout);
+
+	void *(*get_outbuf)(struct virtio_device *vdev, unsigned int *len);
+	void *(*get_inbuf)(struct virtio_device *vdev, unsigned int *len);
+
+	void (*detach_outbuf)(struct virtio_device *vdev, unsigned long id);
+	void (*detach_inbuf)(struct virtio_device *vdev, unsigned long id);
+
+	bool (*restart_in)(struct virtio_device *vdev);
+	bool (*restart_out)(struct virtio_device *vdev);
+};
+#endif /* _LINUX_VIRTIO_H */



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH RFC 2/3] Virtio draft III: example net driver
       [not found]         ` <1181999669.6237.257.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-16 13:17           ` Rusty Russell
       [not found]             ` <1181999825.6237.260.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Rusty Russell @ 2007-06-16 13:17 UTC (permalink / raw)
  To: virtualization
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, Christian Borntraeger, Latchesar Ionkov,
	Suzanne McIntosh, Martin Schwidefsky

Net driver using virtio

TODO:
	1) GSO.
	2) Checksum options.
	3) Big packets.

Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
---
 drivers/net/Makefile       |    2 
 drivers/net/virtio_net.c   |  277 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/virtio_net.h |   12 +
 3 files changed, 290 insertions(+), 1 deletion(-)

===================================================================
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -37,7 +37,7 @@ obj-$(CONFIG_CASSINI) += cassini.o
 
 obj-$(CONFIG_MACE) += mace.o
 obj-$(CONFIG_BMAC) += bmac.o
-
+obj-y += virtio_net.o
 obj-$(CONFIG_DGRS) += dgrs.o
 obj-$(CONFIG_VORTEX) += 3c59x.o
 obj-$(CONFIG_TYPHOON) += typhoon.o
===================================================================
--- /dev/null
+++ b/drivers/net/virtio_net.c
@@ -0,0 +1,277 @@
+/* A simple network driver using virtio.
+ *
+ * Copyright 2007 Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org> IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+//#define DEBUG
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/module.h>
+#include <linux/virtio.h>
+#include <linux/scatterlist.h>
+
+/* FIXME: Make dynamic */
+#define MAX_PACKET_LEN (ETH_HLEN+ETH_DATA_LEN)
+
+#define SKB_ID(skb) (*(unsigned long *)(skb)->cb)
+
+struct virtnet_info
+{
+	struct virtio_device *vdev;
+	struct net_device *ndev;
+
+	/* Number of input buffers, and max we've ever had. */
+	unsigned int num, max;
+
+	/* Receive & send queues. */
+	struct sk_buff_head recv;
+	struct sk_buff_head send;
+};
+
+static bool skb_xmit_done(struct virtio_device *vdev)
+{
+	struct virtnet_info *vi = vdev->priv;
+
+	/* In case we were waiting for output buffers. */
+	netif_wake_queue(vi->ndev);
+	return true;
+}
+
+static void receive_skb(struct net_device *dev, struct sk_buff *skb,
+			unsigned len)
+{
+	if (unlikely(len < ETH_HLEN)) {
+		pr_debug("%s: short packet %i\n", dev->name, len);
+		dev->stats.rx_length_errors++;
+		dev_kfree_skb(skb);
+		return;
+	}
+	BUG_ON(len > MAX_PACKET_LEN);
+
+	skb_trim(skb, len);
+	skb->protocol = eth_type_trans(skb, dev);
+	pr_debug("Receiving skb proto 0x%04x len %i type %i\n",
+		 ntohs(skb->protocol), skb->len, skb->pkt_type);
+	dev->stats.rx_bytes += skb->len;
+	dev->stats.rx_packets++;
+	netif_rx(skb);
+}
+
+static void try_fill_recv(struct virtnet_info *vi)
+{
+	struct sk_buff *skb;
+	struct scatterlist sg[MAX_SKB_FRAGS];
+	unsigned long sgnum, id;
+
+	for (;;) {
+		skb = netdev_alloc_skb(vi->ndev, MAX_PACKET_LEN);
+		if (unlikely(!skb))
+			break;
+
+		skb_put(skb, MAX_PACKET_LEN);
+		sgnum = skb_to_sgvec(skb, sg, 0, skb->len);
+		skb_queue_head(&vi->recv, skb);
+
+		id = vi->vdev->ops->add_inbuf(vi->vdev, sg, sgnum, skb);
+		if (IS_ERR_VALUE(id)) {
+			skb_unlink(skb, &vi->recv);
+			kfree_skb(skb);
+			break;
+		}
+		vi->num++;
+	}
+	if (unlikely(vi->num > vi->max))
+		vi->max = vi->num;
+	vi->vdev->ops->sync(vi->vdev, VIRTIO_IN);
+}
+
+static bool skb_recv_done(struct virtio_device *vdev)
+{
+	struct virtnet_info *vi = vdev->priv;
+
+	netif_rx_schedule(vi->ndev);
+	/* Suppress further interrupts. */
+	return false;
+}
+
+static int virtnet_poll(struct net_device *dev, int *budget)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct sk_buff *skb = NULL;
+	unsigned int len, received = 0;
+
+again:
+	while (received < dev->quota &&
+	       (skb = vi->vdev->ops->get_inbuf(vi->vdev, &len)) != NULL) {
+		__skb_unlink(skb, &vi->recv);
+		receive_skb(vi->ndev, skb, len);
+		vi->num--;
+		received++;
+	}
+
+        dev->quota -= received;
+        *budget -= received;
+
+	/* FIXME: If we oom and completely run out of inbufs, we need
+	 * to start a timer trying to fill more. */
+	if (vi->num < vi->max / 2)
+		try_fill_recv(vi);
+
+	/* Still more work to do? */
+	if (skb)
+		return 1; /* not done */
+
+	netif_rx_complete(dev);
+	if (unlikely(!vi->vdev->ops->restart_in(vi->vdev))
+	    && netif_rx_reschedule(dev, received))
+		goto again;
+
+	return 0;
+}
+
+static void free_old_xmit_skbs(struct virtnet_info *vi)
+{
+	struct sk_buff *skb;
+	unsigned int len;
+
+	while ((skb = vi->vdev->ops->get_outbuf(vi->vdev, &len)) != NULL) {
+		pr_debug("Sent skb %p\n", skb);
+		__skb_unlink(skb, &vi->send);
+		vi->ndev->stats.tx_bytes += len;
+		vi->ndev->stats.tx_packets++;
+		kfree_skb(skb);
+	}
+}
+
+static int start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	unsigned long num, id;
+	struct scatterlist sg[MAX_SKB_FRAGS];
+	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
+
+	pr_debug("%s: xmit %p %02x:%02x:%02x:%02x:%02x:%02x\n",
+		 dev->name, skb,
+		 dest[0], dest[1], dest[2], dest[3], dest[4], dest[5]);
+
+	free_old_xmit_skbs(vi);
+
+	num = skb_to_sgvec(skb, sg, 0, skb->len);
+	__skb_queue_head(&vi->send, skb);
+	id = vi->vdev->ops->add_outbuf(vi->vdev, sg, num, skb);
+	if (IS_ERR_VALUE(id)) {
+		pr_debug("%s: virtio not prepared to send\n", dev->name);
+		skb_unlink(skb, &vi->send);
+		netif_stop_queue(dev);
+		return NETDEV_TX_BUSY;
+	}
+	SKB_ID(skb) = id;
+	vi->vdev->ops->sync(vi->vdev, VIRTIO_OUT);
+
+	return 0;
+}
+
+static int virtnet_open(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+
+	try_fill_recv(vi);
+
+	/* If we didn't even get one input buffer, we're useless. */
+	if (vi->num == 0)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int virtnet_close(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct sk_buff *skb;
+
+	/* networking core has neutered skb_xmit_done/skb_recv_done, so don't
+	 * worry about races vs. get_buf(). */
+	while ((skb = __skb_dequeue(&vi->recv)) != NULL) {
+		vi->vdev->ops->detach_inbuf(vi->vdev, SKB_ID(skb));
+		kfree_skb(skb);
+		vi->num--;
+	}
+	while ((skb = __skb_dequeue(&vi->send)) != NULL) {
+		vi->vdev->ops->detach_outbuf(vi->vdev, SKB_ID(skb));
+		kfree_skb(skb);
+	}
+	BUG_ON(vi->num != 0);
+	return 0;
+}
+
+static struct virtio_driver_ops virtnet_ops = {
+	.in = skb_recv_done,
+	.out = skb_xmit_done,
+};
+
+struct net_device *virtnet_probe(struct virtio_device *vdev,
+				 const u8 mac[ETH_ALEN])
+{
+	int err;
+	struct net_device *dev;
+	struct virtnet_info *vi;
+
+	dev = alloc_etherdev(sizeof(struct virtnet_info));
+	if (!dev)
+		return ERR_PTR(-ENOMEM);
+
+	SET_MODULE_OWNER(dev);
+
+	ether_setup(dev);
+	memcpy(dev->dev_addr, mac, ETH_ALEN);
+	dev->open = virtnet_open;
+	dev->stop = virtnet_close;
+	dev->poll = virtnet_poll;
+	dev->hard_start_xmit = start_xmit;
+	dev->weight = 16;
+	SET_NETDEV_DEV(dev, vdev->dev);
+
+	vi = netdev_priv(dev);
+	vi->vdev = vdev;
+	vi->ndev = dev;
+	vdev->priv = vi;
+	vdev->driver_ops = &virtnet_ops;
+	skb_queue_head_init(&vi->recv);
+	skb_queue_head_init(&vi->send);
+
+	err = register_netdev(dev);
+	if (err) {
+		pr_debug("virtio_net: registering device failed\n");
+		goto free;
+	}
+	pr_debug("virtnet: registered device %s\n", dev->name);
+	return dev;
+
+free:
+	free_netdev(dev);
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(virtnet_probe);
+
+void virtnet_remove(struct net_device *dev)
+{
+	unregister_netdev(dev);
+	free_netdev(dev);
+}
+EXPORT_SYMBOL_GPL(virtnet_remove);
+
+MODULE_DESCRIPTION("Virtio network driver");
+MODULE_LICENSE("GPL");
===================================================================
--- /dev/null
+++ b/include/linux/virtio_net.h
@@ -0,0 +1,12 @@
+#ifndef _LINUX_VIRTIO_NET_H
+#define _LINUX_VIRTIO_NET_H
+#include <linux/types.h>
+#include <linux/etherdevice.h>
+struct net_device;
+struct virtio_device;
+
+struct net_device *virtnet_probe(struct virtio_device *vdev,
+				 const u8 mac[ETH_ALEN]);
+void virtnet_remove(struct net_device *dev);
+
+#endif /* _LINUX_VIRTIO_NET_H */



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH RFC 3/3] Virtio draft III: example block driver
       [not found]             ` <1181999825.6237.260.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-16 13:18               ` Rusty Russell
       [not found]                 ` <1181999920.6237.263.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2007-06-25 15:26               ` [PATCH RFC 2/3] Virtio draft III: example net driver Brian King
  1 sibling, 1 reply; 26+ messages in thread
From: Rusty Russell @ 2007-06-16 13:18 UTC (permalink / raw)
  To: virtualization
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel,
	Christian Borntraeger, Latchesar Ionkov, Suzanne McIntosh,
	Jens Axboe, Martin Schwidefsky

The block driver uses outbufs with sg[0] being the request information
(struct virtio_blk_outhdr) with the type, sector and inbuf id.  For a
write, the rest of the sg will contain the data to be written.

The first segment of the inbuf is a result code (struct
virtio_blk_inhdr).  For a read, the rest of the sg points to the input
buffer.

We accept the normal SCSI ioctls: they get handed through to the other
side which can then handle it or reply that it's unsupported.

Although we try to reply -ENOTTY on unsupported commands, the block
layer in its infinite wisdom suppressed the error so ioctl(fd,
CDROMEJECT) returns success to userspace.

Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
---
 drivers/block/Makefile     |    1 
 drivers/block/virtio_blk.c |  385 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/Kbuild       |    1 
 include/linux/virtio_blk.h |   39 ++++
 4 files changed, 426 insertions(+)

===================================================================
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.
 obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
 obj-$(CONFIG_BLK_DEV_DAC960)	+= DAC960.o
 obj-$(CONFIG_CDROM_PKTCDVD)	+= pktcdvd.o
+obj-y				+= virtio_blk.o
 
 obj-$(CONFIG_BLK_DEV_UMEM)	+= umem.o
 obj-$(CONFIG_BLK_DEV_NBD)	+= nbd.o
===================================================================
--- /dev/null
+++ b/drivers/block/virtio_blk.c
@@ -0,0 +1,383 @@
+#define DEBUG
+#include <linux/spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/hdreg.h>
+#include <linux/virtio.h>
+#include <linux/virtio_blk.h>
+
+static unsigned char virtblk_index = 'a';
+struct virtio_blk
+{
+	spinlock_t lock;
+
+	struct virtio_device *vdev;
+
+	/* The disk structure for the kernel. */
+	struct gendisk *disk;
+
+	/* Request tracking. */
+	struct list_head reqs;
+
+	mempool_t *pool;
+
+	/* Scatterlist: can be too big for stack. */
+	struct scatterlist sg[2+MAX_PHYS_SEGMENTS];
+};
+
+struct virtblk_req
+{
+	struct list_head list;
+	struct request *req;
+	unsigned long out_id;
+	bool out_done, in_done;
+	int uptodate;
+	struct virtio_blk_outhdr out_hdr;
+	struct virtio_blk_inhdr in_hdr;
+};
+
+static void end_dequeued_request(struct request *req,
+				 request_queue_t *q, int uptodate)
+{
+	/* And so the insanity of the block layer infects us here. */
+	int nsectors = req->hard_nr_sectors;
+
+	if (blk_pc_request(req)) {
+		nsectors = (req->data_len + 511) >> 9;
+		if (!nsectors)
+			nsectors = 1;
+	}
+	if (end_that_request_first(req, uptodate, nsectors))
+		BUG();
+	add_disk_randomness(req->rq_disk);
+	end_that_request_last(req, uptodate);
+}
+
+static bool finish(struct virtio_blk *vblk, struct virtblk_req *vbr)
+{
+	if (!vbr->in_done || !vbr->out_done)
+		return false;
+	end_dequeued_request(vbr->req, vblk->disk->queue, vbr->uptodate);
+	list_del(&vbr->list);
+	mempool_free(vbr, vblk->pool);
+	return true;
+}
+
+/* We make sure they finished both the input and output buffers: otherwise
+ * they might still have read access after we free them. */
+static bool blk_out_done(struct virtio_device *vdev)
+{
+	struct virtio_blk *vblk = vdev->priv;
+	struct virtblk_req *vbr;
+	unsigned int len, finished = 0;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vblk->lock, flags);
+	while ((vbr = vdev->ops->get_outbuf(vdev, &len)) != NULL) {
+		BUG_ON(vbr->out_done);
+		vbr->out_done = true;
+		finished += finish(vblk, vbr);
+	}
+	/* In case queue is stopped waiting for more buffers. */
+	if (finished)
+		blk_start_queue(vblk->disk->queue);
+	spin_unlock_irqrestore(&vblk->lock, flags);
+	return true;
+}
+
+static bool blk_in_done(struct virtio_device *vdev)
+{
+	struct virtio_blk *vblk = vdev->priv;
+	struct virtblk_req *vbr;
+	unsigned int len, finished = 0;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vblk->lock, flags);
+
+	while ((vbr = vdev->ops->get_inbuf(vdev, &len)) != NULL) {
+		BUG_ON(vbr->in_done);
+
+		switch (vbr->in_hdr.status) {
+		case VIRTIO_BLK_S_OK:
+			vbr->uptodate = 1;
+			break;
+		case VIRTIO_BLK_S_UNSUPP:
+			vbr->uptodate = -ENOTTY;
+			break;
+		default:
+			vbr->uptodate = 0;
+			break;
+		}
+		vbr->in_done = true;
+		finished += finish(vblk, vbr);
+	}
+
+	/* In case queue is stopped waiting for more buffers. */
+	if (finished)
+		blk_start_queue(vblk->disk->queue);
+	spin_unlock_irqrestore(&vblk->lock, flags);
+	return true;
+}
+
+static bool do_write(request_queue_t *q, struct virtio_blk *vblk,
+		     struct virtblk_req *vbr)
+{
+	unsigned long num;
+
+	vbr->out_hdr.type |= VIRTIO_BLK_T_WRITE;
+
+	/* Set up for reply. */
+	vblk->sg[0].page = virt_to_page(&vbr->in_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->in_hdr);
+	vblk->sg[0].length = sizeof(vbr->in_hdr);
+	vbr->out_hdr.id = vblk->vdev->ops->add_inbuf(vblk->vdev, vblk->sg, 1,
+						     vbr);
+	if (IS_ERR_VALUE(vbr->out_hdr.id))
+		goto full;
+
+	/* First sg element points to output header. */
+	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
+	vblk->sg[0].length = sizeof(vbr->out_hdr);
+
+	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
+	vbr->out_id = vblk->vdev->ops->add_outbuf(vblk->vdev, vblk->sg, 1+num,
+						  vbr);
+	if (IS_ERR_VALUE(vbr->out_id))
+		goto detach_inbuf_full;
+
+	pr_debug("Write: %p in=%lu out=%lu\n", vbr,
+		 (long)vbr->out_hdr.id, (long)vbr->out_id);
+	list_add_tail(&vbr->list, &vblk->reqs);
+	return true;
+
+detach_inbuf_full:
+	vblk->vdev->ops->detach_inbuf(vblk->vdev, vbr->out_hdr.id);
+full:
+	return false;
+}
+
+static bool do_read(request_queue_t *q, struct virtio_blk *vblk,
+		    struct virtblk_req *vbr)
+{
+	unsigned long num;
+
+	vbr->out_hdr.type |= VIRTIO_BLK_T_READ;
+
+	/* Set up for reply. */
+	vblk->sg[0].page = virt_to_page(&vbr->in_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->in_hdr);
+	vblk->sg[0].length = sizeof(vbr->in_hdr);
+	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
+	vbr->out_hdr.id = vblk->vdev->ops->add_inbuf(vblk->vdev, vblk->sg,
+						     1+num, vbr);
+	if (IS_ERR_VALUE(vbr->out_hdr.id))
+		goto full;
+
+	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
+	vblk->sg[0].length = sizeof(vbr->out_hdr);
+
+	vbr->out_id = vblk->vdev->ops->add_outbuf(vblk->vdev, vblk->sg, 1,
+						  vbr);
+	if (IS_ERR_VALUE(vbr->out_id))
+		goto detach_inbuf_full;
+
+	pr_debug("Read: %p in=%lu out=%lu\n", vbr,
+		 (long)vbr->out_hdr.id, (long)vbr->out_id);
+	list_add_tail(&vbr->list, &vblk->reqs);
+	return true;
+
+detach_inbuf_full:
+	vblk->vdev->ops->detach_inbuf(vblk->vdev, vbr->out_hdr.id);
+full:
+	return false;
+}
+
+static bool do_scsi(request_queue_t *q, struct virtio_blk *vblk,
+		    struct virtblk_req *vbr)
+{
+	unsigned long num;
+
+	vbr->out_hdr.type |= VIRTIO_BLK_T_SCSI_CMD;
+
+	/* Set up for reply. */
+	vblk->sg[0].page = virt_to_page(&vbr->in_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->in_hdr);
+	vblk->sg[0].length = sizeof(vbr->in_hdr);
+	vbr->out_hdr.id = vblk->vdev->ops->add_inbuf(vblk->vdev, vblk->sg, 1,
+						     vbr);
+	if (IS_ERR_VALUE(vbr->out_hdr.id))
+		goto full;
+
+	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
+	vblk->sg[0].length = sizeof(vbr->out_hdr);
+	vblk->sg[1].page = virt_to_page(vbr->req->cmd);
+	vblk->sg[1].offset = offset_in_page(vbr->req->cmd);
+	vblk->sg[1].length = vbr->req->cmd_len;
+
+	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
+	vbr->out_id = vblk->vdev->ops->add_outbuf(vblk->vdev, vblk->sg, 2+num,
+						  vbr);
+	if (IS_ERR_VALUE(vbr->out_id))
+		goto detach_inbuf_full;
+
+	pr_debug("Scsi: %p in=%lu out=%lu\n", vbr,
+		 (long)vbr->out_hdr.id, (long)vbr->out_id);
+	list_add_tail(&vbr->list, &vblk->reqs);
+	return true;
+
+detach_inbuf_full:
+	vblk->vdev->ops->detach_inbuf(vblk->vdev, vbr->out_hdr.id);
+full:
+	return false;
+}
+
+static void do_virtblk_request(request_queue_t *q)
+{
+	struct virtio_blk *vblk = NULL;
+	struct request *req;
+	struct virtblk_req *vbr;
+
+	while ((req = elv_next_request(q)) != NULL) {
+		vblk = req->rq_disk->private_data;
+
+		vbr = mempool_alloc(vblk->pool, GFP_ATOMIC);
+		if (!vbr)
+			goto stop;
+
+		BUG_ON(req->nr_phys_segments > ARRAY_SIZE(vblk->sg));
+		vbr->req = req;
+		/* Actual type gets or'ed in do_scsi/do_write/do_read */
+		vbr->out_hdr.type = blk_barrier_rq(req)?VIRTIO_BLK_T_BARRIER:0;
+		vbr->out_hdr.sector = req->sector;
+		vbr->out_hdr.ioprio = req->ioprio;
+		vbr->out_done = vbr->in_done = false;
+
+		if (blk_pc_request(req)) {
+			if (!do_scsi(q, vblk, vbr))
+				goto stop;
+		} else if (blk_fs_request(req)) {
+			if (rq_data_dir(req) == WRITE) {
+				if (!do_write(q, vblk, vbr))
+					goto stop;
+			} else {
+				if (!do_read(q, vblk, vbr))
+					goto stop;
+			}
+		} else
+			/* We don't put anything else in the queue. */
+			BUG();
+		blkdev_dequeue_request(req);
+	}
+
+sync:
+	if (vblk)
+		vblk->vdev->ops->sync(vblk->vdev, VIRTIO_IN|VIRTIO_OUT);
+	return;
+
+stop:
+	/* Queue full?  Wait. */
+	blk_stop_queue(q);
+	mempool_free(vbr, vblk->pool);
+	goto sync;
+}
+
+static int virtblk_ioctl(struct inode *inode, struct file *filp,
+			 unsigned cmd, unsigned long data)
+{
+	return scsi_cmd_ioctl(filp, inode->i_bdev->bd_disk, cmd,
+			      (void __user *)data);
+}
+
+static struct virtio_driver_ops virtblk_ops = {
+	.in = blk_in_done,
+	.out = blk_out_done,
+};
+
+
+static struct block_device_operations virtblk_fops = {
+	.ioctl = virtblk_ioctl,
+	.owner = THIS_MODULE,
+};
+
+struct gendisk *virtblk_probe(struct virtio_device *vdev)
+{
+	struct virtio_blk *vblk;
+	int err, major;
+
+	vblk = kmalloc(sizeof(*vblk), GFP_KERNEL);
+	if (!vblk) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	INIT_LIST_HEAD(&vblk->reqs);
+	spin_lock_init(&vblk->lock);
+	vblk->vdev = vdev;
+	vdev->priv = vblk;
+	vdev->driver_ops = &virtblk_ops;
+
+	vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
+	if (!vblk->pool) {
+		err = -ENOMEM;
+		goto out_free_vblk;
+	}
+
+	major = register_blkdev(0, "virtblk");
+	if (major < 0) {
+		err = major;
+		goto out_mempool;
+	}
+
+	/* FIXME: How many partitions?  How long is a piece of string? */
+	vblk->disk = alloc_disk(1 << 3);
+	if (!vblk->disk) {
+		err = -ENOMEM;
+		goto out_unregister_blkdev;
+	}
+
+	vblk->disk->queue = blk_init_queue(do_virtblk_request, &vblk->lock);
+	if (!vblk->disk->queue) {
+		err = -ENOMEM;
+		goto out_put_disk;
+	}
+
+	sprintf(vblk->disk->disk_name, "vd%c", virtblk_index++);
+	vblk->disk->major = major;
+	vblk->disk->first_minor = 0;
+	vblk->disk->private_data = vblk;
+	vblk->disk->fops = &virtblk_fops;
+
+	blk_queue_ordered(vblk->disk->queue, QUEUE_ORDERED_TAG, NULL);
+
+	/* Caller can do blk_queue_max_hw_segments(), set_capacity()
+	 * etc then add_disk(). */
+	return vblk->disk;
+
+out_put_disk:
+	put_disk(vblk->disk);
+out_unregister_blkdev:
+	unregister_blkdev(major, "virtblk");
+out_mempool:
+	mempool_destroy(vblk->pool);
+out_free_vblk:
+	kfree(vblk);
+out:
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(virtblk_probe);
+
+void virtblk_remove(struct gendisk *disk)
+{
+	struct virtio_blk *vblk = disk->private_data;
+	int major = vblk->disk->major;
+
+	BUG_ON(!list_empty(&vblk->reqs));
+	blk_cleanup_queue(vblk->disk->queue);
+	put_disk(vblk->disk);
+	unregister_blkdev(major, "virtblk");
+	mempool_destroy(vblk->pool);
+	kfree(vblk);
+}
+EXPORT_SYMBOL_GPL(virtblk_remove);
===================================================================
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -341,6 +341,7 @@ unifdef-y += utsname.h
 unifdef-y += utsname.h
 unifdef-y += videodev2.h
 unifdef-y += videodev.h
+unifdef-y += virtio_blk.h
 unifdef-y += wait.h
 unifdef-y += wanrouter.h
 unifdef-y += watchdog.h
===================================================================
--- /dev/null
+++ b/include/linux/virtio_blk.h
@@ -0,0 +1,39 @@
+#ifndef _LINUX_VIRTIO_BLK_H
+#define _LINUX_VIRTIO_BLK_H
+#include <linux/types.h>
+
+#define VIRTIO_BLK_T_READ	0
+#define VIRTIO_BLK_T_WRITE 1
+#define VIRTIO_BLK_T_SCSI_CMD 3
+#define VIRTIO_BLK_T_BARRIER 0x80000000 /* Barrier before this op. */
+
+/* This is the first element of the scatter-gather list. */
+struct virtio_blk_outhdr
+{
+	/* VIRTIO_BLK_T* */
+	__u32 type;
+	/* io priority. */
+	__u32 ioprio;
+	/* Sector (ie. 512 byte offset) */
+	__u64 sector;
+	/* Where to put reply. */
+	__u64 id;
+};
+
+#define VIRTIO_BLK_S_OK		0
+#define VIRTIO_BLK_S_IOERR	1
+#define VIRTIO_BLK_S_UNSUPP	2
+
+struct virtio_blk_inhdr
+{
+	unsigned char status;
+};
+
+#ifdef __KERNEL__
+struct gendisk;
+struct virtio_device;
+
+struct gendisk *virtblk_probe(struct virtio_device *vdev);
+void virtblk_remove(struct gendisk *disk);
+#endif /* __KERNEL__ */
+#endif /* _LINUX_VIRTIO_BLK_H */



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] Lguest implemention of virtio draft III
       [not found]                 ` <1181999920.6237.263.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-16 13:28                   ` Rusty Russell
       [not found]                     ` <1182000514.6237.273.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2007-06-17 14:25                   ` [PATCH RFC 3/3] Virtio draft III: example block driver Avi Kivity
  1 sibling, 1 reply; 26+ messages in thread
From: Rusty Russell @ 2007-06-16 13:28 UTC (permalink / raw)
  To: virtualization
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel,
	Christian Borntraeger, Latchesar Ionkov, Suzanne McIntosh,
	Martin Schwidefsky

This is a bonus patch for those wondering how a virtio implementation
can look.  I have two, this is the more efficient one (needs some
modification for inter-guest though: it assumes the other end does all
the accessing of our memory.  It's currently tacked on to the existing
lguest I/O mechanism as a demonstration, rather than replacing it.

It shows that it's possible to implement virtio without internal
locking.

Userspace server-side code isn't included.
===
This allows zero-copy from guest <-> host.  It uses a page of
descriptors, a page to say what descriptors to use, and a page to say
what's been used: one each set for inbufs and one for outbufs.

TODO:
1) More polishing
2) Get rid of old I/O
3) Inter-guest I/O implementation

Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
---
 drivers/lguest/Makefile         |    2 
 drivers/lguest/hypercalls.c     |    4 
 drivers/lguest/lguest_virtio.c  |  511 +++++++++++++++++++++++++++++++++++++++
 include/linux/lguest.h          |    3 
 include/linux/lguest_launcher.h |   24 +
 6 files changed, 948 insertions(+), 5 deletions(-)

--- a/drivers/lguest/Makefile
+++ b/drivers/lguest/Makefile
@@ -1,5 +1,5 @@
 # Guest requires the paravirt_ops replacement and the bus driver.
-obj-$(CONFIG_LGUEST_GUEST) += lguest.o lguest_asm.o lguest_bus.o
+obj-$(CONFIG_LGUEST_GUEST) += lguest.o lguest_asm.o lguest_bus.o lguest_virtio.o
 
 # Host requires the other files, which can be a module.
 obj-$(CONFIG_LGUEST)	+= lg.o
===================================================================
--- a/drivers/lguest/hypercalls.c
+++ b/drivers/lguest/hypercalls.c
@@ -86,6 +86,10 @@ static void do_hcall(struct lguest *lg, 
 		break;
 	case LHCALL_HALT:
 		lg->halted = 1;
+		break;
+	case LHCALL_NOTIFY:
+		lg->pending_key = regs->edx << PAGE_SHIFT;
+		lg->dma_is_pending = 1;
 		break;
 	default:
 		kill_guest(lg, "Bad hypercall %li\n", regs->eax);
===================================================================
--- /dev/null
+++ b/drivers/lguest/lguest_virtio.c
@@ -0,0 +1,511 @@
+/* Descriptor-based virtio backend using lguest. */
+
+/* FIXME: Put "running" in shared page so other side really doesn't
+ * send us interrupts.  Then we would never need to "fail" restart.
+ * If there are more buffers when we set "running", simply ping other
+ * side.  It would interrupt us back again.
+ */
+#define DEBUG
+#include <linux/lguest.h>
+#include <linux/lguest_bus.h>
+#include <linux/virtio.h>
+#include <linux/interrupt.h>
+#include <asm/io.h>
+
+#define NUM_DESCS (PAGE_SIZE / sizeof(struct lguest_desc))
+
+#ifdef DEBUG
+/* For development, we want to crash whenever the other side is bad. */
+#define BAD_SIDE(lgv, fmt...)			\
+	do { dev_err(lgv->vdev.dev, fmt); BUG(); } while(0)
+#define START_USE(di) \
+	do { if ((di)->in_use) panic("in_use = %i\n", (di)->in_use); (di)->in_use = __LINE__; mb(); } while(0)
+#define END_USE(di) \
+	do { BUG_ON(!(di)->in_use); (di)->in_use = 0; mb(); } while(0)
+#else
+#define BAD_SIDE(lgv, fmt...)			\
+	do { dev_err(lgv->vdev.dev, fmt); (lgv)->broken = true; } while(0)
+#define START_USE(di)
+#define END_USE(di)
+#endif
+
+/* FIXME: make the device mem layout a struct, not a set of pointers */
+struct desc_info
+{
+	/* Page of descriptors. */
+	struct lguest_desc *desc;
+	/* How we tell other side what buffers are available. */
+	unsigned int *avail_idx;
+	unsigned int *available;
+	/* How other side tells us what's used. */
+	unsigned int *used_idx;
+	struct lguest_used *used;
+
+	/* Number of free buffers */
+	unsigned int num_free;
+	/* Head of free buffer list. */
+	unsigned int free_head;
+	/* Number we've added since last sync. */
+	unsigned int num_added;
+
+	/* Last used index we've seen. */
+	unsigned int last_used_idx;
+
+	/* Unless they told us to stop */
+	bool running;
+
+#ifdef DEBUG
+	/* They're supposed to lock for us. */
+	unsigned int in_use;
+#endif
+
+	/* Tokens for callbacks. */
+	void *data[NUM_DESCS];
+};
+
+/* FIXME: When doing this for real, vdev will go straight into lguest_device */
+struct lguest_virtio_device
+{
+	struct virtio_device vdev;
+	struct lguest_device *lg;
+	void *priv;
+
+	/* Other side has made a mess, don't try any more. */
+	bool broken;
+
+	struct desc_info in, out;
+};
+
+static inline struct lguest_virtio_device *
+vdev_to_lgv(struct virtio_device *vdev)
+{
+	return container_of(vdev, struct lguest_virtio_device, vdev);
+}
+
+static unsigned long add_buf(struct desc_info *di,
+			     const struct scatterlist *sg,
+			     unsigned int num,
+			     void *data)
+{
+	unsigned int i, head, uninitialized_var(prev);
+
+	BUG_ON(data == NULL);
+	START_USE(di);
+
+	if (di->num_free < num) {
+		pr_debug("Can't add buf len %i - avail = %i\n", num,
+			 di->num_free);
+		END_USE(di);
+		return -ENOSPC;
+	}
+
+	/* We're about to use some buffers from the free list. */
+	di->num_free -= num;
+
+	head = di->free_head;
+	for (i = di->free_head; num; i = di->desc[i].next, num--) {
+		di->desc[i].flags |= LGUEST_DESC_F_NEXT;
+		di->desc[i].pfn = page_to_pfn(sg[0].page);
+		di->desc[i].offset = sg[0].offset;
+		di->desc[i].len = sg[0].length;
+		prev = i;
+		sg++;
+	}
+	/* Last one doesn't continue. */
+	di->desc[prev].flags &= ~LGUEST_DESC_F_NEXT;
+
+	/* Update free pointer */
+	di->free_head = i;
+
+	di->data[head] = data;
+
+	/* Make sure it's all visible to other side before setting head. */
+	wmb();
+	di->desc[head].flags |= LGUEST_DESC_F_HEAD;
+
+	/* Put it in available array for advertising. */
+	di->available[(*di->avail_idx + di->num_added++) % NUM_DESCS] = head;
+
+	pr_debug("Added buffer head %i\n", head);
+	END_USE(di);
+	return head;
+}
+
+static unsigned long lguest_add_outbuf(struct virtio_device *vdev,
+				       const struct scatterlist sg[],
+				       unsigned int num,
+				       void *data)
+{
+	struct lguest_virtio_device *lgv = vdev_to_lgv(vdev);
+
+	BUG_ON(num > NUM_DESCS);
+	BUG_ON(num == 0);
+
+	return add_buf(&lgv->out, sg, num, data);
+}
+
+static unsigned long lguest_add_inbuf(struct virtio_device *vdev,
+				      struct scatterlist sg[],
+				      unsigned int num,
+				      void *data)
+{
+	struct lguest_virtio_device *lgv = vdev_to_lgv(vdev);
+
+	BUG_ON(num > NUM_DESCS);
+	BUG_ON(num == 0);
+
+	return add_buf(&lgv->in, sg, num, data);
+}
+
+static void lguest_sync(struct virtio_device *vdev, enum virtio_dir inout)
+{
+	struct lguest_virtio_device *lgv = vdev_to_lgv(vdev);
+
+	if (inout & VIRTIO_IN)
+		START_USE(&lgv->in);
+	if (inout & VIRTIO_OUT)
+		START_USE(&lgv->out);
+	/* LGUEST_DESC_F_HEAD needs to be set before we say they're avail. */
+	wmb();
+
+	if (inout & VIRTIO_IN) {
+		*lgv->in.avail_idx += lgv->in.num_added;
+		lgv->in.num_added = 0;
+	}
+	if (inout & VIRTIO_OUT) {
+		*lgv->out.avail_idx += lgv->out.num_added;
+		lgv->out.num_added = 0;
+	}
+
+	/* Prod other side to tell it about changes. */
+	hcall(LHCALL_NOTIFY, lguest_devices[lgv->lg->index].pfn, 0, 0);
+	if (inout & VIRTIO_IN)
+		END_USE(&lgv->in);
+	if (inout & VIRTIO_OUT)
+		END_USE(&lgv->out);
+}
+
+static void detach_buf(struct desc_info *di, int id)
+{
+	unsigned int i;
+
+	BUG_ON(id >= NUM_DESCS);
+	BUG_ON(!(di->desc[id].flags & LGUEST_DESC_F_HEAD));
+
+	di->desc[id].flags &= ~LGUEST_DESC_F_HEAD;
+	/* Make sure other side has seen that it's detached. */
+	wmb();
+
+	/* Put back on free list: find end */
+	for (i = id; di->desc[i].flags&LGUEST_DESC_F_NEXT; i=di->desc[i].next)
+		di->num_free++;
+
+	di->desc[i].next = di->free_head;
+	di->free_head = id;
+	/* Plus final descriptor */
+	di->num_free++;
+}
+
+static void lguest_detach_outbuf(struct virtio_device *vdev, unsigned long id)
+{
+	struct lguest_virtio_device *lgv = vdev_to_lgv(vdev);
+
+	START_USE(&lgv->out);
+	detach_buf(&lgv->out, id);
+	END_USE(&lgv->out);
+}
+
+static void lguest_detach_inbuf(struct virtio_device *vdev, unsigned long id)
+{
+	struct lguest_virtio_device *lgv = vdev_to_lgv(vdev);
+
+	START_USE(&lgv->in);
+	detach_buf(&lgv->in, id);
+	END_USE(&lgv->in);
+}
+
+static bool more_used(struct desc_info *di)
+{
+	return di->last_used_idx != *di->used_idx;
+}
+
+static void *get_buf(struct desc_info *di, struct lguest_virtio_device *lgv,
+		     unsigned int *len)
+{
+	unsigned int id;
+
+	START_USE(di);
+
+	if (!more_used(di)) {
+		END_USE(di);
+		return NULL;
+	}
+
+	/* Don't let them make us do infinite work. */
+	if (unlikely(*di->used_idx > di->last_used_idx + NUM_DESCS)) {
+		BAD_SIDE(lgv, "Too many descriptors");
+		return NULL;
+	}
+
+	id = di->used[di->last_used_idx%NUM_DESCS].id;
+	*len = di->used[di->last_used_idx%NUM_DESCS].len;
+
+	if (unlikely(id >= NUM_DESCS)) {
+		BAD_SIDE(lgv, "id %u out of range\n", id);
+		return NULL;
+	}
+	if (unlikely(!(di->desc[id].flags & LGUEST_DESC_F_HEAD))) {
+		BAD_SIDE(lgv, "id %u is not a head!\n", id);
+		return NULL;
+	}
+
+	detach_buf(di, id);
+	di->last_used_idx++;
+	BUG_ON(!di->data[id]);
+	END_USE(di);
+	return di->data[id];
+}
+
+static void *lguest_get_outbuf(struct virtio_device *vdev, unsigned int *len)
+{
+	struct lguest_virtio_device *lgv = vdev_to_lgv(vdev);
+
+	return get_buf(&lgv->out, lgv, len);
+}
+
+static void *lguest_get_inbuf(struct virtio_device *vdev, unsigned int *len)
+{
+	struct lguest_virtio_device *lgv = vdev_to_lgv(vdev);
+
+	return get_buf(&lgv->in, lgv, len);
+}
+
+static bool lguest_restart_in(struct virtio_device *vdev)
+{
+	struct lguest_virtio_device *lgv = vdev_to_lgv(vdev);
+
+	START_USE(&lgv->in);
+	BUG_ON(lgv->in.running);
+
+	if (likely(!more_used(&lgv->in)) || unlikely(lgv->broken))
+		lgv->in.running = true;
+
+	END_USE(&lgv->in);
+	return lgv->in.running;
+}
+
+static bool lguest_restart_out(struct virtio_device *vdev)
+{
+	struct lguest_virtio_device *lgv = vdev_to_lgv(vdev);
+
+	START_USE(&lgv->out);
+	BUG_ON(lgv->out.running);
+
+	if (likely(!more_used(&lgv->in)) || unlikely(lgv->broken))
+		lgv->in.running = true;
+
+	END_USE(&lgv->out);
+	return lgv->in.running;
+}
+
+static irqreturn_t lguest_virtio_interrupt(int irq, void *_lgv)
+{
+	struct lguest_virtio_device *lgv = _lgv;
+
+	if (unlikely(lgv->broken))
+		return IRQ_HANDLED;
+
+	if (lgv->out.running && more_used(&lgv->out))
+		lgv->out.running = lgv->vdev.driver_ops->out(&lgv->vdev);
+
+	if (lgv->in.running && more_used(&lgv->in))
+		lgv->in.running = lgv->vdev.driver_ops->in(&lgv->vdev);
+
+	return IRQ_HANDLED;
+}
+
+static struct virtio_ops lguest_virtio_ops = {
+	.add_outbuf = lguest_add_outbuf,
+	.add_inbuf = lguest_add_inbuf,
+	.sync = lguest_sync,
+	.detach_outbuf = lguest_detach_outbuf,
+	.detach_inbuf = lguest_detach_inbuf,
+	.get_outbuf = lguest_get_outbuf,
+	.get_inbuf = lguest_get_inbuf,
+	.restart_in = lguest_restart_in,
+	.restart_out = lguest_restart_out,
+};
+
+static struct lguest_virtio_device *lg_new_virtio(struct lguest_device *lgdev)
+{
+	struct lguest_virtio_device *lgv;
+	void *mem;
+	unsigned int i;
+
+	lgv = kmalloc(sizeof(*lgv), GFP_KERNEL);
+	if (!lgv)
+		return NULL;
+
+	memset(lgv, 0, sizeof(*lgv));
+
+	lgdev->private = lgv;
+	lgv->lg = lgdev;
+
+	/* Device mem is input pages followed by output pages */
+	mem = lguest_map(lguest_devices[lgdev->index].pfn<<PAGE_SHIFT, 6);
+	if (!mem)
+		goto free_lgv;
+	lgv->in.desc = mem;
+	lgv->in.avail_idx = mem + PAGE_SIZE;
+	lgv->in.available = (void *)(lgv->in.avail_idx + 1);
+	lgv->in.used_idx = mem + PAGE_SIZE*2;
+	lgv->in.used = (void *)(lgv->in.used_idx + 1);
+	lgv->out.desc = mem + PAGE_SIZE*3;
+	lgv->out.avail_idx = mem + PAGE_SIZE*4;
+	lgv->out.available = (void *)(lgv->out.avail_idx + 1);
+	lgv->out.used_idx = mem + PAGE_SIZE*5;
+	lgv->out.used = (void *)(lgv->out.used_idx + 1);
+
+	lgv->in.last_used_idx = lgv->out.last_used_idx = 0;
+	lgv->in.num_added = lgv->out.num_added = 0;
+	lgv->in.running = lgv->out.running = true;
+
+	/* Put everything in free lists. */
+	lgv->in.num_free = lgv->out.num_free = NUM_DESCS;
+	for (i = 0; i < NUM_DESCS-1; i++) {
+		lgv->in.desc[i].next = i+1;
+		lgv->out.desc[i].next = i+1;
+	}
+
+	lgv->vdev.ops = &lguest_virtio_ops;
+	lgv->vdev.dev = &lgdev->dev;
+	lgv->broken = false;
+	return lgv;
+
+free_lgv:
+	kfree(lgv);
+	return NULL;;
+}
+
+static void lg_destroy_virtio(struct lguest_virtio_device *lgv)
+{
+	lguest_unmap(lgv->in.desc);
+	kfree(lgv);
+}
+
+/* It's nice to have the name for the interrupt, so we do this separately
+ * from lg_new_virtio(). */
+static int lg_setup_interrupt(struct lguest_virtio_device *lgv,
+			      const char *name)
+{
+	int irqf;
+
+	if (lguest_devices[lgv->lg->index].features&LGUEST_DEVICE_F_RANDOMNESS)
+		irqf = IRQF_SAMPLE_RANDOM;
+	else
+		irqf = 0;
+
+	return request_irq(lgdev_irq(lgv->lg), lguest_virtio_interrupt, irqf,
+			   name, lgv);
+}
+
+/* Example network driver code. */
+#include <linux/virtio_net.h>
+#include <linux/etherdevice.h>
+
+static int lguest_virtnet_probe(struct lguest_device *lgdev)
+{
+	struct lguest_virtio_device *lgv;
+	struct net_device *dev;
+	u8 mac[ETH_ALEN];
+	int err;
+
+	lgv = lg_new_virtio(lgdev);
+	if (!lgv)
+		return -ENOMEM;
+
+	random_ether_addr(mac);
+	lgv->priv = dev = virtnet_probe(&lgv->vdev, mac);
+	if (IS_ERR(lgv->priv)) {
+		err = PTR_ERR(lgv->priv);
+		goto destroy;
+	}
+	err = lg_setup_interrupt(lgv, dev->name);
+	if (err)
+		goto unprobe;
+	return 0;
+
+unprobe:
+	virtnet_remove(dev);
+destroy:
+	lg_destroy_virtio(lgv);
+	return err;
+}
+
+static struct lguest_driver lguest_virtnet_drv = {
+	.name = "lguestvirtnet",
+	.owner = THIS_MODULE,
+	.device_type = LGUEST_DEVICE_T_VIRTNET,
+	.probe = lguest_virtnet_probe,
+};
+
+static __init int lguest_virtnet_init(void)
+{
+	return register_lguest_driver(&lguest_virtnet_drv);
+}
+device_initcall(lguest_virtnet_init);
+
+/* Example block driver code. */
+#include <linux/virtio_blk.h>
+#include <linux/genhd.h>
+#include <linux/blkdev.h>
+static int lguest_virtblk_probe(struct lguest_device *lgdev)
+{
+	struct lguest_virtio_device *lgv;
+	struct gendisk *disk;
+	unsigned long sectors;
+	int err;
+
+	lgv = lg_new_virtio(lgdev);
+	if (!lgv)
+		return -ENOMEM;
+
+	/* Page is initially used to pass capacity. */
+	sectors = *(unsigned long *)lgv->in.desc;
+	*(unsigned long *)lgv->in.desc = 0;
+
+	lgv->priv = disk = virtblk_probe(&lgv->vdev);
+	if (IS_ERR(lgv->priv)) {
+		err = PTR_ERR(lgv->priv);
+		goto destroy;
+	}
+	set_capacity(disk, sectors);
+	blk_queue_max_hw_segments(disk->queue, NUM_DESCS-1);
+
+	err = lg_setup_interrupt(lgv, disk->disk_name);
+	if (err)
+		goto unprobe;
+	add_disk(disk);
+	return 0;
+
+unprobe:
+	virtblk_remove(disk);
+destroy:
+	lg_destroy_virtio(lgv);
+	return err;
+}
+
+static struct lguest_driver lguest_virtblk_drv = {
+	.name = "lguestvirtblk",
+	.owner = THIS_MODULE,
+	.device_type = LGUEST_DEVICE_T_VIRTBLK,
+	.probe = lguest_virtblk_probe,
+};
+
+static __init int lguest_virtblk_init(void)
+{
+	return register_lguest_driver(&lguest_virtblk_drv);
+}
+device_initcall(lguest_virtblk_init);
+
+MODULE_LICENSE("GPL");
===================================================================
--- a/include/linux/lguest.h
+++ b/include/linux/lguest.h
@@ -23,6 +23,9 @@
 #define LHCALL_SET_PTE		14
 #define LHCALL_SET_PMD		15
 #define LHCALL_LOAD_TLS		16
+
+/* Experimental hcalls for new I/O */
+#define LHCALL_NOTIFY	100 /* pfn */
 
 #define LG_CLOCK_MIN_DELTA	100UL
 #define LG_CLOCK_MAX_DELTA	ULONG_MAX
===================================================================
--- a/include/linux/lguest_launcher.h
+++ b/include/linux/lguest_launcher.h
@@ -44,6 +44,8 @@ struct lguest_device_desc {
 #define LGUEST_DEVICE_T_CONSOLE	1
 #define LGUEST_DEVICE_T_NET	2
 #define LGUEST_DEVICE_T_BLOCK	3
+#define LGUEST_DEVICE_T_VIRTNET	8
+#define LGUEST_DEVICE_T_VIRTBLK	9
 
 	u16 features;
 #define LGUEST_NET_F_NOCSUM		0x4000 /* Don't bother checksumming */
@@ -70,4 +72,26 @@ enum lguest_req
 	LHREQ_IRQ, /* + irq */
 	LHREQ_BREAK, /* + on/off flag (on blocks until someone does off) */
 };
+
+/* This marks a buffer as being the start (and active) */
+#define LGUEST_DESC_F_HEAD	1
+/* This marks a buffer as continuing via the next field. */
+#define LGUEST_DESC_F_NEXT	2
+
+/* Virtio descriptors */
+struct lguest_desc
+{
+	unsigned long pfn;
+	unsigned long len;
+	u16 offset;
+	u16 flags;
+	/* We chain unused descriptors via this, too */
+	u32 next;
+};
+
+struct lguest_used
+{
+	unsigned int id;
+	unsigned int len;
+};
 #endif /* _ASM_LGUEST_USER */



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] Lguest implemention of virtio draft III
       [not found]                     ` <1182000514.6237.273.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-16 13:50                       ` Arnd Bergmann
  0 siblings, 0 replies; 26+ messages in thread
From: Arnd Bergmann @ 2007-06-16 13:50 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, virtualization, Latchesar Ionkov,
	Suzanne McIntosh, Martin Schwidefsky

On Saturday 16 June 2007, Rusty Russell wrote:
> This is a bonus patch for those wondering how a virtio implementation
> can look.

Thanks, I assumed it was something like this, but having the code
in front of me makes it a lot easier to comment on it.

> +static struct lguest_driver lguest_virtnet_drv = {
> +	.name = "lguestvirtnet",
> +	.owner = THIS_MODULE,
> +	.device_type = LGUEST_DEVICE_T_VIRTNET,
> +	.probe = lguest_virtnet_probe,
> +};
> +
> +static __init int lguest_virtnet_init(void)
> +{
> +	return register_lguest_driver(&lguest_virtnet_drv);
> +}
> +device_initcall(lguest_virtnet_init);
> +
> +/* Example block driver code. */
> +#include <linux/virtio_blk.h>
> +#include <linux/genhd.h>
> +#include <linux/blkdev.h>
> +static int lguest_virtblk_probe(struct lguest_device *lgdev)
> +{
> +	struct lguest_virtio_device *lgv;
> +	struct gendisk *disk;
> +	unsigned long sectors;
> +	int err;
> +
> +	lgv = lg_new_virtio(lgdev);
> +	if (!lgv)
> +		return -ENOMEM;
> +
> +	/* Page is initially used to pass capacity. */
> +	sectors = *(unsigned long *)lgv->in.desc;
> +	*(unsigned long *)lgv->in.desc = 0;
> +
> +	lgv->priv = disk = virtblk_probe(&lgv->vdev);
> +	if (IS_ERR(lgv->priv)) {
> +		err = PTR_ERR(lgv->priv);
> +		goto destroy;
> +	}
> +	set_capacity(disk, sectors);
> +	blk_queue_max_hw_segments(disk->queue, NUM_DESCS-1);
> +
> +	err = lg_setup_interrupt(lgv, disk->disk_name);
> +	if (err)
> +		goto unprobe;
> +	add_disk(disk);
> +	return 0;
> +
> +unprobe:
> +	virtblk_remove(disk);
> +destroy:
> +	lg_destroy_virtio(lgv);
> +	return err;
> +}
> +
> +static struct lguest_driver lguest_virtblk_drv = {
> +	.name = "lguestvirtblk",
> +	.owner = THIS_MODULE,
> +	.device_type = LGUEST_DEVICE_T_VIRTBLK,
> +	.probe = lguest_virtblk_probe,
> +};
> +
> +static __init int lguest_virtblk_init(void)
> +{
> +	return register_lguest_driver(&lguest_virtblk_drv);
> +}
> +device_initcall(lguest_virtblk_init);

This is the part that I find puzzling. My idea would be that the drivers
and devices you register are _not_ specific to the type of virtio bus.

The problem I see with your approach is that drivers for specific devices
can't be loaded as separate modules, right now all of them have to be
loaded or statically linked into the kernel in order to load the
lguest_virtio module.

I'm sure you know how this works for other buses, but I'll explain anyway
so that everyone knows what I'm talking about. What I have in mind is
to have a single 'struct bus_type virtio_bus_type', which is used by
all virtio implementations. The 'struct virtio_device' is derived from
'struct device', meaning that a 'struct device' is embedded in it.
Each virtio implementation can either use the virtio_device directly,
or derive its own kvm_virtio_device/lguest_virtio_device/... from it.
When lguest scans its devices, it registers the lguest_virtio_device
with the generic virtio_bus_type, not an lguest specific one.
Similarly, the block/net/... drivers each register themselves as
a virtio_driver with the virtio_bus_type, not with an implementation
specific guest.

Do you see a problem with the approach that I described here, or did
you just not bother implementing it so far?

	Arnd <><

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 0/3] Virtio draft III
       [not found]     ` <1181999552.6237.255.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2007-06-16 13:14       ` [PATCH RFC 1/3] Virtio draft III: virtio.h Rusty Russell
@ 2007-06-17 14:14       ` Avi Kivity
       [not found]         ` <467541DF.5060907-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2007-06-17 14:14 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Latchesar Ionkov, Suzanne McIntosh, Martin Schwidefsky

Rusty Russell wrote:
> In this episode, Rusty tries to NAPI-ize the driver and discovers that
> virtio callbacks are a bad idea: NAPI needs to turn interrupts off and
> still be able to query for new incoming packets.
>
> Changes to core:
> 1) Back to "interrupt" model with get_inbuf()/get_outbuf() calls.
>   

Seems to be these are just two different ways of iterating over the 
pending buffers, and one could be implemented in terms of the other in a 
few lines.  I think that new new way is better, though, due to less 
indirection.

btw, I'm not sure that the name 'virtio' is correct.  Probably any 
network or block driver which has support for batching (the vast 
majority) could use this to good effect.


-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 3/3] Virtio draft III: example block driver
       [not found]                 ` <1181999920.6237.263.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2007-06-16 13:28                   ` [PATCH] Lguest implemention of virtio draft III Rusty Russell
@ 2007-06-17 14:25                   ` Avi Kivity
       [not found]                     ` <46754451.2010305-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2007-06-17 14:25 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel,
	virtualization, Christian Borntraeger, Latchesar Ionkov,
	Suzanne McIntosh, Jens Axboe, Martin Schwidefsky

Rusty Russell wrote:
> +static bool do_read(request_queue_t *q, struct virtio_blk *vblk,
> +		    struct virtblk_req *vbr)
> +{
> +	unsigned long num;
> +
> +	vbr->out_hdr.type |= VIRTIO_BLK_T_READ;
> +
> +	/* Set up for reply. */
> +	vblk->sg[0].page = virt_to_page(&vbr->in_hdr);
> +	vblk->sg[0].offset = offset_in_page(&vbr->in_hdr);
> +	vblk->sg[0].length = sizeof(vbr->in_hdr);
> +	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
> +	vbr->out_hdr.id = vblk->vdev->ops->add_inbuf(vblk->vdev, vblk->sg,
> +						     1+num, vbr);
> +	if (IS_ERR_VALUE(vbr->out_hdr.id))
> +		goto full;
> +
> +	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
> +	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
> +	vblk->sg[0].length = sizeof(vbr->out_hdr);
> +
> +	vbr->out_id = vblk->vdev->ops->add_outbuf(vblk->vdev, vblk->sg, 1,
> +						  vbr);
>   

This strikes me as wasteful.  Why not set up a single descriptor which 
contains both placement and the data itself?

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] Re:  [PATCH RFC 0/3] Virtio draft III
       [not found]         ` <467541DF.5060907-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-06-18  7:48           ` Rusty Russell
  0 siblings, 0 replies; 26+ messages in thread
From: Rusty Russell @ 2007-06-18  7:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Latchesar Ionkov, Suzanne McIntosh, Martin Schwidefsky

On Sun, 2007-06-17 at 17:14 +0300, Avi Kivity wrote:
> Rusty Russell wrote:
> > In this episode, Rusty tries to NAPI-ize the driver and discovers that
> > virtio callbacks are a bad idea: NAPI needs to turn interrupts off and
> > still be able to query for new incoming packets.
> >
> > Changes to core:
> > 1) Back to "interrupt" model with get_inbuf()/get_outbuf() calls.
> >   
> 
> Seems to be these are just two different ways of iterating over the 
> pending buffers, and one could be implemented in terms of the other in a 
> few lines.  I think that new new way is better, though, due to less 
> indirection.

Yes, the lguest implementation points them to a common routine.  We
could do some batching here too, but it seemed like premature
optimization.

> btw, I'm not sure that the name 'virtio' is correct.  Probably any 
> network or block driver which has support for batching (the vast 
> majority) could use this to good effect.

Trying to avoid boiling the ocean 8)

Cheers,
Rusty.



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 3/3] Virtio draft III: example block driver
       [not found]                     ` <46754451.2010305-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-06-18  8:08                       ` Rusty Russell
       [not found]                         ` <1182154095.19064.24.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Rusty Russell @ 2007-06-18  8:08 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel,
	virtualization, Christian Borntraeger, Latchesar Ionkov,
	Suzanne McIntosh, Jens Axboe, Martin Schwidefsky

On Sun, 2007-06-17 at 17:25 +0300, Avi Kivity wrote:
> Rusty Russell wrote:
> > +	/* Set up for reply. */
> > +	vblk->sg[0].page = virt_to_page(&vbr->in_hdr);
> > +	vblk->sg[0].offset = offset_in_page(&vbr->in_hdr);
> > +	vblk->sg[0].length = sizeof(vbr->in_hdr);
> > +	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
> > +	vbr->out_hdr.id = vblk->vdev->ops->add_inbuf(vblk->vdev, vblk->sg,
> > +						     1+num, vbr);
> > +	if (IS_ERR_VALUE(vbr->out_hdr.id))
> > +		goto full;
> > +
> > +	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
> > +	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
> > +	vblk->sg[0].length = sizeof(vbr->out_hdr);
> > +
> > +	vbr->out_id = vblk->vdev->ops->add_outbuf(vblk->vdev, vblk->sg, 1,
> > +						  vbr);
> 
> This strikes me as wasteful.  Why not set up a single descriptor which 
> contains both placement and the data itself?

We could actually do this for write, but not for read (where the length
& sector need to be sent to other end, and the data comes from other
end).

The strict separation of in and out in virtio is to support both
untrusted inter-guest I/O (admittedly not useful for block devices) and
socket-style hypercall interfaces.

Thanks,
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 3/3] Virtio draft III: example block driver
       [not found]                         ` <1182154095.19064.24.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-18  9:09                           ` Avi Kivity
       [not found]                             ` <46764BB5.6070704-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2007-06-18  9:09 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel,
	virtualization, Christian Borntraeger, Latchesar Ionkov,
	Suzanne McIntosh, Jens Axboe, Martin Schwidefsky

Rusty Russell wrote:
> On Sun, 2007-06-17 at 17:25 +0300, Avi Kivity wrote:
>   
>> Rusty Russell wrote:
>>     
>>> +	/* Set up for reply. */
>>> +	vblk->sg[0].page = virt_to_page(&vbr->in_hdr);
>>> +	vblk->sg[0].offset = offset_in_page(&vbr->in_hdr);
>>> +	vblk->sg[0].length = sizeof(vbr->in_hdr);
>>> +	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
>>> +	vbr->out_hdr.id = vblk->vdev->ops->add_inbuf(vblk->vdev, vblk->sg,
>>> +						     1+num, vbr);
>>> +	if (IS_ERR_VALUE(vbr->out_hdr.id))
>>> +		goto full;
>>> +
>>> +	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
>>> +	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
>>> +	vblk->sg[0].length = sizeof(vbr->out_hdr);
>>> +
>>> +	vbr->out_id = vblk->vdev->ops->add_outbuf(vblk->vdev, vblk->sg, 1,
>>> +						  vbr);
>>>       
>> This strikes me as wasteful.  Why not set up a single descriptor which 
>> contains both placement and the data itself?
>>     
>
> We could actually do this for write, but not for read (where the length
> & sector need to be sent to other end, and the data comes from other
> end).
>
>   

So you map the first sg entry for output, and the rest for input.

Less pretty, but more efficient.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 3/3] Virtio draft III: example block driver
       [not found]                             ` <46764BB5.6070704-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-06-19  6:27                               ` Rusty Russell
       [not found]                                 ` <1182234466.19064.51.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Rusty Russell @ 2007-06-19  6:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel,
	virtualization, Christian Borntraeger, Latchesar Ionkov,
	Suzanne McIntosh, Jens Axboe, Martin Schwidefsky

On Mon, 2007-06-18 at 12:09 +0300, Avi Kivity wrote:
> Rusty Russell wrote:
> > On Sun, 2007-06-17 at 17:25 +0300, Avi Kivity wrote:
> >   
> >> Rusty Russell wrote:
> >>     
> >>> +	/* Set up for reply. */
> >>> +	vblk->sg[0].page = virt_to_page(&vbr->in_hdr);
> >>> +	vblk->sg[0].offset = offset_in_page(&vbr->in_hdr);
> >>> +	vblk->sg[0].length = sizeof(vbr->in_hdr);
> >>> +	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
> >>> +	vbr->out_hdr.id = vblk->vdev->ops->add_inbuf(vblk->vdev, vblk->sg,
> >>> +						     1+num, vbr);
> >>> +	if (IS_ERR_VALUE(vbr->out_hdr.id))
> >>> +		goto full;
> >>> +
> >>> +	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
> >>> +	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
> >>> +	vblk->sg[0].length = sizeof(vbr->out_hdr);
> >>> +
> >>> +	vbr->out_id = vblk->vdev->ops->add_outbuf(vblk->vdev, vblk->sg, 1,
> >>> +						  vbr);
> >>>       
> >> This strikes me as wasteful.  Why not set up a single descriptor which 
> >> contains both placement and the data itself?
> >>     
> >
> > We could actually do this for write, but not for read (where the length
> > & sector need to be sent to other end, and the data comes from other
> > end).
> 
> So you map the first sg entry for output, and the rest for input.
> 
> Less pretty, but more efficient.

At the moment there are two kinds of sgs: input and output.  You're
proposing instead that we have a single kind, and a flag on each segment
which indicates direction.  That may make block a little efficient,
depending on particular implementation of virtio.

Yet network really wants in and out separate.  I'm not even sure how
we'd implement a mixed-sg scheme efficiently for that case.  At the
moment it's very straightforward.

Of course, I'm carrying two implementations and two drivers, so it's
natural for me to try to resist changes 8)

Cheers,
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 3/3] Virtio draft III: example block driver
       [not found]                                 ` <1182234466.19064.51.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-19  8:34                                   ` Avi Kivity
  0 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2007-06-19  8:34 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel,
	virtualization, Christian Borntraeger, Latchesar Ionkov,
	Suzanne McIntosh, Jens Axboe, Martin Schwidefsky

Rusty Russell wrote:
> On Mon, 2007-06-18 at 12:09 +0300, Avi Kivity wrote:
>   
>> Rusty Russell wrote:
>>     
>>> On Sun, 2007-06-17 at 17:25 +0300, Avi Kivity wrote:
>>>   
>>>       
>>>> Rusty Russell wrote:
>>>>     
>>>>         
>>>>> +	/* Set up for reply. */
>>>>> +	vblk->sg[0].page = virt_to_page(&vbr->in_hdr);
>>>>> +	vblk->sg[0].offset = offset_in_page(&vbr->in_hdr);
>>>>> +	vblk->sg[0].length = sizeof(vbr->in_hdr);
>>>>> +	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
>>>>> +	vbr->out_hdr.id = vblk->vdev->ops->add_inbuf(vblk->vdev, vblk->sg,
>>>>> +						     1+num, vbr);
>>>>> +	if (IS_ERR_VALUE(vbr->out_hdr.id))
>>>>> +		goto full;
>>>>> +
>>>>> +	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
>>>>> +	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
>>>>> +	vblk->sg[0].length = sizeof(vbr->out_hdr);
>>>>> +
>>>>> +	vbr->out_id = vblk->vdev->ops->add_outbuf(vblk->vdev, vblk->sg, 1,
>>>>> +						  vbr);
>>>>>       
>>>>>           
>>>> This strikes me as wasteful.  Why not set up a single descriptor which 
>>>> contains both placement and the data itself?
>>>>     
>>>>         
>>> We could actually do this for write, but not for read (where the length
>>> & sector need to be sent to other end, and the data comes from other
>>> end).
>>>       
>> So you map the first sg entry for output, and the rest for input.
>>
>> Less pretty, but more efficient.
>>     
>
> At the moment there are two kinds of sgs: input and output.  You're
> proposing instead that we have a single kind, and a flag on each segment
> which indicates direction.  That may make block a little efficient,
> depending on particular implementation of virtio.
>
> Yet network really wants in and out separate.  I'm not even sure how
> we'd implement a mixed-sg scheme efficiently for that case.  At the
> moment it's very straightforward.
>
>   

I think the differences are deeper than that.  Network wants two queues, 
disk wants one queue.  The directionality of the dma transfer is 
orthogonal to that.

For a disk, the initiator is always the guest; you issue a request and 
expend a completion within a few milliseconds.  Read and write requests 
are similar in nature; you want them in a single queue because you want 
the elevator to run on both read and write request.  When idling, the 
queue is empty.

For a network interface, the guest initiates transmits, but the host 
initiates receives.  Yes, the guest preloads the queue with rx 
descriptors, but it can't expect a response in any given time, hence the 
need for two queues.  When idling, the tx queue is empty, the rx queue 
is full.

So I'd suggest having a struct virtio_queue to represent a queue (so a 
network interface would have two of them), and the dma direction to be 
specified separately:

+	unsigned long (*add_buf)(struct virtio_queue *q,
+				 const struct scatterlist out_sg[], int out_nr,
+				 const struct scatterlist in_sg[], int in_nr,
+				 void *data);


I think this fits better with the Xen model (which has one ring for 
block, two each for network and console, if memory serves).

> Of course, I'm carrying two implementations and two drivers, so it's
> natural for me to try to resist changes 8

Life is rough on the frontier.  But all the glory will be yours!

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 2/3] Virtio draft III: example net driver
       [not found]             ` <1181999825.6237.260.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2007-06-16 13:18               ` [PATCH RFC 3/3] Virtio draft III: example block driver Rusty Russell
@ 2007-06-25 15:26               ` Brian King
       [not found]                 ` <467FDEAD.4030204-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2007-06-28 11:20                 ` Rusty Russell
  1 sibling, 2 replies; 26+ messages in thread
From: Brian King @ 2007-06-25 15:26 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Latchesar Ionkov, Suzanne McIntosh, Martin Schwidefsky

I've started to look at these patches to figure out how easily I
could convert the ibmveth driver to use this new infrastructure
and have ran into a few issues. So far the two biggest issues I've
encountered are:

1. The add_inbuf interface passes an sg list. This causes problems for
   ibmveth since its rx buffers cannot be scatterlists.
2. The user of this API does not have access to the sk_buf. This causes
   issues for ibmveth since it needs to reserve the first 8 bytes of the
   rx buffer for use by the firmware. It currently uses skb_reserve to do this.

Would it be simpler to just pass an sk_buf rather than the scatterlist
on these interfaces for virtio_net users?

-Brian

Rusty Russell wrote:
> Net driver using virtio
> 
> TODO:
> 	1) GSO.
> 	2) Checksum options.
> 	3) Big packets.
> 
> Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
> ---
>  drivers/net/Makefile       |    2 
>  drivers/net/virtio_net.c   |  277 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/virtio_net.h |   12 +
>  3 files changed, 290 insertions(+), 1 deletion(-)
> 
> ===================================================================
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -37,7 +37,7 @@ obj-$(CONFIG_CASSINI) += cassini.o
> 
>  obj-$(CONFIG_MACE) += mace.o
>  obj-$(CONFIG_BMAC) += bmac.o
> -
> +obj-y += virtio_net.o
>  obj-$(CONFIG_DGRS) += dgrs.o
>  obj-$(CONFIG_VORTEX) += 3c59x.o
>  obj-$(CONFIG_TYPHOON) += typhoon.o
> ===================================================================
> --- /dev/null
> +++ b/drivers/net/virtio_net.c
> @@ -0,0 +1,277 @@
> +/* A simple network driver using virtio.
> + *
> + * Copyright 2007 Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org> IBM Corporation
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
> + */
> +//#define DEBUG
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/module.h>
> +#include <linux/virtio.h>
> +#include <linux/scatterlist.h>
> +
> +/* FIXME: Make dynamic */
> +#define MAX_PACKET_LEN (ETH_HLEN+ETH_DATA_LEN)
> +
> +#define SKB_ID(skb) (*(unsigned long *)(skb)->cb)
> +
> +struct virtnet_info
> +{
> +	struct virtio_device *vdev;
> +	struct net_device *ndev;
> +
> +	/* Number of input buffers, and max we've ever had. */
> +	unsigned int num, max;
> +
> +	/* Receive & send queues. */
> +	struct sk_buff_head recv;
> +	struct sk_buff_head send;
> +};
> +
> +static bool skb_xmit_done(struct virtio_device *vdev)
> +{
> +	struct virtnet_info *vi = vdev->priv;
> +
> +	/* In case we were waiting for output buffers. */
> +	netif_wake_queue(vi->ndev);
> +	return true;
> +}
> +
> +static void receive_skb(struct net_device *dev, struct sk_buff *skb,
> +			unsigned len)
> +{
> +	if (unlikely(len < ETH_HLEN)) {
> +		pr_debug("%s: short packet %i\n", dev->name, len);
> +		dev->stats.rx_length_errors++;
> +		dev_kfree_skb(skb);
> +		return;
> +	}
> +	BUG_ON(len > MAX_PACKET_LEN);
> +
> +	skb_trim(skb, len);
> +	skb->protocol = eth_type_trans(skb, dev);
> +	pr_debug("Receiving skb proto 0x%04x len %i type %i\n",
> +		 ntohs(skb->protocol), skb->len, skb->pkt_type);
> +	dev->stats.rx_bytes += skb->len;
> +	dev->stats.rx_packets++;
> +	netif_rx(skb);
> +}
> +
> +static void try_fill_recv(struct virtnet_info *vi)
> +{
> +	struct sk_buff *skb;
> +	struct scatterlist sg[MAX_SKB_FRAGS];
> +	unsigned long sgnum, id;
> +
> +	for (;;) {
> +		skb = netdev_alloc_skb(vi->ndev, MAX_PACKET_LEN);
> +		if (unlikely(!skb))
> +			break;
> +
> +		skb_put(skb, MAX_PACKET_LEN);
> +		sgnum = skb_to_sgvec(skb, sg, 0, skb->len);
> +		skb_queue_head(&vi->recv, skb);
> +
> +		id = vi->vdev->ops->add_inbuf(vi->vdev, sg, sgnum, skb);
> +		if (IS_ERR_VALUE(id)) {
> +			skb_unlink(skb, &vi->recv);
> +			kfree_skb(skb);
> +			break;
> +		}
> +		vi->num++;
> +	}
> +	if (unlikely(vi->num > vi->max))
> +		vi->max = vi->num;
> +	vi->vdev->ops->sync(vi->vdev, VIRTIO_IN);
> +}
> +
> +static bool skb_recv_done(struct virtio_device *vdev)
> +{
> +	struct virtnet_info *vi = vdev->priv;
> +
> +	netif_rx_schedule(vi->ndev);
> +	/* Suppress further interrupts. */
> +	return false;
> +}
> +
> +static int virtnet_poll(struct net_device *dev, int *budget)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	struct sk_buff *skb = NULL;
> +	unsigned int len, received = 0;
> +
> +again:
> +	while (received < dev->quota &&
> +	       (skb = vi->vdev->ops->get_inbuf(vi->vdev, &len)) != NULL) {
> +		__skb_unlink(skb, &vi->recv);
> +		receive_skb(vi->ndev, skb, len);
> +		vi->num--;
> +		received++;
> +	}
> +
> +        dev->quota -= received;
> +        *budget -= received;
> +
> +	/* FIXME: If we oom and completely run out of inbufs, we need
> +	 * to start a timer trying to fill more. */
> +	if (vi->num < vi->max / 2)
> +		try_fill_recv(vi);
> +
> +	/* Still more work to do? */
> +	if (skb)
> +		return 1; /* not done */
> +
> +	netif_rx_complete(dev);
> +	if (unlikely(!vi->vdev->ops->restart_in(vi->vdev))
> +	    && netif_rx_reschedule(dev, received))
> +		goto again;
> +
> +	return 0;
> +}
> +
> +static void free_old_xmit_skbs(struct virtnet_info *vi)
> +{
> +	struct sk_buff *skb;
> +	unsigned int len;
> +
> +	while ((skb = vi->vdev->ops->get_outbuf(vi->vdev, &len)) != NULL) {
> +		pr_debug("Sent skb %p\n", skb);
> +		__skb_unlink(skb, &vi->send);
> +		vi->ndev->stats.tx_bytes += len;
> +		vi->ndev->stats.tx_packets++;
> +		kfree_skb(skb);
> +	}
> +}
> +
> +static int start_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	unsigned long num, id;
> +	struct scatterlist sg[MAX_SKB_FRAGS];
> +	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
> +
> +	pr_debug("%s: xmit %p %02x:%02x:%02x:%02x:%02x:%02x\n",
> +		 dev->name, skb,
> +		 dest[0], dest[1], dest[2], dest[3], dest[4], dest[5]);
> +
> +	free_old_xmit_skbs(vi);
> +
> +	num = skb_to_sgvec(skb, sg, 0, skb->len);
> +	__skb_queue_head(&vi->send, skb);
> +	id = vi->vdev->ops->add_outbuf(vi->vdev, sg, num, skb);
> +	if (IS_ERR_VALUE(id)) {
> +		pr_debug("%s: virtio not prepared to send\n", dev->name);
> +		skb_unlink(skb, &vi->send);
> +		netif_stop_queue(dev);
> +		return NETDEV_TX_BUSY;
> +	}
> +	SKB_ID(skb) = id;
> +	vi->vdev->ops->sync(vi->vdev, VIRTIO_OUT);
> +
> +	return 0;
> +}
> +
> +static int virtnet_open(struct net_device *dev)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +
> +	try_fill_recv(vi);
> +
> +	/* If we didn't even get one input buffer, we're useless. */
> +	if (vi->num == 0)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +static int virtnet_close(struct net_device *dev)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	struct sk_buff *skb;
> +
> +	/* networking core has neutered skb_xmit_done/skb_recv_done, so don't
> +	 * worry about races vs. get_buf(). */
> +	while ((skb = __skb_dequeue(&vi->recv)) != NULL) {
> +		vi->vdev->ops->detach_inbuf(vi->vdev, SKB_ID(skb));
> +		kfree_skb(skb);
> +		vi->num--;
> +	}
> +	while ((skb = __skb_dequeue(&vi->send)) != NULL) {
> +		vi->vdev->ops->detach_outbuf(vi->vdev, SKB_ID(skb));
> +		kfree_skb(skb);
> +	}
> +	BUG_ON(vi->num != 0);
> +	return 0;
> +}
> +
> +static struct virtio_driver_ops virtnet_ops = {
> +	.in = skb_recv_done,
> +	.out = skb_xmit_done,
> +};
> +
> +struct net_device *virtnet_probe(struct virtio_device *vdev,
> +				 const u8 mac[ETH_ALEN])
> +{
> +	int err;
> +	struct net_device *dev;
> +	struct virtnet_info *vi;
> +
> +	dev = alloc_etherdev(sizeof(struct virtnet_info));
> +	if (!dev)
> +		return ERR_PTR(-ENOMEM);
> +
> +	SET_MODULE_OWNER(dev);
> +
> +	ether_setup(dev);
> +	memcpy(dev->dev_addr, mac, ETH_ALEN);
> +	dev->open = virtnet_open;
> +	dev->stop = virtnet_close;
> +	dev->poll = virtnet_poll;
> +	dev->hard_start_xmit = start_xmit;
> +	dev->weight = 16;
> +	SET_NETDEV_DEV(dev, vdev->dev);
> +
> +	vi = netdev_priv(dev);
> +	vi->vdev = vdev;
> +	vi->ndev = dev;
> +	vdev->priv = vi;
> +	vdev->driver_ops = &virtnet_ops;
> +	skb_queue_head_init(&vi->recv);
> +	skb_queue_head_init(&vi->send);
> +
> +	err = register_netdev(dev);
> +	if (err) {
> +		pr_debug("virtio_net: registering device failed\n");
> +		goto free;
> +	}
> +	pr_debug("virtnet: registered device %s\n", dev->name);
> +	return dev;
> +
> +free:
> +	free_netdev(dev);
> +	return ERR_PTR(err);
> +}
> +EXPORT_SYMBOL_GPL(virtnet_probe);
> +
> +void virtnet_remove(struct net_device *dev)
> +{
> +	unregister_netdev(dev);
> +	free_netdev(dev);
> +}
> +EXPORT_SYMBOL_GPL(virtnet_remove);
> +
> +MODULE_DESCRIPTION("Virtio network driver");
> +MODULE_LICENSE("GPL");
> ===================================================================
> --- /dev/null
> +++ b/include/linux/virtio_net.h
> @@ -0,0 +1,12 @@
> +#ifndef _LINUX_VIRTIO_NET_H
> +#define _LINUX_VIRTIO_NET_H
> +#include <linux/types.h>
> +#include <linux/etherdevice.h>
> +struct net_device;
> +struct virtio_device;
> +
> +struct net_device *virtnet_probe(struct virtio_device *vdev,
> +				 const u8 mac[ETH_ALEN]);
> +void virtnet_remove(struct net_device *dev);
> +
> +#endif /* _LINUX_VIRTIO_NET_H */
> 
> 
> _______________________________________________
> Virtualization mailing list
> Virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/virtualization


-- 
Brian King
eServer Storage I/O
IBM Linux Technology Center


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 2/3] Virtio draft III: example net driver
       [not found]                 ` <467FDEAD.4030204-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2007-06-25 19:33                   ` Christoph Hellwig
       [not found]                     ` <20070625193304.GB25736-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2007-06-25 19:33 UTC (permalink / raw)
  To: Brian King
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Latchesar Ionkov, Suzanne McIntosh, Martin Schwidefsky

On Mon, Jun 25, 2007 at 10:26:37AM -0500, Brian King wrote:
> 1. The add_inbuf interface passes an sg list. This causes problems for
>    ibmveth since its rx buffers cannot be scatterlists.
> 2. The user of this API does not have access to the sk_buf. This causes
>    issues for ibmveth since it needs to reserve the first 8 bytes of the
>    rx buffer for use by the firmware. It currently uses skb_reserve to do this.
> 
> Would it be simpler to just pass an sk_buf rather than the scatterlist
> on these interfaces for virtio_net users?

It probably should pass the sk_buff.  Then again the network layer really
should be using struct scaterrlist for skb fragments - on a lot of iommu
a dma_map_sg is a lot more efficient than a lot of dma_map_page calls,
not to mention the benefit of a common data structure for things like
network attached storage protocols that have to talk to both worlds.

And yes, this is getting a little out of scope for the virtualization
discussion.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 2/3] Virtio draft III: example net driver
       [not found]                     ` <20070625193304.GB25736-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2007-06-25 21:54                       ` Avi Kivity
       [not found]                         ` <46803999.4040500-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2007-06-25 21:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Latchesar Ionkov, Suzanne McIntosh, Brian King,
	Martin Schwidefsky

Christoph Hellwig wrote:
> On Mon, Jun 25, 2007 at 10:26:37AM -0500, Brian King wrote:
>   
>> 1. The add_inbuf interface passes an sg list. This causes problems for
>>    ibmveth since its rx buffers cannot be scatterlists.
>>     

I'm probably missing something.  What is wrong with a scatterlist of 
length 1?

>> 2. The user of this API does not have access to the sk_buf. This causes
>>    issues for ibmveth since it needs to reserve the first 8 bytes of the
>>    rx buffer for use by the firmware. It currently uses skb_reserve to do this.
>>
>> Would it be simpler to just pass an sk_buf rather than the scatterlist
>> on these interfaces for virtio_net users?
>>     
>
> It probably should pass the sk_buff.

But virtio is a generic layer that is useful for more than just networking.



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 2/3] Virtio draft III: example net driver
       [not found]                         ` <46803999.4040500-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-06-25 22:13                           ` Brian King
       [not found]                             ` <46803E0E.7080103-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Brian King @ 2007-06-25 22:13 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christoph Hellwig,
	Christian Borntraeger, Latchesar Ionkov, Suzanne McIntosh,
	Martin Schwidefsky

Avi Kivity wrote:
> Christoph Hellwig wrote:
>> On Mon, Jun 25, 2007 at 10:26:37AM -0500, Brian King wrote:
>>   
>>> 1. The add_inbuf interface passes an sg list. This causes problems for
>>>    ibmveth since its rx buffers cannot be scatterlists.
>>>     
> 
> I'm probably missing something.  What is wrong with a scatterlist of 
> length 1?

Absolutely nothing, as long as the interface guarantees I don't see a larger
scatterlist. With the current implementation it looks to me like this is
always true, but if we were to stick to using a scatterlist, it would
make sense to allow the user to specify a max sg length supported. This
value could be different for in vs. out buffers.

>>> 2. The user of this API does not have access to the sk_buf. This causes
>>>    issues for ibmveth since it needs to reserve the first 8 bytes of the
>>>    rx buffer for use by the firmware. It currently uses skb_reserve to do this.
>>>
>>> Would it be simpler to just pass an sk_buf rather than the scatterlist
>>> on these interfaces for virtio_net users?
>>>     
>> It probably should pass the sk_buff.
> 
> But virtio is a generic layer that is useful for more than just networking.

Agreed. However, this one is a huge issue for ibmveth. ibmveth must write a tag
to the first 8 bytes of each rx buffer before passing the buffer to the pSeries
hypervisor. The hypervisor then writes this tag to the rx queue when the buffer
is used. This first 8 bytes of the rx buffer is not overlayed with data. Additionally,
the hypervisor may offset further than 8 bytes into the buffer for the start
of the payload if it so desires. The offset is also written to the rx queue.

Another option which would solve this problem would be if the get_inbuf
interface also returned an offset into the original data buffer where the
payload starts.

-Brian

-- 
Brian King
Linux on Power Virtualization
IBM Linux Technology Center

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 2/3] Virtio draft III: example net driver
       [not found]                             ` <46803E0E.7080103-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2007-06-25 22:20                               ` Avi Kivity
  0 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2007-06-25 22:20 UTC (permalink / raw)
  To: brking-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christoph Hellwig,
	Christian Borntraeger, Latchesar Ionkov, Suzanne McIntosh,
	Martin Schwidefsky

Brian King wrote:

    

>> I'm probably missing something.  What is wrong with a scatterlist of 
>> length 1?
>>     
>
> Absolutely nothing, as long as the interface guarantees I don't see a larger
> scatterlist. With the current implementation it looks to me like this is
> always true, but if we were to stick to using a scatterlist, it would
> make sense to allow the user to specify a max sg length supported. This
> value could be different for in vs. out buffers.
>
>   

Okay; so virtio net should have configurable scatterlist max sizes.

>>>> 2. The user of this API does not have access to the sk_buf. This causes
>>>>    issues for ibmveth since it needs to reserve the first 8 bytes of the
>>>>    rx buffer for use by the firmware. It currently uses skb_reserve to do this.
>>>>
>>>> Would it be simpler to just pass an sk_buf rather than the scatterlist
>>>> on these interfaces for virtio_net users?
>>>>     
>>>>         
>>> It probably should pass the sk_buff.
>>>       
>> But virtio is a generic layer that is useful for more than just networking.
>>     
>
> Agreed. However, this one is a huge issue for ibmveth. ibmveth must write a tag
> to the first 8 bytes of each rx buffer before passing the buffer to the pSeries
> hypervisor. The hypervisor then writes this tag to the rx queue when the buffer
> is used. This first 8 bytes of the rx buffer is not overlayed with data. Additionally,
> the hypervisor may offset further than 8 bytes into the buffer for the start
> of the payload if it so desires. The offset is also written to the rx queue.
>
> Another option which would solve this problem would be if the get_inbuf
> interface also returned an offset into the original data buffer where the
> payload starts.
>
>   

I guess.  Rusty was concerned that virtio will explode to the sum of all 
options on all hypervisors, and we now see it begin.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 2/3] Virtio draft III: example net driver
  2007-06-25 15:26               ` [PATCH RFC 2/3] Virtio draft III: example net driver Brian King
       [not found]                 ` <467FDEAD.4030204-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2007-06-28 11:20                 ` Rusty Russell
       [not found]                   ` <1183029641.12401.36.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  1 sibling, 1 reply; 26+ messages in thread
From: Rusty Russell @ 2007-06-28 11:20 UTC (permalink / raw)
  To: brking
  Cc: Stephen Rothwell, Xen Mailing List, jmk@plan9.bell-labs.com,
	Herbert Xu, kvm-devel, virtualization, Christian Borntraeger,
	Latchesar Ionkov, Suzanne McIntosh, Martin Schwidefsky

On Mon, 2007-06-25 at 10:26 -0500, Brian King wrote: 
> I've started to look at these patches to figure out how easily I
> could convert the ibmveth driver to use this new infrastructure
> and have ran into a few issues. So far the two biggest issues I've
> encountered are:

Hi Brian,

	Thanks for looking at this!  Sorry for the slow reply, I've been
travelling.

> 1. The add_inbuf interface passes an sg list. This causes problems for
>    ibmveth since its rx buffers cannot be scatterlists.
> 2. The user of this API does not have access to the sk_buf. This causes
>    issues for ibmveth since it needs to reserve the first 8 bytes of the
>    rx buffer for use by the firmware. It currently uses skb_reserve to do this.
> 
> Would it be simpler to just pass an sk_buf rather than the scatterlist
> on these interfaces for virtio_net users?

It would be easier, but not suitable for block and other devices which
use the same interface.  Fortunately, I think there are solutions to
your issues.

1. 1500 byte packets will always be 1 sg long (ie. you should BUG_ON()
if that ever happens).  The issue is MTU > PAGE_SIZE, where I was
planning to allocate discontig pages.  I can't quite see what your
maximum supported MTU is, though?

2. The header problem is difficult (I can't resist pointing out that if
you had sg receive capability, it would be trivial 8).  The two
possibilities are to have get_buf take a "unsigned long *off" as well,
or to have ibm_veth do a memmove.

memmove sounds horrible at first glance, but since the hypervisor has
just copied the packet I'm not sure we'll notice in practice.
Benchmarking should tell...

BTW, after Avi's comments I have a new virtio draft, but still debugging
my implementation.  It doesn't effect this discussion, but would involve
churn for an actual implementation if you've gotten that far...

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 2/3] Virtio draft III: example net driver
       [not found]                   ` <1183029641.12401.36.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-06-28 15:55                     ` Brian King
  0 siblings, 0 replies; 26+ messages in thread
From: Brian King @ 2007-06-28 15:55 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Stephen Rothwell, Xen Mailing List,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Herbert Xu,
	kvm-devel, virtualization, Christian Borntraeger,
	Latchesar Ionkov, Suzanne McIntosh, Martin Schwidefsky

Rusty Russell wrote:
> 1. 1500 byte packets will always be 1 sg long (ie. you should BUG_ON()
> if that ever happens).  The issue is MTU > PAGE_SIZE, where I was
> planning to allocate discontig pages.  I can't quite see what your
> maximum supported MTU is, though?

The current maximum supported MTU for ibmveth is 64k. Additionally,
the firmware interface for ibmveth allows for multiple rx "pools" of
different sizes to be allocated and given to the hypervisor. Depending
on the incoming packet size, the smallest buffer is then chosen. Just
something to keep in mind as you look at adding large frame support. I'm
not sure what requirements other users would have...

> 2. The header problem is difficult (I can't resist pointing out that if
> you had sg receive capability, it would be trivial 8).  The two
> possibilities are to have get_buf take a "unsigned long *off" as well,
> or to have ibm_veth do a memmove.
> 
> memmove sounds horrible at first glance, but since the hypervisor has
> just copied the packet I'm not sure we'll notice in practice.
> Benchmarking should tell...

memmove definitely does not sound optimal. I would definitely prefer returning
an offset on get_buf, but when I get something up and running I could
certainly run some benchmarks.

> BTW, after Avi's comments I have a new virtio draft, but still debugging
> my implementation.  It doesn't effect this discussion, but would involve
> churn for an actual implementation if you've gotten that far...

I've started working on some code, but only to get a better handle on the API
and see what issues ibmveth might run into. I was expecting some code churn
yet at this stage..

-Brian


-- 
Brian King
Linux on Power Virtualization
IBM Linux Technology Center

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2007-06-28 15:55 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-07 12:02 [PATCH RFC 0/3] Virtio draft II Rusty Russell
2007-06-07 12:04 ` [PATCH RFC 1/3] Virtio draft II: virtio.h Rusty Russell
     [not found]   ` <1181217867.14054.195.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-07 12:05     ` [PATCH RFC 2/3] Virtio draft II: example block driver Rusty Russell
     [not found]       ` <1181217920.14054.196.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-07 12:07         ` [PATCH RFC 3/3] Virtio draft II: example net driver Rusty Russell
2007-06-07 12:19         ` [PATCH RFC 2/3] Virtio draft II: example block driver Avi Kivity
     [not found]           ` <4667F7C0.3070604-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-08  1:11             ` Rusty Russell
     [not found] ` <1181217762.14054.192.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-16 13:12   ` [PATCH RFC 0/3] Virtio draft III Rusty Russell
     [not found]     ` <1181999552.6237.255.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-16 13:14       ` [PATCH RFC 1/3] Virtio draft III: virtio.h Rusty Russell
     [not found]         ` <1181999669.6237.257.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-16 13:17           ` [PATCH RFC 2/3] Virtio draft III: example net driver Rusty Russell
     [not found]             ` <1181999825.6237.260.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-16 13:18               ` [PATCH RFC 3/3] Virtio draft III: example block driver Rusty Russell
     [not found]                 ` <1181999920.6237.263.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-16 13:28                   ` [PATCH] Lguest implemention of virtio draft III Rusty Russell
     [not found]                     ` <1182000514.6237.273.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-16 13:50                       ` Arnd Bergmann
2007-06-17 14:25                   ` [PATCH RFC 3/3] Virtio draft III: example block driver Avi Kivity
     [not found]                     ` <46754451.2010305-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-18  8:08                       ` Rusty Russell
     [not found]                         ` <1182154095.19064.24.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-18  9:09                           ` Avi Kivity
     [not found]                             ` <46764BB5.6070704-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-19  6:27                               ` Rusty Russell
     [not found]                                 ` <1182234466.19064.51.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-19  8:34                                   ` Avi Kivity
2007-06-25 15:26               ` [PATCH RFC 2/3] Virtio draft III: example net driver Brian King
     [not found]                 ` <467FDEAD.4030204-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2007-06-25 19:33                   ` Christoph Hellwig
     [not found]                     ` <20070625193304.GB25736-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2007-06-25 21:54                       ` Avi Kivity
     [not found]                         ` <46803999.4040500-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-25 22:13                           ` Brian King
     [not found]                             ` <46803E0E.7080103-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2007-06-25 22:20                               ` Avi Kivity
2007-06-28 11:20                 ` Rusty Russell
     [not found]                   ` <1183029641.12401.36.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-06-28 15:55                     ` Brian King
2007-06-17 14:14       ` [PATCH RFC 0/3] Virtio draft III Avi Kivity
     [not found]         ` <467541DF.5060907-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-06-18  7:48           ` [Xen-devel] " Rusty Russell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox