virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/6] virtio interace
       [not found] <1190289808.7262.223.camel@localhost.localdomain>
@ 2007-09-20 12:09 ` Rusty Russell
       [not found] ` <1190290140.7262.228.camel@localhost.localdomain>
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-20 12:09 UTC (permalink / raw)
  To: kvm-devel; +Cc: lguest, Dor Laor, virtualization

(Changes: 
 - renamed sync to kick as Dor suggested
 - added new_vq and free_vq hooks to create virtqueues
 - define a simple virtio driver, which uses PCI ids
 - provide register/unregister_virtio_driver hooks)

This attempts to implement a "virtual I/O" layer which should allow
common drivers to be efficiently used across most virtual I/O
mechanisms.  It will no-doubt need further enhancement.

The virtio drivers add and get I/O buffers; as the buffers are consumed
the driver "interrupt" callbacks are invoked.

It also provides driver register and unregister hooks, which are
simply overridden at run time (eg. for a guest kernel which supports
KVM paravirt and lguest).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/virtio/Kconfig  |    3 +
 drivers/virtio/Makefile |    1 
 drivers/virtio/virtio.c |   20 ++++++++
 include/linux/virtio.h  |  114 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 138 insertions(+)

===================================================================
--- /dev/null
+++ b/drivers/virtio/Kconfig
@@ -0,0 +1,3 @@
+# Virtio always gets selected by whoever wants it.
+config VIRTIO
+	bool
===================================================================
--- /dev/null
+++ b/drivers/virtio/Makefile
@@ -0,0 +1,1 @@
+obj-$(CONFIG_VIRTIO) += virtio.o
===================================================================
--- /dev/null
+++ b/drivers/virtio/virtio.c
@@ -0,0 +1,20 @@
+#include <linux/virtio.h>
+
+struct virtio_backend_ops virtio_backend_ops;
+EXPORT_SYMBOL_GPL(virtio_backend_ops);
+
+/* register and unregister simply punt through to the specific backend. */
+int register_virtio_driver(struct virtio_driver *drv)
+{
+	if (!virtio_backend_ops.register_driver)
+		return -ENOSYS;
+
+	return virtio_backend_ops.register_driver(drv);
+}
+EXPORT_SYMBOL_GPL(register_virtio_driver);
+
+void unregister_virtio_driver(struct virtio_driver *drv)
+{
+	virtio_backend_ops.unregister_driver(drv);
+}
+EXPORT_SYMBOL_GPL(unregister_virtio_driver);
===================================================================
--- /dev/null
+++ b/include/linux/virtio.h
@@ -0,0 +1,114 @@
+#ifndef _LINUX_VIRTIO_H
+#define _LINUX_VIRTIO_H
+/* Everything a virtio driver needs to work with any particular virtio
+ * implementation. */
+#include <linux/types.h>
+#include <linux/scatterlist.h>
+#include <linux/spinlock.h>
+#include <linux/pci.h>
+
+struct device;
+struct virtio_config_space;
+
+/**
+ * virtqueue_ops - operations for virtqueue abstraction layer
+ * @new_vq: create a new virtqueue
+ *	config: the virtio_config_space field describing the queue
+ *	off: the offset in the config space of the queue configuration
+ *	len: the length of the virtio_config_space field
+ *	callback: the driver callback when the queue is used.
+ *	cb_data: the pointer for the callback.
+ *	Returns a new virtqueue or an ERR_PTR.
+ * @free_vq: free a virtqueue
+ *	vq: the struct virtqueue to free (must be unused or empty).
+ * @add_buf: expose buffer to other end
+ *	vq: the struct virtqueue we're talking about.
+ *	sg: the description of the buffer(s).
+ *	out_num: the number of sg readable by other side
+ *	in_num: the number of sg which are writable (after readable ones)
+ *	data: the token identifying the buffer.
+ *      Returns 0 or an error.
+ * @kick: update after add_buf
+ *	vq: the struct virtqueue
+ *	After one or more add_buf calls, invoke this to kick the virtio layer.
+ * @get_buf: get the next used buffer
+ *	vq: the struct virtqueue we're talking about.
+ *	len: the length written into the buffer
+ *	Returns NULL or the "data" token handed to add_buf.
+ * @restart: restart callbacks ater callback returned false.
+ *	vq: the struct virtqueue we're talking about.
+ *	This returns "false" (and doesn't re-enable) if there are pending
+ *	buffers in the queue, to avoid a race.
+ * @shutdown: "unadd" all buffers.
+ *	vq: the struct virtqueue we're talking about.
+ *	Remove everything from the queue.
+ *
+ * Locking rules are straightforward: the driver is responsible for
+ * locking.  No two operations may be invoked simultaneously.
+ *
+ * All operations can be called in any context.
+ */
+struct virtqueue_ops {
+	struct virtqueue *(*new_vq)(struct device *dev,
+				    struct virtio_config_space *config,
+				    int off, unsigned len,
+				    bool (*callback)(void *),
+				    void *cb_data);
+
+	void (*free_vq)(struct virtqueue *vq);
+
+	int (*add_buf)(struct virtqueue *vq,
+		       struct scatterlist sg[],
+		       unsigned int out_num,
+		       unsigned int in_num,
+		       void *data);
+
+	void (*kick)(struct virtqueue *vq);
+
+	void *(*get_buf)(struct virtqueue *vq, unsigned int *len);
+
+	bool (*restart)(struct virtqueue *vq);
+
+	void (*shutdown)(struct virtqueue *vq);
+};
+
+/* FIXME: Get a real PCI vendor ID. */
+#define PCI_VENDOR_ID_VIRTIO 0x5000
+
+#define VIRTIO_DEV_ID(_devid, _class) {		\
+	.vendor = PCI_VENDOR_ID_VIRTIO,		\
+	.device = (_devid),			\
+	.subvendor = PCI_ANY_ID,		\
+	.subdevice = PCI_ANY_ID,		\
+	.class = (_class)<<8,			\
+	.class_mask = 0xFFFF00 }
+
+/**
+ * virtio_driver - operations for a virtio I/O driver
+ * @name: the name of the driver (KBUILD_MODNAME).
+ * @owner: the module which contains these routines (ie. THIS_MODULE).
+ * @id_table: the ids (we re-use PCI ids) serviced by this driver.
+ * @probe: the function to call when a device is found.  Returns a token for
+ *    remove, or PTR_ERR().
+ * @remove: the function when a device is removed.
+ */
+struct virtio_driver {
+	const char *name;
+	struct module *owner;
+	struct pci_device_id *id_table;
+	void *(*probe)(struct device *device,
+		       struct virtio_config_space *config,
+		       struct virtqueue_ops *vqops);
+	void (*remove)(void *dev);
+};
+
+int register_virtio_driver(struct virtio_driver *drv);
+void unregister_virtio_driver(struct virtio_driver *drv);
+
+/* The particular virtio backend supplies these. */
+struct virtio_backend_ops {
+	int (*register_driver)(struct virtio_driver *drv);
+	void (*unregister_driver)(struct virtio_driver *drv);
+};
+extern struct virtio_backend_ops virtio_backend_ops;
+#endif /* _LINUX_VIRTIO_H */

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 2/6] virtio_config
       [not found] ` <1190290140.7262.228.camel@localhost.localdomain>
@ 2007-09-20 12:12   ` Rusty Russell
  2007-09-20 12:27   ` [kvm-devel] [PATCH 1/6] virtio interace Avi Kivity
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-20 12:12 UTC (permalink / raw)
  To: kvm-devel; +Cc: lguest, Dor Laor, virtualization

Previous versions of virtio didn't commonalize probing.  For every
driver, every virtio implementation (KVM, lguest, etc) needed an
in-kernel stub to join their bus to the probe code.

To solve this, we introduce a "virtio_config" mechanism, which is
simply a set of [u8 type][u8 len][...data...] fields for each device.
Some convenient wrapper functions make this fairly easy for drivers to
unpack their configuration data themselves.

This configuration data could be PCI config space, or an unrelated bus
(eg. lguest) or manufactured by the kernel itself.  It's designed to
be extensible: fields get marked as the device reads them so a host
supporting some future extension can tell if the guest driver
understood it.  This also applies to bitfields: the guest explicitly
acks the bits it understands.

There's also a simple status bitmask useful for low-level host
analysis of what the guest is doing with the device.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/Kconfig               |    2 
 drivers/Makefile              |    1 
 drivers/virtio/Makefile       |    2 
 drivers/virtio/config.c       |  109 +++++++++++++++++++++++++++++++
 include/linux/virtio_config.h |  141 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 254 insertions(+), 1 deletion(-)

===================================================================
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -87,4 +87,6 @@ source "drivers/kvm/Kconfig"
 source "drivers/kvm/Kconfig"
 
 source "drivers/uio/Kconfig"
+
+source "drivers/virtio/Kconfig"
 endmenu
===================================================================
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -88,3 +88,4 @@ obj-$(CONFIG_HID)		+= hid/
 obj-$(CONFIG_HID)		+= hid/
 obj-$(CONFIG_PPC_PS3)		+= ps3/
 obj-$(CONFIG_OF)		+= of/
+obj-$(CONFIG_VIRTIO)		+= virtio/
===================================================================
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -1,1 +1,1 @@ obj-$(CONFIG_VIRTIO) += virtio.o
-obj-$(CONFIG_VIRTIO) += virtio.o
+obj-$(CONFIG_VIRTIO) += virtio.o config.o
===================================================================
--- /dev/null
+++ b/drivers/virtio/config.c
@@ -0,0 +1,109 @@
+/* Configuration space parsing helpers for virtio.
+ *
+ * The configuration is [type][len][... len bytes ...] fields.
+ *
+ * Copyright 2007 Rusty Russell, IBM Corporation.
+ * GPL v2 or later.
+ */
+#include <linux/err.h>
+#include <linux/virtio.h>
+#include <linux/virtio_config.h>
+#include <linux/bug.h>
+#include <asm/system.h>
+
+#define HDR_LEN 2
+int virtio_config_find(struct virtio_config_space *config, u8 type,
+		       unsigned int *len)
+{
+	int i;
+
+	for (i = 0; i < config->len; i += HDR_LEN+config->readb(config,i+1)) {
+		if (config->readb(config, i) == type) {
+			/* Mark it used, so host can know we looked at it. */
+			config->writeb(config, i, type | VIRTIO_CONFIG_F_USED);
+			*len = config->readb(config, i+1);
+			return i;
+		}
+	}
+	return -ENOENT;
+}
+
+void virtio_config_get(struct virtio_config_space *config, int off,
+		       unsigned int len, void *buf)
+{
+	unsigned int i;
+
+	BUG_ON(off < 0 || off + HDR_LEN + len > config->len);
+	for (i = 0; i < len; i++)
+		((u8 *)buf)[i] = config->readb(config, off + HDR_LEN + i);
+
+}
+
+void virtio_config_set(struct virtio_config_space *config, int off,
+		       unsigned int len, const void *buf)
+{
+	unsigned int i;
+
+	BUG_ON(off < 0 || off + HDR_LEN + len > config->len);
+	for (i = 0; i < len; i++)
+		config->writeb(config, off + HDR_LEN + i, ((u8 *)buf)[i]);
+}
+
+int __virtio_config_val(struct virtio_config_space *config,
+			u8 type, void *val, size_t size)
+{
+	int off;
+	unsigned int len;
+
+	off = virtio_config_find(config, type, &len);
+	if (off < 0)
+		return off;
+
+	if (len != size)
+		return -EIO;
+
+	virtio_config_get(config, off, size, val);
+	return 0;
+}
+
+int virtio_use_bit(struct virtio_config_space *config,
+		   int off, unsigned int len, unsigned int bitnum)
+{
+	u8 ack, mask = 1 << (bitnum % 8);
+
+	/* This makes it convenient to pass-through virtio_config_find. */
+	if (off < 0)
+		return 0;
+
+	/* bit not in range of this bitfield? */
+	if (bitnum * 8 >= len / 2)
+		return 0;
+
+	/* bit not set? */
+	if (!(config->readb(config, bitnum / 8) & mask))
+		return 0;
+
+	/* Set acknowledge bit. */
+	ack = config->readb(config, bitnum / 8 + len / 2) | mask;
+	config->writeb(config, bitnum / 8 + len / 2, ack);
+	return 1;
+}
+
+struct virtqueue *virtio_config_vq(struct virtio_config_space *config,
+				   struct virtqueue_ops *ops,
+				   struct device *dev,
+				   bool (*callback)(void *), void *cb_data)
+{
+	unsigned int len;
+	int off = virtio_config_find(config, VIRTIO_CONFIG_F_VIRTQUEUE, &len);
+
+	if (off < 0)
+		return ERR_PTR(off);
+
+	return ops->new_vq(config, off, len, dev, callback, cb_data);
+}
+
+void virtio_add_status(struct virtio_config_space *config, u32 mask)
+{
+	config->set_status(config, config->get_status(config) | mask);
+}
===================================================================
--- /dev/null
+++ b/include/linux/virtio_config.h
@@ -0,0 +1,141 @@
+#ifndef _LINUX_VIRTIO_CONFIG_H
+#define _LINUX_VIRTIO_CONFIG_H
+/* Virtio devices use a standardized array of bytes to define their
+ * features and pass configuration information.
+ *
+ * The first byte is the status byte, after that come type/len/data fields.
+ * Each field understood by the guest is acknowledged by setting the USED bit.
+ * Some fields are bitfields: the first half is uses to advertise features to
+ * the guest, and the second half is used by the guest to acknowledge. */
+#include <linux/types.h>
+
+/* Status byte for guest to report progress, and synchronize config. */
+/* We have seen device and processed generic fields (VIRTIO_CONFIG_F_VIRTIO) */
+#define VIRTIO_CONFIG_S_ACKNOWLEDGE	1
+/* We have found a driver for the device. */
+#define VIRTIO_CONFIG_S_DRIVER		2
+/* Driver has used its parts of the config, and is happy */
+#define VIRTIO_CONFIG_S_DRIVER_OK	4
+/* We've given up on this device. */
+#define VIRTIO_CONFIG_S_FAILED		128
+
+/* Requirements/features of the virtio implementation. */
+#define VIRTIO_CONFIG_F_VIRTIO 1
+/* Requirements/features of the virtqueue (may have more than one). */
+#define VIRTIO_CONFIG_F_VIRTQUEUE 2
+
+/* Used config: guest sets this bit to say it understood the field. */
+#define VIRTIO_CONFIG_F_USED 0x80
+
+#ifdef __KERNEL__
+/* Low level opertions on this virtio device's configuration space. */
+struct virtio_config_space
+{
+	unsigned int len;
+	void (*writeb)(struct virtio_config_space *, unsigned off, u8 b);
+	u8 (*readb)(struct virtio_config_space *, unsigned off);
+	void (*set_status)(struct virtio_config_space *, u32 status);
+	u32 (*get_status)(struct virtio_config_space *);
+};
+
+/**
+ * virtio_config_find - look for a virtio config.
+ * @config: the virtio config space
+ * @type: the type to search for.
+ * @len: the length of the field.
+ *
+ * This returns the offset of the first field of the given type (or -ENOENT).
+ * This offset and length can then be handed to virtio_config_get().
+ *
+ * This function marks the field type with VIRTIO_CONFIG_F_USED, so it can't be
+ * found again, and the host knows we've seen the field. */
+int virtio_config_find(struct virtio_config_space *config, u8 type,
+		       unsigned int *len);
+
+/**
+ * virtio_config_get - get a virtio config field, and mark it used.
+ * @config: the virtio config space
+ * @off: the offset as returned from virtio_config_find.
+ * @len: the length of the field.
+ * @buf: the buffer to copy the field into.
+ *
+ * Note that virtio config entries are conventionally little-endian. */
+void virtio_config_get(struct virtio_config_space *config, int off,
+		       unsigned int len, void *buf);
+
+/**
+ * virtio_config_set - set a virtio config field, and mark it used.
+ * @config: the virtio config space
+ * @off: the offset as returned from virtio_config_find.
+ * @len: the length of the field.
+ * @buf: the buffer to copy into the field.
+ *
+ * Note that virtio config entries are conventionally little-endian. */
+void virtio_config_set(struct virtio_config_space *config, int off,
+		       unsigned int len, const void *buf);
+
+/**
+ * virtio_config_val - get a single virtio config and mark it used.
+ * @config: the virtio config space
+ * @type: the type to search for.
+ * @val: a pointer to the value to fill in.
+ *
+ * Once used, the config type is marked with VIRTIO_CONFIG_F_USED so it can't
+ * be found again.  This version does endian conversion. */
+#define virtio_config_val(config, type, v) ({				\
+	int _err = __virtio_config_val((config),(type),(v),sizeof(*(v))); \
+									\
+	BUILD_BUG_ON(sizeof(*(v)) != 1 && sizeof(*(v)) != 2		\
+		     && sizeof(*(v)) != 4 && sizeof(*(v)) != 8);	\
+	if (!_err) {							\
+		switch (sizeof(*(v))) {					\
+		case 2: le16_to_cpus(v); break;				\
+		case 4: le32_to_cpus(v); break;				\
+		case 8: le64_to_cpus(v); break;				\
+		}							\
+	}								\
+	_err;								\
+})
+
+int __virtio_config_val(struct virtio_config_space *config,
+			u8 type, void *val, size_t size);
+
+/**
+ * virtio_use_bit - helper to use a feature bit in a bitfield value.
+ * @config: the virtio config space
+ * @off: the offset as returned from virtio_config_find.
+ * @len: the length of the field.
+ * @bitnum: the bit to test.
+ *
+ * If handed a negative offset, it returns false, otherwise returns bit status.
+ * If it's one, it sets the mirroring acknowledgement bit. */
+int virtio_use_bit(struct virtio_config_space *config,
+		   int off, unsigned int len, unsigned int bitnum);
+
+struct device;
+struct virtqueue_ops;
+
+/**
+ * virtio_config_vq - find a virtqueue in the config and call ops->new_vq.
+ * @config: the virtio config space
+ * @ops: the virtqueue_ops for the queue.
+ * @dev: the struct device this virtqueue is attached to.
+ * @callback: the driver callback when the queue is used.
+ * @cb_data: the pointer for the callback.
+ *
+ * This returns a new virtqueue or ERR_PTR(). */
+struct virtqueue *virtio_config_vq(struct virtio_config_space *config,
+				   struct virtqueue_ops *ops,
+				   struct device *dev,
+				   bool (*callback)(void *), void *cb_data);
+
+/**
+ * virtio_add_status - set a bit in the status field.
+ * @config: the virtio config space
+ * @mask: the mask to set (one bit).
+ *
+ * This is not usually set by the driver itself. */
+void virtio_add_status(struct virtio_config_space *config, u32 mask);
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_VIRTIO_CONFIG_H */

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 3/6] virtio net driver
       [not found]   ` <1190290369.7262.231.camel@localhost.localdomain>
@ 2007-09-20 12:14     ` Rusty Russell
  2007-09-20 12:36     ` [kvm-devel] [PATCH 2/6] virtio_config Avi Kivity
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-20 12:14 UTC (permalink / raw)
  To: kvm-devel; +Cc: Herbert Xu, lguest, Jeff Garzik, Dor Laor, virtualization

The network driver uses *two* virtqueues: one for input packets and
one for output packets.  This has nice locking properties (ie. we
don't do any for recv vs send).

TODO:
	1) Big packets.
	2) Multi-client devices (maybe separate driver?).
	3) Resolve freeing of old xmit skbs (someone sent patch IIRC?)

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/net/Kconfig        |    6 
 drivers/net/Makefile       |    2 
 drivers/net/virtio_net.c   |  438 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/virtio_net.h |   36 +++
 4 files changed, 481 insertions(+), 1 deletion(-)

===================================================================
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -2998,4 +2998,10 @@ config NET_POLL_CONTROLLER
 config NET_POLL_CONTROLLER
 	def_bool NETPOLL
 
+config VIRTIO_NET
+	tristate "Virtio network driver (EXPERIMENTAL)"
+	depends on EXPERIMENTAL && VIRTIO
+	---help---
+	  This is the virtual network driver for lguest.  Say Y or M.
+
 endif # NETDEVICES
===================================================================
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -38,7 +38,7 @@ obj-$(CONFIG_SUNVNET) += sunvnet.o
 
 obj-$(CONFIG_MACE) += mace.o
 obj-$(CONFIG_BMAC) += bmac.o
-
+obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
 obj-$(CONFIG_DGRS) += dgrs.o
 obj-$(CONFIG_VORTEX) += 3c59x.o
 obj-$(CONFIG_TYPHOON) += typhoon.o
===================================================================
--- /dev/null
+++ b/drivers/net/virtio_net.c
@@ -0,0 +1,438 @@
+/* A simple network driver using virtio.
+ *
+ * Copyright 2007 Rusty Russell <rusty@rustcorp.com.au> IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+//#define DEBUG
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/module.h>
+#include <linux/virtio.h>
+#include <linux/virtio_net.h>
+#include <linux/scatterlist.h>
+
+/* FIXME: MTU in config. */
+#define MAX_PACKET_LEN (ETH_HLEN+ETH_DATA_LEN)
+
+struct virtnet_info
+{
+	struct virtqueue_ops *vq_ops;
+	struct virtqueue *vq_recv;
+	struct virtqueue *vq_send;
+	struct net_device *ndev;
+
+	/* Number of input buffers, and max we've ever had. */
+	unsigned int num, max;
+
+	/* Receive & send queues. */
+	struct sk_buff_head recv;
+	struct sk_buff_head send;
+};
+
+static inline struct virtio_net_hdr *skb_vnet_hdr(struct sk_buff *skb)
+{
+	return (struct virtio_net_hdr *)skb->cb;
+}
+
+static inline void vnet_hdr_to_sg(struct scatterlist *sg, struct sk_buff *skb)
+{
+	sg_init_one(sg, skb_vnet_hdr(skb), sizeof(struct virtio_net_hdr));
+}
+
+static bool skb_xmit_done(void *ndev)
+{
+	/* In case we were waiting for output buffers. */
+	netif_wake_queue(ndev);
+	return true;
+}
+
+static void receive_skb(struct net_device *dev, struct sk_buff *skb,
+			unsigned len)
+{
+	struct virtio_net_hdr *hdr = skb_vnet_hdr(skb);
+
+	if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
+		pr_debug("%s: short packet %i\n", dev->name, len);
+		dev->stats.rx_length_errors++;
+		goto drop;
+	}
+	len -= sizeof(struct virtio_net_hdr);
+	BUG_ON(len > MAX_PACKET_LEN);
+
+	skb_trim(skb, len);
+	skb->protocol = eth_type_trans(skb, dev);
+	pr_debug("Receiving skb proto 0x%04x len %i type %i\n",
+		 ntohs(skb->protocol), skb->len, skb->pkt_type);
+	dev->stats.rx_bytes += skb->len;
+	dev->stats.rx_packets++;
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		pr_debug("Needs csum!\n");
+		skb->ip_summed = CHECKSUM_PARTIAL;
+		skb->csum_start = hdr->csum_start;
+		skb->csum_offset = hdr->csum_offset;
+		if (skb->csum_start > skb->len - 2
+		    || skb->csum_offset > skb->len - 2) {
+			if (net_ratelimit())
+				printk(KERN_WARNING "%s: csum=%u/%u len=%u\n",
+				       dev->name, skb->csum_start,
+				       skb->csum_offset, skb->len);
+			goto frame_err;
+		}
+	}
+
+	if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		pr_debug("GSO!\n");
+		switch (hdr->gso_type) {
+		case VIRTIO_NET_HDR_GSO_TCPV4:
+			skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
+			break;
+		case VIRTIO_NET_HDR_GSO_TCPV4_ECN:
+			skb_shinfo(skb)->gso_type = SKB_GSO_TCP_ECN;
+			break;
+		case VIRTIO_NET_HDR_GSO_UDP:
+			skb_shinfo(skb)->gso_type = SKB_GSO_UDP;
+			break;
+		case VIRTIO_NET_HDR_GSO_TCPV6:
+			skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
+			break;
+		default:
+			if (net_ratelimit())
+				printk(KERN_WARNING "%s: bad gso type %u.\n",
+				       dev->name, hdr->gso_type);
+			goto frame_err;
+		}
+
+		skb_shinfo(skb)->gso_size = hdr->gso_size;
+		if (skb_shinfo(skb)->gso_size == 0) {
+			if (net_ratelimit())
+				printk(KERN_WARNING "%s: zero gso size.\n",
+				       dev->name);
+			goto frame_err;
+		}
+
+		/* Header must be checked, and gso_segs computed. */
+		skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
+		skb_shinfo(skb)->gso_segs = 0;
+	}
+
+	netif_receive_skb(skb);
+	return;
+
+frame_err:
+	dev->stats.rx_frame_errors++;
+drop:
+	dev_kfree_skb(skb);
+}
+
+static void try_fill_recv(struct virtnet_info *vi)
+{
+	struct sk_buff *skb;
+	struct scatterlist sg[1+MAX_SKB_FRAGS];
+	int num, err;
+
+	for (;;) {
+		skb = netdev_alloc_skb(vi->ndev, MAX_PACKET_LEN);
+		if (unlikely(!skb))
+			break;
+
+		skb_put(skb, MAX_PACKET_LEN);
+		vnet_hdr_to_sg(sg, skb);
+		num = skb_to_sgvec(skb, sg+1, 0, skb->len) + 1;
+		skb_queue_head(&vi->recv, skb);
+
+		err = vi->vq_ops->add_buf(vi->vq_recv, sg, 0, num, skb);
+		if (err) {
+			skb_unlink(skb, &vi->recv);
+			kfree_skb(skb);
+			break;
+		}
+		vi->num++;
+	}
+	if (unlikely(vi->num > vi->max))
+		vi->max = vi->num;
+	vi->vq_ops->kick(vi->vq_recv);
+}
+
+static bool skb_recv_done(void *ndev)
+{
+	netif_rx_schedule(ndev);
+	/* Suppress further interrupts. */
+	return false;
+}
+
+static int virtnet_poll(struct net_device *dev, int *budget)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct sk_buff *skb = NULL;
+	unsigned int len, received = 0;
+
+again:
+	while (received < dev->quota &&
+	       (skb = vi->vq_ops->get_buf(vi->vq_recv, &len)) != NULL) {
+		__skb_unlink(skb, &vi->recv);
+		receive_skb(vi->ndev, skb, len);
+		vi->num--;
+		received++;
+	}
+
+        dev->quota -= received;
+        *budget -= received;
+
+	/* FIXME: If we oom and completely run out of inbufs, we need
+	 * to start a timer trying to fill more. */
+	if (vi->num < vi->max / 2)
+		try_fill_recv(vi);
+
+	/* Still more work to do? */
+	if (skb)
+		return 1; /* not done */
+
+	netif_rx_complete(dev);
+	if (unlikely(!vi->vq_ops->restart(vi->vq_recv))
+	    && netif_rx_reschedule(dev, received))
+		goto again;
+
+	return 0;
+}
+
+static void free_old_xmit_skbs(struct virtnet_info *vi)
+{
+	struct sk_buff *skb;
+	unsigned int len;
+
+	while ((skb = vi->vq_ops->get_buf(vi->vq_send, &len)) != NULL) {
+		pr_debug("Sent skb %p\n", skb);
+		__skb_unlink(skb, &vi->send);
+		vi->ndev->stats.tx_bytes += len;
+		vi->ndev->stats.tx_packets++;
+		kfree_skb(skb);
+	}
+}
+
+static int start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	int num, err;
+	struct scatterlist sg[1+MAX_SKB_FRAGS];
+	struct virtio_net_hdr *hdr;
+	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
+
+	pr_debug("%s: xmit %p %02x:%02x:%02x:%02x:%02x:%02x\n",
+		 dev->name, skb,
+		 dest[0], dest[1], dest[2], dest[3], dest[4], dest[5]);
+
+	free_old_xmit_skbs(vi);
+
+	/* Encode metadata header at front. */
+	hdr = skb_vnet_hdr(skb);
+	if (skb->ip_summed == CHECKSUM_PARTIAL) {
+		hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+		hdr->csum_start = skb->csum_start - skb_headroom(skb);
+		hdr->csum_offset = skb->csum_offset;
+	} else {
+		hdr->flags = 0;
+		hdr->csum_offset = hdr->csum_start = 0;
+	}
+
+	if (skb_is_gso(skb)) {
+		hdr->gso_size = skb_shinfo(skb)->gso_size;
+		if (skb_shinfo(skb)->gso_type & SKB_GSO_TCP_ECN)
+			hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV4_ECN;
+		else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4)
+			hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+		else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6)
+			hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+		else if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP)
+			hdr->gso_type = VIRTIO_NET_HDR_GSO_UDP;
+		else
+			BUG();
+	} else {
+		hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE;
+		hdr->gso_size = 0;
+	}
+
+	vnet_hdr_to_sg(sg, skb);
+	num = skb_to_sgvec(skb, sg+1, 0, skb->len) + 1;
+	__skb_queue_head(&vi->send, skb);
+	err = vi->vq_ops->add_buf(vi->vq_send, sg, num, 0, skb);
+	if (err) {
+		pr_debug("%s: virtio not prepared to send\n", dev->name);
+		skb_unlink(skb, &vi->send);
+		netif_stop_queue(dev);
+		return NETDEV_TX_BUSY;
+	}
+	vi->vq_ops->kick(vi->vq_send);
+
+	return 0;
+}
+
+static int virtnet_open(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+
+	try_fill_recv(vi);
+
+	/* If we didn't even get one input buffer, we're useless. */
+	if (vi->num == 0)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int virtnet_close(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct sk_buff *skb;
+
+	/* networking core has neutered skb_xmit_done/skb_recv_done, so don't
+	 * worry about races vs. get(). */
+	vi->vq_ops->shutdown(vi->vq_recv);
+	while ((skb = __skb_dequeue(&vi->recv)) != NULL) {
+		kfree_skb(skb);
+		vi->num--;
+	}
+	vi->vq_ops->shutdown(vi->vq_send);
+	while ((skb = __skb_dequeue(&vi->send)) != NULL)
+		kfree_skb(skb);
+
+	BUG_ON(vi->num != 0);
+	return 0;
+}
+
+static void *virtnet_probe(struct device *device,
+			   struct virtio_config_space *config,
+			   struct virtqueue_ops *vq_ops)
+{
+	int err, coff;
+	unsigned int len;
+	struct net_device *dev;
+	struct virtnet_info *vi;
+
+	/* Allocate ourselves a network device with room for our info */
+	dev = alloc_etherdev(sizeof(struct virtnet_info));
+	if (!dev)
+		return ERR_PTR(-ENOMEM);
+
+	/* Set up network device as normal. */
+	SET_MODULE_OWNER(dev);
+	ether_setup(dev);
+	dev->open = virtnet_open;
+	dev->stop = virtnet_close;
+	dev->poll = virtnet_poll;
+	dev->hard_start_xmit = start_xmit;
+	dev->weight = 16;
+	dev->features = NETIF_F_HIGHDMA;
+	SET_NETDEV_DEV(dev, device);
+
+	/* Do we support "hardware" checksums? */
+	coff = virtio_config_find(config, VIRTIO_CONFIG_NET_F, &len);
+	if (virtio_use_bit(config, coff, len, VIRTIO_NET_F_NO_CSUM)) {
+		/* This opens up the world of extra features. */
+		dev->features |= NETIF_F_HW_CSUM|NETIF_F_SG|NETIF_F_FRAGLIST;
+		if (virtio_use_bit(config, coff, len, VIRTIO_NET_F_TSO4))
+			dev->features |= NETIF_F_TSO;
+		if (virtio_use_bit(config, coff, len, VIRTIO_NET_F_UFO))
+			dev->features |= NETIF_F_UFO;
+		if (virtio_use_bit(config, coff, len, VIRTIO_NET_F_TSO4_ECN))
+			dev->features |= NETIF_F_TSO_ECN;
+		if (virtio_use_bit(config, coff, len, VIRTIO_NET_F_TSO6))
+			dev->features |= NETIF_F_TSO6;
+	}
+
+	/* Configuration may specify what MAC to use.  Otherwise random. */
+	coff = virtio_config_find(config, VIRTIO_CONFIG_NET_MAC_F, &len);
+	if (coff >= 0) {
+		dev->addr_len = len;
+		virtio_config_get(config, coff, len, dev->dev_addr);
+	} else
+		random_ether_addr(dev->dev_addr);
+
+	/* Set up our device-specific information */
+	vi = netdev_priv(dev);
+	vi->ndev = dev;
+	vi->vq_ops = vq_ops;
+
+	/* We expect two virtqueues, receive then send. */
+	vi->vq_recv = virtio_config_vq(config, vq_ops, device,
+				       skb_recv_done, dev);
+	if (IS_ERR(vi->vq_recv)) {
+		err = PTR_ERR(vi->vq_recv);
+		goto free;
+	}
+
+	vi->vq_send = virtio_config_vq(config, vq_ops, device,
+				       skb_xmit_done, dev);
+	if (IS_ERR(vi->vq_send)) {
+		err = PTR_ERR(vi->vq_send);
+		goto free_recv;
+	}
+
+	/* Initialize our empty receive and send queues. */
+	skb_queue_head_init(&vi->recv);
+	skb_queue_head_init(&vi->send);
+
+	err = register_netdev(dev);
+	if (err) {
+		pr_debug("virtio_net: registering device failed\n");
+		goto free_send;
+	}
+	pr_debug("virtnet: registered device %s\n", dev->name);
+	return dev;
+
+free_send:
+	vq_ops->free_vq(vi->vq_send);
+free_recv:
+	vq_ops->free_vq(vi->vq_recv);
+free:
+	free_netdev(dev);
+	return ERR_PTR(err);
+}
+
+static void virtnet_remove(void *dev)
+{
+	unregister_netdev(dev);
+	free_netdev(dev);
+}
+
+static struct pci_device_id id_table[] = {
+	VIRTIO_DEV_ID(VIRTIO_ID_NET, PCI_CLASS_NETWORK_OTHER),
+	{ 0 },
+};
+
+static struct virtio_driver virtio_net = {
+	.name =         KBUILD_MODNAME,
+	.owner =        THIS_MODULE,
+	.id_table =     id_table,
+	.probe =        virtnet_probe,
+	.remove =       __devexit_p(virtnet_remove),
+};
+
+static int __init init(void)
+{
+	return register_virtio_driver(&virtio_net);
+}
+
+static void __exit fini(void)
+{
+	unregister_virtio_driver(&virtio_net);
+}
+module_init(init);
+module_exit(fini);
+
+MODULE_DEVICE_TABLE(pci, id_table);
+MODULE_DESCRIPTION("Virtio network driver");
+MODULE_LICENSE("GPL");
===================================================================
--- /dev/null
+++ b/include/linux/virtio_net.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_VIRTIO_NET_H
+#define _LINUX_VIRTIO_NET_H
+#include <linux/virtio_config.h>
+
+/* The ID for virtio_net */
+#define VIRTIO_ID_NET	1
+
+/* The bitmap of config for virtio net */
+#define VIRTIO_CONFIG_NET_F	0x40
+#define VIRTIO_NET_F_NO_CSUM	1
+#define VIRTIO_NET_F_TSO4	2
+#define VIRTIO_NET_F_UFO	4
+#define VIRTIO_NET_F_TSO4_ECN	8
+#define VIRTIO_NET_F_TSO6	16
+
+/* The config defining mac address. */
+#define VIRTIO_CONFIG_NET_MAC_F	0x41
+
+/* This is the first element of the scatter-gather list.  If you don't
+ * specify GSO or CSUM features, you can simply ignore the header. */
+struct virtio_net_hdr
+{
+#define VIRTIO_NET_HDR_F_NEEDS_CSUM	1	// Use csum_start, csum_offset
+      __u8 flags;
+#define VIRTIO_NET_HDR_GSO_NONE		0	// Not a GSO frame
+#define VIRTIO_NET_HDR_GSO_TCPV4	1	// GSO frame, IPv4 TCP (TSO)
+/* FIXME: Do we need this?  If they said they can handle ECN, do they care? */
+#define VIRTIO_NET_HDR_GSO_TCPV4_ECN	2	// GSO frame, IPv4 TCP w/ ECN
+#define VIRTIO_NET_HDR_GSO_UDP		3	// GSO frame, IPv4 UDP (UFO)
+#define VIRTIO_NET_HDR_GSO_TCPV6	4	// GSO frame, IPv6 TCP
+      __u8 gso_type;
+      __u16 gso_size;
+      __u16 csum_start;
+      __u16 csum_offset;
+};
+#endif /* _LINUX_VIRTIO_NET_H */

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 4/6] virtio block driver
       [not found]     ` <1190290495.7262.235.camel@localhost.localdomain>
@ 2007-09-20 12:16       ` Rusty Russell
       [not found]       ` <1190290606.7262.239.camel@localhost.localdomain>
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-20 12:16 UTC (permalink / raw)
  To: kvm-devel; +Cc: lguest, Jens Axboe, Dor Laor, virtualization

The block driver uses scatter-gather lists with sg[0] being the
request information (struct virtio_blk_outhdr) with the type, sector
and inbuf id.  The next N sg entries are the bio itself, then the last
sg is the status byte.  Whether the N entries are in or out depends on
whether it's a read or a write.

We accept the normal (SCSI) ioctls: they get handed through to the other
side which can then handle it or reply that it's unsupported.  It's
not clear that this actually works in general, since I don't know
if blk_pc_request() requests have an accurate rq_data_dir().

Although we try to reply -ENOTTY on unsupported commands, the block
layer in its infinite wisdom suppressed the error so ioctl(fd,
CDROMEJECT) returns success to userspace.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/block/Kconfig      |    6 
 drivers/block/Makefile     |    1 
 drivers/block/virtio_blk.c |  327 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/Kbuild       |    1 
 include/linux/virtio_blk.h |   51 ++++++
 5 files changed, 386 insertions(+)

===================================================================
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -443,4 +443,10 @@ config XEN_BLKDEV_FRONTEND
 	  block device driver.  It communicates with a back-end driver
 	  in another domain which drives the actual block device.
 
+config VIRTIO_BLK
+	tristate "Virtio block driver (EXPERIMENTAL)"
+	depends on EXPERIMENTAL && VIRTIO
+	---help---
+	  This is the virtual block driver for lguest.  Say Y or M.
+
 endif # BLK_DEV
===================================================================
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_BLK_DEV_UMEM)	+= umem.o
 obj-$(CONFIG_BLK_DEV_UMEM)	+= umem.o
 obj-$(CONFIG_BLK_DEV_NBD)	+= nbd.o
 obj-$(CONFIG_BLK_DEV_CRYPTOLOOP) += cryptoloop.o
+obj-$(CONFIG_VIRTIO_BLK)	+= virtio_blk.o
 
 obj-$(CONFIG_VIODASD)		+= viodasd.o
 obj-$(CONFIG_BLK_DEV_SX8)	+= sx8.o
===================================================================
--- /dev/null
+++ b/drivers/block/virtio_blk.c
@@ -0,0 +1,327 @@
+#define DEBUG
+#include <linux/spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/hdreg.h>
+#include <linux/virtio.h>
+#include <linux/virtio_blk.h>
+#include <linux/virtio_blk.h>
+
+static unsigned char virtblk_index = 'a';
+struct virtio_blk
+{
+	spinlock_t lock;
+
+	struct virtqueue *vq;
+	struct virtqueue_ops *vq_ops;
+
+	/* The disk structure for the kernel. */
+	struct gendisk *disk;
+
+	/* Request tracking. */
+	struct list_head reqs;
+
+	mempool_t *pool;
+
+	/* Scatterlist: can be too big for stack. */
+	struct scatterlist sg[3+MAX_PHYS_SEGMENTS];
+};
+
+struct virtblk_req
+{
+	struct list_head list;
+	struct request *req;
+	struct virtio_blk_outhdr out_hdr;
+	struct virtio_blk_inhdr in_hdr;
+};
+
+static void end_dequeued_request(struct request *req,
+				 struct request_queue *q, int uptodate)
+{
+	/* And so the insanity of the block layer infects us here. */
+	int nsectors = req->hard_nr_sectors;
+
+	if (blk_pc_request(req)) {
+		nsectors = (req->data_len + 511) >> 9;
+		if (!nsectors)
+			nsectors = 1;
+	}
+	if (end_that_request_first(req, uptodate, nsectors))
+		BUG();
+	add_disk_randomness(req->rq_disk);
+	end_that_request_last(req, uptodate);
+}
+
+static bool blk_done(void *_vblk)
+{
+	struct virtio_blk *vblk = _vblk;
+	struct virtblk_req *vbr;
+	unsigned int len;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vblk->lock, flags);
+	while ((vbr = vblk->vq_ops->get_buf(vblk->vq, &len)) != NULL) {
+		int uptodate;
+		switch (vbr->in_hdr.status) {
+		case VIRTIO_BLK_S_OK:
+			uptodate = 1;
+			break;
+		case VIRTIO_BLK_S_UNSUPP:
+			uptodate = -ENOTTY;
+			break;
+		default:
+			uptodate = 0;
+			break;
+		}
+
+		end_dequeued_request(vbr->req, vblk->disk->queue, uptodate);
+		list_del(&vbr->list);
+		mempool_free(vbr, vblk->pool);
+	}
+	/* In case queue is stopped waiting for more buffers. */
+	blk_start_queue(vblk->disk->queue);
+	spin_unlock_irqrestore(&vblk->lock, flags);
+	return true;
+}
+
+static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
+		   struct request *req)
+{
+	unsigned long num, out_num, in_num;
+	struct virtblk_req *vbr;
+
+	vbr = mempool_alloc(vblk->pool, GFP_ATOMIC);
+	if (!vbr)
+		/* When another request finishes we'll try again. */
+		return false;
+
+	vbr->req = req;
+	if (blk_fs_request(vbr->req)) {
+		vbr->out_hdr.type = 0;
+		vbr->out_hdr.sector = vbr->req->sector;
+		vbr->out_hdr.ioprio = vbr->req->ioprio;
+	} else if (blk_pc_request(vbr->req)) {
+		vbr->out_hdr.type = VIRTIO_BLK_T_SCSI_CMD;
+		vbr->out_hdr.sector = 0;
+		vbr->out_hdr.ioprio = vbr->req->ioprio;
+	} else {
+		/* We don't put anything else in the queue. */
+		BUG();
+	}
+
+	if (blk_barrier_rq(vbr->req))
+		vbr->out_hdr.type |= VIRTIO_BLK_T_BARRIER;
+
+	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
+	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
+	vblk->sg[0].length = sizeof(vbr->out_hdr);
+	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
+	vblk->sg[num+1].page = virt_to_page(&vbr->in_hdr);
+	vblk->sg[num+1].offset = offset_in_page(&vbr->in_hdr);
+	vblk->sg[num+1].length = sizeof(vbr->in_hdr);
+
+	if (rq_data_dir(vbr->req) == WRITE) {
+		vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
+		out_num = 1 + num;
+		in_num = 1;
+	} else {
+		vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
+		out_num = 1;
+		in_num = 1 + num;
+	}
+
+	if (vblk->vq_ops->add_buf(vblk->vq, vblk->sg, out_num, in_num, vbr)) {
+		mempool_free(vbr, vblk->pool);
+		return false;
+	}
+
+	list_add_tail(&vbr->list, &vblk->reqs);
+	return true;
+}
+
+static void do_virtblk_request(struct request_queue *q)
+{
+	struct virtio_blk *vblk = NULL;
+	struct request *req;
+	unsigned int issued = 0;
+
+	while ((req = elv_next_request(q)) != NULL) {
+		vblk = req->rq_disk->private_data;
+		BUG_ON(req->nr_phys_segments > ARRAY_SIZE(vblk->sg));
+
+		if (!do_req(q, vblk, req)) {
+			/* Queue full?  Wait. */
+			blk_stop_queue(q);
+			break;
+		}
+		blkdev_dequeue_request(req);
+		issued++;
+	}
+
+	if (issued)
+		vblk->vq_ops->kick(vblk->vq);
+}
+
+static int virtblk_ioctl(struct inode *inode, struct file *filp,
+			 unsigned cmd, unsigned long data)
+{
+	return scsi_cmd_ioctl(filp, inode->i_bdev->bd_disk->queue,
+			      inode->i_bdev->bd_disk, cmd,
+			      (void __user *)data);
+}
+
+static struct block_device_operations virtblk_fops = {
+	.ioctl = virtblk_ioctl,
+	.owner = THIS_MODULE,
+};
+
+static void *virtblk_probe(struct device *device,
+			   struct virtio_config_space *config,
+			   struct virtqueue_ops *vq_ops)
+{
+	struct virtio_blk *vblk;
+	int err, major, coff;
+	unsigned int len;
+	u64 cap;
+	u32 v;
+
+	vblk = kmalloc(sizeof(*vblk), GFP_KERNEL);
+	if (!vblk) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	INIT_LIST_HEAD(&vblk->reqs);
+	spin_lock_init(&vblk->lock);
+	vblk->vq_ops = vq_ops;
+
+	/* We expect one virtqueue, for output. */
+	vblk->vq = virtio_config_vq(config, vq_ops, device, blk_done, vblk);
+	if (IS_ERR(vblk->vq)) {
+		err = PTR_ERR(vblk->vq);
+		goto out_free_vblk;
+	}
+
+	vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
+	if (!vblk->pool) {
+		err = -ENOMEM;
+		goto out_free_vq;
+	}
+
+	major = register_blkdev(0, "virtblk");
+	if (major < 0) {
+		err = major;
+		goto out_mempool;
+	}
+
+	/* FIXME: How many partitions?  How long is a piece of string? */
+	vblk->disk = alloc_disk(1 << 4);
+	if (!vblk->disk) {
+		err = -ENOMEM;
+		goto out_unregister_blkdev;
+	}
+
+	vblk->disk->queue = blk_init_queue(do_virtblk_request, &vblk->lock);
+	if (!vblk->disk->queue) {
+		err = -ENOMEM;
+		goto out_put_disk;
+	}
+
+	sprintf(vblk->disk->disk_name, "vd%c", virtblk_index++);
+	vblk->disk->major = major;
+	vblk->disk->first_minor = 0;
+	vblk->disk->private_data = vblk;
+	vblk->disk->fops = &virtblk_fops;
+
+	/* If barriers are supported, tell block layer that queue is ordered */
+	coff = virtio_config_find(config, VIRTIO_CONFIG_BLK_F, &len);
+	if (virtio_use_bit(config, coff, len, VIRTIO_BLK_F_BARRIER))
+		blk_queue_ordered(vblk->disk->queue, QUEUE_ORDERED_TAG, NULL);
+
+	err = virtio_config_val(config, VIRTIO_CONFIG_BLK_F_CAPACITY, &cap);
+	if (err) {
+		dev_err(device, "Bad/missing capacity in config\n");
+		goto out_put_disk;
+	}
+
+	/* If capacity is too big, truncate with warning. */
+	if ((sector_t)cap != cap) {
+		dev_warn(device, "Capacity %llu is too large: truncating\n",
+			 (unsigned long long)cap);
+		cap = (sector_t)-1;
+	}
+	set_capacity(vblk->disk, cap);
+
+	err = virtio_config_val(config, VIRTIO_CONFIG_BLK_F_SIZE_MAX, &v);
+	if (!err)
+		blk_queue_max_segment_size(vblk->disk->queue, v);
+	else if (err != -ENOENT) {
+		dev_err(device, "Bad SIZE_MAX in config\n");
+		goto out_put_disk;
+	}
+
+	err = virtio_config_val(config, VIRTIO_CONFIG_BLK_F_SEG_MAX, &v);
+	if (!err)
+		blk_queue_max_hw_segments(vblk->disk->queue, v);
+	else if (err != -ENOENT) {
+		dev_err(device, "Bad SEG_MAX in config\n");
+		goto out_put_disk;
+	}
+
+	add_disk(vblk->disk);
+	return vblk;
+
+out_put_disk:
+	put_disk(vblk->disk);
+out_unregister_blkdev:
+	unregister_blkdev(major, "virtblk");
+out_mempool:
+	mempool_destroy(vblk->pool);
+out_free_vq:
+	vq_ops->free_vq(vblk->vq);
+out_free_vblk:
+	kfree(vblk);
+out:
+	return ERR_PTR(err);
+}
+
+static void virtblk_remove(void *_vblk)
+{
+	struct virtio_blk *vblk = _vblk;
+	int major = vblk->disk->major;
+
+	BUG_ON(!list_empty(&vblk->reqs));
+	blk_cleanup_queue(vblk->disk->queue);
+	put_disk(vblk->disk);
+	unregister_blkdev(major, "virtblk");
+	mempool_destroy(vblk->pool);
+	kfree(vblk);
+}
+
+static struct pci_device_id id_table[] = {
+	VIRTIO_DEV_ID(VIRTIO_ID_BLOCK, PCI_CLASS_STORAGE_OTHER),
+	{ 0 },
+};
+
+static struct virtio_driver virtio_blk = {
+	.name =         KBUILD_MODNAME,
+	.owner =        THIS_MODULE,
+	.id_table =     id_table,
+	.probe =        virtblk_probe,
+	.remove =       __devexit_p(virtblk_remove),
+};
+
+static int __init init(void)
+{
+	return register_virtio_driver(&virtio_blk);
+}
+
+static void __exit fini(void)
+{
+	unregister_virtio_driver(&virtio_blk);
+}
+module_init(init);
+module_exit(fini);
+
+MODULE_DEVICE_TABLE(pci, id_table);
+MODULE_DESCRIPTION("Virtio block driver");
+MODULE_LICENSE("GPL");
===================================================================
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -343,6 +343,7 @@ unifdef-y += utsname.h
 unifdef-y += utsname.h
 unifdef-y += videodev2.h
 unifdef-y += videodev.h
+unifdef-y += virtio_blk.h
 unifdef-y += wait.h
 unifdef-y += wanrouter.h
 unifdef-y += watchdog.h
===================================================================
--- /dev/null
+++ b/include/linux/virtio_blk.h
@@ -0,0 +1,51 @@
+#ifndef _LINUX_VIRTIO_BLK_H
+#define _LINUX_VIRTIO_BLK_H
+#include <linux/virtio_config.h>
+
+/* The ID for virtio_block */
+#define VIRTIO_ID_BLOCK	2
+
+/* Feature bits */
+#define VIRTIO_CONFIG_BLK_F	0x40
+#define VIRTIO_BLK_F_BARRIER	1	/* Does host support barriers? */
+
+/* The capacity (in 512-byte sectors). */
+#define VIRTIO_CONFIG_BLK_F_CAPACITY	0x41
+/* The maximum segment size. */
+#define VIRTIO_CONFIG_BLK_F_SIZE_MAX	0x42
+/* The maximum number of segments. */
+#define VIRTIO_CONFIG_BLK_F_SEG_MAX	0x43
+
+/* These two define direction. */
+#define VIRTIO_BLK_T_IN		0
+#define VIRTIO_BLK_T_OUT	1
+
+/* This bit says it's a scsi command, not an actual read or write. */
+#define VIRTIO_BLK_T_SCSI_CMD	2
+
+/* Barrier before this op. */
+#define VIRTIO_BLK_T_BARRIER	0x80000000
+
+/* This is the first element of the read scatter-gather list. */
+struct virtio_blk_outhdr
+{
+	/* VIRTIO_BLK_T* */
+	__u32 type;
+	/* io priority. */
+	__u32 ioprio;
+	/* Sector (ie. 512 byte offset) */
+	__u64 sector;
+	/* Where to put reply. */
+	__u64 id;
+};
+
+#define VIRTIO_BLK_S_OK		0
+#define VIRTIO_BLK_S_IOERR	1
+#define VIRTIO_BLK_S_UNSUPP	2
+
+/* This is the first element of the write scatter-gather list */
+struct virtio_blk_inhdr
+{
+	unsigned char status;
+};
+#endif /* _LINUX_VIRTIO_BLK_H */

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 5/6] virtio console driver
       [not found]       ` <1190290606.7262.239.camel@localhost.localdomain>
@ 2007-09-20 12:19         ` Rusty Russell
  2007-09-20 12:27         ` [PATCH 4/6] virtio block driver Jens Axboe
                           ` (4 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-20 12:19 UTC (permalink / raw)
  To: kvm-devel; +Cc: lguest, Jens Axboe, Dor Laor, virtualization

This is an hvc-based virtio console driver.  It's suboptimal becuase
hvc expects to have raw access to interrupts and virtio doesn't assume
that, so it currently polls.

There are two solutions: expose hvc's "kick" interface, or wean off hvc.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/char/Kconfig           |    4 
 drivers/char/Makefile          |    1 
 drivers/char/virtio_console.c  |  232 ++++++++++++++++++++++++++++++++++++++++
 include/linux/virtio_console.h |   12 ++
 4 files changed, 249 insertions(+)

===================================================================
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -613,6 +613,10 @@ config HVC_XEN
 	help
 	  Xen virtual console device driver
 
+config VIRTIO_CONSOLE
+	bool
+	select HVC_DRIVER
+
 config HVCS
 	tristate "IBM Hypervisor Virtual Console Server support"
 	depends on PPC_PSERIES
===================================================================
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -50,6 +50,7 @@ obj-$(CONFIG_HVC_BEAT)		+= hvc_beat.o
 obj-$(CONFIG_HVC_BEAT)		+= hvc_beat.o
 obj-$(CONFIG_HVC_DRIVER)	+= hvc_console.o
 obj-$(CONFIG_HVC_XEN)		+= hvc_xen.o
+obj-$(CONFIG_VIRTIO_CONSOLE)	+= virtio_console.o
 obj-$(CONFIG_RAW_DRIVER)	+= raw.o
 obj-$(CONFIG_SGI_SNSC)		+= snsc.o snsc_event.o
 obj-$(CONFIG_MSPEC)		+= mspec.o
===================================================================
--- /dev/null
+++ b/drivers/char/virtio_console.c
@@ -0,0 +1,232 @@
+/*D:300
+ * The Guest console driver
+ *
+ * Writing console drivers is one of the few remaining Dark Arts in Linux.
+ * Fortunately for us, the path of virtual consoles has been well-trodden by
+ * the PowerPC folks, who wrote "hvc_console.c" to generically support any
+ * virtual console.  We use that infrastructure which only requires us to write
+ * the basic put_chars and get_chars functions and call the right register
+ * functions.
+ :*/
+
+/*M:002 The console can be flooded: while the Guest is processing input the
+ * Host can send more.  Buffering in the Host could alleviate this, but it is a
+ * difficult problem in general. :*/
+/* Copyright (C) 2006, 2007 Rusty Russell, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/virtio.h>
+#include <linux/virtio_console.h>
+#include "hvc_console.h"
+
+/*D:340 These represent our input and output console queues, and the virtio
+ * operations for them. */
+static struct virtqueue *in_vq, *out_vq;
+static struct virtqueue_ops *vq_ops;
+
+/* This is our input buffer, and how much data is left in it. */
+static unsigned int in_len;
+static char *in, *inbuf;
+
+/* The operations for our console. */
+static struct hv_ops virtio_cons;
+
+/*D:310 The put_chars() callback is pretty straightforward.
+ *
+ * We turn the characters into a scatter-gather list, add it to the output
+ * queue and then kick the Host.  Then we sit here waiting for it to finish:
+ * inefficient in theory, but in practice implementations will do it
+ * immediately (lguest's Launcher does). */
+static int put_chars(u32 vtermno, const char *buf, int count)
+{
+	struct scatterlist sg[1];
+	unsigned int len;
+
+	/* This is a convenient routine to initialize a single-elem sg list */
+	sg_init_one(sg, buf, count);
+
+	/* add_buf wants a token to identify this buffer: we hand it any
+	 * non-NULL pointer, since there's only ever one buffer. */
+	if (vq_ops->add_buf(out_vq, sg, 1, 0, (void *)1) == 0) {
+		/* Tell Host to go! */
+		vq_ops->kick(out_vq);
+		/* Chill out until it's done with the buffer. */
+		while (!vq_ops->get_buf(out_vq, &len))
+			cpu_relax();
+	}
+
+	/* We're expected to return the amount of data we wrote: all of it. */
+	return count;
+}
+
+/* Create a scatter-gather list representing our input buffer and put it in the
+ * queue. */
+static void add_inbuf(void)
+{
+	struct scatterlist sg[1];
+	sg_init_one(sg, inbuf, PAGE_SIZE);
+
+	/* We should always be able to add one buffer to an empty queue. */
+	if (vq_ops->add_buf(in_vq, sg, 0, 1, inbuf) != 0)
+		BUG();
+	vq_ops->kick(in_vq);
+}
+
+/*D:350 get_chars() is the callback from the hvc_console infrastructure when
+ * an interrupt is received.
+ *
+ * Most of the code deals with the fact that the hvc_console() infrastructure
+ * only asks us for 16 bytes at a time.  We keep in_offset and in_used fields
+ * for partially-filled buffers. */
+static int get_chars(u32 vtermno, char *buf, int count)
+{
+	/* If we don't have an input queue yet, we can't get input. */
+	BUG_ON(!in_vq);
+
+	/* No buffer?  Try to get one. */
+	if (!in_len) {
+		in = vq_ops->get_buf(in_vq, &in_len);
+		if (!in)
+			return 0;
+	}
+
+	/* You want more than we have to give?  Well, try wanting less! */
+	if (in_len < count)
+		count = in_len;
+
+	/* Copy across to their buffer and increment offset. */
+	memcpy(buf, in, count);
+	in += count;
+	in_len -= count;
+
+	/* Finished?  Re-register buffer so Host will use it again. */
+	if (in_len == 0)
+		add_inbuf();
+
+	return count;
+}
+/*:*/
+
+/*D:320 Console drivers are initialized very early so boot messages can go out,
+ * so we do things slightly differently from the generic virtio initialization
+ * of the net and block drivers.
+ *
+ * At this stage, the console is output-only.  It's too early to set up a
+ * virtqueue, so we let the drivers do some boutique early-output thing. */
+int __init virtio_cons_early_init(int (*put_chars)(u32, const char *, int))
+{
+	virtio_cons.put_chars = put_chars;
+	return hvc_instantiate(0, 0, &virtio_cons);
+}
+
+/* FIXME: This is why we want to wean off hvc: we do nothing when input comes
+ * in. */
+static bool do_nothing(void *data)
+{
+	return true;
+}
+
+/*D:370 Once we're further in boot, we get probed like any other virtio device.
+ * At this stage we set up the output virtqueue.
+ *
+ * To set up and manage our virtual console, we call hvc_alloc().  Since we
+ * never remove the console device we never need this pointer again.
+ *
+ * Finally we put our input buffer in the input queue, ready to receive. */
+static void *virtcons_probe(struct device *device,
+			    struct virtio_config_space *config,
+			    struct virtqueue_ops *new_vq_ops)
+{
+	int err;
+	struct hvc_struct *hvc;
+
+	/* This is the scratch page we use to receive console input */
+	inbuf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!inbuf) {
+		err = -ENOMEM;
+		goto fail;
+	}
+
+	vq_ops = new_vq_ops;
+
+	/* Find the input queue. */
+	in_vq = virtio_config_vq(config, vq_ops, device, do_nothing, NULL);
+	if (IS_ERR(in_vq)) {
+		err = PTR_ERR(in_vq);
+		goto free_out_vq;
+	}
+
+	out_vq = virtio_config_vq(config, vq_ops, device, do_nothing, NULL);
+	if (IS_ERR(out_vq)) {
+		err = PTR_ERR(out_vq);
+		goto free;
+	}
+
+	/* Start using the new console output. */
+	virtio_cons.get_chars = get_chars;
+	virtio_cons.put_chars = put_chars;
+
+	/* The first argument of hvc_alloc() is the virtual console number, so
+	 * we use zero.  The second argument is the interrupt number; we
+	 * currently leave this as zero: it would be better not to use the
+	 * hvc mechanism and fix this (FIXME!).
+	 *
+	 * The third argument is a "struct hv_ops" containing the put_chars()
+	 * and get_chars() pointers.  The final argument is the output buffer
+	 * size: we can do any size, so we put PAGE_SIZE here. */
+	hvc = hvc_alloc(0, 0, &virtio_cons, PAGE_SIZE);
+	if (IS_ERR(hvc)) {
+		err = PTR_ERR(hvc);
+		goto free_in_vq;
+	}
+
+	/* Register the input buffer the first time. */
+	add_inbuf();
+	return NULL;
+
+free_in_vq:
+	new_vq_ops->free_vq(in_vq);
+free_out_vq:
+	new_vq_ops->free_vq(out_vq);
+free:
+	kfree(inbuf);
+fail:
+	return ERR_PTR(err);
+}
+
+static struct pci_device_id id_table[] = {
+	VIRTIO_DEV_ID(VIRTIO_ID_CONSOLE, PCI_CLASS_COMMUNICATION_SERIAL),
+	{ 0 },
+};
+
+static struct virtio_driver virtio_console = {
+	.name =         KBUILD_MODNAME,
+	.owner =        THIS_MODULE,
+	.id_table =     id_table,
+	.probe =        virtcons_probe,
+};
+
+static int __init init(void)
+{
+	return register_virtio_driver(&virtio_console);
+}
+module_init(init);
+
+MODULE_DEVICE_TABLE(pci, id_table);
+MODULE_DESCRIPTION("Virtio console driver");
+MODULE_LICENSE("GPL");
===================================================================
--- /dev/null
+++ b/include/linux/virtio_console.h
@@ -0,0 +1,12 @@
+#ifndef _LINUX_VIRTIO_CONSOLE_H
+#define _LINUX_VIRTIO_CONSOLE_H
+#include <linux/virtio_config.h>
+
+/* The ID for virtio console */
+#define VIRTIO_ID_CONSOLE	3
+
+#ifdef __KERNEL__
+int __init virtio_cons_early_init(int (*put_chars)(u32, const char *, int));
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_VIRTIO_CONSOLE_H */

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] virtio block driver
       [not found]       ` <1190290606.7262.239.camel@localhost.localdomain>
  2007-09-20 12:19         ` [PATCH 5/6] virtio console driver Rusty Russell
@ 2007-09-20 12:27         ` Jens Axboe
       [not found]         ` <1190290761.7262.242.camel@localhost.localdomain>
                           ` (3 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2007-09-20 12:27 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel, lguest, Dor Laor, virtualization

On Thu, Sep 20 2007, Rusty Russell wrote:
> The block driver uses scatter-gather lists with sg[0] being the
> request information (struct virtio_blk_outhdr) with the type, sector
> and inbuf id.  The next N sg entries are the bio itself, then the last
> sg is the status byte.  Whether the N entries are in or out depends on
> whether it's a read or a write.
> 
> We accept the normal (SCSI) ioctls: they get handed through to the other
> side which can then handle it or reply that it's unsupported.  It's
> not clear that this actually works in general, since I don't know
> if blk_pc_request() requests have an accurate rq_data_dir().

They should, if they imply a data transfer.

> Although we try to reply -ENOTTY on unsupported commands, the block
> layer in its infinite wisdom suppressed the error so ioctl(fd,
> CDROMEJECT) returns success to userspace.

How about ever submitting a patch for that, instead of just repeatedly
complaining about it?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 6/6] virtio ring helper
       [not found]         ` <1190290761.7262.242.camel@localhost.localdomain>
@ 2007-09-20 12:27           ` Rusty Russell
       [not found]           ` <1190291234.7262.246.camel@localhost.localdomain>
  1 sibling, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-20 12:27 UTC (permalink / raw)
  To: kvm-devel; +Cc: lguest, Dor Laor, virtualization

These helper routines supply most of the virtqueue_ops for hypervisors
which want to use a ring for virtio.  Unlike the previous lguest
implementation:

1) The rings are variable sized (2^n-1 elements).
2) They have an unfortunate limit of 65535 bytes per sg element.
3) The page numbers are always 64 bit (PAE anyone?)
4) They no longer place used[] on a separate page, just a separate
   cacheline.
5) We do a modulo on a variable.  We could be tricky if we cared.

Users need only implement the new_vq and free_vq hooks (KVM wants the
guest to allocate the rings, lguest does it sanely).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 arch/i386/lguest/Kconfig     |    1 
 drivers/virtio/Kconfig       |    5 
 drivers/virtio/Makefile      |    1 
 drivers/virtio/virtio_ring.c |  301 ++++++++++++++++++++++++++++++++++++++++++
 drivers/virtio/vring.h       |   50 ++++++
 include/linux/virtio_ring.h  |   96 +++++++++++++
 6 files changed, 454 insertions(+)

===================================================================
--- a/arch/i386/lguest/Kconfig
+++ b/arch/i386/lguest/Kconfig
@@ -2,6 +2,7 @@ config LGUEST_GUEST
 	bool "Lguest guest support"
 	select PARAVIRT
 	depends on !X86_PAE
+	select VIRTIO_RING
 	help
 	  Lguest is a tiny in-kernel hypervisor.  Selecting this will
 	  allow your kernel to boot under lguest.  This option will increase
===================================================================
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -1,3 +1,8 @@
 # Virtio always gets selected by whoever wants it.
 config VIRTIO
 	bool
+
+# Similarly the virtio ring implementation.
+config VIRTIO_RING
+	bool
+	depends on VIRTIO
===================================================================
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -1,1 +1,2 @@ obj-$(CONFIG_VIRTIO) += virtio.o config.
 obj-$(CONFIG_VIRTIO) += virtio.o config.o
+obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
===================================================================
--- /dev/null
+++ b/drivers/virtio/virtio_ring.c
@@ -0,0 +1,301 @@
+/* Virtio ring implementation.
+ *
+ *  Copyright 2007 Rusty Russell IBM Corporation
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+#include <linux/virtio.h>
+#include <linux/virtio_ring.h>
+#include <linux/device.h>
+#include "vring.h"
+
+#ifdef DEBUG
+/* For development, we want to crash whenever the ring is screwed. */
+#define BAD_RING(vvq, fmt...)			\
+	do { dev_err(vvq->dev, fmt); BUG(); } while(0)
+#define START_USE(vvq) \
+	do { if ((vvq)->in_use) panic("in_use = %i\n", (vvq)->in_use); (vvq)->in_use = __LINE__; mb(); } while(0)
+#define END_USE(vvq) \
+	do { BUG_ON(!(vvq)->in_use); (vvq)->in_use = 0; mb(); } while(0)
+#else
+#define BAD_RING(vvq, fmt...)			\
+	do { dev_err(vvq->dev, fmt); (vvq)->broken = true; } while(0)
+#define START_USE(vvq)
+#define END_USE(vvq)
+#endif
+
+struct virtqueue
+{
+	/* Callback for driver when data is consumed. */
+	bool (*callback)(void *);
+	void *cb_data;
+
+	/* Who to blame this on. */
+	struct device *dev;
+
+	/* Actual memory layout for this queue */
+	struct vring vring;
+
+	/* Other side has made a mess, don't try any more. */
+	bool broken;
+
+	/* Number of free buffers */
+	unsigned int num_free;
+	/* Head of free buffer list. */
+	unsigned int free_head;
+	/* Number we've added since last sync. */
+	unsigned int num_added;
+
+	/* Last used index we've seen. */
+	unsigned int last_used_idx;
+
+	/* How to notify other side. FIXME: commonalize hcalls! */
+	void (*notify)(void *priv);
+	void *priv;
+
+	/* Unless they told us to stop */
+	bool running;
+
+#ifdef DEBUG
+	/* They're supposed to lock for us. */
+	unsigned int in_use;
+#endif
+
+	/* Tokens for callbacks. */
+	void *data[];
+};
+
+void *vring_priv(struct virtqueue *vq)
+{
+	return vq->priv;
+}
+
+int vring_add_buf(struct virtqueue *vq,
+		  struct scatterlist sg[],
+		  unsigned int out,
+		  unsigned int in,
+		  void *data)
+{
+	unsigned int i, avail, head, uninitialized_var(prev);
+
+	BUG_ON(data == NULL);
+	BUG_ON(out + in > vq->vring.num);
+	BUG_ON(out + in == 0);
+
+	START_USE(vq);
+
+	if (vq->num_free < out + in) {
+		pr_debug("Can't add buf len %i - avail = %i\n",
+			 out + in, vq->num_free);
+		END_USE(vq);
+		return -ENOSPC;
+	}
+
+	/* We're about to use some buffers from the free list. */
+	vq->num_free -= out + in;
+
+	head = vq->free_head;
+	for (i = vq->free_head; out; i = vq->vring.desc[i].next, out--) {
+		vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
+		vq->vring.desc[i].pfn = page_to_pfn(sg[0].page);
+		BUG_ON(sg[0].offset > 65535);
+		BUG_ON(sg[0].length > 65535);
+		vq->vring.desc[i].offset = sg[0].offset;
+		vq->vring.desc[i].len = sg[0].length;
+		prev = i;
+		sg++;
+	}
+	for (; in; i = vq->vring.desc[i].next, in--) {
+		vq->vring.desc[i].flags=VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
+		vq->vring.desc[i].pfn = page_to_pfn(sg[0].page);
+		BUG_ON(sg[0].offset > 65535);
+		BUG_ON(sg[0].length > 65535);
+		vq->vring.desc[i].offset = sg[0].offset;
+		vq->vring.desc[i].len = sg[0].length;
+		prev = i;
+		sg++;
+	}
+	/* Last one doesn't continue. */
+	vq->vring.desc[prev].flags &= ~VRING_DESC_F_NEXT;
+
+	/* Update free pointer */
+	vq->free_head = i;
+
+	/* Set token. */
+	vq->data[head] = data;
+
+	/* Put entry in available array (but don't update avail->idx until they
+	 * do sync).  FIXME: avoid modulus here? */
+	avail = (vq->vring.avail->idx + vq->num_added++) % vq->vring.num;
+	vq->vring.avail->ring[avail] = head;
+
+	pr_debug("Added buffer head %i to %p\n", head, vq);
+	END_USE(vq);
+	return 0;
+}
+
+void vring_kick(struct virtqueue *vq)
+{
+	START_USE(vq);
+	/* Descriptors and available array need to be set before we expose the
+	 * new available array entries. */
+	wmb();
+
+	vq->vring.avail->idx += vq->num_added;
+	vq->num_added = 0;
+
+	/* Prod other side to tell it about changes. */
+	vq->notify(vq->priv);
+	END_USE(vq);
+}
+
+static void detach_buf(struct virtqueue *vq, unsigned int head)
+{
+	unsigned int i;
+
+	/* Clear data ptr. */
+	vq->data[head] = NULL;
+
+	/* Put back on free list: find end */
+	i = head;
+	while (vq->vring.desc[i].flags & VRING_DESC_F_NEXT) {
+		i = vq->vring.desc[i].next;
+		vq->num_free++;
+	}
+
+	vq->vring.desc[i].next = vq->free_head;
+	vq->free_head = head;
+	/* Plus final descriptor */
+	vq->num_free++;
+}
+
+/* FIXME: We need to tell other side about removal, to synchronize. */
+void vring_shutdown(struct virtqueue *vq)
+{
+	unsigned int i;
+
+	for (i = 0; i < vq->vring.num; i++)
+		detach_buf(vq, i);
+}
+
+static bool more_used(const struct virtqueue *vq)
+{
+	return vq->last_used_idx != vq->vring.used->idx;
+}
+
+void *vring_get_buf(struct virtqueue *vq, unsigned int *len)
+{
+	void *ret;
+	unsigned int i;
+
+	START_USE(vq);
+
+	if (!more_used(vq)) {
+		pr_debug("No more buffers in queue\n");
+		END_USE(vq);
+		return NULL;
+	}
+
+	i = vq->vring.used->ring[vq->last_used_idx%vq->vring.num].id;
+	*len = vq->vring.used->ring[vq->last_used_idx%vq->vring.num].len;
+
+	if (unlikely(i >= vq->vring.num)) {
+		BAD_RING(vq, "id %u out of range\n", i);
+		return NULL;
+	}
+	if (unlikely(!vq->data[i])) {
+		BAD_RING(vq, "id %u is not a head!\n", i);
+		return NULL;
+	}
+
+	/* detach_buf clears data, so grab it now. */
+	ret = vq->data[i];
+	detach_buf(vq, i);
+	vq->last_used_idx++;
+	END_USE(vq);
+	return ret;
+}
+
+/* FIXME: Put running in shmem */
+bool vring_restart(struct virtqueue *vq)
+{
+	START_USE(vq);
+	BUG_ON(vq->running);
+
+	if (likely(!more_used(vq)) || unlikely(vq->broken))
+		vq->running = true;
+
+	END_USE(vq);
+	return vq->running;
+}
+
+irqreturn_t vring_interrupt(int irq, void *_vq)
+{
+	struct virtqueue *vq = _vq;
+
+	pr_debug("virtqueue interrupt for %p\n", vq);
+
+	if (unlikely(vq->broken))
+		return IRQ_HANDLED;
+
+	if (vq->running && more_used(vq)) {
+		pr_debug("virtqueue callback for %p (%p)\n", vq, vq->callback);
+		vq->running = vq->callback(vq->cb_data);
+	} else
+		pr_debug("virtqueue %p no more used\n", vq);
+
+	return IRQ_HANDLED;
+}
+
+struct virtqueue *vring_new_virtqueue(unsigned int num,
+				      struct device *dev,
+				      void *pages,
+				      void (*notify)(void *priv), void *priv,
+				      bool (*callback)(void *), void *cb_data)
+{
+	struct virtqueue *vq;
+	unsigned int i;
+
+	vq = kmalloc(sizeof(*vq) + sizeof(void *)*num, GFP_KERNEL);
+	if (!vq)
+		return NULL;
+
+	vring_init(&vq->vring, num, pages);
+	vq->callback = callback;
+	vq->cb_data = cb_data;
+	vq->dev = dev;
+	vq->notify = notify;
+	vq->priv = priv;
+	vq->broken = false;
+	vq->last_used_idx = 0;
+	vq->num_added = 0;
+	vq->running = true;
+#ifdef DEBUG
+	vq->in_use = false;
+#endif
+
+	/* Put everything in free lists. */
+	vq->num_free = num;
+	vq->free_head = 0;
+	for (i = 0; i < num-1; i++)
+		vq->vring.desc[i].next = i+1;
+
+	return vq;
+}
+
+void vring_destroy_virtqueue(struct virtqueue *vq)
+{
+	kfree(vq);
+}
===================================================================
--- /dev/null
+++ b/drivers/virtio/vring.h
@@ -0,0 +1,50 @@
+#ifndef _VRING_VRING_H
+#define _VRING_VRING_H
+#include <linux/virtio.h>
+#include <linux/irqreturn.h>
+
+/* A vring consists of a continuous chunk of memory which looks like so:
+ * (use num == 2^n-1 to make it nicely cacheline aligned).
+ *
+ * struct vring
+ * {
+ *	// The actual descriptors.
+ *	struct vring_desc desc[num];
+ *
+ *	// A ring of available descriptor heads with free-running index.
+ *	__u16 avail_idx;
+ *	__u16 available[num];
+ *
+ *	// Padding so a correctly-chosen num value will cache-align used_idx.
+ *	char pad[sizeof(struct vring_desc) - sizeof(__u16)];
+ *
+ *	// A ring of used descriptor heads with free-running index.
+ *	__u16 used_idx;
+ *	struct vring_used_elem used[num];
+ * };
+ */
+struct virtqueue *vring_new_virtqueue(unsigned int num,
+				      struct device *dev,
+				      void *pages,
+				      void (*notify)(void *priv), void *priv,
+				      bool (*callback)(void *), void *cb_data);
+void vring_destroy_virtqueue(struct virtqueue *vq);
+
+irqreturn_t vring_interrupt(int irq, void *_vq);
+
+void *vring_priv(struct virtqueue *vq);
+
+int vring_add_buf(struct virtqueue *vq,
+		  struct scatterlist sg[],
+		  unsigned int out,
+		  unsigned int in,
+		  void *data);
+
+void vring_kick(struct virtqueue *vq);
+
+void *vring_get_buf(struct virtqueue *vq, unsigned int *len);
+
+bool vring_restart(struct virtqueue *vq);
+
+void vring_shutdown(struct virtqueue *vq);
+#endif /* _VRING_VRING_H */
===================================================================
--- /dev/null
+++ b/include/linux/virtio_ring.h
@@ -0,0 +1,96 @@
+#ifndef _LINUX_VIRTIO_RING_H
+#define _LINUX_VIRTIO_RING_H
+/* An interface for efficient virtio implementation, currently for use by KVM
+ * and lguest, but hopefully others soon.  Do NOT change this since it will
+ * break existing servers and clients.
+ *
+ * This header is BSD licensed so anyone can use the definitions to implement
+ * compatible drivers/servers.
+ *
+ * Copyright Rusty Russell IBM Corporation 2007. */
+#include <linux/types.h>
+
+/* This marks a buffer as continuing via the next field. */
+#define VRING_DESC_F_NEXT	1
+/* This marks a buffer as write-only (otherwise read-only). */
+#define VRING_DESC_F_WRITE	2
+
+/* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
+struct vring_desc
+{
+	/* Page number. */
+	__u64 pfn;
+	/* Length and offset. */
+	__u16 len;
+	__u16 offset;
+	/* The flags as indicated above. */
+	__u16 flags;
+	/* We chain unused descriptors via this, too */
+	__u16 next;
+};
+
+struct vring_avail
+{
+	__u16 idx;
+	__u16 ring[];
+};
+
+/* u32 is used here for ids for padding reasons. */
+struct vring_used_elem
+{
+	/* Index of start of used descriptor chain. */
+	__u32 id;
+	/* Total length of the descriptor chain which was used (written to) */
+	__u32 len;
+};
+
+struct vring_used
+{
+	__u32 idx;
+	struct vring_used_elem ring[];
+};
+
+struct vring {
+	unsigned int num;
+
+	struct vring_desc *desc;
+
+	struct vring_avail *avail;
+
+	struct vring_used *used;
+};
+
+/* The standard layout for the ring is a continuous chunk of memory which looks
+ * like this.  The used fields will be aligned to a "num+1" boundary.
+ *
+ * struct vring
+ * {
+ *	// The actual descriptors (16 bytes each)
+ *	struct vring_desc desc[num];
+ *
+ *	// A ring of available descriptor heads with free-running index.
+ *	__u16 avail_idx;
+ *	__u16 available[num];
+ *
+ *	// Padding so a correctly-chosen num value will cache-align used_idx.
+ *	char pad[sizeof(struct vring_desc)];
+ *
+ *	// A ring of used descriptor heads with free-running index.
+ *	__u32 used_idx;
+ *	struct vring_used_elem used[num];
+ * };
+ */
+static inline void vring_init(struct vring *vr, unsigned int num, void *p)
+{
+	vr->num = num;
+	vr->desc = p;
+	vr->avail = p + num*sizeof(struct vring);
+	vr->used = p + (num+1)*(sizeof(struct vring) + sizeof(__u16));
+}
+
+static inline unsigned vring_size(unsigned int num)
+{
+	return (num + 1) * (sizeof(struct vring_desc) + sizeof(__u16))
+		+ sizeof(__u32) + num * sizeof(struct vring_used_elem);
+}
+#endif /* _LINUX_VIRTIO_RING_H */

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 1/6] virtio interace
       [not found] ` <1190290140.7262.228.camel@localhost.localdomain>
  2007-09-20 12:12   ` [PATCH 2/6] virtio_config Rusty Russell
@ 2007-09-20 12:27   ` Avi Kivity
       [not found]   ` <1190290369.7262.231.camel@localhost.localdomain>
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2007-09-20 12:27 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel, lguest, virtualization

Rusty Russell wrote:
> (Changes: 
>  - renamed sync to kick as Dor suggested
>  - added new_vq and free_vq hooks to create virtqueues
>  - define a simple virtio driver, which uses PCI ids
>  - provide register/unregister_virtio_driver hooks)
>
> This attempts to implement a "virtual I/O" layer which should allow
> common drivers to be efficiently used across most virtual I/O
> mechanisms.  It will no-doubt need further enhancement.
>
> The virtio drivers add and get I/O buffers; as the buffers are consumed
> the driver "interrupt" callbacks are invoked.
>
> It also provides driver register and unregister hooks, which are
> simply overridden at run time (eg. for a guest kernel which supports
> KVM paravirt and lguest).
>   

> +++ b/drivers/virtio/virtio.c
> @@ -0,0 +1,20 @@
> +#include <linux/virtio.h>
> +
> +struct virtio_backend_ops virtio_backend_ops;
> +EXPORT_SYMBOL_GPL(virtio_backend_ops);
>   

Suggest calling this virtio_transport_ops rather than the too-generic 
virtio_backend_ops.  Especially since Xen uses backend for something 
completely different.

> +
> +/**
> + * virtqueue_ops - operations for virtqueue abstraction layer
> + * @new_vq: create a new virtqueue
> + *	config: the virtio_config_space field describing the queue
> + *	off: the offset in the config space of the queue configuration
> + *	len: the length of the virtio_config_space field
>   

'off, len' are really a magic cookie.  Why does the interface care about 
their meaning?

> + *	callback: the driver callback when the queue is used.
>   

Missing callback return value description.
> + * @kick: update after add_buf
> + *	vq: the struct virtqueue
> + *	After one or more add_buf calls, invoke this to kick the virtio layer.
>   

'the other side'


I'm not thrilled about reusing pci ids.  Maybe the s390 can say whether 
this is a real issue.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 2/6] virtio_config
       [not found]   ` <1190290369.7262.231.camel@localhost.localdomain>
  2007-09-20 12:14     ` [PATCH 3/6] virtio net driver Rusty Russell
@ 2007-09-20 12:36     ` Avi Kivity
       [not found]     ` <46F26958.4080102@qumranet.com>
       [not found]     ` <1190290495.7262.235.camel@localhost.localdomain>
  3 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2007-09-20 12:36 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel, lguest, virtualization

Rusty Russell wrote:
> Previous versions of virtio didn't commonalize probing.  For every
> driver, every virtio implementation (KVM, lguest, etc) needed an
> in-kernel stub to join their bus to the probe code.
>
> To solve this, we introduce a "virtio_config" mechanism, which is
> simply a set of [u8 type][u8 len][...data...] fields for each device.
> Some convenient wrapper functions make this fairly easy for drivers to
> unpack their configuration data themselves.
>
> This configuration data could be PCI config space, or an unrelated bus
> (eg. lguest) or manufactured by the kernel itself.  It's designed to
> be extensible: fields get marked as the device reads them so a host
> supporting some future extension can tell if the guest driver
> understood it.  This also applies to bitfields: the guest explicitly
> acks the bits it understands.
>
> There's also a simple status bitmask useful for low-level host
> analysis of what the guest is doing with the device.
>
>   

Lovely.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 6/6] virtio ring helper
       [not found]           ` <1190291234.7262.246.camel@localhost.localdomain>
@ 2007-09-20 12:43             ` Avi Kivity
       [not found]             ` <46F26AF6.60904@qumranet.com>
  1 sibling, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2007-09-20 12:43 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel, lguest, virtualization

Rusty Russell wrote:
> These helper routines supply most of the virtqueue_ops for hypervisors
> which want to use a ring for virtio.  Unlike the previous lguest
> implementation:
>
> 3) The page numbers are always 64 bit (PAE anyone?)
>   

32 bits of page numbers give 44 bits of physical address on x86.  That's 
16TB per guest.  Admittedly it's smaller on a VAX.

I don't like the chaining and would prefer the descriptor to refer to an 
array of page descriptors.  However I can't complain since this isn't 
common code.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 2/6] virtio_config
       [not found]     ` <46F26958.4080102@qumranet.com>
@ 2007-09-20 12:55       ` Avi Kivity
       [not found]       ` <46F26DC7.9040001@qumranet.com>
  1 sibling, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2007-09-20 12:55 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel, lguest, virtualization

Avi Kivity wrote:
> Rusty Russell wrote:
>> Previous versions of virtio didn't commonalize probing.  For every
>> driver, every virtio implementation (KVM, lguest, etc) needed an
>> in-kernel stub to join their bus to the probe code.
>>
>> To solve this, we introduce a "virtio_config" mechanism, which is
>> simply a set of [u8 type][u8 len][...data...] fields for each device.
>> Some convenient wrapper functions make this fairly easy for drivers to
>> unpack their configuration data themselves.
>>
>> This configuration data could be PCI config space, or an unrelated bus
>> (eg. lguest) or manufactured by the kernel itself.  It's designed to
>> be extensible: fields get marked as the device reads them so a host
>> supporting some future extension can tell if the guest driver
>> understood it.  This also applies to bitfields: the guest explicitly
>> acks the bits it understands.
>>
>> There's also a simple status bitmask useful for low-level host
>> analysis of what the guest is doing with the device.
>>
>>   
>
> Lovely.
>

A side effect of this is that Xen drivers can no longer use virtio.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] virtio block driver
       [not found]       ` <1190290606.7262.239.camel@localhost.localdomain>
                           ` (2 preceding siblings ...)
       [not found]         ` <1190290761.7262.242.camel@localhost.localdomain>
@ 2007-09-20 13:05         ` Jens Axboe
       [not found]         ` <20070920130519.GL2367@kernel.dk>
       [not found]         ` <20070920122713.GK2367@kernel.dk>
  5 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2007-09-20 13:05 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel, lguest, Dor Laor, virtualization

On Thu, Sep 20 2007, Rusty Russell wrote:
> +static void end_dequeued_request(struct request *req,
> +				 struct request_queue *q, int uptodate)
> +{
> +	/* And so the insanity of the block layer infects us here. */
> +	int nsectors = req->hard_nr_sectors;
> +
> +	if (blk_pc_request(req)) {
> +		nsectors = (req->data_len + 511) >> 9;
> +		if (!nsectors)
> +			nsectors = 1;
> +	}
> +	if (end_that_request_first(req, uptodate, nsectors))
> +		BUG();
> +	add_disk_randomness(req->rq_disk);
> +	end_that_request_last(req, uptodate);
> +}

We have end_queued_request(), lets add end_dequeued_request(). Below.

> +	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
> +	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
> +	vblk->sg[0].length = sizeof(vbr->out_hdr);
> +	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);

This wont work for chained sglists, but I gather (I'm so funny) that it
wont be an issue for you. How large are your sglists?

> +		if (!do_req(q, vblk, req)) {
> +			/* Queue full?  Wait. */
> +			blk_stop_queue(q);
> +			break;
> +		}

Personally I think this bool stuff is foul. You return false/true, but
still use ! to test. That is just more confusing than the canonical 0/1
good/bad return imho.


diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index f6d3994..c1dc23a 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -3719,15 +3719,24 @@ void end_that_request_last(struct request *req, int uptodate)
 EXPORT_SYMBOL(end_that_request_last);
 
 static inline void __end_request(struct request *rq, int uptodate,
-				 unsigned int nr_bytes)
+				 unsigned int nr_bytes, int dequeue)
 {
 	if (!end_that_request_chunk(rq, uptodate, nr_bytes)) {
-		blkdev_dequeue_request(rq);
+		if (dequeue)
+			blkdev_dequeue_request(rq);
 		add_disk_randomness(rq->rq_disk);
 		end_that_request_last(rq, uptodate);
 	}
 }
 
+static unsigned int rq_byte_size(struct request *rq)
+{
+	if (blk_fs_request(rq))
+		return rq->hard_nr_sectors << 9;
+
+	return rq->data_len;
+}
+
 /**
  * end_queued_request - end all I/O on a queued request
  * @rq:		the request being processed
@@ -3741,18 +3750,29 @@ static inline void __end_request(struct request *rq, int uptodate,
  **/
 void end_queued_request(struct request *rq, int uptodate)
 {
-	unsigned int nr_bytes;
-
-	if (blk_fs_request(rq))
-		nr_bytes = rq->hard_nr_sectors << 9;
-	else
-		nr_bytes = rq->data_len;
-
-	__end_request(rq, uptodate, nr_bytes);
+	__end_request(rq, uptodate, rq_byte_size(rq), 1);
 }
 EXPORT_SYMBOL(end_queued_request);
 
 /**
+ * end_dequeued_request - end all I/O on a dequeued request
+ * @rq:		the request being processed
+ * @uptodate:	error value or 0/1 uptodate flag
+ *
+ * Description:
+ *     Ends all I/O on a request. The request must already have been
+ *     dequeued using blkdev_dequeue_request(), as is normally the case
+ *     for most drivers.
+ *
+ **/
+void end_dequeued_request(struct request *rq, int uptodate)
+{
+	__end_request(rq, uptodate, rq_byte_size(rq), 0);
+}
+EXPORT_SYMBOL(end_dequeued_request);
+
+
+/**
  * end_request - end I/O on the current segment of the request
  * @rq:		the request being processed
  * @uptodate:	error value or 0/1 uptodate flag
@@ -3773,7 +3793,7 @@ EXPORT_SYMBOL(end_queued_request);
  **/
 void end_request(struct request *req, int uptodate)
 {
-	__end_request(req, uptodate, req->hard_cur_sectors << 9);
+	__end_request(req, uptodate, req->hard_cur_sectors << 9, 1);
 }
 EXPORT_SYMBOL(end_request);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 12ab4d4..4b24265 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -732,6 +732,7 @@ extern int end_that_request_chunk(struct request *, int, int);
 extern void end_that_request_last(struct request *, int);
 extern void end_request(struct request *, int);
 extern void end_queued_request(struct request *, int);
+extern void end_dequeued_request(struct request *, int);
 extern void blk_complete_request(struct request *);
 
 /*

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] virtio with config abstraction and ring implementation
       [not found] <1190289808.7262.223.camel@localhost.localdomain>
  2007-09-20 12:09 ` [PATCH 1/6] virtio interace Rusty Russell
       [not found] ` <1190290140.7262.228.camel@localhost.localdomain>
@ 2007-09-20 13:43 ` Dor Laor
       [not found] ` <46F2791A.8070601@qumranet.com>
  3 siblings, 0 replies; 41+ messages in thread
From: Dor Laor @ 2007-09-20 13:43 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel, lguest, virtualization


[-- Attachment #1.1: Type: text/plain, Size: 1581 bytes --]

Rusty Russell wrote:
>
> Hi all,
>
>         This patch series attempts to come closer to unifying kvm and 
> lguest's
> usage of virtio.  As these two are the first implementations we've seen,
> I hope making them closer will make future ones closer too.
>
>         Drivers now unpack their own configuration: their probe() 
> methods are
> uniform.  The configuration mechanism is extensible and can be backed by
> PCI, a string of bytes, or something else.
>
>         I've abstracted out the lguest ring buffer code into a common 
> library.
> The format has changed slightly (mainly because I had an epiphany about
> inter-guest I/O).
>
>         I also implemented a console (lguest needs one).
>
>         Finally, there is a working lguest implementation.  Unfortunately,
> lguest is being refactored for non-i386 ports, so the virtio patches sit
> at the end of the (quite long) for-2.6.24 patchqueue.  Nonetheless, they
> can be found at http://lguest.ozlabs.org/patches (click on bz2 to get
> the series).
>
> Cheers!
> Rusty.
>
Superb job, it saved me the burden of try to merge the in-house 
virtio_backend.

I like the separation of the ring code, the improved descriptors and the 
notify too.
Regarding the pci config space, I rather see config_ops type of 
operations to let
the 390/xen/other implementations jump on our wagon.

Maybe change the offset/len type into a handle pointer and function 
pointers.
The best would be to let them comment on that. I glimpsed over xen 
netfront.c and I think that
the config space can be used without too many hassles
Dor.


[-- Attachment #1.2: Type: text/html, Size: 2555 bytes --]

[-- Attachment #2: Type: text/plain, Size: 184 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Lguest] [PATCH 0/6] virtio with config abstraction and ringimplementation
       [not found] ` <46F2791A.8070601@qumranet.com>
@ 2007-09-20 13:50   ` Dor Laor
  2007-09-21  3:20   ` [PATCH 0/6] virtio with config abstraction and ring implementation Rusty Russell
  1 sibling, 0 replies; 41+ messages in thread
From: Dor Laor @ 2007-09-20 13:50 UTC (permalink / raw)
  To: Dor Laor; +Cc: kvm-devel, lguest, virtualization

Dor Laor wrote:
> Rusty Russell wrote:
>>
>> Hi all,
>>
>>         This patch series attempts to come closer to unifying kvm and 
>> lguest's
>> usage of virtio.  As these two are the first implementations we've seen,
>> I hope making them closer will make future ones closer too.
>>
>>         Drivers now unpack their own configuration: their probe() 
>> methods are
>> uniform.  The configuration mechanism is extensible and can be backed by
>> PCI, a string of bytes, or something else.
>>
>>         I've abstracted out the lguest ring buffer code into a common 
>> library.
>> The format has changed slightly (mainly because I had an epiphany about
>> inter-guest I/O).
>>
>>         I also implemented a console (lguest needs one).
>>
>>         Finally, there is a working lguest implementation.  
>> Unfortunately,
>> lguest is being refactored for non-i386 ports, so the virtio patches sit
>> at the end of the (quite long) for-2.6.24 patchqueue.  Nonetheless, they
>> can be found at http://lguest.ozlabs.org/patches (click on bz2 to get
>> the series).
>>
>> Cheers!
>> Rusty.
>>
> Superb job, it saved me the burden of try to merge the in-house 
> virtio_backend.
>
> I like the separation of the ring code, the improved descriptors and 
> the notify too.
> Regarding the pci config space, I rather see config_ops type of 
> operations to let
> the 390/xen/other implementations jump on our wagon.
>
> Maybe change the offset/len type into a handle pointer and function 
> pointers.
> The best would be to let them comment on that. I glimpsed over xen 
> netfront.c and I think that
> the config space can be used without too many hassles
> Dor.
>
Of course I'll rebase what's left of my code over the new patches.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 2/6] virtio_config
       [not found]       ` <46F26DC7.9040001@qumranet.com>
@ 2007-09-21  1:50         ` Rusty Russell
       [not found]         ` <1190339435.19451.23.camel@localhost.localdomain>
  1 sibling, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-21  1:50 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel, lguest, virtualization

On Thu, 2007-09-20 at 14:55 +0200, Avi Kivity wrote:
> Avi Kivity wrote:
> > Rusty Russell wrote:
> >> Previous versions of virtio didn't commonalize probing.  For every
> >> driver, every virtio implementation (KVM, lguest, etc) needed an
> >> in-kernel stub to join their bus to the probe code.
> >>
> >> To solve this, we introduce a "virtio_config" mechanism, which is
> >> simply a set of [u8 type][u8 len][...data...] fields for each device.
> >> Some convenient wrapper functions make this fairly easy for drivers to
> >> unpack their configuration data themselves.
> >>
> >> This configuration data could be PCI config space, or an unrelated bus
> >> (eg. lguest) or manufactured by the kernel itself.  It's designed to
> >> be extensible: fields get marked as the device reads them so a host
> >> supporting some future extension can tell if the guest driver
> >> understood it.  This also applies to bitfields: the guest explicitly
> >> acks the bits it understands.
> >>
> >> There's also a simple status bitmask useful for low-level host
> >> analysis of what the guest is doing with the device.
> >>
> >>   
> >
> > Lovely.
> >
> 
> A side effect of this is that Xen drivers can no longer use virtio.

I'm not so sure.

We were always assuming that Xen could do state management in its virtio
layer.  If this is not true, it implies we need hooks in the virtio
drivers, and I don't think we've made it worse by making it translate
xenbus config info into virtio_config.

Rusty.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 6/6] virtio ring helper
       [not found]             ` <46F26AF6.60904@qumranet.com>
@ 2007-09-21  2:04               ` Rusty Russell
       [not found]               ` <1190340251.19451.36.camel@localhost.localdomain>
  1 sibling, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-21  2:04 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel, lguest, virtualization

On Thu, 2007-09-20 at 14:43 +0200, Avi Kivity wrote:
> Rusty Russell wrote:
> > These helper routines supply most of the virtqueue_ops for hypervisors
> > which want to use a ring for virtio.  Unlike the previous lguest
> > implementation:
> >
> > 3) The page numbers are always 64 bit (PAE anyone?)
> >   
> 
> 32 bits of page numbers give 44 bits of physical address on x86.  That's 
> 16TB per guest.  Admittedly it's smaller on a VAX.

I like to feel that I make these mistakes to ensure others are paying
attention.  However, it does mean that I can just put an address in
there and increase the length field to 32 bits.  Much rejoicing.

Will fix and resend tomorrow (Friday is in-office-away-from-test-machine
day).

> I don't like the chaining and would prefer the descriptor to refer to an 
> array of page descriptors.  However I can't complain since this isn't 
> common code.

The intent is for kvm to use it.  I'll certainly consider your patches,
although I suspect that managing descriptor pages will make things ugly
enough to cause you to reconsider.

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] virtio block driver
       [not found]         ` <20070920130519.GL2367@kernel.dk>
@ 2007-09-21  2:06           ` Rusty Russell
  2007-09-21 11:47             ` Jens Axboe
       [not found]             ` <20070921114703.GN2367@kernel.dk>
  0 siblings, 2 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-21  2:06 UTC (permalink / raw)
  To: Jens Axboe; +Cc: kvm-devel, lguest, Dor Laor, virtualization

On Thu, 2007-09-20 at 15:05 +0200, Jens Axboe wrote:
> On Thu, Sep 20 2007, Rusty Russell wrote:
> > +static void end_dequeued_request(struct request *req,
> > +				 struct request_queue *q, int uptodate)
> > +{
> > +	/* And so the insanity of the block layer infects us here. */
> > +	int nsectors = req->hard_nr_sectors;
> > +
> > +	if (blk_pc_request(req)) {
> > +		nsectors = (req->data_len + 511) >> 9;
> > +		if (!nsectors)
> > +			nsectors = 1;
> > +	}
> > +	if (end_that_request_first(req, uptodate, nsectors))
> > +		BUG();
> > +	add_disk_randomness(req->rq_disk);
> > +	end_that_request_last(req, uptodate);
> > +}
> 
> We have end_queued_request(), lets add end_dequeued_request(). Below.

OK, thanks, I'll throw that in the mix and test...

> > +	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
> > +	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
> > +	vblk->sg[0].length = sizeof(vbr->out_hdr);
> > +	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
> 
> This wont work for chained sglists, but I gather (I'm so funny) that it
> wont be an issue for you. How large are your sglists?

Hmm, potentially extremely large.  What do I need to do for chained
sglists?

> > +		if (!do_req(q, vblk, req)) {
> > +			/* Queue full?  Wait. */
> > +			blk_stop_queue(q);
> > +			break;
> > +		}
> 
> Personally I think this bool stuff is foul. You return false/true, but
> still use ! to test. That is just more confusing than the canonical 0/1
> good/bad return imho.

Except "0/1" is not canonical in the kernel.  Arguably, "-errno/0" is
canonical.  OTOH, bool is clear.

if do_req() fails, we assume the queue is full.  I shall change the
comment to that effect.

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/6] virtio with config abstraction and ring implementation
       [not found] ` <46F2791A.8070601@qumranet.com>
  2007-09-20 13:50   ` [Lguest] [PATCH 0/6] virtio with config abstraction and ringimplementation Dor Laor
@ 2007-09-21  3:20   ` Rusty Russell
  1 sibling, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-21  3:20 UTC (permalink / raw)
  To: dor.laor; +Cc: kvm-devel, lguest, virtualization

On Thu, 2007-09-20 at 15:43 +0200, Dor Laor wrote:
> Rusty Russell wrote: 

> >         Drivers now unpack their own configuration: their probe() methods are
> > uniform.  The configuration mechanism is extensible and can be backed by
> > PCI, a string of bytes, or something else.

> I like the separation of the ring code, the improved descriptors and
> the notify too.
> Regarding the pci config space, I rather see config_ops type of
> operations to let 
> the 390/xen/other implementations jump on our wagon.

It's possible to abstract at the find_config() level, but it's also not
too bad to linearize any existing configuration.  I chose to change the
lguest device page to use a linearized format, but I could have adapted
the old device info struct in the kernel without too much hassle:

FYI, here's the lguest snippet, for example:

+struct lguest_config {
+	struct virtio_config_space v;
+
+	/* Status pointer: 4 bytes, then comes the config space itself. */
+	u8 *status;
+};
...
+static void lguest_writeb(struct virtio_config_space *v, unsigned off, u8 b)
+{
+	struct lguest_config *c = container_of(v, struct lguest_config, v);
+
+	c->status[4 + off] = b;
+}
+
+static u8 lguest_readb(struct virtio_config_space *v, unsigned off)
+{
+	struct lguest_config *c = container_of(v, struct lguest_config, v);
+
+	return c->status[4 + off];
+}
+
+static void lguest_set_status(struct virtio_config_space *v, u32 status)
+{
+	struct lguest_config *c = container_of(v, struct lguest_config, v);
+
+	memcpy(c->status, &status, sizeof(status));
+}
+
+static u32 lguest_get_status(struct virtio_config_space *v)
+{
+	struct lguest_config *c = container_of(v, struct lguest_config, v);
+	u32 status;
+
+	memcpy(&status, c->status, sizeof(status));
+	return status;
+}

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 3/6] virtio net driver
       [not found]     ` <1190290495.7262.235.camel@localhost.localdomain>
  2007-09-20 12:16       ` [PATCH 4/6] virtio block driver Rusty Russell
       [not found]       ` <1190290606.7262.239.camel@localhost.localdomain>
@ 2007-09-21 10:48       ` Christian Borntraeger
       [not found]       ` <200709211248.11783.borntraeger@de.ibm.com>
  3 siblings, 0 replies; 41+ messages in thread
From: Christian Borntraeger @ 2007-09-21 10:48 UTC (permalink / raw)
  To: kvm-devel; +Cc: lguest, Herbert Xu, Jeff Garzik, virtualization

Am Donnerstag, 20. September 2007 schrieb Rusty Russell:
> The network driver uses *two* virtqueues: one for input packets and
> one for output packets.  This has nice locking properties (ie. we
> don't do any for recv vs send).
[...]
> 	3) Resolve freeing of old xmit skbs (someone sent patch IIRC?)

Yes, that was me. I am quite busy at the moment but I will send a refreshed 
patch soon. The most annoying fact of my current patch is, that I have to add 
locking. (Because of the only one operation per virtqueue at a time rule)

[...]
> +struct virtnet_info
> +{
> +	struct virtqueue_ops *vq_ops;
> +	struct virtqueue *vq_recv;
> +	struct virtqueue *vq_send;
> +	struct net_device *ndev;

This is only a matter of taste, but I like netdev or dev more than ndev.

[...]

Everything else looks sane.

20-minutes-code-review-by: Christian Borntraeger <borntraeger@de.ibm.com>

Christian

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 1/6] virtio interace
       [not found]   ` <46F2674A.2050403@qumranet.com>
@ 2007-09-21 11:37     ` Rusty Russell
  0 siblings, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-21 11:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel, lguest, virtualization

On Thu, 2007-09-20 at 14:27 +0200, Avi Kivity wrote: 
> Rusty Russell wrote:
> > +struct virtio_backend_ops virtio_backend_ops;
> > +EXPORT_SYMBOL_GPL(virtio_backend_ops);
> 
> Suggest calling this virtio_transport_ops rather than the too-generic 
> virtio_backend_ops.  Especially since Xen uses backend for something 
> completely different.

Hi Avi,

Agreed, fixed.  I actually don't like this interface at all, but it is
simple until we determine something better.

> > +/**
> > + * virtqueue_ops - operations for virtqueue abstraction layer
> > + * @new_vq: create a new virtqueue
> > + *	config: the virtio_config_space field describing the queue
> > + *	off: the offset in the config space of the queue configuration
> > + *	len: the length of the virtio_config_space field
> >   
> 
> 'off, len' are really a magic cookie.  Why does the interface care about 
> their meaning?

off is a cookie, len isn't.  The driver does "get me the foo field" and
it returns off + len.  If it wants to read or write the foo field it
needs that length.

In practice, drivers use the virtio_config_vq() convenience interface
which does "find field, hand to ops->new_vq".

> > + *	callback: the driver callback when the queue is used.
> >   
> 
> Missing callback return value description.

Indeed, that's always made me uncomfortable.  Fixed.  Although it's
possible that an explicit disable hook is better...

> > + * @kick: update after add_buf
> > + *	vq: the struct virtqueue
> > + *	After one or more add_buf calls, invoke this to kick the virtio layer.
> >   
> 
> 'the other side'

Thanks, fixed.

> I'm not thrilled about reusing pci ids.  Maybe the s390 can say whether 
> this is a real issue.

I doubt it: it's not a problem for lguest.  But I wonder if I've chosen
the right fields...

Rusty.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] virtio block driver
  2007-09-21  2:06           ` Rusty Russell
@ 2007-09-21 11:47             ` Jens Axboe
       [not found]             ` <20070921114703.GN2367@kernel.dk>
  1 sibling, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2007-09-21 11:47 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel, lguest, Dor Laor, virtualization

On Fri, Sep 21 2007, Rusty Russell wrote:
> On Thu, 2007-09-20 at 15:05 +0200, Jens Axboe wrote:
> > On Thu, Sep 20 2007, Rusty Russell wrote:
> > > +static void end_dequeued_request(struct request *req,
> > > +				 struct request_queue *q, int uptodate)
> > > +{
> > > +	/* And so the insanity of the block layer infects us here. */
> > > +	int nsectors = req->hard_nr_sectors;
> > > +
> > > +	if (blk_pc_request(req)) {
> > > +		nsectors = (req->data_len + 511) >> 9;
> > > +		if (!nsectors)
> > > +			nsectors = 1;
> > > +	}
> > > +	if (end_that_request_first(req, uptodate, nsectors))
> > > +		BUG();
> > > +	add_disk_randomness(req->rq_disk);
> > > +	end_that_request_last(req, uptodate);
> > > +}
> > 
> > We have end_queued_request(), lets add end_dequeued_request(). Below.
> 
> OK, thanks, I'll throw that in the mix and test...

I've queued it for 2.6.24 as well, so should be in mainline in that time
frame.

> > > +	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
> > > +	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
> > > +	vblk->sg[0].length = sizeof(vbr->out_hdr);
> > > +	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
> > 
> > This wont work for chained sglists, but I gather (I'm so funny) that it
> > wont be an issue for you. How large are your sglists?
> 
> Hmm, potentially extremely large.  What do I need to do for chained
> sglists?

Not a lot, actually. You snipped the problematic part in your reply,
though:

        vblk->sg[num+1].page = virt_to_page(&vbr->in_hdr);

which assumes that sg is a contigious piece of memory, for chained sg
lists that isn't true. sg chaining will be in 2.6.24, so if you really
do need large sglists, then that's the way to go.

blk_rq_map_sg() maps correctly for you, no changes needed there. But you
want to use sg_last() for adding to the end of the sglist. And then use
sg_next() to retrieve the next sg segment instead of sg + 1, and
for_each_sg() to loop over all segments.

Just something to keep in mind, if you plan on merging this post 2.6.23.

> > > +		if (!do_req(q, vblk, req)) {
> > > +			/* Queue full?  Wait. */
> > > +			blk_stop_queue(q);
> > > +			break;
> > > +		}
> > 
> > Personally I think this bool stuff is foul. You return false/true, but
> > still use ! to test. That is just more confusing than the canonical 0/1
> > good/bad return imho.
> 
> Except "0/1" is not canonical in the kernel.  Arguably, "-errno/0" is
> canonical.  OTOH, bool is clear.

-errno/0, then. 1 is typically used for 'failure without specific error
number' when -Exx doesn't apply. Like the above :-)

But lets just agree to disagree on the bool.

> if do_req() fails, we assume the queue is full.  I shall change the
> comment to that effect.

Thanks!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 3/6] virtio net driver
       [not found]       ` <200709211248.11783.borntraeger@de.ibm.com>
@ 2007-09-21 11:53         ` Rusty Russell
       [not found]         ` <1190375615.27805.9.camel@localhost.localdomain>
  1 sibling, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-21 11:53 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: kvm-devel, lguest, Herbert Xu, Jeff Garzik, virtualization

On Fri, 2007-09-21 at 12:48 +0200, Christian Borntraeger wrote:
> Am Donnerstag, 20. September 2007 schrieb Rusty Russell:
> > The network driver uses *two* virtqueues: one for input packets and
> > one for output packets.  This has nice locking properties (ie. we
> > don't do any for recv vs send).
> [...]
> > 	3) Resolve freeing of old xmit skbs (someone sent patch IIRC?)
> 
> Yes, that was me. I am quite busy at the moment but I will send a refreshed 
> patch soon. The most annoying fact of my current patch is, that I have to add 
> locking. (Because of the only one operation per virtqueue at a time rule)

Sorry Christian, I thought it was you but was in a hurry to send out the
patches so didn't go back and check.

Can't we just re-use the default xmit lock?

> [...]
> > +struct virtnet_info
> > +{
> > +	struct virtqueue_ops *vq_ops;
> > +	struct virtqueue *vq_recv;
> > +	struct virtqueue *vq_send;
> > +	struct net_device *ndev;
> 
> This is only a matter of taste, but I like netdev or dev more than ndev.

Yeah, I agreed.  It was a moment of weakness: I've renamed it to "dev".

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] virtio block driver
       [not found]         ` <20070920122713.GK2367@kernel.dk>
@ 2007-09-21 12:00           ` Rusty Russell
       [not found]           ` <1190376007.27805.19.camel@localhost.localdomain>
  1 sibling, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-21 12:00 UTC (permalink / raw)
  To: Jens Axboe; +Cc: kvm-devel, lguest, Dor Laor, virtualization

On Thu, 2007-09-20 at 14:27 +0200, Jens Axboe wrote:
> On Thu, Sep 20 2007, Rusty Russell wrote:
> > The block driver uses scatter-gather lists with sg[0] being the
> > request information (struct virtio_blk_outhdr) with the type, sector
> > and inbuf id.  The next N sg entries are the bio itself, then the last
> > sg is the status byte.  Whether the N entries are in or out depends on
> > whether it's a read or a write.
> > 
> > We accept the normal (SCSI) ioctls: they get handed through to the other
> > side which can then handle it or reply that it's unsupported.  It's
> > not clear that this actually works in general, since I don't know
> > if blk_pc_request() requests have an accurate rq_data_dir().
> 
> They should, if they imply a data transfer.

OK, good.  We rely on that to mark input vs output sg elements.  I was
wondering if there was some weird requests which did both input and
output.

> > Although we try to reply -ENOTTY on unsupported commands, the block
> > layer in its infinite wisdom suppressed the error so ioctl(fd,
> > CDROMEJECT) returns success to userspace.
> 
> How about ever submitting a patch for that, instead of just repeatedly
> complaining about it?

To be fair, I've left the complaint in that same patch, you're just
reading it repeatedly 8)

I shall look through the code and see if I can figure out how to fix it.
I'm assuming from your response that there's not some strange reason to
preserve current behaviour.

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] virtio interace
       [not found] ` <1190290140.7262.228.camel@localhost.localdomain>
                     ` (3 preceding siblings ...)
       [not found]   ` <46F2674A.2050403@qumranet.com>
@ 2007-09-21 12:05   ` Arnd Bergmann
       [not found]   ` <200709211405.32116.arnd@arndb.de>
  5 siblings, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2007-09-21 12:05 UTC (permalink / raw)
  To: virtualization; +Cc: kvm-devel, lguest, Dor Laor

On Thursday 20 September 2007, Rusty Russell wrote:
> + * virtio_driver - operations for a virtio I/O driver
> + * @name: the name of the driver (KBUILD_MODNAME).
> + * @owner: the module which contains these routines (ie. THIS_MODULE).
> + * @id_table: the ids (we re-use PCI ids) serviced by this driver.
> + * @probe: the function to call when a device is found.  Returns a token for
> + *    remove, or PTR_ERR().
> + * @remove: the function when a device is removed.
> + */
> +struct virtio_driver {
> +       const char *name;
> +       struct module *owner;
> +       struct pci_device_id *id_table;
> +       void *(*probe)(struct device *device,
> +                      struct virtio_config_space *config,
> +                      struct virtqueue_ops *vqops);
> +       void (*remove)(void *dev);
> +};
> +
> +int register_virtio_driver(struct virtio_driver *drv);
> +void unregister_virtio_driver(struct virtio_driver *drv);
> +
> +/* The particular virtio backend supplies these. */
> +struct virtio_backend_ops {
> +       int (*register_driver)(struct virtio_driver *drv);
> +       void (*unregister_driver)(struct virtio_driver *drv);
> +};
> +extern struct virtio_backend_ops virtio_backend_ops;

This still seems a little awkward. From what I understand, you register
a virtio_driver, which leads to a pci_driver (or whatever you are based on)
to be registered behind the covers, so that the pci_device can be used
directly as the virtio device.

I think there should instead be a pci_driver that automatically binds
to all PCI based virtio imlpementations and creates a child device for
the actual virtio_device. Then you can have the virtio_driver itself
be based on a device_driver, and you can get rid of the global
virtio_backend_ops. That will be useful when a virtual machine
has two ways to get at the virtio devices, e.g. a KVM guest that
has both hcall based probing for virtio devices and some other virtio
devices that are exported through PCI.

	Arnd <><

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] virtio block driver
       [not found]           ` <1190376007.27805.19.camel@localhost.localdomain>
@ 2007-09-21 12:27             ` Jens Axboe
       [not found]             ` <20070921122746.GO2367@kernel.dk>
  1 sibling, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2007-09-21 12:27 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel, lguest, Dor Laor, virtualization

On Fri, Sep 21 2007, Rusty Russell wrote:
> On Thu, 2007-09-20 at 14:27 +0200, Jens Axboe wrote:
> > On Thu, Sep 20 2007, Rusty Russell wrote:
> > > The block driver uses scatter-gather lists with sg[0] being the
> > > request information (struct virtio_blk_outhdr) with the type, sector
> > > and inbuf id.  The next N sg entries are the bio itself, then the last
> > > sg is the status byte.  Whether the N entries are in or out depends on
> > > whether it's a read or a write.
> > > 
> > > We accept the normal (SCSI) ioctls: they get handed through to the other
> > > side which can then handle it or reply that it's unsupported.  It's
> > > not clear that this actually works in general, since I don't know
> > > if blk_pc_request() requests have an accurate rq_data_dir().
> > 
> > They should, if they imply a data transfer.
> 
> OK, good.  We rely on that to mark input vs output sg elements.  I was
> wondering if there was some weird requests which did both input and
> output.

There very soon will be bidirectional requests with both in and out
elements, but they will be clearly marked as such (and the driver needs
to be marked capable). So nothing to worry about for now.

> > > Although we try to reply -ENOTTY on unsupported commands, the block
> > > layer in its infinite wisdom suppressed the error so ioctl(fd,
> > > CDROMEJECT) returns success to userspace.
> > 
> > How about ever submitting a patch for that, instead of just repeatedly
> > complaining about it?
> 
> To be fair, I've left the complaint in that same patch, you're just
> reading it repeatedly 8)

That may be the case :-)

> I shall look through the code and see if I can figure out how to fix it.
> I'm assuming from your response that there's not some strange reason to
> preserve current behaviour.

It surely sounds like a bug, if you issue ioctl(fd, CDROMEJECT), the
driver sees it and returns -ENOTTY, but userland sees a 0 retval. So if
you have time, please do poke at it a bit.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] virtio block driver
       [not found]             ` <20070921114703.GN2367@kernel.dk>
@ 2007-09-21 12:30               ` Rusty Russell
  0 siblings, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-21 12:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: kvm-devel, lguest, Dor Laor, virtualization

On Fri, 2007-09-21 at 13:47 +0200, Jens Axboe wrote:
> On Fri, Sep 21 2007, Rusty Russell wrote:
> > On Thu, 2007-09-20 at 15:05 +0200, Jens Axboe wrote:
> > > We have end_queued_request(), lets add end_dequeued_request(). Below.
> > 
> > OK, thanks, I'll throw that in the mix and test...
> 
> I've queued it for 2.6.24 as well, so should be in mainline in that time
> frame.

OK, I'll sit it in my queue so my patches work until then.

> > > > +	vblk->sg[0].page = virt_to_page(&vbr->out_hdr);
> > > > +	vblk->sg[0].offset = offset_in_page(&vbr->out_hdr);
> > > > +	vblk->sg[0].length = sizeof(vbr->out_hdr);
> > > > +	num = blk_rq_map_sg(q, vbr->req, vblk->sg+1);
> > > 
> > > This wont work for chained sglists, but I gather (I'm so funny) that it
> > > wont be an issue for you. How large are your sglists?
> > 
> > Hmm, potentially extremely large.  What do I need to do for chained
> > sglists?
> 
> Not a lot, actually. You snipped the problematic part in your reply,
> though:
> 
>         vblk->sg[num+1].page = virt_to_page(&vbr->in_hdr);
> 
> which assumes that sg is a contigious piece of memory, for chained sg
> lists that isn't true. sg chaining will be in 2.6.24, so if you really
> do need large sglists, then that's the way to go.

I'm not sure if I need large sglists here.  Obviously I want to handle
anything the block layer can give to me (assume you're going to increase
MAX_PHYS_SEGMENTS soon?).  This might make "vblk" too big to kmalloc
easily:

        struct virtio_blk
        {
        ...
        	/* Scatterlist: can be too big for stack. */
        	struct scatterlist sg[3+MAX_PHYS_SEGMENTS];
        };

The sglist for virtio is really a scratch space to represent the request
buffers (with one element for metadata at start and one at end) which we
hand through to the virtio layer which turns it into its internal
descriptors for the host to read.

Have you thought of putting an sglist inside the req itself?  Then
instead of blk_rq_map_sg() it'd just be req->sg?  It could just be a
chain of sg's in each bio I guess.

This would allow me to just to borrow that request sg chain, rather than
doing the req -> sg -> virtio double-conversion.

> > Except "0/1" is not canonical in the kernel.  Arguably, "-errno/0" is
> > canonical.  OTOH, bool is clear.
> 
> -errno/0, then. 1 is typically used for 'failure without specific error
> number' when -Exx doesn't apply. Like the above :-)

See, I'd be absolutely sure that >= 0 is success (like all syscalls),
and I'd be appalled by code which used it as failure.

So I think you're right that we should agree to disagree :)

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 3/6] virtio net driver
       [not found]         ` <1190375615.27805.9.camel@localhost.localdomain>
@ 2007-09-21 12:36           ` Christian Borntraeger
       [not found]           ` <200709211436.43964.borntraeger@de.ibm.com>
  1 sibling, 0 replies; 41+ messages in thread
From: Christian Borntraeger @ 2007-09-21 12:36 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel, lguest, Herbert Xu, Jeff Garzik, virtualization

Am Freitag, 21. September 2007 schrieb Rusty Russell:
> Can't we just re-use the default xmit lock?

yes we can.

This patch is untested but is basically an refresh of an ealier version. I 
currently have no code testable with the latest virtio version from this mail 
thread, so if you could apply this (It still has the ndev name) and test the 
patch? Thanks

Christian
---

Use the sent interrupt for xmit reclaim.

---
 drivers/net/virtio_net.c |   43 ++++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 17 deletions(-)

Index: virtio/drivers/net/virtio_net.c
===================================================================
--- virtio.orig/drivers/net/virtio_net.c
+++ virtio/drivers/net/virtio_net.c
@@ -52,10 +52,29 @@ static inline void vnet_hdr_to_sg(struct
 	sg_init_one(sg, skb_vnet_hdr(skb), sizeof(struct virtio_net_hdr));
 }
 
+static void free_old_xmit_skbs(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct sk_buff *skb;
+	unsigned int len;
+
+	netif_tx_lock(dev);
+	while ((skb = vi->vq_ops->get_buf(vi->vq_send, &len)) != NULL) {
+		pr_debug("Sent skb %p\n", skb);
+		__skb_unlink(skb, &vi->send);
+		dev->stats.tx_bytes += len;
+		dev->stats.tx_packets++;
+		dev_kfree_skb_irq(skb);
+	}
+	netif_tx_unlock(dev);
+}
+
 static bool skb_xmit_done(void *ndev)
 {
 	/* In case we were waiting for output buffers. */
 	netif_wake_queue(ndev);
+	/* reclaim sent skbs */
+	fre_old_xmit_skbs(ndev);
 	return true;
 }
 
@@ -209,20 +228,6 @@ again:
 	return 0;
 }
 
-static void free_old_xmit_skbs(struct virtnet_info *vi)
-{
-	struct sk_buff *skb;
-	unsigned int len;
-
-	while ((skb = vi->vq_ops->get_buf(vi->vq_send, &len)) != NULL) {
-		pr_debug("Sent skb %p\n", skb);
-		__skb_unlink(skb, &vi->send);
-		vi->ndev->stats.tx_bytes += len;
-		vi->ndev->stats.tx_packets++;
-		kfree_skb(skb);
-	}
-}
-
 static int start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
@@ -235,8 +240,6 @@ static int start_xmit(struct sk_buff *sk
 		 dev->name, skb,
 		 dest[0], dest[1], dest[2], dest[3], dest[4], dest[5]);
 
-	free_old_xmit_skbs(vi);
-
 	/* Encode metadata header at front. */
 	hdr = skb_vnet_hdr(skb);
 	if (skb->ip_summed == CHECKSUM_PARTIAL) {
@@ -267,15 +270,21 @@ static int start_xmit(struct sk_buff *sk
 
 	vnet_hdr_to_sg(sg, skb);
 	num = skb_to_sgvec(skb, sg+1, 0, skb->len) + 1;
+	local_irq_disable();
+	netif_tx_lock(dev);
 	__skb_queue_head(&vi->send, skb);
 	err = vi->vq_ops->add_buf(vi->vq_send, sg, num, 0, skb);
 	if (err) {
 		pr_debug("%s: virtio not prepared to send\n", dev->name);
 		skb_unlink(skb, &vi->send);
 		netif_stop_queue(dev);
+		netif_tx_unlock(dev);
+		local_irq_enable();
 		return NETDEV_TX_BUSY;
 	}
 	vi->vq_ops->kick(vi->vq_send);
+	netif_tx_unlock(dev);
+	local_irq_enable();
 
 	return 0;
 }
@@ -335,7 +344,7 @@ static void *virtnet_probe(struct device
 	dev->poll = virtnet_poll;
 	dev->hard_start_xmit = start_xmit;
 	dev->weight = 16;
-	dev->features = NETIF_F_HIGHDMA;
+	dev->features = NETIF_F_HIGHDMA | NETIF_F_LLTX;
 	SET_NETDEV_DEV(dev, device);
 
 	/* Do we support "hardware" checksums? */

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] virtio interace
       [not found]   ` <200709211405.32116.arnd@arndb.de>
@ 2007-09-21 12:45     ` Rusty Russell
       [not found]     ` <1190378736.27805.54.camel@localhost.localdomain>
  1 sibling, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-21 12:45 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: kvm-devel, lguest, Dor Laor, virtualization

On Fri, 2007-09-21 at 14:05 +0200, Arnd Bergmann wrote:
> On Thursday 20 September 2007, Rusty Russell wrote:
> > +int register_virtio_driver(struct virtio_driver *drv);
> > +void unregister_virtio_driver(struct virtio_driver *drv);
> > +
> > +/* The particular virtio backend supplies these. */
> > +struct virtio_backend_ops {
> > +       int (*register_driver)(struct virtio_driver *drv);
> > +       void (*unregister_driver)(struct virtio_driver *drv);
> > +};
> > +extern struct virtio_backend_ops virtio_backend_ops;
> 
> This still seems a little awkward. From what I understand, you register
> a virtio_driver, which leads to a pci_driver (or whatever you are based on)
> to be registered behind the covers, so that the pci_device can be used
> directly as the virtio device.

Hi Arnd,

	Yes, and I dislike it too. 

> I think there should instead be a pci_driver that automatically binds
> to all PCI based virtio imlpementations and creates a child device for
> the actual virtio_device.

I'm not sure I understand.  For PCI probing to work, you want to have
identified yourself as a driver for each PCI id claimed by a virtio
device.

Hmm, I guess we could have a PCI driver which claims all VIRTIO vendor
devices.  Then it can call virtio_find_driver() (?) at the top of its
probe function to find if there's a matching virtio driver.  This PCI
driver would have to be initialized after all the virtio drivers are
registered, but that's easy.

The virtio layer would simply maintain a linked list of drivers and
implement the virtio_find_driver() matching function.

And since we've suppressed normal PCI driver request_module (since it
always finds "the driver") then we can implement that in
virtio_find_driver(), and not use a PCI MODULE_DEVICE_TABLE.  Then we
don't need (full) PCI ids at all.

OK, I have some coding to do now...
Rusty.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 3/6] virtio net driver
       [not found]           ` <200709211436.43964.borntraeger@de.ibm.com>
@ 2007-09-21 14:08             ` Herbert Xu
       [not found]             ` <20070921140833.GA12242@gondor.apana.org.au>
  1 sibling, 0 replies; 41+ messages in thread
From: Herbert Xu @ 2007-09-21 14:08 UTC (permalink / raw)
  To: Christian Borntraeger; +Cc: kvm-devel, lguest, Jeff Garzik, virtualization

On Fri, Sep 21, 2007 at 02:36:43PM +0200, Christian Borntraeger wrote:
>
> @@ -335,7 +344,7 @@ static void *virtnet_probe(struct device
>  	dev->poll = virtnet_poll;
>  	dev->hard_start_xmit = start_xmit;
>  	dev->weight = 16;
> -	dev->features = NETIF_F_HIGHDMA;
> +	dev->features = NETIF_F_HIGHDMA | NETIF_F_LLTX;
>  	SET_NETDEV_DEV(dev, device);
>  
>  	/* Do we support "hardware" checksums? */

Please don't use LLTX in new drivers.  We're trying to get rid
of it since it's

1) unnecessary;
2) causes problems with AF_PACKET seeing things twice.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] virtio interace
       [not found]     ` <1190378736.27805.54.camel@localhost.localdomain>
@ 2007-09-21 14:22       ` Arnd Bergmann
       [not found]       ` <200709211622.22005.arnd@arndb.de>
  1 sibling, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2007-09-21 14:22 UTC (permalink / raw)
  To: virtualization; +Cc: kvm-devel, lguest, Dor Laor

On Friday 21 September 2007, Rusty Russell wrote:
> Hmm, I guess we could have a PCI driver which claims all VIRTIO vendor
> devices.  

yes, that was the idea.

> Then it can call virtio_find_driver() (?) at the top of its 
> probe function to find if there's a matching virtio driver.  
> This PCI  driver would have to be initialized after all the virtio
> drivers are registered, but that's easy.

No, just use the driver model, instead of working against it:

struct pci_virtio_device {
	struct pci_dev *pdev;
	char __iomem *mmio_space;
	struct virtio_device vdev;
};

static int __devinit pci_virtio_probe(struct pci_dev *pdev,
				const struct pci_device_id *ent)
{
	struct pci_virtio_device *dev = kzalloc(sizeof (*dev), GFP_KERNEL);
	dev->pdev = pdev;
	dev->mmio_space = pcim_iomap(pdev, 0, PCI_VIRTIO_BUFSIZE);
	dev->vdev->ops = &pci_virtqueue_ops;
	dev->vdev->config = &pci_virtio_config_ops;
	dev->vdev->type = ent->device;
	dev->vdev->class = ent->class;
	dev->vdev.dev.parent = &pdev->dev;

	return virtio_device_register(&dev->vdev;
}

> The virtio layer would simply maintain a linked list of drivers and
> implement the virtio_find_driver() matching function.

nonono, just a virtio_bus that all virtio drivers register to:

static int virtio_net_probe(struct device *dev)
{
	struct virtio_device *vdev = to_virtio_dev(dev);
	struct virtqueue_ops *vq_ops = vdev->ops;

	/* same as current code */
	...

	return 0;
}

static struct virtio_device_id virtio_net_ids[] = {
	{ .type = VIRTIO_ID_NET, .class = PCI_CLASS_NETWORK_OTHER },
	{ },
};

static struct virtio_driver virtio_net = {
	.id_table = &virtio_net_ids,
	.driver = {
		.name = "virtionet",
		.probe = virtio_net_probe,
		.remove = virtionet_remove,
		.bus = &virtio_bus,        /* <- look here */
	},
};

static int __init virtio_net_init(void)
{
	return driver_register(&virtio_net.driver);
}
module_init(virtio_net_init);

> And since we've suppressed normal PCI driver request_module (since it
> always finds "the driver") then we can implement that in
> virtio_find_driver(), and not use a PCI MODULE_DEVICE_TABLE.  Then we
> don't need (full) PCI ids at all.

right, as shown in my example above.

	Arnd <><

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 3/6] virtio net driver
       [not found]             ` <20070921140833.GA12242@gondor.apana.org.au>
@ 2007-09-21 14:42               ` Christian Borntraeger
       [not found]               ` <200709211642.25208.borntraeger@de.ibm.com>
  1 sibling, 0 replies; 41+ messages in thread
From: Christian Borntraeger @ 2007-09-21 14:42 UTC (permalink / raw)
  To: kvm-devel; +Cc: lguest, Herbert Xu, Jeff Garzik, virtualization

Am Freitag, 21. September 2007 schrieb Herbert Xu:
> Please don't use LLTX in new drivers.  We're trying to get rid
> of it since it's
> 
> 1) unnecessary;
> 2) causes problems with AF_PACKET seeing things twice.

Ok, but then I cannot reuse the xmit lock in an interrupt handler. Otherwise 
deadlock can occur because hard_start_xmit has interrupts enabled. 

Rusty, seems we need an additional lock.

Christian

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] virtio interace
       [not found]       ` <200709211622.22005.arnd@arndb.de>
@ 2007-09-22  9:55         ` Rusty Russell
  2007-09-22 10:01           ` Arnd Bergmann
       [not found]           ` <200709221201.33865.arnd@arndb.de>
  0 siblings, 2 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-22  9:55 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: kvm-devel, lguest, Dor Laor, virtualization

On Fri, 2007-09-21 at 16:22 +0200, Arnd Bergmann wrote:
> On Friday 21 September 2007, Rusty Russell wrote:
> > Hmm, I guess we could have a PCI driver which claims all VIRTIO vendor
> > devices.  
> 
> yes, that was the idea.
> 
> > Then it can call virtio_find_driver() (?) at the top of its 
> > probe function to find if there's a matching virtio driver.  
> > This PCI  driver would have to be initialized after all the virtio
> > drivers are registered, but that's easy.
> 
> No, just use the driver model, instead of working against it:

But now each virtio device has two "struct device"s, not one.   And
you've made up a fictional bus to do it.

Yet for PCI systems, it really is a PCI device; exposing a second bus to
userspace just because we put a layer in our implementation is poor
form.

Perhaps this is the easiest way of doing it.  But it's still wrong.

Rusty.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] virtio interace
  2007-09-22  9:55         ` Rusty Russell
@ 2007-09-22 10:01           ` Arnd Bergmann
       [not found]           ` <200709221201.33865.arnd@arndb.de>
  1 sibling, 0 replies; 41+ messages in thread
From: Arnd Bergmann @ 2007-09-22 10:01 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel, lguest, Dor Laor, virtualization

On Saturday 22 September 2007, Rusty Russell wrote:
> But now each virtio device has two "struct device"s, not one.   And
> you've made up a fictional bus to do it.
> 
> Yet for PCI systems, it really is a PCI device; exposing a second bus to
> userspace just because we put a layer in our implementation is poor
> form.
> 
> Perhaps this is the easiest way of doing it.  But it's still wrong.

I think it's just a matter of perspective. In the model I'm advocating,
the PCI device is not the same as the virtio device but rather a virtio
host bridge, very much like USB or SATA works.

We could easily have multiple virtio devices behind one PCI device, but
since virtual PCI devices are cheap, a one-to-one mapping makes sense
for simplicity.

Having two struct devices for an actual device is the price to pay for
this, but I do think that creating a bus_type for virtio is the logical
abstraction and not something we just make up to work around a
deficiency in our device reprentation.

	Arnd <><

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 2/6] virtio_config
       [not found]         ` <1190339435.19451.23.camel@localhost.localdomain>
@ 2007-09-22 13:03           ` Avi Kivity
  0 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2007-09-22 13:03 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel, lguest, virtualization

Rusty Russell wrote:
>>
>> A side effect of this is that Xen drivers can no longer use virtio.
>>     
>
> I'm not so sure.
>
> We were always assuming that Xen could do state management in its virtio
> layer.  If this is not true, it implies we need hooks in the virtio
> drivers, and I don't think we've made it worse by making it translate
> xenbus config info into virtio_config.
>   

This means writing a low-level ABI in terms of a high level API.  And
while that describes virtualization fairly will, I prefer not to do this
more than strictly necessary.

However, this can be rewritten if/when Xen or VMware show interest in
porting their drivers to virtio.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] virtio block driver
       [not found]             ` <20070921122746.GO2367@kernel.dk>
@ 2007-09-23  6:47               ` Rusty Russell
  0 siblings, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-23  6:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: kvm-devel, lguest, Dor Laor, virtualization

On Fri, 2007-09-21 at 14:27 +0200, Jens Axboe wrote:
> On Fri, Sep 21 2007, Rusty Russell wrote:
> > I shall look through the code and see if I can figure out how to fix it.
> > I'm assuming from your response that there's not some strange reason to
> > preserve current behaviour.
> 
> It surely sounds like a bug, if you issue ioctl(fd, CDROMEJECT), the
> driver sees it and returns -ENOTTY, but userland sees a 0 retval. So if
> you have time, please do poke at it a bit.

OK, error (-ENOTTY) gets to blk_end_sync_rq(), but nothing happens to
it.  This patch makes the ioctl return -EIO: I can't see a way to pass
the errno back properly.

diff -r 99e125262d6a block/ll_rw_blk.c
--- a/block/ll_rw_blk.c	Sat Sep 22 15:37:22 2007 +1000
+++ b/block/ll_rw_blk.c	Sun Sep 23 16:43:42 2007 +1000
@@ -2792,6 +2792,9 @@ void blk_end_sync_rq(struct request *rq,
 	rq->end_io_data = NULL;
 	__blk_put_request(rq->q, rq);
 
+	if (error)
+		rq->errors++;
+
 	/*
 	 * complete last, if this is a stack request the process (and thus
 	 * the rq pointer) could be invalid right after this complete()

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 3/6] virtio net driver
       [not found]               ` <200709211642.25208.borntraeger@de.ibm.com>
@ 2007-09-23  7:13                 ` Rusty Russell
  0 siblings, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-23  7:13 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: kvm-devel, lguest, Herbert Xu, Jeff Garzik, virtualization

On Fri, 2007-09-21 at 16:42 +0200, Christian Borntraeger wrote:
> Am Freitag, 21. September 2007 schrieb Herbert Xu:
> > Please don't use LLTX in new drivers.  We're trying to get rid
> > of it since it's
> > 
> > 1) unnecessary;
> > 2) causes problems with AF_PACKET seeing things twice.
> 
> Ok, but then I cannot reuse the xmit lock in an interrupt handler. Otherwise 
> deadlock can occur because hard_start_xmit has interrupts enabled. 
> 
> Rusty, seems we need an additional lock.

Can we use a tasklet?  Seems like it'd be safe at a glance...

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] virtio interace
       [not found]           ` <200709221201.33865.arnd@arndb.de>
@ 2007-09-23  8:33             ` Rusty Russell
       [not found]             ` <1190536431.27805.109.camel@localhost.localdomain>
  1 sibling, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-23  8:33 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: kvm-devel, lguest, Dor Laor, virtualization

On Sat, 2007-09-22 at 12:01 +0200, Arnd Bergmann wrote:
> On Saturday 22 September 2007, Rusty Russell wrote:
> > But now each virtio device has two "struct device"s, not one.   And
> > you've made up a fictional bus to do it.
> > 
> > Yet for PCI systems, it really is a PCI device; exposing a second bus to
> > userspace just because we put a layer in our implementation is poor
> > form.
> > 
> > Perhaps this is the easiest way of doing it.  But it's still wrong.
> 
> I think it's just a matter of perspective. In the model I'm advocating,
> the PCI device is not the same as the virtio device but rather a virtio
> host bridge, very much like USB or SATA works.
> 
> We could easily have multiple virtio devices behind one PCI device, but
> since virtual PCI devices are cheap, a one-to-one mapping makes sense
> for simplicity.

This is still retro-justification.  The simplest way for PCI systems to
represent a virtio device as a PCI device; this makes life easy for any
guest OS.

I just know we're going to regret this...
Rusty.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 6/6] virtio ring helper
       [not found]               ` <1190340251.19451.36.camel@localhost.localdomain>
@ 2007-09-23 10:05                 ` Avi Kivity
       [not found]                 ` <46F63A64.9070200@qumranet.com>
  1 sibling, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2007-09-23 10:05 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel, lguest, virtualization

Rusty Russell wrote:
> On Thu, 2007-09-20 at 14:43 +0200, Avi Kivity wrote:
>   
>> Rusty Russell wrote:
>>     
>>> These helper routines supply most of the virtqueue_ops for hypervisors
>>> which want to use a ring for virtio.  Unlike the previous lguest
>>> implementation:
>>>
>>> 3) The page numbers are always 64 bit (PAE anyone?)
>>>   
>>>       
>> 32 bits of page numbers give 44 bits of physical address on x86.  That's 
>> 16TB per guest.  Admittedly it's smaller on a VAX.
>>     
>
> I like to feel that I make these mistakes to ensure others are paying
> attention.  However, it does mean that I can just put an address in
> there and increase the length field to 32 bits.  Much rejoicing.
>
>   

Why are we sending page numbers anyway?  See below.

>> I don't like the chaining and would prefer the descriptor to refer to an 
>> array of page descriptors.  However I can't complain since this isn't 
>> common code.
>>     
>
> The intent is for kvm to use it.  I'll certainly consider your patches,
> although I suspect that managing descriptor pages will make things ugly
> enough to cause you to reconsider.
>   

How about, instead:

    struct vring_desc {
         __u64 address;
         __u32 length;
         __u32 flags;
    };

Where one of the flags is VRING_DESC_INDIRECT, which means that the 
memory within (address, length) is a bunch of descriptors instead of raw 
data.

This has the following nice properties:
- no dependency on page size, for archs where the page size definition 
is muddy
- no indirection for ethernet rx on Linux, even with jumbograms (a 
buffer can span pages in one descriptor, as long as they're physically 
contiguous)
- reward OSes that can do contiguous I/O (with a 32-bit length, too)
- keeps the interface simple at the expense of the implementation 
(address vs. page+offset)
- no nasty 16-bit fields, which some archs don't like
- be free! cast off your chains!


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] virtio interace
       [not found]             ` <1190536431.27805.109.camel@localhost.localdomain>
@ 2007-09-23 11:20               ` Dor Laor
  0 siblings, 0 replies; 41+ messages in thread
From: Dor Laor @ 2007-09-23 11:20 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel, lguest, Arnd Bergmann, virtualization


[-- Attachment #1.1: Type: text/plain, Size: 1435 bytes --]

Rusty Russell wrote:
> On Sat, 2007-09-22 at 12:01 +0200, Arnd Bergmann wrote:
>   
>> On Saturday 22 September 2007, Rusty Russell wrote:
>>     
>>> But now each virtio device has two "struct device"s, not one.   And
>>> you've made up a fictional bus to do it.
>>>
>>> Yet for PCI systems, it really is a PCI device; exposing a second bus to
>>> userspace just because we put a layer in our implementation is poor
>>> form.
>>>
>>> Perhaps this is the easiest way of doing it.  But it's still wrong.
>>>       
>> I think it's just a matter of perspective. In the model I'm advocating,
>> the PCI device is not the same as the virtio device but rather a virtio
>> host bridge, very much like USB or SATA works.
>>
>> We could easily have multiple virtio devices behind one PCI device, but
>> since virtual PCI devices are cheap, a one-to-one mapping makes sense
>> for simplicity.
>>     
>
> This is still retro-justification.  The simplest way for PCI systems to
> represent a virtio device as a PCI device; this makes life easy for any
> guest OS.
>
> I just know we're going to regret this...
> Rusty.
>
>
>   
I'm not sure what's best, maybe this can help:
With the virtio_find_driver you need to place all virtio drivers under 
the same pci device or suggest new mapping.
Using Arnd's solution (actually I implemented it for KVM before the new 
patch set) and  1-1 mapping for pci and virtio
devices makes life simpler.
Dor.


[-- Attachment #1.2: Type: text/html, Size: 1906 bytes --]

[-- Attachment #2: Type: text/plain, Size: 184 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 6/6] virtio ring helper
       [not found]                 ` <46F63A64.9070200@qumranet.com>
@ 2007-09-23 11:40                   ` Rusty Russell
       [not found]                   ` <1190547607.27805.120.camel@localhost.localdomain>
  1 sibling, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-09-23 11:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel, lguest, virtualization

On Sun, 2007-09-23 at 12:05 +0200, Avi Kivity wrote:
> Rusty Russell wrote:
> > On Thu, 2007-09-20 at 14:43 +0200, Avi Kivity wrote:  
> >> 32 bits of page numbers give 44 bits of physical address on x86.  That's 
> >> 16TB per guest.  Admittedly it's smaller on a VAX.
> >
> > I like to feel that I make these mistakes to ensure others are paying
> > attention.  However, it does mean that I can just put an address in
> > there and increase the length field to 32 bits.  Much rejoicing.
> >   
> 
> Why are we sending page numbers anyway?  See below.

Perhaps I was unclear.  I already changed to a 64-bit address.  I
haven't send out another set of patches because I'm changing to Arnd's
explicit virtio bus too.  Will send out a new set tomorrow at this rate.

> Where one of the flags is VRING_DESC_INDIRECT, which means that the 
> memory within (address, length) is a bunch of descriptors instead of raw 
> data.

If that's all we wanted, it's fairly easy to do as a future extension
even if we didn't change it today.  My concern was the allocation and
management of those sg pages; hence my desire for a patch 8)

Rusty.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [kvm-devel] [PATCH 6/6] virtio ring helper
       [not found]                   ` <1190547607.27805.120.camel@localhost.localdomain>
@ 2007-09-23 11:46                     ` Avi Kivity
  0 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2007-09-23 11:46 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel, lguest, virtualization

Rusty Russell wrote:
> On Sun, 2007-09-23 at 12:05 +0200, Avi Kivity wrote:
>   
>> Rusty Russell wrote:
>>     
>>> On Thu, 2007-09-20 at 14:43 +0200, Avi Kivity wrote:  
>>>       
>>>> 32 bits of page numbers give 44 bits of physical address on x86.  That's 
>>>> 16TB per guest.  Admittedly it's smaller on a VAX.
>>>>         
>>> I like to feel that I make these mistakes to ensure others are paying
>>> attention.  However, it does mean that I can just put an address in
>>> there and increase the length field to 32 bits.  Much rejoicing.
>>>   
>>>       
>> Why are we sending page numbers anyway?  See below.
>>     
>
> Perhaps I was unclear.  I already changed to a 64-bit address.  I
> haven't send out another set of patches because I'm changing to Arnd's
> explicit virtio bus too.  Will send out a new set tomorrow at this rate.
>
>   

It does say so quite explicitly in the quoted text.  Sorry.

>> Where one of the flags is VRING_DESC_INDIRECT, which means that the 
>> memory within (address, length) is a bunch of descriptors instead of raw 
>> data.
>>     
>
> If that's all we wanted, it's fairly easy to do as a future extension
> even if we didn't change it today.  My concern was the allocation and
> management of those sg pages; hence my desire for a patch 8)
>   

Won't kmalloc()/kfree() suffice?  IMO the tradeoff (compared to chaining 
with its reduction in ring size, and handling ood) is positive.

I'll try a patch based on the next patchset.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2007-09-23 11:46 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1190289808.7262.223.camel@localhost.localdomain>
2007-09-20 12:09 ` [PATCH 1/6] virtio interace Rusty Russell
     [not found] ` <1190290140.7262.228.camel@localhost.localdomain>
2007-09-20 12:12   ` [PATCH 2/6] virtio_config Rusty Russell
2007-09-20 12:27   ` [kvm-devel] [PATCH 1/6] virtio interace Avi Kivity
     [not found]   ` <1190290369.7262.231.camel@localhost.localdomain>
2007-09-20 12:14     ` [PATCH 3/6] virtio net driver Rusty Russell
2007-09-20 12:36     ` [kvm-devel] [PATCH 2/6] virtio_config Avi Kivity
     [not found]     ` <46F26958.4080102@qumranet.com>
2007-09-20 12:55       ` Avi Kivity
     [not found]       ` <46F26DC7.9040001@qumranet.com>
2007-09-21  1:50         ` Rusty Russell
     [not found]         ` <1190339435.19451.23.camel@localhost.localdomain>
2007-09-22 13:03           ` Avi Kivity
     [not found]     ` <1190290495.7262.235.camel@localhost.localdomain>
2007-09-20 12:16       ` [PATCH 4/6] virtio block driver Rusty Russell
     [not found]       ` <1190290606.7262.239.camel@localhost.localdomain>
2007-09-20 12:19         ` [PATCH 5/6] virtio console driver Rusty Russell
2007-09-20 12:27         ` [PATCH 4/6] virtio block driver Jens Axboe
     [not found]         ` <1190290761.7262.242.camel@localhost.localdomain>
2007-09-20 12:27           ` [PATCH 6/6] virtio ring helper Rusty Russell
     [not found]           ` <1190291234.7262.246.camel@localhost.localdomain>
2007-09-20 12:43             ` [kvm-devel] " Avi Kivity
     [not found]             ` <46F26AF6.60904@qumranet.com>
2007-09-21  2:04               ` Rusty Russell
     [not found]               ` <1190340251.19451.36.camel@localhost.localdomain>
2007-09-23 10:05                 ` Avi Kivity
     [not found]                 ` <46F63A64.9070200@qumranet.com>
2007-09-23 11:40                   ` Rusty Russell
     [not found]                   ` <1190547607.27805.120.camel@localhost.localdomain>
2007-09-23 11:46                     ` Avi Kivity
2007-09-20 13:05         ` [PATCH 4/6] virtio block driver Jens Axboe
     [not found]         ` <20070920130519.GL2367@kernel.dk>
2007-09-21  2:06           ` Rusty Russell
2007-09-21 11:47             ` Jens Axboe
     [not found]             ` <20070921114703.GN2367@kernel.dk>
2007-09-21 12:30               ` Rusty Russell
     [not found]         ` <20070920122713.GK2367@kernel.dk>
2007-09-21 12:00           ` Rusty Russell
     [not found]           ` <1190376007.27805.19.camel@localhost.localdomain>
2007-09-21 12:27             ` Jens Axboe
     [not found]             ` <20070921122746.GO2367@kernel.dk>
2007-09-23  6:47               ` Rusty Russell
2007-09-21 10:48       ` [kvm-devel] [PATCH 3/6] virtio net driver Christian Borntraeger
     [not found]       ` <200709211248.11783.borntraeger@de.ibm.com>
2007-09-21 11:53         ` Rusty Russell
     [not found]         ` <1190375615.27805.9.camel@localhost.localdomain>
2007-09-21 12:36           ` Christian Borntraeger
     [not found]           ` <200709211436.43964.borntraeger@de.ibm.com>
2007-09-21 14:08             ` Herbert Xu
     [not found]             ` <20070921140833.GA12242@gondor.apana.org.au>
2007-09-21 14:42               ` Christian Borntraeger
     [not found]               ` <200709211642.25208.borntraeger@de.ibm.com>
2007-09-23  7:13                 ` Rusty Russell
     [not found]   ` <46F2674A.2050403@qumranet.com>
2007-09-21 11:37     ` [kvm-devel] [PATCH 1/6] virtio interace Rusty Russell
2007-09-21 12:05   ` Arnd Bergmann
     [not found]   ` <200709211405.32116.arnd@arndb.de>
2007-09-21 12:45     ` Rusty Russell
     [not found]     ` <1190378736.27805.54.camel@localhost.localdomain>
2007-09-21 14:22       ` Arnd Bergmann
     [not found]       ` <200709211622.22005.arnd@arndb.de>
2007-09-22  9:55         ` Rusty Russell
2007-09-22 10:01           ` Arnd Bergmann
     [not found]           ` <200709221201.33865.arnd@arndb.de>
2007-09-23  8:33             ` Rusty Russell
     [not found]             ` <1190536431.27805.109.camel@localhost.localdomain>
2007-09-23 11:20               ` Dor Laor
2007-09-20 13:43 ` [PATCH 0/6] virtio with config abstraction and ring implementation Dor Laor
     [not found] ` <46F2791A.8070601@qumranet.com>
2007-09-20 13:50   ` [Lguest] [PATCH 0/6] virtio with config abstraction and ringimplementation Dor Laor
2007-09-21  3:20   ` [PATCH 0/6] virtio with config abstraction and ring implementation Rusty Russell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).