[PATCH 00/10] PV-IO v3

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/10] PV-IO v3
@ 2007-08-16 23:13 Gregory Haskins
       [not found] ` <20070816231357.8044.55943.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Gregory Haskins @ 2007-08-16 23:13 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Here is the v3 release of the patch series for a generalized PV-IO
infrastructure.  It has v2 plus the following changes:

1) The big changes is that PVBUS is now based on the bus/device_register
   APIs.  The code is inspired by the lguest_bus except it has been decoupled
   from the hypervisor.  Also, the "device" object provides an actual
   interface to the device which yeilds a tight-coupling to the underlying
   device provider.  This offers some interesting features as evident in patch
   #4 where we register some in-host virtual devices for the purposes of
   testing the IOQ/PVBUS/IOQNET infrastructure.

2) The KVM specific portions of the code have been adapted to the new model
   and hotplug support has been added.

The test harness in #4 has helped to find some bugs.  Some more remain that I
will fix after this mail goes out.  There is also some work to do with the
"shutdown" portion of the code (e.g. you will see a bus panic if you try to
rmmod a pvbus device/driver right now).

Note that the first four patches are not specific to KVM.  Once we get a
better picture of what we can use, I will obviouly start cross posting those
particular patches to LKML so they can be reviewed by a broader community.
For now I will keep it KVM since their status is in flux.

I am looking forward to seeing Dor's virtio based patch series.  Please send
it to me ASAP, even if its not complete or even compiling.

Regards,
-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 01/10] IOQ: Adding basic definitions for IO-Queue logic
       [not found] ` <20070816231357.8044.55943.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
@ 2007-08-16 23:14   ` Gregory Haskins
  2007-08-16 23:14   ` [PATCH 02/10] PARAVIRTUALIZATION: Add support for a bus abstraction Gregory Haskins
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Gregory Haskins @ 2007-08-16 23:14 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

IOQ is a generic shared-memory-queue mechanism that happens to be friendly
to virtualization boundaries.  Note that it is not virtualization specific
due to its flexible transport layer.

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 include/linux/ioq.h |  176 +++++++++++++++++++++++++++++++++++++++
 lib/Kconfig         |   11 ++
 lib/Makefile        |    1 
 lib/ioq.c           |  228 +++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 416 insertions(+), 0 deletions(-)

diff --git a/include/linux/ioq.h b/include/linux/ioq.h
new file mode 100644
index 0000000..d3a18a1
--- /dev/null
+++ b/include/linux/ioq.h
@@ -0,0 +1,176 @@
+/*
+ * Copyright 2007 Novell.  All Rights Reserved.
+ *
+ * IOQ is a generic shared-memory-queue mechanism that happens to be friendly
+ * to virtualization boundaries. It can be used in a variety of ways, though
+ * its intended purpose is to become the low-level communication path for
+ * paravirtualized drivers.  Note that it is not virtualization specific
+ * due to its flexible signaling layer.
+ *
+ * The following are a list of key design points:
+ *
+ * #) All shared-memory is always allocated on explicitly one side of the
+ *    link.  This typically would be the guest side in a VM/VMM scenario.
+ * #) The code has the concept of "north" and "south" where north denotes the
+ *    memory-owner side (e.g. guest).
+ * #) A IOQ is "created" on the north side (which generates a unique ID), and
+ *    is "connected" on the remote side via its ID.  The facilitates call-path
+ *    setup in a manner that is friendly across VM/VMM boundaries.
+ * #) An IOQ is manipulated using an iterator idiom.
+ * #) A "IOQ Manager" abstraction handles the translation between two
+ *    endpoints. E.g. allocating "north" memory, signaling, translating
+ *    addresses (e.g. GPA to PA)
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_IOQ_H
+#define _LINUX_IOQ_H
+
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <asm/types.h>
+
+struct ioq_mgr;
+
+/*
+ *---------
+ * The following structures represent data that is shared across boundaries
+ * which may be quite disparate from one another (e.g. Windows vs Linux,
+ * 32 vs 64 bit, etc).  Therefore, care has been taken to make sure they
+ * present data in a manner that is independent of the environment.
+ *-----------
+ */
+typedef u64 ioq_id_t;
+
+struct ioq_ring_desc {
+	u64                 cookie; /* for arbitrary use by north-side */
+	u64                 ptr;
+	u64                 len;
+	u64                 alen;
+	u8                  valid;
+	u8                  sown; /* South owned = 1, North owned = 0 */
+};
+
+#define IOQ_RING_MAGIC 0x47fa2fe4
+#define IOQ_RING_VER   1
+
+struct ioq_ring_idx {
+	u32                 head;    /* 0 based index to head of ptr array */
+	u32                 tail;    /* 0 based index to tail of ptr array */
+	u8                  full;
+};
+
+struct ioq_irq {
+	u8                  enabled;
+	u8                  pending;
+};
+
+enum ioq_locality {
+	ioq_locality_north,
+	ioq_locality_south,
+};
+
+struct ioq_ring_head {
+	u32                 magic;
+	u32                 ver;
+	ioq_id_t            id;
+	u32                 count;
+	u64                 ptr;     /* ptr to array of ioq_ring_desc[count] */
+	struct ioq_ring_idx idx[2];
+	struct ioq_irq      irq[2];
+	u8                  padding[16];
+};
+
+/* --- END SHARED STRUCTURES --- */
+
+enum ioq_idx_type {
+	ioq_idxtype_valid,
+	ioq_idxtype_inuse,
+	ioq_idxtype_invalid,
+};
+
+enum ioq_seek_type {
+	ioq_seek_tail,
+	ioq_seek_next,
+	ioq_seek_head,
+	ioq_seek_set
+};
+
+struct ioq_iterator {
+	struct ioq            *ioq;
+	struct ioq_ring_idx   *idx;
+	u32                    pos;
+	struct ioq_ring_desc  *desc;
+	int                    update;
+};
+
+int  ioq_iter_seek(struct ioq_iterator *iter, enum ioq_seek_type type,
+		   long offset, int flags);
+int  ioq_iter_push(struct ioq_iterator *iter, int flags);
+int  ioq_iter_pop(struct ioq_iterator *iter,  int flags);
+
+struct ioq_notifier {
+	void (*signal)(struct ioq_notifier*);
+};
+
+struct ioq {
+	void     (*destroy)(struct ioq *ioq);
+	int      (*signal)(struct ioq *ioq);
+
+	ioq_id_t               id;
+	enum ioq_locality      locale;
+	struct ioq_mgr        *mgr;
+	struct ioq_ring_head  *head_desc;
+	struct ioq_ring_desc  *ring;
+	wait_queue_head_t      wq;
+	struct ioq_notifier   *notifier;
+};
+
+static inline void ioq_init(struct ioq *ioq)
+{
+	memset(ioq, 0, sizeof(*ioq));
+	init_waitqueue_head(&ioq->wq);
+}
+
+int ioq_start(struct ioq *ioq, int flags);
+int ioq_stop(struct ioq *ioq, int flags);
+int ioq_signal(struct ioq *ioq, int flags);
+void ioq_wakeup(struct ioq *ioq); /* This should only be used internally */
+int ioq_count(struct ioq *ioq, enum ioq_idx_type type);
+int ioq_full(struct ioq *ioq, enum ioq_idx_type type);
+
+static inline int ioq_empty(struct ioq *ioq, enum ioq_idx_type type)
+{
+    return !ioq_count(ioq, type);
+}
+
+
+
+#define IOQ_ITER_AUTOUPDATE (1 << 0)
+int ioq_iter_init(struct ioq *ioq, struct ioq_iterator *iter,
+		  enum ioq_idx_type type, int flags);
+
+struct ioq_mgr {
+	int   (*create)(struct ioq_mgr *t, struct ioq **ioq,
+			size_t ringsize, int flags);
+	int   (*connect)(struct ioq_mgr *t, ioq_id_t id, struct ioq **ioq,
+			 int flags);
+};
+
+
+#endif /* _LINUX_IOQ_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 2e7ae6b..65c6d5d 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -124,4 +124,15 @@ config HAS_DMA
 	depends on !NO_DMA
 	default y
 
+config IOQ
+	boolean "IO-Queue library - Generic shared-memory queue"
+	default n
+	help
+	 IOQ is a generic shared-memory-queue mechanism that happens to be
+	 friendly to virtualization boundaries. It can be used in a variety
+	 of ways, though its intended purpose is to become the low-level
+	 communication path for paravirtualized drivers.
+
+	 If unsure, say N
+
 endmenu
diff --git a/lib/Makefile b/lib/Makefile
index c8c8e20..2bf3b5d 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -56,6 +56,7 @@ obj-$(CONFIG_TEXTSEARCH_BM) += ts_bm.o
 obj-$(CONFIG_TEXTSEARCH_FSM) += ts_fsm.o
 obj-$(CONFIG_SMP) += percpu_counter.o
 obj-$(CONFIG_AUDIT_GENERIC) += audit.o
+obj-$(CONFIG_IOQ) += ioq.o
 
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
 obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o
diff --git a/lib/ioq.c b/lib/ioq.c
new file mode 100644
index 0000000..b9ef75e
--- /dev/null
+++ b/lib/ioq.c
@@ -0,0 +1,228 @@
+/*
+ * Copyright 2007 Novell.  All Rights Reserved.
+ *
+ * See include/linux/ioq.h for documentation
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/sched.h>
+#include <linux/ioq.h>
+#include <asm/bitops.h>
+#include <linux/module.h>
+
+#ifndef NULL
+#define NULL 0
+#endif
+
+static int ioq_iter_setpos(struct ioq_iterator *iter, u32 pos)
+{
+	struct ioq *ioq = iter->ioq;
+
+	BUG_ON(pos >= ioq->head_desc->count);
+
+	iter->pos  = pos;
+	iter->desc = &ioq->ring[pos];
+
+	return 0;
+}
+
+int ioq_iter_seek(struct ioq_iterator *iter, enum ioq_seek_type type,
+		  long offset, int flags)
+{
+	struct ioq_ring_head *head_desc  = iter->ioq->head_desc;
+	struct ioq_ring_idx  *idx   = iter->idx;
+	u32 pos;
+
+	switch (type) {
+	case ioq_seek_next:
+		pos = iter->pos + 1;
+		pos %= head_desc->count;
+		break;
+	case ioq_seek_tail:
+		pos = idx->tail;
+		break;
+	case ioq_seek_head:
+		pos = idx->head;
+		break;
+	case ioq_seek_set:
+		if (offset >= head_desc->count)
+			return -1;
+		pos = offset;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return ioq_iter_setpos(iter, pos);
+}
+EXPORT_SYMBOL(ioq_iter_seek);
+
+static int ioq_ring_count(struct ioq_ring_idx *idx, int count)
+{
+	if (idx->full && (idx->head == idx->tail))
+		return count;
+	else if (idx->head >= idx->tail)
+		return idx->head - idx->tail;
+	else
+		return (idx->head + count) - idx->tail;
+}
+
+int ioq_iter_push(struct ioq_iterator *iter, int flags)
+{
+	struct ioq_ring_head *head_desc = iter->ioq->head_desc;
+	struct ioq_ring_idx  *idx  = iter->idx;
+	int ret = -ENOSPC;
+
+	/*
+	 * Its only valid to push if we are currently pointed at the head
+	 */
+	if (iter->pos != idx->head)
+		return -EINVAL;
+
+	if (ioq_ring_count(idx, head_desc->count) < head_desc->count) {
+		idx->head++;
+		idx->head %= head_desc->count;
+
+		if (idx->head == idx->tail)
+			idx->full = 1;
+
+		mb();
+
+		ret = ioq_iter_seek(iter, ioq_seek_next, 0, flags);
+
+		if (iter->update)
+			ioq_signal(iter->ioq, 0);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(ioq_iter_push);
+
+int ioq_iter_pop(struct ioq_iterator *iter,  int flags)
+{
+	struct ioq_ring_head *head_desc = iter->ioq->head_desc;
+	struct ioq_ring_idx  *idx  = iter->idx;
+	int ret = -ENOSPC;
+
+	/*
+	 * Its only valid to pop if we are currently pointed at the tail
+	 */
+	if (iter->pos != idx->tail)
+		return -EINVAL;
+
+	if (ioq_ring_count(idx, head_desc->count) != 0) {
+		idx->tail++;
+		idx->tail %= head_desc->count;
+
+		idx->full = 0;
+
+		mb();
+
+		ret = ioq_iter_seek(iter, ioq_seek_next, 0, flags);
+
+		if (iter->update)
+			ioq_signal(iter->ioq, 0);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(ioq_iter_pop);
+
+int ioq_iter_init(struct ioq *ioq, struct ioq_iterator *iter,
+		  enum ioq_idx_type type, int flags)
+{
+	BUG_ON((type < 0) || (type >= ioq_idxtype_invalid));
+
+	iter->ioq        = ioq;
+	iter->update     = (flags & IOQ_ITER_AUTOUPDATE);
+	iter->idx        = &ioq->head_desc->idx[type];
+	iter->pos        = -1;
+	iter->desc       = NULL;
+
+	return 0;
+}
+EXPORT_SYMBOL(ioq_iter_init);
+
+int ioq_start(struct ioq *ioq, int flags)
+{	
+	struct ioq_irq *irq = &ioq->head_desc->irq[ioq->locale];
+	
+	irq->enabled = 1;
+	mb();
+
+	if (irq->pending)
+		ioq_wakeup(ioq);
+
+	return 0;
+}
+EXPORT_SYMBOL(ioq_start);
+
+int ioq_stop(struct ioq *ioq, int flags)
+{	
+	struct ioq_irq *irq = &ioq->head_desc->irq[ioq->locale];
+	
+	irq->enabled = 0;
+	mb();
+
+	return 0;
+}
+EXPORT_SYMBOL(ioq_stop);
+
+int ioq_signal(struct ioq *ioq, int flags)
+{
+	/* Load the irq structure from the other locale */
+	struct ioq_irq *irq = &ioq->head_desc->irq[!ioq->locale];
+
+	irq->pending = 1;
+	mb();
+
+	if (irq->enabled)
+		ioq->signal(ioq);
+
+	return 0;
+}
+EXPORT_SYMBOL(ioq_signal);
+
+int ioq_count(struct ioq *ioq, enum ioq_idx_type type)
+{
+	BUG_ON((type < 0) || (type >= ioq_idxtype_invalid));
+
+	return ioq_ring_count(&ioq->head_desc->idx[type], ioq->head_desc->count);
+}
+EXPORT_SYMBOL(ioq_count);
+
+int ioq_full(struct ioq *ioq, enum ioq_idx_type type)
+{
+	BUG_ON((type < 0) || (type >= ioq_idxtype_invalid));
+
+	return ioq->head_desc->idx[type].full;
+}
+EXPORT_SYMBOL(ioq_full);
+
+void ioq_wakeup(struct ioq *ioq)
+{
+	struct ioq_irq *irq = &ioq->head_desc->irq[ioq->locale];
+
+	irq->pending = 0;
+	mb();
+
+	wake_up(&ioq->wq);
+	if (ioq->notifier)
+		ioq->notifier->signal(ioq->notifier);
+}
+EXPORT_SYMBOL(ioq_wakeup);


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 02/10] PARAVIRTUALIZATION: Add support for a bus abstraction
       [not found] ` <20070816231357.8044.55943.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  2007-08-16 23:14   ` [PATCH 01/10] IOQ: Adding basic definitions for IO-Queue logic Gregory Haskins
@ 2007-08-16 23:14   ` Gregory Haskins
  2007-08-16 23:14   ` [PATCH 03/10] IOQ: Add an IOQ network driver Gregory Haskins
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Gregory Haskins @ 2007-08-16 23:14 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

PV usually comes in two flavors:  device PV, and "core" PV.  The existing PV
ops deal in terms of the latter.  However, it would be useful to add an
interface for a virtual bus with provisions for discovery/configuration of
backend PV devices.  Often times it is desirable to run PV devices even if the
entire core is not operating with PVOPS.  Therefore, we introduce a separate
interface to deal with the devices.

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 arch/i386/Kconfig            |    2 +
 arch/x86_64/Kconfig          |    2 +
 drivers/Makefile             |    1 
 drivers/pvbus/Kconfig        |    7 ++
 drivers/pvbus/Makefile       |    6 ++
 drivers/pvbus/pvbus-driver.c |  120 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/pvbus.h        |   59 +++++++++++++++++++++
 7 files changed, 197 insertions(+), 0 deletions(-)

diff --git a/arch/i386/Kconfig b/arch/i386/Kconfig
index c2d54b8..acf4506 100644
--- a/arch/i386/Kconfig
+++ b/arch/i386/Kconfig
@@ -1125,6 +1125,8 @@ source "drivers/pci/pcie/Kconfig"
 
 source "drivers/pci/Kconfig"
 
+source "drivers/pvbus/Kconfig"
+
 config ISA_DMA_API
 	bool
 	default y
diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig
index 145bb82..17d6c78 100644
--- a/arch/x86_64/Kconfig
+++ b/arch/x86_64/Kconfig
@@ -721,6 +721,8 @@ source "drivers/pcmcia/Kconfig"
 
 source "drivers/pci/hotplug/Kconfig"
 
+source "drivers/pvbus/Kconfig"
+
 endmenu
 
 
diff --git a/drivers/Makefile b/drivers/Makefile
index adad2f3..179e669 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -81,3 +81,4 @@ obj-$(CONFIG_GENERIC_TIME)	+= clocksource/
 obj-$(CONFIG_DMA_ENGINE)	+= dma/
 obj-$(CONFIG_HID)		+= hid/
 obj-$(CONFIG_PPC_PS3)		+= ps3/
+obj-$(CONFIG_PVBUS)		+= pvbus/
diff --git a/drivers/pvbus/Kconfig b/drivers/pvbus/Kconfig
new file mode 100644
index 0000000..1ca094d
--- /dev/null
+++ b/drivers/pvbus/Kconfig
@@ -0,0 +1,7 @@
+#
+# PVBUS configuration
+#
+
+config PVBUS
+	bool "Paravirtual Bus"
+
diff --git a/drivers/pvbus/Makefile b/drivers/pvbus/Makefile
new file mode 100644
index 0000000..0df2c2e
--- /dev/null
+++ b/drivers/pvbus/Makefile
@@ -0,0 +1,6 @@
+#
+# Makefile for the PVBUS bus specific drivers.
+#
+
+obj-y += pvbus-driver.o
+
diff --git a/drivers/pvbus/pvbus-driver.c b/drivers/pvbus/pvbus-driver.c
new file mode 100644
index 0000000..3f6687d
--- /dev/null
+++ b/drivers/pvbus/pvbus-driver.c
@@ -0,0 +1,120 @@
+/*
+ * Copyright 2007 Novell.  All Rights Reserved.
+ *
+ * Paravirtualized-Bus - This is a generic infrastructure for virtual devices
+ * and their drivers.  It is inspired by Rusty Russell's lguest_bus, but with
+ * the key difference that the bus is decoupled from the underlying hypervisor
+ * in both name and function.
+ *
+ * Instead, it is intended that external hypervisor support will register
+ * arbitrary devices.  Generic drivers can then monitor this bus for
+ * compatible devices regardless of the hypervisor implementation. 
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/pvbus.h>
+
+#define PVBUS_NAME "pvbus"
+
+/*
+ * This function is invoked whenever a new driver and/or device is added
+ * to check if there is a match
+ */
+static int pvbus_dev_match(struct device *_dev, struct device_driver *_drv)
+{
+	struct pvbus_device *dev = container_of(_dev,struct pvbus_device,dev);
+	struct pvbus_driver *drv = container_of(_drv,struct pvbus_driver,drv);
+
+	return !strcmp(dev->name, drv->name);
+}
+
+/*
+ * This function is invoked after the bus infrastructure has already made a
+ * match.  The device will contain a reference to the paired driver which
+ * we will extract.
+ */
+static int pvbus_dev_probe(struct device *_dev)
+{
+	int ret = 0;
+	struct pvbus_device*dev = container_of(_dev,struct pvbus_device, dev);
+	struct pvbus_driver*drv = container_of(_dev->driver,
+					       struct pvbus_driver, drv);
+
+	if (drv->probe)
+		ret = drv->probe(dev);
+
+	return ret;
+}
+
+static struct bus_type pv_bus = {
+	.name   = PVBUS_NAME,
+	.match  = pvbus_dev_match,
+};
+
+static struct device pvbus_rootdev = {
+	.parent = NULL,
+	.bus_id = PVBUS_NAME,
+};
+
+static int __init pvbus_init(void)
+{
+	int ret;
+
+	ret = bus_register(&pv_bus);
+	BUG_ON(ret < 0);
+
+	ret = device_register(&pvbus_rootdev);
+	BUG_ON(ret < 0);
+
+	return 0;
+}
+
+postcore_initcall(pvbus_init);
+
+int pvbus_device_register(struct pvbus_device *new)
+{
+	new->dev.parent = &pvbus_rootdev;
+	new->dev.bus    = &pv_bus;
+	
+	return device_register(&new->dev);
+}
+EXPORT_SYMBOL(pvbus_device_register);
+
+void pvbus_device_unregister(struct pvbus_device *dev)
+{
+	device_unregister(&dev->dev);
+}
+EXPORT_SYMBOL(pvbus_device_unregister);
+
+int pvbus_driver_register(struct pvbus_driver *new)
+{
+	new->drv.bus   = &pv_bus;
+	new->drv.name  = new->name;
+	new->drv.owner = new->owner;
+	new->drv.probe = pvbus_dev_probe;
+
+	return driver_register(&new->drv);
+}
+EXPORT_SYMBOL(pvbus_driver_register);
+
+void pvbus_driver_unregister(struct pvbus_driver *drv)
+{
+	driver_unregister(&drv->drv);
+}
+EXPORT_SYMBOL(pvbus_driver_unregister);
+
diff --git a/include/linux/pvbus.h b/include/linux/pvbus.h
new file mode 100644
index 0000000..471f500
--- /dev/null
+++ b/include/linux/pvbus.h
@@ -0,0 +1,59 @@
+/*
+ * Copyright 2007 Novell.  All Rights Reserved.
+ *
+ * Paravirtualized-Bus
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_PVBUS_H
+#define _LINUX_PVBUS_H
+
+#include <linux/device.h>
+#include <linux/ioq.h>
+
+struct pvbus_device {
+	char             *name;
+	u64               id;
+
+	void             *priv; /* Used by drivers that allocated the dev */
+
+	int (*createqueue)(struct pvbus_device *dev, struct ioq **ioq,
+			   size_t ringsize, int flags);
+	int (*call)(struct pvbus_device *dev, u32 func,
+		    void *data, size_t len, int flags);
+
+	struct device     dev;
+};
+
+int pvbus_device_register(struct pvbus_device *dev);
+void pvbus_device_unregister(struct pvbus_device *dev);
+
+struct pvbus_driver {
+	char                 *name;
+	struct module        *owner;
+
+	int (*probe)(struct pvbus_device *pdev);
+	int (*remove)(struct pvbus_device *pdev);
+
+	struct device_driver  drv;
+};
+
+int pvbus_driver_register(struct pvbus_driver *drv);
+void pvbus_driver_unregister(struct pvbus_driver *drv);
+
+#endif /* _LINUX_PVBUS_H */


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 03/10] IOQ: Add an IOQ network driver
       [not found] ` <20070816231357.8044.55943.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  2007-08-16 23:14   ` [PATCH 01/10] IOQ: Adding basic definitions for IO-Queue logic Gregory Haskins
  2007-08-16 23:14   ` [PATCH 02/10] PARAVIRTUALIZATION: Add support for a bus abstraction Gregory Haskins
@ 2007-08-16 23:14   ` Gregory Haskins
  2007-08-16 23:14   ` [PATCH 04/10] IOQNET: Add a test harness infrastructure to IOQNET Gregory Haskins
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Gregory Haskins @ 2007-08-16 23:14 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 drivers/net/Kconfig         |   10 +
 drivers/net/Makefile        |    2 
 drivers/net/ioqnet/Makefile |   11 +
 drivers/net/ioqnet/driver.c |  658 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/ioqnet.h      |   44 +++
 5 files changed, 725 insertions(+), 0 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index fb99cd4..7ee7454 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -2947,6 +2947,16 @@ config NETCONSOLE
 	If you want to log kernel messages over the network, enable this.
 	See <file:Documentation/networking/netconsole.txt> for details.
 
+config IOQNET
+       tristate "IOQNET (IOQ based paravirtualized network driver)"
+       select IOQ
+       select PVBUS
+
+config IOQNET_DEBUG
+       bool "IOQNET debugging"
+       depends on IOQNET
+       default n
+
 endif #NETDEVICES
 
 config NETPOLL
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index a77affa..4c8a918 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -224,6 +224,8 @@ obj-$(CONFIG_ENP2611_MSF_NET) += ixp2000/
 
 obj-$(CONFIG_NETCONSOLE) += netconsole.o
 
+obj-$(CONFIG_IOQNET) += ioqnet/
+
 obj-$(CONFIG_FS_ENET) += fs_enet/
 
 obj-$(CONFIG_NETXEN_NIC) += netxen/
diff --git a/drivers/net/ioqnet/Makefile b/drivers/net/ioqnet/Makefile
new file mode 100644
index 0000000..d7020ee
--- /dev/null
+++ b/drivers/net/ioqnet/Makefile
@@ -0,0 +1,11 @@
+#
+# Makefile for the IOQNET ethernet driver
+#
+
+ioqnet-objs = driver.o
+obj-$(CONFIG_IOQNET) += ioqnet.o
+
+
+ifeq ($(CONFIG_IOQNET_DEBUG),y)
+EXTRA_CFLAGS += -DIOQNET_DEBUG
+endif
diff --git a/drivers/net/ioqnet/driver.c b/drivers/net/ioqnet/driver.c
new file mode 100644
index 0000000..8352029
--- /dev/null
+++ b/drivers/net/ioqnet/driver.c
@@ -0,0 +1,658 @@
+/*
+ * ioqnet - A paravirtualized network device based on the IOQ interface
+ *
+ * Copyright (C) 2007 Novell, Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * Derived from the SNULL example from the book "Linux Device
+ * Drivers" by Alessandro Rubini and Jonathan Corbet, published
+ * by O'Reilly & Associates.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+
+#include <linux/sched.h>
+#include <linux/kernel.h> /* printk() */
+#include <linux/slab.h> /* kmalloc() */
+#include <linux/errno.h>  /* error codes */
+#include <linux/types.h>  /* size_t */
+#include <linux/interrupt.h> /* mark_bh */
+
+#include <linux/in.h>
+#include <linux/netdevice.h>   /* struct device, and other headers */
+#include <linux/etherdevice.h> /* eth_type_trans */
+#include <linux/ip.h>          /* struct iphdr */
+#include <linux/tcp.h>         /* struct tcphdr */
+#include <linux/skbuff.h>
+#include <linux/ioq.h>
+#include <linux/pvbus.h>
+
+#include <linux/in6.h>
+#include <asm/checksum.h>
+
+#include <linux/ioqnet.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+
+#undef PDEBUG             /* undef it, just in case */
+#ifdef IOQNET_DEBUG
+#  define PDEBUG(fmt, args...) printk( KERN_DEBUG "ioqnet: " fmt, ## args)
+#else
+#  define PDEBUG(fmt, args...) /* not debugging: nothing */
+#endif
+
+#define RX_RINGLEN 64
+#define TX_RINGLEN 64
+#define TX_PTRS_PER_DESC 64
+
+struct ioqnet_queue {
+	struct ioq              *queue;
+	struct ioq_notifier      notifier;
+};
+
+struct ioqnet_tx_desc {
+	struct sk_buff      *skb;
+	struct ioqnet_tx_ptr data[TX_PTRS_PER_DESC];
+};
+
+struct ioqnet_priv {
+	spinlock_t               lock;
+	struct net_device       *dev;
+	struct pvbus_device     *pdev;
+	struct net_device_stats  stats;
+	struct ioqnet_queue      rxq;
+	struct ioqnet_queue      txq;
+	struct tasklet_struct    txtask;
+};
+
+static int ioqnet_queue_init(struct ioqnet_priv *priv,
+			     struct ioqnet_queue *q,
+			     size_t ringsize,
+			     void (*func)(struct ioq_notifier*))
+{
+	int ret = priv->pdev->createqueue(priv->pdev, &q->queue, ringsize, 0);
+	if (ret < 0)
+		return ret;
+
+	q->notifier.signal = func;
+	q->queue->notifier = &q->notifier;
+
+	return 0;
+}
+
+/* Perform a hypercall to register/connect our queues */
+static int ioqnet_connect(struct ioqnet_priv *priv)
+{
+	struct ioqnet_connect data = {
+		.rxq = priv->rxq.queue->id,
+		.txq = priv->txq.queue->id,
+	};
+
+	return priv->pdev->call(priv->pdev, IOQNET_CONNECT,
+				&data, sizeof(data), 0);
+}
+
+static int ioqnet_disconnect(struct ioqnet_priv *priv)
+{
+	return priv->pdev->call(priv->pdev, IOQNET_DISCONNECT, NULL, 0, 0);
+}
+
+/* Perform a hypercall to get the assigned MAC addr */
+static int ioqnet_query_mac(struct ioqnet_priv *priv)
+{
+	return priv->pdev->call(priv->pdev,
+				IOQNET_QUERY_MAC,
+				priv->dev->dev_addr,
+				ETH_ALEN, 0);
+}
+
+
+/*
+ * Enable and disable receive interrupts.
+ */
+static void ioqnet_rx_ints(struct net_device *dev, int enable)
+{
+	struct ioqnet_priv *priv = netdev_priv(dev);
+	struct ioq *ioq = priv->rxq.queue;
+	if (enable)
+		ioq_start(ioq, 0);
+	else
+		ioq_stop(ioq, 0);
+}
+
+static void ioqnet_alloc_rx_desc(struct ioq_ring_desc *desc, size_t len)
+{
+	struct sk_buff *skb = dev_alloc_skb(len + 2);
+	BUG_ON(!skb);
+
+	skb_reserve(skb, 2); /* align IP on 16B boundary */
+
+	desc->cookie = (u64)skb;
+	desc->ptr    = (u64)__pa(skb->data);
+	desc->len    = len; /* total length  */
+	desc->alen   = 0;   /* actual length - to be filled in by host */
+
+	mb();
+	desc->valid  = 1;
+	desc->sown   = 1;   /* give ownership to the south */
+	mb();
+}
+
+static void ioqnet_setup_rx(struct ioqnet_priv *priv)
+{
+	struct ioq *ioq = priv->rxq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	/*
+	 * We want to iterate on the "valid" index.  By default the iterator
+	 * will not "autoupdate" which means it will not hypercall the host
+	 * with our changes.  This is good, because we are really just
+	 * initializing stuff here anyway.  Note that you can always manually
+	 * signal the host with ioq_signal() if the autoupdate feature is not
+	 * used.
+	 */
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Seek to the head of the valid index (which should be our first
+	 * item, since the queue is brand-new)
+	 */
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty SKB and mark it valid
+	 */
+	while (!iter.desc->valid) {
+		ioqnet_alloc_rx_desc(iter.desc, priv->dev->mtu);
+
+		/*
+		 * This push operation will simultaneously advance the
+		 * valid-head index and increment our position in the queue
+		 * by one.
+		 */
+		ret = ioq_iter_push(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+}
+
+static void ioqnet_setup_tx(struct ioqnet_priv *priv)
+{
+	struct ioq *ioq = priv->txq.queue;
+	struct ioq_iterator iter;
+	int ret;
+	int i;
+
+	/*
+	 * We setup the tx-desc in a similar way to how we did the rx SKBs
+	 */
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	for (i = 0; i < TX_RINGLEN; ++i) {
+		struct ioq_ring_desc  *desc = iter.desc;
+		struct ioqnet_tx_desc *txdesc = kzalloc(sizeof(*txdesc),
+							GFP_KERNEL | GFP_DMA);
+
+		desc->cookie = (u64)txdesc;
+		desc->ptr    = (u64)__pa(&txdesc->data[0]);
+		desc->len    = TX_PTRS_PER_DESC; /* "len" is "count" */
+		desc->alen   = 0;
+		desc->valid  = 0; /* mark it "invalid" since payload empty */
+		desc->sown   = 0; /* retain ownership until "inuse" */
+
+		/*
+		 * One big difference between the RX and TX ring is that
+		 * we are going to do an "iter++" here instead of an
+		 * "iter->push()".  That is because we don't want to actually
+		 * advance the valid-index.  We use the valid index to
+		 * determine the difference between outstanding consumed and
+		 * outstanding unconsumed packets
+		 */
+		ret = ioq_iter_seek(&iter, ioq_seek_next, 0, 0);
+		BUG_ON(ret < 0);
+	}
+}
+
+/*
+ * Open and close
+ */
+
+static int ioqnet_open(struct net_device *dev)
+{
+	struct ioqnet_priv *priv = netdev_priv(dev);
+
+	if (ioqnet_connect(priv) < 0)
+		printk("IOQNET: Could not initialize instance %lld\n",
+		       priv->pdev->id);
+
+
+	netif_start_queue(dev);
+	return 0;
+}
+
+static void ioqnet_destroy_queue(struct ioq *ioq)
+{
+	ioq_stop(ioq, 0);
+	ioq->destroy(ioq);
+}
+
+static int ioqnet_release(struct net_device *dev)
+{
+	struct ioqnet_priv *priv = netdev_priv(dev);
+
+	netif_stop_queue(dev);
+
+	if (ioqnet_disconnect(priv) < 0)
+		printk("IOQNET: Could not initialize instance %lld\n",
+		       priv->pdev->id);
+
+	ioqnet_destroy_queue(priv->rxq.queue);
+	ioqnet_destroy_queue(priv->txq.queue);
+
+	return 0;
+}
+
+/*
+ * Configuration changes (passed on by ifconfig)
+ */
+static int ioqnet_config(struct net_device *dev, struct ifmap *map)
+{
+	if (dev->flags & IFF_UP) /* can't act on a running interface */
+		return -EBUSY;
+
+	/* Don't allow changing the I/O address */
+	if (map->base_addr != dev->base_addr) {
+		printk(KERN_WARNING "ioqnet: Can't change I/O address\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* ignore other fields */
+	return 0;
+}
+
+/*
+ * The poll implementation.
+ */
+static int ioqnet_poll(struct net_device *dev, int *budget)
+{
+	int npackets = 0, quota = min(dev->quota, *budget);
+	struct ioqnet_priv *priv = netdev_priv(dev);
+	struct ioq_iterator iter;
+	unsigned long flags;
+	int ret;
+
+	PDEBUG("polling...\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	/* We want to iterate on the tail of the in-use index */
+	ret = ioq_iter_init(priv->rxq.queue, &iter, ioq_idxtype_inuse, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * We stop if we have met the quota or there are no more packets.
+	 * The EOM is indicated by finding a packet that is still owned by
+	 * the south side
+	 */
+	while ((npackets < quota) && (!iter.desc->sown)) {
+		struct sk_buff *skb = (struct sk_buff*)iter.desc->cookie;
+
+		skb_push(skb, iter.desc->alen);
+
+        	/* Maintain stats */
+		npackets++;
+		priv->stats.rx_packets++;
+		priv->stats.rx_bytes += iter.desc->alen;
+
+		/* Pass the buffer up to the stack */
+		skb->dev      = dev;
+		skb->protocol = eth_type_trans(skb, dev);
+		netif_receive_skb(skb);
+
+		mb();
+
+		/* Grab a new buffer to put in the ring */
+		ioqnet_alloc_rx_desc(iter.desc, dev->mtu);
+
+		/* Advance the in-use tail */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+
+		/* Toggle the lock */
+		spin_unlock_irqrestore(&priv->lock, flags);
+		spin_lock_irqsave(&priv->lock, flags);
+	}
+
+	PDEBUG("poll: %d packets received\n", npackets);
+
+	/*
+	 * If we processed all packets, we're done; tell the kernel and
+	 * reenable ints
+	 */
+	*budget -= npackets;
+	dev->quota -= npackets;
+	if (ioq_empty(priv->rxq.queue, ioq_idxtype_inuse)) {
+		/* FIXME: there is a race with enabling interrupts */
+		netif_rx_complete(dev);
+		ioqnet_rx_ints(dev, 1);
+		ret = 0;
+	} else
+		/* We couldn't process everything. */
+		ret = 1;
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	/* And let the south side know that we changed the rx-queue */
+	ioq_signal(priv->rxq.queue, 0);
+
+	return ret;
+}
+
+/*
+ * Transmit a packet (called by the kernel)
+ */
+static int ioqnet_tx_start(struct sk_buff *skb, struct net_device *dev)
+{
+	struct ioqnet_priv    *priv = netdev_priv(dev);
+	struct ioq_iterator    viter;
+	struct ioq_iterator    uiter;
+	struct ioqnet_tx_desc *txdesc;
+	int ret;
+	int i;
+	unsigned long flags;
+
+	if (skb->len < ETH_ZLEN)
+		return -EINVAL;
+
+	PDEBUG("sending %d bytes\n", skb->len); 
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+		/*
+		 * We must flow-control the kernel by disabling the queue
+		 */
+		spin_unlock_irqrestore(&priv->lock, flags);
+		netif_stop_queue(dev);
+		return 0;
+	}
+
+	/*
+	 * We want to iterate on the head of both the "inuse" and "valid" index
+	 */
+	ret = ioq_iter_init(priv->txq.queue, &viter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+	ret = ioq_iter_init(priv->txq.queue, &uiter, ioq_idxtype_inuse, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&viter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+	ret = ioq_iter_seek(&uiter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/* The head pointers should move in lockstep */
+	BUG_ON(uiter.pos != viter.pos);
+
+	dev->trans_start = jiffies; /* save the timestamp */
+	skb_get(skb);               /* add a refcount */
+
+	txdesc = (struct ioqnet_tx_desc*)uiter.desc->cookie;
+
+	/*
+	 * We simply put the skb right onto the ring.  We will get an interrupt
+	 * later when the data has been consumed and we can reap the pointers
+	 * at that time
+	 */
+	for (i = 0; i < 1; ++i) { /* Someday we will support SG */
+		txdesc->data[i].len  = (u64)skb->len;
+		txdesc->data[i].data = (u64)__pa(skb->data);
+
+		uiter.desc->alen++;
+	}
+
+	txdesc->skb        = skb; /* save the skb for future release */
+
+	mb();
+	uiter.desc->valid  = 1;
+	uiter.desc->sown   = 1;        /* give ownership to the south */
+	mb();
+
+	/* Advance both indexes together */
+	ret = ioq_iter_push(&viter, 0);
+	BUG_ON(ret < 0);
+	ret = ioq_iter_push(&uiter, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * This will signal the south side to consume the packet
+	 */
+	ioq_signal(priv->txq.queue, 0);
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	return 0;
+}
+
+/*
+ * called by the tx interrupt handler to indicate that one or more packets
+ * have been consumed
+ */
+static void ioqnet_tx_complete(unsigned long data)
+{
+	struct ioqnet_priv *priv = (struct ioqnet_priv*)data;
+	struct ioq_iterator iter;
+	unsigned long flags;
+	int ret;
+
+	PDEBUG("send complete\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	/* We want to iterate on the tail of the valid index */
+	ret = ioq_iter_init(priv->rxq.queue, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * We are done once we find the first packet either invalid or still
+	 * owned by the south-side
+	 */
+	while (iter.desc->valid && !iter.desc->sown) {
+		struct ioqnet_tx_desc *txdesc;
+		struct sk_buff        *skb;
+
+		txdesc  = (struct ioqnet_tx_desc*)iter.desc->cookie;
+		skb     = txdesc->skb;
+
+        	/* Maintain stats */
+		priv->stats.tx_packets++;
+		priv->stats.tx_bytes += skb->len;
+
+		/* Reset the descriptor */
+		mb();
+		iter.desc->alen   = 0;
+		iter.desc->valid  = 0;
+		mb();
+
+		dev_kfree_skb(skb);
+
+		/* Advance the valid-index tail */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+
+		/* Toggle the lock */
+		spin_unlock_irqrestore(&priv->lock, flags);
+		spin_lock_irqsave(&priv->lock, flags);
+	}
+
+	/*
+	 * If we were previously stopped due to flow control, restart the
+	 * processing
+	 */
+	if (netif_queue_stopped(priv->dev)
+	    && !ioq_full(priv->txq.queue, ioq_idxtype_inuse)) {
+
+		netif_wake_queue(priv->dev);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+/*
+ * Ioctl commands
+ */
+static int ioqnet_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
+{
+	PDEBUG("ioctl\n");
+	return 0;
+}
+
+/*
+ * Return statistics to the caller
+ */
+struct net_device_stats *ioqnet_stats(struct net_device *dev)
+{
+	struct ioqnet_priv *priv = netdev_priv(dev);
+	return &priv->stats;
+}
+
+static void ioq_rx_notify(struct ioq_notifier *notifier)
+{
+	struct ioqnet_priv *priv;
+	struct net_device  *dev;
+
+	priv = container_of(notifier, struct ioqnet_priv, rxq.notifier);
+	dev = priv->dev;
+
+	ioqnet_rx_ints(dev, 0);  /* Disable further interrupts */
+	netif_rx_schedule(dev);
+}
+
+static void ioq_tx_notify(struct ioq_notifier *notifier)
+{
+	struct ioqnet_priv *priv;
+
+	priv = container_of(notifier, struct ioqnet_priv, txq.notifier);
+
+	PDEBUG("tx_notify for %lld\n", priv->pdev->id);
+
+	tasklet_schedule(&priv->txtask);
+}
+
+/*
+ * This is called whenever a new pvbus_device is added to the pvbus with
+ * the matching IOQNET_NAME
+ */
+static int ioqnet_probe(struct pvbus_device *pdev)
+{
+	struct net_device  *dev;
+	struct ioqnet_priv *priv;
+	int ret;
+
+	printk(KERN_INFO "IOQNET: Found new device at %lld\n", pdev->id);
+
+	dev = alloc_etherdev(sizeof(struct ioqnet_priv));
+	if (!dev)
+		return -ENOMEM;
+
+	priv = netdev_priv(dev);
+	memset(priv, 0, sizeof(*priv));
+
+	spin_lock_init(&priv->lock);
+	priv->dev  = dev;
+	priv->pdev = pdev;
+	tasklet_init(&priv->txtask, ioqnet_tx_complete, (unsigned long)priv);
+
+	ioqnet_queue_init(priv, &priv->rxq, RX_RINGLEN, ioq_rx_notify);
+	ioqnet_queue_init(priv, &priv->txq, TX_RINGLEN, ioq_tx_notify);
+
+	ioqnet_setup_rx(priv);
+	ioqnet_setup_tx(priv);
+
+	ioqnet_rx_ints(dev, 1);		/* enable receive interrupts */
+	ioq_start(priv->txq.queue, 0);  /* enable transmit interrupts */
+
+	ether_setup(dev); /* assign some of the fields */
+
+	dev->open              = ioqnet_open;
+	dev->stop              = ioqnet_release;
+	dev->set_config        = ioqnet_config;
+	dev->hard_start_xmit   = ioqnet_tx_start;
+	dev->do_ioctl          = ioqnet_ioctl;
+	dev->get_stats         = ioqnet_stats;
+	dev->poll              = ioqnet_poll;
+	dev->weight            = 2;
+	dev->hard_header_cache = NULL;      /* Disable caching */
+
+	ret = ioqnet_query_mac(priv);
+	if (ret < 0) {
+		printk("IOQNET: Could not obtain MAC address for %lld\n",
+		       priv->pdev->id);
+		goto out_free;
+	}
+
+	ret = register_netdev(dev);
+	if (ret < 0) {
+		printk("IOQNET: error %i registering device \"%s\"\n",
+		       ret, dev->name);
+		goto out_free;
+	}
+
+	pdev->priv = priv;
+
+	return 0;
+
+ out_free:
+	free_netdev(dev);
+
+	return ret;
+}
+
+static int ioqnet_remove(struct pvbus_device *pdev)
+{
+	struct ioqnet_priv *priv = (struct ioqnet_priv*)pdev->priv;
+
+	unregister_netdev(priv->dev);
+	ioqnet_release(priv->dev);
+	free_netdev(priv->dev);
+
+	return 0;
+}
+
+/*
+ * Finally, the module stuff
+ */
+
+static struct pvbus_driver ioqnet_pvbus_driver = {
+	.name   = IOQNET_NAME,
+	.owner  = THIS_MODULE,
+	.probe  = ioqnet_probe,
+	.remove = ioqnet_remove,
+};
+
+__init int ioqnet_init_module(void)
+{
+	return pvbus_driver_register(&ioqnet_pvbus_driver);
+}
+
+__exit void ioqnet_cleanup(void)
+{
+	pvbus_driver_unregister(&ioqnet_pvbus_driver);
+}
+
+
+module_init(ioqnet_init_module);
+module_exit(ioqnet_cleanup);
diff --git a/include/linux/ioqnet.h b/include/linux/ioqnet.h
new file mode 100644
index 0000000..7c73a26
--- /dev/null
+++ b/include/linux/ioqnet.h
@@ -0,0 +1,44 @@
+/*
+ * Copyright 2007 Novell.  All Rights Reserved.
+ *
+ * IOQ Network Driver
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _IOQNET_H
+#define _IOQNET_H
+
+#define IOQNET_VERSION 1
+#define IOQNET_NAME "ioqnet"
+
+/* IOQNET functions (invoked via pvbus_device->call()) */
+#define IOQNET_CONNECT     1
+#define IOQNET_DISCONNECT  2
+#define IOQNET_QUERY_MAC   3
+
+struct ioqnet_connect {
+	ioq_id_t rxq;
+	ioq_id_t txq;
+};
+
+struct ioqnet_tx_ptr {
+	u64 len;
+	u64 data;
+};
+
+#endif /* _IOQNET_H */


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 04/10] IOQNET: Add a test harness infrastructure to IOQNET
       [not found] ` <20070816231357.8044.55943.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
                     ` (2 preceding siblings ...)
  2007-08-16 23:14   ` [PATCH 03/10] IOQ: Add an IOQ network driver Gregory Haskins
@ 2007-08-16 23:14   ` Gregory Haskins
  2007-08-16 23:14   ` [PATCH 05/10] IRQ: Export create_irq/destroy_irq Gregory Haskins
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Gregory Haskins @ 2007-08-16 23:14 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

We can add a IOQNET loop-back device and register it with the PVBUS to test
many aspects of the system (IOQ, PVBUS, and IOQNET itself).

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 drivers/net/Kconfig           |   10 +
 drivers/net/ioqnet/Makefile   |    3 
 drivers/net/ioqnet/loopback.c |  502 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 515 insertions(+), 0 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 7ee7454..426947d 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -2957,6 +2957,16 @@ config IOQNET_DEBUG
        depends on IOQNET
        default n
 
+config IOQNET_LOOPBACK
+       tristate "IOQNET loopback device test harness"
+       depends on IOQNET
+       default n
+       ---help---
+       This will install a special PVBUS device that implements two IOQNET
+       devices.  The devices are, of course, linked to one another forming a
+       loopback mechanism.  This allows many subsystems to be tested: IOQ,
+       PVBUS, and IOQNET itself.  If unsure, say N.
+
 endif #NETDEVICES
 
 config NETPOLL
diff --git a/drivers/net/ioqnet/Makefile b/drivers/net/ioqnet/Makefile
index d7020ee..7d2d156 100644
--- a/drivers/net/ioqnet/Makefile
+++ b/drivers/net/ioqnet/Makefile
@@ -4,8 +4,11 @@
 
 ioqnet-objs = driver.o
 obj-$(CONFIG_IOQNET) += ioqnet.o
+ioqnet-loopback-objs = loopback.o
+obj-$(CONFIG_IOQNET_LOOPBACK) += ioqnet-loopback.o
 
 
 ifeq ($(CONFIG_IOQNET_DEBUG),y)
 EXTRA_CFLAGS += -DIOQNET_DEBUG
 endif
+
diff --git a/drivers/net/ioqnet/loopback.c b/drivers/net/ioqnet/loopback.c
new file mode 100644
index 0000000..0e36b43
--- /dev/null
+++ b/drivers/net/ioqnet/loopback.c
@@ -0,0 +1,502 @@
+/*
+ * ioqnet test harness
+ *
+ * Copyright (C) 2007 Novell, Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/pvbus.h>
+#include <linux/ioq.h>
+#include <linux/kthread.h>
+#include <linux/ioqnet.h>
+#include <linux/interrupt.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+
+#ifndef ETH_ALEN
+#define ETH_ALEN 6
+#endif
+
+#undef PDEBUG             /* undef it, just in case */
+#ifdef IOQNET_DEBUG
+#  define PDEBUG(fmt, args...) printk( KERN_DEBUG "ioqnet: " fmt, ## args)
+#else
+#  define PDEBUG(fmt, args...) /* not debugging: nothing */
+#endif
+
+/*
+ * ---------------------------------------------------------------------
+ * First we must create an IOQ implementation to use while under test
+ * since these operations will all be local to the same host
+ * ---------------------------------------------------------------------
+ */
+
+struct ioqnet_lb_ioq {
+	struct ioq	         ioq;
+	struct ioqnet_lb_ioq    *peer;
+	struct tasklet_struct    task;
+};
+
+struct ioqnet_lb_ioqmgr {
+	struct ioq_mgr  mgr;
+
+	/*
+	 * Since this is just a test harness, we know ahead of time that
+	 * we aren't going to need more than a handful of IOQs.  So to keep
+	 * lookups simple we will simply create a static array of them
+	 */
+	struct ioqnet_lb_ioq ioqs[8];
+	int pos;
+};
+
+static struct ioqnet_lb_ioqmgr lb_ioqmgr;
+
+struct ioqnet_lb_ioq* to_ioq(struct ioq *ioq)
+{
+	return container_of(ioq, struct ioqnet_lb_ioq, ioq);
+}
+
+struct ioqnet_lb_ioqmgr* to_mgr(struct ioq_mgr *mgr)
+{
+	return container_of(mgr, struct ioqnet_lb_ioqmgr, mgr);
+}
+
+/*
+ * ------------------
+ * ioq implementation
+ * ------------------
+ */
+static void ioqnet_lb_ioq_wake(unsigned long data)
+{
+	struct ioqnet_lb_ioq *_ioq = (struct ioqnet_lb_ioq*)data;
+
+	if (_ioq->peer)
+		ioq_wakeup(&_ioq->peer->ioq);
+}
+
+static int ioqnet_lb_ioq_signal(struct ioq *ioq)
+{
+	struct ioqnet_lb_ioq *_ioq = to_ioq(ioq);
+
+	if (_ioq->peer)
+		tasklet_schedule(&_ioq->task);
+
+	return 0;
+}
+
+static void ioqnet_lb_ioq_destroy(struct ioq *ioq)
+{
+	struct ioqnet_lb_ioq *_ioq = to_ioq(ioq);
+
+	if (_ioq->peer) {
+		_ioq->peer->peer = NULL;
+		_ioq->peer       = NULL;
+	}
+
+	if (_ioq->ioq.locale == ioq_locality_north) {
+		kfree(_ioq->ioq.ring);
+		kfree(_ioq->ioq.head_desc);
+	} else
+		kfree(_ioq);
+}
+
+/*
+ * ------------------
+ * ioqmgr implementation
+ * ------------------
+ */
+static int ioqnet_lb_ioq_create(struct ioq_mgr *t, struct ioq **ioq,
+				size_t ringsize, int flags)
+{
+	struct ioqnet_lb_ioqmgr *mgr       = to_mgr(t);
+	struct ioqnet_lb_ioq    *_ioq	   = NULL;
+	struct ioq_ring_head    *head_desc = NULL;
+	void		        *ring	   = NULL;
+	int                      ret       = -ENOMEM;
+	size_t                   ringlen;
+	ioq_id_t                 id;
+
+	ringlen = sizeof(struct ioq_ring_desc) * ringsize;
+
+	BUG_ON(mgr->pos >= (sizeof(mgr->ioqs)/sizeof(mgr->ioqs[0])));
+
+	id = (ioq_id_t)mgr->pos++;
+
+	_ioq = &mgr->ioqs[id];
+	if (!_ioq)
+		goto error;
+
+	head_desc = kzalloc(sizeof(*head_desc), GFP_KERNEL);
+	if (!head_desc)
+		goto error;
+
+	ring = kzalloc(ringlen, GFP_KERNEL);
+	if (!ring)
+		goto error;
+
+	head_desc->magic     = IOQ_RING_MAGIC;
+	head_desc->ver	     = IOQ_RING_VER;
+	head_desc->id	     = id;
+	head_desc->count     = ringsize;
+	head_desc->ptr	     = (u64)ring;
+
+	ioq_init(&_ioq->ioq);
+
+	_ioq->ioq.signal     = ioqnet_lb_ioq_signal;
+	_ioq->ioq.destroy    = ioqnet_lb_ioq_destroy;
+
+	_ioq->ioq.id	     = head_desc->id;
+	_ioq->ioq.locale     = ioq_locality_north;
+	_ioq->ioq.mgr	     = t;
+	_ioq->ioq.head_desc  = head_desc;
+	_ioq->ioq.ring	     = ring;
+
+	tasklet_init(&_ioq->task, ioqnet_lb_ioq_wake, (unsigned long)_ioq);
+
+	*ioq = &_ioq->ioq;
+
+	return 0;
+
+ error:
+	if (head_desc)
+		kfree(head_desc);
+	if (ring)
+		kfree(ring);
+
+	return ret;
+}
+
+static int ioqnet_lb_ioq_connect(struct ioq_mgr *t, ioq_id_t id,
+				 struct ioq **ioq, int flags)
+{
+	struct ioqnet_lb_ioqmgr *mgr      = to_mgr(t);
+	struct ioqnet_lb_ioq    *peer_ioq = &mgr->ioqs[id];
+	struct ioqnet_lb_ioq    *_ioq;
+
+	if (peer_ioq->peer)
+		return -EEXIST;
+
+	_ioq = kzalloc(sizeof(*_ioq), GFP_KERNEL);
+	if (!_ioq)
+		return -ENOMEM;
+
+	ioq_init(&_ioq->ioq);
+
+	_ioq->ioq.signal     = ioqnet_lb_ioq_signal;
+	_ioq->ioq.destroy    = ioqnet_lb_ioq_destroy;
+
+	_ioq->ioq.id	     = id;
+	_ioq->ioq.locale     = ioq_locality_south;
+	_ioq->ioq.mgr	     = t;
+	_ioq->ioq.head_desc  = peer_ioq->ioq.head_desc;
+	_ioq->ioq.ring	     = peer_ioq->ioq.ring;
+
+	_ioq->peer           = peer_ioq;
+	peer_ioq->peer       = _ioq;
+	tasklet_init(&_ioq->task, ioqnet_lb_ioq_wake, (unsigned long)_ioq);
+
+	*ioq = &_ioq->ioq;
+
+	return 0;
+}
+
+static void ioqnet_lb_ioqmgr_init(void)
+{
+	struct ioqnet_lb_ioqmgr	*mgr = &lb_ioqmgr;
+
+	memset(mgr, 0, sizeof(mgr));
+
+	mgr->mgr.create  = ioqnet_lb_ioq_create;
+	mgr->mgr.connect = ioqnet_lb_ioq_connect;
+}
+
+/*
+ * ---------------------------------------------------------------------
+ * Next we create the loopback device in terms of our ioqnet_lb_ioq
+ * subsystem
+ * ---------------------------------------------------------------------
+ */
+
+struct ioqnet_lb_device {
+	int                      idx; /* 0 or 1 */
+	struct ioq              *rxq;
+	struct ioq              *txq;
+	char                     mac[ETH_ALEN];
+	struct task_struct      *task;
+	struct ioqnet_lb_device *peer;
+
+	struct pvbus_device      dev;
+};
+
+struct ioqnet_lb_device* to_dev(struct pvbus_device *dev)
+{
+	return container_of(dev, struct ioqnet_lb_device, dev);
+}
+
+static void ioqnet_lb_xmit(struct ioqnet_lb_device *dev,
+			   struct ioqnet_tx_ptr *ptr,
+			   size_t count)
+{
+	DECLARE_WAITQUEUE(wait, current);
+	struct ioq_iterator iter;
+	int ret;
+	int i;
+	char *dest;
+
+	add_wait_queue(&dev->txq->wq, &wait);
+
+	/* We want to iterate on the head of the in-use index for reading */
+	ret = ioq_iter_init(dev->txq, &iter, ioq_idxtype_inuse,
+			    IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	set_current_state(TASK_UNINTERRUPTIBLE);
+
+	while (!iter.desc->valid || !iter.desc->sown)
+		schedule();
+
+	set_current_state(TASK_RUNNING);
+
+	dest = __va(iter.desc->ptr);
+
+	for (i = 0; i < count; ++i) {
+		struct ioqnet_tx_ptr *p = &ptr[i];
+		void                 *d = __va(p->data);
+
+		memcpy(dest, d, p->len);
+		dest += p->len;
+	}
+
+	mb();
+	iter.desc->sown = 0;
+	mb();
+	
+	/* Advance the in-use head */
+	ret = ioq_iter_push(&iter, 0);
+	BUG_ON(ret < 0);
+}
+
+/*
+ * This is the daemon thread for each device that gets created once the guest
+ * side connects to us (via the pvbus_device->call(IOQNET_CONNECT) operation).
+ * We want to wait on packets to arrive on the rxq, and then send them to our
+ * peer's txq.
+ */
+static int ioqnet_lb_thread(void *data)
+{
+	DECLARE_WAITQUEUE(wait, current);
+	struct ioqnet_lb_device *dev = (struct ioqnet_lb_device*)data;
+	struct ioq_iterator iter;
+	int ret;
+
+	add_wait_queue(&dev->rxq->wq, &wait);
+
+	/* We want to iterate on the tail of the in-use index for reading */
+	ret = ioq_iter_init(dev->rxq, &iter, ioq_idxtype_inuse,
+			    IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+
+	while (1) {
+		struct ioq_ring_desc *desc = iter.desc;
+		struct ioqnet_tx_ptr *ptr;
+
+		PDEBUG("%d: Waiting...\n", dev->idx);
+
+		set_current_state(TASK_UNINTERRUPTIBLE);
+
+		while (!desc->sown)
+			schedule();
+
+		set_current_state(TASK_RUNNING);
+
+		PDEBUG("%d: Got a packet\n", dev->idx);
+
+		ptr = __va(desc->ptr);
+		
+		/*
+		 * If the peer is connected, we transmit it to their
+		 * queue...otherwise we just drop it on the floor
+		 */
+		if (dev->peer->txq)
+			ioqnet_lb_xmit(dev->peer, ptr, desc->alen);
+		
+		mb();
+		desc->sown = 0;
+		mb();
+		
+		/* Advance the in-use tail */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	return 0;
+}
+
+static int ioqnet_lb_dev_createqueue(struct pvbus_device *dev,
+				     struct ioq **ioq,
+				     size_t ringsize, int flags)
+{
+	struct ioq_mgr *ioqmgr = &lb_ioqmgr.mgr;
+
+	return ioqmgr->create(ioqmgr, ioq, ringsize, flags);
+}
+
+static int ioqnet_lb_queue_connect(ioq_id_t id, struct ioq **ioq)
+{
+       int ret;
+       struct ioq_mgr *ioqmgr = &lb_ioqmgr.mgr;
+
+       ret = ioqmgr->connect(ioqmgr, id, ioq, 0);
+       if (ret < 0)
+               return ret;
+
+       ioq_start(*ioq, 0);
+
+       return 0;
+}
+
+static int ioqnet_lb_dev_connect(struct ioqnet_lb_device *dev,
+				 void *data, size_t len)
+{
+       struct ioqnet_connect *cnct = (struct ioqnet_connect*)data;
+       int ret;
+
+       /* We connect the north's rxq to our txq */
+       ret = ioqnet_lb_queue_connect(cnct->rxq, &dev->txq);
+       if (ret < 0)
+               return ret;
+
+       /* And vice-versa */
+       ret = ioqnet_lb_queue_connect(cnct->txq, &dev->rxq);
+       if (ret < 0)
+               return ret;
+
+       dev->task = kthread_create(ioqnet_lb_thread, dev,
+				  "ioqnet-lb/%d", dev->idx);
+       wake_up_process(dev->task);
+
+       return 0;
+}
+
+static int ioqnet_lb_dev_query_mac(struct ioqnet_lb_device *dev,
+				   void *data, size_t len)
+{
+       if (len != ETH_ALEN)
+               return -EINVAL;
+
+       memcpy(data, dev->mac, ETH_ALEN);
+
+       return 0;
+}
+
+/*
+ * This function is invoked whenever a guest calls pvbus_ops->call() against
+ * our instance ID
+ */
+static int ioqnet_lb_dev_call(struct pvbus_device *dev, u32 func, void *data,
+			      size_t len, int flags)
+{
+       struct ioqnet_lb_device *_dev = to_dev(dev);
+       int ret = -EINVAL;
+
+       switch (func) {
+       case IOQNET_CONNECT:
+               ret = ioqnet_lb_dev_connect(_dev, data, len);
+               break;
+       case IOQNET_QUERY_MAC:
+               ret = ioqnet_lb_dev_query_mac(_dev, data, len);
+               break;
+       }
+
+       return ret;
+}
+
+static int ioqnet_lb_dev_init(struct ioqnet_lb_device *dev,
+			      int idx,
+			      struct ioqnet_lb_device *peer)
+{
+	char mac[] = { 0x00, 0x30, 0xcc, 0x00, 0x00, idx };
+
+	memset(dev, 0, sizeof(*dev));
+	dev->idx   = idx;
+	dev->peer  = peer;
+	memcpy(dev->mac, mac, ETH_ALEN);
+
+	dev->dev.name        = IOQNET_NAME;
+	dev->dev.id          = idx;
+	dev->dev.createqueue = ioqnet_lb_dev_createqueue;
+	dev->dev.call        = ioqnet_lb_dev_call;
+	sprintf(dev->dev.dev.bus_id, "%d", idx);
+
+	return 0;
+}
+
+/*
+ * ---------------------------------------------------------------------
+ * Finally we create the top-level object that binds it all together
+ * ---------------------------------------------------------------------
+ */
+
+
+struct ioqnet_lb {
+	struct ioqnet_lb_device devs[2];
+};
+
+static struct ioqnet_lb ioqnet_lb;
+
+__init int ioqnet_lb_init_module(void)
+{
+	int ret;
+	int i;
+
+	ioqnet_lb_ioqmgr_init();
+
+	/* First initialize both devices */
+	for (i = 0; i < 2; i++) {
+		ret = ioqnet_lb_dev_init(&ioqnet_lb.devs[i],
+					 i,
+					 &ioqnet_lb.devs[!i]);
+		BUG_ON(ret < 0);
+	}
+
+	/* Then register them together */
+	for (i = 0; i < 2; i++) {
+		ret = pvbus_device_register(&ioqnet_lb.devs[i].dev);
+		BUG_ON(ret < 0);
+	}
+
+	return 0;
+}
+
+__exit void ioqnet_lb_cleanup(void)
+{
+	int i;
+
+	for (i = 0; i < 2; i++)
+		pvbus_device_unregister(&ioqnet_lb.devs[i].dev);
+
+}
+
+
+module_init(ioqnet_lb_init_module);
+module_exit(ioqnet_lb_cleanup);


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 05/10] IRQ: Export create_irq/destroy_irq
       [not found] ` <20070816231357.8044.55943.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
                     ` (3 preceding siblings ...)
  2007-08-16 23:14   ` [PATCH 04/10] IOQNET: Add a test harness infrastructure to IOQNET Gregory Haskins
@ 2007-08-16 23:14   ` Gregory Haskins
  2007-08-16 23:14   ` [PATCH 06/10] KVM: Add a guest side driver for IOQ Gregory Haskins
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Gregory Haskins @ 2007-08-16 23:14 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 arch/x86_64/kernel/io_apic.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86_64/kernel/io_apic.c b/arch/x86_64/kernel/io_apic.c
index d8bfe31..6bf8794 100644
--- a/arch/x86_64/kernel/io_apic.c
+++ b/arch/x86_64/kernel/io_apic.c
@@ -1849,6 +1849,7 @@ int create_irq(void)
 	}
 	return irq;
 }
+EXPORT_SYMBOL(create_irq);
 
 void destroy_irq(unsigned int irq)
 {
@@ -1860,6 +1861,7 @@ void destroy_irq(unsigned int irq)
 	__clear_irq_vector(irq);
 	spin_unlock_irqrestore(&vector_lock, flags);
 }
+EXPORT_SYMBOL(destroy_irq);
 
 /*
  * MSI mesage composition


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 06/10] KVM: Add a guest side driver for IOQ
       [not found] ` <20070816231357.8044.55943.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
                     ` (4 preceding siblings ...)
  2007-08-16 23:14   ` [PATCH 05/10] IRQ: Export create_irq/destroy_irq Gregory Haskins
@ 2007-08-16 23:14   ` Gregory Haskins
  2007-08-16 23:14   ` [PATCH 07/10] KVM: Add a gpa_to_hva helper function Gregory Haskins
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Gregory Haskins @ 2007-08-16 23:14 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 drivers/kvm/Kconfig       |   28 +++
 drivers/kvm/Makefile      |    3 
 drivers/kvm/ioq.h         |   39 +++++
 drivers/kvm/ioq_guest.c   |  195 +++++++++++++++++++++++
 drivers/kvm/pvbus.h       |   63 +++++++
 drivers/kvm/pvbus_guest.c |  382 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/kvm.h       |    4 
 7 files changed, 706 insertions(+), 8 deletions(-)

diff --git a/drivers/kvm/Kconfig b/drivers/kvm/Kconfig
index 22d0eb4..aca79d1 100644
--- a/drivers/kvm/Kconfig
+++ b/drivers/kvm/Kconfig
@@ -47,16 +47,32 @@ config KVM_BALLOON
 	  The driver inflate/deflate guest physical memory on demand.
 	  This ability provides memory over commit for the host
 
-config KVM_NET
-	tristate "Para virtual network device"
-	depends on KVM
-	---help---
-	  Provides support for guest paravirtualization networking
-
 config KVM_NET_HOST
 	tristate "Para virtual network host device"
 	depends on KVM
 	---help---
 	  Provides support for host paravirtualization networking
 
+config KVM_GUEST
+	bool "KVM Guest support"
+	depends on X86
+	default y
+
+config KVM_PVBUS_GUEST
+       tristate "Paravirtualized Bus (PVBUS) support"
+       depends on KVM_GUEST
+       select IOQ
+       select PVBUS
+       ---help---
+        PVBUS is an infrastructure for generic PV drivers to take advantage
+        of an underlying hypervisor without having to understand the details
+	of the hypervisor itself.  You only need this option if you plan to
+	run this kernel as a KVM guest.
+
+config KVM_NET
+	tristate "Para virtual network device"
+	depends on KVM && KVM_GUEST
+	---help---
+	  Provides support for guest paravirtualization networking
+
 endif # VIRTUALIZATION
diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
index 92600d8..c6a59bb 100644
--- a/drivers/kvm/Makefile
+++ b/drivers/kvm/Makefile
@@ -14,4 +14,5 @@ kvm-net-objs = kvm_net.o
 obj-$(CONFIG_KVM_NET) += kvm-net.o
 kvm-net-host-objs = kvm_net_host.o
 obj-$(CONFIG_KVM_NET_HOST) += kvm_net_host.o
-
+kvm-pvbus-objs := ioq_guest.o pvbus_guest.o
+obj-$(CONFIG_KVM_PVBUS_GUEST) += kvm-pvbus.o
diff --git a/drivers/kvm/ioq.h b/drivers/kvm/ioq.h
new file mode 100644
index 0000000..7e955f1
--- /dev/null
+++ b/drivers/kvm/ioq.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright 2007 Novell.  All Rights Reserved.
+ *
+ * See include/linux/ioq.h for documentation
+ *
+ * Author:
+ *	Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _KVM_IOQ_H_
+#define _KVM_IOQ_H_
+
+#include <linux/ioq.h>
+
+#define IOQHC_REGISTER	  1
+#define IOQHC_UNREGISTER  2
+#define IOQHC_SIGNAL	  3
+
+struct ioq_register {
+	ioq_id_t id;
+	u32	 irq;
+	u64	 ring;
+};
+
+
+#endif /* _KVM_IOQ_H_ */
diff --git a/drivers/kvm/ioq_guest.c b/drivers/kvm/ioq_guest.c
new file mode 100644
index 0000000..068aeb1
--- /dev/null
+++ b/drivers/kvm/ioq_guest.c
@@ -0,0 +1,195 @@
+/*
+ * Copyright 2007 Novell.  All Rights Reserved.
+ *
+ * See include/linux/ioq.h for documentation
+ *
+ * Author:
+ *	Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/ioq.h>
+#include <asm/hypercall.h>
+
+#include "ioq.h"
+#include "kvm.h"
+
+struct kvmguest_ioq {
+	struct ioq	      ioq;
+	int		      irq;
+};
+
+struct kvmguest_ioq* to_ioq(struct ioq *ioq)
+{
+	return container_of(ioq, struct kvmguest_ioq, ioq);
+}
+
+static int ioq_hypercall(unsigned long nr, void *data)
+{
+	return hypercall(2, __NR_hypercall_ioq, nr, __pa(data));
+}
+
+/*
+ * ------------------
+ * interrupt handler
+ * ------------------
+ */
+irqreturn_t kvmguest_ioq_intr(int irq, void *dev)
+{
+	struct kvmguest_ioq *_ioq = to_ioq(dev);
+
+	ioq_wakeup(&_ioq->ioq);
+
+	return IRQ_HANDLED;
+}
+
+/*
+ * ------------------
+ * ioq implementation
+ * ------------------
+ */
+
+static int kvmguest_ioq_signal(struct ioq *ioq)
+{
+	return ioq_hypercall(IOQHC_SIGNAL, &ioq->id);
+}
+
+static void kvmguest_ioq_destroy(struct ioq *ioq)
+{
+	struct kvmguest_ioq *_ioq = to_ioq(ioq);
+	int ret;
+
+	ret = ioq_hypercall(IOQHC_UNREGISTER, &ioq->id);
+	BUG_ON (ret < 0);
+
+	free_irq(_ioq->irq, NULL);
+	destroy_irq(_ioq->irq);
+
+	kfree(_ioq->ioq.ring);
+	kfree(_ioq->ioq.head_desc);
+	kfree(_ioq);
+}
+
+/*
+ * ------------------
+ * ioqmgr implementation
+ * ------------------
+ */
+static int kvmguest_ioq_register(struct kvmguest_ioq *ioq, ioq_id_t id,
+				 int irq, void *ring)
+{
+	struct ioq_register data = {
+		.id   = id,
+		.irq  = irq,
+		.ring = (u64)__pa(ring),
+	};
+
+	return ioq_hypercall(IOQHC_REGISTER, &data);
+}
+
+static int kvmguest_ioq_create(struct ioq_mgr *t, struct ioq **ioq,
+				size_t ringsize, int flags)
+{
+	struct kvmguest_ioq  *_ioq	= NULL;
+	struct ioq_ring_head *head_desc = NULL;
+	void		     *ring	= NULL;
+	size_t ringlen = sizeof(struct ioq_ring_desc) * ringsize;
+	int ret = -ENOMEM;
+
+	_ioq = kzalloc(sizeof(*_ioq), GFP_KERNEL);
+	if (!_ioq)
+		goto error;
+
+	head_desc = kzalloc(sizeof(*head_desc), GFP_KERNEL | GFP_DMA);
+	if (!head_desc)
+		goto error;
+
+	ring = kzalloc(ringlen, GFP_KERNEL | GFP_DMA);
+	if (!ring)
+		goto error;
+
+	head_desc->magic     = IOQ_RING_MAGIC;
+	head_desc->ver	     = IOQ_RING_VER;
+	head_desc->id	     = (ioq_id_t)_ioq;
+	head_desc->count     = ringsize;
+	head_desc->ptr	     = (u64)__pa(ring);
+
+	/* Dynamically assign a free IRQ to this resource */
+	_ioq->irq	     = create_irq();
+
+	ioq_init(&_ioq->ioq);
+
+	_ioq->ioq.signal     = kvmguest_ioq_signal;
+	_ioq->ioq.destroy    = kvmguest_ioq_destroy;
+
+	_ioq->ioq.id	     = head_desc->id;
+	_ioq->ioq.locale     = ioq_locality_north;
+	_ioq->ioq.mgr	     = t;
+	_ioq->ioq.head_desc  = head_desc;
+	_ioq->ioq.ring	     = ring;
+
+	ret = request_irq(_ioq->irq, kvmguest_ioq_intr, 0, "KVM-IOQ", _ioq);
+	if (ret < 0)
+		goto error;
+
+	ret = kvmguest_ioq_register(_ioq, _ioq->ioq.id, _ioq->irq, ring);
+	if (ret < 0)
+		goto error;
+
+	*ioq = &_ioq->ioq;
+
+	return 0;
+
+ error:
+	if (_ioq)
+		kfree(_ioq);
+	if (head_desc)
+		kfree(head_desc);
+	if (ring)
+		kfree(ring);
+
+	return ret;
+}
+
+static int kvmguest_ioq_connect(struct ioq_mgr *t, ioq_id_t id,
+				 struct ioq **ioq, int flags)
+{
+	/* You cannot connect to queues on the guest */
+	return -EINVAL;
+
+}
+
+int kvmguest_ioqmgr_alloc(struct ioq_mgr **mgr)
+{
+	struct ioq_mgr *_mgr = kzalloc(sizeof(*_mgr), GFP_KERNEL);
+	if (!_mgr)
+		return -ENOMEM;
+
+	_mgr->create  = kvmguest_ioq_create;
+	_mgr->connect = kvmguest_ioq_connect;
+
+	*mgr = _mgr;
+
+	return 0;
+}
+
+void kvmguest_ioqmgr_free(struct ioq_mgr *mgr)
+{
+	kfree(mgr);
+}
+
+
+
+
diff --git a/drivers/kvm/pvbus.h b/drivers/kvm/pvbus.h
new file mode 100644
index 0000000..3241ef0
--- /dev/null
+++ b/drivers/kvm/pvbus.h
@@ -0,0 +1,63 @@
+/*
+ * Copyright 2007 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _KVM_PVBUS_H
+#define _KVM_PVBUS_H
+
+#include <linux/ioq.h>
+
+#define KVM_PVBUS_OP_REGISTER   1
+#define KVM_PVBUS_OP_UNREGISTER 2
+#define KVM_PVBUS_OP_CALL       3
+
+struct pvbus_register_params {
+	ioq_id_t qid;
+};
+
+struct pvbus_call_params {
+	u64 inst;
+	u32 func;
+	u64 data;
+	u64 len;
+};
+
+#define KVM_PVBUS_EVENT_ADD  1
+#define KVM_PVBUS_EVENT_DROP 2
+
+#define PVBUS_MAX_NAME 128
+
+struct pvbus_add_event {
+	char name[PVBUS_MAX_NAME];
+	u64  id;
+};
+
+struct pvbus_drop_event {
+	u64 id;
+};
+
+struct pvbus_event {
+	u32 eventid;
+	union {
+		struct pvbus_add_event  add;
+		struct pvbus_drop_event drop;
+	}data;
+};
+
+#endif /* _KVM_PVBUS_H */
diff --git a/drivers/kvm/pvbus_guest.c b/drivers/kvm/pvbus_guest.c
new file mode 100644
index 0000000..56c3b50
--- /dev/null
+++ b/drivers/kvm/pvbus_guest.c
@@ -0,0 +1,382 @@
+/*
+ * Copyright 2007 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *	Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/pvbus.h>
+#include <linux/kvm_para.h>
+#include <linux/kvm.h>
+#include <linux/mm.h>
+#include <linux/ioq.h>
+#include <linux/interrupt.h>
+
+#include <asm/hypercall.h>
+
+#include "pvbus.h"
+
+MODULE_AUTHOR ("Gregory Haskins");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("1");
+
+int kvmguest_ioqmgr_alloc(struct ioq_mgr **mgr);
+void kvmguest_ioqmgr_free(struct ioq_mgr *mgr);
+
+static int kvm_pvbus_hypercall(unsigned long nr, void *data, unsigned long len)
+{
+	return hypercall(3, __NR_hypercall_pvbus, nr, __pa(data), len);
+}
+
+/*
+ * This is the vm-syscall address - to be patched by the host to
+ * VMCALL (Intel) or VMMCALL (AMD), depending on the CPU model:
+ */
+asm (
+	"	.globl hypercall_addr			\n"
+	"	.align 4				\n"
+	"	hypercall_addr:				\n"
+	"		movl $-38, %eax			\n"
+	"		ret				\n"
+);
+
+extern unsigned char hypercall_addr[6];
+
+#ifndef CONFIG_X86_64
+static DEFINE_PER_CPU(struct kvm_vcpu_para_state, para_state);
+#endif
+
+static int __init kvm_pvbus_probe(void)
+{
+	struct page *hypercall_addr_page;
+	struct kvm_vcpu_para_state *para_state;
+
+#ifdef CONFIG_X86_64
+	struct page *pstate_page;
+	if ((pstate_page = alloc_page(GFP_KERNEL)) == NULL)
+		return -ENOMEM;
+	para_state = (struct kvm_vcpu_para_state*)page_address(pstate_page);
+#else
+	para_state =  &per_cpu(para_state, cpu);
+#endif
+	/*
+	 * Try to write to a magic MSR (which is invalid on any real CPU),
+	 * and thus signal to KVM that we wish to entering para-virtualized
+	 * mode:
+	 */
+	para_state->guest_version = KVM_PARA_API_VERSION;
+	para_state->host_version = -1;
+	para_state->size = sizeof(*para_state);
+	para_state->ret = -1;
+
+	hypercall_addr_page = vmalloc_to_page(hypercall_addr);
+	para_state->hypercall_gpa = page_to_pfn(hypercall_addr_page)
+		<< PAGE_SHIFT | offset_in_page(hypercall_addr);
+	printk(KERN_DEBUG "kvm guest: hypercall gpa is 0x%lx\n",
+	       (long)para_state->hypercall_gpa);
+
+	if (wrmsr_safe(MSR_KVM_API_MAGIC, __pa(para_state), 0)) {
+		printk(KERN_INFO "KVM guest: WRMSR probe failed.\n");
+		return -1;
+	}
+
+	printk(KERN_DEBUG "kvm guest: host returned %d\n",
+	       para_state->ret);
+	printk(KERN_DEBUG "kvm guest: host version: %d\n",
+	       para_state->host_version);
+	printk(KERN_DEBUG "kvm guest: syscall entry: %02x %02x %02x %02x\n",
+			hypercall_addr[0], hypercall_addr[1],
+			hypercall_addr[2], hypercall_addr[3]);
+
+	if (para_state->ret) {
+		printk(KERN_ERR "kvm guest: host refused registration.\n");
+		return -1;
+	}
+
+	return 0;
+
+}
+
+struct kvm_pvbus {
+	int                   connected;
+	struct ioq_mgr       *ioqmgr;
+	struct ioq           *ioq;
+	struct ioq_notifier   ioqn;
+	struct tasklet_struct task;
+};
+
+static struct kvm_pvbus kvm_pvbus;
+
+struct kvm_pvbus_device {
+	struct pvbus_device pvbdev;
+	char   name[PVBUS_MAX_NAME];
+};
+
+static int kvm_pvbus_createqueue(struct pvbus_device *dev, struct ioq **ioq,
+				 size_t ringsize, int flags)
+{
+	struct ioq_mgr *ioqmgr = kvm_pvbus.ioqmgr;
+
+	return ioqmgr->create(ioqmgr, ioq, ringsize, flags);
+}
+
+static int kvm_pvbus_call(struct pvbus_device *dev, u32 func, void *data,
+			  size_t len, int flags)
+{
+	struct pvbus_call_params params = {
+		.inst = dev->id,
+		.func = func,
+		.data = (u64)__pa(data),
+		.len  = len,
+	};
+
+	return kvm_pvbus_hypercall(KVM_PVBUS_OP_CALL, &params, sizeof(params));
+}
+
+static void kvm_pvbus_add_event(struct pvbus_add_event *event)
+{
+	int ret;
+	struct kvm_pvbus_device *new = kzalloc(sizeof(*new), GFP_KERNEL);
+	if (!new) {
+		printk("KVM_PVBUS: Out of memory on add_event\n");
+		return;
+	}
+
+	memcpy(new->name, event->name, PVBUS_MAX_NAME);
+	new->pvbdev.name        = new->name;
+	new->pvbdev.id          = event->id;
+	new->pvbdev.createqueue = kvm_pvbus_createqueue;
+	new->pvbdev.call        = kvm_pvbus_call;
+
+	sprintf(new->pvbdev.dev.bus_id, "%lld", event->id);
+
+	ret = pvbus_device_register(&new->pvbdev);
+	BUG_ON(ret < 0);
+}
+
+static void kvm_pvbus_drop_event(struct pvbus_drop_event *event)
+{
+#if 0 /* FIXME */
+	int ret = pvbus_device_unregister(event->id);
+	BUG_ON(ret < 0);
+#endif
+}
+
+/* INTR-Layer2: Invoked whenever layer 1 schedules our tasklet */
+static void kvm_pvbus_intr_l2(unsigned long _data)
+{
+       struct ioq_iterator iter;
+       int ret;
+
+       /* We want to iterate on the tail of the in-use index */
+       ret = ioq_iter_init(kvm_pvbus.ioq, &iter, ioq_idxtype_inuse, 0);
+       BUG_ON(ret < 0);
+
+       ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+       BUG_ON(ret < 0);
+
+       /*
+        * The EOM is indicated by finding a packet that is still owned by
+        * the south side.
+	*
+	* FIXME: This in theory could run indefinitely if the host keeps
+	* feeding us events since there is nothing like a NAPI budget.  We
+	* might need to address that
+        */
+       while (!iter.desc->sown) {
+	       struct ioq_ring_desc *desc  = iter.desc;
+	       struct pvbus_event   *event = (struct pvbus_event*)desc->cookie;
+
+	       switch (event->eventid) {
+	       case KVM_PVBUS_EVENT_ADD:
+		       kvm_pvbus_add_event(&event->data.add);
+		       break;
+	       case KVM_PVBUS_EVENT_DROP:
+		       kvm_pvbus_drop_event(&event->data.drop);
+		       break;
+	       default:
+		       printk(KERN_WARNING "KVM_PVBUS: Unexpected event %d\n",
+			      event->eventid);
+		       break;
+	       };
+
+	       memset(event, 0, sizeof(*event));
+
+	       mb();
+	       desc->sown = 1; /* give ownership back to the south */
+	       mb();
+
+               /* Advance the in-use tail */
+               ret = ioq_iter_pop(&iter, 0);
+               BUG_ON(ret < 0);
+       }
+
+       /* And let the south side know that we changed the rx-queue */
+       ioq_signal(kvm_pvbus.ioq, 0);
+}
+
+/* INTR-Layer1: Invoked whenever the host issues an ioq_signal() */
+static void kvm_pvbus_intr_l1(struct ioq_notifier *ioqn)
+{
+	tasklet_schedule(&kvm_pvbus.task);
+}
+
+static int __init kvm_pvbus_register(void)
+{
+	struct pvbus_register_params params = {
+		.qid = kvm_pvbus.ioq->id,
+	};
+
+	return kvm_pvbus_hypercall(KVM_PVBUS_OP_REGISTER,
+				   &params, sizeof(params)); 
+}
+
+static int __init kvm_pvbus_setup_ring(void) 
+{
+       struct ioq *ioq = kvm_pvbus.ioq;
+       struct ioq_iterator iter;
+       int ret;
+
+       /*
+        * We want to iterate on the "valid" index.  By default the iterator
+        * will not "autoupdate" which means it will not hypercall the host
+        * with our changes.  This is good, because we are really just
+        * initializing stuff here anyway.  Note that you can always manually
+        * signal the host with ioq_signal() if the autoupdate feature is not
+        * used.
+        */
+       ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+       BUG_ON(ret < 0);
+
+       /*
+        * Seek to the head of the valid index (which should be our first
+        * item since the queue is brand-new)
+        */
+       ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+       BUG_ON(ret < 0);
+
+       /*
+        * Now populate each descriptor with an empty pvbus_event and mark it
+	* valid
+        */
+       while (!iter.desc->valid) {
+	       struct pvbus_event   *event;
+	       size_t                len  = sizeof(*event);
+	       struct ioq_ring_desc *desc = iter.desc;
+
+	       event = kzalloc(sizeof(*event), GFP_KERNEL);
+	       if (!event)
+		       return -ENOMEM;
+
+	       desc->cookie = (u64)event;
+	       desc->ptr    = (u64)__pa(event);
+	       desc->len    = len; /* total length  */
+	       desc->alen   = 0;   /* actual length - filled in by host */
+	       
+	       /*
+		* We don't need any barriers here because the ring is not used
+		* yet
+		*/
+	       desc->valid  = 1;
+	       desc->sown   = 1;   /* give ownership to the south */
+
+               /*
+                * This push operation will simultaneously advance the
+                * valid-head index and increment our position in the queue
+                * by one.
+                */
+               ret = ioq_iter_push(&iter, 0);
+               BUG_ON(ret < 0);
+       }
+
+       return 0;
+}
+
+int __init kvm_pvbus_init(void)
+{
+	struct ioq_mgr *ioqmgr = NULL;
+	int             ret;
+
+	memset(&kvm_pvbus, 0, sizeof(kvm_pvbus));
+
+	ret = kvm_pvbus_probe();
+	if (ret < 0)
+		return ret;
+
+	kvm_pvbus.connected = 1;
+
+	/* Allocate an IOQ-manager to use for all operations */
+	ret = kvmguest_ioqmgr_alloc(&ioqmgr);
+	if (ret < 0) {
+		printk(KERN_ERR "KVM_PVBUS: Could not create ioqmgr\n");
+		return ret;
+	}
+
+	kvm_pvbus.ioqmgr = ioqmgr;
+
+	/* Now allocate an IOQ to use for hotplug notification */
+	ret = ioqmgr->create(ioqmgr, &kvm_pvbus.ioq, 32, 0);
+	if (ret < 0) {
+		printk(KERN_ERR "KVM_PVBUS: Cound not create hotplug ioq\n");
+		goto out_fail;
+	}
+
+	ret = kvm_pvbus_setup_ring();
+	if (ret < 0) {
+		printk(KERN_ERR "KVM_PVBUS: Cound not setup ring\n");
+		goto out_fail;
+	}
+
+	/* Setup our interrupt callback */
+	kvm_pvbus.ioqn.signal   = kvm_pvbus_intr_l1;
+	kvm_pvbus.ioq->notifier = &kvm_pvbus.ioqn;
+	tasklet_init(&kvm_pvbus.task, kvm_pvbus_intr_l2, 0);
+	
+	/*
+	 * Finally register our queue on the host to start receiving hotplug
+	 * updates
+	 */
+	ret = kvm_pvbus_register();
+	if (ret < 0) {
+		printk(KERN_ERR "KVM_PVBUS: Could not register with host\n");
+		goto out_fail;
+	}
+
+	return 0;
+
+ out_fail:
+	kvmguest_ioqmgr_free(ioqmgr);
+
+	return ret;
+
+}
+
+static void __exit kvm_pvbus_exit(void)
+{
+	if (kvm_pvbus.connected)
+		kvm_pvbus_hypercall(KVM_PVBUS_OP_UNREGISTER, NULL, 0);
+
+	if (kvm_pvbus.ioq)
+		kvm_pvbus.ioq->destroy(kvm_pvbus.ioq);
+
+	kvmguest_ioqmgr_free(kvm_pvbus.ioqmgr);
+}
+
+module_init(kvm_pvbus_init);
+module_exit(kvm_pvbus_exit);
+
+
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 992aeec..bc2b51e 100755
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -377,13 +377,15 @@ struct kvm_pvnet_config {
  * No registers are clobbered by the hypercall, except that the
  * return value is in RAX.
  */
-#define KVM_NR_HYPERCALLS		5
+#define KVM_NR_HYPERCALLS		7
 
 #define __NR_hypercall_test		0
 #define __NR_hypercall_register_eth	1
 #define __NR_hypercall_send_eth		2
 #define __NR_hypercall_set_multicast_eth 3
 #define __NR_hypercall_start_stop_eth   4
+#define __NR_hypercall_ioq              5
+#define __NR_hypercall_pvbus            6
 
 #define __NR_hypercall_balloon		(KVM_NR_HYPERCALLS + 0)
 


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 07/10] KVM: Add a gpa_to_hva helper function
       [not found] ` <20070816231357.8044.55943.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
                     ` (5 preceding siblings ...)
  2007-08-16 23:14   ` [PATCH 06/10] KVM: Add a guest side driver for IOQ Gregory Haskins
@ 2007-08-16 23:14   ` Gregory Haskins
  2007-08-16 23:14   ` [PATCH 08/10] KVM: Add support for IOQ Gregory Haskins
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Gregory Haskins @ 2007-08-16 23:14 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 drivers/kvm/kvm.h |    1 +
 drivers/kvm/mmu.c |   12 ++++++++++++
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index 9934f11..05d5be1 100755
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -475,6 +475,7 @@ void vcpu_load(struct kvm_vcpu *vcpu);
 void vcpu_put(struct kvm_vcpu *vcpu);
 
 hpa_t gpa_to_hpa(struct kvm_vcpu *vcpu, gpa_t gpa);
+void* gpa_to_hva(struct kvm *kvm, gpa_t gpa);
 #define HPA_MSB ((sizeof(hpa_t) * 8) - 1)
 #define HPA_ERR_MASK ((hpa_t)1 << HPA_MSB)
 static inline int is_error_hpa(hpa_t hpa) { return hpa >> HPA_MSB; }
diff --git a/drivers/kvm/mmu.c b/drivers/kvm/mmu.c
index e84c599..daaf0d2 100644
--- a/drivers/kvm/mmu.c
+++ b/drivers/kvm/mmu.c
@@ -766,6 +766,18 @@ hpa_t gpa_to_hpa(struct kvm_vcpu *vcpu, gpa_t gpa)
 }
 EXPORT_SYMBOL_GPL(gpa_to_hpa);
 
+void* gpa_to_hva(struct kvm *kvm, gpa_t gpa)
+{
+	struct page *page;
+
+	if ((gpa & HPA_ERR_MASK) == 0)
+		return NULL;
+
+	page = gfn_to_page(kvm, gpa >> PAGE_SHIFT);
+	return kmap_atomic(page, gpa & PAGE_MASK);
+}
+EXPORT_SYMBOL_GPL(gpa_to_hva);
+
 hpa_t gva_to_hpa(struct kvm_vcpu *vcpu, gva_t gva)
 {
 	gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, gva);


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 08/10] KVM: Add support for IOQ
       [not found] ` <20070816231357.8044.55943.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
                     ` (6 preceding siblings ...)
  2007-08-16 23:14   ` [PATCH 07/10] KVM: Add a gpa_to_hva helper function Gregory Haskins
@ 2007-08-16 23:14   ` Gregory Haskins
  2007-08-16 23:14   ` [PATCH 09/10] KVM: Add PVBUS support to the KVM host Gregory Haskins
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 41+ messages in thread
From: Gregory Haskins @ 2007-08-16 23:14 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

IOQ is a shared-memory-queue interface for implmenting PV driver
communication.

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 drivers/kvm/Kconfig    |    5 +
 drivers/kvm/Makefile   |    3 
 drivers/kvm/ioq.h      |   12 +-
 drivers/kvm/ioq_host.c |  365 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/kvm/kvm.h      |    5 +
 drivers/kvm/kvm_main.c |    3 
 include/linux/kvm.h    |    1 
 7 files changed, 393 insertions(+), 1 deletions(-)

diff --git a/drivers/kvm/Kconfig b/drivers/kvm/Kconfig
index aca79d1..d9def33 100644
--- a/drivers/kvm/Kconfig
+++ b/drivers/kvm/Kconfig
@@ -47,6 +47,11 @@ config KVM_BALLOON
 	  The driver inflate/deflate guest physical memory on demand.
 	  This ability provides memory over commit for the host
 
+config KVM_IOQ_HOST
+       boolean "Add IOQ support to KVM"
+       depends on KVM
+       select IOQ
+
 config KVM_NET_HOST
 	tristate "Para virtual network host device"
 	depends on KVM
diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
index c6a59bb..2095061 100644
--- a/drivers/kvm/Makefile
+++ b/drivers/kvm/Makefile
@@ -4,6 +4,9 @@
 EXTRA_CFLAGS :=
 
 kvm-objs := kvm_main.o mmu.o x86_emulate.o
+ifeq ($(CONFIG_KVM_IOQ_HOST),y)
+kvm-objs += ioq_host.o
+endif
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/drivers/kvm/ioq.h b/drivers/kvm/ioq.h
index 7e955f1..b942113 100644
--- a/drivers/kvm/ioq.h
+++ b/drivers/kvm/ioq.h
@@ -25,7 +25,17 @@
 
 #include <linux/ioq.h>
 
-#define IOQHC_REGISTER	  1
+struct kvm;
+
+#ifdef CONFIG_KVM_IOQ_HOST
+int kvmhost_ioqmgr_init(struct kvm *kvm);
+int kvmhost_ioqmgr_module_init(void);
+#else
+#define kvmhost_ioqmgr_init(kvm) {}
+#define kvmhost_ioqmgr_module_init() {}
+#endif
+
+#define IOQHC_REGISTER    1
 #define IOQHC_UNREGISTER  2
 #define IOQHC_SIGNAL	  3
 
diff --git a/drivers/kvm/ioq_host.c b/drivers/kvm/ioq_host.c
new file mode 100644
index 0000000..413f103
--- /dev/null
+++ b/drivers/kvm/ioq_host.c
@@ -0,0 +1,365 @@
+/*
+ * Copyright 2007 Novell.  All Rights Reserved.
+ *
+ * See include/linux/ioq.h for documentation
+ *
+ * Author:
+ *	Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/ioq.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/highmem.h>
+
+#include <asm/atomic.h>
+
+#include "ioq.h"
+#include "kvm.h"
+
+struct kvmhost_ioq {
+	struct ioq	      ioq;
+	struct rb_node	      node;
+	atomic_t	      refcnt;
+	struct kvm_vcpu	     *vcpu;
+	int		      irq;
+};
+
+struct kvmhost_map {
+	spinlock_t     lock;
+	struct rb_root root;
+};
+
+struct kvmhost_ioq_mgr {
+	struct ioq_mgr	    mgr;
+	struct kvm	   *kvm;
+	struct kvmhost_map  map;
+};
+
+struct kvmhost_ioq* to_ioq(struct ioq *ioq)
+{
+	return container_of(ioq, struct kvmhost_ioq, ioq);
+}
+
+struct kvmhost_ioq_mgr* to_mgr(struct ioq_mgr *mgr)
+{
+	return container_of(mgr, struct kvmhost_ioq_mgr, mgr);
+}
+
+/*
+ * ------------------
+ * rb map management
+ * ------------------
+ */
+
+static void kvmhost_map_init(struct kvmhost_map *map)
+{
+	spin_lock_init(&map->lock);
+	map->root = RB_ROOT;
+}
+
+static int kvmhost_map_register(struct kvmhost_map *map,
+				struct kvmhost_ioq *ioq)
+{
+	int		ret = 0;
+	struct rb_root *root;
+	struct rb_node **new, *parent = NULL;
+
+	spin_lock(&map->lock);
+
+	root = &map->root;
+	new  = &(root->rb_node);
+
+	/* Figure out where to put new node */
+	while (*new) {
+		struct kvmhost_ioq *this;
+
+		this   = container_of(*new, struct kvmhost_ioq, node);
+		parent = *new;
+
+		if (ioq->ioq.id < this->ioq.id)
+			new = &((*new)->rb_left);
+		else if (ioq->ioq.id > this->ioq.id)
+			new = &((*new)->rb_right);
+		else {
+			ret = -EEXIST;
+			break;
+		}
+	}
+
+	if (!ret) {
+		/* Add new node and rebalance tree. */
+		rb_link_node(&ioq->node, parent, new);
+		rb_insert_color(&ioq->node, root);
+	}
+
+	spin_unlock(&map->lock);
+
+	return ret;
+}
+
+static struct kvmhost_ioq* kvmhost_map_find(struct kvmhost_map *map,
+					    ioq_id_t id)
+{
+	struct rb_node *node;
+	struct kvmhost_ioq *ioq = NULL;
+
+	spin_lock(&map->lock);
+
+	node = map->root.rb_node;
+
+	while (node) {
+		struct kvmhost_ioq *_ioq;
+
+		_ioq = container_of(node, struct kvmhost_ioq, node);
+
+		if (ioq->ioq.id < id)
+			node = node->rb_left;
+		else if (_ioq->ioq.id > id)
+			node = node->rb_right;
+		else {
+			ioq = _ioq;
+			break;
+		}
+	}
+
+	spin_unlock(&map->lock);
+
+	return ioq;
+}
+
+static void kvmhost_map_erase(struct kvmhost_map *map,
+			      struct kvmhost_ioq *ioq)
+{
+	spin_lock(&map->lock);
+	rb_erase(&ioq->node, &map->root);
+	spin_unlock(&map->lock);
+}
+
+/*
+ * ------------------
+ * ioq implementation
+ * ------------------
+ */
+
+static int kvmhost_ioq_signal(struct ioq *ioq)
+{
+	struct kvmhost_ioq *_ioq = to_ioq(ioq);
+	BUG_ON(!_ioq);
+
+	/*
+	 * FIXME: Inject an interrupt to the guest for "id"
+	 *
+	 *  We will have to decide if we will have 1:1 IOQ:IRQ, or if we
+	 * will aggregate all IOQs through a single IRQ.  For purposes of
+	 * example, we will assume 1:1.
+	 */
+
+	/* kvm_vcpu_send_interrupt(_ioq->vcpu, _ioq->irq); */
+
+	return 0;
+}
+
+static void kvmhost_ioq_destroy(struct ioq *ioq)
+{
+	struct kvmhost_ioq *_ioq = to_ioq(ioq);
+
+	if (atomic_dec_and_test(&_ioq->refcnt))
+		kfree(_ioq);
+}
+
+static struct kvmhost_ioq* kvmhost_ioq_alloc(struct ioq_mgr *t,
+					     struct kvm_vcpu *vcpu,
+					     ioq_id_t id, int irq, gpa_t ring)
+{
+	struct kvmhost_ioq *_ioq;
+	struct ioq	   *ioq;
+
+	_ioq = kzalloc(sizeof(*_ioq), GFP_KERNEL);
+	if (!_ioq)
+		return NULL;
+
+	ioq = &_ioq->ioq;
+
+	atomic_set(&_ioq->refcnt, 1);
+	_ioq->vcpu    = vcpu;
+	_ioq->irq     = irq;
+
+	ioq_init(&_ioq->ioq);
+
+	ioq->signal	= kvmhost_ioq_signal;
+	ioq->destroy	= kvmhost_ioq_destroy;
+
+	ioq->id		= id;
+	ioq->locale	= ioq_locality_south;
+	ioq->mgr	= t;
+	ioq->head_desc	= (struct ioq_ring_head*)gpa_to_hva(vcpu->kvm, ring);
+	ioq->ring       = (struct ioq_ring_desc*)gpa_to_hva(vcpu->kvm,
+							    ioq->head_desc->ptr);
+
+	return _ioq;
+}
+
+/*
+ * ------------------
+ * hypercall implementation
+ * ------------------
+ */
+
+static int kvmhost_ioq_hc_register(struct ioq_mgr *t, struct kvm_vcpu *vcpu,
+				   ioq_id_t id, int irq, gpa_t ring)
+{
+	struct kvmhost_ioq *_ioq = kvmhost_ioq_alloc(t, vcpu, id, irq, ring);
+	int ret;
+
+	if (!_ioq)
+		return -ENOMEM;
+
+	ret = kvmhost_map_register(&to_mgr(t)->map, _ioq);
+	if (ret < 0)
+		kvmhost_ioq_destroy(&_ioq->ioq);
+
+	return 0;
+}
+
+static int kvmhost_ioq_hc_unregister(struct ioq_mgr *t, ioq_id_t id)
+{
+	struct kvmhost_ioq_mgr *_mgr = to_mgr(t);
+	struct kvmhost_ioq     *_ioq = kvmhost_map_find(&_mgr->map, id);
+
+	if (!_ioq)
+		return -ENOENT;
+
+	kvmhost_map_erase(&_mgr->map, _ioq);
+	kvmhost_ioq_destroy(&_ioq->ioq);
+
+	return 0;
+}
+
+static int kvmhost_ioq_hc_signal(struct ioq_mgr *t, ioq_id_t id)
+{
+	struct kvmhost_ioq *_ioq = kvmhost_map_find(&to_mgr(t)->map, id);
+
+	if (!_ioq)
+		return -1;
+
+	ioq_wakeup(&_ioq->ioq);
+
+	return 0;
+}
+
+/*
+ * Our hypercall format will always follow with the call-id in arg[0] and
+ * a pointer to the arguments in arg[1]
+ */
+static unsigned long kvmhost_hc(struct kvm_vcpu *vcpu, unsigned long args[])
+{
+	struct ioq_mgr *t = vcpu->kvm->ioqmgr;
+	void *vdata = gpa_to_hva(vcpu->kvm, args[1]);
+	int ret = -EINVAL;
+
+	if (!vdata)
+		return -EINVAL;
+
+	/*
+	 * FIXME: we need to make sure that the pointer is sane
+	 * so a malicious guest cannot crash the host.
+	 */
+
+	switch (args[0])
+	{
+	case IOQHC_REGISTER: {
+		struct ioq_register *data = (struct ioq_register*)vdata;
+		ret = kvmhost_ioq_hc_register(t, vcpu,
+					      data->id,
+					      data->irq,
+					      data->ring);
+	}
+	case IOQHC_UNREGISTER: {
+		ioq_id_t *id = (ioq_id_t*)vdata;
+		ret = kvmhost_ioq_hc_unregister(t, *id);
+	}
+	case IOQHC_SIGNAL: {
+		ioq_id_t *id = (ioq_id_t*)vdata;
+		ret = kvmhost_ioq_hc_signal(t, *id);
+	}
+	}
+
+	/* FIXME: unmap the vdata? */
+
+	return ret;
+}
+
+/*
+ * ------------------
+ * ioqmgr implementation
+ * ------------------
+ */
+
+static int kvmhost_ioq_create(struct ioq_mgr *t, struct ioq **ioq,
+			      size_t ringsize, int flags)
+{
+	/* You cannot create queues on the host */
+	return -EINVAL;
+}
+
+static int kvmhost_ioq_connect(struct ioq_mgr *t, ioq_id_t id,
+			       struct ioq **ioq, int flags)
+{
+	struct kvmhost_ioq *_ioq = kvmhost_map_find(&to_mgr(t)->map, id);
+
+	if (!_ioq)
+		return -1;
+
+	atomic_inc(&_ioq->refcnt);
+	*ioq = &_ioq->ioq;
+
+	return 0;
+
+}
+
+int kvmhost_ioqmgr_init(struct kvm *kvm)
+{
+	struct kvmhost_ioq_mgr *_mgr = kzalloc(sizeof(*_mgr), GFP_KERNEL);
+	if (!_mgr)
+		return -ENOMEM;
+
+	_mgr->kvm = kvm;
+	kvmhost_map_init(&_mgr->map);
+
+	_mgr->mgr.create  = kvmhost_ioq_create;
+	_mgr->mgr.connect = kvmhost_ioq_connect;
+
+	kvm->ioqmgr = &_mgr->mgr;
+
+	return 0;
+}
+
+__init int kvmhost_ioqmgr_module_init(void)
+{
+	struct kvm_hypercall hc;
+
+	hc.hypercall	  = kvmhost_hc;
+	hc.idx		  = __NR_hypercall_ioq;
+
+	kvm_register_hypercall(THIS_MODULE, &hc);
+
+	return 0;
+}
+
+
+
+
diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index 05d5be1..c38c84f 100755
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -16,6 +16,7 @@
 #include <linux/netdevice.h>
 
 #include "vmx.h"
+#include "ioq.h"
 #include <linux/kvm.h>
 #include <linux/kvm_para.h>
 
@@ -389,6 +390,10 @@ struct kvm {
 	struct list_head vm_list;
 	struct net_device *netdev;
 	struct file *filp;
+#ifdef CONFIG_KVM_IOQ_HOST
+	struct ioq_mgr *ioqmgr;
+#endif
+
 };
 
 struct descriptor_table {
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index f252b39..fbffd2f 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -349,6 +349,7 @@ static struct kvm *kvm_create_vm(void)
 		list_add(&kvm->vm_list, &vm_list);
 		spin_unlock(&kvm_lock);
 	}
+	kvmhost_ioqmgr_init(kvm);
 	return kvm;
 }
 
@@ -3614,6 +3615,8 @@ static __init int kvm_init(void)
 	bad_page_address = page_to_pfn(bad_page) << PAGE_SHIFT;
 	memset(__va(bad_page_address), 0, PAGE_SIZE);
 
+	kvmhost_ioqmgr_module_init();
+
 	return 0;
 
 out:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index bc2b51e..2cceae3 100755
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -377,6 +377,7 @@ struct kvm_pvnet_config {
  * No registers are clobbered by the hypercall, except that the
  * return value is in RAX.
  */
+
 #define KVM_NR_HYPERCALLS		7
 
 #define __NR_hypercall_test		0


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 09/10] KVM: Add PVBUS support to the KVM host
       [not found] ` <20070816231357.8044.55943.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
                     ` (7 preceding siblings ...)
  2007-08-16 23:14   ` [PATCH 08/10] KVM: Add support for IOQ Gregory Haskins
@ 2007-08-16 23:14   ` Gregory Haskins
  2007-08-16 23:14   ` [PATCH 10/10] KVM: Add an IOQNET backend driver Gregory Haskins
  2007-08-17  1:25   ` [PATCH 00/10] PV-IO v3 Rusty Russell
  10 siblings, 0 replies; 41+ messages in thread
From: Gregory Haskins @ 2007-08-16 23:14 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

PVBUS allows VMM agnostic PV drivers to discover/configure virtual resources

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 drivers/kvm/Kconfig      |   10 +
 drivers/kvm/Makefile     |    3 
 drivers/kvm/kvm.h        |    4 
 drivers/kvm/kvm_main.c   |    4 
 drivers/kvm/pvbus_host.c |  636 ++++++++++++++++++++++++++++++++++++++++++++++
 drivers/kvm/pvbus_host.h |   66 +++++
 6 files changed, 723 insertions(+), 0 deletions(-)

diff --git a/drivers/kvm/Kconfig b/drivers/kvm/Kconfig
index d9def33..9f2ef22 100644
--- a/drivers/kvm/Kconfig
+++ b/drivers/kvm/Kconfig
@@ -52,6 +52,16 @@ config KVM_IOQ_HOST
        depends on KVM
        select IOQ
 
+config KVM_PVBUS_HOST
+       boolean "Paravirtualized Bus (PVBUS) host support"
+       depends on KVM
+       select KVM_IOQ_HOST
+       ---help---
+        PVBUS is an infrastructure for generic PV drivers to take advantage
+        of an underlying hypervisor without having to understand the details
+	of the hypervisor itself.  You only need this option if you plan to
+	run PVBUS based PV guests in KVM.
+
 config KVM_NET_HOST
 	tristate "Para virtual network host device"
 	depends on KVM
diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
index 2095061..8926fa9 100644
--- a/drivers/kvm/Makefile
+++ b/drivers/kvm/Makefile
@@ -7,6 +7,9 @@ kvm-objs := kvm_main.o mmu.o x86_emulate.o
 ifeq ($(CONFIG_KVM_IOQ_HOST),y)
 kvm-objs += ioq_host.o
 endif
+ifeq ($(CONFIG_KVM_PVBUS_HOST),y)
+kvm-objs += pvbus_host.o
+endif
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index c38c84f..8dc9ac3 100755
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -14,6 +14,7 @@
 #include <linux/sched.h>
 #include <linux/mm.h>
 #include <linux/netdevice.h>
+#include <linux/pvbus.h>
 
 #include "vmx.h"
 #include "ioq.h"
@@ -393,6 +394,9 @@ struct kvm {
 #ifdef CONFIG_KVM_IOQ_HOST
 	struct ioq_mgr *ioqmgr;
 #endif
+#ifdef CONFIG_KVM_PVBUS_HOST
+	struct kvm_pvbus *pvbus;
+#endif
 
 };
 
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index fbffd2f..d35ce8d 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -44,6 +44,7 @@
 
 #include "x86_emulate.h"
 #include "segment_descriptor.h"
+#include "pvbus_host.h"
 
 MODULE_AUTHOR("Qumranet");
 MODULE_LICENSE("GPL");
@@ -350,6 +351,7 @@ static struct kvm *kvm_create_vm(void)
 		spin_unlock(&kvm_lock);
 	}
 	kvmhost_ioqmgr_init(kvm);
+	kvm_pvbus_init(kvm);
 	return kvm;
 }
 
@@ -3616,6 +3618,7 @@ static __init int kvm_init(void)
 	memset(__va(bad_page_address), 0, PAGE_SIZE);
 
 	kvmhost_ioqmgr_module_init();
+	kvm_pvbus_module_init();
 
 	return 0;
 
@@ -3637,6 +3640,7 @@ static __exit void kvm_exit(void)
 	mntput(kvmfs_mnt);
 	unregister_filesystem(&kvm_fs_type);
 	kvm_mmu_module_exit();
+	kvm_pvbus_module_exit();
 }
 
 module_init(kvm_init)
diff --git a/drivers/kvm/pvbus_host.c b/drivers/kvm/pvbus_host.c
new file mode 100644
index 0000000..cc506f4
--- /dev/null
+++ b/drivers/kvm/pvbus_host.c
@@ -0,0 +1,636 @@
+/*
+ * Copyright 2007 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *	Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/highmem.h>
+#include <linux/workqueue.h>
+
+#include "pvbus.h"
+#include "pvbus_host.h"
+#include "kvm.h"
+
+struct pvbus_map {
+	int	    (*compare)(const void *left, const void *right);
+	const void* (*getkey)(struct rb_node *node);
+
+	struct mutex   lock;
+	struct rb_root root;
+	size_t         count;
+};
+
+struct _pv_devtype {
+	struct kvm_pv_devtype *item;
+	struct rb_node	       node;
+};
+
+struct _pv_device {
+	struct kvm_pv_device  *item;
+	struct rb_node	       node;
+	struct _pv_devtype    *parent;
+	int                    synced;
+};
+
+static struct pvbus_map pvbus_typemap;
+
+struct kvm_pvbus_eventq {
+	struct mutex     lock;
+	struct ioq      *ioq;
+
+};
+
+struct kvm_pvbus {
+	struct mutex	        lock;
+	struct kvm             *kvm;
+	struct pvbus_map        devmap;
+	struct kvm_pvbus_eventq eventq;
+};
+
+/*
+ * ------------------
+ * generic rb map management
+ * ------------------
+ */
+
+static void pvbus_map_init(struct pvbus_map *map)
+{
+	mutex_init(&map->lock);
+	map->root = RB_ROOT;
+}
+
+static int pvbus_map_register(struct pvbus_map *map, struct rb_node *node)
+{
+	int		ret = 0;
+	struct rb_root *root;
+	struct rb_node **new, *parent = NULL;
+
+	mutex_lock(&map->lock);
+
+	root = &map->root;
+	new  = &(root->rb_node);
+
+	/* Figure out where to put new node */
+	while (*new) {
+		int result = map->compare(map->getkey(node),
+					  map->getkey(*new));
+
+		parent = *new;
+
+		if (result < 0)
+			new = &((*new)->rb_left);
+		else if (result > 0)
+			new = &((*new)->rb_right);
+		else {
+			ret = -EEXIST;
+			break;
+		}
+	}
+
+	if (!ret) {
+		/* Add new node and rebalance tree. */
+		rb_link_node(node, parent, new);
+		rb_insert_color(node, root);
+		map->count++;
+	}
+
+	mutex_unlock(&map->lock);
+
+	return ret;
+}
+
+static struct rb_node* pvbus_map_find(struct pvbus_map *map, const void *key)
+{
+	struct rb_node *node;
+
+	mutex_lock(&map->lock);
+
+	node = map->root.rb_node;
+
+	while (node) {
+		int result = map->compare(map->getkey(node), key);
+
+		if (result < 0)
+			node = node->rb_left;
+		else if (result > 0)
+			node = node->rb_right;
+		else {
+			break;
+		}
+	}
+
+	mutex_unlock(&map->lock);
+
+	return node;
+}
+
+static void pvbus_map_erase(struct pvbus_map *map, struct rb_node *node)
+{
+	mutex_lock(&map->lock);
+	rb_erase(node, &map->root);
+	map->count--;
+	mutex_unlock(&map->lock);
+}
+
+/*
+ * ------------------
+ * pv_devtype rb map
+ * ------------------
+ */
+static int pv_devtype_map_compare(const void *left, const void *right)
+{
+	return strcmp((char*)left, (char*)right);
+}
+
+static const void* pv_devtype_map_getkey(struct rb_node *node)
+{
+	struct _pv_devtype *dt;
+
+	dt = container_of(node, struct _pv_devtype, node);
+
+	return dt->item->name;
+}
+
+static void pv_devtype_map_init(struct pvbus_map *map)
+{
+	pvbus_map_init(map);
+
+	map->compare   = pv_devtype_map_compare;
+	map->getkey    = pv_devtype_map_getkey;
+}
+
+static struct _pv_devtype* devtype_map_find(struct pvbus_map *map,
+					    const void *key)
+{
+	struct rb_node *node = pvbus_map_find(map, key);
+	if (!node)
+		return NULL;
+
+	return container_of(node, struct _pv_devtype, node);
+}
+
+/*
+ * ------------------
+ * pv_device rb map
+ * ------------------
+ */
+static int pv_device_map_compare(const void *left, const void *right)
+{
+	u64 lid = *(const u64*)left;
+	u64 rid = *(const u64*)right;
+
+	return lid - rid;
+}
+
+static const void* pv_device_map_getkey(struct rb_node *node)
+{
+	struct _pv_device *dev;
+
+	dev = container_of(node, struct _pv_device, node);
+
+	return &dev->item->id;
+}
+
+static void pv_device_map_init(struct pvbus_map *map)
+{
+	pvbus_map_init(map);
+
+	map->compare   = pv_device_map_compare;
+	map->getkey    = pv_device_map_getkey;
+}
+
+static struct _pv_device* device_map_find(struct pvbus_map *map,
+					  const void *key)
+{
+	struct rb_node *node = pvbus_map_find(map, key);
+	if (!node)
+		return NULL;
+
+	return container_of(node, struct _pv_device, node);
+}
+
+/*
+ * ------------------
+ * event-inject code
+ * ------------------
+ */
+static void kvm_pvbus_inject_event(struct kvm_pvbus *pvbus, u32 eventid,
+				   void *data, size_t len)
+{
+	DECLARE_WAITQUEUE(wait, current);
+	struct kvm_pvbus_eventq *eventq = &pvbus->eventq;
+	struct ioq_iterator iter;
+	struct pvbus_event *entry;
+	int ret;
+
+	add_wait_queue(&eventq->ioq->wq, &wait);
+
+	mutex_lock(&eventq->lock);
+
+	/* We want to iterate on the head of the in-use index */
+	ret = ioq_iter_init(eventq->ioq, &iter,
+			    ioq_idxtype_inuse, IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	set_current_state(TASK_UNINTERRUPTIBLE);
+
+	while (!iter.desc->sown)
+		schedule();
+
+	set_current_state(TASK_RUNNING);
+
+	entry = (struct pvbus_event*)gpa_to_hva(pvbus->kvm, iter.desc->ptr);
+
+	entry->eventid = eventid;
+	memcpy(&entry->data, data, len);
+
+	mb();
+	iter.desc->sown = 0;
+	mb();
+	
+	/*
+	 * This will push the index AND signal the guest since AUTOUPDATE is
+	 * enabled
+	 */
+	ret = ioq_iter_push(&iter, 0);
+	BUG_ON(ret < 0);
+
+	/* FIXME: Unmap the entry */
+
+	mutex_unlock(&eventq->lock);
+}
+
+static void kvm_pvbus_inject_add(struct kvm_pvbus *pvbus,
+				 const char *name, u64 id)
+{
+	struct pvbus_add_event data = {
+		.id   = id,
+	};
+
+	strncpy(data.name, name, PVBUS_MAX_NAME);
+
+	kvm_pvbus_inject_event(pvbus, KVM_PVBUS_EVENT_ADD,
+			       &data, sizeof(data));
+}
+
+/*
+ * ------------------
+ * add-event code
+ * ------------------
+ */
+
+struct deferred_add {
+	struct kvm_pvbus      *pvbus;
+	struct work_struct     work;
+	size_t                 count;
+	struct pvbus_add_event data[1];
+};
+
+static void kvm_pvbus_deferred_resync(struct work_struct *work)
+{
+	struct deferred_add *event = container_of(work,
+						  struct deferred_add,
+						  work);
+	int i;
+
+
+	for (i = 0; i < event->count; i++) {
+		struct pvbus_add_event *entry = &event->data[i];
+
+		kvm_pvbus_inject_add(event->pvbus, entry->name, entry->id);
+	}
+
+	kfree(event);
+}
+
+#define for_each_rbnode(node, root) \
+        for (node = rb_first(root); node != NULL; node = rb_next(node))
+
+/*
+ * This function will build a list of all currently registered devices and
+ * send it to a work-queue to be placed on the ioq.  We do this as a two
+ * step operation because the work-queues can queue infinitely deep
+ * (assuming enough memory) whereas IOQ is only queue as deep as the
+ * guest's allocation and then we must sleep.  Since we cannot sleep during
+ * registration we have no real choice but to defer things here
+ */
+static int kvm_pvbus_resync(struct kvm_pvbus *pvbus)
+{
+	struct pvbus_map    *map = &pvbus->devmap;
+	struct deferred_add *event;
+	struct rb_node      *node;
+	size_t               len;
+	int                  i   = 0;
+	int                  ret = 0;
+
+	mutex_lock(&map->lock);
+
+	if (!map->count)
+		/* There are no items current registered so just exit */
+		goto out;
+
+	/*
+	 * First allocate a structure large enough to hold our map->count
+	 * number of entries that are pending
+	 */
+
+	/* we subtract 1 because of item already in struct */
+	len = sizeof(struct pvbus_add_event) * (map->count - 1);
+	event = kzalloc(sizeof(*event) + len, GFP_KERNEL);
+	if (!event) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	event->pvbus = pvbus;
+	event->count = map->count;
+	INIT_WORK(&event->work, kvm_pvbus_deferred_resync);
+
+	/*
+	 * Then cycle through the map and load each node discovered into
+	 * the event
+	 */
+	for_each_rbnode(node, &map->root) {
+		struct pvbus_add_event *entry = &event->data[i++];
+		struct _pv_device      *dev   = container_of(node,
+							     struct _pv_device,
+							     node);
+
+		strncpy(entry->name, dev->parent->item->name, PVBUS_MAX_NAME);
+		entry->id = dev->item->id;
+	}
+
+	/* Finally, fire off the work */
+	schedule_work(&event->work);
+
+ out:
+	mutex_unlock(&map->lock);
+
+	return 0;
+}
+
+
+/*
+ * ------------------
+ * hypercall implementation
+ * ------------------
+ */
+
+/*
+ * This function is invoked when the guest wants to start getting hotplug
+ * events from us to publish on the pvbus
+ */
+static int kvm_pvbus_register(struct kvm_pvbus *pvbus, ioq_id_t id)
+{
+	struct ioq_mgr *ioqmgr = pvbus->kvm->ioqmgr;
+	int ret = 0;
+
+	mutex_lock(&pvbus->lock);
+
+	/*
+	 * Trying to register while someone else is already registered
+	 * is just plain illegal
+	 */
+	if (pvbus->eventq.ioq) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * Open the IOQ channel back to the guest so we can deliver hotplug
+	 * events as devices are registered
+	 */
+	ret = ioqmgr->connect(ioqmgr, id, &pvbus->eventq.ioq, 0);
+	if (ret < 0)
+		goto out;
+
+	/*
+	 * Enable interrupts on the queue
+	 */
+	ioq_start(pvbus->eventq.ioq, 0);
+
+	/*
+	 * Now we need to backfill the guest by sending any of our currently
+	 * registered devices up as hotplug events as if they just happened
+	 */
+	ret = kvm_pvbus_resync(pvbus);
+
+ out:
+	mutex_unlock(&pvbus->lock);
+
+	return ret;
+}
+
+/*
+ * This function is invoked whenever a driver calls pvbus_device->call()
+ */
+static int kvm_pvbus_call(struct kvm_pvbus *pvbus,
+			  u64 instance, u32 func, void *data, size_t len)
+{
+	struct kvm_pv_device *dev;
+	struct _pv_device    *_dev = device_map_find(&pvbus->devmap,
+						     &instance);
+	if (!_dev)
+		return -ENOENT;
+
+	dev = _dev->item;
+
+	return dev->call(dev, func, data, len);
+}
+
+/*
+ * Our hypercall format will always follow with the call-id in arg[0],
+ * a pointer to the arguments in arg[1], and the argument length in arg[2]
+ */
+static unsigned long kvm_pvbus_hc(struct kvm_vcpu *vcpu,
+				  unsigned long args[])
+{
+	struct kvm_pvbus *pvbus = vcpu->kvm->pvbus;
+	void *vdata = (void*)gpa_to_hva(vcpu->kvm, args[1]);
+	int ret = -EINVAL;
+
+	/* FIXME: We need to validate vdata so that malicious guests cannot
+	   cause the host to segfault */
+
+	switch (args[0])
+	{
+	case KVM_PVBUS_OP_REGISTER: {
+		struct pvbus_register_params *params;
+
+		params = (struct pvbus_register_params*)vdata;
+
+		ret = kvm_pvbus_register(pvbus, params->qid);
+	}
+	case KVM_PVBUS_OP_CALL: {
+		struct pvbus_call_params *params;
+		void *data;
+
+		params = (struct pvbus_call_params*)vdata;
+		data = gpa_to_hva(vcpu->kvm, params->data);
+
+		/*
+		 * FIXME: Again, we should validate that
+		 *
+		 *    params->data to params->data+len
+		 *
+		 * is a valid region owned by the guest
+		 */
+
+		ret = kvm_pvbus_call(pvbus, params->inst, params->func,
+				     data, params->len);
+
+		/* FIXME: Do we need to unmap the data */
+	}
+	}
+
+	/* FIXME: Do we need to kunmap the vdata? */
+
+	return ret;
+}
+
+int kvm_pvbus_registertype(struct kvm_pv_devtype *devtype)
+{
+	struct _pv_devtype *_devtype = kzalloc(sizeof(*_devtype), GFP_KERNEL);
+	if (!_devtype)
+		return -ENOMEM;
+
+	_devtype->item = devtype;
+
+	return pvbus_map_register(&pvbus_typemap, &_devtype->node);
+}
+EXPORT_SYMBOL_GPL(kvm_pvbus_registertype);
+
+int kvm_pvbus_unregistertype(const char *name)
+{
+	/* FIXME: */
+	return -ENOSYS;
+}
+EXPORT_SYMBOL_GPL(kvm_pvbus_unregistertype);
+
+/*
+ * This function is invoked by an administrative operation which wants to
+ * instantiate a registered type into a device associated with a specific VM.
+ *
+ * For instance, QEMU may one day issue an ioctl that says
+ *
+ *         createinstance("ioqnet", "mac = 00:30:cc:00:20:10");
+ *
+ * This would cause the system to search for any registered types called
+ * "ioqnet".  If found, it would instantiate the device with a config string
+ * set to give it a specific MAC.  Obviously the name and config string are
+ * specific to a particular driver type.
+ */
+int kvm_pvbus_createinstance(struct kvm *kvm, const char *name,
+			     const char *cfg, u64 *id)
+{
+	struct kvm_pvbus      *pvbus = kvm->pvbus;
+	struct _pv_devtype    *_devtype;
+	struct kvm_pv_devtype *devtype;
+	struct _pv_device     *_dev = NULL;
+	struct kvm_pv_device  *dev;
+	u64		       _id;
+	int		       ret = 0;
+
+	mutex_lock(&pvbus->lock);
+
+	_devtype = devtype_map_find(&pvbus_typemap, name);
+	if (!_devtype) {
+		ret = -ENOENT;
+		goto out_err;
+	}
+
+	devtype = _devtype->item;
+
+	_dev = kzalloc(sizeof(*_dev), GFP_KERNEL);
+	if (!_dev) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+
+	/* We just use the pointer address as a unique id */
+	_id = (u64)_dev;
+
+	ret = devtype->create(kvm, devtype, _id, cfg, &dev);
+	if (ret < 0)
+		goto out_err;
+
+	_dev->item   = dev;
+	_dev->parent = _devtype;
+
+	pvbus_map_register(&pvbus->devmap, &_dev->node);
+
+	mutex_unlock(&pvbus->lock);
+
+	*id = _id;
+
+	kvm_pvbus_inject_add(pvbus, name, _id);
+
+	return 0;
+
+ out_err:
+	mutex_unlock(&pvbus->lock);
+
+	kfree(_dev);
+
+	return ret;
+}
+
+int kvm_pvbus_init(struct kvm *kvm)
+{
+	struct kvm_pvbus *pvbus = kzalloc(sizeof(*pvbus), GFP_KERNEL);
+	if (!pvbus)
+		return -ENOMEM;
+
+	mutex_init(&pvbus->lock);
+	pvbus->kvm = kvm;
+	pv_device_map_init(&pvbus->devmap);
+	mutex_init(&pvbus->eventq.lock);
+
+	kvm->pvbus = pvbus;
+
+	return 0;
+}
+
+__init int kvm_pvbus_module_init(void)
+{
+	struct kvm_hypercall hc;
+
+	pv_devtype_map_init(&pvbus_typemap);
+
+	/* Register our hypercall */
+	hc.hypercall	  = kvm_pvbus_hc;
+	hc.idx		  = __NR_hypercall_pvbus;
+
+	kvm_register_hypercall(THIS_MODULE, &hc);
+
+	return 0;
+}
+
+ __exit void kvm_pvbus_module_exit(void)
+{
+	/* FIXME: Unregister our hypercall */
+}
+
+
+
+
diff --git a/drivers/kvm/pvbus_host.h b/drivers/kvm/pvbus_host.h
new file mode 100644
index 0000000..a3cc7a0
--- /dev/null
+++ b/drivers/kvm/pvbus_host.h
@@ -0,0 +1,66 @@
+/*
+ * Copyright 2007 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _KVM_PVBUS_HOST_H
+#define _KVM_PVBUS_HOST_H
+
+#include <linux/rbtree.h>
+
+#ifdef CONFIG_KVM_PVBUS_HOST
+
+struct kvm;
+
+struct kvm_pvbus;
+
+struct kvm_pv_device {
+	int (*call)(struct kvm_pv_device *t, u32 func, void *data, size_t len);
+	void (*destroy)(struct kvm_pv_device *t);
+
+	u64            id;
+	u32            ver;
+
+};
+
+struct kvm_pv_devtype {
+	int (*create)(struct kvm *kvm,
+		      struct kvm_pv_devtype *t, u64 id, const char *cfg,
+		      struct kvm_pv_device **dev);
+	void (*destroy)(struct kvm_pv_devtype *t);
+
+	const char     *name;
+};
+
+int kvm_pvbus_init(struct kvm *kvm);
+int kvm_pvbus_module_init(void);
+void kvm_pvbus_module_exit(void);
+int kvm_pvbus_registertype(struct kvm_pv_devtype *devtype);
+int kvm_pvbus_unregistertype(const char *name);
+int kvm_pvbus_createinstance(struct kvm *kvm, const char *name,
+			     const char *config, u64 *id);
+
+#else /* CONFIG_KVM_PVBUS_HOST */
+
+#define kvm_pvbus_init(kvm) {}
+#define kvm_pvbus_module_init() {}
+#define kvm_pvbus_module_exit() {}
+
+#endif /* CONFIG_KVM_PVBUS_HOST */
+
+#endif /* _KVM_PVBUS_HOST_H */


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 10/10] KVM: Add an IOQNET backend driver
       [not found] ` <20070816231357.8044.55943.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
                     ` (8 preceding siblings ...)
  2007-08-16 23:14   ` [PATCH 09/10] KVM: Add PVBUS support to the KVM host Gregory Haskins
@ 2007-08-16 23:14   ` Gregory Haskins
  2007-08-17  1:25   ` [PATCH 00/10] PV-IO v3 Rusty Russell
  10 siblings, 0 replies; 41+ messages in thread
From: Gregory Haskins @ 2007-08-16 23:14 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Signed-off-by: Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
---

 drivers/kvm/Kconfig       |    5 
 drivers/kvm/Makefile      |    2 
 drivers/kvm/ioqnet_host.c |  566 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 573 insertions(+), 0 deletions(-)

diff --git a/drivers/kvm/Kconfig b/drivers/kvm/Kconfig
index 9f2ef22..19551a2 100644
--- a/drivers/kvm/Kconfig
+++ b/drivers/kvm/Kconfig
@@ -62,6 +62,11 @@ config KVM_PVBUS_HOST
 	of the hypervisor itself.  You only need this option if you plan to
 	run PVBUS based PV guests in KVM.
 
+config KVM_IOQNET
+       tristate "IOQNET host support"
+       depends on KVM
+       select KVM_PVBUS_HOST
+
 config KVM_NET_HOST
 	tristate "Para virtual network host device"
 	depends on KVM
diff --git a/drivers/kvm/Makefile b/drivers/kvm/Makefile
index 8926fa9..66e5272 100644
--- a/drivers/kvm/Makefile
+++ b/drivers/kvm/Makefile
@@ -22,3 +22,5 @@ kvm-net-host-objs = kvm_net_host.o
 obj-$(CONFIG_KVM_NET_HOST) += kvm_net_host.o
 kvm-pvbus-objs := ioq_guest.o pvbus_guest.o
 obj-$(CONFIG_KVM_PVBUS_GUEST) += kvm-pvbus.o
+kvm-ioqnet-objs := ioqnet_host.o
+obj-$(CONFIG_KVM_IOQNET) += kvm-ioqnet.o
\ No newline at end of file
diff --git a/drivers/kvm/ioqnet_host.c b/drivers/kvm/ioqnet_host.c
new file mode 100644
index 0000000..0f4d055
--- /dev/null
+++ b/drivers/kvm/ioqnet_host.c
@@ -0,0 +1,566 @@
+/*
+ * Copyright 2007 Novell.  All Rights Reserved.
+ *
+ * ioqnet - A paravirtualized network device based on the IOQ interface.
+ *
+ * This module represents the backend driver for an IOQNET driver on the KVM
+ * platform.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
+ *
+ * Derived in part from the SNULL example from the book "Linux Device
+ * Drivers" by Alessandro Rubini and Jonathan Corbet, published
+ * by O'Reilly & Associates.
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+
+#include <linux/sched.h>
+#include <linux/kernel.h> /* printk() */
+#include <linux/slab.h> /* kmalloc() */
+#include <linux/errno.h>  /* error codes */
+#include <linux/types.h>  /* size_t */
+#include <linux/interrupt.h> /* mark_bh */
+
+#include <linux/in.h>
+#include <linux/netdevice.h>   /* struct device, and other headers */
+#include <linux/etherdevice.h> /* eth_type_trans */
+#include <linux/ip.h>          /* struct iphdr */
+#include <linux/tcp.h>         /* struct tcphdr */
+#include <linux/skbuff.h>
+#include <linux/ioq.h>
+#include <linux/pvbus.h>
+
+#include <linux/in6.h>
+#include <asm/checksum.h>
+#include <linux/ioq.h>
+#include <linux/ioqnet.h>
+#include <linux/highmem.h>
+
+#include "pvbus_host.h"
+#include "kvm.h"
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+
+#define IOQNET_NAME "ioqnet"
+
+/*
+ * FIXME: Any "BUG_ON" code that can be triggered by a malicious guest must
+ * be turned into an inject_gp()
+ */
+
+struct ioqnet_queue {
+	struct ioq              *queue;
+	struct ioq_notifier      notifier;
+};
+
+struct ioqnet_priv {
+	spinlock_t               lock;
+	struct kvm              *kvm;
+	struct kvm_pv_device     pvdev;
+	struct net_device       *netdev;
+	struct net_device_stats  stats;
+	struct ioqnet_queue      rxq;
+	struct ioqnet_queue      txq;
+	struct tasklet_struct    txtask;
+	int                      connected;
+	int                      opened;
+};
+
+#undef PDEBUG             /* undef it, just in case */
+#ifdef IOQNET_DEBUG
+#  define PDEBUG(fmt, args...) printk( KERN_DEBUG "ioqnet: " fmt, ## args)
+#else
+#  define PDEBUG(fmt, args...) /* not debugging: nothing */
+#endif
+
+/*
+ * Enable and disable receive interrupts.
+ */
+static void ioqnet_rx_ints(struct net_device *dev, int enable)
+{
+	struct ioqnet_priv *priv = netdev_priv(dev);
+	struct ioq *ioq = priv->rxq.queue;
+
+	if (priv->connected) {
+		if (enable)
+			ioq_start(ioq, 0);
+		else
+			ioq_stop(ioq, 0);
+	}
+}
+
+/*
+ * Open and close
+ */
+
+int ioqnet_open(struct net_device *dev)
+{
+	struct ioqnet_priv *priv = netdev_priv(dev);
+	
+	priv->opened = 1;
+	netif_start_queue(dev);
+	
+	return 0;
+}
+
+int ioqnet_release(struct net_device *dev)
+{
+	struct ioqnet_priv *priv = netdev_priv(dev);
+	
+	priv->opened = 0;
+	netif_stop_queue(dev);
+
+	return 0;
+}
+
+/*
+ * Configuration changes (passed on by ifconfig)
+ */
+int ioqnet_config(struct net_device *dev, struct ifmap *map)
+{
+	if (dev->flags & IFF_UP) /* can't act on a running interface */
+		return -EBUSY;
+
+	/* Don't allow changing the I/O address */
+	if (map->base_addr != dev->base_addr) {
+		printk(KERN_WARNING "ioqnet: Can't change I/O address\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* ignore other fields */
+	return 0;
+}
+
+/*
+ * The poll implementation.
+ */
+static int ioqnet_poll(struct net_device *dev, int *budget)
+{
+	int npackets = 0, quota = min(dev->quota, *budget);
+	struct ioqnet_priv *priv = netdev_priv(dev);
+	struct ioq_iterator iter;
+	unsigned long flags;
+	int ret;
+
+	if (!priv->connected)
+		return 0;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	/* We want to iterate on the tail of the in-use index */
+	ret = ioq_iter_init(priv->rxq.queue, &iter, ioq_idxtype_inuse, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * We stop if we have met the quota or there are no more packets.
+	 * The EOM is indicated by finding a packet that is still owned by
+	 * the north side
+	 */
+	while ((npackets < quota) && iter.desc->sown) {
+		struct ioq_ring_desc *desc = iter.desc;
+		struct ioqnet_tx_ptr *ptr = gpa_to_hva(priv->kvm, desc->ptr);
+		struct sk_buff *skb;
+		int i;
+		size_t len = 0;
+
+		/* First figure out how much of an skb we need */
+		for (i = 0; i < desc->alen; ++i) {
+			len += ptr[i].len;
+		}
+
+		skb = dev_alloc_skb(len + 2);
+		if (!skb) {
+			/* FIXME: This leaks... */
+			printk(KERN_ERR "FATAL: Out of memory on IOQNET\n");
+			netif_stop_queue(dev);
+			return -ENOMEM;
+		}
+
+		skb_reserve(skb, 2);
+
+		/* Then copy the data out to our fresh SKB */
+		for (i = 0; i < desc->alen; ++i) {
+			struct ioqnet_tx_ptr *p = &ptr[i];
+			void                 *d = gpa_to_hva(priv->kvm,
+							     p->data);
+
+			memcpy(skb_push(skb, p->len), d, p->len);
+			kunmap(d);
+		}
+
+        	/* Maintain stats */
+		npackets++;
+		priv->stats.rx_packets++;
+		priv->stats.rx_bytes += len;
+
+		/* Pass the buffer up to the stack */
+		skb->dev      = dev;
+		skb->protocol = eth_type_trans(skb, dev);
+		netif_receive_skb(skb);
+
+		/*
+		 * Ensure that we have finished reading before marking the
+		 * state of the queue
+		 */
+		mb();
+		desc->sown = 0;
+		mb();
+
+		/* Advance the in-use tail */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+
+		/* Toggle the lock */
+		spin_unlock_irqrestore(&priv->lock, flags);
+		spin_lock_irqsave(&priv->lock, flags);
+	}
+
+	/*
+	 * If we processed all packets, we're done; tell the kernel and
+	 * reenable ints
+	 */
+	*budget -= npackets;
+	dev->quota -= npackets;
+	if (ioq_empty(priv->rxq.queue, ioq_idxtype_inuse)) {
+		/* FIXME: there is a race with enabling interrupts */
+		netif_rx_complete(dev);
+		ioqnet_rx_ints(dev, 1);
+		ret = 0;
+	} else
+		/* We couldn't process everything. */
+		ret = 1;
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	/* And let the north side know that we changed the rx-queue */
+	ioq_signal(priv->rxq.queue, 0);
+
+	return ret;
+}
+
+/*
+ * Transmit a packet (called by the kernel)
+ */
+int ioqnet_tx(struct sk_buff *skb, struct net_device *dev)
+{
+	struct ioqnet_priv    *priv = netdev_priv(dev);
+	struct ioq_iterator    iter;
+	int ret;
+	unsigned long flags;
+	char *data;
+	
+	if (skb->len < ETH_ZLEN)
+		return -EINVAL;
+	
+	if (!priv->connected)
+		return 0;
+		
+	spin_lock_irqsave(&priv->lock, flags);
+		
+	if (ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+		/*
+		 * We must flow-control the kernel by disabling the queue
+		 */
+		spin_unlock_irqrestore(&priv->lock, flags);
+		netif_stop_queue(dev);
+		return 0;
+	}
+
+	/*
+	 * We want to iterate on the head of the "inuse" index
+	 */
+	ret = ioq_iter_init(priv->txq.queue, &iter, ioq_idxtype_inuse, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	if (skb->len > iter.desc->len)
+		return -EINVAL;
+
+	dev->trans_start = jiffies; /* save the timestamp */
+
+	/* Copy the data to the north-side buffer */
+	data = (char*)gpa_to_hva(priv->kvm, iter.desc->ptr);
+	memcpy(data, skb->data, skb->len);
+	kunmap(data);
+
+        /* Give ownership back to the north */
+	mb();
+	iter.desc->sown = 0;
+	mb();
+
+	/* Advance the index */
+	ret = ioq_iter_push(&iter, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * This will signal the north side to consume the packet
+	 */
+	ioq_signal(priv->txq.queue, 0);
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	return 0;
+}
+
+void ioqnet_tx_intr(unsigned long data)
+{
+	struct ioqnet_priv *priv = (struct ioqnet_priv*)data;
+	unsigned long flags;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	/*
+	 * If we were previously stopped due to flow control, restart the
+	 * processing
+	 */
+	if (netif_queue_stopped(priv->netdev)
+	    && !ioq_full(priv->txq.queue, ioq_idxtype_inuse)) {
+
+		netif_wake_queue(priv->netdev);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+/*
+ * Ioctl commands
+ */
+int ioqnet_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
+{
+	PDEBUG("ioctl\n");
+	return 0;
+}
+
+/*
+ * Return statistics to the caller
+ */
+struct net_device_stats *ioqnet_stats(struct net_device *dev)
+{
+	struct ioqnet_priv *priv = netdev_priv(dev);
+	return &priv->stats;
+}
+
+static void ioq_rx_notify(struct ioq_notifier *notifier)
+{
+	struct ioqnet_priv *priv;
+	struct net_device  *dev;
+
+	priv = container_of(notifier, struct ioqnet_priv, rxq.notifier);
+	dev = priv->netdev;
+
+	ioqnet_rx_ints(dev, 0);  /* Disable further interrupts */
+	netif_rx_schedule(dev);
+}
+
+static void ioq_tx_notify(struct ioq_notifier *notifier)
+{
+	struct ioqnet_priv *priv;
+
+	priv = container_of(notifier, struct ioqnet_priv, txq.notifier);
+
+	tasklet_schedule(&priv->txtask);
+}
+
+/*
+ * The init function (sometimes called probe).
+ * It is invoked by register_netdev()
+ */
+void ioqnet_init(struct net_device *dev)
+{
+	ether_setup(dev); /* assign some of the fields */
+
+	dev->open              = ioqnet_open;
+	dev->stop              = ioqnet_release;
+	dev->set_config        = ioqnet_config;
+	dev->hard_start_xmit   = ioqnet_tx;
+	dev->do_ioctl          = ioqnet_ioctl;
+	dev->get_stats         = ioqnet_stats;
+	dev->poll              = ioqnet_poll;
+	dev->weight            = 2;
+	dev->hard_header_cache = NULL;      /* Disable caching */
+
+	/* We go "link down" until the guest connects to us */
+	netif_carrier_off(dev);
+
+}
+
+
+/* -------------------------------------------------------------- */
+
+static inline struct ioqnet_priv* to_priv(struct kvm_pv_device *t)
+{
+	return container_of(t, struct ioqnet_priv, pvdev);
+}
+
+
+static int ioqnet_connect(struct ioqnet_priv *priv,
+			  ioq_id_t id,
+			  struct ioqnet_queue *q,
+			  void (*func)(struct ioq_notifier*))
+{
+	int ret;
+	struct ioq_mgr *ioqmgr = priv->kvm->ioqmgr;
+
+	ret = ioqmgr->connect(ioqmgr, id, &q->queue, 0);
+	if (ret < 0)
+		return ret;
+
+	q->notifier.signal = func;
+
+	return 0;
+}
+
+static int ioqnet_pvbus_connect(struct ioqnet_priv *priv,
+				void *data, size_t len)
+{
+	struct ioqnet_connect *cnct = (struct ioqnet_connect*)data;
+	int ret;
+
+	/* We connect the north's rxq to our txq */
+	ret = ioqnet_connect(priv, cnct->rxq, &priv->txq, ioq_tx_notify);
+	if (ret < 0)
+		return ret;
+
+	/* And vice-versa */
+	ret = ioqnet_connect(priv, cnct->txq, &priv->rxq, ioq_rx_notify);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * So now that the guest has connected we can send a "link up" event
+	 * to the kernel.
+	 */
+	netif_carrier_on(priv->netdev);
+
+	priv->connected = 1;
+
+	return 0;
+}
+
+static int ioqnet_pvbus_query_mac(struct ioqnet_priv *priv,
+				  void *data, size_t len)
+{
+	if (len != ETH_ALEN)
+		return -EINVAL;
+
+	memcpy(data, priv->netdev->dev_addr, ETH_ALEN);
+
+	return 0;
+}
+
+/*
+ * This function is invoked whenever a guest calls pvbus_ops->call() against
+ * our instance ID
+ */
+static int ioqnet_pvbus_device_call(struct kvm_pv_device *t, u32 func,
+				    void *data, size_t len)
+{
+	struct ioqnet_priv *priv = to_priv(t);
+	int ret = -EINVAL;
+
+	switch (func) {
+	case IOQNET_CONNECT:
+		ret = ioqnet_pvbus_connect(priv, data, len);
+		break;
+	case IOQNET_QUERY_MAC:
+		ret = ioqnet_pvbus_query_mac(priv, data, len);
+		break;
+	}
+
+	return ret;
+}
+
+static void ioqnet_pvbus_device_destroy(struct kvm_pv_device *t)
+{
+
+}
+
+/*
+ * This function is invoked whenever someone instantiates an IOQNET object
+ */
+static int ioqnet_pvbus_devtype_create(struct kvm *kvm,
+				       struct kvm_pv_devtype *t, u64 id,
+				       const char *cfg,
+				       struct kvm_pv_device **pvdev)
+{
+	struct net_device  *dev;
+	struct ioqnet_priv *priv;
+	int                 ret;
+	
+	dev = alloc_netdev(sizeof(struct ioqnet_priv), "ioq%d",
+			   ioqnet_init);
+	if (!dev)
+		return -ENOMEM;
+	
+	priv = netdev_priv(dev);
+
+	memset(priv, 0, sizeof(*priv));
+
+	priv->pvdev.call    = ioqnet_pvbus_device_call;
+	priv->pvdev.destroy = ioqnet_pvbus_device_destroy;
+	priv->pvdev.id      = id;
+	priv->pvdev.ver     = IOQNET_VERSION;
+
+	spin_lock_init(&priv->lock);
+	priv->kvm = kvm;
+	priv->netdev = dev;
+	tasklet_init(&priv->txtask, ioqnet_tx_intr, (unsigned long)priv);
+
+	ret = register_netdev(dev);
+	if (ret < 0) {
+		printk("ioqnet: error %i registering device \"%s\"\n",
+		       ret, dev->name);
+		free_netdev(dev);
+	}
+	
+	*pvdev = &priv->pvdev;
+
+	return 0;
+}
+
+static void ioqnet_pvbus_devtype_destroy(struct kvm_pv_devtype *t)
+{
+
+}
+
+static struct kvm_pv_devtype ioqnet_devtype = {
+	.create  = ioqnet_pvbus_devtype_create,
+	.destroy = ioqnet_pvbus_devtype_destroy,
+	.name    = IOQNET_NAME,
+};
+
+static int __init ioqnet_init_module(void)
+{
+	return kvm_pvbus_registertype(&ioqnet_devtype);
+}
+
+static void __exit ioqnet_cleanup_module(void)
+{
+	kvm_pvbus_unregistertype(IOQNET_NAME);
+}
+
+module_init(ioqnet_init_module);
+module_exit(ioqnet_cleanup_module);


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found] ` <20070816231357.8044.55943.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
                     ` (9 preceding siblings ...)
  2007-08-16 23:14   ` [PATCH 10/10] KVM: Add an IOQNET backend driver Gregory Haskins
@ 2007-08-17  1:25   ` Rusty Russell
       [not found]     ` <1187313953.6449.70.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  10 siblings, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-08-17  1:25 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

On Thu, 2007-08-16 at 19:13 -0400, Gregory Haskins wrote:
> Here is the v3 release of the patch series for a generalized PV-IO
> infrastructure.  It has v2 plus the following changes:

Hi Gregory,

	This is a lot of code.  I'm having trouble taking it all in, TBH.  It
might help me if we could to go back to the basic transport
implementation questions.

Transport has several parts.  What the hypervisor knows about (usually
shared memory and some interrupt mechanism and possibly "DMA") and what
is convention between users (eg. ringbuffer layouts).  Whether it's 1:1
or n-way (if 1:1, is it symmetrical?).  Whether it has to be host <->
guest, or can be inter-guest.  Whether it requires trust between the
sides.

My personal thoughts are that we should be aiming for 1:1 untrusting.  I
like N-way, but it adds complexity.  And not having inter-guest is just
poor form (and putting it in later is impossible, as we'll see).

It seems that a shared-memory "ring-buffer of descriptors" is the
simplest implementation.  But there are two problems with a simple
descriptor ring:

        1) A ring buffer doesn't work well for things which process
        out-of-order, such as a block device.
        2) We either need huge descriptors or some chaining mechanism to
        handle scatter-gather.

So we end up with an array of descriptors with next pointers, and two
ring buffers which refer to those descriptors: one for what descriptors
are pending, and one for what descriptors have been used (by the other
end).

This is sufficient for guest<->host, but care must be taken for guest
<-> guest.  Let's dig down:

Consider a transport from A -> B.  A populates the descriptor entries
corresponding to its sg, then puts the head descriptor entry in the
"pending" ring buffer and sends B an interrupt.  B sees the new pending
entry, reads the descriptors, does the operation and reads or writes
into the memory pointed to by the descriptors.  It then updates the
"used" ring buffer and sends A an interrupt.

Now, if B is untrusted, this is more difficult.  It needs to read the
descriptor entries and the "pending" ring buffer, and write to the
"used" ring buffer.  We can use page protection to share these if we
arrange things carefully, like so:

        struct desc_pages
        {
        	/* Page of descriptors. */
        	struct lguest_desc desc[NUM_DESCS];

        	/* Next page: how we tell other side what buffers are available. */
        	unsigned int avail_idx;
        	unsigned int available[NUM_DESCS];
        	char pad[PAGE_SIZE - (NUM_DESCS+1) * sizeof(unsigned int)];

        	/* Third page: how other side tells us what's used. */
        	unsigned int used_idx;
        	struct lguest_used used[NUM_DESCS];
        };

But we still have the problem of an untrusted B having to read/write A's
memory pointed to A's descriptors.  At this point, my preferred solution
so far is as follows (note: have not implemented this!):

(1) have the hypervisor be aware of the descriptor page format, location
and which guest can access it.
(2) have the descriptors themselves contains a type (read/write) and a
valid bit.
(3) have a "DMA" hypercall to copy to/from someone else's descriptors.

Note that this means we do a copy for the untrusted case which doesn't
exist for the trusted case.  In theory the hypervisor could do some
tricky copy-on-write page-sharing for very large well-aligned buffers,
but it remains to be seen if that is actually useful.

Sorry for the long mail, but I really want to get the mechanism correct.

Cheers,
Rusty.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]     ` <1187313953.6449.70.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-08-17  5:26       ` Gregory Haskins
       [not found]         ` <1187328402.4363.110.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  2007-08-19  9:24       ` Avi Kivity
  1 sibling, 1 reply; 41+ messages in thread
From: Gregory Haskins @ 2007-08-17  5:26 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

Hi Rusty,

 Comments inline...

On Fri, 2007-08-17 at 11:25 +1000, Rusty Russell wrote:
> 
> Transport has several parts.  What the hypervisor knows about (usually
> shared memory and some interrupt mechanism and possibly "DMA") and what
> is convention between users (eg. ringbuffer layouts).  Whether it's 1:1
> or n-way (if 1:1, is it symmetrical?).

TBH, I am not sure what you mean by 1:1 vs n-way ringbuffers (its
probably just lack of sleep and tomorrow I will smack myself for
asking ;)

But could you elaborate here?

>   Whether it has to be host <->
> guest, or can be inter-guest.  Whether it requires trust between the
> sides.
> 
> My personal thoughts are that we should be aiming for 1:1 untrusting.

Untrusting I understand, and I agree with you there.  Obviously the host
is implicitly trusted (you have no choice, really) but I think the
guests should be validated just as you would for a standard
userspace/kernel interaction (e.g. validate pointer arguments and their
range, etc).

> And not having inter-guest is just
> poor form (and putting it in later is impossible, as we'll see).

I agree that having an ability to do inter-guest is a good idea.
However, I don't know if I am convinced if it has to be done in a
direct, zero-copy way. Mediating through the host certainly can work and
is probably acceptable for most things.  In this way the host is
essentially acting as a DMA agent to copy from one guests memory to the
other.  It solves the "trust" issue and simplifies the need to have a
"grant table" like mechanism which can get pretty hairy, IMHO.

I *could* be convinced otherwise, but that is my current thought.  This
would essentially look very similar to how my patch #4 (loopback) works.
It takes a pointer from an tx-queue and copies the data to a pointer
from an empty descriptor in the other side's rx-queue.  If you move that
concept down into the host this is how I was envisioning it working.

> 
> It seems that a shared-memory "ring-buffer of descriptors" is the
> simplest implementation.  But there are two problems with a simple
> descriptor ring:
> 
>         1) A ring buffer doesn't work well for things which process
>         out-of-order, such as a block device.
>         2) We either need huge descriptors or some chaining mechanism to
>         handle scatter-gather.
> 

I definitely agree that a simple descriptor-ring in of itself doesn't
solve all possible patterns directly.  I don't know if you had a chance
to look too deeply into the IOQ code yet, but it essentially is a very
simple descriptor-ring as you mention.

However, I don't view that as a limitation because I envision this type
of thing to be just one "tool" or layer in a larger puzzle.  One that
can be applied many different ways to solve more complex problems.

(The following is a long and boring story about my train of thought and
how I got to where I am today with this code)

<boring-story>What I was seeing as a general problem is efficient basic
event movement.  <obvious-statement>Each guest->host or host->guest
transition is expensive so we want to minimize the number of these
occurring </obvious-statement> (ideally down to 1 (or less!) per
operation).

Now moving events out of a guest in one (or fewer) IO operations is
fairly straight forward (hypercall namespace is typically pretty large
and they can have accompanying parameters (including pointers)
associated with them).  However, moving events *into* the guest in one
(or fewer) shots is difficult because by default you really only have a
single parameter (interrupt vector) to convey any meaning.  To make
matters worse, the namespace for vectors can be rather small (e.g. 256
on x86).

Now traditionally we would of course solve the latter problem by
turning around and doing some kind of additional IO operation to get
more details about the event.  Any why not?  Its dirt cheap on
bare-metal.  Of course, in a VM this is particularly expensive and we
want to avoid it.

Enter the shared memory concept:  E.g. put details about the event
somewhere in memory that can be read in the guest without a VMEXIT.  Now
your interrupt vector is simply ringing the doorbell on your
shared-memory.  The question becomes: how to you synchronize access to
the memory without necessitating as much overhead as you had to begin
with? E.g. how does one side know when the other side is done and wants
more data, etc.  What if you want to parallelize things, etc.

Enter the shared-memory queue: Now you have a way to organize your
memory such that both sides can use it effectively and simultaneously.

So there you have it:  We can use a simple shared-memory-queue to
efficiently move event data into a guest.  And we can use hypercalls to
efficiently move it out.  As it turns out, there are also cases where
using a queue for the output side makes sense too, but the basic case is
for input.  But long story short, that is the basic fundamental purpose
of this subsystem.

Now enter the more complex usage patterns:  

For instance, a block device driver could do two hypercalls ("write
sglist-A[] as transaction X to position P", and "write sglist-B[] as
transaction Y to position Q"), and the host process them out of order
and write "completed transaction Y", and "completed transaction X" into
the driver's event queue.  (The block driver might also use a tx-queue
instead of hypercalls if it wanted, but this may or may not make sense).

Or a network driver might push a sglist of a packet to write into a
txqueue entry, and the host might copy a received packet into the
drivers rx-queue. (This is essentally what IOQNET is).  The guest would
see interrupts for its tx-queue to say "i finished the send, reclaim
your skbs", and it would see interrupts on the rx-queue to say "here's
data to receive".

Etc. etc.  In the first case, the events were "please write this" and
"write completed".  In the second they were "please write this", "im
done writing" and "please read this".  Sure there is data associated
with these events and they are utilized in drastically different
patterns.  But either way they were just events and the event stream can
be looked at as a simple ordered sequence....even if their underlying
constructs are not per se.  Does this make any sense?

</boring story> 

> So we end up with an array of descriptors with next pointers, and two
> ring buffers which refer to those descriptors: one for what descriptors
> are pending, and one for what descriptors have been used (by the other
> end).

That's certainly one way to do it. IOQ (coming from the "simple ordered
event sequence" mindset) has one logically linear ring.  It uses a set
of two "head/tail" indices ("valid" and "inuse") and an ownership flag
(per descriptor) to essentially offer similar services as you mention.
Producers "push" items at the index head, and consumers "pop" items from
the index tail.  Only the guest side can manipulate the valid index.
Only the producer can manipulate the inuse-head.  And only the consumer
can manipulate the inuse-tail.  Either side can manipulate the ownership
bit, but only in strict accordance with the production or consumption of
data.

Generally speaking, a driver (guest or host side) is seeking to either
the head or tail (depending on if its a producer or consumer) and then
waiting for the ownership bit to change in its favor.  Once its changed,
the data is produced or consumed, the bit is flipped back to the other
side, and the index is advanced.  That, in a nutshell, is how the whole
deal works.... coupled with the fact that a basic "ioq_signal" operation
will kick the other side (which would typically be either a hypercall or
an interrupt depending on which side of the link you were on).

One thing that is particularly cool about the IOQ design is that its
possible to get to 0 IO events for certain circumstances.  For instance,
if you look at the IOQNET driver, it has what I would call
"bidirectional NAPI".  I think everyone here probably understands how
standard NAPI disables RX interrupts after the first packet is received
Well, IOQNET can also disable TX hypercalls after the first one goes
down to the host.  Any subsequent writes will simply post to the queue
until the host catches up and re-enables "interrupts".  Maybe all of
these queue schemes typically do that...im not sure...but I thought it
was pretty cool.

> 
> This is sufficient for guest<->host, but care must be taken for guest
> <-> guest.  Let's dig down:
> 
> Consider a transport from A -> B.  A populates the descriptor entries
> corresponding to its sg, then puts the head descriptor entry in the
> "pending" ring buffer and sends B an interrupt.  B sees the new pending
> entry, reads the descriptors, does the operation and reads or writes
> into the memory pointed to by the descriptors.  It then updates the
> "used" ring buffer and sends A an interrupt.
> 
> Now, if B is untrusted, this is more difficult.  It needs to read the
> descriptor entries and the "pending" ring buffer, and write to the
> "used" ring buffer.  We can use page protection to share these if we
> arrange things carefully, like so:
> 
>         struct desc_pages
>         {
>         	/* Page of descriptors. */
>         	struct lguest_desc desc[NUM_DESCS];
>         
>         	/* Next page: how we tell other side what buffers are available. */
>         	unsigned int avail_idx;
>         	unsigned int available[NUM_DESCS];
>         	char pad[PAGE_SIZE - (NUM_DESCS+1) * sizeof(unsigned int)];
>         
>         	/* Third page: how other side tells us what's used. */
>         	unsigned int used_idx;
>         	struct lguest_used used[NUM_DESCS];
>         };
> 
> But we still have the problem of an untrusted B having to read/write A's
> memory pointed to A's descriptors.  At this point, my preferred solution
> so far is as follows (note: have not implemented this!):
> 
> (1) have the hypervisor be aware of the descriptor page format, location
> and which guest can access it.
> (2) have the descriptors themselves contains a type (read/write) and a
> valid bit.
> (3) have a "DMA" hypercall to copy to/from someone else's descriptors.
> 
> Note that this means we do a copy for the untrusted case which doesn't
> exist for the trusted case.  In theory the hypervisor could do some
> tricky copy-on-write page-sharing for very large well-aligned buffers,
> but it remains to be seen if that is actually useful.

That sounds *somewhat* similar to what I was getting at above with the
dma/loopback thingy.  Though you are talking about that "grant table"
stuff and are scaring me ;)  But in all seriousness, it would be pretty
darn cool to get that to work.  I am still trying to wrap my head around
all of this....

> 
> Sorry for the long mail, but I really want to get the mechanism correct.

I see your long mail, and I raise you 10x ;)

Regards,
-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]         ` <1187328402.4363.110.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
@ 2007-08-17  7:43           ` Rusty Russell
       [not found]             ` <1187336618.6449.106.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-08-17  7:43 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

On Fri, 2007-08-17 at 01:26 -0400, Gregory Haskins wrote:
> Hi Rusty,
> 
>  Comments inline...
> 
> On Fri, 2007-08-17 at 11:25 +1000, Rusty Russell wrote:
> > 
> > Transport has several parts.  What the hypervisor knows about (usually
> > shared memory and some interrupt mechanism and possibly "DMA") and what
> > is convention between users (eg. ringbuffer layouts).  Whether it's 1:1
> > or n-way (if 1:1, is it symmetrical?).
> 
> TBH, I am not sure what you mean by 1:1 vs n-way ringbuffers (its
> probably just lack of sleep and tomorrow I will smack myself for
> asking ;)
> 
> But could you elaborate here?

Hi Gregory,

	Sure, these discussions can get pretty esoteric.  The question is
whether you want a point-to-point transport (as we discuss here), or an
N-way.  Lguest has N-way, but I'm not convinced it's worthwhile, as
there's some overhead involved in looking up recipients (basically futex
code).

> > And not having inter-guest is just
> > poor form (and putting it in later is impossible, as we'll see).
> 
> I agree that having an ability to do inter-guest is a good idea.
> However, I don't know if I am convinced if it has to be done in a
> direct, zero-copy way. Mediating through the host certainly can work and
> is probably acceptable for most things.  In this way the host is
> essentially acting as a DMA agent to copy from one guests memory to the
> other.  It solves the "trust" issue and simplifies the need to have a
> "grant table" like mechanism which can get pretty hairy, IMHO.

I agree that page sharing is silly.  But we can design a mechanism where
it such a "DMA agent" need only enforce a few very simple rules not the
whole protocol, and yet the guest doesn't know whether it's talking to
an agent or the host.

> > So we end up with an array of descriptors with next pointers, and two
> > ring buffers which refer to those descriptors: one for what descriptors
> > are pending, and one for what descriptors have been used (by the other
> > end).
> 
> That's certainly one way to do it. IOQ (coming from the "simple ordered
> event sequence" mindset) has one logically linear ring.  It uses a set
> of two "head/tail" indices ("valid" and "inuse") and an ownership flag
> (per descriptor) to essentially offer similar services as you mention.
> Producers "push" items at the index head, and consumers "pop" items from
> the index tail.  Only the guest side can manipulate the valid index.
> Only the producer can manipulate the inuse-head.  And only the consumer
> can manipulate the inuse-tail.  Either side can manipulate the ownership
> bit, but only in strict accordance with the production or consumption of
> data.

Well, for cache reasons you should really try to avoid having both sides
write to the same data.  Hence two separate cache-aligned regions is
better than one region and a flip bit.  And if you make them separate
pages, then this can also be inter-guest safe 8)

> One thing that is particularly cool about the IOQ design is that its
> possible to get to 0 IO events for certain circumstances.  For instance,
> if you look at the IOQNET driver, it has what I would call
> "bidirectional NAPI".  I think everyone here probably understands how
> standard NAPI disables RX interrupts after the first packet is received
> Well, IOQNET can also disable TX hypercalls after the first one goes
> down to the host.  Any subsequent writes will simply post to the queue
> until the host catches up and re-enables "interrupts".  Maybe all of
> these queue schemes typically do that...im not sure...but I thought it
> was pretty cool.

Yeah, I agree.  I'm not sure how important it is IRL, but it *feels*
clever 8)

> > (1) have the hypervisor be aware of the descriptor page format, location
> > and which guest can access it.
> > (2) have the descriptors themselves contains a type (read/write) and a
> > valid bit.
> > (3) have a "DMA" hypercall to copy to/from someone else's descriptors.
> > 
> > Note that this means we do a copy for the untrusted case which doesn't
> > exist for the trusted case.  In theory the hypervisor could do some
> > tricky copy-on-write page-sharing for very large well-aligned buffers,
> > but it remains to be seen if that is actually useful.
> 
> That sounds *somewhat* similar to what I was getting at above with the
> dma/loopback thingy.  Though you are talking about that "grant table"
> stuff and are scaring me ;)

Yeah, I fear grant tables too.  But in any scheme, the descriptors imply
permission, so with a little careful design and implementation it should
"just work"...

Cheers,
Rusty.



-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]             ` <1187336618.6449.106.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-08-17 13:50               ` Gregory Haskins
       [not found]                 ` <1187358614.4363.135.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Gregory Haskins @ 2007-08-17 13:50 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

On Fri, 2007-08-17 at 17:43 +1000, Rusty Russell wrote:

> 	Sure, these discussions can get pretty esoteric.  The question is
> whether you want a point-to-point transport (as we discuss here), or an
> N-way.  Lguest has N-way, but I'm not convinced it's worthwhile, as
> there's some overhead involved in looking up recipients (basically futex
> code).

Ah, ok I get it.  In that case: yeah, I agree 1:1 is probably the way to
go.  We can always build some kind of N-way transport in terms of 1:1
primitives if its desirable (though its probably better in most cases to
just reuse something like an ethernet transport/bridge than invent
something new). 

> 
> I agree that page sharing is silly.  But we can design a mechanism where
> it such a "DMA agent" need only enforce a few very simple rules not the
> whole protocol, and yet the guest doesn't know whether it's talking to
> an agent or the host.

Interesting....I would love to hear more of your ideas surrounding this.

> Well, for cache reasons you should really try to avoid having both sides
> write to the same data.  Hence two separate cache-aligned regions is
> better than one region and a flip bit.

While I certainly can see what you mean about the cache implications for
a bit-flip design, I don't see how you can get away with not having both
sides write to the same memory in other designs either.  Wouldn't you
still have to adjust descriptors from one ring to the other?  E.g.
wouldn't both sides be writing descriptor pointer data in this case, or
am I missing something?

> And if you make them separate pages, then this can also be inter-guest
> safe 8)

Ok, now you are making my head hurt 8)

> 
> Yeah, I agree.  I'm not sure how important it is IRL, but it *feels*
> clever 8)

Heh, yeah, I agree I don't know how much it saves.  I kind of got it for
free based on the general design of the queue, so I thought "hey that's
pretty cool".  It shouldn't *hurt*, anyway ;)

> 
> Yeah, I fear grant tables too.  But in any scheme, the descriptors imply
> permission, so with a little careful design and implementation it should
> "just work"...
> 

I am certainly looking forward to hearing more of your ideas in this
area.  Very interesting, indeed....

Regards,
-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]     ` <1187313953.6449.70.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2007-08-17  5:26       ` Gregory Haskins
@ 2007-08-19  9:24       ` Avi Kivity
       [not found]         ` <46C80C5B.7070009-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2007-08-19  9:24 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

Rusty Russell wrote:
>         2) We either need huge descriptors or some chaining mechanism to
>         handle scatter-gather.
>   

Or, my preference, have a small sglist in the descriptor; if the buffer 
doesn't fit in the sglist follow a pointer and size (stored in the same 
place as the immediate sglist) to an external sglist.

The advantage of this is that it doesn't complicate descriptor 
allocation and accounting (if you have large sglists, suddenly you have 
fewer than expected effective descriptors).



-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]         ` <46C80C5B.7070009-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-08-20 13:50           ` Gregory Haskins
  2007-08-20 14:03             ` [kvm-devel] " Dor Laor
       [not found]             ` <1187617806.4363.179.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  0 siblings, 2 replies; 41+ messages in thread
From: Gregory Haskins @ 2007-08-20 13:50 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

On Sun, 2007-08-19 at 12:24 +0300, Avi Kivity wrote:
> Rusty Russell wrote:
> >         2) We either need huge descriptors or some chaining mechanism to
> >         handle scatter-gather.
> >   
> 
> Or, my preference, have a small sglist in the descriptor;


Define "small" ;)

There a certainly patterns that cannot/will-not take advantage of SG
(for instance, your typical network rx path), and therefore the sg
entries are wasted in some cases.  Since they need to be (IMHO) u64,
they suck down at least 8 bytes a piece.  Because of this I elected to
use the model of one pointer per descriptor, with an external descriptor
for SG.  What are your thoughts on this?

-Greg




-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [kvm-devel] [PATCH 00/10] PV-IO v3
  2007-08-20 13:50           ` Gregory Haskins
@ 2007-08-20 14:03             ` Dor Laor
       [not found]               ` <64F9B87B6B770947A9F8391472E032160D4649E2-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
       [not found]             ` <1187617806.4363.179.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  1 sibling, 1 reply; 41+ messages in thread
From: Dor Laor @ 2007-08-20 14:03 UTC (permalink / raw)
  To: Gregory Haskins, Avi Kivity; +Cc: kvm-devel, virtualization

>> >         2) We either need huge descriptors or some chaining
>mechanism to
>> >         handle scatter-gather.
>> >
>>
>> Or, my preference, have a small sglist in the descriptor;
>
>
>Define "small" ;)
>
>There a certainly patterns that cannot/will-not take advantage of SG
>(for instance, your typical network rx path), and therefore the sg
>entries are wasted in some cases.  Since they need to be (IMHO) u64,
>they suck down at least 8 bytes a piece.  Because of this I elected to
>use the model of one pointer per descriptor, with an external
descriptor
>for SG.  What are your thoughts on this?

Using Rusty's code there is no waste.
Each descriptor has a flag (head|next). Next flag stands for pointer to
the
next descriptor with u32 next index. So the waste is 4 bytes.
Sg descriptors are chained on the same descriptor ring.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]               ` <64F9B87B6B770947A9F8391472E032160D4649E2-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
@ 2007-08-20 14:12                 ` Avi Kivity
       [not found]                   ` <46C9A150.60101-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-08-20 14:17                 ` Gregory Haskins
  1 sibling, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2007-08-20 14:12 UTC (permalink / raw)
  To: Dor Laor; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

Dor Laor wrote:
>>>>         2) We either need huge descriptors or some chaining
>>>>         
>> mechanism to
>>     
>>>>         handle scatter-gather.
>>>>
>>>>         
>>> Or, my preference, have a small sglist in the descriptor;
>>>       
>> Define "small" ;)
>>
>> There a certainly patterns that cannot/will-not take advantage of SG
>> (for instance, your typical network rx path), and therefore the sg
>> entries are wasted in some cases.  Since they need to be (IMHO) u64,
>> they suck down at least 8 bytes a piece.  Because of this I elected to
>> use the model of one pointer per descriptor, with an external
>>     
> descriptor
>   
>> for SG.  What are your thoughts on this?
>>     
>
> Using Rusty's code there is no waste.
> Each descriptor has a flag (head|next). Next flag stands for pointer to
> the
> next descriptor with u32 next index. So the waste is 4 bytes.
> Sg descriptors are chained on the same descriptor ring.
>   

Block I/O can easily require 256 sglist entries, eating up your ring.


-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]             ` <1187617806.4363.179.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
@ 2007-08-20 14:14               ` Avi Kivity
  0 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2007-08-20 14:14 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

Gregory Haskins wrote:
>>>   
>>>       
>> Or, my preference, have a small sglist in the descriptor;
>>     
>
>
> Define "small" ;)
>   

4.

> There a certainly patterns that cannot/will-not take advantage of SG
> (for instance, your typical network rx path), and therefore the sg
> entries are wasted in some cases.  Since they need to be (IMHO) u64,
> they suck down at least 8 bytes a piece.  Because of this I elected to
> use the model of one pointer per descriptor, with an external descriptor
> for SG.  What are your thoughts on this?
>   

Measurement will tell.  I have a feeling that it won't really matter, so 
maybe an external sglist is best.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]               ` <64F9B87B6B770947A9F8391472E032160D4649E2-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
  2007-08-20 14:12                 ` Avi Kivity
@ 2007-08-20 14:17                 ` Gregory Haskins
  1 sibling, 0 replies; 41+ messages in thread
From: Gregory Haskins @ 2007-08-20 14:17 UTC (permalink / raw)
  To: Dor Laor
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Avi Kivity,
	virtualization

On Mon, 2007-08-20 at 07:03 -0700, Dor Laor wrote:
> >> >         2) We either need huge descriptors or some chaining
> >mechanism to
> >> >         handle scatter-gather.
> >> >
> >>
> >> Or, my preference, have a small sglist in the descriptor;
> >
> >
> >Define "small" ;)
> >
> >There a certainly patterns that cannot/will-not take advantage of SG
> >(for instance, your typical network rx path), and therefore the sg
> >entries are wasted in some cases.  Since they need to be (IMHO) u64,
> >they suck down at least 8 bytes a piece.  Because of this I elected to
> >use the model of one pointer per descriptor, with an external
> descriptor
> >for SG.  What are your thoughts on this?
> 
> Using Rusty's code there is no waste.
> Each descriptor has a flag (head|next). Next flag stands for pointer to
> the
> next descriptor with u32 next index. So the waste is 4 bytes.
> Sg descriptors are chained on the same descriptor ring.

Right, so he is using a chaining mechanism and I was using a
single-pointer + external-descriptor mechanism.  (Actually you can chain
with IOQ too if you want but I chose to implement the IOQNET example
with external-descriptors).  I'm not sure if either way is particularly
better than the other.  The important thing (IMO) is that either way you
avoid waste for the (not so uncommon) non-sg case.

You still owe me some code, BTW ;)


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                   ` <46C9A150.60101-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-08-20 23:24                     ` Rusty Russell
  0 siblings, 0 replies; 41+ messages in thread
From: Rusty Russell @ 2007-08-20 23:24 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

On Mon, 2007-08-20 at 17:12 +0300, Avi Kivity wrote:
> Dor Laor wrote:
> > Using Rusty's code there is no waste.
> > Each descriptor has a flag (head|next). Next flag stands for pointer to
> > the
> > next descriptor with u32 next index. So the waste is 4 bytes.
> > Sg descriptors are chained on the same descriptor ring.
> >   
> 
> Block I/O can easily require 256 sglist entries, eating up your ring.

Absolutely; I think we'd want this variable sized, rather than a single
page as the current implementation does for no really sound reason.

That said, I don't have a problem with Avi's out-of-band mechanism.  A
"DMA" hypercall can follow that almost as easily, and it does give that
nice "fixed number of slots" effect.

Cheers,
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                 ` <1187358614.4363.135.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
@ 2007-08-20 23:28                   ` Rusty Russell
       [not found]                     ` <1187652496.19435.141.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-08-20 23:28 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

On Fri, 2007-08-17 at 09:50 -0400, Gregory Haskins wrote:
> On Fri, 2007-08-17 at 17:43 +1000, Rusty Russell wrote:
> > Well, for cache reasons you should really try to avoid having both sides
> > write to the same data.  Hence two separate cache-aligned regions is
> > better than one region and a flip bit.
> 
> While I certainly can see what you mean about the cache implications for
> a bit-flip design, I don't see how you can get away with not having both
> sides write to the same memory in other designs either.  Wouldn't you
> still have to adjust descriptors from one ring to the other?  E.g.
> wouldn't both sides be writing descriptor pointer data in this case, or
> am I missing something?

Hi Gregory,

	You can have separate produced and consumed counters: see for example
Van Jacobson's Netchannels presentation
http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf page 23.

	This single consumed count isn't sufficient if you can consume
out-of-order: for that you really want a second "reply" ringbuffer
indicating what buffers are consumed.

> > Yeah, I fear grant tables too.  But in any scheme, the descriptors imply
> > permission, so with a little careful design and implementation it should
> > "just work"...
> > 
> 
> I am certainly looking forward to hearing more of your ideas in this
> area.  Very interesting, indeed....

Well, the simplest scheme I think is a ring buffer of descriptors, eg:

	struct io_desc {
		unsigned long pfn;
		u16 len;
		u16 offset;
	}

	struct io_ring {
		unsigned int prod_idx;
		struct io_desc desc[NUM_DESCS];
	}

Now if we want to chain buffers but differentiate separate buffers, we
need a "continues" flag, but we can probably overload bits somehow for
that (no 32 bit machine has 64k pages, and 64 bit machines have space
for a 32 it flag).  I ended up using a separate page of descriptors and
the ring simply referred to them, but I'm not really sure.

A second "used" ring for the receiver to say what's finished completes
the picture.  So much so that we don't need an explicit "consumed" ring,
see code:

--- a/include/linux/lguest_launcher.h
+++ b/include/linux/lguest_launcher.h
@@ -90,6 +90,8 @@ struct lguest_device_desc {
 #define LGUEST_DEVICE_T_CONSOLE	1
 #define LGUEST_DEVICE_T_NET	2
 #define LGUEST_DEVICE_T_BLOCK	3
+#define LGUEST_DEVICE_T_VIRTNET	8
+#define LGUEST_DEVICE_T_VIRTBLK	9
 
 	/* The specific features of this device: these depends on device type
 	 * except for LGUEST_DEVICE_F_RANDOMNESS. */
@@ -124,4 +126,28 @@ enum lguest_req
 	LHREQ_IRQ, /* + irq */
 	LHREQ_BREAK, /* + on/off flag (on blocks until someone does off) */
 };
+
+/* This marks a buffer as being the start (and active) */
+#define LGUEST_DESC_F_HEAD	1
+/* This marks a buffer as continuing via the next field. */
+#define LGUEST_DESC_F_NEXT	2
+/* This marks a buffer as write-only (otherwise read-only). */
+#define LGUEST_DESC_F_WRITE	4
+
+/* Virtio descriptors */
+struct lguest_desc
+{
+	unsigned long pfn;
+	unsigned long len;
+	u16 offset;
+	u16 flags;
+	/* We chain unused descriptors via this, too */
+	u32 next;
+};
+
+struct lguest_used
+{
+	unsigned int id;
+	unsigned int len;
+};
 #endif /* _ASM_LGUEST_USER */

--- /dev/null
+++ b/drivers/lguest/lguest_virtio.c
+/* Descriptor-based virtio backend using lguest. */
+
+/* FIXME: Put "running" in shared page so other side really doesn't
+ * send us interrupts.  Then we would never need to "fail" restart.
+ * If there are more buffers when we set "running", simply ping other
+ * side.  It would interrupt us back again.
+ */
+#define DEBUG
+#include <linux/lguest.h>
+#include <linux/lguest_bus.h>
+#include <linux/virtio.h>
+#include <linux/interrupt.h>
+#include <asm/io.h>
+
+#define NUM_DESCS (PAGE_SIZE / sizeof(struct lguest_desc))
+
+#ifdef DEBUG
+/* For development, we want to crash whenever the other side is bad. */
+#define BAD_SIDE(lvq, fmt...)			\
+	do { dev_err(&lvq->lg->dev, fmt); BUG(); } while(0)
+#define START_USE(lvq) \
+	do { if ((lvq)->in_use) panic("in_use = %i\n", (lvq)->in_use); (lvq)->in_use = __LINE__; mb(); } while(0)
+#define END_USE(lvq) \
+	do { BUG_ON(!(lvq)->in_use); (lvq)->in_use = 0; mb(); } while(0)
+#else
+#define BAD_SIDE(lvq, fmt...)			\
+	do { dev_err(&lvq->lg->dev, fmt); (lvq)->broken = true; } while(0)
+#define START_USE(lvq)
+#define END_USE(lvq)
+#endif
+
+struct desc_pages
+{
+	/* Page of descriptors. */
+	struct lguest_desc desc[NUM_DESCS];
+
+	/* Next page: how we tell other side what buffers are available. */
+	unsigned int avail_idx;
+	unsigned int available[NUM_DESCS];
+	char pad[PAGE_SIZE - (NUM_DESCS+1) * sizeof(unsigned int)];
+
+	/* Third page: how other side tells us what's used. */
+	unsigned int used_idx;
+	struct lguest_used used[NUM_DESCS];
+};
+
+struct lguest_virtqueue
+{
+	struct virtqueue vq;
+
+	/* Actual memory layout for this queue */
+	struct desc_pages *d;
+
+	struct lguest_device *lg;
+
+	/* Other side has made a mess, don't try any more. */
+	bool broken;
+
+	/* Number of free buffers */
+	unsigned int num_free;
+	/* Head of free buffer list. */
+	unsigned int free_head;
+	/* Number we've added since last sync. */
+	unsigned int num_added;
+
+	/* Last used index we've seen. */
+	unsigned int last_used_idx;
+
+	/* Unless they told us to stop */
+	bool running;
+
+#ifdef DEBUG
+	/* They're supposed to lock for us. */
+	unsigned int in_use;
+#endif
+
+	/* Tokens for callbacks. */
+	void *data[NUM_DESCS];
+};
+
+static inline struct lguest_virtqueue *vq_to_lvq(struct virtqueue *vq)
+{
+	return container_of(vq, struct lguest_virtqueue, vq);
+}
+
+static int lguest_add_buf(struct virtqueue *vq,
+			  struct scatterlist sg[],
+			  unsigned int out_num,
+			  unsigned int in_num,
+			  void *data)
+{
+	struct lguest_virtqueue *lvq = vq_to_lvq(vq);
+	unsigned int i, head, uninitialized_var(prev);
+
+	BUG_ON(data == NULL);
+	BUG_ON(out_num + in_num > NUM_DESCS);
+	BUG_ON(out_num + in_num == 0);
+
+	START_USE(lvq);
+
+	if (lvq->num_free < out_num + in_num) {
+		pr_debug("Can't add buf len %i - avail = %i\n",
+			 out_num + in_num, lvq->num_free);
+		END_USE(lvq);
+		return -ENOSPC;
+	}
+
+	/* We're about to use some buffers from the free list. */
+	lvq->num_free -= out_num + in_num;
+
+	head = lvq->free_head;
+	for (i = lvq->free_head; out_num; i=lvq->d->desc[i].next, out_num--) {
+		lvq->d->desc[i].flags = LGUEST_DESC_F_NEXT;
+		lvq->d->desc[i].pfn = page_to_pfn(sg[0].page);
+		lvq->d->desc[i].offset = sg[0].offset;
+		lvq->d->desc[i].len = sg[0].length;
+		prev = i;
+		sg++;
+	}
+	for (; in_num; i = lvq->d->desc[i].next, in_num--) {
+		lvq->d->desc[i].flags = LGUEST_DESC_F_NEXT|LGUEST_DESC_F_WRITE;
+		lvq->d->desc[i].pfn = page_to_pfn(sg[0].page);
+		lvq->d->desc[i].offset = sg[0].offset;
+		lvq->d->desc[i].len = sg[0].length;
+		prev = i;
+		sg++;
+	}
+	/* Last one doesn't continue. */
+	lvq->d->desc[prev].flags &= ~LGUEST_DESC_F_NEXT;
+
+	/* Update free pointer */
+	lvq->free_head = i;
+
+	lvq->data[head] = data;
+
+	/* Make head is only set after descriptor has been written. */
+	wmb();
+	lvq->d->desc[head].flags |= LGUEST_DESC_F_HEAD;
+
+	/* Advertise it in available array. */
+	lvq->d->available[(lvq->d->avail_idx + lvq->num_added++) % NUM_DESCS]
+		= head;
+
+	pr_debug("Added buffer head %i to %p\n", head, lvq);
+	END_USE(lvq);
+	return 0;
+}
+
+static void lguest_sync(struct virtqueue *vq)
+{
+	struct lguest_virtqueue *lvq = vq_to_lvq(vq);
+
+	START_USE(lvq);
+	/* LGUEST_DESC_F_HEAD needs to be set before we say they're avail. */
+	wmb();
+
+	lvq->d->avail_idx += lvq->num_added;
+	lvq->num_added = 0;
+
+	/* Prod other side to tell it about changes. */
+	hcall(LHCALL_NOTIFY, lguest_devices[lvq->lg->index].pfn, 0, 0);
+	END_USE(lvq);
+}
+
+static void __detach_buf(struct lguest_virtqueue *lvq, unsigned int head)
+{
+	unsigned int i;
+
+	lvq->d->desc[head].flags &= ~LGUEST_DESC_F_HEAD;
+	/* Make sure other side has seen that it's detached. */
+	wmb();
+	/* Put back on free list: find end */
+	i = head;
+	while (lvq->d->desc[i].flags&LGUEST_DESC_F_NEXT) {
+		i = lvq->d->desc[i].next;
+		lvq->num_free++;
+	}
+
+	lvq->d->desc[i].next = lvq->free_head;
+	lvq->free_head = head;
+	/* Plus final descriptor */
+	lvq->num_free++;
+}
+
+static int lguest_detach_buf(struct virtqueue *vq, void *data)
+{
+	struct lguest_virtqueue *lvq = vq_to_lvq(vq);
+	unsigned int i;
+
+	for (i = 0; i < NUM_DESCS; i++) {
+		if (lvq->data[i] == data
+		    && (lvq->d->desc[i].flags & LGUEST_DESC_F_HEAD)) {
+			__detach_buf(lvq, i);
+			return 0;
+		}
+	}
+	return -ENOENT;
+}
+
+static bool more_used(const struct lguest_virtqueue *lvq)
+{
+	return lvq->last_used_idx != lvq->d->used_idx;
+}
+
+static void *lguest_get_buf(struct virtqueue *vq, unsigned int *len)
+{
+	struct lguest_virtqueue *lvq = vq_to_lvq(vq);
+	unsigned int i;
+
+	START_USE(lvq);
+
+	if (!more_used(lvq)) {
+		END_USE(lvq);
+		return NULL;
+	}
+
+	/* Don't let them make us do infinite work. */
+	if (unlikely(lvq->d->used_idx > lvq->last_used_idx + NUM_DESCS)) {
+		BAD_SIDE(lvq, "Too many descriptors");
+		return NULL;
+	}
+
+	i = lvq->d->used[lvq->last_used_idx%NUM_DESCS].id;
+	*len = lvq->d->used[lvq->last_used_idx%NUM_DESCS].len;
+
+	if (unlikely(i >= NUM_DESCS)) {
+		BAD_SIDE(lvq, "id %u out of range\n", i);
+		return NULL;
+	}
+	if (unlikely(!(lvq->d->desc[i].flags & LGUEST_DESC_F_HEAD))) {
+		BAD_SIDE(lvq, "id %u is not a head!\n", i);
+		return NULL;
+	}
+
+	__detach_buf(lvq, i);
+	lvq->last_used_idx++;
+	BUG_ON(!lvq->data[i]);
+	END_USE(lvq);
+	return lvq->data[i];
+}
+
+static bool lguest_restart(struct virtqueue *vq)
+{
+	struct lguest_virtqueue *lvq = vq_to_lvq(vq);
+
+	START_USE(lvq);
+	BUG_ON(lvq->running);
+
+	if (likely(!more_used(lvq)) || unlikely(lvq->broken))
+		lvq->running = true;
+
+	END_USE(lvq);
+	return lvq->running;
+}
+
+static irqreturn_t lguest_virtqueue_interrupt(int irq, void *_lvq)
+{
+	struct lguest_virtqueue *lvq = _lvq;
+
+	pr_debug("virtqueue interrupt for %p\n", lvq);
+
+	if (unlikely(lvq->broken))
+		return IRQ_HANDLED;
+
+	if (lvq->running && more_used(lvq)) {
+		pr_debug("virtqueue callback for %p (%p)\n", lvq, lvq->vq.cb);
+		lvq->running = lvq->vq.cb(&lvq->vq);
+	} else
+		pr_debug("virtqueue %p no more used\n", lvq);
+
+	return IRQ_HANDLED;
+}





-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                     ` <1187652496.19435.141.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-08-21  7:33                       ` Dor Laor
       [not found]                         ` <64F9B87B6B770947A9F8391472E032160D464FEB-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Dor Laor @ 2007-08-21  7:33 UTC (permalink / raw)
  To: Rusty Russell, Gregory Haskins
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

>> > Well, for cache reasons you should really try to avoid having both
>sides
>> > write to the same data.  Hence two separate cache-aligned regions
is
>> > better than one region and a flip bit.
>>
>> While I certainly can see what you mean about the cache implications
>for
>> a bit-flip design, I don't see how you can get away with not having
>both
>> sides write to the same memory in other designs either.  Wouldn't you
>> still have to adjust descriptors from one ring to the other?  E.g.
>> wouldn't both sides be writing descriptor pointer data in this case,
>or
>> am I missing something?
>
>Hi Gregory,
>
>	You can have separate produced and consumed counters: see for
>example
>Van Jacobson's Netchannels presentation
>http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf page 23.
>
>	This single consumed count isn't sufficient if you can consume
>out-of-order: for that you really want a second "reply" ringbuffer
>indicating what buffers are consumed.
>

Rusty, although your code works pretty nice, (I'll send some raw patches
later on today with 
kvm support for virtio). I was wandering why didn't you use Xen's ring
implementation?
They have separate counters and also union for the request/response
structure in the same
descriptor.
Here is some of it + lxr link:

http://81.161.245.2/lxr/http/source/xen/include/public/io/ring.h?v=xen-3
.1.0-src;a=m68k


80 #define DEFINE_RING_TYPES(__name, __req_t, __rsp_t)
\
 81
\
 82 /* Shared ring entry */
\
 83 union __name##_sring_entry {
\
 84     __req_t req;
\
 85     __rsp_t rsp;
\
 86 };
\
 87
\
 88 /* Shared ring page */
\
 89 struct __name##_sring {
\
 90     RING_IDX req_prod, req_event;
\
 91     RING_IDX rsp_prod, rsp_event;
\
 92     uint8_t  pad[48];
\
 93     union __name##_sring_entry ring[1]; /* variable-length */
\
 94 };
\
 95
\
 96 /* "Front" end's private variables */
\
 97 struct __name##_front_ring {
\
 98     RING_IDX req_prod_pvt;
\
 99     RING_IDX rsp_cons;
\
100     unsigned int nr_ents;
\
101     struct __name##_sring *sring;
\
102 };
\
103
\
104 /* "Back" end's private variables */
\
105 struct __name##_back_ring {
\
106     RING_IDX rsp_prod_pvt;
\
107     RING_IDX req_cons;
\
108     unsigned int nr_ents;
\
109     struct __name##_sring *sring;
\
110 };
\
111
\
112 /* Syntactic sugar */
\
113 typedef struct __name##_sring __name##_sring_t;
\
114 typedef struct __name##_front_ring __name##_front_ring_t;
\
115 typedef struct __name##_back_ring __name##_back_ring_t

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                         ` <64F9B87B6B770947A9F8391472E032160D464FEB-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
@ 2007-08-21  7:58                           ` Rusty Russell
       [not found]                             ` <1187683122.19435.171.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-08-21  7:58 UTC (permalink / raw)
  To: Dor Laor; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

On Tue, 2007-08-21 at 00:33 -0700, Dor Laor wrote:
> >> > Well, for cache reasons you should really try to avoid having both
> >sides
> >> > write to the same data.  Hence two separate cache-aligned regions
> is
> >> > better than one region and a flip bit.
> >>
> >> While I certainly can see what you mean about the cache implications
> >for
> >> a bit-flip design, I don't see how you can get away with not having
> >both
> >> sides write to the same memory in other designs either.  Wouldn't you
> >> still have to adjust descriptors from one ring to the other?  E.g.
> >> wouldn't both sides be writing descriptor pointer data in this case,
> >or
> >> am I missing something?
> >
> >Hi Gregory,
> >
> >	You can have separate produced and consumed counters: see for
> >example
> >Van Jacobson's Netchannels presentation
> >http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf page 23.
> >
> >	This single consumed count isn't sufficient if you can consume
> >out-of-order: for that you really want a second "reply" ringbuffer
> >indicating what buffers are consumed.
> >
> 
> Rusty, although your code works pretty nice, (I'll send some raw patches
> later on today with 
> kvm support for virtio). I was wandering why didn't you use Xen's ring
> implementation?
> They have separate counters and also union for the request/response
> structure in the same
> descriptor.

Partly the horror of the code, but mainly because it is an in-order
ring.  You'll note that we use a reply ring, so we don't need to know
how much the other side has consumed (and it needn't do so in order).

Cheers,
Rusty.



-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                             ` <1187683122.19435.171.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-08-21 12:00                               ` Gregory Haskins
       [not found]                                 ` <1187697638.4363.277.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  2007-08-21 12:29                               ` Avi Kivity
  1 sibling, 1 reply; 41+ messages in thread
From: Gregory Haskins @ 2007-08-21 12:00 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

On Tue, 2007-08-21 at 17:58 +1000, Rusty Russell wrote:

> Partly the horror of the code, but mainly because it is an in-order
> ring.  You'll note that we use a reply ring, so we don't need to know
> how much the other side has consumed (and it needn't do so in order).
> 

I have certainly been known to take a similar stance when looking at Xen
code ;) (recall the lapic work I did).  However, that said I am not yet
convinced that an out-of-order ring (at least as a fundamental
primitive) buys us much.  I think the use of rings for the tx-path in of
itself is questionable unless you can implement something like the bidir
NAPI that I demonstrated in ioqnet.  Otherwise, you end up having to
hypercall on each update to the ring anyway and you might as well
hypercall directly w/o using a ring.

At a fundamental level, I think we simply need an efficient and in-order
(read: simple) ring to move data in, and a context associated hypercall
to get out.  We can also use that simple ring to move data out if its
advantageous to do so (read: tx NAPI can be used).  From there, we can
build more complex constructs from these primitives, like out-of-order
sg block-io.

OTOH, its possible that its redundant to have a simple low-level
infrastructure and then build a more complex ring for out-of-order
processing on top of it.  I'm not sure.  My gut feeling is that it will
probably result in a cleaner implementation: The higher-layered ring can
stop worrying about the interrupt/hypercall details (it would use the
simple ring as its transport)....and implementations that don't need
out-of-order (e.g. networks) don't have to deal with the associated
complexity.

What are your thoughts to this layering approach?

Regards,
-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                                 ` <1187697638.4363.277.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
@ 2007-08-21 12:25                                   ` Avi Kivity
       [not found]                                     ` <46CAD9CC.6050209-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-08-21 13:47                                   ` Rusty Russell
  1 sibling, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2007-08-21 12:25 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

Gregory Haskins wrote:
> On Tue, 2007-08-21 at 17:58 +1000, Rusty Russell wrote:
>
>   
>> Partly the horror of the code, but mainly because it is an in-order
>> ring.  You'll note that we use a reply ring, so we don't need to know
>> how much the other side has consumed (and it needn't do so in order).
>>
>>     
>
> I have certainly been known to take a similar stance when looking at Xen
> code ;) (recall the lapic work I did).  However, that said I am not yet
> convinced that an out-of-order ring (at least as a fundamental
> primitive) buys us much.  

It's pretty much required for block I/O into disk arrays.

Xen does out-of-order, btw, on its single ring, but at the cost of some 
complexity.  I don't believe it is worthwhile and prefer split 
request/reply rings.

With my VJ T-shirt on, I can even say it's more efficient, as each side 
of the ring will have a single writer and a single reader, reducing 
ping-pong effects if the interrupt completions happens to land on the 
wrong cpu.

> I think the use of rings for the tx-path in of
> itself is questionable unless you can implement something like the bidir
> NAPI that I demonstrated in ioqnet.  Otherwise, you end up having to
> hypercall on each update to the ring anyway and you might as well
> hypercall directly w/o using a ring.
>   

Network tx can be out of order too (with some traffic destined to other 
guests, some to the host, and some to external interfaces, completions 
will be out of order).

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                             ` <1187683122.19435.171.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2007-08-21 12:00                               ` Gregory Haskins
@ 2007-08-21 12:29                               ` Avi Kivity
  1 sibling, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2007-08-21 12:29 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

Rusty Russell wrote:
> Partly the horror of the code, but mainly because it is an in-order
> ring.  You'll note that we use a reply ring, so we don't need to know
> how much the other side has consumed (and it needn't do so in order).
>   

Yes, it's quite nice: by using two in-order rings, you get out-of-order 
completions.  Simple _and_ efficient.


-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                                     ` <46CAD9CC.6050209-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-08-21 13:11                                       ` Gregory Haskins
  0 siblings, 0 replies; 41+ messages in thread
From: Gregory Haskins @ 2007-08-21 13:11 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

On Tue, 2007-08-21 at 15:25 +0300, Avi Kivity wrote:
> Gregory Haskins wrote:
> > On Tue, 2007-08-21 at 17:58 +1000, Rusty Russell wrote:
> >
> >   
> >> Partly the horror of the code, but mainly because it is an in-order
> >> ring.  You'll note that we use a reply ring, so we don't need to know
> >> how much the other side has consumed (and it needn't do so in order).
> >>
> >>     
> >
> > I have certainly been known to take a similar stance when looking at Xen
> > code ;) (recall the lapic work I did).  However, that said I am not yet
> > convinced that an out-of-order ring (at least as a fundamental
> > primitive) buys us much.  
> 
> It's pretty much required for block I/O into disk arrays.

You are misunderstanding me.  I totally agree that block io is
inherently out-of-order.  What I am trying to convey is that at a
fundamental level *everything* (including block-io) can be viewed as an
ordered sequence of events.

For instance, consider that a block-io driver is making requests like
"perform read transaction X", and "perform write transaction Y".
Likewise, the host side can pass events like "completed transaction Y"
and "completed transaction X".  At this level, everything is *always*
ordered, regardless of the fact that X and Y were temporally rearranged
by the host.

This is what the ioq/pvbus series is trying to address:  These low-level
primitives for moving events in and out of the guest in a VMM agnostic
way.  From there, you could apply higher level constructs such as an
out-of-order sg descriptor ring to represent your block-io data.  The
low-level primitives simply become a way to convey changes to that
construct.

In a nutshell, IOQ provides a simple bi-directional ordered event
channel and a context associated hypercall mechanism (see
pvbus_device->call()) to accomplish these low-level chores.

I am also advocating caution on the tx path, as I think indirection
(e.g. queuing) as opposed to direct access (e.g. contextual hypercall)
has limited applicability.  Trying to come up with a complex
"one-size-fits-all" queue for the tx path may be not worthwhile since in
the end there is still a 1:1 with queue-insert:hypercall.  You might as
well just pass the descriptor directly via the contextual hypercall.
Where this ends up being a win is where you can do the bi-dir NAPI-like
tricks like IOQNET and have the queue-insert to hypercall ratio become >
1.  

> 
> Xen does out-of-order, btw, on its single ring, but at the cost of some 
> complexity.  I don't believe it is worthwhile and prefer split 
> request/reply rings.

I am not against the split rings either.  The article that Rusty
forwarded was very interesting indeed.  But if I understood the article
and Rusty, there are kind of two aspects to it.  A) Using two rings to
make an cache-thrash friendly ordered ring, or B) adding out-of-order
capability to these two rings.  I am certainly in favor of (A) for use
as the low-level event transport.  I just question whether the
complexity of (B) is justified as the one and only queuing mechanism
when there are plenty of patterns that simply cannot take advantage of
it.

What I am wondering is if we should have a set of low-level primitives
that deal primarily with ordered event sequencing and VMM abstraction,
and a higher set of code expressed in terms of these primitives for
implementing the constructs such as (B) for block-io.

> 
> With my VJ T-shirt on, I can even say it's more efficient, as each side 
> of the ring will have a single writer and a single reader, reducing 
> ping-pong effects if the interrupt completions happens to land on the 
> wrong cpu.

Agreed.

> 
> Network tx can be out of order too (with some traffic destined to other 
> guests, some to the host, and some to external interfaces, completions 
> will be out of order).

Well, not with respect to the 1:1 event delivery channel as I envision
it (unless I am misunderstanding you?)

Regards,
-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                                 ` <1187697638.4363.277.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  2007-08-21 12:25                                   ` Avi Kivity
@ 2007-08-21 13:47                                   ` Rusty Russell
       [not found]                                     ` <1187704038.19435.194.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  1 sibling, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-08-21 13:47 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

On Tue, 2007-08-21 at 08:00 -0400, Gregory Haskins wrote:
> On Tue, 2007-08-21 at 17:58 +1000, Rusty Russell wrote:
> 
> > Partly the horror of the code, but mainly because it is an in-order
> > ring.  You'll note that we use a reply ring, so we don't need to know
> > how much the other side has consumed (and it needn't do so in order).
> > 
> 
> I have certainly been known to take a similar stance when looking at Xen
> code ;) (recall the lapic work I did).  However, that said I am not yet
> convinced that an out-of-order ring (at least as a fundamental
> primitive) buys us much.

Hi Gregory,

	The main current use is disk drivers: they process out-of-order.

>   I think the use of rings for the tx-path in of
> itself is questionable unless you can implement something like the bidir
> NAPI that I demonstrated in ioqnet.  Otherwise, you end up having to
> hypercall on each update to the ring anyway and you might as well
> hypercall directly w/o using a ring.

	In the guest -> host direction, an interface like virtio is designed
for batching, with the explicit distinction between add_buf & sync.  On
the receive side, you can have explicit interrupt suppression on
implicit mitigation caused by scheduling effects.

> OTOH, its possible that its redundant to have a simple low-level
> infrastructure and then build a more complex ring for out-of-order
> processing on top of it.  I'm not sure.  My gut feeling is that it will
> probably result in a cleaner implementation: The higher-layered ring can
> stop worrying about the interrupt/hypercall details (it would use the
> simple ring as its transport)....and implementations that don't need
> out-of-order (e.g. networks) don't have to deal with the associated
> complexity.

	But in fact as we can see, two rings need less from each ring than one
ring.  One ring must have producer and consumer indices, so the producer
doesn't overrun the consumer.  But if the second ring is used to feed
consumption, the consumer index isn't required any more: in fact, it's
just confusing to have.

	I really think that a table of descriptors, a ring for produced
descriptors and a ring for used descriptors is the most cache-friendly,
bidir-non-trusting simple implementation possible.  Of course, the
produced and used rings might be the same format, which allows code
sharing and if you squint a little, that's your "lowest level" simple
ringbuffer.

Thanks for the discussion,
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                                     ` <1187704038.19435.194.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-08-21 14:06                                       ` Gregory Haskins
       [not found]                                         ` <1187705162.4363.323.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Gregory Haskins @ 2007-08-21 14:06 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

On Tue, 2007-08-21 at 23:47 +1000, Rusty Russell wrote:

> Hi Gregory,
> 
> 	The main current use is disk drivers: they process out-of-order.

Maybe for you ;)  I am working on the networking/IVMC side.

> 
> >   I think the use of rings for the tx-path in of
> > itself is questionable unless you can implement something like the bidir
> > NAPI that I demonstrated in ioqnet.  Otherwise, you end up having to
> > hypercall on each update to the ring anyway and you might as well
> > hypercall directly w/o using a ring.
> 
> 	In the guest -> host direction, an interface like virtio is designed
> for batching, with the explicit distinction between add_buf & sync.

Right.  IOQ has "iter_push()" and "signal()" as synonymous operations.
But note that batching via deferred synchronization does not implicitly
require a shared queue. E.g. you could batch internally and then
hypercall at the "sync" point.  However, batching via a queue is still
nice because at least you give the host side a chance to independently
"notice" the changes concurrently before the sync.  But I digress...

>   On
> the receive side, you can have explicit interrupt suppression on
> implicit mitigation caused by scheduling effects.

Agreed.  This is precisely what the bidir NAPI stuff is doing and I
didn't mean to imply that virtio wasn't capable of it too.  All I meant
is that if you *don't* take advantage of it, the guest->host path via a
queue is likely overkill.  E.g. you might as well hypercall instead.

> 	But in fact as we can see, two rings need less from each ring than one
> ring.  One ring must have producer and consumer indices, so the producer
> doesn't overrun the consumer.  But if the second ring is used to feed
> consumption, the consumer index isn't required any more: in fact, it's
> just confusing to have.

Don't get me wrong.  I am totally in favor of the two ring approach.
You have enlightened me on that front. :)  I was under the impression
that then making the two-ringed approach support out-of-order added
significantly more complexity.  Did I understand that wrong?

> 
> 	I really think that a table of descriptors, a ring for produced
> descriptors and a ring for used descriptors is the most cache-friendly,
> bidir-non-trusting simple implementation possible.  Of course, the
> produced and used rings might be the same format, which allows code
> sharing and if you squint a little, that's your "lowest level" simple
> ringbuffer.

Sounds reasonable to me.

> 
> Thanks for the discussion,

Ditto!
-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                                         ` <1187705162.4363.323.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
@ 2007-08-21 16:47                                           ` Gregory Haskins
       [not found]                                             ` <1187714864.4363.358.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Gregory Haskins @ 2007-08-21 16:47 UTC (permalink / raw)
  To: Rusty Russell; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

On Tue, 2007-08-21 at 10:06 -0400, Gregory Haskins wrote:
> On Tue, 2007-08-21 at 23:47 +1000, Rusty Russell wrote:
> > 
> > 	In the guest -> host direction, an interface like virtio is designed
> > for batching, with the explicit distinction between add_buf & sync.
> 
> Right.  IOQ has "iter_push()" and "signal()" as synonymous operations.

Hi Rusty,
  This reminded me of an area that I thought might have been missing in
virtio compared to IOQ.  That is, flexibility in the io-completion via
the distinction between "signal" and "sync".  sync() implies that its a
blocking call based on the full drain of the queue, correct?  the
ioq_signal() operation is purely a "kick".  You can, of course, still
implement synchronous functions with a higher layer construct such as
the ioq->wq.  For example:

void send_sync(struct ioq *ioq, struct sk_buff *skb)
{
	DECLARE_WAITQUEUE(wait, current);
   	struct ioq_iterator iter;

	ioq_iter_init(ioq, &iter, ioq_idxtype_inuse, IOQ_ITER_AUTOUPDATE);

	ioq_iter_seek(&iter, ioq_seek_head, 0, 0);

	/* Update the iter.desc->ptr with skb details */

	mb();
	iter.desc->valid = 1;
	iter.desc->sown  = 1; /* give ownership to the south */
	mb();

	ioq_iter_push(&iter, 0);

	add_wait_queue(&ioq->wq, &wait);
	set_current_state(TASK_UNINTERRUPTIBLE);

	/* Wait until we own it again */
	while (!iter.desc->sown)
		schedule();

	set_current_state(TASK_RUNNING);
	remove_wait_queue(&ioq->wq, &wait);
}

But really the goal behind this design was to allow for fine-grained
selection of how io-completion is notified.  E.g.  callback (e.g.
interrupt-driven) deferred reclaimation/reaping (see
ioqnet_tx_complete), sleeping-wait via ioq->wq, busy-wait, etc.

Is there a way to do something similar in virtio? (and forgive me if
there is..I still haven't seen the code).  And if not and people like
that idea, what would be a good way to add it to the interface?

Regards,
-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                                             ` <1187714864.4363.358.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
@ 2007-08-21 17:12                                               ` Avi Kivity
       [not found]                                                 ` <46CB1D06.1040005-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-08-22  3:29                                               ` Rusty Russell
  1 sibling, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2007-08-21 17:12 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

Gregory Haskins wrote:
> On Tue, 2007-08-21 at 10:06 -0400, Gregory Haskins wrote:
>   
>> On Tue, 2007-08-21 at 23:47 +1000, Rusty Russell wrote:
>>     
>>> 	In the guest -> host direction, an interface like virtio is designed
>>> for batching, with the explicit distinction between add_buf & sync.
>>>       
>> Right.  IOQ has "iter_push()" and "signal()" as synonymous operations.
>>     
>
> Hi Rusty,
>   This reminded me of an area that I thought might have been missing in
> virtio compared to IOQ.  That is, flexibility in the io-completion via
> the distinction between "signal" and "sync".  sync() implies that its a
> blocking call based on the full drain of the queue, correct?

No, sync() means "make the other side aware that there's work to be done".


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                                                 ` <46CB1D06.1040005-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-08-21 17:17                                                   ` Gregory Haskins
  0 siblings, 0 replies; 41+ messages in thread
From: Gregory Haskins @ 2007-08-21 17:17 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

On Tue, 2007-08-21 at 20:12 +0300, Avi Kivity wrote:

> No, sync() means "make the other side aware that there's work to be done".
> 

Ok, but still the important thing isn't the kick per se, but the
resulting completetion.  Can we do interrupt driven reclamation?  Some
of those virtio_net emails I saw kicking around earlier today implied
buffers are reclaimed on the next xmit (e.g. polling) which violates the
netif rules for avoiding deadlock.  I suppose that could have just been
an implementation decision....but I remember wondering how reaping would
work when virtio first came out.

Regards,
-Greg

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                                             ` <1187714864.4363.358.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
  2007-08-21 17:12                                               ` Avi Kivity
@ 2007-08-22  3:29                                               ` Rusty Russell
       [not found]                                                 ` <1187753365.6174.26.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  1 sibling, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-08-22  3:29 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, virtualization

On Tue, 2007-08-21 at 12:47 -0400, Gregory Haskins wrote:
> On Tue, 2007-08-21 at 10:06 -0400, Gregory Haskins wrote:
> > On Tue, 2007-08-21 at 23:47 +1000, Rusty Russell wrote:
> > > 
> > > 	In the guest -> host direction, an interface like virtio is designed
> > > for batching, with the explicit distinction between add_buf & sync.
> > 
> > Right.  IOQ has "iter_push()" and "signal()" as synonymous operations.
> 
> Hi Rusty,
>   This reminded me of an area that I thought might have been missing in
> virtio compared to IOQ.  That is, flexibility in the io-completion via
> the distinction between "signal" and "sync".  sync() implies that its a
> blocking call based on the full drain of the queue, correct?  the
> ioq_signal() operation is purely a "kick".  You can, of course, still
> implement synchronous functions with a higher layer construct such as
> the ioq->wq.

Hi Gregory,

	You raise a good point.  We should rename "sync" to "kick".  Clear
names are very important.

> Is there a way to do something similar in virtio? (and forgive me if
> there is..I still haven't seen the code).  And if not and people like
> that idea, what would be a good way to add it to the interface?

I had two implementations, an efficient descriptor based one and a dumb
dumb dumb 1-char copying-based one.  I let the latter one rot; it was
sufficient for me to convince myself that it was possible to create an
implementation which uses such a transport.

(Nonetheless, it's kinda boring to maintain so it wasn't updated for the
lastest draft of the virtio API).

Here's the lguest "efficient" implementation, which could still use some
love:

===
More efficient lguest implementation of virtio, using descriptors.

This allows zero-copy from guest <-> host.  It uses a page of
descriptors, a page to say what descriptors to use, and a page to say
what's been used: one each set for inbufs and one for outbufs.

TODO:
1) More polishing
2) Get rid of old I/O
3) Inter-guest I/O implementation

Signed-off-by: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
---
 Documentation/lguest/lguest.c   |  412 +++++++++++++++++++++++++++++++++
 drivers/lguest/Makefile         |    2 
 drivers/lguest/hypercalls.c     |    4 
 drivers/lguest/lguest_virtio.c  |  476 +++++++++++++++++++++++++++++++++++++++
 include/asm-i386/lguest_hcall.h |    3 
 include/linux/lguest_launcher.h |   26 ++
 6 files changed, 914 insertions(+), 9 deletions(-)

===================================================================
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -5,6 +5,8 @@
 #define _LARGEFILE64_SOURCE
 #define _GNU_SOURCE
 #include <stdio.h>
+#include <sched.h>
+#include <assert.h>
 #include <string.h>
 #include <unistd.h>
 #include <err.h>
@@ -43,6 +45,7 @@ typedef uint16_t u16;
 typedef uint16_t u16;
 typedef uint8_t u8;
 #include "../../include/linux/lguest_launcher.h"
+#include "../../include/linux/virtio_blk.h"
 #include "../../include/asm/e820.h"
 /*:*/
 
@@ -55,6 +58,8 @@ typedef uint8_t u8;
 /* We can have up to 256 pages for devices. */
 #define DEVICE_PAGES 256
 
+#define descs_per_page() (getpagesize() / sizeof(struct lguest_desc))
+
 /*L:120 verbose is both a global flag and a macro.  The C preprocessor allows
  * this, and although I wouldn't recommend it, it works quite nicely here. */
 static bool verbose;
@@ -106,6 +111,8 @@ struct device
 	unsigned long watch_key;
 	u32 (*handle_output)(int fd, const struct iovec *iov,
 			     unsigned int num, struct device *me);
+	/* Alternative to handle_output */
+	void (*handle_notify)(int fd, struct device *me);
 
 	/* Device-specific data. */
 	void *priv;
@@ -956,17 +963,21 @@ static void handle_output(int fd, unsign
 	struct iovec iov[LGUEST_MAX_DMA_SECTIONS];
 	unsigned num = 0;
 
-	/* Convert the "struct lguest_dma" they're sending to a "struct
-	 * iovec". */
-	lenp = dma2iov(dma, iov, &num);
-
 	/* Check each device: if they expect output to this key, tell them to
 	 * handle it. */
 	for (i = devices->dev; i; i = i->next) {
-		if (i->handle_output && key == i->watch_key) {
-			/* We write the result straight into the used_len field
-			 * for them. */
+		if (key != i->watch_key)
+			continue;
+
+		if (i->handle_output) {
+			/* Convert the "struct lguest_dma" they're sending to a
+			 * "struct iovec". */
+			lenp = dma2iov(dma, iov, &num);
 			*lenp = i->handle_output(fd, iov, num, i);
+			return;
+		} else if (i->handle_notify) {
+			/* virtio-style notify. */
+			i->handle_notify(fd, i);
 			return;
 		}
 	}
@@ -1079,6 +1090,7 @@ static struct device *new_device(struct 
 	dev->handle_input = handle_input;
 	dev->watch_key = to_guest_phys(dev->mem) + watch_off;
 	dev->handle_output = handle_output;
+	dev->handle_notify = NULL;
 	return dev;
 }
 
@@ -1354,7 +1366,383 @@ static void setup_tun_net(const char *ar
 	if (br_name)
 		verbose("attached to bridge: %s\n", br_name);
 }
-/* That's the end of device setup. */
+/* That's the end of device setup. :*/
+
+struct virtqueue_info
+{
+	/* Their page of descriptors. */
+	struct lguest_desc *desc;
+	/* How they tell us what buffers are available. */
+	unsigned int *avail_idx;
+	unsigned int *available;
+	/* How we tell them what we've used. */
+	unsigned int *used_idx;
+	struct lguest_used *used;
+
+	/* Last available index we saw. */
+	unsigned int last_avail_idx;
+};
+
+static unsigned int irq_of(struct device *dev)
+{
+	/* Interrupt is index of device + 1 */
+	return ((unsigned long)dev->desc % getpagesize())
+		/ sizeof(struct lguest_device_desc) + 1;
+}
+
+/* Descriptors consist of output then input descs. */
+static void gather_desc(struct lguest_desc *desc,
+			unsigned int i,
+			struct iovec iov[],
+			unsigned int *out_num, unsigned int *in_num)
+{
+	*out_num = *in_num = 0;
+
+	for (;;) {
+		iov[*out_num + *in_num].iov_len = desc[i].len;
+		iov[*out_num + *in_num].iov_base
+			= check_pointer(desc[i].pfn * getpagesize()
+					+ desc[i].offset,
+					desc[i].len);
+		if (desc[i].flags & LGUEST_DESC_F_WRITE)
+			(*in_num)++;
+		else {
+			if (*in_num)
+				errx(1, "Descriptor has out after in");
+			(*out_num)++;
+		}
+		if (!(desc[i].flags & LGUEST_DESC_F_NEXT))
+			break;
+		if (*out_num + *in_num == descs_per_page())
+			errx(1, "Looped descriptor");
+		i = desc[i].next;
+		if (i >= descs_per_page())
+			errx(1, "Desc next is %u", i);
+		if (desc[i].flags & LGUEST_DESC_F_HEAD)
+			errx(1, "Descriptor has middle head at %i", i);
+	}
+}
+
+/* We've used a buffer, tell them about it. */
+static void add_used(struct virtqueue_info *vqi, unsigned int id, int len)
+{
+	struct lguest_used *used;
+
+	used = &vqi->used[(*vqi->used_idx)++ % descs_per_page()];
+	used->id = id;
+	used->len = len;
+}
+
+/* See if they have a buffer for us. */
+static unsigned int get_available(struct virtqueue_info *vqi)
+{
+	unsigned int num;
+
+	if (*vqi->avail_idx - vqi->last_avail_idx > descs_per_page())
+		errx(1, "Guest moved used index from %u to %u",
+		     vqi->last_avail_idx, *vqi->avail_idx);
+
+	if (*vqi->avail_idx == vqi->last_avail_idx)
+		return descs_per_page();
+
+	num = vqi->available[vqi->last_avail_idx++ % descs_per_page()];
+	if (num >= descs_per_page())
+		errx(1, "Guest says index %u is available", num);
+	return num;
+}
+
+static void setup_virtqueue_info(struct virtqueue_info *vqi, void *mem)
+{
+	/* Descriptor page, available page, other side's used page */
+	vqi->desc = mem;
+	vqi->avail_idx = mem + getpagesize();
+	vqi->available = (void *)(vqi->avail_idx + 1);
+	vqi->used_idx = mem + getpagesize()*2;
+	vqi->used = (void *)(vqi->used_idx + 1);
+	vqi->last_avail_idx = 0;
+}
+
+struct virtnet_info
+{
+	struct virtqueue_info in, out;
+};
+
+static bool handle_virtnet_input(int fd, struct device *dev)
+{
+	int len;
+	unsigned out_num, in_num, desc;
+	struct virtnet_info *vni = dev->priv;
+	struct iovec iov[descs_per_page()];
+
+	/* Find any input descriptor head. */
+	desc = get_available(&vni->in);
+	if (desc == descs_per_page()) {
+		if (dev->desc->status & LGUEST_DEVICE_S_DRIVER_OK)
+			warnx("network: no dma buffer!");
+		discard_iovec(iov, &in_num);
+	} else {
+		gather_desc(vni->in.desc, desc, iov, &out_num, &in_num);
+		if (out_num != 0)
+			errx(1, "network: output in receive queue?");
+	}
+
+	len = readv(dev->fd, iov, in_num);
+	if (len <= 0)
+		err(1, "reading network");
+
+	if (desc != descs_per_page()) {
+		add_used(&vni->in, desc, len);
+		trigger_irq(fd, irq_of(dev));
+	}
+	verbose("virt input packet len %i [%02x %02x] (%s)\n", len,
+		((u8 *)iov[0].iov_base)[0], ((u8 *)iov[0].iov_base)[1],
+		desc == descs_per_page() ? "discarded" : "sent");
+	return true;
+}
+
+static void handle_virtnet_notify(int fd, struct device *dev)
+{
+	unsigned desc, out_num, in_num;
+	int len;
+	struct virtnet_info *vni = dev->priv;
+	struct iovec iov[descs_per_page()];
+
+	/* Send all output descriptors. */
+	while ((desc = get_available(&vni->out)) < descs_per_page()) {
+		gather_desc(vni->out.desc, desc, iov, &out_num, &in_num);
+		if (in_num != 0)
+			errx(1, "network: recv descs in output queue?");
+		len = writev(dev->fd, iov, out_num);
+		add_used(&vni->out, desc, 0);
+	}
+	trigger_irq(fd, irq_of(dev));
+}
+
+static void setup_virtnet(const char *arg, struct device_list *devices)
+{
+	struct device *dev;
+	struct virtnet_info *vni;
+	struct ifreq ifr;
+	int netfd, ipfd;
+	unsigned char mac[6];
+	u32 ip;
+
+	netfd = open_or_die("/dev/net/tun", O_RDWR);
+	memset(&ifr, 0, sizeof(ifr));
+	ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+	strcpy(ifr.ifr_name, "tap%d");
+	if (ioctl(netfd, TUNSETIFF, &ifr) != 0)
+		err(1, "configuring /dev/net/tun");
+	ioctl(netfd, TUNSETNOCSUM, 1);
+
+	/* Three pages for in, three for out. */
+	dev = new_device(devices, LGUEST_DEVICE_T_VIRTNET, 6,
+			 LGUEST_DEVICE_F_RANDOMNESS, netfd,
+			 handle_virtnet_input, 0, NULL);
+	dev->handle_notify = handle_virtnet_notify;
+	dev->priv = vni = malloc(sizeof(*vni));
+
+	setup_virtqueue_info(&vni->in, dev->mem);
+	setup_virtqueue_info(&vni->out, dev->mem + 3 * getpagesize());
+
+	ipfd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
+	if (ipfd < 0)
+		err(1, "opening IP socket");
+
+	ip = str2ip(arg);
+
+	configure_device(ipfd, ifr.ifr_name, ip, mac);
+
+	close(ipfd);
+
+	verbose("device %p: virt net %u.%u.%u.%u\n",
+		(void *)(dev->desc->pfn * getpagesize()),
+		(u8)(ip>>24), (u8)(ip>>16), (u8)(ip>>8), (u8)ip);
+}
+
+static unsigned long iovec_len(const struct iovec iov[], unsigned int num)
+{
+	unsigned int i;
+	unsigned long len = 0;
+
+	for (i = 0; i < num; i++) {
+		if (len + iov[i].iov_len < len)
+			errx(1, "iovec length wrap");
+		len += iov[i].iov_len;
+	}
+	return len;
+}
+
+struct vblk_info
+{
+	struct virtqueue_info vqi;
+	const char *blkname;
+	off64_t len;
+	u16 last_tag;
+	unsigned int in_progress;
+	int finished_fd;
+	int workpipe[2];
+};
+
+static void do_vblk_seek(int blkfd, off64_t maxlen, u64 sector, unsigned len)
+{
+	if (sector * 512 > maxlen || sector * 512 + len > maxlen)
+		errx(1, "Bad length %u at offset %llu", len, sector * 512);
+
+	if (lseek64(blkfd, sector * 512, SEEK_SET) != sector * 512)
+		err(1, "Bad seek to sector %llu", sector);
+}
+
+static unsigned service_io(struct vblk_info *vblk, int blkfd, unsigned desc)
+{
+	unsigned int wlen, out_num, in_num;
+	int len, ret;
+	struct virtio_blk_inhdr *in;
+	struct virtio_blk_outhdr *out;
+	struct iovec iov[descs_per_page()];
+
+	gather_desc(vblk->vqi.desc, desc, iov, &out_num, &in_num);
+	if (out_num == 0 || in_num == 0)
+		errx(1, "Bad virtblk cmd %u out=%u in=%u",
+		     desc, out_num, in_num);
+
+	if (iov[0].iov_len != sizeof(*out))
+		errx(1, "Bad virtblk cmd len %i", iov[0].iov_len);
+	out = iov[0].iov_base;
+
+	if (iov[out_num+in_num-1].iov_len != sizeof(*in))
+		errx(1, "Bad virtblk input len %i for %u",
+		     iov[out_num+in_num-1].iov_len, desc);
+	in = iov[out_num+in_num-1].iov_base;
+
+	if (out->type & VIRTIO_BLK_T_SCSI_CMD) {
+		fprintf(stderr, "Scsi commands unsupported\n");
+		in->status = VIRTIO_BLK_S_UNSUPP;
+		wlen = sizeof(in);
+	} else if (out->type & VIRTIO_BLK_T_OUT) {
+		/* Write */
+		len = iovec_len(iov+1, out_num-1);
+		do_vblk_seek(blkfd, vblk->len, out->sector, len);
+
+		verbose("WRITE %u to sector %llu\n", len, out->sector);
+		ret = writev(blkfd, iov+1, out_num-1);
+		in->status = (ret==len ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
+		wlen = sizeof(in);
+	} else {
+		/* Read */
+		len = iovec_len(iov+1, in_num-1);
+		do_vblk_seek(blkfd, vblk->len, out->sector, len);
+
+		verbose("READ %u to sector %llu\n", len, out->sector);
+		ret = readv(blkfd, iov+1, in_num-1);
+		in->status = (ret==len ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
+		wlen = sizeof(in) + len;
+	}
+
+	return wlen;
+}
+
+static struct virtio_blk_outhdr *get_outhdr(struct lguest_desc *desc,
+					    unsigned int i)
+{
+	return check_pointer(desc[i].pfn * getpagesize() + desc[i].offset,
+			     sizeof(struct virtio_blk_outhdr));
+}
+
+static bool handle_io_finish(int fd, struct device *dev)
+{
+	unsigned int nums[2];
+	struct vblk_info *vblk = dev->priv;
+
+	/* Find out what finished. */
+	if (read(dev->fd, nums, sizeof(nums)) != sizeof(nums))
+		err(1, "Short read from threads");
+
+	add_used(&vblk->vqi, nums[0], nums[1]);
+	trigger_irq(fd, irq_of(dev));
+	vblk->in_progress--;
+	return true;
+}
+
+static void handle_virtblk_notify(int fd, struct device *dev)
+{
+	unsigned desc;
+	struct vblk_info *vblk = dev->priv;
+
+	/* Send all output descriptors to threads to service. */
+	while ((desc = get_available(&vblk->vqi)) < descs_per_page()) {
+		struct virtio_blk_outhdr *outhdr;
+
+		outhdr = get_outhdr(vblk->vqi.desc, desc);
+		if (outhdr->type & VIRTIO_BLK_T_BARRIER) {
+			/* This sucks, goes sync to flush. */
+			while (vblk->in_progress)
+				handle_io_finish(fd, dev);
+			fdatasync(fd);
+		}
+		write(vblk->workpipe[1], &desc, sizeof(desc));
+		vblk->in_progress++;
+	}
+}
+
+static int io_thread(void *_dev)
+{
+	struct device *dev = _dev;
+	struct vblk_info *vblk = dev->priv;
+	unsigned num[2];
+	int fd;
+
+	fd = open_or_die(vblk->blkname, O_RDWR|O_LARGEFILE|O_DIRECT);
+
+	/* Close other side of workpipe so we get 0 read when main dies. */
+	close(vblk->workpipe[1]);
+	close(dev->fd);
+	close(STDIN_FILENO);
+	while (read(vblk->workpipe[0], &num[0], sizeof(num[0]))
+	       == sizeof(num[0])) {
+		num[1] = service_io(vblk, fd, num[0]);
+		if (write(vblk->finished_fd, num, sizeof(num)) != sizeof(num))
+			err(1, "Bad finish write");
+	}
+	return 0;
+}
+
+static void setup_virtblk(const char *filename, struct device_list *devices)
+{
+	int fd, p[2];
+	struct device *dev;
+	struct vblk_info *vblk;
+	unsigned int i;
+
+	fd = open_or_die(filename, O_RDWR|O_LARGEFILE);
+	pipe(p);
+	dev = new_device(devices, LGUEST_DEVICE_T_VIRTBLK, 6,
+			 LGUEST_DEVICE_F_RANDOMNESS,
+			 p[0], handle_io_finish, 0, NULL);
+	dev->handle_notify = handle_virtblk_notify;
+	vblk = dev->priv = malloc(sizeof(*vblk));
+
+	setup_virtqueue_info(&vblk->vqi, dev->mem);
+
+	vblk->blkname = filename;
+	vblk->len = lseek64(fd, 0, SEEK_END);
+	close(fd);
+	vblk->finished_fd = p[1];
+	vblk->last_tag = 0;
+	vblk->in_progress = 0;
+	pipe(vblk->workpipe);
+
+	for (i = 0; i < 4; i++) {
+		void *stack = malloc(32768);
+		if (clone(io_thread, stack + 32768, CLONE_VM, dev) == -1)
+			err(1, "Creating clone");
+	}
+
+	*(unsigned long *)dev->mem = vblk->len/512;
+	verbose("device %p: virtblock %lu sectors\n",
+		(void *)(dev->desc->pfn * getpagesize()),
+		*(unsigned long *)dev->mem);
+}
 
 /*L:220 Finally we reach the core of the Launcher, which runs the Guest, serves
  * its input and output, and finally, lays it to rest. */
@@ -1406,6 +1794,8 @@ static struct option opts[] = {
 	{ "sharenet", 1, NULL, 's' },
 	{ "tunnet", 1, NULL, 't' },
 	{ "block", 1, NULL, 'b' },
+	{ "virtnet", 1, NULL, 'V' },
+	{ "virtblock", 1, NULL, 'B' },
 	{ "initrd", 1, NULL, 'i' },
 	{ NULL },
 };
@@ -1477,6 +1867,12 @@ int main(int argc, char *argv[])
 		case 'b':
 			setup_block_file(optarg, &device_list);
 			break;
+		case 'V':
+			setup_virtnet(optarg, &device_list);
+			break;
+		case 'B':
+			setup_virtblk(optarg, &device_list);
+			break;
 		case 'i':
 			initrd_name = optarg;
 			break;
===================================================================
--- a/drivers/lguest/Makefile
+++ b/drivers/lguest/Makefile
@@ -1,5 +1,5 @@
 # Guest requires the arch-specific paravirt code, the bus driver and dma code.
-obj-$(CONFIG_LGUEST_GUEST) += lguest_bus.o lguest_dma.o
+obj-$(CONFIG_LGUEST_GUEST) += lguest_bus.o lguest_dma.o lguest_virtio.o
 
 # Host requires the other files, which can be a module.
 obj-$(CONFIG_LGUEST)	+= lg.o
===================================================================
--- a/drivers/lguest/hypercalls.c
+++ b/drivers/lguest/hypercalls.c
@@ -112,6 +112,10 @@ static void do_hcall(struct lguest *lg, 
 	case LHCALL_HALT:
 		/* Similarly, this sets the halted flag for run_guest(). */
 		lg->halted = 1;
+		break;
+	case LHCALL_NOTIFY:
+		lg->pending_key = regs->edx << PAGE_SHIFT;
+		lg->dma_is_pending = 1;
 		break;
 	default:
 		kill_guest(lg, "Bad hypercall %li\n", regs->eax);
===================================================================
--- /dev/null
+++ b/drivers/lguest/lguest_virtio.c
@@ -0,0 +1,476 @@
+/* Descriptor-based virtio backend using lguest. */
+
+/* FIXME: Put "running" in shared page so other side really doesn't
+ * send us interrupts.  Then we would never need to "fail" restart.
+ * If there are more buffers when we set "running", simply ping other
+ * side.  It would interrupt us back again.
+ */
+#define DEBUG
+#include <linux/lguest.h>
+#include <linux/lguest_bus.h>
+#include <linux/virtio.h>
+#include <linux/interrupt.h>
+#include <asm/io.h>
+
+#define NUM_DESCS (PAGE_SIZE / sizeof(struct lguest_desc))
+
+#ifdef DEBUG
+/* For development, we want to crash whenever the other side is bad. */
+#define BAD_SIDE(lvq, fmt...)			\
+	do { dev_err(&lvq->lg->dev, fmt); BUG(); } while(0)
+#define START_USE(lvq) \
+	do { if ((lvq)->in_use) panic("in_use = %i\n", (lvq)->in_use); (lvq)->in_use = __LINE__; mb(); } while(0)
+#define END_USE(lvq) \
+	do { BUG_ON(!(lvq)->in_use); (lvq)->in_use = 0; mb(); } while(0)
+#else
+#define BAD_SIDE(lvq, fmt...)			\
+	do { dev_err(&lvq->lg->dev, fmt); (lvq)->broken = true; } while(0)
+#define START_USE(lvq)
+#define END_USE(lvq)
+#endif
+
+struct desc_pages
+{
+	/* Page of descriptors. */
+	struct lguest_desc desc[NUM_DESCS];
+
+	/* Next page: how we tell other side what buffers are available. */
+	unsigned int avail_idx;
+	unsigned int available[NUM_DESCS];
+	char pad[PAGE_SIZE - (NUM_DESCS+1) * sizeof(unsigned int)];
+
+	/* Third page: how other side tells us what's used. */
+	unsigned int used_idx;
+	struct lguest_used used[NUM_DESCS];
+};
+
+struct lguest_virtqueue
+{
+	struct virtqueue vq;
+
+	/* Actual memory layout for this queue */
+	struct desc_pages *d;
+
+	struct lguest_device *lg;
+
+	/* Other side has made a mess, don't try any more. */
+	bool broken;
+
+	/* Number of free buffers */
+	unsigned int num_free;
+	/* Head of free buffer list. */
+	unsigned int free_head;
+	/* Number we've added since last sync. */
+	unsigned int num_added;
+
+	/* Last used index we've seen. */
+	unsigned int last_used_idx;
+
+	/* Unless they told us to stop */
+	bool running;
+
+#ifdef DEBUG
+	/* They're supposed to lock for us. */
+	unsigned int in_use;
+#endif
+
+	/* Tokens for callbacks. */
+	void *data[NUM_DESCS];
+};
+
+static inline struct lguest_virtqueue *vq_to_lvq(struct virtqueue *vq)
+{
+	return container_of(vq, struct lguest_virtqueue, vq);
+}
+
+static int lguest_add_buf(struct virtqueue *vq,
+			  struct scatterlist sg[],
+			  unsigned int out_num,
+			  unsigned int in_num,
+			  void *data)
+{
+	struct lguest_virtqueue *lvq = vq_to_lvq(vq);
+	unsigned int i, head, uninitialized_var(prev);
+
+	BUG_ON(data == NULL);
+	BUG_ON(out_num + in_num > NUM_DESCS);
+	BUG_ON(out_num + in_num == 0);
+
+	START_USE(lvq);
+
+	if (lvq->num_free < out_num + in_num) {
+		pr_debug("Can't add buf len %i - avail = %i\n",
+			 out_num + in_num, lvq->num_free);
+		END_USE(lvq);
+		return -ENOSPC;
+	}
+
+	/* We're about to use some buffers from the free list. */
+	lvq->num_free -= out_num + in_num;
+
+	head = lvq->free_head;
+	for (i = lvq->free_head; out_num; i=lvq->d->desc[i].next, out_num--) {
+		lvq->d->desc[i].flags = LGUEST_DESC_F_NEXT;
+		lvq->d->desc[i].pfn = page_to_pfn(sg[0].page);
+		lvq->d->desc[i].offset = sg[0].offset;
+		lvq->d->desc[i].len = sg[0].length;
+		prev = i;
+		sg++;
+	}
+	for (; in_num; i = lvq->d->desc[i].next, in_num--) {
+		lvq->d->desc[i].flags = LGUEST_DESC_F_NEXT|LGUEST_DESC_F_WRITE;
+		lvq->d->desc[i].pfn = page_to_pfn(sg[0].page);
+		lvq->d->desc[i].offset = sg[0].offset;
+		lvq->d->desc[i].len = sg[0].length;
+		prev = i;
+		sg++;
+	}
+	/* Last one doesn't continue. */
+	lvq->d->desc[prev].flags &= ~LGUEST_DESC_F_NEXT;
+
+	/* Update free pointer */
+	lvq->free_head = i;
+
+	lvq->data[head] = data;
+
+	/* Make head is only set after descriptor has been written. */
+	wmb();
+	lvq->d->desc[head].flags |= LGUEST_DESC_F_HEAD;
+
+	/* Advertise it in available array. */
+	lvq->d->available[(lvq->d->avail_idx + lvq->num_added++) % NUM_DESCS]
+		= head;
+
+	pr_debug("Added buffer head %i to %p\n", head, lvq);
+	END_USE(lvq);
+	return 0;
+}
+
+static void lguest_sync(struct virtqueue *vq)
+{
+	struct lguest_virtqueue *lvq = vq_to_lvq(vq);
+
+	START_USE(lvq);
+	/* LGUEST_DESC_F_HEAD needs to be set before we say they're avail. */
+	wmb();
+
+	lvq->d->avail_idx += lvq->num_added;
+	lvq->num_added = 0;
+
+	/* Prod other side to tell it about changes. */
+	hcall(LHCALL_NOTIFY, lguest_devices[lvq->lg->index].pfn, 0, 0);
+	END_USE(lvq);
+}
+
+static void __detach_buf(struct lguest_virtqueue *lvq, unsigned int head)
+{
+	unsigned int i;
+
+	lvq->d->desc[head].flags &= ~LGUEST_DESC_F_HEAD;
+	/* Make sure other side has seen that it's detached. */
+	wmb();
+	/* Put back on free list: find end */
+	i = head;
+	while (lvq->d->desc[i].flags&LGUEST_DESC_F_NEXT) {
+		i = lvq->d->desc[i].next;
+		lvq->num_free++;
+	}
+
+	lvq->d->desc[i].next = lvq->free_head;
+	lvq->free_head = head;
+	/* Plus final descriptor */
+	lvq->num_free++;
+}
+
+static int lguest_detach_buf(struct virtqueue *vq, void *data)
+{
+	struct lguest_virtqueue *lvq = vq_to_lvq(vq);
+	unsigned int i;
+
+	for (i = 0; i < NUM_DESCS; i++) {
+		if (lvq->data[i] == data
+		    && (lvq->d->desc[i].flags & LGUEST_DESC_F_HEAD)) {
+			__detach_buf(lvq, i);
+			return 0;
+		}
+	}
+	return -ENOENT;
+}
+
+static bool more_used(const struct lguest_virtqueue *lvq)
+{
+	return lvq->last_used_idx != lvq->d->used_idx;
+}
+
+static void *lguest_get_buf(struct virtqueue *vq, unsigned int *len)
+{
+	struct lguest_virtqueue *lvq = vq_to_lvq(vq);
+	unsigned int i;
+
+	START_USE(lvq);
+
+	if (!more_used(lvq)) {
+		END_USE(lvq);
+		return NULL;
+	}
+
+	/* Don't let them make us do infinite work. */
+	if (unlikely(lvq->d->used_idx > lvq->last_used_idx + NUM_DESCS)) {
+		BAD_SIDE(lvq, "Too many descriptors");
+		return NULL;
+	}
+
+	i = lvq->d->used[lvq->last_used_idx%NUM_DESCS].id;
+	*len = lvq->d->used[lvq->last_used_idx%NUM_DESCS].len;
+
+	if (unlikely(i >= NUM_DESCS)) {
+		BAD_SIDE(lvq, "id %u out of range\n", i);
+		return NULL;
+	}
+	if (unlikely(!(lvq->d->desc[i].flags & LGUEST_DESC_F_HEAD))) {
+		BAD_SIDE(lvq, "id %u is not a head!\n", i);
+		return NULL;
+	}
+
+	__detach_buf(lvq, i);
+	lvq->last_used_idx++;
+	BUG_ON(!lvq->data[i]);
+	END_USE(lvq);
+	return lvq->data[i];
+}
+
+static bool lguest_restart(struct virtqueue *vq)
+{
+	struct lguest_virtqueue *lvq = vq_to_lvq(vq);
+
+	START_USE(lvq);
+	BUG_ON(lvq->running);
+
+	if (likely(!more_used(lvq)) || unlikely(lvq->broken))
+		lvq->running = true;
+
+	END_USE(lvq);
+	return lvq->running;
+}
+
+static irqreturn_t lguest_virtqueue_interrupt(int irq, void *_lvq)
+{
+	struct lguest_virtqueue *lvq = _lvq;
+
+	pr_debug("virtqueue interrupt for %p\n", lvq);
+
+	if (unlikely(lvq->broken))
+		return IRQ_HANDLED;
+
+	if (lvq->running && more_used(lvq)) {
+		pr_debug("virtqueue callback for %p (%p)\n", lvq, lvq->vq.cb);
+		lvq->running = lvq->vq.cb(&lvq->vq);
+	} else
+		pr_debug("virtqueue %p no more used\n", lvq);
+
+	return IRQ_HANDLED;
+}
+
+struct lguest_virtqueue_pair
+{
+	struct lguest_virtqueue *in, *out;
+};
+
+static irqreturn_t lguest_virtqueue_pair_interrupt(int irq, void *_lvqp)
+{
+	struct lguest_virtqueue_pair *lvqp = _lvqp;
+
+	lguest_virtqueue_interrupt(irq, lvqp->in);
+	lguest_virtqueue_interrupt(irq, lvqp->out);
+
+	return IRQ_HANDLED;
+}
+
+static struct virtqueue_ops lguest_virtqueue_ops = {
+	.add_buf = lguest_add_buf,
+	.get_buf = lguest_get_buf,
+	.sync = lguest_sync,
+	.detach_buf = lguest_detach_buf,
+	.restart = lguest_restart,
+};
+
+static struct lguest_virtqueue *lg_new_virtqueue(struct lguest_device *lgdev,
+						 unsigned long pfn)
+{
+	struct lguest_virtqueue *lvq;
+	unsigned int i;
+
+	lvq = kmalloc(sizeof(*lvq), GFP_KERNEL);
+	if (!lvq)
+		return NULL;
+
+	/* Queue takes three pages */
+	lvq->d = lguest_map(pfn << PAGE_SHIFT, 3);
+	if (!lvq->d)
+		goto free_lvq;
+
+	lvq->lg = lgdev;
+	lvq->broken = false;
+	lvq->last_used_idx = 0;
+	lvq->num_added = 0;
+	lvq->running = true;
+#ifdef DEBUG
+	lvq->in_use = false;
+#endif
+
+	/* Put everything in free lists. */
+	lvq->num_free = NUM_DESCS;
+	lvq->free_head = 0;
+	for (i = 0; i < NUM_DESCS-1; i++)
+		lvq->d->desc[i].next = i+1;
+
+	lvq->vq.ops = &lguest_virtqueue_ops;
+	return lvq;
+
+free_lvq:
+	kfree(lvq);
+	return NULL;
+}
+
+static void lg_destroy_virtqueue(struct lguest_virtqueue *lvq)
+{
+	lguest_unmap(lvq->d);
+	kfree(lvq);
+}
+
+/* Example network driver code. */
+#include <linux/virtio_net.h>
+#include <linux/etherdevice.h>
+
+static int lguest_virtnet_probe(struct lguest_device *lgdev)
+{
+	struct net_device *dev;
+	u8 mac[ETH_ALEN];
+	int err, irqf;
+	struct lguest_virtqueue_pair *pair;
+
+	pair = kmalloc(sizeof(*pair), GFP_KERNEL);
+	if (!pair) {
+		err = -ENOMEM;
+		goto fail;
+	}
+
+	pair->in = lg_new_virtqueue(lgdev, lguest_devices[lgdev->index].pfn);
+	if (!pair->in) {
+		err = -ENOMEM;
+		goto free_pair;
+	}
+	pair->out = lg_new_virtqueue(lgdev,lguest_devices[lgdev->index].pfn+3);
+	if (!pair->out) {
+		err = -ENOMEM;
+		goto free_pair_in;
+	}
+
+	random_ether_addr(mac);
+	dev = virtnet_probe(&pair->in->vq, &pair->out->vq, &lgdev->dev, mac);
+	if (IS_ERR(dev)) {
+		err = PTR_ERR(dev);
+		goto free_pair_out;
+	}
+
+	if (lguest_devices[lgdev->index].features&LGUEST_DEVICE_F_RANDOMNESS)
+		irqf = IRQF_SAMPLE_RANDOM;
+	else
+		irqf = 0;
+
+	err = request_irq(lgdev_irq(lgdev),
+			  lguest_virtqueue_pair_interrupt, irqf, dev->name,
+			  pair);
+
+	if (err)
+		goto unprobe;
+
+	lgdev->private = pair;
+	return 0;
+
+unprobe:
+	virtnet_remove(dev);
+free_pair_out:
+	lg_destroy_virtqueue(pair->out);
+free_pair_in:
+	lg_destroy_virtqueue(pair->in);
+free_pair:
+	kfree(pair);
+fail:
+	return err;
+}
+
+static struct lguest_driver lguest_virtnet_drv = {
+	.name = "lguestvirtnet",
+	.owner = THIS_MODULE,
+	.device_type = LGUEST_DEVICE_T_VIRTNET,
+	.probe = lguest_virtnet_probe,
+};
+
+static __init int lguest_virtnet_init(void)
+{
+	return register_lguest_driver(&lguest_virtnet_drv);
+}
+device_initcall(lguest_virtnet_init);
+
+/* Example block driver code. */
+#include <linux/virtio_blk.h>
+#include <linux/genhd.h>
+#include <linux/blkdev.h>
+static int lguest_virtblk_probe(struct lguest_device *lgdev)
+{
+	struct lguest_virtqueue *lvq;
+	struct gendisk *disk;
+	unsigned long sectors;
+	int err, irqf;
+
+	lvq = lg_new_virtqueue(lgdev, lguest_devices[lgdev->index].pfn);
+	if (!lvq)
+		return -ENOMEM;
+
+	/* Page is initially used to pass capacity. */
+	sectors = *(unsigned long *)lvq->d;
+	*(unsigned long *)lvq->d = 0;
+
+	lgdev->private = disk = virtblk_probe(&lvq->vq);
+	if (IS_ERR(disk)) {
+		err = PTR_ERR(disk);
+		goto destroy;
+	}
+	set_capacity(disk, sectors);
+	blk_queue_max_hw_segments(disk->queue, NUM_DESCS-1);
+
+	if (lguest_devices[lgdev->index].features&LGUEST_DEVICE_F_RANDOMNESS)
+		irqf = IRQF_SAMPLE_RANDOM;
+	else
+		irqf = 0;
+
+	err = request_irq(lgdev_irq(lgdev), lguest_virtqueue_interrupt, irqf,
+			  disk->disk_name, lvq);
+	if (err)
+		goto unprobe;
+
+	add_disk(disk);
+	return 0;
+
+unprobe:
+	virtblk_remove(disk);
+destroy:
+	lg_destroy_virtqueue(lvq);
+	return err;
+}
+
+static struct lguest_driver lguest_virtblk_drv = {
+	.name = "lguestvirtblk",
+	.owner = THIS_MODULE,
+	.device_type = LGUEST_DEVICE_T_VIRTBLK,
+	.probe = lguest_virtblk_probe,
+};
+
+static __init int lguest_virtblk_init(void)
+{
+	return register_lguest_driver(&lguest_virtblk_drv);
+}
+device_initcall(lguest_virtblk_init);
+
+MODULE_LICENSE("GPL");
===================================================================
--- a/include/asm-i386/lguest_hcall.h
+++ b/include/asm-i386/lguest_hcall.h
@@ -18,6 +18,9 @@
 #define LHCALL_SET_PTE		14
 #define LHCALL_SET_PMD		15
 #define LHCALL_LOAD_TLS		16
+
+/* Experimental hcalls for new I/O */
+#define LHCALL_NOTIFY	100 /* pfn */
 
 /*G:031 First, how does our Guest contact the Host to ask for privileged
  * operations?  There are two ways: the direct way is to make a "hypercall",
===================================================================
--- a/include/linux/lguest_launcher.h
+++ b/include/linux/lguest_launcher.h
@@ -90,6 +90,8 @@ struct lguest_device_desc {
 #define LGUEST_DEVICE_T_CONSOLE	1
 #define LGUEST_DEVICE_T_NET	2
 #define LGUEST_DEVICE_T_BLOCK	3
+#define LGUEST_DEVICE_T_VIRTNET	8
+#define LGUEST_DEVICE_T_VIRTBLK	9
 
 	/* The specific features of this device: these depends on device type
 	 * except for LGUEST_DEVICE_F_RANDOMNESS. */
@@ -124,4 +126,28 @@ enum lguest_req
 	LHREQ_IRQ, /* + irq */
 	LHREQ_BREAK, /* + on/off flag (on blocks until someone does off) */
 };
+
+/* This marks a buffer as being the start (and active) */
+#define LGUEST_DESC_F_HEAD	1
+/* This marks a buffer as continuing via the next field. */
+#define LGUEST_DESC_F_NEXT	2
+/* This marks a buffer as write-only (otherwise read-only). */
+#define LGUEST_DESC_F_WRITE	4
+
+/* Virtio descriptors */
+struct lguest_desc
+{
+	unsigned long pfn;
+	unsigned long len;
+	u16 offset;
+	u16 flags;
+	/* We chain unused descriptors via this, too */
+	u32 next;
+};
+
+struct lguest_used
+{
+	unsigned int id;
+	unsigned int len;
+};
 #endif /* _ASM_LGUEST_USER */



-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                                                 ` <1187753365.6174.26.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-08-22  9:18                                                   ` Christian Borntraeger
       [not found]                                                     ` <200708221118.00990.borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Borntraeger @ 2007-08-22  9:18 UTC (permalink / raw)
  To: virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Am Mittwoch, 22. August 2007 schrieb Rusty Russell:
> +struct desc_pages
> +{
> +	/* Page of descriptors. */
> +	struct lguest_desc desc[NUM_DESCS];
> +
> +	/* Next page: how we tell other side what buffers are available. */
> +	unsigned int avail_idx;
> +	unsigned int available[NUM_DESCS];
> +	char pad[PAGE_SIZE - (NUM_DESCS+1) * sizeof(unsigned int)];
> +
> +	/* Third page: how other side tells us what's used. */
> +	unsigned int used_idx;
> +	struct lguest_used used[NUM_DESCS];
> +};

Please consider to add this patch to make this data structure work on 64 bit 
to make the second page, really page aligned. On 32 bit this should be a 
no-op.

Signed-Off-by: Christian Borntraeger <borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

---
 drivers/lguest/lguest_virtio.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: lguest/drivers/lguest/lguest_virtio.c
===================================================================
--- lguest.orig/drivers/lguest/lguest_virtio.c
+++ lguest/drivers/lguest/lguest_virtio.c
@@ -33,11 +33,12 @@ struct desc_pages
 {
 	/* Page of descriptors. */
 	struct lguest_desc desc[NUM_DESCS];
+	char pad0[PAGE_SIZE - NUM_DESCS * sizeof(struct lguest_desc)];
 
 	/* Next page: how we tell other side what buffers are available. */
 	unsigned int avail_idx;
 	unsigned int available[NUM_DESCS];
-	char pad[PAGE_SIZE - (NUM_DESCS+1) * sizeof(unsigned int)];
+	char pad1[PAGE_SIZE - (NUM_DESCS+1) * sizeof(unsigned int)];
 
 	/* Third page: how other side tells us what's used. */
 	unsigned int used_idx;


-- 
IBM Deutschland Entwicklung GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Herbert Kircher 
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                                                     ` <200708221118.00990.borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-08-22  9:26                                                       ` Dor Laor
       [not found]                                                         ` <64F9B87B6B770947A9F8391472E032160D503D81-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
  0 siblings, 1 reply; 41+ messages in thread
From: Dor Laor @ 2007-08-22  9:26 UTC (permalink / raw)
  To: Christian Borntraeger,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>> +struct desc_pages
>> +{
>> +	/* Page of descriptors. */
>> +	struct lguest_desc desc[NUM_DESCS];
>> +
>> +	/* Next page: how we tell other side what buffers are available.
>*/
>> +	unsigned int avail_idx;
>> +	unsigned int available[NUM_DESCS];
>> +	char pad[PAGE_SIZE - (NUM_DESCS+1) * sizeof(unsigned int)];
>> +
>> +	/* Third page: how other side tells us what's used. */
>> +	unsigned int used_idx;
>> +	struct lguest_used used[NUM_DESCS];
>> +};
>
>Please consider to add this patch to make this data structure work on
64
>bit
>to make the second page, really page aligned. On 32 bit this should be
a
>no-op.
>
>Signed-Off-by: Christian Borntraeger <borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
>
>---
> drivers/lguest/lguest_virtio.c |    3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>Index: lguest/drivers/lguest/lguest_virtio.c
>===================================================================
>--- lguest.orig/drivers/lguest/lguest_virtio.c
>+++ lguest/drivers/lguest/lguest_virtio.c
>@@ -33,11 +33,12 @@ struct desc_pages
> {
> 	/* Page of descriptors. */
> 	struct lguest_desc desc[NUM_DESCS];
>+	char pad0[PAGE_SIZE - NUM_DESCS * sizeof(struct lguest_desc)];
>
> 	/* Next page: how we tell other side what buffers are available.
>*/
> 	unsigned int avail_idx;
> 	unsigned int available[NUM_DESCS];
>-	char pad[PAGE_SIZE - (NUM_DESCS+1) * sizeof(unsigned int)];
>+	char pad1[PAGE_SIZE - (NUM_DESCS+1) * sizeof(unsigned int)];
>
> 	/* Third page: how other side tells us what's used. */
> 	unsigned int used_idx;


Actually while playing with virtio for kvm Avi saw that and recommended
to do the following:
struct desc_pages
{
	/* Page of descriptors. */
	union {
		struct virtio_desc desc[NUM_DESCS];
		char pad1[PAGE_SIZE];
	};

	/* Next page: how we tell other side what buffers are available.
*/
	union {
		struct {
			unsigned int avail_idx;
			unsigned int available[NUM_DESCS];
		};
		char pad2[PAGE_SIZE];
	};

	/* Third page: how other side tells us what's used. */
	union {
		struct {
			unsigned int used_idx;
			struct virtio_used used[NUM_DESCS];
		};
		char pad3[PAGE_SIZE];
	};
};

It saves useless pointer arithmetic.

--Dor

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                                                         ` <64F9B87B6B770947A9F8391472E032160D503D81-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
@ 2007-08-22  9:30                                                           ` Christian Borntraeger
       [not found]                                                             ` <200708221130.17364.borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  2007-08-22 10:40                                                           ` Rusty Russell
  1 sibling, 1 reply; 41+ messages in thread
From: Christian Borntraeger @ 2007-08-22  9:30 UTC (permalink / raw)
  To: Dor Laor
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Am Mittwoch, 22. August 2007 schrieb Dor Laor:
> Actually while playing with virtio for kvm Avi saw that and recommended
> to do the following:
> struct desc_pages
> {
> 	/* Page of descriptors. */
> 	union {
> 		struct virtio_desc desc[NUM_DESCS];
> 		char pad1[PAGE_SIZE];
> 	};
[...]

Fine with me. Is anybody currently working on descriptor based transport for 
virtio for kvm?

Christian

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                                                             ` <200708221130.17364.borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-08-22 10:05                                                               ` Dor Laor
  0 siblings, 0 replies; 41+ messages in thread
From: Dor Laor @ 2007-08-22 10:05 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

>> Actually while playing with virtio for kvm Avi saw that and
>recommended
>> to do the following:
>> struct desc_pages
>> {
>> 	/* Page of descriptors. */
>> 	union {
>> 		struct virtio_desc desc[NUM_DESCS];
>> 		char pad1[PAGE_SIZE];
>> 	};
>[...]
>
>Fine with me. Is anybody currently working on descriptor based
transport
>for
>virtio for kvm?
>

Yap, I'm working, I have network driver up and running, soon a block
device + patches.
Except me Gregory Haskins is working on a competing implementation, that
might be merged together
later on.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                                                         ` <64F9B87B6B770947A9F8391472E032160D503D81-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
  2007-08-22  9:30                                                           ` Christian Borntraeger
@ 2007-08-22 10:40                                                           ` Rusty Russell
       [not found]                                                             ` <1187779205.6174.87.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  1 sibling, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2007-08-22 10:40 UTC (permalink / raw)
  To: Dor Laor
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Christian Borntraeger,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Wed, 2007-08-22 at 02:26 -0700, Dor Laor wrote:
> Actually while playing with virtio for kvm Avi saw that and recommended
> to do the following:
> struct desc_pages
> {
> 	/* Page of descriptors. */
> 	union {
> 		struct virtio_desc desc[NUM_DESCS];
> 		char pad1[PAGE_SIZE];
> 	};
> 
> 	/* Next page: how we tell other side what buffers are available. */
> 	union {
> 		struct {
> 			unsigned int avail_idx;
> 			unsigned int available[NUM_DESCS];
> 		};
> 		char pad2[PAGE_SIZE];
> 	};
> 
> 	/* Third page: how other side tells us what's used. */
> 	union {
> 		struct {
> 			unsigned int used_idx;
> 			struct virtio_used used[NUM_DESCS];
> 		};
> 		char pad3[PAGE_SIZE];
> 	};
> };
> 
> It saves useless pointer arithmetic.

Hi Dor!

Please consider moving the "used" field into the descriptor (maybe as a
ptr for cache reasons, 'cept I'd really like to trim descriptor size).
That makes the avail and used rings identical, plus the current model
means we can't trust the used length if we don't trust the other side
(this is one of my FIXMEs).

Of course, we could go further and break the fixed structure: there's no
reason for the first and second part to be on separate pages, nor for
the third to be consecutive.  But there is no need until be have an
untrusted demonstration...

Cheers,
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 00/10] PV-IO v3
       [not found]                                                             ` <1187779205.6174.87.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-08-22 11:47                                                               ` Avi Kivity
  0 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2007-08-22 11:47 UTC (permalink / raw)
  To: Rusty Russell
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Christian Borntraeger,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Rusty Russell wrote:
> On Wed, 2007-08-22 at 02:26 -0700, Dor Laor wrote:
>   
>> Actually while playing with virtio for kvm Avi saw that and recommended
>> to do the following:
>> struct desc_pages
>> {
>> 	/* Page of descriptors. */
>> 	union {
>> 		struct virtio_desc desc[NUM_DESCS];
>> 		char pad1[PAGE_SIZE];
>> 	};
>>
>> 	/* Next page: how we tell other side what buffers are available. */
>> 	union {
>> 		struct {
>> 			unsigned int avail_idx;
>> 			unsigned int available[NUM_DESCS];
>> 		};
>> 		char pad2[PAGE_SIZE];
>> 	};
>>
>> 	/* Third page: how other side tells us what's used. */
>> 	union {
>> 		struct {
>> 			unsigned int used_idx;
>> 			struct virtio_used used[NUM_DESCS];
>> 		};
>> 		char pad3[PAGE_SIZE];
>> 	};
>> };
>>
>> It saves useless pointer arithmetic.
>>     
>
> Hi Dor!
>
> Please consider moving the "used" field into the descriptor (maybe as a
> ptr for cache reasons, 'cept I'd really like to trim descriptor size).
> That makes the avail and used rings identical, plus the current model
> means we can't trust the used length if we don't trust the other side
> (this is one of my FIXMEs).
>
> Of course, we could go further and break the fixed structure: there's no
> reason for the first and second part to be on separate pages, nor for
> the third to be consecutive.  But there is no need until be have an
> untrusted demonstration...
>
>   

Another thing that occurs to me is that alignment should be explicit to 
64 bits, so that mixed 32/64 bit guest/hosts can be used.


-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2007-08-22 11:47 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-16 23:13 [PATCH 00/10] PV-IO v3 Gregory Haskins
     [not found] ` <20070816231357.8044.55943.stgit-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
2007-08-16 23:14   ` [PATCH 01/10] IOQ: Adding basic definitions for IO-Queue logic Gregory Haskins
2007-08-16 23:14   ` [PATCH 02/10] PARAVIRTUALIZATION: Add support for a bus abstraction Gregory Haskins
2007-08-16 23:14   ` [PATCH 03/10] IOQ: Add an IOQ network driver Gregory Haskins
2007-08-16 23:14   ` [PATCH 04/10] IOQNET: Add a test harness infrastructure to IOQNET Gregory Haskins
2007-08-16 23:14   ` [PATCH 05/10] IRQ: Export create_irq/destroy_irq Gregory Haskins
2007-08-16 23:14   ` [PATCH 06/10] KVM: Add a guest side driver for IOQ Gregory Haskins
2007-08-16 23:14   ` [PATCH 07/10] KVM: Add a gpa_to_hva helper function Gregory Haskins
2007-08-16 23:14   ` [PATCH 08/10] KVM: Add support for IOQ Gregory Haskins
2007-08-16 23:14   ` [PATCH 09/10] KVM: Add PVBUS support to the KVM host Gregory Haskins
2007-08-16 23:14   ` [PATCH 10/10] KVM: Add an IOQNET backend driver Gregory Haskins
2007-08-17  1:25   ` [PATCH 00/10] PV-IO v3 Rusty Russell
     [not found]     ` <1187313953.6449.70.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-08-17  5:26       ` Gregory Haskins
     [not found]         ` <1187328402.4363.110.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
2007-08-17  7:43           ` Rusty Russell
     [not found]             ` <1187336618.6449.106.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-08-17 13:50               ` Gregory Haskins
     [not found]                 ` <1187358614.4363.135.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
2007-08-20 23:28                   ` Rusty Russell
     [not found]                     ` <1187652496.19435.141.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-08-21  7:33                       ` Dor Laor
     [not found]                         ` <64F9B87B6B770947A9F8391472E032160D464FEB-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
2007-08-21  7:58                           ` Rusty Russell
     [not found]                             ` <1187683122.19435.171.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-08-21 12:00                               ` Gregory Haskins
     [not found]                                 ` <1187697638.4363.277.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
2007-08-21 12:25                                   ` Avi Kivity
     [not found]                                     ` <46CAD9CC.6050209-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-08-21 13:11                                       ` Gregory Haskins
2007-08-21 13:47                                   ` Rusty Russell
     [not found]                                     ` <1187704038.19435.194.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-08-21 14:06                                       ` Gregory Haskins
     [not found]                                         ` <1187705162.4363.323.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
2007-08-21 16:47                                           ` Gregory Haskins
     [not found]                                             ` <1187714864.4363.358.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
2007-08-21 17:12                                               ` Avi Kivity
     [not found]                                                 ` <46CB1D06.1040005-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-08-21 17:17                                                   ` Gregory Haskins
2007-08-22  3:29                                               ` Rusty Russell
     [not found]                                                 ` <1187753365.6174.26.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-08-22  9:18                                                   ` Christian Borntraeger
     [not found]                                                     ` <200708221118.00990.borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-08-22  9:26                                                       ` Dor Laor
     [not found]                                                         ` <64F9B87B6B770947A9F8391472E032160D503D81-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
2007-08-22  9:30                                                           ` Christian Borntraeger
     [not found]                                                             ` <200708221130.17364.borntraeger-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-08-22 10:05                                                               ` Dor Laor
2007-08-22 10:40                                                           ` Rusty Russell
     [not found]                                                             ` <1187779205.6174.87.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-08-22 11:47                                                               ` Avi Kivity
2007-08-21 12:29                               ` Avi Kivity
2007-08-19  9:24       ` Avi Kivity
     [not found]         ` <46C80C5B.7070009-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-08-20 13:50           ` Gregory Haskins
2007-08-20 14:03             ` [kvm-devel] " Dor Laor
     [not found]               ` <64F9B87B6B770947A9F8391472E032160D4649E2-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
2007-08-20 14:12                 ` Avi Kivity
     [not found]                   ` <46C9A150.60101-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-08-20 23:24                     ` Rusty Russell
2007-08-20 14:17                 ` Gregory Haskins
     [not found]             ` <1187617806.4363.179.camel-5CR4LY5GPkvLDviKLk5550HKjMygAv58XqFh9Ls21Oc@public.gmane.org>
2007-08-20 14:14               ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox