[PATCH/RFC 2/9] s390 virtualization interface

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH/RFC 2/9] s390 virtualization interface
       [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
@ 2007-05-11 17:35   ` Carsten Otte
  2007-05-11 17:35   ` [PATCH/RFC 3/9] s390 guest detection Carsten Otte
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 104+ messages in thread
From: Carsten Otte @ 2007-05-11 17:35 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
  Cc: Christian Borntraeger, Martin Schwidefsky

From: Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

Add interface which allows a process to start a virtual machine.

To keep things easy each thread group is allowed to have only one
virtual machine and each thread of the thread group can only control
one virtual cpu of the virtual machine. All the information about
the virtual machines/cpus can be found via the thread_info structures
of the participating threads.

This patch adds three new s390 specific system calls:

long sys_s390host_add_cpu(unsigned long addr, unsigned long flags,
			  struct sie_block __user *sie_template)

Adds a new cpu to a the virtual machine that belongs to the current
thread group. If no virtual machine exists it will be created. In
addition two pages will be allocated and mapped at <addr> into the
address space of the process. These two pages are used so user space
and kernel space can easily exchange/modify the state of the
corresponding virtual cpu without a ton of copy_from/to_user calls.
The sie_template is a pointer to a data structure that contains
initial information how the virtual cpu should be setup. The
resulting block will be used as a parameter to issue the sie (start
interpretive execution) instruction which starts a virtual cpu.

int sys_s390host_remove_cpu(void)

Removes a virtual cpu from a virtual machine.

int sys_s390host_sie(unsigned long action)

Starts / re-enters the virtual cpu of the virtual machine that the
thread belongs to, if any.

Please note that this patch is nothing more than a proof-of-concept
and may contain quite a few bugs.
Since we want to convert to use kvm instead, most of this will be
dropped anyway. But maybe this is of interest for others as well.

Signed-off-by: Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

---
 arch/s390/Kconfig               |    7 
 arch/s390/Makefile              |    2 
 arch/s390/host/Makefile         |    5 
 arch/s390/host/s390_intercept.c |   42 ++++
 arch/s390/host/s390host.c       |  418 ++++++++++++++++++++++++++++++++++++++++
 arch/s390/host/s390host.h       |   16 +
 arch/s390/host/sie64a.S         |   38 +++
 arch/s390/kernel/asm-offsets.c  |    2 
 arch/s390/kernel/process.c      |   15 +
 arch/s390/kernel/setup.c        |    4 
 arch/s390/kernel/syscalls.S     |    3 
 include/asm-s390/sie64.h        |  279 ++++++++++++++++++++++++++
 include/asm-s390/thread_info.h  |    8 
 include/asm-s390/unistd.h       |    5 
 kernel/sys_ni.c                 |    3 
 15 files changed, 842 insertions(+), 5 deletions(-)

Index: linux-2.6.21/arch/s390/kernel/asm-offsets.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/asm-offsets.c
+++ linux-2.6.21/arch/s390/kernel/asm-offsets.c
@@ -44,5 +44,7 @@ int main(void)
 	DEFINE(__SF_BACKCHAIN, offsetof(struct stack_frame, back_chain),);
 	DEFINE(__SF_GPRS, offsetof(struct stack_frame, gprs),);
 	DEFINE(__SF_EMPTY, offsetof(struct stack_frame, empty1),);
+	BLANK();
+	DEFINE(__SIE_USER_gprs, offsetof(struct sie_user, gprs),);
 	return 0;
 }
Index: linux-2.6.21/arch/s390/kernel/syscalls.S
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/syscalls.S
+++ linux-2.6.21/arch/s390/kernel/syscalls.S
@@ -322,3 +322,6 @@ NI_SYSCALL							/* 310 sys_move_pages *
 SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
 SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
 SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
+SYSCALL(sys_ni_syscall,sys_s390host_add_cpu,sys_ni_syscall)
+SYSCALL(sys_ni_syscall,sys_s390host_remove_cpu,sys_ni_syscall)
+SYSCALL(sys_ni_syscall,sys_s390host_sie,sys_ni_syscall)
Index: linux-2.6.21/arch/s390/host/Makefile
===================================================================
--- /dev/null
+++ linux-2.6.21/arch/s390/host/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the s390host components.
+#
+
+obj-$(CONFIG_S390_HOST)	+= s390host.o sie64a.o s390_intercept.o
Index: linux-2.6.21/arch/s390/host/sie64a.S
===================================================================
--- /dev/null
+++ linux-2.6.21/arch/s390/host/sie64a.S
@@ -0,0 +1,38 @@
+/*
+ *  arch/s390/host/sie64a.S
+ *    low level sie call
+ *
+ *    Copyright IBM Corp. 2007
+ *    Author(s): Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *    License  : GPL
+ */
+
+#include <linux/errno.h>
+#include <asm/asm-offsets.h>
+
+SP_R6 =	6 * 8	# offset into stackframe
+
+	.globl	sie64a
+sie64a:
+	stmg	%r6,%r15,SP_R6(%r15)		# save register on entry
+	lgr	%r14,%r2			# pointer to program parms
+	aghi	%r2,4096
+	lmg	%r0,%r13,__SIE_USER_gprs(%r2)	# load guest gprs 0-13
+sie_inst:
+	sie	0(%r14)
+	aghi	%r14,4096
+	stmg	%r0,%r13,__SIE_USER_gprs(%r14)	# save guest gprs 0-13
+	lghi	%r2,0
+	lmg	%r6,%r15,SP_R6(%r15)
+	br	%r14
+
+sie_err:
+	aghi	%r14,4096
+	stmg	%r0,%r13,__SIE_USER_gprs(%r14)	# save guest gprs 0-13
+	lghi	%r2,-EFAULT
+	lmg	%r6,%r15,SP_R6(%r15)
+	br	%r14
+
+	.section __ex_table,"a"
+	.quad	sie_inst,sie_err
+	.previous
Index: linux-2.6.21/arch/s390/host/s390host.c
===================================================================
--- /dev/null
+++ linux-2.6.21/arch/s390/host/s390host.c
@@ -0,0 +1,418 @@
+/*
+ * s390host.c --  hosting zSeries Linux virtual engines
+ *
+ * Copyright IBM Corp. 2007
+ *   Author(s): Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>,
+ *		Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>,
+ *		Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *   License  : GPL
+ */
+
+#include <linux/pagemap.h>
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/init.h>
+#include <linux/file.h>
+#include <linux/mman.h>
+#include <linux/mutex.h>
+#include <asm/uaccess.h>
+#include <asm/processor.h>
+#include <asm/tlbflush.h>
+#include <asm/semaphore.h>
+#include <asm/sie64.h>
+#include "s390host.h"
+
+static int s390host_do_action(unsigned long, struct sie_io *);
+
+static DEFINE_MUTEX(s390host_init_mutex);
+
+static void s390host_get_data(struct s390host_data *data)
+{
+	atomic_inc(&data->count);
+}
+
+void s390host_put_data(struct s390host_data *data)
+{
+	int cpu;
+
+	if (atomic_dec_return(&data->count))
+		return;
+
+	for (cpu = 0; cpu < S390HOST_MAX_CPUS; cpu++)
+		if (data->sie_io[cpu])
+			free_page((unsigned long)data->sie_io[cpu]);
+
+	if (data->sca_block)
+		free_page((unsigned long)data->sca_block);
+
+	kfree(data);
+}
+
+static void s390host_vma_close(struct vm_area_struct *vma)
+{
+	s390host_put_data(vma->vm_private_data);
+}
+
+static struct page *s390host_vma_nopage(struct vm_area_struct *vma,
+					unsigned long address, int *type)
+{
+	return NOPAGE_SIGBUS;
+}
+
+static struct vm_operations_struct s390host_vmops = {
+	.close = s390host_vma_close,
+	.nopage = s390host_vma_nopage,
+};
+
+static struct s390host_data *get_s390host_context(void)
+{
+	struct thread_info *tif;
+	struct sca_block *sca_block = NULL;
+	struct s390host_data *data = NULL;
+	struct task_struct *tsk;
+
+	/* zlh context for current thread already created? */
+	tif = current_thread_info();
+	if (tif->s390host_data)
+		return tif->s390host_data;
+
+	/* zlh context in thread group available? */
+	write_lock_irq(&tasklist_lock);
+	tsk = next_thread(current);
+	for (; tsk != current; tsk = next_thread(tsk)) {
+		data = tsk->thread_info->s390host_data;
+		if (data) {
+			s390host_get_data(data);
+			tif->s390host_data = data;
+			break;
+		}
+	}
+	write_unlock_irq(&tasklist_lock);
+
+	if (data)
+		return data;
+
+	/* create new context */
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+
+	if (!data)
+		return NULL;
+
+	sca_block = (struct sca_block *)get_zeroed_page(GFP_KERNEL);
+
+	if (!sca_block) {
+		kfree(data);
+		return NULL;
+	}
+
+	data->sca_block = sca_block;
+	tif->s390host_data = data;
+	s390host_get_data(data);
+
+	return data;
+}
+
+static unsigned long
+s390host_create_io_area(unsigned long addr, unsigned long flags,
+			unsigned long io_addr, struct s390host_data *data)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	unsigned long ret;
+
+	flags &= MAP_FIXED;
+	addr = get_unmapped_area(NULL, addr, 2 * PAGE_SIZE, 0, flags);
+
+	if (addr & ~PAGE_MASK)
+		return addr;
+
+	vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+
+	if (!vma)
+		return -ENOMEM;
+
+	vma->vm_mm = mm;
+	vma->vm_start = addr;
+	vma->vm_end = addr + 2 * PAGE_SIZE;
+	vma->vm_flags =	VM_READ | VM_MAYREAD | VM_IO | VM_RESERVED;
+	vma->vm_flags |= VM_SHARED | VM_MAYSHARE | VM_DONTCOPY;
+
+#if 1	/* FIXME: write access until sys_s390host_sie interface is final */
+	vma->vm_flags |= VM_WRITE | VM_MAYWRITE;
+#endif
+
+	vma->vm_page_prot = protection_map[vma->vm_flags & 0xf];
+	vma->vm_private_data = data;
+	vma->vm_ops = &s390host_vmops;
+
+	down_write(&mm->mmap_sem);
+	ret = insert_vm_struct(mm, vma);
+	if (ret) {
+		kmem_cache_free(vm_area_cachep, vma);
+		goto out;
+	}
+	s390host_get_data(data);
+	mm->total_vm += 2;
+	vm_insert_page(vma, addr, virt_to_page(io_addr));
+
+	ret = split_vma(mm, vma, addr + PAGE_SIZE, 0);
+	if (ret)
+		goto out;
+	s390host_get_data(data);
+
+	vma = find_vma(mm, addr + PAGE_SIZE);
+	vma->vm_flags |= VM_WRITE | VM_MAYWRITE;
+	vma->vm_page_prot = protection_map[vma->vm_flags & 0xf];
+	vm_insert_page(vma, addr + PAGE_SIZE,
+		       virt_to_page(io_addr + PAGE_SIZE));
+	ret = addr;
+out:
+	up_write(&mm->mmap_sem);
+	return ret;
+}
+
+long sys_s390host_add_cpu(unsigned long addr, unsigned long flags,
+			  struct sie_block __user *sie_template)
+{
+	struct sie_block *sie_block;
+	struct sie_io *sie_io;
+	struct sca_block *sca_block;
+	struct s390host_data *data = NULL;
+	unsigned long ret;
+	__u16 cpu;
+
+	if (current_thread_info()->sie_cpu != -1)
+		return -EINVAL;
+
+	if (copy_from_user(&cpu, &sie_template->icpua, sizeof(u16)))
+		return -EFAULT;
+
+	if (cpu >= S390HOST_MAX_CPUS)
+		return -EINVAL;
+
+	mutex_lock(&s390host_init_mutex);
+
+	data = get_s390host_context();
+	if (!data) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+
+	sca_block = data->sca_block;
+	if (sca_block->mcn & (1UL << (S390HOST_MAX_CPUS - 1 - cpu))) {
+		ret = -EINVAL;
+		goto out_err;
+	}
+
+	if (!data->sie_io[cpu]) {
+		unsigned long tmp;
+
+		/* allocate two pages: 1st is r/o 2nd r/w area */
+		tmp = __get_free_pages(GFP_KERNEL, 1);
+		if (!tmp) {
+			ret = -ENOMEM;
+			goto out_err;
+		}
+		split_page(virt_to_page(tmp), 1);
+		data->sie_io[cpu] = (struct sie_io *)tmp;
+	}
+
+	sie_io = data->sie_io[cpu];
+	memset(sie_io, 0, 2 * PAGE_SIZE);
+
+	sie_block = &sie_io->sie_kernel.sie_block;
+	sca_block->cpu[cpu].sda = (__u64)sie_block;
+
+	if (copy_from_user(sie_block, sie_template, sizeof(struct sie_block))) {
+		ret = -EFAULT;
+		goto out_err;
+	}
+	sie_block->icpua = cpu;
+
+	ret = s390host_create_io_area(addr, flags, (unsigned long)sie_io, data);
+
+	if (ret & ~PAGE_MASK)
+		goto out_err;
+
+	sca_block->mcn |= 1UL << (S390HOST_MAX_CPUS - 1 - cpu);
+	sie_block->scaoh = (__u32)(((__u64)sca_block) >> 32);
+	sie_block->scaol = (__u32)(__u64)sca_block;
+	current_thread_info()->sie_cpu = cpu;
+	goto out;
+out_err:
+	if (data)
+		s390host_put_data(data);
+out:
+	mutex_unlock(&s390host_init_mutex);
+	return ret;
+}
+
+int sys_s390host_remove_cpu(void)
+{
+	struct sca_block *sca_block;
+	int cpu;
+
+	cpu = current_thread_info()->sie_cpu;
+	if (cpu == -1)
+		return -EINVAL;
+
+	mutex_lock(&s390host_init_mutex);
+	sca_block = current_thread_info()->s390host_data->sca_block;
+	sca_block->mcn &= ~(1UL << (S390HOST_MAX_CPUS - 1 - cpu));
+	current_thread_info()->sie_cpu = -1;
+	mutex_unlock(&s390host_init_mutex);
+	return 0;
+}
+
+int sys_s390host_sie(unsigned long action)
+{
+	struct sie_kernel *sie_kernel;
+	struct sie_user *sie_user;
+	struct sie_io *sie_io;
+	int cpu;
+	int ret = 0;
+
+	cpu = current_thread_info()->sie_cpu;
+	if (cpu == -1)
+		return -EINVAL;
+
+	sie_io = current_thread_info()->s390host_data->sie_io[cpu];
+
+	if (action)
+		ret = s390host_do_action(action, sie_io);
+	if (ret)
+		goto out_err;
+	sie_kernel = &sie_io->sie_kernel;
+	sie_user = &sie_io->sie_user;
+
+	save_fp_regs(&sie_kernel->host_fpregs);
+	save_access_regs(sie_kernel->host_acrs);
+	sie_user->guest_fpregs.fpc &= FPC_VALID_MASK;
+	restore_fp_regs(&sie_user->guest_fpregs);
+	restore_access_regs(sie_user->guest_acrs);
+	memcpy(&sie_kernel->sie_block.gg14, &sie_user->gprs[14], 16);
+again:
+	if (need_resched())
+		schedule();
+
+	sie_kernel->sie_block.icptcode = 0;
+	ret = sie64a(sie_kernel);
+	if (ret)
+		goto out;
+
+	if (signal_pending(current)) {
+		ret = -EINTR;
+		goto out;
+	}
+
+	ret = s390host_handle_intercept(sie_kernel);
+
+	/* intercept reason was handled, enter SIE again */
+	if (!ret)
+		goto again;
+
+	/* if kernel cannot hanle intercept, pass to the user */
+	if (ret == -ENOTSUPP)
+		ret = 0;
+
+out:
+	memcpy(&sie_user->gprs[14], &sie_kernel->sie_block.gg14, 16);
+	save_fp_regs(&sie_user->guest_fpregs);
+	save_access_regs(sie_user->guest_acrs);
+	restore_fp_regs(&sie_kernel->host_fpregs);
+	restore_access_regs(sie_kernel->host_acrs);
+out_err:
+	return ret;
+}
+
+static void s390host_vsmxm_local_update(struct sie_io *sie_io)
+{
+	struct sie_kernel *local_sie_kernel;
+	struct sie_user *sie_user;
+	atomic_t *cpuflags;
+	int old, new;
+
+	mutex_lock(&s390host_init_mutex);
+
+	sie_user = &sie_io->sie_user;
+	local_sie_kernel = &sie_io->sie_kernel;
+
+	cpuflags = &local_sie_kernel->sie_block.cpuflags;
+	do {
+		old = atomic_read(cpuflags);
+		new = old | sie_user->vsmxm_or_local;
+		new &= sie_user->vsmxm_and_local;
+	} while (atomic_cmpxchg(cpuflags, old, new) != old);
+
+	mutex_unlock(&s390host_init_mutex);
+	return;
+}
+
+static int s390host_vsmxm_dist_update(struct sie_io *sie_io)
+{
+	struct sie_kernel *dist_sie_kernel;
+	struct sie_user *sie_user;
+	struct sca_block *sca_block;
+	struct thread_info *tif;
+	atomic_t *cpuflags;
+	int cpu;
+	int old, new;
+	int rc = -EINVAL;
+
+	mutex_lock(&s390host_init_mutex);
+
+	sie_user = &sie_io->sie_user;
+	cpu = sie_user->vsmxm_cpuid;
+
+	if (cpu >= S390HOST_MAX_CPUS)
+		goto out;
+
+	tif = current_thread_info();
+	sca_block = tif->s390host_data->sca_block;
+	if (!(sca_block->mcn & (1UL << (S390HOST_MAX_CPUS - 1 - cpu))))
+		goto out;
+
+	dist_sie_kernel = &((tif->s390host_data->sie_io[cpu])->sie_kernel);
+
+	cpuflags = &dist_sie_kernel->sie_block.cpuflags;
+	do {
+		old = atomic_read(cpuflags);
+		new = old | sie_user->vsmxm_or;
+		new &= sie_user->vsmxm_and;
+	} while (atomic_cmpxchg(cpuflags, old, new) != old);
+
+	rc = 0;
+out:
+	mutex_unlock(&s390host_init_mutex);
+	return rc;
+}
+
+static int s390host_do_action(unsigned long action, struct sie_io *sie_io)
+{
+	void *src;
+	void *dest;
+	int rc = 0;
+
+	if (action & SIE_BLOCK_UPDATE) {
+		src  = &(sie_io->sie_user.sie_block);
+		dest = &(sie_io->sie_kernel.sie_block);
+
+		memcpy(dest + 4, src +  4, 88);
+		memcpy(dest + 96, src + 96, 4);
+		memcpy(dest + 104, src + 104, 408);
+	}
+
+	if (action & SIE_UPDATE_PSW)
+		sie_io->sie_kernel.sie_block.psw.gpsw = sie_io->sie_user.psw;
+
+	if (action & SIE_FLUSH_TLB)
+		flush_tlb_mm(current->mm);
+
+	if (action & SIE_VSMXM_LOCAL_UPDATE)
+		s390host_vsmxm_local_update(sie_io);
+
+	if (action & SIE_VSMXM_DIST_UPDATE)
+		rc = s390host_vsmxm_dist_update(sie_io);
+	return rc;
+}
Index: linux-2.6.21/include/asm-s390/unistd.h
===================================================================
--- linux-2.6.21.orig/include/asm-s390/unistd.h
+++ linux-2.6.21/include/asm-s390/unistd.h
@@ -251,8 +251,11 @@
 #define __NR_getcpu		311
 #define __NR_epoll_pwait	312
 #define __NR_utimes		313
+#define __NR_s390host_add_cpu	314
+#define __NR_s390host_remove_cpu	315
+#define __NR_s390host_sie		316
 
-#define NR_syscalls 314
+#define NR_syscalls 317
 
 /* 
  * There are some system calls that are not present on 64 bit, some
Index: linux-2.6.21/include/asm-s390/sie64.h
===================================================================
--- /dev/null
+++ linux-2.6.21/include/asm-s390/sie64.h
@@ -0,0 +1,279 @@
+/*
+ *  include/asm-s390/sie64.h
+ *
+ *    Copyright IBM Corp. 2007
+ *    Author(s): Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>,
+ *		 Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>,
+ *		 Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *
+ */
+
+#ifndef _ASM_S390_SIE64_H
+#define _ASM_S390_SIE64_H
+
+#include <asm/atomic.h>
+#include <asm/ptrace.h> //FIXME: psw_t definition needs relocation
+
+struct sie_block {
+	atomic_t cpuflags;	/* 0x0000 */
+	__u32	prefix;		/* 0x0004 */
+	__u32	:32;		/* 0x0008 */
+	__u32	:32;		/* 0x000c */
+	__u64	:64;		/* 0x0010 */
+	__u64	:64;		/* 0x0018 */
+	__u64	:64;		/* 0x0020 */
+	__u64	cputm;		/* 0x0028 */
+	__u64	ckc;		/* 0x0030 */
+	__u64	epoch;		/* 0x0038 */
+	__u8	svcnn	:1,	/* 0x0040 */
+		svc1c	:1,
+		svc2c	:1,
+		svc3c	:1,
+		:4;
+	__u8	svc1n;		/* 0x0041 */
+	__u8	svc2n;		/* 0x0042 */
+	__u8	svc3n;		/* 0x0043 */
+	__u16	lctl0	:1,	/* 0x0044 */
+		lctl1	:1,
+		lctl2	:1,
+		lctl3	:1,
+		lctl4	:1,
+		lctl5	:1,
+		lctl6	:1,
+		lctl7	:1,
+		lctl8	:1,
+		lctl9	:1,
+		lctla	:1,
+		lctlb	:1,
+		lctlc	:1,
+		lctld	:1,
+		lctle	:1,
+		lctlf	:1;
+	__s16	icpua;		/* 0x0046 */
+	__u32	icpop	:1,	/* 0x0048 */
+		icpro	:1,
+		icprg	:1,
+		:4,
+		icipte	:1,
+		:1,		/* 0x0049 */
+		iclpsw	:1,
+		icptlb	:1,
+		icssm	:1,
+		icbsa	:1,
+		icstctl	:1,
+		icstnsm	:1,
+		icstosm	:1,
+		icstck	:1,	/* 0x004a */
+		iciske	:1,
+		icsske	:1,
+		icrrbe	:1,
+		icpc	:1,
+		icpt	:1,
+		ictprot	:1,
+		iclasp	:1,
+		:1,		/* 0x004b */
+		icstpt	:1,
+		icsckc	:1,
+		:1,
+		icpr	:1,
+		icbakr	:1,
+		icpg	:1,
+		:1;
+	__u32	ecext	:1,	/* 0x004c */
+		ecint	:1,
+		ecwait	:1,
+		ecsigp	:1,
+		ecalt	:1,
+		ecio2	:1,
+		:1,
+		ecmvp	:1;
+	__u8	eca1;		/* 0x004d */
+	__u8	eca2;		/* 0x004e */
+	__u8	eca3;		/* 0x004f */
+	__u8	icptcode;	/* 0x0050 */
+	__u8	:6,		/* 0x0051 */
+		icif	:1,
+		icex	:1;
+	__u16	ihcpu;		/* 0x0052 */
+	__u16	:16;		/* 0x0054 */
+	struct {
+		union {
+			__u16	ipa;	/* 0x0056 */
+			__u16	inst;	/* 0x0056 */
+			struct {
+				union  {
+					__u8	ipa0;	/* 0x0056 */
+					__u8	viwho;	/* 0x0056 */
+				};
+				union  {
+					__u8	ipa1;	/* 0x0057 */
+					__u8	viwhen;	/* 0x0057 */
+				};
+			};
+		};
+		union {
+			__u32	ipb;	/* 0x0058 */
+			struct {
+				union  {
+					__u16	ipbh0;	/* 0x0058 */
+					__u16	viwhy;	/* 0x0058 */
+					struct {
+						__u8	ipb0;	/* 0x0058 */
+						__u8	ipb1;	/* 0x0059 */
+					};
+				};
+				union  {
+					__u16	ipbh1;	/* 0x005a */
+					struct {
+						__u8	ipb2;	/* 0x005a */
+						__u8	ipb3;	/* 0x005b */
+					};
+				};
+			};
+		};
+	} __attribute__((packed));
+	__u32	scaoh;	/* 0x005c */
+	union {
+		__u32	rcp;	/* 0x0060 */
+		struct {
+			__u8	ska	:1,	/* 0x0060 */
+				skaip	:1,
+				:6;
+			__u8	ecb	:8;	/* 0x0061 */
+			__u8	:3,		/* 0x0062 */
+				cpby	:1,
+				:4;
+			__u8	:8;		/* 0x0063 */
+		};
+	};
+	__u32	scaol;	/* 0x0064 */
+	__u32	:32;	/* 0x0068 */
+	union {
+		__u32	todpr;	/* 0x006c */
+		struct {
+			__u16	:16;	/* 0x006c */
+			__u16	todpf;	/* 0x006e */
+		};
+	} __attribute__((packed));
+	__u32	gisa;	/* 0x0070 */
+	__u32	iopct;	/* 0x0074 */
+	__u32	rsvd3;	/* 0x0078 */
+	__u32	:32;	/* 0x007c */
+	__u64	gmsor;	/* 0x0080 */
+	__u64	gmslm;	/* 0x0088 */
+	union {
+		psw_t	gpsw;	/* 0x0090 */
+		struct {
+			__u64	pswh;	/* 0x0090 */
+
+			__u64	pswl;	/* 0x0098 */
+		};
+	} psw;
+	__u64	gg14;	/* 0x00a0 */
+	__u64	gg15;	/* 0x00a8 */
+	__u64	:64;	/* 0x00b0 */
+	__u64	:16,	/* 0x00b8 */
+		xso	:24,
+		xsl	:24;
+	union {
+		__u8	uzp0[56];	/* 0x00c0 */
+		struct {
+			__u32	exmsf;	/* 0x00c0 */
+			union  {
+				__u32	iexcf;	/* 0x00c4 */
+				struct {
+					__u16	iexca;	/* 0x00c4 */
+					__u16	iexcd;	/* 0x00c6 */
+				};
+			};
+			__u16	svcil;	/* 0x00c8 */
+			__u16	svcnt;	/* 0x00ca */
+			__u16	iprcl;	/* 0x00cc */
+			__u16	iprcc;	/* 0x00ce */
+			__u32	itrad;	/* 0x00d0 */
+			__u32	imncl;	/* 0x00d4 */
+			__u64	gpera;	/* 0x00d8 */
+			__u8	excpar; /* 0x00e0 */
+			__u8	perar;	/* 0x00e1 */
+			__u8	oprid;	/* 0x00e2 */
+			__u8	:8;	/* 0x00e3 */
+			__u32	:32;	/* 0x00e4 */
+			__u64	gtrad;	/* 0x00e8 */
+			__u32	:32;	/* 0x00f0 */
+			__u32	:32;	/* 0x00f4 */
+		};
+	};
+	__u16	:16;		/* 0x00f8 */
+	__u16	ief;		/* 0x00fa */
+	__u32	apcbk;		/* 0x00fc */
+	__u64	gcr[16];	/* 0x0100 */
+	__u8	reserved[128];	/* 0x0180 */
+} __attribute__((packed));
+
+struct sie_kernel {
+	struct sie_block	sie_block;
+	s390_fp_regs	host_fpregs;
+	int		host_acrs[NUM_ACRS];
+} __attribute__((packed,aligned(4096)));
+
+#define SIE_UPDATE_PSW		(1UL << 0)
+#define SIE_FLUSH_TLB		(1UL << 1)
+#define SIE_ISKE		(1UL << 2)
+#define SIE_SSKE		(1UL << 3)
+#define SIE_BLOCK_UPDATE	(1UL << 4)
+#define SIE_VSMXM_LOCAL_UPDATE	(1UL << 5)
+#define SIE_VSMXM_DIST_UPDATE   (1UL << 6)
+
+struct sie_skey_parm {
+	unsigned long sk_reg;
+	unsigned long sk_addr;
+};
+
+struct sie_user {
+	struct sie_block	sie_block;
+	psw_t			psw;
+	unsigned long		gprs[NUM_GPRS];
+	s390_fp_regs		guest_fpregs;
+	int			guest_acrs[NUM_ACRS];
+	struct sie_skey_parm	iske_parm;
+	struct sie_skey_parm	sske_parm;
+	int			vsmxm_or_local;
+	int			vsmxm_and_local;
+	int			vsmxm_or;
+	int			vsmxm_and;
+	int			vsmxm_cpuid;
+} __attribute__((packed,aligned(4096)));
+
+struct sie_io {
+	struct sie_kernel sie_kernel;
+	struct sie_user sie_user;
+};
+
+struct sca_entry {
+	atomic_t scn;
+	__u64	reserved;
+	__u64	sda;
+	__u64	reserved2[2];
+}__attribute__((packed));
+
+struct sca_block {
+	__u64	ipte_control;
+	__u64	reserved[5];
+	__u64	mcn;
+	__u64	reserved2;
+	struct	sca_entry cpu[64];
+}__attribute__((packed));
+
+#define S390HOST_MAX_CPUS 64
+
+struct s390host_data {
+	atomic_t	 count;
+	struct sie_io	 *sie_io[S390HOST_MAX_CPUS];
+	struct sca_block *sca_block;
+};
+
+/* function definitions */
+extern int sie64a(struct sie_kernel *);
+extern void s390host_put_data(struct s390host_data *);
+
+#endif /* _ASM_S390_SIE64_H */
Index: linux-2.6.21/arch/s390/Makefile
===================================================================
--- linux-2.6.21.orig/arch/s390/Makefile
+++ linux-2.6.21/arch/s390/Makefile
@@ -85,7 +85,7 @@ LDFLAGS_vmlinux := -e start
 head-y		:= arch/s390/kernel/head.o arch/s390/kernel/init_task.o
 
 core-y		+= arch/s390/mm/ arch/s390/kernel/ arch/s390/crypto/ \
-		   arch/s390/appldata/ arch/s390/hypfs/
+		   arch/s390/appldata/ arch/s390/hypfs/ arch/s390/host/
 libs-y		+= arch/s390/lib/
 drivers-y	+= drivers/s390/
 drivers-$(CONFIG_MATHEMU) += arch/s390/math-emu/
Index: linux-2.6.21/kernel/sys_ni.c
===================================================================
--- linux-2.6.21.orig/kernel/sys_ni.c
+++ linux-2.6.21/kernel/sys_ni.c
@@ -122,6 +122,9 @@ cond_syscall(sys32_sysctl);
 cond_syscall(ppc_rtas);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
+cond_syscall(sys_s390host_add_cpu);
+cond_syscall(sys_s390host_remove_cpu);
+cond_syscall(sys_s390host_sie);
 
 /* mmu depending weak syscall entries */
 cond_syscall(sys_mprotect);
Index: linux-2.6.21/arch/s390/kernel/process.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/process.c
+++ linux-2.6.21/arch/s390/kernel/process.c
@@ -274,12 +274,23 @@ int copy_thread(int nr, unsigned long cl
 #endif /* CONFIG_64BIT */
 	/* start new process with ar4 pointing to the correct address space */
 	p->thread.mm_segment = get_fs();
-        /* Don't copy debug registers */
-        memset(&p->thread.per_info,0,sizeof(p->thread.per_info));
+	/* Don't copy debug registers */
+	memset(&p->thread.per_info,0,sizeof(p->thread.per_info));
+	p->thread_info->s390host_data = NULL;
+	p->thread_info->sie_cpu = -1;
 
         return 0;
 }
 
+void free_thread_info(struct thread_info *ti)
+{
+#ifdef CONFIG_S390_HOST
+	if (ti->s390host_data)
+		s390host_put_data(ti->s390host_data);
+#endif
+	free_pages((unsigned long) (ti),THREAD_ORDER);
+}
+
 asmlinkage long sys_fork(struct pt_regs regs)
 {
 	return do_fork(SIGCHLD, regs.gprs[15], &regs, 0, NULL, NULL);
Index: linux-2.6.21/include/asm-s390/thread_info.h
===================================================================
--- linux-2.6.21.orig/include/asm-s390/thread_info.h
+++ linux-2.6.21/include/asm-s390/thread_info.h
@@ -38,6 +38,7 @@
 #ifndef __ASSEMBLY__
 #include <asm/processor.h>
 #include <asm/lowcore.h>
+#include <asm/sie64.h>
 
 /*
  * low level task data that entry.S needs immediate access to
@@ -52,6 +53,8 @@ struct thread_info {
 	unsigned int		cpu;		/* current CPU */
 	int			preempt_count;	/* 0 => preemptable, <0 => BUG */
 	struct restart_block	restart_block;
+	struct s390host_data	*s390host_data;	/* s390host data */
+	int			sie_cpu;	/* sie cpu number */
 };
 
 /*
@@ -67,6 +70,8 @@ struct thread_info {
 	.restart_block	= {			\
 		.fn = do_no_restart_syscall,	\
 	},					\
+	.s390host_data	= NULL,			\
+	.sie_cpu	= 0,			\
 }
 
 #define init_thread_info	(init_thread_union.thread_info)
@@ -81,7 +86,8 @@ static inline struct thread_info *curren
 /* thread information allocation */
 #define alloc_thread_info(tsk) ((struct thread_info *) \
 	__get_free_pages(GFP_KERNEL,THREAD_ORDER))
-#define free_thread_info(ti) free_pages((unsigned long) (ti),THREAD_ORDER)
+
+extern void free_thread_info(struct thread_info *);
 
 #endif
 
Index: linux-2.6.21/arch/s390/Kconfig
===================================================================
--- linux-2.6.21.orig/arch/s390/Kconfig
+++ linux-2.6.21/arch/s390/Kconfig
@@ -153,6 +153,7 @@ config S390_SWITCH_AMODE
 
 config S390_EXEC_PROTECT
 	bool "Data execute protection"
+	depends on !S390_HOST
 	select S390_SWITCH_AMODE
 	help
 	  This option allows to enable a buffer overflow protection for user
@@ -514,6 +515,12 @@ config KEXEC
 	  current kernel, and to start another kernel.  It is like a reboot
 	  but is independent of hardware/microcode support.
 
+config S390_HOST
+	bool "s390 host support (EXPERIMENTAL)"
+	depends on 64BIT && EXPERIMENTAL
+	select S390_SWITCH_AMODE
+	help
+	  Select this option if you want to host guest Linux images
 endmenu
 
 source "net/Kconfig"
Index: linux-2.6.21/arch/s390/kernel/setup.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/setup.c
+++ linux-2.6.21/arch/s390/kernel/setup.c
@@ -394,7 +394,11 @@ static int __init early_parse_ipldelay(c
 early_param("ipldelay", early_parse_ipldelay);
 
 #ifdef CONFIG_S390_SWITCH_AMODE
+#ifdef CONFIG_S390_HOST
+unsigned int switch_amode = 1;
+#else
 unsigned int switch_amode = 0;
+#endif
 EXPORT_SYMBOL_GPL(switch_amode);
 
 static void set_amode_and_uaccess(unsigned long user_amode,
Index: linux-2.6.21/arch/s390/host/s390_intercept.c
===================================================================
--- /dev/null
+++ linux-2.6.21/arch/s390/host/s390_intercept.c
@@ -0,0 +1,42 @@
+/*
+ * s390_intercept.c --  handle SIE intercept codes
+ *
+ * Copyright IBM Corp. 2007
+ *   Author(s): Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ */
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <asm/sie64.h>
+#include <linux/pagemap.h>
+#include "s390host.h"
+
+static int s390host_handle_validity (struct sie_kernel *sie_kernel)
+{
+	if (sie_kernel->sie_block.viwhy == 0x37) {
+		//debug message here
+		fault_in_pages_writeable((void*)0 + S390HOST_ORIGIN,
+					 PAGE_SIZE);
+		fault_in_pages_writeable((void*)(unsigned long)
+					 sie_kernel->sie_block.prefix+
+					 S390HOST_ORIGIN, 2*PAGE_SIZE);
+		return 0;
+	}
+	// debug message here
+	return -ENOTSUPP;
+}
+
+int s390host_handle_intercept(struct sie_kernel *sie_kernel)
+{
+	switch (sie_kernel->sie_block.icptcode) {
+	case 0x00:
+	case 0x24:
+		return 0;
+	case 0x20:
+		return s390host_handle_validity(sie_kernel);
+	default:
+		// debug message here
+		return -ENOTSUPP;
+	}
+}
+
Index: linux-2.6.21/arch/s390/host/s390host.h
===================================================================
--- /dev/null
+++ linux-2.6.21/arch/s390/host/s390host.h
@@ -0,0 +1,16 @@
+/*
+ * s390host.h --  hosting zSeries Linux virtual engines
+ *
+ * Copyright IBM Corp. 2007
+ *   Author(s): Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>,
+ *		Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>,
+ *		Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ */
+
+#ifndef __S390HOST_H
+#define __S390HOST_H
+#include <asm/sie64.h>
+#define S390HOST_ORIGIN 0
+
+int s390host_handle_intercept(struct sie_kernel *sie_kernel);
+#endif // defined __S390HOST_H



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH/RFC 3/9] s390 guest detection
       [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
  2007-05-11 17:35   ` [PATCH/RFC 2/9] s390 virtualization interface Carsten Otte
@ 2007-05-11 17:35   ` Carsten Otte
  2007-05-11 17:35   ` [PATCH/RFC 4/9] Basic guest virtual devices infrastructure Carsten Otte
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 104+ messages in thread
From: Carsten Otte @ 2007-05-11 17:35 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
  Cc: Christian Borntraeger, Martin Schwidefsky

From: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

This patch adds functionality to detect if the kernel runs under an s390host
hypervisor. A macro MACHINE_IS_GUEST is exported for device drivers. This
allows drivers to skip device detection if the systems runs non-virtualized.

Signed-off-by: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

---
 arch/s390/kernel/early.c |    4 ++++
 arch/s390/kernel/setup.c |    9 ++++++---
 include/asm-s390/setup.h |    1 +
 3 files changed, 11 insertions(+), 3 deletions(-)

Index: linux-2.6.21/arch/s390/kernel/setup.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/setup.c
+++ linux-2.6.21/arch/s390/kernel/setup.c
@@ -744,9 +744,12 @@ setup_arch(char **cmdline_p)
 	       "This machine has an IEEE fpu\n" :
 	       "This machine has no IEEE fpu\n");
 #else /* CONFIG_64BIT */
-	printk((MACHINE_IS_VM) ?
-	       "We are running under VM (64 bit mode)\n" :
-	       "We are running native (64 bit mode)\n");
+	if (MACHINE_IS_VM)
+		printk("We are running under VM (64 bit mode)\n");
+	else if (MACHINE_IS_GUEST)
+		printk("We are running on a non z/VM host\n");
+	else
+		printk("We are running native (64 bit mode)\n");
 #endif /* CONFIG_64BIT */
 
 	/* Save unparsed command line copy for /proc/cmdline */
Index: linux-2.6.21/include/asm-s390/setup.h
===================================================================
--- linux-2.6.21.orig/include/asm-s390/setup.h
+++ linux-2.6.21/include/asm-s390/setup.h
@@ -61,6 +61,7 @@ extern unsigned long machine_flags;
 #define MACHINE_IS_VM		(machine_flags & 1)
 #define MACHINE_IS_P390		(machine_flags & 4)
 #define MACHINE_HAS_MVPG	(machine_flags & 16)
+#define MACHINE_IS_GUEST	(machine_flags & 64)
 #define MACHINE_HAS_IDTE	(machine_flags & 128)
 #define MACHINE_HAS_DIAG9C	(machine_flags & 256)
 
Index: linux-2.6.21/arch/s390/kernel/early.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/early.c
+++ linux-2.6.21/arch/s390/kernel/early.c
@@ -139,6 +139,10 @@ static noinline __init void detect_machi
 	/* Running on a P/390 ? */
 	if (cpuinfo->cpu_id.machine == 0x7490)
 		machine_flags |= 4;
+
+	/* Running under a host ? */
+	if (cpuinfo->cpu_id.version == 0xfe)
+		machine_flags |= 64;
 }
 
 #ifdef CONFIG_64BIT



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH/RFC 4/9] Basic guest virtual devices infrastructure
       [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
  2007-05-11 17:35   ` [PATCH/RFC 2/9] s390 virtualization interface Carsten Otte
  2007-05-11 17:35   ` [PATCH/RFC 3/9] s390 guest detection Carsten Otte
@ 2007-05-11 17:35   ` Carsten Otte
       [not found]     ` <1178904958.25135.31.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
  2007-05-11 17:36   ` [PATCH/RFC 5/9] s390 virtual console for guests Carsten Otte
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 104+ messages in thread
From: Carsten Otte @ 2007-05-11 17:35 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
  Cc: Christian Borntraeger, Martin Schwidefsky

From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

This patch adds support for a new bus type that manages paravirtualized 
devices. The bus uses the s390 diagnose instruction to query devices, and
match them with the corresponding drivers.
Future enhancements should include hotplug and hotremoval of virtual devices
triggered by the host, and supend/resume of virtual devices for migration.

This code is s390 architecture specific, please review.

Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

---
 arch/s390/Kconfig                |    6 +
 drivers/s390/Makefile            |    2 
 drivers/s390/guest/Makefile      |    6 +
 drivers/s390/guest/vdev.c        |  158 +++++++++++++++++++++++++++++++++++++++
 drivers/s390/guest/vdev_device.c |   50 ++++++++++++
 include/asm-s390/vdev.h          |   53 +++++++++++++
 6 files changed, 274 insertions(+), 1 deletion(-)

Index: linux-2.6.21/arch/s390/Kconfig
===================================================================
--- linux-2.6.21.orig/arch/s390/Kconfig
+++ linux-2.6.21/arch/s390/Kconfig
@@ -521,6 +521,12 @@ config S390_HOST
 	select S390_SWITCH_AMODE
 	help
 	  Select this option if you want to host guest Linux images
+
+config S390_GUEST
+	bool "s390 guest support (EXPERIMENTAL)"
+	depends on 64BIT && EXPERIMENTAL
+	help
+	  Select this option if you want to run the kernel under s390 linux
 endmenu
 
 source "net/Kconfig"
Index: linux-2.6.21/drivers/s390/Makefile
===================================================================
--- linux-2.6.21.orig/drivers/s390/Makefile
+++ linux-2.6.21/drivers/s390/Makefile
@@ -5,7 +5,7 @@
 CFLAGS_sysinfo.o += -Iinclude/math-emu -Iarch/s390/math-emu -w
 
 obj-y += s390mach.o sysinfo.o s390_rdev.o
-obj-y += cio/ block/ char/ crypto/ net/ scsi/
+obj-y += cio/ block/ char/ crypto/ net/ scsi/ guest/
 
 drivers-y += drivers/s390/built-in.o
 
Index: linux-2.6.21/drivers/s390/guest/Makefile
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/Makefile
@@ -0,0 +1,6 @@
+#
+# s390 Linux virtual environment
+#
+
+obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o
+
Index: linux-2.6.21/drivers/s390/guest/vdev.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vdev.c
@@ -0,0 +1,158 @@
+/*
+ *  vdev - guest os layer for device virtualization
+ *
+ *  Copyright IBM Corp. 2007
+ *  Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *
+ */
+
+#include <asm/vdev.h>
+
+static void vdev_bus_release(struct device *);
+
+struct bus_type vdev_bus_type = {
+	.name    = "vdev",
+	.match   = vdev_match,
+	.probe   = vdev_probe,
+};
+
+struct device vdev_bus = {
+	.bus_id  = "vdev0",
+	.release = vdev_bus_release
+};
+
+int vdev_match(struct device * dev, struct device_driver *drv)
+{
+	struct vdev *vdev = to_vdev(dev);
+	struct vdev_driver *vdrv = to_vdrv(drv);
+
+	if (vdev->vdev_type == vdrv->vdev_type)
+		return 1;
+
+	return 0;
+}
+
+int vdev_probe(struct device * dev)
+{
+	struct vdev *vdev = to_vdev(dev);
+	struct vdev_driver *vdrv = to_vdrv(dev->driver);
+
+	return vdrv->probe(vdev);
+}
+
+static void vdev_bus_release (struct device *device)
+{
+	/* noop, static bus object */
+}
+
+static inline int vdev_diag_hotplug(char symname[128], char hostid[128])
+{
+	register char * __arg1 asm("2") = symname;
+	register char * __arg2 asm("3") = hostid;
+	register int __svcres asm("2");
+	int __res;
+
+	__asm__ __volatile__ (
+		"diag 0,0,0x1e"
+		: "=d" (__svcres)
+		: "0" (__arg1),
+		  "d" (__arg2)
+		: "cc", "memory");
+	__res = __svcres;
+	return __res;
+}
+
+
+static int vdev_scan_coldplug(void)
+{
+	int rc;
+	struct vdev *device;
+
+	do {
+		device = kzalloc(sizeof(struct vdev), GFP_ATOMIC);
+		if (!device) {
+			rc = -ENOMEM;
+			goto out;
+		}
+		rc = vdev_diag_hotplug(device->symname, device->hostid);
+		if (rc == -ENODEV)
+			break;
+		if (rc < 0) {
+			printk (KERN_WARNING "vdev: error %d detecting" \
+				" initial devices\n", rc);
+			break;
+		}
+		device->vdev_type = rc;
+
+		//sanity: are strings terminated?
+		if ((strnlen(device->symname, 128) == 128) ||
+		    (strnlen(device->hostid, 128) == 128)) {
+			// warn and discard device
+			printk ("vdev: illegal device entry received\n");
+			break;
+		}
+
+		rc = vdevice_register(device);
+		if (rc) {
+			kfree(device);
+		} else
+			switch (device->vdev_type) {
+			case VDEV_TYPE_DISK:
+				printk (KERN_INFO "vdev: storage device " \
+					"detected: %s\n", device->symname);
+				break;
+			case VDEV_TYPE_NET:
+				printk (KERN_INFO "vdev: network device " \
+					"detected: %s\n", device->symname);
+				break;
+			default:
+				printk (KERN_INFO "vdev: unknown device " \
+					"detected: %s\n", device->symname);
+			}
+	} while(1);
+	kfree (device);
+ out:
+	return 0;
+}
+
+
+int __init vdev_init(void)
+{
+	int rc;
+
+	rc = bus_register(&vdev_bus_type);
+	if (rc) {
+		printk (KERN_WARNING "vdev: failed to register bus type\n");
+		goto out;
+	}
+	rc = device_register(&vdev_bus);
+	if (rc) {
+		printk (KERN_WARNING "vdev: failed to register bus device\n");
+		goto bunregister;
+	}
+	printk (KERN_INFO "vdev: initialization complete\n");
+	rc = vdev_scan_coldplug();
+	if (rc) {
+		printk (KERN_WARNING "vdev: failed to scan devices\n");
+		goto dunregister;
+	}
+	goto out;
+ dunregister:
+	device_unregister(&vdev_bus);
+
+ bunregister:
+	bus_unregister(&vdev_bus_type);
+ out:
+	return rc;
+}
+
+void vdev_exit(void)
+{
+	bus_unregister(&vdev_bus_type);
+}
+
+module_init(vdev_init);
+module_exit(vdev_exit);
+MODULE_DESCRIPTION("Guest layer for device virtualization");
+MODULE_AUTHOR("Copyright IBM Corp. 2007");
+MODULE_LICENSE("GPL");
Index: linux-2.6.21/drivers/s390/guest/vdev_device.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vdev_device.c
@@ -0,0 +1,50 @@
+/*
+ *  vdev - guest layer for device virtualization
+ *
+ *  Copyright IBM Corp. 2007
+ *  Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *
+ */
+
+#include <asm/vdev.h>
+
+int vdev_driver_register (struct vdev_driver *vdrv)
+{
+	struct device_driver *drv = &vdrv->driver;
+
+	drv->bus = &vdev_bus_type;
+	drv->name = vdrv->name;
+
+	return driver_register(drv);
+}
+
+int vdevice_register(struct vdev *vdev)
+{
+	struct device *dev = &vdev->dev;
+	int ret,typesize;
+
+	dev->bus = &vdev_bus_type;
+	dev->parent = &vdev_bus;
+	memset(dev->bus_id, 0, BUS_ID_SIZE);
+	switch (vdev->vdev_type) {
+	case VDEV_TYPE_DISK:
+		strncpy (dev->bus_id, "block:", 6);
+		typesize=6;
+		break;
+	case VDEV_TYPE_NET:
+		strncpy (dev->bus_id, "net:", 4);
+		typesize=4;
+		break;
+	default:
+		strncpy (dev->bus_id, "unknown:", 8);
+		typesize=8;
+		break;
+	}
+	strncpy (dev->bus_id+typesize, vdev->symname, BUS_ID_SIZE-typesize-1);
+
+	ret = device_register(dev);
+
+	//FIXME: add device attribs here
+
+	return ret;
+}
Index: linux-2.6.21/include/asm-s390/vdev.h
===================================================================
--- /dev/null
+++ linux-2.6.21/include/asm-s390/vdev.h
@@ -0,0 +1,53 @@
+/*
+ *  vdev - guest layer for device virtualization
+ *
+ *  Copyright IBM Corp. 2007
+ *  Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *
+ */
+
+#ifndef __VDEV_H
+#define __VDEV_H
+#include <linux/device.h>
+
+/* in vdev.c */
+extern int vdev_match(struct device *, struct device_driver *);
+extern int vdev_probe (struct device *);
+
+extern struct device vdev_bus;
+extern struct bus_type vdev_bus_type;
+
+#define VDEV_TYPE_DISK     0
+#define VDEV_TYPE_NET      1
+
+struct vdev {
+	unsigned int        vdev_type;
+	char                symname[128];
+	char                hostid[128];
+	struct vdev_driver *driver;
+	struct device       dev;
+	void                *drv_private;
+};
+
+struct vdev_driver {
+	struct module *owner;
+	int vdev_type;
+	int (*probe) (struct vdev *);
+	int (*set_online) (struct vdev *);
+	int (*set_offline) (struct vdev *);
+	int (*suspend) (struct vdev *);
+	int (*resume) (struct vdev *);
+	struct device_driver driver;	/* higher level structure, don't init
+					   this from your driver	     */
+	char *name;
+	void *drv_private;
+};
+
+#define to_vdev(n) container_of(n, struct vdev, dev)
+#define to_vdrv(n) container_of(n, struct vdev_driver, driver)
+
+
+/* in vdevice.c */
+extern int vdevice_register(struct vdev *);
+extern int vdev_driver_register(struct vdev_driver *);
+#endif



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <1178904958.25135.31.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>]

* Re: [PATCH/RFC 4/9] Basic guest virtual devices infrastructure
       [not found]     ` <1178904958.25135.31.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
@ 2007-05-11 20:06       ` Arnd Bergmann
  2007-05-14 11:26       ` Avi Kivity
  1 sibling, 0 replies; 104+ messages in thread
From: Arnd Bergmann @ 2007-05-11 20:06 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
  Cc: Christian Borntraeger, Martin Schwidefsky

On Friday 11 May 2007, Carsten Otte wrote:

> This patch adds support for a new bus type that manages paravirtualized
> devices. The bus uses the s390 diagnose instruction to query devices, and
> match them with the corresponding drivers.

It seems that the diagnose instruction is really the only s390 specific
thing in here, right? I guess this part of your series is the first one
that we should have in an architecture independent way.

There may also be the chance of merging this with existing virtual
buses like the one for the ps3, which also just exists using
hypercalls.

> +int vdev_match(struct device * dev, struct device_driver *drv)
> +{
> +	struct vdev *vdev = to_vdev(dev);
> +	struct vdev_driver *vdrv = to_vdrv(drv);
> +
> +	if (vdev->vdev_type == vdrv->vdev_type)
> +		return 1;
> +
> +	return 0;
> +}

Why invent device type numbers? On open firmware, we just do a string compare,
which more intuitive, and means you don't need any further 

> +int vdev_probe(struct device * dev)
> +{
> +	struct vdev *vdev = to_vdev(dev);
> +	struct vdev_driver *vdrv = to_vdrv(dev->driver);
> +
> +	return vdrv->probe(vdev);
> +}

This abstraction is unnecessary, just do the do_vdev() conversion inside
of the individual drivers.

> +
> +struct device vdev_bus = {
> +	.bus_id  = "vdev0",
> +	.release = vdev_bus_release
> +};
> 
> +static void vdev_bus_release (struct device *device)
> +{
> +	/* noop, static bus object */
> +}

Just make the root of your devices a platform_device, then you don't need
to do dirty tricks like this.

> +static int vdev_scan_coldplug(void)
> +{
> +	int rc;
> +	struct vdev *device;
> +
> +	do {
> +		device = kzalloc(sizeof(struct vdev), GFP_ATOMIC);
> +		if (!device) {
> +			rc = -ENOMEM;
> +			goto out;
> +		}
> +		rc = vdev_diag_hotplug(device->symname, device->hostid);
> +		if (rc == -ENODEV)
> +			break;
> +		if (rc < 0) {
> +			printk (KERN_WARNING "vdev: error %d detecting" \
> +				" initial devices\n", rc);
> +			break;
> +		}
> +		device->vdev_type = rc;
> +
> +		//sanity: are strings terminated?
> +		if ((strnlen(device->symname, 128) == 128) ||
> +		    (strnlen(device->hostid, 128) == 128)) {
> +			// warn and discard device
> +			printk ("vdev: illegal device entry received\n");
> +			break;
> +		}
> +
> +		rc = vdevice_register(device);
> +		if (rc) {
> +			kfree(device);
> +		} else
> +			switch (device->vdev_type) {
> +			case VDEV_TYPE_DISK:
> +				printk (KERN_INFO "vdev: storage device " \
> +					"detected: %s\n", device->symname);
> +				break;
> +			case VDEV_TYPE_NET:
> +				printk (KERN_INFO "vdev: network device " \
> +					"detected: %s\n", device->symname);
> +				break;
> +			default:
> +				printk (KERN_INFO "vdev: unknown device " \
> +					"detected: %s\n", device->symname);
> +			}
> +	} while(1);
> +	kfree (device);
> + out:
> +	return 0;
> +}

Interesting concept of probing the bus -- so you just ask if there are
any new devices, right?

> +#define VDEV_TYPE_DISK     0
> +#define VDEV_TYPE_NET      1
> +
> +struct vdev {
> +	unsigned int        vdev_type;
> +	char                symname[128];
> +	char                hostid[128];
> +	struct vdev_driver *driver;
> +	struct device       dev;
> +	void                *drv_private;
> +};

You shouldn't need the driver and drv_private fields -- they are already
present in struct device.

	Arnd <><

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 4/9] Basic guest virtual devices infrastructure
       [not found]     ` <1178904958.25135.31.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
  2007-05-11 20:06       ` Arnd Bergmann
@ 2007-05-14 11:26       ` Avi Kivity
       [not found]         ` <46484753.30602-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 104+ messages in thread
From: Avi Kivity @ 2007-05-14 11:26 UTC (permalink / raw)
  To: Carsten Otte
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Christian Borntraeger, Martin Schwidefsky

Carsten Otte wrote:
> From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
>
> This patch adds support for a new bus type that manages paravirtualized 
> devices. The bus uses the s390 diagnose instruction to query devices, and
> match them with the corresponding drivers.
> Future enhancements should include hotplug and hotremoval of virtual devices
> triggered by the host, and supend/resume of virtual devices for migration.
>
>   

Interesting.  We could use a variation this for x86 as well, but I'm not 
sure how easy it is to integrate it into closed source OSes (Windows).  
The diag instruction could be replaced by a hypercall which would make 
the code generic.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <46484753.30602-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [PATCH/RFC 4/9] Basic guest virtual devices infrastructure
       [not found]         ` <46484753.30602-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-14 11:43           ` Carsten Otte
       [not found]             ` <46484B5D.6080605-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Carsten Otte @ 2007-05-14 11:43 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Avi Kivity wrote:
> Interesting.  We could use a variation this for x86 as well, but I'm not 
> sure how easy it is to integrate it into closed source OSes (Windows).  
> The diag instruction could be replaced by a hypercall which would make 
> the code generic.
I think we need to freeze the hypercall API at some time, and consider 
it a stable kernel external API. We do then need to document these 
calls, and non-GPL hypervisors can implement it. We could eventually 
have a similar situation with one of the other non-GPL hypervisors on 
s390 that run Linux.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <46484B5D.6080605-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 4/9] Basic guest virtualdevices infrastructure
       [not found]             ` <46484B5D.6080605-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-05-14 12:00               ` Dor Laor
       [not found]                 ` <64F9B87B6B770947A9F8391472E032160BC7483D-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Dor Laor @ 2007-05-14 12:00 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA, Avi Kivity
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Christian Borntraeger,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

>Avi Kivity wrote:
>> Interesting.  We could use a variation this for x86 as well, but I'm
not
>> sure how easy it is to integrate it into closed source OSes
(Windows).
>> The diag instruction could be replaced by a hypercall which would
make
>> the code generic.
>I think we need to freeze the hypercall API at some time, and consider
>it a stable kernel external API. We do then need to document these
>calls, and non-GPL hypervisors can implement it. We could eventually
>have a similar situation with one of the other non-GPL hypervisors on
>s390 that run Linux.

I think Avi meant using a virtual bus as an option for HVMs too (windows
especially). Currently we're using the cpi bus. Using a new virtualized
bus might be a good idea, it's easy & clean for open source. The
question is it make life easier for HVMs. For instance, on windows we'll
need Pnp support for these devices.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <64F9B87B6B770947A9F8391472E032160BC7483D-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>]

* Re: [PATCH/RFC 4/9] Basic guest virtualdevices infrastructure
       [not found]                 ` <64F9B87B6B770947A9F8391472E032160BC7483D-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
@ 2007-05-14 13:32                   ` Carsten Otte
  0 siblings, 0 replies; 104+ messages in thread
From: Carsten Otte @ 2007-05-14 13:32 UTC (permalink / raw)
  To: Dor Laor
  Cc: carsteno-tA70FqPdS9bQT0dZR+AlfA,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Avi Kivity,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Dor Laor wrote:
> I think Avi meant using a virtual bus as an option for HVMs too (windows
> especially). Currently we're using the cpi bus. Using a new virtualized
> bus might be a good idea, it's easy & clean for open source. The
> question is it make life easier for HVMs. For instance, on windows we'll
> need Pnp support for these devices.
Oh that way around. Thanks for clarification.
As far as I see, a stable hypercall API would also be good for 
maintaining non-GPL HVMs. Probably we should forge the API with 
respect to other HVMs needs then.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH/RFC 5/9] s390 virtual console for guests
       [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
                     ` (2 preceding siblings ...)
  2007-05-11 17:35   ` [PATCH/RFC 4/9] Basic guest virtual devices infrastructure Carsten Otte
@ 2007-05-11 17:36   ` Carsten Otte
       [not found]     ` <1178904960.25135.32.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
  2007-05-11 17:36   ` [PATCH/RFC 6/9] virtual block device driver Carsten Otte
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 104+ messages in thread
From: Carsten Otte @ 2007-05-11 17:36 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
  Cc: Christian Borntraeger, Martin Schwidefsky

From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

This driver provides a simple virtualized console. Userspace can
use read/write to its console to pass the data to the host.

Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

---
 drivers/s390/Kconfig               |    5 +
 drivers/s390/guest/Makefile        |    1 
 drivers/s390/guest/guest_console.c |   72 +++++++++++++++++
 drivers/s390/guest/guest_console.h |   47 +++++++++++
 drivers/s390/guest/guest_tty.c     |  153 +++++++++++++++++++++++++++++++++++++
 5 files changed, 278 insertions(+)

Index: linux-2.6.21/drivers/s390/guest/guest_console.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/guest_console.c
@@ -0,0 +1,72 @@
+/*
+ * guest console device driver
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ */
+
+#include "linux/kernel.h"
+#include "linux/types.h"
+#include "linux/console.h"
+#include "linux/string.h"
+#include "linux/init.h"
+#include "linux/errno.h"
+#include "guest_console.h"
+
+#define guest_console_major 4		/* TTYAUX_MAJOR */
+#define guest_console_minor 65
+#define guest_console_name  "ttyS"
+
+static void guest_console_write(struct console *console, const char *string,
+			     unsigned len)
+{
+	int ret;
+	size_t pos;
+
+	for(pos=0; pos < strlen(string); pos += ret) {
+		ret = diag_write(1, string + pos, len - pos);
+		if (ret <= 0)
+			break;
+	}
+}
+
+static struct tty_driver *
+guest_console_device(struct console *c, int *index)
+{
+	*index = c->index;
+	return guest_tty_driver;
+}
+
+static void
+guest_console_unblank(void)
+{
+	return;
+}
+
+static struct console guest_console =
+{
+	.name = guest_console_name,
+	.write = guest_console_write,
+	.device = guest_console_device,
+	.unblank = guest_console_unblank,
+	.flags = CON_PRINTBUFFER,
+	.index = 0 /* ttyS0 */
+};
+
+/*
+ * called by console_init() in drivers/char/tty_io.c at boot-time.
+ */
+static int __init
+guest_console_init(void)
+{
+	if (!MACHINE_IS_GUEST)
+		return 0;
+
+	printk (KERN_INFO "z/Live console initialized\n");
+
+	/* enable printk-access to this driver */
+	register_console(&guest_console);
+	return 0;
+}
+
+console_initcall(guest_console_init);
+
Index: linux-2.6.21/drivers/s390/guest/guest_console.h
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/guest_console.h
@@ -0,0 +1,47 @@
+/*
+ * guest console device driver
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ */
+
+
+#ifndef __GCONSOLE_H
+#define __GCONSOLE_H
+extern struct tty_driver *guest_tty_driver;
+static inline int diag_write(int fd, const void *buffer, size_t count)
+{
+	register long __arg1 asm("2") = fd;
+	register const void * __arg2 asm("3") = buffer;
+	register size_t __arg3 asm("4") = count;
+	register long __svcres asm("2");
+	long __res;
+	asm volatile (
+		"diag 0,0,2"
+		: "=d" (__svcres)
+		: "0" (__arg1),
+		  "d" (__arg2),
+		  "d" (__arg3)
+		: "cc", "memory");
+	__res = __svcres;
+	return __res;
+}
+
+static inline int diag_read(int fd, const void *buffer, size_t count)
+{
+	register long __arg1 asm("2") = fd;
+	register const void * __arg2 asm("3") = buffer;
+	register size_t __arg3 asm("4") = count;
+	register long __svcres asm("2");
+	long __res;
+	asm volatile (
+		"diag 0,0,1"
+		: "=d" (__svcres)
+		: "0" (__arg1),
+		  "d" (__arg2),
+		  "d" (__arg3)
+		: "cc", "memory");
+	__res = __svcres;
+	return __res;
+}
+#endif
+
Index: linux-2.6.21/drivers/s390/guest/guest_tty.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/guest_tty.c
@@ -0,0 +1,153 @@
+/*
+ * guest console tty device driver
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ */
+
+#include <linux/fs.h>
+#include <linux/tty.h>
+#include <linux/tty_flip.h>
+#include <linux/module.h>
+#include <asm/s390_ext.h>
+#include "guest_console.h"
+
+struct tty_driver *guest_tty_driver;
+static struct tty_struct *guest_tty;
+
+MODULE_DESCRIPTION("Guest console for linux guests");
+MODULE_AUTHOR("Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>");
+MODULE_LICENSE("GPL");
+
+static int
+guest_tty_open(struct tty_struct *tty, struct file *filp)
+{
+	guest_tty = tty;
+	tty->driver_data = NULL;
+	return 0;
+}
+
+static void
+guest_tty_close(struct tty_struct *tty, struct file *filp)
+{
+	if (tty->count > 1)
+		return;
+	guest_tty = NULL;
+}
+
+static int
+guest_tty_ioctl(struct tty_struct *tty, struct file * file,
+	       unsigned int cmd, unsigned long arg)
+{
+	return -ENOIOCTLCMD;
+}
+
+static int
+guest_tty_write(struct tty_struct *tty, const unsigned char *str, int count)
+{
+	int ret;
+	size_t pos;
+
+	for(pos=0; pos < count; pos += ret) {
+		ret = diag_write(1, str + pos, count - pos);
+		if (ret <= 0)
+			break;
+	}
+	return pos;
+}
+
+static void
+guest_tty_put_char(struct tty_struct *tty, unsigned char ch)
+{
+	guest_tty_write (tty, &ch, 1);
+}
+
+static void
+guest_tty_flush_chars(struct tty_struct *tty)
+{
+	int nop;
+	nop=0; // :)
+}
+
+static int
+guest_tty_chars_in_buffer(struct tty_struct *tty)
+{
+	return 0;
+}
+
+static void
+guest_tty_flush_buffer(struct tty_struct *tty)
+{
+	guest_tty_flush_chars(tty); // :)
+}
+
+static int
+guest_tty_write_room (struct tty_struct *tty)
+{
+	return 65536;
+}
+
+static struct tty_operations guest_ops = {
+	.open = guest_tty_open,
+	.close = guest_tty_close,
+	.write = guest_tty_write,
+	.put_char = guest_tty_put_char,
+	.flush_chars = guest_tty_flush_chars,
+	.write_room = guest_tty_write_room,
+	.chars_in_buffer = guest_tty_chars_in_buffer,
+	.flush_buffer = guest_tty_flush_buffer,
+	.ioctl = guest_tty_ioctl,
+};
+
+static void
+guest_tty_ext_handler(__u16 code)
+{
+	char buffer[256];
+	int count;
+
+	count = diag_read(0, buffer, 256);
+	if (count <= 0)
+		return;
+
+	if (!guest_tty)
+		return;
+	tty_insert_flip_string(guest_tty, buffer, count);
+	tty_flip_buffer_push(guest_tty);
+}
+
+int __init
+guest_tty_init(void)
+{
+	struct tty_driver *driver;
+	int rc;
+
+	if (!MACHINE_IS_GUEST)
+		return 0;
+	register_external_interrupt(0x1234, guest_tty_ext_handler);
+	driver = alloc_tty_driver(1);
+	if (!driver)
+		return -ENOMEM;
+	guest_tty = NULL;
+	driver->owner = THIS_MODULE;
+	driver->driver_name = "guest_line";
+	driver->name = "guest_line";
+	driver->major = TTY_MAJOR;
+	driver->minor_start = 65;
+	driver->type = TTY_DRIVER_TYPE_SYSTEM;
+	driver->subtype = SYSTEM_TYPE_TTY;
+	driver->init_termios = tty_std_termios;
+	driver->init_termios.c_iflag = IGNBRK | IGNPAR;
+	driver->init_termios.c_oflag = ONLCR | XTABS;
+	driver->init_termios.c_lflag = ISIG | ECHO;
+	driver->flags = TTY_DRIVER_REAL_RAW;
+	tty_set_operations(driver, &guest_ops);
+	rc = tty_register_driver(driver);
+	if (rc) {
+		printk(KERN_ERR "guest tty driver: could not register tty - "
+		       "tty_register_driver returned %d\n", rc);
+		put_tty_driver(driver);
+		return rc;
+	}
+	guest_tty_driver = driver;
+	return 0;
+}
+module_init(guest_tty_init);
Index: linux-2.6.21/drivers/s390/Kconfig
===================================================================
--- linux-2.6.21.orig/drivers/s390/Kconfig
+++ linux-2.6.21/drivers/s390/Kconfig
@@ -211,6 +211,11 @@ config MONWRITER
 	help
 	  Character device driver for writing z/VM monitor service records
 
+config GUEST_CONSOLE
+	bool "Guest console support"
+	depends on S390_GUEST
+	help
+	  Select this option if you want to run as an s390 guest
 endmenu
 
 menu "Cryptographic devices"
Index: linux-2.6.21/drivers/s390/guest/Makefile
===================================================================
--- linux-2.6.21.orig/drivers/s390/guest/Makefile
+++ linux-2.6.21/drivers/s390/guest/Makefile
@@ -2,5 +2,6 @@
 # s390 Linux virtual environment
 #
 
+obj-$(CONFIG_GUEST_CONSOLE) += guest_console.o guest_tty.o
 obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o
 



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <1178904960.25135.32.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>]

* Re: [PATCH/RFC 5/9] s390 virtual console for guests
       [not found]     ` <1178904960.25135.32.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
@ 2007-05-11 19:00       ` Anthony Liguori
       [not found]         ` <4644BD3C.8040901-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-11 19:00 UTC (permalink / raw)
  To: Carsten Otte
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Christian Borntraeger, Martin Schwidefsky

I think it would be better to use hvc_console as Xen now uses it too.

Carsten Otte wrote:
> +	if (!MACHINE_IS_GUEST)
> +		return 0;
> +	register_external_interrupt(0x1234, guest_tty_ext_handler);
>   

This is an interesting way to get input data from the console :-)  How 
many interrupts does s390 support (the x86 only supports 256)?  Can you 
afford to burn interrupts like this?  Is there not a better way to 
assign interrupts such that conflict isn't an issue?

Regards,

Anthony Liguori

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <4644BD3C.8040901-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>]

* Re: [PATCH/RFC 5/9] s390 virtual console for guests
       [not found]         ` <4644BD3C.8040901-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
@ 2007-05-11 19:42           ` Christian Bornträger
  2007-05-12  8:07           ` Carsten Otte
  2007-05-14 16:23           ` Christian Bornträger
  2 siblings, 0 replies; 104+ messages in thread
From: Christian Bornträger @ 2007-05-11 19:42 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f; +Cc: Martin Schwidefsky

On Friday 11 May 2007 21:00, Anthony Liguori wrote:

> I think it would be better to use hvc_console as Xen now uses it too.

I dont know hvc_console, but I will have a look at it.

> Carsten Otte wrote:
> > +	if (!MACHINE_IS_GUEST)
> > +		return 0;
> > +	register_external_interrupt(0x1234, guest_tty_ext_handler);
> >   
> 
> This is an interesting way to get input data from the console :-)  How 
> many interrupts does s390 support (the x86 only supports 256)?  Can you 
> afford to burn interrupts like this?  Is there not a better way to 
> assign interrupts such that conflict isn't an issue?

On s390 we have a 16 bit interrupt code, so we actually have plenty of 
numbers... But, yes its a very good point, burning interrupts wont work 
cross-platform.

Our patches are prototypes and need rework anyway. Take these patches as 
discussion contribution in the spirit of release early. :-)

cheers

Christian

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 5/9] s390 virtual console for guests
       [not found]         ` <4644BD3C.8040901-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
  2007-05-11 19:42           ` Christian Bornträger
@ 2007-05-12  8:07           ` Carsten Otte
  2007-05-14 16:23           ` Christian Bornträger
  2 siblings, 0 replies; 104+ messages in thread
From: Carsten Otte @ 2007-05-12  8:07 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8


Anthony Liguori wrote:
> I think it would be better to use hvc_console as Xen now uses it too.
This console driver is pretty basic indeed.

> This is an interesting way to get input data from the console :-)  How 
> many interrupts does s390 support (the x86 only supports 256)?  Can you 
> afford to burn interrupts like this?  Is there not a better way to 
> assign interrupts such that conflict isn't an issue?
We have 2^16 external interrupts on 390, plus IO interrupts, 
multiplied by the fact that each interrupt can be used in various 
interrupt subclasses. We can burn irqs indeed, but as Christian 
mentioned this cannot go into the portable approach.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 5/9] s390 virtual console for guests
       [not found]         ` <4644BD3C.8040901-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
  2007-05-11 19:42           ` Christian Bornträger
  2007-05-12  8:07           ` Carsten Otte
@ 2007-05-14 16:23           ` Christian Bornträger
       [not found]             ` <200705141823.13424.cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  2 siblings, 1 reply; 104+ messages in thread
From: Christian Bornträger @ 2007-05-14 16:23 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f; +Cc: Martin Schwidefsky

On Friday 11 May 2007 21:00, Anthony Liguori wrote:
> I think it would be better to use hvc_console as Xen now uses it too.

I just had a look at hvc_console, and indeed this driver looks appropriate for 
us. Looking at the xen-frontend driver (~130 lines of code) and the simple 
interface (get_char and put_char) it should be reasonably easy to convert our 
driver to a hvc_console user.

Christian

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <200705141823.13424.cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 5/9] s390 virtual console for guests
       [not found]             ` <200705141823.13424.cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-05-14 16:48               ` Christian Borntraeger
       [not found]                 ` <200705141848.18996.borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Christian Borntraeger @ 2007-05-14 16:48 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f; +Cc: Martin Schwidefsky

On Monday 14 May 2007 18:23, Christian Bornträger wrote:
> On Friday 11 May 2007 21:00, Anthony Liguori wrote:
> > I think it would be better to use hvc_console as Xen now uses it too.
> I just had a look at hvc_console, and indeed this driver looks appropriate 

As I started prototyping this frontend I realized that hvc_console requires 
some interfaces, which are not present on s390, e.g. we have no request_irq 
and free_irq. Dont know if hvc_console is still the right way to go for us. 
This needs more thinking. 

Christian

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <200705141848.18996.borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 5/9] s390 virtual console for guests
       [not found]                 ` <200705141848.18996.borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-05-14 17:49                   ` Anthony Liguori
       [not found]                     ` <4648A11D.3050607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-14 17:49 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Martin Schwidefsky

Christian Borntraeger wrote:
> On Monday 14 May 2007 18:23, Christian Bornträger wrote:
>   
>> On Friday 11 May 2007 21:00, Anthony Liguori wrote:
>>     
>>> I think it would be better to use hvc_console as Xen now uses it too.
>>>       
>> I just had a look at hvc_console, and indeed this driver looks appropriate 
>>     
>
> As I started prototyping this frontend I realized that hvc_console requires 
> some interfaces, which are not present on s390, e.g. we have no request_irq 
> and free_irq. Dont know if hvc_console is still the right way to go for us. 
>   

It seems like request_irq is roughly the same as 
register_external_interrupt.  I suspect that you could get away with 
either patching hvc_console to use register_external_interrupt if 
CONFIG_S390 or perhaps providing a common interface.

I suspect that this is going to come up again for sharing other paravirt 
drivers.

Regards,

Anthony Liguori

> This needs more thinking. 
>
> Christian
>
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <4648A11D.3050607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 5/9] s390 virtual console for guests
       [not found]                     ` <4648A11D.3050607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2007-05-15  0:27                       ` Arnd Bergmann
  2007-05-15  7:54                       ` Carsten Otte
  1 sibling, 0 replies; 104+ messages in thread
From: Arnd Bergmann @ 2007-05-15  0:27 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f; +Cc: Martin Schwidefsky

On Monday 14 May 2007, Anthony Liguori wrote:
> It seems like request_irq is roughly the same as 
> register_external_interrupt.  I suspect that you could get away with 
> either patching hvc_console to use register_external_interrupt if 
> CONFIG_S390 or perhaps providing a common interface.
> 
> I suspect that this is going to come up again for sharing other paravirt 
> drivers.

request_irq() is not a nice interface for s390, but it will probably make
sense to convert the two existing users of register_external_interrupt to
use that instead, in order to get something that can be shared across
architectures for virtual drivers.

It basically means extending struct ext_int_info_t to include a name and
a void* member that gets passed back to the interrupt handler, and to check
for invalid flags passed to request_irq.

You might want to show these in /proc/interrupts then as well,
as per-interrupt values.

	Arnd <><

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 5/9] s390 virtual console for guests
       [not found]                     ` <4648A11D.3050607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2007-05-15  0:27                       ` Arnd Bergmann
@ 2007-05-15  7:54                       ` Carsten Otte
  1 sibling, 0 replies; 104+ messages in thread
From: Carsten Otte @ 2007-05-15  7:54 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	borntrae-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Anthony Liguori wrote:
> It seems like request_irq is roughly the same as 
> register_external_interrupt.  I suspect that you could get away with 
> either patching hvc_console to use register_external_interrupt if 
> CONFIG_S390 or perhaps providing a common interface.
> 
> I suspect that this is going to come up again for sharing other paravirt 
> drivers.
Maybe we should have a wrappers for request_irq/free_irq in arch/ 
rather then #ifdefs in each paravirtual driver. We need to talk this 
over with Martin (our arch maintainer).


so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH/RFC 6/9] virtual block device driver
       [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
                     ` (3 preceding siblings ...)
  2007-05-11 17:36   ` [PATCH/RFC 5/9] s390 virtual console for guests Carsten Otte
@ 2007-05-11 17:36   ` Carsten Otte
       [not found]     ` <1178904963.25135.33.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
  2007-05-11 17:36   ` [PATCH/RFC 7/9] Virtual network guest " Carsten Otte
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 104+ messages in thread
From: Carsten Otte @ 2007-05-11 17:36 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
  Cc: Christian Borntraeger, Martin Schwidefsky

From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

This driver provides access to virtual block devices. It does use its own
make_request function which passes the bio to a workqueue thread. The workqueue
thread does use the diagnose hypervisor call to call the hosting Linux.
The hypervisor code in host userspace does use aio_submit to initiate the IO. 
Once the IO is done, the host will use io_getevents and then generate an
interrupt to the guest. The interrupt handler calls bio_endio.
This device driver is currently architecture dependent. We intend to move the
host API to hypercall instead of the diagnose instuction. Please review.

Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

---
 drivers/s390/block/Kconfig     |    7 
 drivers/s390/guest/Makefile    |    1 
 drivers/s390/guest/vdisk.c     |  153 +++++++++++++++++
 drivers/s390/guest/vdisk.h     |  230 ++++++++++++++++++++++++++
 drivers/s390/guest/vdisk_blk.c |  355 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 746 insertions(+)

Index: linux-2.6.21/drivers/s390/guest/vdisk.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vdisk.c
@@ -0,0 +1,153 @@
+/*
+ * guest virtual block device driver
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ */
+
+#include <linux/blkdev.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/types.h>
+#include <asm/ptrace.h>
+#include <asm/s390_ext.h>
+#include <asm/vdev.h>
+#include "vdisk.h"
+
+MODULE_DESCRIPTION("Guest virtual block device driver");
+MODULE_AUTHOR("Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>");
+MODULE_LICENSE("GPL");
+
+static struct vdev_driver vdisk_driver;
+
+static int __find_fd(struct device *dev, void* fdptr) {
+	int fd = (long)fdptr;
+
+	struct vdev *vdev = to_vdev(dev);
+	struct vdisk_device *vdisk = (struct vdisk_device *)vdev->drv_private;
+
+	if (vdisk->vfd == fd)
+		return 1;
+	else
+		return 0;
+}
+
+vdisk_irq_t vdisk_get_irqpage(int fd)
+{
+	struct device *dev;
+	struct vdev *vdev;
+	struct vdisk_device *vdisk;
+
+	dev = driver_find_device(&vdisk_driver.driver, NULL, (void*)(long)fd, __find_fd);
+	if (!dev)
+		return NULL;
+	vdev  = to_vdev(dev);
+	vdisk = (struct vdisk_device *)vdev->drv_private;
+	return vdisk->irq_page;
+}
+
+struct vdisk_device * vdisk_get_device_by_fd(int fd)
+{
+	struct device *dev;
+	struct vdev *vdev;
+	struct vdisk_device *vdisk;
+
+	dev = driver_find_device(&vdisk_driver.driver, NULL, (void*)(long)fd, __find_fd);
+	if (!dev)
+		return NULL;
+	vdev  = to_vdev(dev);
+	vdisk = (struct vdisk_device *)vdev->drv_private;
+	return vdisk;
+}
+
+
+static int vdisk_probe(struct vdev *vdev)
+{
+	struct vdisk_device *vdisk;
+	int rc;
+
+	vdisk = kzalloc(sizeof(struct vdisk_device), GFP_ATOMIC);
+	if (!vdisk)
+		return -ENOMEM;
+
+	vdisk->vdev = vdev;
+	vdev->drv_private = vdisk;
+	vdisk->submit_page = (void*)get_zeroed_page(GFP_KERNEL);
+
+	if (!vdisk->submit_page) {
+		rc = -ENOMEM;
+		goto out_free;
+	}
+
+	vdisk->irq_page = (void*)get_zeroed_page(GFP_KERNEL);
+
+	if (!vdisk->irq_page) {
+		rc = -ENOMEM;
+		goto out_free;
+	}
+
+	rc = diag_vdisk_disk_info(vdisk->vdev->hostid,
+				   &vdisk->blocksize, &vdisk->size,
+				   &vdisk->read_only);
+	if (rc)
+		goto out_free;
+	spin_lock_init(&vdisk->lock);
+	init_rwsem(&vdisk->pump_sem);
+	init_waitqueue_head(&vdisk->wait);
+
+	vdisk_init_blockdev(vdisk);
+	goto out;
+
+out_free:
+	if (vdisk->irq_page)
+		free_page((unsigned long)(vdisk->irq_page));
+	if (vdisk->submit_page)
+		free_page((unsigned long)(vdisk->submit_page));
+	kfree(vdisk);
+out:
+	return rc;
+}
+
+static struct vdev_driver vdisk_driver = {
+	.name = "vdisk",
+	.owner = THIS_MODULE,
+	.vdev_type = VDEV_TYPE_DISK,
+	.probe = vdisk_probe,
+};
+
+static int __init vdisk_init(void)
+{
+	int rc;
+	if (!MACHINE_IS_GUEST)
+		return -ENODEV;
+
+	rc = register_blkdev(VDISK_MAJOR, "vdisk");
+	if (rc) {
+		printk(KERN_WARNING "vdisk: cannot register block device\n");
+		return rc;
+	}
+	rc = register_external_interrupt(0x1235, vdisk_ext_handler);
+	if (rc)
+		goto unregister_blk;
+	rc = vdev_driver_register(&vdisk_driver);
+	if (rc)
+		goto unregister_irq;
+	goto out;
+unregister_irq:
+	unregister_external_interrupt(0x1235, vdisk_ext_handler);
+unregister_blk:
+	unregister_blkdev(VDISK_MAJOR, "vdisk");
+	printk (KERN_WARNING "vdisk: initialization failed\n");
+out:
+	return rc;
+}
+
+static void __exit vdisk_exit(void)
+{
+	unregister_external_interrupt(0x1235, vdisk_ext_handler);
+	unregister_blkdev(VDISK_MAJOR, "vdisk");
+	return;
+}
+
+module_init(vdisk_init);
+module_exit(vdisk_exit);
Index: linux-2.6.21/drivers/s390/guest/vdisk.h
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vdisk.h
@@ -0,0 +1,230 @@
+/*
+ * guest virtual block device driver header file
+ * Copyright IBM Corp.
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ */
+
+#include <linux/list.h>
+#include <linux/genhd.h>
+#include <linux/types.h>
+#include <linux/aio_abi.h>
+#include <linux/wait.h>
+#include <asm/vdev.h>
+
+#define VDISK_MAJOR   95
+#define SECTOR_SHIFT   9
+#define VDISK_NR_REQ  256
+#define VDISK_NR_RES  170
+
+#define VDISK_WRITE 1
+#define VDISK_READ  0
+
+struct vdisk_request {
+	unsigned long buf;
+	unsigned long count;
+};
+
+typedef struct vdisk_request (*vdisk_req_t)[VDISK_NR_REQ];
+
+struct vdisk_response {
+	unsigned long intparm;
+	unsigned long count;
+	unsigned long failed;
+};
+
+typedef struct vdisk_response (*vdisk_irq_t)[VDISK_NR_RES];
+
+struct vdisk_device {
+	struct list_head head;
+	int blocksize;
+	long size;
+	int read_only;
+	struct gendisk *gd;
+	struct vdev *vdev;
+	spinlock_t lock;
+	struct rw_semaphore pump_sem;
+	int open_count;
+	int vfd;
+	struct vdisk_request (*submit_page)[VDISK_NR_REQ];
+	struct workqueue_struct *wq;
+	vdisk_irq_t irq_page;
+	wait_queue_head_t wait;
+};
+
+struct vdisk_work {
+	struct work_struct work;
+	struct bio* bio;
+};
+
+struct vdisk_elem {
+	unsigned int fd;
+	unsigned int command;
+	unsigned long offset;
+	unsigned long buffer;
+	unsigned long nbytes;
+};
+
+struct vdisk_iocb_container {
+	struct iocb iocb;
+	struct bio *bio;
+	struct vdisk_device *dev;
+	int ctx_index;
+	unsigned long context;
+	struct list_head list;
+};
+
+// from aio_abi.h
+typedef enum io_iocb_cmd {
+	IO_CMD_PREAD = 0,
+	IO_CMD_PWRITE = 1,
+
+	IO_CMD_FSYNC = 2,
+	IO_CMD_FDSYNC = 3,
+
+	IO_CMD_POLL = 5,
+	IO_CMD_NOOP = 6,
+} io_iocb_cmd_t;
+
+static inline int
+diag_vdisk_disk_info(char name[256], int* blocksize,
+			   long* size, int* read_only)
+{
+	register char* __arg1 asm("2") = name;
+	register int*  __arg2 asm("3") = blocksize;
+	register long* __arg3 asm("4") = size;
+	register int*  __arg4 asm("5") = read_only;
+	register int __svcres asm("2");
+	int __res;
+
+	__asm__ __volatile__ (
+		"diag 0,0,5"
+		: "=d" (__svcres)
+		: "0" (__arg1),
+		  "d" (__arg2),
+		  "d" (__arg3),
+		  "d" (__arg4)
+		: "cc", "memory");
+	__res = __svcres;
+	return __res;
+}
+
+static inline int
+diag_vdisk_open(const char* name, int read_only, void* irq_page)
+{
+	register const char*   __arg1 asm("2") = name;
+	register long          __arg2 asm("3") = read_only;
+	register unsigned long __arg3 asm("4") = (unsigned long) irq_page;
+	register int __svcres asm("2");
+	int __res;
+
+	__asm__ __volatile__ (
+		"diag 0,0,7"
+		: "=d" (__svcres)
+		: "0" (__arg1),
+		  "d" (__arg2),
+		  "d" (__arg3)
+		: "cc", "memory");
+	__res = __svcres;
+	return __res;
+}
+
+static inline int
+diag_vdisk_close(int fd)
+{
+	register long __arg1 asm("2") = fd;
+	register int __svcres asm("2");
+	int __res;
+
+	__asm__ __volatile__ (
+		"diag 0,0,9"
+		: "=d" (__svcres)
+		: "0" (__arg1)
+		: "cc", "memory");
+	__res = __svcres;
+	return __res;
+}
+
+static inline int
+diag_vdisk_aio_setup(unsigned int index, unsigned int num_events,
+		      unsigned long *context, void *int_page)
+{
+	register unsigned long      __arg1 asm("2") = index;
+	register unsigned long      __arg2 asm("3") = num_events;
+	register unsigned long*     __arg3 asm("4") = context;
+	register void*              __arg4 asm("5") = int_page;
+ 	register int __svcres asm("2");
+	int __res;
+
+	__asm__ __volatile__ (
+		"diag 0,0,0x0a"
+		: "=d" (__svcres)
+		: "0" (__arg1),
+		  "d" (__arg2),
+		  "d" (__arg3),
+		  "d" (__arg4)
+		: "cc", "memory");
+	__res = __svcres;
+	return __res;
+}
+
+static inline void
+diag_vdisk_aio_destroy(unsigned int index)
+{
+	register unsigned long     __arg1 asm("2") = index;
+	__asm__ __volatile__ (
+		"diag 0,0,0x12"
+		:: "d" (__arg1)
+		: "cc", "memory");
+}
+
+static inline int
+diag_vdisk_submit_request(int fd, void *submit_page, int op,
+			   loff_t start_offset, int nrreq, void* parm)
+{
+	register long          __arg1 asm("2") = fd;
+	register unsigned long __arg2 asm("3") = (unsigned long)submit_page;
+	register long          __arg3 asm("4") = op;
+	register unsigned long __arg4 asm("5") = start_offset;
+	register long          __arg5 asm("6") = nrreq;
+	register unsigned long __arg6 asm("7") = (unsigned long)parm;
+	register int __svcres asm("2");
+	int __res;
+
+	__asm__ __volatile__ (
+		"diag 0,0,0x0b"
+		: "=d" (__svcres)
+		: "0" (__arg1),
+		  "d" (__arg2),
+		  "d" (__arg3),
+		  "d" (__arg4),
+		  "d" (__arg5),
+		  "d" (__arg6)
+		: "cc", "memory");
+	__res = __svcres;
+	return __res;
+}
+
+static inline int
+diag_vdisk_getevents(int index) {
+	register long  __arg1 asm("2") = index;
+	register int __svcres asm("2");
+	int __res;
+
+	__asm__ __volatile__ (
+		"diag 0,0,0x0d"
+		: "=d" (__svcres)
+		: "0" (__arg1)
+		: "cc", "memory");
+	__res = __svcres;
+	return __res;
+}
+
+// in vdisk.c
+extern struct device *vdisk_sysfs_root;
+int vdisk_disk_info(struct vdisk_device *dev);
+vdisk_irq_t vdisk_get_irqpage(int fd);
+struct vdisk_device * vdisk_get_device_by_fd(int fd);
+
+// in vdisk_blk.c
+void vdisk_init_blockdev(struct vdisk_device *dev);
+void vdisk_ext_handler(__u16 code);
Index: linux-2.6.21/drivers/s390/guest/vdisk_blk.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vdisk_blk.c
@@ -0,0 +1,355 @@
+/*
+ * guest virtual block device driver
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ */
+
+#include <linux/blkdev.h>
+#include "vdisk.h"
+
+static int vdisk_open(struct inode *inode, struct file *filp);
+static int vdisk_release(struct inode *inode, struct file *filp);
+static int vdisk_make_request(request_queue_t *q, struct bio *bio);
+
+static struct block_device_operations vdisk_fops = {
+	.owner   = THIS_MODULE,
+	.open    = vdisk_open,
+	.release = vdisk_release,
+};
+
+void vdisk_init_blockdev(struct vdisk_device *dev)
+{
+	static int lastminor = 0;
+
+	dev->gd = alloc_disk(1);
+	if (!dev->gd) {
+		printk (KERN_WARNING "vdisk: out of memory while " \
+			"initializing %s\n", dev->vdev->symname);
+		return;
+	}
+	dev->open_count = 0;
+	dev->gd->first_minor = lastminor++;
+	dev->gd->queue = blk_alloc_queue(GFP_KERNEL);
+	if (!dev->gd->queue) {
+		printk (KERN_WARNING "vdisk: out of memory while " \
+			"initializing %s\n", dev->vdev->symname);
+		goto free_gd;
+	}
+	blk_queue_max_segment_size(dev->gd->queue, 15 * dev->blocksize);
+	strlcpy(dev->gd->disk_name, dev->vdev->symname, 32);
+	dev->gd->disk_name[32] = '\0';
+	dev->gd->major = VDISK_MAJOR;
+	dev->gd->fops = &vdisk_fops;
+	dev->gd->private_data = dev;
+	dev->gd->driverfs_dev = &dev->vdev->dev;
+	set_capacity(dev->gd, dev->size);
+	get_device(&dev->vdev->dev);
+	add_disk(dev->gd);
+	blk_queue_make_request(dev->gd->queue, vdisk_make_request);
+	blk_queue_hardsect_size(dev->gd->queue, dev->blocksize);
+	set_disk_ro(dev->gd, dev->read_only);
+	if (dev->blocksize)
+		printk(KERN_INFO "vdisk: device %s(%d:%d) active with " \
+		       "block size %d and %ld sectors\n", dev->vdev->symname,
+		       VDISK_MAJOR, dev->gd->first_minor, dev->blocksize,
+		       dev->size);
+	else
+		printk(KERN_INFO "vdisk: device %s(%d:%d) inactive\n",
+		       dev->vdev->symname, VDISK_MAJOR, dev->gd->first_minor);
+	return;
+ free_gd:
+	kfree(dev->gd);
+	dev->gd = NULL;
+	return;
+}
+
+static int vdisk_open(struct inode *inode, struct file *filp)
+{
+	struct vdisk_device *dev = inode->i_bdev->bd_disk->private_data;
+	unsigned long flags;
+	char* wq_name;
+	struct workqueue_struct *new_wq;
+	int rc;
+
+	if (!dev) {
+		rc = -ENODEV;
+		goto out;
+	}
+
+	down_write(&dev->pump_sem);
+	spin_lock_irqsave(&dev->lock, flags);
+	if (dev->open_count) {
+		dev->open_count++;
+		rc = 0;
+		goto unlock;
+	}
+
+
+	wq_name = kmalloc(strlen(dev->gd->disk_name)+9, GFP_ATOMIC);
+	if (!wq_name) {
+		rc = -ENOMEM;
+		goto unlock;
+	}
+	memcpy(wq_name, "IO_pump_", 8);
+	strcpy(wq_name+8, dev->gd->disk_name);
+
+	spin_unlock_irqrestore(&dev->lock, flags);
+	new_wq = create_singlethread_workqueue(wq_name);
+	spin_lock_irqsave(&dev->lock, flags);
+
+	dev->wq = new_wq;
+	kfree (wq_name);
+
+	if (!dev->wq) {
+		rc = -EIO;
+		goto unlock;
+	}
+
+	rc = diag_vdisk_disk_info(dev->vdev->hostid, &dev->blocksize,
+				   &dev->size, &dev->read_only);
+        if (rc) {
+		printk(KERN_ERR "vdisk: error querying %s\n", dev->vdev->hostid);
+		goto cleanup;
+	}
+	inode->i_bdev->bd_block_size = dev->blocksize;
+	dev->vfd = diag_vdisk_open(dev->vdev->hostid, dev->read_only,
+				    dev->irq_page);
+
+	if (dev->vfd < 0) {
+		rc = dev->vfd;
+		printk(KERN_ERR "vdisk: error opening %s\n", dev->vdev->hostid);
+		goto cleanup;
+	} else {
+		dev->open_count++;
+		rc = 0;
+	}
+	goto unlock;
+
+ cleanup:
+	spin_unlock_irqrestore(&dev->lock, flags);
+	destroy_workqueue(new_wq);
+	spin_lock_irqsave(&dev->lock, flags);
+ unlock:
+	spin_unlock_irqrestore(&dev->lock, flags);
+	up_write(&dev->pump_sem);
+ out:
+	return rc;
+}
+
+static int
+vdisk_release(struct inode *inode, struct file *filp)
+{
+	int rc;
+	unsigned long flags;
+	struct vdisk_device *dev = inode->i_bdev->bd_disk->private_data;
+	struct workqueue_struct *old_wq;
+
+	if (!dev) {
+		rc = -ENODEV;
+		goto out;
+	}
+
+	down_write(&dev->pump_sem);
+	spin_lock_irqsave(&dev->lock, flags);
+	dev->open_count--;
+
+	if (dev->open_count) {
+		rc = 0;
+		spin_unlock_irqrestore(&dev->lock, flags);
+		goto up;
+	}
+	rc = diag_vdisk_close(dev->vfd);
+
+	old_wq = dev->wq;
+	dev->wq = NULL;
+
+	spin_unlock_irqrestore(&dev->lock, flags);
+
+	destroy_workqueue(old_wq);
+
+ up:
+	up_write(&dev->pump_sem);
+ out:
+	return rc;
+}
+
+static void vdisk_pump_bvecs(struct vdisk_device *dev, int op,
+			      loff_t start_offset, int requestno,
+			      struct bio* bio, struct bio_vec *(vectors[256]))
+{
+	int i, rc;
+	loff_t offset = start_offset;
+	int nr_done = 0;
+	long size;
+	long flags=0;
+	DEFINE_WAIT(wait);
+
+	spin_lock_irqsave(&dev->lock, flags);
+	prepare_to_wait_exclusive(&dev->wait, &wait,
+				  TASK_UNINTERRUPTIBLE);
+
+	while (nr_done < requestno) {
+		memset(dev->submit_page, 0, PAGE_SIZE);
+		for (i=nr_done; i<requestno; i++) {
+			(*dev->submit_page)[i-nr_done].buf =
+				(unsigned long)page_address(vectors[i]->bv_page) +
+				vectors[i]->bv_offset;
+			(*dev->submit_page)[i-nr_done].count = vectors[i]->bv_len;
+		}
+
+		rc = diag_vdisk_submit_request(dev->vfd,
+						dev->submit_page,
+						op, offset,
+						requestno-nr_done, bio);
+
+		if (rc < 0) {
+			// error case
+			size = 0;
+			for (i=0; i<(requestno-nr_done); i++)
+				size += (*dev->submit_page)[i].count;
+			bio_io_error(bio, size);
+			break;
+		}
+
+		if (rc == requestno - nr_done)
+			// everything was submitted propper
+			break;
+
+		if (rc) {
+			//request was partly submitted
+			for (i=0; i<rc; i++)
+				offset += (*dev->submit_page)[i].count;
+			nr_done += rc;
+		}
+		// we need to throttle IO, and retry submission later
+		spin_unlock_irqrestore(&dev->lock, flags);
+		io_schedule();
+		spin_lock_irqsave(&dev->lock, flags);
+	}
+	finish_wait(&dev->wait, &wait);
+	spin_unlock_irqrestore(&dev->lock, flags);
+	return;
+}
+
+static void vdisk_pump_bio(struct work_struct *zw)
+{
+	struct vdisk_work *work =
+		container_of(zw, struct vdisk_work, work);
+
+	struct bio *bio = work->bio;
+	struct bio_vec *bvec;
+	struct bio_vec *(vectors[256]);
+	struct vdisk_device *dev = bio->bi_bdev->bd_disk->private_data;
+	int i, op, requestno=0;
+	loff_t start_offset, offset;
+
+	BUG_ON(!dev);
+
+	kfree (zw);
+
+	if (bio_data_dir(bio))
+		op = VDISK_WRITE;
+	else
+		op = VDISK_READ;
+
+	offset = start_offset = ((loff_t)bio->bi_sector)<<SECTOR_SHIFT;
+
+	bio_for_each_segment(bvec, bio, i) {
+		if (bvec->bv_len & (dev->blocksize - 1)) //FIXME: Zugriff auf dev ohne lock
+			goto out;
+
+		vectors[requestno] = bvec;
+		offset += bvec->bv_len;
+		requestno++;
+		if (requestno == 255) {
+			vdisk_pump_bvecs(dev, op, start_offset, requestno,
+					 bio, vectors);
+			start_offset = offset;
+			requestno = 0;
+		}
+	}
+
+	if (requestno)
+		vdisk_pump_bvecs(dev, op, start_offset, requestno, bio, vectors);
+
+out:
+	return;
+}
+
+static int vdisk_make_request(request_queue_t *q, struct bio *bio)
+{
+	struct vdisk_device *dev = bio->bi_bdev->bd_disk->private_data;
+	struct vdisk_work *work;
+	int rc;
+
+	if (!dev) {
+		rc = -ENODEV;
+		goto out;
+	}
+
+	if (bio_barrier(bio)) {
+		rc = -EOPNOTSUPP;
+		goto out;
+	}
+
+	work = kmalloc(sizeof(struct vdisk_work), GFP_KERNEL);
+	if (!work) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	work->bio = bio;
+
+	INIT_WORK(&work->work, vdisk_pump_bio);
+
+	if (!queue_work(dev->wq, &work->work)) {
+		rc = -EIO;
+		kfree(work);
+	} else
+		rc = 0;
+
+out:
+	return rc;
+}
+
+static void __post_response(vdisk_irq_t irq_page)
+{
+	int i;
+	struct bio *bio;
+
+	for (i=0; i<VDISK_NR_RES; i++) {
+		if (!(*irq_page)[i].intparm)
+			break;
+		bio = (struct bio *)((*irq_page)[i].intparm);
+		if ((*irq_page)[i].count)
+			bio_endio(bio, (*irq_page)[i].count, 0);
+		if ((*irq_page)[i].failed)
+			bio_endio(bio, (*irq_page)[i].failed, 1);
+	}
+}
+
+void vdisk_ext_handler(__u16 code)
+{
+	int rc=0; //FIXME: no initialization here
+	int fd = S390_lowcore.ext_params;
+	vdisk_irq_t irq_page;
+	struct vdisk_device *vdev;
+
+	irq_page = vdisk_get_irqpage(fd);
+
+	if (irq_page) {
+		do {
+			__post_response(irq_page);
+			rc = diag_vdisk_getevents(fd); // get more interrupts
+		} while(rc > 0);
+		vdev = vdisk_get_device_by_fd(fd);
+		if (!vdev)
+			panic("cannot find vdisk device while in interrupt");
+		spin_lock(&vdev->lock);
+		if (waitqueue_active(&vdev->wait))
+			wake_up(&vdev->wait);
+		spin_unlock(&vdev->lock);
+	} else
+		printk (KERN_WARNING "vdisk got interrupt for non-existing" \
+			" aio context id %d\n", fd);
+}
Index: linux-2.6.21/drivers/s390/block/Kconfig
===================================================================
--- linux-2.6.21.orig/drivers/s390/block/Kconfig
+++ linux-2.6.21/drivers/s390/block/Kconfig
@@ -63,4 +63,11 @@ config DASD_EER
 	  DASD extended error reporting. This is only needed if you want to
 	  use applications written for the EER facility.
 
+
+config VDISK
+	tristate "guest disk device support"
+	depends on S390_GUEST
+	help
+	  This driver provides access to block devices for Linux systems running
+	  under non z/VM hosts. If you are running on LPAR or z/VM only, say N.
 endif
Index: linux-2.6.21/drivers/s390/guest/Makefile
===================================================================
--- linux-2.6.21.orig/drivers/s390/guest/Makefile
+++ linux-2.6.21/drivers/s390/guest/Makefile
@@ -4,4 +4,5 @@
 
 obj-$(CONFIG_GUEST_CONSOLE) += guest_console.o guest_tty.o
 obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o
+obj-$(CONFIG_VDISK) += vdisk.o vdisk_blk.o
 



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <1178904963.25135.33.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>]

* Re: [PATCH/RFC 6/9] virtual block device driver
       [not found]     ` <1178904963.25135.33.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
@ 2007-05-14 11:49       ` Avi Kivity
       [not found]         ` <46484CDF.505-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-05-14 11:52       ` Avi Kivity
  1 sibling, 1 reply; 104+ messages in thread
From: Avi Kivity @ 2007-05-14 11:49 UTC (permalink / raw)
  To: Carsten Otte
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Christian Borntraeger, Martin Schwidefsky

Carsten Otte wrote:
> From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
>
> This driver provides access to virtual block devices. It does use its own
> make_request function which passes the bio to a workqueue thread. The workqueue
> thread does use the diagnose hypervisor call to call the hosting Linux.
> The hypervisor code in host userspace does use aio_submit to initiate the IO. 
> Once the IO is done, the host will use io_getevents and then generate an
> interrupt to the guest. The interrupt handler calls bio_endio.
> This device driver is currently architecture dependent. We intend to move the
> host API to hypercall instead of the diagnose instuction. Please review.
>
> Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
>   

> +struct vdisk_device * vdisk_get_device_by_fd(int fd)
> +{
> +	struct device *dev;
> +	struct vdev *vdev;
> +	struct vdisk_device *vdisk;
> +
> +	dev = driver_find_device(&vdisk_driver.driver, NULL, (void*)(long)fd, __find_fd);
> +	if (!dev)
> +		return NULL;
> +	vdev  = to_vdev(dev);
> +	vdisk = (struct vdisk_device *)vdev->drv_private;
> +	return vdisk;
> +}
>   

Is this the host file descriptor?  If so, we want to use something more 
abstract (if the host side is in kernel, there will be no fd, or if the 
device is implemented using >1 files (or <1 files)).

> +
> +#define VDISK_WRITE 1
> +#define VDISK_READ  0
> +
> +struct vdisk_request {
> +	unsigned long buf;
> +	unsigned long count;
> +};
> +
> +typedef struct vdisk_request (*vdisk_req_t)[VDISK_NR_REQ];
> +
> +struct vdisk_response {
> +	unsigned long intparm;
> +	unsigned long count;
> +	unsigned long failed;
> +};
> +
> +typedef struct vdisk_response (*vdisk_irq_t)[VDISK_NR_RES];
> +
> +struct vdisk_device {
> +	struct list_head head;
> +	int blocksize;
> +	long size;
> +	int read_only;
> +	struct gendisk *gd;
> +	struct vdev *vdev;
> +	spinlock_t lock;
> +	struct rw_semaphore pump_sem;
> +	int open_count;
> +	int vfd;
> +	struct vdisk_request (*submit_page)[VDISK_NR_REQ];
>   


> +	struct workqueue_struct *wq;
> +	vdisk_irq_t irq_page;
> +	wait_queue_head_t wait;
> +};
> +
> +struct vdisk_work {
> +	struct work_struct work;
> +	struct bio* bio;
> +};
> +
> +struct vdisk_elem {
> +	unsigned int fd;
> +	unsigned int command;
> +	unsigned long offset;
> +	unsigned long buffer;
> +	unsigned long nbytes;
>   

We'll want scatter/gather here.

> +};
> +
> +struct vdisk_iocb_container {
> +	struct iocb iocb;
> +	struct bio *bio;
> +	struct vdisk_device *dev;
> +	int ctx_index;
> +	unsigned long context;
> +	struct list_head list;
> +};
> +
> +// from aio_abi.h
> +typedef enum io_iocb_cmd {
> +	IO_CMD_PREAD = 0,
> +	IO_CMD_PWRITE = 1,
> +
> +	IO_CMD_FSYNC = 2,
> +	IO_CMD_FDSYNC = 3,
> +
> +	IO_CMD_POLL = 5,
> +	IO_CMD_NOOP = 6,
> +} io_iocb_cmd_t;
>   

Our own commands, please.  We need READV, WRITEV, and a barrier for 
journalling filesystems.  FDSYNC should work as a barrier, but is 
wasteful.  The FSYNC/FDSYNC distinction is meaningless.  POLL/NOOP are 
irrelevant.

> +static void vdisk_pump_bvecs(struct vdisk_device *dev, int op,
> +			      loff_t start_offset, int requestno,
> +			      struct bio* bio, struct bio_vec *(vectors[256]))
> +{
> +	int i, rc;
> +	loff_t offset = start_offset;
> +	int nr_done = 0;
> +	long size;
> +	long flags=0;
> +	DEFINE_WAIT(wait);
> +
> +	spin_lock_irqsave(&dev->lock, flags);
> +	prepare_to_wait_exclusive(&dev->wait, &wait,
> +				  TASK_UNINTERRUPTIBLE);
> +
> +	while (nr_done < requestno) {
> +		memset(dev->submit_page, 0, PAGE_SIZE);
> +		for (i=nr_done; i<requestno; i++) {
> +			(*dev->submit_page)[i-nr_done].buf =
> +				(unsigned long)page_address(vectors[i]->bv_page) +
> +				vectors[i]->bv_offset;
> +			(*dev->submit_page)[i-nr_done].count = vectors[i]->bv_len;
> +		}
> +
> +		rc = diag_vdisk_submit_request(dev->vfd,
> +						dev->submit_page,
> +						op, offset,
> +						requestno-nr_done, bio);
> +
> +		if (rc < 0) {
> +			// error case
> +			size = 0;
> +			for (i=0; i<(requestno-nr_done); i++)
> +				size += (*dev->submit_page)[i].count;
> +			bio_io_error(bio, size);
> +			break;
> +		}
> +
> +		if (rc == requestno - nr_done)
> +			// everything was submitted propper
> +			break;
> +
> +		if (rc) {
> +			//request was partly submitted
> +			for (i=0; i<rc; i++)
> +				offset += (*dev->submit_page)[i].count;
> +			nr_done += rc;
> +		}
> +		// we need to throttle IO, and retry submission later
> +		spin_unlock_irqrestore(&dev->lock, flags);
> +		io_schedule();
> +		spin_lock_irqsave(&dev->lock, flags);
> +	}
> +	finish_wait(&dev->wait, &wait);
> +	spin_unlock_irqrestore(&dev->lock, flags);
> +	return;
> +}
>   

We want to amortize the hypercall over multiple bios (but maybe you're 
doing that -- I'm not 100% up to speed on the block layer)

> +
> +static void vdisk_pump_bio(struct work_struct *zw)
> +{
> +	struct vdisk_work *work =
> +		container_of(zw, struct vdisk_work, work);
> +
> +	struct bio *bio = work->bio;
> +	struct bio_vec *bvec;
> +	struct bio_vec *(vectors[256]);
> +	struct vdisk_device *dev = bio->bi_bdev->bd_disk->private_data;
> +	int i, op, requestno=0;
> +	loff_t start_offset, offset;
> +
> +	BUG_ON(!dev);
> +
> +	kfree (zw);
> +
> +	if (bio_data_dir(bio))
> +		op = VDISK_WRITE;
> +	else
> +		op = VDISK_READ;
> +
> +	offset = start_offset = ((loff_t)bio->bi_sector)<<SECTOR_SHIFT;
> +
> +	bio_for_each_segment(bvec, bio, i) {
> +		if (bvec->bv_len & (dev->blocksize - 1)) //FIXME: Zugriff auf dev ohne lock
> +			goto out;
> +
> +		vectors[requestno] = bvec;
> +		offset += bvec->bv_len;
> +		requestno++;
> +		if (requestno == 255) {
> +			vdisk_pump_bvecs(dev, op, start_offset, requestno,
> +					 bio, vectors);
> +			start_offset = offset;
> +			requestno = 0;
> +		}
> +	}
> +
> +	if (requestno)
> +		vdisk_pump_bvecs(dev, op, start_offset, requestno, bio, vectors);
> +
> +out:
> +	return;
> +}
> +
> +static int vdisk_make_request(request_queue_t *q, struct bio *bio)
> +{
> +	struct vdisk_device *dev = bio->bi_bdev->bd_disk->private_data;
> +	struct vdisk_work *work;
> +	int rc;
> +
> +	if (!dev) {
> +		rc = -ENODEV;
> +		goto out;
> +	}
> +
> +	if (bio_barrier(bio)) {
> +		rc = -EOPNOTSUPP;
> +		goto out;
> +	}
> +
> +	work = kmalloc(sizeof(struct vdisk_work), GFP_KERNEL);
> +	if (!work) {
> +		rc = -ENOMEM;
> +		goto out;
> +	}
> +
> +	work->bio = bio;
> +
> +	INIT_WORK(&work->work, vdisk_pump_bio);
> +
> +	if (!queue_work(dev->wq, &work->work)) {
> +		rc = -EIO;
> +		kfree(work);
> +	} else
> +		rc = 0;
>   

Any reason not to perform the work directly?

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <46484CDF.505-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [PATCH/RFC 6/9] virtual block device driver
       [not found]         ` <46484CDF.505-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-14 13:23           ` Carsten Otte
       [not found]             ` <464862E9.7020105-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Carsten Otte @ 2007-05-14 13:23 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8


Avi Kivity wrote:
> Is this the host file descriptor?  If so, we want to use something more 
> abstract (if the host side is in kernel, there will be no fd, or if the 
> device is implemented using >1 files (or <1 files)).
This is indeed the host file descriptor. Host userland uses sys_open 
to retrieve it. I see the beauty of having the remote side in the 
kernel, however I fail to see why we would want to reinvent the wheel: 
asynchronous IO with O_DIRECT (to avoid host caching) does just what 
we want. System call latency adds to the in-kernel approach here.

> We'll want scatter/gather here.
If you want scatter/gather, you have to do request merging in the 
guest and use the do_request function of the block queue. That is 
because in make_request you only have a single chunk at hand.
With do_request, you would do that request merging twice and get twice 
the block device plug latency for nothing. The host is the better 
place to do IO scheduling, because it can optimize over IO from all 
guest machines.
> 
>> +};
>> +
>> +struct vdisk_iocb_container {
>> +    struct iocb iocb;
>> +    struct bio *bio;
>> +    struct vdisk_device *dev;
>> +    int ctx_index;
>> +    unsigned long context;
>> +    struct list_head list;
>> +};
>> +
>> +// from aio_abi.h
>> +typedef enum io_iocb_cmd {
>> +    IO_CMD_PREAD = 0,
>> +    IO_CMD_PWRITE = 1,
>> +
>> +    IO_CMD_FSYNC = 2,
>> +    IO_CMD_FDSYNC = 3,
>> +
>> +    IO_CMD_POLL = 5,
>> +    IO_CMD_NOOP = 6,
>> +} io_iocb_cmd_t;
>>   
> 
> Our own commands, please.  We need READV, WRITEV, and a barrier for 
> journalling filesystems.  FDSYNC should work as a barrier, but is 
> wasteful.  The FSYNC/FDSYNC distinction is meaningless.  POLL/NOOP are 
> irrelevant.
This matches the api of libaio. If userland translates this into 
struct iocp, this makes sense. The barrier however is a general 
problem with this approach: today, the asynchronous IO userspace api 
does not allow to submit a barrier. Therefore, our make_request 
function in the guest returns -ENOTSUPP in the guest which forces the 
file system to wait for IO completion. This does sacrifice some 
performance. The right thing to do would be to add the possibility to 
submit a barrier to the kernel aio interface.

> We want to amortize the hypercall over multiple bios (but maybe you're 
> doing that -- I'm not 100% up to speed on the block layer)
We don't. We do one per bio, and I agree that this is a major 
disadvantage of this approach. Since IO is slow (compared to 
vmenter/vmexit), it pays back from to better IO scheduling. On our 
platform, this approach outperforms the scatter/gather do_request one.

> Any reason not to perform the work directly?
I owe you an answer to this one, I have to revisit our CVS logs to 
find out. We used to call from make_request without workqueue before, 
and I cannot remember why we changed that.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <464862E9.7020105-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 6/9] virtual block device driver
       [not found]             ` <464862E9.7020105-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-05-14 14:39               ` Avi Kivity
       [not found]                 ` <46487494.1070802-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Avi Kivity @ 2007-05-14 14:39 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Carsten Otte wrote:
>
> Avi Kivity wrote:
>> Is this the host file descriptor?  If so, we want to use something 
>> more abstract (if the host side is in kernel, there will be no fd, or 
>> if the device is implemented using >1 files (or <1 files)).
> This is indeed the host file descriptor. Host userland uses sys_open 
> to retrieve it. I see the beauty of having the remote side in the 
> kernel, however I fail to see why we would want to reinvent the wheel: 
> asynchronous IO with O_DIRECT (to avoid host caching) does just what 
> we want.

I don't see an immediate need to put the host-side driver in the kernel, 
but I don't want to embed the host fd (which is an implementation 
detail) into the host/guest ABI.  There may not even be a host fd.

> System call latency adds to the in-kernel approach here.

I don't understand this.

>
>> We'll want scatter/gather here.
> If you want scatter/gather, you have to do request merging in the 
> guest and use the do_request function of the block queue. That is 
> because in make_request you only have a single chunk at hand.
> With do_request, you would do that request merging twice and get twice 
> the block device plug latency for nothing. The host is the better 
> place to do IO scheduling, because it can optimize over IO from all 
> guest machines.

The bio layer already has scatter/gather (basically, a biovec), but the 
aio api (which you copy) doesn't.  The basic request should be a bio, 
not a bio page.

I don't think the guest driver needs to do its own merging.

>>
>>> +};
>>> +
>>> +struct vdisk_iocb_container {
>>> +    struct iocb iocb;
>>> +    struct bio *bio;
>>> +    struct vdisk_device *dev;
>>> +    int ctx_index;
>>> +    unsigned long context;
>>> +    struct list_head list;
>>> +};
>>> +
>>> +// from aio_abi.h
>>> +typedef enum io_iocb_cmd {
>>> +    IO_CMD_PREAD = 0,
>>> +    IO_CMD_PWRITE = 1,
>>> +
>>> +    IO_CMD_FSYNC = 2,
>>> +    IO_CMD_FDSYNC = 3,
>>> +
>>> +    IO_CMD_POLL = 5,
>>> +    IO_CMD_NOOP = 6,
>>> +} io_iocb_cmd_t;
>>>   
>>
>> Our own commands, please.  We need READV, WRITEV, and a barrier for 
>> journalling filesystems.  FDSYNC should work as a barrier, but is 
>> wasteful.  The FSYNC/FDSYNC distinction is meaningless.  POLL/NOOP 
>> are irrelevant.
> This matches the api of libaio. If userland translates this into 
> struct iocp, this makes sense. The barrier however is a general 
> problem with this approach: today, the asynchronous IO userspace api 
> does not allow to submit a barrier. Therefore, our make_request 
> function in the guest returns -ENOTSUPP in the guest which forces the 
> file system to wait for IO completion. This does sacrifice some 
> performance. The right thing to do would be to add the possibility to 
> submit a barrier to the kernel aio interface.

Right.  But the ABI needs to support barriers regardless of host kernel 
support.  When unavailable, barriers can be emulated by waiting for the 
request queue to flush itself.  If we do implement the host side in the 
kernel, then barriers become available.

>
>> We want to amortize the hypercall over multiple bios (but maybe 
>> you're doing that -- I'm not 100% up to speed on the block layer)
> We don't. We do one per bio, and I agree that this is a major 
> disadvantage of this approach. Since IO is slow (compared to 
> vmenter/vmexit), it pays back from to better IO scheduling. On our 
> platform, this approach outperforms the scatter/gather do_request one.

I/O may be slow, but you can have a lot more disks than cpus.

For example, if an I/O takes 1ms, and you have 100 disks, then you can 
issue 100K IOPS.  With one hypercall per request, that's ~50% of a cpu 
(at about 5us per hypercall that goes all the way to userspace).  That's 
not counting the overhead of calling io_submit().


-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <46487494.1070802-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [PATCH/RFC 6/9] virtual block device driver
       [not found]                 ` <46487494.1070802-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-15 11:47                   ` Carsten Otte
       [not found]                     ` <46499DE9.9090202-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Carsten Otte @ 2007-05-15 11:47 UTC (permalink / raw)
  To: Avi Kivity
  Cc: carsteno-tA70FqPdS9bQT0dZR+AlfA,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Avi Kivity wrote:
> I don't see an immediate need to put the host-side driver in the kernel, 
> but I don't want to embed the host fd (which is an implementation 
> detail) into the host/guest ABI.  There may not even be a host fd.
Your point is taken, it also punches a hole in the security barrier 
between guest kernel and userspace which our usage scenario of 
multiple guests per uid requires.

>> System call latency adds to the in-kernel approach here.
> I don't understand this.
What I meant to state was: If the host side of the block driver runs 
in userspace, we have the extra latency to leave the kernel system 
call context, compute on behalf of the user process, and do another 
system call (to drive the IO). This extra overhead does not show when 
handling IO requests from the guest in the kernel.

> The bio layer already has scatter/gather (basically, a biovec), but the 
> aio api (which you copy) doesn't.  The basic request should be a bio, 
> not a bio page.
With our block driver it is, we submit an entire bio which may contain 
multiple biovecs at one hypercall.

> Right.  But the ABI needs to support barriers regardless of host kernel 
> support.  When unavailable, barriers can be emulated by waiting for the 
> request queue to flush itself.  If we do implement the host side in the 
> kernel, then barriers become available.
Agreed.

> I/O may be slow, but you can have a lot more disks than cpus.
> 
> For example, if an I/O takes 1ms, and you have 100 disks, then you can 
> issue 100K IOPS.  With one hypercall per request, that's ~50% of a cpu 
> (at about 5us per hypercall that goes all the way to userspace).  That's 
> not counting the overhead of calling io_submit().
Even when a hypercall round-trip takes as long as 5us, and even if you 
have 512byte per biovec only (we use 4k blocksize), I don't see how 
this gets a performance problem:
With linear read/write you get 200.000 hypercalls per second with 128 
kbyte per hypercall. That's 25.6 GByte per second per CPU. With random 
read (worst case: 512 byte per hypercall) you still get 100 MByte per 
second per CPU. There are tighter bottlenecks in the IO hardware afaics.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <46499DE9.9090202-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 6/9] virtual block device driver
       [not found]                     ` <46499DE9.9090202-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-05-16 10:01                       ` Avi Kivity
  0 siblings, 0 replies; 104+ messages in thread
From: Avi Kivity @ 2007-05-16 10:01 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Carsten Otte wrote:
>
>>> System call latency adds to the in-kernel approach here.
>> I don't understand this.
> What I meant to state was: If the host side of the block driver runs 
> in userspace, we have the extra latency to leave the kernel system 
> call context, compute on behalf of the user process, and do another 
> system call (to drive the IO). This extra overhead does not show when 
> handling IO requests from the guest in the kernel.
>

Well, this argument seems to be in favor of not using an fd ;)

Actually, an fd is usable when storing a disk in a raw file.  But qemu 
supports non-raw (formatted) disk images, which have additional features 
like snapshots.  The fd alone does not contain enough information.


>
>> I/O may be slow, but you can have a lot more disks than cpus.
>>
>> For example, if an I/O takes 1ms, and you have 100 disks, then you 
>> can issue 100K IOPS.  With one hypercall per request, that's ~50% of 
>> a cpu (at about 5us per hypercall that goes all the way to 
>> userspace).  That's not counting the overhead of calling io_submit().
> Even when a hypercall round-trip takes as long as 5us, and even if you 
> have 512byte per biovec only (we use 4k blocksize), I don't see how 
> this gets a performance problem:
> With linear read/write you get 200.000 hypercalls per second with 128 
> kbyte per hypercall. That's 25.6 GByte per second per CPU. With random 
> read (worst case: 512 byte per hypercall) you still get 100 MByte per 
> second per CPU. There are tighter bottlenecks in the IO hardware afaics.
>

If all you do is I/O, sure.  If you want to limit I/O cpu overhead to 
10%, the raw bandwidth becomes 10 MB/s/cpu.

(bandwidth isn't a good measure for random I/O, IOPS is the interesting 
metric)

Both the guest and host will favor batched requests.  It's a shame to 
deny that because of a simplistic protocol.


-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 6/9] virtual block device driver
       [not found]     ` <1178904963.25135.33.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
  2007-05-14 11:49       ` Avi Kivity
@ 2007-05-14 11:52       ` Avi Kivity
       [not found]         ` <46484D84.3060601-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 104+ messages in thread
From: Avi Kivity @ 2007-05-14 11:52 UTC (permalink / raw)
  To: Carsten Otte
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Christian Borntraeger, Martin Schwidefsky

Carsten Otte wrote:
> From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
>
> This driver provides access to virtual block devices. It does use its own
> make_request function which passes the bio to a workqueue thread. The workqueue
> thread does use the diagnose hypervisor call to call the hosting Linux.
> The hypervisor code in host userspace does use aio_submit to initiate the IO. 
> Once the IO is done, the host will use io_getevents and then generate an
> interrupt to the guest. The interrupt handler calls bio_endio.
> This device driver is currently architecture dependent. We intend to move the
> host API to hypercall instead of the diagnose instuction. Please review.
>
>   

Oh. Why not use Xen's pending block driver? It probably has everything 
needed.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <46484D84.3060601-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [PATCH/RFC 6/9] virtual block device driver
       [not found]         ` <46484D84.3060601-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-14 13:26           ` Carsten Otte
  0 siblings, 0 replies; 104+ messages in thread
From: Carsten Otte @ 2007-05-14 13:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Avi Kivity wrote:
> Oh. Why not use Xen's pending block driver? It probably has everything 
> needed.
We're not too eager to have our own device drivers become the solution 
of choice. I have'nt looked at it so far, will do.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH/RFC 7/9] Virtual network guest device driver
       [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
                     ` (4 preceding siblings ...)
  2007-05-11 17:36   ` [PATCH/RFC 6/9] virtual block device driver Carsten Otte
@ 2007-05-11 17:36   ` Carsten Otte
       [not found]     ` <1178904965.25135.34.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
  2007-05-11 17:36   ` [PATCH/RFC 8/9] Virtual network host switch support Carsten Otte
  2007-05-11 17:36   ` [PATCH/RFC 9/9] Fix system<->user misaccount of interpreted execution Carsten Otte
  7 siblings, 1 reply; 104+ messages in thread
From: Carsten Otte @ 2007-05-11 17:36 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
  Cc: Christian Borntraeger, Martin Schwidefsky

From: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

This is a work-in- progress paravirtualized network driver for Linux systems
running under a hypervisor.
The basic idea of this network driver is to have a shared memory pool between
host and guest. The guest allocates the buffer and registers its memory with
the host.

There are two queues one for guest to host traffic and vice versa. The queue
state is tracked via an 32 bit atomic. The first 16 bits indicate the slot in
the queue, where to put the next packet in, the last 16 bits indicate the slot
in the queue where to take the next packet out. Macros are provided to check
the queue for empty and full.

We use notification, when the queue _was_ empty or full.  Guest to host 
notification is done via the diagnose instruction. This is basically an
instruction for hypervisor call, similar to vmmcall and vmcall. For the
reverse notification the host sends an interrupt to the guest. Using NAPI, we
react on changes of the queue with netif_rx_schedule, netif_wake_queue,
netif_stop_queue and netif_rx_complete.
As the host only sends an interrupt if the queue was empty, we also need to
check for a race in the poll function and use netif_rx_reschedule. Otherwise 
we would miss a wakup and would sleep forever.

Our s390 network drivers support cards that do IP packets only and provide no
MAC header at all. Therefore, the driver currently supports a layer2 based mode
(like ethernet) and a layer3 mode (we claim to be ptp) depending on the host. 

So we have several prototypes and device drivers for paravirtualized network:
KVM, XEN, our prototype,lguest.... In the long term we want to have one driver
to rule^w drive them all, right?

This driver has currently some s390 specific aspects. I think we could get rid
of the diagnose calls with an architecture defined hypervisor call. 

Please review, comments on the design are very welcome. 

Signed-off-by: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

---
 drivers/s390/guest/Makefile     |    2 
 drivers/s390/guest/vnet.h       |  147 +++++++++++
 drivers/s390/guest/vnet_guest.c |  509 ++++++++++++++++++++++++++++++++++++++++
 drivers/s390/guest/vnet_guest.h |  111 ++++++++
 drivers/s390/net/Kconfig        |    9 
 5 files changed, 777 insertions(+), 1 deletion(-)

Index: linux-2.6.21/drivers/s390/guest/vnet.h
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet.h
@@ -0,0 +1,147 @@
+/*
+ *  vnet - virtual networking for guest systems
+ *
+ *  Copyright IBM Corp. 2007
+ *  Authors:	Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *		Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *
+ */
+
+#ifndef __VNET_H
+#define __VNET_H
+#include <linux/crc32.h>
+#include <linux/ioctl.h>
+#include <linux/if_ether.h>
+#include <linux/netdevice.h>
+#include <asm/atomic.h>
+#include <asm/page.h>
+
+#define VNET_MAJOR 12 //COFIXME
+
+#define VNET_QUEUE_LEN   80 // careful, vnet_control must be < PAGE
+#define VNET_BUFFER_SIZE 65536
+#define VNET_BUFFER_ORDER get_order(VNET_BUFFER_SIZE)
+#define VNET_BUFFER_PAGES (((VNET_BUFFER_SIZE-1)>>PAGE_SHIFT)+1)
+
+#define VNET_TIMEOUT     (5*HZ)
+
+#define VNET_IRQ_START_RX 0
+#define VNET_IRQ_START_TX 1
+
+struct vnet_info {
+	int linktype;
+	int maxmtu;
+};
+
+#define VNET_IOCTL_ID 'Z'
+#define VNET_REGISTER_CTL _IOW(VNET_IOCTL_ID ,0, unsigned long long)
+#define VNET_INTERRUPT _IOW(VNET_IOCTL_ID, 1, int)
+#define VNET_INFO _IOR(VNET_IOCTL_ID, 2, struct vnet_info*)
+
+#define QUEUE_IS_EMPTY 1
+#define QUEUE_WAS_EMPTY 2
+#define QUEUE_IS_FULL 4
+#define QUEUE_WAS_FULL 8
+
+
+struct xmit_buffer {
+	__be16 len;
+	__be16 proto;
+	void *data;
+};
+
+struct vnet_control {
+	int buffer_size;
+	char mac[ETH_ALEN];
+	atomic_t p2smit __attribute__((__aligned__(SMP_CACHE_BYTES)));
+	atomic_t s2pmit __attribute__((__aligned__(SMP_CACHE_BYTES)));
+	struct xmit_buffer p2sbufs[VNET_QUEUE_LEN] __attribute__((__aligned__(SMP_CACHE_BYTES)));
+	struct xmit_buffer s2pbufs[VNET_QUEUE_LEN] __attribute__((__aligned__(SMP_CACHE_BYTES)));
+};
+
+
+#define __nextx(val) (((val) & 0xffff0000)>>16)
+#define __nextr(val) ((val) & 0xffff)
+#define __mkxr(x,r) ((((x) & 0xffff)<<16) | ((r) & 0xffff))
+
+static inline int
+vnet_q_empty(int val)
+{
+	return (__nextx(val) == __nextr(val));
+}
+
+static inline int
+vnet_q_half(int val)
+{
+	if (__nextx(val) == __nextr(val))
+		return 0;
+	if (__nextx(val) < __nextr(val)) {
+		if ((__nextr(val) - __nextx(val)) < VNET_QUEUE_LEN / 2)
+			return 1;
+	} else {
+		if ((__nextx(val) - __nextr(val)) > VNET_QUEUE_LEN / 2)
+			return 1;
+	}
+	return 0;
+}
+
+
+static inline int
+vnet_q_full(int val)
+{
+	if (__nextx(val) == __nextr(val) - 1)
+		return 1;
+	if ((__nextr(val) == 0) && (__nextx(val) == VNET_QUEUE_LEN - 1))
+		return 1;
+	return 0;
+}
+
+/* returns values:
+ * bit RX_QUEUE_NOW_EMPTY (1)
+ * and/or RX_QUEUE_WAS_FULL (2)
+ */
+static inline int
+vnet_rx_packet(atomic_t *ato)
+{
+	int oldval, newval, rc;
+
+	do {
+		oldval = atomic_read(ato);
+
+		//do we wrap?
+		if (__nextr(oldval)+1 == VNET_QUEUE_LEN)
+			newval = __mkxr(__nextx(oldval),0);
+		else
+			newval = __mkxr(__nextx(oldval),__nextr(oldval)+1);
+	} while (atomic_cmpxchg(ato, oldval, newval) != oldval);
+
+	rc = 0;
+	if (vnet_q_empty(newval))
+		rc |= QUEUE_IS_EMPTY;
+	if (vnet_q_full(oldval))
+		rc |= QUEUE_WAS_FULL;
+	return rc;
+}
+
+static inline int
+vnet_tx_packet(atomic_t *ato)
+{
+	int oldval, newval, rc;
+
+	do {
+		oldval = atomic_read(ato);
+
+		//do we wrap?
+		if (__nextx(oldval)+1 == VNET_QUEUE_LEN)
+			newval = __mkxr(0, __nextr(oldval));
+		else
+			newval = __mkxr(__nextx(oldval)+1,__nextr(oldval));
+	} while (atomic_cmpxchg(ato, oldval, newval) != oldval);
+	rc = 0;
+	if (vnet_q_empty(oldval))
+		rc |= QUEUE_WAS_EMPTY;
+	if (vnet_q_full(newval))
+		rc |= QUEUE_IS_FULL;
+	return rc;
+}
+#endif
Index: linux-2.6.21/drivers/s390/guest/vnet_guest.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet_guest.c
@@ -0,0 +1,509 @@
+/*
+ *  vnet - virtual network device driver
+ *
+ *  Copyright IBM Corp. 2007
+ *  Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *          Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/inetdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/if.h>
+#include <linux/if_ether.h>
+#include <linux/if_arp.h>
+#include <linux/rtnetlink.h>
+#include <linux/hardirq.h>
+#include <linux/mm.h>
+#include <linux/notifier.h>
+#include <asm/s390_ext.h>
+#include <asm/atomic.h>
+#include <asm/vdev.h>
+
+#include "vnet.h"
+#include "vnet_guest.h"
+
+static LIST_HEAD(vnet_devices);
+static rwlock_t vnet_devices_lock = RW_LOCK_UNLOCKED;
+
+
+static int
+vnet_net_open(struct net_device *dev)
+{
+	struct vnet_guest_device *guest;
+	struct vnet_control *control;
+
+	guest = netdev_priv(dev);
+	control = guest->control;
+	atomic_set(&control->s2pmit, 0);
+	netif_start_queue(dev);
+	return 0;
+}
+
+static int
+vnet_net_stop(struct net_device *dev)
+{
+	netif_stop_queue(dev);
+	return 0;
+}
+
+static void vnet_net_tx_timeout(struct net_device *dev)
+{
+	struct vnet_guest_device *zk = netdev_priv(dev);
+	struct vnet_control *control = zk->control;
+
+	printk(KERN_ERR "problems in xmit for device %s\n Resetting...\n",
+			 dev->name);
+	atomic_set(&control->p2smit, 0);
+	atomic_set(&control->s2pmit, 0);
+	diag_vnet_send_interrupt(zk->hostfd, VNET_IRQ_START_RX);
+	netif_wake_queue(dev);
+}
+
+
+static void skbcopy(char *dest, struct sk_buff *skb)
+{
+	int i;
+
+	memcpy(dest, skb->data, skb_headlen(skb));
+	dest += skb_headlen(skb);
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+	        skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+		memcpy(dest, page_address(frag->page) +
+				frag->page_offset, frag->size);
+		dest+=frag->size;
+	}
+}
+
+static int
+vnet_net_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct vnet_guest_device *zk = netdev_priv(dev);
+	struct vnet_control *control = zk->control;
+	struct xmit_buffer *buf;
+	int pkid;
+	int buffer_status;
+
+	if (!spin_trylock(&zk->lock))
+		return NETDEV_TX_LOCKED;
+	if(vnet_q_full(atomic_read(&control->p2smit))) {
+		netif_stop_queue(dev);
+		goto full;
+	}
+	pkid = __nextx(atomic_read(&control->p2smit));
+	buf = &control->p2sbufs[pkid];
+	buf->len = skb->len;
+	buf->proto = skb->protocol;
+	skbcopy(buf->data, skb);
+	buffer_status = vnet_tx_packet(&control->p2smit);
+	spin_unlock(&zk->lock);
+	zk->stats.tx_packets++;
+	zk->stats.tx_bytes += skb->len;
+	dev_kfree_skb_any(skb);
+	dev->trans_start = jiffies;
+	if (buffer_status & QUEUE_WAS_EMPTY)
+		diag_vnet_send_interrupt(zk->hostfd, VNET_IRQ_START_RX);
+	if (!(buffer_status & QUEUE_IS_FULL))
+		return NETDEV_TX_OK;
+	netif_stop_queue(dev);
+	spin_lock(&zk->lock);
+full:
+	if (!vnet_q_full(atomic_read(&control->p2smit)))
+		netif_start_queue(dev);
+	spin_unlock(&zk->lock);
+	return NETDEV_TX_OK;
+}
+
+static int
+vnet_l2_poll(struct net_device *dev, int *budget)
+{
+	struct vnet_guest_device *zk = netdev_priv(dev);
+	struct vnet_control *control = zk->control;
+	struct xmit_buffer *buf;
+	struct sk_buff *skb;
+	int pkid, count, numpackets = min(dev->quota, *budget);
+	int buffer_status;
+
+	if (vnet_q_empty(atomic_read(&control->s2pmit))) {
+		count = 0;
+		goto empty;
+	}
+loop:
+	count = 0;
+	while(numpackets) {
+		pkid = __nextr(atomic_read(&control->s2pmit));
+		buf = &control->s2pbufs[pkid]; /* kernel pointer!*/
+		skb = dev_alloc_skb(buf->len);
+		if (likely(skb)) {
+			memcpy(skb_put(skb, buf->len), buf->data, buf->len);
+			skb->dev = dev;
+			skb->protocol = eth_type_trans(skb, dev);
+			zk->stats.rx_packets++;
+			zk->stats.rx_bytes += buf->len;
+			netif_receive_skb(skb);
+			numpackets--;
+			(*budget)--;
+			dev->quota--;
+			count++;
+		} else
+			zk->stats.rx_dropped++;
+		buffer_status = vnet_rx_packet(&control->s2pmit);
+		if (buffer_status & QUEUE_WAS_FULL)
+			diag_vnet_send_interrupt(zk->hostfd,
+							VNET_IRQ_START_TX);
+		if (buffer_status & QUEUE_IS_EMPTY)
+			goto empty;
+	}
+	return 1; /* please ask us again */
+ empty:
+	netif_rx_complete(dev);
+	/* we might have raced against a wakup */
+	if (!vnet_q_empty(atomic_read(&control->s2pmit))) {
+		if (netif_rx_reschedule(dev, count))
+			goto loop;
+	}
+	return 0; /* we're done for now */
+}
+
+
+static int
+vnet_l3_poll(struct net_device *dev, int *budget)
+{
+	struct vnet_guest_device *zk = dev->priv;
+	struct vnet_control *control = zk->control;
+	struct xmit_buffer *buf;
+	struct sk_buff *skb;
+	int pkid, count, numpackets = min(dev->quota, *budget);
+	int buffer_status;
+
+	if (vnet_q_empty(atomic_read(&control->s2pmit))) {
+		count = 0;
+		goto empty;
+	}
+loop:
+	count = 0;
+	while(numpackets) {
+		pkid = __nextr(atomic_read(&control->s2pmit));
+		buf = &control->s2pbufs[pkid]; /*kernel pointer*/
+		skb = dev_alloc_skb(buf->len + NET_IP_ALIGN);
+		if (likely(skb)) {
+			skb_reserve(skb, NET_IP_ALIGN);
+			memcpy(skb_put(skb, buf->len), buf->data, buf->len);
+			skb->dev = dev;
+			skb->protocol = buf->proto;
+			skb->mac.raw = skb->data;
+			zk->stats.rx_packets++;
+			zk->stats.rx_bytes += buf->len;
+			netif_receive_skb(skb);
+			numpackets--;
+			(*budget)--;
+			dev->quota--;
+			count++;
+		} else
+			zk->stats.rx_dropped++;
+		buffer_status = vnet_rx_packet(&control->s2pmit);
+		if (buffer_status & QUEUE_WAS_FULL)
+			diag_vnet_send_interrupt(zk->hostfd,
+							VNET_IRQ_START_TX);
+		if (buffer_status & QUEUE_IS_EMPTY)
+			goto empty;
+	}
+	return 1; /* please ask us again */
+ empty:
+	netif_rx_complete(dev);
+	/* we might have raced against a wakup */
+	if (!vnet_q_empty(atomic_read(&control->s2pmit))) {
+		if (netif_rx_reschedule(dev, count))
+			goto loop;
+	}
+	return 0; /* we're done for now */
+}
+
+static struct net_device_stats *
+vnet_net_stats(struct net_device *dev)
+{
+	struct vnet_guest_device *zk = netdev_priv(dev);
+	return &zk->stats;
+}
+
+static int
+vnet_net_change_mtu(struct net_device *dev, int new_mtu)
+{
+	if (new_mtu <= ETH_ZLEN)
+		return -ERANGE;
+	if (new_mtu > VNET_BUFFER_SIZE-ETH_HLEN)
+		return -ERANGE;
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+
+static void
+__vnet_common_init(struct net_device *dev)
+{
+	dev->open		= vnet_net_open;
+	dev->stop		= vnet_net_stop;
+	dev->hard_start_xmit	= vnet_net_xmit;
+	dev->get_stats		= vnet_net_stats;
+	dev->tx_timeout		= vnet_net_tx_timeout;
+	dev->watchdog_timeo	= VNET_TIMEOUT;
+	dev->change_mtu		= vnet_net_change_mtu;
+	dev->weight		= 64;
+	dev->features		|= NETIF_F_SG | NETIF_F_LLTX;
+}
+
+static void
+__vnet_layer3_init(struct net_device *dev)
+{
+        dev->mtu		= ETH_DATA_LEN;
+        dev->tx_queue_len	= 1000;
+        dev->flags		= IFF_BROADCAST|IFF_MULTICAST|IFF_NOARP;
+        dev->type		= ARPHRD_PPP;
+	dev->mtu		= 1492;
+	dev->poll		= vnet_l3_poll;
+	__vnet_common_init(dev);
+}
+
+static void
+__vnet_layer2_init(struct net_device *dev)
+{
+	ether_setup(dev);
+	dev->mtu	= 1492;
+	dev->poll	= vnet_l2_poll;
+	__vnet_common_init(dev);
+}
+
+static struct vnet_guest_device *
+__get_vnet_dev_by_fd(int fd)
+{
+	struct vnet_guest_device *zk;
+
+	read_lock(&vnet_devices_lock);
+	list_for_each_entry(zk, &vnet_devices, lh) {
+		if (zk->hostfd == fd)
+			goto found;
+	}
+	zk = NULL;
+ found:
+	read_unlock (&vnet_devices_lock);
+	return zk;
+}
+
+void vnet_ext_handler(__u16 code)
+{
+	unsigned int type = S390_lowcore.ext_params & 3;
+	unsigned int fd = S390_lowcore.ext_params >> 2;
+
+	struct vnet_guest_device *zk = __get_vnet_dev_by_fd(fd);
+
+	BUG_ON(!zk);
+	switch (type) {
+	case VNET_IRQ_START_RX:
+		netif_rx_schedule(zk->netdev);
+		break;
+	case VNET_IRQ_START_TX:
+		netif_wake_queue(zk->netdev);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void
+vnet_delete_device(struct vnet_guest_device *zd)
+{
+	int i;
+	unsigned long flags;
+
+	if (zd->hostfd >= 0)
+		diag_vnet_release(zd->hostfd);
+	write_lock_irqsave(&vnet_devices_lock, flags);
+	list_del(&zd->lh);
+	write_unlock_irqrestore(&vnet_devices_lock, flags);
+
+	for (i=0; i<VNET_QUEUE_LEN; i++) {
+		if (zd->control->s2pbufs[i].data) {
+			free_pages((unsigned long) zd->control->s2pbufs[i].data, VNET_BUFFER_ORDER);
+			zd->control->s2pbufs[i].data = NULL;
+		}
+		if (zd->control->p2sbufs[i].data) {
+			free_pages((unsigned long) zd->control->p2sbufs[i].data, VNET_BUFFER_ORDER);
+			zd->control->p2sbufs[i].data = NULL;
+		}
+	}
+	if (zd->control) {
+		kfree(zd->control);
+		zd->control = NULL;
+	}
+	if (zd->netdev)  /* CAUTION: this also frees zd*/
+		free_netdev(zd->netdev);
+}
+
+
+static int vnet_device_alloc(struct vnet_guest_device *zd)
+{
+	int i;
+
+	zd->control = kzalloc(sizeof(struct vnet_control), GFP_KERNEL);
+	if (!zd->control)
+		return -ENOMEM;
+	for (i=0; i<VNET_QUEUE_LEN; i++) {
+		zd->control->s2pbufs[i].data = (void *) __get_free_pages(GFP_KERNEL, VNET_BUFFER_ORDER);
+		if (!zd->control->s2pbufs[i].data)
+			return -ENOMEM;
+		zd->control->p2sbufs[i].data = (void *) __get_free_pages(GFP_KERNEL, VNET_BUFFER_ORDER);
+		if (!zd->control->p2sbufs[i].data)
+			return -ENOMEM;
+	}
+	return 0;
+}
+
+static int vnet_probe(struct vdev *vdev)
+{
+	int ret;
+	long flags;
+	struct vnet_guest_device *zd;
+	struct net_device *netdev;
+	int linktype;
+
+	if (strlen(vdev->symname) > IFNAMSIZ) {
+		printk(KERN_ERR "vnet: %s is too long for a network device,"
+				"discarding it\n", vdev->symname);
+		return -EINVAL;
+	}
+	ret = diag_vnet_info(vdev->hostid, &linktype);
+	if (ret)
+		return ret;
+	if (linktype == 3)
+		netdev = alloc_netdev(sizeof(*zd), vdev->symname,__vnet_layer3_init);
+	else
+		netdev = alloc_netdev(sizeof(*zd), vdev->symname, __vnet_layer2_init);
+	if (!netdev)
+		return -ENOMEM;
+	zd = netdev_priv(netdev);
+	zd->netdev = netdev;
+
+	ret =vnet_device_alloc(zd);
+	if (ret)
+		goto out;
+	zd->control->buffer_size = VNET_BUFFER_SIZE;
+	zd->linktype = linktype;
+	memcpy(zd->ifname, vdev->symname, IFNAMSIZ);
+	INIT_LIST_HEAD(&zd->lh);
+
+	write_lock_irqsave(&vnet_devices_lock, flags);
+	zd->hostfd = diag_vnet_open(vdev->hostid, zd->control);
+	if (zd->hostfd < 0) {
+		write_unlock_irqrestore(&vnet_devices_lock, flags);
+		goto out;
+	}
+	list_add_tail(&zd->lh, &vnet_devices);
+	write_unlock_irqrestore(&vnet_devices_lock, flags);
+
+	// host is ready, now we can set up our local network interface
+	rtnl_lock();
+	memcpy(netdev->dev_addr, zd->control->mac, 6);
+	spin_lock_init(&zd->lock);
+
+	if (!(ret = register_netdevice(zd->netdev))) {
+		/* good case */
+		rtnl_unlock();
+		printk("vnet: Successfully registered %s\n", vdev->symname);
+		return 0;
+	}
+	printk("vnet: Could not register network interface %s\n", vdev->symname);
+	rtnl_unlock();
+ out:
+	vnet_delete_device(zd);
+	return ret;
+}
+
+static struct vdev_driver vnet_driver = {
+	.name = "vnet",
+	.owner = THIS_MODULE,
+	.vdev_type = VDEV_TYPE_NET,
+	.probe = vnet_probe,
+};
+
+static int vnet_ip_event(struct notifier_block *this,
+	      unsigned long event,void *ptr)
+{
+	struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
+	struct net_device *dev =(struct net_device *) ifa->ifa_dev->dev;
+	struct vnet_guest_device *zk;
+	read_lock(&vnet_devices_lock);
+	list_for_each_entry(zk, &vnet_devices, lh)
+		if (zk->netdev == dev) {
+			read_unlock(&vnet_devices_lock);
+			if (event == NETDEV_UP)
+				diag_vnet_ip(1, ifa->ifa_address,
+						ifa->ifa_mask,
+						ifa->ifa_broadcast);
+			if (event == NETDEV_DOWN)
+				diag_vnet_ip(0, ifa->ifa_address,
+						ifa->ifa_mask,
+						ifa->ifa_broadcast);
+			return NOTIFY_OK;
+		}
+	read_unlock(&vnet_devices_lock);
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block vnet_ip_notifier = {
+	vnet_ip_event,
+	NULL
+};
+
+/* module related section */
+int
+vnet_guest_init(void)
+{
+	int ret;
+
+	if (!MACHINE_IS_GUEST)
+		return -ENODEV;
+	BUILD_BUG_ON(sizeof(struct vnet_control) > PAGE_SIZE);
+	register_external_interrupt(0x1236, vnet_ext_handler);
+	if(register_inetaddr_notifier(&vnet_ip_notifier)) {
+		printk(KERN_ERR "vnet: Could not register ip callback\n");
+		unregister_external_interrupt(0x1236, vnet_ext_handler);
+	}
+	ret = vdev_driver_register(&vnet_driver);
+	if (ret) {
+		printk(KERN_ERR "vnet: Could not register driver\n");
+		unregister_external_interrupt(0x1236, vnet_ext_handler);
+		unregister_inetaddr_notifier(&vnet_ip_notifier);
+		return ret;
+	}
+	return ret;
+}
+
+void
+vnet_guest_exit(void)
+{
+	struct vnet_guest_device *zk;
+	struct vnet_guest_device *temp;
+
+
+	unregister_external_interrupt(0x1236, vnet_ext_handler);
+	unregister_inetaddr_notifier(&vnet_ip_notifier);
+	rtnl_lock();
+	write_lock(&vnet_devices_lock);
+	list_for_each_entry_safe(zk, temp, &vnet_devices, lh) {
+		netif_stop_queue(zk->netdev);
+		unregister_netdevice(zk->netdev);
+		vnet_delete_device(zk);
+	}
+	write_unlock(&vnet_devices_lock);
+	rtnl_unlock();
+}
+
+module_init(vnet_guest_init);
+module_exit(vnet_guest_exit);
+MODULE_DESCRIPTION("VNET: Virtual network driver");
+MODULE_AUTHOR("Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>");
+MODULE_LICENSE("GPL");
Index: linux-2.6.21/drivers/s390/guest/vnet_guest.h
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet_guest.h
@@ -0,0 +1,111 @@
+/*
+ *  vnet - zlive insular communication knack
+ *
+ *  Copyright (C) 2005 IBM Corporation
+ *  Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *  Author: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *
+ */
+
+#ifndef __VNET_GUEST_H
+#define __VNET_GUEST_H
+
+#include <linux/netdevice.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include "vnet.h"
+
+
+struct vnet_guest_device {
+	struct list_head lh;
+	int hostfd;
+	char ifname[IFNAMSIZ];
+	struct net_device *netdev;
+	struct vnet_control *control;
+	struct net_device_stats stats;
+	struct work_struct work;
+	int linktype;
+	spinlock_t lock;
+};
+
+static inline int
+diag_vnet_info(char *ifname, int *linktype)
+{
+	register char * __arg1 asm ("2") = ifname;
+	register int *  __arg2 asm ("3") = linktype;
+	register int __svcres asm("2");
+	int __res;
+
+	__asm__ __volatile__ (
+		"diag 0,0,0x0e"
+		: "=d" (__svcres)
+		: "0" (__arg1),
+		  "d" (__arg2)
+		: "cc", "memory");
+	__res = __svcres;
+	return __res;
+}
+
+static inline int
+diag_vnet_open(char *ifname, struct vnet_control *ctrl)
+{
+	register char *  __arg1 asm ("2") = ifname;
+	register struct vnet_control * __arg2 asm ("3") = ctrl;
+	register int __svcres asm("2");
+	int __res;
+
+	__asm__ __volatile__ (
+		"diag 0,0,0x0f"
+		: "=d" (__svcres)
+		: "0" (__arg1),
+		  "d" (__arg2)
+		: "cc", "memory");
+	__res = __svcres;
+	return __res;
+}
+
+static inline void
+diag_vnet_send_interrupt(int fd, int type)
+{
+	register long __arg1 asm ("2") = fd;
+	register long __arg2 asm ("3") = type;
+
+	__asm__ __volatile__ (
+		"diag 0,0,0x11"
+		: : "d" (__arg1),
+		  "d" (__arg2)
+		: "cc", "memory");
+}
+
+static inline void
+diag_vnet_release(int fd)
+{
+	register long __arg1 asm ("2") = fd;
+
+	__asm__ __volatile__ (
+		  "diag 0,0,0x13"
+		: : "d" (__arg1)
+		: "cc", "memory");
+}
+static inline int
+diag_vnet_ip(int add, u32 addr, u32 mask, u32 broadcast)
+{
+	register long  __arg1 asm ("2") = add;
+	register long  __arg2 asm ("3") = addr;
+	register long __arg3 asm ("4") = mask;
+	register long __arg4 asm ("5") = broadcast;
+	register int __svcres asm("2");
+	int __res;
+
+	__asm__ __volatile__ (
+		"diag 0,0,0x1f"
+		: "=d" (__svcres)
+		: "d" (__arg1),
+		  "d" (__arg2),
+		  "d" (__arg3),
+		  "d" (__arg4)
+		: "cc", "memory");
+	__res = __svcres;
+	return __res;
+}
+#endif
Index: linux-2.6.21/drivers/s390/guest/Makefile
===================================================================
--- linux-2.6.21.orig/drivers/s390/guest/Makefile
+++ linux-2.6.21/drivers/s390/guest/Makefile
@@ -5,4 +5,4 @@
 obj-$(CONFIG_GUEST_CONSOLE) += guest_console.o guest_tty.o
 obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o
 obj-$(CONFIG_VDISK) += vdisk.o vdisk_blk.o
-
+obj-$(CONFIG_VNET_GUEST) += vnet_guest.o
Index: linux-2.6.21/drivers/s390/net/Kconfig
===================================================================
--- linux-2.6.21.orig/drivers/s390/net/Kconfig
+++ linux-2.6.21/drivers/s390/net/Kconfig
@@ -86,4 +86,13 @@ config CCWGROUP
  	tristate
 	default (LCS || CTC || QETH)
 
+config VNET_GUEST
+	tristate "virtual networking  support (GUEST)"
+	depends on S390_GUEST
+	help
+	  This is the guest part of the vnet guest network connection.
+          Say Y if you plan to run this kernel as guest with network
+          connection.
+	  If you're not using host/guest support, say N.
+
 endmenu



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <1178904965.25135.34.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]     ` <1178904965.25135.34.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
@ 2007-05-11 19:44       ` ron minnich
       [not found]         ` <13426df10705111244w1578ebedy8259bc42ca1f588d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: ron minnich @ 2007-05-11 19:44 UTC (permalink / raw)
  To: Carsten Otte
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Christian Borntraeger, Martin Schwidefsky

Let me ask what may seem to be a naive question to the linux world. I
see you are doing a lot off solid work on adding block and network
devices. The code for block and network devices
is implemented in different ways. I've also seen this difference of
inerface/implementation on Xen.

Hence my question:
Why are the INTERFACES to the block and network devices different? I
can understand that the implementation -- what goes on "inside the
box" -- would be different. But, again, why is the interface to the
resource different in each case? Will every distinct type of I/O
device end up with a different interface?

These questions doubtless seem naive, I suppose, except I use a system
(Plan 9) in which a common interface is in fact used for the different
resources. I have been hoping that we could bring this model -- same
interface, different resource -- to the inter-vm communications. I
would like to at least raise the idea that it could be used on KVM.

Avoiding too much detail, in the plan 9 world, read and write of data
to a disk is via file read and write system calls. Same for a network.
Same for the mouse, the window system, the serial port, the console,
USB, and so on. Please see this note from IBM on what is
possible:http://domino.watson.ibm.com/library/CyberDig.nsf/0/c6c779bbf1650fa4852570670054f3ca?OpenDocument
or http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf

Different resources, same interface. In the hypervisor world, you
build one shared memory queue as a basic abstraction. On top of that
queue, you run 9P. The provider (network, block device, etc.) provides
certain resources to you, the guest domain The resources have names. A
network can look like this, to a kvm guest (this command from a Plan 9
system):
cpu% ls /net/ether0
/net/ether0/0
/net/ether0/1
/net/ether0/2
/net/ether0/addr
/net/ether0/clone
/net/ether0/ifstats
/net/ether0/stats
To get network stats, or do I/O, one simply gains access to the
appropriate ring buffer, by finding the name, and does the ring buffer
sends and receives via shared memory queues. The I/O operations can be
very efficient.

Disk looks like this:
cpu% ls -l /dev/sdC0
--rw-r----- S 0 bootes bootes   104857600 Jan 22 15:49 /dev/sdC0/9fat
--rw-r----- S 0 bootes bootes 65361213440 Jan 22 15:49 /dev/sdC0/arenas
--rw-r----- S 0 bootes bootes           0 Jan 22 15:49 /dev/sdC0/ctl
--rw-r----- S 0 bootes bootes 82348277760 Jan 22 15:49 /dev/sdC0/data
--rw-r----- S 0 bootes bootes 13072242688 Jan 22 15:49 /dev/sdC0/fossil
--rw-r----- S 0 bootes bootes  3268060672 Jan 22 15:49 /dev/sdC0/isect
--rw-r----- S 0 bootes bootes         512 Jan 22 15:49 /dev/sdC0/nvram
--rw-r----- S 0 bootes bootes 82343245824 Jan 22 15:49 /dev/sdC0/plan9
-lrw------- S 0 bootes bootes           0 Jan 22 15:49 /dev/sdC0/raw
--rw-r----- S 0 bootes bootes   536870912 Jan 22 15:49 /dev/sdC0/swap
cpu%

So the disk partitions are "files", with the "data" file being the
whole disk. Again, on a hypervisor system, to do I/O, software could
create a connection to the "file" and establish the in-memory ring
buffer, for that partition. This I/O can be very efficient; IBM
research is working on zero-copy mechanisms for moving data between
domains.

The result is a single, consistent mechanism for accessing all
resources from a guest domain. The resources have names, and it is
easy to examine the status -- binary interfaces can be minimized. The
resources can be provided by in-kernel servers -- Linux drivers -- or
out-of-kernel servers -- proceses. Same interface, and yet the
implementation of the provider of the resource can be utterly
different.

We had hoped to get something like this into Xen. On Xen, for example,
the block device and ethernet device interfaces are as different as
one could imagine. Disk I/O does not steal pages from the guest. The
network does. Disk I/O is in 4k chunks, period, with a bitmap
describing which of the 8 512-byte subunits are being sent. The enet
device, on read, returns a page with your packet, but also potentially
containing bits of other domain's packets too. The interfaces are as
dissimilar as they can be, and I see no reason for such a huge
variance between what are basically read/write devices.

Another issue is that kvm, in its current form (-24) is beautifully
simple. These additions seem to detract from the beauty a  bit. Might
it be worth taking a little time to consider these ideas in order to
preserve the basic elegance of KVM?

So, before we go too far down the Xen-like paravirtualized device
route, can we discuss the way this ought to look a bit?

thanks

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <13426df10705111244w1578ebedy8259bc42ca1f588d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]         ` <13426df10705111244w1578ebedy8259bc42ca1f588d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2007-05-11 20:12           ` Anthony Liguori
       [not found]             ` <4644CE15.6080505-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2007-05-12  8:46           ` Carsten Otte
  2007-05-14 12:05           ` Avi Kivity
  2 siblings, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-11 20:12 UTC (permalink / raw)
  To: ron minnich
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Martin Schwidefsky

ron minnich wrote:
> Avoiding too much detail, in the plan 9 world, read and write of data
> to a disk is via file read and write system calls.

For low speed devices, I think paravirtualization doesn't make a lot of 
sense unless it's absolutely required.  I don't know enough about s390 
to know if it supports things like uarts but if so, then emulating a 
uart would in my mind make a lot more sense than a PV console device.

>  Same for a network.
> Same for the mouse, the window system, the serial port, the console,
> USB, and so on. Please see this note from IBM on what is
> possible:http://domino.watson.ibm.com/library/CyberDig.nsf/0/c6c779bbf1650fa4852570670054f3ca?OpenDocument
> or http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
> Different resources, same interface. In the hypervisor world, you
> build one shared memory queue as a basic abstraction. On top of that
> queue, you run 9P. The provider (network, block device, etc.) provides
> certain resources to you, the guest domain The resources have names. A
> network can look like this, to a kvm guest (this command from a Plan 9
> system):
> cpu% ls /net/ether0
> /net/ether0/0
> /net/ether0/1
> /net/ether0/2
> /net/ether0/addr
> /net/ether0/clone
> /net/ether0/ifstats
> /net/ether0/stats
>   

This smells a bit like XenStore which I think most will agree was an 
unmitigated disaster.  This sort of thing gets terribly complicated to 
deal with in the corner cases.  Atomic operation of multiple read/write 
operations is difficult to express.  Moreover, quite a lot of things are 
naturally expressed as a state machine which is not straight forward to 
do in this sort of model.  This may have been all figured out in 9P but 
it's certainly not a simple thing to get right.

I think a general rule of thumb for a virtualized environment is that 
the closer you stick to the way hardware tends to do things, the less 
likely you are to screw yourself up and the easier it will be for other 
platforms to support your devices.  Implementing a full 9P client just 
to get console access in something like mini-os would be unfortunate.  
At least the posted s390 console driver behaves roughly like a uart so 
it's pretty obvious that it will be easy to implement in any OS that 
supports uarts already.

Regards,

Anthony Liguori

> To get network stats, or do I/O, one simply gains access to the
> appropriate ring buffer, by finding the name, and does the ring buffer
> sends and receives via shared memory queues. The I/O operations can be
> very efficient.
>
> Disk looks like this:
> cpu% ls -l /dev/sdC0
> --rw-r----- S 0 bootes bootes   104857600 Jan 22 15:49 /dev/sdC0/9fat
> --rw-r----- S 0 bootes bootes 65361213440 Jan 22 15:49 /dev/sdC0/arenas
> --rw-r----- S 0 bootes bootes           0 Jan 22 15:49 /dev/sdC0/ctl
> --rw-r----- S 0 bootes bootes 82348277760 Jan 22 15:49 /dev/sdC0/data
> --rw-r----- S 0 bootes bootes 13072242688 Jan 22 15:49 /dev/sdC0/fossil
> --rw-r----- S 0 bootes bootes  3268060672 Jan 22 15:49 /dev/sdC0/isect
> --rw-r----- S 0 bootes bootes         512 Jan 22 15:49 /dev/sdC0/nvram
> --rw-r----- S 0 bootes bootes 82343245824 Jan 22 15:49 /dev/sdC0/plan9
> -lrw------- S 0 bootes bootes           0 Jan 22 15:49 /dev/sdC0/raw
> --rw-r----- S 0 bootes bootes   536870912 Jan 22 15:49 /dev/sdC0/swap
> cpu%
>
> So the disk partitions are "files", with the "data" file being the
> whole disk. Again, on a hypervisor system, to do I/O, software could
> create a connection to the "file" and establish the in-memory ring
> buffer, for that partition. This I/O can be very efficient; IBM
> research is working on zero-copy mechanisms for moving data between
> domains.
>
> The result is a single, consistent mechanism for accessing all
> resources from a guest domain. The resources have names, and it is
> easy to examine the status -- binary interfaces can be minimized. The
> resources can be provided by in-kernel servers -- Linux drivers -- or
> out-of-kernel servers -- proceses. Same interface, and yet the
> implementation of the provider of the resource can be utterly
> different.
>
> We had hoped to get something like this into Xen. On Xen, for example,
> the block device and ethernet device interfaces are as different as
> one could imagine. Disk I/O does not steal pages from the guest. The
> network does. Disk I/O is in 4k chunks, period, with a bitmap
> describing which of the 8 512-byte subunits are being sent. The enet
> device, on read, returns a page with your packet, but also potentially
> containing bits of other domain's packets too. The interfaces are as
> dissimilar as they can be, and I see no reason for such a huge
> variance between what are basically read/write devices.
>
> Another issue is that kvm, in its current form (-24) is beautifully
> simple. These additions seem to detract from the beauty a  bit. Might
> it be worth taking a little time to consider these ideas in order to
> preserve the basic elegance of KVM?
>
> So, before we go too far down the Xen-like paravirtualized device
> route, can we discuss the way this ought to look a bit?
>
> thanks
>
> ron
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <4644CE15.6080505-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]             ` <4644CE15.6080505-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2007-05-11 21:15               ` Eric Van Hensbergen
       [not found]                 ` <a4e6962a0705111415n47e77a15o331b59cf2a03b4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2007-05-11 21:51               ` ron minnich
  1 sibling, 1 reply; 104+ messages in thread
From: Eric Van Hensbergen @ 2007-05-11 21:15 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, Martin Schwidefsky,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> > cpu% ls /net/ether0
> > /net/ether0/0
> > /net/ether0/1
> > /net/ether0/2
> > /net/ether0/addr
> > /net/ether0/clone
> > /net/ether0/ifstats
> > /net/ether0/stats
> >
>
> This smells a bit like XenStore which I think most will agree was an
> unmitigated disaster.
>

I'd have to disagree with you Anthony.  The Plan 9 interfaces are
simple and built into the kernel - they don't have the
multi-layered-stack-python-xmlrpc garbage that made up the Xen
interfaces.

>This sort of thing gets terribly complicated to deal with in the
corner cases.
>Atomic operation of multiple read/write operations is difficult to express.
> Moreover, quite a lot of things are naturally expressed as a state machine which
> is not straight forward to do in this sort of model.  This may have been all
> figured out in 9P but it's certainly not a simple thing to get right.
>

That's true, but we have been doing it for over 20 years - I think we
have a good model to base stuff on.

> I think a general rule of thumb for a virtualized environment is that
> the closer you stick to the way hardware tends to do things, the less
> likely you are to screw yourself up and the easier it will be for other
> platforms to support your devices.  Implementing a full 9P client just
> to get console access in something like mini-os would be unfortunate.
> At least the posted s390 console driver behaves roughly like a uart so
> it's pretty obvious that it will be easy to implement in any OS that
> supports uarts already.
>

If it were just console access, I would agree with you, but its really
about implementing a single solution for all drivers you are accessing
across the interface.  A single client versus dozens of different
driver variants.  Our existing 9p client for mini-os is ~3000 LOC and
it is a pretty naive port from the p9p code base so it could probably
be reduced even further.  It is a very small percentage of our
existing mini-os kernels and gives us console, disk, network, IP
stack, file system, and control interfaces.  Of course Linux clients
could just use v9fs with a hypervisor-shared-memory transport which I
haven't merged yet.  We'll also be using the same set of interfaces
for the simulator shortly.

Oh yeah, and don't forget the fact that resource access can bridge
seamlessly over any network and the protocol has provisions to be
secured with authentication/encryption/digesting if desired.

Los Alamos will be presenting 9p based control interfaces for KVM at OLS.

        -eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <a4e6962a0705111415n47e77a15o331b59cf2a03b4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                 ` <a4e6962a0705111415n47e77a15o331b59cf2a03b4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2007-05-11 21:47                   ` Anthony Liguori
       [not found]                     ` <4644E456.2060507-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-11 21:47 UTC (permalink / raw)
  To: Eric Van Hensbergen
  Cc: Jimi Xenidis, Christian Borntraeger,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Martin Schwidefsky,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

Eric Van Hensbergen wrote:
> On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
>   
>>> cpu% ls /net/ether0
>>> /net/ether0/0
>>> /net/ether0/1
>>> /net/ether0/2
>>> /net/ether0/addr
>>> /net/ether0/clone
>>> /net/ether0/ifstats
>>> /net/ether0/stats
>>>
>>>       
>> This smells a bit like XenStore which I think most will agree was an
>> unmitigated disaster.
>>
>>     
>
> I'd have to disagree with you Anthony.  The Plan 9 interfaces are
> simple and built into the kernel - they don't have the
> multi-layered-stack-python-xmlrpc garbage that made up the Xen
> interfaces.
>   

My point isn't that 9p is just like XenStore but rather that turning 
this idea into something that is useful and elegant is non-trivial.

> If it were just console access, I would agree with you, but its really
> about implementing a single solution for all drivers you are accessing
> across the interface.  A single client versus dozens of different
> driver variants.

There's definitely a conversation to have here.  There are going to be a 
lot of small devices that would benefit from a common transport 
mechanism.  Someone mentioned a PV entropy device on LKML.  A 
host=>guest filesystem is another consumer of such an interface.

I'm inclined to think though that the abstraction point should be the 
transport and not the actual protocol.  My concern with standardizing on 
a protocol like 9p would be that one would lose some potential 
optimizations (like passing PFN's directly between guest and host).

>   Our existing 9p client for mini-os is ~3000 LOC and
> it is a pretty naive port from the p9p code base so it could probably
> be reduced even further.  It is a very small percentage of our
> existing mini-os kernels and gives us console, disk, network, IP
> stack, file system, and control interfaces.  Of course Linux clients
> could just use v9fs with a hypervisor-shared-memory transport which I
> haven't merged yet.  We'll also be using the same set of interfaces
> for the simulator shortly.
>   

So is there any reason to even tie 9p to KVM?  Why not just have a 
common PV transport that 9p can use.  For certain things, it may make 
sense (like v9fs).

Regards,

Anthony Liguori

> Oh yeah, and don't forget the fact that resource access can bridge
> seamlessly over any network and the protocol has provisions to be
> secured with authentication/encryption/digesting if desired.
>
> Los Alamos will be presenting 9p based control interfaces for KVM at OLS.
>
>         -eric
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <4644E456.2060507-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                     ` <4644E456.2060507-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2007-05-11 22:21                       ` Eric Van Hensbergen
       [not found]                         ` <a4e6962a0705111521v2d451ddcjecf209e2031c85af-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Eric Van Hensbergen @ 2007-05-11 22:21 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis, Christian Borntraeger,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Martin Schwidefsky,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
>
> There's definitely a conversation to have here.  There are going to be a
> lot of small devices that would benefit from a common transport
> mechanism.  Someone mentioned a PV entropy device on LKML.  A
> host=>guest filesystem is another consumer of such an interface.
>
> I'm inclined to think though that the abstraction point should be the
> transport and not the actual protocol.  My concern with standardizing on
> a protocol like 9p would be that one would lose some potential
> optimizations (like passing PFN's directly between guest and host).
>

I think that there are two layers - having a standard, well defined,
simple shared memory transport between partitions (or between
emulators and the host system) is certainly a prerequisite.  There are
lots of different decisions to made here:
  a) does it communicate with userspace, kernelspace, or both?
  b) is it multi-channel? prioritized? interrupt driven or poll driven?
  c) how big are the buffers?  is it packetized?
  d) can all of these parameters be something controllable from userspace?
  e) I'm sure there are many others that I can't be bothered to think
of on a Friday

Regardless of the details, I think we can definitely come together on
a common mechanism here and avoid lots of duplication in the drivers
are already there and which will follow.  My personal preference is to
keep things as simple and flat as possible.  No XML, no multiple
stacks and daemons to contend with.

What runs on top of the transport is no doubt going to be a touchy
subject for some time to come.  Many of Ron's arguments for 9p mostly
apply to this upper level.  I/we will be pursuing this as a unified PV
resource sharing mechanism over the next few months in combination
with reorganization and optimization of the Linux 9p code.  LANL has
also been making progress in this same direction.  I'd have gotten
started sooner, but I was waiting for my new Thinkpad so that I can
actually run KVM ;)

>
> So is there any reason to even tie 9p to KVM?  Why not just have a
> common PV transport that 9p can use.  For certain things, it may make
> sense (like v9fs).
>

Well, I think we were discussing tying KVM to 9p, not vice-versa.

My personal view is that developing a generalized solution for
resource sharing of all manner of devices and services across
virtualization, emulation, and network boundaries is a better way to
spend our time than writing a bunch of specific
drivers/protocols/interfaces for each type of device and each type of
interconnect.

              -eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <a4e6962a0705111521v2d451ddcjecf209e2031c85af-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                         ` <a4e6962a0705111521v2d451ddcjecf209e2031c85af-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2007-05-16 17:28                           ` Anthony Liguori
       [not found]                             ` <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-16 17:28 UTC (permalink / raw)
  To: Eric Van Hensbergen
  Cc: Jimi Xenidis, Christian Borntraeger,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Martin Schwidefsky,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

Eric Van Hensbergen wrote:
> On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
>>
>> There's definitely a conversation to have here.  There are going to be a
>> lot of small devices that would benefit from a common transport
>> mechanism.  Someone mentioned a PV entropy device on LKML.  A
>> host=>guest filesystem is another consumer of such an interface.
>>
>> I'm inclined to think though that the abstraction point should be the
>> transport and not the actual protocol.  My concern with standardizing on
>> a protocol like 9p would be that one would lose some potential
>> optimizations (like passing PFN's directly between guest and host).
>>
>
> I think that there are two layers - having a standard, well defined,
> simple shared memory transport between partitions (or between
> emulators and the host system) is certainly a prerequisite.  There are
> lots of different decisions to made here:

What do you think about a socket interface?  I'm not sure how discovery 
would work yet, but there are a few PV socket implementations for Xen at 
the moment.

>  a) does it communicate with userspace, kernelspace, or both?

sockets are usable for both userspace/kernespace.

>  b) is it multi-channel? prioritized? interrupt driven or poll driven?

Of course, arguments can be made for any of these depending on the 
circumstance.  I think you'd have to start with something simple that 
would cover the most number of users (non-multiplexed, interrupt driven).

>  c) how big are the buffers?  is it packetized?

This could probably be tweaked with sockopts.  I suspect you would have 
an implementation for Xen, KVM, etc. and support a common set of options 
(and possible some per-VM type of options).

>  d) can all of these parameters be something controllable from userspace?
>  e) I'm sure there are many others that I can't be bothered to think
> of on a Friday

The biggest point of contention would probably be what goes in the 
sockaddr structure.

Thoughts?

Regards,

Anthony Liguori

> Regardless of the details, I think we can definitely come together on
> a common mechanism here and avoid lots of duplication in the drivers
> are already there and which will follow.  My personal preference is to
> keep things as simple and flat as possible.  No XML, no multiple
> stacks and daemons to contend with.
>
> What runs on top of the transport is no doubt going to be a touchy
> subject for some time to come.  Many of Ron's arguments for 9p mostly
> apply to this upper level.  I/we will be pursuing this as a unified PV
> resource sharing mechanism over the next few months in combination
> with reorganization and optimization of the Linux 9p code.  LANL has
> also been making progress in this same direction.  I'd have gotten
> started sooner, but I was waiting for my new Thinkpad so that I can
> actually run KVM ;)
>
>>
>> So is there any reason to even tie 9p to KVM?  Why not just have a
>> common PV transport that 9p can use.  For certain things, it may make
>> sense (like v9fs).
>>
>
> Well, I think we were discussing tying KVM to 9p, not vice-versa.
>
> My personal view is that developing a generalized solution for
> resource sharing of all manner of devices and services across
> virtualization, emulation, and network boundaries is a better way to
> spend our time than writing a bunch of specific
> drivers/protocols/interfaces for each type of device and each type of
> interconnect.
>
>              -eric


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                             ` <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2007-05-16 17:38                               ` Daniel P. Berrange
       [not found]                                 ` <20070516173822.GD16863-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2007-05-16 17:41                               ` Eric Van Hensbergen
                                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 104+ messages in thread
From: Daniel P. Berrange @ 2007-05-16 17:38 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, Martin Schwidefsky,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

On Wed, May 16, 2007 at 12:28:00PM -0500, Anthony Liguori wrote:
> Eric Van Hensbergen wrote:
> > On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> >>
> >> There's definitely a conversation to have here.  There are going to be a
> >> lot of small devices that would benefit from a common transport
> >> mechanism.  Someone mentioned a PV entropy device on LKML.  A
> >> host=>guest filesystem is another consumer of such an interface.
> >>
> >> I'm inclined to think though that the abstraction point should be the
> >> transport and not the actual protocol.  My concern with standardizing on
> >> a protocol like 9p would be that one would lose some potential
> >> optimizations (like passing PFN's directly between guest and host).
> >>
> >
> > I think that there are two layers - having a standard, well defined,
> > simple shared memory transport between partitions (or between
> > emulators and the host system) is certainly a prerequisite.  There are
> > lots of different decisions to made here:
> 
> What do you think about a socket interface?  I'm not sure how discovery 
> would work yet, but there are a few PV socket implementations for Xen at 
> the moment.

As a userspace apps service, I'd very much like to see a common sockets 
interface for inter-VM communication that is portable across virt systems 
like Xen & KVM. I'd see it as similar to UNIX domain sockets in style. So 
basically any app which could do UNIX domain sockets, could be ported to 
inter-VM sockets by just changing PF_UNIX to say,  PF_VIRT
Lots of interesting details around impl & security (what VMs are allowed
to talk to each other, whether this policy should be controlled by the
host, or allow VMs to decide for themselves).

> >  a) does it communicate with userspace, kernelspace, or both?
> 
> sockets are usable for both userspace/kernespace.

For userspace, it would be very easy to adapt existing sockets based
apps using IP or UNIX sockets to use inter-VM sockets, which is a big
positive.

> >  d) can all of these parameters be something controllable from userspace?
> >  e) I'm sure there are many others that I can't be bothered to think
> > of on a Friday
> 
> The biggest point of contention would probably be what goes in the 
> sockaddr structure.

Keeping it very simple would be some arbitrary 'path', similar to UNIX 
domain sockets in the abstract namespace ?

Regards,
Dan.
-- 
|=- Red Hat, Engineering, Emerging Technologies, Boston.  +1 978 392 2496 -=|
|=-           Perl modules: http://search.cpan.org/~danberr/              -=|
|=-               Projects: http://freshmeat.net/~danielpb/               -=|
|=-  GnuPG: 7D3B9505   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505  -=| 

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <20070516173822.GD16863-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                 ` <20070516173822.GD16863-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2007-05-17  9:29                                   ` Carsten Otte
       [not found]                                     ` <464C2069.20909-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Carsten Otte @ 2007-05-17  9:29 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

Daniel P. Berrange wrote:
> As a userspace apps service, I'd very much like to see a common sockets 
> interface for inter-VM communication that is portable across virt systems 
> like Xen & KVM. I'd see it as similar to UNIX domain sockets in style. So 
> basically any app which could do UNIX domain sockets, could be ported to 
> inter-VM sockets by just changing PF_UNIX to say,  PF_VIRT
> Lots of interesting details around impl & security (what VMs are allowed
> to talk to each other, whether this policy should be controlled by the
> host, or allow VMs to decide for themselves).
z/VM, the premium hypervisor on 390 already has this capability for 
decades. This is called IUCV (inter user communication vehicle), where 
user really means virtual machine. It so happens the support for 
AF_IUCV was recently merged to Linux mainline. It may be worth a look, 
either for using it or because learning from existing solutions is 
always a good idea.

so long,
Carsten


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <464C2069.20909-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                     ` <464C2069.20909-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-05-17 14:22                                       ` Anthony Liguori
       [not found]                                         ` <464C651F.5070700-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-17 14:22 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

Carsten Otte wrote:
> Daniel P. Berrange wrote:
>   
>> As a userspace apps service, I'd very much like to see a common sockets 
>> interface for inter-VM communication that is portable across virt systems 
>> like Xen & KVM. I'd see it as similar to UNIX domain sockets in style. So 
>> basically any app which could do UNIX domain sockets, could be ported to 
>> inter-VM sockets by just changing PF_UNIX to say,  PF_VIRT
>> Lots of interesting details around impl & security (what VMs are allowed
>> to talk to each other, whether this policy should be controlled by the
>> host, or allow VMs to decide for themselves).
>>     
> z/VM, the premium hypervisor on 390 already has this capability for 
> decades. This is called IUCV (inter user communication vehicle), where 
> user really means virtual machine. It so happens the support for 
> AF_IUCV was recently merged to Linux mainline. It may be worth a look, 
> either for using it or because learning from existing solutions is 
> always a good idea.
>   

Is there anything that explains what the fields in sockaddr mean:

    sa_family_t    siucv_family;
    unsigned short    siucv_port;        /* Reserved */
    unsigned int    siucv_addr;        /* Reserved */
    char        siucv_nodeid[8];    /* Reserved */
    char        siucv_user_id[8];    /* Guest User Id */
    char        siucv_name[8];        /* Application Name */

Regards,

Anthony LIugori

> so long,
> Carsten
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <464C651F.5070700-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                         ` <464C651F.5070700-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
@ 2007-05-21 11:11                                           ` Christian Borntraeger
  0 siblings, 0 replies; 104+ messages in thread
From: Christian Borntraeger @ 2007-05-21 11:11 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis, carsteno-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8


[-- Attachment #1.1: Type: text/plain, Size: 1507 bytes --]

Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> wrote on 17.05.2007 16:22:23:
> Is there anything that explains what the fields in sockaddr mean:
> 
>     sa_family_t    siucv_family;
>     unsigned short    siucv_port;        /* Reserved */
>     unsigned int    siucv_addr;        /* Reserved */
>     char        siucv_nodeid[8];    /* Reserved */
>     char        siucv_user_id[8];    /* Guest User Id */
>     char        siucv_name[8];        /* Application Name */

There is a small description in "Device Drivers, Features, and
Commands SC33-8289-03" on page 211 (its page 235 if you use the pdf
viewer page number)
http://download.boulder.ibm.com/ibmdl/pub/software/dw/linux390/docu/l26cdd03.pdf
(the file is  6.7 MB)

More generic information about iucv can be found in in 
http://www-03.ibm.com/servers/eserver/zseries/zos/bkserv/zvmpdf/zvm52.html
or to be precise:
http://publibz.boulder.ibm.com/epubs/pdf/hcse5b11.pdf part 2. (11 MB)

That said, AF_IUCV builds on top of iucv and therefore requires z/VM
as hypervisor. I dont think that KVM should implement (af_)iucv. But
(af_)iucv shows several aspects how to make things good and bad.
(e.g. AF_IUCV as procotol on top of iucv was first defined in
CMS several years ago and is, therefore, not very smp-friendly.
On the other hand iucv itself offers modern features like
scatter/gather).

Back to the old question: 
If shared memory or socket is better - I dont know. z/VM has both, see 
dcss for its shared memory support. 

[-- Attachment #1.2: Type: text/html, Size: 2491 bytes --]

[-- Attachment #2: Type: text/plain, Size: 286 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

[-- Attachment #3: Type: text/plain, Size: 186 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                             ` <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2007-05-16 17:38                               ` Daniel P. Berrange
@ 2007-05-16 17:41                               ` Eric Van Hensbergen
       [not found]                                 ` <a4e6962a0705161041s5393c1a6wc455b20ff3fe8106-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2007-05-16 17:45                               ` Gregory Haskins
  2007-05-18  5:31                               ` ron minnich
  3 siblings, 1 reply; 104+ messages in thread
From: Eric Van Hensbergen @ 2007-05-16 17:41 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis, Christian Borntraeger,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Martin Schwidefsky,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

On 5/16/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
>
> What do you think about a socket interface?  I'm not sure how discovery
> would work yet, but there are a few PV socket implementations for Xen at
> the moment.
>

>From a functional standpoint I don't have a huge problem with it,
particularly if its more of a pure socket and not something that tries
to look like a TCP/IP endpoint -- I would prefer something closer to
netlink.  Sockets would allow the exisitng 9p stuff to pretty much
work as-is.

However, all that being said, I noticed some pretty big differences
between sockets and shared memory in terms of overhead under Linux.

If you take a look at the RPC latency graph in:
http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf

You'll see that a local socket implementation has about an order of
magnitude worse latency than a PROSE/Libra inter-partition shared
memory channel.  Furthermore it will really limit our ability to trim
the fat of unnecessary copies in order to have competitive
performance.  But perhaps there's magic you can do to eliminate that.

Of course, you could always layer a socket interface for userspace
simplicity on top of a more performance-optimized underlying transport
that could be used directly by kernel-modules.

          -eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <a4e6962a0705161041s5393c1a6wc455b20ff3fe8106-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                 ` <a4e6962a0705161041s5393c1a6wc455b20ff3fe8106-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2007-05-16 18:47                                   ` Anthony Liguori
       [not found]                                     ` <464B51A8.7050307-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-16 18:47 UTC (permalink / raw)
  To: Eric Van Hensbergen
  Cc: Jimi Xenidis, Christian Borntraeger,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Martin Schwidefsky,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

Eric Van Hensbergen wrote:
> On 5/16/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
>>
>> What do you think about a socket interface?  I'm not sure how discovery
>> would work yet, but there are a few PV socket implementations for Xen at
>> the moment.
>>
>
> From a functional standpoint I don't have a huge problem with it,
> particularly if its more of a pure socket and not something that tries
> to look like a TCP/IP endpoint -- I would prefer something closer to
> netlink.  Sockets would allow the exisitng 9p stuff to pretty much
> work as-is.

So you would prefer assigning out types instead of using an identifier 
string in the sockaddr?

> However, all that being said, I noticed some pretty big differences
> between sockets and shared memory in terms of overhead under Linux.
>
> If you take a look at the RPC latency graph in:
> http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
>
> You'll see that a local socket implementation has about an order of
> magnitude worse latency than a PROSE/Libra inter-partition shared
> memory channel.

You seem to suggest that the low latency is due to a very greedy (CPU 
hungry) polling algorithm.  A poll vs. interrupt model would seem to me 
to be orthogonal to using sockets as an interface.

>   Furthermore it will really limit our ability to trim
> the fat of unnecessary copies in order to have competitive
> performance.  But perhaps there's magic you can do to eliminate that.

sockets do add copies.  My initial thinking is that one can work around 
this by passing guest PFNs (or grant references in Xen).  I'm also happy 
to start out focusing on "low-speed" devices.

> Of course, you could always layer a socket interface for userspace
> simplicity on top of a more performance-optimized underlying transport
> that could be used directly by kernel-modules.

Right.

Regards,

Anthony Liguori

>          -eric


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <464B51A8.7050307-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                     ` <464B51A8.7050307-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2007-05-16 19:33                                       ` Eric Van Hensbergen
  0 siblings, 0 replies; 104+ messages in thread
From: Eric Van Hensbergen @ 2007-05-16 19:33 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis, Christian Borntraeger,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Martin Schwidefsky,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

On 5/16/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> Eric Van Hensbergen wrote:
> >
> > From a functional standpoint I don't have a huge problem with it,
> > particularly if its more of a pure socket and not something that tries
> > to look like a TCP/IP endpoint -- I would prefer something closer to
> > netlink.  Sockets would allow the exisitng 9p stuff to pretty much
> > work as-is.
>
> So you would prefer assigning out types instead of using an identifier
> string in the sockaddr?
>

I wasn't really thinking that extreme, just having an assigned type
for the vm sockets so that we can minimize baggage.   Perhaps I'm
being overzealous.

> > However, all that being said, I noticed some pretty big differences
> > between sockets and shared memory in terms of overhead under Linux.
> >
> > If you take a look at the RPC latency graph in:
> > http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
> >
> > You'll see that a local socket implementation has about an order of
> > magnitude worse latency than a PROSE/Libra inter-partition shared
> > memory channel.
>
> You seem to suggest that the low latency is due to a very greedy (CPU
> hungry) polling algorithm.  A poll vs. interrupt model would seem to me
> to be orthogonal to using sockets as an interface.
>

That certainly was a theory -- I never did detailed measurements,
however, there is certainly extra overhead associated with the socket
path due to kernel-user space boundary crossings and additional code
path length associated with socket operations.  Still I'm game to
comparing the alternatives.

           -eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                             ` <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2007-05-16 17:38                               ` Daniel P. Berrange
  2007-05-16 17:41                               ` Eric Van Hensbergen
@ 2007-05-16 17:45                               ` Gregory Haskins
       [not found]                                 ` <464B0ADB.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  2007-05-18  5:31                               ` ron minnich
  3 siblings, 1 reply; 104+ messages in thread
From: Gregory Haskins @ 2007-05-16 17:45 UTC (permalink / raw)
  To: Eric Van Hensbergen, Anthony Liguori
  Cc: Christian Borntraeger, Martin Schwidefsky, Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

>>> On Wed, May 16, 2007 at  1:28 PM, in message <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>,
Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: 
> 
> What do you think about a socket interface?  I'm not sure how discovery 
> would work yet, but there are a few PV socket implementations for Xen at 
> the moment.

FYI: The work I am doing is exactly that.  I am going to extend host-based unix domain sockets up to the KVM guest.  Not sure how well it will work yet, as I had to lay the LAPIC work down first for IO-completion.

-Greg


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <464B0ADB.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                 ` <464B0ADB.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-05-16 18:39                                   ` Anthony Liguori
       [not found]                                     ` <464B4FEB.7070300-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-16 18:39 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, Suzanne McIntosh, Martin Schwidefsky,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

Gregory Haskins wrote:
>>>> On Wed, May 16, 2007 at  1:28 PM, in message <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>,
>>>>         
> Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: 
>   
>> What do you think about a socket interface?  I'm not sure how discovery 
>> would work yet, but there are a few PV socket implementations for Xen at 
>> the moment.
>>     
>
> FYI: The work I am doing is exactly that.  I am going to extend host-based unix domain sockets up to the KVM guest.  Not sure how well it will work yet, as I had to lay the LAPIC work down first for IO-completion.
>   

Do you plan on introducing a new address family in the guest?

Regards,

Anthony Liguori

> -Greg
>
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <464B4FEB.7070300-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                     ` <464B4FEB.7070300-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2007-05-16 18:57                                       ` Gregory Haskins
       [not found]                                         ` <464B1B9C.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Gregory Haskins @ 2007-05-16 18:57 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, Suzanne McIntosh, Martin Schwidefsky,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

>>> On Wed, May 16, 2007 at  2:39 PM, in message <464B4FEB.7070300-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>,
Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: 
> Gregory Haskins wrote:
>>>>> On Wed, May 16, 2007 at  1:28 PM, in message <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>,
>>>>>         
>> Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: 
>>   
>>> What do you think about a socket interface?  I'm not sure how discovery 
>>> would work yet, but there are a few PV socket implementations for Xen at 
>>> the moment.
>>>     
>>
>> FYI: The work I am doing is exactly that.  I am going to extend host- based 
> unix domain sockets up to the KVM guest.  Not sure how well it will work yet, 
> as I had to lay the LAPIC work down first for IO- completion.
>>   
> 
> Do you plan on introducing a new address family in the guest?

Well, since I had to step back and lay some infrastructure groundwork I haven't vetted this approach yet...so its possible what I am about to say is relatively naive:  But my primary application is to create a guest-kernel to host IVMC.  For that you can just think of the guest as any other process on the host, and it will just use the sockets normally as any host-process would.  There might be some thunking that has to happen to deal with gpa vs va, etc, but otherwise its a standard consumer.  If you want to extend IVMC up to guest-userspace, I think making some kind of new socket family makes sense in the guests stack.  PF_VIRT like someone else suggested, for instance.  But since I dont need this type of IVMC I haven't really thought about this too much.

-Greg



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <464B1B9C.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                         ` <464B1B9C.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
@ 2007-05-16 19:10                                           ` Anthony Liguori
       [not found]                                             ` <464B572C.6090800-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-16 19:10 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, Suzanne McIntosh, Martin Schwidefsky,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

Gregory Haskins wrote:
>>>> On Wed, May 16, 2007 at  2:39 PM, in message <464B4FEB.7070300-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>,
>>>>         
> Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: 
>   
>> Gregory Haskins wrote:
>>     
>>>>>> On Wed, May 16, 2007 at  1:28 PM, in message <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>,
>>>>>>         
>>>>>>             
>>> Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: 
>>>   
>>>       
>>>> What do you think about a socket interface?  I'm not sure how discovery 
>>>> would work yet, but there are a few PV socket implementations for Xen at 
>>>> the moment.
>>>>     
>>>>         
>>> FYI: The work I am doing is exactly that.  I am going to extend host- based 
>>>       
>> unix domain sockets up to the KVM guest.  Not sure how well it will work yet, 
>> as I had to lay the LAPIC work down first for IO- completion.
>>     
>>>   
>>>       
>> Do you plan on introducing a new address family in the guest?
>>     
>
> Well, since I had to step back and lay some infrastructure groundwork I haven't vetted this approach yet...so its possible what I am about to say is relatively naive:  But my primary application is to create a guest-kernel to host IVMC.

This is quite easy with KVM.  I like the approach that vmchannel has 
taken.  A simple PCI device.  That gives you a discovery mechanism for 
shared memory and an interrupt and then you can just implement a ring 
queue using those mechanisms (along with a PIO port for signaling from 
the guest to the host).  So given that underlying mechanism, the 
question is how to expose that within the guest kernel/userspace and 
within the host.

For the host, you can probably stay entirely within QEMU.  Interguest 
communication would be a bit tricky but guest->host communication is 
real simple.

You could stop at exposing the channel as a socket within the guest 
kernel/userspace.  That would work, but you may also want to expose the 
ring queue within the kernel at least if there are consumers that need 
to avoid the copy.

A tricky bit of this is how to do discovery.  If you want to support 
interguest communication, it's not really sufficient to just use strings 
since they identifiers would have to be unique throughout the entire 
system.  Maybe you just leave it as a guest=>host channel and be done 
with it.

Regards,

Anthony Liguori

>   For that you can just think of the guest as any other process on the host, and it will just use the sockets normally as any host-process would.  There might be some thunking that has to happen to deal with gpa vs va, etc, but otherwise its a standard consumer.  If you want to extend IVMC up to guest-userspace, I think making some kind of new socket family makes sense in the guests stack.  PF_VIRT like someone else suggested, for instance.  But since I dont need this type of IVMC I haven't really thought about this too much.
>
> -Greg
>
>
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <464B572C.6090800-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                             ` <464B572C.6090800-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2007-05-17  4:24                                               ` Rusty Russell
       [not found]                                                 ` <1179375881.21871.83.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2007-05-21  9:07                                               ` Christian Borntraeger
  1 sibling, 1 reply; 104+ messages in thread
From: Rusty Russell @ 2007-05-17  4:24 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Suzanne McIntosh, Martin Schwidefsky

On Wed, 2007-05-16 at 14:10 -0500, Anthony Liguori wrote:
> For the host, you can probably stay entirely within QEMU.  Interguest 
> communication would be a bit tricky but guest->host communication is 
> real simple.

guest->host is always simple.  But it'd be great if it didn't matter to
the guest whether it's talking to the host or another guest.

I think shared memory is an obvious start, but it's not enough for
inter-guest where they can't freely access each other's memory.  So you
really want a ring-buffer of descriptors with a hypervisor-assist to say
"read/write this into the memory referred to by that descriptor".

I think this can be done as a simple variation of the current schemes in
existence.

But I'm shutting up until I have some demonstration code 8)

> A tricky bit of this is how to do discovery.  If you want to support 
> interguest communication, it's not really sufficient to just use strings 
> since they identifiers would have to be unique throughout the entire 
> system.  Maybe you just leave it as a guest=>host channel and be done 
> with it.

Hmm, I was going to leave that unspecified.  One thing at a time...

Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <1179375881.21871.83.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                 ` <1179375881.21871.83.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2007-05-17 16:13                                                   ` Anthony Liguori
       [not found]                                                     ` <464C7F45.50908-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-17 16:13 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Jimi Xenidis, Anthony Liguori,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Suzanne McIntosh, Martin Schwidefsky

Rusty Russell wrote:
> On Wed, 2007-05-16 at 14:10 -0500, Anthony Liguori wrote:
>   
>> For the host, you can probably stay entirely within QEMU.  Interguest 
>> communication would be a bit tricky but guest->host communication is 
>> real simple.
>>     
>
> guest->host is always simple.  But it'd be great if it didn't matter to
> the guest whether it's talking to the host or another guest.
>
> I think shared memory is an obvious start, but it's not enough for
> inter-guest where they can't freely access each other's memory.  So you
> really want a ring-buffer of descriptors with a hypervisor-assist to say
> "read/write this into the memory referred to by that descriptor".
>   

I think this is getting a little ahead of ourselves.  An example of this 
idea is pretty straight-forward but it gets more complicated when trying 
to support the existing memory sharing mechanisms on various 
hypervisors.  There are a few cases to consider:

1) The target VM can access all of the memory of the guest VM with no 
penalty.  This is the case when going from guest=>QEMU in KVM or going 
from guest=>kernel (ignoring highmem) in KVM.  For this, you can send 
arbitrary memory to the host.

2) The target VM can access all of the memory of the guest VM with a 
penalty.  For guest=>other userspace process in KVM, an mmap() would be 
required.  This would work for Xen provided the target VM was domain-0 
but it would incur a xc_map_foreign_range().

3) The target and source VM can only share memory based on an existing 
pool.  This is the guest with Xen and grant tables.

I think an API that covers these three cases is a bit tricky and will 
likely make undesired trade-offs.  I think it's easier to start out 
focusing on the "low-speed" case where there's a mandatory data-copy.

You can still pass gntref's or PFNs down this transport if you like and 
perhaps down the road we'll find that we can make a common interface for 
doing this sort of thing.

Regards,

Anthony Liguori

> I think this can be done as a simple variation of the current schemes in
> existence.
>
> But I'm shutting up until I have some demonstration code 8)
>
>   
>> A tricky bit of this is how to do discovery.  If you want to support 
>> interguest communication, it's not really sufficient to just use strings 
>> since they identifiers would have to be unique throughout the entire 
>> system.  Maybe you just leave it as a guest=>host channel and be done 
>> with it.
>>     
>
> Hmm, I was going to leave that unspecified.  One thing at a time...
>
> Rusty.
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>   

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <464C7F45.50908-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                     ` <464C7F45.50908-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2007-05-17 23:34                                                       ` Rusty Russell
  0 siblings, 0 replies; 104+ messages in thread
From: Rusty Russell @ 2007-05-17 23:34 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Suzanne McIntosh, Martin Schwidefsky

On Thu, 2007-05-17 at 11:13 -0500, Anthony Liguori wrote:
> Rusty Russell wrote:
> > I think shared memory is an obvious start, but it's not enough for
> > inter-guest where they can't freely access each other's memory.  So you
> > really want a ring-buffer of descriptors with a hypervisor-assist to say
> > "read/write this into the memory referred to by that descriptor".
> 
> I think this is getting a little ahead of ourselves.  An example of this 
> idea is pretty straight-forward but it gets more complicated when trying 
> to support the existing memory sharing mechanisms on various 
> hypervisors.  There are a few cases to consider:

To clarify, I'm not overly interested in existing mechanisms.  I'm first
trying for something sane from a Linux driver POV, then see if it can be
implemented in terms of legacy systems.

This reflects my belief that we will see more virtualization solutions
in the medium term, so it's reasonable to look at a new system.

Cheers,
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                             ` <464B572C.6090800-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2007-05-17  4:24                                               ` Rusty Russell
@ 2007-05-21  9:07                                               ` Christian Borntraeger
       [not found]                                                 ` <OFC1AADF6F.DB57C7AC-ON422572E2.0030BE22-422572E2.0032174C-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 104+ messages in thread
From: Christian Borntraeger @ 2007-05-21  9:07 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Suzanne McIntosh


[-- Attachment #1.1: Type: text/plain, Size: 734 bytes --]

> This is quite easy with KVM.  I like the approach that vmchannel has 
> taken.  A simple PCI device.  That gives you a discovery mechanism for 
> shared memory and an interrupt and then you can just implement a ring 
> queue using those mechanisms (along with a PIO port for signaling from 
> the guest to the host).  So given that underlying mechanism, the 
> question is how to expose that within the guest kernel/userspace and 
> within the host.

Sorry for answering late, but I dont like PCI as a device bus for all
platforms. s390 has no PCI and s390 has no PIO. I would prefer a new 
simple hypercall based virtual bus. I dont know much about windows 
driver programming, but I guess it it is not that hard to add a new bus.

[-- Attachment #1.2: Type: text/html, Size: 973 bytes --]

[-- Attachment #2: Type: text/plain, Size: 286 bytes --]

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

[-- Attachment #3: Type: text/plain, Size: 186 bytes --]

_______________________________________________
kvm-devel mailing list
kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/kvm-devel

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <OFC1AADF6F.DB57C7AC-ON422572E2.0030BE22-422572E2.0032174C-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                 ` <OFC1AADF6F.DB57C7AC-ON422572E2.0030BE22-422572E2.0032174C-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-05-21  9:27                                                   ` Cornelia Huck
  2007-05-21 11:28                                                   ` Arnd Bergmann
  1 sibling, 0 replies; 104+ messages in thread
From: Cornelia Huck @ 2007-05-21  9:27 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Suzanne McIntosh

On Mon, 21 May 2007 11:07:07 +0200,
Christian Borntraeger <CBORNTRA-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> wrote:

> > This is quite easy with KVM.  I like the approach that vmchannel has 
> > taken.  A simple PCI device.  That gives you a discovery mechanism for 
> > shared memory and an interrupt and then you can just implement a ring 
> > queue using those mechanisms (along with a PIO port for signaling from 
> > the guest to the host).  So given that underlying mechanism, the 
> > question is how to expose that within the guest kernel/userspace and 
> > within the host.
> 
> Sorry for answering late, but I dont like PCI as a device bus for all
> platforms. s390 has no PCI and s390 has no PIO. I would prefer a new 
> simple hypercall based virtual bus. I dont know much about windows 
> driver programming, but I guess it it is not that hard to add a new bus.

Agreed. Moreover, if you have an existing OS running on a non-pci
platform, it will be far more likely that they will be able to write a
driver against a simple hypercall-based bus than to cook up a
full-blown pci interface.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                 ` <OFC1AADF6F.DB57C7AC-ON422572E2.0030BE22-422572E2.0032174C-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  2007-05-21  9:27                                                   ` Cornelia Huck
@ 2007-05-21 11:28                                                   ` Arnd Bergmann
       [not found]                                                     ` <200705211328.04565.arnd-r2nGTMty4D4@public.gmane.org>
  1 sibling, 1 reply; 104+ messages in thread
From: Arnd Bergmann @ 2007-05-21 11:28 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Suzanne McIntosh

On Monday 21 May 2007, Christian Borntraeger wrote:
> > This is quite easy with KVM.  I like the approach that vmchannel has 
> > taken.  A simple PCI device.  That gives you a discovery mechanism for 
> > shared memory and an interrupt and then you can just implement a ring 
> > queue using those mechanisms (along with a PIO port for signaling from 
> > the guest to the host).  So given that underlying mechanism, the 
> > question is how to expose that within the guest kernel/userspace and 
> > within the host.
> 
> Sorry for answering late, but I dont like PCI as a device bus for all
> platforms. s390 has no PCI and s390 has no PIO. I would prefer a new 
> simple hypercall based virtual bus. I dont know much about windows 
> driver programming, but I guess it it is not that hard to add a new bus.

We've had the same discussion about PCI as virtual device abstraction
recently when hpa made the suggestions to get a set of PCI device
numbers registered for Linux.

IIRC, the conclusion to which we came was that it is indeed helpful
for most architecture to have a PCI device as one way to probe for
the functionality, but not to rely on it. s390 is the obvious
example where you can't have PCI, but you may also want to build
a guest kernel without PCI support because of space constraints
in a many-guests machine.

What I think would be ideal is to have a new bus type in Linux
that does not have any dependency on PCI itself, but can be
easily implemented as a child of a PCI device.

If we only need the stuff mentioned by Anthony, the interface could
look like

struct vmchannel_device {
	struct resource virt_mem;
	struct vm_device_id id;
	int irq;
	int (*signal)(struct vmchannel_device *);
	int (*irq_ack)(struct vmchannel_device *);
	struct device dev;
};

Such a device can easily be provided as a child of a PCI device,
or as something that is purely virtual based on an hcall interface.

	Arnd <><

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <200705211328.04565.arnd-r2nGTMty4D4@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                     ` <200705211328.04565.arnd-r2nGTMty4D4@public.gmane.org>
@ 2007-05-21 11:56                                                       ` Cornelia Huck
       [not found]                                                         ` <20070521135628.17a4f9cc-XQvu0L+U/CiXI4yAdoq52KN5r0PSdgG1zG2AekJRRhI@public.gmane.org>
  2007-05-21 18:45                                                       ` Anthony Liguori
  1 sibling, 1 reply; 104+ messages in thread
From: Cornelia Huck @ 2007-05-21 11:56 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger,
	Suzanne McIntosh

On Mon, 21 May 2007 13:28:03 +0200,
Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org> wrote:

> We've had the same discussion about PCI as virtual device abstraction
> recently when hpa made the suggestions to get a set of PCI device
> numbers registered for Linux.

(If you want to read it up, it's the thread at
http://marc.info/?t=117554525400003&r=1&w=2)

> 
> IIRC, the conclusion to which we came was that it is indeed helpful
> for most architecture to have a PCI device as one way to probe for
> the functionality, but not to rely on it. s390 is the obvious
> example where you can't have PCI, but you may also want to build
> a guest kernel without PCI support because of space constraints
> in a many-guests machine.
> 
> What I think would be ideal is to have a new bus type in Linux
> that does not have any dependency on PCI itself, but can be
> easily implemented as a child of a PCI device.
> 
> If we only need the stuff mentioned by Anthony, the interface could
> look like
> 
> struct vmchannel_device {
> 	struct resource virt_mem;
> 	struct vm_device_id id;
> 	int irq;
        ^^^^^^^^
> 	int (*signal)(struct vmchannel_device *);
> 	int (*irq_ack)(struct vmchannel_device *);
> 	struct device dev;
> };

IRQ numbers are evil :)

It should be more like a
	void *vmchannel_device_handle;
which could be different things depending on what we want the
vmchannel_device to be a child of (it could be an IRQ number for
PCI devices, or something like subchannel_id if we wanted to
support channel devices).

> 
> Such a device can easily be provided as a child of a PCI device,
> or as something that is purely virtual based on an hcall interface.

This looks like a flexible approach.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <20070521135628.17a4f9cc-XQvu0L+U/CiXI4yAdoq52KN5r0PSdgG1zG2AekJRRhI@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                         ` <20070521135628.17a4f9cc-XQvu0L+U/CiXI4yAdoq52KN5r0PSdgG1zG2AekJRRhI@public.gmane.org>
@ 2007-05-21 13:53                                                           ` Arnd Bergmann
  0 siblings, 0 replies; 104+ messages in thread
From: Arnd Bergmann @ 2007-05-21 13:53 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger,
	Suzanne McIntosh

On Monday 21 May 2007, Cornelia Huck wrote:
> IRQ numbers are evil :)

yes, but getting rid of them is an entirely different discussion.
I really think that in the first step, you should be able to
use its "external interrupts" with the same request_irq interface
as the other architectures.

Fundamentally, the s390 architecture has external interrupt numbers
as well, you're just using a different interface for registering them.
The ccw devices obviously have a better interface already, but
that doesn't help you here.

> It should be more like a
>         void *vmchannel_device_handle;
> which could be different things depending on what we want the
> vmchannel_device to be a child of (it could be an IRQ number for
> PCI devices, or something like subchannel_id if we wanted to
> support channel devices).

No, the driver needs to know how to get at the interrupt without
caring about the bus implementation, that's why you either need
to have a callback function set by the driver (like s390 CCW
or USB have it), or visible interrupt number (like everyone does).

There is no need for a pointer back to a vmchannel_device_handle,
all information needed by the bus layer can simply be in a
subclass derived from the vmchannel_device, e.g.

struct vmchannel_pci {
	struct pci_device *parent; /* shortcut, same as
			      to_pci_dev(&this.vmdev.dev.parent) */
	unsigned long signal_ioport; /* for interrupt generation */
	struct vmchannel_device vmdev;
};

You would allocate this structure in the pci_driver that registers
the vmchannel_device.

	Arnd <><

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                     ` <200705211328.04565.arnd-r2nGTMty4D4@public.gmane.org>
  2007-05-21 11:56                                                       ` Cornelia Huck
@ 2007-05-21 18:45                                                       ` Anthony Liguori
       [not found]                                                         ` <4651E8D1.4010208-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-21 18:45 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger,
	Suzanne McIntosh

Arnd Bergmann wrote:
> On Monday 21 May 2007, Christian Borntraeger wrote:
>   
>>> This is quite easy with KVM.  I like the approach that vmchannel has 
>>> taken.  A simple PCI device.  That gives you a discovery mechanism for 
>>> shared memory and an interrupt and then you can just implement a ring 
>>> queue using those mechanisms (along with a PIO port for signaling from 
>>> the guest to the host).  So given that underlying mechanism, the 
>>> question is how to expose that within the guest kernel/userspace and 
>>> within the host.
>>>       
>> Sorry for answering late, but I dont like PCI as a device bus for all
>> platforms. s390 has no PCI and s390 has no PIO. 

Right, I'm not interested in the lowest level implementation (PCI device 
+ PIO).  I'm more interested in the higher level interface.  The goal is 
to allow drivers to be able to be written to the higher level interface 
so that they work on any platform that implements the lower level 
interface.  On x86, that would be PCI/PIO.  On s390, that could be 
hypercall based.

>> I would prefer a new 
>> simple hypercall based virtual bus. I dont know much about windows 
>> driver programming, but I guess it it is not that hard to add a new bus.
>>     
>
> We've had the same discussion about PCI as virtual device abstraction
> recently when hpa made the suggestions to get a set of PCI device
> numbers registered for Linux.
>
> IIRC, the conclusion to which we came was that it is indeed helpful
> for most architecture to have a PCI device as one way to probe for
> the functionality, but not to rely on it. s390 is the obvious
> example where you can't have PCI, but you may also want to build
> a guest kernel without PCI support because of space constraints
> in a many-guests machine.
>
> What I think would be ideal is to have a new bus type in Linux
> that does not have any dependency on PCI itself, but can be
> easily implemented as a child of a PCI device.
>
> If we only need the stuff mentioned by Anthony, the interface could
> look like
>
> struct vmchannel_device {
> 	struct resource virt_mem;
> 	struct vm_device_id id;
> 	int irq;
> 	int (*signal)(struct vmchannel_device *);
> 	int (*irq_ack)(struct vmchannel_device *);
> 	struct device dev;
> };
>
> Such a device can easily be provided as a child of a PCI device,
> or as something that is purely virtual based on an hcall interface.
>   

Yes, this is close to what I was thinking.  I'm not sure that this 
particular interface can encompass the variety of memory sharing 
mechanisms though.

When I mentioned shared memory via the PCI device, I was referring to 
the memory needed for boot strapping the device.  You still need a 
mechanism to transfer memory for things like zero-copy disk IO and 
network devices.  This may involve passing memory addresses directly, 
copying data, or page flipping.

This leads me to think that a higher level interface that provided a 
data passing interface would be more useful.  Something like:

struct vmchannel_device {
    struct vm_device_id id;
    int (*open)(struct vmchannel_device *, const char *name, const char 
*service)
    int (*release)(struct vmchannel_device *);
    ssize_t (*sendmsg)(struct vmchannel_device *, const void *, size_t);
    ssize_t (*recvmsg)(struct vmchannel_device *, void *, size_t);
    struct device dev;
};

The consuming interface of this would be a socket (PF_VIRTLINK).  The 
sockaddr would contain a name identifying a VM and a service description.

This doesn't address the memory issues I raised above but I think it 
would be easier to special case the drivers where it mattered.  For 
instance, on x86 KVM, a PV disk driver front end would consist of 
connecting to a virtlink socket, and then transferring struct bio's.  
QEMU instances would listen on the virtlink socket in the host, and 
service them directly (QEMU can access all of the guests memory directly 
in userspace).

A PV graphics device could just be a VNC server that listened on a 
virtlink socket.

Regards,

Anthony Liguori

> 	Arnd <><
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <4651E8D1.4010208-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                         ` <4651E8D1.4010208-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2007-05-21 23:09                                                           ` ron minnich
       [not found]                                                             ` <13426df10705211609j613032c6j373d9a4660f8ec6c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: ron minnich @ 2007-05-21 23:09 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger,
	Suzanne McIntosh

OK, so what are we doing here? We're using a PCI abstraction, as a
common abstraction,which is not common really, because we don't have a
common abstraction? So we describe all these non-pci resources with a
pci abstraction?

I don't get it at all. I really think the resource interface idea I
mentioned, which is borrowed from Plan 9, makes  a whole lot more
sense.  IBM Austin has already shown it in practice in the papers I
referenced. It can work. A memory channel at the bottom, with a
resource sharing protocol (9p) above it, and then you describe your
resources via names and a simple file-directory model. Note that PCI
sort of tries to do this tree model, but it's all binary, and, as
noted, it's hardly universal.

All of this is trivially exported over a network, so the use of shared
memory channels in no way rules out network access. Plan 9 exports
devices over the network routinely.

If you're using a PCI abstraction, something has gone badly wrong I think.

thanks

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <13426df10705211609j613032c6j373d9a4660f8ec6c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                             ` <13426df10705211609j613032c6j373d9a4660f8ec6c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2007-05-22  0:29                                                               ` Anthony Liguori
       [not found]                                                                 ` <46523952.7070405-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-22  0:29 UTC (permalink / raw)
  To: ron minnich
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger,
	Suzanne McIntosh

ron minnich wrote:
> OK, so what are we doing here? We're using a PCI abstraction, as a
> common abstraction,which is not common really, because we don't have a
> common abstraction? So we describe all these non-pci resources with a
> pci abstraction?
>   

No.  You're confusing PV device discovery with the actual paravirtual 
transport.  In a fully virtual environment like KVM, a PCI bus is 
present.  You need some way for the guest to detect that a PV device is 
present.  The most natural way to do this IMHO is to have an entry for 
the PV device in the PCI bus.  That will make a lot of existing code happy.

Once you've identified that the device exists, you're free to do 
whatever you want with it.

Regards,

Anthony Liguori



> I don't get it at all. I really think the resource interface idea I
> mentioned, which is borrowed from Plan 9, makes  a whole lot more
> sense.  IBM Austin has already shown it in practice in the papers I
> referenced. It can work. A memory channel at the bottom, with a
> resource sharing protocol (9p) above it, and then you describe your
> resources via names and a simple file-directory model. Note that PCI
> sort of tries to do this tree model, but it's all binary, and, as
> noted, it's hardly universal.
>
> All of this is trivially exported over a network, so the use of shared
> memory channels in no way rules out network access. Plan 9 exports
> devices over the network routinely.
>
> If you're using a PCI abstraction, something has gone badly wrong I think.
>
> thanks
>
> ron
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <46523952.7070405-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                 ` <46523952.7070405-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
@ 2007-05-22  0:45                                                                   ` ron minnich
       [not found]                                                                     ` <13426df10705211745r69acc95ai458b2192fe0d0132-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2007-05-22  1:34                                                                   ` Eric Van Hensbergen
  1 sibling, 1 reply; 104+ messages in thread
From: ron minnich @ 2007-05-22  0:45 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger,
	Suzanne McIntosh

On 5/21/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> wrote:
> ron minnich wrote:
> > OK, so what are we doing here? We're using a PCI abstraction, as a
> > common abstraction,which is not common really, because we don't have a
> > common abstraction? So we describe all these non-pci resources with a
> > pci abstraction?
> >
>
> No.  You're confusing PV device discovery with the actual paravirtual
> transport.  In a fully virtual environment like KVM, a PCI bus is
> present.  You need some way for the guest to detect that a PV device is
> present.  The most natural way to do this IMHO is to have an entry for
> the PV device in the PCI bus.  That will make a lot of existing code happy.
>

I don't think I am confusing it, now that you've explained it more
fully. I'm even less happy with it :-)

How will I explain this sort of thing to my grandchildren? :-)
"grandpop, why do those PV devices look like a bus defined in 1994?"

Why would you not have, e.g., a 9p server for PV device "config space"
as well? I actually implemented that on Xen -- it was quite trivial,
and it makes more sense -- to me anyway -- than pretending a PV device
is something it's not.

What it happening, it seems to me, is that people are still trying to
use an abstraction -- "PCI device" -- which is not really an
abstraction, to model aspects of PV device discovery, enumeration,
configuration and operation. I'm still pretty uncomfortable with it --
well, honestly, it seems kind of gross to me. It's just as easy to
build the right abstraction underneath all this, and then, for those
OSes that have existing code that needs to be happy, present that
abstraction as a PCI bus. But making the PCI bus the underlying
abstraction is getting the order inverted,I believe.

I realize that PCI device space is a pretty handy way to do this, that
it is very convenient. I wonder what happens when you get a system
without enough "holes" in the config space for you to hide the PV
devices in, or that has some other weird property that breaks this
model. I've already worked with one system that had 32 PCI busses.

There are other hypervisors that made convenient choices over the
right choice, and they are paying for it. Let's try to avoid that on
kvm. Kvm has so much going for it right now.

thanks

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <13426df10705211745r69acc95ai458b2192fe0d0132-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                     ` <13426df10705211745r69acc95ai458b2192fe0d0132-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2007-05-22  1:13                                                                       ` Anthony Liguori
  0 siblings, 0 replies; 104+ messages in thread
From: Anthony Liguori @ 2007-05-22  1:13 UTC (permalink / raw)
  To: ron minnich
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger,
	Suzanne McIntosh

ron minnich wrote:
> On 5/21/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> wrote:
>> No.  You're confusing PV device discovery with the actual paravirtual
>> transport.  In a fully virtual environment like KVM, a PCI bus is
>> present.  You need some way for the guest to detect that a PV device is
>> present.  The most natural way to do this IMHO is to have an entry for
>> the PV device in the PCI bus.  That will make a lot of existing code 
>> happy.
>>
>
> I don't think I am confusing it, now that you've explained it more
> fully. I'm even less happy with it :-)

Sometimes I think the best way to make you happy is to just stop talking :-)

> How will I explain this sort of thing to my grandchildren? :-)
> "grandpop, why do those PV devices look like a bus defined in 1994?"
>
> Why would you not have, e.g., a 9p server for PV device "config space"
> as well? I actually implemented that on Xen -- it was quite trivial,
> and it makes more sense -- to me anyway -- than pretending a PV device
> is something it's not.
>
> What it happening, it seems to me, is that people are still trying to
> use an abstraction -- "PCI device" -- which is not really an
> abstraction, to model aspects of PV device discovery, enumeration,
> configuration and operation. I'm still pretty uncomfortable with it --
> well, honestly, it seems kind of gross to me. It's just as easy to
> build the right abstraction underneath all this, and then, for those
> OSes that have existing code that needs to be happy, present that
> abstraction as a PCI bus. But making the PCI bus the underlying
> abstraction is getting the order inverted,I believe.

Okay.  The first problem here is that you're assuming that I'm 
suggesting that this who thing mandate a PCI bus.  I'm not.  I'm merely 
saying that one possible way to implement this is by using a PCI bus to 
discover the existing of a VIRTLINK socket.  Clearly, the s390 guys 
would have to use something else.

For PV Xen where there is no PCI bus, XenBus would be used.  So very 
concretely, there are three separate classes of problems:

1) How to determine that a VM can use virtlink sockets
2) How to enumerate paravirtual devices
3) The various PV protocols for each device

Whatever Linux implements, it has to allow multiple implementations for 
#1.  For x86 VMs, PCI is just the easiest thing to do here.  You could 
do hypercalls but it gets messy on different hypervisors (vmcall with 0 
in eax may do something funky in Xen but be the probing hypercall on KVM).

For #2, I'm not really proposing anything concrete.  One possibility is 
to allow virtlink sockets to be addressed with a "service" and to use 
that.  That doesn't allow for enumeration though so it may not be perfect.

I'm not proposing anything at all for #3.  That's outside the scope of 
this discussion in my mind.

Now, once you have a virtlink socket, could you use p9 to implement #2 
and #3?  Sounds like something you could write a paper about :-) But 
that's later argument.  Right now, I'm just focused on solving the boot 
strap issue.

Hope this clarifies things a bit.

Regards,

Anthony Liguori

> I realize that PCI device space is a pretty handy way to do this, that
> it is very convenient. I wonder what happens when you get a system
> without enough "holes" in the config space for you to hide the PV
> devices in, or that has some other weird property that breaks this
> model. I've already worked with one system that had 32 PCI busses.
>
> There are other hypervisors that made convenient choices over the
> right choice, and they are paying for it. Let's try to avoid that on
> kvm. Kvm has so much going for it right now.
>
> thanks
>
> ron
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                 ` <46523952.7070405-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
  2007-05-22  0:45                                                                   ` ron minnich
@ 2007-05-22  1:34                                                                   ` Eric Van Hensbergen
       [not found]                                                                     ` <a4e6962a0705211834s4db19c7t3b95765bf2c092d7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 104+ messages in thread
From: Eric Van Hensbergen @ 2007-05-22  1:34 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Suzanne McIntosh, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 5/21/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> wrote:
> ron minnich wrote:
> > OK, so what are we doing here? We're using a PCI abstraction, as a
> > common abstraction,which is not common really, because we don't have a
> > common abstraction? So we describe all these non-pci resources with a
> > pci abstraction?
> >
>
> No.  You're confusing PV device discovery with the actual paravirtual
> transport.

In a PV environment why not just pass an initial cookie/hash/whatever
as a command-line argument/register/memory-space to the underlying
kernel?  The presence of such a kernel argument would suggest the
existence of a hypercall interface or other such mechanism to "attach"
to the initial transport(s).  Command-line arguments may be a bit too
linux-centric to Ron's taste, but if we are going to chose something
arbitrary like PCI, I'd prefer we chose something a bit more
straightforward to interact with instead of doing crazy ritual dances
to extract what should be straightforward information.  I really don't
want to have integrate PCI parsing into my testOS/libOS kernels.

            -eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <a4e6962a0705211834s4db19c7t3b95765bf2c092d7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                     ` <a4e6962a0705211834s4db19c7t3b95765bf2c092d7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2007-05-22  1:42                                                                       ` Anthony Liguori
       [not found]                                                                         ` <46524A79.8070004-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-22  1:42 UTC (permalink / raw)
  To: Eric Van Hensbergen
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Suzanne McIntosh, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Eric Van Hensbergen wrote:
> On 5/21/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> wrote:
>> ron minnich wrote:
>> > OK, so what are we doing here? We're using a PCI abstraction, as a
>> > common abstraction,which is not common really, because we don't have a
>> > common abstraction? So we describe all these non-pci resources with a
>> > pci abstraction?
>> >
>>
>> No.  You're confusing PV device discovery with the actual paravirtual
>> transport.
>
> In a PV environment why not just pass an initial cookie/hash/whatever
> as a command-line argument/register/memory-space to the underlying
> kernel?

You can't pass a command line argument to Windows (at least, not easily 
AFAIK).  You could get away with an MSR/CPUID flag but then you're 
relying on uniqueness which isn't guaranteed.

>   The presence of such a kernel argument would suggest the
> existence of a hypercall interface or other such mechanism to "attach"
> to the initial transport(s).  Command-line arguments may be a bit too
> linux-centric to Ron's taste, but if we are going to chose something
> arbitrary like PCI, I'd prefer we chose something a bit more
> straightforward to interact with instead of doing crazy ritual dances
> to extract what should be straightforward information.  I really don't
> want to have integrate PCI parsing into my testOS/libOS kernels.

You could just hard code a PIC interrupt and rely on some static memory 
address for IO and avoid the PCI bus entirely.  The whole point of the 
PCI bus is to avoid hardcoding this sort of things but if you don't want 
the complexity associated with PCI, then using the "older" mechanisms 
seems like the obvious thing to do.

Regards,

Anthony Liguori

>            -eric
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <46524A79.8070004-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                         ` <46524A79.8070004-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
@ 2007-05-22  5:17                                                                           ` Avi Kivity
       [not found]                                                                             ` <46527CD9.5000603-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Avi Kivity @ 2007-05-22  5:17 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger,
	Suzanne McIntosh, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Anthony Liguori wrote:
>>
>> In a PV environment why not just pass an initial cookie/hash/whatever
>> as a command-line argument/register/memory-space to the underlying
>> kernel?
>>     
>
> You can't pass a command line argument to Windows (at least, not easily 
> AFAIK).  You could get away with an MSR/CPUID flag but then you're 
> relying on uniqueness which isn't guaranteed.
>   


In the general case, you can't pass a command line argument to Linux
either.  kvm doesn't boot Linux; it boots the bios, which boots the boot
sector, which boots grub, which boots Linux.  Relying on the user to
edit the command line in grub is wrong.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <46527CD9.5000603-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                             ` <46527CD9.5000603-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-22 12:49                                                                               ` Eric Van Hensbergen
       [not found]                                                                                 ` <a4e6962a0705220549j1c9565f2ic160c672b74aea35-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Eric Van Hensbergen @ 2007-05-22 12:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh

On 5/22/07, Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote:
> Anthony Liguori wrote:
> >>
> >> In a PV environment why not just pass an initial cookie/hash/whatever
> >> as a command-line argument/register/memory-space to the underlying
> >> kernel?
> >>
> >
> > You can't pass a command line argument to Windows (at least, not easily
> > AFAIK).  You could get away with an MSR/CPUID flag but then you're
> > relying on uniqueness which isn't guaranteed.
> >
>
> In the general case, you can't pass a command line argument to Linux
> either.  kvm doesn't boot Linux; it boots the bios, which boots the boot
> sector, which boots grub, which boots Linux.  Relying on the user to
> edit the command line in grub is wrong.
>

I didn't think we were talking about the general case, I thought we
were discussing the PV case.  In the PV case, having bios/bootloader
is unnecessary overhead.  To that same end, I don't see Windows in the
PV case unless they magically want to to coordinate PV standards with
us, in which case we certainly can negotiate a more sane discovery
mechanism.

                -eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <a4e6962a0705220549j1c9565f2ic160c672b74aea35-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                 ` <a4e6962a0705220549j1c9565f2ic160c672b74aea35-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2007-05-22 12:56                                                                                   ` Christoph Hellwig
       [not found]                                                                                     ` <20070522125655.GA4506-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  2007-05-22 13:08                                                                                   ` Anthony Liguori
  1 sibling, 1 reply; 104+ messages in thread
From: Christoph Hellwig @ 2007-05-22 12:56 UTC (permalink / raw)
  To: Eric Van Hensbergen
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh

On Tue, May 22, 2007 at 07:49:51AM -0500, Eric Van Hensbergen wrote:
> > In the general case, you can't pass a command line argument to Linux
> > either.  kvm doesn't boot Linux; it boots the bios, which boots the boot
> > sector, which boots grub, which boots Linux.  Relying on the user to
> > edit the command line in grub is wrong.
> >
> 
> I didn't think we were talking about the general case, I thought we
> were discussing the PV case.  In the PV case, having bios/bootloader
> is unnecessary overhead.  To that same end, I don't see Windows in the
> PV case unless they magically want to to coordinate PV standards with
> us, in which case we certainly can negotiate a more sane discovery
> mechanism.

In case of KVM no one is speaking of pure PV.  What people have been
working on is PV accelaration of a vullvirt host, similar to how
s390 is working for decaded.  The host emulates the full architecture,
but there are some escape for speedups.  Typical escapes would be drivers
for storage or networking because those can no be virtualized very well
on x86-style hardware.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <20070522125655.GA4506-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                     ` <20070522125655.GA4506-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2007-05-22 14:50                                                                                       ` Eric Van Hensbergen
       [not found]                                                                                         ` <a4e6962a0705220750s5abe380dg8dd8e7d0b84de7cd-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Eric Van Hensbergen @ 2007-05-22 14:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh

On 5/22/07, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> >
> > I didn't think we were talking about the general case, I thought we
> > were discussing the PV case.
> >
>
> In case of KVM no one is speaking of pure PV.
>

Why not?  It seems worthwhile to come up with something that can cover
the whole spectrum instead of having different hypervisors (and
interfaces).

Maybe my view is skewed because I don't care to run windows.

           -eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <a4e6962a0705220750s5abe380dg8dd8e7d0b84de7cd-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                         ` <a4e6962a0705220750s5abe380dg8dd8e7d0b84de7cd-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2007-05-22 15:05                                                                                           ` Anthony Liguori
       [not found]                                                                                             ` <465306AE.5080902-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
  2007-05-23 11:55                                                                                           ` Avi Kivity
  1 sibling, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-22 15:05 UTC (permalink / raw)
  To: Eric Van Hensbergen
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Christoph Hellwig, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	Suzanne McIntosh

Eric Van Hensbergen wrote:
> On 5/22/07, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
>   
>>> I didn't think we were talking about the general case, I thought we
>>> were discussing the PV case.
>>>
>>>       
>> In case of KVM no one is speaking of pure PV.
>>
>>     
>
> Why not?  It seems worthwhile to come up with something that can cover
> the whole spectrum instead of having different hypervisors (and
> interfaces).
>   

Because in a few years, almost everyone will have hardware capable of 
doing full virtualization so why bother with pure PV.

> Maybe my view is skewed because I don't care to run windows.
>   

It's not just windows.  There are a lot of people who want to use 
virtualization to run RHEL2 or even RH9.  Backporting PV to these 
kernels is a huge effort.

Regards,

Anthony Liguori

>            -eric
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <465306AE.5080902-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                             ` <465306AE.5080902-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
@ 2007-05-22 15:31                                                                                               ` ron minnich
  2007-05-22 16:25                                                                                               ` Eric Van Hensbergen
  1 sibling, 0 replies; 104+ messages in thread
From: ron minnich @ 2007-05-22 15:31 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig,
	Christian Borntraeger, Suzanne McIntosh,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 5/22/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> wrote:
> Eric Van Hensbergen wrote:
> > On 5/22/07, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> >
> >>> I didn't think we were talking about the general case, I thought we
> >>> were discussing the PV case.
> >>>
> >>>
> >> In case of KVM no one is speaking of pure PV.
> >>
> >>
> >
> > Why not?  It seems worthwhile to come up with something that can cover
> > the whole spectrum instead of having different hypervisors (and
> > interfaces).
> >
>
> Because in a few years, almost everyone will have hardware capable of
> doing full virtualization so why bother with pure PV.

I don't know, we could shoot for a clean, simple interface that makes
PV easy to integrate into any kernel. Pick a common underlying
abstraction for all resources.
Define a simple, efficient memory channel for the comms. Lay 9p over
it. Then take it from there for each device.

I agree, from the way (e.g.) the Xen devices work, PV is a pain. But
it need not be that way.

I think from the Plan 9 side we're happy to run full PV. But we're 0%
of the world, so that may bias our importance a bit :-)

thanks

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                             ` <465306AE.5080902-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
  2007-05-22 15:31                                                                                               ` ron minnich
@ 2007-05-22 16:25                                                                                               ` Eric Van Hensbergen
       [not found]                                                                                                 ` <a4e6962a0705220925l580f136we269380fe3c9691c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 104+ messages in thread
From: Eric Van Hensbergen @ 2007-05-22 16:25 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Christoph Hellwig, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	Suzanne McIntosh

On 5/22/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> wrote:
> Eric Van Hensbergen wrote:
> > On 5/22/07, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> >>>
> >>>
> >> In case of KVM no one is speaking of pure PV.
> >>
> >>
> >
> > Why not?  It seems worthwhile to come up with something that can cover
> > the whole spectrum instead of having different hypervisors (and
> > interfaces).
> >
>
> Because in a few years, almost everyone will have hardware capable of
> doing full virtualization so why bother with pure PV.
>

No matter what the capabilities, full device emulation is always going
to be wasteful.   Just because I have the hardware to run Vista,
doesn't mean I should run Vista.

> > Maybe my view is skewed because I don't care to run windows.
> >
>
> It's not just windows.  There are a lot of people who want to use
> virtualization to run RHEL2 or even RH9.  Backporting PV to these
> kernels is a huge effort.
>

I'm not opposed to supporting emulation environments, just don't make
a large pile of crap the default like Xen -- and having to integrate
PCI probing code in my guest domains is a large pile of crap.

          -eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <a4e6962a0705220925l580f136we269380fe3c9691c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                                 ` <a4e6962a0705220925l580f136we269380fe3c9691c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2007-05-22 17:00                                                                                                   ` ron minnich
       [not found]                                                                                                     ` <13426df10705221000i749badc5h8afe4f2fc95bc2ce-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: ron minnich @ 2007-05-22 17:00 UTC (permalink / raw)
  To: Eric Van Hensbergen
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Christoph Hellwig, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	Suzanne McIntosh

On 5/22/07, Eric Van Hensbergen <ericvh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> I'm not opposed to supporting emulation environments, just don't make
> a large pile of crap the default like Xen -- and having to integrate
> PCI probing code in my guest domains is a large pile of crap.

Exactly. I'm about to start a pretty large project here, using xen or
kvm, not sure. One thing for sure, we are NOT going to use anything
but PV devices. Full emulation is nice, but it's just plain silly if
you don't have to do it. And we don't have to do it. So let's get the
PV devices right, not try to shoehorn them into some framework like
PCI.

What happens to these schemes if I want to try, e.g., 2^16 PV devices?
Or some other crazy thing that doesn't play well with PCI -- simple
example -- I want a 256 GB region of memory for a device. PCI rules
require me to align it on 256GB boundaries and it must be contiguous
address space. This is a hardware rule, done for hardware reasons, and
has no place in the PV world. What if I want a bit more than the basic
set of BARs that PCI gives me? Why would we apply such rules to a PV?
Why limit ourselves this early in the game?

thanks

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <13426df10705221000i749badc5h8afe4f2fc95bc2ce-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                                     ` <13426df10705221000i749badc5h8afe4f2fc95bc2ce-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2007-05-22 17:06                                                                                                       ` Christoph Hellwig
       [not found]                                                                                                         ` <20070522170628.GA16624-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  2007-05-23 12:20                                                                                                       ` Avi Kivity
  1 sibling, 1 reply; 104+ messages in thread
From: Christoph Hellwig @ 2007-05-22 17:06 UTC (permalink / raw)
  To: ron minnich
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig,
	Christian Borntraeger, Suzanne McIntosh,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Tue, May 22, 2007 at 10:00:42AM -0700, ron minnich wrote:
> On 5/22/07, Eric Van Hensbergen <ericvh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> >I'm not opposed to supporting emulation environments, just don't make
> >a large pile of crap the default like Xen -- and having to integrate
> >PCI probing code in my guest domains is a large pile of crap.
> 
> Exactly. I'm about to start a pretty large project here, using xen or
> kvm, not sure. One thing for sure, we are NOT going to use anything
> but PV devices. Full emulation is nice, but it's just plain silly if
> you don't have to do it. And we don't have to do it. So let's get the
> PV devices right, not try to shoehorn them into some framework like
> PCI.

If you don't care about full virtualization kvm is the wrong project for
you.  You might want to take a look at lguest.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <20070522170628.GA16624-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                                         ` <20070522170628.GA16624-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2007-05-22 17:34                                                                                                           ` ron minnich
       [not found]                                                                                                             ` <13426df10705221034k7baf5bccrc77aabca8c9e225c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2007-05-23 12:16                                                                                                           ` Avi Kivity
  1 sibling, 1 reply; 104+ messages in thread
From: ron minnich @ 2007-05-22 17:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger,
	Suzanne McIntosh, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 5/22/07, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:

> If you don't care about full virtualization kvm is the wrong project for
> you.  You might want to take a look at lguest.

Ah, I had not realized that KVM was purely a full-virt environment
with no real use for PV-only users. I'll move on. Thanks for the tip!

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <13426df10705221034k7baf5bccrc77aabca8c9e225c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                                             ` <13426df10705221034k7baf5bccrc77aabca8c9e225c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2007-05-22 20:03                                                                                                               ` Dor Laor
       [not found]                                                                                                                 ` <64F9B87B6B770947A9F8391472E032160BF29F1E-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Dor Laor @ 2007-05-22 20:03 UTC (permalink / raw)
  To: ron minnich, Christoph Hellwig
  Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh

>> If you don't care about full virtualization kvm is the wrong project
for
>> you.  You might want to take a look at lguest.
>
>Ah, I had not realized that KVM was purely a full-virt environment
>with no real use for PV-only users. I'll move on. Thanks for the tip!
>ron

Don't quit so soon on us.
KVM has already PV kernel capabilities (in Ingo Molnar's tree) and has
network and block PV drivers.

We do plan on supporting/improving the PV kernel capabilities. The near
future change is direct guest paging. 
Although all new x86 cpus now ship with hardware support, software PV
can always find spots for acceleration.

Regarding PV drivers, our initial approach was try not to invent the
wheel and implement the PV discovery using pci. For full-virt OSs,
especially windows it was simpler. Now that more platforms might be kvm
based, I agree we should switch to a generic solution.
	Dor.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <64F9B87B6B770947A9F8391472E032160BF29F1E-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                                                 ` <64F9B87B6B770947A9F8391472E032160BF29F1E-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
@ 2007-05-22 20:10                                                                                                                   ` ron minnich
  2007-05-22 22:56                                                                                                                   ` Nakajima, Jun
  1 sibling, 0 replies; 104+ messages in thread
From: ron minnich @ 2007-05-22 20:10 UTC (permalink / raw)
  To: Dor Laor
  Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Christoph Hellwig, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	Suzanne McIntosh

On 5/22/07, Dor Laor <dor.laor-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote:

> Don't quit so soon on us.

OK. I'll go look at Ingo's stuff.

Thanks again

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                                                 ` <64F9B87B6B770947A9F8391472E032160BF29F1E-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
  2007-05-22 20:10                                                                                                                   ` ron minnich
@ 2007-05-22 22:56                                                                                                                   ` Nakajima, Jun
       [not found]                                                                                                                     ` <8FFF7E42E93CC646B632AB40643802A8032793AC-1a9uaKK1+wJcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  1 sibling, 1 reply; 104+ messages in thread
From: Nakajima, Jun @ 2007-05-22 22:56 UTC (permalink / raw)
  To: Dor Laor, ron minnich, Christoph Hellwig
  Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh

Dor Laor wrote:
> > > If you don't care about full virtualization kvm is the wrong
project for
> > > you.  You might want to take a look at lguest.
> > 
> > Ah, I had not realized that KVM was purely a full-virt environment
> > with no real use for PV-only users. I'll move on. Thanks for the
tip!
> > ron
> 
> Don't quit so soon on us.
> KVM has already PV kernel capabilities (in Ingo Molnar's tree) and has
> network and block PV drivers.
> 
> We do plan on supporting/improving the PV kernel capabilities. The
near
> future change is direct guest paging.
> Although all new x86 cpus now ship with hardware support, software PV
> can always find spots for acceleration.

BTW, I'm presenting this at OLS:
http://www.linuxsymposium.org/2007/view_abstract.php?content_key=192

This uses direct paging mode today.

> 
> Regarding PV drivers, our initial approach was try not to invent the
> wheel and implement the PV discovery using pci. For full-virt OSs,
> especially windows it was simpler. Now that more platforms might be
kvm
> based, I agree we should switch to a generic solution.
> 	Dor.
> 

Jun
---
Intel Open Source Technology Center

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <8FFF7E42E93CC646B632AB40643802A8032793AC-1a9uaKK1+wJcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                                                     ` <8FFF7E42E93CC646B632AB40643802A8032793AC-1a9uaKK1+wJcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2007-05-23  8:15                                                                                                                       ` Carsten Otte
       [not found]                                                                                                                         ` <4653F807.2010209-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  2007-05-23 12:21                                                                                                                       ` Avi Kivity
  1 sibling, 1 reply; 104+ messages in thread
From: Carsten Otte @ 2007-05-23  8:15 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Christoph Hellwig, Suzanne McIntosh,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

I have been closely following thisvery interresting discussion. Here's 
my summary:
- PV capabilities is something we'll want
- being able to surface virtual devices to the guest as PCI is 
preferable to Windows
- we need an additional way to surface virtual devices to the guest. 
We don't have PCI on s390, and Ron doesn't want PCI in his guests.
- complex interfaces are a mess to implement and maintain in different 
hypervisors and guest operating systems, we need a simple and clear 
structure like plan9 has today

To me, it looks like we need a virtual device abstraction both in the 
guest kernel and in the kvm/qemu. This abstraction needs to be simple 
and fast, and needs to be representable as PCI device and in a simpler 
way. PCI obstacles are supposed to be transparent to the virutal device.
For me, plan9 does provide answers to a lot of above requirements. 
However, it does not provide capabilities for shared memory and it 
adds extra complexity. It's been designed to solve a different problem.

I think the virtual device abstraction should provide the following 
functionality:
- hypercall guest to host with parameters and return value
- interrupt from host to guest with parameters
- thin interrupt from host to guest, no parameters
- shared memory between guest and host
- dma access to guest memory, possibly via kmap on the host
- copy from/to guest memory

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <4653F807.2010209-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                                                         ` <4653F807.2010209-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-05-23 12:25                                                                                                                           ` Avi Kivity
  2007-05-23 14:12                                                                                                                           ` Eric Van Hensbergen
  1 sibling, 0 replies; 104+ messages in thread
From: Avi Kivity @ 2007-05-23 12:25 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA
  Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Christoph Hellwig, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	Suzanne McIntosh

Carsten Otte wrote:
> I have been closely following thisvery interresting discussion. Here's 
> my summary:
> - PV capabilities is something we'll want
> - being able to surface virtual devices to the guest as PCI is 
> preferable to Windows
> - we need an additional way to surface virtual devices to the guest. 
> We don't have PCI on s390, and Ron doesn't want PCI in his guests.
> - complex interfaces are a mess to implement and maintain in different 
> hypervisors and guest operating systems, we need a simple and clear 
> structure like plan9 has today
>
> To me, it looks like we need a virtual device abstraction both in the 
> guest kernel and in the kvm/qemu. This abstraction needs to be simple 
> and fast, and needs to be representable as PCI device and in a simpler 
> way. PCI obstacles are supposed to be transparent to the virutal device.
> For me, plan9 does provide answers to a lot of above requirements. 
> However, it does not provide capabilities for shared memory and it 
> adds extra complexity. It's been designed to solve a different problem.
>
> I think the virtual device abstraction should provide the following 
> functionality:
> - hypercall guest to host with parameters and return value
> - interrupt from host to guest with parameters
> - thin interrupt from host to guest, no parameters
> - shared memory between guest and host
> - dma access to guest memory, possibly via kmap on the host
> - copy from/to guest memory
>
>   

I agree with all of the above.  In addition, it would be nice if we can
share this interface with other hypervisors.  Unfortunately Xen is
riding the XenBus, but maybe we can share the interface with lguest and VMI.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                                                         ` <4653F807.2010209-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  2007-05-23 12:25                                                                                                                           ` Avi Kivity
@ 2007-05-23 14:12                                                                                                                           ` Eric Van Hensbergen
       [not found]                                                                                                                             ` <a4e6962a0705230712pd8c2958m9dee6b2ccec0899d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 104+ messages in thread
From: Eric Van Hensbergen @ 2007-05-23 14:12 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA
  Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Christoph Hellwig, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	Suzanne McIntosh

On 5/23/07, Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> wrote:
>
> For me, plan9 does provide answers to a lot of above requirements.
> However, it does not provide capabilities for shared memory and it
> adds extra complexity. It's been designed to solve a different problem.
>

As a point of clarification, plan9 protocols have been used over
shared memory for resource access on virtualized systems for the past
3 years.  There are certainly ways it can be further optimized, but it
is not a restriction.  As far as complexity goes, our guest-side stack
is around 2000 lines of code (with an additional 1000 lines of support
routines that could likely be replaced by standard library or OS
services in more conventional platforms) and supports console, file
system, network, and block device access.

> I think the virtual device abstraction should provide the following
> functionality:
> - hypercall guest to host with parameters and return value
> - interrupt from host to guest with parameters
> - thin interrupt from host to guest, no parameters
> - shared memory between guest and host
> - dma access to guest memory, possibly via kmap on the host
> - copy from/to guest memory
>

Good list.  We can certainly work within these parameters.  It would
be nice to have some facility for direct guest<->guest communication
-- however, I understand the difficulties in doing that in a secure
and safe way.  Still, having the ability to provision such a direct
interface would be nice for those that can take advantage of it.

        -eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <a4e6962a0705230712pd8c2958m9dee6b2ccec0899d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                                                             ` <a4e6962a0705230712pd8c2958m9dee6b2ccec0899d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2007-05-23 23:02                                                                                                                               ` Arnd Bergmann
       [not found]                                                                                                                                 ` <200705240102.40795.arnd-r2nGTMty4D4@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Arnd Bergmann @ 2007-05-23 23:02 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
  Cc: Jimi Xenidis, Christian Borntraeger,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig,
	carsteno-tA70FqPdS9bQT0dZR+AlfA, Suzanne McIntosh

On Wednesday 23 May 2007, Eric Van Hensbergen wrote:
> On 5/23/07, Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > For me, plan9 does provide answers to a lot of above requirements.
> > However, it does not provide capabilities for shared memory and it
> > adds extra complexity. It's been designed to solve a different problem.
> >
> As a point of clarification, plan9 protocols have been used over
> shared memory for resource access on virtualized systems for the past
> 3 years.  There are certainly ways it can be further optimized, but it
> is not a restriction.

I think what Carsten means is to have a mmap interface over 9p, not
implementing 9p by means of shared memory, which is what I guess
you are referring to.

If you want to share memory areas between a guest and the host
or another guest, you can't do that with the regular Tread/Twrite
interface that 9p has on a file.

> As far as complexity goes, our guest-side stack 
> is around 2000 lines of code (with an additional 1000 lines of support
> routines that could likely be replaced by standard library or OS
> services in more conventional platforms) and supports console, file
> system, network, and block device access.

Another interface that I think is missing in 9p is a notification
for hotplugging. Of course you can have a long-running read on a
special file that returns the file names for virtual devices that
have been added or removed in the guest, but that sounds a little
clumsy compared to an specialized interface (e.g. Tnotify).

	Arnd <><

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <200705240102.40795.arnd-r2nGTMty4D4@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                                                                 ` <200705240102.40795.arnd-r2nGTMty4D4@public.gmane.org>
@ 2007-05-23 23:57                                                                                                                                   ` Eric Van Hensbergen
       [not found]                                                                                                                                     ` <a4e6962a0705231657n65946ba4n74393f7028b6d61c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Eric Van Hensbergen @ 2007-05-23 23:57 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jimi Xenidis, Christian Borntraeger,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig,
	carsteno-tA70FqPdS9bQT0dZR+AlfA, Suzanne McIntosh

On 5/23/07, Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org> wrote:
> On Wednesday 23 May 2007, Eric Van Hensbergen wrote:
> > On 5/23/07, Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> wrote:
> > >
> > > For me, plan9 does provide answers to a lot of above requirements.
> > > However, it does not provide capabilities for shared memory and it
> > > adds extra complexity. It's been designed to solve a different problem.
> > >
> > As a point of clarification, plan9 protocols have been used over
> > shared memory for resource access on virtualized systems for the past
> > 3 years. There are certainly ways it can be further optimized, but it
> > is not a restriction.
>
> I think what Carsten means is to have a mmap interface over 9p, not
> implementing 9p by means of shared memory, which is what I guess
> you are referring to.
>
> If you want to share memory areas between a guest and the host
> or another guest, you can't do that with the regular Tread/Twrite
> interface that 9p has on a file.
>

Well, there's nothing strictly preventing a mmap interface over 9p (in
fact we are working with that in a Cell project internally) --
however, I'm not sure that makes the best sense for device access
anyways.  The real thing missing from the current implementation is a
better underlying transport which can pass payloads by reference to
shared memory as opposed to marshaling operations through a shared
memory transport -- however, this is what Los Alamos and IBM are
working on right now.

> > As far as complexity goes, our guest-side stack
> > is around 2000 lines of code (with an additional 1000 lines of support
> > routines that could likely be replaced by standard library or OS
> > services in more conventional platforms) and supports console, file
> > system, network, and block device access.
>
> Another interface that I think is missing in 9p is a notification
> for hotplugging. Of course you can have a long-running read on a
> special file that returns the file names for virtual devices that
> have been added or removed in the guest, but that sounds a little
> clumsy compared to an specialized interface (e.g. Tnotify).
>

Discovery and hot-plugging would be synthetic file system semantic
issues that need to be resolved and in general are probably, as Rusty
and others suggested, best handled as a separate set of topics.  That
being said, specialized interfaces always seemed a bit more clunky to
me (just look at ioctl), but I suppose that's largely a matter of
taste.  The advantage of having a file system interface to event
notification is it creates a much more flexible environment, allowing
even simple shell scripting languages to resolve events versus having
to build a complex infrastructure -- and since 9p can be transitively
mounted over a network, you can build cluster management suites
without secondary layers of gorp for such things.  The LANL guys will
probably have more to say about this at their OLS talk on the KVM
management synthetic file system interface they build with 9p.

            -eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <a4e6962a0705231657n65946ba4n74393f7028b6d61c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                                                                     ` <a4e6962a0705231657n65946ba4n74393f7028b6d61c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2007-05-24  0:07                                                                                                                                       ` Eric Van Hensbergen
  0 siblings, 0 replies; 104+ messages in thread
From: Eric Van Hensbergen @ 2007-05-24  0:07 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jimi Xenidis, Christian Borntraeger,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig,
	carsteno-tA70FqPdS9bQT0dZR+AlfA, Suzanne McIntosh

On 5/23/07, Eric Van Hensbergen <ericvh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On 5/23/07, Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org> wrote:
> > On Wednesday 23 May 2007, Eric Van Hensbergen wrote:
> > > On 5/23/07, Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> wrote:
> > > >
> > > > For me, plan9 does provide answers to a lot of above requirements.
> > > > However, it does not provide capabilities for shared memory and it
> > > > adds extra complexity. It's been designed to solve a different problem.
> > > >
> > > As a point of clarification, plan9 protocols have been used over
> > > shared memory for resource access on virtualized systems for the past
> > > 3 years. There are certainly ways it can be further optimized, but it
> > > is not a restriction.
> >
> > I think what Carsten means is to have a mmap interface over 9p, not
> > implementing 9p by means of shared memory, which is what I guess
> > you are referring to.
> >
> > If you want to share memory areas between a guest and the host
> > or another guest, you can't do that with the regular Tread/Twrite
> > interface that 9p has on a file.
> >

ugh.  I'm tired.  Its been a long week -- I realized after I fired off
that last message that you mean establishing a shared mapping versus
support for mmap operations over 9p (which devolve into Tread/Twrite).
 Sorry.  Yes -- that's correct, 9p wouldn't necessarily buy you
something like that.  In fact, the current 9p code relies on someone
else providing that basic mechanism in order for us to establish our
shared memory transport.

What Carsten described as his virtual device abstraction sounded like
a good foundation -- just don't make me use ioctl :)

        -eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                                                     ` <8FFF7E42E93CC646B632AB40643802A8032793AC-1a9uaKK1+wJcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2007-05-23  8:15                                                                                                                       ` Carsten Otte
@ 2007-05-23 12:21                                                                                                                       ` Avi Kivity
  1 sibling, 0 replies; 104+ messages in thread
From: Avi Kivity @ 2007-05-23 12:21 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Christoph Hellwig, Suzanne McIntosh,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Nakajima, Jun wrote:
> BTW, I'm presenting this at OLS:
> http://www.linuxsymposium.org/2007/view_abstract.php?content_key=192
>
> This uses direct paging mode today.
>   

Are there patches available anywhere?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                                         ` <20070522170628.GA16624-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  2007-05-22 17:34                                                                                                           ` ron minnich
@ 2007-05-23 12:16                                                                                                           ` Avi Kivity
       [not found]                                                                                                             ` <465430B2.7050101-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 104+ messages in thread
From: Avi Kivity @ 2007-05-23 12:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Suzanne McIntosh, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Christoph Hellwig wrote:
> On Tue, May 22, 2007 at 10:00:42AM -0700, ron minnich wrote:
>   
>> On 5/22/07, Eric Van Hensbergen <ericvh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>>     
>>> I'm not opposed to supporting emulation environments, just don't make
>>> a large pile of crap the default like Xen -- and having to integrate
>>> PCI probing code in my guest domains is a large pile of crap.
>>>       
>> Exactly. I'm about to start a pretty large project here, using xen or
>> kvm, not sure. One thing for sure, we are NOT going to use anything
>> but PV devices. Full emulation is nice, but it's just plain silly if
>> you don't have to do it. And we don't have to do it. So let's get the
>> PV devices right, not try to shoehorn them into some framework like
>> PCI.
>>     
>
> If you don't care about full virtualization kvm is the wrong project for
> you.  You might want to take a look at lguest.
>
>   

This is incorrect.  While kvm started out as a full virtualization
project, it will expand with I/O PV and core PV.  Eventually most of the
paravirt_ops interface will have a kvm implementation.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <465430B2.7050101-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                                             ` <465430B2.7050101-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-23 12:20                                                                                                               ` Christoph Hellwig
  0 siblings, 0 replies; 104+ messages in thread
From: Christoph Hellwig @ 2007-05-23 12:20 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Christoph Hellwig, Suzanne McIntosh,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Wed, May 23, 2007 at 03:16:50PM +0300, Avi Kivity wrote:
> Christoph Hellwig wrote:
> > On Tue, May 22, 2007 at 10:00:42AM -0700, ron minnich wrote:
> >   
> >> On 5/22/07, Eric Van Hensbergen <ericvh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >>
> >>     
> >>> I'm not opposed to supporting emulation environments, just don't make
> >>> a large pile of crap the default like Xen -- and having to integrate
> >>> PCI probing code in my guest domains is a large pile of crap.
> >>>       
> >> Exactly. I'm about to start a pretty large project here, using xen or
> >> kvm, not sure. One thing for sure, we are NOT going to use anything
> >> but PV devices. Full emulation is nice, but it's just plain silly if
> >> you don't have to do it. And we don't have to do it. So let's get the
> >> PV devices right, not try to shoehorn them into some framework like
> >> PCI.
> >>     
> >
> > If you don't care about full virtualization kvm is the wrong project for
> > you.  You might want to take a look at lguest.
> >
> >   
> 
> This is incorrect.  While kvm started out as a full virtualization
> project, it will expand with I/O PV and core PV.  Eventually most of the
> paravirt_ops interface will have a kvm implementation.

The statement above was a little misworded I think.  It should have
been a "if you care about pure PV ..."


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                                     ` <13426df10705221000i749badc5h8afe4f2fc95bc2ce-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2007-05-22 17:06                                                                                                       ` Christoph Hellwig
@ 2007-05-23 12:20                                                                                                       ` Avi Kivity
  1 sibling, 0 replies; 104+ messages in thread
From: Avi Kivity @ 2007-05-23 12:20 UTC (permalink / raw)
  To: ron minnich
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig,
	Christian Borntraeger, Suzanne McIntosh,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

ron minnich wrote:
> On 5/22/07, Eric Van Hensbergen <ericvh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>   
>> I'm not opposed to supporting emulation environments, just don't make
>> a large pile of crap the default like Xen -- and having to integrate
>> PCI probing code in my guest domains is a large pile of crap.
>>     
>
> Exactly. I'm about to start a pretty large project here, using xen or
> kvm, not sure. One thing for sure, we are NOT going to use anything
> but PV devices. Full emulation is nice, but it's just plain silly if
> you don't have to do it. And we don't have to do it. So let's get the
> PV devices right, not try to shoehorn them into some framework like
> PCI.
>
> What happens to these schemes if I want to try, e.g., 2^16 PV devices?
> Or some other crazy thing that doesn't play well with PCI -- simple
> example -- I want a 256 GB region of memory for a device. PCI rules
> require me to align it on 256GB boundaries and it must be contiguous
> address space. This is a hardware rule, done for hardware reasons, and
> has no place in the PV world. What if I want a bit more than the basic
> set of BARs that PCI gives me? Why would we apply such rules to a PV?
> Why limit ourselves this early in the game?
>
>   

Device discovery and device operation are separate.  Closed operating
systems and older Linuces will need pci as a way to have easy
plug'n'play discovery with no modifications to the kernel. 
Virtualization-friendly systems like newer Linux and s390 can have a
virtual bus for discovery.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                         ` <a4e6962a0705220750s5abe380dg8dd8e7d0b84de7cd-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2007-05-22 15:05                                                                                           ` Anthony Liguori
@ 2007-05-23 11:55                                                                                           ` Avi Kivity
  1 sibling, 0 replies; 104+ messages in thread
From: Avi Kivity @ 2007-05-23 11:55 UTC (permalink / raw)
  To: Eric Van Hensbergen
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Christoph Hellwig, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	Suzanne McIntosh

Eric Van Hensbergen wrote:
> On 5/22/07, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
>> >
>> > I didn't think we were talking about the general case, I thought we
>> > were discussing the PV case.
>> >
>>
>> In case of KVM no one is speaking of pure PV.
>>
>
> Why not?  It seems worthwhile to come up with something that can cover
> the whole spectrum instead of having different hypervisors (and
> interfaces).
>

That's the plan.  PV I/O and PV mmu are on the roadmap.  PV timers and
interrupts should be easily doable too.  The far end of the spectrum (PV
with no hardware virtualization extensions) is possible, but no one is
planning to do it AFAIK.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                                                                 ` <a4e6962a0705220549j1c9565f2ic160c672b74aea35-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2007-05-22 12:56                                                                                   ` Christoph Hellwig
@ 2007-05-22 13:08                                                                                   ` Anthony Liguori
  1 sibling, 0 replies; 104+ messages in thread
From: Anthony Liguori @ 2007-05-22 13:08 UTC (permalink / raw)
  To: Eric Van Hensbergen
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh

Eric Van Hensbergen wrote:
> On 5/22/07, Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote:
>> Anthony Liguori wrote:
>> >>
>> >> In a PV environment why not just pass an initial cookie/hash/whatever
>> >> as a command-line argument/register/memory-space to the underlying
>> >> kernel?
>> >>
>> >
>> > You can't pass a command line argument to Windows (at least, not 
>> easily
>> > AFAIK).  You could get away with an MSR/CPUID flag but then you're
>> > relying on uniqueness which isn't guaranteed.
>> >
>>
>> In the general case, you can't pass a command line argument to Linux
>> either.  kvm doesn't boot Linux; it boots the bios, which boots the boot
>> sector, which boots grub, which boots Linux.  Relying on the user to
>> edit the command line in grub is wrong.
>>
>
> I didn't think we were talking about the general case, I thought we
> were discussing the PV case.

It is still useful to use PV drivers with full virtualization so it's 
something that ought to be considered.

Regards,

Anthony Liguori

>   In the PV case, having bios/bootloader
> is unnecessary overhead.  To that same end, I don't see Windows in the
> PV case unless they magically want to to coordinate PV standards with
> us, in which case we certainly can negotiate a more sane discovery
> mechanism.
>
>                -eric
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                             ` <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                                                 ` (2 preceding siblings ...)
  2007-05-16 17:45                               ` Gregory Haskins
@ 2007-05-18  5:31                               ` ron minnich
       [not found]                                 ` <13426df10705172231y5e93d1f5y398d4f187a8978e1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  3 siblings, 1 reply; 104+ messages in thread
From: ron minnich @ 2007-05-18  5:31 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, Martin Schwidefsky,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

On 5/16/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> What do you think about a socket interface?  I'm not sure how discovery
> would work yet, but there are a few PV socket implementations for Xen at
> the moment.

Hi Anthony,

I still feel that "how about a socket interface" is still focused on
the "how to implement", and not "what the interface should be". I also
am not sure the socket system call interface is quite what we want,
although it's a neat idea.  It's also not that portable outside the
"everything is a Linux variant" world.

So how about this as an interface design. The communications channels
are visible in our name space at a mountpoint of our choice. Let's
call this mount point, for sake of argument, vmic.

When we mount on vmic, we see one file:
/vmic/clone

When we open and read /vmic/clone, we get a number, let's pretend for
this example we get  '0'. The numbers are not important, except to
distinguish connections. Opening the clone file gets us a connection
endpoint. Ls of the directory now shows this:
/vmic/clone
/vmic/0

The "directory", and the "files" in it, are owned by me, mode 700 or
600 or 400 as the file requires. The mode can be changed, of course,
if I wish to allow wider access to the channel. Here, already, we see
some advantage to the use of the file system for this type of
capability.

What is in the directory? Here is one proposal.
/vmic/0/data
/vmic/0/status
/vmic/0/ctl
/vmic/0/local
/vmic/0/remote
What can we do with this?
Data is pretty obvious: we can read it or write it, and that data is
received/sent from the other endpoint. Note that I'm not saying how
the data flows: it can be done in whatever manner is most efficient,
by the kernel, including zero copy. It can be different for many
reasons, but the point is that the interface is basically unchanging.
Of course, it is an error to read or write data until something at the
other end connects to the local end!

What is status? We cat it and it gets us status in some meaningful
text string. E.g.:
cat /vmic/0/status
connected /domain/name

What is local? It's our local name for the resource in this domain
What is remote? It's the name of other endpoint.

What's a name look like? I'm thinking it might look like /domain/name,
but that is just a guess ...

What is ctl? here is where the fun begins. We might do things such as
echo bind somename > /vmic/0/ctl
this names the vmic. We might want to wait for a connection:
echo listen 1> /vmic/0/ctl
We might want to restrict it somehow
echo key somekey > /vmic/0/ctl
echo listendomain domainnumber > /vmic/0/ctl
or we might know there is something out there.
echo connect /domainname/somename > /vmic/0/ctl

Once it is connected, we can move data.

This is similar to your socket idea, but consider that:
o to see active vmics, I use 'ls'
o I don't have to create a new sockaddr address type
o I can control access with chmod
o I am seperating the interface from the implementation
o This is, of course, not really 'files', but in-memory data
structures; this can
  (and will) be fast
o No binary data structures.
  For different domains, even on the same machine, alignment rules etc. are not
  always the same -- I hit this when I ported Plan 9 to Xen, esp. back when Xen
   relied so heavily on gcc tricks such as __align__ and packed. Using
character strings
  eliminates that problem.

This is, I think, the kind of thing Eric would also like to see, but
he can correct me.
Thanks

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <13426df10705172231y5e93d1f5y398d4f187a8978e1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                 ` <13426df10705172231y5e93d1f5y398d4f187a8978e1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2007-05-18 14:31                                   ` Anthony Liguori
       [not found]                                     ` <464DB8A5.6080503-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-18 14:31 UTC (permalink / raw)
  To: ron minnich
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, Martin Schwidefsky,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

ron minnich wrote:
> Hi Anthony,
>
> I still feel that "how about a socket interface" is still focused on
> the "how to implement", and not "what the interface should be".

Right.  I'm not trying to answer that question ATM.  There are a number 
of paravirt devices that would be useful in a virtual setting.  For 
instance, a PV device for providing the guest with entropy and a shared 
PV clipboard.  These devices should be simple but all current 
communication mechanisms are far too complicated.

> I also
> am not sure the socket system call interface is quite what we want,
> although it's a neat idea.  It's also not that portable outside the
> "everything is a Linux variant" world.

A filesystem interface certainly isn't very portable outside the POSIX 
world :-)

> Once it is connected, we can move data.
>
> This is similar to your socket idea, but consider that:
> o to see active vmics, I use 'ls'
> o I don't have to create a new sockaddr address type
> o I can control access with chmod
> o I am seperating the interface from the implementation
> o This is, of course, not really 'files', but in-memory data
> structures; this can
>  (and will) be fast
> o No binary data structures.
>  For different domains, even on the same machine, alignment rules etc. 
> are not
>  always the same -- I hit this when I ported Plan 9 to Xen, esp. back 
> when Xen
>   relied so heavily on gcc tricks such as __align__ and packed. Using
> character strings
>  eliminates that problem.

The interface you're proposing is almost functionally identical to a 
socket.  In fact, once you open /data you've got an fd that you interact 
with in the same way as you would interact with a socket.

It's not that there's an unique value for this sort of interface in 
virtualization; I don't think you're making that argument.  Instead, 
you're making a general argument as to why this way of doing things is 
better than what Unix has been doing forever (with things like 
sockets).  That's fine, I think you have a valid point, but that's a 
larger argument to have on LKML or at a conference.  This isn't the 
place to shoe-horn this sort of thing.

A socket interface would provide a simple, well-understood interface 
that few people in the Linux community would disagree with (it's already 
there for s390).  It should also be easy enough to stream p9 over the 
socket so you can build these interfaces easily and continue your 
attempts to expose the world as a virtual filesystem :-)

Regards,

Anthony Liguori

> This is, I think, the kind of thing Eric would also like to see, but
> he can correct me.
> Thanks
>
> ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <464DB8A5.6080503-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                     ` <464DB8A5.6080503-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2007-05-18 15:14                                       ` ron minnich
  0 siblings, 0 replies; 104+ messages in thread
From: ron minnich @ 2007-05-18 15:14 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, Martin Schwidefsky,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

On 5/18/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
>
> > I also
> > am not sure the socket system call interface is quite what we want,
> > although it's a neat idea.  It's also not that portable outside the
> > "everything is a Linux variant" world.
>
> A filesystem interface certainly isn't very portable outside the POSIX
> world :-)

Actually, it probably the most portable thing you can have.

> The interface you're proposing is almost functionally identical to a
> socket.  In fact, once you open /data you've got an fd that you interact
> with in the same way as you would interact with a socket.

Well, sure, I stole the interface from Plan 9, and they use this
interface to do sockets, among *many* other things -- and there's the
point. The interface is not just sockets. But if you're used to
sockets, it looks familiar. I only steal from the best :-)

Note, btw, that the fd has a path, and can be examined easily, and
also passed to other programs for use. That's messy and ugly with
sockets.

>
> It's not that there's an unique value for this sort of interface in
> virtualization; I don't think you're making that argument.  Instead,
> you're making a general argument as to why this way of doing things is
> better than what Unix has been doing forever (with things like
> sockets)

Yes, Unix has been "doing it this way" forever. The interface I am
proposing was
the one designed by the Unix guys -- once they realized how deficient
the Unix way of doing things had become.

But, forgetting all this argument, it still seems to me that the file
system interface is far simpler than a socket interface. No binary
structures. No new sockaddr structures needed. No alignment/padding
rules. You can actually set up a link from a shell script, or perl, or
python, or whatever, without a special set of bindings.

> A socket interface would provide a simple, well-understood interface
> that few people in the Linux community would disagree with (it's already
> there for s390).

Yes, but ... well understood to the Linux community. Can we look at a
broader scope?

We've got a golden opportunity here to build a really flexible VMIC
interface. I would hate to lose it.

Anyway, thanks for discussing this.

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]             ` <4644CE15.6080505-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2007-05-11 21:15               ` Eric Van Hensbergen
@ 2007-05-11 21:51               ` ron minnich
  1 sibling, 0 replies; 104+ messages in thread
From: ron minnich @ 2007-05-11 21:51 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Martin Schwidefsky

On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:

> For low speed devices, I think paravirtualization doesn't make a lot of
> sense unless it's absolutely required.  I don't know enough about s390
> to know if it supports things like uarts but if so, then emulating a
> uart would in my mind make a lot more sense than a PV console device.

I don't see how. Paravirtualization is pretty trivial for a console. I
think emulating hardware
is always worth avoiding. A PV ocnsole driver is going to much more
flexible than a uart
emulator.

> This smells a bit like XenStore which I think most will agree was an
> unmitigated disaster.

No, not at all. Just because we represent resources as directories and
files, does that imply or require xenstore? Is /proc a xenstore
entity? Is /sys? Not at all. These resources, which are represented
over 9p as files and directories, are simply a representation of
kernel data structures. I think you are jumping ahead too far, because
that's not what I'm talking about.

What I'm trying to propose is that the kvm host use a standard model
for paravirt resources, and, since we've had 20 years of very good
luck on Plan 9 using 9p and a directory/file model for all resources,
including devices, I am hoping  we can use that for the way that kvm
communicates with its guests about devices.

Consider /proc. It works. It's not a thing on disk, or a python glob
like xenstore. There are not even really tree-like data structures in
/proc. The proc outputs are generated on demand as programs do
operations on files in the /proc file system.

This idea is similar -- not the same code, or implementation
technique, but similar.

Our proposal (it was Eric's idea, really, and he has in fact shown it
in practice on IBM hypervisors) is that we define a standard memory
channel for comms, as in Eric's paper; we define a standard
request/response protocol to run over that channel, i.e. 9p, again, as
in Eric's paper(s); and then, what you layer over it is up to the
provider of the resource. This gives us one interface, and it can be
efficient.

Again, in this way, we get a common interface to diverse resources.
This is a basic technique in computer science, and I was sorry to see
Xen ignore it. Eric and I tried to get the Xen team to look at this,
but they were too far along with their myriad interfaces, and it was
too late to change. It's not too late for KVM. I am hoping we can use
this model on KVM, before we have a whole pile of totally different
interfaces to different PV devices.

> This sort of thing gets terribly complicated to
> deal with in the corner cases.  Atomic operation of multiple read/write
> operations is difficult to express.  Moreover, quite a lot of things are
> naturally expressed as a state machine which is not straight forward to
> do in this sort of model.  This may have been all figured out in 9P but
> it's certainly not a simple thing to get right.

We have the QED. It's called Plan 9. Then we have the second QED. It's
called Inferno. They are each a reliable, simple, industrial-strength
kernel running in a router near you. I accept it is hard to get right.
I think you'd have to accept that it can, and in fact has, been gotten
right for quite some time -- 20 years in the case of Plan 9.

> I think a general rule of thumb for a virtualized environment is that
> the closer you stick to the way hardware tends to do things,

You mean like level interrupts emulation in Xen? That was easy? Or not
screwed up? It was one of the messiest things I had to deal with in
the Plan 9 port to Xen. And it made no sense, whatsoever, to have  a
level interrupt emulation. Except, of course, that the edge interrupts
were even less fun :-)

I believe that PV can buy us a very clean interface if done right.
Emulating hardware is easy for the simple bits, and very hard to get
perfect for the messy bits .Do we really want to emulate a 10G PHY,
for example?

> Implementing a full 9P client just
> to get console access in something like mini-os would be unfortunate.

9p clients are trivial. newsham's 9p python client is a whopping 352
lines, 20 of them comments. A 9p client is far less code than the sum
of the Linux uart code.

> At least the posted s390 console driver behaves roughly like a uart so
> it's pretty obvious that it will be easy to implement in any OS that
> supports uarts already.

Including all the fifo bugs? Because to really emulate hardware, to
match a driver, you have to correctly emulate the *bugs*, not just the
spec. That's where the fun begins.

I think KVM has a great opportunity here to do a better job than Xen
did with devices. So, I'll  keep arguing and see if I can convince you
:-)

thanks

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]         ` <13426df10705111244w1578ebedy8259bc42ca1f588d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2007-05-11 20:12           ` Anthony Liguori
@ 2007-05-12  8:46           ` Carsten Otte
       [not found]             ` <46457EF9.2070706-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  2007-05-14 12:05           ` Avi Kivity
  2 siblings, 1 reply; 104+ messages in thread
From: Carsten Otte @ 2007-05-12  8:46 UTC (permalink / raw)
  To: ron minnich
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger


ron minnich wrote:
> Let me ask what may seem to be a naive question to the linux world. I
> see you are doing a lot off solid work on adding block and network
> devices. The code for block and network devices
> is implemented in different ways. I've also seen this difference of
> inerface/implementation on Xen.
Actually, the difference derives from the fact that block and network 
are indeed different:
- block submits requests that ask the host to transfer from/to 
preallocated guest data buffers via dma (request driven)
- net transmits packets that should end up in an skb on the remote 
side (two way, push driven)
- net is sensitive to round-trip times, block is not due to the device 
plug for request merging

We tried different access methods for both block and network. We have 
selected the current communication mechanics after doing performance 
measurements.
I believe for a portable solution we need to develop a set of 
primitives for sending signals (read: interrupts) back and forth, for 
copying data to guest memory, and for establishing shared memory 
between guests and between guest+host. These primitives need to be 
implemented for each platform, and paravirtual drivers should build on 
top of that.
At this point in time, we are aware that these device drivers don't do 
what we'd want for a portable solution. We'll focus on getting the 
kernel interfaces to sie/vt/svm proper and portable first.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <46457EF9.2070706-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]             ` <46457EF9.2070706-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-05-13 12:04               ` Dor Laor
       [not found]                 ` <64F9B87B6B770947A9F8391472E032160BC74612-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Dor Laor @ 2007-05-13 12:04 UTC (permalink / raw)
  To: carsteno-tA70FqPdS9bQT0dZR+AlfA, ron minnich
  Cc: Jimi Xenidis, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR

>ron minnich wrote:
>> Let me ask what may seem to be a naive question to the linux world. I
>> see you are doing a lot off solid work on adding block and network
>> devices. The code for block and network devices
>> is implemented in different ways. I've also seen this difference of
>> inerface/implementation on Xen.
>Actually, the difference derives from the fact that block and network
>are indeed different:
>- block submits requests that ask the host to transfer from/to
>preallocated guest data buffers via dma (request driven)
>- net transmits packets that should end up in an skb on the remote
>side (two way, push driven)
>- net is sensitive to round-trip times, block is not due to the device
>plug for request merging
>
>We tried different access methods for both block and network. We have
>selected the current communication mechanics after doing performance
>measurements.
>I believe for a portable solution we need to develop a set of
>primitives for sending signals (read: interrupts) back and forth, for
>copying data to guest memory, and for establishing shared memory
>between guests and between guest+host. These primitives need to be
>implemented for each platform, and paravirtual drivers should build on
>top of that.
>At this point in time, we are aware that these device drivers don't do
>what we'd want for a portable solution. We'll focus on getting the
>kernel interfaces to sie/vt/svm proper and portable first.
>
>so long,
>Carsten

Based on the previous discussion and the s390 PV drivers I have more
gasoline to pour to the flame:

We have a working PV driver with 1Gbit performance. The reasons we don't

push it into the kernel are:
	a. We should perform much better
	b. It would be a painful task getting all the code review that a

	   complicated network interface should get.
	c. There's already a PV driver that answers a,b.
The Xen's PV network driver is now pushed into the kernel.
It is optimized, and support tso.
By adding a generic ops calls we can make enjoy all the above.

Using Xen's core PV code doesn't imply that we will have their interface
{xenstore} the interface creation and tear-down would be kvm specific.
They could even have a plain directory structure.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <64F9B87B6B770947A9F8391472E032160BC74612-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                 ` <64F9B87B6B770947A9F8391472E032160BC74612-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
@ 2007-05-13 14:49                   ` Anthony Liguori
       [not found]                     ` <4647257F.4020900-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-13 14:49 UTC (permalink / raw)
  To: Dor Laor
  Cc: Jimi Xenidis, Christian Borntraeger,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	carsteno-tA70FqPdS9bQT0dZR+AlfA,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Dor Laor wrote:
> push it into the kernel are:
> 	a. We should perform much better
> 	b. It would be a painful task getting all the code review that a
>
> 	   complicated network interface should get.
> 	c. There's already a PV driver that answers a,b.
> The Xen's PV network driver is now pushed into the kernel.
>   

Actually, it's not (at least not as of a few moments ago).  Furthermore, 
the plan is to completely rearchitect the netback/netfront protocol for 
the next Xen release (this effort is referred to netchannel2).

See some of the XenSummit slides as to why this is necessary.

Regards,

Anthony Liguori

> It is optimized, and support tso.
> By adding a generic ops calls we can make enjoy all the above.
>
> Using Xen's core PV code doesn't imply that we will have their interface
> {xenstore} the interface creation and tear-down would be kvm specific.
> They could even have a plain directory structure.
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <4647257F.4020900-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                     ` <4647257F.4020900-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
@ 2007-05-13 16:23                       ` Dor Laor
       [not found]                         ` <64F9B87B6B770947A9F8391472E032160BC74675-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Dor Laor @ 2007-05-13 16:23 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis, Christian Borntraeger,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	carsteno-tA70FqPdS9bQT0dZR+AlfA,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

>Subject: Re: [kvm-devel] [PATCH/RFC 7/9] Virtual network guest device
>driver
>
>Dor Laor wrote:
>> push it into the kernel are:
>> 	a. We should perform much better
>> 	b. It would be a painful task getting all the code review that a
>>
>> 	   complicated network interface should get.
>> 	c. There's already a PV driver that answers a,b.
>> The Xen's PV network driver is now pushed into the kernel.
>>
>
>Actually, it's not (at least not as of a few moments ago).
Furthermore,
>the plan is to completely rearchitect the netback/netfront protocol for
>the next Xen release (this effort is referred to netchannel2).

But isn't Jeremy Fitzhardinge is pushing big patch queue into the
kernel?
If we manage to plant hooks into the netback/front for using net_ops,
they and the code will get into the kernel they will be have to keep the
hooks for netchannel2. 

>
>See some of the XenSummit slides as to why this is necessary.

It's looks like generalizing all the level 0,1,2 features plus
performance optimizations. It's not something we couldn't upgrade to.

>Regards,
>
>Anthony Liguori
>
>> It is optimized, and support tso.
>> By adding a generic ops calls we can make enjoy all the above.
>>
>> Using Xen's core PV code doesn't imply that we will have their
interface
>> {xenstore} the interface creation and tear-down would be kvm
specific.
>> They could even have a plain directory structure.
>>
>>
------------------------------------------------------------------------
-
>> This SF.net email is sponsored by DB2 Express
>> Download DB2 Express C - the FREE version of DB2 express and take
>> control of your XML. No limits. Just data. Click to get it now.
>> http://sourceforge.net/powerbar/db2/
>> _______________________________________________
>> kvm-devel mailing list
>> kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
>> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>>
>>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <64F9B87B6B770947A9F8391472E032160BC74675-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                         ` <64F9B87B6B770947A9F8391472E032160BC74675-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
@ 2007-05-13 16:49                           ` Anthony Liguori
       [not found]                             ` <4647418A.2040201-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-13 16:49 UTC (permalink / raw)
  To: Dor Laor
  Cc: Jimi Xenidis, Christian Borntraeger,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	carsteno-tA70FqPdS9bQT0dZR+AlfA,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Dor Laor wrote:
> Furthermore,
>   
>> the plan is to completely rearchitect the netback/netfront protocol for
>> the next Xen release (this effort is referred to netchannel2).
>>     
>
> But isn't Jeremy Fitzhardinge is pushing big patch queue into the
> kernel?
>   

Yes, but it's not in the kernel yet and there's no guarantee it'll get 
there in time for KVM's consumption.

> If we manage to plant hooks into the netback/front for using net_ops,
> they and the code will get into the kernel they will be have to keep the
> hooks for netchannel2. 
>
>   
>> See some of the XenSummit slides as to why this is necessary.
>>     
>
> It's looks like generalizing all the level 0,1,2 features plus
> performance optimizations. It's not something we couldn't upgrade to.
>   

I'm curious what Rusty thinks as I do not know nearly enough about the 
networking subsystem to make an educated statement here.  Would it be 
better to just try and generalize netback/netfront or build something 
from scratch?  Could the lguest driver be generalized more easily?

Regards,

Anthony LIguori

>> Regards,
>>
>> Anthony Liguori
>>
>>     
>>> It is optimized, and support tso.
>>> By adding a generic ops calls we can make enjoy all the above.
>>>
>>> Using Xen's core PV code doesn't imply that we will have their
>>>       
> interface
>   
>>> {xenstore} the interface creation and tear-down would be kvm
>>>       
> specific.
>   
>>> They could even have a plain directory structure.
>>>
>>>
>>>       
> ------------------------------------------------------------------------
> -
>   
>>> This SF.net email is sponsored by DB2 Express
>>> Download DB2 Express C - the FREE version of DB2 express and take
>>> control of your XML. No limits. Just data. Click to get it now.
>>> http://sourceforge.net/powerbar/db2/
>>> _______________________________________________
>>> kvm-devel mailing list
>>> kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
>>> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>>>
>>>
>>>       
>
>
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <4647418A.2040201-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                             ` <4647418A.2040201-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
@ 2007-05-13 17:06                               ` Muli Ben-Yehuda
       [not found]                                 ` <20070513170608.GA4343-WD1JZD8MxeCTrf4lBMg6DdBPR1lH4CV8@public.gmane.org>
  2007-05-14  2:39                               ` Rusty Russell
  2007-05-14 11:53                               ` Avi Kivity
  2 siblings, 1 reply; 104+ messages in thread
From: Muli Ben-Yehuda @ 2007-05-13 17:06 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	carsteno-tA70FqPdS9bQT0dZR+AlfA,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Sun, May 13, 2007 at 11:49:14AM -0500, Anthony Liguori wrote:
> Dor Laor wrote:
> > Furthermore,
> >   
> >> the plan is to completely rearchitect the netback/netfront
> >> protocol for the next Xen release (this effort is referred to
> >> netchannel2).
> >>     
> >
> > But isn't Jeremy Fitzhardinge is pushing big patch queue into the
> > kernel?
> >   
> 
> Yes, but it's not in the kernel yet and there's no guarantee it'll
> get there in time for KVM's consumption.

On the other hand, there's strong interest in having unified virtual
drivers. Given that the Xen drivers are out there, have been submitted
and have been reasonably optimized, there will be some resistance to
putting in "yet another" set of PV drivers. Also, the contentious
merge point as I understand it is xenbus needing review, rather than
the drivers themselves which are in pretty good shape.

Cheers,
Muli

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <20070513170608.GA4343-WD1JZD8MxeCTrf4lBMg6DdBPR1lH4CV8@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                                 ` <20070513170608.GA4343-WD1JZD8MxeCTrf4lBMg6DdBPR1lH4CV8@public.gmane.org>
@ 2007-05-13 20:31                                   ` Dor Laor
  0 siblings, 0 replies; 104+ messages in thread
From: Dor Laor @ 2007-05-13 20:31 UTC (permalink / raw)
  To: Muli Ben-Yehuda, Anthony Liguori
  Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	carsteno-tA70FqPdS9bQT0dZR+AlfA,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Subject: Re: [kvm-devel] [PATCH/RFC 7/9] Virtual network guest device
>driver
>
>On Sun, May 13, 2007 at 11:49:14AM -0500, Anthony Liguori wrote:
>> Dor Laor wrote:
>> > Furthermore,
>> >
>> >> the plan is to completely rearchitect the netback/netfront
>> >> protocol for the next Xen release (this effort is referred to
>> >> netchannel2).
>> >>
>> >
>> > But isn't Jeremy Fitzhardinge is pushing big patch queue into the
>> > kernel?
>> >
>>
>> Yes, but it's not in the kernel yet and there's no guarantee it'll
>> get there in time for KVM's consumption.
>
>On the other hand, there's strong interest in having unified virtual
>drivers. Given that the Xen drivers are out there, have been submitted
>and have been reasonably optimized, there will be some resistance to
>putting in "yet another" set of PV drivers. Also, the contentious
>merge point as I understand it is xenbus needing review, rather than
>the drivers themselves which are in pretty good shape.

Moreover, it's not that it is too complex to write set of back/front
ends, it just it's already written and optimized down to the bit.
Our current implementation has all the regular bells and whistles
(rings, delayed notifications, napi) it is simper than Xen's but it
lacks further optimizations and tso/scatter gather.
If we even use the NetChannel2 we should enjoy from smart NIC features
too.
It's more tempting and fun to continue to support our implementation but
it's righter to reuse code. 
Nevertheless, we'll be happy to hear and discuss what others are
thinking.

If the current Xen code fail to hit the kernel, then it would be even
easier for us - we'll just rip off all the Xen wrapping, the grant
tables and the flipping would go away leaving clean optimized network
code.
Regards,
	Dor.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                             ` <4647418A.2040201-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
  2007-05-13 17:06                               ` Muli Ben-Yehuda
@ 2007-05-14  2:39                               ` Rusty Russell
  2007-05-14 11:53                               ` Avi Kivity
  2 siblings, 0 replies; 104+ messages in thread
From: Rusty Russell @ 2007-05-14  2:39 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis, Christian Borntraeger,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Herbert Xu,
	carsteno-tA70FqPdS9bQT0dZR+AlfA,
	mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Sun, 2007-05-13 at 11:49 -0500, Anthony Liguori wrote:
> Dor Laor wrote:
> > Furthermore,
> >   
> >> the plan is to completely rearchitect the netback/netfront protocol for
> >> the next Xen release (this effort is referred to netchannel2).
> > It's looks like generalizing all the level 0,1,2 features plus
> > performance optimizations. It's not something we couldn't upgrade to.
> 
> I'm curious what Rusty thinks as I do not know nearly enough about the 
> networking subsystem to make an educated statement here.  Would it be 
> better to just try and generalize netback/netfront or build something 
> from scratch?  Could the lguest driver be generalized more easily?

In turn, I'm curious as to Herbert's opinions on this.

	The lguest netdriver has only two features: it's small, and it does
multi-way inter-guest networking as well as guest<->host.  It's not
clear how much the latter wins in real life over a point-to-point comms
system.

	My interest is in a common low-level transport.  My experience is that
it's easy to create an efficient comms channel between a guest and host
(ie. one side can access the others' memory), but it's worthwhile trying
for a model which transparently allows untrusted comms (ie.
hypervisor-assisted to access the other guest's memory).  That's easier
if you only want point-to-point (see lguest's io.c for a more general
solution).

Cheers,
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                             ` <4647418A.2040201-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
  2007-05-13 17:06                               ` Muli Ben-Yehuda
  2007-05-14  2:39                               ` Rusty Russell
@ 2007-05-14 11:53                               ` Avi Kivity
  2 siblings, 0 replies; 104+ messages in thread
From: Avi Kivity @ 2007-05-14 11:53 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	carsteno-tA70FqPdS9bQT0dZR+AlfA,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Anthony Liguori wrote:
> Dor Laor wrote:
>   
>> Furthermore,
>>   
>>     
>>> the plan is to completely rearchitect the netback/netfront protocol for
>>> the next Xen release (this effort is referred to netchannel2).
>>>     
>>>       
>> But isn't Jeremy Fitzhardinge is pushing big patch queue into the
>> kernel?
>>   
>>     
>
> Yes, but it's not in the kernel yet and there's no guarantee it'll get 
> there in time for KVM's consumption.
>   

I doubt we could add the missing features to kvmnet, test, optimize, 
submit to netdev, apply comments, re-submit, re-write, update to latest 
netdev api, and fix all the bugs much faster.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]         ` <13426df10705111244w1578ebedy8259bc42ca1f588d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2007-05-11 20:12           ` Anthony Liguori
  2007-05-12  8:46           ` Carsten Otte
@ 2007-05-14 12:05           ` Avi Kivity
       [not found]             ` <46485070.3000106-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2 siblings, 1 reply; 104+ messages in thread
From: Avi Kivity @ 2007-05-14 12:05 UTC (permalink / raw)
  To: ron minnich
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Martin Schwidefsky

ron minnich wrote:
> We had hoped to get something like this into Xen. On Xen, for example,
> the block device and ethernet device interfaces are as different as
> one could imagine. Disk I/O does not steal pages from the guest. The
> network does. Disk I/O is in 4k chunks, period, with a bitmap
> describing which of the 8 512-byte subunits are being sent. The enet
> device, on read, returns a page with your packet, but also potentially
> containing bits of other domain's packets too. The interfaces are as
> dissimilar as they can be, and I see no reason for such a huge
> variance between what are basically read/write devices.
>   

The reason for the variance is that hardware capabilities are very 
different for disk and block. Block device requests are always 
guest-initiated and sector-aligned, and often span many pages. On the 
other hand, network packets are byte aligned, and rx packets are 
host-initiated, triggering the stolen pages concept (which 
unsurprisingly turned out not to be a win). Network has such esoteric 
features as TSO. Block is very interested in actually getting things 
onto the disk (barrier support).

In short, the "everything is a stream of bytes" grossly oversimplifies 
things.

> Another issue is that kvm, in its current form (-24) is beautifully
> simple. These additions seem to detract from the beauty a  bit. Might
> it be worth taking a little time to consider these ideas in order to
> preserve the basic elegance of KVM?
>   

kvm? elegant and simple? it's basically a pile of special cases.

But I agree that the growing code base is a problem. With the block 
driver we can probably keep the host side in userspace, but to do the 
same for networking is much more work. I do think (now) that it is doable.

-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <46485070.3000106-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]             ` <46485070.3000106-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
@ 2007-05-14 12:24               ` Christian Bornträger
       [not found]                 ` <200705141424.44423.cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
  2007-05-14 13:36               ` Carsten Otte
  1 sibling, 1 reply; 104+ messages in thread
From: Christian Bornträger @ 2007-05-14 12:24 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
  Cc: Jimi Xenidis, Carsten Otte,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Martin Schwidefsky

On Monday 14 May 2007 14:05, Avi Kivity wrote:
> But I agree that the growing code base is a problem. With the block 
> driver we can probably keep the host side in userspace, but to do the 
> same for networking is much more work. I do think (now) that it is doable.

Interesting. What kind of userspace networking do you have in mind?

One of the first trys from Carsten was to use tun/tap, which proved to be slow 
performance-wise.

What I had in mind was some kind of switch in userspace. That would allow 
non-root guests to define there own private networks. We could use Linux fast 
pipe implementation for guest-to-guest communication. 

The questions is how to connect user space networks to the host ones?
- tun/tap is quite slow
- last time we checked, netfiler offered only IP hooks (if you dont use the 
bridging code)
- raw sockets get tricky if you do in/out at the same time because you have to 
manually deal with loops

This reminds me, that we actually have another party doing virtual networking 
between guests: UML. User mode linux actually can do networking/switching in 
userspace, but I cannot tell how well UMLs concept works out. 

Christian

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <200705141424.44423.cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]                 ` <200705141424.44423.cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2007-05-14 12:32                   ` Avi Kivity
  0 siblings, 0 replies; 104+ messages in thread
From: Avi Kivity @ 2007-05-14 12:32 UTC (permalink / raw)
  To: Christian Bornträger
  Cc: Jimi Xenidis, Carsten Otte,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Martin Schwidefsky

Christian Bornträger wrote:
> On Monday 14 May 2007 14:05, Avi Kivity wrote:
>   
>> But I agree that the growing code base is a problem. With the block 
>> driver we can probably keep the host side in userspace, but to do the 
>> same for networking is much more work. I do think (now) that it is doable.
>>     
>
> Interesting. What kind of userspace networking do you have in mind?
>
> One of the first trys from Carsten was to use tun/tap, which proved to be slow 
> performance-wise.
>   

tun/tap, but extended with:

- true aio
- aio with scatter/gather (IO_CMD_PWRITEV/IO_CMD_PREADV)
- qemu support for native Linux aio (not the glibc hackaround currently 
in place), so we get event coalescing and cheap multi request submission
- tap support for tso

With these, we could conceivably reach speeds close to an in-kernel 
driver.  Unfortunately we'd only know after all the hard work was done.

> What I had in mind was some kind of switch in userspace. That would allow 
> non-root guests to define there own private networks. We could use Linux fast 
> pipe implementation for guest-to-guest communication. 
>
> The questions is how to connect user space networks to the host ones?
> - tun/tap is quite slow
> - last time we checked, netfiler offered only IP hooks (if you dont use the 
> bridging code)
> - raw sockets get tricky if you do in/out at the same time because you have to 
> manually deal with loops
>   

qemu has some support for this, see the '-net socket' option.


-- 
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [PATCH/RFC 7/9] Virtual network guest device driver
       [not found]             ` <46485070.3000106-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
  2007-05-14 12:24               ` Christian Bornträger
@ 2007-05-14 13:36               ` Carsten Otte
  1 sibling, 0 replies; 104+ messages in thread
From: Carsten Otte @ 2007-05-14 13:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jimi Xenidis,
	jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org,
	Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org

Avi Kivity wrote:
> But I agree that the growing code base is a problem. With the block 
> driver we can probably keep the host side in userspace, but to do the 
> same for networking is much more work. I do think (now) that it is doable.
I agree that networking needs to be handled in the host kernel. We go 
out to userspace for signaling at this time, but that's simply broken. 
All our userspace does is do a system call next.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH/RFC 8/9] Virtual network host switch support
       [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
                     ` (5 preceding siblings ...)
  2007-05-11 17:36   ` [PATCH/RFC 7/9] Virtual network guest " Carsten Otte
@ 2007-05-11 17:36   ` Carsten Otte
       [not found]     ` <1178904968.25135.35.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
  2007-05-11 17:36   ` [PATCH/RFC 9/9] Fix system<->user misaccount of interpreted execution Carsten Otte
  7 siblings, 1 reply; 104+ messages in thread
From: Carsten Otte @ 2007-05-11 17:36 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
  Cc: Christian Borntraeger, Martin Schwidefsky

From: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

This is the host counterpart for the virtual network device driver. This driver
has an char device node where the hypervisor can attach. It also
has a kind of dumb switch that passes packets between guests. Last but not least
it contains a host network interface. Patches for attaching other host network
devices to the switch via raw sockets, extensions to qeth or netfilter are
currently tested but not ready yet. We did not use the linux bridging code to
allow non-root users to create virtual networks between guests. 

Signed-off-by: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

---
 drivers/s390/guest/Makefile          |    3 
 drivers/s390/guest/vnet_port_guest.c |  302 ++++++++++++
 drivers/s390/guest/vnet_port_guest.h |   21 
 drivers/s390/guest/vnet_port_host.c  |  418 +++++++++++++++++
 drivers/s390/guest/vnet_port_host.h  |   18 
 drivers/s390/guest/vnet_switch.c     |  828 +++++++++++++++++++++++++++++++++++
 drivers/s390/guest/vnet_switch.h     |  119 +++++
 drivers/s390/net/Kconfig             |   12 
 8 files changed, 1721 insertions(+)

Index: linux-2.6.21/drivers/s390/guest/vnet_port_guest.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet_port_guest.c
@@ -0,0 +1,302 @@
+/*
+ *  Copyright (C) 2005 IBM Corporation
+ *  Authors:	Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *		Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *
+ */
+#include <linux/etherdevice.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/pagemap.h>
+#include <linux/poll.h>
+#include <linux/spinlock.h>
+
+#include "vnet.h"
+#include "vnet_port_guest.h"
+#include "vnet_switch.h"
+
+static void COFIXME_add_irq(struct vnet_guest_port *zgp, int data)
+{
+	int oldval, newval;
+
+	do {
+		oldval = atomic_read(&zgp->pending_irqs);
+		newval = oldval | data;
+	} while (atomic_cmpxchg(&zgp->pending_irqs, oldval, newval) != oldval);
+}
+
+static int COFIXME_get_irq(struct vnet_guest_port *zgp)
+{
+	int oldval;
+
+	do {
+		oldval = atomic_read(&zgp->pending_irqs);
+	} while (atomic_cmpxchg(&zgp->pending_irqs, oldval, 0) != oldval);
+
+	return oldval;
+}
+
+static void
+vnet_guest_interrupt(struct vnet_port *port, int type)
+{
+	struct vnet_guest_port *priv;
+
+	priv = port->priv;
+
+	if (!priv->fasync) {
+		printk (KERN_WARNING "vnet: cannot send interrupt,"
+			"fd not async\n");
+		return;
+	}
+	switch (type) {
+	case VNET_IRQ_START_RX:
+		COFIXME_add_irq(priv, POLLIN);
+		kill_fasync(&priv->fasync, SIGIO, POLL_IN);
+		break;
+	case VNET_IRQ_START_TX:
+		COFIXME_add_irq(priv, POLLOUT);
+		kill_fasync(&priv->fasync, SIGIO, POLL_OUT);
+		break;
+	default:
+		BUG();
+	}
+}
+
+/* release all pinned user pages*/
+static void
+vnet_guest_release_pages(struct vnet_port *port)
+{
+	int i,j;
+
+	for (i=0; i<VNET_QUEUE_LEN; i++)
+		for (j=0; j<VNET_BUFFER_PAGES; j++) {
+			if (port->s2p_data[i][j]) {
+				page_cache_release(virt_to_page(port->s2p_data[i][j]));
+				port->s2p_data[i][j] = NULL;
+			}
+			if (port->p2s_data[i][j]) {
+				page_cache_release(virt_to_page(port->p2s_data[i][j]));
+				port->p2s_data[i][j] = NULL;
+			}
+	}
+	if (port->control) {
+		page_cache_release(virt_to_page(port->control));
+		port->control = NULL;
+	}
+}
+
+static int
+vnet_chr_open(struct inode *ino, struct file *filp)
+{
+	int minor;
+	struct vnet_port *port;
+	char name[BUS_ID_SIZE];
+
+	minor = iminor(filp->f_dentry->d_inode);
+	snprintf(name, BUS_ID_SIZE, "guest:%d", current->pid);
+	port = vnet_port_get(minor, name);
+	if (!port)
+		return -ENODEV;
+	port->priv = kzalloc(sizeof(struct vnet_guest_port), GFP_KERNEL);
+	if (!port->priv) {
+		vnet_port_put(port);
+		return -ENOMEM;
+	}
+	port->interrupt = vnet_guest_interrupt;
+	filp->private_data = port;
+	return nonseekable_open(ino, filp);
+}
+
+static int
+vnet_chr_release (struct inode *ino, struct file *filp)
+{
+	struct vnet_port *port;
+	port  = (struct vnet_port *) filp->private_data;
+
+//FIXME: what about open close? We unregister non exisiting mac addresses
+// in vnet_port_detach!
+	vnet_port_detach(port);
+	vnet_guest_release_pages(port);
+	vnet_port_put(port);
+	return 0;
+}
+
+
+/* helper function which maps a user page into the kernel
+ * the memory must be free with page_cache_release */
+static void *user_to_kernel(char __user *user)
+{
+	struct page *temp_page;
+	int rc;
+
+	BUG_ON(((unsigned long) user) % PAGE_SIZE);
+	rc = fault_in_pages_writeable(user, PAGE_SIZE);
+	if (rc)
+		return NULL;
+	rc = get_user_pages(current, current->mm, (unsigned long) user,
+				1, 1, 1, &temp_page, NULL);
+	if (rc != 1)
+		return NULL;
+	return page_address(temp_page);
+}
+
+/* this function pins the userspace buffers into memory*/
+static int
+vnet_guest_alloc_pages(struct vnet_port *port)
+{
+	int i,j;
+
+	down_read(&current->mm->mmap_sem);
+	for (i=0; i<VNET_QUEUE_LEN; i++)
+		for (j=0; j<VNET_BUFFER_PAGES; j++) {
+			port->s2p_data[i][j] = user_to_kernel(port->control->
+					s2pbufs[i].data + j*PAGE_SIZE);
+			if (!port->s2p_data[i][j])
+				goto cleanup;
+			port->p2s_data[i][j] = user_to_kernel(port->control->
+					p2sbufs[i].data + j*PAGE_SIZE);
+			if (!port->p2s_data[i][j])
+				goto cleanup;
+
+		}
+	up_read(&current->mm->mmap_sem);
+	return 0;
+cleanup:
+	up_read(&current->mm->mmap_sem);
+	vnet_guest_release_pages(port);
+	return -ENOMEM;
+}
+
+/* userspace control data structure stuff */
+static int
+vnet_register_control(struct vnet_port *port, unsigned long user_addr)
+{
+	u64 uaddr;
+	int rc;
+	struct  page *control_page;
+
+	rc = copy_from_user(&uaddr, (void __user *) user_addr, sizeof(uaddr));
+	if (rc)
+		return -EFAULT;
+	if (uaddr % PAGE_SIZE)
+		return -EFAULT;
+	down_read(&current->mm->mmap_sem);
+	rc = get_user_pages(current, current->mm, (unsigned long)uaddr,
+			    1, 1, 1, &control_page, NULL);
+	up_read(&current->mm->mmap_sem);
+	if (rc!=1)
+		return -EFAULT;
+	port->control = (struct vnet_control *) page_address(control_page);
+	rc = vnet_guest_alloc_pages(port);
+	if (rc) {
+		printk("vnet: could not get buffers\n");
+		return rc;
+	}
+	random_ether_addr(port->mac);
+	memcpy(port->control->mac, port->mac,6);
+	vnet_port_attach(port);
+	return 0;
+}
+
+static int
+vnet_interrupt(struct vnet_port *port, int __user *u_type)
+{
+	int type, rc;
+
+	rc = copy_from_user (&type, u_type, sizeof(int));
+	if (rc)
+		return -EFAULT;
+	switch (type) {
+	case VNET_IRQ_START_RX:
+		vnet_port_rx(port);
+		break;
+	case VNET_IRQ_START_TX: /* noop with current drop packet approach*/
+		break;
+	default:
+		printk(KERN_ERR "vnet: Unknown interrupt type %d\n", type);
+		rc = -EINVAL;
+	}
+	return rc;
+}
+
+
+
+
+//this is a HACK. >>COFIXME<<
+unsigned int
+vnet_poll(struct file *filp, poll_table * wait)
+{
+	struct vnet_port *port;
+	struct vnet_guest_port *zgp;
+
+	port = filp->private_data;
+	zgp = port->priv;
+	return COFIXME_get_irq(zgp);
+}
+
+static int vnet_fill_info(struct vnet_port *zp, void __user *data)
+{
+	struct vnet_info info;
+
+	info.linktype = zp->zs->linktype;
+	info.maxmtu=32768; //FIXME
+	return copy_to_user(data, &info, sizeof(info));
+}
+long
+vnet_ioctl(struct file *filp, unsigned int no, unsigned long data)
+{
+	struct vnet_port *port =
+		(struct vnet_port *) filp->private_data;
+	int rc;
+
+	switch (no) {
+	case VNET_REGISTER_CTL:
+		rc = vnet_register_control(port, data);
+		break;
+	case VNET_INTERRUPT:
+		rc = vnet_interrupt(port, (int __user *) data);
+		break;
+	case VNET_INFO:
+		rc = vnet_fill_info(port, (void __user *) data);
+		break;
+	default:
+		rc = -ENOTTY;
+	}
+	return rc;
+}
+
+int vnet_fasync(int fd, struct file *filp, int on)
+{
+	struct vnet_port *port;
+	struct vnet_guest_port *zgp;
+	int rc;
+
+	port = filp->private_data;
+	zgp = port->priv;
+
+	if ((rc = fasync_helper(fd, filp, on, &zgp->fasync)) < 0)
+		return rc;
+
+	if (on)
+		rc = f_setown(filp, current->pid, 0);
+	return rc;
+}
+
+
+static struct file_operations vnet_char_fops = {
+	.owner          = THIS_MODULE,
+	.open           = vnet_chr_open,
+	.release        = vnet_chr_release,
+	.unlocked_ioctl = vnet_ioctl,
+	.fasync         = vnet_fasync,
+	.poll           = vnet_poll,
+};
+
+
+
+void vnet_cdev_init(struct cdev *cdev)
+{
+	cdev_init(cdev, &vnet_char_fops);
+}
Index: linux-2.6.21/drivers/s390/guest/vnet_port_guest.h
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet_port_guest.h
@@ -0,0 +1,21 @@
+/*
+ *  Copyright (C) 2005 IBM Corporation
+ *  Authors:	Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *		Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *
+ */
+
+#ifndef __VNET_PORTS_GUEST_H
+#define __VNET_PORTS_GUEST_H
+
+#include <linux/fs.h>
+#include <linux/cdev.h>
+#include <asm/atomic.h>
+
+struct vnet_guest_port {
+	struct fasync_struct *fasync;
+	atomic_t pending_irqs;
+};
+
+extern void vnet_cdev_init(struct cdev *cdev);
+#endif
Index: linux-2.6.21/drivers/s390/guest/vnet_port_host.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet_port_host.c
@@ -0,0 +1,418 @@
+/*
+ *  vnet zlswitch handling
+ *
+ *  Copyright (C) 2005 IBM Corporation
+ *  Authors:	Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *		Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *
+ */
+
+#include <linux/etherdevice.h>
+#include <linux/if.h>
+#include <linux/if_ether.h>
+#include <linux/if_arp.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/rtnetlink.h>
+#include <linux/pagemap.h>
+#include <linux/spinlock.h>
+
+#include "vnet.h"
+#include "vnet_switch.h"
+#include "vnet_port_host.h"
+
+static void
+vnet_host_interrupt(struct vnet_port *zp,  int type)
+{
+	struct vnet_host_port *zhp;
+
+	zhp = zp->priv;
+
+	BUG_ON(!zhp->netdev);
+
+	switch (type) {
+	case VNET_IRQ_START_RX:
+		netif_rx_schedule(zhp->netdev);
+		break;
+	case VNET_IRQ_START_TX:
+		netif_wake_queue(zhp->netdev);
+		break;
+	default:
+		BUG();
+	}
+	/* we are called via system call path. enforce softirq handling */
+	do_softirq();
+}
+
+static void
+vnet_host_free(struct vnet_port *zp)
+{
+	int i,j;
+
+	for (i=0; i<VNET_QUEUE_LEN; i++)
+		for (j=0; j<VNET_BUFFER_PAGES; j++) {
+			if (zp->s2p_data[i][j]) {
+				free_page((unsigned long) zp->s2p_data[i][j]);
+				zp->s2p_data[i][j] = NULL;
+			}
+			if (zp->p2s_data[i][j]) {
+				free_page((unsigned long) zp->p2s_data[i][j]);
+				zp->p2s_data[i][j] = NULL;
+			}
+	}
+	if (zp->control) {
+		kfree(zp->control);
+		zp->control = NULL;
+	}
+}
+
+static int
+vnet_port_hostsetup(struct vnet_port *zp)
+{
+	int i,j;
+
+	zp->control = kzalloc(sizeof(*zp->control), GFP_KERNEL);
+	if (!zp->control)
+		return -ENOMEM;
+	for (i=0; i<VNET_QUEUE_LEN; i++)
+		for (j=0; j<VNET_BUFFER_PAGES; j++) {
+			zp->s2p_data[i][j] = (void *) __get_free_pages(GFP_KERNEL,0);
+			if (!zp->s2p_data[i][j])
+				goto oom;
+			zp->p2s_data[i][j] = (void *) __get_free_pages(GFP_KERNEL,0);
+			if (!zp->p2s_data[i][j]) {
+				free_page((unsigned long) zp->s2p_data[i][j]);
+				goto oom;
+			}
+		}
+	zp->control->buffer_size = VNET_BUFFER_SIZE;
+	return 0;
+oom:
+	printk(KERN_WARNING "vnet: No memory for buffer space of host device\n");
+	vnet_host_free(zp);
+	return -ENOMEM;
+}
+
+/* host interface specific parts */
+
+
+static int
+vnet_net_open(struct net_device *dev)
+{
+	struct vnet_port *port;
+	struct vnet_control *control;
+
+	port = dev->priv;
+	control = port->control;
+	atomic_set(&control->s2pmit, 0);
+	netif_start_queue(dev);
+	return 0;
+}
+
+static int
+vnet_net_stop(struct net_device *dev)
+{
+	netif_stop_queue(dev);
+	return 0;
+}
+
+static void vnet_net_tx_timeout(struct net_device *dev)
+{
+	struct vnet_port *port = dev->priv;
+	struct vnet_control *control = port->control;
+
+	printk(KERN_ERR "problems in xmit for device %s\n Resetting...\n",
+			 dev->name);
+	atomic_set(&control->p2smit, 0);
+	atomic_set(&control->s2pmit, 0);
+	vnet_port_rx(port);
+	netif_wake_queue(dev);
+}
+
+
+static int
+vnet_net_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct vnet_port  *zhost;
+	struct vnet_host_port *zhp;
+	struct vnet_control *control;
+	struct xmit_buffer *buf;
+	int buffer_status;
+	int pkid;
+
+	zhost = dev->priv;
+	zhp = zhost->priv;
+	control = zhost->control;
+
+	if (!spin_trylock(&zhost->txlock))
+		return NETDEV_TX_LOCKED;
+	if (vnet_q_full(atomic_read(&control->p2smit))) {
+		netif_stop_queue(dev);
+		goto full;
+	}
+	pkid = __nextx(atomic_read(&control->p2smit));
+	buf = &control->p2sbufs[pkid];
+	buf->len = skb->len;
+	buf->proto = skb->protocol;
+	vnet_copy_buf_to_pages(zhost->p2s_data[pkid], skb->data, skb->len);
+	buffer_status = vnet_tx_packet(&control->p2smit);
+	spin_unlock(&zhost->txlock);
+	zhp->stats.tx_packets++;
+	zhp->stats.tx_bytes += skb->len;
+	dev_kfree_skb(skb);
+	dev->trans_start = jiffies;
+	if (buffer_status & QUEUE_WAS_EMPTY)
+		vnet_port_rx(zhost);
+	if (buffer_status & QUEUE_IS_FULL) {
+		netif_stop_queue(dev);
+		spin_lock(&zhost->txlock);
+	} else
+		return NETDEV_TX_OK;
+full:
+	/* we might have raced against the wakeup */
+	if (!vnet_q_full(atomic_read(&control->p2smit)))
+		netif_start_queue(dev);
+	spin_unlock(&zhost->txlock);
+	return NETDEV_TX_OK;
+}
+
+static int
+vnet_l3_poll(struct net_device *dev, int *budget)
+{
+	struct vnet_port  *zp = dev->priv;
+	struct vnet_host_port *zhp = zp->priv;
+	struct vnet_control *control = zp->control;
+	struct xmit_buffer *buf;
+	struct sk_buff *skb;
+	int pkid, count, numpackets = min(64, min(dev->quota, *budget));
+	int buffer_status;
+
+	if (vnet_q_empty(atomic_read(&control->s2pmit))) {
+		count = 0;
+		goto empty;
+	}
+loop:
+	count = 0;
+	while(numpackets) {
+		pkid = __nextr(atomic_read(&control->s2pmit));
+		buf = &control->s2pbufs[pkid];
+		skb = dev_alloc_skb(buf->len + 2);
+		if (likely(skb)) {
+			skb_reserve(skb, 2);
+			vnet_copy_pages_to_buf(skb_put(skb, buf->len),
+						zp->s2p_data[pkid], buf->len);
+			skb->dev = dev;
+			skb->protocol = buf->proto;
+//			skb->ip_summed = CHECKSUM_UNNECESSARY;
+			zhp->stats.rx_packets++;
+			zhp->stats.rx_bytes += buf->len;
+			netif_receive_skb(skb);
+			numpackets--;
+			(*budget)--;
+			dev->quota--;
+			count++;
+		} else
+			zhp->stats.rx_dropped++;
+		buffer_status = vnet_rx_packet(&control->s2pmit);
+		if (buffer_status & QUEUE_IS_EMPTY)
+			goto empty;
+	}
+	return 1; //please ask us again
+empty:
+	netif_rx_complete(dev);
+	/* we might have raced against a wakup*/
+	if (!vnet_q_empty(atomic_read(&control->s2pmit))) {
+		if (netif_rx_reschedule(dev, count))
+			goto loop;
+	}
+	return 0;
+}
+
+
+static int
+vnet_l2_poll(struct net_device *dev, int *budget)
+{
+	struct vnet_port  *zp = dev->priv;
+	struct vnet_host_port *zhp = zp->priv;
+	struct vnet_control *control = zp->control;
+	struct xmit_buffer *buf;
+	struct sk_buff *skb;
+	int pkid, count, numpackets = min(64, min(dev->quota, *budget));
+	int buffer_status;
+
+	if (vnet_q_empty(atomic_read(&control->s2pmit))) {
+		count = 0;
+		goto empty;
+	}
+loop:
+	count = 0;
+	while(numpackets) {
+		pkid = __nextr(atomic_read(&control->s2pmit));
+		buf = &control->s2pbufs[pkid];
+		skb = dev_alloc_skb(buf->len + 2);
+		if (likely(skb)) {
+			skb_reserve(skb, 2);
+			vnet_copy_pages_to_buf(skb_put(skb, buf->len),
+						zp->s2p_data[pkid], buf->len);
+			skb->dev = dev;
+			skb->protocol = eth_type_trans(skb, dev);
+//			skb->ip_summed = CHECKSUM_UNNECESSARY;
+			zhp->stats.rx_packets++;
+			zhp->stats.rx_bytes += buf->len;
+			netif_receive_skb(skb);
+			numpackets--;
+			(*budget)--;
+			dev->quota--;
+			count++;
+		} else
+			zhp->stats.rx_dropped++;
+		buffer_status = vnet_rx_packet(&control->s2pmit);
+		if (buffer_status & QUEUE_IS_EMPTY)
+			goto empty;
+	}
+	return 1; //please ask us again
+empty:
+	netif_rx_complete(dev);
+	/* we might have raced against a wakup*/
+	if (!vnet_q_empty(atomic_read(&control->s2pmit))) {
+		if (netif_rx_reschedule(dev, count))
+			goto loop;
+	}
+	return 0;
+}
+
+static struct net_device_stats *
+vnet_net_stats(struct net_device *dev)
+{
+	struct vnet_port  *zp;
+	struct vnet_host_port  *zhp;
+
+	zp = dev->priv;
+	zhp = zp->priv;
+	return &zhp->stats;
+}
+
+static int
+vnet_net_change_mtu(struct net_device *dev, int new_mtu)
+{
+	if (new_mtu <= ETH_ZLEN)
+		return -ERANGE;
+	if (new_mtu > VNET_BUFFER_SIZE-ETH_HLEN)
+		return -ERANGE;
+	dev->mtu = new_mtu;
+	return 0;
+}
+
+static void
+__vnet_common_init(struct net_device *dev)
+{
+	dev->open		= vnet_net_open;
+	dev->stop		= vnet_net_stop;
+	dev->hard_start_xmit	= vnet_net_xmit;
+	dev->get_stats		= vnet_net_stats;
+	dev->tx_timeout		= vnet_net_tx_timeout;
+	dev->watchdog_timeo	= VNET_TIMEOUT;
+	dev->change_mtu		= vnet_net_change_mtu;
+	dev->weight		= 64;
+	//dev->features		|= NETIF_F_NO_CSUM | NETIF_F_LLTX;
+	dev->features		|= NETIF_F_LLTX;
+}
+
+static void
+__vnet_layer3_init(struct net_device *dev)
+{
+        dev->mtu		= ETH_DATA_LEN;
+        dev->tx_queue_len	= 1000;
+        dev->flags		= IFF_BROADCAST|IFF_MULTICAST|IFF_NOARP;
+        dev->type		= ARPHRD_PPP;
+	dev->mtu		= 1492;
+	dev->poll		= vnet_l3_poll;
+	__vnet_common_init(dev);
+}
+
+static void
+__vnet_layer2_init(struct net_device *dev)
+{
+	ether_setup(dev);
+	random_ether_addr(dev->dev_addr);
+	dev->mtu	= 1492;
+	dev->poll	= vnet_l2_poll;
+	__vnet_common_init(dev);
+}
+
+static void
+vnet_host_destroy(struct vnet_port  *zhost)
+{
+	struct vnet_host_port *zhp;
+	zhp = zhost->priv;
+
+	vnet_port_detach(zhost);
+	unregister_netdev(zhp->netdev);
+	free_netdev(zhp->netdev);
+	zhp->netdev = NULL;
+	vnet_host_free(zhost);
+	kfree(zhp);
+	vnet_port_put(zhost);
+}
+
+
+
+struct vnet_port *
+vnet_host_create(char *name)
+{
+	int rc;
+	struct vnet_port *port;
+	struct vnet_host_port *host;
+	char busname[BUS_ID_SIZE];
+	int minor;
+
+	snprintf(busname, BUS_ID_SIZE, "host:%s", name);
+
+	minor = vnet_minor_by_name(name);
+	if (minor < 0)
+		return NULL;
+	port = vnet_port_get(minor, busname);
+	if (!port)
+		goto out;
+	host = kzalloc(sizeof(struct vnet_host_port), GFP_KERNEL);
+	if (!host) {
+		kfree(port);
+		port = NULL;
+		goto out;
+	}
+	port->priv = host;
+	rc =vnet_port_hostsetup(port);
+	if (rc)
+		goto out_free_host;
+	rtnl_lock();
+	if (port->zs->linktype == 2)
+		host->netdev = alloc_netdev(0, name, __vnet_layer2_init);
+	else
+		host->netdev = alloc_netdev(0, name, __vnet_layer3_init);
+	if (!host->netdev)
+		goto out_unlock;
+	memcpy(port->mac, host->netdev->dev_addr, ETH_ALEN);
+
+	host->netdev->priv = port;
+	port->interrupt = vnet_host_interrupt;
+	port->destroy = vnet_host_destroy;
+
+	if (!register_netdevice(host->netdev)) {
+		/* good case */
+		rtnl_unlock();
+		return port;
+	}
+	host->netdev->priv = NULL;
+	free_netdev(host->netdev);
+	host->netdev = NULL;
+out_unlock:
+	rtnl_unlock();
+	vnet_host_free(port);
+out_free_host:
+	vnet_port_put(port);
+	port = NULL;
+out:
+	return port;
+}
Index: linux-2.6.21/drivers/s390/guest/vnet_port_host.h
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet_port_host.h
@@ -0,0 +1,18 @@
+/*
+ *  Copyright (C) 2005 IBM Corporation
+ *          Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *
+ */
+
+#ifndef __VNET_PORTS_HOST_H
+#define __VNET_PORTS_HOST_H
+
+#include <linux/netdevice.h>
+#include "vnet_switch.h"
+
+struct vnet_host_port {
+	struct net_device_stats stats;
+	struct net_device *netdev;
+};
+extern struct vnet_port * vnet_host_create(char *name);
+#endif
Index: linux-2.6.21/drivers/s390/guest/vnet_switch.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet_switch.c
@@ -0,0 +1,828 @@
+/*
+ *  vnet zlswitch handling
+ *
+ *  Copyright (C) 2005 IBM Corporation
+ *  Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *  Author: Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *
+ */
+
+#include <linux/device.h>
+#include <linux/etherdevice.h>
+#include <linux/fs.h>
+#include <linux/if.h>
+#include <linux/if_ether.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/rtnetlink.h>
+#include <linux/pagemap.h>
+#include <linux/spinlock.h>
+
+#include "vnet.h"
+#include "vnet_port_guest.h"
+#include "vnet_port_host.h"
+#include "vnet_switch.h"
+
+#define NUM_MINORS 1024
+
+/* devices housekeeping, creation & destruction */
+static LIST_HEAD(vnet_switches);
+static rwlock_t vnet_switches_lock = RW_LOCK_UNLOCKED;
+static struct class *zwitch_class;
+static int vnet_major;
+static struct device *root_dev;
+
+
+/* The following functions allow ports of the switch to know about
+ * the MAC addresses of other ports. This is necessary for special
+ * hardware like OSA express which silently drops incoming packets
+ * that not match known MAC addresses and do not support promiscous
+ * mode as well. We have to register all guest MAC addresses at OSA
+ * make packet receive working */
+
+/* Announces the own MAC address to all other ports
+ * this function is called if a new port is added */
+static void vnet_switch_add_mac(struct vnet_port *port)
+{
+	struct vnet_port *other_port;
+
+	read_lock(&port->zs->ports_lock);
+	list_for_each_entry(other_port, &port->zs->switch_ports, lh)
+		if ((other_port != port) && (other_port->set_mac))
+			other_port->set_mac(other_port,port->mac, 1);
+	read_unlock(&port->zs->ports_lock);
+}
+
+/* Removes the own MAC address from all other ports
+ * this function is called if a port is detached*/
+static void vnet_switch_del_mac(struct vnet_port *port)
+{
+	struct vnet_port *other_port;
+
+	read_lock(&port->zs->ports_lock);
+	list_for_each_entry(other_port, &port->zs->switch_ports, lh)
+		if (other_port->set_mac)
+			other_port->set_mac(other_port, port->mac, 0);
+	read_unlock(&port->zs->ports_lock);
+}
+
+/* Learn MACs from other ports on the same zwitch and forward
+ * the MAC addresses to the set_mac function of the port.*/
+static void __vnet_port_learn_macs(struct vnet_port *port)
+{
+	struct vnet_port *other_port;
+
+	if (!port->set_mac)
+		return;
+	list_for_each_entry(other_port, &port->zs->switch_ports, lh)
+		if (other_port != port)
+			port->set_mac(port, other_port->mac, 1);
+}
+
+/* Unlearn MACS from other ports on the same zwitch */
+static void __vnet_port_unlearn_macs(struct vnet_port *port)
+{
+	struct vnet_port *other_port;
+
+	if (!port->set_mac)
+		return;
+	list_for_each_entry(other_port, &port->zs->switch_ports, lh)
+		if (other_port != port)
+			port->set_mac(port, other_port->mac, 0);
+}
+
+
+static struct vnet_switch *__vnet_switch_by_minor(int minor)
+{
+	struct vnet_switch *zs;
+
+	list_for_each_entry(zs, &vnet_switches, lh) {
+		if (MINOR(zs->cdev.dev) == minor)
+			return zs;
+	}
+	return NULL;
+}
+
+static struct vnet_switch *__vnet_switch_by_name(char *name)
+{
+	struct vnet_switch *zs;
+
+	list_for_each_entry(zs, &vnet_switches, lh)
+		if (strncmp(zs->name, name, ZWITCH_NAME_SIZE) == 0)
+			return zs;
+	return NULL;
+}
+
+/* Returns a switch structure and increases the reference count. If no such
+ * switch exists a new one is created with reference count 1 */
+static struct vnet_switch *zwitch_get(int minor)
+{
+	struct vnet_switch *zs;
+
+	read_lock(&vnet_switches_lock);
+	zs = __vnet_switch_by_minor(minor);
+	if (!zs) {
+			read_unlock(&vnet_switches_lock);
+			return zs;
+	}
+	get_device(&zs->dev);
+	read_unlock(&vnet_switches_lock);
+	return zs;
+}
+
+/* reduces the reference count of the switch. */
+static void zwitch_put(struct vnet_switch * zs)
+{
+	put_device(&zs->dev);
+}
+
+/* looks into the packet and searches a matching MAC address
+ * return NULL if unknown or broadcast */
+static struct vnet_port *__vnet_find_l2(struct vnet_switch *zs, char *data)
+{
+	//FIXME: make this a hash lookup, more macs per device?
+	struct vnet_port *port;
+
+	if (is_multicast_ether_addr(data))
+		return NULL;
+	list_for_each_entry(port, &zs->switch_ports, lh) {
+		if (compare_ether_addr(port->mac, data)==0)
+			goto out;
+	}
+	port = NULL;
+ out:
+	return port;
+}
+
+/* searches the destination for IP only interfaces. Normally routing
+ * is the way to go, but guests should see the net transparently without
+ * a hop in between*/
+static struct vnet_port *__vnet_find_l3(struct vnet_switch *zs, char *data)
+{
+	return NULL;
+}
+
+static struct vnet_port * __vnet_find_destination(struct vnet_switch *zs,
+								char *data)
+{
+	switch (zs->linktype) {
+	case 2:
+		return __vnet_find_l2(zs, data);
+	case 3:
+		return __vnet_find_l3(zs, data);
+	default:
+		BUG();
+	}
+}
+
+/* copies len bytes of data from the memory specified by the list of
+ * pointers **from into the memory specified by the list of pointers **to
+ * with each pointer pointing to a page */
+static void
+vnet_switch_page_copy(void **to, void **from, int len)
+{
+	int remaining=len;
+	int pageid = 0;
+	int amount;
+
+	while(remaining) {
+		amount = min((int)PAGE_SIZE, remaining);
+		memcpy(to[pageid], from[pageid], amount);
+		pageid++;
+		remaining -= amount;
+	}
+}
+
+/* copies to data into a buffer of destination
+ * returns 0 if ok*/
+static int
+vnet_unicast(struct vnet_port *destination, void **from_data, int len, int proto)
+{
+	int pkid;
+	int buffer_status;
+	void  **to_data;
+	struct vnet_control *control;
+
+	control = destination->control;
+	spin_lock_bh(&destination->rxlock);
+	if (vnet_q_full(atomic_read(&control->s2pmit))) {
+		destination->rx_dropped++;
+		spin_unlock_bh(&destination->rxlock);
+		return -ENOBUFS;
+	}
+	pkid = __nextx(atomic_read(&control->s2pmit));
+	to_data = destination->s2p_data[pkid];
+	vnet_switch_page_copy(to_data, from_data, len);
+	control->s2pbufs[pkid].len = len;
+	control->s2pbufs[pkid].proto = proto;
+	buffer_status = vnet_tx_packet(&control->s2pmit);
+	spin_unlock_bh(&destination->rxlock);
+	if (buffer_status & QUEUE_WAS_EMPTY)
+		destination->interrupt(destination, VNET_IRQ_START_RX);
+	destination->rx_bytes += len;
+	destination->rx_packets++;
+	return 0;
+}
+
+/* send packets to all ports and emulate broadcasts via unicasts*/
+static int vnet_allcast(struct vnet_port *from_port, void **fromdata,
+			int len, int proto)
+{
+	struct vnet_port *destination;
+	int failure = 0;
+
+	list_for_each_entry(destination, &from_port->zs->switch_ports, lh)
+		if (destination != from_port)
+			failure |= vnet_unicast(destination, fromdata,
+							len, proto);
+	return failure;
+}
+
+/* takes an incoming packet and forwards it to the right port
+ * if a failure occurs, increase the tx_dropped count of the sender*/
+static void vnet_switch_packet(struct vnet_port *from_port,
+				void **from_data, int len, int proto)
+{
+	struct vnet_port *destination;
+	int failure;
+
+	read_lock(&from_port->zs->ports_lock);
+	destination = __vnet_find_destination(from_port->zs, from_data[0]);
+	/* we dont want to loop. FIXME: document when this can happen*/
+	if (destination == from_port) {
+		read_unlock(&from_port->zs->ports_lock);
+		return;
+	}
+	if (destination)
+		failure = vnet_unicast(destination, from_data, len, proto);
+	else
+		failure = vnet_allcast(from_port, from_data, len, proto);
+	read_unlock(&from_port->zs->ports_lock);
+	if (failure)
+		from_port->tx_dropped++;
+	else {
+		from_port->tx_packets++;
+		from_port->tx_bytes += len;
+	}
+}
+
+static void vnet_port_release(struct device *dev)
+{
+	struct vnet_port *port;
+
+	port = container_of(dev, struct vnet_port, dev);
+	zwitch_put(port->zs);
+	kfree(port);
+}
+
+static ssize_t vnet_port_read_mac(struct device *dev,
+				struct device_attribute *attr,
+				char *buf)
+{
+	struct vnet_port *port;
+
+	port = container_of(dev, struct vnet_port, dev);
+	return sprintf(buf,"%02X:%02X:%02X:%02X:%02X:%02X", port->mac[0],
+				port->mac[1], port->mac[2], port->mac[3],
+				port->mac[4], port->mac[5]);
+}
+
+static ssize_t vnet_port_read_tx_bytes(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	struct vnet_port *port;
+
+	port = container_of(dev, struct vnet_port, dev);
+	return sprintf(buf,"%lu", port->tx_bytes);
+}
+
+static ssize_t vnet_port_read_rx_bytes(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	struct vnet_port *port;
+
+	port = container_of(dev, struct vnet_port, dev);
+	return sprintf(buf,"%lu", port->rx_bytes);
+}
+
+static ssize_t vnet_port_read_tx_packets(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	struct vnet_port *port;
+
+	port = container_of(dev, struct vnet_port, dev);
+	return sprintf(buf,"%lu", port->tx_packets);
+}
+
+static ssize_t vnet_port_read_rx_packets(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	struct vnet_port *port;
+
+	port = container_of(dev, struct vnet_port, dev);
+	return sprintf(buf,"%lu", port->rx_packets);
+}
+
+static ssize_t vnet_port_read_tx_dropped(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	struct vnet_port *port;
+
+	port = container_of(dev, struct vnet_port, dev);
+	return sprintf(buf,"%lu", port->tx_dropped);
+}
+
+static ssize_t vnet_port_read_rx_dropped(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	struct vnet_port *port;
+
+	port = container_of(dev, struct vnet_port, dev);
+	return sprintf(buf,"%lu", port->rx_dropped);
+}
+
+static DEVICE_ATTR(mac, S_IRUSR, vnet_port_read_mac, NULL);
+static DEVICE_ATTR(tx_bytes, S_IRUSR, vnet_port_read_tx_bytes, NULL);
+static DEVICE_ATTR(rx_bytes, S_IRUSR, vnet_port_read_rx_bytes, NULL);
+static DEVICE_ATTR(tx_packets, S_IRUSR, vnet_port_read_tx_packets, NULL);
+static DEVICE_ATTR(rx_packets, S_IRUSR, vnet_port_read_rx_packets, NULL);
+static DEVICE_ATTR(tx_dropped, S_IRUSR, vnet_port_read_tx_dropped, NULL);
+static DEVICE_ATTR(rx_dropped, S_IRUSR, vnet_port_read_rx_dropped, NULL);
+
+static int vnet_port_attributes(struct device *dev)
+{
+	int rc;
+	rc = device_create_file(dev, &dev_attr_mac);
+	if (rc)
+		return rc;
+	rc = device_create_file(dev, &dev_attr_tx_dropped);
+	if (rc)
+		return rc;
+	rc = device_create_file(dev, &dev_attr_rx_dropped);
+	if (rc)
+		return rc;
+	rc = device_create_file(dev, &dev_attr_rx_bytes);
+	if (rc)
+		return rc;
+	rc = device_create_file(dev, &dev_attr_tx_bytes);
+	if (rc)
+		return rc;
+	rc = device_create_file(dev, &dev_attr_rx_packets);
+	if (rc)
+		return rc;
+	rc = device_create_file(dev, &dev_attr_tx_packets);
+	return rc;
+}
+
+
+//FIXME implement this
+static int vnet_port_exists(struct vnet_switch *zs, char *name)
+{
+	read_lock(&zs->ports_lock);
+	read_unlock(&zs->ports_lock);
+	return 0;
+
+}
+
+static struct vnet_port *vnet_port_create(struct vnet_switch *zs,
+					char *name)
+{
+	struct vnet_port *port;
+
+	if (vnet_port_exists(zs, name))
+		return NULL;
+
+	port = kzalloc(sizeof(*port), GFP_KERNEL);
+	if (port) {
+		spin_lock_init(&port->rxlock);
+		spin_lock_init(&port->txlock);
+		INIT_LIST_HEAD(&port->lh);
+		port->zs = zs;
+	} else
+		return NULL;
+	port->dev.parent = &zs->dev;
+	port->dev.release = vnet_port_release;
+	strncpy(port->dev.bus_id, name, BUS_ID_SIZE);
+	if (device_register(&port->dev)) {
+		kfree(port);
+		return NULL;
+	}
+	if (vnet_port_attributes(&port->dev)) {
+		device_unregister(&port->dev);
+		kfree(port);
+		return NULL;
+	}
+	return port;
+}
+
+/*------------------------ switch creation/Destruction/housekeeping---------*/
+
+static void zwitch_destroy_ports(struct vnet_switch *zs)
+{
+	struct vnet_port *port, *tmp;
+
+	list_for_each_entry_safe(port, tmp, &zs->switch_ports, lh) {
+	if (port->destroy)
+		port->destroy(port);
+	else
+		printk("No destroy function for port\n");
+	}
+}
+
+
+static void zwitch_destroy(struct vnet_switch *zs)
+{
+	class_device_destroy(zwitch_class, zs->cdev.dev);
+	cdev_del(&zs->cdev);
+	device_unregister(&zs->dev);
+}
+
+static void zwitch_release(struct device *dev)
+{
+	struct vnet_switch *zs;
+
+	zs = container_of(dev, struct vnet_switch, dev);
+	kfree(zs);
+}
+
+static int __zwitch_get_minor(void)
+{
+	int d, found;
+	struct vnet_switch *zs;
+
+	for (d=0; d< NUM_MINORS; d++) {
+		found = 0;
+		list_for_each_entry(zs, &vnet_switches, lh)
+			if (MINOR(zs->cdev.dev) == d)
+				found++;
+		if (!found) break;
+	}
+	if (found) return -ENODEV;
+	return d;
+}
+
+/*
+ * checks if this name already exists for a zwitch
+ */
+static int __zwitch_check_name(char *name)
+{
+	struct vnet_switch *zs;
+
+	list_for_each_entry(zs, &vnet_switches, lh)
+		if (!strncmp(name, zs->name, ZWITCH_NAME_SIZE))
+			return -EEXIST;
+	return 0;
+}
+
+static int zwitch_create(char *name, int linktype)
+{
+	struct vnet_switch *zs;
+	int minor;
+	int ret;
+
+	if ((linktype < 2) || (linktype > 3))
+		return -EINVAL;
+	zs = kzalloc(sizeof(*zs), GFP_KERNEL);
+	if (!zs) {
+		printk("Creation of %s failed: out of memory\n", name);
+		return -ENOMEM;
+	}
+	zs->linktype = linktype;
+	strncpy(zs->name, name, ZWITCH_NAME_SIZE);
+	rwlock_init(&zs->ports_lock);
+	INIT_LIST_HEAD(&zs->switch_ports);
+
+	write_lock(&vnet_switches_lock);
+	minor = __zwitch_get_minor();
+	if (minor < 0) {
+		write_unlock(&vnet_switches_lock);
+		printk("Creation of %s failed: No free minor number\n",	name);
+		kfree(zs);
+		return minor;
+	}
+	if (__zwitch_check_name(zs->name)) {
+		write_unlock(&vnet_switches_lock);
+		printk("Creation of %s failed: name exists\n", name);
+		kfree(zs);
+		return -EEXIST;
+	}
+	list_add_tail(&zs->lh, &vnet_switches);
+	write_unlock(&vnet_switches_lock);
+	strncpy(zs->dev.bus_id, name, min((int) strlen(name),
+		ZWITCH_NAME_SIZE));
+	zs->dev.parent = root_dev;
+	zs->dev.release = zwitch_release;
+	ret = device_register(&zs->dev);
+	if (ret) {
+		write_lock(&vnet_switches_lock);
+		list_del(&zs->lh);
+		write_unlock(&vnet_switches_lock);
+		printk("Creation of %s failed: no device\n",name);
+		return ret;
+	}
+	vnet_cdev_init(&zs->cdev);
+	cdev_add(&zs->cdev, MKDEV(vnet_major, minor), 1);
+	zs->class_device = class_device_create(zwitch_class, NULL,
+						zs->cdev.dev, &zs->dev, name);
+	if (IS_ERR(zs->class_device)) {
+		cdev_del(&zs->cdev);
+		write_lock(&vnet_switches_lock);
+		list_del(&zs->lh);
+		write_unlock(&vnet_switches_lock);
+		printk("Creation of %s failed: no class_device\n", name);
+		device_unregister(&zs->dev);
+		return PTR_ERR(zs->class_device);
+	}
+	return 0;
+}
+
+
+static int zwitch_delete(char *name)
+{
+	struct vnet_switch *zs;
+
+	write_lock(&vnet_switches_lock);
+	zs = __vnet_switch_by_name(name);
+	if (!zs) {
+		write_unlock(&vnet_switches_lock);
+		return -ENOENT;
+	}
+	list_del(&zs->lh);
+	write_unlock(&vnet_switches_lock);
+	zwitch_destroy_ports(zs);
+	zwitch_destroy(zs);
+	return 0;
+}
+
+/* checks if a switch for the given minor exists
+ * if yes, create an unconnected  port on this switch
+ * if no, return NULL */
+struct vnet_port *vnet_port_get(int minor, char *port_name)
+{
+	struct vnet_switch *zs;
+	struct vnet_port *port;
+
+	zs = zwitch_get(minor);
+	if (!zs)
+		return NULL;
+	port = vnet_port_create(zs, port_name);
+	if (!port)
+		zwitch_put(zs);
+	return port;
+}
+
+/* attaches the port to the switch. The port must be
+ * fully initialized, as it may get called immediately afterwards */
+void vnet_port_attach(struct vnet_port *port)
+{
+	write_lock_bh(&port->zs->ports_lock);
+	__vnet_port_learn_macs(port);
+	list_add(&port->lh, &port->zs->switch_ports);
+	write_unlock_bh(&port->zs->ports_lock);
+	vnet_switch_add_mac(port);
+	return;
+}
+
+/* detaches the port from the switch. After that,
+ * no calls into the port are made */
+void vnet_port_detach(struct vnet_port *port)
+{
+	vnet_switch_del_mac(port);
+	write_lock_bh(&port->zs->ports_lock);
+	if (!list_empty(&port->lh))
+		list_del(&port->lh);
+	__vnet_port_unlearn_macs(port);
+	write_unlock_bh(&port->zs->ports_lock);
+}
+
+/* releases all ressources allocated with vnet_port_get */
+void vnet_port_put(struct vnet_port *port)
+{
+	BUG_ON(!list_empty(&port->lh) &&( port->lh.next != LIST_POISON1));
+	device_unregister(&port->dev);
+}
+
+/* tell the switch that new data is available */
+void vnet_port_rx(struct vnet_port *port)
+{
+	struct vnet_control *control;
+	int pkid, rc;
+
+	control = port->control;
+	if (vnet_q_empty(atomic_read(&control->p2smit))) {
+		printk(KERN_WARNING "vnet_switch: Empty buffer"
+				"on interrupt\n");
+		return;
+	}
+	do {
+		pkid = __nextr(atomic_read(&control->p2smit));
+		/* fire and forget. Let the switch care about lost packets*/
+		vnet_switch_packet(port, port->p2s_data[pkid],
+					control->p2sbufs[pkid].len,
+					control->p2sbufs[pkid].proto);
+		rc = vnet_rx_packet(&control->p2smit);
+		if (rc & QUEUE_WAS_FULL) {
+			port->interrupt(port, VNET_IRQ_START_TX);
+		}
+	} while (!(rc & QUEUE_IS_EMPTY));
+	return;
+}
+
+/* checks if the given address is locally attached to the switch*/
+int vnet_address_is_local(struct vnet_switch *zs, char *address)
+{
+	struct vnet_port *port;
+
+	read_lock(&zs->ports_lock);
+	port = __vnet_find_destination(zs, address);
+	read_unlock(&zs->ports_lock);
+	return (port != NULL);
+}
+
+
+int vnet_minor_by_name(char *name)
+{
+	struct vnet_switch *zs;
+	int ret;
+
+	read_lock(&vnet_switches_lock);
+	zs = __vnet_switch_by_name(name);
+	if (zs)
+		ret = MINOR(zs->cdev.dev);
+	else
+		ret = -ENODEV;
+	read_unlock(&vnet_switches_lock);
+	return ret;
+}
+
+static void vnet_root_release(struct device *dev)
+{
+	kfree(dev);
+}
+
+
+struct command {
+	char *string1;
+	char *string2;
+};
+
+/*FIXME this is ugly. Dont worry: as soon as we have finalized the interface,
+ this crap is going away. Still, it works.......*/
+static long vnet_control_ioctl(struct file *f, unsigned int command,
+						unsigned long data)
+{
+	char string1[BUS_ID_SIZE];
+	char string2[BUS_ID_SIZE];
+	struct command com;
+	struct vnet_port *port;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+	if (copy_from_user(&com, (__user struct command*) data, sizeof(struct command)))
+		return -EFAULT;
+	if (copy_from_user(string1, (__user char *) com.string1, ZWITCH_NAME_SIZE))
+		return -EFAULT;
+	if (command >=2)
+		if (copy_from_user(string2, (__user char *) com.string2, ZWITCH_NAME_SIZE))
+			return -EFAULT;
+	if (strnlen(string1, ZWITCH_NAME_SIZE) == ZWITCH_NAME_SIZE)
+		return -EINVAL;
+	switch(command) {
+	case ADD_SWITCH:
+		return zwitch_create(string1,3);
+	case DEL_SWITCH:
+		return zwitch_delete(string1);
+	case ADD_HOST:
+		port = vnet_host_create(string1);
+		if (port) {
+			vnet_port_attach(port);
+			return 0;
+		} else
+			return -ENODEV;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int vnet_control_open(struct inode *inode, struct file *file)
+{
+	return 0;
+}
+
+static int vnet_control_release(struct inode *inode, struct file *file)
+{
+	return 0;
+}
+
+struct file_operations vnet_control_fops = {
+	.open		= vnet_control_open,
+	.release	= vnet_control_release,
+	.unlocked_ioctl	= &vnet_control_ioctl,
+	.compat_ioctl	= &vnet_control_ioctl,
+};
+
+struct miscdevice vnet_control_device = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "vnet",
+	.fops = &vnet_control_fops,
+};
+
+int vnet_register_control_device(void)
+{
+	return misc_register(&vnet_control_device);
+}
+
+int __init vnet_switch_init(void)
+{
+	int ret;
+	dev_t dev;
+
+	zwitch_class = class_create(THIS_MODULE, "vnet");
+	if (IS_ERR(zwitch_class)) {
+                printk(KERN_ERR "vnet_switch: class_create failed!\n");
+		ret = PTR_ERR(zwitch_class);
+		goto out;
+        }
+	ret = alloc_chrdev_region(&dev, 0, NUM_MINORS, "vnet");
+	if (ret) {
+		printk(KERN_ERR "vnet_switch: alloc_chrdev_region failed\n");
+		goto out_class;
+	}
+	vnet_major = MAJOR(dev);
+	root_dev = kzalloc(sizeof(*root_dev), GFP_KERNEL);
+	if (!root_dev) {
+		printk(KERN_ERR "vnet_switch:allocation of device failed\n");
+		ret = -ENOMEM;
+		goto out_chrdev;
+	}
+	strncpy(root_dev->bus_id, "vnet", 5);
+	root_dev->release = vnet_root_release;
+	ret =device_register(root_dev);
+	if (ret) {
+		printk(KERN_ERR "vnet_switch: could not register device\n");
+		kfree(root_dev);
+		goto out_chrdev;
+	}
+	ret = vnet_register_control_device();
+	if (ret) {
+		printk("vnet_switch: could not create control device\n");
+		goto out_dev;
+	}
+	printk ("vnet_switch loaded\n");
+/* FIXME ---------- remove these static defines as soon as everyone has the
+ *                  user tools */
+	{
+		struct vnet_port *port;
+		zwitch_create("myswitch0",2);
+		zwitch_create("myswitch1",3);
+
+		port = vnet_host_create("myswitch0");
+		if (port)
+			vnet_port_attach(port);
+		port = vnet_host_create("myswitch1");
+		if (port)
+			vnet_port_attach(port);
+	}
+/*-----------------------------------------------------------*/
+	return 0;
+out_dev:
+	device_unregister(root_dev);
+out_chrdev:
+	unregister_chrdev_region(MKDEV(vnet_major,0), NUM_MINORS);
+out_class:
+	class_destroy(zwitch_class);
+out:
+	return ret;
+}
+
+/* remove all existing vnet_zwitches in the system and unregister the
+ * character device from the system */
+void vnet_switch_exit(void)
+{
+	struct vnet_switch *zs, *tmp;
+	list_for_each_entry_safe(zs, tmp, &vnet_switches, lh) {
+		zwitch_destroy_ports(zs);
+		zwitch_destroy(zs);
+	}
+	device_unregister(root_dev);
+	misc_deregister(&vnet_control_device);
+	unregister_chrdev_region(MKDEV(vnet_major,0), NUM_MINORS);
+	class_destroy(zwitch_class);
+	printk ("vnet_switch unloaded\n");
+}
+
+module_init(vnet_switch_init);
+module_exit(vnet_switch_exit);
+MODULE_DESCRIPTION("VNET: Virtual switch for vnet interfaces");
+MODULE_AUTHOR("Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>");
+MODULE_LICENSE("GPL");
Index: linux-2.6.21/drivers/s390/guest/vnet_switch.h
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet_switch.h
@@ -0,0 +1,119 @@
+/*
+ * vnet_switch - zlive insular communication knack switch
+ * infrastructure for virtual switching of Linux guests running under Linux
+ *
+ * Copyright (C) 2005 IBM Corporation
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *         Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
+ *
+ */
+
+#ifndef __VNET_SWITCH_H
+#define __VNET_SWITCH_H
+
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/if_ether.h>
+#include <linux/spinlock.h>
+
+#include "vnet.h"
+
+/* defines for IOCTLs. interface should be replaced by something better */
+#define ADD_SWITCH	0
+#define DEL_SWITCH	1
+#define ADD_OSA		2
+#define DEL_OSA		3
+#define ADD_HOST	4
+#define DEL_HOST	5
+
+/* min(IFNAMSIZ, BUS_ID_SIZE)*/
+#define ZWITCH_NAME_SIZE 16
+
+/* This structure describes a virtual switch for ports to userspace network
+ * interfaces, e.g. in Linux under Linux environments*/
+struct vnet_switch {
+	struct list_head lh;
+	char name[ZWITCH_NAME_SIZE];
+	struct list_head switch_ports;		/* list of ports */
+	rwlock_t ports_lock;			/* lock for switch_ports */
+	struct class_device *class_device;
+	struct cdev cdev;
+	struct device dev;
+	struct vnet_port *osa;
+	int linktype;				/* 2=ethernet 3=IP */
+};
+
+/* description of a port of the vnet_switch */
+struct vnet_port {
+	struct list_head lh;
+	struct vnet_switch *zs;
+	struct vnet_control *control;
+	void *s2p_data[VNET_QUEUE_LEN][(VNET_BUFFER_SIZE>>PAGE_SHIFT)];
+	void *p2s_data[VNET_QUEUE_LEN][(VNET_BUFFER_SIZE>>PAGE_SHIFT)];
+	char mac[ETH_ALEN];
+	void *priv;
+	int (*set_mac) (struct vnet_port *port, char mac[ETH_ALEN], int add);
+	void (*interrupt) (struct vnet_port *port, int type);
+	void (*destroy) (struct vnet_port *port);
+	struct device dev;
+	unsigned long	rx_packets;	/* total packets received */
+	unsigned long	tx_packets;	/* total packets transmitted */
+	unsigned long	rx_bytes;	/* total bytes received	*/
+	unsigned long	tx_bytes;	/* total bytes transmitted */
+	unsigned long	rx_dropped;	/* no space in receive buffer */
+	unsigned long	tx_dropped;	/* no space in destination buffer */
+	spinlock_t rxlock;
+	spinlock_t txlock;
+};
+
+
+static inline int
+vnet_copy_buf_to_pages(void **data, char *buf, int len)
+{
+	int i;
+
+	if (len == 0)
+		return 0;
+	for (i=0; i <= ((len - 1) >> PAGE_SHIFT); i++ )
+		memcpy(data[i], buf + i*PAGE_SIZE, min(PAGE_SIZE, len - i*PAGE_SIZE));
+	return len;
+}
+
+static inline int
+vnet_copy_pages_to_buf(char *buf, void **data, int len)
+{
+	int i;
+
+	if (len == 0)
+		return 0;
+	for (i=0; i <= ((len -1) >> PAGE_SHIFT); i++ )
+		memcpy(buf + i*PAGE_SIZE, data[i], min(PAGE_SIZE, len - i*PAGE_SIZE));
+	return len;
+}
+
+
+/* checks if a switch with the given minor exists
+ * if yes, create a named and unconnected port on
+ * this switch with the given name. if no, return NULL */
+extern struct vnet_port *vnet_port_get(int minor, char *port_name);
+
+/* attaches the port to the switch. The port must be
+ * fully initialized, as it may get data immediately afterwards */
+extern void vnet_port_attach(struct vnet_port *port);
+
+/* detaches the port from the switch. After that,
+ * no calls into the port are made */
+extern void vnet_port_detach(struct vnet_port *port);
+
+/* releases all ressources allocated with vnet_port_get */
+extern void vnet_port_put(struct vnet_port *port);
+
+/* tell the switch that new data is available */
+extern void vnet_port_rx(struct vnet_port *port);
+
+/* get the minor for a given name */
+extern int vnet_minor_by_name(char *name);
+
+/* checks if the given address is locally attached to the switch*/
+extern int vnet_address_is_local(struct vnet_switch *zs, char *address);
+#endif
Index: linux-2.6.21/drivers/s390/guest/Makefile
===================================================================
--- linux-2.6.21.orig/drivers/s390/guest/Makefile
+++ linux-2.6.21/drivers/s390/guest/Makefile
@@ -6,3 +6,6 @@ obj-$(CONFIG_GUEST_CONSOLE) += guest_con
 obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o
 obj-$(CONFIG_VDISK) += vdisk.o vdisk_blk.o
 obj-$(CONFIG_VNET_GUEST) += vnet_guest.o
+vnet_host-objs := vnet_switch.o vnet_port_guest.o vnet_port_host.o
+obj-$(CONFIG_VNET_HOST) += vnet_host.o
+
Index: linux-2.6.21/drivers/s390/net/Kconfig
===================================================================
--- linux-2.6.21.orig/drivers/s390/net/Kconfig
+++ linux-2.6.21/drivers/s390/net/Kconfig
@@ -95,4 +95,16 @@ config VNET_GUEST
           connection.
 	  If you're not using host/guest support, say N.
 
+config VNET_HOST
+	tristate "virtual networking support (HOST)"
+	depends on QETH && S390_HOST
+	help
+	  This is the host part of the vnet guest network connection.
+          Say Y if you plan to host guests with network
+          connection. The host part consists of a virtual switch
+          a host device as well as a connection to the qeth
+          driver.
+	  If you're not using this kernel for hosting guest, say N.
+
+
 endmenu



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <1178904968.25135.35.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>]

* Re: [PATCH/RFC 8/9] Virtual network host switch support
       [not found]     ` <1178904968.25135.35.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
@ 2007-05-11 20:21       ` Anthony Liguori
       [not found]         ` <4644D048.7060106-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 104+ messages in thread
From: Anthony Liguori @ 2007-05-11 20:21 UTC (permalink / raw)
  To: Carsten Otte
  Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	Christian Borntraeger, Martin Schwidefsky

Carsten Otte wrote:
> From: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
>
> This is the host counterpart for the virtual network device driver. This driver
> has an char device node where the hypervisor can attach. It also
> has a kind of dumb switch that passes packets between guests. Last but not least
> it contains a host network interface. Patches for attaching other host network
> devices to the switch via raw sockets, extensions to qeth or netfilter are
>   

Any feel for the performance relative to the bridging code?  The 
bridging code is a pretty big bottle neck in guest=>guest communications 
in Xen at least.

> currently tested but not ready yet. We did not use the linux bridging code to
> allow non-root users to create virtual networks between guests. 
>   

Is that the primary reason?  If so, that seems like a rather large 
hammer for something that a userspace suid wrapper could have addressed...

Regards,

Anthony Liguori

> Signed-off-by: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
>
> ---
>  drivers/s390/guest/Makefile          |    3 
>  drivers/s390/guest/vnet_port_guest.c |  302 ++++++++++++
>  drivers/s390/guest/vnet_port_guest.h |   21 
>  drivers/s390/guest/vnet_port_host.c  |  418 +++++++++++++++++
>  drivers/s390/guest/vnet_port_host.h  |   18 
>  drivers/s390/guest/vnet_switch.c     |  828 +++++++++++++++++++++++++++++++++++
>  drivers/s390/guest/vnet_switch.h     |  119 +++++
>  drivers/s390/net/Kconfig             |   12 
>  8 files changed, 1721 insertions(+)
>
> Index: linux-2.6.21/drivers/s390/guest/vnet_port_guest.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.21/drivers/s390/guest/vnet_port_guest.c
> @@ -0,0 +1,302 @@
> +/*
> + *  Copyright (C) 2005 IBM Corporation
> + *  Authors:	Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
> + *		Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
> + *
> + */
> +#include <linux/etherdevice.h>
> +#include <linux/fs.h>
> +#include <linux/kernel.h>
> +#include <linux/list.h>
> +#include <linux/module.h>
> +#include <linux/pagemap.h>
> +#include <linux/poll.h>
> +#include <linux/spinlock.h>
> +
> +#include "vnet.h"
> +#include "vnet_port_guest.h"
> +#include "vnet_switch.h"
> +
> +static void COFIXME_add_irq(struct vnet_guest_port *zgp, int data)
> +{
> +	int oldval, newval;
> +
> +	do {
> +		oldval = atomic_read(&zgp->pending_irqs);
> +		newval = oldval | data;
> +	} while (atomic_cmpxchg(&zgp->pending_irqs, oldval, newval) != oldval);
> +}
> +
> +static int COFIXME_get_irq(struct vnet_guest_port *zgp)
> +{
> +	int oldval;
> +
> +	do {
> +		oldval = atomic_read(&zgp->pending_irqs);
> +	} while (atomic_cmpxchg(&zgp->pending_irqs, oldval, 0) != oldval);
> +
> +	return oldval;
> +}
> +
> +static void
> +vnet_guest_interrupt(struct vnet_port *port, int type)
> +{
> +	struct vnet_guest_port *priv;
> +
> +	priv = port->priv;
> +
> +	if (!priv->fasync) {
> +		printk (KERN_WARNING "vnet: cannot send interrupt,"
> +			"fd not async\n");
> +		return;
> +	}
> +	switch (type) {
> +	case VNET_IRQ_START_RX:
> +		COFIXME_add_irq(priv, POLLIN);
> +		kill_fasync(&priv->fasync, SIGIO, POLL_IN);
> +		break;
> +	case VNET_IRQ_START_TX:
> +		COFIXME_add_irq(priv, POLLOUT);
> +		kill_fasync(&priv->fasync, SIGIO, POLL_OUT);
> +		break;
> +	default:
> +		BUG();
> +	}
> +}
> +
> +/* release all pinned user pages*/
> +static void
> +vnet_guest_release_pages(struct vnet_port *port)
> +{
> +	int i,j;
> +
> +	for (i=0; i<VNET_QUEUE_LEN; i++)
> +		for (j=0; j<VNET_BUFFER_PAGES; j++) {
> +			if (port->s2p_data[i][j]) {
> +				page_cache_release(virt_to_page(port->s2p_data[i][j]));
> +				port->s2p_data[i][j] = NULL;
> +			}
> +			if (port->p2s_data[i][j]) {
> +				page_cache_release(virt_to_page(port->p2s_data[i][j]));
> +				port->p2s_data[i][j] = NULL;
> +			}
> +	}
> +	if (port->control) {
> +		page_cache_release(virt_to_page(port->control));
> +		port->control = NULL;
> +	}
> +}
> +
> +static int
> +vnet_chr_open(struct inode *ino, struct file *filp)
> +{
> +	int minor;
> +	struct vnet_port *port;
> +	char name[BUS_ID_SIZE];
> +
> +	minor = iminor(filp->f_dentry->d_inode);
> +	snprintf(name, BUS_ID_SIZE, "guest:%d", current->pid);
> +	port = vnet_port_get(minor, name);
> +	if (!port)
> +		return -ENODEV;
> +	port->priv = kzalloc(sizeof(struct vnet_guest_port), GFP_KERNEL);
> +	if (!port->priv) {
> +		vnet_port_put(port);
> +		return -ENOMEM;
> +	}
> +	port->interrupt = vnet_guest_interrupt;
> +	filp->private_data = port;
> +	return nonseekable_open(ino, filp);
> +}
> +
> +static int
> +vnet_chr_release (struct inode *ino, struct file *filp)
> +{
> +	struct vnet_port *port;
> +	port  = (struct vnet_port *) filp->private_data;
> +
> +//FIXME: what about open close? We unregister non exisiting mac addresses
> +// in vnet_port_detach!
> +	vnet_port_detach(port);
> +	vnet_guest_release_pages(port);
> +	vnet_port_put(port);
> +	return 0;
> +}
> +
> +
> +/* helper function which maps a user page into the kernel
> + * the memory must be free with page_cache_release */
> +static void *user_to_kernel(char __user *user)
> +{
> +	struct page *temp_page;
> +	int rc;
> +
> +	BUG_ON(((unsigned long) user) % PAGE_SIZE);
> +	rc = fault_in_pages_writeable(user, PAGE_SIZE);
> +	if (rc)
> +		return NULL;
> +	rc = get_user_pages(current, current->mm, (unsigned long) user,
> +				1, 1, 1, &temp_page, NULL);
> +	if (rc != 1)
> +		return NULL;
> +	return page_address(temp_page);
> +}
> +
> +/* this function pins the userspace buffers into memory*/
> +static int
> +vnet_guest_alloc_pages(struct vnet_port *port)
> +{
> +	int i,j;
> +
> +	down_read(&current->mm->mmap_sem);
> +	for (i=0; i<VNET_QUEUE_LEN; i++)
> +		for (j=0; j<VNET_BUFFER_PAGES; j++) {
> +			port->s2p_data[i][j] = user_to_kernel(port->control->
> +					s2pbufs[i].data + j*PAGE_SIZE);
> +			if (!port->s2p_data[i][j])
> +				goto cleanup;
> +			port->p2s_data[i][j] = user_to_kernel(port->control->
> +					p2sbufs[i].data + j*PAGE_SIZE);
> +			if (!port->p2s_data[i][j])
> +				goto cleanup;
> +
> +		}
> +	up_read(&current->mm->mmap_sem);
> +	return 0;
> +cleanup:
> +	up_read(&current->mm->mmap_sem);
> +	vnet_guest_release_pages(port);
> +	return -ENOMEM;
> +}
> +
> +/* userspace control data structure stuff */
> +static int
> +vnet_register_control(struct vnet_port *port, unsigned long user_addr)
> +{
> +	u64 uaddr;
> +	int rc;
> +	struct  page *control_page;
> +
> +	rc = copy_from_user(&uaddr, (void __user *) user_addr, sizeof(uaddr));
> +	if (rc)
> +		return -EFAULT;
> +	if (uaddr % PAGE_SIZE)
> +		return -EFAULT;
> +	down_read(&current->mm->mmap_sem);
> +	rc = get_user_pages(current, current->mm, (unsigned long)uaddr,
> +			    1, 1, 1, &control_page, NULL);
> +	up_read(&current->mm->mmap_sem);
> +	if (rc!=1)
> +		return -EFAULT;
> +	port->control = (struct vnet_control *) page_address(control_page);
> +	rc = vnet_guest_alloc_pages(port);
> +	if (rc) {
> +		printk("vnet: could not get buffers\n");
> +		return rc;
> +	}
> +	random_ether_addr(port->mac);
> +	memcpy(port->control->mac, port->mac,6);
> +	vnet_port_attach(port);
> +	return 0;
> +}
> +
> +static int
> +vnet_interrupt(struct vnet_port *port, int __user *u_type)
> +{
> +	int type, rc;
> +
> +	rc = copy_from_user (&type, u_type, sizeof(int));
> +	if (rc)
> +		return -EFAULT;
> +	switch (type) {
> +	case VNET_IRQ_START_RX:
> +		vnet_port_rx(port);
> +		break;
> +	case VNET_IRQ_START_TX: /* noop with current drop packet approach*/
> +		break;
> +	default:
> +		printk(KERN_ERR "vnet: Unknown interrupt type %d\n", type);
> +		rc = -EINVAL;
> +	}
> +	return rc;
> +}
> +
> +
> +
> +
> +//this is a HACK. >>COFIXME<<
> +unsigned int
> +vnet_poll(struct file *filp, poll_table * wait)
> +{
> +	struct vnet_port *port;
> +	struct vnet_guest_port *zgp;
> +
> +	port = filp->private_data;
> +	zgp = port->priv;
> +	return COFIXME_get_irq(zgp);
> +}
> +
> +static int vnet_fill_info(struct vnet_port *zp, void __user *data)
> +{
> +	struct vnet_info info;
> +
> +	info.linktype = zp->zs->linktype;
> +	info.maxmtu=32768; //FIXME
> +	return copy_to_user(data, &info, sizeof(info));
> +}
> +long
> +vnet_ioctl(struct file *filp, unsigned int no, unsigned long data)
> +{
> +	struct vnet_port *port =
> +		(struct vnet_port *) filp->private_data;
> +	int rc;
> +
> +	switch (no) {
> +	case VNET_REGISTER_CTL:
> +		rc = vnet_register_control(port, data);
> +		break;
> +	case VNET_INTERRUPT:
> +		rc = vnet_interrupt(port, (int __user *) data);
> +		break;
> +	case VNET_INFO:
> +		rc = vnet_fill_info(port, (void __user *) data);
> +		break;
> +	default:
> +		rc = -ENOTTY;
> +	}
> +	return rc;
> +}
> +
> +int vnet_fasync(int fd, struct file *filp, int on)
> +{
> +	struct vnet_port *port;
> +	struct vnet_guest_port *zgp;
> +	int rc;
> +
> +	port = filp->private_data;
> +	zgp = port->priv;
> +
> +	if ((rc = fasync_helper(fd, filp, on, &zgp->fasync)) < 0)
> +		return rc;
> +
> +	if (on)
> +		rc = f_setown(filp, current->pid, 0);
> +	return rc;
> +}
> +
> +
> +static struct file_operations vnet_char_fops = {
> +	.owner          = THIS_MODULE,
> +	.open           = vnet_chr_open,
> +	.release        = vnet_chr_release,
> +	.unlocked_ioctl = vnet_ioctl,
> +	.fasync         = vnet_fasync,
> +	.poll           = vnet_poll,
> +};
> +
> +
> +
> +void vnet_cdev_init(struct cdev *cdev)
> +{
> +	cdev_init(cdev, &vnet_char_fops);
> +}
> Index: linux-2.6.21/drivers/s390/guest/vnet_port_guest.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.21/drivers/s390/guest/vnet_port_guest.h
> @@ -0,0 +1,21 @@
> +/*
> + *  Copyright (C) 2005 IBM Corporation
> + *  Authors:	Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
> + *		Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
> + *
> + */
> +
> +#ifndef __VNET_PORTS_GUEST_H
> +#define __VNET_PORTS_GUEST_H
> +
> +#include <linux/fs.h>
> +#include <linux/cdev.h>
> +#include <asm/atomic.h>
> +
> +struct vnet_guest_port {
> +	struct fasync_struct *fasync;
> +	atomic_t pending_irqs;
> +};
> +
> +extern void vnet_cdev_init(struct cdev *cdev);
> +#endif
> Index: linux-2.6.21/drivers/s390/guest/vnet_port_host.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.21/drivers/s390/guest/vnet_port_host.c
> @@ -0,0 +1,418 @@
> +/*
> + *  vnet zlswitch handling
> + *
> + *  Copyright (C) 2005 IBM Corporation
> + *  Authors:	Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
> + *		Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
> + *
> + */
> +
> +#include <linux/etherdevice.h>
> +#include <linux/if.h>
> +#include <linux/if_ether.h>
> +#include <linux/if_arp.h>
> +#include <linux/kernel.h>
> +#include <linux/list.h>
> +#include <linux/module.h>
> +#include <linux/netdevice.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/pagemap.h>
> +#include <linux/spinlock.h>
> +
> +#include "vnet.h"
> +#include "vnet_switch.h"
> +#include "vnet_port_host.h"
> +
> +static void
> +vnet_host_interrupt(struct vnet_port *zp,  int type)
> +{
> +	struct vnet_host_port *zhp;
> +
> +	zhp = zp->priv;
> +
> +	BUG_ON(!zhp->netdev);
> +
> +	switch (type) {
> +	case VNET_IRQ_START_RX:
> +		netif_rx_schedule(zhp->netdev);
> +		break;
> +	case VNET_IRQ_START_TX:
> +		netif_wake_queue(zhp->netdev);
> +		break;
> +	default:
> +		BUG();
> +	}
> +	/* we are called via system call path. enforce softirq handling */
> +	do_softirq();
> +}
> +
> +static void
> +vnet_host_free(struct vnet_port *zp)
> +{
> +	int i,j;
> +
> +	for (i=0; i<VNET_QUEUE_LEN; i++)
> +		for (j=0; j<VNET_BUFFER_PAGES; j++) {
> +			if (zp->s2p_data[i][j]) {
> +				free_page((unsigned long) zp->s2p_data[i][j]);
> +				zp->s2p_data[i][j] = NULL;
> +			}
> +			if (zp->p2s_data[i][j]) {
> +				free_page((unsigned long) zp->p2s_data[i][j]);
> +				zp->p2s_data[i][j] = NULL;
> +			}
> +	}
> +	if (zp->control) {
> +		kfree(zp->control);
> +		zp->control = NULL;
> +	}
> +}
> +
> +static int
> +vnet_port_hostsetup(struct vnet_port *zp)
> +{
> +	int i,j;
> +
> +	zp->control = kzalloc(sizeof(*zp->control), GFP_KERNEL);
> +	if (!zp->control)
> +		return -ENOMEM;
> +	for (i=0; i<VNET_QUEUE_LEN; i++)
> +		for (j=0; j<VNET_BUFFER_PAGES; j++) {
> +			zp->s2p_data[i][j] = (void *) __get_free_pages(GFP_KERNEL,0);
> +			if (!zp->s2p_data[i][j])
> +				goto oom;
> +			zp->p2s_data[i][j] = (void *) __get_free_pages(GFP_KERNEL,0);
> +			if (!zp->p2s_data[i][j]) {
> +				free_page((unsigned long) zp->s2p_data[i][j]);
> +				goto oom;
> +			}
> +		}
> +	zp->control->buffer_size = VNET_BUFFER_SIZE;
> +	return 0;
> +oom:
> +	printk(KERN_WARNING "vnet: No memory for buffer space of host device\n");
> +	vnet_host_free(zp);
> +	return -ENOMEM;
> +}
> +
> +/* host interface specific parts */
> +
> +
> +static int
> +vnet_net_open(struct net_device *dev)
> +{
> +	struct vnet_port *port;
> +	struct vnet_control *control;
> +
> +	port = dev->priv;
> +	control = port->control;
> +	atomic_set(&control->s2pmit, 0);
> +	netif_start_queue(dev);
> +	return 0;
> +}
> +
> +static int
> +vnet_net_stop(struct net_device *dev)
> +{
> +	netif_stop_queue(dev);
> +	return 0;
> +}
> +
> +static void vnet_net_tx_timeout(struct net_device *dev)
> +{
> +	struct vnet_port *port = dev->priv;
> +	struct vnet_control *control = port->control;
> +
> +	printk(KERN_ERR "problems in xmit for device %s\n Resetting...\n",
> +			 dev->name);
> +	atomic_set(&control->p2smit, 0);
> +	atomic_set(&control->s2pmit, 0);
> +	vnet_port_rx(port);
> +	netif_wake_queue(dev);
> +}
> +
> +
> +static int
> +vnet_net_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +	struct vnet_port  *zhost;
> +	struct vnet_host_port *zhp;
> +	struct vnet_control *control;
> +	struct xmit_buffer *buf;
> +	int buffer_status;
> +	int pkid;
> +
> +	zhost = dev->priv;
> +	zhp = zhost->priv;
> +	control = zhost->control;
> +
> +	if (!spin_trylock(&zhost->txlock))
> +		return NETDEV_TX_LOCKED;
> +	if (vnet_q_full(atomic_read(&control->p2smit))) {
> +		netif_stop_queue(dev);
> +		goto full;
> +	}
> +	pkid = __nextx(atomic_read(&control->p2smit));
> +	buf = &control->p2sbufs[pkid];
> +	buf->len = skb->len;
> +	buf->proto = skb->protocol;
> +	vnet_copy_buf_to_pages(zhost->p2s_data[pkid], skb->data, skb->len);
> +	buffer_status = vnet_tx_packet(&control->p2smit);
> +	spin_unlock(&zhost->txlock);
> +	zhp->stats.tx_packets++;
> +	zhp->stats.tx_bytes += skb->len;
> +	dev_kfree_skb(skb);
> +	dev->trans_start = jiffies;
> +	if (buffer_status & QUEUE_WAS_EMPTY)
> +		vnet_port_rx(zhost);
> +	if (buffer_status & QUEUE_IS_FULL) {
> +		netif_stop_queue(dev);
> +		spin_lock(&zhost->txlock);
> +	} else
> +		return NETDEV_TX_OK;
> +full:
> +	/* we might have raced against the wakeup */
> +	if (!vnet_q_full(atomic_read(&control->p2smit)))
> +		netif_start_queue(dev);
> +	spin_unlock(&zhost->txlock);
> +	return NETDEV_TX_OK;
> +}
> +
> +static int
> +vnet_l3_poll(struct net_device *dev, int *budget)
> +{
> +	struct vnet_port  *zp = dev->priv;
> +	struct vnet_host_port *zhp = zp->priv;
> +	struct vnet_control *control = zp->control;
> +	struct xmit_buffer *buf;
> +	struct sk_buff *skb;
> +	int pkid, count, numpackets = min(64, min(dev->quota, *budget));
> +	int buffer_status;
> +
> +	if (vnet_q_empty(atomic_read(&control->s2pmit))) {
> +		count = 0;
> +		goto empty;
> +	}
> +loop:
> +	count = 0;
> +	while(numpackets) {
> +		pkid = __nextr(atomic_read(&control->s2pmit));
> +		buf = &control->s2pbufs[pkid];
> +		skb = dev_alloc_skb(buf->len + 2);
> +		if (likely(skb)) {
> +			skb_reserve(skb, 2);
> +			vnet_copy_pages_to_buf(skb_put(skb, buf->len),
> +						zp->s2p_data[pkid], buf->len);
> +			skb->dev = dev;
> +			skb->protocol = buf->proto;
> +//			skb->ip_summed = CHECKSUM_UNNECESSARY;
> +			zhp->stats.rx_packets++;
> +			zhp->stats.rx_bytes += buf->len;
> +			netif_receive_skb(skb);
> +			numpackets--;
> +			(*budget)--;
> +			dev->quota--;
> +			count++;
> +		} else
> +			zhp->stats.rx_dropped++;
> +		buffer_status = vnet_rx_packet(&control->s2pmit);
> +		if (buffer_status & QUEUE_IS_EMPTY)
> +			goto empty;
> +	}
> +	return 1; //please ask us again
> +empty:
> +	netif_rx_complete(dev);
> +	/* we might have raced against a wakup*/
> +	if (!vnet_q_empty(atomic_read(&control->s2pmit))) {
> +		if (netif_rx_reschedule(dev, count))
> +			goto loop;
> +	}
> +	return 0;
> +}
> +
> +
> +static int
> +vnet_l2_poll(struct net_device *dev, int *budget)
> +{
> +	struct vnet_port  *zp = dev->priv;
> +	struct vnet_host_port *zhp = zp->priv;
> +	struct vnet_control *control = zp->control;
> +	struct xmit_buffer *buf;
> +	struct sk_buff *skb;
> +	int pkid, count, numpackets = min(64, min(dev->quota, *budget));
> +	int buffer_status;
> +
> +	if (vnet_q_empty(atomic_read(&control->s2pmit))) {
> +		count = 0;
> +		goto empty;
> +	}
> +loop:
> +	count = 0;
> +	while(numpackets) {
> +		pkid = __nextr(atomic_read(&control->s2pmit));
> +		buf = &control->s2pbufs[pkid];
> +		skb = dev_alloc_skb(buf->len + 2);
> +		if (likely(skb)) {
> +			skb_reserve(skb, 2);
> +			vnet_copy_pages_to_buf(skb_put(skb, buf->len),
> +						zp->s2p_data[pkid], buf->len);
> +			skb->dev = dev;
> +			skb->protocol = eth_type_trans(skb, dev);
> +//			skb->ip_summed = CHECKSUM_UNNECESSARY;
> +			zhp->stats.rx_packets++;
> +			zhp->stats.rx_bytes += buf->len;
> +			netif_receive_skb(skb);
> +			numpackets--;
> +			(*budget)--;
> +			dev->quota--;
> +			count++;
> +		} else
> +			zhp->stats.rx_dropped++;
> +		buffer_status = vnet_rx_packet(&control->s2pmit);
> +		if (buffer_status & QUEUE_IS_EMPTY)
> +			goto empty;
> +	}
> +	return 1; //please ask us again
> +empty:
> +	netif_rx_complete(dev);
> +	/* we might have raced against a wakup*/
> +	if (!vnet_q_empty(atomic_read(&control->s2pmit))) {
> +		if (netif_rx_reschedule(dev, count))
> +			goto loop;
> +	}
> +	return 0;
> +}
> +
> +static struct net_device_stats *
> +vnet_net_stats(struct net_device *dev)
> +{
> +	struct vnet_port  *zp;
> +	struct vnet_host_port  *zhp;
> +
> +	zp = dev->priv;
> +	zhp = zp->priv;
> +	return &zhp->stats;
> +}
> +
> +static int
> +vnet_net_change_mtu(struct net_device *dev, int new_mtu)
> +{
> +	if (new_mtu <= ETH_ZLEN)
> +		return -ERANGE;
> +	if (new_mtu > VNET_BUFFER_SIZE-ETH_HLEN)
> +		return -ERANGE;
> +	dev->mtu = new_mtu;
> +	return 0;
> +}
> +
> +static void
> +__vnet_common_init(struct net_device *dev)
> +{
> +	dev->open		= vnet_net_open;
> +	dev->stop		= vnet_net_stop;
> +	dev->hard_start_xmit	= vnet_net_xmit;
> +	dev->get_stats		= vnet_net_stats;
> +	dev->tx_timeout		= vnet_net_tx_timeout;
> +	dev->watchdog_timeo	= VNET_TIMEOUT;
> +	dev->change_mtu		= vnet_net_change_mtu;
> +	dev->weight		= 64;
> +	//dev->features		|= NETIF_F_NO_CSUM | NETIF_F_LLTX;
> +	dev->features		|= NETIF_F_LLTX;
> +}
> +
> +static void
> +__vnet_layer3_init(struct net_device *dev)
> +{
> +        dev->mtu		= ETH_DATA_LEN;
> +        dev->tx_queue_len	= 1000;
> +        dev->flags		= IFF_BROADCAST|IFF_MULTICAST|IFF_NOARP;
> +        dev->type		= ARPHRD_PPP;
> +	dev->mtu		= 1492;
> +	dev->poll		= vnet_l3_poll;
> +	__vnet_common_init(dev);
> +}
> +
> +static void
> +__vnet_layer2_init(struct net_device *dev)
> +{
> +	ether_setup(dev);
> +	random_ether_addr(dev->dev_addr);
> +	dev->mtu	= 1492;
> +	dev->poll	= vnet_l2_poll;
> +	__vnet_common_init(dev);
> +}
> +
> +static void
> +vnet_host_destroy(struct vnet_port  *zhost)
> +{
> +	struct vnet_host_port *zhp;
> +	zhp = zhost->priv;
> +
> +	vnet_port_detach(zhost);
> +	unregister_netdev(zhp->netdev);
> +	free_netdev(zhp->netdev);
> +	zhp->netdev = NULL;
> +	vnet_host_free(zhost);
> +	kfree(zhp);
> +	vnet_port_put(zhost);
> +}
> +
> +
> +
> +struct vnet_port *
> +vnet_host_create(char *name)
> +{
> +	int rc;
> +	struct vnet_port *port;
> +	struct vnet_host_port *host;
> +	char busname[BUS_ID_SIZE];
> +	int minor;
> +
> +	snprintf(busname, BUS_ID_SIZE, "host:%s", name);
> +
> +	minor = vnet_minor_by_name(name);
> +	if (minor < 0)
> +		return NULL;
> +	port = vnet_port_get(minor, busname);
> +	if (!port)
> +		goto out;
> +	host = kzalloc(sizeof(struct vnet_host_port), GFP_KERNEL);
> +	if (!host) {
> +		kfree(port);
> +		port = NULL;
> +		goto out;
> +	}
> +	port->priv = host;
> +	rc =vnet_port_hostsetup(port);
> +	if (rc)
> +		goto out_free_host;
> +	rtnl_lock();
> +	if (port->zs->linktype == 2)
> +		host->netdev = alloc_netdev(0, name, __vnet_layer2_init);
> +	else
> +		host->netdev = alloc_netdev(0, name, __vnet_layer3_init);
> +	if (!host->netdev)
> +		goto out_unlock;
> +	memcpy(port->mac, host->netdev->dev_addr, ETH_ALEN);
> +
> +	host->netdev->priv = port;
> +	port->interrupt = vnet_host_interrupt;
> +	port->destroy = vnet_host_destroy;
> +
> +	if (!register_netdevice(host->netdev)) {
> +		/* good case */
> +		rtnl_unlock();
> +		return port;
> +	}
> +	host->netdev->priv = NULL;
> +	free_netdev(host->netdev);
> +	host->netdev = NULL;
> +out_unlock:
> +	rtnl_unlock();
> +	vnet_host_free(port);
> +out_free_host:
> +	vnet_port_put(port);
> +	port = NULL;
> +out:
> +	return port;
> +}
> Index: linux-2.6.21/drivers/s390/guest/vnet_port_host.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.21/drivers/s390/guest/vnet_port_host.h
> @@ -0,0 +1,18 @@
> +/*
> + *  Copyright (C) 2005 IBM Corporation
> + *          Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
> + *
> + */
> +
> +#ifndef __VNET_PORTS_HOST_H
> +#define __VNET_PORTS_HOST_H
> +
> +#include <linux/netdevice.h>
> +#include "vnet_switch.h"
> +
> +struct vnet_host_port {
> +	struct net_device_stats stats;
> +	struct net_device *netdev;
> +};
> +extern struct vnet_port * vnet_host_create(char *name);
> +#endif
> Index: linux-2.6.21/drivers/s390/guest/vnet_switch.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.21/drivers/s390/guest/vnet_switch.c
> @@ -0,0 +1,828 @@
> +/*
> + *  vnet zlswitch handling
> + *
> + *  Copyright (C) 2005 IBM Corporation
> + *  Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
> + *  Author: Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
> + *
> + */
> +
> +#include <linux/device.h>
> +#include <linux/etherdevice.h>
> +#include <linux/fs.h>
> +#include <linux/if.h>
> +#include <linux/if_ether.h>
> +#include <linux/kernel.h>
> +#include <linux/list.h>
> +#include <linux/miscdevice.h>
> +#include <linux/module.h>
> +#include <linux/netdevice.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/pagemap.h>
> +#include <linux/spinlock.h>
> +
> +#include "vnet.h"
> +#include "vnet_port_guest.h"
> +#include "vnet_port_host.h"
> +#include "vnet_switch.h"
> +
> +#define NUM_MINORS 1024
> +
> +/* devices housekeeping, creation & destruction */
> +static LIST_HEAD(vnet_switches);
> +static rwlock_t vnet_switches_lock = RW_LOCK_UNLOCKED;
> +static struct class *zwitch_class;
> +static int vnet_major;
> +static struct device *root_dev;
> +
> +
> +/* The following functions allow ports of the switch to know about
> + * the MAC addresses of other ports. This is necessary for special
> + * hardware like OSA express which silently drops incoming packets
> + * that not match known MAC addresses and do not support promiscous
> + * mode as well. We have to register all guest MAC addresses at OSA
> + * make packet receive working */
> +
> +/* Announces the own MAC address to all other ports
> + * this function is called if a new port is added */
> +static void vnet_switch_add_mac(struct vnet_port *port)
> +{
> +	struct vnet_port *other_port;
> +
> +	read_lock(&port->zs->ports_lock);
> +	list_for_each_entry(other_port, &port->zs->switch_ports, lh)
> +		if ((other_port != port) && (other_port->set_mac))
> +			other_port->set_mac(other_port,port->mac, 1);
> +	read_unlock(&port->zs->ports_lock);
> +}
> +
> +/* Removes the own MAC address from all other ports
> + * this function is called if a port is detached*/
> +static void vnet_switch_del_mac(struct vnet_port *port)
> +{
> +	struct vnet_port *other_port;
> +
> +	read_lock(&port->zs->ports_lock);
> +	list_for_each_entry(other_port, &port->zs->switch_ports, lh)
> +		if (other_port->set_mac)
> +			other_port->set_mac(other_port, port->mac, 0);
> +	read_unlock(&port->zs->ports_lock);
> +}
> +
> +/* Learn MACs from other ports on the same zwitch and forward
> + * the MAC addresses to the set_mac function of the port.*/
> +static void __vnet_port_learn_macs(struct vnet_port *port)
> +{
> +	struct vnet_port *other_port;
> +
> +	if (!port->set_mac)
> +		return;
> +	list_for_each_entry(other_port, &port->zs->switch_ports, lh)
> +		if (other_port != port)
> +			port->set_mac(port, other_port->mac, 1);
> +}
> +
> +/* Unlearn MACS from other ports on the same zwitch */
> +static void __vnet_port_unlearn_macs(struct vnet_port *port)
> +{
> +	struct vnet_port *other_port;
> +
> +	if (!port->set_mac)
> +		return;
> +	list_for_each_entry(other_port, &port->zs->switch_ports, lh)
> +		if (other_port != port)
> +			port->set_mac(port, other_port->mac, 0);
> +}
> +
> +
> +static struct vnet_switch *__vnet_switch_by_minor(int minor)
> +{
> +	struct vnet_switch *zs;
> +
> +	list_for_each_entry(zs, &vnet_switches, lh) {
> +		if (MINOR(zs->cdev.dev) == minor)
> +			return zs;
> +	}
> +	return NULL;
> +}
> +
> +static struct vnet_switch *__vnet_switch_by_name(char *name)
> +{
> +	struct vnet_switch *zs;
> +
> +	list_for_each_entry(zs, &vnet_switches, lh)
> +		if (strncmp(zs->name, name, ZWITCH_NAME_SIZE) == 0)
> +			return zs;
> +	return NULL;
> +}
> +
> +/* Returns a switch structure and increases the reference count. If no such
> + * switch exists a new one is created with reference count 1 */
> +static struct vnet_switch *zwitch_get(int minor)
> +{
> +	struct vnet_switch *zs;
> +
> +	read_lock(&vnet_switches_lock);
> +	zs = __vnet_switch_by_minor(minor);
> +	if (!zs) {
> +			read_unlock(&vnet_switches_lock);
> +			return zs;
> +	}
> +	get_device(&zs->dev);
> +	read_unlock(&vnet_switches_lock);
> +	return zs;
> +}
> +
> +/* reduces the reference count of the switch. */
> +static void zwitch_put(struct vnet_switch * zs)
> +{
> +	put_device(&zs->dev);
> +}
> +
> +/* looks into the packet and searches a matching MAC address
> + * return NULL if unknown or broadcast */
> +static struct vnet_port *__vnet_find_l2(struct vnet_switch *zs, char *data)
> +{
> +	//FIXME: make this a hash lookup, more macs per device?
> +	struct vnet_port *port;
> +
> +	if (is_multicast_ether_addr(data))
> +		return NULL;
> +	list_for_each_entry(port, &zs->switch_ports, lh) {
> +		if (compare_ether_addr(port->mac, data)==0)
> +			goto out;
> +	}
> +	port = NULL;
> + out:
> +	return port;
> +}
> +
> +/* searches the destination for IP only interfaces. Normally routing
> + * is the way to go, but guests should see the net transparently without
> + * a hop in between*/
> +static struct vnet_port *__vnet_find_l3(struct vnet_switch *zs, char *data)
> +{
> +	return NULL;
> +}
> +
> +static struct vnet_port * __vnet_find_destination(struct vnet_switch *zs,
> +								char *data)
> +{
> +	switch (zs->linktype) {
> +	case 2:
> +		return __vnet_find_l2(zs, data);
> +	case 3:
> +		return __vnet_find_l3(zs, data);
> +	default:
> +		BUG();
> +	}
> +}
> +
> +/* copies len bytes of data from the memory specified by the list of
> + * pointers **from into the memory specified by the list of pointers **to
> + * with each pointer pointing to a page */
> +static void
> +vnet_switch_page_copy(void **to, void **from, int len)
> +{
> +	int remaining=len;
> +	int pageid = 0;
> +	int amount;
> +
> +	while(remaining) {
> +		amount = min((int)PAGE_SIZE, remaining);
> +		memcpy(to[pageid], from[pageid], amount);
> +		pageid++;
> +		remaining -= amount;
> +	}
> +}
> +
> +/* copies to data into a buffer of destination
> + * returns 0 if ok*/
> +static int
> +vnet_unicast(struct vnet_port *destination, void **from_data, int len, int proto)
> +{
> +	int pkid;
> +	int buffer_status;
> +	void  **to_data;
> +	struct vnet_control *control;
> +
> +	control = destination->control;
> +	spin_lock_bh(&destination->rxlock);
> +	if (vnet_q_full(atomic_read(&control->s2pmit))) {
> +		destination->rx_dropped++;
> +		spin_unlock_bh(&destination->rxlock);
> +		return -ENOBUFS;
> +	}
> +	pkid = __nextx(atomic_read(&control->s2pmit));
> +	to_data = destination->s2p_data[pkid];
> +	vnet_switch_page_copy(to_data, from_data, len);
> +	control->s2pbufs[pkid].len = len;
> +	control->s2pbufs[pkid].proto = proto;
> +	buffer_status = vnet_tx_packet(&control->s2pmit);
> +	spin_unlock_bh(&destination->rxlock);
> +	if (buffer_status & QUEUE_WAS_EMPTY)
> +		destination->interrupt(destination, VNET_IRQ_START_RX);
> +	destination->rx_bytes += len;
> +	destination->rx_packets++;
> +	return 0;
> +}
> +
> +/* send packets to all ports and emulate broadcasts via unicasts*/
> +static int vnet_allcast(struct vnet_port *from_port, void **fromdata,
> +			int len, int proto)
> +{
> +	struct vnet_port *destination;
> +	int failure = 0;
> +
> +	list_for_each_entry(destination, &from_port->zs->switch_ports, lh)
> +		if (destination != from_port)
> +			failure |= vnet_unicast(destination, fromdata,
> +							len, proto);
> +	return failure;
> +}
> +
> +/* takes an incoming packet and forwards it to the right port
> + * if a failure occurs, increase the tx_dropped count of the sender*/
> +static void vnet_switch_packet(struct vnet_port *from_port,
> +				void **from_data, int len, int proto)
> +{
> +	struct vnet_port *destination;
> +	int failure;
> +
> +	read_lock(&from_port->zs->ports_lock);
> +	destination = __vnet_find_destination(from_port->zs, from_data[0]);
> +	/* we dont want to loop. FIXME: document when this can happen*/
> +	if (destination == from_port) {
> +		read_unlock(&from_port->zs->ports_lock);
> +		return;
> +	}
> +	if (destination)
> +		failure = vnet_unicast(destination, from_data, len, proto);
> +	else
> +		failure = vnet_allcast(from_port, from_data, len, proto);
> +	read_unlock(&from_port->zs->ports_lock);
> +	if (failure)
> +		from_port->tx_dropped++;
> +	else {
> +		from_port->tx_packets++;
> +		from_port->tx_bytes += len;
> +	}
> +}
> +
> +static void vnet_port_release(struct device *dev)
> +{
> +	struct vnet_port *port;
> +
> +	port = container_of(dev, struct vnet_port, dev);
> +	zwitch_put(port->zs);
> +	kfree(port);
> +}
> +
> +static ssize_t vnet_port_read_mac(struct device *dev,
> +				struct device_attribute *attr,
> +				char *buf)
> +{
> +	struct vnet_port *port;
> +
> +	port = container_of(dev, struct vnet_port, dev);
> +	return sprintf(buf,"%02X:%02X:%02X:%02X:%02X:%02X", port->mac[0],
> +				port->mac[1], port->mac[2], port->mac[3],
> +				port->mac[4], port->mac[5]);
> +}
> +
> +static ssize_t vnet_port_read_tx_bytes(struct device *dev,
> +					struct device_attribute *attr,
> +					char *buf)
> +{
> +	struct vnet_port *port;
> +
> +	port = container_of(dev, struct vnet_port, dev);
> +	return sprintf(buf,"%lu", port->tx_bytes);
> +}
> +
> +static ssize_t vnet_port_read_rx_bytes(struct device *dev,
> +					struct device_attribute *attr,
> +					char *buf)
> +{
> +	struct vnet_port *port;
> +
> +	port = container_of(dev, struct vnet_port, dev);
> +	return sprintf(buf,"%lu", port->rx_bytes);
> +}
> +
> +static ssize_t vnet_port_read_tx_packets(struct device *dev,
> +					struct device_attribute *attr,
> +					char *buf)
> +{
> +	struct vnet_port *port;
> +
> +	port = container_of(dev, struct vnet_port, dev);
> +	return sprintf(buf,"%lu", port->tx_packets);
> +}
> +
> +static ssize_t vnet_port_read_rx_packets(struct device *dev,
> +					struct device_attribute *attr,
> +					char *buf)
> +{
> +	struct vnet_port *port;
> +
> +	port = container_of(dev, struct vnet_port, dev);
> +	return sprintf(buf,"%lu", port->rx_packets);
> +}
> +
> +static ssize_t vnet_port_read_tx_dropped(struct device *dev,
> +					struct device_attribute *attr,
> +					char *buf)
> +{
> +	struct vnet_port *port;
> +
> +	port = container_of(dev, struct vnet_port, dev);
> +	return sprintf(buf,"%lu", port->tx_dropped);
> +}
> +
> +static ssize_t vnet_port_read_rx_dropped(struct device *dev,
> +					struct device_attribute *attr,
> +					char *buf)
> +{
> +	struct vnet_port *port;
> +
> +	port = container_of(dev, struct vnet_port, dev);
> +	return sprintf(buf,"%lu", port->rx_dropped);
> +}
> +
> +static DEVICE_ATTR(mac, S_IRUSR, vnet_port_read_mac, NULL);
> +static DEVICE_ATTR(tx_bytes, S_IRUSR, vnet_port_read_tx_bytes, NULL);
> +static DEVICE_ATTR(rx_bytes, S_IRUSR, vnet_port_read_rx_bytes, NULL);
> +static DEVICE_ATTR(tx_packets, S_IRUSR, vnet_port_read_tx_packets, NULL);
> +static DEVICE_ATTR(rx_packets, S_IRUSR, vnet_port_read_rx_packets, NULL);
> +static DEVICE_ATTR(tx_dropped, S_IRUSR, vnet_port_read_tx_dropped, NULL);
> +static DEVICE_ATTR(rx_dropped, S_IRUSR, vnet_port_read_rx_dropped, NULL);
> +
> +static int vnet_port_attributes(struct device *dev)
> +{
> +	int rc;
> +	rc = device_create_file(dev, &dev_attr_mac);
> +	if (rc)
> +		return rc;
> +	rc = device_create_file(dev, &dev_attr_tx_dropped);
> +	if (rc)
> +		return rc;
> +	rc = device_create_file(dev, &dev_attr_rx_dropped);
> +	if (rc)
> +		return rc;
> +	rc = device_create_file(dev, &dev_attr_rx_bytes);
> +	if (rc)
> +		return rc;
> +	rc = device_create_file(dev, &dev_attr_tx_bytes);
> +	if (rc)
> +		return rc;
> +	rc = device_create_file(dev, &dev_attr_rx_packets);
> +	if (rc)
> +		return rc;
> +	rc = device_create_file(dev, &dev_attr_tx_packets);
> +	return rc;
> +}
> +
> +
> +//FIXME implement this
> +static int vnet_port_exists(struct vnet_switch *zs, char *name)
> +{
> +	read_lock(&zs->ports_lock);
> +	read_unlock(&zs->ports_lock);
> +	return 0;
> +
> +}
> +
> +static struct vnet_port *vnet_port_create(struct vnet_switch *zs,
> +					char *name)
> +{
> +	struct vnet_port *port;
> +
> +	if (vnet_port_exists(zs, name))
> +		return NULL;
> +
> +	port = kzalloc(sizeof(*port), GFP_KERNEL);
> +	if (port) {
> +		spin_lock_init(&port->rxlock);
> +		spin_lock_init(&port->txlock);
> +		INIT_LIST_HEAD(&port->lh);
> +		port->zs = zs;
> +	} else
> +		return NULL;
> +	port->dev.parent = &zs->dev;
> +	port->dev.release = vnet_port_release;
> +	strncpy(port->dev.bus_id, name, BUS_ID_SIZE);
> +	if (device_register(&port->dev)) {
> +		kfree(port);
> +		return NULL;
> +	}
> +	if (vnet_port_attributes(&port->dev)) {
> +		device_unregister(&port->dev);
> +		kfree(port);
> +		return NULL;
> +	}
> +	return port;
> +}
> +
> +/*------------------------ switch creation/Destruction/housekeeping---------*/
> +
> +static void zwitch_destroy_ports(struct vnet_switch *zs)
> +{
> +	struct vnet_port *port, *tmp;
> +
> +	list_for_each_entry_safe(port, tmp, &zs->switch_ports, lh) {
> +	if (port->destroy)
> +		port->destroy(port);
> +	else
> +		printk("No destroy function for port\n");
> +	}
> +}
> +
> +
> +static void zwitch_destroy(struct vnet_switch *zs)
> +{
> +	class_device_destroy(zwitch_class, zs->cdev.dev);
> +	cdev_del(&zs->cdev);
> +	device_unregister(&zs->dev);
> +}
> +
> +static void zwitch_release(struct device *dev)
> +{
> +	struct vnet_switch *zs;
> +
> +	zs = container_of(dev, struct vnet_switch, dev);
> +	kfree(zs);
> +}
> +
> +static int __zwitch_get_minor(void)
> +{
> +	int d, found;
> +	struct vnet_switch *zs;
> +
> +	for (d=0; d< NUM_MINORS; d++) {
> +		found = 0;
> +		list_for_each_entry(zs, &vnet_switches, lh)
> +			if (MINOR(zs->cdev.dev) == d)
> +				found++;
> +		if (!found) break;
> +	}
> +	if (found) return -ENODEV;
> +	return d;
> +}
> +
> +/*
> + * checks if this name already exists for a zwitch
> + */
> +static int __zwitch_check_name(char *name)
> +{
> +	struct vnet_switch *zs;
> +
> +	list_for_each_entry(zs, &vnet_switches, lh)
> +		if (!strncmp(name, zs->name, ZWITCH_NAME_SIZE))
> +			return -EEXIST;
> +	return 0;
> +}
> +
> +static int zwitch_create(char *name, int linktype)
> +{
> +	struct vnet_switch *zs;
> +	int minor;
> +	int ret;
> +
> +	if ((linktype < 2) || (linktype > 3))
> +		return -EINVAL;
> +	zs = kzalloc(sizeof(*zs), GFP_KERNEL);
> +	if (!zs) {
> +		printk("Creation of %s failed: out of memory\n", name);
> +		return -ENOMEM;
> +	}
> +	zs->linktype = linktype;
> +	strncpy(zs->name, name, ZWITCH_NAME_SIZE);
> +	rwlock_init(&zs->ports_lock);
> +	INIT_LIST_HEAD(&zs->switch_ports);
> +
> +	write_lock(&vnet_switches_lock);
> +	minor = __zwitch_get_minor();
> +	if (minor < 0) {
> +		write_unlock(&vnet_switches_lock);
> +		printk("Creation of %s failed: No free minor number\n",	name);
> +		kfree(zs);
> +		return minor;
> +	}
> +	if (__zwitch_check_name(zs->name)) {
> +		write_unlock(&vnet_switches_lock);
> +		printk("Creation of %s failed: name exists\n", name);
> +		kfree(zs);
> +		return -EEXIST;
> +	}
> +	list_add_tail(&zs->lh, &vnet_switches);
> +	write_unlock(&vnet_switches_lock);
> +	strncpy(zs->dev.bus_id, name, min((int) strlen(name),
> +		ZWITCH_NAME_SIZE));
> +	zs->dev.parent = root_dev;
> +	zs->dev.release = zwitch_release;
> +	ret = device_register(&zs->dev);
> +	if (ret) {
> +		write_lock(&vnet_switches_lock);
> +		list_del(&zs->lh);
> +		write_unlock(&vnet_switches_lock);
> +		printk("Creation of %s failed: no device\n",name);
> +		return ret;
> +	}
> +	vnet_cdev_init(&zs->cdev);
> +	cdev_add(&zs->cdev, MKDEV(vnet_major, minor), 1);
> +	zs->class_device = class_device_create(zwitch_class, NULL,
> +						zs->cdev.dev, &zs->dev, name);
> +	if (IS_ERR(zs->class_device)) {
> +		cdev_del(&zs->cdev);
> +		write_lock(&vnet_switches_lock);
> +		list_del(&zs->lh);
> +		write_unlock(&vnet_switches_lock);
> +		printk("Creation of %s failed: no class_device\n", name);
> +		device_unregister(&zs->dev);
> +		return PTR_ERR(zs->class_device);
> +	}
> +	return 0;
> +}
> +
> +
> +static int zwitch_delete(char *name)
> +{
> +	struct vnet_switch *zs;
> +
> +	write_lock(&vnet_switches_lock);
> +	zs = __vnet_switch_by_name(name);
> +	if (!zs) {
> +		write_unlock(&vnet_switches_lock);
> +		return -ENOENT;
> +	}
> +	list_del(&zs->lh);
> +	write_unlock(&vnet_switches_lock);
> +	zwitch_destroy_ports(zs);
> +	zwitch_destroy(zs);
> +	return 0;
> +}
> +
> +/* checks if a switch for the given minor exists
> + * if yes, create an unconnected  port on this switch
> + * if no, return NULL */
> +struct vnet_port *vnet_port_get(int minor, char *port_name)
> +{
> +	struct vnet_switch *zs;
> +	struct vnet_port *port;
> +
> +	zs = zwitch_get(minor);
> +	if (!zs)
> +		return NULL;
> +	port = vnet_port_create(zs, port_name);
> +	if (!port)
> +		zwitch_put(zs);
> +	return port;
> +}
> +
> +/* attaches the port to the switch. The port must be
> + * fully initialized, as it may get called immediately afterwards */
> +void vnet_port_attach(struct vnet_port *port)
> +{
> +	write_lock_bh(&port->zs->ports_lock);
> +	__vnet_port_learn_macs(port);
> +	list_add(&port->lh, &port->zs->switch_ports);
> +	write_unlock_bh(&port->zs->ports_lock);
> +	vnet_switch_add_mac(port);
> +	return;
> +}
> +
> +/* detaches the port from the switch. After that,
> + * no calls into the port are made */
> +void vnet_port_detach(struct vnet_port *port)
> +{
> +	vnet_switch_del_mac(port);
> +	write_lock_bh(&port->zs->ports_lock);
> +	if (!list_empty(&port->lh))
> +		list_del(&port->lh);
> +	__vnet_port_unlearn_macs(port);
> +	write_unlock_bh(&port->zs->ports_lock);
> +}
> +
> +/* releases all ressources allocated with vnet_port_get */
> +void vnet_port_put(struct vnet_port *port)
> +{
> +	BUG_ON(!list_empty(&port->lh) &&( port->lh.next != LIST_POISON1));
> +	device_unregister(&port->dev);
> +}
> +
> +/* tell the switch that new data is available */
> +void vnet_port_rx(struct vnet_port *port)
> +{
> +	struct vnet_control *control;
> +	int pkid, rc;
> +
> +	control = port->control;
> +	if (vnet_q_empty(atomic_read(&control->p2smit))) {
> +		printk(KERN_WARNING "vnet_switch: Empty buffer"
> +				"on interrupt\n");
> +		return;
> +	}
> +	do {
> +		pkid = __nextr(atomic_read(&control->p2smit));
> +		/* fire and forget. Let the switch care about lost packets*/
> +		vnet_switch_packet(port, port->p2s_data[pkid],
> +					control->p2sbufs[pkid].len,
> +					control->p2sbufs[pkid].proto);
> +		rc = vnet_rx_packet(&control->p2smit);
> +		if (rc & QUEUE_WAS_FULL) {
> +			port->interrupt(port, VNET_IRQ_START_TX);
> +		}
> +	} while (!(rc & QUEUE_IS_EMPTY));
> +	return;
> +}
> +
> +/* checks if the given address is locally attached to the switch*/
> +int vnet_address_is_local(struct vnet_switch *zs, char *address)
> +{
> +	struct vnet_port *port;
> +
> +	read_lock(&zs->ports_lock);
> +	port = __vnet_find_destination(zs, address);
> +	read_unlock(&zs->ports_lock);
> +	return (port != NULL);
> +}
> +
> +
> +int vnet_minor_by_name(char *name)
> +{
> +	struct vnet_switch *zs;
> +	int ret;
> +
> +	read_lock(&vnet_switches_lock);
> +	zs = __vnet_switch_by_name(name);
> +	if (zs)
> +		ret = MINOR(zs->cdev.dev);
> +	else
> +		ret = -ENODEV;
> +	read_unlock(&vnet_switches_lock);
> +	return ret;
> +}
> +
> +static void vnet_root_release(struct device *dev)
> +{
> +	kfree(dev);
> +}
> +
> +
> +struct command {
> +	char *string1;
> +	char *string2;
> +};
> +
> +/*FIXME this is ugly. Dont worry: as soon as we have finalized the interface,
> + this crap is going away. Still, it works.......*/
> +static long vnet_control_ioctl(struct file *f, unsigned int command,
> +						unsigned long data)
> +{
> +	char string1[BUS_ID_SIZE];
> +	char string2[BUS_ID_SIZE];
> +	struct command com;
> +	struct vnet_port *port;
> +
> +	if (!capable(CAP_NET_ADMIN))
> +		return -EPERM;
> +	if (copy_from_user(&com, (__user struct command*) data, sizeof(struct command)))
> +		return -EFAULT;
> +	if (copy_from_user(string1, (__user char *) com.string1, ZWITCH_NAME_SIZE))
> +		return -EFAULT;
> +	if (command >=2)
> +		if (copy_from_user(string2, (__user char *) com.string2, ZWITCH_NAME_SIZE))
> +			return -EFAULT;
> +	if (strnlen(string1, ZWITCH_NAME_SIZE) == ZWITCH_NAME_SIZE)
> +		return -EINVAL;
> +	switch(command) {
> +	case ADD_SWITCH:
> +		return zwitch_create(string1,3);
> +	case DEL_SWITCH:
> +		return zwitch_delete(string1);
> +	case ADD_HOST:
> +		port = vnet_host_create(string1);
> +		if (port) {
> +			vnet_port_attach(port);
> +			return 0;
> +		} else
> +			return -ENODEV;
> +	default:
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +static int vnet_control_open(struct inode *inode, struct file *file)
> +{
> +	return 0;
> +}
> +
> +static int vnet_control_release(struct inode *inode, struct file *file)
> +{
> +	return 0;
> +}
> +
> +struct file_operations vnet_control_fops = {
> +	.open		= vnet_control_open,
> +	.release	= vnet_control_release,
> +	.unlocked_ioctl	= &vnet_control_ioctl,
> +	.compat_ioctl	= &vnet_control_ioctl,
> +};
> +
> +struct miscdevice vnet_control_device = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "vnet",
> +	.fops = &vnet_control_fops,
> +};
> +
> +int vnet_register_control_device(void)
> +{
> +	return misc_register(&vnet_control_device);
> +}
> +
> +int __init vnet_switch_init(void)
> +{
> +	int ret;
> +	dev_t dev;
> +
> +	zwitch_class = class_create(THIS_MODULE, "vnet");
> +	if (IS_ERR(zwitch_class)) {
> +                printk(KERN_ERR "vnet_switch: class_create failed!\n");
> +		ret = PTR_ERR(zwitch_class);
> +		goto out;
> +        }
> +	ret = alloc_chrdev_region(&dev, 0, NUM_MINORS, "vnet");
> +	if (ret) {
> +		printk(KERN_ERR "vnet_switch: alloc_chrdev_region failed\n");
> +		goto out_class;
> +	}
> +	vnet_major = MAJOR(dev);
> +	root_dev = kzalloc(sizeof(*root_dev), GFP_KERNEL);
> +	if (!root_dev) {
> +		printk(KERN_ERR "vnet_switch:allocation of device failed\n");
> +		ret = -ENOMEM;
> +		goto out_chrdev;
> +	}
> +	strncpy(root_dev->bus_id, "vnet", 5);
> +	root_dev->release = vnet_root_release;
> +	ret =device_register(root_dev);
> +	if (ret) {
> +		printk(KERN_ERR "vnet_switch: could not register device\n");
> +		kfree(root_dev);
> +		goto out_chrdev;
> +	}
> +	ret = vnet_register_control_device();
> +	if (ret) {
> +		printk("vnet_switch: could not create control device\n");
> +		goto out_dev;
> +	}
> +	printk ("vnet_switch loaded\n");
> +/* FIXME ---------- remove these static defines as soon as everyone has the
> + *                  user tools */
> +	{
> +		struct vnet_port *port;
> +		zwitch_create("myswitch0",2);
> +		zwitch_create("myswitch1",3);
> +
> +		port = vnet_host_create("myswitch0");
> +		if (port)
> +			vnet_port_attach(port);
> +		port = vnet_host_create("myswitch1");
> +		if (port)
> +			vnet_port_attach(port);
> +	}
> +/*-----------------------------------------------------------*/
> +	return 0;
> +out_dev:
> +	device_unregister(root_dev);
> +out_chrdev:
> +	unregister_chrdev_region(MKDEV(vnet_major,0), NUM_MINORS);
> +out_class:
> +	class_destroy(zwitch_class);
> +out:
> +	return ret;
> +}
> +
> +/* remove all existing vnet_zwitches in the system and unregister the
> + * character device from the system */
> +void vnet_switch_exit(void)
> +{
> +	struct vnet_switch *zs, *tmp;
> +	list_for_each_entry_safe(zs, tmp, &vnet_switches, lh) {
> +		zwitch_destroy_ports(zs);
> +		zwitch_destroy(zs);
> +	}
> +	device_unregister(root_dev);
> +	misc_deregister(&vnet_control_device);
> +	unregister_chrdev_region(MKDEV(vnet_major,0), NUM_MINORS);
> +	class_destroy(zwitch_class);
> +	printk ("vnet_switch unloaded\n");
> +}
> +
> +module_init(vnet_switch_init);
> +module_exit(vnet_switch_exit);
> +MODULE_DESCRIPTION("VNET: Virtual switch for vnet interfaces");
> +MODULE_AUTHOR("Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>");
> +MODULE_LICENSE("GPL");
> Index: linux-2.6.21/drivers/s390/guest/vnet_switch.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.21/drivers/s390/guest/vnet_switch.h
> @@ -0,0 +1,119 @@
> +/*
> + * vnet_switch - zlive insular communication knack switch
> + * infrastructure for virtual switching of Linux guests running under Linux
> + *
> + * Copyright (C) 2005 IBM Corporation
> + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
> + *         Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
> + *
> + */
> +
> +#ifndef __VNET_SWITCH_H
> +#define __VNET_SWITCH_H
> +
> +#include <linux/cdev.h>
> +#include <linux/device.h>
> +#include <linux/if_ether.h>
> +#include <linux/spinlock.h>
> +
> +#include "vnet.h"
> +
> +/* defines for IOCTLs. interface should be replaced by something better */
> +#define ADD_SWITCH	0
> +#define DEL_SWITCH	1
> +#define ADD_OSA		2
> +#define DEL_OSA		3
> +#define ADD_HOST	4
> +#define DEL_HOST	5
> +
> +/* min(IFNAMSIZ, BUS_ID_SIZE)*/
> +#define ZWITCH_NAME_SIZE 16
> +
> +/* This structure describes a virtual switch for ports to userspace network
> + * interfaces, e.g. in Linux under Linux environments*/
> +struct vnet_switch {
> +	struct list_head lh;
> +	char name[ZWITCH_NAME_SIZE];
> +	struct list_head switch_ports;		/* list of ports */
> +	rwlock_t ports_lock;			/* lock for switch_ports */
> +	struct class_device *class_device;
> +	struct cdev cdev;
> +	struct device dev;
> +	struct vnet_port *osa;
> +	int linktype;				/* 2=ethernet 3=IP */
> +};
> +
> +/* description of a port of the vnet_switch */
> +struct vnet_port {
> +	struct list_head lh;
> +	struct vnet_switch *zs;
> +	struct vnet_control *control;
> +	void *s2p_data[VNET_QUEUE_LEN][(VNET_BUFFER_SIZE>>PAGE_SHIFT)];
> +	void *p2s_data[VNET_QUEUE_LEN][(VNET_BUFFER_SIZE>>PAGE_SHIFT)];
> +	char mac[ETH_ALEN];
> +	void *priv;
> +	int (*set_mac) (struct vnet_port *port, char mac[ETH_ALEN], int add);
> +	void (*interrupt) (struct vnet_port *port, int type);
> +	void (*destroy) (struct vnet_port *port);
> +	struct device dev;
> +	unsigned long	rx_packets;	/* total packets received */
> +	unsigned long	tx_packets;	/* total packets transmitted */
> +	unsigned long	rx_bytes;	/* total bytes received	*/
> +	unsigned long	tx_bytes;	/* total bytes transmitted */
> +	unsigned long	rx_dropped;	/* no space in receive buffer */
> +	unsigned long	tx_dropped;	/* no space in destination buffer */
> +	spinlock_t rxlock;
> +	spinlock_t txlock;
> +};
> +
> +
> +static inline int
> +vnet_copy_buf_to_pages(void **data, char *buf, int len)
> +{
> +	int i;
> +
> +	if (len == 0)
> +		return 0;
> +	for (i=0; i <= ((len - 1) >> PAGE_SHIFT); i++ )
> +		memcpy(data[i], buf + i*PAGE_SIZE, min(PAGE_SIZE, len - i*PAGE_SIZE));
> +	return len;
> +}
> +
> +static inline int
> +vnet_copy_pages_to_buf(char *buf, void **data, int len)
> +{
> +	int i;
> +
> +	if (len == 0)
> +		return 0;
> +	for (i=0; i <= ((len -1) >> PAGE_SHIFT); i++ )
> +		memcpy(buf + i*PAGE_SIZE, data[i], min(PAGE_SIZE, len - i*PAGE_SIZE));
> +	return len;
> +}
> +
> +
> +/* checks if a switch with the given minor exists
> + * if yes, create a named and unconnected port on
> + * this switch with the given name. if no, return NULL */
> +extern struct vnet_port *vnet_port_get(int minor, char *port_name);
> +
> +/* attaches the port to the switch. The port must be
> + * fully initialized, as it may get data immediately afterwards */
> +extern void vnet_port_attach(struct vnet_port *port);
> +
> +/* detaches the port from the switch. After that,
> + * no calls into the port are made */
> +extern void vnet_port_detach(struct vnet_port *port);
> +
> +/* releases all ressources allocated with vnet_port_get */
> +extern void vnet_port_put(struct vnet_port *port);
> +
> +/* tell the switch that new data is available */
> +extern void vnet_port_rx(struct vnet_port *port);
> +
> +/* get the minor for a given name */
> +extern int vnet_minor_by_name(char *name);
> +
> +/* checks if the given address is locally attached to the switch*/
> +extern int vnet_address_is_local(struct vnet_switch *zs, char *address);
> +#endif
> Index: linux-2.6.21/drivers/s390/guest/Makefile
> ===================================================================
> --- linux-2.6.21.orig/drivers/s390/guest/Makefile
> +++ linux-2.6.21/drivers/s390/guest/Makefile
> @@ -6,3 +6,6 @@ obj-$(CONFIG_GUEST_CONSOLE) += guest_con
>  obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o
>  obj-$(CONFIG_VDISK) += vdisk.o vdisk_blk.o
>  obj-$(CONFIG_VNET_GUEST) += vnet_guest.o
> +vnet_host-objs := vnet_switch.o vnet_port_guest.o vnet_port_host.o
> +obj-$(CONFIG_VNET_HOST) += vnet_host.o
> +
> Index: linux-2.6.21/drivers/s390/net/Kconfig
> ===================================================================
> --- linux-2.6.21.orig/drivers/s390/net/Kconfig
> +++ linux-2.6.21/drivers/s390/net/Kconfig
> @@ -95,4 +95,16 @@ config VNET_GUEST
>            connection.
>  	  If you're not using host/guest support, say N.
>  
> +config VNET_HOST
> +	tristate "virtual networking support (HOST)"
> +	depends on QETH && S390_HOST
> +	help
> +	  This is the host part of the vnet guest network connection.
> +          Say Y if you plan to host guests with network
> +          connection. The host part consists of a virtual switch
> +          a host device as well as a connection to the qeth
> +          driver.
> +	  If you're not using this kernel for hosting guest, say N.
> +
> +
>  endmenu
>
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>   


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

[parent not found: <4644D048.7060106-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH/RFC 8/9] Virtual network host switch support
       [not found]         ` <4644D048.7060106-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2007-05-11 20:50           ` Christian Bornträger
  0 siblings, 0 replies; 104+ messages in thread
From: Christian Bornträger @ 2007-05-11 20:50 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f; +Cc: Martin Schwidefsky

On Friday 11 May 2007 22:21, Anthony Liguori wrote:
> Any feel for the performance relative to the bridging code?  The 
> bridging code is a pretty big bottle neck in guest=>guest communications 
> in Xen at least.

Last time I checked it we had a quite decent guest to guest performance in the 
gigabits/sec.
On the downside the switch is quite aggressive with dropping packages as the 
inbound buffer of the virtual network adapters has space for 80 packets. 
(that can be changed)

> 
> > currently tested but not ready yet. We did not use the linux bridging code 
to
> > allow non-root users to create virtual networks between guests. 
> >   
> 
> Is that the primary reason?  If so, that seems like a rather large 
> hammer for something that a userspace suid wrapper could have addressed...

Actually there are some reasons why we did not use the bridging code:

- One thing is, that a lot of OSA network cards do not support promiscous 
mode. There is also the issue that a lot of OSA cards are in layer 3 mode (we 
get IP packets and no ethernet frames) so bridging wont work to the host 
interface.
- non-root switches
- the performance of bridging (we copy directly from one guest buffer to 
another without allocating an skb on the host)
- we considered to hook into the qeth driver (for OSA cards) to deal with 
layer3 mode.

The first shot was actually a point-to-point driver (guest netif <--> host 
netif). We added the switch at a later time. 

Hmm, if we can make bridging work (with a decent performance) on s390 that 
would reduce the maintainance work for us as this network switch is far from 
being complete. 

cheers

Christian

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [PATCH/RFC 9/9] Fix system<->user misaccount of interpreted execution
       [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
                     ` (6 preceding siblings ...)
  2007-05-11 17:36   ` [PATCH/RFC 8/9] Virtual network host switch support Carsten Otte
@ 2007-05-11 17:36   ` Carsten Otte
  7 siblings, 0 replies; 104+ messages in thread
From: Carsten Otte @ 2007-05-11 17:36 UTC (permalink / raw)
  To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
  Cc: Christian Borntraeger, Martin Schwidefsky

From: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

This patches fixes the accouting of guest cpu time. As sie is executed via a
system call, all guest operations were accounted as system time. To fix this
we define a per thread "sie context". Before issuing the sie instruction we
enter this context and leave the context afterwards. sie_enter and sie_exit
call account_system_vtime, which now checks for being in sie_context. We 
define the sie_context to be accounted as user time.

Possible future enhancement: We could add an additional field: "interpretion
time" to cpu stat and process time. Thus we could differentiate between user
time in the host and host user time spent for guests. The main challenge is
the necessary user space change. Therefore, we could export the interpretion
time with a new interface. To be defined.

Signed-off-By: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
Signed-off-By: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>

---
 arch/s390/Kconfig              |    1 +
 arch/s390/host/s390host.c      |   15 +++++++++++++++
 arch/s390/kernel/process.c     |    1 +
 arch/s390/kernel/vtime.c       |   11 ++++++++++-
 include/asm-s390/thread_info.h |    2 ++
 5 files changed, 29 insertions(+), 1 deletion(-)

Index: linux-2.6.21/arch/s390/kernel/vtime.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/vtime.c
+++ linux-2.6.21/arch/s390/kernel/vtime.c
@@ -97,6 +97,11 @@ void account_vtime(struct task_struct *t
 	account_system_time(tsk, 0, cputime);
 }
 
+static inline int task_is_in_sie(struct thread_info *thread)
+{
+	return thread->in_sie;
+}
+
 /*
  * Update process times based on virtual cpu times stored by entry.S
  * to the lowcore fields user_timer, system_timer & steal_clock.
@@ -114,7 +119,11 @@ void account_system_vtime(struct task_st
 	cputime =  S390_lowcore.system_timer >> 12;
 	S390_lowcore.system_timer -= cputime << 12;
 	S390_lowcore.steal_clock -= cputime << 12;
-	account_system_time(tsk, 0, cputime);
+
+	if (task_is_in_sie(tsk->thread_info) && !hardirq_count() && !softirq_count())
+		account_user_time(tsk, cputime);
+	else
+		account_system_time(tsk, 0, cputime);
 }
 
 static inline void set_vtimer(__u64 expires)
Index: linux-2.6.21/arch/s390/host/s390host.c
===================================================================
--- linux-2.6.21.orig/arch/s390/host/s390host.c
+++ linux-2.6.21/arch/s390/host/s390host.c
@@ -27,6 +27,19 @@ static int s390host_do_action(unsigned l
 
 static DEFINE_MUTEX(s390host_init_mutex);
 
+static void enter_sie(void)
+{
+	account_system_vtime(current);
+	current_thread_info()->in_sie = 1;
+}
+
+static void exit_sie(void)
+{
+	account_system_vtime(current);
+	current_thread_info()->in_sie = 0;
+}
+
+
 static void s390host_get_data(struct s390host_data *data)
 {
 	atomic_inc(&data->count);
@@ -297,7 +310,9 @@ again:
 		schedule();
 
 	sie_kernel->sie_block.icptcode = 0;
+	enter_sie();
 	ret = sie64a(sie_kernel);
+	exit_sie();
 	if (ret)
 		goto out;
 
Index: linux-2.6.21/include/asm-s390/thread_info.h
===================================================================
--- linux-2.6.21.orig/include/asm-s390/thread_info.h
+++ linux-2.6.21/include/asm-s390/thread_info.h
@@ -55,6 +55,7 @@ struct thread_info {
 	struct restart_block	restart_block;
 	struct s390host_data	*s390host_data;	/* s390host data */
 	int			sie_cpu;	/* sie cpu number */
+	int			in_sie;		/* 1 => cpu is in sie*/
 };
 
 /*
@@ -72,6 +73,7 @@ struct thread_info {
 	},					\
 	.s390host_data	= NULL,			\
 	.sie_cpu	= 0,			\
+	.in_sie		= 0,			\
 }
 
 #define init_thread_info	(init_thread_union.thread_info)
Index: linux-2.6.21/arch/s390/kernel/process.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/process.c
+++ linux-2.6.21/arch/s390/kernel/process.c
@@ -278,6 +278,7 @@ int copy_thread(int nr, unsigned long cl
 	memset(&p->thread.per_info,0,sizeof(p->thread.per_info));
 	p->thread_info->s390host_data = NULL;
 	p->thread_info->sie_cpu = -1;
+	p->thread_info->in_sie = 0;
 
         return 0;
 }
Index: linux-2.6.21/arch/s390/Kconfig
===================================================================
--- linux-2.6.21.orig/arch/s390/Kconfig
+++ linux-2.6.21/arch/s390/Kconfig
@@ -519,6 +519,7 @@ config S390_HOST
 	bool "s390 host support (EXPERIMENTAL)"
 	depends on 64BIT && EXPERIMENTAL
 	select S390_SWITCH_AMODE
+	select VIRT_CPU_ACCOUNTING
 	help
 	  Select this option if you want to host guest Linux images
 



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

^ permalink raw reply	[flat|nested] 104+ messages in thread

end of thread, other threads:[~2007-05-24  0:07 UTC | newest]

Thread overview: 104+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1178903957.25135.13.camel@cotte.boeblingen.de.ibm.com>
     [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
2007-05-11 17:35   ` [PATCH/RFC 2/9] s390 virtualization interface Carsten Otte
2007-05-11 17:35   ` [PATCH/RFC 3/9] s390 guest detection Carsten Otte
2007-05-11 17:35   ` [PATCH/RFC 4/9] Basic guest virtual devices infrastructure Carsten Otte
     [not found]     ` <1178904958.25135.31.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
2007-05-11 20:06       ` Arnd Bergmann
2007-05-14 11:26       ` Avi Kivity
     [not found]         ` <46484753.30602-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-14 11:43           ` Carsten Otte
     [not found]             ` <46484B5D.6080605-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-14 12:00               ` [PATCH/RFC 4/9] Basic guest virtualdevices infrastructure Dor Laor
     [not found]                 ` <64F9B87B6B770947A9F8391472E032160BC7483D-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
2007-05-14 13:32                   ` Carsten Otte
2007-05-11 17:36   ` [PATCH/RFC 5/9] s390 virtual console for guests Carsten Otte
     [not found]     ` <1178904960.25135.32.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
2007-05-11 19:00       ` Anthony Liguori
     [not found]         ` <4644BD3C.8040901-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2007-05-11 19:42           ` Christian Bornträger
2007-05-12  8:07           ` Carsten Otte
2007-05-14 16:23           ` Christian Bornträger
     [not found]             ` <200705141823.13424.cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-14 16:48               ` Christian Borntraeger
     [not found]                 ` <200705141848.18996.borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-14 17:49                   ` Anthony Liguori
     [not found]                     ` <4648A11D.3050607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-15  0:27                       ` Arnd Bergmann
2007-05-15  7:54                       ` Carsten Otte
2007-05-11 17:36   ` [PATCH/RFC 6/9] virtual block device driver Carsten Otte
     [not found]     ` <1178904963.25135.33.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
2007-05-14 11:49       ` Avi Kivity
     [not found]         ` <46484CDF.505-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-14 13:23           ` Carsten Otte
     [not found]             ` <464862E9.7020105-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-14 14:39               ` Avi Kivity
     [not found]                 ` <46487494.1070802-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-15 11:47                   ` Carsten Otte
     [not found]                     ` <46499DE9.9090202-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-16 10:01                       ` Avi Kivity
2007-05-14 11:52       ` Avi Kivity
     [not found]         ` <46484D84.3060601-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-14 13:26           ` Carsten Otte
2007-05-11 17:36   ` [PATCH/RFC 7/9] Virtual network guest " Carsten Otte
     [not found]     ` <1178904965.25135.34.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
2007-05-11 19:44       ` ron minnich
     [not found]         ` <13426df10705111244w1578ebedy8259bc42ca1f588d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-11 20:12           ` Anthony Liguori
     [not found]             ` <4644CE15.6080505-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-11 21:15               ` Eric Van Hensbergen
     [not found]                 ` <a4e6962a0705111415n47e77a15o331b59cf2a03b4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-11 21:47                   ` Anthony Liguori
     [not found]                     ` <4644E456.2060507-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-11 22:21                       ` Eric Van Hensbergen
     [not found]                         ` <a4e6962a0705111521v2d451ddcjecf209e2031c85af-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-16 17:28                           ` Anthony Liguori
     [not found]                             ` <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-16 17:38                               ` Daniel P. Berrange
     [not found]                                 ` <20070516173822.GD16863-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2007-05-17  9:29                                   ` Carsten Otte
     [not found]                                     ` <464C2069.20909-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-17 14:22                                       ` Anthony Liguori
     [not found]                                         ` <464C651F.5070700-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2007-05-21 11:11                                           ` Christian Borntraeger
2007-05-16 17:41                               ` Eric Van Hensbergen
     [not found]                                 ` <a4e6962a0705161041s5393c1a6wc455b20ff3fe8106-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-16 18:47                                   ` Anthony Liguori
     [not found]                                     ` <464B51A8.7050307-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-16 19:33                                       ` Eric Van Hensbergen
2007-05-16 17:45                               ` Gregory Haskins
     [not found]                                 ` <464B0ADB.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-05-16 18:39                                   ` Anthony Liguori
     [not found]                                     ` <464B4FEB.7070300-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-16 18:57                                       ` Gregory Haskins
     [not found]                                         ` <464B1B9C.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-05-16 19:10                                           ` Anthony Liguori
     [not found]                                             ` <464B572C.6090800-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-17  4:24                                               ` Rusty Russell
     [not found]                                                 ` <1179375881.21871.83.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-05-17 16:13                                                   ` Anthony Liguori
     [not found]                                                     ` <464C7F45.50908-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-17 23:34                                                       ` Rusty Russell
2007-05-21  9:07                                               ` Christian Borntraeger
     [not found]                                                 ` <OFC1AADF6F.DB57C7AC-ON422572E2.0030BE22-422572E2.0032174C-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-21  9:27                                                   ` Cornelia Huck
2007-05-21 11:28                                                   ` Arnd Bergmann
     [not found]                                                     ` <200705211328.04565.arnd-r2nGTMty4D4@public.gmane.org>
2007-05-21 11:56                                                       ` Cornelia Huck
     [not found]                                                         ` <20070521135628.17a4f9cc-XQvu0L+U/CiXI4yAdoq52KN5r0PSdgG1zG2AekJRRhI@public.gmane.org>
2007-05-21 13:53                                                           ` Arnd Bergmann
2007-05-21 18:45                                                       ` Anthony Liguori
     [not found]                                                         ` <4651E8D1.4010208-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-21 23:09                                                           ` ron minnich
     [not found]                                                             ` <13426df10705211609j613032c6j373d9a4660f8ec6c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-22  0:29                                                               ` Anthony Liguori
     [not found]                                                                 ` <46523952.7070405-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2007-05-22  0:45                                                                   ` ron minnich
     [not found]                                                                     ` <13426df10705211745r69acc95ai458b2192fe0d0132-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-22  1:13                                                                       ` Anthony Liguori
2007-05-22  1:34                                                                   ` Eric Van Hensbergen
     [not found]                                                                     ` <a4e6962a0705211834s4db19c7t3b95765bf2c092d7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-22  1:42                                                                       ` Anthony Liguori
     [not found]                                                                         ` <46524A79.8070004-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2007-05-22  5:17                                                                           ` Avi Kivity
     [not found]                                                                             ` <46527CD9.5000603-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-22 12:49                                                                               ` Eric Van Hensbergen
     [not found]                                                                                 ` <a4e6962a0705220549j1c9565f2ic160c672b74aea35-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-22 12:56                                                                                   ` Christoph Hellwig
     [not found]                                                                                     ` <20070522125655.GA4506-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2007-05-22 14:50                                                                                       ` Eric Van Hensbergen
     [not found]                                                                                         ` <a4e6962a0705220750s5abe380dg8dd8e7d0b84de7cd-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-22 15:05                                                                                           ` Anthony Liguori
     [not found]                                                                                             ` <465306AE.5080902-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2007-05-22 15:31                                                                                               ` ron minnich
2007-05-22 16:25                                                                                               ` Eric Van Hensbergen
     [not found]                                                                                                 ` <a4e6962a0705220925l580f136we269380fe3c9691c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-22 17:00                                                                                                   ` ron minnich
     [not found]                                                                                                     ` <13426df10705221000i749badc5h8afe4f2fc95bc2ce-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-22 17:06                                                                                                       ` Christoph Hellwig
     [not found]                                                                                                         ` <20070522170628.GA16624-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2007-05-22 17:34                                                                                                           ` ron minnich
     [not found]                                                                                                             ` <13426df10705221034k7baf5bccrc77aabca8c9e225c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-22 20:03                                                                                                               ` Dor Laor
     [not found]                                                                                                                 ` <64F9B87B6B770947A9F8391472E032160BF29F1E-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
2007-05-22 20:10                                                                                                                   ` ron minnich
2007-05-22 22:56                                                                                                                   ` Nakajima, Jun
     [not found]                                                                                                                     ` <8FFF7E42E93CC646B632AB40643802A8032793AC-1a9uaKK1+wJcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-05-23  8:15                                                                                                                       ` Carsten Otte
     [not found]                                                                                                                         ` <4653F807.2010209-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-23 12:25                                                                                                                           ` Avi Kivity
2007-05-23 14:12                                                                                                                           ` Eric Van Hensbergen
     [not found]                                                                                                                             ` <a4e6962a0705230712pd8c2958m9dee6b2ccec0899d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-23 23:02                                                                                                                               ` Arnd Bergmann
     [not found]                                                                                                                                 ` <200705240102.40795.arnd-r2nGTMty4D4@public.gmane.org>
2007-05-23 23:57                                                                                                                                   ` Eric Van Hensbergen
     [not found]                                                                                                                                     ` <a4e6962a0705231657n65946ba4n74393f7028b6d61c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-24  0:07                                                                                                                                       ` Eric Van Hensbergen
2007-05-23 12:21                                                                                                                       ` Avi Kivity
2007-05-23 12:16                                                                                                           ` Avi Kivity
     [not found]                                                                                                             ` <465430B2.7050101-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-23 12:20                                                                                                               ` Christoph Hellwig
2007-05-23 12:20                                                                                                       ` Avi Kivity
2007-05-23 11:55                                                                                           ` Avi Kivity
2007-05-22 13:08                                                                                   ` Anthony Liguori
2007-05-18  5:31                               ` ron minnich
     [not found]                                 ` <13426df10705172231y5e93d1f5y398d4f187a8978e1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-18 14:31                                   ` Anthony Liguori
     [not found]                                     ` <464DB8A5.6080503-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-18 15:14                                       ` ron minnich
2007-05-11 21:51               ` ron minnich
2007-05-12  8:46           ` Carsten Otte
     [not found]             ` <46457EF9.2070706-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-13 12:04               ` Dor Laor
     [not found]                 ` <64F9B87B6B770947A9F8391472E032160BC74612-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
2007-05-13 14:49                   ` Anthony Liguori
     [not found]                     ` <4647257F.4020900-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2007-05-13 16:23                       ` Dor Laor
     [not found]                         ` <64F9B87B6B770947A9F8391472E032160BC74675-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
2007-05-13 16:49                           ` Anthony Liguori
     [not found]                             ` <4647418A.2040201-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2007-05-13 17:06                               ` Muli Ben-Yehuda
     [not found]                                 ` <20070513170608.GA4343-WD1JZD8MxeCTrf4lBMg6DdBPR1lH4CV8@public.gmane.org>
2007-05-13 20:31                                   ` Dor Laor
2007-05-14  2:39                               ` Rusty Russell
2007-05-14 11:53                               ` Avi Kivity
2007-05-14 12:05           ` Avi Kivity
     [not found]             ` <46485070.3000106-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-14 12:24               ` Christian Bornträger
     [not found]                 ` <200705141424.44423.cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-14 12:32                   ` Avi Kivity
2007-05-14 13:36               ` Carsten Otte
2007-05-11 17:36   ` [PATCH/RFC 8/9] Virtual network host switch support Carsten Otte
     [not found]     ` <1178904968.25135.35.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
2007-05-11 20:21       ` Anthony Liguori
     [not found]         ` <4644D048.7060106-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-11 20:50           ` Christian Bornträger
2007-05-11 17:36   ` [PATCH/RFC 9/9] Fix system<->user misaccount of interpreted execution Carsten Otte

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox