* [PATCH/RFC 2/9] s390 virtualization interface [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> @ 2007-05-11 17:35 ` Carsten Otte 2007-05-11 17:35 ` [PATCH/RFC 3/9] s390 guest detection Carsten Otte ` (6 subsequent siblings) 7 siblings, 0 replies; 104+ messages in thread From: Carsten Otte @ 2007-05-11 17:35 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Cc: Christian Borntraeger, Martin Schwidefsky From: Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> Add interface which allows a process to start a virtual machine. To keep things easy each thread group is allowed to have only one virtual machine and each thread of the thread group can only control one virtual cpu of the virtual machine. All the information about the virtual machines/cpus can be found via the thread_info structures of the participating threads. This patch adds three new s390 specific system calls: long sys_s390host_add_cpu(unsigned long addr, unsigned long flags, struct sie_block __user *sie_template) Adds a new cpu to a the virtual machine that belongs to the current thread group. If no virtual machine exists it will be created. In addition two pages will be allocated and mapped at <addr> into the address space of the process. These two pages are used so user space and kernel space can easily exchange/modify the state of the corresponding virtual cpu without a ton of copy_from/to_user calls. The sie_template is a pointer to a data structure that contains initial information how the virtual cpu should be setup. The resulting block will be used as a parameter to issue the sie (start interpretive execution) instruction which starts a virtual cpu. int sys_s390host_remove_cpu(void) Removes a virtual cpu from a virtual machine. int sys_s390host_sie(unsigned long action) Starts / re-enters the virtual cpu of the virtual machine that the thread belongs to, if any. Please note that this patch is nothing more than a proof-of-concept and may contain quite a few bugs. Since we want to convert to use kvm instead, most of this will be dropped anyway. But maybe this is of interest for others as well. Signed-off-by: Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> --- arch/s390/Kconfig | 7 arch/s390/Makefile | 2 arch/s390/host/Makefile | 5 arch/s390/host/s390_intercept.c | 42 ++++ arch/s390/host/s390host.c | 418 ++++++++++++++++++++++++++++++++++++++++ arch/s390/host/s390host.h | 16 + arch/s390/host/sie64a.S | 38 +++ arch/s390/kernel/asm-offsets.c | 2 arch/s390/kernel/process.c | 15 + arch/s390/kernel/setup.c | 4 arch/s390/kernel/syscalls.S | 3 include/asm-s390/sie64.h | 279 ++++++++++++++++++++++++++ include/asm-s390/thread_info.h | 8 include/asm-s390/unistd.h | 5 kernel/sys_ni.c | 3 15 files changed, 842 insertions(+), 5 deletions(-) Index: linux-2.6.21/arch/s390/kernel/asm-offsets.c =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/asm-offsets.c +++ linux-2.6.21/arch/s390/kernel/asm-offsets.c @@ -44,5 +44,7 @@ int main(void) DEFINE(__SF_BACKCHAIN, offsetof(struct stack_frame, back_chain),); DEFINE(__SF_GPRS, offsetof(struct stack_frame, gprs),); DEFINE(__SF_EMPTY, offsetof(struct stack_frame, empty1),); + BLANK(); + DEFINE(__SIE_USER_gprs, offsetof(struct sie_user, gprs),); return 0; } Index: linux-2.6.21/arch/s390/kernel/syscalls.S =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/syscalls.S +++ linux-2.6.21/arch/s390/kernel/syscalls.S @@ -322,3 +322,6 @@ NI_SYSCALL /* 310 sys_move_pages * SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper) SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper) SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper) +SYSCALL(sys_ni_syscall,sys_s390host_add_cpu,sys_ni_syscall) +SYSCALL(sys_ni_syscall,sys_s390host_remove_cpu,sys_ni_syscall) +SYSCALL(sys_ni_syscall,sys_s390host_sie,sys_ni_syscall) Index: linux-2.6.21/arch/s390/host/Makefile =================================================================== --- /dev/null +++ linux-2.6.21/arch/s390/host/Makefile @@ -0,0 +1,5 @@ +# +# Makefile for the s390host components. +# + +obj-$(CONFIG_S390_HOST) += s390host.o sie64a.o s390_intercept.o Index: linux-2.6.21/arch/s390/host/sie64a.S =================================================================== --- /dev/null +++ linux-2.6.21/arch/s390/host/sie64a.S @@ -0,0 +1,38 @@ +/* + * arch/s390/host/sie64a.S + * low level sie call + * + * Copyright IBM Corp. 2007 + * Author(s): Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * License : GPL + */ + +#include <linux/errno.h> +#include <asm/asm-offsets.h> + +SP_R6 = 6 * 8 # offset into stackframe + + .globl sie64a +sie64a: + stmg %r6,%r15,SP_R6(%r15) # save register on entry + lgr %r14,%r2 # pointer to program parms + aghi %r2,4096 + lmg %r0,%r13,__SIE_USER_gprs(%r2) # load guest gprs 0-13 +sie_inst: + sie 0(%r14) + aghi %r14,4096 + stmg %r0,%r13,__SIE_USER_gprs(%r14) # save guest gprs 0-13 + lghi %r2,0 + lmg %r6,%r15,SP_R6(%r15) + br %r14 + +sie_err: + aghi %r14,4096 + stmg %r0,%r13,__SIE_USER_gprs(%r14) # save guest gprs 0-13 + lghi %r2,-EFAULT + lmg %r6,%r15,SP_R6(%r15) + br %r14 + + .section __ex_table,"a" + .quad sie_inst,sie_err + .previous Index: linux-2.6.21/arch/s390/host/s390host.c =================================================================== --- /dev/null +++ linux-2.6.21/arch/s390/host/s390host.c @@ -0,0 +1,418 @@ +/* + * s390host.c -- hosting zSeries Linux virtual engines + * + * Copyright IBM Corp. 2007 + * Author(s): Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>, + * Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>, + * Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * License : GPL + */ + +#include <linux/pagemap.h> +#include <linux/module.h> +#include <linux/fs.h> +#include <linux/mm.h> +#include <linux/init.h> +#include <linux/file.h> +#include <linux/mman.h> +#include <linux/mutex.h> +#include <asm/uaccess.h> +#include <asm/processor.h> +#include <asm/tlbflush.h> +#include <asm/semaphore.h> +#include <asm/sie64.h> +#include "s390host.h" + +static int s390host_do_action(unsigned long, struct sie_io *); + +static DEFINE_MUTEX(s390host_init_mutex); + +static void s390host_get_data(struct s390host_data *data) +{ + atomic_inc(&data->count); +} + +void s390host_put_data(struct s390host_data *data) +{ + int cpu; + + if (atomic_dec_return(&data->count)) + return; + + for (cpu = 0; cpu < S390HOST_MAX_CPUS; cpu++) + if (data->sie_io[cpu]) + free_page((unsigned long)data->sie_io[cpu]); + + if (data->sca_block) + free_page((unsigned long)data->sca_block); + + kfree(data); +} + +static void s390host_vma_close(struct vm_area_struct *vma) +{ + s390host_put_data(vma->vm_private_data); +} + +static struct page *s390host_vma_nopage(struct vm_area_struct *vma, + unsigned long address, int *type) +{ + return NOPAGE_SIGBUS; +} + +static struct vm_operations_struct s390host_vmops = { + .close = s390host_vma_close, + .nopage = s390host_vma_nopage, +}; + +static struct s390host_data *get_s390host_context(void) +{ + struct thread_info *tif; + struct sca_block *sca_block = NULL; + struct s390host_data *data = NULL; + struct task_struct *tsk; + + /* zlh context for current thread already created? */ + tif = current_thread_info(); + if (tif->s390host_data) + return tif->s390host_data; + + /* zlh context in thread group available? */ + write_lock_irq(&tasklist_lock); + tsk = next_thread(current); + for (; tsk != current; tsk = next_thread(tsk)) { + data = tsk->thread_info->s390host_data; + if (data) { + s390host_get_data(data); + tif->s390host_data = data; + break; + } + } + write_unlock_irq(&tasklist_lock); + + if (data) + return data; + + /* create new context */ + data = kzalloc(sizeof(*data), GFP_KERNEL); + + if (!data) + return NULL; + + sca_block = (struct sca_block *)get_zeroed_page(GFP_KERNEL); + + if (!sca_block) { + kfree(data); + return NULL; + } + + data->sca_block = sca_block; + tif->s390host_data = data; + s390host_get_data(data); + + return data; +} + +static unsigned long +s390host_create_io_area(unsigned long addr, unsigned long flags, + unsigned long io_addr, struct s390host_data *data) +{ + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; + unsigned long ret; + + flags &= MAP_FIXED; + addr = get_unmapped_area(NULL, addr, 2 * PAGE_SIZE, 0, flags); + + if (addr & ~PAGE_MASK) + return addr; + + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL); + + if (!vma) + return -ENOMEM; + + vma->vm_mm = mm; + vma->vm_start = addr; + vma->vm_end = addr + 2 * PAGE_SIZE; + vma->vm_flags = VM_READ | VM_MAYREAD | VM_IO | VM_RESERVED; + vma->vm_flags |= VM_SHARED | VM_MAYSHARE | VM_DONTCOPY; + +#if 1 /* FIXME: write access until sys_s390host_sie interface is final */ + vma->vm_flags |= VM_WRITE | VM_MAYWRITE; +#endif + + vma->vm_page_prot = protection_map[vma->vm_flags & 0xf]; + vma->vm_private_data = data; + vma->vm_ops = &s390host_vmops; + + down_write(&mm->mmap_sem); + ret = insert_vm_struct(mm, vma); + if (ret) { + kmem_cache_free(vm_area_cachep, vma); + goto out; + } + s390host_get_data(data); + mm->total_vm += 2; + vm_insert_page(vma, addr, virt_to_page(io_addr)); + + ret = split_vma(mm, vma, addr + PAGE_SIZE, 0); + if (ret) + goto out; + s390host_get_data(data); + + vma = find_vma(mm, addr + PAGE_SIZE); + vma->vm_flags |= VM_WRITE | VM_MAYWRITE; + vma->vm_page_prot = protection_map[vma->vm_flags & 0xf]; + vm_insert_page(vma, addr + PAGE_SIZE, + virt_to_page(io_addr + PAGE_SIZE)); + ret = addr; +out: + up_write(&mm->mmap_sem); + return ret; +} + +long sys_s390host_add_cpu(unsigned long addr, unsigned long flags, + struct sie_block __user *sie_template) +{ + struct sie_block *sie_block; + struct sie_io *sie_io; + struct sca_block *sca_block; + struct s390host_data *data = NULL; + unsigned long ret; + __u16 cpu; + + if (current_thread_info()->sie_cpu != -1) + return -EINVAL; + + if (copy_from_user(&cpu, &sie_template->icpua, sizeof(u16))) + return -EFAULT; + + if (cpu >= S390HOST_MAX_CPUS) + return -EINVAL; + + mutex_lock(&s390host_init_mutex); + + data = get_s390host_context(); + if (!data) { + ret = -ENOMEM; + goto out_err; + } + + sca_block = data->sca_block; + if (sca_block->mcn & (1UL << (S390HOST_MAX_CPUS - 1 - cpu))) { + ret = -EINVAL; + goto out_err; + } + + if (!data->sie_io[cpu]) { + unsigned long tmp; + + /* allocate two pages: 1st is r/o 2nd r/w area */ + tmp = __get_free_pages(GFP_KERNEL, 1); + if (!tmp) { + ret = -ENOMEM; + goto out_err; + } + split_page(virt_to_page(tmp), 1); + data->sie_io[cpu] = (struct sie_io *)tmp; + } + + sie_io = data->sie_io[cpu]; + memset(sie_io, 0, 2 * PAGE_SIZE); + + sie_block = &sie_io->sie_kernel.sie_block; + sca_block->cpu[cpu].sda = (__u64)sie_block; + + if (copy_from_user(sie_block, sie_template, sizeof(struct sie_block))) { + ret = -EFAULT; + goto out_err; + } + sie_block->icpua = cpu; + + ret = s390host_create_io_area(addr, flags, (unsigned long)sie_io, data); + + if (ret & ~PAGE_MASK) + goto out_err; + + sca_block->mcn |= 1UL << (S390HOST_MAX_CPUS - 1 - cpu); + sie_block->scaoh = (__u32)(((__u64)sca_block) >> 32); + sie_block->scaol = (__u32)(__u64)sca_block; + current_thread_info()->sie_cpu = cpu; + goto out; +out_err: + if (data) + s390host_put_data(data); +out: + mutex_unlock(&s390host_init_mutex); + return ret; +} + +int sys_s390host_remove_cpu(void) +{ + struct sca_block *sca_block; + int cpu; + + cpu = current_thread_info()->sie_cpu; + if (cpu == -1) + return -EINVAL; + + mutex_lock(&s390host_init_mutex); + sca_block = current_thread_info()->s390host_data->sca_block; + sca_block->mcn &= ~(1UL << (S390HOST_MAX_CPUS - 1 - cpu)); + current_thread_info()->sie_cpu = -1; + mutex_unlock(&s390host_init_mutex); + return 0; +} + +int sys_s390host_sie(unsigned long action) +{ + struct sie_kernel *sie_kernel; + struct sie_user *sie_user; + struct sie_io *sie_io; + int cpu; + int ret = 0; + + cpu = current_thread_info()->sie_cpu; + if (cpu == -1) + return -EINVAL; + + sie_io = current_thread_info()->s390host_data->sie_io[cpu]; + + if (action) + ret = s390host_do_action(action, sie_io); + if (ret) + goto out_err; + sie_kernel = &sie_io->sie_kernel; + sie_user = &sie_io->sie_user; + + save_fp_regs(&sie_kernel->host_fpregs); + save_access_regs(sie_kernel->host_acrs); + sie_user->guest_fpregs.fpc &= FPC_VALID_MASK; + restore_fp_regs(&sie_user->guest_fpregs); + restore_access_regs(sie_user->guest_acrs); + memcpy(&sie_kernel->sie_block.gg14, &sie_user->gprs[14], 16); +again: + if (need_resched()) + schedule(); + + sie_kernel->sie_block.icptcode = 0; + ret = sie64a(sie_kernel); + if (ret) + goto out; + + if (signal_pending(current)) { + ret = -EINTR; + goto out; + } + + ret = s390host_handle_intercept(sie_kernel); + + /* intercept reason was handled, enter SIE again */ + if (!ret) + goto again; + + /* if kernel cannot hanle intercept, pass to the user */ + if (ret == -ENOTSUPP) + ret = 0; + +out: + memcpy(&sie_user->gprs[14], &sie_kernel->sie_block.gg14, 16); + save_fp_regs(&sie_user->guest_fpregs); + save_access_regs(sie_user->guest_acrs); + restore_fp_regs(&sie_kernel->host_fpregs); + restore_access_regs(sie_kernel->host_acrs); +out_err: + return ret; +} + +static void s390host_vsmxm_local_update(struct sie_io *sie_io) +{ + struct sie_kernel *local_sie_kernel; + struct sie_user *sie_user; + atomic_t *cpuflags; + int old, new; + + mutex_lock(&s390host_init_mutex); + + sie_user = &sie_io->sie_user; + local_sie_kernel = &sie_io->sie_kernel; + + cpuflags = &local_sie_kernel->sie_block.cpuflags; + do { + old = atomic_read(cpuflags); + new = old | sie_user->vsmxm_or_local; + new &= sie_user->vsmxm_and_local; + } while (atomic_cmpxchg(cpuflags, old, new) != old); + + mutex_unlock(&s390host_init_mutex); + return; +} + +static int s390host_vsmxm_dist_update(struct sie_io *sie_io) +{ + struct sie_kernel *dist_sie_kernel; + struct sie_user *sie_user; + struct sca_block *sca_block; + struct thread_info *tif; + atomic_t *cpuflags; + int cpu; + int old, new; + int rc = -EINVAL; + + mutex_lock(&s390host_init_mutex); + + sie_user = &sie_io->sie_user; + cpu = sie_user->vsmxm_cpuid; + + if (cpu >= S390HOST_MAX_CPUS) + goto out; + + tif = current_thread_info(); + sca_block = tif->s390host_data->sca_block; + if (!(sca_block->mcn & (1UL << (S390HOST_MAX_CPUS - 1 - cpu)))) + goto out; + + dist_sie_kernel = &((tif->s390host_data->sie_io[cpu])->sie_kernel); + + cpuflags = &dist_sie_kernel->sie_block.cpuflags; + do { + old = atomic_read(cpuflags); + new = old | sie_user->vsmxm_or; + new &= sie_user->vsmxm_and; + } while (atomic_cmpxchg(cpuflags, old, new) != old); + + rc = 0; +out: + mutex_unlock(&s390host_init_mutex); + return rc; +} + +static int s390host_do_action(unsigned long action, struct sie_io *sie_io) +{ + void *src; + void *dest; + int rc = 0; + + if (action & SIE_BLOCK_UPDATE) { + src = &(sie_io->sie_user.sie_block); + dest = &(sie_io->sie_kernel.sie_block); + + memcpy(dest + 4, src + 4, 88); + memcpy(dest + 96, src + 96, 4); + memcpy(dest + 104, src + 104, 408); + } + + if (action & SIE_UPDATE_PSW) + sie_io->sie_kernel.sie_block.psw.gpsw = sie_io->sie_user.psw; + + if (action & SIE_FLUSH_TLB) + flush_tlb_mm(current->mm); + + if (action & SIE_VSMXM_LOCAL_UPDATE) + s390host_vsmxm_local_update(sie_io); + + if (action & SIE_VSMXM_DIST_UPDATE) + rc = s390host_vsmxm_dist_update(sie_io); + return rc; +} Index: linux-2.6.21/include/asm-s390/unistd.h =================================================================== --- linux-2.6.21.orig/include/asm-s390/unistd.h +++ linux-2.6.21/include/asm-s390/unistd.h @@ -251,8 +251,11 @@ #define __NR_getcpu 311 #define __NR_epoll_pwait 312 #define __NR_utimes 313 +#define __NR_s390host_add_cpu 314 +#define __NR_s390host_remove_cpu 315 +#define __NR_s390host_sie 316 -#define NR_syscalls 314 +#define NR_syscalls 317 /* * There are some system calls that are not present on 64 bit, some Index: linux-2.6.21/include/asm-s390/sie64.h =================================================================== --- /dev/null +++ linux-2.6.21/include/asm-s390/sie64.h @@ -0,0 +1,279 @@ +/* + * include/asm-s390/sie64.h + * + * Copyright IBM Corp. 2007 + * Author(s): Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>, + * Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>, + * Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * + */ + +#ifndef _ASM_S390_SIE64_H +#define _ASM_S390_SIE64_H + +#include <asm/atomic.h> +#include <asm/ptrace.h> //FIXME: psw_t definition needs relocation + +struct sie_block { + atomic_t cpuflags; /* 0x0000 */ + __u32 prefix; /* 0x0004 */ + __u32 :32; /* 0x0008 */ + __u32 :32; /* 0x000c */ + __u64 :64; /* 0x0010 */ + __u64 :64; /* 0x0018 */ + __u64 :64; /* 0x0020 */ + __u64 cputm; /* 0x0028 */ + __u64 ckc; /* 0x0030 */ + __u64 epoch; /* 0x0038 */ + __u8 svcnn :1, /* 0x0040 */ + svc1c :1, + svc2c :1, + svc3c :1, + :4; + __u8 svc1n; /* 0x0041 */ + __u8 svc2n; /* 0x0042 */ + __u8 svc3n; /* 0x0043 */ + __u16 lctl0 :1, /* 0x0044 */ + lctl1 :1, + lctl2 :1, + lctl3 :1, + lctl4 :1, + lctl5 :1, + lctl6 :1, + lctl7 :1, + lctl8 :1, + lctl9 :1, + lctla :1, + lctlb :1, + lctlc :1, + lctld :1, + lctle :1, + lctlf :1; + __s16 icpua; /* 0x0046 */ + __u32 icpop :1, /* 0x0048 */ + icpro :1, + icprg :1, + :4, + icipte :1, + :1, /* 0x0049 */ + iclpsw :1, + icptlb :1, + icssm :1, + icbsa :1, + icstctl :1, + icstnsm :1, + icstosm :1, + icstck :1, /* 0x004a */ + iciske :1, + icsske :1, + icrrbe :1, + icpc :1, + icpt :1, + ictprot :1, + iclasp :1, + :1, /* 0x004b */ + icstpt :1, + icsckc :1, + :1, + icpr :1, + icbakr :1, + icpg :1, + :1; + __u32 ecext :1, /* 0x004c */ + ecint :1, + ecwait :1, + ecsigp :1, + ecalt :1, + ecio2 :1, + :1, + ecmvp :1; + __u8 eca1; /* 0x004d */ + __u8 eca2; /* 0x004e */ + __u8 eca3; /* 0x004f */ + __u8 icptcode; /* 0x0050 */ + __u8 :6, /* 0x0051 */ + icif :1, + icex :1; + __u16 ihcpu; /* 0x0052 */ + __u16 :16; /* 0x0054 */ + struct { + union { + __u16 ipa; /* 0x0056 */ + __u16 inst; /* 0x0056 */ + struct { + union { + __u8 ipa0; /* 0x0056 */ + __u8 viwho; /* 0x0056 */ + }; + union { + __u8 ipa1; /* 0x0057 */ + __u8 viwhen; /* 0x0057 */ + }; + }; + }; + union { + __u32 ipb; /* 0x0058 */ + struct { + union { + __u16 ipbh0; /* 0x0058 */ + __u16 viwhy; /* 0x0058 */ + struct { + __u8 ipb0; /* 0x0058 */ + __u8 ipb1; /* 0x0059 */ + }; + }; + union { + __u16 ipbh1; /* 0x005a */ + struct { + __u8 ipb2; /* 0x005a */ + __u8 ipb3; /* 0x005b */ + }; + }; + }; + }; + } __attribute__((packed)); + __u32 scaoh; /* 0x005c */ + union { + __u32 rcp; /* 0x0060 */ + struct { + __u8 ska :1, /* 0x0060 */ + skaip :1, + :6; + __u8 ecb :8; /* 0x0061 */ + __u8 :3, /* 0x0062 */ + cpby :1, + :4; + __u8 :8; /* 0x0063 */ + }; + }; + __u32 scaol; /* 0x0064 */ + __u32 :32; /* 0x0068 */ + union { + __u32 todpr; /* 0x006c */ + struct { + __u16 :16; /* 0x006c */ + __u16 todpf; /* 0x006e */ + }; + } __attribute__((packed)); + __u32 gisa; /* 0x0070 */ + __u32 iopct; /* 0x0074 */ + __u32 rsvd3; /* 0x0078 */ + __u32 :32; /* 0x007c */ + __u64 gmsor; /* 0x0080 */ + __u64 gmslm; /* 0x0088 */ + union { + psw_t gpsw; /* 0x0090 */ + struct { + __u64 pswh; /* 0x0090 */ + + __u64 pswl; /* 0x0098 */ + }; + } psw; + __u64 gg14; /* 0x00a0 */ + __u64 gg15; /* 0x00a8 */ + __u64 :64; /* 0x00b0 */ + __u64 :16, /* 0x00b8 */ + xso :24, + xsl :24; + union { + __u8 uzp0[56]; /* 0x00c0 */ + struct { + __u32 exmsf; /* 0x00c0 */ + union { + __u32 iexcf; /* 0x00c4 */ + struct { + __u16 iexca; /* 0x00c4 */ + __u16 iexcd; /* 0x00c6 */ + }; + }; + __u16 svcil; /* 0x00c8 */ + __u16 svcnt; /* 0x00ca */ + __u16 iprcl; /* 0x00cc */ + __u16 iprcc; /* 0x00ce */ + __u32 itrad; /* 0x00d0 */ + __u32 imncl; /* 0x00d4 */ + __u64 gpera; /* 0x00d8 */ + __u8 excpar; /* 0x00e0 */ + __u8 perar; /* 0x00e1 */ + __u8 oprid; /* 0x00e2 */ + __u8 :8; /* 0x00e3 */ + __u32 :32; /* 0x00e4 */ + __u64 gtrad; /* 0x00e8 */ + __u32 :32; /* 0x00f0 */ + __u32 :32; /* 0x00f4 */ + }; + }; + __u16 :16; /* 0x00f8 */ + __u16 ief; /* 0x00fa */ + __u32 apcbk; /* 0x00fc */ + __u64 gcr[16]; /* 0x0100 */ + __u8 reserved[128]; /* 0x0180 */ +} __attribute__((packed)); + +struct sie_kernel { + struct sie_block sie_block; + s390_fp_regs host_fpregs; + int host_acrs[NUM_ACRS]; +} __attribute__((packed,aligned(4096))); + +#define SIE_UPDATE_PSW (1UL << 0) +#define SIE_FLUSH_TLB (1UL << 1) +#define SIE_ISKE (1UL << 2) +#define SIE_SSKE (1UL << 3) +#define SIE_BLOCK_UPDATE (1UL << 4) +#define SIE_VSMXM_LOCAL_UPDATE (1UL << 5) +#define SIE_VSMXM_DIST_UPDATE (1UL << 6) + +struct sie_skey_parm { + unsigned long sk_reg; + unsigned long sk_addr; +}; + +struct sie_user { + struct sie_block sie_block; + psw_t psw; + unsigned long gprs[NUM_GPRS]; + s390_fp_regs guest_fpregs; + int guest_acrs[NUM_ACRS]; + struct sie_skey_parm iske_parm; + struct sie_skey_parm sske_parm; + int vsmxm_or_local; + int vsmxm_and_local; + int vsmxm_or; + int vsmxm_and; + int vsmxm_cpuid; +} __attribute__((packed,aligned(4096))); + +struct sie_io { + struct sie_kernel sie_kernel; + struct sie_user sie_user; +}; + +struct sca_entry { + atomic_t scn; + __u64 reserved; + __u64 sda; + __u64 reserved2[2]; +}__attribute__((packed)); + +struct sca_block { + __u64 ipte_control; + __u64 reserved[5]; + __u64 mcn; + __u64 reserved2; + struct sca_entry cpu[64]; +}__attribute__((packed)); + +#define S390HOST_MAX_CPUS 64 + +struct s390host_data { + atomic_t count; + struct sie_io *sie_io[S390HOST_MAX_CPUS]; + struct sca_block *sca_block; +}; + +/* function definitions */ +extern int sie64a(struct sie_kernel *); +extern void s390host_put_data(struct s390host_data *); + +#endif /* _ASM_S390_SIE64_H */ Index: linux-2.6.21/arch/s390/Makefile =================================================================== --- linux-2.6.21.orig/arch/s390/Makefile +++ linux-2.6.21/arch/s390/Makefile @@ -85,7 +85,7 @@ LDFLAGS_vmlinux := -e start head-y := arch/s390/kernel/head.o arch/s390/kernel/init_task.o core-y += arch/s390/mm/ arch/s390/kernel/ arch/s390/crypto/ \ - arch/s390/appldata/ arch/s390/hypfs/ + arch/s390/appldata/ arch/s390/hypfs/ arch/s390/host/ libs-y += arch/s390/lib/ drivers-y += drivers/s390/ drivers-$(CONFIG_MATHEMU) += arch/s390/math-emu/ Index: linux-2.6.21/kernel/sys_ni.c =================================================================== --- linux-2.6.21.orig/kernel/sys_ni.c +++ linux-2.6.21/kernel/sys_ni.c @@ -122,6 +122,9 @@ cond_syscall(sys32_sysctl); cond_syscall(ppc_rtas); cond_syscall(sys_spu_run); cond_syscall(sys_spu_create); +cond_syscall(sys_s390host_add_cpu); +cond_syscall(sys_s390host_remove_cpu); +cond_syscall(sys_s390host_sie); /* mmu depending weak syscall entries */ cond_syscall(sys_mprotect); Index: linux-2.6.21/arch/s390/kernel/process.c =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/process.c +++ linux-2.6.21/arch/s390/kernel/process.c @@ -274,12 +274,23 @@ int copy_thread(int nr, unsigned long cl #endif /* CONFIG_64BIT */ /* start new process with ar4 pointing to the correct address space */ p->thread.mm_segment = get_fs(); - /* Don't copy debug registers */ - memset(&p->thread.per_info,0,sizeof(p->thread.per_info)); + /* Don't copy debug registers */ + memset(&p->thread.per_info,0,sizeof(p->thread.per_info)); + p->thread_info->s390host_data = NULL; + p->thread_info->sie_cpu = -1; return 0; } +void free_thread_info(struct thread_info *ti) +{ +#ifdef CONFIG_S390_HOST + if (ti->s390host_data) + s390host_put_data(ti->s390host_data); +#endif + free_pages((unsigned long) (ti),THREAD_ORDER); +} + asmlinkage long sys_fork(struct pt_regs regs) { return do_fork(SIGCHLD, regs.gprs[15], ®s, 0, NULL, NULL); Index: linux-2.6.21/include/asm-s390/thread_info.h =================================================================== --- linux-2.6.21.orig/include/asm-s390/thread_info.h +++ linux-2.6.21/include/asm-s390/thread_info.h @@ -38,6 +38,7 @@ #ifndef __ASSEMBLY__ #include <asm/processor.h> #include <asm/lowcore.h> +#include <asm/sie64.h> /* * low level task data that entry.S needs immediate access to @@ -52,6 +53,8 @@ struct thread_info { unsigned int cpu; /* current CPU */ int preempt_count; /* 0 => preemptable, <0 => BUG */ struct restart_block restart_block; + struct s390host_data *s390host_data; /* s390host data */ + int sie_cpu; /* sie cpu number */ }; /* @@ -67,6 +70,8 @@ struct thread_info { .restart_block = { \ .fn = do_no_restart_syscall, \ }, \ + .s390host_data = NULL, \ + .sie_cpu = 0, \ } #define init_thread_info (init_thread_union.thread_info) @@ -81,7 +86,8 @@ static inline struct thread_info *curren /* thread information allocation */ #define alloc_thread_info(tsk) ((struct thread_info *) \ __get_free_pages(GFP_KERNEL,THREAD_ORDER)) -#define free_thread_info(ti) free_pages((unsigned long) (ti),THREAD_ORDER) + +extern void free_thread_info(struct thread_info *); #endif Index: linux-2.6.21/arch/s390/Kconfig =================================================================== --- linux-2.6.21.orig/arch/s390/Kconfig +++ linux-2.6.21/arch/s390/Kconfig @@ -153,6 +153,7 @@ config S390_SWITCH_AMODE config S390_EXEC_PROTECT bool "Data execute protection" + depends on !S390_HOST select S390_SWITCH_AMODE help This option allows to enable a buffer overflow protection for user @@ -514,6 +515,12 @@ config KEXEC current kernel, and to start another kernel. It is like a reboot but is independent of hardware/microcode support. +config S390_HOST + bool "s390 host support (EXPERIMENTAL)" + depends on 64BIT && EXPERIMENTAL + select S390_SWITCH_AMODE + help + Select this option if you want to host guest Linux images endmenu source "net/Kconfig" Index: linux-2.6.21/arch/s390/kernel/setup.c =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/setup.c +++ linux-2.6.21/arch/s390/kernel/setup.c @@ -394,7 +394,11 @@ static int __init early_parse_ipldelay(c early_param("ipldelay", early_parse_ipldelay); #ifdef CONFIG_S390_SWITCH_AMODE +#ifdef CONFIG_S390_HOST +unsigned int switch_amode = 1; +#else unsigned int switch_amode = 0; +#endif EXPORT_SYMBOL_GPL(switch_amode); static void set_amode_and_uaccess(unsigned long user_amode, Index: linux-2.6.21/arch/s390/host/s390_intercept.c =================================================================== --- /dev/null +++ linux-2.6.21/arch/s390/host/s390_intercept.c @@ -0,0 +1,42 @@ +/* + * s390_intercept.c -- handle SIE intercept codes + * + * Copyright IBM Corp. 2007 + * Author(s): Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + */ + +#include <linux/kernel.h> +#include <linux/errno.h> +#include <asm/sie64.h> +#include <linux/pagemap.h> +#include "s390host.h" + +static int s390host_handle_validity (struct sie_kernel *sie_kernel) +{ + if (sie_kernel->sie_block.viwhy == 0x37) { + //debug message here + fault_in_pages_writeable((void*)0 + S390HOST_ORIGIN, + PAGE_SIZE); + fault_in_pages_writeable((void*)(unsigned long) + sie_kernel->sie_block.prefix+ + S390HOST_ORIGIN, 2*PAGE_SIZE); + return 0; + } + // debug message here + return -ENOTSUPP; +} + +int s390host_handle_intercept(struct sie_kernel *sie_kernel) +{ + switch (sie_kernel->sie_block.icptcode) { + case 0x00: + case 0x24: + return 0; + case 0x20: + return s390host_handle_validity(sie_kernel); + default: + // debug message here + return -ENOTSUPP; + } +} + Index: linux-2.6.21/arch/s390/host/s390host.h =================================================================== --- /dev/null +++ linux-2.6.21/arch/s390/host/s390host.h @@ -0,0 +1,16 @@ +/* + * s390host.h -- hosting zSeries Linux virtual engines + * + * Copyright IBM Corp. 2007 + * Author(s): Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>, + * Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>, + * Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + */ + +#ifndef __S390HOST_H +#define __S390HOST_H +#include <asm/sie64.h> +#define S390HOST_ORIGIN 0 + +int s390host_handle_intercept(struct sie_kernel *sie_kernel); +#endif // defined __S390HOST_H ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* [PATCH/RFC 3/9] s390 guest detection [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> 2007-05-11 17:35 ` [PATCH/RFC 2/9] s390 virtualization interface Carsten Otte @ 2007-05-11 17:35 ` Carsten Otte 2007-05-11 17:35 ` [PATCH/RFC 4/9] Basic guest virtual devices infrastructure Carsten Otte ` (5 subsequent siblings) 7 siblings, 0 replies; 104+ messages in thread From: Carsten Otte @ 2007-05-11 17:35 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Cc: Christian Borntraeger, Martin Schwidefsky From: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> This patch adds functionality to detect if the kernel runs under an s390host hypervisor. A macro MACHINE_IS_GUEST is exported for device drivers. This allows drivers to skip device detection if the systems runs non-virtualized. Signed-off-by: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> --- arch/s390/kernel/early.c | 4 ++++ arch/s390/kernel/setup.c | 9 ++++++--- include/asm-s390/setup.h | 1 + 3 files changed, 11 insertions(+), 3 deletions(-) Index: linux-2.6.21/arch/s390/kernel/setup.c =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/setup.c +++ linux-2.6.21/arch/s390/kernel/setup.c @@ -744,9 +744,12 @@ setup_arch(char **cmdline_p) "This machine has an IEEE fpu\n" : "This machine has no IEEE fpu\n"); #else /* CONFIG_64BIT */ - printk((MACHINE_IS_VM) ? - "We are running under VM (64 bit mode)\n" : - "We are running native (64 bit mode)\n"); + if (MACHINE_IS_VM) + printk("We are running under VM (64 bit mode)\n"); + else if (MACHINE_IS_GUEST) + printk("We are running on a non z/VM host\n"); + else + printk("We are running native (64 bit mode)\n"); #endif /* CONFIG_64BIT */ /* Save unparsed command line copy for /proc/cmdline */ Index: linux-2.6.21/include/asm-s390/setup.h =================================================================== --- linux-2.6.21.orig/include/asm-s390/setup.h +++ linux-2.6.21/include/asm-s390/setup.h @@ -61,6 +61,7 @@ extern unsigned long machine_flags; #define MACHINE_IS_VM (machine_flags & 1) #define MACHINE_IS_P390 (machine_flags & 4) #define MACHINE_HAS_MVPG (machine_flags & 16) +#define MACHINE_IS_GUEST (machine_flags & 64) #define MACHINE_HAS_IDTE (machine_flags & 128) #define MACHINE_HAS_DIAG9C (machine_flags & 256) Index: linux-2.6.21/arch/s390/kernel/early.c =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/early.c +++ linux-2.6.21/arch/s390/kernel/early.c @@ -139,6 +139,10 @@ static noinline __init void detect_machi /* Running on a P/390 ? */ if (cpuinfo->cpu_id.machine == 0x7490) machine_flags |= 4; + + /* Running under a host ? */ + if (cpuinfo->cpu_id.version == 0xfe) + machine_flags |= 64; } #ifdef CONFIG_64BIT ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* [PATCH/RFC 4/9] Basic guest virtual devices infrastructure [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> 2007-05-11 17:35 ` [PATCH/RFC 2/9] s390 virtualization interface Carsten Otte 2007-05-11 17:35 ` [PATCH/RFC 3/9] s390 guest detection Carsten Otte @ 2007-05-11 17:35 ` Carsten Otte [not found] ` <1178904958.25135.31.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> 2007-05-11 17:36 ` [PATCH/RFC 5/9] s390 virtual console for guests Carsten Otte ` (4 subsequent siblings) 7 siblings, 1 reply; 104+ messages in thread From: Carsten Otte @ 2007-05-11 17:35 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Cc: Christian Borntraeger, Martin Schwidefsky From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> This patch adds support for a new bus type that manages paravirtualized devices. The bus uses the s390 diagnose instruction to query devices, and match them with the corresponding drivers. Future enhancements should include hotplug and hotremoval of virtual devices triggered by the host, and supend/resume of virtual devices for migration. This code is s390 architecture specific, please review. Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> --- arch/s390/Kconfig | 6 + drivers/s390/Makefile | 2 drivers/s390/guest/Makefile | 6 + drivers/s390/guest/vdev.c | 158 +++++++++++++++++++++++++++++++++++++++ drivers/s390/guest/vdev_device.c | 50 ++++++++++++ include/asm-s390/vdev.h | 53 +++++++++++++ 6 files changed, 274 insertions(+), 1 deletion(-) Index: linux-2.6.21/arch/s390/Kconfig =================================================================== --- linux-2.6.21.orig/arch/s390/Kconfig +++ linux-2.6.21/arch/s390/Kconfig @@ -521,6 +521,12 @@ config S390_HOST select S390_SWITCH_AMODE help Select this option if you want to host guest Linux images + +config S390_GUEST + bool "s390 guest support (EXPERIMENTAL)" + depends on 64BIT && EXPERIMENTAL + help + Select this option if you want to run the kernel under s390 linux endmenu source "net/Kconfig" Index: linux-2.6.21/drivers/s390/Makefile =================================================================== --- linux-2.6.21.orig/drivers/s390/Makefile +++ linux-2.6.21/drivers/s390/Makefile @@ -5,7 +5,7 @@ CFLAGS_sysinfo.o += -Iinclude/math-emu -Iarch/s390/math-emu -w obj-y += s390mach.o sysinfo.o s390_rdev.o -obj-y += cio/ block/ char/ crypto/ net/ scsi/ +obj-y += cio/ block/ char/ crypto/ net/ scsi/ guest/ drivers-y += drivers/s390/built-in.o Index: linux-2.6.21/drivers/s390/guest/Makefile =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/Makefile @@ -0,0 +1,6 @@ +# +# s390 Linux virtual environment +# + +obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o + Index: linux-2.6.21/drivers/s390/guest/vdev.c =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/vdev.c @@ -0,0 +1,158 @@ +/* + * vdev - guest os layer for device virtualization + * + * Copyright IBM Corp. 2007 + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * + */ + +#include <asm/vdev.h> + +static void vdev_bus_release(struct device *); + +struct bus_type vdev_bus_type = { + .name = "vdev", + .match = vdev_match, + .probe = vdev_probe, +}; + +struct device vdev_bus = { + .bus_id = "vdev0", + .release = vdev_bus_release +}; + +int vdev_match(struct device * dev, struct device_driver *drv) +{ + struct vdev *vdev = to_vdev(dev); + struct vdev_driver *vdrv = to_vdrv(drv); + + if (vdev->vdev_type == vdrv->vdev_type) + return 1; + + return 0; +} + +int vdev_probe(struct device * dev) +{ + struct vdev *vdev = to_vdev(dev); + struct vdev_driver *vdrv = to_vdrv(dev->driver); + + return vdrv->probe(vdev); +} + +static void vdev_bus_release (struct device *device) +{ + /* noop, static bus object */ +} + +static inline int vdev_diag_hotplug(char symname[128], char hostid[128]) +{ + register char * __arg1 asm("2") = symname; + register char * __arg2 asm("3") = hostid; + register int __svcres asm("2"); + int __res; + + __asm__ __volatile__ ( + "diag 0,0,0x1e" + : "=d" (__svcres) + : "0" (__arg1), + "d" (__arg2) + : "cc", "memory"); + __res = __svcres; + return __res; +} + + +static int vdev_scan_coldplug(void) +{ + int rc; + struct vdev *device; + + do { + device = kzalloc(sizeof(struct vdev), GFP_ATOMIC); + if (!device) { + rc = -ENOMEM; + goto out; + } + rc = vdev_diag_hotplug(device->symname, device->hostid); + if (rc == -ENODEV) + break; + if (rc < 0) { + printk (KERN_WARNING "vdev: error %d detecting" \ + " initial devices\n", rc); + break; + } + device->vdev_type = rc; + + //sanity: are strings terminated? + if ((strnlen(device->symname, 128) == 128) || + (strnlen(device->hostid, 128) == 128)) { + // warn and discard device + printk ("vdev: illegal device entry received\n"); + break; + } + + rc = vdevice_register(device); + if (rc) { + kfree(device); + } else + switch (device->vdev_type) { + case VDEV_TYPE_DISK: + printk (KERN_INFO "vdev: storage device " \ + "detected: %s\n", device->symname); + break; + case VDEV_TYPE_NET: + printk (KERN_INFO "vdev: network device " \ + "detected: %s\n", device->symname); + break; + default: + printk (KERN_INFO "vdev: unknown device " \ + "detected: %s\n", device->symname); + } + } while(1); + kfree (device); + out: + return 0; +} + + +int __init vdev_init(void) +{ + int rc; + + rc = bus_register(&vdev_bus_type); + if (rc) { + printk (KERN_WARNING "vdev: failed to register bus type\n"); + goto out; + } + rc = device_register(&vdev_bus); + if (rc) { + printk (KERN_WARNING "vdev: failed to register bus device\n"); + goto bunregister; + } + printk (KERN_INFO "vdev: initialization complete\n"); + rc = vdev_scan_coldplug(); + if (rc) { + printk (KERN_WARNING "vdev: failed to scan devices\n"); + goto dunregister; + } + goto out; + dunregister: + device_unregister(&vdev_bus); + + bunregister: + bus_unregister(&vdev_bus_type); + out: + return rc; +} + +void vdev_exit(void) +{ + bus_unregister(&vdev_bus_type); +} + +module_init(vdev_init); +module_exit(vdev_exit); +MODULE_DESCRIPTION("Guest layer for device virtualization"); +MODULE_AUTHOR("Copyright IBM Corp. 2007"); +MODULE_LICENSE("GPL"); Index: linux-2.6.21/drivers/s390/guest/vdev_device.c =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/vdev_device.c @@ -0,0 +1,50 @@ +/* + * vdev - guest layer for device virtualization + * + * Copyright IBM Corp. 2007 + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * + */ + +#include <asm/vdev.h> + +int vdev_driver_register (struct vdev_driver *vdrv) +{ + struct device_driver *drv = &vdrv->driver; + + drv->bus = &vdev_bus_type; + drv->name = vdrv->name; + + return driver_register(drv); +} + +int vdevice_register(struct vdev *vdev) +{ + struct device *dev = &vdev->dev; + int ret,typesize; + + dev->bus = &vdev_bus_type; + dev->parent = &vdev_bus; + memset(dev->bus_id, 0, BUS_ID_SIZE); + switch (vdev->vdev_type) { + case VDEV_TYPE_DISK: + strncpy (dev->bus_id, "block:", 6); + typesize=6; + break; + case VDEV_TYPE_NET: + strncpy (dev->bus_id, "net:", 4); + typesize=4; + break; + default: + strncpy (dev->bus_id, "unknown:", 8); + typesize=8; + break; + } + strncpy (dev->bus_id+typesize, vdev->symname, BUS_ID_SIZE-typesize-1); + + ret = device_register(dev); + + //FIXME: add device attribs here + + return ret; +} Index: linux-2.6.21/include/asm-s390/vdev.h =================================================================== --- /dev/null +++ linux-2.6.21/include/asm-s390/vdev.h @@ -0,0 +1,53 @@ +/* + * vdev - guest layer for device virtualization + * + * Copyright IBM Corp. 2007 + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * + */ + +#ifndef __VDEV_H +#define __VDEV_H +#include <linux/device.h> + +/* in vdev.c */ +extern int vdev_match(struct device *, struct device_driver *); +extern int vdev_probe (struct device *); + +extern struct device vdev_bus; +extern struct bus_type vdev_bus_type; + +#define VDEV_TYPE_DISK 0 +#define VDEV_TYPE_NET 1 + +struct vdev { + unsigned int vdev_type; + char symname[128]; + char hostid[128]; + struct vdev_driver *driver; + struct device dev; + void *drv_private; +}; + +struct vdev_driver { + struct module *owner; + int vdev_type; + int (*probe) (struct vdev *); + int (*set_online) (struct vdev *); + int (*set_offline) (struct vdev *); + int (*suspend) (struct vdev *); + int (*resume) (struct vdev *); + struct device_driver driver; /* higher level structure, don't init + this from your driver */ + char *name; + void *drv_private; +}; + +#define to_vdev(n) container_of(n, struct vdev, dev) +#define to_vdrv(n) container_of(n, struct vdev_driver, driver) + + +/* in vdevice.c */ +extern int vdevice_register(struct vdev *); +extern int vdev_driver_register(struct vdev_driver *); +#endif ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <1178904958.25135.31.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>]
* Re: [PATCH/RFC 4/9] Basic guest virtual devices infrastructure [not found] ` <1178904958.25135.31.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> @ 2007-05-11 20:06 ` Arnd Bergmann 2007-05-14 11:26 ` Avi Kivity 1 sibling, 0 replies; 104+ messages in thread From: Arnd Bergmann @ 2007-05-11 20:06 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f Cc: Christian Borntraeger, Martin Schwidefsky On Friday 11 May 2007, Carsten Otte wrote: > This patch adds support for a new bus type that manages paravirtualized > devices. The bus uses the s390 diagnose instruction to query devices, and > match them with the corresponding drivers. It seems that the diagnose instruction is really the only s390 specific thing in here, right? I guess this part of your series is the first one that we should have in an architecture independent way. There may also be the chance of merging this with existing virtual buses like the one for the ps3, which also just exists using hypercalls. > +int vdev_match(struct device * dev, struct device_driver *drv) > +{ > + struct vdev *vdev = to_vdev(dev); > + struct vdev_driver *vdrv = to_vdrv(drv); > + > + if (vdev->vdev_type == vdrv->vdev_type) > + return 1; > + > + return 0; > +} Why invent device type numbers? On open firmware, we just do a string compare, which more intuitive, and means you don't need any further > +int vdev_probe(struct device * dev) > +{ > + struct vdev *vdev = to_vdev(dev); > + struct vdev_driver *vdrv = to_vdrv(dev->driver); > + > + return vdrv->probe(vdev); > +} This abstraction is unnecessary, just do the do_vdev() conversion inside of the individual drivers. > + > +struct device vdev_bus = { > + .bus_id = "vdev0", > + .release = vdev_bus_release > +}; > > +static void vdev_bus_release (struct device *device) > +{ > + /* noop, static bus object */ > +} Just make the root of your devices a platform_device, then you don't need to do dirty tricks like this. > +static int vdev_scan_coldplug(void) > +{ > + int rc; > + struct vdev *device; > + > + do { > + device = kzalloc(sizeof(struct vdev), GFP_ATOMIC); > + if (!device) { > + rc = -ENOMEM; > + goto out; > + } > + rc = vdev_diag_hotplug(device->symname, device->hostid); > + if (rc == -ENODEV) > + break; > + if (rc < 0) { > + printk (KERN_WARNING "vdev: error %d detecting" \ > + " initial devices\n", rc); > + break; > + } > + device->vdev_type = rc; > + > + //sanity: are strings terminated? > + if ((strnlen(device->symname, 128) == 128) || > + (strnlen(device->hostid, 128) == 128)) { > + // warn and discard device > + printk ("vdev: illegal device entry received\n"); > + break; > + } > + > + rc = vdevice_register(device); > + if (rc) { > + kfree(device); > + } else > + switch (device->vdev_type) { > + case VDEV_TYPE_DISK: > + printk (KERN_INFO "vdev: storage device " \ > + "detected: %s\n", device->symname); > + break; > + case VDEV_TYPE_NET: > + printk (KERN_INFO "vdev: network device " \ > + "detected: %s\n", device->symname); > + break; > + default: > + printk (KERN_INFO "vdev: unknown device " \ > + "detected: %s\n", device->symname); > + } > + } while(1); > + kfree (device); > + out: > + return 0; > +} Interesting concept of probing the bus -- so you just ask if there are any new devices, right? > +#define VDEV_TYPE_DISK 0 > +#define VDEV_TYPE_NET 1 > + > +struct vdev { > + unsigned int vdev_type; > + char symname[128]; > + char hostid[128]; > + struct vdev_driver *driver; > + struct device dev; > + void *drv_private; > +}; You shouldn't need the driver and drv_private fields -- they are already present in struct device. Arnd <>< ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 4/9] Basic guest virtual devices infrastructure [not found] ` <1178904958.25135.31.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> 2007-05-11 20:06 ` Arnd Bergmann @ 2007-05-14 11:26 ` Avi Kivity [not found] ` <46484753.30602-atKUWr5tajBWk0Htik3J/w@public.gmane.org> 1 sibling, 1 reply; 104+ messages in thread From: Avi Kivity @ 2007-05-14 11:26 UTC (permalink / raw) To: Carsten Otte Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Christian Borntraeger, Martin Schwidefsky Carsten Otte wrote: > From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > > This patch adds support for a new bus type that manages paravirtualized > devices. The bus uses the s390 diagnose instruction to query devices, and > match them with the corresponding drivers. > Future enhancements should include hotplug and hotremoval of virtual devices > triggered by the host, and supend/resume of virtual devices for migration. > > Interesting. We could use a variation this for x86 as well, but I'm not sure how easy it is to integrate it into closed source OSes (Windows). The diag instruction could be replaced by a hypercall which would make the code generic. -- error compiling committee.c: too many arguments to function ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <46484753.30602-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH/RFC 4/9] Basic guest virtual devices infrastructure [not found] ` <46484753.30602-atKUWr5tajBWk0Htik3J/w@public.gmane.org> @ 2007-05-14 11:43 ` Carsten Otte [not found] ` <46484B5D.6080605-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Carsten Otte @ 2007-05-14 11:43 UTC (permalink / raw) To: Avi Kivity Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 Avi Kivity wrote: > Interesting. We could use a variation this for x86 as well, but I'm not > sure how easy it is to integrate it into closed source OSes (Windows). > The diag instruction could be replaced by a hypercall which would make > the code generic. I think we need to freeze the hypercall API at some time, and consider it a stable kernel external API. We do then need to document these calls, and non-GPL hypervisors can implement it. We could eventually have a similar situation with one of the other non-GPL hypervisors on s390 that run Linux. so long, Carsten ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <46484B5D.6080605-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 4/9] Basic guest virtualdevices infrastructure [not found] ` <46484B5D.6080605-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> @ 2007-05-14 12:00 ` Dor Laor [not found] ` <64F9B87B6B770947A9F8391472E032160BC7483D-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Dor Laor @ 2007-05-14 12:00 UTC (permalink / raw) To: carsteno-tA70FqPdS9bQT0dZR+AlfA, Avi Kivity Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 >Avi Kivity wrote: >> Interesting. We could use a variation this for x86 as well, but I'm not >> sure how easy it is to integrate it into closed source OSes (Windows). >> The diag instruction could be replaced by a hypercall which would make >> the code generic. >I think we need to freeze the hypercall API at some time, and consider >it a stable kernel external API. We do then need to document these >calls, and non-GPL hypervisors can implement it. We could eventually >have a similar situation with one of the other non-GPL hypervisors on >s390 that run Linux. I think Avi meant using a virtual bus as an option for HVMs too (windows especially). Currently we're using the cpi bus. Using a new virtualized bus might be a good idea, it's easy & clean for open source. The question is it make life easier for HVMs. For instance, on windows we'll need Pnp support for these devices. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <64F9B87B6B770947A9F8391472E032160BC7483D-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>]
* Re: [PATCH/RFC 4/9] Basic guest virtualdevices infrastructure [not found] ` <64F9B87B6B770947A9F8391472E032160BC7483D-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org> @ 2007-05-14 13:32 ` Carsten Otte 0 siblings, 0 replies; 104+ messages in thread From: Carsten Otte @ 2007-05-14 13:32 UTC (permalink / raw) To: Dor Laor Cc: carsteno-tA70FqPdS9bQT0dZR+AlfA, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Avi Kivity, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 Dor Laor wrote: > I think Avi meant using a virtual bus as an option for HVMs too (windows > especially). Currently we're using the cpi bus. Using a new virtualized > bus might be a good idea, it's easy & clean for open source. The > question is it make life easier for HVMs. For instance, on windows we'll > need Pnp support for these devices. Oh that way around. Thanks for clarification. As far as I see, a stable hypercall API would also be good for maintaining non-GPL HVMs. Probably we should forge the API with respect to other HVMs needs then. so long, Carsten ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* [PATCH/RFC 5/9] s390 virtual console for guests [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> ` (2 preceding siblings ...) 2007-05-11 17:35 ` [PATCH/RFC 4/9] Basic guest virtual devices infrastructure Carsten Otte @ 2007-05-11 17:36 ` Carsten Otte [not found] ` <1178904960.25135.32.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> 2007-05-11 17:36 ` [PATCH/RFC 6/9] virtual block device driver Carsten Otte ` (3 subsequent siblings) 7 siblings, 1 reply; 104+ messages in thread From: Carsten Otte @ 2007-05-11 17:36 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Cc: Christian Borntraeger, Martin Schwidefsky From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> This driver provides a simple virtualized console. Userspace can use read/write to its console to pass the data to the host. Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> --- drivers/s390/Kconfig | 5 + drivers/s390/guest/Makefile | 1 drivers/s390/guest/guest_console.c | 72 +++++++++++++++++ drivers/s390/guest/guest_console.h | 47 +++++++++++ drivers/s390/guest/guest_tty.c | 153 +++++++++++++++++++++++++++++++++++++ 5 files changed, 278 insertions(+) Index: linux-2.6.21/drivers/s390/guest/guest_console.c =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/guest_console.c @@ -0,0 +1,72 @@ +/* + * guest console device driver + * Copyright IBM Corp. 2007 + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + */ + +#include "linux/kernel.h" +#include "linux/types.h" +#include "linux/console.h" +#include "linux/string.h" +#include "linux/init.h" +#include "linux/errno.h" +#include "guest_console.h" + +#define guest_console_major 4 /* TTYAUX_MAJOR */ +#define guest_console_minor 65 +#define guest_console_name "ttyS" + +static void guest_console_write(struct console *console, const char *string, + unsigned len) +{ + int ret; + size_t pos; + + for(pos=0; pos < strlen(string); pos += ret) { + ret = diag_write(1, string + pos, len - pos); + if (ret <= 0) + break; + } +} + +static struct tty_driver * +guest_console_device(struct console *c, int *index) +{ + *index = c->index; + return guest_tty_driver; +} + +static void +guest_console_unblank(void) +{ + return; +} + +static struct console guest_console = +{ + .name = guest_console_name, + .write = guest_console_write, + .device = guest_console_device, + .unblank = guest_console_unblank, + .flags = CON_PRINTBUFFER, + .index = 0 /* ttyS0 */ +}; + +/* + * called by console_init() in drivers/char/tty_io.c at boot-time. + */ +static int __init +guest_console_init(void) +{ + if (!MACHINE_IS_GUEST) + return 0; + + printk (KERN_INFO "z/Live console initialized\n"); + + /* enable printk-access to this driver */ + register_console(&guest_console); + return 0; +} + +console_initcall(guest_console_init); + Index: linux-2.6.21/drivers/s390/guest/guest_console.h =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/guest_console.h @@ -0,0 +1,47 @@ +/* + * guest console device driver + * Copyright IBM Corp. 2007 + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + */ + + +#ifndef __GCONSOLE_H +#define __GCONSOLE_H +extern struct tty_driver *guest_tty_driver; +static inline int diag_write(int fd, const void *buffer, size_t count) +{ + register long __arg1 asm("2") = fd; + register const void * __arg2 asm("3") = buffer; + register size_t __arg3 asm("4") = count; + register long __svcres asm("2"); + long __res; + asm volatile ( + "diag 0,0,2" + : "=d" (__svcres) + : "0" (__arg1), + "d" (__arg2), + "d" (__arg3) + : "cc", "memory"); + __res = __svcres; + return __res; +} + +static inline int diag_read(int fd, const void *buffer, size_t count) +{ + register long __arg1 asm("2") = fd; + register const void * __arg2 asm("3") = buffer; + register size_t __arg3 asm("4") = count; + register long __svcres asm("2"); + long __res; + asm volatile ( + "diag 0,0,1" + : "=d" (__svcres) + : "0" (__arg1), + "d" (__arg2), + "d" (__arg3) + : "cc", "memory"); + __res = __svcres; + return __res; +} +#endif + Index: linux-2.6.21/drivers/s390/guest/guest_tty.c =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/guest_tty.c @@ -0,0 +1,153 @@ +/* + * guest console tty device driver + * Copyright IBM Corp. 2007 + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + */ + +#include <linux/fs.h> +#include <linux/tty.h> +#include <linux/tty_flip.h> +#include <linux/module.h> +#include <asm/s390_ext.h> +#include "guest_console.h" + +struct tty_driver *guest_tty_driver; +static struct tty_struct *guest_tty; + +MODULE_DESCRIPTION("Guest console for linux guests"); +MODULE_AUTHOR("Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>"); +MODULE_LICENSE("GPL"); + +static int +guest_tty_open(struct tty_struct *tty, struct file *filp) +{ + guest_tty = tty; + tty->driver_data = NULL; + return 0; +} + +static void +guest_tty_close(struct tty_struct *tty, struct file *filp) +{ + if (tty->count > 1) + return; + guest_tty = NULL; +} + +static int +guest_tty_ioctl(struct tty_struct *tty, struct file * file, + unsigned int cmd, unsigned long arg) +{ + return -ENOIOCTLCMD; +} + +static int +guest_tty_write(struct tty_struct *tty, const unsigned char *str, int count) +{ + int ret; + size_t pos; + + for(pos=0; pos < count; pos += ret) { + ret = diag_write(1, str + pos, count - pos); + if (ret <= 0) + break; + } + return pos; +} + +static void +guest_tty_put_char(struct tty_struct *tty, unsigned char ch) +{ + guest_tty_write (tty, &ch, 1); +} + +static void +guest_tty_flush_chars(struct tty_struct *tty) +{ + int nop; + nop=0; // :) +} + +static int +guest_tty_chars_in_buffer(struct tty_struct *tty) +{ + return 0; +} + +static void +guest_tty_flush_buffer(struct tty_struct *tty) +{ + guest_tty_flush_chars(tty); // :) +} + +static int +guest_tty_write_room (struct tty_struct *tty) +{ + return 65536; +} + +static struct tty_operations guest_ops = { + .open = guest_tty_open, + .close = guest_tty_close, + .write = guest_tty_write, + .put_char = guest_tty_put_char, + .flush_chars = guest_tty_flush_chars, + .write_room = guest_tty_write_room, + .chars_in_buffer = guest_tty_chars_in_buffer, + .flush_buffer = guest_tty_flush_buffer, + .ioctl = guest_tty_ioctl, +}; + +static void +guest_tty_ext_handler(__u16 code) +{ + char buffer[256]; + int count; + + count = diag_read(0, buffer, 256); + if (count <= 0) + return; + + if (!guest_tty) + return; + tty_insert_flip_string(guest_tty, buffer, count); + tty_flip_buffer_push(guest_tty); +} + +int __init +guest_tty_init(void) +{ + struct tty_driver *driver; + int rc; + + if (!MACHINE_IS_GUEST) + return 0; + register_external_interrupt(0x1234, guest_tty_ext_handler); + driver = alloc_tty_driver(1); + if (!driver) + return -ENOMEM; + guest_tty = NULL; + driver->owner = THIS_MODULE; + driver->driver_name = "guest_line"; + driver->name = "guest_line"; + driver->major = TTY_MAJOR; + driver->minor_start = 65; + driver->type = TTY_DRIVER_TYPE_SYSTEM; + driver->subtype = SYSTEM_TYPE_TTY; + driver->init_termios = tty_std_termios; + driver->init_termios.c_iflag = IGNBRK | IGNPAR; + driver->init_termios.c_oflag = ONLCR | XTABS; + driver->init_termios.c_lflag = ISIG | ECHO; + driver->flags = TTY_DRIVER_REAL_RAW; + tty_set_operations(driver, &guest_ops); + rc = tty_register_driver(driver); + if (rc) { + printk(KERN_ERR "guest tty driver: could not register tty - " + "tty_register_driver returned %d\n", rc); + put_tty_driver(driver); + return rc; + } + guest_tty_driver = driver; + return 0; +} +module_init(guest_tty_init); Index: linux-2.6.21/drivers/s390/Kconfig =================================================================== --- linux-2.6.21.orig/drivers/s390/Kconfig +++ linux-2.6.21/drivers/s390/Kconfig @@ -211,6 +211,11 @@ config MONWRITER help Character device driver for writing z/VM monitor service records +config GUEST_CONSOLE + bool "Guest console support" + depends on S390_GUEST + help + Select this option if you want to run as an s390 guest endmenu menu "Cryptographic devices" Index: linux-2.6.21/drivers/s390/guest/Makefile =================================================================== --- linux-2.6.21.orig/drivers/s390/guest/Makefile +++ linux-2.6.21/drivers/s390/guest/Makefile @@ -2,5 +2,6 @@ # s390 Linux virtual environment # +obj-$(CONFIG_GUEST_CONSOLE) += guest_console.o guest_tty.o obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <1178904960.25135.32.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>]
* Re: [PATCH/RFC 5/9] s390 virtual console for guests [not found] ` <1178904960.25135.32.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> @ 2007-05-11 19:00 ` Anthony Liguori [not found] ` <4644BD3C.8040901-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-11 19:00 UTC (permalink / raw) To: Carsten Otte Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Christian Borntraeger, Martin Schwidefsky I think it would be better to use hvc_console as Xen now uses it too. Carsten Otte wrote: > + if (!MACHINE_IS_GUEST) > + return 0; > + register_external_interrupt(0x1234, guest_tty_ext_handler); > This is an interesting way to get input data from the console :-) How many interrupts does s390 support (the x86 only supports 256)? Can you afford to burn interrupts like this? Is there not a better way to assign interrupts such that conflict isn't an issue? Regards, Anthony Liguori ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <4644BD3C.8040901-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>]
* Re: [PATCH/RFC 5/9] s390 virtual console for guests [not found] ` <4644BD3C.8040901-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> @ 2007-05-11 19:42 ` Christian Bornträger 2007-05-12 8:07 ` Carsten Otte 2007-05-14 16:23 ` Christian Bornträger 2 siblings, 0 replies; 104+ messages in thread From: Christian Bornträger @ 2007-05-11 19:42 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f; +Cc: Martin Schwidefsky On Friday 11 May 2007 21:00, Anthony Liguori wrote: > I think it would be better to use hvc_console as Xen now uses it too. I dont know hvc_console, but I will have a look at it. > Carsten Otte wrote: > > + if (!MACHINE_IS_GUEST) > > + return 0; > > + register_external_interrupt(0x1234, guest_tty_ext_handler); > > > > This is an interesting way to get input data from the console :-) How > many interrupts does s390 support (the x86 only supports 256)? Can you > afford to burn interrupts like this? Is there not a better way to > assign interrupts such that conflict isn't an issue? On s390 we have a 16 bit interrupt code, so we actually have plenty of numbers... But, yes its a very good point, burning interrupts wont work cross-platform. Our patches are prototypes and need rework anyway. Take these patches as discussion contribution in the spirit of release early. :-) cheers Christian ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 5/9] s390 virtual console for guests [not found] ` <4644BD3C.8040901-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> 2007-05-11 19:42 ` Christian Bornträger @ 2007-05-12 8:07 ` Carsten Otte 2007-05-14 16:23 ` Christian Bornträger 2 siblings, 0 replies; 104+ messages in thread From: Carsten Otte @ 2007-05-12 8:07 UTC (permalink / raw) To: Anthony Liguori Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 Anthony Liguori wrote: > I think it would be better to use hvc_console as Xen now uses it too. This console driver is pretty basic indeed. > This is an interesting way to get input data from the console :-) How > many interrupts does s390 support (the x86 only supports 256)? Can you > afford to burn interrupts like this? Is there not a better way to > assign interrupts such that conflict isn't an issue? We have 2^16 external interrupts on 390, plus IO interrupts, multiplied by the fact that each interrupt can be used in various interrupt subclasses. We can burn irqs indeed, but as Christian mentioned this cannot go into the portable approach. so long, Carsten ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 5/9] s390 virtual console for guests [not found] ` <4644BD3C.8040901-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> 2007-05-11 19:42 ` Christian Bornträger 2007-05-12 8:07 ` Carsten Otte @ 2007-05-14 16:23 ` Christian Bornträger [not found] ` <200705141823.13424.cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> 2 siblings, 1 reply; 104+ messages in thread From: Christian Bornträger @ 2007-05-14 16:23 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f; +Cc: Martin Schwidefsky On Friday 11 May 2007 21:00, Anthony Liguori wrote: > I think it would be better to use hvc_console as Xen now uses it too. I just had a look at hvc_console, and indeed this driver looks appropriate for us. Looking at the xen-frontend driver (~130 lines of code) and the simple interface (get_char and put_char) it should be reasonably easy to convert our driver to a hvc_console user. Christian ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <200705141823.13424.cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 5/9] s390 virtual console for guests [not found] ` <200705141823.13424.cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> @ 2007-05-14 16:48 ` Christian Borntraeger [not found] ` <200705141848.18996.borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Christian Borntraeger @ 2007-05-14 16:48 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f; +Cc: Martin Schwidefsky On Monday 14 May 2007 18:23, Christian Bornträger wrote: > On Friday 11 May 2007 21:00, Anthony Liguori wrote: > > I think it would be better to use hvc_console as Xen now uses it too. > I just had a look at hvc_console, and indeed this driver looks appropriate As I started prototyping this frontend I realized that hvc_console requires some interfaces, which are not present on s390, e.g. we have no request_irq and free_irq. Dont know if hvc_console is still the right way to go for us. This needs more thinking. Christian ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <200705141848.18996.borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 5/9] s390 virtual console for guests [not found] ` <200705141848.18996.borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> @ 2007-05-14 17:49 ` Anthony Liguori [not found] ` <4648A11D.3050607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-14 17:49 UTC (permalink / raw) To: Christian Borntraeger Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Martin Schwidefsky Christian Borntraeger wrote: > On Monday 14 May 2007 18:23, Christian Bornträger wrote: > >> On Friday 11 May 2007 21:00, Anthony Liguori wrote: >> >>> I think it would be better to use hvc_console as Xen now uses it too. >>> >> I just had a look at hvc_console, and indeed this driver looks appropriate >> > > As I started prototyping this frontend I realized that hvc_console requires > some interfaces, which are not present on s390, e.g. we have no request_irq > and free_irq. Dont know if hvc_console is still the right way to go for us. > It seems like request_irq is roughly the same as register_external_interrupt. I suspect that you could get away with either patching hvc_console to use register_external_interrupt if CONFIG_S390 or perhaps providing a common interface. I suspect that this is going to come up again for sharing other paravirt drivers. Regards, Anthony Liguori > This needs more thinking. > > Christian > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <4648A11D.3050607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 5/9] s390 virtual console for guests [not found] ` <4648A11D.3050607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2007-05-15 0:27 ` Arnd Bergmann 2007-05-15 7:54 ` Carsten Otte 1 sibling, 0 replies; 104+ messages in thread From: Arnd Bergmann @ 2007-05-15 0:27 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f; +Cc: Martin Schwidefsky On Monday 14 May 2007, Anthony Liguori wrote: > It seems like request_irq is roughly the same as > register_external_interrupt. I suspect that you could get away with > either patching hvc_console to use register_external_interrupt if > CONFIG_S390 or perhaps providing a common interface. > > I suspect that this is going to come up again for sharing other paravirt > drivers. request_irq() is not a nice interface for s390, but it will probably make sense to convert the two existing users of register_external_interrupt to use that instead, in order to get something that can be shared across architectures for virtual drivers. It basically means extending struct ext_int_info_t to include a name and a void* member that gets passed back to the interrupt handler, and to check for invalid flags passed to request_irq. You might want to show these in /proc/interrupts then as well, as per-interrupt values. Arnd <>< ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 5/9] s390 virtual console for guests [not found] ` <4648A11D.3050607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2007-05-15 0:27 ` Arnd Bergmann @ 2007-05-15 7:54 ` Carsten Otte 1 sibling, 0 replies; 104+ messages in thread From: Carsten Otte @ 2007-05-15 7:54 UTC (permalink / raw) To: Anthony Liguori Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, borntrae-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 Anthony Liguori wrote: > It seems like request_irq is roughly the same as > register_external_interrupt. I suspect that you could get away with > either patching hvc_console to use register_external_interrupt if > CONFIG_S390 or perhaps providing a common interface. > > I suspect that this is going to come up again for sharing other paravirt > drivers. Maybe we should have a wrappers for request_irq/free_irq in arch/ rather then #ifdefs in each paravirtual driver. We need to talk this over with Martin (our arch maintainer). so long, Carsten ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* [PATCH/RFC 6/9] virtual block device driver [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> ` (3 preceding siblings ...) 2007-05-11 17:36 ` [PATCH/RFC 5/9] s390 virtual console for guests Carsten Otte @ 2007-05-11 17:36 ` Carsten Otte [not found] ` <1178904963.25135.33.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> 2007-05-11 17:36 ` [PATCH/RFC 7/9] Virtual network guest " Carsten Otte ` (2 subsequent siblings) 7 siblings, 1 reply; 104+ messages in thread From: Carsten Otte @ 2007-05-11 17:36 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Cc: Christian Borntraeger, Martin Schwidefsky From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> This driver provides access to virtual block devices. It does use its own make_request function which passes the bio to a workqueue thread. The workqueue thread does use the diagnose hypervisor call to call the hosting Linux. The hypervisor code in host userspace does use aio_submit to initiate the IO. Once the IO is done, the host will use io_getevents and then generate an interrupt to the guest. The interrupt handler calls bio_endio. This device driver is currently architecture dependent. We intend to move the host API to hypercall instead of the diagnose instuction. Please review. Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> --- drivers/s390/block/Kconfig | 7 drivers/s390/guest/Makefile | 1 drivers/s390/guest/vdisk.c | 153 +++++++++++++++++ drivers/s390/guest/vdisk.h | 230 ++++++++++++++++++++++++++ drivers/s390/guest/vdisk_blk.c | 355 +++++++++++++++++++++++++++++++++++++++++ 5 files changed, 746 insertions(+) Index: linux-2.6.21/drivers/s390/guest/vdisk.c =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/vdisk.c @@ -0,0 +1,153 @@ +/* + * guest virtual block device driver + * Copyright IBM Corp. 2007 + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + */ + +#include <linux/blkdev.h> +#include <linux/init.h> +#include <linux/module.h> +#include <linux/spinlock.h> +#include <linux/types.h> +#include <asm/ptrace.h> +#include <asm/s390_ext.h> +#include <asm/vdev.h> +#include "vdisk.h" + +MODULE_DESCRIPTION("Guest virtual block device driver"); +MODULE_AUTHOR("Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>"); +MODULE_LICENSE("GPL"); + +static struct vdev_driver vdisk_driver; + +static int __find_fd(struct device *dev, void* fdptr) { + int fd = (long)fdptr; + + struct vdev *vdev = to_vdev(dev); + struct vdisk_device *vdisk = (struct vdisk_device *)vdev->drv_private; + + if (vdisk->vfd == fd) + return 1; + else + return 0; +} + +vdisk_irq_t vdisk_get_irqpage(int fd) +{ + struct device *dev; + struct vdev *vdev; + struct vdisk_device *vdisk; + + dev = driver_find_device(&vdisk_driver.driver, NULL, (void*)(long)fd, __find_fd); + if (!dev) + return NULL; + vdev = to_vdev(dev); + vdisk = (struct vdisk_device *)vdev->drv_private; + return vdisk->irq_page; +} + +struct vdisk_device * vdisk_get_device_by_fd(int fd) +{ + struct device *dev; + struct vdev *vdev; + struct vdisk_device *vdisk; + + dev = driver_find_device(&vdisk_driver.driver, NULL, (void*)(long)fd, __find_fd); + if (!dev) + return NULL; + vdev = to_vdev(dev); + vdisk = (struct vdisk_device *)vdev->drv_private; + return vdisk; +} + + +static int vdisk_probe(struct vdev *vdev) +{ + struct vdisk_device *vdisk; + int rc; + + vdisk = kzalloc(sizeof(struct vdisk_device), GFP_ATOMIC); + if (!vdisk) + return -ENOMEM; + + vdisk->vdev = vdev; + vdev->drv_private = vdisk; + vdisk->submit_page = (void*)get_zeroed_page(GFP_KERNEL); + + if (!vdisk->submit_page) { + rc = -ENOMEM; + goto out_free; + } + + vdisk->irq_page = (void*)get_zeroed_page(GFP_KERNEL); + + if (!vdisk->irq_page) { + rc = -ENOMEM; + goto out_free; + } + + rc = diag_vdisk_disk_info(vdisk->vdev->hostid, + &vdisk->blocksize, &vdisk->size, + &vdisk->read_only); + if (rc) + goto out_free; + spin_lock_init(&vdisk->lock); + init_rwsem(&vdisk->pump_sem); + init_waitqueue_head(&vdisk->wait); + + vdisk_init_blockdev(vdisk); + goto out; + +out_free: + if (vdisk->irq_page) + free_page((unsigned long)(vdisk->irq_page)); + if (vdisk->submit_page) + free_page((unsigned long)(vdisk->submit_page)); + kfree(vdisk); +out: + return rc; +} + +static struct vdev_driver vdisk_driver = { + .name = "vdisk", + .owner = THIS_MODULE, + .vdev_type = VDEV_TYPE_DISK, + .probe = vdisk_probe, +}; + +static int __init vdisk_init(void) +{ + int rc; + if (!MACHINE_IS_GUEST) + return -ENODEV; + + rc = register_blkdev(VDISK_MAJOR, "vdisk"); + if (rc) { + printk(KERN_WARNING "vdisk: cannot register block device\n"); + return rc; + } + rc = register_external_interrupt(0x1235, vdisk_ext_handler); + if (rc) + goto unregister_blk; + rc = vdev_driver_register(&vdisk_driver); + if (rc) + goto unregister_irq; + goto out; +unregister_irq: + unregister_external_interrupt(0x1235, vdisk_ext_handler); +unregister_blk: + unregister_blkdev(VDISK_MAJOR, "vdisk"); + printk (KERN_WARNING "vdisk: initialization failed\n"); +out: + return rc; +} + +static void __exit vdisk_exit(void) +{ + unregister_external_interrupt(0x1235, vdisk_ext_handler); + unregister_blkdev(VDISK_MAJOR, "vdisk"); + return; +} + +module_init(vdisk_init); +module_exit(vdisk_exit); Index: linux-2.6.21/drivers/s390/guest/vdisk.h =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/vdisk.h @@ -0,0 +1,230 @@ +/* + * guest virtual block device driver header file + * Copyright IBM Corp. + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + */ + +#include <linux/list.h> +#include <linux/genhd.h> +#include <linux/types.h> +#include <linux/aio_abi.h> +#include <linux/wait.h> +#include <asm/vdev.h> + +#define VDISK_MAJOR 95 +#define SECTOR_SHIFT 9 +#define VDISK_NR_REQ 256 +#define VDISK_NR_RES 170 + +#define VDISK_WRITE 1 +#define VDISK_READ 0 + +struct vdisk_request { + unsigned long buf; + unsigned long count; +}; + +typedef struct vdisk_request (*vdisk_req_t)[VDISK_NR_REQ]; + +struct vdisk_response { + unsigned long intparm; + unsigned long count; + unsigned long failed; +}; + +typedef struct vdisk_response (*vdisk_irq_t)[VDISK_NR_RES]; + +struct vdisk_device { + struct list_head head; + int blocksize; + long size; + int read_only; + struct gendisk *gd; + struct vdev *vdev; + spinlock_t lock; + struct rw_semaphore pump_sem; + int open_count; + int vfd; + struct vdisk_request (*submit_page)[VDISK_NR_REQ]; + struct workqueue_struct *wq; + vdisk_irq_t irq_page; + wait_queue_head_t wait; +}; + +struct vdisk_work { + struct work_struct work; + struct bio* bio; +}; + +struct vdisk_elem { + unsigned int fd; + unsigned int command; + unsigned long offset; + unsigned long buffer; + unsigned long nbytes; +}; + +struct vdisk_iocb_container { + struct iocb iocb; + struct bio *bio; + struct vdisk_device *dev; + int ctx_index; + unsigned long context; + struct list_head list; +}; + +// from aio_abi.h +typedef enum io_iocb_cmd { + IO_CMD_PREAD = 0, + IO_CMD_PWRITE = 1, + + IO_CMD_FSYNC = 2, + IO_CMD_FDSYNC = 3, + + IO_CMD_POLL = 5, + IO_CMD_NOOP = 6, +} io_iocb_cmd_t; + +static inline int +diag_vdisk_disk_info(char name[256], int* blocksize, + long* size, int* read_only) +{ + register char* __arg1 asm("2") = name; + register int* __arg2 asm("3") = blocksize; + register long* __arg3 asm("4") = size; + register int* __arg4 asm("5") = read_only; + register int __svcres asm("2"); + int __res; + + __asm__ __volatile__ ( + "diag 0,0,5" + : "=d" (__svcres) + : "0" (__arg1), + "d" (__arg2), + "d" (__arg3), + "d" (__arg4) + : "cc", "memory"); + __res = __svcres; + return __res; +} + +static inline int +diag_vdisk_open(const char* name, int read_only, void* irq_page) +{ + register const char* __arg1 asm("2") = name; + register long __arg2 asm("3") = read_only; + register unsigned long __arg3 asm("4") = (unsigned long) irq_page; + register int __svcres asm("2"); + int __res; + + __asm__ __volatile__ ( + "diag 0,0,7" + : "=d" (__svcres) + : "0" (__arg1), + "d" (__arg2), + "d" (__arg3) + : "cc", "memory"); + __res = __svcres; + return __res; +} + +static inline int +diag_vdisk_close(int fd) +{ + register long __arg1 asm("2") = fd; + register int __svcres asm("2"); + int __res; + + __asm__ __volatile__ ( + "diag 0,0,9" + : "=d" (__svcres) + : "0" (__arg1) + : "cc", "memory"); + __res = __svcres; + return __res; +} + +static inline int +diag_vdisk_aio_setup(unsigned int index, unsigned int num_events, + unsigned long *context, void *int_page) +{ + register unsigned long __arg1 asm("2") = index; + register unsigned long __arg2 asm("3") = num_events; + register unsigned long* __arg3 asm("4") = context; + register void* __arg4 asm("5") = int_page; + register int __svcres asm("2"); + int __res; + + __asm__ __volatile__ ( + "diag 0,0,0x0a" + : "=d" (__svcres) + : "0" (__arg1), + "d" (__arg2), + "d" (__arg3), + "d" (__arg4) + : "cc", "memory"); + __res = __svcres; + return __res; +} + +static inline void +diag_vdisk_aio_destroy(unsigned int index) +{ + register unsigned long __arg1 asm("2") = index; + __asm__ __volatile__ ( + "diag 0,0,0x12" + :: "d" (__arg1) + : "cc", "memory"); +} + +static inline int +diag_vdisk_submit_request(int fd, void *submit_page, int op, + loff_t start_offset, int nrreq, void* parm) +{ + register long __arg1 asm("2") = fd; + register unsigned long __arg2 asm("3") = (unsigned long)submit_page; + register long __arg3 asm("4") = op; + register unsigned long __arg4 asm("5") = start_offset; + register long __arg5 asm("6") = nrreq; + register unsigned long __arg6 asm("7") = (unsigned long)parm; + register int __svcres asm("2"); + int __res; + + __asm__ __volatile__ ( + "diag 0,0,0x0b" + : "=d" (__svcres) + : "0" (__arg1), + "d" (__arg2), + "d" (__arg3), + "d" (__arg4), + "d" (__arg5), + "d" (__arg6) + : "cc", "memory"); + __res = __svcres; + return __res; +} + +static inline int +diag_vdisk_getevents(int index) { + register long __arg1 asm("2") = index; + register int __svcres asm("2"); + int __res; + + __asm__ __volatile__ ( + "diag 0,0,0x0d" + : "=d" (__svcres) + : "0" (__arg1) + : "cc", "memory"); + __res = __svcres; + return __res; +} + +// in vdisk.c +extern struct device *vdisk_sysfs_root; +int vdisk_disk_info(struct vdisk_device *dev); +vdisk_irq_t vdisk_get_irqpage(int fd); +struct vdisk_device * vdisk_get_device_by_fd(int fd); + +// in vdisk_blk.c +void vdisk_init_blockdev(struct vdisk_device *dev); +void vdisk_ext_handler(__u16 code); Index: linux-2.6.21/drivers/s390/guest/vdisk_blk.c =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/vdisk_blk.c @@ -0,0 +1,355 @@ +/* + * guest virtual block device driver + * Copyright IBM Corp. 2007 + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + */ + +#include <linux/blkdev.h> +#include "vdisk.h" + +static int vdisk_open(struct inode *inode, struct file *filp); +static int vdisk_release(struct inode *inode, struct file *filp); +static int vdisk_make_request(request_queue_t *q, struct bio *bio); + +static struct block_device_operations vdisk_fops = { + .owner = THIS_MODULE, + .open = vdisk_open, + .release = vdisk_release, +}; + +void vdisk_init_blockdev(struct vdisk_device *dev) +{ + static int lastminor = 0; + + dev->gd = alloc_disk(1); + if (!dev->gd) { + printk (KERN_WARNING "vdisk: out of memory while " \ + "initializing %s\n", dev->vdev->symname); + return; + } + dev->open_count = 0; + dev->gd->first_minor = lastminor++; + dev->gd->queue = blk_alloc_queue(GFP_KERNEL); + if (!dev->gd->queue) { + printk (KERN_WARNING "vdisk: out of memory while " \ + "initializing %s\n", dev->vdev->symname); + goto free_gd; + } + blk_queue_max_segment_size(dev->gd->queue, 15 * dev->blocksize); + strlcpy(dev->gd->disk_name, dev->vdev->symname, 32); + dev->gd->disk_name[32] = '\0'; + dev->gd->major = VDISK_MAJOR; + dev->gd->fops = &vdisk_fops; + dev->gd->private_data = dev; + dev->gd->driverfs_dev = &dev->vdev->dev; + set_capacity(dev->gd, dev->size); + get_device(&dev->vdev->dev); + add_disk(dev->gd); + blk_queue_make_request(dev->gd->queue, vdisk_make_request); + blk_queue_hardsect_size(dev->gd->queue, dev->blocksize); + set_disk_ro(dev->gd, dev->read_only); + if (dev->blocksize) + printk(KERN_INFO "vdisk: device %s(%d:%d) active with " \ + "block size %d and %ld sectors\n", dev->vdev->symname, + VDISK_MAJOR, dev->gd->first_minor, dev->blocksize, + dev->size); + else + printk(KERN_INFO "vdisk: device %s(%d:%d) inactive\n", + dev->vdev->symname, VDISK_MAJOR, dev->gd->first_minor); + return; + free_gd: + kfree(dev->gd); + dev->gd = NULL; + return; +} + +static int vdisk_open(struct inode *inode, struct file *filp) +{ + struct vdisk_device *dev = inode->i_bdev->bd_disk->private_data; + unsigned long flags; + char* wq_name; + struct workqueue_struct *new_wq; + int rc; + + if (!dev) { + rc = -ENODEV; + goto out; + } + + down_write(&dev->pump_sem); + spin_lock_irqsave(&dev->lock, flags); + if (dev->open_count) { + dev->open_count++; + rc = 0; + goto unlock; + } + + + wq_name = kmalloc(strlen(dev->gd->disk_name)+9, GFP_ATOMIC); + if (!wq_name) { + rc = -ENOMEM; + goto unlock; + } + memcpy(wq_name, "IO_pump_", 8); + strcpy(wq_name+8, dev->gd->disk_name); + + spin_unlock_irqrestore(&dev->lock, flags); + new_wq = create_singlethread_workqueue(wq_name); + spin_lock_irqsave(&dev->lock, flags); + + dev->wq = new_wq; + kfree (wq_name); + + if (!dev->wq) { + rc = -EIO; + goto unlock; + } + + rc = diag_vdisk_disk_info(dev->vdev->hostid, &dev->blocksize, + &dev->size, &dev->read_only); + if (rc) { + printk(KERN_ERR "vdisk: error querying %s\n", dev->vdev->hostid); + goto cleanup; + } + inode->i_bdev->bd_block_size = dev->blocksize; + dev->vfd = diag_vdisk_open(dev->vdev->hostid, dev->read_only, + dev->irq_page); + + if (dev->vfd < 0) { + rc = dev->vfd; + printk(KERN_ERR "vdisk: error opening %s\n", dev->vdev->hostid); + goto cleanup; + } else { + dev->open_count++; + rc = 0; + } + goto unlock; + + cleanup: + spin_unlock_irqrestore(&dev->lock, flags); + destroy_workqueue(new_wq); + spin_lock_irqsave(&dev->lock, flags); + unlock: + spin_unlock_irqrestore(&dev->lock, flags); + up_write(&dev->pump_sem); + out: + return rc; +} + +static int +vdisk_release(struct inode *inode, struct file *filp) +{ + int rc; + unsigned long flags; + struct vdisk_device *dev = inode->i_bdev->bd_disk->private_data; + struct workqueue_struct *old_wq; + + if (!dev) { + rc = -ENODEV; + goto out; + } + + down_write(&dev->pump_sem); + spin_lock_irqsave(&dev->lock, flags); + dev->open_count--; + + if (dev->open_count) { + rc = 0; + spin_unlock_irqrestore(&dev->lock, flags); + goto up; + } + rc = diag_vdisk_close(dev->vfd); + + old_wq = dev->wq; + dev->wq = NULL; + + spin_unlock_irqrestore(&dev->lock, flags); + + destroy_workqueue(old_wq); + + up: + up_write(&dev->pump_sem); + out: + return rc; +} + +static void vdisk_pump_bvecs(struct vdisk_device *dev, int op, + loff_t start_offset, int requestno, + struct bio* bio, struct bio_vec *(vectors[256])) +{ + int i, rc; + loff_t offset = start_offset; + int nr_done = 0; + long size; + long flags=0; + DEFINE_WAIT(wait); + + spin_lock_irqsave(&dev->lock, flags); + prepare_to_wait_exclusive(&dev->wait, &wait, + TASK_UNINTERRUPTIBLE); + + while (nr_done < requestno) { + memset(dev->submit_page, 0, PAGE_SIZE); + for (i=nr_done; i<requestno; i++) { + (*dev->submit_page)[i-nr_done].buf = + (unsigned long)page_address(vectors[i]->bv_page) + + vectors[i]->bv_offset; + (*dev->submit_page)[i-nr_done].count = vectors[i]->bv_len; + } + + rc = diag_vdisk_submit_request(dev->vfd, + dev->submit_page, + op, offset, + requestno-nr_done, bio); + + if (rc < 0) { + // error case + size = 0; + for (i=0; i<(requestno-nr_done); i++) + size += (*dev->submit_page)[i].count; + bio_io_error(bio, size); + break; + } + + if (rc == requestno - nr_done) + // everything was submitted propper + break; + + if (rc) { + //request was partly submitted + for (i=0; i<rc; i++) + offset += (*dev->submit_page)[i].count; + nr_done += rc; + } + // we need to throttle IO, and retry submission later + spin_unlock_irqrestore(&dev->lock, flags); + io_schedule(); + spin_lock_irqsave(&dev->lock, flags); + } + finish_wait(&dev->wait, &wait); + spin_unlock_irqrestore(&dev->lock, flags); + return; +} + +static void vdisk_pump_bio(struct work_struct *zw) +{ + struct vdisk_work *work = + container_of(zw, struct vdisk_work, work); + + struct bio *bio = work->bio; + struct bio_vec *bvec; + struct bio_vec *(vectors[256]); + struct vdisk_device *dev = bio->bi_bdev->bd_disk->private_data; + int i, op, requestno=0; + loff_t start_offset, offset; + + BUG_ON(!dev); + + kfree (zw); + + if (bio_data_dir(bio)) + op = VDISK_WRITE; + else + op = VDISK_READ; + + offset = start_offset = ((loff_t)bio->bi_sector)<<SECTOR_SHIFT; + + bio_for_each_segment(bvec, bio, i) { + if (bvec->bv_len & (dev->blocksize - 1)) //FIXME: Zugriff auf dev ohne lock + goto out; + + vectors[requestno] = bvec; + offset += bvec->bv_len; + requestno++; + if (requestno == 255) { + vdisk_pump_bvecs(dev, op, start_offset, requestno, + bio, vectors); + start_offset = offset; + requestno = 0; + } + } + + if (requestno) + vdisk_pump_bvecs(dev, op, start_offset, requestno, bio, vectors); + +out: + return; +} + +static int vdisk_make_request(request_queue_t *q, struct bio *bio) +{ + struct vdisk_device *dev = bio->bi_bdev->bd_disk->private_data; + struct vdisk_work *work; + int rc; + + if (!dev) { + rc = -ENODEV; + goto out; + } + + if (bio_barrier(bio)) { + rc = -EOPNOTSUPP; + goto out; + } + + work = kmalloc(sizeof(struct vdisk_work), GFP_KERNEL); + if (!work) { + rc = -ENOMEM; + goto out; + } + + work->bio = bio; + + INIT_WORK(&work->work, vdisk_pump_bio); + + if (!queue_work(dev->wq, &work->work)) { + rc = -EIO; + kfree(work); + } else + rc = 0; + +out: + return rc; +} + +static void __post_response(vdisk_irq_t irq_page) +{ + int i; + struct bio *bio; + + for (i=0; i<VDISK_NR_RES; i++) { + if (!(*irq_page)[i].intparm) + break; + bio = (struct bio *)((*irq_page)[i].intparm); + if ((*irq_page)[i].count) + bio_endio(bio, (*irq_page)[i].count, 0); + if ((*irq_page)[i].failed) + bio_endio(bio, (*irq_page)[i].failed, 1); + } +} + +void vdisk_ext_handler(__u16 code) +{ + int rc=0; //FIXME: no initialization here + int fd = S390_lowcore.ext_params; + vdisk_irq_t irq_page; + struct vdisk_device *vdev; + + irq_page = vdisk_get_irqpage(fd); + + if (irq_page) { + do { + __post_response(irq_page); + rc = diag_vdisk_getevents(fd); // get more interrupts + } while(rc > 0); + vdev = vdisk_get_device_by_fd(fd); + if (!vdev) + panic("cannot find vdisk device while in interrupt"); + spin_lock(&vdev->lock); + if (waitqueue_active(&vdev->wait)) + wake_up(&vdev->wait); + spin_unlock(&vdev->lock); + } else + printk (KERN_WARNING "vdisk got interrupt for non-existing" \ + " aio context id %d\n", fd); +} Index: linux-2.6.21/drivers/s390/block/Kconfig =================================================================== --- linux-2.6.21.orig/drivers/s390/block/Kconfig +++ linux-2.6.21/drivers/s390/block/Kconfig @@ -63,4 +63,11 @@ config DASD_EER DASD extended error reporting. This is only needed if you want to use applications written for the EER facility. + +config VDISK + tristate "guest disk device support" + depends on S390_GUEST + help + This driver provides access to block devices for Linux systems running + under non z/VM hosts. If you are running on LPAR or z/VM only, say N. endif Index: linux-2.6.21/drivers/s390/guest/Makefile =================================================================== --- linux-2.6.21.orig/drivers/s390/guest/Makefile +++ linux-2.6.21/drivers/s390/guest/Makefile @@ -4,4 +4,5 @@ obj-$(CONFIG_GUEST_CONSOLE) += guest_console.o guest_tty.o obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o +obj-$(CONFIG_VDISK) += vdisk.o vdisk_blk.o ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <1178904963.25135.33.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>]
* Re: [PATCH/RFC 6/9] virtual block device driver [not found] ` <1178904963.25135.33.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> @ 2007-05-14 11:49 ` Avi Kivity [not found] ` <46484CDF.505-atKUWr5tajBWk0Htik3J/w@public.gmane.org> 2007-05-14 11:52 ` Avi Kivity 1 sibling, 1 reply; 104+ messages in thread From: Avi Kivity @ 2007-05-14 11:49 UTC (permalink / raw) To: Carsten Otte Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Christian Borntraeger, Martin Schwidefsky Carsten Otte wrote: > From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > > This driver provides access to virtual block devices. It does use its own > make_request function which passes the bio to a workqueue thread. The workqueue > thread does use the diagnose hypervisor call to call the hosting Linux. > The hypervisor code in host userspace does use aio_submit to initiate the IO. > Once the IO is done, the host will use io_getevents and then generate an > interrupt to the guest. The interrupt handler calls bio_endio. > This device driver is currently architecture dependent. We intend to move the > host API to hypercall instead of the diagnose instuction. Please review. > > Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > > +struct vdisk_device * vdisk_get_device_by_fd(int fd) > +{ > + struct device *dev; > + struct vdev *vdev; > + struct vdisk_device *vdisk; > + > + dev = driver_find_device(&vdisk_driver.driver, NULL, (void*)(long)fd, __find_fd); > + if (!dev) > + return NULL; > + vdev = to_vdev(dev); > + vdisk = (struct vdisk_device *)vdev->drv_private; > + return vdisk; > +} > Is this the host file descriptor? If so, we want to use something more abstract (if the host side is in kernel, there will be no fd, or if the device is implemented using >1 files (or <1 files)). > + > +#define VDISK_WRITE 1 > +#define VDISK_READ 0 > + > +struct vdisk_request { > + unsigned long buf; > + unsigned long count; > +}; > + > +typedef struct vdisk_request (*vdisk_req_t)[VDISK_NR_REQ]; > + > +struct vdisk_response { > + unsigned long intparm; > + unsigned long count; > + unsigned long failed; > +}; > + > +typedef struct vdisk_response (*vdisk_irq_t)[VDISK_NR_RES]; > + > +struct vdisk_device { > + struct list_head head; > + int blocksize; > + long size; > + int read_only; > + struct gendisk *gd; > + struct vdev *vdev; > + spinlock_t lock; > + struct rw_semaphore pump_sem; > + int open_count; > + int vfd; > + struct vdisk_request (*submit_page)[VDISK_NR_REQ]; > > + struct workqueue_struct *wq; > + vdisk_irq_t irq_page; > + wait_queue_head_t wait; > +}; > + > +struct vdisk_work { > + struct work_struct work; > + struct bio* bio; > +}; > + > +struct vdisk_elem { > + unsigned int fd; > + unsigned int command; > + unsigned long offset; > + unsigned long buffer; > + unsigned long nbytes; > We'll want scatter/gather here. > +}; > + > +struct vdisk_iocb_container { > + struct iocb iocb; > + struct bio *bio; > + struct vdisk_device *dev; > + int ctx_index; > + unsigned long context; > + struct list_head list; > +}; > + > +// from aio_abi.h > +typedef enum io_iocb_cmd { > + IO_CMD_PREAD = 0, > + IO_CMD_PWRITE = 1, > + > + IO_CMD_FSYNC = 2, > + IO_CMD_FDSYNC = 3, > + > + IO_CMD_POLL = 5, > + IO_CMD_NOOP = 6, > +} io_iocb_cmd_t; > Our own commands, please. We need READV, WRITEV, and a barrier for journalling filesystems. FDSYNC should work as a barrier, but is wasteful. The FSYNC/FDSYNC distinction is meaningless. POLL/NOOP are irrelevant. > +static void vdisk_pump_bvecs(struct vdisk_device *dev, int op, > + loff_t start_offset, int requestno, > + struct bio* bio, struct bio_vec *(vectors[256])) > +{ > + int i, rc; > + loff_t offset = start_offset; > + int nr_done = 0; > + long size; > + long flags=0; > + DEFINE_WAIT(wait); > + > + spin_lock_irqsave(&dev->lock, flags); > + prepare_to_wait_exclusive(&dev->wait, &wait, > + TASK_UNINTERRUPTIBLE); > + > + while (nr_done < requestno) { > + memset(dev->submit_page, 0, PAGE_SIZE); > + for (i=nr_done; i<requestno; i++) { > + (*dev->submit_page)[i-nr_done].buf = > + (unsigned long)page_address(vectors[i]->bv_page) + > + vectors[i]->bv_offset; > + (*dev->submit_page)[i-nr_done].count = vectors[i]->bv_len; > + } > + > + rc = diag_vdisk_submit_request(dev->vfd, > + dev->submit_page, > + op, offset, > + requestno-nr_done, bio); > + > + if (rc < 0) { > + // error case > + size = 0; > + for (i=0; i<(requestno-nr_done); i++) > + size += (*dev->submit_page)[i].count; > + bio_io_error(bio, size); > + break; > + } > + > + if (rc == requestno - nr_done) > + // everything was submitted propper > + break; > + > + if (rc) { > + //request was partly submitted > + for (i=0; i<rc; i++) > + offset += (*dev->submit_page)[i].count; > + nr_done += rc; > + } > + // we need to throttle IO, and retry submission later > + spin_unlock_irqrestore(&dev->lock, flags); > + io_schedule(); > + spin_lock_irqsave(&dev->lock, flags); > + } > + finish_wait(&dev->wait, &wait); > + spin_unlock_irqrestore(&dev->lock, flags); > + return; > +} > We want to amortize the hypercall over multiple bios (but maybe you're doing that -- I'm not 100% up to speed on the block layer) > + > +static void vdisk_pump_bio(struct work_struct *zw) > +{ > + struct vdisk_work *work = > + container_of(zw, struct vdisk_work, work); > + > + struct bio *bio = work->bio; > + struct bio_vec *bvec; > + struct bio_vec *(vectors[256]); > + struct vdisk_device *dev = bio->bi_bdev->bd_disk->private_data; > + int i, op, requestno=0; > + loff_t start_offset, offset; > + > + BUG_ON(!dev); > + > + kfree (zw); > + > + if (bio_data_dir(bio)) > + op = VDISK_WRITE; > + else > + op = VDISK_READ; > + > + offset = start_offset = ((loff_t)bio->bi_sector)<<SECTOR_SHIFT; > + > + bio_for_each_segment(bvec, bio, i) { > + if (bvec->bv_len & (dev->blocksize - 1)) //FIXME: Zugriff auf dev ohne lock > + goto out; > + > + vectors[requestno] = bvec; > + offset += bvec->bv_len; > + requestno++; > + if (requestno == 255) { > + vdisk_pump_bvecs(dev, op, start_offset, requestno, > + bio, vectors); > + start_offset = offset; > + requestno = 0; > + } > + } > + > + if (requestno) > + vdisk_pump_bvecs(dev, op, start_offset, requestno, bio, vectors); > + > +out: > + return; > +} > + > +static int vdisk_make_request(request_queue_t *q, struct bio *bio) > +{ > + struct vdisk_device *dev = bio->bi_bdev->bd_disk->private_data; > + struct vdisk_work *work; > + int rc; > + > + if (!dev) { > + rc = -ENODEV; > + goto out; > + } > + > + if (bio_barrier(bio)) { > + rc = -EOPNOTSUPP; > + goto out; > + } > + > + work = kmalloc(sizeof(struct vdisk_work), GFP_KERNEL); > + if (!work) { > + rc = -ENOMEM; > + goto out; > + } > + > + work->bio = bio; > + > + INIT_WORK(&work->work, vdisk_pump_bio); > + > + if (!queue_work(dev->wq, &work->work)) { > + rc = -EIO; > + kfree(work); > + } else > + rc = 0; > Any reason not to perform the work directly? -- error compiling committee.c: too many arguments to function ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <46484CDF.505-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH/RFC 6/9] virtual block device driver [not found] ` <46484CDF.505-atKUWr5tajBWk0Htik3J/w@public.gmane.org> @ 2007-05-14 13:23 ` Carsten Otte [not found] ` <464862E9.7020105-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Carsten Otte @ 2007-05-14 13:23 UTC (permalink / raw) To: Avi Kivity Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 Avi Kivity wrote: > Is this the host file descriptor? If so, we want to use something more > abstract (if the host side is in kernel, there will be no fd, or if the > device is implemented using >1 files (or <1 files)). This is indeed the host file descriptor. Host userland uses sys_open to retrieve it. I see the beauty of having the remote side in the kernel, however I fail to see why we would want to reinvent the wheel: asynchronous IO with O_DIRECT (to avoid host caching) does just what we want. System call latency adds to the in-kernel approach here. > We'll want scatter/gather here. If you want scatter/gather, you have to do request merging in the guest and use the do_request function of the block queue. That is because in make_request you only have a single chunk at hand. With do_request, you would do that request merging twice and get twice the block device plug latency for nothing. The host is the better place to do IO scheduling, because it can optimize over IO from all guest machines. > >> +}; >> + >> +struct vdisk_iocb_container { >> + struct iocb iocb; >> + struct bio *bio; >> + struct vdisk_device *dev; >> + int ctx_index; >> + unsigned long context; >> + struct list_head list; >> +}; >> + >> +// from aio_abi.h >> +typedef enum io_iocb_cmd { >> + IO_CMD_PREAD = 0, >> + IO_CMD_PWRITE = 1, >> + >> + IO_CMD_FSYNC = 2, >> + IO_CMD_FDSYNC = 3, >> + >> + IO_CMD_POLL = 5, >> + IO_CMD_NOOP = 6, >> +} io_iocb_cmd_t; >> > > Our own commands, please. We need READV, WRITEV, and a barrier for > journalling filesystems. FDSYNC should work as a barrier, but is > wasteful. The FSYNC/FDSYNC distinction is meaningless. POLL/NOOP are > irrelevant. This matches the api of libaio. If userland translates this into struct iocp, this makes sense. The barrier however is a general problem with this approach: today, the asynchronous IO userspace api does not allow to submit a barrier. Therefore, our make_request function in the guest returns -ENOTSUPP in the guest which forces the file system to wait for IO completion. This does sacrifice some performance. The right thing to do would be to add the possibility to submit a barrier to the kernel aio interface. > We want to amortize the hypercall over multiple bios (but maybe you're > doing that -- I'm not 100% up to speed on the block layer) We don't. We do one per bio, and I agree that this is a major disadvantage of this approach. Since IO is slow (compared to vmenter/vmexit), it pays back from to better IO scheduling. On our platform, this approach outperforms the scatter/gather do_request one. > Any reason not to perform the work directly? I owe you an answer to this one, I have to revisit our CVS logs to find out. We used to call from make_request without workqueue before, and I cannot remember why we changed that. so long, Carsten ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <464862E9.7020105-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 6/9] virtual block device driver [not found] ` <464862E9.7020105-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> @ 2007-05-14 14:39 ` Avi Kivity [not found] ` <46487494.1070802-atKUWr5tajBWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Avi Kivity @ 2007-05-14 14:39 UTC (permalink / raw) To: carsteno-tA70FqPdS9bQT0dZR+AlfA Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 Carsten Otte wrote: > > Avi Kivity wrote: >> Is this the host file descriptor? If so, we want to use something >> more abstract (if the host side is in kernel, there will be no fd, or >> if the device is implemented using >1 files (or <1 files)). > This is indeed the host file descriptor. Host userland uses sys_open > to retrieve it. I see the beauty of having the remote side in the > kernel, however I fail to see why we would want to reinvent the wheel: > asynchronous IO with O_DIRECT (to avoid host caching) does just what > we want. I don't see an immediate need to put the host-side driver in the kernel, but I don't want to embed the host fd (which is an implementation detail) into the host/guest ABI. There may not even be a host fd. > System call latency adds to the in-kernel approach here. I don't understand this. > >> We'll want scatter/gather here. > If you want scatter/gather, you have to do request merging in the > guest and use the do_request function of the block queue. That is > because in make_request you only have a single chunk at hand. > With do_request, you would do that request merging twice and get twice > the block device plug latency for nothing. The host is the better > place to do IO scheduling, because it can optimize over IO from all > guest machines. The bio layer already has scatter/gather (basically, a biovec), but the aio api (which you copy) doesn't. The basic request should be a bio, not a bio page. I don't think the guest driver needs to do its own merging. >> >>> +}; >>> + >>> +struct vdisk_iocb_container { >>> + struct iocb iocb; >>> + struct bio *bio; >>> + struct vdisk_device *dev; >>> + int ctx_index; >>> + unsigned long context; >>> + struct list_head list; >>> +}; >>> + >>> +// from aio_abi.h >>> +typedef enum io_iocb_cmd { >>> + IO_CMD_PREAD = 0, >>> + IO_CMD_PWRITE = 1, >>> + >>> + IO_CMD_FSYNC = 2, >>> + IO_CMD_FDSYNC = 3, >>> + >>> + IO_CMD_POLL = 5, >>> + IO_CMD_NOOP = 6, >>> +} io_iocb_cmd_t; >>> >> >> Our own commands, please. We need READV, WRITEV, and a barrier for >> journalling filesystems. FDSYNC should work as a barrier, but is >> wasteful. The FSYNC/FDSYNC distinction is meaningless. POLL/NOOP >> are irrelevant. > This matches the api of libaio. If userland translates this into > struct iocp, this makes sense. The barrier however is a general > problem with this approach: today, the asynchronous IO userspace api > does not allow to submit a barrier. Therefore, our make_request > function in the guest returns -ENOTSUPP in the guest which forces the > file system to wait for IO completion. This does sacrifice some > performance. The right thing to do would be to add the possibility to > submit a barrier to the kernel aio interface. Right. But the ABI needs to support barriers regardless of host kernel support. When unavailable, barriers can be emulated by waiting for the request queue to flush itself. If we do implement the host side in the kernel, then barriers become available. > >> We want to amortize the hypercall over multiple bios (but maybe >> you're doing that -- I'm not 100% up to speed on the block layer) > We don't. We do one per bio, and I agree that this is a major > disadvantage of this approach. Since IO is slow (compared to > vmenter/vmexit), it pays back from to better IO scheduling. On our > platform, this approach outperforms the scatter/gather do_request one. I/O may be slow, but you can have a lot more disks than cpus. For example, if an I/O takes 1ms, and you have 100 disks, then you can issue 100K IOPS. With one hypercall per request, that's ~50% of a cpu (at about 5us per hypercall that goes all the way to userspace). That's not counting the overhead of calling io_submit(). -- error compiling committee.c: too many arguments to function ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <46487494.1070802-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH/RFC 6/9] virtual block device driver [not found] ` <46487494.1070802-atKUWr5tajBWk0Htik3J/w@public.gmane.org> @ 2007-05-15 11:47 ` Carsten Otte [not found] ` <46499DE9.9090202-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Carsten Otte @ 2007-05-15 11:47 UTC (permalink / raw) To: Avi Kivity Cc: carsteno-tA70FqPdS9bQT0dZR+AlfA, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 Avi Kivity wrote: > I don't see an immediate need to put the host-side driver in the kernel, > but I don't want to embed the host fd (which is an implementation > detail) into the host/guest ABI. There may not even be a host fd. Your point is taken, it also punches a hole in the security barrier between guest kernel and userspace which our usage scenario of multiple guests per uid requires. >> System call latency adds to the in-kernel approach here. > I don't understand this. What I meant to state was: If the host side of the block driver runs in userspace, we have the extra latency to leave the kernel system call context, compute on behalf of the user process, and do another system call (to drive the IO). This extra overhead does not show when handling IO requests from the guest in the kernel. > The bio layer already has scatter/gather (basically, a biovec), but the > aio api (which you copy) doesn't. The basic request should be a bio, > not a bio page. With our block driver it is, we submit an entire bio which may contain multiple biovecs at one hypercall. > Right. But the ABI needs to support barriers regardless of host kernel > support. When unavailable, barriers can be emulated by waiting for the > request queue to flush itself. If we do implement the host side in the > kernel, then barriers become available. Agreed. > I/O may be slow, but you can have a lot more disks than cpus. > > For example, if an I/O takes 1ms, and you have 100 disks, then you can > issue 100K IOPS. With one hypercall per request, that's ~50% of a cpu > (at about 5us per hypercall that goes all the way to userspace). That's > not counting the overhead of calling io_submit(). Even when a hypercall round-trip takes as long as 5us, and even if you have 512byte per biovec only (we use 4k blocksize), I don't see how this gets a performance problem: With linear read/write you get 200.000 hypercalls per second with 128 kbyte per hypercall. That's 25.6 GByte per second per CPU. With random read (worst case: 512 byte per hypercall) you still get 100 MByte per second per CPU. There are tighter bottlenecks in the IO hardware afaics. so long, Carsten ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <46499DE9.9090202-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 6/9] virtual block device driver [not found] ` <46499DE9.9090202-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> @ 2007-05-16 10:01 ` Avi Kivity 0 siblings, 0 replies; 104+ messages in thread From: Avi Kivity @ 2007-05-16 10:01 UTC (permalink / raw) To: carsteno-tA70FqPdS9bQT0dZR+AlfA Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 Carsten Otte wrote: > >>> System call latency adds to the in-kernel approach here. >> I don't understand this. > What I meant to state was: If the host side of the block driver runs > in userspace, we have the extra latency to leave the kernel system > call context, compute on behalf of the user process, and do another > system call (to drive the IO). This extra overhead does not show when > handling IO requests from the guest in the kernel. > Well, this argument seems to be in favor of not using an fd ;) Actually, an fd is usable when storing a disk in a raw file. But qemu supports non-raw (formatted) disk images, which have additional features like snapshots. The fd alone does not contain enough information. > >> I/O may be slow, but you can have a lot more disks than cpus. >> >> For example, if an I/O takes 1ms, and you have 100 disks, then you >> can issue 100K IOPS. With one hypercall per request, that's ~50% of >> a cpu (at about 5us per hypercall that goes all the way to >> userspace). That's not counting the overhead of calling io_submit(). > Even when a hypercall round-trip takes as long as 5us, and even if you > have 512byte per biovec only (we use 4k blocksize), I don't see how > this gets a performance problem: > With linear read/write you get 200.000 hypercalls per second with 128 > kbyte per hypercall. That's 25.6 GByte per second per CPU. With random > read (worst case: 512 byte per hypercall) you still get 100 MByte per > second per CPU. There are tighter bottlenecks in the IO hardware afaics. > If all you do is I/O, sure. If you want to limit I/O cpu overhead to 10%, the raw bandwidth becomes 10 MB/s/cpu. (bandwidth isn't a good measure for random I/O, IOPS is the interesting metric) Both the guest and host will favor batched requests. It's a shame to deny that because of a simplistic protocol. -- error compiling committee.c: too many arguments to function ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 6/9] virtual block device driver [not found] ` <1178904963.25135.33.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> 2007-05-14 11:49 ` Avi Kivity @ 2007-05-14 11:52 ` Avi Kivity [not found] ` <46484D84.3060601-atKUWr5tajBWk0Htik3J/w@public.gmane.org> 1 sibling, 1 reply; 104+ messages in thread From: Avi Kivity @ 2007-05-14 11:52 UTC (permalink / raw) To: Carsten Otte Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Christian Borntraeger, Martin Schwidefsky Carsten Otte wrote: > From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > > This driver provides access to virtual block devices. It does use its own > make_request function which passes the bio to a workqueue thread. The workqueue > thread does use the diagnose hypervisor call to call the hosting Linux. > The hypervisor code in host userspace does use aio_submit to initiate the IO. > Once the IO is done, the host will use io_getevents and then generate an > interrupt to the guest. The interrupt handler calls bio_endio. > This device driver is currently architecture dependent. We intend to move the > host API to hypercall instead of the diagnose instuction. Please review. > > Oh. Why not use Xen's pending block driver? It probably has everything needed. -- error compiling committee.c: too many arguments to function ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <46484D84.3060601-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH/RFC 6/9] virtual block device driver [not found] ` <46484D84.3060601-atKUWr5tajBWk0Htik3J/w@public.gmane.org> @ 2007-05-14 13:26 ` Carsten Otte 0 siblings, 0 replies; 104+ messages in thread From: Carsten Otte @ 2007-05-14 13:26 UTC (permalink / raw) To: Avi Kivity Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 Avi Kivity wrote: > Oh. Why not use Xen's pending block driver? It probably has everything > needed. We're not too eager to have our own device drivers become the solution of choice. I have'nt looked at it so far, will do. so long, Carsten ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> ` (4 preceding siblings ...) 2007-05-11 17:36 ` [PATCH/RFC 6/9] virtual block device driver Carsten Otte @ 2007-05-11 17:36 ` Carsten Otte [not found] ` <1178904965.25135.34.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> 2007-05-11 17:36 ` [PATCH/RFC 8/9] Virtual network host switch support Carsten Otte 2007-05-11 17:36 ` [PATCH/RFC 9/9] Fix system<->user misaccount of interpreted execution Carsten Otte 7 siblings, 1 reply; 104+ messages in thread From: Carsten Otte @ 2007-05-11 17:36 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Cc: Christian Borntraeger, Martin Schwidefsky From: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> This is a work-in- progress paravirtualized network driver for Linux systems running under a hypervisor. The basic idea of this network driver is to have a shared memory pool between host and guest. The guest allocates the buffer and registers its memory with the host. There are two queues one for guest to host traffic and vice versa. The queue state is tracked via an 32 bit atomic. The first 16 bits indicate the slot in the queue, where to put the next packet in, the last 16 bits indicate the slot in the queue where to take the next packet out. Macros are provided to check the queue for empty and full. We use notification, when the queue _was_ empty or full. Guest to host notification is done via the diagnose instruction. This is basically an instruction for hypervisor call, similar to vmmcall and vmcall. For the reverse notification the host sends an interrupt to the guest. Using NAPI, we react on changes of the queue with netif_rx_schedule, netif_wake_queue, netif_stop_queue and netif_rx_complete. As the host only sends an interrupt if the queue was empty, we also need to check for a race in the poll function and use netif_rx_reschedule. Otherwise we would miss a wakup and would sleep forever. Our s390 network drivers support cards that do IP packets only and provide no MAC header at all. Therefore, the driver currently supports a layer2 based mode (like ethernet) and a layer3 mode (we claim to be ptp) depending on the host. So we have several prototypes and device drivers for paravirtualized network: KVM, XEN, our prototype,lguest.... In the long term we want to have one driver to rule^w drive them all, right? This driver has currently some s390 specific aspects. I think we could get rid of the diagnose calls with an architecture defined hypervisor call. Please review, comments on the design are very welcome. Signed-off-by: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> --- drivers/s390/guest/Makefile | 2 drivers/s390/guest/vnet.h | 147 +++++++++++ drivers/s390/guest/vnet_guest.c | 509 ++++++++++++++++++++++++++++++++++++++++ drivers/s390/guest/vnet_guest.h | 111 ++++++++ drivers/s390/net/Kconfig | 9 5 files changed, 777 insertions(+), 1 deletion(-) Index: linux-2.6.21/drivers/s390/guest/vnet.h =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/vnet.h @@ -0,0 +1,147 @@ +/* + * vnet - virtual networking for guest systems + * + * Copyright IBM Corp. 2007 + * Authors: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * + */ + +#ifndef __VNET_H +#define __VNET_H +#include <linux/crc32.h> +#include <linux/ioctl.h> +#include <linux/if_ether.h> +#include <linux/netdevice.h> +#include <asm/atomic.h> +#include <asm/page.h> + +#define VNET_MAJOR 12 //COFIXME + +#define VNET_QUEUE_LEN 80 // careful, vnet_control must be < PAGE +#define VNET_BUFFER_SIZE 65536 +#define VNET_BUFFER_ORDER get_order(VNET_BUFFER_SIZE) +#define VNET_BUFFER_PAGES (((VNET_BUFFER_SIZE-1)>>PAGE_SHIFT)+1) + +#define VNET_TIMEOUT (5*HZ) + +#define VNET_IRQ_START_RX 0 +#define VNET_IRQ_START_TX 1 + +struct vnet_info { + int linktype; + int maxmtu; +}; + +#define VNET_IOCTL_ID 'Z' +#define VNET_REGISTER_CTL _IOW(VNET_IOCTL_ID ,0, unsigned long long) +#define VNET_INTERRUPT _IOW(VNET_IOCTL_ID, 1, int) +#define VNET_INFO _IOR(VNET_IOCTL_ID, 2, struct vnet_info*) + +#define QUEUE_IS_EMPTY 1 +#define QUEUE_WAS_EMPTY 2 +#define QUEUE_IS_FULL 4 +#define QUEUE_WAS_FULL 8 + + +struct xmit_buffer { + __be16 len; + __be16 proto; + void *data; +}; + +struct vnet_control { + int buffer_size; + char mac[ETH_ALEN]; + atomic_t p2smit __attribute__((__aligned__(SMP_CACHE_BYTES))); + atomic_t s2pmit __attribute__((__aligned__(SMP_CACHE_BYTES))); + struct xmit_buffer p2sbufs[VNET_QUEUE_LEN] __attribute__((__aligned__(SMP_CACHE_BYTES))); + struct xmit_buffer s2pbufs[VNET_QUEUE_LEN] __attribute__((__aligned__(SMP_CACHE_BYTES))); +}; + + +#define __nextx(val) (((val) & 0xffff0000)>>16) +#define __nextr(val) ((val) & 0xffff) +#define __mkxr(x,r) ((((x) & 0xffff)<<16) | ((r) & 0xffff)) + +static inline int +vnet_q_empty(int val) +{ + return (__nextx(val) == __nextr(val)); +} + +static inline int +vnet_q_half(int val) +{ + if (__nextx(val) == __nextr(val)) + return 0; + if (__nextx(val) < __nextr(val)) { + if ((__nextr(val) - __nextx(val)) < VNET_QUEUE_LEN / 2) + return 1; + } else { + if ((__nextx(val) - __nextr(val)) > VNET_QUEUE_LEN / 2) + return 1; + } + return 0; +} + + +static inline int +vnet_q_full(int val) +{ + if (__nextx(val) == __nextr(val) - 1) + return 1; + if ((__nextr(val) == 0) && (__nextx(val) == VNET_QUEUE_LEN - 1)) + return 1; + return 0; +} + +/* returns values: + * bit RX_QUEUE_NOW_EMPTY (1) + * and/or RX_QUEUE_WAS_FULL (2) + */ +static inline int +vnet_rx_packet(atomic_t *ato) +{ + int oldval, newval, rc; + + do { + oldval = atomic_read(ato); + + //do we wrap? + if (__nextr(oldval)+1 == VNET_QUEUE_LEN) + newval = __mkxr(__nextx(oldval),0); + else + newval = __mkxr(__nextx(oldval),__nextr(oldval)+1); + } while (atomic_cmpxchg(ato, oldval, newval) != oldval); + + rc = 0; + if (vnet_q_empty(newval)) + rc |= QUEUE_IS_EMPTY; + if (vnet_q_full(oldval)) + rc |= QUEUE_WAS_FULL; + return rc; +} + +static inline int +vnet_tx_packet(atomic_t *ato) +{ + int oldval, newval, rc; + + do { + oldval = atomic_read(ato); + + //do we wrap? + if (__nextx(oldval)+1 == VNET_QUEUE_LEN) + newval = __mkxr(0, __nextr(oldval)); + else + newval = __mkxr(__nextx(oldval)+1,__nextr(oldval)); + } while (atomic_cmpxchg(ato, oldval, newval) != oldval); + rc = 0; + if (vnet_q_empty(oldval)) + rc |= QUEUE_WAS_EMPTY; + if (vnet_q_full(newval)) + rc |= QUEUE_IS_FULL; + return rc; +} +#endif Index: linux-2.6.21/drivers/s390/guest/vnet_guest.c =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/vnet_guest.c @@ -0,0 +1,509 @@ +/* + * vnet - virtual network device driver + * + * Copyright IBM Corp. 2007 + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * + */ + +#include <linux/kernel.h> +#include <linux/list.h> +#include <linux/module.h> +#include <linux/netdevice.h> +#include <linux/inetdevice.h> +#include <linux/etherdevice.h> +#include <linux/if.h> +#include <linux/if_ether.h> +#include <linux/if_arp.h> +#include <linux/rtnetlink.h> +#include <linux/hardirq.h> +#include <linux/mm.h> +#include <linux/notifier.h> +#include <asm/s390_ext.h> +#include <asm/atomic.h> +#include <asm/vdev.h> + +#include "vnet.h" +#include "vnet_guest.h" + +static LIST_HEAD(vnet_devices); +static rwlock_t vnet_devices_lock = RW_LOCK_UNLOCKED; + + +static int +vnet_net_open(struct net_device *dev) +{ + struct vnet_guest_device *guest; + struct vnet_control *control; + + guest = netdev_priv(dev); + control = guest->control; + atomic_set(&control->s2pmit, 0); + netif_start_queue(dev); + return 0; +} + +static int +vnet_net_stop(struct net_device *dev) +{ + netif_stop_queue(dev); + return 0; +} + +static void vnet_net_tx_timeout(struct net_device *dev) +{ + struct vnet_guest_device *zk = netdev_priv(dev); + struct vnet_control *control = zk->control; + + printk(KERN_ERR "problems in xmit for device %s\n Resetting...\n", + dev->name); + atomic_set(&control->p2smit, 0); + atomic_set(&control->s2pmit, 0); + diag_vnet_send_interrupt(zk->hostfd, VNET_IRQ_START_RX); + netif_wake_queue(dev); +} + + +static void skbcopy(char *dest, struct sk_buff *skb) +{ + int i; + + memcpy(dest, skb->data, skb_headlen(skb)); + dest += skb_headlen(skb); + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + memcpy(dest, page_address(frag->page) + + frag->page_offset, frag->size); + dest+=frag->size; + } +} + +static int +vnet_net_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct vnet_guest_device *zk = netdev_priv(dev); + struct vnet_control *control = zk->control; + struct xmit_buffer *buf; + int pkid; + int buffer_status; + + if (!spin_trylock(&zk->lock)) + return NETDEV_TX_LOCKED; + if(vnet_q_full(atomic_read(&control->p2smit))) { + netif_stop_queue(dev); + goto full; + } + pkid = __nextx(atomic_read(&control->p2smit)); + buf = &control->p2sbufs[pkid]; + buf->len = skb->len; + buf->proto = skb->protocol; + skbcopy(buf->data, skb); + buffer_status = vnet_tx_packet(&control->p2smit); + spin_unlock(&zk->lock); + zk->stats.tx_packets++; + zk->stats.tx_bytes += skb->len; + dev_kfree_skb_any(skb); + dev->trans_start = jiffies; + if (buffer_status & QUEUE_WAS_EMPTY) + diag_vnet_send_interrupt(zk->hostfd, VNET_IRQ_START_RX); + if (!(buffer_status & QUEUE_IS_FULL)) + return NETDEV_TX_OK; + netif_stop_queue(dev); + spin_lock(&zk->lock); +full: + if (!vnet_q_full(atomic_read(&control->p2smit))) + netif_start_queue(dev); + spin_unlock(&zk->lock); + return NETDEV_TX_OK; +} + +static int +vnet_l2_poll(struct net_device *dev, int *budget) +{ + struct vnet_guest_device *zk = netdev_priv(dev); + struct vnet_control *control = zk->control; + struct xmit_buffer *buf; + struct sk_buff *skb; + int pkid, count, numpackets = min(dev->quota, *budget); + int buffer_status; + + if (vnet_q_empty(atomic_read(&control->s2pmit))) { + count = 0; + goto empty; + } +loop: + count = 0; + while(numpackets) { + pkid = __nextr(atomic_read(&control->s2pmit)); + buf = &control->s2pbufs[pkid]; /* kernel pointer!*/ + skb = dev_alloc_skb(buf->len); + if (likely(skb)) { + memcpy(skb_put(skb, buf->len), buf->data, buf->len); + skb->dev = dev; + skb->protocol = eth_type_trans(skb, dev); + zk->stats.rx_packets++; + zk->stats.rx_bytes += buf->len; + netif_receive_skb(skb); + numpackets--; + (*budget)--; + dev->quota--; + count++; + } else + zk->stats.rx_dropped++; + buffer_status = vnet_rx_packet(&control->s2pmit); + if (buffer_status & QUEUE_WAS_FULL) + diag_vnet_send_interrupt(zk->hostfd, + VNET_IRQ_START_TX); + if (buffer_status & QUEUE_IS_EMPTY) + goto empty; + } + return 1; /* please ask us again */ + empty: + netif_rx_complete(dev); + /* we might have raced against a wakup */ + if (!vnet_q_empty(atomic_read(&control->s2pmit))) { + if (netif_rx_reschedule(dev, count)) + goto loop; + } + return 0; /* we're done for now */ +} + + +static int +vnet_l3_poll(struct net_device *dev, int *budget) +{ + struct vnet_guest_device *zk = dev->priv; + struct vnet_control *control = zk->control; + struct xmit_buffer *buf; + struct sk_buff *skb; + int pkid, count, numpackets = min(dev->quota, *budget); + int buffer_status; + + if (vnet_q_empty(atomic_read(&control->s2pmit))) { + count = 0; + goto empty; + } +loop: + count = 0; + while(numpackets) { + pkid = __nextr(atomic_read(&control->s2pmit)); + buf = &control->s2pbufs[pkid]; /*kernel pointer*/ + skb = dev_alloc_skb(buf->len + NET_IP_ALIGN); + if (likely(skb)) { + skb_reserve(skb, NET_IP_ALIGN); + memcpy(skb_put(skb, buf->len), buf->data, buf->len); + skb->dev = dev; + skb->protocol = buf->proto; + skb->mac.raw = skb->data; + zk->stats.rx_packets++; + zk->stats.rx_bytes += buf->len; + netif_receive_skb(skb); + numpackets--; + (*budget)--; + dev->quota--; + count++; + } else + zk->stats.rx_dropped++; + buffer_status = vnet_rx_packet(&control->s2pmit); + if (buffer_status & QUEUE_WAS_FULL) + diag_vnet_send_interrupt(zk->hostfd, + VNET_IRQ_START_TX); + if (buffer_status & QUEUE_IS_EMPTY) + goto empty; + } + return 1; /* please ask us again */ + empty: + netif_rx_complete(dev); + /* we might have raced against a wakup */ + if (!vnet_q_empty(atomic_read(&control->s2pmit))) { + if (netif_rx_reschedule(dev, count)) + goto loop; + } + return 0; /* we're done for now */ +} + +static struct net_device_stats * +vnet_net_stats(struct net_device *dev) +{ + struct vnet_guest_device *zk = netdev_priv(dev); + return &zk->stats; +} + +static int +vnet_net_change_mtu(struct net_device *dev, int new_mtu) +{ + if (new_mtu <= ETH_ZLEN) + return -ERANGE; + if (new_mtu > VNET_BUFFER_SIZE-ETH_HLEN) + return -ERANGE; + dev->mtu = new_mtu; + return 0; +} + + +static void +__vnet_common_init(struct net_device *dev) +{ + dev->open = vnet_net_open; + dev->stop = vnet_net_stop; + dev->hard_start_xmit = vnet_net_xmit; + dev->get_stats = vnet_net_stats; + dev->tx_timeout = vnet_net_tx_timeout; + dev->watchdog_timeo = VNET_TIMEOUT; + dev->change_mtu = vnet_net_change_mtu; + dev->weight = 64; + dev->features |= NETIF_F_SG | NETIF_F_LLTX; +} + +static void +__vnet_layer3_init(struct net_device *dev) +{ + dev->mtu = ETH_DATA_LEN; + dev->tx_queue_len = 1000; + dev->flags = IFF_BROADCAST|IFF_MULTICAST|IFF_NOARP; + dev->type = ARPHRD_PPP; + dev->mtu = 1492; + dev->poll = vnet_l3_poll; + __vnet_common_init(dev); +} + +static void +__vnet_layer2_init(struct net_device *dev) +{ + ether_setup(dev); + dev->mtu = 1492; + dev->poll = vnet_l2_poll; + __vnet_common_init(dev); +} + +static struct vnet_guest_device * +__get_vnet_dev_by_fd(int fd) +{ + struct vnet_guest_device *zk; + + read_lock(&vnet_devices_lock); + list_for_each_entry(zk, &vnet_devices, lh) { + if (zk->hostfd == fd) + goto found; + } + zk = NULL; + found: + read_unlock (&vnet_devices_lock); + return zk; +} + +void vnet_ext_handler(__u16 code) +{ + unsigned int type = S390_lowcore.ext_params & 3; + unsigned int fd = S390_lowcore.ext_params >> 2; + + struct vnet_guest_device *zk = __get_vnet_dev_by_fd(fd); + + BUG_ON(!zk); + switch (type) { + case VNET_IRQ_START_RX: + netif_rx_schedule(zk->netdev); + break; + case VNET_IRQ_START_TX: + netif_wake_queue(zk->netdev); + break; + default: + BUG(); + } +} + +static void +vnet_delete_device(struct vnet_guest_device *zd) +{ + int i; + unsigned long flags; + + if (zd->hostfd >= 0) + diag_vnet_release(zd->hostfd); + write_lock_irqsave(&vnet_devices_lock, flags); + list_del(&zd->lh); + write_unlock_irqrestore(&vnet_devices_lock, flags); + + for (i=0; i<VNET_QUEUE_LEN; i++) { + if (zd->control->s2pbufs[i].data) { + free_pages((unsigned long) zd->control->s2pbufs[i].data, VNET_BUFFER_ORDER); + zd->control->s2pbufs[i].data = NULL; + } + if (zd->control->p2sbufs[i].data) { + free_pages((unsigned long) zd->control->p2sbufs[i].data, VNET_BUFFER_ORDER); + zd->control->p2sbufs[i].data = NULL; + } + } + if (zd->control) { + kfree(zd->control); + zd->control = NULL; + } + if (zd->netdev) /* CAUTION: this also frees zd*/ + free_netdev(zd->netdev); +} + + +static int vnet_device_alloc(struct vnet_guest_device *zd) +{ + int i; + + zd->control = kzalloc(sizeof(struct vnet_control), GFP_KERNEL); + if (!zd->control) + return -ENOMEM; + for (i=0; i<VNET_QUEUE_LEN; i++) { + zd->control->s2pbufs[i].data = (void *) __get_free_pages(GFP_KERNEL, VNET_BUFFER_ORDER); + if (!zd->control->s2pbufs[i].data) + return -ENOMEM; + zd->control->p2sbufs[i].data = (void *) __get_free_pages(GFP_KERNEL, VNET_BUFFER_ORDER); + if (!zd->control->p2sbufs[i].data) + return -ENOMEM; + } + return 0; +} + +static int vnet_probe(struct vdev *vdev) +{ + int ret; + long flags; + struct vnet_guest_device *zd; + struct net_device *netdev; + int linktype; + + if (strlen(vdev->symname) > IFNAMSIZ) { + printk(KERN_ERR "vnet: %s is too long for a network device," + "discarding it\n", vdev->symname); + return -EINVAL; + } + ret = diag_vnet_info(vdev->hostid, &linktype); + if (ret) + return ret; + if (linktype == 3) + netdev = alloc_netdev(sizeof(*zd), vdev->symname,__vnet_layer3_init); + else + netdev = alloc_netdev(sizeof(*zd), vdev->symname, __vnet_layer2_init); + if (!netdev) + return -ENOMEM; + zd = netdev_priv(netdev); + zd->netdev = netdev; + + ret =vnet_device_alloc(zd); + if (ret) + goto out; + zd->control->buffer_size = VNET_BUFFER_SIZE; + zd->linktype = linktype; + memcpy(zd->ifname, vdev->symname, IFNAMSIZ); + INIT_LIST_HEAD(&zd->lh); + + write_lock_irqsave(&vnet_devices_lock, flags); + zd->hostfd = diag_vnet_open(vdev->hostid, zd->control); + if (zd->hostfd < 0) { + write_unlock_irqrestore(&vnet_devices_lock, flags); + goto out; + } + list_add_tail(&zd->lh, &vnet_devices); + write_unlock_irqrestore(&vnet_devices_lock, flags); + + // host is ready, now we can set up our local network interface + rtnl_lock(); + memcpy(netdev->dev_addr, zd->control->mac, 6); + spin_lock_init(&zd->lock); + + if (!(ret = register_netdevice(zd->netdev))) { + /* good case */ + rtnl_unlock(); + printk("vnet: Successfully registered %s\n", vdev->symname); + return 0; + } + printk("vnet: Could not register network interface %s\n", vdev->symname); + rtnl_unlock(); + out: + vnet_delete_device(zd); + return ret; +} + +static struct vdev_driver vnet_driver = { + .name = "vnet", + .owner = THIS_MODULE, + .vdev_type = VDEV_TYPE_NET, + .probe = vnet_probe, +}; + +static int vnet_ip_event(struct notifier_block *this, + unsigned long event,void *ptr) +{ + struct in_ifaddr *ifa = (struct in_ifaddr *)ptr; + struct net_device *dev =(struct net_device *) ifa->ifa_dev->dev; + struct vnet_guest_device *zk; + read_lock(&vnet_devices_lock); + list_for_each_entry(zk, &vnet_devices, lh) + if (zk->netdev == dev) { + read_unlock(&vnet_devices_lock); + if (event == NETDEV_UP) + diag_vnet_ip(1, ifa->ifa_address, + ifa->ifa_mask, + ifa->ifa_broadcast); + if (event == NETDEV_DOWN) + diag_vnet_ip(0, ifa->ifa_address, + ifa->ifa_mask, + ifa->ifa_broadcast); + return NOTIFY_OK; + } + read_unlock(&vnet_devices_lock); + return NOTIFY_DONE; +} + +static struct notifier_block vnet_ip_notifier = { + vnet_ip_event, + NULL +}; + +/* module related section */ +int +vnet_guest_init(void) +{ + int ret; + + if (!MACHINE_IS_GUEST) + return -ENODEV; + BUILD_BUG_ON(sizeof(struct vnet_control) > PAGE_SIZE); + register_external_interrupt(0x1236, vnet_ext_handler); + if(register_inetaddr_notifier(&vnet_ip_notifier)) { + printk(KERN_ERR "vnet: Could not register ip callback\n"); + unregister_external_interrupt(0x1236, vnet_ext_handler); + } + ret = vdev_driver_register(&vnet_driver); + if (ret) { + printk(KERN_ERR "vnet: Could not register driver\n"); + unregister_external_interrupt(0x1236, vnet_ext_handler); + unregister_inetaddr_notifier(&vnet_ip_notifier); + return ret; + } + return ret; +} + +void +vnet_guest_exit(void) +{ + struct vnet_guest_device *zk; + struct vnet_guest_device *temp; + + + unregister_external_interrupt(0x1236, vnet_ext_handler); + unregister_inetaddr_notifier(&vnet_ip_notifier); + rtnl_lock(); + write_lock(&vnet_devices_lock); + list_for_each_entry_safe(zk, temp, &vnet_devices, lh) { + netif_stop_queue(zk->netdev); + unregister_netdevice(zk->netdev); + vnet_delete_device(zk); + } + write_unlock(&vnet_devices_lock); + rtnl_unlock(); +} + +module_init(vnet_guest_init); +module_exit(vnet_guest_exit); +MODULE_DESCRIPTION("VNET: Virtual network driver"); +MODULE_AUTHOR("Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>"); +MODULE_LICENSE("GPL"); Index: linux-2.6.21/drivers/s390/guest/vnet_guest.h =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/vnet_guest.h @@ -0,0 +1,111 @@ +/* + * vnet - zlive insular communication knack + * + * Copyright (C) 2005 IBM Corporation + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * Author: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * + */ + +#ifndef __VNET_GUEST_H +#define __VNET_GUEST_H + +#include <linux/netdevice.h> +#include <linux/workqueue.h> +#include <linux/spinlock.h> +#include "vnet.h" + + +struct vnet_guest_device { + struct list_head lh; + int hostfd; + char ifname[IFNAMSIZ]; + struct net_device *netdev; + struct vnet_control *control; + struct net_device_stats stats; + struct work_struct work; + int linktype; + spinlock_t lock; +}; + +static inline int +diag_vnet_info(char *ifname, int *linktype) +{ + register char * __arg1 asm ("2") = ifname; + register int * __arg2 asm ("3") = linktype; + register int __svcres asm("2"); + int __res; + + __asm__ __volatile__ ( + "diag 0,0,0x0e" + : "=d" (__svcres) + : "0" (__arg1), + "d" (__arg2) + : "cc", "memory"); + __res = __svcres; + return __res; +} + +static inline int +diag_vnet_open(char *ifname, struct vnet_control *ctrl) +{ + register char * __arg1 asm ("2") = ifname; + register struct vnet_control * __arg2 asm ("3") = ctrl; + register int __svcres asm("2"); + int __res; + + __asm__ __volatile__ ( + "diag 0,0,0x0f" + : "=d" (__svcres) + : "0" (__arg1), + "d" (__arg2) + : "cc", "memory"); + __res = __svcres; + return __res; +} + +static inline void +diag_vnet_send_interrupt(int fd, int type) +{ + register long __arg1 asm ("2") = fd; + register long __arg2 asm ("3") = type; + + __asm__ __volatile__ ( + "diag 0,0,0x11" + : : "d" (__arg1), + "d" (__arg2) + : "cc", "memory"); +} + +static inline void +diag_vnet_release(int fd) +{ + register long __arg1 asm ("2") = fd; + + __asm__ __volatile__ ( + "diag 0,0,0x13" + : : "d" (__arg1) + : "cc", "memory"); +} +static inline int +diag_vnet_ip(int add, u32 addr, u32 mask, u32 broadcast) +{ + register long __arg1 asm ("2") = add; + register long __arg2 asm ("3") = addr; + register long __arg3 asm ("4") = mask; + register long __arg4 asm ("5") = broadcast; + register int __svcres asm("2"); + int __res; + + __asm__ __volatile__ ( + "diag 0,0,0x1f" + : "=d" (__svcres) + : "d" (__arg1), + "d" (__arg2), + "d" (__arg3), + "d" (__arg4) + : "cc", "memory"); + __res = __svcres; + return __res; +} +#endif Index: linux-2.6.21/drivers/s390/guest/Makefile =================================================================== --- linux-2.6.21.orig/drivers/s390/guest/Makefile +++ linux-2.6.21/drivers/s390/guest/Makefile @@ -5,4 +5,4 @@ obj-$(CONFIG_GUEST_CONSOLE) += guest_console.o guest_tty.o obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o obj-$(CONFIG_VDISK) += vdisk.o vdisk_blk.o - +obj-$(CONFIG_VNET_GUEST) += vnet_guest.o Index: linux-2.6.21/drivers/s390/net/Kconfig =================================================================== --- linux-2.6.21.orig/drivers/s390/net/Kconfig +++ linux-2.6.21/drivers/s390/net/Kconfig @@ -86,4 +86,13 @@ config CCWGROUP tristate default (LCS || CTC || QETH) +config VNET_GUEST + tristate "virtual networking support (GUEST)" + depends on S390_GUEST + help + This is the guest part of the vnet guest network connection. + Say Y if you plan to run this kernel as guest with network + connection. + If you're not using host/guest support, say N. + endmenu ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <1178904965.25135.34.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <1178904965.25135.34.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> @ 2007-05-11 19:44 ` ron minnich [not found] ` <13426df10705111244w1578ebedy8259bc42ca1f588d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: ron minnich @ 2007-05-11 19:44 UTC (permalink / raw) To: Carsten Otte Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Christian Borntraeger, Martin Schwidefsky Let me ask what may seem to be a naive question to the linux world. I see you are doing a lot off solid work on adding block and network devices. The code for block and network devices is implemented in different ways. I've also seen this difference of inerface/implementation on Xen. Hence my question: Why are the INTERFACES to the block and network devices different? I can understand that the implementation -- what goes on "inside the box" -- would be different. But, again, why is the interface to the resource different in each case? Will every distinct type of I/O device end up with a different interface? These questions doubtless seem naive, I suppose, except I use a system (Plan 9) in which a common interface is in fact used for the different resources. I have been hoping that we could bring this model -- same interface, different resource -- to the inter-vm communications. I would like to at least raise the idea that it could be used on KVM. Avoiding too much detail, in the plan 9 world, read and write of data to a disk is via file read and write system calls. Same for a network. Same for the mouse, the window system, the serial port, the console, USB, and so on. Please see this note from IBM on what is possible:http://domino.watson.ibm.com/library/CyberDig.nsf/0/c6c779bbf1650fa4852570670054f3ca?OpenDocument or http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf Different resources, same interface. In the hypervisor world, you build one shared memory queue as a basic abstraction. On top of that queue, you run 9P. The provider (network, block device, etc.) provides certain resources to you, the guest domain The resources have names. A network can look like this, to a kvm guest (this command from a Plan 9 system): cpu% ls /net/ether0 /net/ether0/0 /net/ether0/1 /net/ether0/2 /net/ether0/addr /net/ether0/clone /net/ether0/ifstats /net/ether0/stats To get network stats, or do I/O, one simply gains access to the appropriate ring buffer, by finding the name, and does the ring buffer sends and receives via shared memory queues. The I/O operations can be very efficient. Disk looks like this: cpu% ls -l /dev/sdC0 --rw-r----- S 0 bootes bootes 104857600 Jan 22 15:49 /dev/sdC0/9fat --rw-r----- S 0 bootes bootes 65361213440 Jan 22 15:49 /dev/sdC0/arenas --rw-r----- S 0 bootes bootes 0 Jan 22 15:49 /dev/sdC0/ctl --rw-r----- S 0 bootes bootes 82348277760 Jan 22 15:49 /dev/sdC0/data --rw-r----- S 0 bootes bootes 13072242688 Jan 22 15:49 /dev/sdC0/fossil --rw-r----- S 0 bootes bootes 3268060672 Jan 22 15:49 /dev/sdC0/isect --rw-r----- S 0 bootes bootes 512 Jan 22 15:49 /dev/sdC0/nvram --rw-r----- S 0 bootes bootes 82343245824 Jan 22 15:49 /dev/sdC0/plan9 -lrw------- S 0 bootes bootes 0 Jan 22 15:49 /dev/sdC0/raw --rw-r----- S 0 bootes bootes 536870912 Jan 22 15:49 /dev/sdC0/swap cpu% So the disk partitions are "files", with the "data" file being the whole disk. Again, on a hypervisor system, to do I/O, software could create a connection to the "file" and establish the in-memory ring buffer, for that partition. This I/O can be very efficient; IBM research is working on zero-copy mechanisms for moving data between domains. The result is a single, consistent mechanism for accessing all resources from a guest domain. The resources have names, and it is easy to examine the status -- binary interfaces can be minimized. The resources can be provided by in-kernel servers -- Linux drivers -- or out-of-kernel servers -- proceses. Same interface, and yet the implementation of the provider of the resource can be utterly different. We had hoped to get something like this into Xen. On Xen, for example, the block device and ethernet device interfaces are as different as one could imagine. Disk I/O does not steal pages from the guest. The network does. Disk I/O is in 4k chunks, period, with a bitmap describing which of the 8 512-byte subunits are being sent. The enet device, on read, returns a page with your packet, but also potentially containing bits of other domain's packets too. The interfaces are as dissimilar as they can be, and I see no reason for such a huge variance between what are basically read/write devices. Another issue is that kvm, in its current form (-24) is beautifully simple. These additions seem to detract from the beauty a bit. Might it be worth taking a little time to consider these ideas in order to preserve the basic elegance of KVM? So, before we go too far down the Xen-like paravirtualized device route, can we discuss the way this ought to look a bit? thanks ron ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <13426df10705111244w1578ebedy8259bc42ca1f588d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <13426df10705111244w1578ebedy8259bc42ca1f588d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2007-05-11 20:12 ` Anthony Liguori [not found] ` <4644CE15.6080505-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2007-05-12 8:46 ` Carsten Otte 2007-05-14 12:05 ` Avi Kivity 2 siblings, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-11 20:12 UTC (permalink / raw) To: ron minnich Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Martin Schwidefsky ron minnich wrote: > Avoiding too much detail, in the plan 9 world, read and write of data > to a disk is via file read and write system calls. For low speed devices, I think paravirtualization doesn't make a lot of sense unless it's absolutely required. I don't know enough about s390 to know if it supports things like uarts but if so, then emulating a uart would in my mind make a lot more sense than a PV console device. > Same for a network. > Same for the mouse, the window system, the serial port, the console, > USB, and so on. Please see this note from IBM on what is > possible:http://domino.watson.ibm.com/library/CyberDig.nsf/0/c6c779bbf1650fa4852570670054f3ca?OpenDocument > or http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf > Different resources, same interface. In the hypervisor world, you > build one shared memory queue as a basic abstraction. On top of that > queue, you run 9P. The provider (network, block device, etc.) provides > certain resources to you, the guest domain The resources have names. A > network can look like this, to a kvm guest (this command from a Plan 9 > system): > cpu% ls /net/ether0 > /net/ether0/0 > /net/ether0/1 > /net/ether0/2 > /net/ether0/addr > /net/ether0/clone > /net/ether0/ifstats > /net/ether0/stats > This smells a bit like XenStore which I think most will agree was an unmitigated disaster. This sort of thing gets terribly complicated to deal with in the corner cases. Atomic operation of multiple read/write operations is difficult to express. Moreover, quite a lot of things are naturally expressed as a state machine which is not straight forward to do in this sort of model. This may have been all figured out in 9P but it's certainly not a simple thing to get right. I think a general rule of thumb for a virtualized environment is that the closer you stick to the way hardware tends to do things, the less likely you are to screw yourself up and the easier it will be for other platforms to support your devices. Implementing a full 9P client just to get console access in something like mini-os would be unfortunate. At least the posted s390 console driver behaves roughly like a uart so it's pretty obvious that it will be easy to implement in any OS that supports uarts already. Regards, Anthony Liguori > To get network stats, or do I/O, one simply gains access to the > appropriate ring buffer, by finding the name, and does the ring buffer > sends and receives via shared memory queues. The I/O operations can be > very efficient. > > Disk looks like this: > cpu% ls -l /dev/sdC0 > --rw-r----- S 0 bootes bootes 104857600 Jan 22 15:49 /dev/sdC0/9fat > --rw-r----- S 0 bootes bootes 65361213440 Jan 22 15:49 /dev/sdC0/arenas > --rw-r----- S 0 bootes bootes 0 Jan 22 15:49 /dev/sdC0/ctl > --rw-r----- S 0 bootes bootes 82348277760 Jan 22 15:49 /dev/sdC0/data > --rw-r----- S 0 bootes bootes 13072242688 Jan 22 15:49 /dev/sdC0/fossil > --rw-r----- S 0 bootes bootes 3268060672 Jan 22 15:49 /dev/sdC0/isect > --rw-r----- S 0 bootes bootes 512 Jan 22 15:49 /dev/sdC0/nvram > --rw-r----- S 0 bootes bootes 82343245824 Jan 22 15:49 /dev/sdC0/plan9 > -lrw------- S 0 bootes bootes 0 Jan 22 15:49 /dev/sdC0/raw > --rw-r----- S 0 bootes bootes 536870912 Jan 22 15:49 /dev/sdC0/swap > cpu% > > So the disk partitions are "files", with the "data" file being the > whole disk. Again, on a hypervisor system, to do I/O, software could > create a connection to the "file" and establish the in-memory ring > buffer, for that partition. This I/O can be very efficient; IBM > research is working on zero-copy mechanisms for moving data between > domains. > > The result is a single, consistent mechanism for accessing all > resources from a guest domain. The resources have names, and it is > easy to examine the status -- binary interfaces can be minimized. The > resources can be provided by in-kernel servers -- Linux drivers -- or > out-of-kernel servers -- proceses. Same interface, and yet the > implementation of the provider of the resource can be utterly > different. > > We had hoped to get something like this into Xen. On Xen, for example, > the block device and ethernet device interfaces are as different as > one could imagine. Disk I/O does not steal pages from the guest. The > network does. Disk I/O is in 4k chunks, period, with a bitmap > describing which of the 8 512-byte subunits are being sent. The enet > device, on read, returns a page with your packet, but also potentially > containing bits of other domain's packets too. The interfaces are as > dissimilar as they can be, and I see no reason for such a huge > variance between what are basically read/write devices. > > Another issue is that kvm, in its current form (-24) is beautifully > simple. These additions seem to detract from the beauty a bit. Might > it be worth taking a little time to consider these ideas in order to > preserve the basic elegance of KVM? > > So, before we go too far down the Xen-like paravirtualized device > route, can we discuss the way this ought to look a bit? > > thanks > > ron > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > kvm-devel mailing list > kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org > https://lists.sourceforge.net/lists/listinfo/kvm-devel > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <4644CE15.6080505-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <4644CE15.6080505-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2007-05-11 21:15 ` Eric Van Hensbergen [not found] ` <a4e6962a0705111415n47e77a15o331b59cf2a03b4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2007-05-11 21:51 ` ron minnich 1 sibling, 1 reply; 104+ messages in thread From: Eric Van Hensbergen @ 2007-05-11 21:15 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, Martin Schwidefsky, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > > cpu% ls /net/ether0 > > /net/ether0/0 > > /net/ether0/1 > > /net/ether0/2 > > /net/ether0/addr > > /net/ether0/clone > > /net/ether0/ifstats > > /net/ether0/stats > > > > This smells a bit like XenStore which I think most will agree was an > unmitigated disaster. > I'd have to disagree with you Anthony. The Plan 9 interfaces are simple and built into the kernel - they don't have the multi-layered-stack-python-xmlrpc garbage that made up the Xen interfaces. >This sort of thing gets terribly complicated to deal with in the corner cases. >Atomic operation of multiple read/write operations is difficult to express. > Moreover, quite a lot of things are naturally expressed as a state machine which > is not straight forward to do in this sort of model. This may have been all > figured out in 9P but it's certainly not a simple thing to get right. > That's true, but we have been doing it for over 20 years - I think we have a good model to base stuff on. > I think a general rule of thumb for a virtualized environment is that > the closer you stick to the way hardware tends to do things, the less > likely you are to screw yourself up and the easier it will be for other > platforms to support your devices. Implementing a full 9P client just > to get console access in something like mini-os would be unfortunate. > At least the posted s390 console driver behaves roughly like a uart so > it's pretty obvious that it will be easy to implement in any OS that > supports uarts already. > If it were just console access, I would agree with you, but its really about implementing a single solution for all drivers you are accessing across the interface. A single client versus dozens of different driver variants. Our existing 9p client for mini-os is ~3000 LOC and it is a pretty naive port from the p9p code base so it could probably be reduced even further. It is a very small percentage of our existing mini-os kernels and gives us console, disk, network, IP stack, file system, and control interfaces. Of course Linux clients could just use v9fs with a hypervisor-shared-memory transport which I haven't merged yet. We'll also be using the same set of interfaces for the simulator shortly. Oh yeah, and don't forget the fact that resource access can bridge seamlessly over any network and the protocol has provisions to be secured with authentication/encryption/digesting if desired. Los Alamos will be presenting 9p based control interfaces for KVM at OLS. -eric ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <a4e6962a0705111415n47e77a15o331b59cf2a03b4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <a4e6962a0705111415n47e77a15o331b59cf2a03b4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2007-05-11 21:47 ` Anthony Liguori [not found] ` <4644E456.2060507-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-11 21:47 UTC (permalink / raw) To: Eric Van Hensbergen Cc: Jimi Xenidis, Christian Borntraeger, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Martin Schwidefsky, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Eric Van Hensbergen wrote: > On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > >>> cpu% ls /net/ether0 >>> /net/ether0/0 >>> /net/ether0/1 >>> /net/ether0/2 >>> /net/ether0/addr >>> /net/ether0/clone >>> /net/ether0/ifstats >>> /net/ether0/stats >>> >>> >> This smells a bit like XenStore which I think most will agree was an >> unmitigated disaster. >> >> > > I'd have to disagree with you Anthony. The Plan 9 interfaces are > simple and built into the kernel - they don't have the > multi-layered-stack-python-xmlrpc garbage that made up the Xen > interfaces. > My point isn't that 9p is just like XenStore but rather that turning this idea into something that is useful and elegant is non-trivial. > If it were just console access, I would agree with you, but its really > about implementing a single solution for all drivers you are accessing > across the interface. A single client versus dozens of different > driver variants. There's definitely a conversation to have here. There are going to be a lot of small devices that would benefit from a common transport mechanism. Someone mentioned a PV entropy device on LKML. A host=>guest filesystem is another consumer of such an interface. I'm inclined to think though that the abstraction point should be the transport and not the actual protocol. My concern with standardizing on a protocol like 9p would be that one would lose some potential optimizations (like passing PFN's directly between guest and host). > Our existing 9p client for mini-os is ~3000 LOC and > it is a pretty naive port from the p9p code base so it could probably > be reduced even further. It is a very small percentage of our > existing mini-os kernels and gives us console, disk, network, IP > stack, file system, and control interfaces. Of course Linux clients > could just use v9fs with a hypervisor-shared-memory transport which I > haven't merged yet. We'll also be using the same set of interfaces > for the simulator shortly. > So is there any reason to even tie 9p to KVM? Why not just have a common PV transport that 9p can use. For certain things, it may make sense (like v9fs). Regards, Anthony Liguori > Oh yeah, and don't forget the fact that resource access can bridge > seamlessly over any network and the protocol has provisions to be > secured with authentication/encryption/digesting if desired. > > Los Alamos will be presenting 9p based control interfaces for KVM at OLS. > > -eric > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > kvm-devel mailing list > kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org > https://lists.sourceforge.net/lists/listinfo/kvm-devel > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <4644E456.2060507-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <4644E456.2060507-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2007-05-11 22:21 ` Eric Van Hensbergen [not found] ` <a4e6962a0705111521v2d451ddcjecf209e2031c85af-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Eric Van Hensbergen @ 2007-05-11 22:21 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, Christian Borntraeger, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Martin Schwidefsky, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > > There's definitely a conversation to have here. There are going to be a > lot of small devices that would benefit from a common transport > mechanism. Someone mentioned a PV entropy device on LKML. A > host=>guest filesystem is another consumer of such an interface. > > I'm inclined to think though that the abstraction point should be the > transport and not the actual protocol. My concern with standardizing on > a protocol like 9p would be that one would lose some potential > optimizations (like passing PFN's directly between guest and host). > I think that there are two layers - having a standard, well defined, simple shared memory transport between partitions (or between emulators and the host system) is certainly a prerequisite. There are lots of different decisions to made here: a) does it communicate with userspace, kernelspace, or both? b) is it multi-channel? prioritized? interrupt driven or poll driven? c) how big are the buffers? is it packetized? d) can all of these parameters be something controllable from userspace? e) I'm sure there are many others that I can't be bothered to think of on a Friday Regardless of the details, I think we can definitely come together on a common mechanism here and avoid lots of duplication in the drivers are already there and which will follow. My personal preference is to keep things as simple and flat as possible. No XML, no multiple stacks and daemons to contend with. What runs on top of the transport is no doubt going to be a touchy subject for some time to come. Many of Ron's arguments for 9p mostly apply to this upper level. I/we will be pursuing this as a unified PV resource sharing mechanism over the next few months in combination with reorganization and optimization of the Linux 9p code. LANL has also been making progress in this same direction. I'd have gotten started sooner, but I was waiting for my new Thinkpad so that I can actually run KVM ;) > > So is there any reason to even tie 9p to KVM? Why not just have a > common PV transport that 9p can use. For certain things, it may make > sense (like v9fs). > Well, I think we were discussing tying KVM to 9p, not vice-versa. My personal view is that developing a generalized solution for resource sharing of all manner of devices and services across virtualization, emulation, and network boundaries is a better way to spend our time than writing a bunch of specific drivers/protocols/interfaces for each type of device and each type of interconnect. -eric ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <a4e6962a0705111521v2d451ddcjecf209e2031c85af-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <a4e6962a0705111521v2d451ddcjecf209e2031c85af-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2007-05-16 17:28 ` Anthony Liguori [not found] ` <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-16 17:28 UTC (permalink / raw) To: Eric Van Hensbergen Cc: Jimi Xenidis, Christian Borntraeger, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Martin Schwidefsky, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Eric Van Hensbergen wrote: > On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: >> >> There's definitely a conversation to have here. There are going to be a >> lot of small devices that would benefit from a common transport >> mechanism. Someone mentioned a PV entropy device on LKML. A >> host=>guest filesystem is another consumer of such an interface. >> >> I'm inclined to think though that the abstraction point should be the >> transport and not the actual protocol. My concern with standardizing on >> a protocol like 9p would be that one would lose some potential >> optimizations (like passing PFN's directly between guest and host). >> > > I think that there are two layers - having a standard, well defined, > simple shared memory transport between partitions (or between > emulators and the host system) is certainly a prerequisite. There are > lots of different decisions to made here: What do you think about a socket interface? I'm not sure how discovery would work yet, but there are a few PV socket implementations for Xen at the moment. > a) does it communicate with userspace, kernelspace, or both? sockets are usable for both userspace/kernespace. > b) is it multi-channel? prioritized? interrupt driven or poll driven? Of course, arguments can be made for any of these depending on the circumstance. I think you'd have to start with something simple that would cover the most number of users (non-multiplexed, interrupt driven). > c) how big are the buffers? is it packetized? This could probably be tweaked with sockopts. I suspect you would have an implementation for Xen, KVM, etc. and support a common set of options (and possible some per-VM type of options). > d) can all of these parameters be something controllable from userspace? > e) I'm sure there are many others that I can't be bothered to think > of on a Friday The biggest point of contention would probably be what goes in the sockaddr structure. Thoughts? Regards, Anthony Liguori > Regardless of the details, I think we can definitely come together on > a common mechanism here and avoid lots of duplication in the drivers > are already there and which will follow. My personal preference is to > keep things as simple and flat as possible. No XML, no multiple > stacks and daemons to contend with. > > What runs on top of the transport is no doubt going to be a touchy > subject for some time to come. Many of Ron's arguments for 9p mostly > apply to this upper level. I/we will be pursuing this as a unified PV > resource sharing mechanism over the next few months in combination > with reorganization and optimization of the Linux 9p code. LANL has > also been making progress in this same direction. I'd have gotten > started sooner, but I was waiting for my new Thinkpad so that I can > actually run KVM ;) > >> >> So is there any reason to even tie 9p to KVM? Why not just have a >> common PV transport that 9p can use. For certain things, it may make >> sense (like v9fs). >> > > Well, I think we were discussing tying KVM to 9p, not vice-versa. > > My personal view is that developing a generalized solution for > resource sharing of all manner of devices and services across > virtualization, emulation, and network boundaries is a better way to > spend our time than writing a bunch of specific > drivers/protocols/interfaces for each type of device and each type of > interconnect. > > -eric ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2007-05-16 17:38 ` Daniel P. Berrange [not found] ` <20070516173822.GD16863-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2007-05-16 17:41 ` Eric Van Hensbergen ` (2 subsequent siblings) 3 siblings, 1 reply; 104+ messages in thread From: Daniel P. Berrange @ 2007-05-16 17:38 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, Martin Schwidefsky, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org On Wed, May 16, 2007 at 12:28:00PM -0500, Anthony Liguori wrote: > Eric Van Hensbergen wrote: > > On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > >> > >> There's definitely a conversation to have here. There are going to be a > >> lot of small devices that would benefit from a common transport > >> mechanism. Someone mentioned a PV entropy device on LKML. A > >> host=>guest filesystem is another consumer of such an interface. > >> > >> I'm inclined to think though that the abstraction point should be the > >> transport and not the actual protocol. My concern with standardizing on > >> a protocol like 9p would be that one would lose some potential > >> optimizations (like passing PFN's directly between guest and host). > >> > > > > I think that there are two layers - having a standard, well defined, > > simple shared memory transport between partitions (or between > > emulators and the host system) is certainly a prerequisite. There are > > lots of different decisions to made here: > > What do you think about a socket interface? I'm not sure how discovery > would work yet, but there are a few PV socket implementations for Xen at > the moment. As a userspace apps service, I'd very much like to see a common sockets interface for inter-VM communication that is portable across virt systems like Xen & KVM. I'd see it as similar to UNIX domain sockets in style. So basically any app which could do UNIX domain sockets, could be ported to inter-VM sockets by just changing PF_UNIX to say, PF_VIRT Lots of interesting details around impl & security (what VMs are allowed to talk to each other, whether this policy should be controlled by the host, or allow VMs to decide for themselves). > > a) does it communicate with userspace, kernelspace, or both? > > sockets are usable for both userspace/kernespace. For userspace, it would be very easy to adapt existing sockets based apps using IP or UNIX sockets to use inter-VM sockets, which is a big positive. > > d) can all of these parameters be something controllable from userspace? > > e) I'm sure there are many others that I can't be bothered to think > > of on a Friday > > The biggest point of contention would probably be what goes in the > sockaddr structure. Keeping it very simple would be some arbitrary 'path', similar to UNIX domain sockets in the abstract namespace ? Regards, Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=| ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <20070516173822.GD16863-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <20070516173822.GD16863-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2007-05-17 9:29 ` Carsten Otte [not found] ` <464C2069.20909-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Carsten Otte @ 2007-05-17 9:29 UTC (permalink / raw) To: Daniel P. Berrange Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Daniel P. Berrange wrote: > As a userspace apps service, I'd very much like to see a common sockets > interface for inter-VM communication that is portable across virt systems > like Xen & KVM. I'd see it as similar to UNIX domain sockets in style. So > basically any app which could do UNIX domain sockets, could be ported to > inter-VM sockets by just changing PF_UNIX to say, PF_VIRT > Lots of interesting details around impl & security (what VMs are allowed > to talk to each other, whether this policy should be controlled by the > host, or allow VMs to decide for themselves). z/VM, the premium hypervisor on 390 already has this capability for decades. This is called IUCV (inter user communication vehicle), where user really means virtual machine. It so happens the support for AF_IUCV was recently merged to Linux mainline. It may be worth a look, either for using it or because learning from existing solutions is always a good idea. so long, Carsten ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <464C2069.20909-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <464C2069.20909-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> @ 2007-05-17 14:22 ` Anthony Liguori [not found] ` <464C651F.5070700-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-17 14:22 UTC (permalink / raw) To: carsteno-tA70FqPdS9bQT0dZR+AlfA Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Carsten Otte wrote: > Daniel P. Berrange wrote: > >> As a userspace apps service, I'd very much like to see a common sockets >> interface for inter-VM communication that is portable across virt systems >> like Xen & KVM. I'd see it as similar to UNIX domain sockets in style. So >> basically any app which could do UNIX domain sockets, could be ported to >> inter-VM sockets by just changing PF_UNIX to say, PF_VIRT >> Lots of interesting details around impl & security (what VMs are allowed >> to talk to each other, whether this policy should be controlled by the >> host, or allow VMs to decide for themselves). >> > z/VM, the premium hypervisor on 390 already has this capability for > decades. This is called IUCV (inter user communication vehicle), where > user really means virtual machine. It so happens the support for > AF_IUCV was recently merged to Linux mainline. It may be worth a look, > either for using it or because learning from existing solutions is > always a good idea. > Is there anything that explains what the fields in sockaddr mean: sa_family_t siucv_family; unsigned short siucv_port; /* Reserved */ unsigned int siucv_addr; /* Reserved */ char siucv_nodeid[8]; /* Reserved */ char siucv_user_id[8]; /* Guest User Id */ char siucv_name[8]; /* Application Name */ Regards, Anthony LIugori > so long, > Carsten > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > kvm-devel mailing list > kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org > https://lists.sourceforge.net/lists/listinfo/kvm-devel > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <464C651F.5070700-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <464C651F.5070700-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> @ 2007-05-21 11:11 ` Christian Borntraeger 0 siblings, 0 replies; 104+ messages in thread From: Christian Borntraeger @ 2007-05-21 11:11 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, carsteno-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 [-- Attachment #1.1: Type: text/plain, Size: 1507 bytes --] Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> wrote on 17.05.2007 16:22:23: > Is there anything that explains what the fields in sockaddr mean: > > sa_family_t siucv_family; > unsigned short siucv_port; /* Reserved */ > unsigned int siucv_addr; /* Reserved */ > char siucv_nodeid[8]; /* Reserved */ > char siucv_user_id[8]; /* Guest User Id */ > char siucv_name[8]; /* Application Name */ There is a small description in "Device Drivers, Features, and Commands SC33-8289-03" on page 211 (its page 235 if you use the pdf viewer page number) http://download.boulder.ibm.com/ibmdl/pub/software/dw/linux390/docu/l26cdd03.pdf (the file is 6.7 MB) More generic information about iucv can be found in in http://www-03.ibm.com/servers/eserver/zseries/zos/bkserv/zvmpdf/zvm52.html or to be precise: http://publibz.boulder.ibm.com/epubs/pdf/hcse5b11.pdf part 2. (11 MB) That said, AF_IUCV builds on top of iucv and therefore requires z/VM as hypervisor. I dont think that KVM should implement (af_)iucv. But (af_)iucv shows several aspects how to make things good and bad. (e.g. AF_IUCV as procotol on top of iucv was first defined in CMS several years ago and is, therefore, not very smp-friendly. On the other hand iucv itself offers modern features like scatter/gather). Back to the old question: If shared memory or socket is better - I dont know. z/VM has both, see dcss for its shared memory support. [-- Attachment #1.2: Type: text/html, Size: 2491 bytes --] [-- Attachment #2: Type: text/plain, Size: 286 bytes --] ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ [-- Attachment #3: Type: text/plain, Size: 186 bytes --] _______________________________________________ kvm-devel mailing list kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org https://lists.sourceforge.net/lists/listinfo/kvm-devel ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2007-05-16 17:38 ` Daniel P. Berrange @ 2007-05-16 17:41 ` Eric Van Hensbergen [not found] ` <a4e6962a0705161041s5393c1a6wc455b20ff3fe8106-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2007-05-16 17:45 ` Gregory Haskins 2007-05-18 5:31 ` ron minnich 3 siblings, 1 reply; 104+ messages in thread From: Eric Van Hensbergen @ 2007-05-16 17:41 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, Christian Borntraeger, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Martin Schwidefsky, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org On 5/16/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > > What do you think about a socket interface? I'm not sure how discovery > would work yet, but there are a few PV socket implementations for Xen at > the moment. > >From a functional standpoint I don't have a huge problem with it, particularly if its more of a pure socket and not something that tries to look like a TCP/IP endpoint -- I would prefer something closer to netlink. Sockets would allow the exisitng 9p stuff to pretty much work as-is. However, all that being said, I noticed some pretty big differences between sockets and shared memory in terms of overhead under Linux. If you take a look at the RPC latency graph in: http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf You'll see that a local socket implementation has about an order of magnitude worse latency than a PROSE/Libra inter-partition shared memory channel. Furthermore it will really limit our ability to trim the fat of unnecessary copies in order to have competitive performance. But perhaps there's magic you can do to eliminate that. Of course, you could always layer a socket interface for userspace simplicity on top of a more performance-optimized underlying transport that could be used directly by kernel-modules. -eric ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <a4e6962a0705161041s5393c1a6wc455b20ff3fe8106-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <a4e6962a0705161041s5393c1a6wc455b20ff3fe8106-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2007-05-16 18:47 ` Anthony Liguori [not found] ` <464B51A8.7050307-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-16 18:47 UTC (permalink / raw) To: Eric Van Hensbergen Cc: Jimi Xenidis, Christian Borntraeger, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Martin Schwidefsky, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Eric Van Hensbergen wrote: > On 5/16/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: >> >> What do you think about a socket interface? I'm not sure how discovery >> would work yet, but there are a few PV socket implementations for Xen at >> the moment. >> > > From a functional standpoint I don't have a huge problem with it, > particularly if its more of a pure socket and not something that tries > to look like a TCP/IP endpoint -- I would prefer something closer to > netlink. Sockets would allow the exisitng 9p stuff to pretty much > work as-is. So you would prefer assigning out types instead of using an identifier string in the sockaddr? > However, all that being said, I noticed some pretty big differences > between sockets and shared memory in terms of overhead under Linux. > > If you take a look at the RPC latency graph in: > http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf > > You'll see that a local socket implementation has about an order of > magnitude worse latency than a PROSE/Libra inter-partition shared > memory channel. You seem to suggest that the low latency is due to a very greedy (CPU hungry) polling algorithm. A poll vs. interrupt model would seem to me to be orthogonal to using sockets as an interface. > Furthermore it will really limit our ability to trim > the fat of unnecessary copies in order to have competitive > performance. But perhaps there's magic you can do to eliminate that. sockets do add copies. My initial thinking is that one can work around this by passing guest PFNs (or grant references in Xen). I'm also happy to start out focusing on "low-speed" devices. > Of course, you could always layer a socket interface for userspace > simplicity on top of a more performance-optimized underlying transport > that could be used directly by kernel-modules. Right. Regards, Anthony Liguori > -eric ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <464B51A8.7050307-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <464B51A8.7050307-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2007-05-16 19:33 ` Eric Van Hensbergen 0 siblings, 0 replies; 104+ messages in thread From: Eric Van Hensbergen @ 2007-05-16 19:33 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, Christian Borntraeger, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Martin Schwidefsky, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org On 5/16/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > Eric Van Hensbergen wrote: > > > > From a functional standpoint I don't have a huge problem with it, > > particularly if its more of a pure socket and not something that tries > > to look like a TCP/IP endpoint -- I would prefer something closer to > > netlink. Sockets would allow the exisitng 9p stuff to pretty much > > work as-is. > > So you would prefer assigning out types instead of using an identifier > string in the sockaddr? > I wasn't really thinking that extreme, just having an assigned type for the vm sockets so that we can minimize baggage. Perhaps I'm being overzealous. > > However, all that being said, I noticed some pretty big differences > > between sockets and shared memory in terms of overhead under Linux. > > > > If you take a look at the RPC latency graph in: > > http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf > > > > You'll see that a local socket implementation has about an order of > > magnitude worse latency than a PROSE/Libra inter-partition shared > > memory channel. > > You seem to suggest that the low latency is due to a very greedy (CPU > hungry) polling algorithm. A poll vs. interrupt model would seem to me > to be orthogonal to using sockets as an interface. > That certainly was a theory -- I never did detailed measurements, however, there is certainly extra overhead associated with the socket path due to kernel-user space boundary crossings and additional code path length associated with socket operations. Still I'm game to comparing the alternatives. -eric ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2007-05-16 17:38 ` Daniel P. Berrange 2007-05-16 17:41 ` Eric Van Hensbergen @ 2007-05-16 17:45 ` Gregory Haskins [not found] ` <464B0ADB.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org> 2007-05-18 5:31 ` ron minnich 3 siblings, 1 reply; 104+ messages in thread From: Gregory Haskins @ 2007-05-16 17:45 UTC (permalink / raw) To: Eric Van Hensbergen, Anthony Liguori Cc: Christian Borntraeger, Martin Schwidefsky, Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org >>> On Wed, May 16, 2007 at 1:28 PM, in message <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > > What do you think about a socket interface? I'm not sure how discovery > would work yet, but there are a few PV socket implementations for Xen at > the moment. FYI: The work I am doing is exactly that. I am going to extend host-based unix domain sockets up to the KVM guest. Not sure how well it will work yet, as I had to lay the LAPIC work down first for IO-completion. -Greg ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <464B0ADB.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <464B0ADB.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org> @ 2007-05-16 18:39 ` Anthony Liguori [not found] ` <464B4FEB.7070300-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-16 18:39 UTC (permalink / raw) To: Gregory Haskins Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, Suzanne McIntosh, Martin Schwidefsky, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Gregory Haskins wrote: >>>> On Wed, May 16, 2007 at 1:28 PM, in message <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>, >>>> > Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > >> What do you think about a socket interface? I'm not sure how discovery >> would work yet, but there are a few PV socket implementations for Xen at >> the moment. >> > > FYI: The work I am doing is exactly that. I am going to extend host-based unix domain sockets up to the KVM guest. Not sure how well it will work yet, as I had to lay the LAPIC work down first for IO-completion. > Do you plan on introducing a new address family in the guest? Regards, Anthony Liguori > -Greg > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <464B4FEB.7070300-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <464B4FEB.7070300-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2007-05-16 18:57 ` Gregory Haskins [not found] ` <464B1B9C.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Gregory Haskins @ 2007-05-16 18:57 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, Suzanne McIntosh, Martin Schwidefsky, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org >>> On Wed, May 16, 2007 at 2:39 PM, in message <464B4FEB.7070300-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > Gregory Haskins wrote: >>>>> On Wed, May 16, 2007 at 1:28 PM, in message <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>, >>>>> >> Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: >> >>> What do you think about a socket interface? I'm not sure how discovery >>> would work yet, but there are a few PV socket implementations for Xen at >>> the moment. >>> >> >> FYI: The work I am doing is exactly that. I am going to extend host- based > unix domain sockets up to the KVM guest. Not sure how well it will work yet, > as I had to lay the LAPIC work down first for IO- completion. >> > > Do you plan on introducing a new address family in the guest? Well, since I had to step back and lay some infrastructure groundwork I haven't vetted this approach yet...so its possible what I am about to say is relatively naive: But my primary application is to create a guest-kernel to host IVMC. For that you can just think of the guest as any other process on the host, and it will just use the sockets normally as any host-process would. There might be some thunking that has to happen to deal with gpa vs va, etc, but otherwise its a standard consumer. If you want to extend IVMC up to guest-userspace, I think making some kind of new socket family makes sense in the guests stack. PF_VIRT like someone else suggested, for instance. But since I dont need this type of IVMC I haven't really thought about this too much. -Greg ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <464B1B9C.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <464B1B9C.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org> @ 2007-05-16 19:10 ` Anthony Liguori [not found] ` <464B572C.6090800-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-16 19:10 UTC (permalink / raw) To: Gregory Haskins Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, Suzanne McIntosh, Martin Schwidefsky, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Gregory Haskins wrote: >>>> On Wed, May 16, 2007 at 2:39 PM, in message <464B4FEB.7070300-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>, >>>> > Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > >> Gregory Haskins wrote: >> >>>>>> On Wed, May 16, 2007 at 1:28 PM, in message <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>, >>>>>> >>>>>> >>> Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: >>> >>> >>>> What do you think about a socket interface? I'm not sure how discovery >>>> would work yet, but there are a few PV socket implementations for Xen at >>>> the moment. >>>> >>>> >>> FYI: The work I am doing is exactly that. I am going to extend host- based >>> >> unix domain sockets up to the KVM guest. Not sure how well it will work yet, >> as I had to lay the LAPIC work down first for IO- completion. >> >>> >>> >> Do you plan on introducing a new address family in the guest? >> > > Well, since I had to step back and lay some infrastructure groundwork I haven't vetted this approach yet...so its possible what I am about to say is relatively naive: But my primary application is to create a guest-kernel to host IVMC. This is quite easy with KVM. I like the approach that vmchannel has taken. A simple PCI device. That gives you a discovery mechanism for shared memory and an interrupt and then you can just implement a ring queue using those mechanisms (along with a PIO port for signaling from the guest to the host). So given that underlying mechanism, the question is how to expose that within the guest kernel/userspace and within the host. For the host, you can probably stay entirely within QEMU. Interguest communication would be a bit tricky but guest->host communication is real simple. You could stop at exposing the channel as a socket within the guest kernel/userspace. That would work, but you may also want to expose the ring queue within the kernel at least if there are consumers that need to avoid the copy. A tricky bit of this is how to do discovery. If you want to support interguest communication, it's not really sufficient to just use strings since they identifiers would have to be unique throughout the entire system. Maybe you just leave it as a guest=>host channel and be done with it. Regards, Anthony Liguori > For that you can just think of the guest as any other process on the host, and it will just use the sockets normally as any host-process would. There might be some thunking that has to happen to deal with gpa vs va, etc, but otherwise its a standard consumer. If you want to extend IVMC up to guest-userspace, I think making some kind of new socket family makes sense in the guests stack. PF_VIRT like someone else suggested, for instance. But since I dont need this type of IVMC I haven't really thought about this too much. > > -Greg > > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <464B572C.6090800-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <464B572C.6090800-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2007-05-17 4:24 ` Rusty Russell [not found] ` <1179375881.21871.83.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2007-05-21 9:07 ` Christian Borntraeger 1 sibling, 1 reply; 104+ messages in thread From: Rusty Russell @ 2007-05-17 4:24 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Suzanne McIntosh, Martin Schwidefsky On Wed, 2007-05-16 at 14:10 -0500, Anthony Liguori wrote: > For the host, you can probably stay entirely within QEMU. Interguest > communication would be a bit tricky but guest->host communication is > real simple. guest->host is always simple. But it'd be great if it didn't matter to the guest whether it's talking to the host or another guest. I think shared memory is an obvious start, but it's not enough for inter-guest where they can't freely access each other's memory. So you really want a ring-buffer of descriptors with a hypervisor-assist to say "read/write this into the memory referred to by that descriptor". I think this can be done as a simple variation of the current schemes in existence. But I'm shutting up until I have some demonstration code 8) > A tricky bit of this is how to do discovery. If you want to support > interguest communication, it's not really sufficient to just use strings > since they identifiers would have to be unique throughout the entire > system. Maybe you just leave it as a guest=>host channel and be done > with it. Hmm, I was going to leave that unspecified. One thing at a time... Rusty. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <1179375881.21871.83.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <1179375881.21871.83.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2007-05-17 16:13 ` Anthony Liguori [not found] ` <464C7F45.50908-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-17 16:13 UTC (permalink / raw) To: Rusty Russell Cc: Jimi Xenidis, Anthony Liguori, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Suzanne McIntosh, Martin Schwidefsky Rusty Russell wrote: > On Wed, 2007-05-16 at 14:10 -0500, Anthony Liguori wrote: > >> For the host, you can probably stay entirely within QEMU. Interguest >> communication would be a bit tricky but guest->host communication is >> real simple. >> > > guest->host is always simple. But it'd be great if it didn't matter to > the guest whether it's talking to the host or another guest. > > I think shared memory is an obvious start, but it's not enough for > inter-guest where they can't freely access each other's memory. So you > really want a ring-buffer of descriptors with a hypervisor-assist to say > "read/write this into the memory referred to by that descriptor". > I think this is getting a little ahead of ourselves. An example of this idea is pretty straight-forward but it gets more complicated when trying to support the existing memory sharing mechanisms on various hypervisors. There are a few cases to consider: 1) The target VM can access all of the memory of the guest VM with no penalty. This is the case when going from guest=>QEMU in KVM or going from guest=>kernel (ignoring highmem) in KVM. For this, you can send arbitrary memory to the host. 2) The target VM can access all of the memory of the guest VM with a penalty. For guest=>other userspace process in KVM, an mmap() would be required. This would work for Xen provided the target VM was domain-0 but it would incur a xc_map_foreign_range(). 3) The target and source VM can only share memory based on an existing pool. This is the guest with Xen and grant tables. I think an API that covers these three cases is a bit tricky and will likely make undesired trade-offs. I think it's easier to start out focusing on the "low-speed" case where there's a mandatory data-copy. You can still pass gntref's or PFNs down this transport if you like and perhaps down the road we'll find that we can make a common interface for doing this sort of thing. Regards, Anthony Liguori > I think this can be done as a simple variation of the current schemes in > existence. > > But I'm shutting up until I have some demonstration code 8) > > >> A tricky bit of this is how to do discovery. If you want to support >> interguest communication, it's not really sufficient to just use strings >> since they identifiers would have to be unique throughout the entire >> system. Maybe you just leave it as a guest=>host channel and be done >> with it. >> > > Hmm, I was going to leave that unspecified. One thing at a time... > > Rusty. > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > kvm-devel mailing list > kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org > https://lists.sourceforge.net/lists/listinfo/kvm-devel > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <464C7F45.50908-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <464C7F45.50908-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2007-05-17 23:34 ` Rusty Russell 0 siblings, 0 replies; 104+ messages in thread From: Rusty Russell @ 2007-05-17 23:34 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Suzanne McIntosh, Martin Schwidefsky On Thu, 2007-05-17 at 11:13 -0500, Anthony Liguori wrote: > Rusty Russell wrote: > > I think shared memory is an obvious start, but it's not enough for > > inter-guest where they can't freely access each other's memory. So you > > really want a ring-buffer of descriptors with a hypervisor-assist to say > > "read/write this into the memory referred to by that descriptor". > > I think this is getting a little ahead of ourselves. An example of this > idea is pretty straight-forward but it gets more complicated when trying > to support the existing memory sharing mechanisms on various > hypervisors. There are a few cases to consider: To clarify, I'm not overly interested in existing mechanisms. I'm first trying for something sane from a Linux driver POV, then see if it can be implemented in terms of legacy systems. This reflects my belief that we will see more virtualization solutions in the medium term, so it's reasonable to look at a new system. Cheers, Rusty. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <464B572C.6090800-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2007-05-17 4:24 ` Rusty Russell @ 2007-05-21 9:07 ` Christian Borntraeger [not found] ` <OFC1AADF6F.DB57C7AC-ON422572E2.0030BE22-422572E2.0032174C-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 104+ messages in thread From: Christian Borntraeger @ 2007-05-21 9:07 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Suzanne McIntosh [-- Attachment #1.1: Type: text/plain, Size: 734 bytes --] > This is quite easy with KVM. I like the approach that vmchannel has > taken. A simple PCI device. That gives you a discovery mechanism for > shared memory and an interrupt and then you can just implement a ring > queue using those mechanisms (along with a PIO port for signaling from > the guest to the host). So given that underlying mechanism, the > question is how to expose that within the guest kernel/userspace and > within the host. Sorry for answering late, but I dont like PCI as a device bus for all platforms. s390 has no PCI and s390 has no PIO. I would prefer a new simple hypercall based virtual bus. I dont know much about windows driver programming, but I guess it it is not that hard to add a new bus. [-- Attachment #1.2: Type: text/html, Size: 973 bytes --] [-- Attachment #2: Type: text/plain, Size: 286 bytes --] ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ [-- Attachment #3: Type: text/plain, Size: 186 bytes --] _______________________________________________ kvm-devel mailing list kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org https://lists.sourceforge.net/lists/listinfo/kvm-devel ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <OFC1AADF6F.DB57C7AC-ON422572E2.0030BE22-422572E2.0032174C-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <OFC1AADF6F.DB57C7AC-ON422572E2.0030BE22-422572E2.0032174C-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> @ 2007-05-21 9:27 ` Cornelia Huck 2007-05-21 11:28 ` Arnd Bergmann 1 sibling, 0 replies; 104+ messages in thread From: Cornelia Huck @ 2007-05-21 9:27 UTC (permalink / raw) To: Christian Borntraeger Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Suzanne McIntosh On Mon, 21 May 2007 11:07:07 +0200, Christian Borntraeger <CBORNTRA-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> wrote: > > This is quite easy with KVM. I like the approach that vmchannel has > > taken. A simple PCI device. That gives you a discovery mechanism for > > shared memory and an interrupt and then you can just implement a ring > > queue using those mechanisms (along with a PIO port for signaling from > > the guest to the host). So given that underlying mechanism, the > > question is how to expose that within the guest kernel/userspace and > > within the host. > > Sorry for answering late, but I dont like PCI as a device bus for all > platforms. s390 has no PCI and s390 has no PIO. I would prefer a new > simple hypercall based virtual bus. I dont know much about windows > driver programming, but I guess it it is not that hard to add a new bus. Agreed. Moreover, if you have an existing OS running on a non-pci platform, it will be far more likely that they will be able to write a driver against a simple hypercall-based bus than to cook up a full-blown pci interface. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <OFC1AADF6F.DB57C7AC-ON422572E2.0030BE22-422572E2.0032174C-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> 2007-05-21 9:27 ` Cornelia Huck @ 2007-05-21 11:28 ` Arnd Bergmann [not found] ` <200705211328.04565.arnd-r2nGTMty4D4@public.gmane.org> 1 sibling, 1 reply; 104+ messages in thread From: Arnd Bergmann @ 2007-05-21 11:28 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Suzanne McIntosh On Monday 21 May 2007, Christian Borntraeger wrote: > > This is quite easy with KVM. I like the approach that vmchannel has > > taken. A simple PCI device. That gives you a discovery mechanism for > > shared memory and an interrupt and then you can just implement a ring > > queue using those mechanisms (along with a PIO port for signaling from > > the guest to the host). So given that underlying mechanism, the > > question is how to expose that within the guest kernel/userspace and > > within the host. > > Sorry for answering late, but I dont like PCI as a device bus for all > platforms. s390 has no PCI and s390 has no PIO. I would prefer a new > simple hypercall based virtual bus. I dont know much about windows > driver programming, but I guess it it is not that hard to add a new bus. We've had the same discussion about PCI as virtual device abstraction recently when hpa made the suggestions to get a set of PCI device numbers registered for Linux. IIRC, the conclusion to which we came was that it is indeed helpful for most architecture to have a PCI device as one way to probe for the functionality, but not to rely on it. s390 is the obvious example where you can't have PCI, but you may also want to build a guest kernel without PCI support because of space constraints in a many-guests machine. What I think would be ideal is to have a new bus type in Linux that does not have any dependency on PCI itself, but can be easily implemented as a child of a PCI device. If we only need the stuff mentioned by Anthony, the interface could look like struct vmchannel_device { struct resource virt_mem; struct vm_device_id id; int irq; int (*signal)(struct vmchannel_device *); int (*irq_ack)(struct vmchannel_device *); struct device dev; }; Such a device can easily be provided as a child of a PCI device, or as something that is purely virtual based on an hcall interface. Arnd <>< ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <200705211328.04565.arnd-r2nGTMty4D4@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <200705211328.04565.arnd-r2nGTMty4D4@public.gmane.org> @ 2007-05-21 11:56 ` Cornelia Huck [not found] ` <20070521135628.17a4f9cc-XQvu0L+U/CiXI4yAdoq52KN5r0PSdgG1zG2AekJRRhI@public.gmane.org> 2007-05-21 18:45 ` Anthony Liguori 1 sibling, 1 reply; 104+ messages in thread From: Cornelia Huck @ 2007-05-21 11:56 UTC (permalink / raw) To: Arnd Bergmann Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger, Suzanne McIntosh On Mon, 21 May 2007 13:28:03 +0200, Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org> wrote: > We've had the same discussion about PCI as virtual device abstraction > recently when hpa made the suggestions to get a set of PCI device > numbers registered for Linux. (If you want to read it up, it's the thread at http://marc.info/?t=117554525400003&r=1&w=2) > > IIRC, the conclusion to which we came was that it is indeed helpful > for most architecture to have a PCI device as one way to probe for > the functionality, but not to rely on it. s390 is the obvious > example where you can't have PCI, but you may also want to build > a guest kernel without PCI support because of space constraints > in a many-guests machine. > > What I think would be ideal is to have a new bus type in Linux > that does not have any dependency on PCI itself, but can be > easily implemented as a child of a PCI device. > > If we only need the stuff mentioned by Anthony, the interface could > look like > > struct vmchannel_device { > struct resource virt_mem; > struct vm_device_id id; > int irq; ^^^^^^^^ > int (*signal)(struct vmchannel_device *); > int (*irq_ack)(struct vmchannel_device *); > struct device dev; > }; IRQ numbers are evil :) It should be more like a void *vmchannel_device_handle; which could be different things depending on what we want the vmchannel_device to be a child of (it could be an IRQ number for PCI devices, or something like subchannel_id if we wanted to support channel devices). > > Such a device can easily be provided as a child of a PCI device, > or as something that is purely virtual based on an hcall interface. This looks like a flexible approach. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <20070521135628.17a4f9cc-XQvu0L+U/CiXI4yAdoq52KN5r0PSdgG1zG2AekJRRhI@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <20070521135628.17a4f9cc-XQvu0L+U/CiXI4yAdoq52KN5r0PSdgG1zG2AekJRRhI@public.gmane.org> @ 2007-05-21 13:53 ` Arnd Bergmann 0 siblings, 0 replies; 104+ messages in thread From: Arnd Bergmann @ 2007-05-21 13:53 UTC (permalink / raw) To: Cornelia Huck Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger, Suzanne McIntosh On Monday 21 May 2007, Cornelia Huck wrote: > IRQ numbers are evil :) yes, but getting rid of them is an entirely different discussion. I really think that in the first step, you should be able to use its "external interrupts" with the same request_irq interface as the other architectures. Fundamentally, the s390 architecture has external interrupt numbers as well, you're just using a different interface for registering them. The ccw devices obviously have a better interface already, but that doesn't help you here. > It should be more like a > void *vmchannel_device_handle; > which could be different things depending on what we want the > vmchannel_device to be a child of (it could be an IRQ number for > PCI devices, or something like subchannel_id if we wanted to > support channel devices). No, the driver needs to know how to get at the interrupt without caring about the bus implementation, that's why you either need to have a callback function set by the driver (like s390 CCW or USB have it), or visible interrupt number (like everyone does). There is no need for a pointer back to a vmchannel_device_handle, all information needed by the bus layer can simply be in a subclass derived from the vmchannel_device, e.g. struct vmchannel_pci { struct pci_device *parent; /* shortcut, same as to_pci_dev(&this.vmdev.dev.parent) */ unsigned long signal_ioport; /* for interrupt generation */ struct vmchannel_device vmdev; }; You would allocate this structure in the pci_driver that registers the vmchannel_device. Arnd <>< ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <200705211328.04565.arnd-r2nGTMty4D4@public.gmane.org> 2007-05-21 11:56 ` Cornelia Huck @ 2007-05-21 18:45 ` Anthony Liguori [not found] ` <4651E8D1.4010208-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-21 18:45 UTC (permalink / raw) To: Arnd Bergmann Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger, Suzanne McIntosh Arnd Bergmann wrote: > On Monday 21 May 2007, Christian Borntraeger wrote: > >>> This is quite easy with KVM. I like the approach that vmchannel has >>> taken. A simple PCI device. That gives you a discovery mechanism for >>> shared memory and an interrupt and then you can just implement a ring >>> queue using those mechanisms (along with a PIO port for signaling from >>> the guest to the host). So given that underlying mechanism, the >>> question is how to expose that within the guest kernel/userspace and >>> within the host. >>> >> Sorry for answering late, but I dont like PCI as a device bus for all >> platforms. s390 has no PCI and s390 has no PIO. Right, I'm not interested in the lowest level implementation (PCI device + PIO). I'm more interested in the higher level interface. The goal is to allow drivers to be able to be written to the higher level interface so that they work on any platform that implements the lower level interface. On x86, that would be PCI/PIO. On s390, that could be hypercall based. >> I would prefer a new >> simple hypercall based virtual bus. I dont know much about windows >> driver programming, but I guess it it is not that hard to add a new bus. >> > > We've had the same discussion about PCI as virtual device abstraction > recently when hpa made the suggestions to get a set of PCI device > numbers registered for Linux. > > IIRC, the conclusion to which we came was that it is indeed helpful > for most architecture to have a PCI device as one way to probe for > the functionality, but not to rely on it. s390 is the obvious > example where you can't have PCI, but you may also want to build > a guest kernel without PCI support because of space constraints > in a many-guests machine. > > What I think would be ideal is to have a new bus type in Linux > that does not have any dependency on PCI itself, but can be > easily implemented as a child of a PCI device. > > If we only need the stuff mentioned by Anthony, the interface could > look like > > struct vmchannel_device { > struct resource virt_mem; > struct vm_device_id id; > int irq; > int (*signal)(struct vmchannel_device *); > int (*irq_ack)(struct vmchannel_device *); > struct device dev; > }; > > Such a device can easily be provided as a child of a PCI device, > or as something that is purely virtual based on an hcall interface. > Yes, this is close to what I was thinking. I'm not sure that this particular interface can encompass the variety of memory sharing mechanisms though. When I mentioned shared memory via the PCI device, I was referring to the memory needed for boot strapping the device. You still need a mechanism to transfer memory for things like zero-copy disk IO and network devices. This may involve passing memory addresses directly, copying data, or page flipping. This leads me to think that a higher level interface that provided a data passing interface would be more useful. Something like: struct vmchannel_device { struct vm_device_id id; int (*open)(struct vmchannel_device *, const char *name, const char *service) int (*release)(struct vmchannel_device *); ssize_t (*sendmsg)(struct vmchannel_device *, const void *, size_t); ssize_t (*recvmsg)(struct vmchannel_device *, void *, size_t); struct device dev; }; The consuming interface of this would be a socket (PF_VIRTLINK). The sockaddr would contain a name identifying a VM and a service description. This doesn't address the memory issues I raised above but I think it would be easier to special case the drivers where it mattered. For instance, on x86 KVM, a PV disk driver front end would consist of connecting to a virtlink socket, and then transferring struct bio's. QEMU instances would listen on the virtlink socket in the host, and service them directly (QEMU can access all of the guests memory directly in userspace). A PV graphics device could just be a VNC server that listened on a virtlink socket. Regards, Anthony Liguori > Arnd <>< > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <4651E8D1.4010208-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <4651E8D1.4010208-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2007-05-21 23:09 ` ron minnich [not found] ` <13426df10705211609j613032c6j373d9a4660f8ec6c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: ron minnich @ 2007-05-21 23:09 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger, Suzanne McIntosh OK, so what are we doing here? We're using a PCI abstraction, as a common abstraction,which is not common really, because we don't have a common abstraction? So we describe all these non-pci resources with a pci abstraction? I don't get it at all. I really think the resource interface idea I mentioned, which is borrowed from Plan 9, makes a whole lot more sense. IBM Austin has already shown it in practice in the papers I referenced. It can work. A memory channel at the bottom, with a resource sharing protocol (9p) above it, and then you describe your resources via names and a simple file-directory model. Note that PCI sort of tries to do this tree model, but it's all binary, and, as noted, it's hardly universal. All of this is trivially exported over a network, so the use of shared memory channels in no way rules out network access. Plan 9 exports devices over the network routinely. If you're using a PCI abstraction, something has gone badly wrong I think. thanks ron ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <13426df10705211609j613032c6j373d9a4660f8ec6c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <13426df10705211609j613032c6j373d9a4660f8ec6c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2007-05-22 0:29 ` Anthony Liguori [not found] ` <46523952.7070405-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-22 0:29 UTC (permalink / raw) To: ron minnich Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger, Suzanne McIntosh ron minnich wrote: > OK, so what are we doing here? We're using a PCI abstraction, as a > common abstraction,which is not common really, because we don't have a > common abstraction? So we describe all these non-pci resources with a > pci abstraction? > No. You're confusing PV device discovery with the actual paravirtual transport. In a fully virtual environment like KVM, a PCI bus is present. You need some way for the guest to detect that a PV device is present. The most natural way to do this IMHO is to have an entry for the PV device in the PCI bus. That will make a lot of existing code happy. Once you've identified that the device exists, you're free to do whatever you want with it. Regards, Anthony Liguori > I don't get it at all. I really think the resource interface idea I > mentioned, which is borrowed from Plan 9, makes a whole lot more > sense. IBM Austin has already shown it in practice in the papers I > referenced. It can work. A memory channel at the bottom, with a > resource sharing protocol (9p) above it, and then you describe your > resources via names and a simple file-directory model. Note that PCI > sort of tries to do this tree model, but it's all binary, and, as > noted, it's hardly universal. > > All of this is trivially exported over a network, so the use of shared > memory channels in no way rules out network access. Plan 9 exports > devices over the network routinely. > > If you're using a PCI abstraction, something has gone badly wrong I think. > > thanks > > ron > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > kvm-devel mailing list > kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org > https://lists.sourceforge.net/lists/listinfo/kvm-devel > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <46523952.7070405-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <46523952.7070405-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> @ 2007-05-22 0:45 ` ron minnich [not found] ` <13426df10705211745r69acc95ai458b2192fe0d0132-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2007-05-22 1:34 ` Eric Van Hensbergen 1 sibling, 1 reply; 104+ messages in thread From: ron minnich @ 2007-05-22 0:45 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger, Suzanne McIntosh On 5/21/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> wrote: > ron minnich wrote: > > OK, so what are we doing here? We're using a PCI abstraction, as a > > common abstraction,which is not common really, because we don't have a > > common abstraction? So we describe all these non-pci resources with a > > pci abstraction? > > > > No. You're confusing PV device discovery with the actual paravirtual > transport. In a fully virtual environment like KVM, a PCI bus is > present. You need some way for the guest to detect that a PV device is > present. The most natural way to do this IMHO is to have an entry for > the PV device in the PCI bus. That will make a lot of existing code happy. > I don't think I am confusing it, now that you've explained it more fully. I'm even less happy with it :-) How will I explain this sort of thing to my grandchildren? :-) "grandpop, why do those PV devices look like a bus defined in 1994?" Why would you not have, e.g., a 9p server for PV device "config space" as well? I actually implemented that on Xen -- it was quite trivial, and it makes more sense -- to me anyway -- than pretending a PV device is something it's not. What it happening, it seems to me, is that people are still trying to use an abstraction -- "PCI device" -- which is not really an abstraction, to model aspects of PV device discovery, enumeration, configuration and operation. I'm still pretty uncomfortable with it -- well, honestly, it seems kind of gross to me. It's just as easy to build the right abstraction underneath all this, and then, for those OSes that have existing code that needs to be happy, present that abstraction as a PCI bus. But making the PCI bus the underlying abstraction is getting the order inverted,I believe. I realize that PCI device space is a pretty handy way to do this, that it is very convenient. I wonder what happens when you get a system without enough "holes" in the config space for you to hide the PV devices in, or that has some other weird property that breaks this model. I've already worked with one system that had 32 PCI busses. There are other hypervisors that made convenient choices over the right choice, and they are paying for it. Let's try to avoid that on kvm. Kvm has so much going for it right now. thanks ron ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <13426df10705211745r69acc95ai458b2192fe0d0132-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <13426df10705211745r69acc95ai458b2192fe0d0132-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2007-05-22 1:13 ` Anthony Liguori 0 siblings, 0 replies; 104+ messages in thread From: Anthony Liguori @ 2007-05-22 1:13 UTC (permalink / raw) To: ron minnich Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger, Suzanne McIntosh ron minnich wrote: > On 5/21/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> wrote: >> No. You're confusing PV device discovery with the actual paravirtual >> transport. In a fully virtual environment like KVM, a PCI bus is >> present. You need some way for the guest to detect that a PV device is >> present. The most natural way to do this IMHO is to have an entry for >> the PV device in the PCI bus. That will make a lot of existing code >> happy. >> > > I don't think I am confusing it, now that you've explained it more > fully. I'm even less happy with it :-) Sometimes I think the best way to make you happy is to just stop talking :-) > How will I explain this sort of thing to my grandchildren? :-) > "grandpop, why do those PV devices look like a bus defined in 1994?" > > Why would you not have, e.g., a 9p server for PV device "config space" > as well? I actually implemented that on Xen -- it was quite trivial, > and it makes more sense -- to me anyway -- than pretending a PV device > is something it's not. > > What it happening, it seems to me, is that people are still trying to > use an abstraction -- "PCI device" -- which is not really an > abstraction, to model aspects of PV device discovery, enumeration, > configuration and operation. I'm still pretty uncomfortable with it -- > well, honestly, it seems kind of gross to me. It's just as easy to > build the right abstraction underneath all this, and then, for those > OSes that have existing code that needs to be happy, present that > abstraction as a PCI bus. But making the PCI bus the underlying > abstraction is getting the order inverted,I believe. Okay. The first problem here is that you're assuming that I'm suggesting that this who thing mandate a PCI bus. I'm not. I'm merely saying that one possible way to implement this is by using a PCI bus to discover the existing of a VIRTLINK socket. Clearly, the s390 guys would have to use something else. For PV Xen where there is no PCI bus, XenBus would be used. So very concretely, there are three separate classes of problems: 1) How to determine that a VM can use virtlink sockets 2) How to enumerate paravirtual devices 3) The various PV protocols for each device Whatever Linux implements, it has to allow multiple implementations for #1. For x86 VMs, PCI is just the easiest thing to do here. You could do hypercalls but it gets messy on different hypervisors (vmcall with 0 in eax may do something funky in Xen but be the probing hypercall on KVM). For #2, I'm not really proposing anything concrete. One possibility is to allow virtlink sockets to be addressed with a "service" and to use that. That doesn't allow for enumeration though so it may not be perfect. I'm not proposing anything at all for #3. That's outside the scope of this discussion in my mind. Now, once you have a virtlink socket, could you use p9 to implement #2 and #3? Sounds like something you could write a paper about :-) But that's later argument. Right now, I'm just focused on solving the boot strap issue. Hope this clarifies things a bit. Regards, Anthony Liguori > I realize that PCI device space is a pretty handy way to do this, that > it is very convenient. I wonder what happens when you get a system > without enough "holes" in the config space for you to hide the PV > devices in, or that has some other weird property that breaks this > model. I've already worked with one system that had 32 PCI busses. > > There are other hypervisors that made convenient choices over the > right choice, and they are paying for it. Let's try to avoid that on > kvm. Kvm has so much going for it right now. > > thanks > > ron > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <46523952.7070405-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> 2007-05-22 0:45 ` ron minnich @ 2007-05-22 1:34 ` Eric Van Hensbergen [not found] ` <a4e6962a0705211834s4db19c7t3b95765bf2c092d7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 104+ messages in thread From: Eric Van Hensbergen @ 2007-05-22 1:34 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Suzanne McIntosh, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f On 5/21/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> wrote: > ron minnich wrote: > > OK, so what are we doing here? We're using a PCI abstraction, as a > > common abstraction,which is not common really, because we don't have a > > common abstraction? So we describe all these non-pci resources with a > > pci abstraction? > > > > No. You're confusing PV device discovery with the actual paravirtual > transport. In a PV environment why not just pass an initial cookie/hash/whatever as a command-line argument/register/memory-space to the underlying kernel? The presence of such a kernel argument would suggest the existence of a hypercall interface or other such mechanism to "attach" to the initial transport(s). Command-line arguments may be a bit too linux-centric to Ron's taste, but if we are going to chose something arbitrary like PCI, I'd prefer we chose something a bit more straightforward to interact with instead of doing crazy ritual dances to extract what should be straightforward information. I really don't want to have integrate PCI parsing into my testOS/libOS kernels. -eric ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <a4e6962a0705211834s4db19c7t3b95765bf2c092d7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <a4e6962a0705211834s4db19c7t3b95765bf2c092d7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2007-05-22 1:42 ` Anthony Liguori [not found] ` <46524A79.8070004-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-22 1:42 UTC (permalink / raw) To: Eric Van Hensbergen Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Suzanne McIntosh, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f Eric Van Hensbergen wrote: > On 5/21/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> wrote: >> ron minnich wrote: >> > OK, so what are we doing here? We're using a PCI abstraction, as a >> > common abstraction,which is not common really, because we don't have a >> > common abstraction? So we describe all these non-pci resources with a >> > pci abstraction? >> > >> >> No. You're confusing PV device discovery with the actual paravirtual >> transport. > > In a PV environment why not just pass an initial cookie/hash/whatever > as a command-line argument/register/memory-space to the underlying > kernel? You can't pass a command line argument to Windows (at least, not easily AFAIK). You could get away with an MSR/CPUID flag but then you're relying on uniqueness which isn't guaranteed. > The presence of such a kernel argument would suggest the > existence of a hypercall interface or other such mechanism to "attach" > to the initial transport(s). Command-line arguments may be a bit too > linux-centric to Ron's taste, but if we are going to chose something > arbitrary like PCI, I'd prefer we chose something a bit more > straightforward to interact with instead of doing crazy ritual dances > to extract what should be straightforward information. I really don't > want to have integrate PCI parsing into my testOS/libOS kernels. You could just hard code a PIC interrupt and rely on some static memory address for IO and avoid the PCI bus entirely. The whole point of the PCI bus is to avoid hardcoding this sort of things but if you don't want the complexity associated with PCI, then using the "older" mechanisms seems like the obvious thing to do. Regards, Anthony Liguori > -eric > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <46524A79.8070004-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <46524A79.8070004-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> @ 2007-05-22 5:17 ` Avi Kivity [not found] ` <46527CD9.5000603-atKUWr5tajBWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Avi Kivity @ 2007-05-22 5:17 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger, Suzanne McIntosh, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f Anthony Liguori wrote: >> >> In a PV environment why not just pass an initial cookie/hash/whatever >> as a command-line argument/register/memory-space to the underlying >> kernel? >> > > You can't pass a command line argument to Windows (at least, not easily > AFAIK). You could get away with an MSR/CPUID flag but then you're > relying on uniqueness which isn't guaranteed. > In the general case, you can't pass a command line argument to Linux either. kvm doesn't boot Linux; it boots the bios, which boots the boot sector, which boots grub, which boots Linux. Relying on the user to edit the command line in grub is wrong. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <46527CD9.5000603-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <46527CD9.5000603-atKUWr5tajBWk0Htik3J/w@public.gmane.org> @ 2007-05-22 12:49 ` Eric Van Hensbergen [not found] ` <a4e6962a0705220549j1c9565f2ic160c672b74aea35-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Eric Van Hensbergen @ 2007-05-22 12:49 UTC (permalink / raw) To: Avi Kivity Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh On 5/22/07, Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: > Anthony Liguori wrote: > >> > >> In a PV environment why not just pass an initial cookie/hash/whatever > >> as a command-line argument/register/memory-space to the underlying > >> kernel? > >> > > > > You can't pass a command line argument to Windows (at least, not easily > > AFAIK). You could get away with an MSR/CPUID flag but then you're > > relying on uniqueness which isn't guaranteed. > > > > In the general case, you can't pass a command line argument to Linux > either. kvm doesn't boot Linux; it boots the bios, which boots the boot > sector, which boots grub, which boots Linux. Relying on the user to > edit the command line in grub is wrong. > I didn't think we were talking about the general case, I thought we were discussing the PV case. In the PV case, having bios/bootloader is unnecessary overhead. To that same end, I don't see Windows in the PV case unless they magically want to to coordinate PV standards with us, in which case we certainly can negotiate a more sane discovery mechanism. -eric ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <a4e6962a0705220549j1c9565f2ic160c672b74aea35-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <a4e6962a0705220549j1c9565f2ic160c672b74aea35-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2007-05-22 12:56 ` Christoph Hellwig [not found] ` <20070522125655.GA4506-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 2007-05-22 13:08 ` Anthony Liguori 1 sibling, 1 reply; 104+ messages in thread From: Christoph Hellwig @ 2007-05-22 12:56 UTC (permalink / raw) To: Eric Van Hensbergen Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh On Tue, May 22, 2007 at 07:49:51AM -0500, Eric Van Hensbergen wrote: > > In the general case, you can't pass a command line argument to Linux > > either. kvm doesn't boot Linux; it boots the bios, which boots the boot > > sector, which boots grub, which boots Linux. Relying on the user to > > edit the command line in grub is wrong. > > > > I didn't think we were talking about the general case, I thought we > were discussing the PV case. In the PV case, having bios/bootloader > is unnecessary overhead. To that same end, I don't see Windows in the > PV case unless they magically want to to coordinate PV standards with > us, in which case we certainly can negotiate a more sane discovery > mechanism. In case of KVM no one is speaking of pure PV. What people have been working on is PV accelaration of a vullvirt host, similar to how s390 is working for decaded. The host emulates the full architecture, but there are some escape for speedups. Typical escapes would be drivers for storage or networking because those can no be virtualized very well on x86-style hardware. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <20070522125655.GA4506-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <20070522125655.GA4506-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> @ 2007-05-22 14:50 ` Eric Van Hensbergen [not found] ` <a4e6962a0705220750s5abe380dg8dd8e7d0b84de7cd-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Eric Van Hensbergen @ 2007-05-22 14:50 UTC (permalink / raw) To: Christoph Hellwig Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh On 5/22/07, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > > > > I didn't think we were talking about the general case, I thought we > > were discussing the PV case. > > > > In case of KVM no one is speaking of pure PV. > Why not? It seems worthwhile to come up with something that can cover the whole spectrum instead of having different hypervisors (and interfaces). Maybe my view is skewed because I don't care to run windows. -eric ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <a4e6962a0705220750s5abe380dg8dd8e7d0b84de7cd-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <a4e6962a0705220750s5abe380dg8dd8e7d0b84de7cd-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2007-05-22 15:05 ` Anthony Liguori [not found] ` <465306AE.5080902-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> 2007-05-23 11:55 ` Avi Kivity 1 sibling, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-22 15:05 UTC (permalink / raw) To: Eric Van Hensbergen Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh Eric Van Hensbergen wrote: > On 5/22/07, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > >>> I didn't think we were talking about the general case, I thought we >>> were discussing the PV case. >>> >>> >> In case of KVM no one is speaking of pure PV. >> >> > > Why not? It seems worthwhile to come up with something that can cover > the whole spectrum instead of having different hypervisors (and > interfaces). > Because in a few years, almost everyone will have hardware capable of doing full virtualization so why bother with pure PV. > Maybe my view is skewed because I don't care to run windows. > It's not just windows. There are a lot of people who want to use virtualization to run RHEL2 or even RH9. Backporting PV to these kernels is a huge effort. Regards, Anthony Liguori > -eric > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > kvm-devel mailing list > kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org > https://lists.sourceforge.net/lists/listinfo/kvm-devel > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <465306AE.5080902-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <465306AE.5080902-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> @ 2007-05-22 15:31 ` ron minnich 2007-05-22 16:25 ` Eric Van Hensbergen 1 sibling, 0 replies; 104+ messages in thread From: ron minnich @ 2007-05-22 15:31 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig, Christian Borntraeger, Suzanne McIntosh, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f On 5/22/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> wrote: > Eric Van Hensbergen wrote: > > On 5/22/07, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > > > >>> I didn't think we were talking about the general case, I thought we > >>> were discussing the PV case. > >>> > >>> > >> In case of KVM no one is speaking of pure PV. > >> > >> > > > > Why not? It seems worthwhile to come up with something that can cover > > the whole spectrum instead of having different hypervisors (and > > interfaces). > > > > Because in a few years, almost everyone will have hardware capable of > doing full virtualization so why bother with pure PV. I don't know, we could shoot for a clean, simple interface that makes PV easy to integrate into any kernel. Pick a common underlying abstraction for all resources. Define a simple, efficient memory channel for the comms. Lay 9p over it. Then take it from there for each device. I agree, from the way (e.g.) the Xen devices work, PV is a pain. But it need not be that way. I think from the Plan 9 side we're happy to run full PV. But we're 0% of the world, so that may bias our importance a bit :-) thanks ron ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <465306AE.5080902-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> 2007-05-22 15:31 ` ron minnich @ 2007-05-22 16:25 ` Eric Van Hensbergen [not found] ` <a4e6962a0705220925l580f136we269380fe3c9691c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 104+ messages in thread From: Eric Van Hensbergen @ 2007-05-22 16:25 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh On 5/22/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> wrote: > Eric Van Hensbergen wrote: > > On 5/22/07, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > >>> > >>> > >> In case of KVM no one is speaking of pure PV. > >> > >> > > > > Why not? It seems worthwhile to come up with something that can cover > > the whole spectrum instead of having different hypervisors (and > > interfaces). > > > > Because in a few years, almost everyone will have hardware capable of > doing full virtualization so why bother with pure PV. > No matter what the capabilities, full device emulation is always going to be wasteful. Just because I have the hardware to run Vista, doesn't mean I should run Vista. > > Maybe my view is skewed because I don't care to run windows. > > > > It's not just windows. There are a lot of people who want to use > virtualization to run RHEL2 or even RH9. Backporting PV to these > kernels is a huge effort. > I'm not opposed to supporting emulation environments, just don't make a large pile of crap the default like Xen -- and having to integrate PCI probing code in my guest domains is a large pile of crap. -eric ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <a4e6962a0705220925l580f136we269380fe3c9691c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <a4e6962a0705220925l580f136we269380fe3c9691c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2007-05-22 17:00 ` ron minnich [not found] ` <13426df10705221000i749badc5h8afe4f2fc95bc2ce-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: ron minnich @ 2007-05-22 17:00 UTC (permalink / raw) To: Eric Van Hensbergen Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh On 5/22/07, Eric Van Hensbergen <ericvh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > I'm not opposed to supporting emulation environments, just don't make > a large pile of crap the default like Xen -- and having to integrate > PCI probing code in my guest domains is a large pile of crap. Exactly. I'm about to start a pretty large project here, using xen or kvm, not sure. One thing for sure, we are NOT going to use anything but PV devices. Full emulation is nice, but it's just plain silly if you don't have to do it. And we don't have to do it. So let's get the PV devices right, not try to shoehorn them into some framework like PCI. What happens to these schemes if I want to try, e.g., 2^16 PV devices? Or some other crazy thing that doesn't play well with PCI -- simple example -- I want a 256 GB region of memory for a device. PCI rules require me to align it on 256GB boundaries and it must be contiguous address space. This is a hardware rule, done for hardware reasons, and has no place in the PV world. What if I want a bit more than the basic set of BARs that PCI gives me? Why would we apply such rules to a PV? Why limit ourselves this early in the game? thanks ron ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <13426df10705221000i749badc5h8afe4f2fc95bc2ce-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <13426df10705221000i749badc5h8afe4f2fc95bc2ce-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2007-05-22 17:06 ` Christoph Hellwig [not found] ` <20070522170628.GA16624-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 2007-05-23 12:20 ` Avi Kivity 1 sibling, 1 reply; 104+ messages in thread From: Christoph Hellwig @ 2007-05-22 17:06 UTC (permalink / raw) To: ron minnich Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig, Christian Borntraeger, Suzanne McIntosh, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f On Tue, May 22, 2007 at 10:00:42AM -0700, ron minnich wrote: > On 5/22/07, Eric Van Hensbergen <ericvh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > >I'm not opposed to supporting emulation environments, just don't make > >a large pile of crap the default like Xen -- and having to integrate > >PCI probing code in my guest domains is a large pile of crap. > > Exactly. I'm about to start a pretty large project here, using xen or > kvm, not sure. One thing for sure, we are NOT going to use anything > but PV devices. Full emulation is nice, but it's just plain silly if > you don't have to do it. And we don't have to do it. So let's get the > PV devices right, not try to shoehorn them into some framework like > PCI. If you don't care about full virtualization kvm is the wrong project for you. You might want to take a look at lguest. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <20070522170628.GA16624-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <20070522170628.GA16624-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> @ 2007-05-22 17:34 ` ron minnich [not found] ` <13426df10705221034k7baf5bccrc77aabca8c9e225c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2007-05-23 12:16 ` Avi Kivity 1 sibling, 1 reply; 104+ messages in thread From: ron minnich @ 2007-05-22 17:34 UTC (permalink / raw) To: Christoph Hellwig Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger, Suzanne McIntosh, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f On 5/22/07, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > If you don't care about full virtualization kvm is the wrong project for > you. You might want to take a look at lguest. Ah, I had not realized that KVM was purely a full-virt environment with no real use for PV-only users. I'll move on. Thanks for the tip! ron ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <13426df10705221034k7baf5bccrc77aabca8c9e225c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <13426df10705221034k7baf5bccrc77aabca8c9e225c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2007-05-22 20:03 ` Dor Laor [not found] ` <64F9B87B6B770947A9F8391472E032160BF29F1E-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Dor Laor @ 2007-05-22 20:03 UTC (permalink / raw) To: ron minnich, Christoph Hellwig Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh >> If you don't care about full virtualization kvm is the wrong project for >> you. You might want to take a look at lguest. > >Ah, I had not realized that KVM was purely a full-virt environment >with no real use for PV-only users. I'll move on. Thanks for the tip! >ron Don't quit so soon on us. KVM has already PV kernel capabilities (in Ingo Molnar's tree) and has network and block PV drivers. We do plan on supporting/improving the PV kernel capabilities. The near future change is direct guest paging. Although all new x86 cpus now ship with hardware support, software PV can always find spots for acceleration. Regarding PV drivers, our initial approach was try not to invent the wheel and implement the PV discovery using pci. For full-virt OSs, especially windows it was simpler. Now that more platforms might be kvm based, I agree we should switch to a generic solution. Dor. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <64F9B87B6B770947A9F8391472E032160BF29F1E-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <64F9B87B6B770947A9F8391472E032160BF29F1E-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org> @ 2007-05-22 20:10 ` ron minnich 2007-05-22 22:56 ` Nakajima, Jun 1 sibling, 0 replies; 104+ messages in thread From: ron minnich @ 2007-05-22 20:10 UTC (permalink / raw) To: Dor Laor Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh On 5/22/07, Dor Laor <dor.laor-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: > Don't quit so soon on us. OK. I'll go look at Ingo's stuff. Thanks again ron ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <64F9B87B6B770947A9F8391472E032160BF29F1E-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org> 2007-05-22 20:10 ` ron minnich @ 2007-05-22 22:56 ` Nakajima, Jun [not found] ` <8FFF7E42E93CC646B632AB40643802A8032793AC-1a9uaKK1+wJcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org> 1 sibling, 1 reply; 104+ messages in thread From: Nakajima, Jun @ 2007-05-22 22:56 UTC (permalink / raw) To: Dor Laor, ron minnich, Christoph Hellwig Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh Dor Laor wrote: > > > If you don't care about full virtualization kvm is the wrong project for > > > you. You might want to take a look at lguest. > > > > Ah, I had not realized that KVM was purely a full-virt environment > > with no real use for PV-only users. I'll move on. Thanks for the tip! > > ron > > Don't quit so soon on us. > KVM has already PV kernel capabilities (in Ingo Molnar's tree) and has > network and block PV drivers. > > We do plan on supporting/improving the PV kernel capabilities. The near > future change is direct guest paging. > Although all new x86 cpus now ship with hardware support, software PV > can always find spots for acceleration. BTW, I'm presenting this at OLS: http://www.linuxsymposium.org/2007/view_abstract.php?content_key=192 This uses direct paging mode today. > > Regarding PV drivers, our initial approach was try not to invent the > wheel and implement the PV discovery using pci. For full-virt OSs, > especially windows it was simpler. Now that more platforms might be kvm > based, I agree we should switch to a generic solution. > Dor. > Jun --- Intel Open Source Technology Center ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <8FFF7E42E93CC646B632AB40643802A8032793AC-1a9uaKK1+wJcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <8FFF7E42E93CC646B632AB40643802A8032793AC-1a9uaKK1+wJcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org> @ 2007-05-23 8:15 ` Carsten Otte [not found] ` <4653F807.2010209-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> 2007-05-23 12:21 ` Avi Kivity 1 sibling, 1 reply; 104+ messages in thread From: Carsten Otte @ 2007-05-23 8:15 UTC (permalink / raw) To: Nakajima, Jun Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig, Suzanne McIntosh, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f I have been closely following thisvery interresting discussion. Here's my summary: - PV capabilities is something we'll want - being able to surface virtual devices to the guest as PCI is preferable to Windows - we need an additional way to surface virtual devices to the guest. We don't have PCI on s390, and Ron doesn't want PCI in his guests. - complex interfaces are a mess to implement and maintain in different hypervisors and guest operating systems, we need a simple and clear structure like plan9 has today To me, it looks like we need a virtual device abstraction both in the guest kernel and in the kvm/qemu. This abstraction needs to be simple and fast, and needs to be representable as PCI device and in a simpler way. PCI obstacles are supposed to be transparent to the virutal device. For me, plan9 does provide answers to a lot of above requirements. However, it does not provide capabilities for shared memory and it adds extra complexity. It's been designed to solve a different problem. I think the virtual device abstraction should provide the following functionality: - hypercall guest to host with parameters and return value - interrupt from host to guest with parameters - thin interrupt from host to guest, no parameters - shared memory between guest and host - dma access to guest memory, possibly via kmap on the host - copy from/to guest memory so long, Carsten ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <4653F807.2010209-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <4653F807.2010209-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> @ 2007-05-23 12:25 ` Avi Kivity 2007-05-23 14:12 ` Eric Van Hensbergen 1 sibling, 0 replies; 104+ messages in thread From: Avi Kivity @ 2007-05-23 12:25 UTC (permalink / raw) To: carsteno-tA70FqPdS9bQT0dZR+AlfA Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh Carsten Otte wrote: > I have been closely following thisvery interresting discussion. Here's > my summary: > - PV capabilities is something we'll want > - being able to surface virtual devices to the guest as PCI is > preferable to Windows > - we need an additional way to surface virtual devices to the guest. > We don't have PCI on s390, and Ron doesn't want PCI in his guests. > - complex interfaces are a mess to implement and maintain in different > hypervisors and guest operating systems, we need a simple and clear > structure like plan9 has today > > To me, it looks like we need a virtual device abstraction both in the > guest kernel and in the kvm/qemu. This abstraction needs to be simple > and fast, and needs to be representable as PCI device and in a simpler > way. PCI obstacles are supposed to be transparent to the virutal device. > For me, plan9 does provide answers to a lot of above requirements. > However, it does not provide capabilities for shared memory and it > adds extra complexity. It's been designed to solve a different problem. > > I think the virtual device abstraction should provide the following > functionality: > - hypercall guest to host with parameters and return value > - interrupt from host to guest with parameters > - thin interrupt from host to guest, no parameters > - shared memory between guest and host > - dma access to guest memory, possibly via kmap on the host > - copy from/to guest memory > > I agree with all of the above. In addition, it would be nice if we can share this interface with other hypervisors. Unfortunately Xen is riding the XenBus, but maybe we can share the interface with lguest and VMI. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <4653F807.2010209-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> 2007-05-23 12:25 ` Avi Kivity @ 2007-05-23 14:12 ` Eric Van Hensbergen [not found] ` <a4e6962a0705230712pd8c2958m9dee6b2ccec0899d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 104+ messages in thread From: Eric Van Hensbergen @ 2007-05-23 14:12 UTC (permalink / raw) To: carsteno-tA70FqPdS9bQT0dZR+AlfA Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh On 5/23/07, Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> wrote: > > For me, plan9 does provide answers to a lot of above requirements. > However, it does not provide capabilities for shared memory and it > adds extra complexity. It's been designed to solve a different problem. > As a point of clarification, plan9 protocols have been used over shared memory for resource access on virtualized systems for the past 3 years. There are certainly ways it can be further optimized, but it is not a restriction. As far as complexity goes, our guest-side stack is around 2000 lines of code (with an additional 1000 lines of support routines that could likely be replaced by standard library or OS services in more conventional platforms) and supports console, file system, network, and block device access. > I think the virtual device abstraction should provide the following > functionality: > - hypercall guest to host with parameters and return value > - interrupt from host to guest with parameters > - thin interrupt from host to guest, no parameters > - shared memory between guest and host > - dma access to guest memory, possibly via kmap on the host > - copy from/to guest memory > Good list. We can certainly work within these parameters. It would be nice to have some facility for direct guest<->guest communication -- however, I understand the difficulties in doing that in a secure and safe way. Still, having the ability to provision such a direct interface would be nice for those that can take advantage of it. -eric ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <a4e6962a0705230712pd8c2958m9dee6b2ccec0899d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <a4e6962a0705230712pd8c2958m9dee6b2ccec0899d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2007-05-23 23:02 ` Arnd Bergmann [not found] ` <200705240102.40795.arnd-r2nGTMty4D4@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Arnd Bergmann @ 2007-05-23 23:02 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f Cc: Jimi Xenidis, Christian Borntraeger, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig, carsteno-tA70FqPdS9bQT0dZR+AlfA, Suzanne McIntosh On Wednesday 23 May 2007, Eric Van Hensbergen wrote: > On 5/23/07, Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> wrote: > > > > For me, plan9 does provide answers to a lot of above requirements. > > However, it does not provide capabilities for shared memory and it > > adds extra complexity. It's been designed to solve a different problem. > > > As a point of clarification, plan9 protocols have been used over > shared memory for resource access on virtualized systems for the past > 3 years. There are certainly ways it can be further optimized, but it > is not a restriction. I think what Carsten means is to have a mmap interface over 9p, not implementing 9p by means of shared memory, which is what I guess you are referring to. If you want to share memory areas between a guest and the host or another guest, you can't do that with the regular Tread/Twrite interface that 9p has on a file. > As far as complexity goes, our guest-side stack > is around 2000 lines of code (with an additional 1000 lines of support > routines that could likely be replaced by standard library or OS > services in more conventional platforms) and supports console, file > system, network, and block device access. Another interface that I think is missing in 9p is a notification for hotplugging. Of course you can have a long-running read on a special file that returns the file names for virtual devices that have been added or removed in the guest, but that sounds a little clumsy compared to an specialized interface (e.g. Tnotify). Arnd <>< ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <200705240102.40795.arnd-r2nGTMty4D4@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <200705240102.40795.arnd-r2nGTMty4D4@public.gmane.org> @ 2007-05-23 23:57 ` Eric Van Hensbergen [not found] ` <a4e6962a0705231657n65946ba4n74393f7028b6d61c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Eric Van Hensbergen @ 2007-05-23 23:57 UTC (permalink / raw) To: Arnd Bergmann Cc: Jimi Xenidis, Christian Borntraeger, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig, carsteno-tA70FqPdS9bQT0dZR+AlfA, Suzanne McIntosh On 5/23/07, Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org> wrote: > On Wednesday 23 May 2007, Eric Van Hensbergen wrote: > > On 5/23/07, Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> wrote: > > > > > > For me, plan9 does provide answers to a lot of above requirements. > > > However, it does not provide capabilities for shared memory and it > > > adds extra complexity. It's been designed to solve a different problem. > > > > > As a point of clarification, plan9 protocols have been used over > > shared memory for resource access on virtualized systems for the past > > 3 years. There are certainly ways it can be further optimized, but it > > is not a restriction. > > I think what Carsten means is to have a mmap interface over 9p, not > implementing 9p by means of shared memory, which is what I guess > you are referring to. > > If you want to share memory areas between a guest and the host > or another guest, you can't do that with the regular Tread/Twrite > interface that 9p has on a file. > Well, there's nothing strictly preventing a mmap interface over 9p (in fact we are working with that in a Cell project internally) -- however, I'm not sure that makes the best sense for device access anyways. The real thing missing from the current implementation is a better underlying transport which can pass payloads by reference to shared memory as opposed to marshaling operations through a shared memory transport -- however, this is what Los Alamos and IBM are working on right now. > > As far as complexity goes, our guest-side stack > > is around 2000 lines of code (with an additional 1000 lines of support > > routines that could likely be replaced by standard library or OS > > services in more conventional platforms) and supports console, file > > system, network, and block device access. > > Another interface that I think is missing in 9p is a notification > for hotplugging. Of course you can have a long-running read on a > special file that returns the file names for virtual devices that > have been added or removed in the guest, but that sounds a little > clumsy compared to an specialized interface (e.g. Tnotify). > Discovery and hot-plugging would be synthetic file system semantic issues that need to be resolved and in general are probably, as Rusty and others suggested, best handled as a separate set of topics. That being said, specialized interfaces always seemed a bit more clunky to me (just look at ioctl), but I suppose that's largely a matter of taste. The advantage of having a file system interface to event notification is it creates a much more flexible environment, allowing even simple shell scripting languages to resolve events versus having to build a complex infrastructure -- and since 9p can be transitively mounted over a network, you can build cluster management suites without secondary layers of gorp for such things. The LANL guys will probably have more to say about this at their OLS talk on the KVM management synthetic file system interface they build with 9p. -eric ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <a4e6962a0705231657n65946ba4n74393f7028b6d61c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <a4e6962a0705231657n65946ba4n74393f7028b6d61c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2007-05-24 0:07 ` Eric Van Hensbergen 0 siblings, 0 replies; 104+ messages in thread From: Eric Van Hensbergen @ 2007-05-24 0:07 UTC (permalink / raw) To: Arnd Bergmann Cc: Jimi Xenidis, Christian Borntraeger, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig, carsteno-tA70FqPdS9bQT0dZR+AlfA, Suzanne McIntosh On 5/23/07, Eric Van Hensbergen <ericvh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > On 5/23/07, Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org> wrote: > > On Wednesday 23 May 2007, Eric Van Hensbergen wrote: > > > On 5/23/07, Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> wrote: > > > > > > > > For me, plan9 does provide answers to a lot of above requirements. > > > > However, it does not provide capabilities for shared memory and it > > > > adds extra complexity. It's been designed to solve a different problem. > > > > > > > As a point of clarification, plan9 protocols have been used over > > > shared memory for resource access on virtualized systems for the past > > > 3 years. There are certainly ways it can be further optimized, but it > > > is not a restriction. > > > > I think what Carsten means is to have a mmap interface over 9p, not > > implementing 9p by means of shared memory, which is what I guess > > you are referring to. > > > > If you want to share memory areas between a guest and the host > > or another guest, you can't do that with the regular Tread/Twrite > > interface that 9p has on a file. > > ugh. I'm tired. Its been a long week -- I realized after I fired off that last message that you mean establishing a shared mapping versus support for mmap operations over 9p (which devolve into Tread/Twrite). Sorry. Yes -- that's correct, 9p wouldn't necessarily buy you something like that. In fact, the current 9p code relies on someone else providing that basic mechanism in order for us to establish our shared memory transport. What Carsten described as his virtual device abstraction sounded like a good foundation -- just don't make me use ioctl :) -eric ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <8FFF7E42E93CC646B632AB40643802A8032793AC-1a9uaKK1+wJcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org> 2007-05-23 8:15 ` Carsten Otte @ 2007-05-23 12:21 ` Avi Kivity 1 sibling, 0 replies; 104+ messages in thread From: Avi Kivity @ 2007-05-23 12:21 UTC (permalink / raw) To: Nakajima, Jun Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig, Suzanne McIntosh, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f Nakajima, Jun wrote: > BTW, I'm presenting this at OLS: > http://www.linuxsymposium.org/2007/view_abstract.php?content_key=192 > > This uses direct paging mode today. > Are there patches available anywhere? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <20070522170628.GA16624-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 2007-05-22 17:34 ` ron minnich @ 2007-05-23 12:16 ` Avi Kivity [not found] ` <465430B2.7050101-atKUWr5tajBWk0Htik3J/w@public.gmane.org> 1 sibling, 1 reply; 104+ messages in thread From: Avi Kivity @ 2007-05-23 12:16 UTC (permalink / raw) To: Christoph Hellwig Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Suzanne McIntosh, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f Christoph Hellwig wrote: > On Tue, May 22, 2007 at 10:00:42AM -0700, ron minnich wrote: > >> On 5/22/07, Eric Van Hensbergen <ericvh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: >> >> >>> I'm not opposed to supporting emulation environments, just don't make >>> a large pile of crap the default like Xen -- and having to integrate >>> PCI probing code in my guest domains is a large pile of crap. >>> >> Exactly. I'm about to start a pretty large project here, using xen or >> kvm, not sure. One thing for sure, we are NOT going to use anything >> but PV devices. Full emulation is nice, but it's just plain silly if >> you don't have to do it. And we don't have to do it. So let's get the >> PV devices right, not try to shoehorn them into some framework like >> PCI. >> > > If you don't care about full virtualization kvm is the wrong project for > you. You might want to take a look at lguest. > > This is incorrect. While kvm started out as a full virtualization project, it will expand with I/O PV and core PV. Eventually most of the paravirt_ops interface will have a kvm implementation. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <465430B2.7050101-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <465430B2.7050101-atKUWr5tajBWk0Htik3J/w@public.gmane.org> @ 2007-05-23 12:20 ` Christoph Hellwig 0 siblings, 0 replies; 104+ messages in thread From: Christoph Hellwig @ 2007-05-23 12:20 UTC (permalink / raw) To: Avi Kivity Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig, Suzanne McIntosh, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f On Wed, May 23, 2007 at 03:16:50PM +0300, Avi Kivity wrote: > Christoph Hellwig wrote: > > On Tue, May 22, 2007 at 10:00:42AM -0700, ron minnich wrote: > > > >> On 5/22/07, Eric Van Hensbergen <ericvh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > >> > >> > >>> I'm not opposed to supporting emulation environments, just don't make > >>> a large pile of crap the default like Xen -- and having to integrate > >>> PCI probing code in my guest domains is a large pile of crap. > >>> > >> Exactly. I'm about to start a pretty large project here, using xen or > >> kvm, not sure. One thing for sure, we are NOT going to use anything > >> but PV devices. Full emulation is nice, but it's just plain silly if > >> you don't have to do it. And we don't have to do it. So let's get the > >> PV devices right, not try to shoehorn them into some framework like > >> PCI. > >> > > > > If you don't care about full virtualization kvm is the wrong project for > > you. You might want to take a look at lguest. > > > > > > This is incorrect. While kvm started out as a full virtualization > project, it will expand with I/O PV and core PV. Eventually most of the > paravirt_ops interface will have a kvm implementation. The statement above was a little misworded I think. It should have been a "if you care about pure PV ..." ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <13426df10705221000i749badc5h8afe4f2fc95bc2ce-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2007-05-22 17:06 ` Christoph Hellwig @ 2007-05-23 12:20 ` Avi Kivity 1 sibling, 0 replies; 104+ messages in thread From: Avi Kivity @ 2007-05-23 12:20 UTC (permalink / raw) To: ron minnich Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig, Christian Borntraeger, Suzanne McIntosh, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f ron minnich wrote: > On 5/22/07, Eric Van Hensbergen <ericvh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > >> I'm not opposed to supporting emulation environments, just don't make >> a large pile of crap the default like Xen -- and having to integrate >> PCI probing code in my guest domains is a large pile of crap. >> > > Exactly. I'm about to start a pretty large project here, using xen or > kvm, not sure. One thing for sure, we are NOT going to use anything > but PV devices. Full emulation is nice, but it's just plain silly if > you don't have to do it. And we don't have to do it. So let's get the > PV devices right, not try to shoehorn them into some framework like > PCI. > > What happens to these schemes if I want to try, e.g., 2^16 PV devices? > Or some other crazy thing that doesn't play well with PCI -- simple > example -- I want a 256 GB region of memory for a device. PCI rules > require me to align it on 256GB boundaries and it must be contiguous > address space. This is a hardware rule, done for hardware reasons, and > has no place in the PV world. What if I want a bit more than the basic > set of BARs that PCI gives me? Why would we apply such rules to a PV? > Why limit ourselves this early in the game? > > Device discovery and device operation are separate. Closed operating systems and older Linuces will need pci as a way to have easy plug'n'play discovery with no modifications to the kernel. Virtualization-friendly systems like newer Linux and s390 can have a virtual bus for discovery. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <a4e6962a0705220750s5abe380dg8dd8e7d0b84de7cd-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2007-05-22 15:05 ` Anthony Liguori @ 2007-05-23 11:55 ` Avi Kivity 1 sibling, 0 replies; 104+ messages in thread From: Avi Kivity @ 2007-05-23 11:55 UTC (permalink / raw) To: Eric Van Hensbergen Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christoph Hellwig, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh Eric Van Hensbergen wrote: > On 5/22/07, Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: >> > >> > I didn't think we were talking about the general case, I thought we >> > were discussing the PV case. >> > >> >> In case of KVM no one is speaking of pure PV. >> > > Why not? It seems worthwhile to come up with something that can cover > the whole spectrum instead of having different hypervisors (and > interfaces). > That's the plan. PV I/O and PV mmu are on the roadmap. PV timers and interrupts should be easily doable too. The far end of the spectrum (PV with no hardware virtualization extensions) is possible, but no one is planning to do it AFAIK. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <a4e6962a0705220549j1c9565f2ic160c672b74aea35-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2007-05-22 12:56 ` Christoph Hellwig @ 2007-05-22 13:08 ` Anthony Liguori 1 sibling, 0 replies; 104+ messages in thread From: Anthony Liguori @ 2007-05-22 13:08 UTC (permalink / raw) To: Eric Van Hensbergen Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Suzanne McIntosh Eric Van Hensbergen wrote: > On 5/22/07, Avi Kivity <avi-atKUWr5tajBWk0Htik3J/w@public.gmane.org> wrote: >> Anthony Liguori wrote: >> >> >> >> In a PV environment why not just pass an initial cookie/hash/whatever >> >> as a command-line argument/register/memory-space to the underlying >> >> kernel? >> >> >> > >> > You can't pass a command line argument to Windows (at least, not >> easily >> > AFAIK). You could get away with an MSR/CPUID flag but then you're >> > relying on uniqueness which isn't guaranteed. >> > >> >> In the general case, you can't pass a command line argument to Linux >> either. kvm doesn't boot Linux; it boots the bios, which boots the boot >> sector, which boots grub, which boots Linux. Relying on the user to >> edit the command line in grub is wrong. >> > > I didn't think we were talking about the general case, I thought we > were discussing the PV case. It is still useful to use PV drivers with full virtualization so it's something that ought to be considered. Regards, Anthony Liguori > In the PV case, having bios/bootloader > is unnecessary overhead. To that same end, I don't see Windows in the > PV case unless they magically want to to coordinate PV standards with > us, in which case we certainly can negotiate a more sane discovery > mechanism. > > -eric > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> ` (2 preceding siblings ...) 2007-05-16 17:45 ` Gregory Haskins @ 2007-05-18 5:31 ` ron minnich [not found] ` <13426df10705172231y5e93d1f5y398d4f187a8978e1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 3 siblings, 1 reply; 104+ messages in thread From: ron minnich @ 2007-05-18 5:31 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, Martin Schwidefsky, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org On 5/16/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > What do you think about a socket interface? I'm not sure how discovery > would work yet, but there are a few PV socket implementations for Xen at > the moment. Hi Anthony, I still feel that "how about a socket interface" is still focused on the "how to implement", and not "what the interface should be". I also am not sure the socket system call interface is quite what we want, although it's a neat idea. It's also not that portable outside the "everything is a Linux variant" world. So how about this as an interface design. The communications channels are visible in our name space at a mountpoint of our choice. Let's call this mount point, for sake of argument, vmic. When we mount on vmic, we see one file: /vmic/clone When we open and read /vmic/clone, we get a number, let's pretend for this example we get '0'. The numbers are not important, except to distinguish connections. Opening the clone file gets us a connection endpoint. Ls of the directory now shows this: /vmic/clone /vmic/0 The "directory", and the "files" in it, are owned by me, mode 700 or 600 or 400 as the file requires. The mode can be changed, of course, if I wish to allow wider access to the channel. Here, already, we see some advantage to the use of the file system for this type of capability. What is in the directory? Here is one proposal. /vmic/0/data /vmic/0/status /vmic/0/ctl /vmic/0/local /vmic/0/remote What can we do with this? Data is pretty obvious: we can read it or write it, and that data is received/sent from the other endpoint. Note that I'm not saying how the data flows: it can be done in whatever manner is most efficient, by the kernel, including zero copy. It can be different for many reasons, but the point is that the interface is basically unchanging. Of course, it is an error to read or write data until something at the other end connects to the local end! What is status? We cat it and it gets us status in some meaningful text string. E.g.: cat /vmic/0/status connected /domain/name What is local? It's our local name for the resource in this domain What is remote? It's the name of other endpoint. What's a name look like? I'm thinking it might look like /domain/name, but that is just a guess ... What is ctl? here is where the fun begins. We might do things such as echo bind somename > /vmic/0/ctl this names the vmic. We might want to wait for a connection: echo listen 1> /vmic/0/ctl We might want to restrict it somehow echo key somekey > /vmic/0/ctl echo listendomain domainnumber > /vmic/0/ctl or we might know there is something out there. echo connect /domainname/somename > /vmic/0/ctl Once it is connected, we can move data. This is similar to your socket idea, but consider that: o to see active vmics, I use 'ls' o I don't have to create a new sockaddr address type o I can control access with chmod o I am seperating the interface from the implementation o This is, of course, not really 'files', but in-memory data structures; this can (and will) be fast o No binary data structures. For different domains, even on the same machine, alignment rules etc. are not always the same -- I hit this when I ported Plan 9 to Xen, esp. back when Xen relied so heavily on gcc tricks such as __align__ and packed. Using character strings eliminates that problem. This is, I think, the kind of thing Eric would also like to see, but he can correct me. Thanks ron ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <13426df10705172231y5e93d1f5y398d4f187a8978e1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <13426df10705172231y5e93d1f5y398d4f187a8978e1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2007-05-18 14:31 ` Anthony Liguori [not found] ` <464DB8A5.6080503-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-18 14:31 UTC (permalink / raw) To: ron minnich Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, Martin Schwidefsky, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org ron minnich wrote: > Hi Anthony, > > I still feel that "how about a socket interface" is still focused on > the "how to implement", and not "what the interface should be". Right. I'm not trying to answer that question ATM. There are a number of paravirt devices that would be useful in a virtual setting. For instance, a PV device for providing the guest with entropy and a shared PV clipboard. These devices should be simple but all current communication mechanisms are far too complicated. > I also > am not sure the socket system call interface is quite what we want, > although it's a neat idea. It's also not that portable outside the > "everything is a Linux variant" world. A filesystem interface certainly isn't very portable outside the POSIX world :-) > Once it is connected, we can move data. > > This is similar to your socket idea, but consider that: > o to see active vmics, I use 'ls' > o I don't have to create a new sockaddr address type > o I can control access with chmod > o I am seperating the interface from the implementation > o This is, of course, not really 'files', but in-memory data > structures; this can > (and will) be fast > o No binary data structures. > For different domains, even on the same machine, alignment rules etc. > are not > always the same -- I hit this when I ported Plan 9 to Xen, esp. back > when Xen > relied so heavily on gcc tricks such as __align__ and packed. Using > character strings > eliminates that problem. The interface you're proposing is almost functionally identical to a socket. In fact, once you open /data you've got an fd that you interact with in the same way as you would interact with a socket. It's not that there's an unique value for this sort of interface in virtualization; I don't think you're making that argument. Instead, you're making a general argument as to why this way of doing things is better than what Unix has been doing forever (with things like sockets). That's fine, I think you have a valid point, but that's a larger argument to have on LKML or at a conference. This isn't the place to shoe-horn this sort of thing. A socket interface would provide a simple, well-understood interface that few people in the Linux community would disagree with (it's already there for s390). It should also be easy enough to stream p9 over the socket so you can build these interfaces easily and continue your attempts to expose the world as a virtual filesystem :-) Regards, Anthony Liguori > This is, I think, the kind of thing Eric would also like to see, but > he can correct me. > Thanks > > ron ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <464DB8A5.6080503-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <464DB8A5.6080503-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2007-05-18 15:14 ` ron minnich 0 siblings, 0 replies; 104+ messages in thread From: ron minnich @ 2007-05-18 15:14 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, Martin Schwidefsky, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org On 5/18/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > > > I also > > am not sure the socket system call interface is quite what we want, > > although it's a neat idea. It's also not that portable outside the > > "everything is a Linux variant" world. > > A filesystem interface certainly isn't very portable outside the POSIX > world :-) Actually, it probably the most portable thing you can have. > The interface you're proposing is almost functionally identical to a > socket. In fact, once you open /data you've got an fd that you interact > with in the same way as you would interact with a socket. Well, sure, I stole the interface from Plan 9, and they use this interface to do sockets, among *many* other things -- and there's the point. The interface is not just sockets. But if you're used to sockets, it looks familiar. I only steal from the best :-) Note, btw, that the fd has a path, and can be examined easily, and also passed to other programs for use. That's messy and ugly with sockets. > > It's not that there's an unique value for this sort of interface in > virtualization; I don't think you're making that argument. Instead, > you're making a general argument as to why this way of doing things is > better than what Unix has been doing forever (with things like > sockets) Yes, Unix has been "doing it this way" forever. The interface I am proposing was the one designed by the Unix guys -- once they realized how deficient the Unix way of doing things had become. But, forgetting all this argument, it still seems to me that the file system interface is far simpler than a socket interface. No binary structures. No new sockaddr structures needed. No alignment/padding rules. You can actually set up a link from a shell script, or perl, or python, or whatever, without a special set of bindings. > A socket interface would provide a simple, well-understood interface > that few people in the Linux community would disagree with (it's already > there for s390). Yes, but ... well understood to the Linux community. Can we look at a broader scope? We've got a golden opportunity here to build a really flexible VMIC interface. I would hate to lose it. Anyway, thanks for discussing this. ron ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <4644CE15.6080505-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2007-05-11 21:15 ` Eric Van Hensbergen @ 2007-05-11 21:51 ` ron minnich 1 sibling, 0 replies; 104+ messages in thread From: ron minnich @ 2007-05-11 21:51 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Martin Schwidefsky On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > For low speed devices, I think paravirtualization doesn't make a lot of > sense unless it's absolutely required. I don't know enough about s390 > to know if it supports things like uarts but if so, then emulating a > uart would in my mind make a lot more sense than a PV console device. I don't see how. Paravirtualization is pretty trivial for a console. I think emulating hardware is always worth avoiding. A PV ocnsole driver is going to much more flexible than a uart emulator. > This smells a bit like XenStore which I think most will agree was an > unmitigated disaster. No, not at all. Just because we represent resources as directories and files, does that imply or require xenstore? Is /proc a xenstore entity? Is /sys? Not at all. These resources, which are represented over 9p as files and directories, are simply a representation of kernel data structures. I think you are jumping ahead too far, because that's not what I'm talking about. What I'm trying to propose is that the kvm host use a standard model for paravirt resources, and, since we've had 20 years of very good luck on Plan 9 using 9p and a directory/file model for all resources, including devices, I am hoping we can use that for the way that kvm communicates with its guests about devices. Consider /proc. It works. It's not a thing on disk, or a python glob like xenstore. There are not even really tree-like data structures in /proc. The proc outputs are generated on demand as programs do operations on files in the /proc file system. This idea is similar -- not the same code, or implementation technique, but similar. Our proposal (it was Eric's idea, really, and he has in fact shown it in practice on IBM hypervisors) is that we define a standard memory channel for comms, as in Eric's paper; we define a standard request/response protocol to run over that channel, i.e. 9p, again, as in Eric's paper(s); and then, what you layer over it is up to the provider of the resource. This gives us one interface, and it can be efficient. Again, in this way, we get a common interface to diverse resources. This is a basic technique in computer science, and I was sorry to see Xen ignore it. Eric and I tried to get the Xen team to look at this, but they were too far along with their myriad interfaces, and it was too late to change. It's not too late for KVM. I am hoping we can use this model on KVM, before we have a whole pile of totally different interfaces to different PV devices. > This sort of thing gets terribly complicated to > deal with in the corner cases. Atomic operation of multiple read/write > operations is difficult to express. Moreover, quite a lot of things are > naturally expressed as a state machine which is not straight forward to > do in this sort of model. This may have been all figured out in 9P but > it's certainly not a simple thing to get right. We have the QED. It's called Plan 9. Then we have the second QED. It's called Inferno. They are each a reliable, simple, industrial-strength kernel running in a router near you. I accept it is hard to get right. I think you'd have to accept that it can, and in fact has, been gotten right for quite some time -- 20 years in the case of Plan 9. > I think a general rule of thumb for a virtualized environment is that > the closer you stick to the way hardware tends to do things, You mean like level interrupts emulation in Xen? That was easy? Or not screwed up? It was one of the messiest things I had to deal with in the Plan 9 port to Xen. And it made no sense, whatsoever, to have a level interrupt emulation. Except, of course, that the edge interrupts were even less fun :-) I believe that PV can buy us a very clean interface if done right. Emulating hardware is easy for the simple bits, and very hard to get perfect for the messy bits .Do we really want to emulate a 10G PHY, for example? > Implementing a full 9P client just > to get console access in something like mini-os would be unfortunate. 9p clients are trivial. newsham's 9p python client is a whopping 352 lines, 20 of them comments. A 9p client is far less code than the sum of the Linux uart code. > At least the posted s390 console driver behaves roughly like a uart so > it's pretty obvious that it will be easy to implement in any OS that > supports uarts already. Including all the fifo bugs? Because to really emulate hardware, to match a driver, you have to correctly emulate the *bugs*, not just the spec. That's where the fun begins. I think KVM has a great opportunity here to do a better job than Xen did with devices. So, I'll keep arguing and see if I can convince you :-) thanks ron ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <13426df10705111244w1578ebedy8259bc42ca1f588d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2007-05-11 20:12 ` Anthony Liguori @ 2007-05-12 8:46 ` Carsten Otte [not found] ` <46457EF9.2070706-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> 2007-05-14 12:05 ` Avi Kivity 2 siblings, 1 reply; 104+ messages in thread From: Carsten Otte @ 2007-05-12 8:46 UTC (permalink / raw) To: ron minnich Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Christian Borntraeger ron minnich wrote: > Let me ask what may seem to be a naive question to the linux world. I > see you are doing a lot off solid work on adding block and network > devices. The code for block and network devices > is implemented in different ways. I've also seen this difference of > inerface/implementation on Xen. Actually, the difference derives from the fact that block and network are indeed different: - block submits requests that ask the host to transfer from/to preallocated guest data buffers via dma (request driven) - net transmits packets that should end up in an skb on the remote side (two way, push driven) - net is sensitive to round-trip times, block is not due to the device plug for request merging We tried different access methods for both block and network. We have selected the current communication mechanics after doing performance measurements. I believe for a portable solution we need to develop a set of primitives for sending signals (read: interrupts) back and forth, for copying data to guest memory, and for establishing shared memory between guests and between guest+host. These primitives need to be implemented for each platform, and paravirtual drivers should build on top of that. At this point in time, we are aware that these device drivers don't do what we'd want for a portable solution. We'll focus on getting the kernel interfaces to sie/vt/svm proper and portable first. so long, Carsten ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <46457EF9.2070706-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <46457EF9.2070706-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> @ 2007-05-13 12:04 ` Dor Laor [not found] ` <64F9B87B6B770947A9F8391472E032160BC74612-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Dor Laor @ 2007-05-13 12:04 UTC (permalink / raw) To: carsteno-tA70FqPdS9bQT0dZR+AlfA, ron minnich Cc: Jimi Xenidis, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR >ron minnich wrote: >> Let me ask what may seem to be a naive question to the linux world. I >> see you are doing a lot off solid work on adding block and network >> devices. The code for block and network devices >> is implemented in different ways. I've also seen this difference of >> inerface/implementation on Xen. >Actually, the difference derives from the fact that block and network >are indeed different: >- block submits requests that ask the host to transfer from/to >preallocated guest data buffers via dma (request driven) >- net transmits packets that should end up in an skb on the remote >side (two way, push driven) >- net is sensitive to round-trip times, block is not due to the device >plug for request merging > >We tried different access methods for both block and network. We have >selected the current communication mechanics after doing performance >measurements. >I believe for a portable solution we need to develop a set of >primitives for sending signals (read: interrupts) back and forth, for >copying data to guest memory, and for establishing shared memory >between guests and between guest+host. These primitives need to be >implemented for each platform, and paravirtual drivers should build on >top of that. >At this point in time, we are aware that these device drivers don't do >what we'd want for a portable solution. We'll focus on getting the >kernel interfaces to sie/vt/svm proper and portable first. > >so long, >Carsten Based on the previous discussion and the s390 PV drivers I have more gasoline to pour to the flame: We have a working PV driver with 1Gbit performance. The reasons we don't push it into the kernel are: a. We should perform much better b. It would be a painful task getting all the code review that a complicated network interface should get. c. There's already a PV driver that answers a,b. The Xen's PV network driver is now pushed into the kernel. It is optimized, and support tso. By adding a generic ops calls we can make enjoy all the above. Using Xen's core PV code doesn't imply that we will have their interface {xenstore} the interface creation and tear-down would be kvm specific. They could even have a plain directory structure. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <64F9B87B6B770947A9F8391472E032160BC74612-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <64F9B87B6B770947A9F8391472E032160BC74612-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org> @ 2007-05-13 14:49 ` Anthony Liguori [not found] ` <4647257F.4020900-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-13 14:49 UTC (permalink / raw) To: Dor Laor Cc: Jimi Xenidis, Christian Borntraeger, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, carsteno-tA70FqPdS9bQT0dZR+AlfA, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f Dor Laor wrote: > push it into the kernel are: > a. We should perform much better > b. It would be a painful task getting all the code review that a > > complicated network interface should get. > c. There's already a PV driver that answers a,b. > The Xen's PV network driver is now pushed into the kernel. > Actually, it's not (at least not as of a few moments ago). Furthermore, the plan is to completely rearchitect the netback/netfront protocol for the next Xen release (this effort is referred to netchannel2). See some of the XenSummit slides as to why this is necessary. Regards, Anthony Liguori > It is optimized, and support tso. > By adding a generic ops calls we can make enjoy all the above. > > Using Xen's core PV code doesn't imply that we will have their interface > {xenstore} the interface creation and tear-down would be kvm specific. > They could even have a plain directory structure. > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > kvm-devel mailing list > kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org > https://lists.sourceforge.net/lists/listinfo/kvm-devel > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <4647257F.4020900-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <4647257F.4020900-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> @ 2007-05-13 16:23 ` Dor Laor [not found] ` <64F9B87B6B770947A9F8391472E032160BC74675-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Dor Laor @ 2007-05-13 16:23 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, Christian Borntraeger, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, carsteno-tA70FqPdS9bQT0dZR+AlfA, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f >Subject: Re: [kvm-devel] [PATCH/RFC 7/9] Virtual network guest device >driver > >Dor Laor wrote: >> push it into the kernel are: >> a. We should perform much better >> b. It would be a painful task getting all the code review that a >> >> complicated network interface should get. >> c. There's already a PV driver that answers a,b. >> The Xen's PV network driver is now pushed into the kernel. >> > >Actually, it's not (at least not as of a few moments ago). Furthermore, >the plan is to completely rearchitect the netback/netfront protocol for >the next Xen release (this effort is referred to netchannel2). But isn't Jeremy Fitzhardinge is pushing big patch queue into the kernel? If we manage to plant hooks into the netback/front for using net_ops, they and the code will get into the kernel they will be have to keep the hooks for netchannel2. > >See some of the XenSummit slides as to why this is necessary. It's looks like generalizing all the level 0,1,2 features plus performance optimizations. It's not something we couldn't upgrade to. >Regards, > >Anthony Liguori > >> It is optimized, and support tso. >> By adding a generic ops calls we can make enjoy all the above. >> >> Using Xen's core PV code doesn't imply that we will have their interface >> {xenstore} the interface creation and tear-down would be kvm specific. >> They could even have a plain directory structure. >> >> ------------------------------------------------------------------------ - >> This SF.net email is sponsored by DB2 Express >> Download DB2 Express C - the FREE version of DB2 express and take >> control of your XML. No limits. Just data. Click to get it now. >> http://sourceforge.net/powerbar/db2/ >> _______________________________________________ >> kvm-devel mailing list >> kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org >> https://lists.sourceforge.net/lists/listinfo/kvm-devel >> >> ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <64F9B87B6B770947A9F8391472E032160BC74675-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <64F9B87B6B770947A9F8391472E032160BC74675-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org> @ 2007-05-13 16:49 ` Anthony Liguori [not found] ` <4647418A.2040201-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-13 16:49 UTC (permalink / raw) To: Dor Laor Cc: Jimi Xenidis, Christian Borntraeger, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, carsteno-tA70FqPdS9bQT0dZR+AlfA, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f Dor Laor wrote: > Furthermore, > >> the plan is to completely rearchitect the netback/netfront protocol for >> the next Xen release (this effort is referred to netchannel2). >> > > But isn't Jeremy Fitzhardinge is pushing big patch queue into the > kernel? > Yes, but it's not in the kernel yet and there's no guarantee it'll get there in time for KVM's consumption. > If we manage to plant hooks into the netback/front for using net_ops, > they and the code will get into the kernel they will be have to keep the > hooks for netchannel2. > > >> See some of the XenSummit slides as to why this is necessary. >> > > It's looks like generalizing all the level 0,1,2 features plus > performance optimizations. It's not something we couldn't upgrade to. > I'm curious what Rusty thinks as I do not know nearly enough about the networking subsystem to make an educated statement here. Would it be better to just try and generalize netback/netfront or build something from scratch? Could the lguest driver be generalized more easily? Regards, Anthony LIguori >> Regards, >> >> Anthony Liguori >> >> >>> It is optimized, and support tso. >>> By adding a generic ops calls we can make enjoy all the above. >>> >>> Using Xen's core PV code doesn't imply that we will have their >>> > interface > >>> {xenstore} the interface creation and tear-down would be kvm >>> > specific. > >>> They could even have a plain directory structure. >>> >>> >>> > ------------------------------------------------------------------------ > - > >>> This SF.net email is sponsored by DB2 Express >>> Download DB2 Express C - the FREE version of DB2 express and take >>> control of your XML. No limits. Just data. Click to get it now. >>> http://sourceforge.net/powerbar/db2/ >>> _______________________________________________ >>> kvm-devel mailing list >>> kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org >>> https://lists.sourceforge.net/lists/listinfo/kvm-devel >>> >>> >>> > > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <4647418A.2040201-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <4647418A.2040201-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> @ 2007-05-13 17:06 ` Muli Ben-Yehuda [not found] ` <20070513170608.GA4343-WD1JZD8MxeCTrf4lBMg6DdBPR1lH4CV8@public.gmane.org> 2007-05-14 2:39 ` Rusty Russell 2007-05-14 11:53 ` Avi Kivity 2 siblings, 1 reply; 104+ messages in thread From: Muli Ben-Yehuda @ 2007-05-13 17:06 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, carsteno-tA70FqPdS9bQT0dZR+AlfA, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f On Sun, May 13, 2007 at 11:49:14AM -0500, Anthony Liguori wrote: > Dor Laor wrote: > > Furthermore, > > > >> the plan is to completely rearchitect the netback/netfront > >> protocol for the next Xen release (this effort is referred to > >> netchannel2). > >> > > > > But isn't Jeremy Fitzhardinge is pushing big patch queue into the > > kernel? > > > > Yes, but it's not in the kernel yet and there's no guarantee it'll > get there in time for KVM's consumption. On the other hand, there's strong interest in having unified virtual drivers. Given that the Xen drivers are out there, have been submitted and have been reasonably optimized, there will be some resistance to putting in "yet another" set of PV drivers. Also, the contentious merge point as I understand it is xenbus needing review, rather than the drivers themselves which are in pretty good shape. Cheers, Muli ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <20070513170608.GA4343-WD1JZD8MxeCTrf4lBMg6DdBPR1lH4CV8@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <20070513170608.GA4343-WD1JZD8MxeCTrf4lBMg6DdBPR1lH4CV8@public.gmane.org> @ 2007-05-13 20:31 ` Dor Laor 0 siblings, 0 replies; 104+ messages in thread From: Dor Laor @ 2007-05-13 20:31 UTC (permalink / raw) To: Muli Ben-Yehuda, Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, carsteno-tA70FqPdS9bQT0dZR+AlfA, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f Subject: Re: [kvm-devel] [PATCH/RFC 7/9] Virtual network guest device >driver > >On Sun, May 13, 2007 at 11:49:14AM -0500, Anthony Liguori wrote: >> Dor Laor wrote: >> > Furthermore, >> > >> >> the plan is to completely rearchitect the netback/netfront >> >> protocol for the next Xen release (this effort is referred to >> >> netchannel2). >> >> >> > >> > But isn't Jeremy Fitzhardinge is pushing big patch queue into the >> > kernel? >> > >> >> Yes, but it's not in the kernel yet and there's no guarantee it'll >> get there in time for KVM's consumption. > >On the other hand, there's strong interest in having unified virtual >drivers. Given that the Xen drivers are out there, have been submitted >and have been reasonably optimized, there will be some resistance to >putting in "yet another" set of PV drivers. Also, the contentious >merge point as I understand it is xenbus needing review, rather than >the drivers themselves which are in pretty good shape. Moreover, it's not that it is too complex to write set of back/front ends, it just it's already written and optimized down to the bit. Our current implementation has all the regular bells and whistles (rings, delayed notifications, napi) it is simper than Xen's but it lacks further optimizations and tso/scatter gather. If we even use the NetChannel2 we should enjoy from smart NIC features too. It's more tempting and fun to continue to support our implementation but it's righter to reuse code. Nevertheless, we'll be happy to hear and discuss what others are thinking. If the current Xen code fail to hit the kernel, then it would be even easier for us - we'll just rip off all the Xen wrapping, the grant tables and the flipping would go away leaving clean optimized network code. Regards, Dor. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <4647418A.2040201-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> 2007-05-13 17:06 ` Muli Ben-Yehuda @ 2007-05-14 2:39 ` Rusty Russell 2007-05-14 11:53 ` Avi Kivity 2 siblings, 0 replies; 104+ messages in thread From: Rusty Russell @ 2007-05-14 2:39 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, Christian Borntraeger, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Herbert Xu, carsteno-tA70FqPdS9bQT0dZR+AlfA, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f On Sun, 2007-05-13 at 11:49 -0500, Anthony Liguori wrote: > Dor Laor wrote: > > Furthermore, > > > >> the plan is to completely rearchitect the netback/netfront protocol for > >> the next Xen release (this effort is referred to netchannel2). > > It's looks like generalizing all the level 0,1,2 features plus > > performance optimizations. It's not something we couldn't upgrade to. > > I'm curious what Rusty thinks as I do not know nearly enough about the > networking subsystem to make an educated statement here. Would it be > better to just try and generalize netback/netfront or build something > from scratch? Could the lguest driver be generalized more easily? In turn, I'm curious as to Herbert's opinions on this. The lguest netdriver has only two features: it's small, and it does multi-way inter-guest networking as well as guest<->host. It's not clear how much the latter wins in real life over a point-to-point comms system. My interest is in a common low-level transport. My experience is that it's easy to create an efficient comms channel between a guest and host (ie. one side can access the others' memory), but it's worthwhile trying for a model which transparently allows untrusted comms (ie. hypervisor-assisted to access the other guest's memory). That's easier if you only want point-to-point (see lguest's io.c for a more general solution). Cheers, Rusty. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <4647418A.2040201-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org> 2007-05-13 17:06 ` Muli Ben-Yehuda 2007-05-14 2:39 ` Rusty Russell @ 2007-05-14 11:53 ` Avi Kivity 2 siblings, 0 replies; 104+ messages in thread From: Avi Kivity @ 2007-05-14 11:53 UTC (permalink / raw) To: Anthony Liguori Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, carsteno-tA70FqPdS9bQT0dZR+AlfA, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f Anthony Liguori wrote: > Dor Laor wrote: > >> Furthermore, >> >> >>> the plan is to completely rearchitect the netback/netfront protocol for >>> the next Xen release (this effort is referred to netchannel2). >>> >>> >> But isn't Jeremy Fitzhardinge is pushing big patch queue into the >> kernel? >> >> > > Yes, but it's not in the kernel yet and there's no guarantee it'll get > there in time for KVM's consumption. > I doubt we could add the missing features to kvmnet, test, optimize, submit to netdev, apply comments, re-submit, re-write, update to latest netdev api, and fix all the bugs much faster. -- error compiling committee.c: too many arguments to function ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <13426df10705111244w1578ebedy8259bc42ca1f588d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2007-05-11 20:12 ` Anthony Liguori 2007-05-12 8:46 ` Carsten Otte @ 2007-05-14 12:05 ` Avi Kivity [not found] ` <46485070.3000106-atKUWr5tajBWk0Htik3J/w@public.gmane.org> 2 siblings, 1 reply; 104+ messages in thread From: Avi Kivity @ 2007-05-14 12:05 UTC (permalink / raw) To: ron minnich Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Martin Schwidefsky ron minnich wrote: > We had hoped to get something like this into Xen. On Xen, for example, > the block device and ethernet device interfaces are as different as > one could imagine. Disk I/O does not steal pages from the guest. The > network does. Disk I/O is in 4k chunks, period, with a bitmap > describing which of the 8 512-byte subunits are being sent. The enet > device, on read, returns a page with your packet, but also potentially > containing bits of other domain's packets too. The interfaces are as > dissimilar as they can be, and I see no reason for such a huge > variance between what are basically read/write devices. > The reason for the variance is that hardware capabilities are very different for disk and block. Block device requests are always guest-initiated and sector-aligned, and often span many pages. On the other hand, network packets are byte aligned, and rx packets are host-initiated, triggering the stolen pages concept (which unsurprisingly turned out not to be a win). Network has such esoteric features as TSO. Block is very interested in actually getting things onto the disk (barrier support). In short, the "everything is a stream of bytes" grossly oversimplifies things. > Another issue is that kvm, in its current form (-24) is beautifully > simple. These additions seem to detract from the beauty a bit. Might > it be worth taking a little time to consider these ideas in order to > preserve the basic elegance of KVM? > kvm? elegant and simple? it's basically a pile of special cases. But I agree that the growing code base is a problem. With the block driver we can probably keep the host side in userspace, but to do the same for networking is much more work. I do think (now) that it is doable. -- error compiling committee.c: too many arguments to function ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <46485070.3000106-atKUWr5tajBWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <46485070.3000106-atKUWr5tajBWk0Htik3J/w@public.gmane.org> @ 2007-05-14 12:24 ` Christian Bornträger [not found] ` <200705141424.44423.cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> 2007-05-14 13:36 ` Carsten Otte 1 sibling, 1 reply; 104+ messages in thread From: Christian Bornträger @ 2007-05-14 12:24 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f Cc: Jimi Xenidis, Carsten Otte, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Martin Schwidefsky On Monday 14 May 2007 14:05, Avi Kivity wrote: > But I agree that the growing code base is a problem. With the block > driver we can probably keep the host side in userspace, but to do the > same for networking is much more work. I do think (now) that it is doable. Interesting. What kind of userspace networking do you have in mind? One of the first trys from Carsten was to use tun/tap, which proved to be slow performance-wise. What I had in mind was some kind of switch in userspace. That would allow non-root guests to define there own private networks. We could use Linux fast pipe implementation for guest-to-guest communication. The questions is how to connect user space networks to the host ones? - tun/tap is quite slow - last time we checked, netfiler offered only IP hooks (if you dont use the bridging code) - raw sockets get tricky if you do in/out at the same time because you have to manually deal with loops This reminds me, that we actually have another party doing virtual networking between guests: UML. User mode linux actually can do networking/switching in userspace, but I cannot tell how well UMLs concept works out. Christian ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <200705141424.44423.cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <200705141424.44423.cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> @ 2007-05-14 12:32 ` Avi Kivity 0 siblings, 0 replies; 104+ messages in thread From: Avi Kivity @ 2007-05-14 12:32 UTC (permalink / raw) To: Christian Bornträger Cc: Jimi Xenidis, Carsten Otte, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Martin Schwidefsky Christian Bornträger wrote: > On Monday 14 May 2007 14:05, Avi Kivity wrote: > >> But I agree that the growing code base is a problem. With the block >> driver we can probably keep the host side in userspace, but to do the >> same for networking is much more work. I do think (now) that it is doable. >> > > Interesting. What kind of userspace networking do you have in mind? > > One of the first trys from Carsten was to use tun/tap, which proved to be slow > performance-wise. > tun/tap, but extended with: - true aio - aio with scatter/gather (IO_CMD_PWRITEV/IO_CMD_PREADV) - qemu support for native Linux aio (not the glibc hackaround currently in place), so we get event coalescing and cheap multi request submission - tap support for tso With these, we could conceivably reach speeds close to an in-kernel driver. Unfortunately we'd only know after all the hard work was done. > What I had in mind was some kind of switch in userspace. That would allow > non-root guests to define there own private networks. We could use Linux fast > pipe implementation for guest-to-guest communication. > > The questions is how to connect user space networks to the host ones? > - tun/tap is quite slow > - last time we checked, netfiler offered only IP hooks (if you dont use the > bridging code) > - raw sockets get tricky if you do in/out at the same time because you have to > manually deal with loops > qemu has some support for this, see the '-net socket' option. -- error compiling committee.c: too many arguments to function ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* Re: [PATCH/RFC 7/9] Virtual network guest device driver [not found] ` <46485070.3000106-atKUWr5tajBWk0Htik3J/w@public.gmane.org> 2007-05-14 12:24 ` Christian Bornträger @ 2007-05-14 13:36 ` Carsten Otte 1 sibling, 0 replies; 104+ messages in thread From: Carsten Otte @ 2007-05-14 13:36 UTC (permalink / raw) To: Avi Kivity Cc: Jimi Xenidis, jmk-zzFmDc4TPjtKvsKVC3L/VUEOCMrvLtNR@public.gmane.org, Christian Borntraeger, mschwid2-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Avi Kivity wrote: > But I agree that the growing code base is a problem. With the block > driver we can probably keep the host side in userspace, but to do the > same for networking is much more work. I do think (now) that it is doable. I agree that networking needs to be handled in the host kernel. We go out to userspace for signaling at this time, but that's simply broken. All our userspace does is do a system call next. so long, Carsten ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* [PATCH/RFC 8/9] Virtual network host switch support [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> ` (5 preceding siblings ...) 2007-05-11 17:36 ` [PATCH/RFC 7/9] Virtual network guest " Carsten Otte @ 2007-05-11 17:36 ` Carsten Otte [not found] ` <1178904968.25135.35.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> 2007-05-11 17:36 ` [PATCH/RFC 9/9] Fix system<->user misaccount of interpreted execution Carsten Otte 7 siblings, 1 reply; 104+ messages in thread From: Carsten Otte @ 2007-05-11 17:36 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Cc: Christian Borntraeger, Martin Schwidefsky From: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> This is the host counterpart for the virtual network device driver. This driver has an char device node where the hypervisor can attach. It also has a kind of dumb switch that passes packets between guests. Last but not least it contains a host network interface. Patches for attaching other host network devices to the switch via raw sockets, extensions to qeth or netfilter are currently tested but not ready yet. We did not use the linux bridging code to allow non-root users to create virtual networks between guests. Signed-off-by: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> --- drivers/s390/guest/Makefile | 3 drivers/s390/guest/vnet_port_guest.c | 302 ++++++++++++ drivers/s390/guest/vnet_port_guest.h | 21 drivers/s390/guest/vnet_port_host.c | 418 +++++++++++++++++ drivers/s390/guest/vnet_port_host.h | 18 drivers/s390/guest/vnet_switch.c | 828 +++++++++++++++++++++++++++++++++++ drivers/s390/guest/vnet_switch.h | 119 +++++ drivers/s390/net/Kconfig | 12 8 files changed, 1721 insertions(+) Index: linux-2.6.21/drivers/s390/guest/vnet_port_guest.c =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/vnet_port_guest.c @@ -0,0 +1,302 @@ +/* + * Copyright (C) 2005 IBM Corporation + * Authors: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * + */ +#include <linux/etherdevice.h> +#include <linux/fs.h> +#include <linux/kernel.h> +#include <linux/list.h> +#include <linux/module.h> +#include <linux/pagemap.h> +#include <linux/poll.h> +#include <linux/spinlock.h> + +#include "vnet.h" +#include "vnet_port_guest.h" +#include "vnet_switch.h" + +static void COFIXME_add_irq(struct vnet_guest_port *zgp, int data) +{ + int oldval, newval; + + do { + oldval = atomic_read(&zgp->pending_irqs); + newval = oldval | data; + } while (atomic_cmpxchg(&zgp->pending_irqs, oldval, newval) != oldval); +} + +static int COFIXME_get_irq(struct vnet_guest_port *zgp) +{ + int oldval; + + do { + oldval = atomic_read(&zgp->pending_irqs); + } while (atomic_cmpxchg(&zgp->pending_irqs, oldval, 0) != oldval); + + return oldval; +} + +static void +vnet_guest_interrupt(struct vnet_port *port, int type) +{ + struct vnet_guest_port *priv; + + priv = port->priv; + + if (!priv->fasync) { + printk (KERN_WARNING "vnet: cannot send interrupt," + "fd not async\n"); + return; + } + switch (type) { + case VNET_IRQ_START_RX: + COFIXME_add_irq(priv, POLLIN); + kill_fasync(&priv->fasync, SIGIO, POLL_IN); + break; + case VNET_IRQ_START_TX: + COFIXME_add_irq(priv, POLLOUT); + kill_fasync(&priv->fasync, SIGIO, POLL_OUT); + break; + default: + BUG(); + } +} + +/* release all pinned user pages*/ +static void +vnet_guest_release_pages(struct vnet_port *port) +{ + int i,j; + + for (i=0; i<VNET_QUEUE_LEN; i++) + for (j=0; j<VNET_BUFFER_PAGES; j++) { + if (port->s2p_data[i][j]) { + page_cache_release(virt_to_page(port->s2p_data[i][j])); + port->s2p_data[i][j] = NULL; + } + if (port->p2s_data[i][j]) { + page_cache_release(virt_to_page(port->p2s_data[i][j])); + port->p2s_data[i][j] = NULL; + } + } + if (port->control) { + page_cache_release(virt_to_page(port->control)); + port->control = NULL; + } +} + +static int +vnet_chr_open(struct inode *ino, struct file *filp) +{ + int minor; + struct vnet_port *port; + char name[BUS_ID_SIZE]; + + minor = iminor(filp->f_dentry->d_inode); + snprintf(name, BUS_ID_SIZE, "guest:%d", current->pid); + port = vnet_port_get(minor, name); + if (!port) + return -ENODEV; + port->priv = kzalloc(sizeof(struct vnet_guest_port), GFP_KERNEL); + if (!port->priv) { + vnet_port_put(port); + return -ENOMEM; + } + port->interrupt = vnet_guest_interrupt; + filp->private_data = port; + return nonseekable_open(ino, filp); +} + +static int +vnet_chr_release (struct inode *ino, struct file *filp) +{ + struct vnet_port *port; + port = (struct vnet_port *) filp->private_data; + +//FIXME: what about open close? We unregister non exisiting mac addresses +// in vnet_port_detach! + vnet_port_detach(port); + vnet_guest_release_pages(port); + vnet_port_put(port); + return 0; +} + + +/* helper function which maps a user page into the kernel + * the memory must be free with page_cache_release */ +static void *user_to_kernel(char __user *user) +{ + struct page *temp_page; + int rc; + + BUG_ON(((unsigned long) user) % PAGE_SIZE); + rc = fault_in_pages_writeable(user, PAGE_SIZE); + if (rc) + return NULL; + rc = get_user_pages(current, current->mm, (unsigned long) user, + 1, 1, 1, &temp_page, NULL); + if (rc != 1) + return NULL; + return page_address(temp_page); +} + +/* this function pins the userspace buffers into memory*/ +static int +vnet_guest_alloc_pages(struct vnet_port *port) +{ + int i,j; + + down_read(¤t->mm->mmap_sem); + for (i=0; i<VNET_QUEUE_LEN; i++) + for (j=0; j<VNET_BUFFER_PAGES; j++) { + port->s2p_data[i][j] = user_to_kernel(port->control-> + s2pbufs[i].data + j*PAGE_SIZE); + if (!port->s2p_data[i][j]) + goto cleanup; + port->p2s_data[i][j] = user_to_kernel(port->control-> + p2sbufs[i].data + j*PAGE_SIZE); + if (!port->p2s_data[i][j]) + goto cleanup; + + } + up_read(¤t->mm->mmap_sem); + return 0; +cleanup: + up_read(¤t->mm->mmap_sem); + vnet_guest_release_pages(port); + return -ENOMEM; +} + +/* userspace control data structure stuff */ +static int +vnet_register_control(struct vnet_port *port, unsigned long user_addr) +{ + u64 uaddr; + int rc; + struct page *control_page; + + rc = copy_from_user(&uaddr, (void __user *) user_addr, sizeof(uaddr)); + if (rc) + return -EFAULT; + if (uaddr % PAGE_SIZE) + return -EFAULT; + down_read(¤t->mm->mmap_sem); + rc = get_user_pages(current, current->mm, (unsigned long)uaddr, + 1, 1, 1, &control_page, NULL); + up_read(¤t->mm->mmap_sem); + if (rc!=1) + return -EFAULT; + port->control = (struct vnet_control *) page_address(control_page); + rc = vnet_guest_alloc_pages(port); + if (rc) { + printk("vnet: could not get buffers\n"); + return rc; + } + random_ether_addr(port->mac); + memcpy(port->control->mac, port->mac,6); + vnet_port_attach(port); + return 0; +} + +static int +vnet_interrupt(struct vnet_port *port, int __user *u_type) +{ + int type, rc; + + rc = copy_from_user (&type, u_type, sizeof(int)); + if (rc) + return -EFAULT; + switch (type) { + case VNET_IRQ_START_RX: + vnet_port_rx(port); + break; + case VNET_IRQ_START_TX: /* noop with current drop packet approach*/ + break; + default: + printk(KERN_ERR "vnet: Unknown interrupt type %d\n", type); + rc = -EINVAL; + } + return rc; +} + + + + +//this is a HACK. >>COFIXME<< +unsigned int +vnet_poll(struct file *filp, poll_table * wait) +{ + struct vnet_port *port; + struct vnet_guest_port *zgp; + + port = filp->private_data; + zgp = port->priv; + return COFIXME_get_irq(zgp); +} + +static int vnet_fill_info(struct vnet_port *zp, void __user *data) +{ + struct vnet_info info; + + info.linktype = zp->zs->linktype; + info.maxmtu=32768; //FIXME + return copy_to_user(data, &info, sizeof(info)); +} +long +vnet_ioctl(struct file *filp, unsigned int no, unsigned long data) +{ + struct vnet_port *port = + (struct vnet_port *) filp->private_data; + int rc; + + switch (no) { + case VNET_REGISTER_CTL: + rc = vnet_register_control(port, data); + break; + case VNET_INTERRUPT: + rc = vnet_interrupt(port, (int __user *) data); + break; + case VNET_INFO: + rc = vnet_fill_info(port, (void __user *) data); + break; + default: + rc = -ENOTTY; + } + return rc; +} + +int vnet_fasync(int fd, struct file *filp, int on) +{ + struct vnet_port *port; + struct vnet_guest_port *zgp; + int rc; + + port = filp->private_data; + zgp = port->priv; + + if ((rc = fasync_helper(fd, filp, on, &zgp->fasync)) < 0) + return rc; + + if (on) + rc = f_setown(filp, current->pid, 0); + return rc; +} + + +static struct file_operations vnet_char_fops = { + .owner = THIS_MODULE, + .open = vnet_chr_open, + .release = vnet_chr_release, + .unlocked_ioctl = vnet_ioctl, + .fasync = vnet_fasync, + .poll = vnet_poll, +}; + + + +void vnet_cdev_init(struct cdev *cdev) +{ + cdev_init(cdev, &vnet_char_fops); +} Index: linux-2.6.21/drivers/s390/guest/vnet_port_guest.h =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/vnet_port_guest.h @@ -0,0 +1,21 @@ +/* + * Copyright (C) 2005 IBM Corporation + * Authors: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * + */ + +#ifndef __VNET_PORTS_GUEST_H +#define __VNET_PORTS_GUEST_H + +#include <linux/fs.h> +#include <linux/cdev.h> +#include <asm/atomic.h> + +struct vnet_guest_port { + struct fasync_struct *fasync; + atomic_t pending_irqs; +}; + +extern void vnet_cdev_init(struct cdev *cdev); +#endif Index: linux-2.6.21/drivers/s390/guest/vnet_port_host.c =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/vnet_port_host.c @@ -0,0 +1,418 @@ +/* + * vnet zlswitch handling + * + * Copyright (C) 2005 IBM Corporation + * Authors: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * + */ + +#include <linux/etherdevice.h> +#include <linux/if.h> +#include <linux/if_ether.h> +#include <linux/if_arp.h> +#include <linux/kernel.h> +#include <linux/list.h> +#include <linux/module.h> +#include <linux/netdevice.h> +#include <linux/rtnetlink.h> +#include <linux/pagemap.h> +#include <linux/spinlock.h> + +#include "vnet.h" +#include "vnet_switch.h" +#include "vnet_port_host.h" + +static void +vnet_host_interrupt(struct vnet_port *zp, int type) +{ + struct vnet_host_port *zhp; + + zhp = zp->priv; + + BUG_ON(!zhp->netdev); + + switch (type) { + case VNET_IRQ_START_RX: + netif_rx_schedule(zhp->netdev); + break; + case VNET_IRQ_START_TX: + netif_wake_queue(zhp->netdev); + break; + default: + BUG(); + } + /* we are called via system call path. enforce softirq handling */ + do_softirq(); +} + +static void +vnet_host_free(struct vnet_port *zp) +{ + int i,j; + + for (i=0; i<VNET_QUEUE_LEN; i++) + for (j=0; j<VNET_BUFFER_PAGES; j++) { + if (zp->s2p_data[i][j]) { + free_page((unsigned long) zp->s2p_data[i][j]); + zp->s2p_data[i][j] = NULL; + } + if (zp->p2s_data[i][j]) { + free_page((unsigned long) zp->p2s_data[i][j]); + zp->p2s_data[i][j] = NULL; + } + } + if (zp->control) { + kfree(zp->control); + zp->control = NULL; + } +} + +static int +vnet_port_hostsetup(struct vnet_port *zp) +{ + int i,j; + + zp->control = kzalloc(sizeof(*zp->control), GFP_KERNEL); + if (!zp->control) + return -ENOMEM; + for (i=0; i<VNET_QUEUE_LEN; i++) + for (j=0; j<VNET_BUFFER_PAGES; j++) { + zp->s2p_data[i][j] = (void *) __get_free_pages(GFP_KERNEL,0); + if (!zp->s2p_data[i][j]) + goto oom; + zp->p2s_data[i][j] = (void *) __get_free_pages(GFP_KERNEL,0); + if (!zp->p2s_data[i][j]) { + free_page((unsigned long) zp->s2p_data[i][j]); + goto oom; + } + } + zp->control->buffer_size = VNET_BUFFER_SIZE; + return 0; +oom: + printk(KERN_WARNING "vnet: No memory for buffer space of host device\n"); + vnet_host_free(zp); + return -ENOMEM; +} + +/* host interface specific parts */ + + +static int +vnet_net_open(struct net_device *dev) +{ + struct vnet_port *port; + struct vnet_control *control; + + port = dev->priv; + control = port->control; + atomic_set(&control->s2pmit, 0); + netif_start_queue(dev); + return 0; +} + +static int +vnet_net_stop(struct net_device *dev) +{ + netif_stop_queue(dev); + return 0; +} + +static void vnet_net_tx_timeout(struct net_device *dev) +{ + struct vnet_port *port = dev->priv; + struct vnet_control *control = port->control; + + printk(KERN_ERR "problems in xmit for device %s\n Resetting...\n", + dev->name); + atomic_set(&control->p2smit, 0); + atomic_set(&control->s2pmit, 0); + vnet_port_rx(port); + netif_wake_queue(dev); +} + + +static int +vnet_net_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct vnet_port *zhost; + struct vnet_host_port *zhp; + struct vnet_control *control; + struct xmit_buffer *buf; + int buffer_status; + int pkid; + + zhost = dev->priv; + zhp = zhost->priv; + control = zhost->control; + + if (!spin_trylock(&zhost->txlock)) + return NETDEV_TX_LOCKED; + if (vnet_q_full(atomic_read(&control->p2smit))) { + netif_stop_queue(dev); + goto full; + } + pkid = __nextx(atomic_read(&control->p2smit)); + buf = &control->p2sbufs[pkid]; + buf->len = skb->len; + buf->proto = skb->protocol; + vnet_copy_buf_to_pages(zhost->p2s_data[pkid], skb->data, skb->len); + buffer_status = vnet_tx_packet(&control->p2smit); + spin_unlock(&zhost->txlock); + zhp->stats.tx_packets++; + zhp->stats.tx_bytes += skb->len; + dev_kfree_skb(skb); + dev->trans_start = jiffies; + if (buffer_status & QUEUE_WAS_EMPTY) + vnet_port_rx(zhost); + if (buffer_status & QUEUE_IS_FULL) { + netif_stop_queue(dev); + spin_lock(&zhost->txlock); + } else + return NETDEV_TX_OK; +full: + /* we might have raced against the wakeup */ + if (!vnet_q_full(atomic_read(&control->p2smit))) + netif_start_queue(dev); + spin_unlock(&zhost->txlock); + return NETDEV_TX_OK; +} + +static int +vnet_l3_poll(struct net_device *dev, int *budget) +{ + struct vnet_port *zp = dev->priv; + struct vnet_host_port *zhp = zp->priv; + struct vnet_control *control = zp->control; + struct xmit_buffer *buf; + struct sk_buff *skb; + int pkid, count, numpackets = min(64, min(dev->quota, *budget)); + int buffer_status; + + if (vnet_q_empty(atomic_read(&control->s2pmit))) { + count = 0; + goto empty; + } +loop: + count = 0; + while(numpackets) { + pkid = __nextr(atomic_read(&control->s2pmit)); + buf = &control->s2pbufs[pkid]; + skb = dev_alloc_skb(buf->len + 2); + if (likely(skb)) { + skb_reserve(skb, 2); + vnet_copy_pages_to_buf(skb_put(skb, buf->len), + zp->s2p_data[pkid], buf->len); + skb->dev = dev; + skb->protocol = buf->proto; +// skb->ip_summed = CHECKSUM_UNNECESSARY; + zhp->stats.rx_packets++; + zhp->stats.rx_bytes += buf->len; + netif_receive_skb(skb); + numpackets--; + (*budget)--; + dev->quota--; + count++; + } else + zhp->stats.rx_dropped++; + buffer_status = vnet_rx_packet(&control->s2pmit); + if (buffer_status & QUEUE_IS_EMPTY) + goto empty; + } + return 1; //please ask us again +empty: + netif_rx_complete(dev); + /* we might have raced against a wakup*/ + if (!vnet_q_empty(atomic_read(&control->s2pmit))) { + if (netif_rx_reschedule(dev, count)) + goto loop; + } + return 0; +} + + +static int +vnet_l2_poll(struct net_device *dev, int *budget) +{ + struct vnet_port *zp = dev->priv; + struct vnet_host_port *zhp = zp->priv; + struct vnet_control *control = zp->control; + struct xmit_buffer *buf; + struct sk_buff *skb; + int pkid, count, numpackets = min(64, min(dev->quota, *budget)); + int buffer_status; + + if (vnet_q_empty(atomic_read(&control->s2pmit))) { + count = 0; + goto empty; + } +loop: + count = 0; + while(numpackets) { + pkid = __nextr(atomic_read(&control->s2pmit)); + buf = &control->s2pbufs[pkid]; + skb = dev_alloc_skb(buf->len + 2); + if (likely(skb)) { + skb_reserve(skb, 2); + vnet_copy_pages_to_buf(skb_put(skb, buf->len), + zp->s2p_data[pkid], buf->len); + skb->dev = dev; + skb->protocol = eth_type_trans(skb, dev); +// skb->ip_summed = CHECKSUM_UNNECESSARY; + zhp->stats.rx_packets++; + zhp->stats.rx_bytes += buf->len; + netif_receive_skb(skb); + numpackets--; + (*budget)--; + dev->quota--; + count++; + } else + zhp->stats.rx_dropped++; + buffer_status = vnet_rx_packet(&control->s2pmit); + if (buffer_status & QUEUE_IS_EMPTY) + goto empty; + } + return 1; //please ask us again +empty: + netif_rx_complete(dev); + /* we might have raced against a wakup*/ + if (!vnet_q_empty(atomic_read(&control->s2pmit))) { + if (netif_rx_reschedule(dev, count)) + goto loop; + } + return 0; +} + +static struct net_device_stats * +vnet_net_stats(struct net_device *dev) +{ + struct vnet_port *zp; + struct vnet_host_port *zhp; + + zp = dev->priv; + zhp = zp->priv; + return &zhp->stats; +} + +static int +vnet_net_change_mtu(struct net_device *dev, int new_mtu) +{ + if (new_mtu <= ETH_ZLEN) + return -ERANGE; + if (new_mtu > VNET_BUFFER_SIZE-ETH_HLEN) + return -ERANGE; + dev->mtu = new_mtu; + return 0; +} + +static void +__vnet_common_init(struct net_device *dev) +{ + dev->open = vnet_net_open; + dev->stop = vnet_net_stop; + dev->hard_start_xmit = vnet_net_xmit; + dev->get_stats = vnet_net_stats; + dev->tx_timeout = vnet_net_tx_timeout; + dev->watchdog_timeo = VNET_TIMEOUT; + dev->change_mtu = vnet_net_change_mtu; + dev->weight = 64; + //dev->features |= NETIF_F_NO_CSUM | NETIF_F_LLTX; + dev->features |= NETIF_F_LLTX; +} + +static void +__vnet_layer3_init(struct net_device *dev) +{ + dev->mtu = ETH_DATA_LEN; + dev->tx_queue_len = 1000; + dev->flags = IFF_BROADCAST|IFF_MULTICAST|IFF_NOARP; + dev->type = ARPHRD_PPP; + dev->mtu = 1492; + dev->poll = vnet_l3_poll; + __vnet_common_init(dev); +} + +static void +__vnet_layer2_init(struct net_device *dev) +{ + ether_setup(dev); + random_ether_addr(dev->dev_addr); + dev->mtu = 1492; + dev->poll = vnet_l2_poll; + __vnet_common_init(dev); +} + +static void +vnet_host_destroy(struct vnet_port *zhost) +{ + struct vnet_host_port *zhp; + zhp = zhost->priv; + + vnet_port_detach(zhost); + unregister_netdev(zhp->netdev); + free_netdev(zhp->netdev); + zhp->netdev = NULL; + vnet_host_free(zhost); + kfree(zhp); + vnet_port_put(zhost); +} + + + +struct vnet_port * +vnet_host_create(char *name) +{ + int rc; + struct vnet_port *port; + struct vnet_host_port *host; + char busname[BUS_ID_SIZE]; + int minor; + + snprintf(busname, BUS_ID_SIZE, "host:%s", name); + + minor = vnet_minor_by_name(name); + if (minor < 0) + return NULL; + port = vnet_port_get(minor, busname); + if (!port) + goto out; + host = kzalloc(sizeof(struct vnet_host_port), GFP_KERNEL); + if (!host) { + kfree(port); + port = NULL; + goto out; + } + port->priv = host; + rc =vnet_port_hostsetup(port); + if (rc) + goto out_free_host; + rtnl_lock(); + if (port->zs->linktype == 2) + host->netdev = alloc_netdev(0, name, __vnet_layer2_init); + else + host->netdev = alloc_netdev(0, name, __vnet_layer3_init); + if (!host->netdev) + goto out_unlock; + memcpy(port->mac, host->netdev->dev_addr, ETH_ALEN); + + host->netdev->priv = port; + port->interrupt = vnet_host_interrupt; + port->destroy = vnet_host_destroy; + + if (!register_netdevice(host->netdev)) { + /* good case */ + rtnl_unlock(); + return port; + } + host->netdev->priv = NULL; + free_netdev(host->netdev); + host->netdev = NULL; +out_unlock: + rtnl_unlock(); + vnet_host_free(port); +out_free_host: + vnet_port_put(port); + port = NULL; +out: + return port; +} Index: linux-2.6.21/drivers/s390/guest/vnet_port_host.h =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/vnet_port_host.h @@ -0,0 +1,18 @@ +/* + * Copyright (C) 2005 IBM Corporation + * Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * + */ + +#ifndef __VNET_PORTS_HOST_H +#define __VNET_PORTS_HOST_H + +#include <linux/netdevice.h> +#include "vnet_switch.h" + +struct vnet_host_port { + struct net_device_stats stats; + struct net_device *netdev; +}; +extern struct vnet_port * vnet_host_create(char *name); +#endif Index: linux-2.6.21/drivers/s390/guest/vnet_switch.c =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/vnet_switch.c @@ -0,0 +1,828 @@ +/* + * vnet zlswitch handling + * + * Copyright (C) 2005 IBM Corporation + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * Author: Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * + */ + +#include <linux/device.h> +#include <linux/etherdevice.h> +#include <linux/fs.h> +#include <linux/if.h> +#include <linux/if_ether.h> +#include <linux/kernel.h> +#include <linux/list.h> +#include <linux/miscdevice.h> +#include <linux/module.h> +#include <linux/netdevice.h> +#include <linux/rtnetlink.h> +#include <linux/pagemap.h> +#include <linux/spinlock.h> + +#include "vnet.h" +#include "vnet_port_guest.h" +#include "vnet_port_host.h" +#include "vnet_switch.h" + +#define NUM_MINORS 1024 + +/* devices housekeeping, creation & destruction */ +static LIST_HEAD(vnet_switches); +static rwlock_t vnet_switches_lock = RW_LOCK_UNLOCKED; +static struct class *zwitch_class; +static int vnet_major; +static struct device *root_dev; + + +/* The following functions allow ports of the switch to know about + * the MAC addresses of other ports. This is necessary for special + * hardware like OSA express which silently drops incoming packets + * that not match known MAC addresses and do not support promiscous + * mode as well. We have to register all guest MAC addresses at OSA + * make packet receive working */ + +/* Announces the own MAC address to all other ports + * this function is called if a new port is added */ +static void vnet_switch_add_mac(struct vnet_port *port) +{ + struct vnet_port *other_port; + + read_lock(&port->zs->ports_lock); + list_for_each_entry(other_port, &port->zs->switch_ports, lh) + if ((other_port != port) && (other_port->set_mac)) + other_port->set_mac(other_port,port->mac, 1); + read_unlock(&port->zs->ports_lock); +} + +/* Removes the own MAC address from all other ports + * this function is called if a port is detached*/ +static void vnet_switch_del_mac(struct vnet_port *port) +{ + struct vnet_port *other_port; + + read_lock(&port->zs->ports_lock); + list_for_each_entry(other_port, &port->zs->switch_ports, lh) + if (other_port->set_mac) + other_port->set_mac(other_port, port->mac, 0); + read_unlock(&port->zs->ports_lock); +} + +/* Learn MACs from other ports on the same zwitch and forward + * the MAC addresses to the set_mac function of the port.*/ +static void __vnet_port_learn_macs(struct vnet_port *port) +{ + struct vnet_port *other_port; + + if (!port->set_mac) + return; + list_for_each_entry(other_port, &port->zs->switch_ports, lh) + if (other_port != port) + port->set_mac(port, other_port->mac, 1); +} + +/* Unlearn MACS from other ports on the same zwitch */ +static void __vnet_port_unlearn_macs(struct vnet_port *port) +{ + struct vnet_port *other_port; + + if (!port->set_mac) + return; + list_for_each_entry(other_port, &port->zs->switch_ports, lh) + if (other_port != port) + port->set_mac(port, other_port->mac, 0); +} + + +static struct vnet_switch *__vnet_switch_by_minor(int minor) +{ + struct vnet_switch *zs; + + list_for_each_entry(zs, &vnet_switches, lh) { + if (MINOR(zs->cdev.dev) == minor) + return zs; + } + return NULL; +} + +static struct vnet_switch *__vnet_switch_by_name(char *name) +{ + struct vnet_switch *zs; + + list_for_each_entry(zs, &vnet_switches, lh) + if (strncmp(zs->name, name, ZWITCH_NAME_SIZE) == 0) + return zs; + return NULL; +} + +/* Returns a switch structure and increases the reference count. If no such + * switch exists a new one is created with reference count 1 */ +static struct vnet_switch *zwitch_get(int minor) +{ + struct vnet_switch *zs; + + read_lock(&vnet_switches_lock); + zs = __vnet_switch_by_minor(minor); + if (!zs) { + read_unlock(&vnet_switches_lock); + return zs; + } + get_device(&zs->dev); + read_unlock(&vnet_switches_lock); + return zs; +} + +/* reduces the reference count of the switch. */ +static void zwitch_put(struct vnet_switch * zs) +{ + put_device(&zs->dev); +} + +/* looks into the packet and searches a matching MAC address + * return NULL if unknown or broadcast */ +static struct vnet_port *__vnet_find_l2(struct vnet_switch *zs, char *data) +{ + //FIXME: make this a hash lookup, more macs per device? + struct vnet_port *port; + + if (is_multicast_ether_addr(data)) + return NULL; + list_for_each_entry(port, &zs->switch_ports, lh) { + if (compare_ether_addr(port->mac, data)==0) + goto out; + } + port = NULL; + out: + return port; +} + +/* searches the destination for IP only interfaces. Normally routing + * is the way to go, but guests should see the net transparently without + * a hop in between*/ +static struct vnet_port *__vnet_find_l3(struct vnet_switch *zs, char *data) +{ + return NULL; +} + +static struct vnet_port * __vnet_find_destination(struct vnet_switch *zs, + char *data) +{ + switch (zs->linktype) { + case 2: + return __vnet_find_l2(zs, data); + case 3: + return __vnet_find_l3(zs, data); + default: + BUG(); + } +} + +/* copies len bytes of data from the memory specified by the list of + * pointers **from into the memory specified by the list of pointers **to + * with each pointer pointing to a page */ +static void +vnet_switch_page_copy(void **to, void **from, int len) +{ + int remaining=len; + int pageid = 0; + int amount; + + while(remaining) { + amount = min((int)PAGE_SIZE, remaining); + memcpy(to[pageid], from[pageid], amount); + pageid++; + remaining -= amount; + } +} + +/* copies to data into a buffer of destination + * returns 0 if ok*/ +static int +vnet_unicast(struct vnet_port *destination, void **from_data, int len, int proto) +{ + int pkid; + int buffer_status; + void **to_data; + struct vnet_control *control; + + control = destination->control; + spin_lock_bh(&destination->rxlock); + if (vnet_q_full(atomic_read(&control->s2pmit))) { + destination->rx_dropped++; + spin_unlock_bh(&destination->rxlock); + return -ENOBUFS; + } + pkid = __nextx(atomic_read(&control->s2pmit)); + to_data = destination->s2p_data[pkid]; + vnet_switch_page_copy(to_data, from_data, len); + control->s2pbufs[pkid].len = len; + control->s2pbufs[pkid].proto = proto; + buffer_status = vnet_tx_packet(&control->s2pmit); + spin_unlock_bh(&destination->rxlock); + if (buffer_status & QUEUE_WAS_EMPTY) + destination->interrupt(destination, VNET_IRQ_START_RX); + destination->rx_bytes += len; + destination->rx_packets++; + return 0; +} + +/* send packets to all ports and emulate broadcasts via unicasts*/ +static int vnet_allcast(struct vnet_port *from_port, void **fromdata, + int len, int proto) +{ + struct vnet_port *destination; + int failure = 0; + + list_for_each_entry(destination, &from_port->zs->switch_ports, lh) + if (destination != from_port) + failure |= vnet_unicast(destination, fromdata, + len, proto); + return failure; +} + +/* takes an incoming packet and forwards it to the right port + * if a failure occurs, increase the tx_dropped count of the sender*/ +static void vnet_switch_packet(struct vnet_port *from_port, + void **from_data, int len, int proto) +{ + struct vnet_port *destination; + int failure; + + read_lock(&from_port->zs->ports_lock); + destination = __vnet_find_destination(from_port->zs, from_data[0]); + /* we dont want to loop. FIXME: document when this can happen*/ + if (destination == from_port) { + read_unlock(&from_port->zs->ports_lock); + return; + } + if (destination) + failure = vnet_unicast(destination, from_data, len, proto); + else + failure = vnet_allcast(from_port, from_data, len, proto); + read_unlock(&from_port->zs->ports_lock); + if (failure) + from_port->tx_dropped++; + else { + from_port->tx_packets++; + from_port->tx_bytes += len; + } +} + +static void vnet_port_release(struct device *dev) +{ + struct vnet_port *port; + + port = container_of(dev, struct vnet_port, dev); + zwitch_put(port->zs); + kfree(port); +} + +static ssize_t vnet_port_read_mac(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct vnet_port *port; + + port = container_of(dev, struct vnet_port, dev); + return sprintf(buf,"%02X:%02X:%02X:%02X:%02X:%02X", port->mac[0], + port->mac[1], port->mac[2], port->mac[3], + port->mac[4], port->mac[5]); +} + +static ssize_t vnet_port_read_tx_bytes(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct vnet_port *port; + + port = container_of(dev, struct vnet_port, dev); + return sprintf(buf,"%lu", port->tx_bytes); +} + +static ssize_t vnet_port_read_rx_bytes(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct vnet_port *port; + + port = container_of(dev, struct vnet_port, dev); + return sprintf(buf,"%lu", port->rx_bytes); +} + +static ssize_t vnet_port_read_tx_packets(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct vnet_port *port; + + port = container_of(dev, struct vnet_port, dev); + return sprintf(buf,"%lu", port->tx_packets); +} + +static ssize_t vnet_port_read_rx_packets(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct vnet_port *port; + + port = container_of(dev, struct vnet_port, dev); + return sprintf(buf,"%lu", port->rx_packets); +} + +static ssize_t vnet_port_read_tx_dropped(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct vnet_port *port; + + port = container_of(dev, struct vnet_port, dev); + return sprintf(buf,"%lu", port->tx_dropped); +} + +static ssize_t vnet_port_read_rx_dropped(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct vnet_port *port; + + port = container_of(dev, struct vnet_port, dev); + return sprintf(buf,"%lu", port->rx_dropped); +} + +static DEVICE_ATTR(mac, S_IRUSR, vnet_port_read_mac, NULL); +static DEVICE_ATTR(tx_bytes, S_IRUSR, vnet_port_read_tx_bytes, NULL); +static DEVICE_ATTR(rx_bytes, S_IRUSR, vnet_port_read_rx_bytes, NULL); +static DEVICE_ATTR(tx_packets, S_IRUSR, vnet_port_read_tx_packets, NULL); +static DEVICE_ATTR(rx_packets, S_IRUSR, vnet_port_read_rx_packets, NULL); +static DEVICE_ATTR(tx_dropped, S_IRUSR, vnet_port_read_tx_dropped, NULL); +static DEVICE_ATTR(rx_dropped, S_IRUSR, vnet_port_read_rx_dropped, NULL); + +static int vnet_port_attributes(struct device *dev) +{ + int rc; + rc = device_create_file(dev, &dev_attr_mac); + if (rc) + return rc; + rc = device_create_file(dev, &dev_attr_tx_dropped); + if (rc) + return rc; + rc = device_create_file(dev, &dev_attr_rx_dropped); + if (rc) + return rc; + rc = device_create_file(dev, &dev_attr_rx_bytes); + if (rc) + return rc; + rc = device_create_file(dev, &dev_attr_tx_bytes); + if (rc) + return rc; + rc = device_create_file(dev, &dev_attr_rx_packets); + if (rc) + return rc; + rc = device_create_file(dev, &dev_attr_tx_packets); + return rc; +} + + +//FIXME implement this +static int vnet_port_exists(struct vnet_switch *zs, char *name) +{ + read_lock(&zs->ports_lock); + read_unlock(&zs->ports_lock); + return 0; + +} + +static struct vnet_port *vnet_port_create(struct vnet_switch *zs, + char *name) +{ + struct vnet_port *port; + + if (vnet_port_exists(zs, name)) + return NULL; + + port = kzalloc(sizeof(*port), GFP_KERNEL); + if (port) { + spin_lock_init(&port->rxlock); + spin_lock_init(&port->txlock); + INIT_LIST_HEAD(&port->lh); + port->zs = zs; + } else + return NULL; + port->dev.parent = &zs->dev; + port->dev.release = vnet_port_release; + strncpy(port->dev.bus_id, name, BUS_ID_SIZE); + if (device_register(&port->dev)) { + kfree(port); + return NULL; + } + if (vnet_port_attributes(&port->dev)) { + device_unregister(&port->dev); + kfree(port); + return NULL; + } + return port; +} + +/*------------------------ switch creation/Destruction/housekeeping---------*/ + +static void zwitch_destroy_ports(struct vnet_switch *zs) +{ + struct vnet_port *port, *tmp; + + list_for_each_entry_safe(port, tmp, &zs->switch_ports, lh) { + if (port->destroy) + port->destroy(port); + else + printk("No destroy function for port\n"); + } +} + + +static void zwitch_destroy(struct vnet_switch *zs) +{ + class_device_destroy(zwitch_class, zs->cdev.dev); + cdev_del(&zs->cdev); + device_unregister(&zs->dev); +} + +static void zwitch_release(struct device *dev) +{ + struct vnet_switch *zs; + + zs = container_of(dev, struct vnet_switch, dev); + kfree(zs); +} + +static int __zwitch_get_minor(void) +{ + int d, found; + struct vnet_switch *zs; + + for (d=0; d< NUM_MINORS; d++) { + found = 0; + list_for_each_entry(zs, &vnet_switches, lh) + if (MINOR(zs->cdev.dev) == d) + found++; + if (!found) break; + } + if (found) return -ENODEV; + return d; +} + +/* + * checks if this name already exists for a zwitch + */ +static int __zwitch_check_name(char *name) +{ + struct vnet_switch *zs; + + list_for_each_entry(zs, &vnet_switches, lh) + if (!strncmp(name, zs->name, ZWITCH_NAME_SIZE)) + return -EEXIST; + return 0; +} + +static int zwitch_create(char *name, int linktype) +{ + struct vnet_switch *zs; + int minor; + int ret; + + if ((linktype < 2) || (linktype > 3)) + return -EINVAL; + zs = kzalloc(sizeof(*zs), GFP_KERNEL); + if (!zs) { + printk("Creation of %s failed: out of memory\n", name); + return -ENOMEM; + } + zs->linktype = linktype; + strncpy(zs->name, name, ZWITCH_NAME_SIZE); + rwlock_init(&zs->ports_lock); + INIT_LIST_HEAD(&zs->switch_ports); + + write_lock(&vnet_switches_lock); + minor = __zwitch_get_minor(); + if (minor < 0) { + write_unlock(&vnet_switches_lock); + printk("Creation of %s failed: No free minor number\n", name); + kfree(zs); + return minor; + } + if (__zwitch_check_name(zs->name)) { + write_unlock(&vnet_switches_lock); + printk("Creation of %s failed: name exists\n", name); + kfree(zs); + return -EEXIST; + } + list_add_tail(&zs->lh, &vnet_switches); + write_unlock(&vnet_switches_lock); + strncpy(zs->dev.bus_id, name, min((int) strlen(name), + ZWITCH_NAME_SIZE)); + zs->dev.parent = root_dev; + zs->dev.release = zwitch_release; + ret = device_register(&zs->dev); + if (ret) { + write_lock(&vnet_switches_lock); + list_del(&zs->lh); + write_unlock(&vnet_switches_lock); + printk("Creation of %s failed: no device\n",name); + return ret; + } + vnet_cdev_init(&zs->cdev); + cdev_add(&zs->cdev, MKDEV(vnet_major, minor), 1); + zs->class_device = class_device_create(zwitch_class, NULL, + zs->cdev.dev, &zs->dev, name); + if (IS_ERR(zs->class_device)) { + cdev_del(&zs->cdev); + write_lock(&vnet_switches_lock); + list_del(&zs->lh); + write_unlock(&vnet_switches_lock); + printk("Creation of %s failed: no class_device\n", name); + device_unregister(&zs->dev); + return PTR_ERR(zs->class_device); + } + return 0; +} + + +static int zwitch_delete(char *name) +{ + struct vnet_switch *zs; + + write_lock(&vnet_switches_lock); + zs = __vnet_switch_by_name(name); + if (!zs) { + write_unlock(&vnet_switches_lock); + return -ENOENT; + } + list_del(&zs->lh); + write_unlock(&vnet_switches_lock); + zwitch_destroy_ports(zs); + zwitch_destroy(zs); + return 0; +} + +/* checks if a switch for the given minor exists + * if yes, create an unconnected port on this switch + * if no, return NULL */ +struct vnet_port *vnet_port_get(int minor, char *port_name) +{ + struct vnet_switch *zs; + struct vnet_port *port; + + zs = zwitch_get(minor); + if (!zs) + return NULL; + port = vnet_port_create(zs, port_name); + if (!port) + zwitch_put(zs); + return port; +} + +/* attaches the port to the switch. The port must be + * fully initialized, as it may get called immediately afterwards */ +void vnet_port_attach(struct vnet_port *port) +{ + write_lock_bh(&port->zs->ports_lock); + __vnet_port_learn_macs(port); + list_add(&port->lh, &port->zs->switch_ports); + write_unlock_bh(&port->zs->ports_lock); + vnet_switch_add_mac(port); + return; +} + +/* detaches the port from the switch. After that, + * no calls into the port are made */ +void vnet_port_detach(struct vnet_port *port) +{ + vnet_switch_del_mac(port); + write_lock_bh(&port->zs->ports_lock); + if (!list_empty(&port->lh)) + list_del(&port->lh); + __vnet_port_unlearn_macs(port); + write_unlock_bh(&port->zs->ports_lock); +} + +/* releases all ressources allocated with vnet_port_get */ +void vnet_port_put(struct vnet_port *port) +{ + BUG_ON(!list_empty(&port->lh) &&( port->lh.next != LIST_POISON1)); + device_unregister(&port->dev); +} + +/* tell the switch that new data is available */ +void vnet_port_rx(struct vnet_port *port) +{ + struct vnet_control *control; + int pkid, rc; + + control = port->control; + if (vnet_q_empty(atomic_read(&control->p2smit))) { + printk(KERN_WARNING "vnet_switch: Empty buffer" + "on interrupt\n"); + return; + } + do { + pkid = __nextr(atomic_read(&control->p2smit)); + /* fire and forget. Let the switch care about lost packets*/ + vnet_switch_packet(port, port->p2s_data[pkid], + control->p2sbufs[pkid].len, + control->p2sbufs[pkid].proto); + rc = vnet_rx_packet(&control->p2smit); + if (rc & QUEUE_WAS_FULL) { + port->interrupt(port, VNET_IRQ_START_TX); + } + } while (!(rc & QUEUE_IS_EMPTY)); + return; +} + +/* checks if the given address is locally attached to the switch*/ +int vnet_address_is_local(struct vnet_switch *zs, char *address) +{ + struct vnet_port *port; + + read_lock(&zs->ports_lock); + port = __vnet_find_destination(zs, address); + read_unlock(&zs->ports_lock); + return (port != NULL); +} + + +int vnet_minor_by_name(char *name) +{ + struct vnet_switch *zs; + int ret; + + read_lock(&vnet_switches_lock); + zs = __vnet_switch_by_name(name); + if (zs) + ret = MINOR(zs->cdev.dev); + else + ret = -ENODEV; + read_unlock(&vnet_switches_lock); + return ret; +} + +static void vnet_root_release(struct device *dev) +{ + kfree(dev); +} + + +struct command { + char *string1; + char *string2; +}; + +/*FIXME this is ugly. Dont worry: as soon as we have finalized the interface, + this crap is going away. Still, it works.......*/ +static long vnet_control_ioctl(struct file *f, unsigned int command, + unsigned long data) +{ + char string1[BUS_ID_SIZE]; + char string2[BUS_ID_SIZE]; + struct command com; + struct vnet_port *port; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + if (copy_from_user(&com, (__user struct command*) data, sizeof(struct command))) + return -EFAULT; + if (copy_from_user(string1, (__user char *) com.string1, ZWITCH_NAME_SIZE)) + return -EFAULT; + if (command >=2) + if (copy_from_user(string2, (__user char *) com.string2, ZWITCH_NAME_SIZE)) + return -EFAULT; + if (strnlen(string1, ZWITCH_NAME_SIZE) == ZWITCH_NAME_SIZE) + return -EINVAL; + switch(command) { + case ADD_SWITCH: + return zwitch_create(string1,3); + case DEL_SWITCH: + return zwitch_delete(string1); + case ADD_HOST: + port = vnet_host_create(string1); + if (port) { + vnet_port_attach(port); + return 0; + } else + return -ENODEV; + default: + return -EINVAL; + } + return 0; +} + +static int vnet_control_open(struct inode *inode, struct file *file) +{ + return 0; +} + +static int vnet_control_release(struct inode *inode, struct file *file) +{ + return 0; +} + +struct file_operations vnet_control_fops = { + .open = vnet_control_open, + .release = vnet_control_release, + .unlocked_ioctl = &vnet_control_ioctl, + .compat_ioctl = &vnet_control_ioctl, +}; + +struct miscdevice vnet_control_device = { + .minor = MISC_DYNAMIC_MINOR, + .name = "vnet", + .fops = &vnet_control_fops, +}; + +int vnet_register_control_device(void) +{ + return misc_register(&vnet_control_device); +} + +int __init vnet_switch_init(void) +{ + int ret; + dev_t dev; + + zwitch_class = class_create(THIS_MODULE, "vnet"); + if (IS_ERR(zwitch_class)) { + printk(KERN_ERR "vnet_switch: class_create failed!\n"); + ret = PTR_ERR(zwitch_class); + goto out; + } + ret = alloc_chrdev_region(&dev, 0, NUM_MINORS, "vnet"); + if (ret) { + printk(KERN_ERR "vnet_switch: alloc_chrdev_region failed\n"); + goto out_class; + } + vnet_major = MAJOR(dev); + root_dev = kzalloc(sizeof(*root_dev), GFP_KERNEL); + if (!root_dev) { + printk(KERN_ERR "vnet_switch:allocation of device failed\n"); + ret = -ENOMEM; + goto out_chrdev; + } + strncpy(root_dev->bus_id, "vnet", 5); + root_dev->release = vnet_root_release; + ret =device_register(root_dev); + if (ret) { + printk(KERN_ERR "vnet_switch: could not register device\n"); + kfree(root_dev); + goto out_chrdev; + } + ret = vnet_register_control_device(); + if (ret) { + printk("vnet_switch: could not create control device\n"); + goto out_dev; + } + printk ("vnet_switch loaded\n"); +/* FIXME ---------- remove these static defines as soon as everyone has the + * user tools */ + { + struct vnet_port *port; + zwitch_create("myswitch0",2); + zwitch_create("myswitch1",3); + + port = vnet_host_create("myswitch0"); + if (port) + vnet_port_attach(port); + port = vnet_host_create("myswitch1"); + if (port) + vnet_port_attach(port); + } +/*-----------------------------------------------------------*/ + return 0; +out_dev: + device_unregister(root_dev); +out_chrdev: + unregister_chrdev_region(MKDEV(vnet_major,0), NUM_MINORS); +out_class: + class_destroy(zwitch_class); +out: + return ret; +} + +/* remove all existing vnet_zwitches in the system and unregister the + * character device from the system */ +void vnet_switch_exit(void) +{ + struct vnet_switch *zs, *tmp; + list_for_each_entry_safe(zs, tmp, &vnet_switches, lh) { + zwitch_destroy_ports(zs); + zwitch_destroy(zs); + } + device_unregister(root_dev); + misc_deregister(&vnet_control_device); + unregister_chrdev_region(MKDEV(vnet_major,0), NUM_MINORS); + class_destroy(zwitch_class); + printk ("vnet_switch unloaded\n"); +} + +module_init(vnet_switch_init); +module_exit(vnet_switch_exit); +MODULE_DESCRIPTION("VNET: Virtual switch for vnet interfaces"); +MODULE_AUTHOR("Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>"); +MODULE_LICENSE("GPL"); Index: linux-2.6.21/drivers/s390/guest/vnet_switch.h =================================================================== --- /dev/null +++ linux-2.6.21/drivers/s390/guest/vnet_switch.h @@ -0,0 +1,119 @@ +/* + * vnet_switch - zlive insular communication knack switch + * infrastructure for virtual switching of Linux guests running under Linux + * + * Copyright (C) 2005 IBM Corporation + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> + * + */ + +#ifndef __VNET_SWITCH_H +#define __VNET_SWITCH_H + +#include <linux/cdev.h> +#include <linux/device.h> +#include <linux/if_ether.h> +#include <linux/spinlock.h> + +#include "vnet.h" + +/* defines for IOCTLs. interface should be replaced by something better */ +#define ADD_SWITCH 0 +#define DEL_SWITCH 1 +#define ADD_OSA 2 +#define DEL_OSA 3 +#define ADD_HOST 4 +#define DEL_HOST 5 + +/* min(IFNAMSIZ, BUS_ID_SIZE)*/ +#define ZWITCH_NAME_SIZE 16 + +/* This structure describes a virtual switch for ports to userspace network + * interfaces, e.g. in Linux under Linux environments*/ +struct vnet_switch { + struct list_head lh; + char name[ZWITCH_NAME_SIZE]; + struct list_head switch_ports; /* list of ports */ + rwlock_t ports_lock; /* lock for switch_ports */ + struct class_device *class_device; + struct cdev cdev; + struct device dev; + struct vnet_port *osa; + int linktype; /* 2=ethernet 3=IP */ +}; + +/* description of a port of the vnet_switch */ +struct vnet_port { + struct list_head lh; + struct vnet_switch *zs; + struct vnet_control *control; + void *s2p_data[VNET_QUEUE_LEN][(VNET_BUFFER_SIZE>>PAGE_SHIFT)]; + void *p2s_data[VNET_QUEUE_LEN][(VNET_BUFFER_SIZE>>PAGE_SHIFT)]; + char mac[ETH_ALEN]; + void *priv; + int (*set_mac) (struct vnet_port *port, char mac[ETH_ALEN], int add); + void (*interrupt) (struct vnet_port *port, int type); + void (*destroy) (struct vnet_port *port); + struct device dev; + unsigned long rx_packets; /* total packets received */ + unsigned long tx_packets; /* total packets transmitted */ + unsigned long rx_bytes; /* total bytes received */ + unsigned long tx_bytes; /* total bytes transmitted */ + unsigned long rx_dropped; /* no space in receive buffer */ + unsigned long tx_dropped; /* no space in destination buffer */ + spinlock_t rxlock; + spinlock_t txlock; +}; + + +static inline int +vnet_copy_buf_to_pages(void **data, char *buf, int len) +{ + int i; + + if (len == 0) + return 0; + for (i=0; i <= ((len - 1) >> PAGE_SHIFT); i++ ) + memcpy(data[i], buf + i*PAGE_SIZE, min(PAGE_SIZE, len - i*PAGE_SIZE)); + return len; +} + +static inline int +vnet_copy_pages_to_buf(char *buf, void **data, int len) +{ + int i; + + if (len == 0) + return 0; + for (i=0; i <= ((len -1) >> PAGE_SHIFT); i++ ) + memcpy(buf + i*PAGE_SIZE, data[i], min(PAGE_SIZE, len - i*PAGE_SIZE)); + return len; +} + + +/* checks if a switch with the given minor exists + * if yes, create a named and unconnected port on + * this switch with the given name. if no, return NULL */ +extern struct vnet_port *vnet_port_get(int minor, char *port_name); + +/* attaches the port to the switch. The port must be + * fully initialized, as it may get data immediately afterwards */ +extern void vnet_port_attach(struct vnet_port *port); + +/* detaches the port from the switch. After that, + * no calls into the port are made */ +extern void vnet_port_detach(struct vnet_port *port); + +/* releases all ressources allocated with vnet_port_get */ +extern void vnet_port_put(struct vnet_port *port); + +/* tell the switch that new data is available */ +extern void vnet_port_rx(struct vnet_port *port); + +/* get the minor for a given name */ +extern int vnet_minor_by_name(char *name); + +/* checks if the given address is locally attached to the switch*/ +extern int vnet_address_is_local(struct vnet_switch *zs, char *address); +#endif Index: linux-2.6.21/drivers/s390/guest/Makefile =================================================================== --- linux-2.6.21.orig/drivers/s390/guest/Makefile +++ linux-2.6.21/drivers/s390/guest/Makefile @@ -6,3 +6,6 @@ obj-$(CONFIG_GUEST_CONSOLE) += guest_con obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o obj-$(CONFIG_VDISK) += vdisk.o vdisk_blk.o obj-$(CONFIG_VNET_GUEST) += vnet_guest.o +vnet_host-objs := vnet_switch.o vnet_port_guest.o vnet_port_host.o +obj-$(CONFIG_VNET_HOST) += vnet_host.o + Index: linux-2.6.21/drivers/s390/net/Kconfig =================================================================== --- linux-2.6.21.orig/drivers/s390/net/Kconfig +++ linux-2.6.21/drivers/s390/net/Kconfig @@ -95,4 +95,16 @@ config VNET_GUEST connection. If you're not using host/guest support, say N. +config VNET_HOST + tristate "virtual networking support (HOST)" + depends on QETH && S390_HOST + help + This is the host part of the vnet guest network connection. + Say Y if you plan to host guests with network + connection. The host part consists of a virtual switch + a host device as well as a connection to the qeth + driver. + If you're not using this kernel for hosting guest, say N. + + endmenu ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <1178904968.25135.35.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>]
* Re: [PATCH/RFC 8/9] Virtual network host switch support [not found] ` <1178904968.25135.35.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> @ 2007-05-11 20:21 ` Anthony Liguori [not found] ` <4644D048.7060106-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 104+ messages in thread From: Anthony Liguori @ 2007-05-11 20:21 UTC (permalink / raw) To: Carsten Otte Cc: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, Christian Borntraeger, Martin Schwidefsky Carsten Otte wrote: > From: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > > This is the host counterpart for the virtual network device driver. This driver > has an char device node where the hypervisor can attach. It also > has a kind of dumb switch that passes packets between guests. Last but not least > it contains a host network interface. Patches for attaching other host network > devices to the switch via raw sockets, extensions to qeth or netfilter are > Any feel for the performance relative to the bridging code? The bridging code is a pretty big bottle neck in guest=>guest communications in Xen at least. > currently tested but not ready yet. We did not use the linux bridging code to > allow non-root users to create virtual networks between guests. > Is that the primary reason? If so, that seems like a rather large hammer for something that a userspace suid wrapper could have addressed... Regards, Anthony Liguori > Signed-off-by: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > > --- > drivers/s390/guest/Makefile | 3 > drivers/s390/guest/vnet_port_guest.c | 302 ++++++++++++ > drivers/s390/guest/vnet_port_guest.h | 21 > drivers/s390/guest/vnet_port_host.c | 418 +++++++++++++++++ > drivers/s390/guest/vnet_port_host.h | 18 > drivers/s390/guest/vnet_switch.c | 828 +++++++++++++++++++++++++++++++++++ > drivers/s390/guest/vnet_switch.h | 119 +++++ > drivers/s390/net/Kconfig | 12 > 8 files changed, 1721 insertions(+) > > Index: linux-2.6.21/drivers/s390/guest/vnet_port_guest.c > =================================================================== > --- /dev/null > +++ linux-2.6.21/drivers/s390/guest/vnet_port_guest.c > @@ -0,0 +1,302 @@ > +/* > + * Copyright (C) 2005 IBM Corporation > + * Authors: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > + * Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > + * > + */ > +#include <linux/etherdevice.h> > +#include <linux/fs.h> > +#include <linux/kernel.h> > +#include <linux/list.h> > +#include <linux/module.h> > +#include <linux/pagemap.h> > +#include <linux/poll.h> > +#include <linux/spinlock.h> > + > +#include "vnet.h" > +#include "vnet_port_guest.h" > +#include "vnet_switch.h" > + > +static void COFIXME_add_irq(struct vnet_guest_port *zgp, int data) > +{ > + int oldval, newval; > + > + do { > + oldval = atomic_read(&zgp->pending_irqs); > + newval = oldval | data; > + } while (atomic_cmpxchg(&zgp->pending_irqs, oldval, newval) != oldval); > +} > + > +static int COFIXME_get_irq(struct vnet_guest_port *zgp) > +{ > + int oldval; > + > + do { > + oldval = atomic_read(&zgp->pending_irqs); > + } while (atomic_cmpxchg(&zgp->pending_irqs, oldval, 0) != oldval); > + > + return oldval; > +} > + > +static void > +vnet_guest_interrupt(struct vnet_port *port, int type) > +{ > + struct vnet_guest_port *priv; > + > + priv = port->priv; > + > + if (!priv->fasync) { > + printk (KERN_WARNING "vnet: cannot send interrupt," > + "fd not async\n"); > + return; > + } > + switch (type) { > + case VNET_IRQ_START_RX: > + COFIXME_add_irq(priv, POLLIN); > + kill_fasync(&priv->fasync, SIGIO, POLL_IN); > + break; > + case VNET_IRQ_START_TX: > + COFIXME_add_irq(priv, POLLOUT); > + kill_fasync(&priv->fasync, SIGIO, POLL_OUT); > + break; > + default: > + BUG(); > + } > +} > + > +/* release all pinned user pages*/ > +static void > +vnet_guest_release_pages(struct vnet_port *port) > +{ > + int i,j; > + > + for (i=0; i<VNET_QUEUE_LEN; i++) > + for (j=0; j<VNET_BUFFER_PAGES; j++) { > + if (port->s2p_data[i][j]) { > + page_cache_release(virt_to_page(port->s2p_data[i][j])); > + port->s2p_data[i][j] = NULL; > + } > + if (port->p2s_data[i][j]) { > + page_cache_release(virt_to_page(port->p2s_data[i][j])); > + port->p2s_data[i][j] = NULL; > + } > + } > + if (port->control) { > + page_cache_release(virt_to_page(port->control)); > + port->control = NULL; > + } > +} > + > +static int > +vnet_chr_open(struct inode *ino, struct file *filp) > +{ > + int minor; > + struct vnet_port *port; > + char name[BUS_ID_SIZE]; > + > + minor = iminor(filp->f_dentry->d_inode); > + snprintf(name, BUS_ID_SIZE, "guest:%d", current->pid); > + port = vnet_port_get(minor, name); > + if (!port) > + return -ENODEV; > + port->priv = kzalloc(sizeof(struct vnet_guest_port), GFP_KERNEL); > + if (!port->priv) { > + vnet_port_put(port); > + return -ENOMEM; > + } > + port->interrupt = vnet_guest_interrupt; > + filp->private_data = port; > + return nonseekable_open(ino, filp); > +} > + > +static int > +vnet_chr_release (struct inode *ino, struct file *filp) > +{ > + struct vnet_port *port; > + port = (struct vnet_port *) filp->private_data; > + > +//FIXME: what about open close? We unregister non exisiting mac addresses > +// in vnet_port_detach! > + vnet_port_detach(port); > + vnet_guest_release_pages(port); > + vnet_port_put(port); > + return 0; > +} > + > + > +/* helper function which maps a user page into the kernel > + * the memory must be free with page_cache_release */ > +static void *user_to_kernel(char __user *user) > +{ > + struct page *temp_page; > + int rc; > + > + BUG_ON(((unsigned long) user) % PAGE_SIZE); > + rc = fault_in_pages_writeable(user, PAGE_SIZE); > + if (rc) > + return NULL; > + rc = get_user_pages(current, current->mm, (unsigned long) user, > + 1, 1, 1, &temp_page, NULL); > + if (rc != 1) > + return NULL; > + return page_address(temp_page); > +} > + > +/* this function pins the userspace buffers into memory*/ > +static int > +vnet_guest_alloc_pages(struct vnet_port *port) > +{ > + int i,j; > + > + down_read(¤t->mm->mmap_sem); > + for (i=0; i<VNET_QUEUE_LEN; i++) > + for (j=0; j<VNET_BUFFER_PAGES; j++) { > + port->s2p_data[i][j] = user_to_kernel(port->control-> > + s2pbufs[i].data + j*PAGE_SIZE); > + if (!port->s2p_data[i][j]) > + goto cleanup; > + port->p2s_data[i][j] = user_to_kernel(port->control-> > + p2sbufs[i].data + j*PAGE_SIZE); > + if (!port->p2s_data[i][j]) > + goto cleanup; > + > + } > + up_read(¤t->mm->mmap_sem); > + return 0; > +cleanup: > + up_read(¤t->mm->mmap_sem); > + vnet_guest_release_pages(port); > + return -ENOMEM; > +} > + > +/* userspace control data structure stuff */ > +static int > +vnet_register_control(struct vnet_port *port, unsigned long user_addr) > +{ > + u64 uaddr; > + int rc; > + struct page *control_page; > + > + rc = copy_from_user(&uaddr, (void __user *) user_addr, sizeof(uaddr)); > + if (rc) > + return -EFAULT; > + if (uaddr % PAGE_SIZE) > + return -EFAULT; > + down_read(¤t->mm->mmap_sem); > + rc = get_user_pages(current, current->mm, (unsigned long)uaddr, > + 1, 1, 1, &control_page, NULL); > + up_read(¤t->mm->mmap_sem); > + if (rc!=1) > + return -EFAULT; > + port->control = (struct vnet_control *) page_address(control_page); > + rc = vnet_guest_alloc_pages(port); > + if (rc) { > + printk("vnet: could not get buffers\n"); > + return rc; > + } > + random_ether_addr(port->mac); > + memcpy(port->control->mac, port->mac,6); > + vnet_port_attach(port); > + return 0; > +} > + > +static int > +vnet_interrupt(struct vnet_port *port, int __user *u_type) > +{ > + int type, rc; > + > + rc = copy_from_user (&type, u_type, sizeof(int)); > + if (rc) > + return -EFAULT; > + switch (type) { > + case VNET_IRQ_START_RX: > + vnet_port_rx(port); > + break; > + case VNET_IRQ_START_TX: /* noop with current drop packet approach*/ > + break; > + default: > + printk(KERN_ERR "vnet: Unknown interrupt type %d\n", type); > + rc = -EINVAL; > + } > + return rc; > +} > + > + > + > + > +//this is a HACK. >>COFIXME<< > +unsigned int > +vnet_poll(struct file *filp, poll_table * wait) > +{ > + struct vnet_port *port; > + struct vnet_guest_port *zgp; > + > + port = filp->private_data; > + zgp = port->priv; > + return COFIXME_get_irq(zgp); > +} > + > +static int vnet_fill_info(struct vnet_port *zp, void __user *data) > +{ > + struct vnet_info info; > + > + info.linktype = zp->zs->linktype; > + info.maxmtu=32768; //FIXME > + return copy_to_user(data, &info, sizeof(info)); > +} > +long > +vnet_ioctl(struct file *filp, unsigned int no, unsigned long data) > +{ > + struct vnet_port *port = > + (struct vnet_port *) filp->private_data; > + int rc; > + > + switch (no) { > + case VNET_REGISTER_CTL: > + rc = vnet_register_control(port, data); > + break; > + case VNET_INTERRUPT: > + rc = vnet_interrupt(port, (int __user *) data); > + break; > + case VNET_INFO: > + rc = vnet_fill_info(port, (void __user *) data); > + break; > + default: > + rc = -ENOTTY; > + } > + return rc; > +} > + > +int vnet_fasync(int fd, struct file *filp, int on) > +{ > + struct vnet_port *port; > + struct vnet_guest_port *zgp; > + int rc; > + > + port = filp->private_data; > + zgp = port->priv; > + > + if ((rc = fasync_helper(fd, filp, on, &zgp->fasync)) < 0) > + return rc; > + > + if (on) > + rc = f_setown(filp, current->pid, 0); > + return rc; > +} > + > + > +static struct file_operations vnet_char_fops = { > + .owner = THIS_MODULE, > + .open = vnet_chr_open, > + .release = vnet_chr_release, > + .unlocked_ioctl = vnet_ioctl, > + .fasync = vnet_fasync, > + .poll = vnet_poll, > +}; > + > + > + > +void vnet_cdev_init(struct cdev *cdev) > +{ > + cdev_init(cdev, &vnet_char_fops); > +} > Index: linux-2.6.21/drivers/s390/guest/vnet_port_guest.h > =================================================================== > --- /dev/null > +++ linux-2.6.21/drivers/s390/guest/vnet_port_guest.h > @@ -0,0 +1,21 @@ > +/* > + * Copyright (C) 2005 IBM Corporation > + * Authors: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > + * Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > + * > + */ > + > +#ifndef __VNET_PORTS_GUEST_H > +#define __VNET_PORTS_GUEST_H > + > +#include <linux/fs.h> > +#include <linux/cdev.h> > +#include <asm/atomic.h> > + > +struct vnet_guest_port { > + struct fasync_struct *fasync; > + atomic_t pending_irqs; > +}; > + > +extern void vnet_cdev_init(struct cdev *cdev); > +#endif > Index: linux-2.6.21/drivers/s390/guest/vnet_port_host.c > =================================================================== > --- /dev/null > +++ linux-2.6.21/drivers/s390/guest/vnet_port_host.c > @@ -0,0 +1,418 @@ > +/* > + * vnet zlswitch handling > + * > + * Copyright (C) 2005 IBM Corporation > + * Authors: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > + * Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > + * > + */ > + > +#include <linux/etherdevice.h> > +#include <linux/if.h> > +#include <linux/if_ether.h> > +#include <linux/if_arp.h> > +#include <linux/kernel.h> > +#include <linux/list.h> > +#include <linux/module.h> > +#include <linux/netdevice.h> > +#include <linux/rtnetlink.h> > +#include <linux/pagemap.h> > +#include <linux/spinlock.h> > + > +#include "vnet.h" > +#include "vnet_switch.h" > +#include "vnet_port_host.h" > + > +static void > +vnet_host_interrupt(struct vnet_port *zp, int type) > +{ > + struct vnet_host_port *zhp; > + > + zhp = zp->priv; > + > + BUG_ON(!zhp->netdev); > + > + switch (type) { > + case VNET_IRQ_START_RX: > + netif_rx_schedule(zhp->netdev); > + break; > + case VNET_IRQ_START_TX: > + netif_wake_queue(zhp->netdev); > + break; > + default: > + BUG(); > + } > + /* we are called via system call path. enforce softirq handling */ > + do_softirq(); > +} > + > +static void > +vnet_host_free(struct vnet_port *zp) > +{ > + int i,j; > + > + for (i=0; i<VNET_QUEUE_LEN; i++) > + for (j=0; j<VNET_BUFFER_PAGES; j++) { > + if (zp->s2p_data[i][j]) { > + free_page((unsigned long) zp->s2p_data[i][j]); > + zp->s2p_data[i][j] = NULL; > + } > + if (zp->p2s_data[i][j]) { > + free_page((unsigned long) zp->p2s_data[i][j]); > + zp->p2s_data[i][j] = NULL; > + } > + } > + if (zp->control) { > + kfree(zp->control); > + zp->control = NULL; > + } > +} > + > +static int > +vnet_port_hostsetup(struct vnet_port *zp) > +{ > + int i,j; > + > + zp->control = kzalloc(sizeof(*zp->control), GFP_KERNEL); > + if (!zp->control) > + return -ENOMEM; > + for (i=0; i<VNET_QUEUE_LEN; i++) > + for (j=0; j<VNET_BUFFER_PAGES; j++) { > + zp->s2p_data[i][j] = (void *) __get_free_pages(GFP_KERNEL,0); > + if (!zp->s2p_data[i][j]) > + goto oom; > + zp->p2s_data[i][j] = (void *) __get_free_pages(GFP_KERNEL,0); > + if (!zp->p2s_data[i][j]) { > + free_page((unsigned long) zp->s2p_data[i][j]); > + goto oom; > + } > + } > + zp->control->buffer_size = VNET_BUFFER_SIZE; > + return 0; > +oom: > + printk(KERN_WARNING "vnet: No memory for buffer space of host device\n"); > + vnet_host_free(zp); > + return -ENOMEM; > +} > + > +/* host interface specific parts */ > + > + > +static int > +vnet_net_open(struct net_device *dev) > +{ > + struct vnet_port *port; > + struct vnet_control *control; > + > + port = dev->priv; > + control = port->control; > + atomic_set(&control->s2pmit, 0); > + netif_start_queue(dev); > + return 0; > +} > + > +static int > +vnet_net_stop(struct net_device *dev) > +{ > + netif_stop_queue(dev); > + return 0; > +} > + > +static void vnet_net_tx_timeout(struct net_device *dev) > +{ > + struct vnet_port *port = dev->priv; > + struct vnet_control *control = port->control; > + > + printk(KERN_ERR "problems in xmit for device %s\n Resetting...\n", > + dev->name); > + atomic_set(&control->p2smit, 0); > + atomic_set(&control->s2pmit, 0); > + vnet_port_rx(port); > + netif_wake_queue(dev); > +} > + > + > +static int > +vnet_net_xmit(struct sk_buff *skb, struct net_device *dev) > +{ > + struct vnet_port *zhost; > + struct vnet_host_port *zhp; > + struct vnet_control *control; > + struct xmit_buffer *buf; > + int buffer_status; > + int pkid; > + > + zhost = dev->priv; > + zhp = zhost->priv; > + control = zhost->control; > + > + if (!spin_trylock(&zhost->txlock)) > + return NETDEV_TX_LOCKED; > + if (vnet_q_full(atomic_read(&control->p2smit))) { > + netif_stop_queue(dev); > + goto full; > + } > + pkid = __nextx(atomic_read(&control->p2smit)); > + buf = &control->p2sbufs[pkid]; > + buf->len = skb->len; > + buf->proto = skb->protocol; > + vnet_copy_buf_to_pages(zhost->p2s_data[pkid], skb->data, skb->len); > + buffer_status = vnet_tx_packet(&control->p2smit); > + spin_unlock(&zhost->txlock); > + zhp->stats.tx_packets++; > + zhp->stats.tx_bytes += skb->len; > + dev_kfree_skb(skb); > + dev->trans_start = jiffies; > + if (buffer_status & QUEUE_WAS_EMPTY) > + vnet_port_rx(zhost); > + if (buffer_status & QUEUE_IS_FULL) { > + netif_stop_queue(dev); > + spin_lock(&zhost->txlock); > + } else > + return NETDEV_TX_OK; > +full: > + /* we might have raced against the wakeup */ > + if (!vnet_q_full(atomic_read(&control->p2smit))) > + netif_start_queue(dev); > + spin_unlock(&zhost->txlock); > + return NETDEV_TX_OK; > +} > + > +static int > +vnet_l3_poll(struct net_device *dev, int *budget) > +{ > + struct vnet_port *zp = dev->priv; > + struct vnet_host_port *zhp = zp->priv; > + struct vnet_control *control = zp->control; > + struct xmit_buffer *buf; > + struct sk_buff *skb; > + int pkid, count, numpackets = min(64, min(dev->quota, *budget)); > + int buffer_status; > + > + if (vnet_q_empty(atomic_read(&control->s2pmit))) { > + count = 0; > + goto empty; > + } > +loop: > + count = 0; > + while(numpackets) { > + pkid = __nextr(atomic_read(&control->s2pmit)); > + buf = &control->s2pbufs[pkid]; > + skb = dev_alloc_skb(buf->len + 2); > + if (likely(skb)) { > + skb_reserve(skb, 2); > + vnet_copy_pages_to_buf(skb_put(skb, buf->len), > + zp->s2p_data[pkid], buf->len); > + skb->dev = dev; > + skb->protocol = buf->proto; > +// skb->ip_summed = CHECKSUM_UNNECESSARY; > + zhp->stats.rx_packets++; > + zhp->stats.rx_bytes += buf->len; > + netif_receive_skb(skb); > + numpackets--; > + (*budget)--; > + dev->quota--; > + count++; > + } else > + zhp->stats.rx_dropped++; > + buffer_status = vnet_rx_packet(&control->s2pmit); > + if (buffer_status & QUEUE_IS_EMPTY) > + goto empty; > + } > + return 1; //please ask us again > +empty: > + netif_rx_complete(dev); > + /* we might have raced against a wakup*/ > + if (!vnet_q_empty(atomic_read(&control->s2pmit))) { > + if (netif_rx_reschedule(dev, count)) > + goto loop; > + } > + return 0; > +} > + > + > +static int > +vnet_l2_poll(struct net_device *dev, int *budget) > +{ > + struct vnet_port *zp = dev->priv; > + struct vnet_host_port *zhp = zp->priv; > + struct vnet_control *control = zp->control; > + struct xmit_buffer *buf; > + struct sk_buff *skb; > + int pkid, count, numpackets = min(64, min(dev->quota, *budget)); > + int buffer_status; > + > + if (vnet_q_empty(atomic_read(&control->s2pmit))) { > + count = 0; > + goto empty; > + } > +loop: > + count = 0; > + while(numpackets) { > + pkid = __nextr(atomic_read(&control->s2pmit)); > + buf = &control->s2pbufs[pkid]; > + skb = dev_alloc_skb(buf->len + 2); > + if (likely(skb)) { > + skb_reserve(skb, 2); > + vnet_copy_pages_to_buf(skb_put(skb, buf->len), > + zp->s2p_data[pkid], buf->len); > + skb->dev = dev; > + skb->protocol = eth_type_trans(skb, dev); > +// skb->ip_summed = CHECKSUM_UNNECESSARY; > + zhp->stats.rx_packets++; > + zhp->stats.rx_bytes += buf->len; > + netif_receive_skb(skb); > + numpackets--; > + (*budget)--; > + dev->quota--; > + count++; > + } else > + zhp->stats.rx_dropped++; > + buffer_status = vnet_rx_packet(&control->s2pmit); > + if (buffer_status & QUEUE_IS_EMPTY) > + goto empty; > + } > + return 1; //please ask us again > +empty: > + netif_rx_complete(dev); > + /* we might have raced against a wakup*/ > + if (!vnet_q_empty(atomic_read(&control->s2pmit))) { > + if (netif_rx_reschedule(dev, count)) > + goto loop; > + } > + return 0; > +} > + > +static struct net_device_stats * > +vnet_net_stats(struct net_device *dev) > +{ > + struct vnet_port *zp; > + struct vnet_host_port *zhp; > + > + zp = dev->priv; > + zhp = zp->priv; > + return &zhp->stats; > +} > + > +static int > +vnet_net_change_mtu(struct net_device *dev, int new_mtu) > +{ > + if (new_mtu <= ETH_ZLEN) > + return -ERANGE; > + if (new_mtu > VNET_BUFFER_SIZE-ETH_HLEN) > + return -ERANGE; > + dev->mtu = new_mtu; > + return 0; > +} > + > +static void > +__vnet_common_init(struct net_device *dev) > +{ > + dev->open = vnet_net_open; > + dev->stop = vnet_net_stop; > + dev->hard_start_xmit = vnet_net_xmit; > + dev->get_stats = vnet_net_stats; > + dev->tx_timeout = vnet_net_tx_timeout; > + dev->watchdog_timeo = VNET_TIMEOUT; > + dev->change_mtu = vnet_net_change_mtu; > + dev->weight = 64; > + //dev->features |= NETIF_F_NO_CSUM | NETIF_F_LLTX; > + dev->features |= NETIF_F_LLTX; > +} > + > +static void > +__vnet_layer3_init(struct net_device *dev) > +{ > + dev->mtu = ETH_DATA_LEN; > + dev->tx_queue_len = 1000; > + dev->flags = IFF_BROADCAST|IFF_MULTICAST|IFF_NOARP; > + dev->type = ARPHRD_PPP; > + dev->mtu = 1492; > + dev->poll = vnet_l3_poll; > + __vnet_common_init(dev); > +} > + > +static void > +__vnet_layer2_init(struct net_device *dev) > +{ > + ether_setup(dev); > + random_ether_addr(dev->dev_addr); > + dev->mtu = 1492; > + dev->poll = vnet_l2_poll; > + __vnet_common_init(dev); > +} > + > +static void > +vnet_host_destroy(struct vnet_port *zhost) > +{ > + struct vnet_host_port *zhp; > + zhp = zhost->priv; > + > + vnet_port_detach(zhost); > + unregister_netdev(zhp->netdev); > + free_netdev(zhp->netdev); > + zhp->netdev = NULL; > + vnet_host_free(zhost); > + kfree(zhp); > + vnet_port_put(zhost); > +} > + > + > + > +struct vnet_port * > +vnet_host_create(char *name) > +{ > + int rc; > + struct vnet_port *port; > + struct vnet_host_port *host; > + char busname[BUS_ID_SIZE]; > + int minor; > + > + snprintf(busname, BUS_ID_SIZE, "host:%s", name); > + > + minor = vnet_minor_by_name(name); > + if (minor < 0) > + return NULL; > + port = vnet_port_get(minor, busname); > + if (!port) > + goto out; > + host = kzalloc(sizeof(struct vnet_host_port), GFP_KERNEL); > + if (!host) { > + kfree(port); > + port = NULL; > + goto out; > + } > + port->priv = host; > + rc =vnet_port_hostsetup(port); > + if (rc) > + goto out_free_host; > + rtnl_lock(); > + if (port->zs->linktype == 2) > + host->netdev = alloc_netdev(0, name, __vnet_layer2_init); > + else > + host->netdev = alloc_netdev(0, name, __vnet_layer3_init); > + if (!host->netdev) > + goto out_unlock; > + memcpy(port->mac, host->netdev->dev_addr, ETH_ALEN); > + > + host->netdev->priv = port; > + port->interrupt = vnet_host_interrupt; > + port->destroy = vnet_host_destroy; > + > + if (!register_netdevice(host->netdev)) { > + /* good case */ > + rtnl_unlock(); > + return port; > + } > + host->netdev->priv = NULL; > + free_netdev(host->netdev); > + host->netdev = NULL; > +out_unlock: > + rtnl_unlock(); > + vnet_host_free(port); > +out_free_host: > + vnet_port_put(port); > + port = NULL; > +out: > + return port; > +} > Index: linux-2.6.21/drivers/s390/guest/vnet_port_host.h > =================================================================== > --- /dev/null > +++ linux-2.6.21/drivers/s390/guest/vnet_port_host.h > @@ -0,0 +1,18 @@ > +/* > + * Copyright (C) 2005 IBM Corporation > + * Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > + * > + */ > + > +#ifndef __VNET_PORTS_HOST_H > +#define __VNET_PORTS_HOST_H > + > +#include <linux/netdevice.h> > +#include "vnet_switch.h" > + > +struct vnet_host_port { > + struct net_device_stats stats; > + struct net_device *netdev; > +}; > +extern struct vnet_port * vnet_host_create(char *name); > +#endif > Index: linux-2.6.21/drivers/s390/guest/vnet_switch.c > =================================================================== > --- /dev/null > +++ linux-2.6.21/drivers/s390/guest/vnet_switch.c > @@ -0,0 +1,828 @@ > +/* > + * vnet zlswitch handling > + * > + * Copyright (C) 2005 IBM Corporation > + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > + * Author: Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > + * > + */ > + > +#include <linux/device.h> > +#include <linux/etherdevice.h> > +#include <linux/fs.h> > +#include <linux/if.h> > +#include <linux/if_ether.h> > +#include <linux/kernel.h> > +#include <linux/list.h> > +#include <linux/miscdevice.h> > +#include <linux/module.h> > +#include <linux/netdevice.h> > +#include <linux/rtnetlink.h> > +#include <linux/pagemap.h> > +#include <linux/spinlock.h> > + > +#include "vnet.h" > +#include "vnet_port_guest.h" > +#include "vnet_port_host.h" > +#include "vnet_switch.h" > + > +#define NUM_MINORS 1024 > + > +/* devices housekeeping, creation & destruction */ > +static LIST_HEAD(vnet_switches); > +static rwlock_t vnet_switches_lock = RW_LOCK_UNLOCKED; > +static struct class *zwitch_class; > +static int vnet_major; > +static struct device *root_dev; > + > + > +/* The following functions allow ports of the switch to know about > + * the MAC addresses of other ports. This is necessary for special > + * hardware like OSA express which silently drops incoming packets > + * that not match known MAC addresses and do not support promiscous > + * mode as well. We have to register all guest MAC addresses at OSA > + * make packet receive working */ > + > +/* Announces the own MAC address to all other ports > + * this function is called if a new port is added */ > +static void vnet_switch_add_mac(struct vnet_port *port) > +{ > + struct vnet_port *other_port; > + > + read_lock(&port->zs->ports_lock); > + list_for_each_entry(other_port, &port->zs->switch_ports, lh) > + if ((other_port != port) && (other_port->set_mac)) > + other_port->set_mac(other_port,port->mac, 1); > + read_unlock(&port->zs->ports_lock); > +} > + > +/* Removes the own MAC address from all other ports > + * this function is called if a port is detached*/ > +static void vnet_switch_del_mac(struct vnet_port *port) > +{ > + struct vnet_port *other_port; > + > + read_lock(&port->zs->ports_lock); > + list_for_each_entry(other_port, &port->zs->switch_ports, lh) > + if (other_port->set_mac) > + other_port->set_mac(other_port, port->mac, 0); > + read_unlock(&port->zs->ports_lock); > +} > + > +/* Learn MACs from other ports on the same zwitch and forward > + * the MAC addresses to the set_mac function of the port.*/ > +static void __vnet_port_learn_macs(struct vnet_port *port) > +{ > + struct vnet_port *other_port; > + > + if (!port->set_mac) > + return; > + list_for_each_entry(other_port, &port->zs->switch_ports, lh) > + if (other_port != port) > + port->set_mac(port, other_port->mac, 1); > +} > + > +/* Unlearn MACS from other ports on the same zwitch */ > +static void __vnet_port_unlearn_macs(struct vnet_port *port) > +{ > + struct vnet_port *other_port; > + > + if (!port->set_mac) > + return; > + list_for_each_entry(other_port, &port->zs->switch_ports, lh) > + if (other_port != port) > + port->set_mac(port, other_port->mac, 0); > +} > + > + > +static struct vnet_switch *__vnet_switch_by_minor(int minor) > +{ > + struct vnet_switch *zs; > + > + list_for_each_entry(zs, &vnet_switches, lh) { > + if (MINOR(zs->cdev.dev) == minor) > + return zs; > + } > + return NULL; > +} > + > +static struct vnet_switch *__vnet_switch_by_name(char *name) > +{ > + struct vnet_switch *zs; > + > + list_for_each_entry(zs, &vnet_switches, lh) > + if (strncmp(zs->name, name, ZWITCH_NAME_SIZE) == 0) > + return zs; > + return NULL; > +} > + > +/* Returns a switch structure and increases the reference count. If no such > + * switch exists a new one is created with reference count 1 */ > +static struct vnet_switch *zwitch_get(int minor) > +{ > + struct vnet_switch *zs; > + > + read_lock(&vnet_switches_lock); > + zs = __vnet_switch_by_minor(minor); > + if (!zs) { > + read_unlock(&vnet_switches_lock); > + return zs; > + } > + get_device(&zs->dev); > + read_unlock(&vnet_switches_lock); > + return zs; > +} > + > +/* reduces the reference count of the switch. */ > +static void zwitch_put(struct vnet_switch * zs) > +{ > + put_device(&zs->dev); > +} > + > +/* looks into the packet and searches a matching MAC address > + * return NULL if unknown or broadcast */ > +static struct vnet_port *__vnet_find_l2(struct vnet_switch *zs, char *data) > +{ > + //FIXME: make this a hash lookup, more macs per device? > + struct vnet_port *port; > + > + if (is_multicast_ether_addr(data)) > + return NULL; > + list_for_each_entry(port, &zs->switch_ports, lh) { > + if (compare_ether_addr(port->mac, data)==0) > + goto out; > + } > + port = NULL; > + out: > + return port; > +} > + > +/* searches the destination for IP only interfaces. Normally routing > + * is the way to go, but guests should see the net transparently without > + * a hop in between*/ > +static struct vnet_port *__vnet_find_l3(struct vnet_switch *zs, char *data) > +{ > + return NULL; > +} > + > +static struct vnet_port * __vnet_find_destination(struct vnet_switch *zs, > + char *data) > +{ > + switch (zs->linktype) { > + case 2: > + return __vnet_find_l2(zs, data); > + case 3: > + return __vnet_find_l3(zs, data); > + default: > + BUG(); > + } > +} > + > +/* copies len bytes of data from the memory specified by the list of > + * pointers **from into the memory specified by the list of pointers **to > + * with each pointer pointing to a page */ > +static void > +vnet_switch_page_copy(void **to, void **from, int len) > +{ > + int remaining=len; > + int pageid = 0; > + int amount; > + > + while(remaining) { > + amount = min((int)PAGE_SIZE, remaining); > + memcpy(to[pageid], from[pageid], amount); > + pageid++; > + remaining -= amount; > + } > +} > + > +/* copies to data into a buffer of destination > + * returns 0 if ok*/ > +static int > +vnet_unicast(struct vnet_port *destination, void **from_data, int len, int proto) > +{ > + int pkid; > + int buffer_status; > + void **to_data; > + struct vnet_control *control; > + > + control = destination->control; > + spin_lock_bh(&destination->rxlock); > + if (vnet_q_full(atomic_read(&control->s2pmit))) { > + destination->rx_dropped++; > + spin_unlock_bh(&destination->rxlock); > + return -ENOBUFS; > + } > + pkid = __nextx(atomic_read(&control->s2pmit)); > + to_data = destination->s2p_data[pkid]; > + vnet_switch_page_copy(to_data, from_data, len); > + control->s2pbufs[pkid].len = len; > + control->s2pbufs[pkid].proto = proto; > + buffer_status = vnet_tx_packet(&control->s2pmit); > + spin_unlock_bh(&destination->rxlock); > + if (buffer_status & QUEUE_WAS_EMPTY) > + destination->interrupt(destination, VNET_IRQ_START_RX); > + destination->rx_bytes += len; > + destination->rx_packets++; > + return 0; > +} > + > +/* send packets to all ports and emulate broadcasts via unicasts*/ > +static int vnet_allcast(struct vnet_port *from_port, void **fromdata, > + int len, int proto) > +{ > + struct vnet_port *destination; > + int failure = 0; > + > + list_for_each_entry(destination, &from_port->zs->switch_ports, lh) > + if (destination != from_port) > + failure |= vnet_unicast(destination, fromdata, > + len, proto); > + return failure; > +} > + > +/* takes an incoming packet and forwards it to the right port > + * if a failure occurs, increase the tx_dropped count of the sender*/ > +static void vnet_switch_packet(struct vnet_port *from_port, > + void **from_data, int len, int proto) > +{ > + struct vnet_port *destination; > + int failure; > + > + read_lock(&from_port->zs->ports_lock); > + destination = __vnet_find_destination(from_port->zs, from_data[0]); > + /* we dont want to loop. FIXME: document when this can happen*/ > + if (destination == from_port) { > + read_unlock(&from_port->zs->ports_lock); > + return; > + } > + if (destination) > + failure = vnet_unicast(destination, from_data, len, proto); > + else > + failure = vnet_allcast(from_port, from_data, len, proto); > + read_unlock(&from_port->zs->ports_lock); > + if (failure) > + from_port->tx_dropped++; > + else { > + from_port->tx_packets++; > + from_port->tx_bytes += len; > + } > +} > + > +static void vnet_port_release(struct device *dev) > +{ > + struct vnet_port *port; > + > + port = container_of(dev, struct vnet_port, dev); > + zwitch_put(port->zs); > + kfree(port); > +} > + > +static ssize_t vnet_port_read_mac(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct vnet_port *port; > + > + port = container_of(dev, struct vnet_port, dev); > + return sprintf(buf,"%02X:%02X:%02X:%02X:%02X:%02X", port->mac[0], > + port->mac[1], port->mac[2], port->mac[3], > + port->mac[4], port->mac[5]); > +} > + > +static ssize_t vnet_port_read_tx_bytes(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct vnet_port *port; > + > + port = container_of(dev, struct vnet_port, dev); > + return sprintf(buf,"%lu", port->tx_bytes); > +} > + > +static ssize_t vnet_port_read_rx_bytes(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct vnet_port *port; > + > + port = container_of(dev, struct vnet_port, dev); > + return sprintf(buf,"%lu", port->rx_bytes); > +} > + > +static ssize_t vnet_port_read_tx_packets(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct vnet_port *port; > + > + port = container_of(dev, struct vnet_port, dev); > + return sprintf(buf,"%lu", port->tx_packets); > +} > + > +static ssize_t vnet_port_read_rx_packets(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct vnet_port *port; > + > + port = container_of(dev, struct vnet_port, dev); > + return sprintf(buf,"%lu", port->rx_packets); > +} > + > +static ssize_t vnet_port_read_tx_dropped(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct vnet_port *port; > + > + port = container_of(dev, struct vnet_port, dev); > + return sprintf(buf,"%lu", port->tx_dropped); > +} > + > +static ssize_t vnet_port_read_rx_dropped(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct vnet_port *port; > + > + port = container_of(dev, struct vnet_port, dev); > + return sprintf(buf,"%lu", port->rx_dropped); > +} > + > +static DEVICE_ATTR(mac, S_IRUSR, vnet_port_read_mac, NULL); > +static DEVICE_ATTR(tx_bytes, S_IRUSR, vnet_port_read_tx_bytes, NULL); > +static DEVICE_ATTR(rx_bytes, S_IRUSR, vnet_port_read_rx_bytes, NULL); > +static DEVICE_ATTR(tx_packets, S_IRUSR, vnet_port_read_tx_packets, NULL); > +static DEVICE_ATTR(rx_packets, S_IRUSR, vnet_port_read_rx_packets, NULL); > +static DEVICE_ATTR(tx_dropped, S_IRUSR, vnet_port_read_tx_dropped, NULL); > +static DEVICE_ATTR(rx_dropped, S_IRUSR, vnet_port_read_rx_dropped, NULL); > + > +static int vnet_port_attributes(struct device *dev) > +{ > + int rc; > + rc = device_create_file(dev, &dev_attr_mac); > + if (rc) > + return rc; > + rc = device_create_file(dev, &dev_attr_tx_dropped); > + if (rc) > + return rc; > + rc = device_create_file(dev, &dev_attr_rx_dropped); > + if (rc) > + return rc; > + rc = device_create_file(dev, &dev_attr_rx_bytes); > + if (rc) > + return rc; > + rc = device_create_file(dev, &dev_attr_tx_bytes); > + if (rc) > + return rc; > + rc = device_create_file(dev, &dev_attr_rx_packets); > + if (rc) > + return rc; > + rc = device_create_file(dev, &dev_attr_tx_packets); > + return rc; > +} > + > + > +//FIXME implement this > +static int vnet_port_exists(struct vnet_switch *zs, char *name) > +{ > + read_lock(&zs->ports_lock); > + read_unlock(&zs->ports_lock); > + return 0; > + > +} > + > +static struct vnet_port *vnet_port_create(struct vnet_switch *zs, > + char *name) > +{ > + struct vnet_port *port; > + > + if (vnet_port_exists(zs, name)) > + return NULL; > + > + port = kzalloc(sizeof(*port), GFP_KERNEL); > + if (port) { > + spin_lock_init(&port->rxlock); > + spin_lock_init(&port->txlock); > + INIT_LIST_HEAD(&port->lh); > + port->zs = zs; > + } else > + return NULL; > + port->dev.parent = &zs->dev; > + port->dev.release = vnet_port_release; > + strncpy(port->dev.bus_id, name, BUS_ID_SIZE); > + if (device_register(&port->dev)) { > + kfree(port); > + return NULL; > + } > + if (vnet_port_attributes(&port->dev)) { > + device_unregister(&port->dev); > + kfree(port); > + return NULL; > + } > + return port; > +} > + > +/*------------------------ switch creation/Destruction/housekeeping---------*/ > + > +static void zwitch_destroy_ports(struct vnet_switch *zs) > +{ > + struct vnet_port *port, *tmp; > + > + list_for_each_entry_safe(port, tmp, &zs->switch_ports, lh) { > + if (port->destroy) > + port->destroy(port); > + else > + printk("No destroy function for port\n"); > + } > +} > + > + > +static void zwitch_destroy(struct vnet_switch *zs) > +{ > + class_device_destroy(zwitch_class, zs->cdev.dev); > + cdev_del(&zs->cdev); > + device_unregister(&zs->dev); > +} > + > +static void zwitch_release(struct device *dev) > +{ > + struct vnet_switch *zs; > + > + zs = container_of(dev, struct vnet_switch, dev); > + kfree(zs); > +} > + > +static int __zwitch_get_minor(void) > +{ > + int d, found; > + struct vnet_switch *zs; > + > + for (d=0; d< NUM_MINORS; d++) { > + found = 0; > + list_for_each_entry(zs, &vnet_switches, lh) > + if (MINOR(zs->cdev.dev) == d) > + found++; > + if (!found) break; > + } > + if (found) return -ENODEV; > + return d; > +} > + > +/* > + * checks if this name already exists for a zwitch > + */ > +static int __zwitch_check_name(char *name) > +{ > + struct vnet_switch *zs; > + > + list_for_each_entry(zs, &vnet_switches, lh) > + if (!strncmp(name, zs->name, ZWITCH_NAME_SIZE)) > + return -EEXIST; > + return 0; > +} > + > +static int zwitch_create(char *name, int linktype) > +{ > + struct vnet_switch *zs; > + int minor; > + int ret; > + > + if ((linktype < 2) || (linktype > 3)) > + return -EINVAL; > + zs = kzalloc(sizeof(*zs), GFP_KERNEL); > + if (!zs) { > + printk("Creation of %s failed: out of memory\n", name); > + return -ENOMEM; > + } > + zs->linktype = linktype; > + strncpy(zs->name, name, ZWITCH_NAME_SIZE); > + rwlock_init(&zs->ports_lock); > + INIT_LIST_HEAD(&zs->switch_ports); > + > + write_lock(&vnet_switches_lock); > + minor = __zwitch_get_minor(); > + if (minor < 0) { > + write_unlock(&vnet_switches_lock); > + printk("Creation of %s failed: No free minor number\n", name); > + kfree(zs); > + return minor; > + } > + if (__zwitch_check_name(zs->name)) { > + write_unlock(&vnet_switches_lock); > + printk("Creation of %s failed: name exists\n", name); > + kfree(zs); > + return -EEXIST; > + } > + list_add_tail(&zs->lh, &vnet_switches); > + write_unlock(&vnet_switches_lock); > + strncpy(zs->dev.bus_id, name, min((int) strlen(name), > + ZWITCH_NAME_SIZE)); > + zs->dev.parent = root_dev; > + zs->dev.release = zwitch_release; > + ret = device_register(&zs->dev); > + if (ret) { > + write_lock(&vnet_switches_lock); > + list_del(&zs->lh); > + write_unlock(&vnet_switches_lock); > + printk("Creation of %s failed: no device\n",name); > + return ret; > + } > + vnet_cdev_init(&zs->cdev); > + cdev_add(&zs->cdev, MKDEV(vnet_major, minor), 1); > + zs->class_device = class_device_create(zwitch_class, NULL, > + zs->cdev.dev, &zs->dev, name); > + if (IS_ERR(zs->class_device)) { > + cdev_del(&zs->cdev); > + write_lock(&vnet_switches_lock); > + list_del(&zs->lh); > + write_unlock(&vnet_switches_lock); > + printk("Creation of %s failed: no class_device\n", name); > + device_unregister(&zs->dev); > + return PTR_ERR(zs->class_device); > + } > + return 0; > +} > + > + > +static int zwitch_delete(char *name) > +{ > + struct vnet_switch *zs; > + > + write_lock(&vnet_switches_lock); > + zs = __vnet_switch_by_name(name); > + if (!zs) { > + write_unlock(&vnet_switches_lock); > + return -ENOENT; > + } > + list_del(&zs->lh); > + write_unlock(&vnet_switches_lock); > + zwitch_destroy_ports(zs); > + zwitch_destroy(zs); > + return 0; > +} > + > +/* checks if a switch for the given minor exists > + * if yes, create an unconnected port on this switch > + * if no, return NULL */ > +struct vnet_port *vnet_port_get(int minor, char *port_name) > +{ > + struct vnet_switch *zs; > + struct vnet_port *port; > + > + zs = zwitch_get(minor); > + if (!zs) > + return NULL; > + port = vnet_port_create(zs, port_name); > + if (!port) > + zwitch_put(zs); > + return port; > +} > + > +/* attaches the port to the switch. The port must be > + * fully initialized, as it may get called immediately afterwards */ > +void vnet_port_attach(struct vnet_port *port) > +{ > + write_lock_bh(&port->zs->ports_lock); > + __vnet_port_learn_macs(port); > + list_add(&port->lh, &port->zs->switch_ports); > + write_unlock_bh(&port->zs->ports_lock); > + vnet_switch_add_mac(port); > + return; > +} > + > +/* detaches the port from the switch. After that, > + * no calls into the port are made */ > +void vnet_port_detach(struct vnet_port *port) > +{ > + vnet_switch_del_mac(port); > + write_lock_bh(&port->zs->ports_lock); > + if (!list_empty(&port->lh)) > + list_del(&port->lh); > + __vnet_port_unlearn_macs(port); > + write_unlock_bh(&port->zs->ports_lock); > +} > + > +/* releases all ressources allocated with vnet_port_get */ > +void vnet_port_put(struct vnet_port *port) > +{ > + BUG_ON(!list_empty(&port->lh) &&( port->lh.next != LIST_POISON1)); > + device_unregister(&port->dev); > +} > + > +/* tell the switch that new data is available */ > +void vnet_port_rx(struct vnet_port *port) > +{ > + struct vnet_control *control; > + int pkid, rc; > + > + control = port->control; > + if (vnet_q_empty(atomic_read(&control->p2smit))) { > + printk(KERN_WARNING "vnet_switch: Empty buffer" > + "on interrupt\n"); > + return; > + } > + do { > + pkid = __nextr(atomic_read(&control->p2smit)); > + /* fire and forget. Let the switch care about lost packets*/ > + vnet_switch_packet(port, port->p2s_data[pkid], > + control->p2sbufs[pkid].len, > + control->p2sbufs[pkid].proto); > + rc = vnet_rx_packet(&control->p2smit); > + if (rc & QUEUE_WAS_FULL) { > + port->interrupt(port, VNET_IRQ_START_TX); > + } > + } while (!(rc & QUEUE_IS_EMPTY)); > + return; > +} > + > +/* checks if the given address is locally attached to the switch*/ > +int vnet_address_is_local(struct vnet_switch *zs, char *address) > +{ > + struct vnet_port *port; > + > + read_lock(&zs->ports_lock); > + port = __vnet_find_destination(zs, address); > + read_unlock(&zs->ports_lock); > + return (port != NULL); > +} > + > + > +int vnet_minor_by_name(char *name) > +{ > + struct vnet_switch *zs; > + int ret; > + > + read_lock(&vnet_switches_lock); > + zs = __vnet_switch_by_name(name); > + if (zs) > + ret = MINOR(zs->cdev.dev); > + else > + ret = -ENODEV; > + read_unlock(&vnet_switches_lock); > + return ret; > +} > + > +static void vnet_root_release(struct device *dev) > +{ > + kfree(dev); > +} > + > + > +struct command { > + char *string1; > + char *string2; > +}; > + > +/*FIXME this is ugly. Dont worry: as soon as we have finalized the interface, > + this crap is going away. Still, it works.......*/ > +static long vnet_control_ioctl(struct file *f, unsigned int command, > + unsigned long data) > +{ > + char string1[BUS_ID_SIZE]; > + char string2[BUS_ID_SIZE]; > + struct command com; > + struct vnet_port *port; > + > + if (!capable(CAP_NET_ADMIN)) > + return -EPERM; > + if (copy_from_user(&com, (__user struct command*) data, sizeof(struct command))) > + return -EFAULT; > + if (copy_from_user(string1, (__user char *) com.string1, ZWITCH_NAME_SIZE)) > + return -EFAULT; > + if (command >=2) > + if (copy_from_user(string2, (__user char *) com.string2, ZWITCH_NAME_SIZE)) > + return -EFAULT; > + if (strnlen(string1, ZWITCH_NAME_SIZE) == ZWITCH_NAME_SIZE) > + return -EINVAL; > + switch(command) { > + case ADD_SWITCH: > + return zwitch_create(string1,3); > + case DEL_SWITCH: > + return zwitch_delete(string1); > + case ADD_HOST: > + port = vnet_host_create(string1); > + if (port) { > + vnet_port_attach(port); > + return 0; > + } else > + return -ENODEV; > + default: > + return -EINVAL; > + } > + return 0; > +} > + > +static int vnet_control_open(struct inode *inode, struct file *file) > +{ > + return 0; > +} > + > +static int vnet_control_release(struct inode *inode, struct file *file) > +{ > + return 0; > +} > + > +struct file_operations vnet_control_fops = { > + .open = vnet_control_open, > + .release = vnet_control_release, > + .unlocked_ioctl = &vnet_control_ioctl, > + .compat_ioctl = &vnet_control_ioctl, > +}; > + > +struct miscdevice vnet_control_device = { > + .minor = MISC_DYNAMIC_MINOR, > + .name = "vnet", > + .fops = &vnet_control_fops, > +}; > + > +int vnet_register_control_device(void) > +{ > + return misc_register(&vnet_control_device); > +} > + > +int __init vnet_switch_init(void) > +{ > + int ret; > + dev_t dev; > + > + zwitch_class = class_create(THIS_MODULE, "vnet"); > + if (IS_ERR(zwitch_class)) { > + printk(KERN_ERR "vnet_switch: class_create failed!\n"); > + ret = PTR_ERR(zwitch_class); > + goto out; > + } > + ret = alloc_chrdev_region(&dev, 0, NUM_MINORS, "vnet"); > + if (ret) { > + printk(KERN_ERR "vnet_switch: alloc_chrdev_region failed\n"); > + goto out_class; > + } > + vnet_major = MAJOR(dev); > + root_dev = kzalloc(sizeof(*root_dev), GFP_KERNEL); > + if (!root_dev) { > + printk(KERN_ERR "vnet_switch:allocation of device failed\n"); > + ret = -ENOMEM; > + goto out_chrdev; > + } > + strncpy(root_dev->bus_id, "vnet", 5); > + root_dev->release = vnet_root_release; > + ret =device_register(root_dev); > + if (ret) { > + printk(KERN_ERR "vnet_switch: could not register device\n"); > + kfree(root_dev); > + goto out_chrdev; > + } > + ret = vnet_register_control_device(); > + if (ret) { > + printk("vnet_switch: could not create control device\n"); > + goto out_dev; > + } > + printk ("vnet_switch loaded\n"); > +/* FIXME ---------- remove these static defines as soon as everyone has the > + * user tools */ > + { > + struct vnet_port *port; > + zwitch_create("myswitch0",2); > + zwitch_create("myswitch1",3); > + > + port = vnet_host_create("myswitch0"); > + if (port) > + vnet_port_attach(port); > + port = vnet_host_create("myswitch1"); > + if (port) > + vnet_port_attach(port); > + } > +/*-----------------------------------------------------------*/ > + return 0; > +out_dev: > + device_unregister(root_dev); > +out_chrdev: > + unregister_chrdev_region(MKDEV(vnet_major,0), NUM_MINORS); > +out_class: > + class_destroy(zwitch_class); > +out: > + return ret; > +} > + > +/* remove all existing vnet_zwitches in the system and unregister the > + * character device from the system */ > +void vnet_switch_exit(void) > +{ > + struct vnet_switch *zs, *tmp; > + list_for_each_entry_safe(zs, tmp, &vnet_switches, lh) { > + zwitch_destroy_ports(zs); > + zwitch_destroy(zs); > + } > + device_unregister(root_dev); > + misc_deregister(&vnet_control_device); > + unregister_chrdev_region(MKDEV(vnet_major,0), NUM_MINORS); > + class_destroy(zwitch_class); > + printk ("vnet_switch unloaded\n"); > +} > + > +module_init(vnet_switch_init); > +module_exit(vnet_switch_exit); > +MODULE_DESCRIPTION("VNET: Virtual switch for vnet interfaces"); > +MODULE_AUTHOR("Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>"); > +MODULE_LICENSE("GPL"); > Index: linux-2.6.21/drivers/s390/guest/vnet_switch.h > =================================================================== > --- /dev/null > +++ linux-2.6.21/drivers/s390/guest/vnet_switch.h > @@ -0,0 +1,119 @@ > +/* > + * vnet_switch - zlive insular communication knack switch > + * infrastructure for virtual switching of Linux guests running under Linux > + * > + * Copyright (C) 2005 IBM Corporation > + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > + * Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> > + * > + */ > + > +#ifndef __VNET_SWITCH_H > +#define __VNET_SWITCH_H > + > +#include <linux/cdev.h> > +#include <linux/device.h> > +#include <linux/if_ether.h> > +#include <linux/spinlock.h> > + > +#include "vnet.h" > + > +/* defines for IOCTLs. interface should be replaced by something better */ > +#define ADD_SWITCH 0 > +#define DEL_SWITCH 1 > +#define ADD_OSA 2 > +#define DEL_OSA 3 > +#define ADD_HOST 4 > +#define DEL_HOST 5 > + > +/* min(IFNAMSIZ, BUS_ID_SIZE)*/ > +#define ZWITCH_NAME_SIZE 16 > + > +/* This structure describes a virtual switch for ports to userspace network > + * interfaces, e.g. in Linux under Linux environments*/ > +struct vnet_switch { > + struct list_head lh; > + char name[ZWITCH_NAME_SIZE]; > + struct list_head switch_ports; /* list of ports */ > + rwlock_t ports_lock; /* lock for switch_ports */ > + struct class_device *class_device; > + struct cdev cdev; > + struct device dev; > + struct vnet_port *osa; > + int linktype; /* 2=ethernet 3=IP */ > +}; > + > +/* description of a port of the vnet_switch */ > +struct vnet_port { > + struct list_head lh; > + struct vnet_switch *zs; > + struct vnet_control *control; > + void *s2p_data[VNET_QUEUE_LEN][(VNET_BUFFER_SIZE>>PAGE_SHIFT)]; > + void *p2s_data[VNET_QUEUE_LEN][(VNET_BUFFER_SIZE>>PAGE_SHIFT)]; > + char mac[ETH_ALEN]; > + void *priv; > + int (*set_mac) (struct vnet_port *port, char mac[ETH_ALEN], int add); > + void (*interrupt) (struct vnet_port *port, int type); > + void (*destroy) (struct vnet_port *port); > + struct device dev; > + unsigned long rx_packets; /* total packets received */ > + unsigned long tx_packets; /* total packets transmitted */ > + unsigned long rx_bytes; /* total bytes received */ > + unsigned long tx_bytes; /* total bytes transmitted */ > + unsigned long rx_dropped; /* no space in receive buffer */ > + unsigned long tx_dropped; /* no space in destination buffer */ > + spinlock_t rxlock; > + spinlock_t txlock; > +}; > + > + > +static inline int > +vnet_copy_buf_to_pages(void **data, char *buf, int len) > +{ > + int i; > + > + if (len == 0) > + return 0; > + for (i=0; i <= ((len - 1) >> PAGE_SHIFT); i++ ) > + memcpy(data[i], buf + i*PAGE_SIZE, min(PAGE_SIZE, len - i*PAGE_SIZE)); > + return len; > +} > + > +static inline int > +vnet_copy_pages_to_buf(char *buf, void **data, int len) > +{ > + int i; > + > + if (len == 0) > + return 0; > + for (i=0; i <= ((len -1) >> PAGE_SHIFT); i++ ) > + memcpy(buf + i*PAGE_SIZE, data[i], min(PAGE_SIZE, len - i*PAGE_SIZE)); > + return len; > +} > + > + > +/* checks if a switch with the given minor exists > + * if yes, create a named and unconnected port on > + * this switch with the given name. if no, return NULL */ > +extern struct vnet_port *vnet_port_get(int minor, char *port_name); > + > +/* attaches the port to the switch. The port must be > + * fully initialized, as it may get data immediately afterwards */ > +extern void vnet_port_attach(struct vnet_port *port); > + > +/* detaches the port from the switch. After that, > + * no calls into the port are made */ > +extern void vnet_port_detach(struct vnet_port *port); > + > +/* releases all ressources allocated with vnet_port_get */ > +extern void vnet_port_put(struct vnet_port *port); > + > +/* tell the switch that new data is available */ > +extern void vnet_port_rx(struct vnet_port *port); > + > +/* get the minor for a given name */ > +extern int vnet_minor_by_name(char *name); > + > +/* checks if the given address is locally attached to the switch*/ > +extern int vnet_address_is_local(struct vnet_switch *zs, char *address); > +#endif > Index: linux-2.6.21/drivers/s390/guest/Makefile > =================================================================== > --- linux-2.6.21.orig/drivers/s390/guest/Makefile > +++ linux-2.6.21/drivers/s390/guest/Makefile > @@ -6,3 +6,6 @@ obj-$(CONFIG_GUEST_CONSOLE) += guest_con > obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o > obj-$(CONFIG_VDISK) += vdisk.o vdisk_blk.o > obj-$(CONFIG_VNET_GUEST) += vnet_guest.o > +vnet_host-objs := vnet_switch.o vnet_port_guest.o vnet_port_host.o > +obj-$(CONFIG_VNET_HOST) += vnet_host.o > + > Index: linux-2.6.21/drivers/s390/net/Kconfig > =================================================================== > --- linux-2.6.21.orig/drivers/s390/net/Kconfig > +++ linux-2.6.21/drivers/s390/net/Kconfig > @@ -95,4 +95,16 @@ config VNET_GUEST > connection. > If you're not using host/guest support, say N. > > +config VNET_HOST > + tristate "virtual networking support (HOST)" > + depends on QETH && S390_HOST > + help > + This is the host part of the vnet guest network connection. > + Say Y if you plan to host guests with network > + connection. The host part consists of a virtual switch > + a host device as well as a connection to the qeth > + driver. > + If you're not using this kernel for hosting guest, say N. > + > + > endmenu > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > kvm-devel mailing list > kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org > https://lists.sourceforge.net/lists/listinfo/kvm-devel > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
[parent not found: <4644D048.7060106-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH/RFC 8/9] Virtual network host switch support [not found] ` <4644D048.7060106-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2007-05-11 20:50 ` Christian Bornträger 0 siblings, 0 replies; 104+ messages in thread From: Christian Bornträger @ 2007-05-11 20:50 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f; +Cc: Martin Schwidefsky On Friday 11 May 2007 22:21, Anthony Liguori wrote: > Any feel for the performance relative to the bridging code? The > bridging code is a pretty big bottle neck in guest=>guest communications > in Xen at least. Last time I checked it we had a quite decent guest to guest performance in the gigabits/sec. On the downside the switch is quite aggressive with dropping packages as the inbound buffer of the virtual network adapters has space for 80 packets. (that can be changed) > > > currently tested but not ready yet. We did not use the linux bridging code to > > allow non-root users to create virtual networks between guests. > > > > Is that the primary reason? If so, that seems like a rather large > hammer for something that a userspace suid wrapper could have addressed... Actually there are some reasons why we did not use the bridging code: - One thing is, that a lot of OSA network cards do not support promiscous mode. There is also the issue that a lot of OSA cards are in layer 3 mode (we get IP packets and no ethernet frames) so bridging wont work to the host interface. - non-root switches - the performance of bridging (we copy directly from one guest buffer to another without allocating an skb on the host) - we considered to hook into the qeth driver (for OSA cards) to deal with layer3 mode. The first shot was actually a point-to-point driver (guest netif <--> host netif). We added the switch at a later time. Hmm, if we can make bridging work (with a decent performance) on s390 that would reduce the maintainance work for us as this network switch is far from being complete. cheers Christian ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
* [PATCH/RFC 9/9] Fix system<->user misaccount of interpreted execution [not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org> ` (6 preceding siblings ...) 2007-05-11 17:36 ` [PATCH/RFC 8/9] Virtual network host switch support Carsten Otte @ 2007-05-11 17:36 ` Carsten Otte 7 siblings, 0 replies; 104+ messages in thread From: Carsten Otte @ 2007-05-11 17:36 UTC (permalink / raw) To: kvm-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org Cc: Christian Borntraeger, Martin Schwidefsky From: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> This patches fixes the accouting of guest cpu time. As sie is executed via a system call, all guest operations were accounted as system time. To fix this we define a per thread "sie context". Before issuing the sie instruction we enter this context and leave the context afterwards. sie_enter and sie_exit call account_system_vtime, which now checks for being in sie_context. We define the sie_context to be accounted as user time. Possible future enhancement: We could add an additional field: "interpretion time" to cpu stat and process time. Thus we could differentiate between user time in the host and host user time spent for guests. The main challenge is the necessary user space change. Therefore, we could export the interpretion time with a new interface. To be defined. Signed-off-By: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> Signed-off-By: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org> --- arch/s390/Kconfig | 1 + arch/s390/host/s390host.c | 15 +++++++++++++++ arch/s390/kernel/process.c | 1 + arch/s390/kernel/vtime.c | 11 ++++++++++- include/asm-s390/thread_info.h | 2 ++ 5 files changed, 29 insertions(+), 1 deletion(-) Index: linux-2.6.21/arch/s390/kernel/vtime.c =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/vtime.c +++ linux-2.6.21/arch/s390/kernel/vtime.c @@ -97,6 +97,11 @@ void account_vtime(struct task_struct *t account_system_time(tsk, 0, cputime); } +static inline int task_is_in_sie(struct thread_info *thread) +{ + return thread->in_sie; +} + /* * Update process times based on virtual cpu times stored by entry.S * to the lowcore fields user_timer, system_timer & steal_clock. @@ -114,7 +119,11 @@ void account_system_vtime(struct task_st cputime = S390_lowcore.system_timer >> 12; S390_lowcore.system_timer -= cputime << 12; S390_lowcore.steal_clock -= cputime << 12; - account_system_time(tsk, 0, cputime); + + if (task_is_in_sie(tsk->thread_info) && !hardirq_count() && !softirq_count()) + account_user_time(tsk, cputime); + else + account_system_time(tsk, 0, cputime); } static inline void set_vtimer(__u64 expires) Index: linux-2.6.21/arch/s390/host/s390host.c =================================================================== --- linux-2.6.21.orig/arch/s390/host/s390host.c +++ linux-2.6.21/arch/s390/host/s390host.c @@ -27,6 +27,19 @@ static int s390host_do_action(unsigned l static DEFINE_MUTEX(s390host_init_mutex); +static void enter_sie(void) +{ + account_system_vtime(current); + current_thread_info()->in_sie = 1; +} + +static void exit_sie(void) +{ + account_system_vtime(current); + current_thread_info()->in_sie = 0; +} + + static void s390host_get_data(struct s390host_data *data) { atomic_inc(&data->count); @@ -297,7 +310,9 @@ again: schedule(); sie_kernel->sie_block.icptcode = 0; + enter_sie(); ret = sie64a(sie_kernel); + exit_sie(); if (ret) goto out; Index: linux-2.6.21/include/asm-s390/thread_info.h =================================================================== --- linux-2.6.21.orig/include/asm-s390/thread_info.h +++ linux-2.6.21/include/asm-s390/thread_info.h @@ -55,6 +55,7 @@ struct thread_info { struct restart_block restart_block; struct s390host_data *s390host_data; /* s390host data */ int sie_cpu; /* sie cpu number */ + int in_sie; /* 1 => cpu is in sie*/ }; /* @@ -72,6 +73,7 @@ struct thread_info { }, \ .s390host_data = NULL, \ .sie_cpu = 0, \ + .in_sie = 0, \ } #define init_thread_info (init_thread_union.thread_info) Index: linux-2.6.21/arch/s390/kernel/process.c =================================================================== --- linux-2.6.21.orig/arch/s390/kernel/process.c +++ linux-2.6.21/arch/s390/kernel/process.c @@ -278,6 +278,7 @@ int copy_thread(int nr, unsigned long cl memset(&p->thread.per_info,0,sizeof(p->thread.per_info)); p->thread_info->s390host_data = NULL; p->thread_info->sie_cpu = -1; + p->thread_info->in_sie = 0; return 0; } Index: linux-2.6.21/arch/s390/Kconfig =================================================================== --- linux-2.6.21.orig/arch/s390/Kconfig +++ linux-2.6.21/arch/s390/Kconfig @@ -519,6 +519,7 @@ config S390_HOST bool "s390 host support (EXPERIMENTAL)" depends on 64BIT && EXPERIMENTAL select S390_SWITCH_AMODE + select VIRT_CPU_ACCOUNTING help Select this option if you want to host guest Linux images ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ^ permalink raw reply [flat|nested] 104+ messages in thread
end of thread, other threads:[~2007-05-24 0:07 UTC | newest]
Thread overview: 104+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1178903957.25135.13.camel@cotte.boeblingen.de.ibm.com>
[not found] ` <1178903957.25135.13.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
2007-05-11 17:35 ` [PATCH/RFC 2/9] s390 virtualization interface Carsten Otte
2007-05-11 17:35 ` [PATCH/RFC 3/9] s390 guest detection Carsten Otte
2007-05-11 17:35 ` [PATCH/RFC 4/9] Basic guest virtual devices infrastructure Carsten Otte
[not found] ` <1178904958.25135.31.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
2007-05-11 20:06 ` Arnd Bergmann
2007-05-14 11:26 ` Avi Kivity
[not found] ` <46484753.30602-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-14 11:43 ` Carsten Otte
[not found] ` <46484B5D.6080605-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-14 12:00 ` [PATCH/RFC 4/9] Basic guest virtualdevices infrastructure Dor Laor
[not found] ` <64F9B87B6B770947A9F8391472E032160BC7483D-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
2007-05-14 13:32 ` Carsten Otte
2007-05-11 17:36 ` [PATCH/RFC 5/9] s390 virtual console for guests Carsten Otte
[not found] ` <1178904960.25135.32.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
2007-05-11 19:00 ` Anthony Liguori
[not found] ` <4644BD3C.8040901-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2007-05-11 19:42 ` Christian Bornträger
2007-05-12 8:07 ` Carsten Otte
2007-05-14 16:23 ` Christian Bornträger
[not found] ` <200705141823.13424.cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-14 16:48 ` Christian Borntraeger
[not found] ` <200705141848.18996.borntrae-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-14 17:49 ` Anthony Liguori
[not found] ` <4648A11D.3050607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-15 0:27 ` Arnd Bergmann
2007-05-15 7:54 ` Carsten Otte
2007-05-11 17:36 ` [PATCH/RFC 6/9] virtual block device driver Carsten Otte
[not found] ` <1178904963.25135.33.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
2007-05-14 11:49 ` Avi Kivity
[not found] ` <46484CDF.505-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-14 13:23 ` Carsten Otte
[not found] ` <464862E9.7020105-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-14 14:39 ` Avi Kivity
[not found] ` <46487494.1070802-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-15 11:47 ` Carsten Otte
[not found] ` <46499DE9.9090202-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-16 10:01 ` Avi Kivity
2007-05-14 11:52 ` Avi Kivity
[not found] ` <46484D84.3060601-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-14 13:26 ` Carsten Otte
2007-05-11 17:36 ` [PATCH/RFC 7/9] Virtual network guest " Carsten Otte
[not found] ` <1178904965.25135.34.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
2007-05-11 19:44 ` ron minnich
[not found] ` <13426df10705111244w1578ebedy8259bc42ca1f588d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-11 20:12 ` Anthony Liguori
[not found] ` <4644CE15.6080505-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-11 21:15 ` Eric Van Hensbergen
[not found] ` <a4e6962a0705111415n47e77a15o331b59cf2a03b4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-11 21:47 ` Anthony Liguori
[not found] ` <4644E456.2060507-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-11 22:21 ` Eric Van Hensbergen
[not found] ` <a4e6962a0705111521v2d451ddcjecf209e2031c85af-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-16 17:28 ` Anthony Liguori
[not found] ` <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-16 17:38 ` Daniel P. Berrange
[not found] ` <20070516173822.GD16863-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2007-05-17 9:29 ` Carsten Otte
[not found] ` <464C2069.20909-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-17 14:22 ` Anthony Liguori
[not found] ` <464C651F.5070700-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2007-05-21 11:11 ` Christian Borntraeger
2007-05-16 17:41 ` Eric Van Hensbergen
[not found] ` <a4e6962a0705161041s5393c1a6wc455b20ff3fe8106-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-16 18:47 ` Anthony Liguori
[not found] ` <464B51A8.7050307-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-16 19:33 ` Eric Van Hensbergen
2007-05-16 17:45 ` Gregory Haskins
[not found] ` <464B0ADB.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-05-16 18:39 ` Anthony Liguori
[not found] ` <464B4FEB.7070300-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-16 18:57 ` Gregory Haskins
[not found] ` <464B1B9C.BA47.005A.0-Et1tbQHTxzrQT0dZR+AlfA@public.gmane.org>
2007-05-16 19:10 ` Anthony Liguori
[not found] ` <464B572C.6090800-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-17 4:24 ` Rusty Russell
[not found] ` <1179375881.21871.83.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2007-05-17 16:13 ` Anthony Liguori
[not found] ` <464C7F45.50908-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-17 23:34 ` Rusty Russell
2007-05-21 9:07 ` Christian Borntraeger
[not found] ` <OFC1AADF6F.DB57C7AC-ON422572E2.0030BE22-422572E2.0032174C-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-21 9:27 ` Cornelia Huck
2007-05-21 11:28 ` Arnd Bergmann
[not found] ` <200705211328.04565.arnd-r2nGTMty4D4@public.gmane.org>
2007-05-21 11:56 ` Cornelia Huck
[not found] ` <20070521135628.17a4f9cc-XQvu0L+U/CiXI4yAdoq52KN5r0PSdgG1zG2AekJRRhI@public.gmane.org>
2007-05-21 13:53 ` Arnd Bergmann
2007-05-21 18:45 ` Anthony Liguori
[not found] ` <4651E8D1.4010208-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-21 23:09 ` ron minnich
[not found] ` <13426df10705211609j613032c6j373d9a4660f8ec6c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-22 0:29 ` Anthony Liguori
[not found] ` <46523952.7070405-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2007-05-22 0:45 ` ron minnich
[not found] ` <13426df10705211745r69acc95ai458b2192fe0d0132-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-22 1:13 ` Anthony Liguori
2007-05-22 1:34 ` Eric Van Hensbergen
[not found] ` <a4e6962a0705211834s4db19c7t3b95765bf2c092d7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-22 1:42 ` Anthony Liguori
[not found] ` <46524A79.8070004-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2007-05-22 5:17 ` Avi Kivity
[not found] ` <46527CD9.5000603-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-22 12:49 ` Eric Van Hensbergen
[not found] ` <a4e6962a0705220549j1c9565f2ic160c672b74aea35-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-22 12:56 ` Christoph Hellwig
[not found] ` <20070522125655.GA4506-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2007-05-22 14:50 ` Eric Van Hensbergen
[not found] ` <a4e6962a0705220750s5abe380dg8dd8e7d0b84de7cd-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-22 15:05 ` Anthony Liguori
[not found] ` <465306AE.5080902-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2007-05-22 15:31 ` ron minnich
2007-05-22 16:25 ` Eric Van Hensbergen
[not found] ` <a4e6962a0705220925l580f136we269380fe3c9691c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-22 17:00 ` ron minnich
[not found] ` <13426df10705221000i749badc5h8afe4f2fc95bc2ce-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-22 17:06 ` Christoph Hellwig
[not found] ` <20070522170628.GA16624-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2007-05-22 17:34 ` ron minnich
[not found] ` <13426df10705221034k7baf5bccrc77aabca8c9e225c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-22 20:03 ` Dor Laor
[not found] ` <64F9B87B6B770947A9F8391472E032160BF29F1E-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
2007-05-22 20:10 ` ron minnich
2007-05-22 22:56 ` Nakajima, Jun
[not found] ` <8FFF7E42E93CC646B632AB40643802A8032793AC-1a9uaKK1+wJcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2007-05-23 8:15 ` Carsten Otte
[not found] ` <4653F807.2010209-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-23 12:25 ` Avi Kivity
2007-05-23 14:12 ` Eric Van Hensbergen
[not found] ` <a4e6962a0705230712pd8c2958m9dee6b2ccec0899d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-23 23:02 ` Arnd Bergmann
[not found] ` <200705240102.40795.arnd-r2nGTMty4D4@public.gmane.org>
2007-05-23 23:57 ` Eric Van Hensbergen
[not found] ` <a4e6962a0705231657n65946ba4n74393f7028b6d61c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-24 0:07 ` Eric Van Hensbergen
2007-05-23 12:21 ` Avi Kivity
2007-05-23 12:16 ` Avi Kivity
[not found] ` <465430B2.7050101-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-23 12:20 ` Christoph Hellwig
2007-05-23 12:20 ` Avi Kivity
2007-05-23 11:55 ` Avi Kivity
2007-05-22 13:08 ` Anthony Liguori
2007-05-18 5:31 ` ron minnich
[not found] ` <13426df10705172231y5e93d1f5y398d4f187a8978e1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2007-05-18 14:31 ` Anthony Liguori
[not found] ` <464DB8A5.6080503-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-18 15:14 ` ron minnich
2007-05-11 21:51 ` ron minnich
2007-05-12 8:46 ` Carsten Otte
[not found] ` <46457EF9.2070706-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-13 12:04 ` Dor Laor
[not found] ` <64F9B87B6B770947A9F8391472E032160BC74612-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
2007-05-13 14:49 ` Anthony Liguori
[not found] ` <4647257F.4020900-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2007-05-13 16:23 ` Dor Laor
[not found] ` <64F9B87B6B770947A9F8391472E032160BC74675-yEcIvxbTEBqsx+V+t5oei8rau4O3wl8o3fe8/T/H7NteoWH0uzbU5w@public.gmane.org>
2007-05-13 16:49 ` Anthony Liguori
[not found] ` <4647418A.2040201-rdkfGonbjUSkNkDKm+mE6A@public.gmane.org>
2007-05-13 17:06 ` Muli Ben-Yehuda
[not found] ` <20070513170608.GA4343-WD1JZD8MxeCTrf4lBMg6DdBPR1lH4CV8@public.gmane.org>
2007-05-13 20:31 ` Dor Laor
2007-05-14 2:39 ` Rusty Russell
2007-05-14 11:53 ` Avi Kivity
2007-05-14 12:05 ` Avi Kivity
[not found] ` <46485070.3000106-atKUWr5tajBWk0Htik3J/w@public.gmane.org>
2007-05-14 12:24 ` Christian Bornträger
[not found] ` <200705141424.44423.cborntra-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2007-05-14 12:32 ` Avi Kivity
2007-05-14 13:36 ` Carsten Otte
2007-05-11 17:36 ` [PATCH/RFC 8/9] Virtual network host switch support Carsten Otte
[not found] ` <1178904968.25135.35.camel-WIxn4w2hgUz3YA32ykw5MLlKpX0K8NHHQQ4Iyu8u01E@public.gmane.org>
2007-05-11 20:21 ` Anthony Liguori
[not found] ` <4644D048.7060106-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-05-11 20:50 ` Christian Bornträger
2007-05-11 17:36 ` [PATCH/RFC 9/9] Fix system<->user misaccount of interpreted execution Carsten Otte
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox