[PATCH] xen, xen-sparse: modify spinlocks to use directed yield

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] xen, xen-sparse: modify spinlocks to use directed yield
@ 2005-05-20 16:54 Ryan Harper
  2005-05-20 18:16 ` Ryan Harper
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Ryan Harper @ 2005-05-20 16:54 UTC (permalink / raw)
  To: xen-devel

The following patch creates a new hypercall, do_confer() which allows a
vcpu to yield to another vcpu that is not currently running.  Also
included in the patch is modification to x86 spinlock code to use the
new hcall modeled on the ppc64 implementation which has an identical
hcall.  When a vcpu acquires a spin or write lock, the vcpu id of the
holder is recorded in the lock.  When a different vcpu attempts to
acquire the contested lock, the spinlock code will yield its timeslice
to the lock holder if they are not currently running rather than just
spinning until the lock holder's next timeslice.

There are a couple things not done with this patch.  First, I wasn't
able to determine if there was a guaranteed, scheduler independent way
of switching to another vcpu immediately.  Currently, the implementation
calls domain_wake() on the vcpu that is holding the lock, and then
yields.  This has two issues: 1) there is no guarantee that the
scheduler will pick the domain that has been woken next 2) it is wrong
IMHO to have the woken domain run for more than the remainder of the
caller's slice as this would be preferential to lock-holder vcpus.
Ideally I would like a way to donate the remainder of the current vcpu's
slice to a target vcpu and a SCHED_OP that allows the marking of a vcpu
as high priority that the various scheduler can implement in whatever
way makes sense.

Second, there is a conflict in Linux between CONFIG_PREEMPT and the
spinlock code that yields the vcpu to the lock holder.  That is, when
CONFIG_PREEMPT is enabled, the actually spinlock code that is used never
calls into the code that invokes the spin_yield() function which utilizes
the confer hcall.  I believe that the behavior is intentional as it
doesn't make sense to encourage the lock to be broken (what
CONFIG_PREEMPT spinlocks do) and then yield to a vcpu.  I am currently
investigating the performance difference between CONFIG_PREEMPT and
directed yielding in spinlocks and should have some numbers shortly to
see the perf trade off of these two options.

Trivial, but important, for some reason, the shared_info->shproc value
got reset to 0 at some point after it's initial allocation where I set
it to 1.  I've not tracked down why that is the case, but once that is
fixed, the the SHARED_PROCESSOR define in spinlocks.h will change to
check if shproc=1.  

Comments and questions requested.

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com


diffstat output:
 linux-2.6.11-xen-sparse/arch/i386/lib/Makefile                     |   11 
 linux-2.6.11-xen-sparse/arch/i386/lib/locks.c                      |   76 +++++
 linux-2.6.11-xen-sparse/arch/xen/configs/xenU-smp_defconfig_x86_32 |    4 
 linux-2.6.11-xen-sparse/arch/xen/i386/kernel/entry.S               |    2 
 linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/hypercall.h       |   16 +
 linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/spinlock.h        |  140 +++++++---
 xen/arch/x86/domain.c                                              |    2 
 xen/arch/x86/x86_32/entry.S                                        |    1 
 xen/common/domain.c                                                |    1 
 xen/common/schedule.c                                              |   69 ++++
 xen/include/public/xen.h                                           |   11 
 xen/include/xen/sched.h                                            |    9 
 12 files changed, 302 insertions(+), 40 deletions(-)

Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
---
diff -urN b/linux-2.6.11-xen-sparse/arch/i386/lib/locks.c confer/linux-2.6.11-xen-sparse/arch/i386/lib/locks.c
--- b/linux-2.6.11-xen-sparse/arch/i386/lib/locks.c	1969-12-31 18:00:00.000000000 -0600
+++ confer/linux-2.6.11-xen-sparse/arch/i386/lib/locks.c	2005-05-20 10:37:58.300767080 -0500
@@ -0,0 +1,76 @@
+/*
+ * Spin and read/write lock operations.
+ *
+ * Copyright (C) 2001-2004 Paul Mackerras <paulus@au.ibm.com>, IBM
+ * Copyright (C) 2001 Anton Blanchard <anton@au.ibm.com>, IBM
+ * Copyright (C) 2002 Dave Engebretsen <engebret@us.ibm.com>, IBM
+ *   Rework to support virtual processors
+ * Copyright (C) 2005 Ryan Harper <ryanh@us.ibm.com>, IBM
+ *   Rework for Xen on x86
+ *
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/config.h>
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/module.h>
+#include <linux/stringify.h>
+#include <asm/hypercall.h>
+#include <asm/processor.h>
+
+/* waiting for a spinlock... */
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+void __spin_yield(spinlock_t *lock)
+{
+	unsigned int lock_value, holder_cpu, yield_count;
+   shared_info_t *s = HYPERVISOR_shared_info;
+
+	lock_value = lock->slock;
+	if (lock_value == 1)
+		return;
+	holder_cpu = lock->cpu;
+	BUG_ON(holder_cpu >= NR_CPUS);
+	yield_count = s->vcpu_data[holder_cpu].yield_count;
+	if ((yield_count & 1) == 0)
+		return;		/* virtual cpu is currently running */
+	rmb();
+	if (lock->slock != lock_value)
+		return;		/* something has changed */
+   HYPERVISOR_confer(holder_cpu, yield_count);
+}
+
+void __rw_yield(rwlock_t *rw)
+{
+	unsigned int lock_value, holder_cpu, yield_count;
+   shared_info_t *s = HYPERVISOR_shared_info;
+
+	lock_value = rw->lock;
+	if (lock_value == RW_LOCK_BIAS)
+		return;
+	holder_cpu = rw->cpu;
+	BUG_ON(holder_cpu >= NR_CPUS);
+	yield_count = s->vcpu_data[holder_cpu].yield_count;
+	if ((yield_count & 1) == 0)
+		return;		/* virtual cpu is currently running */
+	rmb();
+	if (rw->lock != lock_value)
+		return;		/* something has changed */
+   HYPERVISOR_confer(holder_cpu, yield_count);
+}
+
+void spin_unlock_wait(spinlock_t *lock)
+{
+	while (spin_is_locked(lock)) {
+		cpu_relax();
+		if (SHARED_PROCESSOR)
+			__spin_yield(lock);
+	}
+	cpu_relax();
+}
+EXPORT_SYMBOL(spin_unlock_wait);
+#endif
diff -urN b/linux-2.6.11-xen-sparse/arch/i386/lib/Makefile confer/linux-2.6.11-xen-sparse/arch/i386/lib/Makefile
--- b/linux-2.6.11-xen-sparse/arch/i386/lib/Makefile	1969-12-31 18:00:00.000000000 -0600
+++ confer/linux-2.6.11-xen-sparse/arch/i386/lib/Makefile	2005-05-20 10:37:58.301766928 -0500
@@ -0,0 +1,11 @@
+#
+# Makefile for i386-specific library files..
+#
+
+
+lib-y = checksum.o delay.o usercopy.o getuser.o memcpy.o strstr.o \
+	bitops.o 
+
+lib-$(CONFIG_X86_USE_3DNOW) += mmx.o
+lib-$(CONFIG_HAVE_DEC_LOCK) += dec_and_lock.o
+lib-$(CONFIG_XEN) += locks.o
diff -urN b/linux-2.6.11-xen-sparse/arch/xen/configs/xenU-smp_defconfig_x86_32 confer/linux-2.6.11-xen-sparse/arch/xen/configs/xenU-smp_defconfig_x86_32
--- b/linux-2.6.11-xen-sparse/arch/xen/configs/xenU-smp_defconfig_x86_32	2005-05-20 10:39:36.826788848 -0500
+++ confer/linux-2.6.11-xen-sparse/arch/xen/configs/xenU-smp_defconfig_x86_32	2005-05-20 10:37:58.303766624 -0500
@@ -117,8 +117,8 @@
 CONFIG_SMP=y
 CONFIG_NR_CPUS=8
 # CONFIG_SCHED_SMT is not set
-CONFIG_PREEMPT=y
-CONFIG_PREEMPT_BKL=y
+# CONFIG_PREEMPT is not set
+# CONFIG_PREEMPT_BKL is not set
 CONFIG_X86_CPUID=y
 
 #
diff -urN b/linux-2.6.11-xen-sparse/arch/xen/i386/kernel/entry.S confer/linux-2.6.11-xen-sparse/arch/xen/i386/kernel/entry.S
--- b/linux-2.6.11-xen-sparse/arch/xen/i386/kernel/entry.S	2005-05-19 22:20:32.000000000 -0500
+++ confer/linux-2.6.11-xen-sparse/arch/xen/i386/kernel/entry.S	2005-05-20 10:37:58.304766472 -0500
@@ -80,7 +80,7 @@
 #define evtchn_upcall_pending		/* 0 */
 #define evtchn_upcall_mask		1
 
-#define sizeof_vcpu_shift		3
+#define sizeof_vcpu_shift		4
 
 #ifdef CONFIG_SMP
 #define preempt_disable(reg)	incl TI_preempt_count(reg)
diff -urN b/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/hypercall.h confer/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/hypercall.h
--- b/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/hypercall.h	2005-05-19 22:20:32.000000000 -0500
+++ confer/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/hypercall.h	2005-05-20 10:37:58.306766168 -0500
@@ -517,4 +517,20 @@
     return ret;
 }
 
+static inline int
+HYPERVISOR_confer(
+	unsigned int vcpu, unsigned int yield_count)
+{
+    int ret;
+    unsigned long ign1, ign2;
+
+    __asm__ __volatile__ (
+        TRAP_INSTR
+        : "=a" (ret), "=b" (ign1), "=c" (ign2)
+	: "0" (__HYPERVISOR_confer), "1" (vcpu), "2" (yield_count)
+	: "memory");
+
+    return ret;
+}
+
 #endif /* __HYPERCALL_H__ */
diff -urN b/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/spinlock.h confer/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/spinlock.h
--- b/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/spinlock.h	2005-05-19 22:20:14.000000000 -0500
+++ confer/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/spinlock.h	2005-05-20 10:37:58.307766016 -0500
@@ -22,10 +22,36 @@
 #ifdef CONFIG_PREEMPT
 	unsigned int break_lock;
 #endif
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	unsigned int cpu;
+#endif
 } spinlock_t;
 
 #define SPINLOCK_MAGIC	0xdead4ead
 
+/*
+ * Read-write spinlocks, allowing multiple readers
+ * but only one writer.
+ *
+ * NOTE! it is quite common to have readers in interrupts
+ * but no interrupt writers. For those circumstances we
+ * can "mix" irq-safe locks - any writer needs to get a
+ * irq-safe write-lock, but readers can get non-irqsafe
+ * read-locks.
+ */
+typedef struct {
+	volatile unsigned int lock;
+#ifdef CONFIG_DEBUG_SPINLOCK
+	unsigned magic;
+#endif
+#ifdef CONFIG_PREEMPT
+	unsigned int break_lock;
+#endif
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	unsigned int cpu;
+#endif
+} rwlock_t;
+
 #ifdef CONFIG_DEBUG_SPINLOCK
 #define SPINLOCK_MAGIC_INIT	, SPINLOCK_MAGIC
 #else
@@ -44,7 +70,20 @@
  */
 
 #define spin_is_locked(x)	(*(volatile signed char *)(&(x)->slock) <= 0)
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+#include <linux/smp.h>
+#define SPINLOCK_CPU (smp_processor_id())
+/* We only yield to the hypervisor if we are in shared processor mode */
+#define SHARED_PROCESSOR (HYPERVISOR_shared_info->shproc == 0)
+extern void __spin_yield(spinlock_t *lock);
+extern void __rw_yield(rwlock_t *rw);
+extern void spin_unlock_wait(spinlock_t *lock);
+#else
+#define __spin_yield(x) barrier()
+#define __rw_yield(x) barrier()
+#define SHARED_PROCESSOR 0
 #define spin_unlock_wait(x)	do { barrier(); } while(spin_is_locked(x))
+#endif
 
 #define spin_lock_string \
 	"\n1:\t" \
@@ -125,6 +164,9 @@
 		"xchgb %b0,%1"
 		:"=q" (oldval), "=m" (lock->slock)
 		:"0" (0) : "memory");
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	lock->cpu = SPINLOCK_CPU;
+#endif
 	return oldval > 0;
 }
 
@@ -136,43 +178,55 @@
 		BUG();
 	}
 #endif
-	__asm__ __volatile__(
-		spin_lock_string
-		:"=m" (lock->slock) : : "memory");
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	while (1) {
+		if ( likely(_raw_spin_trylock(lock)) )
+			break;
+		do {
+			cpu_relax();
+			if (SHARED_PROCESSOR)
+				__spin_yield(lock);
+		} while (likely(spin_is_locked(lock)));
+		cpu_relax();
+	}
+#else
+   __asm__ __volatile__(
+      spin_lock_string
+      :"=m" (lock->slock) : : "memory");
+#endif
 }
 
 static inline void _raw_spin_lock_flags (spinlock_t *lock, unsigned long flags)
 {
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	unsigned long flags_dis;
+#endif
 #ifdef CONFIG_DEBUG_SPINLOCK
 	if (unlikely(lock->magic != SPINLOCK_MAGIC)) {
 		printk("eip: %p\n", __builtin_return_address(0));
 		BUG();
 	}
 #endif
-	__asm__ __volatile__(
-		spin_lock_string_flags
-		:"=m" (lock->slock) : "r" (flags) : "memory");
-}
-
-/*
- * Read-write spinlocks, allowing multiple readers
- * but only one writer.
- *
- * NOTE! it is quite common to have readers in interrupts
- * but no interrupt writers. For those circumstances we
- * can "mix" irq-safe locks - any writer needs to get a
- * irq-safe write-lock, but readers can get non-irqsafe
- * read-locks.
- */
-typedef struct {
-	volatile unsigned int lock;
-#ifdef CONFIG_DEBUG_SPINLOCK
-	unsigned magic;
-#endif
-#ifdef CONFIG_PREEMPT
-	unsigned int break_lock;
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	while (1) {
+		if ( likely(_raw_spin_trylock(lock)) )
+			break;
+		local_save_flags(flags_dis);
+		local_irq_restore(flags);
+		do {
+			cpu_relax();
+			if (SHARED_PROCESSOR)
+				__spin_yield(lock);
+		} while (likely(spin_is_locked(lock)));
+		cpu_relax();
+		local_irq_restore(flags_dis);
+	}
+#else
+   __asm__ __volatile__(
+      spin_lock_string_flags
+      :"=m" (lock->slock) : "r" (flags) : "memory");
 #endif
-} rwlock_t;
+}
 
 #define RWLOCK_MAGIC	0xdeaf1eed
 
@@ -198,6 +252,18 @@
  */
 #define write_can_lock(x) ((x)->lock == RW_LOCK_BIAS)
 
+static inline int _raw_write_trylock(rwlock_t *lock)
+{
+	atomic_t *count = (atomic_t *)lock;
+	if (atomic_sub_and_test(RW_LOCK_BIAS, count)) {
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+      lock->cpu = SPINLOCK_CPU;
+#endif
+		return 1;
+   }
+	atomic_add(RW_LOCK_BIAS, count);
+	return 0;
+}
 /*
  * On x86, we implement read-write locks as a 32-bit counter
  * with the high bit (sign) being the "contended" bit.
@@ -222,7 +288,20 @@
 #ifdef CONFIG_DEBUG_SPINLOCK
 	BUG_ON(rw->magic != RWLOCK_MAGIC);
 #endif
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	while (1) {
+		if ( likely(_raw_write_trylock(rw)) )
+			break;
+		do {
+			cpu_relax();
+			if (SHARED_PROCESSOR)
+				__rw_yield(rw);
+		} while ( likely(!write_can_lock(rw)));
+		cpu_relax();
+	}
+#else
 	__build_write_lock(rw, "__write_lock_failed");
+#endif
 }
 
 #define _raw_read_unlock(rw)		asm volatile("lock ; incl %0" :"=m" ((rw)->lock) : : "memory")
@@ -238,13 +317,6 @@
 	return 0;
 }
 
-static inline int _raw_write_trylock(rwlock_t *lock)
-{
-	atomic_t *count = (atomic_t *)lock;
-	if (atomic_sub_and_test(RW_LOCK_BIAS, count))
-		return 1;
-	atomic_add(RW_LOCK_BIAS, count);
-	return 0;
-}
+
 
 #endif /* __ASM_SPINLOCK_H */
diff -urN b/xen/arch/x86/domain.c confer/xen/arch/x86/domain.c
--- b/xen/arch/x86/domain.c	2005-05-19 22:20:28.000000000 -0500
+++ confer/xen/arch/x86/domain.c	2005-05-20 10:38:29.187071648 -0500
@@ -253,6 +253,8 @@
     memset(d->shared_info, 0, PAGE_SIZE);
     ed->vcpu_info = &d->shared_info->vcpu_data[ed->vcpu_id];
     ed->cpumap = CPUMAP_RUNANYWHERE;
+    /* default vcpus to sharing physical cpus */
+    d->shared_info->shproc = 1;
     SHARE_PFN_WITH_DOMAIN(virt_to_page(d->shared_info), d);
     machine_to_phys_mapping[virt_to_phys(d->shared_info) >> 
                            PAGE_SHIFT] = INVALID_M2P_ENTRY;
diff -urN b/xen/arch/x86/x86_32/entry.S confer/xen/arch/x86/x86_32/entry.S
--- b/xen/arch/x86/x86_32/entry.S	2005-05-19 22:20:33.000000000 -0500
+++ confer/xen/arch/x86/x86_32/entry.S	2005-05-20 10:37:58.353759024 -0500
@@ -749,6 +749,7 @@
         .long do_boot_vcpu
         .long do_ni_hypercall       /* 25 */
         .long do_mmuext_op
+        .long do_confer
         .rept NR_hypercalls-((.-hypercall_table)/4)
         .long do_ni_hypercall
         .endr
diff -urN b/xen/common/domain.c confer/xen/common/domain.c
--- b/xen/common/domain.c	2005-05-19 22:20:15.000000000 -0500
+++ confer/xen/common/domain.c	2005-05-20 10:37:58.354758872 -0500
@@ -289,6 +289,7 @@
 
     atomic_set(&ed->pausecnt, 0);
     ed->cpumap = CPUMAP_RUNANYWHERE;
+    set_bit(_VCPUF_canconfer, &ed->vcpu_flags);
 
     memcpy(&ed->arch, &idle0_exec_domain.arch, sizeof(ed->arch));
 
diff -urN b/xen/common/schedule.c confer/xen/common/schedule.c
--- b/xen/common/schedule.c	2005-05-19 22:20:30.000000000 -0500
+++ confer/xen/common/schedule.c	2005-05-20 10:45:41.493351104 -0500
@@ -224,6 +224,11 @@
     spin_lock_irqsave(&schedule_data[ed->processor].schedule_lock, flags);
     if ( likely(domain_runnable(ed)) )
     {
+        /* mark current's confer state */
+        if ( test_bit(_VCPUF_conferring, &current->vcpu_flags) ) {
+            clear_bit(_VCPUF_conferring, &current->vcpu_flags);
+            set_bit(_VCPUF_conferred, &current->vcpu_flags);
+        }
         SCHED_OP(wake, ed);
 #ifdef WAKE_HISTO
         ed->wokenup = NOW();
@@ -273,6 +278,54 @@
     return 0;
 }
 
+/* Confer control to another vcpu */
+long do_confer(unsigned int vcpu, unsigned int yield_count)
+{
+    struct domain *d = current->domain;
+   
+    /* Validate CONFER prereqs:
+    * - vcpu is within bounds
+    * - vcpu is a valid in this domain
+    * - current has not already conferred its slice to vcpu
+    * - vcpu is not already running
+    * - designated vcpu's yield_count matches value from call
+    *
+    * of all are ok, then set conferred value and enter scheduler
+    */
+
+    if (vcpu > MAX_VIRT_CPUS)
+        return 0; 
+
+    if (d->exec_domain[vcpu] == NULL)
+        return 0;
+
+    if (!test_bit(_VCPUF_canconfer, &current->vcpu_flags))
+        return 0;
+
+    /* even counts indicate a running vcpu, odd is preempted/conferred */
+    /* don't confer if holder is currently running */
+    if ((d->exec_domain[vcpu]->vcpu_info->yield_count & 1) == 0)
+        return 0;
+
+    if (d->exec_domain[vcpu]->vcpu_info->yield_count != yield_count)
+        return 0;
+
+    /*
+     * set current's state to conferring, wake target
+     */
+    clear_bit(_VCPUF_canconfer, &current->vcpu_flags);
+    set_bit(_VCPUF_conferring, &current->vcpu_flags);
+    domain_wake(d->exec_domain[vcpu]);
+
+    /* request scheduling for woken domain */
+    raise_softirq(SCHEDULE_SOFTIRQ);
+
+    /* give up my timeslice */
+    do_yield();
+
+    return 0;
+}
+
 /*
  * Demultiplex scheduler-related hypercalls.
  */
@@ -441,7 +494,15 @@
 
     r_time = next_slice.time;
     next = next_slice.task;
-    
+
+    /*
+     * always clear conferred state so this vcpu can confer during its slice
+     * since it can confer, clear all other confer state
+     */
+    set_bit(_VCPUF_canconfer,   &next->vcpu_flags);
+    clear_bit(_VCPUF_conferring, &next->vcpu_flags);
+    clear_bit(_VCPUF_conferred,  &next->vcpu_flags);
+
     schedule_data[cpu].curr = next;
     
     next->lastschd = now;
@@ -455,6 +516,12 @@
 
     spin_unlock_irq(&schedule_data[cpu].schedule_lock);
 
+    /* bump vcpu yield_count when controlling domain is not-idle */
+    if ( !is_idle_task(prev->domain) )
+        prev->vcpu_info->yield_count++;
+    if ( !is_idle_task(next->domain) )
+        next->vcpu_info->yield_count++;
+
     if ( unlikely(prev == next) ) {
 #ifdef ADV_SCHED_HISTO
         adv_sched_hist_to_stop(cpu);
diff -urN b/xen/include/public/xen.h confer/xen/include/public/xen.h
--- b/xen/include/public/xen.h	2005-05-19 22:20:11.000000000 -0500
+++ confer/xen/include/public/xen.h	2005-05-20 10:37:58.368756744 -0500
@@ -58,6 +58,7 @@
 #define __HYPERVISOR_boot_vcpu            24
 #define __HYPERVISOR_set_segment_base     25 /* x86/64 only */
 #define __HYPERVISOR_mmuext_op            26
+#define __HYPERVISOR_confer               27
 
 /*
  * MULTICALLS
@@ -334,8 +335,11 @@
     u8 evtchn_upcall_mask;              /* 1 */
     u8 pad0, pad1;
     u32 evtchn_pending_sel;             /* 4 */
-    arch_vcpu_info_t arch;              /* 8 */
-} PACKED vcpu_info_t;                   /* 8 + arch */
+    /* Even when vcpu is running, Odd when it is preempted/conferred */
+    u32 yield_count;                    /* 8 */
+    u32 pad2;                           /* 12 */
+    arch_vcpu_info_t arch;              /* 16 */
+} PACKED vcpu_info_t;                   /* 16 + arch */
 
 /*
  * Xen/kernel shared data -- pointer provided in start_info.
@@ -347,6 +351,9 @@
 
     u32 n_vcpu;
 
+    /* set if domains' vcpus share physical cpus */
+    int shproc; 
+
     /*
      * A domain can have up to 1024 "event channels" on which it can send
      * and receive asynchronous event notifications. There are three classes
diff -urN b/xen/include/xen/sched.h confer/xen/include/xen/sched.h
--- b/xen/include/xen/sched.h	2005-05-19 22:20:07.000000000 -0500
+++ confer/xen/include/xen/sched.h	2005-05-20 10:37:58.378755224 -0500
@@ -358,6 +358,15 @@
  /* Initialization completed. */
 #define _VCPUF_initialised     8
 #define VCPUF_initialised      (1UL<<_VCPUF_initialised)
+ /* Able to give time slice to another vcpu */
+#define _VCPUF_canconfer       9
+#define VCPUF_canconfer        (1UL<<_VCPUF_canconfer)
+ /* Currently giving time slice to another vcpu */
+#define _VCPUF_conferring      10
+#define VCPUF_conferring       (1UL<<_VCPUF_conferring)
+ /* Already given time slice to another vcpu */
+#define _VCPUF_conferred       11
+#define VCPUF_conferred        (1UL<<_VCPUF_conferred)
 
 /*
  * Per-domain flags (domain_flags).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] xen, xen-sparse: modify spinlocks to use directed yield
  2005-05-20 16:54 [PATCH] xen, xen-sparse: modify spinlocks to use directed yield Ryan Harper
@ 2005-05-20 18:16 ` Ryan Harper
  2005-05-20 18:17 ` Ryan Harper
  2005-06-03 19:40 ` [PATCH] Yield to VCPU hcall, spinlock yielding Ryan Harper
  2 siblings, 0 replies; 25+ messages in thread
From: Ryan Harper @ 2005-05-20 18:16 UTC (permalink / raw)
  To: Ryan Harper; +Cc: xen-devel

* Ryan Harper <ryanh@us.ibm.com> [2005-05-20 11:55]:
> The following patch creates a new hypercall, do_confer() which allows a

Oops.  I left in fixes to my domU config which doesnt exist in the main
tree.  I've got that part removed in this version.

--
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com


diffstat output:
 linux-2.6.11-xen-sparse/arch/i386/lib/Makefile               |   11 
 linux-2.6.11-xen-sparse/arch/i386/lib/locks.c                |   76 +++++
 linux-2.6.11-xen-sparse/arch/xen/i386/kernel/entry.S         |    2 
 linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/hypercall.h |   16 +
 linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/spinlock.h  |  140 ++++++++---
 xen/arch/x86/domain.c                                        |    2 
 xen/arch/x86/x86_32/entry.S                                  |    1 
 xen/common/domain.c                                          |    1 
 xen/common/schedule.c                                        |   69 +++++
 xen/include/public/xen.h                                     |   11 
 xen/include/xen/sched.h                                      |    9 
 11 files changed, 300 insertions(+), 38 deletions(-)

Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
---
diff -urN b/linux-2.6.11-xen-sparse/arch/i386/lib/locks.c confer/linux-2.6.11-xen-sparse/arch/i386/lib/locks.c
--- b/linux-2.6.11-xen-sparse/arch/i386/lib/locks.c	1969-12-31 18:00:00.000000000 -0600
+++ confer/linux-2.6.11-xen-sparse/arch/i386/lib/locks.c	2005-05-20 10:37:58.300767080 -0500
@@ -0,0 +1,76 @@
+/*
+ * Spin and read/write lock operations.
+ *
+ * Copyright (C) 2001-2004 Paul Mackerras <paulus@au.ibm.com>, IBM
+ * Copyright (C) 2001 Anton Blanchard <anton@au.ibm.com>, IBM
+ * Copyright (C) 2002 Dave Engebretsen <engebret@us.ibm.com>, IBM
+ *   Rework to support virtual processors
+ * Copyright (C) 2005 Ryan Harper <ryanh@us.ibm.com>, IBM
+ *   Rework for Xen on x86
+ *
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/config.h>
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/module.h>
+#include <linux/stringify.h>
+#include <asm/hypercall.h>
+#include <asm/processor.h>
+
+/* waiting for a spinlock... */
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+void __spin_yield(spinlock_t *lock)
+{
+	unsigned int lock_value, holder_cpu, yield_count;
+   shared_info_t *s = HYPERVISOR_shared_info;
+
+	lock_value = lock->slock;
+	if (lock_value == 1)
+		return;
+	holder_cpu = lock->cpu;
+	BUG_ON(holder_cpu >= NR_CPUS);
+	yield_count = s->vcpu_data[holder_cpu].yield_count;
+	if ((yield_count & 1) == 0)
+		return;		/* virtual cpu is currently running */
+	rmb();
+	if (lock->slock != lock_value)
+		return;		/* something has changed */
+   HYPERVISOR_confer(holder_cpu, yield_count);
+}
+
+void __rw_yield(rwlock_t *rw)
+{
+	unsigned int lock_value, holder_cpu, yield_count;
+   shared_info_t *s = HYPERVISOR_shared_info;
+
+	lock_value = rw->lock;
+	if (lock_value == RW_LOCK_BIAS)
+		return;
+	holder_cpu = rw->cpu;
+	BUG_ON(holder_cpu >= NR_CPUS);
+	yield_count = s->vcpu_data[holder_cpu].yield_count;
+	if ((yield_count & 1) == 0)
+		return;		/* virtual cpu is currently running */
+	rmb();
+	if (rw->lock != lock_value)
+		return;		/* something has changed */
+   HYPERVISOR_confer(holder_cpu, yield_count);
+}
+
+void spin_unlock_wait(spinlock_t *lock)
+{
+	while (spin_is_locked(lock)) {
+		cpu_relax();
+		if (SHARED_PROCESSOR)
+			__spin_yield(lock);
+	}
+	cpu_relax();
+}
+EXPORT_SYMBOL(spin_unlock_wait);
+#endif
diff -urN b/linux-2.6.11-xen-sparse/arch/i386/lib/Makefile confer/linux-2.6.11-xen-sparse/arch/i386/lib/Makefile
--- b/linux-2.6.11-xen-sparse/arch/i386/lib/Makefile	1969-12-31 18:00:00.000000000 -0600
+++ confer/linux-2.6.11-xen-sparse/arch/i386/lib/Makefile	2005-05-20 10:37:58.301766928 -0500
@@ -0,0 +1,11 @@
+#
+# Makefile for i386-specific library files..
+#
+
+
+lib-y = checksum.o delay.o usercopy.o getuser.o memcpy.o strstr.o \
+	bitops.o 
+
+lib-$(CONFIG_X86_USE_3DNOW) += mmx.o
+lib-$(CONFIG_HAVE_DEC_LOCK) += dec_and_lock.o
+lib-$(CONFIG_XEN) += locks.o
diff -urN b/linux-2.6.11-xen-sparse/arch/xen/i386/kernel/entry.S confer/linux-2.6.11-xen-sparse/arch/xen/i386/kernel/entry.S
--- b/linux-2.6.11-xen-sparse/arch/xen/i386/kernel/entry.S	2005-05-19 22:20:32.000000000 -0500
+++ confer/linux-2.6.11-xen-sparse/arch/xen/i386/kernel/entry.S	2005-05-20 10:37:58.304766472 -0500
@@ -80,7 +80,7 @@
 #define evtchn_upcall_pending		/* 0 */
 #define evtchn_upcall_mask		1
 
-#define sizeof_vcpu_shift		3
+#define sizeof_vcpu_shift		4
 
 #ifdef CONFIG_SMP
 #define preempt_disable(reg)	incl TI_preempt_count(reg)
diff -urN b/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/hypercall.h confer/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/hypercall.h
--- b/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/hypercall.h	2005-05-19 22:20:32.000000000 -0500
+++ confer/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/hypercall.h	2005-05-20 10:37:58.306766168 -0500
@@ -517,4 +517,20 @@
     return ret;
 }
 
+static inline int
+HYPERVISOR_confer(
+	unsigned int vcpu, unsigned int yield_count)
+{
+    int ret;
+    unsigned long ign1, ign2;
+
+    __asm__ __volatile__ (
+        TRAP_INSTR
+        : "=a" (ret), "=b" (ign1), "=c" (ign2)
+	: "0" (__HYPERVISOR_confer), "1" (vcpu), "2" (yield_count)
+	: "memory");
+
+    return ret;
+}
+
 #endif /* __HYPERCALL_H__ */
diff -urN b/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/spinlock.h confer/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/spinlock.h
--- b/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/spinlock.h	2005-05-19 22:20:14.000000000 -0500
+++ confer/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/spinlock.h	2005-05-20 10:37:58.307766016 -0500
@@ -22,10 +22,36 @@
 #ifdef CONFIG_PREEMPT
 	unsigned int break_lock;
 #endif
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	unsigned int cpu;
+#endif
 } spinlock_t;
 
 #define SPINLOCK_MAGIC	0xdead4ead
 
+/*
+ * Read-write spinlocks, allowing multiple readers
+ * but only one writer.
+ *
+ * NOTE! it is quite common to have readers in interrupts
+ * but no interrupt writers. For those circumstances we
+ * can "mix" irq-safe locks - any writer needs to get a
+ * irq-safe write-lock, but readers can get non-irqsafe
+ * read-locks.
+ */
+typedef struct {
+	volatile unsigned int lock;
+#ifdef CONFIG_DEBUG_SPINLOCK
+	unsigned magic;
+#endif
+#ifdef CONFIG_PREEMPT
+	unsigned int break_lock;
+#endif
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	unsigned int cpu;
+#endif
+} rwlock_t;
+
 #ifdef CONFIG_DEBUG_SPINLOCK
 #define SPINLOCK_MAGIC_INIT	, SPINLOCK_MAGIC
 #else
@@ -44,7 +70,20 @@
  */
 
 #define spin_is_locked(x)	(*(volatile signed char *)(&(x)->slock) <= 0)
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+#include <linux/smp.h>
+#define SPINLOCK_CPU (smp_processor_id())
+/* We only yield to the hypervisor if we are in shared processor mode */
+#define SHARED_PROCESSOR (HYPERVISOR_shared_info->shproc == 0)
+extern void __spin_yield(spinlock_t *lock);
+extern void __rw_yield(rwlock_t *rw);
+extern void spin_unlock_wait(spinlock_t *lock);
+#else
+#define __spin_yield(x) barrier()
+#define __rw_yield(x) barrier()
+#define SHARED_PROCESSOR 0
 #define spin_unlock_wait(x)	do { barrier(); } while(spin_is_locked(x))
+#endif
 
 #define spin_lock_string \
 	"\n1:\t" \
@@ -125,6 +164,9 @@
 		"xchgb %b0,%1"
 		:"=q" (oldval), "=m" (lock->slock)
 		:"0" (0) : "memory");
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	lock->cpu = SPINLOCK_CPU;
+#endif
 	return oldval > 0;
 }
 
@@ -136,43 +178,55 @@
 		BUG();
 	}
 #endif
-	__asm__ __volatile__(
-		spin_lock_string
-		:"=m" (lock->slock) : : "memory");
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	while (1) {
+		if ( likely(_raw_spin_trylock(lock)) )
+			break;
+		do {
+			cpu_relax();
+			if (SHARED_PROCESSOR)
+				__spin_yield(lock);
+		} while (likely(spin_is_locked(lock)));
+		cpu_relax();
+	}
+#else
+   __asm__ __volatile__(
+      spin_lock_string
+      :"=m" (lock->slock) : : "memory");
+#endif
 }
 
 static inline void _raw_spin_lock_flags (spinlock_t *lock, unsigned long flags)
 {
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	unsigned long flags_dis;
+#endif
 #ifdef CONFIG_DEBUG_SPINLOCK
 	if (unlikely(lock->magic != SPINLOCK_MAGIC)) {
 		printk("eip: %p\n", __builtin_return_address(0));
 		BUG();
 	}
 #endif
-	__asm__ __volatile__(
-		spin_lock_string_flags
-		:"=m" (lock->slock) : "r" (flags) : "memory");
-}
-
-/*
- * Read-write spinlocks, allowing multiple readers
- * but only one writer.
- *
- * NOTE! it is quite common to have readers in interrupts
- * but no interrupt writers. For those circumstances we
- * can "mix" irq-safe locks - any writer needs to get a
- * irq-safe write-lock, but readers can get non-irqsafe
- * read-locks.
- */
-typedef struct {
-	volatile unsigned int lock;
-#ifdef CONFIG_DEBUG_SPINLOCK
-	unsigned magic;
-#endif
-#ifdef CONFIG_PREEMPT
-	unsigned int break_lock;
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	while (1) {
+		if ( likely(_raw_spin_trylock(lock)) )
+			break;
+		local_save_flags(flags_dis);
+		local_irq_restore(flags);
+		do {
+			cpu_relax();
+			if (SHARED_PROCESSOR)
+				__spin_yield(lock);
+		} while (likely(spin_is_locked(lock)));
+		cpu_relax();
+		local_irq_restore(flags_dis);
+	}
+#else
+   __asm__ __volatile__(
+      spin_lock_string_flags
+      :"=m" (lock->slock) : "r" (flags) : "memory");
 #endif
-} rwlock_t;
+}
 
 #define RWLOCK_MAGIC	0xdeaf1eed
 
@@ -198,6 +252,18 @@
  */
 #define write_can_lock(x) ((x)->lock == RW_LOCK_BIAS)
 
+static inline int _raw_write_trylock(rwlock_t *lock)
+{
+	atomic_t *count = (atomic_t *)lock;
+	if (atomic_sub_and_test(RW_LOCK_BIAS, count)) {
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+      lock->cpu = SPINLOCK_CPU;
+#endif
+		return 1;
+   }
+	atomic_add(RW_LOCK_BIAS, count);
+	return 0;
+}
 /*
  * On x86, we implement read-write locks as a 32-bit counter
  * with the high bit (sign) being the "contended" bit.
@@ -222,7 +288,20 @@
 #ifdef CONFIG_DEBUG_SPINLOCK
 	BUG_ON(rw->magic != RWLOCK_MAGIC);
 #endif
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	while (1) {
+		if ( likely(_raw_write_trylock(rw)) )
+			break;
+		do {
+			cpu_relax();
+			if (SHARED_PROCESSOR)
+				__rw_yield(rw);
+		} while ( likely(!write_can_lock(rw)));
+		cpu_relax();
+	}
+#else
 	__build_write_lock(rw, "__write_lock_failed");
+#endif
 }
 
 #define _raw_read_unlock(rw)		asm volatile("lock ; incl %0" :"=m" ((rw)->lock) : : "memory")
@@ -238,13 +317,6 @@
 	return 0;
 }
 
-static inline int _raw_write_trylock(rwlock_t *lock)
-{
-	atomic_t *count = (atomic_t *)lock;
-	if (atomic_sub_and_test(RW_LOCK_BIAS, count))
-		return 1;
-	atomic_add(RW_LOCK_BIAS, count);
-	return 0;
-}
+
 
 #endif /* __ASM_SPINLOCK_H */
diff -urN b/xen/arch/x86/domain.c confer/xen/arch/x86/domain.c
--- b/xen/arch/x86/domain.c	2005-05-19 22:20:28.000000000 -0500
+++ confer/xen/arch/x86/domain.c	2005-05-20 10:38:29.187071648 -0500
@@ -253,6 +253,8 @@
     memset(d->shared_info, 0, PAGE_SIZE);
     ed->vcpu_info = &d->shared_info->vcpu_data[ed->vcpu_id];
     ed->cpumap = CPUMAP_RUNANYWHERE;
+    /* default vcpus to sharing physical cpus */
+    d->shared_info->shproc = 1;
     SHARE_PFN_WITH_DOMAIN(virt_to_page(d->shared_info), d);
     machine_to_phys_mapping[virt_to_phys(d->shared_info) >> 
                            PAGE_SHIFT] = INVALID_M2P_ENTRY;
diff -urN b/xen/arch/x86/x86_32/entry.S confer/xen/arch/x86/x86_32/entry.S
--- b/xen/arch/x86/x86_32/entry.S	2005-05-19 22:20:33.000000000 -0500
+++ confer/xen/arch/x86/x86_32/entry.S	2005-05-20 10:37:58.353759024 -0500
@@ -749,6 +749,7 @@
         .long do_boot_vcpu
         .long do_ni_hypercall       /* 25 */
         .long do_mmuext_op
+        .long do_confer
         .rept NR_hypercalls-((.-hypercall_table)/4)
         .long do_ni_hypercall
         .endr
diff -urN b/xen/common/domain.c confer/xen/common/domain.c
--- b/xen/common/domain.c	2005-05-19 22:20:15.000000000 -0500
+++ confer/xen/common/domain.c	2005-05-20 10:37:58.354758872 -0500
@@ -289,6 +289,7 @@
 
     atomic_set(&ed->pausecnt, 0);
     ed->cpumap = CPUMAP_RUNANYWHERE;
+    set_bit(_VCPUF_canconfer, &ed->vcpu_flags);
 
     memcpy(&ed->arch, &idle0_exec_domain.arch, sizeof(ed->arch));
 
diff -urN b/xen/common/schedule.c confer/xen/common/schedule.c
--- b/xen/common/schedule.c	2005-05-19 22:20:30.000000000 -0500
+++ confer/xen/common/schedule.c	2005-05-20 10:45:41.493351104 -0500
@@ -224,6 +224,11 @@
     spin_lock_irqsave(&schedule_data[ed->processor].schedule_lock, flags);
     if ( likely(domain_runnable(ed)) )
     {
+        /* mark current's confer state */
+        if ( test_bit(_VCPUF_conferring, &current->vcpu_flags) ) {
+            clear_bit(_VCPUF_conferring, &current->vcpu_flags);
+            set_bit(_VCPUF_conferred, &current->vcpu_flags);
+        }
         SCHED_OP(wake, ed);
 #ifdef WAKE_HISTO
         ed->wokenup = NOW();
@@ -273,6 +278,54 @@
     return 0;
 }
 
+/* Confer control to another vcpu */
+long do_confer(unsigned int vcpu, unsigned int yield_count)
+{
+    struct domain *d = current->domain;
+   
+    /* Validate CONFER prereqs:
+    * - vcpu is within bounds
+    * - vcpu is a valid in this domain
+    * - current has not already conferred its slice to vcpu
+    * - vcpu is not already running
+    * - designated vcpu's yield_count matches value from call
+    *
+    * of all are ok, then set conferred value and enter scheduler
+    */
+
+    if (vcpu > MAX_VIRT_CPUS)
+        return 0; 
+
+    if (d->exec_domain[vcpu] == NULL)
+        return 0;
+
+    if (!test_bit(_VCPUF_canconfer, &current->vcpu_flags))
+        return 0;
+
+    /* even counts indicate a running vcpu, odd is preempted/conferred */
+    /* don't confer if holder is currently running */
+    if ((d->exec_domain[vcpu]->vcpu_info->yield_count & 1) == 0)
+        return 0;
+
+    if (d->exec_domain[vcpu]->vcpu_info->yield_count != yield_count)
+        return 0;
+
+    /*
+     * set current's state to conferring, wake target
+     */
+    clear_bit(_VCPUF_canconfer, &current->vcpu_flags);
+    set_bit(_VCPUF_conferring, &current->vcpu_flags);
+    domain_wake(d->exec_domain[vcpu]);
+
+    /* request scheduling for woken domain */
+    raise_softirq(SCHEDULE_SOFTIRQ);
+
+    /* give up my timeslice */
+    do_yield();
+
+    return 0;
+}
+
 /*
  * Demultiplex scheduler-related hypercalls.
  */
@@ -441,7 +494,15 @@
 
     r_time = next_slice.time;
     next = next_slice.task;
-    
+
+    /*
+     * always clear conferred state so this vcpu can confer during its slice
+     * since it can confer, clear all other confer state
+     */
+    set_bit(_VCPUF_canconfer,   &next->vcpu_flags);
+    clear_bit(_VCPUF_conferring, &next->vcpu_flags);
+    clear_bit(_VCPUF_conferred,  &next->vcpu_flags);
+
     schedule_data[cpu].curr = next;
     
     next->lastschd = now;
@@ -455,6 +516,12 @@
 
     spin_unlock_irq(&schedule_data[cpu].schedule_lock);
 
+    /* bump vcpu yield_count when controlling domain is not-idle */
+    if ( !is_idle_task(prev->domain) )
+        prev->vcpu_info->yield_count++;
+    if ( !is_idle_task(next->domain) )
+        next->vcpu_info->yield_count++;
+
     if ( unlikely(prev == next) ) {
 #ifdef ADV_SCHED_HISTO
         adv_sched_hist_to_stop(cpu);
diff -urN b/xen/include/public/xen.h confer/xen/include/public/xen.h
--- b/xen/include/public/xen.h	2005-05-19 22:20:11.000000000 -0500
+++ confer/xen/include/public/xen.h	2005-05-20 10:37:58.368756744 -0500
@@ -58,6 +58,7 @@
 #define __HYPERVISOR_boot_vcpu            24
 #define __HYPERVISOR_set_segment_base     25 /* x86/64 only */
 #define __HYPERVISOR_mmuext_op            26
+#define __HYPERVISOR_confer               27
 
 /*
  * MULTICALLS
@@ -334,8 +335,11 @@
     u8 evtchn_upcall_mask;              /* 1 */
     u8 pad0, pad1;
     u32 evtchn_pending_sel;             /* 4 */
-    arch_vcpu_info_t arch;              /* 8 */
-} PACKED vcpu_info_t;                   /* 8 + arch */
+    /* Even when vcpu is running, Odd when it is preempted/conferred */
+    u32 yield_count;                    /* 8 */
+    u32 pad2;                           /* 12 */
+    arch_vcpu_info_t arch;              /* 16 */
+} PACKED vcpu_info_t;                   /* 16 + arch */
 
 /*
  * Xen/kernel shared data -- pointer provided in start_info.
@@ -347,6 +351,9 @@
 
     u32 n_vcpu;
 
+    /* set if domains' vcpus share physical cpus */
+    int shproc; 
+
     /*
      * A domain can have up to 1024 "event channels" on which it can send
      * and receive asynchronous event notifications. There are three classes
diff -urN b/xen/include/xen/sched.h confer/xen/include/xen/sched.h
--- b/xen/include/xen/sched.h	2005-05-19 22:20:07.000000000 -0500
+++ confer/xen/include/xen/sched.h	2005-05-20 10:37:58.378755224 -0500
@@ -358,6 +358,15 @@
  /* Initialization completed. */
 #define _VCPUF_initialised     8
 #define VCPUF_initialised      (1UL<<_VCPUF_initialised)
+ /* Able to give time slice to another vcpu */
+#define _VCPUF_canconfer       9
+#define VCPUF_canconfer        (1UL<<_VCPUF_canconfer)
+ /* Currently giving time slice to another vcpu */
+#define _VCPUF_conferring      10
+#define VCPUF_conferring       (1UL<<_VCPUF_conferring)
+ /* Already given time slice to another vcpu */
+#define _VCPUF_conferred       11
+#define VCPUF_conferred        (1UL<<_VCPUF_conferred)
 
 /*
  * Per-domain flags (domain_flags).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] xen, xen-sparse: modify spinlocks to use directed yield
  2005-05-20 16:54 [PATCH] xen, xen-sparse: modify spinlocks to use directed yield Ryan Harper
  2005-05-20 18:16 ` Ryan Harper
@ 2005-05-20 18:17 ` Ryan Harper
  2005-06-03 19:40 ` [PATCH] Yield to VCPU hcall, spinlock yielding Ryan Harper
  2 siblings, 0 replies; 25+ messages in thread
From: Ryan Harper @ 2005-05-20 18:17 UTC (permalink / raw)
  To: Ryan Harper; +Cc: xen-devel

* Ryan Harper <ryanh@us.ibm.com> [2005-05-20 11:55]:
> The following patch creates a new hypercall, do_confer() which allows a

Here is a debug patch I've been using to go along with it. Comment out
the printks in the do_confer() routine if you are measuring performance.

--
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com


diffstat output:
 common/keyhandler.c |   30 ++++++++++++++++++++++++++++++
 common/schedule.c   |   19 ++++++++++++++++---
 include/xen/sched.h |    3 +++
 3 files changed, 49 insertions(+), 3 deletions(-)

Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
---
diff -urN confer/xen/common/keyhandler.c debug/xen/common/keyhandler.c
--- confer/xen/common/keyhandler.c	2005-05-19 22:20:27.000000000 -0500
+++ debug/xen/common/keyhandler.c	2005-05-20 10:39:03.565845280 -0500
@@ -11,6 +11,7 @@
 #include <xen/sched.h>
 #include <xen/softirq.h>
 #include <asm/debugger.h>
+#include <public/xen.h>
 
 #define KEY_MAX 256
 #define STR_MAX  64
@@ -138,6 +139,33 @@
     read_unlock(&domlist_lock);
 }
 
+static void do_dump_confer(unsigned char key)
+{
+    struct domain *d;
+    struct exec_domain *ed;
+    s_time_t       now = NOW();
+
+    printk("'%c' pressed -> dumping confer stats (now=0x%X:%08X)\n", key,
+           (u32)(now>>32), (u32)now);
+
+    read_lock(&domlist_lock);
+    for_each_domain ( d )
+    {
+        for_each_exec_domain ( d, ed )
+        {
+            printk("Xen: DOM %d, VCPU %d, CPU %d,"
+                   " confers %d, confer_out %d,"
+                   " confer_in %d, yield_count %d\n",
+                   d->domain_id, ed->vcpu_id, ed->processor,
+                   ed->confercnt, ed->confer_out, 
+                   ed->confer_in, ed->vcpu_info->yield_count
+                  );
+        }
+
+    }
+    read_unlock(&domlist_lock);
+}
+
 extern void dump_runq(unsigned char key);
 extern void print_sched_histo(unsigned char key);
 extern void reset_sched_histo(unsigned char key);
@@ -183,6 +211,8 @@
     register_keyhandler(
         'q', do_task_queues, "dump task queues + guest state");
     register_keyhandler(
+        'c', do_dump_confer, "dump confer stats");
+    register_keyhandler(
         'r', dump_runq,      "dump run queues");
     register_irq_keyhandler(
         'R', halt_machine,   "reboot machine"); 
diff -urN confer/xen/common/schedule.c debug/xen/common/schedule.c
--- confer/xen/common/schedule.c	2005-05-20 10:37:58.367756896 -0500
+++ debug/xen/common/schedule.c	2005-05-20 10:39:03.575843760 -0500
@@ -228,7 +228,11 @@
         if ( test_bit(_VCPUF_conferring, &current->vcpu_flags) ) {
             clear_bit(_VCPUF_conferring, &current->vcpu_flags);
             set_bit(_VCPUF_conferred, &current->vcpu_flags);
+            /* increment confer counters */
+            current->confer_out++;
+            ed->confer_in++;
         }
+
         SCHED_OP(wake, ed);
 #ifdef WAKE_HISTO
         ed->wokenup = NOW();
@@ -283,6 +287,9 @@
 {
     struct domain *d = current->domain;
    
+    /* count hcalls */
+    current->confercnt++;
+
     /* Validate CONFER prereqs:
     * - vcpu is within bounds
     * - vcpu is a valid in this domain
@@ -299,16 +306,22 @@
     if (d->exec_domain[vcpu] == NULL)
         return 0;
 
-    if (!test_bit(_VCPUF_canconfer, &current->vcpu_flags))
+    if (!test_bit(_VCPUF_canconfer, &current->vcpu_flags)) {
+        printk("confer: canconfer not set, %d->vcpu-flags = 0x%08lx\n", current->vcpu_id, current->vcpu_flags);
         return 0;
+    }
 
     /* even counts indicate a running vcpu, odd is preempted/conferred */
     /* don't confer if holder is currently running */
-    if ((d->exec_domain[vcpu]->vcpu_info->yield_count & 1) == 0)
+    if ((d->exec_domain[vcpu]->vcpu_info->yield_count & 1) == 0) {
+        printk("confer: vcpu %d already running\n", vcpu);
         return 0;
+    }
 
-    if (d->exec_domain[vcpu]->vcpu_info->yield_count != yield_count)
+    if (d->exec_domain[vcpu]->vcpu_info->yield_count != yield_count) {
+        printk("confer: yield_count mismatch\n");
         return 0;
+    }
 
     /*
      * set current's state to conferring, wake target
diff -urN confer/xen/include/xen/sched.h debug/xen/include/xen/sched.h
--- confer/xen/include/xen/sched.h	2005-05-20 10:37:58.378755224 -0500
+++ debug/xen/include/xen/sched.h	2005-05-20 10:39:03.577843456 -0500
@@ -87,6 +87,9 @@
     atomic_t         pausecnt;
 
     cpumap_t         cpumap;        /* which cpus this domain can run on */
+    u32              confer_out;    /* inc when conferring to another vcpu */
+    u32              confer_in;     /* inc when conferred from another vcpu */
+    u32              confercnt;     /* # of do_confer hcalls */
 
     struct arch_exec_domain arch;
 };

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Yield to VCPU hcall, spinlock yielding
  2005-05-20 16:54 [PATCH] xen, xen-sparse: modify spinlocks to use directed yield Ryan Harper
  2005-05-20 18:16 ` Ryan Harper
  2005-05-20 18:17 ` Ryan Harper
@ 2005-06-03 19:40 ` Ryan Harper
  2 siblings, 0 replies; 25+ messages in thread
From: Ryan Harper @ 2005-06-03 19:40 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 478 bytes --]

* Ryan Harper <ryanh@us.ibm.com> [2005-05-20 11:55]:
> The following patch creates a new hypercall, do_confer() which allows a

I've not recieved any feedback on this.  Following this patch up with
one that applies against current.  Builds, but haven't tested it since
current SMP domains don't run.  

--
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

Signed-off-by: Ryan Harper <ryanh@us.ibm.com>

[-- Attachment #2: confer.patch --]
[-- Type: text/plain, Size: 16792 bytes --]

diff -urN b/linux-2.6.11-xen-sparse/arch/i386/lib/locks.c c/linux-2.6.11-xen-sparse/arch/i386/lib/locks.c
--- b/linux-2.6.11-xen-sparse/arch/i386/lib/locks.c	1969-12-31 18:00:00.000000000 -0600
+++ c/linux-2.6.11-xen-sparse/arch/i386/lib/locks.c	2005-06-03 09:41:10.933009544 -0500
@@ -0,0 +1,76 @@
+/*
+ * Spin and read/write lock operations.
+ *
+ * Copyright (C) 2001-2004 Paul Mackerras <paulus@au.ibm.com>, IBM
+ * Copyright (C) 2001 Anton Blanchard <anton@au.ibm.com>, IBM
+ * Copyright (C) 2002 Dave Engebretsen <engebret@us.ibm.com>, IBM
+ *   Rework to support virtual processors
+ * Copyright (C) 2005 Ryan Harper <ryanh@us.ibm.com>, IBM
+ *   Rework for Xen on x86
+ *
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/config.h>
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/module.h>
+#include <linux/stringify.h>
+#include <asm/hypercall.h>
+#include <asm/processor.h>
+
+/* waiting for a spinlock... */
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+void __spin_yield(spinlock_t *lock)
+{
+	unsigned int lock_value, holder_cpu, yield_count;
+   shared_info_t *s = HYPERVISOR_shared_info;
+
+	lock_value = lock->slock;
+	if (lock_value == 1)
+		return;
+	holder_cpu = lock->cpu;
+	BUG_ON(holder_cpu >= NR_CPUS);
+	yield_count = s->vcpu_data[holder_cpu].yield_count;
+	if ((yield_count & 1) == 0)
+		return;		/* virtual cpu is currently running */
+	rmb();
+	if (lock->slock != lock_value)
+		return;		/* something has changed */
+   HYPERVISOR_confer(holder_cpu, yield_count);
+}
+
+void __rw_yield(rwlock_t *rw)
+{
+	unsigned int lock_value, holder_cpu, yield_count;
+   shared_info_t *s = HYPERVISOR_shared_info;
+
+	lock_value = rw->lock;
+	if (lock_value == RW_LOCK_BIAS)
+		return;
+	holder_cpu = rw->cpu;
+	BUG_ON(holder_cpu >= NR_CPUS);
+	yield_count = s->vcpu_data[holder_cpu].yield_count;
+	if ((yield_count & 1) == 0)
+		return;		/* virtual cpu is currently running */
+	rmb();
+	if (rw->lock != lock_value)
+		return;		/* something has changed */
+   HYPERVISOR_confer(holder_cpu, yield_count);
+}
+
+void spin_unlock_wait(spinlock_t *lock)
+{
+	while (spin_is_locked(lock)) {
+		cpu_relax();
+		if (SHARED_PROCESSOR)
+			__spin_yield(lock);
+	}
+	cpu_relax();
+}
+EXPORT_SYMBOL(spin_unlock_wait);
+#endif
diff -urN b/linux-2.6.11-xen-sparse/arch/i386/lib/Makefile c/linux-2.6.11-xen-sparse/arch/i386/lib/Makefile
--- b/linux-2.6.11-xen-sparse/arch/i386/lib/Makefile	1969-12-31 18:00:00.000000000 -0600
+++ c/linux-2.6.11-xen-sparse/arch/i386/lib/Makefile	2005-06-03 09:41:10.948007446 -0500
@@ -0,0 +1,11 @@
+#
+# Makefile for i386-specific library files..
+#
+
+
+lib-y = checksum.o delay.o usercopy.o getuser.o memcpy.o strstr.o \
+	bitops.o 
+
+lib-$(CONFIG_X86_USE_3DNOW) += mmx.o
+lib-$(CONFIG_HAVE_DEC_LOCK) += dec_and_lock.o
+lib-$(CONFIG_XEN) += locks.o
diff -urN b/linux-2.6.11-xen-sparse/arch/xen/configs/xenU-smp_defconfig_x86_32 c/linux-2.6.11-xen-sparse/arch/xen/configs/xenU-smp_defconfig_x86_32
--- b/linux-2.6.11-xen-sparse/arch/xen/configs/xenU-smp_defconfig_x86_32	2005-06-03 09:02:36.837705157 -0500
+++ c/linux-2.6.11-xen-sparse/arch/xen/configs/xenU-smp_defconfig_x86_32	2005-06-03 09:41:10.949007306 -0500
@@ -117,8 +117,8 @@
 CONFIG_SMP=y
 CONFIG_NR_CPUS=8
 # CONFIG_SCHED_SMT is not set
-CONFIG_PREEMPT=y
-CONFIG_PREEMPT_BKL=y
+# CONFIG_PREEMPT is not set
+# CONFIG_PREEMPT_BKL is not set
 CONFIG_X86_CPUID=y
 
 #
diff -urN b/linux-2.6.11-xen-sparse/arch/xen/i386/kernel/entry.S c/linux-2.6.11-xen-sparse/arch/xen/i386/kernel/entry.S
--- b/linux-2.6.11-xen-sparse/arch/xen/i386/kernel/entry.S	2005-06-02 22:21:42.000000000 -0500
+++ c/linux-2.6.11-xen-sparse/arch/xen/i386/kernel/entry.S	2005-06-03 09:41:10.953006747 -0500
@@ -80,7 +80,7 @@
 #define evtchn_upcall_pending		/* 0 */
 #define evtchn_upcall_mask		1
 
-#define sizeof_vcpu_shift		3
+#define sizeof_vcpu_shift		4
 
 #ifdef CONFIG_SMP
 #define preempt_disable(reg)	incl TI_preempt_count(reg)
diff -urN b/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/hypercall.h c/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/hypercall.h
--- b/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/hypercall.h	2005-06-02 22:21:42.000000000 -0500
+++ c/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/hypercall.h	2005-06-03 09:41:10.954006607 -0500
@@ -517,4 +517,20 @@
     return ret;
 }
 
+static inline int
+HYPERVISOR_confer(
+	unsigned int vcpu, unsigned int yield_count)
+{
+    int ret;
+    unsigned long ign1, ign2;
+
+    __asm__ __volatile__ (
+        TRAP_INSTR
+        : "=a" (ret), "=b" (ign1), "=c" (ign2)
+	: "0" (__HYPERVISOR_confer), "1" (vcpu), "2" (yield_count)
+	: "memory");
+
+    return ret;
+}
+
 #endif /* __HYPERCALL_H__ */
diff -urN b/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/spinlock.h c/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/spinlock.h
--- b/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/spinlock.h	2005-06-02 22:21:37.000000000 -0500
+++ c/linux-2.6.11-xen-sparse/include/asm-xen/asm-i386/spinlock.h	2005-06-03 09:41:10.975003670 -0500
@@ -22,10 +22,36 @@
 #ifdef CONFIG_PREEMPT
 	unsigned int break_lock;
 #endif
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	unsigned int cpu;
+#endif
 } spinlock_t;
 
 #define SPINLOCK_MAGIC	0xdead4ead
 
+/*
+ * Read-write spinlocks, allowing multiple readers
+ * but only one writer.
+ *
+ * NOTE! it is quite common to have readers in interrupts
+ * but no interrupt writers. For those circumstances we
+ * can "mix" irq-safe locks - any writer needs to get a
+ * irq-safe write-lock, but readers can get non-irqsafe
+ * read-locks.
+ */
+typedef struct {
+	volatile unsigned int lock;
+#ifdef CONFIG_DEBUG_SPINLOCK
+	unsigned magic;
+#endif
+#ifdef CONFIG_PREEMPT
+	unsigned int break_lock;
+#endif
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	unsigned int cpu;
+#endif
+} rwlock_t;
+
 #ifdef CONFIG_DEBUG_SPINLOCK
 #define SPINLOCK_MAGIC_INIT	, SPINLOCK_MAGIC
 #else
@@ -44,7 +70,20 @@
  */
 
 #define spin_is_locked(x)	(*(volatile signed char *)(&(x)->slock) <= 0)
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+#include <linux/smp.h>
+#define SPINLOCK_CPU (smp_processor_id())
+/* We only yield to the hypervisor if we are in shared processor mode */
+#define SHARED_PROCESSOR (HYPERVISOR_shared_info->shproc == 0)
+extern void __spin_yield(spinlock_t *lock);
+extern void __rw_yield(rwlock_t *rw);
+extern void spin_unlock_wait(spinlock_t *lock);
+#else
+#define __spin_yield(x) barrier()
+#define __rw_yield(x) barrier()
+#define SHARED_PROCESSOR 0
 #define spin_unlock_wait(x)	do { barrier(); } while(spin_is_locked(x))
+#endif
 
 #define spin_lock_string \
 	"\n1:\t" \
@@ -125,6 +164,9 @@
 		"xchgb %b0,%1"
 		:"=q" (oldval), "=m" (lock->slock)
 		:"0" (0) : "memory");
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	lock->cpu = SPINLOCK_CPU;
+#endif
 	return oldval > 0;
 }
 
@@ -136,43 +178,55 @@
 		BUG();
 	}
 #endif
-	__asm__ __volatile__(
-		spin_lock_string
-		:"=m" (lock->slock) : : "memory");
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	while (1) {
+		if ( likely(_raw_spin_trylock(lock)) )
+			break;
+		do {
+			cpu_relax();
+			if (SHARED_PROCESSOR)
+				__spin_yield(lock);
+		} while (likely(spin_is_locked(lock)));
+		cpu_relax();
+	}
+#else
+   __asm__ __volatile__(
+      spin_lock_string
+      :"=m" (lock->slock) : : "memory");
+#endif
 }
 
 static inline void _raw_spin_lock_flags (spinlock_t *lock, unsigned long flags)
 {
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	unsigned long flags_dis;
+#endif
 #ifdef CONFIG_DEBUG_SPINLOCK
 	if (unlikely(lock->magic != SPINLOCK_MAGIC)) {
 		printk("eip: %p\n", __builtin_return_address(0));
 		BUG();
 	}
 #endif
-	__asm__ __volatile__(
-		spin_lock_string_flags
-		:"=m" (lock->slock) : "r" (flags) : "memory");
-}
-
-/*
- * Read-write spinlocks, allowing multiple readers
- * but only one writer.
- *
- * NOTE! it is quite common to have readers in interrupts
- * but no interrupt writers. For those circumstances we
- * can "mix" irq-safe locks - any writer needs to get a
- * irq-safe write-lock, but readers can get non-irqsafe
- * read-locks.
- */
-typedef struct {
-	volatile unsigned int lock;
-#ifdef CONFIG_DEBUG_SPINLOCK
-	unsigned magic;
-#endif
-#ifdef CONFIG_PREEMPT
-	unsigned int break_lock;
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	while (1) {
+		if ( likely(_raw_spin_trylock(lock)) )
+			break;
+		local_save_flags(flags_dis);
+		local_irq_restore(flags);
+		do {
+			cpu_relax();
+			if (SHARED_PROCESSOR)
+				__spin_yield(lock);
+		} while (likely(spin_is_locked(lock)));
+		cpu_relax();
+		local_irq_restore(flags_dis);
+	}
+#else
+   __asm__ __volatile__(
+      spin_lock_string_flags
+      :"=m" (lock->slock) : "r" (flags) : "memory");
 #endif
-} rwlock_t;
+}
 
 #define RWLOCK_MAGIC	0xdeaf1eed
 
@@ -198,6 +252,18 @@
  */
 #define write_can_lock(x) ((x)->lock == RW_LOCK_BIAS)
 
+static inline int _raw_write_trylock(rwlock_t *lock)
+{
+	atomic_t *count = (atomic_t *)lock;
+	if (atomic_sub_and_test(RW_LOCK_BIAS, count)) {
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+      lock->cpu = SPINLOCK_CPU;
+#endif
+		return 1;
+   }
+	atomic_add(RW_LOCK_BIAS, count);
+	return 0;
+}
 /*
  * On x86, we implement read-write locks as a 32-bit counter
  * with the high bit (sign) being the "contended" bit.
@@ -222,7 +288,20 @@
 #ifdef CONFIG_DEBUG_SPINLOCK
 	BUG_ON(rw->magic != RWLOCK_MAGIC);
 #endif
+#if defined(CONFIG_XEN) && defined(CONFIG_SMP)
+	while (1) {
+		if ( likely(_raw_write_trylock(rw)) )
+			break;
+		do {
+			cpu_relax();
+			if (SHARED_PROCESSOR)
+				__rw_yield(rw);
+		} while ( likely(!write_can_lock(rw)));
+		cpu_relax();
+	}
+#else
 	__build_write_lock(rw, "__write_lock_failed");
+#endif
 }
 
 #define _raw_read_unlock(rw)		asm volatile("lock ; incl %0" :"=m" ((rw)->lock) : : "memory")
@@ -238,13 +317,6 @@
 	return 0;
 }
 
-static inline int _raw_write_trylock(rwlock_t *lock)
-{
-	atomic_t *count = (atomic_t *)lock;
-	if (atomic_sub_and_test(RW_LOCK_BIAS, count))
-		return 1;
-	atomic_add(RW_LOCK_BIAS, count);
-	return 0;
-}
+
 
 #endif /* __ASM_SPINLOCK_H */
diff -urN b/xen/arch/x86/domain.c c/xen/arch/x86/domain.c
--- b/xen/arch/x86/domain.c	2005-06-02 22:21:41.000000000 -0500
+++ c/xen/arch/x86/domain.c	2005-06-03 09:42:37.487868084 -0500
@@ -240,6 +240,8 @@
     memset(d->shared_info, 0, PAGE_SIZE);
     v->vcpu_info = &d->shared_info->vcpu_data[v->vcpu_id];
     v->cpumap = CPUMAP_RUNANYWHERE;
+    /* default vcpus to sharing physical cpus */
+    d->shared_info->shproc = 1;
     SHARE_PFN_WITH_DOMAIN(virt_to_page(d->shared_info), d);
     machine_to_phys_mapping[virt_to_phys(d->shared_info) >> 
                            PAGE_SHIFT] = INVALID_M2P_ENTRY;
diff -urN b/xen/arch/x86/x86_32/entry.S c/xen/arch/x86/x86_32/entry.S
--- b/xen/arch/x86/x86_32/entry.S	2005-06-02 22:21:43.000000000 -0500
+++ c/xen/arch/x86/x86_32/entry.S	2005-06-03 09:41:11.000000173 -0500
@@ -751,6 +751,7 @@
         .long do_boot_vcpu
         .long do_ni_hypercall       /* 25 */
         .long do_mmuext_op
+        .long do_confer
         .rept NR_hypercalls-((.-hypercall_table)/4)
         .long do_ni_hypercall
         .endr
diff -urN b/xen/common/domain.c c/xen/common/domain.c
--- b/xen/common/domain.c	2005-06-02 22:21:37.000000000 -0500
+++ c/xen/common/domain.c	2005-06-03 09:42:09.839752947 -0500
@@ -392,6 +392,7 @@
 
     atomic_set(&v->pausecnt, 0);
     v->cpumap = CPUMAP_RUNANYWHERE;
+    set_bit(_VCPUF_canconfer, &v->vcpu_flags);
 
     memcpy(&v->arch, &idle0_vcpu.arch, sizeof(v->arch));
 
diff -urN b/xen/common/schedule.c c/xen/common/schedule.c
--- b/xen/common/schedule.c	2005-06-02 22:21:42.000000000 -0500
+++ c/xen/common/schedule.c	2005-06-03 09:41:49.540601494 -0500
@@ -219,6 +219,11 @@
     spin_lock_irqsave(&schedule_data[v->processor].schedule_lock, flags);
     if ( likely(domain_runnable(v)) )
     {
+        /* mark current's confer state */
+        if ( test_bit(_VCPUF_conferring, &current->vcpu_flags) ) {
+            clear_bit(_VCPUF_conferring, &current->vcpu_flags);
+            set_bit(_VCPUF_conferred, &current->vcpu_flags);
+        }
         SCHED_OP(wake, v);
 #ifdef WAKE_HISTO
         v->wokenup = NOW();
@@ -260,6 +265,51 @@
     return 0;
 }
 
+/* Confer control to another vcpu */
+long do_confer(unsigned int vcpu, unsigned int yield_count)
+{
+    struct domain *d = current->domain;
+   
+    /* Validate CONFER prereqs:
+    * - vcpu is within bounds
+    * - vcpu is a valid in this domain
+    * - current has not already conferred its slice to vcpu
+    * - vcpu is not already running
+    * - designated vcpu's yield_count matches value from call
+    *
+    * of all are ok, then set conferred value and enter scheduler
+    */
+
+    if (unlikely(vcpu > MAX_VIRT_CPUS))
+        return 0; 
+
+    if (unlikely(d->vcpu[vcpu] == NULL))
+        return 0;
+
+    if (unlikely(!test_bit(_VCPUF_canconfer, &current->vcpu_flags)))
+        return 0;
+
+    /* even counts indicate a running vcpu, odd is preempted/conferred */
+    /* don't confer if holder is currently running */
+    if (unlikely((d->vcpu[vcpu]->vcpu_info->yield_count & 1) == 0))
+        return 0;
+
+    if (unlikely(d->vcpu[vcpu]->vcpu_info->yield_count != yield_count))
+        return 0;
+
+    /*
+     * set current's state to conferring, wake target
+     */
+    clear_bit(_VCPUF_canconfer, &current->vcpu_flags);
+    set_bit(_VCPUF_conferring, &current->vcpu_flags);
+    domain_wake(d->vcpu[vcpu]);
+
+    /* give up my timeslice */
+    do_yield();
+
+    return 0;
+}
+
 /*
  * Demultiplex scheduler-related hypercalls.
  */
@@ -422,7 +472,15 @@
 
     r_time = next_slice.time;
     next = next_slice.task;
-    
+
+    /*
+     * always clear conferred state so this vcpu can confer during its slice
+     * since it can confer, clear all other confer state
+     */
+    set_bit(_VCPUF_canconfer,   &next->vcpu_flags);
+    clear_bit(_VCPUF_conferring, &next->vcpu_flags);
+    clear_bit(_VCPUF_conferred,  &next->vcpu_flags);
+
     schedule_data[cpu].curr = next;
     
     next->lastschd = now;
@@ -434,6 +492,12 @@
 
     spin_unlock_irq(&schedule_data[cpu].schedule_lock);
 
+     /* bump vcpu yield_count when controlling domain is not-idle */
+     if ( !is_idle_task(prev->domain) )
+         prev->vcpu_info->yield_count++;
+     if ( !is_idle_task(next->domain) )
+         next->vcpu_info->yield_count++;
+ 
     if ( unlikely(prev == next) )
         return continue_running(prev);
 
diff -urN b/xen/include/public/xen.h c/xen/include/public/xen.h
--- b/xen/include/public/xen.h	2005-06-02 22:21:41.000000000 -0500
+++ c/xen/include/public/xen.h	2005-06-03 09:41:11.040994437 -0500
@@ -58,6 +58,7 @@
 #define __HYPERVISOR_boot_vcpu            24
 #define __HYPERVISOR_set_segment_base     25 /* x86/64 only */
 #define __HYPERVISOR_mmuext_op            26
+#define __HYPERVISOR_confer               27
 
 /* 
  * VIRTUAL INTERRUPTS
@@ -324,8 +325,11 @@
     u8 evtchn_upcall_mask;              /* 1 */
     u8 pad0, pad1;
     u32 evtchn_pending_sel;             /* 4 */
-    arch_vcpu_info_t arch;              /* 8 */
-} PACKED vcpu_info_t;                   /* 8 + arch */
+    /* Even when vcpu is running, Odd when it is preempted/conferred */
+    u32 yield_count;                    /* 8 */
+    u32 pad2;                           /* 12 */
+    arch_vcpu_info_t arch;              /* 16 */
+} PACKED vcpu_info_t;                   /* 16 + arch */
 
 /*
  * Xen/kernel shared data -- pointer provided in start_info.
@@ -337,6 +341,9 @@
 
     u32 n_vcpu;
 
+    /* set if domains' vcpus share physical cpus */
+    int shproc; 
+
     /*
      * A domain can have up to 1024 "event channels" on which it can send
      * and receive asynchronous event notifications. There are three classes
diff -urN b/xen/include/xen/sched.h c/xen/include/xen/sched.h
--- b/xen/include/xen/sched.h	2005-06-02 22:21:36.000000000 -0500
+++ c/xen/include/xen/sched.h	2005-06-03 09:41:11.042994158 -0500
@@ -342,6 +342,15 @@
  /* Initialization completed. */
 #define _VCPUF_initialised     8
 #define VCPUF_initialised      (1UL<<_VCPUF_initialised)
+ /* Able to give time slice to another vcpu */
+#define _VCPUF_canconfer       9
+#define VCPUF_canconfer        (1UL<<_VCPUF_canconfer)
+ /* Currently giving time slice to another vcpu */
+#define _VCPUF_conferring      10
+#define VCPUF_conferring       (1UL<<_VCPUF_conferring)
+ /* Already given time slice to another vcpu */
+#define _VCPUF_conferred       11
+#define VCPUF_conferred        (1UL<<_VCPUF_conferred)
 
 /*
  * Per-domain flags (domain_flags).

[-- Attachment #3: confer_debug.patch --]
[-- Type: text/plain, Size: 4031 bytes --]

diff -urN c/xen/common/keyhandler.c d/xen/common/keyhandler.c
--- c/xen/common/keyhandler.c	2005-06-02 22:21:41.000000000 -0500
+++ d/xen/common/keyhandler.c	2005-06-03 09:48:15.075135457 -0500
@@ -12,6 +12,7 @@
 #include <xen/softirq.h>
 #include <xen/domain.h>
 #include <asm/debugger.h>
+#include <public/xen.h>
 
 #define KEY_MAX 256
 #define STR_MAX  64
@@ -139,6 +140,33 @@
     read_unlock(&domlist_lock);
 }
 
+static void do_dump_confer(unsigned char key)
+{
+    struct domain *d;
+    struct vcpu *v;
+    s_time_t       now = NOW();
+
+    printk("'%c' pressed -> dumping confer stats (now=0x%X:%08X)\n", key,
+           (u32)(now>>32), (u32)now);
+
+    read_lock(&domlist_lock);
+    for_each_domain ( d )
+    {
+        for_each_vcpu ( d, v )
+        {
+            printk("Xen: D%d, V%d, C%d,"
+                   " cnfs %d, c_out %d,"
+                   " c_in %d, ycnt %d\n",
+                   d->domain_id, v->vcpu_id, v->processor,
+                   v->confercnt, v->confer_out, 
+                   v->confer_in, v->vcpu_info->yield_count
+                  );
+        }
+
+    }
+    read_unlock(&domlist_lock);
+}
+
 extern void dump_runq(unsigned char key);
 extern void print_sched_histo(unsigned char key);
 extern void reset_sched_histo(unsigned char key);
@@ -184,6 +212,8 @@
     register_keyhandler(
         'q', do_task_queues, "dump task queues + guest state");
     register_keyhandler(
+        'c', do_dump_confer, "dump confer stats");
+    register_keyhandler(
         'r', dump_runq,      "dump run queues");
     register_irq_keyhandler(
         'R', halt_machine,   "reboot machine"); 
diff -urN c/xen/common/schedule.c d/xen/common/schedule.c
--- c/xen/common/schedule.c	2005-06-03 09:41:49.540601494 -0500
+++ d/xen/common/schedule.c	2005-06-03 09:46:01.768032249 -0500
@@ -223,6 +223,9 @@
         if ( test_bit(_VCPUF_conferring, &current->vcpu_flags) ) {
             clear_bit(_VCPUF_conferring, &current->vcpu_flags);
             set_bit(_VCPUF_conferred, &current->vcpu_flags);
+            /* increment confer counters */
+            current->confer_out++;
+            v->confer_in++;
         }
         SCHED_OP(wake, v);
 #ifdef WAKE_HISTO
@@ -270,6 +273,9 @@
 {
     struct domain *d = current->domain;
    
+    /* count hcalls */
+    current->confercnt++;
+
     /* Validate CONFER prereqs:
     * - vcpu is within bounds
     * - vcpu is a valid in this domain
@@ -286,16 +292,22 @@
     if (unlikely(d->vcpu[vcpu] == NULL))
         return 0;
 
-    if (unlikely(!test_bit(_VCPUF_canconfer, &current->vcpu_flags)))
+    if (unlikely(!test_bit(_VCPUF_canconfer, &current->vcpu_flags))) {
+         printk("confer: canconfer not set, %d->vcpu-flags = 0x%08lx\n", current->vcpu_id, current->vcpu_flags);
         return 0;
+    }
 
     /* even counts indicate a running vcpu, odd is preempted/conferred */
     /* don't confer if holder is currently running */
-    if (unlikely((d->vcpu[vcpu]->vcpu_info->yield_count & 1) == 0))
+    if (unlikely((d->vcpu[vcpu]->vcpu_info->yield_count & 1) == 0)) {
+        printk("confer: vcpu %d already running\n", vcpu);
         return 0;
+    }
 
-    if (unlikely(d->vcpu[vcpu]->vcpu_info->yield_count != yield_count))
+    if (unlikely(d->vcpu[vcpu]->vcpu_info->yield_count != yield_count)) {
+        printk("confer: yield_count mismatch\n");
         return 0;
+    }
 
     /*
      * set current's state to conferring, wake target
diff -urN c/xen/include/xen/sched.h d/xen/include/xen/sched.h
--- c/xen/include/xen/sched.h	2005-06-03 09:41:11.042994158 -0500
+++ d/xen/include/xen/sched.h	2005-06-03 09:43:52.881249246 -0500
@@ -76,6 +76,9 @@
     atomic_t         pausecnt;
 
     cpumap_t         cpumap;        /* which cpus this domain can run on */
+    u32              confer_out;    /* inc when conferring to another vcpu */
+    u32              confer_in;     /* inc when conferred from another vcpu */
+    u32              confercnt;     /* # of do_confer hcalls */
 
     struct arch_vcpu arch;
 };

[-- Attachment #4: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH] Yield to VCPU hcall, spinlock yielding
@ 2005-06-03 20:41 Ian Pratt
  2005-06-03 20:59 ` Ryan Harper
  0 siblings, 1 reply; 25+ messages in thread
From: Ian Pratt @ 2005-06-03 20:41 UTC (permalink / raw)
  To: Ryan Harper, xen-devel

> I've not recieved any feedback on this.  Following this patch 
> up with one that applies against current.  Builds, but 
> haven't tested it since current SMP domains don't run.  

Steven Smith has been experimenting with and benchmarking a number of
different variants of this approach, testing a range of different
preemption mitigation and avoidance techniques. I'm sure we'll hear more
next week...

My gut feeling is that we can get away with something simpler than the
confer technique as we only need it as a hint. Anyhow, lets see.

Have you any suggestions for metrics for comparing the schemes? lmbench
is quite good for assessing the no contenion case. Perhaps doing a
kernel build on a guest with VCPUs > phy CPUs is a reasonable way of
assesing the benefit.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Yield to VCPU hcall, spinlock yielding
  2005-06-03 20:41 Ian Pratt
@ 2005-06-03 20:59 ` Ryan Harper
  0 siblings, 0 replies; 25+ messages in thread
From: Ryan Harper @ 2005-06-03 20:59 UTC (permalink / raw)
  To: Ian Pratt
  Cc: xen-devel, Bryan Rosenburg, Michael Hohnbaum, Orran Krieger,
	Ryan Harper

* Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> [2005-06-03 15:41]:
> > I've not recieved any feedback on this.  Following this patch 
> > up with one that applies against current.  Builds, but 
> > haven't tested it since current SMP domains don't run.  
> 
> Steven Smith has been experimenting with and benchmarking a number of
> different variants of this approach, testing a range of different
> preemption mitigation and avoidance techniques. I'm sure we'll hear more
> next week...

Great.  I'm looking forward to seeing how this turns out.

> My gut feeling is that we can get away with something simpler than the
> confer technique as we only need it as a hint. Anyhow, lets see.
> 
> Have you any suggestions for metrics for comparing the schemes? lmbench
> is quite good for assessing the no contenion case. Perhaps doing a
> kernel build on a guest with VCPUs > phy CPUs is a reasonable way of
> assesing the benefit.

We have currently been using a lock-intensive program, [1]pft as a
benchmark.  I patched in lockmeter to measure the 'lockiness' of various
benchmarks, and even with 8 VCPUS on backed on a single cpu doesn't
generate a large number of lock contentions.  pft is far more lock
intensive.

However, one of our concerns with confer/directed yielding is that the
lock holder vcpu doesn't know that it was given a time-slice and that
it should voluntarily yield giving other vcpus get a chance at the lock.
With out such a mechanism, one can imagine that the lock holder would
continue on and possibly grab the lock yet again before being preempted
to which another vcpu will then yield, etc.  We could add something
in the vcpu_info array indicating that it was given a slice and in
_raw_spin_unlock() check and call do_yield().  These spinlock changes
certainly affect the speed of the spinlocks in Linux which is one of the
reasons we wanted to avoid directed yielding or any other  mechanism
that required spinlock accounting.

I don't know if you had a chance to see my status on the [2]preemption
notification from about a month ago.  I'm going to bring that patch up
to current and re-run the tests to see where things are again.  Please
take a look at the original results.

1. http://k42.ozlabs.org/wikiattach/PftPerformanceK42/attachments/pft.c.txt
2. http://lists.xensource.com/archives/html/xen-devel/2005-05/msg00139.html

Thanks,
Ryan

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH] Yield to VCPU hcall, spinlock yielding
@ 2005-06-03 21:17 Ian Pratt
  2005-06-03 21:48 ` Ryan Harper
  2005-06-06 13:14 ` Orran Y Krieger
  0 siblings, 2 replies; 25+ messages in thread
From: Ian Pratt @ 2005-06-03 21:17 UTC (permalink / raw)
  To: Ryan Harper; +Cc: Bryan Rosenburg, xen-devel, Michael Hohnbaum, Orran Krieger


> > Have you any suggestions for metrics for comparing the schemes? 
> > lmbench is quite good for assessing the no contenion case. Perhaps 
> > doing a kernel build on a guest with VCPUs > phy CPUs is a 
> reasonable 
> > way of assesing the benefit.
> 
> We have currently been using a lock-intensive program, [1]pft 
> as a benchmark.  I patched in lockmeter to measure the 
> 'lockiness' of various benchmarks, and even with 8 VCPUS on 
> backed on a single cpu doesn't generate a large number of 
> lock contentions.  pft is far more lock intensive.

I'll take a look at pft. Does it use futexes, or is it just contending
for spinlocks in the kernel?
 
> However, one of our concerns with confer/directed yielding is 
> that the lock holder vcpu doesn't know that it was given a 
> time-slice and that it should voluntarily yield giving other 
> vcpus get a chance at the lock.
> With out such a mechanism, one can imagine that the lock 
> holder would continue on and possibly grab the lock yet again 
> before being preempted to which another vcpu will then yield, 
> etc.  We could add something in the vcpu_info array 
> indicating that it was given a slice and in
> _raw_spin_unlock() check and call do_yield().  These spinlock 
> changes certainly affect the speed of the spinlocks in Linux 
> which is one of the reasons we wanted to avoid directed 
> yielding or any other  mechanism that required spinlock accounting.

Spinlock accounting that doesn't involve lock'ed operations might be OK.

> I don't know if you had a chance to see my status on the 
> [2]preemption notification from about a month ago.  I'm going 
> to bring that patch up to current and re-run the tests to see 
> where things are again.  Please take a look at the original results.

Thanks, I did look at the graphs at the time. As I recall, the
notification mechanism was beginning to look somewhat expensive under
high context switch loads induced by IO. We'll have to see what the cost
of using spinlock accounting is. If there are no locked operations wre
might be OK.

BTW: it would be really great if you could work up a patch to enable
xm/xend to add/remove VCPUs from a domain.

Thanks,
Ian 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Yield to VCPU hcall, spinlock yielding
  2005-06-03 21:17 Ian Pratt
@ 2005-06-03 21:48 ` Ryan Harper
  2005-06-06 13:14 ` Orran Y Krieger
  1 sibling, 0 replies; 25+ messages in thread
From: Ryan Harper @ 2005-06-03 21:48 UTC (permalink / raw)
  To: Ian Pratt
  Cc: xen-devel, Bryan Rosenburg, Michael Hohnbaum, Orran Krieger,
	Ryan Harper

* Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> [2005-06-03 16:18]:
> 
> > > Have you any suggestions for metrics for comparing the schemes? 
> > > lmbench is quite good for assessing the no contenion case. Perhaps 
> > > doing a kernel build on a guest with VCPUs > phy CPUs is a 
> > reasonable 
> > > way of assesing the benefit.
> > 
> > We have currently been using a lock-intensive program, [1]pft 
> > as a benchmark.  I patched in lockmeter to measure the 
> > 'lockiness' of various benchmarks, and even with 8 VCPUS on 
> > backed on a single cpu doesn't generate a large number of 
> > lock contentions.  pft is far more lock intensive.
> 
> I'll take a look at pft. Does it use futexes, or is it just contending
> for spinlocks in the kernel?

It contends for spinlocks in kernel.
>  
> > However, one of our concerns with confer/directed yielding is 
> > that the lock holder vcpu doesn't know that it was given a 
> > time-slice and that it should voluntarily yield giving other 
> > vcpus get a chance at the lock.
> > With out such a mechanism, one can imagine that the lock 
> > holder would continue on and possibly grab the lock yet again 
> > before being preempted to which another vcpu will then yield, 
> > etc.  We could add something in the vcpu_info array 
> > indicating that it was given a slice and in
> > _raw_spin_unlock() check and call do_yield().  These spinlock 
> > changes certainly affect the speed of the spinlocks in Linux 
> > which is one of the reasons we wanted to avoid directed 
> > yielding or any other  mechanism that required spinlock accounting.
> 
> Spinlock accounting that doesn't involve lock'ed operations might be OK.

Do you mean to do the accouting somewhere besides in the lock routines?

> > I don't know if you had a chance to see my status on the 
> > [2]preemption notification from about a month ago.  I'm going 
> > to bring that patch up to current and re-run the tests to see 
> > where things are again.  Please take a look at the original results.
> 
> Thanks, I did look at the graphs at the time. As I recall, the
> notification mechanism was beginning to look somewhat expensive under
> high context switch loads induced by IO. We'll have to see what the cost

Yes.  One of the tweaks we are looking to do is change the IO operation
from kernel space (responding to an icmp packet happens within the
kernel) to something that is more IO realistic which would involve
more time per operation, like sending a message over tcp (echo server or
something like that).

> BTW: it would be really great if you could work up a patch to enable
> xm/xend to add/remove VCPUs from a domain.

OK.  I have an older patch that I'll bring up-to-date.  Here is a list
of things that I think we should do with add/remove.

1. Fix cpu_down() to tell Xen to remove the vcpu from its list of
runnable domains.  Currently it a "down" vcpu only yields it's timeslice
back.

2. Fix cpu_up() to have Xen make the target vcpu runnable again.

3. Add cpu_remove() which removes the cpu from Linux, and removes the
vcpu in Xen.

4. Add cpu_add() which boots another vcpu and then brings it up another
cpu in Linux.

I expect that cpu_up/cpu_down to be more light-weight than
cpu_add/cpu_remove.

Does that sound reasonable.  Do we want all four or can we live with
just 1 and 2?

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH] Yield to VCPU hcall, spinlock yielding
  2005-06-03 21:17 Ian Pratt
  2005-06-03 21:48 ` Ryan Harper
@ 2005-06-06 13:14 ` Orran Y Krieger
  1 sibling, 0 replies; 25+ messages in thread
From: Orran Y Krieger @ 2005-06-06 13:14 UTC (permalink / raw)
  To: Ian Pratt
  Cc: Bryan S Rosenburg, ryanh, xen-devel, hohnbaum, xen-devel-bounces

[-- Attachment #1.1: Type: text/plain, Size: 2600 bytes --]

xen-devel-bounces@lists.xensource.com wrote on 06/03/2005 05:17:16 PM:

> > However, one of our concerns with confer/directed yielding is 
> > that the lock holder vcpu doesn't know that it was given a 
> > time-slice and that it should voluntarily yield giving other 
> > vcpus get a chance at the lock.
> > With out such a mechanism, one can imagine that the lock 
> > holder would continue on and possibly grab the lock yet again 
> > before being preempted to which another vcpu will then yield, 
> > etc.  We could add something in the vcpu_info array 
> > indicating that it was given a slice and in
> > _raw_spin_unlock() check and call do_yield().  These spinlock 
> > changes certainly affect the speed of the spinlocks in Linux 
> > which is one of the reasons we wanted to avoid directed 
> > yielding or any other  mechanism that required spinlock accounting.
> 
> Spinlock accounting that doesn't involve lock'ed operations might be OK.

Ian, remember we are not just talking about the kernel, the preemption 
notification scheme also solves in a natural and OS agnostic way 
performance problems from application locking.  The cost of a extra upcall 
to the already running OS should, we believe, be negligable in normal 
operation and solves all the problems in a natural way. 

> 
> > I don't know if you had a chance to see my status on the 
> > [2]preemption notification from about a month ago.  I'm going 
> > to bring that patch up to current and re-run the tests to see 
> > where things are again.  Please take a look at the original results.
> 
> Thanks, I did look at the graphs at the time. As I recall, the
> notification mechanism was beginning to look somewhat expensive under
> high context switch loads induced by IO. 

Yeah, we realize now that the experiment was bogus, not representative of 
anything real, since the ping experiment could: 1) stress the rate of 
interrupts higher than any realistic workload, and 2) did so little work 
in the server that it wasn't worth it to do any preemption 
avoidance/tolerance technique. Ryan is looking at a more realistic 
workload now. 

> We'll have to see what the cost
> of using spinlock accounting is. If there are no locked operations wre
> might be OK.

You can only use spinlock accounting for dealing with locking issues in 
kernel (unless you are willing to change application level programs and 
libs).  If preemption notification overhead is not prohibitive, the fact 
that it solves the application problem as well as the kernel problem seems 
like a compelling advantage over spinlock accounting, doesn't it?

[-- Attachment #1.2: Type: text/html, Size: 3081 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH] Yield to VCPU hcall, spinlock yielding
@ 2005-06-03 22:06 Ian Pratt
  2005-06-03 22:52 ` Ryan Harper
  2005-06-04  6:55 ` Keir Fraser
  0 siblings, 2 replies; 25+ messages in thread
From: Ian Pratt @ 2005-06-03 22:06 UTC (permalink / raw)
  To: Ryan Harper; +Cc: Bryan Rosenburg, xen-devel, Michael Hohnbaum, Orran Krieger

 
> > I'll take a look at pft. Does it use futexes, or is it just 
> contending 
> > for spinlocks in the kernel?
> 
> It contends for spinlocks in kernel.

Sounds like this will be a good benchmark. Does it generate a
perofrmance figure as it runs? (e.g. iterations a second or such like).
  
> > Thanks, I did look at the graphs at the time. As I recall, the 
> > notification mechanism was beginning to look somewhat 
> expensive under 
> > high context switch loads induced by IO. We'll have to see what the 
> > cost
> 
> Yes.  One of the tweaks we are looking to do is change the IO 
> operation from kernel space (responding to an icmp packet 
> happens within the
> kernel) to something that is more IO realistic which would 
> involve more time per operation, like sending a message over 
> tcp (echo server or something like that).

Running a parallel UDP ping-pong test might be good. 
 
> > BTW: it would be really great if you could work up a patch 
> to enable 
> > xm/xend to add/remove VCPUs from a domain.
> 
> OK.  I have an older patch that I'll bring up-to-date.  

Great, thanks.

> Here 
> is a list of things that I think we should do with add/remove.
> 
> 1. Fix cpu_down() to tell Xen to remove the vcpu from its 
> list of runnable domains.  Currently it a "down" vcpu only 
> yields it's timeslice back.
> 
> 2. Fix cpu_up() to have Xen make the target vcpu runnable again.
> 
> 3. Add cpu_remove() which removes the cpu from Linux, and 
> removes the vcpu in Xen.
> 
> 4. Add cpu_add() which boots another vcpu and then brings it 
> up another cpu in Linux.
> 
> I expect that cpu_up/cpu_down to be more light-weight than 
> cpu_add/cpu_remove.
> 
> Does that sound reasonable.  Do we want all four or can we 
> live with just 1 and 2?

It's been a while since I looked at Xen's boot_vcpu code (which could do
with a bit of refactoring between common and arch anyhow), but I don't
recall there being anything in there that looked particularly expensive.
Having said that, it's only holding down a couple of KB of memory, so
maybe we just need up/down/add.

Thanks,
Ian 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Yield to VCPU hcall, spinlock yielding
  2005-06-03 22:06 Ian Pratt
@ 2005-06-03 22:52 ` Ryan Harper
  2005-06-04  6:55 ` Keir Fraser
  1 sibling, 0 replies; 25+ messages in thread
From: Ryan Harper @ 2005-06-03 22:52 UTC (permalink / raw)
  To: Ian Pratt
  Cc: xen-devel, Bryan Rosenburg, Michael Hohnbaum, Orran Krieger,
	Ryan Harper

* Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> [2005-06-03 17:41]:
>  
> > > I'll take a look at pft. Does it use futexes, or is it just 
> > contending 
> > > for spinlocks in the kernel?
> > 
> > It contends for spinlocks in kernel.
> 
> Sounds like this will be a good benchmark. Does it generate a
> perofrmance figure as it runs? (e.g. iterations a second or such like).

yes, here some sample output:

#Gb Rep Thr CLine  User      System   Wall      flt/cpu/s fault/wsec
  0  5    8   1    2.30s      0.33s   1.05s     62296.578 104970.599

Gb = Gigabytes of mem (I used 128M)
Reps = repetitions of the test internally
Thr = # of test threads 

I generally run this with one thread per VCPU, and 128M of memory.

> > > Thanks, I did look at the graphs at the time. As I recall, the 
> > > notification mechanism was beginning to look somewhat 
> > expensive under 
> > > high context switch loads induced by IO. We'll have to see what the 
> > > cost
> > 
> > Yes.  One of the tweaks we are looking to do is change the IO 
> > operation from kernel space (responding to an icmp packet 
> > happens within the
> > kernel) to something that is more IO realistic which would 
> > involve more time per operation, like sending a message over 
> > tcp (echo server or something like that).
> 
> Running a parallel UDP ping-pong test might be good. 

OK.

> > Here 
> > is a list of things that I think we should do with add/remove.
> > 
> > 1. Fix cpu_down() to tell Xen to remove the vcpu from its 
> > list of runnable domains.  Currently it a "down" vcpu only 
> > yields it's timeslice back.
> > 
> > 2. Fix cpu_up() to have Xen make the target vcpu runnable again.
> > 
> > 3. Add cpu_remove() which removes the cpu from Linux, and 
> > removes the vcpu in Xen.
> > 
> > 4. Add cpu_add() which boots another vcpu and then brings it 
> > up another cpu in Linux.
> > 
> > I expect that cpu_up/cpu_down to be more light-weight than 
> > cpu_add/cpu_remove.
> > 
> > Does that sound reasonable.  Do we want all four or can we 
> > live with just 1 and 2?
> 
> It's been a while since I looked at Xen's boot_vcpu code (which could do
> with a bit of refactoring between common and arch anyhow), but I don't
> recall there being anything in there that looked particularly expensive.
> Having said that, it's only holding down a couple of KB of memory, so
> maybe we just need up/down/add.

Sounds good.


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Yield to VCPU hcall, spinlock yielding
  2005-06-03 22:06 Ian Pratt
  2005-06-03 22:52 ` Ryan Harper
@ 2005-06-04  6:55 ` Keir Fraser
  1 sibling, 0 replies; 25+ messages in thread
From: Keir Fraser @ 2005-06-04  6:55 UTC (permalink / raw)
  To: Ian Pratt
  Cc: Bryan Rosenburg, Orran Krieger, xen-devel, Michael Hohnbaum,
	Ryan Harper


On 3 Jun 2005, at 23:06, Ian Pratt wrote:

> It's been a while since I looked at Xen's boot_vcpu code (which could 
> do
> with a bit of refactoring between common and arch anyhow), but I don't
> recall there being anything in there that looked particularly 
> expensive.
> Having said that, it's only holding down a couple of KB of memory, so
> maybe we just need up/down/add.

You can't do a full remove without auditing the whole of Xen and adding 
a reference count to the vcpu structure. We're probably best off lazily 
allocating vcpus but then destroy them all when the domain finally is 
destructed.

  -- Keir

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH] Yield to VCPU hcall, spinlock yielding
@ 2005-06-07 10:38 Ian Pratt
  2005-06-08 15:19 ` Bryan S Rosenburg
  2005-06-09 12:54 ` Orran Y Krieger
  0 siblings, 2 replies; 25+ messages in thread
From: Ian Pratt @ 2005-06-07 10:38 UTC (permalink / raw)
  To: Orran Y Krieger
  Cc: xen-devel, Bryan S Rosenburg, hohnbaum, ryanh, xen-devel-bounces

> You can only use spinlock accounting for dealing with locking 
> issues in kernel (unless you are willing to change 
> application level programs and libs).  If preemption 
> notification overhead is not prohibitive, the fact that it 
> solves the application problem as well as the kernel problem 
> seems like a compelling advantage over spinlock accounting, 
> doesn't it? 

Orran, 

Are you assuming the pre-emption notification is going to get propagated
to user-space as a signal, and that user space applications would be
modified to take advantage of the signal? (possibly this could be hidden
in the pthread library?} Since the kernel doesn't know when user space
has an application lock or not, that's going to be a lot of signals. 

Cheers,
Ian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH] Yield to VCPU hcall, spinlock yielding
  2005-06-07 10:38 Ian Pratt
@ 2005-06-08 15:19 ` Bryan S Rosenburg
  2005-06-09 12:54 ` Orran Y Krieger
  1 sibling, 0 replies; 25+ messages in thread
From: Bryan S Rosenburg @ 2005-06-08 15:19 UTC (permalink / raw)
  To: Ian Pratt; +Cc: ryanh, xen-devel, hohnbaum, Orran Y Krieger


[-- Attachment #1.1: Type: text/plain, Size: 1483 bytes --]

"Ian Pratt" <m+Ian.Pratt@cl.cam.ac.uk> wrote on 06/07/2005 06:38:14 AM:

> 
> > You can only use spinlock accounting for dealing with locking 
> > issues in kernel (unless you are willing to change 
> > application level programs and libs).  If preemption 
> > notification overhead is not prohibitive, the fact that it 
> > solves the application problem as well as the kernel problem 
> > seems like a compelling advantage over spinlock accounting, 
> > doesn't it? 
> 
> Orran, 
> 
> Are you assuming the pre-emption notification is going to get propagated
> to user-space as a signal, and that user space applications would be
> modified to take advantage of the signal? (possibly this could be hidden
> in the pthread library?} Since the kernel doesn't know when user space
> has an application lock or not, that's going to be a lot of signals. 
> 
> Cheers,
> Ian

We're certainly *not* proposing that preemption notifications be 
propagated to user-space as signals.  We're not proposing any changes at 
all to the user/kernel interface.  Preemption notifications to the kernel 
will allow Linux to support its multiprocessor applications as well as 
(but not better than) it does today running natively.  The key point is 
that with kernel-level preemption notification, VCPUs are always in kernel 
mode when suspended, never in user mode.  Application state is always 
saved in Linux, not in Xen, and is available to be resumed on another VCPU 
if Linux so chooses.

- Bryan

[-- Attachment #1.2: Type: text/html, Size: 1811 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH] Yield to VCPU hcall, spinlock yielding
  2005-06-07 10:38 Ian Pratt
  2005-06-08 15:19 ` Bryan S Rosenburg
@ 2005-06-09 12:54 ` Orran Y Krieger
  1 sibling, 0 replies; 25+ messages in thread
From: Orran Y Krieger @ 2005-06-09 12:54 UTC (permalink / raw)
  To: Ian Pratt
  Cc: Bryan S Rosenburg, ryanh, xen-devel, hohnbaum, xen-devel-bounces

[-- Attachment #1.1: Type: text/plain, Size: 1550 bytes --]

xen-devel-bounces@lists.xensource.com wrote on 06/07/2005 06:38:14 AM:

> Orran, 
> 
> Are you assuming the pre-emption notification is going to get propagated
> to user-space as a signal, and that user space applications would be
> modified to take advantage of the signal? (possibly this could be hidden
> in the pthread library?} Since the kernel doesn't know when user space
> has an application lock or not, that's going to be a lot of signals. 

Nope, we reviewed a couple of alternatives for notification to user level 
and for now are not persuing any of them.  The pre-emption notification 
goes to the kernel, and its an OS specific thing what it chooses to do 
with it. Our current plan in Linux is to run a high priority thread in the 
kernel which makes a hypervisor call to yield back the processor. Hence, 
the application thread context is not hidden in the hypervisor, but the 
application thread is marked by linux as ready to run.  Now if the thread 
is a high priority thread holding an application lock normal linux 
mechanisms will cause it to be scheduled on some other processor.  Without 
the preemption notification, or if we yield back at interrupt level, the 
application thread register state is buried in the hypervisor, and of 
course the thread appears to Linux to be running...

Everything seems to just fall out nicely in Linux as long as we make the 
call back on a kernel thread. Different OSes, of course, will have 
different techniques they can use given a notification of an attempt to 
preempt. 

    -- Orran

[-- Attachment #1.2: Type: text/html, Size: 1764 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH] Yield to VCPU hcall, spinlock yielding
@ 2005-06-08 18:25 Ian Pratt
  2005-06-08 18:40 ` Bryan S Rosenburg
  0 siblings, 1 reply; 25+ messages in thread
From: Ian Pratt @ 2005-06-08 18:25 UTC (permalink / raw)
  To: Bryan S Rosenburg; +Cc: ryanh, xen-devel, hohnbaum, Orran Y Krieger

> The key point is that with 
> kernel-level preemption notification, VCPUs are always in 
> kernel mode when suspended, never in user mode.  Application 
> state is always saved in Linux, not in Xen, and is available 
> to be resumed on another VCPU if Linux so chooses. 

In principle, but...

Do you believe this is going to interact well with Linux's work stealing
CPU migration? I haven't looked closely at the current code, but from
Linux's scheduler's POV the de-scheduled (yielded) CPU looks like a
perfectly healthy CPU, so there's no particular reason that another CPU
would steal work from it (without hacking the algorithm, which I suppose
we could do). Also, do you have to do something special in your yield
routine to ensure that no real process is currently running on the
yielded processor so that all processes on the run queue are available
for stealing?

Ian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH] Yield to VCPU hcall, spinlock yielding
  2005-06-08 18:25 Ian Pratt
@ 2005-06-08 18:40 ` Bryan S Rosenburg
  2005-06-08 19:11   ` Andrew Theurer
  2005-06-08 19:17   ` Ryan Harper
  0 siblings, 2 replies; 25+ messages in thread
From: Bryan S Rosenburg @ 2005-06-08 18:40 UTC (permalink / raw)
  To: Ian Pratt; +Cc: ryanh, xen-devel, hohnbaum, Orran Y Krieger


[-- Attachment #1.1: Type: text/plain, Size: 1523 bytes --]

"Ian Pratt" <m+Ian.Pratt@cl.cam.ac.uk> wrote on 06/08/2005 02:25:56 PM:

> > The key point is that with 
> > kernel-level preemption notification, VCPUs are always in 
> > kernel mode when suspended, never in user mode.  Application 
> > state is always saved in Linux, not in Xen, and is available 
> > to be resumed on another VCPU if Linux so chooses. 
> 
> In principle, but...
> 
> Do you believe this is going to interact well with Linux's work stealing
> CPU migration? I haven't looked closely at the current code, but from
> Linux's scheduler's POV the de-scheduled (yielded) CPU looks like a
> perfectly healthy CPU, so there's no particular reason that another CPU
> would steal work from it (without hacking the algorithm, which I suppose
> we could do). Also, do you have to do something special in your yield
> routine to ensure that no real process is currently running on the
> yielded processor so that all processes on the run queue are available
> for stealing?
> 
> Ian

In our original posting, we proposed that the Linux interrupt handler for 
preemption notifications would create (or unblock) a high-priority kernel 
thread which would then yield back to the hypervisor.  To Linux on other 
CPUs, the de-scheduled CPU would appear to be busy running the 
high-priority thread, and all real work that that CPU had been doing would 
be eligible for stealing.

I don't think that Ryan has yet implemented the high-priority thread part 
of the proposal, but that's always been part of the plan.

- Bryan

[-- Attachment #1.2: Type: text/html, Size: 1872 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Yield to VCPU hcall, spinlock yielding
  2005-06-08 18:40 ` Bryan S Rosenburg
@ 2005-06-08 19:11   ` Andrew Theurer
  2005-06-08 20:49     ` Bryan S Rosenburg
  2005-06-08 19:17   ` Ryan Harper
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Theurer @ 2005-06-08 19:11 UTC (permalink / raw)
  To: Bryan S Rosenburg, Ian Pratt; +Cc: ryanh, xen-devel, hohnbaum, Orran Y Krieger

On Wednesday 08 June 2005 13:40, Bryan S Rosenburg wrote:
> "Ian Pratt" <m+Ian.Pratt@cl.cam.ac.uk> wrote on 06/08/2005 02:25:56 
PM:
> > > The key point is that with
> > > kernel-level preemption notification, VCPUs are always in
> > > kernel mode when suspended, never in user mode.  Application
> > > state is always saved in Linux, not in Xen, and is available
> > > to be resumed on another VCPU if Linux so chooses.
> >
> > In principle, but...
> >
> > Do you believe this is going to interact well with Linux's work
> > stealing CPU migration? I haven't looked closely at the current
> > code, but from Linux's scheduler's POV the de-scheduled (yielded)
> > CPU looks like a perfectly healthy CPU, so there's no particular
> > reason that another CPU would steal work from it (without hacking
> > the algorithm, which I suppose we could do). Also, do you have to
> > do something special in your yield routine to ensure that no real
> > process is currently running on the yielded processor so that all
> > processes on the run queue are available for stealing?
> >
> > Ian
>
> In our original posting, we proposed that the Linux interrupt handler
> for preemption notifications would create (or unblock) a
> high-priority kernel thread which would then yield back to the
> hypervisor.  To Linux on other CPUs, the de-scheduled CPU would
> appear to be busy running the high-priority thread, and all real work
> that that CPU had been doing would be eligible for stealing.

IMO, I don't think this alone is enough to encourage task migration.  
The primary motivator to steal is a 25% or more load imbalance, and one 
extra fake kernel thread will probably not be enough to trigger this.

To solve this and other issues, I believe we need an extra modifier to 
the Linux kernel cpus' load value, which Xen could modify to hint the 
kernel what cpus' relative processing power is.  The Linux kernel 
scheduler's per cpu load values would be something like (max_cpu_power 
/ cpu_power * nr_running).  Xen could update cpu_power for a number of 
situations, a "long" preemption, a much faster alternative to a vcpu 
hot-unplug (don't unplug, just set cpu_power to 0), and to normalize 
load values for vcpus which have different time-slice lengths on the 
physical cpus.  

I would hope something like this could also be used without Xen on Linux 
so it has wider appeal.  One thing that comes to mind is normalizing 
cpus' load when some cpus may be "speed stepped" down for power 
management or heat issues.

-Andrew

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Yield to VCPU hcall, spinlock yielding
  2005-06-08 19:11   ` Andrew Theurer
@ 2005-06-08 20:49     ` Bryan S Rosenburg
  0 siblings, 0 replies; 25+ messages in thread
From: Bryan S Rosenburg @ 2005-06-08 20:49 UTC (permalink / raw)
  To: habanero; +Cc: Ian Pratt, ryanh, xen-devel, hohnbaum, Orran Y Krieger

[-- Attachment #1.1: Type: text/plain, Size: 2683 bytes --]

habanero@us.ltcfwd.linux.ibm.com wrote on 06/08/2005 03:11:06 PM:

> > In our original posting, we proposed that the Linux interrupt handler
> > for preemption notifications would create (or unblock) a
> > high-priority kernel thread which would then yield back to the
> > hypervisor.  To Linux on other CPUs, the de-scheduled CPU would
> > appear to be busy running the high-priority thread, and all real work
> > that that CPU had been doing would be eligible for stealing.
> 
> IMO, I don't think this alone is enough to encourage task migration. 
> The primary motivator to steal is a 25% or more load imbalance, and one 
> extra fake kernel thread will probably not be enough to trigger this.

The kernel thread is needed at the very least to ensure that all user 
programs on the de-scheduled CPU are available for migration.  In an 
important case, a program on the de-scheduled CPU holds a futex, and 
another CPU goes idle because its program blocks on the futex.  We'd want 
the idle CPU to pick up the futex holder, and I'm assuming (with very 
little actual knowledge) that the Linux scheduler would make that happen.

> 
> To solve this and other issues, I believe we need an extra modifier to 
> the Linux kernel cpus' load value, which Xen could modify to hint the 
> kernel what cpus' relative processing power is.  The Linux kernel 
> scheduler's per cpu load values would be something like (max_cpu_power 
> / cpu_power * nr_running).  Xen could update cpu_power for a number of 
> situations, a "long" preemption, a much faster alternative to a vcpu 
> hot-unplug (don't unplug, just set cpu_power to 0), and to normalize 
> load values for vcpus which have different time-slice lengths on the 
> physical cpus. 
> 
> I would hope something like this could also be used without Xen on Linux 

> so it has wider appeal.  One thing that comes to mind is normalizing 
> cpus' load when some cpus may be "speed stepped" down for power 
> management or heat issues.
> 
> -Andrew

I'd view your "cpu_power" proposal as orthogonal to (or perhaps 
complementary to) our ideas on preemption notification.  It's aimed more 
at load-balancing and fair scheduling than specifically at the problems 
that arise with the preemption of lock holders.  On the apparent CPU speed 
issue, does Linux account in any way for different interrupt loads on 
different processors?  Is a program just out of luck if it happens to get 
scheduled on a processor with heavy interrupt traffic, or will Linux 
notice that it's not making the same progress as its peers and shuffle 
things around?  It seems that your cpu_power proposal might have something 
to contribute here.

- Bryan

- Bryan

[-- Attachment #1.2: Type: text/html, Size: 3227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Yield to VCPU hcall, spinlock yielding
  2005-06-08 18:40 ` Bryan S Rosenburg
  2005-06-08 19:11   ` Andrew Theurer
@ 2005-06-08 19:17   ` Ryan Harper
  1 sibling, 0 replies; 25+ messages in thread
From: Ryan Harper @ 2005-06-08 19:17 UTC (permalink / raw)
  To: Bryan S Rosenburg; +Cc: Ian Pratt, xen-devel, hohnbaum, Orran Y Krieger, ryanh

* Bryan Rosenburg <rosnbrg@us.ibm.com> [2005-06-08 13:40]:
> "Ian Pratt" <m+Ian.Pratt@cl.cam.ac.uk> wrote on 06/08/2005 02:25:56 PM:
> 
> > > The key point is that with 
> > > kernel-level preemption notification, VCPUs are always in 
> > > kernel mode when suspended, never in user mode.  Application 
> > > state is always saved in Linux, not in Xen, and is available 
> > > to be resumed on another VCPU if Linux so chooses. 
> > 
> > In principle, but...
> > 
> > Do you believe this is going to interact well with Linux's work stealing
> > CPU migration? I haven't looked closely at the current code, but from
> > Linux's scheduler's POV the de-scheduled (yielded) CPU looks like a
> > perfectly healthy CPU, so there's no particular reason that another CPU
> > would steal work from it (without hacking the algorithm, which I suppose
> > we could do). Also, do you have to do something special in your yield
> > routine to ensure that no real process is currently running on the
> > yielded processor so that all processes on the run queue are available
> > for stealing?
> > 
> > Ian
> 
> In our original posting, we proposed that the Linux interrupt handler for 
> preemption notifications would create (or unblock) a high-priority kernel 
> thread which would then yield back to the hypervisor.  To Linux on other 
> CPUs, the de-scheduled CPU would appear to be busy running the 
> high-priority thread, and all real work that that CPU had been doing would 
> be eligible for stealing.
> 
> I don't think that Ryan has yet implemented the high-priority thread part 
> of the proposal, but that's always been part of the plan.

That is correct.  I attempted to schedule some work on a particular cpu
in the interrupt handler but I couldn't quite get it working.
Currently, we are yielding in the interrupt handler which isn't what we
proposed, but I had hoped that it was close enough to see the general
effectiveness of the approach.

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH] Yield to VCPU hcall, spinlock yielding
@ 2005-06-08 21:29 Ian Pratt
  2005-06-09 12:37 ` Bryan S Rosenburg
  2005-06-09 18:13 ` Orran Y Krieger
  0 siblings, 2 replies; 25+ messages in thread
From: Ian Pratt @ 2005-06-08 21:29 UTC (permalink / raw)
  To: Bryan S Rosenburg, habanero; +Cc: ryanh, xen-devel, hohnbaum, Orran Y Krieger

> > IMO, I don't think this alone is enough to encourage task 
> migration.  
> > The primary motivator to steal is a 25% or more load imbalance, and 
> > one extra fake kernel thread will probably not be enough to 
> trigger this.
> 
> The kernel thread is needed at the very least to ensure that 
> all user programs on the de-scheduled CPU are available for 
> migration.  In an important case, a program on the 
> de-scheduled CPU holds a futex, and another CPU goes idle 
> because its program blocks on the futex.  We'd want the idle 
> CPU to pick up the futex holder, and I'm assuming (with very 
> little actual knowledge) that the Linux scheduler would make 
> that happen. 

We might be able to come up with a cheaper hack for doing this. The
notifaction scheme is already on the expensive side, and adding two
extra passes through the scheduler could totally doom it.

> I'd view your "cpu_power" proposal as orthogonal to (or 
> perhaps complementary to) our ideas on preemption 
> notification.  It's aimed more at load-balancing and fair 
> scheduling than specifically at the problems that arise with 
> the preemption of lock holders.  On the apparent CPU speed 
> issue, does Linux account in any way for different interrupt 
> loads on different processors?  Is a program just out of luck 
> if it happens to get scheduled on a processor with heavy 
> interrupt traffic, or will Linux notice that it's not making 
> the same progress as its peers and shuffle things around?  It 
> seems that your cpu_power proposal might have something to 
> contribute here. 

I don't see it as orthogonal -- I think something like it is needed to
make the notification scheme result in any benefit, otherwise no work
will get migrated from the de-scheduled CPU.

I'm just not sure how easy it will be to add into the rebalance
function.

Ian  

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH] Yield to VCPU hcall, spinlock yielding
  2005-06-08 21:29 Ian Pratt
@ 2005-06-09 12:37 ` Bryan S Rosenburg
  2005-06-09 18:13 ` Orran Y Krieger
  1 sibling, 0 replies; 25+ messages in thread
From: Bryan S Rosenburg @ 2005-06-09 12:37 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel, hohnbaum, Orran Y Krieger, habanero, ryanh

[-- Attachment #1.1: Type: text/plain, Size: 946 bytes --]

"Ian Pratt" <m+Ian.Pratt@cl.cam.ac.uk> wrote on 06/08/2005 05:29:21 PM:

> We might be able to come up with a cheaper hack for doing this. The
> notifaction scheme is already on the expensive side, and adding two
> extra passes through the scheduler could totally doom it.

I agree that whether we can tolerate the cost of preemption notification 
interrupts is still very much an open question.  I doubt that the 
additional costs of unblocking and blocking a kernel thread will be 
decisive.

I don't view the high-priority kernel thread idea as a "hack".  In 
conjunction with CONFIG_PREEMPT, it is a straight-forward solution to a 
real problem:  From the notification interrupt handler, how do you get the 
kernel to yield as quickly as possible but only at such time as no kernel 
locks are held?  A newly-unblocked high-priority thread will run at 
exactly the right time, with no new hacking on Linux locking or scheduling 
code.

- Bryan

[-- Attachment #1.2: Type: text/html, Size: 1180 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH] Yield to VCPU hcall, spinlock yielding
  2005-06-08 21:29 Ian Pratt
  2005-06-09 12:37 ` Bryan S Rosenburg
@ 2005-06-09 18:13 ` Orran Y Krieger
  2005-06-09 18:59   ` Andrew Theurer
  1 sibling, 1 reply; 25+ messages in thread
From: Orran Y Krieger @ 2005-06-09 18:13 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel, Bryan S Rosenburg, hohnbaum, habanero, ryanh


[-- Attachment #1.1: Type: text/plain, Size: 1690 bytes --]

"Ian Pratt" <m+Ian.Pratt@cl.cam.ac.uk> wrote on 06/08/2005 05:29:21 PM:
> > I'd view your "cpu_power" proposal as orthogonal to (or 
> > perhaps complementary to) our ideas on preemption 
> > notification.  It's aimed more at load-balancing and fair 
> > scheduling than specifically at the problems that arise with 
> > the preemption of lock holders.  On the apparent CPU speed 
> > issue, does Linux account in any way for different interrupt 
> > loads on different processors?  Is a program just out of luck 
> > if it happens to get scheduled on a processor with heavy 
> > interrupt traffic, or will Linux notice that it's not making 
> > the same progress as its peers and shuffle things around?  It 
> > seems that your cpu_power proposal might have something to 
> > contribute here. 
> 
> I don't see it as orthogonal -- I think something like it is needed to
> make the notification scheme result in any benefit, otherwise no work
> will get migrated from the de-scheduled CPU.

I don't get it. If an application lock is important, and all the app 
threads block on it except the one of the preempted processor, and the 
preempted processor has two threads on it (the high priority kernel and 
the app that was preempted), then we have one (or more) idle processors 
and one processor with two threads.  Hopefully the scheduler is smart 
enough to move the preempted thread over. So, the base scheme should 
result in processors from staying idle....  A scheme like described above 
would be great, but without any changes we have a scheme that keeps 
processors from going idle I think. 

> I'm just not sure how easy it will be to add into the rebalance
> function.
> 
> Ian 

[-- Attachment #1.2: Type: text/html, Size: 2039 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Yield to VCPU hcall, spinlock yielding
  2005-06-09 18:13 ` Orran Y Krieger
@ 2005-06-09 18:59   ` Andrew Theurer
  2005-06-09 19:49     ` Orran Y Krieger
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Theurer @ 2005-06-09 18:59 UTC (permalink / raw)
  To: Orran Y Krieger; +Cc: Ian Pratt, xen-devel, Bryan S Rosenburg, hohnbaum, ryanh

Orran Y Krieger wrote:

>
>
> "Ian Pratt" <m+Ian.Pratt@cl.cam.ac.uk> wrote on 06/08/2005 05:29:21 PM:
> > > I'd view your "cpu_power" proposal as orthogonal to (or
> > > perhaps complementary to) our ideas on preemption
> > > notification.  It's aimed more at load-balancing and fair
> > > scheduling than specifically at the problems that arise with
> > > the preemption of lock holders.  On the apparent CPU speed
> > > issue, does Linux account in any way for different interrupt
> > > loads on different processors?  Is a program just out of luck
> > > if it happens to get scheduled on a processor with heavy
> > > interrupt traffic, or will Linux notice that it's not making
> > > the same progress as its peers and shuffle things around?  It
> > > seems that your cpu_power proposal might have something to
> > > contribute here.
> >
> > I don't see it as orthogonal -- I think something like it is needed to
> > make the notification scheme result in any benefit, otherwise no work
> > will get migrated from the de-scheduled CPU.
>
> I don't get it. If an application lock is important, and all the app 
> threads block on it except the one of the preempted processor, and the 
> preempted processor has two threads on it (the high priority kernel 
> and the app that was preempted), then we have one (or more) idle 
> processors and one processor with two threads.  Hopefully the 
> scheduler is smart enough to move the preempted thread over. So, the 
> base scheme should result in processors from staying idle....  A 
> scheme like described above would be great, but without any changes we 
> have a scheme that keeps processors from going idle I think.  

I take it this assumes nothing else is running on that domain?  I agree 
in this ideal scenerio, it would work (assumming the app threads waiting 
for the lock are not busy waiting), but what about apps which have other 
threads which don't wait on that particular lock, keeping the other cpus 
busy, or a domain with more than one app running?  In those cases the 
load balance would probably not happen, and I would think we need more 
than just the high prio kernel thread to move the desired task to a 
active virtual cpu.

I'm not saying it's all or none either.  Gettng the kernel thread 
implemented is probably the next logial step.  I would just like to 
follow through with some extra logic if possible.

>
> > I'm just not sure how easy it will be to add into the rebalance
> > function.

There is already an abstractoin of "load" for each cpu in 2.6.11.  We 
just need some bits there.  I will take a look.

-Andrew

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Yield to VCPU hcall, spinlock yielding
  2005-06-09 18:59   ` Andrew Theurer
@ 2005-06-09 19:49     ` Orran Y Krieger
  0 siblings, 0 replies; 25+ messages in thread
From: Orran Y Krieger @ 2005-06-09 19:49 UTC (permalink / raw)
  To: habanero; +Cc: Ian Pratt, xen-devel, Bryan S Rosenburg, hohnbaum, ryanh

[-- Attachment #1.1: Type: text/plain, Size: 1279 bytes --]

habanero@us.ltcfwd.linux.ibm.com wrote on 06/09/2005 02:59:56 PM:

> I take it this assumes nothing else is running on that domain?  I agree 
> in this ideal scenerio, it would work (assumming the app threads waiting 

> for the lock are not busy waiting), but what about apps which have other 

> threads which don't wait on that particular lock, keeping the other cpus 

> busy, or a domain with more than one app running?  In those cases the 
> load balance would probably not happen, and I would think we need more 
> than just the high prio kernel thread to move the desired task to a 
> active virtual cpu.

The key place where this came up in our internal discussions was running 
big HPC apps and big data bases...  In both those cases, the dominant 
situation is your ideal scenario.  In a multiprogrammed environment, you 
have the problem even without a hypervisor, i.e., a process holding a lock 
can be preempted by another process.   Currently, Linux doesn't do any 
schedular aware synchronizaiton, so presumably this is not viewed as a 
problem.  If some day this is viewed as a problem, then the preemption 
upcall could be used by whatever mechanism linux chooses to implement.  I 
am not convinced that we need a special purpose solution for 
virtualization...

[-- Attachment #1.2: Type: text/html, Size: 1438 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2005-06-09 19:49 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-20 16:54 [PATCH] xen, xen-sparse: modify spinlocks to use directed yield Ryan Harper
2005-05-20 18:16 ` Ryan Harper
2005-05-20 18:17 ` Ryan Harper
2005-06-03 19:40 ` [PATCH] Yield to VCPU hcall, spinlock yielding Ryan Harper
  -- strict thread matches above, loose matches on Subject: below --
2005-06-03 20:41 Ian Pratt
2005-06-03 20:59 ` Ryan Harper
2005-06-03 21:17 Ian Pratt
2005-06-03 21:48 ` Ryan Harper
2005-06-06 13:14 ` Orran Y Krieger
2005-06-03 22:06 Ian Pratt
2005-06-03 22:52 ` Ryan Harper
2005-06-04  6:55 ` Keir Fraser
2005-06-07 10:38 Ian Pratt
2005-06-08 15:19 ` Bryan S Rosenburg
2005-06-09 12:54 ` Orran Y Krieger
2005-06-08 18:25 Ian Pratt
2005-06-08 18:40 ` Bryan S Rosenburg
2005-06-08 19:11   ` Andrew Theurer
2005-06-08 20:49     ` Bryan S Rosenburg
2005-06-08 19:17   ` Ryan Harper
2005-06-08 21:29 Ian Pratt
2005-06-09 12:37 ` Bryan S Rosenburg
2005-06-09 18:13 ` Orran Y Krieger
2005-06-09 18:59   ` Andrew Theurer
2005-06-09 19:49     ` Orran Y Krieger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.