xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] Scheduler interface changes for credit2
@ 2010-02-15 17:20 George Dunlap
  2010-02-22 15:22 ` George Dunlap
  0 siblings, 1 reply; 5+ messages in thread
From: George Dunlap @ 2010-02-15 17:20 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1822 bytes --]

The two attached patches change the scheduler interface to allow
credit2 to have several cpus share the same runqueue.  The patches
should have almost no impact on the current schedulers.  The patches
and their reasonings are below.  I've also attached the patches for
the prototype credit2 scheduler, for reference.

* Add a context swich callback (sched-context_switch-callback.diff)

Add a callback to tell a scheduler that a vcpu has been completely
context-switched off a cpu.

When sharing a runqueue, we can't put a scheduled-out vcpu back on the
runqueue until it's been completely de-scheduled, because it may be
grabbed by another processor before it's ready.  This callback allows
a scheduler to detect when a vcpu on its way out is completely off the
processor, so that it can put the vcpu on the runqueue.

* Allow sharing of locks between cpus (sched-spin_lock-pointers.diff)

Have per-cpu pointers, initialized to per-cpu locks, which the
scheduler may change during its init to reconfigure locking
granularity.

There are a number of race conditions having to do with updating of
v->is_running and v->processor, all having to do with the fact that
vcpus may change cpus without an explicit migrate.  Furthermore, the
scheduler needs runqueues to be covered by a lock as well.  The
cleanest way to solve all of these is to have the scheduler lock and
the runqueue lock coincide.

* Add a "scheduler" trace class (trace-sched-class.diff)
Uses defined on a per-scheduler basis

I've been running parallel kernel compiles on a 16-way box (2x4x2) for
several hours now without deadlocks or BUG()s.  As far as I'm
concerned, with these changes, credit2 is now ready to be checked in,
as long as it's not set to the default scheduler.

All of the above:
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

[-- Attachment #2: sched-context_switch-callback.diff --]
[-- Type: text/x-patch, Size: 1198 bytes --]

Add context_saved scheduler callback.

Because credit2 shares a runqueue between several cpus, it needs
to know when a scheduled-out process has finally been context-switched
away so that it can be added to the runqueue again.  (Otherwise it may
be grabbed by another processor before the context has been properly
saved.)

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

diff -r c44b7b9b6306 xen/common/schedule.c
--- a/xen/common/schedule.c	Wed Jan 13 13:33:57 2010 +0000
+++ b/xen/common/schedule.c	Wed Jan 13 13:36:37 2010 +0000
@@ -877,6 +877,8 @@
     /* Check for migration request /after/ clearing running flag. */
     smp_mb();
 
+    SCHED_OP(context_saved, prev);
+
     if ( unlikely(test_bit(_VPF_migrating, &prev->pause_flags)) )
         vcpu_migrate(prev);
 }
diff -r c44b7b9b6306 xen/include/xen/sched-if.h
--- a/xen/include/xen/sched-if.h	Wed Jan 13 13:33:57 2010 +0000
+++ b/xen/include/xen/sched-if.h	Wed Jan 13 13:36:37 2010 +0000
@@ -69,6 +69,7 @@
 
     void         (*sleep)          (struct vcpu *);
     void         (*wake)           (struct vcpu *);
+    void         (*context_saved)  (struct vcpu *);
 
     struct task_slice (*do_schedule) (s_time_t);
 

[-- Attachment #3: sched-spin_lock-pointers.diff --]
[-- Type: text/x-patch, Size: 7756 bytes --]

diff -r 297dffc6ca65 xen/arch/ia64/vmx/vmmu.c
--- a/xen/arch/ia64/vmx/vmmu.c	Tue Feb 09 14:45:09 2010 +0000
+++ b/xen/arch/ia64/vmx/vmmu.c	Mon Feb 15 13:51:23 2010 +0000
@@ -394,7 +394,7 @@
     if (cpu != current->processor)
         return;
     local_irq_save(flags);
-    if (!spin_trylock(&per_cpu(schedule_data, cpu).schedule_lock))
+    if (!spin_trylock(per_cpu(schedule_data, cpu).schedule_lock))
         goto bail2;
     if (v->processor != cpu)
         goto bail1;
@@ -416,7 +416,7 @@
     ia64_dv_serialize_data();
     args->vcpu = NULL;
 bail1:
-    spin_unlock(&per_cpu(schedule_data, cpu).schedule_lock);
+    spin_unlock(per_cpu(schedule_data, cpu).schedule_lock);
 bail2:
     local_irq_restore(flags);
 }
@@ -446,7 +446,7 @@
         do {
             cpu = v->processor;
             if (cpu != current->processor) {
-                spin_barrier(&per_cpu(schedule_data, cpu).schedule_lock);
+                spin_barrier(per_cpu(schedule_data, cpu).schedule_lock);
                 /* Flush VHPT on remote processors. */
                 smp_call_function_single(cpu, &ptc_ga_remote_func, &args, 1);
             } else {
diff -r 297dffc6ca65 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Tue Feb 09 14:45:09 2010 +0000
+++ b/xen/common/sched_credit.c	Mon Feb 15 13:51:23 2010 +0000
@@ -770,7 +770,7 @@
 
     spc->runq_sort_last = sort_epoch;
 
-    spin_lock_irqsave(&per_cpu(schedule_data, cpu).schedule_lock, flags);
+    spin_lock_irqsave(per_cpu(schedule_data, cpu).schedule_lock, flags);
 
     runq = &spc->runq;
     elem = runq->next;
@@ -795,7 +795,7 @@
         elem = next;
     }
 
-    spin_unlock_irqrestore(&per_cpu(schedule_data, cpu).schedule_lock, flags);
+    spin_unlock_irqrestore(per_cpu(schedule_data, cpu).schedule_lock, flags);
 }
 
 static void
@@ -1110,7 +1110,7 @@
          * cause a deadlock if the peer CPU is also load balancing and trying
          * to lock this CPU.
          */
-        if ( !spin_trylock(&per_cpu(schedule_data, peer_cpu).schedule_lock) )
+        if ( !spin_trylock(per_cpu(schedule_data, peer_cpu).schedule_lock) )
         {
             CSCHED_STAT_CRANK(steal_trylock_failed);
             continue;
@@ -1120,7 +1120,7 @@
          * Any work over there to steal?
          */
         speer = csched_runq_steal(peer_cpu, cpu, snext->pri);
-        spin_unlock(&per_cpu(schedule_data, peer_cpu).schedule_lock);
+        spin_unlock(per_cpu(schedule_data, peer_cpu).schedule_lock);
         if ( speer != NULL )
             return speer;
     }
diff -r 297dffc6ca65 xen/common/schedule.c
--- a/xen/common/schedule.c	Tue Feb 09 14:45:09 2010 +0000
+++ b/xen/common/schedule.c	Mon Feb 15 13:51:23 2010 +0000
@@ -108,7 +108,7 @@
     s_time_t delta;
 
     ASSERT(v->runstate.state != new_state);
-    ASSERT(spin_is_locked(&per_cpu(schedule_data,v->processor).schedule_lock));
+    ASSERT(spin_is_locked(per_cpu(schedule_data,v->processor).schedule_lock));
 
     trace_runstate_change(v, new_state);
 
@@ -299,7 +299,7 @@
     old_cpu = v->processor;
     v->processor = SCHED_OP(pick_cpu, v);
     spin_unlock_irqrestore(
-        &per_cpu(schedule_data, old_cpu).schedule_lock, flags);
+        per_cpu(schedule_data, old_cpu).schedule_lock, flags);
 
     /* Wake on new CPU. */
     vcpu_wake(v);
@@ -806,7 +806,7 @@
 
     sd = &this_cpu(schedule_data);
 
-    spin_lock_irq(&sd->schedule_lock);
+    spin_lock_irq(sd->schedule_lock);
 
     stop_timer(&sd->s_timer);
     
@@ -822,7 +822,7 @@
 
     if ( unlikely(prev == next) )
     {
-        spin_unlock_irq(&sd->schedule_lock);
+        spin_unlock_irq(sd->schedule_lock);
         trace_continue_running(next);
         return continue_running(prev);
     }
@@ -850,7 +850,7 @@
     ASSERT(!next->is_running);
     next->is_running = 1;
 
-    spin_unlock_irq(&sd->schedule_lock);
+    spin_unlock_irq(sd->schedule_lock);
 
     perfc_incr(sched_ctx);
 
@@ -922,7 +922,9 @@
 
     for_each_possible_cpu ( i )
     {
-        spin_lock_init(&per_cpu(schedule_data, i).schedule_lock);
+        spin_lock_init(&per_cpu(schedule_data, i)._lock);
+        per_cpu(schedule_data, i).schedule_lock
+            = &per_cpu(schedule_data, i)._lock;
         init_timer(&per_cpu(schedule_data, i).s_timer, s_timer_fn, NULL, i);
     }
 
@@ -956,10 +958,10 @@
 
     for_each_online_cpu ( i )
     {
-        spin_lock(&per_cpu(schedule_data, i).schedule_lock);
+        spin_lock(per_cpu(schedule_data, i).schedule_lock);
         printk("CPU[%02d] ", i);
         SCHED_OP(dump_cpu_state, i);
-        spin_unlock(&per_cpu(schedule_data, i).schedule_lock);
+        spin_unlock(per_cpu(schedule_data, i).schedule_lock);
     }
 
     local_irq_restore(flags);
diff -r 297dffc6ca65 xen/include/xen/sched-if.h
--- a/xen/include/xen/sched-if.h	Tue Feb 09 14:45:09 2010 +0000
+++ b/xen/include/xen/sched-if.h	Mon Feb 15 13:51:23 2010 +0000
@@ -10,8 +10,43 @@
 
 #include <xen/percpu.h>
 
+/* How do we check if the vcpu has migrated since we've grabbed teh lock?
+ * Have to add a runqueue ID?  Still have to map vcpu to lock... 
+ *
+ * When we need to lock:
+ * + When changing certain values in the vcpu struct
+    - runstate
+     . Including sleep
+    - Pause state... (vcpu_runnable)?
+    - v->processor
+    - v->is_running (Implicitly by grabbing schedule lock in schedule)
+    - v->affinity
+    - Anytime we want to avoid a running vcpu being scheduled out while we're doing something
+     . e.g., sched_adjust
+   + When scheduling
+    - Implicitly also covers is_running, runstate_change
+
+   + For credit2:
+    - Updating runqueue, credits, &c
+   
+ *
+ * Ideas: 
+ * + Pointer in the vcpu struct; check to see if it's changed since you grabbed it.
+ *   - Big addition to struct
+ *   - Can a pointer to a lock be protected by the lock it points to?!?
+ * + Lock by runq id, map cpu to runq (?)
+ * + Spinlock callback w/ vcpu pointer.
+    - Turns spinlocks into indirect function calls.
+   + Just do the same thing; it won't hurt to grab the same lock twice; if it does,
+     we can think about making the loop more efficient.
+ */
+
+/* Idea: For cache betterness, keep the actual lock in the same cache area
+ * as the rest of the struct.  Just have the scheduler point to the one it wants
+ * (This may be the one right in front of it).*/
 struct schedule_data {
-    spinlock_t          schedule_lock;  /* spinlock protecting curr        */
+    spinlock_t         *schedule_lock,
+                       _lock;
     struct vcpu        *curr;           /* current task                    */
     struct vcpu        *idle;           /* idle task for this cpu          */
     void               *sched_priv;
@@ -26,11 +61,19 @@
 
     for ( ; ; )
     {
+        /* NB: For schedulers with multiple cores per runqueue,
+         * a vcpu may change processor w/o changing runqueues;
+         * so we may release a lock only to grab it again.
+         *
+         * If that is measured to be an issue, then the check
+         * should be changed to checking if the locks pointed to
+         * by cpu and v->processor are still the same.
+         */
         cpu = v->processor;
-        spin_lock(&per_cpu(schedule_data, cpu).schedule_lock);
+        spin_lock(per_cpu(schedule_data, cpu).schedule_lock);
         if ( likely(v->processor == cpu) )
             break;
-        spin_unlock(&per_cpu(schedule_data, cpu).schedule_lock);
+        spin_unlock(per_cpu(schedule_data, cpu).schedule_lock);
     }
 }
 
@@ -41,7 +84,7 @@
 
 static inline void vcpu_schedule_unlock(struct vcpu *v)
 {
-    spin_unlock(&per_cpu(schedule_data, v->processor).schedule_lock);
+    spin_unlock(per_cpu(schedule_data, v->processor).schedule_lock);
 }
 
 #define vcpu_schedule_unlock_irq(v) \

[-- Attachment #4: trace-sched-class.diff --]
[-- Type: text/x-patch, Size: 514 bytes --]

diff -r 5eb7dde5b29b xen/include/public/trace.h
--- a/xen/include/public/trace.h	Tue Feb 09 14:45:09 2010 +0000
+++ b/xen/include/public/trace.h	Mon Feb 15 14:03:57 2010 +0000
@@ -53,6 +53,7 @@
 #define TRC_HVM_HANDLER   0x00082000   /* various HVM handlers      */
 
 #define TRC_SCHED_MIN       0x00021000   /* Just runstate changes */
+#define TRC_SCHED_CLASS     0x00022000   /* Scheduler-specific    */
 #define TRC_SCHED_VERBOSE   0x00028000   /* More inclusive scheduling */
 
 /* Trace events per class */

[-- Attachment #5: 20100215-credit2-hypervisor-spinlock-pointers.diff --]
[-- Type: text/x-patch, Size: 32845 bytes --]

diff -r 889bd19dd09d xen/common/Makefile
--- a/xen/common/Makefile	Mon Feb 15 15:23:09 2010 +0000
+++ b/xen/common/Makefile	Mon Feb 15 16:15:16 2010 +0000
@@ -13,6 +13,7 @@
 obj-y += page_alloc.o
 obj-y += rangeset.o
 obj-y += sched_credit.o
+obj-y += sched_credit2.o
 obj-y += sched_sedf.o
 obj-y += schedule.o
 obj-y += shutdown.o
diff -r 889bd19dd09d xen/common/sched_credit2.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xen/common/sched_credit2.c	Mon Feb 15 16:15:16 2010 +0000
@@ -0,0 +1,1121 @@
+
+/****************************************************************************
+ * (C) 2009 - George Dunlap - Citrix Systems R&D UK, Ltd
+ ****************************************************************************
+ *
+ *        File: common/csched_credit2.c
+ *      Author: George Dunlap
+ *
+ * Description: Credit-based SMP CPU scheduler
+ * Based on an earlier verson by Emmanuel Ackaouy.
+ */
+
+#include <xen/config.h>
+#include <xen/init.h>
+#include <xen/lib.h>
+#include <xen/sched.h>
+#include <xen/domain.h>
+#include <xen/delay.h>
+#include <xen/event.h>
+#include <xen/time.h>
+#include <xen/perfc.h>
+#include <xen/sched-if.h>
+#include <xen/softirq.h>
+#include <asm/atomic.h>
+#include <xen/errno.h>
+#include <xen/trace.h>
+
+#if __i386__
+#define PRI_stime "lld"
+#else
+#define PRI_stime "ld"
+#endif
+
+#define d2printk(x...)
+//#define d2printk printk
+
+#define TRC_CSCHED2_TICK        TRC_SCHED_CLASS + 1
+#define TRC_CSCHED2_RUNQ_POS    TRC_SCHED_CLASS + 2
+#define TRC_CSCHED2_CREDIT_BURN TRC_SCHED_CLASS + 3
+#define TRC_CSCHED2_CREDIT_ADD  TRC_SCHED_CLASS + 4
+#define TRC_CSCHED2_TICKLE_CHECK TRC_SCHED_CLASS + 5
+
+/*
+ * WARNING: This is still in an experimental phase.  Status and work can be found at the
+ * credit2 wiki page:
+ *  http://wiki.xensource.com/xenwiki/Credit2_Scheduler_Development
+ * TODO:
+ * + Immediate bug-fixes
+ *  - Do per-runqueue, grab proper lock for dump debugkey
+ * + Multiple sockets
+ *  - Detect cpu layout and make runqueue map, one per L2 (make_runq_map())
+ *  - Simple load balancer / runqueue assignment
+ *  - Runqueue load measurement
+ *  - Load-based load balancer
+ * + Hyperthreading
+ *  - Look for non-busy core if possible
+ *  - "Discount" time run on a thread with busy siblings
+ * + Algorithm:
+ *  - "Mixed work" problem: if a VM is playing audio (5%) but also burning cpu (e.g.,
+ *    a flash animation in the background) can we schedule it with low enough latency
+ *    so that audio doesn't skip?
+ *  - Cap and reservation: How to implement with the current system?
+ * + Optimizing
+ *  - Profiling, making new algorithms, making math more efficient (no long division)
+ */
+
+/* 
+ * Design:
+ *
+ * VMs "burn" credits based on their weight; higher weight means
+ * credits burn more slowly.  The highest weight vcpu burns credits at
+ * a rate of 1 credit per nanosecond.  Others burn proportionally
+ * more.
+ * 
+ * vcpus are inserted into the runqueue by credit order.
+ *
+ * Credits are "reset" when the next vcpu in the runqueue is less than
+ * or equal to zero.  At that point, everyone's credits are "clipped"
+ * to a small value, and a fixed credit is added to everyone.
+ *
+ * The plan is for all cores that share an L2 will share the same
+ * runqueue.  At the moment, there is one global runqueue for all
+ * cores.
+ */
+
+/*
+ * Locking:
+ * - Schedule-lock is per-runqueue
+ *  + Protects runqueue data, runqueue insertion, &c
+ *  + Also protects updates to private sched vcpu structure
+ *  + Must be grabbed using vcpu_schedule_lock_irq() to make sure vcpu->processr
+ *    doesn't change under our feet.
+ * - Private data lock
+ *  + Protects access to global domain list
+ *  + All other private data is written at init and only read afterwards.
+ * Ordering:
+ * - We grab private->schedule when updating domain weight; so we
+ *  must never grab private if a schedule lock is held.
+ */
+
+/*
+ * Basic constants
+ */
+/* Default weight: How much a new domain starts with */
+#define CSCHED_DEFAULT_WEIGHT       256
+/* Min timer: Minimum length a timer will be set, to
+ * achieve efficiency */
+#define CSCHED_MIN_TIMER            MICROSECS(500)
+/* Amount of credit VMs begin with, and are reset to.
+ * ATM, set so that highest-weight VMs can only run for 10ms
+ * before a reset event. */
+#define CSCHED_CREDIT_INIT          MILLISECS(10)
+/* Carryover: How much "extra" credit may be carried over after
+ * a reset. */
+#define CSCHED_CARRYOVER_MAX        CSCHED_MIN_TIMER
+/* Reset: Value below which credit will be reset. */
+#define CSCHED_CREDIT_RESET         0
+/* Max timer: Maximum time a guest can be run for. */
+#define CSCHED_MAX_TIMER            MILLISECS(2)
+
+
+#define CSCHED_IDLE_CREDIT                 (-(1<<30))
+
+/*
+ * Flags
+ */
+/* CSFLAG_scheduled: Is this vcpu either running on, or context-switching off,
+ * a physical cpu?
+ * + Accessed only with runqueue lock held
+ * + Set when chosen as next in csched_schedule().
+ * + Cleared after context switch has been saved in csched_context_saved()
+ * + Checked in vcpu_wake to see if we can add to the runqueue, or if we should
+ *   set CSFLAG_delayed_runq_add
+ * + Checked to be false in runq_insert.
+ */
+#define __CSFLAG_scheduled 1
+#define CSFLAG_scheduled (1<<__CSFLAG_scheduled)
+/* CSFLAG_delayed_runq_add: Do we need to add this to the runqueue once it'd done
+ * being context switched out?
+ * + Set when scheduling out in csched_schedule() if prev is runnable
+ * + Set in csched_vcpu_wake if it finds CSFLAG_scheduled set
+ * + Read in csched_context_saved().  If set, it adds prev to the runqueue and
+ *   clears the bit.
+ */
+#define __CSFLAG_delayed_runq_add 2
+#define CSFLAG_delayed_runq_add (1<<__CSFLAG_delayed_runq_add)
+
+
+/*
+ * Useful macros
+ */
+#define CSCHED_VCPU(_vcpu)  ((struct csched_vcpu *) (_vcpu)->sched_priv)
+#define CSCHED_DOM(_dom)    ((struct csched_dom *) (_dom)->sched_priv)
+/* CPU to runq_id macro */
+#define c2r(_cpu)           (csched_priv.runq_map[(_cpu)])
+/* CPU to runqueue struct macro */
+#define RQD(_cpu)          (&csched_priv.rqd[c2r(_cpu)])
+
+/*
+ * Per-runqueue data
+ */
+struct csched_runqueue_data {
+    int id;
+    struct list_head runq; /* Ordered list of runnable vms */
+    struct list_head svc;  /* List of all vcpus assigned to this runqueue */
+    int max_weight;
+    int cpu_min, cpu_max;  /* Range of physical cpus this runqueue runs */
+};
+
+/*
+ * System-wide private data
+ */
+struct csched_private {
+    spinlock_t lock;
+    uint32_t ncpus;
+    struct domain *idle_domain;
+
+    struct list_head sdom; /* Used mostly for dump keyhandler. */
+
+    int runq_map[NR_CPUS];
+    uint32_t runq_count;
+    struct csched_runqueue_data rqd[NR_CPUS];
+};
+
+/*
+ * Virtual CPU
+ */
+struct csched_vcpu {
+    struct list_head rqd_elem;  /* On the runqueue data list */
+    struct list_head sdom_elem; /* On the domain vcpu list */
+    struct list_head runq_elem; /* On the runqueue         */
+
+    /* Up-pointers */
+    struct csched_dom *sdom;
+    struct vcpu *vcpu;
+
+    int weight;
+
+    int credit;
+    s_time_t start_time; /* When we were scheduled (used for credit) */
+    unsigned flags;      /* 16 bits doesn't seem to play well with clear_bit() */
+
+};
+
+/*
+ * Domain
+ */
+struct csched_dom {
+    struct list_head vcpu;
+    struct list_head sdom_elem;
+    struct domain *dom;
+    uint16_t weight;
+    uint16_t nr_vcpus;
+};
+
+
+/*
+ * Global variables
+ */
+static struct csched_private csched_priv;
+
+/*
+ * Time-to-credit, credit-to-time.
+ * FIXME: Do pre-calculated division?
+ */
+static s_time_t t2c(struct csched_runqueue_data *rqd, s_time_t time, struct csched_vcpu *svc)
+{
+    return time * rqd->max_weight / svc->weight;
+}
+
+static s_time_t c2t(struct csched_runqueue_data *rqd, s_time_t credit, struct csched_vcpu *svc)
+{
+    return credit * svc->weight / rqd->max_weight;
+}
+
+/*
+ * Runqueue related code
+ */
+
+static /*inline*/ int
+__vcpu_on_runq(struct csched_vcpu *svc)
+{
+    return !list_empty(&svc->runq_elem);
+}
+
+static /*inline*/ struct csched_vcpu *
+__runq_elem(struct list_head *elem)
+{
+    return list_entry(elem, struct csched_vcpu, runq_elem);
+}
+
+static int
+__runq_insert(struct list_head *runq, struct csched_vcpu *svc)
+{
+    struct list_head *iter;
+    int pos = 0;
+
+    d2printk("rqi d%dv%d\n",
+           svc->vcpu->domain->domain_id,
+           svc->vcpu->vcpu_id);
+
+    /* Idle vcpus not allowed on the runqueue anymore */
+    BUG_ON(is_idle_vcpu(svc->vcpu));
+    BUG_ON(svc->vcpu->is_running);
+    BUG_ON(test_bit(__CSFLAG_scheduled, &svc->flags));
+
+    list_for_each( iter, runq )
+    {
+        struct csched_vcpu * iter_svc = __runq_elem(iter);
+
+        if ( svc->credit > iter_svc->credit )
+        {
+            d2printk(" p%d d%dv%d\n",
+                   pos,
+                   iter_svc->vcpu->domain->domain_id,
+                   iter_svc->vcpu->vcpu_id);
+            break;
+        }
+        pos++;
+    }
+
+    list_add_tail(&svc->runq_elem, iter);
+
+    return pos;
+}
+
+static void
+runq_insert(unsigned int cpu, struct csched_vcpu *svc)
+{
+    struct list_head * runq = &RQD(cpu)->runq;
+    int pos = 0;
+
+    ASSERT( spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock) ); 
+
+    BUG_ON( __vcpu_on_runq(svc) );
+    BUG_ON( c2r(cpu) != c2r(svc->vcpu->processor) ); 
+
+    pos = __runq_insert(runq, svc);
+
+    {
+        struct {
+            unsigned dom:16,vcpu:16;
+            unsigned pos;
+        } d;
+        d.dom = svc->vcpu->domain->domain_id;
+        d.vcpu = svc->vcpu->vcpu_id;
+        d.pos = pos;
+        trace_var(TRC_CSCHED2_RUNQ_POS, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+
+    return;
+}
+
+static inline void
+__runq_remove(struct csched_vcpu *svc)
+{
+    BUG_ON( !__vcpu_on_runq(svc) );
+    list_del_init(&svc->runq_elem);
+}
+
+void burn_credits(struct csched_runqueue_data *rqd, struct csched_vcpu *, s_time_t);
+
+/* Check to see if the item on the runqueue is higher priority than what's
+ * currently running; if so, wake up the processor */
+static /*inline*/ void
+runq_tickle(unsigned int cpu, struct csched_vcpu *new, s_time_t now)
+{
+    int i, ipid=-1;
+    s_time_t lowest=(1<<30);
+    struct csched_runqueue_data *rqd = RQD(cpu);
+
+    d2printk("rqt d%dv%d cd%dv%d\n",
+             new->vcpu->domain->domain_id,
+             new->vcpu->vcpu_id,
+             current->domain->domain_id,
+             current->vcpu_id);
+
+    /* Find the cpu in this queue group that has the lowest credits */
+    for ( i=rqd->cpu_min ; i < rqd->cpu_max ; i++ )
+    {
+        struct csched_vcpu * cur;
+
+        /* Skip cpus that aren't online */
+        if ( !cpu_online(i) )
+            continue;
+
+        cur = CSCHED_VCPU(per_cpu(schedule_data, i).curr);
+
+        /* FIXME: keep track of idlers, chose from the mask */
+        if ( is_idle_vcpu(cur->vcpu) )
+        {
+            ipid = i;
+            lowest = CSCHED_IDLE_CREDIT;
+            break;
+        }
+        else
+        {
+            /* Update credits for current to see if we want to preempt */
+            burn_credits(rqd, cur, now);
+
+            if ( cur->credit < lowest )
+            {
+                ipid = i;
+                lowest = cur->credit;
+            }
+
+            /* TRACE */ {
+                struct {
+                    unsigned dom:16,vcpu:16;
+                    unsigned credit;
+                } d;
+                d.dom = cur->vcpu->domain->domain_id;
+                d.vcpu = cur->vcpu->vcpu_id;
+                d.credit = cur->credit;
+                trace_var(TRC_CSCHED2_TICKLE_CHECK, 1,
+                          sizeof(d),
+                          (unsigned char *)&d);
+            }
+        }
+    }
+
+    if ( ipid != -1 )
+    {
+        int cdiff = lowest - new->credit;
+
+        if ( lowest == CSCHED_IDLE_CREDIT || cdiff < 0 ) {
+            d2printk("si %d\n", ipid);
+            cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
+        }
+        else
+            /* FIXME: Wake up later? */;
+    }
+}
+
+/*
+ * Credit-related code
+ */
+static void reset_credit(int cpu, s_time_t now)
+{
+    struct list_head *iter;
+
+    list_for_each( iter, &RQD(cpu)->svc )
+    {
+        struct csched_vcpu * svc = list_entry(iter, struct csched_vcpu, rqd_elem);
+
+        BUG_ON( is_idle_vcpu(svc->vcpu) );
+
+        /* "Clip" credits to max carryover */
+        if ( svc->credit > CSCHED_CARRYOVER_MAX )
+            svc->credit = CSCHED_CARRYOVER_MAX;
+        /* And add INIT */
+        svc->credit += CSCHED_CREDIT_INIT; 
+        svc->start_time = now;
+
+        /* FIXME: Trace credit */
+    }
+
+    /* No need to resort runqueue, as everyone's order should be the same. */
+}
+
+void burn_credits(struct csched_runqueue_data *rqd, struct csched_vcpu *svc, s_time_t now)
+{
+    s_time_t delta;
+
+    /* Assert svc is current */
+    ASSERT(svc==CSCHED_VCPU(per_cpu(schedule_data, svc->vcpu->processor).curr));
+
+    if ( is_idle_vcpu(svc->vcpu) )
+    {
+        BUG_ON(svc->credit != CSCHED_IDLE_CREDIT);
+        return;
+    }
+
+    delta = now - svc->start_time;
+
+    if ( delta > 0 ) {
+        /* This will round down; should we consider rounding up...? */
+        svc->credit -= t2c(rqd, delta, svc);
+        svc->start_time = now;
+
+        d2printk("b d%dv%d c%d\n",
+                 svc->vcpu->domain->domain_id,
+                 svc->vcpu->vcpu_id,
+                 svc->credit);
+    } else {
+        d2printk("%s: Time went backwards? now %"PRI_stime" start %"PRI_stime"\n",
+               __func__, now, svc->start_time);
+    }
+    
+    /* TRACE */
+    {
+        struct {
+            unsigned dom:16,vcpu:16;
+            unsigned credit;
+            int delta;
+        } d;
+        d.dom = svc->vcpu->domain->domain_id;
+        d.vcpu = svc->vcpu->vcpu_id;
+        d.credit = svc->credit;
+        d.delta = delta;
+        trace_var(TRC_CSCHED2_CREDIT_BURN, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+}
+
+/* Find the domain with the highest weight. */
+void update_max_weight(struct csched_runqueue_data *rqd, int new_weight, int old_weight)
+{
+    /* Try to avoid brute-force search:
+     * - If new_weight is larger, max_weigth <- new_weight
+     * - If old_weight != max_weight, someone else is still max_weight
+     *   (No action required)
+     * - If old_weight == max_weight, brute-force search for max weight
+     */
+    if ( new_weight > rqd->max_weight )
+    {
+        rqd->max_weight = new_weight;
+        printk("%s: Runqueue id %d max weight %d\n", __func__, rqd->id, rqd->max_weight);
+    }
+    else if ( old_weight == rqd->max_weight )
+    {
+        struct list_head *iter;
+        int max_weight = 1;
+        
+        list_for_each( iter, &rqd->svc )
+        {
+            struct csched_vcpu * svc = list_entry(iter, struct csched_vcpu, rqd_elem);
+            
+            if ( svc->weight > max_weight )
+                max_weight = svc->weight;
+        }
+        
+        rqd->max_weight = max_weight;
+        printk("%s: Runqueue %d max weight %d\n", __func__, rqd->id, rqd->max_weight);
+    }
+}
+
+#ifndef NDEBUG
+static /*inline*/ void
+__csched_vcpu_check(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+    struct csched_dom * const sdom = svc->sdom;
+
+    BUG_ON( svc->vcpu != vc );
+    BUG_ON( sdom != CSCHED_DOM(vc->domain) );
+    if ( sdom )
+    {
+        BUG_ON( is_idle_vcpu(vc) );
+        BUG_ON( sdom->dom != vc->domain );
+    }
+    else
+    {
+        BUG_ON( !is_idle_vcpu(vc) );
+    }
+}
+#define CSCHED_VCPU_CHECK(_vc)  (__csched_vcpu_check(_vc))
+#else
+#define CSCHED_VCPU_CHECK(_vc)
+#endif
+
+static int
+csched_vcpu_init(struct vcpu *vc)
+{
+    struct domain * const dom = vc->domain;
+    struct csched_dom *sdom = CSCHED_DOM(dom);
+    struct csched_vcpu *svc;
+
+    printk("%s: Initializing d%dv%d\n",
+           __func__, dom->domain_id, vc->vcpu_id);
+
+    /* Allocate per-VCPU info */
+    svc = xmalloc(struct csched_vcpu);
+    if ( svc == NULL )
+        return -1;
+
+    INIT_LIST_HEAD(&svc->rqd_elem);
+    INIT_LIST_HEAD(&svc->sdom_elem);
+    INIT_LIST_HEAD(&svc->runq_elem);
+
+    svc->sdom = sdom;
+    svc->vcpu = vc;
+    svc->flags = 0U;
+    vc->sched_priv = svc;
+
+    if ( ! is_idle_vcpu(vc) )
+    {
+        BUG_ON( sdom == NULL );
+
+        svc->credit = CSCHED_CREDIT_INIT;
+        svc->weight = sdom->weight;
+
+        /* FIXME: Do we need the private lock here? */
+        list_add_tail(&svc->sdom_elem, &sdom->vcpu);
+
+        /* Add vcpu to runqueue of initial processor */
+        /* FIXME: Abstract for multiple runqueues */
+        vcpu_schedule_lock_irq(vc);
+
+        list_add_tail(&svc->rqd_elem, &RQD(vc->processor)->svc);
+        update_max_weight(RQD(vc->processor), svc->weight, 0);
+
+        vcpu_schedule_unlock_irq(vc);
+
+        sdom->nr_vcpus++;
+    } 
+    else
+    {
+        BUG_ON( sdom != NULL );
+        svc->credit = CSCHED_IDLE_CREDIT;
+        svc->weight = 0;
+        if ( csched_priv.idle_domain == NULL )
+            csched_priv.idle_domain = dom;
+    }
+
+    CSCHED_VCPU_CHECK(vc);
+    return 0;
+}
+
+static void
+csched_vcpu_destroy(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+    struct csched_dom * const sdom = svc->sdom;
+
+    BUG_ON( sdom == NULL );
+    BUG_ON( !list_empty(&svc->runq_elem) );
+
+    /* Remove from runqueue */
+    vcpu_schedule_lock_irq(vc);
+
+    list_del_init(&svc->rqd_elem);
+    update_max_weight(RQD(vc->processor), 0, svc->weight);
+
+    vcpu_schedule_unlock_irq(vc);
+
+    /* Remove from sdom list.  Don't need a lock for this, as it's called
+     * syncronously when nothing else can happen. */
+    list_del_init(&svc->sdom_elem);
+
+    sdom->nr_vcpus--;
+
+    xfree(svc);
+}
+
+static void
+csched_vcpu_sleep(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    if ( per_cpu(schedule_data, vc->processor).curr == vc )
+        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
+    else if ( __vcpu_on_runq(svc) )
+        __runq_remove(svc);
+}
+
+static void
+csched_vcpu_wake(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+    const unsigned int cpu = vc->processor;
+    s_time_t now = 0;
+
+    /* Schedule lock should be held at this point. */
+    
+    d2printk("w d%dv%d\n", vc->domain->domain_id, vc->vcpu_id);
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    /* Make sure svc priority mod happens before runq check */
+    if ( unlikely(per_cpu(schedule_data, cpu).curr == vc) )
+    {
+        goto out;
+    }
+
+    if ( unlikely(__vcpu_on_runq(svc)) )
+    {
+        /* If we've boosted someone that's already on a runqueue, prioritize
+         * it and inform the cpu in question. */
+        goto out;
+    }
+
+    /* If the context hasn't been saved for this vcpu yet, we can't put it on
+     * another runqueue.  Instead, we set a flag so that it will be put on the runqueue
+     * after the context has been saved. */
+    if ( unlikely (test_bit(__CSFLAG_scheduled, &svc->flags) ) )
+    {
+        set_bit(__CSFLAG_delayed_runq_add, &svc->flags);
+        goto out;
+    }
+
+    now = NOW();
+
+    /* Put the VCPU on the runq */
+    runq_insert(cpu, svc);
+    runq_tickle(cpu, svc, now);
+
+out:
+    d2printk("w-\n");
+    return;
+}
+
+static void
+csched_context_saved(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+
+    vcpu_schedule_lock_irq(vc);
+
+    /* This vcpu is now eligible to be put on the runqueue again */
+    clear_bit(__CSFLAG_scheduled, &svc->flags);
+    
+    /* If someone wants it on the runqueue, put it there. */
+    /* 
+     * NB: We can get rid of CSFLAG_scheduled by checking for
+     * vc->is_running and __vcpu_on_runq(svc) here.  However,
+     * since we're accessing the flags cacheline anyway,
+     * it seems a bit pointless; especially as we have plenty of
+     * bits free.
+     */
+    if ( test_bit(__CSFLAG_delayed_runq_add, &svc->flags) )
+    {
+        const unsigned int cpu = vc->processor;
+
+        clear_bit(__CSFLAG_delayed_runq_add, &svc->flags);
+
+        BUG_ON(__vcpu_on_runq(svc));
+        
+        runq_insert(cpu, svc);
+        runq_tickle(cpu, svc, NOW());
+    }
+
+    vcpu_schedule_unlock_irq(vc);
+}
+
+static int
+csched_cpu_pick(struct vcpu *vc)
+{
+    /* FIXME: Chose a schedule group based on load */
+    /* FIXME: Migrate the vcpu to the new runqueue list, updating 
+       max_weight for each runqueue */
+    return 0;
+}
+
+static int
+csched_dom_cntl(
+    struct domain *d,
+    struct xen_domctl_scheduler_op *op)
+{
+    struct csched_dom * const sdom = CSCHED_DOM(d);
+    unsigned long flags;
+
+    if ( op->cmd == XEN_DOMCTL_SCHEDOP_getinfo )
+    {
+        op->u.credit2.weight = sdom->weight;
+    }
+    else
+    {
+        ASSERT(op->cmd == XEN_DOMCTL_SCHEDOP_putinfo);
+
+        if ( op->u.credit2.weight != 0 )
+        {
+            struct list_head *iter;
+            int old_weight;
+
+            /* Must hold csched_priv lock to update sdom, runq lock to
+             * update csvcs. */
+            spin_lock_irqsave(&csched_priv.lock, flags);
+
+            old_weight = sdom->weight;
+
+            sdom->weight = op->u.credit2.weight;
+
+            /* Update weights for vcpus, and max_weight for runqueues on which they reside */
+            list_for_each ( iter, &sdom->vcpu )
+            {
+                struct csched_vcpu *svc = list_entry(iter, struct csched_vcpu, sdom_elem);
+
+                /* NB: Locking order is important here.  Because we grab this lock here, we
+                 * must never lock csched_priv.lock if we're holding a runqueue
+                 * lock. */
+                vcpu_schedule_lock_irq(svc->vcpu);
+
+                svc->weight = sdom->weight;
+                update_max_weight(RQD(svc->vcpu->processor), svc->weight, old_weight);
+
+                vcpu_schedule_unlock_irq(svc->vcpu);
+            }
+
+            spin_unlock_irqrestore(&csched_priv.lock, flags);
+        }
+    }
+
+    return 0;
+}
+
+static int
+csched_dom_init(struct domain *dom)
+{
+    struct csched_dom *sdom;
+    int flags;
+
+    printk("%s: Initializing domain %d\n", __func__, dom->domain_id);
+
+    if ( is_idle_domain(dom) )
+        return 0;
+
+    sdom = xmalloc(struct csched_dom);
+    if ( sdom == NULL )
+        return -ENOMEM;
+
+    /* Initialize credit and weight */
+    INIT_LIST_HEAD(&sdom->vcpu);
+    INIT_LIST_HEAD(&sdom->sdom_elem);
+    sdom->dom = dom;
+    sdom->weight = CSCHED_DEFAULT_WEIGHT;
+    sdom->nr_vcpus = 0;
+
+    dom->sched_priv = sdom;
+
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    list_add_tail(&sdom->sdom_elem, &csched_priv.sdom);
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+
+    return 0;
+}
+
+static void
+csched_dom_destroy(struct domain *dom)
+{
+    struct csched_dom *sdom = CSCHED_DOM(dom);
+    int flags;
+
+    BUG_ON(!list_empty(&sdom->vcpu));
+
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    list_del_init(&sdom->sdom_elem);
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+    
+    xfree(CSCHED_DOM(dom));
+}
+
+/* How long should we let this vcpu run for? */
+static s_time_t
+csched_runtime(int cpu, struct csched_vcpu *snext)
+{
+    s_time_t time = CSCHED_MAX_TIMER;
+    struct csched_runqueue_data *rqd = RQD(cpu);
+    struct list_head *runq = &rqd->runq;
+
+    if ( is_idle_vcpu(snext->vcpu) )
+        return CSCHED_MAX_TIMER;
+
+    /* Basic time */
+    time = c2t(rqd, snext->credit, snext);
+
+    /* Next guy on runqueue */
+    if ( ! list_empty(runq) )
+    {
+        struct csched_vcpu *svc = __runq_elem(runq->next);
+        s_time_t ntime;
+
+        if ( ! is_idle_vcpu(svc->vcpu) )
+        {
+            ntime = c2t(rqd, snext->credit - svc->credit, snext);
+
+            if ( time > ntime )
+                time = ntime;
+        }
+    }
+
+    /* Check limits */
+    if ( time < CSCHED_MIN_TIMER )
+        time = CSCHED_MIN_TIMER;
+    else if ( time > CSCHED_MAX_TIMER )
+        time = CSCHED_MAX_TIMER;
+
+    return time;
+}
+
+void __dump_execstate(void *unused);
+
+/*
+ * This function is in the critical path. It is designed to be simple and
+ * fast for the common case.
+ */
+static struct task_slice
+csched_schedule(s_time_t now)
+{
+    const int cpu = smp_processor_id();
+    struct csched_runqueue_data *rqd = RQD(cpu);
+    struct list_head * const runq = &rqd->runq;
+    struct csched_vcpu * const scurr = CSCHED_VCPU(current);
+    struct csched_vcpu *snext = NULL;
+    struct task_slice ret;
+
+    CSCHED_VCPU_CHECK(current);
+
+    d2printk("sc p%d c d%dv%d now %"PRI_stime"\n",
+             cpu,
+             scurr->vcpu->domain->domain_id,
+             scurr->vcpu->vcpu_id,
+             now);
+
+
+    /* Protected by runqueue lock */
+
+    /* Update credits */
+    burn_credits(rqd, scurr, now);
+
+    /*
+     * Select next runnable local VCPU (ie top of local runq).
+     *
+     * If the current vcpu is runnable, and has higher credit than
+     * the next guy on the queue (or there is noone else), we want to run him again.
+     *
+     * If the current vcpu is runnable, and the next guy on the queue
+     * has higher credit, we want to mark current for delayed runqueue
+     * add, and remove the next guy from the queue.
+     *
+     * If the current vcpu is not runnable, we want to chose the idle
+     * vcpu for this processor. 
+     */
+    if ( list_empty(runq) )
+        snext = CSCHED_VCPU(csched_priv.idle_domain->vcpu[cpu]);
+    else
+        snext = __runq_elem(runq->next);
+
+    if ( !is_idle_vcpu(current) && vcpu_runnable(current) )
+    {
+        /* If the current vcpu is runnable, and has higher credit
+         * than the next on the runqueue, run him again.
+         * Otherwise, set him for delayed runq add. */
+        if ( scurr->credit > snext->credit)
+            snext = scurr;
+        else
+            set_bit(__CSFLAG_delayed_runq_add, &scurr->flags);
+    }
+
+    if ( snext != scurr && !is_idle_vcpu(snext->vcpu) )
+    {
+        __runq_remove(snext);
+        if ( snext->vcpu->is_running )
+        {
+            printk("p%d: snext d%dv%d running on p%d! scurr d%dv%d\n",
+                   cpu,
+                   snext->vcpu->domain->domain_id, snext->vcpu->vcpu_id,
+                   snext->vcpu->processor,
+                   scurr->vcpu->domain->domain_id,
+                   scurr->vcpu->vcpu_id);
+            BUG();
+        }
+        set_bit(__CSFLAG_scheduled, &snext->flags);
+    }
+
+    if ( !is_idle_vcpu(snext->vcpu) && snext->credit <= CSCHED_CREDIT_RESET )
+        reset_credit(cpu, now);
+
+#if 0
+    /*
+     * Update idlers mask if necessary. When we're idling, other CPUs
+     * will tickle us when they get extra work.
+     */
+    if ( is_idle_vcpu(snext->vcpu) )
+    {
+        if ( !cpu_isset(cpu, csched_priv.idlers) )
+            cpu_set(cpu, csched_priv.idlers);
+    }
+    else if ( cpu_isset(cpu, csched_priv.idlers) )
+    {
+        cpu_clear(cpu, csched_priv.idlers);
+    }
+#endif
+
+    if ( !is_idle_vcpu(snext->vcpu) )
+    {
+        snext->start_time = now;
+        snext->vcpu->processor = cpu; /* Safe because lock for old processor is held */
+    }
+    /*
+     * Return task to run next...
+     */
+    ret.time = csched_runtime(cpu, snext);
+    ret.task = snext->vcpu;
+
+    CSCHED_VCPU_CHECK(ret.task);
+    return ret;
+}
+
+static void
+csched_dump_vcpu(struct csched_vcpu *svc)
+{
+    printk("[%i.%i] flags=%x cpu=%i",
+            svc->vcpu->domain->domain_id,
+            svc->vcpu->vcpu_id,
+            svc->flags,
+            svc->vcpu->processor);
+
+    printk(" credit=%" PRIi32" [w=%u]", svc->credit, svc->weight);
+
+    printk("\n");
+}
+
+static void
+csched_dump_pcpu(int cpu)
+{
+    struct list_head *runq, *iter;
+    struct csched_vcpu *svc;
+    int loop;
+    char cpustr[100];
+
+    /* FIXME: Do locking properly for access to runqueue structures */
+
+    runq = &RQD(cpu)->runq;
+
+    cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_sibling_map,cpu));
+    printk(" sibling=%s, ", cpustr);
+    cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_core_map,cpu));
+    printk("core=%s\n", cpustr);
+
+    /* current VCPU */
+    svc = CSCHED_VCPU(per_cpu(schedule_data, cpu).curr);
+    if ( svc )
+    {
+        printk("\trun: ");
+        csched_dump_vcpu(svc);
+    }
+
+    loop = 0;
+    list_for_each( iter, runq )
+    {
+        svc = __runq_elem(iter);
+        if ( svc )
+        {
+            printk("\t%3d: ", ++loop);
+            csched_dump_vcpu(svc);
+        }
+    }
+}
+
+static void
+csched_dump(void)
+{
+    struct list_head *iter_sdom, *iter_svc;
+    int loop;
+
+    printk("info:\n"
+           "\tncpus              = %u\n"
+           "\tdefault-weight     = %d\n",
+           csched_priv.ncpus,
+           CSCHED_DEFAULT_WEIGHT);
+
+    /* FIXME: Locking! */
+
+    printk("active vcpus:\n");
+    loop = 0;
+    list_for_each( iter_sdom, &csched_priv.sdom )
+    {
+        struct csched_dom *sdom;
+        sdom = list_entry(iter_sdom, struct csched_dom, sdom_elem);
+
+        list_for_each( iter_svc, &sdom->vcpu )
+        {
+            struct csched_vcpu *svc;
+            svc = list_entry(iter_svc, struct csched_vcpu, sdom_elem);
+
+            printk("\t%3d: ", ++loop);
+            csched_dump_vcpu(svc);
+        }
+    }
+}
+
+static void
+make_runq_map(void)
+{
+    int cpu, cpu_count=0;
+
+    /* FIXME: Read pcpu layout and do this properly */
+    for_each_possible_cpu( cpu )
+    {
+        csched_priv.runq_map[cpu] = 0;
+        cpu_count++;
+    }
+    csched_priv.runq_count = 1;
+    
+    /* Move to the init code...? */
+    csched_priv.rqd[0].cpu_min = 0;
+    csched_priv.rqd[0].cpu_max = cpu_count;
+}
+
+static void
+csched_init(void)
+{
+    int i;
+
+    spin_lock_init(&csched_priv.lock);
+    INIT_LIST_HEAD(&csched_priv.sdom);
+
+    csched_priv.ncpus = 0;
+
+    make_runq_map();
+
+    for ( i=0; i<csched_priv.runq_count ; i++ )
+    {
+        struct csched_runqueue_data *rqd = csched_priv.rqd + i;
+
+        rqd->max_weight = 1;
+        rqd->id = i;
+        INIT_LIST_HEAD(&rqd->svc);
+        INIT_LIST_HEAD(&rqd->runq);
+    }
+
+    /* Initialize pcpu structures */
+    for_each_possible_cpu(i)
+    {
+        int runq_id;
+        spinlock_t *lock;
+
+        /* Point the per-cpu schedule lock to the runq_id lock */
+        runq_id = csched_priv.runq_map[i];
+        lock = &per_cpu(schedule_data, runq_id)._lock;
+
+        per_cpu(schedule_data, i).schedule_lock = lock;
+
+        csched_priv.ncpus++;
+    }
+}
+
+struct scheduler sched_credit2_def = {
+    .name           = "SMP Credit Scheduler rev2",
+    .opt_name       = "credit2",
+    .sched_id       = XEN_SCHEDULER_CREDIT2,
+
+    .init_domain    = csched_dom_init,
+    .destroy_domain = csched_dom_destroy,
+
+    .init_vcpu      = csched_vcpu_init,
+    .destroy_vcpu   = csched_vcpu_destroy,
+
+    .sleep          = csched_vcpu_sleep,
+    .wake           = csched_vcpu_wake,
+
+    .adjust         = csched_dom_cntl,
+
+    .pick_cpu       = csched_cpu_pick,
+    .do_schedule    = csched_schedule,
+    .context_saved  = csched_context_saved,
+
+    .dump_cpu_state = csched_dump_pcpu,
+    .dump_settings  = csched_dump,
+    .init           = csched_init,
+};
diff -r 889bd19dd09d xen/common/schedule.c
--- a/xen/common/schedule.c	Mon Feb 15 15:23:09 2010 +0000
+++ b/xen/common/schedule.c	Mon Feb 15 16:15:16 2010 +0000
@@ -58,9 +58,11 @@
 
 extern const struct scheduler sched_sedf_def;
 extern const struct scheduler sched_credit_def;
+extern const struct scheduler sched_credit2_def;
 static const struct scheduler *__initdata schedulers[] = {
     &sched_sedf_def,
     &sched_credit_def,
+    &sched_credit2_def,
     NULL
 };
 
diff -r 889bd19dd09d xen/include/public/domctl.h
--- a/xen/include/public/domctl.h	Mon Feb 15 15:23:09 2010 +0000
+++ b/xen/include/public/domctl.h	Mon Feb 15 16:15:16 2010 +0000
@@ -303,6 +303,7 @@
 /* Scheduler types. */
 #define XEN_SCHEDULER_SEDF     4
 #define XEN_SCHEDULER_CREDIT   5
+#define XEN_SCHEDULER_CREDIT2  6
 /* Set or get info? */
 #define XEN_DOMCTL_SCHEDOP_putinfo 0
 #define XEN_DOMCTL_SCHEDOP_getinfo 1
@@ -321,6 +322,9 @@
             uint16_t weight;
             uint16_t cap;
         } credit;
+        struct xen_domctl_sched_credit2 {
+            uint16_t weight;
+        } credit2;
     } u;
 };
 typedef struct xen_domctl_scheduler_op xen_domctl_scheduler_op_t;

[-- Attachment #6: 20100215-credit2-tools.diff --]
[-- Type: text/x-patch, Size: 15599 bytes --]

diff -r 63531e640828 tools/libxc/Makefile
--- a/tools/libxc/Makefile	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/libxc/Makefile	Mon Dec 21 11:45:00 2009 +0000
@@ -17,6 +17,7 @@
 CTRL_SRCS-y       += xc_private.c
 CTRL_SRCS-y       += xc_sedf.c
 CTRL_SRCS-y       += xc_csched.c
+CTRL_SRCS-y       += xc_csched2.c
 CTRL_SRCS-y       += xc_tbuf.c
 CTRL_SRCS-y       += xc_pm.c
 CTRL_SRCS-y       += xc_cpu_hotplug.c
diff -r 63531e640828 tools/libxc/xc_csched2.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/libxc/xc_csched2.c	Mon Dec 21 11:45:00 2009 +0000
@@ -0,0 +1,50 @@
+/****************************************************************************
+ * (C) 2006 - Emmanuel Ackaouy - XenSource Inc.
+ ****************************************************************************
+ *
+ *        File: xc_csched.c
+ *      Author: Emmanuel Ackaouy
+ *
+ * Description: XC Interface to the credit scheduler
+ *
+ */
+#include "xc_private.h"
+
+
+int
+xc_sched_credit2_domain_set(
+    int xc_handle,
+    uint32_t domid,
+    struct xen_domctl_sched_credit2 *sdom)
+{
+    DECLARE_DOMCTL;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_CREDIT2;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_putinfo;
+    domctl.u.scheduler_op.u.credit2 = *sdom;
+
+    return do_domctl(xc_handle, &domctl);
+}
+
+int
+xc_sched_credit2_domain_get(
+    int xc_handle,
+    uint32_t domid,
+    struct xen_domctl_sched_credit2 *sdom)
+{
+    DECLARE_DOMCTL;
+    int err;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_CREDIT2;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_getinfo;
+
+    err = do_domctl(xc_handle, &domctl);
+    if ( err == 0 )
+        *sdom = domctl.u.scheduler_op.u.credit2;
+
+    return err;
+}
diff -r 63531e640828 tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/libxc/xenctrl.h	Mon Dec 21 11:45:00 2009 +0000
@@ -469,6 +469,14 @@
                                uint32_t domid,
                                struct xen_domctl_sched_credit *sdom);
 
+int xc_sched_credit2_domain_set(int xc_handle,
+                               uint32_t domid,
+                               struct xen_domctl_sched_credit2 *sdom);
+
+int xc_sched_credit2_domain_get(int xc_handle,
+                               uint32_t domid,
+                               struct xen_domctl_sched_credit2 *sdom);
+
 /**
  * This function sends a trigger to a domain.
  *
diff -r 63531e640828 tools/python/xen/lowlevel/xc/xc.c
--- a/tools/python/xen/lowlevel/xc/xc.c	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/python/xen/lowlevel/xc/xc.c	Mon Dec 21 11:45:00 2009 +0000
@@ -1374,6 +1374,45 @@
                          "cap",     sdom.cap);
 }
 
+static PyObject *pyxc_sched_credit2_domain_set(XcObject *self,
+                                              PyObject *args,
+                                              PyObject *kwds)
+{
+    uint32_t domid;
+    uint16_t weight;
+    static char *kwd_list[] = { "domid", "weight", NULL };
+    static char kwd_type[] = "I|H";
+    struct xen_domctl_sched_credit2 sdom;
+    
+    weight = 0;
+    if( !PyArg_ParseTupleAndKeywords(args, kwds, kwd_type, kwd_list, 
+                                     &domid, &weight) )
+        return NULL;
+
+    sdom.weight = weight;
+
+    if ( xc_sched_credit2_domain_set(self->xc_handle, domid, &sdom) != 0 )
+        return pyxc_error_to_exception();
+
+    Py_INCREF(zero);
+    return zero;
+}
+
+static PyObject *pyxc_sched_credit2_domain_get(XcObject *self, PyObject *args)
+{
+    uint32_t domid;
+    struct xen_domctl_sched_credit2 sdom;
+    
+    if( !PyArg_ParseTuple(args, "I", &domid) )
+        return NULL;
+    
+    if ( xc_sched_credit2_domain_get(self->xc_handle, domid, &sdom) != 0 )
+        return pyxc_error_to_exception();
+
+    return Py_BuildValue("{s:H}",
+                         "weight",  sdom.weight);
+}
+
 static PyObject *pyxc_domain_setmaxmem(XcObject *self, PyObject *args)
 {
     uint32_t dom;
@@ -1912,6 +1951,24 @@
       "Returns:   [dict]\n"
       " weight    [short]: domain's scheduling weight\n"},
 
+    { "sched_credit2_domain_set",
+      (PyCFunction)pyxc_sched_credit2_domain_set,
+      METH_KEYWORDS, "\n"
+      "Set the scheduling parameters for a domain when running with the\n"
+      "SMP credit2 scheduler.\n"
+      " domid     [int]:   domain id to set\n"
+      " weight    [short]: domain's scheduling weight\n"
+      "Returns: [int] 0 on success; -1 on error.\n" },
+
+    { "sched_credit2_domain_get",
+      (PyCFunction)pyxc_sched_credit2_domain_get,
+      METH_VARARGS, "\n"
+      "Get the scheduling parameters for a domain when running with the\n"
+      "SMP credit2 scheduler.\n"
+      " domid     [int]:   domain id to get\n"
+      "Returns:   [dict]\n"
+      " weight    [short]: domain's scheduling weight\n"},
+
     { "evtchn_alloc_unbound", 
       (PyCFunction)pyxc_evtchn_alloc_unbound,
       METH_VARARGS | METH_KEYWORDS, "\n"
@@ -2272,6 +2329,7 @@
     /* Expose some libxc constants to Python */
     PyModule_AddIntConstant(m, "XEN_SCHEDULER_SEDF", XEN_SCHEDULER_SEDF);
     PyModule_AddIntConstant(m, "XEN_SCHEDULER_CREDIT", XEN_SCHEDULER_CREDIT);
+    PyModule_AddIntConstant(m, "XEN_SCHEDULER_CREDIT2", XEN_SCHEDULER_CREDIT2);
 
 }
 
diff -r 63531e640828 tools/python/xen/xend/XendAPI.py
--- a/tools/python/xen/xend/XendAPI.py	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/python/xen/xend/XendAPI.py	Mon Dec 21 11:45:00 2009 +0000
@@ -1626,8 +1626,7 @@
         if 'weight' in xeninfo.info['vcpus_params'] \
            and 'cap' in xeninfo.info['vcpus_params']:
             weight = xeninfo.info['vcpus_params']['weight']
-            cap = xeninfo.info['vcpus_params']['cap']
-            xendom.domain_sched_credit_set(xeninfo.getDomid(), weight, cap)
+            xendom.domain_sched_credit2_set(xeninfo.getDomid(), weight)
 
     def VM_set_VCPUs_number_live(self, _, vm_ref, num):
         dom = XendDomain.instance().get_vm_by_uuid(vm_ref)
diff -r 63531e640828 tools/python/xen/xend/XendDomain.py
--- a/tools/python/xen/xend/XendDomain.py	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/python/xen/xend/XendDomain.py	Mon Dec 21 11:45:00 2009 +0000
@@ -1757,6 +1757,60 @@
             log.exception(ex)
             raise XendError(str(ex))
 
+    def domain_sched_credit2_get(self, domid):
+        """Get credit2 scheduler parameters for a domain.
+
+        @param domid: Domain ID or Name
+        @type domid: int or string.
+        @rtype: dict with keys 'weight'
+        @return: credit2 scheduler parameters
+        """
+        dominfo = self.domain_lookup_nr(domid)
+        if not dominfo:
+            raise XendInvalidDomain(str(domid))
+        
+        if dominfo._stateGet() in (DOM_STATE_RUNNING, DOM_STATE_PAUSED):
+            try:
+                return xc.sched_credit2_domain_get(dominfo.getDomid())
+            except Exception, ex:
+                raise XendError(str(ex))
+        else:
+            return {'weight' : dominfo.getWeight()}
+    
+    def domain_sched_credit2_set(self, domid, weight = None):
+        """Set credit2 scheduler parameters for a domain.
+
+        @param domid: Domain ID or Name
+        @type domid: int or string.
+        @type weight: int
+        @rtype: 0
+        """
+        set_weight = False
+        dominfo = self.domain_lookup_nr(domid)
+        if not dominfo:
+            raise XendInvalidDomain(str(domid))
+        try:
+            if weight is None:
+                weight = int(0)
+            elif weight < 1 or weight > 65535:
+                raise XendError("weight is out of range")
+            else:
+                set_weight = True
+
+            assert type(weight) == int
+
+            rc = 0
+            if dominfo._stateGet() in (DOM_STATE_RUNNING, DOM_STATE_PAUSED):
+                rc = xc.sched_credit2_domain_set(dominfo.getDomid(), weight)
+            if rc == 0:
+                if set_weight:
+                    dominfo.setWeight(weight)
+                self.managed_config_save(dominfo)
+            return rc
+        except Exception, ex:
+            log.exception(ex)
+            raise XendError(str(ex))
+
     def domain_maxmem_set(self, domid, mem):
         """Set the memory limit for a domain.
 
diff -r 63531e640828 tools/python/xen/xend/XendDomainInfo.py
--- a/tools/python/xen/xend/XendDomainInfo.py	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/python/xen/xend/XendDomainInfo.py	Mon Dec 21 11:45:00 2009 +0000
@@ -2719,6 +2719,10 @@
             XendDomain.instance().domain_sched_credit_set(self.getDomid(),
                                                           self.getWeight(),
                                                           self.getCap())
+        elif XendNode.instance().xenschedinfo() == 'credit2':
+            from xen.xend import XendDomain
+            XendDomain.instance().domain_sched_credit2_set(self.getDomid(),
+                                                           self.getWeight())
 
     def _initDomain(self):
         log.debug('XendDomainInfo.initDomain: %s %s',
diff -r 63531e640828 tools/python/xen/xend/XendNode.py
--- a/tools/python/xen/xend/XendNode.py	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/python/xen/xend/XendNode.py	Mon Dec 21 11:45:00 2009 +0000
@@ -760,6 +760,8 @@
             return 'sedf'
         elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT:
             return 'credit'
+        elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT2:
+            return 'credit2'
         else:
             return 'unknown'
 
@@ -961,6 +963,8 @@
             return 'sedf'
         elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT:
             return 'credit'
+        elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT2:
+            return 'credit2'
         else:
             return 'unknown'
 
diff -r 63531e640828 tools/python/xen/xend/XendVMMetrics.py
--- a/tools/python/xen/xend/XendVMMetrics.py	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/python/xen/xend/XendVMMetrics.py	Mon Dec 21 11:45:00 2009 +0000
@@ -129,6 +129,7 @@
                 params_live['cpumap%i' % i] = \
                     ",".join(map(str, info['cpumap']))
 
+                # FIXME: credit2??
             params_live.update(xc.sched_credit_domain_get(domid))
             
             return params_live
diff -r 63531e640828 tools/python/xen/xend/server/SrvDomain.py
--- a/tools/python/xen/xend/server/SrvDomain.py	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/python/xen/xend/server/SrvDomain.py	Mon Dec 21 11:45:00 2009 +0000
@@ -163,6 +163,20 @@
         val = fn(req.args, {'dom': self.dom.getName()})
         return val
 
+    def op_domain_sched_credit2_get(self, _, req):
+        fn = FormFn(self.xd.domain_sched_credit2_get,
+                    [['dom', 'str']])
+        val = fn(req.args, {'dom': self.dom.getName()})
+        return val
+
+
+    def op_domain_sched_credit2_set(self, _, req):
+        fn = FormFn(self.xd.domain_sched_credit2_set,
+                    [['dom', 'str'],
+                     ['weight', 'int']])
+        val = fn(req.args, {'dom': self.dom.getName()})
+        return val
+
     def op_maxmem_set(self, _, req):
         return self.call(self.dom.setMemoryMaximum,
                          [['memory', 'int']],
diff -r 63531e640828 tools/python/xen/xm/main.py
--- a/tools/python/xen/xm/main.py	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/python/xen/xm/main.py	Mon Dec 21 11:45:00 2009 +0000
@@ -150,6 +150,8 @@
     'sched-sedf'  : ('<Domain> [options]', 'Get/set EDF parameters.'),
     'sched-credit': ('[-d <Domain> [-w[=WEIGHT]|-c[=CAP]]]',
                      'Get/set credit scheduler parameters.'),
+    'sched-credit2': ('[-d <Domain> [-w[=WEIGHT]]',
+                     'Get/set credit2 scheduler parameters.'),
     'sysrq'       : ('<Domain> <letter>', 'Send a sysrq to a domain.'),
     'debug-keys'  : ('<Keys>', 'Send debug keys to Xen.'),
     'trigger'     : ('<Domain> <nmi|reset|init|s3resume|power> [<VCPU>]',
@@ -265,6 +267,10 @@
        ('-w WEIGHT', '--weight=WEIGHT', 'Weight (int)'),
        ('-c CAP',    '--cap=CAP',       'Cap (int)'),
     ),
+    'sched-credit2': (
+       ('-d DOMAIN', '--domain=DOMAIN', 'Domain to modify'),
+       ('-w WEIGHT', '--weight=WEIGHT', 'Weight (int)'),
+    ),
     'list': (
        ('-l', '--long',         'Output all VM details in SXP'),
        ('', '--label',          'Include security labels'),
@@ -406,6 +412,7 @@
     ]
 
 scheduler_commands = [
+    "sched-credit2",
     "sched-credit",
     "sched-sedf",
     ]
@@ -1720,6 +1727,80 @@
             if result != 0:
                 err(str(result))
 
+def xm_sched_credit2(args):
+    """Get/Set options for Credit2 Scheduler."""
+    
+    check_sched_type('credit2')
+
+    try:
+        opts, params = getopt.getopt(args, "d:w:",
+            ["domain=", "weight="])
+    except getopt.GetoptError, opterr:
+        err(opterr)
+        usage('sched-credit2')
+
+    domid = None
+    weight = None
+
+    for o, a in opts:
+        if o in ["-d", "--domain"]:
+            domid = a
+        elif o in ["-w", "--weight"]:
+            weight = int(a)
+
+    doms = filter(lambda x : domid_match(domid, x),
+                  [parse_doms_info(dom)
+                  for dom in getDomains(None, 'all')])
+
+    if weight is None:
+        if domid is not None and doms == []: 
+            err("Domain '%s' does not exist." % domid)
+            usage('sched-credit2')
+        # print header if we aren't setting any parameters
+        print '%-33s %4s %6s' % ('Name','ID','Weight')
+        
+        for d in doms:
+            try:
+                if serverType == SERVER_XEN_API:
+                    info = server.xenapi.VM_metrics.get_VCPUs_params(
+                        server.xenapi.VM.get_metrics(
+                            get_single_vm(d['name'])))
+                else:
+                    info = server.xend.domain.sched_credit2_get(d['name'])
+            except xmlrpclib.Fault:
+                pass
+
+            if 'weight' not in info:
+                # domain does not support sched-credit2?
+                info = {'weight': -1}
+
+            info['weight'] = int(info['weight'])
+            
+            info['name']  = d['name']
+            info['domid'] = str(d['domid'])
+            print( ("%(name)-32s %(domid)5s %(weight)6d") % info)
+    else:
+        if domid is None:
+            # place holder for system-wide scheduler parameters
+            err("No domain given.")
+            usage('sched-credit2')
+
+        if serverType == SERVER_XEN_API:
+            if doms[0]['domid']:
+                server.xenapi.VM.add_to_VCPUs_params_live(
+                    get_single_vm(domid),
+                    "weight",
+                    weight)
+            else:
+                server.xenapi.VM.add_to_VCPUs_params(
+                    get_single_vm(domid),
+                    "weight",
+                    weight)
+        else:
+            result = server.xend.domain.sched_credit2_set(domid, weight)
+            if result != 0:
+                err(str(result))
+
 def xm_info(args):
     arg_check(args, "info", 0, 1)
     
@@ -3341,6 +3422,7 @@
     # scheduler
     "sched-sedf": xm_sched_sedf,
     "sched-credit": xm_sched_credit,
+    "sched-credit2": xm_sched_credit2,
     # block
     "block-attach": xm_block_attach,
     "block-detach": xm_block_detach,

[-- Attachment #7: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] Scheduler interface changes for credit2
  2010-02-15 17:20 [RFC][PATCH] Scheduler interface changes for credit2 George Dunlap
@ 2010-02-22 15:22 ` George Dunlap
  2010-02-22 15:38   ` Keir Fraser
  0 siblings, 1 reply; 5+ messages in thread
From: George Dunlap @ 2010-02-22 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Keir Fraser

Keir, any thoughts?  Does this seem like a reasonable approach?

Thanks,
 -George

On Mon, Feb 15, 2010 at 5:20 PM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> The two attached patches change the scheduler interface to allow
> credit2 to have several cpus share the same runqueue.  The patches
> should have almost no impact on the current schedulers.  The patches
> and their reasonings are below.  I've also attached the patches for
> the prototype credit2 scheduler, for reference.
>
> * Add a context swich callback (sched-context_switch-callback.diff)
>
> Add a callback to tell a scheduler that a vcpu has been completely
> context-switched off a cpu.
>
> When sharing a runqueue, we can't put a scheduled-out vcpu back on the
> runqueue until it's been completely de-scheduled, because it may be
> grabbed by another processor before it's ready.  This callback allows
> a scheduler to detect when a vcpu on its way out is completely off the
> processor, so that it can put the vcpu on the runqueue.
>
> * Allow sharing of locks between cpus (sched-spin_lock-pointers.diff)
>
> Have per-cpu pointers, initialized to per-cpu locks, which the
> scheduler may change during its init to reconfigure locking
> granularity.
>
> There are a number of race conditions having to do with updating of
> v->is_running and v->processor, all having to do with the fact that
> vcpus may change cpus without an explicit migrate.  Furthermore, the
> scheduler needs runqueues to be covered by a lock as well.  The
> cleanest way to solve all of these is to have the scheduler lock and
> the runqueue lock coincide.
>
> * Add a "scheduler" trace class (trace-sched-class.diff)
> Uses defined on a per-scheduler basis
>
> I've been running parallel kernel compiles on a 16-way box (2x4x2) for
> several hours now without deadlocks or BUG()s.  As far as I'm
> concerned, with these changes, credit2 is now ready to be checked in,
> as long as it's not set to the default scheduler.
>
> All of the above:
> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] Scheduler interface changes for credit2
  2010-02-22 15:22 ` George Dunlap
@ 2010-02-22 15:38   ` Keir Fraser
  2010-02-22 16:16     ` George Dunlap
  0 siblings, 1 reply; 5+ messages in thread
From: Keir Fraser @ 2010-02-22 15:38 UTC (permalink / raw)
  To: George Dunlap, xen-devel@lists.xensource.com

They're all fine, even probably sched-context_switch-callback.diff, which I
suppose is your new alternative to having vcpus-which-arent-yet-schedulable
left on the shared runqueue? I suppose, although I reckon it could still be
done another way and using the vcpu_migrate logic, that this is a smaller
and neater way to do it really.

All this are apply-able after 4.0.

 -- Keir

On 22/02/2010 15:22, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:

> Keir, any thoughts?  Does this seem like a reasonable approach?
> 
> Thanks,
>  -George
> 
> On Mon, Feb 15, 2010 at 5:20 PM, George Dunlap
> <George.Dunlap@eu.citrix.com> wrote:
>> The two attached patches change the scheduler interface to allow
>> credit2 to have several cpus share the same runqueue.  The patches
>> should have almost no impact on the current schedulers.  The patches
>> and their reasonings are below.  I've also attached the patches for
>> the prototype credit2 scheduler, for reference.
>> 
>> * Add a context swich callback (sched-context_switch-callback.diff)
>> 
>> Add a callback to tell a scheduler that a vcpu has been completely
>> context-switched off a cpu.
>> 
>> When sharing a runqueue, we can't put a scheduled-out vcpu back on the
>> runqueue until it's been completely de-scheduled, because it may be
>> grabbed by another processor before it's ready.  This callback allows
>> a scheduler to detect when a vcpu on its way out is completely off the
>> processor, so that it can put the vcpu on the runqueue.
>> 
>> * Allow sharing of locks between cpus (sched-spin_lock-pointers.diff)
>> 
>> Have per-cpu pointers, initialized to per-cpu locks, which the
>> scheduler may change during its init to reconfigure locking
>> granularity.
>> 
>> There are a number of race conditions having to do with updating of
>> v->is_running and v->processor, all having to do with the fact that
>> vcpus may change cpus without an explicit migrate.  Furthermore, the
>> scheduler needs runqueues to be covered by a lock as well.  The
>> cleanest way to solve all of these is to have the scheduler lock and
>> the runqueue lock coincide.
>> 
>> * Add a "scheduler" trace class (trace-sched-class.diff)
>> Uses defined on a per-scheduler basis
>> 
>> I've been running parallel kernel compiles on a 16-way box (2x4x2) for
>> several hours now without deadlocks or BUG()s.  As far as I'm
>> concerned, with these changes, credit2 is now ready to be checked in,
>> as long as it's not set to the default scheduler.
>> 
>> All of the above:
>> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
>> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] Scheduler interface changes for credit2
  2010-02-22 15:38   ` Keir Fraser
@ 2010-02-22 16:16     ` George Dunlap
  2010-02-22 17:32       ` Keir Fraser
  0 siblings, 1 reply; 5+ messages in thread
From: George Dunlap @ 2010-02-22 16:16 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel@lists.xensource.com

Keir Fraser wrote:
> They're all fine, even probably sched-context_switch-callback.diff, which I
> suppose is your new alternative to having vcpus-which-arent-yet-schedulable
> left on the shared runqueue? I suppose, although I reckon it could still be
> done another way and using the vcpu_migrate logic, that this is a smaller
> and neater way to do it really.
>   
Exactly.  The migrate logic calls cpu_pick, which is also called from 
other contexts.  Having the same function try to do both, and setting 
the "migrating" flag when the vcpu isn't migrating, just seems hackish 
and likely to be confusing to me.  It does add another "Check to see if 
this function exists" on the hot path for the other schedulers.

I suppose we could think about changing things around, so that there's a 
generic "callback on context switch" flag.  Then we'd only have one 
callback in the context switch code, and the check would only happen if 
the flag was set (reducing cache footprint).  It would be a slightly 
larger patch, and make changes to the other schedulers, but those would 
be pretty minimal.

I'll play around with it and see how it looks.

-George
> All this are apply-able after 4.0.
>
>  -- Keir
>
> On 22/02/2010 15:22, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:
>
>   
>> Keir, any thoughts?  Does this seem like a reasonable approach?
>>
>> Thanks,
>>  -George
>>
>> On Mon, Feb 15, 2010 at 5:20 PM, George Dunlap
>> <George.Dunlap@eu.citrix.com> wrote:
>>     
>>> The two attached patches change the scheduler interface to allow
>>> credit2 to have several cpus share the same runqueue.  The patches
>>> should have almost no impact on the current schedulers.  The patches
>>> and their reasonings are below.  I've also attached the patches for
>>> the prototype credit2 scheduler, for reference.
>>>
>>> * Add a context swich callback (sched-context_switch-callback.diff)
>>>
>>> Add a callback to tell a scheduler that a vcpu has been completely
>>> context-switched off a cpu.
>>>
>>> When sharing a runqueue, we can't put a scheduled-out vcpu back on the
>>> runqueue until it's been completely de-scheduled, because it may be
>>> grabbed by another processor before it's ready.  This callback allows
>>> a scheduler to detect when a vcpu on its way out is completely off the
>>> processor, so that it can put the vcpu on the runqueue.
>>>
>>> * Allow sharing of locks between cpus (sched-spin_lock-pointers.diff)
>>>
>>> Have per-cpu pointers, initialized to per-cpu locks, which the
>>> scheduler may change during its init to reconfigure locking
>>> granularity.
>>>
>>> There are a number of race conditions having to do with updating of
>>> v->is_running and v->processor, all having to do with the fact that
>>> vcpus may change cpus without an explicit migrate.  Furthermore, the
>>> scheduler needs runqueues to be covered by a lock as well.  The
>>> cleanest way to solve all of these is to have the scheduler lock and
>>> the runqueue lock coincide.
>>>
>>> * Add a "scheduler" trace class (trace-sched-class.diff)
>>> Uses defined on a per-scheduler basis
>>>
>>> I've been running parallel kernel compiles on a 16-way box (2x4x2) for
>>> several hours now without deadlocks or BUG()s.  As far as I'm
>>> concerned, with these changes, credit2 is now ready to be checked in,
>>> as long as it's not set to the default scheduler.
>>>
>>> All of the above:
>>> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
>>>
>>>       
>
>
>   

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] Scheduler interface changes for credit2
  2010-02-22 16:16     ` George Dunlap
@ 2010-02-22 17:32       ` Keir Fraser
  0 siblings, 0 replies; 5+ messages in thread
From: Keir Fraser @ 2010-02-22 17:32 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel@lists.xensource.com

On 22/02/2010 16:16, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:

> I suppose we could think about changing things around, so that there's a
> generic "callback on context switch" flag.  Then we'd only have one
> callback in the context switch code, and the check would only happen if
> the flag was set (reducing cache footprint).  It would be a slightly
> larger patch, and make changes to the other schedulers, but those would
> be pretty minimal.
> 
> I'll play around with it and see how it looks.

Nah, I think we'll just go with it as it is. It's clean and the path isn't
*that* hot.

 -- Keir

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-02-22 17:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-15 17:20 [RFC][PATCH] Scheduler interface changes for credit2 George Dunlap
2010-02-22 15:22 ` George Dunlap
2010-02-22 15:38   ` Keir Fraser
2010-02-22 16:16     ` George Dunlap
2010-02-22 17:32       ` Keir Fraser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).