[PATCH] [RFC] Credit2 scheduler prototype

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] [RFC] Credit2 scheduler prototype
@ 2009-12-07 17:02 George Dunlap
  2009-12-07 17:45 ` Keir Fraser
  0 siblings, 1 reply; 11+ messages in thread
From: George Dunlap @ 2009-12-07 17:02 UTC (permalink / raw)
  To: xen-devel, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 897 bytes --]

I'm attaching patches for the rudimentary new credit2 scheduler which
I discussed at the Summit.  It's definitely still at a developmental
stage, but it's in a state where people should be able to contribute
now.

I've put up a wiki page to help coordinate development here:
 http://wiki.xensource.com/xenwiki/Credit2_Scheduler_Development

Some caveats:
* There is no inter-socket load balancing.  This is one of the work
items to be done.
* It works on my 2-core box, but deadlocks on my 4-core box; cause
still to be determined.

The wiki page lists a number of semi-independent lines of development
that people can take.  Let me know if you're interested in any of
them, and if you have any questions, and I can elaborate.

Keir (and everyone), I think at this point it would be a good idea to
start a credit2 development branch in ext/ so we can keep a revision
history.  Thoughts?

 -George

[-- Attachment #2: credit2-hypervisor.diff --]
[-- Type: text/x-diff, Size: 30565 bytes --]

diff -r 23d34c3ba4b7 xen/arch/x86/domain.c
--- a/xen/arch/x86/domain.c	Mon Nov 30 16:13:01 2009 -0600
+++ b/xen/arch/x86/domain.c	Mon Dec 07 16:59:53 2009 +0000
@@ -1426,9 +1426,9 @@
 
     set_current(next);
 
-    if ( (per_cpu(curr_vcpu, cpu) == next) || is_idle_vcpu(next) )
+    if ( (per_cpu(curr_vcpu, cpu) == next) /* || is_idle_vcpu(next) */)
     {
-        local_irq_enable();
+        ;//local_irq_enable();
     }
     else
     {
@@ -1445,9 +1445,8 @@
                 write_efer(efer | EFER_SCE);
         }
 #endif
-
         /* Re-enable interrupts before restoring state which may fault. */
-        local_irq_enable();
+        //local_irq_enable();
 
         if ( !is_hvm_vcpu(next) )
         {
@@ -1458,6 +1457,13 @@
 
     context_saved(prev);
 
+    local_irq_enable();
+
+    /* If we've deadlocked somehow and temporarily made a VM unrunnable,
+     * clear the bit and call wake. */
+    if ( test_and_clear_bit(_VPF_deadlock, &prev->pause_flags ) )
+         vcpu_wake(prev);
+
     if (prev != next)
         update_runstate_area(next);
 
diff -r 23d34c3ba4b7 xen/arch/x86/nmi.c
--- a/xen/arch/x86/nmi.c	Mon Nov 30 16:13:01 2009 -0600
+++ b/xen/arch/x86/nmi.c	Mon Dec 07 16:59:53 2009 +0000
@@ -391,7 +391,6 @@
     u32 id = cpu_physical_id(cpu);
 
     printk("Triggering NMI on APIC ID %x\n", id);
-    debugtrace_dump();
 
     local_irq_disable();
     apic_wait_icr_idle();
@@ -426,11 +425,12 @@
         if ( this_cpu(alert_counter) == 5*nmi_hz )
         {
             console_force_unlock();
+            spin_lock(&panic_lock);
             printk("Watchdog timer detects that CPU%d is stuck!\n",
                    smp_processor_id());
-            spin_lock(&panic_lock);
             show_execution_state(regs);
             debugtrace_dump();
+            spin_unlock(&panic_lock);
             atomic_inc(&all_panic);
             {
                 int cpu;
@@ -441,7 +441,6 @@
                     do_nmi_trigger_cpu(cpu);
                 }
             }
-            spin_unlock(&panic_lock);
             while(1);
             //fatal_trap(TRAP_nmi, regs);
         }
diff -r 23d34c3ba4b7 xen/common/Makefile
--- a/xen/common/Makefile	Mon Nov 30 16:13:01 2009 -0600
+++ b/xen/common/Makefile	Mon Dec 07 16:59:53 2009 +0000
@@ -13,6 +13,7 @@
 obj-y += page_alloc.o
 obj-y += rangeset.o
 obj-y += sched_credit.o
+obj-y += sched_credit2.o
 obj-y += sched_sedf.o
 obj-y += schedule.o
 obj-y += shutdown.o
diff -r 23d34c3ba4b7 xen/common/sched_credit2.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xen/common/sched_credit2.c	Mon Dec 07 16:59:53 2009 +0000
@@ -0,0 +1,992 @@
+
+/****************************************************************************
+ * (C) 2009 - George Dunlap - Citrix Systems R&D UK, Ltd
+ ****************************************************************************
+ *
+ *        File: common/csched_credit2.c
+ *      Author: George Dunlap
+ *
+ * Description: Credit-based SMP CPU scheduler
+ * Based on an earlier verson by Emmanuel Ackaouy.
+ */
+
+#include <xen/config.h>
+#include <xen/init.h>
+#include <xen/lib.h>
+#include <xen/sched.h>
+#include <xen/domain.h>
+#include <xen/delay.h>
+#include <xen/event.h>
+#include <xen/time.h>
+#include <xen/perfc.h>
+#include <xen/sched-if.h>
+#include <xen/softirq.h>
+#include <asm/atomic.h>
+#include <xen/errno.h>
+#include <xen/trace.h>
+
+#if __i386__
+#define PRI_stime "lld"
+#else
+#define PRI_stime "ld"
+#endif
+
+#define d2printk(x...)
+//#define d2printk printk
+
+#define TRC_CSCHED2_TICK        TRC_SCHED_CLASS + 1
+#define TRC_CSCHED2_RUNQ_POS    TRC_SCHED_CLASS + 2
+#define TRC_CSCHED2_CREDIT_BURN TRC_SCHED_CLASS + 3
+#define TRC_CSCHED2_CREDIT_ADD  TRC_SCHED_CLASS + 4
+#define TRC_CSCHED2_TICKLE_CHECK TRC_SCHED_CLASS + 5
+
+/* 
+ * Design:
+ *
+ * VMs "burn" credits based on their weight; higher weight means credits burn
+ * more slowly.
+ * 
+ * vcpus are inserted into the runqueue by credit order.
+ *
+ * Credits are "reset" when the next vcpu in the runqueue is less than or equal to zero.  At that
+ * point, everyone's credits are "clipped" to a small value, and a fixed credit is added to everyone.
+ *
+ * The plan is for all cores that share an L2 will share the same runqueue.  At the moment, there is
+ * one global runqueue for all cores.
+ */
+
+/*
+ * Basic constants
+ */
+#define CSCHED_DEFAULT_WEIGHT       256
+#define CSCHED_MIN_TIMER            MICROSECS(500)
+#define CSCHED_CARRYOVER_MAX        CSCHED_MIN_TIMER
+#define CSCHED_CREDIT_RESET         0
+#define CSCHED_CREDIT_INIT          MILLISECS(10)
+#define CSCHED_MAX_TIMER            MILLISECS(2)
+
+#define CSCHED_IDLE_CREDIT                 (-(1<<30))
+
+/*
+ * Flags
+ */
+// Placeholder template for when we need real flags
+//#define __CSFLAG_foo 1
+//#define CSFLAG_foo (1<<__CSFLAG_foo)
+
+
+/*
+ * Useful macros
+ */
+#define CSCHED_PCPU(_c)                                                 \
+    ((struct csched_pcpu *)per_cpu(schedule_data, _c).sched_priv)
+#define CSCHED_VCPU(_vcpu)  ((struct csched_vcpu *) (_vcpu)->sched_priv)
+#define CSCHED_DOM(_dom)    ((struct csched_dom *) (_dom)->sched_priv)
+//#define RUNQ(_cpu)          (&(CSCHED_GROUP(_cpu)->runq))
+#define RUNQ(_cpu)          (&csched_priv.runq)
+
+/*
+ * System-wide private data
+ */
+struct csched_private {
+    spinlock_t lock;
+    struct list_head sdom;
+    struct list_head svc;  /* List of all vcpus */
+    uint32_t ncpus;
+
+    /* Per-runqueue info */
+    struct list_head runq; /* Global runqueue */
+    int max_weight;
+};
+
+struct csched_pcpu {
+    int _dummy;
+};
+
+/*
+ * Virtual CPU
+ */
+struct csched_vcpu {
+    struct list_head global_elem; /* On the global vcpu list */
+    struct list_head sdom_elem;   /* On the domain vcpu list */
+    struct list_head runq_elem;   /* On the runqueue         */
+
+    /* Up-pointers */
+    struct csched_dom *sdom;
+    struct vcpu *vcpu;
+
+    int weight;
+
+    int credit;
+    s_time_t start_time; /* When we were scheduled (used for credit) */
+    unsigned flags;      /* 16 bits doesn't seem to play well with clear_bit() */
+
+};
+
+/*
+ * Domain
+ */
+struct csched_dom {
+    struct list_head vcpu;
+    struct list_head sdom_elem;
+    struct domain *dom;
+    uint16_t weight;
+    uint16_t nr_vcpus;
+};
+
+
+/*
+ * Global variables
+ */
+static struct csched_private csched_priv;
+
+/*
+ * Time-to-credit, credit-to-time.
+ * FIXME: Do pre-calculated division?
+ */
+static s_time_t t2c(s_time_t time, struct csched_vcpu *svc)
+{
+    return time * csched_priv.max_weight / svc->weight;
+}
+
+static s_time_t c2t(s_time_t credit, struct csched_vcpu *svc)
+{
+    return credit * svc->weight / csched_priv.max_weight;
+}
+
+/*
+ * Runqueue related code
+ */
+
+static /*inline*/ int
+__vcpu_on_runq(struct csched_vcpu *svc)
+{
+    return !list_empty(&svc->runq_elem);
+}
+
+static /*inline*/ struct csched_vcpu *
+__runq_elem(struct list_head *elem)
+{
+    return list_entry(elem, struct csched_vcpu, runq_elem);
+}
+
+static int
+__runq_insert(struct list_head *runq, struct csched_vcpu *svc)
+{
+    struct list_head *iter;
+    int pos = 0;
+
+    d2printk("rqi d%dv%d\n",
+           svc->vcpu->domain->domain_id,
+           svc->vcpu->vcpu_id);
+
+    list_for_each( iter, runq )
+    {
+        struct csched_vcpu * iter_svc = __runq_elem(iter);
+
+        if ( svc->credit > iter_svc->credit )
+        {
+            d2printk(" p%d d%dv%d\n",
+                   pos,
+                   iter_svc->vcpu->domain->domain_id,
+                   iter_svc->vcpu->vcpu_id);
+            break;
+        }
+        pos++;
+    }
+
+    list_add_tail(&svc->runq_elem, iter);
+
+    return pos;
+}
+
+static void
+runq_insert(unsigned int cpu, struct csched_vcpu *svc)
+{
+    struct list_head * runq = RUNQ(cpu);
+    int pos = 0;
+
+    /* FIXME: Runqueue per L2 */
+    ASSERT( spin_is_locked(&csched_priv.lock) ); 
+
+    BUG_ON( __vcpu_on_runq(svc) );
+    /* FIXME: Check runqueue handles this cpu*/
+    //BUG_ON( cpu != svc->vcpu->processor ); 
+
+    pos = __runq_insert(runq, svc);
+
+    {
+        struct {
+            unsigned dom:16,vcpu:16;
+            unsigned pos;
+        } d;
+        d.dom = svc->vcpu->domain->domain_id;
+        d.vcpu = svc->vcpu->vcpu_id;
+        d.pos = pos;
+        trace_var(TRC_CSCHED2_RUNQ_POS, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+
+    return;
+}
+
+static inline void
+__runq_remove(struct csched_vcpu *svc)
+{
+    BUG_ON( !__vcpu_on_runq(svc) );
+    list_del_init(&svc->runq_elem);
+}
+
+void burn_credits(struct csched_vcpu *, s_time_t);
+
+/* Check to see if the item on the runqueue is higher priority than what's
+ * currently running; if so, wake up the processor */
+static /*inline*/ void
+runq_tickle(unsigned int cpu, struct csched_vcpu *new, s_time_t now)
+{
+    int i, ipid=-1;
+    s_time_t lowest=(1<<30);
+
+    d2printk("rqt d%dv%d cd%dv%d\n",
+             new->vcpu->domain->domain_id,
+             new->vcpu->vcpu_id,
+             current->domain->domain_id,
+             current->vcpu_id);
+
+    /* Find the cpu in this queue group that has the lowest credits */
+    /* FIXME: separate runqueues */
+    for_each_online_cpu ( i )
+    {
+        struct csched_vcpu * const cur =
+            CSCHED_VCPU(per_cpu(schedule_data, i).curr);
+
+        /* FIXME: keep track of idlers, chose from the mask */
+        if ( is_idle_vcpu(cur->vcpu) )
+        {
+            ipid = i;
+            lowest = CSCHED_IDLE_CREDIT;
+            break;
+        }
+        else
+        {
+            /* Update credits for current to see if we want to preempt */
+            burn_credits(cur, now);
+
+            if ( cur->credit < lowest )
+            {
+                ipid = i;
+                lowest = cur->credit;
+            }
+
+            /* TRACE */ {
+                struct {
+                    unsigned dom:16,vcpu:16;
+                    unsigned credit;
+                } d;
+                d.dom = cur->vcpu->domain->domain_id;
+                d.vcpu = cur->vcpu->vcpu_id;
+                d.credit = cur->credit;
+                trace_var(TRC_CSCHED2_TICKLE_CHECK, 1,
+                          sizeof(d),
+                          (unsigned char *)&d);
+            }
+        }
+    }
+
+    if ( ipid != -1 )
+    {
+        int cdiff = lowest - new->credit;
+
+        if ( lowest == CSCHED_IDLE_CREDIT || cdiff < 0 ) {
+            d2printk("si %d\n", ipid);
+            cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
+        }
+        else
+            /* FIXME: Wake up later? */;
+    }
+}
+
+/*
+ * Credit-related code
+ */
+static void reset_credit(int cpu, s_time_t now)
+{
+    struct list_head *iter;
+
+    list_for_each( iter, &csched_priv.svc )
+    {
+        struct csched_vcpu * svc = list_entry(iter, struct csched_vcpu, global_elem);
+        s_time_t cmax;
+
+        BUG_ON( is_idle_vcpu(svc->vcpu) );
+
+        /* Maximum amount of credit that can be carried over */
+        cmax = CSCHED_CARRYOVER_MAX;
+
+        if ( svc->credit > cmax )
+            svc->credit = cmax;
+        svc->credit += CSCHED_CREDIT_INIT;  /* Find a better name */
+        svc->start_time = now;
+
+        /* Trace credit */
+    }
+
+    /* No need to resort runqueue, as everyone's order should be the same. */
+}
+
+void burn_credits(struct csched_vcpu *svc, s_time_t now)
+{
+    s_time_t delta;
+
+    /* Assert svc is current */
+    ASSERT(svc==CSCHED_VCPU(per_cpu(schedule_data, svc->vcpu->processor).curr));
+
+    if ( is_idle_vcpu(svc->vcpu) )
+    {
+        BUG_ON(svc->credit != CSCHED_IDLE_CREDIT);
+        return;
+    }
+
+    delta = now - svc->start_time;
+
+    if ( delta > 0 ) {
+        /* This will round down; should we consider rounding up...? */
+        svc->credit -= t2c(delta, svc);
+        svc->start_time = now;
+
+        d2printk("b d%dv%d c%d\n",
+                 svc->vcpu->domain->domain_id,
+                 svc->vcpu->vcpu_id,
+                 svc->credit);
+    } else {
+        d2printk("%s: Time went backwards? now %"PRI_stime" start %"PRI_stime"\n",
+               __func__, now, svc->start_time);
+    }
+    
+    /* TRACE */
+    {
+        struct {
+            unsigned dom:16,vcpu:16;
+            unsigned credit;
+            int delta;
+        } d;
+        d.dom = svc->vcpu->domain->domain_id;
+        d.vcpu = svc->vcpu->vcpu_id;
+        d.credit = svc->credit;
+        d.delta = delta;
+        trace_var(TRC_CSCHED2_CREDIT_BURN, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+}
+
+/* Find the domain with the highest weight. */
+void update_max_weight(int new_weight, int old_weight)
+{
+    if ( new_weight > csched_priv.max_weight )
+    {
+        csched_priv.max_weight = new_weight;
+        printk("%s: Max weight %d\n", __func__, csched_priv.max_weight);
+    }
+    else if ( old_weight == csched_priv.max_weight )
+    {
+        struct list_head *iter;
+        int max_weight = 1;
+        
+        list_for_each( iter, &csched_priv.sdom )
+        {
+            struct csched_dom * sdom = list_entry(iter, struct csched_dom, sdom_elem);
+            
+            if ( sdom->weight > max_weight )
+                max_weight = sdom->weight;
+        }
+        
+        csched_priv.max_weight = max_weight;
+        printk("%s: Max weight %d\n", __func__, csched_priv.max_weight);
+    }
+}
+
+/*
+ * Initialization code
+ */
+static int
+csched_pcpu_init(int cpu)
+{
+    unsigned long flags;
+    struct csched_pcpu *spc;
+
+    /* Allocate per-PCPU info */
+    spc = xmalloc(struct csched_pcpu);
+    if ( spc == NULL )
+        return -1;
+
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    /* Initialize/update system-wide config */
+    per_cpu(schedule_data, cpu).sched_priv = spc;
+
+    csched_priv.ncpus++;
+
+    /* Start off idling... */
+    BUG_ON(!is_idle_vcpu(per_cpu(schedule_data, cpu).curr));
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+
+    return 0;
+}
+
+#ifndef NDEBUG
+static /*inline*/ void
+__csched_vcpu_check(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+    struct csched_dom * const sdom = svc->sdom;
+
+    BUG_ON( svc->vcpu != vc );
+    BUG_ON( sdom != CSCHED_DOM(vc->domain) );
+    if ( sdom )
+    {
+        BUG_ON( is_idle_vcpu(vc) );
+        BUG_ON( sdom->dom != vc->domain );
+    }
+    else
+    {
+        BUG_ON( !is_idle_vcpu(vc) );
+    }
+}
+#define CSCHED_VCPU_CHECK(_vc)  (__csched_vcpu_check(_vc))
+#else
+#define CSCHED_VCPU_CHECK(_vc)
+#endif
+
+static int
+csched_vcpu_init(struct vcpu *vc)
+{
+    struct domain * const dom = vc->domain;
+    struct csched_dom *sdom = CSCHED_DOM(dom);
+    struct csched_vcpu *svc;
+
+    printk("%s: Initializing d%dv%d\n",
+           __func__, dom->domain_id, vc->vcpu_id);
+
+    /* Allocate per-VCPU info */
+    svc = xmalloc(struct csched_vcpu);
+    if ( svc == NULL )
+        return -1;
+
+    INIT_LIST_HEAD(&svc->global_elem);
+    INIT_LIST_HEAD(&svc->sdom_elem);
+    INIT_LIST_HEAD(&svc->runq_elem);
+
+    svc->sdom = sdom;
+    svc->vcpu = vc;
+    svc->flags = 0U;
+    vc->sched_priv = svc;
+
+    if ( ! is_idle_vcpu(vc) )
+    {
+        BUG_ON( sdom == NULL );
+
+        svc->credit = CSCHED_CREDIT_INIT;
+        svc->weight = sdom->weight;
+
+        list_add_tail(&svc->sdom_elem, &sdom->vcpu);
+        list_add_tail(&svc->global_elem, &csched_priv.svc);
+        sdom->nr_vcpus++;
+    } 
+    else
+    {
+        BUG_ON( sdom != NULL );
+        svc->credit = CSCHED_IDLE_CREDIT;
+        svc->weight = 0;
+    }
+
+    /* Allocate per-PCPU info */
+    if ( unlikely(!CSCHED_PCPU(vc->processor)) )
+    {
+        if ( csched_pcpu_init(vc->processor) != 0 )
+            return -1;
+    }
+
+    CSCHED_VCPU_CHECK(vc);
+    return 0;
+}
+
+static void
+csched_vcpu_destroy(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+    struct csched_dom * const sdom = svc->sdom;
+    unsigned long flags;
+
+    BUG_ON( sdom == NULL );
+    BUG_ON( !list_empty(&svc->runq_elem) );
+
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    /* Remove from sdom list */
+    list_del_init(&svc->global_elem);
+    list_del_init(&svc->sdom_elem);
+
+    sdom->nr_vcpus--;
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+
+    xfree(svc);
+}
+
+static void
+csched_vcpu_sleep(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    if ( per_cpu(schedule_data, vc->processor).curr == vc )
+        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
+    else if ( __vcpu_on_runq(svc) )
+        __runq_remove(svc);
+}
+
+static void
+csched_vcpu_wake(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+    const unsigned int cpu = vc->processor;
+    s_time_t now = 0;
+    int flags;
+
+    d2printk("w d%dv%d\n", vc->domain->domain_id, vc->vcpu_id);
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    /* FIXME: Runqueue per L2 */
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+
+    /* Make sure svc priority mod happens before runq check */
+    if ( unlikely(per_cpu(schedule_data, cpu).curr == vc) )
+    {
+        goto out;
+    }
+    if ( unlikely(__vcpu_on_runq(svc)) )
+    {
+        /* If we've boosted someone that's already on a runqueue, prioritize
+         * it and inform the cpu in question. */
+        goto out;
+    }
+
+    now = NOW();
+
+    /* Put the VCPU on the runq */
+    runq_insert(cpu, svc);
+    runq_tickle(cpu, svc, now);
+
+out:
+    spin_unlock_irqrestore(&csched_priv.lock, flags); 
+    d2printk("w-\n");
+    return;
+}
+
+static int
+csched_cpu_pick(struct vcpu *vc)
+{
+    /* FIXME: Chose a schedule group based on load */
+    return 0;
+}
+
+static int
+csched_dom_cntl(
+    struct domain *d,
+    struct xen_domctl_scheduler_op *op)
+{
+    struct csched_dom * const sdom = CSCHED_DOM(d);
+    unsigned long flags;
+
+    if ( op->cmd == XEN_DOMCTL_SCHEDOP_getinfo )
+    {
+        op->u.credit2.weight = sdom->weight;
+    }
+    else
+    {
+        ASSERT(op->cmd == XEN_DOMCTL_SCHEDOP_putinfo);
+
+        if ( op->u.credit2.weight != 0 )
+        {
+            struct list_head *iter;
+            int old_weight;
+
+            spin_lock_irqsave(&csched_priv.lock, flags);
+
+            old_weight = sdom->weight;
+
+            sdom->weight = op->u.credit2.weight;
+
+            /* Update max weight */
+            update_max_weight(sdom->weight, old_weight);
+
+            /* Update weights for vcpus */
+            list_for_each ( iter, &sdom->vcpu )
+            {
+                struct csched_vcpu *svc = list_entry(iter, struct csched_vcpu, sdom_elem);
+
+                svc->weight = sdom->weight;
+            }
+
+            spin_unlock_irqrestore(&csched_priv.lock, flags);
+        }
+    }
+
+    return 0;
+}
+
+static int
+csched_dom_init(struct domain *dom)
+{
+    struct csched_dom *sdom;
+    int flags;
+
+    printk("%s: Initializing domain %d\n", __func__, dom->domain_id);
+
+    if ( is_idle_domain(dom) )
+        return 0;
+
+    sdom = xmalloc(struct csched_dom);
+    if ( sdom == NULL )
+        return -ENOMEM;
+
+    /* Initialize credit and weight */
+    INIT_LIST_HEAD(&sdom->vcpu);
+    INIT_LIST_HEAD(&sdom->sdom_elem);
+    sdom->dom = dom;
+    sdom->weight = CSCHED_DEFAULT_WEIGHT;
+    sdom->nr_vcpus = 0;
+
+    dom->sched_priv = sdom;
+
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    update_max_weight(sdom->weight, 0);
+
+    list_add_tail(&sdom->sdom_elem, &csched_priv.sdom);
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+
+    return 0;
+}
+
+static void
+csched_dom_destroy(struct domain *dom)
+{
+    struct csched_dom *sdom = CSCHED_DOM(dom);
+    int flags;
+
+    BUG_ON(!list_empty(&sdom->vcpu));
+
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    list_del_init(&sdom->sdom_elem);
+
+    update_max_weight(0, sdom->weight);
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+    
+    xfree(CSCHED_DOM(dom));
+}
+
+#if 0
+static void csched_load_balance(int cpu)
+{
+    /* FIXME: Do something. */
+}
+#endif
+
+/* How long should we let this vcpu run for? */
+static s_time_t
+csched_runtime(int cpu, struct csched_vcpu *snext)
+{
+    s_time_t time = CSCHED_MAX_TIMER;
+    struct list_head *runq = RUNQ(cpu);
+
+    if ( is_idle_vcpu(snext->vcpu) )
+        return CSCHED_MAX_TIMER;
+
+    /* Basic time */
+    time = c2t(snext->credit, snext);
+
+    /* Next guy on runqueue */
+    if ( ! list_empty(runq) )
+    {
+        struct csched_vcpu *svc = __runq_elem(runq->next);
+        s_time_t ntime;
+
+        if ( ! is_idle_vcpu(svc->vcpu) )
+        {
+            ntime = c2t(snext->credit - svc->credit, snext);
+
+            if ( time > ntime )
+                time = ntime;
+        }
+    }
+
+    /* Check limits */
+    if ( time < CSCHED_MIN_TIMER )
+        time = CSCHED_MIN_TIMER;
+    else if ( time > CSCHED_MAX_TIMER )
+        time = CSCHED_MAX_TIMER;
+
+    return time;
+}
+
+void __dump_execstate(void *unused);
+
+/*
+ * This function is in the critical path. It is designed to be simple and
+ * fast for the common case.
+ */
+static struct task_slice
+csched_schedule(s_time_t now)
+{
+    const int cpu = smp_processor_id();
+    struct list_head * const runq = RUNQ(cpu);
+    //struct csched_pcpu *spc = CSCHED_PCPU(cpu);
+    struct csched_vcpu * const scurr = CSCHED_VCPU(current);
+    struct csched_vcpu *snext;
+    struct task_slice ret;
+    int flags;
+
+    CSCHED_VCPU_CHECK(current);
+
+    d2printk("sc p%d c d%dv%d now %"PRI_stime"\n",
+             cpu,
+             scurr->vcpu->domain->domain_id,
+             scurr->vcpu->vcpu_id,
+             now);
+
+
+    /* FIXME: Runqueue per L2 */
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    /* Update credits */
+    burn_credits(scurr, now);
+
+    /*
+     * Select next runnable local VCPU (ie top of local runq)
+     * Insert will cause credits to be updated. 
+     */
+    if ( vcpu_runnable(current) )
+        runq_insert(cpu, scurr);
+    else
+        BUG_ON( is_idle_vcpu(current) || list_empty(runq) );
+
+    snext = __runq_elem(runq->next);
+
+    if ( snext->credit <= CSCHED_CREDIT_RESET && !is_idle_vcpu(snext->vcpu) )
+    {
+        /* If the next item has <= 0 credits, update credits and resort */
+        reset_credit(cpu, now);
+    }
+    
+    __runq_remove(snext);
+
+    /* HACK.  Multiple cpus are sharing a runqueue; but due to the way
+     * things are set up, it's possible for a vcpu to be scheduled out on one
+     * cpu and put on the runqueue, and taken off by another cpu, before the first
+     * cpu has actually completed the context switch (indicated by is_running).
+     * 
+     * So in general we just wait for is_running to be false, always checking
+     * to see if it should still be put on the runqueue (i.e., it may be
+     * paused).
+     *
+     * Even so, occasionally we get into a deadlock situation.  I haven't found
+     * out who the other "hold-and-wait"-er is because they seem to have
+     * irqs disabled.  In any case, if we spin for 65K times, we assume there's
+     * a deadlock and put the vcpu on the tail of the runqueue (yes, behind the
+     * idle vcpus).  It will be re-ordered at most 10ms later when we do a
+     * runqueue sort.
+     *
+     * Other hold-and-waiters:
+     * + flush_tlb_mask(), which will try to get a sync_lazy_execstate.
+     * + vcpu_wake(): if an interrupt that causes a wake happens between unlock in schedule
+     *   and irq_disable() in context_switch(), it tries to grab the vcpu's cpu's schedule lock
+     *   (which we're holding).
+     **/
+    if ( snext != scurr && snext->vcpu->is_running )
+    {
+        int count = 0;
+        do {
+            BUG_ON(count < 0);
+            count++;
+
+            if ( (count & 0xffff) == 0 ) {
+                printk("p%d d%dv%d running on p%d, passed %d iterations!\n",
+                         cpu, snext->vcpu->domain->domain_id,
+                         snext->vcpu->vcpu_id,
+                         snext->vcpu->processor,
+                         count);
+                set_bit(_VPF_deadlock, &snext->vcpu->pause_flags);
+                BUG_ON( vcpu_runnable(snext->vcpu) );
+
+            } else if ( vcpu_runnable(snext->vcpu) )
+                runq_insert(cpu, snext);
+
+            BUG_ON(list_empty(runq));
+
+            snext = __runq_elem(runq->next);
+            __runq_remove(snext);
+        } while ( snext != scurr && snext->vcpu->is_running );
+        //printk("done\n");
+    }
+
+    /* FIXME: Think about this some more. */
+    snext->vcpu->processor = cpu;
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+
+#if 0
+    /*
+     * Update idlers mask if necessary. When we're idling, other CPUs
+     * will tickle us when they get extra work.
+     */
+    if ( is_idle_vcpu(snext->vcpu) )
+    {
+        if ( !cpu_isset(cpu, csched_priv.idlers) )
+            cpu_set(cpu, csched_priv.idlers);
+    }
+    else if ( cpu_isset(cpu, csched_priv.idlers) )
+    {
+        cpu_clear(cpu, csched_priv.idlers);
+    }
+#endif
+
+    if ( !is_idle_vcpu(snext->vcpu) )
+        snext->start_time = now;
+    /*
+     * Return task to run next...
+     */
+    ret.time = csched_runtime(cpu, snext);
+    ret.task = snext->vcpu;
+
+    CSCHED_VCPU_CHECK(ret.task);
+    return ret;
+}
+
+static void
+csched_dump_vcpu(struct csched_vcpu *svc)
+{
+    printk("[%i.%i] flags=%x cpu=%i",
+            svc->vcpu->domain->domain_id,
+            svc->vcpu->vcpu_id,
+            svc->flags,
+            svc->vcpu->processor);
+
+    printk(" credit=%" PRIi32" [w=%u]", svc->credit, svc->weight);
+
+    printk("\n");
+}
+
+static void
+csched_dump_pcpu(int cpu)
+{
+    struct list_head *runq, *iter;
+    //struct csched_pcpu *spc;
+    struct csched_vcpu *svc;
+    int loop;
+    char cpustr[100];
+
+    //spc = CSCHED_PCPU(cpu);
+    runq = RUNQ(cpu);
+
+    cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_sibling_map,cpu));
+    printk(" sibling=%s, ", cpustr);
+    cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_core_map,cpu));
+    printk("core=%s\n", cpustr);
+
+    /* current VCPU */
+    svc = CSCHED_VCPU(per_cpu(schedule_data, cpu).curr);
+    if ( svc )
+    {
+        printk("\trun: ");
+        csched_dump_vcpu(svc);
+    }
+
+    loop = 0;
+    list_for_each( iter, runq )
+    {
+        svc = __runq_elem(iter);
+        if ( svc )
+        {
+            printk("\t%3d: ", ++loop);
+            csched_dump_vcpu(svc);
+        }
+    }
+}
+
+static void
+csched_dump(void)
+{
+    struct list_head *iter_sdom, *iter_svc;
+    int loop;
+
+    printk("info:\n"
+           "\tncpus              = %u\n"
+           "\tdefault-weight     = %d\n",
+           csched_priv.ncpus,
+           CSCHED_DEFAULT_WEIGHT);
+
+    printk("active vcpus:\n");
+    loop = 0;
+    list_for_each( iter_sdom, &csched_priv.sdom )
+    {
+        struct csched_dom *sdom;
+        sdom = list_entry(iter_sdom, struct csched_dom, sdom_elem);
+
+        list_for_each( iter_svc, &sdom->vcpu )
+        {
+            struct csched_vcpu *svc;
+            svc = list_entry(iter_svc, struct csched_vcpu, sdom_elem);
+
+            printk("\t%3d: ", ++loop);
+            csched_dump_vcpu(svc);
+        }
+    }
+}
+
+static void
+csched_init(void)
+{
+    spin_lock_init(&csched_priv.lock);
+    INIT_LIST_HEAD(&csched_priv.sdom);
+    INIT_LIST_HEAD(&csched_priv.svc);
+
+    csched_priv.ncpus = 0;
+
+    /* FIXME: Runqueue per l2 */
+    csched_priv.max_weight = 1;
+    INIT_LIST_HEAD(&csched_priv.runq);
+}
+
+struct scheduler sched_credit2_def = {
+    .name           = "SMP Credit Scheduler rev2",
+    .opt_name       = "credit2",
+    .sched_id       = XEN_SCHEDULER_CREDIT2,
+
+    .init_domain    = csched_dom_init,
+    .destroy_domain = csched_dom_destroy,
+
+    .init_vcpu      = csched_vcpu_init,
+    .destroy_vcpu   = csched_vcpu_destroy,
+
+    .sleep          = csched_vcpu_sleep,
+    .wake           = csched_vcpu_wake,
+
+    .adjust         = csched_dom_cntl,
+
+    .pick_cpu       = csched_cpu_pick,
+    .do_schedule    = csched_schedule,
+
+    .dump_cpu_state = csched_dump_pcpu,
+    .dump_settings  = csched_dump,
+    .init           = csched_init,
+};
diff -r 23d34c3ba4b7 xen/common/schedule.c
--- a/xen/common/schedule.c	Mon Nov 30 16:13:01 2009 -0600
+++ b/xen/common/schedule.c	Mon Dec 07 16:59:53 2009 +0000
@@ -58,9 +58,11 @@
 
 extern const struct scheduler sched_sedf_def;
 extern const struct scheduler sched_credit_def;
+extern const struct scheduler sched_credit2_def;
 static const struct scheduler *__initdata schedulers[] = {
     &sched_sedf_def,
     &sched_credit_def,
+    &sched_credit2_def,
     NULL
 };
 
diff -r 23d34c3ba4b7 xen/include/public/domctl.h
--- a/xen/include/public/domctl.h	Mon Nov 30 16:13:01 2009 -0600
+++ b/xen/include/public/domctl.h	Mon Dec 07 16:59:53 2009 +0000
@@ -297,6 +297,7 @@
 /* Scheduler types. */
 #define XEN_SCHEDULER_SEDF     4
 #define XEN_SCHEDULER_CREDIT   5
+#define XEN_SCHEDULER_CREDIT2  6
 /* Set or get info? */
 #define XEN_DOMCTL_SCHEDOP_putinfo 0
 #define XEN_DOMCTL_SCHEDOP_getinfo 1
@@ -315,6 +316,9 @@
             uint16_t weight;
             uint16_t cap;
         } credit;
+        struct xen_domctl_sched_credit2 {
+            uint16_t weight;
+        } credit2;
     } u;
 };
 typedef struct xen_domctl_scheduler_op xen_domctl_scheduler_op_t;
diff -r 23d34c3ba4b7 xen/include/public/trace.h
--- a/xen/include/public/trace.h	Mon Nov 30 16:13:01 2009 -0600
+++ b/xen/include/public/trace.h	Mon Dec 07 16:59:53 2009 +0000
@@ -53,6 +53,7 @@
 #define TRC_HVM_HANDLER   0x00082000   /* various HVM handlers      */
 
 #define TRC_SCHED_MIN       0x00021000   /* Just runstate changes */
+#define TRC_SCHED_CLASS     0x00022000   /* Scheduler-specific    */
 #define TRC_SCHED_VERBOSE   0x00028000   /* More inclusive scheduling */
 
 /* Trace events per class */
diff -r 23d34c3ba4b7 xen/include/xen/sched.h
--- a/xen/include/xen/sched.h	Mon Nov 30 16:13:01 2009 -0600
+++ b/xen/include/xen/sched.h	Mon Dec 07 16:59:53 2009 +0000
@@ -530,6 +530,8 @@
  /* VCPU affinity has changed: migrating to a new CPU. */
 #define _VPF_migrating       3
 #define VPF_migrating        (1UL<<_VPF_migrating)
+#define _VPF_deadlock        4
+#define VPF_deadlock        (1UL<<_VPF_migrating)
 
 static inline int vcpu_runnable(struct vcpu *v)
 {

[-- Attachment #3: credit2-tools.diff --]
[-- Type: text/x-diff, Size: 15599 bytes --]

diff -r d2f0843a38e4 tools/libxc/Makefile
--- a/tools/libxc/Makefile	Wed Oct 28 14:21:37 2009 +0000
+++ b/tools/libxc/Makefile	Wed Oct 28 14:42:17 2009 +0000
@@ -17,6 +17,7 @@
 CTRL_SRCS-y       += xc_private.c
 CTRL_SRCS-y       += xc_sedf.c
 CTRL_SRCS-y       += xc_csched.c
+CTRL_SRCS-y       += xc_csched2.c
 CTRL_SRCS-y       += xc_tbuf.c
 CTRL_SRCS-y       += xc_pm.c
 CTRL_SRCS-y       += xc_cpu_hotplug.c
diff -r d2f0843a38e4 tools/libxc/xc_csched2.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/libxc/xc_csched2.c	Wed Oct 28 14:42:17 2009 +0000
@@ -0,0 +1,50 @@
+/****************************************************************************
+ * (C) 2006 - Emmanuel Ackaouy - XenSource Inc.
+ ****************************************************************************
+ *
+ *        File: xc_csched.c
+ *      Author: Emmanuel Ackaouy
+ *
+ * Description: XC Interface to the credit scheduler
+ *
+ */
+#include "xc_private.h"
+
+
+int
+xc_sched_credit2_domain_set(
+    int xc_handle,
+    uint32_t domid,
+    struct xen_domctl_sched_credit2 *sdom)
+{
+    DECLARE_DOMCTL;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_CREDIT2;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_putinfo;
+    domctl.u.scheduler_op.u.credit2 = *sdom;
+
+    return do_domctl(xc_handle, &domctl);
+}
+
+int
+xc_sched_credit2_domain_get(
+    int xc_handle,
+    uint32_t domid,
+    struct xen_domctl_sched_credit2 *sdom)
+{
+    DECLARE_DOMCTL;
+    int err;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_CREDIT2;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_getinfo;
+
+    err = do_domctl(xc_handle, &domctl);
+    if ( err == 0 )
+        *sdom = domctl.u.scheduler_op.u.credit2;
+
+    return err;
+}
diff -r d2f0843a38e4 tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h	Wed Oct 28 14:21:37 2009 +0000
+++ b/tools/libxc/xenctrl.h	Wed Oct 28 14:42:17 2009 +0000
@@ -468,6 +468,14 @@
                                uint32_t domid,
                                struct xen_domctl_sched_credit *sdom);
 
+int xc_sched_credit2_domain_set(int xc_handle,
+                               uint32_t domid,
+                               struct xen_domctl_sched_credit2 *sdom);
+
+int xc_sched_credit2_domain_get(int xc_handle,
+                               uint32_t domid,
+                               struct xen_domctl_sched_credit2 *sdom);
+
 /**
  * This function sends a trigger to a domain.
  *
diff -r d2f0843a38e4 tools/python/xen/lowlevel/xc/xc.c
--- a/tools/python/xen/lowlevel/xc/xc.c	Wed Oct 28 14:21:37 2009 +0000
+++ b/tools/python/xen/lowlevel/xc/xc.c	Wed Oct 28 14:42:17 2009 +0000
@@ -1340,6 +1340,45 @@
                          "cap",     sdom.cap);
 }
 
+static PyObject *pyxc_sched_credit2_domain_set(XcObject *self,
+                                              PyObject *args,
+                                              PyObject *kwds)
+{
+    uint32_t domid;
+    uint16_t weight;
+    static char *kwd_list[] = { "domid", "weight", NULL };
+    static char kwd_type[] = "I|H";
+    struct xen_domctl_sched_credit2 sdom;
+    
+    weight = 0;
+    if( !PyArg_ParseTupleAndKeywords(args, kwds, kwd_type, kwd_list, 
+                                     &domid, &weight) )
+        return NULL;
+
+    sdom.weight = weight;
+
+    if ( xc_sched_credit2_domain_set(self->xc_handle, domid, &sdom) != 0 )
+        return pyxc_error_to_exception();
+
+    Py_INCREF(zero);
+    return zero;
+}
+
+static PyObject *pyxc_sched_credit2_domain_get(XcObject *self, PyObject *args)
+{
+    uint32_t domid;
+    struct xen_domctl_sched_credit2 sdom;
+    
+    if( !PyArg_ParseTuple(args, "I", &domid) )
+        return NULL;
+    
+    if ( xc_sched_credit2_domain_get(self->xc_handle, domid, &sdom) != 0 )
+        return pyxc_error_to_exception();
+
+    return Py_BuildValue("{s:H}",
+                         "weight",  sdom.weight);
+}
+
 static PyObject *pyxc_domain_setmaxmem(XcObject *self, PyObject *args)
 {
     uint32_t dom;
@@ -1871,6 +1910,24 @@
       "Returns:   [dict]\n"
       " weight    [short]: domain's scheduling weight\n"},
 
+    { "sched_credit2_domain_set",
+      (PyCFunction)pyxc_sched_credit2_domain_set,
+      METH_KEYWORDS, "\n"
+      "Set the scheduling parameters for a domain when running with the\n"
+      "SMP credit2 scheduler.\n"
+      " domid     [int]:   domain id to set\n"
+      " weight    [short]: domain's scheduling weight\n"
+      "Returns: [int] 0 on success; -1 on error.\n" },
+
+    { "sched_credit2_domain_get",
+      (PyCFunction)pyxc_sched_credit2_domain_get,
+      METH_VARARGS, "\n"
+      "Get the scheduling parameters for a domain when running with the\n"
+      "SMP credit2 scheduler.\n"
+      " domid     [int]:   domain id to get\n"
+      "Returns:   [dict]\n"
+      " weight    [short]: domain's scheduling weight\n"},
+
     { "evtchn_alloc_unbound", 
       (PyCFunction)pyxc_evtchn_alloc_unbound,
       METH_VARARGS | METH_KEYWORDS, "\n"
@@ -2230,6 +2287,7 @@
     /* Expose some libxc constants to Python */
     PyModule_AddIntConstant(m, "XEN_SCHEDULER_SEDF", XEN_SCHEDULER_SEDF);
     PyModule_AddIntConstant(m, "XEN_SCHEDULER_CREDIT", XEN_SCHEDULER_CREDIT);
+    PyModule_AddIntConstant(m, "XEN_SCHEDULER_CREDIT2", XEN_SCHEDULER_CREDIT2);
 
 }
 
diff -r d2f0843a38e4 tools/python/xen/xend/XendAPI.py
--- a/tools/python/xen/xend/XendAPI.py	Wed Oct 28 14:21:37 2009 +0000
+++ b/tools/python/xen/xend/XendAPI.py	Wed Oct 28 14:42:17 2009 +0000
@@ -1613,8 +1613,7 @@
         if 'weight' in xeninfo.info['vcpus_params'] \
            and 'cap' in xeninfo.info['vcpus_params']:
             weight = xeninfo.info['vcpus_params']['weight']
-            cap = xeninfo.info['vcpus_params']['cap']
-            xendom.domain_sched_credit_set(xeninfo.getDomid(), weight, cap)
+            xendom.domain_sched_credit2_set(xeninfo.getDomid(), weight)
 
     def VM_set_VCPUs_number_live(self, _, vm_ref, num):
         dom = XendDomain.instance().get_vm_by_uuid(vm_ref)
diff -r d2f0843a38e4 tools/python/xen/xend/XendDomain.py
--- a/tools/python/xen/xend/XendDomain.py	Wed Oct 28 14:21:37 2009 +0000
+++ b/tools/python/xen/xend/XendDomain.py	Wed Oct 28 14:42:17 2009 +0000
@@ -1757,6 +1757,60 @@
             log.exception(ex)
             raise XendError(str(ex))
 
+    def domain_sched_credit2_get(self, domid):
+        """Get credit2 scheduler parameters for a domain.
+
+        @param domid: Domain ID or Name
+        @type domid: int or string.
+        @rtype: dict with keys 'weight'
+        @return: credit2 scheduler parameters
+        """
+        dominfo = self.domain_lookup_nr(domid)
+        if not dominfo:
+            raise XendInvalidDomain(str(domid))
+        
+        if dominfo._stateGet() in (DOM_STATE_RUNNING, DOM_STATE_PAUSED):
+            try:
+                return xc.sched_credit2_domain_get(dominfo.getDomid())
+            except Exception, ex:
+                raise XendError(str(ex))
+        else:
+            return {'weight' : dominfo.getWeight()}
+    
+    def domain_sched_credit2_set(self, domid, weight = None):
+        """Set credit2 scheduler parameters for a domain.
+
+        @param domid: Domain ID or Name
+        @type domid: int or string.
+        @type weight: int
+        @rtype: 0
+        """
+        set_weight = False
+        dominfo = self.domain_lookup_nr(domid)
+        if not dominfo:
+            raise XendInvalidDomain(str(domid))
+        try:
+            if weight is None:
+                weight = int(0)
+            elif weight < 1 or weight > 65535:
+                raise XendError("weight is out of range")
+            else:
+                set_weight = True
+
+            assert type(weight) == int
+
+            rc = 0
+            if dominfo._stateGet() in (DOM_STATE_RUNNING, DOM_STATE_PAUSED):
+                rc = xc.sched_credit2_domain_set(dominfo.getDomid(), weight)
+            if rc == 0:
+                if set_weight:
+                    dominfo.setWeight(weight)
+                self.managed_config_save(dominfo)
+            return rc
+        except Exception, ex:
+            log.exception(ex)
+            raise XendError(str(ex))
+
     def domain_maxmem_set(self, domid, mem):
         """Set the memory limit for a domain.
 
diff -r d2f0843a38e4 tools/python/xen/xend/XendDomainInfo.py
--- a/tools/python/xen/xend/XendDomainInfo.py	Wed Oct 28 14:21:37 2009 +0000
+++ b/tools/python/xen/xend/XendDomainInfo.py	Wed Oct 28 14:42:17 2009 +0000
@@ -2640,6 +2640,10 @@
             XendDomain.instance().domain_sched_credit_set(self.getDomid(),
                                                           self.getWeight(),
                                                           self.getCap())
+        elif XendNode.instance().xenschedinfo() == 'credit2':
+            from xen.xend import XendDomain
+            XendDomain.instance().domain_sched_credit2_set(self.getDomid(),
+                                                           self.getWeight())
 
     def _initDomain(self):
         log.debug('XendDomainInfo.initDomain: %s %s',
diff -r d2f0843a38e4 tools/python/xen/xend/XendNode.py
--- a/tools/python/xen/xend/XendNode.py	Wed Oct 28 14:21:37 2009 +0000
+++ b/tools/python/xen/xend/XendNode.py	Wed Oct 28 14:42:17 2009 +0000
@@ -679,6 +679,8 @@
             return 'sedf'
         elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT:
             return 'credit'
+        elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT2:
+            return 'credit2'
         else:
             return 'unknown'
 
@@ -874,6 +876,8 @@
             return 'sedf'
         elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT:
             return 'credit'
+        elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT2:
+            return 'credit2'
         else:
             return 'unknown'
 
diff -r d2f0843a38e4 tools/python/xen/xend/XendVMMetrics.py
--- a/tools/python/xen/xend/XendVMMetrics.py	Wed Oct 28 14:21:37 2009 +0000
+++ b/tools/python/xen/xend/XendVMMetrics.py	Wed Oct 28 14:42:17 2009 +0000
@@ -129,6 +129,7 @@
                 params_live['cpumap%i' % i] = \
                     ",".join(map(str, info['cpumap']))
 
+                # FIXME: credit2??
             params_live.update(xc.sched_credit_domain_get(domid))
             
             return params_live
diff -r d2f0843a38e4 tools/python/xen/xend/server/SrvDomain.py
--- a/tools/python/xen/xend/server/SrvDomain.py	Wed Oct 28 14:21:37 2009 +0000
+++ b/tools/python/xen/xend/server/SrvDomain.py	Wed Oct 28 14:42:17 2009 +0000
@@ -163,6 +163,20 @@
         val = fn(req.args, {'dom': self.dom.getName()})
         return val
 
+    def op_domain_sched_credit2_get(self, _, req):
+        fn = FormFn(self.xd.domain_sched_credit2_get,
+                    [['dom', 'str']])
+        val = fn(req.args, {'dom': self.dom.getName()})
+        return val
+
+
+    def op_domain_sched_credit2_set(self, _, req):
+        fn = FormFn(self.xd.domain_sched_credit2_set,
+                    [['dom', 'str'],
+                     ['weight', 'int']])
+        val = fn(req.args, {'dom': self.dom.getName()})
+        return val
+
     def op_maxmem_set(self, _, req):
         return self.call(self.dom.setMemoryMaximum,
                          [['memory', 'int']],
diff -r d2f0843a38e4 tools/python/xen/xm/main.py
--- a/tools/python/xen/xm/main.py	Wed Oct 28 14:21:37 2009 +0000
+++ b/tools/python/xen/xm/main.py	Wed Oct 28 14:42:17 2009 +0000
@@ -150,6 +150,8 @@
     'sched-sedf'  : ('<Domain> [options]', 'Get/set EDF parameters.'),
     'sched-credit': ('[-d <Domain> [-w[=WEIGHT]|-c[=CAP]]]',
                      'Get/set credit scheduler parameters.'),
+    'sched-credit2': ('[-d <Domain> [-w[=WEIGHT]]',
+                     'Get/set credit2 scheduler parameters.'),
     'sysrq'       : ('<Domain> <letter>', 'Send a sysrq to a domain.'),
     'debug-keys'  : ('<Keys>', 'Send debug keys to Xen.'),
     'trigger'     : ('<Domain> <nmi|reset|init|s3resume|power> [<VCPU>]',
@@ -265,6 +267,10 @@
        ('-w WEIGHT', '--weight=WEIGHT', 'Weight (int)'),
        ('-c CAP',    '--cap=CAP',       'Cap (int)'),
     ),
+    'sched-credit2': (
+       ('-d DOMAIN', '--domain=DOMAIN', 'Domain to modify'),
+       ('-w WEIGHT', '--weight=WEIGHT', 'Weight (int)'),
+    ),
     'list': (
        ('-l', '--long',         'Output all VM details in SXP'),
        ('', '--label',          'Include security labels'),
@@ -406,6 +412,7 @@
     ]
 
 scheduler_commands = [
+    "sched-credit2",
     "sched-credit",
     "sched-sedf",
     ]
@@ -1720,6 +1727,80 @@
             if result != 0:
                 err(str(result))
 
+def xm_sched_credit2(args):
+    """Get/Set options for Credit2 Scheduler."""
+    
+    check_sched_type('credit2')
+
+    try:
+        opts, params = getopt.getopt(args, "d:w:",
+            ["domain=", "weight="])
+    except getopt.GetoptError, opterr:
+        err(opterr)
+        usage('sched-credit2')
+
+    domid = None
+    weight = None
+
+    for o, a in opts:
+        if o in ["-d", "--domain"]:
+            domid = a
+        elif o in ["-w", "--weight"]:
+            weight = int(a)
+
+    doms = filter(lambda x : domid_match(domid, x),
+                  [parse_doms_info(dom)
+                  for dom in getDomains(None, 'all')])
+
+    if weight is None:
+        if domid is not None and doms == []: 
+            err("Domain '%s' does not exist." % domid)
+            usage('sched-credit2')
+        # print header if we aren't setting any parameters
+        print '%-33s %4s %6s' % ('Name','ID','Weight')
+        
+        for d in doms:
+            try:
+                if serverType == SERVER_XEN_API:
+                    info = server.xenapi.VM_metrics.get_VCPUs_params(
+                        server.xenapi.VM.get_metrics(
+                            get_single_vm(d['name'])))
+                else:
+                    info = server.xend.domain.sched_credit2_get(d['name'])
+            except xmlrpclib.Fault:
+                pass
+
+            if 'weight' not in info:
+                # domain does not support sched-credit2?
+                info = {'weight': -1}
+
+            info['weight'] = int(info['weight'])
+            
+            info['name']  = d['name']
+            info['domid'] = str(d['domid'])
+            print( ("%(name)-32s %(domid)5s %(weight)6d") % info)
+    else:
+        if domid is None:
+            # place holder for system-wide scheduler parameters
+            err("No domain given.")
+            usage('sched-credit2')
+
+        if serverType == SERVER_XEN_API:
+            if doms[0]['domid']:
+                server.xenapi.VM.add_to_VCPUs_params_live(
+                    get_single_vm(domid),
+                    "weight",
+                    weight)
+            else:
+                server.xenapi.VM.add_to_VCPUs_params(
+                    get_single_vm(domid),
+                    "weight",
+                    weight)
+        else:
+            result = server.xend.domain.sched_credit2_set(domid, weight)
+            if result != 0:
+                err(str(result))
+
 def xm_info(args):
     arg_check(args, "info", 0, 1)
     
@@ -3298,6 +3379,7 @@
     # scheduler
     "sched-sedf": xm_sched_sedf,
     "sched-credit": xm_sched_credit,
+    "sched-credit2": xm_sched_credit2,
     # block
     "block-attach": xm_block_attach,
     "block-detach": xm_block_detach,

[-- Attachment #4: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] [RFC] Credit2 scheduler prototype
  2009-12-07 17:02 [PATCH] [RFC] Credit2 scheduler prototype George Dunlap
@ 2009-12-07 17:45 ` Keir Fraser
  2009-12-08 14:48   ` George Dunlap
  0 siblings, 1 reply; 11+ messages in thread
From: Keir Fraser @ 2009-12-07 17:45 UTC (permalink / raw)
  To: George Dunlap, xen-devel@lists.xensource.com

On 07/12/2009 17:02, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:

> Keir (and everyone), I think at this point it would be a good idea to
> start a credit2 development branch in ext/ so we can keep a revision
> history.  Thoughts?

Sounds like a reasonable idea, if you don't think it suitable just to check
into mainline as the non-default scheduler.

 -- Keir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH] [RFC] Credit2 scheduler prototype
  2009-12-07 17:45 ` Keir Fraser
@ 2009-12-08 14:48   ` George Dunlap
  2009-12-08 18:20     ` Keir Fraser
  0 siblings, 1 reply; 11+ messages in thread
From: George Dunlap @ 2009-12-08 14:48 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel@lists.xensource.com

My main concern is that sharing the runqueue between cores requires
some changes to the core context switch code. The kinks aren't 100%
worked out yet, so there's a risk that there will be an impact on the
correctness of the credit1 scheduler.

If you want to go that route, we should probably talk about the
changes we want to the context switch path first, and check that in as
a separate patch, before checking in the core scheduler code.

Or we could just check it in and sort it out as things go, since this
is -unstable. :-)

Thoughts?

Either way I'll write up an e-mail describing some of the scheduler
path changes I'd like to see.

 -George

On Mon, Dec 7, 2009 at 5:45 PM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:
> On 07/12/2009 17:02, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:
>
>> Keir (and everyone), I think at this point it would be a good idea to
>> start a credit2 development branch in ext/ so we can keep a revision
>> history.  Thoughts?
>
> Sounds like a reasonable idea, if you don't think it suitable just to check
> into mainline as the non-default scheduler.
>
>  -- Keir
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH] [RFC] Credit2 scheduler prototype
  2009-12-08 14:48   ` George Dunlap
@ 2009-12-08 18:20     ` Keir Fraser
  2010-01-13 14:48       ` George Dunlap
  0 siblings, 1 reply; 11+ messages in thread
From: Keir Fraser @ 2009-12-08 18:20 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel@lists.xensource.com

On 08/12/2009 14:48, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:

> My main concern is that sharing the runqueue between cores requires
> some changes to the core context switch code. The kinks aren't 100%
> worked out yet, so there's a risk that there will be an impact on the
> correctness of the credit1 scheduler.

Ah, if that's the problem with selecting a vcpu which happens to still be
'is_running' then I had some ideas how you could deal with that within the
credit2 scheduler. If you see such a vcpu when searching the runqueue,
ignore it, but set VPF_migrating. You'll then get a 'pick_cpu' callback when
descheduling of the vcpu is completed. That should play nice with the lazy
context switch logic while keeping things work conserving.

 -- Keir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH] [RFC] Credit2 scheduler prototype
  2009-12-08 18:20     ` Keir Fraser
@ 2010-01-13 14:48       ` George Dunlap
  2010-01-13 15:16         ` Keir Fraser
  0 siblings, 1 reply; 11+ messages in thread
From: George Dunlap @ 2010-01-13 14:48 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 1699 bytes --]

Keir,

What do you think of the attached patches?

The first implements something like what you suggest below, but
instead of using a sort of "hack" with VPF_migrate, it makes a proper
"context_saved" SCHED_OP callback.

The second addresses the fact that when sharing runqueues,
v->processor may change quickly without an explicit migrate.

The last two are the credit2 hypervisor and tool patches, which use
these two changes (for reference).

I think these patches should be basically NOOP for the existing
schedulers, so as far as I'm concerned they're ready to be merged as
soon as you're happy with them.

Peace,
 -George

On Tue, Dec 8, 2009 at 6:20 PM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:
> On 08/12/2009 14:48, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:
>
>> My main concern is that sharing the runqueue between cores requires
>> some changes to the core context switch code. The kinks aren't 100%
>> worked out yet, so there's a risk that there will be an impact on the
>> correctness of the credit1 scheduler.
>
> Ah, if that's the problem with selecting a vcpu which happens to still be
> 'is_running' then I had some ideas how you could deal with that within the
> credit2 scheduler. If you see such a vcpu when searching the runqueue,
> ignore it, but set VPF_migrating. You'll then get a 'pick_cpu' callback when
> descheduling of the vcpu is completed. That should play nice with the lazy
> context switch logic while keeping things work conserving.
>
>  -- Keir
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

[-- Attachment #2: context_switch-scheduler-callback.diff --]
[-- Type: text/x-patch, Size: 1198 bytes --]

Add context_saved scheduler callback.

Because credit2 shares a runqueue between several cpus, it needs
to know when a scheduled-out process has finally been context-switched
away so that it can be added to the runqueue again.  (Otherwise it may
be grabbed by another processor before the context has been properly
saved.)

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

diff -r c44b7b9b6306 xen/common/schedule.c
--- a/xen/common/schedule.c	Wed Jan 13 13:33:57 2010 +0000
+++ b/xen/common/schedule.c	Wed Jan 13 13:36:37 2010 +0000
@@ -877,6 +877,8 @@
     /* Check for migration request /after/ clearing running flag. */
     smp_mb();
 
+    SCHED_OP(context_saved, prev);
+
     if ( unlikely(test_bit(_VPF_migrating, &prev->pause_flags)) )
         vcpu_migrate(prev);
 }
diff -r c44b7b9b6306 xen/include/xen/sched-if.h
--- a/xen/include/xen/sched-if.h	Wed Jan 13 13:33:57 2010 +0000
+++ b/xen/include/xen/sched-if.h	Wed Jan 13 13:36:37 2010 +0000
@@ -69,6 +69,7 @@
 
     void         (*sleep)          (struct vcpu *);
     void         (*wake)           (struct vcpu *);
+    void         (*context_saved)  (struct vcpu *);
 
     struct task_slice (*do_schedule) (s_time_t);
 

[-- Attachment #3: context_switch-vcpu-processor-sync.diff --]
[-- Type: text/x-patch, Size: 1615 bytes --]

Safely change next->processor if necessary.

Credit2's shared runqueue means that a vcpu may switch from one
pcpu to another without an explicit migration.  We need to
change v->processor to match.  However, this must be done with the
current v->processor schedule lock held.  To avoid deadlock,
do this after we've released the current processor's schedule lock.

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

diff -r 669448fb9d0c xen/common/schedule.c
--- a/xen/common/schedule.c	Wed Jan 13 13:47:16 2010 +0000
+++ b/xen/common/schedule.c	Wed Jan 13 14:15:51 2010 +0000
@@ -305,6 +305,23 @@
     vcpu_wake(v);
 }
 
+/* Safely change v->processor when running on a different cpu sharing the same runqueue */
+static void __vcpu_processor_sync(struct vcpu *next)
+{
+    unsigned long flags;
+    int old_cpu;
+    int this_cpu = smp_processor_id();
+
+    vcpu_schedule_lock_irqsave(next, flags);
+
+    /* Switch to new CPU, then unlock old CPU. */
+    old_cpu = next->processor;
+    next->processor = this_cpu;
+
+    spin_unlock_irqrestore(
+        &per_cpu(schedule_data, old_cpu).schedule_lock, flags);
+}
+
 /*
  * Force a VCPU through a deschedule/reschedule path.
  * For example, using this when setting the periodic timer period means that
@@ -852,6 +869,11 @@
 
     spin_unlock_irq(&sd->schedule_lock);
 
+    /* Safely change v->processor if necessary.  Do this after
+     * releasing this cpu's lock to avoid deadlock. */
+    if ( next->processor != smp_processor_id() )
+        __vcpu_processor_sync(next);
+
     perfc_incr(sched_ctx);
 
     stop_timer(&prev->periodic_timer);

[-- Attachment #4: credit2-hypervisor.diff --]
[-- Type: text/x-patch, Size: 29453 bytes --]

diff -r 7bd1dd9fb30f xen/common/Makefile
--- a/xen/common/Makefile	Wed Jan 13 14:15:51 2010 +0000
+++ b/xen/common/Makefile	Wed Jan 13 14:36:58 2010 +0000
@@ -13,6 +13,7 @@
 obj-y += page_alloc.o
 obj-y += rangeset.o
 obj-y += sched_credit.o
+obj-y += sched_credit2.o
 obj-y += sched_sedf.o
 obj-y += schedule.o
 obj-y += shutdown.o
diff -r 7bd1dd9fb30f xen/common/sched_credit2.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xen/common/sched_credit2.c	Wed Jan 13 14:36:58 2010 +0000
@@ -0,0 +1,1037 @@
+
+/****************************************************************************
+ * (C) 2009 - George Dunlap - Citrix Systems R&D UK, Ltd
+ ****************************************************************************
+ *
+ *        File: common/csched_credit2.c
+ *      Author: George Dunlap
+ *
+ * Description: Credit-based SMP CPU scheduler
+ * Based on an earlier verson by Emmanuel Ackaouy.
+ */
+
+#include <xen/config.h>
+#include <xen/init.h>
+#include <xen/lib.h>
+#include <xen/sched.h>
+#include <xen/domain.h>
+#include <xen/delay.h>
+#include <xen/event.h>
+#include <xen/time.h>
+#include <xen/perfc.h>
+#include <xen/sched-if.h>
+#include <xen/softirq.h>
+#include <asm/atomic.h>
+#include <xen/errno.h>
+#include <xen/trace.h>
+
+#if __i386__
+#define PRI_stime "lld"
+#else
+#define PRI_stime "ld"
+#endif
+
+#define d2printk(x...)
+//#define d2printk printk
+
+#define TRC_CSCHED2_TICK        TRC_SCHED_CLASS + 1
+#define TRC_CSCHED2_RUNQ_POS    TRC_SCHED_CLASS + 2
+#define TRC_CSCHED2_CREDIT_BURN TRC_SCHED_CLASS + 3
+#define TRC_CSCHED2_CREDIT_ADD  TRC_SCHED_CLASS + 4
+#define TRC_CSCHED2_TICKLE_CHECK TRC_SCHED_CLASS + 5
+
+/*
+ * WARNING: This is still in an experimental phase.  Status and work can be found at the
+ * credit2 wiki page:
+ *  http://wiki.xensource.com/xenwiki/Credit2_Scheduler_Development
+ */
+
+/* 
+ * Design:
+ *
+ * VMs "burn" credits based on their weight; higher weight means credits burn
+ * more slowly.
+ * 
+ * vcpus are inserted into the runqueue by credit order.
+ *
+ * Credits are "reset" when the next vcpu in the runqueue is less than or equal to zero.  At that
+ * point, everyone's credits are "clipped" to a small value, and a fixed credit is added to everyone.
+ *
+ * The plan is for all cores that share an L2 will share the same runqueue.  At the moment, there is
+ * one global runqueue for all cores.
+ */
+
+/*
+ * Basic constants
+ */
+#define CSCHED_DEFAULT_WEIGHT       256
+#define CSCHED_MIN_TIMER            MICROSECS(500)
+#define CSCHED_CARRYOVER_MAX        CSCHED_MIN_TIMER
+#define CSCHED_CREDIT_RESET         0
+#define CSCHED_CREDIT_INIT          MILLISECS(10)
+#define CSCHED_MAX_TIMER            MILLISECS(2)
+
+#define CSCHED_IDLE_CREDIT                 (-(1<<30))
+
+/*
+ * Flags
+ */
+/* CSFLAG_scheduled: Is this vcpu either running on, or context-switching off,
+ * a physical cpu?
+ * + Accessed only with runqueue lock held
+ * + Set when chosen as next in csched_schedule().
+ * + Cleared after context switch has been saved in csched_context_saved()
+ * + Checked in vcpu_wake to see if we can add to the runqueue, or if we should
+ *   set CSFLAG_delayed_runq_add
+ * + Checked to be false in runq_insert.
+ */
+#define __CSFLAG_scheduled 1
+#define CSFLAG_scheduled (1<<__CSFLAG_scheduled)
+/* CSFLAG_delayed_runq_add: Do we need to add this to the runqueue once it'd done
+ * being context switched out?
+ * + Set when scheduling out in csched_schedule() if prev is runnable
+ * + Set in csched_vcpu_wake if it finds CSFLAG_scheduled set
+ * + Read in csched_context_switched().  If set, it adds prev to the runqueue and
+ *   clears the bit.
+ */
+#define __CSFLAG_delayed_runq_add 2
+#define CSFLAG_delayed_runq_add (1<<__CSFLAG_delayed_runq_add)
+
+
+/*
+ * Useful macros
+ */
+#define CSCHED_PCPU(_c)                                                 \
+    ((struct csched_pcpu *)per_cpu(schedule_data, _c).sched_priv)
+#define CSCHED_VCPU(_vcpu)  ((struct csched_vcpu *) (_vcpu)->sched_priv)
+#define CSCHED_DOM(_dom)    ((struct csched_dom *) (_dom)->sched_priv)
+//#define RUNQ(_cpu)          (&(CSCHED_GROUP(_cpu)->runq))
+#define RUNQ(_cpu)          (&csched_priv.runq)
+
+/*
+ * System-wide private data
+ */
+struct csched_private {
+    spinlock_t lock;
+    uint32_t ncpus;
+    struct domain *idle_domain;
+
+    /* Per-runqueue info */
+    struct list_head runq; /* Global runqueue */
+    int max_weight;
+    struct list_head sdom;
+    struct list_head svc;  /* List of all vcpus */
+};
+
+struct csched_pcpu {
+    int _dummy;
+};
+
+/*
+ * Virtual CPU
+ */
+struct csched_vcpu {
+    struct list_head global_elem; /* On the global vcpu list */
+    struct list_head sdom_elem;   /* On the domain vcpu list */
+    struct list_head runq_elem;   /* On the runqueue         */
+
+    /* Up-pointers */
+    struct csched_dom *sdom;
+    struct vcpu *vcpu;
+
+    int weight;
+
+    int credit;
+    s_time_t start_time; /* When we were scheduled (used for credit) */
+    unsigned flags;      /* 16 bits doesn't seem to play well with clear_bit() */
+
+};
+
+/*
+ * Domain
+ */
+struct csched_dom {
+    struct list_head vcpu;
+    struct list_head sdom_elem;
+    struct domain *dom;
+    uint16_t weight;
+    uint16_t nr_vcpus;
+};
+
+
+/*
+ * Global variables
+ */
+static struct csched_private csched_priv;
+
+/*
+ * Time-to-credit, credit-to-time.
+ * FIXME: Do pre-calculated division?
+ */
+static s_time_t t2c(s_time_t time, struct csched_vcpu *svc)
+{
+    return time * csched_priv.max_weight / svc->weight;
+}
+
+static s_time_t c2t(s_time_t credit, struct csched_vcpu *svc)
+{
+    return credit * svc->weight / csched_priv.max_weight;
+}
+
+/*
+ * Runqueue related code
+ */
+
+static /*inline*/ int
+__vcpu_on_runq(struct csched_vcpu *svc)
+{
+    return !list_empty(&svc->runq_elem);
+}
+
+static /*inline*/ struct csched_vcpu *
+__runq_elem(struct list_head *elem)
+{
+    return list_entry(elem, struct csched_vcpu, runq_elem);
+}
+
+static int
+__runq_insert(struct list_head *runq, struct csched_vcpu *svc)
+{
+    struct list_head *iter;
+    int pos = 0;
+
+    d2printk("rqi d%dv%d\n",
+           svc->vcpu->domain->domain_id,
+           svc->vcpu->vcpu_id);
+
+    /* Idle vcpus not allowed on the runqueue anymore */
+    BUG_ON(is_idle_vcpu(svc->vcpu));
+    BUG_ON(svc->vcpu->is_running);
+    BUG_ON(test_bit(__CSFLAG_scheduled, &svc->flags));
+
+    list_for_each( iter, runq )
+    {
+        struct csched_vcpu * iter_svc = __runq_elem(iter);
+
+        if ( svc->credit > iter_svc->credit )
+        {
+            d2printk(" p%d d%dv%d\n",
+                   pos,
+                   iter_svc->vcpu->domain->domain_id,
+                   iter_svc->vcpu->vcpu_id);
+            break;
+        }
+        pos++;
+    }
+
+    list_add_tail(&svc->runq_elem, iter);
+
+    return pos;
+}
+
+static void
+runq_insert(unsigned int cpu, struct csched_vcpu *svc)
+{
+    struct list_head * runq = RUNQ(cpu);
+    int pos = 0;
+
+    /* FIXME: Runqueue per L2 */
+    ASSERT( spin_is_locked(&csched_priv.lock) ); 
+
+    BUG_ON( __vcpu_on_runq(svc) );
+    /* FIXME: Check runqueue handles this cpu*/
+    //BUG_ON( cpu != svc->vcpu->processor ); 
+
+    pos = __runq_insert(runq, svc);
+
+    {
+        struct {
+            unsigned dom:16,vcpu:16;
+            unsigned pos;
+        } d;
+        d.dom = svc->vcpu->domain->domain_id;
+        d.vcpu = svc->vcpu->vcpu_id;
+        d.pos = pos;
+        trace_var(TRC_CSCHED2_RUNQ_POS, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+
+    return;
+}
+
+static inline void
+__runq_remove(struct csched_vcpu *svc)
+{
+    BUG_ON( !__vcpu_on_runq(svc) );
+    list_del_init(&svc->runq_elem);
+}
+
+void burn_credits(struct csched_vcpu *, s_time_t);
+
+/* Check to see if the item on the runqueue is higher priority than what's
+ * currently running; if so, wake up the processor */
+static /*inline*/ void
+runq_tickle(unsigned int cpu, struct csched_vcpu *new, s_time_t now)
+{
+    int i, ipid=-1;
+    s_time_t lowest=(1<<30);
+
+    d2printk("rqt d%dv%d cd%dv%d\n",
+             new->vcpu->domain->domain_id,
+             new->vcpu->vcpu_id,
+             current->domain->domain_id,
+             current->vcpu_id);
+
+    /* Find the cpu in this queue group that has the lowest credits */
+    /* FIXME: separate runqueues */
+    for_each_online_cpu ( i )
+    {
+        struct csched_vcpu * const cur =
+            CSCHED_VCPU(per_cpu(schedule_data, i).curr);
+
+        /* FIXME: keep track of idlers, chose from the mask */
+        if ( is_idle_vcpu(cur->vcpu) )
+        {
+            ipid = i;
+            lowest = CSCHED_IDLE_CREDIT;
+            break;
+        }
+        else
+        {
+            /* Update credits for current to see if we want to preempt */
+            burn_credits(cur, now);
+
+            if ( cur->credit < lowest )
+            {
+                ipid = i;
+                lowest = cur->credit;
+            }
+
+            /* TRACE */ {
+                struct {
+                    unsigned dom:16,vcpu:16;
+                    unsigned credit;
+                } d;
+                d.dom = cur->vcpu->domain->domain_id;
+                d.vcpu = cur->vcpu->vcpu_id;
+                d.credit = cur->credit;
+                trace_var(TRC_CSCHED2_TICKLE_CHECK, 1,
+                          sizeof(d),
+                          (unsigned char *)&d);
+            }
+        }
+    }
+
+    if ( ipid != -1 )
+    {
+        int cdiff = lowest - new->credit;
+
+        if ( lowest == CSCHED_IDLE_CREDIT || cdiff < 0 ) {
+            d2printk("si %d\n", ipid);
+            cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
+        }
+        else
+            /* FIXME: Wake up later? */;
+    }
+}
+
+/*
+ * Credit-related code
+ */
+static void reset_credit(int cpu, s_time_t now)
+{
+    struct list_head *iter;
+
+    list_for_each( iter, &csched_priv.svc )
+    {
+        struct csched_vcpu * svc = list_entry(iter, struct csched_vcpu, global_elem);
+        s_time_t cmax;
+
+        BUG_ON( is_idle_vcpu(svc->vcpu) );
+
+        /* Maximum amount of credit that can be carried over */
+        cmax = CSCHED_CARRYOVER_MAX;
+
+        if ( svc->credit > cmax )
+            svc->credit = cmax;
+        svc->credit += CSCHED_CREDIT_INIT;  /* Find a better name */
+        svc->start_time = now;
+
+        /* Trace credit */
+    }
+
+    /* No need to resort runqueue, as everyone's order should be the same. */
+}
+
+void burn_credits(struct csched_vcpu *svc, s_time_t now)
+{
+    s_time_t delta;
+
+    /* Assert svc is current */
+    ASSERT(svc==CSCHED_VCPU(per_cpu(schedule_data, svc->vcpu->processor).curr));
+
+    if ( is_idle_vcpu(svc->vcpu) )
+    {
+        BUG_ON(svc->credit != CSCHED_IDLE_CREDIT);
+        return;
+    }
+
+    delta = now - svc->start_time;
+
+    if ( delta > 0 ) {
+        /* This will round down; should we consider rounding up...? */
+        svc->credit -= t2c(delta, svc);
+        svc->start_time = now;
+
+        d2printk("b d%dv%d c%d\n",
+                 svc->vcpu->domain->domain_id,
+                 svc->vcpu->vcpu_id,
+                 svc->credit);
+    } else {
+        d2printk("%s: Time went backwards? now %"PRI_stime" start %"PRI_stime"\n",
+               __func__, now, svc->start_time);
+    }
+    
+    /* TRACE */
+    {
+        struct {
+            unsigned dom:16,vcpu:16;
+            unsigned credit;
+            int delta;
+        } d;
+        d.dom = svc->vcpu->domain->domain_id;
+        d.vcpu = svc->vcpu->vcpu_id;
+        d.credit = svc->credit;
+        d.delta = delta;
+        trace_var(TRC_CSCHED2_CREDIT_BURN, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+}
+
+/* Find the domain with the highest weight. */
+void update_max_weight(int new_weight, int old_weight)
+{
+    if ( new_weight > csched_priv.max_weight )
+    {
+        csched_priv.max_weight = new_weight;
+        printk("%s: Max weight %d\n", __func__, csched_priv.max_weight);
+    }
+    else if ( old_weight == csched_priv.max_weight )
+    {
+        struct list_head *iter;
+        int max_weight = 1;
+        
+        list_for_each( iter, &csched_priv.sdom )
+        {
+            struct csched_dom * sdom = list_entry(iter, struct csched_dom, sdom_elem);
+            
+            if ( sdom->weight > max_weight )
+                max_weight = sdom->weight;
+        }
+        
+        csched_priv.max_weight = max_weight;
+        printk("%s: Max weight %d\n", __func__, csched_priv.max_weight);
+    }
+}
+
+/*
+ * Initialization code
+ */
+static int
+csched_pcpu_init(int cpu)
+{
+    unsigned long flags;
+    struct csched_pcpu *spc;
+
+    /* Allocate per-PCPU info */
+    spc = xmalloc(struct csched_pcpu);
+    if ( spc == NULL )
+        return -1;
+
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    /* Initialize/update system-wide config */
+    per_cpu(schedule_data, cpu).sched_priv = spc;
+
+    csched_priv.ncpus++;
+
+    /* Start off idling... */
+    BUG_ON(!is_idle_vcpu(per_cpu(schedule_data, cpu).curr));
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+
+    return 0;
+}
+
+#ifndef NDEBUG
+static /*inline*/ void
+__csched_vcpu_check(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+    struct csched_dom * const sdom = svc->sdom;
+
+    BUG_ON( svc->vcpu != vc );
+    BUG_ON( sdom != CSCHED_DOM(vc->domain) );
+    if ( sdom )
+    {
+        BUG_ON( is_idle_vcpu(vc) );
+        BUG_ON( sdom->dom != vc->domain );
+    }
+    else
+    {
+        BUG_ON( !is_idle_vcpu(vc) );
+    }
+}
+#define CSCHED_VCPU_CHECK(_vc)  (__csched_vcpu_check(_vc))
+#else
+#define CSCHED_VCPU_CHECK(_vc)
+#endif
+
+static int
+csched_vcpu_init(struct vcpu *vc)
+{
+    struct domain * const dom = vc->domain;
+    struct csched_dom *sdom = CSCHED_DOM(dom);
+    struct csched_vcpu *svc;
+
+    printk("%s: Initializing d%dv%d\n",
+           __func__, dom->domain_id, vc->vcpu_id);
+
+    /* Allocate per-VCPU info */
+    svc = xmalloc(struct csched_vcpu);
+    if ( svc == NULL )
+        return -1;
+
+    INIT_LIST_HEAD(&svc->global_elem);
+    INIT_LIST_HEAD(&svc->sdom_elem);
+    INIT_LIST_HEAD(&svc->runq_elem);
+
+    svc->sdom = sdom;
+    svc->vcpu = vc;
+    svc->flags = 0U;
+    vc->sched_priv = svc;
+
+    if ( ! is_idle_vcpu(vc) )
+    {
+        BUG_ON( sdom == NULL );
+
+        svc->credit = CSCHED_CREDIT_INIT;
+        svc->weight = sdom->weight;
+
+        list_add_tail(&svc->sdom_elem, &sdom->vcpu);
+        list_add_tail(&svc->global_elem, &csched_priv.svc);
+        sdom->nr_vcpus++;
+    } 
+    else
+    {
+        BUG_ON( sdom != NULL );
+        svc->credit = CSCHED_IDLE_CREDIT;
+        svc->weight = 0;
+        if ( csched_priv.idle_domain == NULL )
+            csched_priv.idle_domain = dom;
+    }
+
+    /* Allocate per-PCPU info */
+    if ( unlikely(!CSCHED_PCPU(vc->processor)) )
+    {
+        if ( csched_pcpu_init(vc->processor) != 0 )
+            return -1;
+    }
+
+    CSCHED_VCPU_CHECK(vc);
+    return 0;
+}
+
+static void
+csched_vcpu_destroy(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+    struct csched_dom * const sdom = svc->sdom;
+    unsigned long flags;
+
+    BUG_ON( sdom == NULL );
+    BUG_ON( !list_empty(&svc->runq_elem) );
+
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    /* Remove from sdom list */
+    list_del_init(&svc->global_elem);
+    list_del_init(&svc->sdom_elem);
+
+    sdom->nr_vcpus--;
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+
+    xfree(svc);
+}
+
+static void
+csched_vcpu_sleep(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    if ( per_cpu(schedule_data, vc->processor).curr == vc )
+        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
+    else if ( __vcpu_on_runq(svc) )
+        __runq_remove(svc);
+}
+
+static void
+csched_vcpu_wake(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+    const unsigned int cpu = vc->processor;
+    s_time_t now = 0;
+    int flags;
+
+    d2printk("w d%dv%d\n", vc->domain->domain_id, vc->vcpu_id);
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    /* FIXME: Runqueue per L2 */
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+
+    /* Make sure svc priority mod happens before runq check */
+    if ( unlikely(per_cpu(schedule_data, cpu).curr == vc) )
+    {
+        goto out;
+    }
+
+    if ( unlikely(__vcpu_on_runq(svc)) )
+    {
+        /* If we've boosted someone that's already on a runqueue, prioritize
+         * it and inform the cpu in question. */
+        goto out;
+    }
+
+    /* If the context hasn't been saved for this vcpu yet, we can't put it on
+     * another runqueue.  Instead, we set a flag so that it will be put on the runqueue
+     * after the context has been saved. */
+    if ( unlikely (test_bit(__CSFLAG_scheduled, &svc->flags) ) )
+    {
+        set_bit(__CSFLAG_delayed_runq_add, &svc->flags);
+        goto out;
+    }
+
+    now = NOW();
+
+    /* Put the VCPU on the runq */
+    runq_insert(cpu, svc);
+    runq_tickle(cpu, svc, now);
+
+out:
+    spin_unlock_irqrestore(&csched_priv.lock, flags); 
+    d2printk("w-\n");
+    return;
+}
+
+static void
+csched_context_saved(struct vcpu *vc)
+{
+    int flags;
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    /* This vcpu is now eligible to be put on the runqueue again */
+    clear_bit(__CSFLAG_scheduled, &svc->flags);
+    
+    /* If someone wants it there, put it there */
+    if ( test_bit(__CSFLAG_delayed_runq_add, &svc->flags) )
+    {
+        const unsigned int cpu = vc->processor;
+
+        clear_bit(__CSFLAG_delayed_runq_add, &svc->flags);
+
+        BUG_ON(__vcpu_on_runq(svc));
+        
+        runq_insert(cpu, svc);
+        runq_tickle(cpu, svc, NOW());
+    }
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+}
+
+static int
+csched_cpu_pick(struct vcpu *vc)
+{
+    /* FIXME: Chose a schedule group based on load */
+    return 0;
+}
+
+static int
+csched_dom_cntl(
+    struct domain *d,
+    struct xen_domctl_scheduler_op *op)
+{
+    struct csched_dom * const sdom = CSCHED_DOM(d);
+    unsigned long flags;
+
+    if ( op->cmd == XEN_DOMCTL_SCHEDOP_getinfo )
+    {
+        op->u.credit2.weight = sdom->weight;
+    }
+    else
+    {
+        ASSERT(op->cmd == XEN_DOMCTL_SCHEDOP_putinfo);
+
+        if ( op->u.credit2.weight != 0 )
+        {
+            struct list_head *iter;
+            int old_weight;
+
+            spin_lock_irqsave(&csched_priv.lock, flags);
+
+            old_weight = sdom->weight;
+
+            sdom->weight = op->u.credit2.weight;
+
+            /* Update max weight */
+            update_max_weight(sdom->weight, old_weight);
+
+            /* Update weights for vcpus */
+            list_for_each ( iter, &sdom->vcpu )
+            {
+                struct csched_vcpu *svc = list_entry(iter, struct csched_vcpu, sdom_elem);
+
+                svc->weight = sdom->weight;
+            }
+
+            spin_unlock_irqrestore(&csched_priv.lock, flags);
+        }
+    }
+
+    return 0;
+}
+
+static int
+csched_dom_init(struct domain *dom)
+{
+    struct csched_dom *sdom;
+    int flags;
+
+    printk("%s: Initializing domain %d\n", __func__, dom->domain_id);
+
+    if ( is_idle_domain(dom) )
+        return 0;
+
+    sdom = xmalloc(struct csched_dom);
+    if ( sdom == NULL )
+        return -ENOMEM;
+
+    /* Initialize credit and weight */
+    INIT_LIST_HEAD(&sdom->vcpu);
+    INIT_LIST_HEAD(&sdom->sdom_elem);
+    sdom->dom = dom;
+    sdom->weight = CSCHED_DEFAULT_WEIGHT;
+    sdom->nr_vcpus = 0;
+
+    dom->sched_priv = sdom;
+
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    update_max_weight(sdom->weight, 0);
+    list_add_tail(&sdom->sdom_elem, &csched_priv.sdom);
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+
+    return 0;
+}
+
+static void
+csched_dom_destroy(struct domain *dom)
+{
+    struct csched_dom *sdom = CSCHED_DOM(dom);
+    int flags;
+
+    BUG_ON(!list_empty(&sdom->vcpu));
+
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    list_del_init(&sdom->sdom_elem);
+
+    update_max_weight(0, sdom->weight);
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+    
+    xfree(CSCHED_DOM(dom));
+}
+
+#if 0
+static void csched_load_balance(int cpu)
+{
+    /* FIXME: Do something. */
+}
+#endif
+
+/* How long should we let this vcpu run for? */
+static s_time_t
+csched_runtime(int cpu, struct csched_vcpu *snext)
+{
+    s_time_t time = CSCHED_MAX_TIMER;
+    struct list_head *runq = RUNQ(cpu);
+
+    if ( is_idle_vcpu(snext->vcpu) )
+        return CSCHED_MAX_TIMER;
+
+    /* Basic time */
+    time = c2t(snext->credit, snext);
+
+    /* Next guy on runqueue */
+    if ( ! list_empty(runq) )
+    {
+        struct csched_vcpu *svc = __runq_elem(runq->next);
+        s_time_t ntime;
+
+        if ( ! is_idle_vcpu(svc->vcpu) )
+        {
+            ntime = c2t(snext->credit - svc->credit, snext);
+
+            if ( time > ntime )
+                time = ntime;
+        }
+    }
+
+    /* Check limits */
+    if ( time < CSCHED_MIN_TIMER )
+        time = CSCHED_MIN_TIMER;
+    else if ( time > CSCHED_MAX_TIMER )
+        time = CSCHED_MAX_TIMER;
+
+    return time;
+}
+
+void __dump_execstate(void *unused);
+
+/*
+ * This function is in the critical path. It is designed to be simple and
+ * fast for the common case.
+ */
+static struct task_slice
+csched_schedule(s_time_t now)
+{
+    const int cpu = smp_processor_id();
+    struct list_head * const runq = RUNQ(cpu);
+    //struct csched_pcpu *spc = CSCHED_PCPU(cpu);
+    struct csched_vcpu * const scurr = CSCHED_VCPU(current);
+    struct csched_vcpu *snext = NULL;
+    struct task_slice ret;
+    int flags;
+
+    CSCHED_VCPU_CHECK(current);
+
+    d2printk("sc p%d c d%dv%d now %"PRI_stime"\n",
+             cpu,
+             scurr->vcpu->domain->domain_id,
+             scurr->vcpu->vcpu_id,
+             now);
+
+
+    /* FIXME: Runqueue per L2 */
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    /* Update credits */
+    burn_credits(scurr, now);
+
+    /*
+     * Select next runnable local VCPU (ie top of local runq).
+     *
+     * If the current vcpu is runnable, and has higher credit than
+     * the next guy on the queue (or there is noone else), we want to run him again.
+     *
+     * If the current vcpu is runnable, and the next guy on the queue
+     * has higher credit, we want to mark current for delayed runqueue
+     * add, and remove the next guy from the queue.
+     *
+     * If the current vcpu is not runnable, we want to chose the idle
+     * vcpu for this processor. 
+     */
+    if ( list_empty(runq) )
+        snext = CSCHED_VCPU(csched_priv.idle_domain->vcpu[cpu]);
+    else
+        snext = __runq_elem(runq->next);
+
+    if ( !is_idle_vcpu(current) && vcpu_runnable(current) )
+    {
+        /* If the current vcpu is runnable, and has higher credit
+         * than the next on the runqueue, run him again.
+         * Otherwise, set him for delayed runq add. */
+        if ( scurr->credit > snext->credit)
+            snext = scurr;
+        else
+            set_bit(__CSFLAG_delayed_runq_add, &scurr->flags);
+    }
+
+    if ( snext != scurr && !is_idle_vcpu(snext->vcpu) )
+    {
+        __runq_remove(snext);
+        if ( snext->vcpu->is_running )
+        {
+            printk("p%d: snext d%dv%d running on p%d! scurr d%dv%d\n",
+                   cpu,
+                   snext->vcpu->domain->domain_id, snext->vcpu->vcpu_id,
+                   snext->vcpu->processor,
+                   scurr->vcpu->domain->domain_id,
+                   scurr->vcpu->vcpu_id);
+            BUG();
+        }
+        set_bit(__CSFLAG_scheduled, &snext->flags);
+    }
+
+    if ( !is_idle_vcpu(snext->vcpu) && snext->credit <= CSCHED_CREDIT_RESET )
+        reset_credit(cpu, now);
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+
+#if 0
+    /*
+     * Update idlers mask if necessary. When we're idling, other CPUs
+     * will tickle us when they get extra work.
+     */
+    if ( is_idle_vcpu(snext->vcpu) )
+    {
+        if ( !cpu_isset(cpu, csched_priv.idlers) )
+            cpu_set(cpu, csched_priv.idlers);
+    }
+    else if ( cpu_isset(cpu, csched_priv.idlers) )
+    {
+        cpu_clear(cpu, csched_priv.idlers);
+    }
+#endif
+
+    if ( !is_idle_vcpu(snext->vcpu) )
+        snext->start_time = now;
+    /*
+     * Return task to run next...
+     */
+    ret.time = csched_runtime(cpu, snext);
+    ret.task = snext->vcpu;
+
+    CSCHED_VCPU_CHECK(ret.task);
+    return ret;
+}
+
+static void
+csched_dump_vcpu(struct csched_vcpu *svc)
+{
+    printk("[%i.%i] flags=%x cpu=%i",
+            svc->vcpu->domain->domain_id,
+            svc->vcpu->vcpu_id,
+            svc->flags,
+            svc->vcpu->processor);
+
+    printk(" credit=%" PRIi32" [w=%u]", svc->credit, svc->weight);
+
+    printk("\n");
+}
+
+static void
+csched_dump_pcpu(int cpu)
+{
+    struct list_head *runq, *iter;
+    //struct csched_pcpu *spc;
+    struct csched_vcpu *svc;
+    int loop;
+    char cpustr[100];
+
+    //spc = CSCHED_PCPU(cpu);
+    runq = RUNQ(cpu);
+
+    cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_sibling_map,cpu));
+    printk(" sibling=%s, ", cpustr);
+    cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_core_map,cpu));
+    printk("core=%s\n", cpustr);
+
+    /* current VCPU */
+    svc = CSCHED_VCPU(per_cpu(schedule_data, cpu).curr);
+    if ( svc )
+    {
+        printk("\trun: ");
+        csched_dump_vcpu(svc);
+    }
+
+    loop = 0;
+    list_for_each( iter, runq )
+    {
+        svc = __runq_elem(iter);
+        if ( svc )
+        {
+            printk("\t%3d: ", ++loop);
+            csched_dump_vcpu(svc);
+        }
+    }
+}
+
+static void
+csched_dump(void)
+{
+    struct list_head *iter_sdom, *iter_svc;
+    int loop;
+
+    printk("info:\n"
+           "\tncpus              = %u\n"
+           "\tdefault-weight     = %d\n",
+           csched_priv.ncpus,
+           CSCHED_DEFAULT_WEIGHT);
+
+    printk("active vcpus:\n");
+    loop = 0;
+    list_for_each( iter_sdom, &csched_priv.sdom )
+    {
+        struct csched_dom *sdom;
+        sdom = list_entry(iter_sdom, struct csched_dom, sdom_elem);
+
+        list_for_each( iter_svc, &sdom->vcpu )
+        {
+            struct csched_vcpu *svc;
+            svc = list_entry(iter_svc, struct csched_vcpu, sdom_elem);
+
+            printk("\t%3d: ", ++loop);
+            csched_dump_vcpu(svc);
+        }
+    }
+}
+
+static void
+csched_init(void)
+{
+    spin_lock_init(&csched_priv.lock);
+    INIT_LIST_HEAD(&csched_priv.sdom);
+    INIT_LIST_HEAD(&csched_priv.svc);
+
+    csched_priv.ncpus = 0;
+
+    /* FIXME: Runqueue per l2 */
+    csched_priv.max_weight = 1;
+    INIT_LIST_HEAD(&csched_priv.runq);
+}
+
+struct scheduler sched_credit2_def = {
+    .name           = "SMP Credit Scheduler rev2",
+    .opt_name       = "credit2",
+    .sched_id       = XEN_SCHEDULER_CREDIT2,
+
+    .init_domain    = csched_dom_init,
+    .destroy_domain = csched_dom_destroy,
+
+    .init_vcpu      = csched_vcpu_init,
+    .destroy_vcpu   = csched_vcpu_destroy,
+
+    .sleep          = csched_vcpu_sleep,
+    .wake           = csched_vcpu_wake,
+
+    .adjust         = csched_dom_cntl,
+
+    .pick_cpu       = csched_cpu_pick,
+    .do_schedule    = csched_schedule,
+    .context_saved  = csched_context_saved,
+
+    .dump_cpu_state = csched_dump_pcpu,
+    .dump_settings  = csched_dump,
+    .init           = csched_init,
+};
diff -r 7bd1dd9fb30f xen/common/schedule.c
--- a/xen/common/schedule.c	Wed Jan 13 14:15:51 2010 +0000
+++ b/xen/common/schedule.c	Wed Jan 13 14:36:58 2010 +0000
@@ -58,9 +58,11 @@
 
 extern const struct scheduler sched_sedf_def;
 extern const struct scheduler sched_credit_def;
+extern const struct scheduler sched_credit2_def;
 static const struct scheduler *__initdata schedulers[] = {
     &sched_sedf_def,
     &sched_credit_def,
+    &sched_credit2_def,
     NULL
 };
 
diff -r 7bd1dd9fb30f xen/include/public/domctl.h
--- a/xen/include/public/domctl.h	Wed Jan 13 14:15:51 2010 +0000
+++ b/xen/include/public/domctl.h	Wed Jan 13 14:36:58 2010 +0000
@@ -295,6 +295,7 @@
 /* Scheduler types. */
 #define XEN_SCHEDULER_SEDF     4
 #define XEN_SCHEDULER_CREDIT   5
+#define XEN_SCHEDULER_CREDIT2  6
 /* Set or get info? */
 #define XEN_DOMCTL_SCHEDOP_putinfo 0
 #define XEN_DOMCTL_SCHEDOP_getinfo 1
@@ -313,6 +314,9 @@
             uint16_t weight;
             uint16_t cap;
         } credit;
+        struct xen_domctl_sched_credit2 {
+            uint16_t weight;
+        } credit2;
     } u;
 };
 typedef struct xen_domctl_scheduler_op xen_domctl_scheduler_op_t;
diff -r 7bd1dd9fb30f xen/include/public/trace.h
--- a/xen/include/public/trace.h	Wed Jan 13 14:15:51 2010 +0000
+++ b/xen/include/public/trace.h	Wed Jan 13 14:36:58 2010 +0000
@@ -53,6 +53,7 @@
 #define TRC_HVM_HANDLER   0x00082000   /* various HVM handlers      */
 
 #define TRC_SCHED_MIN       0x00021000   /* Just runstate changes */
+#define TRC_SCHED_CLASS     0x00022000   /* Scheduler-specific    */
 #define TRC_SCHED_VERBOSE   0x00028000   /* More inclusive scheduling */
 
 /* Trace events per class */

[-- Attachment #5: credit2-tools.diff --]
[-- Type: text/x-patch, Size: 15599 bytes --]

diff -r 63531e640828 tools/libxc/Makefile
--- a/tools/libxc/Makefile	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/libxc/Makefile	Mon Dec 21 11:45:00 2009 +0000
@@ -17,6 +17,7 @@
 CTRL_SRCS-y       += xc_private.c
 CTRL_SRCS-y       += xc_sedf.c
 CTRL_SRCS-y       += xc_csched.c
+CTRL_SRCS-y       += xc_csched2.c
 CTRL_SRCS-y       += xc_tbuf.c
 CTRL_SRCS-y       += xc_pm.c
 CTRL_SRCS-y       += xc_cpu_hotplug.c
diff -r 63531e640828 tools/libxc/xc_csched2.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/libxc/xc_csched2.c	Mon Dec 21 11:45:00 2009 +0000
@@ -0,0 +1,50 @@
+/****************************************************************************
+ * (C) 2006 - Emmanuel Ackaouy - XenSource Inc.
+ ****************************************************************************
+ *
+ *        File: xc_csched.c
+ *      Author: Emmanuel Ackaouy
+ *
+ * Description: XC Interface to the credit scheduler
+ *
+ */
+#include "xc_private.h"
+
+
+int
+xc_sched_credit2_domain_set(
+    int xc_handle,
+    uint32_t domid,
+    struct xen_domctl_sched_credit2 *sdom)
+{
+    DECLARE_DOMCTL;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_CREDIT2;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_putinfo;
+    domctl.u.scheduler_op.u.credit2 = *sdom;
+
+    return do_domctl(xc_handle, &domctl);
+}
+
+int
+xc_sched_credit2_domain_get(
+    int xc_handle,
+    uint32_t domid,
+    struct xen_domctl_sched_credit2 *sdom)
+{
+    DECLARE_DOMCTL;
+    int err;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_CREDIT2;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_getinfo;
+
+    err = do_domctl(xc_handle, &domctl);
+    if ( err == 0 )
+        *sdom = domctl.u.scheduler_op.u.credit2;
+
+    return err;
+}
diff -r 63531e640828 tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/libxc/xenctrl.h	Mon Dec 21 11:45:00 2009 +0000
@@ -469,6 +469,14 @@
                                uint32_t domid,
                                struct xen_domctl_sched_credit *sdom);
 
+int xc_sched_credit2_domain_set(int xc_handle,
+                               uint32_t domid,
+                               struct xen_domctl_sched_credit2 *sdom);
+
+int xc_sched_credit2_domain_get(int xc_handle,
+                               uint32_t domid,
+                               struct xen_domctl_sched_credit2 *sdom);
+
 /**
  * This function sends a trigger to a domain.
  *
diff -r 63531e640828 tools/python/xen/lowlevel/xc/xc.c
--- a/tools/python/xen/lowlevel/xc/xc.c	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/python/xen/lowlevel/xc/xc.c	Mon Dec 21 11:45:00 2009 +0000
@@ -1374,6 +1374,45 @@
                          "cap",     sdom.cap);
 }
 
+static PyObject *pyxc_sched_credit2_domain_set(XcObject *self,
+                                              PyObject *args,
+                                              PyObject *kwds)
+{
+    uint32_t domid;
+    uint16_t weight;
+    static char *kwd_list[] = { "domid", "weight", NULL };
+    static char kwd_type[] = "I|H";
+    struct xen_domctl_sched_credit2 sdom;
+    
+    weight = 0;
+    if( !PyArg_ParseTupleAndKeywords(args, kwds, kwd_type, kwd_list, 
+                                     &domid, &weight) )
+        return NULL;
+
+    sdom.weight = weight;
+
+    if ( xc_sched_credit2_domain_set(self->xc_handle, domid, &sdom) != 0 )
+        return pyxc_error_to_exception();
+
+    Py_INCREF(zero);
+    return zero;
+}
+
+static PyObject *pyxc_sched_credit2_domain_get(XcObject *self, PyObject *args)
+{
+    uint32_t domid;
+    struct xen_domctl_sched_credit2 sdom;
+    
+    if( !PyArg_ParseTuple(args, "I", &domid) )
+        return NULL;
+    
+    if ( xc_sched_credit2_domain_get(self->xc_handle, domid, &sdom) != 0 )
+        return pyxc_error_to_exception();
+
+    return Py_BuildValue("{s:H}",
+                         "weight",  sdom.weight);
+}
+
 static PyObject *pyxc_domain_setmaxmem(XcObject *self, PyObject *args)
 {
     uint32_t dom;
@@ -1912,6 +1951,24 @@
       "Returns:   [dict]\n"
       " weight    [short]: domain's scheduling weight\n"},
 
+    { "sched_credit2_domain_set",
+      (PyCFunction)pyxc_sched_credit2_domain_set,
+      METH_KEYWORDS, "\n"
+      "Set the scheduling parameters for a domain when running with the\n"
+      "SMP credit2 scheduler.\n"
+      " domid     [int]:   domain id to set\n"
+      " weight    [short]: domain's scheduling weight\n"
+      "Returns: [int] 0 on success; -1 on error.\n" },
+
+    { "sched_credit2_domain_get",
+      (PyCFunction)pyxc_sched_credit2_domain_get,
+      METH_VARARGS, "\n"
+      "Get the scheduling parameters for a domain when running with the\n"
+      "SMP credit2 scheduler.\n"
+      " domid     [int]:   domain id to get\n"
+      "Returns:   [dict]\n"
+      " weight    [short]: domain's scheduling weight\n"},
+
     { "evtchn_alloc_unbound", 
       (PyCFunction)pyxc_evtchn_alloc_unbound,
       METH_VARARGS | METH_KEYWORDS, "\n"
@@ -2272,6 +2329,7 @@
     /* Expose some libxc constants to Python */
     PyModule_AddIntConstant(m, "XEN_SCHEDULER_SEDF", XEN_SCHEDULER_SEDF);
     PyModule_AddIntConstant(m, "XEN_SCHEDULER_CREDIT", XEN_SCHEDULER_CREDIT);
+    PyModule_AddIntConstant(m, "XEN_SCHEDULER_CREDIT2", XEN_SCHEDULER_CREDIT2);
 
 }
 
diff -r 63531e640828 tools/python/xen/xend/XendAPI.py
--- a/tools/python/xen/xend/XendAPI.py	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/python/xen/xend/XendAPI.py	Mon Dec 21 11:45:00 2009 +0000
@@ -1626,8 +1626,7 @@
         if 'weight' in xeninfo.info['vcpus_params'] \
            and 'cap' in xeninfo.info['vcpus_params']:
             weight = xeninfo.info['vcpus_params']['weight']
-            cap = xeninfo.info['vcpus_params']['cap']
-            xendom.domain_sched_credit_set(xeninfo.getDomid(), weight, cap)
+            xendom.domain_sched_credit2_set(xeninfo.getDomid(), weight)
 
     def VM_set_VCPUs_number_live(self, _, vm_ref, num):
         dom = XendDomain.instance().get_vm_by_uuid(vm_ref)
diff -r 63531e640828 tools/python/xen/xend/XendDomain.py
--- a/tools/python/xen/xend/XendDomain.py	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/python/xen/xend/XendDomain.py	Mon Dec 21 11:45:00 2009 +0000
@@ -1757,6 +1757,60 @@
             log.exception(ex)
             raise XendError(str(ex))
 
+    def domain_sched_credit2_get(self, domid):
+        """Get credit2 scheduler parameters for a domain.
+
+        @param domid: Domain ID or Name
+        @type domid: int or string.
+        @rtype: dict with keys 'weight'
+        @return: credit2 scheduler parameters
+        """
+        dominfo = self.domain_lookup_nr(domid)
+        if not dominfo:
+            raise XendInvalidDomain(str(domid))
+        
+        if dominfo._stateGet() in (DOM_STATE_RUNNING, DOM_STATE_PAUSED):
+            try:
+                return xc.sched_credit2_domain_get(dominfo.getDomid())
+            except Exception, ex:
+                raise XendError(str(ex))
+        else:
+            return {'weight' : dominfo.getWeight()}
+    
+    def domain_sched_credit2_set(self, domid, weight = None):
+        """Set credit2 scheduler parameters for a domain.
+
+        @param domid: Domain ID or Name
+        @type domid: int or string.
+        @type weight: int
+        @rtype: 0
+        """
+        set_weight = False
+        dominfo = self.domain_lookup_nr(domid)
+        if not dominfo:
+            raise XendInvalidDomain(str(domid))
+        try:
+            if weight is None:
+                weight = int(0)
+            elif weight < 1 or weight > 65535:
+                raise XendError("weight is out of range")
+            else:
+                set_weight = True
+
+            assert type(weight) == int
+
+            rc = 0
+            if dominfo._stateGet() in (DOM_STATE_RUNNING, DOM_STATE_PAUSED):
+                rc = xc.sched_credit2_domain_set(dominfo.getDomid(), weight)
+            if rc == 0:
+                if set_weight:
+                    dominfo.setWeight(weight)
+                self.managed_config_save(dominfo)
+            return rc
+        except Exception, ex:
+            log.exception(ex)
+            raise XendError(str(ex))
+
     def domain_maxmem_set(self, domid, mem):
         """Set the memory limit for a domain.
 
diff -r 63531e640828 tools/python/xen/xend/XendDomainInfo.py
--- a/tools/python/xen/xend/XendDomainInfo.py	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/python/xen/xend/XendDomainInfo.py	Mon Dec 21 11:45:00 2009 +0000
@@ -2719,6 +2719,10 @@
             XendDomain.instance().domain_sched_credit_set(self.getDomid(),
                                                           self.getWeight(),
                                                           self.getCap())
+        elif XendNode.instance().xenschedinfo() == 'credit2':
+            from xen.xend import XendDomain
+            XendDomain.instance().domain_sched_credit2_set(self.getDomid(),
+                                                           self.getWeight())
 
     def _initDomain(self):
         log.debug('XendDomainInfo.initDomain: %s %s',
diff -r 63531e640828 tools/python/xen/xend/XendNode.py
--- a/tools/python/xen/xend/XendNode.py	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/python/xen/xend/XendNode.py	Mon Dec 21 11:45:00 2009 +0000
@@ -760,6 +760,8 @@
             return 'sedf'
         elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT:
             return 'credit'
+        elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT2:
+            return 'credit2'
         else:
             return 'unknown'
 
@@ -961,6 +963,8 @@
             return 'sedf'
         elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT:
             return 'credit'
+        elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT2:
+            return 'credit2'
         else:
             return 'unknown'
 
diff -r 63531e640828 tools/python/xen/xend/XendVMMetrics.py
--- a/tools/python/xen/xend/XendVMMetrics.py	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/python/xen/xend/XendVMMetrics.py	Mon Dec 21 11:45:00 2009 +0000
@@ -129,6 +129,7 @@
                 params_live['cpumap%i' % i] = \
                     ",".join(map(str, info['cpumap']))
 
+                # FIXME: credit2??
             params_live.update(xc.sched_credit_domain_get(domid))
             
             return params_live
diff -r 63531e640828 tools/python/xen/xend/server/SrvDomain.py
--- a/tools/python/xen/xend/server/SrvDomain.py	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/python/xen/xend/server/SrvDomain.py	Mon Dec 21 11:45:00 2009 +0000
@@ -163,6 +163,20 @@
         val = fn(req.args, {'dom': self.dom.getName()})
         return val
 
+    def op_domain_sched_credit2_get(self, _, req):
+        fn = FormFn(self.xd.domain_sched_credit2_get,
+                    [['dom', 'str']])
+        val = fn(req.args, {'dom': self.dom.getName()})
+        return val
+
+
+    def op_domain_sched_credit2_set(self, _, req):
+        fn = FormFn(self.xd.domain_sched_credit2_set,
+                    [['dom', 'str'],
+                     ['weight', 'int']])
+        val = fn(req.args, {'dom': self.dom.getName()})
+        return val
+
     def op_maxmem_set(self, _, req):
         return self.call(self.dom.setMemoryMaximum,
                          [['memory', 'int']],
diff -r 63531e640828 tools/python/xen/xm/main.py
--- a/tools/python/xen/xm/main.py	Mon Dec 07 17:01:11 2009 +0000
+++ b/tools/python/xen/xm/main.py	Mon Dec 21 11:45:00 2009 +0000
@@ -150,6 +150,8 @@
     'sched-sedf'  : ('<Domain> [options]', 'Get/set EDF parameters.'),
     'sched-credit': ('[-d <Domain> [-w[=WEIGHT]|-c[=CAP]]]',
                      'Get/set credit scheduler parameters.'),
+    'sched-credit2': ('[-d <Domain> [-w[=WEIGHT]]',
+                     'Get/set credit2 scheduler parameters.'),
     'sysrq'       : ('<Domain> <letter>', 'Send a sysrq to a domain.'),
     'debug-keys'  : ('<Keys>', 'Send debug keys to Xen.'),
     'trigger'     : ('<Domain> <nmi|reset|init|s3resume|power> [<VCPU>]',
@@ -265,6 +267,10 @@
        ('-w WEIGHT', '--weight=WEIGHT', 'Weight (int)'),
        ('-c CAP',    '--cap=CAP',       'Cap (int)'),
     ),
+    'sched-credit2': (
+       ('-d DOMAIN', '--domain=DOMAIN', 'Domain to modify'),
+       ('-w WEIGHT', '--weight=WEIGHT', 'Weight (int)'),
+    ),
     'list': (
        ('-l', '--long',         'Output all VM details in SXP'),
        ('', '--label',          'Include security labels'),
@@ -406,6 +412,7 @@
     ]
 
 scheduler_commands = [
+    "sched-credit2",
     "sched-credit",
     "sched-sedf",
     ]
@@ -1720,6 +1727,80 @@
             if result != 0:
                 err(str(result))
 
+def xm_sched_credit2(args):
+    """Get/Set options for Credit2 Scheduler."""
+    
+    check_sched_type('credit2')
+
+    try:
+        opts, params = getopt.getopt(args, "d:w:",
+            ["domain=", "weight="])
+    except getopt.GetoptError, opterr:
+        err(opterr)
+        usage('sched-credit2')
+
+    domid = None
+    weight = None
+
+    for o, a in opts:
+        if o in ["-d", "--domain"]:
+            domid = a
+        elif o in ["-w", "--weight"]:
+            weight = int(a)
+
+    doms = filter(lambda x : domid_match(domid, x),
+                  [parse_doms_info(dom)
+                  for dom in getDomains(None, 'all')])
+
+    if weight is None:
+        if domid is not None and doms == []: 
+            err("Domain '%s' does not exist." % domid)
+            usage('sched-credit2')
+        # print header if we aren't setting any parameters
+        print '%-33s %4s %6s' % ('Name','ID','Weight')
+        
+        for d in doms:
+            try:
+                if serverType == SERVER_XEN_API:
+                    info = server.xenapi.VM_metrics.get_VCPUs_params(
+                        server.xenapi.VM.get_metrics(
+                            get_single_vm(d['name'])))
+                else:
+                    info = server.xend.domain.sched_credit2_get(d['name'])
+            except xmlrpclib.Fault:
+                pass
+
+            if 'weight' not in info:
+                # domain does not support sched-credit2?
+                info = {'weight': -1}
+
+            info['weight'] = int(info['weight'])
+            
+            info['name']  = d['name']
+            info['domid'] = str(d['domid'])
+            print( ("%(name)-32s %(domid)5s %(weight)6d") % info)
+    else:
+        if domid is None:
+            # place holder for system-wide scheduler parameters
+            err("No domain given.")
+            usage('sched-credit2')
+
+        if serverType == SERVER_XEN_API:
+            if doms[0]['domid']:
+                server.xenapi.VM.add_to_VCPUs_params_live(
+                    get_single_vm(domid),
+                    "weight",
+                    weight)
+            else:
+                server.xenapi.VM.add_to_VCPUs_params(
+                    get_single_vm(domid),
+                    "weight",
+                    weight)
+        else:
+            result = server.xend.domain.sched_credit2_set(domid, weight)
+            if result != 0:
+                err(str(result))
+
 def xm_info(args):
     arg_check(args, "info", 0, 1)
     
@@ -3341,6 +3422,7 @@
     # scheduler
     "sched-sedf": xm_sched_sedf,
     "sched-credit": xm_sched_credit,
+    "sched-credit2": xm_sched_credit2,
     # block
     "block-attach": xm_block_attach,
     "block-detach": xm_block_detach,

[-- Attachment #6: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH] [RFC] Credit2 scheduler prototype
  2010-01-13 14:48       ` George Dunlap
@ 2010-01-13 15:16         ` Keir Fraser
  2010-01-13 16:05           ` George Dunlap
  0 siblings, 1 reply; 11+ messages in thread
From: Keir Fraser @ 2010-01-13 15:16 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel@lists.xensource.com

On 13/01/2010 14:48, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:

> The first implements something like what you suggest below, but
> instead of using a sort of "hack" with VPF_migrate, it makes a proper
> "context_saved" SCHED_OP callback.

I thought using the vcpu_migrate() path might work well since you presumably
have logic there to pick a new cpu which is relatively unloaded, making the
cpu which tried to schedule the vcpu but had to idle instead a prime
candidate. So rather than having to implement a new callback hook, you'd get
to leverage the pick_cpu hook for free?

> The second addresses the fact that when sharing runqueues,
> v->processor may change quickly without an explicit migrate.

I can't think of a better solution for this one.

 -- Keir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH] [RFC] Credit2 scheduler prototype
  2010-01-13 15:16         ` Keir Fraser
@ 2010-01-13 16:05           ` George Dunlap
  2010-01-13 16:36             ` Keir Fraser
  0 siblings, 1 reply; 11+ messages in thread
From: George Dunlap @ 2010-01-13 16:05 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel@lists.xensource.com

On Wed, Jan 13, 2010 at 3:16 PM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:
> On 13/01/2010 14:48, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:
>
>> The first implements something like what you suggest below, but
>> instead of using a sort of "hack" with VPF_migrate, it makes a proper
>> "context_saved" SCHED_OP callback.
>
> I thought using the vcpu_migrate() path might work well since you presumably
> have logic there to pick a new cpu which is relatively unloaded, making the
> cpu which tried to schedule the vcpu but had to idle instead a prime
> candidate. So rather than having to implement a new callback hook, you'd get
> to leverage the pick_cpu hook for free?

Hmm, not sure that actually gives us the leverage we need to solve all
the races.  If you look at sched_credit2.c (in the credit2-hypervisor
patch), you'll see I added two flags to the private vcpu struct: one
to indicate that the vcpu has (or may have) context somewhere on a
cpu, and thus can't be added to the runqueue; another to indicate that
when the first flag is cleared, it should be added to the runqueue.
In the current implementation, the first flag is set and cleared every
time a vcpu is scheduled or descheduled, whether it needs to be added
to the runqueue after context_saved() or not.

[NB that the current global lock will eventually be replaced with
per-runqueue locks.]

In particular, one of the races without the first flag looks like this
(brackets indicate physical cpu):
[0] lock cpu0 schedule lock
[0] lock credit2 runqueue lock
[0] Take vX off runqueue; vX->processor == 1
[0] unlock credit2 runqueue lock
[1] vcpu_wake(vX) lock cpu1 schedule lock
[1] finds vX->running false, adds it to the runqueue
[1] unlock cpu1 schedule_lock
[0] vX->running=1
[0] unlock cpu0 schedule lock
[0] lock cpu1 schedule lock (vX->cpu == 1)
[0] vX->cpu = 0
[0] unlock cpu1 schedule lock
[1] takes vX from the runqueue, finds vX->running is true *ERROR*

I guess the real problem here is that vX->running is set even though
the vX->processor schedule lock isn't held, causing a race with
vcpu_wake().  In the other schedulers this can't happen, since it
takes an explicit migrate to change processors.  In the attached
patches, csched2 operations serialize on the runqueue lock, fixing
that particular race.

Can't think of a better solution off the top of my head; I'll give it
some thought.

 -George

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH] [RFC] Credit2 scheduler prototype
  2010-01-13 16:05           ` George Dunlap
@ 2010-01-13 16:36             ` Keir Fraser
  2010-01-13 16:43               ` George Dunlap
  0 siblings, 1 reply; 11+ messages in thread
From: Keir Fraser @ 2010-01-13 16:36 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel@lists.xensource.com

On 13/01/2010 16:05, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:

> [NB that the current global lock will eventually be replaced with
> per-runqueue locks.]
> 
> In particular, one of the races without the first flag looks like this
> (brackets indicate physical cpu):
> [0] lock cpu0 schedule lock
> [0] lock credit2 runqueue lock
> [0] Take vX off runqueue; vX->processor == 1
> [0] unlock credit2 runqueue lock
> [1] vcpu_wake(vX) lock cpu1 schedule lock
> [1] finds vX->running false, adds it to the runqueue
> [1] unlock cpu1 schedule_lock

Actually, hang on. Doesn't this issue, and the one that your second patch
addresses, go away if we change the schedule_lock granularity to match
runqueue granularity? That would seem pretty sensible, and could be
implemented with a schedule_lock(cpu) scheduler hook, returning a
spinlock_t*, and a some easy scheduler code changes.

If we do that, do you then even need separate private per-runqueue locks?
(Just an extra thought).

 -- Keir

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH] [RFC] Credit2 scheduler prototype
  2010-01-13 16:36             ` Keir Fraser
@ 2010-01-13 16:43               ` George Dunlap
  2010-01-28 23:27                 ` Dulloor
  0 siblings, 1 reply; 11+ messages in thread
From: George Dunlap @ 2010-01-13 16:43 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel@lists.xensource.com

Keir Fraser wrote:
> On 13/01/2010 16:05, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:
>
>   
>> [NB that the current global lock will eventually be replaced with
>> per-runqueue locks.]
>>
>> In particular, one of the races without the first flag looks like this
>> (brackets indicate physical cpu):
>> [0] lock cpu0 schedule lock
>> [0] lock credit2 runqueue lock
>> [0] Take vX off runqueue; vX->processor == 1
>> [0] unlock credit2 runqueue lock
>> [1] vcpu_wake(vX) lock cpu1 schedule lock
>> [1] finds vX->running false, adds it to the runqueue
>> [1] unlock cpu1 schedule_lock
>>     
>
> Actually, hang on. Doesn't this issue, and the one that your second patch
> addresses, go away if we change the schedule_lock granularity to match
> runqueue granularity? That would seem pretty sensible, and could be
> implemented with a schedule_lock(cpu) scheduler hook, returning a
> spinlock_t*, and a some easy scheduler code changes.
>
> If we do that, do you then even need separate private per-runqueue locks?
> (Just an extra thought).
>   
Hmm.... can't see anything wrong with it.  It would make the whole 
locking discipline thing a lot simpler.  It would, AFAICT, remove the 
need for private per-runqueue locks, which make it a lot harder to avoid 
deadlock without these sorts of strange tricks. :-)

I'll think about it, and probably give it a spin to see how it works out.

 -George

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH] [RFC] Credit2 scheduler prototype
  2010-01-13 16:43               ` George Dunlap
@ 2010-01-28 23:27                 ` Dulloor
  2010-01-29  0:56                   ` George Dunlap
  0 siblings, 1 reply; 11+ messages in thread
From: Dulloor @ 2010-01-28 23:27 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel@lists.xensource.com, Keir Fraser

George,

With your patches and sched=credit2, xen crashes on a failed assertion :
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) Assertion '_spin_is_locked(&(*({ unsigned long __ptr; __asm__ ("" : "=r"(*
(XEN)

Is this version supposed to work (or is it just some reference code) ?

thanks
dulloor


On Wed, Jan 13, 2010 at 11:43 AM, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> Keir Fraser wrote:
>>
>> On 13/01/2010 16:05, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:
>>
>>
>>>
>>> [NB that the current global lock will eventually be replaced with
>>> per-runqueue locks.]
>>>
>>> In particular, one of the races without the first flag looks like this
>>> (brackets indicate physical cpu):
>>> [0] lock cpu0 schedule lock
>>> [0] lock credit2 runqueue lock
>>> [0] Take vX off runqueue; vX->processor == 1
>>> [0] unlock credit2 runqueue lock
>>> [1] vcpu_wake(vX) lock cpu1 schedule lock
>>> [1] finds vX->running false, adds it to the runqueue
>>> [1] unlock cpu1 schedule_lock
>>>
>>
>> Actually, hang on. Doesn't this issue, and the one that your second patch
>> addresses, go away if we change the schedule_lock granularity to match
>> runqueue granularity? That would seem pretty sensible, and could be
>> implemented with a schedule_lock(cpu) scheduler hook, returning a
>> spinlock_t*, and a some easy scheduler code changes.
>>
>> If we do that, do you then even need separate private per-runqueue locks?
>> (Just an extra thought).
>>
>
> Hmm.... can't see anything wrong with it.  It would make the whole locking
> discipline thing a lot simpler.  It would, AFAICT, remove the need for
> private per-runqueue locks, which make it a lot harder to avoid deadlock
> without these sorts of strange tricks. :-)
>
> I'll think about it, and probably give it a spin to see how it works out.
>
> -George
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH] [RFC] Credit2 scheduler prototype
  2010-01-28 23:27                 ` Dulloor
@ 2010-01-29  0:56                   ` George Dunlap
  0 siblings, 0 replies; 11+ messages in thread
From: George Dunlap @ 2010-01-29  0:56 UTC (permalink / raw)
  To: Dulloor; +Cc: xen-devel@lists.xensource.com, Keir Fraser

Since it's an assertion, I assume you ran it with debug=y?

I'm definitely changing some assumptions with this, so it's not a
surprise that some assertions trigger.

I'm working on a modified version based on the discussion we had here;
I'll post a patch (tested with debug=y) when I'm done.

-George

On Thu, Jan 28, 2010 at 11:27 PM, Dulloor <dulloor@gmail.com> wrote:
> George,
>
> With your patches and sched=credit2, xen crashes on a failed assertion :
> (XEN) ****************************************
> (XEN) Panic on CPU 1:
> (XEN) Assertion '_spin_is_locked(&(*({ unsigned long __ptr; __asm__ ("" : "=r"(*
> (XEN)
>
> Is this version supposed to work (or is it just some reference code) ?
>
> thanks
> dulloor
>
>
> On Wed, Jan 13, 2010 at 11:43 AM, George Dunlap
> <george.dunlap@eu.citrix.com> wrote:
>> Keir Fraser wrote:
>>>
>>> On 13/01/2010 16:05, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:
>>>
>>>
>>>>
>>>> [NB that the current global lock will eventually be replaced with
>>>> per-runqueue locks.]
>>>>
>>>> In particular, one of the races without the first flag looks like this
>>>> (brackets indicate physical cpu):
>>>> [0] lock cpu0 schedule lock
>>>> [0] lock credit2 runqueue lock
>>>> [0] Take vX off runqueue; vX->processor == 1
>>>> [0] unlock credit2 runqueue lock
>>>> [1] vcpu_wake(vX) lock cpu1 schedule lock
>>>> [1] finds vX->running false, adds it to the runqueue
>>>> [1] unlock cpu1 schedule_lock
>>>>
>>>
>>> Actually, hang on. Doesn't this issue, and the one that your second patch
>>> addresses, go away if we change the schedule_lock granularity to match
>>> runqueue granularity? That would seem pretty sensible, and could be
>>> implemented with a schedule_lock(cpu) scheduler hook, returning a
>>> spinlock_t*, and a some easy scheduler code changes.
>>>
>>> If we do that, do you then even need separate private per-runqueue locks?
>>> (Just an extra thought).
>>>
>>
>> Hmm.... can't see anything wrong with it.  It would make the whole locking
>> discipline thing a lot simpler.  It would, AFAICT, remove the need for
>> private per-runqueue locks, which make it a lot harder to avoid deadlock
>> without these sorts of strange tricks. :-)
>>
>> I'll think about it, and probably give it a spin to see how it works out.
>>
>> -George
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-01-29  0:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-07 17:02 [PATCH] [RFC] Credit2 scheduler prototype George Dunlap
2009-12-07 17:45 ` Keir Fraser
2009-12-08 14:48   ` George Dunlap
2009-12-08 18:20     ` Keir Fraser
2010-01-13 14:48       ` George Dunlap
2010-01-13 15:16         ` Keir Fraser
2010-01-13 16:05           ` George Dunlap
2010-01-13 16:36             ` Keir Fraser
2010-01-13 16:43               ` George Dunlap
2010-01-28 23:27                 ` Dulloor
2010-01-29  0:56                   ` George Dunlap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).