[PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)
@ 2010-04-14 10:26 George Dunlap
  2010-04-14 10:26 ` [PATCH 1 of 5] credit2: Add context_saved scheduler callback George Dunlap
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: George Dunlap @ 2010-04-14 10:26 UTC (permalink / raw)
  To: xen-devel; +Cc: george.dunlap

This patch series introduces the credit2 scheduler.  The first two patches
introduce changes necessary to allow the credit2 shared runqueue functionality
to work properly; the last two implement the functionality itself.

The scheduler is still in the experimental phase.  There's lots of 
opportunity to contribute with independent lines of development; email
George Dunlap <george.dunlap@eu.citrix.com> or check out the wiki page
http://wiki.xensource.com/xenwiki/Credit2_Scheduler_Development for ideas
and status updates.

19 files changed, 1453 insertions(+), 21 deletions(-)
tools/libxc/Makefile                      |    1 
tools/libxc/xc_csched2.c                  |   50 +
tools/libxc/xenctrl.h                     |    8 
tools/python/xen/lowlevel/xc/xc.c         |   58 +
tools/python/xen/xend/XendAPI.py          |    3 
tools/python/xen/xend/XendDomain.py       |   54 +
tools/python/xen/xend/XendDomainInfo.py   |    4 
tools/python/xen/xend/XendNode.py         |    4 
tools/python/xen/xend/XendVMMetrics.py    |    1 
tools/python/xen/xend/server/SrvDomain.py |   14 
tools/python/xen/xm/main.py               |   82 ++
xen/arch/ia64/vmx/vmmu.c                  |    6 
xen/common/Makefile                       |    1 
xen/common/sched_credit.c                 |    8 
xen/common/sched_credit2.c                | 1125 +++++++++++++++++++++++++++++
xen/common/schedule.c                     |   22 
xen/include/public/domctl.h               |    4 
xen/include/public/trace.h                |    1 
xen/include/xen/sched-if.h                |   28 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1 of 5] credit2: Add context_saved scheduler callback
  2010-04-14 10:26 [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL) George Dunlap
@ 2010-04-14 10:26 ` George Dunlap
  2010-04-14 10:26 ` [PATCH 2 of 5] credit2: Flexible cpu-to-schedule-spinlock mappings George Dunlap
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: George Dunlap @ 2010-04-14 10:26 UTC (permalink / raw)
  To: xen-devel; +Cc: george.dunlap

2 files changed, 3 insertions(+)
xen/common/schedule.c      |    2 ++
xen/include/xen/sched-if.h |    1 +


Because credit2 shares a runqueue between several cpus, it needs
to know when a scheduled-out process has finally been context-switched
away so that it can be added to the runqueue again.  (Otherwise it may
be grabbed by another processor before the context has been properly
saved.)

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

diff -r c02cc832cb2d -r 2631707c54b3 xen/common/schedule.c
--- a/xen/common/schedule.c	Tue Apr 13 18:19:33 2010 +0100
+++ b/xen/common/schedule.c	Wed Apr 14 11:16:58 2010 +0100
@@ -923,6 +923,8 @@
     /* Check for migration request /after/ clearing running flag. */
     smp_mb();
 
+    SCHED_OP(context_saved, prev);
+
     if ( unlikely(test_bit(_VPF_migrating, &prev->pause_flags)) )
         vcpu_migrate(prev);
 }
diff -r c02cc832cb2d -r 2631707c54b3 xen/include/xen/sched-if.h
--- a/xen/include/xen/sched-if.h	Tue Apr 13 18:19:33 2010 +0100
+++ b/xen/include/xen/sched-if.h	Wed Apr 14 11:16:58 2010 +0100
@@ -70,6 +70,7 @@
 
     void         (*sleep)          (struct vcpu *);
     void         (*wake)           (struct vcpu *);
+    void         (*context_saved)  (struct vcpu *);
 
     struct task_slice (*do_schedule) (s_time_t);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2 of 5] credit2: Flexible cpu-to-schedule-spinlock mappings
  2010-04-14 10:26 [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL) George Dunlap
  2010-04-14 10:26 ` [PATCH 1 of 5] credit2: Add context_saved scheduler callback George Dunlap
@ 2010-04-14 10:26 ` George Dunlap
  2010-04-14 10:26 ` [PATCH 3 of 5] credit2: Add a scheduler-specific schedule trace class George Dunlap
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: George Dunlap @ 2010-04-14 10:26 UTC (permalink / raw)
  To: xen-devel; +Cc: george.dunlap

4 files changed, 40 insertions(+), 19 deletions(-)
xen/arch/ia64/vmx/vmmu.c   |    6 +++---
xen/common/sched_credit.c  |    8 ++++----
xen/common/schedule.c      |   18 ++++++++++--------
xen/include/xen/sched-if.h |   27 +++++++++++++++++++++++----


Credit2 shares a runqueue between several cpus.  Rather than have
double locking and dealing with the cpu-to-runqueue races, allow
the scheduler to redefine the sched_lock-to-cpu mapping.

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

diff -r 2631707c54b3 -r 21d0f640b0c0 xen/arch/ia64/vmx/vmmu.c
--- a/xen/arch/ia64/vmx/vmmu.c	Wed Apr 14 11:16:58 2010 +0100
+++ b/xen/arch/ia64/vmx/vmmu.c	Wed Apr 14 11:16:58 2010 +0100
@@ -394,7 +394,7 @@
     if (cpu != current->processor)
         return;
     local_irq_save(flags);
-    if (!spin_trylock(&per_cpu(schedule_data, cpu).schedule_lock))
+    if (!spin_trylock(per_cpu(schedule_data, cpu).schedule_lock))
         goto bail2;
     if (v->processor != cpu)
         goto bail1;
@@ -416,7 +416,7 @@
     ia64_dv_serialize_data();
     args->vcpu = NULL;
 bail1:
-    spin_unlock(&per_cpu(schedule_data, cpu).schedule_lock);
+    spin_unlock(per_cpu(schedule_data, cpu).schedule_lock);
 bail2:
     local_irq_restore(flags);
 }
@@ -446,7 +446,7 @@
         do {
             cpu = v->processor;
             if (cpu != current->processor) {
-                spin_barrier(&per_cpu(schedule_data, cpu).schedule_lock);
+                spin_barrier(per_cpu(schedule_data, cpu).schedule_lock);
                 /* Flush VHPT on remote processors. */
                 smp_call_function_single(cpu, &ptc_ga_remote_func, &args, 1);
             } else {
diff -r 2631707c54b3 -r 21d0f640b0c0 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Wed Apr 14 11:16:58 2010 +0100
+++ b/xen/common/sched_credit.c	Wed Apr 14 11:16:58 2010 +0100
@@ -789,7 +789,7 @@
 
     spc->runq_sort_last = sort_epoch;
 
-    spin_lock_irqsave(&per_cpu(schedule_data, cpu).schedule_lock, flags);
+    spin_lock_irqsave(per_cpu(schedule_data, cpu).schedule_lock, flags);
 
     runq = &spc->runq;
     elem = runq->next;
@@ -814,7 +814,7 @@
         elem = next;
     }
 
-    spin_unlock_irqrestore(&per_cpu(schedule_data, cpu).schedule_lock, flags);
+    spin_unlock_irqrestore(per_cpu(schedule_data, cpu).schedule_lock, flags);
 }
 
 static void
@@ -1130,7 +1130,7 @@
          * cause a deadlock if the peer CPU is also load balancing and trying
          * to lock this CPU.
          */
-        if ( !spin_trylock(&per_cpu(schedule_data, peer_cpu).schedule_lock) )
+        if ( !spin_trylock(per_cpu(schedule_data, peer_cpu).schedule_lock) )
         {
             CSCHED_STAT_CRANK(steal_trylock_failed);
             continue;
@@ -1140,7 +1140,7 @@
          * Any work over there to steal?
          */
         speer = csched_runq_steal(peer_cpu, cpu, snext->pri);
-        spin_unlock(&per_cpu(schedule_data, peer_cpu).schedule_lock);
+        spin_unlock(per_cpu(schedule_data, peer_cpu).schedule_lock);
         if ( speer != NULL )
             return speer;
     }
diff -r 2631707c54b3 -r 21d0f640b0c0 xen/common/schedule.c
--- a/xen/common/schedule.c	Wed Apr 14 11:16:58 2010 +0100
+++ b/xen/common/schedule.c	Wed Apr 14 11:16:58 2010 +0100
@@ -131,7 +131,7 @@
     s_time_t delta;
 
     ASSERT(v->runstate.state != new_state);
-    ASSERT(spin_is_locked(&per_cpu(schedule_data,v->processor).schedule_lock));
+    ASSERT(spin_is_locked(per_cpu(schedule_data,v->processor).schedule_lock));
 
     vcpu_urgent_count_update(v);
 
@@ -340,7 +340,7 @@
     /* Switch to new CPU, then unlock old CPU. */
     v->processor = new_cpu;
     spin_unlock_irqrestore(
-        &per_cpu(schedule_data, old_cpu).schedule_lock, flags);
+        per_cpu(schedule_data, old_cpu).schedule_lock, flags);
 
     /* Wake on new CPU. */
     vcpu_wake(v);
@@ -846,7 +846,7 @@
 
     sd = &this_cpu(schedule_data);
 
-    spin_lock_irq(&sd->schedule_lock);
+    spin_lock_irq(sd->schedule_lock);
 
     stop_timer(&sd->s_timer);
     
@@ -862,7 +862,7 @@
 
     if ( unlikely(prev == next) )
     {
-        spin_unlock_irq(&sd->schedule_lock);
+        spin_unlock_irq(sd->schedule_lock);
         trace_continue_running(next);
         return continue_running(prev);
     }
@@ -900,7 +900,7 @@
     ASSERT(!next->is_running);
     next->is_running = 1;
 
-    spin_unlock_irq(&sd->schedule_lock);
+    spin_unlock_irq(sd->schedule_lock);
 
     perfc_incr(sched_ctx);
 
@@ -968,7 +968,9 @@
 
     for_each_possible_cpu ( i )
     {
-        spin_lock_init(&per_cpu(schedule_data, i).schedule_lock);
+        spin_lock_init(&per_cpu(schedule_data, i)._lock);
+        per_cpu(schedule_data, i).schedule_lock
+            = &per_cpu(schedule_data, i)._lock;
         init_timer(&per_cpu(schedule_data, i).s_timer, s_timer_fn, NULL, i);
     }
 
@@ -1005,10 +1007,10 @@
 
     for_each_online_cpu ( i )
     {
-        spin_lock(&per_cpu(schedule_data, i).schedule_lock);
+        spin_lock(per_cpu(schedule_data, i).schedule_lock);
         printk("CPU[%02d] ", i);
         SCHED_OP(dump_cpu_state, i);
-        spin_unlock(&per_cpu(schedule_data, i).schedule_lock);
+        spin_unlock(per_cpu(schedule_data, i).schedule_lock);
     }
 
     local_irq_restore(flags);
diff -r 2631707c54b3 -r 21d0f640b0c0 xen/include/xen/sched-if.h
--- a/xen/include/xen/sched-if.h	Wed Apr 14 11:16:58 2010 +0100
+++ b/xen/include/xen/sched-if.h	Wed Apr 14 11:16:58 2010 +0100
@@ -10,8 +10,19 @@
 
 #include <xen/percpu.h>
 
+/*
+ * In order to allow a scheduler to remap the lock->cpu mapping,
+ * we have a per-cpu pointer, along with a pre-allocated set of
+ * locks.  The generic schedule init code will point each schedule lock
+ * pointer to the schedule lock; if the scheduler wants to remap them,
+ * it can simply modify the schedule locks.
+ * 
+ * For cache betterness, keep the actual lock in the same cache area
+ * as the rest of the struct.  Just have the scheduler point to the
+ * one it wants (This may be the one right in front of it).*/
 struct schedule_data {
-    spinlock_t          schedule_lock;  /* spinlock protecting curr        */
+    spinlock_t         *schedule_lock,
+                       _lock;
     struct vcpu        *curr;           /* current task                    */
     struct vcpu        *idle;           /* idle task for this cpu          */
     void               *sched_priv;
@@ -27,11 +38,19 @@
 
     for ( ; ; )
     {
+        /* NB: For schedulers with multiple cores per runqueue,
+         * a vcpu may change processor w/o changing runqueues;
+         * so we may release a lock only to grab it again.
+         *
+         * If that is measured to be an issue, then the check
+         * should be changed to checking if the locks pointed to
+         * by cpu and v->processor are still the same.
+         */
         cpu = v->processor;
-        spin_lock(&per_cpu(schedule_data, cpu).schedule_lock);
+        spin_lock(per_cpu(schedule_data, cpu).schedule_lock);
         if ( likely(v->processor == cpu) )
             break;
-        spin_unlock(&per_cpu(schedule_data, cpu).schedule_lock);
+        spin_unlock(per_cpu(schedule_data, cpu).schedule_lock);
     }
 }
 
@@ -42,7 +61,7 @@
 
 static inline void vcpu_schedule_unlock(struct vcpu *v)
 {
-    spin_unlock(&per_cpu(schedule_data, v->processor).schedule_lock);
+    spin_unlock(per_cpu(schedule_data, v->processor).schedule_lock);
 }
 
 #define vcpu_schedule_unlock_irq(v) \

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3 of 5] credit2: Add a scheduler-specific schedule trace class
  2010-04-14 10:26 [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL) George Dunlap
  2010-04-14 10:26 ` [PATCH 1 of 5] credit2: Add context_saved scheduler callback George Dunlap
  2010-04-14 10:26 ` [PATCH 2 of 5] credit2: Flexible cpu-to-schedule-spinlock mappings George Dunlap
@ 2010-04-14 10:26 ` George Dunlap
  2010-04-14 10:26 ` [PATCH 4 of 5] credit2: Add credit2 scheduler to hypervisor George Dunlap
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: George Dunlap @ 2010-04-14 10:26 UTC (permalink / raw)
  To: xen-devel; +Cc: george.dunlap

1 file changed, 1 insertion(+)
xen/include/public/trace.h |    1 +


Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

diff -r 21d0f640b0c0 -r 68636d5fb3df xen/include/public/trace.h
--- a/xen/include/public/trace.h	Wed Apr 14 11:16:58 2010 +0100
+++ b/xen/include/public/trace.h	Wed Apr 14 11:16:58 2010 +0100
@@ -53,6 +53,7 @@
 #define TRC_HVM_HANDLER   0x00082000   /* various HVM handlers      */
 
 #define TRC_SCHED_MIN       0x00021000   /* Just runstate changes */
+#define TRC_SCHED_CLASS     0x00022000   /* Scheduler-specific    */
 #define TRC_SCHED_VERBOSE   0x00028000   /* More inclusive scheduling */
 
 /* Trace events per class */

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 4 of 5] credit2: Add credit2 scheduler to hypervisor
  2010-04-14 10:26 [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL) George Dunlap
                   ` (2 preceding siblings ...)
  2010-04-14 10:26 ` [PATCH 3 of 5] credit2: Add a scheduler-specific schedule trace class George Dunlap
@ 2010-04-14 10:26 ` George Dunlap
  2010-04-14 10:26 ` [PATCH 5 of 5] credit2: Add toolstack options to control credit2 scheduler parameters George Dunlap
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: George Dunlap @ 2010-04-14 10:26 UTC (permalink / raw)
  To: xen-devel; +Cc: george.dunlap

4 files changed, 1132 insertions(+)
xen/common/Makefile         |    1 
xen/common/sched_credit2.c  | 1125 +++++++++++++++++++++++++++++++++++++++++++
xen/common/schedule.c       |    2 
xen/include/public/domctl.h |    4 


This is the core credit2 patch.  It adds the new credit2 scheduler to the hypervisor,
as the non-default scheduler.  It should be emphasized that this is still in the development
phase, and is probably still unstable.  It is known to be suboptimal for multi-socket systems.

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

diff -r 68636d5fb3df -r 1cdbec67f224 xen/common/Makefile
--- a/xen/common/Makefile	Wed Apr 14 11:16:58 2010 +0100
+++ b/xen/common/Makefile	Wed Apr 14 11:16:58 2010 +0100
@@ -13,6 +13,7 @@
 obj-y += page_alloc.o
 obj-y += rangeset.o
 obj-y += sched_credit.o
+obj-y += sched_credit2.o
 obj-y += sched_sedf.o
 obj-y += schedule.o
 obj-y += shutdown.o
diff -r 68636d5fb3df -r 1cdbec67f224 xen/common/sched_credit2.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xen/common/sched_credit2.c	Wed Apr 14 11:16:58 2010 +0100
@@ -0,0 +1,1125 @@
+
+/****************************************************************************
+ * (C) 2009 - George Dunlap - Citrix Systems R&D UK, Ltd
+ ****************************************************************************
+ *
+ *        File: common/csched_credit2.c
+ *      Author: George Dunlap
+ *
+ * Description: Credit-based SMP CPU scheduler
+ * Based on an earlier verson by Emmanuel Ackaouy.
+ */
+
+#include <xen/config.h>
+#include <xen/init.h>
+#include <xen/lib.h>
+#include <xen/sched.h>
+#include <xen/domain.h>
+#include <xen/delay.h>
+#include <xen/event.h>
+#include <xen/time.h>
+#include <xen/perfc.h>
+#include <xen/sched-if.h>
+#include <xen/softirq.h>
+#include <asm/atomic.h>
+#include <xen/errno.h>
+#include <xen/trace.h>
+
+#if __i386__
+#define PRI_stime "lld"
+#else
+#define PRI_stime "ld"
+#endif
+
+#define d2printk(x...)
+//#define d2printk printk
+
+#define TRC_CSCHED2_TICK        TRC_SCHED_CLASS + 1
+#define TRC_CSCHED2_RUNQ_POS    TRC_SCHED_CLASS + 2
+#define TRC_CSCHED2_CREDIT_BURN TRC_SCHED_CLASS + 3
+#define TRC_CSCHED2_CREDIT_ADD  TRC_SCHED_CLASS + 4
+#define TRC_CSCHED2_TICKLE_CHECK TRC_SCHED_CLASS + 5
+
+/*
+ * WARNING: This is still in an experimental phase.  Status and work can be found at the
+ * credit2 wiki page:
+ *  http://wiki.xensource.com/xenwiki/Credit2_Scheduler_Development
+ * TODO:
+ * + Immediate bug-fixes
+ *  - Do per-runqueue, grab proper lock for dump debugkey
+ * + Multiple sockets
+ *  - Detect cpu layout and make runqueue map, one per L2 (make_runq_map())
+ *  - Simple load balancer / runqueue assignment
+ *  - Runqueue load measurement
+ *  - Load-based load balancer
+ * + Hyperthreading
+ *  - Look for non-busy core if possible
+ *  - "Discount" time run on a thread with busy siblings
+ * + Algorithm:
+ *  - "Mixed work" problem: if a VM is playing audio (5%) but also burning cpu (e.g.,
+ *    a flash animation in the background) can we schedule it with low enough latency
+ *    so that audio doesn't skip?
+ *  - Cap and reservation: How to implement with the current system?
+ * + Optimizing
+ *  - Profiling, making new algorithms, making math more efficient (no long division)
+ */
+
+/* 
+ * Design:
+ *
+ * VMs "burn" credits based on their weight; higher weight means
+ * credits burn more slowly.  The highest weight vcpu burns credits at
+ * a rate of 1 credit per nanosecond.  Others burn proportionally
+ * more.
+ * 
+ * vcpus are inserted into the runqueue by credit order.
+ *
+ * Credits are "reset" when the next vcpu in the runqueue is less than
+ * or equal to zero.  At that point, everyone's credits are "clipped"
+ * to a small value, and a fixed credit is added to everyone.
+ *
+ * The plan is for all cores that share an L2 will share the same
+ * runqueue.  At the moment, there is one global runqueue for all
+ * cores.
+ */
+
+/*
+ * Locking:
+ * - Schedule-lock is per-runqueue
+ *  + Protects runqueue data, runqueue insertion, &c
+ *  + Also protects updates to private sched vcpu structure
+ *  + Must be grabbed using vcpu_schedule_lock_irq() to make sure vcpu->processr
+ *    doesn't change under our feet.
+ * - Private data lock
+ *  + Protects access to global domain list
+ *  + All other private data is written at init and only read afterwards.
+ * Ordering:
+ * - We grab private->schedule when updating domain weight; so we
+ *  must never grab private if a schedule lock is held.
+ */
+
+/*
+ * Basic constants
+ */
+/* Default weight: How much a new domain starts with */
+#define CSCHED_DEFAULT_WEIGHT       256
+/* Min timer: Minimum length a timer will be set, to
+ * achieve efficiency */
+#define CSCHED_MIN_TIMER            MICROSECS(500)
+/* Amount of credit VMs begin with, and are reset to.
+ * ATM, set so that highest-weight VMs can only run for 10ms
+ * before a reset event. */
+#define CSCHED_CREDIT_INIT          MILLISECS(10)
+/* Carryover: How much "extra" credit may be carried over after
+ * a reset. */
+#define CSCHED_CARRYOVER_MAX        CSCHED_MIN_TIMER
+/* Reset: Value below which credit will be reset. */
+#define CSCHED_CREDIT_RESET         0
+/* Max timer: Maximum time a guest can be run for. */
+#define CSCHED_MAX_TIMER            MILLISECS(2)
+
+
+#define CSCHED_IDLE_CREDIT                 (-(1<<30))
+
+/*
+ * Flags
+ */
+/* CSFLAG_scheduled: Is this vcpu either running on, or context-switching off,
+ * a physical cpu?
+ * + Accessed only with runqueue lock held
+ * + Set when chosen as next in csched_schedule().
+ * + Cleared after context switch has been saved in csched_context_saved()
+ * + Checked in vcpu_wake to see if we can add to the runqueue, or if we should
+ *   set CSFLAG_delayed_runq_add
+ * + Checked to be false in runq_insert.
+ */
+#define __CSFLAG_scheduled 1
+#define CSFLAG_scheduled (1<<__CSFLAG_scheduled)
+/* CSFLAG_delayed_runq_add: Do we need to add this to the runqueue once it'd done
+ * being context switched out?
+ * + Set when scheduling out in csched_schedule() if prev is runnable
+ * + Set in csched_vcpu_wake if it finds CSFLAG_scheduled set
+ * + Read in csched_context_saved().  If set, it adds prev to the runqueue and
+ *   clears the bit.
+ */
+#define __CSFLAG_delayed_runq_add 2
+#define CSFLAG_delayed_runq_add (1<<__CSFLAG_delayed_runq_add)
+
+
+/*
+ * Useful macros
+ */
+#define CSCHED_VCPU(_vcpu)  ((struct csched_vcpu *) (_vcpu)->sched_priv)
+#define CSCHED_DOM(_dom)    ((struct csched_dom *) (_dom)->sched_priv)
+/* CPU to runq_id macro */
+#define c2r(_cpu)           (csched_priv.runq_map[(_cpu)])
+/* CPU to runqueue struct macro */
+#define RQD(_cpu)          (&csched_priv.rqd[c2r(_cpu)])
+
+/*
+ * Per-runqueue data
+ */
+struct csched_runqueue_data {
+    int id;
+    struct list_head runq; /* Ordered list of runnable vms */
+    struct list_head svc;  /* List of all vcpus assigned to this runqueue */
+    int max_weight;
+    int cpu_min, cpu_max;  /* Range of physical cpus this runqueue runs */
+};
+
+/*
+ * System-wide private data
+ */
+struct csched_private {
+    spinlock_t lock;
+    uint32_t ncpus;
+    struct domain *idle_domain;
+
+    struct list_head sdom; /* Used mostly for dump keyhandler. */
+
+    int runq_map[NR_CPUS];
+    uint32_t runq_count;
+    struct csched_runqueue_data rqd[NR_CPUS];
+};
+
+/*
+ * Virtual CPU
+ */
+struct csched_vcpu {
+    struct list_head rqd_elem;  /* On the runqueue data list */
+    struct list_head sdom_elem; /* On the domain vcpu list */
+    struct list_head runq_elem; /* On the runqueue         */
+
+    /* Up-pointers */
+    struct csched_dom *sdom;
+    struct vcpu *vcpu;
+
+    int weight;
+
+    int credit;
+    s_time_t start_time; /* When we were scheduled (used for credit) */
+    unsigned flags;      /* 16 bits doesn't seem to play well with clear_bit() */
+
+};
+
+/*
+ * Domain
+ */
+struct csched_dom {
+    struct list_head vcpu;
+    struct list_head sdom_elem;
+    struct domain *dom;
+    uint16_t weight;
+    uint16_t nr_vcpus;
+};
+
+
+/*
+ * Global variables
+ */
+static struct csched_private csched_priv;
+
+/*
+ * Time-to-credit, credit-to-time.
+ * FIXME: Do pre-calculated division?
+ */
+static s_time_t t2c(struct csched_runqueue_data *rqd, s_time_t time, struct csched_vcpu *svc)
+{
+    return time * rqd->max_weight / svc->weight;
+}
+
+static s_time_t c2t(struct csched_runqueue_data *rqd, s_time_t credit, struct csched_vcpu *svc)
+{
+    return credit * svc->weight / rqd->max_weight;
+}
+
+/*
+ * Runqueue related code
+ */
+
+static /*inline*/ int
+__vcpu_on_runq(struct csched_vcpu *svc)
+{
+    return !list_empty(&svc->runq_elem);
+}
+
+static /*inline*/ struct csched_vcpu *
+__runq_elem(struct list_head *elem)
+{
+    return list_entry(elem, struct csched_vcpu, runq_elem);
+}
+
+static int
+__runq_insert(struct list_head *runq, struct csched_vcpu *svc)
+{
+    struct list_head *iter;
+    int pos = 0;
+
+    d2printk("rqi d%dv%d\n",
+           svc->vcpu->domain->domain_id,
+           svc->vcpu->vcpu_id);
+
+    /* Idle vcpus not allowed on the runqueue anymore */
+    BUG_ON(is_idle_vcpu(svc->vcpu));
+    BUG_ON(svc->vcpu->is_running);
+    BUG_ON(test_bit(__CSFLAG_scheduled, &svc->flags));
+
+    list_for_each( iter, runq )
+    {
+        struct csched_vcpu * iter_svc = __runq_elem(iter);
+
+        if ( svc->credit > iter_svc->credit )
+        {
+            d2printk(" p%d d%dv%d\n",
+                   pos,
+                   iter_svc->vcpu->domain->domain_id,
+                   iter_svc->vcpu->vcpu_id);
+            break;
+        }
+        pos++;
+    }
+
+    list_add_tail(&svc->runq_elem, iter);
+
+    return pos;
+}
+
+static void
+runq_insert(unsigned int cpu, struct csched_vcpu *svc)
+{
+    struct list_head * runq = &RQD(cpu)->runq;
+    int pos = 0;
+
+    ASSERT( spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock) ); 
+
+    BUG_ON( __vcpu_on_runq(svc) );
+    BUG_ON( c2r(cpu) != c2r(svc->vcpu->processor) ); 
+
+    pos = __runq_insert(runq, svc);
+
+    {
+        struct {
+            unsigned dom:16,vcpu:16;
+            unsigned pos;
+        } d;
+        d.dom = svc->vcpu->domain->domain_id;
+        d.vcpu = svc->vcpu->vcpu_id;
+        d.pos = pos;
+        trace_var(TRC_CSCHED2_RUNQ_POS, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+
+    return;
+}
+
+static inline void
+__runq_remove(struct csched_vcpu *svc)
+{
+    BUG_ON( !__vcpu_on_runq(svc) );
+    list_del_init(&svc->runq_elem);
+}
+
+void burn_credits(struct csched_runqueue_data *rqd, struct csched_vcpu *, s_time_t);
+
+/* Check to see if the item on the runqueue is higher priority than what's
+ * currently running; if so, wake up the processor */
+static /*inline*/ void
+runq_tickle(unsigned int cpu, struct csched_vcpu *new, s_time_t now)
+{
+    int i, ipid=-1;
+    s_time_t lowest=(1<<30);
+    struct csched_runqueue_data *rqd = RQD(cpu);
+
+    d2printk("rqt d%dv%d cd%dv%d\n",
+             new->vcpu->domain->domain_id,
+             new->vcpu->vcpu_id,
+             current->domain->domain_id,
+             current->vcpu_id);
+
+    /* Find the cpu in this queue group that has the lowest credits */
+    for ( i=rqd->cpu_min ; i < rqd->cpu_max ; i++ )
+    {
+        struct csched_vcpu * cur;
+
+        /* Skip cpus that aren't online */
+        if ( !cpu_online(i) )
+            continue;
+
+        cur = CSCHED_VCPU(per_cpu(schedule_data, i).curr);
+
+        /* FIXME: keep track of idlers, chose from the mask */
+        if ( is_idle_vcpu(cur->vcpu) )
+        {
+            ipid = i;
+            lowest = CSCHED_IDLE_CREDIT;
+            break;
+        }
+        else
+        {
+            /* Update credits for current to see if we want to preempt */
+            burn_credits(rqd, cur, now);
+
+            if ( cur->credit < lowest )
+            {
+                ipid = i;
+                lowest = cur->credit;
+            }
+
+            /* TRACE */ {
+                struct {
+                    unsigned dom:16,vcpu:16;
+                    unsigned credit;
+                } d;
+                d.dom = cur->vcpu->domain->domain_id;
+                d.vcpu = cur->vcpu->vcpu_id;
+                d.credit = cur->credit;
+                trace_var(TRC_CSCHED2_TICKLE_CHECK, 1,
+                          sizeof(d),
+                          (unsigned char *)&d);
+            }
+        }
+    }
+
+    if ( ipid != -1 )
+    {
+        int cdiff = lowest - new->credit;
+
+        if ( lowest == CSCHED_IDLE_CREDIT || cdiff < 0 ) {
+            d2printk("si %d\n", ipid);
+            cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
+        }
+        else
+            /* FIXME: Wake up later? */;
+    }
+}
+
+/*
+ * Credit-related code
+ */
+static void reset_credit(int cpu, s_time_t now)
+{
+    struct list_head *iter;
+
+    list_for_each( iter, &RQD(cpu)->svc )
+    {
+        struct csched_vcpu * svc = list_entry(iter, struct csched_vcpu, rqd_elem);
+
+        BUG_ON( is_idle_vcpu(svc->vcpu) );
+
+        /* "Clip" credits to max carryover */
+        if ( svc->credit > CSCHED_CARRYOVER_MAX )
+            svc->credit = CSCHED_CARRYOVER_MAX;
+        /* And add INIT */
+        svc->credit += CSCHED_CREDIT_INIT; 
+        svc->start_time = now;
+
+        /* FIXME: Trace credit */
+    }
+
+    /* No need to resort runqueue, as everyone's order should be the same. */
+}
+
+void burn_credits(struct csched_runqueue_data *rqd, struct csched_vcpu *svc, s_time_t now)
+{
+    s_time_t delta;
+
+    /* Assert svc is current */
+    ASSERT(svc==CSCHED_VCPU(per_cpu(schedule_data, svc->vcpu->processor).curr));
+
+    if ( is_idle_vcpu(svc->vcpu) )
+    {
+        BUG_ON(svc->credit != CSCHED_IDLE_CREDIT);
+        return;
+    }
+
+    delta = now - svc->start_time;
+
+    if ( delta > 0 ) {
+        /* This will round down; should we consider rounding up...? */
+        svc->credit -= t2c(rqd, delta, svc);
+        svc->start_time = now;
+
+        d2printk("b d%dv%d c%d\n",
+                 svc->vcpu->domain->domain_id,
+                 svc->vcpu->vcpu_id,
+                 svc->credit);
+    } else {
+        d2printk("%s: Time went backwards? now %"PRI_stime" start %"PRI_stime"\n",
+               __func__, now, svc->start_time);
+    }
+    
+    /* TRACE */
+    {
+        struct {
+            unsigned dom:16,vcpu:16;
+            unsigned credit;
+            int delta;
+        } d;
+        d.dom = svc->vcpu->domain->domain_id;
+        d.vcpu = svc->vcpu->vcpu_id;
+        d.credit = svc->credit;
+        d.delta = delta;
+        trace_var(TRC_CSCHED2_CREDIT_BURN, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+}
+
+/* Find the domain with the highest weight. */
+void update_max_weight(struct csched_runqueue_data *rqd, int new_weight, int old_weight)
+{
+    /* Try to avoid brute-force search:
+     * - If new_weight is larger, max_weigth <- new_weight
+     * - If old_weight != max_weight, someone else is still max_weight
+     *   (No action required)
+     * - If old_weight == max_weight, brute-force search for max weight
+     */
+    if ( new_weight > rqd->max_weight )
+    {
+        rqd->max_weight = new_weight;
+        printk("%s: Runqueue id %d max weight %d\n", __func__, rqd->id, rqd->max_weight);
+    }
+    else if ( old_weight == rqd->max_weight )
+    {
+        struct list_head *iter;
+        int max_weight = 1;
+        
+        list_for_each( iter, &rqd->svc )
+        {
+            struct csched_vcpu * svc = list_entry(iter, struct csched_vcpu, rqd_elem);
+            
+            if ( svc->weight > max_weight )
+                max_weight = svc->weight;
+        }
+        
+        rqd->max_weight = max_weight;
+        printk("%s: Runqueue %d max weight %d\n", __func__, rqd->id, rqd->max_weight);
+    }
+}
+
+#ifndef NDEBUG
+static /*inline*/ void
+__csched_vcpu_check(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+    struct csched_dom * const sdom = svc->sdom;
+
+    BUG_ON( svc->vcpu != vc );
+    BUG_ON( sdom != CSCHED_DOM(vc->domain) );
+    if ( sdom )
+    {
+        BUG_ON( is_idle_vcpu(vc) );
+        BUG_ON( sdom->dom != vc->domain );
+    }
+    else
+    {
+        BUG_ON( !is_idle_vcpu(vc) );
+    }
+}
+#define CSCHED_VCPU_CHECK(_vc)  (__csched_vcpu_check(_vc))
+#else
+#define CSCHED_VCPU_CHECK(_vc)
+#endif
+
+static int
+csched_vcpu_init(struct vcpu *vc)
+{
+    struct domain * const dom = vc->domain;
+    struct csched_dom *sdom = CSCHED_DOM(dom);
+    struct csched_vcpu *svc;
+
+    printk("%s: Initializing d%dv%d\n",
+           __func__, dom->domain_id, vc->vcpu_id);
+
+    /* Allocate per-VCPU info */
+    svc = xmalloc(struct csched_vcpu);
+    if ( svc == NULL )
+        return -1;
+
+    INIT_LIST_HEAD(&svc->rqd_elem);
+    INIT_LIST_HEAD(&svc->sdom_elem);
+    INIT_LIST_HEAD(&svc->runq_elem);
+
+    svc->sdom = sdom;
+    svc->vcpu = vc;
+    svc->flags = 0U;
+    vc->sched_priv = svc;
+
+    if ( ! is_idle_vcpu(vc) )
+    {
+        BUG_ON( sdom == NULL );
+
+        svc->credit = CSCHED_CREDIT_INIT;
+        svc->weight = sdom->weight;
+
+        /* FIXME: Do we need the private lock here? */
+        list_add_tail(&svc->sdom_elem, &sdom->vcpu);
+
+        /* Add vcpu to runqueue of initial processor */
+        /* FIXME: Abstract for multiple runqueues */
+        vcpu_schedule_lock_irq(vc);
+
+        list_add_tail(&svc->rqd_elem, &RQD(vc->processor)->svc);
+        update_max_weight(RQD(vc->processor), svc->weight, 0);
+
+        vcpu_schedule_unlock_irq(vc);
+
+        sdom->nr_vcpus++;
+    } 
+    else
+    {
+        BUG_ON( sdom != NULL );
+        svc->credit = CSCHED_IDLE_CREDIT;
+        svc->weight = 0;
+        if ( csched_priv.idle_domain == NULL )
+            csched_priv.idle_domain = dom;
+    }
+
+    CSCHED_VCPU_CHECK(vc);
+    return 0;
+}
+
+static void
+csched_vcpu_destroy(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+    struct csched_dom * const sdom = svc->sdom;
+
+    BUG_ON( sdom == NULL );
+    BUG_ON( !list_empty(&svc->runq_elem) );
+
+    /* Remove from runqueue */
+    vcpu_schedule_lock_irq(vc);
+
+    list_del_init(&svc->rqd_elem);
+    update_max_weight(RQD(vc->processor), 0, svc->weight);
+
+    vcpu_schedule_unlock_irq(vc);
+
+    /* Remove from sdom list.  Don't need a lock for this, as it's called
+     * syncronously when nothing else can happen. */
+    list_del_init(&svc->sdom_elem);
+
+    sdom->nr_vcpus--;
+
+    xfree(svc);
+}
+
+static void
+csched_vcpu_sleep(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    if ( per_cpu(schedule_data, vc->processor).curr == vc )
+        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
+    else if ( __vcpu_on_runq(svc) )
+        __runq_remove(svc);
+}
+
+static void
+csched_vcpu_wake(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+    const unsigned int cpu = vc->processor;
+    s_time_t now = 0;
+
+    /* Schedule lock should be held at this point. */
+    
+    d2printk("w d%dv%d\n", vc->domain->domain_id, vc->vcpu_id);
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    /* Make sure svc priority mod happens before runq check */
+    if ( unlikely(per_cpu(schedule_data, cpu).curr == vc) )
+    {
+        goto out;
+    }
+
+    if ( unlikely(__vcpu_on_runq(svc)) )
+    {
+        /* If we've boosted someone that's already on a runqueue, prioritize
+         * it and inform the cpu in question. */
+        goto out;
+    }
+
+    /* If the context hasn't been saved for this vcpu yet, we can't put it on
+     * another runqueue.  Instead, we set a flag so that it will be put on the runqueue
+     * after the context has been saved. */
+    if ( unlikely (test_bit(__CSFLAG_scheduled, &svc->flags) ) )
+    {
+        set_bit(__CSFLAG_delayed_runq_add, &svc->flags);
+        goto out;
+    }
+
+    now = NOW();
+
+    /* Put the VCPU on the runq */
+    runq_insert(cpu, svc);
+    runq_tickle(cpu, svc, now);
+
+out:
+    d2printk("w-\n");
+    return;
+}
+
+static void
+csched_context_saved(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+
+    vcpu_schedule_lock_irq(vc);
+
+    /* This vcpu is now eligible to be put on the runqueue again */
+    clear_bit(__CSFLAG_scheduled, &svc->flags);
+    
+    /* If someone wants it on the runqueue, put it there. */
+    /* 
+     * NB: We can get rid of CSFLAG_scheduled by checking for
+     * vc->is_running and __vcpu_on_runq(svc) here.  However,
+     * since we're accessing the flags cacheline anyway,
+     * it seems a bit pointless; especially as we have plenty of
+     * bits free.
+     */
+    if ( test_bit(__CSFLAG_delayed_runq_add, &svc->flags) )
+    {
+        const unsigned int cpu = vc->processor;
+
+        clear_bit(__CSFLAG_delayed_runq_add, &svc->flags);
+
+        BUG_ON(__vcpu_on_runq(svc));
+        
+        runq_insert(cpu, svc);
+        runq_tickle(cpu, svc, NOW());
+    }
+
+    vcpu_schedule_unlock_irq(vc);
+}
+
+static int
+csched_cpu_pick(struct vcpu *vc)
+{
+    /* FIXME: Chose a schedule group based on load */
+    /* FIXME: Migrate the vcpu to the new runqueue list, updating 
+       max_weight for each runqueue */
+    return 0;
+}
+
+static int
+csched_dom_cntl(
+    struct domain *d,
+    struct xen_domctl_scheduler_op *op)
+{
+    struct csched_dom * const sdom = CSCHED_DOM(d);
+    unsigned long flags;
+
+    if ( op->cmd == XEN_DOMCTL_SCHEDOP_getinfo )
+    {
+        op->u.credit2.weight = sdom->weight;
+    }
+    else
+    {
+        ASSERT(op->cmd == XEN_DOMCTL_SCHEDOP_putinfo);
+
+        if ( op->u.credit2.weight != 0 )
+        {
+            struct list_head *iter;
+            int old_weight;
+
+            /* Must hold csched_priv lock to update sdom, runq lock to
+             * update csvcs. */
+            spin_lock_irqsave(&csched_priv.lock, flags);
+
+            old_weight = sdom->weight;
+
+            sdom->weight = op->u.credit2.weight;
+
+            /* Update weights for vcpus, and max_weight for runqueues on which they reside */
+            list_for_each ( iter, &sdom->vcpu )
+            {
+                struct csched_vcpu *svc = list_entry(iter, struct csched_vcpu, sdom_elem);
+
+                /* NB: Locking order is important here.  Because we grab this lock here, we
+                 * must never lock csched_priv.lock if we're holding a runqueue
+                 * lock. */
+                vcpu_schedule_lock_irq(svc->vcpu);
+
+                svc->weight = sdom->weight;
+                update_max_weight(RQD(svc->vcpu->processor), svc->weight, old_weight);
+
+                vcpu_schedule_unlock_irq(svc->vcpu);
+            }
+
+            spin_unlock_irqrestore(&csched_priv.lock, flags);
+        }
+    }
+
+    return 0;
+}
+
+static int
+csched_dom_init(struct domain *dom)
+{
+    struct csched_dom *sdom;
+    int flags;
+
+    printk("%s: Initializing domain %d\n", __func__, dom->domain_id);
+
+    if ( is_idle_domain(dom) )
+        return 0;
+
+    sdom = xmalloc(struct csched_dom);
+    if ( sdom == NULL )
+        return -ENOMEM;
+
+    /* Initialize credit and weight */
+    INIT_LIST_HEAD(&sdom->vcpu);
+    INIT_LIST_HEAD(&sdom->sdom_elem);
+    sdom->dom = dom;
+    sdom->weight = CSCHED_DEFAULT_WEIGHT;
+    sdom->nr_vcpus = 0;
+
+    dom->sched_priv = sdom;
+
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    list_add_tail(&sdom->sdom_elem, &csched_priv.sdom);
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+
+    return 0;
+}
+
+static void
+csched_dom_destroy(struct domain *dom)
+{
+    struct csched_dom *sdom = CSCHED_DOM(dom);
+    int flags;
+
+    BUG_ON(!list_empty(&sdom->vcpu));
+
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    list_del_init(&sdom->sdom_elem);
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+    
+    xfree(CSCHED_DOM(dom));
+}
+
+/* How long should we let this vcpu run for? */
+static s_time_t
+csched_runtime(int cpu, struct csched_vcpu *snext)
+{
+    s_time_t time = CSCHED_MAX_TIMER;
+    struct csched_runqueue_data *rqd = RQD(cpu);
+    struct list_head *runq = &rqd->runq;
+
+    if ( is_idle_vcpu(snext->vcpu) )
+        return CSCHED_MAX_TIMER;
+
+    /* Basic time */
+    time = c2t(rqd, snext->credit, snext);
+
+    /* Next guy on runqueue */
+    if ( ! list_empty(runq) )
+    {
+        struct csched_vcpu *svc = __runq_elem(runq->next);
+        s_time_t ntime;
+
+        if ( ! is_idle_vcpu(svc->vcpu) )
+        {
+            ntime = c2t(rqd, snext->credit - svc->credit, snext);
+
+            if ( time > ntime )
+                time = ntime;
+        }
+    }
+
+    /* Check limits */
+    if ( time < CSCHED_MIN_TIMER )
+        time = CSCHED_MIN_TIMER;
+    else if ( time > CSCHED_MAX_TIMER )
+        time = CSCHED_MAX_TIMER;
+
+    return time;
+}
+
+void __dump_execstate(void *unused);
+
+/*
+ * This function is in the critical path. It is designed to be simple and
+ * fast for the common case.
+ */
+static struct task_slice
+csched_schedule(s_time_t now)
+{
+    const int cpu = smp_processor_id();
+    struct csched_runqueue_data *rqd = RQD(cpu);
+    struct list_head * const runq = &rqd->runq;
+    struct csched_vcpu * const scurr = CSCHED_VCPU(current);
+    struct csched_vcpu *snext = NULL;
+    struct task_slice ret;
+
+    CSCHED_VCPU_CHECK(current);
+
+    d2printk("sc p%d c d%dv%d now %"PRI_stime"\n",
+             cpu,
+             scurr->vcpu->domain->domain_id,
+             scurr->vcpu->vcpu_id,
+             now);
+
+
+    /* Protected by runqueue lock */
+
+    /* Update credits */
+    burn_credits(rqd, scurr, now);
+
+    /*
+     * Select next runnable local VCPU (ie top of local runq).
+     *
+     * If the current vcpu is runnable, and has higher credit than
+     * the next guy on the queue (or there is noone else), we want to run him again.
+     *
+     * If the current vcpu is runnable, and the next guy on the queue
+     * has higher credit, we want to mark current for delayed runqueue
+     * add, and remove the next guy from the queue.
+     *
+     * If the current vcpu is not runnable, we want to chose the idle
+     * vcpu for this processor. 
+     */
+    if ( list_empty(runq) )
+        snext = CSCHED_VCPU(csched_priv.idle_domain->vcpu[cpu]);
+    else
+        snext = __runq_elem(runq->next);
+
+    if ( !is_idle_vcpu(current) && vcpu_runnable(current) )
+    {
+        /* If the current vcpu is runnable, and has higher credit
+         * than the next on the runqueue, run him again.
+         * Otherwise, set him for delayed runq add. */
+        if ( scurr->credit > snext->credit)
+            snext = scurr;
+        else
+            set_bit(__CSFLAG_delayed_runq_add, &scurr->flags);
+    }
+
+    if ( snext != scurr && !is_idle_vcpu(snext->vcpu) )
+    {
+        __runq_remove(snext);
+        if ( snext->vcpu->is_running )
+        {
+            printk("p%d: snext d%dv%d running on p%d! scurr d%dv%d\n",
+                   cpu,
+                   snext->vcpu->domain->domain_id, snext->vcpu->vcpu_id,
+                   snext->vcpu->processor,
+                   scurr->vcpu->domain->domain_id,
+                   scurr->vcpu->vcpu_id);
+            BUG();
+        }
+        set_bit(__CSFLAG_scheduled, &snext->flags);
+    }
+
+    if ( !is_idle_vcpu(snext->vcpu) && snext->credit <= CSCHED_CREDIT_RESET )
+        reset_credit(cpu, now);
+
+#if 0
+    /*
+     * Update idlers mask if necessary. When we're idling, other CPUs
+     * will tickle us when they get extra work.
+     */
+    if ( is_idle_vcpu(snext->vcpu) )
+    {
+        if ( !cpu_isset(cpu, csched_priv.idlers) )
+            cpu_set(cpu, csched_priv.idlers);
+    }
+    else if ( cpu_isset(cpu, csched_priv.idlers) )
+    {
+        cpu_clear(cpu, csched_priv.idlers);
+    }
+#endif
+
+    if ( !is_idle_vcpu(snext->vcpu) )
+    {
+        snext->start_time = now;
+        snext->vcpu->processor = cpu; /* Safe because lock for old processor is held */
+    }
+    /*
+     * Return task to run next...
+     */
+    ret.time = csched_runtime(cpu, snext);
+    ret.task = snext->vcpu;
+
+    CSCHED_VCPU_CHECK(ret.task);
+    return ret;
+}
+
+static void
+csched_dump_vcpu(struct csched_vcpu *svc)
+{
+    printk("[%i.%i] flags=%x cpu=%i",
+            svc->vcpu->domain->domain_id,
+            svc->vcpu->vcpu_id,
+            svc->flags,
+            svc->vcpu->processor);
+
+    printk(" credit=%" PRIi32" [w=%u]", svc->credit, svc->weight);
+
+    printk("\n");
+}
+
+static void
+csched_dump_pcpu(int cpu)
+{
+    struct list_head *runq, *iter;
+    struct csched_vcpu *svc;
+    int loop;
+    char cpustr[100];
+
+    /* FIXME: Do locking properly for access to runqueue structures */
+
+    runq = &RQD(cpu)->runq;
+
+    cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_sibling_map,cpu));
+    printk(" sibling=%s, ", cpustr);
+    cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_core_map,cpu));
+    printk("core=%s\n", cpustr);
+
+    /* current VCPU */
+    svc = CSCHED_VCPU(per_cpu(schedule_data, cpu).curr);
+    if ( svc )
+    {
+        printk("\trun: ");
+        csched_dump_vcpu(svc);
+    }
+
+    loop = 0;
+    list_for_each( iter, runq )
+    {
+        svc = __runq_elem(iter);
+        if ( svc )
+        {
+            printk("\t%3d: ", ++loop);
+            csched_dump_vcpu(svc);
+        }
+    }
+}
+
+static void
+csched_dump(void)
+{
+    struct list_head *iter_sdom, *iter_svc;
+    int loop;
+
+    printk("info:\n"
+           "\tncpus              = %u\n"
+           "\tdefault-weight     = %d\n",
+           csched_priv.ncpus,
+           CSCHED_DEFAULT_WEIGHT);
+
+    /* FIXME: Locking! */
+
+    printk("active vcpus:\n");
+    loop = 0;
+    list_for_each( iter_sdom, &csched_priv.sdom )
+    {
+        struct csched_dom *sdom;
+        sdom = list_entry(iter_sdom, struct csched_dom, sdom_elem);
+
+        list_for_each( iter_svc, &sdom->vcpu )
+        {
+            struct csched_vcpu *svc;
+            svc = list_entry(iter_svc, struct csched_vcpu, sdom_elem);
+
+            printk("\t%3d: ", ++loop);
+            csched_dump_vcpu(svc);
+        }
+    }
+}
+
+static void
+make_runq_map(void)
+{
+    int cpu, cpu_count=0;
+
+    /* FIXME: Read pcpu layout and do this properly */
+    for_each_possible_cpu( cpu )
+    {
+        csched_priv.runq_map[cpu] = 0;
+        cpu_count++;
+    }
+    csched_priv.runq_count = 1;
+    
+    /* Move to the init code...? */
+    csched_priv.rqd[0].cpu_min = 0;
+    csched_priv.rqd[0].cpu_max = cpu_count;
+}
+
+static void
+csched_init(void)
+{
+    int i;
+
+    printk("Initializing Credit2 scheduler\n" \
+           " WARNING: This is experimental software in development.\n" \
+           " Use at your own risk.\n");
+
+    spin_lock_init(&csched_priv.lock);
+    INIT_LIST_HEAD(&csched_priv.sdom);
+
+    csched_priv.ncpus = 0;
+
+    make_runq_map();
+
+    for ( i=0; i<csched_priv.runq_count ; i++ )
+    {
+        struct csched_runqueue_data *rqd = csched_priv.rqd + i;
+
+        rqd->max_weight = 1;
+        rqd->id = i;
+        INIT_LIST_HEAD(&rqd->svc);
+        INIT_LIST_HEAD(&rqd->runq);
+    }
+
+    /* Initialize pcpu structures */
+    for_each_possible_cpu(i)
+    {
+        int runq_id;
+        spinlock_t *lock;
+
+        /* Point the per-cpu schedule lock to the runq_id lock */
+        runq_id = csched_priv.runq_map[i];
+        lock = &per_cpu(schedule_data, runq_id)._lock;
+
+        per_cpu(schedule_data, i).schedule_lock = lock;
+
+        csched_priv.ncpus++;
+    }
+}
+
+struct scheduler sched_credit2_def = {
+    .name           = "SMP Credit Scheduler rev2",
+    .opt_name       = "credit2",
+    .sched_id       = XEN_SCHEDULER_CREDIT2,
+
+    .init_domain    = csched_dom_init,
+    .destroy_domain = csched_dom_destroy,
+
+    .init_vcpu      = csched_vcpu_init,
+    .destroy_vcpu   = csched_vcpu_destroy,
+
+    .sleep          = csched_vcpu_sleep,
+    .wake           = csched_vcpu_wake,
+
+    .adjust         = csched_dom_cntl,
+
+    .pick_cpu       = csched_cpu_pick,
+    .do_schedule    = csched_schedule,
+    .context_saved  = csched_context_saved,
+
+    .dump_cpu_state = csched_dump_pcpu,
+    .dump_settings  = csched_dump,
+    .init           = csched_init,
+};
diff -r 68636d5fb3df -r 1cdbec67f224 xen/common/schedule.c
--- a/xen/common/schedule.c	Wed Apr 14 11:16:58 2010 +0100
+++ b/xen/common/schedule.c	Wed Apr 14 11:16:58 2010 +0100
@@ -56,9 +56,11 @@
 
 extern const struct scheduler sched_sedf_def;
 extern const struct scheduler sched_credit_def;
+extern const struct scheduler sched_credit2_def;
 static const struct scheduler *__initdata schedulers[] = {
     &sched_sedf_def,
     &sched_credit_def,
+    &sched_credit2_def,
     NULL
 };
 
diff -r 68636d5fb3df -r 1cdbec67f224 xen/include/public/domctl.h
--- a/xen/include/public/domctl.h	Wed Apr 14 11:16:58 2010 +0100
+++ b/xen/include/public/domctl.h	Wed Apr 14 11:16:58 2010 +0100
@@ -303,6 +303,7 @@
 /* Scheduler types. */
 #define XEN_SCHEDULER_SEDF     4
 #define XEN_SCHEDULER_CREDIT   5
+#define XEN_SCHEDULER_CREDIT2  6
 /* Set or get info? */
 #define XEN_DOMCTL_SCHEDOP_putinfo 0
 #define XEN_DOMCTL_SCHEDOP_getinfo 1
@@ -321,6 +322,9 @@
             uint16_t weight;
             uint16_t cap;
         } credit;
+        struct xen_domctl_sched_credit2 {
+            uint16_t weight;
+        } credit2;
     } u;
 };
 typedef struct xen_domctl_scheduler_op xen_domctl_scheduler_op_t;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 5 of 5] credit2: Add toolstack options to control credit2 scheduler parameters
  2010-04-14 10:26 [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL) George Dunlap
                   ` (3 preceding siblings ...)
  2010-04-14 10:26 ` [PATCH 4 of 5] credit2: Add credit2 scheduler to hypervisor George Dunlap
@ 2010-04-14 10:26 ` George Dunlap
       [not found] ` <7db7f696-1f0b-44d0-8f7b-eea1be5167dd@default>
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: George Dunlap @ 2010-04-14 10:26 UTC (permalink / raw)
  To: xen-devel; +Cc: george.dunlap

11 files changed, 277 insertions(+), 2 deletions(-)
tools/libxc/Makefile                      |    1 
tools/libxc/xc_csched2.c                  |   50 +++++++++++++++++
tools/libxc/xenctrl.h                     |    8 ++
tools/python/xen/lowlevel/xc/xc.c         |   58 ++++++++++++++++++++
tools/python/xen/xend/XendAPI.py          |    3 -
tools/python/xen/xend/XendDomain.py       |   54 +++++++++++++++++++
tools/python/xen/xend/XendDomainInfo.py   |    4 +
tools/python/xen/xend/XendNode.py         |    4 +
tools/python/xen/xend/XendVMMetrics.py    |    1 
tools/python/xen/xend/server/SrvDomain.py |   14 ++++
tools/python/xen/xm/main.py               |   82 +++++++++++++++++++++++++++++


Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

diff -r 1cdbec67f224 -r 149e4fb24e95 tools/libxc/Makefile
--- a/tools/libxc/Makefile	Wed Apr 14 11:16:58 2010 +0100
+++ b/tools/libxc/Makefile	Wed Apr 14 11:25:17 2010 +0100
@@ -17,6 +17,7 @@
 CTRL_SRCS-y       += xc_private.c
 CTRL_SRCS-y       += xc_sedf.c
 CTRL_SRCS-y       += xc_csched.c
+CTRL_SRCS-y       += xc_csched2.c
 CTRL_SRCS-y       += xc_tbuf.c
 CTRL_SRCS-y       += xc_pm.c
 CTRL_SRCS-y       += xc_cpu_hotplug.c
diff -r 1cdbec67f224 -r 149e4fb24e95 tools/libxc/xc_csched2.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/libxc/xc_csched2.c	Wed Apr 14 11:25:17 2010 +0100
@@ -0,0 +1,50 @@
+/****************************************************************************
+ * (C) 2006 - Emmanuel Ackaouy - XenSource Inc.
+ ****************************************************************************
+ *
+ *        File: xc_csched.c
+ *      Author: Emmanuel Ackaouy
+ *
+ * Description: XC Interface to the credit scheduler
+ *
+ */
+#include "xc_private.h"
+
+
+int
+xc_sched_credit2_domain_set(
+    int xc_handle,
+    uint32_t domid,
+    struct xen_domctl_sched_credit2 *sdom)
+{
+    DECLARE_DOMCTL;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_CREDIT2;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_putinfo;
+    domctl.u.scheduler_op.u.credit2 = *sdom;
+
+    return do_domctl(xc_handle, &domctl);
+}
+
+int
+xc_sched_credit2_domain_get(
+    int xc_handle,
+    uint32_t domid,
+    struct xen_domctl_sched_credit2 *sdom)
+{
+    DECLARE_DOMCTL;
+    int err;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_CREDIT2;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_getinfo;
+
+    err = do_domctl(xc_handle, &domctl);
+    if ( err == 0 )
+        *sdom = domctl.u.scheduler_op.u.credit2;
+
+    return err;
+}
diff -r 1cdbec67f224 -r 149e4fb24e95 tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h	Wed Apr 14 11:16:58 2010 +0100
+++ b/tools/libxc/xenctrl.h	Wed Apr 14 11:25:17 2010 +0100
@@ -475,6 +475,14 @@
                                uint32_t domid,
                                struct xen_domctl_sched_credit *sdom);
 
+int xc_sched_credit2_domain_set(int xc_handle,
+                               uint32_t domid,
+                               struct xen_domctl_sched_credit2 *sdom);
+
+int xc_sched_credit2_domain_get(int xc_handle,
+                               uint32_t domid,
+                               struct xen_domctl_sched_credit2 *sdom);
+
 /**
  * This function sends a trigger to a domain.
  *
diff -r 1cdbec67f224 -r 149e4fb24e95 tools/python/xen/lowlevel/xc/xc.c
--- a/tools/python/xen/lowlevel/xc/xc.c	Wed Apr 14 11:16:58 2010 +0100
+++ b/tools/python/xen/lowlevel/xc/xc.c	Wed Apr 14 11:25:17 2010 +0100
@@ -1558,6 +1558,45 @@
                          "cap",     sdom.cap);
 }
 
+static PyObject *pyxc_sched_credit2_domain_set(XcObject *self,
+                                              PyObject *args,
+                                              PyObject *kwds)
+{
+    uint32_t domid;
+    uint16_t weight;
+    static char *kwd_list[] = { "domid", "weight", NULL };
+    static char kwd_type[] = "I|H";
+    struct xen_domctl_sched_credit2 sdom;
+    
+    weight = 0;
+    if( !PyArg_ParseTupleAndKeywords(args, kwds, kwd_type, kwd_list, 
+                                     &domid, &weight) )
+        return NULL;
+
+    sdom.weight = weight;
+
+    if ( xc_sched_credit2_domain_set(self->xc_handle, domid, &sdom) != 0 )
+        return pyxc_error_to_exception();
+
+    Py_INCREF(zero);
+    return zero;
+}
+
+static PyObject *pyxc_sched_credit2_domain_get(XcObject *self, PyObject *args)
+{
+    uint32_t domid;
+    struct xen_domctl_sched_credit2 sdom;
+    
+    if( !PyArg_ParseTuple(args, "I", &domid) )
+        return NULL;
+    
+    if ( xc_sched_credit2_domain_get(self->xc_handle, domid, &sdom) != 0 )
+        return pyxc_error_to_exception();
+
+    return Py_BuildValue("{s:H}",
+                         "weight",  sdom.weight);
+}
+
 static PyObject *pyxc_domain_setmaxmem(XcObject *self, PyObject *args)
 {
     uint32_t dom;
@@ -2113,6 +2152,24 @@
       "Returns:   [dict]\n"
       " weight    [short]: domain's scheduling weight\n"},
 
+    { "sched_credit2_domain_set",
+      (PyCFunction)pyxc_sched_credit2_domain_set,
+      METH_KEYWORDS, "\n"
+      "Set the scheduling parameters for a domain when running with the\n"
+      "SMP credit2 scheduler.\n"
+      " domid     [int]:   domain id to set\n"
+      " weight    [short]: domain's scheduling weight\n"
+      "Returns: [int] 0 on success; -1 on error.\n" },
+
+    { "sched_credit2_domain_get",
+      (PyCFunction)pyxc_sched_credit2_domain_get,
+      METH_VARARGS, "\n"
+      "Get the scheduling parameters for a domain when running with the\n"
+      "SMP credit2 scheduler.\n"
+      " domid     [int]:   domain id to get\n"
+      "Returns:   [dict]\n"
+      " weight    [short]: domain's scheduling weight\n"},
+
     { "evtchn_alloc_unbound", 
       (PyCFunction)pyxc_evtchn_alloc_unbound,
       METH_VARARGS | METH_KEYWORDS, "\n"
@@ -2495,6 +2552,7 @@
     /* Expose some libxc constants to Python */
     PyModule_AddIntConstant(m, "XEN_SCHEDULER_SEDF", XEN_SCHEDULER_SEDF);
     PyModule_AddIntConstant(m, "XEN_SCHEDULER_CREDIT", XEN_SCHEDULER_CREDIT);
+    PyModule_AddIntConstant(m, "XEN_SCHEDULER_CREDIT2", XEN_SCHEDULER_CREDIT2);
 
 }
 
diff -r 1cdbec67f224 -r 149e4fb24e95 tools/python/xen/xend/XendAPI.py
--- a/tools/python/xen/xend/XendAPI.py	Wed Apr 14 11:16:58 2010 +0100
+++ b/tools/python/xen/xend/XendAPI.py	Wed Apr 14 11:25:17 2010 +0100
@@ -1626,8 +1626,7 @@
         if 'weight' in xeninfo.info['vcpus_params'] \
            and 'cap' in xeninfo.info['vcpus_params']:
             weight = xeninfo.info['vcpus_params']['weight']
-            cap = xeninfo.info['vcpus_params']['cap']
-            xendom.domain_sched_credit_set(xeninfo.getDomid(), weight, cap)
+            xendom.domain_sched_credit2_set(xeninfo.getDomid(), weight)
 
     def VM_set_VCPUs_number_live(self, _, vm_ref, num):
         dom = XendDomain.instance().get_vm_by_uuid(vm_ref)
diff -r 1cdbec67f224 -r 149e4fb24e95 tools/python/xen/xend/XendDomain.py
--- a/tools/python/xen/xend/XendDomain.py	Wed Apr 14 11:16:58 2010 +0100
+++ b/tools/python/xen/xend/XendDomain.py	Wed Apr 14 11:25:17 2010 +0100
@@ -1757,6 +1757,60 @@
             log.exception(ex)
             raise XendError(str(ex))
 
+    def domain_sched_credit2_get(self, domid):
+        """Get credit2 scheduler parameters for a domain.
+
+        @param domid: Domain ID or Name
+        @type domid: int or string.
+        @rtype: dict with keys 'weight'
+        @return: credit2 scheduler parameters
+        """
+        dominfo = self.domain_lookup_nr(domid)
+        if not dominfo:
+            raise XendInvalidDomain(str(domid))
+        
+        if dominfo._stateGet() in (DOM_STATE_RUNNING, DOM_STATE_PAUSED):
+            try:
+                return xc.sched_credit2_domain_get(dominfo.getDomid())
+            except Exception, ex:
+                raise XendError(str(ex))
+        else:
+            return {'weight' : dominfo.getWeight()}
+    
+    def domain_sched_credit2_set(self, domid, weight = None):
+        """Set credit2 scheduler parameters for a domain.
+
+        @param domid: Domain ID or Name
+        @type domid: int or string.
+        @type weight: int
+        @rtype: 0
+        """
+        set_weight = False
+        dominfo = self.domain_lookup_nr(domid)
+        if not dominfo:
+            raise XendInvalidDomain(str(domid))
+        try:
+            if weight is None:
+                weight = int(0)
+            elif weight < 1 or weight > 65535:
+                raise XendError("weight is out of range")
+            else:
+                set_weight = True
+
+            assert type(weight) == int
+
+            rc = 0
+            if dominfo._stateGet() in (DOM_STATE_RUNNING, DOM_STATE_PAUSED):
+                rc = xc.sched_credit2_domain_set(dominfo.getDomid(), weight)
+            if rc == 0:
+                if set_weight:
+                    dominfo.setWeight(weight)
+                self.managed_config_save(dominfo)
+            return rc
+        except Exception, ex:
+            log.exception(ex)
+            raise XendError(str(ex))
+
     def domain_maxmem_set(self, domid, mem):
         """Set the memory limit for a domain.
 
diff -r 1cdbec67f224 -r 149e4fb24e95 tools/python/xen/xend/XendDomainInfo.py
--- a/tools/python/xen/xend/XendDomainInfo.py	Wed Apr 14 11:16:58 2010 +0100
+++ b/tools/python/xen/xend/XendDomainInfo.py	Wed Apr 14 11:25:17 2010 +0100
@@ -2811,6 +2811,10 @@
             XendDomain.instance().domain_sched_credit_set(self.getDomid(),
                                                           self.getWeight(),
                                                           self.getCap())
+        elif XendNode.instance().xenschedinfo() == 'credit2':
+            from xen.xend import XendDomain
+            XendDomain.instance().domain_sched_credit2_set(self.getDomid(),
+                                                           self.getWeight())
 
     def _initDomain(self):
         log.debug('XendDomainInfo.initDomain: %s %s',
diff -r 1cdbec67f224 -r 149e4fb24e95 tools/python/xen/xend/XendNode.py
--- a/tools/python/xen/xend/XendNode.py	Wed Apr 14 11:16:58 2010 +0100
+++ b/tools/python/xen/xend/XendNode.py	Wed Apr 14 11:25:17 2010 +0100
@@ -779,6 +779,8 @@
             return 'sedf'
         elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT:
             return 'credit'
+        elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT2:
+            return 'credit2'
         else:
             return 'unknown'
 
@@ -988,6 +990,8 @@
             return 'sedf'
         elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT:
             return 'credit'
+        elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT2:
+            return 'credit2'
         else:
             return 'unknown'
 
diff -r 1cdbec67f224 -r 149e4fb24e95 tools/python/xen/xend/XendVMMetrics.py
--- a/tools/python/xen/xend/XendVMMetrics.py	Wed Apr 14 11:16:58 2010 +0100
+++ b/tools/python/xen/xend/XendVMMetrics.py	Wed Apr 14 11:25:17 2010 +0100
@@ -129,6 +129,7 @@
                 params_live['cpumap%i' % i] = \
                     ",".join(map(str, info['cpumap']))
 
+                # FIXME: credit2??
             params_live.update(xc.sched_credit_domain_get(domid))
             
             return params_live
diff -r 1cdbec67f224 -r 149e4fb24e95 tools/python/xen/xend/server/SrvDomain.py
--- a/tools/python/xen/xend/server/SrvDomain.py	Wed Apr 14 11:16:58 2010 +0100
+++ b/tools/python/xen/xend/server/SrvDomain.py	Wed Apr 14 11:25:17 2010 +0100
@@ -163,6 +163,20 @@
         val = fn(req.args, {'dom': self.dom.getName()})
         return val
 
+    def op_domain_sched_credit2_get(self, _, req):
+        fn = FormFn(self.xd.domain_sched_credit2_get,
+                    [['dom', 'str']])
+        val = fn(req.args, {'dom': self.dom.getName()})
+        return val
+
+
+    def op_domain_sched_credit2_set(self, _, req):
+        fn = FormFn(self.xd.domain_sched_credit2_set,
+                    [['dom', 'str'],
+                     ['weight', 'int']])
+        val = fn(req.args, {'dom': self.dom.getName()})
+        return val
+
     def op_maxmem_set(self, _, req):
         return self.call(self.dom.setMemoryMaximum,
                          [['memory', 'int']],
diff -r 1cdbec67f224 -r 149e4fb24e95 tools/python/xen/xm/main.py
--- a/tools/python/xen/xm/main.py	Wed Apr 14 11:16:58 2010 +0100
+++ b/tools/python/xen/xm/main.py	Wed Apr 14 11:25:17 2010 +0100
@@ -151,6 +151,8 @@
     'sched-sedf'  : ('<Domain> [options]', 'Get/set EDF parameters.'),
     'sched-credit': ('[-d <Domain> [-w[=WEIGHT]|-c[=CAP]]]',
                      'Get/set credit scheduler parameters.'),
+    'sched-credit2': ('[-d <Domain> [-w[=WEIGHT]]',
+                     'Get/set credit2 scheduler parameters.'),
     'sysrq'       : ('<Domain> <letter>', 'Send a sysrq to a domain.'),
     'debug-keys'  : ('<Keys>', 'Send debug keys to Xen.'),
     'trigger'     : ('<Domain> <nmi|reset|init|s3resume|power> [<VCPU>]',
@@ -277,6 +279,10 @@
        ('-w WEIGHT', '--weight=WEIGHT', 'Weight (int)'),
        ('-c CAP',    '--cap=CAP',       'Cap (int)'),
     ),
+    'sched-credit2': (
+       ('-d DOMAIN', '--domain=DOMAIN', 'Domain to modify'),
+       ('-w WEIGHT', '--weight=WEIGHT', 'Weight (int)'),
+    ),
     'list': (
        ('-l', '--long',         'Output all VM details in SXP'),
        ('', '--label',          'Include security labels'),
@@ -418,6 +424,7 @@
     ]
 
 scheduler_commands = [
+    "sched-credit2",
     "sched-credit",
     "sched-sedf",
     ]
@@ -1740,6 +1747,80 @@
             if result != 0:
                 err(str(result))
 
+def xm_sched_credit2(args):
+    """Get/Set options for Credit2 Scheduler."""
+    
+    check_sched_type('credit2')
+
+    try:
+        opts, params = getopt.getopt(args, "d:w:",
+            ["domain=", "weight="])
+    except getopt.GetoptError, opterr:
+        err(opterr)
+        usage('sched-credit2')
+
+    domid = None
+    weight = None
+
+    for o, a in opts:
+        if o in ["-d", "--domain"]:
+            domid = a
+        elif o in ["-w", "--weight"]:
+            weight = int(a)
+
+    doms = filter(lambda x : domid_match(domid, x),
+                  [parse_doms_info(dom)
+                  for dom in getDomains(None, 'all')])
+
+    if weight is None:
+        if domid is not None and doms == []: 
+            err("Domain '%s' does not exist." % domid)
+            usage('sched-credit2')
+        # print header if we aren't setting any parameters
+        print '%-33s %4s %6s' % ('Name','ID','Weight')
+        
+        for d in doms:
+            try:
+                if serverType == SERVER_XEN_API:
+                    info = server.xenapi.VM_metrics.get_VCPUs_params(
+                        server.xenapi.VM.get_metrics(
+                            get_single_vm(d['name'])))
+                else:
+                    info = server.xend.domain.sched_credit2_get(d['name'])
+            except xmlrpclib.Fault:
+                pass
+
+            if 'weight' not in info:
+                # domain does not support sched-credit2?
+                info = {'weight': -1}
+
+            info['weight'] = int(info['weight'])
+            
+            info['name']  = d['name']
+            info['domid'] = str(d['domid'])
+            print( ("%(name)-32s %(domid)5s %(weight)6d") % info)
+    else:
+        if domid is None:
+            # place holder for system-wide scheduler parameters
+            err("No domain given.")
+            usage('sched-credit2')
+
+        if serverType == SERVER_XEN_API:
+            if doms[0]['domid']:
+                server.xenapi.VM.add_to_VCPUs_params_live(
+                    get_single_vm(domid),
+                    "weight",
+                    weight)
+            else:
+                server.xenapi.VM.add_to_VCPUs_params(
+                    get_single_vm(domid),
+                    "weight",
+                    weight)
+        else:
+            result = server.xend.domain.sched_credit2_set(domid, weight)
+            if result != 0:
+                err(str(result))
+
 def xm_info(args):
     arg_check(args, "info", 0, 1)
     
@@ -3490,6 +3571,7 @@
     # scheduler
     "sched-sedf": xm_sched_sedf,
     "sched-credit": xm_sched_credit,
+    "sched-credit2": xm_sched_credit2,
     # block
     "block-attach": xm_block_attach,
     "block-detach": xm_block_detach,

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)
       [not found] ` <7db7f696-1f0b-44d0-8f7b-eea1be5167dd@default>
@ 2010-04-14 14:29   ` George Dunlap
  2010-04-14 14:52     ` Keir Fraser
  2010-04-15 20:11     ` Dan Magenheimer
  0 siblings, 2 replies; 21+ messages in thread
From: George Dunlap @ 2010-04-14 14:29 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: xen-devel@lists.xensource.com

Keir has checked the patches in, so if you wait a bit, they should show 
up on the public repository.

The tool patch is only necessary for adjusting the weight; if you're OK 
using the default weight, just adding "sched=credit2" on the xen 
command-line should be fine.

Don't forget that this isn't meant to perform well on multiple sockets 
yet. :-)

 -George

Dan Magenheimer wrote:
> Hi George --
>
> I'm seeing some problems applying the patches (such as "malformed
> patch").  If you could send me a monolithic patch in an attachment
> and tell me what cset in http://xenbits.xensource.com/xen-unstable.hg 
> that it successfully applies against, I will try to give my
> workload a test against it to see if it has the same
> symptoms.
>
> Also, do I need to apply the tools patch if I don't intend
> to specify any parameters, or is the xen patch + "sched=credit2"
> in a boot param sufficient?
>
> Thanks,
> Dan
>
>   
>> -----Original Message-----
>> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
>> Sent: Wednesday, April 14, 2010 4:26 AM
>> To: xen-devel@lists.xensource.com
>> Cc: george.dunlap@eu.citrix.com
>> Subject: [Xen-devel] [PATCH 0 of 5] Add credit2 scheduler
>> (EXPERIMENTAL)
>>
>> This patch series introduces the credit2 scheduler.  The first two
>> patches
>> introduce changes necessary to allow the credit2 shared runqueue
>> functionality
>> to work properly; the last two implement the functionality itself.
>>
>> The scheduler is still in the experimental phase.  There's lots of
>> opportunity to contribute with independent lines of development; email
>> George Dunlap <george.dunlap@eu.citrix.com> or check out the wiki page
>> http://wiki.xensource.com/xenwiki/Credit2_Scheduler_Development for
>> ideas
>> and status updates.
>>
>> 19 files changed, 1453 insertions(+), 21 deletions(-)
>> tools/libxc/Makefile                      |    1
>> tools/libxc/xc_csched2.c                  |   50 +
>> tools/libxc/xenctrl.h                     |    8
>> tools/python/xen/lowlevel/xc/xc.c         |   58 +
>> tools/python/xen/xend/XendAPI.py          |    3
>> tools/python/xen/xend/XendDomain.py       |   54 +
>> tools/python/xen/xend/XendDomainInfo.py   |    4
>> tools/python/xen/xend/XendNode.py         |    4
>> tools/python/xen/xend/XendVMMetrics.py    |    1
>> tools/python/xen/xend/server/SrvDomain.py |   14
>> tools/python/xen/xm/main.py               |   82 ++
>> xen/arch/ia64/vmx/vmmu.c                  |    6
>> xen/common/Makefile                       |    1
>> xen/common/sched_credit.c                 |    8
>> xen/common/sched_credit2.c                | 1125
>> +++++++++++++++++++++++++++++
>> xen/common/schedule.c                     |   22
>> xen/include/public/domctl.h               |    4
>> xen/include/public/trace.h                |    1
>> xen/include/xen/sched-if.h                |   28
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>     

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)
  2010-04-14 14:29   ` [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL) George Dunlap
@ 2010-04-14 14:52     ` Keir Fraser
  2010-04-14 15:59       ` Dan Magenheimer
  2010-04-15 20:11     ` Dan Magenheimer
  1 sibling, 1 reply; 21+ messages in thread
From: Keir Fraser @ 2010-04-14 14:52 UTC (permalink / raw)
  To: George Dunlap, Dan Magenheimer; +Cc: xen-devel@lists.xensource.com

The patches are already available from the staging tree. They will get
automatically pushed to the main tree when they pass the regression tests.

 K.


On 14/04/2010 15:29, "George Dunlap" <george.dunlap@eu.citrix.com> wrote:

> Keir has checked the patches in, so if you wait a bit, they should show
> up on the public repository.
> 
> The tool patch is only necessary for adjusting the weight; if you're OK
> using the default weight, just adding "sched=credit2" on the xen
> command-line should be fine.
> 
> Don't forget that this isn't meant to perform well on multiple sockets
> yet. :-)
> 
>  -George
> 
> Dan Magenheimer wrote:
>> Hi George --
>> 
>> I'm seeing some problems applying the patches (such as "malformed
>> patch").  If you could send me a monolithic patch in an attachment
>> and tell me what cset in http://xenbits.xensource.com/xen-unstable.hg
>> that it successfully applies against, I will try to give my
>> workload a test against it to see if it has the same
>> symptoms.
>> 
>> Also, do I need to apply the tools patch if I don't intend
>> to specify any parameters, or is the xen patch + "sched=credit2"
>> in a boot param sufficient?
>> 
>> Thanks,
>> Dan
>> 
>>   
>>> -----Original Message-----
>>> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
>>> Sent: Wednesday, April 14, 2010 4:26 AM
>>> To: xen-devel@lists.xensource.com
>>> Cc: george.dunlap@eu.citrix.com
>>> Subject: [Xen-devel] [PATCH 0 of 5] Add credit2 scheduler
>>> (EXPERIMENTAL)
>>> 
>>> This patch series introduces the credit2 scheduler.  The first two
>>> patches
>>> introduce changes necessary to allow the credit2 shared runqueue
>>> functionality
>>> to work properly; the last two implement the functionality itself.
>>> 
>>> The scheduler is still in the experimental phase.  There's lots of
>>> opportunity to contribute with independent lines of development; email
>>> George Dunlap <george.dunlap@eu.citrix.com> or check out the wiki page
>>> http://wiki.xensource.com/xenwiki/Credit2_Scheduler_Development for
>>> ideas
>>> and status updates.
>>> 
>>> 19 files changed, 1453 insertions(+), 21 deletions(-)
>>> tools/libxc/Makefile                      |    1
>>> tools/libxc/xc_csched2.c                  |   50 +
>>> tools/libxc/xenctrl.h                     |    8
>>> tools/python/xen/lowlevel/xc/xc.c         |   58 +
>>> tools/python/xen/xend/XendAPI.py          |    3
>>> tools/python/xen/xend/XendDomain.py       |   54 +
>>> tools/python/xen/xend/XendDomainInfo.py   |    4
>>> tools/python/xen/xend/XendNode.py         |    4
>>> tools/python/xen/xend/XendVMMetrics.py    |    1
>>> tools/python/xen/xend/server/SrvDomain.py |   14
>>> tools/python/xen/xm/main.py               |   82 ++
>>> xen/arch/ia64/vmx/vmmu.c                  |    6
>>> xen/common/Makefile                       |    1
>>> xen/common/sched_credit.c                 |    8
>>> xen/common/sched_credit2.c                | 1125
>>> +++++++++++++++++++++++++++++
>>> xen/common/schedule.c                     |   22
>>> xen/include/public/domctl.h               |    4
>>> xen/include/public/trace.h                |    1
>>> xen/include/xen/sched-if.h                |   28
>>> 
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>>     
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)
  2010-04-14 14:52     ` Keir Fraser
@ 2010-04-14 15:59       ` Dan Magenheimer
  2010-04-14 16:23         ` Keir Fraser
  0 siblings, 1 reply; 21+ messages in thread
From: Dan Magenheimer @ 2010-04-14 15:59 UTC (permalink / raw)
  To: Keir Fraser, George Dunlap; +Cc: xen-devel

Thanks.  Unfortunately, after updating both hypervisor
and tools to cs21173 (from staging), xend seems to start fine, but
attempting to launch a domain yields:  

Xend has probably crashed!  Invalid or missing HTTP status code.

An immediate "xm list" shows that xend has not crashed
(or perhaps silently and successfully restarted), but
re-attempting to launch a domain yields the same message.

(George, no need to point out that this is probably
unrelated to the credit2 scheduler... but that would
imply that xen-unstable-not-staging is also broken.)

Keir, if staging passes your regression tests successfully
without the problem I am seeing, please let me know.
(And I'll try rolling back to 4.0 for now.)

> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
> Sent: Wednesday, April 14, 2010 8:53 AM
> To: George Dunlap; Dan Magenheimer
> Cc: xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] [PATCH 0 of 5] Add credit2 scheduler
> (EXPERIMENTAL)
> 
> The patches are already available from the staging tree. They will get
> automatically pushed to the main tree when they pass the regression
> tests.
> 
>  K.
> 
> 
> On 14/04/2010 15:29, "George Dunlap" <george.dunlap@eu.citrix.com>
> wrote:
> 
> > Keir has checked the patches in, so if you wait a bit, they should
> show
> > up on the public repository.
> >
> > The tool patch is only necessary for adjusting the weight; if you're
> OK
> > using the default weight, just adding "sched=credit2" on the xen
> > command-line should be fine.
> >
> > Don't forget that this isn't meant to perform well on multiple
> sockets
> > yet. :-)
> >
> >  -George
> >
> > Dan Magenheimer wrote:
> >> Hi George --
> >>
> >> I'm seeing some problems applying the patches (such as "malformed
> >> patch").  If you could send me a monolithic patch in an attachment
> >> and tell me what cset in http://xenbits.xensource.com/xen-
> unstable.hg
> >> that it successfully applies against, I will try to give my
> >> workload a test against it to see if it has the same
> >> symptoms.
> >>
> >> Also, do I need to apply the tools patch if I don't intend
> >> to specify any parameters, or is the xen patch + "sched=credit2"
> >> in a boot param sufficient?
> >>
> >> Thanks,
> >> Dan
> >>
> >>
> >>> -----Original Message-----
> >>> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> >>> Sent: Wednesday, April 14, 2010 4:26 AM
> >>> To: xen-devel@lists.xensource.com
> >>> Cc: george.dunlap@eu.citrix.com
> >>> Subject: [Xen-devel] [PATCH 0 of 5] Add credit2 scheduler
> >>> (EXPERIMENTAL)
> >>>
> >>> This patch series introduces the credit2 scheduler.  The first two
> >>> patches
> >>> introduce changes necessary to allow the credit2 shared runqueue
> >>> functionality
> >>> to work properly; the last two implement the functionality itself.
> >>>
> >>> The scheduler is still in the experimental phase.  There's lots of
> >>> opportunity to contribute with independent lines of development;
> email
> >>> George Dunlap <george.dunlap@eu.citrix.com> or check out the wiki
> page
> >>> http://wiki.xensource.com/xenwiki/Credit2_Scheduler_Development for
> >>> ideas
> >>> and status updates.
> >>>
> >>> 19 files changed, 1453 insertions(+), 21 deletions(-)
> >>> tools/libxc/Makefile                      |    1
> >>> tools/libxc/xc_csched2.c                  |   50 +
> >>> tools/libxc/xenctrl.h                     |    8
> >>> tools/python/xen/lowlevel/xc/xc.c         |   58 +
> >>> tools/python/xen/xend/XendAPI.py          |    3
> >>> tools/python/xen/xend/XendDomain.py       |   54 +
> >>> tools/python/xen/xend/XendDomainInfo.py   |    4
> >>> tools/python/xen/xend/XendNode.py         |    4
> >>> tools/python/xen/xend/XendVMMetrics.py    |    1
> >>> tools/python/xen/xend/server/SrvDomain.py |   14
> >>> tools/python/xen/xm/main.py               |   82 ++
> >>> xen/arch/ia64/vmx/vmmu.c                  |    6
> >>> xen/common/Makefile                       |    1
> >>> xen/common/sched_credit.c                 |    8
> >>> xen/common/sched_credit2.c                | 1125
> >>> +++++++++++++++++++++++++++++
> >>> xen/common/schedule.c                     |   22
> >>> xen/include/public/domctl.h               |    4
> >>> xen/include/public/trace.h                |    1
> >>> xen/include/xen/sched-if.h                |   28
> >>>
> >>> _______________________________________________
> >>> Xen-devel mailing list
> >>> Xen-devel@lists.xensource.com
> >>> http://lists.xensource.com/xen-devel
> >>>
> >
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xensource.com
> > http://lists.xensource.com/xen-devel
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)
  2010-04-14 15:59       ` Dan Magenheimer
@ 2010-04-14 16:23         ` Keir Fraser
  2010-04-14 16:31           ` Dulloor
  2010-04-14 16:46           ` Dan Magenheimer
  0 siblings, 2 replies; 21+ messages in thread
From: Keir Fraser @ 2010-04-14 16:23 UTC (permalink / raw)
  To: Dan Magenheimer, George Dunlap; +Cc: xen-devel@lists.xensource.com

On 14/04/2010 16:59, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:

> (George, no need to point out that this is probably
> unrelated to the credit2 scheduler... but that would
> imply that xen-unstable-not-staging is also broken.)
> 
> Keir, if staging passes your regression tests successfully
> without the problem I am seeing, please let me know.
> (And I'll try rolling back to 4.0 for now.)

No, it crashes for me too. I think it's related to the recent NUMA patches.

 -- Keir

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)
  2010-04-14 16:23         ` Keir Fraser
@ 2010-04-14 16:31           ` Dulloor
  2010-04-14 16:36             ` Keir Fraser
  2010-04-14 16:46           ` Dan Magenheimer
  1 sibling, 1 reply; 21+ messages in thread
From: Dulloor @ 2010-04-14 16:31 UTC (permalink / raw)
  To: Keir Fraser; +Cc: George Dunlap, Dan Magenheimer, xen-devel@lists.xensource.com

> No, it crashes for me too. I think it's related to the recent NUMA patches.
Keir, which numa patches

-dulloor

On Wed, Apr 14, 2010 at 12:23 PM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:
> On 14/04/2010 16:59, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:
>
>> (George, no need to point out that this is probably
>> unrelated to the credit2 scheduler... but that would
>> imply that xen-unstable-not-staging is also broken.)
>>
>> Keir, if staging passes your regression tests successfully
>> without the problem I am seeing, please let me know.
>> (And I'll try rolling back to 4.0 for now.)
>
> No, it crashes for me too. I think it's related to the recent NUMA patches.
>
>  -- Keir
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)
  2010-04-14 16:31           ` Dulloor
@ 2010-04-14 16:36             ` Keir Fraser
  2010-04-14 17:04               ` Dan Magenheimer
  0 siblings, 1 reply; 21+ messages in thread
From: Keir Fraser @ 2010-04-14 16:36 UTC (permalink / raw)
  To: Dulloor; +Cc: George Dunlap, Dan Magenheimer, xen-devel@lists.xensource.com

On 14/04/2010 17:31, "Dulloor" <dulloor@gmail.com> wrote:

>> No, it crashes for me too. I think it's related to the recent NUMA patches.
> Keir, which numa patches

Nitin's interface-changing patch, and the ensuing patch to remove
sockets_per_node. Not that I'm certain, but it was around then that the
crashes began, and that's what touched the Xc Python extension package (and
C extensions to Python code are usually what make Python programs crash).

 -- Keir

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)
  2010-04-14 16:23         ` Keir Fraser
  2010-04-14 16:31           ` Dulloor
@ 2010-04-14 16:46           ` Dan Magenheimer
  1 sibling, 0 replies; 21+ messages in thread
From: Dan Magenheimer @ 2010-04-14 16:46 UTC (permalink / raw)
  To: Keir Fraser, George Dunlap; +Cc: xen-devel

> > (George, no need to point out that this is probably
> > unrelated to the credit2 scheduler... but that would
> > imply that xen-unstable-not-staging is also broken.)
> >
> > Keir, if staging passes your regression tests successfully
> > without the problem I am seeing, please let me know.
> > (And I'll try rolling back to 4.0 for now.)
> 
> No, it crashes for me too. I think it's related to the recent NUMA
> patches.

OK.  George, I've applied your credit2 patch to 4.0-testing tip
and it seems to be starting up my test workload.  I'll let you
know what I see (but it takes a few hours).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)
  2010-04-14 16:36             ` Keir Fraser
@ 2010-04-14 17:04               ` Dan Magenheimer
  0 siblings, 0 replies; 21+ messages in thread
From: Dan Magenheimer @ 2010-04-14 17:04 UTC (permalink / raw)
  To: Keir Fraser, Dulloor; +Cc: George Dunlap, xen-devel

While someone is fixing up the new NUMA code, it would be
nice if the output of the NUMA info in "xm info" would
conform to the rest of the "xm info" output (e.g. all
on the same line, maybe "topology: cpu=X node=Y etc").
I often write scripts that parse various Xen output,
and I suspect many Xen-based products do also,
so format consistency is always good.

> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
> Sent: Wednesday, April 14, 2010 10:36 AM
> To: Dulloor
> Cc: George Dunlap; Dan Magenheimer; xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] [PATCH 0 of 5] Add credit2 scheduler
> (EXPERIMENTAL)
> 
> On 14/04/2010 17:31, "Dulloor" <dulloor@gmail.com> wrote:
> 
> >> No, it crashes for me too. I think it's related to the recent NUMA
> patches.
> > Keir, which numa patches
> 
> Nitin's interface-changing patch, and the ensuing patch to remove
> sockets_per_node. Not that I'm certain, but it was around then that the
> crashes began, and that's what touched the Xc Python extension package
> (and
> C extensions to Python code are usually what make Python programs
> crash).
> 
>  -- Keir
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)
       [not found] ` <4BC664E1.7090304@purdue.edu>
@ 2010-04-15 13:53   ` George Dunlap
  2010-04-15 16:46     ` Naresh Rapolu
  0 siblings, 1 reply; 21+ messages in thread
From: George Dunlap @ 2010-04-15 13:53 UTC (permalink / raw)
  To: Naresh Rapolu, xen-devel@lists.xensource.com

I have not measured cache / TLB misses with this workload yet.  In the 
past I've instrumented the scheduler trace records in Xen to include 
performance counters such as instructions executed and cache / tlb 
misses, and then used xenalyze 
(http://xenbits.xensource.com/ext/xenalyze.hg) to analyze them.  But the 
functionality for both capture and analysis was never standardized or 
added to mainline.

I'd be happy to help point you in the right direction if you're 
interested in investing in that approach. :-)

 -George

Naresh Rapolu wrote:
> Hello George,
>
> How did you measure  Cache/ TLB misses etc while using/profiling this 
> new scheduler ?  Any tool that you`ve used which works with Xen ?
>
> Thanks,
> Naresh Rapolu.
> PhD Student, Computer Science,
> Purdue University.
>
> George Dunlap wrote:
>   
>> This patch series introduces the credit2 scheduler.  The first two patches
>> introduce changes necessary to allow the credit2 shared runqueue functionality
>> to work properly; the last two implement the functionality itself.
>>
>> The scheduler is still in the experimental phase.  There's lots of 
>> opportunity to contribute with independent lines of development; email
>> George Dunlap <george.dunlap@eu.citrix.com> or check out the wiki page
>> http://wiki.xensource.com/xenwiki/Credit2_Scheduler_Development for ideas
>> and status updates.
>>
>> 19 files changed, 1453 insertions(+), 21 deletions(-)
>> tools/libxc/Makefile                      |    1 
>> tools/libxc/xc_csched2.c                  |   50 +
>> tools/libxc/xenctrl.h                     |    8 
>> tools/python/xen/lowlevel/xc/xc.c         |   58 +
>> tools/python/xen/xend/XendAPI.py          |    3 
>> tools/python/xen/xend/XendDomain.py       |   54 +
>> tools/python/xen/xend/XendDomainInfo.py   |    4 
>> tools/python/xen/xend/XendNode.py         |    4 
>> tools/python/xen/xend/XendVMMetrics.py    |    1 
>> tools/python/xen/xend/server/SrvDomain.py |   14 
>> tools/python/xen/xm/main.py               |   82 ++
>> xen/arch/ia64/vmx/vmmu.c                  |    6 
>> xen/common/Makefile                       |    1 
>> xen/common/sched_credit.c                 |    8 
>> xen/common/sched_credit2.c                | 1125 +++++++++++++++++++++++++++++
>> xen/common/schedule.c                     |   22 
>> xen/include/public/domctl.h               |    4 
>> xen/include/public/trace.h                |    1 
>> xen/include/xen/sched-if.h                |   28 
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>   
>>     
>
>   

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)
       [not found] ` <h2x940bcfd21004140841kcdffe330xff5d749d43392fe3@mail.gmail.com>
@ 2010-04-15 14:17   ` George Dunlap
  2010-04-17 20:29     ` Dulloor
  0 siblings, 1 reply; 21+ messages in thread
From: George Dunlap @ 2010-04-15 14:17 UTC (permalink / raw)
  To: Dulloor, xen-devel@lists.xensource.com

Dulloor wrote:
> As we talked before, I am interested in improving the mutiple-socket
> scenario and adding the load balancing functionalilty, which could
> provide an acceptable alternative to pinning vcpus to sockets (for my
> NUMA work). I am going over your patch right now, but what are your
> thoughts ?
>   
That would be great -- my focus for the next several months will be 
setting up a testing infrastructure to automatically test performance of 
different workloads mixes so I can hone the algorithm and test regressions.

My idea with load balancing was to do this:
* One runqueue per L2 cache.
* Add code to calculate the load of a runqueue.  Load would be the 
average (~integral) of (vcpus running + vcpus on runqueue).  I was 
planning on doing accurate load calculation, rather than sample-based, 
and falling back to sample-based if accurate turned out to be too slow.
* Calculate the load contributed by various vcpus.
* At regular intervals, determine of some kind of balancing needs to be 
done by looking at the overall runqueue load and placing based on 
"contributory" load of each VCPU.

Does that make sense?  Thoughts?

I have some old patches that calculated accurate load, I could dig them 
up if you wanted something to start with.  (I don't think they'll apply 
cleanly at the moment.)

Thanks,
 -George
> -dulloor
>
> On Wed, Apr 14, 2010 at 6:26 AM, George Dunlap
> <george.dunlap@eu.citrix.com> wrote:
>   
>> This patch series introduces the credit2 scheduler.  The first two patches
>> introduce changes necessary to allow the credit2 shared runqueue functionality
>> to work properly; the last two implement the functionality itself.
>>
>> The scheduler is still in the experimental phase.  There's lots of
>> opportunity to contribute with independent lines of development; email
>> George Dunlap <george.dunlap@eu.citrix.com> or check out the wiki page
>> http://wiki.xensource.com/xenwiki/Credit2_Scheduler_Development for ideas
>> and status updates.
>>
>> 19 files changed, 1453 insertions(+), 21 deletions(-)
>> tools/libxc/Makefile                      |    1
>> tools/libxc/xc_csched2.c                  |   50 +
>> tools/libxc/xenctrl.h                     |    8
>> tools/python/xen/lowlevel/xc/xc.c         |   58 +
>> tools/python/xen/xend/XendAPI.py          |    3
>> tools/python/xen/xend/XendDomain.py       |   54 +
>> tools/python/xen/xend/XendDomainInfo.py   |    4
>> tools/python/xen/xend/XendNode.py         |    4
>> tools/python/xen/xend/XendVMMetrics.py    |    1
>> tools/python/xen/xend/server/SrvDomain.py |   14
>> tools/python/xen/xm/main.py               |   82 ++
>> xen/arch/ia64/vmx/vmmu.c                  |    6
>> xen/common/Makefile                       |    1
>> xen/common/sched_credit.c                 |    8
>> xen/common/sched_credit2.c                | 1125 +++++++++++++++++++++++++++++
>> xen/common/schedule.c                     |   22
>> xen/include/public/domctl.h               |    4
>> xen/include/public/trace.h                |    1
>> xen/include/xen/sched-if.h                |   28
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
>>     

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)
  2010-04-15 13:53   ` George Dunlap
@ 2010-04-15 16:46     ` Naresh Rapolu
  2010-04-15 17:33       ` Dulloor
  0 siblings, 1 reply; 21+ messages in thread
From: Naresh Rapolu @ 2010-04-15 16:46 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel@lists.xensource.com

Hello George,

I am trying to get linux "perf" tool work with Xen(Virtualize PMU to 
measure hardware events from inside guests).
I have the following options :

   1. allowing the guest kernel to see the PMU hardware features via
      cpuid, and then doing whatever is necessary to make them work as
      expected (by instruction emulation, etc), or
   2. keeping them hidden, but adding a new Xen interface and the
      appropriate Linux-side code to detect that interface and use it


Does Xenalyze have any code relevant to this ? Can you think of any 
directions in this regard ?

Thanks,
Naresh Rapolu.


George Dunlap wrote:
> I have not measured cache / TLB misses with this workload yet.  In the 
> past I've instrumented the scheduler trace records in Xen to include 
> performance counters such as instructions executed and cache / tlb 
> misses, and then used xenalyze 
> (http://xenbits.xensource.com/ext/xenalyze.hg) to analyze them.  But 
> the functionality for both capture and analysis was never standardized 
> or added to mainline.
>
> I'd be happy to help point you in the right direction if you're 
> interested in investing in that approach. :-)
>
> -George
>
> Naresh Rapolu wrote:
>> Hello George,
>>
>> How did you measure  Cache/ TLB misses etc while using/profiling this 
>> new scheduler ?  Any tool that you`ve used which works with Xen ?
>>
>> Thanks,
>> Naresh Rapolu.
>> PhD Student, Computer Science,
>> Purdue University.
>>
>> George Dunlap wrote:
>>  
>>> This patch series introduces the credit2 scheduler.  The first two 
>>> patches
>>> introduce changes necessary to allow the credit2 shared runqueue 
>>> functionality
>>> to work properly; the last two implement the functionality itself.
>>>
>>> The scheduler is still in the experimental phase.  There's lots of 
>>> opportunity to contribute with independent lines of development; email
>>> George Dunlap <george.dunlap@eu.citrix.com> or check out the wiki page
>>> http://wiki.xensource.com/xenwiki/Credit2_Scheduler_Development for 
>>> ideas
>>> and status updates.
>>>
>>> 19 files changed, 1453 insertions(+), 21 deletions(-)
>>> tools/libxc/Makefile                      |    1 
>>> tools/libxc/xc_csched2.c                  |   50 +
>>> tools/libxc/xenctrl.h                     |    8 
>>> tools/python/xen/lowlevel/xc/xc.c         |   58 +
>>> tools/python/xen/xend/XendAPI.py          |    3 
>>> tools/python/xen/xend/XendDomain.py       |   54 +
>>> tools/python/xen/xend/XendDomainInfo.py   |    4 
>>> tools/python/xen/xend/XendNode.py         |    4 
>>> tools/python/xen/xend/XendVMMetrics.py    |    1 
>>> tools/python/xen/xend/server/SrvDomain.py |   14 
>>> tools/python/xen/xm/main.py               |   82 ++
>>> xen/arch/ia64/vmx/vmmu.c                  |    6 
>>> xen/common/Makefile                       |    1 
>>> xen/common/sched_credit.c                 |    8 
>>> xen/common/sched_credit2.c                | 1125 
>>> +++++++++++++++++++++++++++++
>>> xen/common/schedule.c                     |   22 
>>> xen/include/public/domctl.h               |    4 
>>> xen/include/public/trace.h                |    1 
>>> xen/include/xen/sched-if.h                |   28
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>>       
>>
>>   
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)
  2010-04-15 16:46     ` Naresh Rapolu
@ 2010-04-15 17:33       ` Dulloor
  2010-04-15 18:57         ` Naresh Rapolu
  0 siblings, 1 reply; 21+ messages in thread
From: Dulloor @ 2010-04-15 17:33 UTC (permalink / raw)
  To: Naresh Rapolu; +Cc: George Dunlap, xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 4211 bytes --]

Naresh,

If you are interested only in profiling, you could use xenoprof too.
I had ported xenoprof to pvops (attaching a patch that applies cleanly
to linux pvops). I have used this with passive profiling and for
profiling xen/dom0. This patch also includes an obvious fix (over
oprofile branch in Jeremy's repo) for active profiling, although I
didn't get a chance to test.

Please let know if you try this and if you face any issues.

thanks
dulloor

On Thu, Apr 15, 2010 at 12:46 PM, Naresh Rapolu <nrapolu@purdue.edu> wrote:
> Hello George,
>
> I am trying to get linux "perf" tool work with Xen(Virtualize PMU to measure
> hardware events from inside guests).
> I have the following options :
>
>  1. allowing the guest kernel to see the PMU hardware features via
>     cpuid, and then doing whatever is necessary to make them work as
>     expected (by instruction emulation, etc), or
>  2. keeping them hidden, but adding a new Xen interface and the
>     appropriate Linux-side code to detect that interface and use it
>
>
> Does Xenalyze have any code relevant to this ? Can you think of any
> directions in this regard ?
>
> Thanks,
> Naresh Rapolu.
>
>
> George Dunlap wrote:
>>
>> I have not measured cache / TLB misses with this workload yet.  In the
>> past I've instrumented the scheduler trace records in Xen to include
>> performance counters such as instructions executed and cache / tlb misses,
>> and then used xenalyze (http://xenbits.xensource.com/ext/xenalyze.hg) to
>> analyze them.  But the functionality for both capture and analysis was never
>> standardized or added to mainline.
>>
>> I'd be happy to help point you in the right direction if you're interested
>> in investing in that approach. :-)
>>
>> -George
>>
>> Naresh Rapolu wrote:
>>>
>>> Hello George,
>>>
>>> How did you measure  Cache/ TLB misses etc while using/profiling this new
>>> scheduler ?  Any tool that you`ve used which works with Xen ?
>>>
>>> Thanks,
>>> Naresh Rapolu.
>>> PhD Student, Computer Science,
>>> Purdue University.
>>>
>>> George Dunlap wrote:
>>>
>>>>
>>>> This patch series introduces the credit2 scheduler.  The first two
>>>> patches
>>>> introduce changes necessary to allow the credit2 shared runqueue
>>>> functionality
>>>> to work properly; the last two implement the functionality itself.
>>>>
>>>> The scheduler is still in the experimental phase.  There's lots of
>>>> opportunity to contribute with independent lines of development; email
>>>> George Dunlap <george.dunlap@eu.citrix.com> or check out the wiki page
>>>> http://wiki.xensource.com/xenwiki/Credit2_Scheduler_Development for
>>>> ideas
>>>> and status updates.
>>>>
>>>> 19 files changed, 1453 insertions(+), 21 deletions(-)
>>>> tools/libxc/Makefile                      |    1
>>>> tools/libxc/xc_csched2.c                  |   50 +
>>>> tools/libxc/xenctrl.h                     |    8
>>>> tools/python/xen/lowlevel/xc/xc.c         |   58 +
>>>> tools/python/xen/xend/XendAPI.py          |    3
>>>> tools/python/xen/xend/XendDomain.py       |   54 +
>>>> tools/python/xen/xend/XendDomainInfo.py   |    4
>>>> tools/python/xen/xend/XendNode.py         |    4
>>>> tools/python/xen/xend/XendVMMetrics.py    |    1
>>>> tools/python/xen/xend/server/SrvDomain.py |   14 tools/python/xen/xm/main.py
>>>>               |   82 ++
>>>> xen/arch/ia64/vmx/vmmu.c                  |    6 xen/common/Makefile
>>>>                   |    1 xen/common/sched_credit.c                 |    8
>>>> xen/common/sched_credit2.c                | 1125
>>>> +++++++++++++++++++++++++++++
>>>> xen/common/schedule.c                     |   22
>>>> xen/include/public/domctl.h               |    4 xen/include/public/trace.h
>>>>                |    1 xen/include/xen/sched-if.h                |   28
>>>> _______________________________________________
>>>> Xen-devel mailing list
>>>> Xen-devel@lists.xensource.com
>>>> http://lists.xensource.com/xen-devel
>>>>
>>>
>>>
>>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

[-- Attachment #2: xenoprof.patch --]
[-- Type: text/x-patch, Size: 49526 bytes --]

diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h
index 7de93f3..04d8f38 100644
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -450,6 +450,12 @@ HYPERVISOR_nmi_op(unsigned long op, unsigned long arg)
 	return _hypercall2(int, nmi_op, op, arg);
 }
 
+static inline int
+HYPERVISOR_xenoprof_op(unsigned int op, void *arg)
+{
+	return _hypercall2(int, xenoprof_op, op, arg);
+}
+
 static inline void
 MULTI_fpu_taskswitch(struct multicall_entry *mcl, int set)
 {
diff --git a/arch/x86/oprofile/Makefile b/arch/x86/oprofile/Makefile
index 446902b..6a976e6 100644
--- a/arch/x86/oprofile/Makefile
+++ b/arch/x86/oprofile/Makefile
@@ -6,6 +6,12 @@ DRIVER_OBJS = $(addprefix ../../../drivers/oprofile/, \
 		oprofilefs.o oprofile_stats.o  \
 		timer_int.o )
 
+ifdef CONFIG_XEN
+XENOPROF_COMMON_OBJS = $(addprefix ../../../drivers/xen/xenoprof/, \
+			 xenoprofile.o)
+DRIVER_OBJS				:= $(DRIVER_OBJS) \
+					   $(XENOPROF_COMMON_OBJS) xenoprof.o
+endif 
 oprofile-y				:= $(DRIVER_OBJS) init.o backtrace.o
 oprofile-$(CONFIG_X86_LOCAL_APIC) 	+= nmi_int.o op_model_amd.o \
 					   op_model_ppro.o op_model_p4.o
diff --git a/arch/x86/oprofile/xenoprof.c b/arch/x86/oprofile/xenoprof.c
new file mode 100644
index 0000000..e86f1d0
--- /dev/null
+++ b/arch/x86/oprofile/xenoprof.c
@@ -0,0 +1,172 @@
+/**
+ * @file xenoprof.c
+ *
+ * @remark Copyright 2002 OProfile authors
+ * @remark Read the file COPYING
+ *
+ * @author John Levon <levon@movementarian.org>
+ *
+ * Modified by Aravind Menon and Jose Renato Santos for Xen
+ * These modifications are:
+ * Copyright (C) 2005 Hewlett-Packard Co.
+ *
+ * x86-specific part
+ * Copyright (c) 2006 Isaku Yamahata <yamahata at valinux co jp>
+ *                    VA Linux Systems Japan K.K.
+ */
+
+#include <linux/init.h>
+#include <linux/oprofile.h>
+#include <linux/sched.h>
+#include <linux/vmalloc.h>
+#include <asm/pgtable.h>
+
+#include <xen/interface/xen.h>
+#include <asm/xen/hypercall.h>
+#include <xen/xen-ops.h>
+#include <xen/interface/xenoprof.h>
+#include <xen/xenoprof.h>
+#include "op_counter.h"
+
+static unsigned int num_events = 0;
+struct op_counter_config xen_counter_config[OP_MAX_COUNTER];
+
+void __init xenoprof_arch_init_counter(struct xenoprof_init *init)
+{
+	num_events = init->num_events;
+	/* just in case - make sure we do not overflow event list 
+	   (i.e. xen_counter_config list) */
+	if (num_events > OP_MAX_COUNTER) {
+		num_events = OP_MAX_COUNTER;
+		init->num_events = num_events;
+	}
+}
+
+void xenoprof_arch_counter(void)
+{
+	int i;
+	struct xenoprof_counter counter;
+
+	for (i=0; i<num_events; i++) {
+		counter.ind       = i;
+		counter.count     = (uint64_t)xen_counter_config[i].count;
+		counter.enabled   = (uint32_t)xen_counter_config[i].enabled;
+		counter.event     = (uint32_t)xen_counter_config[i].event;
+		counter.kernel    = (uint32_t)xen_counter_config[i].kernel;
+		counter.user      = (uint32_t)xen_counter_config[i].user;
+		counter.unit_mask = (uint64_t)xen_counter_config[i].unit_mask;
+		WARN_ON(HYPERVISOR_xenoprof_op(XENOPROF_counter,
+					       &counter));
+	}
+}
+
+void xenoprof_arch_start(void) 
+{
+	/* nothing */
+}
+
+void xenoprof_arch_stop(void)
+{
+	/* nothing */
+}
+
+void xenoprof_arch_unmap_shared_buffer(struct xenoprof_shared_buffer * sbuf)
+{
+	if (sbuf->buffer) {
+		vunmap(sbuf->buffer);
+		sbuf->buffer = NULL;
+	}
+}
+
+int xenoprof_arch_map_shared_buffer(struct xenoprof_get_buffer * get_buffer,
+				    struct xenoprof_shared_buffer * sbuf)
+{
+	int npages, ret;
+	struct vm_struct *area;
+
+	sbuf->buffer = NULL;
+	if ( (ret = HYPERVISOR_xenoprof_op(XENOPROF_get_buffer, get_buffer)) )
+		return ret;
+
+	npages = (get_buffer->bufsize * get_buffer->nbuf - 1) / PAGE_SIZE + 1;
+
+	area = alloc_vm_area(npages * PAGE_SIZE);
+	if (area == NULL)
+		return -ENOMEM;
+
+	if ( (ret = xen_remap_domain_kernel_mfn_range(
+		      (unsigned long)area->addr,
+		      get_buffer->buf_gmaddr >> PAGE_SHIFT,
+		      npages, __pgprot(_KERNPG_TABLE),
+		      DOMID_SELF)) ) {
+		vunmap(area->addr);
+		return ret;
+	}
+
+	sbuf->buffer = area->addr;
+	return ret;
+}
+
+int xenoprof_arch_set_passive(struct xenoprof_passive * pdomain,
+			      struct xenoprof_shared_buffer * sbuf)
+{
+	int ret;
+	int npages;
+	struct vm_struct *area;
+	pgprot_t prot = __pgprot(_KERNPG_TABLE);
+
+	sbuf->buffer = NULL;
+
+	ret = HYPERVISOR_xenoprof_op(XENOPROF_set_passive, pdomain);
+	if (ret)
+		goto out;
+
+	npages = (pdomain->bufsize * pdomain->nbuf - 1) / PAGE_SIZE + 1;
+
+	area = alloc_vm_area(npages * PAGE_SIZE);
+	if (area == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = xen_remap_domain_kernel_mfn_range(
+		(unsigned long)area->addr & PAGE_MASK,
+		pdomain->buf_gmaddr >> PAGE_SHIFT,
+		npages, prot, DOMID_SELF);
+	if (ret) {
+		vunmap(area->addr);
+		goto out;
+	}
+	sbuf->buffer = area->addr;
+
+out:
+	return ret;
+}
+
+
+int xenoprof_create_files(struct super_block * sb, struct dentry * root)
+{
+	unsigned int i;
+
+	for (i = 0; i < num_events; ++i) {
+		struct dentry * dir;
+		char buf[2];
+ 
+		snprintf(buf, 2, "%d", i);
+		dir = oprofilefs_mkdir(sb, root, buf);
+		oprofilefs_create_ulong(sb, dir, "enabled",
+					&xen_counter_config[i].enabled);
+		oprofilefs_create_ulong(sb, dir, "event",
+					&xen_counter_config[i].event);
+		oprofilefs_create_ulong(sb, dir, "count",
+					&xen_counter_config[i].count);
+		oprofilefs_create_ulong(sb, dir, "unit_mask",
+					&xen_counter_config[i].unit_mask);
+		oprofilefs_create_ulong(sb, dir, "kernel",
+					&xen_counter_config[i].kernel);
+		oprofilefs_create_ulong(sb, dir, "user",
+					&xen_counter_config[i].user);
+	}
+
+	return 0;
+}
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index c5e31cb..7a222eb 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -2351,30 +2351,30 @@ static int remap_area_mfn_pte_fn(pte_t *ptep, pgtable_t token,
 				 unsigned long addr, void *data)
 {
 	struct remap_data *rmd = data;
-	pte_t pte = pte_mkspecial(pfn_pte(rmd->mfn++, rmd->prot));
+	pte_t pte = pte_mkspecial(pfn_pte(rmd->mfn, rmd->prot));
 
 	rmd->mmu_update->ptr = arbitrary_virt_to_machine(ptep).maddr;
 	rmd->mmu_update->val = pte_val_ma(pte);
+
+	rmd->mfn++;
 	rmd->mmu_update++;
 
 	return 0;
 }
 
-int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
-			       unsigned long addr,
-			       unsigned long mfn, int nr,
-			       pgprot_t prot, unsigned domid)
+static int __xen_remap_domain_mfn_range(struct mm_struct *mm,
+				unsigned long addr,
+				unsigned long mfn, int nr,
+				pgprot_t prot, unsigned domid)
 {
 	struct remap_data rmd;
 	struct mmu_update mmu_update[REMAP_BATCH_SIZE];
 	int batch;
 	unsigned long range;
-	int err = 0;
+	int err;
 
 	prot = __pgprot(pgprot_val(prot) | _PAGE_IOMAP);
 
-	vma->vm_flags |= VM_IO | VM_RESERVED | VM_PFNMAP;
-
 	rmd.mfn = mfn;
 	rmd.prot = prot;
 
@@ -2383,14 +2383,16 @@ int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
 		range = (unsigned long)batch << PAGE_SHIFT;
 
 		rmd.mmu_update = mmu_update;
-		err = apply_to_page_range(vma->vm_mm, addr, range,
+
+		err = apply_to_page_range(mm, addr, range,
 					  remap_area_mfn_pte_fn, &rmd);
 		if (err)
 			goto out;
 
-		err = -EFAULT;
-		if (HYPERVISOR_mmu_update(mmu_update, batch, NULL, domid) < 0)
+		if (HYPERVISOR_mmu_update(mmu_update, batch, NULL, domid) < 0) {
+			err = -EFAULT;
 			goto out;
+		}
 
 		nr -= batch;
 		addr += range;
@@ -2398,13 +2400,33 @@ int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
 
 	err = 0;
 out:
-
 	flush_tlb_all();
-
 	return err;
 }
+
+int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
+			       unsigned long addr,
+			       unsigned long mfn, int nr,
+			       pgprot_t prot, unsigned domid)
+{
+
+	vma->vm_flags |= VM_IO | VM_RESERVED | VM_PFNMAP;
+
+	return __xen_remap_domain_mfn_range(vma->vm_mm, addr, 
+					mfn, nr, prot, domid);
+}
 EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_range);
 
+
+int xen_remap_domain_kernel_mfn_range(unsigned long addr,
+			       unsigned long mfn, int nr,
+			       pgprot_t prot, unsigned domid)
+{
+	return __xen_remap_domain_mfn_range(&init_mm, addr, 
+					mfn, nr, prot, domid);
+}
+EXPORT_SYMBOL_GPL(xen_remap_domain_kernel_mfn_range);
+
 #ifdef CONFIG_XEN_DEBUG_FS
 
 static struct dentry *d_mmu_debug;
diff --git a/drivers/oprofile/buffer_sync.c b/drivers/oprofile/buffer_sync.c
index 8574622..856139d 100644
--- a/drivers/oprofile/buffer_sync.c
+++ b/drivers/oprofile/buffer_sync.c
@@ -42,6 +42,10 @@ static cpumask_var_t marked_cpus;
 static DEFINE_SPINLOCK(task_mortuary);
 static void process_task_mortuary(void);
 
+#ifdef CONFIG_XEN
+static int cpu_current_xen_domain[NR_CPUS];
+#endif
+
 /* Take ownership of the task struct and place it on the
  * list for processing. Only after two full buffer syncs
  * does the task eventually get freed, because by then
@@ -154,10 +158,16 @@ int sync_start(void)
 {
 	int err;
 
+#ifdef CONFIG_XEN
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++)
+		cpu_current_xen_domain[i] = XEN_COORDINATOR_DOMAIN;
+#endif
+
 	if (!alloc_cpumask_var(&marked_cpus, GFP_KERNEL))
 		return -ENOMEM;
 	cpumask_clear(marked_cpus);
-
 	start_cpu_work();
 
 	err = task_handoff_register(&task_free_nb);
@@ -285,14 +295,37 @@ static void add_cpu_switch(int i)
 	last_cookie = INVALID_COOKIE;
 }
 
-static void add_kernel_ctx_switch(unsigned int in_kernel)
+static void add_cpu_mode_switch(unsigned int cpu_mode)
+{
+	add_event_entry(ESCAPE_CODE);
+	switch(cpu_mode)
+	{
+		case CPU_MODE_USER:
+			add_event_entry(USER_ENTER_SWITCH_CODE);
+			break;
+		case CPU_MODE_KERNEL:
+			add_event_entry(KERNEL_ENTER_SWITCH_CODE);
+			break;
+#ifdef CONFIG_XEN
+		case CPU_MODE_XEN:
+			add_event_entry(XEN_ENTER_SWITCH_CODE);
+			break;
+#endif
+		default:
+			break;
+	}
+
+	return;
+}
+
+#ifdef CONFIG_XEN
+static void add_xen_domain_switch(unsigned long domain_id)
 {
 	add_event_entry(ESCAPE_CODE);
-	if (in_kernel)
-		add_event_entry(KERNEL_ENTER_SWITCH_CODE);
-	else
-		add_event_entry(KERNEL_EXIT_SWITCH_CODE);
+	add_event_entry(XEN_DOMAIN_SWITCH_CODE);
+	add_event_entry(domain_id);
 }
+#endif
 
 static void
 add_user_ctx_switch(struct task_struct const *task, unsigned long cookie)
@@ -372,12 +405,12 @@ static inline void add_sample_entry(unsigned long offset, unsigned long event)
  * for later lookup from userspace. Return 0 on failure.
  */
 static int
-add_sample(struct mm_struct *mm, struct op_sample *s, int in_kernel)
+add_sample(struct mm_struct *mm, struct op_sample *s, int cpu_mode)
 {
 	unsigned long cookie;
 	off_t offset;
 
-	if (in_kernel) {
+	if (cpu_mode >= CPU_MODE_KERNEL) {
 		add_sample_entry(s->eip, s->event);
 		return 1;
 	}
@@ -502,7 +535,7 @@ void sync_buffer(int cpu)
 	unsigned long val;
 	struct task_struct *new;
 	unsigned long cookie = 0;
-	int in_kernel = 1;
+	int cpu_mode = CPU_MODE_KERNEL;
 	sync_buffer_state state = sb_buffer_start;
 	unsigned int i;
 	unsigned long available;
@@ -514,6 +547,13 @@ void sync_buffer(int cpu)
 
 	add_cpu_switch(cpu);
 
+#ifdef CONFIG_XEN
+	/* We need to assign the first samples in this CPU buffer to the
+	 * same domain that we were processing at the last sync_buffer */
+	if(cpu_current_xen_domain[cpu] != XEN_COORDINATOR_DOMAIN)
+		add_xen_domain_switch(cpu_current_xen_domain[cpu]);
+#endif
+
 	op_cpu_buffer_reset(cpu);
 	available = op_cpu_buffer_entries(cpu);
 
@@ -530,10 +570,11 @@ void sync_buffer(int cpu)
 			}
 			if (flags & KERNEL_CTX_SWITCH) {
 				/* kernel/userspace switch */
-				in_kernel = flags & IS_KERNEL;
+				/* XXX: crap change this to use cpu_mode explicitly */
+				cpu_mode = flags & CPU_MODE_MASK;
 				if (state == sb_buffer_start)
 					state = sb_sample_start;
-				add_kernel_ctx_switch(flags & IS_KERNEL);
+				add_cpu_mode_switch(cpu_mode);
 			}
 			if (flags & USER_CTX_SWITCH
 			    && op_cpu_buffer_get_data(&entry, &val)) {
@@ -546,16 +587,32 @@ void sync_buffer(int cpu)
 					cookie = get_exec_dcookie(mm);
 				add_user_ctx_switch(new, cookie);
 			}
+#ifdef CONFIG_XEN
+			/* xen domain switch */
+			if (flags & XEN_DOMAIN_SWITCH
+				&& op_cpu_buffer_get_data(&entry, &val)) {
+				cpu_current_xen_domain[cpu] = val;
+				add_xen_domain_switch(val);
+			}
+#endif
 			if (op_cpu_buffer_get_size(&entry))
 				add_data(&entry, mm);
 			continue;
 		}
 
+#ifdef CONFIG_XEN
+		if(cpu_current_xen_domain[cpu] != XEN_COORDINATOR_DOMAIN)
+		{
+			add_sample_entry(sample->eip, sample->event);
+			continue;
+		}
+#endif
+			
 		if (state < sb_bt_start)
 			/* ignore sample */
 			continue;
 
-		if (add_sample(mm, sample, in_kernel))
+		if (add_sample(mm, sample, cpu_mode))
 			continue;
 
 		/* ignore backtraces if failed to add a sample */
diff --git a/drivers/oprofile/cpu_buffer.c b/drivers/oprofile/cpu_buffer.c
index 242257b..21959f1 100644
--- a/drivers/oprofile/cpu_buffer.c
+++ b/drivers/oprofile/cpu_buffer.c
@@ -55,6 +55,11 @@ static void wq_sync_buffer(struct work_struct *work);
 #define DEFAULT_TIMER_EXPIRE (HZ / 10)
 static int work_enabled;
 
+
+#ifdef CONFIG_XEN
+static int current_xen_domain = XEN_COORDINATOR_DOMAIN;
+#endif
+
 unsigned long oprofile_get_cpu_buffer_size(void)
 {
 	return oprofile_cpu_buffer_size;
@@ -99,7 +104,7 @@ int alloc_cpu_buffers(void)
 		struct oprofile_cpu_buffer *b = &per_cpu(cpu_buffer, i);
 
 		b->last_task = NULL;
-		b->last_is_kernel = -1;
+		b->last_cpu_mode = -1;
 		b->tracing = 0;
 		b->buffer_size = buffer_size;
 		b->sample_received = 0;
@@ -217,7 +222,7 @@ unsigned long op_cpu_buffer_entries(int cpu)
 
 static int
 op_add_code(struct oprofile_cpu_buffer *cpu_buf, unsigned long backtrace,
-	    int is_kernel, struct task_struct *task)
+	    int cpu_mode, struct task_struct *task)
 {
 	struct op_entry entry;
 	struct op_sample *sample;
@@ -229,17 +234,20 @@ op_add_code(struct oprofile_cpu_buffer *cpu_buf, unsigned long backtrace,
 	if (backtrace)
 		flags |= TRACE_BEGIN;
 
-	/* notice a switch from user->kernel or vice versa */
-	is_kernel = !!is_kernel;
-	if (cpu_buf->last_is_kernel != is_kernel) {
-		cpu_buf->last_is_kernel = is_kernel;
-		flags |= KERNEL_CTX_SWITCH;
-		if (is_kernel)
-			flags |= IS_KERNEL;
+	/* switch in cpu_mode */
+	if (cpu_buf->last_cpu_mode != cpu_mode) {
+		cpu_buf->last_cpu_mode = cpu_mode;
+		flags |= (KERNEL_CTX_SWITCH | cpu_mode);
 	}
 
 	/* notice a task switch */
+/* XXX: yuck ! do something about this too. */
+#ifndef CONFIG_XEN
 	if (cpu_buf->last_task != task) {
+#else
+	if ((cpu_buf->last_task != task)
+		&& (current_xen_domain == XEN_COORDINATOR_DOMAIN)) {
+#endif
 		cpu_buf->last_task = task;
 		flags |= USER_CTX_SWITCH;
 	}
@@ -288,14 +296,14 @@ op_add_sample(struct oprofile_cpu_buffer *cpu_buf,
 /*
  * This must be safe from any context.
  *
- * is_kernel is needed because on some architectures you cannot
- * tell if you are in kernel or user space simply by looking at
- * pc. We tag this in the buffer by generating kernel enter/exit
+ * cpu_mode is needed because on some architectures you cannot
+ * tell if you are in user/kernel(/xen) space simply by looking at
+ * pc. We tag this in the buffer by generating user/kernel(/xen) enter
  * events whenever is_kernel changes
  */
 static int
 log_sample(struct oprofile_cpu_buffer *cpu_buf, unsigned long pc,
-	   unsigned long backtrace, int is_kernel, unsigned long event)
+	   unsigned long backtrace, int cpu_mode, unsigned long event)
 {
 	cpu_buf->sample_received++;
 
@@ -304,7 +312,7 @@ log_sample(struct oprofile_cpu_buffer *cpu_buf, unsigned long pc,
 		return 0;
 	}
 
-	if (op_add_code(cpu_buf, backtrace, is_kernel, current))
+	if (op_add_code(cpu_buf, backtrace, cpu_mode, current))
 		goto fail;
 
 	if (op_add_sample(cpu_buf, pc, event))
@@ -414,12 +422,27 @@ int oprofile_write_commit(struct op_entry *entry)
 	return op_cpu_buffer_write_commit(entry);
 }
 
+/* XXX: yuck ! Needs clean-up */
 void oprofile_add_pc(unsigned long pc, int is_kernel, unsigned long event)
 {
 	struct oprofile_cpu_buffer *cpu_buf = &__get_cpu_var(cpu_buffer);
 	log_sample(cpu_buf, pc, 0, is_kernel, event);
 }
 
+/*
+ * Equivalent to log_sample(b, ESCAPE_CODE, 1, cpu_mode, CPU_TRACE_BEGIN),
+ * Previously accessible through oprofile_add_pc().
+ */
+void oprofile_add_mode(int cpu_mode)
+{
+	struct oprofile_cpu_buffer *cpu_buf = &__get_cpu_var(cpu_buffer);
+
+	if (op_add_code(cpu_buf, 1, cpu_mode, current))
+		cpu_buf->sample_lost_overflow++;
+
+	return;
+}
+
 void oprofile_add_trace(unsigned long pc)
 {
 	struct oprofile_cpu_buffer *cpu_buf = &__get_cpu_var(cpu_buffer);
@@ -444,6 +467,28 @@ fail:
 	return;
 }
 
+#ifdef CONFIG_XEN
+int oprofile_add_domain_switch(int32_t domain_id)
+{
+	struct op_entry entry;
+	struct op_sample *sample;
+
+	sample = op_cpu_buffer_write_reserve(&entry, 1);
+	if (!sample)
+        	return 0;
+
+	sample->eip = ESCAPE_CODE;
+	sample->event = XEN_DOMAIN_SWITCH;
+
+	op_cpu_buffer_add_data(&entry, domain_id);
+	op_cpu_buffer_write_commit(&entry);
+
+	current_xen_domain = domain_id;
+
+	return 1;
+}
+#endif
+
 /*
  * This serves to avoid cpu buffer overflow, and makes sure
  * the task mortuary progresses
diff --git a/drivers/oprofile/cpu_buffer.h b/drivers/oprofile/cpu_buffer.h
index 272995d..95be4c7 100644
--- a/drivers/oprofile/cpu_buffer.h
+++ b/drivers/oprofile/cpu_buffer.h
@@ -40,7 +40,7 @@ struct op_entry;
 struct oprofile_cpu_buffer {
 	unsigned long buffer_size;
 	struct task_struct *last_task;
-	int last_is_kernel;
+	int last_cpu_mode;
 	int tracing;
 	unsigned long sample_received;
 	unsigned long sample_lost_overflow;
@@ -62,7 +62,7 @@ static inline void op_cpu_buffer_reset(int cpu)
 {
 	struct oprofile_cpu_buffer *cpu_buf = &per_cpu(cpu_buffer, cpu);
 
-	cpu_buf->last_is_kernel = -1;
+	cpu_buf->last_cpu_mode = -1;
 	cpu_buf->last_task = NULL;
 }
 
@@ -111,10 +111,22 @@ int op_cpu_buffer_get_data(struct op_entry *entry, unsigned long *val)
 	return size;
 }
 
+/* data flags */
+/* cpu modes */
+/* */
+#define CPU_MODE_BEGIN		(0UL)
+#define CPU_MODE_USER		(CPU_MODE_BEGIN + 0x0)
+#define CPU_MODE_KERNEL		(CPU_MODE_BEGIN + 0x1)
+#ifdef CONFIG_XEN
+#define CPU_MODE_XEN		(CPU_MODE_BEGIN + 0x2)
+#endif
+#define CPU_MODE_END		(CPU_MODE_BEGIN + 0x3)
+#define CPU_MODE_MASK		0x3
+
 /* extra data flags */
-#define KERNEL_CTX_SWITCH	(1UL << 0)
-#define IS_KERNEL		(1UL << 1)
 #define TRACE_BEGIN		(1UL << 2)
 #define USER_CTX_SWITCH		(1UL << 3)
+#define KERNEL_CTX_SWITCH	(1UL << 4)
+#define XEN_DOMAIN_SWITCH	(1UL << 5)
 
 #endif /* OPROFILE_CPU_BUFFER_H */
diff --git a/drivers/oprofile/event_buffer.h b/drivers/oprofile/event_buffer.h
index 4e70749..9f19d42 100644
--- a/drivers/oprofile/event_buffer.h
+++ b/drivers/oprofile/event_buffer.h
@@ -30,6 +30,11 @@ void wake_up_buffer_waiter(void);
 #define INVALID_COOKIE ~0UL
 #define NO_COOKIE 0UL
 
+#ifdef CONFIG_XEN
+#define XEN_COORDINATOR_DOMAIN -1
+#endif
+
+
 extern const struct file_operations event_buffer_fops;
 
 /* mutex between sync_cpu_buffers() and the
diff --git a/drivers/oprofile/oprof.c b/drivers/oprofile/oprof.c
index 3cffce9..d1d8a27 100644
--- a/drivers/oprofile/oprof.c
+++ b/drivers/oprofile/oprof.c
@@ -20,6 +20,11 @@
 #include "buffer_sync.h"
 #include "oprofile_stats.h"
 
+#ifdef CONFIG_XEN
+#include <xen/xenoprof.h>
+#include <asm/xen/hypervisor.h>
+#endif
+
 struct oprofile_operations oprofile_ops;
 
 unsigned long oprofile_started;
@@ -33,6 +38,35 @@ static DEFINE_MUTEX(start_mutex);
  */
 static int timer = 0;
 
+#ifdef CONFIG_XEN
+int oprofile_xen_set_active(int active_domains[], unsigned int adomains)
+{
+	int err;
+
+	if (!oprofile_ops.xen_set_active)
+		return -EINVAL;
+
+	mutex_lock(&start_mutex);
+	err = oprofile_ops.xen_set_active(active_domains, adomains);
+	mutex_unlock(&start_mutex);
+	return err;
+}
+
+int oprofile_xen_set_passive(int passive_domains[], unsigned int pdomains)
+{
+	int err;
+
+	if (!oprofile_ops.xen_set_passive)
+		return -EINVAL;
+
+	mutex_lock(&start_mutex);
+	err = oprofile_ops.xen_set_passive(passive_domains, pdomains);
+	mutex_unlock(&start_mutex);
+	return err;
+}
+#endif
+
+
 int oprofile_setup(void)
 {
 	int err;
@@ -182,8 +216,21 @@ out:
 static int __init oprofile_init(void)
 {
 	int err;
-
-	err = oprofile_arch_init(&oprofile_ops);
+    int (*oprofile_arch_init_func)(struct oprofile_operations * ops);
+    void (*oprofile_arch_exit_func)(void);
+
+    if (xen_pv_domain())
+    {
+        oprofile_arch_init_func = xenoprofile_init;
+        oprofile_arch_exit_func = xenoprofile_exit;
+    }
+    else
+    {
+        oprofile_arch_init_func = oprofile_arch_init;
+        oprofile_arch_exit_func = oprofile_arch_exit;
+    }
+
+	err = oprofile_arch_init_func(&oprofile_ops);
 
 	if (err < 0 || timer) {
 		printk(KERN_INFO "oprofile: using timer interrupt.\n");
@@ -192,7 +239,7 @@ static int __init oprofile_init(void)
 
 	err = oprofilefs_register();
 	if (err)
-		oprofile_arch_exit();
+		oprofile_arch_exit_func();
 
 	return err;
 }
diff --git a/drivers/oprofile/oprof.h b/drivers/oprofile/oprof.h
index c288d3c..e659728 100644
--- a/drivers/oprofile/oprof.h
+++ b/drivers/oprofile/oprof.h
@@ -36,4 +36,9 @@ void oprofile_timer_init(struct oprofile_operations *ops);
 
 int oprofile_set_backtrace(unsigned long depth);
 
+#ifdef CONFIG_XEN
+int oprofile_xen_set_active(int active_domains[], unsigned int adomains);
+int oprofile_xen_set_passive(int passive_domains[], unsigned int pdomains);
+#endif
+
 #endif /* OPROF_H */
diff --git a/drivers/oprofile/oprofile_files.c b/drivers/oprofile/oprofile_files.c
index 5d36ffc..c1318c9 100644
--- a/drivers/oprofile/oprofile_files.c
+++ b/drivers/oprofile/oprofile_files.c
@@ -9,6 +9,11 @@
 
 #include <linux/fs.h>
 #include <linux/oprofile.h>
+#include <asm/uaccess.h>
+
+#include <linux/slab.h>
+#include <linux/ctype.h>
+#include <linux/gfp.h>
 
 #include "event_buffer.h"
 #include "oprofile_stats.h"
@@ -123,6 +128,207 @@ static const struct file_operations dump_fops = {
 	.write		= dump_write,
 };
 
+#ifdef CONFIG_XEN
+
+#define TMPBUFSIZE 512
+
+static unsigned int adomains = 0;
+static int active_domains[MAX_OPROF_DOMAINS + 1];
+static DEFINE_MUTEX(adom_mutex);
+
+static ssize_t adomain_write(struct file * file, char const __user * buf,
+			     size_t count, loff_t * offset)
+{
+	char *tmpbuf;
+	char *startp, *endp;
+	int i;
+	unsigned long val;
+	ssize_t retval = count;
+
+	if (*offset)
+		return -EINVAL;
+	if (count > TMPBUFSIZE - 1)
+		return -EINVAL;
+
+	if (!(tmpbuf = kmalloc(TMPBUFSIZE, GFP_KERNEL)))
+		return -ENOMEM;
+
+	if (copy_from_user(tmpbuf, buf, count)) {
+		kfree(tmpbuf);
+		return -EFAULT;
+	}
+	tmpbuf[count] = 0;
+
+	mutex_lock(&adom_mutex);
+
+	startp = tmpbuf;
+	/* Parse one more than MAX_OPROF_DOMAINS, for easy error checking */
+	for (i = 0; i <= MAX_OPROF_DOMAINS; i++) {
+		val = simple_strtoul(startp, &endp, 0);
+		if (endp == startp)
+			break;
+		while (ispunct(*endp) || isspace(*endp))
+			endp++;
+		active_domains[i] = val;
+		if (active_domains[i] != val)
+			/* Overflow, force error below */
+			i = MAX_OPROF_DOMAINS + 1;
+		startp = endp;
+	}
+	/* Force error on trailing junk */
+	adomains = *startp ? MAX_OPROF_DOMAINS + 1 : i;
+
+	kfree(tmpbuf);
+
+	if (adomains > MAX_OPROF_DOMAINS
+	    || oprofile_xen_set_active(active_domains, adomains)) {
+		adomains = 0;
+		retval = -EINVAL;
+	}
+
+	mutex_unlock(&adom_mutex);
+	return retval;
+}
+
+static ssize_t adomain_read(struct file * file, char __user * buf,
+			    size_t count, loff_t * offset)
+{
+	char * tmpbuf;
+	size_t len;
+	int i;
+	ssize_t retval;
+
+	if (!(tmpbuf = kmalloc(TMPBUFSIZE, GFP_KERNEL)))
+		return -ENOMEM;
+
+	mutex_lock(&adom_mutex);
+
+	len = 0;
+	for (i = 0; i < adomains; i++)
+		len += snprintf(tmpbuf + len,
+				len < TMPBUFSIZE ? TMPBUFSIZE - len : 0,
+				"%u ", active_domains[i]);
+	WARN_ON(len > TMPBUFSIZE);
+	if (len != 0 && len <= TMPBUFSIZE)
+		tmpbuf[len-1] = '\n';
+
+	mutex_unlock(&adom_mutex);
+
+	retval = simple_read_from_buffer(buf, count, offset, tmpbuf, len);
+
+	kfree(tmpbuf);
+	return retval;
+}
+
+
+static const struct file_operations active_domain_ops = {
+	.read		= adomain_read,
+	.write		= adomain_write,
+};
+
+static unsigned int pdomains = 0;
+static int passive_domains[MAX_OPROF_DOMAINS];
+static DEFINE_MUTEX(pdom_mutex);
+
+static ssize_t pdomain_write(struct file * file, char const __user * buf,
+			     size_t count, loff_t * offset)
+{
+	char *tmpbuf;
+	char *startp, *endp;
+	int i;
+	unsigned long val;
+	ssize_t retval = count;
+
+	if (*offset)
+		return -EINVAL;
+	if (count > TMPBUFSIZE - 1)
+		return -EINVAL;
+
+	if (!(tmpbuf = kmalloc(TMPBUFSIZE, GFP_KERNEL)))
+		return -ENOMEM;
+
+	if (copy_from_user(tmpbuf, buf, count)) {
+		kfree(tmpbuf);
+		return -EFAULT;
+	}
+	tmpbuf[count] = 0;
+
+	mutex_lock(&pdom_mutex);
+
+	startp = tmpbuf;
+	/* Parse one more than MAX_OPROF_DOMAINS, for easy error checking */
+	for (i = 0; i <= MAX_OPROF_DOMAINS; i++) {
+		val = simple_strtoul(startp, &endp, 0);
+		if (endp == startp)
+			break;
+		while (ispunct(*endp) || isspace(*endp))
+			endp++;
+		passive_domains[i] = val;
+		if (passive_domains[i] != val)
+			/* Overflow, force error below */
+			i = MAX_OPROF_DOMAINS + 1;
+		startp = endp;
+	}
+	/* Force error on trailing junk */
+	pdomains = *startp ? MAX_OPROF_DOMAINS + 1 : i;
+
+	kfree(tmpbuf);
+
+	if (pdomains > MAX_OPROF_DOMAINS)
+	{
+		pdomains = 0;
+		retval = -EINVAL;
+		goto out;
+	}
+	
+	if(oprofile_xen_set_passive(passive_domains, pdomains))
+	{
+		pdomains = 0;
+		retval = -EINVAL;
+		goto out;
+	}
+
+out:
+	mutex_unlock(&pdom_mutex);
+	return retval;
+}
+
+static ssize_t pdomain_read(struct file * file, char __user * buf,
+			    size_t count, loff_t * offset)
+{
+	char * tmpbuf;
+	size_t len;
+	int i;
+	ssize_t retval;
+
+	if (!(tmpbuf = kmalloc(TMPBUFSIZE, GFP_KERNEL)))
+		return -ENOMEM;
+
+	mutex_lock(&pdom_mutex);
+
+	len = 0;
+	for (i = 0; i < pdomains; i++)
+		len += snprintf(tmpbuf + len,
+				len < TMPBUFSIZE ? TMPBUFSIZE - len : 0,
+				"%u ", passive_domains[i]);
+	WARN_ON(len > TMPBUFSIZE);
+	if (len != 0 && len <= TMPBUFSIZE)
+		tmpbuf[len-1] = '\n';
+
+	mutex_unlock(&pdom_mutex);
+
+	retval = simple_read_from_buffer(buf, count, offset, tmpbuf, len);
+
+	kfree(tmpbuf);
+	return retval;
+}
+
+static const struct file_operations passive_domain_ops = {
+	.read		= pdomain_read,
+	.write		= pdomain_write,
+};
+
+#endif /* CONFIG_XEN */
 void oprofile_create_files(struct super_block *sb, struct dentry *root)
 {
 	/* reinitialize default values */
@@ -132,6 +338,10 @@ void oprofile_create_files(struct super_block *sb, struct dentry *root)
 
 	oprofilefs_create_file(sb, root, "enable", &enable_fops);
 	oprofilefs_create_file_perm(sb, root, "dump", &dump_fops, 0666);
+#ifdef CONFIG_XEN
+	oprofilefs_create_file(sb, root, "active_domains", &active_domain_ops);
+	oprofilefs_create_file(sb, root, "passive_domains", &passive_domain_ops);
+#endif
 	oprofilefs_create_file(sb, root, "buffer", &event_buffer_fops);
 	oprofilefs_create_ulong(sb, root, "buffer_size", &oprofile_buffer_size);
 	oprofilefs_create_ulong(sb, root, "buffer_watershed", &oprofile_buffer_watershed);
diff --git a/drivers/xen/xenoprof/xenoprofile.c b/drivers/xen/xenoprof/xenoprofile.c
new file mode 100644
index 0000000..116b617
--- /dev/null
+++ b/drivers/xen/xenoprof/xenoprofile.c
@@ -0,0 +1,547 @@
+/**
+ * @file xenoprofile.c
+ *
+ * @remark Copyright 2002 OProfile authors
+ * @remark Read the file COPYING
+ *
+ * @author John Levon <levon@movementarian.org>
+ *
+ * Modified by Aravind Menon and Jose Renato Santos for Xen
+ * These modifications are:
+ * Copyright (C) 2005 Hewlett-Packard Co.
+ *
+ * Separated out arch-generic part
+ * Copyright (c) 2006 Isaku Yamahata <yamahata at valinux co jp>
+ *                    VA Linux Systems Japan K.K.
+ */
+
+#include <linux/init.h>
+#include <linux/notifier.h>
+#include <linux/smp.h>
+#include <linux/oprofile.h>
+#include <linux/sysdev.h>
+#include <linux/slab.h>
+#include <linux/interrupt.h>
+#include <linux/vmalloc.h>
+#include <asm/pgtable.h>
+#include <xen/evtchn.h>
+#include <xen/events.h>
+#include <xen/xenoprof.h>
+#include <xen/interface/xen.h>
+#include <xen/interface/xenoprof.h>
+#include "../../../drivers/oprofile/cpu_buffer.h"
+#include "../../../drivers/oprofile/event_buffer.h"
+
+#define MAX_XENOPROF_SAMPLES 16
+
+/* sample buffers shared with Xen */
+static xenoprof_buf_t *xenoprof_buf[MAX_VIRT_CPUS];
+/* Shared buffer area */
+static struct xenoprof_shared_buffer shared_buffer;
+
+/* Passive sample buffers shared with Xen */
+static xenoprof_buf_t *p_xenoprof_buf[MAX_OPROF_DOMAINS][MAX_VIRT_CPUS];
+/* Passive shared buffer area */
+static struct xenoprof_shared_buffer p_shared_buffer[MAX_OPROF_DOMAINS];
+
+static int xenoprof_start(void);
+static void xenoprof_stop(void);
+
+static int xenoprof_enabled = 0;
+static int xenoprof_is_primary = 0;
+static int active_defined;
+
+extern unsigned long oprofile_backtrace_depth;
+
+/* Number of buffers in shared area (one per VCPU) */
+static int nbuf;
+/* Mappings of VIRQ_XENOPROF to irq number (per cpu) */
+static int ovf_irq[NR_CPUS];
+/* cpu model type string - copied from Xen on XENOPROF_init command */
+static char cpu_type[XENOPROF_CPU_TYPE_SIZE];
+
+#ifdef CONFIG_PM
+
+static int xenoprof_suspend(struct sys_device * dev, pm_message_t state)
+{
+	if (xenoprof_enabled == 1)
+		xenoprof_stop();
+	return 0;
+}
+
+
+static int xenoprof_resume(struct sys_device * dev)
+{
+	if (xenoprof_enabled == 1)
+		xenoprof_start();
+	return 0;
+}
+
+
+static struct sysdev_class oprofile_sysclass = {
+	.name 		= "oprofile",
+	.resume		= xenoprof_resume,
+	.suspend	= xenoprof_suspend
+};
+
+
+static struct sys_device device_oprofile = {
+	.id	= 0,
+	.cls	= &oprofile_sysclass,
+};
+
+
+static int __init init_driverfs(void)
+{
+	int error;
+	if (!(error = sysdev_class_register(&oprofile_sysclass)))
+		error = sysdev_register(&device_oprofile);
+	return error;
+}
+
+
+static void exit_driverfs(void)
+{
+	sysdev_unregister(&device_oprofile);
+	sysdev_class_unregister(&oprofile_sysclass);
+}
+
+#else
+#define init_driverfs() do { } while (0)
+#define exit_driverfs() do { } while (0)
+#endif /* CONFIG_PM */
+
+static unsigned long long oprofile_samples;
+static unsigned long long p_oprofile_samples;
+
+static unsigned int pdomains;
+static struct xenoprof_passive passive_domains[MAX_OPROF_DOMAINS];
+
+/* Check whether the given entry is an escape code */
+static int xenoprof_is_escape(xenoprof_buf_t * buf, int tail)
+{
+	return (buf->event_log[tail].eip == XENOPROF_ESCAPE_CODE);
+}
+
+/* Get the event at the given entry  */
+static uint8_t xenoprof_get_event(xenoprof_buf_t * buf, int tail)
+{
+	return (buf->event_log[tail].event);
+}
+
+static void xenoprof_add_pc(xenoprof_buf_t *buf, int is_passive)
+{
+	int head, tail, size;
+	int tracing = 0;
+
+	head = buf->event_head;
+	tail = buf->event_tail;
+	size = buf->event_size;
+
+	while (tail != head) {
+		if (xenoprof_is_escape(buf, tail) &&
+		    xenoprof_get_event(buf, tail) == XENOPROF_TRACE_BEGIN) {
+			tracing=1;
+			oprofile_add_pc(ESCAPE_CODE, buf->event_log[tail].mode, 
+					TRACE_BEGIN); 
+			if (!is_passive)
+				oprofile_samples++;
+			else
+				p_oprofile_samples++;
+			
+		} else {
+			oprofile_add_pc(buf->event_log[tail].eip,
+					buf->event_log[tail].mode,
+					buf->event_log[tail].event);
+			if (!tracing) {
+				if (!is_passive)
+					oprofile_samples++;
+				else
+					p_oprofile_samples++;
+			}
+       
+		}
+		tail++;
+		if(tail==size)
+		    tail=0;
+	}
+	buf->event_tail = tail;
+}
+
+static void xenoprof_handle_passive(void)
+{
+	int i, j;
+	int flag_domain, flag_switch = 0;
+	
+	for (i = 0; i < pdomains; i++) {
+		flag_domain = 0;
+		for (j = 0; j < passive_domains[i].nbuf; j++) {
+			xenoprof_buf_t *buf = p_xenoprof_buf[i][j];
+			if (buf->event_head == buf->event_tail)
+				continue;
+			if (!flag_domain) {
+				if (!oprofile_add_domain_switch(
+					passive_domains[i].domain_id))
+					goto done;
+				flag_domain = 1;
+			}
+			xenoprof_add_pc(buf, 1);
+			flag_switch = 1;
+		}
+	}
+done:
+	if (flag_switch)
+		oprofile_add_domain_switch(XEN_COORDINATOR_DOMAIN);
+}
+
+static irqreturn_t 
+xenoprof_ovf_interrupt(int irq, void * dev_id)
+{
+	struct xenoprof_buf * buf;
+	static unsigned long flag;
+
+	buf = xenoprof_buf[smp_processor_id()];
+
+	xenoprof_add_pc(buf, 0);
+
+	if (xenoprof_is_primary && !test_and_set_bit(0, &flag)) {
+		xenoprof_handle_passive();
+		smp_mb__before_clear_bit();
+		clear_bit(0, &flag);
+	}
+
+	return IRQ_HANDLED;
+}
+
+
+static void unbind_virq(void)
+{
+	unsigned int i;
+
+	for_each_online_cpu(i) {
+		if (ovf_irq[i] >= 0) {
+			unbind_from_irqhandler(ovf_irq[i], NULL);
+			ovf_irq[i] = -1;
+		}
+	}
+}
+
+
+static int bind_virq(void)
+{
+	unsigned int i;
+	int result;
+
+	for_each_online_cpu(i) {
+		result = bind_virq_to_irqhandler(VIRQ_XENOPROF,
+						 i,
+						 xenoprof_ovf_interrupt,
+						 IRQF_DISABLED|IRQF_NOBALANCING,
+						 "xenoprof",
+						 NULL);
+
+		if (result < 0) {
+			unbind_virq();
+			return result;
+		}
+
+		ovf_irq[i] = result;
+	}
+		
+	return 0;
+}
+
+
+static void unmap_passive_list(void)
+{
+	int i;
+	for (i = 0; i < pdomains; i++)
+		xenoprof_arch_unmap_shared_buffer(&p_shared_buffer[i]);
+	pdomains = 0;
+}
+
+
+static int map_xenoprof_buffer(int max_samples)
+{
+	struct xenoprof_get_buffer get_buffer;
+	struct xenoprof_buf *buf;
+	int ret, i;
+
+	if ( shared_buffer.buffer )
+		return 0;
+
+	get_buffer.max_samples = max_samples;
+	ret = xenoprof_arch_map_shared_buffer(&get_buffer, &shared_buffer);
+	if (ret)
+		return ret;
+	nbuf = get_buffer.nbuf;
+
+	for (i=0; i< nbuf; i++) {
+		buf = (struct xenoprof_buf*) 
+			&shared_buffer.buffer[i * get_buffer.bufsize];
+		BUG_ON(buf->vcpu_id >= MAX_VIRT_CPUS);
+		xenoprof_buf[buf->vcpu_id] = buf;
+	}
+
+	return 0;
+}
+
+
+static int xenoprof_setup(void)
+{
+	int ret;
+
+	if ( (ret = map_xenoprof_buffer(MAX_XENOPROF_SAMPLES)) )
+		return ret;
+
+	if ( (ret = bind_virq()) )
+		return ret;
+
+	if (xenoprof_is_primary) {
+		/* Define dom0 as an active domain if not done yet */
+		if (!active_defined) {
+			domid_t domid;
+			ret = HYPERVISOR_xenoprof_op(
+				XENOPROF_reset_active_list, NULL);
+			if (ret)
+				goto err;
+			domid = 0;
+			ret = HYPERVISOR_xenoprof_op(
+				XENOPROF_set_active, &domid);
+			if (ret)
+				goto err;
+			active_defined = 1;
+		}
+
+		if (oprofile_backtrace_depth > 0) {
+			ret = HYPERVISOR_xenoprof_op(XENOPROF_set_backtrace, 
+						     &oprofile_backtrace_depth);
+			if (ret)
+				oprofile_backtrace_depth = 0;
+		}
+
+		ret = HYPERVISOR_xenoprof_op(XENOPROF_reserve_counters, NULL);
+		if (ret)
+			goto err;
+		
+		xenoprof_arch_counter();
+		ret = HYPERVISOR_xenoprof_op(XENOPROF_setup_events, NULL);
+		if (ret)
+			goto err;
+	}
+
+	ret = HYPERVISOR_xenoprof_op(XENOPROF_enable_virq, NULL);
+	if (ret)
+		goto err;
+
+	xenoprof_enabled = 1;
+	return 0;
+ err:
+	unbind_virq();
+	return ret;
+}
+
+
+static void xenoprof_shutdown(void)
+{
+	xenoprof_enabled = 0;
+
+	WARN_ON(HYPERVISOR_xenoprof_op(XENOPROF_disable_virq, NULL));
+
+	if (xenoprof_is_primary) {
+		WARN_ON(HYPERVISOR_xenoprof_op(XENOPROF_release_counters,
+					       NULL));
+		active_defined = 0;
+	}
+
+	unbind_virq();
+
+	xenoprof_arch_unmap_shared_buffer(&shared_buffer);
+	if (xenoprof_is_primary)
+		unmap_passive_list();
+}
+
+
+static int xenoprof_start(void)
+{
+	int ret = 0;
+
+	if (xenoprof_is_primary)
+		ret = HYPERVISOR_xenoprof_op(XENOPROF_start, NULL);
+	if (!ret)
+		xenoprof_arch_start();
+	return ret;
+}
+
+
+static void xenoprof_stop(void)
+{
+	if (xenoprof_is_primary)
+		WARN_ON(HYPERVISOR_xenoprof_op(XENOPROF_stop, NULL));
+	xenoprof_arch_stop();
+}
+
+
+static int xenoprof_set_active(int * active_domains,
+			       unsigned int adomains)
+{
+	int ret = 0;
+	int i;
+	int set_dom0 = 0;
+	domid_t domid;
+
+	if (!xenoprof_is_primary)
+		return 0;
+
+	if (adomains > MAX_OPROF_DOMAINS)
+		return -E2BIG;
+
+	ret = HYPERVISOR_xenoprof_op(XENOPROF_reset_active_list, NULL);
+	if (ret)
+		return ret;
+
+	for (i=0; i<adomains; i++) {
+		domid = active_domains[i];
+		if (domid != active_domains[i]) {
+			ret = -EINVAL;
+			goto out;
+		}
+		ret = HYPERVISOR_xenoprof_op(XENOPROF_set_active, &domid);
+		if (ret)
+			goto out;
+		if (active_domains[i] == 0)
+			set_dom0 = 1;
+	}
+	/* dom0 must always be active but may not be in the list */ 
+	if (!set_dom0) {
+		domid = 0;
+		ret = HYPERVISOR_xenoprof_op(XENOPROF_set_active, &domid);
+	}
+
+out:
+	if (ret)
+		WARN_ON(HYPERVISOR_xenoprof_op(XENOPROF_reset_active_list,
+					       NULL));
+	active_defined = !ret;
+	return ret;
+}
+
+static int xenoprof_set_passive(int * p_domains,
+                                unsigned int pdoms)
+{
+	int ret;
+	unsigned int i, j;
+	struct xenoprof_buf *buf;
+
+	if (!xenoprof_is_primary)
+        	return 0;
+
+	if (pdoms > MAX_OPROF_DOMAINS)
+		return -E2BIG;
+
+	ret = HYPERVISOR_xenoprof_op(XENOPROF_reset_passive_list, NULL);
+	if (ret)
+		return ret;
+	unmap_passive_list();
+
+	for (i = 0; i < pdoms; i++) {
+		passive_domains[i].domain_id = p_domains[i];
+		passive_domains[i].max_samples = 2048;
+		ret = xenoprof_arch_set_passive(&passive_domains[i],
+						&p_shared_buffer[i]);
+		if (ret)
+		{
+			goto out;
+		}
+		for (j = 0; j < passive_domains[i].nbuf; j++) {
+			buf = (struct xenoprof_buf *)
+				&p_shared_buffer[i].buffer[
+				j * passive_domains[i].bufsize];
+			BUG_ON(buf->vcpu_id >= MAX_VIRT_CPUS);
+			p_xenoprof_buf[i][buf->vcpu_id] = buf;
+		}
+	}
+
+	pdomains = pdoms;
+	return 0;
+
+out:
+	for (j = 0; j < i; j++)
+		xenoprof_arch_unmap_shared_buffer(&p_shared_buffer[i]);
+
+	return ret;
+}
+
+
+/* The dummy backtrace function to keep oprofile happy
+ * The real backtrace is done in xen
+ */
+static void xenoprof_dummy_backtrace(struct pt_regs * const regs, 
+				     unsigned int depth)
+{
+	/* this should never be called */
+	BUG();
+	return;
+}
+
+
+static struct oprofile_operations xenoprof_ops = {
+#ifdef HAVE_XENOPROF_CREATE_FILES
+	.create_files 	= xenoprof_create_files,
+#endif
+	.xen_set_active	= xenoprof_set_active,
+	.xen_set_passive    = xenoprof_set_passive,
+	.setup 		= xenoprof_setup,
+	.shutdown	= xenoprof_shutdown,
+	.start		= xenoprof_start,
+	.stop		= xenoprof_stop,
+	.backtrace	= xenoprof_dummy_backtrace
+};
+
+
+/* in order to get driverfs right */
+static int using_xenoprof;
+
+int __init xenoprofile_init(struct oprofile_operations * ops)
+{
+	struct xenoprof_init init;
+	unsigned int i;
+	int ret;
+
+	ret = HYPERVISOR_xenoprof_op(XENOPROF_init, &init);
+	if (!ret) {
+		xenoprof_arch_init_counter(&init);
+		xenoprof_is_primary = init.is_primary;
+
+		/*  cpu_type is detected by Xen */
+		cpu_type[XENOPROF_CPU_TYPE_SIZE-1] = 0;
+		strncpy(cpu_type, init.cpu_type, XENOPROF_CPU_TYPE_SIZE - 1);
+		xenoprof_ops.cpu_type = cpu_type;
+
+		init_driverfs();
+		using_xenoprof = 1;
+		*ops = xenoprof_ops;
+
+		for (i=0; i<NR_CPUS; i++)
+			ovf_irq[i] = -1;
+
+		active_defined = 0;
+	}
+
+	printk(KERN_INFO "%s: ret %d, events %d, xenoprof_is_primary %d\n",
+	       __func__, ret, init.num_events, xenoprof_is_primary);
+	return ret;
+}
+
+
+void xenoprofile_exit(void)
+{
+	if (using_xenoprof)
+		exit_driverfs();
+
+	xenoprof_arch_unmap_shared_buffer(&shared_buffer);
+	if (xenoprof_is_primary) {
+		unmap_passive_list();
+		WARN_ON(HYPERVISOR_xenoprof_op(XENOPROF_shutdown, NULL));
+        }
+}
diff --git a/include/linux/oprofile.h b/include/linux/oprofile.h
index 1d9518b..bd55065 100644
--- a/include/linux/oprofile.h
+++ b/include/linux/oprofile.h
@@ -16,6 +16,9 @@
 #include <linux/types.h>
 #include <linux/spinlock.h>
 #include <asm/atomic.h>
+#ifdef CONFIG_XEN
+#include <xen/interface/xenoprof.h>
+#endif
  
 /* Each escaped entry is prefixed by ESCAPE_CODE
  * then one of the following codes, then the
@@ -28,14 +31,18 @@
 #define CPU_SWITCH_CODE			2
 #define COOKIE_SWITCH_CODE		3
 #define KERNEL_ENTER_SWITCH_CODE	4
-#define KERNEL_EXIT_SWITCH_CODE		5
+#define USER_ENTER_SWITCH_CODE		5
 #define MODULE_LOADED_CODE		6
 #define CTX_TGID_CODE			7
 #define TRACE_BEGIN_CODE		8
 #define TRACE_END_CODE			9
 #define XEN_ENTER_SWITCH_CODE		10
+#ifndef CONFIG_XEN
 #define SPU_PROFILING_CODE		11
 #define SPU_CTX_SWITCH_CODE		12
+#else
+#define XEN_DOMAIN_SWITCH_CODE		11
+#endif
 #define IBS_FETCH_CODE			13
 #define IBS_OP_CODE			14
 
@@ -49,6 +56,12 @@ struct oprofile_operations {
 	/* create any necessary configuration files in the oprofile fs.
 	 * Optional. */
 	int (*create_files)(struct super_block * sb, struct dentry * root);
+#ifdef CONFIG_XEN
+	/* setup active domains with Xen */
+	int (*xen_set_active)(int *active_domains, unsigned int adomains);
+	/* setup passive domains with Xen */
+	int (*xen_set_passive)(int *passive_domains, unsigned int pdomains);
+#endif
 	/* Do any necessary interrupt setup. Optional. */
 	int (*setup)(void);
 	/* Do any necessary interrupt shutdown. Optional. */
@@ -104,9 +117,16 @@ void oprofile_add_ext_sample(unsigned long pc, struct pt_regs * const regs,
  * backtrace. */
 void oprofile_add_pc(unsigned long pc, int is_kernel, unsigned long event);
 
+/* Record when the cpu mode switches between user/kernel/xen(hypervisor) */
+void oprofile_add_mode(int cpu_mode);
+
 /* add a backtrace entry, to be called from the ->backtrace callback */
 void oprofile_add_trace(unsigned long eip);
 
+#ifdef CONFIG_XEN
+/* add a xen domain switch entry */
+int oprofile_add_domain_switch(int32_t domain_id);
+#endif
 
 /**
  * Create a file of the given name as a child of the given root, with
diff --git a/include/xen/interface/xen.h b/include/xen/interface/xen.h
index 812ffd5..0054a3f 100644
--- a/include/xen/interface/xen.h
+++ b/include/xen/interface/xen.h
@@ -79,6 +79,8 @@
 #define VIRQ_CONSOLE    2  /* (DOM0) Bytes received on emergency console. */
 #define VIRQ_DOM_EXC    3  /* (DOM0) Exceptional event for some domain.   */
 #define VIRQ_DEBUGGER   6  /* (DOM0) A domain has paused for debugging.   */
+#define VIRQ_XENOPROF   7  /* V. XenOprofile interrupt: new sample available */
+
 
 /* Architecture-specific VIRQ definitions. */
 #define VIRQ_ARCH_0    16
diff --git a/include/xen/interface/xenoprof.h b/include/xen/interface/xenoprof.h
new file mode 100644
index 0000000..8ff3a56
--- /dev/null
+++ b/include/xen/interface/xenoprof.h
@@ -0,0 +1,140 @@
+/******************************************************************************
+ * xenoprof.h
+ * 
+ * Interface for enabling system wide profiling based on hardware performance
+ * counters
+ * 
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (C) 2005 Hewlett-Packard Co.
+ * Written by Aravind Menon & Jose Renato Santos
+ */
+
+#ifndef __XEN_PUBLIC_XENOPROF_H__
+#define __XEN_PUBLIC_XENOPROF_H__
+
+#include "xen.h"
+
+/*
+ * Commands to HYPERVISOR_xenoprof_op().
+ */
+#define XENOPROF_init                0
+#define XENOPROF_reset_active_list   1
+#define XENOPROF_reset_passive_list  2
+#define XENOPROF_set_active          3
+#define XENOPROF_set_passive         4
+#define XENOPROF_reserve_counters    5
+#define XENOPROF_counter             6
+#define XENOPROF_setup_events        7
+#define XENOPROF_enable_virq         8
+#define XENOPROF_start               9
+#define XENOPROF_stop               10
+#define XENOPROF_disable_virq       11
+#define XENOPROF_release_counters   12
+#define XENOPROF_shutdown           13
+#define XENOPROF_get_buffer         14
+#define XENOPROF_set_backtrace      15
+#define XENOPROF_last_op            15
+
+#define MAX_OPROF_EVENTS    32
+#define MAX_OPROF_DOMAINS   25
+#define XENOPROF_CPU_TYPE_SIZE 64
+
+#define DEFINE_XEN_GUEST_HANDLE(x)
+
+/* Xenoprof performance events (not Xen events) */
+struct event_log {
+    uint64_t eip;
+    uint8_t mode;
+    uint8_t event;
+};
+
+/* PC value that indicates a special code */
+#define XENOPROF_ESCAPE_CODE ~0UL
+/* Transient events for the xenoprof->oprofile cpu buf */
+#define XENOPROF_TRACE_BEGIN 1
+
+/* Xenoprof buffer shared between Xen and domain - 1 per VCPU */
+struct xenoprof_buf {
+    uint32_t event_head;
+    uint32_t event_tail;
+    uint32_t event_size;
+    uint32_t vcpu_id;
+    uint64_t xen_samples;
+    uint64_t kernel_samples;
+    uint64_t user_samples;
+    uint64_t lost_samples;
+    struct event_log event_log[1];
+};
+#ifndef __XEN__
+typedef struct xenoprof_buf xenoprof_buf_t;
+DEFINE_XEN_GUEST_HANDLE(xenoprof_buf_t);
+#endif
+
+struct xenoprof_init {
+    int32_t  num_events;
+    int32_t  is_primary;
+    char cpu_type[XENOPROF_CPU_TYPE_SIZE];
+};
+typedef struct xenoprof_init xenoprof_init_t;
+DEFINE_XEN_GUEST_HANDLE(xenoprof_init_t);
+
+struct xenoprof_get_buffer {
+    int32_t  max_samples;
+    int32_t  nbuf;
+    int32_t  bufsize;
+    uint64_t buf_gmaddr;
+};
+typedef struct xenoprof_get_buffer xenoprof_get_buffer_t;
+DEFINE_XEN_GUEST_HANDLE(xenoprof_get_buffer_t);
+
+struct xenoprof_counter {
+    uint32_t ind;
+    uint64_t count;
+    uint32_t enabled;
+    uint32_t event;
+    uint32_t hypervisor;
+    uint32_t kernel;
+    uint32_t user;
+    uint64_t unit_mask;
+};
+typedef struct xenoprof_counter xenoprof_counter_t;
+DEFINE_XEN_GUEST_HANDLE(xenoprof_counter_t);
+
+typedef struct xenoprof_passive {
+    uint16_t domain_id;
+    int32_t  max_samples;
+    int32_t  nbuf;
+    int32_t  bufsize;
+    uint64_t buf_gmaddr;
+} xenoprof_passive_t;
+DEFINE_XEN_GUEST_HANDLE(xenoprof_passive_t);
+
+
+#endif /* __XEN_PUBLIC_XENOPROF_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-set-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 9769738..8234b8e 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -25,4 +25,7 @@ int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
 			       pgprot_t prot, unsigned domid);
 
 
+int xen_remap_domain_kernel_mfn_range(unsigned long addr,
+			       unsigned long mfn, int nr,
+			       pgprot_t prot, unsigned domid);
 #endif /* INCLUDE_XEN_OPS_H */
diff --git a/include/xen/xenoprof.h b/include/xen/xenoprof.h
new file mode 100644
index 0000000..2a9a119
--- /dev/null
+++ b/include/xen/xenoprof.h
@@ -0,0 +1,69 @@
+/******************************************************************************
+ * xen/xenoprof.h
+ *
+ * Copyright (c) 2006 Isaku Yamahata <yamahata at valinux co jp>
+ *                    VA Linux Systems Japan K.K.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ */
+
+#ifndef __XEN_XENOPROF_H__
+#define __XEN_XENOPROF_H__
+#ifdef CONFIG_XEN
+
+#if 0
+#include <asm/xenoprof.h>
+#endif
+
+#if defined(CONFIG_X86) || defined(CONFIG_X86_64)
+/* xenoprof x86 specific */
+struct super_block;
+struct dentry;
+int xenoprof_create_files(struct super_block * sb, struct dentry * root);
+#define HAVE_XENOPROF_CREATE_FILES
+
+struct xenoprof_init;
+void xenoprof_arch_init_counter(struct xenoprof_init *init);
+void xenoprof_arch_counter(void);
+void xenoprof_arch_start(void);
+void xenoprof_arch_stop(void);
+
+struct xenoprof_arch_shared_buffer {
+	/* nothing */
+};
+struct xenoprof_shared_buffer;
+void xenoprof_arch_unmap_shared_buffer(struct xenoprof_shared_buffer* sbuf);
+struct xenoprof_get_buffer;
+int xenoprof_arch_map_shared_buffer(struct xenoprof_get_buffer* get_buffer, struct xenoprof_shared_buffer* sbuf);
+struct xenoprof_passive;
+int xenoprof_arch_set_passive(struct xenoprof_passive* pdomain, struct xenoprof_shared_buffer* sbuf);
+#endif
+
+/* xenoprof common */
+struct oprofile_operations;
+int xenoprofile_init(struct oprofile_operations * ops);
+void xenoprofile_exit(void);
+
+struct xenoprof_shared_buffer {
+	char					*buffer;
+	struct xenoprof_arch_shared_buffer	arch;
+};
+#else
+#define xenoprofile_init(ops)	(-ENOSYS)
+#define xenoprofile_exit()	do { } while (0)
+
+#endif /* CONFIG_XEN */
+#endif /* __XEN_XENOPROF_H__ */

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)
  2010-04-15 17:33       ` Dulloor
@ 2010-04-15 18:57         ` Naresh Rapolu
  0 siblings, 0 replies; 21+ messages in thread
From: Naresh Rapolu @ 2010-04-15 18:57 UTC (permalink / raw)
  To: Dulloor; +Cc: George Dunlap, xen-devel@lists.xensource.com

Hello Dulloor,

Thank you so much for sharing this patch !! 
Jeremy and I  feel that "perf" support is needed in Xen for thorough 
(per-process, per-vpcu) profiling, not just statistical profiling by 
Oprofile.
Now that I have this latest Xenoprof patch, I will try to stick to its 
design and see how "perf" linux subsystem can use xenoprof  hypercall 
interfaces.

Will keep updating you on this regularly.

Thanks,
Naresh Rapolu,
PhD student, Computer Science,
Purdue University.

Dulloor wrote:
> Naresh,
>
> If you are interested only in profiling, you could use xenoprof too.
> I had ported xenoprof to pvops (attaching a patch that applies cleanly
> to linux pvops). I have used this with passive profiling and for
> profiling xen/dom0. This patch also includes an obvious fix (over
> oprofile branch in Jeremy's repo) for active profiling, although I
> didn't get a chance to test.
>
> Please let know if you try this and if you face any issues.
>
> thanks
> dulloor
>
> On Thu, Apr 15, 2010 at 12:46 PM, Naresh Rapolu <nrapolu@purdue.edu> wrote:
>   
>> Hello George,
>>
>> I am trying to get linux "perf" tool work with Xen(Virtualize PMU to measure
>> hardware events from inside guests).
>> I have the following options :
>>
>>  1. allowing the guest kernel to see the PMU hardware features via
>>     cpuid, and then doing whatever is necessary to make them work as
>>     expected (by instruction emulation, etc), or
>>  2. keeping them hidden, but adding a new Xen interface and the
>>     appropriate Linux-side code to detect that interface and use it
>>
>>
>> Does Xenalyze have any code relevant to this ? Can you think of any
>> directions in this regard ?
>>
>> Thanks,
>> Naresh Rapolu.
>>
>>
>> George Dunlap wrote:
>>     
>>> I have not measured cache / TLB misses with this workload yet.  In the
>>> past I've instrumented the scheduler trace records in Xen to include
>>> performance counters such as instructions executed and cache / tlb misses,
>>> and then used xenalyze (http://xenbits.xensource.com/ext/xenalyze.hg) to
>>> analyze them.  But the functionality for both capture and analysis was never
>>> standardized or added to mainline.
>>>
>>> I'd be happy to help point you in the right direction if you're interested
>>> in investing in that approach. :-)
>>>
>>> -George
>>>
>>> Naresh Rapolu wrote:
>>>       
>>>> Hello George,
>>>>
>>>> How did you measure  Cache/ TLB misses etc while using/profiling this new
>>>> scheduler ?  Any tool that you`ve used which works with Xen ?
>>>>
>>>> Thanks,
>>>> Naresh Rapolu.
>>>> PhD Student, Computer Science,
>>>> Purdue University.
>>>>
>>>> George Dunlap wrote:
>>>>
>>>>         
>>>>> This patch series introduces the credit2 scheduler.  The first two
>>>>> patches
>>>>> introduce changes necessary to allow the credit2 shared runqueue
>>>>> functionality
>>>>> to work properly; the last two implement the functionality itself.
>>>>>
>>>>> The scheduler is still in the experimental phase.  There's lots of
>>>>> opportunity to contribute with independent lines of development; email
>>>>> George Dunlap <george.dunlap@eu.citrix.com> or check out the wiki page
>>>>> http://wiki.xensource.com/xenwiki/Credit2_Scheduler_Development for
>>>>> ideas
>>>>> and status updates.
>>>>>
>>>>> 19 files changed, 1453 insertions(+), 21 deletions(-)
>>>>> tools/libxc/Makefile                      |    1
>>>>> tools/libxc/xc_csched2.c                  |   50 +
>>>>> tools/libxc/xenctrl.h                     |    8
>>>>> tools/python/xen/lowlevel/xc/xc.c         |   58 +
>>>>> tools/python/xen/xend/XendAPI.py          |    3
>>>>> tools/python/xen/xend/XendDomain.py       |   54 +
>>>>> tools/python/xen/xend/XendDomainInfo.py   |    4
>>>>> tools/python/xen/xend/XendNode.py         |    4
>>>>> tools/python/xen/xend/XendVMMetrics.py    |    1
>>>>> tools/python/xen/xend/server/SrvDomain.py |   14 tools/python/xen/xm/main.py
>>>>>               |   82 ++
>>>>> xen/arch/ia64/vmx/vmmu.c                  |    6 xen/common/Makefile
>>>>>                   |    1 xen/common/sched_credit.c                 |    8
>>>>> xen/common/sched_credit2.c                | 1125
>>>>> +++++++++++++++++++++++++++++
>>>>> xen/common/schedule.c                     |   22
>>>>> xen/include/public/domctl.h               |    4 xen/include/public/trace.h
>>>>>                |    1 xen/include/xen/sched-if.h                |   28
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@lists.xensource.com
>>>>> http://lists.xensource.com/xen-devel
>>>>>
>>>>>           
>>>>         
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
>>     
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)
  2010-04-14 14:29   ` [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL) George Dunlap
  2010-04-14 14:52     ` Keir Fraser
@ 2010-04-15 20:11     ` Dan Magenheimer
  1 sibling, 0 replies; 21+ messages in thread
From: Dan Magenheimer @ 2010-04-15 20:11 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel

Well, sadly, credit2 doesn't seem to solve the problem... and
even more sadly causes worse performance on my overcommitted
workload.

elapsed is wallclock seconds from the time the first VM
 launches the first "make clean" until the fourth VM finishes
 its second make.

sumvcpu is the sum of the vcpu sec (including dom0) reported
 by xm list after all VM's have finished the workload and
 force-crashed

dom0 is the vcpu sec reported by xm list for dom0

credit: 5 test runs
elapsed=(9447,9388,9578,9576,9412)
sumvcpu=(13665,13671,13693,13589,13598)
dom0=(559,556,555,467,483)

sedf: 6 test runs
elapsed=(10022,9418,9637,12129,13599,11875)
sumvcpu=(13539,13514,13510,14270,14447,14237)
dom0=(473,468,460,482,537,475)

credit2: 6 test runs
elapsed=(11007,9931,10051,10090,11647,10070)
sumvcpu=(14878,14615,14610,14641,14886,14594)
dom0=(510,470,471,482,536,463)

P.S.  physical machine is a single socket dual core

> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> Sent: Wednesday, April 14, 2010 8:30 AM
> To: Dan Magenheimer
> Cc: xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] [PATCH 0 of 5] Add credit2 scheduler
> (EXPERIMENTAL)
> 
> Keir has checked the patches in, so if you wait a bit, they should show
> up on the public repository.
> 
> The tool patch is only necessary for adjusting the weight; if you're OK
> using the default weight, just adding "sched=credit2" on the xen
> command-line should be fine.
> 
> Don't forget that this isn't meant to perform well on multiple sockets
> yet. :-)
> 
>  -George
> 
> Dan Magenheimer wrote:
> > Hi George --
> >
> > I'm seeing some problems applying the patches (such as "malformed
> > patch").  If you could send me a monolithic patch in an attachment
> > and tell me what cset in http://xenbits.xensource.com/xen-unstable.hg
> > that it successfully applies against, I will try to give my
> > workload a test against it to see if it has the same
> > symptoms.
> >
> > Also, do I need to apply the tools patch if I don't intend
> > to specify any parameters, or is the xen patch + "sched=credit2"
> > in a boot param sufficient?
> >
> > Thanks,
> > Dan
> >
> >
> >> -----Original Message-----
> >> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> >> Sent: Wednesday, April 14, 2010 4:26 AM
> >> To: xen-devel@lists.xensource.com
> >> Cc: george.dunlap@eu.citrix.com
> >> Subject: [Xen-devel] [PATCH 0 of 5] Add credit2 scheduler
> >> (EXPERIMENTAL)
> >>
> >> This patch series introduces the credit2 scheduler.  The first two
> >> patches
> >> introduce changes necessary to allow the credit2 shared runqueue
> >> functionality
> >> to work properly; the last two implement the functionality itself.
> >>
> >> The scheduler is still in the experimental phase.  There's lots of
> >> opportunity to contribute with independent lines of development;
> email
> >> George Dunlap <george.dunlap@eu.citrix.com> or check out the wiki
> page
> >> http://wiki.xensource.com/xenwiki/Credit2_Scheduler_Development for
> >> ideas
> >> and status updates.
> >>
> >> 19 files changed, 1453 insertions(+), 21 deletions(-)
> >> tools/libxc/Makefile                      |    1
> >> tools/libxc/xc_csched2.c                  |   50 +
> >> tools/libxc/xenctrl.h                     |    8
> >> tools/python/xen/lowlevel/xc/xc.c         |   58 +
> >> tools/python/xen/xend/XendAPI.py          |    3
> >> tools/python/xen/xend/XendDomain.py       |   54 +
> >> tools/python/xen/xend/XendDomainInfo.py   |    4
> >> tools/python/xen/xend/XendNode.py         |    4
> >> tools/python/xen/xend/XendVMMetrics.py    |    1
> >> tools/python/xen/xend/server/SrvDomain.py |   14
> >> tools/python/xen/xm/main.py               |   82 ++
> >> xen/arch/ia64/vmx/vmmu.c                  |    6
> >> xen/common/Makefile                       |    1
> >> xen/common/sched_credit.c                 |    8
> >> xen/common/sched_credit2.c                | 1125
> >> +++++++++++++++++++++++++++++
> >> xen/common/schedule.c                     |   22
> >> xen/include/public/domctl.h               |    4
> >> xen/include/public/trace.h                |    1
> >> xen/include/xen/sched-if.h                |   28
> >>
> >> _______________________________________________
> >> Xen-devel mailing list
> >> Xen-devel@lists.xensource.com
> >> http://lists.xensource.com/xen-devel
> >>
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL)
  2010-04-15 14:17   ` George Dunlap
@ 2010-04-17 20:29     ` Dulloor
  0 siblings, 0 replies; 21+ messages in thread
From: Dulloor @ 2010-04-17 20:29 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel@lists.xensource.com

On Thu, Apr 15, 2010 at 10:17 AM, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> Dulloor wrote:
>>
>> As we talked before, I am interested in improving the mutiple-socket
>> scenario and adding the load balancing functionalilty, which could
>> provide an acceptable alternative to pinning vcpus to sockets (for my
>> NUMA work). I am going over your patch right now, but what are your
>> thoughts ?
>>
>
> That would be great -- my focus for the next several months will be setting
> up a testing infrastructure to automatically test performance of different
> workloads mixes so I can hone the algorithm and test regressions.
>
> My idea with load balancing was to do this:
> * One runqueue per L2 cache.
> * Add code to calculate the load of a runqueue.  Load would be the average
> (~integral) of (vcpus running + vcpus on runqueue).  I was planning on doing
> accurate load calculation, rather than sample-based, and falling back to
> sample-based if accurate turned out to be too slow.
> * Calculate the load contributed by various vcpus.
> * At regular intervals, determine of some kind of balancing needs to be done
> by looking at the overall runqueue load and placing based on "contributory"
> load of each VCPU.
>
> Does that make sense?  Thoughts?

Sounds good. I can see that the runq_map for all cpus point to the
same run-queue (in make_runq_map). I will start there.

>
> I have some old patches that calculated accurate load, I could dig them up
> if you wanted something to start with.  (I don't think they'll apply cleanly
> at the moment.)
>
> Thanks,
> -George
>>
>> -dulloor
>>
>> On Wed, Apr 14, 2010 at 6:26 AM, George Dunlap
>> <george.dunlap@eu.citrix.com> wrote:
>>
>>>
>>> This patch series introduces the credit2 scheduler.  The first two
>>> patches
>>> introduce changes necessary to allow the credit2 shared runqueue
>>> functionality
>>> to work properly; the last two implement the functionality itself.
>>>
>>> The scheduler is still in the experimental phase.  There's lots of
>>> opportunity to contribute with independent lines of development; email
>>> George Dunlap <george.dunlap@eu.citrix.com> or check out the wiki page
>>> http://wiki.xensource.com/xenwiki/Credit2_Scheduler_Development for ideas
>>> and status updates.
>>>
>>> 19 files changed, 1453 insertions(+), 21 deletions(-)
>>> tools/libxc/Makefile                      |    1
>>> tools/libxc/xc_csched2.c                  |   50 +
>>> tools/libxc/xenctrl.h                     |    8
>>> tools/python/xen/lowlevel/xc/xc.c         |   58 +
>>> tools/python/xen/xend/XendAPI.py          |    3
>>> tools/python/xen/xend/XendDomain.py       |   54 +
>>> tools/python/xen/xend/XendDomainInfo.py   |    4
>>> tools/python/xen/xend/XendNode.py         |    4
>>> tools/python/xen/xend/XendVMMetrics.py    |    1
>>> tools/python/xen/xend/server/SrvDomain.py |   14
>>> tools/python/xen/xm/main.py               |   82 ++
>>> xen/arch/ia64/vmx/vmmu.c                  |    6
>>> xen/common/Makefile                       |    1
>>> xen/common/sched_credit.c                 |    8
>>> xen/common/sched_credit2.c                | 1125
>>> +++++++++++++++++++++++++++++
>>> xen/common/schedule.c                     |   22
>>> xen/include/public/domctl.h               |    4
>>> xen/include/public/trace.h                |    1
>>> xen/include/xen/sched-if.h                |   28
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>>
>>>
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2010-04-17 20:29 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-14 10:26 [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL) George Dunlap
2010-04-14 10:26 ` [PATCH 1 of 5] credit2: Add context_saved scheduler callback George Dunlap
2010-04-14 10:26 ` [PATCH 2 of 5] credit2: Flexible cpu-to-schedule-spinlock mappings George Dunlap
2010-04-14 10:26 ` [PATCH 3 of 5] credit2: Add a scheduler-specific schedule trace class George Dunlap
2010-04-14 10:26 ` [PATCH 4 of 5] credit2: Add credit2 scheduler to hypervisor George Dunlap
2010-04-14 10:26 ` [PATCH 5 of 5] credit2: Add toolstack options to control credit2 scheduler parameters George Dunlap
     [not found] ` <7db7f696-1f0b-44d0-8f7b-eea1be5167dd@default>
2010-04-14 14:29   ` [PATCH 0 of 5] Add credit2 scheduler (EXPERIMENTAL) George Dunlap
2010-04-14 14:52     ` Keir Fraser
2010-04-14 15:59       ` Dan Magenheimer
2010-04-14 16:23         ` Keir Fraser
2010-04-14 16:31           ` Dulloor
2010-04-14 16:36             ` Keir Fraser
2010-04-14 17:04               ` Dan Magenheimer
2010-04-14 16:46           ` Dan Magenheimer
2010-04-15 20:11     ` Dan Magenheimer
     [not found] ` <4BC664E1.7090304@purdue.edu>
2010-04-15 13:53   ` George Dunlap
2010-04-15 16:46     ` Naresh Rapolu
2010-04-15 17:33       ` Dulloor
2010-04-15 18:57         ` Naresh Rapolu
     [not found] ` <h2x940bcfd21004140841kcdffe330xff5d749d43392fe3@mail.gmail.com>
2010-04-15 14:17   ` George Dunlap
2010-04-17 20:29     ` Dulloor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).