[RFC] Add static priority into credit scheduler

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] Add static priority into credit scheduler
@ 2009-03-20  9:18 Su, Disheng
  2009-03-20 12:42 ` George Dunlap
  0 siblings, 1 reply; 10+ messages in thread
From: Su, Disheng @ 2009-03-20  9:18 UTC (permalink / raw)
  To: Xen-devel; +Cc: George Dunlap, Su, Disheng, NISHIGUCHI Naoki

[-- Attachment #1: Type: text/plain, Size: 3584 bytes --]

Hi all,
	Attached patches add static priority into credit scheduler.
	Currently, credit scheduler has 4 kinds of priority: BOOST, UNDER, OVER and IDLE. And the priority of VM is dynamically changed according to the credit of VM, or I/O events, the highest priority VM is chosed to be scheduled in for each scheduling period. Due to priority is not fixed, which VM will be scheduled in is properly unknown. The I/O latency caused by scheduler is well analyzed in [1] and [2]. They provides ways to reduce I/O latency and also retain CPU and I/O fairness between VMs to some extend.
	There are some cases that reducing latency is much preferable to CPU or I/O fairness, such as RTOS guest or VM with device(audio)-assigned. The straightforward way is to set static(fixed) highest priority for this VM, to make sure it is scheduled each time. Attached patches implemented this kind of mechanism, like SCHED_RR/SCHED_FIFO in Linux.
	
	How it works?
	--Users can set RT priority(between 1~100) for domains. The larger the number, the higher the priority. Users can also change a RT domain into a non-RT domain by setting its priority other than 1~100.  
	--Scheduler always chooses the highest priority domain to run for RT domains, no changes for non-RT domains in there. If RT domains have the same priority, round robin between this domains for every 30ms. 30ms is the default scheduling period, it can be changed to 2ms or other value if needed. 
	--There is still accounting for current running non-RT vcpu in every 10ms, accounting for all non-RT domains in every 30ms as credit scheduler did before. 

	Implementation details:
	 -- In order to minimize the modification in the credit scheduler, one additional rt runqueue per pcpu is added, and one rt active domain list added in csched_private. RT vcpus are added into the rt runqueue in the running pcpu, and rt domains are added into rt active domain.
	 -- Scheduler always chooses the highest priority in the rt runqueue if it's not empty at first, then chooses from normal runqueue instead.
	 --__runq_insert/__runq_remove are changed to based on the priority of vcpu.
	 -- Vcpu accounting is only took effects on the non-RT vcpus as before. Non-RT vcpus propotionally share the rest of cpu based on their weight. The total weight is changed during adding/removing RT domains, e.g. promoting a non-RT domain to a RT domain, total weight is  substracted by the weight of non-RT domain.
	
	How to use it:
		set priority(y) of a VM(x) by: "xm sched-credit -d x -p y"
	
	Test results:
	I did some tests with this patches according to following configuration:
		CPU: Intel Core 2 Duo E6850, Xen(1881), 7 VMs created on one physical machine A, each 2 VMs pair ping with each other, the other VM has RT priority. Another physical machine B connects with it through 1G network card directly. Conduct these tests from B to A, e.g ping A from B.
	some test results are uploaded to http://wiki.xensource.com/xenwiki/DishengSu, FYI.

	Summary:
	This patches minimize the scheduling latency, while losing CPU, or I/O fairness. It can be used as a scheduler for RT guest, for some cases(such as RT guest and non-RT guests co-exist). While there are lot of areas to improve real time response, such as interrupt latency, Xen I/O model[3].
	Any comments are appreciated. Thanks!

---------------------
[1]Scheduling I/O in Virtual Machine Monitors
[2]Evaluation and Consideration of the Credit Scheduler for Client Virtualization 
[3]A step to support real-time in virtual machine

Best Regards,
Disheng, Su

[-- Attachment #2: static_priority_for_xen.patch --]
[-- Type: application/octet-stream, Size: 9674 bytes --]

diff -r 8ad9c2fabd8e xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Mon May 11 18:28:19 2009 +0800
+++ b/xen/common/sched_credit.c	Mon May 11 18:35:06 2009 +0800
@@ -52,6 +52,8 @@
 /*
  * Priorities
  */
+#define CSCHED_PRI_RT_MIN        1      /* min RT priority */
+#define CSCHED_PRI_RT_MAX        100    /* max RT priority */
 #define CSCHED_PRI_TS_BOOST      0      /* time-share waking up */
 #define CSCHED_PRI_TS_UNDER     -1      /* time-share w/ credits */
 #define CSCHED_PRI_TS_OVER      -2      /* time-share w/o credits */
@@ -72,6 +74,8 @@
 #define CSCHED_VCPU(_vcpu)  ((struct csched_vcpu *) (_vcpu)->sched_priv)
 #define CSCHED_DOM(_dom)    ((struct csched_dom *) (_dom)->sched_priv)
 #define RUNQ(_cpu)          (&(CSCHED_PCPU(_cpu)->runq))
+#define RT_RUNQ(_cpu)       (&(CSCHED_PCPU(_cpu)->rt_runq))
+#define IS_RT_PRI(pri)       ((pri >= CSCHED_PRI_RT_MIN) && (pri <= CSCHED_PRI_RT_MAX))
 
 
 /*
@@ -185,6 +189,7 @@
  */
 struct csched_pcpu {
     struct list_head runq;
+    struct list_head rt_runq;
     uint32_t runq_sort_last;
     struct timer ticker;
     unsigned int tick;
@@ -223,6 +228,7 @@
     uint16_t active_vcpu_count;
     uint16_t weight;
     uint16_t cap;
+    uint16_t pri;
 };
 
 /*
@@ -231,6 +237,7 @@
 struct csched_private {
     spinlock_t lock;
     struct list_head active_sdom;
+    struct list_head rt_active_sdom;
     uint32_t ncpus;
     unsigned int master;
     cpumask_t idlers;
@@ -273,7 +280,7 @@
 static inline void
 __runq_insert(unsigned int cpu, struct csched_vcpu *svc)
 {
-    const struct list_head * const runq = RUNQ(cpu);
+    const struct list_head * const runq = (IS_RT_PRI(svc->pri))?RT_RUNQ(cpu):RUNQ(cpu);
     struct list_head *iter;
 
     BUG_ON( __vcpu_on_runq(svc) );
@@ -366,6 +373,7 @@
 
     init_timer(&spc->ticker, csched_tick, (void *)(unsigned long)cpu, cpu);
     INIT_LIST_HEAD(&spc->runq);
+    INIT_LIST_HEAD(&spc->rt_runq);
     spc->runq_sort_last = csched_priv.runq_sort;
     per_cpu(schedule_data, cpu).sched_priv = spc;
 
@@ -500,8 +508,13 @@
         list_add(&svc->active_vcpu_elem, &sdom->active_vcpu);
         if ( list_empty(&sdom->active_sdom_elem) )
         {
-            list_add(&sdom->active_sdom_elem, &csched_priv.active_sdom);
-            csched_priv.weight += sdom->weight;
+            if ( IS_RT_PRI(sdom->pri) )
+                list_add(&sdom->active_sdom_elem, &csched_priv.rt_active_sdom);
+            else
+            {
+                list_add(&sdom->active_sdom_elem, &csched_priv.active_sdom);
+                csched_priv.weight += sdom->weight;
+            }
         }
     }
 
@@ -522,9 +535,14 @@
     list_del_init(&svc->active_vcpu_elem);
     if ( list_empty(&sdom->active_vcpu) )
     {
-        BUG_ON( csched_priv.weight < sdom->weight );
-        list_del_init(&sdom->active_sdom_elem);
-        csched_priv.weight -= sdom->weight;
+        if ( IS_RT_PRI(sdom->pri) )
+            list_del_init(&sdom->active_sdom_elem);
+        else
+        {
+            BUG_ON( csched_priv.weight < sdom->weight );
+            list_del_init(&sdom->active_sdom_elem);
+            csched_priv.weight -= sdom->weight;
+        }
     }
 }
 
@@ -547,7 +565,8 @@
     /*
      * Update credits
      */
-    atomic_sub(CSCHED_CREDITS_PER_TICK, &svc->credit);
+    if (!IS_RT_PRI(svc->pri))
+        atomic_sub(CSCHED_CREDITS_PER_TICK, &svc->credit);
 
     /*
      * Put this VCPU and domain back on the active list if it was
@@ -703,12 +722,19 @@
     struct xen_domctl_scheduler_op *op)
 {
     struct csched_dom * const sdom = CSCHED_DOM(d);
+    struct vcpu *v;
+    struct csched_vcpu * svc;
     unsigned long flags;
 
     if ( op->cmd == XEN_DOMCTL_SCHEDOP_getinfo )
     {
         op->u.credit.weight = sdom->weight;
         op->u.credit.cap = sdom->cap;
+        if ( IS_RT_PRI(sdom->pri) )
+            op->u.credit.pri = sdom->pri;
+        else
+            /* for non-rt dom, vcpus may have different pri, use 0 instead */
+            op->u.credit.pri = 0; 
     }
     else
     {
@@ -718,7 +744,7 @@
 
         if ( op->u.credit.weight != 0 )
         {
-            if ( !list_empty(&sdom->active_sdom_elem) )
+            if ( !list_empty(&sdom->active_sdom_elem) && !IS_RT_PRI(sdom->pri) )
             {
                 csched_priv.weight -= sdom->weight;
                 csched_priv.weight += op->u.credit.weight;
@@ -728,6 +754,68 @@
 
         if ( op->u.credit.cap != (uint16_t)~0U )
             sdom->cap = op->u.credit.cap;
+
+        if ( op->u.credit.pri != (int16_t)~0U )
+        {
+            if ( !IS_RT_PRI(op->u.credit.pri) )
+                /* To make sure user input is a right pri, no parameter check in
+                 * user space */
+                op->u.credit.pri = 0;
+
+            if ( !list_empty(&sdom->active_sdom_elem) )
+            {
+                if ( !IS_RT_PRI(sdom->pri) && IS_RT_PRI(op->u.credit.pri) )
+                {
+                    /* Change from non-RT to RT */
+                    list_del_init(&sdom->active_sdom_elem);
+                    list_add(&sdom->active_sdom_elem, &csched_priv.rt_active_sdom);
+                    csched_priv.weight -= sdom->weight;
+                }
+                else if ( IS_RT_PRI(sdom->pri) && !IS_RT_PRI(op->u.credit.pri) )
+                {
+                    /* Change from RT to non-RT*/
+                    list_del_init(&sdom->active_sdom_elem);
+                    list_add(&sdom->active_sdom_elem, &csched_priv.active_sdom);
+                    csched_priv.weight += sdom->weight;
+                }
+            }
+
+            if ( IS_RT_PRI(op->u.credit.pri) )
+            {
+                for_each_vcpu ( d, v )
+                {
+                    svc = CSCHED_VCPU(v); 
+                    svc->pri = op->u.credit.pri;
+                    /* If svc is still in the runq, which means svc is running on the
+                     * current cpu, due to vcpus are already paused if vcpu is not
+                     * running on the current cpu, pls refer to sched_adjust */
+                    if ( __vcpu_on_runq(svc) && !IS_RT_PRI(sdom->pri) )
+                    {
+                        /* switch from non-RT to RT */
+                        /* remove it from normal runqueue */
+                        __runq_remove(svc);
+                        /* then, insert it into rt queue */
+                        __runq_insert(v->processor, svc);
+                    }
+                }
+                sdom->pri = op->u.credit.pri;
+            }
+            else if ( IS_RT_PRI(sdom->pri) )
+            {
+                /* change from RT to non-RT */
+                for_each_vcpu ( d, v )
+                {
+                    svc = CSCHED_VCPU(v); 
+                    svc->pri = op->u.credit.pri;
+                    if ( __vcpu_on_runq(svc) )
+                    {
+                        __runq_remove(svc);
+                        __runq_insert(v->processor, svc);
+                    }
+                }
+                sdom->pri = op->u.credit.pri;
+            }
+        }
 
         spin_unlock_irqrestore(&csched_priv.lock, flags);
     }
@@ -756,6 +844,7 @@
     sdom->dom = dom;
     sdom->weight = CSCHED_DEFAULT_WEIGHT;
     sdom->cap = 0U;
+    sdom->pri = 0;
     dom->sched_priv = sdom;
 
     return 0;
@@ -1168,6 +1257,7 @@
 {
     const int cpu = smp_processor_id();
     struct list_head * const runq = RUNQ(cpu);
+    struct list_head * const rt_runq = RT_RUNQ(cpu);
     struct csched_vcpu * const scurr = CSCHED_VCPU(current);
     struct csched_vcpu *snext;
     struct task_slice ret;
@@ -1183,7 +1273,10 @@
     else
         BUG_ON( is_idle_vcpu(current) || list_empty(runq) );
 
-    snext = __runq_elem(runq->next);
+    if ( !list_empty(rt_runq) )
+        snext = __runq_elem(rt_runq->next);
+    else
+        snext = __runq_elem(runq->next);
 
     /*
      * SMP Load balance:
@@ -1254,7 +1347,7 @@
 static void
 csched_dump_pcpu(int cpu)
 {
-    struct list_head *runq, *iter;
+    struct list_head *runq, *iter, *rt_runq;
     struct csched_pcpu *spc;
     struct csched_vcpu *svc;
     int loop;
@@ -1262,6 +1355,7 @@
 
     spc = CSCHED_PCPU(cpu);
     runq = &spc->runq;
+    rt_runq = &spc->rt_runq;
 
     cpumask_scnprintf(cpustr, sizeof(cpustr), cpu_sibling_map[cpu]);
     printk(" sort=%d, sibling=%s, ", spc->runq_sort_last, cpustr);
@@ -1277,7 +1371,20 @@
     }
 
     loop = 0;
+    printk("\trunq:\n");
     list_for_each( iter, runq )
+    {
+        svc = __runq_elem(iter);
+        if ( svc )
+        {
+            printk("\t%3d: ", ++loop);
+            csched_dump_vcpu(svc);
+        }
+    }
+
+    loop = 0;
+    printk("\nrt_runq:\n");
+    list_for_each( iter, rt_runq )
     {
         svc = __runq_elem(iter);
         if ( svc )
@@ -1340,6 +1447,22 @@
             csched_dump_vcpu(svc);
         }
     }
+
+    printk("\nactive vcpus in rt dom:\n");
+    list_for_each( iter_sdom, &csched_priv.rt_active_sdom )
+    {
+        struct csched_dom *sdom;
+        sdom = list_entry(iter_sdom, struct csched_dom, active_sdom_elem);
+
+        list_for_each( iter_svc, &sdom->active_vcpu )
+        {
+            struct csched_vcpu *svc;
+            svc = list_entry(iter_svc, struct csched_vcpu, active_vcpu_elem);
+
+            printk("\t%3d: ", ++loop);
+            csched_dump_vcpu(svc);
+        }
+    }
 }
 
 static void
@@ -1347,6 +1470,7 @@
 {
     spin_lock_init(&csched_priv.lock);
     INIT_LIST_HEAD(&csched_priv.active_sdom);
+    INIT_LIST_HEAD(&csched_priv.rt_active_sdom);
     csched_priv.ncpus = 0;
     csched_priv.master = UINT_MAX;
     cpus_clear(csched_priv.idlers);

[-- Attachment #3: static_priority_for_xen_tools.patch --]
[-- Type: application/octet-stream, Size: 11804 bytes --]

diff -r cc82d54bedfd tools/python/xen/lowlevel/xc/xc.c
--- a/tools/python/xen/lowlevel/xc/xc.c	Fri Dec 05 15:54:22 2008 +0000
+++ b/tools/python/xen/lowlevel/xc/xc.c	Mon May 11 18:28:19 2009 +0800
@@ -1281,18 +1281,20 @@
     uint32_t domid;
     uint16_t weight;
     uint16_t cap;
-    static char *kwd_list[] = { "domid", "weight", "cap", NULL };
-    static char kwd_type[] = "I|HH";
+    uint16_t pri;
+    static char *kwd_list[] = { "domid", "weight", "cap", "pri", NULL };
+    static char kwd_type[] = "I|HHH";
     struct xen_domctl_sched_credit sdom;
     
     weight = 0;
     cap = (uint16_t)~0U;
     if( !PyArg_ParseTupleAndKeywords(args, kwds, kwd_type, kwd_list, 
-                                     &domid, &weight, &cap) )
+                                     &domid, &weight, &cap, &pri) )
         return NULL;
 
     sdom.weight = weight;
     sdom.cap = cap;
+    sdom.pri = pri;
 
     if ( xc_sched_credit_domain_set(self->xc_handle, domid, &sdom) != 0 )
         return pyxc_error_to_exception();
@@ -1312,9 +1314,10 @@
     if ( xc_sched_credit_domain_get(self->xc_handle, domid, &sdom) != 0 )
         return pyxc_error_to_exception();
 
-    return Py_BuildValue("{s:H,s:H}",
+    return Py_BuildValue("{s:H,s:H,s:H}",
                          "weight",  sdom.weight,
-                         "cap",     sdom.cap);
+                         "cap",     sdom.cap,
+                         "pri",     sdom.pri);
 }
 
 static PyObject *pyxc_domain_setmaxmem(XcObject *self, PyObject *args)
@@ -1722,6 +1725,7 @@
       "SMP credit scheduler.\n"
       " domid     [int]:   domain id to set\n"
       " weight    [short]: domain's scheduling weight\n"
+      " pri       [short]: domain's scheduling pri\n"
       "Returns: [int] 0 on success; -1 on error.\n" },
 
     { "sched_credit_domain_get",
@@ -1731,7 +1735,8 @@
       "SMP credit scheduler.\n"
       " domid     [int]:   domain id to get\n"
       "Returns:   [dict]\n"
-      " weight    [short]: domain's scheduling weight\n"},
+      " weight    [short]: domain's scheduling weight\n"
+      " pri       [short]: domain's scheduling pri\n"},
 
     { "evtchn_alloc_unbound", 
       (PyCFunction)pyxc_evtchn_alloc_unbound,
diff -r cc82d54bedfd tools/python/xen/xend/XendAPI.py
--- a/tools/python/xen/xend/XendAPI.py	Fri Dec 05 15:54:22 2008 +0000
+++ b/tools/python/xen/xend/XendAPI.py	Mon May 11 18:28:19 2009 +0800
@@ -1505,10 +1505,12 @@
 
         #need to update sched params aswell
         if 'weight' in xeninfo.info['vcpus_params'] \
-           and 'cap' in xeninfo.info['vcpus_params']:
+           and 'cap' in xeninfo.info['vcpus_params'] \
+           and 'pri' in xeninfo.info['vcpus_params']:
             weight = xeninfo.info['vcpus_params']['weight']
             cap = xeninfo.info['vcpus_params']['cap']
-            xendom.domain_sched_credit_set(xeninfo.getDomid(), weight, cap)
+            pri = xeninfo.info['vcpus_params']['pri']
+            xendom.domain_sched_credit_set(xeninfo.getDomid(), weight, cap, pri)
 
     def VM_set_VCPUs_number_live(self, _, vm_ref, num):
         dom = XendDomain.instance().get_vm_by_uuid(vm_ref)
diff -r cc82d54bedfd tools/python/xen/xend/XendConfig.py
--- a/tools/python/xen/xend/XendConfig.py	Fri Dec 05 15:54:22 2008 +0000
+++ b/tools/python/xen/xend/XendConfig.py	Mon May 11 18:28:19 2009 +0800
@@ -589,6 +589,8 @@
             int(sxp.child_value(sxp_cfg, "cpu_weight", 256))
         cfg["vcpus_params"]["cap"] = \
             int(sxp.child_value(sxp_cfg, "cpu_cap", 0))
+        cfg["vcpus_params"]["pri"] = \
+            int(sxp.child_value(sxp_cfg, "cpu_pri", 0))
 
         # Only extract options we know about.
         extract_keys = LEGACY_UNSUPPORTED_BY_XENAPI_CFG + \
diff -r cc82d54bedfd tools/python/xen/xend/XendDomain.py
--- a/tools/python/xen/xend/XendDomain.py	Fri Dec 05 15:54:22 2008 +0000
+++ b/tools/python/xen/xend/XendDomain.py	Mon May 11 18:28:19 2009 +0800
@@ -1536,7 +1536,7 @@
 
         @param domid: Domain ID or Name
         @type domid: int or string.
-        @rtype: dict with keys 'weight' and 'cap'
+        @rtype: dict with keys 'weight' and 'cap' and 'pri'
         @return: credit scheduler parameters
         """
         dominfo = self.domain_lookup_nr(domid)
@@ -1550,19 +1550,23 @@
                 raise XendError(str(ex))
         else:
             return {'weight' : dominfo.getWeight(),
-                    'cap'    : dominfo.getCap()} 
+                    'cap'    : dominfo.getCap(), 
+                    'pri'    : dominfo.getPri()} 
     
-    def domain_sched_credit_set(self, domid, weight = None, cap = None):
+    def domain_sched_credit_set(self, domid, weight = None, cap = None, pri =
+            None):
         """Set credit scheduler parameters for a domain.
 
         @param domid: Domain ID or Name
         @type domid: int or string.
         @type weight: int
         @type cap: int
+        @type pri: int
         @rtype: 0
         """
         set_weight = False
         set_cap = False
+        set_pri = False
         dominfo = self.domain_lookup_nr(domid)
         if not dominfo:
             raise XendInvalidDomain(str(domid))
@@ -1581,17 +1585,26 @@
             else:
                 set_cap = True
 
+            if pri is None:
+                pri = int(~0)
+            else:
+                set_pri = True
+
             assert type(weight) == int
             assert type(cap) == int
+            assert type(pri) == int
 
             rc = 0
             if dominfo._stateGet() in (DOM_STATE_RUNNING, DOM_STATE_PAUSED):
-                rc = xc.sched_credit_domain_set(dominfo.getDomid(), weight, cap)
+                rc = xc.sched_credit_domain_set(dominfo.getDomid(), weight, cap,
+                        pri)
             if rc == 0:
                 if set_weight:
                     dominfo.setWeight(weight)
                 if set_cap:
                     dominfo.setCap(cap)
+                if set_pri:
+                    dominfo.setPri(pri)
                 self.managed_config_save(dominfo)
             return rc
         except Exception, ex:
diff -r cc82d54bedfd tools/python/xen/xend/XendDomainInfo.py
--- a/tools/python/xen/xend/XendDomainInfo.py	Fri Dec 05 15:54:22 2008 +0000
+++ b/tools/python/xen/xend/XendDomainInfo.py	Mon May 11 18:28:19 2009 +0800
@@ -465,7 +465,8 @@
                 if xennode.xenschedinfo() == 'credit':
                     xendomains.domain_sched_credit_set(self.getDomid(),
                                                        self.getWeight(),
-                                                       self.getCap())
+                                                       self.getCap(),
+                                                       self.getPri())
             except:
                 log.exception('VM start failed')
                 self.destroy()
@@ -1617,6 +1618,12 @@
 
     def setWeight(self, cpu_weight):
         self.info['vcpus_params']['weight'] = cpu_weight
+
+    def getPri(self):
+        return self.info['vcpus_params']['pri']
+
+    def setPri(self, cpu_pri):
+        self.info['vcpus_params']['pri'] = cpu_pri
 
     def getRestartCount(self):
         return self._readVm('xend/restart_count')
diff -r cc82d54bedfd tools/python/xen/xm/main.py
--- a/tools/python/xen/xm/main.py	Fri Dec 05 15:54:22 2008 +0000
+++ b/tools/python/xen/xm/main.py	Mon May 11 18:28:19 2009 +0800
@@ -150,7 +150,7 @@
     'log'         : ('', 'Print Xend log'),
     'rename'      : ('<Domain> <NewDomainName>', 'Rename a domain.'),
     'sched-sedf'  : ('<Domain> [options]', 'Get/set EDF parameters.'),
-    'sched-credit': ('[-d <Domain> [-w[=WEIGHT]|-c[=CAP]]]',
+    'sched-credit': ('[-d <Domain> [-w[=WEIGHT]|-c[=CAP]|-p[=PRI]]]',
                      'Get/set credit scheduler parameters.'),
     'sysrq'       : ('<Domain> <letter>', 'Send a sysrq to a domain.'),
     'debug-keys'  : ('<Keys>', 'Send debug keys to Xen.'),
@@ -240,6 +240,7 @@
        ('-d DOMAIN', '--domain=DOMAIN', 'Domain to modify'),
        ('-w WEIGHT', '--weight=WEIGHT', 'Weight (int)'),
        ('-c CAP',    '--cap=CAP',       'Cap (int)'),
+       ('-p PRI',    '--pri=PRI',       'Pri (int)'),
     ),
     'list': (
        ('-l', '--long',         'Output all VM details in SXP'),
@@ -1578,8 +1579,8 @@
     check_sched_type('credit')
 
     try:
-        opts, params = getopt.getopt(args, "d:w:c:",
-            ["domain=", "weight=", "cap="])
+        opts, params = getopt.getopt(args, "d:w:c:p:",
+            ["domain=", "weight=", "cap=", "pri="])
     except getopt.GetoptError, opterr:
         err(opterr)
         usage('sched-credit')
@@ -1587,6 +1588,7 @@
     domid = None
     weight = None
     cap = None
+    pri = None
 
     for o, a in opts:
         if o in ["-d", "--domain"]:
@@ -1595,17 +1597,19 @@
             weight = int(a)
         elif o in ["-c", "--cap"]:
             cap = int(a);
+        elif o in ["-p", "--pri"]:
+            pri = int(a);
 
     doms = filter(lambda x : domid_match(domid, x),
                   [parse_doms_info(dom)
                   for dom in getDomains(None, 'all')])
 
-    if weight is None and cap is None:
+    if weight is None and cap is None and pri is None:
         if domid is not None and doms == []: 
             err("Domain '%s' does not exist." % domid)
             usage('sched-credit')
         # print header if we aren't setting any parameters
-        print '%-33s %4s %6s %4s' % ('Name','ID','Weight','Cap')
+        print '%-33s %4s %6s %4s %4s' % ('Name','ID','Weight','Cap','Pri')
         
         for d in doms:
             try:
@@ -1618,16 +1622,17 @@
             except xmlrpclib.Fault:
                 pass
 
-            if 'weight' not in info or 'cap' not in info:
+            if 'weight' not in info or 'cap' not in info or 'pri' not in info:
                 # domain does not support sched-credit?
-                info = {'weight': -1, 'cap': -1}
+                info = {'weight': -1, 'cap': -1, 'pri': -1}
 
             info['weight'] = int(info['weight'])
             info['cap']    = int(info['cap'])
+            info['pri']    = int(info['pri'])
             
             info['name']  = d['name']
             info['domid'] = str(d['domid'])
-            print( ("%(name)-32s %(domid)5s %(weight)6d %(cap)4d") % info)
+            print( ("%(name)-32s %(domid)5s %(weight)6d %(cap)4d %(pri)4d") % info)
     else:
         if domid is None:
             # place holder for system-wide scheduler parameters
@@ -1644,6 +1649,10 @@
                     get_single_vm(domid),
                     "cap",
                     cap)
+                server.xenapi.VM.add_to_VCPUs_params_live(
+                    get_single_vm(domid),
+                    "pri",
+                    pri)
             else:
                 server.xenapi.VM.add_to_VCPUs_params(
                     get_single_vm(domid),
@@ -1653,8 +1662,12 @@
                     get_single_vm(domid),
                     "cap",
                     cap)
+                server.xenapi.VM.add_to_VCPUs_params(
+                    get_single_vm(domid),
+                    "pri",
+                    pri)
         else:
-            result = server.xend.domain.sched_credit_set(domid, weight, cap)
+            result = server.xend.domain.sched_credit_set(domid, weight, cap, pri)
             if result != 0:
                 err(str(result))
 
diff -r cc82d54bedfd xen/include/public/domctl.h
--- a/xen/include/public/domctl.h	Fri Dec 05 15:54:22 2008 +0000
+++ b/xen/include/public/domctl.h	Mon May 11 18:28:19 2009 +0800
@@ -311,6 +311,7 @@
         struct xen_domctl_sched_credit {
             uint16_t weight;
             uint16_t cap;
+            int16_t  pri;
         } credit;
     } u;
 };

[-- Attachment #4: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Add static priority into credit scheduler
  2009-03-20  9:18 [RFC] Add static priority into credit scheduler Su, Disheng
@ 2009-03-20 12:42 ` George Dunlap
  2009-03-23  7:33   ` Su, Disheng
  0 siblings, 1 reply; 10+ messages in thread
From: George Dunlap @ 2009-03-20 12:42 UTC (permalink / raw)
  To: Su, Disheng; +Cc: Xen-devel, NISHIGUCHI Naoki

So, just to be clear:  you're proposing that this mechanism *might* be
useful for a VM with real-time scheduling requirements?  Or are
actually working on / developing real-time operating systems, and are
suggesting this in order to support real-time VMs?

I'm not an expert in real-time scheduling, but it doesn't seem to me
like this will really be what a real-time system would want.  (Feel
free to contradict me if you know better.)  It might work OK if there
were only a single real-time PV guest, but in the face of competition,
you'd have trouble.  It seems like an actual real-time Xen scheduler
would want the PV guests to submit deadlines to Xen, and then Xen
could try to make a decision as to which deadlines to drop if it needs
to (based on some mechanism).

The only test you've measured is networking; but networking isn't a
"real-time" workload, it's a latency-sensitive workload.  And you
haven't measured:
* The effect on network traffic if you have several high-priority VMs competing
* The effect on network traffic of non-prioritized VMs if a
high-priority VM is receiving traffic, or is misbehaving

You also haven't compared how raising a VM's priority within the
current credit framework, such as giving it a very high weight,
affects the numbers.  Can you get similar results if you were to give
the "latency-sensitive" VMs a weight of, say, 10000, and leave the
other ones at 256?

Overall, I don't think fixed priorities like this is a good solution:
I think it will create more problems than it solves, and I think it's
actually harder to predict how a complex system will actually behave
(and thus harder to configure properly).

I think the proper solution (and I'm working on a "credit2" scheduler
that has these properites) is:
* Fix the credit assignment, so that VMs don't spend very much time in "over"
* Give VMs that wake up and are under their credits a fixed "boost"
period (e.g., 1ms)
* Allow users to specify a cpu "reservation"; so that no matter how
much work there is on the system, a VM can be guaranteed to get a
minimum fixed amount of the cpu if it wants it; e.g., dom0 always gets
50% of one core if it wants it, no matter how many other VMs are on
the system.

#1 and #2 have resulted in significant improvements in TCP throughput
in the face of competition.  I hope to publish a draft here on the
list sometime soon, but I'm still working out some of the details.

 -George Dunlap

2009/3/20 Su, Disheng <disheng.su@intel.com>:
> Hi all,
>        Attached patches add static priority into credit scheduler.
>        Currently, credit scheduler has 4 kinds of priority: BOOST, UNDER, OVER and IDLE. And the priority of VM is dynamically changed according to the credit of VM, or I/O events, the highest priority VM is chosed to be scheduled in for each scheduling period. Due to priority is not fixed, which VM will be scheduled in is properly unknown. The I/O latency caused by scheduler is well analyzed in [1] and [2]. They provides ways to reduce I/O latency and also retain CPU and I/O fairness between VMs to some extend.
>        There are some cases that reducing latency is much preferable to CPU or I/O fairness, such as RTOS guest or VM with device(audio)-assigned. The straightforward way is to set static(fixed) highest priority for this VM, to make sure it is scheduled each time. Attached patches implemented this kind of mechanism, like SCHED_RR/SCHED_FIFO in Linux.
>
>        How it works?
>        --Users can set RT priority(between 1~100) for domains. The larger the number, the higher the priority. Users can also change a RT domain into a non-RT domain by setting its priority other than 1~100.
>        --Scheduler always chooses the highest priority domain to run for RT domains, no changes for non-RT domains in there. If RT domains have the same priority, round robin between this domains for every 30ms. 30ms is the default scheduling period, it can be changed to 2ms or other value if needed.
>        --There is still accounting for current running non-RT vcpu in every 10ms, accounting for all non-RT domains in every 30ms as credit scheduler did before.
>
>        Implementation details:
>         -- In order to minimize the modification in the credit scheduler, one additional rt runqueue per pcpu is added, and one rt active domain list added in csched_private. RT vcpus are added into the rt runqueue in the running pcpu, and rt domains are added into rt active domain.
>         -- Scheduler always chooses the highest priority in the rt runqueue if it's not empty at first, then chooses from normal runqueue instead.
>         --__runq_insert/__runq_remove are changed to based on the priority of vcpu.
>         -- Vcpu accounting is only took effects on the non-RT vcpus as before. Non-RT vcpus propotionally share the rest of cpu based on their weight. The total weight is changed during adding/removing RT domains, e.g. promoting a non-RT domain to a RT domain, total weight is  substracted by the weight of non-RT domain.
>
>        How to use it:
>                set priority(y) of a VM(x) by: "xm sched-credit -d x -p y"
>
>        Test results:
>        I did some tests with this patches according to following configuration:
>                CPU: Intel Core 2 Duo E6850, Xen(1881), 7 VMs created on one physical machine A, each 2 VMs pair ping with each other, the other VM has RT priority. Another physical machine B connects with it through 1G network card directly. Conduct these tests from B to A, e.g ping A from B.
>        some test results are uploaded to http://wiki.xensource.com/xenwiki/DishengSu, FYI.
>
>        Summary:
>        This patches minimize the scheduling latency, while losing CPU, or I/O fairness. It can be used as a scheduler for RT guest, for some cases(such as RT guest and non-RT guests co-exist). While there are lot of areas to improve real time response, such as interrupt latency, Xen I/O model[3].
>        Any comments are appreciated. Thanks!
>
> ---------------------
> [1]Scheduling I/O in Virtual Machine Monitors
> [2]Evaluation and Consideration of the Credit Scheduler for Client Virtualization
> [3]A step to support real-time in virtual machine
>
> Best Regards,
> Disheng, Su
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [RFC] Add static priority into credit scheduler
  2009-03-20 12:42 ` George Dunlap
@ 2009-03-23  7:33   ` Su, Disheng
  2009-03-25 10:35     ` George Dunlap
  0 siblings, 1 reply; 10+ messages in thread
From: Su, Disheng @ 2009-03-23  7:33 UTC (permalink / raw)
  To: George Dunlap; +Cc: Xen-devel, NISHIGUCHI Naoki, Su, Disheng

George Dunlap wrote:
> So, just to be clear:  you're proposing that this mechanism *might* be
> useful for a VM with real-time scheduling requirements?  Or are
> actually working on / developing real-time operating systems, and are
> suggesting this in order to support real-time VMs?

The first one, I think it will be useful to consolidate VM with real-time requirements:
1. Enterprise real-time: such as SUSE Linux Enterprise Real Time(http://www.novell.com/products/realtime/), Red Hat Enterprise MRG(http://www.redhat.com/mrg/) and some apps/middware(http://www-03.ibm.com/linux/realtime.html). They are used for mission-critical applications such as, trading system/VOIP server etc.
2. Embedded real-time: normal usage model is to consolidate one embedded RTOS(QNX, VxWorks etc) and a general purpose OS(Linux/Windows) on one cpu core.

> 
> I'm not an expert in real-time scheduling, but it doesn't seem to me
> like this will really be what a real-time system would want.  (Feel
> free to contradict me if you know better.)  It might work OK if there
> were only a single real-time PV guest, but in the face of competition,
> you'd have trouble.  It seems like an actual real-time Xen scheduler
> would want the PV guests to submit deadlines to Xen, and then Xen
> could try to make a decision as to which deadlines to drop if it needs
> to (based on some mechanism).

I agree this is one of way to go. But it's not suitable for all the real time OS, e,g. the enterprise real time Linux(no period/deadline at all), whose real time scheduling mechanism is SCHED_RR/SCHED_FIFO(based on static priority). It's quite different from traditional embedded real time OS. 
On the other hand, there are so many embedded real time COTS OS,  and how about unmodified RTOS:)? If you just consolidate one embedded real time OS and one general purpose OS(I guess this is the normal usage model currently on cellphone/industry control), how about just setting RTOS as highest priority?

It's true, that static priority has problem if two or more RT VM competing with each other on one phyiscal cpu. This case can be addressed by real time PV guest as you said, or by other ways. Currently I made the assumption that only one RT VM and more non-RT VM on one physical cpu core/thread. It's reasonable especially with quad/many core.

> 
> The only test you've measured is networking; but networking isn't a
> "real-time" workload, it's a latency-sensitive workload.  And you

Yes, the normal/simple "real-time" workload is "sleep_for_some_ns-and-wake_up", such as Cyclictest(http://rt.wiki.kernel.org/index.php/Cyclictest), but it depends on hrtimer in dom0. In order to minimize the dependence with dom0, I use the assigned network card instead. Sending out a packet from remote machine, then test latency according to the response from RT VM. 
It's obvious to see the improvement in scheduler...

> haven't measured:
> * The effect on network traffic if you have several high-priority VMs
> competing 

Currently it's not in my scope. And I think it's very hard to schedule multiple RT VM on one CPU in practical.

> * The effect on network traffic of non-prioritized VMs if a
> high-priority VM is receiving traffic, or is misbehaving
> 

RT VM is dealing with critical events, so we trust it...

> You also haven't compared how raising a VM's priority within the
> current credit framework, such as giving it a very high weight,
> affects the numbers.  Can you get similar results if you were to give
> the "latency-sensitive" VMs a weight of, say, 10000, and leave the
> other ones at 256?

"Weight" isn't helpful here.  I had tested one VM with assigned audio device, the noise is obvious, if other VM is busy.
The current credit framework has the following issues AFAIK:
	One VM is in OVER state can't be BOOSTed
	Multiple VM in BOOST state, no preemption
So there is no guarantee to schedule which VM in. 

> 
> Overall, I don't think fixed priorities like this is a good solution:
> I think it will create more problems than it solves, and I think it's
> actually harder to predict how a complex system will actually behave
> (and thus harder to configure properly).

static priority is useful in the simple case(one RT VM and multiple non-RT VM on one cpu core/thread)
If we trust the RT VM, then it's easy to configure. I mean RT VM is usually timely response extern events and then sleep.
I know you may concern about such as RT VM is misbehaving, may monopolise the whole cpu.
Static priority is just a scheduling mechanism. It depends on user's favor to use it.
Linux supports this kind of static priority also...

> 
> I think the proper solution (and I'm working on a "credit2" scheduler
> that has these properites) is:
> * Fix the credit assignment, so that VMs don't spend very much time
> in "over" 
> * Give VMs that wake up and are under their credits a fixed "boost"
> period (e.g., 1ms)
> * Allow users to specify a cpu "reservation"; so that no matter how
> much work there is on the system, a VM can be guaranteed to get a
> minimum fixed amount of the cpu if it wants it; e.g., dom0 always gets
> 50% of one core if it wants it, no matter how many other VMs are on
> the system.
> 
> #1 and #2 have resulted in significant improvements in TCP throughput
> in the face of competition.  I hope to publish a draft here on the
> list sometime soon, but I'm still working out some of the details.

Glad to know that credit scheduler is being improved. 
If it can improve the latency/real time capabilty with minimal enhancement that will be much better:)

> 
Best Regards,
Disheng, Su

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Add static priority into credit scheduler
  2009-03-23  7:33   ` Su, Disheng
@ 2009-03-25 10:35     ` George Dunlap
  2009-03-27  2:39       ` NISHIGUCHI Naoki
  2009-03-27  4:29       ` Su, Disheng
  0 siblings, 2 replies; 10+ messages in thread
From: George Dunlap @ 2009-03-25 10:35 UTC (permalink / raw)
  To: Su, Disheng; +Cc: Xen-devel, NISHIGUCHI Naoki

[-- Attachment #1: Type: text/plain, Size: 1859 bytes --]

2009/3/23 Su, Disheng <disheng.su@intel.com>:
> Glad to know that credit scheduler is being improved.
> If it can improve the latency/real time capabilty with minimal enhancement that will be much better:)

I'm still a bit skeptical, but I guess not as much as before.  I don't
think it's really the best solution, and I definitely think that
people shouldn't think about this as a *good* solution to
latency-sensitive workloads like video and audio.  But it might be
handy to have around as a "quick-fix".  If nothing else, client
virtualization (e.g., VMs with audio pass-through) are important, and
since credit2 isn't going to make it into 3.4, something like this
might be a necessary stand-in.

I'd be interested to hear others' opinions.

Regarding the "sleep-for-some-time-and-wake-up" test, I hacked minios
to do simulate this kind of "periodic deadline work" as a part of my
development.  (Patch attached.) It will set a timer to go off every
period, and then spin for a given amount of cycles.  If it isn't
scheduled for a period, it "drops" work.  Every second it reports the
percentage of work completed.  Credit1 does absolutely terrible --
completely unfair and unpredictable.  A few changes in credit2 allowed
the number of missed deadlines to degrade gracefully, equally across
all VMs, correlated at least with the VM's weight, in a predictable
manner.

If we did include something like this, we would need to make sure that
we couldn't get into a state where misbehaving RT guests locked out
dom0 and any driver domains necessary for dom0 network access.  (We
may trust the VMs not to purposely misbehave, but between bugs and
operator error, there's still plenty of room for misbehavior.)  I was
looking through the Linux scheduler code, and they seem to have some
limits on RT processes as well, presumably for the same reason.

 -George

[-- Attachment #2: minios-periodic-work.diff --]
[-- Type: text/x-diff, Size: 3785 bytes --]

diff -r 4c7a54a43420 extras/mini-os/kernel.c
--- a/extras/mini-os/kernel.c	Tue Feb 17 12:30:29 2009 +0000
+++ b/extras/mini-os/kernel.c	Fri Feb 20 14:55:13 2009 +0000
@@ -75,15 +75,144 @@
     /* test_xenbus(); */
 }
 
+#if 0
+#define rdtscll(val) do { \
+     unsigned int a,d; \
+     asm volatile("rdtsc" : "=a" (a), "=d" (d)); \
+     (val) = ((unsigned long long)a) | (((unsigned long long)d)<<32); \
+} while(0)
+#endif
+
+
+void calibrate_cpms(long long *cpms)
+{
+    struct timeval tv;
+    long long ca, cb, ta, tb, t1, t2;
+
+retry:
+    /* Try to detect if we're scheduled out. */
+    do {
+        gettimeofday(&tv, NULL); t1 = tv.tv_sec * 1000000 + tv.tv_usec;
+        rdtscll(ca);
+        gettimeofday(&tv, NULL); t2 = tv.tv_sec * 1000000 + tv.tv_usec;
+    } while (t2 - t1 > 10);
+    ta = (t1+t2)/2;
+
+    msleep(100);
+
+    do {
+        gettimeofday(&tv, NULL); t1 = tv.tv_sec * 1000000 + tv.tv_usec;
+        rdtscll(cb);
+        gettimeofday(&tv, NULL); t2 = tv.tv_sec * 1000000 + tv.tv_usec;
+    } while (t2 - t1 > 10);
+    tb = (t1+t2)/2;
+
+    if ( tb-ta <= 0 || cb-ca <= 0)
+    {
+        printk("%s: ca %lld cb %lld ta %lld tb %lld, retry in 1s\n",
+               __func__, ca, cb, ta, tb);
+        msleep(1000);
+        goto retry;
+    }
+
+    *cpms = ((cb - ca) * 1000 )/ (tb - ta);
+
+    printk("cpms: %lld\n", *cpms);
+}
+
+#define CAL_COUNT 100000
+#define CHECK_MS 10
+void calibrate_ipms(long long cpms, long long *ipms)
+{
+    long long ca, cb, s[2], d;
+    int i, j, k;
+
+    k = CAL_COUNT;
+
+    do {
+        for ( j=0 ; j<2; j++)
+        {
+            rdtscll(ca);
+            for ( i=0 ; i < k; i++ )
+                ;
+            rdtscll(cb);
+            s[j] = CAL_COUNT*cpms/(cb-ca);
+        }
+        d = s[0] - s[1];
+        if ( d < 0 )
+            d = -d;
+    } while( d > 100 );
+
+    *ipms = (s[0]+s[1])/2;
+
+    printk("ipms: %lld\n", *ipms);
+
+    k = CHECK_MS * *ipms;
+
+    rdtscll(ca);
+    for ( i=0; i<k; i++);
+    rdtscll(cb);
+
+    printk(" check: %lld %lld\n", CHECK_MS * cpms, cb-ca);
+}
+
+#define QUANTUM_MS    1ULL  /* How long to run each "quantum" */
+#define PERIOD_MS     5ULL  /* How often "quanta" show up */
+#define UPDATE_MS  5000ULL  /* How often to report how well we're doing */
+
 static void periodic_thread(void *p)
 {
-    struct timeval tv;
+    long long now, next_quantum, next_update;
+    long long cpms, ipms;
+    unsigned received = 0, processed = 0;
+    int i, quantum_count;
+
     printk("Periodic thread started.\n");
+
+    /* Wait for timing code to come up */
+    msleep(50);
+
+    calibrate_cpms(&cpms);
+    calibrate_ipms(cpms, &ipms);
+
+    quantum_count = QUANTUM_MS * ipms;
+
+    rdtscll(now);
+
+    next_quantum = now;
+    next_update = now + UPDATE_MS * cpms;
+    
     for(;;)
     {
-        gettimeofday(&tv, NULL);
-        printk("T(s=%ld us=%ld)\n", tv.tv_sec, tv.tv_usec);
-        msleep(1000);
+        rdtscll(now);
+
+        do
+        {
+            received++;
+            next_quantum+=PERIOD_MS * cpms;
+        } while  ( next_quantum < now );
+
+        /* Now do one "quantum" of work */
+        for ( i=0; i<quantum_count; i++)
+            ;
+        processed++;
+
+        /* See if it's time to post an update */
+        rdtscll(now);
+        if ( now > next_update )
+        {
+            printk("%lu\n", processed*100/received);
+            processed = received = 0;
+            while ( now > next_update )
+                next_update += UPDATE_MS * cpms;
+        }
+            
+        /* If we haven't passed the next quantum, sleep. */
+        if ( now < next_quantum )
+        {
+            int msec = ((next_quantum - now)+(cpms-1)) / cpms;
+            msleep(msec);
+        }
     }
 }
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Add static priority into credit scheduler
  2009-03-25 10:35     ` George Dunlap
@ 2009-03-27  2:39       ` NISHIGUCHI Naoki
  2009-03-27  3:29         ` Su, Disheng
  2009-03-27  4:29       ` Su, Disheng
  1 sibling, 1 reply; 10+ messages in thread
From: NISHIGUCHI Naoki @ 2009-03-27  2:39 UTC (permalink / raw)
  To: George Dunlap, Su, Disheng; +Cc: Xen-devel

Hi Disheng and George,

Disheng, I'm glad to see your work.

George Dunlap wrote:
> I'd be interested to hear others' opinions.

I think that static priority is useful under some conditions.
But, as George said, I also think it is harder to configure properly.
And I'm anxious that it makes credits on non-RT vcpu meaningless.

I tested your patch in following environment.

   CPU: Intel Core2 Quad Q9450
   Chipset: Intel 82Q35
   VM: dom0 (4 vcpus), HVM (4 vcpus)
   Xen: c/s 19426
   HVM:
     RT priority (1)
     pass-through devcies
       PCI graphic board
       Integrated devices(audio, USB controller)
     playing video

With this configuration, HVM does not work well.
When HVM does not have RT priority, HVM works well.

I think we would need to consider the relationship between static 
priority and credit, handling of dom0 and driver domain, and so on.

Best regards,
Naoki Nishiguchi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [RFC] Add static priority into credit scheduler
  2009-03-27  2:39       ` NISHIGUCHI Naoki
@ 2009-03-27  3:29         ` Su, Disheng
  2009-03-27  7:03           ` NISHIGUCHI Naoki
  0 siblings, 1 reply; 10+ messages in thread
From: Su, Disheng @ 2009-03-27  3:29 UTC (permalink / raw)
  To: NISHIGUCHI Naoki, George Dunlap; +Cc: Xen-devel, Su, Disheng

NISHIGUCHI Naoki wrote:
> Hi Disheng and George,
> 
> Disheng, I'm glad to see your work.
> 
> George Dunlap wrote:
>> I'd be interested to hear others' opinions.
> 
> I think that static priority is useful under some conditions.
> But, as George said, I also think it is harder to configure properly.
> And I'm anxious that it makes credits on non-RT vcpu meaningless.
> 

Not exactly, credits still makes sense for non-RT guests, but these guests only propotionally share the rest of cpu(not used by RT guest) based on their wieght/credit. If non-RT guest is scheduled in and out, its credit is substracted as usual. It has the potential that RT guests monopolise the whole cpu, if we don't have other mechanisms to prevent that.

> I tested your patch in following environment.
> 
>    CPU: Intel Core2 Quad Q9450
>    Chipset: Intel 82Q35
>    VM: dom0 (4 vcpus), HVM (4 vcpus)
>    Xen: c/s 19426
>    HVM:
>      RT priority (1)
>      pass-through devcies
>        PCI graphic board
>        Integrated devices(audio, USB controller)
>      playing video
> 
> With this configuration, HVM does not work well.
> When HVM does not have RT priority, HVM works well.
> 
> I think we would need to consider the relationship between static
> priority and credit, handling of dom0 and driver domain, and so on.
> 

Thanks for your testing with the patch!
In client virtualization, with static priority, the simplest way is to set the primary guest and dom0 as the highest priority(can be different priority), other auxiliary guests as non-RT guest. I think it *should* sovle the audio/video glitch,  I don't test it though. One of issues in this way is that other non-RT guests may not have enough CPU/thoughput when user is busy in primary guest(e.g playing video, copying files at the same time).
 I remembered that I heard some audio glitches in some cases with your Bcredit before. Maybe static priority can be helpful but with a somewhat heavy way...
Could you kindly have a test with this configuration?

> Best regards,
> Naoki Nishiguchi

Best Regards,
Disheng, Su

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [RFC] Add static priority into credit scheduler
  2009-03-25 10:35     ` George Dunlap
  2009-03-27  2:39       ` NISHIGUCHI Naoki
@ 2009-03-27  4:29       ` Su, Disheng
  1 sibling, 0 replies; 10+ messages in thread
From: Su, Disheng @ 2009-03-27  4:29 UTC (permalink / raw)
  To: George Dunlap, Daniel.Rossier@heig-vd.ch
  Cc: Xen-devel, NISHIGUCHI Naoki, Su, Disheng

> I'd be interested to hear others' opinions.

Just found Daniel is working on EmbeddedXen to support hard realtime OS(Xenomai on top of Xen), from thread http://markmail.org/message/o2vyzy7ngf7oluw4
Add Daniel in, hope he can give more opinions on it...
Hi Daniel, we are talking about adding static priority in xen's credit scheduler to support real time guest. You can see the thread from http://markmail.org/message/vn62u7qdbmswms5a, in case you missed it.
Could you give us your concerns/opinions about xen to support real time OS, such as the scheduler, interrupt latency, the overhead introduced by Xen,etc.? Thanks!

Best Regards,
Disheng, Su

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Add static priority into credit scheduler
  2009-03-27  3:29         ` Su, Disheng
@ 2009-03-27  7:03           ` NISHIGUCHI Naoki
  2009-03-27  8:05             ` Su, Disheng
  0 siblings, 1 reply; 10+ messages in thread
From: NISHIGUCHI Naoki @ 2009-03-27  7:03 UTC (permalink / raw)
  To: Su, Disheng; +Cc: George Dunlap, Xen-devel

[-- Attachment #1: Type: text/plain, Size: 1506 bytes --]

Su, Disheng wrote:
> Not exactly, credits still makes sense for non-RT guests, but these guests only propotionally share the rest of cpu(not used by RT guest) based on their wieght/credit. If non-RT guest is scheduled in and out, its credit is substracted as usual. It has the potential that RT guests monopolise the whole cpu, if we don't have other mechanisms to prevent that.

I understand what you mean.
I doubt whether the rest of cpu not used by RT guest is reflected to 
credit of non-RT guests. If RT guest might monopolize the whole cpu, I 
think the rest of cpu is nothing, therefore non-RT guests have no credit.

> Thanks for your testing with the patch!
> In client virtualization, with static priority, the simplest way is to set the primary guest and dom0 as the highest priority(can be different priority), other auxiliary guests as non-RT guest. I think it *should* sovle the audio/video glitch,  I don't test it though. One of issues in this way is that other non-RT guests may not have enough CPU/thoughput when user is busy in primary guest(e.g playing video, copying files at the same time).
>  I remembered that I heard some audio glitches in some cases with your Bcredit before. Maybe static priority can be helpful but with a somewhat heavy way...
> Could you kindly have a test with this configuration?

As you sad, I set dom0 and HVM to RT priority(1) and tested.
Regretfully, HVM does not work well.

Attached file is output of "xm debug-keys r".

Best regards,
Naoki Nishiguchi

[-- Attachment #2: runqueue.log --]
[-- Type: text/plain, Size: 3673 bytes --]

(XEN) Scheduler: SMP Credit Scheduler (credit)
(XEN) info:
(XEN)   ncpus              = 4
(XEN)   master             = 0
(XEN)   credit             = 1200
(XEN)   credit balance     = 0
(XEN)   weight             = 0
(XEN)   runq_sort          = 25322
(XEN)   default-weight     = 256
(XEN)   msecs per tick     = 10ms
(XEN)   credits per tick   = 100
(XEN)   ticks per tslice   = 3
(XEN)   ticks per acct     = 3
(XEN)   migration delay    = 0us
(XEN) idlers: 0000000a
(XEN) active vcpus:
(XEN) 
(XEN) active vcpus in rt dom:
(XEN)     1: [3.3] pri=1 flags=0 cpu=2 credit=0 [w=256]
(XEN)     2: [3.0] pri=1 flags=0 cpu=3 credit=0 [w=256]
(XEN)     3: [3.1] pri=1 flags=0 cpu=2 credit=0 [w=256]
(XEN)     4: [3.2] pri=1 flags=0 cpu=1 credit=300 [w=256]
(XEN)     5: [0.2] pri=1 flags=0 cpu=3 credit=0 [w=256]
(XEN)     6: [0.3] pri=1 flags=0 cpu=0 credit=-100 [w=256]
(XEN)     7: [0.0] pri=1 flags=0 cpu=2 credit=0 [w=256]
(XEN)     8: [0.1] pri=1 flags=0 cpu=1 credit=0 [w=256]
(XEN) sched_smt_power_savings: disabled
(XEN) NOW=0x00000131829AAF25
(XEN) CPU[00]  sort=25322, sibling=00000001, core=0000000f
(XEN)   run: [0.3] pri=1 flags=0 cpu=0 credit=-100 [w=256]
(XEN)   runq:
(XEN)     1: [32767.0] pri=-64 flags=0 cpu=0
(XEN) 
(XEN) rt_runq:
(XEN) CPU[01]  sort=25322, sibling=00000002, core=0000000f
(XEN)   run: [32767.1] pri=-64 flags=0 cpu=1
(XEN)   runq:
(XEN) 
(XEN) rt_runq:
(XEN) CPU[02]  sort=25322, sibling=00000004, core=0000000f
(XEN)   run: [3.1] pri=1 flags=0 cpu=2 credit=0 [w=256]
(XEN)   runq:
(XEN)     1: [32767.2] pri=-64 flags=0 cpu=2
(XEN) 
(XEN) rt_runq:
(XEN)     1: [3.3] pri=1 flags=0 cpu=2 credit=0 [w=256]
(XEN)     2: [0.0] pri=1 flags=0 cpu=2 credit=0 [w=256]
(XEN) CPU[03]  sort=25322, sibling=00000008, core=0000000f
(XEN)   run: [32767.3] pri=-64 flags=0 cpu=3
(XEN)   runq:
(XEN) 
(XEN) rt_runq:
(XEN) Scheduler: SMP Credit Scheduler (credit)
(XEN) info:
(XEN)   ncpus              = 4
(XEN)   master             = 0
(XEN)   credit             = 1200
(XEN)   credit balance     = 0
(XEN)   weight             = 0
(XEN)   runq_sort          = 25322
(XEN)   default-weight     = 256
(XEN)   msecs per tick     = 10ms
(XEN)   credits per tick   = 100
(XEN)   ticks per tslice   = 3
(XEN)   ticks per acct     = 3
(XEN)   migration delay    = 0us
(XEN) idlers: 0000000a
(XEN) active vcpus:
(XEN) 
(XEN) active vcpus in rt dom:
(XEN)     1: [3.3] pri=1 flags=0 cpu=2 credit=0 [w=256]
(XEN)     2: [3.0] pri=1 flags=0 cpu=3 credit=0 [w=256]
(XEN)     3: [3.1] pri=1 flags=0 cpu=2 credit=0 [w=256]
(XEN)     4: [3.2] pri=1 flags=0 cpu=1 credit=300 [w=256]
(XEN)     5: [0.2] pri=1 flags=0 cpu=3 credit=0 [w=256]
(XEN)     6: [0.3] pri=1 flags=0 cpu=0 credit=-100 [w=256]
(XEN)     7: [0.0] pri=1 flags=0 cpu=2 credit=0 [w=256]
(XEN)     8: [0.1] pri=1 flags=0 cpu=1 credit=0 [w=256]
(XEN) sched_smt_power_savings: disabled
(XEN) NOW=0x000001346A19FB21
(XEN) CPU[00]  sort=25322, sibling=00000001, core=0000000f
(XEN)   run: [0.3] pri=1 flags=0 cpu=0 credit=-100 [w=256]
(XEN)   runq:
(XEN)     1: [32767.0] pri=-64 flags=0 cpu=0
(XEN) 
(XEN) rt_runq:
(XEN) CPU[01]  sort=25322, sibling=00000002, core=0000000f
(XEN)   run: [32767.1] pri=-64 flags=0 cpu=1
(XEN)   runq:
(XEN) 
(XEN) rt_runq:
(XEN) CPU[02]  sort=25322, sibling=00000004, core=0000000f
(XEN)   run: [3.3] pri=1 flags=0 cpu=2 credit=0 [w=256]
(XEN)   runq:
(XEN)     1: [32767.2] pri=-64 flags=0 cpu=2
(XEN) 
(XEN) rt_runq:
(XEN)     1: [3.1] pri=1 flags=0 cpu=2 credit=0 [w=256]
(XEN)     2: [0.0] pri=1 flags=0 cpu=2 credit=0 [w=256]
(XEN) CPU[03]  sort=25322, sibling=00000008, core=0000000f
(XEN)   run: [32767.3] pri=-64 flags=0 cpu=3
(XEN)   runq:
(XEN) 
(XEN) rt_runq:


[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [RFC] Add static priority into credit scheduler
  2009-03-27  7:03           ` NISHIGUCHI Naoki
@ 2009-03-27  8:05             ` Su, Disheng
  2009-03-27 10:13               ` NISHIGUCHI Naoki
  0 siblings, 1 reply; 10+ messages in thread
From: Su, Disheng @ 2009-03-27  8:05 UTC (permalink / raw)
  To: NISHIGUCHI Naoki; +Cc: George Dunlap, Xen-devel, Su, Disheng

[-- Attachment #1: Type: text/plain, Size: 1248 bytes --]

NISHIGUCHI Naoki wrote:
> I understand what you mean.
> I doubt whether the rest of cpu not used by RT guest is reflected to
> credit of non-RT guests. If RT guest might monopolize the whole cpu, I
> think the rest of cpu is nothing, therefore non-RT guests have no
> credit. 
> 

Yes, it's an issue need to be addressed for client virtualization case, due to the primary guest(e,g Windows) is not a trusted guest.
When detecting one RT guest is monopolize cpu for a while(e.g. 1-2minute), one can:
1. kill the RT guest...
2. lower its priority for a while, give other guests the opportunity to run, then restore its previous priority
Any ideas? 

> As you sad, I set dom0 and HVM to RT priority(1) and tested.
> Regretfully, HVM does not work well.
> 
> Attached file is output of "xm debug-keys r".
> 

Oh, forgot to mention you need to pin the RT guest, or try the attached patch.
If you set one guest as high priority, it increases the chance that its vcpus are migrated back and forth, because the priority is fixed and higher than  
OVER.
Don't migrate the RT guest in practice. It's the same with Bcredit from my previous experience, isn't it?


> Best regards,
> Naoki Nishiguchi



Best Regards,
Disheng, Su

[-- Attachment #2: dont_migrate_for_rt_dom.patch --]
[-- Type: application/octet-stream, Size: 573 bytes --]

diff -r 3582b2c3cd7f xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Wed May 20 07:17:02 2009 +0800
+++ b/xen/common/sched_credit.c	Wed May 20 07:21:38 2009 +0800
@@ -352,8 +352,10 @@
      * Don't pick up work that's in the peer's scheduling tail or hot on
      * peer PCPU. Only pick up work that's allowed to run on our CPU.
      */
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
     return !vc->is_running &&
            !__csched_vcpu_is_cache_hot(vc) &&
+           !IS_RT_PRI(svc->pri) &&
            cpu_isset(dest_cpu, vc->cpu_affinity);
 }
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] Add static priority into credit scheduler
  2009-03-27  8:05             ` Su, Disheng
@ 2009-03-27 10:13               ` NISHIGUCHI Naoki
  0 siblings, 0 replies; 10+ messages in thread
From: NISHIGUCHI Naoki @ 2009-03-27 10:13 UTC (permalink / raw)
  To: Su, Disheng; +Cc: George Dunlap, Xen-devel

Su, Disheng wrote:
> NISHIGUCHI Naoki wrote:
>> I understand what you mean.
>> I doubt whether the rest of cpu not used by RT guest is reflected to
>> credit of non-RT guests. If RT guest might monopolize the whole cpu, I
>> think the rest of cpu is nothing, therefore non-RT guests have no
>> credit. 
>>
> 
> Yes, it's an issue need to be addressed for client virtualization case, due to the primary guest(e,g Windows) is not a trusted guest.
> When detecting one RT guest is monopolize cpu for a while(e.g. 1-2minute), one can:
> 1. kill the RT guest...
> 2. lower its priority for a while, give other guests the opportunity to run, then restore its previous priority
> Any ideas? 

What I mean is credit_total given to non-RT guest in scheduler.
If a PC has 4 core cpu and an RT guest has 4 vcpu, I think credit_total 
would be 0, because we could not predict behavior of the RT guest .

> Oh, forgot to mention you need to pin the RT guest, or try the attached patch.
> If you set one guest as high priority, it increases the chance that its vcpus are migrated back and forth, because the priority is fixed and higher than  
> OVER.

I tried your patch. The result was the same.
I also pin the RT guest and dom0 as follows, but HVM did not work well.
             vcpu cpu
      dom0     0   0
               1   1
               2   2
               3   3
      HVM      0   0
               1   1
               2   2
               3   3

It seems to me that idle cpus are not effectively used.

> Don't migrate the RT guest in practice. It's the same with Bcredit from my previous experience, isn't it?

I think that scheduler should not migrate the vcpu needlessly, but 
necessary migration should be done.

Best regards,
Naoki Nishiguchi

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-03-27 10:13 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-20  9:18 [RFC] Add static priority into credit scheduler Su, Disheng
2009-03-20 12:42 ` George Dunlap
2009-03-23  7:33   ` Su, Disheng
2009-03-25 10:35     ` George Dunlap
2009-03-27  2:39       ` NISHIGUCHI Naoki
2009-03-27  3:29         ` Su, Disheng
2009-03-27  7:03           ` NISHIGUCHI Naoki
2009-03-27  8:05             ` Su, Disheng
2009-03-27 10:13               ` NISHIGUCHI Naoki
2009-03-27  4:29       ` Su, Disheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.