All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH] scheduler: credit scheduler for client virtualization
@ 2008-12-03  8:54 NISHIGUCHI Naoki
  2008-12-03  9:16 ` Keir Fraser
       [not found] ` <de76405a0901191232k19d910d5o77160fa5ee7bf06c@mail.gmail.com>
  0 siblings, 2 replies; 15+ messages in thread
From: NISHIGUCHI Naoki @ 2008-12-03  8:54 UTC (permalink / raw)
  To: xen-devel; +Cc: Ian.Pratt, disheng.su

[-- Attachment #1: Type: text/plain, Size: 2492 bytes --]

Hi all,

This patch is what I spoke about improvement of credit scheduler in
XenSummit Tokyo.
My presentation is now available at
http://www.xen.org/xensummit/xensummit_fall_2008.html.

In case of using Xen hypervisor on the client virtualization
environment, especially enabling vtd and passing through some devices to
a domain, I think that it is neccessary to reduce time for the vcpu in
the domain to wait its turn on a run queue.

My approach is to keep the vcpu's priority in BOOST and to switch the
vcpu to another vcpu at short intervals when there are some vcpus in
BOOST priority .

Changes to credit scheduler are the following:

- Improve the precision of credit
  There are three changes. First change is to subtract credit for
consumed cpu time accurately. Second change is to preserve the value of
credit when credit of the vcpu is over upper bound value(currently 300).
Third change is to shorten cpu time per one credit(experimentally 30000
credits to 30ms).

- Shorten allocated time to a vcpu in BOOST priority
  Allocated time is experimentally changed to 2ms from 30ms.

- Balance credits of each vcpu of a domain

- Introduce boost credit
  Boost credit is new credit to keep a vcpu's priority in BOOST. When a
value of boost credit is 1 or more, priority of the vcpu is set to
BOOST. Moreover, to avoid the fall of priority for abrupt cpu
consumption of the vcpu, upper bound value of boost credit can be set.


How to use:

On this patch, I added bcredit scheduler(named boost credit scheduler)
as third scheduler. In order to use bcredit scheduler, add
"sched=bcredit" option to xen.gz in grub.conf.

Then in order to boost a domain, you should enable boost credit of the
domain. There is two method.

1. Using xm command, set upper bound value of boost credit of the
domain. It is specified by not the value of credit but the millisecond.
It is named max boost period.
  e.g. domain:0, max boost period:100ms
    xm sched-bcredit -d 0 -m 100

2. Using xm command, set upper bound value of boost credit of the domain
and set boost ratio. Boost ratio is ratio to one CPU that is used for
distributing boost credit. Boost credit corresponding to boost ratio is
distributed in place of credit. An influence of other domains is not
received because of ratio to one CPU.
  e.g. domain:0, max boost period:500ms, boost ratio:80(80% to one CPU)
    xm sched-bcredit -d 0 -m 500 -r 80


Please review this patch.
Any comments are appreciated.

Best regards,
Naoki Nishiguchi

[-- Attachment #2: sched_bcredit.patch --]
[-- Type: text/x-patch, Size: 54563 bytes --]

diff -r a00eb6595d3c tools/libxc/xc_csched.c
--- a/tools/libxc/xc_csched.c	Sat Nov 29 09:07:52 2008 +0000
+++ b/tools/libxc/xc_csched.c	Wed Dec 03 10:19:34 2008 +0900
@@ -48,3 +48,41 @@ xc_sched_credit_domain_get(
 
     return err;
 }
+
+int
+xc_sched_bcredit_domain_set(
+    int xc_handle,
+    uint32_t domid,
+    struct xen_domctl_sched_bcredit *sdom)
+{
+    DECLARE_DOMCTL;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_BCREDIT;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_putinfo;
+    domctl.u.scheduler_op.u.bcredit = *sdom;
+
+    return do_domctl(xc_handle, &domctl);
+}
+
+int
+xc_sched_bcredit_domain_get(
+    int xc_handle,
+    uint32_t domid,
+    struct xen_domctl_sched_bcredit *sdom)
+{
+    DECLARE_DOMCTL;
+    int err;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_BCREDIT;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_getinfo;
+
+    err = do_domctl(xc_handle, &domctl);
+    if ( err == 0 )
+        *sdom = domctl.u.scheduler_op.u.bcredit;
+
+    return err;
+}
diff -r a00eb6595d3c tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h	Sat Nov 29 09:07:52 2008 +0000
+++ b/tools/libxc/xenctrl.h	Wed Dec 03 10:19:34 2008 +0900
@@ -448,6 +448,14 @@ int xc_sched_credit_domain_get(int xc_ha
                                uint32_t domid,
                                struct xen_domctl_sched_credit *sdom);
 
+int xc_sched_bcredit_domain_set(int xc_handle,
+                                uint32_t domind,
+                                struct xen_domctl_sched_bcredit *sdom);
+
+int xc_sched_bcredit_domain_get(int xc_handle,
+                                uint32_t domid,
+                                struct xen_domctl_sched_bcredit *sdom);
+
 /**
  * This function sends a trigger to a domain.
  *
diff -r a00eb6595d3c tools/python/xen/lowlevel/xc/xc.c
--- a/tools/python/xen/lowlevel/xc/xc.c	Sat Nov 29 09:07:52 2008 +0000
+++ b/tools/python/xen/lowlevel/xc/xc.c	Wed Dec 03 10:19:34 2008 +0900
@@ -1317,6 +1317,59 @@ static PyObject *pyxc_sched_credit_domai
                          "cap",     sdom.cap);
 }
 
+static PyObject *pyxc_sched_bcredit_domain_set(XcObject *self,
+                                               PyObject *args,
+                                               PyObject *kwds)
+{
+    uint32_t domid;
+    uint16_t weight;
+    uint16_t cap;
+    uint16_t max_boost_period;
+    uint16_t boost_ratio;
+    static char *kwd_list[] = { "domid", "bc_weight", "bc_cap",
+                                "bc_max_boost_period", "bc_ratio", NULL };
+    static char kwd_type[] = "I|HHhh";
+    struct xen_domctl_sched_bcredit sdom;
+    
+    weight = 0;
+    cap = (uint16_t)~0U;
+    max_boost_period = (uint16_t)~0U;
+    boost_ratio = (uint16_t)~0U;
+    if( !PyArg_ParseTupleAndKeywords(args, kwds, kwd_type, kwd_list, 
+                                     &domid, &weight, &cap,
+                                     &max_boost_period, &boost_ratio) )
+        return NULL;
+
+    sdom.weight = weight;
+    sdom.cap = cap;
+    sdom.max_boost_period = max_boost_period;
+    sdom.boost_ratio = boost_ratio;
+
+    if ( xc_sched_bcredit_domain_set(self->xc_handle, domid, &sdom) != 0 )
+        return pyxc_error_to_exception();
+
+    Py_INCREF(zero);
+    return zero;
+}
+
+static PyObject *pyxc_sched_bcredit_domain_get(XcObject *self, PyObject *args)
+{
+    uint32_t domid;
+    struct xen_domctl_sched_bcredit sdom;
+    
+    if( !PyArg_ParseTuple(args, "I", &domid) )
+        return NULL;
+    
+    if ( xc_sched_bcredit_domain_get(self->xc_handle, domid, &sdom) != 0 )
+        return pyxc_error_to_exception();
+
+    return Py_BuildValue("{s:H,s:H,s:i,s:i}",
+                         "bc_weight",           sdom.weight,
+                         "bc_cap",              sdom.cap,
+                         "bc_max_boost_period", sdom.max_boost_period,
+                         "bc_ratio",            sdom.boost_ratio);
+}
+
 static PyObject *pyxc_domain_setmaxmem(XcObject *self, PyObject *args)
 {
     uint32_t dom;
@@ -1732,6 +1785,30 @@ static PyMethodDef pyxc_methods[] = {
       " domid     [int]:   domain id to get\n"
       "Returns:   [dict]\n"
       " weight    [short]: domain's scheduling weight\n"},
+
+    { "sched_bcredit_domain_set",
+      (PyCFunction)pyxc_sched_bcredit_domain_set,
+      METH_KEYWORDS, "\n"
+      "Set the scheduling parameters for a domain when running with the\n"
+      "SMP credit scheduler for client.\n"
+      " domid               [int]:   domain id to set\n"
+      " bc_weight           [short]: domain's scheduling weight\n"
+      " bc_cap              [short]: cap\n"
+      " bc_max_boost_period [short]: upper limit in BOOST priority\n"
+      " bc_ratio            [short]: domain's boost ratio per a CPU\n"
+      "Returns: [int] 0 on success; -1 on error.\n" },
+
+    { "sched_bcredit_domain_get",
+      (PyCFunction)pyxc_sched_bcredit_domain_get,
+      METH_VARARGS, "\n"
+      "Get the scheduling parameters for a domain when running with the\n"
+      "SMP credit scheduler for client.\n"
+      " domid     [int]:   domain id to get\n"
+      "Returns:   [dict]\n"
+      " bc_weight           [short]: domain's scheduling weight\n"
+      " bc_cap              [short]: cap\n"
+      " bc_max_boost_period [short]: upper limit in BOOST priority\n"
+      " bc_ratio            [short]: domain's boost ratio per a CPU\n"},
 
     { "evtchn_alloc_unbound", 
       (PyCFunction)pyxc_evtchn_alloc_unbound,
@@ -2048,6 +2125,7 @@ PyMODINIT_FUNC initxc(void)
     /* Expose some libxc constants to Python */
     PyModule_AddIntConstant(m, "XEN_SCHEDULER_SEDF", XEN_SCHEDULER_SEDF);
     PyModule_AddIntConstant(m, "XEN_SCHEDULER_CREDIT", XEN_SCHEDULER_CREDIT);
+    PyModule_AddIntConstant(m, "XEN_SCHEDULER_BCREDIT", XEN_SCHEDULER_BCREDIT);
 
 }
 
diff -r a00eb6595d3c tools/python/xen/xend/XendAPI.py
--- a/tools/python/xen/xend/XendAPI.py	Sat Nov 29 09:07:52 2008 +0000
+++ b/tools/python/xen/xend/XendAPI.py	Wed Dec 03 10:19:34 2008 +0900
@@ -1509,6 +1509,16 @@ class XendAPI(object):
             weight = xeninfo.info['vcpus_params']['weight']
             cap = xeninfo.info['vcpus_params']['cap']
             xendom.domain_sched_credit_set(xeninfo.getDomid(), weight, cap)
+
+        if 'bc_weight' in xeninfo.info['vcpus_params'] \
+           and 'bc_cap' in xeninfo.info['vcpus_params'] \
+           and 'bc_max_boost_period' in xeninfo.info['vcpus_params'] \
+           and 'bc_ratio' in xeninfo.info['vcpus_params']:
+            bc_weight = xeninfo.info['vcpus_params']['bc_weight']
+            bc_cap = xeninfo.info['vcpus_params']['bc_cap']
+            bc_max_boost_period = xeninfo.info['vcpus_params']['bc_max_boost_period']
+            bc_ratio = xeninfo.info['vcpus_params']['bc_ratio']
+            xendom.domain_sched_bcredit_set(xeninfo.getDomid(), bc_weight, bc_cap, bc_max_boost_period, bc_ratio)
 
     def VM_set_VCPUs_number_live(self, _, vm_ref, num):
         dom = XendDomain.instance().get_vm_by_uuid(vm_ref)
diff -r a00eb6595d3c tools/python/xen/xend/XendConfig.py
--- a/tools/python/xen/xend/XendConfig.py	Sat Nov 29 09:07:52 2008 +0000
+++ b/tools/python/xen/xend/XendConfig.py	Wed Dec 03 10:19:34 2008 +0900
@@ -585,6 +585,15 @@ class XendConfig(dict):
             int(sxp.child_value(sxp_cfg, "cpu_weight", 256))
         cfg["vcpus_params"]["cap"] = \
             int(sxp.child_value(sxp_cfg, "cpu_cap", 0))
+        # For boost credit scheduler
+        cfg["vcpus_params"]["bc_weight"] = \
+            int(sxp.child_value(sxp_cfg, "cpu_bc_weight", 256))
+        cfg["vcpus_params"]["bc_cap"] = \
+            int(sxp.child_value(sxp_cfg, "cpu_bc_cap", 0))
+        cfg["vcpus_params"]["bc_max_boost_period"] = \
+            int(sxp.child_value(sxp_cfg, "cpu_bc_max_boost_period", 0))
+        cfg["vcpus_params"]["bc_ratio"] = \
+            int(sxp.child_value(sxp_cfg, "cpu_bc_ratio", 0))
 
         # Only extract options we know about.
         extract_keys = LEGACY_UNSUPPORTED_BY_XENAPI_CFG + \
diff -r a00eb6595d3c tools/python/xen/xend/XendDomain.py
--- a/tools/python/xen/xend/XendDomain.py	Sat Nov 29 09:07:52 2008 +0000
+++ b/tools/python/xen/xend/XendDomain.py	Wed Dec 03 10:19:34 2008 +0900
@@ -1598,6 +1598,99 @@ class XendDomain:
             log.exception(ex)
             raise XendError(str(ex))
 
+    def domain_sched_bcredit_get(self, domid):
+        """Get boost credit scheduler parameters for a domain.
+
+        @param domid: Domain ID or Name
+        @type domid: int or string.
+        @rtype: dict with keys 'bc_weight' and 'bc_cap' and 'bc_max_boost_period' and 'bc_ratio'
+        @return: boost credit scheduler parameters
+        """
+        dominfo = self.domain_lookup_nr(domid)
+        if not dominfo:
+            raise XendInvalidDomain(str(domid))
+        
+        if dominfo._stateGet() in (DOM_STATE_RUNNING, DOM_STATE_PAUSED):
+            try:
+                return xc.sched_bcredit_domain_get(dominfo.getDomid())
+            except Exception, ex:
+                raise XendError(str(ex))
+        else:
+            return {'bc_weight'           : dominfo.getBCWeight(),
+                    'bc_cap'              : dominfo.getBCCap(),
+                    'bc_max_boost_period' : dominfo.getBCMaxBoostPeriod(),
+                    'bc_ratio'            : dominfo.getBCRatio()} 
+    
+    def domain_sched_bcredit_set(self, domid, bc_weight = None, bc_cap = None, bc_max_boost_period = None, bc_ratio = None):
+        """Set boost credit scheduler parameters for a domain.
+
+        @param domid: Domain ID or Name
+        @type domid: int or string.
+        @type bc_weight: int
+        @type bc_cap: int
+        @type bc_max_boost_period: int
+        @type bc_ratio: int
+        @rtype: 0
+        """
+        set_weight = False
+        set_cap = False
+        set_max_boost_period = False
+        set_ratio = False
+        dominfo = self.domain_lookup_nr(domid)
+        if not dominfo:
+            raise XendInvalidDomain(str(domid))
+        try:
+            if bc_weight is None:
+                bc_weight = int(0)
+            elif bc_weight < 1 or bc_weight > 65535:
+                raise XendError("bc_weight is out of range")
+            else:
+                set_weight = True
+
+            if bc_cap is None:
+                bc_cap = int(~0)
+            elif bc_cap < 0 or bc_cap > dominfo.getVCpuCount() * 100:
+                raise XendError("bc_cap is out of range")
+            else:
+                set_cap = True
+
+            if bc_max_boost_period is None:
+                bc_max_boost_period = int(~0)
+            elif bc_max_boost_period < 0:
+                raise XendError("bc_max_boost_period is out of range")
+            else:
+                set_max_boost_period = True
+
+            if bc_ratio is None:
+                bc_ratio = int(~0)
+            elif bc_ratio < 0:
+                raise XendError("bc_ratio is out of range")
+            else:
+                set_ratio = True
+
+            assert type(bc_weight) == int
+            assert type(bc_cap) == int
+            assert type(bc_max_boost_period) == int
+            assert type(bc_ratio) == int
+
+            rc = 0
+            if dominfo._stateGet() in (DOM_STATE_RUNNING, DOM_STATE_PAUSED):
+                rc = xc.sched_bcredit_domain_set(dominfo.getDomid(), bc_weight, bc_cap, bc_max_boost_period, bc_ratio)
+            if rc == 0:
+                if set_weight:
+                    dominfo.setBCWeight(bc_weight)
+                if set_cap:
+                    dominfo.setBCCap(bc_cap)
+                if set_max_boost_period:
+                    dominfo.setBCMaxBoostPeriod(bc_max_boost_period)
+                if set_ratio:
+                    dominfo.setBCRatio(bc_ratio)
+                self.managed_config_save(dominfo)
+            return rc
+        except Exception, ex:
+            log.exception(ex)
+            raise XendError(str(ex))
+
     def domain_maxmem_set(self, domid, mem):
         """Set the memory limit for a domain.
 
diff -r a00eb6595d3c tools/python/xen/xend/XendDomainInfo.py
--- a/tools/python/xen/xend/XendDomainInfo.py	Sat Nov 29 09:07:52 2008 +0000
+++ b/tools/python/xen/xend/XendDomainInfo.py	Wed Dec 03 10:19:34 2008 +0900
@@ -466,6 +466,14 @@ class XendDomainInfo:
                     xendomains.domain_sched_credit_set(self.getDomid(),
                                                        self.getWeight(),
                                                        self.getCap())
+
+                if xennode.xenschedinfo() == 'bcredit':
+                    xendomains.domain_sched_bcredit_set(self.getDomid(),
+                                                        self.getBCWeight(),
+                                                        self.getBCCap(),
+                                                        self.getBCMaxBoostPeriod(),
+                                                        self.getBCRatio())
+
             except:
                 log.exception('VM start failed')
                 self.destroy()
@@ -1606,6 +1614,30 @@ class XendDomainInfo:
     def setWeight(self, cpu_weight):
         self.info['vcpus_params']['weight'] = cpu_weight
 
+    def getBCCap(self):
+        return self.info['vcpus_params']['bc_cap']
+
+    def setBCCap(self, cpu_bc_cap):
+        self.info['vcpus_params']['bc_cap'] = cpu_bc_cap
+
+    def getBCWeight(self):
+        return self.info['vcpus_params']['bc_weight']
+
+    def setBCWeight(self, cpu_bc_weight):
+        self.info['vcpus_params']['bc_weight'] = cpu_bc_weight
+
+    def getBCMaxBoostPeriod(self):
+        return self.info['vcpus_params']['bc_max_boost_period']
+
+    def setBCMaxBoostPeriod(self, cpu_bc_max_boost_period):
+        self.info['vcpus_params']['bc_max_boost_period'] = cpu_bc_max_boost_period
+
+    def getBCRatio(self):
+        return self.info['vcpus_params']['bc_ratio']
+
+    def setBCRatio(self, cpu_bc_ratio):
+        self.info['vcpus_params']['bc_ratio'] = cpu_bc_ratio
+
     def getRestartCount(self):
         return self._readVm('xend/restart_count')
 
diff -r a00eb6595d3c tools/python/xen/xend/XendNode.py
--- a/tools/python/xen/xend/XendNode.py	Sat Nov 29 09:07:52 2008 +0000
+++ b/tools/python/xen/xend/XendNode.py	Wed Dec 03 10:19:34 2008 +0900
@@ -555,6 +555,8 @@ class XendNode:
             return 'sedf'
         elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT:
             return 'credit'
+        elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_BCREDIT:
+            return 'bcredit'
         else:
             return 'unknown'
 
@@ -714,6 +716,8 @@ class XendNode:
             return 'sedf'
         elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_CREDIT:
             return 'credit'
+        elif sched_id == xen.lowlevel.xc.XEN_SCHEDULER_BCREDIT:
+            return 'bcredit'
         else:
             return 'unknown'
 
diff -r a00eb6595d3c tools/python/xen/xm/main.py
--- a/tools/python/xen/xm/main.py	Sat Nov 29 09:07:52 2008 +0000
+++ b/tools/python/xen/xm/main.py	Wed Dec 03 10:19:34 2008 +0900
@@ -152,6 +152,8 @@ SUBCOMMAND_HELP = {
     'sched-sedf'  : ('<Domain> [options]', 'Get/set EDF parameters.'),
     'sched-credit': ('[-d <Domain> [-w[=WEIGHT]|-c[=CAP]]]',
                      'Get/set credit scheduler parameters.'),
+    'sched-bcredit': ('[-d <Domain> [-w[=WEIGHT]|-c[=CAP]|-m[=MAXBOOSTPERIOD]|-r[=RATIO]]]',
+                      ''),
     'sysrq'       : ('<Domain> <letter>', 'Send a sysrq to a domain.'),
     'debug-keys'  : ('<Keys>', 'Send debug keys to Xen.'),
     'trigger'     : ('<Domain> <nmi|reset|init|s3resume> [<VCPU>]',
@@ -240,6 +242,13 @@ SUBCOMMAND_OPTIONS = {
        ('-d DOMAIN', '--domain=DOMAIN', 'Domain to modify'),
        ('-w WEIGHT', '--weight=WEIGHT', 'Weight (int)'),
        ('-c CAP',    '--cap=CAP',       'Cap (int)'),
+    ),
+    'sched-bcredit': (
+       ('-d DOMAIN', '--domain=DOMAIN', 'Domain to modify'),
+       ('-w WEIGHT', '--weight=WEIGHT', 'Weight (int)'),
+       ('-c CAP',    '--cap=CAP',       'Cap (int)'),
+       ('-m PERIOD', '--maxboostperiod=PERIOD', 'Upper limit of boost period (ms)'),
+       ('-r RATIO',  '--ratio=RATIO',   'Boost ratio per a CPU (int)'),
     ),
     'list': (
        ('-l', '--long',         'Output all VM details in SXP'),
@@ -1655,6 +1664,116 @@ def xm_sched_credit(args):
                     cap)
         else:
             result = server.xend.domain.sched_credit_set(domid, weight, cap)
+            if result != 0:
+                err(str(result))
+
+def xm_sched_bcredit(args):
+    """Get/Set options for Boost Credit Scheduler."""
+    
+    check_sched_type('bcredit')
+
+    try:
+        opts, params = getopt.getopt(args, "d:w:c:m:r:",
+            ["domain=", "weight=", "cap=", "maxboostperiod=", "ratio="])
+    except getopt.GetoptError, opterr:
+        err(opterr)
+        usage('sched-bcredit')
+
+    domid = None
+    weight = None
+    cap = None
+    max_boost_period = None
+    boost_ratio = None
+
+    for o, a in opts:
+        if o in ["-d", "--domain"]:
+            domid = a
+        elif o in ["-w", "--weight"]:
+            weight = int(a)
+        elif o in ["-c", "--cap"]:
+            cap = int(a)
+        elif o in ["-m", "--maxboostperiod"]:
+            max_boost_period = int(a)
+        elif o in ["-r", "--ratio"]:
+            boost_ratio = int(a)
+
+    doms = filter(lambda x : domid_match(domid, x),
+                  [parse_doms_info(dom)
+                  for dom in getDomains(None, 'all')])
+
+    if weight is None and cap is None and max_boost_period is None and boost_ratio is None:
+        if domid is not None and doms == []: 
+            err("Domain '%s' does not exist." % domid)
+            usage('sched-bcredit')
+        # print header if we aren't setting any parameters
+        print '%-33s %4s %6s %4s %8s %5s' % ('Name','ID','Weight','Cap','Max(ms)','Ratio')
+        
+        for d in doms:
+            try:
+                if serverType == SERVER_XEN_API:
+                    info = server.xenapi.VM_metrics.get_VCPUs_params(
+                        server.xenapi.VM.get_metrics(
+                            get_single_vm(d['name'])))
+                else:
+                    info = server.xend.domain.sched_bcredit_get(d['name'])
+            except xmlrpclib.Fault:
+                pass
+
+            if 'bc_weight' not in info or 'bc_cap' not in info or 'bc_max_boost_period' not in info or 'bc_ratio' not in info:
+                # domain does not support sched-bcredit?
+                info = {'bc_weight': -1, 'bc_cap': -1, 'bc_max_boost_period': -1, 'bc_ratio': -1}
+
+            info['bc_weight'] = int(info['bc_weight'])
+            info['bc_cap']    = int(info['bc_cap'])
+            info['bc_max_boost_period'] = int(info['bc_max_boost_period'])
+            info['bc_ratio']  = int(info['bc_ratio'])
+            
+            info['name']  = d['name']
+            info['domid'] = str(d['domid'])
+            print( ("%(name)-32s %(domid)5s %(bc_weight)6d %(bc_cap)4d %(bc_max_boost_period)8d %(bc_ratio)5d") % info)
+    else:
+        if domid is None:
+            # place holder for system-wide scheduler parameters
+            err("No domain given.")
+            usage('sched-bcredit')
+
+        if serverType == SERVER_XEN_API:
+            if doms[0]['domid']:
+                server.xenapi.VM.add_to_VCPUs_params_live(
+                    get_single_vm(domid),
+                    "bc_weight",
+                    weight)
+                server.xenapi.VM.add_to_VCPUs_params_live(
+                    get_single_vm(domid),
+                    "bc_cap",
+                    cap)
+                server.xenapi.VM.add_to_VCPUs_params_live(
+                    get_single_vm(domid),
+                    "bc_max_boost_period",
+                     max_boost_period)
+                server.xenapi.VM.add_to_VCPUs_params_live(
+                    get_single_vm(domid),
+                    "bc_ratio",
+                    boost_ratio)
+            else:
+                server.xenapi.VM.add_to_VCPUs_params(
+                    get_single_vm(domid),
+                    "bc_weight",
+                    weight)
+                server.xenapi.VM.add_to_VCPUs_params(
+                    get_single_vm(domid),
+                    "bc_cap",
+                    cap)
+                server.xenapi.VM.add_to_VCPUs_params(
+                    get_single_vm(domid),
+                    "bc_max_boost_period",
+                    max_boost_period)
+                server.xenapi.VM.add_to_VCPUs_params(
+                    get_single_vm(domid),
+                    "bc_ratio",
+                    boost_ratio)
+        else:
+            result = server.xend.domain.sched_bcredit_set(domid, weight, cap, max_boost_period, boost_ratio)
             if result != 0:
                 err(str(result))
 
@@ -2825,6 +2944,7 @@ commands = {
     # scheduler
     "sched-sedf": xm_sched_sedf,
     "sched-credit": xm_sched_credit,
+    "sched-bcredit": xm_sched_bcredit,
     # block
     "block-attach": xm_block_attach,
     "block-detach": xm_block_detach,
diff -r a00eb6595d3c xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Sat Nov 29 09:07:52 2008 +0000
+++ b/xen/common/sched_credit.c	Wed Dec 03 10:19:34 2008 +0900
@@ -1401,3 +1401,1003 @@ struct scheduler sched_credit_def = {
     .dump_settings  = csched_dump,
     .init           = csched_init,
 };
+
+
+/*
+ * Boost Credit Schdeuler(bcredit)
+ *   Alternative Credit Scheduler optimized for client hypervisor
+ */
+
+/*
+ * Basic constants
+ */
+#define BCSCHED_DEFAULT_WEIGHT     CSCHED_DEFAULT_WEIGHT
+#define BCSCHED_TICKS_PER_TSLICE   CSCHED_TICKS_PER_TSLICE
+#define BCSCHED_TICKS_PER_ACCT     CSCHED_TICKS_PER_ACCT
+#define BCSCHED_MSECS_PER_TICK     CSCHED_MSECS_PER_TICK
+#define BCSCHED_MSECS_PER_TSLICE   \
+    (BCSCHED_MSECS_PER_TICK * BCSCHED_TICKS_PER_TSLICE)
+#define BCSCHED_CREDITS_PER_TICK   10000
+#define BCSCHED_CREDITS_PER_TSLICE \
+    (BCSCHED_CREDITS_PER_TICK * BCSCHED_TICKS_PER_TSLICE)
+#define BCSCHED_CREDITS_PER_ACCT   \
+    (BCSCHED_CREDITS_PER_TICK * BCSCHED_TICKS_PER_ACCT)
+#define BCSCHED_MSECS_BOOSTTSLICE_PER_CPU 2
+#define BCSCHED_NSECS_MIN_BOOST_TSLICE 500000
+
+/*
+ * Macros
+ */
+#define svc_sbvc(_v) (container_of((_v), struct bcsched_vcpu, svc))
+#define sdom_sbdom(_d) (container_of((_d), struct bcsched_dom, sdom))
+
+/*
+ * Virtual CPU
+ */
+struct bcsched_vcpu {
+    struct csched_vcpu svc;
+    struct list_head inactive_vcpu_elem;
+    s_time_t start_time;
+    atomic_t boost_credit;
+};
+
+/*
+ * Domain
+ */
+struct bcsched_dom {
+    struct csched_dom sdom;
+    uint16_t boost_ratio;
+    uint16_t max_boost_period;
+};
+
+/*
+ * System-wide private data
+ */
+struct bcsched_private {
+    struct list_head inactive_vcpu;
+    uint32_t nvcpus;
+    s_time_t boost_tslice;
+    uint32_t boost_credit;
+    uint16_t total_boost_ratio;
+};
+
+/*
+ * Global variables
+ */
+static struct bcsched_private bcsched_priv;
+
+/* opt_bcsched_tslice: time slice for BOOST priority */
+static unsigned int opt_bcsched_tslice = BCSCHED_MSECS_BOOSTTSLICE_PER_CPU;
+integer_param("bcsched_tslice", opt_bcsched_tslice);
+
+static void bcsched_tick(void *_cpu);
+
+static int
+bcsched_pcpu_init(int cpu)
+{
+    struct csched_pcpu *spc;
+    unsigned long flags;
+
+    /* Allocate per-PCPU info */
+    spc = xmalloc(struct csched_pcpu);
+    if ( spc == NULL )
+        return -1;
+
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    /* Initialize/update system-wide config */
+    csched_priv.credit += BCSCHED_CREDITS_PER_ACCT;
+    if ( csched_priv.ncpus <= cpu )
+        csched_priv.ncpus = cpu + 1;
+    if ( csched_priv.master >= csched_priv.ncpus )
+        csched_priv.master = cpu;
+
+    init_timer(&spc->ticker, bcsched_tick, (void *)(unsigned long)cpu, cpu);
+    INIT_LIST_HEAD(&spc->runq);
+    spc->runq_sort_last = csched_priv.runq_sort;
+    per_cpu(schedule_data, cpu).sched_priv = spc;
+
+    /* Start off idling... */
+    BUG_ON(!is_idle_vcpu(per_cpu(schedule_data, cpu).curr));
+    cpu_set(cpu, csched_priv.idlers);
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+
+    return 0;
+}
+
+static inline void
+__bcsched_vcpu_acct_start_locked(struct csched_vcpu *svc)
+{
+    struct csched_dom * const sdom = svc->sdom;
+    struct bcsched_vcpu * const sbvc = svc_sbvc(svc);
+    struct bcsched_dom * const sbdom = sdom_sbdom(sdom);
+
+    CSCHED_VCPU_STAT_CRANK(svc, state_active);
+    CSCHED_STAT_CRANK(acct_vcpu_active);
+
+    sdom->active_vcpu_count++;
+    list_add(&svc->active_vcpu_elem, &sdom->active_vcpu);
+    list_del_init(&sbvc->inactive_vcpu_elem);
+    if ( list_empty(&sdom->active_sdom_elem) )
+    {
+        list_add(&sdom->active_sdom_elem, &csched_priv.active_sdom);
+        csched_priv.weight += sdom->weight;
+        bcsched_priv.boost_credit += (sbdom->boost_ratio *
+                                      BCSCHED_CREDITS_PER_TSLICE) / 100;
+    }
+}
+
+static inline void
+__bcsched_vcpu_acct_stop_locked(struct csched_vcpu *svc)
+{
+    struct csched_dom * const sdom = svc->sdom;
+    struct bcsched_vcpu * const sbvc = svc_sbvc(svc);
+    struct bcsched_dom * const sbdom = sdom_sbdom(sdom);
+
+    BUG_ON( list_empty(&svc->active_vcpu_elem) );
+
+    CSCHED_VCPU_STAT_CRANK(svc, state_idle);
+    CSCHED_STAT_CRANK(acct_vcpu_idle);
+
+    sdom->active_vcpu_count--;
+    list_del_init(&svc->active_vcpu_elem);
+    list_add(&sbvc->inactive_vcpu_elem, &bcsched_priv.inactive_vcpu);
+    if ( list_empty(&sdom->active_vcpu) )
+    {
+        BUG_ON( csched_priv.weight < sdom->weight );
+        list_del_init(&sdom->active_sdom_elem);
+        csched_priv.weight -= sdom->weight;
+        bcsched_priv.boost_credit -= (sbdom->boost_ratio *
+                                      BCSCHED_CREDITS_PER_TSLICE) / 100;
+    }
+}
+
+static void
+bcsched_vcpu_acct(unsigned int cpu)
+{
+    ASSERT( current->processor == cpu );
+    ASSERT( CSCHED_VCPU(current)->sdom != NULL );
+
+    /*
+     * If it's been active a while, check if we'd be better off
+     * migrating it to run elsewhere (see multi-core and multi-thread
+     * support in csched_cpu_pick()).
+     */
+    if ( csched_cpu_pick(current) != cpu )
+    {
+        CSCHED_VCPU_STAT_CRANK(CSCHED_VCPU(current), migrate_r);
+        CSCHED_STAT_CRANK(migrate_running);
+        set_bit(_VPF_migrating, &current->pause_flags);
+        cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
+    }
+}
+
+static int
+bcsched_vcpu_init(struct vcpu *vc)
+{
+    struct domain * const dom = vc->domain;
+    struct csched_dom *sdom = CSCHED_DOM(dom);
+    struct bcsched_vcpu *sbvc;
+    struct csched_vcpu *svc;
+    unsigned long flags;
+
+    CSCHED_STAT_CRANK(vcpu_init);
+
+    /* Allocate per-VCPU info */
+    sbvc = xmalloc(struct bcsched_vcpu);
+    if ( sbvc == NULL )
+        return -1;
+    svc = &(sbvc->svc);
+
+    INIT_LIST_HEAD(&svc->runq_elem);
+    INIT_LIST_HEAD(&svc->active_vcpu_elem);
+    INIT_LIST_HEAD(&sbvc->inactive_vcpu_elem);
+    svc->sdom = sdom;
+    svc->vcpu = vc;
+    atomic_set(&svc->credit, 0);
+    svc->flags = 0U;
+    svc->pri = is_idle_domain(dom) ? CSCHED_PRI_IDLE : CSCHED_PRI_TS_UNDER;
+    CSCHED_VCPU_STATS_RESET(svc);
+    vc->sched_priv = svc;
+    atomic_set(&sbvc->boost_credit, 0);
+
+    /* Allocate per-PCPU info */
+    if ( unlikely(!CSCHED_PCPU(vc->processor)) )
+    {
+        if ( bcsched_pcpu_init(vc->processor) != 0 )
+            return -1;
+    }
+
+    /* Add inactive queue in order to start acct */
+    if ( !is_idle_vcpu(vc) )
+    {
+        uint32_t vcpus_per_cpu;
+
+        spin_lock_irqsave(&csched_priv.lock, flags);
+
+        list_add(&sbvc->inactive_vcpu_elem, &bcsched_priv.inactive_vcpu);
+
+        bcsched_priv.nvcpus++;
+        vcpus_per_cpu = ( (bcsched_priv.nvcpus + (csched_priv.ncpus-1)) /
+                          csched_priv.ncpus
+                        ) - 1;
+        if ( vcpus_per_cpu == 0 )
+            bcsched_priv.boost_tslice = MILLISECS(BCSCHED_MSECS_PER_TSLICE);
+        else
+        {
+            bcsched_priv.boost_tslice =  MILLISECS(opt_bcsched_tslice) /
+                                         vcpus_per_cpu;
+            if ( bcsched_priv.boost_tslice < BCSCHED_NSECS_MIN_BOOST_TSLICE )
+                bcsched_priv.boost_tslice = BCSCHED_NSECS_MIN_BOOST_TSLICE; 
+        }
+
+        spin_unlock_irqrestore(&csched_priv.lock, flags);
+    }
+
+    CSCHED_VCPU_CHECK(vc);
+    return 0;
+}
+
+static void
+bcsched_vcpu_destroy(struct vcpu *vc)
+{
+    struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+    struct bcsched_vcpu * const sbvc = svc_sbvc(svc);
+    struct csched_dom * const sdom = svc->sdom;
+    unsigned long flags;
+
+    CSCHED_STAT_CRANK(vcpu_destroy);
+
+    BUG_ON( sdom == NULL );
+    BUG_ON( !list_empty(&svc->runq_elem) );
+
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    if ( !list_empty(&svc->active_vcpu_elem) )
+        __bcsched_vcpu_acct_stop_locked(svc);
+
+    if ( !list_empty(&sbvc->inactive_vcpu_elem) )
+        list_del_init(&sbvc->inactive_vcpu_elem);
+
+    if ( !is_idle_vcpu(vc) )
+    {
+        uint32_t vcpus_per_cpu;
+
+        bcsched_priv.nvcpus--;
+        vcpus_per_cpu = ( (bcsched_priv.nvcpus + (csched_priv.ncpus-1)) /
+                          csched_priv.ncpus
+                        ) - 1;
+        if ( vcpus_per_cpu == 0 )
+            bcsched_priv.boost_tslice = MILLISECS(BCSCHED_MSECS_PER_TSLICE);
+        else
+        {
+            bcsched_priv.boost_tslice =  MILLISECS(opt_bcsched_tslice) /
+                                         vcpus_per_cpu;
+            if ( bcsched_priv.boost_tslice < BCSCHED_NSECS_MIN_BOOST_TSLICE )
+                bcsched_priv.boost_tslice = BCSCHED_NSECS_MIN_BOOST_TSLICE; 
+        }
+    }
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+
+    xfree(sbvc);
+}
+
+static int
+bcsched_dom_cntl(
+    struct domain *d,
+    struct xen_domctl_scheduler_op *op)
+{
+    struct csched_dom * const sdom = CSCHED_DOM(d);
+    struct bcsched_dom * const sbdom = sdom_sbdom(sdom);
+    unsigned long flags;
+
+    if ( op->cmd == XEN_DOMCTL_SCHEDOP_getinfo )
+    {
+        op->u.bcredit.weight = sdom->weight;
+        op->u.bcredit.cap = sdom->cap;
+        op->u.bcredit.max_boost_period = sbdom->max_boost_period;
+        op->u.bcredit.boost_ratio = sbdom->boost_ratio;
+    }
+    else
+    {
+        uint16_t weight = (uint16_t)~0U;
+
+        ASSERT(op->cmd == XEN_DOMCTL_SCHEDOP_putinfo);
+
+        spin_lock_irqsave(&csched_priv.lock, flags);
+
+        if ( (op->u.bcredit.weight != 0) &&
+             (sbdom->boost_ratio == 0 || op->u.bcredit.boost_ratio == 0) )
+        {
+            weight = op->u.bcredit.weight;
+        }
+
+        if ( op->u.bcredit.cap != (uint16_t)~0U )
+            sdom->cap = op->u.bcredit.cap;
+
+        if ( (op->u.bcredit.max_boost_period != (uint16_t)~0U) &&
+             (op->u.bcredit.max_boost_period >= BCSCHED_MSECS_PER_TSLICE ||
+              op->u.bcredit.max_boost_period == 0) )
+        {
+                sbdom->max_boost_period = op->u.bcredit.max_boost_period;
+        }
+
+        if ( (op->u.bcredit.boost_ratio != (uint16_t)~0U) &&
+             ((bcsched_priv.total_boost_ratio - sbdom->boost_ratio +
+               op->u.bcredit.boost_ratio) <= 100 * csched_priv.ncpus) &&
+             (sbdom->max_boost_period || op->u.bcredit.boost_ratio == 0) )
+        {
+            uint16_t new_bc, old_bc;
+
+            new_bc = ( op->u.bcredit.boost_ratio *
+                       BCSCHED_CREDITS_PER_TSLICE ) / 100;
+            old_bc = ( sbdom->boost_ratio *
+                       BCSCHED_CREDITS_PER_TSLICE ) / 100;
+
+            bcsched_priv.total_boost_ratio -= sbdom->boost_ratio;
+            bcsched_priv.total_boost_ratio += op->u.bcredit.boost_ratio;
+
+            sbdom->boost_ratio = op->u.bcredit.boost_ratio;
+
+            if ( !list_empty(&sdom->active_sdom_elem) )
+            {
+                bcsched_priv.boost_credit -= old_bc;
+                bcsched_priv.boost_credit += new_bc;
+            }
+            if ( new_bc == 0 )
+            {
+                if ( sdom->weight == 0 )
+                    weight = BCSCHED_DEFAULT_WEIGHT;
+            }
+            else
+                weight = 0;
+        }
+
+        if ( weight != (uint16_t)~0U )
+        {
+            if ( !list_empty(&sdom->active_sdom_elem) )
+            {
+                csched_priv.weight -= sdom->weight;
+                csched_priv.weight += weight;
+            }
+            sdom->weight = weight;
+        }
+
+        spin_unlock_irqrestore(&csched_priv.lock, flags);
+    }
+
+    return 0;
+}
+
+static int
+bcsched_dom_init(struct domain *dom)
+{
+    struct csched_dom *sdom;
+    struct bcsched_dom *sbdom;
+
+    CSCHED_STAT_CRANK(dom_init);
+
+    if ( is_idle_domain(dom) )
+        return 0;
+
+    sbdom = xmalloc(struct bcsched_dom);
+    if ( sbdom == NULL )
+        return -ENOMEM;
+    sdom = &(sbdom->sdom);
+
+    /* Initalize credit and weight */
+    INIT_LIST_HEAD(&sdom->active_vcpu);
+    sdom->active_vcpu_count = 0;
+    INIT_LIST_HEAD(&sdom->active_sdom_elem);
+    sdom->dom = dom;
+    sdom->weight = BCSCHED_DEFAULT_WEIGHT;
+    sdom->cap = 0U;
+    sbdom->boost_ratio = 0U;
+    sbdom->max_boost_period = 0;
+    dom->sched_priv = sdom;
+
+    return 0;
+}
+
+static void
+bcsched_dom_destroy(struct domain *dom)
+{
+    CSCHED_STAT_CRANK(dom_destroy);
+    xfree(sdom_sbdom(CSCHED_DOM(dom)));
+}
+
+/*
+ * This is a O(n) optimized sort of the runq.
+ *
+ * Time-share VCPUs can only be one of three priorities, BOOST, UNDER or OVER.
+ * We walk through the runq and move up any BOOSTs that are preceded by UNDERs
+ * or OVERs, and any UNDERs that are preceded by OVERs. We remember the last
+ * BOOST and UNDER to make the move up operation O(1).
+ */
+static void
+bcsched_runq_sort(unsigned int cpu)
+{
+    struct csched_pcpu * const spc = CSCHED_PCPU(cpu);
+    struct list_head *runq, *elem, *next, *last_boost, *last_under;
+    struct csched_vcpu *svc_elem;
+    unsigned long flags;
+    int sort_epoch;
+
+    sort_epoch = csched_priv.runq_sort;
+    if ( sort_epoch == spc->runq_sort_last )
+        return;
+
+    spc->runq_sort_last = sort_epoch;
+
+    spin_lock_irqsave(&per_cpu(schedule_data, cpu).schedule_lock, flags);
+
+    runq = &spc->runq;
+    elem = runq->next;
+    last_boost = last_under = runq;
+    while ( elem != runq )
+    {
+        next = elem->next;
+        svc_elem = __runq_elem(elem);
+
+        if ( svc_elem->pri == CSCHED_PRI_TS_BOOST )
+        {
+            /* does elem need to move up the runq? */
+            if ( elem->prev != last_boost )
+            {
+                list_del(elem);
+                list_add(elem, last_boost);
+            }
+            if ( last_boost == last_under )
+                last_under = elem;
+            last_boost = elem;
+        }
+        else if ( svc_elem->pri == CSCHED_PRI_TS_UNDER )
+        {
+            /* does elem need to move up the runq? */
+            if ( elem->prev != last_under )
+            {
+                list_del(elem);
+                list_add(elem, last_under);
+            }
+            last_under = elem;
+        }
+
+        elem = next;
+    }
+
+    spin_unlock_irqrestore(&per_cpu(schedule_data, cpu).schedule_lock, flags);
+}
+
+static void
+bcsched_acct(void)
+{
+    unsigned long flags;
+    struct list_head *iter_vcpu, *next_vcpu;
+    struct list_head *iter_sdom, *next_sdom;
+    struct bcsched_vcpu *sbvc;
+    struct bcsched_dom *sbdom;
+    struct csched_vcpu *svc;
+    struct csched_dom *sdom;
+    uint32_t credit_total;
+    uint32_t weight_total;
+    uint32_t bc_total;
+    uint32_t weight_left;
+    uint32_t credit_fair;
+    uint32_t credit_peak;
+    uint32_t credit_cap;
+    uint32_t bc_fair;
+    int credit_balance;
+    int credit_xtra;
+    int credit;
+    int boost_credit;
+    int max_boost_credit;
+    int64_t c_sum, bc_sum;
+    int c_average, bc_average;
+
+
+    spin_lock_irqsave(&csched_priv.lock, flags);
+
+    /* Add vcpu to active list when its credits were consumued by one tick */
+    list_for_each_safe( iter_vcpu, next_vcpu, &bcsched_priv.inactive_vcpu )
+    {
+        sbvc = list_entry(iter_vcpu, struct bcsched_vcpu, inactive_vcpu_elem);
+        svc = &(sbvc->svc);
+        sbdom = sdom_sbdom(svc->sdom);
+
+        max_boost_credit = sbdom->max_boost_period *
+                           (BCSCHED_CREDITS_PER_TSLICE/BCSCHED_MSECS_PER_TSLICE);
+        if ( (atomic_read(&sbvc->boost_credit)
+              <= (max_boost_credit-BCSCHED_CREDITS_PER_TICK)) ||
+             (atomic_read(&svc->credit)
+              <= BCSCHED_CREDITS_PER_TICK*(BCSCHED_TICKS_PER_ACCT-1)) )
+        {
+            __bcsched_vcpu_acct_start_locked(svc);
+        }
+    }
+
+    weight_total = csched_priv.weight;
+    credit_total = csched_priv.credit;
+    bc_total = bcsched_priv.boost_credit;
+
+    /* Converge balance towards 0 when it drops negative */
+    if ( csched_priv.credit_balance < 0 )
+    {
+        credit_total -= csched_priv.credit_balance;
+        CSCHED_STAT_CRANK(acct_balance);
+    }
+
+    if ( unlikely(weight_total == 0 && bc_total == 0) )
+    {
+        csched_priv.credit_balance = 0;
+        spin_unlock_irqrestore(&csched_priv.lock, flags);
+        CSCHED_STAT_CRANK(acct_no_work);
+        return;
+    }
+
+    CSCHED_STAT_CRANK(acct_run);
+
+    weight_left = weight_total;
+    credit_balance = 0;
+    credit_xtra = 0;
+    credit_cap = 0U;
+
+    /* Firstly, subtract boost credits from credit_total. */
+    if ( bc_total != 0 )
+    {
+        credit_total -= bc_total;
+        credit_balance += bc_total;
+    }
+
+    /* Avoid 0 divide error */
+    if ( weight_total == 0 )
+        weight_total = 1;
+
+    list_for_each_safe( iter_sdom, next_sdom, &csched_priv.active_sdom )
+    {
+        sdom = list_entry(iter_sdom, struct csched_dom, active_sdom_elem);
+        sbdom = sdom_sbdom(sdom);
+
+        BUG_ON( is_idle_domain(sdom->dom) );
+        BUG_ON( sdom->active_vcpu_count == 0 );
+        BUG_ON( sdom->weight > weight_left );
+
+        max_boost_credit = sbdom->max_boost_period *
+                           (BCSCHED_CREDITS_PER_TSLICE/BCSCHED_MSECS_PER_TSLICE);
+        c_sum = bc_sum = 0;
+        list_for_each_safe( iter_vcpu, next_vcpu, &sdom->active_vcpu )
+        {
+            svc = list_entry(iter_vcpu, struct csched_vcpu, active_vcpu_elem);
+            sbvc = svc_sbvc(svc);
+
+            BUG_ON( sdom != svc->sdom );
+
+            c_sum += atomic_read(&svc->credit);
+            bc_sum += atomic_read(&sbvc->boost_credit);
+        }
+        c_average = ( c_sum + ( sdom->active_vcpu_count - 1 )
+                    ) / sdom->active_vcpu_count;
+        bc_average = ( bc_sum + ( sdom->active_vcpu_count - 1 )
+                     ) / sdom->active_vcpu_count;
+
+        weight_left -= sdom->weight;
+
+        /*
+         * A domain's fair share is computed using its weight in competition
+         * with that of all other active domains.
+         *
+         * At most, a domain can use credits to run all its active VCPUs
+         * for one full accounting period. We allow a domain to earn more
+         * only when the system-wide credit balance is negative.
+         */
+        credit_peak = sdom->active_vcpu_count * BCSCHED_CREDITS_PER_ACCT;
+        if ( csched_priv.credit_balance < 0 )
+        {
+            credit_peak += ( ( -csched_priv.credit_balance * sdom->weight) +
+                             (weight_total - 1)
+                           ) / weight_total;
+        }
+
+        if ( sdom->cap != 0U )
+        {
+            credit_cap = ((sdom->cap * BCSCHED_CREDITS_PER_ACCT) + 99) / 100;
+            if ( credit_cap < credit_peak )
+                credit_peak = credit_cap;
+
+            credit_cap = ( credit_cap + ( sdom->active_vcpu_count - 1 )
+                         ) / sdom->active_vcpu_count;
+        }
+
+        credit_fair = ( ( credit_total * sdom->weight) + (weight_total - 1)
+                      ) / weight_total;
+
+        if ( credit_fair < credit_peak )
+        {
+            /* credit_fair is 0 if weight is 0. */
+            if ( sdom->weight != 0 )
+                credit_xtra = 1;
+        }
+        else
+        {
+            if ( weight_left != 0U )
+            {
+                /* Give other domains a chance at unused credits */
+                credit_total += ( ( ( credit_fair - credit_peak
+                                    ) * weight_total
+                                  ) + ( weight_left - 1 )
+                                ) / weight_left;
+            }
+
+            if ( credit_xtra )
+            {
+                /*
+                 * Lazily keep domains with extra credits at the head of
+                 * the queue to give others a chance at them in future
+                 * accounting periods.
+                 */
+                CSCHED_STAT_CRANK(acct_reorder);
+                list_del(&sdom->active_sdom_elem);
+                list_add(&sdom->active_sdom_elem, &csched_priv.active_sdom);
+            }
+
+            credit_fair = credit_peak;
+        }
+
+        /* Compute fair share per VCPU */
+        credit_fair = ( credit_fair + ( sdom->active_vcpu_count - 1 )
+                      ) / sdom->active_vcpu_count;
+
+        /* Compute fair share of boost_credit per VCPU */
+        bc_fair = ( ((sbdom->boost_ratio * BCSCHED_CREDITS_PER_ACCT)/100) +
+                    (sdom->active_vcpu_count - 1)
+                  ) / sdom->active_vcpu_count;
+
+        list_for_each_safe( iter_vcpu, next_vcpu, &sdom->active_vcpu )
+        {
+            svc = list_entry(iter_vcpu, struct csched_vcpu, active_vcpu_elem);
+            sbvc = svc_sbvc(svc);
+
+            BUG_ON( sdom != svc->sdom );
+
+            /* Balance two credits */
+            credit = atomic_read(&svc->credit);
+            atomic_add(c_average - credit, &svc->credit);
+            boost_credit = atomic_read(&sbvc->boost_credit);
+            atomic_add(bc_average - boost_credit, &sbvc->boost_credit);
+            boost_credit = atomic_read(&sbvc->boost_credit);
+            if ( sbdom->boost_ratio != 0 )
+            {
+                /* Increment boost credit */
+                atomic_add(bc_fair, &sbvc->boost_credit);
+                boost_credit = atomic_read(&sbvc->boost_credit);
+
+                /*
+                 * Upper bound on boost credits.
+                 * Add excess to credit.
+                 */
+                if ( boost_credit > max_boost_credit )
+                {
+                    atomic_add(boost_credit - max_boost_credit, &svc->credit);
+                    atomic_set(&sbvc->boost_credit, max_boost_credit);
+                    boost_credit = atomic_read(&sbvc->boost_credit);
+                }
+                /*
+                 * If credit is negative,
+                 * boost credits compensate credit.
+                 */
+                credit = atomic_read(&svc->credit);
+                if ( credit < 0 && boost_credit > 0 )
+                {
+                    if ( boost_credit > -credit )
+                    {
+                        atomic_sub(-credit, &sbvc->boost_credit);
+                        atomic_add(-credit, &svc->credit);
+                    }
+                    else
+                    {
+                        atomic_sub(boost_credit, &sbvc->boost_credit);
+                        atomic_add(boost_credit, &svc->credit);
+                    }
+                    boost_credit = atomic_read(&sbvc->boost_credit);
+                }
+            }
+
+            /* Increment credit */
+            atomic_add(credit_fair, &svc->credit);
+            credit = atomic_read(&svc->credit);
+
+            /*
+             * Recompute priority or, if VCPU is idling, remove it from
+             * the active list.
+             */
+            if ( credit < 0 )
+            {
+                svc->pri = CSCHED_PRI_TS_OVER;
+
+                /* Park running VCPUs of capped-out domains */
+                if ( sdom->cap != 0U &&
+                     credit < -credit_cap &&
+                     !(svc->flags & CSCHED_FLAG_VCPU_PARKED) )
+                {
+                    CSCHED_STAT_CRANK(vcpu_park);
+                    vcpu_pause_nosync(svc->vcpu);
+                    svc->flags |= CSCHED_FLAG_VCPU_PARKED;
+                }
+
+                /* Lower bound on credits */
+                if ( credit < -BCSCHED_CREDITS_PER_TSLICE )
+                {
+                    CSCHED_STAT_CRANK(acct_min_credit);
+                    credit = -BCSCHED_CREDITS_PER_TSLICE;
+                    atomic_set(&svc->credit, credit);
+                }
+            }
+            else
+            {
+                if ( boost_credit <= 0 )
+                    svc->pri = CSCHED_PRI_TS_UNDER;
+                else
+                    svc->pri = CSCHED_PRI_TS_BOOST;
+
+                /* Unpark any capped domains whose credits go positive */
+                if ( svc->flags & CSCHED_FLAG_VCPU_PARKED)
+                {
+                    /*
+                     * It's important to unset the flag AFTER the unpause()
+                     * call to make sure the VCPU's priority is not boosted
+                     * if it is woken up here.
+                     */
+                    CSCHED_STAT_CRANK(vcpu_unpark);
+                    vcpu_unpause(svc->vcpu);
+                    svc->flags &= ~CSCHED_FLAG_VCPU_PARKED;
+                }
+
+                if ( credit > BCSCHED_CREDITS_PER_TSLICE )
+                {
+                    atomic_add(credit - BCSCHED_CREDITS_PER_TSLICE,
+                               &sbvc->boost_credit);
+                    boost_credit = atomic_read(&sbvc->boost_credit);
+                    credit = BCSCHED_CREDITS_PER_TSLICE;
+                    atomic_set(&svc->credit, credit);
+
+                    if ( boost_credit > max_boost_credit )
+                    {
+                        atomic_set(&sbvc->boost_credit, max_boost_credit);
+                        __bcsched_vcpu_acct_stop_locked(svc);
+                    }
+                }
+            }
+
+            if ( sbdom->boost_ratio == 0 )
+            {
+                CSCHED_VCPU_STAT_SET(svc, credit_last, credit);
+                CSCHED_VCPU_STAT_SET(svc, credit_incr, credit_fair);
+                credit_balance += credit;
+            }
+            else
+            {
+                CSCHED_VCPU_STAT_SET(svc, credit_last, boost_credit);
+                CSCHED_VCPU_STAT_SET(svc, credit_incr, bc_fair);
+            }
+        }
+    }
+
+    csched_priv.credit_balance = credit_balance;
+
+    spin_unlock_irqrestore(&csched_priv.lock, flags);
+
+    /* Inform each CPU that its runq needs to be sorted */
+    csched_priv.runq_sort++;
+}
+
+static void
+bcsched_tick(void *_cpu)
+{
+    unsigned int cpu = (unsigned long)_cpu;
+    struct csched_pcpu *spc = CSCHED_PCPU(cpu);
+
+    spc->tick++;
+
+    /*
+     * Accounting for running VCPU
+     */
+    if ( !is_idle_vcpu(current) )
+        bcsched_vcpu_acct(cpu);
+
+    /*
+     * Host-wide accounting duty
+     *
+     * Note: Currently, this is always done by the master boot CPU. Eventually,
+     * we could distribute or at the very least cycle the duty.
+     */
+    if ( (csched_priv.master == cpu) &&
+         (spc->tick % BCSCHED_TICKS_PER_ACCT) == 0 )
+    {
+        bcsched_acct();
+    }
+
+    /*
+     * Check if runq needs to be sorted
+     *
+     * Every physical CPU resorts the runq after the accounting master has
+     * modified priorities. This is a special O(n) sort and runs at most
+     * once per accounting period (currently 30 milliseconds).
+     */
+    bcsched_runq_sort(cpu);
+
+    set_timer(&spc->ticker, NOW() + MILLISECS(BCSCHED_MSECS_PER_TICK));
+}
+
+static struct task_slice
+bcsched_schedule(s_time_t now)
+{
+    struct csched_vcpu *svc = CSCHED_VCPU(current);
+    struct bcsched_vcpu *sbvc = svc_sbvc(svc);
+    s_time_t passed = now - sbvc->start_time;
+    int consumed;
+    int boost_credit;
+    struct task_slice ret;
+
+    /*
+     * Update credit
+     */
+    consumed = ( passed +
+                 (MILLISECS(BCSCHED_MSECS_PER_TSLICE) /
+                  BCSCHED_CREDITS_PER_TSLICE - 1)
+               ) / (MILLISECS(BCSCHED_MSECS_PER_TSLICE) /
+                    BCSCHED_CREDITS_PER_TSLICE);
+    if ( svc->pri == CSCHED_PRI_TS_BOOST )
+    {
+        boost_credit = atomic_read(&sbvc->boost_credit);
+        if ( boost_credit > consumed )
+        {
+            atomic_sub(consumed, &sbvc->boost_credit);
+            consumed = 0;
+        }
+        else
+        {
+            atomic_sub(boost_credit, &sbvc->boost_credit);
+            consumed -= boost_credit;
+            svc->pri = CSCHED_PRI_TS_UNDER;
+        }
+    }
+    if ( consumed > 0 && !is_idle_vcpu(current) )
+        atomic_sub(consumed, &svc->credit);
+
+    ret = csched_schedule(now);
+
+    svc = CSCHED_VCPU(ret.task);
+    if ( svc->pri == CSCHED_PRI_TS_BOOST )
+        ret.time = bcsched_priv.boost_tslice;
+
+    sbvc = svc_sbvc(svc);
+    sbvc->start_time = now;
+
+    return ret;
+}
+
+static void
+bcsched_dump_vcpu(struct csched_vcpu *svc)
+{
+    struct bcsched_vcpu * const sbvc = svc_sbvc(svc);
+
+    csched_dump_vcpu(svc);
+
+    if ( svc->sdom )
+    {
+        struct bcsched_dom * const sbdom = sdom_sbdom(svc->sdom);
+
+        printk("\t     bc=%i [bc=%i]\n",
+               atomic_read(&sbvc->boost_credit),
+               sbdom->boost_ratio * BCSCHED_CREDITS_PER_TSLICE / 100);
+    }
+}
+
+static void
+bcsched_dump(void)
+{
+    struct list_head *iter_sdom, *iter_svc;
+    int loop;
+    char idlers_buf[100];
+
+    printk("info:\n"
+           "\tncpus              = %u\n"
+           "\tmaster             = %u\n"
+           "\tcredit             = %u\n"
+           "\tcredit balance     = %d\n"
+           "\tweight             = %u\n"
+           "\trunq_sort          = %u\n"
+           "\tboost_tslice       = %"PRId64"\n"
+           "\tboost_credit       = %u\n"
+           "\ttotal_boost_ratio  = %u\n"
+           "\tdefault-weight     = %d\n"
+           "\tmsecs per tick     = %dms\n"
+           "\tcredits per tick   = %d\n"
+           "\tticks per tslice   = %d\n"
+           "\tticks per acct     = %d\n",
+           csched_priv.ncpus,
+           csched_priv.master,
+           csched_priv.credit,
+           csched_priv.credit_balance,
+           csched_priv.weight,
+           csched_priv.runq_sort,
+           bcsched_priv.boost_tslice,
+           bcsched_priv.boost_credit,
+           bcsched_priv.total_boost_ratio,
+           CSCHED_DEFAULT_WEIGHT,
+           BCSCHED_MSECS_PER_TICK,
+           BCSCHED_CREDITS_PER_TICK,
+           BCSCHED_TICKS_PER_TSLICE,
+           BCSCHED_TICKS_PER_ACCT);
+
+    cpumask_scnprintf(idlers_buf, sizeof(idlers_buf), csched_priv.idlers);
+    printk("idlers: %s\n", idlers_buf);
+
+    CSCHED_STATS_PRINTK();
+
+    printk("active vcpus:\n");
+    loop = 0;
+    list_for_each( iter_sdom, &csched_priv.active_sdom )
+    {
+        struct csched_dom *sdom;
+        sdom = list_entry(iter_sdom, struct csched_dom, active_sdom_elem);
+
+        list_for_each( iter_svc, &sdom->active_vcpu )
+        {
+            struct csched_vcpu *svc;
+            svc = list_entry(iter_svc, struct csched_vcpu, active_vcpu_elem);
+
+            printk("\t%3d: ", ++loop);
+            bcsched_dump_vcpu(svc);
+        }
+    }
+
+    printk("inactive vcpus:\n");
+    loop = 0;
+    list_for_each( iter_svc, &bcsched_priv.inactive_vcpu )
+    {
+        struct bcsched_vcpu *sbvc;
+        sbvc = list_entry(iter_svc, struct bcsched_vcpu, inactive_vcpu_elem);
+
+        printk("\t%3d: ", ++loop);
+        bcsched_dump_vcpu(&sbvc->svc);
+    }
+}
+
+static void
+bcsched_init(void)
+{
+    csched_init();
+
+    INIT_LIST_HEAD(&bcsched_priv.inactive_vcpu);
+    bcsched_priv.boost_tslice = MILLISECS(BCSCHED_MSECS_PER_TSLICE);
+    bcsched_priv.boost_credit = 0;
+    bcsched_priv.total_boost_ratio = 0;
+}
+
+
+struct scheduler sched_bcredit_def = {
+    .name           = "SMP Credit Scheduler for client side",
+    .opt_name       = "bcredit",
+    .sched_id       = XEN_SCHEDULER_BCREDIT,
+
+    .init_domain    = bcsched_dom_init,
+    .destroy_domain = bcsched_dom_destroy,
+
+    .init_vcpu      = bcsched_vcpu_init,
+    .destroy_vcpu   = bcsched_vcpu_destroy,
+
+    .sleep          = csched_vcpu_sleep,
+    .wake           = csched_vcpu_wake,
+
+    .adjust         = bcsched_dom_cntl,
+
+    .pick_cpu       = csched_cpu_pick,
+    .do_schedule    = bcsched_schedule,
+
+    .dump_cpu_state = csched_dump_pcpu,
+    .dump_settings  = bcsched_dump,
+    .init           = bcsched_init,
+};
+
diff -r a00eb6595d3c xen/common/schedule.c
--- a/xen/common/schedule.c	Sat Nov 29 09:07:52 2008 +0000
+++ b/xen/common/schedule.c	Wed Dec 03 10:19:34 2008 +0900
@@ -51,9 +51,11 @@ DEFINE_PER_CPU(struct schedule_data, sch
 
 extern struct scheduler sched_sedf_def;
 extern struct scheduler sched_credit_def;
+extern struct scheduler sched_bcredit_def;
 static struct scheduler *schedulers[] = { 
     &sched_sedf_def,
     &sched_credit_def,
+    &sched_bcredit_def,
     NULL
 };
 
diff -r a00eb6595d3c xen/include/public/domctl.h
--- a/xen/include/public/domctl.h	Sat Nov 29 09:07:52 2008 +0000
+++ b/xen/include/public/domctl.h	Wed Dec 03 10:19:34 2008 +0900
@@ -294,6 +294,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_max_v
 /* Scheduler types. */
 #define XEN_SCHEDULER_SEDF     4
 #define XEN_SCHEDULER_CREDIT   5
+#define XEN_SCHEDULER_BCREDIT  6
 /* Set or get info? */
 #define XEN_DOMCTL_SCHEDOP_putinfo 0
 #define XEN_DOMCTL_SCHEDOP_getinfo 1
@@ -312,6 +313,12 @@ struct xen_domctl_scheduler_op {
             uint16_t weight;
             uint16_t cap;
         } credit;
+        struct xen_domctl_sched_bcredit {
+            uint16_t weight;
+            uint16_t cap;
+            uint16_t max_boost_period;
+            uint16_t boost_ratio;
+        } bcredit;
     } u;
 };
 typedef struct xen_domctl_scheduler_op xen_domctl_scheduler_op_t;

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] scheduler: credit scheduler for client virtualization
  2008-12-03  8:54 [RFC][PATCH] scheduler: credit scheduler for client virtualization NISHIGUCHI Naoki
@ 2008-12-03  9:16 ` Keir Fraser
  2008-12-03 12:46   ` George Dunlap
  2008-12-04  7:45   ` NISHIGUCHI Naoki
       [not found] ` <de76405a0901191232k19d910d5o77160fa5ee7bf06c@mail.gmail.com>
  1 sibling, 2 replies; 15+ messages in thread
From: Keir Fraser @ 2008-12-03  9:16 UTC (permalink / raw)
  To: NISHIGUCHI Naoki, xen-devel; +Cc: Ian.Pratt, disheng.su

On 03/12/2008 08:54, "NISHIGUCHI Naoki" <nisiguti@jp.fujitsu.com> wrote:

> Please review this patch.
> Any comments are appreciated.

Don't hack it into the existing sched_credit.c unless you are really sharing
significant amounts of stuff (which it looks like you aren't?).
sched_bcredit.c would be a cleaner name if there's no sharing. Is a new
scheduler necessary -- could the existing credit scheduler be generalised
with your boost mechanism to be suitable for both client and server?

The issue with multiple schedulers is that it's most likely the non-default
will not be tested, used or maintained. The default credit scheduler gets
little enough love as it is, and it's really the only sensible scheduler to
choose now (SEDF is not great -- good example of a rotten non-default
scheduler).

 -- Keir

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] scheduler: credit scheduler for client virtualization
  2008-12-03  9:16 ` Keir Fraser
@ 2008-12-03 12:46   ` George Dunlap
  2008-12-04  7:51     ` NISHIGUCHI Naoki
  2008-12-04  7:45   ` NISHIGUCHI Naoki
  1 sibling, 1 reply; 15+ messages in thread
From: George Dunlap @ 2008-12-03 12:46 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Ian.Pratt, xen-devel, NISHIGUCHI Naoki, disheng.su

On Wed, Dec 3, 2008 at 9:16 AM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:
> Don't hack it into the existing sched_credit.c unless you are really sharing
> significant amounts of stuff (which it looks like you aren't?).
> sched_bcredit.c would be a cleaner name if there's no sharing. Is a new
> scheduler necessary -- could the existing credit scheduler be generalised
> with your boost mechanism to be suitable for both client and server?

I think we ought to be able to work this out; the functionality
doesn't sound that different, and as you say, keeping two schedulers
around is only an invitation to bitrot.

The more accurate credit scheduling and vcpu credit "balancing" seem
like good ideas.  For the other changes, it's probably worth measuring
on a battery of tests to see what kinds of effects we get, especially
on network throughput.

Nishiguchi-san, (I hope that's right!) as I understood from your
presentation, you haven't tested this on a server workload, but you
predict that the "boost" scheduling of 2ms will cause unnecessary
overhead for server workloads.  Is that correct?

Couldn't we avoid the overhead this way:  If a vcpu has 5 or more
"boost" credits, we simply set the next-timer to 10ms.  If the vcpu
yields before then, we subtract the amount of "boost" credits actually
used.  If not, we subtract 5.  That way we're not interrupting any
more frequently than we were before.

Come to think of it: won't the effect of setting the 'boost' time to
2ms be basically counteracted by giving domains boost credits?  I
thought the purpose reducing the boost time was to allow other domains
to run more quickly?  But if a domain has more than 5 'boost' credits,
it will run for a full 10 ms anyway.  Is that not so?

Could you test your video latency measurement with all the other
optimizations, but with the "boost" time set to 10ms instead of 2?  If
it works well, it's probably worth simply merging the bulk of your
changes in and testing with server workloads.

 -George

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] scheduler: credit scheduler for client virtualization
  2008-12-03  9:16 ` Keir Fraser
  2008-12-03 12:46   ` George Dunlap
@ 2008-12-04  7:45   ` NISHIGUCHI Naoki
  1 sibling, 0 replies; 15+ messages in thread
From: NISHIGUCHI Naoki @ 2008-12-04  7:45 UTC (permalink / raw)
  To: Keir Fraser, xen-devel; +Cc: Ian.Pratt, disheng.su

Thank you for your comment.

I'll try to be suitable for both server and client.

Regards,
Naoki Nishiguchi

Keir Fraser wrote:
> On 03/12/2008 08:54, "NISHIGUCHI Naoki" <nisiguti@jp.fujitsu.com> wrote:
> 
>> Please review this patch.
>> Any comments are appreciated.
> 
> Don't hack it into the existing sched_credit.c unless you are really sharing
> significant amounts of stuff (which it looks like you aren't?).
> sched_bcredit.c would be a cleaner name if there's no sharing. Is a new
> scheduler necessary -- could the existing credit scheduler be generalised
> with your boost mechanism to be suitable for both client and server?
> 
> The issue with multiple schedulers is that it's most likely the non-default
> will not be tested, used or maintained. The default credit scheduler gets
> little enough love as it is, and it's really the only sensible scheduler to
> choose now (SEDF is not great -- good example of a rotten non-default
> scheduler).
> 
>  -- Keir
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] scheduler: credit scheduler for client virtualization
  2008-12-03 12:46   ` George Dunlap
@ 2008-12-04  7:51     ` NISHIGUCHI Naoki
  2008-12-04 12:21       ` George Dunlap
  0 siblings, 1 reply; 15+ messages in thread
From: NISHIGUCHI Naoki @ 2008-12-04  7:51 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: Ian.Pratt, disheng.su, Keir Fraser

Thank you for your suggestions.

George Dunlap wrote:
> On Wed, Dec 3, 2008 at 9:16 AM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:
>> Don't hack it into the existing sched_credit.c unless you are really sharing
>> significant amounts of stuff (which it looks like you aren't?).
>> sched_bcredit.c would be a cleaner name if there's no sharing. Is a new
>> scheduler necessary -- could the existing credit scheduler be generalised
>> with your boost mechanism to be suitable for both client and server?
> 
> I think we ought to be able to work this out; the functionality
> doesn't sound that different, and as you say, keeping two schedulers
> around is only an invitation to bitrot.

I had thought that the scheduler for client would be needed separately 
because this modification would influence a server workload. In order to 
minimize modifications, the bcredit scheduler was implemented by 
wrapping the current credit scheduler. I added the differences between 
original and bcredit. But as a result, almost functions were created newly.

Now, I agree that one scheduler is best.

> The more accurate credit scheduling and vcpu credit "balancing" seem
> like good ideas.  For the other changes, it's probably worth measuring
> on a battery of tests to see what kinds of effects we get, especially
> on network throughput.

I didn’t think about the battery and the performance.

> Nishiguchi-san, (I hope that's right!) as I understood from your
> presentation, you haven't tested this on a server workload, but you
> predict that the "boost" scheduling of 2ms will cause unnecessary
> overhead for server workloads.  Is that correct?

Yes, you are correct. I answered that in Q/A.

> Couldn't we avoid the overhead this way:  If a vcpu has 5 or more
> "boost" credits, we simply set the next-timer to 10ms.  If the vcpu
> yields before then, we subtract the amount of "boost" credits actually
> used.  If not, we subtract 5.  That way we're not interrupting any
> more frequently than we were before.

I set the next-timer to 2ms in any vcpu having “boost” credits since 
every vcpu having “boost” credits need to be run equally at short 
intervals. If there are vcpus having “boost” credits and the next-timer 
of a vcpu is set to 10ms, the other vcpus will be waited during 10ms.

At present, I am thinking that if the other vcpus don’t have “boost” 
credits then we may set the next-timer to 30ms.


> Come to think of it: won't the effect of setting the 'boost' time to
> 2ms be basically counteracted by giving domains boost credits?  I
> thought the purpose reducing the boost time was to allow other domains
> to run more quickly?  But if a domain has more than 5 'boost' credits,
> it will run for a full 10 ms anyway.  Is that not so?

I suppose that there are two domains given “boost” credits. One domain 
runs for 2ms, then the other domain runs for 2ms, then one domain runs 
for 2ms, then the other domain runs for 2ms, … Because I think to need 
that waited time of both is same.

> Could you test your video latency measurement with all the other
> optimizations, but with the "boost" time set to 10ms instead of 2?  If
> it works well, it's probably worth simply merging the bulk of your
> changes in and testing with server workloads.

I tested the video latency measurement with the “boost” time set to 
10ms. But it regretted not to work well. As I was mentioned above, the 
vcpu was occasionally waited during 10ms.

On my patch, “boost” time is tuneable. How about the default “boost” 
time is 30ms and if necessary, “boost” time is set? Is it acceptable?

In order to lengthen the “boost” time as much as possible, I will think 
about computing the length of the next-timer of the vcpu having “boost” 
credits.
I’ll try to revise the patch.

And thanks again.

Best regards,
Naoki Nishiguchi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] scheduler: credit scheduler for client virtualization
  2008-12-04  7:51     ` NISHIGUCHI Naoki
@ 2008-12-04 12:21       ` George Dunlap
  2008-12-04 12:37         ` George Dunlap
  2008-12-05  2:47         ` NISHIGUCHI Naoki
  0 siblings, 2 replies; 15+ messages in thread
From: George Dunlap @ 2008-12-04 12:21 UTC (permalink / raw)
  To: NISHIGUCHI Naoki; +Cc: Ian.Pratt, xen-devel, disheng.su, Keir Fraser

On Thu, Dec 4, 2008 at 7:51 AM, NISHIGUCHI Naoki
<nisiguti@jp.fujitsu.com> wrote:
>> The more accurate credit scheduling and vcpu credit "balancing" seem
>> like good ideas.  For the other changes, it's probably worth measuring
>> on a battery of tests to see what kinds of effects we get, especially
>> on network throughput.
>
> I didn't think about the battery and the performance.

I'm sorry, I used an uncommon definition of the word "battery"; I
should have been more careful. :-)

In this context, "a battery of tests" means "a combination of several
different kinds of tests."  I meant some disk-intensive tests, some
network-intensive tests, some cpu-intensive tests, and some
combination of all three.  I can run some of these, and you can make
sure that the "client" tests still work well.  It would probably be
helpful to have other people volunteer to do some testing as well,
just to make sure we have our bases covered.

> I set the next-timer to 2ms in any vcpu having "boost" credits since every
> vcpu having "boost" credits need to be run equally at short intervals. If
> there are vcpus having "boost" credits and the next-timer of a vcpu is set
> to 10ms, the other vcpus will be waited during 10ms.

> At present, I am thinking that if the other vcpus don't have "boost" credits
> then we may set the next-timer to 30ms.

I see -- the current setup is good if there's only one "boosted" VM
(per cpu) at a time; but if there are two "boosted" VMs, they're back
to taking turns at 30 ms.  Your 2ms patch allows several
latency-sensitive VMs to share the "low latency" boost.  That makes
sense.  I agree with your suggestion: we can set the timer to 2ms only
if the next waiting vcpu on the queue is also BOOST.

> I tested the video latency measurement with the "boost" time set to 10ms.
> But it regretted not to work well. As I was mentioned above, the vcpu was
> occasionally waited during 10ms.

OK, good to know.

> On my patch, "boost" time is tuneable. How about the default "boost" time is
> 30ms and if necessary, "boost" time is set? Is it acceptable?

I suspect that latency-sensitive workloads such as network, especially
network servers that do very little computation, may also benefit from
short boost times.

> In order to lengthen the "boost" time as much as possible, I will think
> about computing the length of the next-timer of the vcpu having "boost"
> credits.

If it makes things simpler, we could just stick with 10ms timeslices
when there are no waiting vcpus with BOOST priority, and 2ms if there
is BOOST priority.  I don't think there's a particular need to give a
VM only (say) 8 ms instead of 10, if there are no latency-sensitive
VMs waiting.

> I'll try to revise the patch.

I suggest:
* Modify the credit scheduler directly, rather than having an extra scheduler
* Break down your changes into patches that make individual changes,
i.e (from your first post):
 + A patch to subtract credit consumed accurately
 + A patch to preserve the value of cpu credit when the vcpu is over upper bound
 + A patch to shorten cpu time per one credit
 + A patch to balance credits of each vcpu of a domain
 + A patch to introduce BOOST credit (both Xen and tool components)
 + A patch to shorten allocated time in BOOST priority if the next
vcpu on the runqueue is also at BOOST

Then we can evaluate each change individually.

Thanks for your work!

 -George

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] scheduler: credit scheduler for client virtualization
  2008-12-04 12:21       ` George Dunlap
@ 2008-12-04 12:37         ` George Dunlap
  2008-12-05  3:17           ` NISHIGUCHI Naoki
  2008-12-05  2:47         ` NISHIGUCHI Naoki
  1 sibling, 1 reply; 15+ messages in thread
From: George Dunlap @ 2008-12-04 12:37 UTC (permalink / raw)
  To: NISHIGUCHI Naoki; +Cc: Ian.Pratt, xen-devel, disheng.su, Keir Fraser

On Thu, Dec 4, 2008 at 12:21 PM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> I see -- the current setup is good if there's only one "boosted" VM
> (per cpu) at a time; but if there are two "boosted" VMs, they're back
> to taking turns at 30 ms.  Your 2ms patch allows several
> latency-sensitive VMs to share the "low latency" boost.  That makes
> sense.  I agree with your suggestion: we can set the timer to 2ms only
> if the next waiting vcpu on the queue is also BOOST.

There was a paper earlier this year about scheduling and I/O performance:
 http://www.cs.rice.edu/CS/Architecture/docs/ongaro-vee08.pdf

One of the things he noted was that if a driver domain is accepting
network packets for multiple VMs, we sometimes get the following
pattern:
* driver domain wakes up, starts processing packets.  Because it's in
"over", it doesn't get boosted.
* Passes a packet to VM 1, waking it up.  It runs in "boost",
preempting the (now lower-priority) driver domain.
* Other packets (possibly even for VM 1) sit in the driver domain's
queue, waiting for it to get cpu time.

Their tests, for 3 networking guests and 3 cpu-intensive guests,
showed a 40% degradation in performance due to this problem.  While
we're thinking about the scheduler, it might be worth seeing if we can
solve this.

 -George

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] scheduler: credit scheduler for client virtualization
  2008-12-04 12:21       ` George Dunlap
  2008-12-04 12:37         ` George Dunlap
@ 2008-12-05  2:47         ` NISHIGUCHI Naoki
  2008-12-05 11:37           ` George Dunlap
  1 sibling, 1 reply; 15+ messages in thread
From: NISHIGUCHI Naoki @ 2008-12-05  2:47 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: Ian.Pratt, disheng.su, Keir Fraser

Hi,

Thank you for your comments and suggestions.

George Dunlap wrote:
>> I didn't think about the battery and the performance.
> 
> I'm sorry, I used an uncommon definition of the word "battery"; I
> should have been more careful. :-)
> 
> In this context, "a battery of tests" means "a combination of several
> different kinds of tests."  I meant some disk-intensive tests, some
> network-intensive tests, some cpu-intensive tests, and some
> combination of all three.  I can run some of these, and you can make
> sure that the "client" tests still work well.  It would probably be
> helpful to have other people volunteer to do some testing as well,
> just to make sure we have our bases covered.

Oh, I misread the word “battery”. I understand what “a battery of tests” 
means.
By the way, what tests do you concretely do? I have no idea on these tests.

>> I set the next-timer to 2ms in any vcpu having "boost" credits since every
>> vcpu having "boost" credits need to be run equally at short intervals. If
>> there are vcpus having "boost" credits and the next-timer of a vcpu is set
>> to 10ms, the other vcpus will be waited during 10ms.
> 
>> At present, I am thinking that if the other vcpus don't have "boost" credits
>> then we may set the next-timer to 30ms.
> 
> I see -- the current setup is good if there's only one "boosted" VM
> (per cpu) at a time; but if there are two "boosted" VMs, they're back
> to taking turns at 30 ms.  Your 2ms patch allows several
> latency-sensitive VMs to share the "low latency" boost.  That makes
> sense.  I agree with your suggestion: we can set the timer to 2ms only
> if the next waiting vcpu on the queue is also BOOST.

OK.
We must consider also a sleeping vcpu. The vcpu will be added to the 
queue by wakeup. So, we can set the timer to 2ms only if the next 
waiting vcpu on the queue or the sleeping vcpu is also BOOST.

My thought about 2ms is: the period that the vcpu will be executed next 
is 2ms. Therefore, time slice of the vcpu is changed according to the 
number of existing vcpus. In a word, we may set the timer to 2ms or 
less. But I think that the number of vcpus will not be so much. Is this 
supposition wrong? And how about time slice of 2ms or less?

>> On my patch, "boost" time is tuneable. How about the default "boost" time is
>> 30ms and if necessary, "boost" time is set? Is it acceptable?
> 
> I suspect that latency-sensitive workloads such as network, especially
> network servers that do very little computation, may also benefit from
> short boost times.

I think so, too.

>> In order to lengthen the "boost" time as much as possible, I will think
>> about computing the length of the next-timer of the vcpu having "boost"
>> credits.
> 
> If it makes things simpler, we could just stick with 10ms timeslices
> when there are no waiting vcpus with BOOST priority, and 2ms if there
> is BOOST priority.  I don't think there's a particular need to give a
> VM only (say) 8 ms instead of 10, if there are no latency-sensitive
> VMs waiting.

I agree.

>> I'll try to revise the patch.
> 
> I suggest:
> * Modify the credit scheduler directly, rather than having an extra scheduler
> * Break down your changes into patches that make individual changes,
> i.e (from your first post):
>  + A patch to subtract credit consumed accurately
>  + A patch to preserve the value of cpu credit when the vcpu is over upper bound
>  + A patch to shorten cpu time per one credit
>  + A patch to balance credits of each vcpu of a domain
>  + A patch to introduce BOOST credit (both Xen and tool components)
>  + A patch to shorten allocated time in BOOST priority if the next
> vcpu on the runqueue is also at BOOST
> 
> Then we can evaluate each change individually.

OK.
I’ll separate individual changes from current patch and post each patch.

Best regards,
Naoki Nishiguchi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] scheduler: credit scheduler for client virtualization
  2008-12-04 12:37         ` George Dunlap
@ 2008-12-05  3:17           ` NISHIGUCHI Naoki
  2008-12-18  2:49             ` NISHIGUCHI Naoki
  0 siblings, 1 reply; 15+ messages in thread
From: NISHIGUCHI Naoki @ 2008-12-05  3:17 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: Ian.Pratt, disheng.su, Keir Fraser

Thanks for your information.

George Dunlap wrote:
> There was a paper earlier this year about scheduling and I/O performance:
>  http://www.cs.rice.edu/CS/Architecture/docs/ongaro-vee08.pdf
> 
> One of the things he noted was that if a driver domain is accepting
> network packets for multiple VMs, we sometimes get the following
> pattern:
> * driver domain wakes up, starts processing packets.  Because it's in
> "over", it doesn't get boosted.
> * Passes a packet to VM 1, waking it up.  It runs in "boost",
> preempting the (now lower-priority) driver domain.
> * Other packets (possibly even for VM 1) sit in the driver domain's
> queue, waiting for it to get cpu time.

I don't read the paper yet, but I think our approach is effective in 
this problem.
However, if driver domain consumes cpu time too much, we couldn't 
prevent it from becoming "over" priority. Otherwise, we could keep it 
with "under" or "boost" priority.

> Their tests, for 3 networking guests and 3 cpu-intensive guests,
> showed a 40% degradation in performance due to this problem.  While
> we're thinking about the scheduler, it might be worth seeing if we can
> solve this.

Firstly, I'd like to read the paper.

Regards,
Naoki Nishiguchi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] scheduler: credit scheduler for client virtualization
  2008-12-05  2:47         ` NISHIGUCHI Naoki
@ 2008-12-05 11:37           ` George Dunlap
  2008-12-08  8:37             ` NISHIGUCHI Naoki
  0 siblings, 1 reply; 15+ messages in thread
From: George Dunlap @ 2008-12-05 11:37 UTC (permalink / raw)
  To: NISHIGUCHI Naoki; +Cc: Ian.Pratt, xen-devel, disheng.su, Keir Fraser

On Fri, Dec 5, 2008 at 2:47 AM, NISHIGUCHI Naoki
<nisiguti@jp.fujitsu.com> wrote:
> Oh, I misread the word "battery". I understand what "a battery of tests"
> means.
> By the way, what tests do you concretely do? I have no idea on these tests.

For basic workload tests, a couple are pretty handy.  vConsolidate is
a good test, but pretty hard to set up; I should be able to manage it
with our infrastructure here, though.  Other tests include:
* kernel-build (i.e., time how long it takes to build the Linux
kernel) and or ddk-build (Windows equivalent)
* specjbb (a cpu-intensive workload)
* netperf (for networks)

For testing its effect on network, the paper I mentioned has three
workloads that it combines with different ways:
* cpu (just busy spinning)
* sustained network (netbench): throughput
* network ping: latency.

> OK.
> We must consider also a sleeping vcpu. The vcpu will be added to the queue
> by wakeup. So, we can set the timer to 2ms only if the next waiting vcpu on
> the queue or the sleeping vcpu is also BOOST.
>
> My thought about 2ms is: the period that the vcpu will be executed next is
> 2ms. Therefore, time slice of the vcpu is changed according to the number of
> existing vcpus. In a word, we may set the timer to 2ms or less. But I think
> that the number of vcpus will not be so much. Is this supposition wrong? And
> how about time slice of 2ms or less?

I think I understand you to mean: If we set the timer for 10ms, and in
the mean time another vcpu wakes up and is set at BOOST, then it won't
get a chance to run for another 10 ms.  And you're suggesting that we
run the scheduler at 2ms if there are any vcpus that *may* wake up and
be at BOOST, just in case; and you don't think this situation will
happen very often.  Is that correct?

Unfortunately, in consolidated server workloads you're pretty likely
to have more vcpus than physical cpus, so I think this case would come
up pretty often.  Furthermore, 2ms is really too short a scheduling
quantum for normal use, especially for HVM domains, which have to take
a vmexit/vmenter cycle to handle every interrupt.  (I did some tests
back when we were using the SEDF scheduler, and the scheduling alone
was a 4-5% overhead for HVM domains.)

But I don't think we actually have a problem here: if a vcpu wakes up
and is promoted to BOOST, won't it "tickle" the runqueues to find
somewhere for it to run?  At very least the current cpu should be able
to run it, or if it's already running one at BOOST, it can set its own
timer to 2ms.  In any case, I think handling this corner case with
some extra code is preferrable to running a 2ms timer any time it
*might* happen.

> OK.
> I'll separate individual changes from current patch and post each patch.

Thanks!  I'll take them for a spin today.

 -George

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] scheduler: credit scheduler for client virtualization
  2008-12-05 11:37           ` George Dunlap
@ 2008-12-08  8:37             ` NISHIGUCHI Naoki
  0 siblings, 0 replies; 15+ messages in thread
From: NISHIGUCHI Naoki @ 2008-12-08  8:37 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: Ian.Pratt, disheng.su, Keir Fraser

George Dunlap wrote:
> For basic workload tests, a couple are pretty handy.  vConsolidate is
> a good test, but pretty hard to set up; I should be able to manage it
> with our infrastructure here, though.  Other tests include:
> * kernel-build (i.e., time how long it takes to build the Linux
> kernel) and or ddk-build (Windows equivalent)
> * specjbb (a cpu-intensive workload)
> * netperf (for networks)
> 
> For testing its effect on network, the paper I mentioned has three
> workloads that it combines with different ways:
> * cpu (just busy spinning)
> * sustained network (netbench): throughput
> * network ping: latency.

Thanks! I'll try to prepare.

> I think I understand you to mean: If we set the timer for 10ms, and in
> the mean time another vcpu wakes up and is set at BOOST, then it won't
> get a chance to run for another 10 ms.  And you're suggesting that we
> run the scheduler at 2ms if there are any vcpus that *may* wake up and
> be at BOOST, just in case; and you don't think this situation will
> happen very often.  Is that correct?

Almost that is correct.
I had thought that we run the scheduler at 2ms only if there are vcpus 
that have boost credit and are already at BOOST. But I don't think so now.

> Unfortunately, in consolidated server workloads you're pretty likely
> to have more vcpus than physical cpus, so I think this case would come
> up pretty often.  Furthermore, 2ms is really too short a scheduling
> quantum for normal use, especially for HVM domains, which have to take
> a vmexit/vmenter cycle to handle every interrupt.  (I did some tests
> back when we were using the SEDF scheduler, and the scheduling alone
> was a 4-5% overhead for HVM domains.)

I see.

> But I don't think we actually have a problem here: if a vcpu wakes up
> and is promoted to BOOST, won't it "tickle" the runqueues to find
> somewhere for it to run?  At very least the current cpu should be able
> to run it, or if it's already running one at BOOST, it can set its own
> timer to 2ms.  In any case, I think handling this corner case with
> some extra code is preferrable to running a 2ms timer any time it
> *might* happen.

OK.
I implemented as follows:
- If next running vcpu is at BOOST and first vcpu on run-queue is at 
BOOST, set the timer to 2ms.
- If next running vcpu is at BOOST and first vcpu on run-queue is not at 
BOOST, set the timer to 10ms.
- If next running vcpu is not at BOOST, set the timer to 30ms.
- When a vcpu wakes up, if the vcpu has boost credit then send scheduler 
interrupts to at least one CPU.

In my test environment, it works well.
I'll post last patch today.

Thanks,
Naoki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] scheduler: credit scheduler for client virtualization
  2008-12-05  3:17           ` NISHIGUCHI Naoki
@ 2008-12-18  2:49             ` NISHIGUCHI Naoki
  2008-12-18 10:21               ` George Dunlap
  0 siblings, 1 reply; 15+ messages in thread
From: NISHIGUCHI Naoki @ 2008-12-18  2:49 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: Ian.Pratt, disheng.su, Keir Fraser

Hi all,

In almost the same environment as the paper, I experimented with credit 
scheduler(original and modified version).
I describe the results below.

Unfortunately the good result was not obtained by my previous patches.

I found that there were some problems on my previous patches.
So, I had revised the patches and experimented with revised version 
again. Using revised patches, the good result was obtained.
Especially, please look at the result of ex7. In revised version, I/O 
bandwidth per guest is growing correctly according to dom0's weight.

I'll post the revised patches later.

Thanks,
Naoki Nishiguchi

---------- results ----------
experimental environment:
   HP dc7800 US/CT(Core2 Duo E6550 2.33GHz)
     Multi-processor: disable
   Xen: xen 3.3.0 release
   dom0: CentOS 5.2

I used the following experiments from among the paper's experiments.
ex3: burn x7, ping x1
ex5: stream x7, ping x1
ex7: stream x3, burn x3, ping x1
ex8: stream x3, ping+burn x1, burn x3

original credit scheduler
ex3
   burn(%):      14 14 14 14 14 14 14
   ping(ms):     19.7(average)  0.1 - 359
ex5
   stream(Mbps): 144.05 141.19 137.81 137.01 137.30 138.76 142.21
   ping(ms)    : 8.2(average)  7.84 - 8.63
ex7
   stream(Mbps): 33.74 27.74 34.70
   burn(%):      28 28 28 (by guess)
   ping(ms):     238(average)  1.78 - 485
ex7(xm sched-credit -d 0 -w 512)
   There was no change in the result.
ex8
   stream(Mbps): 9.98 11.32 10.61
   ping+burn:    264.9ms(average)  20.3 - 547
                 24%
   burn(%):      24 24 24


modified version(previous patches)
ex3
   burn(%):      14 14 14 14 14 14 14
   ping(ms):     0.17(average)  0.136 - 0.202
ex5
   stream(Mbps): 143.90 141.79 137.15 138.43 138.37 130.33 143.36
   ping(ms):     7.2(average)  4.85 - 8.95
ex7
   stream(Mbps): 2.33 2.18 1.87
   burn(%):      32 32 32 (by guess)
   ping(ms):     373.7(average)  68.0 - 589
ex7(xm sched-credit -d 0 -w 512)
   There was no change in the result.
ex7(xm sched-credit -d 0 -m 100 -r 20)
   stream(Mbps): 114.49 117.59 115.76
   burn(%):      24 24 24
   ping(ms):     1.2(average)  0.158 - 65.1
ex8
   stream(Mbps): 1.31 1.09 1.92
   ping+burn:    387.7ms(average)  92.6 - 676
                 24% (by guess)
   burn(%):      24 24 24 (by guess)


revised version
ex3
   burn(%):      14 14 14 14 14 14 14
   ping(ms):     0.18(average)  0.140 - 0.238
ex5
   stream(Mbps): 142.57 139.03 137.50 136.77 137.61 138.95 142.63
   ping(ms):     8.2(average)  7.86 - 8.71
ex7
   stream(Mbps): 143.63 132.13 131.77
   burn(%):      24 24 24
   ping(ms):     32.2(average)  1.73 - 173
ex7(xm sched-credit -d 0 -w 512)
   stream(Mbps): 240.06 204.85 229.23
   burn(%):      18 18 18
   ping(ms):     7.0(average)  0.412 - 73.9
ex7(xm sched-credit -d 0 -m 100 -r 20)
   stream(Mbps): 139.74 134.95 135.18
   burn(%):      23 23 23
   ping(ms):     15.1(average)  1.87 - 95.4
ex8
   stream(Mbps): 118.15 106.71 116.37
   ping+burn:    68.8ms(average) 1.86 - 319
                 19%
   burn(%):      19 19 19
----------

NISHIGUCHI Naoki wrote:
> Thanks for your information.
> 
> George Dunlap wrote:
>> There was a paper earlier this year about scheduling and I/O performance:
>>  http://www.cs.rice.edu/CS/Architecture/docs/ongaro-vee08.pdf
>>
>> One of the things he noted was that if a driver domain is accepting
>> network packets for multiple VMs, we sometimes get the following
>> pattern:
>> * driver domain wakes up, starts processing packets.  Because it's in
>> "over", it doesn't get boosted.
>> * Passes a packet to VM 1, waking it up.  It runs in "boost",
>> preempting the (now lower-priority) driver domain.
>> * Other packets (possibly even for VM 1) sit in the driver domain's
>> queue, waiting for it to get cpu time.
> 
> I don't read the paper yet, but I think our approach is effective in 
> this problem.
> However, if driver domain consumes cpu time too much, we couldn't 
> prevent it from becoming "over" priority. Otherwise, we could keep it 
> with "under" or "boost" priority.
> 
>> Their tests, for 3 networking guests and 3 cpu-intensive guests,
>> showed a 40% degradation in performance due to this problem.  While
>> we're thinking about the scheduler, it might be worth seeing if we can
>> solve this.
> 
> Firstly, I'd like to read the paper.
> 
> Regards,
> Naoki Nishiguchi
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] scheduler: credit scheduler for client virtualization
  2008-12-18  2:49             ` NISHIGUCHI Naoki
@ 2008-12-18 10:21               ` George Dunlap
  0 siblings, 0 replies; 15+ messages in thread
From: George Dunlap @ 2008-12-18 10:21 UTC (permalink / raw)
  To: NISHIGUCHI Naoki; +Cc: Ian.Pratt, xen-devel, disheng.su, Keir Fraser

Naoki,

Thank you for your work!  The results look really good.

Overall, I think the scheduler as a whole needs some design work
before these can go in.  No one at this point fully understands the
principles on which it's supposed to run.  I've been taking a close
look at the unmodified scheduler (notably trying to understand the
anomalies pointed out by Atsushi), and I think it's clear that there
are some flaws in the logic.

Before making a large change like this, I think we should do several things:
* Try to describe exactly what the scheduler is currently doing, and why
* If there are some inconsistencies, change them
* Modify the description to include your proposed changes to the boost scheduler

Your changes, although proven effective, make the scheduler much more
complicated.  If no one understands it now, it will be even harder to
understand with your changes, unless we set down some very clear
documentation of how the algorithm is supposed to work.  Namely, we
need to document:

* What factors different workloads need; i.e.:
 + Long enough time for cpu-bound workloads to warm up the cache effectively
 + Fast responsiveness for "latency-sensitive" workloads, esp. in the
face of multiple latency-sensitive workloads
 + Fairness wrt weight
* At a high level, what we'd like to see happen
* How individual mechanisms work:
 + Credits: when they are added / subtracted
 + Priorities: when they are changed and why
 + Preemption: when a cpu-bound process gets preempted
 + Active / passive status: when and why switched from one to the other

I've been intending to do this for a couple of weeks now, but I've got
some other patches I need to get cleaned up and submitted first.
Hopefully those will be finished by the end of the week.  This is my
very next priority.

Once I have the "design" document, I can describe your changes in
reference to them, and we can discuss them at a design level.

I have a couple of specific comments on your patches that I'll put
inline in other e-mails.

Thank you for your work, and your patience.

 -George


On Thu, Dec 18, 2008 at 2:49 AM, NISHIGUCHI Naoki
<nisiguti@jp.fujitsu.com> wrote:
> Hi all,
>
> In almost the same environment as the paper, I experimented with credit
> scheduler(original and modified version).
> I describe the results below.
>
> Unfortunately the good result was not obtained by my previous patches.
>
> I found that there were some problems on my previous patches.
> So, I had revised the patches and experimented with revised version again.
> Using revised patches, the good result was obtained.
> Especially, please look at the result of ex7. In revised version, I/O
> bandwidth per guest is growing correctly according to dom0's weight.
>
> I'll post the revised patches later.
>
> Thanks,
> Naoki Nishiguchi
>
> ---------- results ----------
> experimental environment:
>  HP dc7800 US/CT(Core2 Duo E6550 2.33GHz)
>    Multi-processor: disable
>  Xen: xen 3.3.0 release
>  dom0: CentOS 5.2
>
> I used the following experiments from among the paper's experiments.
> ex3: burn x7, ping x1
> ex5: stream x7, ping x1
> ex7: stream x3, burn x3, ping x1
> ex8: stream x3, ping+burn x1, burn x3
>
> original credit scheduler
> ex3
>  burn(%):      14 14 14 14 14 14 14
>  ping(ms):     19.7(average)  0.1 - 359
> ex5
>  stream(Mbps): 144.05 141.19 137.81 137.01 137.30 138.76 142.21
>  ping(ms)    : 8.2(average)  7.84 - 8.63
> ex7
>  stream(Mbps): 33.74 27.74 34.70
>  burn(%):      28 28 28 (by guess)
>  ping(ms):     238(average)  1.78 - 485
> ex7(xm sched-credit -d 0 -w 512)
>  There was no change in the result.
> ex8
>  stream(Mbps): 9.98 11.32 10.61
>  ping+burn:    264.9ms(average)  20.3 - 547
>                24%
>  burn(%):      24 24 24
>
>
> modified version(previous patches)
> ex3
>  burn(%):      14 14 14 14 14 14 14
>  ping(ms):     0.17(average)  0.136 - 0.202
> ex5
>  stream(Mbps): 143.90 141.79 137.15 138.43 138.37 130.33 143.36
>  ping(ms):     7.2(average)  4.85 - 8.95
> ex7
>  stream(Mbps): 2.33 2.18 1.87
>  burn(%):      32 32 32 (by guess)
>  ping(ms):     373.7(average)  68.0 - 589
> ex7(xm sched-credit -d 0 -w 512)
>  There was no change in the result.
> ex7(xm sched-credit -d 0 -m 100 -r 20)
>  stream(Mbps): 114.49 117.59 115.76
>  burn(%):      24 24 24
>  ping(ms):     1.2(average)  0.158 - 65.1
> ex8
>  stream(Mbps): 1.31 1.09 1.92
>  ping+burn:    387.7ms(average)  92.6 - 676
>                24% (by guess)
>  burn(%):      24 24 24 (by guess)
>
>
> revised version
> ex3
>  burn(%):      14 14 14 14 14 14 14
>  ping(ms):     0.18(average)  0.140 - 0.238
> ex5
>  stream(Mbps): 142.57 139.03 137.50 136.77 137.61 138.95 142.63
>  ping(ms):     8.2(average)  7.86 - 8.71
> ex7
>  stream(Mbps): 143.63 132.13 131.77
>  burn(%):      24 24 24
>  ping(ms):     32.2(average)  1.73 - 173
> ex7(xm sched-credit -d 0 -w 512)
>  stream(Mbps): 240.06 204.85 229.23
>  burn(%):      18 18 18
>  ping(ms):     7.0(average)  0.412 - 73.9
> ex7(xm sched-credit -d 0 -m 100 -r 20)
>  stream(Mbps): 139.74 134.95 135.18
>  burn(%):      23 23 23
>  ping(ms):     15.1(average)  1.87 - 95.4
> ex8
>  stream(Mbps): 118.15 106.71 116.37
>  ping+burn:    68.8ms(average) 1.86 - 319
>                19%
>  burn(%):      19 19 19
> ----------
>
> NISHIGUCHI Naoki wrote:
>>
>> Thanks for your information.
>>
>> George Dunlap wrote:
>>>
>>> There was a paper earlier this year about scheduling and I/O performance:
>>>  http://www.cs.rice.edu/CS/Architecture/docs/ongaro-vee08.pdf
>>>
>>> One of the things he noted was that if a driver domain is accepting
>>> network packets for multiple VMs, we sometimes get the following
>>> pattern:
>>> * driver domain wakes up, starts processing packets.  Because it's in
>>> "over", it doesn't get boosted.
>>> * Passes a packet to VM 1, waking it up.  It runs in "boost",
>>> preempting the (now lower-priority) driver domain.
>>> * Other packets (possibly even for VM 1) sit in the driver domain's
>>> queue, waiting for it to get cpu time.
>>
>> I don't read the paper yet, but I think our approach is effective in this
>> problem.
>> However, if driver domain consumes cpu time too much, we couldn't prevent
>> it from becoming "over" priority. Otherwise, we could keep it with "under"
>> or "boost" priority.
>>
>>> Their tests, for 3 networking guests and 3 cpu-intensive guests,
>>> showed a 40% degradation in performance due to this problem.  While
>>> we're thinking about the scheduler, it might be worth seeing if we can
>>> solve this.
>>
>> Firstly, I'd like to read the paper.
>>
>> Regards,
>> Naoki Nishiguchi
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] scheduler: credit scheduler for client virtualization
       [not found]     ` <49768FDB.60609@jp.fujitsu.com>
@ 2009-01-21 10:35       ` George Dunlap
  2009-01-22  6:15         ` NISHIGUCHI Naoki
  0 siblings, 1 reply; 15+ messages in thread
From: George Dunlap @ 2009-01-21 10:35 UTC (permalink / raw)
  To: NISHIGUCHI Naoki, xen-devel@lists.xensource.com

Naoki,

I'm working on revising the scheduler right now, so it's probably best
if you hold off patches for a little while.

I'm also trying to understand the minimum that your client workloads
actually need to run well.  There were compontents of the "boost"
patch series that helped your workload:
 (a) minimum cpu time,
 (b) Shortened time slices (2ms)
 (c) "boosted" priority for multimedia domains

Is it possible that having (a) and (b), possibly with some other
combinations, could work well without adding (c)?

At any rate, I'm going to start with a revised system that has a
minimum cpu time, but no "high priority", and see if we can get things
to work OK without it.

Thanks for your work, BTW -- the scheduler has needed some attention
for a long time, but I don't think it would have gotten it if you
hadn't introduced these patches.

Peace,
 -George

On Wed, Jan 21, 2009 at 3:00 AM, NISHIGUCHI Naoki
<nisiguti@jp.fujitsu.com> wrote:
> Hi George,
>
> George Dunlap wrote:
>>
>> Sorry, didn't finish my thoughts before sending...
>>
>>> The original meaning of the "boost" priority was a priority given to
>>> domains when waking up, so that latency-sensitive workloads could
>>> achieve low latency when competing with cpu-intensive workloads, while
>>> maintaining weight.  I think this meaning of "boost" (and the
>>> mechanism) is still important, especially for server-style workloads.
>>
>> ...so, I think we need to maintain the old "boost" mechanism (or
>> something like it), and come up with a new name for this "priority cpu
>> time" feature.
>
> I believe that the old "boost" mechanism remains after applying my patches.
> But, now I think that "priority cpu time" feature needs a new name as you
> said.
>
> Because of not changing the existing functionalities in credit scheduler and
> achieving continuous high-priority for a domain, I decided to use "boost"
> mechanism, especially boost priority. In my rev2 patches, old "boost"
> mechanism and "boost credit" I introduced were integrated strongly and the
> good result was obtained. But, as you said and I wrote above, I think that
> the "boost" mechanism and "boost credit" should be separated. I'll try to
> achieve this by introducing new priority for "priority cpu time" feature.
>
> Regards,
> Naoki
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] scheduler: credit scheduler for client virtualization
  2009-01-21 10:35       ` George Dunlap
@ 2009-01-22  6:15         ` NISHIGUCHI Naoki
  0 siblings, 0 replies; 15+ messages in thread
From: NISHIGUCHI Naoki @ 2009-01-22  6:15 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 1451 bytes --]

Hi George,

George Dunlap wrote:
> I'm working on revising the scheduler right now, so it's probably best
> if you hold off patches for a little while.

OK. I'll wait to finish your work.

> I'm also trying to understand the minimum that your client workloads
> actually need to run well.  There were compontents of the "boost"
> patch series that helped your workload:
>  (a) minimum cpu time,
>  (b) Shortened time slices (2ms)
>  (c) "boosted" priority for multimedia domains
> 
> Is it possible that having (a) and (b), possibly with some other
> combinations, could work well without adding (c)?

Yes, it is possible.
I divided the rev2 "boost" patch as follows without (c).
  (1) minimum cpu time (a): boost_1.patch + boost_1_tools.patch
  (2) Shortened time slices (b): boost_2.patch
  (3) alternative "boost" mechanism by boost_credit: boost_3.patch

These patches works with the following combinations.
   (1), (1)+(2), (1)+(2)+(3)
Please apply these patches in numerical order.

Without (3), didn't solve the problem in the paper you showed.

Are these what you want?

> At any rate, I'm going to start with a revised system that has a
> minimum cpu time, but no "high priority", and see if we can get things
> to work OK without it.
> 
> Thanks for your work, BTW -- the scheduler has needed some attention
> for a long time, but I don't think it would have gotten it if you
> hadn't introduced these patches.

Thanks.

Best regards,
Naoki

[-- Attachment #2: boost_1.patch --]
[-- Type: text/x-patch, Size: 9014 bytes --]

diff -r 56032cbaf1e8 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Thu Jan 22 10:43:50 2009 +0900
+++ b/xen/common/sched_credit.c	Thu Jan 22 13:06:24 2009 +0900
@@ -201,6 +201,7 @@ struct csched_vcpu {
     struct csched_dom *sdom;
     struct vcpu *vcpu;
     atomic_t credit;
+    int prev_credit;
     uint16_t flags;
     int16_t pri;
 #ifdef CSCHED_STATS
@@ -225,6 +226,7 @@ struct csched_dom {
     uint16_t active_vcpu_count;
     uint16_t weight;
     uint16_t cap;
+    uint16_t percent;
 };
 
 /*
@@ -239,6 +241,8 @@ struct csched_private {
     cpumask_t idlers;
     uint32_t weight;
     uint32_t credit;
+    uint32_t percent;
+    uint16_t total_percent;
     int credit_balance;
     uint32_t runq_sort;
     CSCHED_STATS_DEFINE()
@@ -503,6 +507,7 @@ __csched_vcpu_acct_start_locked(struct c
         {
             list_add(&sdom->active_sdom_elem, &csched_priv.active_sdom);
             csched_priv.weight += sdom->weight;
+            csched_priv.percent += sdom->percent;
         }
     }
 }
@@ -525,6 +530,7 @@ __csched_vcpu_acct_stop_locked(struct cs
         BUG_ON( csched_priv.weight < sdom->weight );
         list_del_init(&sdom->active_sdom_elem);
         csched_priv.weight -= sdom->weight;
+        csched_priv.percent -= sdom->percent;
     }
 }
 
@@ -579,6 +585,7 @@ csched_vcpu_init(struct vcpu *vc)
     svc->sdom = sdom;
     svc->vcpu = vc;
     atomic_set(&svc->credit, 0);
+    svc->prev_credit = 0;
     svc->flags = 0U;
     svc->pri = is_idle_domain(dom) ? CSCHED_PRI_IDLE : CSCHED_PRI_TS_UNDER;
     CSCHED_VCPU_STATS_RESET(svc);
@@ -712,25 +719,56 @@ csched_dom_cntl(
     {
         op->u.credit.weight = sdom->weight;
         op->u.credit.cap = sdom->cap;
+        op->u.credit.percent = sdom->percent;
     }
     else
     {
+        uint16_t weight = (uint16_t)~0U;
+
         ASSERT(op->cmd == XEN_DOMCTL_SCHEDOP_putinfo);
 
         spin_lock_irqsave(&csched_priv.lock, flags);
 
-        if ( op->u.credit.weight != 0 )
+        if ( (op->u.credit.weight != 0) &&
+             (sdom->percent == 0 || op->u.credit.percent == 0) )
+        {
+            weight = op->u.credit.weight;
+        }
+
+        if ( op->u.credit.cap != (uint16_t)~0U )
+            sdom->cap = op->u.credit.cap;
+
+        if ( (op->u.credit.percent != (uint16_t)~0U) &&
+             ((csched_priv.total_percent - sdom->percent +
+               op->u.credit.percent) <= 100 * csched_priv.ncpus) )
+        {
+            csched_priv.total_percent -= sdom->percent;
+            csched_priv.total_percent += op->u.credit.percent;
+
+            if ( !list_empty(&sdom->active_sdom_elem) )
+            {
+                csched_priv.percent -= sdom->percent;
+                csched_priv.percent += op->u.credit.percent;
+            }
+            sdom->percent = op->u.credit.percent;
+            if ( sdom->percent == 0 )
+            {
+                if ( sdom->weight == 0 )
+                    weight = CSCHED_DEFAULT_WEIGHT;
+            }
+            else
+                weight = 0;
+        }
+
+        if ( weight != (uint16_t)~0U )
         {
             if ( !list_empty(&sdom->active_sdom_elem) )
             {
                 csched_priv.weight -= sdom->weight;
-                csched_priv.weight += op->u.credit.weight;
+                csched_priv.weight += weight;
             }
-            sdom->weight = op->u.credit.weight;
-        }
-
-        if ( op->u.credit.cap != (uint16_t)~0U )
-            sdom->cap = op->u.credit.cap;
+            sdom->weight = weight;
+        }
 
         spin_unlock_irqrestore(&csched_priv.lock, flags);
     }
@@ -759,6 +797,7 @@ csched_dom_init(struct domain *dom)
     sdom->dom = dom;
     sdom->weight = CSCHED_DEFAULT_WEIGHT;
     sdom->cap = 0U;
+    sdom->percent = 0U;
     dom->sched_priv = sdom;
 
     return 0;
@@ -831,6 +870,7 @@ csched_acct(void)
     struct csched_dom *sdom;
     uint32_t credit_total;
     uint32_t weight_total;
+    uint32_t percent_credit;
     uint32_t weight_left;
     uint32_t credit_fair;
     uint32_t credit_peak;
@@ -857,6 +897,7 @@ csched_acct(void)
 
     weight_total = csched_priv.weight;
     credit_total = csched_priv.credit;
+    percent_credit = csched_priv.percent * CSCHED_CREDITS_PER_TSLICE / 100;
 
     /* Converge balance towards 0 when it drops negative */
     if ( csched_priv.credit_balance < 0 )
@@ -865,7 +906,7 @@ csched_acct(void)
         CSCHED_STAT_CRANK(acct_balance);
     }
 
-    if ( unlikely(weight_total == 0) )
+    if ( unlikely(weight_total == 0 && percent_credit == 0) )
     {
         csched_priv.credit_balance = 0;
         spin_unlock_irqrestore(&csched_priv.lock, flags);
@@ -880,22 +921,44 @@ csched_acct(void)
     credit_xtra = 0;
     credit_cap = 0U;
 
+    /* Firstly, subtract percent_credit from credit_total. */
+    if ( percent_credit != 0 )
+    {
+        credit_total -= percent_credit;
+        credit_balance += percent_credit;
+    }
+
+    /* Avoid 0 divide error */
+    if ( weight_total == 0 )
+        weight_total = 1;
+
     list_for_each_safe( iter_sdom, next_sdom, &csched_priv.active_sdom )
     {
         sdom = list_entry(iter_sdom, struct csched_dom, active_sdom_elem);
 
         BUG_ON( is_idle_domain(sdom->dom) );
         BUG_ON( sdom->active_vcpu_count == 0 );
-        BUG_ON( sdom->weight == 0 );
+        BUG_ON( sdom->weight == 0 && sdom->percent == 0 );
         BUG_ON( sdom->weight > weight_left );
 
-        /* Compute the average of active VCPUs. */
+        /*
+         * Compute the average of active VCPUs
+         * and adjust credit for comsumption too much.
+         */
         credit_sum = 0;
         list_for_each_safe( iter_vcpu, next_vcpu, &sdom->active_vcpu )
         {
+            int adjust;
+
             svc = list_entry(iter_vcpu, struct csched_vcpu, active_vcpu_elem);
             BUG_ON( sdom != svc->sdom );
 
+            credit = atomic_read(&svc->credit);
+            adjust = svc->prev_credit - credit - CSCHED_CREDITS_PER_TSLICE;
+            if ( adjust > 0 )
+            {
+                atomic_add(adjust, &svc->credit);
+            }
             credit_sum += atomic_read(&svc->credit);
         }
         credit_average = ( credit_sum + (sdom->active_vcpu_count - 1)
@@ -934,7 +997,9 @@ csched_acct(void)
 
         if ( credit_fair < credit_peak )
         {
-            credit_xtra = 1;
+            /* credit_fair is 0 if weight is 0. */
+            if ( sdom->weight != 0 )
+                credit_xtra = 1;
         }
         else
         {
@@ -963,9 +1028,9 @@ csched_acct(void)
         }
 
         /* Compute fair share per VCPU */
+        credit_fair += (sdom->percent * CSCHED_CREDITS_PER_ACCT)/100;
         credit_fair = ( credit_fair + ( sdom->active_vcpu_count - 1 )
                       ) / sdom->active_vcpu_count;
-
 
         list_for_each_safe( iter_vcpu, next_vcpu, &sdom->active_vcpu )
         {
@@ -1029,6 +1094,9 @@ csched_acct(void)
                 }
             }
 
+            /* save credit for adjustment */
+            svc->prev_credit = credit;
+
             CSCHED_VCPU_STAT_SET(svc, credit_last, credit);
             CSCHED_VCPU_STAT_SET(svc, credit_incr, credit_fair);
             credit_balance += credit;
@@ -1282,7 +1350,10 @@ csched_dump_vcpu(struct csched_vcpu *svc
 
     if ( sdom )
     {
-        printk(" credit=%i [w=%u]", atomic_read(&svc->credit), sdom->weight);
+        printk(" credit=%i [w=%u,p=%u]",
+               atomic_read(&svc->credit),
+               sdom->weight,
+               sdom->percent);
 #ifdef CSCHED_STATS
         printk(" (%d+%u) {a/i=%u/%u m=%u+%u}",
                 svc->stats.credit_last,
@@ -1348,6 +1419,8 @@ csched_dump(void)
            "\tcredit balance     = %d\n"
            "\tweight             = %u\n"
            "\trunq_sort          = %u\n"
+           "\tpercent            = %u\n"
+           "\ttotal_percent      = %u\n"
            "\tdefault-weight     = %d\n"
            "\tmsecs per tick     = %dms\n"
            "\tcredits per tick   = %d\n"
@@ -1359,6 +1432,8 @@ csched_dump(void)
            csched_priv.credit_balance,
            csched_priv.weight,
            csched_priv.runq_sort,
+           csched_priv.percent,
+           csched_priv.total_percent,
            CSCHED_DEFAULT_WEIGHT,
            CSCHED_MSECS_PER_TICK,
            CSCHED_CREDITS_PER_TICK,
@@ -1412,6 +1487,8 @@ csched_init(void)
     csched_priv.credit = 0U;
     csched_priv.credit_balance = 0;
     csched_priv.runq_sort = 0U;
+    csched_priv.percent = 0;
+    csched_priv.total_percent = 0;
     CSCHED_STATS_RESET();
 }
 
diff -r 56032cbaf1e8 xen/include/public/domctl.h
--- a/xen/include/public/domctl.h	Thu Jan 22 10:43:50 2009 +0900
+++ b/xen/include/public/domctl.h	Thu Jan 22 13:06:24 2009 +0900
@@ -311,6 +311,7 @@ struct xen_domctl_scheduler_op {
         struct xen_domctl_sched_credit {
             uint16_t weight;
             uint16_t cap;
+            uint16_t percent;
         } credit;
     } u;
 };

[-- Attachment #3: boost_1_tools.patch --]
[-- Type: text/x-patch, Size: 12414 bytes --]

diff -r 56032cbaf1e8 tools/python/xen/lowlevel/xc/xc.c
--- a/tools/python/xen/lowlevel/xc/xc.c	Thu Jan 22 10:43:50 2009 +0900
+++ b/tools/python/xen/lowlevel/xc/xc.c	Thu Jan 22 12:31:18 2009 +0900
@@ -1285,18 +1285,21 @@ static PyObject *pyxc_sched_credit_domai
     uint32_t domid;
     uint16_t weight;
     uint16_t cap;
-    static char *kwd_list[] = { "domid", "weight", "cap", NULL };
-    static char kwd_type[] = "I|HH";
+    uint16_t percent;
+    static char *kwd_list[] = { "domid", "weight", "cap", "percent", NULL };
+    static char kwd_type[] = "I|HHh";
     struct xen_domctl_sched_credit sdom;
     
     weight = 0;
     cap = (uint16_t)~0U;
+    percent = (uint16_t)~0U;
     if( !PyArg_ParseTupleAndKeywords(args, kwds, kwd_type, kwd_list, 
-                                     &domid, &weight, &cap) )
+                                     &domid, &weight, &cap, &percent) )
         return NULL;
 
     sdom.weight = weight;
     sdom.cap = cap;
+    sdom.percent = percent;
 
     if ( xc_sched_credit_domain_set(self->xc_handle, domid, &sdom) != 0 )
         return pyxc_error_to_exception();
@@ -1316,9 +1319,10 @@ static PyObject *pyxc_sched_credit_domai
     if ( xc_sched_credit_domain_get(self->xc_handle, domid, &sdom) != 0 )
         return pyxc_error_to_exception();
 
-    return Py_BuildValue("{s:H,s:H}",
+    return Py_BuildValue("{s:H,s:H,s:i}",
                          "weight",  sdom.weight,
-                         "cap",     sdom.cap);
+                         "cap",     sdom.cap,
+                         "percent", sdom.percent);
 }
 
 static PyObject *pyxc_domain_setmaxmem(XcObject *self, PyObject *args)
@@ -1744,6 +1748,8 @@ static PyMethodDef pyxc_methods[] = {
       "SMP credit scheduler.\n"
       " domid     [int]:   domain id to set\n"
       " weight    [short]: domain's scheduling weight\n"
+      " cap       [short]: cap\n"
+      " percent   [short]; domain's scheduling percentage per a cpu\n"
       "Returns: [int] 0 on success; -1 on error.\n" },
 
     { "sched_credit_domain_get",
@@ -1753,7 +1759,9 @@ static PyMethodDef pyxc_methods[] = {
       "SMP credit scheduler.\n"
       " domid     [int]:   domain id to get\n"
       "Returns:   [dict]\n"
-      " weight    [short]: domain's scheduling weight\n"},
+      " weight    [short]: domain's scheduling weight\n"
+      " cap       [short]: cap\n"
+      " percent   [short]: domain's scheduling percentage per a cpu\n"},
 
     { "evtchn_alloc_unbound", 
       (PyCFunction)pyxc_evtchn_alloc_unbound,
diff -r 56032cbaf1e8 tools/python/xen/xend/XendAPI.py
--- a/tools/python/xen/xend/XendAPI.py	Thu Jan 22 10:43:50 2009 +0900
+++ b/tools/python/xen/xend/XendAPI.py	Thu Jan 22 12:31:18 2009 +0900
@@ -1505,10 +1505,12 @@ class XendAPI(object):
 
         #need to update sched params aswell
         if 'weight' in xeninfo.info['vcpus_params'] \
-           and 'cap' in xeninfo.info['vcpus_params']:
+           and 'cap' in xeninfo.info['vcpus_params'] \
+           and 'percent' in xeninfo.info['vcpus_params']:
             weight = xeninfo.info['vcpus_params']['weight']
             cap = xeninfo.info['vcpus_params']['cap']
-            xendom.domain_sched_credit_set(xeninfo.getDomid(), weight, cap)
+            percent = xeninfo.info['vcpus_params']['percent']
+            xendom.domain_sched_credit_set(xeninfo.getDomid(), weight, cap, percent)
 
     def VM_set_VCPUs_number_live(self, _, vm_ref, num):
         dom = XendDomain.instance().get_vm_by_uuid(vm_ref)
diff -r 56032cbaf1e8 tools/python/xen/xend/XendConfig.py
--- a/tools/python/xen/xend/XendConfig.py	Thu Jan 22 10:43:50 2009 +0900
+++ b/tools/python/xen/xend/XendConfig.py	Thu Jan 22 12:31:18 2009 +0900
@@ -591,6 +591,8 @@ class XendConfig(dict):
             int(sxp.child_value(sxp_cfg, "cpu_weight", 256))
         cfg["vcpus_params"]["cap"] = \
             int(sxp.child_value(sxp_cfg, "cpu_cap", 0))
+        cfg["vcpus_params"]["percent"] = \
+            int(sxp.child_value(sxp_cfg, "cpu_percent", 0))
 
         # Only extract options we know about.
         extract_keys = LEGACY_UNSUPPORTED_BY_XENAPI_CFG + \
diff -r 56032cbaf1e8 tools/python/xen/xend/XendDomain.py
--- a/tools/python/xen/xend/XendDomain.py	Thu Jan 22 10:43:50 2009 +0900
+++ b/tools/python/xen/xend/XendDomain.py	Thu Jan 22 12:31:18 2009 +0900
@@ -1536,7 +1536,7 @@ class XendDomain:
 
         @param domid: Domain ID or Name
         @type domid: int or string.
-        @rtype: dict with keys 'weight' and 'cap'
+        @rtype: dict with keys 'weight' and 'cap' and 'percent'
         @return: credit scheduler parameters
         """
         dominfo = self.domain_lookup_nr(domid)
@@ -1550,19 +1550,22 @@ class XendDomain:
                 raise XendError(str(ex))
         else:
             return {'weight' : dominfo.getWeight(),
-                    'cap'    : dominfo.getCap()} 
+                    'cap'    : dominfo.getCap(),
+                    'percent': dominfo.getPercent()} 
     
-    def domain_sched_credit_set(self, domid, weight = None, cap = None):
+    def domain_sched_credit_set(self, domid, weight = None, cap = None, percent = None):
         """Set credit scheduler parameters for a domain.
 
         @param domid: Domain ID or Name
         @type domid: int or string.
         @type weight: int
         @type cap: int
+        @type percent: int
         @rtype: 0
         """
         set_weight = False
         set_cap = False
+        set_percent = False
         dominfo = self.domain_lookup_nr(domid)
         if not dominfo:
             raise XendInvalidDomain(str(domid))
@@ -1581,17 +1584,27 @@ class XendDomain:
             else:
                 set_cap = True
 
+            if percent is None:
+                percent = int(~0)
+            elif percent < 0:
+                raise XendError("percent is out of range")
+            else:
+                set_percent = True
+
             assert type(weight) == int
             assert type(cap) == int
+            assert type(percent) == int
 
             rc = 0
             if dominfo._stateGet() in (DOM_STATE_RUNNING, DOM_STATE_PAUSED):
-                rc = xc.sched_credit_domain_set(dominfo.getDomid(), weight, cap)
+                rc = xc.sched_credit_domain_set(dominfo.getDomid(), weight, cap, percent)
             if rc == 0:
                 if set_weight:
                     dominfo.setWeight(weight)
                 if set_cap:
                     dominfo.setCap(cap)
+                if set_percent:
+                    dominfo.setPercent(percent)
                 self.managed_config_save(dominfo)
             return rc
         except Exception, ex:
diff -r 56032cbaf1e8 tools/python/xen/xend/XendDomainInfo.py
--- a/tools/python/xen/xend/XendDomainInfo.py	Thu Jan 22 10:43:50 2009 +0900
+++ b/tools/python/xen/xend/XendDomainInfo.py	Thu Jan 22 12:31:18 2009 +0900
@@ -467,7 +467,8 @@ class XendDomainInfo:
                 if xennode.xenschedinfo() == 'credit':
                     xendomains.domain_sched_credit_set(self.getDomid(),
                                                        self.getWeight(),
-                                                       self.getCap())
+                                                       self.getCap(),
+                                                       self.getPercent())
             except:
                 log.exception('VM start failed')
                 self.destroy()
@@ -1705,6 +1706,12 @@ class XendDomainInfo:
     def setWeight(self, cpu_weight):
         self.info['vcpus_params']['weight'] = cpu_weight
 
+    def getPercent(self):
+        return self.info['vcpus_params']['percent']
+
+    def setPercent(self, cpu_percent):
+        self.info['vcpus_params']['percent'] = cpu_percent
+
     def getRestartCount(self):
         return self._readVm('xend/restart_count')
 
diff -r 56032cbaf1e8 tools/python/xen/xm/main.py
--- a/tools/python/xen/xm/main.py	Thu Jan 22 10:43:50 2009 +0900
+++ b/tools/python/xen/xm/main.py	Thu Jan 22 12:31:18 2009 +0900
@@ -150,7 +150,7 @@ SUBCOMMAND_HELP = {
     'log'         : ('', 'Print Xend log'),
     'rename'      : ('<Domain> <NewDomainName>', 'Rename a domain.'),
     'sched-sedf'  : ('<Domain> [options]', 'Get/set EDF parameters.'),
-    'sched-credit': ('[-d <Domain> [-w[=WEIGHT]|-c[=CAP]]]',
+    'sched-credit': ('[-d <Domain> [-w[=WEIGHT]|-c[=CAP]|-p[=PERCENT]]]',
                      'Get/set credit scheduler parameters.'),
     'sysrq'       : ('<Domain> <letter>', 'Send a sysrq to a domain.'),
     'debug-keys'  : ('<Keys>', 'Send debug keys to Xen.'),
@@ -240,6 +240,7 @@ SUBCOMMAND_OPTIONS = {
        ('-d DOMAIN', '--domain=DOMAIN', 'Domain to modify'),
        ('-w WEIGHT', '--weight=WEIGHT', 'Weight (int)'),
        ('-c CAP',    '--cap=CAP',       'Cap (int)'),
+       ('-p PERCENT', '--percent=PERCENT', 'Percent per a cpu (int)'),
     ),
     'list': (
        ('-l', '--long',         'Output all VM details in SXP'),
@@ -1578,8 +1579,8 @@ def xm_sched_credit(args):
     check_sched_type('credit')
 
     try:
-        opts, params = getopt.getopt(args, "d:w:c:",
-            ["domain=", "weight=", "cap="])
+        opts, params = getopt.getopt(args, "d:w:c:p:",
+            ["domain=", "weight=", "cap=", "percent="])
     except getopt.GetoptError, opterr:
         err(opterr)
         usage('sched-credit')
@@ -1587,6 +1588,7 @@ def xm_sched_credit(args):
     domid = None
     weight = None
     cap = None
+    percent = None
 
     for o, a in opts:
         if o in ["-d", "--domain"]:
@@ -1594,18 +1596,20 @@ def xm_sched_credit(args):
         elif o in ["-w", "--weight"]:
             weight = int(a)
         elif o in ["-c", "--cap"]:
-            cap = int(a);
+            cap = int(a)
+        elif o in ["-p", "--percent"]:
+            percent = int(a);
 
     doms = filter(lambda x : domid_match(domid, x),
                   [parse_doms_info(dom)
                   for dom in getDomains(None, 'all')])
 
-    if weight is None and cap is None:
+    if weight is None and cap is None and percent is None:
         if domid is not None and doms == []: 
             err("Domain '%s' does not exist." % domid)
             usage('sched-credit')
         # print header if we aren't setting any parameters
-        print '%-33s %4s %6s %4s' % ('Name','ID','Weight','Cap')
+        print '%-33s %4s %6s %4s %7s' % ('Name','ID','Weight','Cap','Percent')
         
         for d in doms:
             try:
@@ -1618,16 +1622,17 @@ def xm_sched_credit(args):
             except xmlrpclib.Fault:
                 pass
 
-            if 'weight' not in info or 'cap' not in info:
+            if 'weight' not in info or 'cap' not in info or 'percent' not in info:
                 # domain does not support sched-credit?
-                info = {'weight': -1, 'cap': -1}
+                info = {'weight': -1, 'cap': -1, 'percent':-1}
 
             info['weight'] = int(info['weight'])
             info['cap']    = int(info['cap'])
+            info['percent'] = int(info['percent'])
             
             info['name']  = d['name']
             info['domid'] = str(d['domid'])
-            print( ("%(name)-32s %(domid)5s %(weight)6d %(cap)4d") % info)
+            print( ("%(name)-32s %(domid)5s %(weight)6d %(cap)4d %(percent)6d") % info)
     else:
         if domid is None:
             # place holder for system-wide scheduler parameters
@@ -1644,6 +1649,10 @@ def xm_sched_credit(args):
                     get_single_vm(domid),
                     "cap",
                     cap)
+                server.xenapi.VM.add_to_VCPUs_params_live(
+                    get_single_vm(domid),
+                    "percent",
+                    percent)
             else:
                 server.xenapi.VM.add_to_VCPUs_params(
                     get_single_vm(domid),
@@ -1653,8 +1662,12 @@ def xm_sched_credit(args):
                     get_single_vm(domid),
                     "cap",
                     cap)
+                server.xenapi.VM.add_to_VCPUs_params(
+                    get_single_vm(domid),
+                    "percent",
+                    percent)
         else:
-            result = server.xend.domain.sched_credit_set(domid, weight, cap)
+            result = server.xend.domain.sched_credit_set(domid, weight, cap, percent)
             if result != 0:
                 err(str(result))
 

[-- Attachment #4: boost_2.patch --]
[-- Type: text/x-patch, Size: 1727 bytes --]

diff -r 116e2691c071 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Thu Jan 22 13:16:49 2009 +0900
+++ b/xen/common/sched_credit.c	Thu Jan 22 13:21:28 2009 +0900
@@ -47,6 +47,7 @@
     (CSCHED_CREDITS_PER_TICK * CSCHED_TICKS_PER_TSLICE)
 #define CSCHED_CREDITS_PER_ACCT     \
     (CSCHED_CREDITS_PER_TICK * CSCHED_TICKS_PER_ACCT)
+#define CSCHED_MSECS_PER_BOOST_TSLICE 2
 
 
 /*
@@ -245,6 +246,7 @@ struct csched_private {
     uint16_t total_percent;
     int credit_balance;
     uint32_t runq_sort;
+    s_time_t boost_tslice;
     CSCHED_STATS_DEFINE()
 };
 
@@ -253,6 +255,10 @@ struct csched_private {
  * Global variables
  */
 static struct csched_private csched_priv;
+
+/* opt_credit_tslice: time slice for BOOST priority */
+static unsigned int opt_credit_tslice = CSCHED_MSECS_PER_BOOST_TSLICE;
+integer_param("credit_tslice", opt_credit_tslice);
 
 static void csched_tick(void *_cpu);
 
@@ -1327,7 +1333,17 @@ csched_schedule(s_time_t now)
     /*
      * Return task to run next...
      */
-    ret.time = MILLISECS(CSCHED_MSECS_PER_TSLICE);
+    if ( snext->pri == CSCHED_PRI_TS_BOOST )
+    {
+        struct csched_vcpu * const svc = __runq_elem(runq->next);
+
+        if ( svc->pri == CSCHED_PRI_TS_BOOST )
+            ret.time = csched_priv.boost_tslice;
+        else
+            ret.time = MILLISECS(CSCHED_MSECS_PER_TICK);
+    }
+    else
+        ret.time = MILLISECS(CSCHED_MSECS_PER_TSLICE);
     ret.task = snext->vcpu;
 
     spc->start_time = now;
@@ -1489,6 +1505,7 @@ csched_init(void)
     csched_priv.runq_sort = 0U;
     csched_priv.percent = 0;
     csched_priv.total_percent = 0;
+    csched_priv.boost_tslice = MILLISECS(opt_credit_tslice);
     CSCHED_STATS_RESET();
 }
 

[-- Attachment #5: boost_3.patch --]
[-- Type: text/x-patch, Size: 3349 bytes --]

diff -r 64618a20b9de xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Thu Jan 22 13:22:54 2009 +0900
+++ b/xen/common/sched_credit.c	Thu Jan 22 13:42:04 2009 +0900
@@ -202,6 +202,7 @@ struct csched_vcpu {
     struct csched_dom *sdom;
     struct vcpu *vcpu;
     atomic_t credit;
+    atomic_t boost_credit;
     int prev_credit;
     uint16_t flags;
     int16_t pri;
@@ -549,14 +550,6 @@ csched_vcpu_acct(unsigned int cpu)
     ASSERT( svc->sdom != NULL );
 
     /*
-     * If this VCPU's priority was boosted when it last awoke, reset it.
-     * If the VCPU is found here, then it's consuming a non-negligeable
-     * amount of CPU resources and should no longer be boosted.
-     */
-    if ( svc->pri == CSCHED_PRI_TS_BOOST )
-        svc->pri = CSCHED_PRI_TS_UNDER;
-
-    /*
      * If it's been active a while, check if we'd be better off
      * migrating it to run elsewhere (see multi-core and multi-thread
      * support in csched_cpu_pick()).
@@ -591,6 +584,7 @@ csched_vcpu_init(struct vcpu *vc)
     svc->sdom = sdom;
     svc->vcpu = vc;
     atomic_set(&svc->credit, 0);
+    atomic_set(&svc->boost_credit, 0);
     svc->prev_credit = 0;
     svc->flags = 0U;
     svc->pri = is_idle_domain(dom) ? CSCHED_PRI_IDLE : CSCHED_PRI_TS_UNDER;
@@ -706,6 +700,8 @@ csched_vcpu_wake(struct vcpu *vc)
          !(svc->flags & CSCHED_FLAG_VCPU_PARKED) )
     {
         svc->pri = CSCHED_PRI_TS_BOOST;
+        atomic_add(CSCHED_CREDITS_PER_TICK, &svc->boost_credit);
+        atomic_sub(CSCHED_CREDITS_PER_TICK, &svc->credit);
     }
 
     /* Put the VCPU on the runq and tickle CPUs */
@@ -954,11 +950,14 @@ csched_acct(void)
         credit_sum = 0;
         list_for_each_safe( iter_vcpu, next_vcpu, &sdom->active_vcpu )
         {
-            int adjust;
+            int adjust, boost_credit;
 
             svc = list_entry(iter_vcpu, struct csched_vcpu, active_vcpu_elem);
             BUG_ON( sdom != svc->sdom );
 
+            boost_credit = atomic_read(&svc->boost_credit);
+            atomic_set(&svc->boost_credit, 0);
+            atomic_add(boost_credit, &svc->credit);
             credit = atomic_read(&svc->credit);
             adjust = svc->prev_credit - credit - CSCHED_CREDITS_PER_TSLICE;
             if ( adjust > 0 )
@@ -1290,6 +1289,22 @@ csched_schedule(s_time_t now)
                ) /
                ( MILLISECS(CSCHED_MSECS_PER_TSLICE) /
                  CSCHED_CREDITS_PER_TSLICE );
+    if ( scurr->pri == CSCHED_PRI_TS_BOOST )
+    {
+        int boost_credit = atomic_read(&scurr->boost_credit);
+
+        if ( boost_credit > consumed )
+        {
+            atomic_sub(consumed, &scurr->boost_credit);
+            consumed = 0;
+        }
+        else
+        {
+            atomic_sub(boost_credit, &scurr->boost_credit);
+            consumed -= boost_credit;
+            scurr->pri = CSCHED_PRI_TS_UNDER;
+        }
+    }
     if ( consumed > 0 && !is_idle_vcpu(current) )
         atomic_sub(consumed, &scurr->credit);
 
@@ -1366,8 +1381,9 @@ csched_dump_vcpu(struct csched_vcpu *svc
 
     if ( sdom )
     {
-        printk(" credit=%i [w=%u,p=%u]",
+        printk(" credit=%i bc=%i [w=%u,p=%u]",
                atomic_read(&svc->credit),
+               atomic_read(&svc->boost_credit),
                sdom->weight,
                sdom->percent);
 #ifdef CSCHED_STATS

[-- Attachment #6: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2009-01-22  6:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-03  8:54 [RFC][PATCH] scheduler: credit scheduler for client virtualization NISHIGUCHI Naoki
2008-12-03  9:16 ` Keir Fraser
2008-12-03 12:46   ` George Dunlap
2008-12-04  7:51     ` NISHIGUCHI Naoki
2008-12-04 12:21       ` George Dunlap
2008-12-04 12:37         ` George Dunlap
2008-12-05  3:17           ` NISHIGUCHI Naoki
2008-12-18  2:49             ` NISHIGUCHI Naoki
2008-12-18 10:21               ` George Dunlap
2008-12-05  2:47         ` NISHIGUCHI Naoki
2008-12-05 11:37           ` George Dunlap
2008-12-08  8:37             ` NISHIGUCHI Naoki
2008-12-04  7:45   ` NISHIGUCHI Naoki
     [not found] ` <de76405a0901191232k19d910d5o77160fa5ee7bf06c@mail.gmail.com>
     [not found]   ` <de76405a0901191257p3b45304fi538d040b5634de23@mail.gmail.com>
     [not found]     ` <49768FDB.60609@jp.fujitsu.com>
2009-01-21 10:35       ` George Dunlap
2009-01-22  6:15         ` NISHIGUCHI Naoki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.