[PATCH v2 0/4] xen/tools: Credit2: implement caps.

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] xen/tools: Credit2: implement caps.
@ 2017-08-18 15:50 Dario Faggioli
  2017-08-18 15:50 ` [PATCH v2 1/4] xen: credit2: implement utilization cap Dario Faggioli
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Dario Faggioli @ 2017-08-18 15:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson, Anshul Makkar,
	Jan Beulich

This is v2 of the 'caps for Credit2' series.

Posting of v1 is here:

 https://lists.xen.org/archives/html/xen-devel/2017-06/msg00700.html

No change wrt that, apart from taking care of the review comments. The patch
that required more rework is patch 1, as I changed how a corner case (budget
overrun, due to potential timer or accounting issues) is dealt with, complying
with what George suggested and thought it was best.

Note, however, that this series is *NOT* based on top of staging. In fact, it
is based on top of staging + "Soft affinity for Credit2, v2":

 https://lists.xen.org/archives/html/xen-devel/2017-07/msg02802.html

Reason I did things like this is that the two series do clash, and since the
soft affinity one is pretty much all acked and ready to go in (with the only
exception of patch 2, as George still needs to look at it), I just assumed that
one will go in first, and based on top of it.

In fact, as I'm leaving for 2 weeks, having done things like this allows one to
commit both the series, even with me away, in case both collect all the needed
acks, of course (hey, one can dream, can't him? :-D :-D).

As usual, I aslo prepared a git branch:

 git://xenbits.xen.org/people/dariof/xen.git  rel/sched/credit2-caps-v2
 https://travis-ci.org/fdario/xen/builds/266018957

Thanks and Regards,
Dario
---
Dario Faggioli (4):
      xen: credit2: implement utilization cap
      xen: credit2: allow to set and get utilization cap
      xen: credit2: improve distribution of budget (for domains with caps)
      libxl/xl: allow to get and set cap on Credit2.

 tools/libxl/libxl_sched.c   |   21 +
 tools/xentrace/formats      |    2 
 tools/xentrace/xenalyze.c   |   10 -
 tools/xl/xl_cmdtable.c      |    1 
 tools/xl/xl_sched.c         |   25 +-
 xen/common/sched_credit2.c  |  676 ++++++++++++++++++++++++++++++++++++++++---
 xen/include/public/domctl.h |    1 
 xen/include/xen/sched.h     |    3 
 8 files changed, 682 insertions(+), 57 deletions(-)
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/4] xen: credit2: implement utilization cap
  2017-08-18 15:50 [PATCH v2 0/4] xen/tools: Credit2: implement caps Dario Faggioli
@ 2017-08-18 15:50 ` Dario Faggioli
  2017-08-24 19:42   ` Anshul Makkar
  2017-09-14 16:20   ` George Dunlap
  2017-08-18 15:51 ` [PATCH v2 2/4] xen: credit2: allow to set and get " Dario Faggioli
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 10+ messages in thread
From: Dario Faggioli @ 2017-08-18 15:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson, Anshul Makkar,
	Jan Beulich

This commit implements the Xen part of the cap mechanism for
Credit2.

A cap is how much, in terms of % of physical CPU time, a domain
can execute at most.

For instance, a domain that must not use more than 1/4 of
one physical CPU, must have a cap of 25%; one that must not
use more than 1+1/2 of physical CPU time, must be given a cap
of 150%.

Caps are per domain, so it is all a domain's vCPUs, cumulatively,
that will be forced to execute no more than the decided amount.

This is implemented by giving each domain a 'budget', and
using a (per-domain again) periodic timer. Values of budget
and 'period' are chosen so that budget/period is equal to the
cap itself.

Budget is burned by the domain's vCPUs, in a similar way to
how credits are.

When a domain runs out of budget, its vCPUs can't run any
longer. They can gain, when the budget is replenishment by
the timer, which event happens once every period.

Blocking the vCPUs because of lack of budget happens by
means of a new (_VPF_parked) pause flag, so that, e.g.,
vcpu_runnable() still works. This is similar to what is
done in sched_rtds.c, as opposed to what happens in
sched_credit.c, where vcpu_pause() and vcpu_unpause()
(which means, among other things, more overhead).

Note that, while adding new fields to csched2_vcpu and
csched2_dom, currently existing members are being moved
around, to achieve best placement inside cache lines.

Note also that xenalyze and tools/xentrace/format are being
updated too.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Anshul Makkar <anshulmakkar@gmail.com>
---
Changed from v1:
* used has_cap() instead of open coding it in burn_credits();
* removed some of the unlikely() around has_cap(), as, although cap is not on
  by default, it's up to the user to decide how many domains will have caps,
  and we can't assume much about what users will actually do;
* tried to clarify the comment about (the non deterministic nature of the)
  CPU capacity distribution between the vCPUs of a multi-vCPUs guest;
* clarified the comment about budget being replenished to nothing more than
  top capacity, i.e., about the fact that budget is *not* being accumulated
  across different period;
* fixed many style and typo issues in comments;
* added a comment about the budget distribution logic (to the vCPUs) being
  subject to be refined in subsequent commits;
* renaming:
   vcpu_try_to_get_budget() --> vcpu_grab_budget()
   vcpu_give_back_budget() --> vcpu_return_budget()
   repl_sdom_budget() --> replanish_domain_budget()
* change how replenishment logic deals with cases of overrun. In v1, we were
  always doing multiple replenishment at once, until the domain's budget was
  back into the black. Now, in cases of substantial overrun, we just do one
  replenishment, and rely on future ones to bring back the budget into
  being a positive number. This was agreed upon with George during v1's
  review.
---
 tools/xentrace/formats     |    2 
 tools/xentrace/xenalyze.c  |   10 +
 xen/common/sched_credit2.c |  521 +++++++++++++++++++++++++++++++++++++++++---
 xen/include/xen/sched.h    |    3 
 4 files changed, 492 insertions(+), 44 deletions(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index f39182a..d6e7e3f 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -51,7 +51,7 @@
 
 0x00022201  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tick
 0x00022202  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_pos       [ dom:vcpu = 0x%(1)08x, pos = %(2)d]
-0x00022203  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit burn    [ dom:vcpu = 0x%(1)08x, credit = %(2)d, delta = %(3)d ]
+0x00022203  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit burn    [ dom:vcpu = 0x%(1)08x, credit = %(2)d, budget = %(3)d, delta = %(4)d ]
 0x00022204  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:credit_add
 0x00022205  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tickle_check   [ dom:vcpu = 0x%(1)08x, credit = %(2)d, score = %(3)d ]
 0x00022206  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tickle         [ cpu = %(1)d ]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index 39fc35f..79bdba7 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7680,12 +7680,14 @@ void sched_process(struct pcpu_info *p)
             if(opt.dump_all) {
                 struct {
                     unsigned int vcpuid:16, domid:16;
-                    int credit, delta;
+                    int credit, budget, delta;
                 } *r = (typeof(r))ri->d;
 
-                printf(" %s csched2:burn_credits d%uv%u, credit = %d, delta = %d\n",
-                       ri->dump_header, r->domid, r->vcpuid,
-                       r->credit, r->delta);
+                printf(" %s csched2:burn_credits d%uv%u, credit = %d, ",
+                       ri->dump_header, r->domid, r->vcpuid, r->credit);
+                if ( r->budget != INT_MIN )
+                    printf("budget = %d, ", r->budget);
+                printf("delta = %d\n", r->delta);
             }
             break;
         case TRC_SCHED_CLASS_EVT(CSCHED2, 5): /* TICKLE_CHECK      */
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index fab7f2e..69a7679 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -92,6 +92,86 @@
  */
 
 /*
+ * Utilization cap:
+ *
+ * Setting an pCPU utilization cap for a domain means the following:
+ *
+ * - a domain can have a cap, expressed in terms of % of physical CPU time.
+ *   A domain that must not use more than 1/4 of _one_ physical CPU, will
+ *   be given a cap of 25%; a domain that must not use more than 1+1/2 of
+ *   physical CPU time, will be given a cap of 150%;
+ *
+ * - caps are per-domain (not per-vCPU). If a domain has only 1 vCPU, and
+ *   a 40% cap, that one vCPU will use 40% of one pCPU. If a somain has 4
+ *   vCPUs, and a 200% cap, the equivalent of 100% time on 2 pCPUs will be
+ *   split among the v vCPUs. How much each of the vCPUs will actually get,
+ *   during any given interval of time, is unspecified (as it depends on
+ *   various aspects: workload, system load, etc.). For instance, it is
+ *   possible that, during a given time interval, 2 vCPUs use 100% each,
+ *   and the other two use nothing; while during another time interval,
+ *   two vCPUs use 80%, one uses 10% and the other 30%; or that each use
+ *   50% (and so on and so forth).
+ *
+ * For implementing this, we use the following approach:
+ *
+ * - each domain is given a 'budget', an each domain has a timer, which
+ *   replenishes the domain's budget periodically. The budget is the amount
+ *   of time the vCPUs of the domain can use every 'period';
+ *
+ * - the period is CSCHED2_BDGT_REPL_PERIOD, and is the same for all domains
+ *   (but each domain has its own timer; so the all are periodic by the same
+ *   period, but replenishment of the budgets of the various domains, at
+ *   periods boundaries, are not synchronous);
+ *
+ * - when vCPUs run, they consume budget. When they don't run, they don't
+ *   consume budget. If there is no budget left for the domain, no vCPU of
+ *   that domain can run. If a vCPU tries to run and finds that there is no
+ *   budget, it blocks.
+ *   At whatever time a vCPU wants to run, it must check the domain's budget,
+ *   and if there is some, it can use it.
+ *
+ * - budget is replenished to the top of the capacity for the domain once
+ *   per period. Even if there was some leftover budget from previous period,
+ *   though, the budget after a replenishment will always be at most equal
+ *   to the total capacify of the domain ('tot_budget');
+ *
+ * - when a budget replenishment occurs, if there are vCPUs that had been
+ *   blocked because of lack of budget, they'll be unblocked, and they will
+ *   (potentially) be able to run again.
+ *
+ * Finally, some even more implementation related detail:
+ *
+ * - budget is stored in a domain-wide pool. vCPUs of the domain that want
+ *   to run go to such pool, and grub some. When they do so, the amount
+ *   they grabbed is _immediately_ removed from the pool. This happens in
+ *   vcpu_grab_budget();
+ *
+ * - when vCPUs stop running, if they've not consumed all the budget they
+ *   took, the leftover is put back in the pool. This happens in
+ *   vcpu_return_budget();
+ *
+ * - the above means that a vCPU can find out that there is no budget and
+ *   block, not only if the cap has actually been reached (for this period),
+ *   but also if some other vCPUs, in order to run, have grabbed a certain
+ *   quota of budget, no matter whether they've already used it all or not.
+ *   A vCPU blocking because (any form of) lack of budget is said to be
+ *   "parked", and such blocking happens in park_vcpu();
+ *
+ * - when a vCPU stops running, and puts back some budget in the domain pool,
+ *   we need to check whether there is someone which has been parked and that
+ *   can be unparked. This happens in unpark_parked_vcpus(), called from
+ *   csched2_context_saved();
+ *
+ * - of course, unparking happens also as a consequence of the domain's budget
+ *   being replenished by the periodic timer. This also occurs by means of
+ *   calling csched2_context_saved() (but from replenish_domain_budget());
+ *
+ * - parked vCPUs of a domain are kept in a (per-domain) list, called
+ *   'parked_vcpus'). Manipulation of the list and of the domain-wide budget
+ *   pool, must occur only when holding the 'budget_lock'.
+ */
+
+/*
  * Locking:
  *
  * - runqueue lock
@@ -112,18 +192,29 @@
  *     runqueue each cpu is;
  *  + serializes the operation of changing the weights of domains;
  *
+ * - Budget lock
+ *  + it is per-domain;
+ *  + protects, in domains that have an utilization cap;
+ *   * manipulation of the total budget of the domain (as it is shared
+ *     among all vCPUs of the domain),
+ *   * manipulation of the list of vCPUs that are blocked waiting for
+ *     some budget to be available.
+ *
  * - Type:
  *  + runqueue locks are 'regular' spinlocks;
  *  + the private scheduler lock can be an rwlock. In fact, data
  *    it protects is modified only during initialization, cpupool
  *    manipulation and when changing weights, and read in all
- *    other cases (e.g., during load balancing).
+ *    other cases (e.g., during load balancing);
+ *  + budget locks are 'regular' spinlocks.
  *
  * Ordering:
  *  + tylock must be used when wanting to take a runqueue lock,
  *    if we already hold another one;
  *  + if taking both a runqueue lock and the private scheduler
- *    lock is, the latter must always be taken for first.
+ *    lock is, the latter must always be taken for first;
+ *  + if taking both a runqueue lock and a budget lock, the former
+ *    must always be taken for first.
  */
 
 /*
@@ -166,6 +257,8 @@
 #define CSCHED2_CREDIT_RESET         0
 /* Max timer: Maximum time a guest can be run for. */
 #define CSCHED2_MAX_TIMER            CSCHED2_CREDIT_INIT
+/* Period of the cap replenishment timer. */
+#define CSCHED2_BDGT_REPL_PERIOD     ((opt_cap_period)*MILLISECS(1))
 
 /*
  * Flags
@@ -295,6 +388,14 @@ static int __read_mostly opt_underload_balance_tolerance = 0;
 integer_param("credit2_balance_under", opt_underload_balance_tolerance);
 static int __read_mostly opt_overload_balance_tolerance = -3;
 integer_param("credit2_balance_over", opt_overload_balance_tolerance);
+/*
+ * Domains subject to a cap receive a replenishment of their runtime budget
+ * once every opt_cap_period interval. Default is 10 ms. The amount of budget
+ * they receive depends on their cap. For instance, a domain with a 50% cap
+ * will receive 50% of 10 ms, so 5 ms.
+ */
+static unsigned int __read_mostly opt_cap_period = 10;    /* ms */
+integer_param("credit2_cap_period_ms", opt_cap_period);
 
 /*
  * Runqueue organization.
@@ -411,13 +512,15 @@ static DEFINE_PER_CPU(int, runq_map);
  * Virtual CPU
  */
 struct csched2_vcpu {
-    struct list_head rqd_elem;         /* On csched2_runqueue_data's svc list */
+    struct csched2_dom *sdom;          /* Up-pointer to domain                */
+    struct vcpu *vcpu;                 /* Up-pointer, to vcpu                 */
     struct csched2_runqueue_data *rqd; /* Up-pointer to the runqueue          */
 
     int credit;                        /* Current amount of credit            */
     unsigned int weight;               /* Weight of this vcpu                 */
     unsigned int residual;             /* Reminder of div(max_weight/weight)  */
     unsigned flags;                    /* Status flags (16 bits would be ok,  */
+    s_time_t budget;                   /* Current budget (if domains has cap) */
                                        /* but clear_bit() does not like that) */
     s_time_t start_time;               /* Time we were scheduled (for credit) */
 
@@ -426,9 +529,8 @@ struct csched2_vcpu {
     s_time_t avgload;                  /* Decaying queue load                 */
 
     struct list_head runq_elem;        /* On the runqueue (rqd->runq)         */
-    struct csched2_dom *sdom;          /* Up-pointer to domain                */
-    struct vcpu *vcpu;                 /* Up-pointer, to vcpu                 */
-
+    struct list_head parked_elem;      /* On the parked_vcpus list            */
+    struct list_head rqd_elem;         /* On csched2_runqueue_data's svc list */
     struct csched2_runqueue_data *migrate_rqd; /* Pre-determined migr. target */
     int tickled_cpu;                   /* Cpu that will pick us (-1 if none)  */
 };
@@ -437,9 +539,19 @@ struct csched2_vcpu {
  * Domain
  */
 struct csched2_dom {
-    struct list_head sdom_elem; /* On csched2_runqueue_data's sdom list       */
     struct domain *dom;         /* Up-pointer to domain                       */
+
+    spinlock_t budget_lock;     /* Serialized budget calculations             */
+    s_time_t tot_budget;        /* Total amount of budget                     */
+    s_time_t budget;            /* Currently available budget                 */
+
+    struct timer *repl_timer;   /* Timer for periodic replenishment of budget */
+    s_time_t next_repl;         /* Time at which next replenishment occurs    */
+    struct list_head parked_vcpus; /* List of CPUs waiting for budget         */
+
+    struct list_head sdom_elem; /* On csched2_runqueue_data's sdom list       */
     uint16_t weight;            /* User specified weight                      */
+    uint16_t cap;               /* User specified cap                         */
     uint16_t nr_vcpus;          /* Number of vcpus of this domain             */
 };
 
@@ -474,6 +586,12 @@ static inline struct csched2_runqueue_data *c2rqd(const struct scheduler *ops,
     return &csched2_priv(ops)->rqd[c2r(cpu)];
 }
 
+/* Does the domain of this vCPU have a cap? */
+static inline bool has_cap(const struct csched2_vcpu *svc)
+{
+    return svc->budget != STIME_MAX;
+}
+
 /*
  * Hyperthreading (SMT) support.
  *
@@ -1515,7 +1633,16 @@ static void reset_credit(const struct scheduler *ops, int cpu, s_time_t now,
          * that the credit it has spent so far get accounted.
          */
         if ( svc->vcpu == curr_on_cpu(svc_cpu) )
+        {
             burn_credits(rqd, svc, now);
+            /*
+             * And, similarly, in case it has run out of budget, as a
+             * consequence of this round of accounting, we also must inform
+             * its pCPU that it's time to park it, and pick up someone else.
+             */
+            if ( unlikely(svc->budget <= 0) )
+                tickle_cpu(svc_cpu, rqd);
+        }
 
         start_credit = svc->credit;
 
@@ -1571,27 +1698,35 @@ void burn_credits(struct csched2_runqueue_data *rqd,
 
     delta = now - svc->start_time;
 
-    if ( likely(delta > 0) )
-    {
-        SCHED_STAT_CRANK(burn_credits_t2c);
-        t2c_update(rqd, delta, svc);
-        svc->start_time = now;
-    }
-    else if ( delta < 0 )
+    if ( unlikely(delta <= 0) )
     {
-        d2printk("WARNING: %s: Time went backwards? now %"PRI_stime" start_time %"PRI_stime"\n",
-                 __func__, now, svc->start_time);
+        if ( unlikely(delta < 0) )
+            d2printk("WARNING: %s: Time went backwards? now %"PRI_stime
+                     " start_time %"PRI_stime"\n", __func__, now,
+                     svc->start_time);
+        goto out;
     }
 
+    SCHED_STAT_CRANK(burn_credits_t2c);
+    t2c_update(rqd, delta, svc);
+
+    if ( has_cap(svc) )
+        svc->budget -= delta;
+
+    svc->start_time = now;
+
+ out:
     if ( unlikely(tb_init_done) )
     {
         struct {
             unsigned vcpu:16, dom:16;
-            int credit, delta;
+            int credit, budget;
+            int delta;
         } d;
         d.dom = svc->vcpu->domain->domain_id;
         d.vcpu = svc->vcpu->vcpu_id;
         d.credit = svc->credit;
+        d.budget = has_cap(svc) ?  svc->budget : INT_MIN;
         d.delta = delta;
         __trace_var(TRC_CSCHED2_CREDIT_BURN, 1,
                     sizeof(d),
@@ -1599,6 +1734,248 @@ void burn_credits(struct csched2_runqueue_data *rqd,
     }
 }
 
+/*
+ * Budget-related code.
+ */
+
+static void park_vcpu(struct csched2_vcpu *svc)
+{
+    struct vcpu *v = svc->vcpu;
+
+    ASSERT(spin_is_locked(&svc->sdom->budget_lock));
+
+    /*
+     * It was impossible to find budget for this vCPU, so it has to be
+     * "parked". This implies it is not runnable, so we mark it as such in
+     * its pause_flags. If the vCPU is currently scheduled (which means we
+     * are here after being called from within csched_schedule()), flagging
+     * is enough, as we'll choose someone else, and then context_saved()
+     * will take care of updating the load properly.
+     *
+     * If, OTOH, the vCPU is sitting in the runqueue (which means we are here
+     * after being called from within runq_candidate()), we must go all the
+     * way down to taking it out of there, and updating the load accordingly.
+     *
+     * In both cases, we also add it to the list of parked vCPUs of the domain.
+     */
+    __set_bit(_VPF_parked, &v->pause_flags);
+    if ( vcpu_on_runq(svc) )
+    {
+        runq_remove(svc);
+        update_load(svc->sdom->dom->cpupool->sched, svc->rqd, svc, -1, NOW());
+    }
+    list_add(&svc->parked_elem, &svc->sdom->parked_vcpus);
+}
+
+static bool vcpu_grab_budget(struct csched2_vcpu *svc)
+{
+    struct csched2_dom *sdom = svc->sdom;
+    unsigned int cpu = svc->vcpu->processor;
+
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
+
+    if ( svc->budget > 0 )
+        return true;
+
+    /* budget_lock nests inside runqueue lock. */
+    spin_lock(&sdom->budget_lock);
+
+    /*
+     * Here, svc->budget is <= 0 (as, if it was > 0, we'd have taken the if
+     * above!). That basically means the vCPU has overrun a bit --because of
+     * various reasons-- and we want to take that into account. With the +=,
+     * we are actually subtracting the amount of budget the vCPU has
+     * overconsumed, from the total domain budget.
+     */
+    sdom->budget += svc->budget;
+
+    if ( sdom->budget > 0 )
+    {
+        /*
+         * NB: we give the whole remaining budget a domain has, to the first
+         * vCPU that comes here and asks for it. This means that, in a domain
+         * with a cap, only 1 vCPU is able to run, at any given time.
+         * /THIS IS GOING TO CHANGE/ in subsequent patches, toward something
+         * that allows much better fairness and parallelism. Proceeding in
+         * two steps, is for making things easy to understand, when looking
+         * at the signle commits.
+         */
+        svc->budget = sdom->budget;
+        sdom->budget = 0;
+    }
+    else
+    {
+        svc->budget = 0;
+        park_vcpu(svc);
+    }
+
+    spin_unlock(&sdom->budget_lock);
+
+    return svc->budget > 0;
+}
+
+static void
+vcpu_return_budget(struct csched2_vcpu *svc, struct list_head *parked)
+{
+    struct csched2_dom *sdom = svc->sdom;
+    unsigned int cpu = svc->vcpu->processor;
+
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
+    ASSERT(list_empty(parked));
+
+    /* budget_lock nests inside runqueue lock. */
+    spin_lock(&sdom->budget_lock);
+
+    /*
+     * The vCPU is stopping running (e.g., because it's blocking, or it has
+     * been preempted). If it hasn't consumed all the budget it got when,
+     * starting to run, put that remaining amount back in the domain's budget
+     * pool.
+     */
+    sdom->budget += svc->budget;
+    svc->budget = 0;
+
+    /*
+     * Making budget available again to the domain means that parked vCPUs
+     * may be unparked and run. They are, if any, in the domain's parked_vcpus
+     * list, so we want to go through that and unpark them (so they can try
+     * to get some budget).
+     *
+     * Touching the list requires the budget_lock, which we hold. Let's
+     * therefore put everyone in that list in another, temporary list, which
+     * then the caller will traverse, unparking the vCPUs it finds there.
+     *
+     * In fact, we can't do the actual unparking here, because that requires
+     * taking the runqueue lock of the vCPUs being unparked, and we can't
+     * take any runqueue locks while we hold a budget_lock.
+     */
+    if ( sdom->budget > 0 )
+        list_splice_init(&sdom->parked_vcpus, parked);
+
+    spin_unlock(&sdom->budget_lock);
+}
+
+static void
+unpark_parked_vcpus(const struct scheduler *ops, struct list_head *vcpus)
+{
+    struct csched2_vcpu *svc, *tmp;
+    spinlock_t *lock;
+
+    list_for_each_entry_safe(svc, tmp, vcpus, parked_elem)
+    {
+        unsigned long flags;
+        s_time_t now;
+
+        lock = vcpu_schedule_lock_irqsave(svc->vcpu, &flags);
+
+        __clear_bit(_VPF_parked, &svc->vcpu->pause_flags);
+        if ( unlikely(svc->flags & CSFLAG_scheduled) )
+        {
+            /*
+             * We end here if a budget replenishment arrived between
+             * csched2_schedule() (and, in particular, after a call to
+             * vcpu_grab_budget() that returned false), and
+             * context_saved(). By setting __CSFLAG_delayed_runq_add,
+             * we tell context_saved() to put the vCPU back in the
+             * runqueue, from where it will compete with the others
+             * for the newly replenished budget.
+             */
+            ASSERT( svc->rqd != NULL );
+            ASSERT( c2rqd(ops, svc->vcpu->processor) == svc->rqd );
+            __set_bit(__CSFLAG_delayed_runq_add, &svc->flags);
+        }
+        else if ( vcpu_runnable(svc->vcpu) )
+        {
+            /*
+             * The vCPU should go back to the runqueue, and compete for
+             * the newly replenished budget, but only if it is actually
+             * runnable (and was therefore offline only because of the
+             * lack of budget).
+             */
+            now = NOW();
+            update_load(ops, svc->rqd, svc, 1, now);
+            runq_insert(ops, svc);
+            runq_tickle(ops, svc, now);
+        }
+        list_del_init(&svc->parked_elem);
+
+        vcpu_schedule_unlock_irqrestore(lock, flags, svc->vcpu);
+    }
+}
+
+static inline void do_replenish(struct csched2_dom *sdom)
+{
+    sdom->next_repl += CSCHED2_BDGT_REPL_PERIOD;
+    sdom->budget += sdom->tot_budget;
+}
+
+static void replenish_domain_budget(void* data)
+{
+    struct csched2_dom *sdom = data;
+    unsigned long flags;
+    s_time_t now;
+    LIST_HEAD(parked);
+
+    spin_lock_irqsave(&sdom->budget_lock, flags);
+
+    now = NOW();
+
+    /*
+     * Let's do the replenishment. Note, though, that a domain may overrun,
+     * which means the budget would have gone below 0 (reasons may be system
+     * overbooking, accounting issues, etc.). It also may happen that we are
+     * handling the replenishment (much) later than we should (reasons may
+     * again be overbooking, or issues with timers).
+     *
+     * Even in cases of overrun or delay, however, we expect that in 99% of
+     * cases, doing just one replenishment will be good enough for being able
+     * to unpark the vCPUs that are waiting for some budget.
+     */
+    do_replenish(sdom);
+
+    /*
+     * And now, the special cases:
+     * 1) if we are late enough to have skipped (at least) one full period,
+     * what we must do is doing more replenishments. Note that, however,
+     * every time we add tot_budget to the budget, we also move next_repl
+     * away by CSCHED2_BDGT_REPL_PERIOD, to make sure the cap is always
+     * respected.
+     */
+    if ( unlikely(sdom->next_repl <= now) )
+    {
+        do
+            do_replenish(sdom);
+        while ( sdom->next_repl <= now );
+    }
+    /*
+     * 2) if we overrun by more than tot_budget, then budget+tot_budget is
+     * still < 0, which means that we can't unpark the vCPUs. Let's bail,
+     * and wait for future replenishments.
+     */
+    if ( unlikely(sdom->budget <= 0) )
+    {
+        spin_unlock_irqrestore(&sdom->budget_lock, flags);
+        goto out;
+    }
+
+    /* Since we do more replenishments, make sure we didn't overshot. */
+    sdom->budget = min(sdom->budget, sdom->tot_budget);
+
+    /*
+     * As above, let's prepare the temporary list, out of the domain's
+     * parked_vcpus list, now that we hold the budget_lock. Then, drop such
+     * lock, and pass the list to the unparking function.
+     */
+    list_splice_init(&sdom->parked_vcpus, &parked);
+
+    spin_unlock_irqrestore(&sdom->budget_lock, flags);
+
+    unpark_parked_vcpus(sdom->dom->cpupool->sched, &parked);
+
+ out:
+    set_timer(sdom->repl_timer, sdom->next_repl);
+}
+
 #ifndef NDEBUG
 static inline void
 csched2_vcpu_check(struct vcpu *vc)
@@ -1658,6 +2035,9 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
     }
     svc->tickled_cpu = -1;
 
+    svc->budget = STIME_MAX;
+    INIT_LIST_HEAD(&svc->parked_elem);
+
     SCHED_STAT_CRANK(vcpu_alloc);
 
     return svc;
@@ -1754,6 +2134,7 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
     struct csched2_vcpu * const svc = csched2_vcpu(vc);
     spinlock_t *lock = vcpu_schedule_lock_irq(vc);
     s_time_t now = NOW();
+    LIST_HEAD(were_parked);
 
     BUG_ON( !is_idle_vcpu(vc) && svc->rqd != c2rqd(ops, vc->processor));
     ASSERT(is_idle_vcpu(vc) || svc->rqd == c2rqd(ops, vc->processor));
@@ -1761,6 +2142,9 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
     /* This vcpu is now eligible to be put on the runqueue again */
     __clear_bit(__CSFLAG_scheduled, &svc->flags);
 
+    if ( unlikely(has_cap(svc) && svc->budget > 0) )
+        vcpu_return_budget(svc, &were_parked);
+
     /* If someone wants it on the runqueue, put it there. */
     /*
      * NB: We can get rid of CSFLAG_scheduled by checking for
@@ -1781,6 +2165,8 @@ csched2_context_saved(const struct scheduler *ops, struct vcpu *vc)
         update_load(ops, svc->rqd, svc, -1, now);
 
     vcpu_schedule_unlock_irq(lock, vc);
+
+    unpark_parked_vcpus(ops, &were_parked);
 }
 
 #define MAX_LOAD (STIME_MAX)
@@ -2483,12 +2869,25 @@ csched2_alloc_domdata(const struct scheduler *ops, struct domain *dom)
     if ( sdom == NULL )
         return NULL;
 
-    /* Initialize credit and weight */
+    sdom->repl_timer = xzalloc(struct timer);
+    if ( sdom->repl_timer == NULL )
+    {
+        xfree(sdom);
+        return NULL;
+    }
+
+    /* Initialize credit, cap and weight */
     INIT_LIST_HEAD(&sdom->sdom_elem);
     sdom->dom = dom;
     sdom->weight = CSCHED2_DEFAULT_WEIGHT;
+    sdom->cap = 0U;
     sdom->nr_vcpus = 0;
 
+    init_timer(sdom->repl_timer, replenish_domain_budget, sdom,
+               cpumask_any(cpupool_domain_cpumask(dom)));
+    spin_lock_init(&sdom->budget_lock);
+    INIT_LIST_HEAD(&sdom->parked_vcpus);
+
     write_lock_irqsave(&prv->lock, flags);
 
     list_add_tail(&sdom->sdom_elem, &csched2_priv(ops)->sdom);
@@ -2524,6 +2923,9 @@ csched2_free_domdata(const struct scheduler *ops, void *data)
 
     write_lock_irqsave(&prv->lock, flags);
 
+    kill_timer(sdom->repl_timer);
+    xfree(sdom->repl_timer);
+
     list_del_init(&sdom->sdom_elem);
 
     write_unlock_irqrestore(&prv->lock, flags);
@@ -2618,11 +3020,12 @@ csched2_runtime(const struct scheduler *ops, int cpu,
         return -1;
 
     /* General algorithm:
-     * 1) Run until snext's credit will be 0
+     * 1) Run until snext's credit will be 0.
      * 2) But if someone is waiting, run until snext's credit is equal
-     * to his
-     * 3) But never run longer than MAX_TIMER or shorter than MIN_TIMER or
-     * the ratelimit time.
+     *    to his.
+     * 3) But, if we are capped, never run more than our budget.
+     * 4) And never run longer than MAX_TIMER or shorter than MIN_TIMER or
+     *    the ratelimit time.
      */
 
     /* Calculate mintime */
@@ -2637,11 +3040,13 @@ csched2_runtime(const struct scheduler *ops, int cpu,
             min_time = ratelimit_min;
     }
 
-    /* 1) Basic time: Run until credit is 0. */
+    /* 1) Run until snext's credit will be 0. */
     rt_credit = snext->credit;
 
-    /* 2) If there's someone waiting whose credit is positive,
-     * run until your credit ~= his */
+    /*
+     * 2) If there's someone waiting whose credit is positive,
+     *    run until your credit ~= his.
+     */
     if ( ! list_empty(runq) )
     {
         struct csched2_vcpu *swait = runq_elem(runq->next);
@@ -2663,14 +3068,22 @@ csched2_runtime(const struct scheduler *ops, int cpu,
      * credit values of MIN,MAX per vcpu, since each vcpu burns credit
      * at a different rate.
      */
-    if (rt_credit > 0)
+    if ( rt_credit > 0 )
         time = c2t(rqd, rt_credit, snext);
     else
         time = 0;
 
-    /* 3) But never run longer than MAX_TIMER or less than MIN_TIMER or
-     * the rate_limit time. */
-    if ( time < min_time)
+    /*
+     * 3) But, if capped, never run more than our budget.
+     */
+    if ( has_cap(snext) )
+        time = snext->budget < time ? snext->budget : time;
+
+    /*
+     * 4) And never run longer than MAX_TIMER or less than MIN_TIMER or
+     *    the rate_limit time.
+     */
+    if ( time < min_time )
     {
         time = min_time;
         SCHED_STAT_CRANK(runtime_min_timer);
@@ -2693,7 +3106,7 @@ runq_candidate(struct csched2_runqueue_data *rqd,
                int cpu, s_time_t now,
                unsigned int *skipped)
 {
-    struct list_head *iter;
+    struct list_head *iter, *temp;
     struct csched2_vcpu *snext = NULL;
     struct csched2_private *prv = csched2_priv(per_cpu(scheduler, cpu));
     bool yield = false, soft_aff_preempt = false;
@@ -2778,7 +3191,7 @@ runq_candidate(struct csched2_runqueue_data *rqd,
         snext = csched2_vcpu(idle_vcpu[cpu]);
 
  check_runq:
-    list_for_each( iter, &rqd->runq )
+    list_for_each_safe( iter, temp, &rqd->runq )
     {
         struct csched2_vcpu * svc = list_entry(iter, struct csched2_vcpu, runq_elem);
 
@@ -2826,11 +3239,13 @@ runq_candidate(struct csched2_runqueue_data *rqd,
         }
 
         /*
-         * If the next one on the list has more credit than current
-         * (or idle, if current is not runnable), or if current is
-         * yielding, choose it.
+         * If the one in the runqueue has more credit than current (or idle,
+         * if current is not runnable), or if current is yielding, and also
+         * if the one in runqueue either is not capped, or is capped but has
+         * some budget, then choose it.
          */
-        if ( yield || svc->credit > snext->credit )
+        if ( (yield || svc->credit > snext->credit) &&
+             (!has_cap(svc) || vcpu_grab_budget(svc)) )
             snext = svc;
 
         /* In any case, if we got this far, break. */
@@ -2857,6 +3272,13 @@ runq_candidate(struct csched2_runqueue_data *rqd,
     if ( unlikely(snext->tickled_cpu != -1 && snext->tickled_cpu != cpu) )
         SCHED_STAT_CRANK(tickled_cpu_overridden);
 
+    /*
+     * If snext is from a capped domain, it must have budget (or it
+     * wouldn't have been in the runq). If it is not, it'd be STIME_MAX,
+     * which still is >= 0.
+     */
+    ASSERT(snext->budget >= 0);
+
     return snext;
 }
 
@@ -2914,10 +3336,18 @@ csched2_schedule(
                     (unsigned char *)&d);
     }
 
-    /* Update credits */
+    /* Update credits (and budget, if necessary). */
     burn_credits(rqd, scurr, now);
 
     /*
+     *  Below 0, means that we are capped and we have overrun our  budget.
+     *  Let's try to get some more but, if we fail (e.g., because of the
+     *  other running vcpus), we will be parked.
+     */
+    if ( unlikely(scurr->budget <= 0) )
+        vcpu_grab_budget(scurr);
+
+    /*
      * Select next runnable local VCPU (ie top of local runq).
      *
      * If the current vcpu is runnable, and has higher credit than
@@ -3051,6 +3481,9 @@ csched2_dump_vcpu(struct csched2_private *prv, struct csched2_vcpu *svc)
 
     printk(" credit=%" PRIi32" [w=%u]", svc->credit, svc->weight);
 
+    if ( has_cap(svc) )
+        printk(" budget=%"PRI_stime, svc->budget);
+
     printk(" load=%"PRI_stime" (~%"PRI_stime"%%)", svc->avgload,
            (svc->avgload * 100) >> prv->load_precision_shift);
 
@@ -3138,9 +3571,10 @@ csched2_dump(const struct scheduler *ops)
 
         sdom = list_entry(iter_sdom, struct csched2_dom, sdom_elem);
 
-        printk("\tDomain: %d w %d v %d\n",
+        printk("\tDomain: %d w %d c %u v %d\n",
                sdom->dom->domain_id,
                sdom->weight,
+               sdom->cap,
                sdom->nr_vcpus);
 
         for_each_vcpu( sdom->dom, v )
@@ -3360,12 +3794,14 @@ csched2_init(struct scheduler *ops)
            XENLOG_INFO " load_window_shift: %d\n"
            XENLOG_INFO " underload_balance_tolerance: %d\n"
            XENLOG_INFO " overload_balance_tolerance: %d\n"
-           XENLOG_INFO " runqueues arrangement: %s\n",
+           XENLOG_INFO " runqueues arrangement: %s\n"
+           XENLOG_INFO " cap enforcement granularity: %dms\n",
            opt_load_precision_shift,
            opt_load_window_shift,
            opt_underload_balance_tolerance,
            opt_overload_balance_tolerance,
-           opt_runqueue_str[opt_runqueue]);
+           opt_runqueue_str[opt_runqueue],
+           opt_cap_period);
 
     if ( opt_load_precision_shift < LOADAVG_PRECISION_SHIFT_MIN )
     {
@@ -3383,6 +3819,13 @@ csched2_init(struct scheduler *ops)
     printk(XENLOG_INFO "load tracking window length %llu ns\n",
            1ULL << opt_load_window_shift);
 
+    if ( CSCHED2_BDGT_REPL_PERIOD < CSCHED2_MIN_TIMER )
+    {
+        printk("WARNING: %s: opt_cap_period %d too small, resetting\n",
+               __func__, opt_cap_period);
+        opt_cap_period = 10; /* ms */
+    }
+
     /*
      * Basically no CPU information is available at this point; just
      * set up basic structures, and a callback when the CPU info is
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 5828a01..9ab8585 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -792,6 +792,9 @@ static inline struct domain *next_domain_in_cpupool(
  /* VCPU is being reset. */
 #define _VPF_in_reset        7
 #define VPF_in_reset         (1UL<<_VPF_in_reset)
+/* VCPU is parked. */
+#define _VPF_parked          8
+#define VPF_parked           (1UL<<_VPF_parked)
 
 static inline int vcpu_runnable(struct vcpu *v)
 {


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 2/4] xen: credit2: allow to set and get utilization cap
  2017-08-18 15:50 [PATCH v2 0/4] xen/tools: Credit2: implement caps Dario Faggioli
  2017-08-18 15:50 ` [PATCH v2 1/4] xen: credit2: implement utilization cap Dario Faggioli
@ 2017-08-18 15:51 ` Dario Faggioli
  2017-09-14 16:21   ` George Dunlap
  2017-08-18 15:51 ` [PATCH v2 3/4] xen: credit2: improve distribution of budget (for domains with caps) Dario Faggioli
  2017-08-18 15:51 ` [PATCH v2 4/4] libxl/xl: allow to get and set cap on Credit2 Dario Faggioli
  3 siblings, 1 reply; 10+ messages in thread
From: Dario Faggioli @ 2017-08-18 15:51 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson, Anshul Makkar,
	Jan Beulich

As cap is already present in Credit1, as a parameter, all
the wiring is there already for it to be percolate down
to csched2_dom_cntl() too.

In this commit, we actually deal with it, and implement
setting, changing or disabling the cap of a domain.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Anshul Makkar <anshulmakkar@gmail.com>
---
Changes from v1:
- check that cap is below 100*nr_vcpus;
- do multiplication first when computing the domain's budget, given the cap;
- when disabling cap, take the budget lock for manipulating the list of
  parked vCPUs. Things would have been safe without it, but it's just
  more linear, more robust and more future-proof, to "do thing properly".
---
 xen/common/sched_credit2.c  |  129 +++++++++++++++++++++++++++++++++++++++++--
 xen/include/public/domctl.h |    1 
 2 files changed, 125 insertions(+), 5 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 69a7679..ce70224 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -2772,30 +2772,35 @@ csched2_dom_cntl(
     struct csched2_dom * const sdom = csched2_dom(d);
     struct csched2_private *prv = csched2_priv(ops);
     unsigned long flags;
+    struct vcpu *v;
     int rc = 0;
 
     /*
      * Locking:
      *  - we must take the private lock for accessing the weights of the
-     *    vcpus of d,
+     *    vcpus of d, and/or the cap;
      *  - in the putinfo case, we also need the runqueue lock(s), for
      *    updating the max waight of the runqueue(s).
+     *    If changing the cap, we also need the budget_lock, for updating
+     *    the value of the domain budget pool (and the runqueue lock,
+     *    for adjusting the parameters and rescheduling any vCPU that is
+     *    running at the time of the change).
      */
     switch ( op->cmd )
     {
     case XEN_DOMCTL_SCHEDOP_getinfo:
         read_lock_irqsave(&prv->lock, flags);
         op->u.credit2.weight = sdom->weight;
+        op->u.credit2.cap = sdom->cap;
         read_unlock_irqrestore(&prv->lock, flags);
         break;
     case XEN_DOMCTL_SCHEDOP_putinfo:
+        write_lock_irqsave(&prv->lock, flags);
+        /* Weight */
         if ( op->u.credit2.weight != 0 )
         {
-            struct vcpu *v;
             int old_weight;
 
-            write_lock_irqsave(&prv->lock, flags);
-
             old_weight = sdom->weight;
 
             sdom->weight = op->u.credit2.weight;
@@ -2813,9 +2818,123 @@ csched2_dom_cntl(
 
                 vcpu_schedule_unlock(lock, svc->vcpu);
             }
+        }
+        /* Cap */
+        if ( op->u.credit2.cap != 0 )
+        {
+            /* Cap is only valid if it's below 100 * nr_of_vCPUS */
+            if ( op->u.credit2.cap > 100 * sdom->nr_vcpus )
+            {
+                rc = -EINVAL;
+                break;
+            }
+
+            spin_lock(&sdom->budget_lock);
+            sdom->tot_budget = (CSCHED2_BDGT_REPL_PERIOD * op->u.credit2.cap);
+            sdom->tot_budget /= 100;
+            spin_unlock(&sdom->budget_lock);
+
+            if ( sdom->cap == 0 )
+            {
+                /*
+                 * We give to the domain the budget to which it is entitled,
+                 * and queue its first replenishment event.
+                 *
+                 * Since cap is currently disabled for this domain, we
+                 * know no vCPU is messing with the domain's budget, and
+                 * the replenishment timer is still off.
+                 * For these reasons, it is safe to do the following without
+                 * taking the budget_lock.
+                 */
+                sdom->budget = sdom->tot_budget;
+                sdom->next_repl = NOW() + CSCHED2_BDGT_REPL_PERIOD;
+                set_timer(sdom->repl_timer, sdom->next_repl);
+
+                /*
+                 * Now, let's enable budget accounting for all the vCPUs.
+                 * For making sure that they will start to honour the domain's
+                 * cap, we set their budget to 0.
+                 * This way, as soon as they will try to run, they will have
+                 * to get some budget.
+                 *
+                 * For the vCPUs that are already running, we trigger the
+                 * scheduler on their pCPU. When, as a consequence of this,
+                 * csched2_schedule() will run, it will figure out there is
+                 * no budget, and the vCPU will try to get some (and be parked,
+                 * if there's none, and we'll switch to someone else).
+                 */
+                for_each_vcpu ( d, v )
+                {
+                    struct csched2_vcpu *svc = csched2_vcpu(v);
+                    spinlock_t *lock = vcpu_schedule_lock(svc->vcpu);
+
+                    if ( v->is_running )
+                    {
+                        unsigned int cpu = v->processor;
+                        struct csched2_runqueue_data *rqd = c2rqd(ops, cpu);
+
+                        ASSERT(curr_on_cpu(cpu) == v);
+
+                        /*
+                         * We are triggering a reschedule on the vCPU's
+                         * pCPU. That will run burn_credits() and, since
+                         * the vCPU is capped now, it would charge all the
+                         * execution time of this last round as budget as
+                         * well. That will make the vCPU budget go negative,
+                         * potentially by a large amount, and it's unfair.
+                         *
+                         * To avoid that, call burn_credit() here, to do the
+                         * accounting of this current running instance now,
+                         * with budgetting still disabled. This does not
+                         * prevent some small amount of budget being charged
+                         * to the vCPU (i.e., the amount of time it runs from
+                         * now, to when scheduling happens). The budget will
+                         * also go below 0, but a lot less than how it would
+                         * if we don't do this.
+                         */
+                        burn_credits(rqd, svc, NOW());
+                        __cpumask_set_cpu(cpu, &rqd->tickled);
+                        ASSERT(!cpumask_test_cpu(cpu, &rqd->smt_idle));
+                        cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
+                    }
+                    svc->budget = 0;
+                    vcpu_schedule_unlock(lock, svc->vcpu);
+                }
+            }
+
+            sdom->cap = op->u.credit2.cap;
+        }
+        else if ( sdom->cap != 0 )
+        {
+            LIST_HEAD(parked);
+
+            stop_timer(sdom->repl_timer);
+
+            /* Disable budget accounting for all the vCPUs. */
+            for_each_vcpu ( d, v )
+            {
+                struct csched2_vcpu *svc = csched2_vcpu(v);
+                spinlock_t *lock = vcpu_schedule_lock(svc->vcpu);
+
+                svc->budget = STIME_MAX;
+
+                vcpu_schedule_unlock(lock, svc->vcpu);
+            }
+            sdom->cap = 0;
+            /*
+             * We are disabling the cap for this domain, which may have
+             * vCPUs waiting for a replenishment, so we unpark them all.
+             * Note that, since we have already disabled budget accounting
+             * for all the vCPUs of the domain, no currently running vCPU
+             * will be added to the parked vCPUs list any longer.
+             */
+            spin_lock(&sdom->budget_lock);
+            list_splice_init(&sdom->parked_vcpus, &parked);
+            spin_unlock(&sdom->budget_lock);
 
-            write_unlock_irqrestore(&prv->lock, flags);
+            unpark_parked_vcpus(ops, &parked);
         }
+        write_unlock_irqrestore(&prv->lock, flags);
         break;
     default:
         rc = -EINVAL;
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 0669c31..10c0015 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -355,6 +355,7 @@ typedef struct xen_domctl_sched_credit {
 
 typedef struct xen_domctl_sched_credit2 {
     uint16_t weight;
+    uint16_t cap;
 } xen_domctl_sched_credit2_t;
 
 typedef struct xen_domctl_sched_rtds {


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 3/4] xen: credit2: improve distribution of budget (for domains with caps)
  2017-08-18 15:50 [PATCH v2 0/4] xen/tools: Credit2: implement caps Dario Faggioli
  2017-08-18 15:50 ` [PATCH v2 1/4] xen: credit2: implement utilization cap Dario Faggioli
  2017-08-18 15:51 ` [PATCH v2 2/4] xen: credit2: allow to set and get " Dario Faggioli
@ 2017-08-18 15:51 ` Dario Faggioli
  2017-08-18 15:51 ` [PATCH v2 4/4] libxl/xl: allow to get and set cap on Credit2 Dario Faggioli
  3 siblings, 0 replies; 10+ messages in thread
From: Dario Faggioli @ 2017-08-18 15:51 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Anshul Makkar

Instead of letting the vCPU that for first tries to get
some budget take it all (although temporarily), allow each
vCPU to only get a specific quota of the total budget.

This improves fairness, allows for more parallelism, and
prevents vCPUs from not being able to get any budget (e.g.,
because some other vCPU always comes before and gets it all)
for one or more period, and hence starve (and cause troubles
in guest kernels, such as livelocks, triggering of whatchdogs,
etc.).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
---
Cc: Anshul Makkar <anshulmakkar@gmail.com>
---
Changes from v1:
- typos;
- spurious hunk moved to previous patch.
---
 xen/common/sched_credit2.c |   56 ++++++++++++++++++++++++++++++++------------
 1 file changed, 41 insertions(+), 15 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index ce70224..211e2d6 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -522,6 +522,8 @@ struct csched2_vcpu {
     unsigned flags;                    /* Status flags (16 bits would be ok,  */
     s_time_t budget;                   /* Current budget (if domains has cap) */
                                        /* but clear_bit() does not like that) */
+    s_time_t budget_quota;             /* Budget to which vCPU is entitled    */
+
     s_time_t start_time;               /* Time we were scheduled (for credit) */
 
     /* Individual contribution to load                                        */
@@ -1791,17 +1793,16 @@ static bool vcpu_grab_budget(struct csched2_vcpu *svc)
 
     if ( sdom->budget > 0 )
     {
-        /*
-         * NB: we give the whole remaining budget a domain has, to the first
-         * vCPU that comes here and asks for it. This means that, in a domain
-         * with a cap, only 1 vCPU is able to run, at any given time.
-         * /THIS IS GOING TO CHANGE/ in subsequent patches, toward something
-         * that allows much better fairness and parallelism. Proceeding in
-         * two steps, is for making things easy to understand, when looking
-         * at the signle commits.
-         */
-        svc->budget = sdom->budget;
-        sdom->budget = 0;
+        s_time_t budget;
+
+        /* Get our quota, if there's at least as much budget */
+        if ( likely(sdom->budget >= svc->budget_quota) )
+            budget = svc->budget_quota;
+        else
+            budget = sdom->budget;
+
+        svc->budget = budget;
+        sdom->budget -= budget;
     }
     else
     {
@@ -2036,6 +2037,7 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
     svc->tickled_cpu = -1;
 
     svc->budget = STIME_MAX;
+    svc->budget_quota = 0;
     INIT_LIST_HEAD(&svc->parked_elem);
 
     SCHED_STAT_CRANK(vcpu_alloc);
@@ -2822,6 +2824,9 @@ csched2_dom_cntl(
         /* Cap */
         if ( op->u.credit2.cap != 0 )
         {
+            struct csched2_vcpu *svc;
+            spinlock_t *lock;
+
             /* Cap is only valid if it's below 100 * nr_of_vCPUS */
             if ( op->u.credit2.cap > 100 * sdom->nr_vcpus )
             {
@@ -2834,6 +2839,26 @@ csched2_dom_cntl(
             sdom->tot_budget /= 100;
             spin_unlock(&sdom->budget_lock);
 
+            /*
+             * When trying to get some budget and run, each vCPU will grab
+             * from the pool 1/N (with N = nr of vCPUs of the domain) of
+             * the total budget. Roughly speaking, this means each vCPU will
+             * have at least one chance to run during every period.
+             */
+            for_each_vcpu ( d, v )
+            {
+                svc = csched2_vcpu(v);
+                lock = vcpu_schedule_lock(svc->vcpu);
+                /*
+                 * Too small quotas would in theory cause a lot of overhead,
+                 * which then won't happen because, in csched2_runtime(),
+                 * CSCHED2_MIN_TIMER is what would be used anyway.
+                 */
+                svc->budget_quota = max(sdom->tot_budget / sdom->nr_vcpus,
+                                        CSCHED2_MIN_TIMER);
+                vcpu_schedule_unlock(lock, svc->vcpu);
+            }
+
             if ( sdom->cap == 0 )
             {
                 /*
@@ -2865,9 +2890,8 @@ csched2_dom_cntl(
                  */
                 for_each_vcpu ( d, v )
                 {
-                    struct csched2_vcpu *svc = csched2_vcpu(v);
-                    spinlock_t *lock = vcpu_schedule_lock(svc->vcpu);
-
+                    svc = csched2_vcpu(v);
+                    lock = vcpu_schedule_lock(svc->vcpu);
                     if ( v->is_running )
                     {
                         unsigned int cpu = v->processor;
@@ -2917,6 +2941,7 @@ csched2_dom_cntl(
                 spinlock_t *lock = vcpu_schedule_lock(svc->vcpu);
 
                 svc->budget = STIME_MAX;
+                svc->budget_quota = 0;
 
                 vcpu_schedule_unlock(lock, svc->vcpu);
             }
@@ -3601,7 +3626,8 @@ csched2_dump_vcpu(struct csched2_private *prv, struct csched2_vcpu *svc)
     printk(" credit=%" PRIi32" [w=%u]", svc->credit, svc->weight);
 
     if ( has_cap(svc) )
-        printk(" budget=%"PRI_stime, svc->budget);
+        printk(" budget=%"PRI_stime"(%"PRI_stime")",
+               svc->budget, svc->budget_quota);
 
     printk(" load=%"PRI_stime" (~%"PRI_stime"%%)", svc->avgload,
            (svc->avgload * 100) >> prv->load_precision_shift);


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 4/4] libxl/xl: allow to get and set cap on Credit2.
  2017-08-18 15:50 [PATCH v2 0/4] xen/tools: Credit2: implement caps Dario Faggioli
                   ` (2 preceding siblings ...)
  2017-08-18 15:51 ` [PATCH v2 3/4] xen: credit2: improve distribution of budget (for domains with caps) Dario Faggioli
@ 2017-08-18 15:51 ` Dario Faggioli
  3 siblings, 0 replies; 10+ messages in thread
From: Dario Faggioli @ 2017-08-18 15:51 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Ian Jackson, Wei Liu

Note that a cap is considered valid only if
it is within the [1, nr_vcpus]% interval.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
---
 tools/libxl/libxl_sched.c |   21 +++++++++++++++++++++
 tools/xl/xl_cmdtable.c    |    1 +
 tools/xl/xl_sched.c       |   25 +++++++++++++++++--------
 3 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/tools/libxl/libxl_sched.c b/tools/libxl/libxl_sched.c
index faa604e..7d144d0 100644
--- a/tools/libxl/libxl_sched.c
+++ b/tools/libxl/libxl_sched.c
@@ -405,6 +405,7 @@ static int sched_credit2_domain_get(libxl__gc *gc, uint32_t domid,
     libxl_domain_sched_params_init(scinfo);
     scinfo->sched = LIBXL_SCHEDULER_CREDIT2;
     scinfo->weight = sdom.weight;
+    scinfo->cap = sdom.cap;
 
     return 0;
 }
@@ -413,8 +414,17 @@ static int sched_credit2_domain_set(libxl__gc *gc, uint32_t domid,
                                     const libxl_domain_sched_params *scinfo)
 {
     struct xen_domctl_sched_credit2 sdom;
+    xc_domaininfo_t info;
     int rc;
 
+    rc = xc_domain_getinfolist(CTX->xch, domid, 1, &info);
+    if (rc < 0) {
+        LOGED(ERROR, domid, "Getting domain info");
+        return ERROR_FAIL;
+    }
+    if (rc != 1 || info.domain != domid)
+        return ERROR_INVAL;
+
     rc = xc_sched_credit2_domain_get(CTX->xch, domid, &sdom);
     if (rc != 0) {
         LOGED(ERROR, domid, "Getting domain sched credit2");
@@ -430,6 +440,17 @@ static int sched_credit2_domain_set(libxl__gc *gc, uint32_t domid,
         sdom.weight = scinfo->weight;
     }
 
+    if (scinfo->cap != LIBXL_DOMAIN_SCHED_PARAM_CAP_DEFAULT) {
+        if (scinfo->cap < 0
+            || scinfo->cap > (info.max_vcpu_id + 1) * 100) {
+            LOGD(ERROR, domid, "Cpu cap out of range, "
+                 "valid range is from 0 to %d for specified number of vcpus",
+                 ((info.max_vcpu_id + 1) * 100));
+            return ERROR_INVAL;
+        }
+        sdom.cap = scinfo->cap;
+    }
+
     rc = xc_sched_credit2_domain_set(CTX->xch, domid, &sdom);
     if ( rc < 0 ) {
         LOGED(ERROR, domid, "Setting domain sched credit2");
diff --git a/tools/xl/xl_cmdtable.c b/tools/xl/xl_cmdtable.c
index 2c71a9f..bfe6eb0 100644
--- a/tools/xl/xl_cmdtable.c
+++ b/tools/xl/xl_cmdtable.c
@@ -265,6 +265,7 @@ struct cmd_spec cmd_table[] = {
       "[-d <Domain> [-w[=WEIGHT]]] [-p CPUPOOL]",
       "-d DOMAIN, --domain=DOMAIN     Domain to modify\n"
       "-w WEIGHT, --weight=WEIGHT     Weight (int)\n"
+      "-c CAP,    --cap=CAP           Cap (int)\n"
       "-s         --schedparam        Query / modify scheduler parameters\n"
       "-r RLIMIT, --ratelimit_us=RLIMIT Set the scheduling rate limit, in microseconds\n"
       "-p CPUPOOL, --cpupool=CPUPOOL  Restrict output to CPUPOOL"
diff --git a/tools/xl/xl_sched.c b/tools/xl/xl_sched.c
index 85722fe..7fabce3 100644
--- a/tools/xl/xl_sched.c
+++ b/tools/xl/xl_sched.c
@@ -209,7 +209,7 @@ static int sched_credit2_domain_output(int domid)
     libxl_domain_sched_params scinfo;
 
     if (domid < 0) {
-        printf("%-33s %4s %6s\n", "Name", "ID", "Weight");
+        printf("%-33s %4s %6s %4s\n", "Name", "ID", "Weight", "Cap");
         return 0;
     }
 
@@ -219,10 +219,11 @@ static int sched_credit2_domain_output(int domid)
         return 1;
     }
     domname = libxl_domid_to_name(ctx, domid);
-    printf("%-33s %4d %6d\n",
+    printf("%-33s %4d %6d %4d\n",
         domname,
         domid,
-        scinfo.weight);
+        scinfo.weight,
+        scinfo.cap);
     free(domname);
     libxl_domain_sched_params_dispose(&scinfo);
     return 0;
@@ -589,21 +590,23 @@ int main_sched_credit2(int argc, char **argv)
     const char *dom = NULL;
     const char *cpupool = NULL;
     int ratelimit = 0;
-    int weight = 256;
+    int weight = 256, cap = 0;
     bool opt_s = false;
     bool opt_r = false;
     bool opt_w = false;
+    bool opt_c = false;
     int opt, rc;
     static struct option opts[] = {
         {"domain", 1, 0, 'd'},
         {"weight", 1, 0, 'w'},
+        {"cap", 1, 0, 'c'},
         {"schedparam", 0, 0, 's'},
         {"ratelimit_us", 1, 0, 'r'},
         {"cpupool", 1, 0, 'p'},
         COMMON_LONG_OPTS
     };
 
-    SWITCH_FOREACH_OPT(opt, "d:w:p:r:s", opts, "sched-credit2", 0) {
+    SWITCH_FOREACH_OPT(opt, "d:w:c:p:r:s", opts, "sched-credit2", 0) {
     case 'd':
         dom = optarg;
         break;
@@ -611,6 +614,10 @@ int main_sched_credit2(int argc, char **argv)
         weight = strtol(optarg, NULL, 10);
         opt_w = true;
         break;
+    case 'c':
+        cap = strtol(optarg, NULL, 10);
+        opt_c = true;
+        break;
     case 's':
         opt_s = true;
         break;
@@ -623,12 +630,12 @@ int main_sched_credit2(int argc, char **argv)
         break;
     }
 
-    if (cpupool && (dom || opt_w)) {
+    if (cpupool && (dom || opt_w || opt_c)) {
         fprintf(stderr, "Specifying a cpupool is not allowed with other "
                 "options.\n");
         return EXIT_FAILURE;
     }
-    if (!dom && opt_w) {
+    if (!dom && (opt_w || opt_c)) {
         fprintf(stderr, "Must specify a domain.\n");
         return EXIT_FAILURE;
     }
@@ -663,7 +670,7 @@ int main_sched_credit2(int argc, char **argv)
     } else {
         uint32_t domid = find_domain(dom);
 
-        if (!opt_w) { /* output credit2 scheduler info */
+        if (!opt_w && !opt_c) { /* output credit2 scheduler info */
             sched_credit2_domain_output(-1);
             if (sched_credit2_domain_output(domid))
                 return EXIT_FAILURE;
@@ -673,6 +680,8 @@ int main_sched_credit2(int argc, char **argv)
             scinfo.sched = LIBXL_SCHEDULER_CREDIT2;
             if (opt_w)
                 scinfo.weight = weight;
+            if (opt_c)
+                scinfo.cap = cap;
             rc = sched_domain_set(domid, &scinfo);
             libxl_domain_sched_params_dispose(&scinfo);
             if (rc)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/4] xen: credit2: implement utilization cap
  2017-08-18 15:50 ` [PATCH v2 1/4] xen: credit2: implement utilization cap Dario Faggioli
@ 2017-08-24 19:42   ` Anshul Makkar
  2017-09-05 17:53     ` Dario Faggioli
  2017-09-14 16:20   ` George Dunlap
  1 sibling, 1 reply; 10+ messages in thread
From: Anshul Makkar @ 2017-08-24 19:42 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson, Jan Beulich



On 8/18/17 4:50 PM, Dario Faggioli wrote:
>   
> @@ -474,6 +586,12 @@ static inline struct csched2_runqueue_data *c2rqd(const struct scheduler *ops,
>       return &csched2_priv(ops)->rqd[c2r(cpu)];
>   }
>   
> +/* Does the domain of this vCPU have a cap? */
> +static inline bool has_cap(const struct csched2_vcpu *svc)
> +{
> +    return svc->budget != STIME_MAX;
> +}
> +
>   /*
>    * Hyperthreading (SMT) support.
>    *
> @@ -1515,7 +1633,16 @@ static void reset_credit(const struct scheduler *ops, int cpu, s_time_t now,
>            * that the credit it has spent so far get accounted.
>            */
>           if ( svc->vcpu == curr_on_cpu(svc_cpu) )
> +        {
>               burn_credits(rqd, svc, now);
> +            /*
> +             * And, similarly, in case it has run out of budget, as a
> +             * consequence of this round of accounting, we also must inform
> +             * its pCPU that it's time to park it, and pick up someone else.
> +             */
> +            if ( unlikely(svc->budget <= 0) )
> +                tickle_cpu(svc_cpu, rqd);
This is for accounting of credit. Why it willl impact the budget. Do you 
intend to refer that
budget of current vcpu expired while doing calculation for credit ??
> +        }
>   
>           start_credit = svc->credit;
>   
> @@ -1571,27 +1698,35 @@ void burn_credits(struct csched2_runqueue_data *rqd,
>   
>       delta = now - svc->start_time;
>   
> -    if ( likely(delta > 0) )
> -    {
> -        SCHED_STAT_CRANK(burn_credits_t2c);
> -        t2c_update(rqd, delta, svc);
> -        svc->start_time = now;
> -    }
> -    else if ( delta < 0 )
> +    if ( unlikely(delta <= 0) )
>       {
>
> +static void replenish_domain_budget(void* data)
> +{
> +    struct csched2_dom *sdom = data;
> +    unsigned long flags;
> +    s_time_t now;
> +    LIST_HEAD(parked);
> +
> +    spin_lock_irqsave(&sdom->budget_lock, flags);
> +
> +    now = NOW();
> +
> +    /*
> +     * Let's do the replenishment. Note, though, that a domain may overrun,
> +     * which means the budget would have gone below 0 (reasons may be system
> +     * overbooking, accounting issues, etc.). It also may happen that we are
> +     * handling the replenishment (much) later than we should (reasons may
> +     * again be overbooking, or issues with timers).
> +     *
> +     * Even in cases of overrun or delay, however, we expect that in 99% of
> +     * cases, doing just one replenishment will be good enough for being able
> +     * to unpark the vCPUs that are waiting for some budget.
> +     */
> +    do_replenish(sdom);
> +
> +    /*
> +     * And now, the special cases:
> +     * 1) if we are late enough to have skipped (at least) one full period,
> +     * what we must do is doing more replenishments. Note that, however,
> +     * every time we add tot_budget to the budget, we also move next_repl
> +     * away by CSCHED2_BDGT_REPL_PERIOD, to make sure the cap is always
> +     * respected.
> +     */
> +    if ( unlikely(sdom->next_repl <= now) )
> +    {
> +        do
> +            do_replenish(sdom);
> +        while ( sdom->next_repl <= now );
> +    }
Just a bit confused. Have you seen this kind of scenario. Please can you 
explain it.
Is this condition necessary.
> +    /*
> +     * 2) if we overrun by more than tot_budget, then budget+tot_budget is
> +     * still < 0, which means that we can't unpark the vCPUs. Let's bail,
> +     * and wait for future replenishments.
> +     */
> +    if ( unlikely(sdom->budget <= 0) )
> +    {
> +        spin_unlock_irqrestore(&sdom->budget_lock, flags);
> +        goto out;
> +    }
"if we overran by more than tot_budget in previous run", make is more 
clear..
> +
> +    /* Since we do more replenishments, make sure we didn't overshot. */
> +    sdom->budget = min(sdom->budget, sdom->tot_budget);
> +
> +    /*
> +     * As above, let's prepare the temporary list, out of the domain's
> +     * parked_vcpus list, now that we hold the budget_lock. Then, drop such
> +     * lock, and pass the list to the unparking function.
> +     */
> +    list_splice_init(&sdom->parked_vcpus, &parked);
> +
> +    spin_unlock_irqrestore(&sdom->budget_lock, flags);
> +
> +    unpark_parked_vcpus(sdom->dom->cpupool->sched, &parked);
> +
> + out:
> +    set_timer(sdom->repl_timer, sdom->next_repl);
> +}
> +
>   #ifndef NDEBUG
>   static inline void
>   csched2_vcpu_check(struct vcpu *vc)
> @@ -1658,6 +2035,9 @@ csched2_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
>       }
>       svc->tickled_cpu = -1;
>   
> +

Rest, looks good to me.

Thanks
Anshul
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/4] xen: credit2: implement utilization cap
  2017-08-24 19:42   ` Anshul Makkar
@ 2017-09-05 17:53     ` Dario Faggioli
  0 siblings, 0 replies; 10+ messages in thread
From: Dario Faggioli @ 2017-09-05 17:53 UTC (permalink / raw)
  To: Anshul Makkar, xen-devel
  Cc: George Dunlap, Andrew Cooper, Wei Liu, Ian Jackson, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 5183 bytes --]

On Thu, 2017-08-24 at 20:42 +0100, Anshul Makkar wrote:
> On 8/18/17 4:50 PM, Dario Faggioli wrote:
> >   
> > @@ -1515,7 +1633,16 @@ static void reset_credit(const struct
> > scheduler *ops, int cpu, s_time_t now,
> >            * that the credit it has spent so far get accounted.
> >            */
> >           if ( svc->vcpu == curr_on_cpu(svc_cpu) )
> > +        {
> >               burn_credits(rqd, svc, now);
> > +            /*
> > +             * And, similarly, in case it has run out of budget,
> > as a
> > +             * consequence of this round of accounting, we also
> > must inform
> > +             * its pCPU that it's time to park it, and pick up
> > someone else.
> > +             */
> > +            if ( unlikely(svc->budget <= 0) )
> > +                tickle_cpu(svc_cpu, rqd);
> 
> This is for accounting of credit. Why it willl impact the budget. Do
> you 
> intend to refer that
> budget of current vcpu expired while doing calculation for credit ??
>
burn_credits() burns does budget acounting too now. So, it's entirely
possible that the vCPU has actually run out of budget, and we figure it
out now (and we should take appropriate actions!).

> > @@ -1571,27 +1698,35 @@ void burn_credits(struct
> > csched2_runqueue_data *rqd,
> >   
> >       delta = now - svc->start_time;
> >   
> > -    if ( likely(delta > 0) )
> > -    {
> > -        SCHED_STAT_CRANK(burn_credits_t2c);
> > -        t2c_update(rqd, delta, svc);
> > -        svc->start_time = now;
> > -    }
> > -    else if ( delta < 0 )
> > +    if ( unlikely(delta <= 0) )
> >       {
> > 
> > +static void replenish_domain_budget(void* data)
> > +{
> > +    struct csched2_dom *sdom = data;
> > +    unsigned long flags;
> > +    s_time_t now;
> > +    LIST_HEAD(parked);
> > +
> > +    spin_lock_irqsave(&sdom->budget_lock, flags);
> > +
> > +    now = NOW();
> > +
> > +    /*
> > +     * Let's do the replenishment. Note, though, that a domain may
> > overrun,
> > +     * which means the budget would have gone below 0 (reasons may
> > be system
> > +     * overbooking, accounting issues, etc.). It also may happen
> > that we are
> > +     * handling the replenishment (much) later than we should
> > (reasons may
> > +     * again be overbooking, or issues with timers).
> > +     *
> > +     * Even in cases of overrun or delay, however, we expect that
> > in 99% of
> > +     * cases, doing just one replenishment will be good enough for
> > being able
> > +     * to unpark the vCPUs that are waiting for some budget.
> > +     */
> > +    do_replenish(sdom);
> > +
> > +    /*
> > +     * And now, the special cases:
> > +     * 1) if we are late enough to have skipped (at least) one
> > full period,
> > +     * what we must do is doing more replenishments. Note that,
> > however,
> > +     * every time we add tot_budget to the budget, we also move
> > next_repl
> > +     * away by CSCHED2_BDGT_REPL_PERIOD, to make sure the cap is
> > always
> > +     * respected.
> > +     */
> > +    if ( unlikely(sdom->next_repl <= now) )
> > +    {
> > +        do
> > +            do_replenish(sdom);
> > +        while ( sdom->next_repl <= now );
> > +    }
> 
> Just a bit confused. Have you seen this kind of scenario. Please can
> you 
> explain it.
> Is this condition necessary.
>
This was discussed (with George) during v1 review. It's a corner case,
which should never happen, and I in fact have never seen it happening
in my tests.

But we can't rule out that it won't occur, so it makes sense to deal
with it (instead of just ignoring it, causing the cap mechanism to
[temporarily] malfunction / become inaccurate).

> > +    /*
> > +     * 2) if we overrun by more than tot_budget, then
> > budget+tot_budget is
> > +     * still < 0, which means that we can't unpark the vCPUs.
> > Let's bail,
> > +     * and wait for future replenishments.
> > +     */
> > +    if ( unlikely(sdom->budget <= 0) )
> > +    {
> > +        spin_unlock_irqrestore(&sdom->budget_lock, flags);
> > +        goto out;
> > +    }
> 
> "if we overran by more than tot_budget in previous run", make is
> more 
> clear..
>
Mmm... perhaps, but not so much, IMO. It's quite clear to which time
window we are referring to already, and I don't feel like re-sending
for this.

Let's see if there are other comments/requests.

> Rest, looks good to me.
> 
Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/4] xen: credit2: implement utilization cap
  2017-08-18 15:50 ` [PATCH v2 1/4] xen: credit2: implement utilization cap Dario Faggioli
  2017-08-24 19:42   ` Anshul Makkar
@ 2017-09-14 16:20   ` George Dunlap
  2017-09-14 16:32     ` George Dunlap
  1 sibling, 1 reply; 10+ messages in thread
From: George Dunlap @ 2017-09-14 16:20 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson, Anshul Makkar,
	Jan Beulich

On 08/18/2017 04:50 PM, Dario Faggioli wrote:
> This commit implements the Xen part of the cap mechanism for
> Credit2.
> 
> A cap is how much, in terms of % of physical CPU time, a domain
> can execute at most.
> 
> For instance, a domain that must not use more than 1/4 of
> one physical CPU, must have a cap of 25%; one that must not
> use more than 1+1/2 of physical CPU time, must be given a cap
> of 150%.
> 
> Caps are per domain, so it is all a domain's vCPUs, cumulatively,
> that will be forced to execute no more than the decided amount.
> 
> This is implemented by giving each domain a 'budget', and
> using a (per-domain again) periodic timer. Values of budget
> and 'period' are chosen so that budget/period is equal to the
> cap itself.
> 
> Budget is burned by the domain's vCPUs, in a similar way to
> how credits are.
> 
> When a domain runs out of budget, its vCPUs can't run any
> longer. They can gain, when the budget is replenishment by
> the timer, which event happens once every period.
> 
> Blocking the vCPUs because of lack of budget happens by
> means of a new (_VPF_parked) pause flag, so that, e.g.,
> vcpu_runnable() still works. This is similar to what is
> done in sched_rtds.c, as opposed to what happens in
> sched_credit.c, where vcpu_pause() and vcpu_unpause()
> (which means, among other things, more overhead).
> 
> Note that, while adding new fields to csched2_vcpu and
> csched2_dom, currently existing members are being moved
> around, to achieve best placement inside cache lines.
> 
> Note also that xenalyze and tools/xentrace/format are being
> updated too.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Looks good, with one minor nit...


> +        /*
> +         * NB: we give the whole remaining budget a domain has, to the first
> +         * vCPU that comes here and asks for it. This means that, in a domain
> +         * with a cap, only 1 vCPU is able to run, at any given time.
> +         * /THIS IS GOING TO CHANGE/ in subsequent patches, toward something
> +         * that allows much better fairness and parallelism. Proceeding in
> +         * two steps, is for making things easy to understand, when looking
> +         * at the signle commits.

*single

But I can fix that up on check-in.

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/4] xen: credit2: allow to set and get utilization cap
  2017-08-18 15:51 ` [PATCH v2 2/4] xen: credit2: allow to set and get " Dario Faggioli
@ 2017-09-14 16:21   ` George Dunlap
  0 siblings, 0 replies; 10+ messages in thread
From: George Dunlap @ 2017-09-14 16:21 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson, Anshul Makkar,
	Jan Beulich

On 08/18/2017 04:51 PM, Dario Faggioli wrote:
> As cap is already present in Credit1, as a parameter, all
> the wiring is there already for it to be percolate down
> to csched2_dom_cntl() too.
> 
> In this commit, we actually deal with it, and implement
> setting, changing or disabling the cap of a domain.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Reviewed-by: George Dunlap <george.dunlap@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/4] xen: credit2: implement utilization cap
  2017-09-14 16:20   ` George Dunlap
@ 2017-09-14 16:32     ` George Dunlap
  0 siblings, 0 replies; 10+ messages in thread
From: George Dunlap @ 2017-09-14 16:32 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson, Anshul Makkar,
	Jan Beulich

On 09/14/2017 05:20 PM, George Dunlap wrote:
> On 08/18/2017 04:50 PM, Dario Faggioli wrote:
>> This commit implements the Xen part of the cap mechanism for
>> Credit2.
>>
>> A cap is how much, in terms of % of physical CPU time, a domain
>> can execute at most.
>>
>> For instance, a domain that must not use more than 1/4 of
>> one physical CPU, must have a cap of 25%; one that must not
>> use more than 1+1/2 of physical CPU time, must be given a cap
>> of 150%.
>>
>> Caps are per domain, so it is all a domain's vCPUs, cumulatively,
>> that will be forced to execute no more than the decided amount.
>>
>> This is implemented by giving each domain a 'budget', and
>> using a (per-domain again) periodic timer. Values of budget
>> and 'period' are chosen so that budget/period is equal to the
>> cap itself.
>>
>> Budget is burned by the domain's vCPUs, in a similar way to
>> how credits are.
>>
>> When a domain runs out of budget, its vCPUs can't run any
>> longer. They can gain, when the budget is replenishment by
>> the timer, which event happens once every period.
>>
>> Blocking the vCPUs because of lack of budget happens by
>> means of a new (_VPF_parked) pause flag, so that, e.g.,
>> vcpu_runnable() still works. This is similar to what is
>> done in sched_rtds.c, as opposed to what happens in
>> sched_credit.c, where vcpu_pause() and vcpu_unpause()
>> (which means, among other things, more overhead).
>>
>> Note that, while adding new fields to csched2_vcpu and
>> csched2_dom, currently existing members are being moved
>> around, to achieve best placement inside cache lines.
>>
>> Note also that xenalyze and tools/xentrace/format are being
>> updated too.
>>
>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> Looks good, with one minor nit...
> 
> 
>> +        /*
>> +         * NB: we give the whole remaining budget a domain has, to the first
>> +         * vCPU that comes here and asks for it. This means that, in a domain
>> +         * with a cap, only 1 vCPU is able to run, at any given time.
>> +         * /THIS IS GOING TO CHANGE/ in subsequent patches, toward something
>> +         * that allows much better fairness and parallelism. Proceeding in
>> +         * two steps, is for making things easy to understand, when looking
>> +         * at the signle commits.
> 
> *single
> 
> But I can fix that up on check-in.

Well turns out it gets clobbered in the 3rd patch anyway.  But at least
we a well-spelled commit history. :-)

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-09-14 16:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-18 15:50 [PATCH v2 0/4] xen/tools: Credit2: implement caps Dario Faggioli
2017-08-18 15:50 ` [PATCH v2 1/4] xen: credit2: implement utilization cap Dario Faggioli
2017-08-24 19:42   ` Anshul Makkar
2017-09-05 17:53     ` Dario Faggioli
2017-09-14 16:20   ` George Dunlap
2017-09-14 16:32     ` George Dunlap
2017-08-18 15:51 ` [PATCH v2 2/4] xen: credit2: allow to set and get " Dario Faggioli
2017-09-14 16:21   ` George Dunlap
2017-08-18 15:51 ` [PATCH v2 3/4] xen: credit2: improve distribution of budget (for domains with caps) Dario Faggioli
2017-08-18 15:51 ` [PATCH v2 4/4] libxl/xl: allow to get and set cap on Credit2 Dario Faggioli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).