From mboxrd@z Thu Jan 1 00:00:00 1970 From: George Dunlap Subject: Re: [PATCH 2 of 6 v2] xen: sched_credit: improve tickling of idle CPUs Date: Fri, 14 Dec 2012 19:29:30 +0000 Message-ID: <50CB7E1A.8000304@eu.citrix.com> References: <69860abfe7aec775f781.1355280772@Solace> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <69860abfe7aec775f781.1355280772@Solace> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: "xen-devel@lists.xensource.com" , "Keir (Xen.org)" , Ian Campbell List-Id: xen-devel@lists.xenproject.org On 12/12/12 02:52, Dario Faggioli wrote: > Right now, when a VCPU wakes-up, we check whether it should preempt > what is running on the PCPU, and whether or not the waking VCPU can > be migrated (by tickling some idlers). However, this can result in > suboptimal or even wrong behaviour, as explained here: > > http://lists.xen.org/archives/html/xen-devel/2012-10/msg01732.html > > This change, instead, when deciding which PCPU(s) to tickle, upon > VCPU wake-up, considers both what it is likely to happen on the PCPU > where the wakeup occurs,and whether or not there are idlers where > the woken-up VCPU can run. In fact, if there are, we can avoid > interrupting the running VCPU. Only in case there aren't any of > these PCPUs, preemption and migration are the way to go. > > This has been tested (on top of the previous change) by running > the following benchmarks inside 2, 6 and 10 VMs, concurrently, on > a shared host, each with 2 VCPUs and 960 MB of memory (host had 16 > ways and 12 GB RAM). > > 1) All VMs had 'cpus="all"' in their config file. > > $ sysbench --test=cpu ... (time, lower is better) > | VMs | w/o this change | w/ this change | > | 2 | 50.078467 +/- 1.6676162 | 49.673667 +/- 0.0094321 | > | 6 | 63.259472 +/- 0.1137586 | 61.680011 +/- 1.0208723 | > | 10 | 91.246797 +/- 0.1154008 | 90.396720 +/- 1.5900423 | > $ sysbench --test=memory ... (throughput, higher is better) > | VMs | w/o this change | w/ this change | > | 2 | 485.56333 +/- 6.0527356 | 487.83167 +/- 0.7602850 | > | 6 | 401.36278 +/- 1.9745916 | 409.96778 +/- 3.6761092 | > | 10 | 294.43933 +/- 0.8064945 | 302.49033 +/- 0.2343978 | > $ specjbb2005 ... (throughput, higher is better) > | VMs | w/o this change | w/ this change | > | 2 | 43150.63 +/- 1359.5616 | 43275.427 +/- 606.28185 | > | 6 | 29274.29 +/- 1024.4042 | 29716.189 +/- 1290.1878 | > | 10 | 19061.28 +/- 512.88561 | 19192.599 +/- 605.66058 | > > > 2) All VMs had their VCPUs statically pinned to the host's PCPUs. > > $ sysbench --test=cpu ... (time, lower is better) > | VMs | w/o this change | w/ this change | > | 2 | 47.8211 +/- 0.0215504 | 47.826900 +/- 0.0077872 | > | 6 | 62.689122 +/- 0.0877173 | 62.764539 +/- 0.3882493 | > | 10 | 90.321097 +/- 1.4803867 | 89.974570 +/- 1.1437566 | > $ sysbench --test=memory ... (throughput, higher is better) > | VMs | w/o this change | w/ this change | > | 2 | 550.97667 +/- 2.3512355 | 550.87000 +/- 0.8140792 | > | 6 | 443.15000 +/- 5.7471797 | 454.01056 +/- 8.4373466 | > | 10 | 313.89233 +/- 1.3237493 | 321.81167 +/- 0.3528418 | > $ specjbb2005 ... (throughput, higher is better) > | 2 | 49591.057 +/- 952.93384 | 49594.195 +/- 799.57976 | > | 6 | 33538.247 +/- 1089.2115 | 33671.758 +/- 1077.6806 | > | 10 | 21927.870 +/- 831.88742 | 21891.131 +/- 563.37929 | > > > Numbers show how the change has either no or very limited impact > (specjbb2005 case) or, when it does have some impact, that is a > real improvement in performances (sysbench-memory case). > > Signed-off-by: Dario Faggioli > --- > Changes from v1: > * Rewritten as per George's suggestion, in order to improve readability. > * Killed some of the stats, with the only exception of `tickle_idlers_none' > and `tickle_idlers_some'. They don't make things look that terrible and > I think they could be useful. > * The preemption+migration of the currently running VCPU has been turned > into a migration request, instead than just tickling. I traced this > thing some more, and it looks like that is the way to go. Tickling is > not effective here, because the woken PCPU would expect cur to be out > of the scheduler tail, which is likely false (cur->is_running is > still set to 1). Ah, right. :-) Acked-by: George Dunlap > > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > --- a/xen/common/sched_credit.c > +++ b/xen/common/sched_credit.c > @@ -133,6 +133,7 @@ struct csched_vcpu { > uint32_t state_idle; > uint32_t migrate_q; > uint32_t migrate_r; > + uint32_t kicked_away; > } stats; > #endif > }; > @@ -251,54 +252,67 @@ static inline void > struct csched_vcpu * const cur = > CSCHED_VCPU(per_cpu(schedule_data, cpu).curr); > struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu)); > - cpumask_t mask; > + cpumask_t mask, idle_mask; > + int idlers_empty; > > ASSERT(cur); > cpumask_clear(&mask); > > - /* If strictly higher priority than current VCPU, signal the CPU */ > - if ( new->pri > cur->pri ) > + idlers_empty = cpumask_empty(prv->idlers); > + > + /* > + * If the pcpu is idle, or there are no idlers and the new > + * vcpu is a higher priority than the old vcpu, run it here. > + * > + * If there are idle cpus, first try to find one suitable to run > + * new, so we can avoid preempting cur. If we cannot find a > + * suitable idler on which to run new, run it here, but try to > + * find a suitable idler on which to run cur instead. > + */ > + if ( cur->pri == CSCHED_PRI_IDLE > + || (idlers_empty && new->pri > cur->pri) ) > { > - if ( cur->pri == CSCHED_PRI_IDLE ) > - SCHED_STAT_CRANK(tickle_local_idler); > - else if ( cur->pri == CSCHED_PRI_TS_OVER ) > - SCHED_STAT_CRANK(tickle_local_over); > - else if ( cur->pri == CSCHED_PRI_TS_UNDER ) > - SCHED_STAT_CRANK(tickle_local_under); > - else > - SCHED_STAT_CRANK(tickle_local_other); > - > + if ( cur->pri != CSCHED_PRI_IDLE ) > + SCHED_STAT_CRANK(tickle_idlers_none); > cpumask_set_cpu(cpu, &mask); > } > + else if ( !idlers_empty ) > + { > + /* Check whether or not there are idlers that can run new */ > + cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity); > > - /* > - * If this CPU has at least two runnable VCPUs, we tickle any idlers to > - * let them know there is runnable work in the system... > - */ > - if ( cur->pri > CSCHED_PRI_IDLE ) > - { > - if ( cpumask_empty(prv->idlers) ) > + /* > + * If there are no suitable idlers for new, and it's higher > + * priority than cur, ask the scheduler to migrate cur away. > + * We have to act like this (instead of just waking some of > + * the idlers suitable for cur) because cur is running. > + * > + * If there are suitable idlers for new, no matter priorities, > + * leave cur alone (as it is running and is, likely, cache-hot) > + * and wake some of them (which is waking up and so is, likely, > + * cache cold anyway). > + */ > + if ( cpumask_empty(&idle_mask) && new->pri > cur->pri ) > { > SCHED_STAT_CRANK(tickle_idlers_none); > + SCHED_VCPU_STAT_CRANK(cur, kicked_away); > + SCHED_VCPU_STAT_CRANK(cur, migrate_r); > + SCHED_STAT_CRANK(migrate_kicked_away); > + set_bit(_VPF_migrating, &cur->vcpu->pause_flags); > + cpumask_set_cpu(cpu, &mask); > } > - else > + else if ( !cpumask_empty(&idle_mask) ) > { > - cpumask_t idle_mask; > - > - cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity); > - if ( !cpumask_empty(&idle_mask) ) > + /* Which of the idlers suitable for new shall we wake up? */ > + SCHED_STAT_CRANK(tickle_idlers_some); > + if ( opt_tickle_one_idle ) > { > - SCHED_STAT_CRANK(tickle_idlers_some); > - if ( opt_tickle_one_idle ) > - { > - this_cpu(last_tickle_cpu) = > - cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask); > - cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask); > - } > - else > - cpumask_or(&mask, &mask, &idle_mask); > + this_cpu(last_tickle_cpu) = > + cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask); > + cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask); > } > - cpumask_and(&mask, &mask, new->vcpu->cpu_affinity); > + else > + cpumask_or(&mask, &mask, &idle_mask); > } > } > > @@ -1456,13 +1470,14 @@ csched_dump_vcpu(struct csched_vcpu *svc > { > printk(" credit=%i [w=%u]", atomic_read(&svc->credit), sdom->weight); > #ifdef CSCHED_STATS > - printk(" (%d+%u) {a/i=%u/%u m=%u+%u}", > + printk(" (%d+%u) {a/i=%u/%u m=%u+%u (k=%u)}", > svc->stats.credit_last, > svc->stats.credit_incr, > svc->stats.state_active, > svc->stats.state_idle, > svc->stats.migrate_q, > - svc->stats.migrate_r); > + svc->stats.migrate_r, > + svc->stats.kicked_away); > #endif > } > > diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h > --- a/xen/include/xen/perfc_defn.h > +++ b/xen/include/xen/perfc_defn.h > @@ -39,10 +39,6 @@ PERFCOUNTER(vcpu_wake_runnable, "csc > PERFCOUNTER(vcpu_wake_not_runnable, "csched: vcpu_wake_not_runnable") > PERFCOUNTER(vcpu_park, "csched: vcpu_park") > PERFCOUNTER(vcpu_unpark, "csched: vcpu_unpark") > -PERFCOUNTER(tickle_local_idler, "csched: tickle_local_idler") > -PERFCOUNTER(tickle_local_over, "csched: tickle_local_over") > -PERFCOUNTER(tickle_local_under, "csched: tickle_local_under") > -PERFCOUNTER(tickle_local_other, "csched: tickle_local_other") > PERFCOUNTER(tickle_idlers_none, "csched: tickle_idlers_none") > PERFCOUNTER(tickle_idlers_some, "csched: tickle_idlers_some") > PERFCOUNTER(load_balance_idle, "csched: load_balance_idle") > @@ -52,6 +48,7 @@ PERFCOUNTER(steal_trylock_failed, "csc > PERFCOUNTER(steal_peer_idle, "csched: steal_peer_idle") > PERFCOUNTER(migrate_queued, "csched: migrate_queued") > PERFCOUNTER(migrate_running, "csched: migrate_running") > +PERFCOUNTER(migrate_kicked_away, "csched: migrate_kicked_away") > PERFCOUNTER(vcpu_hot, "csched: vcpu_hot") > > PERFCOUNTER(need_flush_tlb_flush, "PG_need_flush tlb flushes")