From mboxrd@z Thu Jan 1 00:00:00 1970 From: George Dunlap Subject: Re: [PATCH 1 of 3] xen: sched_credit, improve tickling of idle CPUs Date: Wed, 5 Dec 2012 12:16:52 +0000 Message-ID: <50BF3B34.5030501@eu.citrix.com> References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------000604080007020604090108" Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: xen-devel , "Keir (Xen.org)" List-Id: xen-devel@lists.xenproject.org --------------000604080007020604090108 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit On 03/12/12 16:34, Dario Faggioli wrote: > Right now, when a VCPU wakes-up, we check if the it should preempt > what is running on the PCPU, and whether or not the waking VCPU can > be migrated (by tickling some idlers). However, this can result in > suboptimal or even wrong behaviour, as explained here: > > http://lists.xen.org/archives/html/xen-devel/2012-10/msg01732.html > > This change, instead, when deciding what PCPUs to tickle upon VCPU > wake-up, considers both what it is likely to happen on the PCPU > where the wakeup occurs, as well as whether or not there are idle > PCPUs where to run the waking VCPU. > In fact, if there are idlers where the new VCPU can run, we can > avoid interrupting the running VCPU. OTOH, in case there aren't > any of these PCPUs, preemption and migration are the way to go. > > This has been tested by running the following benchmarks inside 2, > 6 and 10 VMs concurrently, on a shared host, each with 2 VCPUs and > 960 MB of memory (host has 16 ways and 12 GB RAM). > > 1) All VMs had 'cpus="all"' in their config file. > > $ sysbench --test=cpu ... (time, lower is better) > | VMs | w/o this change | w/ this change | > | 2 | 50.078467 +/- 1.6676162 | 49.704933 +/- 0.0277184 | > | 6 | 63.259472 +/- 0.1137586 | 62.227367 +/- 0.3880619 | > | 10 | 91.246797 +/- 0.1154008 | 91.174820 +/- 0.0928781 | > $ sysbench --test=memory ... (throughput, higher is better) > | VMs | w/o this change | w/ this change | > | 2 | 485.56333 +/- 6.0527356 | 525.57833 +/- 25.085826 | > | 6 | 401.36278 +/- 1.9745916 | 421.96111 +/- 9.0364048 | > | 10 | 294.43933 +/- 0.8064945 | 302.49033 +/- 0.2343978 | > $ specjbb2005 ... (throughput, higher is better) > | VMs | w/o this change | w/ this change | > | 2 | 43150.63 +/- 1359.5616 | 42720.632 +/- 1937.4488 | > | 6 | 29274.29 +/- 1024.4042 | 29518.171 +/- 1014.5239 | > | 10 | 19061.28 +/- 512.88561 | 19050.141 +/- 458.77327 | > > > 2) All VMs had their VCPUs statically pinned to the host's PCPUs. > > $ sysbench --test=cpu ... (time, lower is better) > | VMs | w/o this change | w/ this change | > | 2 | 47.8211 +/- 0.0215504 | 47.826900 +/- 0.0077872 | > | 6 | 62.689122 +/- 0.0877173 | 62.764539 +/- 0.3882493 | > | 10 | 90.321097 +/- 1.4803867 | 89.974570 +/- 1.1437566 | > $ sysbench --test=memory ... (throughput, higher is better) > | VMs | w/o this change | w/ this change | > | 2 | 550.97667 +/- 2.3512355 | 550.87000 +/- 0.8140792 | > | 6 | 443.15000 +/- 5.7471797 | 454.01056 +/- 8.4373466 | > | 10 | 313.89233 +/- 1.3237493 | 321.81167 +/- 0.3528418 | > $ specjbb2005 ... (throughput, higher is better) > | 2 | 49591.057 +/- 952.93384 | 49610.98 +/- 1242.1675 | > | 6 | 33538.247 +/- 1089.2115 | 33682.222 +/- 1216.1078 | > | 10 | 21927.870 +/- 831.88742 | 21801.138 +/- 561.97068 | > > > Numbers show how the change has either no or very limited impact > (specjbb2005 case) or, when it does have some impact, that is an > actual improvement in performances, especially in the > sysbench-memory case. > > Signed-off-by: Dario Faggioli So I think the principle is good, but the resulting set of "if" statements is hard to figure out what's going on. What do you think about re-arranging things, something like the attached? This particular version I got rid of the stats, because they require if() statements that break up the flow. If we really think they're useful, maybe we could have a separate block somewhere for them? We could actually do without the idlers_empty entirely, as if we just remove the condition from the "else" block, the "right thing" will happen; however, it means several unnecessary cpumask operations on a busy system. Thoughts? -George --------------000604080007020604090108 Content-Type: text/plain; charset="UTF-8"; name="xen_sched_credit_improve_tickling_of_idle_cpus" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="xen_sched_credit_improve_tickling_of_idle_cpus" xen: sched_credit, improve tickling of idle CPUs RFC: Re-organized ifs Signed-off-by: Dario Faggioli Signed-off-by: George Dunlap diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -249,54 +249,53 @@ static inline void struct csched_vcpu * const cur = CSCHED_VCPU(per_cpu(schedule_data, cpu).curr); struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu)); - cpumask_t mask; + cpumask_t mask, idle_mask; + int idlers_empty; ASSERT(cur); cpumask_clear(&mask); - /* If strictly higher priority than current VCPU, signal the CPU */ - if ( new->pri > cur->pri ) + idlers_empty = cpumask_empty(prv->idlers); + /* + * If the pcpu is idle, or there are no idlers and the new + * vcpu is a higher priority than the old vcpu, run it here. + * + * If there are idle cpus, first try to find one suitable to run + * "new", so we can avoid preempting cur. If we cannot find a + * suitable idler on which to run "new", run it here, but try to + * find a suitable idler on which to run "cur" instead. + */ + if ( cur->pri == CSCHED_PRI_IDLE + || (idlers_empty && new->pri > cur->pri) ) { - if ( cur->pri == CSCHED_PRI_IDLE ) - SCHED_STAT_CRANK(tickle_local_idler); - else if ( cur->pri == CSCHED_PRI_TS_OVER ) - SCHED_STAT_CRANK(tickle_local_over); - else if ( cur->pri == CSCHED_PRI_TS_UNDER ) - SCHED_STAT_CRANK(tickle_local_under); - else - SCHED_STAT_CRANK(tickle_local_other); - cpumask_set_cpu(cpu, &mask); } + else if (!idlers_empty) + { + /* Check whether or not there are idlers that can run new */ + cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity); - /* - * If this CPU has at least two runnable VCPUs, we tickle any idlers to - * let them know there is runnable work in the system... - */ - if ( cur->pri > CSCHED_PRI_IDLE ) - { - if ( cpumask_empty(prv->idlers) ) + /* If there are no suitable idlers for new, and it's higher + * priority than cur, wake up the current cpu, but also + * look for idlers suitable for cur. */ + if (cpumask_empty(&idle_mask) && new->pri > cur->pri) { - SCHED_STAT_CRANK(tickle_idlers_none); + cpumask_set_cpu(cpu, &mask); + cpumask_and(&idle_mask, prv->idlers, cur->vcpu->cpu_affinity); } - else + + /* Which of the idlers shall we wake up? */ + if ( !cpumask_empty(&idle_mask) ) { - cpumask_t idle_mask; - - cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity); - if ( !cpumask_empty(&idle_mask) ) + SCHED_STAT_CRANK(tickle_idlers_some); + if ( opt_tickle_one_idle ) { - SCHED_STAT_CRANK(tickle_idlers_some); - if ( opt_tickle_one_idle ) - { - this_cpu(last_tickle_cpu) = - cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask); - cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask); - } - else - cpumask_or(&mask, &mask, &idle_mask); + this_cpu(last_tickle_cpu) = + cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask); + cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask); } - cpumask_and(&mask, &mask, new->vcpu->cpu_affinity); + else + cpumask_or(&mask, &mask, &idle_mask); } } --------------000604080007020604090108 Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --------------000604080007020604090108--