From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@eu.citrix.com>
Subject: Re: [PATCH 2 of 6 v2] xen: sched_credit: improve
 tickling of idle CPUs
Date: Fri, 14 Dec 2012 19:29:30 +0000
Message-ID: <50CB7E1A.8000304@eu.citrix.com>
References: <patchbomb.1355280770@Solace>
	<69860abfe7aec775f781.1355280772@Solace>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <69860abfe7aec775f781.1355280772@Solace>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>, "Keir
	(Xen.org)" <keir@xen.org>, Ian Campbell <Ian.Campbell@citrix.com>
List-Id: xen-devel@lists.xenproject.org

On 12/12/12 02:52, Dario Faggioli wrote:
> Right now, when a VCPU wakes-up, we check whether it should preempt
> what is running on the PCPU, and whether or not the waking VCPU can
> be migrated (by tickling some idlers). However, this can result in
> suboptimal or even wrong behaviour, as explained here:
>
>   http://lists.xen.org/archives/html/xen-devel/2012-10/msg01732.html
>
> This change, instead, when deciding which PCPU(s) to tickle, upon
> VCPU wake-up, considers both what it is likely to happen on the PCPU
> where the wakeup occurs,and whether or not there are idlers where
> the woken-up VCPU can run. In fact, if there are, we can avoid
> interrupting the running VCPU. Only in case there aren't any of
> these PCPUs, preemption and migration are the way to go.
>
> This has been tested (on top of the previous change) by running
> the following benchmarks inside 2, 6 and 10 VMs, concurrently, on
> a shared host, each with 2 VCPUs and 960 MB of memory (host had 16
> ways and 12 GB RAM).
>
> 1) All VMs had 'cpus="all"' in their config file.
>
> $ sysbench --test=cpu ... (time, lower is better)
>   | VMs | w/o this change         | w/ this change          |
>   | 2   | 50.078467 +/- 1.6676162 | 49.673667 +/- 0.0094321 |
>   | 6   | 63.259472 +/- 0.1137586 | 61.680011 +/- 1.0208723 |
>   | 10  | 91.246797 +/- 0.1154008 | 90.396720 +/- 1.5900423 |
> $ sysbench --test=memory ... (throughput, higher is better)
>   | VMs | w/o this change         | w/ this change          |
>   | 2   | 485.56333 +/- 6.0527356 | 487.83167 +/- 0.7602850 |
>   | 6   | 401.36278 +/- 1.9745916 | 409.96778 +/- 3.6761092 |
>   | 10  | 294.43933 +/- 0.8064945 | 302.49033 +/- 0.2343978 |
> $ specjbb2005 ... (throughput, higher is better)
>   | VMs | w/o this change         | w/ this change          |
>   | 2   | 43150.63 +/- 1359.5616  | 43275.427 +/- 606.28185 |
>   | 6   | 29274.29 +/- 1024.4042  | 29716.189 +/- 1290.1878 |
>   | 10  | 19061.28 +/- 512.88561  | 19192.599 +/- 605.66058 |
>
>
> 2) All VMs had their VCPUs statically pinned to the host's PCPUs.
>
> $ sysbench --test=cpu ... (time, lower is better)
>   | VMs | w/o this change         | w/ this change          |
>   | 2   | 47.8211   +/- 0.0215504 | 47.826900 +/- 0.0077872 |
>   | 6   | 62.689122 +/- 0.0877173 | 62.764539 +/- 0.3882493 |
>   | 10  | 90.321097 +/- 1.4803867 | 89.974570 +/- 1.1437566 |
> $ sysbench --test=memory ... (throughput, higher is better)
>   | VMs | w/o this change         | w/ this change          |
>   | 2   | 550.97667 +/- 2.3512355 | 550.87000 +/- 0.8140792 |
>   | 6   | 443.15000 +/- 5.7471797 | 454.01056 +/- 8.4373466 |
>   | 10  | 313.89233 +/- 1.3237493 | 321.81167 +/- 0.3528418 |
> $ specjbb2005 ... (throughput, higher is better)
>   | 2   | 49591.057 +/- 952.93384 | 49594.195 +/- 799.57976 |
>   | 6   | 33538.247 +/- 1089.2115 | 33671.758 +/- 1077.6806 |
>   | 10  | 21927.870 +/- 831.88742 | 21891.131 +/- 563.37929 |
>
>
> Numbers show how the change has either no or very limited impact
> (specjbb2005 case) or, when it does have some impact, that is a
> real improvement in performances (sysbench-memory case).
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Changes from v1:
>   * Rewritten as per George's suggestion, in order to improve readability.
>   * Killed some of the stats, with the only exception of `tickle_idlers_none'
>     and `tickle_idlers_some'. They don't make things look that terrible and
>     I think they could be useful.
>   * The preemption+migration of the currently running VCPU has been turned
>     into a migration request, instead than just tickling. I traced this
>     thing some more, and it looks like that is the way to go. Tickling is
>     not effective here, because the woken PCPU would expect cur to be out
>     of the scheduler tail, which is likely false (cur->is_running is
>     still set to 1).

Ah, right. :-)

Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -133,6 +133,7 @@ struct csched_vcpu {
>           uint32_t state_idle;
>           uint32_t migrate_q;
>           uint32_t migrate_r;
> +        uint32_t kicked_away;
>       } stats;
>   #endif
>   };
> @@ -251,54 +252,67 @@ static inline void
>       struct csched_vcpu * const cur =
>           CSCHED_VCPU(per_cpu(schedule_data, cpu).curr);
>       struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
> -    cpumask_t mask;
> +    cpumask_t mask, idle_mask;
> +    int idlers_empty;
>   
>       ASSERT(cur);
>       cpumask_clear(&mask);
>   
> -    /* If strictly higher priority than current VCPU, signal the CPU */
> -    if ( new->pri > cur->pri )
> +    idlers_empty = cpumask_empty(prv->idlers);
> +
> +    /*
> +     * If the pcpu is idle, or there are no idlers and the new
> +     * vcpu is a higher priority than the old vcpu, run it here.
> +     *
> +     * If there are idle cpus, first try to find one suitable to run
> +     * new, so we can avoid preempting cur.  If we cannot find a
> +     * suitable idler on which to run new, run it here, but try to
> +     * find a suitable idler on which to run cur instead.
> +     */
> +    if ( cur->pri == CSCHED_PRI_IDLE
> +         || (idlers_empty && new->pri > cur->pri) )
>       {
> -        if ( cur->pri == CSCHED_PRI_IDLE )
> -            SCHED_STAT_CRANK(tickle_local_idler);
> -        else if ( cur->pri == CSCHED_PRI_TS_OVER )
> -            SCHED_STAT_CRANK(tickle_local_over);
> -        else if ( cur->pri == CSCHED_PRI_TS_UNDER )
> -            SCHED_STAT_CRANK(tickle_local_under);
> -        else
> -            SCHED_STAT_CRANK(tickle_local_other);
> -
> +        if ( cur->pri != CSCHED_PRI_IDLE )
> +            SCHED_STAT_CRANK(tickle_idlers_none);
>           cpumask_set_cpu(cpu, &mask);
>       }
> +    else if ( !idlers_empty )
> +    {
> +        /* Check whether or not there are idlers that can run new */
> +        cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
>   
> -    /*
> -     * If this CPU has at least two runnable VCPUs, we tickle any idlers to
> -     * let them know there is runnable work in the system...
> -     */
> -    if ( cur->pri > CSCHED_PRI_IDLE )
> -    {
> -        if ( cpumask_empty(prv->idlers) )
> +        /*
> +         * If there are no suitable idlers for new, and it's higher
> +         * priority than cur, ask the scheduler to migrate cur away.
> +         * We have to act like this (instead of just waking some of
> +         * the idlers suitable for cur) because cur is running.
> +         *
> +         * If there are suitable idlers for new, no matter priorities,
> +         * leave cur alone (as it is running and is, likely, cache-hot)
> +         * and wake some of them (which is waking up and so is, likely,
> +         * cache cold anyway).
> +         */
> +        if ( cpumask_empty(&idle_mask) && new->pri > cur->pri )
>           {
>               SCHED_STAT_CRANK(tickle_idlers_none);
> +            SCHED_VCPU_STAT_CRANK(cur, kicked_away);
> +            SCHED_VCPU_STAT_CRANK(cur, migrate_r);
> +            SCHED_STAT_CRANK(migrate_kicked_away);
> +            set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
> +            cpumask_set_cpu(cpu, &mask);
>           }
> -        else
> +        else if ( !cpumask_empty(&idle_mask) )
>           {
> -            cpumask_t idle_mask;
> -
> -            cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
> -            if ( !cpumask_empty(&idle_mask) )
> +            /* Which of the idlers suitable for new shall we wake up? */
> +            SCHED_STAT_CRANK(tickle_idlers_some);
> +            if ( opt_tickle_one_idle )
>               {
> -                SCHED_STAT_CRANK(tickle_idlers_some);
> -                if ( opt_tickle_one_idle )
> -                {
> -                    this_cpu(last_tickle_cpu) =
> -                        cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
> -                    cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
> -                }
> -                else
> -                    cpumask_or(&mask, &mask, &idle_mask);
> +                this_cpu(last_tickle_cpu) =
> +                    cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
> +                cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
>               }
> -            cpumask_and(&mask, &mask, new->vcpu->cpu_affinity);
> +            else
> +                cpumask_or(&mask, &mask, &idle_mask);
>           }
>       }
>   
> @@ -1456,13 +1470,14 @@ csched_dump_vcpu(struct csched_vcpu *svc
>       {
>           printk(" credit=%i [w=%u]", atomic_read(&svc->credit), sdom->weight);
>   #ifdef CSCHED_STATS
> -        printk(" (%d+%u) {a/i=%u/%u m=%u+%u}",
> +        printk(" (%d+%u) {a/i=%u/%u m=%u+%u (k=%u)}",
>                   svc->stats.credit_last,
>                   svc->stats.credit_incr,
>                   svc->stats.state_active,
>                   svc->stats.state_idle,
>                   svc->stats.migrate_q,
> -                svc->stats.migrate_r);
> +                svc->stats.migrate_r,
> +                svc->stats.kicked_away);
>   #endif
>       }
>   
> diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h
> --- a/xen/include/xen/perfc_defn.h
> +++ b/xen/include/xen/perfc_defn.h
> @@ -39,10 +39,6 @@ PERFCOUNTER(vcpu_wake_runnable,     "csc
>   PERFCOUNTER(vcpu_wake_not_runnable, "csched: vcpu_wake_not_runnable")
>   PERFCOUNTER(vcpu_park,              "csched: vcpu_park")
>   PERFCOUNTER(vcpu_unpark,            "csched: vcpu_unpark")
> -PERFCOUNTER(tickle_local_idler,     "csched: tickle_local_idler")
> -PERFCOUNTER(tickle_local_over,      "csched: tickle_local_over")
> -PERFCOUNTER(tickle_local_under,     "csched: tickle_local_under")
> -PERFCOUNTER(tickle_local_other,     "csched: tickle_local_other")
>   PERFCOUNTER(tickle_idlers_none,     "csched: tickle_idlers_none")
>   PERFCOUNTER(tickle_idlers_some,     "csched: tickle_idlers_some")
>   PERFCOUNTER(load_balance_idle,      "csched: load_balance_idle")
> @@ -52,6 +48,7 @@ PERFCOUNTER(steal_trylock_failed,   "csc
>   PERFCOUNTER(steal_peer_idle,        "csched: steal_peer_idle")
>   PERFCOUNTER(migrate_queued,         "csched: migrate_queued")
>   PERFCOUNTER(migrate_running,        "csched: migrate_running")
> +PERFCOUNTER(migrate_kicked_away,    "csched: migrate_kicked_away")
>   PERFCOUNTER(vcpu_hot,               "csched: vcpu_hot")
>   
>   PERFCOUNTER(need_flush_tlb_flush,   "PG_need_flush tlb flushes")