From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@eu.citrix.com>
Subject: Re: [PATCH 1 of 3] xen: sched_credit,
	improve tickling of idle CPUs
Date: Wed, 5 Dec 2012 12:16:52 +0000
Message-ID: <50BF3B34.5030501@eu.citrix.com>
References: <patchbomb.1354552497@Solace>
	<dde3de6d81a3014f1d13.1354552498@Solace>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="------------000604080007020604090108"
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <dde3de6d81a3014f1d13.1354552498@Solace>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <raistlin@linux.it>
Cc: xen-devel <xen-devel@lists.xensource.com>, "Keir (Xen.org)" <keir@xen.org>
List-Id: xen-devel@lists.xenproject.org

--------------000604080007020604090108
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit

On 03/12/12 16:34, Dario Faggioli wrote:
> Right now, when a VCPU wakes-up, we check if the it should preempt
> what is running on the PCPU, and whether or not the waking VCPU can
> be migrated (by tickling some idlers). However, this can result in
> suboptimal or even wrong behaviour, as explained here:
>
>   http://lists.xen.org/archives/html/xen-devel/2012-10/msg01732.html
>
> This change, instead, when deciding what PCPUs to tickle upon VCPU
> wake-up, considers both what it is likely to happen on the PCPU
> where the wakeup occurs, as well as whether or not there are idle
> PCPUs where to run the waking VCPU.
> In fact, if there are idlers where the new VCPU can run, we can
> avoid interrupting the running VCPU. OTOH, in case there aren't
> any of these PCPUs, preemption and migration are the way to go.
>
> This has been tested by running the following benchmarks inside 2,
> 6 and 10 VMs concurrently, on a shared host, each with 2 VCPUs and
> 960 MB of memory (host has 16 ways and 12 GB RAM).
>
> 1) All VMs had 'cpus="all"' in their config file.
>
> $ sysbench --test=cpu ... (time, lower is better)
>   | VMs | w/o this change         | w/ this change          |
>   | 2   | 50.078467 +/- 1.6676162 | 49.704933 +/- 0.0277184 |
>   | 6   | 63.259472 +/- 0.1137586 | 62.227367 +/- 0.3880619 |
>   | 10  | 91.246797 +/- 0.1154008 | 91.174820 +/- 0.0928781 |
> $ sysbench --test=memory ... (throughput, higher is better)
>   | VMs | w/o this change         | w/ this change          |
>   | 2   | 485.56333 +/- 6.0527356 | 525.57833 +/- 25.085826 |
>   | 6   | 401.36278 +/- 1.9745916 | 421.96111 +/- 9.0364048 |
>   | 10  | 294.43933 +/- 0.8064945 | 302.49033 +/- 0.2343978 |
> $ specjbb2005 ... (throughput, higher is better)
>   | VMs | w/o this change         | w/ this change          |
>   | 2   | 43150.63 +/- 1359.5616  | 42720.632 +/- 1937.4488 |
>   | 6   | 29274.29 +/- 1024.4042  | 29518.171 +/- 1014.5239 |
>   | 10  | 19061.28 +/- 512.88561  | 19050.141 +/- 458.77327 |
>
>
> 2) All VMs had their VCPUs statically pinned to the host's PCPUs.
>
> $ sysbench --test=cpu ... (time, lower is better)
>   | VMs | w/o this change         | w/ this change          |
>   | 2   | 47.8211   +/- 0.0215504 | 47.826900 +/- 0.0077872 |
>   | 6   | 62.689122 +/- 0.0877173 | 62.764539 +/- 0.3882493 |
>   | 10  | 90.321097 +/- 1.4803867 | 89.974570 +/- 1.1437566 |
> $ sysbench --test=memory ... (throughput, higher is better)
>   | VMs | w/o this change         | w/ this change          |
>   | 2   | 550.97667 +/- 2.3512355 | 550.87000 +/- 0.8140792 |
>   | 6   | 443.15000 +/- 5.7471797 | 454.01056 +/- 8.4373466 |
>   | 10  | 313.89233 +/- 1.3237493 | 321.81167 +/- 0.3528418 |
> $ specjbb2005 ... (throughput, higher is better)
>   | 2   | 49591.057 +/- 952.93384 | 49610.98  +/- 1242.1675 |
>   | 6   | 33538.247 +/- 1089.2115 | 33682.222 +/- 1216.1078 |
>   | 10  | 21927.870 +/- 831.88742 | 21801.138 +/- 561.97068 |
>
>
> Numbers show how the change has either no or very limited impact
> (specjbb2005 case) or, when it does have some impact, that is an
> actual improvement in performances, especially in the
> sysbench-memory case.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

So I think the principle is good, but the resulting set of "if" 
statements is hard to figure out what's going on.

What do you think about re-arranging things, something like the attached?

This particular version I got rid of the stats, because they require 
if() statements that break up the flow.  If we really think they're 
useful, maybe we could have a separate block somewhere for them?

We could actually do without the idlers_empty entirely, as if we just 
remove the condition from the "else" block, the "right thing" will 
happen; however, it means several unnecessary cpumask operations on a 
busy system.

Thoughts?

  -George


--------------000604080007020604090108
Content-Type: text/plain; charset="UTF-8";
	name="xen_sched_credit_improve_tickling_of_idle_cpus"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename="xen_sched_credit_improve_tickling_of_idle_cpus"

xen: sched_credit, improve tickling of idle CPUs

RFC: Re-organized ifs

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -249,54 +249,53 @@ static inline void
     struct csched_vcpu * const cur =
         CSCHED_VCPU(per_cpu(schedule_data, cpu).curr);
     struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
-    cpumask_t mask;
+    cpumask_t mask, idle_mask;
+    int idlers_empty;
 
     ASSERT(cur);
     cpumask_clear(&mask);
 
-    /* If strictly higher priority than current VCPU, signal the CPU */
-    if ( new->pri > cur->pri )
+    idlers_empty = cpumask_empty(prv->idlers);
+    /*
+     * If the pcpu is idle, or there are no idlers and the new
+     * vcpu is a higher priority than the old vcpu, run it here.
+     *
+     * If there are idle cpus, first try to find one suitable to run
+     * "new", so we can avoid preempting cur.  If we cannot find a
+     * suitable idler on which to run "new", run it here, but try to
+     * find a suitable idler on which to run "cur" instead.
+     */
+    if ( cur->pri == CSCHED_PRI_IDLE
+         || (idlers_empty && new->pri > cur->pri) )
     {
-        if ( cur->pri == CSCHED_PRI_IDLE )
-            SCHED_STAT_CRANK(tickle_local_idler);
-        else if ( cur->pri == CSCHED_PRI_TS_OVER )
-            SCHED_STAT_CRANK(tickle_local_over);
-        else if ( cur->pri == CSCHED_PRI_TS_UNDER )
-            SCHED_STAT_CRANK(tickle_local_under);
-        else
-            SCHED_STAT_CRANK(tickle_local_other);
-
         cpumask_set_cpu(cpu, &mask);
     }
+    else if (!idlers_empty)
+    {
+        /* Check whether or not there are idlers that can run new */
+        cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
 
-    /*
-     * If this CPU has at least two runnable VCPUs, we tickle any idlers to
-     * let them know there is runnable work in the system...
-     */
-    if ( cur->pri > CSCHED_PRI_IDLE )
-    {
-        if ( cpumask_empty(prv->idlers) )
+        /* If there are no suitable idlers for new, and it's higher
+         * priority than cur, wake up the current cpu, but also
+         * look for idlers suitable for cur. */
+        if (cpumask_empty(&idle_mask) && new->pri > cur->pri)
         {
-            SCHED_STAT_CRANK(tickle_idlers_none);
+            cpumask_set_cpu(cpu, &mask);
+            cpumask_and(&idle_mask, prv->idlers, cur->vcpu->cpu_affinity);
         }
-        else
+
+        /* Which of the idlers shall we wake up? */
+        if ( !cpumask_empty(&idle_mask) )
         {
-            cpumask_t idle_mask;
-
-            cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
-            if ( !cpumask_empty(&idle_mask) )
+            SCHED_STAT_CRANK(tickle_idlers_some);
+            if ( opt_tickle_one_idle )
             {
-                SCHED_STAT_CRANK(tickle_idlers_some);
-                if ( opt_tickle_one_idle )
-                {
-                    this_cpu(last_tickle_cpu) = 
-                        cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
-                    cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
-                }
-                else
-                    cpumask_or(&mask, &mask, &idle_mask);
+                this_cpu(last_tickle_cpu) = 
+                    cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
+                cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
             }
-            cpumask_and(&mask, &mask, new->vcpu->cpu_affinity);
+            else
+                cpumask_or(&mask, &mask, &idle_mask);
         }
     }
 

--------------000604080007020604090108
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

--------------000604080007020604090108--