[RFC PATCH v1 02/16] xen: Credit1: always steal from pcpus with runnable but not running vcpus

From: Dario Faggioli <dfaggioli@suse.com>
To: xen-devel@lists.xenproject.org
Cc: George Dunlap <george.dunlap@eu.citrix.com>,
	Wei Liu <wei.liu2@citrix.com>,
	Ian Jackson <ian.jackson@eu.citrix.com>
Subject: [RFC PATCH v1 02/16] xen: Credit1: always steal from pcpus with runnable but not running vcpus
Date: Sat, 25 Aug 2018 01:35:26 +0200	[thread overview]
Message-ID: <153515372661.8598.14583085291840756167.stgit@Istar.fritz.box> (raw)
In-Reply-To: <153515305655.8598.6054293649487840735.stgit@Istar.fritz.box>

Currently, if a pcpu is idle and has some runnable vcpus in its
runqueue, the other pcpus (e.g., when they're going idle) never try to
steal any of such runnable vcpus.

This is because Credit1 assumes that, when a vcpu wakes up, if the pcpu
on which it is queued is idle, that very pcpu is _always_ tickled, and
will _always_ have the chance to pick up someone from its own runqueue.

While true for now, this is going to change. In fact:
- we want to be able to tickle a different pcpu than the one where the
  vcpu has been queues, if that is better for load distribution (or for
  honoring soft-affinity);
- when introducing SMT-aware domain coscheduling, it can happen that a
  pcpu, even if idle, can't pick up vcpus from its own runqueue, and
  hence we do want other vcpus to be able to try to steal and run them.

Let's, therefore, remove the logic implementing such assumption, making
sure that runnable vcpus are picked up from the runqueues of whatever
pcpu they're in.

We also make it more clear, while tracing, whether a cpu is being
considered or skipped (and, if the latter, why) during load balancing.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
---
TODO:
- for tracing, deal with xentrace's formats too;
- we now check a lot of pcpus. Although we do that without taking their
  runqueue lock, the overhead may be significant in big systems. In
  order to avoid having to do that, we can introduce a new cpumask,
  tracking the 'overloaded' pcpus (i.e., the pcpus which have at least
  one queued, runnable, vcpu), and only look at them.
---
 tools/xentrace/xenalyze.c |   11 +++++++----
 xen/common/sched_credit.c |   29 +++++++++++++----------------
 2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index 5ed0a12327..a1e2531945 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7655,12 +7655,15 @@ void sched_process(struct pcpu_info *p)
         case TRC_SCHED_CLASS_EVT(CSCHED, 11): /* STEAL_CHECK   */
             if(opt.dump_all) {
                 struct {
-                    unsigned int peer_cpu, check;
+                    unsigned int peer_cpu;
+                    int check;
                 } *r = (typeof(r))ri->d;
 
-                printf(" %s csched:load_balance %s %u\n",
-                       ri->dump_header, r->check ? "checking" : "skipping",
-                       r->peer_cpu);
+                printf(" %s csched:load_balance cpu %u: %s (nr_runnable=%d)\n",
+                       ri->dump_header, r->peer_cpu,
+                       r->check == 0 ? "skipping" :
+                           (r->check < 0 ? "locked" : "checking" ),
+                       r->check < 0 ? -r->check : r->check);
             }
             break;
         /* CREDIT 2 (TRC_CSCHED2_xxx) */
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index b8765ac6df..48d80993b1 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1606,11 +1606,8 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
 
     ASSERT(peer_pcpu != NULL);
 
-    /*
-     * Don't steal from an idle CPU's runq because it's about to
-     * pick up work from it itself.
-     */
-    if ( unlikely(is_idle_vcpu(curr_on_cpu(peer_cpu))) )
+    /* Don't steal from pcpus which have no runnable vcpu queued. */
+    if ( unlikely(peer_pcpu->nr_runnable == 0) )
         goto out;
 
     list_for_each( iter, &peer_pcpu->runq )
@@ -1695,15 +1692,15 @@ csched_load_balance(struct csched_private *prv, int cpu,
 
     /*
      * Let's look around for work to steal, taking both hard affinity
-     * and soft affinity into account. More specifically, we check all
-     * the non-idle CPUs' runq, looking for:
+     * and soft affinity into account. More specifically, we check the
+     * CPUs' runq with runnable vCPUs, looking for:
      *  1. any "soft-affine work" to steal first,
      *  2. if not finding anything, any "hard-affine work" to steal.
      */
     for_each_affinity_balance_step( bstep )
     {
         /*
-         * We peek at the non-idling CPUs in a node-wise fashion. In fact,
+         * We peek at the CPUs in a node-wise fashion. In fact,
          * it is more likely that we find some affine work on our same
          * node, not to mention that migrating vcpus within the same node
          * could well expected to be cheaper than across-nodes (memory
@@ -1713,8 +1710,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
         do
         {
             /* Select the pCPUs in this node that have work we can steal. */
-            cpumask_andnot(&workers, online, prv->idlers);
-            cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
+            cpumask_and(&workers, online, &node_to_cpumask(peer_node));
             __cpumask_clear_cpu(cpu, &workers);
 
             first_cpu = cpumask_cycle(prv->balance_bias[peer_node], &workers);
@@ -1723,6 +1719,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
             peer_cpu = first_cpu;
             do
             {
+                const struct csched_pcpu * const pspc = CSCHED_PCPU(peer_cpu);
                 spinlock_t *lock;
 
                 /*
@@ -1735,9 +1732,9 @@ csched_load_balance(struct csched_private *prv, int cpu,
                  * In more details:
                  * - if we race with dec_nr_runnable(), we may try to take the
                  *   lock and call csched_runq_steal() for no reason. This is
-                 *   not a functional issue, and should be infrequent enough.
-                 *   And we can avoid that by re-checking nr_runnable after
-                 *   having grabbed the lock, if we want;
+                 *   not a functional issue and, in any case, we re-check
+                 *   nr_runnable inside csched_runq_steal, after grabbing the
+                 *   lock;
                  * - if we race with inc_nr_runnable(), we skip a pCPU that may
                  *   have runnable vCPUs in its runqueue, but that's not a
                  *   problem because:
@@ -1751,7 +1748,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
                  *     optimization), it the pCPU would schedule right after we
                  *     have taken the lock, and hence block on it.
                  */
-                if ( CSCHED_PCPU(peer_cpu)->nr_runnable == 0 )
+                if ( pspc->nr_runnable == 0 )
                 {
                     TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skipp'n */ 0);
                     goto next_cpu;
@@ -1769,11 +1766,11 @@ csched_load_balance(struct csched_private *prv, int cpu,
                 if ( !lock )
                 {
                     SCHED_STAT_CRANK(steal_trylock_failed);
-                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skip */ 0);
+                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, -pspc->nr_runnable);
                     goto next_cpu;
                 }
 
-                TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* checked */ 1);
+                TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, pspc->nr_runnable);
 
                 /* Any work over there to steal? */
                 speer = cpumask_test_cpu(peer_cpu, online) ?


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel