[PATCH 0/2] Credit2: introduce per-vcpu hard and soft affinity

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] Credit2: introduce per-vcpu hard and soft affinity
@ 2014-11-30  0:33 Justin T. Weaver
  2014-11-30  0:33 ` [PATCH 1/2] sched: credit2: respect per-vcpu hard affinity Justin T. Weaver
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Justin T. Weaver @ 2014-11-30  0:33 UTC (permalink / raw)
  To: xen-devel; +Cc: george.dunlap, dario.faggioli, henric

Hello,

The credit2 scheduler currently ignores per-vcpu hard and soft affinity
masks.

The first patch updates the scheduler to ensure that vcpus only run
on pcpus on which they are allowed to run (hard affinity).

The second patch updates the scheduler to make more informed run queue
assignment decisions by considering which pcpus that vcpus prefer to
run on (soft affinity).

I tested this series on a NUMA machine with Dario Faggioli's "fix
per-socket runqueue setup" patch series applied. Without it, the credit2
scheduler only creates one run queue, regardless of the type of machine.

---
[PATCH 1/2] sched: credit2: respect per-vcpu hard affinity
[PATCH 2/2] sched: credit2: consider per-vcpu soft affinity

 xen/common/sched_credit2.c |  419 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 388 insertions(+), 31 deletions(-)

---

Justin Weaver, Masters candidate
University of Hawaiʻi at Mānoa

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] sched: credit2: respect per-vcpu hard affinity
  2014-11-30  0:33 [PATCH 0/2] Credit2: introduce per-vcpu hard and soft affinity Justin T. Weaver
@ 2014-11-30  0:33 ` Justin T. Weaver
  2015-01-12 18:05   ` Dario Faggioli
  2014-11-30  0:33 ` [PATCH 2/2] sched: credit2: consider per-vcpu soft affinity Justin T. Weaver
  2015-01-02 16:33 ` [PATCH 0/2] Credit2: introduce per-vcpu hard and " Dario Faggioli
  2 siblings, 1 reply; 15+ messages in thread
From: Justin T. Weaver @ 2014-11-30  0:33 UTC (permalink / raw)
  To: xen-devel; +Cc: george.dunlap, dario.faggioli, Justin T. Weaver, henric

by making sure that vcpus only run on the pcpu(s) they are allowed to
run on based on their hard affinity cpu masks.

Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu>
---
 xen/common/sched_credit2.c |  199 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 171 insertions(+), 28 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 1bcd6c0..90e9cdf 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -501,8 +501,9 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
         goto tickle;
     }
     
-    /* Get a mask of idle, but not tickled */
+    /* Get a mask of idle, but not tickled, that new is allowed to run on. */
     cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
+    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
     
     /* If it's not empty, choose one */
     i = cpumask_cycle(cpu, &mask);
@@ -513,9 +514,11 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
     }
 
     /* Otherwise, look for the non-idle cpu with the lowest credit,
-     * skipping cpus which have been tickled but not scheduled yet */
+     * skipping cpus which have been tickled but not scheduled yet,
+     * that new is allowed to run on. */
     cpumask_andnot(&mask, &rqd->active, &rqd->idle);
     cpumask_andnot(&mask, &mask, &rqd->tickled);
+    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
 
     for_each_cpu(i, &mask)
     {
@@ -1038,6 +1041,7 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
     int i, min_rqi = -1, new_cpu;
     struct csched2_vcpu *svc = CSCHED2_VCPU(vc);
     s_time_t min_avgload;
+    cpumask_t temp_mask;
 
     BUG_ON(cpumask_empty(&prv->active_queues));
 
@@ -1053,7 +1057,7 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
      *
      * Since one of the runqueue locks is already held, we can't
      * just grab the prv lock.  Instead, we'll have to trylock, and
-     * do something else reasonable if we fail.
+     * return a safe cpu.
      */
 
     if ( !spin_trylock(&prv->lock) )
@@ -1063,9 +1067,23 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
             d2printk("%pv -\n", svc->vcpu);
             clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
         }
-        /* Leave it where it is for now.  When we actually pay attention
-         * to affinity we'll have to figure something out... */
-        return vc->processor;
+
+        /* Check vc's hard affinity mask with the run queue's active mask. */
+        cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active);
+        if ( cpumask_empty(&temp_mask) )
+        {
+            /* Can't be assigned to current runqueue; return a safe pcpu. */
+            cpumask_and(&temp_mask, vc->cpu_hard_affinity,
+                cpupool_online_cpumask(vc->domain->cpupool));
+            return cpumask_any(&temp_mask);
+        }
+        else
+            if ( cpumask_test_cpu(vc->processor, &temp_mask) )
+                /* Leave it where it is. */
+                return vc->processor;
+            else
+                /* Same runq, different cpu; affinity must have changed. */
+                return cpumask_any(&temp_mask);
     }
 
     /* First check to see if we're here because someone else suggested a place
@@ -1081,13 +1099,17 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
         else
         {
             d2printk("%pv +\n", svc->vcpu);
-            new_cpu = cpumask_cycle(vc->processor, &svc->migrate_rqd->active);
-            goto out_up;
+            cpumask_and(&temp_mask, vc->cpu_hard_affinity,
+                &svc->migrate_rqd->active);
+            if ( !cpumask_empty(&temp_mask) )
+            {
+                new_cpu = cpumask_any(&temp_mask);
+                goto out_up;
+            }
+            /* Fall-through to normal cpu pick */
         }
     }
 
-    /* FIXME: Pay attention to cpu affinity */                                                                                      
-
     min_avgload = MAX_LOAD;
 
     /* Find the runqueue with the lowest instantaneous load */
@@ -1099,17 +1121,26 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
         rqd = prv->rqd + i;
 
         /* If checking a different runqueue, grab the lock,
-         * read the avg, and then release the lock.
+         * check hard affinity, read the avg, and then release the lock.
          *
          * If on our own runqueue, don't grab or release the lock;
          * but subtract our own load from the runqueue load to simulate
          * impartiality */
         if ( rqd == svc->rqd )
         {
+            cpumask_and(&temp_mask, vc->cpu_hard_affinity, &rqd->active);
+            if ( cpumask_empty(&temp_mask) )
+                continue;
             rqd_avgload = rqd->b_avgload - svc->avgload;
         }
         else if ( spin_trylock(&rqd->lock) )
         {
+            cpumask_and(&temp_mask, vc->cpu_hard_affinity, &rqd->active);
+            if ( cpumask_empty(&temp_mask) )
+            {
+                spin_unlock(&rqd->lock);
+                continue;
+            }
             rqd_avgload = rqd->b_avgload;
             spin_unlock(&rqd->lock);
         }
@@ -1123,12 +1154,30 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
         }
     }
 
-    /* We didn't find anyone (most likely because of spinlock contention); leave it where it is */
     if ( min_rqi == -1 )
-        new_cpu = vc->processor;
+    {
+        /* No runqs found (most likely because of spinlock contention). */
+        cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active);
+        if ( cpumask_empty(&temp_mask) )
+        {
+            /* Can't be assigned to current runqueue; return a safe pcpu. */
+            cpumask_and(&temp_mask, vc->cpu_hard_affinity,
+                cpupool_online_cpumask(vc->domain->cpupool));
+            new_cpu = cpumask_any(&temp_mask);
+        }
+        else
+            if ( cpumask_test_cpu(vc->processor, &temp_mask) )
+                /* Leave it where it is. */
+                new_cpu = vc->processor;
+            else
+                /* Same runq, different cpu; affinity must have changed. */
+                new_cpu = cpumask_any(&temp_mask);
+    }
     else
     {
-        new_cpu = cpumask_cycle(vc->processor, &prv->rqd[min_rqi].active);
+        cpumask_and(&temp_mask, vc->cpu_hard_affinity,
+            &prv->rqd[min_rqi].active);
+        new_cpu = cpumask_any(&temp_mask);
         BUG_ON(new_cpu >= nr_cpu_ids);
     }
 
@@ -1197,22 +1246,40 @@ static void migrate(const struct scheduler *ops,
     }
     else
     {
-        int on_runq=0;
-        /* It's not running; just move it */
+        /* It's not running; move it if it's on a different runq than trqd. */
+        bool_t on_runq = 0;
+        cpumask_t temp_mask;
+
         d2printk("%pv %d-%d i\n", svc->vcpu, svc->rqd->id, trqd->id);
+
+        /* Re-assign vcpu's processor, if necessary. */
+        cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity, &trqd->active);
+        svc->vcpu->processor = cpumask_any(&temp_mask);
+        if ( !cpumask_test_cpu(svc->vcpu->processor, &temp_mask) )
+            svc->vcpu->processor = cpumask_any(&temp_mask);
+
         if ( __vcpu_on_runq(svc) )
+            on_runq = 1;
+
+        /* If the runqs are different, move svc to trqd. */
+        if ( svc->rqd != trqd )
         {
-            __runq_remove(svc);
-            update_load(ops, svc->rqd, svc, -1, now);
-            on_runq=1;
+            if ( on_runq )
+            {
+                __runq_remove(svc);
+                update_load(ops, svc->rqd, svc, -1, now);
+            }
+            __runq_deassign(svc);
+            __runq_assign(svc, trqd);
+            if ( on_runq )
+            {
+                update_load(ops, svc->rqd, svc, 1, now);
+                runq_insert(ops, svc->vcpu->processor, svc);
+            }
         }
-        __runq_deassign(svc);
-        svc->vcpu->processor = cpumask_any(&trqd->active);
-        __runq_assign(svc, trqd);
+
         if ( on_runq )
         {
-            update_load(ops, svc->rqd, svc, 1, now);
-            runq_insert(ops, svc->vcpu->processor, svc);
             runq_tickle(ops, svc->vcpu->processor, svc, now);
         }
     }
@@ -1224,6 +1291,7 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
     struct csched2_private *prv = CSCHED2_PRIV(ops);
     int i, max_delta_rqi = -1;
     struct list_head *push_iter, *pull_iter;
+    cpumask_t temp_mask;
 
     balance_state_t st = { .best_push_svc = NULL, .best_pull_svc = NULL };
     
@@ -1250,6 +1318,11 @@ retry:
     for_each_cpu(i, &prv->active_queues)
     {
         s_time_t delta;
+        /* true if there are no vcpus to push due to hard affinity */
+        bool_t ha_no_push = 1;
+        /* true if there are no vcpus to pull due to hard affinity */
+        bool_t ha_no_pull = 1;
+        struct list_head *iter;
         
         st.orqd = prv->rqd + i;
 
@@ -1257,6 +1330,47 @@ retry:
              || !spin_trylock(&st.orqd->lock) )
             continue;
 
+        /*
+         * If due to hard affinity there are no vcpus that can be
+         * pulled or pushed, move to the next runq in the loop.
+         */
+
+        /* See if there are any vcpus that can be pushed from lrqd to orqd. */
+        list_for_each( iter, &st.lrqd->svc )
+        {
+            struct csched2_vcpu * svc =
+                list_entry(iter, struct csched2_vcpu, rqd_elem);
+            cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity,
+                &st.orqd->active);
+            if (!cpumask_empty(&temp_mask))
+            {
+                /* vcpu can be pushed from lrqd to ordq. */
+                ha_no_push = 0;
+                break;
+            }
+        }
+
+        /* See if there are any vcpus that can be pulled from orqd to lrqd. */
+        list_for_each( iter, &st.orqd->svc )
+        {
+            struct csched2_vcpu * svc =
+                list_entry(iter, struct csched2_vcpu, rqd_elem);
+            cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity,
+                &st.lrqd->active);
+            if (!cpumask_empty(&temp_mask))
+            {
+                /* vcpu can be pulled from orqd to lrdq. */
+                ha_no_pull = 0;
+                break;
+            }
+        }
+
+        if ( ha_no_push && ha_no_pull )
+        {
+            spin_unlock(&st.orqd->lock);
+            continue;
+        }
+
         __update_runq_load(ops, st.orqd, 0, now);
     
         delta = st.lrqd->b_avgload - st.orqd->b_avgload;
@@ -1330,6 +1444,12 @@ retry:
         if ( test_bit(__CSFLAG_runq_migrate_request, &push_svc->flags) )
             continue;
 
+        /* Skip if it can't run on the destination runq. */
+        cpumask_and(&temp_mask, push_svc->vcpu->cpu_hard_affinity,
+            &st.orqd->active);
+        if ( cpumask_empty(&temp_mask) )
+            continue;
+
         list_for_each( pull_iter, &st.orqd->svc )
         {
             struct csched2_vcpu * pull_svc = list_entry(pull_iter, struct csched2_vcpu, rqd_elem);
@@ -1338,11 +1458,17 @@ retry:
             {
                 __update_svc_load(ops, pull_svc, 0, now);
             }
-        
+
             /* Skip this one if it's already been flagged to migrate */
             if ( test_bit(__CSFLAG_runq_migrate_request, &pull_svc->flags) )
                 continue;
 
+            /* Skip if it can't run on the destination runq. */
+            cpumask_and(&temp_mask, pull_svc->vcpu->cpu_hard_affinity,
+                &st.lrqd->active);
+            if ( cpumask_empty(&temp_mask) )
+                continue;
+
             consider(&st, push_svc, pull_svc);
         }
 
@@ -1355,11 +1481,17 @@ retry:
     list_for_each( pull_iter, &st.orqd->svc )
     {
         struct csched2_vcpu * pull_svc = list_entry(pull_iter, struct csched2_vcpu, rqd_elem);
-        
+
         /* Skip this one if it's already been flagged to migrate */
         if ( test_bit(__CSFLAG_runq_migrate_request, &pull_svc->flags) )
             continue;
 
+        /* Skip if it can't run on the destination runq. */
+        cpumask_and(&temp_mask, pull_svc->vcpu->cpu_hard_affinity,
+            &st.lrqd->active);
+        if ( cpumask_empty(&temp_mask) )
+            continue;
+
         /* Consider pull only */
         consider(&st, NULL, pull_svc);
     }
@@ -1399,8 +1531,12 @@ csched2_vcpu_migrate(
 
     trqd = RQD(ops, new_cpu);
 
-    if ( trqd != svc->rqd )
-        migrate(ops, svc, trqd, NOW());
+    /*
+     * Call migrate even if svc->rqd == trqd; there may have been an
+     * affinity change that requires a call to runq_tickle for a new
+     * processor within the same run queue.
+     */
+    migrate(ops, svc, trqd, NOW());
 }
 
 static int
@@ -1610,6 +1746,13 @@ runq_candidate(struct csched2_runqueue_data *rqd,
     {
         struct csched2_vcpu * svc = list_entry(iter, struct csched2_vcpu, runq_elem);
 
+        /*
+         * If vcpu is not allowed to run on this processor due to
+         * hard affinity, continue to the next vcpu on the queue.
+         */
+        if ( !cpumask_test_cpu(cpu, svc->vcpu->cpu_hard_affinity) )
+            continue;
+
         /* If this is on a different processor, don't pull it unless
          * its credit is at least CSCHED2_MIGRATE_RESIST higher. */
         if ( svc->vcpu->processor != cpu
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] sched: credit2: consider per-vcpu soft affinity
  2014-11-30  0:33 [PATCH 0/2] Credit2: introduce per-vcpu hard and soft affinity Justin T. Weaver
  2014-11-30  0:33 ` [PATCH 1/2] sched: credit2: respect per-vcpu hard affinity Justin T. Weaver
@ 2014-11-30  0:33 ` Justin T. Weaver
  2015-01-12 18:07   ` Dario Faggioli
  2015-01-13 13:06   ` Dario Faggioli
  2015-01-02 16:33 ` [PATCH 0/2] Credit2: introduce per-vcpu hard and " Dario Faggioli
  2 siblings, 2 replies; 15+ messages in thread
From: Justin T. Weaver @ 2014-11-30  0:33 UTC (permalink / raw)
  To: xen-devel; +Cc: george.dunlap, dario.faggioli, Justin T. Weaver, henric

when deciding which run queue to assign each vcpu to.

There are two main changes in functionality that this patch introduces.

First, in function runq_tickle, it tries to find idle pcpus in other run
queues that the vcpu would prefer to run on (soft affinity) or can run on
(hard affinity), in that order.

Second, in function balance_load, if moving a vcpu with soft affinity from
one run queue to another means moving it away from its soft affinity to hard
affinity only, then that move is not considered for load balancing.

Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu>
---
 xen/common/sched_credit2.c |  222 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 218 insertions(+), 4 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 90e9cdf..ad867f2 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -127,6 +127,15 @@
 #define CSCHED2_CREDIT_RESET         0
 /* Max timer: Maximum time a guest can be run for. */
 #define CSCHED2_MAX_TIMER            MILLISECS(2)
+/* Two step balancing logic; consider soft affinity first. */
+#define CSCHED2_BALANCE_SOFT_AFFINITY 0
+#define CSCHED2_BALANCE_HARD_AFFINITY 1
+/* vcpu runq migration away from vcpu's soft affinity. */
+#define CSCHED2_MIGRATE_SOFT_TO_HARD 0
+/* vcpu runq migration to vcpu's soft affinity. */
+#define CSCHED2_MIGRATE_HARD_TO_SOFT 1
+/* vcpu runq migration soft to soft or vcpu has no soft affinity. */
+#define CSCHED2_MIGRATE_ANY_TO_ANY   2
 
 
 #define CSCHED2_IDLE_CREDIT                 (-(1<<30))
@@ -176,6 +185,8 @@ integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
 #define c2r(_ops, _cpu)     (CSCHED2_PRIV(_ops)->runq_map[(_cpu)])
 /* CPU to runqueue struct macro */
 #define RQD(_ops, _cpu)     (&CSCHED2_PRIV(_ops)->rqd[c2r(_ops, _cpu)])
+#define for_each_csched2_balance_step(step) \
+    for ( (step) = 0; (step) <= CSCHED2_BALANCE_HARD_AFFINITY; (step)++ )
 
 /*
  * Shifts for load average.
@@ -268,6 +279,35 @@ struct csched2_dom {
     uint16_t nr_vcpus;
 };
 
+/*
+ * A vcpu has meaningful soft affinity if...
+ * - its soft affinity mask is not full, and
+ * - the passed in mask (usually its hard affinity mask) intersects
+ *   with its soft affinity mask
+ */
+static inline int __vcpu_has_soft_affinity(const struct vcpu *vc,
+                                           const cpumask_t *mask)
+{
+    if ( cpumask_full(vc->cpu_soft_affinity) ||
+        !cpumask_intersects(vc->cpu_soft_affinity, mask) )
+        return 0;
+
+    return 1;
+}
+
+static void
+csched2_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
+{
+    if ( step == CSCHED2_BALANCE_SOFT_AFFINITY )
+    {
+        cpumask_and(mask, vc->cpu_soft_affinity, vc->cpu_hard_affinity);
+
+        if ( unlikely(cpumask_empty(mask)) )
+            cpumask_copy(mask, vc->cpu_hard_affinity);
+    }
+    else /* step == CSCHED2_BALANCE_HARD_AFFINITY */
+        cpumask_copy(mask, vc->cpu_hard_affinity);
+}
 
 /*
  * Time-to-credit, credit-to-time.
@@ -474,6 +514,9 @@ __runq_remove(struct csched2_vcpu *svc)
 }
 
 void burn_credits(struct csched2_runqueue_data *rqd, struct csched2_vcpu *, s_time_t);
+static void __runq_deassign(struct csched2_vcpu *svc);
+static void __runq_assign(struct csched2_vcpu *svc,
+                          struct csched2_runqueue_data *rqd);
 
 /* Check to see if the item on the runqueue is higher priority than what's
  * currently running; if so, wake up the processor */
@@ -485,6 +528,9 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
     struct csched2_runqueue_data *rqd = RQD(ops, cpu);
     cpumask_t mask;
     struct csched2_vcpu * cur;
+    struct csched2_private *prv = CSCHED2_PRIV(ops);
+    int balance_step = CSCHED2_BALANCE_SOFT_AFFINITY;
+    bool_t prv_lock_held = 0;
 
     d2printk("rqt %pv curr %pv\n", new->vcpu, current);
 
@@ -513,6 +559,68 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
         goto tickle;
     }
 
+    /* Look for idle cpus in other runqs; consider soft affinity first. */
+    for_each_csched2_balance_step( balance_step )
+    {
+        cpumask_t balance_mask;
+
+        /* Skip the soft affinity balance step if new doesn't have any. */
+        if ( balance_step == CSCHED2_BALANCE_SOFT_AFFINITY &&
+            !__vcpu_has_soft_affinity(
+                new->vcpu, new->vcpu->cpu_hard_affinity) )
+            continue;
+
+        /* Skip this step if can't get a lock on the credit2 private data. */
+        if ( !prv_lock_held || !spin_trylock(&prv->lock) )
+            continue;
+        prv_lock_held = 1;
+
+        csched2_balance_cpumask(new->vcpu, balance_step, &balance_mask);
+
+        for_each_cpu(i, &prv->active_queues)
+        {
+            struct csched2_runqueue_data *temp_rqd;
+
+            temp_rqd = prv->rqd + i;
+
+            if ( temp_rqd == rqd || !spin_trylock(&temp_rqd->lock) )
+                continue;
+
+            /* Find idle cpus in the balance mask that are not tickled. */
+            cpumask_andnot(&mask, &temp_rqd->idle, &temp_rqd->tickled);
+            cpumask_and(&mask, &mask, &balance_mask);
+
+            if ( !cpumask_empty(&mask) )
+            {
+                /* Found an idle cpu on another run queue; move new. */
+                s_time_t now = 0;
+
+                ipid = cpumask_any(&mask);
+                new->vcpu->processor = ipid;
+                __runq_remove(new);
+                now = NOW();
+                update_load(ops, rqd, new, -1, now);
+                __runq_deassign(new);
+                __runq_assign(new, temp_rqd);
+                update_load(ops, temp_rqd, new, 1, now);
+                runq_insert(ops, ipid, new);
+                cpumask_set_cpu(ipid, &temp_rqd->tickled);
+                cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
+
+                spin_unlock(&temp_rqd->lock);
+                spin_unlock(&prv->lock);
+                return;
+            }
+            else
+                /* No suitable idlers found in runq temp_rqd. */
+                spin_unlock(&temp_rqd->lock);
+        }
+
+        if ( prv_lock_held && balance_step == CSCHED2_BALANCE_HARD_AFFINITY )
+            /* No suitable other-runq idlers found; unlock private data. */
+            spin_unlock(&prv->lock);
+    }
+
     /* Otherwise, look for the non-idle cpu with the lowest credit,
      * skipping cpus which have been tickled but not scheduled yet,
      * that new is allowed to run on. */
@@ -1039,12 +1147,22 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
 {
     struct csched2_private *prv = CSCHED2_PRIV(ops);
     int i, min_rqi = -1, new_cpu;
+    int max_soft_cpus = 0, max_soft_cpus_rqi = -1;
+    bool_t consider_soft_affinity = 0;
     struct csched2_vcpu *svc = CSCHED2_VCPU(vc);
     s_time_t min_avgload;
-    cpumask_t temp_mask;
+    cpumask_t temp_mask, vc_soft_affinity;
 
     BUG_ON(cpumask_empty(&prv->active_queues));
 
+    /* Consider soft affinity in the cpu descision? */
+    if ( __vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
+    {
+        consider_soft_affinity = 1;
+        cpumask_and(&vc_soft_affinity, vc->cpu_soft_affinity,
+            vc->cpu_hard_affinity);
+    }
+
     /* Locking:
      * - vc->processor is already locked
      * - Need to grab prv lock to make sure active runqueues don't
@@ -1075,6 +1193,13 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
             /* Can't be assigned to current runqueue; return a safe pcpu. */
             cpumask_and(&temp_mask, vc->cpu_hard_affinity,
                 cpupool_online_cpumask(vc->domain->cpupool));
+            if ( consider_soft_affinity )
+            {
+                cpumask_t safe_soft_mask;
+                cpumask_and(&safe_soft_mask, &vc_soft_affinity, &temp_mask);
+                if ( !cpumask_empty(&safe_soft_mask) )
+                    cpumask_copy(&temp_mask, &safe_soft_mask);
+            }
             return cpumask_any(&temp_mask);
         }
         else
@@ -1112,11 +1237,15 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
 
     min_avgload = MAX_LOAD;
 
-    /* Find the runqueue with the lowest instantaneous load */
+    /*
+     * Find the run queue with the most cpus in vc's soft affniity, or the
+     * the lowest instantaneous load if not considering soft affinity.
+     */
     for_each_cpu(i, &prv->active_queues)
     {
         struct csched2_runqueue_data *rqd;
         s_time_t rqd_avgload;
+        int rqd_soft_cpus = 0;
 
         rqd = prv->rqd + i;
 
@@ -1131,6 +1260,11 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
             cpumask_and(&temp_mask, vc->cpu_hard_affinity, &rqd->active);
             if ( cpumask_empty(&temp_mask) )
                 continue;
+            if ( consider_soft_affinity )
+            {
+                cpumask_and(&temp_mask, &vc_soft_affinity, &rqd->active);
+                rqd_soft_cpus = cpumask_weight(&temp_mask);
+            }
             rqd_avgload = rqd->b_avgload - svc->avgload;
         }
         else if ( spin_trylock(&rqd->lock) )
@@ -1141,6 +1275,11 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
                 spin_unlock(&rqd->lock);
                 continue;
             }
+            if ( consider_soft_affinity )
+            {
+                cpumask_and(&temp_mask, &vc_soft_affinity, &rqd->active);
+                rqd_soft_cpus = cpumask_weight(&temp_mask);
+            }
             rqd_avgload = rqd->b_avgload;
             spin_unlock(&rqd->lock);
         }
@@ -1152,9 +1291,14 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
             min_avgload = rqd_avgload;
             min_rqi=i;
         }
+        if ( consider_soft_affinity && rqd_soft_cpus > max_soft_cpus )
+        {
+            max_soft_cpus = rqd_soft_cpus;
+            max_soft_cpus_rqi = i;
+        }
     }
 
-    if ( min_rqi == -1 )
+    if ( min_rqi == -1 && max_soft_cpus_rqi == -1 )
     {
         /* No runqs found (most likely because of spinlock contention). */
         cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active);
@@ -1163,6 +1307,13 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
             /* Can't be assigned to current runqueue; return a safe pcpu. */
             cpumask_and(&temp_mask, vc->cpu_hard_affinity,
                 cpupool_online_cpumask(vc->domain->cpupool));
+            if ( consider_soft_affinity )
+            {
+                cpumask_t safe_soft_mask;
+                cpumask_and(&safe_soft_mask, &vc_soft_affinity, &temp_mask);
+                if ( !cpumask_empty(&safe_soft_mask) )
+                    cpumask_copy(&temp_mask, &safe_soft_mask);
+            }
             new_cpu = cpumask_any(&temp_mask);
         }
         else
@@ -1173,8 +1324,16 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
                 /* Same runq, different cpu; affinity must have changed. */
                 new_cpu = cpumask_any(&temp_mask);
     }
+    else if ( max_soft_cpus_rqi != -1 )
+    {
+        /* Prefer soft affinity here over minimizing run queue load. */
+        cpumask_and(&temp_mask, &vc_soft_affinity,
+            &prv->rqd[max_soft_cpus_rqi].active);
+        new_cpu = cpumask_any(&temp_mask);
+    }
     else
     {
+        /* vc does not have soft affinity; use the rq that minimizes load. */
         cpumask_and(&temp_mask, vc->cpu_hard_affinity,
             &prv->rqd[min_rqi].active);
         new_cpu = cpumask_any(&temp_mask);
@@ -1197,6 +1356,46 @@ typedef struct {
     struct csched2_runqueue_data *orqd;                  
 } balance_state_t;
 
+static int classify_vcpu_migration(const struct vcpu *vc,
+    const struct csched2_runqueue_data *src_rqd,
+    const struct csched2_runqueue_data *dst_rqd)
+{
+    cpumask_t soft_mask, temp_mask;
+
+    /*
+     * Must already hold the locks on src_rqd and dst_rqd.
+     * Function assumes vc has at least hard affinity with one or more
+     * pcpus in both the source and destination run queues.
+     */
+
+    /* Does vcpu not have soft affinity? */
+    if ( !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
+        return CSCHED2_MIGRATE_ANY_TO_ANY;
+    else
+        cpumask_and(&soft_mask, vc->cpu_soft_affinity, vc->cpu_hard_affinity);
+
+    /* Does vcpu have soft affinity with pcpu(s) in the source runq? */
+    cpumask_and(&temp_mask, &soft_mask, &src_rqd->active);
+    if ( !cpumask_empty(&temp_mask) )
+    {
+        /* ... and soft affinity with the destination runq? */
+        cpumask_and(&temp_mask, &soft_mask, &dst_rqd->active);
+        if ( !cpumask_empty(&temp_mask) )
+            return CSCHED2_MIGRATE_ANY_TO_ANY;
+        else
+            return CSCHED2_MIGRATE_SOFT_TO_HARD;
+    }
+    else
+    {
+        /* Does vcpu only have soft affinity with the destination runq? */
+        cpumask_and(&temp_mask, &soft_mask, &dst_rqd->active);
+        if ( !cpumask_empty(&temp_mask) )
+            return CSCHED2_MIGRATE_HARD_TO_SOFT;
+        else
+            return CSCHED2_MIGRATE_ANY_TO_ANY;
+    }
+}
+
 static void consider(balance_state_t *st, 
                      struct csched2_vcpu *push_svc,
                      struct csched2_vcpu *pull_svc)
@@ -1294,7 +1493,7 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
     cpumask_t temp_mask;
 
     balance_state_t st = { .best_push_svc = NULL, .best_pull_svc = NULL };
-    
+
     /*
      * Basic algorithm: Push, pull, or swap.
      * - Find the runqueue with the furthest load distance
@@ -1450,6 +1649,11 @@ retry:
         if ( cpumask_empty(&temp_mask) )
             continue;
 
+        /* Skip if result is moving vcpu away from its soft affinity. */
+        if ( classify_vcpu_migration(push_svc->vcpu, st.lrqd, st.orqd) ==
+            CSCHED2_MIGRATE_SOFT_TO_HARD )
+            continue;
+
         list_for_each( pull_iter, &st.orqd->svc )
         {
             struct csched2_vcpu * pull_svc = list_entry(pull_iter, struct csched2_vcpu, rqd_elem);
@@ -1469,6 +1673,11 @@ retry:
             if ( cpumask_empty(&temp_mask) )
                 continue;
 
+            /* Skip if result is moving vcpu away from its soft affinity. */
+            if ( classify_vcpu_migration(pull_svc->vcpu, st.orqd, st.lrqd) ==
+                CSCHED2_MIGRATE_SOFT_TO_HARD )
+                continue;
+
             consider(&st, push_svc, pull_svc);
         }
 
@@ -1492,6 +1701,11 @@ retry:
         if ( cpumask_empty(&temp_mask) )
             continue;
 
+        /* Skip if result is moving vcpu away from its soft affinity. */
+        if ( classify_vcpu_migration(pull_svc->vcpu, st.orqd, st.lrqd) ==
+            CSCHED2_MIGRATE_SOFT_TO_HARD )
+            continue;
+
         /* Consider pull only */
         consider(&st, NULL, pull_svc);
     }
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Credit2: introduce per-vcpu hard and soft affinity
  2014-11-30  0:33 [PATCH 0/2] Credit2: introduce per-vcpu hard and soft affinity Justin T. Weaver
  2014-11-30  0:33 ` [PATCH 1/2] sched: credit2: respect per-vcpu hard affinity Justin T. Weaver
  2014-11-30  0:33 ` [PATCH 2/2] sched: credit2: consider per-vcpu soft affinity Justin T. Weaver
@ 2015-01-02 16:33 ` Dario Faggioli
  2015-01-04 21:32   ` Justin Weaver
  2 siblings, 1 reply; 15+ messages in thread
From: Dario Faggioli @ 2015-01-02 16:33 UTC (permalink / raw)
  To: Justin T. Weaver; +Cc: george.dunlap, henric, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1541 bytes --]

On Sat, 2014-11-29 at 14:33 -1000, Justin T. Weaver wrote:
> Hello,
> 
Hi Justin,

First of all, this is, overall, a really good submission. The only thing
you need to add is a few references.

For example, you already sent something in the past (an RFC, IIRC).
Mention it in the cover letter (I personally also add a link to the
thread).

> I tested this series on a NUMA machine with Dario Faggioli's "fix
> per-socket runqueue setup" patch series applied. Without it, the credit2
> scheduler only creates one run queue, regardless of the type of machine.
> 
Yes, I'll resubmit an updated version of that series soon.

So, how did the testing went? Since you're not saying anything, I
imagine that, at least functionally, everything works, at least as far
as you can tell, isn't it so? :-)

At least as far as soft-affinity is concerned, it should not be too
difficult to collect some numbers, to show that it is bringing benefits,
at least for some workloads.

Do you think you'll be able to do something like that?

When doing this same thing for credit1, here's what I did:
http://lists.xen.org/archives/html/xen-devel/2013-02/msg00009.html
https://blog.xenproject.org/2013/03/14/numa-aware-scheduling-development-report/

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] Credit2: introduce per-vcpu hard and soft affinity
  2015-01-02 16:33 ` [PATCH 0/2] Credit2: introduce per-vcpu hard and " Dario Faggioli
@ 2015-01-04 21:32   ` Justin Weaver
  0 siblings, 0 replies; 15+ messages in thread
From: Justin Weaver @ 2015-01-04 21:32 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: George Dunlap, Henri Casanova, xen-devel

Dario,

Thank you for the initial feedback!

> First of all, this is, overall, a really good submission. The only thing
> you need to add is a few references.
>
> For example, you already sent something in the past (an RFC, IIRC).
> Mention it in the cover letter (I personally also add a link to the
> thread).

Yes I will add references in the cover letter as you suggest.

> So, how did the testing went? Since you're not saying anything, I
> imagine that, at least functionally, everything works, at least as far
> as you can tell, isn't it so? :-)

I tested the hard affinity patch and I observed that the pinned VCPUs
always ran on their assigned PCPUs. I used vcpu-pin to change the
assignments which also worked. I tested the soft affinity patch with
workloads that I believed would cause the VPCUs to stay on their soft
PCPUs and only move away from them to hard affinity only in certain
circumstances (all soft busy, hard idle). I will add more specifics to
the cover letter.

> At least as far as soft-affinity is concerned, it should not be too
> difficult to collect some numbers, to show that it is bringing benefits,
> at least for some workloads.
>
> Do you think you'll be able to do something like that?

Yes!

> When doing this same thing for credit1, here's what I did:
> http://lists.xen.org/archives/html/xen-devel/2013-02/msg00009.html
> https://blog.xenproject.org/2013/03/14/numa-aware-scheduling-development-report/

I will take a look at these and try to run additional workloads. If
you have any other ideas about workloads I should try, please let me
know.

Thank you!
Justin Weaver

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] sched: credit2: respect per-vcpu hard affinity
  2014-11-30  0:33 ` [PATCH 1/2] sched: credit2: respect per-vcpu hard affinity Justin T. Weaver
@ 2015-01-12 18:05   ` Dario Faggioli
  2015-01-20  7:21     ` Justin Weaver
  0 siblings, 1 reply; 15+ messages in thread
From: Dario Faggioli @ 2015-01-12 18:05 UTC (permalink / raw)
  To: Justin T. Weaver; +Cc: george.dunlap, henric, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 18582 bytes --]

On Sat, 2014-11-29 at 14:33 -1000, Justin T. Weaver wrote:
> by making sure that vcpus only run on the pcpu(s) they are allowed to
> run on based on their hard affinity cpu masks.
> 
> Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu>
>
Still about referencing "history", one usually puts here a quick summary
of what changed in each patch, wrt the previously submitted version.

This is actually not really important in this case, since _a_lot_ of
things changed, as it usually happens when moving from RFC to actual
submissions... But consider doing something like this for next round.

Style apart, this overall is a very good submission. Since last time,
you really made huge progresses, and are now tackling the issue with a
good approach. I have a few questions on the approach itself, and some
comments on the code, but I'd say you are on the right track.

> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 1bcd6c0..90e9cdf 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -501,8 +501,9 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
>          goto tickle;
>      }
>      
> -    /* Get a mask of idle, but not tickled */
> +    /* Get a mask of idle, but not tickled, that new is allowed to run on. */
>      cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
> +    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
>
This looks ok.

>      /* If it's not empty, choose one */
>      i = cpumask_cycle(cpu, &mask);
> @@ -513,9 +514,11 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
>      }
>  
>      /* Otherwise, look for the non-idle cpu with the lowest credit,
> -     * skipping cpus which have been tickled but not scheduled yet */
> +     * skipping cpus which have been tickled but not scheduled yet,
> +     * that new is allowed to run on. */
>      cpumask_andnot(&mask, &rqd->active, &rqd->idle);
>      cpumask_andnot(&mask, &mask, &rqd->tickled);
> +    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
>
And this too.

I wonder whether there is a way to group the cpumask operations
differently to save at least one one of them (yes, they may be quite
costly on large boxes), but let's leave this for later...

> @@ -1038,6 +1041,7 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>      int i, min_rqi = -1, new_cpu;
>      struct csched2_vcpu *svc = CSCHED2_VCPU(vc);
>      s_time_t min_avgload;
> +    cpumask_t temp_mask;
>  
Still about cpumasks, we really should try to avoid adding new ones on
the stack like this.

> @@ -1053,7 +1057,7 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>       *
>       * Since one of the runqueue locks is already held, we can't
>       * just grab the prv lock.  Instead, we'll have to trylock, and
> -     * do something else reasonable if we fail.
> +     * return a safe cpu.
>  
I don't think this change is particularly useful.

> @@ -1063,9 +1067,23 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>              d2printk("%pv -\n", svc->vcpu);
>              clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
>          }
> -        /* Leave it where it is for now.  When we actually pay attention
> -         * to affinity we'll have to figure something out... */
> -        return vc->processor;
> +
> +        /* Check vc's hard affinity mask with the run queue's active mask. */
> +        cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active);
> +        if ( cpumask_empty(&temp_mask) )
> +        {
> +            /* Can't be assigned to current runqueue; return a safe pcpu. */
> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity,
> +                cpupool_online_cpumask(vc->domain->cpupool));
> +            return cpumask_any(&temp_mask);
> +        }
> +        else
> +            if ( cpumask_test_cpu(vc->processor, &temp_mask) )
> +                /* Leave it where it is. */
> +                return vc->processor;
> +            else
> +                /* Same runq, different cpu; affinity must have changed. */
> +                return cpumask_any(&temp_mask);
>      }
>  
I see what you are after, and it should be fine. However, how about
something like this?

  /* We can't perform the operation properly because of locking issues.
   * Check whether we can at least leave the vcpu where it is, or if
   * we need to send it somewhere else.
   */
  cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active);
  if ( unlikely(cpumask_empty(&temp_mask)) )
    cpumask_and(&temp_mask, vc->cpu_hard_affinity,
                cpupool_online_cpumask(vc->domain->cpupool));

  if ( cpumask_test_cpu(vc->processor, &temp_mask) )
    return vc->processor;
  else
    return cpumask_any(&temp_mask);

>      /* First check to see if we're here because someone else suggested a place
> @@ -1081,13 +1099,17 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>          else
>          {
>              d2printk("%pv +\n", svc->vcpu);
> -            new_cpu = cpumask_cycle(vc->processor, &svc->migrate_rqd->active);
> -            goto out_up;
> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity,
> +                &svc->migrate_rqd->active);
> +            if ( !cpumask_empty(&temp_mask) )
> +            {
> +                new_cpu = cpumask_any(&temp_mask);
> +                goto out_up;
> +            }
> +            /* Fall-through to normal cpu pick */
>
So, if we've been asked to move to a cpu where we are not allowed to run
(i.e., in case temp_mask ends up empty), we just, silently, ignore the
request. This has the potential of screwing the work of the load
balancer. If we need to keep this as a sort of "safety catch", then
fine, but I'd really like to see a lot of efforts made in the load
balancing code to make this not trigger!

Actually, since we've been asked to do something, and we're doing
something completely different, I even wonder if, if we can't go to
migrate_rqd, that "something completely different" is better than  doing
nothing, and if yes, under what conditions and circumstances...

>      /* Find the runqueue with the lowest instantaneous load */
> @@ -1099,17 +1121,26 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>          rqd = prv->rqd + i;
>  
>          /* If checking a different runqueue, grab the lock,
> -         * read the avg, and then release the lock.
> +         * check hard affinity, read the avg, and then release the lock.
>           *
>           * If on our own runqueue, don't grab or release the lock;
>           * but subtract our own load from the runqueue load to simulate
>           * impartiality */
>          if ( rqd == svc->rqd )
>          {
> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity, &rqd->active);
> +            if ( cpumask_empty(&temp_mask) )
> +                continue;
>
You can use cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) for
the test. However, can we really be on a runq, and not have hard
affinity with _any_ of its pcpus? I think no. And if yes, then we should
try to change that and make it no. :-)

>              rqd_avgload = rqd->b_avgload - svc->avgload;
>          }
>          else if ( spin_trylock(&rqd->lock) )
>          {
> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity, &rqd->active);
> +            if ( cpumask_empty(&temp_mask) )
>
Again, cpumask_intersects()

> +            {
> +                spin_unlock(&rqd->lock);
> +                continue;
> +            }
>              rqd_avgload = rqd->b_avgload;
>              spin_unlock(&rqd->lock);
>          }

> @@ -1123,12 +1154,30 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>          }
>      }
>  
> -    /* We didn't find anyone (most likely because of spinlock contention); leave it where it is */
>      if ( min_rqi == -1 )
> -        new_cpu = vc->processor;
> +    {
> +        /* No runqs found (most likely because of spinlock contention). */
> +        cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active);
> +        if ( cpumask_empty(&temp_mask) )
> +        {
> +            /* Can't be assigned to current runqueue; return a safe pcpu. */
> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity,
> +                cpupool_online_cpumask(vc->domain->cpupool));
> +            new_cpu = cpumask_any(&temp_mask);
> +        }
> +        else
> +            if ( cpumask_test_cpu(vc->processor, &temp_mask) )
> +                /* Leave it where it is. */
> +                new_cpu = vc->processor;
> +            else
> +                /* Same runq, different cpu; affinity must have changed. */
> +                new_cpu = cpumask_any(&temp_mask);
> +    }
>
This is the exact same code as above, isn't it? Put it in an helper and
call it from both sites.

Well, actually, put my rework of the code above in the helper and call
it. :-)

>      else
>      {
> -        new_cpu = cpumask_cycle(vc->processor, &prv->rqd[min_rqi].active);
> +        cpumask_and(&temp_mask, vc->cpu_hard_affinity,
> +            &prv->rqd[min_rqi].active);
> +        new_cpu = cpumask_any(&temp_mask);
>          BUG_ON(new_cpu >= nr_cpu_ids);
>      }
>
Looks ok.
 
> @@ -1197,22 +1246,40 @@ static void migrate(const struct scheduler *ops,
>      }
>      else
>      {
> -        int on_runq=0;
> -        /* It's not running; just move it */
> +        /* It's not running; move it if it's on a different runq than trqd. */
> +        bool_t on_runq = 0;
> +        cpumask_t temp_mask;
> +
>
Ditto.

>          d2printk("%pv %d-%d i\n", svc->vcpu, svc->rqd->id, trqd->id);
> +
> +        /* Re-assign vcpu's processor, if necessary. */
> +        cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity, &trqd->active);
> +        svc->vcpu->processor = cpumask_any(&temp_mask);
> +        if ( !cpumask_test_cpu(svc->vcpu->processor, &temp_mask) )
> +            svc->vcpu->processor = cpumask_any(&temp_mask);
> +
Not following: you assign processor one of the bits set in temp_mask.
Then, if such new value of processor is not in temp_mask, you assign to
it cpumask_any(&temp_mask), i.e., the exact same thing as before...

Am I missing something obvious? What is it that you are up for here?
If the goal is to do something sane and safe if trqd->active and
svc->vcpu->cpu_hard_affinity does not intersect, then, we certainly need
it, but I really can't tell how this is supposed
to handle such situation.

BTW, after an operation like cpumask_first(), cpumask_next(),
cpumask_any(), etc., a quick way to check whether the operation
succeeded is to check for the returned value to be < nr_cpu_ids (see the
implementation of those operations). If that is the case, then you can
be sure that the bit returned is one of the bits set in the mask.

>          if ( __vcpu_on_runq(svc) )
> +            on_runq = 1;
> +
> +        /* If the runqs are different, move svc to trqd. */
> +        if ( svc->rqd != trqd )
>          {
> -            __runq_remove(svc);
> -            update_load(ops, svc->rqd, svc, -1, now);
> -            on_runq=1;
> +            if ( on_runq )
> +            {
> +                __runq_remove(svc);
> +                update_load(ops, svc->rqd, svc, -1, now);
> +            }
> +            __runq_deassign(svc);
> +            __runq_assign(svc, trqd);
> +            if ( on_runq )
> +            {
> +                update_load(ops, svc->rqd, svc, 1, now);
> +                runq_insert(ops, svc->vcpu->processor, svc);
> +            }
>          }
> -        __runq_deassign(svc);
> -        svc->vcpu->processor = cpumask_any(&trqd->active);
> -        __runq_assign(svc, trqd);
> +
>
Mmm.. I do not like the way the code looks after this is applied. Before
the patch, it was really straightforward and easy to understand. Now
it's way more involved. Can you explain why this rework is necessary?
For now do it here, then we'll see whether and how to put that into a
doc comment.

I know there is a comment on the call site of this function, but that
does not help much either (as I say there).

>          if ( on_runq )
>          {
> -            update_load(ops, svc->rqd, svc, 1, now);
> -            runq_insert(ops, svc->vcpu->processor, svc);
>              runq_tickle(ops, svc->vcpu->processor, svc, now);
>          }
>      }

> @@ -1224,6 +1291,7 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
>      struct csched2_private *prv = CSCHED2_PRIV(ops);
>      int i, max_delta_rqi = -1;
>      struct list_head *push_iter, *pull_iter;
> +    cpumask_t temp_mask;
>  
...

>      balance_state_t st = { .best_push_svc = NULL, .best_pull_svc = NULL };
>      
> @@ -1250,6 +1318,11 @@ retry:
>      for_each_cpu(i, &prv->active_queues)
>      {
>          s_time_t delta;
> +        /* true if there are no vcpus to push due to hard affinity */
> +        bool_t ha_no_push = 1;
> +        /* true if there are no vcpus to pull due to hard affinity */
> +        bool_t ha_no_pull = 1;
> +        struct list_head *iter;
>          
>          st.orqd = prv->rqd + i;
>  
> @@ -1257,6 +1330,47 @@ retry:
>               || !spin_trylock(&st.orqd->lock) )
>              continue;
>  
> +        /*
> +         * If due to hard affinity there are no vcpus that can be
> +         * pulled or pushed, move to the next runq in the loop.
> +         */
> +
> +        /* See if there are any vcpus that can be pushed from lrqd to orqd. */
> +        list_for_each( iter, &st.lrqd->svc )
> +        {
> +            struct csched2_vcpu * svc =
> +                list_entry(iter, struct csched2_vcpu, rqd_elem);
> +            cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity,
> +                &st.orqd->active);
> +            if (!cpumask_empty(&temp_mask))
> +            {
> +                /* vcpu can be pushed from lrqd to ordq. */
> +                ha_no_push = 0;
> +                break;
> +            }
> +        }
> +
>
So, here, for each runqueue j, we scan all the vcpus in lrqd, to see if
at least one can migrate to j.

This means that, if we have 4 runqueues, we potentially loop through all
the vcpus in lrqd 4 times, right?

> +        /* See if there are any vcpus that can be pulled from orqd to lrqd. */
> +        list_for_each( iter, &st.orqd->svc )
> +        {
> +            struct csched2_vcpu * svc =
> +                list_entry(iter, struct csched2_vcpu, rqd_elem);
> +            cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity,
> +                &st.lrqd->active);
> +            if (!cpumask_empty(&temp_mask))
> +            {
> +                /* vcpu can be pulled from orqd to lrdq. */
> +                ha_no_pull = 0;
> +                break;
> +            }
> +        }
> +
>
And here, for each runq j, we scan all its vcpus to see if at least one
can run on lrqd.

In summary, this means (at least potentially) looping through all the
active vcpus, some more than one time, which sounds pretty expensive.
Add to this the fact that, since we are releasing and re-acquiring the
locks, things may even change under our feet, making all that sort of
useless.

I like the idea of filtering out runqueues that does not have any
suitable vcpus, due to hard affinity issues, but done like this, it
looks really expensive. Can we perhaps think to a data structure to make
it cheaper?

In the meanwhile, I'd go for a more 'greedy approach'. Out of the top of
my head, I think I would leave this loop alone, and just have it select
a candidate runqueue. Below, where the vcpus are considered, I'd filter
out the inappropriate one (as this patch does also). If it turns out
that there are no suitable vcpus in the runqueue the loop selected, I'd
"goto retry", perhaps adding a maximum number of attempts, or something
like that.

What do you think?

> @@ -1338,11 +1458,17 @@ retry:
>              {
>                  __update_svc_load(ops, pull_svc, 0, now);
>              }
> -        
> +
>
Avoid things like these. I appreciate that this is an actual
improvement, but let's not mix coding style fixes with new features. :-)

> @@ -1399,8 +1531,12 @@ csched2_vcpu_migrate(
>  
>      trqd = RQD(ops, new_cpu);
>  
> -    if ( trqd != svc->rqd )
> -        migrate(ops, svc, trqd, NOW());
> +    /*
> +     * Call migrate even if svc->rqd == trqd; there may have been an
> +     * affinity change that requires a call to runq_tickle for a new
> +     * processor within the same run queue.
> +     */
> +    migrate(ops, svc, trqd, NOW());
>  }
>  
As said above, I don't think I see the reason for this. Affinity
changes, e.g., due to calls to vcpu_set_affinity() in schedule.c, forces
the vcpu through a sleep wakeup cycle (it calls vcpu_sleep_nosync()
direcly, while vcpu_wake() is called inside vcpu_migrate()).

So, looks like what you are after (i.e., runq_tickle being called)
should happen already, isn't it? Are there other reasons you need it
for?

>  static int
> @@ -1610,6 +1746,13 @@ runq_candidate(struct csched2_runqueue_data *rqd,
>      {
>          struct csched2_vcpu * svc = list_entry(iter, struct csched2_vcpu, runq_elem);
>  
> +        /*
> +         * If vcpu is not allowed to run on this processor due to
> +         * hard affinity, continue to the next vcpu on the queue.
> +         */
> +        if ( !cpumask_test_cpu(cpu, svc->vcpu->cpu_hard_affinity) )
> +            continue;
> +
>
Looks ok. The comment explains a bit too much the "what", which one can
easily see from the code. Comments (most of the time :-) ) should be
about the "why" things are done in a certain way.

Here, the relevant piece of information is that "Only consider vcpus
allowed to run on this processor", so I'd just say so. The fact that you
are continuing scanning the runque is pretty evident. :-D

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: credit2: consider per-vcpu soft affinity
  2014-11-30  0:33 ` [PATCH 2/2] sched: credit2: consider per-vcpu soft affinity Justin T. Weaver
@ 2015-01-12 18:07   ` Dario Faggioli
  2015-01-13 13:06   ` Dario Faggioli
  1 sibling, 0 replies; 15+ messages in thread
From: Dario Faggioli @ 2015-01-12 18:07 UTC (permalink / raw)
  To: Justin T. Weaver; +Cc: george.dunlap, henric, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 673 bytes --]

On Sat, 2014-11-29 at 14:33 -1000, Justin T. Weaver wrote:
> when deciding which run queue to assign each vcpu to.
> 
> There are two main changes in functionality that this patch introduces.
> 
Just very quickly: I will have a look at this patch too.

However, I think it would be better to put it on hold a bit, and get the
hard affinity part right first. Any objection to that?

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: credit2: consider per-vcpu soft affinity
  2014-11-30  0:33 ` [PATCH 2/2] sched: credit2: consider per-vcpu soft affinity Justin T. Weaver
  2015-01-12 18:07   ` Dario Faggioli
@ 2015-01-13 13:06   ` Dario Faggioli
  2015-02-11  8:38     ` Justin Weaver
  1 sibling, 1 reply; 15+ messages in thread
From: Dario Faggioli @ 2015-01-13 13:06 UTC (permalink / raw)
  To: Justin T. Weaver; +Cc: george.dunlap, henric, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 12747 bytes --]

On Sat, 2014-11-29 at 14:33 -1000, Justin T. Weaver wrote:
> when deciding which run queue to assign each vcpu to.
> 
> There are two main changes in functionality that this patch introduces.
> 
> First, in function runq_tickle, it tries to find idle pcpus in other run
> queues that the vcpu would prefer to run on (soft affinity) or can run on
> (hard affinity), in that order.
> 
I see. I don't think a function like runq_tickle() should be able to
move a vcpu to a different runqueue. That is done in other places, and I
would leave this as it is. Tickling usually happens right after a vcpu
has been put on a runqueue, which is something controlled by other parts
of the Credit2 algorithm and code, and that is by design. More
specifically, moving vcpus among runqueues is the job of the load
balancing logic.

Turning runq_tickle() into a potential "runqueue-changer" would mix the
two concepts of inter-runqueue and intra-runqueue balancing/migration up
to a point that I do not like. It also makes the code look rather
weird... E.g., look at when runq_tickle() is called in migrate(). In
there, right after going through:
 - __runq_remove();
 - __runq_deassign();
 - __runq_assign();
 - runq_insert();
we call runq_tickle() and... Do it all aver again!?!? :-O
Even just because of this, I'd say that, if such a change to
runq_tickle() would be necessary (which I don't think), that should
happen differently, and should involve changing its various callers, to
avoid the repetition.

OTOH, it looks like you are not touching the existing
for_each_cpu(i,&mask) loop, still inside runq_tickle() itself, while I'd
say that looking for a pcpu that the vcpu has soft affinity with, among
all the non idle and non-tickled pcpus in the runqueue, really is
worthwhile. Actually, the same applies to idle pcpus in the runqueue,
i.e., to the code before the for_each_cpu() loop, still in
runq_tickle().

So, instead of adding to runq_tickle() the capability of moving vcpus
among runqueues, which is not his responsibility, I would add the usual
affinity balancing logic (== the for_each_csched2_balance_step2) to it. 

When we'll have this in place, and we will be able to bench it, we'll
see if we think that there is a need for making such function capable of
more advanced things (but even at that point, I doubt that changing the
runqueue of a vcpu would be one of these).

> Second, in function balance_load, if moving a vcpu with soft affinity from
> one run queue to another means moving it away from its soft affinity to hard
> affinity only, then that move is not considered for load balancing.
> 
This is a key part of the whole thing so, please, spend a few more words
explaining what is your  idea and how it actually works, both here and
in the sources, with comments.

For what I've understood, I like the idea. However, why don't use the
usual two step approach? In the soft-affinity step, we won't consider
vcpus that would see their soft-affinity 'impaired' by being moved, and,
perhaps, give a preference boost to vcpus that would see their
soft-affinity 'improved' by being moved. If we don't find any suitable
candidate, in the next (hard-affinity) step, we fall back to ignoring
soft-affinity, and only care about hard constraints.

About 'impaired' and 'improoved' soft-affinity. This is just an idea
popping out my head right now, so bear with me. :-)
Affinity is to pcpus, while, in principle, here we are talking about
migration between runqueues, with each runqueue spanning a certain set
of pcpus. There seems to me to be room for defining something like the
"soft affinity of a vcpu to a runqueue", as, e.g., the number of pcpus
of that runqueue with which the vcpu has soft-affinity. This would
measure how likely the vcpus will run on a pcpu it has soft-affinity
with, if migrated to the runqueue in question. How do you like this?

BTW, understand why yesterday I said I'd rather get hard affinity in
place before tackling dealing with soft affinity? :-P

> @@ -127,6 +127,15 @@
>  #define CSCHED2_CREDIT_RESET         0
>  /* Max timer: Maximum time a guest can be run for. */
>  #define CSCHED2_MAX_TIMER            MILLISECS(2)
> +/* Two step balancing logic; consider soft affinity first. */
> +#define CSCHED2_BALANCE_SOFT_AFFINITY 0
> +#define CSCHED2_BALANCE_HARD_AFFINITY 1
> +/* vcpu runq migration away from vcpu's soft affinity. */
> +#define CSCHED2_MIGRATE_SOFT_TO_HARD 0
> +/* vcpu runq migration to vcpu's soft affinity. */
> +#define CSCHED2_MIGRATE_HARD_TO_SOFT 1
> +/* vcpu runq migration soft to soft or vcpu has no soft affinity. */
> +#define CSCHED2_MIGRATE_ANY_TO_ANY   2
>  
> 
>  #define CSCHED2_IDLE_CREDIT                 (-(1<<30))
> @@ -176,6 +185,8 @@ integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
>  #define c2r(_ops, _cpu)     (CSCHED2_PRIV(_ops)->runq_map[(_cpu)])
>  /* CPU to runqueue struct macro */
>  #define RQD(_ops, _cpu)     (&CSCHED2_PRIV(_ops)->rqd[c2r(_ops, _cpu)])
> +#define for_each_csched2_balance_step(step) \
> +    for ( (step) = 0; (step) <= CSCHED2_BALANCE_HARD_AFFINITY; (step)++ )
>
This is identical (apart from the CSCHED2 prefix) to what we have in
Credit1. I wonder whether it wouldn't make sense to factor it out in a
common header.

I'm saying I wonder because I am actually not sure. As things stands, it
makes perfect sense. Even if looking forward, I don't see why the two
schedulers (as well as any other scheduler wanting to support
affinities) should do things differently between each others.

The only doubt I have is not about the structure (i.e., the loop, etc.)
but about the fact that a scheduler may perhaps want to add more steps,
or things like this.

If I'd have to decide now, I'd say let's factor this out, and when --if
ever-- someone with different needs will come, we'll see what to do.
Thoughts?

> +/*
> + * A vcpu has meaningful soft affinity if...
> + * - its soft affinity mask is not full, and
> + * - the passed in mask (usually its hard affinity mask) intersects
> + *   with its soft affinity mask
> + */
> +static inline int __vcpu_has_soft_affinity(const struct vcpu *vc,
> +                                           const cpumask_t *mask)
> +{
> +    if ( cpumask_full(vc->cpu_soft_affinity) ||
> +        !cpumask_intersects(vc->cpu_soft_affinity, mask) )
> +        return 0;
> +
> +    return 1;
> +}
>
About this, I have far less doubts when asking you to factor this out
from sched_credit.c and use it here. :-)

> +static void
> +csched2_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
> +{
> +    if ( step == CSCHED2_BALANCE_SOFT_AFFINITY )
> +    {
> +        cpumask_and(mask, vc->cpu_soft_affinity, vc->cpu_hard_affinity);
> +
> +        if ( unlikely(cpumask_empty(mask)) )
> +            cpumask_copy(mask, vc->cpu_hard_affinity);
> +    }
> +    else /* step == CSCHED2_BALANCE_HARD_AFFINITY */
> +        cpumask_copy(mask, vc->cpu_hard_affinity);
> +}
>
Same here.

Oh, and have also a look at how this is used, in sched_credit.c, in
conjunction with the per-csched_pcpu variable balance_mask.

As I said yesterday, too many cpumask_t variables on stack should be
avoided, and that was how I managed to comply to this when working on
Credit1.

There for sure are more than a few places where you can do the same, or
something very similar.

> @@ -513,6 +559,68 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
>          goto tickle;
>      }
>  
> +    /* Look for idle cpus in other runqs; consider soft affinity first. */
> +    for_each_csched2_balance_step( balance_step )
> +    {
> +        cpumask_t balance_mask;
> +
> +        /* Skip the soft affinity balance step if new doesn't have any. */
> +        if ( balance_step == CSCHED2_BALANCE_SOFT_AFFINITY &&
> +            !__vcpu_has_soft_affinity(
> +                new->vcpu, new->vcpu->cpu_hard_affinity) )
> +            continue;
> +
> +        /* Skip this step if can't get a lock on the credit2 private data. */
> +        if ( !prv_lock_held || !spin_trylock(&prv->lock) )
> +            continue;
> +        prv_lock_held = 1;
> +
> +        csched2_balance_cpumask(new->vcpu, balance_step, &balance_mask);
> +
> +        for_each_cpu(i, &prv->active_queues)
> +        {
> +            struct csched2_runqueue_data *temp_rqd;
> +
> +            temp_rqd = prv->rqd + i;
> +
> +            if ( temp_rqd == rqd || !spin_trylock(&temp_rqd->lock) )
> +                continue;
> +
> +            /* Find idle cpus in the balance mask that are not tickled. */
> +            cpumask_andnot(&mask, &temp_rqd->idle, &temp_rqd->tickled);
> +            cpumask_and(&mask, &mask, &balance_mask);
> +
> +            if ( !cpumask_empty(&mask) )
> +            {
> +                /* Found an idle cpu on another run queue; move new. */
> +                s_time_t now = 0;
> +
> +                ipid = cpumask_any(&mask);
> +                new->vcpu->processor = ipid;
> +                __runq_remove(new);
> +                now = NOW();
> +                update_load(ops, rqd, new, -1, now);
> +                __runq_deassign(new);
> +                __runq_assign(new, temp_rqd);
> +                update_load(ops, temp_rqd, new, 1, now);
> +                runq_insert(ops, ipid, new);
> +                cpumask_set_cpu(ipid, &temp_rqd->tickled);
> +                cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
> +
> +                spin_unlock(&temp_rqd->lock);
> +                spin_unlock(&prv->lock);
> +                return;
> +            }
> +            else
> +                /* No suitable idlers found in runq temp_rqd. */
> +                spin_unlock(&temp_rqd->lock);
> +        }
> +
> +        if ( prv_lock_held && balance_step == CSCHED2_BALANCE_HARD_AFFINITY )
> +            /* No suitable other-runq idlers found; unlock private data. */
> +            spin_unlock(&prv->lock);
> +    }
> +
>
I commented about this above.

BTW, look at the code you'd have to add for making this function behave
as you wanted, especially locking, and I'm sure you'll concur that all
that does not belong here.

Where I would put something similar to the above is in choose_cpu()
below!

> @@ -1039,12 +1147,22 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>  {
>      struct csched2_private *prv = CSCHED2_PRIV(ops);
>      int i, min_rqi = -1, new_cpu;
> +    int max_soft_cpus = 0, max_soft_cpus_rqi = -1;
> +    bool_t consider_soft_affinity = 0;
>      struct csched2_vcpu *svc = CSCHED2_VCPU(vc);
>      s_time_t min_avgload;
> -    cpumask_t temp_mask;
> +    cpumask_t temp_mask, vc_soft_affinity;
>  
>      BUG_ON(cpumask_empty(&prv->active_queues));
>  
> +    /* Consider soft affinity in the cpu descision? */
> +    if ( __vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
> +    {
> +        consider_soft_affinity = 1;
> +        cpumask_and(&vc_soft_affinity, vc->cpu_soft_affinity,
> +            vc->cpu_hard_affinity);
> +    }
> +
>
Why "black and white"? Why not using for_each_csched2_balance_step() and
differentiate which one affinity to consider depending on the step,
rather than here and for all?

> @@ -1112,11 +1237,15 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>  
>      min_avgload = MAX_LOAD;
>  
> -    /* Find the runqueue with the lowest instantaneous load */
> +    /*
> +     * Find the run queue with the most cpus in vc's soft affniity, or the
> +     * the lowest instantaneous load if not considering soft affinity.
> +     */
>
Exactly my point: if the vcpu does not have any soft-affinity, then, as
usual, the soft-affinity balancing step must be skipped, and the
behaviour should fall back to finding the runq with the lowest
instantaneous load, taking only hard affinity constraint into account.

And if the vcpu does have a soft-affinity, why forget completely about
instantaneous load? I like the idea of "the most cpus in vc's soft
affinity" (it's actually really similar to what I was talking about at
the beginning of this message), but why can't we combine the two things?
More important, why does it have to be just an option, instead than the
first step of the soft/hard balancing loop?

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] sched: credit2: respect per-vcpu hard affinity
  2015-01-12 18:05   ` Dario Faggioli
@ 2015-01-20  7:21     ` Justin Weaver
  2015-02-01  6:51       ` Justin Weaver
  2015-03-03  3:13       ` Dario Faggioli
  0 siblings, 2 replies; 15+ messages in thread
From: Justin Weaver @ 2015-01-20  7:21 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: George Dunlap, Henri Casanova, xen-devel

On Mon, Jan 12, 2015 at 8:05 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> Still about referencing "history", one usually puts here a quick summary
> of what changed in each patch, wrt the previously submitted version.
>
> This is actually not really important in this case, since _a_lot_ of
> things changed, as it usually happens when moving from RFC to actual
> submissions... But consider doing something like this for next round.

I'll put in a change summary when I send out the next version.

> Style apart, this overall is a very good submission. Since last time,
> you really made huge progresses, and are now tackling the issue with a
> good approach. I have a few questions on the approach itself, and some
> comments on the code, but I'd say you are on the right track.

Thanks, very glad to hear it!

>> @@ -1038,6 +1041,7 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>>      int i, min_rqi = -1, new_cpu;
>>      struct csched2_vcpu *svc = CSCHED2_VCPU(vc);
>>      s_time_t min_avgload;
>> +    cpumask_t temp_mask;
>>
> Still about cpumasks, we really should try to avoid adding new ones on
> the stack like this.

I'll update it making a cpumask_var_t balance_mask struct member like
you've got in credit 1. Or something similar.

>> @@ -1053,7 +1057,7 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>>       *
>>       * Since one of the runqueue locks is already held, we can't
>>       * just grab the prv lock.  Instead, we'll have to trylock, and
>> -     * do something else reasonable if we fail.
>> +     * return a safe cpu.
>>
> I don't think this change is particularly useful.

You mean the comment change isn't useful? I don't think it should be
left as is because we're not "doing something else reasonable" we're
now doing something specific. I'll change it to make it more clear.

>> @@ -1063,9 +1067,23 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>>              d2printk("%pv -\n", svc->vcpu);
>>              clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
>>          }
>> -        /* Leave it where it is for now.  When we actually pay attention
>> -         * to affinity we'll have to figure something out... */
>> -        return vc->processor;
>> +
>> +        /* Check vc's hard affinity mask with the run queue's active mask. */
>> +        cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active);
>> +        if ( cpumask_empty(&temp_mask) )
>> +        {
>> +            /* Can't be assigned to current runqueue; return a safe pcpu. */
>> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity,
>> +                cpupool_online_cpumask(vc->domain->cpupool));
>> +            return cpumask_any(&temp_mask);
>> +        }
>> +        else
>> +            if ( cpumask_test_cpu(vc->processor, &temp_mask) )
>> +                /* Leave it where it is. */
>> +                return vc->processor;
>> +            else
>> +                /* Same runq, different cpu; affinity must have changed. */
>> +                return cpumask_any(&temp_mask);
>>      }
>>
> I see what you are after, and it should be fine. However, how about
> something like this?
>
>   /* We can't perform the operation properly because of locking issues.
>    * Check whether we can at least leave the vcpu where it is, or if
>    * we need to send it somewhere else.
>    */
>   cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active);
>   if ( unlikely(cpumask_empty(&temp_mask)) )
>     cpumask_and(&temp_mask, vc->cpu_hard_affinity,
>                 cpupool_online_cpumask(vc->domain->cpupool));
>
>   if ( cpumask_test_cpu(vc->processor, &temp_mask) )
>     return vc->processor;
>   else
>     return cpumask_any(&temp_mask);

Looks good to me. I'll change it.

>>      /* First check to see if we're here because someone else suggested a place
>> @@ -1081,13 +1099,17 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>>          else
>>          {
>>              d2printk("%pv +\n", svc->vcpu);
>> -            new_cpu = cpumask_cycle(vc->processor, &svc->migrate_rqd->active);
>> -            goto out_up;
>> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity,
>> +                &svc->migrate_rqd->active);
>> +            if ( !cpumask_empty(&temp_mask) )
>> +            {
>> +                new_cpu = cpumask_any(&temp_mask);
>> +                goto out_up;
>> +            }
>> +            /* Fall-through to normal cpu pick */
>>
> So, if we've been asked to move to a cpu where we are not allowed to run
> (i.e., in case temp_mask ends up empty), we just, silently, ignore the
> request. This has the potential of screwing the work of the load
> balancer. If we need to keep this as a sort of "safety catch", then
> fine, but I'd really like to see a lot of efforts made in the load
> balancing code to make this not trigger!
>
> Actually, since we've been asked to do something, and we're doing
> something completely different, I even wonder if, if we can't go to
> migrate_rqd, that "something completely different" is better than  doing
> nothing, and if yes, under what conditions and circumstances...

I'm glad you brought this up because it's not clear to me what the
relationship between migrate() and choose_cpu() is, if any. This block
of code in choose_cpu only triggers if __CSFLAG_runq_migrate_request
is set, and that flag is only set in migrate(), and only if
__CSFLAG_scheduled is set. choose_cpu() is only called in response to
the pick_cpu call in schedule.c in vcpu_migrate() which doesn't have
anything to do with the load balancer balance_load() in credit 2. It
seems to me that the hard affinity check is needed at the end of this
block in case affinity has changed after __CSFLAG_runq_migrate_request
is set in migrate() and before a call to choose_cpu(). Please let me
know what you think.

>>      /* Find the runqueue with the lowest instantaneous load */
>> @@ -1099,17 +1121,26 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>>          rqd = prv->rqd + i;
>>
>>          /* If checking a different runqueue, grab the lock,
>> -         * read the avg, and then release the lock.
>> +         * check hard affinity, read the avg, and then release the lock.
>>           *
>>           * If on our own runqueue, don't grab or release the lock;
>>           * but subtract our own load from the runqueue load to simulate
>>           * impartiality */
>>          if ( rqd == svc->rqd )
>>          {
>> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity, &rqd->active);
>> +            if ( cpumask_empty(&temp_mask) )
>> +                continue;
>>
> You can use cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) for
> the test. However, can we really be on a runq, and not have hard
> affinity with _any_ of its pcpus? I think no. And if yes, then we should
> try to change that and make it no. :-)

I'll update the code to use cpumask_intersects in all cases like this.

If I use xl vcpu-pin and turn off hard affinity for all pcpus in a
vpcu's currently assigned run queue, then xen will go through
vcpu_set_affinity, vcpu_migrate in schedule.c then choose_cpu in
credit 2. This is credit 2's first chance to react to a change in hard
affinity so there is a chance here that a vcpu can be assigned to a
run queue in which it no longer has any hard affinity with the pcpus.
Also, svc->rqd is the run queue that svc is assigned to, but svc isn't
necessarily on the queue waiting to be run. Please let me know what
your thoughts are on this.

>>              rqd_avgload = rqd->b_avgload - svc->avgload;
>>          }
>>          else if ( spin_trylock(&rqd->lock) )
>>          {
>> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity, &rqd->active);
>> +            if ( cpumask_empty(&temp_mask) )
>>
> Again, cpumask_intersects()

Will do.

>> +            {
>> +                spin_unlock(&rqd->lock);
>> +                continue;
>> +            }
>>              rqd_avgload = rqd->b_avgload;
>>              spin_unlock(&rqd->lock);
>>          }
>
>> @@ -1123,12 +1154,30 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>>          }
>>      }
>>
>> -    /* We didn't find anyone (most likely because of spinlock contention); leave it where it is */
>>      if ( min_rqi == -1 )
>> -        new_cpu = vc->processor;
>> +    {
>> +        /* No runqs found (most likely because of spinlock contention). */
>> +        cpumask_and(&temp_mask, vc->cpu_hard_affinity, &svc->rqd->active);
>> +        if ( cpumask_empty(&temp_mask) )
>> +        {
>> +            /* Can't be assigned to current runqueue; return a safe pcpu. */
>> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity,
>> +                cpupool_online_cpumask(vc->domain->cpupool));
>> +            new_cpu = cpumask_any(&temp_mask);
>> +        }
>> +        else
>> +            if ( cpumask_test_cpu(vc->processor, &temp_mask) )
>> +                /* Leave it where it is. */
>> +                new_cpu = vc->processor;
>> +            else
>> +                /* Same runq, different cpu; affinity must have changed. */
>> +                new_cpu = cpumask_any(&temp_mask);
>> +    }
>>
> This is the exact same code as above, isn't it? Put it in an helper and
> call it from both sites.
>
> Well, actually, put my rework of the code above in the helper and call
> it. :-)

Will do.

>>      else
>>      {
>> -        new_cpu = cpumask_cycle(vc->processor, &prv->rqd[min_rqi].active);
>> +        cpumask_and(&temp_mask, vc->cpu_hard_affinity,
>> +            &prv->rqd[min_rqi].active);
>> +        new_cpu = cpumask_any(&temp_mask);
>>          BUG_ON(new_cpu >= nr_cpu_ids);
>>      }
>>
> Looks ok.
>
>> @@ -1197,22 +1246,40 @@ static void migrate(const struct scheduler *ops,
>>      }
>>      else
>>      {
>> -        int on_runq=0;
>> -        /* It's not running; just move it */
>> +        /* It's not running; move it if it's on a different runq than trqd. */
>> +        bool_t on_runq = 0;
>> +        cpumask_t temp_mask;
>> +
>>
> Ditto.
>
>>          d2printk("%pv %d-%d i\n", svc->vcpu, svc->rqd->id, trqd->id);
>> +
>> +        /* Re-assign vcpu's processor, if necessary. */
>> +        cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity, &trqd->active);
>> +        svc->vcpu->processor = cpumask_any(&temp_mask);
>> +        if ( !cpumask_test_cpu(svc->vcpu->processor, &temp_mask) )
>> +            svc->vcpu->processor = cpumask_any(&temp_mask);
>> +
> Not following: you assign processor one of the bits set in temp_mask.
> Then, if such new value of processor is not in temp_mask, you assign to
> it cpumask_any(&temp_mask), i.e., the exact same thing as before...
>
> Am I missing something obvious? What is it that you are up for here?
> If the goal is to do something sane and safe if trqd->active and
> svc->vcpu->cpu_hard_affinity does not intersect, then, we certainly need
> it, but I really can't tell how this is supposed
> to handle such situation.

Not sure what I was thinking here, woops! The point is to assign
svc->vcpu->processor to a pcpu in run queue trqd that is in svc's hard
affinity. If for some reason there are no pcpus that satisfy hard
affinity it should probably be a BUG_ON because migrate() shouldn't be
able to decide not to do the migration. What do you think, bug_on or
return without migrating?

> BTW, after an operation like cpumask_first(), cpumask_next(),
> cpumask_any(), etc., a quick way to check whether the operation
> succeeded is to check for the returned value to be < nr_cpu_ids (see the
> implementation of those operations). If that is the case, then you can
> be sure that the bit returned is one of the bits set in the mask.

I'll keep that in mind while fixing this block.

>>          if ( __vcpu_on_runq(svc) )
>> +            on_runq = 1;
>> +
>> +        /* If the runqs are different, move svc to trqd. */
>> +        if ( svc->rqd != trqd )
>>          {
>> -            __runq_remove(svc);
>> -            update_load(ops, svc->rqd, svc, -1, now);
>> -            on_runq=1;
>> +            if ( on_runq )
>> +            {
>> +                __runq_remove(svc);
>> +                update_load(ops, svc->rqd, svc, -1, now);
>> +            }
>> +            __runq_deassign(svc);
>> +            __runq_assign(svc, trqd);
>> +            if ( on_runq )
>> +            {
>> +                update_load(ops, svc->rqd, svc, 1, now);
>> +                runq_insert(ops, svc->vcpu->processor, svc);
>> +            }
>>          }
>> -        __runq_deassign(svc);
>> -        svc->vcpu->processor = cpumask_any(&trqd->active);
>> -        __runq_assign(svc, trqd);
>> +
>>
> Mmm.. I do not like the way the code looks after this is applied. Before
> the patch, it was really straightforward and easy to understand. Now
> it's way more involved. Can you explain why this rework is necessary?
> For now do it here, then we'll see whether and how to put that into a
> doc comment.

When I was testing, if I removed hard affinity from a vcpu's current
pcpu to another pcpu in the same run queue, the VM would stop
executing. I'll go back and look at this because I see what you wrote
below about wake being called by vcpu_migrate in schedule.c; it
shouldn't freeze on the old cpu, it should wake on the new cpu no
matter if the run queue changed or not. I'll address this again after
some testing.

> I know there is a comment on the call site of this function, but that
> does not help much either (as I say there).
>
>>          if ( on_runq )
>>          {
>> -            update_load(ops, svc->rqd, svc, 1, now);
>> -            runq_insert(ops, svc->vcpu->processor, svc);
>>              runq_tickle(ops, svc->vcpu->processor, svc, now);
>>          }
>>      }
>
>> @@ -1224,6 +1291,7 @@ static void balance_load(const struct scheduler *ops, int cpu, s_time_t now)
>>      struct csched2_private *prv = CSCHED2_PRIV(ops);
>>      int i, max_delta_rqi = -1;
>>      struct list_head *push_iter, *pull_iter;
>> +    cpumask_t temp_mask;
>>
> ...
>
>>      balance_state_t st = { .best_push_svc = NULL, .best_pull_svc = NULL };
>>
>> @@ -1250,6 +1318,11 @@ retry:
>>      for_each_cpu(i, &prv->active_queues)
>>      {
>>          s_time_t delta;
>> +        /* true if there are no vcpus to push due to hard affinity */
>> +        bool_t ha_no_push = 1;
>> +        /* true if there are no vcpus to pull due to hard affinity */
>> +        bool_t ha_no_pull = 1;
>> +        struct list_head *iter;
>>
>>          st.orqd = prv->rqd + i;
>>
>> @@ -1257,6 +1330,47 @@ retry:
>>               || !spin_trylock(&st.orqd->lock) )
>>              continue;
>>
>> +        /*
>> +         * If due to hard affinity there are no vcpus that can be
>> +         * pulled or pushed, move to the next runq in the loop.
>> +         */
>> +
>> +        /* See if there are any vcpus that can be pushed from lrqd to orqd. */
>> +        list_for_each( iter, &st.lrqd->svc )
>> +        {
>> +            struct csched2_vcpu * svc =
>> +                list_entry(iter, struct csched2_vcpu, rqd_elem);
>> +            cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity,
>> +                &st.orqd->active);
>> +            if (!cpumask_empty(&temp_mask))
>> +            {
>> +                /* vcpu can be pushed from lrqd to ordq. */
>> +                ha_no_push = 0;
>> +                break;
>> +            }
>> +        }
>> +
>>
> So, here, for each runqueue j, we scan all the vcpus in lrqd, to see if
> at least one can migrate to j.
>
> This means that, if we have 4 runqueues, we potentially loop through all
> the vcpus in lrqd 4 times, right?

Three times I think, but yes, lots of looping. It will loop through
lrqd once for each additional run queue orqd.

>> +        /* See if there are any vcpus that can be pulled from orqd to lrqd. */
>> +        list_for_each( iter, &st.orqd->svc )
>> +        {
>> +            struct csched2_vcpu * svc =
>> +                list_entry(iter, struct csched2_vcpu, rqd_elem);
>> +            cpumask_and(&temp_mask, svc->vcpu->cpu_hard_affinity,
>> +                &st.lrqd->active);
>> +            if (!cpumask_empty(&temp_mask))
>> +            {
>> +                /* vcpu can be pulled from orqd to lrdq. */
>> +                ha_no_pull = 0;
>> +                break;
>> +            }
>> +        }
>> +
>>
> And here, for each runq j, we scan all its vcpus to see if at least one
> can run on lrqd.
>
> In summary, this means (at least potentially) looping through all the
> active vcpus, some more than one time, which sounds pretty expensive.
> Add to this the fact that, since we are releasing and re-acquiring the
> locks, things may even change under our feet, making all that sort of
> useless.

Yes, I considered things changing between holding the locks making
this looping potentially pointless.

> I like the idea of filtering out runqueues that does not have any
> suitable vcpus, due to hard affinity issues, but done like this, it
> looks really expensive. Can we perhaps think to a data structure to make
> it cheaper?

I will think about a solution with a data structure.

> In the meanwhile, I'd go for a more 'greedy approach'. Out of the top of
> my head, I think I would leave this loop alone, and just have it select
> a candidate runqueue. Below, where the vcpus are considered, I'd filter
> out the inappropriate one (as this patch does also). If it turns out
> that there are no suitable vcpus in the runqueue the loop selected, I'd
> "goto retry", perhaps adding a maximum number of attempts, or something
> like that.

I think balance_load needs some work beyond these hard affinity
changes. I noticed some odd behavior where it would migrate a vcpu
away from a runqueue that only had the one vcpu running on it. Maybe
it hasn't been tested much because of the single run queue issue which
your patches fix? Anyway, I will examine this section further and see
what I can come up with.

>> @@ -1338,11 +1458,17 @@ retry:
>>              {
>>                  __update_svc_load(ops, pull_svc, 0, now);
>>              }
>> -
>> +
>>
> Avoid things like these. I appreciate that this is an actual
> improvement, but let's not mix coding style fixes with new features. :-)

Got it, I will take this out.

>> @@ -1399,8 +1531,12 @@ csched2_vcpu_migrate(
>>
>>      trqd = RQD(ops, new_cpu);
>>
>> -    if ( trqd != svc->rqd )
>> -        migrate(ops, svc, trqd, NOW());
>> +    /*
>> +     * Call migrate even if svc->rqd == trqd; there may have been an
>> +     * affinity change that requires a call to runq_tickle for a new
>> +     * processor within the same run queue.
>> +     */
>> +    migrate(ops, svc, trqd, NOW());
>>  }
>>
> As said above, I don't think I see the reason for this. Affinity
> changes, e.g., due to calls to vcpu_set_affinity() in schedule.c, forces
> the vcpu through a sleep wakeup cycle (it calls vcpu_sleep_nosync()
> direcly, while vcpu_wake() is called inside vcpu_migrate()).
>
> So, looks like what you are after (i.e., runq_tickle being called)
> should happen already, isn't it? Are there other reasons you need it
> for?

Like I said above, I will look at this again. My VMs were getting
stuck after certain hard affinity changes. I'll roll back some of
these changes and test it out again.

>>  static int
>> @@ -1610,6 +1746,13 @@ runq_candidate(struct csched2_runqueue_data *rqd,
>>      {
>>          struct csched2_vcpu * svc = list_entry(iter, struct csched2_vcpu, runq_elem);
>>
>> +        /*
>> +         * If vcpu is not allowed to run on this processor due to
>> +         * hard affinity, continue to the next vcpu on the queue.
>> +         */
>> +        if ( !cpumask_test_cpu(cpu, svc->vcpu->cpu_hard_affinity) )
>> +            continue;
>> +
>>
> Looks ok. The comment explains a bit too much the "what", which one can
> easily see from the code. Comments (most of the time :-) ) should be
> about the "why" things are done in a certain way.
>
> Here, the relevant piece of information is that "Only consider vcpus
> allowed to run on this processor", so I'd just say so. The fact that you
> are continuing scanning the runque is pretty evident. :-D

Understood, I'll fix the comment here.

Thank you for the feedback! I agree that the hard affinity piece needs
to be solid before the soft affinity. I'm going to hold off on
replying to your feedback on my second patch until I get this one
fixed. I hope that's OK.

Justin Weaver
University of Hawaii at Manoa

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] sched: credit2: respect per-vcpu hard affinity
  2015-01-20  7:21     ` Justin Weaver
@ 2015-02-01  6:51       ` Justin Weaver
  2015-02-02 11:12         ` Dario Faggioli
  2015-02-02 14:24         ` Dario Faggioli
  2015-03-03  3:13       ` Dario Faggioli
  1 sibling, 2 replies; 15+ messages in thread
From: Justin Weaver @ 2015-02-01  6:51 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: George Dunlap, Henri Casanova, xen-devel

On Mon, Jan 19, 2015 at 9:21 PM, Justin Weaver <jtweaver@hawaii.edu> wrote:
> On Mon, Jan 12, 2015 at 8:05 AM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:

>>>          if ( __vcpu_on_runq(svc) )
>>> +            on_runq = 1;
>>> +
>>> +        /* If the runqs are different, move svc to trqd. */
>>> +        if ( svc->rqd != trqd )
>>>          {
>>> -            __runq_remove(svc);
>>> -            update_load(ops, svc->rqd, svc, -1, now);
>>> -            on_runq=1;
>>> +            if ( on_runq )
>>> +            {
>>> +                __runq_remove(svc);
>>> +                update_load(ops, svc->rqd, svc, -1, now);
>>> +            }
>>> +            __runq_deassign(svc);
>>> +            __runq_assign(svc, trqd);
>>> +            if ( on_runq )
>>> +            {
>>> +                update_load(ops, svc->rqd, svc, 1, now);
>>> +                runq_insert(ops, svc->vcpu->processor, svc);
>>> +            }
>>>          }
>>> -        __runq_deassign(svc);
>>> -        svc->vcpu->processor = cpumask_any(&trqd->active);
>>> -        __runq_assign(svc, trqd);
>>> +
>>>
>> Mmm.. I do not like the way the code looks after this is applied. Before
>> the patch, it was really straightforward and easy to understand. Now
>> it's way more involved. Can you explain why this rework is necessary?
>> For now do it here, then we'll see whether and how to put that into a
>> doc comment.
>
> When I was testing, if I removed hard affinity from a vcpu's current
> pcpu to another pcpu in the same run queue, the VM would stop
> executing. I'll go back and look at this because I see what you wrote
> below about wake being called by vcpu_migrate in schedule.c; it
> shouldn't freeze on the old cpu, it should wake on the new cpu no
> matter if the run queue changed or not. I'll address this again after
> some testing.

>>> @@ -1399,8 +1531,12 @@ csched2_vcpu_migrate(
>>>
>>>      trqd = RQD(ops, new_cpu);
>>>
>>> -    if ( trqd != svc->rqd )
>>> -        migrate(ops, svc, trqd, NOW());
>>> +    /*
>>> +     * Call migrate even if svc->rqd == trqd; there may have been an
>>> +     * affinity change that requires a call to runq_tickle for a new
>>> +     * processor within the same run queue.
>>> +     */
>>> +    migrate(ops, svc, trqd, NOW());
>>>  }
>>>
>> As said above, I don't think I see the reason for this. Affinity
>> changes, e.g., due to calls to vcpu_set_affinity() in schedule.c, forces
>> the vcpu through a sleep wakeup cycle (it calls vcpu_sleep_nosync()
>> direcly, while vcpu_wake() is called inside vcpu_migrate()).
>>
>> So, looks like what you are after (i.e., runq_tickle being called)
>> should happen already, isn't it? Are there other reasons you need it
>> for?
>
> Like I said above, I will look at this again. My VMs were getting
> stuck after certain hard affinity changes. I'll roll back some of
> these changes and test it out again.

I discovered that SCHED_OP(VCPU2OP(v), wake, v); in function vcpu_wake
in schedule.c is not being called because v's pause flags has
_VPF_blocked set.

For example...
I start a guest with one vcpu with hard affinity 8 - 15 and xl
vcpu-list says it's running on pcpu 15
I run xl vcpu-pin 1 0 8 to change it to hard affinity only with pcpu 8
When it gets to vcpu_wake, it tests vcpu_runnable(v) which is false
because _VPF_blocked is set, so it skips the call to
SCHED_OP(VCPU2OP(v), wake, v); and so does not get a runq_tickle
xl vcpu-list now shows --- for the state and I cannot console into it
What I don't understand though is if I then enter xl vcpu-pin 1 0 15
it reports that _VPF_blocked is NOT set, vcpu_wake calls credit2's
wake, it gets a runq_tickle and everything is fine again
Why did the value of the _VPF_blocked flag change after I entered xl
vcpu-pin the second time?? I dove deep in the code and could not
figure it out.

So that is why v1 of my patch worked because I let it run migrate
during an affinity change even if the current and destination run
queues were the same, so it would do the processor assignment and
runq_tickle regardless. I think you'll have to tell me if that's a
hack or a good solution!

I greatly appreciate any feedback.

Thank you,
Justin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] sched: credit2: respect per-vcpu hard affinity
  2015-02-01  6:51       ` Justin Weaver
@ 2015-02-02 11:12         ` Dario Faggioli
  2015-02-02 14:24         ` Dario Faggioli
  1 sibling, 0 replies; 15+ messages in thread
From: Dario Faggioli @ 2015-02-02 11:12 UTC (permalink / raw)
  To: Justin Weaver; +Cc: George Dunlap, Henri Casanova, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1972 bytes --]

On Sat, 2015-01-31 at 20:51 -1000, Justin Weaver wrote:
> On Mon, Jan 19, 2015 at 9:21 PM, Justin Weaver <jtweaver@hawaii.edu> wrote:

> > Like I said above, I will look at this again. My VMs were getting
> > stuck after certain hard affinity changes. I'll roll back some of
> > these changes and test it out again.
> 
> I discovered that SCHED_OP(VCPU2OP(v), wake, v); in function vcpu_wake
> in schedule.c is not being called because v's pause flags has
> _VPF_blocked set.
> 
> For example...
> I start a guest with one vcpu with hard affinity 8 - 15 and xl
> vcpu-list says it's running on pcpu 15
> I run xl vcpu-pin 1 0 8 to change it to hard affinity only with pcpu 8
> When it gets to vcpu_wake, it tests vcpu_runnable(v) which is false
> because _VPF_blocked is set, so it skips the call to
> SCHED_OP(VCPU2OP(v), wake, v); and so does not get a runq_tickle
> xl vcpu-list now shows --- for the state and I cannot console into it
> What I don't understand though is if I then enter xl vcpu-pin 1 0 15
> it reports that _VPF_blocked is NOT set, vcpu_wake calls credit2's
> wake, it gets a runq_tickle and everything is fine again
> Why did the value of the _VPF_blocked flag change after I entered xl
> vcpu-pin the second time?? I dove deep in the code and could not
> figure it out.
> 
> So that is why v1 of my patch worked because I let it run migrate
> during an affinity change even if the current and destination run
> queues were the same, so it would do the processor assignment and
> runq_tickle regardless. I think you'll have to tell me if that's a
> hack or a good solution!
> 
Ok, thanks for the analysis. I shall have a look and let you know.

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] sched: credit2: respect per-vcpu hard affinity
  2015-02-01  6:51       ` Justin Weaver
  2015-02-02 11:12         ` Dario Faggioli
@ 2015-02-02 14:24         ` Dario Faggioli
  2015-02-03  9:09           ` Justin Weaver
  1 sibling, 1 reply; 15+ messages in thread
From: Dario Faggioli @ 2015-02-02 14:24 UTC (permalink / raw)
  To: Justin Weaver; +Cc: George Dunlap, Henri Casanova, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2072 bytes --]

On Sat, 2015-01-31 at 20:51 -1000, Justin Weaver wrote:
> On Mon, Jan 19, 2015 at 9:21 PM, Justin Weaver <jtweaver@hawaii.edu> wrote:
> > On Mon, Jan 12, 2015 at 8:05 AM, Dario Faggioli

> For example...
> I start a guest with one vcpu with hard affinity 8 - 15 and xl
> vcpu-list says it's running on pcpu 15
> I run xl vcpu-pin 1 0 8 to change it to hard affinity only with pcpu 8
> When it gets to vcpu_wake, it tests vcpu_runnable(v) which is false
> because _VPF_blocked is set, so it skips the call to
> SCHED_OP(VCPU2OP(v), wake, v); and so does not get a runq_tickle
> xl vcpu-list now shows --- for the state and I cannot console into it
> What I don't understand though is if I then enter xl vcpu-pin 1 0 15
> it reports that _VPF_blocked is NOT set, vcpu_wake calls credit2's
> wake, it gets a runq_tickle and everything is fine again
> Why did the value of the _VPF_blocked flag change after I entered xl
> vcpu-pin the second time?? I dove deep in the code and could not
> figure it out.
> 
As promised, I am having a look. As I recalled, affinity
setting/changing should not involve fiddling with _VPF_blocked, so there
should be something else going on.

This seems to me to be confirmed by the fact that you can make the
system working again via another call to 'vcpu-pin'.

I'm playing a little bit with your code, with and without the call to
migrate(), to try and see better what's happening. If you happen to have
an updated version of it, even if still work-in progress, and not ready
for being the actual v2, and you want to share it, so that I can try
that one, please go ahead (just attach it to your reply, it's for
debugging, so no necessary for it to be a proper submission).

I'll let you know if I find something interesting.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] sched: credit2: respect per-vcpu hard affinity
  2015-02-02 14:24         ` Dario Faggioli
@ 2015-02-03  9:09           ` Justin Weaver
  0 siblings, 0 replies; 15+ messages in thread
From: Justin Weaver @ 2015-02-03  9:09 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: George Dunlap, Henri Casanova, xen-devel

[-- Attachment #1: Type: text/plain, Size: 2534 bytes --]

On Mon, Feb 2, 2015 at 4:24 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On Sat, 2015-01-31 at 20:51 -1000, Justin Weaver wrote:
>> On Mon, Jan 19, 2015 at 9:21 PM, Justin Weaver <jtweaver@hawaii.edu> wrote:
>> > On Mon, Jan 12, 2015 at 8:05 AM, Dario Faggioli
>
>> For example...
>> I start a guest with one vcpu with hard affinity 8 - 15 and xl
>> vcpu-list says it's running on pcpu 15
>> I run xl vcpu-pin 1 0 8 to change it to hard affinity only with pcpu 8
>> When it gets to vcpu_wake, it tests vcpu_runnable(v) which is false
>> because _VPF_blocked is set, so it skips the call to
>> SCHED_OP(VCPU2OP(v), wake, v); and so does not get a runq_tickle
>> xl vcpu-list now shows --- for the state and I cannot console into it
>> What I don't understand though is if I then enter xl vcpu-pin 1 0 15
>> it reports that _VPF_blocked is NOT set, vcpu_wake calls credit2's
>> wake, it gets a runq_tickle and everything is fine again
>> Why did the value of the _VPF_blocked flag change after I entered xl
>> vcpu-pin the second time?? I dove deep in the code and could not
>> figure it out.
>>
> As promised, I am having a look. As I recalled, affinity
> setting/changing should not involve fiddling with _VPF_blocked, so there
> should be something else going on.
>
> This seems to me to be confirmed by the fact that you can make the
> system working again via another call to 'vcpu-pin'.
>
> I'm playing a little bit with your code, with and without the call to
> migrate(), to try and see better what's happening. If you happen to have
> an updated version of it, even if still work-in progress, and not ready
> for being the actual v2, and you want to share it, so that I can try
> that one, please go ahead (just attach it to your reply, it's for
> debugging, so no necessary for it to be a proper submission).

Dario,

Please see attached draft v2 1/2 patch.

Changes since v1:
 * Added dynamically allocated cpumasks to avoid putting them on the stack
 * Added helper function for code suggested in v1 review and called in two
   locations in function choose_cpu
 * Replaced two instances of cpumask_and/cpumask_empty with cpumask_intersects
 * Removed coding style fix in function balance_load
 * Improved comment in function runq_candidate
Debugging changes:
* During an affinity change, migrate only gets called when the current
and destination run queues are different
* The changes to schedule.c and domctl.c are only for debugging
(printk calls that only trigger during an affinity change)

Thank you,
Justin

[-- Attachment #2: sched-credit2-v2_1of2_draft.patch --]
[-- Type: application/octet-stream, Size: 23717 bytes --]

From 78288b7fcc9ae64618ff50bdabe14afa706ff386 Mon Sep 17 00:00:00 2001
From: "Justin T. Weaver" <jtweaver@hawaii.edu>
Date: Mon, 2 Feb 2015 22:04:51 -1000
Subject: [PATCH v2 1/2 draft] sched: credit2: respect per-vcpu hard affinity

by making sure that vcpus only run on the pcpu(s) they are allowed to
run on based on their hard affinity cpu masks.

Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu>
---
 xen/common/domctl.c        |    6 ++
 xen/common/sched_credit2.c |  218 +++++++++++++++++++++++++++++++++++++++-----
 xen/common/schedule.c      |   76 ++++++++++++++-
 3 files changed, 272 insertions(+), 28 deletions(-)

diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index ee578c0..de76298 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -32,6 +32,8 @@
 #include <public/domctl.h>
 #include <xsm/xsm.h>
 
+extern bool_t setting_affinity;
+
 static DEFINE_SPINLOCK(domctl_lock);
 DEFINE_SPINLOCK(vcpu_alloc_lock);
 
@@ -729,6 +731,8 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
         if ( (v = d->vcpu[vcpuaff->vcpu]) == NULL )
             break;
 
+        printk("domctl:do_domctl - VPF_blocked is %d\n", test_bit(_VPF_blocked, &v->pause_flags));
+
         ret = -EINVAL;
         if ( vcpuaffinity_params_invalid(vcpuaff) )
             break;
@@ -738,6 +742,8 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
             cpumask_var_t new_affinity, old_affinity;
             cpumask_t *online = cpupool_online_cpumask(v->domain->cpupool);;
 
+            setting_affinity = 1;
+
             /*
              * We want to be able to restore hard affinity if we are trying
              * setting both and changing soft affinity (which happens later,
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index cf53770..f452e36 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -26,6 +26,8 @@
 #include <xen/trace.h>
 #include <xen/cpu.h>
 
+extern bool_t setting_affinity;
+
 #define d2printk(x...)
 //#define d2printk printk
 
@@ -194,6 +196,12 @@ int opt_overload_balance_tolerance=-3;
 integer_param("credit2_balance_over", opt_overload_balance_tolerance);
 
 /*
+ * Use this to avoid having too many cpumask_t structs on the stack
+ */
+static cpumask_t **cpumask = NULL;
+#define csched2_cpumask cpumask[smp_processor_id()]
+
+/*
  * Per-runqueue data
  */
 struct csched2_runqueue_data {
@@ -268,6 +276,23 @@ struct csched2_dom {
     uint16_t nr_vcpus;
 };
 
+/*
+ * When a hard affinity change occurs, we may not be able to check some or
+ * all of the other run queues for a valid new processor for the given vcpu.
+ * Return svc's current pcpu if valid, otherwise return a safe pcpu.
+ */
+static int get_safe_pcpu(struct csched2_vcpu *svc)
+{
+    cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity, &svc->rqd->active);
+    if ( unlikely(cpumask_empty(csched2_cpumask)) )
+        cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
+            cpupool_online_cpumask(svc->vcpu->domain->cpupool));
+
+    if ( cpumask_test_cpu(svc->vcpu->processor, csched2_cpumask) )
+        return svc->vcpu->processor;
+    else
+        return cpumask_any(csched2_cpumask);
+}
 
 /*
  * Time-to-credit, credit-to-time.
@@ -501,8 +526,9 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
         goto tickle;
     }
     
-    /* Get a mask of idle, but not tickled */
+    /* Get a mask of idle, but not tickled, that new is allowed to run on. */
     cpumask_andnot(&mask, &rqd->idle, &rqd->tickled);
+    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
     
     /* If it's not empty, choose one */
     i = cpumask_cycle(cpu, &mask);
@@ -513,9 +539,11 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
     }
 
     /* Otherwise, look for the non-idle cpu with the lowest credit,
-     * skipping cpus which have been tickled but not scheduled yet */
+     * skipping cpus which have been tickled but not scheduled yet,
+     * that new is allowed to run on. */
     cpumask_andnot(&mask, &rqd->active, &rqd->idle);
     cpumask_andnot(&mask, &mask, &rqd->tickled);
+    cpumask_and(&mask, &mask, new->vcpu->cpu_hard_affinity);
 
     for_each_cpu(i, &mask)
     {
@@ -965,6 +993,8 @@ csched2_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
     {
         /* If we've boosted someone that's already on a runqueue, prioritize
          * it and inform the cpu in question. */
+        if (setting_affinity && vc->domain->domain_id > 0)
+            printk("credit2:csched2_vcpu_wake - NOT calling runq_insert and runq_tickle because vcpu is on a run queue\n");
         goto out;
     }
 
@@ -988,7 +1018,12 @@ csched2_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
     update_load(ops, svc->rqd, svc, 1, now);
         
     /* Put the VCPU on the runq */
+    if (setting_affinity && vc->domain->domain_id > 0)
+        printk("credit2:csched2_vcpu_wake - calling runq_insert\n");
     runq_insert(ops, vc->processor, svc);
+    if (setting_affinity && vc->domain->domain_id > 0)
+        printk("credit2:csched2_vcpu_wake - calling runq_tickle for pcpu # %d\n",
+            vc->processor);
     runq_tickle(ops, vc->processor, svc, now);
 
 out:
@@ -1053,7 +1088,7 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
      *
      * Since one of the runqueue locks is already held, we can't
      * just grab the prv lock.  Instead, we'll have to trylock, and
-     * do something else reasonable if we fail.
+     * return a safe cpu.
      */
 
     if ( !spin_trylock(&prv->lock) )
@@ -1063,9 +1098,8 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
             d2printk("%pv -\n", svc->vcpu);
             clear_bit(__CSFLAG_runq_migrate_request, &svc->flags);
         }
-        /* Leave it where it is for now.  When we actually pay attention
-         * to affinity we'll have to figure something out... */
-        return vc->processor;
+
+    return get_safe_pcpu(svc);
     }
 
     /* First check to see if we're here because someone else suggested a place
@@ -1081,13 +1115,17 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
         else
         {
             d2printk("%pv +\n", svc->vcpu);
-            new_cpu = cpumask_cycle(vc->processor, &svc->migrate_rqd->active);
-            goto out_up;
+            cpumask_and(csched2_cpumask, vc->cpu_hard_affinity,
+                &svc->migrate_rqd->active);
+            if ( !cpumask_empty(csched2_cpumask) )
+            {
+                new_cpu = cpumask_any(csched2_cpumask);
+                goto out_up;
+            }
+            /* Fall-through to normal cpu pick */
         }
     }
 
-    /* FIXME: Pay attention to cpu affinity */                                                                                      
-
     min_avgload = MAX_LOAD;
 
     /* Find the runqueue with the lowest instantaneous load */
@@ -1099,17 +1137,24 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
         rqd = prv->rqd + i;
 
         /* If checking a different runqueue, grab the lock,
-         * read the avg, and then release the lock.
+         * check hard affinity, read the avg, and then release the lock.
          *
          * If on our own runqueue, don't grab or release the lock;
          * but subtract our own load from the runqueue load to simulate
          * impartiality */
         if ( rqd == svc->rqd )
         {
+            if ( !cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
+                continue;
             rqd_avgload = rqd->b_avgload - svc->avgload;
         }
         else if ( spin_trylock(&rqd->lock) )
         {
+            if ( !cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
+            {
+                spin_unlock(&rqd->lock);
+                continue;
+            }
             rqd_avgload = rqd->b_avgload;
             spin_unlock(&rqd->lock);
         }
@@ -1123,18 +1168,25 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
         }
     }
 
-    /* We didn't find anyone (most likely because of spinlock contention); leave it where it is */
     if ( min_rqi == -1 )
-        new_cpu = vc->processor;
+    {
+        /* No runqs found (most likely because of spinlock contention). */
+    new_cpu = get_safe_pcpu(svc);
+    }
     else
     {
-        new_cpu = cpumask_cycle(vc->processor, &prv->rqd[min_rqi].active);
+        cpumask_and(csched2_cpumask, vc->cpu_hard_affinity,
+            &prv->rqd[min_rqi].active);
+        new_cpu = cpumask_any(csched2_cpumask);
         BUG_ON(new_cpu >= nr_cpu_ids);
     }
 
 out_up:
     spin_unlock(&prv->lock);
 
+    if (setting_affinity && vc->domain->domain_id > 0)
+        printk("credit2:choose_cpu - returning cpu # %d\n", new_cpu);
+
     return new_cpu;
 }
 
@@ -1197,24 +1249,47 @@ static void migrate(const struct scheduler *ops,
     }
     else
     {
-        int on_runq=0;
-        /* It's not running; just move it */
+        /* It's not running; move it if it's on a different runq than trqd. */
+        bool_t on_runq = 0;
+
         d2printk("%pv %d-%d i\n", svc->vcpu, svc->rqd->id, trqd->id);
+
+        /* Re-assign vcpu's processor, if necessary. */
+        cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity, &trqd->active);
+        svc->vcpu->processor = cpumask_any(csched2_cpumask);
+        if ( !cpumask_test_cpu(svc->vcpu->processor, csched2_cpumask) )
+            svc->vcpu->processor = cpumask_any(csched2_cpumask);
+
         if ( __vcpu_on_runq(svc) )
+            on_runq = 1;
+
+        /* If the runqs are different, move svc to trqd. */
+        if ( svc->rqd != trqd )
         {
-            __runq_remove(svc);
-            update_load(ops, svc->rqd, svc, -1, now);
-            on_runq=1;
+            if ( on_runq )
+            {
+                __runq_remove(svc);
+                update_load(ops, svc->rqd, svc, -1, now);
+            }
+            __runq_deassign(svc);
+            __runq_assign(svc, trqd);
+            if ( on_runq )
+            {
+                update_load(ops, svc->rqd, svc, 1, now);
+                runq_insert(ops, svc->vcpu->processor, svc);
+            }
         }
-        __runq_deassign(svc);
-        svc->vcpu->processor = cpumask_any(&trqd->active);
-        __runq_assign(svc, trqd);
+
         if ( on_runq )
         {
-            update_load(ops, svc->rqd, svc, 1, now);
-            runq_insert(ops, svc->vcpu->processor, svc);
+            if (setting_affinity && svc->vcpu->domain->domain_id > 0)
+                printk("credit2:migrate - calling runq_tickle\n");
             runq_tickle(ops, svc->vcpu->processor, svc, now);
         }
+        else
+            if (setting_affinity && svc->vcpu->domain->domain_id > 0)
+                printk("credit2:migrate - NOT calling runq_tickle because vcpu was not on a run queue before migrating\n");
+            runq_tickle(ops, svc->vcpu->processor, svc, now);
     }
 }
 
@@ -1250,6 +1325,11 @@ retry:
     for_each_cpu(i, &prv->active_queues)
     {
         s_time_t delta;
+        /* true if there are no vcpus to push due to hard affinity */
+        bool_t ha_no_push = 1;
+        /* true if there are no vcpus to pull due to hard affinity */
+        bool_t ha_no_pull = 1;
+        struct list_head *iter;
         
         st.orqd = prv->rqd + i;
 
@@ -1257,6 +1337,47 @@ retry:
              || !spin_trylock(&st.orqd->lock) )
             continue;
 
+        /*
+         * If due to hard affinity there are no vcpus that can be
+         * pulled or pushed, move to the next runq in the loop.
+         */
+
+        /* See if there are any vcpus that can be pushed from lrqd to orqd. */
+        list_for_each( iter, &st.lrqd->svc )
+        {
+            struct csched2_vcpu * svc =
+                list_entry(iter, struct csched2_vcpu, rqd_elem);
+            cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
+                &st.orqd->active);
+            if (!cpumask_empty(csched2_cpumask))
+            {
+                /* vcpu can be pushed from lrqd to ordq. */
+                ha_no_push = 0;
+                break;
+            }
+        }
+
+        /* See if there are any vcpus that can be pulled from orqd to lrqd. */
+        list_for_each( iter, &st.orqd->svc )
+        {
+            struct csched2_vcpu * svc =
+                list_entry(iter, struct csched2_vcpu, rqd_elem);
+            cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
+                &st.lrqd->active);
+            if (!cpumask_empty(csched2_cpumask))
+            {
+                /* vcpu can be pulled from orqd to lrdq. */
+                ha_no_pull = 0;
+                break;
+            }
+        }
+
+        if ( ha_no_push && ha_no_pull )
+        {
+            spin_unlock(&st.orqd->lock);
+            continue;
+        }
+
         __update_runq_load(ops, st.orqd, 0, now);
     
         delta = st.lrqd->b_avgload - st.orqd->b_avgload;
@@ -1330,6 +1451,12 @@ retry:
         if ( test_bit(__CSFLAG_runq_migrate_request, &push_svc->flags) )
             continue;
 
+        /* Skip if it can't run on the destination runq. */
+        cpumask_and(csched2_cpumask, push_svc->vcpu->cpu_hard_affinity,
+            &st.orqd->active);
+        if ( cpumask_empty(csched2_cpumask) )
+            continue;
+
         list_for_each( pull_iter, &st.orqd->svc )
         {
             struct csched2_vcpu * pull_svc = list_entry(pull_iter, struct csched2_vcpu, rqd_elem);
@@ -1343,6 +1470,12 @@ retry:
             if ( test_bit(__CSFLAG_runq_migrate_request, &pull_svc->flags) )
                 continue;
 
+            /* Skip if it can't run on the destination runq. */
+            cpumask_and(csched2_cpumask, pull_svc->vcpu->cpu_hard_affinity,
+                &st.lrqd->active);
+            if ( cpumask_empty(csched2_cpumask) )
+                continue;
+
             consider(&st, push_svc, pull_svc);
         }
 
@@ -1355,11 +1488,17 @@ retry:
     list_for_each( pull_iter, &st.orqd->svc )
     {
         struct csched2_vcpu * pull_svc = list_entry(pull_iter, struct csched2_vcpu, rqd_elem);
-        
+
         /* Skip this one if it's already been flagged to migrate */
         if ( test_bit(__CSFLAG_runq_migrate_request, &pull_svc->flags) )
             continue;
 
+        /* Skip if it can't run on the destination runq. */
+        cpumask_and(csched2_cpumask, pull_svc->vcpu->cpu_hard_affinity,
+            &st.lrqd->active);
+        if ( cpumask_empty(csched2_cpumask) )
+            continue;
+
         /* Consider pull only */
         consider(&st, NULL, pull_svc);
     }
@@ -1401,6 +1540,9 @@ csched2_vcpu_migrate(
 
     if ( trqd != svc->rqd )
         migrate(ops, svc, trqd, NOW());
+    else
+        if (setting_affinity && vc->domain->domain_id > 0)
+            printk("credit2:csched2_vcpu_migrate - NOT calling migrate because destination run queue is same as current\n");
 }
 
 static int
@@ -1610,6 +1752,10 @@ runq_candidate(struct csched2_runqueue_data *rqd,
     {
         struct csched2_vcpu * svc = list_entry(iter, struct csched2_vcpu, runq_elem);
 
+        /* Only consider vcpus that are allowed to run on this processor. */
+        if ( !cpumask_test_cpu(cpu, svc->vcpu->cpu_hard_affinity) )
+            continue;
+
         /* If this is on a different processor, don't pull it unless
          * its credit is at least CSCHED2_MIGRATE_RESIST higher. */
         if ( svc->vcpu->processor != cpu
@@ -1992,6 +2138,13 @@ csched2_alloc_pdata(const struct scheduler *ops, int cpu)
         printk("%s: cpu %d not online yet, deferring initializatgion\n",
                __func__, cpu);
 
+    /*
+     * For each new pcpu, allocate a cpumask_t for use throughout the
+     * scheduler to avoid putting any cpumask_t structs on the stack.
+     */
+    if ( !zalloc_cpumask_var(&cpumask[cpu]) )
+        return NULL;
+
     return (void *)1;
 }
 
@@ -2040,6 +2193,8 @@ csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 
     spin_unlock_irqrestore(&prv->lock, flags);
 
+    free_cpumask_var(cpumask[cpu]);
+
     return;
 }
 
@@ -2127,16 +2282,29 @@ csched2_init(struct scheduler *ops)
 
     prv->load_window_shift = opt_load_window_shift;
 
+    cpumask = xzalloc_bytes(nr_cpu_ids * sizeof(cpumask_t *));
+    if ( cpumask == NULL )
+        return -ENOMEM;
+
     return 0;
 }
 
 static void
 csched2_deinit(const struct scheduler *ops)
 {
+    int i;
     struct csched2_private *prv;
 
     prv = CSCHED2_PRIV(ops);
     xfree(prv);
+
+    if ( cpumask != NULL )
+    {
+        for ( i = 0; i < nr_cpu_ids; i++ )
+            if ( cpumask[i] != NULL )
+                free_cpumask_var(cpumask[i]);
+        xfree(cpumask);
+    }
 }
 
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index b73177f..1c5961c 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -38,6 +38,8 @@
 #include <public/sched.h>
 #include <xsm/xsm.h>
 
+bool_t setting_affinity = 0;
+
 /* opt_sched: scheduler - default to credit */
 static char __initdata opt_sched[10] = "credit";
 string_param("sched", opt_sched);
@@ -358,8 +360,13 @@ void vcpu_sleep_nosync(struct vcpu *v)
         if ( v->runstate.state == RUNSTATE_runnable )
             vcpu_runstate_change(v, RUNSTATE_offline, NOW());
 
+        if (setting_affinity && v->domain->domain_id > 0)
+            printk("schedule:vcpu_sleep_nosync - calling credit2:sleep\n");
         SCHED_OP(VCPU2OP(v), sleep, v);
     }
+    else
+        if (setting_affinity && v->domain->domain_id > 0)
+            printk("schedule:vcpu_sleep_nosync - NOT calling credit2:sleep because vcpu_runnable(v) was true\n");
 
     vcpu_schedule_unlock_irqrestore(lock, flags, v);
 
@@ -381,16 +388,28 @@ void vcpu_wake(struct vcpu *v)
     unsigned long flags;
     spinlock_t *lock = vcpu_schedule_lock_irqsave(v, &flags);
 
+    if (setting_affinity && v->domain->domain_id > 0)
+        printk("schedule:vcpu_wake - pause flags for d%dv%d is %lu\n",
+            v->domain->domain_id, v->vcpu_id, v->pause_flags);
+
     if ( likely(vcpu_runnable(v)) )
     {
         if ( v->runstate.state >= RUNSTATE_blocked )
             vcpu_runstate_change(v, RUNSTATE_runnable, NOW());
+        if (setting_affinity && v->domain->domain_id > 0)
+            printk("schedule:vcpu_wake - calling credit2:wake\n");
         SCHED_OP(VCPU2OP(v), wake, v);
     }
-    else if ( !test_bit(_VPF_blocked, &v->pause_flags) )
+    else
     {
-        if ( v->runstate.state == RUNSTATE_blocked )
-            vcpu_runstate_change(v, RUNSTATE_offline, NOW());
+        if (setting_affinity && v->domain->domain_id > 0)
+            printk("schedule:vcpu_wake - NOT calling credit2:wake because vcpu_runnable() is false\n");
+
+        if ( !test_bit(_VPF_blocked, &v->pause_flags) )
+        {
+            if ( v->runstate.state == RUNSTATE_blocked )
+                vcpu_runstate_change(v, RUNSTATE_offline, NOW());
+        }
     }
 
     vcpu_schedule_unlock_irqrestore(lock, flags, v);
@@ -400,6 +419,9 @@ void vcpu_wake(struct vcpu *v)
 
 void vcpu_unblock(struct vcpu *v)
 {
+    if (setting_affinity && v->domain->domain_id > 0)
+        printk("schedule:vcpu_unblock - clearing VPF_block\n");
+
     if ( !test_and_clear_bit(_VPF_blocked, &v->pause_flags) )
         return;
 
@@ -470,6 +492,8 @@ static void vcpu_migrate(struct vcpu *v)
                 break;
 
             /* Select a new CPU. */
+            if (setting_affinity && v->domain->domain_id > 0)
+                printk("schedule:vcpu_migrate - calling credit2:pick_cpu\n");
             new_cpu = SCHED_OP(VCPU2OP(v), pick_cpu, v);
             if ( (new_lock == per_cpu(schedule_data, new_cpu).schedule_lock) &&
                  cpumask_test_cpu(new_cpu, v->domain->cpupool->cpu_valid) )
@@ -520,7 +544,11 @@ static void vcpu_migrate(struct vcpu *v)
      * the lock pointer cant' change while the current lock is held.
      */
     if ( VCPU2OP(v)->migrate )
+    {
+        if (setting_affinity && v->domain->domain_id > 0)
+            printk("schedule:vcpu_migrate - calling credit2:migrate\n");
         SCHED_OP(VCPU2OP(v), migrate, v, new_cpu);
+    }
     else
         v->processor = new_cpu;
 
@@ -533,6 +561,8 @@ static void vcpu_migrate(struct vcpu *v)
         sched_move_irqs(v);
 
     /* Wake on new CPU. */
+    if (setting_affinity && v->domain->domain_id > 0)
+        printk("schedule:vcpu_migrate - calling vcpu_wake\n");
     vcpu_wake(v);
 }
 
@@ -671,8 +701,42 @@ static int vcpu_set_affinity(
 
     if ( test_bit(_VPF_migrating, &v->pause_flags) )
     {
+        if (setting_affinity && v->domain->domain_id > 0)
+        {
+            printk("****************************************************\n");
+            printk("schedule:vcpu_set_affinity - pause flags is %lu\n", v->pause_flags);
+            printk("1 VPF_blocked is %d\n", test_bit(_VPF_blocked, &v->pause_flags));
+            printk("2 VPF_down is %d\n", test_bit(_VPF_down, &v->pause_flags));
+            printk("4 VPF_blocked_in_xen is %d\n", test_bit(_VPF_blocked_in_xen, &v->pause_flags));
+            printk("8 VPF_migrating is %d\n", test_bit(_VPF_migrating, &v->pause_flags));
+            printk("16 VPF_mem_paging is %d\n", test_bit(_VPF_mem_paging, &v->pause_flags));
+            printk("32 VPF_mem_access is %d\n", test_bit(_VPF_mem_access, &v->pause_flags));
+            printk("64 VPF_mem_sharing is %d\n", test_bit(_VPF_mem_sharing, &v->pause_flags));
+            printk("128 VPF_in_reset is %d\n", test_bit(_VPF_in_reset, &v->pause_flags));
+        }
+
+        printk("schedule:vcpu_set_affinity - calling vcpu_sleep_nosync\n");
         vcpu_sleep_nosync(v);
+        printk("schedule:vcpu_set_affinity - calling vcpu_migrate\n");
         vcpu_migrate(v);
+
+        if (setting_affinity && v->domain->domain_id > 0)
+        {
+            printk("schedule:vcpu_set_affinity - pause flags is %lu\n", v->pause_flags);
+            printk("1 VPF_blocked is %d\n", test_bit(_VPF_blocked, &v->pause_flags));
+            printk("2 VPF_down is %d\n", test_bit(_VPF_down, &v->pause_flags));
+            printk("4 VPF_blocked_in_xen is %d\n", test_bit(_VPF_blocked_in_xen, &v->pause_flags));
+            printk("8 VPF_migrating is %d\n", test_bit(_VPF_migrating, &v->pause_flags));
+            printk("16 VPF_mem_paging is %d\n", test_bit(_VPF_mem_paging, &v->pause_flags));
+            printk("32 VPF_mem_access is %d\n", test_bit(_VPF_mem_access, &v->pause_flags));
+            printk("64 VPF_mem_sharing is %d\n", test_bit(_VPF_mem_sharing, &v->pause_flags));
+            printk("128 VPF_in_reset is %d\n", test_bit(_VPF_in_reset, &v->pause_flags));
+        }
+
+        if (setting_affinity && v->domain->domain_id > 0)
+            printk("schedule:vcpu_set_affinity - Done changing hard affinity\n");
+
+        setting_affinity = 0;
     }
 
     return 0;
@@ -704,6 +768,9 @@ void vcpu_block(void)
 {
     struct vcpu *v = current;
 
+    if (setting_affinity && v->domain->domain_id > 0)
+        printk("schedule:vcpu_block - setting VPF_block\n");
+
     set_bit(_VPF_blocked, &v->pause_flags);
 
     /* Check for events /after/ blocking: avoids wakeup waiting race. */
@@ -739,6 +806,9 @@ static long do_poll(struct sched_poll *sched_poll)
     if ( !guest_handle_okay(sched_poll->ports, sched_poll->nr_ports) )
         return -EFAULT;
 
+    if (setting_affinity && v->domain->domain_id > 0)
+        printk("schedule:do_poll - setting VPF_block\n");
+
     set_bit(_VPF_blocked, &v->pause_flags);
     v->poll_evtchn = -1;
     set_bit(v->vcpu_id, d->poll_mask);
-- 
1.7.10.4


[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: credit2: consider per-vcpu soft affinity
  2015-01-13 13:06   ` Dario Faggioli
@ 2015-02-11  8:38     ` Justin Weaver
  0 siblings, 0 replies; 15+ messages in thread
From: Justin Weaver @ 2015-02-11  8:38 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: George Dunlap, Henri Casanova, xen-devel

Dario,

I'm working on soft affinity while you review v2 of the credit 2 hard
affinity patch (no rush, of course).

On Tue, Jan 13, 2015 at 3:06 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On Sat, 2014-11-29 at 14:33 -1000, Justin T. Weaver wrote:
>> when deciding which run queue to assign each vcpu to.
>>
>> There are two main changes in functionality that this patch introduces.
>>
>> First, in function runq_tickle, it tries to find idle pcpus in other run
>> queues that the vcpu would prefer to run on (soft affinity) or can run on
>> (hard affinity), in that order.
>>
> I see. I don't think a function like runq_tickle() should be able to
> move a vcpu to a different runqueue. That is done in other places, and I
> would leave this as it is. Tickling usually happens right after a vcpu
> has been put on a runqueue, which is something controlled by other parts
> of the Credit2 algorithm and code, and that is by design. More
> specifically, moving vcpus among runqueues is the job of the load
> balancing logic.
>
> Turning runq_tickle() into a potential "runqueue-changer" would mix the
> two concepts of inter-runqueue and intra-runqueue balancing/migration up
> to a point that I do not like. It also makes the code look rather
> weird... E.g., look at when runq_tickle() is called in migrate(). In
> there, right after going through:
>  - __runq_remove();
>  - __runq_deassign();
>  - __runq_assign();
>  - runq_insert();
> we call runq_tickle() and... Do it all aver again!?!? :-O
> Even just because of this, I'd say that, if such a change to
> runq_tickle() would be necessary (which I don't think), that should
> happen differently, and should involve changing its various callers, to
> avoid the repetition.
>
> OTOH, it looks like you are not touching the existing
> for_each_cpu(i,&mask) loop, still inside runq_tickle() itself, while I'd
> say that looking for a pcpu that the vcpu has soft affinity with, among
> all the non idle and non-tickled pcpus in the runqueue, really is
> worthwhile. Actually, the same applies to idle pcpus in the runqueue,
> i.e., to the code before the for_each_cpu() loop, still in
> runq_tickle().
>
> So, instead of adding to runq_tickle() the capability of moving vcpus
> among runqueues, which is not his responsibility, I would add the usual
> affinity balancing logic (== the for_each_csched2_balance_step2) to it.

What I was going for is using runq_tickle as a last resort for finding
an idle processor to run a vcpu. Also I was thinking about how the
mechanism would work if say balance_load does move a vcpu away from
its hard affinity, where/when would it get moved back to soft? But I
see that this isn't the place to make those kind of decisions. I'll
implement the balancing logic here like you say.

> When we'll have this in place, and we will be able to bench it, we'll
> see if we think that there is a need for making such function capable of
> more advanced things (but even at that point, I doubt that changing the
> runqueue of a vcpu would be one of these).
>
>> Second, in function balance_load, if moving a vcpu with soft affinity from
>> one run queue to another means moving it away from its soft affinity to hard
>> affinity only, then that move is not considered for load balancing.
>>
> This is a key part of the whole thing so, please, spend a few more words
> explaining what is your  idea and how it actually works, both here and
> in the sources, with comments.
>
> For what I've understood, I like the idea. However, why don't use the
> usual two step approach? In the soft-affinity step, we won't consider
> vcpus that would see their soft-affinity 'impaired' by being moved, and,
> perhaps, give a preference boost to vcpus that would see their
> soft-affinity 'improved' by being moved. If we don't find any suitable
> candidate, in the next (hard-affinity) step, we fall back to ignoring
> soft-affinity, and only care about hard constraints.

I did try a few different things here like you suggest, but I kept
seeing odd behavior due to how the system is currently calculating
load. I observed a vcpu moving back and forth between two run queues
when it was the only vcpu on one of them. It would move away from soft
to hard (even though it was the only vcpu associated with the run
queue) because of a load calculation, and then get moved back to soft
because I told it to prefer that. I need to get a better understanding
of how run queue load is calculated and then see what makes the most
sense to do here. Also with a two step approach won't we encounter
even more looping through the vcpus like from the hard affinity v1
patch?

> About 'impaired' and 'improoved' soft-affinity. This is just an idea
> popping out my head right now, so bear with me. :-)
> Affinity is to pcpus, while, in principle, here we are talking about
> migration between runqueues, with each runqueue spanning a certain set
> of pcpus. There seems to me to be room for defining something like the
> "soft affinity of a vcpu to a runqueue", as, e.g., the number of pcpus
> of that runqueue with which the vcpu has soft-affinity. This would
> measure how likely the vcpus will run on a pcpu it has soft-affinity
> with, if migrated to the runqueue in question. How do you like this?

I do like this and I've been trying to come up with a way to maintain
this information. Run queues and per-vcpu affinity can change so
having the proper locks to read and update this information is
something I've been thinking about. For example in choose_cpu during
an affinity change there's a chance that it doesn't get the private
lock, and so it wouldn't be able to compare the new affinity with each
of the run queues. But then I was thinking that maybe it can have a
state where the "soft affinity of a vcpu to a runqueue" is unknown
until there is a situation with the proper locks where it can be
calculated and readily available the next time it's needed.

> BTW, understand why yesterday I said I'd rather get hard affinity in
> place before tackling dealing with soft affinity? :-P

Yes, definitely.

>> @@ -127,6 +127,15 @@
>>  #define CSCHED2_CREDIT_RESET         0
>>  /* Max timer: Maximum time a guest can be run for. */
>>  #define CSCHED2_MAX_TIMER            MILLISECS(2)
>> +/* Two step balancing logic; consider soft affinity first. */
>> +#define CSCHED2_BALANCE_SOFT_AFFINITY 0
>> +#define CSCHED2_BALANCE_HARD_AFFINITY 1
>> +/* vcpu runq migration away from vcpu's soft affinity. */
>> +#define CSCHED2_MIGRATE_SOFT_TO_HARD 0
>> +/* vcpu runq migration to vcpu's soft affinity. */
>> +#define CSCHED2_MIGRATE_HARD_TO_SOFT 1
>> +/* vcpu runq migration soft to soft or vcpu has no soft affinity. */
>> +#define CSCHED2_MIGRATE_ANY_TO_ANY   2
>>
>>
>>  #define CSCHED2_IDLE_CREDIT                 (-(1<<30))
>> @@ -176,6 +185,8 @@ integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
>>  #define c2r(_ops, _cpu)     (CSCHED2_PRIV(_ops)->runq_map[(_cpu)])
>>  /* CPU to runqueue struct macro */
>>  #define RQD(_ops, _cpu)     (&CSCHED2_PRIV(_ops)->rqd[c2r(_ops, _cpu)])
>> +#define for_each_csched2_balance_step(step) \
>> +    for ( (step) = 0; (step) <= CSCHED2_BALANCE_HARD_AFFINITY; (step)++ )
>>
> This is identical (apart from the CSCHED2 prefix) to what we have in
> Credit1. I wonder whether it wouldn't make sense to factor it out in a
> common header.
>
> I'm saying I wonder because I am actually not sure. As things stands, it
> makes perfect sense. Even if looking forward, I don't see why the two
> schedulers (as well as any other scheduler wanting to support
> affinities) should do things differently between each others.
>
> The only doubt I have is not about the structure (i.e., the loop, etc.)
> but about the fact that a scheduler may perhaps want to add more steps,
> or things like this.
>
> If I'd have to decide now, I'd say let's factor this out, and when --if
> ever-- someone with different needs will come, we'll see what to do.
> Thoughts?

More steps could be added if the new scheduler #defines them as
negative numbers I suppose. I will factor this and the sections you
mention below out. I was thinking about putting them in sched-if.h.
Does that sound OK, or do you think they should be in a new or
different file?

>> +/*
>> + * A vcpu has meaningful soft affinity if...
>> + * - its soft affinity mask is not full, and
>> + * - the passed in mask (usually its hard affinity mask) intersects
>> + *   with its soft affinity mask
>> + */
>> +static inline int __vcpu_has_soft_affinity(const struct vcpu *vc,
>> +                                           const cpumask_t *mask)
>> +{
>> +    if ( cpumask_full(vc->cpu_soft_affinity) ||
>> +        !cpumask_intersects(vc->cpu_soft_affinity, mask) )
>> +        return 0;
>> +
>> +    return 1;
>> +}
>>
> About this, I have far less doubts when asking you to factor this out
> from sched_credit.c and use it here. :-)
>
>> +static void
>> +csched2_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
>> +{
>> +    if ( step == CSCHED2_BALANCE_SOFT_AFFINITY )
>> +    {
>> +        cpumask_and(mask, vc->cpu_soft_affinity, vc->cpu_hard_affinity);
>> +
>> +        if ( unlikely(cpumask_empty(mask)) )
>> +            cpumask_copy(mask, vc->cpu_hard_affinity);
>> +    }
>> +    else /* step == CSCHED2_BALANCE_HARD_AFFINITY */
>> +        cpumask_copy(mask, vc->cpu_hard_affinity);
>> +}
>>
> Same here.
>
> Oh, and have also a look at how this is used, in sched_credit.c, in
> conjunction with the per-csched_pcpu variable balance_mask.
>
> As I said yesterday, too many cpumask_t variables on stack should be
> avoided, and that was how I managed to comply to this when working on
> Credit1.
>
> There for sure are more than a few places where you can do the same, or
> something very similar.
>
>> @@ -513,6 +559,68 @@ runq_tickle(const struct scheduler *ops, unsigned int cpu, struct csched2_vcpu *
>>          goto tickle;
>>      }
>>
>> +    /* Look for idle cpus in other runqs; consider soft affinity first. */
>> +    for_each_csched2_balance_step( balance_step )
>> +    {
>> +        cpumask_t balance_mask;
>> +
>> +        /* Skip the soft affinity balance step if new doesn't have any. */
>> +        if ( balance_step == CSCHED2_BALANCE_SOFT_AFFINITY &&
>> +            !__vcpu_has_soft_affinity(
>> +                new->vcpu, new->vcpu->cpu_hard_affinity) )
>> +            continue;
>> +
>> +        /* Skip this step if can't get a lock on the credit2 private data. */
>> +        if ( !prv_lock_held || !spin_trylock(&prv->lock) )
>> +            continue;
>> +        prv_lock_held = 1;
>> +
>> +        csched2_balance_cpumask(new->vcpu, balance_step, &balance_mask);
>> +
>> +        for_each_cpu(i, &prv->active_queues)
>> +        {
>> +            struct csched2_runqueue_data *temp_rqd;
>> +
>> +            temp_rqd = prv->rqd + i;
>> +
>> +            if ( temp_rqd == rqd || !spin_trylock(&temp_rqd->lock) )
>> +                continue;
>> +
>> +            /* Find idle cpus in the balance mask that are not tickled. */
>> +            cpumask_andnot(&mask, &temp_rqd->idle, &temp_rqd->tickled);
>> +            cpumask_and(&mask, &mask, &balance_mask);
>> +
>> +            if ( !cpumask_empty(&mask) )
>> +            {
>> +                /* Found an idle cpu on another run queue; move new. */
>> +                s_time_t now = 0;
>> +
>> +                ipid = cpumask_any(&mask);
>> +                new->vcpu->processor = ipid;
>> +                __runq_remove(new);
>> +                now = NOW();
>> +                update_load(ops, rqd, new, -1, now);
>> +                __runq_deassign(new);
>> +                __runq_assign(new, temp_rqd);
>> +                update_load(ops, temp_rqd, new, 1, now);
>> +                runq_insert(ops, ipid, new);
>> +                cpumask_set_cpu(ipid, &temp_rqd->tickled);
>> +                cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
>> +
>> +                spin_unlock(&temp_rqd->lock);
>> +                spin_unlock(&prv->lock);
>> +                return;
>> +            }
>> +            else
>> +                /* No suitable idlers found in runq temp_rqd. */
>> +                spin_unlock(&temp_rqd->lock);
>> +        }
>> +
>> +        if ( prv_lock_held && balance_step == CSCHED2_BALANCE_HARD_AFFINITY )
>> +            /* No suitable other-runq idlers found; unlock private data. */
>> +            spin_unlock(&prv->lock);
>> +    }
>> +
>>
> I commented about this above.
>
> BTW, look at the code you'd have to add for making this function behave
> as you wanted, especially locking, and I'm sure you'll concur that all
> that does not belong here.
>
> Where I would put something similar to the above is in choose_cpu()
> below!

I'm not sure I understand what you mean. choose_cpu is only called in
response to a SCHED_OP call to pick_cpu in vcpu_migrate in schedule.c.
Its only job is to return a pcpu to the generic scheduler that it
recommends for the given vcpu. vcpu_migrate eventually calls credit 2
migrate to do the actual run queue moving. Please let me know if I'm
missing something here.

>> @@ -1039,12 +1147,22 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>>  {
>>      struct csched2_private *prv = CSCHED2_PRIV(ops);
>>      int i, min_rqi = -1, new_cpu;
>> +    int max_soft_cpus = 0, max_soft_cpus_rqi = -1;
>> +    bool_t consider_soft_affinity = 0;
>>      struct csched2_vcpu *svc = CSCHED2_VCPU(vc);
>>      s_time_t min_avgload;
>> -    cpumask_t temp_mask;
>> +    cpumask_t temp_mask, vc_soft_affinity;
>>
>>      BUG_ON(cpumask_empty(&prv->active_queues));
>>
>> +    /* Consider soft affinity in the cpu descision? */
>> +    if ( __vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
>> +    {
>> +        consider_soft_affinity = 1;
>> +        cpumask_and(&vc_soft_affinity, vc->cpu_soft_affinity,
>> +            vc->cpu_hard_affinity);
>> +    }
>> +
>>
> Why "black and white"? Why not using for_each_csched2_balance_step() and
> differentiate which one affinity to consider depending on the step,
> rather than here and for all?
>
>> @@ -1112,11 +1237,15 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
>>
>>      min_avgload = MAX_LOAD;
>>
>> -    /* Find the runqueue with the lowest instantaneous load */
>> +    /*
>> +     * Find the run queue with the most cpus in vc's soft affniity, or the
>> +     * the lowest instantaneous load if not considering soft affinity.
>> +     */
>>
> Exactly my point: if the vcpu does not have any soft-affinity, then, as
> usual, the soft-affinity balancing step must be skipped, and the
> behaviour should fall back to finding the runq with the lowest
> instantaneous load, taking only hard affinity constraint into account.
>
> And if the vcpu does have a soft-affinity, why forget completely about
> instantaneous load? I like the idea of "the most cpus in vc's soft
> affinity" (it's actually really similar to what I was talking about at
> the beginning of this message), but why can't we combine the two things?
> More important, why does it have to be just an option, instead than the
> first step of the soft/hard balancing loop?

Yes, I will re-work choose_cpu to use the affinity balancing steps.
I'll combine soft cpu count and instantaneous load in the soft
balancing step.

Thanks for the feedback!

Justin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] sched: credit2: respect per-vcpu hard affinity
  2015-01-20  7:21     ` Justin Weaver
  2015-02-01  6:51       ` Justin Weaver
@ 2015-03-03  3:13       ` Dario Faggioli
  1 sibling, 0 replies; 15+ messages in thread
From: Dario Faggioli @ 2015-03-03  3:13 UTC (permalink / raw)
  To: jtweaver@hawaii.edu
  Cc: xen-devel@lists.xen.org, George Dunlap, henric@hawaii.edu


[-- Attachment #1.1: Type: text/plain, Size: 4093 bytes --]

So, being finally able to look at this again, I'm going directly to
reviewing v2, since it's available.

However, there was an open question on this email, which I figured I
should answer, so here it is. :-)

On Mon, 2015-01-19 at 21:21 -1000, Justin Weaver wrote:
> On Mon, Jan 12, 2015 at 8:05 AM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:

> >>      /* First check to see if we're here because someone else suggested a place
> >> @@ -1081,13 +1099,17 @@ choose_cpu(const struct scheduler *ops, struct vcpu *vc)
> >>          else
> >>          {
> >>              d2printk("%pv +\n", svc->vcpu);
> >> -            new_cpu = cpumask_cycle(vc->processor, &svc->migrate_rqd->active);
> >> -            goto out_up;
> >> +            cpumask_and(&temp_mask, vc->cpu_hard_affinity,
> >> +                &svc->migrate_rqd->active);
> >> +            if ( !cpumask_empty(&temp_mask) )
> >> +            {
> >> +                new_cpu = cpumask_any(&temp_mask);
> >> +                goto out_up;
> >> +            }
> >> +            /* Fall-through to normal cpu pick */
> >>
> > So, if we've been asked to move to a cpu where we are not allowed to run
> > (i.e., in case temp_mask ends up empty), we just, silently, ignore the
> > request. This has the potential of screwing the work of the load
> > balancer. If we need to keep this as a sort of "safety catch", then
> > fine, but I'd really like to see a lot of efforts made in the load
> > balancing code to make this not trigger!
>
> I'm glad you brought this up because it's not clear to me what the
> relationship between migrate() and choose_cpu() is, if any. This block
> of code in choose_cpu only triggers if __CSFLAG_runq_migrate_request
> is set, and that flag is only set in migrate(), and only if
> __CSFLAG_scheduled is set. 
>
Yes, because, if the vCPU in question is running, we can't just move it,
and all we can do is "file a migration request". To do so, in Credit2, 2
flags are set:
 - _VPF_migrating, in svc->vcpu->pause_flags
 - __CSFLAG_runq_migrate_request, in svc->flags

> choose_cpu() is only called in response to
> the pick_cpu call in schedule.c in vcpu_migrate() which doesn't have
> anything to do with the load balancer balance_load() in credit 2. 
>
Well, the Credit2 load balancer (sched_credit2.c:balance_load()) calls
sched_credit2.c:migrate(), which, if the vCPU it wants to act on is
running, does as said above: it sets both _VPF_migrating and
__CSFLAGS_runq_migrate_request.

Now, as soon as schedule.c:schedule() is invoked on the pCPU where that
vCPU is running, _VPF_migrating is checked (in context_saved(), called
by context_switch()) and, if it's set, schedule():vcpu_migrate() is
called.

That gets us back to the pick_cpu hook, and hence, in Credit2, to
choose_cpu(), and all because of something initiated by the Credit2 load
balancer... So choose_cpu() and balance_load() seem to have some
relationship, after all. :-D

In fact, if you look at the comment that documents the meaning of the
flag, it says:

/* CSFLAG_runq_migrate_request: This vcpu is being migrated as a result of a
 * credit2-initiated runq migrate request; migrate it to the runqueue indicated
 * in the svc struct.
 */

> It
> seems to me that the hard affinity check is needed at the end of this
> block in case affinity has changed after __CSFLAG_runq_migrate_request
> is set in migrate() and before a call to choose_cpu(). Please let me
> know what you think.
> 
migrate() is only called when all the proper locks are held, so there is
a few scope for things changing under its feet (look at
schedule.c:vcpu_set_affinity()), so I don't think I see the potential
race you're talking about.

That being said, I agree that checking hard affinity is important here.
In fact, no matter whether it is Credit2 load balancer or something else
that brought us here, but we can't return to pick_cpu() a pCPU which is
not in our hard affinity mask.

For more/other comments, see the reply to v2 posting... :-)

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-03-03  3:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-30  0:33 [PATCH 0/2] Credit2: introduce per-vcpu hard and soft affinity Justin T. Weaver
2014-11-30  0:33 ` [PATCH 1/2] sched: credit2: respect per-vcpu hard affinity Justin T. Weaver
2015-01-12 18:05   ` Dario Faggioli
2015-01-20  7:21     ` Justin Weaver
2015-02-01  6:51       ` Justin Weaver
2015-02-02 11:12         ` Dario Faggioli
2015-02-02 14:24         ` Dario Faggioli
2015-02-03  9:09           ` Justin Weaver
2015-03-03  3:13       ` Dario Faggioli
2014-11-30  0:33 ` [PATCH 2/2] sched: credit2: consider per-vcpu soft affinity Justin T. Weaver
2015-01-12 18:07   ` Dario Faggioli
2015-01-13 13:06   ` Dario Faggioli
2015-02-11  8:38     ` Justin Weaver
2015-01-02 16:33 ` [PATCH 0/2] Credit2: introduce per-vcpu hard and " Dario Faggioli
2015-01-04 21:32   ` Justin Weaver

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.