[PATCH v2 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2
@ 2017-04-06  8:16 Dario Faggioli
  2017-04-06  8:16 ` [PATCH v2 1/7] xen: credit1: simplify csched_runq_steal() a little bit Dario Faggioli
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Dario Faggioli @ 2017-04-06  8:16 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Anshul Makkar, Ian Jackson, George Dunlap, Wei Liu

Hello,

Here it comes, v2 of this series. v1 was here:
 https://lists.xen.org/archives/html/xen-devel/2017-03/msg00265.html

It took a bit, because I found out issues, as well as chances of further
improvements in what was patch 3.

This resulted in:
 - one patch being added to the series;
 - patch "xen: credit1: increase efficiency and scalability of load balancing."
   being reworked.

I also changed the patch ordering in the series a little bit. More details in
the single changelogs.

Given all the above, I did rerun all the benchmarks. The results are not too
different, though (oh, BTW, it's a different test machine, wrt the one I used
for v1, so absolute numbers are not comparable).

I've again benchmarked a Xen build (seconds, lower = better) and Iperf (from VM
to host, GBps, higher = better), in either 1 or 2 VMs with 16 vCPUs, on a 16
pCPUs host. This is all on Credit1, as the Credit2 patch in the series has not
been touched in v2.

     MAKEXEN, 1VM, 16 VCPUs    MAKEXEN, 2VMs, 16 VCPUs
     -------------------------------------------------
      baseline  patched          baseline  patched
 avg   18.246   18.078            53.178   50.783
 stdd   0.585    0.455             0.746    1.975

So, we indeed see an improvement, especially in the overcommitted case.

     IPERF, 1VM, 16 VCPUs      IPERF, 2VMs, 16 VCPUs
     -----------------------------------------------
       baseline  patched          baseline patched
 avg    25.480   25.580            12.678  13.220
 stdd    0.963    0.986             2.694   1.108

So, improvement again, although less of a clear call.

I've also run Iperf in a VM-to-VM configuration (using two VMs with 6 vCPUs).

       IPERF, VM-to-VM
       -----------------
       baseline  patched
 avg    9.788    10.324
 stdd   0.486     0.447

Which indeed seems a better benchmark for showing the effectiveness of this
series.

Git branch available here:
 git://xenbits.xen.org/people/dariof/xen.git  rel/sched/credit1-credit2-optim-and-scalability-v2
 https://travis-ci.org/fdario/xen/builds/219179935

Thanks and Regards,
Dario
---
Dario Faggioli (7):
      xen: credit1: simplify csched_runq_steal() a little bit.
      xen: credit: (micro) optimize csched_runq_steal().
      xen: credit: consider tickled pCPUs as busy.
      xen/tools: tracing: add record for credit1 runqueue stealing.
      xen: credit1: increase efficiency and scalability of load balancing.
      xen: credit1: treat pCPUs more evenly during balancing.
      xen: credit2: avoid cpumask_any() in pick_cpu().

 tools/xentrace/formats       |    1 
 tools/xentrace/xenalyze.c    |   11 ++
 xen/common/sched_credit.c    |  241 +++++++++++++++++++++++++++++++-----------
 xen/common/sched_credit2.c   |   22 +++-
 xen/include/xen/perfc_defn.h |    1 
 5 files changed, 210 insertions(+), 66 deletions(-)
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 1/7] xen: credit1: simplify csched_runq_steal() a little bit.
  2017-04-06  8:16 [PATCH v2 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
@ 2017-04-06  8:16 ` Dario Faggioli
  2017-04-07 10:45   ` George Dunlap
  2017-04-06  8:16 ` [PATCH v2 2/7] xen: credit: (micro) optimize csched_runq_steal() Dario Faggioli
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Dario Faggioli @ 2017-04-06  8:16 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Anshul Makkar

Since we're holding the lock on the pCPU from which we
are trying to steal, it can't have disappeared, so we
can drop the check for that (and convert it in an
ASSERT()).

And since we try to steal only from busy pCPUs, it's
unlikely for such pCPU to be idle, so we mark it as
such (and bail early if it unfortunately is).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
 xen/common/sched_credit.c |   87 +++++++++++++++++++++++----------------------
 1 file changed, 44 insertions(+), 43 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 4649e64..63a8675 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1593,64 +1593,65 @@ static struct csched_vcpu *
 csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
 {
     const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
-    const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu);
     struct csched_vcpu *speer;
     struct list_head *iter;
     struct vcpu *vc;
 
+    ASSERT(peer_pcpu != NULL);
+
     /*
      * Don't steal from an idle CPU's runq because it's about to
      * pick up work from it itself.
      */
-    if ( peer_pcpu != NULL && !is_idle_vcpu(peer_vcpu) )
+    if ( unlikely(is_idle_vcpu(curr_on_cpu(peer_cpu))) )
+        goto out;
+
+    list_for_each( iter, &peer_pcpu->runq )
     {
-        list_for_each( iter, &peer_pcpu->runq )
-        {
-            speer = __runq_elem(iter);
+        speer = __runq_elem(iter);
 
-            /*
-             * If next available VCPU here is not of strictly higher
-             * priority than ours, this PCPU is useless to us.
-             */
-            if ( speer->pri <= pri )
-                break;
+        /*
+         * If next available VCPU here is not of strictly higher
+         * priority than ours, this PCPU is useless to us.
+         */
+        if ( speer->pri <= pri )
+            break;
 
-            /* Is this VCPU runnable on our PCPU? */
-            vc = speer->vcpu;
-            BUG_ON( is_idle_vcpu(vc) );
+        /* Is this VCPU runnable on our PCPU? */
+        vc = speer->vcpu;
+        BUG_ON( is_idle_vcpu(vc) );
 
-            /*
-             * If the vcpu has no useful soft affinity, skip this vcpu.
-             * In fact, what we want is to check if we have any "soft-affine
-             * work" to steal, before starting to look at "hard-affine work".
-             *
-             * Notice that, if not even one vCPU on this runq has a useful
-             * soft affinity, we could have avoid considering this runq for
-             * a soft balancing step in the first place. This, for instance,
-             * can be implemented by taking note of on what runq there are
-             * vCPUs with useful soft affinities in some sort of bitmap
-             * or counter.
-             */
-            if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
-                 && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
-                continue;
+        /*
+         * If the vcpu has no useful soft affinity, skip this vcpu.
+         * In fact, what we want is to check if we have any "soft-affine
+         * work" to steal, before starting to look at "hard-affine work".
+         *
+         * Notice that, if not even one vCPU on this runq has a useful
+         * soft affinity, we could have avoid considering this runq for
+         * a soft balancing step in the first place. This, for instance,
+         * can be implemented by taking note of on what runq there are
+         * vCPUs with useful soft affinities in some sort of bitmap
+         * or counter.
+         */
+        if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
+             && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
+            continue;
 
-            csched_balance_cpumask(vc, balance_step, cpumask_scratch);
-            if ( __csched_vcpu_is_migrateable(vc, cpu, cpumask_scratch) )
-            {
-                /* We got a candidate. Grab it! */
-                TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
-                         vc->domain->domain_id, vc->vcpu_id);
-                SCHED_VCPU_STAT_CRANK(speer, migrate_q);
-                SCHED_STAT_CRANK(migrate_queued);
-                WARN_ON(vc->is_urgent);
-                __runq_remove(speer);
-                vc->processor = cpu;
-                return speer;
-            }
+        csched_balance_cpumask(vc, balance_step, cpumask_scratch);
+        if ( __csched_vcpu_is_migrateable(vc, cpu, cpumask_scratch) )
+        {
+            /* We got a candidate. Grab it! */
+            TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
+                     vc->domain->domain_id, vc->vcpu_id);
+            SCHED_VCPU_STAT_CRANK(speer, migrate_q);
+            SCHED_STAT_CRANK(migrate_queued);
+            WARN_ON(vc->is_urgent);
+            __runq_remove(speer);
+            vc->processor = cpu;
+            return speer;
         }
     }
-
+ out:
     SCHED_STAT_CRANK(steal_peer_idle);
     return NULL;
 }


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/7] xen: credit1: simplify csched_runq_steal() a little bit.
  2017-04-06  8:16 ` [PATCH v2 1/7] xen: credit1: simplify csched_runq_steal() a little bit Dario Faggioli
@ 2017-04-07 10:45   ` George Dunlap
  2017-04-07 11:00     ` Dario Faggioli
  0 siblings, 1 reply; 24+ messages in thread
From: George Dunlap @ 2017-04-07 10:45 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Anshul Makkar

On 06/04/17 09:16, Dario Faggioli wrote:
> Since we're holding the lock on the pCPU from which we
> are trying to steal, it can't have disappeared, so we
> can drop the check for that (and convert it in an
> ASSERT()).
> 
> And since we try to steal only from busy pCPUs, it's
> unlikely for such pCPU to be idle, so we mark it as
> such (and bail early if it unfortunately is).

Oh, you mean "mark the test as unlikely".  I initially parsed this as,
"Mark the pcpu as idle", which confused me.

Can I change this to, "tell the compiler that it's unlikely"?

Either way:

Reviewed-by; George Dunlap <george.dunlap@citrix.com>

> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> ---
>  xen/common/sched_credit.c |   87 +++++++++++++++++++++++----------------------
>  1 file changed, 44 insertions(+), 43 deletions(-)
> 
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index 4649e64..63a8675 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -1593,64 +1593,65 @@ static struct csched_vcpu *
>  csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>  {
>      const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
> -    const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu);
>      struct csched_vcpu *speer;
>      struct list_head *iter;
>      struct vcpu *vc;
>  
> +    ASSERT(peer_pcpu != NULL);
> +
>      /*
>       * Don't steal from an idle CPU's runq because it's about to
>       * pick up work from it itself.
>       */
> -    if ( peer_pcpu != NULL && !is_idle_vcpu(peer_vcpu) )
> +    if ( unlikely(is_idle_vcpu(curr_on_cpu(peer_cpu))) )
> +        goto out;
> +
> +    list_for_each( iter, &peer_pcpu->runq )
>      {
> -        list_for_each( iter, &peer_pcpu->runq )
> -        {
> -            speer = __runq_elem(iter);
> +        speer = __runq_elem(iter);
>  
> -            /*
> -             * If next available VCPU here is not of strictly higher
> -             * priority than ours, this PCPU is useless to us.
> -             */
> -            if ( speer->pri <= pri )
> -                break;
> +        /*
> +         * If next available VCPU here is not of strictly higher
> +         * priority than ours, this PCPU is useless to us.
> +         */
> +        if ( speer->pri <= pri )
> +            break;
>  
> -            /* Is this VCPU runnable on our PCPU? */
> -            vc = speer->vcpu;
> -            BUG_ON( is_idle_vcpu(vc) );
> +        /* Is this VCPU runnable on our PCPU? */
> +        vc = speer->vcpu;
> +        BUG_ON( is_idle_vcpu(vc) );
>  
> -            /*
> -             * If the vcpu has no useful soft affinity, skip this vcpu.
> -             * In fact, what we want is to check if we have any "soft-affine
> -             * work" to steal, before starting to look at "hard-affine work".
> -             *
> -             * Notice that, if not even one vCPU on this runq has a useful
> -             * soft affinity, we could have avoid considering this runq for
> -             * a soft balancing step in the first place. This, for instance,
> -             * can be implemented by taking note of on what runq there are
> -             * vCPUs with useful soft affinities in some sort of bitmap
> -             * or counter.
> -             */
> -            if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
> -                 && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
> -                continue;
> +        /*
> +         * If the vcpu has no useful soft affinity, skip this vcpu.
> +         * In fact, what we want is to check if we have any "soft-affine
> +         * work" to steal, before starting to look at "hard-affine work".
> +         *
> +         * Notice that, if not even one vCPU on this runq has a useful
> +         * soft affinity, we could have avoid considering this runq for
> +         * a soft balancing step in the first place. This, for instance,
> +         * can be implemented by taking note of on what runq there are
> +         * vCPUs with useful soft affinities in some sort of bitmap
> +         * or counter.
> +         */
> +        if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
> +             && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
> +            continue;
>  
> -            csched_balance_cpumask(vc, balance_step, cpumask_scratch);
> -            if ( __csched_vcpu_is_migrateable(vc, cpu, cpumask_scratch) )
> -            {
> -                /* We got a candidate. Grab it! */
> -                TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
> -                         vc->domain->domain_id, vc->vcpu_id);
> -                SCHED_VCPU_STAT_CRANK(speer, migrate_q);
> -                SCHED_STAT_CRANK(migrate_queued);
> -                WARN_ON(vc->is_urgent);
> -                __runq_remove(speer);
> -                vc->processor = cpu;
> -                return speer;
> -            }
> +        csched_balance_cpumask(vc, balance_step, cpumask_scratch);
> +        if ( __csched_vcpu_is_migrateable(vc, cpu, cpumask_scratch) )
> +        {
> +            /* We got a candidate. Grab it! */
> +            TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
> +                     vc->domain->domain_id, vc->vcpu_id);
> +            SCHED_VCPU_STAT_CRANK(speer, migrate_q);
> +            SCHED_STAT_CRANK(migrate_queued);
> +            WARN_ON(vc->is_urgent);
> +            __runq_remove(speer);
> +            vc->processor = cpu;
> +            return speer;
>          }
>      }
> -
> + out:
>      SCHED_STAT_CRANK(steal_peer_idle);
>      return NULL;
>  }
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/7] xen: credit1: simplify csched_runq_steal() a little bit.
  2017-04-07 10:45   ` George Dunlap
@ 2017-04-07 11:00     ` Dario Faggioli
  0 siblings, 0 replies; 24+ messages in thread
From: Dario Faggioli @ 2017-04-07 11:00 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: George Dunlap, Anshul Makkar


[-- Attachment #1.1: Type: text/plain, Size: 1086 bytes --]

On Fri, 2017-04-07 at 11:45 +0100, George Dunlap wrote:
> On 06/04/17 09:16, Dario Faggioli wrote:
> > Since we're holding the lock on the pCPU from which we
> > are trying to steal, it can't have disappeared, so we
> > can drop the check for that (and convert it in an
> > ASSERT()).
> > 
> > And since we try to steal only from busy pCPUs, it's
> > unlikely for such pCPU to be idle, so we mark it as
> > such (and bail early if it unfortunately is).
> 
> Oh, you mean "mark the test as unlikely".  I initially parsed this
> as,
> "Mark the pcpu as idle", which confused me.
> 
> Can I change this to, "tell the compiler that it's unlikely"?
> 
Indeed. AFAICS, I'll have to resend, so I will do this myself.

> Either way:
> 
> Reviewed-by; George Dunlap <george.dunlap@citrix.com>
> 
Thanks,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 2/7] xen: credit: (micro) optimize csched_runq_steal().
  2017-04-06  8:16 [PATCH v2 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
  2017-04-06  8:16 ` [PATCH v2 1/7] xen: credit1: simplify csched_runq_steal() a little bit Dario Faggioli
@ 2017-04-06  8:16 ` Dario Faggioli
  2017-04-07 10:49   ` George Dunlap
  2017-04-06  8:16 ` [PATCH v2 3/7] xen: credit: consider tickled pCPUs as busy Dario Faggioli
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Dario Faggioli @ 2017-04-06  8:16 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Anshul Makkar

Chacking whether or not a vCPU can be 'stolen'
from a peer pCPU's runqueue is relatively cheap.

Therefore, let's do that  as early as possible,
avoiding potentially useless complex checks, and
cpumask manipulations.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
Changes from v1:
* fixed a typo in a comment.
---
 xen/common/sched_credit.c |   17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 63a8675..5a3f13f 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -708,12 +708,10 @@ static inline int
 __csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
 {
     /*
-     * Don't pick up work that's in the peer's scheduling tail or hot on
-     * peer PCPU. Only pick up work that prefers and/or is allowed to run
-     * on our CPU.
+     * Don't pick up work that's hot on peer PCPU, or that can't (or
+     * would prefer not to) run on cpu.
      */
-    return !vc->is_running &&
-           !__csched_vcpu_is_cache_hot(vc) &&
+    return !__csched_vcpu_is_cache_hot(vc) &&
            cpumask_test_cpu(dest_cpu, mask);
 }
 
@@ -1622,7 +1620,9 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
         BUG_ON( is_idle_vcpu(vc) );
 
         /*
-         * If the vcpu has no useful soft affinity, skip this vcpu.
+         * If the vcpu is still in peer_cpu's scheduling tail, or if it
+         * has no useful soft affinity, skip it.
+         *
          * In fact, what we want is to check if we have any "soft-affine
          * work" to steal, before starting to look at "hard-affine work".
          *
@@ -1633,8 +1633,9 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
          * vCPUs with useful soft affinities in some sort of bitmap
          * or counter.
          */
-        if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY
-             && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
+        if ( vc->is_running ||
+             (balance_step == CSCHED_BALANCE_SOFT_AFFINITY
+              && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity)) )
             continue;
 
         csched_balance_cpumask(vc, balance_step, cpumask_scratch);


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/7] xen: credit: (micro) optimize csched_runq_steal().
  2017-04-06  8:16 ` [PATCH v2 2/7] xen: credit: (micro) optimize csched_runq_steal() Dario Faggioli
@ 2017-04-07 10:49   ` George Dunlap
  2017-04-07 11:01     ` Dario Faggioli
  0 siblings, 1 reply; 24+ messages in thread
From: George Dunlap @ 2017-04-07 10:49 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Anshul Makkar

On 06/04/17 09:16, Dario Faggioli wrote:
> Chacking whether or not a vCPU can be 'stolen'
> from a peer pCPU's runqueue is relatively cheap.
> 
> Therefore, let's do that  as early as possible,
> avoiding potentially useless complex checks, and
> cpumask manipulations.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> ---
> Changes from v1:
> * fixed a typo in a comment.
> ---
>  xen/common/sched_credit.c |   17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index 63a8675..5a3f13f 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -708,12 +708,10 @@ static inline int
>  __csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
>  {
>      /*
> -     * Don't pick up work that's in the peer's scheduling tail or hot on
> -     * peer PCPU. Only pick up work that prefers and/or is allowed to run
> -     * on our CPU.
> +     * Don't pick up work that's hot on peer PCPU, or that can't (or
> +     * would prefer not to) run on cpu.
>       */
> -    return !vc->is_running &&
> -           !__csched_vcpu_is_cache_hot(vc) &&
> +    return !__csched_vcpu_is_cache_hot(vc) &&
>             cpumask_test_cpu(dest_cpu, mask);

+1 to moving the check earlier; but there's a risk that this function
will be used somewhere else and that the is_running check will be missed.

Should we maybe add an ASSERT() here (perhaps with a comment that the
check can be added back in if necessary)?

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/7] xen: credit: (micro) optimize csched_runq_steal().
  2017-04-07 10:49   ` George Dunlap
@ 2017-04-07 11:01     ` Dario Faggioli
  0 siblings, 0 replies; 24+ messages in thread
From: Dario Faggioli @ 2017-04-07 11:01 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: George Dunlap, Anshul Makkar


[-- Attachment #1.1: Type: text/plain, Size: 1500 bytes --]

On Fri, 2017-04-07 at 11:49 +0100, George Dunlap wrote:
> On 06/04/17 09:16, Dario Faggioli wrote:
> > --- a/xen/common/sched_credit.c
> > +++ b/xen/common/sched_credit.c
> > @@ -708,12 +708,10 @@ static inline int
> >  __csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu,
> > cpumask_t *mask)
> >  {
> >      /*
> > -     * Don't pick up work that's in the peer's scheduling tail or
> > hot on
> > -     * peer PCPU. Only pick up work that prefers and/or is allowed
> > to run
> > -     * on our CPU.
> > +     * Don't pick up work that's hot on peer PCPU, or that can't
> > (or
> > +     * would prefer not to) run on cpu.
> >       */
> > -    return !vc->is_running &&
> > -           !__csched_vcpu_is_cache_hot(vc) &&
> > +    return !__csched_vcpu_is_cache_hot(vc) &&
> >             cpumask_test_cpu(dest_cpu, mask);
> 
> +1 to moving the check earlier; but there's a risk that this function
> will be used somewhere else and that the is_running check will be
> missed.
> 
> Should we maybe add an ASSERT() here (perhaps with a comment that the
> check can be added back in if necessary)?
> 
Ok, makes sense. Will do.

Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 3/7] xen: credit: consider tickled pCPUs as busy.
  2017-04-06  8:16 [PATCH v2 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
  2017-04-06  8:16 ` [PATCH v2 1/7] xen: credit1: simplify csched_runq_steal() a little bit Dario Faggioli
  2017-04-06  8:16 ` [PATCH v2 2/7] xen: credit: (micro) optimize csched_runq_steal() Dario Faggioli
@ 2017-04-06  8:16 ` Dario Faggioli
  2017-04-07 10:56   ` George Dunlap
  2017-04-06  8:16 ` [PATCH v2 4/7] xen/tools: tracing: add record for credit1 runqueue stealing Dario Faggioli
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Dario Faggioli @ 2017-04-06  8:16 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Anshul Makkar

Currently, it can happen that __runq_tickle(),
running on pCPU 2 because vCPU x woke up, decides
to tickle pCPU 3, because it's idle. Just after
that, but before pCPU 3 manages to schedule and
pick up x, either __runq_tickel() or
__csched_cpu_pick(), running on pCPU 6, sees that
idle pCPUs are 0, 1 and also 3, and for whatever
reason it also chooses 3 for waking up (or
migrating) vCPU y.

When pCPU 3 goes through the scheduler, it will
pick up, say, vCPU x, and y will sit in its
runqueue, even if there are idle pCPUs.

Alleviate this by marking a pCPU to be idle
right away when tickling it (like, e.g., it happens
in Credit2).

Note that this does not eliminate the race. That
is not possible without introducing proper locking
for the cpumasks the scheduler uses. It significantly
reduces the window during which it can happen, though.

Introduce proper locking for the cpumask can, in
theory, be done, and may be investigated in future.
It is a significant amount of work to do it properly
(e.g., avoiding deadlock), and it is likely to adversely
affect scalability, and so it may be a path it is just
not worth following.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
 xen/common/sched_credit.c |   22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 5a3f13f..c753089 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -489,7 +489,17 @@ static inline void __runq_tickle(struct csched_vcpu *new)
                 __trace_var(TRC_CSCHED_TICKLE, 1, sizeof(cpu), &cpu);
         }
 
-        /* Send scheduler interrupts to designated CPUs */
+        /*
+         * Mark the designated CPUs as busy and send them all the scheduler
+         * interrupt. We need the for_each_cpu for dealing with the
+         * !opt_tickle_one_idle case. We must use cpumask_clear_cpu() and
+         * can't use cpumask_andnot(), because prv->idlers needs atomic access.
+         *
+         * In the default (and most common) case, when opt_rickle_one_idle is
+         * true, the loop does only one step, and only one bit is cleared.
+         */
+        for_each_cpu(cpu, &mask)
+            cpumask_clear_cpu(cpu, prv->idlers);
         cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ);
     }
     else
@@ -985,6 +995,8 @@ csched_vcpu_acct(struct csched_private *prv, unsigned int cpu)
             SCHED_VCPU_STAT_CRANK(svc, migrate_r);
             SCHED_STAT_CRANK(migrate_running);
             set_bit(_VPF_migrating, &current->pause_flags);
+            ASSERT(!cpumask_test_cpu(cpu,
+                                     CSCHED_PRIV(per_cpu(scheduler, cpu))->idlers));
             cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
         }
     }
@@ -1077,13 +1089,17 @@ static void
 csched_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
 {
     struct csched_vcpu * const svc = CSCHED_VCPU(vc);
+    unsigned int cpu = vc->processor;
 
     SCHED_STAT_CRANK(vcpu_sleep);
 
     BUG_ON( is_idle_vcpu(vc) );
 
-    if ( curr_on_cpu(vc->processor) == vc )
-        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
+    if ( curr_on_cpu(cpu) == vc )
+    {
+        ASSERT(!cpumask_test_cpu(cpu, CSCHED_PRIV(per_cpu(scheduler, cpu))->idlers));
+        cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
+    }
     else if ( __vcpu_on_runq(svc) )
         __runq_remove(svc);
 }


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] xen: credit: consider tickled pCPUs as busy.
  2017-04-06  8:16 ` [PATCH v2 3/7] xen: credit: consider tickled pCPUs as busy Dario Faggioli
@ 2017-04-07 10:56   ` George Dunlap
  2017-04-07 12:53     ` Dario Faggioli
  0 siblings, 1 reply; 24+ messages in thread
From: George Dunlap @ 2017-04-07 10:56 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Anshul Makkar

On 06/04/17 09:16, Dario Faggioli wrote:
> Currently, it can happen that __runq_tickle(),
> running on pCPU 2 because vCPU x woke up, decides
> to tickle pCPU 3, because it's idle. Just after
> that, but before pCPU 3 manages to schedule and
> pick up x, either __runq_tickel() or
> __csched_cpu_pick(), running on pCPU 6, sees that
> idle pCPUs are 0, 1 and also 3, and for whatever
> reason it also chooses 3 for waking up (or
> migrating) vCPU y.
> 
> When pCPU 3 goes through the scheduler, it will
> pick up, say, vCPU x, and y will sit in its
> runqueue, even if there are idle pCPUs.
> 
> Alleviate this by marking a pCPU to be idle
> right away when tickling it (like, e.g., it happens
> in Credit2).
> 
> Note that this does not eliminate the race. That
> is not possible without introducing proper locking
> for the cpumasks the scheduler uses. It significantly
> reduces the window during which it can happen, though.
> 
> Introduce proper locking for the cpumask can, in
> theory, be done, and may be investigated in future.
> It is a significant amount of work to do it properly
> (e.g., avoiding deadlock), and it is likely to adversely
> affect scalability, and so it may be a path it is just
> not worth following.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> ---
>  xen/common/sched_credit.c |   22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index 5a3f13f..c753089 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -489,7 +489,17 @@ static inline void __runq_tickle(struct csched_vcpu *new)
>                  __trace_var(TRC_CSCHED_TICKLE, 1, sizeof(cpu), &cpu);
>          }
>  
> -        /* Send scheduler interrupts to designated CPUs */
> +        /*
> +         * Mark the designated CPUs as busy and send them all the scheduler
> +         * interrupt. We need the for_each_cpu for dealing with the
> +         * !opt_tickle_one_idle case. We must use cpumask_clear_cpu() and
> +         * can't use cpumask_andnot(), because prv->idlers needs atomic access.
> +         *
> +         * In the default (and most common) case, when opt_rickle_one_idle is
> +         * true, the loop does only one step, and only one bit is cleared.
> +         */
> +        for_each_cpu(cpu, &mask)
> +            cpumask_clear_cpu(cpu, prv->idlers);
>          cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ);
>      }
>      else
> @@ -985,6 +995,8 @@ csched_vcpu_acct(struct csched_private *prv, unsigned int cpu)
>              SCHED_VCPU_STAT_CRANK(svc, migrate_r);
>              SCHED_STAT_CRANK(migrate_running);
>              set_bit(_VPF_migrating, &current->pause_flags);
> +            ASSERT(!cpumask_test_cpu(cpu,
> +                                     CSCHED_PRIV(per_cpu(scheduler, cpu))->idlers));

What are these about?  Is this just to double-check that the "idler
accounting" logic is correct?

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] xen: credit: consider tickled pCPUs as busy.
  2017-04-07 10:56   ` George Dunlap
@ 2017-04-07 12:53     ` Dario Faggioli
  0 siblings, 0 replies; 24+ messages in thread
From: Dario Faggioli @ 2017-04-07 12:53 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: George Dunlap, Anshul Makkar


[-- Attachment #1.1: Type: text/plain, Size: 4519 bytes --]

On Fri, 2017-04-07 at 11:56 +0100, George Dunlap wrote:
> On 06/04/17 09:16, Dario Faggioli wrote:
> > Currently, it can happen that __runq_tickle(),
> > running on pCPU 2 because vCPU x woke up, decides
> > to tickle pCPU 3, because it's idle. Just after
> > that, but before pCPU 3 manages to schedule and
> > pick up x, either __runq_tickel() or
> > __csched_cpu_pick(), running on pCPU 6, sees that
> > idle pCPUs are 0, 1 and also 3, and for whatever
> > reason it also chooses 3 for waking up (or
> > migrating) vCPU y.
> > 
> > When pCPU 3 goes through the scheduler, it will
> > pick up, say, vCPU x, and y will sit in its
> > runqueue, even if there are idle pCPUs.
> > 
> > Alleviate this by marking a pCPU to be idle
> > right away when tickling it (like, e.g., it happens
> > in Credit2).
> > 
> > Note that this does not eliminate the race. That
> > is not possible without introducing proper locking
> > for the cpumasks the scheduler uses. It significantly
> > reduces the window during which it can happen, though.
> > 
> > Introduce proper locking for the cpumask can, in
> > theory, be done, and may be investigated in future.
> > It is a significant amount of work to do it properly
> > (e.g., avoiding deadlock), and it is likely to adversely
> > affect scalability, and so it may be a path it is just
> > not worth following.
> > 
> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> > ---
> > Cc: George Dunlap <george.dunlap@eu.citrix.com>
> > Cc: Anshul Makkar <anshul.makkar@citrix.com>
> > ---
> >  xen/common/sched_credit.c |   22 +++++++++++++++++++---
> >  1 file changed, 19 insertions(+), 3 deletions(-)
> > 
> > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> > index 5a3f13f..c753089 100644
> > --- a/xen/common/sched_credit.c
> > +++ b/xen/common/sched_credit.c
> > @@ -489,7 +489,17 @@ static inline void __runq_tickle(struct
> > csched_vcpu *new)
> >                  __trace_var(TRC_CSCHED_TICKLE, 1, sizeof(cpu),
> > &cpu);
> >          }
> >  
> > -        /* Send scheduler interrupts to designated CPUs */
> > +        /*
> > +         * Mark the designated CPUs as busy and send them all the
> > scheduler
> > +         * interrupt. We need the for_each_cpu for dealing with
> > the
> > +         * !opt_tickle_one_idle case. We must use
> > cpumask_clear_cpu() and
> > +         * can't use cpumask_andnot(), because prv->idlers needs
> > atomic access.
> > +         *
> > +         * In the default (and most common) case, when
> > opt_rickle_one_idle is
> > +         * true, the loop does only one step, and only one bit is
> > cleared.
> > +         */
> > +        for_each_cpu(cpu, &mask)
> > +            cpumask_clear_cpu(cpu, prv->idlers);
> >          cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ);
> >      }
> >      else
> > @@ -985,6 +995,8 @@ csched_vcpu_acct(struct csched_private *prv,
> > unsigned int cpu)
> >              SCHED_VCPU_STAT_CRANK(svc, migrate_r);
> >              SCHED_STAT_CRANK(migrate_running);
> >              set_bit(_VPF_migrating, &current->pause_flags);
> > +            ASSERT(!cpumask_test_cpu(cpu,
> > +                                     CSCHED_PRIV(per_cpu(scheduler
> > , cpu))->idlers));
> 
> What are these about?  Is this just to double-check that the "idler
> accounting" logic is correct?
>
No, raising the SCHEDULE_SOFTIRQ (which is what we do here, right below
the added ASSERT(), and also in the other place where I added another
one) basically means tickling the cpu we raise it for.

Therefore, according to the very purpose of this patch, we should clear
the bit corresponding to the cpu itself in the idler mask. In practise,
that is not necessary, because it (in both cases) that happens to be
cleared already.

The ASSERT()-s are for both making this explicit, and check/enforce it.

I appreciate it's not that clear, though. I'll add comments.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 4/7] xen/tools: tracing: add record for credit1 runqueue stealing.
  2017-04-06  8:16 [PATCH v2 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
                   ` (2 preceding siblings ...)
  2017-04-06  8:16 ` [PATCH v2 3/7] xen: credit: consider tickled pCPUs as busy Dario Faggioli
@ 2017-04-06  8:16 ` Dario Faggioli
  2017-04-07 11:01   ` George Dunlap
  2017-04-06  8:16 ` [PATCH v2 5/7] xen: credit1: increase efficiency and scalability of load balancing Dario Faggioli
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Dario Faggioli @ 2017-04-06  8:16 UTC (permalink / raw)
  To: xen-devel; +Cc: Wei Liu, Ian Jackson, George Dunlap

Including whether we actually tried stealing a vCPU from
a given pCPU, or we skipped that one, because of lock
contention.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
---
 tools/xentrace/formats       |    1 +
 tools/xentrace/xenalyze.c    |   11 +++++++++++
 xen/common/sched_credit.c    |    6 +++++-
 xen/include/xen/perfc_defn.h |    1 +
 4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index a055231..8b31780 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -47,6 +47,7 @@
 0x00022008  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched:unboost       [ dom:vcpu = 0x%(1)04x%(2)04x ]
 0x00022009  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched:schedule      [ cpu[16]:tasklet[8]:idle[8] = %(1)08x ]
 0x0002200A  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched:ratelimit     [ dom:vcpu = 0x%(1)08x, runtime = %(2)d ]
+0x0002200B  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched:steal_check   [ peer_cpu = %(1)d, checked = %(2)d ]
 
 0x00022201  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tick
 0x00022202  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_pos       [ dom:vcpu = 0x%(1)08x, pos = %(2)d]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index 029c89d..fa608ad 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -7651,6 +7651,17 @@ void sched_process(struct pcpu_info *p)
                        r->runtime / 1000, r->runtime % 1000);
             }
             break;
+        case TRC_SCHED_CLASS_EVT(CSCHED, 11): /* STEAL_CHECK   */
+            if(opt.dump_all) {
+                struct {
+                    unsigned int peer_cpu, check;
+                } *r = (typeof(r))ri->d;
+
+                printf(" %s csched:load_balance %s %u\n",
+                       ri->dump_header, r->check ? "checking" : "skipping",
+                       r->peer_cpu);
+            }
+            break;
         /* CREDIT 2 (TRC_CSCHED2_xxx) */
         case TRC_SCHED_CLASS_EVT(CSCHED2, 1): /* TICK              */
         case TRC_SCHED_CLASS_EVT(CSCHED2, 4): /* CREDIT_ADD        */
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index c753089..49caa0a 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -134,6 +134,7 @@
 #define TRC_CSCHED_BOOST_END     TRC_SCHED_CLASS_EVT(CSCHED, 8)
 #define TRC_CSCHED_SCHEDULE      TRC_SCHED_CLASS_EVT(CSCHED, 9)
 #define TRC_CSCHED_RATELIMIT     TRC_SCHED_CLASS_EVT(CSCHED, 10)
+#define TRC_CSCHED_STEAL_CHECK   TRC_SCHED_CLASS_EVT(CSCHED, 11)
 
 
 /*
@@ -1738,14 +1739,17 @@ csched_load_balance(struct csched_private *prv, int cpu,
                  * balancing and trying to lock this CPU.
                  */
                 spinlock_t *lock = pcpu_schedule_trylock(peer_cpu);
-
+                SCHED_STAT_CRANK(steal_trylock);
                 if ( !lock )
                 {
                     SCHED_STAT_CRANK(steal_trylock_failed);
+                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skipp'n */ 0);
                     peer_cpu = cpumask_cycle(peer_cpu, &workers);
                     continue;
                 }
 
+                TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* checked */ 1);
+
                 /* Any work over there to steal? */
                 speer = cpumask_test_cpu(peer_cpu, online) ?
                     csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL;
diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h
index 0d702f0..53849af 100644
--- a/xen/include/xen/perfc_defn.h
+++ b/xen/include/xen/perfc_defn.h
@@ -48,6 +48,7 @@ PERFCOUNTER(vcpu_unpark,            "csched: vcpu_unpark")
 PERFCOUNTER(load_balance_idle,      "csched: load_balance_idle")
 PERFCOUNTER(load_balance_over,      "csched: load_balance_over")
 PERFCOUNTER(load_balance_other,     "csched: load_balance_other")
+PERFCOUNTER(steal_trylock,          "csched: steal_trylock")
 PERFCOUNTER(steal_trylock_failed,   "csched: steal_trylock_failed")
 PERFCOUNTER(steal_peer_idle,        "csched: steal_peer_idle")
 PERFCOUNTER(migrate_queued,         "csched: migrate_queued")


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 4/7] xen/tools: tracing: add record for credit1 runqueue stealing.
  2017-04-06  8:16 ` [PATCH v2 4/7] xen/tools: tracing: add record for credit1 runqueue stealing Dario Faggioli
@ 2017-04-07 11:01   ` George Dunlap
  2017-04-07 13:06     ` Dario Faggioli
  0 siblings, 1 reply; 24+ messages in thread
From: George Dunlap @ 2017-04-07 11:01 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: Wei Liu, Ian Jackson

On 06/04/17 09:16, Dario Faggioli wrote:
> Including whether we actually tried stealing a vCPU from
> a given pCPU, or we skipped that one, because of lock
> contention.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Wei Liu <wei.liu2@citrix.com>
> ---
>  tools/xentrace/formats       |    1 +
>  tools/xentrace/xenalyze.c    |   11 +++++++++++
>  xen/common/sched_credit.c    |    6 +++++-
>  xen/include/xen/perfc_defn.h |    1 +
>  4 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/xentrace/formats b/tools/xentrace/formats
> index a055231..8b31780 100644
> --- a/tools/xentrace/formats
> +++ b/tools/xentrace/formats
> @@ -47,6 +47,7 @@
>  0x00022008  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched:unboost       [ dom:vcpu = 0x%(1)04x%(2)04x ]
>  0x00022009  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched:schedule      [ cpu[16]:tasklet[8]:idle[8] = %(1)08x ]
>  0x0002200A  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched:ratelimit     [ dom:vcpu = 0x%(1)08x, runtime = %(2)d ]
> +0x0002200B  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched:steal_check   [ peer_cpu = %(1)d, checked = %(2)d ]
>  
>  0x00022201  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:tick
>  0x00022202  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  csched2:runq_pos       [ dom:vcpu = 0x%(1)08x, pos = %(2)d]
> diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
> index 029c89d..fa608ad 100644
> --- a/tools/xentrace/xenalyze.c
> +++ b/tools/xentrace/xenalyze.c
> @@ -7651,6 +7651,17 @@ void sched_process(struct pcpu_info *p)
>                         r->runtime / 1000, r->runtime % 1000);
>              }
>              break;
> +        case TRC_SCHED_CLASS_EVT(CSCHED, 11): /* STEAL_CHECK   */
> +            if(opt.dump_all) {
> +                struct {
> +                    unsigned int peer_cpu, check;
> +                } *r = (typeof(r))ri->d;
> +
> +                printf(" %s csched:load_balance %s %u\n",
> +                       ri->dump_header, r->check ? "checking" : "skipping",
> +                       r->peer_cpu);
> +            }
> +            break;
>          /* CREDIT 2 (TRC_CSCHED2_xxx) */
>          case TRC_SCHED_CLASS_EVT(CSCHED2, 1): /* TICK              */
>          case TRC_SCHED_CLASS_EVT(CSCHED2, 4): /* CREDIT_ADD        */
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index c753089..49caa0a 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -134,6 +134,7 @@
>  #define TRC_CSCHED_BOOST_END     TRC_SCHED_CLASS_EVT(CSCHED, 8)
>  #define TRC_CSCHED_SCHEDULE      TRC_SCHED_CLASS_EVT(CSCHED, 9)
>  #define TRC_CSCHED_RATELIMIT     TRC_SCHED_CLASS_EVT(CSCHED, 10)
> +#define TRC_CSCHED_STEAL_CHECK   TRC_SCHED_CLASS_EVT(CSCHED, 11)
>  
>  
>  /*
> @@ -1738,14 +1739,17 @@ csched_load_balance(struct csched_private *prv, int cpu,
>                   * balancing and trying to lock this CPU.
>                   */
>                  spinlock_t *lock = pcpu_schedule_trylock(peer_cpu);
> -
> +                SCHED_STAT_CRANK(steal_trylock);
>                  if ( !lock )
>                  {
>                      SCHED_STAT_CRANK(steal_trylock_failed);
> +                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skipp'n */ 0);

Was this to try to keep the line under 80 characters? :-)

I think you could write "skipped" here and it would be the same number
of characters; or just "skip" would be fine too.  Let me know what you
think and I can change it on check-in.

Acked-by: George Dunlap <george.dunlap@citrix.com>

>                      peer_cpu = cpumask_cycle(peer_cpu, &workers);
>                      continue;
>                  }
>  
> +                TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* checked */ 1);
> +
>                  /* Any work over there to steal? */
>                  speer = cpumask_test_cpu(peer_cpu, online) ?
>                      csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL;
> diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h
> index 0d702f0..53849af 100644
> --- a/xen/include/xen/perfc_defn.h
> +++ b/xen/include/xen/perfc_defn.h
> @@ -48,6 +48,7 @@ PERFCOUNTER(vcpu_unpark,            "csched: vcpu_unpark")
>  PERFCOUNTER(load_balance_idle,      "csched: load_balance_idle")
>  PERFCOUNTER(load_balance_over,      "csched: load_balance_over")
>  PERFCOUNTER(load_balance_other,     "csched: load_balance_other")
> +PERFCOUNTER(steal_trylock,          "csched: steal_trylock")
>  PERFCOUNTER(steal_trylock_failed,   "csched: steal_trylock_failed")
>  PERFCOUNTER(steal_peer_idle,        "csched: steal_peer_idle")
>  PERFCOUNTER(migrate_queued,         "csched: migrate_queued")
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 4/7] xen/tools: tracing: add record for credit1 runqueue stealing.
  2017-04-07 11:01   ` George Dunlap
@ 2017-04-07 13:06     ` Dario Faggioli
  0 siblings, 0 replies; 24+ messages in thread
From: Dario Faggioli @ 2017-04-07 13:06 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: Wei Liu, Ian Jackson


[-- Attachment #1.1: Type: text/plain, Size: 1637 bytes --]

On Fri, 2017-04-07 at 12:01 +0100, George Dunlap wrote:
> On 06/04/17 09:16, Dario Faggioli wrote:
> > 
> > --- a/xen/common/sched_credit.c
> > +++ b/xen/common/sched_credit.c
> > @@ -1738,14 +1739,17 @@ csched_load_balance(struct csched_private
> > *prv, int cpu,
> >                   * balancing and trying to lock this CPU.
> >                   */
> >                  spinlock_t *lock =
> > pcpu_schedule_trylock(peer_cpu);
> > -
> > +                SCHED_STAT_CRANK(steal_trylock);
> >                  if ( !lock )
> >                  {
> >                      SCHED_STAT_CRANK(steal_trylock_failed);
> > +                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /*
> > skipp'n */ 0);
> 
> Was this to try to keep the line under 80 characters? :-)
> 
Well, that _and_ being cool! :-P :-P

> I think you could write "skipped" here and it would be the same
> number
> of characters; or just "skip" would be fine too.
>
It's a lot less cool, but ok, go for it. :-D

>   Let me know what you
> think and I can change it on check-in.
> 
Well, I'm preparing a v3, where I'm addressing the comments you're
making, so I'm happy to change this in there too.

> Acked-by: George Dunlap <george.dunlap@citrix.com>
> 
Thanks,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 5/7] xen: credit1: increase efficiency and scalability of load balancing.
  2017-04-06  8:16 [PATCH v2 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
                   ` (3 preceding siblings ...)
  2017-04-06  8:16 ` [PATCH v2 4/7] xen/tools: tracing: add record for credit1 runqueue stealing Dario Faggioli
@ 2017-04-06  8:16 ` Dario Faggioli
  2017-04-07 14:38   ` George Dunlap
  2017-04-06  8:16 ` [PATCH v2 6/7] xen: credit1: treat pCPUs more evenly during balancing Dario Faggioli
  2017-04-06  8:17 ` [PATCH v2 7/7] xen: credit2: avoid cpumask_any() in pick_cpu() Dario Faggioli
  6 siblings, 1 reply; 24+ messages in thread
From: Dario Faggioli @ 2017-04-06  8:16 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrew Cooper, Anshul Makkar

During load balancing, we check the non idle pCPUs to
see if they have runnable but not running vCPUs that
can be stolen by and set to run on currently idle pCPUs.

If a pCPU has only one running (or runnable) vCPU,
though, we don't want to steal it from there, and
it's therefore pointless bothering with it
(especially considering that bothering means trying
to take its runqueue lock!).

On large systems, when load is only slightly higher
than the number of pCPUs (i.e., there are just a few
more active vCPUs than the number of the pCPUs), this
may mean that:
 - we go through all the pCPUs,
 - for each one, we (try to) take its runqueue locks,
 - we figure out there's actually nothing to be stolen!

To mitigate this, we introduce a counter for the number
of runnable vCPUs on each pCPU. In fact, unless there
re least 2 runnable vCPUs --typically, one running,
and the others in the runqueue-- it does not make sense
to try stealing anything.

signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
Changes from v2:
* don't count the idle vCPU as runnable. This is just cosmetic and not at all a
  logic or functionl change wrt v1;
* don't count inside of __runq_remove() or __runq_insert(), but provide
  specific counting functions, and call them when appropriate. This is necessary
  to avoid spurious overloaded state to be reported, basically *all* the time
  that a pCPU goes through the scheduler (due to the fact that the scheduler
  calls __runq_insert() on the current vCPU);
* get rid of the overloaded cpumask and only use the counter. I actually did
  like the cpumask solution better, but for this purpose, it was overkill.
---
 xen/common/sched_credit.c |   78 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 72 insertions(+), 6 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 49caa0a..09e3192 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -172,6 +172,7 @@ struct csched_pcpu {
     struct timer ticker;
     unsigned int tick;
     unsigned int idle_bias;
+    unsigned int nr_runnable;
 };
 
 /*
@@ -262,9 +263,26 @@ static inline bool_t is_runq_idle(unsigned int cpu)
 }
 
 static inline void
+inc_nr_runnable(unsigned int cpu)
+{
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
+    CSCHED_PCPU(cpu)->nr_runnable++;
+
+}
+
+static inline void
+dec_nr_runnable(unsigned int cpu)
+{
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
+    CSCHED_PCPU(cpu)->nr_runnable--;
+    ASSERT(CSCHED_PCPU(cpu)->nr_runnable >= 0);
+}
+
+static inline void
 __runq_insert(struct csched_vcpu *svc)
 {
-    const struct list_head * const runq = RUNQ(svc->vcpu->processor);
+    unsigned int cpu = svc->vcpu->processor;
+    const struct list_head * const runq = RUNQ(cpu);
     struct list_head *iter;
 
     BUG_ON( __vcpu_on_runq(svc) );
@@ -601,6 +619,7 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
     /* Start off idling... */
     BUG_ON(!is_idle_vcpu(curr_on_cpu(cpu)));
     cpumask_set_cpu(cpu, prv->idlers);
+    spc->nr_runnable = 0;
 }
 
 static void
@@ -1042,7 +1061,10 @@ csched_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
     lock = vcpu_schedule_lock_irq(vc);
 
     if ( !__vcpu_on_runq(svc) && vcpu_runnable(vc) && !vc->is_running )
+    {
         __runq_insert(svc);
+        inc_nr_runnable(vc->processor);
+    }
 
     vcpu_schedule_unlock_irq(lock, vc);
 
@@ -1102,7 +1124,10 @@ csched_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
         cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
     }
     else if ( __vcpu_on_runq(svc) )
+    {
+        dec_nr_runnable(cpu);
         __runq_remove(svc);
+    }
 }
 
 static void
@@ -1163,6 +1188,7 @@ csched_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
 
     /* Put the VCPU on the runq and tickle CPUs */
     __runq_insert(svc);
+    inc_nr_runnable(vc->processor);
     __runq_tickle(svc);
 }
 
@@ -1664,8 +1690,10 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
             SCHED_VCPU_STAT_CRANK(speer, migrate_q);
             SCHED_STAT_CRANK(migrate_queued);
             WARN_ON(vc->is_urgent);
+            dec_nr_runnable(peer_cpu);
             __runq_remove(speer);
             vc->processor = cpu;
+            inc_nr_runnable(cpu);
             return speer;
         }
     }
@@ -1721,7 +1749,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
         peer_node = node;
         do
         {
-            /* Find out what the !idle are in this node */
+            /* Select the pCPUs in this node that have work we can steal. */
             cpumask_andnot(&workers, online, prv->idlers);
             cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
             __cpumask_clear_cpu(cpu, &workers);
@@ -1731,6 +1759,40 @@ csched_load_balance(struct csched_private *prv, int cpu,
                 goto next_node;
             do
             {
+                spinlock_t *lock;
+
+                /*
+                 * If there is only one runnable vCPU on peer_cpu, it means
+                 * there's no one to be stolen in its runqueue, so skip it.
+                 *
+                 * Checking this without holding the lock is racy... But that's
+                 * the whole point of this optimization!
+                 *
+                 * In more details:
+                 * - if we race with dec_nr_runnable(), we may try to take the
+                 *   lock and call csched_runq_steal() for no reason. This is
+                 *   not a functional issue, and should be infrequent enough.
+                 *   And we can avoid that by re-checking nr_runnable after
+                 *   having grabbed the lock, if we want;
+                 * - if we race with inc_nr_runnable(), we skip a pCPU that may
+                 *   have runnable vCPUs in its runqueue, but that's not a
+                 *   problem because:
+                 *   + if racing with csched_vcpu_insert() or csched_vcpu_wake(),
+                 *     __runq_tickle() will be called afterwords, so the vCPU
+                 *     won't get stuck in the runqueue for too long;
+                 *   + if racing with csched_runq_steal(), it may be that a
+                 *     vCPU that we could have picked up, stays in a runqueue
+                 *     until someone else tries to steal it again. But this is
+                 *     no worse than what can happen already (without this
+                 *     optimization), it the pCPU would schedule right after we
+                 *     have taken the lock, and hence block on it.
+                 */
+                if ( CSCHED_PCPU(peer_cpu)->nr_runnable <= 1 )
+                {
+                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skipp'n */ 0);
+                    goto next_cpu;
+                }
+
                 /*
                  * Get ahold of the scheduler lock for this peer CPU.
                  *
@@ -1738,14 +1800,13 @@ csched_load_balance(struct csched_private *prv, int cpu,
                  * could cause a deadlock if the peer CPU is also load
                  * balancing and trying to lock this CPU.
                  */
-                spinlock_t *lock = pcpu_schedule_trylock(peer_cpu);
+                lock = pcpu_schedule_trylock(peer_cpu);
                 SCHED_STAT_CRANK(steal_trylock);
                 if ( !lock )
                 {
                     SCHED_STAT_CRANK(steal_trylock_failed);
                     TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skipp'n */ 0);
-                    peer_cpu = cpumask_cycle(peer_cpu, &workers);
-                    continue;
+                    goto next_cpu;
                 }
 
                 TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* checked */ 1);
@@ -1762,6 +1823,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
                     return speer;
                 }
 
+ next_cpu:
                 peer_cpu = cpumask_cycle(peer_cpu, &workers);
 
             } while( peer_cpu != cpumask_first(&workers) );
@@ -1892,7 +1954,10 @@ csched_schedule(
     if ( vcpu_runnable(current) )
         __runq_insert(scurr);
     else
+    {
         BUG_ON( is_idle_vcpu(current) || list_empty(runq) );
+        dec_nr_runnable(cpu);
+    }
 
     snext = __runq_elem(runq->next);
     ret.migrated = 0;
@@ -2009,7 +2074,8 @@ csched_dump_pcpu(const struct scheduler *ops, int cpu)
     runq = &spc->runq;
 
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_sibling_mask, cpu));
-    printk("CPU[%02d] sort=%d, sibling=%s, ", cpu, spc->runq_sort_last, cpustr);
+    printk("CPU[%02d] nr_run=%d, sort=%d, sibling=%s, ",
+           cpu, spc->nr_runnable, spc->runq_sort_last, cpustr);
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_core_mask, cpu));
     printk("core=%s\n", cpustr);
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/7] xen: credit1: increase efficiency and scalability of load balancing.
  2017-04-06  8:16 ` [PATCH v2 5/7] xen: credit1: increase efficiency and scalability of load balancing Dario Faggioli
@ 2017-04-07 14:38   ` George Dunlap
  2017-04-07 14:49     ` Dario Faggioli
  0 siblings, 1 reply; 24+ messages in thread
From: George Dunlap @ 2017-04-07 14:38 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Andrew Cooper, Anshul Makkar

On 06/04/17 09:16, Dario Faggioli wrote:
> During load balancing, we check the non idle pCPUs to
> see if they have runnable but not running vCPUs that
> can be stolen by and set to run on currently idle pCPUs.
> 
> If a pCPU has only one running (or runnable) vCPU,
> though, we don't want to steal it from there, and
> it's therefore pointless bothering with it
> (especially considering that bothering means trying
> to take its runqueue lock!).
> 
> On large systems, when load is only slightly higher
> than the number of pCPUs (i.e., there are just a few
> more active vCPUs than the number of the pCPUs), this
> may mean that:
>  - we go through all the pCPUs,
>  - for each one, we (try to) take its runqueue locks,
>  - we figure out there's actually nothing to be stolen!
> 
> To mitigate this, we introduce a counter for the number
> of runnable vCPUs on each pCPU. In fact, unless there
> re least 2 runnable vCPUs --typically, one running,
> and the others in the runqueue-- it does not make sense
> to try stealing anything.
> 
> signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Anshul Makkar <anshul.makkar@citrix.com>
> ---
> Changes from v2:
> * don't count the idle vCPU as runnable. This is just cosmetic and not at all a
>   logic or functionl change wrt v1;
> * don't count inside of __runq_remove() or __runq_insert(), but provide
>   specific counting functions, and call them when appropriate. This is necessary
>   to avoid spurious overloaded state to be reported, basically *all* the time
>   that a pCPU goes through the scheduler (due to the fact that the scheduler
>   calls __runq_insert() on the current vCPU);
> * get rid of the overloaded cpumask and only use the counter. I actually did
>   like the cpumask solution better, but for this purpose, it was overkill.
> ---
>  xen/common/sched_credit.c |   78 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 72 insertions(+), 6 deletions(-)
> 
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index 49caa0a..09e3192 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -172,6 +172,7 @@ struct csched_pcpu {
>      struct timer ticker;
>      unsigned int tick;
>      unsigned int idle_bias;
> +    unsigned int nr_runnable;
>  };
>  
>  /*
> @@ -262,9 +263,26 @@ static inline bool_t is_runq_idle(unsigned int cpu)
>  }
>  
>  static inline void
> +inc_nr_runnable(unsigned int cpu)
> +{
> +    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
> +    CSCHED_PCPU(cpu)->nr_runnable++;
> +
> +}
> +
> +static inline void
> +dec_nr_runnable(unsigned int cpu)
> +{
> +    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
> +    CSCHED_PCPU(cpu)->nr_runnable--;
> +    ASSERT(CSCHED_PCPU(cpu)->nr_runnable >= 0);
> +}
> +
> +static inline void
>  __runq_insert(struct csched_vcpu *svc)
>  {
> -    const struct list_head * const runq = RUNQ(svc->vcpu->processor);
> +    unsigned int cpu = svc->vcpu->processor;
> +    const struct list_head * const runq = RUNQ(cpu);
>      struct list_head *iter;
>  
>      BUG_ON( __vcpu_on_runq(svc) );
> @@ -601,6 +619,7 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
>      /* Start off idling... */
>      BUG_ON(!is_idle_vcpu(curr_on_cpu(cpu)));
>      cpumask_set_cpu(cpu, prv->idlers);
> +    spc->nr_runnable = 0;
>  }
>  
>  static void
> @@ -1042,7 +1061,10 @@ csched_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
>      lock = vcpu_schedule_lock_irq(vc);
>  
>      if ( !__vcpu_on_runq(svc) && vcpu_runnable(vc) && !vc->is_running )
> +    {
>          __runq_insert(svc);
> +        inc_nr_runnable(vc->processor);
> +    }
>  
>      vcpu_schedule_unlock_irq(lock, vc);
>  
> @@ -1102,7 +1124,10 @@ csched_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
>          cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
>      }
>      else if ( __vcpu_on_runq(svc) )
> +    {
> +        dec_nr_runnable(cpu);
>          __runq_remove(svc);
> +    }
>  }
>  
>  static void
> @@ -1163,6 +1188,7 @@ csched_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
>  
>      /* Put the VCPU on the runq and tickle CPUs */
>      __runq_insert(svc);
> +    inc_nr_runnable(vc->processor);
>      __runq_tickle(svc);
>  }
>  
> @@ -1664,8 +1690,10 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>              SCHED_VCPU_STAT_CRANK(speer, migrate_q);
>              SCHED_STAT_CRANK(migrate_queued);
>              WARN_ON(vc->is_urgent);
> +            dec_nr_runnable(peer_cpu);
>              __runq_remove(speer);
>              vc->processor = cpu;
> +            inc_nr_runnable(cpu);
>              return speer;
>          }
>      }
> @@ -1721,7 +1749,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
>          peer_node = node;
>          do
>          {
> -            /* Find out what the !idle are in this node */
> +            /* Select the pCPUs in this node that have work we can steal. */
>              cpumask_andnot(&workers, online, prv->idlers);
>              cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
>              __cpumask_clear_cpu(cpu, &workers);
> @@ -1731,6 +1759,40 @@ csched_load_balance(struct csched_private *prv, int cpu,
>                  goto next_node;
>              do
>              {
> +                spinlock_t *lock;
> +
> +                /*
> +                 * If there is only one runnable vCPU on peer_cpu, it means
> +                 * there's no one to be stolen in its runqueue, so skip it.
> +                 *
> +                 * Checking this without holding the lock is racy... But that's
> +                 * the whole point of this optimization!
> +                 *
> +                 * In more details:
> +                 * - if we race with dec_nr_runnable(), we may try to take the
> +                 *   lock and call csched_runq_steal() for no reason. This is
> +                 *   not a functional issue, and should be infrequent enough.
> +                 *   And we can avoid that by re-checking nr_runnable after
> +                 *   having grabbed the lock, if we want;
> +                 * - if we race with inc_nr_runnable(), we skip a pCPU that may
> +                 *   have runnable vCPUs in its runqueue, but that's not a
> +                 *   problem because:
> +                 *   + if racing with csched_vcpu_insert() or csched_vcpu_wake(),
> +                 *     __runq_tickle() will be called afterwords, so the vCPU
> +                 *     won't get stuck in the runqueue for too long;
> +                 *   + if racing with csched_runq_steal(), it may be that a
> +                 *     vCPU that we could have picked up, stays in a runqueue
> +                 *     until someone else tries to steal it again. But this is
> +                 *     no worse than what can happen already (without this
> +                 *     optimization), it the pCPU would schedule right after we
> +                 *     have taken the lock, and hence block on it.
> +                 */
> +                if ( CSCHED_PCPU(peer_cpu)->nr_runnable <= 1 )
> +                {
> +                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skipp'n */ 0);
> +                    goto next_cpu;
> +                }
> +
>                  /*
>                   * Get ahold of the scheduler lock for this peer CPU.
>                   *
> @@ -1738,14 +1800,13 @@ csched_load_balance(struct csched_private *prv, int cpu,
>                   * could cause a deadlock if the peer CPU is also load
>                   * balancing and trying to lock this CPU.
>                   */
> -                spinlock_t *lock = pcpu_schedule_trylock(peer_cpu);
> +                lock = pcpu_schedule_trylock(peer_cpu);
>                  SCHED_STAT_CRANK(steal_trylock);
>                  if ( !lock )
>                  {
>                      SCHED_STAT_CRANK(steal_trylock_failed);
>                      TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skipp'n */ 0);
> -                    peer_cpu = cpumask_cycle(peer_cpu, &workers);
> -                    continue;
> +                    goto next_cpu;

Wait -- does this mean that before this patch, this effectively
busy-waited until peer_cpu was actually free (since peer_cpu was never
incremented)?

I like the idea in general, but I'm not a fan of the current way of
doing the accounting -- it seems like too many special cases.  Let me
have a think about this and come back to it.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/7] xen: credit1: increase efficiency and scalability of load balancing.
  2017-04-07 14:38   ` George Dunlap
@ 2017-04-07 14:49     ` Dario Faggioli
  2017-04-07 15:17       ` George Dunlap
  0 siblings, 1 reply; 24+ messages in thread
From: Dario Faggioli @ 2017-04-07 14:49 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: George Dunlap, Andrew Cooper, Anshul Makkar


[-- Attachment #1.1: Type: text/plain, Size: 2313 bytes --]

On Fri, 2017-04-07 at 15:38 +0100, George Dunlap wrote:
> On 06/04/17 09:16, Dario Faggioli wrote:
> > --- a/xen/common/sched_credit.c
> > +++ b/xen/common/sched_credit.c
> > @@ -1738,14 +1800,13 @@ csched_load_balance(struct csched_private
> > *prv, int cpu,
> >                   * could cause a deadlock if the peer CPU is also
> > load
> >                   * balancing and trying to lock this CPU.
> >                   */
> > -                spinlock_t *lock =
> > pcpu_schedule_trylock(peer_cpu);
> > +                lock = pcpu_schedule_trylock(peer_cpu);
> >                  SCHED_STAT_CRANK(steal_trylock);
> >                  if ( !lock )
> >                  {
> >                      SCHED_STAT_CRANK(steal_trylock_failed);
> >                      TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /*
> > skipp'n */ 0);
> > -                    peer_cpu = cpumask_cycle(peer_cpu, &workers);
> > -                    continue;
> > +                    goto next_cpu;
> 
> Wait -- does this mean that before this patch, this effectively
> busy-waited until peer_cpu was actually free (since peer_cpu was
> never
> incremented)?
> 
Err, what do you mean? peer_cpu is updated here, with the result of
cpumask_cycle(), run on workers, starting from the the current value of
peer_cpu itself.

So, no, we don't busy wait try-locking on the same pcpu, we try all of
them, one after the other (in a node-wise fashion, thanks to the outer
loop).

If this is what you were asking....

> I like the idea in general, but I'm not a fan of the current way of
> doing the accounting -- it seems like too many special cases.  Let me
> have a think about this and come back to it.
> 
Ok. I'm not sure I see what you mean with 'accounting' in this context,
but, yeah, go ahead and let me know. :-)

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/7] xen: credit1: increase efficiency and scalability of load balancing.
  2017-04-07 14:49     ` Dario Faggioli
@ 2017-04-07 15:17       ` George Dunlap
  2017-04-07 15:26         ` Dario Faggioli
  0 siblings, 1 reply; 24+ messages in thread
From: George Dunlap @ 2017-04-07 15:17 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Andrew Cooper, Anshul Makkar

On 07/04/17 15:49, Dario Faggioli wrote:
> On Fri, 2017-04-07 at 15:38 +0100, George Dunlap wrote:
>> On 06/04/17 09:16, Dario Faggioli wrote:
>>> --- a/xen/common/sched_credit.c
>>> +++ b/xen/common/sched_credit.c
>>> @@ -1738,14 +1800,13 @@ csched_load_balance(struct csched_private
>>> *prv, int cpu,
>>>                   * could cause a deadlock if the peer CPU is also
>>> load
>>>                   * balancing and trying to lock this CPU.
>>>                   */
>>> -                spinlock_t *lock =
>>> pcpu_schedule_trylock(peer_cpu);
>>> +                lock = pcpu_schedule_trylock(peer_cpu);
>>>                  SCHED_STAT_CRANK(steal_trylock);
>>>                  if ( !lock )
>>>                  {
>>>                      SCHED_STAT_CRANK(steal_trylock_failed);
>>>                      TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /*
>>> skipp'n */ 0);
>>> -                    peer_cpu = cpumask_cycle(peer_cpu, &workers);
>>> -                    continue;
>>> +                    goto next_cpu;
>>
>> Wait -- does this mean that before this patch, this effectively
>> busy-waited until peer_cpu was actually free (since peer_cpu was
>> never
>> incremented)?
>>
> Err, what do you mean? peer_cpu is updated here, with the result of
> cpumask_cycle(), run on workers, starting from the the current value of
> peer_cpu itself.
> 
> So, no, we don't busy wait try-locking on the same pcpu, we try all of
> them, one after the other (in a node-wise fashion, thanks to the outer
> loop).

Oh, right -- sorry, missed that. :-)

>> I like the idea in general, but I'm not a fan of the current way of
>> doing the accounting -- it seems like too many special cases.  Let me
>> have a think about this and come back to it.
>>
> Ok. I'm not sure I see what you mean with 'accounting' in this context,
> but, yeah, go ahead and let me know. :-)

Well I don't like having the increase_* and decrease_* all over the
place, *almost* corresponding with __runq_insert() and __runq_remove(),
but not quite.  It seems like trying to sort out most of the refcounting
inside fo those two functions (perhaps with runq_insert() which did
reference counting, and __runq_insert() that didn't, or something  like
that) would be a better approach.

If you're going to re-send the series, maybe it would be best if you
sent this patch last, so we can commit the rest?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/7] xen: credit1: increase efficiency and scalability of load balancing.
  2017-04-07 15:17       ` George Dunlap
@ 2017-04-07 15:26         ` Dario Faggioli
  2017-04-07 15:49           ` Dario Faggioli
  0 siblings, 1 reply; 24+ messages in thread
From: Dario Faggioli @ 2017-04-07 15:26 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: George Dunlap, Andrew Cooper, Anshul Makkar


[-- Attachment #1.1: Type: text/plain, Size: 1500 bytes --]

On Fri, 2017-04-07 at 16:17 +0100, George Dunlap wrote:
> On 07/04/17 15:49, Dario Faggioli wrote:
> > Ok. I'm not sure I see what you mean with 'accounting' in this
> > context,
> > but, yeah, go ahead and let me know. :-)
> 
> Well I don't like having the increase_* and decrease_* all over the
> place, *almost* corresponding with __runq_insert() and
> __runq_remove(),
> but not quite.  
>
Ah, I see what you mean now.

Personally, I don't dislike it, but most important, I'm not sure I can
do much better. :-/

> It seems like trying to sort out most of the refcounting
> inside fo those two functions (perhaps with runq_insert() which did
> reference counting, and __runq_insert() that didn't, or
> something  like
> that) would be a better approach.
>
Right. I've tried already, but without success, and I had to stop an
resort to what's in this patch.

As I said, I am ok with this approach, so I just went for it. I can try
to think harder at whether it is really possible to do something like
you suggest above... lemme try.

> If you're going to re-send the series, maybe it would be best if you
> sent this patch last, so we can commit the rest?
> 
Yep, will do.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/7] xen: credit1: increase efficiency and scalability of load balancing.
  2017-04-07 15:26         ` Dario Faggioli
@ 2017-04-07 15:49           ` Dario Faggioli
  2017-04-07 16:25             ` Dario Faggioli
  0 siblings, 1 reply; 24+ messages in thread
From: Dario Faggioli @ 2017-04-07 15:49 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: George Dunlap, Andrew Cooper, Anshul Makkar


[-- Attachment #1.1: Type: text/plain, Size: 9505 bytes --]

On Fri, 2017-04-07 at 17:26 +0200, Dario Faggioli wrote:
> On Fri, 2017-04-07 at 16:17 +0100, George Dunlap wrote:
> > It seems like trying to sort out most of the refcounting
> > inside fo those two functions (perhaps with runq_insert() which did
> > reference counting, and __runq_insert() that didn't, or
> > something  like
> > that) would be a better approach.
> > 
> Right. I've tried already, but without success, and I had to stop an
> resort to what's in this patch.
> 
> As I said, I am ok with this approach, so I just went for it. I can
> try
> to think harder at whether it is really possible to do something like
> you suggest above... lemme try.
> 
So, how about something like the below?

For almost all cases, all it's done inside runqueue remove and insert
helpers.

There still are two special cases, though, one where I must explicitly
call inc_nr_runnable(), the other where I need dec_nr_runnable().

I've added comments trying to explain why that is the case.

Let me know.

Thanks and Regards,
Dario
---
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 59b87f7..a0ad167 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -172,6 +172,7 @@ struct csched_pcpu {
     struct timer ticker;
     unsigned int tick;
     unsigned int idle_bias;
+    unsigned int nr_runnable;
 };
 
 /*
@@ -262,9 +263,26 @@ static inline bool_t is_runq_idle(unsigned int cpu)
 }
 
 static inline void
+inc_nr_runnable(unsigned int cpu)
+{
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
+    CSCHED_PCPU(cpu)->nr_runnable++;
+
+}
+
+static inline void
+dec_nr_runnable(unsigned int cpu)
+{
+    ASSERT(spin_is_locked(per_cpu(schedule_data, cpu).schedule_lock));
+    CSCHED_PCPU(cpu)->nr_runnable--;
+    ASSERT(CSCHED_PCPU(cpu)->nr_runnable >= 0);
+}
+
+static inline void
 __runq_insert(struct csched_vcpu *svc)
 {
-    const struct list_head * const runq = RUNQ(svc->vcpu->processor);
+    unsigned int cpu = svc->vcpu->processor;
+    const struct list_head * const runq = RUNQ(cpu);
     struct list_head *iter;
 
     BUG_ON( __vcpu_on_runq(svc) );
@@ -292,12 +310,25 @@ __runq_insert(struct csched_vcpu *svc)
 }
 
 static inline void
+runq_insert(struct csched_vcpu *svc)
+{
+    __runq_insert(svc);
+    inc_nr_runnable(svc->vcpu->processor);
+}
+
+static inline void
 __runq_remove(struct csched_vcpu *svc)
 {
     BUG_ON( !__vcpu_on_runq(svc) );
     list_del_init(&svc->runq_elem);
 }
 
+static inline void
+runq_remove(struct csched_vcpu *svc)
+{
+    dec_nr_runnable(svc->vcpu->processor);
+    __runq_remove(svc);
+}
 
 #define for_each_csched_balance_step(step) \
     for ( (step) = 0; (step) <= CSCHED_BALANCE_HARD_AFFINITY; (step)++ )
@@ -601,6 +632,7 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
     /* Start off idling... */
     BUG_ON(!is_idle_vcpu(curr_on_cpu(cpu)));
     cpumask_set_cpu(cpu, prv->idlers);
+    spc->nr_runnable = 0;
 }
 
 static void
@@ -1052,7 +1084,7 @@ csched_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
     lock = vcpu_schedule_lock_irq(vc);
 
     if ( !__vcpu_on_runq(svc) && vcpu_runnable(vc) && !vc->is_running )
-        __runq_insert(svc);
+        runq_insert(svc);
 
     vcpu_schedule_unlock_irq(lock, vc);
 
@@ -1117,7 +1149,7 @@ csched_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
         cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
     }
     else if ( __vcpu_on_runq(svc) )
-        __runq_remove(svc);
+        runq_remove(svc);
 }
 
 static void
@@ -1177,7 +1209,7 @@ csched_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
     }
 
     /* Put the VCPU on the runq and tickle CPUs */
-    __runq_insert(svc);
+    runq_insert(svc);
     __runq_tickle(svc);
 }
 
@@ -1679,8 +1711,14 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
             SCHED_VCPU_STAT_CRANK(speer, migrate_q);
             SCHED_STAT_CRANK(migrate_queued);
             WARN_ON(vc->is_urgent);
-            __runq_remove(speer);
+            runq_remove(speer);
             vc->processor = cpu;
+            /*
+             * speer will start executing directly on cpu, without having to
+             * go through runq_insert(). So we must update the runnable count
+             * for cpu here.
+             */
+            inc_nr_runnable(cpu);
             return speer;
         }
     }
@@ -1736,7 +1774,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
         peer_node = node;
         do
         {
-            /* Find out what the !idle are in this node */
+            /* Select the pCPUs in this node that have work we can steal. */
             cpumask_andnot(&workers, online, prv->idlers);
             cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
             __cpumask_clear_cpu(cpu, &workers);
@@ -1746,6 +1784,40 @@ csched_load_balance(struct csched_private *prv, int cpu,
                 goto next_node;
             do
             {
+                spinlock_t *lock;
+
+                /*
+                 * If there is only one runnable vCPU on peer_cpu, it means
+                 * there's no one to be stolen in its runqueue, so skip it.
+                 *
+                 * Checking this without holding the lock is racy... But that's
+                 * the whole point of this optimization!
+                 *
+                 * In more details:
+                 * - if we race with dec_nr_runnable(), we may try to take the
+                 *   lock and call csched_runq_steal() for no reason. This is
+                 *   not a functional issue, and should be infrequent enough.
+                 *   And we can avoid that by re-checking nr_runnable after
+                 *   having grabbed the lock, if we want;
+                 * - if we race with inc_nr_runnable(), we skip a pCPU that may
+                 *   have runnable vCPUs in its runqueue, but that's not a
+                 *   problem because:
+                 *   + if racing with csched_vcpu_insert() or csched_vcpu_wake(),
+                 *     __runq_tickle() will be called afterwords, so the vCPU
+                 *     won't get stuck in the runqueue for too long;
+                 *   + if racing with csched_runq_steal(), it may be that a
+                 *     vCPU that we could have picked up, stays in a runqueue
+                 *     until someone else tries to steal it again. But this is
+                 *     no worse than what can happen already (without this
+                 *     optimization), it the pCPU would schedule right after we
+                 *     have taken the lock, and hence block on it.
+                 */
+                if ( CSCHED_PCPU(peer_cpu)->nr_runnable <= 1 )
+                {
+                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skipp'n */ 0);
+                    goto next_cpu;
+                }
+
                 /*
                  * Get ahold of the scheduler lock for this peer CPU.
                  *
@@ -1753,14 +1825,13 @@ csched_load_balance(struct csched_private *prv, int cpu,
                  * could cause a deadlock if the peer CPU is also load
                  * balancing and trying to lock this CPU.
                  */
-                spinlock_t *lock = pcpu_schedule_trylock(peer_cpu);
+                lock = pcpu_schedule_trylock(peer_cpu);
                 SCHED_STAT_CRANK(steal_trylock);
                 if ( !lock )
                 {
                     SCHED_STAT_CRANK(steal_trylock_failed);
                     TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* skip */ 0);
-                    peer_cpu = cpumask_cycle(peer_cpu, &workers);
-                    continue;
+                    goto next_cpu;
                 }
 
                 TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /* checked */ 1);
@@ -1777,6 +1848,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
                     return speer;
                 }
 
+ next_cpu:
                 peer_cpu = cpumask_cycle(peer_cpu, &workers);
 
             } while( peer_cpu != cpumask_first(&workers) );
@@ -1907,7 +1979,11 @@ csched_schedule(
     if ( vcpu_runnable(current) )
         __runq_insert(scurr);
     else
+    {
         BUG_ON( is_idle_vcpu(current) || list_empty(runq) );
+        /* Current has blocked. Update the runnable counter for this cpu. */
+        dec_nr_runnable(cpu);
+    }
 
     snext = __runq_elem(runq->next);
     ret.migrated = 0;
@@ -2024,7 +2100,8 @@ csched_dump_pcpu(const struct scheduler *ops, int cpu)
     runq = &spc->runq;
 
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_sibling_mask, cpu));
-    printk("CPU[%02d] sort=%d, sibling=%s, ", cpu, spc->runq_sort_last, cpustr);
+    printk("CPU[%02d] nr_run=%d, sort=%d, sibling=%s, ",
+           cpu, spc->nr_runnable, spc->runq_sort_last, cpustr);
     cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_core_mask, cpu));
     printk("core=%s\n", cpustr);
 
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/7] xen: credit1: increase efficiency and scalability of load balancing.
  2017-04-07 15:49           ` Dario Faggioli
@ 2017-04-07 16:25             ` Dario Faggioli
  0 siblings, 0 replies; 24+ messages in thread
From: Dario Faggioli @ 2017-04-07 16:25 UTC (permalink / raw)
  To: George Dunlap, xen-devel; +Cc: George Dunlap, Andrew Cooper, Anshul Makkar


[-- Attachment #1.1: Type: text/plain, Size: 11986 bytes --]

On Fri, 2017-04-07 at 17:49 +0200, Dario Faggioli wrote:
> On Fri, 2017-04-07 at 17:26 +0200, Dario Faggioli wrote:
> > On Fri, 2017-04-07 at 16:17 +0100, George Dunlap wrote:
> > > It seems like trying to sort out most of the refcounting
> > > inside fo those two functions (perhaps with runq_insert() which
> > > did
> > > reference counting, and __runq_insert() that didn't, or
> > > something  like
> > > that) would be a better approach.
> > > 
> > 
> > Right. I've tried already, but without success, and I had to stop
> > an
> > resort to what's in this patch.
> > 
> > As I said, I am ok with this approach, so I just went for it. I can
> > try
> > to think harder at whether it is really possible to do something
> > like
> > you suggest above... lemme try.
> > 
> 
> So, how about something like the below?
> 
And, given the tight timing, I'm sending out v3, with the patch
proposed here inside it.

As said on IRC, about reordering the series, putting "treat pCPUs more
evenly during balancing" before "increase efficiency and scalability of
load balancing" does not make much sense, and is not worth the effort.

I think the modifications I've done go in the direction you were
pointing at... if you like the result, then jolly good. If not, and we
don't find an alternative in time, I'm fine leaving out both the last
two patches of the series (I mean, the last two patches in v3 that I'm
about to send out).

Thanks and Regards,
Dario
> ---
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index 59b87f7..a0ad167 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -172,6 +172,7 @@ struct csched_pcpu {
>      struct timer ticker;
>      unsigned int tick;
>      unsigned int idle_bias;
> +    unsigned int nr_runnable;
>  };
>  
>  /*
> @@ -262,9 +263,26 @@ static inline bool_t is_runq_idle(unsigned int
> cpu)
>  }
>  
>  static inline void
> +inc_nr_runnable(unsigned int cpu)
> +{
> +    ASSERT(spin_is_locked(per_cpu(schedule_data,
> cpu).schedule_lock));
> +    CSCHED_PCPU(cpu)->nr_runnable++;
> +
> +}
> +
> +static inline void
> +dec_nr_runnable(unsigned int cpu)
> +{
> +    ASSERT(spin_is_locked(per_cpu(schedule_data,
> cpu).schedule_lock));
> +    CSCHED_PCPU(cpu)->nr_runnable--;
> +    ASSERT(CSCHED_PCPU(cpu)->nr_runnable >= 0);
> +}
> +
> +static inline void
>  __runq_insert(struct csched_vcpu *svc)
>  {
> -    const struct list_head * const runq = RUNQ(svc->vcpu-
> >processor);
> +    unsigned int cpu = svc->vcpu->processor;
> +    const struct list_head * const runq = RUNQ(cpu);
>      struct list_head *iter;
>  
>      BUG_ON( __vcpu_on_runq(svc) );
> @@ -292,12 +310,25 @@ __runq_insert(struct csched_vcpu *svc)
>  }
>  
>  static inline void
> +runq_insert(struct csched_vcpu *svc)
> +{
> +    __runq_insert(svc);
> +    inc_nr_runnable(svc->vcpu->processor);
> +}
> +
> +static inline void
>  __runq_remove(struct csched_vcpu *svc)
>  {
>      BUG_ON( !__vcpu_on_runq(svc) );
>      list_del_init(&svc->runq_elem);
>  }
>  
> +static inline void
> +runq_remove(struct csched_vcpu *svc)
> +{
> +    dec_nr_runnable(svc->vcpu->processor);
> +    __runq_remove(svc);
> +}
>  
>  #define for_each_csched_balance_step(step) \
>      for ( (step) = 0; (step) <= CSCHED_BALANCE_HARD_AFFINITY;
> (step)++ )
> @@ -601,6 +632,7 @@ init_pdata(struct csched_private *prv, struct
> csched_pcpu *spc, int cpu)
>      /* Start off idling... */
>      BUG_ON(!is_idle_vcpu(curr_on_cpu(cpu)));
>      cpumask_set_cpu(cpu, prv->idlers);
> +    spc->nr_runnable = 0;
>  }
>  
>  static void
> @@ -1052,7 +1084,7 @@ csched_vcpu_insert(const struct scheduler *ops,
> struct vcpu *vc)
>      lock = vcpu_schedule_lock_irq(vc);
>  
>      if ( !__vcpu_on_runq(svc) && vcpu_runnable(vc) && !vc-
> >is_running )
> -        __runq_insert(svc);
> +        runq_insert(svc);
>  
>      vcpu_schedule_unlock_irq(lock, vc);
>  
> @@ -1117,7 +1149,7 @@ csched_vcpu_sleep(const struct scheduler *ops,
> struct vcpu *vc)
>          cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
>      }
>      else if ( __vcpu_on_runq(svc) )
> -        __runq_remove(svc);
> +        runq_remove(svc);
>  }
>  
>  static void
> @@ -1177,7 +1209,7 @@ csched_vcpu_wake(const struct scheduler *ops,
> struct vcpu *vc)
>      }
>  
>      /* Put the VCPU on the runq and tickle CPUs */
> -    __runq_insert(svc);
> +    runq_insert(svc);
>      __runq_tickle(svc);
>  }
>  
> @@ -1679,8 +1711,14 @@ csched_runq_steal(int peer_cpu, int cpu, int
> pri, int balance_step)
>              SCHED_VCPU_STAT_CRANK(speer, migrate_q);
>              SCHED_STAT_CRANK(migrate_queued);
>              WARN_ON(vc->is_urgent);
> -            __runq_remove(speer);
> +            runq_remove(speer);
>              vc->processor = cpu;
> +            /*
> +             * speer will start executing directly on cpu, without
> having to
> +             * go through runq_insert(). So we must update the
> runnable count
> +             * for cpu here.
> +             */
> +            inc_nr_runnable(cpu);
>              return speer;
>          }
>      }
> @@ -1736,7 +1774,7 @@ csched_load_balance(struct csched_private *prv,
> int cpu,
>          peer_node = node;
>          do
>          {
> -            /* Find out what the !idle are in this node */
> +            /* Select the pCPUs in this node that have work we can
> steal. */
>              cpumask_andnot(&workers, online, prv->idlers);
>              cpumask_and(&workers, &workers,
> &node_to_cpumask(peer_node));
>              __cpumask_clear_cpu(cpu, &workers);
> @@ -1746,6 +1784,40 @@ csched_load_balance(struct csched_private
> *prv, int cpu,
>                  goto next_node;
>              do
>              {
> +                spinlock_t *lock;
> +
> +                /*
> +                 * If there is only one runnable vCPU on peer_cpu,
> it means
> +                 * there's no one to be stolen in its runqueue, so
> skip it.
> +                 *
> +                 * Checking this without holding the lock is racy...
> But that's
> +                 * the whole point of this optimization!
> +                 *
> +                 * In more details:
> +                 * - if we race with dec_nr_runnable(), we may try
> to take the
> +                 *   lock and call csched_runq_steal() for no
> reason. This is
> +                 *   not a functional issue, and should be
> infrequent enough.
> +                 *   And we can avoid that by re-checking
> nr_runnable after
> +                 *   having grabbed the lock, if we want;
> +                 * - if we race with inc_nr_runnable(), we skip a
> pCPU that may
> +                 *   have runnable vCPUs in its runqueue, but that's
> not a
> +                 *   problem because:
> +                 *   + if racing with csched_vcpu_insert() or
> csched_vcpu_wake(),
> +                 *     __runq_tickle() will be called afterwords, so
> the vCPU
> +                 *     won't get stuck in the runqueue for too long;
> +                 *   + if racing with csched_runq_steal(), it may be
> that a
> +                 *     vCPU that we could have picked up, stays in a
> runqueue
> +                 *     until someone else tries to steal it again.
> But this is
> +                 *     no worse than what can happen already
> (without this
> +                 *     optimization), it the pCPU would schedule
> right after we
> +                 *     have taken the lock, and hence block on it.
> +                 */
> +                if ( CSCHED_PCPU(peer_cpu)->nr_runnable <= 1 )
> +                {
> +                    TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /*
> skipp'n */ 0);
> +                    goto next_cpu;
> +                }
> +
>                  /*
>                   * Get ahold of the scheduler lock for this peer
> CPU.
>                   *
> @@ -1753,14 +1825,13 @@ csched_load_balance(struct csched_private
> *prv, int cpu,
>                   * could cause a deadlock if the peer CPU is also
> load
>                   * balancing and trying to lock this CPU.
>                   */
> -                spinlock_t *lock = pcpu_schedule_trylock(peer_cpu);
> +                lock = pcpu_schedule_trylock(peer_cpu);
>                  SCHED_STAT_CRANK(steal_trylock);
>                  if ( !lock )
>                  {
>                      SCHED_STAT_CRANK(steal_trylock_failed);
>                      TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /*
> skip */ 0);
> -                    peer_cpu = cpumask_cycle(peer_cpu, &workers);
> -                    continue;
> +                    goto next_cpu;
>                  }
>  
>                  TRACE_2D(TRC_CSCHED_STEAL_CHECK, peer_cpu, /*
> checked */ 1);
> @@ -1777,6 +1848,7 @@ csched_load_balance(struct csched_private *prv,
> int cpu,
>                      return speer;
>                  }
>  
> + next_cpu:
>                  peer_cpu = cpumask_cycle(peer_cpu, &workers);
>  
>              } while( peer_cpu != cpumask_first(&workers) );
> @@ -1907,7 +1979,11 @@ csched_schedule(
>      if ( vcpu_runnable(current) )
>          __runq_insert(scurr);
>      else
> +    {
>          BUG_ON( is_idle_vcpu(current) || list_empty(runq) );
> +        /* Current has blocked. Update the runnable counter for this
> cpu. */
> +        dec_nr_runnable(cpu);
> +    }
>  
>      snext = __runq_elem(runq->next);
>      ret.migrated = 0;
> @@ -2024,7 +2100,8 @@ csched_dump_pcpu(const struct scheduler *ops,
> int cpu)
>      runq = &spc->runq;
>  
>      cpumask_scnprintf(cpustr, sizeof(cpustr),
> per_cpu(cpu_sibling_mask, cpu));
> -    printk("CPU[%02d] sort=%d, sibling=%s, ", cpu, spc-
> >runq_sort_last, cpustr);
> +    printk("CPU[%02d] nr_run=%d, sort=%d, sibling=%s, ",
> +           cpu, spc->nr_runnable, spc->runq_sort_last, cpustr);
>      cpumask_scnprintf(cpustr, sizeof(cpustr), per_cpu(cpu_core_mask,
> cpu));
>      printk("core=%s\n", cpustr);
>  
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 6/7] xen: credit1: treat pCPUs more evenly during balancing.
  2017-04-06  8:16 [PATCH v2 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
                   ` (4 preceding siblings ...)
  2017-04-06  8:16 ` [PATCH v2 5/7] xen: credit1: increase efficiency and scalability of load balancing Dario Faggioli
@ 2017-04-06  8:16 ` Dario Faggioli
  2017-04-07 14:44   ` George Dunlap
  2017-04-06  8:17 ` [PATCH v2 7/7] xen: credit2: avoid cpumask_any() in pick_cpu() Dario Faggioli
  6 siblings, 1 reply; 24+ messages in thread
From: Dario Faggioli @ 2017-04-06  8:16 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Anshul Makkar

Right now, we use cpumask_first() for going through
the bus pCPUs in csched_load_balance(). This means
not all pCPUs have equal chances of seeing their
pending work stolen. It also means there is more
runqueue lock pressure on lower ID pCPUs.

To avoid all this, let's record and remember, for
each NUMA node, from what pCPU we have stolen for
last, and start from that the following time.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
 xen/common/sched_credit.c |   37 +++++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 09e3192..c37a65c 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -229,6 +229,7 @@ struct csched_private {
     uint32_t credit;
     int credit_balance;
     uint32_t runq_sort;
+    uint32_t *balance_bias;
     unsigned ratelimit_us;
     /* Period of master and tick in milliseconds */
     unsigned tslice_ms, tick_period_us, ticks_per_tslice;
@@ -547,6 +548,7 @@ csched_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
 {
     struct csched_private *prv = CSCHED_PRIV(ops);
     struct csched_pcpu *spc = pcpu;
+    unsigned int node = cpu_to_node(cpu);
     unsigned long flags;
 
     /*
@@ -570,6 +572,12 @@ csched_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
         prv->master = cpumask_first(prv->cpus);
         migrate_timer(&prv->master_ticker, prv->master);
     }
+    if ( prv->balance_bias[node] == cpu )
+    {
+        cpumask_and(cpumask_scratch, prv->cpus, &node_to_cpumask(node));
+        if ( !cpumask_empty(cpumask_scratch) )
+            prv->balance_bias[node] =  cpumask_first(cpumask_scratch);
+    }
     kill_timer(&spc->ticker);
     if ( prv->ncpus == 0 )
         kill_timer(&prv->master_ticker);
@@ -609,6 +617,10 @@ init_pdata(struct csched_private *prv, struct csched_pcpu *spc, int cpu)
                   NOW() + MILLISECS(prv->tslice_ms));
     }
 
+    cpumask_and(cpumask_scratch, prv->cpus, &node_to_cpumask(cpu_to_node(cpu)));
+    if ( cpumask_weight(cpumask_scratch) == 1 )
+        prv->balance_bias[cpu_to_node(cpu)] = cpu;
+
     init_timer(&spc->ticker, csched_tick, (void *)(unsigned long)cpu, cpu);
     set_timer(&spc->ticker, NOW() + MICROSECS(prv->tick_period_us) );
 
@@ -1710,7 +1722,7 @@ csched_load_balance(struct csched_private *prv, int cpu,
     struct csched_vcpu *speer;
     cpumask_t workers;
     cpumask_t *online;
-    int peer_cpu, peer_node, bstep;
+    int peer_cpu, first_cpu, peer_node, bstep;
     int node = cpu_to_node(cpu);
 
     BUG_ON( cpu != snext->vcpu->processor );
@@ -1754,9 +1766,10 @@ csched_load_balance(struct csched_private *prv, int cpu,
             cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
             __cpumask_clear_cpu(cpu, &workers);
 
-            peer_cpu = cpumask_first(&workers);
-            if ( peer_cpu >= nr_cpu_ids )
+            first_cpu = cpumask_cycle(prv->balance_bias[peer_node], &workers);
+            if ( first_cpu >= nr_cpu_ids )
                 goto next_node;
+            peer_cpu = first_cpu;
             do
             {
                 spinlock_t *lock;
@@ -1820,13 +1833,19 @@ csched_load_balance(struct csched_private *prv, int cpu,
                 if ( speer != NULL )
                 {
                     *stolen = 1;
+                    /*
+                     * Next time we'll look for work to steal on this node, we
+                     * will start from the next pCPU, with respect to this one,
+                     * so we don't risk stealing always from the same ones.
+                     */
+                    prv->balance_bias[peer_node] = peer_cpu;
                     return speer;
                 }
 
  next_cpu:
                 peer_cpu = cpumask_cycle(peer_cpu, &workers);
 
-            } while( peer_cpu != cpumask_first(&workers) );
+            } while( peer_cpu != first_cpu );
 
  next_node:
             peer_node = cycle_node(peer_node, node_online_map);
@@ -2178,10 +2197,19 @@ csched_init(struct scheduler *ops)
     prv = xzalloc(struct csched_private);
     if ( prv == NULL )
         return -ENOMEM;
+
+    prv->balance_bias = xzalloc_array(uint32_t, MAX_NUMNODES);
+    if ( prv->balance_bias == NULL )
+    {
+        xfree(prv);
+        return -ENOMEM;
+    }
+
     if ( !zalloc_cpumask_var(&prv->cpus) ||
          !zalloc_cpumask_var(&prv->idlers) )
     {
         free_cpumask_var(prv->cpus);
+        xfree(prv->balance_bias);
         xfree(prv);
         return -ENOMEM;
     }
@@ -2227,6 +2255,7 @@ csched_deinit(struct scheduler *ops)
         ops->sched_data = NULL;
         free_cpumask_var(prv->cpus);
         free_cpumask_var(prv->idlers);
+        xfree(prv->balance_bias);
         xfree(prv);
     }
 }


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/7] xen: credit1: treat pCPUs more evenly during balancing.
  2017-04-06  8:16 ` [PATCH v2 6/7] xen: credit1: treat pCPUs more evenly during balancing Dario Faggioli
@ 2017-04-07 14:44   ` George Dunlap
  0 siblings, 0 replies; 24+ messages in thread
From: George Dunlap @ 2017-04-07 14:44 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Anshul Makkar

On 06/04/17 09:16, Dario Faggioli wrote:
> Right now, we use cpumask_first() for going through
> the bus pCPUs in csched_load_balance(). This means
> not all pCPUs have equal chances of seeing their
> pending work stolen. It also means there is more
> runqueue lock pressure on lower ID pCPUs.
> 
> To avoid all this, let's record and remember, for
> each NUMA node, from what pCPU we have stolen for
> last, and start from that the following time.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Acked-by: George Dunlap <george.dunlap@citrix.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 7/7] xen: credit2: avoid cpumask_any() in pick_cpu().
  2017-04-06  8:16 [PATCH v2 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
                   ` (5 preceding siblings ...)
  2017-04-06  8:16 ` [PATCH v2 6/7] xen: credit1: treat pCPUs more evenly during balancing Dario Faggioli
@ 2017-04-06  8:17 ` Dario Faggioli
  2017-04-07 14:48   ` George Dunlap
  6 siblings, 1 reply; 24+ messages in thread
From: Dario Faggioli @ 2017-04-06  8:17 UTC (permalink / raw)
  To: xen-devel; +Cc: Anshul Makkar, George Dunlap

cpumask_any() is costly (because of the randomization).
And since it does not really matter which exact CPU is
selected within a runqueue, as that will be overridden
shortly after, in runq_tickle(), spending too much time
and achieving true randomization is pretty pointless.

As the picked CPU, however, would be used as an hint,
within runq_tickle(), don't give up on it entirely,
and let's make sure we don't always return the same
CPU, or favour lower or higher ID CPUs.

To achieve that, let's record and remember, for each
runqueue, what CPU we picked for last, and start from
that the following time.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Anshul Makkar <anshul.makkar@citrix.com>
---
 xen/common/sched_credit2.c |   22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index bb1c657..a76bedb 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -363,6 +363,7 @@ struct csched2_runqueue_data {
     struct list_head runq; /* Ordered list of runnable vms */
     struct list_head svc;  /* List of all vcpus assigned to this runqueue */
     unsigned int max_weight;
+    unsigned int pick_bias;/* Last CPU we picked. Start from it next time */
 
     cpumask_t idle,        /* Currently idle pcpus */
         smt_idle,          /* Fully idle-and-untickled cores (see below) */
@@ -1679,7 +1680,9 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
         {
             cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
                         &svc->migrate_rqd->active);
-            new_cpu = cpumask_any(cpumask_scratch_cpu(cpu));
+            new_cpu = cpumask_cycle(svc->migrate_rqd->pick_bias,
+                                    cpumask_scratch_cpu(cpu));
+            svc->migrate_rqd->pick_bias = new_cpu;
             goto out_up;
         }
         /* Fall-through to normal cpu pick */
@@ -1737,7 +1740,9 @@ csched2_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
 
     cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
                 &prv->rqd[min_rqi].active);
-    new_cpu = cpumask_any(cpumask_scratch_cpu(cpu));
+    new_cpu = cpumask_cycle(prv->rqd[min_rqi].pick_bias,
+                            cpumask_scratch_cpu(cpu));
+    prv->rqd[min_rqi].pick_bias = new_cpu;
     BUG_ON(new_cpu >= nr_cpu_ids);
 
  out_up:
@@ -1854,7 +1859,9 @@ static void migrate(const struct scheduler *ops,
                     cpupool_domain_cpumask(svc->vcpu->domain));
         cpumask_and(cpumask_scratch_cpu(cpu), cpumask_scratch_cpu(cpu),
                     &trqd->active);
-        svc->vcpu->processor = cpumask_any(cpumask_scratch_cpu(cpu));
+        svc->vcpu->processor = cpumask_cycle(trqd->pick_bias,
+                                             cpumask_scratch_cpu(cpu));
+        trqd->pick_bias = svc->vcpu->processor;
         ASSERT(svc->vcpu->processor < nr_cpu_ids);
 
         _runq_assign(svc, trqd);
@@ -2819,13 +2826,15 @@ csched2_dump(const struct scheduler *ops)
         printk("Runqueue %d:\n"
                "\tncpus              = %u\n"
                "\tcpus               = %s\n"
-               "\tmax_weight         = %d\n"
+               "\tmax_weight         = %u\n"
+               "\tpick_bias          = %u\n"
                "\tinstload           = %d\n"
                "\taveload            = %"PRI_stime" (~%"PRI_stime"%%)\n",
                i,
                cpumask_weight(&prv->rqd[i].active),
                cpustr,
                prv->rqd[i].max_weight,
+               prv->rqd[i].pick_bias,
                prv->rqd[i].load,
                prv->rqd[i].avgload,
                fraction);
@@ -2928,6 +2937,9 @@ init_pdata(struct csched2_private *prv, unsigned int cpu)
     __cpumask_set_cpu(cpu, &prv->initialized);
     __cpumask_set_cpu(cpu, &rqd->smt_idle);
 
+    if ( cpumask_weight(&rqd->active) == 1 )
+        rqd->pick_bias = cpu;
+
     return rqi;
 }
 
@@ -3040,6 +3052,8 @@ csched2_deinit_pdata(const struct scheduler *ops, void *pcpu, int cpu)
         printk(XENLOG_INFO " No cpus left on runqueue, disabling\n");
         deactivate_runqueue(prv, rqi);
     }
+    else if ( rqd->pick_bias == cpu )
+        rqd->pick_bias = cpumask_first(&rqd->active);
 
     spin_unlock(&rqd->lock);
 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 7/7] xen: credit2: avoid cpumask_any() in pick_cpu().
  2017-04-06  8:17 ` [PATCH v2 7/7] xen: credit2: avoid cpumask_any() in pick_cpu() Dario Faggioli
@ 2017-04-07 14:48   ` George Dunlap
  0 siblings, 0 replies; 24+ messages in thread
From: George Dunlap @ 2017-04-07 14:48 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: Anshul Makkar

On 06/04/17 09:17, Dario Faggioli wrote:
> cpumask_any() is costly (because of the randomization).
> And since it does not really matter which exact CPU is
> selected within a runqueue, as that will be overridden
> shortly after, in runq_tickle(), spending too much time
> and achieving true randomization is pretty pointless.
> 
> As the picked CPU, however, would be used as an hint,
> within runq_tickle(), don't give up on it entirely,
> and let's make sure we don't always return the same
> CPU, or favour lower or higher ID CPUs.
> 
> To achieve that, let's record and remember, for each
> runqueue, what CPU we picked for last, and start from
> that the following time.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Acked-by: George Dunlap <george.dunlap@citrix.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-04-07 16:26 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-04-06  8:16 [PATCH v2 0/7] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2 Dario Faggioli
2017-04-06  8:16 ` [PATCH v2 1/7] xen: credit1: simplify csched_runq_steal() a little bit Dario Faggioli
2017-04-07 10:45   ` George Dunlap
2017-04-07 11:00     ` Dario Faggioli
2017-04-06  8:16 ` [PATCH v2 2/7] xen: credit: (micro) optimize csched_runq_steal() Dario Faggioli
2017-04-07 10:49   ` George Dunlap
2017-04-07 11:01     ` Dario Faggioli
2017-04-06  8:16 ` [PATCH v2 3/7] xen: credit: consider tickled pCPUs as busy Dario Faggioli
2017-04-07 10:56   ` George Dunlap
2017-04-07 12:53     ` Dario Faggioli
2017-04-06  8:16 ` [PATCH v2 4/7] xen/tools: tracing: add record for credit1 runqueue stealing Dario Faggioli
2017-04-07 11:01   ` George Dunlap
2017-04-07 13:06     ` Dario Faggioli
2017-04-06  8:16 ` [PATCH v2 5/7] xen: credit1: increase efficiency and scalability of load balancing Dario Faggioli
2017-04-07 14:38   ` George Dunlap
2017-04-07 14:49     ` Dario Faggioli
2017-04-07 15:17       ` George Dunlap
2017-04-07 15:26         ` Dario Faggioli
2017-04-07 15:49           ` Dario Faggioli
2017-04-07 16:25             ` Dario Faggioli
2017-04-06  8:16 ` [PATCH v2 6/7] xen: credit1: treat pCPUs more evenly during balancing Dario Faggioli
2017-04-07 14:44   ` George Dunlap
2017-04-06  8:17 ` [PATCH v2 7/7] xen: credit2: avoid cpumask_any() in pick_cpu() Dario Faggioli
2017-04-07 14:48   ` George Dunlap

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.