Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: George Dunlap <george.dunlap@eu.citrix.com>
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>,
	Dan Magenheimer <dan.magenheimer@oracle.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Anil Madhavapeddy <anil@recoil.org>,
	Andrew Cooper <Andrew.Cooper3@citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <JBeulich@suse.com>,
	Daniel De Graaf <dgdegra@tycho.nsa.gov>,
	Matt Wilson <msw@amazon.com>
Subject: Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
Date: Thu, 20 Dec 2012 15:56:39 +0000	[thread overview]
Message-ID: <50D33537.9050802@eu.citrix.com> (raw)
In-Reply-To: <06d2f322a6319d8ba212.1355944039@Solace>

On 19/12/12 19:07, Dario Faggioli wrote:
> As vcpu-affinity tells where VCPUs must run, node-affinity tells
> where they should or, better, prefer. While respecting vcpu-affinity
> remains mandatory, node-affinity is not that strict, it only expresses
> a preference, although honouring it is almost always true that will
> bring significant performances benefit (especially as compared to
> not having any affinity at all).
>
> This change modifies the VCPU load balancing algorithm (for the
> credit scheduler only), introducing a two steps logic.
> During the first step, we use the node-affinity mask. The aim is
> giving precedence to the CPUs where it is known to be preferable
> for the domain to run. If that fails in finding a valid PCPU, the
> node-affinity is just ignored and, in the second step, we fall
> back to using cpu-affinity only.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

This one has a lot of structural comments; so I'm going to send a couple 
of different mails as I'm going through it, so we can parallize the 
discussion better. :-)

> ---
> Changes from v1:
>   * CPU masks variables moved off from the stack, as requested during
>     review. As per the comments in the code, having them in the private
>     (per-scheduler instance) struct could have been enough, but it would be
>     racy (again, see comments). For that reason, use a global bunch of
>     them of (via per_cpu());
>   * George suggested a different load balancing logic during v1's review. I
>     think he was right and then I changed the old implementation in a way
>     that resembles exactly that. I rewrote most of this patch to introduce
>     a more sensible and effective noda-affinity handling logic.
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -111,6 +111,33 @@
>
>
>   /*
> + * Node Balancing
> + */
> +#define CSCHED_BALANCE_CPU_AFFINITY     0
> +#define CSCHED_BALANCE_NODE_AFFINITY    1
> +#define CSCHED_BALANCE_LAST CSCHED_BALANCE_NODE_AFFINITY
[snip]
> +#define for_each_csched_balance_step(__step) \
> +    for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- )

Why are we starting at the top and going down?  Is there any good reason 
for it?

Every time you do anything unexpected, you add to the cognitive load of 
the person reading your code, leaving less spare processing power or 
memory for other bits of the code, and increasing (sligthly) the chance 
of making a mistake.  The most natural thing would be for someone to 
expect that the steps start at 0 and go up; just reversing it means it's 
that little bit harder to understand.  When you name it "LAST", it's 
even worse, because that would definitely imply that this step is going 
to be executed last.

So why not just have this be as follows?

for(step=0; step<CSCHED_BALANCE_MAX; step++)

> +
> +/*
> + * Each csched-balance step has to use its own cpumask. This function
> + * determines which one, given the step, and copies it in mask. Notice
> + * that, in case of node-affinity balancing step, it also filters out from
> + * the node-affinity mask the cpus that are not part of vc's cpu-affinity,
> + * as we do not want to end up running a vcpu where it would like, but
> + * is not allowed to!
> + *
> + * As an optimization, if a domain does not have any node-affinity at all
> + * (namely, its node affinity is automatically computed), not only the
> + * computed mask will reflect its vcpu-affinity, but we also return -1 to
> + * let the caller know that he can skip the step or quit the loop (if he
> + * wants).
> + */
> +static int
> +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
> +{
> +    if ( step == CSCHED_BALANCE_NODE_AFFINITY )
> +    {
> +        struct domain *d = vc->domain;
> +        struct csched_dom *sdom = CSCHED_DOM(d);
> +
> +        cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity);
> +
> +        if ( cpumask_full(sdom->node_affinity_cpumask) )
> +            return -1;

There's no optimization in having this comparison done here.  You're not 
reading something from a local variable that you've just calculated.  
But hiding this comparison inside this function, and disguising it as 
"returns -1", does increase the cognitive load on anybody trying to read 
and understand the code -- particularly, as how the return value is used 
is not really clear.

Also, when you use this value, effectively what you're doing is saying, 
"Actually, we just said we were doing the NODE_BALANCE step, but it 
turns out that the results of NODE_BALANCE and CPU_BALANCE will be the 
same, so we're just going to pretend that we've been doing the 
CPU_BALANCE step instead."  (See for example, "balance_step == 
CSCHED_BALANCE_NODE_AFFINITY && !ret" -- why the !ret in this clause?  
Because if !ret then we're not actually doing NODE_AFFINITY now, but 
CPU_AFFINITY.)  Another non-negligible chunk of cognitive load for 
someone reading the code to 1) figure out, and 2) keep in mind as she 
tries to analyze it.

I took a look at all the places which use this return value, and it 
seems like the best thing in each case would just be to have the 
*caller*, before getting into the loop, call 
cpumask_full(sdom->node_affinity_cpumask) and just skip the 
CSCHED_NODE_BALANCE step altogether if it's true.  (Example below.)

> @@ -266,67 +332,94 @@ static inline void
>       struct csched_vcpu * const cur = CSCHED_VCPU(curr_on_cpu(cpu));
>       struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
>       cpumask_t mask, idle_mask;
> -    int idlers_empty;
> +    int balance_step, idlers_empty;
>
>       ASSERT(cur);
> -    cpumask_clear(&mask);
> -
>       idlers_empty = cpumask_empty(prv->idlers);
>
>       /*
> -     * If the pcpu is idle, or there are no idlers and the new
> -     * vcpu is a higher priority than the old vcpu, run it here.
> -     *
> -     * If there are idle cpus, first try to find one suitable to run
> -     * new, so we can avoid preempting cur.  If we cannot find a
> -     * suitable idler on which to run new, run it here, but try to
> -     * find a suitable idler on which to run cur instead.
> +     * Node and vcpu-affinity balancing loop. To speed things up, in case
> +     * no node-affinity at all is present, scratch_balance_mask reflects
> +     * the vcpu-affinity, and ret is -1, so that we then can quit the
> +     * loop after only one step.
>        */
> -    if ( cur->pri == CSCHED_PRI_IDLE
> -         || (idlers_empty && new->pri > cur->pri) )
> +    for_each_csched_balance_step( balance_step )
>       {
> -        if ( cur->pri != CSCHED_PRI_IDLE )
> -            SCHED_STAT_CRANK(tickle_idlers_none);
> -        cpumask_set_cpu(cpu, &mask);
> -    }
> -    else if ( !idlers_empty )
> -    {
> -        /* Check whether or not there are idlers that can run new */
> -        cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
> +        int ret, new_idlers_empty;
> +
> +        cpumask_clear(&mask);
>
>           /*
> -         * If there are no suitable idlers for new, and it's higher
> -         * priority than cur, ask the scheduler to migrate cur away.
> -         * We have to act like this (instead of just waking some of
> -         * the idlers suitable for cur) because cur is running.
> +         * If the pcpu is idle, or there are no idlers and the new
> +         * vcpu is a higher priority than the old vcpu, run it here.
>            *
> -         * If there are suitable idlers for new, no matter priorities,
> -         * leave cur alone (as it is running and is, likely, cache-hot)
> -         * and wake some of them (which is waking up and so is, likely,
> -         * cache cold anyway).
> +         * If there are idle cpus, first try to find one suitable to run
> +         * new, so we can avoid preempting cur.  If we cannot find a
> +         * suitable idler on which to run new, run it here, but try to
> +         * find a suitable idler on which to run cur instead.
>            */
> -        if ( cpumask_empty(&idle_mask) && new->pri > cur->pri )
> +        if ( cur->pri == CSCHED_PRI_IDLE
> +             || (idlers_empty && new->pri > cur->pri) )
>           {
> -            SCHED_STAT_CRANK(tickle_idlers_none);
> -            SCHED_VCPU_STAT_CRANK(cur, kicked_away);
> -            SCHED_VCPU_STAT_CRANK(cur, migrate_r);
> -            SCHED_STAT_CRANK(migrate_kicked_away);
> -            set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
> +            if ( cur->pri != CSCHED_PRI_IDLE )
> +                SCHED_STAT_CRANK(tickle_idlers_none);
>               cpumask_set_cpu(cpu, &mask);
>           }
> -        else if ( !cpumask_empty(&idle_mask) )
> +        else if ( !idlers_empty )
>           {
> -            /* Which of the idlers suitable for new shall we wake up? */
> -            SCHED_STAT_CRANK(tickle_idlers_some);
> -            if ( opt_tickle_one_idle )
> +            /* Are there idlers suitable for new (for this balance step)? */
> +            ret = csched_balance_cpumask(new->vcpu, balance_step,
> +                                         &scratch_balance_mask);
> +            cpumask_and(&idle_mask, prv->idlers, &scratch_balance_mask);
> +            new_idlers_empty = cpumask_empty(&idle_mask);
> +
> +            /*
> +             * Let's not be too harsh! If there aren't idlers suitable
> +             * for new in its node-affinity mask, make sure we check its
> +             * vcpu-affinity as well, before tacking final decisions.
> +             */
> +            if ( new_idlers_empty
> +                 && (balance_step == CSCHED_BALANCE_NODE_AFFINITY && !ret) )
> +                continue;
> +
> +            /*
> +             * If there are no suitable idlers for new, and it's higher
> +             * priority than cur, ask the scheduler to migrate cur away.
> +             * We have to act like this (instead of just waking some of
> +             * the idlers suitable for cur) because cur is running.
> +             *
> +             * If there are suitable idlers for new, no matter priorities,
> +             * leave cur alone (as it is running and is, likely, cache-hot)
> +             * and wake some of them (which is waking up and so is, likely,
> +             * cache cold anyway).
> +             */
> +            if ( new_idlers_empty && new->pri > cur->pri )
>               {
> -                this_cpu(last_tickle_cpu) =
> -                    cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
> -                cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
> +                SCHED_STAT_CRANK(tickle_idlers_none);
> +                SCHED_VCPU_STAT_CRANK(cur, kicked_away);
> +                SCHED_VCPU_STAT_CRANK(cur, migrate_r);
> +                SCHED_STAT_CRANK(migrate_kicked_away);
> +                set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
> +                cpumask_set_cpu(cpu, &mask);
>               }
> -            else
> -                cpumask_or(&mask, &mask, &idle_mask);
> +            else if ( !new_idlers_empty )
> +            {
> +                /* Which of the idlers suitable for new shall we wake up? */
> +                SCHED_STAT_CRANK(tickle_idlers_some);
> +                if ( opt_tickle_one_idle )
> +                {
> +                    this_cpu(last_tickle_cpu) =
> +                        cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
> +                    cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
> +                }
> +                else
> +                    cpumask_or(&mask, &mask, &idle_mask);
> +            }
>           }
> +
> +        /* Did we find anyone (or csched_balance_cpumask() says we're done)? */
> +        if ( !cpumask_empty(&mask) || ret )
> +            break;
>       }

The whole logic here is really convoluted and hard to read.  For 
example, if cur->pri==IDLE, then you will always just break of the loop 
after the first iteration.  In that case, why have the if() inside the 
loop to begin with?  And if idlers_empty is true but cur->pri >= 
new->pri, then you'll go through the loop two times, even though both 
times it will come up empty.  And, of course, the whole thing about the 
node affinity mask being checked inside csched_balance_cpumask(), but 
not used until the very end.

A much more straighforward way to arrange it would be:

if(cur->pri=IDLE &c &c)
{
   foo;
}
else if(!idlers_empty)
{
   if(cpumask_full(sdom->node_affinity_cpumask)
     balance_step=CSCHED_BALANCE_CPU_AFFINITY;
   else
     balance_step=CSCHED_BALANCE_NODE_AFFINITY;

   for(; balance_step <= CSCHED_BALANCE_MAX; balance_step++)
  {
  ...
  }
}

That seems a lot clearer to me -- does that make sense?

[To be continued...]

next prev parent reply	other threads:[~2012-12-20 15:56 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
2012-12-19 19:07 ` [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap Dario Faggioli
2012-12-20  9:17   ` Jan Beulich
2012-12-20  9:35     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks Dario Faggioli
2012-12-20  9:18   ` Jan Beulich
2012-12-20  9:55     ` Dario Faggioli
2012-12-20 14:33     ` George Dunlap
2012-12-20 14:52       ` Jan Beulich
2012-12-20 15:13         ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Dario Faggioli
2012-12-20  6:44   ` Juergen Gross
2012-12-20  8:16     ` Dario Faggioli
2012-12-20  8:25       ` Juergen Gross
2012-12-20  8:33         ` Dario Faggioli
2012-12-20  8:39           ` Juergen Gross
2012-12-20  8:58             ` Dario Faggioli
2012-12-20 15:28             ` George Dunlap
2012-12-20 16:00               ` Dario Faggioli
2012-12-20  9:22           ` Jan Beulich
2012-12-20 15:56   ` George Dunlap [this message]
2012-12-20 17:12     ` Dario Faggioli
2012-12-20 16:48   ` George Dunlap
2012-12-20 18:18     ` Dario Faggioli
2012-12-21 14:29       ` George Dunlap
2012-12-21 16:07         ` Dario Faggioli
2012-12-20 20:21   ` George Dunlap
2012-12-21  0:18     ` Dario Faggioli
2012-12-21 14:56       ` George Dunlap
2012-12-21 16:13         ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity Dario Faggioli
2012-12-21 15:17   ` George Dunlap
2012-12-21 16:17     ` Dario Faggioli
2013-01-03 16:05     ` Daniel De Graaf
2012-12-19 19:07 ` [PATCH 05 of 10 v2] libxc: " Dario Faggioli
2012-12-21 15:19   ` George Dunlap
2012-12-21 16:27     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 06 of 10 v2] libxl: " Dario Faggioli
2012-12-21 15:30   ` George Dunlap
2012-12-21 16:18     ` Dario Faggioli
2012-12-21 17:02       ` Ian Jackson
2012-12-21 17:09         ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate Dario Faggioli
2012-12-20  8:41   ` Ian Campbell
2012-12-20  9:24     ` Dario Faggioli
2012-12-21 16:00   ` George Dunlap
2012-12-21 16:23     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 08 of 10 v2] libxl: automatic placement deals with node-affinity Dario Faggioli
2012-12-21 16:22   ` George Dunlap
2012-12-19 19:07 ` [PATCH 09 of 10 v2] xl: add node-affinity to the output of `xl list` Dario Faggioli
2012-12-21 16:34   ` George Dunlap
2012-12-21 16:54     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 10 of 10 v2] docs: rearrange and update NUMA placement documentation Dario Faggioli
2012-12-19 23:16 ` [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
2013-01-11 12:19 ` Ian Campbell
2013-01-11 13:57   ` Dario Faggioli
2013-01-11 14:09     ` Ian Campbell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50D33537.9050802@eu.citrix.com \
    --to=george.dunlap@eu.citrix.com \
    --cc=Andrew.Cooper3@citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=Marcus.Granado@eu.citrix.com \
    --cc=anil@recoil.org \
    --cc=dan.magenheimer@oracle.com \
    --cc=dario.faggioli@citrix.com \
    --cc=dgdegra@tycho.nsa.gov \
    --cc=juergen.gross@ts.fujitsu.com \
    --cc=msw@amazon.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).