Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: George Dunlap <george.dunlap@eu.citrix.com>
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>,
	Dan Magenheimer <dan.magenheimer@oracle.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Anil Madhavapeddy <anil@recoil.org>,
	Andrew Cooper <Andrew.Cooper3@citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <JBeulich@suse.com>,
	Daniel De Graaf <dgdegra@tycho.nsa.gov>,
	Matt Wilson <msw@amazon.com>
Subject: Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
Date: Thu, 20 Dec 2012 16:48:13 +0000	[thread overview]
Message-ID: <50D3414D.8080901@eu.citrix.com> (raw)
In-Reply-To: <06d2f322a6319d8ba212.1355944039@Solace>

On 19/12/12 19:07, Dario Faggioli wrote:
>   static inline int
> -__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu)
> +__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
>   {
>       /*
>        * Don't pick up work that's in the peer's scheduling tail or hot on
> -     * peer PCPU. Only pick up work that's allowed to run on our CPU.
> +     * peer PCPU. Only pick up work that prefers and/or is allowed to run
> +     * on our CPU.
>        */
>       return !vc->is_running &&
>              !__csched_vcpu_is_cache_hot(vc) &&
> -           cpumask_test_cpu(dest_cpu, vc->cpu_affinity);
> +           cpumask_test_cpu(dest_cpu, mask);
> +}
> +
> +static inline int
> +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers)
> +{
> +    /*
> +     * Consent to migration if cpu is one of the idlers in the VCPU's
> +     * affinity mask. In fact, if that is not the case, it just means it
> +     * was some other CPU that was tickled and should hence come and pick
> +     * VCPU up. Migrating it to cpu would only make things worse.
> +     */
> +    return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask);
>   }

I don't get what this function is for.  The only time you call it is in 
csched_runq_steal(), immediately after calling 
__csched_vcpu_is_migrateable().  But is_migrateable() has already 
checked cpu_mask_test_cpu(cpu, mask).  So why do we need to check it again?

We could just replace this with cpumask_test_cpu(cpu, prv->idlers).  But 
that clause is going to be either true or false for every single 
iteration of all the loops, including the loops in 
csched_load_balance().  Wouldn't it make more sense to check it once in 
csched_load_balance(), rather than doing all those nested loops?

And in any case, looking at the caller of csched_load_balance(), it 
explicitly says to steal work if the next thing on the runqueue of cpu 
has a priority of TS_OVER.  That was chosen for a reason -- if you want 
to change that, you should change it there at the top (and make a 
justification for doing so), not deeply nested in a function like this.

Or am I completely missing something?

>   static struct csched_vcpu *
> -csched_runq_steal(int peer_cpu, int cpu, int pri)
> +csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>   {
>       const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
> +    struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, peer_cpu));
>       const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu);
>       struct csched_vcpu *speer;
>       struct list_head *iter;
> @@ -1265,11 +1395,24 @@ csched_runq_steal(int peer_cpu, int cpu,
>               if ( speer->pri <= pri )
>                   break;
>
> -            /* Is this VCPU is runnable on our PCPU? */
> +            /* Is this VCPU runnable on our PCPU? */
>               vc = speer->vcpu;
>               BUG_ON( is_idle_vcpu(vc) );
>
> -            if (__csched_vcpu_is_migrateable(vc, cpu))
> +            /*
> +             * Retrieve the correct mask for this balance_step or, if we're
> +             * dealing with node-affinity and the vcpu has no node affinity
> +             * at all, just skip this vcpu. That is needed if we want to
> +             * check if we have any node-affine work to steal first (wrt
> +             * any vcpu-affine work).
> +             */
> +            if ( csched_balance_cpumask(vc, balance_step,
> +                                        &scratch_balance_mask) )
> +                continue;

Again, I think for clarity the best thing to do here is:

if ( balance_step == NODE

      && cpumask_full(speer->sdom->node_affinity_cpumask) )

     continue;

csched_balance_cpumask();

/* Etc. */




> @@ -1295,7 +1438,8 @@ csched_load_balance(struct csched_privat
>       struct csched_vcpu *speer;
>       cpumask_t workers;
>       cpumask_t *online;
> -    int peer_cpu;
> +    int peer_cpu, peer_node, bstep;
> +    int node = cpu_to_node(cpu);
>
>       BUG_ON( cpu != snext->vcpu->processor );
>       online = cpupool_scheduler_cpumask(per_cpu(cpupool, cpu));
> @@ -1312,42 +1456,68 @@ csched_load_balance(struct csched_privat
>           SCHED_STAT_CRANK(load_balance_other);
>
>       /*
> -     * Peek at non-idling CPUs in the system, starting with our
> -     * immediate neighbour.
> +     * Let's look around for work to steal, taking both vcpu-affinity
> +     * and node-affinity into account. More specifically, we check all
> +     * the non-idle CPUs' runq, looking for:
> +     *  1. any node-affine work to steal first,
> +     *  2. if not finding anything, any vcpu-affine work to steal.
>        */
> -    cpumask_andnot(&workers, online, prv->idlers);
> -    cpumask_clear_cpu(cpu, &workers);
> -    peer_cpu = cpu;
> +    for_each_csched_balance_step( bstep )
> +    {
> +        /*
> +         * We peek at the non-idling CPUs in a node-wise fashion. In fact,
> +         * it is more likely that we find some node-affine work on our same
> +         * node, not to mention that migrating vcpus within the same node
> +         * could well expected to be cheaper than across-nodes (memory
> +         * stays local, there might be some node-wide cache[s], etc.).
> +         */
> +        peer_node = node;
> +        do
> +        {
> +            /* Find out what the !idle are in this node */
> +            cpumask_andnot(&workers, online, prv->idlers);
> +            cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
> +            cpumask_clear_cpu(cpu, &workers);
>
> -    while ( !cpumask_empty(&workers) )
> -    {
> -        peer_cpu = cpumask_cycle(peer_cpu, &workers);
> -        cpumask_clear_cpu(peer_cpu, &workers);
> +            if ( cpumask_empty(&workers) )
> +                goto next_node;
>
> -        /*
> -         * Get ahold of the scheduler lock for this peer CPU.
> -         *
> -         * Note: We don't spin on this lock but simply try it. Spinning could
> -         * cause a deadlock if the peer CPU is also load balancing and trying
> -         * to lock this CPU.
> -         */
> -        if ( !pcpu_schedule_trylock(peer_cpu) )
> -        {
> -            SCHED_STAT_CRANK(steal_trylock_failed);
> -            continue;
> -        }
> +            peer_cpu = cpumask_first(&workers);
> +            do
> +            {
> +                /*
> +                 * Get ahold of the scheduler lock for this peer CPU.
> +                 *
> +                 * Note: We don't spin on this lock but simply try it. Spinning
> +                 * could cause a deadlock if the peer CPU is also load
> +                 * balancing and trying to lock this CPU.
> +                 */
> +                if ( !pcpu_schedule_trylock(peer_cpu) )
> +                {
> +                    SCHED_STAT_CRANK(steal_trylock_failed);
> +                    peer_cpu = cpumask_cycle(peer_cpu, &workers);
> +                    continue;
> +                }
>
> -        /*
> -         * Any work over there to steal?
> -         */
> -        speer = cpumask_test_cpu(peer_cpu, online) ?
> -            csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL;
> -        pcpu_schedule_unlock(peer_cpu);
> -        if ( speer != NULL )
> -        {
> -            *stolen = 1;
> -            return speer;
> -        }
> +                /* Any work over there to steal? */
> +                speer = cpumask_test_cpu(peer_cpu, online) ?
> +                    csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL;
> +                pcpu_schedule_unlock(peer_cpu);
> +
> +                /* As soon as one vcpu is found, balancing ends */
> +                if ( speer != NULL )
> +                {
> +                    *stolen = 1;
> +                    return speer;
> +                }
> +
> +                peer_cpu = cpumask_cycle(peer_cpu, &workers);
> +
> +            } while( peer_cpu != cpumask_first(&workers) );
> +
> + next_node:
> +            peer_node = cycle_node(peer_node, node_online_map);
> +        } while( peer_node != node );
>       }

These changes all look right.  But then, I'm a bit tired, so I'll give 
it another once-over tomorrow. :-)

[To be continued]

next prev parent reply	other threads:[~2012-12-20 16:48 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
2012-12-19 19:07 ` [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap Dario Faggioli
2012-12-20  9:17   ` Jan Beulich
2012-12-20  9:35     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks Dario Faggioli
2012-12-20  9:18   ` Jan Beulich
2012-12-20  9:55     ` Dario Faggioli
2012-12-20 14:33     ` George Dunlap
2012-12-20 14:52       ` Jan Beulich
2012-12-20 15:13         ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Dario Faggioli
2012-12-20  6:44   ` Juergen Gross
2012-12-20  8:16     ` Dario Faggioli
2012-12-20  8:25       ` Juergen Gross
2012-12-20  8:33         ` Dario Faggioli
2012-12-20  8:39           ` Juergen Gross
2012-12-20  8:58             ` Dario Faggioli
2012-12-20 15:28             ` George Dunlap
2012-12-20 16:00               ` Dario Faggioli
2012-12-20  9:22           ` Jan Beulich
2012-12-20 15:56   ` George Dunlap
2012-12-20 17:12     ` Dario Faggioli
2012-12-20 16:48   ` George Dunlap [this message]
2012-12-20 18:18     ` Dario Faggioli
2012-12-21 14:29       ` George Dunlap
2012-12-21 16:07         ` Dario Faggioli
2012-12-20 20:21   ` George Dunlap
2012-12-21  0:18     ` Dario Faggioli
2012-12-21 14:56       ` George Dunlap
2012-12-21 16:13         ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity Dario Faggioli
2012-12-21 15:17   ` George Dunlap
2012-12-21 16:17     ` Dario Faggioli
2013-01-03 16:05     ` Daniel De Graaf
2012-12-19 19:07 ` [PATCH 05 of 10 v2] libxc: " Dario Faggioli
2012-12-21 15:19   ` George Dunlap
2012-12-21 16:27     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 06 of 10 v2] libxl: " Dario Faggioli
2012-12-21 15:30   ` George Dunlap
2012-12-21 16:18     ` Dario Faggioli
2012-12-21 17:02       ` Ian Jackson
2012-12-21 17:09         ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate Dario Faggioli
2012-12-20  8:41   ` Ian Campbell
2012-12-20  9:24     ` Dario Faggioli
2012-12-21 16:00   ` George Dunlap
2012-12-21 16:23     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 08 of 10 v2] libxl: automatic placement deals with node-affinity Dario Faggioli
2012-12-21 16:22   ` George Dunlap
2012-12-19 19:07 ` [PATCH 09 of 10 v2] xl: add node-affinity to the output of `xl list` Dario Faggioli
2012-12-21 16:34   ` George Dunlap
2012-12-21 16:54     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 10 of 10 v2] docs: rearrange and update NUMA placement documentation Dario Faggioli
2012-12-19 23:16 ` [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
2013-01-11 12:19 ` Ian Campbell
2013-01-11 13:57   ` Dario Faggioli
2013-01-11 14:09     ` Ian Campbell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50D3414D.8080901@eu.citrix.com \
    --to=george.dunlap@eu.citrix.com \
    --cc=Andrew.Cooper3@citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=Marcus.Granado@eu.citrix.com \
    --cc=anil@recoil.org \
    --cc=dan.magenheimer@oracle.com \
    --cc=dario.faggioli@citrix.com \
    --cc=dgdegra@tycho.nsa.gov \
    --cc=juergen.gross@ts.fujitsu.com \
    --cc=msw@amazon.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).