From: George Dunlap <george.dunlap@eu.citrix.com>
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>,
Dan Magenheimer <dan.magenheimer@oracle.com>,
Ian Campbell <Ian.Campbell@citrix.com>,
Anil Madhavapeddy <anil@recoil.org>,
Andrew Cooper <Andrew.Cooper3@citrix.com>,
Juergen Gross <juergen.gross@ts.fujitsu.com>,
Ian Jackson <Ian.Jackson@eu.citrix.com>,
"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
Jan Beulich <JBeulich@suse.com>,
Daniel De Graaf <dgdegra@tycho.nsa.gov>,
Matt Wilson <msw@amazon.com>
Subject: Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
Date: Thu, 20 Dec 2012 16:48:13 +0000 [thread overview]
Message-ID: <50D3414D.8080901@eu.citrix.com> (raw)
In-Reply-To: <06d2f322a6319d8ba212.1355944039@Solace>
On 19/12/12 19:07, Dario Faggioli wrote:
> static inline int
> -__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu)
> +__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
> {
> /*
> * Don't pick up work that's in the peer's scheduling tail or hot on
> - * peer PCPU. Only pick up work that's allowed to run on our CPU.
> + * peer PCPU. Only pick up work that prefers and/or is allowed to run
> + * on our CPU.
> */
> return !vc->is_running &&
> !__csched_vcpu_is_cache_hot(vc) &&
> - cpumask_test_cpu(dest_cpu, vc->cpu_affinity);
> + cpumask_test_cpu(dest_cpu, mask);
> +}
> +
> +static inline int
> +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers)
> +{
> + /*
> + * Consent to migration if cpu is one of the idlers in the VCPU's
> + * affinity mask. In fact, if that is not the case, it just means it
> + * was some other CPU that was tickled and should hence come and pick
> + * VCPU up. Migrating it to cpu would only make things worse.
> + */
> + return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask);
> }
I don't get what this function is for. The only time you call it is in
csched_runq_steal(), immediately after calling
__csched_vcpu_is_migrateable(). But is_migrateable() has already
checked cpu_mask_test_cpu(cpu, mask). So why do we need to check it again?
We could just replace this with cpumask_test_cpu(cpu, prv->idlers). But
that clause is going to be either true or false for every single
iteration of all the loops, including the loops in
csched_load_balance(). Wouldn't it make more sense to check it once in
csched_load_balance(), rather than doing all those nested loops?
And in any case, looking at the caller of csched_load_balance(), it
explicitly says to steal work if the next thing on the runqueue of cpu
has a priority of TS_OVER. That was chosen for a reason -- if you want
to change that, you should change it there at the top (and make a
justification for doing so), not deeply nested in a function like this.
Or am I completely missing something?
> static struct csched_vcpu *
> -csched_runq_steal(int peer_cpu, int cpu, int pri)
> +csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
> {
> const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
> + struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, peer_cpu));
> const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu);
> struct csched_vcpu *speer;
> struct list_head *iter;
> @@ -1265,11 +1395,24 @@ csched_runq_steal(int peer_cpu, int cpu,
> if ( speer->pri <= pri )
> break;
>
> - /* Is this VCPU is runnable on our PCPU? */
> + /* Is this VCPU runnable on our PCPU? */
> vc = speer->vcpu;
> BUG_ON( is_idle_vcpu(vc) );
>
> - if (__csched_vcpu_is_migrateable(vc, cpu))
> + /*
> + * Retrieve the correct mask for this balance_step or, if we're
> + * dealing with node-affinity and the vcpu has no node affinity
> + * at all, just skip this vcpu. That is needed if we want to
> + * check if we have any node-affine work to steal first (wrt
> + * any vcpu-affine work).
> + */
> + if ( csched_balance_cpumask(vc, balance_step,
> + &scratch_balance_mask) )
> + continue;
Again, I think for clarity the best thing to do here is:
if ( balance_step == NODE
&& cpumask_full(speer->sdom->node_affinity_cpumask) )
continue;
csched_balance_cpumask();
/* Etc. */
> @@ -1295,7 +1438,8 @@ csched_load_balance(struct csched_privat
> struct csched_vcpu *speer;
> cpumask_t workers;
> cpumask_t *online;
> - int peer_cpu;
> + int peer_cpu, peer_node, bstep;
> + int node = cpu_to_node(cpu);
>
> BUG_ON( cpu != snext->vcpu->processor );
> online = cpupool_scheduler_cpumask(per_cpu(cpupool, cpu));
> @@ -1312,42 +1456,68 @@ csched_load_balance(struct csched_privat
> SCHED_STAT_CRANK(load_balance_other);
>
> /*
> - * Peek at non-idling CPUs in the system, starting with our
> - * immediate neighbour.
> + * Let's look around for work to steal, taking both vcpu-affinity
> + * and node-affinity into account. More specifically, we check all
> + * the non-idle CPUs' runq, looking for:
> + * 1. any node-affine work to steal first,
> + * 2. if not finding anything, any vcpu-affine work to steal.
> */
> - cpumask_andnot(&workers, online, prv->idlers);
> - cpumask_clear_cpu(cpu, &workers);
> - peer_cpu = cpu;
> + for_each_csched_balance_step( bstep )
> + {
> + /*
> + * We peek at the non-idling CPUs in a node-wise fashion. In fact,
> + * it is more likely that we find some node-affine work on our same
> + * node, not to mention that migrating vcpus within the same node
> + * could well expected to be cheaper than across-nodes (memory
> + * stays local, there might be some node-wide cache[s], etc.).
> + */
> + peer_node = node;
> + do
> + {
> + /* Find out what the !idle are in this node */
> + cpumask_andnot(&workers, online, prv->idlers);
> + cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
> + cpumask_clear_cpu(cpu, &workers);
>
> - while ( !cpumask_empty(&workers) )
> - {
> - peer_cpu = cpumask_cycle(peer_cpu, &workers);
> - cpumask_clear_cpu(peer_cpu, &workers);
> + if ( cpumask_empty(&workers) )
> + goto next_node;
>
> - /*
> - * Get ahold of the scheduler lock for this peer CPU.
> - *
> - * Note: We don't spin on this lock but simply try it. Spinning could
> - * cause a deadlock if the peer CPU is also load balancing and trying
> - * to lock this CPU.
> - */
> - if ( !pcpu_schedule_trylock(peer_cpu) )
> - {
> - SCHED_STAT_CRANK(steal_trylock_failed);
> - continue;
> - }
> + peer_cpu = cpumask_first(&workers);
> + do
> + {
> + /*
> + * Get ahold of the scheduler lock for this peer CPU.
> + *
> + * Note: We don't spin on this lock but simply try it. Spinning
> + * could cause a deadlock if the peer CPU is also load
> + * balancing and trying to lock this CPU.
> + */
> + if ( !pcpu_schedule_trylock(peer_cpu) )
> + {
> + SCHED_STAT_CRANK(steal_trylock_failed);
> + peer_cpu = cpumask_cycle(peer_cpu, &workers);
> + continue;
> + }
>
> - /*
> - * Any work over there to steal?
> - */
> - speer = cpumask_test_cpu(peer_cpu, online) ?
> - csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL;
> - pcpu_schedule_unlock(peer_cpu);
> - if ( speer != NULL )
> - {
> - *stolen = 1;
> - return speer;
> - }
> + /* Any work over there to steal? */
> + speer = cpumask_test_cpu(peer_cpu, online) ?
> + csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL;
> + pcpu_schedule_unlock(peer_cpu);
> +
> + /* As soon as one vcpu is found, balancing ends */
> + if ( speer != NULL )
> + {
> + *stolen = 1;
> + return speer;
> + }
> +
> + peer_cpu = cpumask_cycle(peer_cpu, &workers);
> +
> + } while( peer_cpu != cpumask_first(&workers) );
> +
> + next_node:
> + peer_node = cycle_node(peer_node, node_online_map);
> + } while( peer_node != node );
> }
These changes all look right. But then, I'm a bit tired, so I'll give
it another once-over tomorrow. :-)
[To be continued]
next prev parent reply other threads:[~2012-12-20 16:48 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
2012-12-19 19:07 ` [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap Dario Faggioli
2012-12-20 9:17 ` Jan Beulich
2012-12-20 9:35 ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks Dario Faggioli
2012-12-20 9:18 ` Jan Beulich
2012-12-20 9:55 ` Dario Faggioli
2012-12-20 14:33 ` George Dunlap
2012-12-20 14:52 ` Jan Beulich
2012-12-20 15:13 ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Dario Faggioli
2012-12-20 6:44 ` Juergen Gross
2012-12-20 8:16 ` Dario Faggioli
2012-12-20 8:25 ` Juergen Gross
2012-12-20 8:33 ` Dario Faggioli
2012-12-20 8:39 ` Juergen Gross
2012-12-20 8:58 ` Dario Faggioli
2012-12-20 15:28 ` George Dunlap
2012-12-20 16:00 ` Dario Faggioli
2012-12-20 9:22 ` Jan Beulich
2012-12-20 15:56 ` George Dunlap
2012-12-20 17:12 ` Dario Faggioli
2012-12-20 16:48 ` George Dunlap [this message]
2012-12-20 18:18 ` Dario Faggioli
2012-12-21 14:29 ` George Dunlap
2012-12-21 16:07 ` Dario Faggioli
2012-12-20 20:21 ` George Dunlap
2012-12-21 0:18 ` Dario Faggioli
2012-12-21 14:56 ` George Dunlap
2012-12-21 16:13 ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity Dario Faggioli
2012-12-21 15:17 ` George Dunlap
2012-12-21 16:17 ` Dario Faggioli
2013-01-03 16:05 ` Daniel De Graaf
2012-12-19 19:07 ` [PATCH 05 of 10 v2] libxc: " Dario Faggioli
2012-12-21 15:19 ` George Dunlap
2012-12-21 16:27 ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 06 of 10 v2] libxl: " Dario Faggioli
2012-12-21 15:30 ` George Dunlap
2012-12-21 16:18 ` Dario Faggioli
2012-12-21 17:02 ` Ian Jackson
2012-12-21 17:09 ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate Dario Faggioli
2012-12-20 8:41 ` Ian Campbell
2012-12-20 9:24 ` Dario Faggioli
2012-12-21 16:00 ` George Dunlap
2012-12-21 16:23 ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 08 of 10 v2] libxl: automatic placement deals with node-affinity Dario Faggioli
2012-12-21 16:22 ` George Dunlap
2012-12-19 19:07 ` [PATCH 09 of 10 v2] xl: add node-affinity to the output of `xl list` Dario Faggioli
2012-12-21 16:34 ` George Dunlap
2012-12-21 16:54 ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 10 of 10 v2] docs: rearrange and update NUMA placement documentation Dario Faggioli
2012-12-19 23:16 ` [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
2013-01-11 12:19 ` Ian Campbell
2013-01-11 13:57 ` Dario Faggioli
2013-01-11 14:09 ` Ian Campbell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50D3414D.8080901@eu.citrix.com \
--to=george.dunlap@eu.citrix.com \
--cc=Andrew.Cooper3@citrix.com \
--cc=Ian.Campbell@citrix.com \
--cc=Ian.Jackson@eu.citrix.com \
--cc=JBeulich@suse.com \
--cc=Marcus.Granado@eu.citrix.com \
--cc=anil@recoil.org \
--cc=dan.magenheimer@oracle.com \
--cc=dario.faggioli@citrix.com \
--cc=dgdegra@tycho.nsa.gov \
--cc=juergen.gross@ts.fujitsu.com \
--cc=msw@amazon.com \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).