From mboxrd@z Thu Jan 1 00:00:00 1970 From: Juergen Gross Subject: Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Date: Thu, 20 Dec 2012 07:44:46 +0100 Message-ID: <50D2B3DE.70206@ts.fujitsu.com> References: <06d2f322a6319d8ba212.1355944039@Solace> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <06d2f322a6319d8ba212.1355944039@Solace> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: Marcus Granado , Dan Magenheimer , Ian Campbell , Anil Madhavapeddy , George Dunlap , Andrew Cooper , Ian Jackson , xen-devel@lists.xen.org, Jan Beulich , Daniel De Graaf , Matt Wilson List-Id: xen-devel@lists.xenproject.org Am 19.12.2012 20:07, schrieb Dario Faggioli: > As vcpu-affinity tells where VCPUs must run, node-affinity tells > where they should or, better, prefer. While respecting vcpu-affinity > remains mandatory, node-affinity is not that strict, it only expresses > a preference, although honouring it is almost always true that will > bring significant performances benefit (especially as compared to > not having any affinity at all). > > This change modifies the VCPU load balancing algorithm (for the > credit scheduler only), introducing a two steps logic. > During the first step, we use the node-affinity mask. The aim is > giving precedence to the CPUs where it is known to be preferable > for the domain to run. If that fails in finding a valid PCPU, the > node-affinity is just ignored and, in the second step, we fall > back to using cpu-affinity only. > > Signed-off-by: Dario Faggioli > --- > Changes from v1: > * CPU masks variables moved off from the stack, as requested during > review. As per the comments in the code, having them in the private > (per-scheduler instance) struct could have been enough, but it would be > racy (again, see comments). For that reason, use a global bunch of > them of (via per_cpu()); Wouldn't it be better to put the mask in the scheduler private per-pcpu area? This could be applied to several other instances of cpu masks on the stack, too. > * George suggested a different load balancing logic during v1's review. I > think he was right and then I changed the old implementation in a way > that resembles exactly that. I rewrote most of this patch to introduce > a more sensible and effective noda-affinity handling logic. > > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > --- a/xen/common/sched_credit.c > +++ b/xen/common/sched_credit.c > @@ -111,6 +111,33 @@ > > > /* > + * Node Balancing > + */ > +#define CSCHED_BALANCE_CPU_AFFINITY 0 > +#define CSCHED_BALANCE_NODE_AFFINITY 1 > +#define CSCHED_BALANCE_LAST CSCHED_BALANCE_NODE_AFFINITY > + > +/* > + * When building for high number of CPUs, cpumask_var_t > + * variables on stack are better avoided. However, we need them, > + * in order to be able to consider both vcpu and node affinity. > + * We also don't want to xmalloc()/xfree() them, as that would > + * happen in critical code paths. Therefore, let's (pre)allocate > + * some scratch space for them. > + * > + * Having one mask for each instance of the scheduler seems > + * enough, and that would suggest putting it wihin `struct > + * csched_private' below. However, we don't always hold the > + * private scheduler lock when the mask itself would need to > + * be used, leaving room for races. For that reason, we define > + * and use a cpumask_t for each CPU. As preemption is not an > + * issue here (we're holding the runqueue spin-lock!), that is > + * both enough and safe. > + */ > +DEFINE_PER_CPU(cpumask_t, csched_balance_mask); > +#define scratch_balance_mask (this_cpu(csched_balance_mask)) > + > +/* > * Boot parameters > */ > static int __read_mostly sched_credit_tslice_ms = CSCHED_DEFAULT_TSLICE_MS; > @@ -159,6 +186,9 @@ struct csched_dom { > struct list_head active_vcpu; > struct list_head active_sdom_elem; > struct domain *dom; > + /* cpumask translated from the domain's node-affinity. > + * Basically, the CPUs we prefer to be scheduled on. */ > + cpumask_var_t node_affinity_cpumask; > uint16_t active_vcpu_count; > uint16_t weight; > uint16_t cap; > @@ -239,6 +269,42 @@ static inline void > list_del_init(&svc->runq_elem); > } > > +#define for_each_csched_balance_step(__step) \ > + for ( (__step) = CSCHED_BALANCE_LAST; (__step)>= 0; (__step)-- ) > + > +/* > + * Each csched-balance step has to use its own cpumask. This function > + * determines which one, given the step, and copies it in mask. Notice > + * that, in case of node-affinity balancing step, it also filters out from > + * the node-affinity mask the cpus that are not part of vc's cpu-affinity, > + * as we do not want to end up running a vcpu where it would like, but > + * is not allowed to! > + * > + * As an optimization, if a domain does not have any node-affinity at all > + * (namely, its node affinity is automatically computed), not only the > + * computed mask will reflect its vcpu-affinity, but we also return -1 to > + * let the caller know that he can skip the step or quit the loop (if he > + * wants). > + */ > +static int > +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask) > +{ > + if ( step == CSCHED_BALANCE_NODE_AFFINITY ) > + { > + struct domain *d = vc->domain; > + struct csched_dom *sdom = CSCHED_DOM(d); > + > + cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity); > + > + if ( cpumask_full(sdom->node_affinity_cpumask) ) > + return -1; > + } > + else /* step == CSCHED_BALANCE_CPU_AFFINITY */ > + cpumask_copy(mask, vc->cpu_affinity); > + > + return 0; > +} > + > static void burn_credits(struct csched_vcpu *svc, s_time_t now) > { > s_time_t delta; > @@ -266,67 +332,94 @@ static inline void > struct csched_vcpu * const cur = CSCHED_VCPU(curr_on_cpu(cpu)); > struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu)); > cpumask_t mask, idle_mask; > - int idlers_empty; > + int balance_step, idlers_empty; > > ASSERT(cur); > - cpumask_clear(&mask); > - > idlers_empty = cpumask_empty(prv->idlers); > > /* > - * If the pcpu is idle, or there are no idlers and the new > - * vcpu is a higher priority than the old vcpu, run it here. > - * > - * If there are idle cpus, first try to find one suitable to run > - * new, so we can avoid preempting cur. If we cannot find a > - * suitable idler on which to run new, run it here, but try to > - * find a suitable idler on which to run cur instead. > + * Node and vcpu-affinity balancing loop. To speed things up, in case > + * no node-affinity at all is present, scratch_balance_mask reflects > + * the vcpu-affinity, and ret is -1, so that we then can quit the > + * loop after only one step. > */ > - if ( cur->pri == CSCHED_PRI_IDLE > - || (idlers_empty&& new->pri> cur->pri) ) > + for_each_csched_balance_step( balance_step ) > { > - if ( cur->pri != CSCHED_PRI_IDLE ) > - SCHED_STAT_CRANK(tickle_idlers_none); > - cpumask_set_cpu(cpu,&mask); > - } > - else if ( !idlers_empty ) > - { > - /* Check whether or not there are idlers that can run new */ > - cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity); > + int ret, new_idlers_empty; > + > + cpumask_clear(&mask); > > /* > - * If there are no suitable idlers for new, and it's higher > - * priority than cur, ask the scheduler to migrate cur away. > - * We have to act like this (instead of just waking some of > - * the idlers suitable for cur) because cur is running. > + * If the pcpu is idle, or there are no idlers and the new > + * vcpu is a higher priority than the old vcpu, run it here. > * > - * If there are suitable idlers for new, no matter priorities, > - * leave cur alone (as it is running and is, likely, cache-hot) > - * and wake some of them (which is waking up and so is, likely, > - * cache cold anyway). > + * If there are idle cpus, first try to find one suitable to run > + * new, so we can avoid preempting cur. If we cannot find a > + * suitable idler on which to run new, run it here, but try to > + * find a suitable idler on which to run cur instead. > */ > - if ( cpumask_empty(&idle_mask)&& new->pri> cur->pri ) > + if ( cur->pri == CSCHED_PRI_IDLE > + || (idlers_empty&& new->pri> cur->pri) ) > { > - SCHED_STAT_CRANK(tickle_idlers_none); > - SCHED_VCPU_STAT_CRANK(cur, kicked_away); > - SCHED_VCPU_STAT_CRANK(cur, migrate_r); > - SCHED_STAT_CRANK(migrate_kicked_away); > - set_bit(_VPF_migrating,&cur->vcpu->pause_flags); > + if ( cur->pri != CSCHED_PRI_IDLE ) > + SCHED_STAT_CRANK(tickle_idlers_none); > cpumask_set_cpu(cpu,&mask); > } > - else if ( !cpumask_empty(&idle_mask) ) > + else if ( !idlers_empty ) > { > - /* Which of the idlers suitable for new shall we wake up? */ > - SCHED_STAT_CRANK(tickle_idlers_some); > - if ( opt_tickle_one_idle ) > + /* Are there idlers suitable for new (for this balance step)? */ > + ret = csched_balance_cpumask(new->vcpu, balance_step, > +&scratch_balance_mask); > + cpumask_and(&idle_mask, prv->idlers,&scratch_balance_mask); > + new_idlers_empty = cpumask_empty(&idle_mask); > + > + /* > + * Let's not be too harsh! If there aren't idlers suitable > + * for new in its node-affinity mask, make sure we check its > + * vcpu-affinity as well, before tacking final decisions. > + */ > + if ( new_idlers_empty > +&& (balance_step == CSCHED_BALANCE_NODE_AFFINITY&& !ret) ) > + continue; > + > + /* > + * If there are no suitable idlers for new, and it's higher > + * priority than cur, ask the scheduler to migrate cur away. > + * We have to act like this (instead of just waking some of > + * the idlers suitable for cur) because cur is running. > + * > + * If there are suitable idlers for new, no matter priorities, > + * leave cur alone (as it is running and is, likely, cache-hot) > + * and wake some of them (which is waking up and so is, likely, > + * cache cold anyway). > + */ > + if ( new_idlers_empty&& new->pri> cur->pri ) > { > - this_cpu(last_tickle_cpu) = > - cpumask_cycle(this_cpu(last_tickle_cpu),&idle_mask); > - cpumask_set_cpu(this_cpu(last_tickle_cpu),&mask); > + SCHED_STAT_CRANK(tickle_idlers_none); > + SCHED_VCPU_STAT_CRANK(cur, kicked_away); > + SCHED_VCPU_STAT_CRANK(cur, migrate_r); > + SCHED_STAT_CRANK(migrate_kicked_away); > + set_bit(_VPF_migrating,&cur->vcpu->pause_flags); > + cpumask_set_cpu(cpu,&mask); > } > - else > - cpumask_or(&mask,&mask,&idle_mask); > + else if ( !new_idlers_empty ) > + { > + /* Which of the idlers suitable for new shall we wake up? */ > + SCHED_STAT_CRANK(tickle_idlers_some); > + if ( opt_tickle_one_idle ) > + { > + this_cpu(last_tickle_cpu) = > + cpumask_cycle(this_cpu(last_tickle_cpu),&idle_mask); > + cpumask_set_cpu(this_cpu(last_tickle_cpu),&mask); > + } > + else > + cpumask_or(&mask,&mask,&idle_mask); > + } > } > + > + /* Did we find anyone (or csched_balance_cpumask() says we're done)? */ > + if ( !cpumask_empty(&mask) || ret ) > + break; > } > > if ( !cpumask_empty(&mask) ) > @@ -475,15 +568,28 @@ static inline int > } > > static inline int > -__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu) > +__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask) > { > /* > * Don't pick up work that's in the peer's scheduling tail or hot on > - * peer PCPU. Only pick up work that's allowed to run on our CPU. > + * peer PCPU. Only pick up work that prefers and/or is allowed to run > + * on our CPU. > */ > return !vc->is_running&& > !__csched_vcpu_is_cache_hot(vc)&& > - cpumask_test_cpu(dest_cpu, vc->cpu_affinity); > + cpumask_test_cpu(dest_cpu, mask); > +} > + > +static inline int > +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers) > +{ > + /* > + * Consent to migration if cpu is one of the idlers in the VCPU's > + * affinity mask. In fact, if that is not the case, it just means it > + * was some other CPU that was tickled and should hence come and pick > + * VCPU up. Migrating it to cpu would only make things worse. > + */ > + return cpumask_test_cpu(cpu, idlers)&& cpumask_test_cpu(cpu, mask); > } > > static int > @@ -493,85 +599,98 @@ static int > cpumask_t idlers; > cpumask_t *online; > struct csched_pcpu *spc = NULL; > + int ret, balance_step; > int cpu; > > - /* > - * Pick from online CPUs in VCPU's affinity mask, giving a > - * preference to its current processor if it's in there. > - */ > online = cpupool_scheduler_cpumask(vc->domain->cpupool); > - cpumask_and(&cpus, online, vc->cpu_affinity); > - cpu = cpumask_test_cpu(vc->processor,&cpus) > - ? vc->processor > - : cpumask_cycle(vc->processor,&cpus); > - ASSERT( !cpumask_empty(&cpus)&& cpumask_test_cpu(cpu,&cpus) ); > + for_each_csched_balance_step( balance_step ) > + { > + /* Pick an online CPU from the proper affinity mask */ > + ret = csched_balance_cpumask(vc, balance_step,&cpus); > + cpumask_and(&cpus,&cpus, online); > > - /* > - * Try to find an idle processor within the above constraints. > - * > - * In multi-core and multi-threaded CPUs, not all idle execution > - * vehicles are equal! > - * > - * We give preference to the idle execution vehicle with the most > - * idling neighbours in its grouping. This distributes work across > - * distinct cores first and guarantees we don't do something stupid > - * like run two VCPUs on co-hyperthreads while there are idle cores > - * or sockets. > - * > - * Notice that, when computing the "idleness" of cpu, we may want to > - * discount vc. That is, iff vc is the currently running and the only > - * runnable vcpu on cpu, we add cpu to the idlers. > - */ > - cpumask_and(&idlers,&cpu_online_map, CSCHED_PRIV(ops)->idlers); > - if ( vc->processor == cpu&& IS_RUNQ_IDLE(cpu) ) > - cpumask_set_cpu(cpu,&idlers); > - cpumask_and(&cpus,&cpus,&idlers); > - cpumask_clear_cpu(cpu,&cpus); > + /* If present, prefer vc's current processor */ > + cpu = cpumask_test_cpu(vc->processor,&cpus) > + ? vc->processor > + : cpumask_cycle(vc->processor,&cpus); > + ASSERT( !cpumask_empty(&cpus)&& cpumask_test_cpu(cpu,&cpus) ); > > - while ( !cpumask_empty(&cpus) ) > - { > - cpumask_t cpu_idlers; > - cpumask_t nxt_idlers; > - int nxt, weight_cpu, weight_nxt; > - int migrate_factor; > + /* > + * Try to find an idle processor within the above constraints. > + * > + * In multi-core and multi-threaded CPUs, not all idle execution > + * vehicles are equal! > + * > + * We give preference to the idle execution vehicle with the most > + * idling neighbours in its grouping. This distributes work across > + * distinct cores first and guarantees we don't do something stupid > + * like run two VCPUs on co-hyperthreads while there are idle cores > + * or sockets. > + * > + * Notice that, when computing the "idleness" of cpu, we may want to > + * discount vc. That is, iff vc is the currently running and the only > + * runnable vcpu on cpu, we add cpu to the idlers. > + */ > + cpumask_and(&idlers,&cpu_online_map, CSCHED_PRIV(ops)->idlers); > + if ( vc->processor == cpu&& IS_RUNQ_IDLE(cpu) ) > + cpumask_set_cpu(cpu,&idlers); > + cpumask_and(&cpus,&cpus,&idlers); > + /* If there are idlers and cpu is still not among them, pick one */ > + if ( !cpumask_empty(&cpus)&& !cpumask_test_cpu(cpu,&cpus) ) > + cpu = cpumask_cycle(cpu,&cpus); > + cpumask_clear_cpu(cpu,&cpus); > > - nxt = cpumask_cycle(cpu,&cpus); > + while ( !cpumask_empty(&cpus) ) > + { > + cpumask_t cpu_idlers; > + cpumask_t nxt_idlers; > + int nxt, weight_cpu, weight_nxt; > + int migrate_factor; > > - if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) ) > - { > - /* We're on the same socket, so check the busy-ness of threads. > - * Migrate if # of idlers is less at all */ > - ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); > - migrate_factor = 1; > - cpumask_and(&cpu_idlers,&idlers, per_cpu(cpu_sibling_mask, cpu)); > - cpumask_and(&nxt_idlers,&idlers, per_cpu(cpu_sibling_mask, nxt)); > - } > - else > - { > - /* We're on different sockets, so check the busy-ness of cores. > - * Migrate only if the other core is twice as idle */ > - ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); > - migrate_factor = 2; > - cpumask_and(&cpu_idlers,&idlers, per_cpu(cpu_core_mask, cpu)); > - cpumask_and(&nxt_idlers,&idlers, per_cpu(cpu_core_mask, nxt)); > + nxt = cpumask_cycle(cpu,&cpus); > + > + if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) ) > + { > + /* We're on the same socket, so check the busy-ness of threads. > + * Migrate if # of idlers is less at all */ > + ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); > + migrate_factor = 1; > + cpumask_and(&cpu_idlers,&idlers, per_cpu(cpu_sibling_mask, > + cpu)); > + cpumask_and(&nxt_idlers,&idlers, per_cpu(cpu_sibling_mask, > + nxt)); > + } > + else > + { > + /* We're on different sockets, so check the busy-ness of cores. > + * Migrate only if the other core is twice as idle */ > + ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); > + migrate_factor = 2; > + cpumask_and(&cpu_idlers,&idlers, per_cpu(cpu_core_mask, cpu)); > + cpumask_and(&nxt_idlers,&idlers, per_cpu(cpu_core_mask, nxt)); > + } > + > + weight_cpu = cpumask_weight(&cpu_idlers); > + weight_nxt = cpumask_weight(&nxt_idlers); > + /* smt_power_savings: consolidate work rather than spreading it */ > + if ( sched_smt_power_savings ? > + weight_cpu> weight_nxt : > + weight_cpu * migrate_factor< weight_nxt ) > + { > + cpumask_and(&nxt_idlers,&cpus,&nxt_idlers); > + spc = CSCHED_PCPU(nxt); > + cpu = cpumask_cycle(spc->idle_bias,&nxt_idlers); > + cpumask_andnot(&cpus,&cpus, per_cpu(cpu_sibling_mask, cpu)); > + } > + else > + { > + cpumask_andnot(&cpus,&cpus,&nxt_idlers); > + } > } > > - weight_cpu = cpumask_weight(&cpu_idlers); > - weight_nxt = cpumask_weight(&nxt_idlers); > - /* smt_power_savings: consolidate work rather than spreading it */ > - if ( sched_smt_power_savings ? > - weight_cpu> weight_nxt : > - weight_cpu * migrate_factor< weight_nxt ) > - { > - cpumask_and(&nxt_idlers,&cpus,&nxt_idlers); > - spc = CSCHED_PCPU(nxt); > - cpu = cpumask_cycle(spc->idle_bias,&nxt_idlers); > - cpumask_andnot(&cpus,&cpus, per_cpu(cpu_sibling_mask, cpu)); > - } > - else > - { > - cpumask_andnot(&cpus,&cpus,&nxt_idlers); > - } > + /* Stop if cpu is idle (or if csched_balance_cpumask() says we can) */ > + if ( cpumask_test_cpu(cpu,&idlers) || ret ) > + break; > } > > if ( commit&& spc ) > @@ -913,6 +1032,13 @@ csched_alloc_domdata(const struct schedu > if ( sdom == NULL ) > return NULL; > > + if ( !alloc_cpumask_var(&sdom->node_affinity_cpumask) ) > + { > + xfree(sdom); > + return NULL; > + } > + cpumask_setall(sdom->node_affinity_cpumask); > + > /* Initialize credit and weight */ > INIT_LIST_HEAD(&sdom->active_vcpu); > sdom->active_vcpu_count = 0; > @@ -944,6 +1070,9 @@ csched_dom_init(const struct scheduler * > static void > csched_free_domdata(const struct scheduler *ops, void *data) > { > + struct csched_dom *sdom = data; > + > + free_cpumask_var(sdom->node_affinity_cpumask); > xfree(data); > } > > @@ -1240,9 +1369,10 @@ csched_tick(void *_cpu) > } > > static struct csched_vcpu * > -csched_runq_steal(int peer_cpu, int cpu, int pri) > +csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step) > { > const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu); > + struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, peer_cpu)); > const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu); > struct csched_vcpu *speer; > struct list_head *iter; > @@ -1265,11 +1395,24 @@ csched_runq_steal(int peer_cpu, int cpu, > if ( speer->pri<= pri ) > break; > > - /* Is this VCPU is runnable on our PCPU? */ > + /* Is this VCPU runnable on our PCPU? */ > vc = speer->vcpu; > BUG_ON( is_idle_vcpu(vc) ); > > - if (__csched_vcpu_is_migrateable(vc, cpu)) > + /* > + * Retrieve the correct mask for this balance_step or, if we're > + * dealing with node-affinity and the vcpu has no node affinity > + * at all, just skip this vcpu. That is needed if we want to > + * check if we have any node-affine work to steal first (wrt > + * any vcpu-affine work). > + */ > + if ( csched_balance_cpumask(vc, balance_step, > +&scratch_balance_mask) ) > + continue; > + > + if ( __csched_vcpu_is_migrateable(vc, cpu,&scratch_balance_mask) > +&& __csched_vcpu_should_migrate(cpu,&scratch_balance_mask, > + prv->idlers) ) > { > /* We got a candidate. Grab it! */ > TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu, > @@ -1295,7 +1438,8 @@ csched_load_balance(struct csched_privat > struct csched_vcpu *speer; > cpumask_t workers; > cpumask_t *online; > - int peer_cpu; > + int peer_cpu, peer_node, bstep; > + int node = cpu_to_node(cpu); > > BUG_ON( cpu != snext->vcpu->processor ); > online = cpupool_scheduler_cpumask(per_cpu(cpupool, cpu)); > @@ -1312,42 +1456,68 @@ csched_load_balance(struct csched_privat > SCHED_STAT_CRANK(load_balance_other); > > /* > - * Peek at non-idling CPUs in the system, starting with our > - * immediate neighbour. > + * Let's look around for work to steal, taking both vcpu-affinity > + * and node-affinity into account. More specifically, we check all > + * the non-idle CPUs' runq, looking for: > + * 1. any node-affine work to steal first, > + * 2. if not finding anything, any vcpu-affine work to steal. > */ > - cpumask_andnot(&workers, online, prv->idlers); > - cpumask_clear_cpu(cpu,&workers); > - peer_cpu = cpu; > + for_each_csched_balance_step( bstep ) > + { > + /* > + * We peek at the non-idling CPUs in a node-wise fashion. In fact, > + * it is more likely that we find some node-affine work on our same > + * node, not to mention that migrating vcpus within the same node > + * could well expected to be cheaper than across-nodes (memory > + * stays local, there might be some node-wide cache[s], etc.). > + */ > + peer_node = node; > + do > + { > + /* Find out what the !idle are in this node */ > + cpumask_andnot(&workers, online, prv->idlers); > + cpumask_and(&workers,&workers,&node_to_cpumask(peer_node)); > + cpumask_clear_cpu(cpu,&workers); > > - while ( !cpumask_empty(&workers) ) > - { > - peer_cpu = cpumask_cycle(peer_cpu,&workers); > - cpumask_clear_cpu(peer_cpu,&workers); > + if ( cpumask_empty(&workers) ) > + goto next_node; > > - /* > - * Get ahold of the scheduler lock for this peer CPU. > - * > - * Note: We don't spin on this lock but simply try it. Spinning could > - * cause a deadlock if the peer CPU is also load balancing and trying > - * to lock this CPU. > - */ > - if ( !pcpu_schedule_trylock(peer_cpu) ) > - { > - SCHED_STAT_CRANK(steal_trylock_failed); > - continue; > - } > + peer_cpu = cpumask_first(&workers); > + do > + { > + /* > + * Get ahold of the scheduler lock for this peer CPU. > + * > + * Note: We don't spin on this lock but simply try it. Spinning > + * could cause a deadlock if the peer CPU is also load > + * balancing and trying to lock this CPU. > + */ > + if ( !pcpu_schedule_trylock(peer_cpu) ) > + { > + SCHED_STAT_CRANK(steal_trylock_failed); > + peer_cpu = cpumask_cycle(peer_cpu,&workers); > + continue; > + } > > - /* > - * Any work over there to steal? > - */ > - speer = cpumask_test_cpu(peer_cpu, online) ? > - csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL; > - pcpu_schedule_unlock(peer_cpu); > - if ( speer != NULL ) > - { > - *stolen = 1; > - return speer; > - } > + /* Any work over there to steal? */ > + speer = cpumask_test_cpu(peer_cpu, online) ? > + csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL; > + pcpu_schedule_unlock(peer_cpu); > + > + /* As soon as one vcpu is found, balancing ends */ > + if ( speer != NULL ) > + { > + *stolen = 1; > + return speer; > + } > + > + peer_cpu = cpumask_cycle(peer_cpu,&workers); > + > + } while( peer_cpu != cpumask_first(&workers) ); > + > + next_node: > + peer_node = cycle_node(peer_node, node_online_map); > + } while( peer_node != node ); > } > > out: > diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h > --- a/xen/include/xen/nodemask.h > +++ b/xen/include/xen/nodemask.h > @@ -41,6 +41,8 @@ > * int last_node(mask) Number highest set bit, or MAX_NUMNODES > * int first_unset_node(mask) First node not set in mask, or > * MAX_NUMNODES. > + * int cycle_node(node, mask) Next node cycling from 'node', or > + * MAX_NUMNODES > * > * nodemask_t nodemask_of_node(node) Return nodemask with bit 'node' set > * NODE_MASK_ALL Initializer - all bits set > @@ -254,6 +256,16 @@ static inline int __first_unset_node(con > find_first_zero_bit(maskp->bits, MAX_NUMNODES)); > } > > +#define cycle_node(n, src) __cycle_node((n),&(src), MAX_NUMNODES) > +static inline int __cycle_node(int n, const nodemask_t *maskp, int nbits) > +{ > + int nxt = __next_node(n, maskp, nbits); > + > + if (nxt == nbits) > + nxt = __first_node(maskp, nbits); > + return nxt; > +} > + > #define NODE_MASK_LAST_WORD BITMAP_LAST_WORD_MASK(MAX_NUMNODES) > > #if MAX_NUMNODES<= BITS_PER_LONG > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel > > -- Juergen Gross Principal Developer Operating Systems PBG PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html