From mboxrd@z Thu Jan 1 00:00:00 1970 From: George Dunlap Subject: Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Date: Thu, 20 Dec 2012 16:48:13 +0000 Message-ID: <50D3414D.8080901@eu.citrix.com> References: <06d2f322a6319d8ba212.1355944039@Solace> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <06d2f322a6319d8ba212.1355944039@Solace> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: Marcus Granado , Dan Magenheimer , Ian Campbell , Anil Madhavapeddy , Andrew Cooper , Juergen Gross , Ian Jackson , "xen-devel@lists.xen.org" , Jan Beulich , Daniel De Graaf , Matt Wilson List-Id: xen-devel@lists.xenproject.org On 19/12/12 19:07, Dario Faggioli wrote: > static inline int > -__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu) > +__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask) > { > /* > * Don't pick up work that's in the peer's scheduling tail or hot on > - * peer PCPU. Only pick up work that's allowed to run on our CPU. > + * peer PCPU. Only pick up work that prefers and/or is allowed to run > + * on our CPU. > */ > return !vc->is_running && > !__csched_vcpu_is_cache_hot(vc) && > - cpumask_test_cpu(dest_cpu, vc->cpu_affinity); > + cpumask_test_cpu(dest_cpu, mask); > +} > + > +static inline int > +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers) > +{ > + /* > + * Consent to migration if cpu is one of the idlers in the VCPU's > + * affinity mask. In fact, if that is not the case, it just means it > + * was some other CPU that was tickled and should hence come and pick > + * VCPU up. Migrating it to cpu would only make things worse. > + */ > + return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask); > } I don't get what this function is for. The only time you call it is in csched_runq_steal(), immediately after calling __csched_vcpu_is_migrateable(). But is_migrateable() has already checked cpu_mask_test_cpu(cpu, mask). So why do we need to check it again? We could just replace this with cpumask_test_cpu(cpu, prv->idlers). But that clause is going to be either true or false for every single iteration of all the loops, including the loops in csched_load_balance(). Wouldn't it make more sense to check it once in csched_load_balance(), rather than doing all those nested loops? And in any case, looking at the caller of csched_load_balance(), it explicitly says to steal work if the next thing on the runqueue of cpu has a priority of TS_OVER. That was chosen for a reason -- if you want to change that, you should change it there at the top (and make a justification for doing so), not deeply nested in a function like this. Or am I completely missing something? > static struct csched_vcpu * > -csched_runq_steal(int peer_cpu, int cpu, int pri) > +csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step) > { > const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu); > + struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, peer_cpu)); > const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu); > struct csched_vcpu *speer; > struct list_head *iter; > @@ -1265,11 +1395,24 @@ csched_runq_steal(int peer_cpu, int cpu, > if ( speer->pri <= pri ) > break; > > - /* Is this VCPU is runnable on our PCPU? */ > + /* Is this VCPU runnable on our PCPU? */ > vc = speer->vcpu; > BUG_ON( is_idle_vcpu(vc) ); > > - if (__csched_vcpu_is_migrateable(vc, cpu)) > + /* > + * Retrieve the correct mask for this balance_step or, if we're > + * dealing with node-affinity and the vcpu has no node affinity > + * at all, just skip this vcpu. That is needed if we want to > + * check if we have any node-affine work to steal first (wrt > + * any vcpu-affine work). > + */ > + if ( csched_balance_cpumask(vc, balance_step, > + &scratch_balance_mask) ) > + continue; Again, I think for clarity the best thing to do here is: if ( balance_step == NODE && cpumask_full(speer->sdom->node_affinity_cpumask) ) continue; csched_balance_cpumask(); /* Etc. */ > @@ -1295,7 +1438,8 @@ csched_load_balance(struct csched_privat > struct csched_vcpu *speer; > cpumask_t workers; > cpumask_t *online; > - int peer_cpu; > + int peer_cpu, peer_node, bstep; > + int node = cpu_to_node(cpu); > > BUG_ON( cpu != snext->vcpu->processor ); > online = cpupool_scheduler_cpumask(per_cpu(cpupool, cpu)); > @@ -1312,42 +1456,68 @@ csched_load_balance(struct csched_privat > SCHED_STAT_CRANK(load_balance_other); > > /* > - * Peek at non-idling CPUs in the system, starting with our > - * immediate neighbour. > + * Let's look around for work to steal, taking both vcpu-affinity > + * and node-affinity into account. More specifically, we check all > + * the non-idle CPUs' runq, looking for: > + * 1. any node-affine work to steal first, > + * 2. if not finding anything, any vcpu-affine work to steal. > */ > - cpumask_andnot(&workers, online, prv->idlers); > - cpumask_clear_cpu(cpu, &workers); > - peer_cpu = cpu; > + for_each_csched_balance_step( bstep ) > + { > + /* > + * We peek at the non-idling CPUs in a node-wise fashion. In fact, > + * it is more likely that we find some node-affine work on our same > + * node, not to mention that migrating vcpus within the same node > + * could well expected to be cheaper than across-nodes (memory > + * stays local, there might be some node-wide cache[s], etc.). > + */ > + peer_node = node; > + do > + { > + /* Find out what the !idle are in this node */ > + cpumask_andnot(&workers, online, prv->idlers); > + cpumask_and(&workers, &workers, &node_to_cpumask(peer_node)); > + cpumask_clear_cpu(cpu, &workers); > > - while ( !cpumask_empty(&workers) ) > - { > - peer_cpu = cpumask_cycle(peer_cpu, &workers); > - cpumask_clear_cpu(peer_cpu, &workers); > + if ( cpumask_empty(&workers) ) > + goto next_node; > > - /* > - * Get ahold of the scheduler lock for this peer CPU. > - * > - * Note: We don't spin on this lock but simply try it. Spinning could > - * cause a deadlock if the peer CPU is also load balancing and trying > - * to lock this CPU. > - */ > - if ( !pcpu_schedule_trylock(peer_cpu) ) > - { > - SCHED_STAT_CRANK(steal_trylock_failed); > - continue; > - } > + peer_cpu = cpumask_first(&workers); > + do > + { > + /* > + * Get ahold of the scheduler lock for this peer CPU. > + * > + * Note: We don't spin on this lock but simply try it. Spinning > + * could cause a deadlock if the peer CPU is also load > + * balancing and trying to lock this CPU. > + */ > + if ( !pcpu_schedule_trylock(peer_cpu) ) > + { > + SCHED_STAT_CRANK(steal_trylock_failed); > + peer_cpu = cpumask_cycle(peer_cpu, &workers); > + continue; > + } > > - /* > - * Any work over there to steal? > - */ > - speer = cpumask_test_cpu(peer_cpu, online) ? > - csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL; > - pcpu_schedule_unlock(peer_cpu); > - if ( speer != NULL ) > - { > - *stolen = 1; > - return speer; > - } > + /* Any work over there to steal? */ > + speer = cpumask_test_cpu(peer_cpu, online) ? > + csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL; > + pcpu_schedule_unlock(peer_cpu); > + > + /* As soon as one vcpu is found, balancing ends */ > + if ( speer != NULL ) > + { > + *stolen = 1; > + return speer; > + } > + > + peer_cpu = cpumask_cycle(peer_cpu, &workers); > + > + } while( peer_cpu != cpumask_first(&workers) ); > + > + next_node: > + peer_node = cycle_node(peer_node, node_online_map); > + } while( peer_node != node ); > } These changes all look right. But then, I'm a bit tired, so I'll give it another once-over tomorrow. :-) [To be continued]