From mboxrd@z Thu Jan 1 00:00:00 1970 From: George Dunlap Subject: Re: [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity` Date: Fri, 27 Apr 2012 15:45:12 +0100 Message-ID: <4F9AB0F8.10102@eu.citrix.com> References: <1f4b55806de9e7109ff6.1334150274@Solace> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1f4b55806de9e7109ff6.1334150274@Solace> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: Andre Przywara , Ian Campbell , Stefano Stabellini , Juergen Gross , Ian Jackson , "xen-devel@lists.xen.org" , Jan Beulich List-Id: xen-devel@lists.xenproject.org On 11/04/12 14:17, Dario Faggioli wrote: > Basic idea is: cpu-affinity tells where a vcpu can run, > while node-affinity tells where (in terms of on what NUMA nodes, > but that of course imply a set of CPUs) all the vcpus of the > domain should/prefer to run... Problems starts when you have > both of them speaking at the same time! > > Respecting vcpu-affinity should remain the primary concern, thus > what we do here is add some two-step logic here and there in the > scheduler (i.e., where vcpu-affinity related decisions are being > undertaken). During the first step, we check the `&&` of vcpu and > node affinity masks, aiming at giving some precedence to the CPUs > that are both suitable and preferrable for the domain. However, > if that fails in finding a valid CPU, the node-affinity is just > ignored and, in the second step, we just fallback to vcpu-affinity, > as the original behaviour was. > > Both the introduced overhead and the benefits this provides has > to be investigated and compared against each other, possibly in > varying conditions and running different workloads. > > Finally, although still at the RFC level, efforts have been made > to write the code in a flexible enough fashion so that, if we > ever want to introduce further balancing steps, it shouldn't be > too much of a pain. > > TODO: * Better investigate and test interactions with cpupools. > * Test, verify and benchmark. Then test, verify and benchmark > again, and again and again. What I know as of know is that > it does not explode that easily, but whether it works properly > and, more important, brings any benefit, this has to be > proven (and yes, I'm out running these tests and benchs already, > but do not esitate to manifest your will for helping with > that :-D). > * Add some counter/stats, e.g., to serve the purpose outlined > right above. > > XXX I'm benchmarking this just right now, and peeking at the results > they don't seem too bad. Numbers are mostly close to the case where > the VM is already pinned to a node when created. I'll post the > results as soon as I can, and I'll be happy to investigate any issue, > if you feel like the approach could be the right one. Hey Dario, Sorry for the long delay in reviewing this. Overall I think the approach is good. A few points: > /* > + * Node Balancing > + */ > +#define CSCHED_BALANCE_NODE_AFFINITY 1 > +#define CSCHED_BALANCE_CPU_AFFINITY 0 > +#define CSCHED_BALANCE_START_STEP CSCHED_BALANCE_NODE_AFFINITY > +#define CSCHED_BALANCE_END_STEP CSCHED_BALANCE_CPU_AFFINITY > + > + This thing of defining "START_STEP" and "END_STEP" seems a bit fragile. I think it would be better to always start at 0, and go until CSCHED_BALANCE_MAX. > + /* > + * Let's cache domain's dom->node_affinity here as an > + * optimization for a couple of hot paths. In fact, > + * knowing whether or not dom->node_affinity has changed > + * would allow us to avoid rebuilding node_affinity_cpumask > + * (below) duing node balancing and/or scheduling. > + */ > + nodemask_t node_affinity_cache; > + /* Basing on what dom->node_affinity says, > + * on what CPUs would we like to run most? */ > + cpumask_t node_affinity_cpumask; I think the comments here need to be more clear. The main points are: * node_affinity_cpumask is the dom->node_affinity translated from a nodemask into a cpumask * Because doing the nodemask -> cpumask translation may be expensive, node_affinity_cache stores the last translated value, so we can avoid doing the translation if nothing has changed. > +#define csched_balance(__step) \ > + for ( (__step) = CSCHED_BALANCE_START_STEP; \ > + (__step)>= CSCHED_BALANCE_END_STEP; \ > + (__step)-- ) I don't like this. For one, it's fragile; if for some reason you switched NODE_AFFINITY and CPU_AFFINITY above, suddenly this loop would go the wrong direction. I don't think there's any reason not to just use "for(step=0; step < CSCHED_BALANCE_MAX; step++)" -- it's not ugly and it means you know exactly what's going on without having to go search for the macro. > + > +/* > + * Sort-of conversion between node-affinity and vcpu-affinity for the domain, > + * i.e., a cpumask containing all the cpus from all the set nodes in the > + * node-affinity mask of the domain. This needs to be clearer -- vcpu-affinity doesn't have anything to do with this function, and there's nothing "sort-of" about the conversion. :-) I think you mean to say, "Create a cpumask from the node affinity mask." > + * > + * Notice that we completely forget about serializing this (both here and > + * in the various sites where node_affinity_cpumask is used) against > + * d->node_affinity_lock. That would be hard to do properly, as that lock > + * is a non IRQ-safe one, and it would be nearly impossible to access it > + * from the scheduling code. However, although this leaves some room for > + * races with code paths that updates d->node_affinity, it all should still > + * be fine, considering: > + * (1) d->node_affinity updates are going to be quite rare; > + * (2) this balancing logic is kind-of "best effort" anyway; > + * (3) given (1), any inconsistencies are likely to be fixed by the next > + * call to this same function without risking going into instability. > + * > + * XXX Validate (via testing/benchmarking) whether this is true, as it > + * likely sounds to be, or it causes unexpected issues. Ack. > +/* Put together the cpumask we are going to use for this balancing step */ > +static int > +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask) > +{ > + struct domain *d = vc->domain; > + struct csched_dom *sdom = CSCHED_DOM(d); > + > + /* > + * Only two possible steps exists for now: node and vcpu affinity. > + * If this domain does not have a node-affinity or is affine to all the > + * nodes, just return th vcpu-affinity mask (for *this* vcpu) and > + * stop the balancing process. > + */ > + if ( !d->has_node_affinity || nodes_full(d->node_affinity) || > + step == CSCHED_BALANCE_CPU_AFFINITY ) > + { > + cpumask_copy(mask, vc->cpu_affinity); > + return CSCHED_BALANCE_CPU_AFFINITY; > + } Hmm -- this mechanism seems kind of fragile. It seems like returning -1 or something, and having the caller call "continue", would be easier for future coders (potentially you or I) to reason about. > + /* > + * XXX As of now (with only two possible steps, this is easy and readable > + * enough. If more steps become necessary at some point in time, we > + * can keep an array of cpumask_t in dom_data and return the proper > + * element via step-indexing such an array. Of course, we can turn > + * this like that even right now... Thoughts? > + */ Beware of early optimization. :-) Just make sure to mark this down for something to look at for profiling. > static inline void > +__cpumask_tickle(cpumask_t *mask, const cpumask_t *idle_mask) > +{ > + CSCHED_STAT_CRANK(tickle_idlers_some); > + if ( opt_tickle_one_idle ) > + { > + this_cpu(last_tickle_cpu) = > + cpumask_cycle(this_cpu(last_tickle_cpu), idle_mask); > + cpumask_set_cpu(this_cpu(last_tickle_cpu), mask); > + } > + else > + cpumask_or(mask, mask, idle_mask); > +} I don't see any reason to make this into a function -- it's only called once, and it's not that long. Unless you're concerned about too many indentations making the lines too short? > + csched_balance(balance_step) > { Again, I'd make this a for() loop. > sdom->dom = dom; > + /* > + *XXX This would be 'The Right Thing', but as it is still too > + * early and d->node_affinity has not settled yet, maybe we > + * can just init the two masks with something like all-nodes > + * and all-cpus and rely on the first balancing call for > + * having them updated? > + */ > + csched_build_balance_cpumask(sdom); We might as well do what you've got here, unless it's likely to produce garbage. This isn't exactly a hot path. :-) Other than that, looks good -- Thanks! -George