From mboxrd@z Thu Jan 1 00:00:00 1970 From: George Dunlap Subject: Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Date: Thu, 20 Dec 2012 20:21:45 +0000 Message-ID: <50D37359.9080001@eu.citrix.com> References: <06d2f322a6319d8ba212.1355944039@Solace> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <06d2f322a6319d8ba212.1355944039@Solace> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: Marcus Granado , Dan Magenheimer , Ian Campbell , Anil Madhavapeddy , Andrew Cooper , Juergen Gross , Ian Jackson , "xen-devel@lists.xen.org" , Jan Beulich , Daniel De Graaf , Matt Wilson List-Id: xen-devel@lists.xenproject.org On 19/12/12 19:07, Dario Faggioli wrote: > +static inline int > +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers) > +{ > + /* > + * Consent to migration if cpu is one of the idlers in the VCPU's > + * affinity mask. In fact, if that is not the case, it just means it > + * was some other CPU that was tickled and should hence come and pick > + * VCPU up. Migrating it to cpu would only make things worse. > + */ > + return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask); > } > > static int > @@ -493,85 +599,98 @@ static int > cpumask_t idlers; > cpumask_t *online; > struct csched_pcpu *spc = NULL; > + int ret, balance_step; > int cpu; > > - /* > - * Pick from online CPUs in VCPU's affinity mask, giving a > - * preference to its current processor if it's in there. > - */ > online = cpupool_scheduler_cpumask(vc->domain->cpupool); > - cpumask_and(&cpus, online, vc->cpu_affinity); > - cpu = cpumask_test_cpu(vc->processor, &cpus) > - ? vc->processor > - : cpumask_cycle(vc->processor, &cpus); > - ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) ); > + for_each_csched_balance_step( balance_step ) > + { > + /* Pick an online CPU from the proper affinity mask */ > + ret = csched_balance_cpumask(vc, balance_step, &cpus); > + cpumask_and(&cpus, &cpus, online); > > - /* > - * Try to find an idle processor within the above constraints. > - * > - * In multi-core and multi-threaded CPUs, not all idle execution > - * vehicles are equal! > - * > - * We give preference to the idle execution vehicle with the most > - * idling neighbours in its grouping. This distributes work across > - * distinct cores first and guarantees we don't do something stupid > - * like run two VCPUs on co-hyperthreads while there are idle cores > - * or sockets. > - * > - * Notice that, when computing the "idleness" of cpu, we may want to > - * discount vc. That is, iff vc is the currently running and the only > - * runnable vcpu on cpu, we add cpu to the idlers. > - */ > - cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); > - if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) ) > - cpumask_set_cpu(cpu, &idlers); > - cpumask_and(&cpus, &cpus, &idlers); > - cpumask_clear_cpu(cpu, &cpus); > + /* If present, prefer vc's current processor */ > + cpu = cpumask_test_cpu(vc->processor, &cpus) > + ? vc->processor > + : cpumask_cycle(vc->processor, &cpus); > + ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) ); > > - while ( !cpumask_empty(&cpus) ) > - { > - cpumask_t cpu_idlers; > - cpumask_t nxt_idlers; > - int nxt, weight_cpu, weight_nxt; > - int migrate_factor; > + /* > + * Try to find an idle processor within the above constraints. > + * > + * In multi-core and multi-threaded CPUs, not all idle execution > + * vehicles are equal! > + * > + * We give preference to the idle execution vehicle with the most > + * idling neighbours in its grouping. This distributes work across > + * distinct cores first and guarantees we don't do something stupid > + * like run two VCPUs on co-hyperthreads while there are idle cores > + * or sockets. > + * > + * Notice that, when computing the "idleness" of cpu, we may want to > + * discount vc. That is, iff vc is the currently running and the only > + * runnable vcpu on cpu, we add cpu to the idlers. > + */ > + cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); > + if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) ) > + cpumask_set_cpu(cpu, &idlers); > + cpumask_and(&cpus, &cpus, &idlers); > + /* If there are idlers and cpu is still not among them, pick one */ > + if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) ) > + cpu = cpumask_cycle(cpu, &cpus); This seems to be an addition to the algorithm -- particularly hidden in this kind of "indent a big section that's almost exactly the same", I think this at least needs to be called out in the changelog message, perhaps put in a separate patch. Can you comment on to why you think it's necessary? Was there a particular problem you were seeing? > + cpumask_clear_cpu(cpu, &cpus); > > - nxt = cpumask_cycle(cpu, &cpus); > + while ( !cpumask_empty(&cpus) ) > + { > + cpumask_t cpu_idlers; > + cpumask_t nxt_idlers; > + int nxt, weight_cpu, weight_nxt; > + int migrate_factor; > > - if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) ) > - { > - /* We're on the same socket, so check the busy-ness of threads. > - * Migrate if # of idlers is less at all */ > - ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); > - migrate_factor = 1; > - cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask, cpu)); > - cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask, nxt)); > - } > - else > - { > - /* We're on different sockets, so check the busy-ness of cores. > - * Migrate only if the other core is twice as idle */ > - ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); > - migrate_factor = 2; > - cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu)); > - cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt)); > + nxt = cpumask_cycle(cpu, &cpus); > + > + if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) ) > + { > + /* We're on the same socket, so check the busy-ness of threads. > + * Migrate if # of idlers is less at all */ > + ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); > + migrate_factor = 1; > + cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask, > + cpu)); > + cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask, > + nxt)); > + } > + else > + { > + /* We're on different sockets, so check the busy-ness of cores. > + * Migrate only if the other core is twice as idle */ > + ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); > + migrate_factor = 2; > + cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu)); > + cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt)); > + } > + > + weight_cpu = cpumask_weight(&cpu_idlers); > + weight_nxt = cpumask_weight(&nxt_idlers); > + /* smt_power_savings: consolidate work rather than spreading it */ > + if ( sched_smt_power_savings ? > + weight_cpu > weight_nxt : > + weight_cpu * migrate_factor < weight_nxt ) > + { > + cpumask_and(&nxt_idlers, &cpus, &nxt_idlers); > + spc = CSCHED_PCPU(nxt); > + cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers); > + cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu)); > + } > + else > + { > + cpumask_andnot(&cpus, &cpus, &nxt_idlers); > + } > } > > - weight_cpu = cpumask_weight(&cpu_idlers); > - weight_nxt = cpumask_weight(&nxt_idlers); > - /* smt_power_savings: consolidate work rather than spreading it */ > - if ( sched_smt_power_savings ? > - weight_cpu > weight_nxt : > - weight_cpu * migrate_factor < weight_nxt ) > - { > - cpumask_and(&nxt_idlers, &cpus, &nxt_idlers); > - spc = CSCHED_PCPU(nxt); > - cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers); > - cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu)); > - } > - else > - { > - cpumask_andnot(&cpus, &cpus, &nxt_idlers); > - } > + /* Stop if cpu is idle (or if csched_balance_cpumask() says we can) */ > + if ( cpumask_test_cpu(cpu, &idlers) || ret ) > + break; Right -- OK, I think everything looks good here, except the "return -1 from csched_balance_cpumask" thing. I think it would be better if we explicitly checked cpumask_full(...->node_affinity_cpumask) and skipped the NODE step if that's the case. Also -- and sorry to have to ask this kind of thing, but after sorting through the placement algorithm my head hurts -- under what circumstances would "cpumask_test_cpu(cpu, &idlers)" be false at this point? It seems like the only possibility would be if: ( (vc->processor was not in the original &cpus [1]) || !IS_RUNQ_IDLE(vc->processor) ) && (there are no idlers in the original &cpus) Which I suppose probably matches the time when we want to move on from looking at NODE affinity and look for CPU affinity. [1] This could happen either if the vcpu/node affinity has changed, or if we're currently running outside our node affinity and we're doing the NODE step. OK -- I think I've convinced myself that this is OK as well (apart from the hidden check). I'll come back to look at your response to the load balancing thing tomorrow. -George