From mboxrd@z Thu Jan 1 00:00:00 1970 From: George Dunlap Subject: Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Date: Fri, 21 Dec 2012 14:56:24 +0000 Message-ID: <50D47898.2060909@eu.citrix.com> References: <06d2f322a6319d8ba212.1355944039@Solace> <50D37359.9080001@eu.citrix.com> <1356049122.15403.34.camel@Abyss> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1356049122.15403.34.camel@Abyss> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: Marcus Granado , Dan Magenheimer , Ian Campbell , Anil Madhavapeddy , Andrew Cooper , Juergen Gross , Ian Jackson , "xen-devel@lists.xen.org" , Jan Beulich , Daniel De Graaf , Matt Wilson List-Id: xen-devel@lists.xenproject.org On 21/12/12 00:18, Dario Faggioli wrote: > On Thu, 2012-12-20 at 20:21 +0000, George Dunlap wrote: >>> - /* >>> - * Pick from online CPUs in VCPU's affinity mask, giving a >>> - * preference to its current processor if it's in there. >>> - */ >>> online = cpupool_scheduler_cpumask(vc->domain->cpupool); >>> - cpumask_and(&cpus, online, vc->cpu_affinity); >>> - cpu = cpumask_test_cpu(vc->processor, &cpus) >>> - ? vc->processor >>> - : cpumask_cycle(vc->processor, &cpus); >>> - ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) ); >>> + for_each_csched_balance_step( balance_step ) >>> + { >>> + /* Pick an online CPU from the proper affinity mask */ >>> + ret = csched_balance_cpumask(vc, balance_step, &cpus); >>> + cpumask_and(&cpus, &cpus, online); >>> >>> - /* >>> - * Try to find an idle processor within the above constraints. >>> - * >>> - * In multi-core and multi-threaded CPUs, not all idle execution >>> - * vehicles are equal! >>> - * >>> - * We give preference to the idle execution vehicle with the most >>> - * idling neighbours in its grouping. This distributes work across >>> - * distinct cores first and guarantees we don't do something stupid >>> - * like run two VCPUs on co-hyperthreads while there are idle cores >>> - * or sockets. >>> - * >>> - * Notice that, when computing the "idleness" of cpu, we may want to >>> - * discount vc. That is, iff vc is the currently running and the only >>> - * runnable vcpu on cpu, we add cpu to the idlers. >>> - */ >>> - cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); >>> - if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) ) >>> - cpumask_set_cpu(cpu, &idlers); >>> - cpumask_and(&cpus, &cpus, &idlers); >>> - cpumask_clear_cpu(cpu, &cpus); >>> + /* If present, prefer vc's current processor */ >>> + cpu = cpumask_test_cpu(vc->processor, &cpus) >>> + ? vc->processor >>> + : cpumask_cycle(vc->processor, &cpus); >>> + ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) ); >>> >>> - while ( !cpumask_empty(&cpus) ) >>> - { >>> - cpumask_t cpu_idlers; >>> - cpumask_t nxt_idlers; >>> - int nxt, weight_cpu, weight_nxt; >>> - int migrate_factor; >>> + /* >>> + * Try to find an idle processor within the above constraints. >>> + * >>> + * In multi-core and multi-threaded CPUs, not all idle execution >>> + * vehicles are equal! >>> + * >>> + * We give preference to the idle execution vehicle with the most >>> + * idling neighbours in its grouping. This distributes work across >>> + * distinct cores first and guarantees we don't do something stupid >>> + * like run two VCPUs on co-hyperthreads while there are idle cores >>> + * or sockets. >>> + * >>> + * Notice that, when computing the "idleness" of cpu, we may want to >>> + * discount vc. That is, iff vc is the currently running and the only >>> + * runnable vcpu on cpu, we add cpu to the idlers. >>> + */ >>> + cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); >>> + if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) ) >>> + cpumask_set_cpu(cpu, &idlers); >>> + cpumask_and(&cpus, &cpus, &idlers); >>> + /* If there are idlers and cpu is still not among them, pick one */ >>> + if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) ) >>> + cpu = cpumask_cycle(cpu, &cpus); >> This seems to be an addition to the algorithm -- particularly hidden in >> this kind of "indent a big section that's almost exactly the same", I >> think this at least needs to be called out in the changelog message, >> perhaps put in a separate patch. >> > You're right, it is an addition, although a minor enough one (at least > from the amount of code point of view, the effect of not having it was > pretty bad! :-P) that I thought it can "hide" here. :-) > > But I guess I can put it in a separate patch. > >> Can you comment on to why you think it's necessary? Was there a >> particular problem you were seeing? >> > Yep. Suppose vc is for some reason running on a pcpu which is outside > its node-affinity, but that now some pcpus within vc's node-affinity > have become idle. What we would like is vc start running there as soon > as possible, so we expect this call to _csched_pick_cpu() to determine > that. > > What happens is we do not use vc->processor (as it is outside of vc's > node-affinity) and 'cpu' gets set to cpumask_cycle(vc->processor, > &cpus), where 'cpu' is the cpumask_and(&cpus, balance_mask, online). > This means 'cpu'. Let's also suppose that 'cpu' now points to a busy > thread but with an idle sibling, and that there aren't any other idle > pcpus (either core or threads). Now, the algorithm evaluates the > idleness of 'cpu', and compares it with the idleness of all the other > pcpus, and it won't find anything better that 'cpu' itself, as all the > other pcpus except its sibling thread are busy, while its sibling thread > has the very same idleness it has (2 threads, 1 idle 1 busy). > > The neat effect is vc being moved to 'cpu', which is busy, while it > could have been moved to 'cpu''s sibling thread, which is indeed idle. > > The if() I added fixes this by making sure that the reference cpu is an > idle one (if that is possible). > > I hope I've explained it correctly, and sorry if it is a little bit > tricky, especially to explain like this (although, believe me, it was > tricky to hunt it out too! :-P). I've seen that happening and I'm almost > sure I kept a trace somewhere, so let me know if you want to see the > "smoking gun". :-) No, the change looks quite plausible. I guess it's not obvious that the balancing code will never migrate from one thread to another thread. (That whole algorithm could do with some commenting -- I may submit a patch once this series is in.) I'm really glad you've had the opportunity to take a close look at these kinds of things. >> Also -- and sorry to have to ask this kind of thing, but after sorting >> through the placement algorithm my head hurts -- under what >> circumstances would "cpumask_test_cpu(cpu, &idlers)" be false at this >> point? It seems like the only possibility would be if: >> ( (vc->processor was not in the original &cpus [1]) >> || !IS_RUNQ_IDLE(vc->processor) ) >> && (there are no idlers in the original &cpus) >> >> Which I suppose probably matches the time when we want to move on from >> looking at NODE affinity and look for CPU affinity. >> >> [1] This could happen either if the vcpu/node affinity has changed, or >> if we're currently running outside our node affinity and we're doing the >> NODE step. >> >> OK -- I think I've convinced myself that this is OK as well (apart from >> the hidden check). I'll come back to look at your response to the load >> balancing thing tomorrow. >> > Mmm... Sorry, not sure I follow, does this means that you figured out > and understood why I need that 'if(){break;}' ? Sounds like so, but I > can't be sure (my head hurts a bit too, after having written that > thing! :-D). Well I always understood why we needed the break -- the purpose is to avoid the second run through when it's not necessary. What I was doing, in a sort of "thinking out loud" fashion, seeing under what conditions that break might actually happen. Like the analysis with vcpu_should_migrate(), it might have turned out to be redundant, or to have missed some cases. But I think it doesn't, so it's fine. :-) -George