Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: George Dunlap <george.dunlap@eu.citrix.com>
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>,
	Dan Magenheimer <dan.magenheimer@oracle.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Anil Madhavapeddy <anil@recoil.org>,
	Andrew Cooper <Andrew.Cooper3@citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <JBeulich@suse.com>,
	Daniel De Graaf <dgdegra@tycho.nsa.gov>,
	Matt Wilson <msw@amazon.com>
Subject: Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
Date: Fri, 21 Dec 2012 14:56:24 +0000	[thread overview]
Message-ID: <50D47898.2060909@eu.citrix.com> (raw)
In-Reply-To: <1356049122.15403.34.camel@Abyss>

On 21/12/12 00:18, Dario Faggioli wrote:
> On Thu, 2012-12-20 at 20:21 +0000, George Dunlap wrote:
>>> -    /*
>>> -     * Pick from online CPUs in VCPU's affinity mask, giving a
>>> -     * preference to its current processor if it's in there.
>>> -     */
>>>        online = cpupool_scheduler_cpumask(vc->domain->cpupool);
>>> -    cpumask_and(&cpus, online, vc->cpu_affinity);
>>> -    cpu = cpumask_test_cpu(vc->processor, &cpus)
>>> -            ? vc->processor
>>> -            : cpumask_cycle(vc->processor, &cpus);
>>> -    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
>>> +    for_each_csched_balance_step( balance_step )
>>> +    {
>>> +        /* Pick an online CPU from the proper affinity mask */
>>> +        ret = csched_balance_cpumask(vc, balance_step, &cpus);
>>> +        cpumask_and(&cpus, &cpus, online);
>>>
>>> -    /*
>>> -     * Try to find an idle processor within the above constraints.
>>> -     *
>>> -     * In multi-core and multi-threaded CPUs, not all idle execution
>>> -     * vehicles are equal!
>>> -     *
>>> -     * We give preference to the idle execution vehicle with the most
>>> -     * idling neighbours in its grouping. This distributes work across
>>> -     * distinct cores first and guarantees we don't do something stupid
>>> -     * like run two VCPUs on co-hyperthreads while there are idle cores
>>> -     * or sockets.
>>> -     *
>>> -     * Notice that, when computing the "idleness" of cpu, we may want to
>>> -     * discount vc. That is, iff vc is the currently running and the only
>>> -     * runnable vcpu on cpu, we add cpu to the idlers.
>>> -     */
>>> -    cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
>>> -    if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
>>> -        cpumask_set_cpu(cpu, &idlers);
>>> -    cpumask_and(&cpus, &cpus, &idlers);
>>> -    cpumask_clear_cpu(cpu, &cpus);
>>> +        /* If present, prefer vc's current processor */
>>> +        cpu = cpumask_test_cpu(vc->processor, &cpus)
>>> +                ? vc->processor
>>> +                : cpumask_cycle(vc->processor, &cpus);
>>> +        ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
>>>
>>> -    while ( !cpumask_empty(&cpus) )
>>> -    {
>>> -        cpumask_t cpu_idlers;
>>> -        cpumask_t nxt_idlers;
>>> -        int nxt, weight_cpu, weight_nxt;
>>> -        int migrate_factor;
>>> +        /*
>>> +         * Try to find an idle processor within the above constraints.
>>> +         *
>>> +         * In multi-core and multi-threaded CPUs, not all idle execution
>>> +         * vehicles are equal!
>>> +         *
>>> +         * We give preference to the idle execution vehicle with the most
>>> +         * idling neighbours in its grouping. This distributes work across
>>> +         * distinct cores first and guarantees we don't do something stupid
>>> +         * like run two VCPUs on co-hyperthreads while there are idle cores
>>> +         * or sockets.
>>> +         *
>>> +         * Notice that, when computing the "idleness" of cpu, we may want to
>>> +         * discount vc. That is, iff vc is the currently running and the only
>>> +         * runnable vcpu on cpu, we add cpu to the idlers.
>>> +         */
>>> +        cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
>>> +        if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
>>> +            cpumask_set_cpu(cpu, &idlers);
>>> +        cpumask_and(&cpus, &cpus, &idlers);
>>> +        /* If there are idlers and cpu is still not among them, pick one */
>>> +        if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) )
>>> +            cpu = cpumask_cycle(cpu, &cpus);
>> This seems to be an addition to the algorithm -- particularly hidden in
>> this kind of "indent a big section that's almost exactly the same", I
>> think this at least needs to be called out in the changelog message,
>> perhaps put in a separate patch.
>>
> You're right, it is an addition, although a minor enough one (at least
> from the amount of code point of view, the effect of not having it was
> pretty bad! :-P) that I thought it can "hide" here. :-)
>
> But I guess I can put it in a separate patch.
>
>> Can you comment on to why you think it's necessary?  Was there a
>> particular problem you were seeing?
>>
> Yep. Suppose vc is for some reason running on a pcpu which is outside
> its node-affinity, but that now some pcpus within vc's node-affinity
> have become idle. What we would like is vc start running there as soon
> as possible, so we expect this call to _csched_pick_cpu() to determine
> that.
>
> What happens is we do not use vc->processor (as it is outside of vc's
> node-affinity) and 'cpu' gets set to cpumask_cycle(vc->processor,
> &cpus), where 'cpu' is the cpumask_and(&cpus, balance_mask, online).
> This means 'cpu'. Let's also suppose that 'cpu' now points to a busy
> thread but with an idle sibling, and that there aren't any other idle
> pcpus (either core or threads). Now, the algorithm evaluates the
> idleness of 'cpu', and compares it with the idleness of all the other
> pcpus, and it won't find anything better that 'cpu' itself, as all the
> other pcpus except its sibling thread are busy, while its sibling thread
> has the very same idleness it has (2 threads, 1 idle 1 busy).
>
> The neat effect is vc being moved to 'cpu', which is busy, while it
> could have been moved to 'cpu''s sibling thread, which is indeed idle.
>
> The if() I added fixes this by making sure that the reference cpu is an
> idle one (if that is possible).
>
> I hope I've explained it correctly, and sorry if it is a little bit
> tricky, especially to explain like this (although, believe me, it was
> tricky to hunt it out too! :-P). I've seen that happening and I'm almost
> sure I kept a trace somewhere, so let me know if you want to see the
> "smoking gun". :-)

No, the change looks quite plausible.  I guess it's not obvious that the 
balancing code will never migrate from one thread to another thread.  
(That whole algorithm could do with some commenting -- I may submit a 
patch once this series is in.)

I'm really glad you've had the opportunity to take a close look at these 
kinds of things.
>> Also -- and sorry to have to ask this kind of thing, but after sorting
>> through the placement algorithm my head hurts -- under what
>> circumstances would "cpumask_test_cpu(cpu, &idlers)" be false at this
>> point?  It seems like the only possibility would be if:
>> ( (vc->processor was not in the original &cpus [1])
>>     || !IS_RUNQ_IDLE(vc->processor) )
>> && (there are no idlers in the original &cpus)
>>
>> Which I suppose probably matches the time when we want to move on from
>> looking at NODE affinity and look for CPU affinity.
>>
>> [1] This could happen either if the vcpu/node affinity has changed, or
>> if we're currently running outside our node affinity and we're doing the
>> NODE step.
>>
>> OK -- I think I've convinced myself that this is OK as well (apart from
>> the hidden check).  I'll come back to look at your response to the load
>> balancing thing tomorrow.
>>
> Mmm... Sorry, not sure I follow, does this means that you figured out
> and understood why I need that 'if(){break;}' ? Sounds like so, but I
> can't be sure (my head hurts a bit too, after having written that
> thing! :-D).

Well I always understood why we needed the break -- the purpose is to 
avoid the second run through when it's not necessary.  What I was doing, 
in a sort of "thinking out loud" fashion, seeing under what conditions 
that break might actually happen.  Like the analysis with 
vcpu_should_migrate(), it might have turned out to be redundant, or to 
have missed some cases.  But I think it doesn't, so it's fine. :-)

  -George

next prev parent reply	other threads:[~2012-12-21 14:56 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
2012-12-19 19:07 ` [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap Dario Faggioli
2012-12-20  9:17   ` Jan Beulich
2012-12-20  9:35     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks Dario Faggioli
2012-12-20  9:18   ` Jan Beulich
2012-12-20  9:55     ` Dario Faggioli
2012-12-20 14:33     ` George Dunlap
2012-12-20 14:52       ` Jan Beulich
2012-12-20 15:13         ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Dario Faggioli
2012-12-20  6:44   ` Juergen Gross
2012-12-20  8:16     ` Dario Faggioli
2012-12-20  8:25       ` Juergen Gross
2012-12-20  8:33         ` Dario Faggioli
2012-12-20  8:39           ` Juergen Gross
2012-12-20  8:58             ` Dario Faggioli
2012-12-20 15:28             ` George Dunlap
2012-12-20 16:00               ` Dario Faggioli
2012-12-20  9:22           ` Jan Beulich
2012-12-20 15:56   ` George Dunlap
2012-12-20 17:12     ` Dario Faggioli
2012-12-20 16:48   ` George Dunlap
2012-12-20 18:18     ` Dario Faggioli
2012-12-21 14:29       ` George Dunlap
2012-12-21 16:07         ` Dario Faggioli
2012-12-20 20:21   ` George Dunlap
2012-12-21  0:18     ` Dario Faggioli
2012-12-21 14:56       ` George Dunlap [this message]
2012-12-21 16:13         ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity Dario Faggioli
2012-12-21 15:17   ` George Dunlap
2012-12-21 16:17     ` Dario Faggioli
2013-01-03 16:05     ` Daniel De Graaf
2012-12-19 19:07 ` [PATCH 05 of 10 v2] libxc: " Dario Faggioli
2012-12-21 15:19   ` George Dunlap
2012-12-21 16:27     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 06 of 10 v2] libxl: " Dario Faggioli
2012-12-21 15:30   ` George Dunlap
2012-12-21 16:18     ` Dario Faggioli
2012-12-21 17:02       ` Ian Jackson
2012-12-21 17:09         ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate Dario Faggioli
2012-12-20  8:41   ` Ian Campbell
2012-12-20  9:24     ` Dario Faggioli
2012-12-21 16:00   ` George Dunlap
2012-12-21 16:23     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 08 of 10 v2] libxl: automatic placement deals with node-affinity Dario Faggioli
2012-12-21 16:22   ` George Dunlap
2012-12-19 19:07 ` [PATCH 09 of 10 v2] xl: add node-affinity to the output of `xl list` Dario Faggioli
2012-12-21 16:34   ` George Dunlap
2012-12-21 16:54     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 10 of 10 v2] docs: rearrange and update NUMA placement documentation Dario Faggioli
2012-12-19 23:16 ` [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
2013-01-11 12:19 ` Ian Campbell
2013-01-11 13:57   ` Dario Faggioli
2013-01-11 14:09     ` Ian Campbell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50D47898.2060909@eu.citrix.com \
    --to=george.dunlap@eu.citrix.com \
    --cc=Andrew.Cooper3@citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=Marcus.Granado@eu.citrix.com \
    --cc=anil@recoil.org \
    --cc=dan.magenheimer@oracle.com \
    --cc=dario.faggioli@citrix.com \
    --cc=dgdegra@tycho.nsa.gov \
    --cc=juergen.gross@ts.fujitsu.com \
    --cc=msw@amazon.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).