From: George Dunlap <george.dunlap@eu.citrix.com>
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>,
Dan Magenheimer <dan.magenheimer@oracle.com>,
Ian Campbell <Ian.Campbell@citrix.com>,
Anil Madhavapeddy <anil@recoil.org>,
Andrew Cooper <Andrew.Cooper3@citrix.com>,
Juergen Gross <juergen.gross@ts.fujitsu.com>,
Ian Jackson <Ian.Jackson@eu.citrix.com>,
"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
Jan Beulich <JBeulich@suse.com>,
Daniel De Graaf <dgdegra@tycho.nsa.gov>,
Matt Wilson <msw@amazon.com>
Subject: Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
Date: Thu, 20 Dec 2012 20:21:45 +0000 [thread overview]
Message-ID: <50D37359.9080001@eu.citrix.com> (raw)
In-Reply-To: <06d2f322a6319d8ba212.1355944039@Solace>
On 19/12/12 19:07, Dario Faggioli wrote:
> +static inline int
> +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers)
> +{
> + /*
> + * Consent to migration if cpu is one of the idlers in the VCPU's
> + * affinity mask. In fact, if that is not the case, it just means it
> + * was some other CPU that was tickled and should hence come and pick
> + * VCPU up. Migrating it to cpu would only make things worse.
> + */
> + return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask);
> }
>
> static int
> @@ -493,85 +599,98 @@ static int
> cpumask_t idlers;
> cpumask_t *online;
> struct csched_pcpu *spc = NULL;
> + int ret, balance_step;
> int cpu;
>
> - /*
> - * Pick from online CPUs in VCPU's affinity mask, giving a
> - * preference to its current processor if it's in there.
> - */
> online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> - cpumask_and(&cpus, online, vc->cpu_affinity);
> - cpu = cpumask_test_cpu(vc->processor, &cpus)
> - ? vc->processor
> - : cpumask_cycle(vc->processor, &cpus);
> - ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
> + for_each_csched_balance_step( balance_step )
> + {
> + /* Pick an online CPU from the proper affinity mask */
> + ret = csched_balance_cpumask(vc, balance_step, &cpus);
> + cpumask_and(&cpus, &cpus, online);
>
> - /*
> - * Try to find an idle processor within the above constraints.
> - *
> - * In multi-core and multi-threaded CPUs, not all idle execution
> - * vehicles are equal!
> - *
> - * We give preference to the idle execution vehicle with the most
> - * idling neighbours in its grouping. This distributes work across
> - * distinct cores first and guarantees we don't do something stupid
> - * like run two VCPUs on co-hyperthreads while there are idle cores
> - * or sockets.
> - *
> - * Notice that, when computing the "idleness" of cpu, we may want to
> - * discount vc. That is, iff vc is the currently running and the only
> - * runnable vcpu on cpu, we add cpu to the idlers.
> - */
> - cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
> - if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
> - cpumask_set_cpu(cpu, &idlers);
> - cpumask_and(&cpus, &cpus, &idlers);
> - cpumask_clear_cpu(cpu, &cpus);
> + /* If present, prefer vc's current processor */
> + cpu = cpumask_test_cpu(vc->processor, &cpus)
> + ? vc->processor
> + : cpumask_cycle(vc->processor, &cpus);
> + ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
>
> - while ( !cpumask_empty(&cpus) )
> - {
> - cpumask_t cpu_idlers;
> - cpumask_t nxt_idlers;
> - int nxt, weight_cpu, weight_nxt;
> - int migrate_factor;
> + /*
> + * Try to find an idle processor within the above constraints.
> + *
> + * In multi-core and multi-threaded CPUs, not all idle execution
> + * vehicles are equal!
> + *
> + * We give preference to the idle execution vehicle with the most
> + * idling neighbours in its grouping. This distributes work across
> + * distinct cores first and guarantees we don't do something stupid
> + * like run two VCPUs on co-hyperthreads while there are idle cores
> + * or sockets.
> + *
> + * Notice that, when computing the "idleness" of cpu, we may want to
> + * discount vc. That is, iff vc is the currently running and the only
> + * runnable vcpu on cpu, we add cpu to the idlers.
> + */
> + cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
> + if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
> + cpumask_set_cpu(cpu, &idlers);
> + cpumask_and(&cpus, &cpus, &idlers);
> + /* If there are idlers and cpu is still not among them, pick one */
> + if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) )
> + cpu = cpumask_cycle(cpu, &cpus);
This seems to be an addition to the algorithm -- particularly hidden in
this kind of "indent a big section that's almost exactly the same", I
think this at least needs to be called out in the changelog message,
perhaps put in a separate patch.
Can you comment on to why you think it's necessary? Was there a
particular problem you were seeing?
> + cpumask_clear_cpu(cpu, &cpus);
>
> - nxt = cpumask_cycle(cpu, &cpus);
> + while ( !cpumask_empty(&cpus) )
> + {
> + cpumask_t cpu_idlers;
> + cpumask_t nxt_idlers;
> + int nxt, weight_cpu, weight_nxt;
> + int migrate_factor;
>
> - if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
> - {
> - /* We're on the same socket, so check the busy-ness of threads.
> - * Migrate if # of idlers is less at all */
> - ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> - migrate_factor = 1;
> - cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask, cpu));
> - cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask, nxt));
> - }
> - else
> - {
> - /* We're on different sockets, so check the busy-ness of cores.
> - * Migrate only if the other core is twice as idle */
> - ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> - migrate_factor = 2;
> - cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu));
> - cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt));
> + nxt = cpumask_cycle(cpu, &cpus);
> +
> + if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
> + {
> + /* We're on the same socket, so check the busy-ness of threads.
> + * Migrate if # of idlers is less at all */
> + ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> + migrate_factor = 1;
> + cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask,
> + cpu));
> + cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask,
> + nxt));
> + }
> + else
> + {
> + /* We're on different sockets, so check the busy-ness of cores.
> + * Migrate only if the other core is twice as idle */
> + ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> + migrate_factor = 2;
> + cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu));
> + cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt));
> + }
> +
> + weight_cpu = cpumask_weight(&cpu_idlers);
> + weight_nxt = cpumask_weight(&nxt_idlers);
> + /* smt_power_savings: consolidate work rather than spreading it */
> + if ( sched_smt_power_savings ?
> + weight_cpu > weight_nxt :
> + weight_cpu * migrate_factor < weight_nxt )
> + {
> + cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
> + spc = CSCHED_PCPU(nxt);
> + cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers);
> + cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu));
> + }
> + else
> + {
> + cpumask_andnot(&cpus, &cpus, &nxt_idlers);
> + }
> }
>
> - weight_cpu = cpumask_weight(&cpu_idlers);
> - weight_nxt = cpumask_weight(&nxt_idlers);
> - /* smt_power_savings: consolidate work rather than spreading it */
> - if ( sched_smt_power_savings ?
> - weight_cpu > weight_nxt :
> - weight_cpu * migrate_factor < weight_nxt )
> - {
> - cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
> - spc = CSCHED_PCPU(nxt);
> - cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers);
> - cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu));
> - }
> - else
> - {
> - cpumask_andnot(&cpus, &cpus, &nxt_idlers);
> - }
> + /* Stop if cpu is idle (or if csched_balance_cpumask() says we can) */
> + if ( cpumask_test_cpu(cpu, &idlers) || ret )
> + break;
Right -- OK, I think everything looks good here, except the "return -1
from csched_balance_cpumask" thing. I think it would be better if we
explicitly checked cpumask_full(...->node_affinity_cpumask) and skipped
the NODE step if that's the case.
Also -- and sorry to have to ask this kind of thing, but after sorting
through the placement algorithm my head hurts -- under what
circumstances would "cpumask_test_cpu(cpu, &idlers)" be false at this
point? It seems like the only possibility would be if:
( (vc->processor was not in the original &cpus [1])
|| !IS_RUNQ_IDLE(vc->processor) )
&& (there are no idlers in the original &cpus)
Which I suppose probably matches the time when we want to move on from
looking at NODE affinity and look for CPU affinity.
[1] This could happen either if the vcpu/node affinity has changed, or
if we're currently running outside our node affinity and we're doing the
NODE step.
OK -- I think I've convinced myself that this is OK as well (apart from
the hidden check). I'll come back to look at your response to the load
balancing thing tomorrow.
-George
next prev parent reply other threads:[~2012-12-20 20:21 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
2012-12-19 19:07 ` [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap Dario Faggioli
2012-12-20 9:17 ` Jan Beulich
2012-12-20 9:35 ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks Dario Faggioli
2012-12-20 9:18 ` Jan Beulich
2012-12-20 9:55 ` Dario Faggioli
2012-12-20 14:33 ` George Dunlap
2012-12-20 14:52 ` Jan Beulich
2012-12-20 15:13 ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Dario Faggioli
2012-12-20 6:44 ` Juergen Gross
2012-12-20 8:16 ` Dario Faggioli
2012-12-20 8:25 ` Juergen Gross
2012-12-20 8:33 ` Dario Faggioli
2012-12-20 8:39 ` Juergen Gross
2012-12-20 8:58 ` Dario Faggioli
2012-12-20 15:28 ` George Dunlap
2012-12-20 16:00 ` Dario Faggioli
2012-12-20 9:22 ` Jan Beulich
2012-12-20 15:56 ` George Dunlap
2012-12-20 17:12 ` Dario Faggioli
2012-12-20 16:48 ` George Dunlap
2012-12-20 18:18 ` Dario Faggioli
2012-12-21 14:29 ` George Dunlap
2012-12-21 16:07 ` Dario Faggioli
2012-12-20 20:21 ` George Dunlap [this message]
2012-12-21 0:18 ` Dario Faggioli
2012-12-21 14:56 ` George Dunlap
2012-12-21 16:13 ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity Dario Faggioli
2012-12-21 15:17 ` George Dunlap
2012-12-21 16:17 ` Dario Faggioli
2013-01-03 16:05 ` Daniel De Graaf
2012-12-19 19:07 ` [PATCH 05 of 10 v2] libxc: " Dario Faggioli
2012-12-21 15:19 ` George Dunlap
2012-12-21 16:27 ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 06 of 10 v2] libxl: " Dario Faggioli
2012-12-21 15:30 ` George Dunlap
2012-12-21 16:18 ` Dario Faggioli
2012-12-21 17:02 ` Ian Jackson
2012-12-21 17:09 ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate Dario Faggioli
2012-12-20 8:41 ` Ian Campbell
2012-12-20 9:24 ` Dario Faggioli
2012-12-21 16:00 ` George Dunlap
2012-12-21 16:23 ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 08 of 10 v2] libxl: automatic placement deals with node-affinity Dario Faggioli
2012-12-21 16:22 ` George Dunlap
2012-12-19 19:07 ` [PATCH 09 of 10 v2] xl: add node-affinity to the output of `xl list` Dario Faggioli
2012-12-21 16:34 ` George Dunlap
2012-12-21 16:54 ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 10 of 10 v2] docs: rearrange and update NUMA placement documentation Dario Faggioli
2012-12-19 23:16 ` [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
2013-01-11 12:19 ` Ian Campbell
2013-01-11 13:57 ` Dario Faggioli
2013-01-11 14:09 ` Ian Campbell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50D37359.9080001@eu.citrix.com \
--to=george.dunlap@eu.citrix.com \
--cc=Andrew.Cooper3@citrix.com \
--cc=Ian.Campbell@citrix.com \
--cc=Ian.Jackson@eu.citrix.com \
--cc=JBeulich@suse.com \
--cc=Marcus.Granado@eu.citrix.com \
--cc=anil@recoil.org \
--cc=dan.magenheimer@oracle.com \
--cc=dario.faggioli@citrix.com \
--cc=dgdegra@tycho.nsa.gov \
--cc=juergen.gross@ts.fujitsu.com \
--cc=msw@amazon.com \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).