From: George Dunlap <george.dunlap@eu.citrix.com>
To: Dario Faggioli <raistlin@linux.it>
Cc: Andre Przywara <andre.przywara@amd.com>,
Ian Campbell <Ian.Campbell@citrix.com>,
Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>,
Juergen Gross <juergen.gross@ts.fujitsu.com>,
Ian Jackson <Ian.Jackson@eu.citrix.com>,
"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
Jan Beulich <JBeulich@suse.com>
Subject: Re: [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity`
Date: Fri, 27 Apr 2012 15:45:12 +0100 [thread overview]
Message-ID: <4F9AB0F8.10102@eu.citrix.com> (raw)
In-Reply-To: <1f4b55806de9e7109ff6.1334150274@Solace>
On 11/04/12 14:17, Dario Faggioli wrote:
> Basic idea is: cpu-affinity tells where a vcpu can run,
> while node-affinity tells where (in terms of on what NUMA nodes,
> but that of course imply a set of CPUs) all the vcpus of the
> domain should/prefer to run... Problems starts when you have
> both of them speaking at the same time!
>
> Respecting vcpu-affinity should remain the primary concern, thus
> what we do here is add some two-step logic here and there in the
> scheduler (i.e., where vcpu-affinity related decisions are being
> undertaken). During the first step, we check the `&&` of vcpu and
> node affinity masks, aiming at giving some precedence to the CPUs
> that are both suitable and preferrable for the domain. However,
> if that fails in finding a valid CPU, the node-affinity is just
> ignored and, in the second step, we just fallback to vcpu-affinity,
> as the original behaviour was.
>
> Both the introduced overhead and the benefits this provides has
> to be investigated and compared against each other, possibly in
> varying conditions and running different workloads.
>
> Finally, although still at the RFC level, efforts have been made
> to write the code in a flexible enough fashion so that, if we
> ever want to introduce further balancing steps, it shouldn't be
> too much of a pain.
>
> TODO: * Better investigate and test interactions with cpupools.
> * Test, verify and benchmark. Then test, verify and benchmark
> again, and again and again. What I know as of know is that
> it does not explode that easily, but whether it works properly
> and, more important, brings any benefit, this has to be
> proven (and yes, I'm out running these tests and benchs already,
> but do not esitate to manifest your will for helping with
> that :-D).
> * Add some counter/stats, e.g., to serve the purpose outlined
> right above.
>
> XXX I'm benchmarking this just right now, and peeking at the results
> they don't seem too bad. Numbers are mostly close to the case where
> the VM is already pinned to a node when created. I'll post the
> results as soon as I can, and I'll be happy to investigate any issue,
> if you feel like the approach could be the right one.
Hey Dario,
Sorry for the long delay in reviewing this.
Overall I think the approach is good. A few points:
> /*
> + * Node Balancing
> + */
> +#define CSCHED_BALANCE_NODE_AFFINITY 1
> +#define CSCHED_BALANCE_CPU_AFFINITY 0
> +#define CSCHED_BALANCE_START_STEP CSCHED_BALANCE_NODE_AFFINITY
> +#define CSCHED_BALANCE_END_STEP CSCHED_BALANCE_CPU_AFFINITY
> +
> +
This thing of defining "START_STEP" and "END_STEP" seems a bit fragile.
I think it would be better to always start at 0, and go until
CSCHED_BALANCE_MAX.
> + /*
> + * Let's cache domain's dom->node_affinity here as an
> + * optimization for a couple of hot paths. In fact,
> + * knowing whether or not dom->node_affinity has changed
> + * would allow us to avoid rebuilding node_affinity_cpumask
> + * (below) duing node balancing and/or scheduling.
> + */
> + nodemask_t node_affinity_cache;
> + /* Basing on what dom->node_affinity says,
> + * on what CPUs would we like to run most? */
> + cpumask_t node_affinity_cpumask;
I think the comments here need to be more clear. The main points are:
* node_affinity_cpumask is the dom->node_affinity translated from a
nodemask into a cpumask
* Because doing the nodemask -> cpumask translation may be expensive,
node_affinity_cache stores the last translated value, so we can avoid
doing the translation if nothing has changed.
> +#define csched_balance(__step) \
> + for ( (__step) = CSCHED_BALANCE_START_STEP; \
> + (__step)>= CSCHED_BALANCE_END_STEP; \
> + (__step)-- )
I don't like this. For one, it's fragile; if for some reason you
switched NODE_AFFINITY and CPU_AFFINITY above, suddenly this loop would
go the wrong direction.
I don't think there's any reason not to just use "for(step=0; step <
CSCHED_BALANCE_MAX; step++)" -- it's not ugly and it means you know
exactly what's going on without having to go search for the macro.
> +
> +/*
> + * Sort-of conversion between node-affinity and vcpu-affinity for the domain,
> + * i.e., a cpumask containing all the cpus from all the set nodes in the
> + * node-affinity mask of the domain.
This needs to be clearer -- vcpu-affinity doesn't have anything to do
with this function, and there's nothing "sort-of" about the conversion. :-)
I think you mean to say, "Create a cpumask from the node affinity mask."
> + *
> + * Notice that we completely forget about serializing this (both here and
> + * in the various sites where node_affinity_cpumask is used) against
> + * d->node_affinity_lock. That would be hard to do properly, as that lock
> + * is a non IRQ-safe one, and it would be nearly impossible to access it
> + * from the scheduling code. However, although this leaves some room for
> + * races with code paths that updates d->node_affinity, it all should still
> + * be fine, considering:
> + * (1) d->node_affinity updates are going to be quite rare;
> + * (2) this balancing logic is kind-of "best effort" anyway;
> + * (3) given (1), any inconsistencies are likely to be fixed by the next
> + * call to this same function without risking going into instability.
> + *
> + * XXX Validate (via testing/benchmarking) whether this is true, as it
> + * likely sounds to be, or it causes unexpected issues.
Ack.
> +/* Put together the cpumask we are going to use for this balancing step */
> +static int
> +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
> +{
> + struct domain *d = vc->domain;
> + struct csched_dom *sdom = CSCHED_DOM(d);
> +
> + /*
> + * Only two possible steps exists for now: node and vcpu affinity.
> + * If this domain does not have a node-affinity or is affine to all the
> + * nodes, just return th vcpu-affinity mask (for *this* vcpu) and
> + * stop the balancing process.
> + */
> + if ( !d->has_node_affinity || nodes_full(d->node_affinity) ||
> + step == CSCHED_BALANCE_CPU_AFFINITY )
> + {
> + cpumask_copy(mask, vc->cpu_affinity);
> + return CSCHED_BALANCE_CPU_AFFINITY;
> + }
Hmm -- this mechanism seems kind of fragile. It seems like returning -1
or something, and having the caller call "continue", would be easier for
future coders (potentially you or I) to reason about.
> + /*
> + * XXX As of now (with only two possible steps, this is easy and readable
> + * enough. If more steps become necessary at some point in time, we
> + * can keep an array of cpumask_t in dom_data and return the proper
> + * element via step-indexing such an array. Of course, we can turn
> + * this like that even right now... Thoughts?
> + */
Beware of early optimization. :-) Just make sure to mark this down for
something to look at for profiling.
> static inline void
> +__cpumask_tickle(cpumask_t *mask, const cpumask_t *idle_mask)
> +{
> + CSCHED_STAT_CRANK(tickle_idlers_some);
> + if ( opt_tickle_one_idle )
> + {
> + this_cpu(last_tickle_cpu) =
> + cpumask_cycle(this_cpu(last_tickle_cpu), idle_mask);
> + cpumask_set_cpu(this_cpu(last_tickle_cpu), mask);
> + }
> + else
> + cpumask_or(mask, mask, idle_mask);
> +}
I don't see any reason to make this into a function -- it's only called
once, and it's not that long. Unless you're concerned about too many
indentations making the lines too short?
> + csched_balance(balance_step)
> {
Again, I'd make this a for() loop.
> sdom->dom = dom;
> + /*
> + *XXX This would be 'The Right Thing', but as it is still too
> + * early and d->node_affinity has not settled yet, maybe we
> + * can just init the two masks with something like all-nodes
> + * and all-cpus and rely on the first balancing call for
> + * having them updated?
> + */
> + csched_build_balance_cpumask(sdom);
We might as well do what you've got here, unless it's likely to produce
garbage. This isn't exactly a hot path. :-)
Other than that, looks good -- Thanks!
-George
next prev parent reply other threads:[~2012-04-27 14:45 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-04-11 13:17 [PATCH 00 of 10 [RFC]] Automatically place guest on host's NUMA nodes with xl Dario Faggioli
2012-04-11 13:17 ` [PATCH 01 of 10 [RFC]] libxc: Generalize xenctl_cpumap to just xenctl_map Dario Faggioli
2012-04-11 16:08 ` George Dunlap
2012-04-11 16:31 ` Dario Faggioli
2012-04-11 16:41 ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 02 of 10 [RFC]] libxl: Generalize libxl_cpumap to just libxl_map Dario Faggioli
2012-04-11 13:17 ` [PATCH 03 of 10 [RFC]] libxc, libxl: Introduce xc_nodemap_t and libxl_nodemap Dario Faggioli
2012-04-11 16:38 ` George Dunlap
2012-04-11 16:57 ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 04 of 10 [RFC]] libxl: Introduce libxl_get_numainfo() calling xc_numainfo() Dario Faggioli
2012-04-11 13:17 ` [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file Dario Faggioli
2012-04-12 10:24 ` George Dunlap
2012-04-12 10:48 ` David Vrabel
2012-04-12 22:25 ` Dario Faggioli
2012-04-12 11:32 ` Formatting of emails which are comments on patches Ian Jackson
2012-04-12 11:42 ` George Dunlap
2012-04-12 22:21 ` [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file Dario Faggioli
2012-04-11 13:17 ` [PATCH 06 of 10 [RFC]] xl: Allow user to set or change node affinity on-line Dario Faggioli
2012-04-12 10:29 ` George Dunlap
2012-04-12 21:57 ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity` Dario Faggioli
2012-04-12 23:06 ` Dario Faggioli
2012-04-27 14:45 ` George Dunlap [this message]
2012-05-02 15:13 ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes Dario Faggioli
2012-05-01 15:45 ` George Dunlap
2012-05-02 16:30 ` Dario Faggioli
2012-05-03 1:03 ` Dario Faggioli
2012-05-03 8:10 ` Ian Campbell
2012-05-03 10:16 ` George Dunlap
2012-05-03 13:41 ` George Dunlap
2012-05-03 14:58 ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 09 of 10 [RFC]] xl: Introduce Best and Worst Fit guest placement algorithms Dario Faggioli
2012-04-16 10:29 ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation Dario Faggioli
2012-04-12 9:11 ` Ian Campbell
2012-04-12 10:32 ` Dario Faggioli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4F9AB0F8.10102@eu.citrix.com \
--to=george.dunlap@eu.citrix.com \
--cc=Ian.Campbell@citrix.com \
--cc=Ian.Jackson@eu.citrix.com \
--cc=JBeulich@suse.com \
--cc=Stefano.Stabellini@eu.citrix.com \
--cc=andre.przywara@amd.com \
--cc=juergen.gross@ts.fujitsu.com \
--cc=raistlin@linux.it \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).