From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@eu.citrix.com>
Subject: Re: [PATCH] xen: sched_credit: filter node-affinity
 mask against online cpus
Date: Mon, 16 Sep 2013 15:00:09 +0100
Message-ID: <52370EE9.3040005@eu.citrix.com>
References: <20130913160856.25674.4965.stgit@hit-nxdomain.opendns.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <20130913160856.25674.4965.stgit@hit-nxdomain.opendns.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Keir Fraser <keir@xen.org>, Jan Beulich <JBeulich@suse.com>, xen-devel@lists.xen.org
List-Id: xen-devel@lists.xenproject.org

On 13/09/13 17:09, Dario Faggioli wrote:
> in _csched_cpu_pick(), as not doing so may result in the domain's
> node-affinity mask (as retrieved by csched_balance_cpumask() )
> and online mask (as retrieved by cpupool_scheduler_cpumask() )
> having an empty intersection.
>
> Therefore, when attempting a node-affinity load balancing step
> and running this:
>
>      ...
>      /* Pick an online CPU from the proper affinity mask */
>      csched_balance_cpumask(vc, balance_step, &cpus);
>      cpumask_and(&cpus, &cpus, online);
>      ...
>
> we end up with an empty cpumask (in cpus). At this point, in
> the following code:
>
>      ....
>      /* If present, prefer vc's current processor */
>      cpu = cpumask_test_cpu(vc->processor, &cpus)
>              ? vc->processor
>              : cpumask_cycle(vc->processor, &cpus);
>      ....
>
> an ASSERT (from inside cpumask_cycle() ) triggers like this:
>
> (XEN) Xen call trace:
> (XEN)    [<ffff82d08011b124>] _csched_cpu_pick+0x1d2/0x652
> (XEN)    [<ffff82d08011b5b2>] csched_cpu_pick+0xe/0x10
> (XEN)    [<ffff82d0801232de>] vcpu_migrate+0x167/0x31e
> (XEN)    [<ffff82d0801238cc>] cpu_disable_scheduler+0x1c8/0x287
> (XEN)    [<ffff82d080101b3f>] cpupool_unassign_cpu_helper+0x20/0xb4
> (XEN)    [<ffff82d08010544f>] continue_hypercall_tasklet_handler+0x4a/0xb1
> (XEN)    [<ffff82d080127793>] do_tasklet_work+0x78/0xab
> (XEN)    [<ffff82d080127a70>] do_tasklet+0x5f/0x8b
> (XEN)    [<ffff82d080158985>] idle_loop+0x57/0x5e
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 1:
> (XEN) Assertion 'cpu < nr_cpu_ids' failed at /home/dario/Sources/xen/xen/xen.git/xen/include/xe:16481
>
> It is for example sufficient to have a domain with node-affinity
> to NUMA node 1 running, and issueing a `xl cpupool-numa-split'
> would make the above happen. That is because, by default, all
> the existing domains remain assigned to the first cpupool, and
> it now (after the cpupool-numa-split) only includes NUMA node 0.
>
> This change prevents that by generalizing the function used
> for figuring out whether a node-affinity load balancing step
> is legit or not. This way we can, in _csched_cpu_pick(),
> figure out early enough that the mask would end up empty,
> skip the step all together and avoid the splat.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> Cc: Jan Beulich <JBeulich@suse.com>
> Cc: Keir Fraser <keir@xen.org>
> ---
> This was originally submitted (yesterday) as "xen: make sure the node-affinity
> is always updated". After discussing it with George, I got convinced that a fix
> like this is more appropriate, since it deals with the issue in the scheduler,
> without making a mess out of the user interface.
>
> This is therefore not being posted as a real v2 of that patch because the
> approach radically changed.
> ---
>   xen/common/sched_credit.c |   41 ++++++++++++++++++++++++++++++-----------
>   1 file changed, 30 insertions(+), 11 deletions(-)
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index dbe6de6..d76cd88 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -296,15 +296,25 @@ static void csched_set_node_affinity(
>    * vcpu-affinity balancing is always necessary and must never be skipped.
>    * OTOH, if a domain's node-affinity is said to be automatically computed
>    * (or if it just spans all the nodes), we can safely avoid dealing with
> - * node-affinity entirely. Ah, node-affinity is also deemed meaningless
> - * in case it has empty intersection with the vcpu's vcpu-affinity, as it
> - * would mean trying to schedule it on _no_ pcpu!
> + * node-affinity entirely.
> + *
> + * Node-affinity is also deemed meaningless in case it has empty
> + * intersection with mask, to cover the cases where using the node-affinity
> + * mask seems legit, but would instead led to trying to schedule the vcpu
> + * on _no_ pcpu! Typical use cases are for mask to be equal to the vcpu's
> + * vcpu-affinity, or to the && of vcpu-affinity and the set of online cpus
> + * in the domain's cpupool.
>    */
> -#define __vcpu_has_node_affinity(vc)                                          \
> -    ( !(cpumask_full(CSCHED_DOM(vc->domain)->node_affinity_cpumask)           \
> -        || !cpumask_intersects(vc->cpu_affinity,                              \
> -                               CSCHED_DOM(vc->domain)->node_affinity_cpumask) \
> -        || vc->domain->auto_node_affinity == 1) )
> +static inline int __vcpu_has_node_affinity(struct vcpu *vc, cpumask_t *mask)
> +{
> +    if ( vc->domain->auto_node_affinity == 1
> +         || cpumask_full(CSCHED_DOM(vc->domain)->node_affinity_cpumask)
> +         || !cpumask_intersects(CSCHED_DOM(vc->domain)->node_affinity_cpumask,
> +                                mask) )
> +        return 0;
> +
> +    return 1;
> +}
>   
>   /*
>    * Each csched-balance step uses its own cpumask. This function determines
> @@ -393,7 +403,8 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu *new)
>               int new_idlers_empty;
>   
>               if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY
> -                 && !__vcpu_has_node_affinity(new->vcpu) )
> +                 && !__vcpu_has_node_affinity(new->vcpu,
> +                                              new->vcpu->cpu_affinity) )
>                   continue;
>   
>               /* Are there idlers suitable for new (for this balance step)? */
> @@ -627,10 +638,18 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit)
>       int balance_step;
>   
>       online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> +    cpumask_and(&cpus, vc->cpu_affinity, online);
> +
>       for_each_csched_balance_step( balance_step )
>       {
> +        /*
> +         * We filter the node-affinity mask against
> +         * [vc->cpu_affinity && online] here to avoid problems if all
> +         * the cpus in in the node-affinity mask are offline (e.g.,
> +         * because the domain moved to a different cpupool).
> +         */

It's probably worth mentioning that the reason you do the cpumask_and 
here and not in the other place this is called is that cpumask_cycle is 
called after this one (which will choke on an empy mask), but not in the 
other place it's called.

Other than that, looks good -- thanks.

  -George