From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@eu.citrix.com>
Subject: Re: [PATCH v2 10/16] xen: sched: use soft-affinity
 instead of domain's node-affinity
Date: Fri, 15 Nov 2013 11:23:52 +0000
Message-ID: <52860448.8060807@eu.citrix.com>
References: <20131113190852.18086.5437.stgit@Solace>	
	<20131113191233.18086.60472.stgit@Solace>
	<5284EC99.3070607@eu.citrix.com> <1384475989.16918.93.camel@Solace>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <1384475989.16918.93.camel@Solace>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>, Keir Fraser <keir@xen.org>, Ian Campbell <Ian.Campbell@citrix.com>, Li Yechen <lccycc123@gmail.com>, Andrew Cooper <Andrew.Cooper3@citrix.com>, Juergen Gross <juergen.gross@ts.fujitsu.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, xen-devel@lists.xen.org, Jan Beulich <JBeulich@suse.com>, Justin Weaver <jtweaver@hawaii.edu>, Matt Wilson <msw@amazon.com>, Elena Ufimtseva <ufimtseva@gmail.com>
List-Id: xen-devel@lists.xenproject.org

On 15/11/13 00:39, Dario Faggioli wrote:
> On gio, 2013-11-14 at 15:30 +0000, George Dunlap wrote:
>> On 13/11/13 19:12, Dario Faggioli wrote:
>>> [..]
>>> The high level description of NUMA placement and scheduling in
>>> docs/misc/xl-numa-placement.markdown is being updated too, to match
>>> the new architecture.
>>>
>>> signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
>>
> Cool, thanks.
>
>> Just a few things to note below...
>>
> Ok.
>
>>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>>> @@ -411,8 +411,6 @@ void domain_update_node_affinity(struct domain *d)
>>>                    node_set(node, d->node_affinity);
>>>        }
>>>    
>>> -    sched_set_node_affinity(d, &d->node_affinity);
>>> -
>>>        spin_unlock(&d->node_affinity_lock);
>> At this point, the only thing inside the spinlock is contingent on
>> d->auto_node_affinity.
>>
> Mmm... Sorry, but I'm not geting what you mean here. :-(

I mean just what I said -- if d->auto_node_affinity is false, nothing 
inside the critical region here needs to be done.  I'm just pointing it 
out. :-)  (This is sort of related to my comment on the other patch, 
about not needing to do the work of calculating intersections.)

>
>>> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
>>> -static inline int __vcpu_has_node_affinity(const struct vcpu *vc,
>>> +static inline int __vcpu_has_soft_affinity(const struct vcpu *vc,
>>>                                               const cpumask_t *mask)
>>>    {
>>> -    const struct domain *d = vc->domain;
>>> -    const struct csched_dom *sdom = CSCHED_DOM(d);
>>> -
>>> -    if ( d->auto_node_affinity
>>> -         || cpumask_full(sdom->node_affinity_cpumask)
>>> -         || !cpumask_intersects(sdom->node_affinity_cpumask, mask) )
>>> +    if ( cpumask_full(vc->cpu_soft_affinity)
>>> +         || !cpumask_intersects(vc->cpu_soft_affinity, mask) )
>>>            return 0;
>> At this point we've lost a way to make this check potentially much
>> faster (being able to check auto_node_affinity).
>>
> Right.
>
>> This isn't a super-hot
>> path but it does happen fairly frequently --
>>
> Quite frequently indeed.
>
>> will the "cpumask_full()"
>> check take a significant amount of time on, say, a 4096-core system?  If
>> so, we might think about "caching" the results of cpumask_full() at some
>> point.
>>
> Yes, I think cpumask_* operation could be heavy when the number of pcpus
> is high. However, this is not really a problem introduced by this
> series. Consider that the default behavior (for libxl and xl) is to go
> through initial domain placement, which would set a node-affinity for
> the domain explicitly, which means d->auto_node_affinity is false.
>
> In fact, every domain that does not manually pin its vcpus at creation
> time --which is what we want, because that way NUMA placement can do its
> magic-- will have to go through the (cpumask_full || !cpumask_intrscts)
> anyway. Basically, I'm saying that having d->auto_node_affinity there
> may look like a speedup, but it really is only for a minority of cases.
>
> So, yes, I think we should aim at optimizing this, but that is something
> completely orthogonal to this series. That is to say: (a) we should do
> it anyway, whether or not this series goes in; (b) for that same reason,
> that shouldn't prevent this series from going in.
>
> If you think this can be an issue for 4.4, I'm fine creating a bug for
> it and putting it among the blockers. At that point, I'll start looking
> for a solution, and will commit to post a fix ASAP, but again, that's
> pretty independent from this very series, at least AFAICT.
>
> Then, the fact that you provided your Reviewed-by above probably means
> that you are aware and ok with this all, but I felt like it was worth
> pointing it out anyway. :-)

Yes, the "at some point" was intended to imply that I didn't think this 
had to be done right away, as was "things to note", which means, "I just 
want to point this out, they're not something which needs to be acted on 
right away."

  -George