From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754056AbbIBOJ7 (ORCPT <rfc822;w@1wt.eu>);
	Wed, 2 Sep 2015 10:09:59 -0400
Received: from userp1040.oracle.com ([156.151.31.81]:44215 "EHLO
	userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750856AbbIBOJ5 (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 2 Sep 2015 10:09:57 -0400
Message-ID: <55E702E7.6070709@oracle.com>
Date: Wed, 02 Sep 2015 10:08:39 -0400
From: Boris Ostrovsky <boris.ostrovsky@oracle.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0
MIME-Version: 1.0
To: Juergen Gross <jgross@suse.com>,
        Dario Faggioli <dario.faggioli@citrix.com>,
        "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
CC: Andrew Cooper <Andrew.Cooper3@citrix.com>,
        "Luis R. Rodriguez" <mcgrof@do-not-panic.com>,
        David Vrabel <david.vrabel@citrix.com>,
        Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
        George Dunlap <George.Dunlap@citrix.com>
Subject: Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
References: <1439913332.4239.134.camel@citrix.com> <55D61964.90608@suse.com> <55E47CFE.8020809@oracle.com> <55E6E454.7090503@suse.com>
In-Reply-To: <55E6E454.7090503@suse.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Source-IP: userv0021.oracle.com [156.151.31.71]
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 09/02/2015 07:58 AM, Juergen Gross wrote:
> On 08/31/2015 06:12 PM, Boris Ostrovsky wrote:
>>
>>
>> On 08/20/2015 02:16 PM, Juergen Groß wrote:
>>> On 08/18/2015 05:55 PM, Dario Faggioli wrote:
>>>> Hey everyone,
>>>>
>>>> So, as a followup of what we were discussing in this thread:
>>>>
>>>>   [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html 
>>>>
>>>>
>>>>
>>>> I started looking in more details at scheduling domains in the Linux
>>>> kernel. Now, that thread was about CPUID and vNUMA, and their weird 
>>>> way
>>>> of interacting, while this thing I'm proposing here is completely
>>>> independent from them both.
>>>>
>>>> In fact, no matter whether vNUMA is supported and enabled, and no 
>>>> matter
>>>> whether CPUID is reporting accurate, random, meaningful or completely
>>>> misleading information, I think that we should do something about how
>>>> scheduling domains are build.
>>>>
>>>> Fact is, unless we use 1:1, and immutable (across all the guest
>>>> lifetime) pinning, scheduling domains should not be constructed, in
>>>> Linux, by looking at *any* topology information, because that just 
>>>> does
>>>> not make any sense, when vcpus move around.
>>>>
>>>> Let me state this again (hoping to make myself as clear as 
>>>> possible): no
>>>> matter in  how much good shape we put CPUID support, no matter how
>>>> beautifully and consistently that will interact with both vNUMA,
>>>> licensing requirements and whatever else. It will be always 
>>>> possible for
>>>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>>>> on two different NUMA nodes at time t2. Hence, the Linux scheduler
>>>> should really not skew his load balancing logic toward any of those 
>>>> two
>>>> situations, as neither of them could be considered correct (since
>>>> nothing is!).
>>>>
>>>> For now, this only covers the PV case. HVM case shouldn't be any
>>>> different, but I haven't looked at how to make the same thing 
>>>> happen in
>>>> there as well.
>>>>
>>>> OVERALL DESCRIPTION
>>>> ===================
>>>> What this RFC patch does is, in the Xen PV case, configure scheduling
>>>> domains in such a way that there is only one of them, spanning all the
>>>> pCPUs of the guest.
>>>>
>>>> Note that the patch deals directly with scheduling domains, and 
>>>> there is
>>>> no need to alter the masks that will then be used for building and
>>>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). 
>>>> That is
>>>> the main difference between it and the patch proposed by Juergen here:
>>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html 
>>>>
>>>>
>>>>
>>>> This means that when, in future, we will fix CPUID handling and 
>>>> make it
>>>> comply with whatever logic or requirements we want, that won't 
>>>> have  any
>>>> unexpected side effects on scheduling domains.
>>>>
>>>> Information about how the scheduling domains are being constructed
>>>> during boot are available in `dmesg', if the kernel is booted with the
>>>> 'sched_debug' parameter. It is also possible to look
>>>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>>>
>>>> With the patch applied, only one scheduling domain is created, called
>>>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>>>> tell that from the fact that every cpu* folder
>>>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>>>> ('domain0'), with all the tweaks and the tunables for our scheduling
>>>> domain.
>>>>
>>>> EVALUATION
>>>> ==========
>>>> I've tested this with UnixBench, and by looking at Xen build time, 
>>>> on a
>>>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
>>>> now, but I plan to re-run them in DomUs soon (Juergen may be doing
>>>> something similar to this in DomU already, AFAUI).
>>>>
>>>> I've run the benchmarks with and without the patch applied ('patched'
>>>> and 'vanilla', respectively, in the tables below), and with different
>>>> number of build jobs (in case of the Xen build) or of parallel copy of
>>>> the benchmarks (in the case of UnixBench).
>>>>
>>>> What I get from the numbers is that the patch almost always brings
>>>> benefits, in some cases even huge ones. There are a couple of cases
>>>> where we regress, but always only slightly so, especially if comparing
>>>> that to the magnitude of some of the improvement that we get.
>>>>
>>>> Bear also in mind that these results are gathered from Dom0, and 
>>>> without
>>>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
>>>> we move things in DomU and do overcommit at the Xen scheduler level, I
>>>> am expecting even better results.
>>>>
>>> ...
>>>> REQUEST FOR COMMENTS
>>>> ====================
>>>> Basically, the kind of feedback I'd be really glad to hear is:
>>>>   - what you guys thing of the approach,
>>>
>>> Yesterday at the end of the developer meeting we (Andrew, Elena and
>>> myself) discussed this topic again.
>>>
>>> Regarding a possible future scenario with credit2 eventually supporting
>>> gang scheduling on hyperthreads (which is desirable due to security
>>> reasons [side channel attack] and fairness) my patch seems to be more
>>> suited for that direction than yours. Correct me if I'm wrong, but I
>>> think scheduling domains won't enable the guest kernel's scheduler to
>>> migrate threads more easily between hyperthreads opposed to other 
>>> vcpus,
>>> while my approach can easily be extended to do so.
>>>
>>>>   - whether you think, looking at this preliminary set of numbers, 
>>>> that
>>>>     this is something worth continuing investigating,
>>>
>>> I believe as both approaches lead to the same topology information used
>>> by the scheduler (all vcpus are regarded as being equal) your numbers
>>> should apply to my patch as well. Would you mind verifying this?
>>
>> If set_cpu_sibling_map()'s has_mp is false, wouldn't we effectively have
>> both of your patches?
>
> Hmm, sort of.
>
> OTOH this would it make hard to make use of some of the topology
> information in case of e.g. pinned vcpus (as George pointed out).


I didn't mean to just set has_mp to zero unconditionally (for Xen, or 
any other, guest). We'd need to have some logic as to when to set it to 
false.

-boris


>
>> Also, it seems to me that Xen guests would not be the only ones having
>> to deal with topology inconsistencies due to migrating VCPUs. Don't KVM
>> guests, for example, have the same problem? And if yes, perhaps we
>> should try solving it in non-Xen-specific way (especially given that
>> both of those patches look pretty simple and thus are presumably easy to
>> integrate into common code).
>
> Indeed. I'll have a try.
>
>> And, as George already pointed out, this should be an optional feature
>> --- if a guest spans physical nodes and VCPUs are pinned then we don't
>> always want flat topology/domains.
>
> Yes, it might be a good idea to be able to keep some of the topology
> levels. I'll modify my patch to make this command line selectable.
>
>
> Juergen