From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754056AbbIBOJ7 (ORCPT ); Wed, 2 Sep 2015 10:09:59 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:44215 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750856AbbIBOJ5 (ORCPT ); Wed, 2 Sep 2015 10:09:57 -0400 Message-ID: <55E702E7.6070709@oracle.com> Date: Wed, 02 Sep 2015 10:08:39 -0400 From: Boris Ostrovsky User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Juergen Gross , Dario Faggioli , "xen-devel@lists.xenproject.org" CC: Andrew Cooper , "Luis R. Rodriguez" , David Vrabel , Konrad Rzeszutek Wilk , linux-kernel , Stefano Stabellini , George Dunlap Subject: Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy References: <1439913332.4239.134.camel@citrix.com> <55D61964.90608@suse.com> <55E47CFE.8020809@oracle.com> <55E6E454.7090503@suse.com> In-Reply-To: <55E6E454.7090503@suse.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Source-IP: userv0021.oracle.com [156.151.31.71] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/02/2015 07:58 AM, Juergen Gross wrote: > On 08/31/2015 06:12 PM, Boris Ostrovsky wrote: >> >> >> On 08/20/2015 02:16 PM, Juergen Groß wrote: >>> On 08/18/2015 05:55 PM, Dario Faggioli wrote: >>>> Hey everyone, >>>> >>>> So, as a followup of what we were discussing in this thread: >>>> >>>> [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest >>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html >>>> >>>> >>>> >>>> I started looking in more details at scheduling domains in the Linux >>>> kernel. Now, that thread was about CPUID and vNUMA, and their weird >>>> way >>>> of interacting, while this thing I'm proposing here is completely >>>> independent from them both. >>>> >>>> In fact, no matter whether vNUMA is supported and enabled, and no >>>> matter >>>> whether CPUID is reporting accurate, random, meaningful or completely >>>> misleading information, I think that we should do something about how >>>> scheduling domains are build. >>>> >>>> Fact is, unless we use 1:1, and immutable (across all the guest >>>> lifetime) pinning, scheduling domains should not be constructed, in >>>> Linux, by looking at *any* topology information, because that just >>>> does >>>> not make any sense, when vcpus move around. >>>> >>>> Let me state this again (hoping to make myself as clear as >>>> possible): no >>>> matter in how much good shape we put CPUID support, no matter how >>>> beautifully and consistently that will interact with both vNUMA, >>>> licensing requirements and whatever else. It will be always >>>> possible for >>>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and >>>> on two different NUMA nodes at time t2. Hence, the Linux scheduler >>>> should really not skew his load balancing logic toward any of those >>>> two >>>> situations, as neither of them could be considered correct (since >>>> nothing is!). >>>> >>>> For now, this only covers the PV case. HVM case shouldn't be any >>>> different, but I haven't looked at how to make the same thing >>>> happen in >>>> there as well. >>>> >>>> OVERALL DESCRIPTION >>>> =================== >>>> What this RFC patch does is, in the Xen PV case, configure scheduling >>>> domains in such a way that there is only one of them, spanning all the >>>> pCPUs of the guest. >>>> >>>> Note that the patch deals directly with scheduling domains, and >>>> there is >>>> no need to alter the masks that will then be used for building and >>>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). >>>> That is >>>> the main difference between it and the patch proposed by Juergen here: >>>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html >>>> >>>> >>>> >>>> This means that when, in future, we will fix CPUID handling and >>>> make it >>>> comply with whatever logic or requirements we want, that won't >>>> have any >>>> unexpected side effects on scheduling domains. >>>> >>>> Information about how the scheduling domains are being constructed >>>> during boot are available in `dmesg', if the kernel is booted with the >>>> 'sched_debug' parameter. It is also possible to look >>>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat. >>>> >>>> With the patch applied, only one scheduling domain is created, called >>>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can >>>> tell that from the fact that every cpu* folder >>>> in /proc/sys/kernel/sched_domain/ only have one subdirectory >>>> ('domain0'), with all the tweaks and the tunables for our scheduling >>>> domain. >>>> >>>> EVALUATION >>>> ========== >>>> I've tested this with UnixBench, and by looking at Xen build time, >>>> on a >>>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for >>>> now, but I plan to re-run them in DomUs soon (Juergen may be doing >>>> something similar to this in DomU already, AFAUI). >>>> >>>> I've run the benchmarks with and without the patch applied ('patched' >>>> and 'vanilla', respectively, in the tables below), and with different >>>> number of build jobs (in case of the Xen build) or of parallel copy of >>>> the benchmarks (in the case of UnixBench). >>>> >>>> What I get from the numbers is that the patch almost always brings >>>> benefits, in some cases even huge ones. There are a couple of cases >>>> where we regress, but always only slightly so, especially if comparing >>>> that to the magnitude of some of the improvement that we get. >>>> >>>> Bear also in mind that these results are gathered from Dom0, and >>>> without >>>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If >>>> we move things in DomU and do overcommit at the Xen scheduler level, I >>>> am expecting even better results. >>>> >>> ... >>>> REQUEST FOR COMMENTS >>>> ==================== >>>> Basically, the kind of feedback I'd be really glad to hear is: >>>> - what you guys thing of the approach, >>> >>> Yesterday at the end of the developer meeting we (Andrew, Elena and >>> myself) discussed this topic again. >>> >>> Regarding a possible future scenario with credit2 eventually supporting >>> gang scheduling on hyperthreads (which is desirable due to security >>> reasons [side channel attack] and fairness) my patch seems to be more >>> suited for that direction than yours. Correct me if I'm wrong, but I >>> think scheduling domains won't enable the guest kernel's scheduler to >>> migrate threads more easily between hyperthreads opposed to other >>> vcpus, >>> while my approach can easily be extended to do so. >>> >>>> - whether you think, looking at this preliminary set of numbers, >>>> that >>>> this is something worth continuing investigating, >>> >>> I believe as both approaches lead to the same topology information used >>> by the scheduler (all vcpus are regarded as being equal) your numbers >>> should apply to my patch as well. Would you mind verifying this? >> >> If set_cpu_sibling_map()'s has_mp is false, wouldn't we effectively have >> both of your patches? > > Hmm, sort of. > > OTOH this would it make hard to make use of some of the topology > information in case of e.g. pinned vcpus (as George pointed out). I didn't mean to just set has_mp to zero unconditionally (for Xen, or any other, guest). We'd need to have some logic as to when to set it to false. -boris > >> Also, it seems to me that Xen guests would not be the only ones having >> to deal with topology inconsistencies due to migrating VCPUs. Don't KVM >> guests, for example, have the same problem? And if yes, perhaps we >> should try solving it in non-Xen-specific way (especially given that >> both of those patches look pretty simple and thus are presumably easy to >> integrate into common code). > > Indeed. I'll have a try. > >> And, as George already pointed out, this should be an optional feature >> --- if a guest spans physical nodes and VCPUs are pinned then we don't >> always want flat topology/domains. > > Yes, it might be a good idea to be able to keep some of the topology > levels. I'll modify my patch to make this command line selectable. > > > Juergen