From mboxrd@z Thu Jan 1 00:00:00 1970 From: Juergen Gross Subject: Re: PV-vNUMA issue: topology is misinterpreted by the guest Date: Tue, 28 Jul 2015 17:11:24 +0200 Message-ID: <55B79B9C.6030505@suse.com> References: <1437042762.28251.18.camel@citrix.com> <55A78DF2.1060709@citrix.com> <20150716152513.GU12455@zion.uk.xensource.com> <55A7D17C.5060602@citrix.com> <55A7D2CC.1050708@oracle.com> <55A7F7F40200007800092152@mail.emea.novell.com> <55A7DE45.4040804@citrix.com> <55A7E2D8.3040203@oracle.com> <55A8B83802000078000924AE@mail.emea.novell.com> <1437118075.23656.25.camel@citrix.com> <55A946C6.8000002@oracle.com> <1437401354.5036.19.camel@citrix.com> <55AD08F7.7020105@oracle.com> <55AEA4DD.7080406@oracle.com> <1437572160.5036.39.camel@citrix.com> <55AF9F8F.7030200@suse.com> <55AFA16B.3070103@oracle.com> <55AFA41E.1080101@suse.com> <55AFAC34.1060606@oracle.com> <55B070ED.2040200@suse.com> <1437660433.5036.96.camel@citrix.com> <55B21364.5040906@suse.com> <1437749076.4682.47.camel@citrix.com> <55B25650.4030402@suse.com> <55B258C9.4040400@suse.com> <1437753509.4682.78.camel@citrix.com> <55B26377.4060807@suse.com> <1438006166.5036.156.camel@citrix.com> <55B7052F.8090804@suse.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------060804000100070306000100" Return-path: Received: from mail6.bemta3.messagelabs.com ([195.245.230.39]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1ZK6XR-0007cu-GC for xen-devel@lists.xenproject.org; Tue, 28 Jul 2015 15:11:29 +0000 In-Reply-To: <55B7052F.8090804@suse.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: Elena Ufimtseva , Wei Liu , Andrew Cooper , David Vrabel , Jan Beulich , "xen-devel@lists.xenproject.org" , Boris Ostrovsky List-Id: xen-devel@lists.xenproject.org This is a multi-part message in MIME format. --------------060804000100070306000100 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit On 07/28/2015 06:29 AM, Juergen Gross wrote: > On 07/27/2015 04:09 PM, Dario Faggioli wrote: >> On Fri, 2015-07-24 at 18:10 +0200, Juergen Gross wrote: >>> On 07/24/2015 05:58 PM, Dario Faggioli wrote: >> >>>> So, just to check if I'm understanding is correct: you'd like to add an >>>> abstraction layer, in Linux, like in generic (or, perhaps, scheduling) >>>> code, to hide the direct interaction with CPUID. >>>> Such layer, on baremetal, would just read CPUID while, on PV-ops, it'd >>>> check with Xen/match vNUMA/whatever... Is this that you are saying? >>> >>> Sort of, yes. >>> >>> I just wouldn't add it, as it is already existing (more or less). It >>> can deal right now with AMD and Intel, we would "just" have to add Xen. >>> >> So, having gone through the rest of the thread (so far), and having >> given a fair amount o thinking to this, I really think that something >> like this would be a good thing to have in Linux. >> >> Of course, it's not that my opinion on where should be in Linux counts >> that much! :-D Nevertheless, I wanted to make it clear that, while >> skeptic at the beginning, I now think this is (part of) the way to go, >> as I said and explained in my reply to George. > > I think it's time to obtain some real numbers. > > I'll make some performance tests on a big machine (4 sockets, 60 cores, > 120 threads) regarding topology information: > > - bare metal > - "random" topology (like today) > - "simple" topology (all vcpus regarded as equal) > - "real" topology with all vcpus pinned > > This should show: > > - how intrusive would the topology patch(es) be? > - what is the performance impact of a "wrong" scheduling data base On the above box I used a pvops kernel 4.2-rc4 plus a rather small patch (see attachment). I did 5 kernel builds in each environment: make clean time make -j 120 The first result of the 5 runs was always omitted as it would have to build up buffer caches etc. The Xen cases were all done in dom0, pinning of vcpus in the last scenario was done via dom0_vcpus_pin boot parameter of the hypervisor. Here are the results (everything in seconds): elapsed user system bare metal: 100 5770 805 "random" topology: 283 6740 20700 "simple" topology: 290 6740 22200 "real" topology: 185 7800 8040 As expected bare metal is the best. Next is "real" topology with pinned vcpus (expected again - but system time already factor of 10 up!). What I didn't expect is: "random" is better than "simple" topology. I could test some other topologies (e.g. everything on one socket, or even on one core), but I'm not sure this makes sense. I didn't check the exact topology result of the "random" case, maybe I'll do that tomorrow with another measurement. BTW: the topology hack is working, as each cpu is shown to have a sibling count of 1 in /proc/cpuinfo. Juergen --------------060804000100070306000100 Content-Type: text/x-patch; name="topo.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="topo.patch" commit 2d9cee2b37319714e07e6f8b4044c0db44cc8e7d Author: Juergen Gross Date: Tue Jul 28 09:28:35 2015 +0200 xen: use simple topology on demand As the cpuid information is currently taken from the physical cpu the current virtual cpu is running on, the topology information derived from the cpuid is potentially useless. In order to avoid insane scheduling decisions based on random data provide a possibility to set topology data to a very simple scheme making all cpus appear to be independent from each other. Signed-off-by: Juergen Gross diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c index 8648438..a57e816 100644 --- a/arch/x86/xen/smp.c +++ b/arch/x86/xen/smp.c @@ -55,6 +55,25 @@ static irqreturn_t xen_call_function_interrupt(int irq, void *dev_id); static irqreturn_t xen_call_function_single_interrupt(int irq, void *dev_id); static irqreturn_t xen_irq_work_interrupt(int irq, void *dev_id); +static bool xen_simple_topology; +static __init int xen_parse_topology(char *arg) +{ + xen_simple_topology = true; + return 0; +} +early_param("xen_simple_topology", xen_parse_topology); + +static void xen_set_cpu_sibling_map(int cpu) +{ + if (xen_simple_topology) { + cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu)); + cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); + cpumask_set_cpu(cpu, topology_core_cpumask(cpu)); + } else { + set_cpu_sibling_map(cpu); + } +} + /* * Reschedule call back. */ @@ -82,7 +101,7 @@ static void cpu_bringup(void) cpu = smp_processor_id(); smp_store_cpu_info(cpu); cpu_data(cpu).x86_max_cores = 1; - set_cpu_sibling_map(cpu); + xen_set_cpu_sibling_map(cpu); xen_setup_cpu_clockevents(); @@ -333,7 +352,7 @@ static void __init xen_smp_prepare_cpus(unsigned int max_cpus) zalloc_cpumask_var(&per_cpu(cpu_core_map, i), GFP_KERNEL); zalloc_cpumask_var(&per_cpu(cpu_llc_shared_map, i), GFP_KERNEL); } - set_cpu_sibling_map(0); + xen_set_cpu_sibling_map(0); if (xen_smp_intr_init(0)) BUG(); --------------060804000100070306000100 Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --------------060804000100070306000100--