From mboxrd@z Thu Jan 1 00:00:00 1970 From: Boris Ostrovsky Subject: Re: PV-vNUMA issue: topology is misinterpreted by the guest Date: Tue, 21 Jul 2015 16:00:29 -0400 Message-ID: <55AEA4DD.7080406@oracle.com> References: <1437042762.28251.18.camel@citrix.com> <55A7A7F40200007800091D60@mail.emea.novell.com> <55A78DF2.1060709@citrix.com> <20150716152513.GU12455@zion.uk.xensource.com> <55A7D17C.5060602@citrix.com> <55A7D2CC.1050708@oracle.com> <55A7F7F40200007800092152@mail.emea.novell.com> <55A7DE45.4040804@citrix.com> <55A7E2D8.3040203@oracle.com> <55A8B83802000078000924AE@mail.emea.novell.com> <1437118075.23656.25.camel@citrix.com> <55A946C6.8000002@oracle.com> <1437401354.5036.19.camel@citrix.com> <55AD08F7.7020105@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta5.messagelabs.com ([195.245.231.135]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1ZHdir-0002wn-6y for xen-devel@lists.xenproject.org; Tue, 21 Jul 2015 20:01:05 +0000 In-Reply-To: <55AD08F7.7020105@oracle.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Dario Faggioli Cc: Elena Ufimtseva , Wei Liu , Andrew Cooper , David Vrabel , Jan Beulich , "xen-devel@lists.xenproject.org" List-Id: xen-devel@lists.xenproject.org On 07/20/2015 10:43 AM, Boris Ostrovsky wrote: > On 07/20/2015 10:09 AM, Dario Faggioli wrote: >> On Fri, 2015-07-17 at 14:17 -0400, Boris Ostrovsky wrote: >>> On 07/17/2015 03:27 AM, Dario Faggioli wrote: >>>> In the meanwhile, what should we do? Document this? How? "don't use >>>> vNUMA with PV guest in SMT enabled systems" seems a bit harsh... Is >>>> there a workaround we can put in place/suggest? >>> I haven't been able to reproduce this on my Intel box because I think I >>> have different core enumeration. >>> >> Yes, most likely, that's highly topology dependant. :-( >> >>> Can you try adding >>> cpuid=['0x1:ebx=xxxxxxxx00000001xxxxxxxxxxxxxxxx'] >>> to your config file? >>> >> Done (sorry for the delay, the testbox was busy doing other stuff). >> >> Still no joy (.101 is the IP address of the guest, domain id 3): >> >> root@Zhaman:~# ssh root@192.168.1.101 "yes > /dev/null 2>&1 &" >> root@Zhaman:~# ssh root@192.168.1.101 "yes > /dev/null 2>&1 &" >> root@Zhaman:~# ssh root@192.168.1.101 "yes > /dev/null 2>&1 &" >> root@Zhaman:~# ssh root@192.168.1.101 "yes > /dev/null 2>&1 &" >> root@Zhaman:~# xl vcpu-list 3 >> Name ID VCPU CPU State Time(s) >> Affinity (Hard / Soft) >> test 3 0 4 r-- 23.6 all / 0-7 >> test 3 1 9 r-- 19.8 all / 0-7 >> test 3 2 8 -b- 0.4 all / 8-15 >> test 3 3 4 -b- 0.2 all / 8-15 >> >> *HOWEVER* it seems to have an effect. In fact, now, topology as it is >> shown in /sys/... is different: >> >> root@test:~# cat >> /sys/devices/system/cpu/cpu0/topology/thread_siblings_list >> 0 >> (it was 0-1) >> >> This, OTOH, is still the same: >> root@test:~# cat >> /sys/devices/system/cpu/cpu0/topology/core_siblings_list >> 0-3 >> >> Also, I now see this: >> >> [ 0.150560] ------------[ cut here ]------------ >> [ 0.150560] WARNING: CPU: 2 PID: 0 at >> ../arch/x86/kernel/smpboot.c:317 topology_sane.isra.2+0x74/0x88() >> [ 0.150560] sched: CPU #2's llc-sibling CPU #0 is not on the same >> node! [node: 1 != 0]. Ignoring dependency. >> [ 0.150560] Modules linked in: >> [ 0.150560] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.19.0+ #1 >> [ 0.150560] 0000000000000009 ffff88001ee2fdd0 ffffffff81657c7b >> ffffffff810bbd2c >> [ 0.150560] ffff88001ee2fe20 ffff88001ee2fe10 ffffffff81081510 >> ffff88001ee2fea0 >> [ 0.150560] ffffffff8103aa02 ffff88003ea0a001 0000000000000000 >> ffff88001f20a040 >> [ 0.150560] Call Trace: >> [ 0.150560] [] dump_stack+0x4f/0x7b >> [ 0.150560] [] ? up+0x39/0x3e >> [ 0.150560] [] warn_slowpath_common+0xa1/0xbb >> [ 0.150560] [] ? topology_sane.isra.2+0x74/0x88 >> [ 0.150560] [] warn_slowpath_fmt+0x46/0x48 >> [ 0.150560] [] ? __cpuid.constprop.0+0x15/0x19 >> [ 0.150560] [] topology_sane.isra.2+0x74/0x88 >> [ 0.150560] [] set_cpu_sibling_map+0x27a/0x444 >> [ 0.150560] [] ? numa_add_cpu+0x98/0x9f >> [ 0.150560] [] cpu_bringup+0x63/0xa8 >> [ 0.150560] [] cpu_bringup_and_idle+0xe/0x1a >> [ 0.150560] ---[ end trace 63d204896cce9f68 ]--- >> >> Notice that it now says 'llc-sibling', while, before, it was saying >> 'smt-sibling'. > > Exactly. You are now passing the first topology test which was to see > that threads are on the same node. And since each processor has only > one thread (as evidenced by thread_siblings_list) we are good. > > The second test checks that cores (i.e. things that share last level > cache) are on the same node. And they are not. > > >> >>> On AMD, BTW, we fail a different test so some other bits probably need >>> to be tweaked. You may fail it too (the LLC sanity check). >>> >> Yep, that's the one I guess. Should I try something more/else? > > > I'll need to see how LLC IDs are calculated, probably also from some > CPUID bits. No, can't do this: LLC is calculated from CPUID leaf 4 (on Intel) which use indexes in ECX register and xl syntax doesn't allow you to override CPUIDs for such leaves. -boris > The question though will be --- what do we do with how cache sizes > (and TLB sizes for that matter) are presented to the guests. Do we > scale them down per thread? > > -boris