From mboxrd@z Thu Jan 1 00:00:00 1970 From: Juergen Gross Subject: Re: PV-vNUMA issue: topology is misinterpreted by the guest Date: Mon, 27 Jul 2015 17:03:50 +0200 Message-ID: <55B64856.1090101@suse.com> References: <1437042762.28251.18.camel@citrix.com> <55A7D17C.5060602@citrix.com> <55A7D2CC.1050708@oracle.com> <55A7F7F40200007800092152@mail.emea.novell.com> <55A7DE45.4040804@citrix.com> <55A7E2D8.3040203@oracle.com> <55A8B83802000078000924AE@mail.emea.novell.com> <1437118075.23656.25.camel@citrix.com> <55A946C6.8000002@oracle.com> <1437401354.5036.19.camel@citrix.com> <55AD08F7.7020105@oracle.com> <55AEA4DD.7080406@oracle.com> <1437572160.5036.39.camel@citrix.com> <55AF9F8F.7030200@suse.com> <55AFA16B.3070103@oracle.com> <55AFA41E.1080101@suse.com> <55AFAC34.1060606@oracle.com> <55B070ED.2040200@suse.com> <1437660433.5036.96.camel@citrix.com> <55B21364.5040906@suse.com> <1437749076.4682.47.camel@citrix.com> <55B25650.4030402@suse.com> <55B258C9.4040400@suse.com> <1437753509.4682.78.camel@citrix.com> <55B26377.4060807@suse.com> <1438006166.5036.156.camel@citrix.com> <55B64193.9030400@oracle.com> <55B64383.1000902@suse.com> <55B64561.6020402@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta14.messagelabs.com ([193.109.254.103]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1ZJjwX-000645-EH for xen-devel@lists.xenproject.org; Mon, 27 Jul 2015 15:03:53 +0000 In-Reply-To: <55B64561.6020402@oracle.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Boris Ostrovsky , Dario Faggioli Cc: Elena Ufimtseva , Wei Liu , Andrew Cooper , David Vrabel , Jan Beulich , "xen-devel@lists.xenproject.org" List-Id: xen-devel@lists.xenproject.org On 07/27/2015 04:51 PM, Boris Ostrovsky wrote: > On 07/27/2015 10:43 AM, Juergen Gross wrote: >> On 07/27/2015 04:34 PM, Boris Ostrovsky wrote: >>> On 07/27/2015 10:09 AM, Dario Faggioli wrote: >>>> On Fri, 2015-07-24 at 18:10 +0200, Juergen Gross wrote: >>>>> On 07/24/2015 05:58 PM, Dario Faggioli wrote: >>>>>> So, just to check if I'm understanding is correct: you'd like to >>>>>> add an >>>>>> abstraction layer, in Linux, like in generic (or, perhaps, >>>>>> scheduling) >>>>>> code, to hide the direct interaction with CPUID. >>>>>> Such layer, on baremetal, would just read CPUID while, on PV-ops, >>>>>> it'd >>>>>> check with Xen/match vNUMA/whatever... Is this that you are saying? >>>>> Sort of, yes. >>>>> >>>>> I just wouldn't add it, as it is already existing (more or less). It >>>>> can deal right now with AMD and Intel, we would "just" have to add >>>>> Xen. >>>>> >>>> So, having gone through the rest of the thread (so far), and having >>>> given a fair amount o thinking to this, I really think that something >>>> like this would be a good thing to have in Linux. >>>> >>>> Of course, it's not that my opinion on where should be in Linux counts >>>> that much! :-D Nevertheless, I wanted to make it clear that, while >>>> skeptic at the beginning, I now think this is (part of) the way to go, >>>> as I said and explained in my reply to George. >>> >>> And I continue to believe that kernel solution does not address the >>> userland problem which is no less important than making kernel do proper >>> scheduling decisions (and I suspect when this patch goes for review >>> that's what the scheduling people are going to say). >>> >>> Remember the original problem that started this thread was that kernel >>> complained that topology didn't make sense and it turned off all >>> topology-related decisions. Which means that kernel already has a >>> solution for weird topology. Some enumeration doesn't trigger this >>> warning, but we can come up with one that does. Or we can indeed have a >>> patch in kernel that will, possibly silently, fail topology_sane() when >>> virtualized and not pinned. >> >> How would you come up with a topology the kernel is complaining about >> and user mode scheduling will use for sane decisions ? > > We need to understand first why Dario's box is apparently the only one > resulting in a warning and probably then emulate that enumeration. This will lead to other problems in user land e.g. with hwloc. > And again, if that is not possible then just make topology_sane() fail. And again: once you claim that kernel mode isn't everything and here you fail to respect possible user land requirements. >>> (This is what I assume kernel does when topology_sane() fails. And if it >>> doesn't, that's a bug IMO) >>> >>> The licensing problem that Juergen described can be solved by pining >>> vcpus and exposing HT bit. Besides, creating a guest with 24 VPCUs and >> >> Hmm, yes. This way you sacrifice most of the virtualization advantages. >> >>> hoping that 16-core licensing will work I think is pushing it a bit when >>> you know that VCPUs will jump around cores (i.e. "on average" you are >>> running on more than 16 cores -- multi-threaded or not -- which arguably >>> is what licensing is trying to prevent) >> >> On a machine with only 16 cores running on more than 16 cores? I have >> some problems to believe this. The point was: if the license is happy on >> bare metal it should be so when running on the same hardware as a guest. > > Ok, that's not how I should have described it. I meant that IMO asking > for 24 VCPUs is somewhat akin to oversubscribing since you kind of know > that you dont' have 24 PCPUs, you are just trying to fool the kernel > into thinking that threads are cores. /proc/cpuinfo on bare metal will list 32 cpus. xl info in dom0 will list 32 cpus. You have 32 entities where you can do scheduling. So what's the problem having a domU with 24 vcpus? There are still 8 pcpus free for e.g. dom0 then. Juergen