From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753649AbZEEOlU (ORCPT ); Tue, 5 May 2009 10:41:20 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751975AbZEEOlL (ORCPT ); Tue, 5 May 2009 10:41:11 -0400 Received: from tx2ehsobe003.messaging.microsoft.com ([65.55.88.13]:2489 "EHLO TX2EHSOBE005.bigfish.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751812AbZEEOlK convert rfc822-to-8bit (ORCPT ); Tue, 5 May 2009 10:41:10 -0400 X-BigFish: VPS-20(zz1432R98dR1805M936fJzz1202hzzz32i6bh15fn) X-FB-SS: 5,13, X-WSS-ID: 0KJ6E3X-02-1A2-01 Date: Tue, 5 May 2009 16:40:27 +0200 From: Andreas Herrmann To: Andi Kleen CC: Ingo Molnar , "H. Peter Anvin" , Thomas Gleixner , linux-kernel@vger.kernel.org Subject: Re: [PATCH 0/3] x86: adapt CPU topology detection for AMD Magny-Cours Message-ID: <20090505144027.GD29045@alberich.amd.com> References: <20090504173330.GF28728@alberich.amd.com> <87vdogbp4g.fsf@basil.nowhere.org> <20090505092238.GB29045@alberich.amd.com> <20090505093520.GL23223@one.firstfloor.org> <20090505104848.GC29045@alberich.amd.com> <20090505120206.GM23223@one.firstfloor MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline In-Reply-To: <20090505120206.GM23223@one.firstfloor.org> User-Agent: Mutt/1.5.16 (2007-06-09) X-OriginalArrivalTime: 05 May 2009 14:40:49.0994 (UTC) FILETIME=[7BED5AA0:01C9CD8F] Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 05, 2009 at 02:02:06PM +0200, Andi Kleen wrote: > On Tue, May 05, 2009 at 12:48:48PM +0200, Andreas Herrmann wrote: > > On Tue, May 05, 2009 at 11:35:20AM +0200, Andi Kleen wrote: > > > > Best example is node interleaving. Usually you won't get a SRAT table > > > > on such a system. > > > > > > That sounds like a BIOS bug. It should supply a suitable SLIT/SRAT > > > even for this case. Or perhaps if the BIOS are really that broken > > > add a suitable quirk that provides distances, but better fix the BIOSes. > > > > How do you define SRAT when node interleaving is enabled? > > (Defining same distances between all nodes, describing only one node, > > or omitting SRAT entirely? I've observed that the latter is common > > behavior.) > > Either a memory less node with 10 distance (which seems to be envogue > recently for some reason) or multiple nodes with 10 distance. See below -- (a) and (b). > > > > Thus you see just one NUMA node in > > > > /sys/devices/system/node. But on such a configuration you still see > > > > (and you want to see) the correct CPU topology information in > > > > /sys/devices/system/cpu/cpuX/topology. Based on that you always can > > > > figure out which cores are on the same physical package independent of > > > > availability and contents of SRAT and even with kernels that are > > > > compiled w/o NUMA support. > > > > > > So you're adding a x86 specific mini NUMA for kernels without NUMA > > > (which btw becomes more and more an exotic case -- modern distros > > > are normally unconditionally NUMA) Doesn't seem very useful. > > > > No, I just tried to give an example why you can't derive CPU topology > > First I must say it's unclear to me if CPU topology is really generally > useful to export to the user. I think it is useful. (Linux already provides this kind of information. But it should pass only useful/correct information to user space. For Magny-Cours sibling information provided by current Linux kernel is not very useful.) > If they want to know how far cores > are away they should look at cache sharing and NUMA distances (especially > cache topology gives a very good approximation anyways). For other > purposes like power management just having arbitary sets (x is shared > with y in a set without hierarchy) seems to work just fine. (a) Are you saying that users have to check NUMA distances when they want to pin tasks on certain CPUs? (b) Just an example, if SRAT describes "Either a memory less node with 10 distance (which seems to be envogue recently for some reason) or multiple nodes with 10 distance." how would you do (a) to pin tasks say on the same internal node or on the same physical package? That is not straightforward. But representing entire CPU topology in sysfs makes this obvious. > Then traditionally there were special cases for SMT and for packages > (for error containment etc.) and some hacks for licensing, but these > don't really apply in your case or can be already expressed in other > ways. Yes, it's a new "special case" which can't be expressed in other ways. SLIT and SRAT are not sufficient. The kernel needs to know which cores are on same internal node. The way my patches do it is to fix core_siblings to refer to siblings on same internal node instead of physical processor. Another approach would have been: - to keep core_siblings as is (to specify siblings on same physical processor) - and to introduce a new cpu mask to specify siblings on same internal node > If there is really a good use case for exporting CPU topology I would > argue for not adding another adhoc level, but just export a SLIT > style arbitary distance table somewhere in sys. That would support to express > any possible future hierarchies too. But again I have doubts that's really > needed at all. I don't agree on this. > > > and you're making it even worse, adding another strange special case. > > > > It's an abstraction -- I think of it just as another level in the CPU > > It's not a general abstraction, just another ad-hoc hack. Fine. But do you have any constructive and usable suggestion how Linux should handle topology information for multi-node processors? > > hierarchy -- where existing CPUs and multi-node CPUs fit in: > > > > physical package --> processor node --> processor core --> thread > > > > I guess the problem is that you are associating node always with NUMA. > > Would it help to rename cpu_node_id to something else? > > Nope. It's a general problem, renaming it won't make it better. I don't see "a general problem". There is just a new CPU introducing a topology that slightly differs from what we had so far. Adapting the kernel shouldn't be that problematic. > > or something entirely different? > > > > > On the other hand NUMA topology is comparatively straight forward and well > > > understood and it's flexible enough to express your case too. > > > > > > > physical package == two northbridges (two nodes) > > > > > > > > and this needs to be represented somehow in the kernel. > > > > > > It's just two nodes with a very fast interconnect. > > > > In fact, I also thought about representing each internal node as one > > physical package. But that is even worse as you can't figure out which > > node is on the same socket. > > That's what the physical id is for. Seconded. The physical id is for identifying the socket. That implies that phys_proc_id != id_of_internal_node, right? > > > You're saying there are MSRs shared between the two in package nodes? > > > > No. I referred to NB MSRs that are shared between the cores on the > > same (internal) node. > > Just check siblings then. I can't "just check siblings" if core_siblings represent all cores on the same physical package. Regards, Andreas -- Operating | Advanced Micro Devices GmbH System | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni Center | Sitz: Dornach, Gemeinde Aschheim, Landkreis München (OSRC) | Registergericht München, HRB Nr. 43632