From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from e19.ny.us.ibm.com (e19.ny.us.ibm.com [129.33.205.209]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3436E1A084B for ; Wed, 8 Apr 2015 03:14:18 +1000 (AEST) Received: from /spool/local by e19.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 7 Apr 2015 13:14:15 -0400 Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com [9.57.198.24]) by d01dlp03.pok.ibm.com (Postfix) with ESMTP id 8A37DC9003E for ; Tue, 7 Apr 2015 13:05:22 -0400 (EDT) Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id t37HEDU523199776 for ; Tue, 7 Apr 2015 17:14:13 GMT Received: from d01av04.pok.ibm.com (localhost [127.0.0.1]) by d01av04.pok.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id t37HECcj003617 for ; Tue, 7 Apr 2015 13:14:12 -0400 Date: Tue, 7 Apr 2015 10:14:10 -0700 From: Nishanth Aravamudan To: Peter Zijlstra Subject: Re: Topology updates and NUMA-level sched domains Message-ID: <20150407171410.GA62529@linux.vnet.ibm.com> References: <20150406214558.GA38501@linux.vnet.ibm.com> <20150407102147.GJ23123@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <20150407102147.GJ23123@twins.programming.kicks-ass.net> Cc: Boqun Feng , Srikar Dronamraju , linux-kernel@vger.kernel.org, Ingo Molnar , Anton Blanchard , linuxppc-dev@lists.ozlabs.org, Anshuman Khandual List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On 07.04.2015 [12:21:47 +0200], Peter Zijlstra wrote: > On Mon, Apr 06, 2015 at 02:45:58PM -0700, Nishanth Aravamudan wrote: > > Hi Peter, > > > > As you are very aware, I think, power has some odd NUMA topologies (and > > changes to the those topologies) at run-time. In particular, we can see > > a topology at boot: > > > > Node 0: all Cpus > > Node 7: no cpus > > > > Then we get a notification from the hypervisor that a core (or two) have > > moved from node 0 to node 7. This results in the: > > > or a re-init API (which won't try to reallocate various bits), because > > the topology could be completely different now (e.g., > > sched_domains_numa_distance will also be inaccurate now). Really, a > > topology update on power (not sure on s390x, but those are the only two > > archs that return a positive value from arch_update_cpu_topology() right > > now, afaics) is a lot like a hotplug event and we need to re-initialize > > any dependent structures. > > > > I'm just sending out feelers, as we can limp by with the above warning, > > it seems, but is less than ideal. Any help or insight you could provide > > would be greatly appreciated! > > So I think (and ISTR having stated this before) that dynamic cpu<->node > maps are absolutely insane. Sorry if I wasn't involved at the time. I agree that it's a bit of a mess! > There is a ton of stuff that assumes the cpu<->node relation is a boot > time fixed one. Userspace being one of them. Per-cpu memory another. Well, userspace already deals with CPU hotplug, right? And the topology updates are, in a lot of ways, just like you've hotplugged a CPU from one node and re-hotplugged it into another node. I'll look into the per-cpu memory case. For what it's worth, our test teams are stressing the kernel with these topology updates and hopefully we'll be able to resolve any issues that result. > You simply cannot do this without causing massive borkage. > > So please come up with a coherent plan to deal with the entire problem > of dynamic cpu to memory relation and I might consider the scheduler > impact. But we're not going to hack around and maybe make it not crash > in a few corner cases while the entire thing is shite. Well, it doesn't crash now. In fact, it stays up reasonable well and seems to dtrt (from the kernel perspective) other than the sched domain messages. I will look into per-cpu memory, and also another case I have been thinking about where if a process is bound to a CPU/node combination via numactl and then the topology changes, what exactly will happen. In theory, via these topology updates, a node could go from memoryless -> not and v.v., which seems like it might not be well supported (but again, should not be much different from hotplugging all the memory out from a node). And, in fact, I think topologically speaking, I think I should be able to repeat the same sched domain warnings if I start off with a 2-node system with all CPUs on one node, and then hotplug a CPU onto the second node, right? That has nothing to do with power, that I can tell. I'll see if I can demonstrate it via a KVM guest. Thanks for your quick response! -Nish