From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753446AbcHPWUF (ORCPT <rfc822;w@1wt.eu>);
	Tue, 16 Aug 2016 18:20:05 -0400
Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:36326 "EHLO
	mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1752922AbcHPWUD (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 16 Aug 2016 18:20:03 -0400
X-IBM-Helo: d06dlp03.portsmouth.uk.ibm.com
X-IBM-MailFrom: heiko.carstens@de.ibm.com
X-IBM-RcptTo: linux-kernel@vger.kernel.org
Date: Wed, 17 Aug 2016 00:19:53 +0200
From: Heiko Carstens <heiko.carstens@de.ibm.com>
To: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>, Ming Lei <tom.leiming@gmail.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        LKML <linux-kernel@vger.kernel.org>,
        Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Lai Jiangshan <laijs@cn.fujitsu.com>,
        Michael Holzheu <holzheu@linux.vnet.ibm.com>,
        Martin Schwidefsky <schwidefsky@de.ibm.com>
Subject: Re: [bisected] "sched: Allow per-cpu kernel threads to run on online
 && !active" causes warning
References: <20160727125412.GB3912@osiris>
 <alpine.DEB.2.11.1607271717320.19896@nanos>
 <20160730112552.GA3744@osiris>
 <CACVXFVNrMjk46pB_E=5fQP2njN8cntSKJ_BMnR-Z4ZmxsMpqyg@mail.gmail.com>
 <20160815111908.GA3903@osiris>
 <20160815224801.GA3672@mtj.duckdns.org>
 <20160816075505.GB3896@osiris>
 <20160816152027.GD9516@htj.duckdns.org>
 <20160816152949.GL30192@twins.programming.kicks-ass.net>
 <20160816154205.GE9516@htj.duckdns.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160816154205.GE9516@htj.duckdns.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-MML: disable
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 16081622-0024-0000-0000-00000205D7C4
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 16081622-0025-0000-0000-000020063EEC
Message-Id: <20160816221953.GA3373@osiris>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-08-16_13:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0
 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
 adjust=0 reason=mlx scancount=1 engine=8.0.1-1604210000
 definitions=main-1608160251
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue, Aug 16, 2016 at 11:42:05AM -0400, Tejun Heo wrote:
> Hello, Peter.
> 
> On Tue, Aug 16, 2016 at 05:29:49PM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 16, 2016 at 11:20:27AM -0400, Tejun Heo wrote:
> > > As long as the mapping doesn't change after the first onlining of the
> > > CPU, the workqueue side shouldn't be too difficult to fix up.  I'll
> > > look into it.  For memory allocations, as long as the cpu <-> node
> > > mapping is established before any memory allocation for the cpu takes
> > > place, it should be fine too, I think.
> > 
> > Don't we allocate per-cpu memory for 'cpu_possible_map' on boot? There's
> > a whole bunch of per-cpu memory users that does things like:
> > 
> > 
> > 	for_each_possible_cpu(cpu) {
> > 		struct foo *foo = per_cpu_ptr(&per_cpu_var, cpu);
> > 
> > 		/* muck with foo */
> > 	}
> > 
> > 
> > Which requires a cpu->node map for all possible cpus at boot time.
> 
> Ah, right.  If cpu -> node mapping is dynamic, there isn't much that
> we can do about allocating per-cpu memory on the wrong node.  And it
> is problematic that percpu allocations can race against an onlining
> CPU switching its node association.
> 
> One way to keep the mapping stable would be reserving per-node
> possible CPU slots so that the CPU number assigned to a new CPU is on
> the right node.  It'd be a simple solution but would get really
> expensive with increasing number of nodes.
> 
> Heiko, do you have any ideas?

I think the easiest solution would be to simply assign all cpus, for which
we do not have any topology information, to an arbitrary node; e.g. round
robin.

After all the case that cpus are added later is rare and the s390 fake numa
implementation does not know about the memory topology. All it is doing is
distributing the memory to several nodes in order to avoid a single huge
node. So that should be sort of ok.

Unless somebody has a better idea?

Michael, Martin?