From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752395AbbIETyO (ORCPT ); Sat, 5 Sep 2015 15:54:14 -0400 Received: from e39.co.us.ibm.com ([32.97.110.160]:45946 "EHLO e39.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751801AbbIETyD (ORCPT ); Sat, 5 Sep 2015 15:54:03 -0400 X-Helo: d03dlp03.boulder.ibm.com X-MailFrom: paulmck@linux.vnet.ibm.com X-RcptTo: linux-kernel@vger.kernel.org Date: Sat, 5 Sep 2015 12:53:57 -0700 From: "Paul E. McKenney" To: Frederic Weisbecker Cc: Peter Zijlstra , Tejun Heo , linux-kernel@vger.kernel.org Subject: Re: Warning in irq_work_queue_on() Message-ID: <20150905195357.GP4029@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20150825001611.GA1751@linux.vnet.ibm.com> <20150902194405.GM22326@mtj.duckdns.org> <20150902215020.GA21505@lerouge> <20150902222427.GW19282@twins.programming.kicks-ass.net> <20150903000350.GA28870@lerouge> <20150903075840.GY19282@twins.programming.kicks-ass.net> <20150904151153.GB13708@lerouge> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150904151153.GB13708@lerouge> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15090519-0033-0000-0000-000005C451AC Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 04, 2015 at 05:11:54PM +0200, Frederic Weisbecker wrote: > On Thu, Sep 03, 2015 at 09:58:40AM +0200, Peter Zijlstra wrote: > > On Thu, Sep 03, 2015 at 02:03:51AM +0200, Frederic Weisbecker wrote: > > > On Thu, Sep 03, 2015 at 12:24:27AM +0200, Peter Zijlstra wrote: > > > > On Wed, Sep 02, 2015 at 11:50:22PM +0200, Frederic Weisbecker wrote: > > > > > > > [ 875.703227] [] tick_nohz_full_kick_cpu+0x44/0x50 > > > > > > > > > > It happens in nohz full, but I'm not sure the guilty is nohz full. > > > > > > > > > > The problem here is that wake_up_nohz_cpu() selects a CPU that is offline. > > > > > > > > wake_up_nohz_cpu() doesn't do any such thing. Where does the selection > > > > logic live? > > > > > > Err, got confused with get_nohz_timer_target(). But yeah wake_up_nohz_cpu() is > > > called with a CPU that is chosen by mod_timer() -> get_nohz_timer_target(). > > > > > > > > > > > > But this shouldn't happen. Either it selects a CPU that is in the domain tree, > > > > > and I suspect offline CPUs aren't supposed to be there, or it selects the current > > > > > CPU. And if the CPU is offlined, it shouldn't be running some kthread... > > > > > > > > Do no assume things like that.. always check with the active mask. > > > > > > Hmm, so perhaps we need something like this (makes me realize that > > > the is_housekeeping_cpu() passes the wrong argument, no issue in practice > > > since nohz full aren't in the domain tree but I still need to fix that along). > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > index 0902e4d..2c10a69 100644 > > > --- a/kernel/sched/core.c > > > +++ b/kernel/sched/core.c > > > @@ -628,7 +628,7 @@ int get_nohz_timer_target(void) > > > > > > rcu_read_lock(); > > > for_each_domain(cpu, sd) { > > > - for_each_cpu(i, sched_domain_span(sd)) { > > > + for_each_cpu_and(i, sched_domain_span(sd), cpu_online_mask) { > > > > cpu_active_mask, we clear that when we start killing the cpu. online > > only gets cleared once the cpu is actually dead. > > So, after our discussion in IRC, I checked how domains are rebuild on hotplug > ops and it appears that partition_sched_domain() is called on CPU_DOWN_PREPARE > only. The CPU shouldn't be on the domain tree after that. > > (Correct me if I'm wrong, I really am not an expert in the domain handling code. > As you said that we can't guarantee that a CPU in the domain tree is in the cpu_online_mask, > I'm likely wrong somewhere). > > This is then followed by synchronize_sched(). Which means that after that, the > new version of the CPU domains (with the offlining CPU excluded) is visible > everywhere while the CPU is still in cpu_online_mask. > > And finally stop machine runs and the CPU is cleared out of cpu_online_mask. > So I'm probably missing something, otherwise we could find a CPU in the domain > tree that is not in cpu_online_mask. OK, I have to ask... Should I be trying Frederic's patch? At the current failure rate, I will need to be running it for about a year to give any reasonable conclusion. :-/ Thanx, Paul