From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752523AbcGASk7 (ORCPT <rfc822;w@1wt.eu>);
	Fri, 1 Jul 2016 14:40:59 -0400
Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:34064 "EHLO
	mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1752463AbcGASk5 (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Fri, 1 Jul 2016 14:40:57 -0400
X-IBM-Helo: d01dlp03.pok.ibm.com
X-IBM-MailFrom: paulmck@linux.vnet.ibm.com
Date: Fri, 1 Jul 2016 11:40:54 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Frederic Weisbecker <fweisbec@gmail.com>
Cc: peterz@infradead.org, tglx@linutronix.de, linux-kernel@vger.kernel.org,
        rgkernel@gmail.com
Subject: Re: [PATCH RFC] sched: Make wake_up_nohz_cpu() handle CPUs going
 offline
Reply-To: paulmck@linux.vnet.ibm.com
References: <20160630175845.GA10269@linux.vnet.ibm.com>
 <20160630232957.GB32568@lerouge>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160630232957.GB32568@lerouge>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-MML: disable
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 16070118-0044-0000-0000-0000008D516F
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 16070118-0045-0000-0000-000004A361B7
Message-Id: <20160701184054.GK4650@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-07-01_04:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0
 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
 adjust=0 reason=mlx scancount=1 engine=8.0.1-1604210000
 definitions=main-1607010182
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, Jul 01, 2016 at 01:29:59AM +0200, Frederic Weisbecker wrote:
> On Thu, Jun 30, 2016 at 10:58:45AM -0700, Paul E. McKenney wrote:
> > Both timers and hrtimers are maintained on the outgoing CPU until
> > CPU_DEAD time, at which point they are migrated to a surviving CPU.  If a
> > mod_timer() executes between CPU_DYING and CPU_DEAD time, x86 systems
> > will splat in native_smp_send_reschedule() when attempting to wake up
> > the just-now-offlined CPU, as shown below from a NO_HZ_FULL kernel:
> > 
> > [ 7976.741556] WARNING: CPU: 0 PID: 661 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:125 native_smp_send_reschedule+0x39/0x40
> > [ 7976.741595] Modules linked in:
> > [ 7976.741595] CPU: 0 PID: 661 Comm: rcu_torture_rea Not tainted 4.7.0-rc2+ #1
> > [ 7976.741595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> > [ 7976.741595]  0000000000000000 ffff88000002fcc8 ffffffff8138ab2e 0000000000000000
> > [ 7976.741595]  0000000000000000 ffff88000002fd08 ffffffff8105cabc 0000007d1fd0ee18
> > [ 7976.741595]  0000000000000001 ffff88001fd16d40 ffff88001fd0ee00 ffff88001fd0ee00
> > [ 7976.741595] Call Trace:
> > [ 7976.741595]  [<ffffffff8138ab2e>] dump_stack+0x67/0x99
> > [ 7976.741595]  [<ffffffff8105cabc>] __warn+0xcc/0xf0
> > [ 7976.741595]  [<ffffffff8105cb98>] warn_slowpath_null+0x18/0x20
> > [ 7976.741595]  [<ffffffff8103cba9>] native_smp_send_reschedule+0x39/0x40
> > [ 7976.741595]  [<ffffffff81089bc2>] wake_up_nohz_cpu+0x82/0x190
> > [ 7976.741595]  [<ffffffff810d275a>] internal_add_timer+0x7a/0x80
> > [ 7976.741595]  [<ffffffff810d3ee7>] mod_timer+0x187/0x2b0
> > [ 7976.741595]  [<ffffffff810c89dd>] rcu_torture_reader+0x33d/0x380
> > [ 7976.741595]  [<ffffffff810c66f0>] ? sched_torture_read_unlock+0x30/0x30
> > [ 7976.741595]  [<ffffffff810c86a0>] ? rcu_bh_torture_read_lock+0x80/0x80
> > [ 7976.741595]  [<ffffffff8108068f>] kthread+0xdf/0x100
> > [ 7976.741595]  [<ffffffff819dd83f>] ret_from_fork+0x1f/0x40
> > [ 7976.741595]  [<ffffffff810805b0>] ? kthread_create_on_node+0x200/0x200
> > 
> > However, in this case, the wakeup is redundant, because the timer
> > migration will reprogram timer hardware as needed.  Note that the fact
> > that preemption is disabled does not avoid the splat, as the offline
> > operation has already passed both the synchronize_sched() and the
> > stop_machine() that would be blocked by disabled preemption.
> > 
> > This commit therefore modifies wake_up_nohz_cpu() to avoid attempting
> > to wake up offline CPUs.  It also adds a comment stating that the
> > caller must tolerate lost wakeups when the target CPU is going offline,
> > and suggesting the CPU_DEAD notifier as a recovery mechanism.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >  core.c |    7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 7f2cae4620c7..08502966e7df 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -590,9 +590,14 @@ static bool wake_up_full_nohz_cpu(int cpu)
> >  	return false;
> >  }
> >  
> > +/*
> > + * Wake up the specified CPU.  If the CPU is going offline, it is the
> > + * caller's responsibility to deal with the lost wakeup, for example,
> > + * by hooking into the CPU_DEAD notifier like timers and hrtimers do.
> > + */
> >  void wake_up_nohz_cpu(int cpu)
> >  {
> > -	if (!wake_up_full_nohz_cpu(cpu))
> > +	if (cpu_online(cpu) && !wake_up_full_nohz_cpu(cpu))
> 
> So at this point, as we passed CPU_DYING, I believe the CPU isn't visible in the domains
> anymore (correct me if I'm wrong), therefore get_nohz_timer_target() can't return it,
> unless smp_processor_id() is the only alternative.

Right, but the timers have been posted long before even CPU_UP_PREPARE.
>>From what I can see, they are left alone until CPU_DEAD.  Which means
that if you try to mod_timer() them between CPU_DYING and CPU_DEAD,
you can get the above splat.

Or am I missing somthing subtle here?

> Hence, that call to wake_up_nohz_cpu() can only happen to online CPUs or the current
> one (pinned). And wake_up_idle_cpu() on the current CPU is a no-op. So only
> wake_up_full_nohz_cpu() is concerned. Then perhaps it would be better to move that
> cpu_online() check to wake_up_full_nohz_cpu() ?

As in the patch shown below?  Either way works for me.

> BTW, it seems that rcutorture stops its kthreads after CPU_DYING, is it expected that
> it queues timers at this stage?

Hmmm...  From what I can see, rcutorture cleans up its priority-boost
kthreads at CPU_DOWN_PREPARE time.  The other threads are allowed to
migrate wherever the scheduler wants, give or take the task shuffling.
The task shuffling only excludes one CPU at a time, and I have seen
this occur when multiple CPUs were running, e.g., 0, 2, and 3 while
offlining 1.

Besides which, doesn't the scheduler prevent anything but the idle
thread from running after CPU_DYING time?

							Thanx, Paul

------------------------------------------------------------------------
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f2cae4620c7..08502966e7df 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -590,9 +590,14 @@ static bool wake_up_full_nohz_cpu(int cpu)
 	return false;
 }
 
+/*
+ * Wake up the specified CPU.  If the CPU is going offline, it is the
+ * caller's responsibility to deal with the lost wakeup, for example,
+ * by hooking into the CPU_DEAD notifier like timers and hrtimers do.
+ */
 void wake_up_nohz_cpu(int cpu)
 {
-	if (!wake_up_full_nohz_cpu(cpu))
+	if (cpu_online(cpu) && !wake_up_full_nohz_cpu(cpu))
 		wake_up_idle_cpu(cpu);
 }