From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752427AbdLFWE1 (ORCPT ); Wed, 6 Dec 2017 17:04:27 -0500 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:39378 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752214AbdLFWEY (ORCPT ); Wed, 6 Dec 2017 17:04:24 -0500 Date: Wed, 6 Dec 2017 14:04:21 -0800 From: "Paul E. McKenney" To: anna-maria@linutronix.de Cc: tglx@linutronix.de, linux-kernel@vger.kernel.org Subject: Re: Timer refuses to expire Reply-To: paulmck@linux.vnet.ibm.com References: <20171201182529.GA6073@linux.vnet.ibm.com> <20171204174208.GA17376@linux.vnet.ibm.com> <20171205233744.GA2453@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171205233744.GA2453@linux.vnet.ibm.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17120622-2213-0000-0000-00000246D14B X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00008162; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000244; SDB=6.00956435; UDB=6.00483475; IPR=6.00736472; BA=6.00005729; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00018390; XFM=3.00000015; UTC=2017-12-06 22:04:21 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17120622-2214-0000-0000-0000585E825D Message-Id: <20171206220421.GA12886@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-12-06_07:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=1 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1712060309 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 05, 2017 at 03:37:44PM -0800, Paul E. McKenney wrote: > On Mon, Dec 04, 2017 at 09:42:08AM -0800, Paul E. McKenney wrote: > > On Fri, Dec 01, 2017 at 10:25:29AM -0800, Paul E. McKenney wrote: > > > Hello, Anna-Maria, > > > > > > It turned out that one set of problems was due to NO_HZ_FULL issues, > > > which are addressed with a pair of patches. However, there is still one > > > case of a timer refusing to expire in rcutorture's TREE01 test scenario. > > > The takes from two to four hours to trigger, and the most recent one > > > gives the following diagnostic (see patch at the very end of this email): > > > > > > [13166.127458] schedule_timeout: Waylayed timer base->clk: 0x100c40004 jiffies: 0x100c4524e base->next_expiry: 0x100c40004 timer->flags: 0x1000003 timer->expires 0x100c40003 idx: 4 idx_now: ea base->pending_map 000000000000001000000000001080000000000000040000000002000000000000000000000000000000000000040000000000000000000000000000000000000000000000000000 > > > > > > The message appears any time a timer for less than five jiffies takes > > > more than eight seconds to expire. In all cases, this expiry did not > > > happen naturally, but rather via an unsolicited wakeup from the RCU CPU > > > stall code. If I am interpreting this message correctly, base->clk > > > has advanced past this timer, and the timer is correctly flagged in > > > base->pending_map. This seems like part of the problem because the > > > timer's timeout was only three jiffies. However, base->clk has not > > > advanced for more than 20 seconds, which seems like another problem. > > > > > > What additional diagnostics would be helpful? I can capture data > > > at the beginning of the schedule_timeout in the timer_list structure, > > > and of course can print more information at the time of the wakeup. > > > > And on the off-chance that the following messages from this weekend's > > runs are at all helpful. One interesting point is that starting at > > time 4731.872311, there are repeated enqueues to CPU 5's timer wheel, > > but CPU 5's ->clk does not advance. Things seem to advance at > > time 4891.721190. > > Another layer on the onion... For at least some of the failures, there > is a stalled CPU-hotplug operation. This of course can mean that the > timers are stuck on the partially offlined CPU. So I dumped the stack > of the task that is taking the CPU offline. Please see below for console > output and patch. > > I am considering doing an unsolicited wakeup of the task doing the > hotplug operation, but I am not convinced that the entirely of the CPU > hotplug code is willing to put up with that sort of thing. What I did instead was to dump out the state of the task that __cpuhp_kick_ap() waits on, please see the patch at the very end of this email. This triggered as shown below, and you guessed it, that task is waiting on a grace period. Which I am guessing won't happen until the outgoing CPU reaches CPUHP_TIMERS_DEAD state and calls timers_dead_cpu(). Which will prevent RCU's grace-period kthread from ever awakening, which will prevent the task that __cpuhp_kick_ap() waits on from ever awakening, which will prevent the outgoing CPU from reaching CPUHP_TIMERS_DEAD state. Deadlock. Does that make sense, or am I missing a trick here? I tried invoking timers_dead_cpu() from sched_cpu_deactivate(), but that triggered the BUG_ON(cpu_online(cpu)). I removed this BUG_ON(), and appear to have deadlocked on the timer ->lock. Any thoughts on what else to try? Thanx, Paul ------------------------------------------------------------------------ [ 2939.381210] schedule_timeout: Waylayed timer base->clk: 0x1002709c3 jiffies: 0x100284614 base->next_expiry: 0x1002709c3 timer->flags: 0x40000007 timer->expires 0x10027f3a2 idx: 100 idx_now: 105 base->pending_map 000000000000000800000000000000000002000000000000020080400002000868000000000000010000000001000000000000000000000000000000000000000000000000000000 [ 2939.381219] Torture onoff task state: [ 2939.381221] torture_onoff D14584 842 2 0x80000000 [ 2939.381239] Call Trace: [ 2939.381245] ? __schedule+0x33c/0x6f0 [ 2939.381248] ? preempt_count_add+0x51/0x90 [ 2939.381250] schedule+0x37/0x90 [ 2939.381270] schedule_timeout+0x20d/0x4c0 [ 2939.381276] wait_for_completion+0x95/0x100 [ 2939.381282] ? wake_up_q+0x60/0x60 [ 2939.381299] __cpuhp_kick_ap+0x49/0x60 [ 2939.381313] cpuhp_kick_ap+0x29/0x70 [ 2939.381316] cpuhp_kick_ap_work+0x34/0x130 [ 2939.381322] _cpu_down+0x17e/0x1f0 [ 2939.381326] do_cpu_down+0x30/0x50 [ 2939.381354] torture_offline+0x76/0x140 [ 2939.381368] torture_onoff+0xef/0x1c0 [ 2939.381371] kthread+0xf0/0x130 [ 2939.381374] ? torture_kthread_stopping+0x70/0x70 [ 2939.381376] ? kthread_destroy_worker+0x40/0x40 [ 2939.381378] ret_from_fork+0x1f/0x30 [ 2939.381381] __cpuhp_kick_ap task state: [ 2939.381383] cpuhp/7 D14504 46 2 0x80000000 [ 2939.381399] Call Trace: [ 2939.381403] ? __schedule+0x33c/0x6f0 [ 2939.381406] schedule+0x37/0x90 [ 2939.381408] schedule_timeout+0x20d/0x4c0 [ 2939.381414] ? enqueue_task_fair+0x556/0x12b0 [ 2939.381417] wait_for_completion+0x95/0x100 [ 2939.381419] ? wake_up_q+0x60/0x60 [ 2939.381431] __wait_rcu_gp+0x10d/0x140 [ 2939.381441] ? sched_cpu_activate+0xc0/0xc0 [ 2939.381443] sched_cpu_deactivate+0x36/0xa0 [ 2939.381453] ? kfree_call_rcu+0x20/0x20 [ 2939.381454] ? __call_rcu+0x530/0x530 [ 2939.381457] ? __rcu_read_unlock+0x50/0x50 [ 2939.381459] ? __rcu_read_unlock+0x50/0x50 [ 2939.381462] cpuhp_invoke_callback+0xa8/0x610 [ 2939.381466] cpuhp_thread_fun+0xc5/0x150 [ 2939.381469] smpboot_thread_fn+0x15c/0x220 [ 2939.381476] kthread+0xf0/0x130 [ 2939.381478] ? sort_range+0x20/0x20 [ 2939.381480] ? kthread_destroy_worker+0x40/0x40 [ 2939.381482] ret_from_fork+0x1f/0x30 ------------------------------------------------------------------------ commit aa2b802cb432a725589884d2ba47833de21e0871 Author: Paul E. McKenney Date: Wed Dec 6 09:46:30 2017 -0800 EXP: timer: Dump __cpuhp_kick_ap task state Signed-off-by: Paul E. McKenney diff --git a/kernel/cpu.c b/kernel/cpu.c index 04892a82f6ac..dc9098946b3a 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -381,6 +381,8 @@ cpuhp_reset_state(struct cpuhp_cpu_state *st, enum cpuhp_state prev_state) st->bringup = !st->bringup; } +void schedule_timeout_set_task3dump(struct task_struct *t); + /* Regular hotplug invocation of the AP hotplug thread */ static void __cpuhp_kick_ap(struct cpuhp_cpu_state *st) { @@ -395,7 +397,9 @@ static void __cpuhp_kick_ap(struct cpuhp_cpu_state *st) smp_mb(); st->should_run = true; wake_up_process(st->thread); + schedule_timeout_set_task3dump(st->thread); wait_for_ap_thread(st, st->bringup); + schedule_timeout_set_task3dump(NULL); } static int cpuhp_kick_ap(struct cpuhp_cpu_state *st, enum cpuhp_state target) diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 0a67df4cdeb6..532b43d05219 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1725,6 +1725,7 @@ static void process_timeout(struct timer_list *t) } static struct task_struct *schedule_timeout_task2dump; +static struct task_struct *schedule_timeout_task3dump; static DEFINE_MUTEX(schedule_timeout_task2dump_mutex); void schedule_timeout_set_task2dump(struct task_struct *t) { @@ -1732,7 +1733,12 @@ void schedule_timeout_set_task2dump(struct task_struct *t) WRITE_ONCE(schedule_timeout_task2dump, t); mutex_unlock(&schedule_timeout_task2dump_mutex); } -EXPORT_SYMBOL_GPL(schedule_timeout_set_task2dump); +void schedule_timeout_set_task3dump(struct task_struct *t) +{ + mutex_lock(&schedule_timeout_task2dump_mutex); + WRITE_ONCE(schedule_timeout_task3dump, t); + mutex_unlock(&schedule_timeout_task2dump_mutex); +} /** * schedule_timeout - sleep until timeout @@ -1824,6 +1830,10 @@ signed long __sched schedule_timeout(signed long timeout) pr_info("Torture onoff task state:\n"); sched_show_task(schedule_timeout_task2dump); } + if (schedule_timeout_task3dump) { + pr_info("__cpuhp_kick_ap task state:\n"); + sched_show_task(schedule_timeout_task3dump); + } mutex_unlock(&schedule_timeout_task2dump_mutex); } }