From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751398AbcACGRf (ORCPT <rfc822;w@1wt.eu>);
	Sun, 3 Jan 2016 01:17:35 -0500
Received: from e33.co.us.ibm.com ([32.97.110.151]:56437 "EHLO
	e33.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750788AbcACGRa (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Sun, 3 Jan 2016 01:17:30 -0500
X-IBM-Helo: d03dlp01.boulder.ibm.com
X-IBM-MailFrom: paulmck@linux.vnet.ibm.com
X-IBM-RcptTo: linux-kernel@vger.kernel.org
Date: Sat, 2 Jan 2016 22:17:20 -0800
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Ross Green <rgkernel@gmail.com>
Cc: linux-kernel@vger.kernel.org, mingo@kernel.org, jiangshanlai@gmail.com,
        dipankar@in.ibm.com, akpm@linux-foundation.org,
        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
        josh@joshtriplett.org, tglx@linutronix.de, peterz@infradead.org,
        rostedt@goodmis.org, dhowells@redhat.com,
        Eric Dumazet <edumazet@google.com>, dvhart@linux.intel.com,
        =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com>,
        oleg@redhat.com, pranith kumar <bobby.prani@gmail.com>
Subject: Re: rcu_preempt self-detected stall on CPU from 4.4-rc4, since 3.17
Message-ID: <20160103061720.GT4054@linux.vnet.ibm.com>
Reply-To: paulmck@linux.vnet.ibm.com
References: <CANfgCY2bSyw9n9tYZW46yRHXM-fF+XqobVOjjE_uOWcPaUWHCA@mail.gmail.com>
 <CANfgCY2=np7CX1vHxY7gyYb1-6GZJB7o4yFStkOsh5Yta_-v7A@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CANfgCY2=np7CX1vHxY7gyYb1-6GZJB7o4yFStkOsh5Yta_-v7A@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-MML: disable
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 16010306-0009-0000-0000-00001109EC17
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sun, Jan 03, 2016 at 04:29:11PM +1100, Ross Green wrote:
> Still seeing these rcu_preempt stalls on kernels through to 4.4-rc7
> 
> Still have not found a sure fire method to evoke this stall, but have
> found that it will normally occur within a week of running a kernel -
> usually when it is quiet with light load.
> 
> Have seen similar self detected stalls all the way back to 3.17.
> Most recent kernels have included 4.4-rc5 4.4-rc6 and 4.4-rc7
> 
> Regards,
> 
> Ross
> 
> On Fri, Dec 11, 2015 at 10:17 PM, Ross Green <rgkernel@gmail.com> wrote:
> > I have been getting these stalls in kernels going back to 3.17.
> >
> > This stall occurs usually under light load but often requires several
> > days to show itself. I have not found any simple way to trigger the
> > stall. Indeed heavy workloads seems not to show the fault.
> >
> > Anyone have any thoughts here?
> >
> > The recent patch by peterz with kernel/sched/wait.c I thought might
> > help the situation, but alas after a few days of running 4.4-rc4 the
> > following turned up.
> >
> > [179922.003570] INFO: rcu_preempt self-detected stall on CPU
> > [179922.008178] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [179922.008178]         0-...: (1 ticks this GP) idle=a91/1/0

CPU 0 is non-idle from an RCU perspective.

> > softirq=1296733/1296733 fqs=0
> > [179922.008178]
> > [179922.008209] (detected by 1, t=8775 jiffies, g=576439, c=576438, q=102)
> > [179922.008209] Task dump for CPU 0:
> > [179922.008209] swapper/0       R [179922.008209]  running [179922.008209]     0     0      0 0x00000000
> > [179922.008209] Backtrace:
> >
> > [179922.008239] Backtrace aborted due to bad frame pointer <c0907f54>

Can't have everything, I guess...

> > [179922.008239] rcu_preempt kthread starved for 8775 jiffies! g576439 c576438 f0x0 s3 ->state=0x1

Something is keeping the rcu_preempt grace-period kthread from
running.  This far into the grace period, it should have a
timer event waking it every few jiffies.  It is currently
in TASK_INTERRUPTIBLE state.

> > [179922.060302]         0-...: (1 ticks this GP) idle=a91/1/0 softirq=1296733/1296733 fqs=0
> > [179922.068023]          (t=8775 jiffies g=576439 c=576438 q=102)
> > [179922.073913] rcu_preempt kthread starved for 8775 jiffies! g576439 c576438 f0x2 s3 ->state=0x100

Same story, same grace period, pretty much same time.  Now there is an FQS
request (f0x2) and the state is now TASK_WAKING (->state=0x100 == 256).

> > [179922.083587] Task dump for CPU 0:
> > [179922.087097] swapper/0       R running      0     0      0 0x00000000
> > [179922.093292] Backtrace:
> > [179922.096313] [<c0013ea8>] (dump_backtrace) from [<c00140a4>] (show_stack+0x18/0x1c)
> > [179922.104675]  r7:c0908514 r6:80080193 r5:00000000 r4:c090aca8
> > [179922.110809] [<c001408c>] (show_stack) from [<c005a858>] (sched_show_task+0xbc/0x110)
> > [179922.119049] [<c005a79c>] (sched_show_task) from [<c005ccd4>] (dump_cpu_task+0x40/0x44)
> > [179922.127624]  r5:c0917680 r4:00000000
> > [179922.131042] [<c005cc94>] (dump_cpu_task) from [<c0082268>] (rcu_dump_cpu_stacks+0x9c/0xdc)
> > [179922.140350]  r5:c0917680 r4:00000001
> > [179922.143157] [<c00821cc>] (rcu_dump_cpu_stacks) from [<c008637c>] (rcu_check_callbacks+0x504/0x8e4)
> > [179922.153808]  r9:c0908514 r8:c0917680 r7:00000066 r6:2eeab000
> > r5:c0904300 r4:ef7af300
> > [179922.161499] [<c0085e78>] (rcu_check_callbacks) from [<c00895d0>] (update_process_times+0x40/0x6c)
> > [179922.170898]  r10:c009a584 r9:00000001 r8:ef7abc4c r7:0000a3a3
> > r6:4ec3391c r5:00000000
> > [179922.179901]  r4:c090aca8
> > [179922.182708] [<c0089590>] (update_process_times) from [<c009a580>]
> > (tick_sched_handle+0x50/0x54)
> > [179922.192108]  r5:c0907f10 r4:ef7abe40
> > [179922.195983] [<c009a530>] (tick_sched_handle) from [<c009a5d4>]
> > (tick_sched_timer+0x50/0x94)
> > [179922.204895] [<c009a584>] (tick_sched_timer) from [<c0089fe4>]
> > (__hrtimer_run_queues+0x110/0x1a0)
> > [179922.214324]  r7:00000000 r6:ef7abc40 r5:ef7abe40 r4:ef7abc00
> > [179922.220428] [<c0089ed4>] (__hrtimer_run_queues) from [<c008a674>]
> > (hrtimer_interrupt+0xac/0x1f8)
> > [179922.227111]  r10:ef7abc78 r9:ef7abc98 r8:ef7abc14 r7:ef7abcb8
> > r6:ffffffff r5:00000003
> > [179922.238220]  r4:ef7abc00
> > [179922.238220] [<c008a5c8>] (hrtimer_interrupt) from [<c00170ec>]
> > (twd_handler+0x38/0x48)
> > [179922.238220]  r10:c09084e8 r9:fa241100 r8:00000011 r7:ef028780
> > r6:c092574c r5:ef005cc0

All interrupt stack up to this point.

It is quite possible that the stuff below here is at fault as well.
That said, the grace-period should actually get to execute at some
point.  Do you have a heavy real-time load that might be starving
things?

							Thanx, Paul

> > [179922.257110]  r4:00000001
> > [179922.257110] [<c00170b4>] (twd_handler) from [<c007c8f8>] (handle_percpu_devid_irq+0x74/0x8c)
> > [179922.269683]  r5:ef005cc0 r4:ef7b1740
> > [179922.269683] [<c007c884>] (handle_percpu_devid_irq) from [<c0078454>] (generic_handle_irq+0x2c/0x3c)
> > [179922.283233]  r9:fa241100 r8:ef008000 r7:00000001 r6:00000000
> > r5:00000000 r4:c09013e8
> > [179922.290985] [<c0078428>] (generic_handle_irq) from [<c007872c>] (__handle_domain_irq+0x64/0xbc)
> > [179922.300842] [<c00786c8>] (__handle_domain_irq) from [<c00094c0>]
> > (gic_handle_irq+0x50/0x90)
> > [179922.303222]  r9:fa241100 r8:fa240100 r7:c0907f10 r6:fa24010c
> > r5:c09087a8 r4:c0925748
> > [179922.315216] [<c0009470>] (gic_handle_irq) from [<c0014bd4>]
> > (__irq_svc+0x54/0x90)
> > [179922.319000] Exception stack(0xc0907f10 to 0xc0907f58)
> > [179922.331542] 7f00:                                     00000000
> > ef7ab390 fe600000 00000000
> > [179922.331542] 7f20: c0906000 c090849c c0900364 c06a8124 c0907f80
> > c0944563 c09084e8 c0907f6c
> > [179922.349029] 7f40: c0907f4c c0907f60 c00287ac c0010ba8 60080113 ffffffff
> > [179922.349029]  r9:c0944563 r8:c0907f80 r7:c0907f44 r6:ffffffff
> > r5:60080113 r4:c0010ba8
> > [179922.357116] [<c0010b80>] (arch_cpu_idle) from [<c006f034>]
> > (default_idle_call+0x28/0x34)
> > [179922.368926] [<c006f00c>] (default_idle_call) from [<c006f154>]
> > (cpu_startup_entry+0x114/0x18c)
> > [179922.368926] [<c006f040>] (cpu_startup_entry) from [<c069fc6c>]
> > (rest_init+0x90/0x94)
> > [179922.385284]  r7:ffffffff r4:00000002
> > [179922.393463] [<c069fbdc>] (rest_init) from [<c08bbcec>]
> > (start_kernel+0x370/0x37c)
> > [179922.400421]  r5:c0947000 r4:00000000
> > [179922.400421] [<c08bb97c>] (start_kernel) from [<8000807c>] (0x8000807c)
> > $