From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751243AbdKLRR6 (ORCPT <rfc822;w@1wt.eu>);
        Sun, 12 Nov 2017 12:17:58 -0500
Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:37598 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1750846AbdKLRR5 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Sun, 12 Nov 2017 12:17:57 -0500
Date: Sun, 12 Nov 2017 09:17:51 -0800
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Bruno =?iso-8859-1?Q?Pr=E9mont?= <bonbons@sysophe.eu>
Cc: Josh Triplett <josh@joshtriplett.org>,
        Steven Rostedt <rostedt@goodmis.org>,
        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
        Lai Jiangshan <jiangshanlai@gmail.com>, linux-kernel@vger.kernel.org
Subject: Re: RCU stall/SOFT-Lockup on 4.11.3/4.13.11 after multiple days
 uptime
Reply-To: paulmck@linux.vnet.ibm.com
References: <20171111203832.58bc73ce@uranus>
 <20171112012128.GW3624@linux.vnet.ibm.com>
 <20171112120915.3072b927@neptune.home>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20171112120915.3072b927@neptune.home>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-GCONF: 00
x-cbid: 17111217-0052-0000-0000-00000280D3E4
X-IBM-SpamModules-Scores: 
X-IBM-SpamModules-Versions: BY=3.00008058; HX=3.00000241; KW=3.00000007;
 PH=3.00000004; SC=3.00000239; SDB=6.00944864; UDB=6.00476803; IPR=6.00725160;
 BA=6.00005688; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000;
 ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00017981; XFM=3.00000015;
 UTC=2017-11-12 17:17:54
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17111217-0053-0000-0000-0000529A41C0
Message-Id: <20171112171751.GA3624@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-11-12_06:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501
 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0
 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000
 definitions=main-1711120246
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sun, Nov 12, 2017 at 12:09:28PM +0100, Bruno Prémont wrote:
> On Sat, 11 November 2017 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > On Sat, Nov 11, 2017 at 08:38:32PM +0100, Bruno Prémont wrote:
> > > Hi,
> > > 
> > > On a single-CPU KVM-based virtual machine I'm suffering from RCU stall
> > > and soft-lockup. 4.10.x kernels run fine (4.10.12) but starting with
> > > 4.11.x (4.11.3, 4.13.11) I'm getting system freezes for no apparent
> > > reason.
> > > 
> > > All info I have is following console dump (from 4.13.11):
> > > [526415.290012] INFO: rcu_sched self-detected stall on CPU
> > > [526415.290012] o0-...: (745847 ticks this GP) idle=ba2/2/0 softirq=37393463/37393463 fqs=0
> > > [526415.290012] o (t=745854 jiffies g=23779976 c=23779975 q=32)
> > > [526415.290012] rcu_sched kthread starved for 745854 jiffies! g23779976 c23779975 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0  
> > 
> > The above line says that the rcu_sched kthread asked to sleep for three
> > jiffies, but ended up sleeping for more than 745,854 jiffies.
> > 
> > If your system does not let the RCU's kernel threads run, RCU cannot
> > help you much.
> > 
> > The ->state of 0x0 indicates that the kthread is in fact runnable, but
> > did not get a chance to run.  Was the system heavily loaded to the
> > point where you would expect a kthread to remain preempted for many
> > minutes?
> > 
> > I am guessing that the answer is no, given that CPU 0 is actually idle
> > (idle=ba2/2/0).  Seems unlikely, but I have to ask:  Did you bind the
> > kthread to a specific CPU?
> 
> The system should be lightly loaded (about 5-10% CPU usage on average), so
> plenty of time for RCU to do its work.
> 
> I didn't bind processes (be it userspace process or kthread) to a specific
> CPU, thus it's all auto-configured.
> 
> I guess the question then is what is the system busy with or waiting for
> that prevents RCU to get its work done...
> Shouldn't the watchdog print a trace of where CPU#0 is stuck? If so I might need
> to check at which log level and make sure that loglevel reaches console.
> Nothing did hit the disk though.

Do you have a high-speed interface to capture and store console output?
(As in something that can handle, say, 50MB in a reasonable period of
time.)

							Thanx, Paul

> Thanks,
> Bruno
> 
> > 							Thanx, Paul
> > 
> > > [526440.020015] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
> > > [526468.020005] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
> > > [526478.320009] INFO: rcu_sched self-detected stall on CPU
> > > [526478.320009] o0-...: (752143 ticks this GP) idle=ba2/2/0 softirq=37393463/37393463 fqs=0
> > > [526478.320009] o (t=752157 jiffies g=23779976 c=23779975 q=32)
> > > [526478.320009] rcu_sched kthread starved for 752157 jiffies! g23779976 c23779975 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0
> > > [526504.020016] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
> > > [526532.020007] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
> > > ...
> > > 
> > > Attached is kernel config (4.13.11).
> > > 
> > > 
> > > The output obtained with 4.11.3 was:
> > > [  280.680010] INFO: rcu_sched self-detected stall on CPU
> > > [  280.680021] o0-...: (27312 ticks this GP) dile=b11/2/0 softirq=6119/6119 fqs=0
> > > [  280.680021] o (t=27312 jiffies g=441 c=440 q=0)
> > > [  280.680021] rcu_sched_kthread starved for 27312 jiffies! g441 c440 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0
> > > ...
> > > 
> > > 
> > > As it's a remote VM for which I don't have access to the host I have little
> > > options for further digging (can't trigger sysrq's).
> > > 
> > > 
> > > Same kernel (4.13.11) seems to be running just fine on another KVM-base VM that
> > > has two CPUs.
> > > 
> > > 
> > > Does it ring a bell or is there some info that might be of any use,
> > > assuming I can obtain it?
> > > 
> > > Bruno  
> > 
> > 
>