From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755004AbdKMNQJ (ORCPT ); Mon, 13 Nov 2017 08:16:09 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:54284 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751317AbdKMNQG (ORCPT ); Mon, 13 Nov 2017 08:16:06 -0500 Date: Mon, 13 Nov 2017 05:14:42 -0800 From: "Paul E. McKenney" To: Bruno =?iso-8859-1?Q?Pr=E9mont?= Cc: Josh Triplett , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , linux-kernel@vger.kernel.org Subject: Re: RCU stall/SOFT-Lockup on 4.11.3/4.13.11 after multiple days uptime Reply-To: paulmck@linux.vnet.ibm.com References: <20171111203832.58bc73ce@uranus> <20171112012128.GW3624@linux.vnet.ibm.com> <20171112120915.3072b927@neptune.home> <20171112171751.GA3624@linux.vnet.ibm.com> <20171112182824.4c62d18c@neptune.home> <20171112193008.24e39fee@uranus> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20171112193008.24e39fee@uranus> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17111313-0040-0000-0000-000003C1DE4D X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00008063; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000239; SDB=6.00945262; UDB=6.00477034; IPR=6.00725559; BA=6.00005689; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00017991; XFM=3.00000015; UTC=2017-11-13 13:14:43 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17111313-0041-0000-0000-000007B6FC66 Message-Id: <20171113131442.GC3624@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-11-13_06:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1711130186 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Nov 12, 2017 at 07:30:08PM +0100, Bruno Prémont wrote: > On Sun, 12 Nov 2017 18:29:06 Bruno Prémont wrote: > > On Sun, 12 November 2017 "Paul E. McKenney" wrote: > > > On Sun, Nov 12, 2017 at 12:09:28PM +0100, Bruno Prémont wrote: > > > > On Sat, 11 November 2017 "Paul E. McKenney" wrote: > > > > > On Sat, Nov 11, 2017 at 08:38:32PM +0100, Bruno Prémont wrote: > > > > > > Hi, > > > > > > > > > > > > On a single-CPU KVM-based virtual machine I'm suffering from RCU stall > > > > > > and soft-lockup. 4.10.x kernels run fine (4.10.12) but starting with > > > > > > 4.11.x (4.11.3, 4.13.11) I'm getting system freezes for no apparent > > > > > > reason. > > > > > > > > > > > > All info I have is following console dump (from 4.13.11): > > > > > > [526415.290012] INFO: rcu_sched self-detected stall on CPU > > > > > > [526415.290012] o0-...: (745847 ticks this GP) idle=ba2/2/0 softirq=37393463/37393463 fqs=0 > > > > > > [526415.290012] o (t=745854 jiffies g=23779976 c=23779975 q=32) > > > > > > [526415.290012] rcu_sched kthread starved for 745854 jiffies! g23779976 c23779975 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0 > > > > > > > > > > The above line says that the rcu_sched kthread asked to sleep for three > > > > > jiffies, but ended up sleeping for more than 745,854 jiffies. > > > > > > > > > > If your system does not let the RCU's kernel threads run, RCU cannot > > > > > help you much. > > > > > > > > > > The ->state of 0x0 indicates that the kthread is in fact runnable, but > > > > > did not get a chance to run. Was the system heavily loaded to the > > > > > point where you would expect a kthread to remain preempted for many > > > > > minutes? > > > > > > > > > > I am guessing that the answer is no, given that CPU 0 is actually idle > > > > > (idle=ba2/2/0). Seems unlikely, but I have to ask: Did you bind the > > > > > kthread to a specific CPU? > > > > > > > > The system should be lightly loaded (about 5-10% CPU usage on average), so > > > > plenty of time for RCU to do its work. > > > > > > > > I didn't bind processes (be it userspace process or kthread) to a specific > > > > CPU, thus it's all auto-configured. > > > > > > > > I guess the question then is what is the system busy with or waiting for > > > > that prevents RCU to get its work done... > > > > Shouldn't the watchdog print a trace of where CPU#0 is stuck? If so I might need > > > > to check at which log level and make sure that loglevel reaches console. > > > > Nothing did hit the disk though. > > > > > > Do you have a high-speed interface to capture and store console output? > > > (As in something that can handle, say, 50MB in a reasonable period of > > > time.) > > > > The only option I could have is netconsole, in the hope it does not cause > > issues by itself! > > > > Other than that I have a poor man's web-VNC kind of view on VGA-sized > > text console, which is slow and small (it's a virtual server at a hosting > > company). > > Got a new stall "caught early" with some partial trace output. System > had been running for nearly 24 hours though did have some IO load (rsync > over SSH of data mountpoint) at the time of stall. > > Attaching the screen-shotted output which might help pointing in some > direction. > This time there is a stack-trace with some possibly useful content > (I did tune console log level to include everything `dmesg -n 8`) > > It seems to point to native_safe_halt+0x6/0x10 with some interrupt > interaction. Unfortunately one of the trace halves was never rendered on > the web-VNC console... > > Partially transcribed trace: > __do_softirq+0xe5/0x1f0 > irq_exit+0xbf/0xd0 > do_IRQ+0x4a/0xc0 > comon_interrupt+0x89/0x89 > RIP: 0010:native_safe_halt+0x6/0x10 > ... (register dump) > > default_idle+0x9/0x10 > arch_cup_idle+0xa/0x10 > default_idle_Call+0x1e/0x30 > do_idle+0x16d/0x190 > cpu_startup_entry+0x5f/0x70 > rest_init+0xab/0xb0 > start_kernel+0x3b7/0x3c4 > ? early_idt_handler_array+0x120/0x120 > x86_64_start_reservations+0x2/0x2c > x86_64_start_kernel+0xe6/0xf3 > secondary_startup_64+0x9f/0x9f Which does confirm that the CPU was idle. Was there a "rcu_sched_kthread starved for" line? If so, given the likely difficulties collecting large traces, I have to suggest bisecting between 4.10.x and 4.11.x. This will obviously take some time, but it might be quite a bit faster than randomly guessing. Thanx, Paul > Thanks, > Bruno > > > > Thanx, Paul > > > > > > > Thanks, > > > > Bruno > > > > > > > > > Thanx, Paul > > > > > > > > > > > [526440.020015] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0] > > > > > > [526468.020005] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0] > > > > > > [526478.320009] INFO: rcu_sched self-detected stall on CPU > > > > > > [526478.320009] o0-...: (752143 ticks this GP) idle=ba2/2/0 softirq=37393463/37393463 fqs=0 > > > > > > [526478.320009] o (t=752157 jiffies g=23779976 c=23779975 q=32) > > > > > > [526478.320009] rcu_sched kthread starved for 752157 jiffies! g23779976 c23779975 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0 > > > > > > [526504.020016] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0] > > > > > > [526532.020007] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0] > > > > > > ... > > > > > > > > > > > > Attached is kernel config (4.13.11). > > > > > > > > > > > > > > > > > > The output obtained with 4.11.3 was: > > > > > > [ 280.680010] INFO: rcu_sched self-detected stall on CPU > > > > > > [ 280.680021] o0-...: (27312 ticks this GP) dile=b11/2/0 softirq=6119/6119 fqs=0 > > > > > > [ 280.680021] o (t=27312 jiffies g=441 c=440 q=0) > > > > > > [ 280.680021] rcu_sched_kthread starved for 27312 jiffies! g441 c440 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0 > > > > > > ... > > > > > > > > > > > > > > > > > > As it's a remote VM for which I don't have access to the host I have little > > > > > > options for further digging (can't trigger sysrq's). > > > > > > > > > > > > > > > > > > Same kernel (4.13.11) seems to be running just fine on another KVM-base VM that > > > > > > has two CPUs. > > > > > > > > > > > > > > > > > > Does it ring a bell or is there some info that might be of any use, > > > > > > assuming I can obtain it? > > > > > > > > > > > > Bruno > > > > > > > > > > > > > > > > > >