From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753596AbcCWB7h (ORCPT <rfc822;w@1wt.eu>);
	Tue, 22 Mar 2016 21:59:37 -0400
Received: from e17.ny.us.ibm.com ([129.33.205.207]:39772 "EHLO
	e17.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753131AbcCWB73 (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 22 Mar 2016 21:59:29 -0400
X-IBM-Helo: d01dlp01.pok.ibm.com
X-IBM-MailFrom: paulmck@linux.vnet.ibm.com
X-IBM-RcptTo: linux-kernel@vger.kernel.org
Date: Tue, 22 Mar 2016 18:59:32 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: RCU stall
Message-ID: <20160323015932.GX4287@linux.vnet.ibm.com>
Reply-To: paulmck@linux.vnet.ibm.com
References: <56F1A8F2.9000905@sandisk.com>
 <20160322204510.GS4287@linux.vnet.ibm.com>
 <56F1DAF6.3030804@sandisk.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <56F1DAF6.3030804@sandisk.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-MML: disable
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 16032301-0041-0000-0000-000003AB5367
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue, Mar 22, 2016 at 04:53:26PM -0700, Bart Van Assche wrote:
> On 03/22/2016 01:45 PM, Paul E. McKenney wrote:
> >You are getting a soft lockup as well as an RCU CPU stall warning, so
> >it looks like something is taking a very long time in blk_done_softirq().
> >
> >You have multiple occurrences at different times, so it looks to be
> >a long time as opposed to an infinite time.  Are you perhaps doing
> >something that would make a huge amount of work for blk_done_softirq()?
> >
> >See Documentation/RCU/stallwarn.txt in the kernel source tree for more
> >info on how to debug this sort of thing.
> 
> Hello Paul,
> 
> None of the drivers involved in the test I ran contain RCU code that
> has been changed recently. The block and SCSI subsystems processes
> I/O completions in softirq context but until last week I hadn't seen
> any RCU lockup complaints when I ran an SRP test against a kernel
> with lockdep and several other kernel debugging options enabled.
> This is why I sent an e-mail to you. I have read
> Documentation/RCU/stallwarn.txt after I received your reply but this
> didn't provide me any clue about where to look for the root cause.
> Any further help would be appreciated.

My suggestion would be to check the block/SCSI softirq handler for
event traces.  If there are some, enable them and see what the loop
is doing.  Documentation/trace/ftrace.txt describes how to enable
existing event tracing.

If there is no event tracing, consider adding some in your local
view.  Failing that, there is always printk().  ;-)

Or perhaps you have some sort of debug setup.

Either way, the next step is to work out why that CPU is spending
so much time in that loop.

							Thanx, Paul