From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751798AbdHPQcf (ORCPT <rfc822;w@1wt.eu>);
        Wed, 16 Aug 2017 12:32:35 -0400
Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:46380 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1751638AbdHPQce (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 16 Aug 2017 12:32:34 -0400
Date: Wed, 16 Aug 2017 09:32:28 -0700
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>,
        Pratyush Anand <panand@redhat.com>,
        =?utf-8?B?6rmA64+Z7ZiE?= <austinkernel.kim@gmail.com>,
        john.stultz@linaro.org, linux-kernel@vger.kernel.org
Subject: Re: RCU stall when using function_graph
Reply-To: paulmck@linux.vnet.ibm.com
References: <CAOoBcBU00VRXmrNNEOjJHgXf9BimxKYOorJC0d3766mNdda=Bg@mail.gmail.com>
 <20170806170220.GQ3730@linux.vnet.ibm.com>
 <db4dc3c5-8a3d-9752-802e-ab509201e251@redhat.com>
 <20170809125804.GT3730@linux.vnet.ibm.com>
 <bf4f38d6-57b7-2281-db24-368d047956aa@linaro.org>
 <20170809144033.GU3730@linux.vnet.ibm.com>
 <208e981d-40ec-54fa-6293-5b8e6fe10a84@linaro.org>
 <20170815092902.252f5e83@gandalf.local.home>
 <43e0a0bc-bdd4-6bd0-c970-336f2fb01c6d@linaro.org>
 <20170816100421.318deae2@gandalf.local.home>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170816100421.318deae2@gandalf.local.home>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-GCONF: 00
x-cbid: 17081616-0008-0000-0000-0000026F729B
X-IBM-SpamModules-Scores: 
X-IBM-SpamModules-Versions: BY=3.00007556; HX=3.00000241; KW=3.00000007;
 PH=3.00000004; SC=3.00000221; SDB=6.00903214; UDB=6.00452441; IPR=6.00683430;
 BA=6.00005537; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000;
 ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016722; XFM=3.00000015;
 UTC=2017-08-16 16:32:29
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17081616-0009-0000-0000-000036651BAC
Message-Id: <20170816163228.GZ7017@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-08-16_07:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0
 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
 adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000
 definitions=main-1708160271
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Aug 16, 2017 at 10:04:21AM -0400, Steven Rostedt wrote:
> On Wed, 16 Aug 2017 10:42:15 +0200
> Daniel Lezcano <daniel.lezcano@linaro.org> wrote:
> 
> > Hi Steven,
> > 
> > 
> > On 15/08/2017 15:29, Steven Rostedt wrote:
> > > 
> > > [ I'm back from vacation! ]  
> > 
> > Did you get the tapes? :)
> 
> Yes, but nothing in them would cause the reputation of the POTUS to
> become any worse than it already is.
> 
> > 
> > > On Wed, 9 Aug 2017 17:51:33 +0200
> > > Daniel Lezcano <daniel.lezcano@linaro.org> wrote:
> > >   
> > >> Well, may be the instruction pointer thing is not a good idea.
> > >>
> > >> I learnt from this experience, an overloaded kernel with a lot of
> > >> interrupts can hang the console and issue RCU stall.
> > >>
> > >> However, someone else can face the same situation. Even if he reads the
> > >> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
> > >>
> > >> A message telling the grace period can't be reached because we are too
> > >> busy processing interrupts would have helped but I understand it is not
> > >> easy to implement.  
> > > 
> > > What if the stall code triggered an irqwork first? The irqwork would
> > > trigger as soon as interrupts were enabled again (or at the next tick,
> > > depending on the arch), and then it would know that RCU stalled due to
> > > an irq storm if the irqwork is being hit.  
> > 
> > Is that condition enough to tell the CPU is over utilized by the
> > interrupts handling?
> > 
> > And I'm wondering if it wouldn't make sense to have this detection in
> > the irq code. With or without the RCU stall warning kernel option set,
> > the irq framework will be warning about this situation. If the RCU stall
> > option is set, that will issue a second message. It will be easy to do
> > the connection between the first message and the second one, no ?
> 
> The thing is, the RCU code keeps track of the state of progress, I
> don't believe the interrupt code does. It just worries about handling
> interrupts. I'm not excited about adding infrastructure to the
> interrupt code to do accounting of IRQ storms.
> 
> On the other hand, the RCU code already does this. If it notices a
> stall, it can trigger a irq_work and wait a little more. If the
> irq_work doesn't fire, then it can do the normal RCU stall message. But
> if the irq_work does fire, and the RCU progress still hasn't moved
> forward, then it would be able to say this is due to an IRQ storm and
> produce a better error message.

Let me see if I understand you...  About halfway to the stall limit,
RCU triggers an irq_work (on each CPU that has not yet passed through
a quiescent state, IPIing them in turn?), and if the irq_work has
not completed by the end of the stall limit, RCU adds that to its
stall-warning message.

Or am I missing something here?

							Thanx, Paul