From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751798AbdHPQcf (ORCPT ); Wed, 16 Aug 2017 12:32:35 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:46380 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751638AbdHPQce (ORCPT ); Wed, 16 Aug 2017 12:32:34 -0400 Date: Wed, 16 Aug 2017 09:32:28 -0700 From: "Paul E. McKenney" To: Steven Rostedt Cc: Daniel Lezcano , Pratyush Anand , =?utf-8?B?6rmA64+Z7ZiE?= , john.stultz@linaro.org, linux-kernel@vger.kernel.org Subject: Re: RCU stall when using function_graph Reply-To: paulmck@linux.vnet.ibm.com References: <20170806170220.GQ3730@linux.vnet.ibm.com> <20170809125804.GT3730@linux.vnet.ibm.com> <20170809144033.GU3730@linux.vnet.ibm.com> <208e981d-40ec-54fa-6293-5b8e6fe10a84@linaro.org> <20170815092902.252f5e83@gandalf.local.home> <43e0a0bc-bdd4-6bd0-c970-336f2fb01c6d@linaro.org> <20170816100421.318deae2@gandalf.local.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170816100421.318deae2@gandalf.local.home> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 17081616-0008-0000-0000-0000026F729B X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007556; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000221; SDB=6.00903214; UDB=6.00452441; IPR=6.00683430; BA=6.00005537; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016722; XFM=3.00000015; UTC=2017-08-16 16:32:29 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17081616-0009-0000-0000-000036651BAC Message-Id: <20170816163228.GZ7017@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-08-16_07:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1708160271 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 16, 2017 at 10:04:21AM -0400, Steven Rostedt wrote: > On Wed, 16 Aug 2017 10:42:15 +0200 > Daniel Lezcano wrote: > > > Hi Steven, > > > > > > On 15/08/2017 15:29, Steven Rostedt wrote: > > > > > > [ I'm back from vacation! ] > > > > Did you get the tapes? :) > > Yes, but nothing in them would cause the reputation of the POTUS to > become any worse than it already is. > > > > > > On Wed, 9 Aug 2017 17:51:33 +0200 > > > Daniel Lezcano wrote: > > > > > >> Well, may be the instruction pointer thing is not a good idea. > > >> > > >> I learnt from this experience, an overloaded kernel with a lot of > > >> interrupts can hang the console and issue RCU stall. > > >> > > >> However, someone else can face the same situation. Even if he reads the > > >> RCU/stallwarn.txt documentation, it will be hard to figure out the issue. > > >> > > >> A message telling the grace period can't be reached because we are too > > >> busy processing interrupts would have helped but I understand it is not > > >> easy to implement. > > > > > > What if the stall code triggered an irqwork first? The irqwork would > > > trigger as soon as interrupts were enabled again (or at the next tick, > > > depending on the arch), and then it would know that RCU stalled due to > > > an irq storm if the irqwork is being hit. > > > > Is that condition enough to tell the CPU is over utilized by the > > interrupts handling? > > > > And I'm wondering if it wouldn't make sense to have this detection in > > the irq code. With or without the RCU stall warning kernel option set, > > the irq framework will be warning about this situation. If the RCU stall > > option is set, that will issue a second message. It will be easy to do > > the connection between the first message and the second one, no ? > > The thing is, the RCU code keeps track of the state of progress, I > don't believe the interrupt code does. It just worries about handling > interrupts. I'm not excited about adding infrastructure to the > interrupt code to do accounting of IRQ storms. > > On the other hand, the RCU code already does this. If it notices a > stall, it can trigger a irq_work and wait a little more. If the > irq_work doesn't fire, then it can do the normal RCU stall message. But > if the irq_work does fire, and the RCU progress still hasn't moved > forward, then it would be able to say this is due to an IRQ storm and > produce a better error message. Let me see if I understand you... About halfway to the stall limit, RCU triggers an irq_work (on each CPU that has not yet passed through a quiescent state, IPIing them in turn?), and if the irq_work has not completed by the end of the stall limit, RCU adds that to its stall-warning message. Or am I missing something here? Thanx, Paul