From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BE484C48BDB for ; Sun, 7 Jul 2019 11:19:37 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 82E8120659 for ; Sun, 7 Jul 2019 11:19:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725867AbfGGLTg (ORCPT ); Sun, 7 Jul 2019 07:19:36 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:18430 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725846AbfGGLTg (ORCPT ); Sun, 7 Jul 2019 07:19:36 -0400 Received: from pps.filterd (m0098417.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x67BGMih098265; Sun, 7 Jul 2019 07:19:06 -0400 Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0a-001b2d01.pphosted.com with ESMTP id 2tk919adpj-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 07 Jul 2019 07:19:06 -0400 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.0.27/8.16.0.27) with SMTP id x67BEE9t030590; Sun, 7 Jul 2019 11:19:05 GMT Received: from b01cxnp23032.gho.pok.ibm.com (b01cxnp23032.gho.pok.ibm.com [9.57.198.27]) by ppma03dal.us.ibm.com with ESMTP id 2tjk967vxb-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 07 Jul 2019 11:19:05 +0000 Received: from b01ledav003.gho.pok.ibm.com (b01ledav003.gho.pok.ibm.com [9.57.199.108]) by b01cxnp23032.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id x67BJ4kn43909602 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Sun, 7 Jul 2019 11:19:04 GMT Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A7A82B2065; Sun, 7 Jul 2019 11:19:04 +0000 (GMT) Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 749E4B205F; Sun, 7 Jul 2019 11:19:04 +0000 (GMT) Received: from paulmck-ThinkPad-W541 (unknown [9.80.215.71]) by b01ledav003.gho.pok.ibm.com (Postfix) with ESMTP; Sun, 7 Jul 2019 11:19:04 +0000 (GMT) Received: by paulmck-ThinkPad-W541 (Postfix, from userid 1000) id 22E7716C29D7; Sun, 7 Jul 2019 04:19:06 -0700 (PDT) Date: Sun, 7 Jul 2019 04:19:06 -0700 From: "Paul E. McKenney" To: Joel Fernandes Cc: Steven Rostedt , Mathieu Desnoyers , rcu Subject: Re: Normal RCU grace period can be stalled for long because need-resched flags not set? Message-ID: <20190707111906.GB26519@linux.ibm.com> Reply-To: paulmck@linux.ibm.com References: <20190704002130.GA68801@google.com> <20190704003213.GA218086@google.com> <20190704005009.GZ26519@linux.ibm.com> <20190704032454.GA259593@google.com> <20190704171315.GG26519@linux.ibm.com> <20190704185055.GA12919@google.com> <20190704221702.GL26519@linux.ibm.com> <20190706120230.GA200542@google.com> <20190706182117.GY26519@linux.ibm.com> <20190706230331.GA158271@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190706230331.GA158271@google.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2019-07-07_04:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1907070158 Sender: rcu-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: rcu@vger.kernel.org On Sat, Jul 06, 2019 at 07:03:31PM -0400, Joel Fernandes wrote: > On Sat, Jul 06, 2019 at 11:21:17AM -0700, Paul E. McKenney wrote: > > On Sat, Jul 06, 2019 at 08:02:30AM -0400, Joel Fernandes wrote: > > > On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote: > > > > On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote: > > > [snip] > > > > > I did have a request, could you help me understand why is the grace period > > > > > duration double that of my busy wait time? You mentioned this has something > > > > > to do with the thread not waking up before another GP is started. But I did > > > > > not follow this. Thanks a lot!! > > > > > > > > Let's look at a normal grace period, and assume the ->gp_seq grace-period > > > > sequence counter is initially 0x1, so that a grace period is already > > > > in progress (keep in mind that the low-order two bits of ->gp_seq are > > > > the phase within the grace period and the rest of the bits are the > > > > grace-period number, so we are in phase 1 of the grace period following > > > > grace period #0). This grace period was started via synchronize_rcu() > > > > by Task A. > > > > > > > > Then we have the following sequence of events: > > > > > > > > o Task B does call_rcu(), but is too late to use the already-started > > > > grace period, so it needs another one. The ->gp_seq_needed > > > > counter thus becomes 0x8. > > > > > > > > o The current grace period completes, so that the ->gp_seq > > > > counter becomes 0x4. Callback invocation commences. > > > > > > > > o The grace-period kthread notices that ->gp_seq_needed is greater > > > > than ->gp_seq, so it starts a new grace period. The ->gp_seq > > > > counter therefore becomes 0x5. > > > > > > > > o The callback for Task A's synchronize_rcu() is invoked, awakening > > > > Task A. This happens almost immediately after the new grace > > > > period starts, but does definitely happen -after- that grace > > > > period starts. > > > > > > Yes, but at this point, we are still at the 1GP mark. But the distribution's > > > median below shows a strong correlation with 2 preempt-disable durations and > > > grace-period completion. For example, if the duration of the preempt > > > disable/enable loop is 50ms, it strongly shows the writer's median time > > > difference before and after the synchronize_rcu as 100ms, not really showing > > > it is 60 ms or 75 ms or anything. > > > > > > > > > > > o Task A, being part of rcuperf, almost immediately does another > > > > synchronize_rcu(). So ->gp_seq_needed becomes 0xc. > > > > > > Yes, but before that another synchronize_rcu, it also gets the timestamp and > > > does a diff between old/new timestamp and has captured the data, so at that > > > point the data captured should only be for around 1GPs worth give or take. > > > > But in this case, the "give or take" is almost a full grace period, for > > a total of almost two grace periods. > > > > > > If you play out the rest of this sequence, you should see how Task A > > > > waits for almost two full grace periods. > > > > > > I tried to play this out, still didn't get it :-|. Not that it is an issue > > > per-se, but still would like to understand it better. > > > > Let's try it a different way. Let's skip the state portion of ->gp_seq > > and just focus on the number of elapsed grace periods. > > > > o Let's start with RCU idle, having completed grace-period number > > zero and not yet having started grace-period number 1. > > Let's also abbreviate: GP0 and GP1. > > > > o Task A wants a grace period. Because RCU is idle, Task A's > > GP will complete at completion of GP1. > > > > o RCU starts GP1. > > > > o Almost immediately thereafter, Task B wants a grace period. > > But it cannot use GP1, because GP1 has already started, even > > if just barely, because some CPU or tasks might already have > > reported a quiescent state for GP1. > > > > So Task B's GP will not complete until the end of GP2. And that > > is almost two GPs worth of time from now. > > > > And on a system where RCU is busy, Task B's experience is the common > > case if it is in a tight loop doing synchronize_rcu(). Because RCU is > > busy, by the end of each grace period, there will always be a request > > for another grace period. And therefore RCU's grace-period kthread > > will immediately start a new grace period upon completion of each > > grace period. So the sequence of events from Task B's viewpoint will > > be as follows: > > > > o Task B executes synchronize_rcu(), which requests a new grace > > period. > > > > o This requested grace period ends, and another immediately starts. > > > > o Task B's RCU callback is invoked, which does a wakeup. > > > > o Task B executes another synchronize_rcu(), but just after a new > > grace period has started. Task B thus has to wait almost two > > full grace periods. > > > > This could be perceived as a problem, but my question back to you would > > be "Why on earth are you invoking synchronize_rcu() in a tight loop???" > > "Oh, because you are running rcuperf? Well, that is why rcuperf uses > > ftrace data!" > > > > Which might be the point on which we are talking past each other. You > > might have been thinking of the latency of RCU's grace-period computation. > > I was thinking of the length of time between the call to synchronize_rcu() > > and the corresponding return. > > This registered perfectly with me now, I was having a hard time visualizing > but it makes sense to me now based on your description. Thanks a lot! > > Ascii art: > (Basically the synch rcu is done just after a slight offset) > > synchronize_rcu Wait synchronize_rcu Wait > |<-------------------->| |--------------------| > v v v v > <----------> <----------> <----------> <---------> <---------> > GP GP GP GP GP > > GP = long preempt disable duration You got it! Thanx, Paul