From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1752857AbdLXB3N (ORCPT <rfc822;w@1wt.eu>);
        Sat, 23 Dec 2017 20:29:13 -0500
Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:42114 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL)
        by vger.kernel.org with ESMTP id S1750752AbdLXB3L (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Sat, 23 Dec 2017 20:29:11 -0500
Date: Sat, 23 Dec 2017 17:29:24 -0800
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: LKML <linux-kernel@vger.kernel.org>,
        Anna-Maria Gleixner <anna-maria@linutronix.de>,
        Sebastian Siewior <bigeasy@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Ingo Molnar <mingo@kernel.org>
Subject: Re: [patch 0/4] timer/nohz: Fix timer/nohz woes
Reply-To: paulmck@linux.vnet.ibm.com
References: <20171222145111.919609918@linutronix.de>
 <20171222170907.GJ7829@linux.vnet.ibm.com>
 <20171224012120.GA4113@linux.vnet.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20171224012120.GA4113@linux.vnet.ibm.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-GCONF: 00
x-cbid: 17122401-0040-0000-0000-000003D670B2
X-IBM-SpamModules-Scores: 
X-IBM-SpamModules-Versions: BY=3.00008253; HX=3.00000241; KW=3.00000007;
 PH=3.00000004; SC=3.00000244; SDB=6.00964642; UDB=6.00488067; IPR=6.00744605;
 BA=6.00005755; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000;
 ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00018696; XFM=3.00000015;
 UTC=2017-12-24 01:29:09
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17122401-0041-0000-0000-000007CBBC87
Message-Id: <20171224012924.GA6916@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-12-23_15:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501
 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0
 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000
 definitions=main-1712240019
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sat, Dec 23, 2017 at 05:21:20PM -0800, Paul E. McKenney wrote:
> On Fri, Dec 22, 2017 at 09:09:07AM -0800, Paul E. McKenney wrote:
> > On Fri, Dec 22, 2017 at 03:51:11PM +0100, Thomas Gleixner wrote:
> > > Paul was observing weird stalls which are hard to reproduce and decode. We
> > > were finally able to reproduce and decode the wreckage on RT.
> > > 
> > > The following series addresses the issues and hopefully nails the root
> > > cause completely.
> > > 
> > > Please review carefully and expose it to the dreaded rcu torture tests
> > > which seem to be the only way to trigger it.
> > 
> > Best Christmas present ever, thank you!!!
> > 
> > Just started up three concurrent 10-hour runs of the infamous rcutorture
> > TREE01 scenario, and will let you know how it goes!
> 
> Well, I messed up the first test and then reran it.  Which had the benefit
> of giving me a baseline.  The rerun (with all four patches) produced
> failures, so I ran it again with an additional patch of mine.  I score
> these tests by recording the time at first failure, or, if there is no
> failure, the duration of the test.  Summing the values gives the score.
> And here are the scores, where 30 is a perfect score:

Sigh.  They were five-hour tests, not ten-hour tests.  

1.	Baseline: 3.0+2.5+5=10.5

2.	Four patches from Anna-Marie and Thomas: 5+2.7+1.7=9.4

3.	Ditto plus the patch below: 5+4.3+5=14.3

Oh, and the reason for my suspecting that #2 is actually an improvement
over #1 is that my patch by itself produced a very small improvement in
reliability.  This leads to the hypothesis that #2 really is helping out
in some way or another.

							Thanx, Paul

> 1.	Baseline: 3.0+2.5+10=15.5
> 
> 2.	Four patches from Anna-Marie and Thomas: 10+2.7+1.7=14.4
> 
> 3.	Ditto plus the patch below: 10+4.3+10=24.3
> 
> Please note that these are nowhere near anything even resembling
> statistical significance.  However, they are encouraging.  I will do
> more runs, but also do shorter five-hour runs to increase the amount
> of data per unit time.  Please note also that my patch by itself never
> did provide that great of an improvement, so there might be some sort
> of combination effect going on here.  Or maybe it is just luck, who knows?
> 
> Please note that I have not yet ported my diagnostic patches on top of
> these, however, the stacks have the usual schedule_timeout() entries.
> This is not too surprising from a software-engineering viewpoint:
> Locating several bugs at a given point of time usually indicates that
> there are more to be found.  So in a sense we are lucky that the
> same test triggers at least one of those additional bugs.
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit accb0edb85526a05b934eac49658d05ea0216fc4
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Thu Dec 7 13:18:44 2017 -0800
> 
>     timers: Ensure that timer_base ->clk accounts for time offline
>     
>     The timer_base ->must_forward_clk is set to indicate that the next timer
>     operation on that timer_base must check for passage of time.  One instance
>     of time passage is when the timer wheel goes idle, and another is when
>     the corresponding CPU is offline.  Note that it is not appropriate to set
>     ->is_idle because that could result in IPIing an offline CPU.  Therefore,
>     this commit instead sets ->must_forward_clk at CPU-offline time.
>     
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index ffebcf878fba..94cce780c574 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -1875,6 +1875,7 @@ int timers_dead_cpu(unsigned int cpu)
>  
>  		BUG_ON(old_base->running_timer);
>  
> +		old_base->must_forward_clk = true;
>  		for (i = 0; i < WHEEL_SIZE; i++)
>  			migrate_timer_list(new_base, old_base->vectors + i);
>