Re: [patch 0/4] timer/nohz: Fix timer/nohz woes

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Anna-Maria Gleixner <anna-maria@linutronix.de>,
	Sebastian Siewior <bigeasy@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Ingo Molnar <mingo@kernel.org>
Subject: Re: [patch 0/4] timer/nohz: Fix timer/nohz woes
Date: Fri, 5 Jan 2018 11:41:50 -0800	[thread overview]
Message-ID: <20180105194150.GA24831@linux.vnet.ibm.com> (raw)
In-Reply-To: <20171224012924.GA6916@linux.vnet.ibm.com>

On Sat, Dec 23, 2017 at 05:29:24PM -0800, Paul E. McKenney wrote:
> On Sat, Dec 23, 2017 at 05:21:20PM -0800, Paul E. McKenney wrote:
> > On Fri, Dec 22, 2017 at 09:09:07AM -0800, Paul E. McKenney wrote:
> > > On Fri, Dec 22, 2017 at 03:51:11PM +0100, Thomas Gleixner wrote:
> > > > Paul was observing weird stalls which are hard to reproduce and decode. We
> > > > were finally able to reproduce and decode the wreckage on RT.
> > > > 
> > > > The following series addresses the issues and hopefully nails the root
> > > > cause completely.
> > > > 
> > > > Please review carefully and expose it to the dreaded rcu torture tests
> > > > which seem to be the only way to trigger it.
> > > 
> > > Best Christmas present ever, thank you!!!
> > > 
> > > Just started up three concurrent 10-hour runs of the infamous rcutorture
> > > TREE01 scenario, and will let you know how it goes!
> > 
> > Well, I messed up the first test and then reran it.  Which had the benefit
> > of giving me a baseline.  The rerun (with all four patches) produced
> > failures, so I ran it again with an additional patch of mine.  I score
> > these tests by recording the time at first failure, or, if there is no
> > failure, the duration of the test.  Summing the values gives the score.
> > And here are the scores, where 30 is a perfect score:
> 
> Sigh.  They were five-hour tests, not ten-hour tests.  
> 
> 1.	Baseline: 3.0+2.5+5=10.5
> 
> 2.	Four patches from Anna-Marie and Thomas: 5+2.7+1.7=9.4
> 
> 3.	Ditto plus the patch below: 5+4.3+5=14.3
> 
> Oh, and the reason for my suspecting that #2 is actually an improvement
> over #1 is that my patch by itself produced a very small improvement in
> reliability.  This leads to the hypothesis that #2 really is helping out
> in some way or another.

But after more than 1,000 hours of test runs, split roughly evenly
among the above three scenarios, there is no statistically significant
difference in error rate among them.  This means that there is some
other bug lurking somewhere, and having the same appearance (lost timer).
Were you guys ever able to reproduce this via rcutorture?

More details below.

							Thanx, Paul

------------------------------------------------------------------------

I ran sets of three-hour runs.  I took the time of first error (if
any), and excluded the rest of that particular three-hour run from
consideration.  This means that if a given run failed at two hours,
we add one to the "errors" column and two to the "duration" column.
Runs without errors contributed three hours "duration" column, but of
course nothing to the "errors" column.  An overall errors/hour rate
is then computed for each scenario:

1.	Baseline: (378 hours total runtime)
	74 errors in 218.8 hours error-free runtime, 0.338 errors/hour.

2.	Four patches from Anna-Marie and Thomas: (315 hours total runtime)
	65 errors in 195.2 hours error-free runtime, 0.333 errors/hour.

3.	Ditto plus the patch below: (315 hours total runtime)
	66 errors in 179.4 hours error-free runtime, 0.368 errors/hour.

Applying Poisson statistics shows that we need to drop below 0.270
errors/hour to assert that a fix had a 95% chance of having reduced the
error rate, and none of the runs achieve this level of improvement.
In fact, even the least probable scenario had more than a 25% probability
of happening by chance.

These calculations were carried out using maxima:

	load(distrib);
	bfloat(cdf_poisson(59,218.8*0.338));
	(%o11)                       4.267467688401431b-2

This is 4.2% probability of the result having happened due to random
chance, just a bit better than 95% confidence.

	bfloat(cdf_poisson(60,218.8*0.338));
	(%o8)                        5.525461180734715b-2

This is 5.5% probability of the result having happened due to random
chance, just a bit worse than 95% confidence.  So, dividing 59 by the
218.8 hours of error-free runs on baseline gives the aforementioned
0.270 errors/hour.

next prev parent reply	other threads:[~2018-01-05 19:41 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-22 14:51 [patch 0/4] timer/nohz: Fix timer/nohz woes Thomas Gleixner
2017-12-22 14:51 ` [patch 1/4] timer: Use deferrable base independent of base::nohz_active Thomas Gleixner
2017-12-25 16:24   ` Frederic Weisbecker
2017-12-29 22:45   ` [tip:timers/urgent] timers: " tip-bot for Anna-Maria Gleixner
2017-12-22 14:51 ` [patch 2/4] nohz: Prevent erroneous tick stop invocations Thomas Gleixner
2017-12-26 15:17   ` Frederic Weisbecker
2017-12-27 18:22     ` Thomas Gleixner
2017-12-27 18:24       ` Thomas Gleixner
2017-12-27 20:58         ` Thomas Gleixner
2017-12-29 16:12           ` Frederic Weisbecker
2017-12-29 22:46           ` [tip:timers/urgent] nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick() tip-bot for Thomas Gleixner
2017-12-22 14:51 ` [patch 3/4] timer: Invoke timer_start_debug() where it makes sense Thomas Gleixner
2017-12-29 22:47   ` [tip:timers/urgent] timers: " tip-bot for Thomas Gleixner
2017-12-22 14:51 ` [patch 4/4] timerqueue: Document return values of timerqueue_add/del() Thomas Gleixner
2017-12-29 22:47   ` [tip:timers/urgent] " tip-bot for Thomas Gleixner
2017-12-22 17:09 ` [patch 0/4] timer/nohz: Fix timer/nohz woes Paul E. McKenney
2017-12-24  1:21   ` Paul E. McKenney
2017-12-24  1:29     ` Paul E. McKenney
2018-01-05 19:41       ` Paul E. McKenney [this message]
2018-01-06 21:18         ` Thomas Gleixner
2018-01-06 23:21           ` Paul E. McKenney
2017-12-27 20:55     ` Thomas Gleixner
2017-12-29 22:46       ` [tip:timers/urgent] timers: Reinitialize per cpu bases on hotplug tip-bot for Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180105194150.GA24831@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=anna-maria@linutronix.de \
    --cc=bigeasy@linutronix.de \
    --cc=fweisbec@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).