Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Peter Zijlstra <peterz@infradead.org>
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Ingo Molnar <mingo@kernel.org>,
	linux-kernel@vger.kernel.org,
	Mike Galbraith <umgwanakikbuti@gmail.com>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	stable@vger.kernel.org,
	Vincent Guittot <vincent.guittot@linaro.org>
Subject: Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
Date: Thu, 9 Feb 2017 16:07:53 +0100	[thread overview]
Message-ID: <20170209150753.GF6500@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <20170208132924.3038-1-matt@codeblueprint.co.uk>

On Wed, Feb 08, 2017 at 01:29:24PM +0000, Matt Fleming wrote:
> The calculation for the next sample window when exiting NOH_HZ idle
> does not handle the fact that we may not have reached the next sample
> window yet, i.e. that we came out of idle between sample windows.
> 
> If we wake from NO_HZ idle after the pending this_rq->calc_load_update
> window time when we want idle but before the next sample window, we
> will add an unnecessary LOAD_FREQ delay to the load average
> accounting, delaying any update for potentially ~9seconds.
> 
> This can result in huge spikes in the load average values due to
> per-cpu uninterruptible task counts being out of sync when accumulated
> across all CPUs.
> 
> It's safe to update the per-cpu active count if we wake between sample
> windows because any load that we left in 'calc_load_idle' will have
> been zero'd when the idle load was folded in calc_global_load().

Right, so differently put; the problem is that we check against the
'stale' rq->calc_load_update, while the current and effective period
boundary is 'calc_load_update'.

So, when rq->calc_load_update < jiffies < calc_load_update, we end up
setting the next-update to calc_load_update+LOAD_FREQ, where it should
have been calc_load_update.

> diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
> index a2d6eb71f06b..a7a6f3646970 100644
> --- a/kernel/sched/loadavg.c
> +++ b/kernel/sched/loadavg.c

> @@ -210,10 +211,16 @@ void calc_load_exit_idle(void)
>  	 * We woke inside or after the sample window, this means we're already
>  	 * accounted through the nohz accounting, so skip the entire deal and
>  	 * sync up for the next window.
> +	 *
> +	 * The next window is 'calc_load_update' if we haven't reached it yet,
> +	 * and 'calc_load_update + 10' if we're inside the current window.
>  	 */
> +	next_window = calc_load_update;
> +
> +	if (time_in_range_open(jiffies, next_window, next_window + 10)
> +		next_window += LOAD_FREQ;
> +
> +	this_rq->calc_load_update = next_window;
>  }

So I don't much like the time_in_range_open() thing. The simpler patch
which you tested to also work was:

diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 7296b7308eca..cfb47bd0ee50 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -201,6 +201,8 @@ void calc_load_exit_idle(void)
 {
 	struct rq *this_rq = this_rq();
 
+	this_rq->calc_load_update = calc_load_update;
+
 	/*
 	 * If we're still before the sample window, we're done.
 	 */
@@ -212,7 +214,6 @@ void calc_load_exit_idle(void)
 	 * accounted through the nohz accounting, so skip the entire deal and
 	 * sync up for the next window.
 	 */
-	this_rq->calc_load_update = calc_load_update;
 	if (time_before(jiffies, this_rq->calc_load_update + 10))
 		this_rq->calc_load_update += LOAD_FREQ;
 }

But the problem there is that we unconditionally issue that store. Now
I've no idea how much of a problem that is, and it certainly is the
simplest form (+- comments that need updating), so maybe that makes
sense.

Alternatively, something like:

diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
index 7296b7308eca..3dd4ce6fe151 100644
--- a/kernel/sched/loadavg.c
+++ b/kernel/sched/loadavg.c
@@ -207,12 +207,15 @@ void calc_load_exit_idle(void)
 	if (time_before(jiffies, this_rq->calc_load_update))
 		return;
 
+	this_rq->calc_load_update = calc_load_update;
+	if (time_before(jiffies, this_rq->calc_load_update))
+		return;
+
 	/*
 	 * We woke inside or after the sample window, this means we're already
 	 * accounted through the nohz accounting, so skip the entire deal and
 	 * sync up for the next window.
 	 */
-	this_rq->calc_load_update = calc_load_update;
 	if (time_before(jiffies, this_rq->calc_load_update + 10))
 		this_rq->calc_load_update += LOAD_FREQ;
 }

might be another solution.

Irrespective the above though; should we not make this:

+	this_rq->calc_load_update = READ_ONCE(calc_load_update);

because if for some reason we do a double load of calc_load_update and
see two different values, weird stuff could happen.

And because, on general principle, a READ_ONCE() should be paired with a
WRITE_ONCE(), that should be done too I suppose.

next prev parent reply	other threads:[~2017-02-09 15:55 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-08 13:29 [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting Matt Fleming
2017-02-08 16:04 ` kbuild test robot
2017-02-08 16:07 ` kbuild test robot
2017-02-09 15:07 ` Peter Zijlstra [this message]
2017-02-15 15:30   ` Frederic Weisbecker
2017-02-15 15:12 ` Frederic Weisbecker
2017-02-15 16:16   ` Matt Fleming
2017-02-15 16:44     ` Frederic Weisbecker

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:7296b7308ec dfblob:cfb47bd0ee5 dfblob:7296b7308ec
dfblob:3dd4ce6fe15 )
 OR (
bs:"Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170209150753.GF6500@twins.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matt@codeblueprint.co.uk \
    --cc=mingo@kernel.org \
    --cc=morten.rasmussen@arm.com \
    --cc=stable@vger.kernel.org \
    --cc=umgwanakikbuti@gmail.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.