public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: Dave Jones <davej@codemonkey.org.uk>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	peterz@infradead.org, mingo@kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: weird loadavg on idle machine post 5.7
Date: Thu, 2 Jul 2020 22:36:27 +0100	[thread overview]
Message-ID: <20200702213627.GF3183@techsingularity.net> (raw)
In-Reply-To: <20200702171548.GA11813@codemonkey.org.uk>

On Thu, Jul 02, 2020 at 01:15:48PM -0400, Dave Jones wrote:
> When I upgraded my firewall to 5.7-rc2 I noticed that on a mostly
> idle machine (that usually sees loadavg hover in the 0.xx range)
> that it was consistently above 1.00 even when there was nothing running.
> All that perf showed was the kernel was spending time in the idle loop
> (and running perf).
> 
> For the first hour or so after boot, everything seems fine, but over
> time loadavg creeps up, and once it's established a new baseline, it
> never seems to ever drop below that again.
> 
> One morning I woke up to find loadavg at '7.xx', after almost as many
> hours of uptime, which makes me wonder if perhaps this is triggered
> by something in cron.  I have a bunch of scripts that fire off
> every hour that involve thousands of shortlived runs of iptables/ipset,
> but running them manually didn't seem to automatically trigger the bug.
> 
> Given it took a few hours of runtime to confirm good/bad, bisecting this
> took the last two weeks. I did it four different times, the first
> producing bogus results from over-eager 'good', but the last two runs
> both implicated this commit:
> 
> commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad)
> Author: Peter Zijlstra <peterz@infradead.org>
> Date:   Sun May 24 21:29:55 2020 +0100
> 
>     sched/core: Optimize ttwu() spinning on p->on_cpu
>     
>     Both Rik and Mel reported seeing ttwu() spend significant time on:
>     
>       smp_cond_load_acquire(&p->on_cpu, !VAL);
>     
>     Attempt to avoid this by queueing the wakeup on the CPU that owns the
>     p->on_cpu value. This will then allow the ttwu() to complete without
>     further waiting.
>     
>     Since we run schedule() with interrupts disabled, the IPI is
>     guaranteed to happen after p->on_cpu is cleared, this is what makes it
>     safe to queue early.
>     
>     Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>     Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>     Signed-off-by: Ingo Molnar <mingo@kernel.org>

Peter, I'm not supremely confident about this but could it be because
"p->sched_contributes_to_load = !!task_contributes_to_load(p)" potentially
happens while a task is still being dequeued? In the final stages of a
task switch we have

        prev_state = prev->state;
        vtime_task_switch(prev);
        perf_event_task_sched_in(prev, current);
        finish_task(prev);

finish_task is when p->on_cpu is cleared after the state is updated.
With the patch, we potentially update sched_contributes_to_load while
p->state is transient so if the check below is true and ttwu_queue_wakelist
is used then sched_contributes_to_load was based on a transient value
and potentially wrong.

        if (smp_load_acquire(&p->on_cpu) &&
            ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
                goto unlock;

sched_contributes_to_load determines the value of rq->uninterruptible
which is used in the load value so it's a partial fit. The race would not
happen very often on a normal desktop so it would explain why it takes
a long time for the value to get screwed up and appears to fit.

I'm thinking that the !!task_contributes_to_load(p) should still happen
after smp_cond_load_acquire() when on_cpu is stable and the pi_lock is
held to stabilised p->state against a parallel wakeup or updating the
task rq. I do not see any hazards with respect to smp_rmb and the value
of p->state in this particular path but I've confused myself enough in
the various scheduler and wakeup paths that I don't want to bet money on
it late in the evening

It builds, not booted, it's for discussion but maybe Dave is feeling brave!

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca5db40392d4..52c73598b18a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2592,9 +2592,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	}
 
 #ifdef CONFIG_SMP
-	p->sched_contributes_to_load = !!task_contributes_to_load(p);
-	p->state = TASK_WAKING;
-
 	/*
 	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
 	 * possible to, falsely, observe p->on_cpu == 0.
@@ -2650,6 +2647,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 */
 	smp_cond_load_acquire(&p->on_cpu, !VAL);
 
+	/*
+	 * p is off the cpu and pi_lock is held to p->state is stable
+	 * for calculating whether it contributes to load.
+	 */
+	p->sched_contributes_to_load = !!task_contributes_to_load(p);
+	p->state = TASK_WAKING;
+
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;

-- 
Mel Gorman
SUSE Labs

  parent reply	other threads:[~2020-07-02 21:36 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-02 17:15 weird loadavg on idle machine post 5.7 Dave Jones
2020-07-02 19:46 ` Dave Jones
2020-07-02 21:15 ` Paul Gortmaker
2020-07-03 13:23   ` Paul Gortmaker
2020-07-02 21:36 ` Mel Gorman [this message]
2020-07-02 23:11   ` Michal Kubecek
2020-07-02 23:24   ` Dave Jones
2020-07-03  9:02   ` Peter Zijlstra
2020-07-03 10:40     ` Peter Zijlstra
2020-07-03 20:51       ` Dave Jones
2020-07-06 14:59         ` Peter Zijlstra
2020-07-06 21:20           ` Dave Jones
2020-07-07  7:48             ` Peter Zijlstra
2020-07-06 23:56           ` Valentin Schneider
2020-07-07  8:17             ` Peter Zijlstra
2020-07-07 10:20               ` Valentin Schneider
2020-07-07 10:29               ` Peter Zijlstra
2020-07-08  9:46                 ` [tip: sched/urgent] sched: Fix loadavg accounting race tip-bot2 for Peter Zijlstra
2020-07-07  9:20           ` weird loadavg on idle machine post 5.7 Qais Yousef
2020-07-07  9:47             ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200702213627.GF3183@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=davej@codemonkey.org.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox