All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: Dave Jones <davej@codemonkey.org.uk>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	peterz@infradead.org, mingo@kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: weird loadavg on idle machine post 5.7
Date: Thu, 2 Jul 2020 22:36:27 +0100	[thread overview]
Message-ID: <20200702213627.GF3183@techsingularity.net> (raw)
In-Reply-To: <20200702171548.GA11813@codemonkey.org.uk>

On Thu, Jul 02, 2020 at 01:15:48PM -0400, Dave Jones wrote:
> When I upgraded my firewall to 5.7-rc2 I noticed that on a mostly
> idle machine (that usually sees loadavg hover in the 0.xx range)
> that it was consistently above 1.00 even when there was nothing running.
> All that perf showed was the kernel was spending time in the idle loop
> (and running perf).
> 
> For the first hour or so after boot, everything seems fine, but over
> time loadavg creeps up, and once it's established a new baseline, it
> never seems to ever drop below that again.
> 
> One morning I woke up to find loadavg at '7.xx', after almost as many
> hours of uptime, which makes me wonder if perhaps this is triggered
> by something in cron.  I have a bunch of scripts that fire off
> every hour that involve thousands of shortlived runs of iptables/ipset,
> but running them manually didn't seem to automatically trigger the bug.
> 
> Given it took a few hours of runtime to confirm good/bad, bisecting this
> took the last two weeks. I did it four different times, the first
> producing bogus results from over-eager 'good', but the last two runs
> both implicated this commit:
> 
> commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad)
> Author: Peter Zijlstra <peterz@infradead.org>
> Date:   Sun May 24 21:29:55 2020 +0100
> 
>     sched/core: Optimize ttwu() spinning on p->on_cpu
>     
>     Both Rik and Mel reported seeing ttwu() spend significant time on:
>     
>       smp_cond_load_acquire(&p->on_cpu, !VAL);
>     
>     Attempt to avoid this by queueing the wakeup on the CPU that owns the
>     p->on_cpu value. This will then allow the ttwu() to complete without
>     further waiting.
>     
>     Since we run schedule() with interrupts disabled, the IPI is
>     guaranteed to happen after p->on_cpu is cleared, this is what makes it
>     safe to queue early.
>     
>     Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>     Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>     Signed-off-by: Ingo Molnar <mingo@kernel.org>

Peter, I'm not supremely confident about this but could it be because
"p->sched_contributes_to_load = !!task_contributes_to_load(p)" potentially
happens while a task is still being dequeued? In the final stages of a
task switch we have

        prev_state = prev->state;
        vtime_task_switch(prev);
        perf_event_task_sched_in(prev, current);
        finish_task(prev);

finish_task is when p->on_cpu is cleared after the state is updated.
With the patch, we potentially update sched_contributes_to_load while
p->state is transient so if the check below is true and ttwu_queue_wakelist
is used then sched_contributes_to_load was based on a transient value
and potentially wrong.

        if (smp_load_acquire(&p->on_cpu) &&
            ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
                goto unlock;

sched_contributes_to_load determines the value of rq->uninterruptible
which is used in the load value so it's a partial fit. The race would not
happen very often on a normal desktop so it would explain why it takes
a long time for the value to get screwed up and appears to fit.

I'm thinking that the !!task_contributes_to_load(p) should still happen
after smp_cond_load_acquire() when on_cpu is stable and the pi_lock is
held to stabilised p->state against a parallel wakeup or updating the
task rq. I do not see any hazards with respect to smp_rmb and the value
of p->state in this particular path but I've confused myself enough in
the various scheduler and wakeup paths that I don't want to bet money on
it late in the evening

It builds, not booted, it's for discussion but maybe Dave is feeling brave!

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca5db40392d4..52c73598b18a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2592,9 +2592,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	}
 
 #ifdef CONFIG_SMP
-	p->sched_contributes_to_load = !!task_contributes_to_load(p);
-	p->state = TASK_WAKING;
-
 	/*
 	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
 	 * possible to, falsely, observe p->on_cpu == 0.
@@ -2650,6 +2647,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	 */
 	smp_cond_load_acquire(&p->on_cpu, !VAL);
 
+	/*
+	 * p is off the cpu and pi_lock is held to p->state is stable
+	 * for calculating whether it contributes to load.
+	 */
+	p->sched_contributes_to_load = !!task_contributes_to_load(p);
+	p->state = TASK_WAKING;
+
 	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;

-- 
Mel Gorman
SUSE Labs

  parent reply	other threads:[~2020-07-02 21:36 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-02 17:15 weird loadavg on idle machine post 5.7 Dave Jones
2020-07-02 19:46 ` Dave Jones
2020-07-02 21:15 ` Paul Gortmaker
2020-07-03 13:23   ` Paul Gortmaker
2020-07-02 21:36 ` Mel Gorman [this message]
2020-07-02 23:11   ` Michal Kubecek
2020-07-02 23:24   ` Dave Jones
2020-07-03  9:02   ` Peter Zijlstra
2020-07-03 10:40     ` Peter Zijlstra
2020-07-03 20:51       ` Dave Jones
2020-07-06 14:59         ` Peter Zijlstra
2020-07-06 21:20           ` Dave Jones
2020-07-07  7:48             ` Peter Zijlstra
2020-07-06 23:56           ` Valentin Schneider
2020-07-07  8:17             ` Peter Zijlstra
2020-07-07 10:20               ` Valentin Schneider
2020-07-07 10:29               ` Peter Zijlstra
2020-07-08  9:46                 ` [tip: sched/urgent] sched: Fix loadavg accounting race tip-bot2 for Peter Zijlstra
2020-07-07  9:20           ` weird loadavg on idle machine post 5.7 Qais Yousef
2020-07-07  9:47             ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200702213627.GF3183@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=davej@codemonkey.org.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.