All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: Peter Zijlstra <peterz@infradead.org>, Will Deacon <will@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org
Subject: Loadavg accounting error on arm64
Date: Mon, 16 Nov 2020 09:10:54 +0000	[thread overview]
Message-ID: <20201116091054.GL3371@techsingularity.net> (raw)

Hi,

I got cc'd internal bug report filed against a 5.8 and 5.9 kernel
that loadavg was "exploding" on arch64 on a machines acting as a build
servers. It happened on at least two different arm64 variants. That setup
is complex to replicate but fortunately can be reproduced by running
hackbench-process-pipes while heavily overcomitting a machine with 96
logical CPUs and then checking if loadavg drops afterwards. With an
MMTests clone, I reproduced it as follows

./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \
    for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done

Load should drop to 10 after about 10 minutes and it does on x86-64 but
remained at around 200+ on arm64.

The reproduction case simply hammers the case where a task can be
descheduling while also being woken by another task at the same time. It
takes a long time to run but it makes the problem very obvious. The
expectation is that after hackbench has been running and saturating the
machine for a long time.

Commit dbfb089d360b ("sched: Fix loadavg accounting race") fixed a loadavg
accounting race in the generic case. Later it was documented why the
ordering of when p->sched_contributes_to_load is read/updated relative
to p->on_cpu.  This is critical when a task is descheduling at the same
time it is being activated on another CPU. While the load/stores happen
under the RQ lock, the RQ lock on its own does not give any guarantees
on the task state.

Over the weekend I convinced myself that it must be because the
implementation of smp_load_acquire and smp_store_release do not appear
to implement acquire/release semantics because I didn't find something
arm64 that was playing with p->state behind the schedulers back (I could
have missed it if it was in an assembly portion as I can't reliablyh read
arm assembler). Similarly, it's not clear why the arm64 implementation
does not call smp_acquire__after_ctrl_dep in the smp_load_acquire
implementation. Even when it was introduced, the arm64 implementation
differed significantly from the arm implementation in terms of what
barriers it used for non-obvious reasons.

Unfortunately, making that work similar to the arch-independent version
did not help but it's not helped that I know nothing about the arm64
memory model.

I'll be looking again today to see can I find a mistake in the ordering for
how sched_contributes_to_load is handled but again, the lack of knowledge
on the arm64 memory model means I'm a bit stuck and a second set of eyes
would be nice :(

-- 
Mel Gorman
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Mel Gorman <mgorman@techsingularity.net>
To: Peter Zijlstra <peterz@infradead.org>, Will Deacon <will@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
Subject: Loadavg accounting error on arm64
Date: Mon, 16 Nov 2020 09:10:54 +0000	[thread overview]
Message-ID: <20201116091054.GL3371@techsingularity.net> (raw)

Hi,

I got cc'd internal bug report filed against a 5.8 and 5.9 kernel
that loadavg was "exploding" on arch64 on a machines acting as a build
servers. It happened on at least two different arm64 variants. That setup
is complex to replicate but fortunately can be reproduced by running
hackbench-process-pipes while heavily overcomitting a machine with 96
logical CPUs and then checking if loadavg drops afterwards. With an
MMTests clone, I reproduced it as follows

./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \
    for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done

Load should drop to 10 after about 10 minutes and it does on x86-64 but
remained at around 200+ on arm64.

The reproduction case simply hammers the case where a task can be
descheduling while also being woken by another task at the same time. It
takes a long time to run but it makes the problem very obvious. The
expectation is that after hackbench has been running and saturating the
machine for a long time.

Commit dbfb089d360b ("sched: Fix loadavg accounting race") fixed a loadavg
accounting race in the generic case. Later it was documented why the
ordering of when p->sched_contributes_to_load is read/updated relative
to p->on_cpu.  This is critical when a task is descheduling at the same
time it is being activated on another CPU. While the load/stores happen
under the RQ lock, the RQ lock on its own does not give any guarantees
on the task state.

Over the weekend I convinced myself that it must be because the
implementation of smp_load_acquire and smp_store_release do not appear
to implement acquire/release semantics because I didn't find something
arm64 that was playing with p->state behind the schedulers back (I could
have missed it if it was in an assembly portion as I can't reliablyh read
arm assembler). Similarly, it's not clear why the arm64 implementation
does not call smp_acquire__after_ctrl_dep in the smp_load_acquire
implementation. Even when it was introduced, the arm64 implementation
differed significantly from the arm implementation in terms of what
barriers it used for non-obvious reasons.

Unfortunately, making that work similar to the arch-independent version
did not help but it's not helped that I know nothing about the arm64
memory model.

I'll be looking again today to see can I find a mistake in the ordering for
how sched_contributes_to_load is handled but again, the lack of knowledge
on the arm64 memory model means I'm a bit stuck and a second set of eyes
would be nice :(

-- 
Mel Gorman
SUSE Labs

             reply	other threads:[~2020-11-16  9:11 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-16  9:10 Mel Gorman [this message]
2020-11-16  9:10 ` Loadavg accounting error on arm64 Mel Gorman
2020-11-16 11:49 ` Mel Gorman
2020-11-16 11:49   ` Mel Gorman
2020-11-16 12:00   ` Mel Gorman
2020-11-16 12:00     ` Mel Gorman
2020-11-16 12:53   ` Peter Zijlstra
2020-11-16 12:53     ` Peter Zijlstra
2020-11-16 12:58     ` Peter Zijlstra
2020-11-16 12:58       ` Peter Zijlstra
2020-11-16 15:29       ` Mel Gorman
2020-11-16 15:29         ` Mel Gorman
2020-11-16 16:42         ` Mel Gorman
2020-11-16 16:42           ` Mel Gorman
2020-11-16 16:49         ` Peter Zijlstra
2020-11-16 16:49           ` Peter Zijlstra
2020-11-16 17:24           ` Mel Gorman
2020-11-16 17:24             ` Mel Gorman
2020-11-16 17:41             ` Will Deacon
2020-11-16 17:41               ` Will Deacon
2020-11-16 12:46 ` Peter Zijlstra
2020-11-16 12:46   ` Peter Zijlstra
2020-11-16 12:58   ` Mel Gorman
2020-11-16 12:58     ` Mel Gorman
2020-11-16 13:11 ` Will Deacon
2020-11-16 13:11   ` Will Deacon
2020-11-16 13:37   ` Mel Gorman
2020-11-16 13:37     ` Mel Gorman
2020-11-16 14:20     ` Peter Zijlstra
2020-11-16 14:20       ` Peter Zijlstra
2020-11-16 15:52       ` Mel Gorman
2020-11-16 15:52         ` Mel Gorman
2020-11-16 16:54         ` Peter Zijlstra
2020-11-16 16:54           ` Peter Zijlstra
2020-11-16 17:16           ` Mel Gorman
2020-11-16 17:16             ` Mel Gorman
2020-11-16 19:31       ` Mel Gorman
2020-11-16 19:31         ` Mel Gorman
2020-11-17  8:30         ` [PATCH] sched: Fix data-race in wakeup Peter Zijlstra
2020-11-17  8:30           ` Peter Zijlstra
2020-11-17  9:15           ` Will Deacon
2020-11-17  9:15             ` Will Deacon
2020-11-17  9:29             ` Peter Zijlstra
2020-11-17  9:29               ` Peter Zijlstra
2020-11-17  9:46               ` Peter Zijlstra
2020-11-17  9:46                 ` Peter Zijlstra
2020-11-17 10:36                 ` Will Deacon
2020-11-17 10:36                   ` Will Deacon
2020-11-17 12:52                 ` Valentin Schneider
2020-11-17 12:52                   ` Valentin Schneider
2020-11-17 15:37                   ` Valentin Schneider
2020-11-17 15:37                     ` Valentin Schneider
2020-11-17 16:13                     ` Peter Zijlstra
2020-11-17 16:13                       ` Peter Zijlstra
2020-11-17 19:32                       ` Valentin Schneider
2020-11-17 19:32                         ` Valentin Schneider
2020-11-18  8:05                         ` Peter Zijlstra
2020-11-18  8:05                           ` Peter Zijlstra
2020-11-18  9:51                           ` Valentin Schneider
2020-11-18  9:51                             ` Valentin Schneider
2020-11-18 13:33               ` Marco Elver
2020-11-18 13:33                 ` Marco Elver
2020-11-17  9:38           ` [PATCH] sched: Fix rq->nr_iowait ordering Peter Zijlstra
2020-11-17  9:38             ` Peter Zijlstra
2020-11-17 11:43             ` Mel Gorman
2020-11-17 11:43               ` Mel Gorman
2020-11-19  9:55             ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra
2020-11-17 12:40           ` [PATCH] sched: Fix data-race in wakeup Mel Gorman
2020-11-17 12:40             ` Mel Gorman
2020-11-19  9:55           ` [tip: sched/urgent] " tip-bot2 for Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201116091054.GL3371@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=dave@stgolabs.net \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.