Re: [sched/fair] 38ac256d1c: stress-ng.vm-segv.ops_per_sec -13.8% regression

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Valentin Schneider <valentin.schneider@arm.com>
To: lkp@lists.01.org
Subject: Re: [sched/fair] 38ac256d1c: stress-ng.vm-segv.ops_per_sec -13.8% regression
Date: Thu, 06 May 2021 17:11:20 +0100	[thread overview]
Message-ID: <871rajkfkn.mognet@arm.com> (raw)
In-Reply-To: <87k0omxe6w.mognet@arm.com>

[-- Attachment #1: Type: text/plain, Size: 2671 bytes --]

On 28/04/21 23:00, Valentin Schneider wrote:
> As far as I can tell, the culprit is the loss of LBF_SOME_PINNED. By some
> happy accident, the load balancer repeatedly iterates over PCPU kthreads,
> sets LBF_SOME_PINNED and causes a group to be classified as group_imbalanced
> in a later load-balance. This, in turn, forces a 1-task pull, and repeating
> this pattern ~25 times a sec ends up increasing CPU utilization by ~5% over the
> span of the benchmark.

So this is where I got to:

Because pcpu kthreads run periodically, they sometimes get iterated over by the
periodic load-balance and can cause LBF_SOME_PINNED. This can lead to setting

  env->sd->parent->sg->sgc->imbalance

which may cause later load balance attempts to classify the designated group
span as group_imbalanced. Note that this will affect periodic load balance
*and* fork/exec balance.

On a 2-node system with SMT, MC and NUMA topology levels, this means that
load-balance at MC level will periodically set LBF_SOME_PINNED, opening a
window where any subsequent fork() issued on that node will see
find_idlest_cpu() being biased towards the remote node (find_idlest_cpu()
tries to minimize group_type, and group_imbalanced is the second highest).
In the benchmark's case, the NUMA groups are only ever classified as
group_has_spare, making this a hard bias.

Digging down into find_idlest_cpu(), this periodic bias seems to act as an
override to allow_numa_imbalance(): the benchmark spawns 6 stressors which
AFAICT each spawn a child, so that's at most 12 total runnable tasks. In
this particular case, the 25% domain size threshold of
allow_numa_imbalance() maps to 16, so the condition is pretty much always
true (confirmed via tracing).

On this particular machine (dual-socket Xeon Gold 5120 @ 2.20GHz, 64 CPUs)
with this particular benchmark this seems to happen for ~1% of forks, but
causes a performance improvement between of 5% to 13%. I'm not exactly sure
on the why, but I suspect that the tasks having a very short runtime (avg
6µs) means fork-time balance is the only real opportunity for them to move
to a different NUMA node.

One could argue the benchmark itself gets what it deserves since forking ad
nauseam isn't such a great idea [1], and perhaps it should pin the
stressors to a single NUMA node. I did try to make allow_numa_imbalance()
"smarter", but couldn't find any winning formula. Adding to this the fact
that this regression isn't reproducible on a lot of systems (I got either
noise or improvements on all the arm64 systems I tried), so I'm somewhat
stumped TBH.

[1]: Unless you're trying to summon Slaanesh

WARNING: multiple messages have this Message-ID (diff)

From: Valentin Schneider <valentin.schneider@arm.com>
To: Oliver Sang <oliver.sang@intel.com>
Cc: 0day robot <lkp@intel.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	LKML <linux-kernel@vger.kernel.org>,
	lkp@lists.01.org, ying.huang@intel.com, feng.tang@intel.com,
	zhengjun.xing@intel.com,
	Lingutla Chandrasekhar <clingutla@codeaurora.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	Qais Yousef <qais.yousef@arm.com>,
	Quentin Perret <qperret@google.com>,
	Pavan Kondeti <pkondeti@codeaurora.org>,
	Rik van Riel <riel@surriel.com>,
	aubrey.li@linux.intel.com, yu.c.chen@intel.com,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [sched/fair]  38ac256d1c:  stress-ng.vm-segv.ops_per_sec -13.8% regression
Date: Thu, 06 May 2021 17:11:20 +0100	[thread overview]
Message-ID: <871rajkfkn.mognet@arm.com> (raw)
In-Reply-To: <87k0omxe6w.mognet@arm.com>

On 28/04/21 23:00, Valentin Schneider wrote:
> As far as I can tell, the culprit is the loss of LBF_SOME_PINNED. By some
> happy accident, the load balancer repeatedly iterates over PCPU kthreads,
> sets LBF_SOME_PINNED and causes a group to be classified as group_imbalanced
> in a later load-balance. This, in turn, forces a 1-task pull, and repeating
> this pattern ~25 times a sec ends up increasing CPU utilization by ~5% over the
> span of the benchmark.

So this is where I got to:

Because pcpu kthreads run periodically, they sometimes get iterated over by the
periodic load-balance and can cause LBF_SOME_PINNED. This can lead to setting

  env->sd->parent->sg->sgc->imbalance

which may cause later load balance attempts to classify the designated group
span as group_imbalanced. Note that this will affect periodic load balance
*and* fork/exec balance.

On a 2-node system with SMT, MC and NUMA topology levels, this means that
load-balance at MC level will periodically set LBF_SOME_PINNED, opening a
window where any subsequent fork() issued on that node will see
find_idlest_cpu() being biased towards the remote node (find_idlest_cpu()
tries to minimize group_type, and group_imbalanced is the second highest).
In the benchmark's case, the NUMA groups are only ever classified as
group_has_spare, making this a hard bias.

Digging down into find_idlest_cpu(), this periodic bias seems to act as an
override to allow_numa_imbalance(): the benchmark spawns 6 stressors which
AFAICT each spawn a child, so that's at most 12 total runnable tasks. In
this particular case, the 25% domain size threshold of
allow_numa_imbalance() maps to 16, so the condition is pretty much always
true (confirmed via tracing).

On this particular machine (dual-socket Xeon Gold 5120 @ 2.20GHz, 64 CPUs)
with this particular benchmark this seems to happen for ~1% of forks, but
causes a performance improvement between of 5% to 13%. I'm not exactly sure
on the why, but I suspect that the tasks having a very short runtime (avg
6µs) means fork-time balance is the only real opportunity for them to move
to a different NUMA node.

One could argue the benchmark itself gets what it deserves since forking ad
nauseam isn't such a great idea [1], and perhaps it should pin the
stressors to a single NUMA node. I did try to make allow_numa_imbalance()
"smarter", but couldn't find any winning formula. Adding to this the fact
that this regression isn't reproducible on a lot of systems (I got either
noise or improvements on all the arm64 systems I tried), so I'm somewhat
stumped TBH.

[1]: Unless you're trying to summon Slaanesh

next prev parent reply	other threads:[~2021-05-06 16:11 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-07 22:06 [PATCH v5 0/3] sched/fair: load-balance vs capacity margins Valentin Schneider
2021-04-07 22:06 ` [PATCH v5 1/3] sched/fair: Ignore percpu threads for imbalance pulls Valentin Schneider
2021-04-09 11:24   ` [tip: sched/core] " tip-bot2 for Lingutla Chandrasekhar
2021-04-09 12:05   ` tip-bot2 for Lingutla Chandrasekhar
2021-04-09 16:14   ` tip-bot2 for Lingutla Chandrasekhar
2021-04-14  5:21   ` [sched/fair] 38ac256d1c: stress-ng.vm-segv.ops_per_sec -13.8% regression kernel test robot
2021-04-14  5:21     ` kernel test robot
2021-04-14 17:17     ` Valentin Schneider
2021-04-14 17:17       ` Valentin Schneider
2021-04-21  3:20       ` Oliver Sang
2021-04-21  3:20         ` Oliver Sang
2021-04-21 10:27         ` Valentin Schneider
2021-04-21 10:27           ` Valentin Schneider
2021-04-21 12:03           ` Peter Zijlstra
2021-04-21 12:03             ` Peter Zijlstra
2021-04-22  7:47           ` Oliver Sang
2021-04-22  7:47             ` Oliver Sang
2021-04-22  9:55             ` Valentin Schneider
2021-04-22  9:55               ` Valentin Schneider
2021-04-22 20:42               ` Valentin Schneider
2021-04-22 20:42                 ` Valentin Schneider
2021-04-28 22:00                 ` Valentin Schneider
2021-04-28 22:00                   ` Valentin Schneider
2021-05-06 16:11                   ` Valentin Schneider [this message]
2021-05-06 16:11                     ` Valentin Schneider
2021-04-07 22:06 ` [PATCH v5 2/3] sched/fair: Clean up active balance nr_balance_failed trickery Valentin Schneider
2021-04-09 11:24   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2021-04-09 12:05   ` tip-bot2 for Valentin Schneider
2021-04-09 16:14   ` tip-bot2 for Valentin Schneider
2021-04-07 22:06 ` [PATCH v5 3/3] sched/fair: Introduce a CPU capacity comparison helper Valentin Schneider
2021-04-09 11:24   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2021-04-09 12:05   ` tip-bot2 for Valentin Schneider
2021-04-09 16:14   ` tip-bot2 for Valentin Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871rajkfkn.mognet@arm.com \
    --to=valentin.schneider@arm.com \
    --cc=lkp@lists.01.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.