Re: [RFC PATCH 3/3] sched: ttwu_queue_cond: skip queued wakeups across different l2 caches

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Julien Desfossez <jdesfossez@digitalocean.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Swapnil Sapkal <Swapnil.Sapkal@amd.com>,
	Aaron Lu <aaron.lu@intel.com>,
	x86@kernel.org
Subject: Re: [RFC PATCH 3/3] sched: ttwu_queue_cond: skip queued wakeups across different l2 caches
Date: Fri, 18 Aug 2023 15:27:42 -0400	[thread overview]
Message-ID: <20230818192742.GA461596@sinkpad> (raw)
In-Reply-To: <20230818153027.202017-3-mathieu.desnoyers@efficios.com>

On 18-Aug-2023 11:30:27 AM, Mathieu Desnoyers wrote:
> Considering a system like the AMD EPYC 9654 96-Core Processor, the L1
> cache has a latency of 4-5 cycles, the L2 cache has a latency of at
> least 14ns, whereas the L3 cache has a latency of 50ns [1]. Compared to
> this, I measured the RAM accesses to a latency around 120ns on my
> system [2]. So L3 really is only 2.4x faster than RAM accesses.
> Therefore, with this relatively slow access speed compared to L2, the
> scheduler will benefit from only considering CPUs sharing an L2 cache
> for the purpose of using remote runqueue locking rather than queued
> wakeups.
> 
> Skipping queued wakeups for all logical CPUs sharing an LLC means that
> on a 192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), groups
> of 8 cores (16 hardware threads) end up grabbing runqueue locks of other
> runqueues within the same group for each wakeup, causing contention on
> the runqueue locks.
> 
> Improve this by only considering logical cpus sharing an L2 cache as
> candidates for skipping use of the queued wakeups.
> 
> This results in the following benchmark improvements:
> 
>     hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
> 
> from 49s to 34s. (30% speedup)
> 
> And similarly with perf bench:
> 
>     perf bench sched messaging -g 32 -p -t -l 100000
> 
> from 10.9s to 7.4s (32% speedup)
> 
> I have noticed that in order to observe the speedup, the workload needs
> to keep the CPUs sufficiently busy to cause runqueue lock contention,
> but not so busy that they don't go idle. This can be explained by the
> fact that idle CPUs are a preferred target for task wakeup runqueue
> selection, and therefore having idle cpus causes more migrations, which
> triggers more remote wakeups. For both the hackbench and the perf bench
> sched messaging benchmarks, the scale of the workload can be tweaked by
> changing the number groups.
> 
> This was developed as part of the investigation into a weird regression
> reported by AMD where adding a raw spinlock in the scheduler context
> switch accelerated hackbench. It turned out that changing this raw
> spinlock for a loop of 10000x cpu_relax within do_idle() had similar
> benefits.
> 
> This patch achieves a similar effect without busy waiting nor changing
> anything about runqueue selection on wakeup. It considers that only
> hardware threads sharing an L2 cache should skip the queued
> try-to-wakeup and directly grab the target runqueue lock, rather than
> allowing all hardware threads sharing an LLC to do so.
> 
> I would be interested to hear feedback about performance impact of this
> patch (improvement or regression) on other workloads and hardware,
> especially for Intel CPUs. One thing that we might want to empirically
> figure out from the topology is whether there is a maximum number of
> hardware threads within an LLC below which it would make sense to use
> the LLC rather than L2 as group within which queued wakeups can be
> skipped.

I just tested this patchset with the above perf bench sched on Intel Xeon
Platinum 8358 CPU with 2 sockets, 32 cores (64 hardware threads) by socket.

Here are the results by group size (the -g parameter):
group  A_idle   B_idle   A_avg    A_stdev  B_avg    B_stdev  diff_percent
2      51.9429  50.8     3.45583  0.333    3.5144   0.379    -1.69
4      29.0722  29.475   4.0825   0.067    3.9498   0.049    3.250
6      17.9737  18.7158  4.991    0.026    4.5654   0.051    8.527
8      12.8167  13.1667  5.9366   0.046    5.1026   0.049    14.04
10     9.608    7.08095  6.6506   0.038    5.6814   0.064    14.57
12     6.91034  3.1      7.2526   0.048    6.7408   0.170    7.056
14     5.48333  2.76429  7.9436   0.085    8.1394   0.124    -2.46
16     4.475    3.21562  9.184    0.277    9.544    0.302    -3.91
18     4.60811  2.65588  10.45    0.279    10.46    0.243    -.095
20     4.35349  3.40857  11.7954  0.304    10.9482  0.268    7.182
22     4.63778  3.25714  13.0338  0.294    11.7888  0.302    9.552
24     4.49216  3.5      14.2888  0.251    13.1898  0.344    7.691
26     4.67288  3.42308  15.7286  0.231    14.3678  0.495    8.651
28     4.9375   3.68929  17.4738  0.714    15.3134  0.541    12.36
30     4.65538  4.47286  18.678   0.486    18.2136  1.413    2.486
32     4.28904  4.25479  19.6096  0.168    18.4308  0.778    6.011
34     4.49342  4.09146  20.9478  0.184    21.0388  1.349    -.434
36     4.47439  4.04945  22.3406  0.322    23.606   1.060    -5.66

The avg and stdev are based on 5 runs of perf bench sched.

"A" is the baseline captured on 6.1.42, "B" is 6.1.42 + this patchset.
the _idle columns are a rough sampling of the %idle CPU time while perf
is running.

This is really encouraging and it seems to match an issue I have
observed in production for a while where the scheduling latencies seem
too high proportionally to the %idle on large servers. Those servers are
hypervisors basically running a lot of random workloads in VMs. I am
working on a synthetic test case that could help confirm this is the
same issue or not.

Thanks,

Julien

     prev parent reply	other threads:[~2023-08-18 19:28 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-18 15:30 [RFC PATCH 1/3] sched: Rename cpus_share_cache to cpus_share_llc Mathieu Desnoyers
2023-08-18 15:30 ` [RFC PATCH 2/3] sched: Introduce cpus_share_l2c Mathieu Desnoyers
2023-08-18 15:40   ` Mathieu Desnoyers
2023-08-19 12:31   ` [RFC PATCH v2 " Mathieu Desnoyers
2023-08-22  7:58     ` Aaron Lu
2023-08-22 11:12       ` Mathieu Desnoyers
2023-08-18 15:30 ` [RFC PATCH 3/3] sched: ttwu_queue_cond: skip queued wakeups across different l2 caches Mathieu Desnoyers
2023-08-18 19:27   ` Julien Desfossez [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230818192742.GA461596@sinkpad \
    --to=jdesfossez@digitalocean.com \
    --cc=Swapnil.Sapkal@amd.com \
    --cc=aaron.lu@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox