Re: SD_LOAD_BALANCE - Valentin Schneider

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Valentin Schneider <valentin.schneider@arm.com>
To: Julia Lawall <julia.lawall@inria.fr>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	linux-kernel@vger.kernel.org,
	Gilles Muller <Gilles.Muller@inria.fr>
Subject: Re: SD_LOAD_BALANCE
Date: Thu, 03 Sep 2020 17:59:57 +0100	[thread overview]
Message-ID: <jhj7dtaokxe.mognet@arm.com> (raw)
In-Reply-To: <alpine.DEB.2.22.394.2009031605190.2496@hadrien>


Hi Julia,

On 03/09/20 15:09, Julia Lawall wrote:
> Uses of SD_LOAD_BALANCE were removed in commit e669ac8ab952 (first
> released in v5.8), with the comment:
>
> The SD_LOAD_BALANCE flag is set unconditionally for all domains in
> sd_init().
>
> I have the impression that this was not quite true.  The NUMA domain was
> not initialized with sd_init, and didn't have the SD_LOAD_BALANCE flag
> set.

Did you check the contents of

  /proc/sys/kernel/sched_domain/cpu*/domain*/flags

(requires CONFIG_SCHED_DEBUG)? If the LSB is set, it would mean
SD_LOAD_BALANCE is set.

The sched_domain construction flow isn't the easiest thing to follow, but
NUMA domains *have* to go through sd_init().

What happens is we first go through sched_init_numa(), and there we add
some more topology levels on top of the default ones (or the arch-defined
ones if using an arch-defined topology hierarchy) by using the NUMA
distance table.

We then build the actual domains in sched_init_domains(), and that goes
through a loop that looks like

  for_each_cpu() {
      for_each_sd_topology() {
          build_sched_domain() -> sd_init()
      }
  }

where the SD topology loop is going to iterate over the newly-added
NUMA-specific topology levels. Since that used to unconditionally set
SD_LOAD_BALANCE, NUMA domains really ought to have it.

If that wasn't the case, we would have fired the (now removed) warning in
sched_domain_debug_one() that would do:

       if (!(sd->flags & SD_LOAD_BALANCE)) {
               printk("does not load-balance\n");
               if (sd->parent)
                       printk(KERN_ERR "ERROR: !SD_LOAD_BALANCE domain has parent");
               return -1;
       }

> The effect is that in v5.8, the for_each_domain loop in
> select_task_rq_fair can always end up at the global NUMA domain, and thus
> consider any pair of waking cpu (cpu) and previous cpus of the wakee
> (prev_cpu) as arguments to wake_affine.  Up to v5.7, this was only
> possible if cpu and prev_cpu were together in some lower level domain, ie
> sharing the LLC.  The effect is that in v5.8 wake_affine can often end up
> choosing as a target a core that does not share the LLC with the core
> where the thread ran previously.  Threads then move around a lot between
> the different sockets.
>
> Was this intentional?
>

AFAICT it isn't forbidden for the logic here to peek outside of the
previous LLC. The NUMA reclaim distance thing says we allow affine wakeups
and fork / exec balancing to move a task to a CPU at most RECLAIM_DISTANCE
away (in NUMA distance values). However, I don't remember any patch
changing this between v5.7 and v5.8.

Briefly glancing over the kernel/sched log between v5.7 and v5.8, I don't
see any obvious culprits. Did you try to bisect this? If it indeed ends on
the SD_LOAD_BALANCE thing, well, I'll be off eating my keyboard.

> The effect can be seen in the traces of the parsec vips benchmark at the
> following URL:
>
> https://pages.lip6.fr/Julia.Lawall/vips.pdf
>
> The first two graphs (complete run and small fragment) are Linux v5.7 and
> the next two are Linux v5.8.  The machine has 160 hardware threads
> organized in 4 sockets and the colors are by socket.  In the small
> fragment for v5.7 (second graph), one can see that a given pid pretty much
> stays on the same socket, while in the corresponding fragment for v5.8
> (fourth graph), the pids move around between the sockets.  The x's
> describe the unblocks that result in a migration.  A pink x means that the
> migration is in the same socket, while a blue x means that the migration
> is to another socket. It's not apparent from the graphs, but by adding
> some tracing, it seems that the new socket is always the one of the core
> that handles the wakeup.
>

Interesting graphs, thanks for sharing!

> I haven't yet studied the early part of the execution of vips in detail,
> but I suspect that the same issue causes all of the threads to be
> initially on the same socket in v5.7, while in v5.8 they are more quickly
> dispersed to other sockets.
>
> My impression from the parsec and the NAS benchmarks is that the v5.8
> performance is a bit better than v5.7, probably because of getting more
> threads to more different sockets earlier, but other benchmarks might
> rely more on locality and might react less well to threads moving around
> so much in this way.
>
> julia

next prev parent reply	other threads:[~2020-09-03 17:00 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-03 14:09 SD_LOAD_BALANCE Julia Lawall
2020-09-03 16:59 ` Valentin Schneider [this message]
2020-09-03 17:10   ` SD_LOAD_BALANCE Julia Lawall
2020-09-03 18:17   ` SD_LOAD_BALANCE Julia Lawall
2020-10-10 16:14   ` SD_LOAD_BALANCE Julia Lawall
2020-10-12 10:16     ` SD_LOAD_BALANCE Vincent Guittot
2020-10-12 10:34       ` SD_LOAD_BALANCE Julia Lawall
2020-10-12 11:09         ` SD_LOAD_BALANCE Vincent Guittot
2020-10-12 11:18           ` SD_LOAD_BALANCE Julia Lawall
2020-10-12 11:21     ` SD_LOAD_BALANCE Peter Zijlstra
2020-10-12 11:31       ` SD_LOAD_BALANCE Julia Lawall

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=jhj7dtaokxe.mognet@arm.com \
    --to=valentin.schneider@arm.com \
    --cc=Gilles.Muller@inria.fr \
    --cc=julia.lawall@inria.fr \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.