All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>, Phil Auld <pauld@redhat.com>,
	Peter Puhov <peter.puhov@linaro.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Robert Foley <robert.foley@linaro.org>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Jirka Hladky <jhladky@redhat.com>
Subject: Re: [PATCH v1] sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal
Date: Wed, 4 Nov 2020 10:47:56 +0000	[thread overview]
Message-ID: <20201104104755.GC3371@techsingularity.net> (raw)
In-Reply-To: <CAKfTPtAjhv8JafvZFR8_UUfDM2MUzVGMPXVBx1zynhPXJ_oh3w@mail.gmail.com>

On Wed, Nov 04, 2020 at 11:06:06AM +0100, Vincent Guittot wrote:
> >
> > Hackbench failed to run because I typo'd the configuration. Kernel build
> > benchmark and git test suite both were inconclusive for 5.10-rc2
> > (neutral results) although the showed 10-20% gain for kernbench and 24%
> > gain in git test suite by reverting in 5.9.
> >
> > The gitsource test was interesting for a few reasons. First, the big
> > difference between 5.9 and 5.10 is that the workload is mostly concentrated
> > on one NUMA node. mpstat shows that 5.10-rc2 uses all of the CPUs on one
> > node lightly. Reverting the patch shows that far fewer CPUs are used at
> > a higher utilisation -- not particularly high utilisation because of the
> > nature of the workload but noticable. i.e.  gitsource with the revert
> > packs the workload onto fewer CPUs. The same holds for fork_test --
> > reverting packs the workload onto fewer CPUs with higher utilisation on
> > each of them. Generally this plays well with cpufreq without schedutil
> > using fewer CPUs means the CPU is likely to reach higher frequencies.
> 
> Which cpufreq governor are you using ?
> 

Uhh, intel_pstate with ondemand .... which is surprising, I would have
expected powersave. I'd have to look closer at what happened there. It
might be a variation of the Kconfig mess selecting the wrong governors when
"yes '' | make oldconfig" is used.

> >
> > While it's possible that some other factor masked the impact of the patch,
> > the fact it's neutral for two workloads in 5.10-rc2 is suspicious as it
> > indicates that if the patch was implemented against 5.10-rc2, it would
> > likely not have been merged. I've queued the tests on the remaining
> > machines to see if something more conclusive falls out.
> 
> I don't think that the goal of the patch is stressed by those benchmarks.
> I typically try to optimize the sequence:
> 1-fork a lot of threads that immediately wait
> 2-wake up all threads simultaneously to run in parallel
> 3-wait the end of all threads
> 

Out of curiousity, have you a stock benchmark that does this with some
associated metric?  sysbench-threads wouldn't do it. While I know of at
least one benchmark that *does* exhibit this pattern, it's a Real Workload
that cannot be shared (so I can't discuss it) and it's *complex* with a
minimal kernel footprint so analysing it is non-trivial.

I could develop one on my own but if you had one already, I'd wire it into
mmtests and add it to the stock collection of scheduler loads. schbench
*might* match what you're talking about but I'd rather not guess.
schbench is also more of a latency wakeup benchmark than it is a throughput
one. Latency ones tend to be more important but optimising purely for
wakeup-latency also tends to kick other workloads into a hole.

> Without the patch all newly forked threads were packed on few CPUs
> which were already idle when the next fork happened. Then the spreads
> were spread on CPUs at wakeup in the LLC but they have to wait for a
> LB to fill other sched domain
> 

Which is fair enough but it's a tradeoff because there are plenty of
workloads that fork/exec and do something immediately and this is not
the first time we've had to tradeoff between workloads.

The other aspect I find interesting is that we get slightly burned by
the initial fork path because of this thing;

                        /*
                         * Otherwise, keep the task on this node to stay close
                         * its wakeup source and improve locality. If there is
                         * a real need of migration, periodic load balance will
                         * take care of it.
                         */
                        if (local_sgs.idle_cpus)
                                return NULL;

For a workload that creates a lot of new threads that go idle and then
wakeup (think worker pool threads that receive requests at unpredictable
times), it packs one node too tightly when the threads wakeup -- it's
also visible from page fault microbenchmarks that scale the number of
threads. It's a vaguely similar class of problem but the patches are
taking very different approaches.

It'd been in my mind to consider reconciling that chunk with the
adjust_numa_imbalance but had not gotten around to seeing how it should
be reconciled without introducing another regression.

The longer I work on the scheduler, the more I feel it's like juggling
while someone is firing arrows at you :D .

-- 
Mel Gorman
SUSE Labs

  reply	other threads:[~2020-11-04 10:48 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-14 12:59 [PATCH v1] sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal peter.puhov
2020-07-22  9:12 ` [tip: sched/core] " tip-bot2 for Peter Puhov
2020-11-02 10:50 ` [PATCH v1] " Mel Gorman
2020-11-02 11:06   ` Vincent Guittot
2020-11-02 14:44     ` Phil Auld
2020-11-02 16:52       ` Mel Gorman
2020-11-04  9:42       ` Mel Gorman
2020-11-04 10:06         ` Vincent Guittot
2020-11-04 10:47           ` Mel Gorman [this message]
2020-11-04 11:34             ` Vincent Guittot
2020-11-06 12:03         ` Mel Gorman
2020-11-06 13:33           ` Vincent Guittot
2020-11-06 16:00             ` Mel Gorman
2020-11-06 16:06               ` Vincent Guittot
2020-11-06 17:02                 ` Mel Gorman
2020-11-09 15:24               ` Phil Auld
2020-11-09 15:38                 ` Mel Gorman
2020-11-09 15:47                   ` Phil Auld
2020-11-09 15:49                   ` Vincent Guittot
2020-11-10 14:05                     ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201104104755.GC3371@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=jhladky@redhat.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=pauld@redhat.com \
    --cc=peter.puhov@linaro.org \
    --cc=peterz@infradead.org \
    --cc=robert.foley@linaro.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.