public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2] sched/fair: Age the average idle time
Date: Thu, 17 Jun 2021 16:47:59 +0100	[thread overview]
Message-ID: <20210617154759.GR30378@techsingularity.net> (raw)
In-Reply-To: <CAKfTPtBJkpSMFFGwgdFLyO5aSnGuzQSPrtpwOFckMQa4xaex=Q@mail.gmail.com>

On Thu, Jun 17, 2021 at 05:03:54PM +0200, Vincent Guittot wrote:
> On Thu, 17 Jun 2021 at 13:05, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Thu, Jun 17, 2021 at 12:02:56PM +0200, Vincent Guittot wrote:
> > > > > >
> > > > > > Fundamentally though, as the changelog notes "due to the nature of the
> > > > > > patch, this is a regression magnet". There are going to be examples
> > > > > > where a deep search is better even if a machine is fully busy or
> > > > > > overloaded and examples where cutting off the search is better. I think
> > > > > > it's better to have an idle estimate that gets updated if CPUs are fully
> > > > > > busy even if it's not a universal win.
> > > > >
> > > > > Although I agree that using a stall average idle time value of local
> > > > > is not good, I'm not sure this proposal is better. The main problem is
> > > > > that we use the avg_idle of the local CPU to estimate how many times
> > > > > we should loop and try to find another idle CPU. But there is no
> > > > > direct relation between both.
> > > >
> > > > This is true. The idle time of the local CPU is used to estimate the
> > > > idle time of the domain which is inevitably going to be inaccurate but
> > >
> > > I'm more and more convinced that using average idle time  (of the
> > > local cpu or the full domain) is not the right metric. In
> > > select_idle_cpu(), we looks for an idle CPU but we don't care about
> > > how long it will be idle.
> >
> > Can we predict that accurately? cpufreq for intel_pstate used to try
> > something like that but it was a bit fuzzy and I don't know if the
> > scheduler could do much better. There is some idle prediction stuff but
> > it's related to nohz which does not really help us if a machine is nearly
> > fully busy or overloaded.
> >
> > I guess for tracking idle that revisiting
> > https://lore.kernel.org/lkml/1615872606-56087-1-git-send-email-aubrey.li@intel.com/
> > is an option now that the scan is somewhat unified. A two-pass scan
> > could be used to check potentially idle CPUs first and if there is
> > sufficient search depth left, scan other CPUs. There were some questions
> 
> I think it's the other way around:
> a CPU is busy for sure if it is not set in the cpuidle_mask and we
> don't need to check it. But a cpu might not be idle even if it is set
> in the idle mask might because it's cleared during the tick
> 

Tick is a long time so scan depth may still be a problem.

> > Selecting based on avg idle time could be interesting but hazardous. If
> > for example, we prioritised selecting a CPU that is mostly idle, it'll
> > also pick CPUs that are potentially in a deep idle state incurring a
> > larger wakeup cost. Right now we are not much better because we just
> > select an idle CPU and hope for the best but always targetting the most
> > idle CPU could have problems. There would also be the cost of tracking
> > idle CPUs in priority order. It would eliminate the scan depth cost
> > calculations but the overall cost would be much worse.
> >
> > Hence, I still think we can improve the scan depth costs in the short
> > term until a replacement is identified that works reasonably well.
> >
> > > Even more, we can scan all CPUs whatever the
> > > avg idle time if there is a chance that there is an idle core.
> > >
> >
> > That is an important, but separate topic. It's known that the idle core
> > detection can yield false positives. Putting core scanning under SIS_PROP
> > had mixed results when we last tried but things change. Again, it doesn't
> > help with scan depth calculations.
> 
> my point was mainly to highlight that the path can take opposite
> decision for the same avg_idle value:
> - scan all cpus if has_idle_core is true whatever avg_idle
> - limit the depth if has_idle_core is false and avg_idle is short
> 

I do understand the point but the idle core scan anomaly was not
intended to be addressed in the patch because putting the idle scan
under SIS_PROP potentially means using cpus with active idle siblings
prematurely.

> >
> > > > tracking idle time for the domain will be cache write intensive and
> > > > potentially very expensive. I think this was discussed before but maybe
> > > > it is my imaginaction.
> > > >
> > > > > Typically, a short average idle time on
> > > > > the local CPU doesn't mean that there are less idle CPUs and that's
> > > > > why we have a mix a gain and loss
> > > > >
> > > >
> > > > Can you evaluate if scanning proportional to cores helps if applied on
> > > > top? The patch below is a bit of pick&mix and has only seen a basic build
> > >
> > > I will queue it for some test later today
> > >
> >
> > Thanks. The proposed patch since passed a build and boot test,
> > performance evaluation is under way but as it's x86 and SMT2, I'm mostly
> > just checking that it's neutral.
> 
> Results stay similar:
> group  tip/sched/core      + this patch             + latest addon
> 1      13.358(+/- 1.82%)   12.850(+/- 2.21%) +4%    13.411(+/- 2.47%) -0%
> 4      4.286(+/- 2.77%)    4.114(+/- 2.25%)  +4%    4.163(+/- 1.88%)  +3%
> 16     3.175(+/- 0.55%)    3.559(+/- 0.43%)  -12%   3.535(+/- 0.52%)  -11%
> 32     2.912(+/- 0.79%)    3.165(+/- 0.95%)  -8%    3.153(+/- 0.76%)  -10%
> 64     2.859(+/- 1.12%)    2.937(+/- 0.91%)  -3%    2.919(+/- 0.73%)  -2%
> 128    3.092(+/- 4.75%)    3.003(+/-5.18%)   +3%    2.973(+/- 0.90%)  +4%
> 256    3.233(+/- 3.03%)    2.973(+/- 0.80%)  +8%    3.036(+/- 1.05%)  +6%
> 

Ok, accounting for SMT4 didn't help.

-- 
Mel Gorman
SUSE Labs

  reply	other threads:[~2021-06-17 15:48 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-15 11:16 [PATCH v2] sched/fair: Age the average idle time Mel Gorman
2021-06-15 20:42 ` Peter Zijlstra
2021-06-15 21:13   ` Phil Auld
2021-06-16  9:03   ` Mel Gorman
2021-06-17 15:01     ` Phil Auld
2021-06-17 15:40       ` Mel Gorman
2021-06-18 19:22         ` Phil Auld
2021-06-16 15:52   ` Vincent Guittot
2021-06-17  7:44     ` Mel Gorman
2021-06-17  8:30       ` Vincent Guittot
2021-06-17  9:40         ` Mel Gorman
2021-06-17 10:02           ` Vincent Guittot
2021-06-17 11:05             ` Mel Gorman
2021-06-17 15:03               ` Vincent Guittot
2021-06-17 15:47                 ` Mel Gorman [this message]
2021-06-17 16:05                   ` Vincent Guittot
2021-06-18  8:46 ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2021-06-23 14:46 ` [sched/fair] 5359f5ca0f: phoronix-test-suite.stress-ng.SystemVMessagePassing.bogo_ops_s 77.9% improvement kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210617154759.GR30378@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox