RE: [PATCH 1/2] x86,sched: Add support for frequency invariance

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Doug Smythies" <dsmythies@telus.net>
To: "'Giovanni Gherdovich'" <ggherdovich@suse.cz>
Cc: <x86@kernel.org>, <linux-pm@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <mgorman@techsingularity.net>,
	<matt@codeblueprint.co.uk>, <viresh.kumar@linaro.org>,
	<juri.lelli@redhat.com>, <pjt@google.com>,
	<vincent.guittot@linaro.org>, <qperret@qperret.net>,
	<dietmar.eggemann@arm.com>, <srinivas.pandruvada@linux.intel.com>,
	<tglx@linutronix.de>, <mingo@redhat.com>, <peterz@infradead.org>,
	<bp@suse.de>, <lenb@kernel.org>, <rjw@rjwysocki.net>
Subject: RE: [PATCH 1/2] x86,sched: Add support for frequency invariance
Date: Thu, 19 Sep 2019 07:42:29 -0700	[thread overview]
Message-ID: <001a01d56ef8$7abb07c0$70311740$@net> (raw)
In-Reply-To: <1568730313.3329.1.camel@suse.cz>

Hi Giovanni,

Thank you for your detailed reply.

On 2019.09.17 07:25 Giovanni Gherdovich wrote:
>On Wed, 2019-09-11 at 08:28 -0700, Doug Smythies wrote:
> [...]

>> The problem with the test is its run to run variability, which was from
>> all the disk I/O, as far as I could determine. At the time,
>> I studied this to death [2], and made a more repeatable test, without
>> any disk I/O.
>> 
>> While the challenges with this work flow have tended to be focused
>> on the CPU frequency scaling driver, I have always considered
>> the root issue here to be a scheduling issue. Excerpt from my notes
>> [2]:
>> 
>>> The issue is that performance is much much better if the system is
>>> forced to use only 1 CPU rather than relying on the defaults where
>>> the CPU scheduler decides what to do.
>>> The scheduler seems to not realize that the current CPU has just
>>> become free, and assigns the new task to a new CPU. Thus the load
>>> on any one CPU is so low that it doesn't ramp up the CPU frequency.
>>> It would be better if somehow the scheduler knew that the current
>>> active CPU was now able to take on the new task, overall resulting
>>> on one fully loaded CPU at the highest CPU frequency.
>> 
>> I do not know if such is practical, and I didn't re-visit the issue.
>>
>
> You're absolutely right, pinning a serialized, fork-intensive workload such as
> gitsource gives you as good of a performance as you can get, because it removes
> the scheduler out of the picture.
>
> So one might be tempted to flag this test as non-representative of a
> real-world scenario;

Disagree. I consider this test to be very representative of real-world
scenarios. However, and I do not know for certain, the relatively high
average fork rate of the gitsource "make test" is less common.

> the reasons we keep looking at it are:
> 1. pinning may not always practical, as you mention
> 2. it's an adversary, worst-case sort of test for some scheduler code paths

Agree.

>> For reference against which all other results are compared
>> is the forced CPU affinity test run. i.e.:
>> 
>> taskset -c 3 test_script.
>> 
>> Mode          Governor                degradation     Power           Bzy_MHz
>> Reference     perf 1 CPU              1.00            reference       3798
>> -             performance             1.2             6% worse        3618
>> passive       ondemand                2.3
>> active        powersave               2.6
>> passive       schedutil               2.7                             1600
>> passive       schedutil-4C            1.68                            2515
>> 
>> Where degradation ratio is the time to execute / the reference time for
>> the same conditions. The test runs over a wide range of processes per
>> second, and the worst ratio has been selected for the above table.
>> I have yet to write up this experiment, but the graphs that will
>> eventually be used are at [4] and [5] (same data presented two
>> different ways).
>
> Your table is interesting; I'd say that the one to beat there (from the
> schedutil point of view) is intel_pstate(active)/performance. I'm slightly
> surprised that intel_pstate(passive)/ondemand is worse than
> intel_pstate(active)/powersave, I'd have guessed the other way around but it's
> also true that the latter lost some grip on iowait_boost in of the recent
> dev cycles.

??
intel_pstate(passive)/ondemand is better than intel_pstate(active)/powersave,
not worse, over the entire range of PIDs (forks) per second and by quite a lot.

>> I did the "make test" method and, presenting the numbers your way,
>> got that 4C took 0.69 times as long as the unpatched schedutil.
>> Your numbers were same or better (copied below, lower is better):
>> 80x-BROADWELL-NUMA:   0.49
>> 8x-SKYLAKE-UMA:               0.55
>> 48x-HASWELL-NUMA:             0.69

> I think your 0.69 and my three values tell the same story: schedutil really
> needs to use the frequency invariant formula otherwise it's out of the
> race. Enabling scale-invariance gives multple tens of percent point in
> advantage.

Agreed. This frequency invariant addition is great. However, if
schedutil is "out of the race" without it, as you say, then isn't
intel_pstate(passive)/ondemand out of the race also? It performs
just as poorly for this test, until very low PIDs per second.

> Now, is it 0.69 or 0.49? There are many factors to it; that's why I'm happy I
> can test on multiple machines and get a somehow more varied picture.
>
> Also, didn't you mention you made several runs and selected the worst one for
> the final score? I was less adventurous and took the average of 5 runs for my
> gitsource executions :) that might contribute to a slightly higher final mark.

No, I did the exact same as you for the gitsource "make test" method, except
that I do 6 runs and throw out the first one and average the next 5.

Yes, I said I picked the worse ratio, but that was for my version of this test,
with the disk I/O and its related non-repeatability eliminated, only to provide
something for readers that did not want to go to my web site to look at the
related graph [1]. I'll send you the graph in a separate e-mail, in case you didn't
go to the web site.

>>>> 
>>>> Compare it to the update formula of intel_pstate/powersave:
>>> 
>>>    freq_next = 1.25 * freq_max * Busy%
>>> 
>>> where again freq_max is 1C turbo and Busy% is the percentage of time not spent
>>> idling (calculated with delta_MPERF / delta_TSC);
>> 
>> Note that the delta_MPERF / delta_TSC method includes idle state 0 and the old
>> method of utilization does not (at least not last time I investigated, which was
>> awhile ago (and I can not find my notes)).
>
> I think that depends on whether or not TSC stops at idle. As understand from
> the Intel Software Developer manual (SDM) a TSC that stops at idle is called
> "invariant TSC", and makes delta_MPERF / delta_TSC interesting. Otherwise the
> two counters behaves exactly the same and the ratio is always 1, modulo the
> delays in actually reading the two values. But all I know comes from
> turbostat's man page and the SDM, so don't quote me on that :)

I was only talking about idle state 0 (polling), where TSC does not stop.

By the way, I have now done some tests with this patch set and multi-threaded
stuff. Nothing to report, it all looks great.

[1] http://www.smythies.com/~doug/linux/single-threaded/gg-pidps2.png

... Doug

next prev parent reply	other threads:[~2019-09-19 14:42 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-09  2:42 [PATCH 0/2] Add support for frequency invariance for (some) x86 Giovanni Gherdovich
2019-09-09  2:42 ` [PATCH 1/2] x86,sched: Add support for frequency invariance Giovanni Gherdovich
2019-09-11 15:28   ` Doug Smythies
2019-09-13 20:58     ` Doug Smythies
2019-09-17 14:25       ` Giovanni Gherdovich
2019-09-19 14:42         ` Doug Smythies [this message]
2019-09-24  8:06           ` Mel Gorman
2019-09-24 17:52             ` Doug Smythies
2019-09-13 22:52   ` Srinivas Pandruvada
2019-09-17 14:27     ` Giovanni Gherdovich
2019-09-17 15:55       ` Vincent Guittot
2019-09-19 23:55       ` Srinivas Pandruvada
2019-09-14 10:57   ` Quentin Perret
2019-09-17 14:27     ` Giovanni Gherdovich
2019-09-17 14:39       ` Quentin Perret
2019-09-24 14:03       ` Peter Zijlstra
2019-09-24 16:00         ` Peter Zijlstra
2019-10-02 12:27           ` Giovanni Gherdovich
2019-10-02 18:45             ` Peter Zijlstra
2019-09-24 16:04   ` Peter Zijlstra
2019-10-02 12:26     ` Giovanni Gherdovich
2019-10-02 18:35       ` Peter Zijlstra
2019-09-24 16:30   ` Peter Zijlstra
2019-10-02 12:25     ` Giovanni Gherdovich
2019-10-02 18:47       ` Peter Zijlstra
2019-09-09  2:42 ` [PATCH 2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting Giovanni Gherdovich
2019-09-24 16:01 ` [PATCH 0/2] Add support for frequency invariance for (some) x86 Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='001a01d56ef8$7abb07c0$70311740$@net' \
    --to=dsmythies@telus.net \
    --cc=bp@suse.de \
    --cc=dietmar.eggemann@arm.com \
    --cc=ggherdovich@suse.cz \
    --cc=juri.lelli@redhat.com \
    --cc=lenb@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=matt@codeblueprint.co.uk \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=qperret@qperret.net \
    --cc=rjw@rjwysocki.net \
    --cc=srinivas.pandruvada@linux.intel.com \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=viresh.kumar@linaro.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.