From: Morten Rasmussen <morten.rasmussen@arm.com>
To: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Peter Zijlstra <peterz@infradead.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>,
"mingo@kernel.org" <mingo@kernel.org>,
"vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
"daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>,
"preeti@linux.vnet.ibm.com" <preeti@linux.vnet.ibm.com>,
Dietmar Eggemann <Dietmar.Eggemann@arm.com>,
"pjt@google.com" <pjt@google.com>
Subject: Re: [RFCv2 PATCH 01/23] sched: Documentation for scheduler energy cost model
Date: Thu, 24 Jul 2014 18:57:10 +0100 [thread overview]
Message-ID: <20140724175709.GA11501@e103034-lin> (raw)
In-Reply-To: <14224056.9QdYi5f1q1@vostro.rjw.lan>
On Thu, Jul 24, 2014 at 03:28:27PM +0100, Rafael J. Wysocki wrote:
> On Thursday, July 24, 2014 09:26:09 AM Peter Zijlstra wrote:
> > On Thu, Jul 24, 2014 at 02:53:20AM +0200, Rafael J. Wysocki wrote:
> > > I am used to slightly different terminology here. Namely, there are voltage
> > > domains (parts sharing a voltage rail or a voltage regulator, such that you
> > > can only apply/remove/change voltage to all of them at the same time) and clock
> > > domains (analogously, but for clocks). A power domain (which in your description
> > > above seems to correspond to a voltage domain) may be a voltage domain, a clock
> > > domain or a combination thereof.
Your terminology is closer how the hardware actually operates, agreed. I
was hoping to keep things a bit simpler if we can get away with it. In
the simplified view a frequency domain is the combination of voltage and
clock domain (using your terminology). Since clock and voltage usually
scale together (DVFS) the assumption is that those domains are
equivalent. Thus, the frequency domain defines the subset of cpus that
scale P-state together.
A power domain (in my terminology) defines a subset of cpus that share
C-states (do-nothing-states at reduced power consumption). The actual
technique applied for the C-state implementation is not considered. It
may anything between just clock gating up to and including completely
powering the domain off. So it isn't necessarily equivalent to the clock
or voltage domain. For example on ARM is quite typical to have clock
gating per cpu and sometimes also per core power gating, while the
clock/voltage domain covers multiple cpus. It is worth noting that power
gating if often hierarchical meaning that you can power gate larger
subsets of cpus in one go as well to save more power as you can power
down (some) shared resources as well. I think that is equivalent to
package C-states in Intel terminology.
> > > In addition to that, in a voltage domain it may be possible to apply many
> > > different levels of voltage, which case doesn't seem to be covered at all by
> > > the above (or I'm missing something).
I don't include it explicitly, but it is factored into the capacity
state data (which is really frequency states on SMP, but that is another
story). Each capacity state is represented by a compute capacity
(proportional to frequency on SMP) and the associated power consumption.
The energy-efficiency (work/energy) for the capacity state is basically
the ratio of the two. Hence the voltage is include in the power figure
associated with the P-state. It is assumed that you don't scale voltage
without scaling frequency. I hope that is a valid assumption for Intel
systems as well?
> > > Also a P-state is not just a frequency level, but a combination of frequency
> > > and voltage that has to be applied for that frequency to be stable. You may
> > > regard them as Operation Performance Points of the CPU, but that very well may
> > > go beyond frequencies and voltages. Thus it actually is better not to talk
> > > about P-states as "frequencies".
Agreed. In my world voltage and frequency are always linked, so I might
have been a bit sloppy in my definitions. I will fix that to use P-state
instead.
Capacity states are equal to P-states on SMP but not for big.LITTLE as
we also have to factor in performance differences between different
micro-architectures. Any objections to that? It is in line with the
recent renaming of cpu_power to cpu_capacity in fair.c.
> > > Now, P-states may or may not have to be coordinated between all CPUs in a
> > > package (cluster), by hardware or software, such that all CPUs in a cluster
> > > need to be kept in the same P-state. That you can regard as a "P-state
> > > domain", but it usually means a specific combination of voltage and frequency.
> >
> > I think Morton is aware of this, but for the sake of sanity dropped the
> > whole lot into something simpler (while hoping reality would not ruin
> > his life).
Spot on :-) (except for the spelling of my name ;-))
> >
> > > C-states in turn are states in which CPUs don't execute instructions.
> > > That need not mean the removal of voltage or even frequency from them.
> > > Of course, they do mean some sort of power draw reduction, but that may
> > > be achieved in many different ways. Some C-states require coordination
> > > too (for example, a single C-state may apply to a whole package or cluster
> > > at the same time) and you can think about "domains" here too, but there
> > > need not be a direct mapping to physical parameters such as the frequency
> > > or the voltage.
That is "power domains" in my simplified terminology as described above.
> > One thing that wasn't clear to me is if you allow for C-domain and
> > P-domain to overlap or if they're always inclusive (where one is wholly
> > contained in the other).
>
> On the CPUs I worked with so far they were always inclusive. Previously, the
> whole package was a P-state domain. Today some CPUs (Haswell server chips
> for example) have per-core P-states.
I don't know of any design where they overlap. My assumption is that it
won't happen ;-)
> > > Moreover, P-states and C-states may overlap. That is, a CPU may be in Px
> > > and Cy at the same time, which means that after leaving Cy it will execute
> > > instructions in Px. Things like leakage may depend on x in that case and
> > > the total power draw may depend on the combination of x and y.
Right, I have ignored that aspect so far (along with a lot of other
things) hoping that it wouldn't make too much difference. I haven't
investigated it in detail yet. I guess that main difference would be in
the shallowest C-states as you would be power gating in the deeper ones?
It could be factored in but it would mean providing platform data.
> > Right, and I suppose the domain thing makes it impossible to drop to the
> > lowest P state on going idle. Tricky that.
>
> That's the case for older chips. I'm not sure about the newest lot entirely
> to be honest, need to ask.
I think some ARM do the lowest P-state trick before entering idle. But
yeah, it only makes sense if you are the last cpu in the P-state domain
to go down.
> > > The concern is that if a scaling governor is running in parallel with the above
> > > algorithm and it has its own utilization goal (it usually does), it may change
> > > the P-state under you to match that utilization goal and you'll end up with
> > > something different from what you expected.
> > >
> > > That may be addressed either by trying to predict what the scaling governor will
> > > do (and good luck with that) or by taking care of P-states by yourself. The
> > > latter would require changes to the algorithm I think, though.
> >
> > The idea was that we'll do P states ourselves based on these utilization
> > figures. If we find we cannot fit the 'new' task into the current set
> > without either raising P or waking an idle cpu (if at all available), we
> > compute the cost of either option and pick the cheapest.
>
> Yeah. One subtle thing is that ramping up P may affect the other guys
> (if the whole chip is a P-domain, for example), but I guess that can be
> taken into account.
For now I have assumed that the P-state governor will provide select a
P-state which is sufficient for handling the utilization. But, as Peter
already said, the plan is to try at least guide the P-state selection
based on the decisions made by the scheduler.
Affected cpus are actually already take into account when trying to
figure out whether to raise the P-state or waking an idle cpu.
Morten
next prev parent reply other threads:[~2014-07-24 17:57 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-07-03 16:25 [RFCv2 PATCH 00/23] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
2014-07-03 16:25 ` [RFCv2 PATCH 01/23] sched: Documentation for scheduler energy cost model Morten Rasmussen
2014-07-24 0:53 ` Rafael J. Wysocki
2014-07-24 7:26 ` Peter Zijlstra
2014-07-24 14:28 ` Rafael J. Wysocki
2014-07-24 17:57 ` Morten Rasmussen [this message]
2014-07-03 16:25 ` [RFCv2 PATCH 02/23] sched: Make energy awareness a sched feature Morten Rasmussen
2014-07-03 16:25 ` [RFCv2 PATCH 03/23] sched: Introduce energy data structures Morten Rasmussen
2014-07-03 16:25 ` [RFCv2 PATCH 04/23] sched: Allocate and initialize " Morten Rasmussen
2014-07-03 16:25 ` [RFCv2 PATCH 05/23] sched: Add energy procfs interface Morten Rasmussen
2014-07-03 16:25 ` [RFCv2 PATCH 06/23] arm: topology: Define TC2 energy and provide it to the scheduler Morten Rasmussen
2014-07-03 16:25 ` [RFCv2 PATCH 07/23] sched: Introduce system-wide sched_energy Morten Rasmussen
2014-07-03 16:25 ` [RFCv2 PATCH 08/23] sched: Aggregate unweighted load contributed by task entities on parenting cfs_rq Morten Rasmussen
2014-07-03 23:50 ` Yuyang Du
2014-07-03 16:25 ` [RFCv2 PATCH 09/23] sched: Maintain the unweighted load contribution of blocked entities Morten Rasmussen
2014-07-03 16:25 ` [RFCv2 PATCH 10/23] sched: Account for blocked unweighted load waking back up Morten Rasmussen
2014-07-03 16:25 ` [RFCv2 PATCH 11/23] sched: Introduce an unweighted cpu_load array Morten Rasmussen
2014-07-03 16:25 ` [RFCv2 PATCH 12/23] sched: Rename weighted_cpuload() to cpu_load() Morten Rasmussen
2014-07-03 16:26 ` [RFCv2 PATCH 13/23] sched: Introduce weighted/unweighted switch in load related functions Morten Rasmussen
2014-07-03 16:26 ` [RFCv2 PATCH 14/23] sched: Introduce SD_SHARE_CAP_STATES sched_domain flag Morten Rasmussen
2014-07-03 16:26 ` [RFCv2 PATCH 15/23] sched, cpufreq: Introduce current cpu compute capacity into scheduler Morten Rasmussen
2014-07-03 16:26 ` [RFCv2 PATCH 16/23] sched, cpufreq: Current compute capacity hack for ARM TC2 Morten Rasmussen
2014-07-03 16:26 ` [RFCv2 PATCH 17/23] sched: Likely idle state statistics placeholder Morten Rasmussen
2014-07-03 16:26 ` [RFCv2 PATCH 18/23] sched: Energy model functions Morten Rasmussen
2014-07-03 16:26 ` [RFCv2 PATCH 19/23] sched: Task wakeup tracking Morten Rasmussen
2014-07-03 16:26 ` [RFCv2 PATCH 20/23] sched: Take task wakeups into account in energy estimates Morten Rasmussen
2014-07-03 16:26 ` [RFCv2 PATCH 21/23] sched: Use energy model in select_idle_sibling Morten Rasmussen
2014-07-03 16:26 ` [RFCv2 PATCH 22/23] sched: Use energy to guide wakeup task placement Morten Rasmussen
2014-07-03 16:26 ` [RFCv2 PATCH 23/23] sched: Use energy model in load balance path Morten Rasmussen
2014-07-03 23:19 ` [RFCv2 PATCH 00/23] sched: Energy cost model for energy-aware scheduling Yuyang Du
2014-07-04 11:06 ` Morten Rasmussen
2014-07-04 16:03 ` Anca Emanuel
2014-07-06 19:05 ` Yuyang Du
2014-07-07 14:16 ` Morten Rasmussen
2014-07-08 0:23 ` Yuyang Du
2014-07-08 9:28 ` Morten Rasmussen
2014-07-04 16:55 ` Catalin Marinas
2014-07-07 14:00 ` Morten Rasmussen
2014-07-07 15:42 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140724175709.GA11501@e103034-lin \
--to=morten.rasmussen@arm.com \
--cc=Dietmar.Eggemann@arm.com \
--cc=daniel.lezcano@linaro.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=pjt@google.com \
--cc=preeti@linux.vnet.ibm.com \
--cc=rjw@rjwysocki.net \
--cc=vincent.guittot@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).