* [PATCH 1/8] sched/cpufreq: remove cpufreq_trigger_update()
2016-03-14 5:22 [PATCH 0/8] schedutil enhancements Michael Turquette
@ 2016-03-14 5:22 ` Michael Turquette
2016-03-15 21:14 ` Peter Zijlstra
2016-03-16 8:00 ` Peter Zijlstra
2016-03-14 5:22 ` [PATCH 2/8] sched/fair: add margin to utilization update Michael Turquette
` (7 subsequent siblings)
8 siblings, 2 replies; 61+ messages in thread
From: Michael Turquette @ 2016-03-14 5:22 UTC (permalink / raw)
To: peterz, rjw
Cc: linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
cpufreq_trigger_update() was introduced in "cpufreq: Rework the
scheduler hooks for triggering updates"[0]. Consensus is that this
helper is not needed and removing it will aid in experimenting with
deadline and rt capacity requests.
Instead of reverting the above patch, which includes useful renaming of
data structures and related functions, simply remove the function,
update affected kerneldoc and change rt.c and deadline.c to use
cpufreq_update_util().
[0] lkml.kernel.org/r/7541372.ciUW4go8Ux@vostro.rjw.lan
Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
---
kernel/sched/cpufreq.c | 28 ++--------------------------
kernel/sched/deadline.c | 2 +-
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 2 --
4 files changed, 4 insertions(+), 30 deletions(-)
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
index eecaba4..bd012c2 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq.c
@@ -20,8 +20,8 @@ static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook);
*
* Set and publish the freq_update_hook pointer for the given CPU. That pointer
* points to a struct freq_update_hook object containing a callback function
- * to call from cpufreq_trigger_update(). That function will be called from
- * an RCU read-side critical section, so it must not sleep.
+ * to call from cpufreq_update_util(). That function will be called from an
+ * RCU read-side critical section, so it must not sleep.
*
* Callers must use RCU-sched callbacks to free any memory that might be
* accessed via the old update_util_data pointer or invoke synchronize_sched()
@@ -87,27 +87,3 @@ void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
if (hook)
hook->func(hook, time, util, max);
}
-
-/**
- * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
- * @time: Current time.
- *
- * The way cpufreq is currently arranged requires it to evaluate the CPU
- * performance state (frequency/voltage) on a regular basis. To facilitate
- * that, cpufreq_update_util() is called by update_load_avg() in CFS when
- * executed for the current CPU's runqueue.
- *
- * However, this isn't sufficient to prevent the CPU from being stuck in a
- * completely inadequate performance level for too long, because the calls
- * from CFS will not be made if RT or deadline tasks are active all the time
- * (or there are RT and DL tasks only).
- *
- * As a workaround for that issue, this function is called by the RT and DL
- * sched classes to trigger extra cpufreq updates to prevent it from stalling,
- * but that really is a band-aid. Going forward it should be replaced with
- * solutions targeted more specifically at RT and DL tasks.
- */
-void cpufreq_trigger_update(u64 time)
-{
- cpufreq_update_util(time, ULONG_MAX, 0);
-}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1a035fa..3fd5bc4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -728,7 +728,7 @@ static void update_curr_dl(struct rq *rq)
/* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
if (cpu_of(rq) == smp_processor_id())
- cpufreq_trigger_update(rq_clock(rq));
+ cpufreq_update_util(rq_clock(rq), ULONG_MAX, 0);
/*
* Consumed budget is computed considering the time as
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 9dd1c09..53ad077 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -947,7 +947,7 @@ static void update_curr_rt(struct rq *rq)
/* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
if (cpu_of(rq) == smp_processor_id())
- cpufreq_trigger_update(rq_clock(rq));
+ cpufreq_update_util(rq_clock(rq), ULONG_MAX, 0);
delta_exec = rq_clock_task(rq) - curr->se.exec_start;
if (unlikely((s64)delta_exec <= 0))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7ae012e..f06dfca 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1742,9 +1742,7 @@ static inline u64 irq_time_read(int cpu)
#ifdef CONFIG_CPU_FREQ
void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
-void cpufreq_trigger_update(u64 time);
#else
static inline void cpufreq_update_util(u64 time, unsigned long util,
unsigned long max) {}
-static inline void cpufreq_trigger_update(u64 time) {}
#endif /* CONFIG_CPU_FREQ */
--
2.1.4
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH 1/8] sched/cpufreq: remove cpufreq_trigger_update()
2016-03-14 5:22 ` [PATCH 1/8] sched/cpufreq: remove cpufreq_trigger_update() Michael Turquette
@ 2016-03-15 21:14 ` Peter Zijlstra
[not found] ` <20160315214545.30639.98727@quark.deferred.io>
2016-03-16 8:00 ` Peter Zijlstra
1 sibling, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2016-03-15 21:14 UTC (permalink / raw)
To: Michael Turquette
Cc: rjw, linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
On Sun, Mar 13, 2016 at 10:22:05PM -0700, Michael Turquette wrote:
> cpufreq_trigger_update() was introduced in "cpufreq: Rework the
> scheduler hooks for triggering updates"[0]. Consensus is that this
> helper is not needed and removing it will aid in experimenting with
> deadline and rt capacity requests.
>
> Instead of reverting the above patch, which includes useful renaming of
> data structures and related functions, simply remove the function,
> update affected kerneldoc and change rt.c and deadline.c to use
> cpufreq_update_util().
This fails to explain how the need for these hooks is dealt with.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH 1/8] sched/cpufreq: remove cpufreq_trigger_update()
2016-03-14 5:22 ` [PATCH 1/8] sched/cpufreq: remove cpufreq_trigger_update() Michael Turquette
2016-03-15 21:14 ` Peter Zijlstra
@ 2016-03-16 8:00 ` Peter Zijlstra
1 sibling, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2016-03-16 8:00 UTC (permalink / raw)
To: Michael Turquette
Cc: rjw, linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
On Sun, Mar 13, 2016 at 10:22:05PM -0700, Michael Turquette wrote:
> cpufreq_trigger_update() was introduced in "cpufreq: Rework the
> scheduler hooks for triggering updates"[0]. Consensus is that this
> helper is not needed and removing it will aid in experimenting with
> deadline and rt capacity requests.
>
> Instead of reverting the above patch, which includes useful renaming of
> data structures and related functions, simply remove the function,
> update affected kerneldoc and change rt.c and deadline.c to use
> cpufreq_update_util().
> -/**
> - * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
> - * @time: Current time.
> - *
> - * The way cpufreq is currently arranged requires it to evaluate the CPU
> - * performance state (frequency/voltage) on a regular basis. To facilitate
> - * that, cpufreq_update_util() is called by update_load_avg() in CFS when
> - * executed for the current CPU's runqueue.
> - *
> - * However, this isn't sufficient to prevent the CPU from being stuck in a
> - * completely inadequate performance level for too long, because the calls
> - * from CFS will not be made if RT or deadline tasks are active all the time
> - * (or there are RT and DL tasks only).
> - *
> - * As a workaround for that issue, this function is called by the RT and DL
> - * sched classes to trigger extra cpufreq updates to prevent it from stalling,
> - * but that really is a band-aid. Going forward it should be replaced with
> - * solutions targeted more specifically at RT and DL tasks.
> - */
> -void cpufreq_trigger_update(u64 time)
> -{
> - cpufreq_update_util(time, ULONG_MAX, 0);
> -}
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 1a035fa..3fd5bc4 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -728,7 +728,7 @@ static void update_curr_dl(struct rq *rq)
>
> /* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
> if (cpu_of(rq) == smp_processor_id())
> - cpufreq_trigger_update(rq_clock(rq));
> + cpufreq_update_util(rq_clock(rq), ULONG_MAX, 0);
>
> /*
> * Consumed budget is computed considering the time as
OK, so take two on this, now hopefully more coherent (yay for sleep!).
So my problem is that this (update_curr_dl) is not the right location to
set DL utilization (although it might be for avg dl, see the other
email).
The only reason it lives here, is that some cpufreq governors require
'timely' calls into this hook. The comment you destroyed tries to convey
this.
We should still remove this requirement from the governors. And for
simple DL guarantees this hook is placed wrong.
^ permalink raw reply [flat|nested] 61+ messages in thread
* [PATCH 2/8] sched/fair: add margin to utilization update
2016-03-14 5:22 [PATCH 0/8] schedutil enhancements Michael Turquette
2016-03-14 5:22 ` [PATCH 1/8] sched/cpufreq: remove cpufreq_trigger_update() Michael Turquette
@ 2016-03-14 5:22 ` Michael Turquette
2016-03-15 21:16 ` Peter Zijlstra
2016-03-16 2:52 ` Steve Muckle
2016-03-14 5:22 ` [PATCH 3/8] sched/cpufreq: new cfs capacity margin helpers Michael Turquette
` (6 subsequent siblings)
8 siblings, 2 replies; 61+ messages in thread
From: Michael Turquette @ 2016-03-14 5:22 UTC (permalink / raw)
To: peterz, rjw
Cc: linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
Utilization contributions to cfs_rq->avg.util_avg are scaled for both
microarchitecture-invariance as well as frequency-invariance. This means
that any given utilization contribution will be scaled against the
current cpu capacity (cpu frequency). Contributions from long running
tasks, whose utilization grows larger over time, will asymptotically
approach the current capacity.
This causes a problem when using this utilization signal to select a
target cpu capacity (cpu frequency), as our signal will never exceed the
current capacity, which would otherwise be our signal to increase
frequency.
Solve this by introducing a default capacity margin that is added to the
utilization signal when requesting a change to capacity (cpu frequency).
The margin is 1280, or 1.25 x SCHED_CAPACITY_SCALE (1024). This is
equivalent to similar margins such as the default 125 value assigned to
struct sched_domain.imbalance_pct for load balancing, and to the 80%
up_threshold used by the legacy cpufreq ondemand governor.
Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
---
kernel/sched/fair.c | 18 ++++++++++++++++--
kernel/sched/sched.h | 3 +++
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a32f281..29e8bae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -100,6 +100,19 @@ const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
*/
unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
+/*
+ * Add a 25% margin globally to all capacity requests from cfs. This is
+ * equivalent to an 80% up_threshold in legacy governors like ondemand.
+ *
+ * This is required as task utilization increases. The frequency-invariant
+ * utilization will asymptotically approach the current capacity of the cpu and
+ * the additional margin will cross the threshold into the next capacity state.
+ *
+ * XXX someday expand to separate, per-call site margins? e.g. enqueue, fork,
+ * task_tick, load_balance, etc
+ */
+unsigned long cfs_capacity_margin = CAPACITY_MARGIN_DEFAULT;
+
#ifdef CONFIG_CFS_BANDWIDTH
/*
* Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
@@ -2840,6 +2853,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
unsigned long max = rq->cpu_capacity_orig;
+ unsigned long cap = cfs_rq->avg.util_avg *
+ cfs_capacity_margin / max;
/*
* There are a few boundary cases this might miss but it should
@@ -2852,8 +2867,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
* thread is a different class (!fair), nor will the utilization
* number include things like RT tasks.
*/
- cpufreq_update_util(rq_clock(rq),
- min(cfs_rq->avg.util_avg, max), max);
+ cpufreq_update_util(rq_clock(rq), min(cap, max), max);
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f06dfca..8c93ed2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -27,6 +27,9 @@ extern __read_mostly int scheduler_running;
extern unsigned long calc_load_update;
extern atomic_long_t calc_load_tasks;
+#define CAPACITY_MARGIN_DEFAULT 1280;
+extern unsigned long cfs_capacity_margin;
+
extern void calc_global_load_tick(struct rq *this_rq);
extern long calc_load_fold_active(struct rq *this_rq);
--
2.1.4
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH 2/8] sched/fair: add margin to utilization update
2016-03-14 5:22 ` [PATCH 2/8] sched/fair: add margin to utilization update Michael Turquette
@ 2016-03-15 21:16 ` Peter Zijlstra
[not found] ` <20160315212848.30639.38747@quark.deferred.io>
2016-03-16 2:52 ` Steve Muckle
1 sibling, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2016-03-15 21:16 UTC (permalink / raw)
To: Michael Turquette
Cc: rjw, linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
On Sun, Mar 13, 2016 at 10:22:06PM -0700, Michael Turquette wrote:
> @@ -2840,6 +2853,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
>
> if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
> unsigned long max = rq->cpu_capacity_orig;
> + unsigned long cap = cfs_rq->avg.util_avg *
> + cfs_capacity_margin / max;
>
> /*
> * There are a few boundary cases this might miss but it should
> @@ -2852,8 +2867,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
> * thread is a different class (!fair), nor will the utilization
> * number include things like RT tasks.
> */
> - cpufreq_update_util(rq_clock(rq),
> - min(cfs_rq->avg.util_avg, max), max);
> + cpufreq_update_util(rq_clock(rq), min(cap, max), max);
> }
> }
I really don't see why that is here, and not inside whatever uses
cpufreq_update_util().
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH 2/8] sched/fair: add margin to utilization update
2016-03-14 5:22 ` [PATCH 2/8] sched/fair: add margin to utilization update Michael Turquette
2016-03-15 21:16 ` Peter Zijlstra
@ 2016-03-16 2:52 ` Steve Muckle
2016-03-16 22:12 ` Michael Turquette
1 sibling, 1 reply; 61+ messages in thread
From: Steve Muckle @ 2016-03-16 2:52 UTC (permalink / raw)
To: Michael Turquette, peterz, rjw
Cc: linux-kernel, linux-pm, Juri.Lelli, morten.rasmussen,
dietmar.eggemann, vincent.guittot, Michael Turquette
On 03/13/2016 10:22 PM, Michael Turquette wrote:
> +unsigned long cfs_capacity_margin = CAPACITY_MARGIN_DEFAULT;
> +
> #ifdef CONFIG_CFS_BANDWIDTH
> /*
> * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
> @@ -2840,6 +2853,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
>
> if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
> unsigned long max = rq->cpu_capacity_orig;
> + unsigned long cap = cfs_rq->avg.util_avg *
> + cfs_capacity_margin / max;
Doesn't rq->cpu_capacity_orig get scaled per the microarch invariance?
This would mean that the margin we're applying here would differ based
on that.
I'd expect that the margin would be * (cfs_capacity_margin /
SCHED_CAPACITY_SCALE) which would then reduce the division into a shift.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH 2/8] sched/fair: add margin to utilization update
2016-03-16 2:52 ` Steve Muckle
@ 2016-03-16 22:12 ` Michael Turquette
0 siblings, 0 replies; 61+ messages in thread
From: Michael Turquette @ 2016-03-16 22:12 UTC (permalink / raw)
To: Steve Muckle, peterz, rjw
Cc: linux-kernel, linux-pm, Juri.Lelli, morten.rasmussen,
dietmar.eggemann, vincent.guittot, Michael Turquette
Quoting Steve Muckle (2016-03-15 19:52:59)
> On 03/13/2016 10:22 PM, Michael Turquette wrote:
> > +unsigned long cfs_capacity_margin = CAPACITY_MARGIN_DEFAULT;
> > +
> > #ifdef CONFIG_CFS_BANDWIDTH
> > /*
> > * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
> > @@ -2840,6 +2853,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
> >
> > if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
> > unsigned long max = rq->cpu_capacity_orig;
> > + unsigned long cap = cfs_rq->avg.util_avg *
> > + cfs_capacity_margin / max;
>
> Doesn't rq->cpu_capacity_orig get scaled per the microarch invariance?
> This would mean that the margin we're applying here would differ based
> on that.
>
> I'd expect that the margin would be * (cfs_capacity_margin /
> SCHED_CAPACITY_SCALE) which would then reduce the division into a shift.
Will fix.
Thanks,
Mike
^ permalink raw reply [flat|nested] 61+ messages in thread
* [PATCH 3/8] sched/cpufreq: new cfs capacity margin helpers
2016-03-14 5:22 [PATCH 0/8] schedutil enhancements Michael Turquette
2016-03-14 5:22 ` [PATCH 1/8] sched/cpufreq: remove cpufreq_trigger_update() Michael Turquette
2016-03-14 5:22 ` [PATCH 2/8] sched/fair: add margin to utilization update Michael Turquette
@ 2016-03-14 5:22 ` Michael Turquette
2016-03-15 21:17 ` Peter Zijlstra
2016-03-14 5:22 ` [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable Michael Turquette
` (5 subsequent siblings)
8 siblings, 1 reply; 61+ messages in thread
From: Michael Turquette @ 2016-03-14 5:22 UTC (permalink / raw)
To: peterz, rjw
Cc: linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
Introduce helper functions that allow cpufreq governors to change the
value of the capacity margin applied to the cfs_rq->avg.util_avg signal.
This allows for run-time tuning of the margin.
A follow-up patch will update the schedutil governor to use these
helpers.
Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
---
include/linux/sched.h | 3 +++
kernel/sched/cpufreq.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 56 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1fa9b52..f18a99b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2372,6 +2372,9 @@ void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
void (*func)(struct freq_update_hook *hook, u64 time,
unsigned long util, unsigned long max));
void cpufreq_clear_freq_update_hook(int cpu);
+unsigned long cpufreq_get_cfs_capacity_margin(void);
+void cpufreq_set_cfs_capacity_margin(unsigned long margin);
+void cpufreq_reset_cfs_capacity_margin(void);
#endif
#ifdef CONFIG_SCHED_AUTOGROUP
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
index bd012c2..a126b58 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq.c
@@ -61,6 +61,59 @@ void cpufreq_clear_freq_update_hook(int cpu)
EXPORT_SYMBOL_GPL(cpufreq_clear_freq_update_hook);
/**
+ * cpufreq_get_cfs_capacity_margin - Get global cfs enqueue capacity margin
+ *
+ * margin is a percentage of capacity that is applied to the current
+ * utilization when selecting a new capacity state or cpu frequency. The value
+ * should be normalized to the range of [0..SCHED_CAPACITY_SCALE], where
+ * SCHED_CAPACITY_SCALE is 100% of the normalized capacity, or equivalent to
+ * multiplying the utilization by one.
+ *
+ * This function returns the current global cfs enqueue capacity margin
+ */
+unsigned long cpufreq_get_cfs_capacity_margin(void)
+{
+ return cfs_capacity_margin;
+}
+EXPORT_SYMBOL_GPL(cpufreq_get_cfs_capacity_margin);
+
+/**
+ * cpufreq_set_cfs_capacity_margin - Set global cfs enqueue capacity margin
+ * @margin: new capacity margin
+ *
+ * margin is a percentage of capacity that is applied to the current
+ * utilization when selecting a new capacity state or cpu frequency. The value
+ * should be normalized to the range of [0..SCHED_CAPACITY_SCALE], where
+ * SCHED_CAPACITY_SCALE is 100% of the normalized capacity, or equivalent to
+ * multiplying the utilization by one.
+ *
+ * For instance, to add a 25% margin to a utilization, margin should be 1280,
+ * which is 1.25x 1024, the default for SCHED_CAPACITY_SCALE.
+ */
+void cpufreq_set_cfs_capacity_margin(unsigned long margin)
+{
+ cfs_capacity_margin = margin;
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_cfs_capacity_margin);
+
+/**
+ * cpufreq_reset_cfs_capacity_margin - Reset global cfs enqueue cap margin
+ *
+ * margin is a percentage of capacity that is applied to the current
+ * utilization when selecting a new capacity state or cpu frequency. The value
+ * should be normalized to the range of [0..SCHED_CAPACITY_SCALE], where
+ * SCHED_CAPACITY_SCALE is 100% of the normalized capacity, or equivalent to
+ * multiplying the utilization by one.
+ *
+ * This function resets the global margin to its default value.
+ */
+void cpufreq_reset_cfs_capacity_margin(void)
+{
+ cfs_capacity_margin = CAPACITY_MARGIN_DEFAULT;
+}
+EXPORT_SYMBOL_GPL(cpufreq_reset_cfs_capacity_margin);
+
+/**
* cpufreq_update_util - Take a note about CPU utilization changes.
* @time: Current time.
* @util: CPU utilization.
--
2.1.4
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH 3/8] sched/cpufreq: new cfs capacity margin helpers
2016-03-14 5:22 ` [PATCH 3/8] sched/cpufreq: new cfs capacity margin helpers Michael Turquette
@ 2016-03-15 21:17 ` Peter Zijlstra
0 siblings, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2016-03-15 21:17 UTC (permalink / raw)
To: Michael Turquette
Cc: rjw, linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
On Sun, Mar 13, 2016 at 10:22:07PM -0700, Michael Turquette wrote:
> +/**
> + * cpufreq_set_cfs_capacity_margin - Set global cfs enqueue capacity margin
> + * @margin: new capacity margin
> + *
> + * margin is a percentage of capacity that is applied to the current
> + * utilization when selecting a new capacity state or cpu frequency. The value
> + * should be normalized to the range of [0..SCHED_CAPACITY_SCALE], where
> + * SCHED_CAPACITY_SCALE is 100% of the normalized capacity, or equivalent to
> + * multiplying the utilization by one.
> + *
> + * For instance, to add a 25% margin to a utilization, margin should be 1280,
> + * which is 1.25x 1024, the default for SCHED_CAPACITY_SCALE.
> + */
> +void cpufreq_set_cfs_capacity_margin(unsigned long margin)
> +{
> + cfs_capacity_margin = margin;
> +}
> +EXPORT_SYMBOL_GPL(cpufreq_set_cfs_capacity_margin);
I don't like this as an interface; what's wrong with using percentiles
as per the discussion I had with Rafael last week?
Also, why is this exported?
^ permalink raw reply [flat|nested] 61+ messages in thread
* [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable
2016-03-14 5:22 [PATCH 0/8] schedutil enhancements Michael Turquette
` (2 preceding siblings ...)
2016-03-14 5:22 ` [PATCH 3/8] sched/cpufreq: new cfs capacity margin helpers Michael Turquette
@ 2016-03-14 5:22 ` Michael Turquette
2016-03-15 21:20 ` Peter Zijlstra
2016-03-14 5:22 ` [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util Michael Turquette
` (4 subsequent siblings)
8 siblings, 1 reply; 61+ messages in thread
From: Michael Turquette @ 2016-03-14 5:22 UTC (permalink / raw)
To: peterz, rjw
Cc: linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
With the addition of the global cfs capacity margin helpers in patch,
"sched/cpufreq: new cfs capacity margin helpers", we can now export
sysfs tunables from the schedutil governor. This allows privileged users
to tune the value more easily.
The margin value is global to cfs, not per-policy. As such schedutil
does not store any state about the margin. Schedutil restores the margin
value to its default value when exiting.
Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
---
drivers/cpufreq/cpufreq_schedutil.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/drivers/cpufreq/cpufreq_schedutil.c b/drivers/cpufreq/cpufreq_schedutil.c
index 5aa26bf..12e49b9 100644
--- a/drivers/cpufreq/cpufreq_schedutil.c
+++ b/drivers/cpufreq/cpufreq_schedutil.c
@@ -246,8 +246,32 @@ static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *bu
static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
+static ssize_t capacity_margin_show(struct gov_attr_set *not_used,
+ char *buf)
+{
+ return sprintf(buf, "%lu\n", cpufreq_get_cfs_capacity_margin());
+}
+
+static ssize_t capacity_margin_store(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
+{
+ unsigned long margin;
+ int ret;
+
+ ret = sscanf(buf, "%lu", &margin);
+ if (ret != 1)
+ return -EINVAL;
+
+ cpufreq_set_cfs_capacity_margin(margin);
+
+ return count;
+}
+
+static struct governor_attr capacity_margin = __ATTR_RW(capacity_margin);
+
static struct attribute *sugov_attributes[] = {
&rate_limit_us.attr,
+ &capacity_margin.attr,
NULL
};
@@ -381,6 +405,7 @@ static int sugov_exit(struct cpufreq_policy *policy)
mutex_lock(&global_tunables_lock);
+ cpufreq_reset_cfs_capacity_margin();
count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
policy->governor_data = NULL;
if (!count)
--
2.1.4
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable
2016-03-14 5:22 ` [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable Michael Turquette
@ 2016-03-15 21:20 ` Peter Zijlstra
[not found] ` <20160315214043.30639.75507@quark.deferred.io>
0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2016-03-15 21:20 UTC (permalink / raw)
To: Michael Turquette
Cc: rjw, linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
On Sun, Mar 13, 2016 at 10:22:08PM -0700, Michael Turquette wrote:
> With the addition of the global cfs capacity margin helpers in patch,
> "sched/cpufreq: new cfs capacity margin helpers", we can now export
> sysfs tunables from the schedutil governor. This allows privileged users
> to tune the value more easily.
>
> The margin value is global to cfs, not per-policy. As such schedutil
> does not store any state about the margin. Schedutil restores the margin
> value to its default value when exiting.
Yuck sysfs.. I would really rather we did not expose this per default.
And certainly not in this weird form.
^ permalink raw reply [flat|nested] 61+ messages in thread
* [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util
2016-03-14 5:22 [PATCH 0/8] schedutil enhancements Michael Turquette
` (3 preceding siblings ...)
2016-03-14 5:22 ` [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable Michael Turquette
@ 2016-03-14 5:22 ` Michael Turquette
2016-03-15 21:25 ` Peter Zijlstra
2016-03-14 5:22 ` [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization Michael Turquette
` (3 subsequent siblings)
8 siblings, 1 reply; 61+ messages in thread
From: Michael Turquette @ 2016-03-14 5:22 UTC (permalink / raw)
To: peterz, rjw
Cc: linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
cpufreq_update_util() accepts a single utilization value which does not
account for multiple utilization contributions from the cfs, rt & dl
scheduler classes. Begin fixing this by adding a sched_class argument to
cpufreq_update_util(), all of its call sites and the governor-specific
hooks in intel_pstate.c, cpufreq_schedutil.c and cpufreq_governor.c.
A follow-on patch will add summation of the sched_class contributions to
the schedutil governor.
Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
---
drivers/cpufreq/cpufreq_governor.c | 5 +++--
drivers/cpufreq/cpufreq_schedutil.c | 6 ++++--
drivers/cpufreq/intel_pstate.c | 5 +++--
include/linux/sched.h | 16 +++++++++++++---
kernel/sched/cpufreq.c | 11 +++++++----
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 2 +-
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 8 +++++---
9 files changed, 38 insertions(+), 19 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index 148576c..4694751 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -248,8 +248,9 @@ static void dbs_irq_work(struct irq_work *irq_work)
schedule_work(&policy_dbs->work);
}
-static void dbs_freq_update_handler(struct freq_update_hook *hook, u64 time,
- unsigned long util_not_used,
+static void dbs_freq_update_handler(struct freq_update_hook *hook,
+ enum sched_class_util sc_not_used,
+ u64 time, unsigned long util_not_used,
unsigned long max_not_used)
{
struct cpu_dbs_info *cdbs = container_of(hook, struct cpu_dbs_info, update_hook);
diff --git a/drivers/cpufreq/cpufreq_schedutil.c b/drivers/cpufreq/cpufreq_schedutil.c
index 12e49b9..18d9ca3 100644
--- a/drivers/cpufreq/cpufreq_schedutil.c
+++ b/drivers/cpufreq/cpufreq_schedutil.c
@@ -106,7 +106,8 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
trace_cpu_frequency(freq, smp_processor_id());
}
-static void sugov_update_single(struct freq_update_hook *hook, u64 time,
+static void sugov_update_single(struct freq_update_hook *hook,
+ enum sched_class_util sc, u64 time,
unsigned long util, unsigned long max)
{
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook);
@@ -166,7 +167,8 @@ static unsigned int sugov_next_freq(struct sugov_policy *sg_policy,
return util * max_f / max;
}
-static void sugov_update_shared(struct freq_update_hook *hook, u64 time,
+static void sugov_update_shared(struct freq_update_hook *hook,
+ enum sched_class_util sc, u64 time,
unsigned long util, unsigned long max)
{
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook);
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 20e2bb2..86aa368 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1020,8 +1020,9 @@ static inline void intel_pstate_adjust_busy_pstate(struct cpudata *cpu)
sample->freq);
}
-static void intel_pstate_freq_update(struct freq_update_hook *hook, u64 time,
- unsigned long util_not_used,
+static void intel_pstate_freq_update(struct freq_update_hook *hook,
+ enum sched_class_util sc_not_used
+ u64 time, unsigned long util_not_used,
unsigned long max_not_used)
{
struct cpudata *cpu = container_of(hook, struct cpudata, update_hook);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f18a99b..1c7d7bd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2362,15 +2362,25 @@ extern u64 scheduler_tick_max_deferment(void);
static inline bool sched_can_stop_tick(void) { return false; }
#endif
+enum sched_class_util {
+ cfs_util,
+ rt_util,
+ dl_util,
+ nr_util_types,
+};
+
#ifdef CONFIG_CPU_FREQ
struct freq_update_hook {
- void (*func)(struct freq_update_hook *hook, u64 time,
+ void (*func)(struct freq_update_hook *hook,
+ enum sched_class_util sched_class, u64 time,
unsigned long util, unsigned long max);
};
void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
- void (*func)(struct freq_update_hook *hook, u64 time,
- unsigned long util, unsigned long max));
+ void (*func)(struct freq_update_hook *hook,
+ enum sched_class_util sched_class,
+ u64 time, unsigned long util,
+ unsigned long max));
void cpufreq_clear_freq_update_hook(int cpu);
unsigned long cpufreq_get_cfs_capacity_margin(void);
void cpufreq_set_cfs_capacity_margin(unsigned long margin);
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
index a126b58..87f99a6 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq.c
@@ -39,8 +39,10 @@ static void set_freq_update_hook(int cpu, struct freq_update_hook *hook)
* @func: Callback function to use with the new hook.
*/
void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
- void (*func)(struct freq_update_hook *hook, u64 time,
- unsigned long util, unsigned long max))
+ void (*func)(struct freq_update_hook *hook,
+ enum sched_class_util sched_class,
+ u64 time, unsigned long util,
+ unsigned long max))
{
if (WARN_ON(!hook || !func))
return;
@@ -124,7 +126,8 @@ EXPORT_SYMBOL_GPL(cpufreq_reset_cfs_capacity_margin);
*
* It can only be called from RCU-sched read-side critical sections.
*/
-void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+void cpufreq_update_util(enum sched_class_util sc, u64 time,
+ unsigned long util, unsigned long max)
{
struct freq_update_hook *hook;
@@ -138,5 +141,5 @@ void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
* may become NULL after the check below.
*/
if (hook)
- hook->func(hook, time, util, max);
+ hook->func(hook, sc, time, util, max);
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 3fd5bc4..d88ed3f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -728,7 +728,7 @@ static void update_curr_dl(struct rq *rq)
/* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
if (cpu_of(rq) == smp_processor_id())
- cpufreq_update_util(rq_clock(rq), ULONG_MAX, 0);
+ cpufreq_update_util(dl_util, rq_clock(rq), ULONG_MAX, 0);
/*
* Consumed budget is computed considering the time as
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 29e8bae..6b454bc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2867,7 +2867,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
* thread is a different class (!fair), nor will the utilization
* number include things like RT tasks.
*/
- cpufreq_update_util(rq_clock(rq), min(cap, max), max);
+ cpufreq_update_util(cfs_util, rq_clock(rq), min(cap, max), max);
}
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 53ad077..9d9dab4 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -947,7 +947,7 @@ static void update_curr_rt(struct rq *rq)
/* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
if (cpu_of(rq) == smp_processor_id())
- cpufreq_update_util(rq_clock(rq), ULONG_MAX, 0);
+ cpufreq_update_util(rt_util, rq_clock(rq), ULONG_MAX, 0);
delta_exec = rq_clock_task(rq) - curr->se.exec_start;
if (unlikely((s64)delta_exec <= 0))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8c93ed2..469d11d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1744,8 +1744,10 @@ static inline u64 irq_time_read(int cpu)
#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
#ifdef CONFIG_CPU_FREQ
-void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
+void cpufreq_update_util(enum sched_class_util sc, u64 time,
+ unsigned long util, unsigned long max);
#else
-static inline void cpufreq_update_util(u64 time, unsigned long util,
- unsigned long max) {}
+static inline void cpufreq_update_util(enum sched_class_util sc, u64 time,
+ unsigned long util, unsigned long max)
+{}
#endif /* CONFIG_CPU_FREQ */
--
2.1.4
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util
2016-03-14 5:22 ` [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util Michael Turquette
@ 2016-03-15 21:25 ` Peter Zijlstra
[not found] ` <20160315220609.30639.67271@quark.deferred.io>
0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2016-03-15 21:25 UTC (permalink / raw)
To: Michael Turquette
Cc: rjw, linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
On Sun, Mar 13, 2016 at 10:22:09PM -0700, Michael Turquette wrote:
> +++ b/include/linux/sched.h
> @@ -2362,15 +2362,25 @@ extern u64 scheduler_tick_max_deferment(void);
> static inline bool sched_can_stop_tick(void) { return false; }
> #endif
>
> +enum sched_class_util {
> + cfs_util,
> + rt_util,
> + dl_util,
> + nr_util_types,
> +};
> +
> #ifdef CONFIG_CPU_FREQ
> struct freq_update_hook {
> + void (*func)(struct freq_update_hook *hook,
> + enum sched_class_util sched_class, u64 time,
> unsigned long util, unsigned long max);
> };
>
> void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
> + void (*func)(struct freq_update_hook *hook,
> + enum sched_class_util sched_class,
> + u64 time, unsigned long util,
> + unsigned long max));
Have you looked at the asm that generated? At some point you'll start
spilling on the stack and it'll be a god awful mess.
^ permalink raw reply [flat|nested] 61+ messages in thread
* [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization
2016-03-14 5:22 [PATCH 0/8] schedutil enhancements Michael Turquette
` (4 preceding siblings ...)
2016-03-14 5:22 ` [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util Michael Turquette
@ 2016-03-14 5:22 ` Michael Turquette
2016-03-15 21:29 ` Peter Zijlstra
2016-03-14 5:22 ` [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support Michael Turquette
` (2 subsequent siblings)
8 siblings, 1 reply; 61+ messages in thread
From: Michael Turquette @ 2016-03-14 5:22 UTC (permalink / raw)
To: peterz, rjw
Cc: linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
Patch, "sched/cpufreq: pass sched class into cpufreq_update_util" made
it possible for calls of cpufreq_update_util() to specify scheduler
class, particularly cfs, rt & dl.
Update the schedutil governor to store these individual utilizations per
cpu and sum them to create a total utilization contribution.
Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
---
drivers/cpufreq/cpufreq_schedutil.c | 39 +++++++++++++++++++++++++++++++------
1 file changed, 33 insertions(+), 6 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_schedutil.c b/drivers/cpufreq/cpufreq_schedutil.c
index 18d9ca3..b9234e1 100644
--- a/drivers/cpufreq/cpufreq_schedutil.c
+++ b/drivers/cpufreq/cpufreq_schedutil.c
@@ -46,8 +46,10 @@ struct sugov_cpu {
struct freq_update_hook update_hook;
struct sugov_policy *sg_policy;
+ unsigned long util[nr_util_types];
+ unsigned long total_util;
+
/* The fields below are only needed when sharing a policy. */
- unsigned long util;
unsigned long max;
u64 last_update;
};
@@ -106,6 +108,18 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
trace_cpu_frequency(freq, smp_processor_id());
}
+static unsigned long sugov_sum_total_util(struct sugov_cpu *sg_cpu)
+{
+ enum sched_class_util sc;
+
+ /* sum the utilization of all sched classes */
+ sg_cpu->total_util = 0;
+ for (sc = 0; sc < nr_util_types; sc++)
+ sg_cpu->total_util += sg_cpu->util[sc];
+
+ return sg_cpu->total_util;
+}
+
static void sugov_update_single(struct freq_update_hook *hook,
enum sched_class_util sc, u64 time,
unsigned long util, unsigned long max)
@@ -113,12 +127,17 @@ static void sugov_update_single(struct freq_update_hook *hook,
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook);
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
unsigned int max_f, next_f;
+ unsigned long total_util;
if (!sugov_should_update_freq(sg_policy, time))
return;
+ /* update per-sched_class utilization for this cpu */
+ sg_cpu->util[sc] = util;
+ total_util = sugov_sum_total_util(sg_cpu);
+
max_f = sg_policy->max_freq;
- next_f = util > max ? max_f : util * max_f / max;
+ next_f = total_util > max ? max_f : total_util * max_f / max;
sugov_update_commit(sg_policy, time, next_f);
}
@@ -153,7 +172,7 @@ static unsigned int sugov_next_freq(struct sugov_policy *sg_policy,
if ((s64)delta_ns > NSEC_PER_SEC / HZ)
continue;
- j_util = j_sg_cpu->util;
+ j_util = j_sg_cpu->total_util;
j_max = j_sg_cpu->max;
if (j_util > j_max)
return max_f;
@@ -174,15 +193,19 @@ static void sugov_update_shared(struct freq_update_hook *hook,
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook);
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
unsigned int next_f;
+ unsigned long total_util;
raw_spin_lock(&sg_policy->update_lock);
- sg_cpu->util = util;
+ sg_cpu->util[sc] = util;
sg_cpu->max = max;
sg_cpu->last_update = time;
+ /* update per-sched_class utilization for this cpu */
+ total_util = sugov_sum_total_util(sg_cpu);
+
if (sugov_should_update_freq(sg_policy, time)) {
- next_f = sugov_next_freq(sg_policy, util, max);
+ next_f = sugov_next_freq(sg_policy, total_util, max);
sugov_update_commit(sg_policy, time, next_f);
}
@@ -423,6 +446,7 @@ static int sugov_start(struct cpufreq_policy *policy)
{
struct sugov_policy *sg_policy = policy->governor_data;
unsigned int cpu;
+ enum sched_class_util sc;
sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
sg_policy->last_freq_update_time = 0;
@@ -434,8 +458,11 @@ static int sugov_start(struct cpufreq_policy *policy)
struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
sg_cpu->sg_policy = sg_policy;
+ for (sc = 0; sc < nr_util_types; sc++) {
+ sg_cpu->util[sc] = ULONG_MAX;
+ sg_cpu->total_util = ULONG_MAX;
+ }
if (policy_is_shared(policy)) {
- sg_cpu->util = ULONG_MAX;
sg_cpu->max = 0;
sg_cpu->last_update = 0;
cpufreq_set_freq_update_hook(cpu, &sg_cpu->update_hook,
--
2.1.4
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization
2016-03-14 5:22 ` [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization Michael Turquette
@ 2016-03-15 21:29 ` Peter Zijlstra
[not found] ` <20160315220951.30639.12872@quark.deferred.io>
0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2016-03-15 21:29 UTC (permalink / raw)
To: Michael Turquette
Cc: rjw, linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
On Sun, Mar 13, 2016 at 10:22:10PM -0700, Michael Turquette wrote:
> +static unsigned long sugov_sum_total_util(struct sugov_cpu *sg_cpu)
> +{
> + enum sched_class_util sc;
> +
> + /* sum the utilization of all sched classes */
> + sg_cpu->total_util = 0;
> + for (sc = 0; sc < nr_util_types; sc++)
> + sg_cpu->total_util += sg_cpu->util[sc];
> +
> + return sg_cpu->total_util;
> +}
> @@ -153,7 +172,7 @@ static unsigned int sugov_next_freq(struct sugov_policy *sg_policy,
> if ((s64)delta_ns > NSEC_PER_SEC / HZ)
> continue;
>
> - j_util = j_sg_cpu->util;
> + j_util = j_sg_cpu->total_util;
> j_max = j_sg_cpu->max;
> if (j_util > j_max)
> return max_f;
So while not strictly wrong, I think we can do so much better.
Changelog doesn't mention anything useful, like that this is indeed very
rough and what we really should be doing etc..
^ permalink raw reply [flat|nested] 61+ messages in thread
* [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support
2016-03-14 5:22 [PATCH 0/8] schedutil enhancements Michael Turquette
` (5 preceding siblings ...)
2016-03-14 5:22 ` [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization Michael Turquette
@ 2016-03-14 5:22 ` Michael Turquette
2016-03-15 19:13 ` Dietmar Eggemann
2016-03-15 21:34 ` Peter Zijlstra
2016-03-14 5:22 ` [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity Michael Turquette
2016-03-16 0:08 ` [PATCH 0/8] schedutil enhancements Rafael J. Wysocki
8 siblings, 2 replies; 61+ messages in thread
From: Michael Turquette @ 2016-03-14 5:22 UTC (permalink / raw)
To: peterz, rjw
Cc: linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
From: Dietmar Eggemann <dietmar.eggemann@arm.com>
Implements cpufreq_scale_freq_capacity() to provide the scheduler with a
frequency scaling correction factor for more accurate load-tracking.
The factor is:
current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)
In fact, freq_scale should be a struct cpufreq_policy data member. But
this would require that the scheduler hot path (__update_load_avg()) would
have to grab the cpufreq lock. This can be avoided by using per-cpu data
initialized to SCHED_CAPACITY_SCALE for freq_scale.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
---
I'm not as sure about patches 7 & 8, but I included them since I needed
frequency invariance while testing.
As mentioned by myself in 2014 and Rafael last month, the
arch_scale_freq_capacity hook is awkward, because this behavior may vary
within an architecture.
I re-introduce Dietmar's generic cpufreq implementation of the frequency
invariance hook in this patch, and change the preprocessor magic in
sched.h to favor the cpufreq implementation over arch- or
platform-specific ones in the next patch.
If run-time selection of ops is needed them someone will need to write
that code.
I think that this negates the need for the arm arch hooks[0-2], and
hopefully Morten and Dietmar can weigh in on this.
[0] lkml.kernel.org/r/1436293469-25707-2-git-send-email-morten.rasmussen@arm.com
[1] lkml.kernel.org/r/1436293469-25707-6-git-send-email-morten.rasmussen@arm.com
[2] lkml.kernel.org/r/1436293469-25707-8-git-send-email-morten.rasmussen@arm.com
drivers/cpufreq/cpufreq.c | 29 +++++++++++++++++++++++++++++
include/linux/cpufreq.h | 3 +++
2 files changed, 32 insertions(+)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index b1ca9c4..e67584f 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -306,6 +306,31 @@ static void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci)
#endif
}
+/*********************************************************************
+ * FREQUENCY INVARIANT CPU CAPACITY *
+ *********************************************************************/
+
+static DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
+
+static void
+scale_freq_capacity(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs)
+{
+ unsigned long cur = freqs ? freqs->new : policy->cur;
+ unsigned long scale = (cur << SCHED_CAPACITY_SHIFT) / policy->max;
+ int cpu;
+
+ pr_debug("cpus %*pbl cur/cur max freq %lu/%u kHz freq scale %lu\n",
+ cpumask_pr_args(policy->cpus), cur, policy->max, scale);
+
+ for_each_cpu(cpu, policy->cpus)
+ per_cpu(freq_scale, cpu) = scale;
+}
+
+unsigned long cpufreq_scale_freq_capacity(struct sched_domain *sd, int cpu)
+{
+ return per_cpu(freq_scale, cpu);
+}
+
static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, unsigned int state)
{
@@ -409,6 +434,8 @@ wait:
spin_unlock(&policy->transition_lock);
+ scale_freq_capacity(policy, freqs);
+
cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE);
}
EXPORT_SYMBOL_GPL(cpufreq_freq_transition_begin);
@@ -2125,6 +2152,8 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
CPUFREQ_NOTIFY, new_policy);
+ scale_freq_capacity(new_policy, NULL);
+
policy->min = new_policy->min;
policy->max = new_policy->max;
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 0e39499..72833be 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -583,4 +583,7 @@ unsigned int cpufreq_generic_get(unsigned int cpu);
int cpufreq_generic_init(struct cpufreq_policy *policy,
struct cpufreq_frequency_table *table,
unsigned int transition_latency);
+
+struct sched_domain;
+unsigned long cpufreq_scale_freq_capacity(struct sched_domain *sd, int cpu);
#endif /* _LINUX_CPUFREQ_H */
--
2.1.4
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support
2016-03-14 5:22 ` [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support Michael Turquette
@ 2016-03-15 19:13 ` Dietmar Eggemann
2016-03-15 20:19 ` Michael Turquette
2016-03-15 21:34 ` Peter Zijlstra
1 sibling, 1 reply; 61+ messages in thread
From: Dietmar Eggemann @ 2016-03-15 19:13 UTC (permalink / raw)
To: Michael Turquette, peterz, rjw
Cc: linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, vincent.guittot, Michael Turquette
Hi Mike,
On 14/03/16 05:22, Michael Turquette wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>
> Implements cpufreq_scale_freq_capacity() to provide the scheduler with a
> frequency scaling correction factor for more accurate load-tracking.
>
> The factor is:
>
> current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)
>
> In fact, freq_scale should be a struct cpufreq_policy data member. But
> this would require that the scheduler hot path (__update_load_avg()) would
> have to grab the cpufreq lock. This can be avoided by using per-cpu data
> initialized to SCHED_CAPACITY_SCALE for freq_scale.
>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
> ---
> I'm not as sure about patches 7 & 8, but I included them since I needed
> frequency invariance while testing.
>
> As mentioned by myself in 2014 and Rafael last month, the
> arch_scale_freq_capacity hook is awkward, because this behavior may vary
> within an architecture.
>
> I re-introduce Dietmar's generic cpufreq implementation of the frequency
> invariance hook in this patch, and change the preprocessor magic in
> sched.h to favor the cpufreq implementation over arch- or
> platform-specific ones in the next patch.
Maybe it is worth mentioning that this patch is from EAS RFC5.2
(linux-arm.org/linux-power.git energy_model_rfc_v5.2) which hasn't been
posted to LKML. The last EAS RFCv5 has the Frequency Invariant Engine
(FEI) based on the cpufreq notifier calls (cpufreq_callback,
cpufreq_policy_callback) in the ARM arch code.
> If run-time selection of ops is needed them someone will need to write
> that code.
Right now I see 3 different implementations of the FEI. 1) The X86
aperf/mperf based one (https://lkml.org/lkml/2016/3/3/589), 2) This one
in cpufreq.c and 3) the one based on cpufreq notifiers in ARCH (ARM,
ARM64) code.
I guess with sched_util we do need a solution for all platforms
(different archs, x86 w/ and w/o X86_FEATURE_APERFMPERF, ...).
> I think that this negates the need for the arm arch hooks[0-2], and
> hopefully Morten and Dietmar can weigh in on this.
It's true that we tried to get rid of the usage of the cpufreq callbacks
(cpufreq_callback, cpufreq_policy_callback) with this patch. Plus we
didn't want to implement it twice (for ARM and ARM64).
But 2) would have to work for other ARCHs as well. Maybe as a fall-back
for X86 w/o X86_FEATURE_APERFMPERF feature?
[...]
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support
2016-03-15 19:13 ` Dietmar Eggemann
@ 2016-03-15 20:19 ` Michael Turquette
2016-03-15 21:32 ` Peter Zijlstra
2016-03-16 18:33 ` Dietmar Eggemann
0 siblings, 2 replies; 61+ messages in thread
From: Michael Turquette @ 2016-03-15 20:19 UTC (permalink / raw)
To: Dietmar Eggemann, peterz, rjw
Cc: linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, vincent.guittot, Michael Turquette
Quoting Dietmar Eggemann (2016-03-15 12:13:46)
> Hi Mike,
>
> On 14/03/16 05:22, Michael Turquette wrote:
> > From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >
> > Implements cpufreq_scale_freq_capacity() to provide the scheduler with a
> > frequency scaling correction factor for more accurate load-tracking.
> >
> > The factor is:
> >
> > current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)
> >
> > In fact, freq_scale should be a struct cpufreq_policy data member. But
> > this would require that the scheduler hot path (__update_load_avg()) would
> > have to grab the cpufreq lock. This can be avoided by using per-cpu data
> > initialized to SCHED_CAPACITY_SCALE for freq_scale.
> >
> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
> > ---
> > I'm not as sure about patches 7 & 8, but I included them since I needed
> > frequency invariance while testing.
> >
> > As mentioned by myself in 2014 and Rafael last month, the
> > arch_scale_freq_capacity hook is awkward, because this behavior may vary
> > within an architecture.
> >
> > I re-introduce Dietmar's generic cpufreq implementation of the frequency
> > invariance hook in this patch, and change the preprocessor magic in
> > sched.h to favor the cpufreq implementation over arch- or
> > platform-specific ones in the next patch.
>
> Maybe it is worth mentioning that this patch is from EAS RFC5.2
> (linux-arm.org/linux-power.git energy_model_rfc_v5.2) which hasn't been
> posted to LKML. The last EAS RFCv5 has the Frequency Invariant Engine
> (FEI) based on the cpufreq notifier calls (cpufreq_callback,
> cpufreq_policy_callback) in the ARM arch code.
Oops, my apologies. I got a little mixed up while developing these
patches and I should have at least asked you about this one before
posting.
I'm really quite happy to drop #7 and #8 if they are too contentious or
if patch #7 is deemed as not-ready by you.
>
> > If run-time selection of ops is needed them someone will need to write
> > that code.
>
> Right now I see 3 different implementations of the FEI. 1) The X86
> aperf/mperf based one (https://lkml.org/lkml/2016/3/3/589), 2) This one
> in cpufreq.c and 3) the one based on cpufreq notifiers in ARCH (ARM,
> ARM64) code.
>
> I guess with sched_util we do need a solution for all platforms
> (different archs, x86 w/ and w/o X86_FEATURE_APERFMPERF, ...).
>
> > I think that this negates the need for the arm arch hooks[0-2], and
> > hopefully Morten and Dietmar can weigh in on this.
>
> It's true that we tried to get rid of the usage of the cpufreq callbacks
> (cpufreq_callback, cpufreq_policy_callback) with this patch. Plus we
> didn't want to implement it twice (for ARM and ARM64).
>
> But 2) would have to work for other ARCHs as well. Maybe as a fall-back
> for X86 w/o X86_FEATURE_APERFMPERF feature?
That's what I had in mind. I guess that some day there will be a need to
select implementations at run-time for both cpufreq (e.g. different
cpufreq drivers might implement arch_scale_freq_capacity) and for the
!CONFIG_CPU_FREQ case (e.g. different platforms might implement
arch_scale_freq_capcity within the same arch).
The cpufreq approach seems the most generic, hence patch #8 to make it
the default.
Regards,
Mike
>
> [...]
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support
2016-03-15 20:19 ` Michael Turquette
@ 2016-03-15 21:32 ` Peter Zijlstra
2016-03-16 18:33 ` Dietmar Eggemann
1 sibling, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2016-03-15 21:32 UTC (permalink / raw)
To: Michael Turquette
Cc: Dietmar Eggemann, rjw, linux-kernel, linux-pm, Juri.Lelli,
steve.muckle, morten.rasmussen, vincent.guittot,
Michael Turquette
On Tue, Mar 15, 2016 at 01:19:17PM -0700, Michael Turquette wrote:
> That's what I had in mind. I guess that some day there will be a need to
> select implementations at run-time for both cpufreq (e.g. different
> cpufreq drivers might implement arch_scale_freq_capacity) and for the
> !CONFIG_CPU_FREQ case (e.g. different platforms might implement
> arch_scale_freq_capcity within the same arch).
No, no runtime selection. That gets us function pointers and other
indirect mess.
We should be trying very hard to get rid of that cpufreq_util_update()
pointer, not add more of that gunk.
Use self modifying code if you have to do something.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support
2016-03-15 20:19 ` Michael Turquette
2016-03-15 21:32 ` Peter Zijlstra
@ 2016-03-16 18:33 ` Dietmar Eggemann
1 sibling, 0 replies; 61+ messages in thread
From: Dietmar Eggemann @ 2016-03-16 18:33 UTC (permalink / raw)
To: Michael Turquette, peterz, rjw
Cc: linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, vincent.guittot, Michael Turquette
On 15/03/16 20:19, Michael Turquette wrote:
> Quoting Dietmar Eggemann (2016-03-15 12:13:46)
>> Hi Mike,
>>
>> On 14/03/16 05:22, Michael Turquette wrote:
>>> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>>
[...]
>> Maybe it is worth mentioning that this patch is from EAS RFC5.2
>> (linux-arm.org/linux-power.git energy_model_rfc_v5.2) which hasn't been
>> posted to LKML. The last EAS RFCv5 has the Frequency Invariant Engine
>> (FEI) based on the cpufreq notifier calls (cpufreq_callback,
>> cpufreq_policy_callback) in the ARM arch code.
>
> Oops, my apologies. I got a little mixed up while developing these
> patches and I should have at least asked you about this one before
> posting.
No need to apologize, just wanted to set the context right for people
following the EAS stuff.
> I'm really quite happy to drop #7 and #8 if they are too contentious or
> if patch #7 is deemed as not-ready by you.
If people think that driving frequency invariance from within cpufreq.c
is the right thing to so then this patch can stay.
That said, I'm not sure if we can do a better job of integrating
freq_scale into existing cpufreq data structures.
>>> If run-time selection of ops is needed them someone will need to write
>>> that code.
>>
>> Right now I see 3 different implementations of the FEI. 1) The X86
>> aperf/mperf based one (https://lkml.org/lkml/2016/3/3/589), 2) This one
>> in cpufreq.c and 3) the one based on cpufreq notifiers in ARCH (ARM,
>> ARM64) code.
I rechecked the functionality of this implementation (2) versus (1) and
(3) by running them all together on an X86 system w/
X86_FEATURE_APERFMPERF and acpi-cpufreq (i5 CPU M520) w/o actually doing
the wiring towards arch_scale_freq_capacity and they all behave similar
or equal. The update of (1) happens much more often obviously since its
driven from the scheduler tick.
...
arch_scale_freq_tick: APERF/MPERF: aperf=5864236223 mperf=6009442434
acnt=200073 mcnt=362360 arch_cpu_freq=565
arch_scale_freq_tick: APERF/MPERF: aperf=5864261886 mperf=6009474724
acnt=25663 mcnt=32290 arch_cpu_freq=813
scale_freq_capacity: CPUFREQ.C: cpu=1 freq_scale=910
cpufreq_callback: NOTIFIER: cpu=1 curr_freq=2133000 max_freq=2400000
freq_scale=910
arch_scale_freq_tick: APERF/MPERF: aperf=5864524610 mperf=6009802492
acnt=262724 mcnt=327768 arch_cpu_freq=820
arch_scale_freq_tick: APERF/MPERF: aperf=5864581063 mperf=6009862268
acnt=56453 mcnt=59776 arch_cpu_freq=967
...
[...]
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support
2016-03-14 5:22 ` [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support Michael Turquette
2016-03-15 19:13 ` Dietmar Eggemann
@ 2016-03-15 21:34 ` Peter Zijlstra
1 sibling, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2016-03-15 21:34 UTC (permalink / raw)
To: Michael Turquette
Cc: rjw, linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
On Sun, Mar 13, 2016 at 10:22:11PM -0700, Michael Turquette wrote:
> +static DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
> +
> +unsigned long cpufreq_scale_freq_capacity(struct sched_domain *sd, int cpu)
> +{
> + return per_cpu(freq_scale, cpu);
> +}
These should be in a header so that we can avoid having to do a function
call.
^ permalink raw reply [flat|nested] 61+ messages in thread
* [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity
2016-03-14 5:22 [PATCH 0/8] schedutil enhancements Michael Turquette
` (6 preceding siblings ...)
2016-03-14 5:22 ` [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support Michael Turquette
@ 2016-03-14 5:22 ` Michael Turquette
2016-03-15 19:13 ` Dietmar Eggemann
2016-03-15 21:37 ` Peter Zijlstra
2016-03-16 0:08 ` [PATCH 0/8] schedutil enhancements Rafael J. Wysocki
8 siblings, 2 replies; 61+ messages in thread
From: Michael Turquette @ 2016-03-14 5:22 UTC (permalink / raw)
To: peterz, rjw
Cc: linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
arch_scale_freq_capacity is weird. It specifies an arch hook for an
implementation that could easily vary within an architecture or even a
chip family.
This patch helps to mitigate this weirdness by defaulting to the
cpufreq-provided implementation, which should work for all cases where
CONFIG_CPU_FREQ is set.
If CONFIG_CPU_FREQ is not set, then try to use an implementation
provided by the architecture. Failing that, fall back to
SCHED_CAPACITY_SCALE.
It may be desirable for cpufreq drivers to specify their own
implementation of arch_scale_freq_capacity in the future. The same is
true for platform code within an architecture. In both cases an
efficient implementation selector will need to be created and this patch
adds a comment to that effect.
Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
---
kernel/sched/sched.h | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 469d11d..37502ea 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq)
#ifdef CONFIG_SMP
extern void sched_avg_update(struct rq *rq);
-#ifndef arch_scale_freq_capacity
+/*
+ * arch_scale_freq_capacity can be implemented by cpufreq, platform code or
+ * arch code. We select the cpufreq-provided implementation first. If it
+ * doesn't exist then we default to any other implementation provided from
+ * platform/arch code. If those do not exist then we use the default
+ * SCHED_CAPACITY_SCALE value below.
+ *
+ * Note that if cpufreq drivers or platform/arch code have competing
+ * implementations it is up to those subsystems to select one at runtime with
+ * an efficient solution, as we cannot tolerate the overhead of indirect
+ * functions (e.g. function pointers) in the scheduler fast path
+ */
+#ifdef CONFIG_CPU_FREQ
+#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
+#elif !defined(arch_scale_freq_capacity)
static __always_inline
unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
{
--
2.1.4
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity
2016-03-14 5:22 ` [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity Michael Turquette
@ 2016-03-15 19:13 ` Dietmar Eggemann
2016-03-15 20:46 ` Michael Turquette
2016-03-15 21:37 ` Peter Zijlstra
1 sibling, 1 reply; 61+ messages in thread
From: Dietmar Eggemann @ 2016-03-15 19:13 UTC (permalink / raw)
To: Michael Turquette, peterz, rjw
Cc: linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, vincent.guittot, Michael Turquette
On 14/03/16 05:22, Michael Turquette wrote:
> arch_scale_freq_capacity is weird. It specifies an arch hook for an
> implementation that could easily vary within an architecture or even a
> chip family.
>
> This patch helps to mitigate this weirdness by defaulting to the
> cpufreq-provided implementation, which should work for all cases where
> CONFIG_CPU_FREQ is set.
>
> If CONFIG_CPU_FREQ is not set, then try to use an implementation
> provided by the architecture. Failing that, fall back to
> SCHED_CAPACITY_SCALE.
>
> It may be desirable for cpufreq drivers to specify their own
> implementation of arch_scale_freq_capacity in the future. The same is
> true for platform code within an architecture. In both cases an
> efficient implementation selector will need to be created and this patch
> adds a comment to that effect.
For me this independence of the scheduler code towards the actual
implementation of the Frequency Invariant Engine (FEI) was actually a
feature.
In EAS RFC5.2 (linux-arm.org/linux-power.git energy_model_rfc_v5.2 ,
which hasn't been posted to LKML) we establish the link in the ARCH code
(arch/arm64/include/asm/topology.h).
#ifdef CONFIG_CPU_FREQ
#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
...
+#endif
>
> Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
> ---
> kernel/sched/sched.h | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 469d11d..37502ea 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq)
> #ifdef CONFIG_SMP
> extern void sched_avg_update(struct rq *rq);
>
> -#ifndef arch_scale_freq_capacity
> +/*
> + * arch_scale_freq_capacity can be implemented by cpufreq, platform code or
> + * arch code. We select the cpufreq-provided implementation first. If it
> + * doesn't exist then we default to any other implementation provided from
> + * platform/arch code. If those do not exist then we use the default
> + * SCHED_CAPACITY_SCALE value below.
> + *
> + * Note that if cpufreq drivers or platform/arch code have competing
> + * implementations it is up to those subsystems to select one at runtime with
> + * an efficient solution, as we cannot tolerate the overhead of indirect
> + * functions (e.g. function pointers) in the scheduler fast path
> + */
> +#ifdef CONFIG_CPU_FREQ
> +#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
> +#elif !defined(arch_scale_freq_capacity)
> static __always_inline
> unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> {
>
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity
2016-03-15 19:13 ` Dietmar Eggemann
@ 2016-03-15 20:46 ` Michael Turquette
2016-03-16 19:44 ` Dietmar Eggemann
0 siblings, 1 reply; 61+ messages in thread
From: Michael Turquette @ 2016-03-15 20:46 UTC (permalink / raw)
To: Dietmar Eggemann, peterz, rjw
Cc: linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, vincent.guittot, Michael Turquette
Quoting Dietmar Eggemann (2016-03-15 12:13:58)
> On 14/03/16 05:22, Michael Turquette wrote:
> > arch_scale_freq_capacity is weird. It specifies an arch hook for an
> > implementation that could easily vary within an architecture or even a
> > chip family.
> >
> > This patch helps to mitigate this weirdness by defaulting to the
> > cpufreq-provided implementation, which should work for all cases where
> > CONFIG_CPU_FREQ is set.
> >
> > If CONFIG_CPU_FREQ is not set, then try to use an implementation
> > provided by the architecture. Failing that, fall back to
> > SCHED_CAPACITY_SCALE.
> >
> > It may be desirable for cpufreq drivers to specify their own
> > implementation of arch_scale_freq_capacity in the future. The same is
> > true for platform code within an architecture. In both cases an
> > efficient implementation selector will need to be created and this patch
> > adds a comment to that effect.
>
> For me this independence of the scheduler code towards the actual
> implementation of the Frequency Invariant Engine (FEI) was actually a
> feature.
I do not agree that it is a strength; I think it is confusing. My
opinion is that cpufreq drivers should implement
arch_scale_freq_capacity. Having a sane fallback
(cpufreq_scale_freq_capacity) simply means that you can remove the
boilerplate from the arm32 and arm64 code, which is a win.
Furthermore, if we have multiple competing implementations of
arch_scale_freq_invariance, wouldn't it be better for all of them to
live in cpufreq drivers? This means we would only need to implement a
single run-time "selector".
On the other hand, if the implementation lives in arch code and we have
various implementations of arch_scale_freq_capacity within an
architecture, then each arch would need to implement this selector
function. Even worse then if we have a split where some implementations
live in drivers/cpufreq (e.g. intel_pstate) and others in arch/arm and
others in arch/arm64 ... now we have three selectors.
Note that this has nothing to do with cpu microarch invariance. I'm
happy for that to stay in arch code because we can have heterogeneous
cpus that do not scale frequency, and thus would not enable cpufreq.
But if your platform scales cpu frequency, then really cpufreq should be
in the loop.
>
> In EAS RFC5.2 (linux-arm.org/linux-power.git energy_model_rfc_v5.2 ,
> which hasn't been posted to LKML) we establish the link in the ARCH code
> (arch/arm64/include/asm/topology.h).
Right, sorry again about preemptively posting the patch. Total brainfart
on my part.
>
> #ifdef CONFIG_CPU_FREQ
> #define arch_scale_freq_capacity cpufreq_scale_freq_capacity
> ...
> +#endif
The above is no longer necessary with this patch. Same question as
above: why insist on the arch boilerplate?
Regards,
Mike
>
> >
> > Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
> > ---
> > kernel/sched/sched.h | 16 +++++++++++++++-
> > 1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 469d11d..37502ea 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq)
> > #ifdef CONFIG_SMP
> > extern void sched_avg_update(struct rq *rq);
> >
> > -#ifndef arch_scale_freq_capacity
> > +/*
> > + * arch_scale_freq_capacity can be implemented by cpufreq, platform code or
> > + * arch code. We select the cpufreq-provided implementation first. If it
> > + * doesn't exist then we default to any other implementation provided from
> > + * platform/arch code. If those do not exist then we use the default
> > + * SCHED_CAPACITY_SCALE value below.
> > + *
> > + * Note that if cpufreq drivers or platform/arch code have competing
> > + * implementations it is up to those subsystems to select one at runtime with
> > + * an efficient solution, as we cannot tolerate the overhead of indirect
> > + * functions (e.g. function pointers) in the scheduler fast path
> > + */
> > +#ifdef CONFIG_CPU_FREQ
> > +#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
> > +#elif !defined(arch_scale_freq_capacity)
> > static __always_inline
> > unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> > {
> >
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity
2016-03-15 20:46 ` Michael Turquette
@ 2016-03-16 19:44 ` Dietmar Eggemann
2016-03-16 20:07 ` Peter Zijlstra
0 siblings, 1 reply; 61+ messages in thread
From: Dietmar Eggemann @ 2016-03-16 19:44 UTC (permalink / raw)
To: Michael Turquette, peterz, rjw
Cc: linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, vincent.guittot, Michael Turquette
On 15/03/16 20:46, Michael Turquette wrote:
> Quoting Dietmar Eggemann (2016-03-15 12:13:58)
>> On 14/03/16 05:22, Michael Turquette wrote:
[...]
>> For me this independence of the scheduler code towards the actual
>> implementation of the Frequency Invariant Engine (FEI) was actually a
>> feature.
>
> I do not agree that it is a strength; I think it is confusing. My
> opinion is that cpufreq drivers should implement
> arch_scale_freq_capacity. Having a sane fallback
> (cpufreq_scale_freq_capacity) simply means that you can remove the
> boilerplate from the arm32 and arm64 code, which is a win.
>
> Furthermore, if we have multiple competing implementations of
> arch_scale_freq_invariance, wouldn't it be better for all of them to
> live in cpufreq drivers? This means we would only need to implement a
> single run-time "selector".
>
> On the other hand, if the implementation lives in arch code and we have
> various implementations of arch_scale_freq_capacity within an
> architecture, then each arch would need to implement this selector
> function. Even worse then if we have a split where some implementations
> live in drivers/cpufreq (e.g. intel_pstate) and others in arch/arm and
> others in arch/arm64 ... now we have three selectors.
OK, now I see your point. What I don't understand is the fact why you
want different foo_scale_freq_capacity() implementations per cpufreq
drivers. IMHO we want to do the cpufreq.c based implementation to
abstract from that (at least for target_index() cpufreq drivers).
intel_pstate (setpolicy()) is an exception but my humble guess is that
systems with intel_pstate driver have X86_FEATURE_APERFMPERF support.
> Note that this has nothing to do with cpu microarch invariance. I'm
> happy for that to stay in arch code because we can have heterogeneous
> cpus that do not scale frequency, and thus would not enable cpufreq.
> But if your platform scales cpu frequency, then really cpufreq should be
> in the loop.
Agreed.
>
>>
>> In EAS RFC5.2 (linux-arm.org/linux-power.git energy_model_rfc_v5.2 ,
>> which hasn't been posted to LKML) we establish the link in the ARCH code
>> (arch/arm64/include/asm/topology.h).
>
> Right, sorry again about preemptively posting the patch. Total brainfart
> on my part.
>
>>
>> #ifdef CONFIG_CPU_FREQ
>> #define arch_scale_freq_capacity cpufreq_scale_freq_capacity
>> ...
>> +#endif
>
> The above is no longer necessary with this patch. Same question as
> above: why insist on the arch boilerplate?
OK.
[...]
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity
2016-03-16 19:44 ` Dietmar Eggemann
@ 2016-03-16 20:07 ` Peter Zijlstra
2016-03-16 21:32 ` Rafael J. Wysocki
0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2016-03-16 20:07 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: Michael Turquette, rjw, linux-kernel, linux-pm, Juri.Lelli,
steve.muckle, morten.rasmussen, vincent.guittot,
Michael Turquette
On Wed, Mar 16, 2016 at 07:44:33PM +0000, Dietmar Eggemann wrote:
> intel_pstate (setpolicy()) is an exception but my humble guess is that
> systems with intel_pstate driver have X86_FEATURE_APERFMPERF support.
A quick browse of the Intel SDM says you're right. It looks like
everything after Pentium-M; so Core-Solo/Core-Duo and onwards have
APERF/MPERF.
And it looks like P6 class systems didn't have DVFS support at all,
which basically leaves P4 and Pentium-M as the only chips to have DVFS
support lacking APERF/MPERF.
And while I haven't got any P4 based space heaters left, I might still
have a Pentium-M class laptop somewhere (if it still boots).
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity
2016-03-16 20:07 ` Peter Zijlstra
@ 2016-03-16 21:32 ` Rafael J. Wysocki
0 siblings, 0 replies; 61+ messages in thread
From: Rafael J. Wysocki @ 2016-03-16 21:32 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Dietmar Eggemann, Michael Turquette, linux-kernel, linux-pm,
Juri.Lelli, steve.muckle, morten.rasmussen, vincent.guittot,
Michael Turquette
On Wednesday, March 16, 2016 09:07:52 PM Peter Zijlstra wrote:
> On Wed, Mar 16, 2016 at 07:44:33PM +0000, Dietmar Eggemann wrote:
> > intel_pstate (setpolicy()) is an exception but my humble guess is that
> > systems with intel_pstate driver have X86_FEATURE_APERFMPERF support.
>
> A quick browse of the Intel SDM says you're right. It looks like
> everything after Pentium-M; so Core-Solo/Core-Duo and onwards have
> APERF/MPERF.
>
> And it looks like P6 class systems didn't have DVFS support at all,
> which basically leaves P4 and Pentium-M as the only chips to have DVFS
> support lacking APERF/MPERF.
>
> And while I haven't got any P4 based space heaters left, I might still
> have a Pentium-M class laptop somewhere (if it still boots).
intel_pstate depends on APERF/MPERF, so it won't work with anything
without them.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity
2016-03-14 5:22 ` [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity Michael Turquette
2016-03-15 19:13 ` Dietmar Eggemann
@ 2016-03-15 21:37 ` Peter Zijlstra
[not found] ` <20160315222721.30639.28332@quark.deferred.io>
1 sibling, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2016-03-15 21:37 UTC (permalink / raw)
To: Michael Turquette
Cc: rjw, linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
On Sun, Mar 13, 2016 at 10:22:12PM -0700, Michael Turquette wrote:
> +++ b/kernel/sched/sched.h
> @@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq)
> #ifdef CONFIG_SMP
> extern void sched_avg_update(struct rq *rq);
>
> -#ifndef arch_scale_freq_capacity
> +#ifdef CONFIG_CPU_FREQ
> +#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
> +#elif !defined(arch_scale_freq_capacity)
> static __always_inline
> unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> {
This could not allow x86 to use the APERF/MPERF thing, so no, can't be
right.
Maybe something like
#ifndef arch_scale_freq_capacity
#ifdef CONFIG_CPU_FREQ
#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
#else
static __always_inline
unsigned long arch_scale_freq_capacity(..)
{
return SCHED_CAPACITY_SCALE;
}
#endif
#endif
Which will let the arch override and only falls back to cpufreq if
the arch doesn't do anything.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH 0/8] schedutil enhancements
2016-03-14 5:22 [PATCH 0/8] schedutil enhancements Michael Turquette
` (7 preceding siblings ...)
2016-03-14 5:22 ` [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity Michael Turquette
@ 2016-03-16 0:08 ` Rafael J. Wysocki
8 siblings, 0 replies; 61+ messages in thread
From: Rafael J. Wysocki @ 2016-03-16 0:08 UTC (permalink / raw)
To: Michael Turquette
Cc: peterz, linux-kernel, linux-pm, Juri.Lelli, steve.muckle,
morten.rasmussen, dietmar.eggemann, vincent.guittot,
Michael Turquette
Hi Mike,
On Sunday, March 13, 2016 10:22:04 PM Michael Turquette wrote:
> I'm happy that scheduler-driven cpu frequency selection is getting some
> attention. Rafael's recent schedutil governor is a step in the right direction.
Thanks!
> This series builds on top of Rafael's schedutil governor, bringing it to parity
> with some of the features in the schedfreq series posted by Steve[0], as well
> as adding a couple of new things.
>
> Patch 1 removes cpufreq_trigger_update()
>
> Patches 2-4 move the cfs capacity margin out of the governor and into
> cfs. This value is made tunable by a sysfs control in schedutil.
>
> Patches 5-6 make cpufreq_update_util() aware of multiple scheduler
> classes (cfs, rt & dl), and add storage & summation of these per-class
> utilization values into schedutil.
>
> Patches 7-8 introduces Dietmar's generic cpufreq implementation[1] of the
> frequency invariance hook and changes the preprocessor magic in sched.h to
> favor the cpufreq implementation over arch- or platform-specific ones.
After the discussion mentioned by Peter in one of his responses (the relevant
e-mail thread is here: http://marc.info/?t=145688568600003&r=1&w=4) I have
changed the schedutil series to address some points made during it.
In particular, I've modified it to use the formula from
http://marc.info/?l=linux-acpi&m=145756618321500&w=4 with the twist that
if the utilization is frequency-invariant, max_freq will be used instead
of current_freq.
The way it recognizes whether or not that is the case is based on the
Peter's suggestion from http://marc.info/?l=linux-kernel&m=145760739700716&w=4.
For this reason, the governor goes into kernel/sched/ as I don't want
arch_scale_freq_invariant() to be exposed to cpufreq at large, because
schedutil will be the only governor using it in foreseeable future.
The fast switching support patch is slightly different too, but that's
not relevant here.
The new series is in the pm-cpufreq-experimental branch of my tree:
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
pm-cpufreq-experimental
I haven't had the time to post it over the last few days (sorry about
that), but I'll do that tomorrow.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 61+ messages in thread