Linux Power Management development
 help / color / mirror / Atom feed
* [RFC] sched: fair: Don't update CPU frequency too frequently
@ 2017-06-01 11:34 Viresh Kumar
  2017-06-01 12:22 ` Peter Zijlstra
  0 siblings, 1 reply; 5+ messages in thread
From: Viresh Kumar @ 2017-06-01 11:34 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Rafael Wysocki
  Cc: Viresh Kumar, linaro-kernel, linux-kernel, Vincent Guittot,
	linux-pm, Juri Lelli, Dietmar.Eggemann, Morten.Rasmussen

update_load_avg() is the heaviest caller of the cpufreq utilization
callbacks right now and it runs too frequently from various parts of the
fair scheduler.

It may not be optimal to update the CPU frequency that frequently
though.

We are more concerned about the events when the CPU utilization changes
sharply and that mostly happens when a task gets attached/detached
to/from a runqueue.

Perhaps we aren't that concerned about the case when the utilization
changes gradually (small delta) due the time a task gets to run and
updating the CPU frequency at the tick rate should be good enough in
that case.

Sometimes update_load_avg() gets called right before migrating a task to
a new CPU and that can steal the chance of updating the frequency when
we actually needed it (i.e. after util_avg is updated with the
contributions of the migrated task). This can happen because the cpufreq
governors change the frequency only once in a fixed period of time (i.e.
rate_limit_us for the schedutil governor).

The fair scheduler already calls the utilization hooks from
attach/detach paths and that takes care of the most important case we
are concerned about.

This patch relocates the call to utilization hook from
update_cfs_rq_load_avg() to task_tick_fair().

Testing:

Platform: Skylake, Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz. Quad core (8
threads).

Distribution: Ubutnu 16.04 with cmdline intel_pstate=passive.
Based over 4.12-rc3.

All the below tests were run 3 times with and without this change over
4.12-rc3 and the average of all the runs is shown here (RESULT). The
number of times the schedutil hook is called during the tests is also
present in the table (UPDATE-COUNT).

Bench:

- pipe-test

  Command: perf bench sched pipe

  		4.12-rc3	4.12-rc3 + this patch	Change
  		--------	---------------------	----------

  RESULT	4.21		3.68			- 12.53 %

  UPDATE-COUNT	5405		886			- 83 %


- Hackbench

  Command: sync; sleep 1; for i in {1..20}; do sleep 2; \
  	   hackbench -P -g 1 | grep Time; done

  		4.12-rc3	4.12-rc3 + this patch	Change
  		--------	---------------------	----------

  RESULT	0.00925		0.00853			- 7.75 %

  UPDATE-COUNT	10342		1925			- 81 %


- ebizzy

  Command: ebizzy -t N -S 20

  		4.12-rc3	4.12-rc3 + this patch	Change
  		--------	---------------------	----------

  RESULT (N=8)	270221		247841			- 8 %
  RESULT (N=16)	252692		249839			- 1.1 %
  RESULT (N=24)	251151		254114			+ 1.1 %
  RESULT (N=32)	249274		250115			+ 0.3 %

  UPDATE-COUNT	41964		40279			- 4 %


- cyclictest (with latency)

  Command: cyclictest -q -t N -i 10000 --latency=1000000 -D 20

  		4.12-rc3	4.12-rc3 + this patch	Change
  		--------	---------------------	----------

  RESULT (N=8)	326		332			+ 1.75 %
  RESULT (N=16)	297		328			+ 10 %
  RESULT (N=24)	288		317			+ 9.9 %
  RESULT (N=32)	268		313			+ 17 %

  UPDATE-COUNT	28949		313			- 98.6 %

  Some more investigation (with ftrace) was done to check the cause of
  regression in this case (as the below test doesn't have the same
  results without latency).

  o The number of time we went into a cpuidle state and came out of it
    is almost same with both the images.
  o The average time between sched_waking and sched_switch traces for
    the cyclictest threads is 19.8 us with rc3 and 13.5 us with this
    patch. And so the response to the threads is actually better with
    this patch.
  o Need more suggestions to verify why the regression is there.


- cyclictest (without latency)

  Command: cyclictest -q -t N -i 10000 --latency=0 -D 20

  		4.12-rc3	4.12-rc3 + this patch	Change
  		--------	---------------------	----------

  RESULT (N=8)	16.34		8.87			- 45.78 %
  RESULT (N=16)	13.91		6.41			- 54 %
  RESULT (N=24)	12.42		9.1			- 26.77 %
  RESULT (N=32)	13		3.34			- 74 %

  UPDATE-COUNT	26000		350			- 98.6 %

  Note: Most of the MAX values with rc3 were between 10-30 and that with
  this patch were between 2-20. There were few random high values
  (100-400) though with both the images and are discarded to get the
  averages.

Summary:

o The number of times cpufreq utilization callbacks were called
  (UPDATE-COUNT) during the tests has come down heavily (81-96 %),
  except for ebizzy. And that's a lot of code removed from scheduler's
  hot path.

o pipe-test and hackbench have shown a clear improvement with the patch.

o ebizzy has shown a regression (specially at N=8). Though during some
  individual tests, rc3 also reported low numbers as compared to the
  kernel with this patch.

o cyclictest with 0 latency (i.e. no deeper idle states) shows clear
  improvement, while with latency=1000000 there is considerable
  regression.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>

---
I have also created a callgraph of where exactly are cpufreq hooks
getting called from within the fair scheduler: https://yuml.me/dcc6d554.
---
 kernel/sched/fair.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 47a0c552c77b..62fe58b68e43 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3276,7 +3276,7 @@ static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 {
 	struct sched_avg *sa = &cfs_rq->avg;
-	int decayed, removed_load = 0, removed_util = 0;
+	int decayed, removed_load = 0;
 
 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
@@ -3290,7 +3290,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
 		sub_positive(&sa->util_avg, r);
 		sub_positive(&sa->util_sum, r * LOAD_AVG_MAX);
-		removed_util = 1;
 		set_tg_cfs_propagate(cfs_rq);
 	}
 
@@ -3301,9 +3300,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 	cfs_rq->load_last_update_time_copy = sa->last_update_time;
 #endif
 
-	if (update_freq && (decayed || removed_util))
-		cfs_rq_util_change(cfs_rq);
-
 	return decayed || removed_load;
 }
 
@@ -3481,10 +3477,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
 #define UPDATE_TG	0x0
 #define SKIP_AGE_LOAD	0x0
 
-static inline void update_load_avg(struct sched_entity *se, int not_used1)
-{
-	cpufreq_update_util(rq_of(cfs_rq_of(se)), 0);
-}
+static inline void update_load_avg(struct sched_entity *se, int not_used1) {}
 
 static inline void
 enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
@@ -9021,6 +9014,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		entity_tick(cfs_rq, se, queued);
 	}
 
+	cpufreq_update_util(rq, 0);
+
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 }
-- 
2.13.0.70.g6367777092d9

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC] sched: fair: Don't update CPU frequency too frequently
  2017-06-01 11:34 [RFC] sched: fair: Don't update CPU frequency too frequently Viresh Kumar
@ 2017-06-01 12:22 ` Peter Zijlstra
  2017-06-07 12:06   ` Viresh Kumar
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2017-06-01 12:22 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Ingo Molnar, Rafael Wysocki, linaro-kernel, linux-kernel,
	Vincent Guittot, linux-pm, Juri Lelli, Dietmar.Eggemann,
	Morten.Rasmussen

On Thu, Jun 01, 2017 at 05:04:27PM +0530, Viresh Kumar wrote:
> This patch relocates the call to utilization hook from
> update_cfs_rq_load_avg() to task_tick_fair().

That's not right. Consider hardware where 'setting' the DVFS is a
'cheap' MSR write, doing that once every 10ms (HZ=100) is absurd.

We spoke about this problem in Pisa, the proposed solution was having
each driver provide a cost metric and the generic code doing a max
filter over the window constructed from that cost metric.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] sched: fair: Don't update CPU frequency too frequently
  2017-06-01 12:22 ` Peter Zijlstra
@ 2017-06-07 12:06   ` Viresh Kumar
  2017-06-07 15:43     ` Morten Rasmussen
  0 siblings, 1 reply; 5+ messages in thread
From: Viresh Kumar @ 2017-06-07 12:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Rafael Wysocki, linaro-kernel, linux-kernel,
	Vincent Guittot, linux-pm, Juri Lelli, Dietmar.Eggemann,
	Morten.Rasmussen, patrick.bellasi

+ Patrick,

On 01-06-17, 14:22, Peter Zijlstra wrote:
> On Thu, Jun 01, 2017 at 05:04:27PM +0530, Viresh Kumar wrote:
> > This patch relocates the call to utilization hook from
> > update_cfs_rq_load_avg() to task_tick_fair().
> 
> That's not right. Consider hardware where 'setting' the DVFS is a
> 'cheap' MSR write, doing that once every 10ms (HZ=100) is absurd.

Yeah, that may be too much for such a platforms. Actually we (/me & Vincent)
were worried about the current location of the utilization update hooks and
believed that they are getting called way too often. But yeah, this patch
optimized it way too much.

One of the goals of this patch was to avoid doing small OPP updates from
update_load_avg() which can potentially block significant utilization changes
(and hence big OPP changes) while a task is attached or detached, etc.

> We spoke about this problem in Pisa, the proposed solution was having
> each driver provide a cost metric and the generic code doing a max
> filter over the window constructed from that cost metric.

So we want to compensate for the lost opportunities (due to rate_limit_us
window) by changing the OPP based on what has happened in the previous
rate_limit_us window. I am not sure how will that help.

Case 1: A periodic RT task runs for a small time in the rate_limit_us window and
        the timing is such that we (almost) never go to the max OPP because of
        rate_limit_us window.

        Wouldn't a better solution towards such a case is what Patrick [1]
        proposed earlier (i.e. ignore rate_limit_us for RT/DL tasks), as we will
        run at high OPP when we really needed it the most.


Case 2: A high utilization periodic CFS task runs for short duration and keeps
        on migrating to other CPUs. We miss the opportunity to update the OPP
        based on this tasks utilization because of rate_limit_us window and by
        the time we update the OPP again, this task is already migrated and so
        the utilization is low again.

        If the task has already migrated, why should we increase the OPP on
        assumption that this task will come back on this CPU? There are enough
        chances that the selected (higher) OPP will not be utilized by the
        current load on the CPU.

        Also if this CFS tasks runs once every 2 (or more) ticks on the same
        CPU, then we are back to the same problem again.

        1         2         3         4
        |---------|---------|---------|---------|

           T                   T

        1,2,3,4 are representing the events on which we try to update the OPP
        and are placed rate_limit_us distance apart. And the task T happens to
        run between 1-2 and 3-4. We will not change the frequency until the
        event 2 in this case as rate_limit_us window isn't over yet. We go to
        higher OPP on 2 (which is really wasted for the current loads) because T
        happened in the last window. On 3 we come back to the OPP proportional
        to the current load. And the next time T runs again, we are still stuck
        on the low OPP. So instead of fixing it, we made it worse by wasting
        power unnecessarily.

Is there any case I am missing that you are concerned about ?

-- 
viresh

[1] https://marc.info/?l=linux-kernel&m=148846976032099&w=2

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] sched: fair: Don't update CPU frequency too frequently
  2017-06-07 12:06   ` Viresh Kumar
@ 2017-06-07 15:43     ` Morten Rasmussen
  2017-06-07 21:55       ` Rafael J. Wysocki
  0 siblings, 1 reply; 5+ messages in thread
From: Morten Rasmussen @ 2017-06-07 15:43 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Peter Zijlstra, Ingo Molnar, Rafael Wysocki, linaro-kernel,
	linux-kernel, Vincent Guittot, linux-pm, Juri Lelli,
	Dietmar.Eggemann, patrick.bellasi

On Wed, Jun 07, 2017 at 05:36:55PM +0530, Viresh Kumar wrote:
> + Patrick,
> 
> On 01-06-17, 14:22, Peter Zijlstra wrote:
> > On Thu, Jun 01, 2017 at 05:04:27PM +0530, Viresh Kumar wrote:
> > > This patch relocates the call to utilization hook from
> > > update_cfs_rq_load_avg() to task_tick_fair().
> > 
> > That's not right. Consider hardware where 'setting' the DVFS is a
> > 'cheap' MSR write, doing that once every 10ms (HZ=100) is absurd.
> 
> Yeah, that may be too much for such a platforms. Actually we (/me & Vincent)
> were worried about the current location of the utilization update hooks and
> believed that they are getting called way too often. But yeah, this patch
> optimized it way too much.
> 
> One of the goals of this patch was to avoid doing small OPP updates from
> update_load_avg() which can potentially block significant utilization changes
> (and hence big OPP changes) while a task is attached or detached, etc.

To me that sounds like you want to apply a more clever filter to the
utilization updates than a simple rate limiter as Peter suggests below.
IMHO, it would be better to feed schedutil with all the available
information and improve the filtering policy there instead of trying to
hack the policy tweaking the input data.

> > We spoke about this problem in Pisa, the proposed solution was having
> > each driver provide a cost metric and the generic code doing a max
> > filter over the window constructed from that cost metric.

Maybe it is possible to somehow let the rate at which we allow OPP
changes depend on the size of the 'error' delta between the current OPP
and what we need. So radical changes causes OPP changes immediately, and
small corrections have to wait longer?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] sched: fair: Don't update CPU frequency too frequently
  2017-06-07 15:43     ` Morten Rasmussen
@ 2017-06-07 21:55       ` Rafael J. Wysocki
  0 siblings, 0 replies; 5+ messages in thread
From: Rafael J. Wysocki @ 2017-06-07 21:55 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Viresh Kumar, Peter Zijlstra, Ingo Molnar, Rafael Wysocki,
	Lists linaro-kernel, Linux Kernel Mailing List, Vincent Guittot,
	Linux PM, Juri Lelli, Dietmar Eggemann, Patrick Bellasi

On Wed, Jun 7, 2017 at 5:43 PM, Morten Rasmussen
<morten.rasmussen@arm.com> wrote:
> On Wed, Jun 07, 2017 at 05:36:55PM +0530, Viresh Kumar wrote:
>> + Patrick,
>>
>> On 01-06-17, 14:22, Peter Zijlstra wrote:
>> > On Thu, Jun 01, 2017 at 05:04:27PM +0530, Viresh Kumar wrote:
>> > > This patch relocates the call to utilization hook from
>> > > update_cfs_rq_load_avg() to task_tick_fair().
>> >
>> > That's not right. Consider hardware where 'setting' the DVFS is a
>> > 'cheap' MSR write, doing that once every 10ms (HZ=100) is absurd.
>>
>> Yeah, that may be too much for such a platforms. Actually we (/me & Vincent)
>> were worried about the current location of the utilization update hooks and
>> believed that they are getting called way too often. But yeah, this patch
>> optimized it way too much.
>>
>> One of the goals of this patch was to avoid doing small OPP updates from
>> update_load_avg() which can potentially block significant utilization changes
>> (and hence big OPP changes) while a task is attached or detached, etc.
>
> To me that sounds like you want to apply a more clever filter to the
> utilization updates than a simple rate limiter as Peter suggests below.
> IMHO, it would be better to feed schedutil with all the available
> information and improve the filtering policy there instead of trying to
> hack the policy tweaking the input data.

Agreed.

Unless the tweaked input data would be used somewhere else too, that is.

>> > We spoke about this problem in Pisa, the proposed solution was having
>> > each driver provide a cost metric and the generic code doing a max
>> > filter over the window constructed from that cost metric.
>
> Maybe it is possible to somehow let the rate at which we allow OPP
> changes depend on the size of the 'error' delta between the current OPP
> and what we need. So radical changes causes OPP changes immediately, and
> small corrections have to wait longer?

That sounds reasonable to me.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-06-07 21:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-06-01 11:34 [RFC] sched: fair: Don't update CPU frequency too frequently Viresh Kumar
2017-06-01 12:22 ` Peter Zijlstra
2017-06-07 12:06   ` Viresh Kumar
2017-06-07 15:43     ` Morten Rasmussen
2017-06-07 21:55       ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox