public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/fair: update scale invariance of pelt
@ 2015-11-24 13:49 Vincent Guittot
  2015-11-25  9:24 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Vincent Guittot @ 2015-11-24 13:49 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, yuyang.du, Morten.Rasmussen
  Cc: linaro-kernel, dietmar.eggemann, pjt, bsegall, Vincent Guittot

The current implementation of load tracking invariance scales the load
tracking value with current frequency and uarch performance (only for
utilization) of the CPU.

One main result of the current formula is that the figures are capped by
the current capacity of the CPU. This limitation is the main reason of not
including the uarch invariance (arch_scale_cpu_capacity) in the calculation
of load_avg because capping the load can generate erroneous system load
statistic as described with this example [1]

Instead of scaling the complete value of PELT algo, we should only scale
the running time by the current capacity of the CPU. It seems more correct
to only scale the running time because the non running time of a task
(sleeping or waiting for a runqueue) is the same whatever the current freq
and the compute capacity of the CPU.

Then, one main advantage of this change is that the load of a task can
reach max value whatever the current freq and the uarch of the CPU on which
it run. It will just take more time at a lower freq than a max freq or on a
"little" CPU compared to a "big" one. The load and the utilization stay
invariant across system so we can still compared them between CPU but with
a wider range of values.

With this change, we don't have to test if a CPU is overloaded or not in
order to use one metric (util) or another (load) as all metrics are always
valid.

I have put below some examples of duration to reach some typical load value
according to the capacity of the CPU with current implementation
and with this patch. 

Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
972 (95%)    138ms	   not reachable	    276ms
486 (47.5%)  30ms	   138ms		     60ms
256 (25%)    13ms	    32ms		     26ms

We can see that at half capacity, we need twice the duration of max
capacity with this patch whereas we have a non linear increase of the
duration with current implementation.

[1] https://lkml.org/lkml/2014/12/18/128

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 28 +++++++++++++---------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 824aa9f..f2a18e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2560,10 +2560,9 @@ static __always_inline int
 __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
-	u64 delta, scaled_delta, periods;
+	u64 delta, periods;
 	u32 contrib;
-	unsigned int delta_w, scaled_delta_w, decayed = 0;
-	unsigned long scale_freq, scale_cpu;
+	unsigned int delta_w, decayed = 0;
 
 	delta = now - sa->last_update_time;
 	/*
@@ -2584,8 +2583,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		return 0;
 	sa->last_update_time = now;
 
-	scale_freq = arch_scale_freq_capacity(NULL, cpu);
-	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+	if (running) {
+		delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
+		delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
+	}
 
 	/* delta_w is the amount already accumulated against our next period */
 	delta_w = sa->period_contrib;
@@ -2601,16 +2602,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		 * period and accrue it.
 		 */
 		delta_w = 1024 - delta_w;
-		scaled_delta_w = cap_scale(delta_w, scale_freq);
 		if (weight) {
-			sa->load_sum += weight * scaled_delta_w;
+			sa->load_sum += weight * delta_w;
 			if (cfs_rq) {
 				cfs_rq->runnable_load_sum +=
-						weight * scaled_delta_w;
+						weight * delta_w;
 			}
 		}
 		if (running)
-			sa->util_sum += scaled_delta_w * scale_cpu;
+			sa->util_sum += delta_w << SCHED_CAPACITY_SHIFT;
 
 		delta -= delta_w;
 
@@ -2627,25 +2627,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		contrib = __compute_runnable_contrib(periods);
-		contrib = cap_scale(contrib, scale_freq);
 		if (weight) {
 			sa->load_sum += weight * contrib;
 			if (cfs_rq)
 				cfs_rq->runnable_load_sum += weight * contrib;
 		}
 		if (running)
-			sa->util_sum += contrib * scale_cpu;
+			sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
 	}
 
 	/* Remainder of delta accrued against u_0` */
-	scaled_delta = cap_scale(delta, scale_freq);
 	if (weight) {
-		sa->load_sum += weight * scaled_delta;
+		sa->load_sum += weight * delta;
 		if (cfs_rq)
-			cfs_rq->runnable_load_sum += weight * scaled_delta;
+			cfs_rq->runnable_load_sum += weight * delta;
 	}
 	if (running)
-		sa->util_sum += scaled_delta * scale_cpu;
+		sa->util_sum += delta << SCHED_CAPACITY_SHIFT;
 
 	sa->period_contrib += delta;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread
* [PATCH] sched/fair: update scale invariance of PELT
@ 2017-03-28 15:35 Vincent Guittot
  2017-03-29  8:05 ` Vincent Guittot
  0 siblings, 1 reply; 9+ messages in thread
From: Vincent Guittot @ 2017-03-28 15:35 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: dietmar.eggemann, Morten.Rasmussen, yuyang.du, pjt, bsegall,
	Vincent Guittot

The current implementation of load tracking invariance scales the contribution
with current frequency and uarch performance (only for utilization) of the
CPU. One main result of this formula is that the figures are capped by current
capacity of CPU. Another one is that the load_avg is not invariant because not
scaled with uarch.

The util_avg of a periodic task that runs r time slots every p time slots
varies in the range :

    U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)

with U is the max util_avg value = SCHED_CAPACITY_SCALE

At a lower capacity, the range becomes:

    U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)

with C reflecting the compute capacity ratio between current capacity and
max capacity.

so C tries to compensate changes in (1-y^r') but it can't be accurate.

Instead of scaling the contribution value of PELT algo, we should scale the
running time. The PELT signal aims to track the amount of computation of tasks
and/or rq so it seems more correct to scale the running time to reflect the
effective amount of computation done since the last update.

In order to be fully invariant, we need to apply the same amount of running
time and idle time whatever the current capacity. Because running at lower
capacity implies that the task will run longer, we have to track the amount of
"stolen" idle time and to apply it when task becomes idle.

But once we have reached the maximum utilization value (SCHED_CAPACITY_SCALE),
it means that the task is seen as an always-running task whatever the capacity of
the cpu (even at max compute capacity). In this case, we can discard the "stolen"
idle times which becomes meaningless. In order to cope with rounding effect of
PELT algo we take a margin and consider task with utilization greater than 1000
(vs 1024 max) as an always-running task.

Then, we can use the same algorithm for both utilization and load and
simplify __update_load_avg now that the load of a task doesn't have to be
capped by CPU uarch.

The responsivness of PELT is improved when CPU is not running at max
capacity with this new algorithm. I have put below some examples of
duration to reach some typical load values according to the capacity of the
CPU with current implementation and with this patch.

Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
972 (95%)    138ms         not reachable            276ms
486 (47.5%)  30ms          138ms                     60ms
256 (25%)    13ms           32ms                     26ms

On my hikey (octo ARM platform) with schedutil governor, the time to reach
max OPP when starting from a null utilization, decreases from 223ms with
current scale invariance down to 121ms with the new algorithm. For this
test, i have enable arch_scale_freq for arm64.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 49 ++++++++++++++++++++++++++++++++++---------------
 2 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d67eee8..ca9d00f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -313,6 +313,7 @@ struct load_weight {
  */
 struct sched_avg {
 	u64				last_update_time;
+	u64				stolen_idle_time;
 	u64				load_sum;
 	u32				util_sum;
 	u32				period_contrib;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 31453d5..d1514cb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -735,6 +735,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 	struct sched_avg *sa = &se->avg;
 
 	sa->last_update_time = 0;
+	sa->stolen_idle_time = 0;
 	/*
 	 * sched_avg's period_contrib should be strictly less then 1024, so
 	 * we give it 1023 to make sure it is almost a period (1024us), and
@@ -2852,10 +2853,9 @@ static __always_inline int
 __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
-	u64 delta, scaled_delta, periods;
+	u64 delta, periods;
 	u32 contrib;
-	unsigned int delta_w, scaled_delta_w, decayed = 0;
-	unsigned long scale_freq, scale_cpu;
+	unsigned int delta_w, decayed = 0;
 
 	delta = now - sa->last_update_time;
 	/*
@@ -2876,8 +2876,30 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		return 0;
 	sa->last_update_time = now;
 
-	scale_freq = arch_scale_freq_capacity(NULL, cpu);
-	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+	if (running) {
+		sa->stolen_idle_time += delta;
+		/*
+		 * scale the elapsed time to reflect the real amount of
+		 * computation
+		 */
+		delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
+		delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
+
+		/*
+		 * Track the amount of stolen idle time due to running at
+		 * lower capacity
+		 */
+		sa->stolen_idle_time -= delta;
+	} else if (!weight) {
+		if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {
+			/*
+			 * Add the idle time stolen by running at lower compute
+			 * capacity
+			 */
+			delta += sa->stolen_idle_time;
+		}
+		sa->stolen_idle_time = 0;
+	}
 
 	/* delta_w is the amount already accumulated against our next period */
 	delta_w = sa->period_contrib;
@@ -2893,16 +2915,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		 * period and accrue it.
 		 */
 		delta_w = 1024 - delta_w;
-		scaled_delta_w = cap_scale(delta_w, scale_freq);
 		if (weight) {
-			sa->load_sum += weight * scaled_delta_w;
+			sa->load_sum += weight * delta_w;
 			if (cfs_rq) {
 				cfs_rq->runnable_load_sum +=
-						weight * scaled_delta_w;
+						weight * delta_w;
 			}
 		}
 		if (running)
-			sa->util_sum += scaled_delta_w * scale_cpu;
+			sa->util_sum += delta_w << SCHED_CAPACITY_SHIFT;
 
 		delta -= delta_w;
 
@@ -2919,25 +2940,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		contrib = __compute_runnable_contrib(periods);
-		contrib = cap_scale(contrib, scale_freq);
 		if (weight) {
 			sa->load_sum += weight * contrib;
 			if (cfs_rq)
 				cfs_rq->runnable_load_sum += weight * contrib;
 		}
 		if (running)
-			sa->util_sum += contrib * scale_cpu;
+			sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
 	}
 
 	/* Remainder of delta accrued against u_0` */
-	scaled_delta = cap_scale(delta, scale_freq);
 	if (weight) {
-		sa->load_sum += weight * scaled_delta;
+		sa->load_sum += weight * delta;
 		if (cfs_rq)
-			cfs_rq->runnable_load_sum += weight * scaled_delta;
+			cfs_rq->runnable_load_sum += weight * delta;
 	}
 	if (running)
-		sa->util_sum += scaled_delta * scale_cpu;
+		sa->util_sum += delta << SCHED_CAPACITY_SHIFT;
 
 	sa->period_contrib += delta;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-03-29  8:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-24 13:49 [PATCH] sched/fair: update scale invariance of pelt Vincent Guittot
2015-11-25  9:24 ` Peter Zijlstra
2015-11-25 11:25   ` Vincent Guittot
2015-12-08 17:04 ` Morten Rasmussen
2015-12-15 10:18   ` Vincent Guittot
2015-12-14  0:26 ` Yuyang Du
2015-12-15 10:21   ` Vincent Guittot
  -- strict thread matches above, loose matches on Subject: below --
2017-03-28 15:35 [PATCH] sched/fair: update scale invariance of PELT Vincent Guittot
2017-03-29  8:05 ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox