[PATCH 1/3] Added runqueue clock normalized with cpufreq

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/3] Added runqueue clock normalized with cpufreq
@ 2010-12-17 13:02 Harald Gustafsson
  2010-12-17 13:02 ` [PATCH 2/3] cpufreq normalized runtime to enforce runtime cycles also at lower frequencies Harald Gustafsson
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Harald Gustafsson @ 2010-12-17 13:02 UTC (permalink / raw)
  To: Dario Faggioli, Peter Zijlstra, Harald Gustafsson
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Claudio Scordino,
	Michael Trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Dario Faggioli, Harald Gustafsson

This is a request for comments on additions to sched deadline v3 patches.
Deadline scheduler is the first scheduler (I think) we introduce in Linux that
specifies the runtime in time and not only as a weight or a relation.
I have introduced a normalized runtime clock dependent on the CPU frequency.
This is used, in [PATCH 2/3], to calculate the deadline thread's runtime
so that approximately the same number of cycles are giving to the thread
independent of the CPU frequency. 

I suggest that this is important for users of hard reservation based schedulers
that the intended amount of work can be accomplished independent of the CPU frequency.
The usage of CPU frequency scaling is important on mobile devices and hence 
the combination of deadline scheduler and cpufreq should be solved.

This patch series applies on a backported sched deadline v3 to a 2.6.34 kernel.
That backport can be made available if anyone is interested. It also runs on
my dual core ARM system.

So before I do this for the linux tip I would welcome a discussion about if this
is a good idea and also suggestions on how to improve this.

This first patch introduce the normalized runtime clock, this could be made
lockless instead if requested.

/Harald

Change-Id: Ie0d9b8533cf4e5720eefd3af860d3a8577101907

Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com>
---
 kernel/sched.c |  103 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 103 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c075664..2816371 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -72,6 +72,7 @@
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
 #include <linux/slab.h>
+#include <linux/cpufreq.h>
 #include <linux/cgroup_cpufreq.h>
 
 #include <asm/tlb.h>
@@ -596,6 +597,16 @@ struct rq {
 
 	u64 clock;
 
+        /* Need to keep track of clock cycles since
+	 * dl need to work with cpufreq, is derived based
+	 * on rq clock and cpufreq.
+	 */
+        u64 clock_norm;
+        u64 delta_clock_norm;
+        u64 delta_clock;
+        /* norm factor is in the Q31 format */
+        u64 norm_factor; 
+
 	atomic_t nr_iowait;
 
 #ifdef CONFIG_SMP
@@ -697,7 +708,17 @@ static inline int cpu_of(struct rq *rq)
 
 inline void update_rq_clock(struct rq *rq)
 {
+        u64 delta_clock = rq->delta_clock;
 	rq->clock = sched_clock_cpu(cpu_of(rq));
+#ifndef CONFIG_CPU_FREQ
+	rq->clock_norm = rq->clock;
+#else
+	rq->delta_clock = rq->clock;
+	rq->clock_norm += rq->delta_clock_norm;
+	rq->delta_clock_norm = 0;
+	if(delta_clock !=0)
+	    rq->clock_norm += ((rq->delta_clock - delta_clock) * rq->norm_factor) >> 32;
+#endif /*CONFIG_CPU_FREQ*/
 }
 
 /*
@@ -8115,6 +8136,79 @@ static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
 }
 #endif
 
+#ifdef CONFIG_CPU_FREQ
+static int rq_clock_cpufreq_notify(struct notifier_block *nb, unsigned long val,
+				   void *data)
+{
+	struct cpufreq_policy *policy;
+	struct cpufreq_freqs *freq = data;
+	struct rq *rq;
+	u64 delta_clock, temp;
+	int cpu=freq->cpu;
+	unsigned long flags;
+
+	printk(KERN_INFO "rq_clock_cpufreq_notify called for cpu %i\n", cpu);
+
+	if (val != CPUFREQ_POSTCHANGE)
+		return 0;
+
+	if (freq->old == freq->new)
+	        return 0;
+
+	/* Update cpufreq_index with current speed */
+	policy = cpufreq_cpu_get(cpu);
+
+	/* calculate the norm factor in Q31 base */
+	temp = (((u64) freq->new) << 32);
+	temp = div_u64(temp, policy->cpuinfo.max_freq);
+
+	if(policy->shared_type == CPUFREQ_SHARED_TYPE_ALL) {
+	    for_each_cpu(cpu, policy->cpus) {
+		rq = cpu_rq(cpu);
+		raw_spin_lock_irqsave(&rq->lock, flags);
+		delta_clock = rq->delta_clock;
+		rq->delta_clock = sched_clock_cpu(freq->cpu);
+		if(delta_clock != 0)
+		    rq->delta_clock_norm += ((rq->delta_clock - delta_clock) * rq->norm_factor) >> 32; 
+		rq->norm_factor = temp;
+		raw_spin_unlock_irqrestore(&rq->lock, flags);
+		printk(KERN_INFO "cpufreq transition cpu:%i, norm:%llu, cycles:%llu\n", 
+		       freq->cpu, rq->norm_factor, rq->delta_clock_norm);
+	    }
+	}
+	else {
+	    raw_spin_lock_irqsave(&rq->lock, flags);
+	    rq = cpu_rq(cpu);
+	    delta_clock = rq->delta_clock;
+	    rq->delta_clock = sched_clock_cpu(freq->cpu);
+	    if(delta_clock != 0)
+		rq->delta_clock_norm += ((rq->delta_clock - delta_clock) * rq->norm_factor) >> 32; 
+	    rq->norm_factor = temp;
+	    raw_spin_unlock_irqrestore(&rq->lock, flags);
+	    printk(KERN_INFO "cpufreq transition cpu:%i, norm:%llu, cycles:%llu\n", 
+		   freq->cpu, rq->norm_factor, rq->delta_clock_norm);
+	}
+
+	cpufreq_cpu_put(policy);
+	return 0;
+}
+
+static struct notifier_block cpufreq_notifier = {
+	.notifier_call = rq_clock_cpufreq_notify,
+};
+
+static int __init init_rq_clock_cpufreq(void) 
+{
+        int ret=cpufreq_register_notifier(&cpufreq_notifier,
+					  CPUFREQ_TRANSITION_NOTIFIER);
+
+	//FIXME should set norm_factor etc here as well if not max speed
+	printk(KERN_INFO "init_rq_clock_cpufreq called ret:%i\n", ret);
+	return ret;
+}
+late_initcall(init_rq_clock_cpufreq);
+#endif /*CONFIG_CPU_FREQ*/
+
 void __init sched_init(void)
 {
 	int i, j;
@@ -8243,6 +8337,11 @@ void __init sched_init(void)
 #endif
 		init_rq_hrtick(rq);
 		atomic_set(&rq->nr_iowait, 0);
+
+		rq->norm_factor = 1ULL <<32;
+		rq->clock_norm = 0;
+		rq->delta_clock_norm = 0;
+		rq->delta_clock = 0;
 	}
 
 	set_load_weight(&init_task);
@@ -8255,6 +8354,10 @@ void __init sched_init(void)
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
 #endif
 
+#ifdef CONFIG_CPU_FREQ
+	init_rq_clock_cpufreq();
+#endif /*CONFIG_CPU_FREQ*/
+
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
 	 */
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/3] cpufreq normalized runtime to enforce runtime cycles also at lower frequencies.
  2010-12-17 13:02 [PATCH 1/3] Added runqueue clock normalized with cpufreq Harald Gustafsson
@ 2010-12-17 13:02 ` Harald Gustafsson
  2010-12-17 13:02 ` [PATCH 3/3] sched trace updated with normalized clock info Harald Gustafsson
  2010-12-17 14:29 ` [PATCH 1/3] Added runqueue clock normalized with cpufreq Peter Zijlstra
  2 siblings, 0 replies; 23+ messages in thread
From: Harald Gustafsson @ 2010-12-17 13:02 UTC (permalink / raw)
  To: Dario Faggioli, Peter Zijlstra, Harald Gustafsson
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Claudio Scordino,
	Michael Trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Dario Faggioli, Harald Gustafsson

This patch do the actual changes to sched deadline v3 to
utilize the normalized runtime clock. Note that the 
deadline/periods still use the regular runtime clock. 

Change-Id: I75c88676e9e18a71d94d6c4e779b376a7ac0615f

Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com>
---
 include/linux/sched.h |    6 +++
 kernel/sched.c        |    2 +
 kernel/sched_dl.c     |   82 +++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 84 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 89a158e..167771c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1301,6 +1301,12 @@ struct sched_dl_entity {
 	u64 deadline;		/* absolute deadline for this instance	*/
 	unsigned int flags;	/* specifying the scheduler behaviour   */
 
+        /*
+	 * CPU frequency normalized start time.
+	 * Put it inside DL since only one using it.
+	 */
+        u64 exec_start_norm;
+
 	/*
 	 * Some bool flags:
 	 *
diff --git a/kernel/sched.c b/kernel/sched.c
index 2816371..ddb18d2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2671,6 +2671,7 @@ static void __sched_fork(struct task_struct *p)
 	p->dl.dl_deadline = p->dl.deadline = 0;
 	p->dl.dl_period = 0;
 	p->dl.flags = 0;
+	p->dl.exec_start_norm = 0;
 
 	INIT_LIST_HEAD(&p->rt.run_list);
 	p->se.on_rq = 0;
@@ -8475,6 +8476,7 @@ void normalize_rt_tasks(void)
 			continue;
 
 		p->se.exec_start		= 0;
+		p->dl.exec_start_norm		= 0;
 #ifdef CONFIG_SCHEDSTATS
 		p->se.wait_start		= 0;
 		p->se.sleep_start		= 0;
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 5aa5a52..049c001 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -333,6 +333,40 @@ static bool dl_entity_overflow(struct sched_dl_entity *dl_se,
 }
 
 /*
+ * A cpu freq normalized overflow check, see dl_entity_overflow
+ * function for details. Check against current cpu frequency.
+ * For this to hold, we must check if:
+ *   runtime / (norm_factor * (deadline - t)) < dl_runtime / dl_deadline .
+ */
+static bool dl_entity_overflow_norm(struct sched_dl_entity *dl_se,
+				    struct sched_dl_entity *pi_se, u64 t,
+				    struct rq *rq)
+{
+	u64 left, right;
+
+	/*
+	 * left and right are the two sides of the equation above,
+	 * after a bit of shuffling to use multiplications instead
+	 * of divisions.
+	 *
+	 * Note that none of the time values involved in the two
+	 * multiplications are absolute: dl_deadline and dl_runtime
+	 * are the relative deadline and the maximum runtime of each
+	 * instance, runtime is the runtime left for the last instance
+	 * and (deadline - t), since t is rq->clock, is the time left
+	 * to the (absolute) deadline. Therefore, overflowing the u64
+	 * type is very unlikely to occur in both cases.
+	 * Likewise the runtime multiplied with the norm factor is
+	 * for the same reasons unlikely to overflow u64 and since
+	 * norm factor is max 1<<32.
+	 */
+	left = pi_se->dl_deadline * dl_se->runtime;
+	right = (dl_se->deadline - t) * ((pi_se->dl_runtime * rq->norm_factor) >> 32);
+
+	return dl_time_before(right, left);
+}
+
+/*
  * When a -deadline entity is queued back on the runqueue, its runtime and
  * deadline might need updating.
  *
@@ -358,12 +392,16 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
 	}
 
 	if (dl_time_before(dl_se->deadline, rq->clock) ||
-	    dl_entity_overflow(dl_se, pi_se, rq->clock)) {
+	    dl_entity_overflow_norm(dl_se, pi_se, rq->clock, rq)) {
 		dl_se->deadline = rq->clock + pi_se->dl_deadline;
 		dl_se->runtime = pi_se->dl_runtime;
 		overflow = 1;
 	}
 #ifdef CONFIG_SCHEDSTATS
+	if(dl_entity_overflow(dl_se, pi_se, rq->clock))
+	    overflow |= 2;
+	if(dl_entity_overflow_norm(dl_se, pi_se, rq->clock, rq))
+	    overflow |= 4;
 	trace_sched_stat_updt_dl(dl_task_of(dl_se), rq->clock, overflow);
 #endif
 }
@@ -549,10 +587,15 @@ int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
 	 * executing, then we have already used some of the runtime of
 	 * the next instance. Thus, if we do not account that, we are
 	 * stealing bandwidth from the system at each deadline miss!
+	 *
+	 * Use normalization of deadline and clock to compensate the
+	 * runtime. Here assuming that the whole exceeded runtime is
+	 * done with current cpu frequency.
 	 */
 	if (dmiss) {
 		dl_se->runtime = rorun ? dl_se->runtime : 0;
-		dl_se->runtime -= rq->clock - dl_se->deadline;
+		dl_se->runtime -= ((rq->clock - dl_se->deadline)
+				   * rq->norm_factor) >> 32;
 	}
 
 	return 1;
@@ -576,31 +619,46 @@ static void update_curr_dl(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_dl_entity *dl_se = &curr->dl;
-	u64 delta_exec;
+	u64 delta_exec, delta_exec_norm;
 
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
 
+	/*
+	 * Maintaine the unnormalized execution statistics
+	 * to keep user space happy.
+	 *
+	 * Do cpu frequency normalized runtime handling for
+	 * the actual DL scheduling to enforce the CPU
+	 * max frequency runtime cycles even at lower freq.
+	 */
+
 	delta_exec = rq->clock - curr->se.exec_start;
 	if (unlikely((s64)delta_exec < 0))
 		delta_exec = 0;
 
+	delta_exec_norm = rq->clock_norm - curr->dl.exec_start_norm;
+	if (unlikely((s64)delta_exec_norm < 0))
+		delta_exec_norm = 0;
+
 	schedstat_set(curr->se.exec_max,
 		      max(curr->se.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
 	schedstat_add(&rq->dl, exec_clock, delta_exec);
 	account_group_exec_runtime(curr, delta_exec);
-	trace_sched_stat_runtime_dl(curr, rq->clock, delta_exec);
+	trace_sched_stat_runtime_dl(curr, rq->clock, delta_exec_norm);
 
 	curr->se.exec_start = rq->clock;
+	curr->dl.exec_start_norm = rq->clock_norm;
 	cpuacct_charge(curr, delta_exec);
 	cg_cpufreq_charge(curr, delta_exec, curr->se.exec_start);
 
 	sched_dl_avg_update(rq, delta_exec);
 
 	dl_se->stats.tot_rtime += delta_exec;
-	dl_se->runtime -= delta_exec;
+
+	dl_se->runtime -= delta_exec_norm;
 	if (dl_runtime_exceeded(rq, dl_se)) {
 		__dequeue_task_dl(rq, curr, 0);
 		if (likely(start_dl_timer(dl_se, !!curr->pi_top_task)))
@@ -865,10 +923,12 @@ static long wait_interval_dl(struct task_struct *p, struct timespec *rqtp,
 	 * instant. This involves a division (to calculate the reverse of the
 	 * task's bandwidth), but it is worth to notice that it is quite
 	 * unlikely that we get into here very often.
+	 * Use normalized overflow check since used for setting the timer.
 	 */
+
 	wakeup = timespec_to_ns(rqtp);
 	if (dl_time_before(wakeup, dl_se->deadline) &&
-	    !dl_entity_overflow(dl_se, dl_se, wakeup)) {
+	    !dl_entity_overflow_norm(dl_se, dl_se, wakeup, rq)) {
 		u64 ibw = (u64)dl_se->runtime * dl_se->dl_period;
 
 		ibw = div_u64(ibw, dl_se->dl_runtime);
@@ -989,6 +1049,13 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 #ifdef CONFIG_SCHED_HRTICK
 static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
 {
+        /*
+	 * Don't use normalized runtime to calculate the
+	 * delta, since the clock frequency might increase
+	 * and we then misses our needed tick time.
+	 * Worst case we will be ticked an extra time.
+	 * We also don't need to do a u64 division.
+	 */
 	s64 delta = p->dl.dl_runtime - p->dl.runtime;
 
 	if (delta > 10000)
@@ -1037,6 +1104,7 @@ struct task_struct *pick_next_task_dl(struct rq *rq)
 
 	p = dl_task_of(dl_se);
 	p->se.exec_start = rq->clock;
+	p->dl.exec_start_norm = rq->clock_norm;
 
 	/* Running task will never be pushed. */
 	if (p)
@@ -1061,6 +1129,7 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
 
 	update_curr_dl(rq);
 	p->se.exec_start = 0;
+	p->dl.exec_start_norm = 0;
 
 	if (on_dl_rq(&p->dl) && p->dl.nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
@@ -1102,6 +1171,7 @@ static void set_curr_task_dl(struct rq *rq)
 	struct task_struct *p = rq->curr;
 
 	p->se.exec_start = rq->clock;
+	p->dl.exec_start_norm = rq->clock_norm;
 
 	/* You can't push away the running task */
 	dequeue_pushable_dl_task(rq, p);
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/3] sched trace updated with normalized clock info.
  2010-12-17 13:02 [PATCH 1/3] Added runqueue clock normalized with cpufreq Harald Gustafsson
  2010-12-17 13:02 ` [PATCH 2/3] cpufreq normalized runtime to enforce runtime cycles also at lower frequencies Harald Gustafsson
@ 2010-12-17 13:02 ` Harald Gustafsson
  2010-12-17 14:29 ` [PATCH 1/3] Added runqueue clock normalized with cpufreq Peter Zijlstra
  2 siblings, 0 replies; 23+ messages in thread
From: Harald Gustafsson @ 2010-12-17 13:02 UTC (permalink / raw)
  To: Dario Faggioli, Peter Zijlstra, Harald Gustafsson
  Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Claudio Scordino,
	Michael Trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Dario Faggioli, Harald Gustafsson

Updated the sched deadline v3 traces with the normalized runtime clock.
The delta execution runtime and the last start of execution is also
using the normalized clock.

Change-Id: I6f05a76ad876e8895f3f24940f3ee07f1cb0e8b8

Signed-off-by: Harald Gustafsson <harald.gustafsson@ericsson.com>
---
 include/trace/events/sched.h |   25 +++++++++++++++++--------
 kernel/sched.c               |    2 +-
 kernel/sched_dl.c            |    2 +-
 3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 3307353..3c766eb 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -379,16 +379,17 @@ TRACE_EVENT(sched_stat_runtime,
  */
 TRACE_EVENT(sched_switch_dl,
 
-	TP_PROTO(u64 clock,
+	TP_PROTO(u64 clock, u64 clock_norm,
 		 struct task_struct *prev,
 		 struct task_struct *next),
 
-	TP_ARGS(clock, prev, next),
+	TP_ARGS(clock, clock_norm, prev, next),
 
 	TP_STRUCT__entry(
 		__array(	char,	prev_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	prev_pid			)
 		__field(	u64,	clock				)
+		__field(	u64,	clock_norm			)
 		__field(	s64,	prev_rt				)
 		__field(	u64,	prev_dl				)
 		__field(	long,	prev_state			)
@@ -402,6 +403,7 @@ TRACE_EVENT(sched_switch_dl,
 		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
 		__entry->prev_pid	= prev->pid;
 		__entry->clock		= clock;
+		__entry->clock_norm	= clock_norm;
 		__entry->prev_rt	= prev->dl.runtime;
 		__entry->prev_dl	= prev->dl.deadline;
 		__entry->prev_state	= prev->state;
@@ -412,7 +414,7 @@ TRACE_EVENT(sched_switch_dl,
 	),
 
 	TP_printk("prev_comm=%s prev_pid=%d prev_rt=%Ld [ns] prev_dl=%Lu [ns] prev_state=%s ==> "
-		  "next_comm=%s next_pid=%d next_rt=%Ld [ns] next_dl=%Lu [ns] clock=%Lu [ns]",
+		  "next_comm=%s next_pid=%d next_rt=%Ld [ns] next_dl=%Lu [ns] clock=%Lu (%Lu) [ns]",
 		  __entry->prev_comm, __entry->prev_pid, (long long)__entry->prev_rt,
 		  (unsigned long long)__entry->prev_dl, __entry->prev_state ?
 		    __print_flags(__entry->prev_state, "|",
@@ -420,7 +422,8 @@ TRACE_EVENT(sched_switch_dl,
 				{ 16, "Z" }, { 32, "X" }, { 64, "x" },
 				{ 128, "W" }) : "R",
 		  __entry->next_comm, __entry->next_pid, (long long)__entry->next_rt,
-		  (unsigned long long)__entry->next_dl, (unsigned long long)__entry->clock)
+		  (unsigned long long)__entry->next_dl, (unsigned long long)__entry->clock,
+		  (unsigned long long)__entry->clock_norm)
 );
 
 /*
@@ -655,9 +658,9 @@ DEFINE_EVENT(sched_stat_template_dl, sched_stat_updt_dl,
  */
 TRACE_EVENT(sched_stat_runtime_dl,
 
-	TP_PROTO(struct task_struct *p, u64 clock, u64 last),
+	TP_PROTO(struct task_struct *p, u64 clock, u64 last, u64 clock_norm),
 
-	TP_ARGS(p, clock, last),
+	TP_ARGS(p, clock, last, clock_norm),
 
 	TP_STRUCT__entry(
 		__array(	char,	comm,	TASK_COMM_LEN	)
@@ -667,6 +670,8 @@ TRACE_EVENT(sched_stat_runtime_dl,
 		__field(	s64,	rt			)
 		__field(	u64,	dl			)
 		__field(	u64,	start			)
+		__field(	u64,	start_norm		)
+		__field(	u64,	clock_norm		)
         ),
 
 	TP_fast_assign(
@@ -677,12 +682,16 @@ TRACE_EVENT(sched_stat_runtime_dl,
 		__entry->rt		= p->dl.runtime - last;
 		__entry->dl		= p->dl.deadline;
 		__entry->start		= p->se.exec_start;
+		__entry->start_norm	= p->dl.exec_start_norm;
+		__entry->clock_norm	= clock_norm;
 	),
 
-	TP_printk("comm=%s pid=%d clock=%Lu [ns] delta_exec=%Lu [ns] rt=%Ld [ns] dl=%Lu [ns] exec_start=%Lu [ns]",
+	TP_printk("comm=%s pid=%d clock=%Lu (%Lu) [ns] delta_exec=%Lu [ns] rt=%Ld [ns] dl=%Lu [ns] exec_start=%Lu (%Lu) [ns]",
 		  __entry->comm, __entry->pid, (unsigned long long)__entry->clock,
+		  (unsigned long long)__entry->clock_norm,
 		  (unsigned long long)__entry->last, (long long)__entry->rt,
-		  (unsigned long long)__entry->dl, (unsigned long long)__entry->start)
+		  (unsigned long long)__entry->dl, (unsigned long long)__entry->start,
+		  (unsigned long long)__entry->start_norm)
 );
 
 /*
diff --git a/kernel/sched.c b/kernel/sched.c
index ddb18d2..b6b8ccc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3027,7 +3027,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	prepare_task_switch(rq, prev, next);
 	trace_sched_switch(rq, prev, next);
 	if (unlikely(__dl_task(prev) || __dl_task(next)))
-		trace_sched_switch_dl(rq->clock, prev, next);
+		trace_sched_switch_dl(rq->clock, rq->clock_norm, prev, next);
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
diff --git a/kernel/sched_dl.c b/kernel/sched_dl.c
index 049c001..b37a905 100644
--- a/kernel/sched_dl.c
+++ b/kernel/sched_dl.c
@@ -647,7 +647,7 @@ static void update_curr_dl(struct rq *rq)
 	curr->se.sum_exec_runtime += delta_exec;
 	schedstat_add(&rq->dl, exec_clock, delta_exec);
 	account_group_exec_runtime(curr, delta_exec);
-	trace_sched_stat_runtime_dl(curr, rq->clock, delta_exec_norm);
+	trace_sched_stat_runtime_dl(curr, rq->clock, delta_exec_norm, rq->clock_norm);
 
 	curr->se.exec_start = rq->clock;
 	curr->dl.exec_start_norm = rq->clock_norm;
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 13:02 [PATCH 1/3] Added runqueue clock normalized with cpufreq Harald Gustafsson
  2010-12-17 13:02 ` [PATCH 2/3] cpufreq normalized runtime to enforce runtime cycles also at lower frequencies Harald Gustafsson
  2010-12-17 13:02 ` [PATCH 3/3] sched trace updated with normalized clock info Harald Gustafsson
@ 2010-12-17 14:29 ` Peter Zijlstra
  2010-12-17 14:32   ` Peter Zijlstra
                     ` (2 more replies)
  2 siblings, 3 replies; 23+ messages in thread
From: Peter Zijlstra @ 2010-12-17 14:29 UTC (permalink / raw)
  To: Harald Gustafsson
  Cc: Dario Faggioli, Harald Gustafsson, linux-kernel, Ingo Molnar,
	Thomas Gleixner, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Dario Faggioli

On Fri, 2010-12-17 at 14:02 +0100, Harald Gustafsson wrote:

> This is a request for comments on additions to sched deadline v3 patches.
> Deadline scheduler is the first scheduler (I think) we introduce in Linux that
> specifies the runtime in time and not only as a weight or a relation.
> I have introduced a normalized runtime clock dependent on the CPU frequency.
> This is used, in [PATCH 2/3], to calculate the deadline thread's runtime
> so that approximately the same number of cycles are giving to the thread
> independent of the CPU frequency. 
> 
> I suggest that this is important for users of hard reservation based schedulers
> that the intended amount of work can be accomplished independent of the CPU frequency.
> The usage of CPU frequency scaling is important on mobile devices and hence 
> the combination of deadline scheduler and cpufreq should be solved.

> So before I do this for the linux tip I would welcome a discussion about if this
> is a good idea and also suggestions on how to improve this. 

I'm thinking this is going about it totally wrong..

Solving the CPUfreq problem involves writing a SCHED_DEADLINE aware
CPUfreq governor. The governor must know about the constraints placed on
the system by the task-set. You simply cannot lower the frequency when
your system is at u=1.

Once you have a governor that keeps the freq such that: freq/max_freq >=
utilization (which is only sufficient for deadline == period systems),
then you need to frob the SCHED_DEADLINE runtime accounting.

Adding a complete normalized clock to the system like you've done is a
total no-go, it adds overhead even for the !SCHED_DEADLINE case.

The simple solution would be to slow down the runtime accounting of
SCHED_DEADLINE tasks by freq/max_freq. So instead of having:

  dl_se->runtime -= delta;

you do something like:

  dl_se->runtime -= (freq * delta) / max_freq;

Which auto-magically grows the actual bandwidth, and since the deadlines
are wall-time already it all works out nicely. It also keeps the
overhead inside SCHED_DEADLINE.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 14:29 ` [PATCH 1/3] Added runqueue clock normalized with cpufreq Peter Zijlstra
@ 2010-12-17 14:32   ` Peter Zijlstra
  2010-12-17 15:06     ` Harald Gustafsson
  2010-12-17 15:02   ` Harald Gustafsson
  2010-12-17 18:56   ` Dario Faggioli
  2 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2010-12-17 14:32 UTC (permalink / raw)
  To: Harald Gustafsson
  Cc: Dario Faggioli, Harald Gustafsson, linux-kernel, Ingo Molnar,
	Thomas Gleixner, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Dario Faggioli

On Fri, 2010-12-17 at 15:29 +0100, Peter Zijlstra wrote:
> On Fri, 2010-12-17 at 14:02 +0100, Harald Gustafsson wrote:
> 
> > This is a request for comments on additions to sched deadline v3 patches.
> > Deadline scheduler is the first scheduler (I think) we introduce in Linux that
> > specifies the runtime in time and not only as a weight or a relation.
> > I have introduced a normalized runtime clock dependent on the CPU frequency.
> > This is used, in [PATCH 2/3], to calculate the deadline thread's runtime
> > so that approximately the same number of cycles are giving to the thread
> > independent of the CPU frequency. 
> > 
> > I suggest that this is important for users of hard reservation based schedulers
> > that the intended amount of work can be accomplished independent of the CPU frequency.
> > The usage of CPU frequency scaling is important on mobile devices and hence 
> > the combination of deadline scheduler and cpufreq should be solved.
> 
> > So before I do this for the linux tip I would welcome a discussion about if this
> > is a good idea and also suggestions on how to improve this. 
> 
> I'm thinking this is going about it totally wrong..
> 
> Solving the CPUfreq problem involves writing a SCHED_DEADLINE aware
> CPUfreq governor. The governor must know about the constraints placed on
> the system by the task-set. You simply cannot lower the frequency when
> your system is at u=1.
> 
> Once you have a governor that keeps the freq such that: freq/max_freq >=
> utilization (which is only sufficient for deadline == period systems),
> then you need to frob the SCHED_DEADLINE runtime accounting.
> 
> Adding a complete normalized clock to the system like you've done is a
> total no-go, it adds overhead even for the !SCHED_DEADLINE case.
> 
> The simple solution would be to slow down the runtime accounting of
> SCHED_DEADLINE tasks by freq/max_freq. So instead of having:
> 
>   dl_se->runtime -= delta;
> 
> you do something like:
> 
>   dl_se->runtime -= (freq * delta) / max_freq;
> 
> Which auto-magically grows the actual bandwidth, and since the deadlines
> are wall-time already it all works out nicely. It also keeps the
> overhead inside SCHED_DEADLINE.

This is all assuming lowering the frequency is sensible to begin with in
the first place... but that's all part of the CPUfreq governor, it needs
to find a way to lower energy usage while conforming to the system
constraints.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 14:32   ` Peter Zijlstra
@ 2010-12-17 15:06     ` Harald Gustafsson
  2010-12-17 15:16       ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Harald Gustafsson @ 2010-12-17 15:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Harald Gustafsson, Dario Faggioli, linux-kernel, Ingo Molnar,
	Thomas Gleixner, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Dario Faggioli

2010/12/17 Peter Zijlstra <peterz@infradead.org>:
> This is all assuming lowering the frequency is sensible to begin with in
> the first place... but that's all part of the CPUfreq governor, it needs
> to find a way to lower energy usage while conforming to the system
> constraints.

Yes, I and you have already suggested the safe way to not lower it below
the total dl bandwidth. But for softer use cases it might be possible to
e.g. exclude threads with longer periods than cpufreq change periods in the
minimum frequency.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 15:06     ` Harald Gustafsson
@ 2010-12-17 15:16       ` Peter Zijlstra
  2010-12-17 15:36         ` Harald Gustafsson
  2010-12-17 15:43         ` Thomas Gleixner
  0 siblings, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2010-12-17 15:16 UTC (permalink / raw)
  To: Harald Gustafsson
  Cc: Harald Gustafsson, Dario Faggioli, linux-kernel, Ingo Molnar,
	Thomas Gleixner, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Dario Faggioli

On Fri, 2010-12-17 at 16:06 +0100, Harald Gustafsson wrote:
> 2010/12/17 Peter Zijlstra <peterz@infradead.org>:
> > This is all assuming lowering the frequency is sensible to begin with in
> > the first place... but that's all part of the CPUfreq governor, it needs
> > to find a way to lower energy usage while conforming to the system
> > constraints.
> 
> Yes, I and you have already suggested the safe way to not lower it below
> the total dl bandwidth. But for softer use cases it might be possible to
> e.g. exclude threads with longer periods than cpufreq change periods in the
> minimum frequency.

I was more hinting at the fact that CPUfreq is at best a controversial
approach to power savings. I much prefer the whole race-to-idle
approach, its much simpler.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 15:16       ` Peter Zijlstra
@ 2010-12-17 15:36         ` Harald Gustafsson
  2010-12-17 15:43         ` Thomas Gleixner
  1 sibling, 0 replies; 23+ messages in thread
From: Harald Gustafsson @ 2010-12-17 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Harald Gustafsson, Dario Faggioli, linux-kernel, Ingo Molnar,
	Thomas Gleixner, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Dario Faggioli

2010/12/17 Peter Zijlstra <peterz@infradead.org>:
> I was more hinting at the fact that CPUfreq is at best a controversial
> approach to power savings. I much prefer the whole race-to-idle
> approach, its much simpler.

That depends to a large degree on architecture, chip technology node
and deployed user space
applications. I don't agree that race-to-idle is a good idea for
some/many combinations at least
for embedded systems. But of course race-to-idle is simpler, but not
necessarily giving the
lowest energy.

/Harald

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 15:16       ` Peter Zijlstra
  2010-12-17 15:36         ` Harald Gustafsson
@ 2010-12-17 15:43         ` Thomas Gleixner
  2010-12-17 15:54           ` Harald Gustafsson
                             ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Thomas Gleixner @ 2010-12-17 15:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Harald Gustafsson, Harald Gustafsson, Dario Faggioli,
	linux-kernel, Ingo Molnar, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Dario Faggioli

On Fri, 17 Dec 2010, Peter Zijlstra wrote:
> On Fri, 2010-12-17 at 16:06 +0100, Harald Gustafsson wrote:
> > 2010/12/17 Peter Zijlstra <peterz@infradead.org>:
> > > This is all assuming lowering the frequency is sensible to begin with in
> > > the first place... but that's all part of the CPUfreq governor, it needs
> > > to find a way to lower energy usage while conforming to the system
> > > constraints.
> > 
> > Yes, I and you have already suggested the safe way to not lower it below
> > the total dl bandwidth. But for softer use cases it might be possible to
> > e.g. exclude threads with longer periods than cpufreq change periods in the
> > minimum frequency.
> 
> I was more hinting at the fact that CPUfreq is at best a controversial
> approach to power savings. I much prefer the whole race-to-idle
> approach, its much simpler.

There's that and I have yet to see a proof that running code with
lower frequency and not going idle saves more power than running full
speed and going into low power states for longer time.

Also if you want to have your deadline scheduler aware of cpu
frequency changes, then simply limit the total bandwith based on the
lowest possible frequency and it works always. This whole dynamic
bandwith expansion is more an academic exercise than a practical
necessity.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 15:43         ` Thomas Gleixner
@ 2010-12-17 15:54           ` Harald Gustafsson
  2010-12-17 18:44           ` Dario Faggioli
  2011-01-03 14:17           ` Pavel Machek
  2 siblings, 0 replies; 23+ messages in thread
From: Harald Gustafsson @ 2010-12-17 15:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Harald Gustafsson, Dario Faggioli, linux-kernel,
	Ingo Molnar, Claudio Scordino, Michael Trimarchi, Fabio Checconi,
	Tommaso Cucinotta, Juri Lelli, Dario Faggioli

2010/12/17 Thomas Gleixner <tglx@linutronix.de>:
> Also if you want to have your deadline scheduler aware of cpu
> frequency changes, then simply limit the total bandwith based on the
> lowest possible frequency and it works always. This whole dynamic
> bandwith expansion is more an academic exercise than a practical
> necessity.

This would severely limit the bandwidth available to deadline tasks. Which
then also reduces the use cases that could benefit from using sched deadline.
Also it would imply a over-reservation of the system, e.g. if you need 10%
BW of the total system and the lowest speed is at 20%, you basically need to
set a BW of 50%, to always be guaranteed that you get your 10% when cpufreq
clocks down. If you use cpufreq and want to use sched deadline this has
strong practical implications and is definitely not academic only.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 15:43         ` Thomas Gleixner
  2010-12-17 15:54           ` Harald Gustafsson
@ 2010-12-17 18:44           ` Dario Faggioli
  2011-01-03 14:17           ` Pavel Machek
  2 siblings, 0 replies; 23+ messages in thread
From: Dario Faggioli @ 2010-12-17 18:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Harald Gustafsson, Harald Gustafsson,
	linux-kernel, Ingo Molnar, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 1637 bytes --]

On Fri, 2010-12-17 at 16:43 +0100, Thomas Gleixner wrote: 
> There's that and I have yet to see a proof that running code with
> lower frequency and not going idle saves more power than running full
> speed and going into low power states for longer time.
> 
I was expecting a reply like this from right from you! :-P

BTW, I mostly agree that race to idle is better. The point here is that
you might end in a situation where frequency scaling is enabled and/or
a particular frequency is statically selected for whatever reason. In
that case, making the scheduler aware of such could be needed to get the
expected behaviour out of it, independently from the fact it is probably
going to be worse than race-to-idle for power saving purposes... How
much am I wrong?

> Also if you want to have your deadline scheduler aware of cpu
> frequency changes, then simply limit the total bandwith based on the
> lowest possible frequency and it works always. 
>
That could be a solution as well, although you're limiting a lot the
bandwidth available for deadline task. But something similar could be
considered...

> This whole dynamic
> bandwith expansion is more an academic exercise than a practical
> necessity.
> 
Well, despite the fact that Harald is with Ericsson and as not much to
do with academia. :-D

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://retis.sssup.it/people/faggioli -- dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 15:43         ` Thomas Gleixner
  2010-12-17 15:54           ` Harald Gustafsson
  2010-12-17 18:44           ` Dario Faggioli
@ 2011-01-03 14:17           ` Pavel Machek
  2 siblings, 0 replies; 23+ messages in thread
From: Pavel Machek @ 2011-01-03 14:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Harald Gustafsson, Harald Gustafsson,
	Dario Faggioli, linux-kernel, Ingo Molnar, Claudio Scordino,
	Michael Trimarchi, Fabio Checconi, Tommaso Cucinotta, Juri Lelli,
	Dario Faggioli

Hi!

> > > Yes, I and you have already suggested the safe way to not lower it below
> > > the total dl bandwidth. But for softer use cases it might be possible to
> > > e.g. exclude threads with longer periods than cpufreq change periods in the
> > > minimum frequency.
> > 
> > I was more hinting at the fact that CPUfreq is at best a controversial
> > approach to power savings. I much prefer the whole race-to-idle
> > approach, its much simpler.
> 
> There's that and I have yet to see a proof that running code with
> lower frequency and not going idle saves more power than running full
> speed and going into low power states for longer time.

That depends on cpu. Look at early athlon64s that could not even run
at full speed at battery power, and where cpu sleep states were not
saving much power. Race-to-idle does not work there. It works on
recent x86 cpus.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 14:29 ` [PATCH 1/3] Added runqueue clock normalized with cpufreq Peter Zijlstra
  2010-12-17 14:32   ` Peter Zijlstra
@ 2010-12-17 15:02   ` Harald Gustafsson
  2010-12-17 18:48     ` Dario Faggioli
  2010-12-17 18:56   ` Dario Faggioli
  2 siblings, 1 reply; 23+ messages in thread
From: Harald Gustafsson @ 2010-12-17 15:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Harald Gustafsson, Dario Faggioli, linux-kernel, Ingo Molnar,
	Thomas Gleixner, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli, Dario Faggioli

2010/12/17 Peter Zijlstra <peterz@infradead.org>:
>
> I'm thinking this is going about it totally wrong..
>
> Solving the CPUfreq problem involves writing a SCHED_DEADLINE aware
> CPUfreq governor. The governor must know about the constraints placed on
> the system by the task-set. You simply cannot lower the frequency when
> your system is at u=1.
>
> Once you have a governor that keeps the freq such that: freq/max_freq >=
> utilization (which is only sufficient for deadline == period systems),
> then you need to frob the SCHED_DEADLINE runtime accounting.

I agree that this is the other part of the solution, which I have in a separate
ondemand governor, but that code is not ready for public review yet. Since that
code also incorporate other ondemand changes I'm playing with. Such
changes to the ondemand is quite simple it just picks a frequency that at
least supports the total dl bandwidth. It might get tricky for systems which
support individual/clusters frequency for the cores on the system together with
the G-EDF.

> Adding a complete normalized clock to the system like you've done is a
> total no-go, it adds overhead even for the !SCHED_DEADLINE case.

I suspected this, it works as a proof of concept, but not good for mainline.
I will rework this part, if we in general thinks having the dl runtime
accounting
be cpufreq "aware" is a good idea.

> The simple solution would be to slow down the runtime accounting of
> SCHED_DEADLINE tasks by freq/max_freq. So instead of having:
>
>  dl_se->runtime -= delta;
>
> you do something like:
>
>  dl_se->runtime -= (freq * delta) / max_freq;
>
> Which auto-magically grows the actual bandwidth, and since the deadlines
> are wall-time already it all works out nicely. It also keeps the
> overhead inside SCHED_DEADLINE.

OK, I can do that. My thought from the beginning was considering that
the reading of the clock was done more often then updating it, but I agree that
it has a negative impact on none dl threads.

/Harald

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 15:02   ` Harald Gustafsson
@ 2010-12-17 18:48     ` Dario Faggioli
  0 siblings, 0 replies; 23+ messages in thread
From: Dario Faggioli @ 2010-12-17 18:48 UTC (permalink / raw)
  To: Harald Gustafsson
  Cc: Peter Zijlstra, Harald Gustafsson, linux-kernel, Ingo Molnar,
	Thomas Gleixner, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 1247 bytes --]

On Fri, 2010-12-17 at 16:02 +0100, Harald Gustafsson wrote: 
> > Once you have a governor that keeps the freq such that: freq/max_freq >=
> > utilization (which is only sufficient for deadline == period systems),
> > then you need to frob the SCHED_DEADLINE runtime accounting.
> 
> I agree that this is the other part of the solution, which I have in a separate
> ondemand governor, but that code is not ready for public review yet. Since that
> code also incorporate other ondemand changes I'm playing with. 
>
So, while we're waiting for this to be cooked...

> OK, I can do that. My thought from the beginning was considering that
> the reading of the clock was done more often then updating it, but I agree that
> it has a negative impact on none dl threads.
> 
... We can at least integrate this (done in the proper, way as Peter
suggests, i.e., _inside_ SCHED_DEADLINE) in the next release of the
patchset, can't we?

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://retis.sssup.it/people/faggioli -- dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 14:29 ` [PATCH 1/3] Added runqueue clock normalized with cpufreq Peter Zijlstra
  2010-12-17 14:32   ` Peter Zijlstra
  2010-12-17 15:02   ` Harald Gustafsson
@ 2010-12-17 18:56   ` Dario Faggioli
  2010-12-17 18:59     ` Peter Zijlstra
  2010-12-17 19:27     ` Harald Gustafsson
  2 siblings, 2 replies; 23+ messages in thread
From: Dario Faggioli @ 2010-12-17 18:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Harald Gustafsson, Harald Gustafsson, linux-kernel, Ingo Molnar,
	Thomas Gleixner, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 1750 bytes --]

On Fri, 2010-12-17 at 15:29 +0100, Peter Zijlstra wrote: 
> Solving the CPUfreq problem involves writing a SCHED_DEADLINE aware
> CPUfreq governor. The governor must know about the constraints placed on
> the system by the task-set. You simply cannot lower the frequency when
> your system is at u=1.
> 
We already did the very same thing (for another EU Project called
FRESCOR), although it was done in an userspace sort of daemon. It was
also able to consider other "high level" parameters like some estimation
of the QoS of each application and of the global QoS of the system.

However, converting the basic mechanism into a CPUfreq governor should
be easily doable... The only problem is finding the time for that! ;-P 

> The simple solution would be to slow down the runtime accounting of
> SCHED_DEADLINE tasks by freq/max_freq. So instead of having:
> 
>   dl_se->runtime -= delta;
> 
> you do something like:
> 
>   dl_se->runtime -= (freq * delta) / max_freq;
> 
> Which auto-magically grows the actual bandwidth, and since the deadlines
> are wall-time already it all works out nicely. It also keeps the
> overhead inside SCHED_DEADLINE.
> 
And, at least for the meantime, this seems a very very nice solution.
The only thing I don't like is that division which would end up in being
performed at each tick/update_curr_dl(), but we can try to find out a
way to mitigate this, what do you think Harald?

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://retis.sssup.it/people/faggioli -- dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 18:56   ` Dario Faggioli
@ 2010-12-17 18:59     ` Peter Zijlstra
  2010-12-17 19:16       ` Dario Faggioli
  2010-12-17 19:31       ` Harald Gustafsson
  2010-12-17 19:27     ` Harald Gustafsson
  1 sibling, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2010-12-17 18:59 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Harald Gustafsson, Harald Gustafsson, linux-kernel, Ingo Molnar,
	Thomas Gleixner, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli

On Fri, 2010-12-17 at 19:56 +0100, Dario Faggioli wrote:
> On Fri, 2010-12-17 at 15:29 +0100, Peter Zijlstra wrote: 
> > Solving the CPUfreq problem involves writing a SCHED_DEADLINE aware
> > CPUfreq governor. The governor must know about the constraints placed on
> > the system by the task-set. You simply cannot lower the frequency when
> > your system is at u=1.
> > 
> We already did the very same thing (for another EU Project called
> FRESCOR), although it was done in an userspace sort of daemon. It was
> also able to consider other "high level" parameters like some estimation
> of the QoS of each application and of the global QoS of the system.
> 
> However, converting the basic mechanism into a CPUfreq governor should
> be easily doable... The only problem is finding the time for that! ;-P 

Ah, I think Harald will solve that for you,.. :)

> > The simple solution would be to slow down the runtime accounting of
> > SCHED_DEADLINE tasks by freq/max_freq. So instead of having:
> > 
> >   dl_se->runtime -= delta;
> > 
> > you do something like:
> > 
> >   dl_se->runtime -= (freq * delta) / max_freq;
> > 
> > Which auto-magically grows the actual bandwidth, and since the deadlines
> > are wall-time already it all works out nicely. It also keeps the
> > overhead inside SCHED_DEADLINE.
> > 
> And, at least for the meantime, this seems a very very nice solution.
> The only thing I don't like is that division which would end up in being
> performed at each tick/update_curr_dl(), but we can try to find out a
> way to mitigate this, what do you think Harald?

A simple mult and shift-right should do. You can either pre-compute for
a platform, or compute the inv multiplier in the cpufreq notifier thing.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 18:59     ` Peter Zijlstra
@ 2010-12-17 19:16       ` Dario Faggioli
  2010-12-17 19:31       ` Harald Gustafsson
  1 sibling, 0 replies; 23+ messages in thread
From: Dario Faggioli @ 2010-12-17 19:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Harald Gustafsson, Harald Gustafsson, linux-kernel, Ingo Molnar,
	Thomas Gleixner, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 1203 bytes --]

On Fri, 2010-12-17 at 19:59 +0100, Peter Zijlstra wrote: 
> > However, converting the basic mechanism into a CPUfreq governor should
> > be easily doable... The only problem is finding the time for that! ;-P 
> 
> Ah, I think Harald will solve that for you,.. :)
> 
Yeah, I saw that... Help is a wonderful thing, you know? :-P

> > And, at least for the meantime, this seems a very very nice solution.
> > The only thing I don't like is that division which would end up in being
> > performed at each tick/update_curr_dl(), but we can try to find out a
> > way to mitigate this, what do you think Harald?
> 
> A simple mult and shift-right should do. You can either pre-compute for
> a platform, or compute the inv multiplier in the cpufreq notifier thing.
>
Yeah, I was thinking about something like the last solution you just
propose, but we'll consider all of them.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa  (Italy)

http://retis.sssup.it/people/faggioli -- dario.faggioli@jabber.org

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 18:59     ` Peter Zijlstra
  2010-12-17 19:16       ` Dario Faggioli
@ 2010-12-17 19:31       ` Harald Gustafsson
  2010-12-20  0:11         ` Tommaso Cucinotta
  1 sibling, 1 reply; 23+ messages in thread
From: Harald Gustafsson @ 2010-12-17 19:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dario Faggioli, Harald Gustafsson, linux-kernel, Ingo Molnar,
	Thomas Gleixner, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli

>> We already did the very same thing (for another EU Project called
>> FRESCOR), although it was done in an userspace sort of daemon. It was
>> also able to consider other "high level" parameters like some estimation
>> of the QoS of each application and of the global QoS of the system.
>>
>> However, converting the basic mechanism into a CPUfreq governor should
>> be easily doable... The only problem is finding the time for that! ;-P
>
> Ah, I think Harald will solve that for you,.. :)

Yes, I don't mind doing that. Could you point me to the right part of
the FRESCOR code, Dario?
I will then compare that with what I already have.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 19:31       ` Harald Gustafsson
@ 2010-12-20  0:11         ` Tommaso Cucinotta
  2010-12-20  9:44           ` Harald Gustafsson
  0 siblings, 1 reply; 23+ messages in thread
From: Tommaso Cucinotta @ 2010-12-20  0:11 UTC (permalink / raw)
  To: Harald Gustafsson
  Cc: Peter Zijlstra, Dario Faggioli, Harald Gustafsson, linux-kernel,
	Ingo Molnar, Thomas Gleixner, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Juri Lelli

[-- Attachment #1: Type: text/plain, Size: 7316 bytes --]

Il 17/12/2010 20:31, Harald Gustafsson ha scritto:
>>> We already did the very same thing (for another EU Project called
>>> FRESCOR), although it was done in an userspace sort of daemon. It was
>>> also able to consider other "high level" parameters like some estimation
>>> of the QoS of each application and of the global QoS of the system.
>>>
>>> However, converting the basic mechanism into a CPUfreq governor should
>>> be easily doable... The only problem is finding the time for that! ;-P
>> Ah, I think Harald will solve that for you,.. :)
> Yes, I don't mind doing that. Could you point me to the right part of
> the FRESCOR code, Dario?

Hi there,

I'm sorry to join so late this discussion, but the unprecedented 20cm of 
snow in Pisa had some non-negligible drawbacks on my return flight from 
Perth :-).

Let me try to briefly recap what the outcomes of FRESCOR were, w.r.t. 
power management (but usually I'm not that brief :-) ):

1. from a requirements analysis phase, it comes out that it should be 
possible to specify the individual runtimes for each possible frequency, 
as it is well-known that the way computation times scale to CPU 
frequency is application-dependent (and platform-dependent); this 
assumes that as a developer I can specify the possible configurations of 
my real-time app, then the OS will be free to pick the CPU frequency 
that best suites its power management logic (i.e., keeping the minimum 
frequency by which I can meet all the deadlines).

   Requirements Analysis:

http://www.frescor.org/index.php?mact=Uploads,cntnt01,getfile,0&cntnt01showtemplate=false&cntnt01upload_id=62&cntnt01returnid=54

   Proposed API:

http://www.frescor.org/index.php?mact=Uploads,cntnt01,getfile,0&cntnt01showtemplate=false&cntnt01upload_id=105&cntnt01returnid=54

   I also attach the API we implemented, however consider it is a mix of 
calls for doing both what I wrote above, and building an OS-independent 
abstraction layer for dealing with CPU frequency scaling (and not only) 
on the heterogeneous OSes we had in FRESCOR;

2. this was also assuming, at an API level, a quite static settings 
(typical of hard RT), in which I configure the system and don't change 
its frequency too often; for example, implications of power switches on 
hard real-time requirements (i.e., time windows in which the CPU is not 
operating during the switch, and limits on the max sustainable switching 
frequencies by apps and the like) have not been stated through the API;

3. for soft real-time contexts and Linux (consider FRESCOR targeted both 
hard RT on RT OSes and soft RT on Linux), we played with a much simpler 
trivial linear scaling, which is exactly what has been proposed and 
implemented by someone in this thread on top of SCHED_DEADLINE (AFAIU); 
however, there's a trick which cannot be neglected, i.e., *change 
protocol* (see 5); benchmarks on MPEG-2 decoding times showed that the 
linear approximation is not that bad, but the best interpolating ratio 
between the computing times in different CPU frequencies do not 
perfectly conform to the frequencies ratios; we didn't make any attempt 
of extensive evaluation over different workloads so far. See Figure 4.1 
in D-AQ2v2:

http://www.frescor.org/index.php?mact=Uploads,cntnt01,getfile,0&cntnt01showtemplate=false&cntnt01upload_id=82&cntnt01returnid=54

4. I would say that, given the tendency to over-provision the runtime 
(WCET) for hard real-time contexts, it would not bee too much of a 
burden for a hard RT developer to properly over-provision the required 
budget in presence of a trivial runtime rescaling policy like in 2.; 
however, in order to make everybody happy, it doesn't seem a bad idea to 
have something like:
   4a) use the fine runtimes specified by the user if they are available;
   4b) use the trivially rescaled runtimes if the user only specified a 
single runtime, of course it should be clear through the API what is the 
frequency the user is referring its runtime to, in such case (e.g., 
maximum one ?)

5. Mode Change Protocol: whenever a frequency switch occurs (e.g., 
dictated by the non-RT workload fluctuations), runtimes cannot simply be 
rescaled instantaneously: keeping it short, the simplest thing we can do 
is relying on the various CBS servers implemented in the scheduler to 
apply the change from the next "runtime recharge", i.e., the next 
period. This creates the potential problem that the RT tasks have a 
non-negligible transitory for the instances crossing the CPU frequency 
switch, in which they do not have enough runtime for their work. Now, 
the general "rule of thumb" is straightforward: make room first, then 
"pack", i.e., we need to consider 2 distinct cases:

   5a) we want to *increase the CPU frequency*; we can immediately 
increase the frequency, then the RT applications will have a temporary 
over-provisioning of runtime (still tuned for the slower frequency 
case), however as soon as we're sure the CPU frequency switch completed, 
we can lower the runtimes to the new values;

   5b) we want to *decrease the CPU frequency*; unfortunately, here we 
need to proceed in the other way round: first, we need to increase the 
runtimes of the RT applications to the new values, then, as soon as 
we're sure all the scheduling servers made the change (waiting at most 
for a time equal to the maximum configured RT period), then we can 
actually perform the frequency switch. Of course, before switching the 
frequency, there's an assumption: that the new runtimes after the freq 
decrease are still schedulable, so the CPU freq switching logic needs to 
be aware of the allocated RT reservations.

The protocol in 5. has been implemented completely in user-space as a 
modification to the powernowd daemon, in the context of an extended 
version of a paper in which we were automagically guessing the whole set 
of scheduling parameters for periodic RT applications (EuroSys 2010). 
The modified powernowd was considering both the whole RT utilization as 
imposed by the RT reservations, and the non-RT utilization as measured 
on the CPU. The paper will appear on ACM TECS, but who knows when, so 
here u can find it (see Section 7.5 "Power Management"):

   http://retis.sssup.it/~tommaso/publications/ACM-TECS-2010.pdf

(last remark: no attempt to deal with multi-cores and their various 
power switching capabilities, on this paper . . .)

Last, but not least, the whole point in the above discussion is the 
assumption that it is meaningful to have a CPU frequency switching 
policy, as opposed to merely CPU idle-ing. Perhaps on old embedded CPUs 
this is still the case. Unfortunately, from preliminary measurements 
made on a few systems I use every day through a cheap power measurement 
device attached on the power cable, I could actually see that for RT 
workloads only it is worth to leave the system at the maximum frequency 
and exploit the much higher time spent in idle mode(s), except when the 
system is completely idle.

If you're interested, I can share the collected data sets.

Bye (and apologies for the length).

     T.

-- 
Tommaso Cucinotta, Computer Engineering PhD, Researcher
ReTiS Lab, Scuola Superiore Sant'Anna, Pisa, Italy
Tel +39 050 882 024, Fax +39 050 882 003
http://retis.sssup.it/people/tommaso

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: frsh_energy_management.h --]
[-- Type: text/x-chdr; name="frsh_energy_management.h", Size: 11689 bytes --]

// -----------------------------------------------------------------------
//  Copyright (C) 2006 - 2009 FRESCOR consortium partners:
//
//    Universidad de Cantabria,              SPAIN
//    University of York,                    UK
//    Scuola Superiore Sant'Anna,            ITALY
//    Kaiserslautern University,             GERMANY
//    Univ. Politécnica  Valencia,           SPAIN
//    Czech Technical University in Prague,  CZECH REPUBLIC
//    ENEA                                   SWEDEN
//    Thales Communication S.A.              FRANCE
//    Visual Tools S.A.                      SPAIN
//    Rapita Systems Ltd                     UK
//    Evidence                               ITALY
//
//    See http://www.frescor.org for a link to partners' websites
//
//           FRESCOR project (FP6/2005/IST/5-034026) is funded
//        in part by the European Union Sixth Framework Programme
//        The European Union is not liable of any use that may be
//        made of this code.
//
//
//  based on previous work (FSF) done in the FIRST project
//
//   Copyright (C) 2005  Mälardalen University, SWEDEN
//                       Scuola Superiore S.Anna, ITALY
//                       Universidad de Cantabria, SPAIN
//                       University of York, UK
//
//   FSF API web pages: http://marte.unican.es/fsf/docs
//                      http://shark.sssup.it/contrib/first/docs/
//
//   This file is part of FRSH (FRescor ScHeduler)
//
//  FRSH is free software; you can redistribute it and/or modify it
//  under terms of the GNU General Public License as published by the
//  Free Software Foundation; either version 2, or (at your option) any
//  later version.  FRSH is distributed in the hope that it will be
//  useful, but WITHOUT ANY WARRANTY; without even the implied warranty
//  of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
//  General Public License for more details. You should have received a
//  copy of the GNU General Public License along with FRSH; see file
//  COPYING. If not, write to the Free Software Foundation, 675 Mass Ave,
//  Cambridge, MA 02139, USA.
//
//  As a special exception, including FRSH header files in a file,
//  instantiating FRSH generics or templates, or linking other files
//  with FRSH objects to produce an executable application, does not
//  by itself cause the resulting executable application to be covered
//  by the GNU General Public License. This exception does not
//  however invalidate any other reasons why the executable file might be
//  covered by the GNU Public License.
// -----------------------------------------------------------------------
//frsh_energy_management.h

//==============================================
//  ******** *******    ********  **      **
//  **///// /**////**  **//////  /**     /**
//  **      /**   /** /**        /**     /**
//  ******* /*******  /********* /**********
//  **////  /**///**  ////////** /**//////**
//  **      /**  //**        /** /**     /**
//  **      /**   //** ********  /**     /**
//  //       //     // ////////   //      // 
//
// FRSH(FRescor ScHeduler), pronounced "fresh"
//==============================================

#ifndef  _FRSH_ENERGY_MANAGEMENT_H_
#define  _FRSH_ENERGY_MANAGEMENT_H_

#include <time.h>

#include "frsh_energy_management_types.h"
#include "frsh_core_types.h"

FRSH_CPP_BEGIN_DECLS

#define FRSH_ENERGY_MANAGEMENT_MODULE_SUPPORTED       1

/**
 * @file frsh_energy_management.h
 **/

/**
 * @defgroup energymgmnt Energy Management Module
 *
 * This module provides the ability to specify different budgets for
 * different power levels.
 *
 * We model the situation by specifying budget values per power
 * level.  Thus switching in the power-level would be done by changing
 * the budget of the vres.  In all cases the period remains the same.
 *
 * All global FRSH contract operations (those done with the core
 * module without specifying the power level) are considered to be
 * applied to the higest power level, corresponding to a power_level_t
 * value of 0.
 *
 * @note
 * For all functions that operate on a contract, the resource is
 * implicitly identified by the contract core parameters resource_type
 * and resource_id that are either set through the
 * frsh_contract_set_resource_and_label() function, or implicitly
 * defined if no such call is made.
 *
 * @note
 * For the power level management operations, only
 * implementation for resource_type = FRSH_RT_PROCESSOR is mandatory,
 * if the energy management module is present.
 *
 * @{
 *
 **/

//////////////////////////////////////////////////////////////////////
//           CONTRACT SERVICES
//////////////////////////////////////////////////////////////////////

/**
 * frsh_contract_set_min_expiration()
 * 
 * This function sets the minimum battery expiration time that the
 * system must be able to sustain without finishing battery power. A
 * value of (0,0) would mean that the application does not have such
 * requirement (this is the default if this parameter is not explicitly
 * set).
 **/
int frsh_contract_set_min_expiration(frsh_contract_t *contract,
				     frsh_rel_time_t min_expiration);

/**
 * frsh_contract_get_min_expiration()
 * 
 * Get version of the previous function.
 **/
int frsh_contract_get_min_expiration(const frsh_contract_t *contract,
				     frsh_rel_time_t *min_expiration);

/**
 * frsh_contract_set_min_budget_pow()
 *
 * Here we specify the minimum budget value corresponding to a single
 * power level.
 *
 * @param contract		The affected contract.
 * @param power_level		The power level for which we are specifying the minimum budget.
 * @param pow_min_budget	The minimum budget requested for the power level.
 *
 * @return 0 if no error \n
 *	FRSH_ERR_BAD_ARGUMENT if power_level is greater than or equal to the value
 *	returned by frsh_get_power_levels  budget value is not correct.
 *
 * @note
 * If the minimum budget relative to one or more power levels has not been specified, then
 * the framework may attempt to perform interpolation of the supplied values in
 * order to infer them, if an accurate model for such operation is available.
 * Otherwise, the contract is rejected at frsh_negotiate() time.
 **/
int frsh_contract_set_min_budget_pow(frsh_contract_t *contract,
				     frsh_power_level_t power_level,
				     const frsh_rel_time_t *pow_min_budget);

/**
 * frsh_contract_get_min_budget_pow()
 *
 * Get version of the previous function.
 **/
int frsh_contract_get_min_budget_pow(const frsh_contract_t *contract,
				     frsh_power_level_t power_level,
				     frsh_rel_time_t *pow_min_budget);

/**
 * frsh_contract_set_max_budget_pow()
 *
 * Here we specify the maximum budget for a single power level.
 *
 * @param contract		The affected contract object.
 * @param power_level		The power level for which we are specifying the maximum budget.
 * @param pow_max_budget	The maximum budget requested for the power level.
 *
 * @return 0 if no error \n
 *        FRSH_ERR_BAD_ARGUMENT if any of the pointers is NULL or the
 *             budget values don't go in ascending order.
 *
 **/
int frsh_contract_set_max_budget_pow(frsh_contract_t *contract,
				     frsh_power_level_t power_level,
				     const frsh_rel_time_t *pow_max_budget);

/**
 * frsh_contract_get_max_budget_pow()
 *
 * Get version of the previous function.
 **/
int frsh_contract_get_max_budget_pow(const frsh_contract_t *contract,
				     frsh_power_level_t power_level,
				     frsh_rel_time_t *pow_max_budget);

/**
 * frsh_contract_set_utilization_pow()
 *
 * This function should be used for contracts with a period of
 * discrete granularity.  Here we specify, for each allowed period,
 * the budget to be used for each power level.
 *
 * @param contract	The affected contract object.
 * @param power_level	The power level for which we specify budget and period.
 * @param budget	The budget to be used for the supplied power level and period.
 * @param period	One of the allowed periods (from the discrete set).
 * @param period	The deadline used with the associated period (from the discrete set).
 **/
int frsh_contract_set_utilization_pow(frsh_contract_t *contract,
				      frsh_power_level_t power_level,
				      const frsh_rel_time_t *budget,
				      const frsh_rel_time_t *period,
				      const frsh_rel_time_t *deadline);

/**
 * frsh_contract_get_utilization_pow()
 *
 * Get version of the previous function.
 **/
int frsh_contract_get_utilization_pow(const frsh_contract_t *contract,
				      frsh_power_level_t power_level,
				      frsh_rel_time_t *budget,
				      frsh_rel_time_t *period,
				      frsh_rel_time_t *deadline);

//////////////////////////////////////////////////////////////////////
//           MANAGING THE POWER LEVEL
//////////////////////////////////////////////////////////////////////

/**
 * frsh_resource_set_power_level()
 *
 * Set the power level of the resource identified by the supplied type and id.
 *
 * @note
 * Only implementation for resource_type = FRSH_RT_PROCESSOR is mandatory,
 * if the energy management module is present.
 **/
int frsh_resource_set_power_level(frsh_resource_type_t resource_type,
				  frsh_resource_id_t resource_id,
                                  frsh_power_level_t power_level);

/**
 * frsh_resource_get_power_level()
 *
 * Get version of the previous function.
 **/
int frsh_resource_get_power_level(frsh_resource_type_t resource_type,
				  frsh_resource_id_t resource_id,
                                  frsh_power_level_t *power_level);

/**
 * frsh_resource_get_speed()
 *
 * Get in speed_ratio representative value for the speed of the specified
 * resource, with respect to the maximum possible speed for such resource.
 *
 * @note
 * Only implementation for resource_type = FRSH_RT_PROCESSOR is mandatory,
 * if the energy management module is present.
 **/
int frsh_resource_get_speed(frsh_resource_type_t resource_type,
			    frsh_resource_id_t resource_id,
			    frsh_power_level_t power_level,
			    double *speed_ratio);

/**
 * frsh_resource_get_num_power_levels()
 *
 * Get the number of power levels available for the resource identified
 * by the supplied type and id.
 *
 * @note
 * The power levels that may be used, for the identified resource,
 * in other functions through a power_level_t type, range from 0
 * to the value returned by this function minus 1.
 *
 * @note
 * The power level 0 identifies the configuration with the maximum
 * performance (and energy consumption) for the resource.
 *
 * @note
 * Only implementation for resource_type = FRSH_RT_PROCESSOR is mandatory,
 * if the energy management module is present.
 */
int frsh_resource_get_num_power_levels(frsh_resource_type_t resource_type,
				       frsh_resource_id_t resource_id,
				       int *num_power_levels);

//////////////////////////////////////////////////////////////////////
//           BATTERY EXPIRATION AND MANAGING POWER LEVELS
//////////////////////////////////////////////////////////////////////

/* /\** IS THIS NEEDED AT ALL ? I GUESS NOT - COMMENTED */
/*  * frsh_resource_get_battery_expiration() */
/*  * */
/*  * Get the foreseen expiration time of the battery for the resource */
/*  * identified by the supplied type and id. */
/*  * */
/* int frsh_battery_get_expiration(frsh_resource_type_t resource_type, */
/* 				 frsh_resource_id_t resource_id, */
/* 				 frsh_rel_time_t *expiration); */

/**
 * frsh_battery_get_expiration()
 *
 * Get the foreseen expiration time of the system battery(ies).
 **/
int frsh_battery_get_expiration(frsh_abs_time_t *expiration);

/*@}*/

FRSH_CPP_END_DECLS

#endif 	    /* _FRSH_ENERGY_MANAGEMENT_H_ */

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-20  0:11         ` Tommaso Cucinotta
@ 2010-12-20  9:44           ` Harald Gustafsson
  2011-01-03 20:25             ` Tommaso Cucinotta
  0 siblings, 1 reply; 23+ messages in thread
From: Harald Gustafsson @ 2010-12-20  9:44 UTC (permalink / raw)
  To: Tommaso Cucinotta
  Cc: Peter Zijlstra, Dario Faggioli, Harald Gustafsson, linux-kernel,
	Ingo Molnar, Thomas Gleixner, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Juri Lelli

2010/12/20 Tommaso Cucinotta <tommaso.cucinotta@sssup.it>:
> 1. from a requirements analysis phase, it comes out that it should be
> possible to specify the individual runtimes for each possible frequency, as
> it is well-known that the way computation times scale to CPU frequency is
> application-dependent (and platform-dependent); this assumes that as a
> developer I can specify the possible configurations of my real-time app,
> then the OS will be free to pick the CPU frequency that best suites its
> power management logic (i.e., keeping the minimum frequency by which I can
> meet all the deadlines).

I think this make perfect sense, and I have explored related ideas,
but for the Linux kernel and
softer realtime use cases I think it is likely too much at least if
this info needs to be passed to the kernel.

> 2. this was also assuming, at an API level, a quite static settings (typical
> of hard RT), in which I configure the system and don't change its frequency
> too often; for example, implications of power switches on hard real-time
> requirements (i.e., time windows in which the CPU is not operating during
> the switch, and limits on the max sustainable switching frequencies by apps
> and the like) have not been stated through the API;

I would not worry too much about switch transition effects. They are
in the same order of magnitude
as other disturbances from timers and interrupts and can easily be set
to a certain smallest periodicity.
But if I was designing a system that needed real hard RT tasks I would
probably not enable cpufreq
when those tasks were active.

> 3. for soft real-time contexts and Linux (consider FRESCOR targeted both
> hard RT on RT OSes and soft RT on Linux), we played with a much simpler
> trivial linear scaling, which is exactly what has been proposed and
> implemented by someone in this thread on top of SCHED_DEADLINE (AFAIU);
> however, there's a trick which cannot be neglected, i.e., *change protocol*
> (see 5); benchmarks on MPEG-2 decoding times showed that the linear
> approximation is not that bad, but the best interpolating ratio between the
> computing times in different CPU frequencies do not perfectly conform to the
> frequencies ratios; we didn't make any attempt of extensive evaluation over
> different workloads so far. See Figure 4.1 in D-AQ2v2:

Totally agree on this as well, and it would not be that difficult to
implement in Linux.
For example not just use the frequency as the normalization but have a
different
architecture dependent normalization. This would capture the general
normalization but
not on an application level. But, others might think this is
complicating matter too much.
The other solution is that the deadline task do some over-reservation,
which is going to be
less over-reservation compared to if no normalization existed.

> 4. I would say that, given the tendency to over-provision the runtime (WCET)
> for hard real-time contexts, it would not bee too much of a burden for a
> hard RT developer to properly over-provision the required budget in presence
> of a trivial runtime rescaling policy like in 2.; however, in order to make
> everybody happy, it doesn't seem a bad idea to have something like:
>  4a) use the fine runtimes specified by the user if they are available;
>  4b) use the trivially rescaled runtimes if the user only specified a single
> runtime, of course it should be clear through the API what is the frequency
> the user is referring its runtime to, in such case (e.g., maximum one ?)

You mean this on an application level? I think we should test the
trivial rescaling first
and if any users steps forward that need this lets reconsider.

> 5. Mode Change Protocol: whenever a frequency switch occurs (e.g., dictated
> by the non-RT workload fluctuations), runtimes cannot simply be rescaled
> instantaneously: keeping it short, the simplest thing we can do is relying
> on the various CBS servers implemented in the scheduler to apply the change
> from the next "runtime recharge", i.e., the next period. This creates the
> potential problem that the RT tasks have a non-negligible transitory for the
> instances crossing the CPU frequency switch, in which they do not have
> enough runtime for their work. Now, the general "rule of thumb" is
> straightforward: make room first, then "pack", i.e., we need to consider 2
> distinct cases:

If we use the trivial rescaling is this a problem? In my
implementation the runtime
accounting is correct even when the frequency switch happens during a period.
Also with Peter's suggested implementation the runtime will be correct
as I understand it.

>  5a) we want to *increase the CPU frequency*; we can immediately increase
> the frequency, then the RT applications will have a temporary
> over-provisioning of runtime (still tuned for the slower frequency case),
> however as soon as we're sure the CPU frequency switch completed, we can
> lower the runtimes to the new values;

Don't you think that this was due to that you did it from user space,
I actually change the
scheduler's accounting for the rest of the runtime, i.e. can deal with
partial runtimes.

> The protocol in 5. has been implemented completely in user-space as a
> modification to the powernowd daemon, in the context of an extended version
> of a paper in which we were automagically guessing the whole set of
> scheduling parameters for periodic RT applications (EuroSys 2010). The
> modified powernowd was considering both the whole RT utilization as imposed
> by the RT reservations, and the non-RT utilization as measured on the CPU.
> The paper will appear on ACM TECS, but who knows when, so here u can find it
> (see Section 7.5 "Power Management"):
>
>  http://retis.sssup.it/~tommaso/publications/ACM-TECS-2010.pdf

Thanks I will take a look as soon as I find the time.

> Last, but not least, the whole point in the above discussion is the
> assumption that it is meaningful to have a CPU frequency switching policy,
> as opposed to merely CPU idle-ing. Perhaps on old embedded CPUs this is
> still the case. Unfortunately, from preliminary measurements made on a few
> systems I use every day through a cheap power measurement device attached on
> the power cable, I could actually see that for RT workloads only it is worth
> to leave the system at the maximum frequency and exploit the much higher
> time spent in idle mode(s), except when the system is completely idle.

I was also of this impression for a while that cpufreq scaling would
be of less importance.
But when I looked at complex use cases, which are common on embedded devices and
also new chip technology nodes I had to reconsider. Unfortunately I
don't have any information
that I can share publicly. What is true is that the whole system
energy needs to be considered,
including peripherals, and this is very application dependent.

> If you're interested, I can share the collected data sets.
Sure, more data is always of interest.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-20  9:44           ` Harald Gustafsson
@ 2011-01-03 20:25             ` Tommaso Cucinotta
  2011-01-04 12:16               ` Harald Gustafsson
  0 siblings, 1 reply; 23+ messages in thread
From: Tommaso Cucinotta @ 2011-01-03 20:25 UTC (permalink / raw)
  To: Harald Gustafsson
  Cc: Peter Zijlstra, Dario Faggioli, Harald Gustafsson, linux-kernel,
	Ingo Molnar, Thomas Gleixner, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Juri Lelli

Il 20/12/2010 10:44, Harald Gustafsson ha scritto:
> 2010/12/20 Tommaso Cucinotta<tommaso.cucinotta@sssup.it>:
>> 1. from a requirements analysis phase, it comes out that it should be
>> possible to specify the individual runtimes for each possible frequency, as
>> it is well-known that the way computation times scale to CPU frequency is
>> application-dependent (and platform-dependent); this assumes that as a
>> developer I can specify the possible configurations of my real-time app,
>> then the OS will be free to pick the CPU frequency that best suites its
>> power management logic (i.e., keeping the minimum frequency by which I can
>> meet all the deadlines).
> I think this make perfect sense, and I have explored related ideas,
> but for the Linux kernel and
> softer realtime use cases I think it is likely too much at least if
> this info needs to be passed to the kernel.

That's why we proposed a user-space daemon taking care of this (see
our paper at the last RTLWS in Kenya). This way, the kernel only sees
the minimal information it needs to have, and all the rest is handled
from the user-space (i.e., awareness of different budgets for the various
CPU speeds, extra complexity due the mode-change protocol, power
management logic). However, this is compatible with a user-space
power-management logic. Instead, if we wanted a kernel-space one
(e.g., the current governors), then we would have to pass all the
additional info to the kernel as well.
> But if I was designing a system that needed real hard RT tasks I would
> probably not enable cpufreq
> when those tasks were active.
This is what has always been done. However, there's an interesting thread
on the Jack mailing list in these weeks about the support for power
management (Jack may be considered to a certain extent hard RT due to
its professional usage [ audio glitches cannot be tolerated at all ], 
even if
it is definitely not safety critical). Interestingly, there they 
proposed jackfreqd:

   http://comments.gmane.org/gmane.comp.audio.jackit/22884

>
>> 4. I would say that, given the tendency to over-provision the runtime (WCET)
>> for hard real-time contexts, it would not bee too much of a burden for a
>> hard RT developer to properly over-provision the required budget in presence
>> of a trivial runtime rescaling policy like in 2.; however, in order to make
>> everybody happy, it doesn't seem a bad idea to have something like:
>>   4a) use the fine runtimes specified by the user if they are available;
>>   4b) use the trivially rescaled runtimes if the user only specified a single
>> runtime, of course it should be clear through the API what is the frequency
>> the user is referring its runtime to, in such case (e.g., maximum one ?)
> You mean this on an application level?
I was referring to the possibility to both specify (from within the app) the
additional budgets for the additional power modes, or not. In the former
case, the kernel would use the app-supplied values, in the latter case the
kernel would be free to use its dumb linear rescaling policy.
>> 5. Mode Change Protocol: whenever a frequency switch occurs (e.g., dictated
>> by the non-RT workload fluctuations), runtimes cannot simply be rescaled
>> instantaneously: keeping it short, the simplest thing we can do is relying
>> on the various CBS servers implemented in the scheduler to apply the change
>> from the next "runtime recharge", i.e., the next period. This creates the
>> potential problem that the RT tasks have a non-negligible transitory for the
>> instances crossing the CPU frequency switch, in which they do not have
>> enough runtime for their work. Now, the general "rule of thumb" is
>> straightforward: make room first, then "pack", i.e., we need to consider 2
>> distinct cases:
> If we use the trivial rescaling is this a problem?
This is independent on how the budgets for the various CPU speeds are
computed. It is simply a matter of how to dynamically change the runtime
assigned to a reservation. The change cannot be instantaneous, and the
easiest thing to implement is that, at the next recharge, the new value is
applied. If you try to simply "reset" the current reservation without
precautions, you put at risk schedulability of other reservations.
CPU frequency changes make things slightly more complex: if you reduce
the runtimes and increase the speed, you need to be sure the frequency
increase already occurred before recharging with a halved runtime.
Similarly, if you increase the runtimes and decrease the speed, you need
to ensure runtimes are already incremented when the frequency switch
actually occurs, and this takes time because the increase in runtimes
cannot be instantaneous (and the request comes asynchronously with
the various deadline tasks, where they consumed different parts of their
runtime at that moment).
> In my
> implementation the runtime
> accounting is correct even when the frequency switch happens during a period.
> Also with Peter's suggested implementation the runtime will be correct
> as I understand it.
Is it too much of a burden for you to detail how these "accounting" are
made, in your implementations ? (please, avoid me to go through the
whole code if possible).
>>   5a) we want to *increase the CPU frequency*; we can immediately increase
>> the frequency, then the RT applications will have a temporary
>> over-provisioning of runtime (still tuned for the slower frequency case),
>> however as soon as we're sure the CPU frequency switch completed, we can
>> lower the runtimes to the new values;
> Don't you think that this was due to that you did it from user space,
nope. The problem is the one I tried to detail above, and is there both
if you change things from the user-space, and if you do that from the
kernel-space.
> I actually change the
> scheduler's accounting for the rest of the runtime, i.e. can deal with
> partial runtimes.
... same request as above, if possible (detail, please) ...

... and, happy new year to everybody ...

     T.

-- 
Tommaso Cucinotta, Computer Engineering PhD, Researcher
ReTiS Lab, Scuola Superiore Sant'Anna, Pisa, Italy
Tel +39 050 882 024, Fax +39 050 882 003
http://retis.sssup.it/people/tommaso


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2011-01-03 20:25             ` Tommaso Cucinotta
@ 2011-01-04 12:16               ` Harald Gustafsson
  0 siblings, 0 replies; 23+ messages in thread
From: Harald Gustafsson @ 2011-01-04 12:16 UTC (permalink / raw)
  To: Tommaso Cucinotta
  Cc: Peter Zijlstra, Dario Faggioli, Harald Gustafsson, linux-kernel,
	Ingo Molnar, Thomas Gleixner, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Juri Lelli

> This is what has always been done. However, there's an interesting thread
> on the Jack mailing list in these weeks about the support for power
> management (Jack may be considered to a certain extent hard RT due to
> its professional usage [ audio glitches cannot be tolerated at all ], even
> if
> it is definitely not safety critical). Interestingly, there they proposed
> jackfreqd:
>
>  http://comments.gmane.org/gmane.comp.audio.jackit/22884
Being an embedded audio engineer for many years I know that we audio people take
audio quality and realtime performance seriously. If I understand what
the jackfreqd
does is that it make's sure that the CPU frequency is controlled by
the JACK DSP-load,
which sort of is a CPU time percentage devoted to JACK over an audio
frame period.
With sched deadline and a resource manager knowing about JACK's needs
this should be
possible to handle in an ondemand governor aware of sched deadline
bandwidths. The
RM would set the periods and runtime budgets based on JACK's DSP load,
e.g. period = audio frame duration and runtime = "max" DSP-load + margin.

> I was referring to the possibility to both specify (from within the app) the
> additional budgets for the additional power modes, or not. In the former
> case, the kernel would use the app-supplied values, in the latter case the
> kernel would be free to use its dumb linear rescaling policy.

OK, basically specifying the normalization values per power state for each
thread with the default being a linear scaling. I'll make sure that the default
normalization can be changed then but default initialized to linear based on
frequency in each freq state. Maybe a separate patch with a new prctl
call that can alter this, so we can evaluate it separately.

> This is independent on how the budgets for the various CPU speeds are
> computed. It is simply a matter of how to dynamically change the runtime
> assigned to a reservation. The change cannot be instantaneous, and the
But we don't change the runtime assigned to a reservation, think of it more
as the runtime is specified in "cycles". This is done either as in my patch that
the scheduler's runtime clock is running slower at lower clock speeds
or as Peter suggest that during runtime accounting the delta execution is
normalized with the cpu frequency.

> easiest thing to implement is that, at the next recharge, the new value is
> applied. If you try to simply "reset" the current reservation without
> precautions, you put at risk schedulability of other reservations.
> CPU frequency changes make things slightly more complex: if you reduce
> the runtimes and increase the speed, you need to be sure the frequency
> increase already occurred before recharging with a halved runtime.
Right now I only act on the post cpu frequency change notification. I think
that on most systems the error due to that it takes some time to change the
actual frequency of the core is on par with other errors like context switches,
migration (due to G-EDF) or cache misses. But I'm open for other views on that.

> Similarly, if you increase the runtimes and decrease the speed, you need
> to ensure runtimes are already incremented when the frequency switch
> actually occurs, and this takes time because the increase in runtimes
> cannot be instantaneous (and the request comes asynchronously with
> the various deadline tasks, where they consumed different parts of their
> runtime at that moment).
See previous comment about the change of the runtime vs accounting
a normalized runtime.

> Is it too much of a burden for you to detail how these "accounting" are
> made, in your implementations ? (please, avoid me to go through the
> whole code if possible).
It is simple, basically to things are introduced.
1) At every post cpufreq notification the factor between the current
frequency and the maximum frequency is calculated, i.e. the linear scaling.
I also keep track of the time this happens so that the runtime clock progress
is done with right factor also between sched clock updates. Hence I introduce
a clock that progress approximately proportional to the CPU clock frequency.
(On some systems this could actually be obtained directly, so that is
a potential
optimization by introducing a sched_cycle_in_ktime() next to the
sched_clock() call.)

2) For runtime accounting the CPU frequency normalized runtime clock is used.
Deadline accounting still use the real time. So for example if running
at 50% freq
and having a runtime budget of 20 ms and a period of 100 ms. The
deadline will still
happen at each 100 ms period, but the runtime progress is only half
compared with
real time. Hence it would correspond to setting the runtime to 40 ms,
but the nice part
of it is that when the CPU frequency is altered the accounting
progress as before
but with a new factor. For example if at 50 ms into the period the
runtime is at 10 ms
(i.e. have run 20 ms@ real time) and the CPU freq is now set to 100%
the remaining
10 ms of the runtime will finish in 10ms@real time.

Hope this helps explains how the runtime accounting is done in my
patches. With the
comments from Peter this would change slightly so that instead of
keeping an actual
normalized runtime clock we would normalize each threads progress
during the accounting
of the runtime. This would actually help also to incorporate your
comments about having
non-linear normalization per thread.

/Harald

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/3] Added runqueue clock normalized with cpufreq
  2010-12-17 18:56   ` Dario Faggioli
  2010-12-17 18:59     ` Peter Zijlstra
@ 2010-12-17 19:27     ` Harald Gustafsson
  1 sibling, 0 replies; 23+ messages in thread
From: Harald Gustafsson @ 2010-12-17 19:27 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Peter Zijlstra, Harald Gustafsson, linux-kernel, Ingo Molnar,
	Thomas Gleixner, Claudio Scordino, Michael Trimarchi,
	Fabio Checconi, Tommaso Cucinotta, Juri Lelli

2010/12/17 Dario Faggioli <raistlin@linux.it>:
> On Fri, 2010-12-17 at 15:29 +0100, Peter Zijlstra wrote:
>> Solving the CPUfreq problem involves writing a SCHED_DEADLINE aware
>> CPUfreq governor. The governor must know about the constraints placed on
>> the system by the task-set. You simply cannot lower the frequency when
>> your system is at u=1.
>>
> We already did the very same thing (for another EU Project called
> FRESCOR), although it was done in an userspace sort of daemon. It was
> also able to consider other "high level" parameters like some estimation
> of the QoS of each application and of the global QoS of the system.
>
> However, converting the basic mechanism into a CPUfreq governor should
> be easily doable... The only problem is finding the time for that! ;-P

I'm a bit choked before the holidays, but I can fix this in the
beginning of next year.
At the same time as I do a new version of the current patches that takes
in Peter's comments.

>> The simple solution would be to slow down the runtime accounting of
>> SCHED_DEADLINE tasks by freq/max_freq. So instead of having:
>>
>>   dl_se->runtime -= delta;
>>
>> you do something like:
>>
>>   dl_se->runtime -= (freq * delta) / max_freq;
>>
>> Which auto-magically grows the actual bandwidth, and since the deadlines
>> are wall-time already it all works out nicely. It also keeps the
>> overhead inside SCHED_DEADLINE.
>>
> And, at least for the meantime, this seems a very very nice solution.
> The only thing I don't like is that division which would end up in being
> performed at each tick/update_curr_dl(), but we can try to find out a
> way to mitigate this, what do you think Harald?

Yes, I will do something like this instead, need to make sure that
everything is consider first though.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2011-01-04 12:16 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-17 13:02 [PATCH 1/3] Added runqueue clock normalized with cpufreq Harald Gustafsson
2010-12-17 13:02 ` [PATCH 2/3] cpufreq normalized runtime to enforce runtime cycles also at lower frequencies Harald Gustafsson
2010-12-17 13:02 ` [PATCH 3/3] sched trace updated with normalized clock info Harald Gustafsson
2010-12-17 14:29 ` [PATCH 1/3] Added runqueue clock normalized with cpufreq Peter Zijlstra
2010-12-17 14:32   ` Peter Zijlstra
2010-12-17 15:06     ` Harald Gustafsson
2010-12-17 15:16       ` Peter Zijlstra
2010-12-17 15:36         ` Harald Gustafsson
2010-12-17 15:43         ` Thomas Gleixner
2010-12-17 15:54           ` Harald Gustafsson
2010-12-17 18:44           ` Dario Faggioli
2011-01-03 14:17           ` Pavel Machek
2010-12-17 15:02   ` Harald Gustafsson
2010-12-17 18:48     ` Dario Faggioli
2010-12-17 18:56   ` Dario Faggioli
2010-12-17 18:59     ` Peter Zijlstra
2010-12-17 19:16       ` Dario Faggioli
2010-12-17 19:31       ` Harald Gustafsson
2010-12-20  0:11         ` Tommaso Cucinotta
2010-12-20  9:44           ` Harald Gustafsson
2011-01-03 20:25             ` Tommaso Cucinotta
2011-01-04 12:16               ` Harald Gustafsson
2010-12-17 19:27     ` Harald Gustafsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox