[RFC][PATCH 0/9] CKRM CPU resource controller

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH 0/9] CKRM CPU resource controller
@ 2006-04-21  2:27 maeda.naoaki
  2006-04-21  2:27 ` [RFC][PATCH 1/9] CPU controller - Adds class load estimation maeda.naoaki
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: maeda.naoaki @ 2006-04-21  2:27 UTC (permalink / raw)
  To: linux-kernel, ckrm-tech; +Cc: maeda.naoaki

This patchset adds a CPU resource controller for CKRM.  The CPU
resource controller manages CPU resources by scaling timeslice
allocated for each task without changing the algorithm of the O(1)
scheduler.

Patch descriptions:
1/9: cpurc_load_estimation
	- Adds class load estimation

2/9: cpurc_hungry_detection
	- Adds class hungry detection

3/9: cpurc_timeslice_scaling
	- Adds CPU resource controll by scaling timeslice

4/9: cpurc_interface
	- Adds interface functions to CKRM CPU controller

5/9: cpurc_docs
	- Documents how the CPU resource controller works 

6/9: ckrm_cpu_init
	- Adds the basic functions and registering the CPU controller
	  to CKRM

7/9: ckrm_cpu_shares_n_stats
	- Adds routines to change share values and show statistics

8/9: ckrm_cpu_hotplug
	- Adds cpu hotplug notifier for the CPU controller

9/9: ckrm_cpu_docs
	- Documents how to use the CPU controller

Thanks,
MAEDA Naoaki

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH 1/9] CPU controller - Adds class load estimation
  2006-04-21  2:27 [RFC][PATCH 0/9] CKRM CPU resource controller maeda.naoaki
@ 2006-04-21  2:27 ` maeda.naoaki
  2006-04-21  2:27 ` [RFC][PATCH 2/9] CPU controller - Adds class hungry detection maeda.naoaki
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: maeda.naoaki @ 2006-04-21  2:27 UTC (permalink / raw)
  To: linux-kernel, ckrm-tech; +Cc: maeda.naoaki

1/9: cpurc_load_estimation

This patch corresponds to section 1 in Documentation/ckrm/cpurc-internals,
adding load estimation of task groups (classes in CKRM terminology) that is
grouped by the cpurc structure.  Load estimation is necessary for controlling
CPU resource because the CPU resource controller need to know whether
the resource assigned to a task group is enough or not.

Signed-off-by: Kurosawa Takahiro <kurosawa@valinux.co.jp>
Signed-off-by: MAEDA Naoaki <maeda.naoaki@jp.fujitsu.com>

 include/linux/cpu_rc.h |   65 ++++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h  |    4 ++
 init/Kconfig           |    9 +++++
 kernel/Makefile        |    1 
 kernel/cpu_rc.c        |   79 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/exit.c          |    2 +
 kernel/sched.c         |   14 ++++++++
 7 files changed, 174 insertions(+)

Index: linux-2.6.17-rc2/include/linux/cpu_rc.h
===================================================================
--- /dev/null
+++ linux-2.6.17-rc2/include/linux/cpu_rc.h
@@ -0,0 +1,65 @@
+#ifndef _LINUX_CPU_RC_H_
+#define _LINUX_CPU_RC_H_
+/*
+ *  CPU resource controller interface
+ *
+ *  Copyright 2005-2006 FUJITSU LIMITED
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/config.h>
+#include <linux/sched.h>
+
+#ifdef CONFIG_CPU_RC
+
+#define CPU_RC_SPREAD_PERIOD	(10 * HZ)
+#define CPU_RC_LOAD_SCALE	(2 * CPU_RC_SPREAD_PERIOD)
+#define CPU_RC_SHARE_SCALE	100
+
+struct cpu_rc_domain {
+	spinlock_t lock;
+	unsigned long timestamp;
+	cpumask_t cpus;
+	int numcpus;
+	int numcrs;
+};
+
+struct cpu_rc {
+	struct cpu_rc_domain *rcd;
+	struct {
+		unsigned long timestamp;
+		unsigned int load;
+	} stat[NR_CPUS];	/* XXX  need alignment */
+};
+
+extern struct cpu_rc *cpu_rc_get(task_t *);
+extern unsigned int cpu_rc_load(struct cpu_rc *);
+extern void cpu_rc_account(task_t *, unsigned long);
+
+static inline void cpu_rc_record_allocation(task_t *tsk,
+					    unsigned int slice,
+					    unsigned long now)
+{
+	if (slice == 0) {
+		/* minimal allocated time_slice is 1 (see sched_fork()). */
+		slice = 1;
+	}
+
+	tsk->last_slice = slice;
+	tsk->ts_alloced = now;
+}
+
+#else /* CONFIG_CPU_RC */
+
+static inline void cpu_rc_account(task_t *tsk, unsigned long now) {}
+static inline void cpu_rc_record_allocation(task_t *tsk,
+					    unsigned int slice,
+					    unsigned long now) {}
+
+#endif /* CONFIG_CPU_RC */
+
+#endif /* _LINUX_CPU_RC_H_ */
+
Index: linux-2.6.17-rc2/include/linux/sched.h
===================================================================
--- linux-2.6.17-rc2.orig/include/linux/sched.h
+++ linux-2.6.17-rc2/include/linux/sched.h
@@ -892,6 +892,10 @@ struct task_struct {
 	struct ckrm_class *class;
 	struct list_head member_list; /* list of tasks in class */
 #endif /* CONFIG_CKRM */
+#ifdef CONFIG_CPU_RC
+	unsigned int last_slice;
+	unsigned long ts_alloced;
+#endif
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
Index: linux-2.6.17-rc2/init/Kconfig
===================================================================
--- linux-2.6.17-rc2.orig/init/Kconfig
+++ linux-2.6.17-rc2/init/Kconfig
@@ -261,6 +261,15 @@ config RELAY
 
 	  If unsure, say N.
 
+config CPU_RC
+	bool "CPU resource controller"
+	depends on CKRM_RES_CPU
+	help
+	  This options will let you control the CPU resource by scaling
+	  the timeslice allocated for each tasks.
+
+	  Say N if unsure.
+
 source "usr/Kconfig"
 
 config UID16
Index: linux-2.6.17-rc2/kernel/Makefile
===================================================================
--- linux-2.6.17-rc2.orig/kernel/Makefile
+++ linux-2.6.17-rc2/kernel/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
+obj-$(CONFIG_CPU_RC) += cpu_rc.o
 obj-$(CONFIG_IKCONFIG) += configs.o
 obj-$(CONFIG_STOP_MACHINE) += stop_machine.o
 obj-$(CONFIG_AUDIT) += audit.o auditfilter.o
Index: linux-2.6.17-rc2/kernel/cpu_rc.c
===================================================================
--- /dev/null
+++ linux-2.6.17-rc2/kernel/cpu_rc.c
@@ -0,0 +1,79 @@
+/*
+ *  kernel/cpu_rc.c
+ *
+ *  CPU resource controller by scaling time_slice of the task.
+ *
+ *  Copyright 2005-2006 FUJITSU LIMITED
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/config.h>
+#include <linux/sched.h>
+#include <linux/cpu_rc.h>
+
+/*
+ * cpu_rc_load() calculates the class load
+ */
+unsigned int cpu_rc_load(struct cpu_rc *cr)
+{
+	unsigned int load;
+	int i, n;
+
+	BUG_ON(!cr);
+
+	load = 0;
+	n = 0;
+
+	/* Just reading the value, so no locking... */
+	for_each_cpu_mask(i, cr->rcd->cpus) {
+		if (jiffies - cr->stat[i].timestamp <= CPU_RC_SPREAD_PERIOD)
+			load += cr->stat[i].load;
+		n++;
+	}
+
+	BUG_ON(!n);
+
+	return load / n * CPU_RC_GUAR_SCALE / CPU_RC_LOAD_SCALE;
+}
+
+/*
+ * cpu_rc_account() calculates the task load when the timeslice is expired
+ */
+void cpu_rc_account(task_t *tsk, unsigned long now)
+{
+	struct cpu_rc *cr;
+	int cpu = smp_processor_id();
+	unsigned long last;
+	unsigned int cls_load, tsk_load;
+	unsigned long base, update;
+
+	if (tsk == idle_task(task_cpu(tsk)))
+		return;
+
+	cr = cpu_rc_get(tsk);
+	if (!cr)
+		return;
+
+	base = now - tsk->ts_alloced;
+	if (base == 0)
+		return;  /* duration too small. can not collect statistics. */
+
+	tsk_load = CPU_RC_LOAD_SCALE * (tsk->last_slice - tsk->time_slice)
+			+ (CPU_RC_LOAD_SCALE / 2);
+	if (base > CPU_RC_SPREAD_PERIOD)
+		tsk_load = CPU_RC_SPREAD_PERIOD * tsk_load / base;
+
+	last = cr->stat[cpu].timestamp;
+	update = now - last;
+	if (update > CPU_RC_SPREAD_PERIOD)
+		cls_load = 0;  /* statistics data obsolete. */
+	else
+		cls_load = cr->stat[cpu].load
+			 * (CPU_RC_SPREAD_PERIOD - update);
+
+	cr->stat[cpu].timestamp = now;
+	cr->stat[cpu].load = (cls_load + tsk_load) / CPU_RC_SPREAD_PERIOD;
+}
Index: linux-2.6.17-rc2/kernel/sched.c
===================================================================
--- linux-2.6.17-rc2.orig/kernel/sched.c
+++ linux-2.6.17-rc2/kernel/sched.c
@@ -43,6 +43,7 @@
 #include <linux/rcupdate.h>
 #include <linux/cpu.h>
 #include <linux/cpuset.h>
+#include <linux/cpu_rc.h>
 #include <linux/percpu.h>
 #include <linux/kthread.h>
 #include <linux/seq_file.h>
@@ -1377,6 +1378,7 @@ int fastcall wake_up_state(task_t *p, un
 void fastcall sched_fork(task_t *p, int clone_flags)
 {
 	int cpu = get_cpu();
+	unsigned long now;
 
 #ifdef CONFIG_SMP
 	cpu = sched_balance_self(cpu, SD_BALANCE_FORK);
@@ -1416,6 +1418,9 @@ void fastcall sched_fork(task_t *p, int 
 	p->first_time_slice = 1;
 	current->time_slice >>= 1;
 	p->timestamp = sched_clock();
+	now = jiffies;
+	cpu_rc_record_allocation(current, current->time_slice, now);
+	cpu_rc_record_allocation(p, p->time_slice, now);
 	if (unlikely(!current->time_slice)) {
 		/*
 		 * This case is rare, it happens when the parent has only
@@ -1533,6 +1538,8 @@ void fastcall sched_exit(task_t *p)
 		p->parent->time_slice += p->time_slice;
 		if (unlikely(p->parent->time_slice > task_timeslice(p)))
 			p->parent->time_slice = task_timeslice(p);
+		cpu_rc_record_allocation(p->parent,
+					 p->parent->time_slice, jiffies);
 	}
 	if (p->sleep_avg < p->parent->sleep_avg)
 		p->parent->sleep_avg = p->parent->sleep_avg /
@@ -2617,6 +2624,7 @@ void scheduler_tick(void)
 	runqueue_t *rq = this_rq();
 	task_t *p = current;
 	unsigned long long now = sched_clock();
+	unsigned long jnow;
 
 	update_cpu_clock(p, rq, now);
 
@@ -2651,6 +2659,9 @@ void scheduler_tick(void)
 			p->time_slice = task_timeslice(p);
 			p->first_time_slice = 0;
 			set_tsk_need_resched(p);
+#ifdef CONFIG_CPU_RC
+			/* XXX  need accounting even for rt_task? */
+#endif
 
 			/* put it at the end of the queue: */
 			requeue_task(p, rq->active);
@@ -2660,9 +2671,12 @@ void scheduler_tick(void)
 	if (!--p->time_slice) {
 		dequeue_task(p, rq->active);
 		set_tsk_need_resched(p);
+		jnow = jiffies;
+		cpu_rc_account(p, jnow);
 		p->prio = effective_prio(p);
 		p->time_slice = task_timeslice(p);
 		p->first_time_slice = 0;
+		cpu_rc_record_allocation(p, p->time_slice, jnow);
 
 		if (!rq->expired_timestamp)
 			rq->expired_timestamp = jiffies;
Index: linux-2.6.17-rc2/kernel/exit.c
===================================================================
--- linux-2.6.17-rc2.orig/kernel/exit.c
+++ linux-2.6.17-rc2/kernel/exit.c
@@ -36,6 +36,7 @@
 #include <linux/compat.h>
 #include <linux/pipe_fs_i.h>
 #include <linux/ckrm.h>
+#include <linux/cpu_rc.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -852,6 +853,7 @@ fastcall NORET_TYPE void do_exit(long co
 	int group_dead;
 
 	profile_task_exit(tsk);
+	cpu_rc_account(tsk, jiffies);
 
 	WARN_ON(atomic_read(&tsk->fs_excl));
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH 2/9] CPU controller - Adds class hungry detection
  2006-04-21  2:27 [RFC][PATCH 0/9] CKRM CPU resource controller maeda.naoaki
  2006-04-21  2:27 ` [RFC][PATCH 1/9] CPU controller - Adds class load estimation maeda.naoaki
@ 2006-04-21  2:27 ` maeda.naoaki
  2006-04-21  2:27 ` [RFC][PATCH 3/9] CPU controller - Adds timeslice scaling maeda.naoaki
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: maeda.naoaki @ 2006-04-21  2:27 UTC (permalink / raw)
  To: linux-kernel, ckrm-tech; +Cc: maeda.naoaki

2/9: cpurc_hungry_detection

This patch corresponds to section 2 in Documentation/ckrm/cpurc-internals,
adding the detection code that checks whether a task group needs more CPU
resource or not.  The CPU resource controller have to distinguish whether
tasks in the group actually need more resource or they are just sleepy.
If they need more resource, the resource controller must give more resource,
otherwise it must not.

Signed-off-by: Kurosawa Takahiro <kurosawa@valinux.co.jp>
Signed-off-by: MAEDA Naoaki <maeda.naoaki@jp.fujitsu.com>

 include/linux/cpu_rc.h |   15 +++++++
 include/linux/sched.h  |    1 
 kernel/cpu_rc.c        |   96 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched.c         |    5 ++
 4 files changed, 117 insertions(+)

Index: linux-2.6.17-rc2/include/linux/cpu_rc.h
===================================================================
--- linux-2.6.17-rc2.orig/include/linux/cpu_rc.h
+++ linux-2.6.17-rc2/include/linux/cpu_rc.h
@@ -18,9 +18,13 @@
 #define CPU_RC_SPREAD_PERIOD	(10 * HZ)
 #define CPU_RC_LOAD_SCALE	(2 * CPU_RC_SPREAD_PERIOD)
 #define CPU_RC_SHARE_SCALE	100
+#define CPU_RC_TSFACTOR_MAX	CPU_RC_SHARE_SCALE
+#define CPU_RC_HCOUNT_INC	2
+#define CPU_RC_RECALC_INTERVAL	HZ
 
 struct cpu_rc_domain {
 	spinlock_t lock;
+ 	unsigned int hungry_count;
 	unsigned long timestamp;
 	cpumask_t cpus;
 	int numcpus;
@@ -28,16 +32,25 @@ struct cpu_rc_domain {
 };
 
 struct cpu_rc {
+	int share;
+	int is_hungry;
 	struct cpu_rc_domain *rcd;
 	struct {
 		unsigned long timestamp;
 		unsigned int load;
+		int maybe_hungry;
 	} stat[NR_CPUS];	/* XXX  need alignment */
 };
 
 extern struct cpu_rc *cpu_rc_get(task_t *);
 extern unsigned int cpu_rc_load(struct cpu_rc *);
 extern void cpu_rc_account(task_t *, unsigned long);
+extern void cpu_rc_detect_hunger(task_t *);
+
+static inline void cpu_rc_record_activated(task_t *tsk, unsigned long now)
+{
+	tsk->last_activated = now;
+}
 
 static inline void cpu_rc_record_allocation(task_t *tsk,
 					    unsigned int slice,
@@ -55,6 +68,8 @@ static inline void cpu_rc_record_allocat
 #else /* CONFIG_CPU_RC */
 
 static inline void cpu_rc_account(task_t *tsk, unsigned long now) {}
+static inline void cpu_rc_detect_hunger(task_t *tsk) {}
+static inline void cpu_rc_record_activated(task_t *tsk, unsigned long now) {}
 static inline void cpu_rc_record_allocation(task_t *tsk,
 					    unsigned int slice,
 					    unsigned long now) {}
Index: linux-2.6.17-rc2/include/linux/sched.h
===================================================================
--- linux-2.6.17-rc2.orig/include/linux/sched.h
+++ linux-2.6.17-rc2/include/linux/sched.h
@@ -895,6 +895,7 @@ struct task_struct {
 #ifdef CONFIG_CPU_RC
 	unsigned int last_slice;
 	unsigned long ts_alloced;
+	unsigned long last_activated;
 #endif
 };
 
Index: linux-2.6.17-rc2/kernel/cpu_rc.c
===================================================================
--- linux-2.6.17-rc2.orig/kernel/cpu_rc.c
+++ linux-2.6.17-rc2/kernel/cpu_rc.c
@@ -14,6 +14,72 @@
 #include <linux/sched.h>
 #include <linux/cpu_rc.h>
 
+static inline int cpu_rc_is_hungry(struct cpu_rc *cr)
+{
+	return cr->is_hungry;
+}
+
+static inline void cpu_rc_set_hungry(struct cpu_rc *cr)
+{
+	cr->is_hungry++;
+	cr->rcd->hungry_count += CPU_RC_HCOUNT_INC;
+}
+
+static inline void cpu_rc_set_satisfied(struct cpu_rc *cr)
+{
+	cr->is_hungry = 0;
+}
+
+static inline int cpu_rc_is_anyone_hungry(struct cpu_rc *cr)
+{
+	return cr->rcd->hungry_count > 0;
+}
+
+/*
+ * cpu_rc_recalc_tsfactor() uptates the class timeslice scale factor
+ */
+static inline void cpu_rc_recalc_tsfactor(struct cpu_rc *cr)
+{
+	unsigned long now = jiffies;
+	unsigned long interval = now - cr->rcd->timestamp;
+	unsigned int load;
+	int maybe_hungry;
+	int i, n;
+
+	n = 0;
+	load = 0;
+	maybe_hungry = 0;
+
+	cpu_rcd_lock(cr);
+	if (cr->rcd->timestamp == 0)	{
+		cr->rcd->timestamp = now;
+	} else if (interval > CPU_RC_SPREAD_PERIOD) {
+		cr->rcd->hungry_count = 0;
+		cr->rcd->timestamp = now;
+	} else if (interval > CPU_RC_RECALC_INTERVAL) {
+		cr->rcd->hungry_count >>= 1;
+		cr->rcd->timestamp = now;
+	}
+
+	for_each_cpu_mask(i, cr->rcd->cpus) {
+		load += cr->stat[i].load;
+		maybe_hungry += cr->stat[i].maybe_hungry;
+		cr->stat[i].maybe_hungry = 0;
+		n++;
+	}
+
+	BUG_ON(n == 0);
+	load = load / n;
+
+	if ((load * CPU_RC_SHARE_SCALE >= cr->share * CPU_RC_LOAD_SCALE) ||
+	    !maybe_hungry)
+		cpu_rc_set_satisfied(cr);
+	else
+		cpu_rc_set_hungry(cr);
+
+	cpu_rcd_unlock(cr);
+}
+
 /*
  * cpu_rc_load() calculates the class load
  */
@@ -77,3 +143,33 @@ void cpu_rc_account(task_t *tsk, unsigne
 	cr->stat[cpu].timestamp = now;
 	cr->stat[cpu].load = (cls_load + tsk_load) / CPU_RC_SPREAD_PERIOD;
 }
+
+/*
+ * cpu_rc_detect_hunger() judges if the class is maybe hungry
+ */
+void cpu_rc_detect_hunger(task_t *tsk)
+{
+	struct cpu_rc *cr;
+	unsigned long wait;
+	int cpu = smp_processor_id();
+
+	if (tsk == idle_task(task_cpu(tsk)))
+		return;
+
+	if (tsk->last_activated == 0)
+		return;
+
+	cr = cpu_rc_get(tsk);
+	if (!cr) {
+		tsk->last_activated = 0;
+		return;
+	}
+
+	BUG_ON(tsk->last_slice == 0);
+	wait = jiffies - tsk->last_activated;
+	if (CPU_RC_GUAR_SCALE * tsk->last_slice	/ (wait + tsk->last_slice)
+			< cr->share)
+		cr->stat[cpu].maybe_hungry++;
+
+	tsk->last_activated = 0;
+}
Index: linux-2.6.17-rc2/kernel/sched.c
===================================================================
--- linux-2.6.17-rc2.orig/kernel/sched.c
+++ linux-2.6.17-rc2/kernel/sched.c
@@ -716,6 +716,7 @@ static void __activate_task(task_t *p, r
 
 	if (unlikely(batch_task(p) || (expired_starving(rq) && !rt_task(p))))
 		target = rq->expired;
+	cpu_rc_record_activated(p, jiffies);
 	enqueue_task(p, target);
 	rq->nr_running++;
 }
@@ -1478,6 +1479,7 @@ void fastcall wake_up_new_task(task_t *p
 				p->array = current->array;
 				p->array->nr_active++;
 				rq->nr_running++;
+				cpu_rc_record_activated(p, jiffies);
 			}
 			set_need_resched();
 		} else
@@ -2686,6 +2688,8 @@ void scheduler_tick(void)
 				rq->best_expired_prio = p->static_prio;
 		} else
 			enqueue_task(p, rq->active);
+
+		cpu_rc_record_activated(p, jnow);
 	} else {
 		/*
 		 * Prevent a too long timeslice allowing a task to monopolize
@@ -3079,6 +3083,7 @@ switch_tasks:
 	rcu_qsctr_inc(task_cpu(prev));
 
 	update_cpu_clock(prev, rq, now);
+	cpu_rc_detect_hunger(next);
 
 	prev->sleep_avg -= run_time;
 	if ((long)prev->sleep_avg <= 0)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH 3/9] CPU controller - Adds timeslice scaling
  2006-04-21  2:27 [RFC][PATCH 0/9] CKRM CPU resource controller maeda.naoaki
  2006-04-21  2:27 ` [RFC][PATCH 1/9] CPU controller - Adds class load estimation maeda.naoaki
  2006-04-21  2:27 ` [RFC][PATCH 2/9] CPU controller - Adds class hungry detection maeda.naoaki
@ 2006-04-21  2:27 ` maeda.naoaki
  2006-04-21  8:17   ` Mike Galbraith
  2006-04-21  2:27 ` [RFC][PATCH 4/9] CPU controller - Adds interface functions maeda.naoaki
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: maeda.naoaki @ 2006-04-21  2:27 UTC (permalink / raw)
  To: linux-kernel, ckrm-tech; +Cc: maeda.naoaki

3/9: cpurc_timeslice_scaling

This patch corresponds to section 3 in Documentation/ckrm/cpurc-internals, 
adding the CPU resource control by scaling timeslices given to each tasks.
The scaling factors of timeslices are changed based on the difference between
the share of the resource and the actual load.

Signed-off-by: Kurosawa Takahiro <kurosawa@valinux.co.jp>
Signed-off-by: MAEDA Naoaki <maeda.naoaki@jp.fujitsu.com>

 include/linux/cpu_rc.h |   12 +++++++++
 kernel/cpu_rc.c        |   63 +++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched.c         |   11 +++++++-
 3 files changed, 82 insertions(+), 4 deletions(-)

Index: linux-2.6.17-rc2/include/linux/cpu_rc.h
===================================================================
--- linux-2.6.17-rc2.orig/include/linux/cpu_rc.h
+++ linux-2.6.17-rc2/include/linux/cpu_rc.h
@@ -17,8 +17,11 @@
 
 #define CPU_RC_SPREAD_PERIOD	(10 * HZ)
 #define CPU_RC_LOAD_SCALE	(2 * CPU_RC_SPREAD_PERIOD)
+#define CPU_RC_LOAD_MARGIN	1
 #define CPU_RC_SHARE_SCALE	100
 #define CPU_RC_TSFACTOR_MAX	CPU_RC_SHARE_SCALE
+#define CPU_RC_TSFACTOR_INC_HI	5
+#define CPU_RC_TSFACTOR_INC_LO	2
 #define CPU_RC_HCOUNT_INC	2
 #define CPU_RC_RECALC_INTERVAL	HZ
 
@@ -34,6 +37,8 @@ struct cpu_rc_domain {
 struct cpu_rc {
 	int share;
 	int is_hungry;
+	unsigned int ts_factor;
+	unsigned long last_recalc;
 	struct cpu_rc_domain *rcd;
 	struct {
 		unsigned long timestamp;
@@ -44,6 +49,7 @@ struct cpu_rc {
 
 extern struct cpu_rc *cpu_rc_get(task_t *);
 extern unsigned int cpu_rc_load(struct cpu_rc *);
+extern unsigned int cpu_rc_scale_timeslice(task_t *, unsigned int);
 extern void cpu_rc_account(task_t *, unsigned long);
 extern void cpu_rc_detect_hunger(task_t *);
 
@@ -74,6 +80,12 @@ static inline void cpu_rc_record_allocat
 					    unsigned int slice,
 					    unsigned long now) {}
 
+static inline unsigned int cpu_rc_scale_timeslice(task_t *tsk,
+						  unsigned int slice)
+{
+	return slice;
+}
+
 #endif /* CONFIG_CPU_RC */
 
 #endif /* _LINUX_CPU_RC_H_ */
Index: linux-2.6.17-rc2/kernel/cpu_rc.c
===================================================================
--- linux-2.6.17-rc2.orig/kernel/cpu_rc.c
+++ linux-2.6.17-rc2/kernel/cpu_rc.c
@@ -14,6 +14,16 @@
 #include <linux/sched.h>
 #include <linux/cpu_rc.h>
 
+static inline void cpu_rcd_lock(struct cpu_rc *cr)
+{
+	spin_lock(&cr->rcd->lock);
+}
+
+static inline void cpu_rcd_unlock(struct cpu_rc *cr)
+{
+	spin_unlock(&cr->rcd->lock);
+}
+
 static inline int cpu_rc_is_hungry(struct cpu_rc *cr)
 {
 	return cr->is_hungry;
@@ -77,6 +87,33 @@ static inline void cpu_rc_recalc_tsfacto
 	else
 		cpu_rc_set_hungry(cr);
 
+	if (!cpu_rc_is_anyone_hungry(cr)) {
+		/* Everyone satisfied.  Extend time_slice. */
+		cr->ts_factor += CPU_RC_TSFACTOR_INC_HI;
+	} else {
+		if (cpu_rc_is_hungry(cr)) {
+			/* Extend time_slice a little. */
+			cr->ts_factor += CPU_RC_TSFACTOR_INC_LO;
+		} else if (load * CPU_RC_SHARE_SCALE >
+			   (cr->share + CPU_RC_LOAD_MARGIN)
+				* CPU_RC_LOAD_SCALE) {
+			/*
+			 * scale time_slice only when load is higher than
+			 * the share.
+			 */
+			cr->ts_factor = cr->ts_factor * cr->share
+				* CPU_RC_LOAD_SCALE
+				/ (load * CPU_RC_SHARE_SCALE);
+		}
+	}
+
+	if (cr->ts_factor == 0)
+		cr->ts_factor = 1;
+	else if (cr->ts_factor > CPU_RC_TSFACTOR_MAX)
+		cr->ts_factor = CPU_RC_TSFACTOR_MAX;
+
+	cr->last_recalc = now;
+
 	cpu_rcd_unlock(cr);
 }
 
@@ -102,7 +139,29 @@ unsigned int cpu_rc_load(struct cpu_rc *
 
 	BUG_ON(!n);
 
-	return load / n * CPU_RC_GUAR_SCALE / CPU_RC_LOAD_SCALE;
+	return load / n * CPU_RC_SHARE_SCALE / CPU_RC_LOAD_SCALE;
+}
+
+/*
+ * cpu_rc_scale_timeslice scales the task timeslice based on the scale factor
+ */
+unsigned int cpu_rc_scale_timeslice(task_t *tsk, unsigned int slice)
+{
+	struct cpu_rc *cr;
+	unsigned int scaled;
+
+	cr = cpu_rc_get(tsk);
+	if (!cr)
+		return slice;
+
+	if (jiffies - cr->last_recalc > CPU_RC_RECALC_INTERVAL)
+		cpu_rc_recalc_tsfactor(cr);
+
+	scaled = slice * cr->ts_factor / CPU_RC_TSFACTOR_MAX;
+	if (scaled == 0)
+		scaled = 1;
+
+	return scaled;
 }
 
 /*
@@ -167,7 +226,7 @@ void cpu_rc_detect_hunger(task_t *tsk)
 
 	BUG_ON(tsk->last_slice == 0);
 	wait = jiffies - tsk->last_activated;
-	if (CPU_RC_GUAR_SCALE * tsk->last_slice	/ (wait + tsk->last_slice)
+	if (CPU_RC_SHARE_SCALE * tsk->last_slice / (wait + tsk->last_slice)
 			< cr->share)
 		cr->stat[cpu].maybe_hungry++;
 
Index: linux-2.6.17-rc2/kernel/sched.c
===================================================================
--- linux-2.6.17-rc2.orig/kernel/sched.c
+++ linux-2.6.17-rc2/kernel/sched.c
@@ -173,10 +173,17 @@
 
 static unsigned int task_timeslice(task_t *p)
 {
+	unsigned int timeslice;
+
 	if (p->static_prio < NICE_TO_PRIO(0))
-		return SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);
+		timeslice = SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);
 	else
-		return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
+		timeslice = SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
+
+	if (!TASK_INTERACTIVE(p))
+		timeslice = cpu_rc_scale_timeslice(p, timeslice);
+
+	return timeslice;
 }
 #define task_hot(p, now, sd) ((long long) ((now) - (p)->last_ran)	\
 				< (long long) (sd)->cache_hot_time)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH 3/9] CPU controller - Adds timeslice scaling
  2006-04-21  2:27 ` [RFC][PATCH 3/9] CPU controller - Adds timeslice scaling maeda.naoaki
@ 2006-04-21  8:17   ` Mike Galbraith
  2006-04-21  8:56     ` MAEDA Naoaki
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2006-04-21  8:17 UTC (permalink / raw)
  To: maeda.naoaki; +Cc: linux-kernel, ckrm-tech

On Fri, 2006-04-21 at 11:27 +0900, maeda.naoaki@jp.fujitsu.com wrote:
> Index: linux-2.6.17-rc2/kernel/sched.c
> ===================================================================
> --- linux-2.6.17-rc2.orig/kernel/sched.c
> +++ linux-2.6.17-rc2/kernel/sched.c
> @@ -173,10 +173,17 @@
>  
>  static unsigned int task_timeslice(task_t *p)
>  {
> +	unsigned int timeslice;
> +
>  	if (p->static_prio < NICE_TO_PRIO(0))
> -		return SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);
> +		timeslice = SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);
>  	else
> -		return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
> +		timeslice = SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
> +
> +	if (!TASK_INTERACTIVE(p))
> +		timeslice = cpu_rc_scale_timeslice(p, timeslice);
> +
> +	return timeslice;
>  }

Why does timeslice scaling become undesirable if TASK_INTERACTIVE(p)?
With this barrier, you will completely disable scaling for many loads.

Is it possible you meant !rt_task(p)?

(The only place I can see scaling as having a large effect is on gobs of
non-sleeping tasks.  Slice width doesn't mean much otherwise.)

	-Mike


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH 3/9] CPU controller - Adds timeslice scaling
  2006-04-21  8:17   ` Mike Galbraith
@ 2006-04-21  8:56     ` MAEDA Naoaki
  2006-04-21 11:50       ` [ckrm-tech] " Naoaki MAEDA
  2006-04-21 11:53       ` Mike Galbraith
  0 siblings, 2 replies; 18+ messages in thread
From: MAEDA Naoaki @ 2006-04-21  8:56 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, ckrm-tech

Mike Galbraith wrote:
> On Fri, 2006-04-21 at 11:27 +0900, maeda.naoaki@jp.fujitsu.com wrote:
>> Index: linux-2.6.17-rc2/kernel/sched.c
>> ===================================================================
>> --- linux-2.6.17-rc2.orig/kernel/sched.c
>> +++ linux-2.6.17-rc2/kernel/sched.c
>> @@ -173,10 +173,17 @@
>>  
>>  static unsigned int task_timeslice(task_t *p)
>>  {
>> +	unsigned int timeslice;
>> +
>>  	if (p->static_prio < NICE_TO_PRIO(0))
>> -		return SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);
>> +		timeslice = SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);
>>  	else
>> -		return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
>> +		timeslice = SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
>> +
>> +	if (!TASK_INTERACTIVE(p))
>> +		timeslice = cpu_rc_scale_timeslice(p, timeslice);
>> +
>> +	return timeslice;
>>  }
> 
> Why does timeslice scaling become undesirable if TASK_INTERACTIVE(p)?
> With this barrier, you will completely disable scaling for many loads.

Because interactive tasks tend to spend very small timeslice at one
time, scaling timeslice for these tasks is not effective to control
CPU spent.

Or, do you say that lots of non-interactive tasks are misjudged as
TASK_INTERACTIVE(p)?

> Is it possible you meant !rt_task(p)?
> 
> (The only place I can see scaling as having a large effect is on gobs of
> non-sleeping tasks.  Slice width doesn't mean much otherwise.)

Yes. But these non-sleeping CPU-hog tasks tend to dominant CPU, so
it is worth controlling them.

Thanks,
MAEDA Naoaki


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [ckrm-tech] Re: [RFC][PATCH 3/9] CPU controller - Adds timeslice scaling
  2006-04-21  8:56     ` MAEDA Naoaki
@ 2006-04-21 11:50       ` Naoaki MAEDA
  2006-04-21 12:03         ` Mike Galbraith
  2006-04-21 11:53       ` Mike Galbraith
  1 sibling, 1 reply; 18+ messages in thread
From: Naoaki MAEDA @ 2006-04-21 11:50 UTC (permalink / raw)
  To: MAEDA Naoaki; +Cc: Mike Galbraith, linux-kernel, ckrm-tech

> Mike Galbraith wrote:
> > On Fri, 2006-04-21 at 11:27 +0900, maeda.naoaki@jp.fujitsu.com wrote:
> >> Index: linux-2.6.17-rc2/kernel/sched.c
> >> ===================================================================
> >> --- linux-2.6.17-rc2.orig/kernel/sched.c
> >> +++ linux-2.6.17-rc2/kernel/sched.c
> >> @@ -173,10 +173,17 @@
> >>
> >>  static unsigned int task_timeslice(task_t *p)
> >>  {
> >> +    unsigned int timeslice;
> >> +
> >>      if (p->static_prio < NICE_TO_PRIO(0))
> >> -            return SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);
> >> +            timeslice = SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);
> >>      else
> >> -            return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
> >> +            timeslice = SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
> >> +
> >> +    if (!TASK_INTERACTIVE(p))
> >> +            timeslice = cpu_rc_scale_timeslice(p, timeslice);
> >> +
> >> +    return timeslice;
> >>  }
> >
> > Why does timeslice scaling become undesirable if TASK_INTERACTIVE(p)?
> > With this barrier, you will completely disable scaling for many loads.
>
> Because interactive tasks tend to spend very small timeslice at one
> time, scaling timeslice for these tasks is not effective to control
> CPU spent.

It was not good explanation. Let me restate that.
The effect of shortening timeslice is to let the task be expired soon
by shortening
its remainder timeclice, so it still works even if the task consome very small
timeslice at one time. However, expired TASK_INTERACTIVE tasks will be requeued
to the active for a while by the scheduler, so shortening timeslice
doesn't work well for
TASK_INTERACTIVE tasks.

Thanks,
MAEDA Naoaki

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [ckrm-tech] Re: [RFC][PATCH 3/9] CPU controller - Adds timeslice scaling
  2006-04-21 11:50       ` [ckrm-tech] " Naoaki MAEDA
@ 2006-04-21 12:03         ` Mike Galbraith
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2006-04-21 12:03 UTC (permalink / raw)
  To: Naoaki MAEDA; +Cc: MAEDA Naoaki, linux-kernel, ckrm-tech

On Fri, 2006-04-21 at 20:50 +0900, Naoaki MAEDA wrote:

> It was not good explanation. Let me restate that.
> The effect of shortening timeslice is to let the task be expired soon
> by shortening
> its remainder timeclice, so it still works even if the task consome very small
> timeslice at one time. However, expired TASK_INTERACTIVE tasks will be requeued
> to the active for a while by the scheduler, so shortening timeslice
> doesn't work well for
> TASK_INTERACTIVE tasks.

Yeah, understood.  This shortening of timeslice I think is generally
bad, (hmm...) though in an environment where preemption is rampant, this
shortening of slice should lead to a throughput gain for low ranking
_interdependent_ tasks.  Is that what you're trying to accomplish?  To
reduce latency at the bottom in the face of preemption from above?

	-Mike


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH 3/9] CPU controller - Adds timeslice scaling
  2006-04-21  8:56     ` MAEDA Naoaki
  2006-04-21 11:50       ` [ckrm-tech] " Naoaki MAEDA
@ 2006-04-21 11:53       ` Mike Galbraith
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2006-04-21 11:53 UTC (permalink / raw)
  To: MAEDA Naoaki; +Cc: linux-kernel, ckrm-tech

On Fri, 2006-04-21 at 17:56 +0900, MAEDA Naoaki wrote:
> Mike Galbraith wrote:
> > Why does timeslice scaling become undesirable if TASK_INTERACTIVE(p)?
> > With this barrier, you will completely disable scaling for many loads.
> 
> Because interactive tasks tend to spend very small timeslice at one
> time, scaling timeslice for these tasks is not effective to control
> CPU spent.

True.

> Or, do you say that lots of non-interactive tasks are misjudged as
> TASK_INTERACTIVE(p)?

Almost.  TASK_INTERACTIVE(p) doesn't mean the task is an interactive
task, only that it sleeps enough that it may be.  Interactive tasks can
generally be categorized as doing quite a bit of sleeping, but so do
other things.  HTTP/FTP daemons etc etc.

In the presence of a mixed load with several "interactive" components,
timeslice scaling can only do harm to throughput by further fragmenting
the already shattered time a task spends on cpu.  You don't want to
increase the context switch rate if you want throughput.

> > Is it possible you meant !rt_task(p)?
> > 
> > (The only place I can see scaling as having a large effect is on gobs of
> > non-sleeping tasks.  Slice width doesn't mean much otherwise.)
> 
> Yes. But these non-sleeping CPU-hog tasks tend to dominant CPU, so
> it is worth controlling them.

Time spent in the expired array limits the !TASK_INTERACTIVE(p).  In a
mixed load, the sleeping task component is the one which needs
controlling, because it will always preempt and get it's share of cpu.
In a pure hog load, the scheduler is pure round-robin, so no scaling is
needed.  It's the sleep deprived who need protection from the swarms of
preempt enabled.

	-Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH 4/9] CPU controller - Adds interface functions
  2006-04-21  2:27 [RFC][PATCH 0/9] CKRM CPU resource controller maeda.naoaki
                   ` (2 preceding siblings ...)
  2006-04-21  2:27 ` [RFC][PATCH 3/9] CPU controller - Adds timeslice scaling maeda.naoaki
@ 2006-04-21  2:27 ` maeda.naoaki
  2006-04-21  2:27 ` [RFC][PATCH 5/9] CPU controller - Documents how the controller works maeda.naoaki
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: maeda.naoaki @ 2006-04-21  2:27 UTC (permalink / raw)
  To: linux-kernel, ckrm-tech; +Cc: maeda.naoaki

4/9: cpurc_interface

Adds interface functions to CKRM CPU controller.

Signed-off-by: Kurosawa Takahiro <kurosawa@valinux.co.jp>
Signed-off-by: MAEDA Naoaki <maeda.naoaki@jp.fujitsu.com>

 include/linux/cpu_rc.h |    6 ++++++
 kernel/cpu_rc.c        |   45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

Index: linux-2.6.17-rc2/kernel/cpu_rc.c
===================================================================
--- linux-2.6.17-rc2.orig/kernel/cpu_rc.c
+++ linux-2.6.17-rc2/kernel/cpu_rc.c
@@ -232,3 +232,48 @@ void cpu_rc_detect_hunger(task_t *tsk)
 
 	tsk->last_activated = 0;
 }
+
+void cpu_rc_clear_stat(struct cpu_rc *cr, int cpu)
+{
+	cr->stat[cpu].timestamp = 0;
+	cr->stat[cpu].load = 0;
+	cr->stat[cpu].maybe_hungry = 0;
+}
+
+void cpu_rc_init_cr(struct cpu_rc *cr, struct cpu_rc_domain *rcd)
+{
+	cr->rcd = rcd;
+	cr->share = 0;
+	cr->ts_factor = CPU_RC_TSFACTOR_MAX;
+}
+
+void cpu_rc_get_cr(struct cpu_rc *cr)
+{
+	cpu_rcd_lock(cr);
+	cr->rcd->numcrs++;
+	cpu_rcd_unlock(cr);
+}
+
+void cpu_rc_put_cr(struct cpu_rc *cr)
+{
+	cpu_rcd_lock(cr);
+	cr->is_hungry = 0;
+	cr->rcd->numcrs--;
+	cpu_rcd_unlock(cr);
+}
+
+void cpu_rc_init_rcd(struct cpu_rc_domain *rcd)
+{
+	rcd->cpus = cpu_online_map;
+	spin_lock_init(&rcd->lock);
+	rcd->hungry_count = 0;
+	rcd->numcpus = cpus_weight(cpu_online_map);
+	rcd->numcrs = 0;
+}
+
+void cpu_rc_set_share(struct cpu_rc *cr, int val)
+{
+	cpu_rcd_lock(cr);
+	cr->share = val;
+	cpu_rcd_unlock(cr);
+}
Index: linux-2.6.17-rc2/include/linux/cpu_rc.h
===================================================================
--- linux-2.6.17-rc2.orig/include/linux/cpu_rc.h
+++ linux-2.6.17-rc2/include/linux/cpu_rc.h
@@ -52,6 +52,12 @@ extern unsigned int cpu_rc_load(struct c
 extern unsigned int cpu_rc_scale_timeslice(task_t *, unsigned int);
 extern void cpu_rc_account(task_t *, unsigned long);
 extern void cpu_rc_detect_hunger(task_t *);
+extern void cpu_rc_clear_stat(struct cpu_rc *, int);
+extern void cpu_rc_init_cr(struct cpu_rc *, struct cpu_rc_domain *);
+extern void cpu_rc_get_cr(struct cpu_rc *);
+extern void cpu_rc_put_cr(struct cpu_rc *);
+extern void cpu_rc_init_rcd(struct cpu_rc_domain *);
+extern void cpu_rc_set_share(struct cpu_rc *, int);
 
 static inline void cpu_rc_record_activated(task_t *tsk, unsigned long now)
 {

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH 5/9] CPU controller - Documents how the controller works
  2006-04-21  2:27 [RFC][PATCH 0/9] CKRM CPU resource controller maeda.naoaki
                   ` (3 preceding siblings ...)
  2006-04-21  2:27 ` [RFC][PATCH 4/9] CPU controller - Adds interface functions maeda.naoaki
@ 2006-04-21  2:27 ` maeda.naoaki
  2006-04-23  7:13   ` Mike Galbraith
  2006-04-21  2:27 ` [RFC][PATCH 6/9] CPU controller - Adds basic functions and registering the controller maeda.naoaki
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: maeda.naoaki @ 2006-04-21  2:27 UTC (permalink / raw)
  To: linux-kernel, ckrm-tech; +Cc: maeda.naoaki

5/9: cpurc_docs

Documentation that describes how the CPU resource controller works.

Signed-off-by: Kurosawa Takahiro <kurosawa@valinux.co.jp>
Signed-off-by: MAEDA Naoaki <maeda.naoaki@jp.fujitsu.com>

 Documentation/ckrm/cpurc-internals |  166 +++++++++++++++++++++++++++++++++++++
 1 files changed, 166 insertions(+)

Index: linux-2.6.17-rc2/Documentation/ckrm/cpurc-internals
===================================================================
--- /dev/null
+++ linux-2.6.17-rc2/Documentation/ckrm/cpurc-internals
@@ -0,0 +1,166 @@
+CPU resource controller internals
+
+ There are 3 components in the CPU resource controller:
+
+ (1)  load estimation
+ (2)  hungry detection
+ (3)  timeslice scaling
+
+ We need to estimate the class load in order to check whether the
+ share is satisfied or not.  Class load also gets lower than the
+ share when all the tasks in the class tends to sleep. We need to
+ check whether the class needs to schedule more or not by hungry
+ detection.  If a class needs to schedule more, timeslices of tasks
+ are scaled by timeslice scaling.
+
+1. Load estimation
+
+ We calculate the class load as the accumulation of task loads in the
+ class.  We need to calculate the task load first, then calculate the
+ class load from the task loads.
+
+ Task load estimation
+
+  Task load is estimated as the ratio of:
+   * the timeslice value allocated to the task (Ts)
+  to:
+   * the time that is taken for the task to run out the allocated timeslice
+     (Tr).
+  If a task can use all the CPU time, Ts / Tr becomes 1 for example.
+
+  The detailed procedure of the calculation is as follows:
+  (1) Record the timeslice (Ts) and the time when the timeslice is
+      allocated to the task (by calling cpu_rc_record_allocation()).
+      * The timeslice value is recorded to task->last_slice ( = Ts).
+      * The time is recorded to task->ts_alloced.
+  (2) Calculate the task load when the timeslice is expired
+      (by calling cpu_rc_account()).
+      Tr is calculated as:
+       Tr = jiffies - task->ts_alloced
+      Then task load (Ts / Tr) becomes:
+       Ts / Tr = task->last_slice / (jiffies - task->ts_alloced)
+
+      The load value is scaled by CPU_RC_LOAD_SCALE.
+      If the load value equals to CPU_RC_LOAD_SCALE, it indicates 100%
+      CPU usage.
+
+          task->ts_alloced   task scheduled             now
+             v               v                          v
+             |---------------===========================|
+
+                             |<------------------------>|
+                               Ts ( = task->last_slice)
+
+             |<---------------------------------------->|
+                Tr ( = now - task->ts_alloced)
+
+             |<------------->|
+               the time that the task isn't scheduled
+
+
+      Note that task load calculation is also needed for strict
+      accuracy when a task forks or exits, because timeslice is
+      changed on fork and exit.  But we don't do that in order to
+      simplify the code and in order not to introduce overhead on fork
+      and exit.  Probably we can get enough accurate number without
+      calculating the task load on fork/exit.
+
+ Class load estimation:
+
+  Class load is the accumulation of load values of tasks in the class in
+  the duration of CPU_RC_SPREAD_PERIOD.
+  Per-CPU class load is recalculated each time the task load is calculated
+  in the cpu_rc_account() function.
+  Then on CPU_RC_RECALC_INTERVAL intervals, the class load value per-CPU
+  value is calculated as the average of the per-CPU class load.
+
+  Task load is accumulated to the per-CPU class load as if the class uses
+  Ts/Tr of the CPU time from task->ts_alloced to now (the time the timeslice
+  expired).
+
+  So the time that the task has used the CPU from (now - CPU_RC_SPREAD_PERIOD)
+  to now (Ttsk) should be:
+
+   if task->ts_alloced < now - CPU_RC_SPREAD_PERIOD:
+     Ts/Tr * CPU_RC_SPREAD_PERIOD
+     (We assume that the task has used the CPU at the constant rate of Ts/Tr.)
+
+                    now-CPU_RC_SPREAD_PERIOD                now
+                    v                                       v
+                    |---------------------------------------|
+         |==================================================| load: Ts/Tr
+         ^
+         task->ts_alloced
+
+   else:
+     Ts
+
+                    now-CPU_RC_SPREAD_PERIOD                now
+                    v                                       v
+                    |---------------------------------------|
+                               |============================| load: Ts/Tr
+                               ^
+                               task->ts_alloced
+
+  Also, we assume that the class uses the CPU at the rate of the class load
+  from (now - CPU_RC_SPREAD_PERIOD) to the last time the per-CPU class load
+  was calculated (stored in struct cpu_rc::stat[cpu].timestamp).  If
+  cpu_rc::stat[cpu].timestamp < now - CPU_RC_SPREAD_PERIOD, we assume that
+  the class doesn't use the CPU from (now - CPU_RC_SPREAD_PERIOD) to
+  task->ts_alloced.
+
+  So the time that the class use the CPU from (now - CPU_RC_SPREAD_PERIOD)
+  to now (Tcls) should be:
+   if cpu_rc::stat[cpu].timestamp < now - CPU_RC_SPREAD_PERIOD:
+     0
+   else:
+     cpu_rc::stat[cpu].load * (cpu_rc::stat[cpu].timestamp - (now - CPU_RC_SPREAD_PERIOD))
+
+  The new per-CPU class load that will be assigned to cpu_rc::stat[cpu].load
+  is calculated as:
+    (Ttsk + Tcls) / CPU_RC_SPREAD_PERIOD
+
+2. Hungry detection
+
+ When the class load is less than the share, there are 2 cases:
+  (a) the share is enough and tasks in the class have time for sleep
+  (b) tasks in other classes overuse the CPU
+
+ We should not scale the timeslice in case (a) even if the class load
+ is lower than the share.  In order to distinguish case (b) from
+ case (a), we measure the time (Tsch) from when a task is activated
+ (stored in task->last_activated) till when the task is actually
+ scheduled.  If the class load is lower than the share but tasks
+ in the class are quickly scheduled, it can be classified to case (a).
+ If Tsch / timeslice of a task is lower than the share, the class
+ that has the task is marked as "maybe hungry."  If the class load of
+ the class that is marked as "maybe hungry" is lower than the
+ share, it is treated as hungry and the timeslices of tasks in
+ other classes will be scaled down.
+
+
+3. Timeslice scaling
+
+ If there are hungry classes, we need to adjust timeslices to satisfy
+ the share.  To scale timeslices, we introduce a scaling factor
+ used for scaling timeslices.  The scaling factor is associated with
+ the class (stored in the cpu_rc structure) and adaptively adjusted
+ according to the class load and the share.
+
+ If some classes are hungry, the scaling factor of the class that is
+ not hungry is calculated as follows (note: F is the scaling factor):
+   F_new = F * share / class_load
+
+ And the scaling factor of the hungry class is calculated as:
+   F_new = F + CPU_RC_TSFACTOR_INC_LO   (CPU_RC_TSFACTOR_INC_LO is defined as 2)
+
+ When all the classes are not hungry, the scaling factor is calculated
+ as follows in order to recover the timeslices:
+   F_new = F + CPU_RC_TSFACTOR_INC_HI   (CPU_RC_TSFACTOR_INC_HI is defined as 5)
+
+ Note that the maximum value of F is limited to CPU_RC_TSFACTOR_MAX.
+ The timeslice assigned to each task is:
+   timeslice_scaled = timeslice_orig * F / CPU_RC_TSFACTOR_MAX
+
+ where timeslice_orig is the value that is calculated by the conventional
+ O(1) scheduler.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH 5/9] CPU controller - Documents how the controller works
  2006-04-21  2:27 ` [RFC][PATCH 5/9] CPU controller - Documents how the controller works maeda.naoaki
@ 2006-04-23  7:13   ` Mike Galbraith
  2006-04-24  6:25     ` [ckrm-tech] " MAEDA Naoaki
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2006-04-23  7:13 UTC (permalink / raw)
  To: maeda.naoaki; +Cc: linux-kernel, ckrm-tech

On Fri, 2006-04-21 at 11:27 +0900, maeda.naoaki@jp.fujitsu.com wrote:
> +3. Timeslice scaling
> +
> + If there are hungry classes, we need to adjust timeslices to satisfy
> + the share.  To scale timeslices, we introduce a scaling factor
> + used for scaling timeslices.  The scaling factor is associated with
> + the class (stored in the cpu_rc structure) and adaptively adjusted
> + according to the class load and the share.

This all works fine until interactive task requeueing is considered, and
it must be considered.

One simple way to address the requeue problem is to introduce a scaled
class sleep_avg consumption factor.  Remove the scaling exemption for
TASK_INTERACTIVE(p), and if a class's cpu usage doesn't drop to what is
expected by group timeslice scaling, make members consume sleep_avg at a
higher rate such that scaling can take effect.

A better way to achieve the desired group cpu usage IMHO would be to
adjust nice level of members at slice refresh time.  This way, you get
the timeslice scaling and priority adjustment all in one.

(I think I would do both actually, with nice level being preferred such
that dynamic priority spread within the group isn't flattened, which can
cause terminal starvation within the group, unless really required.)

	-Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [ckrm-tech] Re: [RFC][PATCH 5/9] CPU controller - Documents how the controller works
  2006-04-23  7:13   ` Mike Galbraith
@ 2006-04-24  6:25     ` MAEDA Naoaki
  2006-04-24  9:49       ` Mike Galbraith
  0 siblings, 1 reply; 18+ messages in thread
From: MAEDA Naoaki @ 2006-04-24  6:25 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, ckrm-tech, Maeda Naoaki

Mike Galbraith wrote:
> On Fri, 2006-04-21 at 11:27 +0900, maeda.naoaki@jp.fujitsu.com wrote:
>> +3. Timeslice scaling
>> +
>> + If there are hungry classes, we need to adjust timeslices to satisfy
>> + the share.  To scale timeslices, we introduce a scaling factor
>> + used for scaling timeslices.  The scaling factor is associated with
>> + the class (stored in the cpu_rc structure) and adaptively adjusted
>> + according to the class load and the share.
> 
> This all works fine until interactive task requeueing is considered, and
> it must be considered.
> 
> One simple way to address the requeue problem is to introduce a scaled
> class sleep_avg consumption factor.  Remove the scaling exemption for
> TASK_INTERACTIVE(p), and if a class's cpu usage doesn't drop to what is
> expected by group timeslice scaling, make members consume sleep_avg at a
> higher rate such that scaling can take effect.

Interesting approach. However, I'm worrying about hurting interactive
response by this change.

> A better way to achieve the desired group cpu usage IMHO would be to
> adjust nice level of members at slice refresh time.  This way, you get
> the timeslice scaling and priority adjustment all in one.
> 
> (I think I would do both actually, with nice level being preferred such
> that dynamic priority spread within the group isn't flattened, which can
> cause terminal starvation within the group, unless really required.)

If nice is changed, the task priority is also changed. I don't think
changing the task priority for this purpose is a good choice, but
only lengthen the timeslice would work and that is what I'm considering.

Another obvious bad case is an imbalanced number of runnable tasks
in the different groups. Since minimum timeslice is 1 tick,
minimum share is the factor of number of runnable tasks in the group.
If 1% share group contains 99 runnable tasks and the other 99% share
group has just one runnable task, the load of the two groups would be
the same. (It becomes worse in small HZ configuration.)

I've tried different approach to compensate for this badness.
Which is to requeue the starving tasks to the active as if they are
TASK_INTERACTIVE, but it sometimes hurt system response and other
undesirable side effect was observed.

Now, I'm thinking to enlarge the timeslice of starving groups.

Thanks,
MAEDA Naoaki

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [ckrm-tech] Re: [RFC][PATCH 5/9] CPU controller - Documents how the controller works
  2006-04-24  6:25     ` [ckrm-tech] " MAEDA Naoaki
@ 2006-04-24  9:49       ` Mike Galbraith
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2006-04-24  9:49 UTC (permalink / raw)
  To: MAEDA Naoaki; +Cc: linux-kernel, ckrm-tech

On Mon, 2006-04-24 at 15:25 +0900, MAEDA Naoaki wrote:
> Mike Galbraith wrote:
> > On Fri, 2006-04-21 at 11:27 +0900, maeda.naoaki@jp.fujitsu.com wrote:
> >> +3. Timeslice scaling
> >> +
> >> + If there are hungry classes, we need to adjust timeslices to satisfy
> >> + the share.  To scale timeslices, we introduce a scaling factor
> >> + used for scaling timeslices.  The scaling factor is associated with
> >> + the class (stored in the cpu_rc structure) and adaptively adjusted
> >> + according to the class load and the share.
> > 
> > This all works fine until interactive task requeueing is considered, and
> > it must be considered.
> > 
> > One simple way to address the requeue problem is to introduce a scaled
> > class sleep_avg consumption factor.  Remove the scaling exemption for
> > TASK_INTERACTIVE(p), and if a class's cpu usage doesn't drop to what is
> > expected by group timeslice scaling, make members consume sleep_avg at a
> > higher rate such that scaling can take effect.
> 
> Interesting approach. However, I'm worrying about hurting interactive
> response by this change.

It will certainly do so if _truely_ interactive tasks are using
significant cpu.

> > A better way to achieve the desired group cpu usage IMHO would be to
> > adjust nice level of members at slice refresh time.  This way, you get
> > the timeslice scaling and priority adjustment all in one.
> > 
> > (I think I would do both actually, with nice level being preferred such
> > that dynamic priority spread within the group isn't flattened, which can
> > cause terminal starvation within the group, unless really required.)
> 
> If nice is changed, the task priority is also changed. I don't think
> changing the task priority for this purpose is a good choice, but
> only lengthen the timeslice would work and that is what I'm considering.

How would that help if the cause of starvation is "interactive" tasks
being requeued?  At present, EXPIRED_STARVING() will initiate an array
switch, but that only gives you one slice every nr_running seconds.  For
one non-interactive task (group A) to achieve parity with N interactive
tasks (group B) which are doing round-robin requeue, you'd need an N
second slice.

> Another obvious bad case is an imbalanced number of runnable tasks
> in the different groups. Since minimum timeslice is 1 tick,
> minimum share is the factor of number of runnable tasks in the group.
> If 1% share group contains 99 runnable tasks and the other 99% share
> group has just one runnable task, the load of the two groups would be
> the same. (It becomes worse in small HZ configuration.)

A group scheduler (array of arrays) would solve that as well as the
above.  You can kind of implement such a scheduler within the scheduler
via priority twiddling, but I don't think you can't get there from here
via timeslice modulation alone.

> I've tried different approach to compensate for this badness.
> Which is to requeue the starving tasks to the active as if they are
> TASK_INTERACTIVE, but it sometimes hurt system response and other
> undesirable side effect was observed.

Yeah, I'm currently requeueing based upon latency, and it does have both
promise and some interesting rough spots.  I've eliminated the massively
painful to interactive tasks array switch when busy successfully, but
requeueing starving tasks within the active array itself still needs
work. 

> Now, I'm thinking to enlarge the timeslice of starving groups.

Good luck.

	-Mike


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH 6/9] CPU controller - Adds basic functions and registering the controller
  2006-04-21  2:27 [RFC][PATCH 0/9] CKRM CPU resource controller maeda.naoaki
                   ` (4 preceding siblings ...)
  2006-04-21  2:27 ` [RFC][PATCH 5/9] CPU controller - Documents how the controller works maeda.naoaki
@ 2006-04-21  2:27 ` maeda.naoaki
  2006-04-21  2:28 ` [RFC][PATCH 7/9] CPU controller - Adds routines to change share values and show stat maeda.naoaki
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: maeda.naoaki @ 2006-04-21  2:27 UTC (permalink / raw)
  To: linux-kernel, ckrm-tech; +Cc: maeda.naoaki

6/9: ckrm_cpu_init

Adds the basic functions and registering the CPU controller to CKRM.

Signed-off-by: MAEDA Naoaki <maeda.naoaki@jp.fujitsu.com>
Signed-off-by: Kurosawa Takahiro <kurosawa@valinux.co.jp>

 init/Kconfig           |   10 +++
 kernel/ckrm/Makefile   |    1 
 kernel/ckrm/ckrm_cpu.c |  142 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 153 insertions(+)

Index: linux-2.6.17-rc2/init/Kconfig
===================================================================
--- linux-2.6.17-rc2.orig/init/Kconfig
+++ linux-2.6.17-rc2/init/Kconfig
@@ -185,6 +185,16 @@ config CKRM_RES_NUMTASKS
 
 	  Say N if unsure, Y to use the feature.
 
+config CKRM_RES_CPU
+	bool "CPU Resource Controller"
+	select CPU_RC
+	depends on CKRM
+	default y
+	help
+	  Provides a CPU Resource Controller for CKRM.
+
+	  Say N if unsure, Y to use the feature.
+
 endmenu
 config SYSCTL
 	bool "Sysctl support"
Index: linux-2.6.17-rc2/kernel/ckrm/Makefile
===================================================================
--- linux-2.6.17-rc2.orig/kernel/ckrm/Makefile
+++ linux-2.6.17-rc2/kernel/ckrm/Makefile
@@ -1,3 +1,4 @@
 obj-y = ckrm.o ckrm_shares.o ckrm_task.o
 obj-$(CONFIG_CKRM_RES_NUMTASKS) += ckrm_numtasks.o
+obj-$(CONFIG_CKRM_RES_CPU) += ckrm_cpu.o
 obj-$(CONFIG_CKRM_RCFS) += ckrm_rcfs.o
Index: linux-2.6.17-rc2/kernel/ckrm/ckrm_cpu.c
===================================================================
--- /dev/null
+++ linux-2.6.17-rc2/kernel/ckrm/ckrm_cpu.c
@@ -0,0 +1,142 @@
+/*
+ *  kernel/ckrm/ckrm_cpu.c
+ *
+ *  CPU resource controller for CKRM
+ *
+ *  Copyright 2005-2006 FUJITSU LIMITED
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/config.h>
+#include <linux/notifier.h>
+#include <linux/cpu.h>
+#include <linux/cpu_rc.h>
+#include <linux/ckrm_rc.h>
+
+static const char res_ctlr_name[] = "cpu";
+
+struct ckrm_cpu {
+	struct ckrm_class *class;	/* the class I belong to */
+	struct ckrm_shares shares;
+	struct cpu_rc	cpu_rc;	/* cpu resource controller */
+	int 	cnt_total_min_shares; 	/* total min_shares behind the class */
+};
+
+static struct cpu_rc_domain grcd; /* system wide resource controller domain */
+struct ckrm_controller cpu_ctlr;
+
+static struct ckrm_cpu *get_shares_cpu(struct ckrm_shares *shares)
+{
+	if (shares)
+		return container_of(shares, struct ckrm_cpu, shares);
+	return NULL;
+}
+
+static struct ckrm_cpu *get_class_cpu(struct ckrm_class *class)
+{
+	return get_shares_cpu(ckrm_get_controller_shares(class, &cpu_ctlr));
+}
+
+struct cpu_rc *cpu_rc_get(task_t *tsk)
+{
+	struct ckrm_class *class = tsk->class;
+	struct ckrm_cpu *res;
+
+	/* controller is not registered; no class is given */
+	if ((cpu_ctlr.ctlr_id == CKRM_NO_RES_ID) || (class == NULL))
+		return NULL;
+
+	res = get_class_cpu(class);
+	/* cpu controller is not available for this class */
+	if (!res)
+		return NULL;
+
+	return &res->cpu_rc;
+}
+
+static void cpu_res_initcls_one(struct ckrm_cpu * res)
+{
+	res->shares.min_shares = 0;
+	res->shares.max_shares = CKRM_SHARE_UNSUPPORTED;
+	res->shares.child_shares_divisor = CKRM_SHARE_DEFAULT_DIVISOR;
+	res->shares.unused_min_shares = CKRM_SHARE_DEFAULT_DIVISOR;
+
+	res->cnt_total_min_shares = 0;
+	cpu_rc_init_cr(&res->cpu_rc, &grcd);
+	cpu_rc_get_cr(&res->cpu_rc);
+}
+
+static struct ckrm_shares *cpu_alloc_shares_struct(struct ckrm_class *class)
+{
+	struct ckrm_cpu *res;
+
+	res = kzalloc(sizeof(struct ckrm_cpu), GFP_KERNEL);
+	if (!res)
+		return NULL;
+	res->class = class;
+	cpu_res_initcls_one(res);
+	if (ckrm_is_class_root(class))	{
+		res->cpu_rc.share = CKRM_SHARE_DEFAULT_DIVISOR;
+		res->cnt_total_min_shares = CKRM_SHARE_DEFAULT_DIVISOR;
+		res->shares.min_shares = CKRM_SHARE_DONT_CARE;
+		res->shares.max_shares = CKRM_SHARE_DONT_CARE;
+	}
+	return &res->shares;
+}
+
+static void cpu_free_shares_struct(struct ckrm_shares *my_res)
+{
+	struct ckrm_cpu *res, *parres;
+	u64	temp = 0;
+
+	res = get_shares_cpu(my_res);
+	if (!res)
+		return;
+
+	parres = get_class_cpu(res->class->parent);
+	/* return child's min_shares to parent class */
+	spin_lock(&parres->class->class_lock);
+	if (parres->shares.child_shares_divisor) {
+		temp = (u64) parres->shares.unused_min_shares
+				* parres->cnt_total_min_shares;
+		do_div(temp, parres->shares.child_shares_divisor);
+	}
+	cpu_rc_set_share(&parres->cpu_rc, (int)temp);
+	spin_unlock(&parres->class->class_lock);
+
+	cpu_rc_put_cr(&res->cpu_rc);
+	kfree(res);
+}
+
+struct ckrm_controller cpu_ctlr = {
+	.name = res_ctlr_name,
+	.depth_supported = 3,
+	.ctlr_id = CKRM_NO_RES_ID,
+	.alloc_shares_struct = cpu_alloc_shares_struct,
+	.free_shares_struct = cpu_free_shares_struct,
+};
+
+int __init init_ckrm_cpu_res(void)
+{
+	if (cpu_ctlr.ctlr_id != CKRM_NO_RES_ID)
+		return -EBUSY; /* already registered */
+	cpu_rc_init_rcd(&grcd);
+	printk(KERN_INFO "init_ckrm_cpu_res %d cpus available\n", grcd.numcpus);
+	return ckrm_register_controller(&cpu_ctlr);
+}
+
+void __exit exit_ckrm_cpu_res(void)
+{
+	int rc;
+	do {
+		rc = ckrm_unregister_controller(&cpu_ctlr);
+	} while (rc == -EBUSY);
+	BUG_ON(rc != 0);
+}
+
+module_init(init_ckrm_cpu_res)
+module_exit(exit_ckrm_cpu_res)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH 7/9] CPU controller - Adds routines to change share values and show stat
  2006-04-21  2:27 [RFC][PATCH 0/9] CKRM CPU resource controller maeda.naoaki
                   ` (5 preceding siblings ...)
  2006-04-21  2:27 ` [RFC][PATCH 6/9] CPU controller - Adds basic functions and registering the controller maeda.naoaki
@ 2006-04-21  2:28 ` maeda.naoaki
  2006-04-21  2:28 ` [RFC][PATCH 8/9] CPU controller - Adds cpu hotplug notifier maeda.naoaki
  2006-04-21  2:28 ` [RFC][PATCH 9/9] CPU controller - Documents how to use the controller maeda.naoaki
  8 siblings, 0 replies; 18+ messages in thread
From: maeda.naoaki @ 2006-04-21  2:28 UTC (permalink / raw)
  To: linux-kernel, ckrm-tech; +Cc: maeda.naoaki

7/9: ckrm_cpu_shares_n_stats

Adds routine to change share values and show statistics.

Signed-off-by: MAEDA Naoaki <maeda.naoaki@jp.fujitsu.com>
Signed-off-by: Kurosawa Takahiro <kurosawa@valinux.co.jp>

 kernel/ckrm/ckrm_cpu.c |  123 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 123 insertions(+)

Index: linux-2.6.17-rc2/kernel/ckrm/ckrm_cpu.c
===================================================================
--- linux-2.6.17-rc2.orig/kernel/ckrm/ckrm_cpu.c
+++ linux-2.6.17-rc2/kernel/ckrm/ckrm_cpu.c
@@ -112,12 +112,135 @@ static void cpu_free_shares_struct(struc
 	kfree(res);
 }
 
+static int recalc_shares(int self_shares, int parent_shares, int parent_divisor)
+{
+	u64 numerator;
+
+	if (parent_divisor == 0)
+		return 0;
+	numerator = (u64) self_shares * parent_shares;
+	do_div(numerator, parent_divisor);
+	return numerator;
+}
+
+static int recalc_unused_shares(int self_cnt_min_shares,
+				int self_unused_min_shares, int self_divisor)
+{
+	u64 numerator;
+
+	if (self_divisor == 0)
+		return 0;
+	numerator = (u64) self_unused_min_shares * self_cnt_min_shares;
+	do_div(numerator, self_divisor);
+	return numerator;
+}
+
+static void recalc_self(struct ckrm_cpu *res, struct ckrm_cpu *parres)
+{
+	struct ckrm_shares *par = &parres->shares;
+	struct ckrm_shares *self = &res->shares;
+	u64 cnt_total, cnt_min_shares;
+
+	/* calculate total and current min_shares */
+	cnt_total = recalc_shares(self->min_shares,
+					parres->cnt_total_min_shares,
+					par->child_shares_divisor);
+	cnt_min_shares = recalc_unused_shares(self->unused_min_shares,
+					cnt_total,
+					par->child_shares_divisor);
+	cpu_rc_set_share(&res->cpu_rc, (int) cnt_min_shares);
+	res->cnt_total_min_shares = (int) cnt_total;
+}
+
+static void
+recalc_and_propagate(struct ckrm_cpu * res)
+{
+	struct ckrm_class *child = NULL;
+	struct ckrm_cpu *parres, *childres;
+
+	parres = get_class_cpu(res->class->parent);
+
+	if (parres)
+		recalc_self(res, parres);
+
+	/* propagate to children */
+	spin_lock(&res->class->class_lock);
+	for_each_child(child, res->class) {
+		childres = get_class_cpu(child);
+		if (childres) {
+			spin_lock(&child->class_lock);
+			recalc_and_propagate(childres);
+			spin_unlock(&child->class_lock);
+		}
+	}
+	spin_unlock(&res->class->class_lock);
+	return;
+}
+
+static void cpu_shares_changed(struct ckrm_shares *my_res)
+{
+	struct ckrm_cpu *parres, *res;
+	struct ckrm_shares *cur, *par;
+	u64    temp = 0;
+
+	res = get_shares_cpu(my_res);
+	if (!res)
+		return;
+	cur = &res->shares;
+
+	if (!ckrm_is_class_root(res->class)) {
+		spin_lock(&res->class->parent->class_lock);
+		parres = get_class_cpu(res->class->parent);
+		par = &parres->shares;
+	} else {
+		par = NULL;
+		parres = NULL;
+	}
+
+	if (parres) {
+		/* adjust parent's unused min_shares */
+		temp = recalc_unused_shares(parres->cnt_total_min_shares,
+					par->unused_min_shares,
+					par->child_shares_divisor);
+		cpu_rc_set_share(&parres->cpu_rc, temp);
+	} else {
+		/* adjust root class's unused min_shares */
+		temp = recalc_unused_shares(CKRM_SHARE_DEFAULT_DIVISOR,
+					cur->unused_min_shares,
+					cur->child_shares_divisor);
+		cpu_rc_set_share(&res->cpu_rc, temp);
+	}
+	recalc_and_propagate(res);
+
+	if (!ckrm_is_class_root(res->class))
+		spin_unlock(&res->class->parent->class_lock);
+}
+
+static ssize_t cpu_show_stats(struct ckrm_shares *my_res, char *buf,
+							size_t buf_size)
+{
+	struct ckrm_cpu *res;
+	unsigned int load = 0;
+	ssize_t	i;
+
+	res = get_shares_cpu(my_res);
+	if (!res)
+		return -EINVAL;
+
+	load = cpu_rc_load(&res->cpu_rc);
+	i = snprintf(buf, buf_size, "%s:effective_min_shares=%d, load=%d\n",
+				res_ctlr_name, res->cpu_rc.share, load);
+	return i;
+}
+
 struct ckrm_controller cpu_ctlr = {
 	.name = res_ctlr_name,
 	.depth_supported = 3,
 	.ctlr_id = CKRM_NO_RES_ID,
 	.alloc_shares_struct = cpu_alloc_shares_struct,
 	.free_shares_struct = cpu_free_shares_struct,
+	.shares_changed = cpu_shares_changed,
+	.show_stats = cpu_show_stats,
 };
 
 int __init init_ckrm_cpu_res(void)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH 8/9] CPU controller - Adds cpu hotplug notifier
  2006-04-21  2:27 [RFC][PATCH 0/9] CKRM CPU resource controller maeda.naoaki
                   ` (6 preceding siblings ...)
  2006-04-21  2:28 ` [RFC][PATCH 7/9] CPU controller - Adds routines to change share values and show stat maeda.naoaki
@ 2006-04-21  2:28 ` maeda.naoaki
  2006-04-21  2:28 ` [RFC][PATCH 9/9] CPU controller - Documents how to use the controller maeda.naoaki
  8 siblings, 0 replies; 18+ messages in thread
From: maeda.naoaki @ 2006-04-21  2:28 UTC (permalink / raw)
  To: linux-kernel, ckrm-tech; +Cc: maeda.naoaki

8/9: ckrm_cpu_hotplug

Adds cpu hotplug notifier for the CKRM CPU controller.

Signed-off-by: MAEDA Naoaki <maeda.naoaki@jp.fujitsu.com>
Signed-off-by: Kurosawa Takahiro <kurosawa@valinux.co.jp>

 kernel/ckrm/ckrm_cpu.c |   50 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 50 insertions(+)

Index: linux-2.6.17-rc2/kernel/ckrm/ckrm_cpu.c
===================================================================
--- linux-2.6.17-rc2.orig/kernel/ckrm/ckrm_cpu.c
+++ linux-2.6.17-rc2/kernel/ckrm/ckrm_cpu.c
@@ -243,12 +243,61 @@ struct ckrm_controller cpu_ctlr = {
 	.show_stats = cpu_show_stats,
 };
 
+static void clear_stat_and_propagate(struct ckrm_cpu * res, int cpu)
+{
+	struct ckrm_class *child = NULL;
+	struct ckrm_cpu *childres;
+
+	cpu_rc_clear_stat(&res->cpu_rc, cpu);
+
+	/* propagate to children */
+	spin_lock(&res->class->class_lock);
+	for_each_child(child, res->class) {
+		childres = get_class_cpu(child);
+		if (childres) {
+		    spin_lock(&child->class_lock);
+		    clear_stat_and_propagate(childres, cpu);
+		    spin_unlock(&child->class_lock);
+		}
+	}
+	spin_unlock(&res->class->class_lock);
+}
+
+static int __devinit ckrm_cpu_notify(struct notifier_block *self,
+				unsigned long action, void *hcpu)
+{
+	struct ckrm_class *cls = &ckrm_default_class;
+	struct ckrm_cpu *res;
+	int	cpu = (long) hcpu;
+
+	switch (action)	{
+
+	case CPU_DEAD:
+		res = get_class_cpu(cls);
+		clear_stat_and_propagate(res, cpu);
+		/* FALL THROUGH */
+	case CPU_ONLINE:
+		grcd.cpus = cpu_online_map;
+		grcd.numcpus = cpus_weight(cpu_online_map);
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block ckrm_cpu_nb = {
+	.notifier_call	= ckrm_cpu_notify,
+};
+
 int __init init_ckrm_cpu_res(void)
 {
 	if (cpu_ctlr.ctlr_id != CKRM_NO_RES_ID)
 		return -EBUSY; /* already registered */
 	cpu_rc_init_rcd(&grcd);
 	printk(KERN_INFO "init_ckrm_cpu_res %d cpus available\n", grcd.numcpus);
+	/* Register notifier for non-boot CPUs */
+	register_cpu_notifier(&ckrm_cpu_nb);
 	return ckrm_register_controller(&cpu_ctlr);
 }
 
@@ -259,6 +308,7 @@ void __exit exit_ckrm_cpu_res(void)
 		rc = ckrm_unregister_controller(&cpu_ctlr);
 	} while (rc == -EBUSY);
 	BUG_ON(rc != 0);
+	unregister_cpu_notifier(&ckrm_cpu_nb);
 }
 
 module_init(init_ckrm_cpu_res)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC][PATCH 9/9] CPU controller - Documents how to use the controller
  2006-04-21  2:27 [RFC][PATCH 0/9] CKRM CPU resource controller maeda.naoaki
                   ` (7 preceding siblings ...)
  2006-04-21  2:28 ` [RFC][PATCH 8/9] CPU controller - Adds cpu hotplug notifier maeda.naoaki
@ 2006-04-21  2:28 ` maeda.naoaki
  8 siblings, 0 replies; 18+ messages in thread
From: maeda.naoaki @ 2006-04-21  2:28 UTC (permalink / raw)
  To: linux-kernel, ckrm-tech; +Cc: maeda.naoaki

9/9: ckrm_cpu_docs

Documents how to use the CPU controller

Signed-off-by: MAEDA Naoaki <maeda.naoaki@jp.fujitsu.com>
Signed-off-by: Kurosawa Takahiro <kurosawa@valinux.co.jp>

 Documentation/ckrm/cpurc |   70 +++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 70 insertions(+)

Index: linux-2.6.17-rc2/Documentation/ckrm/cpurc
===================================================================
--- /dev/null
+++ linux-2.6.17-rc2/Documentation/ckrm/cpurc
@@ -0,0 +1,71 @@
+Introduction
+------------
+
+CPU resource controller enables user/sysadmin to control CPU time
+percentage of tasks in a class. It controls time_slice of tasks based on
+the feedback of difference between the target value and the current usage
+in order to control the percentage of the CPU usage to the target value.
+
+Installation
+------------
+
+1. Configure "CPU Resource Controller" under CKRM. Currently, this cannot be
+configured as a module.
+
+2. Reboot the system with the new kernel.
+
+3. Verify that the CPU resource controller is present by reading
+the file /config/ckrm/shares (should show a line with res=cpu).
+
+Assigning shares
+----------------
+
+Follows the general approach of setting shares for a class in CKRM.
+
+# echo "res=cpu,min_shares=val" > shares
+
+sets the min_shares of a class.
+
+The CPU resource controller calculates an effective min_shares in percent
+for each class. Following is an example of class/min_shares settings
+and each effective min_shares.
+
+			/
+			  effective_min_shares
+			  = 100% - 50% - 30%
+			  = 20%
+	+---------------+---------------+
+	/A min_shares=50%		/B min_shares=30%
+	   effective_min_shares		   effective_min_shares
+	   = 50% - 10% - 25%	    	   = 30% - 0%
+	   = 15%			   = 30%
++---------------+---------------+
+/C min_shares=20%		/D min_shares=50%
+effective_min_shares		   effective_min_shares
+= 20% of 50% - 0% = 10%	   = 50% of 50% - 0 %
+= 10%			   = 25%
+
+If the min_shares in the class /A is changed 50% to 40% in the above
+example, the effective_min_shares of the class /A, /C and /D are automatically
+changed to 12%, 8% and 20% respectively.
+
+Although the child_shares_divisor can be changed, the effective_min_shares is
+always calculated in percent.
+
+Note that the CPU resource controller doesn't support the limit, so assigning
+the limit for "res=cpu" will have no effect.
+
+Monitoring
+----------
+
+stats file shows the effective min_shares and the current cpu usage of a class
+in percentage.
+
+# cat stats
+cpu:effective_min_shares=50, load=40
+
+That means the effective min_shares of the class is 50% and the current load
+average of the class is 40%.
+
+Since the tasks in the class do not always try to consume CPU, the load could be
+less or greater than the effective_min_shares. Both cases are normal.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2006-04-24  9:48 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-21  2:27 [RFC][PATCH 0/9] CKRM CPU resource controller maeda.naoaki
2006-04-21  2:27 ` [RFC][PATCH 1/9] CPU controller - Adds class load estimation maeda.naoaki
2006-04-21  2:27 ` [RFC][PATCH 2/9] CPU controller - Adds class hungry detection maeda.naoaki
2006-04-21  2:27 ` [RFC][PATCH 3/9] CPU controller - Adds timeslice scaling maeda.naoaki
2006-04-21  8:17   ` Mike Galbraith
2006-04-21  8:56     ` MAEDA Naoaki
2006-04-21 11:50       ` [ckrm-tech] " Naoaki MAEDA
2006-04-21 12:03         ` Mike Galbraith
2006-04-21 11:53       ` Mike Galbraith
2006-04-21  2:27 ` [RFC][PATCH 4/9] CPU controller - Adds interface functions maeda.naoaki
2006-04-21  2:27 ` [RFC][PATCH 5/9] CPU controller - Documents how the controller works maeda.naoaki
2006-04-23  7:13   ` Mike Galbraith
2006-04-24  6:25     ` [ckrm-tech] " MAEDA Naoaki
2006-04-24  9:49       ` Mike Galbraith
2006-04-21  2:27 ` [RFC][PATCH 6/9] CPU controller - Adds basic functions and registering the controller maeda.naoaki
2006-04-21  2:28 ` [RFC][PATCH 7/9] CPU controller - Adds routines to change share values and show stat maeda.naoaki
2006-04-21  2:28 ` [RFC][PATCH 8/9] CPU controller - Adds cpu hotplug notifier maeda.naoaki
2006-04-21  2:28 ` [RFC][PATCH 9/9] CPU controller - Documents how to use the controller maeda.naoaki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox