[git] CFS-devel, group scheduler, fixes

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [git] CFS-devel, group scheduler, fixes
@ 2007-09-15 13:06 Ingo Molnar
  0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-09-15 13:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Mike Galbraith, Srivatsa Vaddagiri, Dhaval Giani,
	Dmitry Adamushko


The latest sched-devel.git tree can be pulled from:

   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

people were busy sending patches, so there's lots of updates since the 
first announcement of the cfs-devel.git tree four days ago:

  include/linux/sched.h   |    4 
  init/Kconfig            |    9 +
  kernel/sched.c          |  374 +++++++++++++++++++++++++++++++++++++++++++-----
  kernel/sched_debug.c    |   22 +-
  kernel/sched_fair.c     |  191 ++++++++++++++++++++----
  kernel/sched_idletask.c |    5 
  kernel/sched_rt.c       |    5 
  kernel/sysctl.c         |   19 ++
  8 files changed, 552 insertions(+), 77 deletions(-)

the most user-noticeable improvement should be the latency/interactivity 
fixes from Mike Galbraith and Peter Zijlstra. The biggest (and most 
complex) item is the new group scheduler code from Srivatsa Vaddagiri. 
There's also speedups from Dmitry Adamushko and various cleanups and 
fixes. Also added in the yield workaround that should address the 
regression reported by Antoine Martin.

Changes:

- the biggest item is the addition of the group scheduler from Srivatsa 
  Vaddagiri - this is not configurable yet, it depends on
  CONFIG_CONTAINERS. It causes no overhead on !CONFIG_CONTAINERS. This
  code clearly seems mature now, hopefully the container bits go
  upstream in 2.6.24 too. Srivatsa did lots of heavy lifting in the past 
  few months, and this final bit of code that moves in all the 
  infrastructure changes almost nothing in the core scheduler.

- a triplet of nice simplifications from Dmitry Adamushko. Most notably
  Dmitry got rid of se->fair_key which shaves 8 bytes of task_struct and
  gives further speedup. Thanks Dmitry!

- continued refinements to the SMP ->vruntime code and timeslicing by 
  Peter Zijstra and Mike Galbraith.

As usual, bugreports, fixes and suggestions are welcome and please 
holler if some patch went missing in action.

	Ingo

------------------>
Dmitry Adamushko (3):
      sched: clean up struct load_stat
      sched: clean up schedstat block in dequeue_entity()
      sched: optimize away ->fair_key

Matthias Kaehlcke (1):
      sched: use list_for_each_entry_safe() in __wake_up_common()

Mike Galbraith (1):
      sched: fix SMP migration latencies

Peter Zijlstra (7):
      sched: simplify SCHED_FEAT_* code
      sched: new task placement for vruntime
      sched: simplify adaptive latency
      sched: clean up new task placement
      sched: add tree based averages
      sched: handle vruntime overflow
      sched: better min_vruntime tracking

Srivatsa Vaddagiri (1):
      sched: group-scheduler core

Ingo Molnar (27):
      sched: fix new-task method
      sched: resched task in task_new_fair()
      sched: small sched_debug cleanup
      sched: debug: track maximum 'slice'
      sched: uniform tunings
      sched: use constants if !CONFIG_SCHED_DEBUG
      sched: remove stat_gran
      sched: remove precise CPU load
      sched: remove precise CPU load calculations #2
      sched: track cfs_rq->curr on !group-scheduling too
      sched: cleanup: simplify cfs_rq_curr() methods
      sched: uninline __enqueue_entity()/__dequeue_entity()
      sched: speed up update_load_add/_sub()
      sched: clean up calc_weighted()
      sched: introduce se->vruntime
      sched: move sched_feat() definitions
      sched: optimize vruntime based scheduling
      sched: simplify check_preempt() methods
      sched: wakeup granularity fix
      sched: add se->vruntime debugging
      sched: sync up ->min_vruntime when going idle
      sched: add more vruntime statistics
      sched: debug: update exec_clock only when SCHED_DEBUG
      sched: remove wait_runtime limit
      sched: remove wait_runtime fields and features
      sched: x86: allow single-depth wchan output
      sched: yield workaround

 arch/i386/Kconfig       |   11 
 include/linux/sched.h   |   21 -
 init/Kconfig            |    9 
 kernel/sched.c          |  558 +++++++++++++++++++++++++------------
 kernel/sched_debug.c    |  100 +++---
 kernel/sched_fair.c     |  716 +++++++++++++++++++-----------------------------
 kernel/sched_idletask.c |    5 
 kernel/sched_rt.c       |    5 
 kernel/sysctl.c         |   33 +-
 9 files changed, 765 insertions(+), 693 deletions(-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
@ 2007-09-18 19:36 dimm
  2007-09-18 20:16 ` Ingo Molnar
  2007-09-18 20:22 ` Ingo Molnar
  0 siblings, 2 replies; 37+ messages in thread
From: dimm @ 2007-09-18 19:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Srivatsa Vaddagiri

[ well, don't expect to find here anything like RDCFS
(no, 'D' does not stand for 'dumb'!). I was focused
on more prosaic things in the mean time so just
didn't have time for writing it.. ]

here is a few cleanup/simplification/optimization(s)
based on the recent modifications in the sched-dev tree.

(1) optimize task_new_fair()
(2) simplify yield_task()
(3) rework enqueue/dequeue_entity() to get rid of
sched_class::set_curr_task()

additionally, the changes somewhat decrease code size:

   text    data     bss     dec     hex filename
  43538    5398      48   48984    bf58 build/kernel/sched.o.before
  43250    5390      48   48688    be30 build/kernel/sched.o

(SMP + lots of debugging options but, I guess, in this case the diff
should remain visible for any combination).

---

(1)

due to the fact that we no longer keep the 'current' within the tree,
dequeue/enqueue_entity() is useless for the 'current' in task_new_fair().
We are about to reschedule and sched_class->put_prev_task() will put
the 'current' back into the tree, based on its new key.

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>

---
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 6e52d5a..5a244e2 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -969,10 +969,11 @@ static void task_new_fair(struct rq *rq, struct task_struct *p)

 	if (sysctl_sched_child_runs_first &&
 			curr->vruntime < se->vruntime) {
-
-		dequeue_entity(cfs_rq, curr, 0);
+		/*
+ 		 * Upon rescheduling, sched_class::put_prev_task() will place
+ 		 * 'current' within the tree based on its new key value.
+ 		 */
 		swap(curr->vruntime, se->vruntime);
-		enqueue_entity(cfs_rq, curr, 0);
 	}

 	update_stats_enqueue(cfs_rq, se);

---

Dmitry

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [git] CFS-devel, group scheduler, fixes
@ 2007-09-18 19:46 Dmitry Adamushko
  2007-09-18 20:17 ` Ingo Molnar
  0 siblings, 1 reply; 37+ messages in thread
From: Dmitry Adamushko @ 2007-09-18 19:46 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Srivatsa Vaddagiri


(2)

the 'p' (task_struct) parameter in the sched_class :: yield_task()
is redundant as the caller is always the 'current'. Get rid of it.

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9fd936f..3728cd6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -857,7 +857,7 @@ struct sched_class {
 
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
-	void (*yield_task) (struct rq *rq, struct task_struct *p);
+	void (*yield_task) (struct rq *rq);
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 046dae1..361fad8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4535,7 +4535,7 @@ asmlinkage long sys_sched_yield(void)
 	if (unlikely(rq->nr_running == 1))
 		schedstat_inc(rq, yld_act_empty);
 	else
-		current->sched_class->yield_task(rq, current);
+		current->sched_class->yield_task(rq);
 
 	/*
 	 * Since we are going to call schedule() anyway, there's
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5a244e2..9b982ef 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -723,10 +723,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
 /*
  * sched_yield() support is very simple - we dequeue and enqueue
  */
-static void yield_task_fair(struct rq *rq, struct task_struct *p)
+static void yield_task_fair(struct rq *rq)
 {
+	struct task_struct *curr = rq->curr;
+	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
+
 	if (!sysctl_sched_yield_bug_workaround) {
-		struct cfs_rq *cfs_rq = task_cfs_rq(p);
 		__update_rq_clock(rq);
 
 		/*
@@ -737,7 +739,6 @@ static void yield_task_fair(struct rq *rq, struct task_struct *p)
 	}
 
 	if (sysctl_sched_yield_bug_workaround == 1) {
-		struct cfs_rq *cfs_rq = task_cfs_rq(p);
 		struct sched_entity *next;
 
 		/*
@@ -757,7 +758,7 @@ static void yield_task_fair(struct rq *rq, struct task_struct *p)
 		/*
 		 * Minimally necessary key value to be the second in the tree:
 		 */
-		p->se.vruntime = next->vruntime +
+		curr->se.vruntime = next->vruntime +
 					sysctl_sched_yield_granularity;
 
 		/*
@@ -770,7 +771,7 @@ static void yield_task_fair(struct rq *rq, struct task_struct *p)
 	/*
 	 * Just reschedule, do nothing else:
 	 */
-	resched_task(p);
+	resched_task(curr);
 }
 
 /*
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 45b339f..b86944c 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -59,9 +59,9 @@ static void requeue_task_rt(struct rq *rq, struct task_struct *p)
 }
 
 static void
-yield_task_rt(struct rq *rq, struct task_struct *p)
+yield_task_rt(struct rq *rq)
 {
-	requeue_task_rt(rq, p);
+	requeue_task_rt(rq, rq->curr);
 }
 
 /*

---

Dmitry





^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
@ 2007-09-18 19:56 Dmitry Adamushko
  2007-09-18 20:18 ` Ingo Molnar
  0 siblings, 1 reply; 37+ messages in thread
From: Dmitry Adamushko @ 2007-09-18 19:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Srivatsa Vaddagiri


(3)

rework enqueue/dequeue_entity() to get rid of sched_class::set_curr_task().
This simplifies sched_setscheduler(), rt_mutex_setprio() and sched_move_tasks().

Signed-off-by : Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3728cd6..1094804 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -870,7 +870,6 @@ struct sched_class {
 			struct sched_domain *sd, enum cpu_idle_type idle,
 			int *all_pinned, int *this_best_prio);
 
-	void (*set_curr_task) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
 	void (*task_new) (struct rq *rq, struct task_struct *p);
 };
diff --git a/kernel/sched.c b/kernel/sched.c
index 361fad8..bc1a625 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3910,8 +3910,8 @@ EXPORT_SYMBOL(sleep_on_timeout);
  */
 void rt_mutex_setprio(struct task_struct *p, int prio)
 {
-	int oldprio, on_rq, running;
 	unsigned long flags;
+	int oldprio, on_rq;
 	struct rq *rq;
 
 	BUG_ON(prio < 0 || prio > MAX_PRIO);
@@ -3921,12 +3921,8 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 
 	oldprio = p->prio;
 	on_rq = p->se.on_rq;
-	running = task_running(rq, p);
-	if (on_rq) {
+	if (on_rq)
 		dequeue_task(rq, p, 0);
-		if (running)
-			p->sched_class->put_prev_task(rq, p);
-	}
 
 	if (rt_prio(prio))
 		p->sched_class = &rt_sched_class;
@@ -3936,15 +3932,13 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
 	p->prio = prio;
 
 	if (on_rq) {
-		if (running)
-			p->sched_class->set_curr_task(rq);
 		enqueue_task(rq, p, 0);
 		/*
 		 * Reschedule if we are currently running on this runqueue and
 		 * our priority decreased, or if we are not currently running on
 		 * this runqueue and our priority is higher than the current's
 		 */
-		if (running) {
+		if (task_running(rq, p)) {
 			if (p->prio > oldprio)
 				resched_task(rq->curr);
 		} else {
@@ -4150,7 +4144,7 @@ __setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 int sched_setscheduler(struct task_struct *p, int policy,
 		       struct sched_param *param)
 {
-	int retval, oldprio, oldpolicy = -1, on_rq, running;
+	int retval, oldprio, oldpolicy = -1, on_rq;
 	unsigned long flags;
 	struct rq *rq;
 
@@ -4232,24 +4226,20 @@ recheck:
 	}
 	update_rq_clock(rq);
 	on_rq = p->se.on_rq;
-	running = task_running(rq, p);
-	if (on_rq) {
+	if (on_rq)
 		deactivate_task(rq, p, 0);
-		if (running)
-			p->sched_class->put_prev_task(rq, p);
-	}
+
 	oldprio = p->prio;
 	__setscheduler(rq, p, policy, param->sched_priority);
+
 	if (on_rq) {
-		if (running)
-			p->sched_class->set_curr_task(rq);
 		activate_task(rq, p, 0);
 		/*
 		 * Reschedule if we are currently running on this runqueue and
 		 * our priority decreased, or if we are not currently running on
 		 * this runqueue and our priority is higher than the current's
 		 */
-		if (running) {
+		if (task_running(rq, p)) {
 			if (p->prio > oldprio)
 				resched_task(rq->curr);
 		} else {
@@ -6853,19 +6843,13 @@ static void sched_move_task(struct container_subsys *ss, struct container *cont,
 	running = task_running(rq, tsk);
 	on_rq = tsk->se.on_rq;
 
-	if (on_rq) {
+	if (on_rq)
 		dequeue_task(rq, tsk, 0);
-		if (unlikely(running))
-			tsk->sched_class->put_prev_task(rq, tsk);
-	}
 
 	set_task_cfs_rq(tsk);
 
-	if (on_rq) {
-		if (unlikely(running))
-			tsk->sched_class->set_curr_task(rq);
+	if (on_rq)
 		enqueue_task(rq, tsk, 0);
-	}
 
 done:
 	task_rq_unlock(rq, &flags);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index e65af8c..6539377 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -475,9 +475,20 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 }
 
 static void
-enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
+enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+		int wakeup, int set_curr)
 {
 	/*
+ 	 * In case of the 'current'.
+ 	 */
+	if (unlikely(set_curr)) {
+		update_stats_curr_start(cfs_rq, se);
+		cfs_rq->curr = se;
+		account_entity_enqueue(cfs_rq, se);
+		return;
+	}
+
+	/*
 	 * Update the fair clock.
 	 */
 	update_curr(cfs_rq);
@@ -488,8 +499,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
 	}
 
 	update_stats_enqueue(cfs_rq, se);
-	if (se != cfs_rq->curr)
-		__enqueue_entity(cfs_rq, se);
+	__enqueue_entity(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
 }
 
@@ -509,8 +519,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
 		}
 	}
 #endif
-	if (se != cfs_rq->curr)
+	if (likely(se != cfs_rq->curr))
 		__dequeue_entity(cfs_rq, se);
+	else {
+		update_stats_curr_end(cfs_rq, se);
+		cfs_rq->curr = NULL;
+	}
 	account_entity_dequeue(cfs_rq, se);
 }
 
@@ -692,12 +706,17 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	int set_curr = 0;
+
+	/* Are we enqueuing the current task? */
+	if (unlikely(task_running(rq, p)))
+		set_curr = 1;
 
 	for_each_sched_entity(se) {
 		if (se->on_rq)
 			break;
 		cfs_rq = cfs_rq_of(se);
-		enqueue_entity(cfs_rq, se, wakeup);
+		enqueue_entity(cfs_rq, se, wakeup, set_curr);
 	}
 }
 
@@ -983,29 +1002,6 @@ static void task_new_fair(struct rq *rq, struct task_struct *p)
 	resched_task(rq->curr);
 }
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/* Account for a task changing its policy or group.
- *
- * This routine is mostly called to set cfs_rq->curr field when a task
- * migrates between groups/classes.
- */
-static void set_curr_task_fair(struct rq *rq)
-{
-	struct sched_entity *se = &rq->curr->se;
-
-	for_each_sched_entity(se)
-		set_next_entity(cfs_rq_of(se), se);
-}
-#else
-static void set_curr_task_fair(struct rq *rq)
-{
-	struct sched_entity *se = &rq->curr->se;
-	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-
-	cfs_rq->curr = se;
-}
-#endif
-
 /*
  * All the scheduling class methods:
  */
@@ -1021,7 +1017,6 @@ struct sched_class fair_sched_class __read_mostly = {
 
 	.load_balance		= load_balance_fair,
 
-	.set_curr_task          = set_curr_task_fair,
 	.task_tick		= task_tick_fair,
 	.task_new		= task_new_fair,
 };
diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index 5ebf829..3503fb2 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -50,10 +50,6 @@ static void task_tick_idle(struct rq *rq, struct task_struct *curr)
 {
 }
 
-static void set_curr_task_idle(struct rq *rq)
-{
-}
-
 /*
  * Simple, special scheduling class for the per-CPU idle tasks:
  */
@@ -70,7 +66,6 @@ static struct sched_class idle_sched_class __read_mostly = {
 
 	.load_balance		= load_balance_idle,
 
-	.set_curr_task          = set_curr_task_idle,
 	.task_tick		= task_tick_idle,
 	/* no .task_new for idle tasks */
 };
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index b86944c..3c77c03 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -218,10 +218,6 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p)
 	}
 }
 
-static void set_curr_task_rt(struct rq *rq)
-{
-}
-
 static struct sched_class rt_sched_class __read_mostly = {
 	.enqueue_task		= enqueue_task_rt,
 	.dequeue_task		= dequeue_task_rt,
@@ -234,6 +230,5 @@ static struct sched_class rt_sched_class __read_mostly = {
 
 	.load_balance		= load_balance_rt,
 
-	.set_curr_task          = set_curr_task_rt,
 	.task_tick		= task_tick_rt,
 };

---

Dmitry



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-18 19:36 [git] CFS-devel, group scheduler, fixes dimm
@ 2007-09-18 20:16 ` Ingo Molnar
  2007-09-19  6:03   ` Tong Li
  2007-09-18 20:22 ` Ingo Molnar
  1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2007-09-18 20:16 UTC (permalink / raw)
  To: dimm; +Cc: linux-kernel, Srivatsa Vaddagiri, Peter Zijlstra


* dimm <dmitry.adamushko@gmail.com> wrote:

> (1)
> 
> due to the fact that we no longer keep the 'current' within the tree, 
> dequeue/enqueue_entity() is useless for the 'current' in 
> task_new_fair(). We are about to reschedule and 
> sched_class->put_prev_task() will put the 'current' back into the 
> tree, based on its new key.
> 
> Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>

thanks, applied.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-18 19:46 Dmitry Adamushko
@ 2007-09-18 20:17 ` Ingo Molnar
  0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-09-18 20:17 UTC (permalink / raw)
  To: Dmitry Adamushko; +Cc: linux-kernel, Srivatsa Vaddagiri, Peter Zijlstra


* Dmitry Adamushko <dmitry.adamushko@gmail.com> wrote:

> (2)
> 
> the 'p' (task_struct) parameter in the sched_class :: yield_task() is 
> redundant as the caller is always the 'current'. Get rid of it.
> 
> Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>

ah - good one! I completely forgot about that parameter and it's indeed 
unneeded. Applied.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-18 19:56 Dmitry Adamushko
@ 2007-09-18 20:18 ` Ingo Molnar
  0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-09-18 20:18 UTC (permalink / raw)
  To: Dmitry Adamushko; +Cc: linux-kernel, Srivatsa Vaddagiri


* Dmitry Adamushko <dmitry.adamushko@gmail.com> wrote:

> (3)
> 
> rework enqueue/dequeue_entity() to get rid of 
> sched_class::set_curr_task(). This simplifies sched_setscheduler(), 
> rt_mutex_setprio() and sched_move_tasks().

ah. This makes your ready-queue patch a code size win. Applied.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-18 19:36 [git] CFS-devel, group scheduler, fixes dimm
  2007-09-18 20:16 ` Ingo Molnar
@ 2007-09-18 20:22 ` Ingo Molnar
  2007-09-19  3:55   ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2007-09-18 20:22 UTC (permalink / raw)
  To: dimm; +Cc: linux-kernel, Srivatsa Vaddagiri


* dimm <dmitry.adamushko@gmail.com> wrote:

> [ well, don't expect to find here anything like RDCFS (no, 'D' does 
> not stand for 'dumb'!). I was focused on more prosaic things in the 
> mean time so just didn't have time for writing it.. ]
> 
> here is a few cleanup/simplification/optimization(s) based on the 
> recent modifications in the sched-dev tree.
> 
> (1) optimize task_new_fair()
> (2) simplify yield_task()
> (3) rework enqueue/dequeue_entity() to get rid of
>     sched_class::set_curr_task()

the queue with your enhancements and simplifications applied looks good 
here, and it booted fine on two testboxes. I've updated the 
sched-devel.git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

(below is the full shortlog over current upstream.)

(I have not tested the group scheduling bits but perhaps Srivatsa would 
like to do that?)

	Ingo

------------------>
Dmitry Adamushko (9):
      sched: clean up struct load_stat
      sched: clean up schedstat block in dequeue_entity()
      sched: sched_setscheduler() fix
      sched: add set_curr_task() calls
      sched: do not keep current in the tree and get rid of sched_entity::fair_key
      sched: yield-workaround update
      sched: optimize task_new_fair()
      sched: simplify sched_class::yield_task()
      sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()

Ingo Molnar (29):
      sched: fix new-task method
      sched: resched task in task_new_fair()
      sched: small sched_debug cleanup
      sched: debug: track maximum 'slice'
      sched: uniform tunings
      sched: use constants if !CONFIG_SCHED_DEBUG
      sched: remove stat_gran
      sched: remove precise CPU load
      sched: remove precise CPU load calculations #2
      sched: track cfs_rq->curr on !group-scheduling too
      sched: cleanup: simplify cfs_rq_curr() methods
      sched: uninline __enqueue_entity()/__dequeue_entity()
      sched: speed up update_load_add/_sub()
      sched: clean up calc_weighted()
      sched: introduce se->vruntime
      sched: move sched_feat() definitions
      sched: optimize vruntime based scheduling
      sched: simplify check_preempt() methods
      sched: wakeup granularity fix
      sched: add se->vruntime debugging
      sched: sync up ->min_vruntime when going idle
      sched: add more vruntime statistics
      sched: debug: update exec_clock only when SCHED_DEBUG
      sched: remove wait_runtime limit
      sched: remove wait_runtime fields and features
      sched: x86: allow single-depth wchan output
      sched: yield workaround
      sched: fix delay accounting performance regression
      sched: disable START_DEBIT

Matthias Kaehlcke (1):
      sched: use list_for_each_entry_safe() in __wake_up_common()

Mike Galbraith (1):
      sched: fix SMP migration latencies

Peter Zijlstra (7):
      sched: simplify SCHED_FEAT_* code
      sched: new task placement for vruntime
      sched: simplify adaptive latency
      sched: clean up new task placement
      sched: add tree based averages
      sched: handle vruntime overflow
      sched: better min_vruntime tracking

Srivatsa Vaddagiri (1):
      sched: group-scheduler core

 arch/i386/Kconfig     |   11 
 include/linux/sched.h |   24 -
 init/Kconfig          |    9 
 kernel/sched.c        |  544 ++++++++++++++++++++++++------------
 kernel/sched_debug.c  |  100 ++----
 kernel/sched_fair.c   |  752 ++++++++++++++++++++------------------------------
 kernel/sched_rt.c     |    4 
 kernel/sched_stats.h  |    4 
 kernel/sysctl.c       |   35 +-
 9 files changed, 763 insertions(+), 720 deletions(-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-18 20:22 ` Ingo Molnar
@ 2007-09-19  3:55   ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 37+ messages in thread
From: Srivatsa Vaddagiri @ 2007-09-19  3:55 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: dimm, linux-kernel, dhaval

On Tue, Sep 18, 2007 at 10:22:43PM +0200, Ingo Molnar wrote:
> (I have not tested the group scheduling bits but perhaps Srivatsa would 
> like to do that?)

Ingo,
	I plan to test it today and send you any updates that may be
required.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-18 20:16 ` Ingo Molnar
@ 2007-09-19  6:03   ` Tong Li
  2007-09-19  6:28     ` Mike Galbraith
  2007-09-19 19:35     ` Siddha, Suresh B
  0 siblings, 2 replies; 37+ messages in thread
From: Tong Li @ 2007-09-19  6:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dimm, linux-kernel, Srivatsa Vaddagiri, Peter Zijlstra,
	Mike Galbraith

This patch attempts to improve CFS's SMP global fairness based on the new 
virtual time design.

Removed vruntime adjustment in set_task_cpu() as it skews global fairness.

Modified small_imbalance logic in find_busiest_group(). If there's small 
imbalance, move tasks from busiest to local sched_group only if the local 
group contains a CPU whose min_vruntime is the maximum among all CPUs in 
the same sched_domain. This prevents any CPU from advancing too far ahead 
in virtual time and avoids tasks thrashing between two CPUs without 
utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs, 
since the load is not evenly divisible by the number of CPUs, we want the 
extra load to have a fair use of every CPU in the system.

Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task 
runs a trivial while (1) loop. The benchmark runs for 300 seconds and, at 
every T seconds, it samples for each task the following:

1. Actual CPU time the task received during the past 60 seconds.

2. Ideal CPU time it would receive under a perfect fair scheduler.

3. Lag = ideal time - actual time, where a positive lag means the task 
received less CPU time than its fair share and negative means it received 
more.

4. Error = lag / ideal time

The following shows the max and min errors among all samples for all tasks 
before and after applying the patch:

Before:

Sampling interval: 30 s
Max error: 100.00%
Min error: -25.00%

Sampling interval: 10 s
Max error: 27.62%
Min error: -25.00%

After:

Sampling interval: 30 s
Max error: 1.33%
Min error: -1.29%

Sampling interval: 10 s
Max error: 7.38%
Min error: -6.25%

The errors for the 10s sampling interval are still not as small as I had 
hoped for, but looks like it does have some improvement.

    tong

Signed-off-by: Tong Li <tong.n.li@intel.com>
---
--- linux-2.6-sched-devel-orig/kernel/sched.c	2007-09-15 22:00:48.000000000 -0700
+++ linux-2.6-sched-devel/kernel/sched.c	2007-09-18 22:10:52.000000000 -0700
@@ -1033,9 +1033,6 @@ void set_task_cpu(struct task_struct *p,
  	if (p->se.block_start)
  		p->se.block_start -= clock_offset;
  #endif
-	if (likely(new_rq->cfs.min_vruntime))
-		p->se.vruntime -= old_rq->cfs.min_vruntime -
-						new_rq->cfs.min_vruntime;

  	__set_task_cpu(p, new_cpu);
  }
@@ -1599,6 +1596,7 @@ static void __sched_fork(struct task_str
  	p->se.exec_start		= 0;
  	p->se.sum_exec_runtime		= 0;
  	p->se.prev_sum_exec_runtime	= 0;
+	p->se.vruntime			= 0;

  #ifdef CONFIG_SCHEDSTATS
  	p->se.wait_start		= 0;
@@ -2277,6 +2275,8 @@ find_busiest_group(struct sched_domain *
  		   int *sd_idle, cpumask_t *cpus, int *balance)
  {
  	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
+	struct sched_group *max_vruntime_group = NULL;
+	u64 max_vruntime = 0;
  	unsigned long max_load, avg_load, total_load, this_load, total_pwr;
  	unsigned long max_pull;
  	unsigned long busiest_load_per_task, busiest_nr_running;
@@ -2322,6 +2322,11 @@ find_busiest_group(struct sched_domain *

  			rq = cpu_rq(i);

+			if (rq->cfs.min_vruntime > max_vruntime) {
+				max_vruntime = rq->cfs.min_vruntime;
+				max_vruntime_group = group;
+			}
+
  			if (*sd_idle && rq->nr_running)
  				*sd_idle = 0;

@@ -2483,59 +2488,16 @@ group_next:
  	 * moved
  	 */
  	if (*imbalance < busiest_load_per_task) {
-		unsigned long tmp, pwr_now, pwr_move;
-		unsigned int imbn;
-
  small_imbalance:
-		pwr_move = pwr_now = 0;
-		imbn = 2;
-		if (this_nr_running) {
-			this_load_per_task /= this_nr_running;
-			if (busiest_load_per_task > this_load_per_task)
-				imbn = 1;
-		} else
-			this_load_per_task = SCHED_LOAD_SCALE;
-
-		if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
-					busiest_load_per_task * imbn) {
-			*imbalance = busiest_load_per_task;
-			return busiest;
-		}
-
-		/*
-		 * OK, we don't have enough imbalance to justify moving tasks,
-		 * however we may be able to increase total CPU power used by
-		 * moving them.
+		/* 
+		 * When there's small imbalance, move tasks only if this
+		 * sched_group contains a CPU whose min_vruntime is the 
+		 * maximum among all CPUs in the same domain.
  		 */
-
-		pwr_now += busiest->__cpu_power *
-				min(busiest_load_per_task, max_load);
-		pwr_now += this->__cpu_power *
-				min(this_load_per_task, this_load);
-		pwr_now /= SCHED_LOAD_SCALE;
-
-		/* Amount of load we'd subtract */
-		tmp = sg_div_cpu_power(busiest,
-				busiest_load_per_task * SCHED_LOAD_SCALE);
-		if (max_load > tmp)
-			pwr_move += busiest->__cpu_power *
-				min(busiest_load_per_task, max_load - tmp);
-
-		/* Amount of load we'd add */
-		if (max_load * busiest->__cpu_power <
-				busiest_load_per_task * SCHED_LOAD_SCALE)
-			tmp = sg_div_cpu_power(this,
-					max_load * busiest->__cpu_power);
-		else
-			tmp = sg_div_cpu_power(this,
-				busiest_load_per_task * SCHED_LOAD_SCALE);
-		pwr_move += this->__cpu_power *
-				min(this_load_per_task, this_load + tmp);
-		pwr_move /= SCHED_LOAD_SCALE;
-
-		/* Move if we gain throughput */
-		if (pwr_move > pwr_now)
+		if (max_vruntime_group == this)
  			*imbalance = busiest_load_per_task;
+		else
+			*imbalance = 0;
  	}

  	return busiest;

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-19  6:03   ` Tong Li
@ 2007-09-19  6:28     ` Mike Galbraith
  2007-09-19  7:51       ` Mike Galbraith
  2007-09-19 19:35     ` Siddha, Suresh B
  1 sibling, 1 reply; 37+ messages in thread
From: Mike Galbraith @ 2007-09-19  6:28 UTC (permalink / raw)
  To: Tong Li; +Cc: Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra

Greetings,

On Tue, 2007-09-18 at 23:03 -0700, Tong Li wrote:
> This patch attempts to improve CFS's SMP global fairness based on the new 
> virtual time design.
> 
> Removed vruntime adjustment in set_task_cpu() as it skews global fairness.

Since I'm (still) encountering Xorg latency issues (which go away if
load is hefty instead of light) even with that migration adjustment and
synchronization, and am having difficulty nailing it down to a specific
event, I'll test this immediately.

	-Mike


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-19  6:28     ` Mike Galbraith
@ 2007-09-19  7:51       ` Mike Galbraith
  2007-09-19  8:42         ` Mike Galbraith
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Galbraith @ 2007-09-19  7:51 UTC (permalink / raw)
  To: Tong Li; +Cc: Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra

On Wed, 2007-09-19 at 08:28 +0200, Mike Galbraith wrote:
> Greetings,
> 
> On Tue, 2007-09-18 at 23:03 -0700, Tong Li wrote:
> > This patch attempts to improve CFS's SMP global fairness based on the new 
> > virtual time design.
> > 
> > Removed vruntime adjustment in set_task_cpu() as it skews global fairness.
> 
> Since I'm (still) encountering Xorg latency issues (which go away if
> load is hefty instead of light) even with that migration adjustment and
> synchronization, and am having difficulty nailing it down to a specific
> event, I'll test this immediately.

(had to apply manually to freshly pulled tree)

Drat.  This didn't cure the latency hits with a Xorg at nice -5 running
with a make -j2 at nice 0, but seems to have reinstated a latency issue
which was previously cured.

Xorg 1 sec. max latency samples:
(trimmed to only show >20ms latencies)
se.wait_max              :            23343582
se.wait_max              :            20119460
se.wait_max              :            20771573
se.wait_max              :            21084567
se.wait_max              :            31338500
se.wait_max              :            35368148
se.wait_max              :            39199642
se.wait_max              :            22889062
se.wait_max              :            40285501
se.wait_max              :            21002720
se.wait_max              :            21002266
se.wait_max              :            21680578
se.wait_max              :            22012913
se.wait_max              :            94646331
se.wait_max              :            29003693
se.wait_max              :            20812613

(boot with maxcpus=1 or nail X+make to one cpu and these latencies are
gone, so it does seem to be the migration logic - why i was so
interested in testing your patch)

The scenario which was previously cured was this:
taskset -c 1 nice -n 0 ./massive_intr 2 9999
taskset -c 1 nice -n 5 ./massive_intr 2 9999
click link
(http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring
up browser and OpenOffice Impress.

Xorg (at nice -5 + above scenario) latency samples:
se.wait_max              :            57985337
se.wait_max              :            25163510
se.wait_max              :            37005538
se.wait_max              :            66986511
se.wait_max              :            53990868
se.wait_max              :            80976761
se.wait_max              :            96967501
se.wait_max              :            80989254
se.wait_max              :            53990897
se.wait_max              :           181963905
se.wait_max              :            85985181

	-Mike


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-19  7:51       ` Mike Galbraith
@ 2007-09-19  8:42         ` Mike Galbraith
  2007-09-19 17:06           ` Tong Li
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Galbraith @ 2007-09-19  8:42 UTC (permalink / raw)
  To: Tong Li; +Cc: Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra

On Wed, 2007-09-19 at 09:51 +0200, Mike Galbraith wrote:

> The scenario which was previously cured was this:
> taskset -c 1 nice -n 0 ./massive_intr 2 9999
> taskset -c 1 nice -n 5 ./massive_intr 2 9999
> click link
> (http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring
> up browser and OpenOffice Impress.
> 
> Xorg (at nice -5 + above scenario) latency samples:
> se.wait_max              :            57985337
> se.wait_max              :            25163510
> se.wait_max              :            37005538
> se.wait_max              :            66986511
> se.wait_max              :            53990868
> se.wait_max              :            80976761
> se.wait_max              :            96967501
> se.wait_max              :            80989254
> se.wait_max              :            53990897
> se.wait_max              :           181963905
> se.wait_max              :            85985181

To be doubly sure of the effect on the pinned tasks + migrating Xorg
scenario, I just ran the above test 10 times with virgin devel source.
Maximum Xorg latency was 20ms.

	-Mike


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-19  8:42         ` Mike Galbraith
@ 2007-09-19 17:06           ` Tong Li
  2007-09-20  4:55             ` Mike Galbraith
  0 siblings, 1 reply; 37+ messages in thread
From: Tong Li @ 2007-09-19 17:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tong Li, Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra

On Wed, 19 Sep 2007, Mike Galbraith wrote:

> On Wed, 2007-09-19 at 09:51 +0200, Mike Galbraith wrote:
>
>> The scenario which was previously cured was this:
>> taskset -c 1 nice -n 0 ./massive_intr 2 9999
>> taskset -c 1 nice -n 5 ./massive_intr 2 9999
>> click link
>> (http://pages.cs.wisc.edu/~shubu/talks/cachescrub-prdc2004.ppt) to bring
>> up browser and OpenOffice Impress.
>>
>> Xorg (at nice -5 + above scenario) latency samples:
>> se.wait_max              :            57985337
>> se.wait_max              :            25163510
>> se.wait_max              :            37005538
>> se.wait_max              :            66986511
>> se.wait_max              :            53990868
>> se.wait_max              :            80976761
>> se.wait_max              :            96967501
>> se.wait_max              :            80989254
>> se.wait_max              :            53990897
>> se.wait_max              :           181963905
>> se.wait_max              :            85985181
>
> To be doubly sure of the effect on the pinned tasks + migrating Xorg
> scenario, I just ran the above test 10 times with virgin devel source.
> Maximum Xorg latency was 20ms.
>
> 	-Mike
>

Yeah, the patch was a first attempt at getting better global fairness for 
unpinned tasks. I expected there'd be latency problems when unpinned and 
pinned tasks co-exist. The original code for vruntime adjustment in 
set_task_cpu() helped alleviate this problem, so removing it in my patch 
would definitely bring the problem back. On the other hand, leaving it 
there broke global fairness in my fairness benchmark. So we'd need a 
better compromise.

Were the experiments run on a 2-CPU system? When Xorg experiences large 
wait time, is it on the same CPU that has the two pinned tasks? If this is 
the case, then the problem could be X somehow advanced faster and got a 
larger vruntime then the two pinned tasks, so it had to wait for the 
pinned ones to catch up before it got a chance to be scheduled.

   tong

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-19  6:03   ` Tong Li
  2007-09-19  6:28     ` Mike Galbraith
@ 2007-09-19 19:35     ` Siddha, Suresh B
  2007-09-19 20:58       ` Tong Li
  1 sibling, 1 reply; 37+ messages in thread
From: Siddha, Suresh B @ 2007-09-19 19:35 UTC (permalink / raw)
  To: Tong Li
  Cc: Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith

On Tue, Sep 18, 2007 at 11:03:59PM -0700, Tong Li wrote:
> This patch attempts to improve CFS's SMP global fairness based on the new 
> virtual time design.
> 
> Removed vruntime adjustment in set_task_cpu() as it skews global fairness.
> 
> Modified small_imbalance logic in find_busiest_group(). If there's small 
> imbalance, move tasks from busiest to local sched_group only if the local 
> group contains a CPU whose min_vruntime is the maximum among all CPUs in 
> the same sched_domain. This prevents any CPU from advancing too far ahead 
> in virtual time and avoids tasks thrashing between two CPUs without 
> utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs, 
> since the load is not evenly divisible by the number of CPUs, we want the 
> extra load to have a fair use of every CPU in the system.
> 
> Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task 

Just as an experiment, can you run 82 tasks on 8 CPUs. Current
imbalance_pct logic will not detect and fix the global fairness issue
even with this patch.

>  	if (*imbalance < busiest_load_per_task) {
> -		unsigned long tmp, pwr_now, pwr_move;
> -		unsigned int imbn;
> -
>  small_imbalance:
> -		pwr_move = pwr_now = 0;
> -		imbn = 2;
> -		if (this_nr_running) {
> -			this_load_per_task /= this_nr_running;
> -			if (busiest_load_per_task > this_load_per_task)
> -				imbn = 1;
> -		} else
> -			this_load_per_task = SCHED_LOAD_SCALE;
> -
> -		if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
> -					busiest_load_per_task * imbn) {
> -			*imbalance = busiest_load_per_task;
> -			return busiest;
> -		}

This patch removes quite a few lines and all this is logic is not for
fairness :( Especially the above portion handles some of the HT/MC
optimizations.

> -
> -		/*
> -		 * OK, we don't have enough imbalance to justify moving 
> tasks,
> -		 * however we may be able to increase total CPU power used by
> -		 * moving them.
> +		/* 
> +		 * When there's small imbalance, move tasks only if this
> +		 * sched_group contains a CPU whose min_vruntime is the 
> +		 * maximum among all CPUs in the same domain.
>  		 */
> -
> -		pwr_now += busiest->__cpu_power *
> -				min(busiest_load_per_task, max_load);
> -		pwr_now += this->__cpu_power *
> -				min(this_load_per_task, this_load);
> -		pwr_now /= SCHED_LOAD_SCALE;
> -
> -		/* Amount of load we'd subtract */
> -		tmp = sg_div_cpu_power(busiest,
> -				busiest_load_per_task * SCHED_LOAD_SCALE);
> -		if (max_load > tmp)
> -			pwr_move += busiest->__cpu_power *
> -				min(busiest_load_per_task, max_load - tmp);
> -
> -		/* Amount of load we'd add */
> -		if (max_load * busiest->__cpu_power <
> -				busiest_load_per_task * SCHED_LOAD_SCALE)
> -			tmp = sg_div_cpu_power(this,
> -					max_load * busiest->__cpu_power);
> -		else
> -			tmp = sg_div_cpu_power(this,
> -				busiest_load_per_task * SCHED_LOAD_SCALE);
> -		pwr_move += this->__cpu_power *
> -				min(this_load_per_task, this_load + tmp);
> -		pwr_move /= SCHED_LOAD_SCALE;
> -
> -		/* Move if we gain throughput */
> -		if (pwr_move > pwr_now)
> +		if (max_vruntime_group == this)
>  			*imbalance = busiest_load_per_task;
> +		else
> +			*imbalance = 0;

Not sure how this all interacts when some of the cpu's are idle. I have to
look more closely.

thanks,
suresh

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-19 19:35     ` Siddha, Suresh B
@ 2007-09-19 20:58       ` Tong Li
  0 siblings, 0 replies; 37+ messages in thread
From: Tong Li @ 2007-09-19 20:58 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Tong Li, Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra, Mike Galbraith

On Wed, 19 Sep 2007, Siddha, Suresh B wrote:

> On Tue, Sep 18, 2007 at 11:03:59PM -0700, Tong Li wrote:
>> This patch attempts to improve CFS's SMP global fairness based on the new
>> virtual time design.
>>
>> Removed vruntime adjustment in set_task_cpu() as it skews global fairness.
>>
>> Modified small_imbalance logic in find_busiest_group(). If there's small
>> imbalance, move tasks from busiest to local sched_group only if the local
>> group contains a CPU whose min_vruntime is the maximum among all CPUs in
>> the same sched_domain. This prevents any CPU from advancing too far ahead
>> in virtual time and avoids tasks thrashing between two CPUs without
>> utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs,
>> since the load is not evenly divisible by the number of CPUs, we want the
>> extra load to have a fair use of every CPU in the system.
>>
>> Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task
>
> Just as an experiment, can you run 82 tasks on 8 CPUs. Current
> imbalance_pct logic will not detect and fix the global fairness issue
> even with this patch.

Right. For 82 tasks on 8 CPUs, the max error is 7.07% and min is -14.12%. 
It could be fixed by removing the imbalance_pct check (i.e., making 
imbalance_pct effectively 1). The cost would be more migrations, so I 
wasn't sure if that was the approach people would agree on.

>
>>  	if (*imbalance < busiest_load_per_task) {
>> -		unsigned long tmp, pwr_now, pwr_move;
>> -		unsigned int imbn;
>> -
>>  small_imbalance:
>> -		pwr_move = pwr_now = 0;
>> -		imbn = 2;
>> -		if (this_nr_running) {
>> -			this_load_per_task /= this_nr_running;
>> -			if (busiest_load_per_task > this_load_per_task)
>> -				imbn = 1;
>> -		} else
>> -			this_load_per_task = SCHED_LOAD_SCALE;
>> -
>> -		if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
>> -					busiest_load_per_task * imbn) {
>> -			*imbalance = busiest_load_per_task;
>> -			return busiest;
>> -		}
>
> This patch removes quite a few lines and all this is logic is not for
> fairness :( Especially the above portion handles some of the HT/MC
> optimizations.

What are the optimizations? I was trying to simplify the code to use only 
vruntime to control balancing when there's small imbalance, but if it 
breaks non-fairness related optimizations, we can certainly add the code 
back.

   tong

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-19 17:06           ` Tong Li
@ 2007-09-20  4:55             ` Mike Galbraith
  2007-09-20  7:15               ` Mike Galbraith
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Galbraith @ 2007-09-20  4:55 UTC (permalink / raw)
  To: Tong Li; +Cc: Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra

On Wed, 2007-09-19 at 10:06 -0700, Tong Li wrote:

> Were the experiments run on a 2-CPU system?

Yes.

>  When Xorg experiences large 
> wait time, is it on the same CPU that has the two pinned tasks? If this is 
> the case, then the problem could be X somehow advanced faster and got a 
> larger vruntime then the two pinned tasks, so it had to wait for the 
> pinned ones to catch up before it got a chance to be scheduled.

Good question.  I've not yet been able to capture any event where
vruntimes are that far apart in sched_debug.

	-Mike


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-20  4:55             ` Mike Galbraith
@ 2007-09-20  7:15               ` Mike Galbraith
  2007-09-20  7:51                 ` Ingo Molnar
  2007-09-20 19:48                 ` Willy Tarreau
  0 siblings, 2 replies; 37+ messages in thread
From: Mike Galbraith @ 2007-09-20  7:15 UTC (permalink / raw)
  To: Tong Li; +Cc: Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra

On Thu, 2007-09-20 at 06:55 +0200, Mike Galbraith wrote:
> On Wed, 2007-09-19 at 10:06 -0700, Tong Li wrote:
> 
> > Were the experiments run on a 2-CPU system?
> 
> Yes.
> 
> >  When Xorg experiences large 
> > wait time, is it on the same CPU that has the two pinned tasks? If this is 
> > the case, then the problem could be X somehow advanced faster and got a 
> > larger vruntime then the two pinned tasks, so it had to wait for the 
> > pinned ones to catch up before it got a chance to be scheduled.
> 
> Good question.  I've not yet been able to capture any event where
> vruntimes are that far apart in sched_debug.

But, I did just manage to trigger some horrid behavior, and log it.  I
modified the kernel to print task's actual tree key instead of their
current vruntime, and was watching that while make -j2 was running (and
not seeing anything very interesting), when on a lark, I restarted
SuSE's system updater thingy.  That beast chews 100% CPU for so long at
startup that I long ago got annoyed, and changed it to run at nice 19.
Anyway, when it started, interactivity went to hell in the proverbial
hand-basket, and the sched_debug log shows some interesting results..
like spread0 hitting -13659412644, and cc1 being keyed at -3867063305.

cpu#0, 2992.608 MHz
  .nr_running                    : 4
  .load                          : 4096
  .nr_switches                   : 1105882
  .nr_load_updates               : 735146
  .nr_uninterruptible            : 4294966399
  .jiffies                       : 1936201
  .next_balance                  : 1936276
  .curr->pid                     : 27004
  .clock                         : 735034802877
  .idle_clock                    : 0
  .prev_clock_raw                : 2259213083886
  .clock_warps                   : 0
  .clock_overflows               : 677631
  .clock_deep_idle_events        : 0
  .clock_max_delta               : 999848
  .cpu_load[0]                   : 4096
  .cpu_load[1]                   : 3960
  .cpu_load[2]                   : 3814
  .cpu_load[3]                   : 3539
  .cpu_load[4]                   : 3153

cfs_rq
  .exec_clock                    : 623175870707
  .MIN_vruntime                  : 3791173262297
  .min_vruntime                  : 3791173259723
  .max_vruntime                  : 3791173264579
  .spread                        : 2282
  .spread0                       : 0
  .nr_sync_min_vruntime          : 296756

runnable tasks:
            task   PID        tree-key  switches  prio    exec-runtime        sum-exec       sum-sleep
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
            bash  6338            3212           554   120   3791173262935       104271360    645224098563
R            cc1 27004        -7509103             6   120   3791165750620        15396029        13300466
            make 27006            4856             6   120   3791173264579         9540879         6458208
            make 27010            2574             3   120   3791173262297         8897680               0

cpu#1, 2992.608 MHz
  .nr_running                    : 5
  .load                          : 6208
  .nr_switches                   : 1012995
  .nr_load_updates               : 747540
  .nr_uninterruptible            : 897
  .jiffies                       : 1936201
  .next_balance                  : 1936303
  .curr->pid                     : 27012
  .clock                         : 747426624818
  .idle_clock                    : 0
  .prev_clock_raw                : 2259213140042
  .clock_warps                   : 0
  .clock_overflows               : 582091
  .clock_deep_idle_events        : 0
  .clock_max_delta               : 999848
  .cpu_load[0]                   : 6208
  .cpu_load[1]                   : 4545
  .cpu_load[2]                   : 3810
  .cpu_load[3]                   : 3065
  .cpu_load[4]                   : 2397

cfs_rq
  .exec_clock                    : 581940255731
  .MIN_vruntime                  : 3791835890838
  .min_vruntime                  : 3791854778322
  .max_vruntime                  : 3791902652145
  .spread                        : 66761307
  .spread0                       : 681518599
  .nr_sync_min_vruntime          : 256743

runnable tasks:
            task   PID        tree-key  switches  prio    exec-runtime        sum-exec       sum-sleep
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
            Xorg  5740       -18887484        157025   115   3791835890838        46597617    638567698716
   update-status 26093        48184396          1537   139   3791902652145      5709334592      2622161813
              sh 27011        -6808461             2   120   3791847969861         4162855         1457101
R            cat 27012       -19965525             3   120   3791836175385         1756456         1974618
            bash 27013            1498             2   120   3791854779820          394907               0

Sched Debug Version: v0.05-v20, 2.6.23-rc6-smp-d #43
now at 2237384024289 nsecs

cpu#0, 2992.608 MHz
  .nr_running                    : 2
  .load                          : 1039
  .nr_switches                   : 1106019
  .nr_load_updates               : 736329
  .nr_uninterruptible            : 4294966399
  .jiffies                       : 1937383
  .next_balance                  : 1937428
  .curr->pid                     : 27087
  .clock                         : 736217427466
  .idle_clock                    : 0
  .prev_clock_raw                : 2260395464994
  .clock_warps                   : 0
  .clock_overflows               : 678147
  .clock_deep_idle_events        : 0
  .clock_max_delta               : 999848
  .cpu_load[0]                   : 1039
  .cpu_load[1]                   : 1039
  .cpu_load[2]                   : 1039
  .cpu_load[3]                   : 1039
  .cpu_load[4]                   : 1039

cfs_rq
  .exec_clock                    : 624358414291
  .MIN_vruntime                  : 3806531640151
  .min_vruntime                  : 3806501788119
  .max_vruntime                  : 3806531640151
  .spread                        : 0
  .spread0                       : 0
  .nr_sync_min_vruntime          : 296782

runnable tasks:
            task   PID        tree-key  switches  prio    exec-runtime        sum-exec       sum-sleep
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   update-status 26093        56810274          1572   139   3806531640151      5929003154      2622161813
R            cc1 27087               0            13   120   3806501788119       403581890        15024992

cpu#1, 2992.608 MHz
  .nr_running                    : 2
  .load                          : 2048
  .nr_switches                   : 1013933
  .nr_load_updates               : 748722
  .nr_uninterruptible            : 898
  .jiffies                       : 1937383
  .next_balance                  : 1937455
  .curr->pid                     : 27089
  .clock                         : 748608445154
  .idle_clock                    : 0
  .prev_clock_raw                : 2260394709751
  .clock_warps                   : 0
  .clock_overflows               : 582583
  .clock_deep_idle_events        : 0
  .clock_max_delta               : 999848
  .cpu_load[0]                   : 2048
  .cpu_load[1]                   : 5016
  .cpu_load[2]                   : 6366
  .cpu_load[3]                   : 7081
  .cpu_load[4]                   : 6895

cfs_rq
  .exec_clock                    : 583122024084
  .MIN_vruntime                  : 3806331397540
  .min_vruntime                  : 3806350696772
  .max_vruntime                  : 3806331397540
  .spread                        : 0
  .spread0                       : -151091347
  .nr_sync_min_vruntime          : 256756

runnable tasks:
            task   PID        tree-key  switches  prio    exec-runtime        sum-exec       sum-sleep
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
         konsole  6330       -19299232        124476   120   3806331397540     17086991675    615801753460
R            cat 27089       -19513636             2   120   3806331713718         1163204         7839421

Sched Debug Version: v0.05-v20, 2.6.23-rc6-smp-d #43
now at 2238396055118 nsecs

cpu#0, 2992.608 MHz
  .nr_running                    : 4
  .load                          : 3087
  .nr_switches                   : 1106671
  .nr_load_updates               : 737341
  .nr_uninterruptible            : 4294966395
  .jiffies                       : 1938395
  .next_balance                  : 1938412
  .curr->pid                     : 27115
  .clock                         : 737229273643
  .idle_clock                    : 0
  .prev_clock_raw                : 2261407091960
  .clock_warps                   : 0
  .clock_overflows               : 678565
  .clock_deep_idle_events        : 0
  .clock_max_delta               : 999848
  .cpu_load[0]                   : 3087
  .cpu_load[1]                   : 3055
  .cpu_load[2]                   : 2723
  .cpu_load[3]                   : 2169
  .cpu_load[4]                   : 1698

cfs_rq
  .exec_clock                    : 625370185925
  .MIN_vruntime                  : 3833041200497
  .min_vruntime                  : 3833041199517
  .max_vruntime                  : 3833103226422
  .spread                        : 62025925
  .spread0                       : 0
  .nr_sync_min_vruntime          : 296798

runnable tasks:
            task   PID        tree-key  switches  prio    exec-runtime        sum-exec       sum-sleep
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
            bash  6334             980          4556   120   3833041200497       565634278    646894253582
   update-status 26093        66364651          1667   139   3833103226422      6318237803      2622161813
             cc1 27112         1387807            22   120   3833042327482       242204855        12705113
R            cat 27115       -19905217             2   120   3833023008660         1844735         3670718

cpu#1, 2992.608 MHz
  .nr_running                    : 2
  .load                          : 4145
  .nr_switches                   : 1014099
  .nr_load_updates               : 749734
  .nr_uninterruptible            : 901
  .jiffies                       : 1938395
  .next_balance                  : 1938479
  .curr->pid                     : 27084
  .clock                         : 749620291330
  .idle_clock                    : 0
  .prev_clock_raw                : 2261406342542
  .clock_warps                   : 0
  .clock_overflows               : 582991
  .clock_deep_idle_events        : 0
  .clock_max_delta               : 999848
  .cpu_load[0]                   : 4145
  .cpu_load[1]                   : 4145
  .cpu_load[2]                   : 4136
  .cpu_load[3]                   : 3934
  .cpu_load[4]                   : 3342

cfs_rq
  .exec_clock                    : 584133842523
  .MIN_vruntime                  : 3819361786873
  .min_vruntime                  : 3819381786873
  .max_vruntime                  : 3819361786873
  .spread                        : 0
  .spread0                       : -13659412644
  .nr_sync_min_vruntime          : 256768

runnable tasks:
            task   PID        tree-key  switches  prio    exec-runtime        sum-exec       sum-sleep
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
          kacpid    73       -20000000          1073   115   3819361786873       489384092    743825068255
R            cc1 27084     -3867063305           138   120   3815518715812      1205052524        24535558



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-20  7:15               ` Mike Galbraith
@ 2007-09-20  7:51                 ` Ingo Molnar
  2007-09-20  8:11                   ` Mike Galbraith
  2007-09-20 19:48                 ` Willy Tarreau
  1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2007-09-20  7:51 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tong Li, dimm, linux-kernel, Srivatsa Vaddagiri, Peter Zijlstra


* Mike Galbraith <efault@gmx.de> wrote:

> [...]  I modified the kernel to print task's actual tree key instead 
> of their current vruntime, [...]

btw., that looks like a debug printout bug in sched-devel.git - could 
you send me your fix? I've pushed out the latest sched-devel (ontop of 
-rc7) to the usual place:

   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

(note that there are some debug printout updates which might clash with 
your fix)

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-20  7:51                 ` Ingo Molnar
@ 2007-09-20  8:11                   ` Mike Galbraith
  2007-09-22  3:27                     ` Tong Li
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Galbraith @ 2007-09-20  8:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tong Li, dimm, linux-kernel, Srivatsa Vaddagiri, Peter Zijlstra

On Thu, 2007-09-20 at 09:51 +0200, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > [...]  I modified the kernel to print task's actual tree key instead 
> > of their current vruntime, [...]
> 
> btw., that looks like a debug printout bug in sched-devel.git - could 
> you send me your fix? I've pushed out the latest sched-devel (ontop of 
> -rc7) to the usual place:

Well, I temporarily added a key field, which is only used to store the
key for debug...

>    git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
> 
> (note that there are some debug printout updates which might clash with 
> your fix)

I'll pull/build this and test my migration problem there.  All I have to
do is to add a nice 19 chew-max to my make -j2, and all hell breaks
loose.  Always suspecting myself first in the search for dirty rotten
SOB who broke local scheduler :) I nuked the migration fix.  Looks like
I'm not the SOB this time.. it got much worse, with max Xorg latency of
>500ms.

	-Mike


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-20  7:15               ` Mike Galbraith
  2007-09-20  7:51                 ` Ingo Molnar
@ 2007-09-20 19:48                 ` Willy Tarreau
  2007-09-21  2:40                   ` Mike Galbraith
  1 sibling, 1 reply; 37+ messages in thread
From: Willy Tarreau @ 2007-09-20 19:48 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tong Li, Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra

On Thu, Sep 20, 2007 at 09:15:07AM +0200, Mike Galbraith wrote:
> But, I did just manage to trigger some horrid behavior, and log it.  I
> modified the kernel to print task's actual tree key instead of their
> current vruntime, and was watching that while make -j2 was running (and
> not seeing anything very interesting), when on a lark, I restarted
> SuSE's system updater thingy.  That beast chews 100% CPU for so long at
> startup that I long ago got annoyed, and changed it to run at nice 19.
> Anyway, when it started, interactivity went to hell in the proverbial
> hand-basket, and the sched_debug log shows some interesting results..
> like spread0 hitting -13659412644, and cc1 being keyed at -3867063305.
> 
> cpu#0, 2992.608 MHz
>   .nr_running                    : 4
>   .load                          : 4096
>   .nr_switches                   : 1105882
>   .nr_load_updates               : 735146
>   .nr_uninterruptible            : 4294966399

(...)

> cpu#1, 2992.608 MHz
>   .nr_running                    : 5
>   .load                          : 6208
>   .nr_switches                   : 1012995
>   .nr_load_updates               : 747540
>   .nr_uninterruptible            : 897

I don't know if this is relevant, but 4294966399 in nr_uninterruptible
for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible.
I don't know if this rings a bell for someone or if it's a completely
useless comment, but just in case...

HTH,
Willy


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-20 19:48                 ` Willy Tarreau
@ 2007-09-21  2:40                   ` Mike Galbraith
  2007-09-21  3:11                     ` Willy Tarreau
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Galbraith @ 2007-09-21  2:40 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Tong Li, Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra

On Thu, 2007-09-20 at 21:48 +0200, Willy Tarreau wrote:

> I don't know if this is relevant, but 4294966399 in nr_uninterruptible
> for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible.
> I don't know if this rings a bell for someone or if it's a completely
> useless comment, but just in case...

A task can block on one cpu, and wake up on another, which isn't
tracked, hence the fishy looking numbers.  The true nr_uninterruptible
is the sum of all.

	-Mike


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-21  2:40                   ` Mike Galbraith
@ 2007-09-21  3:11                     ` Willy Tarreau
  0 siblings, 0 replies; 37+ messages in thread
From: Willy Tarreau @ 2007-09-21  3:11 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tong Li, Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra

On Fri, Sep 21, 2007 at 04:40:55AM +0200, Mike Galbraith wrote:
> On Thu, 2007-09-20 at 21:48 +0200, Willy Tarreau wrote:
> 
> > I don't know if this is relevant, but 4294966399 in nr_uninterruptible
> > for cpu#0 equals -897, exactly the negation of cpu1.nr_uninterruptible.
> > I don't know if this rings a bell for someone or if it's a completely
> > useless comment, but just in case...
> 
> A task can block on one cpu, and wake up on another, which isn't
> tracked, hence the fishy looking numbers.  The true nr_uninterruptible
> is the sum of all.

OK, thanks Mike for this clarification.

Willy


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-20  8:11                   ` Mike Galbraith
@ 2007-09-22  3:27                     ` Tong Li
  2007-09-22 10:01                       ` Mike Galbraith
  0 siblings, 1 reply; 37+ messages in thread
From: Tong Li @ 2007-09-22  3:27 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Tong Li, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra

Mike,

Could you try this patch to see if it solves the latency problem?

   tong

Changes:

1. Modified vruntime adjustment logic in set_task_cpu(). See comments in 
code. This fixed the negative vruntime problem.

2. This code in update_curr() seems to be wrong:

if (unlikely(!curr))
 	return sync_vruntime(cfs_rq);

cfs_rq->curr can be NULL even if cfs_rq->nr_running is non-zero (e.g., 
when an RT task is running). We only want to call sync_vruntime when 
cfs_rq->nr_running is 0. This fixed the large latency problem (at least in 
my tests).

Signed-off-by: Tong Li <tong.n.li@intel.com>
---
diff -uprN linux-2.6-sched-devel-orig/kernel/sched.c linux-2.6-sched-devel/kernel/sched.c
--- linux-2.6-sched-devel-orig/kernel/sched.c	2007-09-20 12:15:41.000000000 -0700
+++ linux-2.6-sched-devel/kernel/sched.c	2007-09-21 19:40:08.000000000 -0700
@@ -1033,9 +1033,20 @@ void set_task_cpu(struct task_struct *p,
  	if (p->se.block_start)
  		p->se.block_start -= clock_offset;
  #endif
-	if (likely(new_rq->cfs.min_vruntime))
-		p->se.vruntime -= old_rq->cfs.min_vruntime -
-						new_rq->cfs.min_vruntime;
+	/* 
+	 * Reset p's vruntime if it moves to new_cpu whose min_vruntime is 
+	 * 100,000,000 (equivalent to 100ms for nice-0 tasks) larger or
+	 * smaller than p's vruntime. This improves interactivity when
+	 * pinned and unpinned tasks co-exist. For example, pinning a few
+	 * tasks to a CPU can cause its min_vruntime much smaller than the
+	 * other CPUs. If a task moves to this CPU, its vruntime can be so
+	 * large it won't be scheduled until the locally pinned tasks'
+	 * vruntimes catch up, causing large delays.
+	 */
+	if (unlikely(old_cpu != new_cpu && p->se.vruntime && 
+		(p->se.vruntime > new_rq->cfs.min_vruntime + 100000000 || 
+		 p->se.vruntime + 100000000 < new_rq->cfs.min_vruntime)))
+		p->se.vruntime = new_rq->cfs.min_vruntime;

  	__set_task_cpu(p, new_cpu);
  }
@@ -1599,6 +1610,7 @@ static void __sched_fork(struct task_str
  	p->se.exec_start		= 0;
  	p->se.sum_exec_runtime		= 0;
  	p->se.prev_sum_exec_runtime	= 0;
+	p->se.vruntime			= 0;

  #ifdef CONFIG_SCHEDSTATS
  	p->se.wait_start		= 0;
diff -uprN linux-2.6-sched-devel-orig/kernel/sched_fair.c linux-2.6-sched-devel/kernel/sched_fair.c
--- linux-2.6-sched-devel-orig/kernel/sched_fair.c	2007-09-20 12:15:41.000000000 -0700
+++ linux-2.6-sched-devel/kernel/sched_fair.c	2007-09-21 17:23:09.000000000 -0700
@@ -306,8 +306,10 @@ static void update_curr(struct cfs_rq *c
  	u64 now = rq_of(cfs_rq)->clock;
  	unsigned long delta_exec;

-	if (unlikely(!curr))
+	if (unlikely(!cfs_rq->nr_running))
  		return sync_vruntime(cfs_rq);
+	if (unlikely(!curr))
+		return;

  	/*
  	 * Get the amount of time the current task was running

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-22  3:27                     ` Tong Li
@ 2007-09-22 10:01                       ` Mike Galbraith
  2007-09-23  7:14                         ` Mike Galbraith
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Galbraith @ 2007-09-22 10:01 UTC (permalink / raw)
  To: Tong Li; +Cc: Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra

On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote:
> Mike,
> 
> Could you try this patch to see if it solves the latency problem?

No, but it helps some when running two un-pinned busy loops, one at nice
0, and the other at nice 19.  Yesterday I hit latencies of up to 1.2
_seconds_ doing this, and logging sched_debug and /proc/`pidof
Xorg`/sched from SCHED_RR shells.

se.wait_max              :           164.242748
se.wait_max              :           121.996128
se.wait_max              :           194.464773
se.wait_max              :           517.425411
se.wait_max              :           131.453214
se.wait_max              :           122.984190
se.wait_max              :           111.729274
se.wait_max              :           119.567580
se.wait_max              :           126.980696
se.wait_max              :           177.241452
se.wait_max              :           129.604329
se.wait_max              :           119.631657

It doesn't help at all with this oddity while running same:

root@Homer: now is
the time
foooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooor aall good men to come to the aid of their country

That was a nice 0 shell window.  I'm not a great typist, but I ain't
_that_ bad :)

	-Mike


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-22 10:01                       ` Mike Galbraith
@ 2007-09-23  7:14                         ` Mike Galbraith
  2007-09-23 11:37                           ` Mike Galbraith
                                             ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Mike Galbraith @ 2007-09-23  7:14 UTC (permalink / raw)
  To: Tong Li; +Cc: Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 2053 bytes --]

On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote:
> On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote:
> > Mike,
> > 
> > Could you try this patch to see if it solves the latency problem?
> 
> No, but it helps some when running two un-pinned busy loops, one at nice
> 0, and the other at nice 19.  Yesterday I hit latencies of up to 1.2
> _seconds_ doing this, and logging sched_debug and /proc/`pidof
> Xorg`/sched from SCHED_RR shells.

Looking at a log (snippet attached) from this morning with the last hunk
of your patch still intact, it looks like min_vruntime is being modified
after a task is queued.  If you look at the snippet, you'll see the nice
19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one
second later on CPU1 with it's vruntime at 2814952.425082, but
min_vruntime is 3061874.838356.

I took a hammer to it, and my latencies running this test went away.

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c linux-2.6.23-rc7.d/kernel/sched_fair.c
--- git/linux-2.6.sched-devel/kernel/sched_fair.c	2007-09-22 13:37:32.000000000 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_fair.c	2007-09-23 08:29:38.000000000 +0200
@@ -290,14 +290,19 @@ __update_curr(struct cfs_rq *cfs_rq, str
 static void sync_vruntime(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq;
-	int cpu;
+	int cpu, wrote = 0;
 
 	for_each_online_cpu(cpu) {
 		rq = &per_cpu(runqueues, cpu);
+		if (spin_is_locked(&rq->lock))
+			continue;
+		smp_wmb();
 		cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime,
 				rq->cfs.min_vruntime);
+		wrote++;
 	}
-	schedstat_inc(cfs_rq, nr_sync_min_vruntime);
+	if (wrote)
+		schedstat_inc(cfs_rq, nr_sync_min_vruntime);
 }
 
 static void update_curr(struct cfs_rq *cfs_rq)
@@ -306,8 +311,10 @@ static void update_curr(struct cfs_rq *c
 	u64 now = rq_of(cfs_rq)->clock;
 	unsigned long delta_exec;
 
-	if (unlikely(!curr))
+	if (unlikely(!cfs_rq->nr_running))
 		return sync_vruntime(cfs_rq);
+	if (unlikely(!curr))
+		return;
 
 	/*
 	 * Get the amount of time the current task was running


[-- Attachment #2: sched_debug.save2 --]
[-- Type: text/plain, Size: 6719 bytes --]

Sched Debug Version: v0.05-v20, 2.6.23-rc7-smp-d #57
now at 682203.468012 msecs

cpu#0, 2992.611 MHz
  .nr_running                    : 1
  .load                          : 1024
  .nr_switches                   : 196616
  .nr_load_updates               : 118674
  .nr_uninterruptible            : 4294966769
  .jiffies                       : 382203
  .next_balance                  : 0.382295
  .curr->pid                     : 6322
  .clock                         : 118656.313628
  .idle_clock                    : 0.000000
  .prev_clock_raw                : 705696.529916
  .clock_warps                   : 0
  .clock_overflows               : 171665
  .clock_deep_idle_events        : 0
  .clock_max_delta               : 0.999848
  .cpu_load[0]                   : 178561
  .cpu_load[1]                   : 134173
  .cpu_load[2]                   : 78694
  .cpu_load[3]                   : 42634
  .cpu_load[4]                   : 22523

cfs_rq
  .exec_clock                    : 71557.235360
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 2798832.982741
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : 0.000000
  .nr_sync_min_vruntime          : 76120

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
R           bash  6322   2798832.982741       603   120   2798832.982741      4423.956624     18419.152603

cpu#1, 2992.611 MHz
  .nr_running                    : 2
  .load                          : 177537
  .nr_switches                   : 154505
  .nr_load_updates               : 121679
  .nr_uninterruptible            : 527
  .jiffies                       : 382203
  .next_balance                  : 0.382234
  .curr->pid                     : 6633
  .clock                         : 121660.572179
  .idle_clock                    : 0.000000
  .prev_clock_raw                : 705696.537576
  .clock_warps                   : 0
  .clock_overflows               : 127876
  .clock_deep_idle_events        : 0
  .clock_max_delta               : 0.999848
  .cpu_load[0]                   : 177537
  .cpu_load[1]                   : 155347
  .cpu_load[2]                   : 102646
  .cpu_load[3]                   : 58613
  .cpu_load[4]                   : 31265

cfs_rq
  .exec_clock                    : 32989.995528
  .MIN_vruntime                  : 3010385.345325
  .min_vruntime                  : 3010385.345325
  .max_vruntime                  : 3010385.345325
  .spread                        : 0.000000
  .spread0                       : 211552.362584
  .nr_sync_min_vruntime          : 62486

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
            bash  6324   3010385.345325       448   139   3010385.345325      4089.662816     18698.650954
R            cat  6633   2771173.427733         1    98   2771173.427733         0.099100         0.000000

Sched Debug Version: v0.05-v20, 2.6.23-rc7-smp-d #57
now at 683208.774593 msecs

cpu#0, 2992.611 MHz
  .nr_running                    : 3
  .load                          : 4160
  .nr_switches                   : 197117
  .nr_load_updates               : 119679
  .nr_uninterruptible            : 4294966769
  .jiffies                       : 383208
  .next_balance                  : 0.383306
  .curr->pid                     : 6324
  .clock                         : 119661.418716
  .idle_clock                    : 0.000000
  .prev_clock_raw                : 706701.424587
  .clock_warps                   : 0
  .clock_overflows               : 171860
  .clock_deep_idle_events        : 0
  .clock_max_delta               : 0.999848
  .cpu_load[0]                   : 181682
  .cpu_load[1]                   : 92921
  .cpu_load[2]                   : 48541
  .cpu_load[3]                   : 26351
  .cpu_load[4]                   : 15382

cfs_rq
  .exec_clock                    : 72539.463238
  .MIN_vruntime                  : 3061874.838356
  .min_vruntime                  : 3061894.838356
  .max_vruntime                  : 3061874.838356
  .spread                        : 0.000000
  .spread0                       : 0.000000
  .nr_sync_min_vruntime          : 76121

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
     ksoftirqd/0     4   3061874.838356     10021   115   3061874.838356        20.667212    119265.488607
 hald-addon-stor  4815   3061874.838356      1842   120   3061874.838356        63.581807    108065.895877
R           bash  6324   2814952.425082       516   139   2814952.425082      5069.832468     18698.650954

cpu#1, 2992.611 MHz
  .nr_running                    : 2
  .load                          : 178546
  .nr_switches                   : 154769
  .nr_load_updates               : 122684
  .nr_uninterruptible            : 527
  .jiffies                       : 383208
  .next_balance                  : 0.383218
  .curr->pid                     : 6642
  .clock                         : 122665.419419
  .idle_clock                    : 0.000000
  .prev_clock_raw                : 706701.171511
  .clock_warps                   : 0
  .clock_overflows               : 128064
  .clock_deep_idle_events        : 0
  .clock_max_delta               : 0.999848
  .cpu_load[0]                   : 178546
  .cpu_load[1]                   : 178546
  .cpu_load[2]                   : 128623
  .cpu_load[3]                   : 76639
  .cpu_load[4]                   : 42094

cfs_rq
  .exec_clock                    : 33972.085103
  .MIN_vruntime                  : 3062114.882734
  .min_vruntime                  : 3062114.882734
  .max_vruntime                  : 3062114.882734
  .spread                        : 0.000000
  .spread0                       : 220.044378
  .nr_sync_min_vruntime          : 62486

runnable tasks:
            task   PID         tree-key  switches  prio     exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
            bash  6322   3062114.882734       747   120   3062114.882734      5360.029921     18419.152603
R            cat  6642   2771173.427733         1    98   2771173.427733         0.099097         0.000000

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-23  7:14                         ` Mike Galbraith
@ 2007-09-23 11:37                           ` Mike Galbraith
       [not found]                           ` <20070923115847.GA13061@elte.hu>
  2007-09-24  6:21                           ` [git] CFS-devel, group scheduler, fixes Tong Li
  2 siblings, 0 replies; 37+ messages in thread
From: Mike Galbraith @ 2007-09-23 11:37 UTC (permalink / raw)
  To: Tong Li; +Cc: Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra

Hi,

I haven't chased down the exact scenario, but using a min_vruntime which
is about to change definitely seems to be what's causing my latency
woes.  Does the below cure your fairness woes as well?

(first bit is just some debug format changes for your convenience if you
try it)

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_debug.c linux-2.6.23-rc7.d/kernel/sched_debug.c
--- git/linux-2.6.sched-devel/kernel/sched_debug.c	2007-09-22 13:37:32.000000000 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_debug.c	2007-09-21 12:12:27.000000000 +0200
@@ -67,7 +67,7 @@ print_task(struct seq_file *m, struct rq
 		(long long)(p->nvcsw + p->nivcsw),
 		p->prio);
 #ifdef CONFIG_SCHEDSTATS
-	SEQ_printf(m, "%15Ld.%06ld %15Ld.%06ld %15Ld.%06ld\n",
+	SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld\n",
 		SPLIT_NS(p->se.vruntime),
 		SPLIT_NS(p->se.sum_exec_runtime),
 		SPLIT_NS(p->se.sum_sleep_runtime));
@@ -83,10 +83,10 @@ static void print_rq(struct seq_file *m,
 
 	SEQ_printf(m,
 	"\nrunnable tasks:\n"
-	"            task   PID        tree-key  switches  prio"
-	"    exec-runtime        sum-exec       sum-sleep\n"
+	"            task   PID         tree-key  switches  prio"
+	"     exec-runtime         sum-exec        sum-sleep\n"
 	"------------------------------------------------------"
-	"------------------------------------------------");
+	"----------------------------------------------------\n");
 
 	read_lock_irq(&tasklist_lock);
 
@@ -134,7 +134,7 @@ void print_cfs_rq(struct seq_file *m, in
 	spread0 = min_vruntime - rq0_min_vruntime;
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread0",
 			SPLIT_NS(spread0));
-	SEQ_printf(m, "  .%-30s: %ld\n", "spread0",
+	SEQ_printf(m, "  .%-30s: %ld\n", "nr_sync_min_vruntime",
 			cfs_rq->nr_sync_min_vruntime);
 }
 
diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c linux-2.6.23-rc7.d/kernel/sched_fair.c
--- git/linux-2.6.sched-devel/kernel/sched_fair.c	2007-09-22 13:37:32.000000000 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_fair.c	2007-09-23 12:53:11.000000000 +0200
@@ -290,14 +290,19 @@ __update_curr(struct cfs_rq *cfs_rq, str
 static void sync_vruntime(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq;
-	int cpu;
+	int cpu, wrote = 0;
 
 	for_each_online_cpu(cpu) {
 		rq = &per_cpu(runqueues, cpu);
+		if (!spin_trylock(&rq->lock))
+			continue;
 		cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime,
 				rq->cfs.min_vruntime);
+		spin_unlock(&rq->lock);
+		wrote++;
 	}
-	schedstat_inc(cfs_rq, nr_sync_min_vruntime);
+	if (wrote)
+		schedstat_inc(cfs_rq, nr_sync_min_vruntime);
 }
 
 static void update_curr(struct cfs_rq *cfs_rq)
@@ -306,8 +311,10 @@ static void update_curr(struct cfs_rq *c
 	u64 now = rq_of(cfs_rq)->clock;
 	unsigned long delta_exec;
 
-	if (unlikely(!curr))
+	if (unlikely(!cfs_rq->nr_running))
 		return sync_vruntime(cfs_rq);
+	if (unlikely(!curr))
+		return;
 
 	/*
 	 * Get the amount of time the current task was running



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, updates
       [not found]                           ` <20070923115847.GA13061@elte.hu>
@ 2007-09-23 15:53                             ` Mike Galbraith
  0 siblings, 0 replies; 37+ messages in thread
From: Mike Galbraith @ 2007-09-23 15:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tong Li, dimm, linux-kernel, Srivatsa Vaddagiri, Peter Zijlstra

On Sun, 2007-09-23 at 13:58 +0200, Ingo Molnar wrote:

> i'm also wondering - do we even need sync_vruntime() at all? Now that we 
> fix up vruntime while we migrate tasks across rqs, the global notion of 
> vruntime makes less and less of a sense.

Yeah, I was wondering that too, but it does seem to be needed.  If I
make it a noop, and run the pinned tasks with an unpinned make -j2 test,
Xorg at nice -5 still takes hits...

se.wait_max              :            11.513462
se.wait_max              :             6.250911
se.wait_max              :            15.181511
se.wait_max              :           188.963625
se.wait_max              :             4.951569
se.wait_max              :            11.391749

...which begs the question "gee, what happens if I pin a nice 0 hog to
CPU0, and a nice 19 hog to CPU1 with the queues never ever syncing?".

The answer to that one seems to be "Don't even think about it!".  This
needs more investigation... there's gotta be a 32 bit thingie hiding in
there.  I lost control for a few seconds, got it back, then lost it
again forever (sysrq-b worked though, hmm).

	-Mike

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-23  7:14                         ` Mike Galbraith
  2007-09-23 11:37                           ` Mike Galbraith
       [not found]                           ` <20070923115847.GA13061@elte.hu>
@ 2007-09-24  6:21                           ` Tong Li
  2007-09-24 10:10                             ` Mike Galbraith
  2 siblings, 1 reply; 37+ messages in thread
From: Tong Li @ 2007-09-24  6:21 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra

On Sun, 23 Sep 2007, Mike Galbraith wrote:

> On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote:
>> On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote:
>>> Mike,
>>>
>>> Could you try this patch to see if it solves the latency problem?
>>
>> No, but it helps some when running two un-pinned busy loops, one at nice
>> 0, and the other at nice 19.  Yesterday I hit latencies of up to 1.2
>> _seconds_ doing this, and logging sched_debug and /proc/`pidof
>> Xorg`/sched from SCHED_RR shells.
>
> Looking at a log (snippet attached) from this morning with the last hunk
> of your patch still intact, it looks like min_vruntime is being modified
> after a task is queued.  If you look at the snippet, you'll see the nice
> 19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one
> second later on CPU1 with it's vruntime at 2814952.425082, but
> min_vruntime is 3061874.838356.

I think this could be what was happening: between the two seconds, CPU 0 
becomes idle and it pulls the nice 19 task over via pull_task(), which 
calls set_task_cpu(), which changes the task's vruntime to the current 
min_vruntime of CPU 0 (in my patch). Then, after set_task_cpu(), CPU 0 
calls activate_task(), which calls enqueue_task() and in turn 
update_curr(). Now, nr_running on CPU 0 is 0, so sync_vruntime() gets 
called and CPU 0's min_vruntime gets synced to the system max. Thus, the 
nice 19 task now has a vruntime less than CPU 0's min_vruntime. I think 
this can be fixed by adding the following code in set_task_cpu() before we 
adjust p->vruntime:

if (!new_rq->cfs.nr_running)
 	sync_vruntime(new_rq);

> static void sync_vruntime(struct cfs_rq *cfs_rq)
> {
> 	struct rq *rq;
> -	int cpu;
> +	int cpu, wrote = 0;
>
> 	for_each_online_cpu(cpu) {
> 		rq = &per_cpu(runqueues, cpu);
> +		if (spin_is_locked(&rq->lock))
> +			continue;
> +		smp_wmb();
> 		cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime,
> 				rq->cfs.min_vruntime);
> +		wrote++;
> 	}
> -	schedstat_inc(cfs_rq, nr_sync_min_vruntime);
> +	if (wrote)
> +		schedstat_inc(cfs_rq, nr_sync_min_vruntime);
> }

I think this rq->lock check works because it prevents the above scenario 
(CPU 0 is in pull_task so it must hold the rq lock). But my concern is 
that it may be too conservative, since sync_vruntime is called by 
update_curr, which mostly gets called in enqueue_task() and 
dequeue_task(), both of which are often invoked with the rq lock being 
held. Thus, if we don't allow sync_vruntime when rq lock is held, the sync 
will occur much less frequently and thus weaken cross-CPU fairness.

   tong

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-24  6:21                           ` [git] CFS-devel, group scheduler, fixes Tong Li
@ 2007-09-24 10:10                             ` Mike Galbraith
  2007-09-24 10:24                               ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Galbraith @ 2007-09-24 10:10 UTC (permalink / raw)
  To: Tong Li; +Cc: Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri,
	Peter Zijlstra

On Sun, 2007-09-23 at 23:21 -0700, Tong Li wrote: 
> On Sun, 23 Sep 2007, Mike Galbraith wrote:
> 
> > On Sat, 2007-09-22 at 12:01 +0200, Mike Galbraith wrote:
> >> On Fri, 2007-09-21 at 20:27 -0700, Tong Li wrote:
> >>> Mike,
> >>>
> >>> Could you try this patch to see if it solves the latency problem?
> >>
> >> No, but it helps some when running two un-pinned busy loops, one at nice
> >> 0, and the other at nice 19.  Yesterday I hit latencies of up to 1.2
> >> _seconds_ doing this, and logging sched_debug and /proc/`pidof
> >> Xorg`/sched from SCHED_RR shells.
> >
> > Looking at a log (snippet attached) from this morning with the last hunk
> > of your patch still intact, it looks like min_vruntime is being modified
> > after a task is queued.  If you look at the snippet, you'll see the nice
> > 19 bash busy loop on CPU1 with a vruntime of 3010385.345325, and one
> > second later on CPU1 with it's vruntime at 2814952.425082, but
> > min_vruntime is 3061874.838356.
> 
> I think this could be what was happening: between the two seconds, CPU 0 
> becomes idle and it pulls the nice 19 task over via pull_task(), which 
> calls set_task_cpu(), which changes the task's vruntime to the current 
> min_vruntime of CPU 0 (in my patch). Then, after set_task_cpu(), CPU 0 
> calls activate_task(), which calls enqueue_task() and in turn 
> update_curr(). Now, nr_running on CPU 0 is 0, so sync_vruntime() gets 
> called and CPU 0's min_vruntime gets synced to the system max. Thus, the 
> nice 19 task now has a vruntime less than CPU 0's min_vruntime. I think 
> this can be fixed by adding the following code in set_task_cpu() before we 
> adjust p->vruntime:
> 
> if (!new_rq->cfs.nr_running)
>  	sync_vruntime(new_rq);

Hmm.  I can imagine Mondo-Boxen-R-Us folks getting upset with that.
Better would be like Ingo said, see if we can toss sync_vrintime(), and
I've been playing with that...

I found something this morning, and as usual, the darn thing turned out
to be dirt simple.  With sync_vruntime() disabled, I found queues with
negative min_vruntime right from boot, and went hunting.  Adding some
instrumentation to set_task_cpu() (annoying consequences), I found the
below:

Legend:
vrun: tasks's vruntime
old: old queue's min_vruntime
new: new queue's min_vruntime
result: what's gonna happen

[   60.214508] kseriod vrun: 1427596999 old: 15070988657 new: 4065818654 res: -9577573004
[  218.274661] konqueror vrun: 342076210254 old: 658982966338 new: 219203403598 res: -97703352486
[  218.284657] init vrun: 411638725179 old: 659187735208 new: 219203492472 res: -28345517557
[...]

A task which hasn't run in long enough for queues to have digressed
further than it's vruntime is going to end up with a negative vruntime.
Looking at place_entity(), it looks like it's supposed to fix that up,
but due to the order of arguments passed to max_vrintime(), and the
unsigned comparison therein, it won't.

Running with the patchlet below, my box _so far_ has not become
terminally unhappy despite spread0.  I'm writing this with the pinned
hogs test running right now, and all is well, so I _think_ it might be
ok to just remove sync_vruntime() after all.

diff -uprNX /root/dontdiff git/linux-2.6.sched-devel/kernel/sched_fair.c linux-2.6.23-rc7.d/kernel/sched_fair.c
--- git/linux-2.6.sched-devel/kernel/sched_fair.c	2007-09-23 14:48:18.000000000 +0200
+++ linux-2.6.23-rc7.d/kernel/sched_fair.c	2007-09-24 11:02:05.000000000 +0200
@@ -117,7 +117,7 @@ static inline struct task_struct *task_o
 static inline u64
 max_vruntime(u64 min_vruntime, u64 vruntime)
 {
-	if ((vruntime > min_vruntime) ||
+	if (((s64)vruntime > (s64)min_vruntime) ||
 	    (min_vruntime > (1ULL << 61) && vruntime < (1ULL << 50)))
 		min_vruntime = vruntime;
 
@@ -310,7 +310,7 @@ static void update_curr(struct cfs_rq *c
 	unsigned long delta_exec;
 
 	if (unlikely(!cfs_rq->nr_running))
-		return sync_vruntime(cfs_rq);
+		return ;//sync_vruntime(cfs_rq);
 	if (unlikely(!curr))
 		return;
 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-24 10:10                             ` Mike Galbraith
@ 2007-09-24 10:24                               ` Peter Zijlstra
  2007-09-24 10:42                                 ` Mike Galbraith
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2007-09-24 10:24 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tong Li, Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri

On Mon, 24 Sep 2007 12:10:09 +0200 Mike Galbraith <efault@gmx.de> wrote:


> @@ -117,7 +117,7 @@ static inline struct task_struct *task_o
>  static inline u64
>  max_vruntime(u64 min_vruntime, u64 vruntime)
>  {
> -	if ((vruntime > min_vruntime) ||
> +	if (((s64)vruntime > (s64)min_vruntime) ||
>  	    (min_vruntime > (1ULL << 61) && vruntime < (1ULL << 50)))
>  		min_vruntime = vruntime;
>  

how about something like:

 s64 delta = (s64)(vruntime - min_vruntime);
 if (delta > 0)
  min_vruntime += delta;

That would rid us of most of the funny conditionals there.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-24 10:24                               ` Peter Zijlstra
@ 2007-09-24 10:42                                 ` Mike Galbraith
  2007-09-24 11:08                                   ` Peter Zijlstra
  2007-09-24 11:22                                   ` Mike Galbraith
  0 siblings, 2 replies; 37+ messages in thread
From: Mike Galbraith @ 2007-09-24 10:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tong Li, Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri

On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:

> how about something like:
> 
>  s64 delta = (s64)(vruntime - min_vruntime);
>  if (delta > 0)
>   min_vruntime += delta;
> 
> That would rid us of most of the funny conditionals there.

That still left me with negative min_vruntimes.  The pinned hogs didn't
lock my box up, but I quickly got the below, so hastily killed it.

se.wait_max              :             7.846949
se.wait_max              :           301.951601
se.wait_max              :             7.071359

	-Mike


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-24 10:42                                 ` Mike Galbraith
@ 2007-09-24 11:08                                   ` Peter Zijlstra
  2007-09-24 11:43                                     ` Mike Galbraith
  2007-09-24 11:22                                   ` Mike Galbraith
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2007-09-24 11:08 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tong Li, Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri

On Mon, 24 Sep 2007 12:42:15 +0200 Mike Galbraith <efault@gmx.de> wrote:

> On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:
> 
> > how about something like:
> > 
> >  s64 delta = (s64)(vruntime - min_vruntime);
> >  if (delta > 0)
> >   min_vruntime += delta;
> > 
> > That would rid us of most of the funny conditionals there.
> 
> That still left me with negative min_vruntimes.  The pinned hogs didn't
> lock my box up, but I quickly got the below, so hastily killed it.
> 
> se.wait_max              :             7.846949
> se.wait_max              :           301.951601
> se.wait_max              :             7.071359
>

Odd, the idea (which I think is clear) is that min_vruntime can wrap
around the u64 spectrum. And by using min_vruntime as offset to base
the key around, we get a signed but limited range key-space. (because
we update min_vruntime to be the leftmost task (in a monotonic fashion))

So I'm having trouble with these patches, that is, both your wrap
around condition of:

  if (likely(new_rq->cfs.min_vruntime))

as well as the last patchlet:

  if (((s64)vruntime > (s64)min_vruntime) ||

in that neither of these changes make sense in what its trying to do.

Its perfectly valid for min_vruntime to exist in 1ULL << 63.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-24 10:42                                 ` Mike Galbraith
  2007-09-24 11:08                                   ` Peter Zijlstra
@ 2007-09-24 11:22                                   ` Mike Galbraith
  2007-09-24 11:51                                     ` Peter Zijlstra
  1 sibling, 1 reply; 37+ messages in thread
From: Mike Galbraith @ 2007-09-24 11:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tong Li, Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri

On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote:
> On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:
> 
> > how about something like:
> > 
> >  s64 delta = (s64)(vruntime - min_vruntime);
> >  if (delta > 0)
> >   min_vruntime += delta;
> > 
> > That would rid us of most of the funny conditionals there.
> 
> That still left me with negative min_vruntimes.  The pinned hogs didn't
> lock my box up, but I quickly got the below, so hastily killed it.

Shouldn't the max() in place_entity() be the max_vruntime() that my
lysdexia told me it was when I looked at it earlier? ;-)

	-Mike


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-24 11:08                                   ` Peter Zijlstra
@ 2007-09-24 11:43                                     ` Mike Galbraith
  0 siblings, 0 replies; 37+ messages in thread
From: Mike Galbraith @ 2007-09-24 11:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tong Li, Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri

On Mon, 2007-09-24 at 13:08 +0200, Peter Zijlstra wrote:

> Its perfectly valid for min_vruntime to exist in 1ULL << 63.

But wrap backward timewarp is what's killing my box.

	-Mike


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-24 11:22                                   ` Mike Galbraith
@ 2007-09-24 11:51                                     ` Peter Zijlstra
  2007-09-24 16:43                                       ` Tong Li
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2007-09-24 11:51 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tong Li, Ingo Molnar, dimm, linux-kernel, Srivatsa Vaddagiri

On Mon, 24 Sep 2007 13:22:14 +0200 Mike Galbraith <efault@gmx.de> wrote:

> On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote:
> > On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:
> > 
> > > how about something like:
> > > 
> > >  s64 delta = (s64)(vruntime - min_vruntime);
> > >  if (delta > 0)
> > >   min_vruntime += delta;
> > > 
> > > That would rid us of most of the funny conditionals there.
> > 
> > That still left me with negative min_vruntimes.  The pinned hogs didn't
> > lock my box up, but I quickly got the below, so hastily killed it.
> 
> Shouldn't the max() in place_entity() be the max_vruntime() that my
> lysdexia told me it was when I looked at it earlier? ;-)
> 

probably, my tree doesn't have that max anymore so I'm not sure.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [git] CFS-devel, group scheduler, fixes
  2007-09-24 11:51                                     ` Peter Zijlstra
@ 2007-09-24 16:43                                       ` Tong Li
  0 siblings, 0 replies; 37+ messages in thread
From: Tong Li @ 2007-09-24 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Ingo Molnar, dimm, linux-kernel,
	Srivatsa Vaddagiri

On Mon, 24 Sep 2007, Peter Zijlstra wrote:

> On Mon, 24 Sep 2007 13:22:14 +0200 Mike Galbraith <efault@gmx.de> wrote:
>
>> On Mon, 2007-09-24 at 12:42 +0200, Mike Galbraith wrote:
>>> On Mon, 2007-09-24 at 12:24 +0200, Peter Zijlstra wrote:
>>>
>>>> how about something like:
>>>>
>>>>  s64 delta = (s64)(vruntime - min_vruntime);
>>>>  if (delta > 0)
>>>>   min_vruntime += delta;
>>>>
>>>> That would rid us of most of the funny conditionals there.
>>>
>>> That still left me with negative min_vruntimes.  The pinned hogs didn't
>>> lock my box up, but I quickly got the below, so hastily killed it.
>>
>> Shouldn't the max() in place_entity() be the max_vruntime() that my
>> lysdexia told me it was when I looked at it earlier? ;-)
>>
>
> probably, my tree doesn't have that max anymore so I'm not sure.
>

Last time I looked, I thought the max() in place_entity() was fine and the 
problem seemed to be in set_task_cpu() that was causing the negative 
vruntimes:

         if (likely(new_rq->cfs.min_vruntime))
                 p->se.vruntime -= old_rq->cfs.min_vruntime -
                                                 new_rq->cfs.min_vruntime;

I think it's fine we get rid of sync_vruntime(). I need to think more 
about how to make global fairness work based on this--it seems to be more 
complicated than if we had sync_vruntime().

   tong

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2007-09-24 16:48 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-18 19:36 [git] CFS-devel, group scheduler, fixes dimm
2007-09-18 20:16 ` Ingo Molnar
2007-09-19  6:03   ` Tong Li
2007-09-19  6:28     ` Mike Galbraith
2007-09-19  7:51       ` Mike Galbraith
2007-09-19  8:42         ` Mike Galbraith
2007-09-19 17:06           ` Tong Li
2007-09-20  4:55             ` Mike Galbraith
2007-09-20  7:15               ` Mike Galbraith
2007-09-20  7:51                 ` Ingo Molnar
2007-09-20  8:11                   ` Mike Galbraith
2007-09-22  3:27                     ` Tong Li
2007-09-22 10:01                       ` Mike Galbraith
2007-09-23  7:14                         ` Mike Galbraith
2007-09-23 11:37                           ` Mike Galbraith
     [not found]                           ` <20070923115847.GA13061@elte.hu>
2007-09-23 15:53                             ` [git] CFS-devel, updates Mike Galbraith
2007-09-24  6:21                           ` [git] CFS-devel, group scheduler, fixes Tong Li
2007-09-24 10:10                             ` Mike Galbraith
2007-09-24 10:24                               ` Peter Zijlstra
2007-09-24 10:42                                 ` Mike Galbraith
2007-09-24 11:08                                   ` Peter Zijlstra
2007-09-24 11:43                                     ` Mike Galbraith
2007-09-24 11:22                                   ` Mike Galbraith
2007-09-24 11:51                                     ` Peter Zijlstra
2007-09-24 16:43                                       ` Tong Li
2007-09-20 19:48                 ` Willy Tarreau
2007-09-21  2:40                   ` Mike Galbraith
2007-09-21  3:11                     ` Willy Tarreau
2007-09-19 19:35     ` Siddha, Suresh B
2007-09-19 20:58       ` Tong Li
2007-09-18 20:22 ` Ingo Molnar
2007-09-19  3:55   ` Srivatsa Vaddagiri
  -- strict thread matches above, loose matches on Subject: below --
2007-09-18 19:56 Dmitry Adamushko
2007-09-18 20:18 ` Ingo Molnar
2007-09-18 19:46 Dmitry Adamushko
2007-09-18 20:17 ` Ingo Molnar
2007-09-15 13:06 Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).