[PATCH 1/5] 2.6.0 fix preempt ctx switch accounting

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/5] 2.6.0 fix preempt ctx switch accounting
@ 2003-12-20 15:19 Nick Piggin
  2003-12-20 15:20 ` [PATCH 2/5] 2.6.0 sched fork cleanup Nick Piggin
  2003-12-20 19:22 ` [PATCH 1/5] 2.6.0 fix preempt ctx switch accounting Ingo Molnar
  0 siblings, 2 replies; 15+ messages in thread
From: Nick Piggin @ 2003-12-20 15:19 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 96 bytes --]

Make sure to count kernel preemption as a context switch. A short cut
has been preventing it.



[-- Attachment #2: sched-ctx-count-preempt.patch --]
[-- Type: text/plain, Size: 1794 bytes --]


Make sure to count kernel preemption as a context switch. A short cut
has been preventing it.


 linux-2.6-npiggin/kernel/sched.c |   37 +++++++++++++++----------------------
 1 files changed, 15 insertions(+), 22 deletions(-)

diff -puN kernel/sched.c~sched-ctx-count-preempt kernel/sched.c
--- linux-2.6/kernel/sched.c~sched-ctx-count-preempt	2003-12-04 13:01:03.000000000 +1100
+++ linux-2.6-npiggin/kernel/sched.c	2003-12-04 13:01:26.000000000 +1100
@@ -1512,33 +1512,20 @@ need_resched:
 
 	spin_lock_irq(&rq->lock);
 
-	/*
-	 * if entering off of a kernel preemption go straight
-	 * to picking the next task.
-	 */
-	if (unlikely(preempt_count() & PREEMPT_ACTIVE))
-		goto pick_next_task;
-
-	switch (prev->state) {
-	case TASK_INTERRUPTIBLE:
-		if (unlikely(signal_pending(prev))) {
+	if (prev->state != TASK_RUNNING &&
+			likely(!(preempt_count() & PREEMPT_ACTIVE)) ) {
+		if (unlikely(signal_pending(prev)) &&
+				prev->state == TASK_INTERRUPTIBLE)
 			prev->state = TASK_RUNNING;
-			break;
-		}
-	default:
-		deactivate_task(prev, rq);
-		prev->nvcsw++;
-		break;
-	case TASK_RUNNING:
-		prev->nivcsw++;
+		else
+			deactivate_task(prev, rq);
 	}
-pick_next_task:
-	if (unlikely(!rq->nr_running)) {
+
 #ifdef CONFIG_SMP
+	if (unlikely(!rq->nr_running))
 		load_balance(rq, 1, cpu_to_node_mask(smp_processor_id()));
-		if (rq->nr_running)
-			goto pick_next_task;
 #endif
+	if (unlikely(!rq->nr_running)) {
 		next = rq->idle;
 		rq->expired_timestamp = 0;
 		goto switch_tasks;
@@ -1585,6 +1572,12 @@ switch_tasks:
 	prev->timestamp = now;
 
 	if (likely(prev != next)) {
+		if (prev->state == TASK_RUNNING ||
+				unlikely(preempt_count() & PREEMPT_ACTIVE))
+			prev->nivcsw++;
+		else
+			prev->nvcsw++;
+
 		next->timestamp = now;
 		rq->nr_switches++;
 		rq->curr = next;

_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/5] 2.6.0 sched fork cleanup
  2003-12-20 15:19 [PATCH 1/5] 2.6.0 fix preempt ctx switch accounting Nick Piggin
@ 2003-12-20 15:20 ` Nick Piggin
  2003-12-20 15:21   ` [PATCH 3/5] 2.6.0 sched migrate comment Nick Piggin
  2003-12-20 19:55   ` [PATCH 2/5] 2.6.0 sched fork cleanup Ingo Molnar
  2003-12-20 19:22 ` [PATCH 1/5] 2.6.0 fix preempt ctx switch accounting Ingo Molnar
  1 sibling, 2 replies; 15+ messages in thread
From: Nick Piggin @ 2003-12-20 15:20 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 83 bytes --]

Move some fork related scheduler policy from fork.c to sched.c where it 
belongs.


[-- Attachment #2: sched-fork-cleanup.patch --]
[-- Type: text/plain, Size: 3752 bytes --]


Move some fork related scheduler policy from fork.c to sched.c where it belongs.


 linux-2.6-npiggin/include/linux/sched.h |    1 
 linux-2.6-npiggin/kernel/fork.c         |   30 ++-------------------------
 linux-2.6-npiggin/kernel/sched.c        |   35 ++++++++++++++++++++++++++++++++
 3 files changed, 39 insertions(+), 27 deletions(-)

diff -puN include/linux/sched.h~sched-fork-cleanup include/linux/sched.h
--- linux-2.6/include/linux/sched.h~sched-fork-cleanup	2003-12-19 14:22:29.000000000 +1100
+++ linux-2.6-npiggin/include/linux/sched.h	2003-12-19 14:22:29.000000000 +1100
@@ -580,6 +580,7 @@ extern int FASTCALL(wake_up_process(stru
  static inline void kick_process(struct task_struct *tsk) { }
 #endif
 extern void FASTCALL(wake_up_forked_process(struct task_struct * tsk));
+extern void FASTCALL(sched_fork(task_t * p));
 extern void FASTCALL(sched_exit(task_t * p));
 
 asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);
diff -puN kernel/fork.c~sched-fork-cleanup kernel/fork.c
--- linux-2.6/kernel/fork.c~sched-fork-cleanup	2003-12-19 14:22:29.000000000 +1100
+++ linux-2.6-npiggin/kernel/fork.c	2003-12-19 14:22:29.000000000 +1100
@@ -973,33 +973,9 @@ struct task_struct *copy_process(unsigne
 	p->exit_signal = (clone_flags & CLONE_THREAD) ? -1 : (clone_flags & CSIGNAL);
 	p->pdeath_signal = 0;
 
-	/*
-	 * Share the timeslice between parent and child, thus the
-	 * total amount of pending timeslices in the system doesn't change,
-	 * resulting in more scheduling fairness.
-	 */
-	local_irq_disable();
-        p->time_slice = (current->time_slice + 1) >> 1;
-	/*
-	 * The remainder of the first timeslice might be recovered by
-	 * the parent if the child exits early enough.
-	 */
-	p->first_time_slice = 1;
-	current->time_slice >>= 1;
-	p->timestamp = sched_clock();
-	if (!current->time_slice) {
-		/*
-	 	 * This case is rare, it happens when the parent has only
-	 	 * a single jiffy left from its timeslice. Taking the
-		 * runqueue lock is not a problem.
-		 */
-		current->time_slice = 1;
-		preempt_disable();
-		scheduler_tick(0, 0);
-		local_irq_enable();
-		preempt_enable();
-	} else
-		local_irq_enable();
+	/* Perform scheduler related accounting */
+	sched_fork(p);
+
 	/*
 	 * Ok, add it to the run-queues and make it
 	 * visible to the rest of the system.
diff -puN kernel/sched.c~sched-fork-cleanup kernel/sched.c
--- linux-2.6/kernel/sched.c~sched-fork-cleanup	2003-12-19 14:22:29.000000000 +1100
+++ linux-2.6-npiggin/kernel/sched.c	2003-12-19 14:22:29.000000000 +1100
@@ -674,6 +674,41 @@ int wake_up_state(task_t *p, unsigned in
 }
 
 /*
+ * Perform scheduler related accounting for a newly forked process p.
+ * p is forked by current.
+ */
+void sched_fork(task_t *p)
+{
+	/*
+	 * Share the timeslice between parent and child, thus the
+	 * total amount of pending timeslices in the system doesn't change,
+	 * resulting in more scheduling fairness.
+	 */
+	local_irq_disable();
+        p->time_slice = (current->time_slice + 1) >> 1;
+	/*
+	 * The remainder of the first timeslice might be recovered by
+	 * the parent if the child exits early enough.
+	 */
+	p->first_time_slice = 1;
+	current->time_slice >>= 1;
+	p->timestamp = sched_clock();
+	if (!current->time_slice) {
+		/*
+	 	 * This case is rare, it happens when the parent has only
+	 	 * a single jiffy left from its timeslice. Taking the
+		 * runqueue lock is not a problem.
+		 */
+		current->time_slice = 1;
+		preempt_disable();
+		scheduler_tick(0, 0);
+		local_irq_enable();
+		preempt_enable();
+	} else
+		local_irq_enable();
+}
+
+/*
  * wake_up_forked_process - wake up a freshly forked process.
  *
  * This function will do some initial scheduler statistics housekeeping

_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/5] 2.6.0 sched migrate comment
  2003-12-20 15:20 ` [PATCH 2/5] 2.6.0 sched fork cleanup Nick Piggin
@ 2003-12-20 15:21   ` Nick Piggin
  2003-12-20 15:22     ` [PATCH 4/5] 2.6.0 sched style fixes Nick Piggin
  2003-12-20 20:26     ` [PATCH 3/5] 2.6.0 sched migrate comment Ingo Molnar
  2003-12-20 19:55   ` [PATCH 2/5] 2.6.0 sched fork cleanup Ingo Molnar
  1 sibling, 2 replies; 15+ messages in thread
From: Nick Piggin @ 2003-12-20 15:21 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 54 bytes --]

Some comments were lost during minor surgery here...


[-- Attachment #2: sched-migrate-comment.patch --]
[-- Type: text/plain, Size: 2299 bytes --]


Some comments were lost during minor surgery here...


 linux-2.6-npiggin/kernel/sched.c |   42 +++++++++++++++++++++------------------
 1 files changed, 23 insertions(+), 19 deletions(-)

diff -puN kernel/sched.c~sched-migrate-comment kernel/sched.c
--- linux-2.6/kernel/sched.c~sched-migrate-comment	2003-12-04 13:01:06.000000000 +1100
+++ linux-2.6-npiggin/kernel/sched.c	2003-12-04 13:01:25.000000000 +1100
@@ -1179,25 +1179,34 @@ static inline void pull_task(runqueue_t 
 }
 
 /*
- * Previously:
- *
- * #define CAN_MIGRATE_TASK(p,rq,this_cpu)	\
- *	((!idle || (NS_TO_JIFFIES(now - (p)->timestamp) > \
- *		cache_decay_ticks)) && !task_running(rq, p) && \
- *			cpu_isset(this_cpu, (p)->cpus_allowed))
+ * can_migrate_task
+ * May task p from runqueue rq be migrated to this_cpu?
+ * idle: Is this_cpu idle
+ * Returns: 1 if p may be migrated, 0 otherwise.
  */
-
 static inline int
-can_migrate_task(task_t *tsk, runqueue_t *rq, int this_cpu, int idle)
+can_migrate_task(task_t *p, runqueue_t *rq, int this_cpu, int idle)
 {
-	unsigned long delta = sched_clock() - tsk->timestamp;
+	unsigned long delta;
+
+	/*
+	 * We do not migrate tasks that are:
+	 * 1) running (obviously), or
+	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
+	 * 3) are cache-hot on their current CPU.
+	 */
 
-	if (!idle && (delta <= JIFFIES_TO_NS(cache_decay_ticks)))
+	if (task_running(rq, p))
 		return 0;
-	if (task_running(rq, tsk))
+
+	if (!cpu_isset(this_cpu, p->cpus_allowed))
 		return 0;
-	if (!cpu_isset(this_cpu, tsk->cpus_allowed))
+
+	/* Aggressive migration if we're idle */
+	delta = sched_clock() - p->timestamp;
+	if (!idle && (delta <= cache_decay_ticks))
 		return 0;
+
 	return 1;
 }
 
@@ -1259,13 +1268,6 @@ skip_bitmap:
 skip_queue:
 	tmp = list_entry(curr, task_t, run_list);
 
-	/*
-	 * We do not migrate tasks that are:
-	 * 1) running (obviously), or
-	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
-	 * 3) are cache-hot on their current CPU.
-	 */
-
 	curr = curr->prev;
 
 	if (!can_migrate_task(tmp, busiest, this_cpu, idle)) {
@@ -1275,6 +1277,8 @@ skip_queue:
 		goto skip_bitmap;
 	}
 	pull_task(busiest, array, tmp, this_rq, this_cpu);
+
+	/* Only migrate 1 task if we're idle */
 	if (!idle && --imbalance) {
 		if (curr != head)
 			goto skip_queue;

_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 4/5] 2.6.0 sched style fixes
  2003-12-20 15:21   ` [PATCH 3/5] 2.6.0 sched migrate comment Nick Piggin
@ 2003-12-20 15:22     ` Nick Piggin
  2003-12-20 15:24       ` [PATCH 5/5] 2.6.0 sched affinity race Nick Piggin
  2003-12-20 21:07       ` [PATCH 4/5] 2.6.0 sched style fixes Ingo Molnar
  2003-12-20 20:26     ` [PATCH 3/5] 2.6.0 sched migrate comment Ingo Molnar
  1 sibling, 2 replies; 15+ messages in thread
From: Nick Piggin @ 2003-12-20 15:22 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 49 bytes --]

Couple of small spacing and indentation changes


[-- Attachment #2: sched-style.patch --]
[-- Type: text/plain, Size: 2996 bytes --]


Couple of small spacing and indentation changes


 linux-2.6-npiggin/kernel/sched.c |   33 +++++++++++++++++----------------
 1 files changed, 17 insertions(+), 16 deletions(-)

diff -puN kernel/sched.c~sched-style kernel/sched.c
--- linux-2.6/kernel/sched.c~sched-style	2003-12-04 13:01:07.000000000 +1100
+++ linux-2.6-npiggin/kernel/sched.c	2003-12-04 13:01:25.000000000 +1100
@@ -396,7 +396,7 @@ static void recalc_task_prio(task_t *p, 
 		 * other processes.
 		 */
 		if (p->mm && p->activated != -1 &&
-			sleep_time > JUST_INTERACTIVE_SLEEP(p)){
+			sleep_time > JUST_INTERACTIVE_SLEEP(p)) {
 				p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG -
 						AVG_TIMESLICE);
 				if (!HIGH_CREDIT(p))
@@ -422,15 +422,15 @@ static void recalc_task_prio(task_t *p, 
 			 * sleep are limited in their sleep_avg rise as they
 			 * are likely to be cpu hogs waiting on I/O
 			 */
-			if (p->activated == -1 && !HIGH_CREDIT(p) && p->mm){
+			if (p->activated == -1 && !HIGH_CREDIT(p) && p->mm) {
 				if (p->sleep_avg >= JUST_INTERACTIVE_SLEEP(p))
 					sleep_time = 0;
 				else if (p->sleep_avg + sleep_time >=
-					JUST_INTERACTIVE_SLEEP(p)){
-						p->sleep_avg =
-							JUST_INTERACTIVE_SLEEP(p);
-						sleep_time = 0;
-					}
+						JUST_INTERACTIVE_SLEEP(p)) {
+					p->sleep_avg =
+						JUST_INTERACTIVE_SLEEP(p);
+					sleep_time = 0;
+				}
 			}
 
 			/*
@@ -443,7 +443,7 @@ static void recalc_task_prio(task_t *p, 
 			 */
 			p->sleep_avg += sleep_time;
 
-			if (p->sleep_avg > NS_MAX_SLEEP_AVG){
+			if (p->sleep_avg > NS_MAX_SLEEP_AVG) {
 				p->sleep_avg = NS_MAX_SLEEP_AVG;
 				if (!HIGH_CREDIT(p))
 					p->interactive_credit++;
@@ -470,7 +470,7 @@ static inline void activate_task(task_t 
 	 * This checks to make sure it's not an uninterruptible task
 	 * that is now waking up.
 	 */
-	if (!p->activated){
+	if (!p->activated) {
 		/*
 		 * Tasks which were woken up by interrupts (ie. hw events)
 		 * are most likely of interactive nature. So we give them
@@ -480,13 +480,14 @@ static inline void activate_task(task_t 
 		 */
 		if (in_interrupt())
 			p->activated = 2;
-		else
-		/*
-		 * Normal first-time wakeups get a credit too for on-runqueue
-		 * time, but it will be weighted down:
-		 */
+		else {
+			/*
+			 * Normal first-time wakeups get a credit too for on-runqueue
+			 * time, but it will be weighted down:
+			 */
 			p->activated = 1;
 		}
+	}
 	p->timestamp = now;
 
 	__activate_task(p, rq);
@@ -638,7 +639,7 @@ repeat_lock_task:
 				task_rq_unlock(rq, &flags);
 				goto repeat_lock_task;
 			}
-			if (old_state == TASK_UNINTERRUPTIBLE){
+			if (old_state == TASK_UNINTERRUPTIBLE) {
 				rq->nr_uninterruptible--;
 				/*
 				 * Tasks on involuntary sleep don't earn
@@ -1603,7 +1604,7 @@ switch_tasks:
 	RCU_qsctr(task_cpu(prev))++;
 
 	prev->sleep_avg -= run_time;
-	if ((long)prev->sleep_avg <= 0){
+	if ((long)prev->sleep_avg <= 0) {
 		prev->sleep_avg = 0;
 		if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev)))
 			prev->interactive_credit--;

_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 5/5] 2.6.0 sched affinity race
  2003-12-20 15:22     ` [PATCH 4/5] 2.6.0 sched style fixes Nick Piggin
@ 2003-12-20 15:24       ` Nick Piggin
  2003-12-20 21:07       ` [PATCH 4/5] 2.6.0 sched style fixes Ingo Molnar
  1 sibling, 0 replies; 15+ messages in thread
From: Nick Piggin @ 2003-12-20 15:24 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 171 bytes --]

Prevents a race where sys_sched_setaffinity can race with sched_migrate_task
and cause sched_migrate_task to restore an invalid cpu mask.

(race can only happen on NUMA)


[-- Attachment #2: sched-migrate-affinity-race.patch --]
[-- Type: text/plain, Size: 4366 bytes --]


Prevents a race where sys_sched_setaffinity can race with sched_migrate_task
and cause sched_migrate_task to restore an invalid cpu mask.


 linux-2.6-npiggin/kernel/sched.c |   83 +++++++++++++++++++++++++++++----------
 1 files changed, 62 insertions(+), 21 deletions(-)

diff -puN kernel/sched.c~sched-migrate-affinity-race kernel/sched.c
--- linux-2.6/kernel/sched.c~sched-migrate-affinity-race	2003-12-19 19:56:27.000000000 +1100
+++ linux-2.6-npiggin/kernel/sched.c	2003-12-19 19:57:58.000000000 +1100
@@ -947,6 +947,9 @@ static inline void double_rq_unlock(runq
 }
 
 #ifdef CONFIG_NUMA
+
+static inline int __set_cpus_allowed(task_t *p, cpumask_t new_mask, unsigned long *flags);
+
 /*
  * If dest_cpu is allowed for this process, migrate the task to it.
  * This is accomplished by forcing the cpu_allowed mask to only
@@ -955,16 +958,37 @@ static inline void double_rq_unlock(runq
  */
 static void sched_migrate_task(task_t *p, int dest_cpu)
 {
-	cpumask_t old_mask;
+	runqueue_t *rq;
+	unsigned long flags;
+	cpumask_t old_mask, new_mask = cpumask_of_cpu(dest_cpu);
 
+	rq = task_rq_lock(p, &flags);
 	old_mask = p->cpus_allowed;
-	if (!cpu_isset(dest_cpu, old_mask))
+	if (!cpu_isset(dest_cpu, old_mask)) {
+		task_rq_unlock(rq, &flags);
 		return;
+	}
+
 	/* force the process onto the specified CPU */
-	set_cpus_allowed(p, cpumask_of_cpu(dest_cpu));
+	if (__set_cpus_allowed(p, new_mask, &flags) < 0)
+		return;
 
-	/* restore the cpus allowed mask */
-	set_cpus_allowed(p, old_mask);
+	rq = task_rq_lock(p, &flags);	/* __set_cpus_allowed unlocks rq */
+	if (unlikely(p->cpus_allowed != new_mask)) {
+		/*
+		 * We have raced with another set_cpus_allowed.
+		 * old_mask is invalid and we needn't move the
+		 * task back.
+		 */
+		task_rq_unlock(rq, &flags);
+		return;
+	}
+
+	/*
+	 * restore the cpus allowed mask. old_mask must be valid because
+	 * p->cpus_allowed is a subset of old_mask.
+	 */
+	WARN_ON(__set_cpus_allowed(p, old_mask, &flags) < 0);
 }
 
 /*
@@ -2603,31 +2627,27 @@ typedef struct {
 } migration_req_t;
 
 /*
- * Change a given task's CPU affinity. Migrate the thread to a
- * proper CPU and schedule it away if the CPU it's executing on
- * is removed from the allowed bitmask.
- *
- * NOTE: the caller must have a valid reference to the task, the
- * task must not exit() & deallocate itself prematurely.  The
- * call is not atomic; no spinlocks may be held.
+ * See comment for set_cpus_allowed. calling rules are different:
+ * the task's runqueue lock must be held, and __set_cpus_allowed
+ * will return with the runqueue unlocked.
  */
-int set_cpus_allowed(task_t *p, cpumask_t new_mask)
+static inline int __set_cpus_allowed(task_t *p, cpumask_t new_mask, unsigned long *flags)
 {
-	unsigned long flags;
 	migration_req_t req;
-	runqueue_t *rq;
+	runqueue_t *rq = task_rq(p);
 
-	if (any_online_cpu(new_mask) == NR_CPUS)
+	if (any_online_cpu(new_mask) == NR_CPUS) {
+		task_rq_unlock(rq, flags);
 		return -EINVAL;
+	}
 
-	rq = task_rq_lock(p, &flags);
 	p->cpus_allowed = new_mask;
 	/*
 	 * Can the task run on the task's current CPU? If not then
 	 * migrate the thread off to a proper CPU.
 	 */
 	if (cpu_isset(task_cpu(p), new_mask)) {
-		task_rq_unlock(rq, &flags);
+		task_rq_unlock(rq, flags);
 		return 0;
 	}
 	/*
@@ -2636,18 +2656,39 @@ int set_cpus_allowed(task_t *p, cpumask_
 	 */
 	if (!p->array && !task_running(rq, p)) {
 		set_task_cpu(p, any_online_cpu(p->cpus_allowed));
-		task_rq_unlock(rq, &flags);
+		task_rq_unlock(rq, flags);
 		return 0;
 	}
+
 	init_completion(&req.done);
 	req.task = p;
 	list_add(&req.list, &rq->migration_queue);
-	task_rq_unlock(rq, &flags);
+	task_rq_unlock(rq, flags);
 
 	wake_up_process(rq->migration_thread);
-
 	wait_for_completion(&req.done);
+
 	return 0;
+
+}
+
+/*
+ * Change a given task's CPU affinity. Migrate the thread to a
+ * proper CPU and schedule it away if the CPU it's executing on
+ * is removed from the allowed bitmask.
+ *
+ * NOTE: the caller must have a valid reference to the task, the
+ * task must not exit() & deallocate itself prematurely.  The
+ * call is not atomic; no spinlocks may be held.
+ */
+int set_cpus_allowed(task_t *p, cpumask_t new_mask)
+{
+	unsigned long flags;
+	runqueue_t *rq;
+
+	rq = task_rq_lock(p, &flags);
+
+	return __set_cpus_allowed(p, new_mask, &flags);
 }
 
 EXPORT_SYMBOL_GPL(set_cpus_allowed);

_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] 2.6.0 fix preempt ctx switch accounting
  2003-12-20 15:19 [PATCH 1/5] 2.6.0 fix preempt ctx switch accounting Nick Piggin
  2003-12-20 15:20 ` [PATCH 2/5] 2.6.0 sched fork cleanup Nick Piggin
@ 2003-12-20 19:22 ` Ingo Molnar
  2003-12-20 19:52   ` Rob Love
  2003-12-20 20:07   ` Linus Torvalds
  1 sibling, 2 replies; 15+ messages in thread
From: Ingo Molnar @ 2003-12-20 19:22 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

* Nick Piggin <piggin@cyberone.com.au> wrote:

> Make sure to count kernel preemption as a context switch. A short cut
> has been preventing it.

i'd prefer the much simpler patch below. This also keeps the kernel
preemption logic isolated instead of mixing it into the normal path.

	Ingo

--- linux/kernel/sched.c.orig	
+++ linux/kernel/sched.c	
@@ -1520,8 +1520,10 @@ need_resched:
 	 * if entering off of a kernel preemption go straight
 	 * to picking the next task.
 	 */
-	if (unlikely(preempt_count() & PREEMPT_ACTIVE))
+	if (unlikely(preempt_count() & PREEMPT_ACTIVE)) {
+		prev->nivcsw++;
 		goto pick_next_task;
+	}
 
 	switch (prev->state) {
 	case TASK_INTERRUPTIBLE:

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] 2.6.0 fix preempt ctx switch accounting
  2003-12-20 19:22 ` [PATCH 1/5] 2.6.0 fix preempt ctx switch accounting Ingo Molnar
@ 2003-12-20 19:52   ` Rob Love
  2003-12-20 20:07   ` Linus Torvalds
  1 sibling, 0 replies; 15+ messages in thread
From: Rob Love @ 2003-12-20 19:52 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Nick Piggin, Andrew Morton, linux-kernel

On Sat, 2003-12-20 at 14:22, Ingo Molnar wrote:

> i'd prefer the much simpler patch below. This also keeps the kernel
> preemption logic isolated instead of mixing it into the normal path.

Much agreed, on both counts.

	Rob Love



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] 2.6.0 sched fork cleanup
  2003-12-20 15:20 ` [PATCH 2/5] 2.6.0 sched fork cleanup Nick Piggin
  2003-12-20 15:21   ` [PATCH 3/5] 2.6.0 sched migrate comment Nick Piggin
@ 2003-12-20 19:55   ` Ingo Molnar
  2003-12-20 23:17     ` Nick Piggin
  1 sibling, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2003-12-20 19:55 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 530 bytes --]


* Nick Piggin <piggin@cyberone.com.au> wrote:

> Move some fork related scheduler policy from fork.c to sched.c where
> it belongs.

this only makes sense if all the other fork-time initializations are
done in sched.c too - these are scattered all around copy_process(). 
The attached patch (it compiles & boots) does this. All the lowlevel
scheduler logic (that was done in fork.c) is now in sched.c - fork.c
only sees the higher level primitives. I've also updated a couple of
comments that relate to the affected code.

	Ingo

[-- Attachment #2: sched-cleanup-2.6.0-A1 --]
[-- Type: text/plain, Size: 4725 bytes --]

--- linux/include/linux/sched.h.orig	
+++ linux/include/linux/sched.h	
@@ -580,6 +580,7 @@ extern int FASTCALL(wake_up_process(stru
  static inline void kick_process(struct task_struct *tsk) { }
 #endif
 extern void FASTCALL(wake_up_forked_process(struct task_struct * tsk));
+extern void FASTCALL(sched_fork(task_t * p));
 extern void FASTCALL(sched_exit(task_t * p));
 
 asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);
--- linux/kernel/fork.c.orig	
+++ linux/kernel/fork.c	
@@ -876,15 +876,7 @@ struct task_struct *copy_process(unsigne
 	if (p->binfmt && !try_module_get(p->binfmt->module))
 		goto bad_fork_cleanup_put_domain;
 
-#ifdef CONFIG_PREEMPT
-	/*
-	 * schedule_tail drops this_rq()->lock so we compensate with a count
-	 * of 1.  Also, we want to start with kernel preemption disabled.
-	 */
-	p->thread_info->preempt_count = 1;
-#endif
 	p->did_exec = 0;
-	p->state = TASK_UNINTERRUPTIBLE;
 
 	copy_flags(clone_flags, p);
 	if (clone_flags & CLONE_IDLETASK)
@@ -901,15 +893,12 @@ struct task_struct *copy_process(unsigne
 
 	p->proc_dentry = NULL;
 
-	INIT_LIST_HEAD(&p->run_list);
-
 	INIT_LIST_HEAD(&p->children);
 	INIT_LIST_HEAD(&p->sibling);
 	INIT_LIST_HEAD(&p->posix_timers);
 	init_waitqueue_head(&p->wait_chldexit);
 	p->vfork_done = NULL;
 	spin_lock_init(&p->alloc_lock);
-	spin_lock_init(&p->switch_lock);
 	spin_lock_init(&p->proc_lock);
 
 	clear_tsk_thread_flag(p, TIF_SIGPENDING);
@@ -924,7 +913,6 @@ struct task_struct *copy_process(unsigne
 	p->tty_old_pgrp = 0;
 	p->utime = p->stime = 0;
 	p->cutime = p->cstime = 0;
-	p->array = NULL;
 	p->lock_depth = -1;		/* -1 = no lock */
 	p->start_time = get_jiffies_64();
 	p->security = NULL;
@@ -973,38 +961,12 @@ struct task_struct *copy_process(unsigne
 	p->exit_signal = (clone_flags & CLONE_THREAD) ? -1 : (clone_flags & CSIGNAL);
 	p->pdeath_signal = 0;
 
+	/* Perform scheduler related setup */
+	sched_fork(p);
+
 	/*
-	 * Share the timeslice between parent and child, thus the
-	 * total amount of pending timeslices in the system doesn't change,
-	 * resulting in more scheduling fairness.
-	 */
-	local_irq_disable();
-        p->time_slice = (current->time_slice + 1) >> 1;
-	/*
-	 * The remainder of the first timeslice might be recovered by
-	 * the parent if the child exits early enough.
-	 */
-	p->first_time_slice = 1;
-	current->time_slice >>= 1;
-	p->timestamp = sched_clock();
-	if (!current->time_slice) {
-		/*
-	 	 * This case is rare, it happens when the parent has only
-	 	 * a single jiffy left from its timeslice. Taking the
-		 * runqueue lock is not a problem.
-		 */
-		current->time_slice = 1;
-		preempt_disable();
-		scheduler_tick(0, 0);
-		local_irq_enable();
-		preempt_enable();
-	} else
-		local_irq_enable();
-	/*
-	 * Ok, add it to the run-queues and make it
-	 * visible to the rest of the system.
-	 *
-	 * Let it rip!
+	 * Ok, make it visible to the rest of the system.
+	 * We dont wake it up yet.
 	 */
 	p->tgid = p->pid;
 	p->group_leader = p;
--- linux/kernel/sched.c.orig	
+++ linux/kernel/sched.c	
@@ -674,6 +674,54 @@ int wake_up_state(task_t *p, unsigned in
 }
 
 /*
+ * Perform scheduler related setup for a newly forked process p.
+ * p is forked by current.
+ */
+void sched_fork(task_t *p)
+{
+	p->state = TASK_UNINTERRUPTIBLE;
+	INIT_LIST_HEAD(&p->run_list);
+	p->array = NULL;
+	spin_lock_init(&p->switch_lock);
+#ifdef CONFIG_PREEMPT
+	/*
+	 * During context-switch we hold precisely one spinlock, which
+	 * schedule_tail drops. (in the common case it's this_rq()->lock,
+	 * but it also can be p->switch_lock.) So we compensate with a count
+	 * of 1. Also, we want to start with kernel preemption disabled.
+	 */
+	p->thread_info->preempt_count = 1;
+#endif
+	/*
+	 * Share the timeslice between parent and child, thus the
+	 * total amount of pending timeslices in the system doesn't change,
+	 * resulting in more scheduling fairness.
+	 */
+	local_irq_disable();
+        p->time_slice = (current->time_slice + 1) >> 1;
+	/*
+	 * The remainder of the first timeslice might be recovered by
+	 * the parent if the child exits early enough.
+	 */
+	p->first_time_slice = 1;
+	current->time_slice >>= 1;
+	p->timestamp = sched_clock();
+	if (!current->time_slice) {
+		/*
+	 	 * This case is rare, it happens when the parent has only
+	 	 * a single jiffy left from its timeslice. Taking the
+		 * runqueue lock is not a problem.
+		 */
+		current->time_slice = 1;
+		preempt_disable();
+		scheduler_tick(0, 0);
+		local_irq_enable();
+		preempt_enable();
+	} else
+		local_irq_enable();
+}
+
+/*
  * wake_up_forked_process - wake up a freshly forked process.
  *
  * This function will do some initial scheduler statistics housekeeping

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] 2.6.0 fix preempt ctx switch accounting
  2003-12-20 19:22 ` [PATCH 1/5] 2.6.0 fix preempt ctx switch accounting Ingo Molnar
  2003-12-20 19:52   ` Rob Love
@ 2003-12-20 20:07   ` Linus Torvalds
  2003-12-20 21:32     ` Ingo Molnar
  2003-12-20 23:15     ` Nick Piggin
  1 sibling, 2 replies; 15+ messages in thread
From: Linus Torvalds @ 2003-12-20 20:07 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Nick Piggin, Andrew Morton, linux-kernel



On Sat, 20 Dec 2003, Ingo Molnar wrote:
> 
> i'd prefer the much simpler patch below. This also keeps the kernel
> preemption logic isolated instead of mixing it into the normal path.

That patch still gets several cases wrong: we don't update any counters at
all for the case where we were TASK_INTERRUPTIBLE and we got made
TASK_RUNNING because of having a signal pending.

Also, we shouldn't update the context switch counter just because we 
entered the scheduler. If we don't actually end up switching to anything 
else, it shouldn't count as a context switch.

So how about something like this?

Totally untested. Comments?

		Linus

---
===== kernel/sched.c 1.225 vs edited =====
--- 1.225/kernel/sched.c	Mon Dec  1 16:00:00 2003
+++ edited/kernel/sched.c	Sat Dec 20 12:05:56 2003
@@ -1470,6 +1470,7 @@
  */
 asmlinkage void schedule(void)
 {
+	long *switch_count;
 	task_t *prev, *next;
 	runqueue_t *rq;
 	prio_array_t *array;
@@ -1516,22 +1517,16 @@
 	 * if entering off of a kernel preemption go straight
 	 * to picking the next task.
 	 */
-	if (unlikely(preempt_count() & PREEMPT_ACTIVE))
-		goto pick_next_task;
-
-	switch (prev->state) {
-	case TASK_INTERRUPTIBLE:
-		if (unlikely(signal_pending(prev))) {
+	switch_count = &prev->nivcsw;
+	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
+		switch_count = &prev->nvcsw;
+		if ((prev->state & TASK_INTERRUPTIBLE) && unlikely(signal_pending(prev))) {
 			prev->state = TASK_RUNNING;
-			break;
+		} else {
+			deactivate_task(prev, rq);
 		}
-	default:
-		deactivate_task(prev, rq);
-		prev->nvcsw++;
-		break;
-	case TASK_RUNNING:
-		prev->nivcsw++;
 	}
+
 pick_next_task:
 	if (unlikely(!rq->nr_running)) {
 #ifdef CONFIG_SMP
@@ -1588,6 +1583,7 @@
 		next->timestamp = now;
 		rq->nr_switches++;
 		rq->curr = next;
+		++*switch_count;
 
 		prepare_arch_switch(rq, next);
 		prev = context_switch(rq, prev, next);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] 2.6.0 sched migrate comment
  2003-12-20 15:21   ` [PATCH 3/5] 2.6.0 sched migrate comment Nick Piggin
  2003-12-20 15:22     ` [PATCH 4/5] 2.6.0 sched style fixes Nick Piggin
@ 2003-12-20 20:26     ` Ingo Molnar
  2003-12-20 23:19       ` Nick Piggin
  1 sibling, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2003-12-20 20:26 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 578 bytes --]


* Nick Piggin <piggin@cyberone.com.au> wrote:

> Some comments were lost during minor surgery here...

this patch collides with John Hawkes' sched_clock() fix-patch which goes
in first. Also, the patch does more than just comment fixups. It changes
the order of tests, where a bug slipped in:

+       if (!idle && (delta <= cache_decay_ticks))

This will cause the cache-hot test to almost never trigger, which is
clearly incorrect. The correct test is:

        if (!idle && (delta <= JIFFIES_TO_NS(cache_decay_ticks)))

anyway, i've fixed it all up (patch attached).

	Ingo

[-- Attachment #2: sched-can-migrate-2.6.0-A1 --]
[-- Type: text/plain, Size: 1648 bytes --]

--- linux/kernel/sched.c.orig	
+++ linux/kernel/sched.c	
@@ -1193,25 +1193,25 @@ static inline void pull_task(runqueue_t 
 }
 
 /*
- * Previously:
- *
- * #define CAN_MIGRATE_TASK(p,rq,this_cpu)	\
- *	((!idle || (NS_TO_JIFFIES(now - (p)->timestamp) > \
- *		cache_decay_ticks)) && !task_running(rq, p) && \
- *			cpu_isset(this_cpu, (p)->cpus_allowed))
+ * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
-
 static inline int
 can_migrate_task(task_t *tsk, runqueue_t *rq, int this_cpu, int idle)
 {
 	unsigned long delta = rq->timestamp_last_tick - tsk->timestamp;
 
-	if (!idle && (delta <= JIFFIES_TO_NS(cache_decay_ticks)))
-		return 0;
+	/*
+	 * We do not migrate tasks that are:
+	 * 1) running (obviously), or
+	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
+	 * 3) are cache-hot on their current CPU.
+	 */
 	if (task_running(rq, tsk))
 		return 0;
 	if (!cpu_isset(this_cpu, tsk->cpus_allowed))
 		return 0;
+	if (!idle && (delta <= JIFFIES_TO_NS(cache_decay_ticks)))
+		return 0;
 	return 1;
 }
 
@@ -1273,13 +1273,6 @@ skip_bitmap:
 skip_queue:
 	tmp = list_entry(curr, task_t, run_list);
 
-	/*
-	 * We do not migrate tasks that are:
-	 * 1) running (obviously), or
-	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
-	 * 3) are cache-hot on their current CPU.
-	 */
-
 	curr = curr->prev;
 
 	if (!can_migrate_task(tmp, busiest, this_cpu, idle)) {
@@ -1289,6 +1282,8 @@ skip_queue:
 		goto skip_bitmap;
 	}
 	pull_task(busiest, array, tmp, this_rq, this_cpu);
+
+	/* Only migrate one task if we are idle */
 	if (!idle && --imbalance) {
 		if (curr != head)
 			goto skip_queue;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] 2.6.0 sched style fixes
  2003-12-20 15:22     ` [PATCH 4/5] 2.6.0 sched style fixes Nick Piggin
  2003-12-20 15:24       ` [PATCH 5/5] 2.6.0 sched affinity race Nick Piggin
@ 2003-12-20 21:07       ` Ingo Molnar
  1 sibling, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2003-12-20 21:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 316 bytes --]


* Nick Piggin <piggin@cyberone.com.au> wrote:

> Couple of small spacing and indentation changes

agreed. While were are at, there's much more coding style errors and
inconsistencies that accumulated up in sched.c during the past couple of
months. The attached patch includes your cleanups and mine as well.

	Ingo

[-- Attachment #2: sched-style-2.6.0-A3 --]
[-- Type: text/plain, Size: 18046 bytes --]

--- linux/kernel/sched.c.orig	
+++ linux/kernel/sched.c	
@@ -79,13 +79,13 @@
  */
 #define MIN_TIMESLICE		( 10 * HZ / 1000)
 #define MAX_TIMESLICE		(200 * HZ / 1000)
-#define ON_RUNQUEUE_WEIGHT	30
-#define CHILD_PENALTY		95
+#define ON_RUNQUEUE_WEIGHT	 30
+#define CHILD_PENALTY		 95
 #define PARENT_PENALTY		100
-#define EXIT_WEIGHT		3
-#define PRIO_BONUS_RATIO	25
+#define EXIT_WEIGHT		  3
+#define PRIO_BONUS_RATIO	 25
 #define MAX_BONUS		(MAX_USER_PRIO * PRIO_BONUS_RATIO / 100)
-#define INTERACTIVE_DELTA	2
+#define INTERACTIVE_DELTA	  2
 #define MAX_SLEEP_AVG		(AVG_TIMESLICE * MAX_BONUS)
 #define STARVATION_LIMIT	(MAX_SLEEP_AVG)
 #define NS_MAX_SLEEP_AVG	(JIFFIES_TO_NS(MAX_SLEEP_AVG))
@@ -143,7 +143,7 @@
 #define TASK_INTERACTIVE(p) \
 	((p)->prio <= (p)->static_prio - DELTA(p))
 
-#define JUST_INTERACTIVE_SLEEP(p) \
+#define INTERACTIVE_SLEEP(p) \
 	(JIFFIES_TO_NS(MAX_SLEEP_AVG * \
 		(MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
 
@@ -168,7 +168,8 @@
  */
 
 #define BASE_TIMESLICE(p) (MIN_TIMESLICE + \
-	((MAX_TIMESLICE - MIN_TIMESLICE) * (MAX_PRIO-1-(p)->static_prio)/(MAX_USER_PRIO - 1)))
+		((MAX_TIMESLICE - MIN_TIMESLICE) * \
+			(MAX_PRIO-1 - (p)->static_prio) / (MAX_USER_PRIO-1)))
 
 static inline unsigned int task_timeslice(task_t *p)
 {
@@ -199,7 +200,7 @@ struct prio_array {
 struct runqueue {
 	spinlock_t lock;
 	unsigned long nr_running, nr_switches, expired_timestamp,
-			nr_uninterruptible, timestamp_last_tick;
+		      nr_uninterruptible, timestamp_last_tick;
 	task_t *curr, *idle;
 	struct mm_struct *prev_mm;
 	prio_array_t *active, *expired, arrays[2];
@@ -225,7 +226,7 @@ static DEFINE_PER_CPU(struct runqueue, r
  * Default context-switch locking:
  */
 #ifndef prepare_arch_switch
-# define prepare_arch_switch(rq, next)	do { } while(0)
+# define prepare_arch_switch(rq, next)	do { } while (0)
 # define finish_arch_switch(rq, next)	spin_unlock_irq(&(rq)->lock)
 # define task_running(rq, p)		((rq)->curr == (p))
 #endif
@@ -269,9 +270,9 @@ __init void node_nr_running_init(void)
 
 #else /* !CONFIG_NUMA */
 
-# define nr_running_init(rq)   do { } while (0)
-# define nr_running_inc(rq)    do { (rq)->nr_running++; } while (0)
-# define nr_running_dec(rq)    do { (rq)->nr_running--; } while (0)
+# define nr_running_init(rq)	do { } while (0)
+# define nr_running_inc(rq)	do { (rq)->nr_running++; } while (0)
+# define nr_running_dec(rq)	do { (rq)->nr_running--; } while (0)
 
 #endif /* CONFIG_NUMA */
 
@@ -396,7 +397,7 @@ static void recalc_task_prio(task_t *p, 
 		 * other processes.
 		 */
 		if (p->mm && p->activated != -1 &&
-			sleep_time > JUST_INTERACTIVE_SLEEP(p)){
+			sleep_time > INTERACTIVE_SLEEP(p)) {
 				p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG -
 						AVG_TIMESLICE);
 				if (!HIGH_CREDIT(p))
@@ -413,37 +414,35 @@ static void recalc_task_prio(task_t *p, 
 			 * one timeslice worth of sleep avg bonus.
 			 */
 			if (LOW_CREDIT(p) &&
-				sleep_time > JIFFIES_TO_NS(task_timeslice(p)))
-					sleep_time =
-						JIFFIES_TO_NS(task_timeslice(p));
+			    sleep_time > JIFFIES_TO_NS(task_timeslice(p)))
+				sleep_time = JIFFIES_TO_NS(task_timeslice(p));
 
 			/*
 			 * Non high_credit tasks waking from uninterruptible
 			 * sleep are limited in their sleep_avg rise as they
 			 * are likely to be cpu hogs waiting on I/O
 			 */
-			if (p->activated == -1 && !HIGH_CREDIT(p) && p->mm){
-				if (p->sleep_avg >= JUST_INTERACTIVE_SLEEP(p))
+			if (p->activated == -1 && !HIGH_CREDIT(p) && p->mm) {
+				if (p->sleep_avg >= INTERACTIVE_SLEEP(p))
 					sleep_time = 0;
 				else if (p->sleep_avg + sleep_time >=
-					JUST_INTERACTIVE_SLEEP(p)){
-						p->sleep_avg =
-							JUST_INTERACTIVE_SLEEP(p);
-						sleep_time = 0;
-					}
+						INTERACTIVE_SLEEP(p)) {
+					p->sleep_avg = INTERACTIVE_SLEEP(p);
+					sleep_time = 0;
+				}
 			}
 
 			/*
 			 * This code gives a bonus to interactive tasks.
 			 *
 			 * The boost works by updating the 'average sleep time'
-			 * value here, based on ->timestamp. The more time a task
-			 * spends sleeping, the higher the average gets - and the
-			 * higher the priority boost gets as well.
+			 * value here, based on ->timestamp. The more time a
+			 * task spends sleeping, the higher the average gets -
+			 * and the higher the priority boost gets as well.
 			 */
 			p->sleep_avg += sleep_time;
 
-			if (p->sleep_avg > NS_MAX_SLEEP_AVG){
+			if (p->sleep_avg > NS_MAX_SLEEP_AVG) {
 				p->sleep_avg = NS_MAX_SLEEP_AVG;
 				if (!HIGH_CREDIT(p))
 					p->interactive_credit++;
@@ -470,7 +469,7 @@ static inline void activate_task(task_t 
 	 * This checks to make sure it's not an uninterruptible task
 	 * that is now waking up.
 	 */
-	if (!p->activated){
+	if (!p->activated) {
 		/*
 		 * Tasks which were woken up by interrupts (ie. hw events)
 		 * are most likely of interactive nature. So we give them
@@ -480,13 +479,14 @@ static inline void activate_task(task_t 
 		 */
 		if (in_interrupt())
 			p->activated = 2;
-		else
-		/*
-		 * Normal first-time wakeups get a credit too for on-runqueue
-		 * time, but it will be weighted down:
-		 */
+		else {
+			/*
+			 * Normal first-time wakeups get a credit too for
+			 * on-runqueue time, but it will be weighted down:
+			 */
 			p->activated = 1;
 		}
+	}
 	p->timestamp = now;
 
 	__activate_task(p, rq);
@@ -632,13 +632,14 @@ repeat_lock_task:
 			 */
 			if (unlikely(sync && !task_running(rq, p) &&
 				(task_cpu(p) != smp_processor_id()) &&
-				cpu_isset(smp_processor_id(), p->cpus_allowed))) {
+					cpu_isset(smp_processor_id(),
+							p->cpus_allowed))) {
 
 				set_task_cpu(p, smp_processor_id());
 				task_rq_unlock(rq, &flags);
 				goto repeat_lock_task;
 			}
-			if (old_state == TASK_UNINTERRUPTIBLE){
+			if (old_state == TASK_UNINTERRUPTIBLE) {
 				rq->nr_uninterruptible--;
 				/*
 				 * Tasks on involuntary sleep don't earn
@@ -663,7 +664,8 @@ repeat_lock_task:
 }
 int wake_up_process(task_t * p)
 {
-	return try_to_wake_up(p, TASK_STOPPED | TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0);
+	return try_to_wake_up(p, TASK_STOPPED |
+		       		 TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0);
 }
 
 EXPORT_SYMBOL(wake_up_process);
@@ -698,7 +700,7 @@ void sched_fork(task_t *p)
 	 * resulting in more scheduling fairness.
 	 */
 	local_irq_disable();
-        p->time_slice = (current->time_slice + 1) >> 1;
+	p->time_slice = (current->time_slice + 1) >> 1;
 	/*
 	 * The remainder of the first timeslice might be recovered by
 	 * the parent if the child exits early enough.
@@ -847,7 +849,8 @@ asmlinkage void schedule_tail(task_t *pr
  * context_switch - switch to the new MM and the new
  * thread's register state.
  */
-static inline task_t * context_switch(runqueue_t *rq, task_t *prev, task_t *next)
+static inline
+task_t * context_switch(runqueue_t *rq, task_t *prev, task_t *next)
 {
 	struct mm_struct *mm = next->mm;
 	struct mm_struct *oldmm = prev->active_mm;
@@ -995,10 +998,10 @@ static int sched_best_cpu(struct task_st
 	minload = 10000000;
 	for_each_node_with_cpus(i) {
 		/*
-		 * Node load is always divided by nr_cpus_node to normalise 
+		 * Node load is always divided by nr_cpus_node to normalise
 		 * load values in case cpu count differs from node to node.
 		 * We first multiply node_nr_running by 10 to get a little
-		 * better resolution.   
+		 * better resolution.
 		 */
 		load = 10 * atomic_read(&node_nr_running[i]) / nr_cpus_node(i);
 		if (load < minload) {
@@ -1037,7 +1040,7 @@ void sched_balance_exec(void)
  *      load_{t} = load_{t-1}/2 + nr_node_running_{t}
  * This way sudden load peaks are flattened out a bit.
  * Node load is divided by nr_cpus_node() in order to compare nodes
- * of different cpu count but also [first] multiplied by 10 to 
+ * of different cpu count but also [first] multiplied by 10 to
  * provide better resolution.
  */
 static int find_busiest_node(int this_node)
@@ -1075,8 +1078,10 @@ static int find_busiest_node(int this_no
  * this_rq is locked already. Recalculate nr_running if we have to
  * drop the runqueue lock.
  */
-static inline unsigned int double_lock_balance(runqueue_t *this_rq,
-	runqueue_t *busiest, int this_cpu, int idle, unsigned int nr_running)
+static inline
+unsigned int double_lock_balance(runqueue_t *this_rq, runqueue_t *busiest,
+				 int this_cpu, int idle,
+				 unsigned int nr_running)
 {
 	if (unlikely(!spin_trylock(&busiest->lock))) {
 		if (busiest < this_rq) {
@@ -1084,7 +1089,8 @@ static inline unsigned int double_lock_b
 			spin_lock(&busiest->lock);
 			spin_lock(&this_rq->lock);
 			/* Need to recalculate nr_running */
-			if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
+			if (idle || (this_rq->nr_running >
+					this_rq->prev_cpu_load[this_cpu]))
 				nr_running = this_rq->nr_running;
 			else
 				nr_running = this_rq->prev_cpu_load[this_cpu];
@@ -1097,7 +1103,9 @@ static inline unsigned int double_lock_b
 /*
  * find_busiest_queue - find the busiest runqueue among the cpus in cpumask.
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance, cpumask_t cpumask)
+static inline
+runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle,
+			       int *imbalance, cpumask_t cpumask)
 {
 	int nr_running, load, max_load, i;
 	runqueue_t *busiest, *rq_src;
@@ -1159,7 +1167,8 @@ static inline runqueue_t *find_busiest_q
 		goto out;
 	}
 
-	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+	nr_running = double_lock_balance(this_rq, busiest, this_cpu,
+					 idle, nr_running);
 	/*
 	 * Make sure nothing changed since we checked the
 	 * runqueue length.
@@ -1176,14 +1185,17 @@ out:
  * pull_task - move a task from a remote runqueue to the local runqueue.
  * Both runqueues must be locked.
  */
-static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
+static inline
+void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p,
+	       runqueue_t *this_rq, int this_cpu)
 {
 	dequeue_task(p, src_array);
 	nr_running_dec(src_rq);
 	set_task_cpu(p, this_cpu);
 	nr_running_inc(this_rq);
 	enqueue_task(p, this_rq->active);
-	p->timestamp = sched_clock() - (src_rq->timestamp_last_tick - p->timestamp);
+	p->timestamp = sched_clock() -
+				(src_rq->timestamp_last_tick - p->timestamp);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
 	 * to be always true for them.
@@ -1195,8 +1207,8 @@ static inline void pull_task(runqueue_t 
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
-static inline int
-can_migrate_task(task_t *tsk, runqueue_t *rq, int this_cpu, int idle)
+static inline
+int can_migrate_task(task_t *tsk, runqueue_t *rq, int this_cpu, int idle)
 {
 	unsigned long delta = rq->timestamp_last_tick - tsk->timestamp;
 
@@ -1231,7 +1243,8 @@ static void load_balance(runqueue_t *thi
 	struct list_head *head, *curr;
 	task_t *tmp;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask);
+	busiest = find_busiest_queue(this_rq, this_cpu, idle,
+				     &imbalance, cpumask);
 	if (!busiest)
 		goto out;
 
@@ -1373,7 +1386,7 @@ static inline void rebalance_tick(runque
 }
 #endif
 
-DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
+DEFINE_PER_CPU(struct kernel_stat, kstat);
 
 EXPORT_PER_CPU_SYMBOL(kstat);
 
@@ -1391,7 +1404,7 @@ EXPORT_PER_CPU_SYMBOL(kstat);
 	((STARVATION_LIMIT && ((rq)->expired_timestamp && \
 		(jiffies - (rq)->expired_timestamp >= \
 			STARVATION_LIMIT * ((rq)->nr_running) + 1))) || \
-				((rq)->curr->static_prio > (rq)->best_expired_prio))
+			((rq)->curr->static_prio > (rq)->best_expired_prio))
 
 /*
  * This function gets called by the timer code, with HZ frequency.
@@ -1563,7 +1576,7 @@ need_resched:
 	spin_lock_irq(&rq->lock);
 
 	if (prev->state != TASK_RUNNING &&
-			likely(!(preempt_count() & PREEMPT_ACTIVE)) ) {
+			likely(!(preempt_count() & PREEMPT_ACTIVE))) {
 		if (unlikely(signal_pending(prev)) &&
 				prev->state == TASK_INTERRUPTIBLE)
 			prev->state = TASK_RUNNING;
@@ -1615,7 +1628,7 @@ switch_tasks:
 	RCU_qsctr(task_cpu(prev))++;
 
 	prev->sleep_avg -= run_time;
-	if ((long)prev->sleep_avg <= 0){
+	if ((long)prev->sleep_avg <= 0) {
 		prev->sleep_avg = 0;
 		if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev)))
 			prev->interactive_credit--;
@@ -1697,7 +1710,8 @@ EXPORT_SYMBOL(default_wake_function);
  * started to run but is not in state TASK_RUNNING.  try_to_wake_up() returns
  * zero in this (rare) case, and we handle it by continuing to scan the queue.
  */
-static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int sync)
+static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
+			     int nr_exclusive, int sync)
 {
 	struct list_head *tmp, *next;
 
@@ -1774,7 +1788,8 @@ void complete(struct completion *x)
 
 	spin_lock_irqsave(&x->wait.lock, flags);
 	x->done++;
-	__wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1, 0);
+	__wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE,
+			 1, 0);
 	spin_unlock_irqrestore(&x->wait.lock, flags);
 }
 
@@ -1786,7 +1801,8 @@ void complete_all(struct completion *x)
 
 	spin_lock_irqsave(&x->wait.lock, flags);
 	x->done += UINT_MAX/2;
-	__wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 0, 0);
+	__wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE,
+			 0, 0);
 	spin_unlock_irqrestore(&x->wait.lock, flags);
 }
 
@@ -1813,9 +1829,9 @@ void wait_for_completion(struct completi
 
 EXPORT_SYMBOL(wait_for_completion);
 
-#define	SLEEP_ON_VAR				\
-	unsigned long flags;			\
-	wait_queue_t wait;			\
+#define	SLEEP_ON_VAR					\
+	unsigned long flags;				\
+	wait_queue_t wait;				\
 	init_waitqueue_entry(&wait, current);
 
 #define SLEEP_ON_HEAD					\
@@ -1823,9 +1839,9 @@ EXPORT_SYMBOL(wait_for_completion);
 	__add_wait_queue(q, &wait);			\
 	spin_unlock(&q->lock);
 
-#define	SLEEP_ON_TAIL						\
-	spin_lock_irq(&q->lock);				\
-	__remove_wait_queue(q, &wait);				\
+#define	SLEEP_ON_TAIL					\
+	spin_lock_irq(&q->lock);			\
+	__remove_wait_queue(q, &wait);			\
 	spin_unlock_irqrestore(&q->lock, flags);
 
 void interruptible_sleep_on(wait_queue_head_t *q)
@@ -1950,9 +1966,9 @@ asmlinkage long sys_nice(int increment)
 	long nice;
 
 	/*
-	 *	Setpriority might change our priority at the same moment.
-	 *	We don't have to worry. Conceptually one call occurs first
-	 *	and we have a single winner.
+	 * Setpriority might change our priority at the same moment.
+	 * We don't have to worry. Conceptually one call occurs first
+	 * and we have a single winner.
 	 */
 	if (increment < 0) {
 		if (!capable(CAP_SYS_NICE))
@@ -2132,7 +2148,7 @@ out_nounlock:
  * @param: structure containing the new RT priority.
  */
 asmlinkage long sys_sched_setscheduler(pid_t pid, int policy,
-				      struct sched_param __user *param)
+				       struct sched_param __user *param)
 {
 	return setscheduler(pid, policy, param);
 }
@@ -2439,7 +2455,8 @@ asmlinkage long sys_sched_get_priority_m
  * this syscall writes the default timeslice value of a given process
  * into the user-space timespec buffer. A value of '0' means infinity.
  */
-asmlinkage long sys_sched_rr_get_interval(pid_t pid, struct timespec __user *interval)
+asmlinkage
+long sys_sched_rr_get_interval(pid_t pid, struct timespec __user *interval)
 {
 	int retval = -EINVAL;
 	struct timespec t;
@@ -2512,10 +2529,10 @@ static void show_task(task_t * p)
 		printk(" %016lx ", thread_saved_pc(p));
 #endif
 	{
-		unsigned long * n = (unsigned long *) (p->thread_info+1);
+		unsigned long * n = (unsigned long *)(p->thread_info+1);
 		while (!*n)
 			n++;
-		free = (unsigned long) n - (unsigned long)(p->thread_info+1);
+		free = (unsigned long)n - (unsigned long)(p->thread_info+1);
 	}
 	printk("%5lu %5d %6d ", free, p->pid, p->parent->pid);
 	if ((relative = eldest_child(p)))
@@ -2685,7 +2702,7 @@ static void move_task_away(struct task_s
 	}
 	p->timestamp = rq_dest->timestamp_last_tick;
 
- out:
+out:
 	double_rq_unlock(this_rq(), rq_dest);
 	local_irq_restore(flags);
 }
@@ -2754,11 +2771,10 @@ static int migration_thread(void * data)
  * migration_call - callback that gets triggered when a CPU is added.
  * Here we can start up the necessary migration thread for the new CPU.
  */
-static int migration_call(struct notifier_block *nfb,
-			  unsigned long action,
+static int migration_call(struct notifier_block *nfb, unsigned long action,
 			  void *hcpu)
 {
-	long cpu = (long) hcpu;
+	long cpu = (long)hcpu;
 	migration_startup_t startup;
 
 	switch (action) {
@@ -2787,7 +2803,8 @@ static int migration_call(struct notifie
 	return NOTIFY_OK;
 }
 
-static struct notifier_block migration_notifier = { &migration_call, NULL, 0 };
+static struct notifier_block migration_notifier
+			= { .notifier_call = &migration_call };
 
 __init int migration_init(void)
 {
@@ -2823,7 +2840,7 @@ static void kstat_init_cpu(int cpu)
 }
 
 static int __devinit kstat_cpu_notify(struct notifier_block *self,
-					unsigned long action, void *hcpu)
+				      unsigned long action, void *hcpu)
 {
 	int cpu = (unsigned long)hcpu;
 	switch(action) {
@@ -2837,13 +2854,14 @@ static int __devinit kstat_cpu_notify(st
 }
 
 static struct notifier_block __devinitdata kstat_nb = {
-	.notifier_call  = kstat_cpu_notify,
-	.next           = NULL,
+	.notifier_call	= kstat_cpu_notify,
+	.next		= NULL,
 };
 
-__init static void init_kstat(void) {
+__init static void init_kstat(void)
+{
 	kstat_cpu_notify(&kstat_nb, (unsigned long)CPU_UP_PREPARE,
-			(void *)(long)smp_processor_id());
+			 (void *)(long)smp_processor_id());
 	register_cpu_notifier(&kstat_nb);
 }
 
@@ -2909,7 +2927,7 @@ void __might_sleep(char *file, int line)
 		printk(KERN_ERR "Debug: sleeping function called from invalid"
 				" context at %s:%d\n", file, line);
 		printk("in_atomic():%d, irqs_disabled():%d\n",
-				in_atomic(), irqs_disabled());
+			in_atomic(), irqs_disabled());
 		dump_stack();
 	}
 #endif

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] 2.6.0 fix preempt ctx switch accounting
  2003-12-20 20:07   ` Linus Torvalds
@ 2003-12-20 21:32     ` Ingo Molnar
  2003-12-20 23:15     ` Nick Piggin
  1 sibling, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2003-12-20 21:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nick Piggin, Andrew Morton, linux-kernel


* Linus Torvalds <torvalds@osdl.org> wrote:

> That patch still gets several cases wrong: we don't update any
> counters at all for the case where we were TASK_INTERRUPTIBLE and we
> got made TASK_RUNNING because of having a signal pending.
> 
> Also, we shouldn't update the context switch counter just because we
> entered the scheduler. If we don't actually end up switching to
> anything else, it shouldn't count as a context switch.
> 
> So how about something like this?
> 
> Totally untested. Comments?

looks good. The indirect pointer nicely gets rid of complexity. I've
done some additional cleanups: fixed a compilation warning on UP and
cleaned up the goto pick_next_task code. Moved the 'unlikely' to the
test as a whole. I've test-booted this patch and the context-switch
stats look OK. Updated patch attached.

	Ingo

--- linux/kernel/sched.c.orig	
+++ linux/kernel/sched.c	
@@ -1477,6 +1477,7 @@ void scheduling_functions_start_here(voi
  */
 asmlinkage void schedule(void)
 {
+	long *switch_count;
 	task_t *prev, *next;
 	runqueue_t *rq;
 	prio_array_t *array;
@@ -1523,32 +1524,25 @@ need_resched:
 	 * if entering off of a kernel preemption go straight
 	 * to picking the next task.
 	 */
-	if (unlikely(preempt_count() & PREEMPT_ACTIVE))
-		goto pick_next_task;
-
-	switch (prev->state) {
-	case TASK_INTERRUPTIBLE:
-		if (unlikely(signal_pending(prev))) {
+	switch_count = &prev->nivcsw;
+	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
+		switch_count = &prev->nvcsw;
+		if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
+				unlikely(signal_pending(prev))))
 			prev->state = TASK_RUNNING;
-			break;
-		}
-	default:
-		deactivate_task(prev, rq);
-		prev->nvcsw++;
-		break;
-	case TASK_RUNNING:
-		prev->nivcsw++;
+		else
+			deactivate_task(prev, rq);
 	}
-pick_next_task:
+
 	if (unlikely(!rq->nr_running)) {
 #ifdef CONFIG_SMP
 		load_balance(rq, 1, cpu_to_node_mask(smp_processor_id()));
-		if (rq->nr_running)
-			goto pick_next_task;
 #endif
-		next = rq->idle;
-		rq->expired_timestamp = 0;
-		goto switch_tasks;
+		if (!rq->nr_running) {
+			next = rq->idle;
+			rq->expired_timestamp = 0;
+			goto switch_tasks;
+		}
 	}
 
 	array = rq->active;
@@ -1596,6 +1590,7 @@ switch_tasks:
 		next->timestamp = now;
 		rq->nr_switches++;
 		rq->curr = next;
+		++*switch_count;
 
 		prepare_arch_switch(rq, next);
 		prev = context_switch(rq, prev, next);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] 2.6.0 fix preempt ctx switch accounting
  2003-12-20 20:07   ` Linus Torvalds
  2003-12-20 21:32     ` Ingo Molnar
@ 2003-12-20 23:15     ` Nick Piggin
  1 sibling, 0 replies; 15+ messages in thread
From: Nick Piggin @ 2003-12-20 23:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, Andrew Morton, linux-kernel



Linus Torvalds wrote:

>On Sat, 20 Dec 2003, Ingo Molnar wrote:
>  
>
>>i'd prefer the much simpler patch below. This also keeps the kernel
>>preemption logic isolated instead of mixing it into the normal path.
>>    
>>
>
>That patch still gets several cases wrong: we don't update any counters at
>all for the case where we were TASK_INTERRUPTIBLE and we got made
>TASK_RUNNING because of having a signal pending.
>
>Also, we shouldn't update the context switch counter just because we 
>entered the scheduler. If we don't actually end up switching to anything 
>else, it shouldn't count as a context switch.
>
>So how about something like this?
>
>Totally untested. Comments?
>  
>

Thats about as good as you'll get it, I think.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] 2.6.0 sched fork cleanup
  2003-12-20 19:55   ` [PATCH 2/5] 2.6.0 sched fork cleanup Ingo Molnar
@ 2003-12-20 23:17     ` Nick Piggin
  0 siblings, 0 replies; 15+ messages in thread
From: Nick Piggin @ 2003-12-20 23:17 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel



Ingo Molnar wrote:

>* Nick Piggin <piggin@cyberone.com.au> wrote:
>
>
>>Move some fork related scheduler policy from fork.c to sched.c where
>>it belongs.
>>
>
>this only makes sense if all the other fork-time initializations are
>done in sched.c too - these are scattered all around copy_process(). 
>The attached patch (it compiles & boots) does this. All the lowlevel
>scheduler logic (that was done in fork.c) is now in sched.c - fork.c
>only sees the higher level primitives. I've also updated a couple of
>comments that relate to the affected code.
>

OK I missed those bits. Looks good, of course. Thanks Ingo



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] 2.6.0 sched migrate comment
  2003-12-20 20:26     ` [PATCH 3/5] 2.6.0 sched migrate comment Ingo Molnar
@ 2003-12-20 23:19       ` Nick Piggin
  0 siblings, 0 replies; 15+ messages in thread
From: Nick Piggin @ 2003-12-20 23:19 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel



Ingo Molnar wrote:

>* Nick Piggin <piggin@cyberone.com.au> wrote:
>
>  
>
>>Some comments were lost during minor surgery here...
>>    
>>
>
>this patch collides with John Hawkes' sched_clock() fix-patch which goes
>in first. Also, the patch does more than just comment fixups. It changes
>the order of tests, where a bug slipped in:
>
>+       if (!idle && (delta <= cache_decay_ticks))
>
>This will cause the cache-hot test to almost never trigger, which is
>clearly incorrect. The correct test is:
>
>        if (!idle && (delta <= JIFFIES_TO_NS(cache_decay_ticks)))
>
>anyway, i've fixed it all up (patch attached).
>  
>

Yep, thanks again.



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2003-12-20 23:19 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-12-20 15:19 [PATCH 1/5] 2.6.0 fix preempt ctx switch accounting Nick Piggin
2003-12-20 15:20 ` [PATCH 2/5] 2.6.0 sched fork cleanup Nick Piggin
2003-12-20 15:21   ` [PATCH 3/5] 2.6.0 sched migrate comment Nick Piggin
2003-12-20 15:22     ` [PATCH 4/5] 2.6.0 sched style fixes Nick Piggin
2003-12-20 15:24       ` [PATCH 5/5] 2.6.0 sched affinity race Nick Piggin
2003-12-20 21:07       ` [PATCH 4/5] 2.6.0 sched style fixes Ingo Molnar
2003-12-20 20:26     ` [PATCH 3/5] 2.6.0 sched migrate comment Ingo Molnar
2003-12-20 23:19       ` Nick Piggin
2003-12-20 19:55   ` [PATCH 2/5] 2.6.0 sched fork cleanup Ingo Molnar
2003-12-20 23:17     ` Nick Piggin
2003-12-20 19:22 ` [PATCH 1/5] 2.6.0 fix preempt ctx switch accounting Ingo Molnar
2003-12-20 19:52   ` Rob Love
2003-12-20 20:07   ` Linus Torvalds
2003-12-20 21:32     ` Ingo Molnar
2003-12-20 23:15     ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox