[sched] [patch] migration-fixes-2.5.3-pre2-A1

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [sched] [patch] migration-fixes-2.5.3-pre2-A1
@ 2002-01-20 13:36 Ingo Molnar
  0 siblings, 0 replies; 4+ messages in thread
From: Ingo Molnar @ 2002-01-20 13:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1399 bytes --]


the attached patch solves an SMP task migration bug in the O(1) scheduler.

the bug is triggered all the time on an 8-way CPU i tested, and there are
some bugreports from dual boxes as well that lock up during bootups.

task migration is a subtle issue. The basic task is to 'send' the
currently executing task over to another CPU, where it continues
execution. The current code in 2.5.3-pre2 is broken, as it has a window in
which it's possible for a ksoftirqd thread to run on two CPUs at once -
causing a lockup.

my solution is to send the task to the other CPU using the
smp_migrate_task() lowlevel-SMP method defined by the SMP architecture,
where it will call back the scheduler via sched_task_migrated(new_task).

it's also possible to move a task from one runqueue to another one without
using cross-CPU messaging, but this increases the overhead of the
scheduler hot-path, schedule_tail() needs to check for some sort of
prev->state == TASK_MIGRATED flag, at least. The patch solves this without
adding overhead to the hot-path.

the patch is also an optimization: the set_cpus_allowed() function used to
switch the current idle thread manually, to initiate a reschedule (due to
locking issues). This 'manual context switching' code is gone now, and
set_cpus_allowed() calls schedule() directly now. This simplifies
set_cpus_allowed() greatly, reducing sched.c's line count by 10.

	Ingo

[-- Attachment #2: Type: TEXT/PLAIN, Size: 5475 bytes --]

--- linux/kernel/sched.c.orig	Sun Jan 20 11:38:59 2002
+++ linux/kernel/sched.c	Sun Jan 20 11:39:16 2002
@@ -191,6 +191,20 @@
 }
 
 /*
+ * The SMP message passing code calls this function whenever
+ * the new task has arrived at the target CPU. We move the
+ * new task into the local runqueue.
+ *
+ * This function must be called with interrupts disabled.
+ */
+void sched_task_migrated(task_t *new_task)
+{
+	wait_task_inactive(new_task);
+	new_task->cpu = smp_processor_id();
+	wake_up_process(new_task);
+}
+
+/*
  * Kick the remote CPU if the task is running currently,
  * this code is used by the signal code to signal tasks
  * which are in user-mode as quickly as possible.
@@ -611,7 +625,6 @@
 	if (likely(prev != next)) {
 		rq->nr_switches++;
 		rq->curr = next;
-		next->cpu = prev->cpu;
 		context_switch(prev, next);
 		/*
 		 * The runqueue pointer might be from another CPU
@@ -778,47 +791,23 @@
  */
 void set_cpus_allowed(task_t *p, unsigned long new_mask)
 {
-	runqueue_t *this_rq = this_rq(), *target_rq;
-	unsigned long this_mask = 1UL << smp_processor_id();
-	int target_cpu;
-
 	new_mask &= cpu_online_map;
 	if (!new_mask)
 		BUG();
+
 	p->cpus_allowed = new_mask;
 	/*
 	 * Can the task run on the current CPU? If not then
 	 * migrate the process off to a proper CPU.
 	 */
-	if (new_mask & this_mask)
+	if (new_mask & (1UL << smp_processor_id()))
 		return;
-	target_cpu = ffz(~new_mask);
-	target_rq = cpu_rq(target_cpu);
-	if (target_cpu < smp_processor_id()) {
-		spin_lock_irq(&target_rq->lock);
-		spin_lock(&this_rq->lock);
-	} else {
-		spin_lock_irq(&this_rq->lock);
-		spin_lock(&target_rq->lock);
-	}
-	dequeue_task(p, p->array);
-	this_rq->nr_running--;
-	target_rq->nr_running++;
-	enqueue_task(p, target_rq->active);
-	target_rq->curr->need_resched = 1;
-	spin_unlock(&target_rq->lock);
+#if CONFIG_SMP
+	smp_migrate_task(ffz(~new_mask), current);
 
-	/*
-	 * The easiest solution is to context switch into
-	 * the idle thread - which will pick the best task
-	 * afterwards:
-	 */
-	this_rq->nr_switches++;
-	this_rq->curr = this_rq->idle;
-	this_rq->idle->need_resched = 1;
-	context_switch(current, this_rq->idle);
-	barrier();
-	spin_unlock_irq(&this_rq()->lock);
+	current->state = TASK_UNINTERRUPTIBLE;
+	schedule();
+#endif
 }
 
 void scheduling_functions_end_here(void) { }
--- linux/include/linux/sched.h.orig	Sun Jan 20 11:38:59 2002
+++ linux/include/linux/sched.h	Sun Jan 20 11:39:16 2002
@@ -149,6 +149,8 @@
 extern void update_one_process(struct task_struct *p, unsigned long user,
 			       unsigned long system, int cpu);
 extern void scheduler_tick(struct task_struct *p);
+extern void sched_task_migrated(struct task_struct *p);
+extern void smp_migrate_task(int cpu, task_t *task);
 
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
--- linux/include/asm-i386/hw_irq.h.orig	Thu Nov 22 20:46:18 2001
+++ linux/include/asm-i386/hw_irq.h	Sun Jan 20 11:39:16 2002
@@ -35,13 +35,14 @@
  *  into a single vector (CALL_FUNCTION_VECTOR) to save vector space.
  *  TLB, reschedule and local APIC vectors are performance-critical.
  *
- *  Vectors 0xf0-0xfa are free (reserved for future Linux use).
+ *  Vectors 0xf0-0xf9 are free (reserved for future Linux use).
  */
 #define SPURIOUS_APIC_VECTOR	0xff
 #define ERROR_APIC_VECTOR	0xfe
 #define INVALIDATE_TLB_VECTOR	0xfd
 #define RESCHEDULE_VECTOR	0xfc
-#define CALL_FUNCTION_VECTOR	0xfb
+#define TASK_MIGRATION_VECTOR	0xfb
+#define CALL_FUNCTION_VECTOR	0xfa
 
 /*
  * Local APIC timer IRQ vector is on a different priority level,
--- linux/arch/i386/kernel/i8259.c.orig	Tue Sep 18 08:03:09 2001
+++ linux/arch/i386/kernel/i8259.c	Sun Jan 20 11:39:16 2002
@@ -79,6 +79,7 @@
  * through the ICC by us (IPIs)
  */
 #ifdef CONFIG_SMP
+BUILD_SMP_INTERRUPT(task_migration_interrupt,TASK_MIGRATION_VECTOR)
 BUILD_SMP_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR)
 BUILD_SMP_INTERRUPT(invalidate_interrupt,INVALIDATE_TLB_VECTOR)
 BUILD_SMP_INTERRUPT(call_function_interrupt,CALL_FUNCTION_VECTOR)
@@ -472,6 +473,9 @@
 	 * IPI, driven by wakeup.
 	 */
 	set_intr_gate(RESCHEDULE_VECTOR, reschedule_interrupt);
+
+	/* IPI for task migration */
+	set_intr_gate(TASK_MIGRATION_VECTOR, task_migration_interrupt);
 
 	/* IPI for invalidation */
 	set_intr_gate(INVALIDATE_TLB_VECTOR, invalidate_interrupt);
--- linux/arch/i386/kernel/smp.c.orig	Sun Jan 20 11:36:02 2002
+++ linux/arch/i386/kernel/smp.c	Sun Jan 20 11:39:16 2002
@@ -485,6 +485,35 @@
 	do_flush_tlb_all_local();
 }
 
+static spinlock_t migration_lock = SPIN_LOCK_UNLOCKED;
+static task_t *new_task;
+
+/*
+ * This function sends a 'task migration' IPI to another CPU.
+ * Must be called from syscall contexts, with interrupts *enabled*.
+ */
+void smp_migrate_task(int cpu, task_t *p)
+{
+	/*
+	 * The target CPU will unlock the migration spinlock:
+	 */
+	spin_lock(&migration_lock);
+	new_task = p;
+	send_IPI_mask(1 << cpu, TASK_MIGRATION_VECTOR);
+}
+
+/*
+ * Task migration callback.
+ */
+asmlinkage void smp_task_migration_interrupt(void)
+{
+	task_t *p;
+
+	ack_APIC_irq();
+	p = new_task;
+	spin_unlock(&migration_lock);
+	sched_task_migrated(p);
+}
 /*
  * this function sends a 'reschedule' IPI to another CPU.
  * it goes straight through and wastes no time serializing

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [sched] [patch] migration-fixes-2.5.3-pre2-A1
@ 2002-01-20 14:26 Manfred Spraul
  2002-01-20 21:59 ` Ingo Molnar
  0 siblings, 1 reply; 4+ messages in thread
From: Manfred Spraul @ 2002-01-20 14:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

>   */
>  #define SPURIOUS_APIC_VECTOR   0xff
>  #define ERROR_APIC_VECTOR      0xfe
>  #define INVALIDATE_TLB_VECTOR  0xfd
>  #define RESCHEDULE_VECTOR      0xfc
> -#define CALL_FUNCTION_VECTOR   0xfb
> +#define TASK_MIGRATION_VECTOR  0xfb
> +#define CALL_FUNCTION_VECTOR   0xfa
>  
Are you sure it's a good idea to have 6 interrupts at priority 15?
The local apic of the P6 has only one in-service entry and one holding
entry for each priority.

I'm not sure what happens if the entries overflow - either an interrupt
is lost or the message is resent.

<<<< Newest ia32 docu, vol3, chap. 7.6.4.3:
The P6 family and Pentium processor\x19's local APIC includes an in-service
entryy and a holding entry for each priority level. Optimally, software
should allocate no more than 2 interrupts per priority.
<<<<<

<<< Original PPro, Vol 3, chap 7.4.2:
The processor's local APIC includes an in-service entry and a holding
entry for each proirity level. To avoid losing interrupts, software
should allocate no more than 2 interrupts per priority.
<<<<<

--
	Manfred

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [sched] [patch] migration-fixes-2.5.3-pre2-A1
  2002-01-20 14:26 [sched] [patch] migration-fixes-2.5.3-pre2-A1 Manfred Spraul
@ 2002-01-20 21:59 ` Ingo Molnar
  2002-01-21 15:40   ` Manfred Spraul
  0 siblings, 1 reply; 4+ messages in thread
From: Ingo Molnar @ 2002-01-20 21:59 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-kernel, Linus Torvalds

On Sun, 20 Jan 2002, Manfred Spraul wrote:

> >  #define SPURIOUS_APIC_VECTOR   0xff
> >  #define ERROR_APIC_VECTOR      0xfe
> >  #define INVALIDATE_TLB_VECTOR  0xfd
> >  #define RESCHEDULE_VECTOR      0xfc
> > -#define CALL_FUNCTION_VECTOR   0xfb
> > +#define TASK_MIGRATION_VECTOR  0xfb
> > +#define CALL_FUNCTION_VECTOR   0xfa
> >

> Are you sure it's a good idea to have 6 interrupts at priority 15? The
> local apic of the P6 has only one in-service entry and one holding
> entry for each priority.

we havent been following this strict rule for some time. We are
distributing device interrupts according to this rule, but only as an
optimization, and only as long as the number of IRQ sources is smaller
than ~30.

i think the worst-case is that we might lose a local APIC timer interrupt
- and this only on older APIC versions. Recent CPUs should handle this
correctly. AND, we have the local APIC timer interrupt on its own priority
level anyway.

So i think the P6 documentation is a pessimisation of the true situation,
and that we can very well have multiple interrupts on the same priority
level even on older APICs - as long as the local timer interrupt is not
amongst them.

	Ingo

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [sched] [patch] migration-fixes-2.5.3-pre2-A1
  2002-01-20 21:59 ` Ingo Molnar
@ 2002-01-21 15:40   ` Manfred Spraul
  0 siblings, 0 replies; 4+ messages in thread
From: Manfred Spraul @ 2002-01-21 15:40 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel, Linus Torvalds

Ingo Molnar wrote:
> 
> So i think the P6 documentation is a pessimisation of the true situation,
> and that we can very well have multiple interrupts on the same priority
> level even on older APICs - as long as the local timer interrupt is not
> amongst them.
>
You are right.
I tried to reproduce it (force cpu1 to priority 15, send 4 ipis to
priority 14 from cpu0, check that all arrive), and it seems that the
pIII doesn't drop ipi messages.

--
	Manfred

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2002-01-21 15:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-01-20 14:26 [sched] [patch] migration-fixes-2.5.3-pre2-A1 Manfred Spraul
2002-01-20 21:59 ` Ingo Molnar
2002-01-21 15:40   ` Manfred Spraul
  -- strict thread matches above, loose matches on Subject: below --
2002-01-20 13:36 Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox