[PATCH RFC] reduce runqueue lock contention

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC] reduce runqueue lock contention
@ 2010-05-20 20:48 Chris Mason
  2010-05-20 21:09 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Chris Mason @ 2010-05-20 20:48 UTC (permalink / raw)
  To: Ingo Molnar, peterz, axboe; +Cc: linux-kernel

This is more of a starting point than a patch, but it is something I've
been meaning to look at for a long time.  Many different workloads end
up hammering very hard on try_to_wake_up, to the point where the
runqueue locks dominate CPU profiles.

This includes benchmarking really fast IO subsystems, fast networking,
fast pipes...well anywhere that we lots of procs on lots of cpus waiting
for short periods on lots of different things.

I think we probably want to add a way to wait just for a little while as
a more lightweight operation (less balancing etc) but this patch doesn't
do that.  It only tries to make the act of waking someone up less
expensive by avoiding the runqueue lock when we're on a different CPU
than the process we want to wake.

I do this with a per-runqueue list and some cmpxchg tricks.  Basically
all the other CPUs will toss a given process they want to wakeup onto
the destination per-runqueue list.  Later on, when that runqueue is
finding a new task to run, it processes the list in bulk.

>From a performance point of view, this adds about 10% to a TPC
simulation on a large numa machine.  It doubles the speed of my IPC semaphore
benchmark (which really just hammers on wakeup in a loop).

This is probably broken in less than subtle ways, so it includes a new
schedule feature to (FAST_WAKEUP) to control the new mode.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dad7f66..9c2f188 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1187,6 +1187,8 @@ struct task_struct {
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
+	struct task_struct *fast_wakeup_next;
+	unsigned long fast_wakeup_mode;
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
diff --git a/kernel/sched.c b/kernel/sched.c
index 3c2a54f..3667e89 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -530,6 +530,7 @@ struct rq {
 	unsigned long nr_uninterruptible;
 
 	struct task_struct *curr, *idle;
+	struct task_struct *fast_wakeup;
 	unsigned long next_balance;
 	struct mm_struct *prev_mm;
 
@@ -968,6 +969,24 @@ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
 	}
 }
 
+/*
+ * return the rq for a task without actually taking the lock
+ * this is racey, the task may get moved elsewhere
+ */
+static struct rq *task_rq_nolock(struct task_struct *p, unsigned long *flags)
+{
+	struct rq *rq;
+
+	for (;;) {
+		while (task_is_waking(p))
+			cpu_relax();
+		smp_mb();
+		rq = task_rq(p);
+		if (likely(rq == task_rq(p) && !task_is_waking(p)))
+			return rq;
+	}
+}
+
 void task_rq_unlock_wait(struct task_struct *p)
 {
 	struct rq *rq = task_rq(p);
@@ -1241,6 +1260,29 @@ void wake_up_idle_cpu(int cpu)
 }
 #endif /* CONFIG_NO_HZ */
 
+void force_idle_wake(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	smp_mb();
+	if (rq->curr != rq->idle)
+		return;
+
+	if (cpu_is_offline(cpu))
+		return;
+	/*
+	 * We can set TIF_RESCHED on the idle task of the other CPU
+	 * lockless. The worst case is that the other CPU runs the
+	 * idle task through an additional NOOP schedule()
+	 */
+	set_tsk_need_resched(rq->idle);
+
+	/* NEED_RESCHED must be visible before we test polling */
+	smp_mb();
+	if (!tsk_is_polling(rq->idle))
+		smp_send_reschedule(cpu);
+}
+
 static u64 sched_avg_period(void)
 {
 	return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
@@ -2351,6 +2393,138 @@ int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
 }
 #endif
 
+/*
+ * we stuff a 0x1 into the task fast_wakeup_next pointer to claim
+ * ownership of the pointer.  Basically this allows us to make sure
+ * that only a single process is doing a fast queue on a given process at
+ * once.
+ *
+ * fastwake_next() masks off the 0x1 so we can check to see if the next
+ * pointer is null.  We can't set it to null while the task is on the
+ * list because that would allow someone else to jump in and
+ * try to queue up the process again.
+ */
+static inline struct task_struct *fastwake_next(struct task_struct *p)
+{
+	return (struct task_struct *)((unsigned long)p->fast_wakeup_next &
+				      ~(1UL));
+}
+
+/*
+ * accessing the runqueue lock from a remote numa node can
+ * be very expensive.  To avoid contention on the runqueue lock
+ * we do a lockless cmpxchg operation to just stuff this task
+ * into a per-runq list of things that must be woken up.
+ *
+ * The list is processed each time we schedule.
+ *
+ * This returns 0 if all went well or -1 if you must
+ * proceed with the try_to_wake_up yourself.
+ */
+static int queue_fast_wakeup(struct task_struct *p, int mode)
+{
+	struct rq *rq;
+	unsigned long flags;
+	struct task_struct *old_tail;
+	struct task_struct *old_ptr;
+	struct task_struct *next;
+	int ret = 0;
+
+	if (!sched_feat(FAST_WAKEUP))
+		return -1;
+
+	old_ptr = cmpxchg(&p->fast_wakeup_next, NULL,
+			  (struct task_struct *)0x1);
+
+	rq = task_rq_nolock(p, &flags);
+
+	/* if it wasn't null fast_wakeup_next, we're already on some CPU's list
+	 * of things to be woken.  We don't know which list, and
+	 * things can go badly if we assume this task isn't
+	 * already running somewhere.
+	 *
+	 * If we return zero here, things go badly with exclusive wakeups
+	 * because our caller thinks we've woken up a task when really this
+	 * one might be running.  To avoid those kinds of races and other
+	 * issues, we return -1 and force the caller to go all the way
+	 * through with the full try_to_wake_up.
+	 */
+	if (old_ptr) {
+		if (mode == TASK_ALL)
+			p->fast_wakeup_mode = TASK_ALL;
+		ret = -1;
+		goto checks;
+	}
+	/*
+	 * fast_wakeup_next is now set to 0x1, which means this task is our
+	 * responsibility to wake up.  We take a reference on the task struct
+	 * for the list
+	 */
+	get_task_struct(p);
+	old_tail = NULL;
+
+	/* offline cpus mean just use the real ttwu */
+	if (cpu_is_offline(rq->cpu)) {
+		p->fast_wakeup_next = NULL;
+		put_task_struct(p);
+		return -1;
+	}
+	p->fast_wakeup_mode = mode;
+
+	/* our xchg loop has two parts.  We replace the
+	 * rq fast_wakeup list with NULL and we append
+	 * any procs that were in the list onto our list.
+	 * Then we try to cmpxchg our updated list back into
+	 * the rq.
+	 *
+	 * If the rq still has NULL, we're done.  If not
+	 * someone else has raced in and we loop around again
+	 * pulling their updates off the list and appending them
+	 * onto our own list
+	 */
+	while (1) {
+		old_ptr = xchg(&rq->fast_wakeup, NULL);
+		if (old_ptr) {
+			if (old_tail) {
+				while (1) {
+					next = fastwake_next(old_tail);
+					if (!next)
+						break;
+					old_tail = next;
+				}
+			} else {
+				old_tail = p;
+			}
+			old_tail->fast_wakeup_next = old_ptr;
+		}
+		old_ptr = cmpxchg(&rq->fast_wakeup, NULL, p);
+		if (!old_ptr)
+			break;
+	}
+checks:
+	smp_mb();
+	/*
+	 * if the process state changed after they went into the
+	 * dedicated list, go through the full try_to_wake_up code
+	 * We don't want to return that we've woken a task if it
+	 * was actually already running on its own.
+	 *
+	 * But make sure to do the reschedule kick so that our
+	 * task is eventually taken off the list for that CPU.
+	 */
+	if (!(p->state & mode) || p->se.on_rq || task_running(rq, p))
+		ret = -1;
+
+	if (unlikely(rt_prio(p->prio)))
+		resched_cpu(rq->cpu);
+	else
+		force_idle_wake(rq->cpu);
+
+	if (cpu_is_offline(rq->cpu))
+		ret = -1;
+	return ret;
+}
+
 /***
  * try_to_wake_up - wake up a thread
  * @p: the to-be-woken-up thread
@@ -2377,6 +2551,20 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
 
 	this_cpu = get_cpu();
 
+	if (sched_feat(FAST_WAKEUP)) {
+		smp_mb();
+		/* we only want to do the fast wakeup if
+		 * the task isn't already running.
+		 */
+		if (this_cpu != task_cpu(p) && wake_flags == 0 &&
+		    !p->se.on_rq && p->pid != 0 && (p->state & state)) {
+			if (queue_fast_wakeup(p, state) == 0) {
+				put_cpu();
+				return 1;
+			}
+		}
+	}
+
 	smp_wmb();
 	rq = task_rq_lock(p, &flags);
 	update_rq_clock(rq);
@@ -2499,10 +2687,42 @@ out_running:
 out:
 	task_rq_unlock(rq, &flags);
 	put_cpu();
-
 	return success;
 }
 
+/*
+ * run all the wakeups we've queued on the rq.
+ */
+static void run_fast_queue(struct rq *rq)
+{
+	struct task_struct *head;
+	struct task_struct *next;
+	unsigned long mode;
+
+	if (!sched_feat(FAST_WAKEUP) && !rq->fast_wakeup)
+		return;
+
+again:
+	/*
+	 * we swap the list pointer on the rq with NULL
+	 * and then we are able to go through the
+	 * entire list without any locks
+	 */
+	head = xchg(&rq->fast_wakeup, NULL);
+
+	while (head) {
+		next = fastwake_next(head);
+		mode = head->fast_wakeup_mode;
+		head->fast_wakeup_next = NULL;
+		try_to_wake_up(head, mode, 0);
+		put_task_struct(head);
+		head = next;
+	}
+	smp_mb();
+	if (rq->fast_wakeup)
+		goto again;
+}
+
 /**
  * wake_up_process - Wake up a specific process
  * @p: The process to be woken up.
@@ -2533,6 +2753,7 @@ int wake_up_state(struct task_struct *p, unsigned int state)
  */
 static void __sched_fork(struct task_struct *p)
 {
+	p->fast_wakeup_next = NULL;
 	p->se.exec_start		= 0;
 	p->se.sum_exec_runtime		= 0;
 	p->se.prev_sum_exec_runtime	= 0;
@@ -2858,6 +3079,7 @@ static inline void post_schedule(struct rq *rq)
 
 		rq->post_schedule = 0;
 	}
+	run_fast_queue(rq);
 }
 
 #else
@@ -3711,6 +3933,9 @@ need_resched:
 	switch_count = &prev->nivcsw;
 
 	release_kernel_lock(prev);
+
+	run_fast_queue(rq);
+
 need_resched_nonpreemptible:
 
 	schedule_debug(prev);
@@ -5674,7 +5899,6 @@ static void migrate_dead_tasks(unsigned int dead_cpu)
 			break;
 		next->sched_class->put_prev_task(rq, next);
 		migrate_dead(dead_cpu, next);
-
 	}
 }
 
@@ -5944,9 +6168,11 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
+		rq = cpu_rq(cpu);
+		run_fast_queue(rq);
+
 		cpuset_lock(); /* around calls to cpuset_cpus_allowed_lock() */
 		migrate_live_tasks(cpu);
-		rq = cpu_rq(cpu);
 		kthread_stop(rq->migration_thread);
 		put_task_struct(rq->migration_thread);
 		rq->migration_thread = NULL;
@@ -7747,6 +7973,7 @@ void __init sched_init(void)
 
 		rq = cpu_rq(i);
 		raw_spin_lock_init(&rq->lock);
+		rq->fast_wakeup = NULL;
 		rq->nr_running = 0;
 		rq->calc_load_active = 0;
 		rq->calc_load_update = jiffies + LOAD_FREQ;
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index d5059fd..044e93c 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -116,3 +116,5 @@ SCHED_FEAT(ASYM_EFF_LOAD, 1)
  * release the lock. Decreases scheduling overhead.
  */
 SCHED_FEAT(OWNER_SPIN, 1)
+
+SCHED_FEAT(FAST_WAKEUP, 0)

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-05-20 20:48 [PATCH RFC] reduce runqueue lock contention Chris Mason
@ 2010-05-20 21:09 ` Peter Zijlstra
  2010-05-20 21:23   ` Peter Zijlstra
  2010-05-20 22:21   ` Chris Mason
  2010-06-04 10:56 ` Stijn Devriendt
  2010-06-21 10:02 ` Peter Zijlstra
  2 siblings, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2010-05-20 21:09 UTC (permalink / raw)
  To: Chris Mason; +Cc: Ingo Molnar, axboe, linux-kernel

On Thu, 2010-05-20 at 16:48 -0400, Chris Mason wrote:
> 
> This is more of a starting point than a patch, but it is something I've
> been meaning to look at for a long time.  Many different workloads end
> up hammering very hard on try_to_wake_up, to the point where the
> runqueue locks dominate CPU profiles.

Right, so one of the things that I considered was to make p->state an
atomic_t and replace the initial stage of try_to_wake_up() with
something like:

int try_to_wake_up(struct task *p, unsigned int mask, wake_flags)
{
  int state = atomic_read(&p->state);

  do {
    if (!(state & mask))
      return 0;

    state = atomic_cmpxchg(&p->state, state, TASK_WAKING);
  } while (state != TASK_WAKING);

  /* do this pending queue + ipi thing */

  return 1;
}

Also, I think we might want to put that atomic single linked list thing
into some header (using atomic_long_t or so), because I have a similar
thing living in kernel/perf_event.c, that needs to queue things from NMI
context.

The advantage of doing basically the whole enqueue on the remote cpu is
less cacheline bouncing of the runqueue structures.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-05-20 21:09 ` Peter Zijlstra
@ 2010-05-20 21:23   ` Peter Zijlstra
  2010-05-20 22:17     ` Chris Mason
  2010-05-20 22:21   ` Chris Mason
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-05-20 21:23 UTC (permalink / raw)
  To: Chris Mason; +Cc: Ingo Molnar, axboe, linux-kernel

On Thu, 2010-05-20 at 23:09 +0200, Peter Zijlstra wrote:

> int try_to_wake_up(struct task *p, unsigned int mask, wake_flags)
> {
>   int state = atomic_read(&p->state);
> 
>   do {
>     if (!(state & mask))
>       return 0;
> 
>     state = atomic_cmpxchg(&p->state, state, TASK_WAKING);
>   } while (state != TASK_WAKING);

    cpu = select_task_rq()

and then somehow see we get set_task_cpu() done without races :-)

>   /* do this pending queue + ipi thing */
> 
>   return 1;
> }



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-05-20 21:23   ` Peter Zijlstra
@ 2010-05-20 22:17     ` Chris Mason
  0 siblings, 0 replies; 21+ messages in thread
From: Chris Mason @ 2010-05-20 22:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, axboe, linux-kernel

On Thu, May 20, 2010 at 11:23:12PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-05-20 at 23:09 +0200, Peter Zijlstra wrote:
> 
> > int try_to_wake_up(struct task *p, unsigned int mask, wake_flags)
> > {
> >   int state = atomic_read(&p->state);
> > 
> >   do {
> >     if (!(state & mask))
> >       return 0;
> > 
> >     state = atomic_cmpxchg(&p->state, state, TASK_WAKING);
> >   } while (state != TASK_WAKING);
> 
>     cpu = select_task_rq()
> 
> and then somehow see we get set_task_cpu() done without races :-)
> 
> >   /* do this pending queue + ipi thing */
> > 
> >   return 1;
> > }

I tried not to set the task waking, since I was worried about races with
us getting queued somewhere else.  But, I don't have a good handle on
all of that so I kind of chickened out.  That's why my code falls back
to the full ttwu in a few cases.

Do you think the above could be an addition to my patch or that it's
required for my patch to work well?

-chris


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-05-20 21:09 ` Peter Zijlstra
  2010-05-20 21:23   ` Peter Zijlstra
@ 2010-05-20 22:21   ` Chris Mason
  1 sibling, 0 replies; 21+ messages in thread
From: Chris Mason @ 2010-05-20 22:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, axboe, linux-kernel

On Thu, May 20, 2010 at 11:09:46PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-05-20 at 16:48 -0400, Chris Mason wrote:
> > 
> > This is more of a starting point than a patch, but it is something I've
> > been meaning to look at for a long time.  Many different workloads end
> > up hammering very hard on try_to_wake_up, to the point where the
> > runqueue locks dominate CPU profiles.
> 
> Right, so one of the things that I considered was to make p->state an
> atomic_t and replace the initial stage of try_to_wake_up() with
> something like:
> 
> int try_to_wake_up(struct task *p, unsigned int mask, wake_flags)
> {
>   int state = atomic_read(&p->state);
> 
>   do {
>     if (!(state & mask))
>       return 0;
> 
>     state = atomic_cmpxchg(&p->state, state, TASK_WAKING);
>   } while (state != TASK_WAKING);
> 
>   /* do this pending queue + ipi thing */
> 
>   return 1;
> }
> 
> Also, I think we might want to put that atomic single linked list thing
> into some header (using atomic_long_t or so), because I have a similar
> thing living in kernel/perf_event.c, that needs to queue things from NMI
> context.

So I've done three of these cmpxchg lists recently...but they have all
been a little different.  I went back and forth a bunch of times about
using a list_head based thing instead to avoid the walk for list append.
I really don't like the walk.

But, what makes this one unique is that I'm using a cmpxchg on the list
pointer in the in task struct to take ownership of this task struct.
It is how I avoid concurrent lockless enqueues.

Your fiddling with the p->state above would let me avoid that.

-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-05-20 20:48 [PATCH RFC] reduce runqueue lock contention Chris Mason
  2010-05-20 21:09 ` Peter Zijlstra
@ 2010-06-04 10:56 ` Stijn Devriendt
  2010-06-04 12:00   ` Peter Zijlstra
  2010-06-21 10:02 ` Peter Zijlstra
  2 siblings, 1 reply; 21+ messages in thread
From: Stijn Devriendt @ 2010-06-04 10:56 UTC (permalink / raw)
  To: Chris Mason, Ingo Molnar, peterz, axboe, linux-kernel

On Thu, May 20, 2010 at 10:48 PM, Chris Mason <chris.mason@oracle.com> wrote:
> I think we probably want to add a way to wait just for a little while as
> a more lightweight operation (less balancing etc) but this patch doesn't
> do that.  It only tries to make the act of waking someone up less
> expensive by avoiding the runqueue lock when we're on a different CPU
> than the process we want to wake.
>
> I do this with a per-runqueue list and some cmpxchg tricks.  Basically
> all the other CPUs will toss a given process they want to wakeup onto
> the destination per-runqueue list.  Later on, when that runqueue is
> finding a new task to run, it processes the list in bulk.

I actually have the reverse lying around somewhere (even more broken, probably)
to allow nested wakeups from the scheduler.

The issue I want to tackle is waking up processes when others go to sleep.
This means try_to_wake_up() from inside the runqueue lock.

I used a simple per_cpu taskqueue where tasks are put on when waking
during schedule(). At the end of schedule() I empty the list and reschedule
as one of the newly woken tasks may be higher prio.

I'm wondering if both approaches can be merged by checking this list before
and after every schedule().

Stijn

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-06-04 10:56 ` Stijn Devriendt
@ 2010-06-04 12:00   ` Peter Zijlstra
  2010-06-05  9:37     ` Stijn Devriendt
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-06-04 12:00 UTC (permalink / raw)
  To: Stijn Devriendt; +Cc: Chris Mason, Ingo Molnar, axboe, linux-kernel, Tejun Heo

On Fri, 2010-06-04 at 12:56 +0200, Stijn Devriendt wrote:

> The issue I want to tackle is waking up processes when others go to sleep.
> This means try_to_wake_up() from inside the runqueue lock.


Tejun did something for that:

  http://lkml.org/lkml/2010/5/13/136
  


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-06-04 12:00   ` Peter Zijlstra
@ 2010-06-05  9:37     ` Stijn Devriendt
  0 siblings, 0 replies; 21+ messages in thread
From: Stijn Devriendt @ 2010-06-05  9:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Chris Mason, Ingo Molnar, axboe, linux-kernel, Tejun Heo

On Fri, Jun 4, 2010 at 2:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2010-06-04 at 12:56 +0200, Stijn Devriendt wrote:
>
>> The issue I want to tackle is waking up processes when others go to sleep.
>> This means try_to_wake_up() from inside the runqueue lock.
>
>
> Tejun did something for that:
>
>  http://lkml.org/lkml/2010/5/13/136
>
>
>

The difference with what I have is that Tejun's threads are guaranteed
to be local
and on the same CPU/runqueue avoiding runqueue deadlocks. My code can be
used to wake up any thread as I basically defer waking up untill after
the runqueue
lock is released.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-05-20 20:48 [PATCH RFC] reduce runqueue lock contention Chris Mason
  2010-05-20 21:09 ` Peter Zijlstra
  2010-06-04 10:56 ` Stijn Devriendt
@ 2010-06-21 10:02 ` Peter Zijlstra
  2010-06-21 10:54   ` Peter Zijlstra
  2 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-06-21 10:02 UTC (permalink / raw)
  To: Chris Mason; +Cc: Ingo Molnar, axboe, linux-kernel, Mike Galbraith

To return the favour, here's a scary patch that renders your machine a
doorstop :-) Sadly it just hangs, no splat, no sysrq, no nmi, no tripple
fault.

It looses the ttwu task_running() check, as I must admit I'm not quite
sure what it does.. Ingo?

It makes the whole TASK_WAKING thing very interesting again :-)

It also re-introduces a bunch of races because we now need to run
->select_task_rq() without holding the rq->lock. We cannot defer running
that because it really wants to know the cpu the wakeup originated from
(affine wakeups and the like).


---
 arch/x86/kernel/smp.c   |    1 +
 include/linux/sched.h   |    7 +-
 kernel/sched.c          |  225 ++++++++++++++++++++++++-----------------------
 kernel/sched_fair.c     |   10 +--
 kernel/sched_idletask.c |    2 +-
 kernel/sched_rt.c       |    4 +-
 6 files changed, 127 insertions(+), 122 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index d801210..1e487d6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -202,6 +202,7 @@ void smp_reschedule_interrupt(struct pt_regs *regs)
 	/*
 	 * KVM uses this interrupt to force a cpu out of guest mode
 	 */
+	sched_ttwu_pending();
 }
 
 void smp_call_function_interrupt(struct pt_regs *regs)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a61c08c..9ae9fdb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1003,6 +1003,7 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 }
 #endif	/* !CONFIG_SMP */
 
+void sched_ttwu_pending(void);
 
 struct io_context;			/* See blkdev.h */
 
@@ -1046,12 +1047,11 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
-	int  (*select_task_rq)(struct rq *rq, struct task_struct *p,
-			       int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
-	void (*task_waking) (struct rq *this_rq, struct task_struct *task);
+	void (*task_waking) (struct task_struct *task);
 	void (*task_woken) (struct rq *this_rq, struct task_struct *task);
 
 	void (*set_cpus_allowed)(struct task_struct *p,
@@ -1174,6 +1174,7 @@ struct task_struct {
 	int lock_depth;		/* BKL lock depth */
 
 #ifdef CONFIG_SMP
+	struct task_struct *wake_entry;
 #ifdef __ARCH_WANT_UNLOCKED_CTXSW
 	int oncpu;
 #endif
diff --git a/kernel/sched.c b/kernel/sched.c
index b697606..ef90585 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -518,6 +518,8 @@ struct rq {
 	u64 age_stamp;
 	u64 idle_stamp;
 	u64 avg_idle;
+
+	struct task_struct *wake_list;
 #endif
 
 	/* calc_load related fields */
@@ -944,7 +946,7 @@ static inline struct rq *__task_rq_lock(struct task_struct *p)
 	for (;;) {
 		rq = task_rq(p);
 		raw_spin_lock(&rq->lock);
-		if (likely(rq == task_rq(p)))
+		if (likely(rq == task_rq(p)) && !task_is_waking(p))
 			return rq;
 		raw_spin_unlock(&rq->lock);
 	}
@@ -964,7 +966,7 @@ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
 		local_irq_save(*flags);
 		rq = task_rq(p);
 		raw_spin_lock(&rq->lock);
-		if (likely(rq == task_rq(p)))
+		if (likely(rq == task_rq(p)) && !task_is_waking(p))
 			return rq;
 		raw_spin_unlock_irqrestore(&rq->lock, *flags);
 	}
@@ -2257,9 +2259,9 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
  * The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
  */
 static inline
-int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
 {
-	int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags);
+	int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
 
 	/*
 	 * In order not to call set_task_cpu() on a blocking task we need
@@ -2330,111 +2332,6 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
 }
 
 /**
- * try_to_wake_up - wake up a thread
- * @p: the thread to be awakened
- * @state: the mask of task states that can be woken
- * @wake_flags: wake modifier flags (WF_*)
- *
- * Put it on the run-queue if it's not already there. The "current"
- * thread is always on the run-queue (except when the actual
- * re-schedule is in progress), and as such you're allowed to do
- * the simpler "current->state = TASK_RUNNING" to mark yourself
- * runnable without the overhead of this.
- *
- * Returns %true if @p was woken up, %false if it was already running
- * or @state didn't match @p's state.
- */
-static int try_to_wake_up(struct task_struct *p, unsigned int state,
-			  int wake_flags)
-{
-	int cpu, orig_cpu, this_cpu, success = 0;
-	unsigned long flags;
-	unsigned long en_flags = ENQUEUE_WAKEUP;
-	struct rq *rq;
-
-	this_cpu = get_cpu();
-
-	smp_wmb();
-	rq = task_rq_lock(p, &flags);
-	if (!(p->state & state))
-		goto out;
-
-	if (p->se.on_rq)
-		goto out_running;
-
-	cpu = task_cpu(p);
-	orig_cpu = cpu;
-
-#ifdef CONFIG_SMP
-	if (unlikely(task_running(rq, p)))
-		goto out_activate;
-
-	/*
-	 * In order to handle concurrent wakeups and release the rq->lock
-	 * we put the task in TASK_WAKING state.
-	 *
-	 * First fix up the nr_uninterruptible count:
-	 */
-	if (task_contributes_to_load(p)) {
-		if (likely(cpu_online(orig_cpu)))
-			rq->nr_uninterruptible--;
-		else
-			this_rq()->nr_uninterruptible--;
-	}
-	p->state = TASK_WAKING;
-
-	if (p->sched_class->task_waking) {
-		p->sched_class->task_waking(rq, p);
-		en_flags |= ENQUEUE_WAKING;
-	}
-
-	cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags);
-	if (cpu != orig_cpu)
-		set_task_cpu(p, cpu);
-	__task_rq_unlock(rq);
-
-	rq = cpu_rq(cpu);
-	raw_spin_lock(&rq->lock);
-
-	/*
-	 * We migrated the task without holding either rq->lock, however
-	 * since the task is not on the task list itself, nobody else
-	 * will try and migrate the task, hence the rq should match the
-	 * cpu we just moved it to.
-	 */
-	WARN_ON(task_cpu(p) != cpu);
-	WARN_ON(p->state != TASK_WAKING);
-
-#ifdef CONFIG_SCHEDSTATS
-	schedstat_inc(rq, ttwu_count);
-	if (cpu == this_cpu)
-		schedstat_inc(rq, ttwu_local);
-	else {
-		struct sched_domain *sd;
-		for_each_domain(this_cpu, sd) {
-			if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
-				schedstat_inc(sd, ttwu_wake_remote);
-				break;
-			}
-		}
-	}
-#endif /* CONFIG_SCHEDSTATS */
-
-out_activate:
-#endif /* CONFIG_SMP */
-	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
-		      cpu == this_cpu, en_flags);
-	success = 1;
-out_running:
-	ttwu_post_activation(p, rq, wake_flags, success);
-out:
-	task_rq_unlock(rq, &flags);
-	put_cpu();
-
-	return success;
-}
-
-/**
  * try_to_wake_up_local - try to wake up a local task with rq lock held
  * @p: the thread to be awakened
  *
@@ -2465,6 +2362,112 @@ static void try_to_wake_up_local(struct task_struct *p)
 	ttwu_post_activation(p, rq, 0, success);
 }
 
+static int ttwu_running(struct task_struct *p, int wake_flags)
+{
+	struct rq *rq;
+
+	for (;;) {
+		rq = task_rq(p);
+		raw_spin_lock(&rq->lock);
+		if (likely(rq == task_rq(p)))
+			break;
+		raw_spin_unlock(&rq->lock);
+	}
+
+	if (!p->se.on_rq) {
+		raw_spin_unlock(&rq->lock);
+		return 0;
+	}
+
+	ttwu_post_activation(p, rq, 0, true);
+	return 1;
+}
+
+static void ttwu_do_activate(struct rq *rq, struct task_struct *p, bool local)
+{
+	set_task_cpu(p, cpu_of(rq));
+	ttwu_activate(p, rq, false, !local, local,
+			ENQUEUE_WAKEUP|ENQUEUE_WAKING);
+	ttwu_post_activation(p, rq, 0, true);
+}
+
+void sched_ttwu_pending(void)
+{
+	struct task_struct *list = xchg(&this_rq()->wake_list, NULL);
+	struct rq *rq = this_rq();
+
+	if (!list)
+		return;
+
+	raw_spin_lock(&rq->lock);
+
+	while (list) {
+		struct task_struct *p = list;
+		list = list->wake_entry;
+		ttwu_do_activate(rq, p, false);
+	}
+
+	raw_spin_unlock(&rq->lock);
+}
+
+static void ttwu_queue(struct task_struct *p, int cpu)
+{
+	struct task_struct *next = NULL;
+	struct rq *rq = cpu_rq(cpu);
+
+	if (cpu == smp_processor_id()) {
+		raw_spin_lock(&rq->lock);
+		ttwu_do_activate(rq, p, true);
+		raw_spin_unlock(&rq->lock);
+		return;
+	}
+
+	for (;;) {
+		struct task_struct *old = next;
+
+		p->wake_entry = next;
+		next = cmpxchg(&rq->wake_list, old, p);
+		if (next == old)
+			break;
+	}
+
+	if (!next)
+		resched_cpu(cpu);
+}
+
+static int
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+{
+	unsigned long flags;
+	int success = 0;
+	int cpu;
+
+	local_irq_save(flags);
+	for (;;) {
+		unsigned int task_state = p->state;
+
+		if (!(task_state & state))
+			goto out;
+
+		if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state)
+			break;
+	}
+
+	success = 1;
+
+	if (unlikely(p->se.on_rq) && ttwu_running(p, wake_flags))
+		goto out;
+
+	if (p->sched_class->task_waking)
+		p->sched_class->task_waking(p);
+
+	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+	ttwu_queue(p, cpu);
+out:
+	local_irq_restore(flags);
+	return success;
+}
+
 /**
  * wake_up_process - Wake up a specific process
  * @p: The process to be woken up.
@@ -2604,7 +2607,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 	 * We set TASK_WAKING so that select_task_rq() can drop rq->lock
 	 * without people poking at ->cpus_allowed.
 	 */
-	cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
+	cpu = select_task_rq(p, SD_BALANCE_FORK, 0);
 	set_task_cpu(p, cpu);
 
 	p->state = TASK_RUNNING;
@@ -3200,7 +3203,7 @@ void sched_exec(void)
 	int dest_cpu;
 
 	rq = task_rq_lock(p, &flags);
-	dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0);
+	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5e8f98c..2bd145f 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1128,11 +1128,12 @@ static void yield_task_fair(struct rq *rq)
 
 #ifdef CONFIG_SMP
 
-static void task_waking_fair(struct rq *rq, struct task_struct *p)
+static void task_waking_fair(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
+	// XXX racy again,.. we need to read cfs_rq->min_vruntime atomically
 	se->vruntime -= cfs_rq->min_vruntime;
 }
 
@@ -1443,7 +1444,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
@@ -1516,11 +1517,8 @@ select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_
 		if (affine_sd && (!tmp || affine_sd->span_weight > sd->span_weight))
 			tmp = affine_sd;
 
-		if (tmp) {
-			raw_spin_unlock(&rq->lock);
+		if (tmp)
 			update_shares(tmp);
-			raw_spin_lock(&rq->lock);
-		}
 	}
 #endif
 
diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index 9fa0f40..efaed8c 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -7,7 +7,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_idle(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index d10c80e..1011cdc 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -949,8 +949,10 @@ static void yield_task_rt(struct rq *rq)
 static int find_lowest_rq(struct task_struct *task);
 
 static int
-select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
 {
+	struct rq *rq = task_rq(p); // XXX  racy
+
 	if (sd_flag != SD_BALANCE_WAKE)
 		return smp_processor_id();
 


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-06-21 10:02 ` Peter Zijlstra
@ 2010-06-21 10:54   ` Peter Zijlstra
  2010-06-21 13:04     ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-06-21 10:54 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ingo Molnar, axboe, linux-kernel, Mike Galbraith, Oleg Nesterov,
	tglx

On Mon, 2010-06-21 at 12:02 +0200, Peter Zijlstra wrote:
> To return the favour, here's a scary patch that renders your machine a
> doorstop :-) Sadly it just hangs, no splat, no sysrq, no nmi, no tripple
> fault.
> 
> It looses the ttwu task_running() check, as I must admit I'm not quite
> sure what it does.. Ingo?
> 
> It makes the whole TASK_WAKING thing very interesting again :-)
> 
> It also re-introduces a bunch of races because we now need to run
> ->select_task_rq() without holding the rq->lock. We cannot defer running
> that because it really wants to know the cpu the wakeup originated from
> (affine wakeups and the like).


Progress, this one gave spectacular fireworks ;-)

---
 arch/x86/kernel/smp.c   |    1 +
 include/linux/sched.h   |    7 +-
 kernel/sched.c          |  247 +++++++++++++++++++++++++----------------------
 kernel/sched_fair.c     |   10 +-
 kernel/sched_idletask.c |    2 +-
 kernel/sched_rt.c       |    4 +-
 6 files changed, 143 insertions(+), 128 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index d801210..1e487d6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -202,6 +202,7 @@ void smp_reschedule_interrupt(struct pt_regs *regs)
 	/*
 	 * KVM uses this interrupt to force a cpu out of guest mode
 	 */
+	sched_ttwu_pending();
 }
 
 void smp_call_function_interrupt(struct pt_regs *regs)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a61c08c..9ae9fdb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1003,6 +1003,7 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 }
 #endif	/* !CONFIG_SMP */
 
+void sched_ttwu_pending(void);
 
 struct io_context;			/* See blkdev.h */
 
@@ -1046,12 +1047,11 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
-	int  (*select_task_rq)(struct rq *rq, struct task_struct *p,
-			       int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
-	void (*task_waking) (struct rq *this_rq, struct task_struct *task);
+	void (*task_waking) (struct task_struct *task);
 	void (*task_woken) (struct rq *this_rq, struct task_struct *task);
 
 	void (*set_cpus_allowed)(struct task_struct *p,
@@ -1174,6 +1174,7 @@ struct task_struct {
 	int lock_depth;		/* BKL lock depth */
 
 #ifdef CONFIG_SMP
+	struct task_struct *wake_entry;
 #ifdef __ARCH_WANT_UNLOCKED_CTXSW
 	int oncpu;
 #endif
diff --git a/kernel/sched.c b/kernel/sched.c
index b697606..5fe442c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -518,6 +518,8 @@ struct rq {
 	u64 age_stamp;
 	u64 idle_stamp;
 	u64 avg_idle;
+
+	struct task_struct *wake_list;
 #endif
 
 	/* calc_load related fields */
@@ -944,7 +946,7 @@ static inline struct rq *__task_rq_lock(struct task_struct *p)
 	for (;;) {
 		rq = task_rq(p);
 		raw_spin_lock(&rq->lock);
-		if (likely(rq == task_rq(p)))
+		if (likely(rq == task_rq(p)) && !task_is_waking(p))
 			return rq;
 		raw_spin_unlock(&rq->lock);
 	}
@@ -964,7 +966,7 @@ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
 		local_irq_save(*flags);
 		rq = task_rq(p);
 		raw_spin_lock(&rq->lock);
-		if (likely(rq == task_rq(p)))
+		if (likely(rq == task_rq(p)) && !task_is_waking(p))
 			return rq;
 		raw_spin_unlock_irqrestore(&rq->lock, *flags);
 	}
@@ -2257,9 +2259,9 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
  * The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
  */
 static inline
-int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
 {
-	int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags);
+	int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
 
 	/*
 	 * In order not to call set_task_cpu() on a blocking task we need
@@ -2285,9 +2287,8 @@ static void update_avg(u64 *avg, u64 sample)
 }
 #endif
 
-static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
-				 bool is_sync, bool is_migrate, bool is_local,
-				 unsigned long en_flags)
+static inline void
+ttwu_stat(struct task_struct *p, bool is_sync, bool is_migrate, bool is_local)
 {
 	schedstat_inc(p, se.statistics.nr_wakeups);
 	if (is_sync)
@@ -2298,8 +2299,6 @@ static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
 		schedstat_inc(p, se.statistics.nr_wakeups_local);
 	else
 		schedstat_inc(p, se.statistics.nr_wakeups_remote);
-
-	activate_task(rq, p, en_flags);
 }
 
 static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
@@ -2330,111 +2329,6 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
 }
 
 /**
- * try_to_wake_up - wake up a thread
- * @p: the thread to be awakened
- * @state: the mask of task states that can be woken
- * @wake_flags: wake modifier flags (WF_*)
- *
- * Put it on the run-queue if it's not already there. The "current"
- * thread is always on the run-queue (except when the actual
- * re-schedule is in progress), and as such you're allowed to do
- * the simpler "current->state = TASK_RUNNING" to mark yourself
- * runnable without the overhead of this.
- *
- * Returns %true if @p was woken up, %false if it was already running
- * or @state didn't match @p's state.
- */
-static int try_to_wake_up(struct task_struct *p, unsigned int state,
-			  int wake_flags)
-{
-	int cpu, orig_cpu, this_cpu, success = 0;
-	unsigned long flags;
-	unsigned long en_flags = ENQUEUE_WAKEUP;
-	struct rq *rq;
-
-	this_cpu = get_cpu();
-
-	smp_wmb();
-	rq = task_rq_lock(p, &flags);
-	if (!(p->state & state))
-		goto out;
-
-	if (p->se.on_rq)
-		goto out_running;
-
-	cpu = task_cpu(p);
-	orig_cpu = cpu;
-
-#ifdef CONFIG_SMP
-	if (unlikely(task_running(rq, p)))
-		goto out_activate;
-
-	/*
-	 * In order to handle concurrent wakeups and release the rq->lock
-	 * we put the task in TASK_WAKING state.
-	 *
-	 * First fix up the nr_uninterruptible count:
-	 */
-	if (task_contributes_to_load(p)) {
-		if (likely(cpu_online(orig_cpu)))
-			rq->nr_uninterruptible--;
-		else
-			this_rq()->nr_uninterruptible--;
-	}
-	p->state = TASK_WAKING;
-
-	if (p->sched_class->task_waking) {
-		p->sched_class->task_waking(rq, p);
-		en_flags |= ENQUEUE_WAKING;
-	}
-
-	cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags);
-	if (cpu != orig_cpu)
-		set_task_cpu(p, cpu);
-	__task_rq_unlock(rq);
-
-	rq = cpu_rq(cpu);
-	raw_spin_lock(&rq->lock);
-
-	/*
-	 * We migrated the task without holding either rq->lock, however
-	 * since the task is not on the task list itself, nobody else
-	 * will try and migrate the task, hence the rq should match the
-	 * cpu we just moved it to.
-	 */
-	WARN_ON(task_cpu(p) != cpu);
-	WARN_ON(p->state != TASK_WAKING);
-
-#ifdef CONFIG_SCHEDSTATS
-	schedstat_inc(rq, ttwu_count);
-	if (cpu == this_cpu)
-		schedstat_inc(rq, ttwu_local);
-	else {
-		struct sched_domain *sd;
-		for_each_domain(this_cpu, sd) {
-			if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
-				schedstat_inc(sd, ttwu_wake_remote);
-				break;
-			}
-		}
-	}
-#endif /* CONFIG_SCHEDSTATS */
-
-out_activate:
-#endif /* CONFIG_SMP */
-	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
-		      cpu == this_cpu, en_flags);
-	success = 1;
-out_running:
-	ttwu_post_activation(p, rq, wake_flags, success);
-out:
-	task_rq_unlock(rq, &flags);
-	put_cpu();
-
-	return success;
-}
-
-/**
  * try_to_wake_up_local - try to wake up a local task with rq lock held
  * @p: the thread to be awakened
  *
@@ -2459,12 +2353,131 @@ static void try_to_wake_up_local(struct task_struct *p)
 			schedstat_inc(rq, ttwu_count);
 			schedstat_inc(rq, ttwu_local);
 		}
-		ttwu_activate(p, rq, false, false, true, ENQUEUE_WAKEUP);
+		ttwu_stat(p, false, false, true);
+		activate_task(rq, p, ENQUEUE_WAKEUP);
 		success = true;
 	}
 	ttwu_post_activation(p, rq, 0, success);
 }
 
+static int ttwu_running(struct task_struct *p, int wake_flags)
+{
+	struct rq *rq;
+
+	for (;;) {
+		rq = task_rq(p);
+		raw_spin_lock(&rq->lock);
+		if (likely(rq == task_rq(p)))
+			break;
+		raw_spin_unlock(&rq->lock);
+	}
+
+	if (!p->se.on_rq) {
+		raw_spin_unlock(&rq->lock);
+		return 0;
+	}
+
+	ttwu_post_activation(p, rq, 0, true);
+	return 1;
+}
+
+static void ttwu_do_activate(struct rq *rq, struct task_struct *p, bool local)
+{
+	set_task_cpu(p, cpu_of(rq));
+	activate_task(rq, p, ENQUEUE_WAKEUP
+#ifdef CONFIG_SMP
+			     | ENQUEUE_WAKING
+#endif
+			);
+	ttwu_post_activation(p, rq, 0, true);
+}
+
+void sched_ttwu_pending(void)
+{
+#ifdef CONFIG_SMP
+	struct rq *rq = this_rq();
+	struct task_struct *list = xchg(&rq->wake_list, NULL);
+
+	if (!list)
+		return;
+
+	raw_spin_lock(&rq->lock);
+
+	while (list) {
+		struct task_struct *p = list;
+		list = list->wake_entry;
+		ttwu_do_activate(rq, p, false);
+	}
+
+	raw_spin_unlock(&rq->lock);
+#endif
+}
+
+static void ttwu_queue(struct task_struct *p, int cpu)
+{
+	struct task_struct *next = NULL;
+	struct rq *rq = cpu_rq(cpu);
+
+	if (cpu == smp_processor_id()) {
+		raw_spin_lock(&rq->lock);
+		ttwu_do_activate(rq, p, true);
+		raw_spin_unlock(&rq->lock);
+		return;
+	}
+
+#ifdef CONFIG_SMP
+	for (;;) {
+		struct task_struct *old = next;
+
+		p->wake_entry = next;
+		next = cmpxchg(&rq->wake_list, old, p);
+		if (next == old)
+			break;
+	}
+
+	if (!next)
+		smp_send_reschedule(cpu);
+#endif
+}
+
+static int
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+{
+	unsigned long flags;
+	int success = 0;
+	int cpu = task_cpu(p);
+
+	local_irq_save(flags);
+	for (;;) {
+		unsigned int task_state = p->state;
+
+		if (!(task_state & state))
+			goto out;
+
+		if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state)
+			break;
+	}
+
+	success = 1;
+
+	if (unlikely(p->se.on_rq) && ttwu_running(p, wake_flags))
+		goto out;
+
+#ifdef CONFIG_SMP
+	if (p->sched_class->task_waking)
+		p->sched_class->task_waking(p);
+
+	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+#endif
+	ttwu_stat(p, wake_flags & WF_SYNC,		/* sync */
+		     cpu != task_cpu(p),		/* migrate */
+		     cpu == smp_processor_id());	/* local */
+	ttwu_queue(p, cpu);
+out:
+	local_irq_restore(flags);
+	return success;
+}
+
 /**
  * wake_up_process - Wake up a specific process
  * @p: The process to be woken up.
@@ -2604,7 +2617,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 	 * We set TASK_WAKING so that select_task_rq() can drop rq->lock
 	 * without people poking at ->cpus_allowed.
 	 */
-	cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
+	cpu = select_task_rq(p, SD_BALANCE_FORK, 0);
 	set_task_cpu(p, cpu);
 
 	p->state = TASK_RUNNING;
@@ -3200,7 +3213,7 @@ void sched_exec(void)
 	int dest_cpu;
 
 	rq = task_rq_lock(p, &flags);
-	dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0);
+	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5e8f98c..2bd145f 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1128,11 +1128,12 @@ static void yield_task_fair(struct rq *rq)
 
 #ifdef CONFIG_SMP
 
-static void task_waking_fair(struct rq *rq, struct task_struct *p)
+static void task_waking_fair(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
+	// XXX racy again,.. we need to read cfs_rq->min_vruntime atomically
 	se->vruntime -= cfs_rq->min_vruntime;
 }
 
@@ -1443,7 +1444,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
@@ -1516,11 +1517,8 @@ select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_
 		if (affine_sd && (!tmp || affine_sd->span_weight > sd->span_weight))
 			tmp = affine_sd;
 
-		if (tmp) {
-			raw_spin_unlock(&rq->lock);
+		if (tmp)
 			update_shares(tmp);
-			raw_spin_lock(&rq->lock);
-		}
 	}
 #endif
 
diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index 9fa0f40..efaed8c 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -7,7 +7,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_idle(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index d10c80e..1011cdc 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -949,8 +949,10 @@ static void yield_task_rt(struct rq *rq)
 static int find_lowest_rq(struct task_struct *task);
 
 static int
-select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
 {
+	struct rq *rq = task_rq(p); // XXX  racy
+
 	if (sd_flag != SD_BALANCE_WAKE)
 		return smp_processor_id();
 


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-06-21 10:54   ` Peter Zijlstra
@ 2010-06-21 13:04     ` Peter Zijlstra
  2010-06-22 13:33       ` Peter Zijlstra
  2010-12-14  2:41       ` Frank Rowand
  0 siblings, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2010-06-21 13:04 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ingo Molnar, axboe, linux-kernel, Mike Galbraith, Oleg Nesterov,
	tglx

On Mon, 2010-06-21 at 12:54 +0200, Peter Zijlstra wrote:
> > It looses the ttwu task_running() check, as I must admit I'm not quite
> > sure what it does.. Ingo?

I think I figured out what its for, its for when p is prev in schedule()
after deactivate_task(), we have to call activate_task() it again, but
we cannot migrate the task because the CPU its on is still referencing
it.

I added it back, but still fireworks :-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-06-21 13:04     ` Peter Zijlstra
@ 2010-06-22 13:33       ` Peter Zijlstra
  2010-06-22 21:11         ` Ingo Molnar
  2010-12-14  2:41       ` Frank Rowand
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-06-22 13:33 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ingo Molnar, axboe, linux-kernel, Mike Galbraith, Oleg Nesterov,
	tglx

So this one boots and builds a kernel on a dual-socket nehalem.

there's still quite a number of XXXs to fix, but I don't think any of
the races are crashing potential, mostly wrong accounting and scheduling
iffies like.

But give it a go.. see what it does for you (x86 only for now).

Ingo, any comments other than, eew, scary? :-)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/smp.c   |    1 +
 include/linux/sched.h   |    7 +-
 kernel/sched.c          |  271 ++++++++++++++++++++++++++++++++---------------
 kernel/sched_fair.c     |   10 +-
 kernel/sched_idletask.c |    2 +-
 kernel/sched_rt.c       |    4 +-
 6 files changed, 198 insertions(+), 97 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index d801210..1e487d6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -202,6 +202,7 @@ void smp_reschedule_interrupt(struct pt_regs *regs)
 	/*
 	 * KVM uses this interrupt to force a cpu out of guest mode
 	 */
+	sched_ttwu_pending();
 }
 
 void smp_call_function_interrupt(struct pt_regs *regs)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a61c08c..9ae9fdb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1003,6 +1003,7 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 }
 #endif	/* !CONFIG_SMP */
 
+void sched_ttwu_pending(void);
 
 struct io_context;			/* See blkdev.h */
 
@@ -1046,12 +1047,11 @@ struct sched_class {
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
-	int  (*select_task_rq)(struct rq *rq, struct task_struct *p,
-			       int sd_flag, int flags);
+	int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
 
 	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
 	void (*post_schedule) (struct rq *this_rq);
-	void (*task_waking) (struct rq *this_rq, struct task_struct *task);
+	void (*task_waking) (struct task_struct *task);
 	void (*task_woken) (struct rq *this_rq, struct task_struct *task);
 
 	void (*set_cpus_allowed)(struct task_struct *p,
@@ -1174,6 +1174,7 @@ struct task_struct {
 	int lock_depth;		/* BKL lock depth */
 
 #ifdef CONFIG_SMP
+	struct task_struct *wake_entry;
 #ifdef __ARCH_WANT_UNLOCKED_CTXSW
 	int oncpu;
 #endif
diff --git a/kernel/sched.c b/kernel/sched.c
index b697606..9a5f9ea 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -518,6 +518,8 @@ struct rq {
 	u64 age_stamp;
 	u64 idle_stamp;
 	u64 avg_idle;
+
+	struct task_struct *wake_list;
 #endif
 
 	/* calc_load related fields */
@@ -944,7 +946,7 @@ static inline struct rq *__task_rq_lock(struct task_struct *p)
 	for (;;) {
 		rq = task_rq(p);
 		raw_spin_lock(&rq->lock);
-		if (likely(rq == task_rq(p)))
+		if (likely(rq == task_rq(p)) && !task_is_waking(p))
 			return rq;
 		raw_spin_unlock(&rq->lock);
 	}
@@ -964,7 +966,7 @@ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
 		local_irq_save(*flags);
 		rq = task_rq(p);
 		raw_spin_lock(&rq->lock);
-		if (likely(rq == task_rq(p)))
+		if (likely(rq == task_rq(p)) && !task_is_waking(p))
 			return rq;
 		raw_spin_unlock_irqrestore(&rq->lock, *flags);
 	}
@@ -2257,9 +2259,9 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
  * The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
  */
 static inline
-int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
 {
-	int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags);
+	int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
 
 	/*
 	 * In order not to call set_task_cpu() on a blocking task we need
@@ -2285,24 +2287,38 @@ static void update_avg(u64 *avg, u64 sample)
 }
 #endif
 
-static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
-				 bool is_sync, bool is_migrate, bool is_local,
-				 unsigned long en_flags)
+static inline void ttwu_stat(struct task_struct *p, bool sync, int cpu)
 {
+#ifdef CONFIG_SCHEDSTATS
 	schedstat_inc(p, se.statistics.nr_wakeups);
-	if (is_sync)
+
+	if (sync)
 		schedstat_inc(p, se.statistics.nr_wakeups_sync);
-	if (is_migrate)
+
+	if (cpu != task_cpu(p))
 		schedstat_inc(p, se.statistics.nr_wakeups_migrate);
-	if (is_local)
+
+	if (cpu == smp_processor_id()) {
 		schedstat_inc(p, se.statistics.nr_wakeups_local);
-	else
-		schedstat_inc(p, se.statistics.nr_wakeups_remote);
+		schedstat_inc(this_rq(), ttwu_local);
+	} else {
+		struct sched_domain *sd;
 
-	activate_task(rq, p, en_flags);
+		schedstat_inc(p, se.statistics.nr_wakeups_remote);
+		for_each_domain(smp_processor_id(), sd) {
+			if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
+				schedstat_inc(sd, ttwu_wake_remote);
+				break;
+			}
+		}
+	}
+#endif
 }
 
-static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
+/*
+ * Mark the task runnable and perform wakeup-preemption.
+ */
+static inline void ttwu_do_wakeup(struct task_struct *p, struct rq *rq,
 					int wake_flags, bool success)
 {
 	trace_sched_wakeup(p, success);
@@ -2329,6 +2345,104 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
 		wq_worker_waking_up(p, cpu_of(rq));
 }
 
+/*
+ * 'Wake' a still running task.
+ */
+static int ttwu_running(struct task_struct *p, int wake_flags)
+{
+	struct rq *rq;
+
+	/* open-coded __task_rq_lock() to avoid the TASK_WAKING check. */
+	for (;;) {
+		rq = task_rq(p);
+		raw_spin_lock(&rq->lock);
+		if (likely(rq == task_rq(p)))
+			break;
+		raw_spin_unlock(&rq->lock);
+	}
+
+	if (!p->se.on_rq) {
+		raw_spin_unlock(&rq->lock);
+		return 0;
+	}
+
+	/*
+	 * XXX like below, I think this should be an actual wakeup
+	 */
+	ttwu_do_wakeup(p, rq, 0, false);
+	raw_spin_unlock(&rq->lock);
+
+	return 1;
+}
+
+static void ttwu_do_activate(struct rq *rq, struct task_struct *p, bool local)
+{
+	set_task_cpu(p, cpu_of(rq)); /* XXX conditional */
+	activate_task(rq, p, ENQUEUE_WAKEUP
+#ifdef CONFIG_SMP
+			     | ENQUEUE_WAKING
+#endif
+			);
+	ttwu_do_wakeup(p, rq, 0, true);
+}
+
+void sched_ttwu_pending(void)
+{
+#ifdef CONFIG_SMP
+	struct rq *rq = this_rq();
+	struct task_struct *list = xchg(&rq->wake_list, NULL);
+
+	if (!list)
+		return;
+
+	raw_spin_lock(&rq->lock);
+
+	while (list) {
+		struct task_struct *p = list;
+		list = list->wake_entry;
+		ttwu_do_activate(rq, p, false);
+	}
+
+	raw_spin_unlock(&rq->lock);
+#endif
+}
+
+#ifdef CONFIG_SMP
+static void ttwu_queue_remote(struct task_struct *p, int cpu)
+{
+	struct task_struct *next = NULL;
+	struct rq *rq = cpu_rq(cpu);
+
+	for (;;) {
+		struct task_struct *old = next;
+
+		p->wake_entry = next;
+		next = cmpxchg(&rq->wake_list, old, p);
+		if (next == old)
+			break;
+	}
+
+	if (!next)
+		smp_send_reschedule(cpu);
+}
+#endif
+
+static void ttwu_queue(struct task_struct *p, int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+#ifdef CONFIG_SMP
+	if (cpu != smp_processor_id()) {
+		ttwu_queue_remote(p, cpu);
+		return;
+	}
+#endif
+
+	raw_spin_lock(&rq->lock);
+	ttwu_do_activate(rq, p, true);
+	raw_spin_unlock(&rq->lock);
+}
+
 /**
  * try_to_wake_up - wake up a thread
  * @p: the thread to be awakened
@@ -2344,92 +2458,76 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
  * Returns %true if @p was woken up, %false if it was already running
  * or @state didn't match @p's state.
  */
-static int try_to_wake_up(struct task_struct *p, unsigned int state,
-			  int wake_flags)
+static int
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
-	int cpu, orig_cpu, this_cpu, success = 0;
+	int cpu = task_cpu(p);
 	unsigned long flags;
-	unsigned long en_flags = ENQUEUE_WAKEUP;
-	struct rq *rq;
+	int success = 0;
+	int load;
 
-	this_cpu = get_cpu();
-
-	smp_wmb();
-	rq = task_rq_lock(p, &flags);
-	if (!(p->state & state))
-		goto out;
+	local_irq_save(flags);
+	for (;;) {
+		unsigned int task_state = p->state;
 
-	if (p->se.on_rq)
-		goto out_running;
+		if (!(task_state & state))
+			goto out;
 
-	cpu = task_cpu(p);
-	orig_cpu = cpu;
+		/*
+		 * We've got to store the contributes_to_load state before
+		 * modifying the task state.
+		 */
+		load = task_contributes_to_load(p);
 
-#ifdef CONFIG_SMP
-	if (unlikely(task_running(rq, p)))
-		goto out_activate;
+		if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state)
+			break;
+	}
 
 	/*
-	 * In order to handle concurrent wakeups and release the rq->lock
-	 * we put the task in TASK_WAKING state.
+	 * If p is prev in schedule() before deactivate_task() we
+	 * simply need to set p->state = TASK_RUNNING while
+	 * holding the rq->lock.
 	 *
-	 * First fix up the nr_uninterruptible count:
+	 * Also, before deactivate_task() we will not yet have accounted
+	 * the contributes_to_load state, so ignore that.
 	 */
-	if (task_contributes_to_load(p)) {
-		if (likely(cpu_online(orig_cpu)))
-			rq->nr_uninterruptible--;
-		else
-			this_rq()->nr_uninterruptible--;
-	}
-	p->state = TASK_WAKING;
+	if (p->se.on_rq && ttwu_running(p, wake_flags))
+		goto out;
 
-	if (p->sched_class->task_waking) {
-		p->sched_class->task_waking(rq, p);
-		en_flags |= ENQUEUE_WAKING;
-	}
+	/*
+	 * XXX: I would consider the above to be a proper wakeup too, so
+	 * success should be true for anything that changes ->state to RUNNING
+	 * but preserve this funny from the previous implementation.
+	 */
+	success = 1;
 
-	cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags);
-	if (cpu != orig_cpu)
-		set_task_cpu(p, cpu);
-	__task_rq_unlock(rq);
+	/*
+	 * Now that we'll do an actual wakeup, account the
+	 * contribute_to_load state.
+	 */
+	if (load) // XXX racy
+		this_rq()->nr_uninterruptible--;
 
-	rq = cpu_rq(cpu);
-	raw_spin_lock(&rq->lock);
+#ifdef CONFIG_SMP
+	if (p->sched_class->task_waking)
+		p->sched_class->task_waking(p);
 
 	/*
-	 * We migrated the task without holding either rq->lock, however
-	 * since the task is not on the task list itself, nobody else
-	 * will try and migrate the task, hence the rq should match the
-	 * cpu we just moved it to.
+	 * If p is prev in schedule() after deactivate_task() we must
+	 * call activate_task(), but we cannot migrate the task.
 	 */
-	WARN_ON(task_cpu(p) != cpu);
-	WARN_ON(p->state != TASK_WAKING);
-
-#ifdef CONFIG_SCHEDSTATS
-	schedstat_inc(rq, ttwu_count);
-	if (cpu == this_cpu)
-		schedstat_inc(rq, ttwu_local);
-	else {
-		struct sched_domain *sd;
-		for_each_domain(this_cpu, sd) {
-			if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
-				schedstat_inc(sd, ttwu_wake_remote);
-				break;
-			}
-		}
+	if (likely(!task_running(task_rq(p), p))) {
+		/*
+		 * XXX: by having set TASK_WAKING outside of rq->lock, there
+		 * could be an in-flight change to p->cpus_allowed..
+		 */
+		cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
 	}
-#endif /* CONFIG_SCHEDSTATS */
-
-out_activate:
-#endif /* CONFIG_SMP */
-	ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
-		      cpu == this_cpu, en_flags);
-	success = 1;
-out_running:
-	ttwu_post_activation(p, rq, wake_flags, success);
+#endif
+	ttwu_stat(p, wake_flags & WF_SYNC, cpu);
+	ttwu_queue(p, cpu);
 out:
-	task_rq_unlock(rq, &flags);
-	put_cpu();
+	local_irq_restore(flags);
 
 	return success;
 }
@@ -2459,10 +2557,11 @@ static void try_to_wake_up_local(struct task_struct *p)
 			schedstat_inc(rq, ttwu_count);
 			schedstat_inc(rq, ttwu_local);
 		}
-		ttwu_activate(p, rq, false, false, true, ENQUEUE_WAKEUP);
+		ttwu_stat(p, false, smp_processor_id());
+		activate_task(rq, p, ENQUEUE_WAKEUP);
 		success = true;
 	}
-	ttwu_post_activation(p, rq, 0, success);
+	ttwu_do_wakeup(p, rq, 0, success);
 }
 
 /**
@@ -2604,7 +2703,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 	 * We set TASK_WAKING so that select_task_rq() can drop rq->lock
 	 * without people poking at ->cpus_allowed.
 	 */
-	cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
+	cpu = select_task_rq(p, SD_BALANCE_FORK, 0);
 	set_task_cpu(p, cpu);
 
 	p->state = TASK_RUNNING;
@@ -3200,7 +3299,7 @@ void sched_exec(void)
 	int dest_cpu;
 
 	rq = task_rq_lock(p, &flags);
-	dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0);
+	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
 		goto unlock;
 
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5e8f98c..2bd145f 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1128,11 +1128,12 @@ static void yield_task_fair(struct rq *rq)
 
 #ifdef CONFIG_SMP
 
-static void task_waking_fair(struct rq *rq, struct task_struct *p)
+static void task_waking_fair(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
+	// XXX racy again,.. we need to read cfs_rq->min_vruntime atomically
 	se->vruntime -= cfs_rq->min_vruntime;
 }
 
@@ -1443,7 +1444,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
@@ -1516,11 +1517,8 @@ select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_
 		if (affine_sd && (!tmp || affine_sd->span_weight > sd->span_weight))
 			tmp = affine_sd;
 
-		if (tmp) {
-			raw_spin_unlock(&rq->lock);
+		if (tmp)
 			update_shares(tmp);
-			raw_spin_lock(&rq->lock);
-		}
 	}
 #endif
 
diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index 9fa0f40..efaed8c 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -7,7 +7,7 @@
 
 #ifdef CONFIG_SMP
 static int
-select_task_rq_idle(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
 {
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index d10c80e..1011cdc 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -949,8 +949,10 @@ static void yield_task_rt(struct rq *rq)
 static int find_lowest_rq(struct task_struct *task);
 
 static int
-select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
 {
+	struct rq *rq = task_rq(p); // XXX  racy
+
 	if (sd_flag != SD_BALANCE_WAKE)
 		return smp_processor_id();
 


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-06-22 13:33       ` Peter Zijlstra
@ 2010-06-22 21:11         ` Ingo Molnar
  2010-06-23  9:10           ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2010-06-22 21:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, axboe, linux-kernel, Mike Galbraith, Oleg Nesterov,
	tglx


* Peter Zijlstra <peterz@infradead.org> wrote:

> So this one boots and builds a kernel on a dual-socket nehalem.
> 
> there's still quite a number of XXXs to fix, but I don't think any of the 
> races are crashing potential, mostly wrong accounting and scheduling iffies 
> like.
> 
> But give it a go.. see what it does for you (x86 only for now).
> 
> Ingo, any comments other than, eew, scary? :-)

None, other than a question: which future kernel do you aim it for? I'd prefer 
v2.6.50 or later ;-)

This is a truly scary patch.

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-06-22 21:11         ` Ingo Molnar
@ 2010-06-23  9:10           ` Peter Zijlstra
  2010-12-01 23:13             ` Frank Rowand
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-06-23  9:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Mason, axboe, linux-kernel, Mike Galbraith, Oleg Nesterov,
	tglx

On Tue, 2010-06-22 at 23:11 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > So this one boots and builds a kernel on a dual-socket nehalem.
> > 
> > there's still quite a number of XXXs to fix, but I don't think any of the 
> > races are crashing potential, mostly wrong accounting and scheduling iffies 
> > like.
> > 
> > But give it a go.. see what it does for you (x86 only for now).
> > 
> > Ingo, any comments other than, eew, scary? :-)
> 
> None, other than a question: which future kernel do you aim it for? I'd prefer 
> v2.6.50 or later ;-)

Well, assuming it all works out and actually reduces runqueue lock
contention we still need to sort out all those XXXs in there, I'd say at
the soonest somewhere near .38/.39 or so.

Its definitely not something we should rush in.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-06-23  9:10           ` Peter Zijlstra
@ 2010-12-01 23:13             ` Frank Rowand
  2010-12-02  1:17               ` Frank Rowand
  2010-12-02  7:36               ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Frank Rowand @ 2010-12-01 23:13 UTC (permalink / raw)
  To: Chris Mason
  Cc: Peter Zijlstra, Ingo Molnar, axboe@kernel.dk,
	linux-kernel@vger.kernel.org, Mike Galbraith, Oleg Nesterov, tglx

On 06/23/10 02:10, Peter Zijlstra wrote:
> On Tue, 2010-06-22 at 23:11 +0200, Ingo Molnar wrote:
>> * Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>> So this one boots and builds a kernel on a dual-socket nehalem.
>>>
>>> there's still quite a number of XXXs to fix, but I don't think any of the 
>>> races are crashing potential, mostly wrong accounting and scheduling iffies 
>>> like.
>>>
>>> But give it a go.. see what it does for you (x86 only for now).
>>>
>>> Ingo, any comments other than, eew, scary? :-)
>>
>> None, other than a question: which future kernel do you aim it for? I'd prefer 
>> v2.6.50 or later ;-)
> 
> Well, assuming it all works out and actually reduces runqueue lock
> contention we still need to sort out all those XXXs in there, I'd say at
> the soonest somewhere near .38/.39 or so.
> 
> Its definitely not something we should rush in.

This thread was started by Chris Mason back in May:
   http://lkml.indiana.edu/hypermail/linux/kernel/1005.2/02329.html

The problem he was trying to fix is:

> Many different workloads end
> up hammering very hard on try_to_wake_up, to the point where the
> runqueue locks dominate CPU profiles.
> 
> This includes benchmarking really fast IO subsystems, fast networking,
> fast pipes...well anywhere that we lots of procs on lots of cpus waiting
> for short periods on lots of different things.

Chris provided some code as a starting point for a solution.

Peter Zijlstra had some good ideas, and came up with some alternate code,
culminating with:

   http://lkml.indiana.edu/hypermail/linux/kernel/1006.2/02381.html

Building on this previous work, I have another patch to try to address
the problem.  I have taken some of Peter's code (the cmpxchg() based
queueing and unqueueing, plus the cross cpu interrupt), but taken a
simpler (and hopefully less scary) approach otherwise:

  If the task to be woken is on a run queue on a different cpu then use
  cmpxchg() to put it onto a pending try_to_wake_up list on the different
  cpu.  Then send an interrupt to the different cpu to cause that cpu to
  call try_to_wake_up() for each process on the try_to_wake_up list.

  The result is that the initial run queue lock acquired by try_to_wake_up()
  will be on the cpu we are currently executing on, not a different cpu.

  This patch does not address the run queue lock contention that may occur
  if try_to_wake_up() chooses to move the waking process to another cpu,
  based on the result returned by select_task_rq().

  The patch was created on the -top tree.

Signed-off-by: Frank Rowand <frank.rowand@am.sony.com>


Chris, can you check the performance of this patch on your large system?


Limitations
  x86 only

Tests
  - tested on 2 cpu x86_64
  - very simplistic workload
  - results:
     rq->lock contention count reduced by ~ 95%
     rq->lock contention wait time reduced by ~ 90%
     test duration reduced by ~ 0.5% - 4% (in the noise)

Review goals:
  (1) performance results
  (2) architectural comments

Review non-goal:
  code style, etc (but will be a goal in a future review round)

Todo:
  - add support for additional architectures
  - polish code style
  - add a schedule feature to control whether to use the new algorithm
  - verify that smp_wmb() is implied by cmpxchg() on x86, so that the explicit
    smp_wmb() in ttwu_queue_wake_up() can be removed.

---
 arch/x86/kernel/smp.c |    1 	1 +	0 -	0 !
 include/linux/sched.h |    5 	5 +	0 -	0 !
 kernel/sched.c        |  105 	99 +	6 -	0 !
 3 files changed, 105 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/x86/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp.c
+++ linux-2.6/arch/x86/kernel/smp.c
@@ -205,6 +205,7 @@ void smp_reschedule_interrupt(struct pt_
 	/*
 	 * KVM uses this interrupt to force a cpu out of guest mode
 	 */
+	sched_ttwu_pending();
 }
 
 void smp_call_function_interrupt(struct pt_regs *regs)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1038,6 +1038,7 @@ struct sched_domain;
  */
 #define WF_SYNC		0x01		/* waker goes to sleep after wakup */
 #define WF_FORK		0x02		/* child wakeup after fork */
+#define WF_LOAD		0x04		/* for queued try_to_wake_up() */
 
 #define ENQUEUE_WAKEUP		1
 #define ENQUEUE_WAKING		2
@@ -1193,6 +1194,9 @@ struct task_struct {
 	int lock_depth;		/* BKL lock depth */
 
 #ifdef CONFIG_SMP
+	struct task_struct *ttwu_queue_wake_entry;
+	int ttwu_queue_load;
+	int ttwu_queue_wake_flags;
 #ifdef __ARCH_WANT_UNLOCKED_CTXSW
 	int oncpu;
 #endif
@@ -2017,6 +2021,7 @@ extern void release_uids(struct user_nam
 
 extern void do_timer(unsigned long ticks);
 
+extern void sched_ttwu_pending(void);
 extern int wake_up_state(struct task_struct *tsk, unsigned int state);
 extern int wake_up_process(struct task_struct *tsk);
 extern void wake_up_new_task(struct task_struct *tsk,
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -515,6 +515,8 @@ struct rq {
 	u64 age_stamp;
 	u64 idle_stamp;
 	u64 avg_idle;
+
+	struct task_struct *wake_list;
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -2332,6 +2334,39 @@ static inline void ttwu_post_activation(
 		wq_worker_waking_up(p, cpu_of(rq));
 }
 
+#ifdef CONFIG_SMP
+static void ttwu_queue_wake_up(struct task_struct *p, int cpu, int wake_flags,
+			       int load)
+{
+	struct task_struct *next = NULL;
+	struct rq *rq = cpu_rq(cpu);
+
+	if (load)
+		wake_flags |= WF_LOAD;
+	p->ttwu_queue_load = load;
+	p->ttwu_queue_wake_flags = wake_flags;
+	/* xxx
+	 * smp_wmb() is implied by cmpxchg()
+	 * (see Documentation/memory-barriers.txt).
+	 * It is the case for arm.
+	 * I don't know about x86, so do it explicitly until I know for sure.
+	 */
+	smp_wmb();
+
+	for (;;) {
+		struct task_struct *old = next;
+
+		p->ttwu_queue_wake_entry = next;
+		next = cmpxchg(&rq->wake_list, old, p);
+		if (next == old)
+			break;
+	}
+
+	if (!next)
+		smp_send_reschedule(cpu);
+}
+#endif
+
 /**
  * try_to_wake_up - wake up a thread
  * @p: the thread to be awakened
@@ -2354,13 +2389,51 @@ static int try_to_wake_up(struct task_st
 	unsigned long flags;
 	unsigned long en_flags = ENQUEUE_WAKEUP;
 	struct rq *rq;
+#ifdef CONFIG_SMP
+	int load;
+#endif
 
 	this_cpu = get_cpu();
 
+	local_irq_save(flags);
+
+#ifdef CONFIG_SMP
+	for (;;) {
+		unsigned int task_state = p->state;
+
+		if (!(task_state & state))
+			goto out_nolock;
+
+		/*
+		 * We've got to store the contributes_to_load state before
+		 * modifying the task state.
+		 */
+		load = task_contributes_to_load(p);
+
+		if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state) {
+			if (state == TASK_WAKING)
+				load = (wake_flags & WF_LOAD) ? 1 : 0;
+			break;
+		}
+	}
+
+	/*
+	 * There is a race where task_cpu could be set to
+	 * this_cpu while task_state is TASK_WAKING?
+	 *
+	 * That's ok, the destination cpu will just send it back here when
+	 * it calls try_to_wake_up() of this process.
+	 */
+
+	cpu = task_cpu(p);
+	if (cpu != this_cpu) {
+		ttwu_queue_wake_up(p, cpu, wake_flags, load);
+		goto out_nolock;
+	}
+#endif
+
 	smp_wmb();
-	rq = task_rq_lock(p, &flags);
-	if (!(p->state & state))
-		goto out;
+	rq = __task_rq_lock(p);
 
 	if (p->se.on_rq)
 		goto out_running;
@@ -2373,18 +2446,16 @@ static int try_to_wake_up(struct task_st
 		goto out_activate;
 
 	/*
-	 * In order to handle concurrent wakeups and release the rq->lock
-	 * we put the task in TASK_WAKING state.
-	 *
-	 * First fix up the nr_uninterruptible count:
+	 * Can handle concurrent wakeups and release the rq->lock
+	 * since we put the task in TASK_WAKING state.
 	 */
-	if (task_contributes_to_load(p)) {
+
+	if (load) {
 		if (likely(cpu_online(orig_cpu)))
 			rq->nr_uninterruptible--;
 		else
 			this_rq()->nr_uninterruptible--;
 	}
-	p->state = TASK_WAKING;
 
 	if (p->sched_class->task_waking) {
 		p->sched_class->task_waking(rq, p);
@@ -2430,13 +2501,32 @@ out_activate:
 	success = 1;
 out_running:
 	ttwu_post_activation(p, rq, wake_flags, success);
-out:
-	task_rq_unlock(rq, &flags);
+	__task_rq_unlock(rq);
+#ifdef CONFIG_SMP
+out_nolock:
+#endif
+	local_irq_restore(flags);
 	put_cpu();
 
 	return success;
 }
 
+#ifdef CONFIG_SMP
+void sched_ttwu_pending(void)
+{
+	struct rq *rq = this_rq();
+	struct task_struct *p = xchg(&rq->wake_list, NULL);
+
+	if (!p)
+		return;
+
+	while (p) {
+		try_to_wake_up(p, TASK_WAKING, p->ttwu_queue_wake_flags);
+		p = p->ttwu_queue_wake_entry;
+	}
+}
+#endif
+
 /**
  * try_to_wake_up_local - try to wake up a local task with rq lock held
  * @p: the thread to be awakened


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-12-01 23:13             ` Frank Rowand
@ 2010-12-02  1:17               ` Frank Rowand
  2010-12-02  7:36               ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Frank Rowand @ 2010-12-02  1:17 UTC (permalink / raw)
  To: Chris Mason
  Cc: Peter Zijlstra, Ingo Molnar, axboe@kernel.dk,
	linux-kernel@vger.kernel.org, Mike Galbraith, Oleg Nesterov, tglx

On 12/01/10 15:13, Frank Rowand wrote:

< snip >

>   The patch was created on the -top tree.

Fat-fingered typing.  That should be the -tip tree....

-Frank


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-12-01 23:13             ` Frank Rowand
  2010-12-02  1:17               ` Frank Rowand
@ 2010-12-02  7:36               ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2010-12-02  7:36 UTC (permalink / raw)
  To: frank.rowand
  Cc: Chris Mason, Ingo Molnar, axboe@kernel.dk,
	linux-kernel@vger.kernel.org, Mike Galbraith, Oleg Nesterov, tglx

On Wed, 2010-12-01 at 15:13 -0800, Frank Rowand wrote:
> 
>   If the task to be woken is on a run queue on a different cpu then use
>   cmpxchg() to put it onto a pending try_to_wake_up list on the different
>   cpu.  Then send an interrupt to the different cpu to cause that cpu to
>   call try_to_wake_up() for each process on the try_to_wake_up list. 

Without having looked at the actual code, the described thing cannot
work, try_to_wake_up() has a return value that needs to be passed back.

Also, try_to_wake_up() does load-balancing, you really want to do that
before queueing it on a remote cpu.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-06-21 13:04     ` Peter Zijlstra
  2010-06-22 13:33       ` Peter Zijlstra
@ 2010-12-14  2:41       ` Frank Rowand
  2010-12-14  3:42         ` Mike Galbraith
  2010-12-15 18:59         ` Oleg Nesterov
  1 sibling, 2 replies; 21+ messages in thread
From: Frank Rowand @ 2010-12-14  2:41 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Chris Mason, axboe@kernel.dk, linux-kernel@vger.kernel.org,
	Mike Galbraith, Oleg Nesterov, tglx

On 06/21/10 06:04, Peter Zijlstra wrote:
> On Mon, 2010-06-21 at 12:54 +0200, Peter Zijlstra wrote:
>>> It looses the ttwu task_running() check, as I must admit I'm not quite
>>> sure what it does.. Ingo?
> 
> I think I figured out what its for, its for when p is prev in schedule()
> after deactivate_task(), we have to call activate_task() it again, but
> we cannot migrate the task because the CPU its on is still referencing
> it.

I have not been able to make sense of the task_running() check in
try_to_wake_up(), even with that clue.  The try_to_wake_up() code in
question is:

        rq = task_rq_lock(p, &flags);
        if (!(p->state & state))
                goto out;

        if (p->se.on_rq)
                goto out_running;

        cpu = task_cpu(p);
        orig_cpu = cpu;

#ifdef CONFIG_SMP
        if (unlikely(task_running(rq, p)))
                goto out_activate;

The relevent code in schedule() executes with the rq lock held (many
lines left out to emphasize the key lines):

        raw_spin_lock_irq(&rq->lock);

        if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {

                       deactivate_task(rq, prev, DEQUEUE_SLEEP);

        if (likely(prev != next)) {
                rq->curr = next;
                context_switch(rq, prev, next); /* unlocks the rq */
        } else
                raw_spin_unlock_irq(&rq->lock);

If (p->se.on_rq) can becomes false due to deactivate_task()
then task_running() will also become false while the rq lock is still
held (either via "rq->curr = next" or via context_switch() updating
p->oncpu -- which one matters depends on #ifdef __ARCH_WANT_UNLOCKED_CTXSW).

I  haven't been able to find any case where task_running() can be true
when (p->se.on_rq) is false, while the rq lock is not being held.  Thus
I don't think the branch to out_activate will ever be taken.

What am I missing, or is the task_running() test not needed?

Thanks!

Frank

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-12-14  2:41       ` Frank Rowand
@ 2010-12-14  3:42         ` Mike Galbraith
  2010-12-14 21:42           ` Frank Rowand
  2010-12-15 18:59         ` Oleg Nesterov
  1 sibling, 1 reply; 21+ messages in thread
From: Mike Galbraith @ 2010-12-14  3:42 UTC (permalink / raw)
  To: frank.rowand
  Cc: Peter Zijlstra, Ingo Molnar, Chris Mason, axboe@kernel.dk,
	linux-kernel@vger.kernel.org, Oleg Nesterov, tglx

On Mon, 2010-12-13 at 18:41 -0800, Frank Rowand wrote:
> On 06/21/10 06:04, Peter Zijlstra wrote:
> > On Mon, 2010-06-21 at 12:54 +0200, Peter Zijlstra wrote:
> >>> It looses the ttwu task_running() check, as I must admit I'm not quite
> >>> sure what it does.. Ingo?
> > 
> > I think I figured out what its for, its for when p is prev in schedule()
> > after deactivate_task(), we have to call activate_task() it again, but
> > we cannot migrate the task because the CPU its on is still referencing
> > it.
> 
> I have not been able to make sense of the task_running() check in
> try_to_wake_up(), even with that clue.  The try_to_wake_up() code in
> question is:
> 
>         rq = task_rq_lock(p, &flags);
>         if (!(p->state & state))
>                 goto out;
> 
>         if (p->se.on_rq)
>                 goto out_running;
> 
>         cpu = task_cpu(p);
>         orig_cpu = cpu;
> 
> #ifdef CONFIG_SMP
>         if (unlikely(task_running(rq, p)))
>                 goto out_activate;
> 
> 
> The relevent code in schedule() executes with the rq lock held (many
> lines left out to emphasize the key lines):
> 
>         raw_spin_lock_irq(&rq->lock);
> 
>         if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
> 
>                        deactivate_task(rq, prev, DEQUEUE_SLEEP);
> 
>         if (likely(prev != next)) {
>                 rq->curr = next;
>                 context_switch(rq, prev, next); /* unlocks the rq */
>         } else
>                 raw_spin_unlock_irq(&rq->lock);
> 
> 
> If (p->se.on_rq) can becomes false due to deactivate_task()
> then task_running() will also become false while the rq lock is still
> held (either via "rq->curr = next" or via context_switch() updating
> p->oncpu -- which one matters depends on #ifdef __ARCH_WANT_UNLOCKED_CTXSW).
> 
> I  haven't been able to find any case where task_running() can be true
> when (p->se.on_rq) is false, while the rq lock is not being held.  Thus
> I don't think the branch to out_activate will ever be taken.
> 
> What am I missing, or is the task_running() test not needed?

Say the last runnable task passes through schedule(), is deactivated.
We hit idle_balance(), which drops/retakes rq->lock _before_ the task
schedules off.  ttwu() can acquire rq->lock in that window, p->se.on_rq
is false, p->state is true, as is task_current(rq, p).

We have to check whether the task is still current, but not enqueued,
lest the wakeup be a noop, and the wakee possibly then sleep forever.

	-Mike


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-12-14  3:42         ` Mike Galbraith
@ 2010-12-14 21:42           ` Frank Rowand
  0 siblings, 0 replies; 21+ messages in thread
From: Frank Rowand @ 2010-12-14 21:42 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Rowand, Frank, Peter Zijlstra, Ingo Molnar, Chris Mason,
	axboe@kernel.dk, linux-kernel@vger.kernel.org, Oleg Nesterov,
	tglx

On 12/13/10 19:42, Mike Galbraith wrote:
> On Mon, 2010-12-13 at 18:41 -0800, Frank Rowand wrote:
>> On 06/21/10 06:04, Peter Zijlstra wrote:
>>> On Mon, 2010-06-21 at 12:54 +0200, Peter Zijlstra wrote:
>>>>> It looses the ttwu task_running() check, as I must admit I'm not quite
>>>>> sure what it does.. Ingo?
>>>
>>> I think I figured out what its for, its for when p is prev in schedule()
>>> after deactivate_task(), we have to call activate_task() it again, but
>>> we cannot migrate the task because the CPU its on is still referencing
>>> it.
>>
>> I have not been able to make sense of the task_running() check in
>> try_to_wake_up(), even with that clue.  The try_to_wake_up() code in
>> question is:
>>
>>         rq = task_rq_lock(p, &flags);
>>         if (!(p->state & state))
>>                 goto out;
>>
>>         if (p->se.on_rq)
>>                 goto out_running;
>>
>>         cpu = task_cpu(p);
>>         orig_cpu = cpu;
>>
>> #ifdef CONFIG_SMP
>>         if (unlikely(task_running(rq, p)))
>>                 goto out_activate;
>>
>>
>> The relevent code in schedule() executes with the rq lock held (many
>> lines left out to emphasize the key lines):

Additional lines added here to show the function call that Mike pointed out:

>>
>>         raw_spin_lock_irq(&rq->lock);
>>
>>         if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
>>
>>                        deactivate_task(rq, prev, DEQUEUE_SLEEP);

           }
           pre_schedule(rq, prev);
           if (unlikely(!rq->nr_running))
                   idle_balance(cpu, rq);

>>
>>         if (likely(prev != next)) {
>>                 rq->curr = next;
>>                 context_switch(rq, prev, next); /* unlocks the rq */
>>         } else
>>                 raw_spin_unlock_irq(&rq->lock);
>>
>>
>> If (p->se.on_rq) can becomes false due to deactivate_task()
>> then task_running() will also become false while the rq lock is still
>> held (either via "rq->curr = next" or via context_switch() updating
>> p->oncpu -- which one matters depends on #ifdef __ARCH_WANT_UNLOCKED_CTXSW).
>>
>> I  haven't been able to find any case where task_running() can be true
>> when (p->se.on_rq) is false, while the rq lock is not being held.  Thus
>> I don't think the branch to out_activate will ever be taken.
>>
>> What am I missing, or is the task_running() test not needed?
> 
> Say the last runnable task passes through schedule(), is deactivated.
> We hit idle_balance(), which drops/retakes rq->lock _before_ the task
> schedules off.  ttwu() can acquire rq->lock in that window, p->se.on_rq
> is false, p->state is true, as is task_current(rq, p).
> 
> We have to check whether the task is still current, but not enqueued,
> lest the wakeup be a noop, and the wakee possibly then sleep forever.

Thanks Mike, that's just the cluebat I needed!

And for the lkml archives, in case anyone has this question again in the
future, with Mike's clue in hand I found another case in this window where
the rq->lock can be dropped then reacquired.  Just before idle_balance()
is called, pre_schedule() is called:

pre_schedule()
   prev->sched_class->pre_schedule(rq, prev)
   [pre_schedule_rt()]
      pull_rt_task(rq)
      pull_rt_task[this_rq]
         for_each_cpu(cpu, this_rq->rd->rto_mask)
            double_lock_balance(this_rq, src_rq)
               raw_spin_unlock(&this_rq->lock)        <-----
               double_rq_lock(this_rq, busiest)

-Frank


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC] reduce runqueue lock contention
  2010-12-14  2:41       ` Frank Rowand
  2010-12-14  3:42         ` Mike Galbraith
@ 2010-12-15 18:59         ` Oleg Nesterov
  1 sibling, 0 replies; 21+ messages in thread
From: Oleg Nesterov @ 2010-12-15 18:59 UTC (permalink / raw)
  To: Frank Rowand
  Cc: Peter Zijlstra, Ingo Molnar, Chris Mason, axboe@kernel.dk,
	linux-kernel@vger.kernel.org, Mike Galbraith, tglx

On 12/13, Frank Rowand wrote:
>
> I have not been able to make sense of the task_running() check in
> try_to_wake_up(), even with that clue.  The try_to_wake_up() code in
> question is:
> ...
>
> What am I missing, or is the task_running() test not needed?

I am afraid I can misuderstood this all, including the question ;)

But, with __ARCH_WANT_UNLOCKED_CTXSW task_running() checks ->oncpu.
However, schedule() drops rq->lock after prev was deactivated but
before it clears prev->oncpu.

Oleg.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2010-12-15 19:06 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-20 20:48 [PATCH RFC] reduce runqueue lock contention Chris Mason
2010-05-20 21:09 ` Peter Zijlstra
2010-05-20 21:23   ` Peter Zijlstra
2010-05-20 22:17     ` Chris Mason
2010-05-20 22:21   ` Chris Mason
2010-06-04 10:56 ` Stijn Devriendt
2010-06-04 12:00   ` Peter Zijlstra
2010-06-05  9:37     ` Stijn Devriendt
2010-06-21 10:02 ` Peter Zijlstra
2010-06-21 10:54   ` Peter Zijlstra
2010-06-21 13:04     ` Peter Zijlstra
2010-06-22 13:33       ` Peter Zijlstra
2010-06-22 21:11         ` Ingo Molnar
2010-06-23  9:10           ` Peter Zijlstra
2010-12-01 23:13             ` Frank Rowand
2010-12-02  1:17               ` Frank Rowand
2010-12-02  7:36               ` Peter Zijlstra
2010-12-14  2:41       ` Frank Rowand
2010-12-14  3:42         ` Mike Galbraith
2010-12-14 21:42           ` Frank Rowand
2010-12-15 18:59         ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).