[RFC][PATCH RT 0/4] sched/rt: Lower rq lock contention latencies on many CPU boxes

linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH RT 0/4] sched/rt: Lower rq lock contention latencies on many CPU boxes
@ 2012-12-07 23:56 Steven Rostedt
  2012-12-07 23:56 ` [RFC][PATCH RT 1/4] sched/rt: Fix push_rt_task() to have the same checks as the caller did Steven Rostedt
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Steven Rostedt @ 2012-12-07 23:56 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams, Ingo Molnar

I've been debugging large latencies on a 40 core box and found a major
cause due to the thundering herd like grab of the rq lock due to the
pull_rt_task() logic.

Basically, if a large number of CPUs were to lower its priority roughly
the same time, they would all trigger a pull. If there happens to be
only one CPU available to get a task, all CPUs doing the pull will try
to grab it. In doing so, they will all contend on the rq lock of
the overloaded CPU. Only one CPU will succeed in pulling the task
and unfortunately, there's no quick way to know which, as it's dependent
on the affinitiy of the task that needs to be pulled, and to look at that,
we need to grab its rq lock!

Instead of having the pull logic grab the rq locks and do the work to
switch the task over to the pulling CPU, this patch series (well patch
#3) has the pulling CPU send an IPI to the overloaded CPU and that
CPU will do the push instead. The push logic uses the cpupri.c code
to quickly find the best CPU to offload the overloaded RT task to, so
it makes it quite efficient to do this.

Retrieving multiple IPIs has a much lower overhead than all the CPUs
grabbing the rq lock.

The other three patches are fixes/enhancements to the push/pull code
that I found while doing the debugging of the latencies.

Note, although this patch series is made for the -rt patch, the issues
apply to mainline as well. But because -rt has the migrate_disable() code,
this patch series is tailored to that. But if we can vet this out in
-rt, all this code should make its way quickly to mainline.

I tested this code out, but it probably needs some clean up and definitely
more comments. I'm only posting this as an RFC for now to get feedback
on the idea.

Thanks!

-- Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH RT 1/4] sched/rt: Fix push_rt_task() to have the same checks as the caller did
  2012-12-07 23:56 [RFC][PATCH RT 0/4] sched/rt: Lower rq lock contention latencies on many CPU boxes Steven Rostedt
@ 2012-12-07 23:56 ` Steven Rostedt
  2012-12-07 23:56 ` [RFC][PATCH RT 2/4] sched/rt: Try to migrate task if preempting pinned rt task Steven Rostedt
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2012-12-07 23:56 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams, Ingo Molnar

[-- Attachment #1: fix-push-rt-task-check.patch --]
[-- Type: text/plain, Size: 1975 bytes --]

Currently, the push_rt_task() only pushes the task if it is lower
priority than the currently running task.

But this is not the only check. If the currently running task is also
pinned, we may want to push as well, and we do this check when we wake
up a task, but then we are guaranteed to fail pushing the task because
the internal checks may fail.

Make the check the same as the wakeup checks. We could remove the
check in the wake up and just let the push_rt_task() do the work,
but this makes the wake up exit this check on the likely case that
"ok_to_push_task()" will fail, and that we don't need to do the
iterative loop of checks on the pushable task list.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-rt.git/kernel/sched/rt.c
===================================================================
--- linux-rt.git.orig/kernel/sched/rt.c
+++ linux-rt.git/kernel/sched/rt.c
@@ -1615,6 +1615,15 @@ static struct task_struct *pick_next_pus
 	return p;
 }
 
+static int ok_to_push_task(struct task_struct *p, struct task_struct *curr)
+{
+	return p->nr_cpus_allowed > 1 &&
+		rt_task(curr) &&
+		(curr->migrate_disable ||
+		 curr->nr_cpus_allowed < 2 ||
+		 curr->prio <= p->prio);
+}
+
 /*
  * If the current CPU has more than one RT task, see if the non
  * running task can migrate over to a CPU that is running a task
@@ -1649,7 +1658,7 @@ retry:
 	 * higher priority than current. If that's the case
 	 * just reschedule current.
 	 */
-	if (unlikely(next_task->prio < rq->curr->prio)) {
+	if (!ok_to_push_task(next_task, rq->curr)) {
 		resched_task(rq->curr);
 		return 0;
 	}
@@ -1814,10 +1823,7 @@ static void task_woken_rt(struct rq *rq,
 	if (!task_running(rq, p) &&
 	    !test_tsk_need_resched(rq->curr) &&
 	    has_pushable_tasks(rq) &&
-	    p->nr_cpus_allowed > 1 &&
-	    rt_task(rq->curr) &&
-	    (rq->curr->nr_cpus_allowed < 2 ||
-	     rq->curr->prio <= p->prio))
+	    ok_to_push_task(p, rq->curr))
 		push_rt_tasks(rq);
 }
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH RT 2/4] sched/rt: Try to migrate task if preempting pinned rt task
  2012-12-07 23:56 [RFC][PATCH RT 0/4] sched/rt: Lower rq lock contention latencies on many CPU boxes Steven Rostedt
  2012-12-07 23:56 ` [RFC][PATCH RT 1/4] sched/rt: Fix push_rt_task() to have the same checks as the caller did Steven Rostedt
@ 2012-12-07 23:56 ` Steven Rostedt
  2012-12-07 23:56 ` [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2012-12-07 23:56 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams, Ingo Molnar

[-- Attachment #1: fix-post-sched-push.patch --]
[-- Type: text/plain, Size: 1547 bytes --]

If a higher priority task is about to preempt a task that has been
pinned to a CPU. Try to first see if the higher priority task can
preempt another task instead.

That is, a high priority process wakes up on a CPU while a currently
running task can still migrate, it will miss pushing that high priority
task to another CPU. If by the time the task schedules, the task
that it's about to preempt could have changed its affinity and
is pinned. At this time, it may be better to move the task to another
CPU if one exists that is currently running a lower priority task
than the one about to be preempted.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-rt.git/kernel/sched/rt.c
===================================================================
--- linux-rt.git.orig/kernel/sched/rt.c
+++ linux-rt.git/kernel/sched/rt.c
@@ -1804,8 +1804,22 @@ skip:
 
 static void pre_schedule_rt(struct rq *rq, struct task_struct *prev)
 {
+	struct task_struct *p = prev;
+
+	/*
+	 * If we are preempting a migrate disabled task
+	 * see if we can push the higher tasks first.
+	 */
+	if (prev->on_rq && (prev->nr_cpus_allowed <= 1 || prev->migrate_disable) &&
+	    has_pushable_tasks(rq) && rq->rt.highest_prio.next < prev->prio) {
+		p = _pick_next_task_rt(rq);
+
+		if (p != prev && p->nr_cpus_allowed > 1 && push_rt_task(rq))
+			p = _pick_next_task_rt(rq);
+	}
+
 	/* Try to pull RT tasks here if we lower this rq's prio */
-	if (rq->rt.highest_prio.curr > prev->prio)
+	if (rq->rt.highest_prio.curr > p->prio)
 		pull_rt_task(rq);
 }
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2012-12-07 23:56 [RFC][PATCH RT 0/4] sched/rt: Lower rq lock contention latencies on many CPU boxes Steven Rostedt
  2012-12-07 23:56 ` [RFC][PATCH RT 1/4] sched/rt: Fix push_rt_task() to have the same checks as the caller did Steven Rostedt
  2012-12-07 23:56 ` [RFC][PATCH RT 2/4] sched/rt: Try to migrate task if preempting pinned rt task Steven Rostedt
@ 2012-12-07 23:56 ` Steven Rostedt
  2012-12-11  0:48   ` Frank Rowand
  2012-12-07 23:56 ` [RFC][PATCH RT 4/4] sched/rt: Initiate a pull when the priority of a task is lowered Steven Rostedt
  2012-12-10 22:59 ` [RFC][PATCH RT 0/4] sched/rt: Lower rq lock contention latencies on many CPU boxes Clark Williams
  4 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2012-12-07 23:56 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams, Ingo Molnar

[-- Attachment #1: push-rt-task-ipi.patch --]
[-- Type: text/plain, Size: 6450 bytes --]

When debugging the latencies on a 40 core box, where we hit 300 to
500 microsecond latencies, I found there was a huge contention on the
runqueue locks.

Investigating it further, running ftrace, I found that it was due to
the pulling of RT tasks.

The test that was run was the following:

 cyclictest --numa -p95 -m -d0 -i100

This created a thread on each CPU, that would set its wakeup in interations
of 100 microseconds. The -d0 means that all the threads had the same
interval (100us). Each thread sleeps for 100us and wakes up and measures
its latencies.

What happened was another RT task would be scheduled on one of the CPUs
that was running our test, when the other CPUS test went to sleep and
scheduled idle. This cause the "pull" operation to execute on all
these CPUs. Each one of these saw the RT task that was overloaded on
the CPU of the test that was still running, and each one tried
to grab that task in a thundering herd way.

To grab the task, each thread would do a double rq lock grab, grabbing
its own lock as well as the rq of the overloaded CPU. As the sched
domains on this box was rather flat for its size, I saw up to 12 CPUs
block on this lock at once. This caused a ripple affect with the
rq locks. As these locks were blocked, any wakeups on these CPUs
would also block on these locks, and the wait time escalated.

I've tried various methods to lesson the load, but things like an
atomic counter to only let one CPU grab the task wont work, because
the task may have a limited affinity, and we may pick the wrong
CPU to take that lock and do the pull, to only find out that the
CPU we picked isn't in the task's affinity.

Instead of doing the PULL, I now have the CPUs that want the pull to
send over an IPI to the overloaded CPU, and let that CPU pick what
CPU to push the task to. No more need to grab the rq lock, and the
push/pull algorithm still works fine.

With this patch, the latency dropped to just 150us over a 20 hour run.
Without the patch, the huge latencies would trigger in seconds.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-rt.git/kernel/sched/core.c
===================================================================
--- linux-rt.git.orig/kernel/sched/core.c
+++ linux-rt.git/kernel/sched/core.c
@@ -1538,6 +1538,8 @@ static void sched_ttwu_pending(void)
 
 void scheduler_ipi(void)
 {
+	sched_rt_push_check();
+
 	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
 		return;
 
Index: linux-rt.git/kernel/sched/rt.c
===================================================================
--- linux-rt.git.orig/kernel/sched/rt.c
+++ linux-rt.git/kernel/sched/rt.c
@@ -1425,53 +1425,6 @@ static void put_prev_task_rt(struct rq *
 /* Only try algorithms three times */
 #define RT_MAX_TRIES 3
 
-static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
-{
-	if (!task_running(rq, p) &&
-	    (cpu < 0 || cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) &&
-	    (p->nr_cpus_allowed > 1))
-		return 1;
-	return 0;
-}
-
-/* Return the second highest RT task, NULL otherwise */
-static struct task_struct *pick_next_highest_task_rt(struct rq *rq, int cpu)
-{
-	struct task_struct *next = NULL;
-	struct sched_rt_entity *rt_se;
-	struct rt_prio_array *array;
-	struct rt_rq *rt_rq;
-	int idx;
-
-	for_each_leaf_rt_rq(rt_rq, rq) {
-		array = &rt_rq->active;
-		idx = sched_find_first_bit(array->bitmap);
-next_idx:
-		if (idx >= MAX_RT_PRIO)
-			continue;
-		if (next && next->prio <= idx)
-			continue;
-		list_for_each_entry(rt_se, array->queue + idx, run_list) {
-			struct task_struct *p;
-
-			if (!rt_entity_is_task(rt_se))
-				continue;
-
-			p = rt_task_of(rt_se);
-			if (pick_rt_task(rq, p, cpu)) {
-				next = p;
-				break;
-			}
-		}
-		if (!next) {
-			idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
-			goto next_idx;
-		}
-	}
-
-	return next;
-}
-
 static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask);
 
 static int find_lowest_rq(struct task_struct *task)
@@ -1723,10 +1676,24 @@ static void push_rt_tasks(struct rq *rq)
 		;
 }
 
+void sched_rt_push_check(void)
+{
+	struct rq *rq = cpu_rq(smp_processor_id());
+
+	if (WARN_ON_ONCE(!irqs_disabled()))
+		return;
+
+	if (!has_pushable_tasks(rq))
+		return;
+
+	raw_spin_lock(&rq->lock);
+	push_rt_tasks(rq);
+	raw_spin_unlock(&rq->lock);
+}
+
 static int pull_rt_task(struct rq *this_rq)
 {
 	int this_cpu = this_rq->cpu, ret = 0, cpu;
-	struct task_struct *p;
 	struct rq *src_rq;
 
 	if (likely(!rt_overloaded(this_rq)))
@@ -1749,54 +1716,7 @@ static int pull_rt_task(struct rq *this_
 		    this_rq->rt.highest_prio.curr)
 			continue;
 
-		/*
-		 * We can potentially drop this_rq's lock in
-		 * double_lock_balance, and another CPU could
-		 * alter this_rq
-		 */
-		double_lock_balance(this_rq, src_rq);
-
-		/*
-		 * Are there still pullable RT tasks?
-		 */
-		if (src_rq->rt.rt_nr_running <= 1)
-			goto skip;
-
-		p = pick_next_highest_task_rt(src_rq, this_cpu);
-
-		/*
-		 * Do we have an RT task that preempts
-		 * the to-be-scheduled task?
-		 */
-		if (p && (p->prio < this_rq->rt.highest_prio.curr)) {
-			WARN_ON(p == src_rq->curr);
-			WARN_ON(!p->on_rq);
-
-			/*
-			 * There's a chance that p is higher in priority
-			 * than what's currently running on its cpu.
-			 * This is just that p is wakeing up and hasn't
-			 * had a chance to schedule. We only pull
-			 * p if it is lower in priority than the
-			 * current task on the run queue
-			 */
-			if (p->prio < src_rq->curr->prio)
-				goto skip;
-
-			ret = 1;
-
-			deactivate_task(src_rq, p, 0);
-			set_task_cpu(p, this_cpu);
-			activate_task(this_rq, p, 0);
-			/*
-			 * We continue with the search, just in
-			 * case there's an even higher prio task
-			 * in another runqueue. (low likelihood
-			 * but possible)
-			 */
-		}
-skip:
-		double_unlock_balance(this_rq, src_rq);
+		smp_send_reschedule(cpu);
 	}
 
 	return ret;
Index: linux-rt.git/kernel/sched/sched.h
===================================================================
--- linux-rt.git.orig/kernel/sched/sched.h
+++ linux-rt.git/kernel/sched/sched.h
@@ -1111,6 +1111,8 @@ static inline void double_rq_unlock(stru
 		__release(rq2->lock);
 }
 
+void sched_rt_push_check(void);
+
 #else /* CONFIG_SMP */
 
 /*
@@ -1144,6 +1146,9 @@ static inline void double_rq_unlock(stru
 	__release(rq2->lock);
 }
 
+void sched_rt_push_check(void)
+{
+}
 #endif
 
 extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2012-12-07 23:56 ` [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
@ 2012-12-11  0:48   ` Frank Rowand
  2012-12-11  1:15     ` Frank Rowand
  2012-12-11  1:41     ` Steven Rostedt
  0 siblings, 2 replies; 14+ messages in thread
From: Frank Rowand @ 2012-12-11  0:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel@vger.kernel.org, linux-rt-users, Thomas Gleixner,
	Carsten Emde, John Kacur, Peter Zijlstra, Clark Williams,
	Ingo Molnar

On 12/07/12 15:56, Steven Rostedt wrote:
> When debugging the latencies on a 40 core box, where we hit 300 to
> 500 microsecond latencies, I found there was a huge contention on the
> runqueue locks.
> 
> Investigating it further, running ftrace, I found that it was due to
> the pulling of RT tasks.
> 
> The test that was run was the following:
> 
>  cyclictest --numa -p95 -m -d0 -i100
> 
> This created a thread on each CPU, that would set its wakeup in interations
> of 100 microseconds. The -d0 means that all the threads had the same
> interval (100us). Each thread sleeps for 100us and wakes up and measures
> its latencies.
> 
> What happened was another RT task would be scheduled on one of the CPUs
> that was running our test, when the other CPUS test went to sleep and
> scheduled idle. This cause the "pull" operation to execute on all
> these CPUs. Each one of these saw the RT task that was overloaded on
> the CPU of the test that was still running, and each one tried
> to grab that task in a thundering herd way.
> 
> To grab the task, each thread would do a double rq lock grab, grabbing
> its own lock as well as the rq of the overloaded CPU. As the sched
> domains on this box was rather flat for its size, I saw up to 12 CPUs
> block on this lock at once. This caused a ripple affect with the
> rq locks. As these locks were blocked, any wakeups on these CPUs
> would also block on these locks, and the wait time escalated.
> 
> I've tried various methods to lesson the load, but things like an
> atomic counter to only let one CPU grab the task wont work, because
> the task may have a limited affinity, and we may pick the wrong
> CPU to take that lock and do the pull, to only find out that the
> CPU we picked isn't in the task's affinity.

You are saying that the pulling CPU might not be in the pulled task's
affinity?  But isn't that checked:

  pull_rt_task()
     pick_next_highest_task_rt()
        pick_rt_task()
           if ( ... || cpumask_test_cpu(cpu, tsk_cpus_allowed(p) ...

> 
> Instead of doing the PULL, I now have the CPUs that want the pull to
> send over an IPI to the overloaded CPU, and let that CPU pick what
> CPU to push the task to. No more need to grab the rq lock, and the
> push/pull algorithm still works fine.

That gives me the opposite of a warm fuzzy feeling.  Processing an IPI
on the overloaded CPU is not free (I'm being ARM-centric), and this is
putting more load on the already overloaded CPU.

I do recognize that you have actual measurements below that show goodness
for the pathological case you debugged.  I'm still mulling this all over...

> 
> With this patch, the latency dropped to just 150us over a 20 hour run.
> Without the patch, the huge latencies would trigger in seconds.

-Frank


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2012-12-11  0:48   ` Frank Rowand
@ 2012-12-11  1:15     ` Frank Rowand
  2012-12-11  1:53       ` Steven Rostedt
  2012-12-11  1:41     ` Steven Rostedt
  1 sibling, 1 reply; 14+ messages in thread
From: Frank Rowand @ 2012-12-11  1:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel@vger.kernel.org, linux-rt-users, Thomas Gleixner,
	Carsten Emde, John Kacur, Peter Zijlstra, Clark Williams,
	Ingo Molnar

On 12/10/12 16:48, Frank Rowand wrote:
> On 12/07/12 15:56, Steven Rostedt wrote:
>> When debugging the latencies on a 40 core box, where we hit 300 to
>> 500 microsecond latencies, I found there was a huge contention on the
>> runqueue locks.
>>
>> Investigating it further, running ftrace, I found that it was due to
>> the pulling of RT tasks.
>>
>> The test that was run was the following:
>>
>>  cyclictest --numa -p95 -m -d0 -i100
>>
>> This created a thread on each CPU, that would set its wakeup in interations
>> of 100 microseconds. The -d0 means that all the threads had the same
>> interval (100us). Each thread sleeps for 100us and wakes up and measures
>> its latencies.
>>
>> What happened was another RT task would be scheduled on one of the CPUs
>> that was running our test, when the other CPUS test went to sleep and
>> scheduled idle. This cause the "pull" operation to execute on all
>> these CPUs. Each one of these saw the RT task that was overloaded on
>> the CPU of the test that was still running, and each one tried
>> to grab that task in a thundering herd way.
>>
>> To grab the task, each thread would do a double rq lock grab, grabbing
>> its own lock as well as the rq of the overloaded CPU. As the sched
>> domains on this box was rather flat for its size, I saw up to 12 CPUs
>> block on this lock at once. This caused a ripple affect with the
>> rq locks. As these locks were blocked, any wakeups on these CPUs
>> would also block on these locks, and the wait time escalated.
>>
>> I've tried various methods to lesson the load, but things like an
>> atomic counter to only let one CPU grab the task wont work, because
>> the task may have a limited affinity, and we may pick the wrong
>> CPU to take that lock and do the pull, to only find out that the
>> CPU we picked isn't in the task's affinity.
> 
> You are saying that the pulling CPU might not be in the pulled task's
> affinity?  But isn't that checked:
> 
>   pull_rt_task()
>      pick_next_highest_task_rt()
>         pick_rt_task()
>            if ( ... || cpumask_test_cpu(cpu, tsk_cpus_allowed(p) ...
> 
>>
>> Instead of doing the PULL, I now have the CPUs that want the pull to
>> send over an IPI to the overloaded CPU, and let that CPU pick what
>> CPU to push the task to. No more need to grab the rq lock, and the
>> push/pull algorithm still works fine.
> 
> That gives me the opposite of a warm fuzzy feeling.  Processing an IPI
> on the overloaded CPU is not free (I'm being ARM-centric), and this is
> putting more load on the already overloaded CPU.

I should have also mentioned some previous experience using IPIs to
avoid runq lock contention on wake up.  Someone encountered IPI
storms when using the TTWU_QUEUE feature, thus it defaults to off
for CONFIG_PREEMPT_RT_FULL:

  #ifndef CONFIG_PREEMPT_RT_FULL
  /*
   * Queue remote wakeups on the target CPU and process them
   * using the scheduler IPI. Reduces rq->lock contention/bounces.
   */
  SCHED_FEAT(TTWU_QUEUE, true)
  #else
  SCHED_FEAT(TTWU_QUEUE, false)

-Frank

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2012-12-11  1:15     ` Frank Rowand
@ 2012-12-11  1:53       ` Steven Rostedt
  2012-12-11  7:07         ` Mike Galbraith
  2012-12-11 12:43         ` Thomas Gleixner
  0 siblings, 2 replies; 14+ messages in thread
From: Steven Rostedt @ 2012-12-11  1:53 UTC (permalink / raw)
  To: frank.rowand
  Cc: linux-kernel@vger.kernel.org, linux-rt-users, Thomas Gleixner,
	Carsten Emde, John Kacur, Peter Zijlstra, Clark Williams,
	Ingo Molnar

On Mon, 2012-12-10 at 17:15 -0800, Frank Rowand wrote:

> I should have also mentioned some previous experience using IPIs to
> avoid runq lock contention on wake up.  Someone encountered IPI
> storms when using the TTWU_QUEUE feature, thus it defaults to off
> for CONFIG_PREEMPT_RT_FULL:
> 
>   #ifndef CONFIG_PREEMPT_RT_FULL
>   /*
>    * Queue remote wakeups on the target CPU and process them
>    * using the scheduler IPI. Reduces rq->lock contention/bounces.
>    */
>   SCHED_FEAT(TTWU_QUEUE, true)
>   #else
>   SCHED_FEAT(TTWU_QUEUE, false)
> 

Interesting, but I'm wondering if this also does it for every wakeup? If
you have 1000 tasks waking up on another CPU, this could potentially
send out 1000 IPIs. The number of IPIs here looks to be # of tasks
waking up, and perhaps more than that, as there could be multiple
instances that try to wake up the same task.

Now this patch set, the # of IPIs is limited to the # of CPUs. If you
have 4 CPUs, you'll get a storm of 3 IPIs. That's a big difference.

Now we could even add a flag, and do a test_and_set on it, and send out
an IPI iff the flag wasn't set before. Have the target CPU clear the
flag and then do the pushing. This would limit even further the IPIs
needed to be sent. I didn't add this yet, because I wanted to test the
current code first, and only add this if there is an issue with too many
IPIs.

-- Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2012-12-11  1:53       ` Steven Rostedt
@ 2012-12-11  7:07         ` Mike Galbraith
  2012-12-11 12:43         ` Thomas Gleixner
  1 sibling, 0 replies; 14+ messages in thread
From: Mike Galbraith @ 2012-12-11  7:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: frank.rowand, linux-kernel@vger.kernel.org, linux-rt-users,
	Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams, Ingo Molnar

On Mon, 2012-12-10 at 20:53 -0500, Steven Rostedt wrote: 
> On Mon, 2012-12-10 at 17:15 -0800, Frank Rowand wrote:
> 
> > I should have also mentioned some previous experience using IPIs to
> > avoid runq lock contention on wake up.  Someone encountered IPI
> > storms when using the TTWU_QUEUE feature, thus it defaults to off
> > for CONFIG_PREEMPT_RT_FULL:
> > 
> >   #ifndef CONFIG_PREEMPT_RT_FULL
> >   /*
> >    * Queue remote wakeups on the target CPU and process them
> >    * using the scheduler IPI. Reduces rq->lock contention/bounces.
> >    */
> >   SCHED_FEAT(TTWU_QUEUE, true)
> >   #else
> >   SCHED_FEAT(TTWU_QUEUE, false)
> > 
> 
> Interesting, but I'm wondering if this also does it for every wakeup? If
> you have 1000 tasks waking up on another CPU, this could potentially
> send out 1000 IPIs. The number of IPIs here looks to be # of tasks
> waking up, and perhaps more than that, as there could be multiple
> instances that try to wake up the same task.

Yeah.  In mainline, wakeup via IPI is disabled within a socket, because
it's too much of a performance hit for high frequency switchers.  (It
seems we're limited by the max rate at which we can IPI)

-Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2012-12-11  1:53       ` Steven Rostedt
  2012-12-11  7:07         ` Mike Galbraith
@ 2012-12-11 12:43         ` Thomas Gleixner
  2012-12-11 14:02           ` Steven Rostedt
  1 sibling, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2012-12-11 12:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: frank.rowand, linux-kernel@vger.kernel.org, linux-rt-users,
	Carsten Emde, John Kacur, Peter Zijlstra, Clark Williams,
	Ingo Molnar

On Mon, 10 Dec 2012, Steven Rostedt wrote:
> On Mon, 2012-12-10 at 17:15 -0800, Frank Rowand wrote:
> 
> > I should have also mentioned some previous experience using IPIs to
> > avoid runq lock contention on wake up.  Someone encountered IPI
> > storms when using the TTWU_QUEUE feature, thus it defaults to off
> > for CONFIG_PREEMPT_RT_FULL:
> > 
> >   #ifndef CONFIG_PREEMPT_RT_FULL
> >   /*
> >    * Queue remote wakeups on the target CPU and process them
> >    * using the scheduler IPI. Reduces rq->lock contention/bounces.
> >    */
> >   SCHED_FEAT(TTWU_QUEUE, true)
> >   #else
> >   SCHED_FEAT(TTWU_QUEUE, false)
> > 
> 
> Interesting, but I'm wondering if this also does it for every wakeup? If
> you have 1000 tasks waking up on another CPU, this could potentially
> send out 1000 IPIs. The number of IPIs here looks to be # of tasks
> waking up, and perhaps more than that, as there could be multiple
> instances that try to wake up the same task.

Not using the TTWU_QUEUE feature limits the IPIs to a single one,
which is only sent if the newly woken task preempts the current task
on the remote cpu and the NEED_RESCHED flag was not yet set.

With TTWU_QUEUE you can induce massive latencies just by starting
hackbench. You get a herd wakeup on CPU0 which then enqueues hundreds
of tasks to the remote pull list and sends IPIs. The remote CPUs pulls
the tasks and activate them on their runqueue in hard interrupt
context. That easiliy can accumulate to hundreds of microseconds when
you do a mass push of newly woken tasks.

Of course it avoids fiddling with the remote rq lock, but it becomes
massivly non deterministic.

> Now this patch set, the # of IPIs is limited to the # of CPUs. If you
> have 4 CPUs, you'll get a storm of 3 IPIs. That's a big difference.

Yeah, the big difference is that you offload the double lock to the
IPI. So in the worst case you interrupt the most latency sensitive
task running on the remote CPU. Not sure if I really like that
"feature".

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2012-12-11 12:43         ` Thomas Gleixner
@ 2012-12-11 14:02           ` Steven Rostedt
  2012-12-11 14:16             ` Steven Rostedt
  0 siblings, 1 reply; 14+ messages in thread
From: Steven Rostedt @ 2012-12-11 14:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: frank.rowand, linux-kernel@vger.kernel.org, linux-rt-users,
	Carsten Emde, John Kacur, Peter Zijlstra, Clark Williams,
	Ingo Molnar

On Tue, 2012-12-11 at 13:43 +0100, Thomas Gleixner wrote:
> On Mon, 10 Dec 2012, Steven Rostedt wrote:
> > On Mon, 2012-12-10 at 17:15 -0800, Frank Rowand wrote:
> > 
> > > I should have also mentioned some previous experience using IPIs to
> > > avoid runq lock contention on wake up.  Someone encountered IPI
> > > storms when using the TTWU_QUEUE feature, thus it defaults to off
> > > for CONFIG_PREEMPT_RT_FULL:
> > > 
> > >   #ifndef CONFIG_PREEMPT_RT_FULL
> > >   /*
> > >    * Queue remote wakeups on the target CPU and process them
> > >    * using the scheduler IPI. Reduces rq->lock contention/bounces.
> > >    */
> > >   SCHED_FEAT(TTWU_QUEUE, true)
> > >   #else
> > >   SCHED_FEAT(TTWU_QUEUE, false)
> > > 
> > 
> > Interesting, but I'm wondering if this also does it for every wakeup? If
> > you have 1000 tasks waking up on another CPU, this could potentially
> > send out 1000 IPIs. The number of IPIs here looks to be # of tasks
> > waking up, and perhaps more than that, as there could be multiple
> > instances that try to wake up the same task.
> 
> Not using the TTWU_QUEUE feature limits the IPIs to a single one,
> which is only sent if the newly woken task preempts the current task
> on the remote cpu and the NEED_RESCHED flag was not yet set.
>  
> With TTWU_QUEUE you can induce massive latencies just by starting
> hackbench. You get a herd wakeup on CPU0 which then enqueues hundreds
> of tasks to the remote pull list and sends IPIs. The remote CPUs pulls
> the tasks and activate them on their runqueue in hard interrupt
> context. That easiliy can accumulate to hundreds of microseconds when
> you do a mass push of newly woken tasks.
> 
> Of course it avoids fiddling with the remote rq lock, but it becomes
> massivly non deterministic.

Agreed. I never suggested to use TTWU_QUEUE. I was just stating the
difference between that and my patches.

> 
> > Now this patch set, the # of IPIs is limited to the # of CPUs. If you
> > have 4 CPUs, you'll get a storm of 3 IPIs. That's a big difference.
> 
> Yeah, the big difference is that you offload the double lock to the
> IPI. So in the worst case you interrupt the most latency sensitive
> task running on the remote CPU. Not sure if I really like that
> "feature".
>  

First, the pulled CPU isn't necessarily running the most latency
sensitive task. It just happens to be running more than one RT task, and
the waiting RT task can migrate. The running task may be of the same
priority as the waiting task. And they both may be the lowest priority
RT tasks in the system, and a CPU just went idle.

Currently, what we have is a huge contention on both the pulled CPU rq
lock. We've measured over 500us latencies due to it. This hurts even the
CPU that has the overloaded task, as the contention is on its lock.

-- Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2012-12-11 14:02           ` Steven Rostedt
@ 2012-12-11 14:16             ` Steven Rostedt
  0 siblings, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2012-12-11 14:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: frank.rowand, linux-kernel@vger.kernel.org, linux-rt-users,
	Carsten Emde, John Kacur, Peter Zijlstra, Clark Williams,
	Ingo Molnar

On Tue, 2012-12-11 at 09:02 -0500, Steven Rostedt wrote:

> Currently, what we have is a huge contention on both the pulled CPU rq
> lock. We've measured over 500us latencies due to it. This hurts even the
> CPU that has the overloaded task, as the contention is on its lock.

The 500us latency was one of the max ones I saw, and I believe it was
combined with the load balancer going off at the same time as the pulls
were happening. I've seen larger latencies with limited function tracing
enabled and the load balancer was very involved.

As I believe we have users that would very much want this in, and if you
are still not sure you like this "feature" I can make it into a
SCHED_FEAT() like TTWU_QUEUE. Then you can keep this off and the big
boxes can enable them.

Maybe enable the feature by default if the box has more than 16 cpus?

-- Steve

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling
  2012-12-11  0:48   ` Frank Rowand
  2012-12-11  1:15     ` Frank Rowand
@ 2012-12-11  1:41     ` Steven Rostedt
  1 sibling, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2012-12-11  1:41 UTC (permalink / raw)
  To: frank.rowand
  Cc: linux-kernel@vger.kernel.org, linux-rt-users, Thomas Gleixner,
	Carsten Emde, John Kacur, Peter Zijlstra, Clark Williams,
	Ingo Molnar

On Mon, 2012-12-10 at 16:48 -0800, Frank Rowand wrote:
>  
> > I've tried various methods to lesson the load, but things like an
> > atomic counter to only let one CPU grab the task wont work, because
> > the task may have a limited affinity, and we may pick the wrong
> > CPU to take that lock and do the pull, to only find out that the
> > CPU we picked isn't in the task's affinity.
> 
> You are saying that the pulling CPU might not be in the pulled task's
> affinity?  But isn't that checked:
> 
>   pull_rt_task()

But here you forgot:
    double_lock_balance();

which is what we are trying to avoid.

>      pick_next_highest_task_rt()
>         pick_rt_task()
>            if ( ... || cpumask_test_cpu(cpu, tsk_cpus_allowed(p) ...
> 
> > 
> > Instead of doing the PULL, I now have the CPUs that want the pull to
> > send over an IPI to the overloaded CPU, and let that CPU pick what
> > CPU to push the task to. No more need to grab the rq lock, and the
> > push/pull algorithm still works fine.
> 
> That gives me the opposite of a warm fuzzy feeling.  Processing an IPI
> on the overloaded CPU is not free (I'm being ARM-centric), and this is
> putting more load on the already overloaded CPU.

Note that when I say overloaded, it's not quite the normal term for
overload. It means that there's more than one RT task scheduled to run
on that CPU *and* that one of the waiting RT tasks can migrate. If all
RT tasks are pinned to a CPU, none of this happens.

> 
> I do recognize that you have actual measurements below that show goodness
> for the pathological case you debugged.  I'm still mulling this all over...

I guess it comes down to if we want to send out IPI storms or lock up
other CPUs grabbing the rq lock and serializing the rest of the system.

It's really easy to reproduce. When my local systems are done testing,
I'll see what the effect is with just 4 CPUS (still x86).

-- Steve



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH RT 4/4] sched/rt: Initiate a pull when the priority of a task is lowered
  2012-12-07 23:56 [RFC][PATCH RT 0/4] sched/rt: Lower rq lock contention latencies on many CPU boxes Steven Rostedt
                   ` (2 preceding siblings ...)
  2012-12-07 23:56 ` [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
@ 2012-12-07 23:56 ` Steven Rostedt
  2012-12-10 22:59 ` [RFC][PATCH RT 0/4] sched/rt: Lower rq lock contention latencies on many CPU boxes Clark Williams
  4 siblings, 0 replies; 14+ messages in thread
From: Steven Rostedt @ 2012-12-07 23:56 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams, Ingo Molnar

[-- Attachment #1: rt-migrate-redo.patch --]
[-- Type: text/plain, Size: 835 bytes --]

If a task lowers its priority (say by losing priority inheritance)
if a higher priority task is waiting on another CPU, initiate a pull.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-rt.git/kernel/sched/rt.c
===================================================================
--- linux-rt.git.orig/kernel/sched/rt.c
+++ linux-rt.git/kernel/sched/rt.c
@@ -997,6 +997,8 @@ inc_rt_prio(struct rt_rq *rt_rq, int pri
 	inc_rt_prio_smp(rt_rq, prio, prev_prio);
 }
 
+static int pull_rt_task(struct rq *this_rq);
+
 static void
 dec_rt_prio(struct rt_rq *rt_rq, int prio)
 {
@@ -1021,6 +1023,9 @@ dec_rt_prio(struct rt_rq *rt_rq, int pri
 		rt_rq->highest_prio.curr = MAX_RT_PRIO;
 
 	dec_rt_prio_smp(rt_rq, prio, prev_prio);
+
+	if (prev_prio < rt_rq->highest_prio.curr)
+		pull_rt_task(rq_of_rt_rq(rt_rq));
 }
 
 #else

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH RT 0/4] sched/rt: Lower rq lock contention latencies on many CPU boxes
  2012-12-07 23:56 [RFC][PATCH RT 0/4] sched/rt: Lower rq lock contention latencies on many CPU boxes Steven Rostedt
                   ` (3 preceding siblings ...)
  2012-12-07 23:56 ` [RFC][PATCH RT 4/4] sched/rt: Initiate a pull when the priority of a task is lowered Steven Rostedt
@ 2012-12-10 22:59 ` Clark Williams
  4 siblings, 0 replies; 14+ messages in thread
From: Clark Williams @ 2012-12-10 22:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-rt-users, Thomas Gleixner, Carsten Emde,
	John Kacur, Peter Zijlstra, Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 2246 bytes --]

On Fri, 07 Dec 2012 18:56:15 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> I've been debugging large latencies on a 40 core box and found a major
> cause due to the thundering herd like grab of the rq lock due to the
> pull_rt_task() logic.
> 
> Basically, if a large number of CPUs were to lower its priority roughly
> the same time, they would all trigger a pull. If there happens to be
> only one CPU available to get a task, all CPUs doing the pull will try
> to grab it. In doing so, they will all contend on the rq lock of
> the overloaded CPU. Only one CPU will succeed in pulling the task
> and unfortunately, there's no quick way to know which, as it's dependent
> on the affinitiy of the task that needs to be pulled, and to look at that,
> we need to grab its rq lock!
> 
> Instead of having the pull logic grab the rq locks and do the work to
> switch the task over to the pulling CPU, this patch series (well patch
> #3) has the pulling CPU send an IPI to the overloaded CPU and that
> CPU will do the push instead. The push logic uses the cpupri.c code
> to quickly find the best CPU to offload the overloaded RT task to, so
> it makes it quite efficient to do this.
> 
> Retrieving multiple IPIs has a much lower overhead than all the CPUs
> grabbing the rq lock.
> 
> The other three patches are fixes/enhancements to the push/pull code
> that I found while doing the debugging of the latencies.
> 
> Note, although this patch series is made for the -rt patch, the issues
> apply to mainline as well. But because -rt has the migrate_disable() code,
> this patch series is tailored to that. But if we can vet this out in
> -rt, all this code should make its way quickly to mainline.
> 
> I tested this code out, but it probably needs some clean up and definitely
> more comments. I'm only posting this as an RFC for now to get feedback
> on the idea.
> 
> Thanks!
> 

Steve,

I've been running this set of patches on my laptop+RT kernel since
Friday with no ill-effects. I just applied it to v3.6.10+rt21 and it
seems to be fine.

I've got rteval runs going on a 40-core and a 24-core box which will be
done early Tuesday morning so I'll let you know results then. 

Clark

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-12-11 14:16 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-07 23:56 [RFC][PATCH RT 0/4] sched/rt: Lower rq lock contention latencies on many CPU boxes Steven Rostedt
2012-12-07 23:56 ` [RFC][PATCH RT 1/4] sched/rt: Fix push_rt_task() to have the same checks as the caller did Steven Rostedt
2012-12-07 23:56 ` [RFC][PATCH RT 2/4] sched/rt: Try to migrate task if preempting pinned rt task Steven Rostedt
2012-12-07 23:56 ` [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling Steven Rostedt
2012-12-11  0:48   ` Frank Rowand
2012-12-11  1:15     ` Frank Rowand
2012-12-11  1:53       ` Steven Rostedt
2012-12-11  7:07         ` Mike Galbraith
2012-12-11 12:43         ` Thomas Gleixner
2012-12-11 14:02           ` Steven Rostedt
2012-12-11 14:16             ` Steven Rostedt
2012-12-11  1:41     ` Steven Rostedt
2012-12-07 23:56 ` [RFC][PATCH RT 4/4] sched/rt: Initiate a pull when the priority of a task is lowered Steven Rostedt
2012-12-10 22:59 ` [RFC][PATCH RT 0/4] sched/rt: Lower rq lock contention latencies on many CPU boxes Clark Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).