[PATCH 0/3] RT: scheduler newidle enhancements

linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] RT: scheduler newidle enhancements
@ 2008-06-23 23:04 Gregory Haskins
  2008-06-23 23:04 ` [PATCH 1/3] sched: enable interrupts and drop rq-lock during newidle balancing Gregory Haskins
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-06-23 23:04 UTC (permalink / raw)
  To: mingo, rostedt, tglx
  Cc: linux-kernel, peterz, linux-rt-users, ghaskins, dbahi

Hi Ingo, Steven, Thomas,
  The following series applies to 25.4-rt6 and is "ready for inclusion" from
  my perspective.  However, due to their nature I am sure we will want to get
  some review feedback before being considered for acceptance.

  These patches are probably suitable for consideration for
  mainline/sched-devel as well, but PREEMPT_RT gains the biggest boost from
  them since it tends to context switch much more frequently than mainline.

  This series makes some adjustments to the way we do newidle balancing.
  These tweaks were discovered while doing some related scheduler R&D that is
  not quite ready for the light of day.  But these patches do offer a
  substantial boost (6-12%) in network performance.  They may help in other
  areas as well, but networking is where our focus remains currently.

  I think the patch 1/3 is a good one, but it requires careful review. Patch
  2/3 is based upon some work I had seen PeterZ submit, so its possible to
  just pull his work (if it hasnt been already) as opposed to this patch, if
  desired.  Lastly, Patch 3/3 is a bit controversial so I put it last for
  easier cherry picking of the first two.  It does help in our testing, but it
  will have to be reviewed carefully.

Comments/feedback/bugfixes more than welcome.

Thanks!
-Greg

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/3] sched: enable interrupts and drop rq-lock during newidle balancing
  2008-06-23 23:04 [PATCH 0/3] RT: scheduler newidle enhancements Gregory Haskins
@ 2008-06-23 23:04 ` Gregory Haskins
  2008-06-24  0:11   ` Steven Rostedt
  2008-06-24 10:13   ` Peter Zijlstra
  2008-06-23 23:04 ` [PATCH 2/3] sched: only run newidle if previous task was CFS Gregory Haskins
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-06-23 23:04 UTC (permalink / raw)
  To: mingo, rostedt, tglx
  Cc: linux-kernel, peterz, linux-rt-users, ghaskins, dbahi

We do find_busiest_groups() et. al. without locks held for normal balancing,
so lets do it for newidle as well.  It will allow other cpus to make
forward progress (against our RQ) while we try to balance and allow 
some interrupts to occur.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |   44 ++++++++++++++++++++++++++++++++++++++------
 1 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 31f91d9..490e6bc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3333,6 +3333,16 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd)
 	int sd_idle = 0;
 	int all_pinned = 0;
 	cpumask_t cpus = CPU_MASK_ALL;
+	int nr_running;
+
+	schedstat_inc(sd, lb_count[CPU_NEWLY_IDLE]);
+
+	/*
+	 * We are in a preempt-disabled section, so dropping the lock/irq
+	 * here simply means that other cores may acquire the lock,
+	 * and interrupts may occur.
+	 */
+	spin_unlock_irq(&this_rq->lock);
 
 	/*
 	 * When power savings policy is enabled for the parent domain, idle
@@ -3344,7 +3354,6 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd)
 	    !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
 		sd_idle = 1;
 
-	schedstat_inc(sd, lb_count[CPU_NEWLY_IDLE]);
 redo:
 	group = find_busiest_group(sd, this_cpu, &imbalance, CPU_NEWLY_IDLE,
 				   &sd_idle, &cpus, NULL);
@@ -3366,14 +3375,33 @@ redo:
 
 	ld_moved = 0;
 	if (busiest->nr_running > 1) {
-		/* Attempt to move tasks */
-		double_lock_balance(this_rq, busiest);
-		/* this_rq->clock is already updated */
-		update_rq_clock(busiest);
+		local_irq_disable();
+		double_rq_lock(this_rq, busiest);
+
+		BUG_ON(this_cpu != smp_processor_id());
+
+		/*
+		 * Checking rq->nr_running covers both the case where
+		 * newidle-balancing pulls a task, as well as if something
+		 * else issued a NEEDS_RESCHED (since we would only need
+		 * a reschedule if something was moved to us)
+		 */
+		if (this_rq->nr_running) {
+			double_rq_unlock(this_rq, busiest);
+			local_irq_enable();
+			goto out_balanced;
+		}
+
 		ld_moved = move_tasks(this_rq, this_cpu, busiest,
 					imbalance, sd, CPU_NEWLY_IDLE,
 					&all_pinned);
-		spin_unlock(&busiest->lock);
+
+		nr_running = this_rq->nr_running;
+		double_rq_unlock(this_rq, busiest);
+		local_irq_enable();
+
+		if (nr_running)
+			goto out_balanced;
 
 		if (unlikely(all_pinned)) {
 			cpu_clear(cpu_of(busiest), cpus);
@@ -3382,6 +3410,8 @@ redo:
 		}
 	}
 
+	spin_lock_irq(&this_rq->lock);
+
 	if (!ld_moved) {
 		schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
 		if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
@@ -3393,6 +3423,8 @@ redo:
 	return ld_moved;
 
 out_balanced:
+	spin_lock_irq(&this_rq->lock);
+
 	schedstat_inc(sd, lb_balanced[CPU_NEWLY_IDLE]);
 	if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
 	    !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/3] sched: only run newidle if previous task was CFS
  2008-06-23 23:04 [PATCH 0/3] RT: scheduler newidle enhancements Gregory Haskins
  2008-06-23 23:04 ` [PATCH 1/3] sched: enable interrupts and drop rq-lock during newidle balancing Gregory Haskins
@ 2008-06-23 23:04 ` Gregory Haskins
  2008-06-24  9:58   ` Peter Zijlstra
  2008-06-23 23:04 ` [PATCH 3/3] sched: terminate newidle balancing once at least one task has moved over Gregory Haskins
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Gregory Haskins @ 2008-06-23 23:04 UTC (permalink / raw)
  To: mingo, rostedt, tglx
  Cc: linux-kernel, peterz, linux-rt-users, ghaskins, dbahi

A system that tends to overschedule (such as PREEMPT_RT) will naturally
tend to newidle balance often as well.  This may have quite a negative
impact on performance.  This patch attempts to address the overzealous
newidle balancing by only allowing it to occur if the previous task
was SCHED_OTHER.

Some may argue that if the system is going idle, it should try to
newidle balance to keep it doing useful work.  But the fact is that
spending too much time in the load-balancing code demonstrably hurts
performance as well.  Running oprofile on the system with various
workloads has shown that we can sometimes spend a majority of our
cpu-time running load_balance_newidle.  Additionally, disabling
newidle balancing can make said workloads increase in performance by
up to 200%.  Obviously disabling the feature outright is not sustainable,
but hopefully we can make it smarter. 

This code assumes that if there arent any CFS tasks present on the queue,
it was probably already balanced.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c      |    4 +---
 kernel/sched_fair.c |    9 +++++++++
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 490e6bc..3efbbc5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1310,6 +1310,7 @@ static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
 static unsigned long cpu_avg_load_per_task(int cpu);
 static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd);
+static void idle_balance(int this_cpu, struct rq *this_rq);
 #endif /* CONFIG_SMP */
 
 #include "sched_stats.h"
@@ -4170,9 +4171,6 @@ asmlinkage void __sched __schedule(void)
 		prev->sched_class->pre_schedule(rq, prev);
 #endif
 
-	if (unlikely(!rq->nr_running))
-		idle_balance(cpu, rq);
-
 	prev->sched_class->put_prev_task(rq, prev);
 	next = pick_next_task(rq, prev);
 
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0ade6f8..2e22529 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1426,6 +1426,14 @@ static void moved_group_fair(struct task_struct *p)
 }
 #endif
 
+#ifdef CONFIG_SMP
+static void pre_schedule_fair(struct rq *rq, struct task_struct *prev)
+{
+	if (unlikely(!rq->nr_running))
+		idle_balance(rq->cpu, rq);
+}
+#endif
+
 /*
  * All the scheduling class methods:
  */
@@ -1446,6 +1454,7 @@ static const struct sched_class fair_sched_class = {
 #ifdef CONFIG_SMP
 	.load_balance		= load_balance_fair,
 	.move_one_task		= move_one_task_fair,
+	.pre_schedule		= pre_schedule_fair,
 #endif
 
 	.set_curr_task          = set_curr_task_fair,


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/3] sched: terminate newidle balancing once at least one task has moved over
  2008-06-23 23:04 [PATCH 0/3] RT: scheduler newidle enhancements Gregory Haskins
  2008-06-23 23:04 ` [PATCH 1/3] sched: enable interrupts and drop rq-lock during newidle balancing Gregory Haskins
  2008-06-23 23:04 ` [PATCH 2/3] sched: only run newidle if previous task was CFS Gregory Haskins
@ 2008-06-23 23:04 ` Gregory Haskins
  2008-06-24  0:50   ` Nick Piggin
  2008-06-24 10:13   ` Peter Zijlstra
  2008-06-24  0:15 ` [PATCH 0/3] RT: scheduler newidle enhancements Steven Rostedt
  2008-06-24  1:51 ` Gregory Haskins
  4 siblings, 2 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-06-23 23:04 UTC (permalink / raw)
  To: mingo, rostedt, tglx
  Cc: linux-kernel, peterz, linux-rt-users, ghaskins, dbahi

Inspired by Peter Zijlstra.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3efbbc5..c8e8520 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2775,6 +2775,10 @@ static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
 				max_load_move - total_load_moved,
 				sd, idle, all_pinned, &this_best_prio);
 		class = class->next;
+
+		if (idle == CPU_NEWLY_IDLE && this_rq->nr_running)
+			break;
+
 	} while (class && max_load_move > total_load_moved);
 
 	return total_load_moved > 0;


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] sched: enable interrupts and drop rq-lock during newidle balancing
  2008-06-23 23:04 ` [PATCH 1/3] sched: enable interrupts and drop rq-lock during newidle balancing Gregory Haskins
@ 2008-06-24  0:11   ` Steven Rostedt
  2008-06-24 10:13   ` Peter Zijlstra
  1 sibling, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2008-06-24  0:11 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: mingo, tglx, linux-kernel, peterz, linux-rt-users, dbahi



On Mon, 23 Jun 2008, Gregory Haskins wrote:

>
> We do find_busiest_groups() et. al. without locks held for normal balancing,
> so lets do it for newidle as well.  It will allow other cpus to make
> forward progress (against our RQ) while we try to balance and allow
> some interrupts to occur.
>

I don't see anything wrong with this patch. I'll probably queue it up for
-rt8 since it is too late for -rt7.

-- Steve


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] RT: scheduler newidle enhancements
  2008-06-23 23:04 [PATCH 0/3] RT: scheduler newidle enhancements Gregory Haskins
                   ` (2 preceding siblings ...)
  2008-06-23 23:04 ` [PATCH 3/3] sched: terminate newidle balancing once at least one task has moved over Gregory Haskins
@ 2008-06-24  0:15 ` Steven Rostedt
  2008-06-24  1:51 ` Gregory Haskins
  4 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2008-06-24  0:15 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: mingo, tglx, linux-kernel, peterz, linux-rt-users, dbahi


On Mon, 23 Jun 2008, Gregory Haskins wrote:

>
> Hi Ingo, Steven, Thomas,
>   The following series applies to 25.4-rt6 and is "ready for inclusion" from
>   my perspective.  However, due to their nature I am sure we will want to get
>   some review feedback before being considered for acceptance.
>
>   These patches are probably suitable for consideration for
>   mainline/sched-devel as well, but PREEMPT_RT gains the biggest boost from
>   them since it tends to context switch much more frequently than mainline.
>
>   This series makes some adjustments to the way we do newidle balancing.
>   These tweaks were discovered while doing some related scheduler R&D that is
>   not quite ready for the light of day.  But these patches do offer a
>   substantial boost (6-12%) in network performance.  They may help in other
>   areas as well, but networking is where our focus remains currently.
>
>   I think the patch 1/3 is a good one, but it requires careful review. Patch
>   2/3 is based upon some work I had seen PeterZ submit, so its possible to
>   just pull his work (if it hasnt been already) as opposed to this patch, if
>   desired.  Lastly, Patch 3/3 is a bit controversial so I put it last for
>   easier cherry picking of the first two.  It does help in our testing, but it
>   will have to be reviewed carefully.
>
> Comments/feedback/bugfixes more than welcome.

Greg,

I don't see anything wrong with these patches. If either Ingo or Peter
have any issues, I hope they speak up soon. I'll queue them all for 25-rt8
(after I get rt7 out the door).

-- Steve


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: terminate newidle balancing once at least one task has moved over
  2008-06-23 23:04 ` [PATCH 3/3] sched: terminate newidle balancing once at least one task has moved over Gregory Haskins
@ 2008-06-24  0:50   ` Nick Piggin
  2008-06-24  1:07     ` Steven Rostedt
  2008-06-24  2:39     ` Gregory Haskins
  2008-06-24 10:13   ` Peter Zijlstra
  1 sibling, 2 replies; 24+ messages in thread
From: Nick Piggin @ 2008-06-24  0:50 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: mingo, rostedt, tglx, linux-kernel, peterz, linux-rt-users, dbahi

On Tuesday 24 June 2008 09:04, Gregory Haskins wrote:
> Inspired by Peter Zijlstra.

Is this really getting tested well? Because at least for SCHED_OTHER
tasks, the newidle balancer is still supposed to be relatively
conservative and not over balance too much. By the time you have
done all this calculation and reached here, it will be a loss to only
move one task if you could have moved two and halved your newidle
balance rate...

> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> ---
>
>  kernel/sched.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 3efbbc5..c8e8520 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2775,6 +2775,10 @@ static int move_tasks(struct rq *this_rq, int
> this_cpu, struct rq *busiest, max_load_move - total_load_moved,
>  				sd, idle, all_pinned, &this_best_prio);
>  		class = class->next;
> +
> +		if (idle == CPU_NEWLY_IDLE && this_rq->nr_running)
> +			break;
> +
>  	} while (class && max_load_move > total_load_moved);
>
>  	return total_load_moved > 0;
>
> --

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: terminate newidle balancing once at least one task has moved over
  2008-06-24  0:50   ` Nick Piggin
@ 2008-06-24  1:07     ` Steven Rostedt
  2008-06-24  1:26       ` Nick Piggin
  2008-06-24  2:39     ` Gregory Haskins
  1 sibling, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2008-06-24  1:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Gregory Haskins, mingo, tglx, linux-kernel, peterz,
	linux-rt-users, dbahi


On Tue, 24 Jun 2008, Nick Piggin wrote:

>
> On Tuesday 24 June 2008 09:04, Gregory Haskins wrote:
> > Inspired by Peter Zijlstra.
>
> Is this really getting tested well? Because at least for SCHED_OTHER
> tasks, the newidle balancer is still supposed to be relatively
> conservative and not over balance too much. By the time you have
> done all this calculation and reached here, it will be a loss to only
> move one task if you could have moved two and halved your newidle
> balance rate...

We've been finding a lot of our high latencies have been coming from the
balancing code. And the newidle balance is a large offender. I don't think
it's much wasted work for what you want. Even if we wasted the work done,
it was during "idle" time. But now we have a task to run, why not run it
now. Especially if that task is an RT task and doesn't like to wait.

The newidle balance should really just get a task to run, the balancing
code should be done at a later time. Ideally when no RT tasks need to run.

-- Steve


>
> > Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> > ---
> >
> >  kernel/sched.c |    4 ++++
> >  1 files changed, 4 insertions(+), 0 deletions(-)
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 3efbbc5..c8e8520 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -2775,6 +2775,10 @@ static int move_tasks(struct rq *this_rq, int
> > this_cpu, struct rq *busiest, max_load_move - total_load_moved,
> >  				sd, idle, all_pinned, &this_best_prio);
> >  		class = class->next;
> > +
> > +		if (idle == CPU_NEWLY_IDLE && this_rq->nr_running)
> > +			break;
> > +
> >  	} while (class && max_load_move > total_load_moved);
> >
> >  	return total_load_moved > 0;
> >
> > --
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: terminate newidle balancing once at least one task has moved over
  2008-06-24  1:07     ` Steven Rostedt
@ 2008-06-24  1:26       ` Nick Piggin
  0 siblings, 0 replies; 24+ messages in thread
From: Nick Piggin @ 2008-06-24  1:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Gregory Haskins, mingo, tglx, linux-kernel, peterz,
	linux-rt-users, dbahi

On Tuesday 24 June 2008 11:07, Steven Rostedt wrote:
> On Tue, 24 Jun 2008, Nick Piggin wrote:
> > On Tuesday 24 June 2008 09:04, Gregory Haskins wrote:
> > > Inspired by Peter Zijlstra.
> >
> > Is this really getting tested well? Because at least for SCHED_OTHER
> > tasks, the newidle balancer is still supposed to be relatively
> > conservative and not over balance too much. By the time you have
> > done all this calculation and reached here, it will be a loss to only
> > move one task if you could have moved two and halved your newidle
> > balance rate...
>
> We've been finding a lot of our high latencies have been coming from the
> balancing code. And the newidle balance is a large offender. I don't think

Measurements and results weren't in the changelog.

> it's much wasted work for what you want. Even if we wasted the work done,
> it was during "idle" time.

It is idle because it has gone idle. If we pulled more tasks in the last
newidle balance, it wouldn't have gone idle so early.

> But now we have a task to run, why not run it 
> now. Especially if that task is an RT task and doesn't like to wait.

Maybe I would agree if you check for an -rt task rather than generally.
At this level, throughput is the major concern for many more Linux users
than latency.


> The newidle balance should really just get a task to run, the balancing
> code should be done at a later time. Ideally when no RT tasks need to run.

I disagree. All balancing should be minimised, but if it has to run
then it should do as much balancing as it can. We already have checked
several other runqueues and calculated the busiest one  etc etc.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: terminate newidle balancing once at least one task has moved over
  2008-06-24  2:39     ` Gregory Haskins
@ 2008-06-24  1:46       ` Nick Piggin
  2008-06-24  2:59         ` Gregory Haskins
  0 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2008-06-24  1:46 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: mingo, rostedt, peterz, tglx, David Bahi, linux-kernel,
	linux-rt-users

On Tuesday 24 June 2008 12:39, Gregory Haskins wrote:
> Hi Nick,
>
> >>> On Mon, Jun 23, 2008 at  8:50 PM, in message
>
> <200806241050.12028.nickpiggin@yahoo.com.au>, Nick Piggin
>
> <nickpiggin@yahoo.com.au> wrote:
> > On Tuesday 24 June 2008 09:04, Gregory Haskins wrote:
> >> Inspired by Peter Zijlstra.
> >
> > Is this really getting tested well?   Because at least for SCHED_OTHER
> > tasks,
>
> Note that this only affects SCHED_OTHER.  RT tasks are handled with a
> different algorithm.
>
> > the newidle balancer is still supposed to be relatively
> > conservative and not over balance too much.
>
> In our testing, newidle is degrading the system (at least for certain
> workloads).  Oprofile was showing that newidle can account for 60-80% of
> the CPU during our benchmark runs. Turning off newidle *completely* by
> commenting out idle_balance() boosts netperf performance by 200% for our
> 8-core to 8-core UDP transaction test. Obviously neutering it is not
> sustainable as a general solution, so we are trying to reduce its negative
> impact.

Hmm. I'd like to see an attempt to be made to tuning the algorithm
so that newidle actually won't cause any tasks to be balanced in
this case. That seems like the right thing to do, doesn't it?

Of course... tuning the whole balancer on the basis of a crazy
netperf benchmark is... dangerous :)


> It is not clear whether the problem is that newidle is over-balancing the
> system, or that newidle is simply running too frequently as a symptom of a
> system that has a high frequency of context switching (such as -rt).  I 
> suspected the latter, so I was attracted to Peter's idea based on the
> concept of shortening the time we execute this function.  But I have to
> admit, unlike 1/3 and 2/3 which I have carefully benchmarked individually
> and know make a positive performance impact, I pulled this in more on
> theory.  I will try to benchmark this individually as well.
>
> > By the time you have
> > done all this calculation and reached here, it will be a loss to only
> > move one task if you could have moved two and halved your newidle
> > balance rate...
>
> Thats an interesting point that I did not consider, but note that a very
> significant chunk of the overhead I believe comes from the
> double_lock/move_tasks code after the algorithmic complexity is completed.

And that double lock will be amortized if you can move 2 tasks at once,
rather than 1 task each 2 times.


> I believe the primary motivation of this patch is related to reducing the
> overall latency in the schedule() critical section.  Currently this
> operation can perform an unbounded move_task operation in a
> preempt-disabled region (which, as an aside, is always SCHED_OTHER
> related).

Maybe putting some upper cap on it, I could live with. Cutting off at
one task I think needs a lot more thought and testing.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] RT: scheduler newidle enhancements
  2008-06-23 23:04 [PATCH 0/3] RT: scheduler newidle enhancements Gregory Haskins
                   ` (3 preceding siblings ...)
  2008-06-24  0:15 ` [PATCH 0/3] RT: scheduler newidle enhancements Steven Rostedt
@ 2008-06-24  1:51 ` Gregory Haskins
  4 siblings, 0 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-06-24  1:51 UTC (permalink / raw)
  To: mingo, rostedt, tglx, Gregory Haskins
  Cc: peterz, David Bahi, linux-kernel, linux-rt-users

>>> On Mon, Jun 23, 2008 at  7:04 PM, in message
<20080623225645.31515.36393.stgit@lsg.lsg.lab.novell.com>, Gregory Haskins
<ghaskins@novell.com> wrote: 
>
> 
>   I think the patch 1/3 is a good one, but it requires careful review. Patch
>   2/3 is based upon some work I had seen PeterZ submit, so its possible to
>   just pull his work (if it hasnt been already) as opposed to this patch, if
>   desired.  Lastly, Patch 3/3 is a bit controversial so I put it last for
>   easier cherry picking of the first two.  It does help in our testing, but 
> it
>   will have to be reviewed carefully.

Err...in case it wasnt obvious, I had 2/3 and 3/3 inverted in my description.  2/3 is the controversial one, 3/3 is based on Peter's work.  Sorry for the(my) confusion. :)

-Greg


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: terminate newidle balancing once at least one task has moved over
  2008-06-24  0:50   ` Nick Piggin
  2008-06-24  1:07     ` Steven Rostedt
@ 2008-06-24  2:39     ` Gregory Haskins
  2008-06-24  1:46       ` Nick Piggin
  1 sibling, 1 reply; 24+ messages in thread
From: Gregory Haskins @ 2008-06-24  2:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: mingo, rostedt, peterz, tglx, David Bahi, linux-kernel,
	linux-rt-users

Hi Nick,

>>> On Mon, Jun 23, 2008 at  8:50 PM, in message
<200806241050.12028.nickpiggin@yahoo.com.au>, Nick Piggin
<nickpiggin@yahoo.com.au> wrote: 
> On Tuesday 24 June 2008 09:04, Gregory Haskins wrote:
>> Inspired by Peter Zijlstra.
> 
> Is this really getting tested well?   Because at least for SCHED_OTHER
> tasks,

Note that this only affects SCHED_OTHER.  RT tasks are handled with a different algorithm.

> the newidle balancer is still supposed to be relatively
> conservative and not over balance too much.

In our testing, newidle is degrading the system (at least for certain workloads).  Oprofile was
showing that newidle can account for 60-80% of the CPU during our benchmark runs. Turning
off newidle *completely* by commenting out idle_balance() boosts netperf performance by
200% for our 8-core to 8-core UDP transaction test. Obviously neutering it is not sustainable
as a general solution, so we are trying to reduce its negative impact.

It is not clear whether the problem is that newidle is over-balancing the system, or that newidle
is simply running too frequently as a symptom of a system that has a high frequency of context
switching (such as -rt).  I  suspected the latter, so I was attracted to Peter's idea based
on the concept of shortening the time we execute this function.  But I have to admit, unlike 1/3
and 2/3 which I have carefully benchmarked individually and know make a positive performance
impact, I pulled this in more on theory.  I will try to benchmark this individually as well.

> By the time you have
> done all this calculation and reached here, it will be a loss to only
> move one task if you could have moved two and halved your newidle
> balance rate...

Thats an interesting point that I did not consider, but note that a very significant chunk of the overhead
I believe comes from the double_lock/move_tasks code after the algorithmic complexity is completed.

I believe the primary motivation of this patch is related to reducing the overall latency in the schedule()
critical section.  Currently this operation can perform an unbounded move_task operation in a
preempt-disabled region (which, as an aside, is always SCHED_OTHER related).

Since the bare minimum requirement is to move at least one task, I think this is a tradeoff: newidle
balance-rate vs critical-section depth.  For RT obviously we put more weight on the latter, but perhaps
this is not a mainline worthy concept afterall.  I will defer to Peter to comment further.

Thanks for the review, Nick.

Regards,
-Greg

> 
>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>> ---
>>
>>  kernel/sched.c |    4 ++++
>>  1 files changed, 4 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 3efbbc5..c8e8520 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -2775,6 +2775,10 @@ static int move_tasks(struct rq *this_rq, int
>> this_cpu, struct rq *busiest, max_load_move - total_load_moved,
>>  				sd, idle, all_pinned, &this_best_prio);
>>  		class = class->next;
>> +
>> +		if (idle == CPU_NEWLY_IDLE && this_rq->nr_running)
>> +			break;
>> +
>>  	} while (class && max_load_move > total_load_moved);
>>
>>  	return total_load_moved > 0;
>>
>> --

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: terminate newidle balancing once at least one task has moved over
  2008-06-24  1:46       ` Nick Piggin
@ 2008-06-24  2:59         ` Gregory Haskins
  0 siblings, 0 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-06-24  2:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: mingo, rostedt, peterz, tglx, David Bahi, linux-kernel,
	linux-rt-users

>>> On Mon, Jun 23, 2008 at  9:46 PM, in message
<200806241146.35112.nickpiggin@yahoo.com.au>, Nick Piggin
<nickpiggin@yahoo.com.au> wrote: 
> On Tuesday 24 June 2008 12:39, Gregory Haskins wrote:
>> Hi Nick,
>>
>> >>> On Mon, Jun 23, 2008 at  8:50 PM, in message
>>
>> <200806241050.12028.nickpiggin@yahoo.com.au>, Nick Piggin
>>
>> <nickpiggin@yahoo.com.au> wrote:
>> > On Tuesday 24 June 2008 09:04, Gregory Haskins wrote:
>> >> Inspired by Peter Zijlstra.
>> >
>> > Is this really getting tested well?   Because at least for SCHED_OTHER
>> > tasks,
>>
>> Note that this only affects SCHED_OTHER.  RT tasks are handled with a
>> different algorithm.
>>
>> > the newidle balancer is still supposed to be relatively
>> > conservative and not over balance too much.
>>
>> In our testing, newidle is degrading the system (at least for certain
>> workloads).  Oprofile was showing that newidle can account for 60-80% of
>> the CPU during our benchmark runs. Turning off newidle *completely* by
>> commenting out idle_balance() boosts netperf performance by 200% for our
>> 8-core to 8-core UDP transaction test. Obviously neutering it is not
>> sustainable as a general solution, so we are trying to reduce its negative
>> impact.
> 
> Hmm. I'd like to see an attempt to be made to tuning the algorithm
> so that newidle actually won't cause any tasks to be balanced in
> this case. That seems like the right thing to do, doesn't it?

Agreed.  I'm working on it, but its not quite ready yet :)

> 
> Of course... tuning the whole balancer on the basis of a crazy
> netperf benchmark is... dangerous :)

Agreed.  I am working on a general algorithm to make the
RT and CFS tasks "play nice" with each other.  This netperf test
was chosen because it is particularly hard-hit by the current
problems in this space.  But I agree we cant tune it just for
that one benchmark.  I am hoping when completed this work will
help the entire system :)

I will add you to the CC list when I send these patches out.

> 
> 
>> It is not clear whether the problem is that newidle is over-balancing the
>> system, or that newidle is simply running too frequently as a symptom of a
>> system that has a high frequency of context switching (such as -rt).  I 
>> suspected the latter, so I was attracted to Peter's idea based on the
>> concept of shortening the time we execute this function.  But I have to
>> admit, unlike 1/3 and 2/3 which I have carefully benchmarked individually
>> and know make a positive performance impact, I pulled this in more on
>> theory.  I will try to benchmark this individually as well.
>>
>> > By the time you have
>> > done all this calculation and reached here, it will be a loss to only
>> > move one task if you could have moved two and halved your newidle
>> > balance rate...
>>
>> Thats an interesting point that I did not consider, but note that a very
>> significant chunk of the overhead I believe comes from the
>> double_lock/move_tasks code after the algorithmic complexity is completed.
> 
> And that double lock will be amortized if you can move 2 tasks at once,
> rather than 1 task each 2 times.

Thats a good point.

> 
> 
>> I believe the primary motivation of this patch is related to reducing the
>> overall latency in the schedule() critical section.  Currently this
>> operation can perform an unbounded move_task operation in a
>> preempt-disabled region (which, as an aside, is always SCHED_OTHER
>> related).
> 
> Maybe putting some upper cap on it, I could live with. Cutting off at
> one task I think needs a lot more thought and testing.

Perhaps we could reuse the sched_nr_migrations as the threshold?

-Greg



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] sched: only run newidle if previous task was CFS
  2008-06-23 23:04 ` [PATCH 2/3] sched: only run newidle if previous task was CFS Gregory Haskins
@ 2008-06-24  9:58   ` Peter Zijlstra
  2008-06-24 10:38     ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2008-06-24  9:58 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: mingo, rostedt, tglx, linux-kernel, linux-rt-users, dbahi

On Mon, 2008-06-23 at 17:04 -0600, Gregory Haskins wrote:
> A system that tends to overschedule (such as PREEMPT_RT) will naturally
> tend to newidle balance often as well.  This may have quite a negative
> impact on performance.  This patch attempts to address the overzealous
> newidle balancing by only allowing it to occur if the previous task
> was SCHED_OTHER.
> 
> Some may argue that if the system is going idle, it should try to
> newidle balance to keep it doing useful work.  But the fact is that
> spending too much time in the load-balancing code demonstrably hurts
> performance as well.  Running oprofile on the system with various
> workloads has shown that we can sometimes spend a majority of our
> cpu-time running load_balance_newidle.  Additionally, disabling
> newidle balancing can make said workloads increase in performance by
> up to 200%.  Obviously disabling the feature outright is not sustainable,
> but hopefully we can make it smarter. 
> 
> This code assumes that if there arent any CFS tasks present on the queue,
> it was probably already balanced.
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>

NAK, this wrecks idle balance for any potential other classes.

idle_balance() is the generical hook - as can be seen from the class
iteration in move_tasks().

I can imagine paritioned EDF wanting to make use of these hooks to
balance the reservations.

> ---
> 
>  kernel/sched.c      |    4 +---
>  kernel/sched_fair.c |    9 +++++++++
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 490e6bc..3efbbc5 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1310,6 +1310,7 @@ static unsigned long source_load(int cpu, int type);
>  static unsigned long target_load(int cpu, int type);
>  static unsigned long cpu_avg_load_per_task(int cpu);
>  static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd);
> +static void idle_balance(int this_cpu, struct rq *this_rq);
>  #endif /* CONFIG_SMP */
>  
>  #include "sched_stats.h"
> @@ -4170,9 +4171,6 @@ asmlinkage void __sched __schedule(void)
>  		prev->sched_class->pre_schedule(rq, prev);
>  #endif
>  
> -	if (unlikely(!rq->nr_running))
> -		idle_balance(cpu, rq);
> -
>  	prev->sched_class->put_prev_task(rq, prev);
>  	next = pick_next_task(rq, prev);
>  
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 0ade6f8..2e22529 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1426,6 +1426,14 @@ static void moved_group_fair(struct task_struct *p)
>  }
>  #endif
>  
> +#ifdef CONFIG_SMP
> +static void pre_schedule_fair(struct rq *rq, struct task_struct *prev)
> +{
> +	if (unlikely(!rq->nr_running))
> +		idle_balance(rq->cpu, rq);
> +}
> +#endif
> +
>  /*
>   * All the scheduling class methods:
>   */
> @@ -1446,6 +1454,7 @@ static const struct sched_class fair_sched_class = {
>  #ifdef CONFIG_SMP
>  	.load_balance		= load_balance_fair,
>  	.move_one_task		= move_one_task_fair,
> +	.pre_schedule		= pre_schedule_fair,
>  #endif
>  
>  	.set_curr_task          = set_curr_task_fair,
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] sched: enable interrupts and drop rq-lock during newidle balancing
  2008-06-23 23:04 ` [PATCH 1/3] sched: enable interrupts and drop rq-lock during newidle balancing Gregory Haskins
  2008-06-24  0:11   ` Steven Rostedt
@ 2008-06-24 10:13   ` Peter Zijlstra
  2008-06-24 13:15     ` [PATCH 1/3] sched: enable interrupts and drop rq-lock duringnewidle balancing Gregory Haskins
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2008-06-24 10:13 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: mingo, rostedt, tglx, linux-kernel, linux-rt-users, dbahi

On Mon, 2008-06-23 at 17:04 -0600, Gregory Haskins wrote:
> We do find_busiest_groups() et. al. without locks held for normal balancing,
> so lets do it for newidle as well.  It will allow other cpus to make
> forward progress (against our RQ) while we try to balance and allow 
> some interrupts to occur.

Is running f_b_g really that expensive? I was under the impression that
move_tasks() is the expensive one...

> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> ---
> 
>  kernel/sched.c |   44 ++++++++++++++++++++++++++++++++++++++------
>  1 files changed, 38 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 31f91d9..490e6bc 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3333,6 +3333,16 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd)
>  	int sd_idle = 0;
>  	int all_pinned = 0;
>  	cpumask_t cpus = CPU_MASK_ALL;
> +	int nr_running;
> +
> +	schedstat_inc(sd, lb_count[CPU_NEWLY_IDLE]);
> +
> +	/*
> +	 * We are in a preempt-disabled section, so dropping the lock/irq
> +	 * here simply means that other cores may acquire the lock,
> +	 * and interrupts may occur.
> +	 */
> +	spin_unlock_irq(&this_rq->lock);
>  
>  	/*
>  	 * When power savings policy is enabled for the parent domain, idle
> @@ -3344,7 +3354,6 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, struct sched_domain *sd)
>  	    !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
>  		sd_idle = 1;
>  
> -	schedstat_inc(sd, lb_count[CPU_NEWLY_IDLE]);
>  redo:
>  	group = find_busiest_group(sd, this_cpu, &imbalance, CPU_NEWLY_IDLE,
>  				   &sd_idle, &cpus, NULL);
> @@ -3366,14 +3375,33 @@ redo:
>  
>  	ld_moved = 0;
>  	if (busiest->nr_running > 1) {
> -		/* Attempt to move tasks */
> -		double_lock_balance(this_rq, busiest);
> -		/* this_rq->clock is already updated */
> -		update_rq_clock(busiest);
> +		local_irq_disable();
> +		double_rq_lock(this_rq, busiest);
> +
> +		BUG_ON(this_cpu != smp_processor_id());
> +
> +		/*
> +		 * Checking rq->nr_running covers both the case where
> +		 * newidle-balancing pulls a task, as well as if something
> +		 * else issued a NEEDS_RESCHED (since we would only need
> +		 * a reschedule if something was moved to us)
> +		 */
> +		if (this_rq->nr_running) {
> +			double_rq_unlock(this_rq, busiest);
> +			local_irq_enable();
> +			goto out_balanced;
> +		}
> +
>  		ld_moved = move_tasks(this_rq, this_cpu, busiest,
>  					imbalance, sd, CPU_NEWLY_IDLE,
>  					&all_pinned);
> -		spin_unlock(&busiest->lock);
> +
> +		nr_running = this_rq->nr_running;
> +		double_rq_unlock(this_rq, busiest);
> +		local_irq_enable();
> +
> +		if (nr_running)
> +			goto out_balanced;
>  
>  		if (unlikely(all_pinned)) {
>  			cpu_clear(cpu_of(busiest), cpus);
> @@ -3382,6 +3410,8 @@ redo:
>  		}
>  	}
>  
> +	spin_lock_irq(&this_rq->lock);
> +
>  	if (!ld_moved) {
>  		schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
>  		if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
> @@ -3393,6 +3423,8 @@ redo:
>  	return ld_moved;
>  
>  out_balanced:
> +	spin_lock_irq(&this_rq->lock);
> +
>  	schedstat_inc(sd, lb_balanced[CPU_NEWLY_IDLE]);
>  	if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
>  	    !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: terminate newidle balancing once at least one task has moved over
  2008-06-23 23:04 ` [PATCH 3/3] sched: terminate newidle balancing once at least one task has moved over Gregory Haskins
  2008-06-24  0:50   ` Nick Piggin
@ 2008-06-24 10:13   ` Peter Zijlstra
  2008-06-24 13:18     ` [PATCH 3/3] sched: terminate newidle balancing once at leastone " Gregory Haskins
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2008-06-24 10:13 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: mingo, rostedt, tglx, linux-kernel, linux-rt-users, dbahi

On Mon, 2008-06-23 at 17:04 -0600, Gregory Haskins wrote:
> Inspired by Peter Zijlstra.
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> ---
> 
>  kernel/sched.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 3efbbc5..c8e8520 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2775,6 +2775,10 @@ static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
>  				max_load_move - total_load_moved,
>  				sd, idle, all_pinned, &this_best_prio);
>  		class = class->next;
> +
> +		if (idle == CPU_NEWLY_IDLE && this_rq->nr_running)
> +			break;
> +
>  	} while (class && max_load_move > total_load_moved);
>  
>  	return total_load_moved > 0;


right,.. uhm, except that you forgot all the other fixes and
generalizations I had,..

The LB_START/LB_COMPLETE stuff is needed to fix CFS load balancing. It
now always iterates the first sysctl_sched_nr_migrate tasks, and if it
doesn't find any there, just gives up - which isn't too big of a problem
with it set to 32, but if you drop it to 2/4 stuff starts valing apart.

And the break I had here, only checks classes above and equal to the
current class.

This again is needed when you have more classes.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] sched: only run newidle if previous task was CFS
  2008-06-24  9:58   ` Peter Zijlstra
@ 2008-06-24 10:38     ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2008-06-24 10:38 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: mingo, rostedt, tglx, linux-kernel, linux-rt-users, dbahi

On Tue, 2008-06-24 at 11:58 +0200, Peter Zijlstra wrote:
> On Mon, 2008-06-23 at 17:04 -0600, Gregory Haskins wrote:
> > A system that tends to overschedule (such as PREEMPT_RT) will naturally
> > tend to newidle balance often as well.  This may have quite a negative
> > impact on performance.  This patch attempts to address the overzealous
> > newidle balancing by only allowing it to occur if the previous task
> > was SCHED_OTHER.
> > 
> > Some may argue that if the system is going idle, it should try to
> > newidle balance to keep it doing useful work.  But the fact is that
> > spending too much time in the load-balancing code demonstrably hurts
> > performance as well.  Running oprofile on the system with various
> > workloads has shown that we can sometimes spend a majority of our
> > cpu-time running load_balance_newidle.  Additionally, disabling
> > newidle balancing can make said workloads increase in performance by
> > up to 200%.  Obviously disabling the feature outright is not sustainable,
> > but hopefully we can make it smarter. 
> > 
> > This code assumes that if there arent any CFS tasks present on the queue,
> > it was probably already balanced.
> > 
> > Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> 
> NAK, this wrecks idle balance for any potential other classes.
> 
> idle_balance() is the generical hook - as can be seen from the class
> iteration in move_tasks().
> 
> I can imagine paritioned EDF wanting to make use of these hooks to
> balance the reservations.

Hmm, it wouldn't,.. since its too tied in with fbg which is sched_other
based,..

would need more generalization work,..




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] sched: enable interrupts and drop rq-lock duringnewidle balancing
  2008-06-24 13:15     ` [PATCH 1/3] sched: enable interrupts and drop rq-lock duringnewidle balancing Gregory Haskins
@ 2008-06-24 12:24       ` Peter Zijlstra
  2008-06-24 12:39         ` [PATCH 1/3] sched: enable interrupts and drop rq-lockduringnewidle balancing Gregory Haskins
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2008-06-24 12:24 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: mingo, rostedt, tglx, David Bahi, linux-kernel, linux-rt-users

On Tue, 2008-06-24 at 07:15 -0600, Gregory Haskins wrote:
> >>> On Tue, Jun 24, 2008 at  6:13 AM, in message <1214302405.4351.21.camel@twins>,
> Peter Zijlstra <peterz@infradead.org> wrote: 
> > On Mon, 2008-06-23 at 17:04 -0600, Gregory Haskins wrote:
> >> We do find_busiest_groups() et. al. without locks held for normal balancing,
> >> so lets do it for newidle as well.  It will allow other cpus to make
> >> forward progress (against our RQ) while we try to balance and allow 
> >> some interrupts to occur.
> > 
> > Is running f_b_g really that expensive? 
> 
> According to our oprofile data, yes.  I speculate that it works out that way because most newidle
> attempts result in "no imbalance".  But we were spending ~60%+ time in find_busiest_groups()
> because of all the heavy-context switching that goes on in PREEMPT_RT.  So while f_b_g() is
> probably cheaper than double-lock/move_tasks(), the ratio of occurrence is off the charts in
> comparison. Prior to this patch, those occurrences were preempt-disabled/irq-disabled/rq->lock critical
> sections.
> 
> So while it is not clear if f_b_g() is the actual cost, it is a convenient (and legal, afaict) place to
> deterministically reduce the rq->lock scope.  Additionally, doing so measurably helps
> performance, so I think its a win.  Without this patch you have to hope the double_lock releases
> this_rq, and even so were not checking for the NEEDS_RESCHED. 

See, having had this information in the changelog to begin with would
have helped ;-)




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] sched: enable interrupts and drop rq-lockduringnewidle balancing
  2008-06-24 12:24       ` Peter Zijlstra
@ 2008-06-24 12:39         ` Gregory Haskins
  0 siblings, 0 replies; 24+ messages in thread
From: Gregory Haskins @ 2008-06-24 12:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rostedt, tglx, David Bahi, linux-kernel, linux-rt-users

>>> On Tue, Jun 24, 2008 at  8:24 AM, in message <1214310299.4351.27.camel@twins>,
Peter Zijlstra <peterz@infradead.org> wrote: 
> On Tue, 2008-06-24 at 07:15 -0600, Gregory Haskins wrote:
>> >>> On Tue, Jun 24, 2008 at  6:13 AM, in message 
> <1214302405.4351.21.camel@twins>,
>> Peter Zijlstra <peterz@infradead.org> wrote: 
>> > On Mon, 2008-06-23 at 17:04 -0600, Gregory Haskins wrote:
>> >> We do find_busiest_groups() et. al. without locks held for normal 
> balancing,
>> >> so lets do it for newidle as well.  It will allow other cpus to make
>> >> forward progress (against our RQ) while we try to balance and allow 
>> >> some interrupts to occur.
>> > 
>> > Is running f_b_g really that expensive? 
>> 
>> According to our oprofile data, yes.  I speculate that it works out that way 
> because most newidle
>> attempts result in "no imbalance".  But we were spending ~60%+ time in 
> find_busiest_groups()
>> because of all the heavy-context switching that goes on in PREEMPT_RT.  So 
> while f_b_g() is
>> probably cheaper than double-lock/move_tasks(), the ratio of occurrence is 
> off the charts in
>> comparison. Prior to this patch, those occurrences were 
> preempt-disabled/irq-disabled/rq->lock critical
>> sections.
>> 
>> So while it is not clear if f_b_g() is the actual cost, it is a convenient 
> (and legal, afaict) place to
>> deterministically reduce the rq->lock scope.  Additionally, doing so 
> measurably helps
>> performance, so I think its a win.  Without this patch you have to hope the 
> double_lock releases
>> this_rq, and even so were not checking for the NEEDS_RESCHED. 
> 
> See, having had this information in the changelog to begin with would
> have helped ;-)

What?  You can't read my mind?  :)

Good point, Peter.  Will fix on next drop.

-Greg

> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] sched: enable interrupts and drop rq-lock duringnewidle balancing
  2008-06-24 10:13   ` Peter Zijlstra
@ 2008-06-24 13:15     ` Gregory Haskins
  2008-06-24 12:24       ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Gregory Haskins @ 2008-06-24 13:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rostedt, tglx, David Bahi, linux-kernel, linux-rt-users

>>> On Tue, Jun 24, 2008 at  6:13 AM, in message <1214302405.4351.21.camel@twins>,
Peter Zijlstra <peterz@infradead.org> wrote: 
> On Mon, 2008-06-23 at 17:04 -0600, Gregory Haskins wrote:
>> We do find_busiest_groups() et. al. without locks held for normal balancing,
>> so lets do it for newidle as well.  It will allow other cpus to make
>> forward progress (against our RQ) while we try to balance and allow 
>> some interrupts to occur.
> 
> Is running f_b_g really that expensive? 

According to our oprofile data, yes.  I speculate that it works out that way because most newidle
attempts result in "no imbalance".  But we were spending ~60%+ time in find_busiest_groups()
because of all the heavy-context switching that goes on in PREEMPT_RT.  So while f_b_g() is
probably cheaper than double-lock/move_tasks(), the ratio of occurrence is off the charts in
comparison. Prior to this patch, those occurrences were preempt-disabled/irq-disabled/rq->lock critical
sections.

So while it is not clear if f_b_g() is the actual cost, it is a convenient (and legal, afaict) place to
deterministically reduce the rq->lock scope.  Additionally, doing so measurably helps
performance, so I think its a win.  Without this patch you have to hope the double_lock releases
this_rq, and even so were not checking for the NEEDS_RESCHED. 

Note: I have a refresh of this patch coming shortly, and I will drop the one you NAKed

Thanks Peter!

Regards,
-Greg

> 
>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>> ---
>> 
>>  kernel/sched.c |   44 ++++++++++++++++++++++++++++++++++++++------
>>  1 files changed, 38 insertions(+), 6 deletions(-)
>> 
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 31f91d9..490e6bc 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -3333,6 +3333,16 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, 
> struct sched_domain *sd)
>>  	int sd_idle = 0;
>>  	int all_pinned = 0;
>>  	cpumask_t cpus = CPU_MASK_ALL;
>> +	int nr_running;
>> +
>> +	schedstat_inc(sd, lb_count[CPU_NEWLY_IDLE]);
>> +
>> +	/*
>> +	 * We are in a preempt-disabled section, so dropping the lock/irq
>> +	 * here simply means that other cores may acquire the lock,
>> +	 * and interrupts may occur.
>> +	 */
>> +	spin_unlock_irq(&this_rq->lock);
>>  
>>  	/*
>>  	 * When power savings policy is enabled for the parent domain, idle
>> @@ -3344,7 +3354,6 @@ load_balance_newidle(int this_cpu, struct rq *this_rq, 
> struct sched_domain *sd)
>>  	    !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
>>  		sd_idle = 1;
>>  
>> -	schedstat_inc(sd, lb_count[CPU_NEWLY_IDLE]);
>>  redo:
>>  	group = find_busiest_group(sd, this_cpu, &imbalance, CPU_NEWLY_IDLE,
>>  				   &sd_idle, &cpus, NULL);
>> @@ -3366,14 +3375,33 @@ redo:
>>  
>>  	ld_moved = 0;
>>  	if (busiest->nr_running > 1) {
>> -		/* Attempt to move tasks */
>> -		double_lock_balance(this_rq, busiest);
>> -		/* this_rq->clock is already updated */
>> -		update_rq_clock(busiest);
>> +		local_irq_disable();
>> +		double_rq_lock(this_rq, busiest);
>> +
>> +		BUG_ON(this_cpu != smp_processor_id());
>> +
>> +		/*
>> +		 * Checking rq->nr_running covers both the case where
>> +		 * newidle-balancing pulls a task, as well as if something
>> +		 * else issued a NEEDS_RESCHED (since we would only need
>> +		 * a reschedule if something was moved to us)
>> +		 */
>> +		if (this_rq->nr_running) {
>> +			double_rq_unlock(this_rq, busiest);
>> +			local_irq_enable();
>> +			goto out_balanced;
>> +		}
>> +
>>  		ld_moved = move_tasks(this_rq, this_cpu, busiest,
>>  					imbalance, sd, CPU_NEWLY_IDLE,
>>  					&all_pinned);
>> -		spin_unlock(&busiest->lock);
>> +
>> +		nr_running = this_rq->nr_running;
>> +		double_rq_unlock(this_rq, busiest);
>> +		local_irq_enable();
>> +
>> +		if (nr_running)
>> +			goto out_balanced;
>>  
>>  		if (unlikely(all_pinned)) {
>>  			cpu_clear(cpu_of(busiest), cpus);
>> @@ -3382,6 +3410,8 @@ redo:
>>  		}
>>  	}
>>  
>> +	spin_lock_irq(&this_rq->lock);
>> +
>>  	if (!ld_moved) {
>>  		schedstat_inc(sd, lb_failed[CPU_NEWLY_IDLE]);
>>  		if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
>> @@ -3393,6 +3423,8 @@ redo:
>>  	return ld_moved;
>>  
>>  out_balanced:
>> +	spin_lock_irq(&this_rq->lock);
>> +
>>  	schedstat_inc(sd, lb_balanced[CPU_NEWLY_IDLE]);
>>  	if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
>>  	    !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
>> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: terminate newidle balancing once at leastone task has moved over
  2008-06-24 10:13   ` Peter Zijlstra
@ 2008-06-24 13:18     ` Gregory Haskins
  2008-06-24 13:31       ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Gregory Haskins @ 2008-06-24 13:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rostedt, tglx, David Bahi, linux-kernel, linux-rt-users

>>> On Tue, Jun 24, 2008 at  6:13 AM, in message <1214302406.4351.23.camel@twins>,
Peter Zijlstra <peterz@infradead.org> wrote: 
> On Mon, 2008-06-23 at 17:04 -0600, Gregory Haskins wrote:
>> Inspired by Peter Zijlstra.
>> 
>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>> ---
>> 
>>  kernel/sched.c |    4 ++++
>>  1 files changed, 4 insertions(+), 0 deletions(-)
>> 
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 3efbbc5..c8e8520 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -2775,6 +2775,10 @@ static int move_tasks(struct rq *this_rq, int 
> this_cpu, struct rq *busiest,
>>  				max_load_move - total_load_moved,
>>  				sd, idle, all_pinned, &this_best_prio);
>>  		class = class->next;
>> +
>> +		if (idle == CPU_NEWLY_IDLE && this_rq->nr_running)
>> +			break;
>> +
>>  	} while (class && max_load_move > total_load_moved);
>>  
>>  	return total_load_moved > 0;
> 
> 
> right,.. uhm, except that you forgot all the other fixes and
> generalizations I had,..

Heh...well I intentionally simplified it, but perhaps that is out of ignorance.  I did say "inspired by" ;)

> 
> The LB_START/LB_COMPLETE stuff is needed to fix CFS load balancing. It
> now always iterates the first sysctl_sched_nr_migrate tasks, and if it
> doesn't find any there, just gives up - which isn't too big of a problem
> with it set to 32, but if you drop it to 2/4 stuff starts valing apart.
> 
> And the break I had here, only checks classes above and equal to the
> current class.
> 
> This again is needed when you have more classes.

Im not sure I understand/agree here (unless you plan on having a class below sched_idle()??)

The fact that we are going NEWLYIDLE to me implies that all the other classes are
"above or equal".  And rq->nr_running approximates all the per-class vtable work
that you had done to probe the higher classes.  We currently only hit this code when
rq->nr_running == 0, so rq->nr_running !=0 seems like a logical termination
condition.

I guess what I am not clear on is: "when would we be NEWLYIDLE in a higher class,
yet have tasks populated in lower classes such at nr_running is non-zero".
Additionally, even if we have that condition (e.g. with something like the EDF work you
are doing, perhaps?), shouldn't we patch the advanced form of this logic when the rest
of the code goes in?  For now, this seems like the most straight forward way to
accomplish the goal.  But I could be missing something ;)

Regards,
-Greg

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: terminate newidle balancing once at leastone task has moved over
  2008-06-24 13:18     ` [PATCH 3/3] sched: terminate newidle balancing once at leastone " Gregory Haskins
@ 2008-06-24 13:31       ` Peter Zijlstra
  2008-06-24 16:55         ` [PATCH 3/3] sched: terminate newidle balancing once atleastone " Gregory Haskins
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2008-06-24 13:31 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: mingo, rostedt, tglx, David Bahi, linux-kernel, linux-rt-users

On Tue, 2008-06-24 at 07:18 -0600, Gregory Haskins wrote:
> >>> On Tue, Jun 24, 2008 at  6:13 AM, in message <1214302406.4351.23.camel@twins>,
> Peter Zijlstra <peterz@infradead.org> wrote: 
> > On Mon, 2008-06-23 at 17:04 -0600, Gregory Haskins wrote:
> >> Inspired by Peter Zijlstra.
> >> 
> >> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> >> ---
> >> 
> >>  kernel/sched.c |    4 ++++
> >>  1 files changed, 4 insertions(+), 0 deletions(-)
> >> 
> >> diff --git a/kernel/sched.c b/kernel/sched.c
> >> index 3efbbc5..c8e8520 100644
> >> --- a/kernel/sched.c
> >> +++ b/kernel/sched.c
> >> @@ -2775,6 +2775,10 @@ static int move_tasks(struct rq *this_rq, int 
> > this_cpu, struct rq *busiest,
> >>  				max_load_move - total_load_moved,
> >>  				sd, idle, all_pinned, &this_best_prio);
> >>  		class = class->next;
> >> +
> >> +		if (idle == CPU_NEWLY_IDLE && this_rq->nr_running)
> >> +			break;
> >> +
> >>  	} while (class && max_load_move > total_load_moved);
> >>  
> >>  	return total_load_moved > 0;
> > 
> > 
> > right,.. uhm, except that you forgot all the other fixes and
> > generalizations I had,..
> 
> Heh...well I intentionally simplified it, but perhaps that is out of ignorance.  I did say "inspired by" ;)
> 
> > 
> > The LB_START/LB_COMPLETE stuff is needed to fix CFS load balancing. It
> > now always iterates the first sysctl_sched_nr_migrate tasks, and if it
> > doesn't find any there, just gives up - which isn't too big of a problem
> > with it set to 32, but if you drop it to 2/4 stuff starts valing apart.
> > 
> > And the break I had here, only checks classes above and equal to the
> > current class.
> > 
> > This again is needed when you have more classes.
> 
> Im not sure I understand/agree here (unless you plan on having a class below sched_idle()??)
> 
> The fact that we are going NEWLYIDLE to me implies that all the other classes are
> "above or equal".  And rq->nr_running approximates all the per-class vtable work
> that you had done to probe the higher classes.  We currently only hit this code when
> rq->nr_running == 0, so rq->nr_running !=0 seems like a logical termination
> condition.
> 
> I guess what I am not clear on is: "when would we be NEWLYIDLE in a higher class,
> yet have tasks populated in lower classes such at nr_running is non-zero".
> Additionally, even if we have that condition (e.g. with something like the EDF work you
> are doing, perhaps?), shouldn't we patch the advanced form of this logic when the rest
> of the code goes in?  For now, this seems like the most straight forward way to
> accomplish the goal.  But I could be missing something ;)

The thing I'm worried about - but it might be unfounded and is certainly
so now - is that suppose we have:

  EDF
  FIFO/RR
  SOFTRT
  OTHER
  IDLE

and we've just done FIFO/RR (which is a nop) and and some interrupt woke
an OTHER task while we dropped for lockbreak.

At this point your logic would bail out and start running the OTHER
task, even though we might have found a SOFTRQ task to run had we
bothered to look.




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: terminate newidle balancing once atleastone task has moved over
  2008-06-24 13:31       ` Peter Zijlstra
@ 2008-06-24 16:55         ` Gregory Haskins
  2008-06-24 19:44           ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Gregory Haskins @ 2008-06-24 16:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rostedt, tglx, David Bahi, linux-kernel, linux-rt-users

>>> On Tue, Jun 24, 2008 at  9:31 AM, in message <1214314273.4351.34.camel@twins>,
Peter Zijlstra <peterz@infradead.org> wrote: 
> On Tue, 2008-06-24 at 07:18 -0600, Gregory Haskins wrote:
>> >>> On Tue, Jun 24, 2008 at  6:13 AM, in message 
> <1214302406.4351.23.camel@twins>,
>> Peter Zijlstra <peterz@infradead.org> wrote: 
>> > On Mon, 2008-06-23 at 17:04 -0600, Gregory Haskins wrote:
>> >> Inspired by Peter Zijlstra.
>> >> 
>> >> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>> >> ---
>> >> 
>> >>  kernel/sched.c |    4 ++++
>> >>  1 files changed, 4 insertions(+), 0 deletions(-)
>> >> 
>> >> diff --git a/kernel/sched.c b/kernel/sched.c
>> >> index 3efbbc5..c8e8520 100644
>> >> --- a/kernel/sched.c
>> >> +++ b/kernel/sched.c
>> >> @@ -2775,6 +2775,10 @@ static int move_tasks(struct rq *this_rq, int 
>> > this_cpu, struct rq *busiest,
>> >>  				max_load_move - total_load_moved,
>> >>  				sd, idle, all_pinned, &this_best_prio);
>> >>  		class = class->next;
>> >> +
>> >> +		if (idle == CPU_NEWLY_IDLE && this_rq->nr_running)
>> >> +			break;
>> >> +
>> >>  	} while (class && max_load_move > total_load_moved);
>> >>  
>> >>  	return total_load_moved > 0;
>> > 
>> > 
>> > right,.. uhm, except that you forgot all the other fixes and
>> > generalizations I had,..
>> 
>> Heh...well I intentionally simplified it, but perhaps that is out of 
> ignorance.  I did say "inspired by" ;)
>> 
>> > 
>> > The LB_START/LB_COMPLETE stuff is needed to fix CFS load balancing. It
>> > now always iterates the first sysctl_sched_nr_migrate tasks, and if it
>> > doesn't find any there, just gives up - which isn't too big of a problem
>> > with it set to 32, but if you drop it to 2/4 stuff starts valing apart.
>> > 
>> > And the break I had here, only checks classes above and equal to the
>> > current class.
>> > 
>> > This again is needed when you have more classes.
>> 
>> Im not sure I understand/agree here (unless you plan on having a class below 
> sched_idle()??)
>> 
>> The fact that we are going NEWLYIDLE to me implies that all the other 
> classes are
>> "above or equal".  And rq->nr_running approximates all the per-class vtable 
> work
>> that you had done to probe the higher classes.  We currently only hit this 
> code when
>> rq->nr_running == 0, so rq->nr_running !=0 seems like a logical termination
>> condition.
>> 
>> I guess what I am not clear on is: "when would we be NEWLYIDLE in a higher 
> class,
>> yet have tasks populated in lower classes such at nr_running is non-zero".
>> Additionally, even if we have that condition (e.g. with something like the 
> EDF work you
>> are doing, perhaps?), shouldn't we patch the advanced form of this logic 
> when the rest
>> of the code goes in?  For now, this seems like the most straight forward way 
> to
>> accomplish the goal.  But I could be missing something ;)
> 
> The thing I'm worried about - but it might be unfounded and is certainly
> so now - is that suppose we have:
> 
>   EDF
>   FIFO/RR
>   SOFTRT
>   OTHER
>   IDLE
> 
> and we've just done FIFO/RR (which is a nop) and and some interrupt woke
> an OTHER task while we dropped for lockbreak.
> 
> At this point your logic would bail out and start running the OTHER
> task, even though we might have found a SOFTRQ task to run had we
> bothered to look.
> 

Ok, now I think I understand your concern.  But I think you may be worrying about
this at the wrong level.  I would think we should be doing something similar to the
post-balance patch I submitted a while back.  It basically iterated through each class,
giving each an opportunity to pull tasks over in its own way.  The difference there
was that I was doing it post-schedule to deal with that locking issue.  We could
take the same idea and do it where we pre_schedule() today.

I think the f_b_g() et. al. is really SCHED_OTHER specific, and probably always will be.
Lets just formalize that.  Perhaps we should move all the LB code to sched_fair and set
something like what I proposed up.  Thoughts?

-Greg



> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: terminate newidle balancing once atleastone task has moved over
  2008-06-24 16:55         ` [PATCH 3/3] sched: terminate newidle balancing once atleastone " Gregory Haskins
@ 2008-06-24 19:44           ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2008-06-24 19:44 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: mingo, rostedt, tglx, David Bahi, linux-kernel, linux-rt-users

On Tue, 2008-06-24 at 10:55 -0600, Gregory Haskins wrote:
> >>> On Tue, Jun 24, 2008 at  9:31 AM, in message <1214314273.4351.34.camel@twins>,
> Peter Zijlstra <peterz@infradead.org> wrote: 
> > On Tue, 2008-06-24 at 07:18 -0600, Gregory Haskins wrote:
> >> >>> On Tue, Jun 24, 2008 at  6:13 AM, in message 
> > <1214302406.4351.23.camel@twins>,
> >> Peter Zijlstra <peterz@infradead.org> wrote: 
> >> > On Mon, 2008-06-23 at 17:04 -0600, Gregory Haskins wrote:
> >> >> Inspired by Peter Zijlstra.
> >> >> 
> >> >> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> >> >> ---
> >> >> 
> >> >>  kernel/sched.c |    4 ++++
> >> >>  1 files changed, 4 insertions(+), 0 deletions(-)
> >> >> 
> >> >> diff --git a/kernel/sched.c b/kernel/sched.c
> >> >> index 3efbbc5..c8e8520 100644
> >> >> --- a/kernel/sched.c
> >> >> +++ b/kernel/sched.c
> >> >> @@ -2775,6 +2775,10 @@ static int move_tasks(struct rq *this_rq, int 
> >> > this_cpu, struct rq *busiest,
> >> >>  				max_load_move - total_load_moved,
> >> >>  				sd, idle, all_pinned, &this_best_prio);
> >> >>  		class = class->next;
> >> >> +
> >> >> +		if (idle == CPU_NEWLY_IDLE && this_rq->nr_running)
> >> >> +			break;
> >> >> +
> >> >>  	} while (class && max_load_move > total_load_moved);
> >> >>  
> >> >>  	return total_load_moved > 0;
> >> > 
> >> > 
> >> > right,.. uhm, except that you forgot all the other fixes and
> >> > generalizations I had,..
> >> 
> >> Heh...well I intentionally simplified it, but perhaps that is out of 
> > ignorance.  I did say "inspired by" ;)
> >> 
> >> > 
> >> > The LB_START/LB_COMPLETE stuff is needed to fix CFS load balancing. It
> >> > now always iterates the first sysctl_sched_nr_migrate tasks, and if it
> >> > doesn't find any there, just gives up - which isn't too big of a problem
> >> > with it set to 32, but if you drop it to 2/4 stuff starts valing apart.
> >> > 
> >> > And the break I had here, only checks classes above and equal to the
> >> > current class.
> >> > 
> >> > This again is needed when you have more classes.
> >> 
> >> Im not sure I understand/agree here (unless you plan on having a class below 
> > sched_idle()??)
> >> 
> >> The fact that we are going NEWLYIDLE to me implies that all the other 
> > classes are
> >> "above or equal".  And rq->nr_running approximates all the per-class vtable 
> > work
> >> that you had done to probe the higher classes.  We currently only hit this 
> > code when
> >> rq->nr_running == 0, so rq->nr_running !=0 seems like a logical termination
> >> condition.
> >> 
> >> I guess what I am not clear on is: "when would we be NEWLYIDLE in a higher 
> > class,
> >> yet have tasks populated in lower classes such at nr_running is non-zero".
> >> Additionally, even if we have that condition (e.g. with something like the 
> > EDF work you
> >> are doing, perhaps?), shouldn't we patch the advanced form of this logic 
> > when the rest
> >> of the code goes in?  For now, this seems like the most straight forward way 
> > to
> >> accomplish the goal.  But I could be missing something ;)
> > 
> > The thing I'm worried about - but it might be unfounded and is certainly
> > so now - is that suppose we have:
> > 
> >   EDF
> >   FIFO/RR
> >   SOFTRT
> >   OTHER
> >   IDLE
> > 
> > and we've just done FIFO/RR (which is a nop) and and some interrupt woke
> > an OTHER task while we dropped for lockbreak.
> > 
> > At this point your logic would bail out and start running the OTHER
> > task, even though we might have found a SOFTRQ task to run had we
> > bothered to look.
> > 
> 
> Ok, now I think I understand your concern.  But I think you may be worrying about
> this at the wrong level.  I would think we should be doing something similar to the
> post-balance patch I submitted a while back.  It basically iterated through each class,
> giving each an opportunity to pull tasks over in its own way.  The difference there
> was that I was doing it post-schedule to deal with that locking issue.  We could
> take the same idea and do it where we pre_schedule() today.
> 
> I think the f_b_g() et. al. is really SCHED_OTHER specific, and probably always will be.
> Lets just formalize that.  Perhaps we should move all the LB code to sched_fair and set
> something like what I proposed up.  Thoughts?

Right,. generalizing f_b_g() isn't something we should consider, its
plenty impossible to understand already.

OK, moving everything into _fair sounds like the right approach.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-06-24 19:44 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-23 23:04 [PATCH 0/3] RT: scheduler newidle enhancements Gregory Haskins
2008-06-23 23:04 ` [PATCH 1/3] sched: enable interrupts and drop rq-lock during newidle balancing Gregory Haskins
2008-06-24  0:11   ` Steven Rostedt
2008-06-24 10:13   ` Peter Zijlstra
2008-06-24 13:15     ` [PATCH 1/3] sched: enable interrupts and drop rq-lock duringnewidle balancing Gregory Haskins
2008-06-24 12:24       ` Peter Zijlstra
2008-06-24 12:39         ` [PATCH 1/3] sched: enable interrupts and drop rq-lockduringnewidle balancing Gregory Haskins
2008-06-23 23:04 ` [PATCH 2/3] sched: only run newidle if previous task was CFS Gregory Haskins
2008-06-24  9:58   ` Peter Zijlstra
2008-06-24 10:38     ` Peter Zijlstra
2008-06-23 23:04 ` [PATCH 3/3] sched: terminate newidle balancing once at least one task has moved over Gregory Haskins
2008-06-24  0:50   ` Nick Piggin
2008-06-24  1:07     ` Steven Rostedt
2008-06-24  1:26       ` Nick Piggin
2008-06-24  2:39     ` Gregory Haskins
2008-06-24  1:46       ` Nick Piggin
2008-06-24  2:59         ` Gregory Haskins
2008-06-24 10:13   ` Peter Zijlstra
2008-06-24 13:18     ` [PATCH 3/3] sched: terminate newidle balancing once at leastone " Gregory Haskins
2008-06-24 13:31       ` Peter Zijlstra
2008-06-24 16:55         ` [PATCH 3/3] sched: terminate newidle balancing once atleastone " Gregory Haskins
2008-06-24 19:44           ` Peter Zijlstra
2008-06-24  0:15 ` [PATCH 0/3] RT: scheduler newidle enhancements Steven Rostedt
2008-06-24  1:51 ` Gregory Haskins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).