sched: fix/optimise some issues

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* sched: fix/optimise some issues
@ 2011-07-20 13:42 Stephan Bärwolf
  2011-07-20 19:11 ` Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Stephan Bärwolf @ 2011-07-20 13:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, ncrao

[-- Attachment #1: Type: text/plain, Size: 1451 bytes --]

After reviewing the kernels processscheduler I have found three issues.
I am not 100% sure, but I think they are worth patching. (see attached
patch 0001 to 0003) - These patches are agains 3.0-rc7.


I also implemented an 128bit vruntime support:
Majorly on systems with many tasks and (for example) deep cgroups
(or increased NICE0_LOAD/ SCHED_LOAD_SCALE as in commit
c8b281161dfa4bb5d5be63fb036ce19347b88c63), a weighted timeslice
(unsigned long) can become very large (on x86_64) and consumes a
large part of the u64 vruntimes (per tick) when added.
This might lead to missscheduling because of overflows.

The patches (as single files or as a one file blockpatch) in the bz2-files
mainly intruduce code (and a Kconfig) to switch to a 128bit vruntime on
x86_64 (of course with a little overhead) or limiting a virtual timeslice.
These patches are also "tidying up" the code around vruntimes by
abstracting it into seperate files and types, which also makes further
coding
easier and simplifies debugging.
For example vruntimes are stored into sched_vruntime_t type instead u64
after patching.
Please see for your own (and excuse the direct export from my local git)...

The bz2-patches are working against 2.6.39.3.

Best regards, Stephan

-- 
Dipl.-Inf. Stephan Bärwolf
Ilmenau University of Technology, Integrated Communication Systems Group
Phone: +49 (0)3677 69 4130
Email: stephan.baerwolf@tu-ilmenau.de,  
Web: http://www.tu-ilmenau.de/iks


[-- Attachment #2: vruntime128bitpatches_single.tar.bz2 --]
[-- Type: application/x-bzip, Size: 23828 bytes --]

[-- Attachment #3: vruntime128bitpatches_oneblock.bz2 --]
[-- Type: application/x-bzip, Size: 14321 bytes --]

[-- Attachment #4: 0003-sched-fix-incorrect-use-of-ideal_runtime-s-timeunit.patch --]
[-- Type: text/plain, Size: 2101 bytes --]

>From c9b7e910de032ce0851ec297575d42bc8f07a4a3 Mon Sep 17 00:00:00 2001
From: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de>
Date: Wed, 20 Jul 2011 15:06:17 +0200
Subject: [PATCH 3/3] sched: fix incorrect use of "ideal_runtime"s timeunit

In "check_preempt_tick()" (kernel/sched_fair.c:1093) a ulong
called "ideal_runtime" stores a timeslice of the current task
(scheduling entity). This time complies real cpu-time.

At the end of the same function (nr_running > 1) this (real) time
is compared with a virtual-runtime-delta. Obviously the timeunits
(real vs. virtual) didn't fit.

Using "wakeup_preempt_entity()" instead should fix this in a even
more general way.

Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de>
---
 kernel/sched_fair.c |   16 +++++++---------
 1 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 66dc9f7..61d002d 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1083,6 +1083,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	update_cfs_shares(cfs_rq);
 }
 
+static int
+wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -1115,14 +1118,11 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 		return;
 
 	if (cfs_rq->nr_running > 1) {
+		// check, if maybe curr has finally overtaken the remaining leftmost
 		struct sched_entity *se = __pick_first_entity(cfs_rq);
-		s64 delta = curr->vruntime - se->vruntime;
-
-		if (delta < 0)
-			return;
-
-		if (delta > ideal_runtime)
-			resched_task(rq_of(cfs_rq)->curr);
+		
+		if (wakeup_preempt_entity(curr, se) > 0)
+		  resched_task(rq_of(cfs_rq)->curr);
 	}
 }
 
@@ -1156,8 +1156,6 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	se->prev_sum_exec_runtime = se->sum_exec_runtime;
 }
 
-static int
-wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
 
 /*
  * Pick the next process, keeping these things in mind, in this order:
-- 
1.7.3.4


[-- Attachment #5: 0002-sched-replace-use-of-entity_key.patch --]
[-- Type: text/plain, Size: 2181 bytes --]

>From df6bc28340e3cc2b4ebe132492971e8a4164fe11 Mon Sep 17 00:00:00 2001
From: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de>
Date: Wed, 20 Jul 2011 14:46:59 +0200
Subject: [PATCH 2/3] sched: replace use of "entity_key()"

"entity_key()" is only used in "__enqueue_entity()" and
its only function is to subtract a tasks vruntime by
its groups minvruntime.
Before this patch a rbtree enqueue-decision is done by
comparing two tasks in the style:

	"if (entity_key(cfs_rq, se) < entity_key(cfs_rq, entry))"

which would be

	"if (se->vruntime-cfs_rq->min_vruntime < entry->vruntime-cfs_rq->min_vruntime)"

or (if reducing cfs_rq->min_vruntime out)

	"if (se->vruntime < entry->vruntime)"

which is

	"if (entity_before(se, entry))"

So we do not need "entity_key()".
If "entity_before()" is inline we will also save one subtraction (only one,
because "entity_key(cfs_rq, se)"  was cached in "key")

Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de>
---
 kernel/sched_fair.c |    8 +-------
 1 files changed, 1 insertions(+), 7 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index e092e72..66dc9f7 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -334,11 +334,6 @@ static inline int entity_before(struct sched_entity *a,
 	return (s64)(a->vruntime - b->vruntime) < 0;
 }
 
-static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
-{
-	return se->vruntime - cfs_rq->min_vruntime;
-}
-
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
 	u64 vruntime = cfs_rq->min_vruntime;
@@ -372,7 +367,6 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
 	struct rb_node *parent = NULL;
 	struct sched_entity *entry;
-	s64 key = entity_key(cfs_rq, se);
 	int leftmost = 1;
 
 	/*
@@ -385,7 +379,7 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 * We dont care about collisions. Nodes with
 		 * the same key stay together.
 		 */
-		if (key < entity_key(cfs_rq, entry)) {
+		if (entity_before(se, entry)) {
 			link = &parent->rb_left;
 		} else {
 			link = &parent->rb_right;
-- 
1.7.3.4


[-- Attachment #6: 0001-sched-check-WAKEUP_PREEMPT-feature-before-preemting-.patch --]
[-- Type: text/plain, Size: 1411 bytes --]

>From ccd1e7d300c1f939da745e1c0d50d13fc3ccec7b Mon Sep 17 00:00:00 2001
From: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de>
Date: Wed, 20 Jul 2011 14:37:56 +0200
Subject: [PATCH 1/3] sched: check WAKEUP_PREEMPT feature before preemting anything

The function "check_preempt_wakeup" (kernel/sched_fair.c:1885)
will preempt idle-task (for non-idle task), even if WAKEUP_PREEMPT
is not featured (because of a too late checking the feature).
This patches moves the checking of WAKEUP_PREEMT in front of
idle-preemtion.

Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de>
---
 kernel/sched_fair.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 433491c..e092e72 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1905,6 +1905,9 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	if (test_tsk_need_resched(curr))
 		return;
 
+	if (!sched_feat(WAKEUP_PREEMPT))
+		return;
+
 	/* Idle tasks are by definition preempted by non-idle tasks. */
 	if (unlikely(curr->policy == SCHED_IDLE) &&
 	    likely(p->policy != SCHED_IDLE))
@@ -1918,9 +1921,6 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 		return;
 
 
-	if (!sched_feat(WAKEUP_PREEMPT))
-		return;
-
 	update_curr(cfs_rq);
 	find_matching_se(&se, &pse);
 	BUG_ON(!pse);
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: sched: fix/optimise some issues
  2011-07-20 13:42 sched: fix/optimise some issues Stephan Bärwolf
@ 2011-07-20 19:11 ` Peter Zijlstra
  2011-07-21  1:00   ` Mike Galbraith
  2011-07-20 19:11 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2011-07-20 19:11 UTC (permalink / raw)
  To: stephan.baerwolf; +Cc: linux-kernel, Ingo Molnar, ncrao, Mike Galbraith

On Wed, 2011-07-20 at 15:42 +0200, Stephan Bärwolf wrote:
> In "check_preempt_tick()" (kernel/sched_fair.c:1093) a ulong
> called "ideal_runtime" stores a timeslice of the current task
> (scheduling entity). This time complies real cpu-time.
> 
> At the end of the same function (nr_running > 1) this (real) time
> is compared with a virtual-runtime-delta. Obviously the timeunits
> (real vs. virtual) didn't fit.
> 
> Using "wakeup_preempt_entity()" instead should fix this in a even
> more general way. 

Hrm,. I'm fairly sure we did that on purpose and the thing that is
missing is a big fat comment. People keep trying to fix that (me
included).

I'll try and dig up the why and such.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sched: fix/optimise some issues
  2011-07-20 13:42 sched: fix/optimise some issues Stephan Bärwolf
  2011-07-20 19:11 ` Peter Zijlstra
@ 2011-07-20 19:11 ` Peter Zijlstra
  2011-07-20 19:11 ` Peter Zijlstra
  2011-07-21 15:08 ` Peter Zijlstra
  3 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2011-07-20 19:11 UTC (permalink / raw)
  To: stephan.baerwolf; +Cc: linux-kernel, Ingo Molnar, ncrao

On Wed, 2011-07-20 at 15:42 +0200, Stephan Bärwolf wrote:
> "entity_key()" is only used in "__enqueue_entity()" and
> its only function is to subtract a tasks vruntime by
> its groups minvruntime.
> Before this patch a rbtree enqueue-decision is done by
> comparing two tasks in the style:
> 
>         "if (entity_key(cfs_rq, se) < entity_key(cfs_rq, entry))"
> 
> which would be
> 
>         "if (se->vruntime-cfs_rq->min_vruntime < entry->vruntime-cfs_rq->min_vruntime)"
> 
> or (if reducing cfs_rq->min_vruntime out)
> 
>         "if (se->vruntime < entry->vruntime)"
> 
> which is
> 
>         "if (entity_before(se, entry))"
> 
> So we do not need "entity_key()".
> If "entity_before()" is inline we will also save one subtraction (only one,
> because "entity_key(cfs_rq, se)"  was cached in "key") 

Indeed, thanks!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sched: fix/optimise some issues
  2011-07-20 13:42 sched: fix/optimise some issues Stephan Bärwolf
  2011-07-20 19:11 ` Peter Zijlstra
  2011-07-20 19:11 ` Peter Zijlstra
@ 2011-07-20 19:11 ` Peter Zijlstra
  2011-07-21 15:08 ` Peter Zijlstra
  3 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2011-07-20 19:11 UTC (permalink / raw)
  To: stephan.baerwolf; +Cc: linux-kernel, Ingo Molnar, ncrao

On Wed, 2011-07-20 at 15:42 +0200, Stephan Bärwolf wrote:
> 
> The function "check_preempt_wakeup" (kernel/sched_fair.c:1885)
> will preempt idle-task (for non-idle task), even if WAKEUP_PREEMPT
> is not featured (because of a too late checking the feature).
> This patches moves the checking of WAKEUP_PREEMT in front of
> idle-preemtion. 

That's actually on purpose, the WAKE_PREEMPT feature is supposed to only
affect preemption between SCHED_OTHER tasks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sched: fix/optimise some issues
  2011-07-20 19:11 ` Peter Zijlstra
@ 2011-07-21  1:00   ` Mike Galbraith
  0 siblings, 0 replies; 10+ messages in thread
From: Mike Galbraith @ 2011-07-21  1:00 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: stephan.baerwolf, linux-kernel, Ingo Molnar, ncrao

On Wed, 2011-07-20 at 21:11 +0200, Peter Zijlstra wrote:
> On Wed, 2011-07-20 at 15:42 +0200, Stephan Bärwolf wrote:
> > In "check_preempt_tick()" (kernel/sched_fair.c:1093) a ulong
> > called "ideal_runtime" stores a timeslice of the current task
> > (scheduling entity). This time complies real cpu-time.
> > 
> > At the end of the same function (nr_running > 1) this (real) time
> > is compared with a virtual-runtime-delta. Obviously the timeunits
> > (real vs. virtual) didn't fit.
> > 
> > Using "wakeup_preempt_entity()" instead should fix this in a even
> > more general way.

(That's what I thought at first too)

> Hrm,. I'm fairly sure we did that on purpose and the thing that is
> missing is a big fat comment. People keep trying to fix that (me
> included).
> 
> I'll try and dig up the why and such.

Better to just kill it outright.  It's not doing that much anyway.

	-Mike


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sched: fix/optimise some issues
  2011-07-20 13:42 sched: fix/optimise some issues Stephan Bärwolf
                   ` (2 preceding siblings ...)
  2011-07-20 19:11 ` Peter Zijlstra
@ 2011-07-21 15:08 ` Peter Zijlstra
  2011-07-21 16:36   ` Stephan Bärwolf
  3 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2011-07-21 15:08 UTC (permalink / raw)
  To: stephan.baerwolf; +Cc: linux-kernel, Ingo Molnar, ncrao

On Wed, 2011-07-20 at 15:42 +0200, Stephan Bärwolf wrote:
> 
> I also implemented an 128bit vruntime support:
> Majorly on systems with many tasks and (for example) deep cgroups
> (or increased NICE0_LOAD/ SCHED_LOAD_SCALE as in commit
> c8b281161dfa4bb5d5be63fb036ce19347b88c63), a weighted timeslice
> (unsigned long) can become very large (on x86_64) and consumes a
> large part of the u64 vruntimes (per tick) when added.
> This might lead to missscheduling because of overflows. 

Right, so I've often wanted a [us]128 type, and gcc has some (broken?)
support for that, but overhead has always kept me from it.

There's also the non-atomicy thing to consider, see min_vruntime_copy
etc.

How horrid is the current vruntime situation?

As to your true-idle, there's a very good reason the current SCHED_IDLE
isnt' a true idle scheduler; it would create horrid priority inversion
problems, imagine the true idle task holding a mutex or is required to
complete something.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sched: fix/optimise some issues
  2011-07-21 16:36   ` Stephan Bärwolf
@ 2011-07-21 16:32     ` Peter Zijlstra
  2011-07-21 16:43     ` Peter Zijlstra
  2011-07-21 16:51     ` Peter Zijlstra
  2 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2011-07-21 16:32 UTC (permalink / raw)
  To: Stephan Bärwolf; +Cc: linux-kernel

On Thu, 2011-07-21 at 18:36 +0200, Stephan Bärwolf wrote:
> > Right, so I've often wanted a [us]128 type, and gcc has some (broken?)
> > support for that, but overhead has always kept me from it.
> 128bit sched_vruntime_t support seems to be running fine, when compiled with
> gcc (Gentoo 4.4.5 p1.2, pie-0.4.5) 4.4.5.
> Of course overhead is a problem (but there is also overhead using u64 on
> x86),

Yeah, I know, but luckily all 32bit computing shall die sooner rather
than later. But there really wasn't much choice there anyway, 32bit
simply won't do.

> that is why it should be Kconfig selectable (for servers with many
> processes,
> deep cgroups and many different priorities?).

Sadly that's not how things work in practice, distro's will have to
enable the option and that means that pretty much everybody runs it. The
whole cgroup crap is already _way_ too expensive. 

> But I think also abstracting the whole vruntime-stuff into a seperate
> collection
> simplifies further evaluations and adpations. (Think of central
> statistics collection
> for example maximum timeslice seen or happened overflows - without changing
> all the lines of code with the risk of missing sth.)

It made rather a mess of things,

> > There's also the non-atomicy thing to consider, see min_vruntime_copy
> > etc.
> I think atomicy is not an (great) issue, because of two reasons:
>     a) on x86 the u64 wouldn't be atomic, too (vruntime is u64 not
> atomic64_t)

atomic64_t isn't needed in order to guarantee consistent loads, Linux
depends on the fact that all naturally aligned loads are complete loads
(no partials etc.).

>     b) every operation on cfs_rq->min_vruntime should happen, when
>         holding the runqueue-lock?. 

---
commit 3fe1698b7fe05aeb063564e71e40d09f28d8e80c
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date:   Tue Apr 5 17:23:48 2011 +0200

    sched: Deal with non-atomic min_vruntime reads on 32bits
    
    In order to avoid reading partial updated min_vruntime values on 32bit
    implement a seqcount like solution.
    
    Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Nick Piggin <npiggin@kernel.dk>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Link: http://lkml.kernel.org/r/20110405152729.111378493@chello.nl
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

diff --git a/kernel/sched.c b/kernel/sched.c
index 46f42ca..7a5eb26 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -312,6 +312,9 @@ struct cfs_rq {
 
 	u64 exec_clock;
 	u64 min_vruntime;
+#ifndef CONFIG_64BIT
+	u64 min_vruntime_copy;
+#endif
 
 	struct rb_root tasks_timeline;
 	struct rb_node *rb_leftmost;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index ad4c414f..054cebb 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -358,6 +358,10 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 	}
 
 	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
+#ifndef CONFIG_64BIT
+	smp_wmb();
+	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
+#endif
 }
 
 /*
@@ -1376,10 +1380,21 @@ static void task_waking_fair(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 min_vruntime;
 
-	lockdep_assert_held(&task_rq(p)->lock);
+#ifndef CONFIG_64BIT
+	u64 min_vruntime_copy;
 
-	se->vruntime -= cfs_rq->min_vruntime;
+	do {
+		min_vruntime_copy = cfs_rq->min_vruntime_copy;
+		smp_rmb();
+		min_vruntime = cfs_rq->min_vruntime;
+	} while (min_vruntime != min_vruntime_copy);
+#else
+	min_vruntime = cfs_rq->min_vruntime;
+#endif
+
+	se->vruntime -= min_vruntime;
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: sched: fix/optimise some issues
  2011-07-21 15:08 ` Peter Zijlstra
@ 2011-07-21 16:36   ` Stephan Bärwolf
  2011-07-21 16:32     ` Peter Zijlstra
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Stephan Bärwolf @ 2011-07-21 16:36 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel

Thank you for your fast response and your detailed comments.

On 07/21/11 17:08, Peter Zijlstra wrote:
> On Wed, 2011-07-20 at 15:42 +0200, Stephan Bärwolf wrote:
>> I also implemented an 128bit vruntime support:
>> Majorly on systems with many tasks and (for example) deep cgroups
>> (or increased NICE0_LOAD/ SCHED_LOAD_SCALE as in commit
>> c8b281161dfa4bb5d5be63fb036ce19347b88c63), a weighted timeslice
>> (unsigned long) can become very large (on x86_64) and consumes a
>> large part of the u64 vruntimes (per tick) when added.
>> This might lead to missscheduling because of overflows. 
> Right, so I've often wanted a [us]128 type, and gcc has some (broken?)
> support for that, but overhead has always kept me from it.
128bit sched_vruntime_t support seems to be running fine, when compiled with
gcc (Gentoo 4.4.5 p1.2, pie-0.4.5) 4.4.5.
Of course overhead is a problem (but there is also overhead using u64 on
x86),
that is why it should be Kconfig selectable (for servers with many
processes,
deep cgroups and many different priorities?).

But I think also abstracting the whole vruntime-stuff into a seperate
collection
simplifies further evaluations and adpations. (Think of central
statistics collection
for example maximum timeslice seen or happened overflows - without changing
all the lines of code with the risk of missing sth.)
> There's also the non-atomicy thing to consider, see min_vruntime_copy
> etc.
I think atomicy is not an (great) issue, because of two reasons:
    a) on x86 the u64 wouldn't be atomic, too (vruntime is u64 not
atomic64_t)
    b) every operation on cfs_rq->min_vruntime should happen, when
        holding the runqueue-lock?.
> How horrid is the current vruntime situation?
This is a point, which needs further discussion/observation.

When for example NICE0_LOAD is increased by 6 Bit (and I think
"c8b281161dfa4bb5d5be63fb036ce19347b88c63" did it by 10bits
on x86_64) the maximum timeslice (I am not quite sure if it was on
HZ=1000) with a PRE kernel will be around 2**38.
Adding this every ms (lets say 1024 times per sec) to min_vruntime
might cause overflows too fast (after 2**(63-38-10)sec = 2**15sec ~ 9h).
Having a great heterogenity of priorities may intensify this situation...

Long story short: on x86_64 an unsigned long (timeslice) could be
as large as the whole u64 for min_vruntime and this is dangerous.

Of course limiting the maximum timeslice in "calc_delta_mine()" would
help, too - but without the comfort using the whole x86_64 capabilties.
(and mostly therefore finer priority-resolutions)
> As to your true-idle, there's a very good reason the current SCHED_IDLE
> isnt' a true idle scheduler; it would create horrid priority inversion
> problems, imagine the true idle task holding a mutex or is required to
> complete something.
Of course, I fully agree! This is one reason why it was marked as
"experimental". When having a few backgroundjobs (for example
a boinc or a bitcoin-crunsher ;-) ) it works ok because there seems
not to many process-spanned lockings.
But in general it is a bad idea...

I also remember weak Linus had sth. against "priority inheritance"
(don't ask me what or why - I don't know),
but it would be an honour to me working with you guys to implement
this feature in future kernels. (On the base of rb-trees saving the
priorities of each "se" holding the lock, to solve prio.inv. ? or in
non-schedulable contextes maybe setting an "super-priority" while locking)

I think real idle-scheduling (maybe based in more than one idle-levels)
would
be a very great feature to future kernels.
(For example utilizing expensive systems without feelable affects on
interactivity)
Even because SMP gains more and
more importance (plus increasing cpus/cores) and the "load-balancing"
often leads to short
but great idle-phases on sparse (because of interactivity) processed
systems.

>

Thanks,
regards Stephan

-- 
Dipl.-Inf. Stephan Bärwolf
Ilmenau University of Technology, Integrated Communication Systems Group
Phone: +49 (0)3677 69 4130
Email: stephan.baerwolf@tu-ilmenau.de,  
Web: http://www.tu-ilmenau.de/iks

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sched: fix/optimise some issues
  2011-07-21 16:36   ` Stephan Bärwolf
  2011-07-21 16:32     ` Peter Zijlstra
@ 2011-07-21 16:43     ` Peter Zijlstra
  2011-07-21 16:51     ` Peter Zijlstra
  2 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2011-07-21 16:43 UTC (permalink / raw)
  To: Stephan Bärwolf; +Cc: linux-kernel, Nikhil Rao, Ingo Molnar

(you seem to have lost CC's, restored them)

On Thu, 2011-07-21 at 18:36 +0200, Stephan Bärwolf wrote:
> > How horrid is the current vruntime situation?
> This is a point, which needs further discussion/observation.
> 
> When for example NICE0_LOAD is increased by 6 Bit (and I think
> "c8b281161dfa4bb5d5be63fb036ce19347b88c63" did it by 10bits
> on x86_64) the maximum timeslice (I am not quite sure if it was on
> HZ=1000) with a PRE kernel will be around 2**38.
> Adding this every ms (lets say 1024 times per sec) to min_vruntime
> might cause overflows too fast (after 2**(63-38-10)sec = 2**15sec ~ 9h).

9h should be just fine, and I think we can get down to minutes without
really breaking much. As long as the wrap period is at least twice the
duration of the period needed to service all tasks.

So what we want is the wrap to be > 2*nr_running*sched_min_granularity.

> Long story short: on x86_64 an unsigned long (timeslice) could be
> as large as the whole u64 for min_vruntime and this is dangerous.

> Of course limiting the maximum timeslice in "calc_delta_mine()" would
> help, too - but without the comfort using the whole x86_64 capabilties.
> (and mostly therefore finer priority-resolutions) 

It should be limited by HZ, that is I don't think there is a situation
where we can get a bigger real-time delta than 1s/HZ.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: sched: fix/optimise some issues
  2011-07-21 16:36   ` Stephan Bärwolf
  2011-07-21 16:32     ` Peter Zijlstra
  2011-07-21 16:43     ` Peter Zijlstra
@ 2011-07-21 16:51     ` Peter Zijlstra
  2 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2011-07-21 16:51 UTC (permalink / raw)
  To: Stephan Bärwolf; +Cc: linux-kernel, Nikhil Rao, Ingo Molnar

On Thu, 2011-07-21 at 18:36 +0200, Stephan Bärwolf wrote:
> I also remember weak Linus had sth. against "priority inheritance"
> (don't ask me what or why - I don't know),
> but it would be an honour to me working with you guys to implement
> this feature in future kernels. 

Look at kernel/rt_mutex.c, it as a complete tradition Priority
Inheritance implementation :-)

The trouble is that it only works for SCHED_FIFO/RR.

Now you can extend PI to cover weighted fair queueing, or implement the
much simpler proxy execution policy which generalizes to pretty much any
scheduling algorithm, see for example the paper: "Timeslice donation in
component-based systems" in:

http://www.artist-embedded.org/docs/Events/2010/OSPERT/OSPERT2010-Proceedings.pdf

Extending that to SMP is the 'interesting' bit..

However that will not solve the true idle thing since some
synchronization primitives we have are fundamentally incompatible with
any form of PI (including the various ceiling protocols) :-/, see for
example the traditional semaphore and completions. 


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-07-21 16:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-20 13:42 sched: fix/optimise some issues Stephan Bärwolf
2011-07-20 19:11 ` Peter Zijlstra
2011-07-21  1:00   ` Mike Galbraith
2011-07-20 19:11 ` Peter Zijlstra
2011-07-20 19:11 ` Peter Zijlstra
2011-07-21 15:08 ` Peter Zijlstra
2011-07-21 16:36   ` Stephan Bärwolf
2011-07-21 16:32     ` Peter Zijlstra
2011-07-21 16:43     ` Peter Zijlstra
2011-07-21 16:51     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox