[patch] CFS scheduler, -v14

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch] CFS scheduler, -v14
@ 2007-05-23 12:06 Ingo Molnar
  2007-05-23 19:39 ` Nicolas Mailhot
                   ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Ingo Molnar @ 2007-05-23 12:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, pranith-kumar_d

i'm pleased to announce release -v14 of the CFS scheduler patchset.

The CFS patch against v2.6.22-rc2, v2.6.21.1 or v2.6.20.10 can be 
downloaded from the usual place:

      http://people.redhat.com/mingo/cfs-scheduler/

In -v14 the biggest user-visible change is increased sleeper fairness 
(done by Mike Galbraith and myself), which results in better 
interactivity under load. In particular 3D apps such as compiz/Beryl or 
games benefit from it and should be less sensitive to other apps running 
in parallel to them - but plain X benefits from it too.

CFS is converging nicely, with no regressions reported against -v13. 
Changes since -v13:

 - increase sleeper-fairness (Mike Galbraith, me)

 - kernel/sched_debug.c printk argument fixes for ia64 (Andrew Morton)

 - CFS documentation fixes (Pranith Kumar D)

 - increased the default rescheduling granularity to 3msecs on UP,
   6 msecs on 2-way systems

 - small update_curr() precision fix

 - added an overview section to Documentation/sched-design-CFS.txt

 - misc cleanups

As usual, any sort of feedback, bugreport, fix and suggestion is more 
than welcome!

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-23 12:06 [patch] CFS scheduler, -v14 Ingo Molnar
@ 2007-05-23 19:39 ` Nicolas Mailhot
  2007-05-23 19:57   ` Ingo Molnar
  2007-05-24  6:42 ` Balbir Singh
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 36+ messages in thread
From: Nicolas Mailhot @ 2007-05-23 19:39 UTC (permalink / raw)
  To: linux-kernel

Ingo Molnar <mingo <at> elte.hu> writes:

Hi Ingo

> i'm pleased to announce release -v14 of the CFS scheduler patchset.
> 
> The CFS patch against v2.6.22-rc2, v2.6.21.1 or v2.6.20.10 can be 
> downloaded from the usual place:
> 
>       http://people.redhat.com/mingo/cfs-scheduler/


I get a forbidden access on 
http://people.redhat.com/mingo/cfs-scheduler/sched-cfs-v2.6.22-rc2-mm1-v14.patch

Regards,

-- 
Nicolas Mailhot


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-23 19:39 ` Nicolas Mailhot
@ 2007-05-23 19:57   ` Ingo Molnar
  2007-05-23 20:02     ` Nicolas Mailhot
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2007-05-23 19:57 UTC (permalink / raw)
  To: Nicolas Mailhot; +Cc: linux-kernel


* Nicolas Mailhot <nicolas.mailhot@laposte.net> wrote:

> Ingo Molnar <mingo <at> elte.hu> writes:
> 
> Hi Ingo
> 
> > i'm pleased to announce release -v14 of the CFS scheduler patchset.
> > 
> > The CFS patch against v2.6.22-rc2, v2.6.21.1 or v2.6.20.10 can be 
> > downloaded from the usual place:
> > 
> >       http://people.redhat.com/mingo/cfs-scheduler/
> 
> I get a forbidden access on 
> http://people.redhat.com/mingo/cfs-scheduler/sched-cfs-v2.6.22-rc2-mm1-v14.patch

oops - fixed it.

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-23 19:57   ` Ingo Molnar
@ 2007-05-23 20:02     ` Nicolas Mailhot
  0 siblings, 0 replies; 36+ messages in thread
From: Nicolas Mailhot @ 2007-05-23 20:02 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 657 bytes --]

Le mercredi 23 mai 2007 à 21:57 +0200, Ingo Molnar a écrit :
> * Nicolas Mailhot <nicolas.mailhot@laposte.net> wrote:
> 
> > Ingo Molnar <mingo <at> elte.hu> writes:
> > 
> > Hi Ingo
> > 
> > > i'm pleased to announce release -v14 of the CFS scheduler patchset.
> > > 
> > > The CFS patch against v2.6.22-rc2, v2.6.21.1 or v2.6.20.10 can be 
> > > downloaded from the usual place:
> > > 
> > >       http://people.redhat.com/mingo/cfs-scheduler/
> > 
> > I get a forbidden access on 
> > http://people.redhat.com/mingo/cfs-scheduler/sched-cfs-v2.6.22-rc2-mm1-v14.patch
> 
> oops - fixed it.

Works now, thanks!

-- 
Nicolas Mailhot

[-- Attachment #2: Ceci est une partie de message numériquement signée --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-23 12:06 [patch] CFS scheduler, -v14 Ingo Molnar
  2007-05-23 19:39 ` Nicolas Mailhot
@ 2007-05-24  6:42 ` Balbir Singh
  2007-05-24  8:09   ` Ingo Molnar
  2007-05-26 14:58 ` S.Çağlar Onur
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 36+ messages in thread
From: Balbir Singh @ 2007-05-24  6:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, pranith-kumar_d

On Wed, May 23, 2007 at 02:06:16PM +0200, Ingo Molnar wrote:
> 
> i'm pleased to announce release -v14 of the CFS scheduler patchset.
> 
> The CFS patch against v2.6.22-rc2, v2.6.21.1 or v2.6.20.10 can be 
> downloaded from the usual place:
>    
>       http://people.redhat.com/mingo/cfs-scheduler/
> 
> In -v14 the biggest user-visible change is increased sleeper fairness 
> (done by Mike Galbraith and myself), which results in better 
> interactivity under load. In particular 3D apps such as compiz/Beryl or 
> games benefit from it and should be less sensitive to other apps running 
> in parallel to them - but plain X benefits from it too.
> 
> CFS is converging nicely, with no regressions reported against -v13. 
> Changes since -v13:
> 
>  - increase sleeper-fairness (Mike Galbraith, me)
> 
>  - kernel/sched_debug.c printk argument fixes for ia64 (Andrew Morton)
> 
>  - CFS documentation fixes (Pranith Kumar D)
> 
>  - increased the default rescheduling granularity to 3msecs on UP,
>    6 msecs on 2-way systems
> 
>  - small update_curr() precision fix
> 
>  - added an overview section to Documentation/sched-design-CFS.txt
> 
>  - misc cleanups
> 
> As usual, any sort of feedback, bugreport, fix and suggestion is more 
> than welcome!
> 
> 	Ingo

Hi, Ingo,

I've implemented a patch on top of v14 for better accounting of
sched_info statistics. Earlier, sched_info relied on jiffies for
accounting and I've seen applications that show "0" cpu usage
statistics (in delay accounting and from /proc) even though they've
been running on the CPU for a long time. The basic problem is that
accounting in jiffies is too coarse to be accurate.

The patch below uses sched_clock() for sched_info accounting.

Comments, suggestions, feedback is more than welcome!

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/sched.h |   10 +++++-----
 kernel/delayacct.c    |   10 +++++-----
 kernel/sched_stats.h  |   28 ++++++++++++++--------------
 3 files changed, 24 insertions(+), 24 deletions(-)

diff -puN kernel/sched_stats.h~move-sched-accounting-to-sched_clock kernel/sched_stats.h
--- linux-2.6.21/kernel/sched_stats.h~move-sched-accounting-to-sched_clock	2007-05-24 11:23:38.000000000 +0530
+++ linux-2.6.21-balbir/kernel/sched_stats.h	2007-05-24 11:23:38.000000000 +0530
@@ -97,10 +97,10 @@ const struct file_operations proc_scheds
  * Expects runqueue lock to be held for atomicity of update
  */
 static inline void
-rq_sched_info_arrive(struct rq *rq, unsigned long delta_jiffies)
+rq_sched_info_arrive(struct rq *rq, unsigned long long delta)
 {
 	if (rq) {
-		rq->rq_sched_info.run_delay += delta_jiffies;
+		rq->rq_sched_info.run_delay += delta;
 		rq->rq_sched_info.pcnt++;
 	}
 }
@@ -109,19 +109,19 @@ rq_sched_info_arrive(struct rq *rq, unsi
  * Expects runqueue lock to be held for atomicity of update
  */
 static inline void
-rq_sched_info_depart(struct rq *rq, unsigned long delta_jiffies)
+rq_sched_info_depart(struct rq *rq, unsigned long long delta)
 {
 	if (rq)
-		rq->rq_sched_info.cpu_time += delta_jiffies;
+		rq->rq_sched_info.cpu_time += delta;
 }
 # define schedstat_inc(rq, field)	do { (rq)->field++; } while (0)
 # define schedstat_add(rq, field, amt)	do { (rq)->field += (amt); } while (0)
 #else /* !CONFIG_SCHEDSTATS */
 static inline void
-rq_sched_info_arrive(struct rq *rq, unsigned long delta_jiffies)
+rq_sched_info_arrive(struct rq *rq, unsigned long long delta)
 {}
 static inline void
-rq_sched_info_depart(struct rq *rq, unsigned long delta_jiffies)
+rq_sched_info_depart(struct rq *rq, unsigned long long delta)
 {}
 # define schedstat_inc(rq, field)	do { } while (0)
 # define schedstat_add(rq, field, amt)	do { } while (0)
@@ -155,16 +155,16 @@ static inline void sched_info_dequeued(s
  */
 static void sched_info_arrive(struct task_struct *t)
 {
-	unsigned long now = jiffies, delta_jiffies = 0;
+	unsigned long long now = sched_clock(), delta = 0;
 
 	if (t->sched_info.last_queued)
-		delta_jiffies = now - t->sched_info.last_queued;
+		delta = now - t->sched_info.last_queued;
 	sched_info_dequeued(t);
-	t->sched_info.run_delay += delta_jiffies;
+	t->sched_info.run_delay += delta;
 	t->sched_info.last_arrival = now;
 	t->sched_info.pcnt++;
 
-	rq_sched_info_arrive(task_rq(t), delta_jiffies);
+	rq_sched_info_arrive(task_rq(t), delta);
 }
 
 /*
@@ -186,7 +186,7 @@ static inline void sched_info_queued(str
 {
 	if (unlikely(sched_info_on()))
 		if (!t->sched_info.last_queued)
-			t->sched_info.last_queued = jiffies;
+			t->sched_info.last_queued = sched_clock();
 }
 
 /*
@@ -195,10 +195,10 @@ static inline void sched_info_queued(str
  */
 static inline void sched_info_depart(struct task_struct *t)
 {
-	unsigned long delta_jiffies = jiffies - t->sched_info.last_arrival;
+	unsigned long long delta = sched_clock() - t->sched_info.last_arrival;
 
-	t->sched_info.cpu_time += delta_jiffies;
-	rq_sched_info_depart(task_rq(t), delta_jiffies);
+	t->sched_info.cpu_time += delta;
+	rq_sched_info_depart(task_rq(t), delta);
 }
 
 /*
diff -puN include/linux/sched.h~move-sched-accounting-to-sched_clock include/linux/sched.h
--- linux-2.6.21/include/linux/sched.h~move-sched-accounting-to-sched_clock	2007-05-24 11:23:38.000000000 +0530
+++ linux-2.6.21-balbir/include/linux/sched.h	2007-05-24 11:23:38.000000000 +0530
@@ -588,13 +588,13 @@ struct reclaim_state;
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
 struct sched_info {
 	/* cumulative counters */
-	unsigned long	cpu_time,	/* time spent on the cpu */
-			run_delay,	/* time spent waiting on a runqueue */
-			pcnt;		/* # of timeslices run on this cpu */
+	unsigned long pcnt;	      /* # of times run on this cpu */
+	unsigned long long cpu_time,  /* time spent on the cpu */
+			   run_delay; /* time spent waiting on a runqueue */
 
 	/* timestamps */
-	unsigned long	last_arrival,	/* when we last ran on a cpu */
-			last_queued;	/* when we were last queued to run */
+	unsigned long long last_arrival,/* when we last ran on a cpu */
+			   last_queued;	/* when we were last queued to run */
 };
 #endif /* defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) */
 
diff -puN kernel/delayacct.c~move-sched-accounting-to-sched_clock kernel/delayacct.c
--- linux-2.6.21/kernel/delayacct.c~move-sched-accounting-to-sched_clock	2007-05-24 11:31:11.000000000 +0530
+++ linux-2.6.21-balbir/kernel/delayacct.c	2007-05-24 11:52:33.000000000 +0530
@@ -99,9 +99,10 @@ void __delayacct_blkio_end(void)
 int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
 {
 	s64 tmp;
-	struct timespec ts;
-	unsigned long t1,t2,t3;
+	unsigned long t1;
+	unsigned long long t2,t3;
 	unsigned long flags;
+	struct timespec ts;
 
 	/* Though tsk->delays accessed later, early exit avoids
 	 * unnecessary returning of other data
@@ -124,11 +125,10 @@ int __delayacct_add_tsk(struct taskstats
 
 	d->cpu_count += t1;
 
-	jiffies_to_timespec(t2, &ts);
-	tmp = (s64)d->cpu_delay_total + timespec_to_ns(&ts);
+	tmp = (s64)d->cpu_delay_total + t2;
 	d->cpu_delay_total = (tmp < (s64)d->cpu_delay_total) ? 0 : tmp;
 
-	tmp = (s64)d->cpu_run_virtual_total + (s64)jiffies_to_usecs(t3) * 1000;
+	tmp = (s64)d->cpu_run_virtual_total + t3;
 	d->cpu_run_virtual_total =
 		(tmp < (s64)d->cpu_run_virtual_total) ?	0 : tmp;
 
_

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-24  6:42 ` Balbir Singh
@ 2007-05-24  8:09   ` Ingo Molnar
  2007-05-24  9:19     ` Balbir Singh
                       ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Ingo Molnar @ 2007-05-24  8:09 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, pranith-kumar_d, Andi Kleen

* Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Hi, Ingo,
> 
> I've implemented a patch on top of v14 for better accounting of 
> sched_info statistics. Earlier, sched_info relied on jiffies for 
> accounting and I've seen applications that show "0" cpu usage 
> statistics (in delay accounting and from /proc) even though they've 
> been running on the CPU for a long time. The basic problem is that 
> accounting in jiffies is too coarse to be accurate.
> 
> The patch below uses sched_clock() for sched_info accounting.

nice! I've merged your patch and it built/booted fine so it should show 
up in -v15. This should also play well with Andi's sched_clock() 
enhancements in -mm, slated for .23.

btw., i think some more consolidation could be done in this area. We've 
now got the traditional /proc/PID/stat metrics, schedstats, taskstats 
and delay accounting and with CFS we've got /proc/sched_debug and 
/proc/PID/sched. There's a fair amount of overlap.

btw., CFS does this change to fs/proc/array.c:

@@ -410,6 +408,14 @@ static int do_task_stat(struct task_stru
 	/* convert nsec -> ticks */
 	start_time = nsec_to_clock_t(start_time);

+	/*
+	 * Use CFS's precise accounting, if available:
+	 */
+	if (!has_rt_policy(task)) {
+		utime = nsec_to_clock_t(task->sum_exec_runtime);
+		stime = 0;
+	}
+
 	res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
 %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
 %lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %llu\n",

if you have some spare capacity to improve this code, it could be 
further enhanced by not setting 'stime' to zero, but using the existing 
jiffies based utime/stime statistics as a _ratio_ to split up the 
precise p->sum_exec_runtime. That way we dont have to add precise 
accounting to syscall entry/exit points (that would be quite expensive), 
but still the sum of utime+stime would be very precise. (and that's what 
matters most anyway)

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-24  8:09   ` Ingo Molnar
@ 2007-05-24  9:19     ` Balbir Singh
  2007-05-24 17:25     ` Jeremy Fitzhardinge
  2007-05-25 12:46     ` Ingo Molnar
  2 siblings, 0 replies; 36+ messages in thread
From: Balbir Singh @ 2007-05-24  9:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, pranith-kumar_d, Andi Kleen

Ingo Molnar wrote:
> btw., i think some more consolidation could be done in this area. We've 
> now got the traditional /proc/PID/stat metrics, schedstats, taskstats 
> and delay accounting and with CFS we've got /proc/sched_debug and 
> /proc/PID/sched. There's a fair amount of overlap.
> 

Yes. true. schedstats and delay accounting share code and taskstats is
a transport mechansim. I'll try and look at /proc/PID/stat and /proc/PID/sched
and /proc/sched_debug.

> btw., CFS does this change to fs/proc/array.c:
> 
> @@ -410,6 +408,14 @@ static int do_task_stat(struct task_stru
>  	/* convert nsec -> ticks */
>  	start_time = nsec_to_clock_t(start_time);
> 
> +	/*
> +	 * Use CFS's precise accounting, if available:
> +	 */
> +	if (!has_rt_policy(task)) {
> +		utime = nsec_to_clock_t(task->sum_exec_runtime);
> +		stime = 0;
> +	}
> +
>  	res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
>  %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
>  %lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %llu\n",
> 
> if you have some spare capacity to improve this code, it could be 
> further enhanced by not setting 'stime' to zero, but using the existing 
> jiffies based utime/stime statistics as a _ratio_ to split up the 
> precise p->sum_exec_runtime. That way we dont have to add precise 
> accounting to syscall entry/exit points (that would be quite expensive), 
> but still the sum of utime+stime would be very precise. (and that's what 
> matters most anyway)
> 
> 	Ingo

I'll start looking into splitting sum_exec_time into utime and stime
based on the ratio already present in the task structure. 

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-24  8:09   ` Ingo Molnar
  2007-05-24  9:19     ` Balbir Singh
@ 2007-05-24 17:25     ` Jeremy Fitzhardinge
  2007-05-24 20:59       ` Ingo Molnar
  2007-05-25 12:46     ` Ingo Molnar
  2 siblings, 1 reply; 36+ messages in thread
From: Jeremy Fitzhardinge @ 2007-05-24 17:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Balbir Singh, linux-kernel, Linus Torvalds, Andrew Morton,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	pranith-kumar_d, Andi Kleen

Ingo Molnar wrote:
> nice! I've merged your patch and it built/booted fine so it should show 
> up in -v15. This should also play well with Andi's sched_clock() 
> enhancements in -mm, slated for .23.
>   

BTW, does CFS treat sched_clock as a per-cpu clock, or will it compare
time values of sched_clock()s called on different CPUs?

    J

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-24 17:25     ` Jeremy Fitzhardinge
@ 2007-05-24 20:59       ` Ingo Molnar
  2007-05-24 22:43         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2007-05-24 20:59 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Balbir Singh, linux-kernel, Linus Torvalds, Andrew Morton,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	pranith-kumar_d, Andi Kleen


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Ingo Molnar wrote:
> > nice! I've merged your patch and it built/booted fine so it should show 
> > up in -v15. This should also play well with Andi's sched_clock() 
> > enhancements in -mm, slated for .23.
> >   
> 
> BTW, does CFS treat sched_clock as a per-cpu clock, or will it compare 
> time values of sched_clock()s called on different CPUs?

it treats it as a per-cpu clock.

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-24 20:59       ` Ingo Molnar
@ 2007-05-24 22:43         ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 36+ messages in thread
From: Jeremy Fitzhardinge @ 2007-05-24 22:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Balbir Singh, linux-kernel, Linus Torvalds, Andrew Morton,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	pranith-kumar_d, Andi Kleen

Ingo Molnar wrote:
> it treats it as a per-cpu clock.
>   

Excellent.  I'd noticed it seems to work pretty well in a Xen guest with
lots of stolen time, but I haven't really evaluated it in detail.

    J

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-24  8:09   ` Ingo Molnar
  2007-05-24  9:19     ` Balbir Singh
  2007-05-24 17:25     ` Jeremy Fitzhardinge
@ 2007-05-25 12:46     ` Ingo Molnar
  2007-05-25 16:45       ` Balbir Singh
  2007-05-29 10:19       ` Balbir Singh
  2 siblings, 2 replies; 36+ messages in thread
From: Ingo Molnar @ 2007-05-25 12:46 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, pranith-kumar_d, Andi Kleen


* Ingo Molnar <mingo@elte.hu> wrote:

> btw., CFS does this change to fs/proc/array.c:
> 
> @@ -410,6 +408,14 @@ static int do_task_stat(struct task_stru
>  	/* convert nsec -> ticks */
>  	start_time = nsec_to_clock_t(start_time);
>  
> +	/*
> +	 * Use CFS's precise accounting, if available:
> +	 */
> +	if (!has_rt_policy(task)) {
> +		utime = nsec_to_clock_t(task->sum_exec_runtime);
> +		stime = 0;
> +	}
> +
>  	res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
>  %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
>  %lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %llu\n",
> 
> if you have some spare capacity to improve this code, it could be 
> further enhanced by not setting 'stime' to zero, but using the 
> existing jiffies based utime/stime statistics as a _ratio_ to split up 
> the precise p->sum_exec_runtime. That way we dont have to add precise 
> accounting to syscall entry/exit points (that would be quite 
> expensive), but still the sum of utime+stime would be very precise. 
> (and that's what matters most anyway)

i found an accounting bug in this: it didnt sum up threads correctly. 
The patch below fixes this. The stime == 0 problem is still there 
though.

	Ingo

Index: linux/fs/proc/array.c
===================================================================
--- linux.orig/fs/proc/array.c
+++ linux/fs/proc/array.c
@@ -310,6 +310,29 @@ int proc_pid_status(struct task_struct *
 	return buffer - orig;
 }
 
+static clock_t task_utime(struct task_struct *p)
+{
+	/*
+	 * Use CFS's precise accounting, if available:
+	 */
+	if (!has_rt_policy(p) && !(sysctl_sched_load_smoothing & 128))
+		return nsec_to_clock_t(p->sum_exec_runtime);
+
+	return cputime_to_clock_t(p->utime);
+}
+
+static clock_t task_stime(struct task_struct *p)
+{
+	/*
+	 * Use CFS's precise accounting, if available:
+	 */
+	if (!has_rt_policy(p) && !(sysctl_sched_load_smoothing & 128))
+		return 0;
+
+	return cputime_to_clock_t(p->stime);
+}
+
+
 static int do_task_stat(struct task_struct *task, char * buffer, int whole)
 {
 	unsigned long vsize, eip, esp, wchan = ~0UL;
@@ -324,7 +347,8 @@ static int do_task_stat(struct task_stru
 	unsigned long long start_time;
 	unsigned long cmin_flt = 0, cmaj_flt = 0;
 	unsigned long  min_flt = 0,  maj_flt = 0;
-	cputime_t cutime, cstime, utime, stime;
+	cputime_t cutime, cstime;
+	clock_t utime, stime;
 	unsigned long rsslim = 0;
 	char tcomm[sizeof(task->comm)];
 	unsigned long flags;
@@ -342,7 +366,8 @@ static int do_task_stat(struct task_stru
 
 	sigemptyset(&sigign);
 	sigemptyset(&sigcatch);
-	cutime = cstime = utime = stime = cputime_zero;
+	cutime = cstime = cputime_zero;
+	utime = stime = 0;
 
 	rcu_read_lock();
 	if (lock_task_sighand(task, &flags)) {
@@ -368,15 +393,15 @@ static int do_task_stat(struct task_stru
 			do {
 				min_flt += t->min_flt;
 				maj_flt += t->maj_flt;
-				utime = cputime_add(utime, t->utime);
-				stime = cputime_add(stime, t->stime);
+				utime += task_utime(t);
+				stime += task_stime(t);
 				t = next_thread(t);
 			} while (t != task);
 
 			min_flt += sig->min_flt;
 			maj_flt += sig->maj_flt;
-			utime = cputime_add(utime, sig->utime);
-			stime = cputime_add(stime, sig->stime);
+			utime += cputime_to_clock_t(sig->utime);
+			stime += cputime_to_clock_t(sig->stime);
 		}
 
 		sid = signal_session(sig);
@@ -392,8 +417,8 @@ static int do_task_stat(struct task_stru
 	if (!whole) {
 		min_flt = task->min_flt;
 		maj_flt = task->maj_flt;
-		utime = task->utime;
-		stime = task->stime;
+		utime = task_utime(task);
+		stime = task_stime(task);
 	}
 
 	/* scale priority and nice values from timeslices to -20..20 */
@@ -408,14 +433,6 @@ static int do_task_stat(struct task_stru
 	/* convert nsec -> ticks */
 	start_time = nsec_to_clock_t(start_time);
 
-	/*
-	 * Use CFS's precise accounting, if available:
-	 */
-	if (!has_rt_policy(task)) {
-		utime = nsec_to_clock_t(task->sum_exec_runtime);
-		stime = 0;
-	}
-
 	res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
 %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
 %lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %llu\n",

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-25 12:46     ` Ingo Molnar
@ 2007-05-25 16:45       ` Balbir Singh
  2007-05-28 11:07         ` Ingo Molnar
  2007-05-29 10:19       ` Balbir Singh
  1 sibling, 1 reply; 36+ messages in thread
From: Balbir Singh @ 2007-05-25 16:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, pranith-kumar_d, Andi Kleen

Ingo Molnar wrote:
> i found an accounting bug in this: it didnt sum up threads correctly. 
> The patch below fixes this. The stime == 0 problem is still there 
> though.
> 
> 	Ingo
> 

Thanks! I'll test the code on Monday. I do not understand the
sysctl_sched_smoothing functionality, so I do not understand
its impact on accounting. I'll take a look more closely

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-23 12:06 [patch] CFS scheduler, -v14 Ingo Molnar
  2007-05-23 19:39 ` Nicolas Mailhot
  2007-05-24  6:42 ` Balbir Singh
@ 2007-05-26 14:58 ` S.Çağlar Onur
  2007-05-26 15:08   ` S.Çağlar Onur
  2007-06-01 13:35   ` S.Çağlar Onur
  2007-05-27  2:49 ` Li Yu
  2007-05-28  1:17 ` Li Yu
  4 siblings, 2 replies; 36+ messages in thread
From: S.Çağlar Onur @ 2007-05-26 14:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, pranith-kumar_d

[-- Attachment #1: Type: text/plain, Size: 1277 bytes --]

Hi Ingo;

23 May 2007 Çar tarihinde, Ingo Molnar şunları yazmıştı: 
> As usual, any sort of feedback, bugreport, fix and suggestion is more
> than welcome!

I have another kaffeine [0.8.4]/xine-lib [1.1.6] problem with CFS for you :)

Under load (compiling any Qt app. or kernel with -j1 or -j2) audio always goes 
sync with time (and i'm sure it never skips) but video starts slowdown and 
loses its sync with audio (like for the 10th sec. of a movie, audio is at 
10th sec. also, but the shown video is from 7th sec.). 

After some time video suddenly wants to sync with audio and starts to play 
really fast (like fast-forward) and syncs with audio. But it will lose its 
audio/video sync after a while and loop continues like that.

I also reproduced that behaviour with CFS-13, i'm not sure its reproducible 
with mainline cause for a long time i only use CFS (but i'm pretty sure that 
problem not exists or not hit me with CFS-1 to CFS-11). And its only 
reproducible with some load and mplayer plays same video without losing its 
audio/video sync with same load.

Cheers
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-26 14:58 ` S.Çağlar Onur
@ 2007-05-26 15:08   ` S.Çağlar Onur
  2007-06-01 13:35   ` S.Çağlar Onur
  1 sibling, 0 replies; 36+ messages in thread
From: S.Çağlar Onur @ 2007-05-26 15:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, pranith-kumar_d

[-- Attachment #1: Type: text/plain, Size: 1539 bytes --]

26 May 2007 Cts tarihinde, S.Çağlar Onur şunları yazmıştı: 
> 23 May 2007 Çar tarihinde, Ingo Molnar şunları yazmıştı:
> > As usual, any sort of feedback, bugreport, fix and suggestion is more
> > than welcome!
>
> I have another kaffeine [0.8.4]/xine-lib [1.1.6] problem with CFS for you
> :)
>
> Under load (compiling any Qt app. or kernel with -j1 or -j2) audio always
> goes sync with time (and i'm sure it never skips) but video starts slowdown
> and loses its sync with audio (like for the 10th sec. of a movie, audio is
> at 10th sec. also, but the shown video is from 7th sec.).
>
> After some time video suddenly wants to sync with audio and starts to play
> really fast (like fast-forward) and syncs with audio. But it will lose its
> audio/video sync after a while and loop continues like that.
>
> I also reproduced that behaviour with CFS-13, i'm not sure its reproducible
> with mainline cause for a long time i only use CFS (but i'm pretty sure
> that problem not exists or not hit me with CFS-1 to CFS-11). And its only
> reproducible with some load and mplayer plays same video without losing its
> audio/video sync with same load.

Ah, i forgot to add you can find the "strace -o kaffine.log -f -tttTTT 
kaffeine" and ps output while that problem exists at [1]

[1] http://cekirdek.pardus.org.tr/~caglar/kaffeine/
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-23 12:06 [patch] CFS scheduler, -v14 Ingo Molnar
                   ` (2 preceding siblings ...)
  2007-05-26 14:58 ` S.Çağlar Onur
@ 2007-05-27  2:49 ` Li Yu
  2007-05-29  6:15   ` Ingo Molnar
  2007-05-28  1:17 ` Li Yu
  4 siblings, 1 reply; 36+ messages in thread
From: Li Yu @ 2007-05-27  2:49 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Ingo Molnar wrote:
> i'm pleased to announce release -v14 of the CFS scheduler patchset.
>
> The CFS patch against v2.6.22-rc2, v2.6.21.1 or v2.6.20.10 can be 
> downloaded from the usual place:
>    
>       http://people.redhat.com/mingo/cfs-scheduler/
>   
I tried this on 2.6.21.1, Good work!

I have a doubt when read this patch: in update_stats_enqueue(), It seem 
that these statements in two brances of "if (p->load_weight > 
NICE_0_LOAD)" are same,
is it on purpose?

Good luck

- Li Yu


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-23 12:06 [patch] CFS scheduler, -v14 Ingo Molnar
                   ` (3 preceding siblings ...)
  2007-05-27  2:49 ` Li Yu
@ 2007-05-28  1:17 ` Li Yu
  2007-05-29  0:49   ` Li Yu
  4 siblings, 1 reply; 36+ messages in thread
From: Li Yu @ 2007-05-28  1:17 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Ingo Molnar wrote:
> i'm pleased to announce release -v14 of the CFS scheduler patchset.
>
> The CFS patch against v2.6.22-rc2, v2.6.21.1 or v2.6.20.10 can be 
> downloaded from the usual place:
>    
>       http://people.redhat.com/mingo/cfs-scheduler/
>   

In comment before distribute_fair_add(), we have such text:

/*
  * A task gets added back to the runnable tasks and gets
  * a small credit for the CPU time it missed out on while
  * it slept, so fix up all other runnable task's wait_runtime
  * so that the sum stays constant (around 0).
  *
[snip]
  */

But as I observe by cat /proc/sched_debug (2.6.21.1, UP, RHEL4), I found 
the all waiting fields often are more than zero, or less than zero.

IMHO, the sum of task_struct->wait_runtime just is the denominator of 
all runnable time in some ways, is it right? if so, increasing the sum 
of wait_runtime just make scheduling decision more precise. so what's 
meaning for keeping the wait_runtime is zero-sum?

Good luck

- Li Yu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-25 16:45       ` Balbir Singh
@ 2007-05-28 11:07         ` Ingo Molnar
  2007-05-29 10:23           ` Balbir Singh
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2007-05-28 11:07 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, pranith-kumar_d, Andi Kleen


* Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> Ingo Molnar wrote:
> > i found an accounting bug in this: it didnt sum up threads correctly. 
> > The patch below fixes this. The stime == 0 problem is still there 
> > though.
> > 
> > 	Ingo
> > 
> 
> Thanks! I'll test the code on Monday. I do not understand the 
> sysctl_sched_smoothing functionality, so I do not understand its 
> impact on accounting. I'll take a look more closely

basically sysctl_sched_smoothing is more of a 'experimental features 
flag' kind of thing. I'll remove it soon, you should only need to 
concentrate on the functionality that it enables by default.

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-28  1:17 ` Li Yu
@ 2007-05-29  0:49   ` Li Yu
  0 siblings, 0 replies; 36+ messages in thread
From: Li Yu @ 2007-05-29  0:49 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Li Yu wrote:
>
> But as I observe by cat /proc/sched_debug (2.6.21.1, UP, RHEL4), I 
> found the all waiting fields often are more than zero, or less than zero.
>
> IMHO, the sum of task_struct->wait_runtime just is the denominator of 
> all runnable time in some ways, is it right? if so, increasing the sum 
> of wait_runtime just make scheduling decision more precise. so what's 
> meaning for keeping the wait_runtime is zero-sum?
>
Forget it pls, here I am wrong, sorry for pestering.

Good luck

- Li Yu


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-27  2:49 ` Li Yu
@ 2007-05-29  6:15   ` Ingo Molnar
  2007-05-29  8:07     ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2007-05-29  6:15 UTC (permalink / raw)
  To: Li Yu; +Cc: linux-kernel


* Li Yu <raise.sail@gmail.com> wrote:

> Ingo Molnar wrote:
> >i'm pleased to announce release -v14 of the CFS scheduler patchset.
> >
> >The CFS patch against v2.6.22-rc2, v2.6.21.1 or v2.6.20.10 can be 
> >downloaded from the usual place:
> >   
> >      http://people.redhat.com/mingo/cfs-scheduler/
> >  
> I tried this on 2.6.21.1, Good work!

thanks :)

> [...] in update_stats_enqueue(), It seem that these statements in two 
> brances of "if (p->load_weight > NICE_0_LOAD)" are same, is it on 
> purpose?

what do you mean?

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-29  6:15   ` Ingo Molnar
@ 2007-05-29  8:07     ` Ingo Molnar
  2007-05-31  9:45       ` Li Yu
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2007-05-29  8:07 UTC (permalink / raw)
  To: Li Yu; +Cc: linux-kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> > [...] in update_stats_enqueue(), It seem that these statements in 
> > two brances of "if (p->load_weight > NICE_0_LOAD)" are same, is it 
> > on purpose?
> 
> what do you mean?

you are right indeed. Mike Galbraith has sent a cleanup patch that 
removes that duplication (and uses div64_s()).

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-25 12:46     ` Ingo Molnar
  2007-05-25 16:45       ` Balbir Singh
@ 2007-05-29 10:19       ` Balbir Singh
  1 sibling, 0 replies; 36+ messages in thread
From: Balbir Singh @ 2007-05-29 10:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, pranith-kumar_d, Andi Kleen

Hi, Ingo,

> +static clock_t task_utime(struct task_struct *p)
> +{
> +	/*
> +	 * Use CFS's precise accounting, if available:
> +	 */
> +	if (!has_rt_policy(p) && !(sysctl_sched_load_smoothing & 128))
> +		return nsec_to_clock_t(p->sum_exec_runtime);


I wonder if this leads to data truncation, p->sum_exec_runtime is
unsigned long long and clock_t is long (on all architectures from what
my cscope shows me). I have my other patch ready on top of this. I'll
post it out soon.

> +
> +	return cputime_to_clock_t(p->utime);
> +}
> +

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-28 11:07         ` Ingo Molnar
@ 2007-05-29 10:23           ` Balbir Singh
  2007-06-05  7:57             ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Balbir Singh @ 2007-05-29 10:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, pranith-kumar_d, Andi Kleen

On Mon, May 28, 2007 at 01:07:48PM +0200, Ingo Molnar wrote:
> 
> * Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > Ingo Molnar wrote:
> > > i found an accounting bug in this: it didnt sum up threads correctly. 
> > > The patch below fixes this. The stime == 0 problem is still there 
> > > though.
> > > 
> > > 	Ingo
> > > 
> > 
> > Thanks! I'll test the code on Monday. I do not understand the 
> > sysctl_sched_smoothing functionality, so I do not understand its 
> > impact on accounting. I'll take a look more closely
> 
> basically sysctl_sched_smoothing is more of a 'experimental features 
> flag' kind of thing. I'll remove it soon, you should only need to 
> concentrate on the functionality that it enables by default.
> 
> 	Ingo

Hi, Ingo,

I hope this patch addresses the stime == 0 problem.

This patch improves accounting of the CFS scheduler. We have the executed
run time in sum_exec_runtime field of the task. This patch splits the
sum_exec_runtime in the ratio of task->utime and task->stime to obtain
the user and system time of the task. 

TODO's:

1. Migrate getrusage() to use sum_exec_runtime so that the output in /proc
   is consistent with the data by running time(1).


Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 fs/proc/array.c |   27 ++++++++++++++++++++++++---
 linux/sched.h   |    0 
 2 files changed, 24 insertions(+), 3 deletions(-)

diff -puN fs/proc/array.c~cfs-distribute-accounting fs/proc/array.c
--- linux-2.6.22-rc2/fs/proc/array.c~cfs-distribute-accounting	2007-05-29 13:47:47.000000000 +0530
+++ linux-2.6.22-rc2-balbir/fs/proc/array.c	2007-05-29 15:35:22.000000000 +0530
@@ -332,7 +332,6 @@ static clock_t task_stime(struct task_st
 	return cputime_to_clock_t(p->stime);
 }
 
-
 static int do_task_stat(struct task_struct *task, char * buffer, int whole)
 {
 	unsigned long vsize, eip, esp, wchan = ~0UL;
@@ -400,8 +399,13 @@ static int do_task_stat(struct task_stru
 
 			min_flt += sig->min_flt;
 			maj_flt += sig->maj_flt;
-			utime += cputime_to_clock_t(sig->utime);
-			stime += cputime_to_clock_t(sig->stime);
+			if (!has_rt_policy(t))
+				utime += nsec_to_clock_t(
+						sig->sum_sched_runtime);
+			else {
+				utime += cputime_to_clock_t(sig->utime);
+				stime += cputime_to_clock_t(sig->stime);
+			}
 		}
 
 		sid = signal_session(sig);
@@ -421,6 +425,23 @@ static int do_task_stat(struct task_stru
 		stime = task_stime(task);
 	}
 
+	if (!has_rt_policy(task)) {
+		clock_t sum_us_time = utime + stime;
+		clock_t tu_time = cputime_to_clock_t(task->utime);
+		clock_t ts_time = cputime_to_clock_t(task->stime);
+		clock_t total_time = utime;
+
+		/*
+		 * Split up sched_exec_time according to the utime and
+		 * stime ratio. At this point utime contains the summed
+		 * sched_exec_runtime and stime is zero
+		 */
+		if (sum_us_time) {
+			utime = ((tu_time * total_time) / sum_us_time);
+			stime = ((ts_time * total_time) / sum_us_time);
+		}
+	}
+
 	/* scale priority and nice values from timeslices to -20..20 */
 	/* to make it look like a "normal" Unix priority/nice value  */
 	priority = task_prio(task);
diff -puN kernel/sys.c~cfs-distribute-accounting kernel/sys.c
diff -puN include/linux/sched.h~cfs-distribute-accounting include/linux/sched.h
_

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-29  8:07     ` Ingo Molnar
@ 2007-05-31  9:45       ` Li Yu
  2007-05-31  9:53         ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Li Yu @ 2007-05-31  9:45 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel


static void distribute_fair_add(struct rq *rq, s64 delta)
{
    struct task_struct *curr = rq->curr;
    s64 delta_fair = 0;

    if (!(sysctl_sched_load_smoothing & 32))
        return;

    if (rq->nr_running) {
        delta_fair = div64_s(delta, rq->nr_running);
        /*
         * The currently running task's next wait_runtime value does
         * not depend on the fair_clock, so fix it up explicitly:
         */
        add_wait_runtime(rq, curr, -delta_fair);
        rq->fair_clock -= delta_fair;
    }
}

See this line:

        delta_fair = div64_s(delta, rq->nr_running);

Ingo, should we be replace "rq->nr_running" with "rq->raw_load_weight" here?

Good luck
- Li Yu





^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-31  9:45       ` Li Yu
@ 2007-05-31  9:53         ` Ingo Molnar
  2007-06-01  7:16           ` Li Yu
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2007-05-31  9:53 UTC (permalink / raw)
  To: Li Yu; +Cc: linux-kernel


* Li Yu <raise.sail@gmail.com> wrote:

> static void distribute_fair_add(struct rq *rq, s64 delta)
> {
>    struct task_struct *curr = rq->curr;
>    s64 delta_fair = 0;
> 
>    if (!(sysctl_sched_load_smoothing & 32))
>        return;
> 
>    if (rq->nr_running) {
>        delta_fair = div64_s(delta, rq->nr_running);
>        /*
>         * The currently running task's next wait_runtime value does
>         * not depend on the fair_clock, so fix it up explicitly:
>         */
>        add_wait_runtime(rq, curr, -delta_fair);
>        rq->fair_clock -= delta_fair;
>    }
> }
> 
> See this line:
> 
>        delta_fair = div64_s(delta, rq->nr_running);
> 
> Ingo, should we be replace "rq->nr_running" with "rq->raw_load_weight" 
> here?

that would break the code. The handling of sleep periods is basically 
heuristics and using nr_running here appears to be 'good enough' in 
practice.

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-31  9:53         ` Ingo Molnar
@ 2007-06-01  7:16           ` Li Yu
  2007-06-01 19:21             ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Li Yu @ 2007-06-01  7:16 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Ingo Molnar wrote:
> * Li Yu <raise.sail@gmail.com> wrote:
>
>   
>> static void distribute_fair_add(struct rq *rq, s64 delta)
>> {
>>    struct task_struct *curr = rq->curr;
>>    s64 delta_fair = 0;
>>
>>    if (!(sysctl_sched_load_smoothing & 32))
>>        return;
>>
>>    if (rq->nr_running) {
>>        delta_fair = div64_s(delta, rq->nr_running);
>>        /*
>>         * The currently running task's next wait_runtime value does
>>         * not depend on the fair_clock, so fix it up explicitly:
>>         */
>>        add_wait_runtime(rq, curr, -delta_fair);
>>        rq->fair_clock -= delta_fair;
>>    }
>> }
>>
>> See this line:
>>
>>        delta_fair = div64_s(delta, rq->nr_running);
>>
>> Ingo, should we be replace "rq->nr_running" with "rq->raw_load_weight" 
>> here?
>>     
>
> that would break the code. The handling of sleep periods is basically 
> heuristics and using nr_running here appears to be 'good enough' in 
> practice.
>
>   
Thanks,  I am wrong at seeing the delta variable is represented by 
virtual time unit. if the code does as I said, the delta_fair may be too 
small to meanless.

Also, I have want to know what's real meaning of     

    add_wait_runtime(rq, curr, delta_mine - delta_exec);

in update_curr(), IMHO, it should be

    add_wait_runtime(rq, curr, delta_mine - delta_fair);

Is this just another heuristics? or my opinion is wrong again? :-)

Good luck.

- Li Yu






^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-26 14:58 ` S.Çağlar Onur
  2007-05-26 15:08   ` S.Çağlar Onur
@ 2007-06-01 13:35   ` S.Çağlar Onur
  2007-06-01 15:31     ` Linus Torvalds
  2007-06-01 15:37     ` [OT] " Andreas Mohr
  1 sibling, 2 replies; 36+ messages in thread
From: S.Çağlar Onur @ 2007-06-01 13:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, pranith-kumar_d

[-- Attachment #1: Type: text/plain, Size: 2615 bytes --]

Hi;

26 May 2007 Cts tarihinde, S.Çağlar Onur şunları yazmıştı: 
> Under load (compiling any Qt app. or kernel with -j1 or -j2) audio always
> goes sync with time (and i'm sure it never skips) but video starts slowdown
> and loses its sync with audio (like for the 10th sec. of a movie, audio is
> at 10th sec. also, but the shown video is from 7th sec.).
>
> After some time video suddenly wants to sync with audio and starts to play
> really fast (like fast-forward) and syncs with audio. But it will lose its
> audio/video sync after a while and loop continues like that.

After a lots of private mail traffic and debuggin efforts with Ingo, yesterday 
i simply requested to ignore that problem (at least until i can reproduce 
same with different machines). 

Yesterday i turn back to vanilla 2.6.18.8 to see the situation with it and i 
reproduce the problem even with lower loads.

Seems like this piece of hardware is dieing [For a while my laptop starts to 
poweroff suddenly without any log/error etc] and i think all these problems 
caused by this. Or at least/ for me/ this laptop (sony vaio fs-215b) is not a 
stable test bed for this kind of human involved testings.

Ingo cannot reproduce same audio/video out-of-sync problems with his setups, 
and currently i'm the only person deals with that problem. 

And for some boots kernel reports wrong frequency for my cpu (and notice the 
timing difference), [this maybe a overheat problem but aslo i'll try 
disabling CONFIG_NO_HZ as Ingo suggested]

...
[    0.000000] Initializing CPU#0
[    0.000000] PID hash table entries: 4096 (order: 12, 16384 bytes)
[    0.000000] Detected 897.591 MHz processor.
[   13.142654] Console: colour dummy device 80x25
[   13.143609] Dentry cache hash table entries: 131072 (order: 7, 524288 
 bytes)
[   13.144530] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
...

...
[    0.000000] Initializing CPU#0
[    0.000000] PID hash table entries: 4096 (order: 12, 16384 bytes)
[    0.000000] Detected 1729.292 MHz processor.
[    8.286228] Console: colour dummy device 80x25
[    8.286650] Dentry cache hash table entries: 131072 (order: 7, 524288 
bytes)
[    8.287058] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
...

As a result, please ignore this problem until i can reproduce on different 
machines or someone else reports the same problem :)

Cheers
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-06-01 13:35   ` S.Çağlar Onur
@ 2007-06-01 15:31     ` Linus Torvalds
  2007-06-07 22:29       ` S.Çağlar Onur
  2007-06-01 15:37     ` [OT] " Andreas Mohr
  1 sibling, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2007-06-01 15:31 UTC (permalink / raw)
  To: S.?a?lar Onur
  Cc: Ingo Molnar, linux-kernel, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, pranith-kumar_d

On Fri, 1 Jun 2007, S.?a?lar Onur wrote:
> 
> Seems like this piece of hardware is dieing [For a while my laptop starts to 
> poweroff suddenly without any log/error etc] and i think all these problems 
> caused by this. Or at least/ for me/ this laptop (sony vaio fs-215b) is not a 
> stable test bed for this kind of human involved testings.

Has it been hot where you are lately? Is your fan working? 

Hardware that acts up under load is quite often thermal-related, 
especially if it starts happening during summer and didn't happen before 
that... ESPECIALLY the kinds of behaviours you see: the "sudden power-off" 
is the normal behaviour for a CPU that trips a critial overheating point, 
and the slowdown is also one normal response to overheating (CPU 
throttling).

		Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [OT] Re: [patch] CFS scheduler, -v14
  2007-06-01 13:35   ` S.Çağlar Onur
  2007-06-01 15:31     ` Linus Torvalds
@ 2007-06-01 15:37     ` Andreas Mohr
  1 sibling, 0 replies; 36+ messages in thread
From: Andreas Mohr @ 2007-06-01 15:37 UTC (permalink / raw)
  To: S.Ça??lar Onur; +Cc: linux-kernel

[OT, thus removed private addresses]

Hi,

On Fri, Jun 01, 2007 at 04:35:02PM +0300, S.Ça??lar Onur wrote:
> Seems like this piece of hardware is dieing [For a while my laptop starts to 
> poweroff suddenly without any log/error etc] and i think all these problems 
> caused by this. Or at least/ for me/ this laptop (sony vaio fs-215b) is not a 
> stable test bed for this kind of human involved testings.

Socketed CPU?

It *might* be an idea to reseat it, maybe it's simply insufficient seating
of the CPU due to rougher travel handling than with a desktop.
(CPU socket issues can easily be the case on some notebooks AFAIK,
and it was on mine: Inspiron 8000 - bought it as completely dead until I simply
fiddled with CPU socket...).

Andreas Mohr

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-06-01  7:16           ` Li Yu
@ 2007-06-01 19:21             ` Ingo Molnar
  2007-06-05  2:33               ` Li Yu
  2007-06-05  3:35               ` Li Yu
  0 siblings, 2 replies; 36+ messages in thread
From: Ingo Molnar @ 2007-06-01 19:21 UTC (permalink / raw)
  To: Li Yu; +Cc: linux-kernel


* Li Yu <raise.sail@gmail.com> wrote:

> Also, I have want to know what's real meaning of
> 
>    add_wait_runtime(rq, curr, delta_mine - delta_exec);
> 
> in update_curr(), IMHO, it should be
> 
>    add_wait_runtime(rq, curr, delta_mine - delta_fair);
> 
> Is this just another heuristics? or my opinion is wrong again? :-)

well, ->wait_runtime is in real time units. If a task executes 
delta_exec time on the CPU, we deduct "-delta_exec" 1:1. But during that 
time the task also got entitled to a bit more CPU time, that is 
+delta_mine. The calculation above expresses this. I'm not sure what 
sense '-delta_fair' would make - "delta_fair" is the amount of time a 
nice-0 task would be entitled to - but this task might not be a nice-0 
task. Furthermore, even for a nice-0 task why deduct -delta_fair - it 
spent delta_exec on the CPU.

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-06-01 19:21             ` Ingo Molnar
@ 2007-06-05  2:33               ` Li Yu
  2007-06-05  8:01                 ` Ingo Molnar
  2007-06-05  3:35               ` Li Yu
  1 sibling, 1 reply; 36+ messages in thread
From: Li Yu @ 2007-06-05  2:33 UTC (permalink / raw)
  To: Ingo Molnar, LKML


Ingo Molnar wrote:
> * Li Yu <raise.sail@gmail.com> wrote:
>
>   
>> Also, I have want to know what's real meaning of
>>
>>    add_wait_runtime(rq, curr, delta_mine - delta_exec);
>>
>> in update_curr(), IMHO, it should be
>>
>>    add_wait_runtime(rq, curr, delta_mine - delta_fair);
>>
>> Is this just another heuristics? or my opinion is wrong again? :-)
>>     
>
> well, ->wait_runtime is in real time units. If a task executes 
> delta_exec time on the CPU, we deduct "-delta_exec" 1:1. But during that 
> time the task also got entitled to a bit more CPU time, that is 
> +delta_mine. The calculation above expresses this. I'm not sure what 
> sense '-delta_fair' would make - "delta_fair" is the amount of time a 
> nice-0 task would be entitled to - but this task might not be a nice-0 
> task. Furthermore, even for a nice-0 task why deduct -delta_fair - it 
> spent delta_exec on the CPU.
> 

Eh, I wrong again~ I even took an experiment in last week end, this idea 
is really bad! ;(

I think the most inner of source of my wrong again and again is
misunderstanding virtual time. For more better understanding this, I try 
to write one python script to simulate CFS behavior. However, It can not 
implement the fairness as I want. I really confuse here.

Would you like help me point out what's wrong in it? Any suggestion is 
welcome. Thanks in advanced.


#! /usr/bin/python

# htucfs.py - Hard-To-Understand-CFS.py ;)
# Wrote by Li Yu / 20070604

#
# only support static load on UP.
#


# Usage:
#    ./htucfs.py nr_clock_ticks_to_run
#

import sys

class task_struct:
    def __init__(self, name, load_weight):
        self.name = name
        self.wait_runtime = 0
        self.fair_clock = 0
        self.fair_key = 0
        self.load_weight = float(load_weight)
    def __repr__(self):
        return "%s/C%.2f" % (self.name, self.fair_clock)

idle_task = task_struct("idle", 0)

class run_queue:
    def __init__(self):
        self.raw_weighted_load = 0
        self.wall_clock = 0
        self.fair_clock = 0
        self.ready_queue = {}
        self.run_history = []
        self.task_list = []
        self.curr = None
        self.debug = 0

    def snapshot(self):
        if self.debug:
            print "%.2f" % self.fair_clock, self.ready_queue, self.curr

    def enqueue(self, task):
        task.fair_key = self.fair_clock-task.wait_runtime
        task.fair_key = int(100 * task.fair_key)
        if not self.ready_queue.get(task.fair_key):
            self.ready_queue[task.fair_key] = [task]
        else:
            # keep FIFO for same fair_key tasks.
            self.ready_queue[task.fair_key].append(task)
        self.raw_weighted_load += task.load_weight
        self.task_list.append(task)

    def dequeue(self, task):
        self.raw_weighted_load -= task.load_weight
        self.ready_queue[task.fair_key].remove(task)
        if not self.ready_queue[task.fair_key]:
            del self.ready_queue[task.fair_key]
        self.task_list.remove(task)

    def other_wait_runtime(self):
        for task in self.task_list:
            self.dequeue(task)
            task.wait_runtime += 1
            self.enqueue(task)

    def clock_tick(self):
        # clock_tick = 1.0
        self.fair_clock += 1.0/self.raw_weighted_load
        # delta_exec = 1.0
        delta_mine = self.curr.load_weight / self.raw_weighted_load
        self.curr.wait_runtime += (delta_mine-1.0)
        self.curr.fair_clock += 1.0/self.curr.load_weight
        self.dequeue(self.curr)
        self.other_wait_runtime()
        self.enqueue(self.curr)
        self.pick_next_task()

    def pick_next_task(self):
        key_seq = self.ready_queue.keys()
        if key_seq:
            key_seq.sort()
            self.curr = self.ready_queue[key_seq[0]][0]
        else:
            self.curr = idle_task
        self.snapshot()
        self.record_run_history()

    def record_run_history(self):
        task = self.curr
        if not self.run_history:
            self.run_history.append([task, 1])
            return
        curr = self.run_history[-1]
        if curr[0] != task:
            self.run_history.append([task, 1])
        else:
            curr[1] += 1

    def show_history(self):
        stat = {}
        for entry in self.run_history:
            task = entry[0]
            nsec = entry[1]
            print "%s run %d sec" % (task, nsec)
            if task not in stat.keys():
                stat[task] = nsec
            else:
                stat[task] += nsec
        print "=============================="
        tasks = stat.keys()
        tasks.sort()
        for task in tasks:
            print task, "/", task.load_weight, ":", stat[task], "sec"
        print "=============================="

    def run(self, delta=0, debug=0):
        self.debug = debug
        until = self.wall_clock + delta
        print "-----------------------------"
        self.pick_next_task()
        while self.wall_clock < until:
            self.wall_clock += 1
            self.clock_tick()
        print "-----------------------------"

#
# To turn this, display verbose runtime information.
#
debug = True

if __name__ == "__main__":
    rq = run_queue()
    task1 = task_struct("TASK_1", 1)
    task2 = task_struct("TASK_2", 1)
    task3 = task_struct("TASK_3", 2)
    rq.enqueue(task1)
    rq.enqueue(task2)
    rq.enqueue(task3)
    rq.run(int(sys.argv[1]), debug)
    rq.show_history()

#EOF

Good luck

- Li Yu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-06-01 19:21             ` Ingo Molnar
  2007-06-05  2:33               ` Li Yu
@ 2007-06-05  3:35               ` Li Yu
  1 sibling, 0 replies; 36+ messages in thread
From: Li Yu @ 2007-06-05  3:35 UTC (permalink / raw)
  To: Ingo Molnar, LKML

> 
> Ingo Molnar wrote:
>> * Li Yu <raise.sail@gmail.com> wrote:
>>
>>   
>>> Also, I have want to know what's real meaning of
>>>
>>>    add_wait_runtime(rq, curr, delta_mine - delta_exec);
>>>
>>> in update_curr(), IMHO, it should be
>>>
>>>    add_wait_runtime(rq, curr, delta_mine - delta_fair);
>>>
>>> Is this just another heuristics? or my opinion is wrong again? :-)
>>>     
>>
>> well, ->wait_runtime is in real time units. If a task executes 
>> delta_exec time on the CPU, we deduct "-delta_exec" 1:1. But during that 
>> time the task also got entitled to a bit more CPU time, that is 
>> +delta_mine. The calculation above expresses this. I'm not sure what 
>> sense '-delta_fair' would make - "delta_fair" is the amount of time a 
>> nice-0 task would be entitled to - but this task might not be a nice-0 
>> task. Furthermore, even for a nice-0 task why deduct -delta_fair - it 
>> spent delta_exec on the CPU.
>> 
> 
> Eh, I wrong again~ I even took an experiment in last week end, this idea 
>>is really bad! ;(
> 
> I think the most inner of source of my wrong again and again is
> misunderstanding virtual time. For more better understanding this, I try 
> to write one python script to simulate CFS behavior. However, It can not 
> implement the fairness as I want. I really confuse here.
> 
> Would you like help me point out what's wrong in it? Any suggestion is 
> welcome. Thanks in advanced.
> 
> 
> 

I think use wait_runtime is more clear. so I modify this script.


#! /usr/bin/python

# htucfs.py - Hard-To-Understand-CFS.py ;)
# Wrote by Li Yu / 20070604

#
# only support static load / UP.
#


# Usage:
#	./htucfs.py nr_clock_ticks_to_run
#

import sys

class task_struct:
	def __init__(self, name, load_weight):
		self.name = name
		self.wait_runtime = 0
		self.fair_clock = 0
		self.load_weight = float(load_weight)
	def __repr__(self):
		return "%s/C%.2f" % (self.name, self.fair_clock)

idle_task = task_struct("idle", 0)

class run_queue:
	def __init__(self):
		self.raw_weighted_load = 0
		self.wall_clock = 0
		self.fair_clock = 0
		self.ready_queue = {}
		self.run_history = []
		self.task_list = []
		self.curr = None
		self.debug = 0
		
	def snapshot(self):
		if self.debug:
			print "%.2f" % self.fair_clock, self.ready_queue, self.curr

	def enqueue(self, task):
		if not self.ready_queue.get(task.wait_runtime):
			self.ready_queue[task.wait_runtime] = [task]
		else:
			# keep FIFO for same wait_runtime tasks.
			self.ready_queue[task.wait_runtime].append(task)
		self.raw_weighted_load += task.load_weight
		self.task_list.append(task)

	def dequeue(self, task):
		self.raw_weighted_load -= task.load_weight
		self.ready_queue[task.wait_runtime].remove(task)
		if not self.ready_queue[task.wait_runtime]:
			del self.ready_queue[task.wait_runtime]
		self.task_list.remove(task)
	
	def other_wait_runtime(self):
		task_list = self.task_list[:]
		for task in task_list:
			if task == self.curr:
				continue
			self.dequeue(task)
			task.wait_runtime += 1
			print task, "wait 1 sec"
			self.enqueue(task)
		
	def clock_tick(self):
		# clock_tick = 1.0
		self.fair_clock += 1.0/self.raw_weighted_load
		# delta_exec = 1.0
		delta_mine = self.curr.load_weight / self.raw_weighted_load
		self.dequeue(self.curr)
		self.other_wait_runtime()
		print self.curr, "run %.2f" % (delta_mine-1.0)
		self.curr.wait_runtime += (delta_mine-1.0)
		self.curr.fair_clock += 1.0/self.curr.load_weight
		self.enqueue(self.curr)
		self.pick_next_task()
	
	def pick_next_task(self):
		key_seq	= self.ready_queue.keys()
		if key_seq:
			key_seq.sort()
			self.curr = self.ready_queue[key_seq[-1]][0]
		else:
			self.curr = idle_task
		self.snapshot()
		self.record_run_history()

	def record_run_history(self):
		task = self.curr
		if not self.run_history:
			self.run_history.append([task, 1])
			return
		curr = self.run_history[-1]
		if curr[0] != task:
			self.run_history.append([task, 1])
		else:
			curr[1] += 1
	
	def show_history(self):
		stat = {}
		for entry in self.run_history:
			task = entry[0]
			nsec = entry[1]
			print "%s run %d sec" % (task, nsec)
			if task not in stat.keys():
				stat[task] = nsec
			else:
				stat[task] += nsec
		print "=============================="
		tasks = stat.keys()
		tasks.sort()
		for task in tasks:
			print task, "/", task.load_weight, ":", stat[task], "sec"
		print "=============================="

	def run(self, delta=0, debug=0):
		self.debug = debug
		until = self.wall_clock + delta
		print "-----------------------------"
		self.pick_next_task()
		while self.wall_clock < until:
			self.wall_clock += 1
			self.clock_tick()
		print "-----------------------------"

#
# To turn this, display verbose runtime information.
#
debug = True

if __name__ == "__main__":
	rq = run_queue()
	task1 = task_struct("TASK_1", 1)
	task2 = task_struct("TASK_2", 2)
	task3 = task_struct("TASK_3", 1)
	rq.enqueue(task1)
	rq.enqueue(task2)
	rq.enqueue(task3)
	rq.run(int(sys.argv[1]), debug)
	rq.show_history()

#EOF

Good luck

- Li Yu



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-05-29 10:23           ` Balbir Singh
@ 2007-06-05  7:57             ` Ingo Molnar
  0 siblings, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2007-06-05  7:57 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, pranith-kumar_d, Andi Kleen


* Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> +		/*
> +		 * Split up sched_exec_time according to the utime and
> +		 * stime ratio. At this point utime contains the summed
> +		 * sched_exec_runtime and stime is zero
> +		 */
> +		if (sum_us_time) {
> +			utime = ((tu_time * total_time) / sum_us_time);
> +			stime = ((ts_time * total_time) / sum_us_time);
> +		}
> +	}

hm, Dmitry Adamushko found out that this will cause rounding problems 
and might confuse 'top' - because total_time is a 10 msecs granular 
value, so under the above calculation the total of 'utime+stime' can 
shrink a bit as time goes forward. The symptom is that top will display 
a '99.9%' entry for tasks, sporadically.

I've attached below my current delta (ontop of -v15) which does the 
stime/utime splitup correctly and which includes some more enhancements 
from Dmitry - could you please take a look at this and add any deltas 
you might have ontop of it?

	Ingo

---
 Makefile                  |    2 +-
 fs/proc/array.c           |   33 ++++++++++++++++++++++++---------
 include/linux/sched.h     |    3 +--
 kernel/posix-cpu-timers.c |    2 +-
 kernel/sched.c            |   17 ++++++++++-------
 kernel/sched_debug.c      |   16 +++++++++++++++-
 kernel/sched_fair.c       |    2 +-
 kernel/sched_rt.c         |   12 ++++++++----
 8 files changed, 61 insertions(+), 26 deletions(-)

Index: linux/Makefile
===================================================================
--- linux.orig/Makefile
+++ linux/Makefile
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 21
-EXTRAVERSION = .3-cfs-v15
+EXTRAVERSION = .3-cfs-v16
 NAME = Nocturnal Monster Puppy
 
 # *DOCUMENTATION*
Index: linux/fs/proc/array.c
===================================================================
--- linux.orig/fs/proc/array.c
+++ linux/fs/proc/array.c
@@ -172,8 +172,8 @@ static inline char * task_state(struct t
 		"Uid:\t%d\t%d\t%d\t%d\n"
 		"Gid:\t%d\t%d\t%d\t%d\n",
 		get_task_state(p),
-		p->tgid, p->pid,
-		pid_alive(p) ? rcu_dereference(p->real_parent)->tgid : 0,
+	       	p->tgid, p->pid,
+	       	pid_alive(p) ? rcu_dereference(p->real_parent)->tgid : 0,
 		pid_alive(p) && p->ptrace ? rcu_dereference(p->parent)->pid : 0,
 		p->uid, p->euid, p->suid, p->fsuid,
 		p->gid, p->egid, p->sgid, p->fsgid);
@@ -312,24 +312,39 @@ int proc_pid_status(struct task_struct *
 
 static clock_t task_utime(struct task_struct *p)
 {
+	clock_t utime = cputime_to_clock_t(p->utime),
+		total = utime + cputime_to_clock_t(p->stime);
+
 	/*
 	 * Use CFS's precise accounting, if available:
 	 */
-	if (!has_rt_policy(p) && !(sysctl_sched_load_smoothing & 128))
-		return nsec_to_clock_t(p->sum_exec_runtime);
+	if (!(sysctl_sched_load_smoothing & 128)) {
+		u64 temp = (u64)nsec_to_clock_t(p->sum_exec_runtime);
+
+		if (total) {
+			temp *= utime;
+			do_div(temp, total);
+		}
+		utime = (clock_t)temp;
+	}
 
-	return cputime_to_clock_t(p->utime);
+	return utime;
 }
 
 static clock_t task_stime(struct task_struct *p)
 {
+	clock_t stime = cputime_to_clock_t(p->stime),
+		total = stime + cputime_to_clock_t(p->utime);
+
 	/*
-	 * Use CFS's precise accounting, if available:
+	 * Use CFS's precise accounting, if available (we subtract
+	 * utime from the total, to make sure the total observed
+	 * by userspace grows monotonically - apps rely on that):
 	 */
-	if (!has_rt_policy(p) && !(sysctl_sched_load_smoothing & 128))
-		return 0;
+	if (!(sysctl_sched_load_smoothing & 128))
+		stime = nsec_to_clock_t(p->sum_exec_runtime) - task_utime(p);
 
-	return cputime_to_clock_t(p->stime);
+	return stime;
 }
 
 
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -852,7 +852,6 @@ struct task_struct {
 	u64 block_max;
 	u64 exec_max;
 	u64 wait_max;
-	u64 last_ran;
 
 	s64 wait_runtime;
 	u64 sum_exec_runtime;
@@ -1235,7 +1234,7 @@ static inline int set_cpus_allowed(struc
 extern unsigned long long sched_clock(void);
 extern void sched_clock_unstable_event(void);
 extern unsigned long long
-current_sched_runtime(const struct task_struct *current_task);
+task_sched_runtime(struct task_struct *task);
 
 /* sched_exec is called by processes performing an exec */
 #ifdef CONFIG_SMP
Index: linux/kernel/posix-cpu-timers.c
===================================================================
--- linux.orig/kernel/posix-cpu-timers.c
+++ linux/kernel/posix-cpu-timers.c
@@ -161,7 +161,7 @@ static inline cputime_t virt_ticks(struc
 }
 static inline unsigned long long sched_ns(struct task_struct *p)
 {
-	return (p == current) ? current_sched_runtime(p) : p->sum_exec_runtime;
+	return task_sched_runtime(p);
 }
 
 int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec *tp)
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1227,7 +1227,7 @@ static void task_running_tick(struct rq 
  */
 static void __sched_fork(struct task_struct *p)
 {
-	p->wait_start_fair = p->wait_start = p->exec_start = p->last_ran = 0;
+	p->wait_start_fair = p->wait_start = p->exec_start = 0;
 	p->sum_exec_runtime = 0;
 
 	p->wait_runtime = 0;
@@ -2592,17 +2592,20 @@ DEFINE_PER_CPU(struct kernel_stat, kstat
 EXPORT_PER_CPU_SYMBOL(kstat);
 
 /*
- * Return current->sum_exec_runtime plus any more ns on the sched_clock
- * that have not yet been banked.
+ * Return p->sum_exec_runtime plus any more ns on the sched_clock
+ * that have not yet been banked in case the task is currently running.
  */
-unsigned long long current_sched_runtime(const struct task_struct *p)
+unsigned long long task_sched_runtime(struct task_struct *p)
 {
 	unsigned long long ns;
 	unsigned long flags;
+	struct rq *rq;
 
-	local_irq_save(flags);
-	ns = p->sum_exec_runtime + sched_clock() - p->last_ran;
-	local_irq_restore(flags);
+	rq = task_rq_lock(p, &flags);
+	ns = p->sum_exec_runtime;
+	if (rq->curr == p)
+		ns += rq_clock(rq) - p->exec_start;
+	task_rq_unlock(rq, &flags);
 
 	return ns;
 }
Index: linux/kernel/sched_debug.c
===================================================================
--- linux.orig/kernel/sched_debug.c
+++ linux/kernel/sched_debug.c
@@ -188,6 +188,18 @@ __initcall(init_sched_debug_procfs);
 
 void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 {
+	unsigned long flags;
+	int num_threads = 1;
+
+	rcu_read_lock();
+	if (lock_task_sighand(p, &flags)) {
+		num_threads = atomic_read(&p->signal->count);
+		unlock_task_sighand(p, &flags);
+	}
+	rcu_read_unlock();
+
+	SEQ_printf(m, "%s (%d, #threads: %d)\n", p->comm, p->pid, num_threads);
+	SEQ_printf(m, "----------------------------------------------\n");
 #define P(F) \
 	SEQ_printf(m, "%-25s:%20Ld\n", #F, (long long)p->F)
 
@@ -201,11 +213,13 @@ void proc_sched_show_task(struct task_st
 	P(block_max);
 	P(exec_max);
 	P(wait_max);
-	P(last_ran);
 	P(wait_runtime);
 	P(wait_runtime_overruns);
 	P(wait_runtime_underruns);
 	P(sum_exec_runtime);
+	P(load_weight);
+	P(policy);
+	P(prio);
 #undef P
 
 	{
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -200,7 +200,7 @@ static inline void update_curr(struct rq
 	 * since the last time we changed raw_weighted_load:
 	 */
 	delta_exec = now - curr->exec_start;
-	if (unlikely(delta_exec < 0))
+	if (unlikely((s64)delta_exec < 0))
 		delta_exec = 0;
 	if (unlikely(delta_exec > curr->exec_max))
 		curr->exec_max = delta_exec;
Index: linux/kernel/sched_rt.c
===================================================================
--- linux.orig/kernel/sched_rt.c
+++ linux/kernel/sched_rt.c
@@ -54,6 +54,7 @@ static void check_preempt_curr_rt(struct
 static struct task_struct * pick_next_task_rt(struct rq *rq, u64 now)
 {
 	struct prio_array *array = &rq->active;
+	struct task_struct *next;
 	struct list_head *queue;
 	int idx;
 
@@ -62,14 +63,17 @@ static struct task_struct * pick_next_ta
 		return NULL;
 
 	queue = array->queue + idx;
-	return list_entry(queue->next, struct task_struct, run_list);
+	next = list_entry(queue->next, struct task_struct, run_list);
+
+	next->exec_start = now;
+
+	return next;
 }
 
-/*
- * No accounting done when RT tasks are descheduled:
- */
 static void put_prev_task_rt(struct rq *rq, struct task_struct *p, u64 now)
 {
+	p->sum_exec_runtime += now - p->exec_start;
+	p->exec_start = 0;
 }
 
 /*

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-06-05  2:33               ` Li Yu
@ 2007-06-05  8:01                 ` Ingo Molnar
  2007-06-05  8:54                   ` Li Yu
  2007-06-06  7:41                   ` Li Yu
  0 siblings, 2 replies; 36+ messages in thread
From: Ingo Molnar @ 2007-06-05  8:01 UTC (permalink / raw)
  To: Li Yu; +Cc: LKML


* Li Yu <raise.sail@gmail.com> wrote:

> Eh, I wrong again~ I even took an experiment in last week end, this 
> idea is really bad! ;(
> 
> I think the most inner of source of my wrong again and again is 
> misunderstanding virtual time. For more better understanding this, I 
> try to write one python script to simulate CFS behavior. However, It 
> can not implement the fairness as I want. I really confuse here.
> 
> Would you like help me point out what's wrong in it? Any suggestion is 
> welcome. Thanks in advanced.

sorry, my python-fu is really, really weak. All i can give you at the 
moment is the in-kernel implementation of CFS :-)

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-06-05  8:01                 ` Ingo Molnar
@ 2007-06-05  8:54                   ` Li Yu
  2007-06-06  7:41                   ` Li Yu
  1 sibling, 0 replies; 36+ messages in thread
From: Li Yu @ 2007-06-05  8:54 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: LKML

Ingo Molnar wrote:
> * Li Yu <raise.sail@gmail.com> wrote:
>
>   
>> Eh, I wrong again~ I even took an experiment in last week end, this 
>> idea is really bad! ;(
>>
>> I think the most inner of source of my wrong again and again is 
>> misunderstanding virtual time. For more better understanding this, I 
>> try to write one python script to simulate CFS behavior. However, It 
>> can not implement the fairness as I want. I really confuse here.
>>
>> Would you like help me point out what's wrong in it? Any suggestion is 
>> welcome. Thanks in advanced.
>>     
>
> sorry, my python-fu is really, really weak. All i can give you at the 
> moment is the in-kernel implementation of CFS :-)
>
>   
:~)

I changed that script to check my understanding of virtual clock. I 
found out we really got the really fairness if allocate resource by 
selecting the most earliest task virtual clock! this really eliminate my 
doubt on virtual clock in much degree. for example:

./htucfs.py 60

==============================
TASK_1/C10.00 / 1.0 : 11.0 sec
TASK_2/C10.00 / 2.0 : 20.0 sec
TASK_3/C10.00 / 3.0 : 30.0 sec
==============================

It seem my haltingly english works fine when I read the introduction of 
virtual clock ;-)

The next step is find out why wait_runtime can not work normally in my 
script. 

Thanks for your quickly reply.

Good luck.

- Li Yu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-06-05  8:01                 ` Ingo Molnar
  2007-06-05  8:54                   ` Li Yu
@ 2007-06-06  7:41                   ` Li Yu
  1 sibling, 0 replies; 36+ messages in thread
From: Li Yu @ 2007-06-06  7:41 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: LKML

Hi, Ingo:

    I am sorry for disturbing you again, I am interesting on CFS, 
however, had really confused on the fairness implementation of CFS.

    After reviewed the past mails of LKML, I known the virtual clock is 
used by fairness measuring scale, it is excellent idea. and CFS use 
wait_runtime (the total time to run) to simulate the virtual clock of 
the task. however, from my experiment, it seem we can not get well 
fairness effect if we just only take care of task->wait_runtime. but, 
CFS work well-known fine ;-)

    Here is the detail of my experiment:

    Suppose it is one UP system, and there are three 100 % cpuhog tasks 
on that processor,  they have 1, 2, 3  weight respectively,  so they 
should have 1, 2, 3 seconds time to run itself in 6 secords interval 
respectively.

    The clock tick interval is 1 sec. so the step of virtual time(VT) is 
0.17 (1/6) wall time(WT). I use follow convention to describe runtime 
information:

    VTR0.17: the virtual time 0.17 of runqueue
    VTT11.0 : the virtual time 11.0 of task
    WT2: the wall time 2.
    TASK_1/123.00: the task named TASK_1, it has 123.00 wait_runtime at 
that time.

for example:

    WT1/VTR0.17     [ VTT0.00:[TASK_2/1.00, TASK_3/1.00], 
VTT1.00:[TASK_1/-0.83] ] current: TASK_2/1.00  

its meaning is we pick TASK_2 as next task at wall time 1 / virtual time 
of runqueue 0.17, the ready task list has three tasks:

TASK_2:  the virtual time of it is 0.00, the wait_runtime of it is 1.00
TASK_3:  the virtual time of it is 0.00, the wait_runtime of it is 1.00
TASK_1:  the virtual time of it is 1.00, the wait_runtime of it is -0.83

It seem the result of picking next task by least VTT or by largest 
wait_runtime is same at this time, lucky.

Follow is complete result of running my script to simulate 6 clock ticks:

-----------------------------
WT0/VTR0.00     [ VTT0.00:[TASK_1/0.00, TASK_2/0.00, TASK_3/0.00] ] 
current: TASK_1/0.00
Before WT1 :
TASK_2/1.00 wait 1.00 sec
TASK_3/1.00 wait 1.00 sec
TASK_1/0.00 spent - 0.83 sec (delta_mine-delta_exec, delta_exec always 
is 1.0)
WT1/VTR0.17     [ VTT0.00:[TASK_2/1.00, TASK_3/1.00], 
VTT1.00:[TASK_1/-0.83] ] current: TASK_2/1.00
Before WT2 :
TASK_3/2.00 wait 1.00 sec
TASK_1/0.17 wait 1.00 sec
TASK_2/1.00 spent - 0.67 sec (delta_mine-delta_exec, delta_exec always 
is 1.0)
WT2/VTR0.33     [ VTT0.00:[TASK_3/2.00], VTT0.50:[TASK_2/0.33], 
VTT1.00:[TASK_1/0.17] ] current: TASK_3/2.00
Before WT3 :
TASK_1/1.17 wait 1.00 sec
TASK_2/1.33 wait 1.00 sec
TASK_3/2.00 spent - 0.50 sec (delta_mine-delta_exec, delta_exec always 
is 1.0)
WT3/VTR0.50     [ VTT0.33:[TASK_3/1.50], VTT0.50:[TASK_2/1.33], 
VTT1.00:[TASK_1/1.17] ] current: TASK_3/1.50
Before WT4 :
TASK_1/2.17 wait 1.00 sec
TASK_2/2.33 wait 1.00 sec
TASK_3/1.50 spent - 0.50 sec (delta_mine-delta_exec, delta_exec always 
is 1.0)
WT4/VTR0.67     [ VTT0.50:[TASK_2/2.33], VTT0.67:[TASK_3/1.00], 
VTT1.00:[TASK_1/2.17] ] current: TASK_2/2.33
Before WT5 :
TASK_1/3.17 wait 1.00 sec
TASK_3/2.00 wait 1.00 sec
TASK_2/2.33 spent - 0.67 sec (delta_mine-delta_exec, delta_exec always 
is 1.0)
WT5/VTR0.83     [ VTT0.67:[TASK_3/2.00], VTT1.00:[TASK_1/3.17, 
TASK_2/1.67] ] current: TASK_3/2.00
-----------------------------
TASK_1/3.17 run 1.00 sec
TASK_2/1.67 run 1.00 sec
TASK_3/2.00 run 2.00 sec
TASK_2/1.67 run 1.00 sec
TASK_3/2.00 run 1.00 sec
==============================
TASK_1 / 1.0 total run: 1.0 sec
TASK_2 / 2.0 total run: 2.0 sec
TASK_3 / 3.0 total run: 3.0 sec
==============================

if we pick next task by the least VTT (as above showing),  we can get 
the well fair result, the scheduling sequence is :

TASK_1 -> TASK_2 -> TASK_3 -> TASK_2 -> TASK_3

however, if we pick next task by the largest wait_runtime ,we can get 
other scheduling sequence:

TASK_1 -> TASK_2 -> TASK_3 -> TASK_2 -> TASK_1

In this case, they are not fairness anymore. every task got same 
processor time!

if we run the latter longer time, for example, to simulate 6000 times 
clock tick. the result is:

==============================
TASK_1 / 1.0  total run : 1806.0 sec
TASK_2 / 2.0  total run : 1987.0 sec
TASK_3 / 3.0  total run : 2207.0 sec
==============================

Do not need any vindication,  I really trust CFS works fine (it work 
fine for my desktop ;), it is fact.

so I think there must have some wrongs in my above experiment. but it is 
true apparently. What is really cleft between VTT and wait_runtime? and 
how you fill it in CFS? It seem I should give TASK_3  some extra credits 
in some ways.

Sorry for such long mail and so bad English.

Good luck.

- Li Yu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [patch] CFS scheduler, -v14
  2007-06-01 15:31     ` Linus Torvalds
@ 2007-06-07 22:29       ` S.Çağlar Onur
  0 siblings, 0 replies; 36+ messages in thread
From: S.Çağlar Onur @ 2007-06-07 22:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, linux-kernel, Andrew Morton, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, pranith-kumar_d

[-- Attachment #1: Type: text/plain, Size: 2215 bytes --]

Hi;

01 Haz 2007 Cum tarihinde, Linus Torvalds şunları yazmıştı: 
> Has it been hot where you are lately? Is your fan working?

First of all sorry for late reply.

For a while İstanbul is not really hot [~26 C] :) and yes fans are/seems 
working without a problem.

> Hardware that acts up under load is quite often thermal-related,
> especially if it starts happening during summer and didn't happen before
> that... ESPECIALLY the kinds of behaviours you see: the "sudden power-off"
> is the normal behaviour for a CPU that trips a critial overheating point,
> and the slowdown is also one normal response to overheating (CPU
> throttling).

According to ACPI output;

[caglar@zangetsu][~]> cat /proc/acpi/thermal_zone/THRM/*
<setting not supported>
cooling mode:   passive
<polling disabled>
state:                   ok
temperature:             56 C
critical (S5):           105 C
passive:                 95 C: tc1=1 tc2=5 tsp=10 devices=0xc20deec8

105 C is critical for that CPU, for a while (this is why i reply late) i'm 
constantly monitoring the temprature under low and high load. 

Its in 50-70 C interval in normal usage/idle and 80-100 C interval under high 
load (compiling some applications, using cpuburn to test etc.), so seems it 
can handle overheating issues

But digging the kern.log shows some strange values also;

May 24 10:39:23 localhost kernel: [    0.000000] Detected 897.748 MHz 
processor. <--- 2.6.21.2-CFS-v14
...
May 30 00:59:11 localhost kernel: [    0.000000] Detected 898.726 MHz 
processor. <--- 2.6.21.2-CFS-v15
...
Jun  1 02:09:44 localhost kernel: [    0.000000] Detected 897.591 MHz 
processor. <--- 2.6.21.3-CFS-v15
...

And according to same log these slowdowns occured after i compiled/installed 
these kernel versions into system(cause these are the first appearence of 
this versions in kern.log). So as you said it seems definetly a overheating 
issue. I'll continue to test/monitor and report back if i can find anything. 
Thanks!

Cheers
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2007-06-07 22:31 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-23 12:06 [patch] CFS scheduler, -v14 Ingo Molnar
2007-05-23 19:39 ` Nicolas Mailhot
2007-05-23 19:57   ` Ingo Molnar
2007-05-23 20:02     ` Nicolas Mailhot
2007-05-24  6:42 ` Balbir Singh
2007-05-24  8:09   ` Ingo Molnar
2007-05-24  9:19     ` Balbir Singh
2007-05-24 17:25     ` Jeremy Fitzhardinge
2007-05-24 20:59       ` Ingo Molnar
2007-05-24 22:43         ` Jeremy Fitzhardinge
2007-05-25 12:46     ` Ingo Molnar
2007-05-25 16:45       ` Balbir Singh
2007-05-28 11:07         ` Ingo Molnar
2007-05-29 10:23           ` Balbir Singh
2007-06-05  7:57             ` Ingo Molnar
2007-05-29 10:19       ` Balbir Singh
2007-05-26 14:58 ` S.Çağlar Onur
2007-05-26 15:08   ` S.Çağlar Onur
2007-06-01 13:35   ` S.Çağlar Onur
2007-06-01 15:31     ` Linus Torvalds
2007-06-07 22:29       ` S.Çağlar Onur
2007-06-01 15:37     ` [OT] " Andreas Mohr
2007-05-27  2:49 ` Li Yu
2007-05-29  6:15   ` Ingo Molnar
2007-05-29  8:07     ` Ingo Molnar
2007-05-31  9:45       ` Li Yu
2007-05-31  9:53         ` Ingo Molnar
2007-06-01  7:16           ` Li Yu
2007-06-01 19:21             ` Ingo Molnar
2007-06-05  2:33               ` Li Yu
2007-06-05  8:01                 ` Ingo Molnar
2007-06-05  8:54                   ` Li Yu
2007-06-06  7:41                   ` Li Yu
2007-06-05  3:35               ` Li Yu
2007-05-28  1:17 ` Li Yu
2007-05-29  0:49   ` Li Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox