Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
       [not found] <200201071922.g07JMN106760@penguin.transmeta.com>
@ 2002-01-07 21:36 ` Ingo Molnar
  2002-01-08  8:49   ` FD Cami
  2002-01-08 11:32   ` Anton Blanchard
  0 siblings, 2 replies; 41+ messages in thread
From: Ingo Molnar @ 2002-01-07 21:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Brian Gerst

On Mon, 7 Jan 2002, Linus Torvalds wrote:

> Ingo, looks true. A quick -D2?

yep, Brian is right. I've uploaded -D2:

        http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-D2.patch

other changes:

 - make rt_priority 99 map to p->prio 0, rt_priority 0 map to p->prio 99.

 - display 'top' priorities correctly, 0-39 for normal processes, negative
   values for RT tasks. (it works just fine it appears.) We did not use to
   display the real priority of RT tasks, but now it's natural.

> Oh, and please move console_init() back, other consoles (sparc?) may
> depend on having PCI layers initialized.

(doh, done too, fix is in -D2.)

> Oh, and _I_ don't like "cpu()".  What's wrong with the already
> existing "smp_processor_id()"?

nothing serious, my main problem with it is that it's often too long for
my 80 chars wide consoles, and it's also too long to type and i use it
quite often in SMP code.

IIRC we had a 'hard_smp_processor_id()' initially, partly to make it
harder to use it. (it was very slow because it did an APIC read). But
these days smp_processor_id() is just as fast (or even faster) as
'current'. So i wanted to use cpu() in new code to make it easier to read
and to make it more compact. But if this is a problem i can remove it.
I've verified that there is no obvious namespace collisions.

(i've done a quick UP sanity compile + boot of 2.5.2-pre9 + D2, it all
works as expected.)

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-07 21:36 ` [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 Ingo Molnar
@ 2002-01-08  8:49   ` FD Cami
  2002-01-08 18:44     ` J Sloan
  2002-01-08 11:32   ` Anton Blanchard
  1 sibling, 1 reply; 41+ messages in thread
From: FD Cami @ 2002-01-08  8:49 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel


Hi all

I'm joining the host of beta testers involved in that patch...

It's currently running on a production machine :
dual PII350 on ASUS P2B-DS
3 SCSI hard drives
512MB of RAM
3C905C
This is a network server running squid-cache www proxy with
a medium load (700 clients on a T3), mysqld, apache, proftpd.
kernel is stock 2.4.17 - and so far, so good.

Cheers,

François Cami


Ingo Molnar wrote:

> On Mon, 7 Jan 2002, Linus Torvalds wrote:
> 
> 
>>Ingo, looks true. A quick -D2?
>>
> 
> yep, Brian is right. I've uploaded -D2:
> 
>         http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-D2.patch
> 
> other changes:
> 
>  - make rt_priority 99 map to p->prio 0, rt_priority 0 map to p->prio 99.
> 
>  - display 'top' priorities correctly, 0-39 for normal processes, negative
>    values for RT tasks. (it works just fine it appears.) We did not use to
>    display the real priority of RT tasks, but now it's natural.
> 
> 
>>Oh, and please move console_init() back, other consoles (sparc?) may
>>depend on having PCI layers initialized.
>>
> 
> (doh, done too, fix is in -D2.)
> 
> 
>>Oh, and _I_ don't like "cpu()".  What's wrong with the already
>>existing "smp_processor_id()"?
>>
> 
> nothing serious, my main problem with it is that it's often too long for
> my 80 chars wide consoles, and it's also too long to type and i use it
> quite often in SMP code.
> 
> IIRC we had a 'hard_smp_processor_id()' initially, partly to make it
> harder to use it. (it was very slow because it did an APIC read). But
> these days smp_processor_id() is just as fast (or even faster) as
> 'current'. So i wanted to use cpu() in new code to make it easier to read
> and to make it more compact. But if this is a problem i can remove it.
> I've verified that there is no obvious namespace collisions.
> 
> (i've done a quick UP sanity compile + boot of 2.5.2-pre9 + D2, it all
> works as expected.)
> 
> 	Ingo
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-08  8:49   ` FD Cami
@ 2002-01-08 18:44     ` J Sloan
  0 siblings, 0 replies; 41+ messages in thread
From: J Sloan @ 2002-01-08 18:44 UTC (permalink / raw)
  To: FD Cami; +Cc: mingo, linux-kernel

Excellent - I'm going to try this one on whatever
machines I have available for testing, and if I am
emboldened by success, I'll  try it on some light
duty production servers as well -

- keep us in the loop, please!

Regards,

jjs

FD Cami wrote:

>
> Hi all
>
> I'm joining the host of beta testers involved in that patch...
>
> It's currently running on a production machine :
> dual PII350 on ASUS P2B-DS
> 3 SCSI hard drives
> 512MB of RAM
> 3C905C
> This is a network server running squid-cache www proxy with
> a medium load (700 clients on a T3), mysqld, apache, proftpd.
> kernel is stock 2.4.17 - and so far, so good.
>
> Cheers,
>
> François Cami
>
>
> Ingo Molnar wrote:
>
>> On Mon, 7 Jan 2002, Linus Torvalds wrote:
>>
>>
>>> Ingo, looks true. A quick -D2?
>>>
>>
>> yep, Brian is right. I've uploaded -D2:
>>
>>         http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-D2.patch
>>
>> other changes:
>>
>>  - make rt_priority 99 map to p->prio 0, rt_priority 0 map to p->prio 
>> 99.
>>
>>  - display 'top' priorities correctly, 0-39 for normal processes, 
>> negative
>>    values for RT tasks. (it works just fine it appears.) We did not 
>> use to
>>    display the real priority of RT tasks, but now it's natural.
>>
>>
>>> Oh, and please move console_init() back, other consoles (sparc?) may
>>> depend on having PCI layers initialized.
>>>
>>
>> (doh, done too, fix is in -D2.)
>>
>>
>>> Oh, and _I_ don't like "cpu()".  What's wrong with the already
>>> existing "smp_processor_id()"?
>>>
>>
>> nothing serious, my main problem with it is that it's often too long for
>> my 80 chars wide consoles, and it's also too long to type and i use it
>> quite often in SMP code.
>>
>> IIRC we had a 'hard_smp_processor_id()' initially, partly to make it
>> harder to use it. (it was very slow because it did an APIC read). But
>> these days smp_processor_id() is just as fast (or even faster) as
>> 'current'. So i wanted to use cpu() in new code to make it easier to 
>> read
>> and to make it more compact. But if this is a problem i can remove it.
>> I've verified that there is no obvious namespace collisions.
>>
>> (i've done a quick UP sanity compile + boot of 2.5.2-pre9 + D2, it all
>> works as expected.)
>>
>>     Ingo
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe 
>> linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>>
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-07 21:36 ` [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 Ingo Molnar
  2002-01-08  8:49   ` FD Cami
@ 2002-01-08 11:32   ` Anton Blanchard
  2002-01-08 11:43     ` Anton Blanchard
  2002-01-08 14:32     ` [patch] O(1) scheduler, -E1, 2.5.2-pre10, 2.4.17 Ingo Molnar
  1 sibling, 2 replies; 41+ messages in thread
From: Anton Blanchard @ 2002-01-08 11:32 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel


Hi Ingo,

I tested 2.5.2-pre10 today. There is some bitop abuse that needs fixing
for big endian machines to work :)

At the moment we have:

	#define BITMAP_SIZE ((MAX_PRIO+7)/8)
	char bitmap[BITMAP_SIZE];

Which is initialised using:

	memset(array->bitmap, 0xff, BITMAP_SIZE);
	clear_bit(MAX_PRIO, array->bitmap);

This results in the following in memory (in ascending memory order):

ffffffffffffffff ffffffffffffffff fffffeffff000000

The problem here is that when we search the high word, we do so from
the right, therefore we get 128 all the time :)

The following patch fixes this. We need to define the bitmap to be in
terms of unsigned long, in this case its only lucky we have the correct
alignment. We also replace the memset of the bitmap with set_bit.

With the patch things look much better (and the kernel boots on my
ppc64 machine :)

ffffffffffffffff ffffffffffffffff 000000ffffffffff 

Anton

diff -urN linuxppc_2_5/include/asm-i386/mmu_context.h linuxppc_2_5_work/include/asm-i386/mmu_context.h
--- linuxppc_2_5/include/asm-i386/mmu_context.h	Tue Jan  8 17:09:47 2002
+++ linuxppc_2_5_work/include/asm-i386/mmu_context.h	Tue Jan  8 22:06:35 2002
@@ -16,7 +16,7 @@
 # error update this function.
 #endif
 
-static inline int sched_find_first_zero_bit(char *bitmap)
+static inline int sched_find_first_zero_bit(unsigned long *bitmap)
 {
 	unsigned int *b = (unsigned int *)bitmap;
 	unsigned int rt;
diff -urN linuxppc_2_5/kernel/sched.c linuxppc_2_5_work/kernel/sched.c
--- linuxppc_2_5/kernel/sched.c	Tue Jan  8 17:09:47 2002
+++ linuxppc_2_5_work/kernel/sched.c	Tue Jan  8 22:13:45 2002
@@ -20,15 +20,13 @@
 #include <linux/interrupt.h>
 #include <asm/mmu_context.h>
 
-#define BITMAP_SIZE ((MAX_PRIO+7)/8)
-
 typedef struct runqueue runqueue_t;
 
 struct prio_array {
 	int nr_active;
 	spinlock_t *lock;
 	runqueue_t *rq;
-	char bitmap[BITMAP_SIZE];
+	unsigned long bitmap[3];
 	list_t queue[MAX_PRIO];
 };
 
@@ -1306,11 +1304,12 @@
 			array = rq->arrays + j;
 			array->rq = rq;
 			array->lock = &rq->lock;
-			for (k = 0; k < MAX_PRIO; k++)
+			for (k = 0; k < MAX_PRIO; k++) {
 				INIT_LIST_HEAD(array->queue + k);
-			memset(array->bitmap, 0xff, BITMAP_SIZE);
+				__set_bit(k, array->bitmap);
+			}
 			// zero delimiter for bitsearch
-			clear_bit(MAX_PRIO, array->bitmap);
+			__clear_bit(MAX_PRIO, array->bitmap);
 		}
 	}
 	/*

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-08 11:32   ` Anton Blanchard
@ 2002-01-08 11:43     ` Anton Blanchard
  2002-01-08 14:34       ` Ingo Molnar
  2002-01-08 14:32     ` [patch] O(1) scheduler, -E1, 2.5.2-pre10, 2.4.17 Ingo Molnar
  1 sibling, 1 reply; 41+ messages in thread
From: Anton Blanchard @ 2002-01-08 11:43 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel

  
>  struct prio_array {
>  	int nr_active;
>  	spinlock_t *lock;
>  	runqueue_t *rq;
> -	char bitmap[BITMAP_SIZE];
> +	unsigned long bitmap[3];
>  	list_t queue[MAX_PRIO];
>  };

Sorry, of course this is wrong if sizeof(unsigned long) < 64. But you
get the idea :)

Anton

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-08 11:43     ` Anton Blanchard
@ 2002-01-08 14:34       ` Ingo Molnar
  2002-01-09 23:15         ` Anton Blanchard
  0 siblings, 1 reply; 41+ messages in thread
From: Ingo Molnar @ 2002-01-08 14:34 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Linus Torvalds, linux-kernel


On Tue, 8 Jan 2002, Anton Blanchard wrote:

> > -	char bitmap[BITMAP_SIZE];
> > +	unsigned long bitmap[3];
> >  	list_t queue[MAX_PRIO];
>
> Sorry, of course this is wrong if sizeof(unsigned long) < 64. But you
> get the idea :)

thanks, i've put the generic fix into the -E1 patch.

> With the patch things look much better (and the kernel boots on my
> ppc64 machine :)

hey it should not even compile, you forgot to send us the PPC definition
of sched_find_first_zero_bit() ;-)

	Ingo


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-08 14:34       ` Ingo Molnar
@ 2002-01-09 23:15         ` Anton Blanchard
  2002-01-10  1:09           ` Richard Henderson
  0 siblings, 1 reply; 41+ messages in thread
From: Anton Blanchard @ 2002-01-09 23:15 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel

 
> > With the patch things look much better (and the kernel boots on my
> > ppc64 machine :)
> 
> hey it should not even compile, you forgot to send us the PPC definition
> of sched_find_first_zero_bit() ;-)

Good point, but its ppc64 so the patch would include all of
include/asm-ppc64 and arch/ppc64 :)

I expect most architectures have a reasonably fast find_first_zero_bit
so they can simply do:

static inline int sched_find_first_zero_bit(unsigned long *bitmap)
{
	return find_first_zero_bit(bitmap, MAX_PRIO);
}

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09 23:15         ` Anton Blanchard
@ 2002-01-10  1:09           ` Richard Henderson
  2002-01-10 17:04             ` Ivan Kokshaysky
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Henderson @ 2002-01-10  1:09 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Ingo Molnar, Linus Torvalds, linux-kernel

On Thu, Jan 10, 2002 at 10:15:14AM +1100, Anton Blanchard wrote:
> I expect most architectures have a reasonably fast find_first_zero_bit
> so they can simply do:
> 
> static inline int sched_find_first_zero_bit(unsigned long *bitmap)
> {
> 	return find_first_zero_bit(bitmap, MAX_PRIO);
> }

Careful.  The following is really quite a bit better on Alpha:

static inline int
sched_find_first_zero_bit(unsigned long *bitmap)
{
        unsigned long b0 = bitmap[0];
        unsigned long b1 = bitmap[1];
        unsigned long b2 = bitmap[2];
        unsigned long ofs = MAX_RT_PRIO;

        if (unlikely(~(b0 & b1) != 0)) {
                b2 = (~b0 == 0 ? b0 : b1);
                ofs = (~b0 == 0 ? 0 : 64);
        }

        return ffz(b2) + ofs;
}

It compiles down to 

        ldq $2,0($16)
        ldq $3,8($16)
        lda $5,128($31)
        ldq $0,16($16)
        and $2,$3,$1
        ornot $31,$2,$4
        ornot $31,$1,$1
        bne $1,$L8
$L2:
        ornot $31,$0,$0
        cttz $0,$0
        addl $0,$5,$0
        ret $31,($26),1
$L8:
        mov $2,$0
        cmpult $31,$4,$5
        cmovne $4,$3,$0
        sll $5,6,$5
        br $31,$L2

which is a fair bit better than find_first_zero_bit if for
no other reason than we collect all the memory accesses
right up at the beginning.

While we're on the subject of sched_find_first_zero_bit, I'd 
like to complain about Ingo's choice of header file.  Why in
the world did you choose mmu_context.h?  Invent a new asm/sched.h
if you must, but please don't choose headers at random.


r~

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-10  1:09           ` Richard Henderson
@ 2002-01-10 17:04             ` Ivan Kokshaysky
  2002-01-10 20:42               ` george anzinger
  2002-01-10 23:56               ` Ingo Molnar
  0 siblings, 2 replies; 41+ messages in thread
From: Ivan Kokshaysky @ 2002-01-10 17:04 UTC (permalink / raw)
  To: Anton Blanchard, Ingo Molnar, Linus Torvalds, linux-kernel

On Wed, Jan 09, 2002 at 05:09:28PM -0800, Richard Henderson wrote:
> Careful.  The following is really quite a bit better on Alpha:
> 
> static inline int
> sched_find_first_zero_bit(unsigned long *bitmap)
> {
>         unsigned long b0 = bitmap[0];
>         unsigned long b1 = bitmap[1];
>         unsigned long b2 = bitmap[2];
>         unsigned long ofs = MAX_RT_PRIO;
> 
>         if (unlikely(~(b0 & b1) != 0)) {
>                 b2 = (~b0 == 0 ? b0 : b1);
>                 ofs = (~b0 == 0 ? 0 : 64);
>         }
> 
>         return ffz(b2) + ofs;
> }

True. Minor correction:
-               b2 = (~b0 == 0 ? b0 : b1);
-               ofs = (~b0 == 0 ? 0 : 64);
+		b2 = (~b0 ? b0 : b1);
+		ofs = (~b0 ? 0 : 64);

Note that comment for this function is a bit confusing:
 * ... It's the fastest
 * way of searching a 168-bit bitmap where the first 128 bits are
 * unlikely to be set.

s/set/cleared/

> While we're on the subject of sched_find_first_zero_bit, I'd 
> like to complain about Ingo's choice of header file.  Why in
> the world did you choose mmu_context.h?  Invent a new asm/sched.h
> if you must, but please don't choose headers at random.

Agreed. Apparently asm/bitops.h?

Ivan.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-10 17:04             ` Ivan Kokshaysky
@ 2002-01-10 20:42               ` george anzinger
  2002-01-10 23:56               ` Ingo Molnar
  1 sibling, 0 replies; 41+ messages in thread
From: george anzinger @ 2002-01-10 20:42 UTC (permalink / raw)
  To: Ivan Kokshaysky
  Cc: Anton Blanchard, Ingo Molnar, Linus Torvalds, linux-kernel

Ivan Kokshaysky wrote:
> 
> On Wed, Jan 09, 2002 at 05:09:28PM -0800, Richard Henderson wrote:
> > Careful.  The following is really quite a bit better on Alpha:
> >
> > static inline int
> > sched_find_first_zero_bit(unsigned long *bitmap)
> > {
> >         unsigned long b0 = bitmap[0];
> >         unsigned long b1 = bitmap[1];
> >         unsigned long b2 = bitmap[2];
> >         unsigned long ofs = MAX_RT_PRIO;
> >
> >         if (unlikely(~(b0 & b1) != 0)) {
> >                 b2 = (~b0 == 0 ? b0 : b1);
> >                 ofs = (~b0 == 0 ? 0 : 64);
> >         }
> >
> >         return ffz(b2) + ofs;
> > }
> 
> True. Minor correction:
> -               b2 = (~b0 == 0 ? b0 : b1);
> -               ofs = (~b0 == 0 ? 0 : 64);
> +               b2 = (~b0 ? b0 : b1);
> +               ofs = (~b0 ? 0 : 64);
> 
> Note that comment for this function is a bit confusing:
>  * ... It's the fastest
>  * way of searching a 168-bit bitmap where the first 128 bits are
>  * unlikely to be set.

What if we want a 2048-bit bitmap???
> 
> s/set/cleared/
> 
> > While we're on the subject of sched_find_first_zero_bit, I'd
> > like to complain about Ingo's choice of header file.  Why in
> > the world did you choose mmu_context.h?  Invent a new asm/sched.h
> > if you must, but please don't choose headers at random.
> 
> Agreed. Apparently asm/bitops.h?
> 
> Ivan.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
George           george@mvista.com
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-10 17:04             ` Ivan Kokshaysky
  2002-01-10 20:42               ` george anzinger
@ 2002-01-10 23:56               ` Ingo Molnar
  1 sibling, 0 replies; 41+ messages in thread
From: Ingo Molnar @ 2002-01-10 23:56 UTC (permalink / raw)
  To: Ivan Kokshaysky; +Cc: Anton Blanchard, Linus Torvalds, linux-kernel


On Thu, 10 Jan 2002, Ivan Kokshaysky wrote:

> Note that comment for this function is a bit confusing:
>  * ... It's the fastest
>  * way of searching a 168-bit bitmap where the first 128 bits are
>  * unlikely to be set.
>
> s/set/cleared/

no, it's really 'cleared'. The bits are inverted right now.

	Ingo


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [patch] O(1) scheduler, -E1, 2.5.2-pre10, 2.4.17
  2002-01-08 11:32   ` Anton Blanchard
  2002-01-08 11:43     ` Anton Blanchard
@ 2002-01-08 14:32     ` Ingo Molnar
  1 sibling, 0 replies; 41+ messages in thread
From: Ingo Molnar @ 2002-01-08 14:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, Anton Blanchard, Davide Libenzi


this is the latest update of the O(1) scheduler:

	http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-pre10-E1.patch

        http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.17-E1.patch

now that Linus has put the -D2 patch into the 2.5.2-pre10 kernel, the
2.5.2-pre10-E1 patch has become quite small :-)

The patch compiles, boots & works just fine on my UP/SMP boxes.

Changes since -D2:

 - make rq->bitmap big-endian safe. (Anton Blanchard)

 - documented and cleaned up the load estimator bits, no functional
   changes apart from small speedups.

 - do init_idle() before starting up the init thread, this removes a race
   where we'd run the init thread on CPU#0 before init_idle() has been
   called.

	Ingo


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
@ 2002-01-07 20:24 Ingo Molnar
  2002-01-07 19:03 ` Brian Gerst
  2002-01-09  3:39 ` Mike Kravetz
  0 siblings, 2 replies; 41+ messages in thread
From: Ingo Molnar @ 2002-01-07 20:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, george anzinger, Davide Libenzi


-D1 is a quick update over -D0:

 	http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-D1.patch
 	http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.17-D1.patch

this should fix the child-inherits-parent-priority-boost issue that causes
interactivity problems during compilation.

	Ingo


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-07 20:24 [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 Ingo Molnar
@ 2002-01-07 19:03 ` Brian Gerst
  2002-01-07 21:19   ` Ingo Molnar
  2002-01-09  3:39 ` Mike Kravetz
  1 sibling, 1 reply; 41+ messages in thread
From: Brian Gerst @ 2002-01-07 19:03 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel

Ingo Molnar wrote:
> 
> -D1 is a quick update over -D0:
> 
>         http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-D1.patch
>         http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.17-D1.patch
> 
> this should fix the child-inherits-parent-priority-boost issue that causes
> interactivity problems during compilation.
> 
>         Ingo

I noticed in this patch that you removed the rest_init() function.  The
reason it was split from start_kernel() is that there was a race where
init memory could be freed before the call to cpu_idle().  Note that
start_kernel() is marked __init and rest_init() is not.

--

				Brian Gerst

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-07 19:03 ` Brian Gerst
@ 2002-01-07 21:19   ` Ingo Molnar
  0 siblings, 0 replies; 41+ messages in thread
From: Ingo Molnar @ 2002-01-07 21:19 UTC (permalink / raw)
  To: Brian Gerst; +Cc: linux-kernel, Linus Torvalds


On Mon, 7 Jan 2002, Brian Gerst wrote:

> I noticed in this patch that you removed the rest_init() function.
> The reason it was split from start_kernel() is that there was a race
> where init memory could be freed before the call to cpu_idle().  Note
> that start_kernel() is marked __init and rest_init() is not.

you are right, i've missed that detail. I've fixed this in my tree
(reverted that part to the previous behavior), the fix will show up in the
next patch. Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-07 20:24 [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 Ingo Molnar
  2002-01-07 19:03 ` Brian Gerst
@ 2002-01-09  3:39 ` Mike Kravetz
  2002-01-09  5:05   ` Davide Libenzi
                     ` (2 more replies)
  1 sibling, 3 replies; 41+ messages in thread
From: Mike Kravetz @ 2002-01-09  3:39 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel, george anzinger, Davide Libenzi

Below are some benchmark results when running the D1 version
of the O(1) scheduler on 2.5.2-pre9.  To add another data point,
I hacked together a half-a** multi-queue scheduler based on
the 2.5.2-pre9 scheduler.  haMQ doesn't do load balancing or
anything fancy.  However, it aggressively tries to not let CPUs
go idle (not always a good thing as has been previously discussed).
For reference, patch is at: lse.sourceforge.net/scheduling/2.5.2-pre9-hamq
I can't recommend this code for anything useful.

All benchmarks were run on an 8-way Pentium III 700 MHz 1MB caches.
Number of CPUs was altered via the maxcpus boot flag.

--------------------------------------------------------------------
mkbench - Time how long it takes to compile the kernel.
        We use 'make -j 8' and increase the number of makes run
        in parallel.  Result is average build time in seconds.
        Lower is better.
--------------------------------------------------------------------
# CPUs      # Makes         Vanilla         O(1)	haMQ
--------------------------------------------------------------------
2           1                188             192        184
2           2                366             372        362
2           4                730             742        600
2           6               1096            1112        853
4           1                102             101         95
4           2                196             198        186
4           4                384             386        374
4           6                576             579        487
8           1                 58              57         58
8           2                109             108        105
8           4                209             213        186
8           6                309             312        280

Surprisingly, O(1) seems to do worse than the vanilla scheduler
in almost all cases.

--------------------------------------------------------------------
Chat - VolanoMark simulator.  Result is a measure of throughput.
       Higher is better.
--------------------------------------------------------------------
Configuration Parms     # CPUs   Vanilla    O(1)      haMQ
--------------------------------------------------------------------
10 rooms, 200 messages  2        162644     145915    137097
20 rooms, 200 messages  2        145872     136134    138646
30 rooms, 200 messages  2        124314     183366    144403
10 rooms, 200 messages  4        201745     258444    255415
20 rooms, 200 messages  4        177854     246032    263723
30 rooms, 200 messages  4        153506     302615    257170
10 rooms, 200 messages  8        121792     262804    310603
20 rooms, 200 messages  8         68697     248406    420157
30 rooms, 200 messages  8         42133     302513    283817

O(1) scheduler does better than Vanilla as load and number of
CPUs increase.  Still need to look into why it does worse on
the less loaded 2 CPU runs.

--------------------------------------------------------------------
Reflex - lat_ctx(of LMbench) on steroids.  Does token passing
         to over emphasize scheduler paths.  Allows loading of
         the runqueue unlike lat_ctx.  Result is microseconds
         per round.  Lower is better.  All runs with 0 delay.
         lse.sourceforge.net/scheduling/other/reflex/
         Lower is better.
--------------------------------------------------------------------
#tasks   # CPUs       Vanilla        O(1)         haMQ
--------------------------------------------------------------------
2        2             6.594       14.388       15.996
4        2             6.988        3.787        4.686
8        2             7.322        3.757        5.148
16       2             7.234        3.737        7.244
32       2             7.651        5.135        7.182
64       2             9.462        3.948        7.553
128      2            13.889        4.584        7.918
2        4             6.019       14.646       15.403
4        4            10.997        6.213        6.755
8        4             9.838        2.160        2.838
16       4            10.595        2.154        3.080
32       4            11.870        2.917        3.400
64       4            15.280        2.890        3.131
128      4            19.832        2.685        3.307
2        8             6.338        9.064       15.474
4        8            11.454        7.020        8.281
8        8            13.354        4.390        5.816
16       8            14.976        1.502        2.018
32       8            16.757        1.920        2.240
64       8            19.961        2.264        2.358
128      8            25.010        2.280        2.260

I believe the poor showings for O(1) at the low end are the
result of having the 2 tasks run on 2 different CPUs.  This
is the right thing to do in spite of the numbers.  You
can see lock contention become a factor in the Vanilla scheduler
as load and number of CPUs increase.  Having multiple runqueues
eliminates this problem.

-- 
Mike

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09  3:39 ` Mike Kravetz
@ 2002-01-09  5:05   ` Davide Libenzi
  2002-01-09  3:32     ` Rusty Russell
  2002-01-09 11:37     ` Ingo Molnar
  2002-01-09  6:29   ` Brian
  2002-01-09 10:25   ` Ingo Molnar
  2 siblings, 2 replies; 41+ messages in thread
From: Davide Libenzi @ 2002-01-09  5:05 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Ingo Molnar, Linus Torvalds, lkml, george anzinger

On Tue, 8 Jan 2002, Mike Kravetz wrote:

> Below are some benchmark results when running the D1 version
> of the O(1) scheduler on 2.5.2-pre9.  To add another data point,
> I hacked together a half-a** multi-queue scheduler based on
> the 2.5.2-pre9 scheduler.  haMQ doesn't do load balancing or
> anything fancy.  However, it aggressively tries to not let CPUs
> go idle (not always a good thing as has been previously discussed).
> For reference, patch is at: lse.sourceforge.net/scheduling/2.5.2-pre9-hamq
> I can't recommend this code for anything useful.
>
> All benchmarks were run on an 8-way Pentium III 700 MHz 1MB caches.
> Number of CPUs was altered via the maxcpus boot flag.
>
> --------------------------------------------------------------------
> mkbench - Time how long it takes to compile the kernel.
>         We use 'make -j 8' and increase the number of makes run
>         in parallel.  Result is average build time in seconds.
>         Lower is better.
> --------------------------------------------------------------------
> # CPUs      # Makes         Vanilla         O(1)	haMQ
> --------------------------------------------------------------------
> 2           1                188             192        184
> 2           2                366             372        362
> 2           4                730             742        600
> 2           6               1096            1112        853
> 4           1                102             101         95
> 4           2                196             198        186
> 4           4                384             386        374
> 4           6                576             579        487
> 8           1                 58              57         58
> 8           2                109             108        105
> 8           4                209             213        186
> 8           6                309             312        280
>
> Surprisingly, O(1) seems to do worse than the vanilla scheduler
> in almost all cases.
>
> --------------------------------------------------------------------
> Chat - VolanoMark simulator.  Result is a measure of throughput.
>        Higher is better.
> --------------------------------------------------------------------
> Configuration Parms     # CPUs   Vanilla    O(1)      haMQ
> --------------------------------------------------------------------
> 10 rooms, 200 messages  2        162644     145915    137097
> 20 rooms, 200 messages  2        145872     136134    138646
> 30 rooms, 200 messages  2        124314     183366    144403
> 10 rooms, 200 messages  4        201745     258444    255415
> 20 rooms, 200 messages  4        177854     246032    263723
> 30 rooms, 200 messages  4        153506     302615    257170
> 10 rooms, 200 messages  8        121792     262804    310603
> 20 rooms, 200 messages  8         68697     248406    420157
> 30 rooms, 200 messages  8         42133     302513    283817
>
> O(1) scheduler does better than Vanilla as load and number of
> CPUs increase.  Still need to look into why it does worse on
> the less loaded 2 CPU runs.
>
> --------------------------------------------------------------------
> Reflex - lat_ctx(of LMbench) on steroids.  Does token passing
>          to over emphasize scheduler paths.  Allows loading of
>          the runqueue unlike lat_ctx.  Result is microseconds
>          per round.  Lower is better.  All runs with 0 delay.
>          lse.sourceforge.net/scheduling/other/reflex/
>          Lower is better.
> --------------------------------------------------------------------
> #tasks   # CPUs       Vanilla        O(1)         haMQ
> --------------------------------------------------------------------
> 2        2             6.594       14.388       15.996
> 4        2             6.988        3.787        4.686
> 8        2             7.322        3.757        5.148
> 16       2             7.234        3.737        7.244
> 32       2             7.651        5.135        7.182
> 64       2             9.462        3.948        7.553
> 128      2            13.889        4.584        7.918
> 2        4             6.019       14.646       15.403
> 4        4            10.997        6.213        6.755
> 8        4             9.838        2.160        2.838
> 16       4            10.595        2.154        3.080
> 32       4            11.870        2.917        3.400
> 64       4            15.280        2.890        3.131
> 128      4            19.832        2.685        3.307
> 2        8             6.338        9.064       15.474
> 4        8            11.454        7.020        8.281
> 8        8            13.354        4.390        5.816
> 16       8            14.976        1.502        2.018
> 32       8            16.757        1.920        2.240
> 64       8            19.961        2.264        2.358
> 128      8            25.010        2.280        2.260
>
> I believe the poor showings for O(1) at the low end are the
> result of having the 2 tasks run on 2 different CPUs.  This
> is the right thing to do in spite of the numbers.  You
> can see lock contention become a factor in the Vanilla scheduler
> as load and number of CPUs increase.  Having multiple runqueues
> eliminates this problem.

Awesome job Mike. Ingo's O(1) scheduler is still 'young' and can be
improved expecially from a balancing point of view. I think that it'll
here that the real challenge will take place ( even if Linus keeps saying
that it's easy :-) ).
Mike can you try the patch listed below on custom pre-10 ?
I've got 30-70% better performances with the chat_s/c test.




PS: next time we'll have lunch i'll talk you about a wanderful tool called
	gnuplot :)



- Davide




diff -Nru linux-2.5.2-pre10.vanilla/include/linux/sched.h linux-2.5.2-pre10.mqo1/include/linux/sched.h
--- linux-2.5.2-pre10.vanilla/include/linux/sched.h	Mon Jan  7 17:12:45 2002
+++ linux-2.5.2-pre10.mqo1/include/linux/sched.h	Mon Jan  7 21:45:19 2002
@@ -305,11 +305,7 @@
 	prio_array_t *array;

 	unsigned int time_slice;
-	unsigned long sleep_timestamp, run_timestamp;
-
-	#define SLEEP_HIST_SIZE 4
-	int sleep_hist[SLEEP_HIST_SIZE];
-	int sleep_idx;
+	unsigned long swap_cnt_last;

 	unsigned long policy;
 	unsigned long cpus_allowed;
diff -Nru linux-2.5.2-pre10.vanilla/kernel/fork.c linux-2.5.2-pre10.mqo1/kernel/fork.c
--- linux-2.5.2-pre10.vanilla/kernel/fork.c	Mon Jan  7 17:12:45 2002
+++ linux-2.5.2-pre10.mqo1/kernel/fork.c	Mon Jan  7 18:49:34 2002
@@ -705,9 +705,6 @@
 		current->time_slice = 1;
 		expire_task(current);
 	}
-	p->sleep_timestamp = p->run_timestamp = jiffies;
-	memset(p->sleep_hist, 0, sizeof(p->sleep_hist[0])*SLEEP_HIST_SIZE);
-	p->sleep_idx = 0;
 	__restore_flags(flags);

 	/*
diff -Nru linux-2.5.2-pre10.vanilla/kernel/sched.c linux-2.5.2-pre10.mqo1/kernel/sched.c
--- linux-2.5.2-pre10.vanilla/kernel/sched.c	Mon Jan  7 17:12:45 2002
+++ linux-2.5.2-pre10.mqo1/kernel/sched.c	Tue Jan  8 18:28:02 2002
@@ -48,6 +48,7 @@
 	spinlock_t lock;
 	unsigned long nr_running, nr_switches, last_rt_event;
 	task_t *curr, *idle;
+	unsigned long swap_cnt;
 	prio_array_t *active, *expired, arrays[2];
 	char __pad [SMP_CACHE_BYTES];
 } runqueues [NR_CPUS] __cacheline_aligned;
@@ -91,115 +92,20 @@
 	p->array = array;
 }

-/*
- * This is the per-process load estimator. Processes that generate
- * more load than the system can handle get a priority penalty.
- *
- * The estimator uses a 4-entry load-history ringbuffer which is
- * updated whenever a task is moved to/from the runqueue. The load
- * estimate is also updated from the timer tick to get an accurate
- * estimation of currently executing tasks as well.
- */
-#define NEXT_IDX(idx) (((idx) + 1) % SLEEP_HIST_SIZE)
-
-static inline void update_sleep_avg_deactivate(task_t *p)
-{
-	unsigned int idx;
-	unsigned long j = jiffies, last_sample = p->run_timestamp / HZ,
-		curr_sample = j / HZ, delta = curr_sample - last_sample;
-
-	if (unlikely(delta)) {
-		if (delta < SLEEP_HIST_SIZE) {
-			for (idx = 0; idx < delta; idx++) {
-				p->sleep_idx++;
-				p->sleep_idx %= SLEEP_HIST_SIZE;
-				p->sleep_hist[p->sleep_idx] = 0;
-			}
-		} else {
-			for (idx = 0; idx < SLEEP_HIST_SIZE; idx++)
-				p->sleep_hist[idx] = 0;
-			p->sleep_idx = 0;
-		}
-	}
-	p->sleep_timestamp = j;
-}
-
-#if SLEEP_HIST_SIZE != 4
-# error update this code.
-#endif
-
-static inline unsigned int get_sleep_avg(task_t *p, unsigned long j)
-{
-	unsigned int sum;
-
-	sum = p->sleep_hist[0];
-	sum += p->sleep_hist[1];
-	sum += p->sleep_hist[2];
-	sum += p->sleep_hist[3];
-
-	return sum * HZ / ((SLEEP_HIST_SIZE-1)*HZ + (j % HZ));
-}
-
-static inline void update_sleep_avg_activate(task_t *p, unsigned long j)
-{
-	unsigned int idx;
-	unsigned long delta_ticks, last_sample = p->sleep_timestamp / HZ,
-		curr_sample = j / HZ, delta = curr_sample - last_sample;
-
-	if (unlikely(delta)) {
-		if (delta < SLEEP_HIST_SIZE) {
-			p->sleep_hist[p->sleep_idx] += HZ - (p->sleep_timestamp % HZ);
-			p->sleep_idx++;
-			p->sleep_idx %= SLEEP_HIST_SIZE;
-
-			for (idx = 1; idx < delta; idx++) {
-				p->sleep_idx++;
-				p->sleep_idx %= SLEEP_HIST_SIZE;
-				p->sleep_hist[p->sleep_idx] = HZ;
-			}
-		} else {
-			for (idx = 0; idx < SLEEP_HIST_SIZE; idx++)
-				p->sleep_hist[idx] = HZ;
-			p->sleep_idx = 0;
-		}
-		p->sleep_hist[p->sleep_idx] = 0;
-		delta_ticks = j % HZ;
-	} else
-		delta_ticks = j - p->sleep_timestamp;
-	p->sleep_hist[p->sleep_idx] += delta_ticks;
-	p->run_timestamp = j;
-}
-
 static inline void activate_task(task_t *p, runqueue_t *rq)
 {
 	prio_array_t *array = rq->active;
-	unsigned long j = jiffies;
-	unsigned int sleep, load;
-	int penalty;

-	if (likely(p->run_timestamp == j))
-		goto enqueue;
-	/*
-	 * Give the process a priority penalty if it has not slept often
-	 * enough in the past. We scale the priority penalty according
-	 * to the current load of the runqueue, and the 'load history'
-	 * this process has. Eg. if the CPU has 3 processes running
-	 * right now then a process that has slept more than two-thirds
-	 * of the time is considered to be 'interactive'. The higher
-	 * the load of the CPUs is, the easier it is for a process to
-	 * get an non-interactivity penalty.
-	 */
-#define MAX_PENALTY (MAX_USER_PRIO/3)
-	update_sleep_avg_activate(p, j);
-	sleep = get_sleep_avg(p, j);
-	load = HZ - sleep;
-	penalty = (MAX_PENALTY * load)/HZ;
 	if (!rt_task(p)) {
-		p->prio = NICE_TO_PRIO(p->__nice) + penalty;
-		if (p->prio > MAX_PRIO-1)
-			p->prio = MAX_PRIO-1;
+		unsigned long prio_bonus = rq->swap_cnt - p->swap_cnt_last;
+
+		p->swap_cnt_last = rq->swap_cnt;
+		if (prio_bonus > MAX_PRIO)
+			prio_bonus = MAX_PRIO;
+		p->prio -= prio_bonus;
+		if (p->prio < MAX_RT_PRIO)
+			p->prio = MAX_RT_PRIO;
 	}
-enqueue:
 	enqueue_task(p, array);
 	rq->nr_running++;
 }
@@ -209,7 +115,6 @@
 	rq->nr_running--;
 	dequeue_task(p, p->array);
 	p->array = NULL;
-	update_sleep_avg_deactivate(p);
 }

 static inline void resched_task(task_t *p)
@@ -535,33 +440,16 @@
 		p->need_resched = 1;
 		if (rt_task(p))
 			p->time_slice = RT_PRIO_TO_TIMESLICE(p->prio);
-		else
+		else {
 			p->time_slice = PRIO_TO_TIMESLICE(p->prio);
-
-		/*
-		 * Timeslice used up - discard any possible
-		 * priority penalty:
-		 */
-		dequeue_task(p, rq->active);
-		/*
-		 * Tasks that have nice values of -20 ... -15 are put
-		 * back into the active array. If they use up too much
-		 * CPU time then they'll get a priority penalty anyway
-		 * so this can not starve other processes accidentally.
-		 * Otherwise this is pretty handy for sysadmins ...
-		 */
-		if (p->prio <= MAX_RT_PRIO + MAX_PENALTY/2)
-			enqueue_task(p, rq->active);
-		else
+			/*
+			 * Timeslice used up - discard any possible
+			 * priority penalty:
+			 */
+			dequeue_task(p, rq->active);
+			if (++p->prio >= MAX_PRIO)
+				p->prio = MAX_PRIO - 1;
 			enqueue_task(p, rq->expired);
-	} else {
-		/*
-		 * Deactivate + activate the task so that the
-		 * load estimator gets updated properly:
-		 */
-		if (!rt_task(p)) {
-			deactivate_task(p, rq);
-			activate_task(p, rq);
 		}
 	}
 	load_balance(rq);
@@ -616,6 +504,7 @@
 		rq->active = rq->expired;
 		rq->expired = array;
 		array = rq->active;
+		rq->swap_cnt++;
 	}

 	idx = sched_find_first_zero_bit(array->bitmap);
@@ -1301,6 +1190,7 @@
 		rq->expired = rq->arrays + 1;
 		spin_lock_init(&rq->lock);
 		rq->cpu = i;
+		rq->swap_cnt = 0;

 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09  5:05   ` Davide Libenzi
@ 2002-01-09  3:32     ` Rusty Russell
  2002-01-09 18:02       ` Davide Libenzi
  2002-01-09 11:37     ` Ingo Molnar
  1 sibling, 1 reply; 41+ messages in thread
From: Rusty Russell @ 2002-01-09  3:32 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: kravetz, mingo, torvalds, linux-kernel, george

On Tue, 8 Jan 2002 21:05:23 -0800 (PST)
Davide Libenzi <davidel@xmailserver.org> wrote:
> Mike can you try the patch listed below on custom pre-10 ?
> I've got 30-70% better performances with the chat_s/c test.

I'd encourage you to use hackbench, which is basically "the part of chat_c/s
that is interesting".

And I'd encourage you to come up with a better name, too 8)

Cheers,
Rusty.

/* Simple scheduler test. */
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <sys/poll.h>

static int use_pipes = 0;

static void barf(const char *msg)
{
	fprintf(stderr, "%s (error: %s)\n", msg, strerror(errno));
	exit(1);
}

static void fdpair(int fds[2])
{
	if (use_pipes) {
		if (pipe(fds) == 0)
			return;
	} else {
		if (socketpair(AF_UNIX, SOCK_STREAM, 0, fds) == 0)
			return;
	}
	barf("Creating fdpair");
}

/* Block until we're ready to go */
static void ready(int ready_out, int wakefd)
{
	char dummy;
	struct pollfd pollfd = { .fd = wakefd, .events = POLLIN };

	/* Tell them we're ready. */
	if (write(ready_out, &dummy, 1) != 1)
		barf("CLIENT: ready write");

	/* Wait for "GO" signal */
	if (poll(&pollfd, 1, -1) != 1)
		barf("poll");
}

static void reader(int ready_out, int wakefd, unsigned int loops, int fd)
{
	char dummy;
	unsigned int i;

	ready(ready_out, wakefd);

	for (i = 0; i < loops; i++) {
		if (read(fd, &dummy, 1) != 1)
			barf("READER: read");
	}
}

/* Start the server */
static void server(int ready_out, int wakefd,
		   unsigned int loops, unsigned int num_fds)
{
	unsigned int i;
	int write_fds[num_fds];
	unsigned int counters[num_fds];

	for (i = 0; i < num_fds; i++) {
		int fds[2];

		fdpair(fds);
		switch (fork()) {
		case -1: barf("fork()");
		case 0:
			close(fds[1]);
			reader(ready_out, wakefd, loops, fds[0]);
			exit(0);
		}
		close(fds[0]);
		write_fds[i] = fds[1];
		if (fcntl(write_fds[i], F_SETFL, O_NONBLOCK) != 0)
			barf("fcntl NONBLOCK");

		counters[i] = 0;
	}

	ready(ready_out, wakefd);

	for (i = 0; i < loops * num_fds;) {
		unsigned int j;
		char dummy;

		for (j = 0; j < num_fds; j++) {
			if (counters[j] < loops) {
				if (write(write_fds[j], &dummy, 1) == 1) {
					counters[j]++;
					i++;
				} else if (errno != EAGAIN)
					barf("write");
			}
		}
	}

	/* Reap them all */
	for (i = 0; i < num_fds; i++) {
		int status;
		wait(&status);
		if (!WIFEXITED(status))
			exit(1);
	}
	exit(0);
}

int main(int argc, char *argv[])
{
	unsigned int i;
	struct timeval start, stop, diff;
	unsigned int num_fds;
	int readyfds[2], wakefds[2];
	char dummy;
	int status;

	if (argv[1] && strcmp(argv[1], "-pipe") == 0) {
		use_pipes = 1;
		argc--;
		argv++;
	}

	if (argc != 2 || (num_fds = atoi(argv[1])) == 0)
		barf("Usage: hackbench2 [-pipe] <num pipes>\n");

	fdpair(readyfds);
	fdpair(wakefds);

	switch (fork()) {
	case -1: barf("fork()");
	case 0:
		server(readyfds[1], wakefds[0], 10000, num_fds);
		exit(0);
	}

	/* Wait for everyone to be ready */
	for (i = 0; i < num_fds+1; i++)
		if (read(readyfds[0], &dummy, 1) != 1)
			barf("Reading for readyfds");

	gettimeofday(&start, NULL);

	/* Kick them off */
	if (write(wakefds[1], &dummy, 1) != 1)
		barf("Writing to start them");

	/* Reap server */
	wait(&status);
	if (!WIFEXITED(status))
		exit(1);

	gettimeofday(&stop, NULL);

	/* Print time... */
	timersub(&stop, &start, &diff);
	printf("Time: %lu.%03lu\n", diff.tv_sec, diff.tv_usec/1000);
	exit(0);
}
-- 
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09  3:32     ` Rusty Russell
@ 2002-01-09 18:02       ` Davide Libenzi
  0 siblings, 0 replies; 41+ messages in thread
From: Davide Libenzi @ 2002-01-09 18:02 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Mike Kravetz, Ingo Molnar, Linus Torvalds, lkml, georgr anzinger

On Wed, 9 Jan 2002, Rusty Russell wrote:

> On Tue, 8 Jan 2002 21:05:23 -0800 (PST)
> Davide Libenzi <davidel@xmailserver.org> wrote:
> > Mike can you try the patch listed below on custom pre-10 ?
> > I've got 30-70% better performances with the chat_s/c test.
>
> I'd encourage you to use hackbench, which is basically "the part of chat_c/s
> that is interesting".
>
> And I'd encourage you to come up with a better name, too 8)

Got it. I'll try.



- Davide



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09  5:05   ` Davide Libenzi
  2002-01-09  3:32     ` Rusty Russell
@ 2002-01-09 11:37     ` Ingo Molnar
  2002-01-09 11:19       ` Rene Rebe
                         ` (2 more replies)
  1 sibling, 3 replies; 41+ messages in thread
From: Ingo Molnar @ 2002-01-09 11:37 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Mike Kravetz, Linus Torvalds, lkml, george anzinger


On Tue, 8 Jan 2002, Davide Libenzi wrote:

> Mike can you try the patch listed below on custom pre-10 ?
> I've got 30-70% better performances with the chat_s/c test.

i've compared this patch of yours (which changes the way interactivity is
detected and timeslices are distributed), to 2.5.2-pre10-vanilla on a
2-way 466 MHz Celeron box:

davide-patch-2.5.2-pre10 running at default priority:

    # ./chat_s 127.0.0.1
    # ./chat_c 127.0.0.1 10 1000

    Average throughput : 123103 messages per second
    Average throughput : 105122 messages per second
    Average throughput : 112901 messages per second

    [ system is *unusable* interactively, during the whole test. ]

davide-patch-2.5.2-pre10 running at nice level 19:

    # nice -n 19 ./chat_s 127.0.0.1
    # nice -n 19 ./chat_c 127.0.0.1 10 1000

    Average throughput : 109337 messages per second
    Average throughput : 122077 messages per second
    Average throughput : 105296 messages per second

    [ system is *unusable* interactively, despite renicing. ]

2.5.2-pre10-vanilla running the test at the default priority level:

    # ./chat_s 127.0.0.1
    # ./chat_c 127.0.0.1 10 1000

    Average throughput : 124676 messages per second
    Average throughput : 102244 messages per second
    Average throughput : 115841 messages per second

    [ system is unresponsive at the start of the test, but
      once the 2.5.2-pre10 load-estimator establishes which task is
      interactive and which one is not, the system becomes usable.
      Load can be felt and there are frequent delays in commands. ]

2.5.2-pre10-vanilla running at nice level 19:

    # nice -n 19 ./chat_s 127.0.0.1
    # nice -n 19 ./chat_c 127.0.0.1 10 1000

    Average throughput : 214626 messages per second
    Average throughput : 220876 messages per second
    Average throughput : 225529 messages per second

    [ system is usable from the beginning - nice levels are working as
      expected. Load can be felt while executing shell commands, but the
      system is usable. Load cannot be felt in truly interactive
      applications like editors.

Summary of throughput results: 2.5.2-pre10-vanilla is equivalent
throughput-wise in the test with your patched kernel, but the vanilla
kernel is about 100% faster than your patched kernel when running reniced.

but the interactivity observations are the real showstoppers in my
opinion. With your patch applied the system became *unbearably* slow
during the test.

i have three observation about why your patch causes these effects (we had
email discussions about this topic in private already, so you probably
know my position):

 - your patch adds the 'recalculation based priority distribution
   method' that is in 2.5.2-pre9 to the O(1) scheduler. (2.4.2-pre9's
   priority distribution scheme is an improved but conceptually
   equivalent version of the priority distribution scheme of 2.4.17 -
   which scheme was basically unchanged since 1991. )

   originally the O(1) patch was using the priority distribution scheme of
   2.5.2-pre9 (it's very easy to switch between the two methods), but i
   have changed it because:

   there is a flaw in the recalculation-based (array-switch based in O(1)
   scheduler terms) priority distribution scheme: interactive tasks will
   get new timeslices depending on the frequency of recalculations. But
   *exactly under load*, the frequency of recalculations gets very, very
   low - it can be more than 10 seconds. In the above test this property
   causes shell interactivity to degrade so dramatically. Interactive
   tasks might accumulate up to 64 timeslices, but it's easy for them to
   use up this reserve in such high load situations and they'll never get
   back any new timeslices. Mike, do you agree with this analysis? [if
   anyone wants to look at the new estimtaor code then please apply the
   -E1 patch to -pre10, which cleans up the estimator code and comments
   it, without changing functionality.]

 - your patch in essence makes the scheduler ignore things like nice
   level +19. We *used to* ignore nice levels, but with the new load
   estimator this has changed, and personally i dont think i want to go
   back to the old behavior.

 - the system i tested has a more than twice as slow CPU as yours. So i'd
   suggest for you to repeat those exact tests but increase the number of
   'rooms' to something like 40 (i know you tried 20 rooms, i dont think
   it's enough), and increase the number of messages sent, from 1000 to
   5000 or something like that.

your patch indeed decreases the load estimation and interactivity
detection overhead and code complexity - but as the above tests have
shown, at the price of interactivity, and in some cases even at the price
of throughput.

	Ingo


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09 11:37     ` Ingo Molnar
@ 2002-01-09 11:19       ` Rene Rebe
  2002-01-09 15:34         ` Ryan Cumming
  2002-01-09 18:24       ` Davide Libenzi
  2002-01-09 20:15       ` Linus Torvalds
  2 siblings, 1 reply; 41+ messages in thread
From: Rene Rebe @ 2002-01-09 11:19 UTC (permalink / raw)
  To: mingo; +Cc: davidel, kravetz, torvalds, linux-kernel, george

Hi.

From: Ingo Molnar <mingo@elte.hu>
Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
Date: Wed, 9 Jan 2002 12:37:46 +0100 (CET)

[...]

> 2.5.2-pre10-vanilla running the test at the default priority level:
> 
>     # ./chat_s 127.0.0.1
>     # ./chat_c 127.0.0.1 10 1000
> 
>     Average throughput : 124676 messages per second
>     Average throughput : 102244 messages per second
>     Average throughput : 115841 messages per second
> 
>     [ system is unresponsive at the start of the test, but
>       once the 2.5.2-pre10 load-estimator establishes which task is
>       interactive and which one is not, the system becomes usable.
>       Load can be felt and there are frequent delays in commands. ]
> 
> 2.5.2-pre10-vanilla running at nice level 19:
> 
>     # nice -n 19 ./chat_s 127.0.0.1
>     # nice -n 19 ./chat_c 127.0.0.1 10 1000
> 
>     Average throughput : 214626 messages per second
>     Average throughput : 220876 messages per second
>     Average throughput : 225529 messages per second
> 
>     [ system is usable from the beginning - nice levels are working as
>       expected. Load can be felt while executing shell commands, but the
>       system is usable. Load cannot be felt in truly interactive
>       applications like editors.
>
> Summary of throughput results: 2.5.2-pre10-vanilla is equivalent
> throughput-wise in the test with your patched kernel, but the vanilla
> kernel is about 100% faster than your patched kernel when running reniced.

Could someone tell a non-kernel-hacker why this benchmark is nearly
twice as fast when running reniced??? Shouldn't it be slower when it
runs with lower priority (And you execute / type some commands during
it)?

[...]
 
> 	Ingo

k33p h4ck1n6
  René

-- 
René Rebe (Registered Linux user: #248718 <http://counter.li.org>)

eMail:    rene.rebe@gmx.net
          rene@rocklinux.org

Homepage: http://www.tfh-berlin.de/~s712059/index.html

Anyone sending unwanted advertising e-mail to this address will be
charged $25 for network traffic and computing time. By extracting my
address from this message or its header, you agree to these terms.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09 11:19       ` Rene Rebe
@ 2002-01-09 15:34         ` Ryan Cumming
  0 siblings, 0 replies; 41+ messages in thread
From: Ryan Cumming @ 2002-01-09 15:34 UTC (permalink / raw)
  To: Rene Rebe; +Cc: linux-kernel@vger.kernel.org

On January 9, 2002 03:19, Rene Rebe wrote:
> Could someone tell a non-kernel-hacker why this benchmark is nearly
> twice as fast when running reniced??? Shouldn't it be slower when it
> runs with lower priority (And you execute / type some commands during
> it)?

In addition for using the nice level as a priority hint, the new scheduler 
also uses it as a hint of how "CPU-bound" a process it. Negative (higher 
priority) nice levels give the process short, frequent timeslices. Positive 
priorities give the process long, infrequent time slices. On an otherwise 
(mostly) idle system, both processes will get the same amount of CPU time, 
but distributed in a different way.

In applications that really don't care about interactivity, the long time 
slice will increase their efficency greatly. In addition to having a fewer 
context switches (and therefore less context switch overhead), the longer 
time slices give them more time to warm up the cache. This has been referred 
to as "batching", as the process is executing at once what would normally 
take many shorter timeslices to complete.

So, what you're actually seeing is the reniced task not taking up more CPU 
time (it's probably actually using slightly less), just using the CPU time 
more efficently.

<worships Ingo>

-Ryan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09 11:37     ` Ingo Molnar
  2002-01-09 11:19       ` Rene Rebe
@ 2002-01-09 18:24       ` Davide Libenzi
  2002-01-09 21:24         ` Ingo Molnar
  2002-01-09 20:15       ` Linus Torvalds
  2 siblings, 1 reply; 41+ messages in thread
From: Davide Libenzi @ 2002-01-09 18:24 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Mike Kravetz, Linus Torvalds, lkml, george anzinger

On Wed, 9 Jan 2002, Ingo Molnar wrote:

>
> On Tue, 8 Jan 2002, Davide Libenzi wrote:
>
> > Mike can you try the patch listed below on custom pre-10 ?
> > I've got 30-70% better performances with the chat_s/c test.
>
> i've compared this patch of yours (which changes the way interactivity is
> detected and timeslices are distributed), to 2.5.2-pre10-vanilla on a
> 2-way 466 MHz Celeron box:
>
> davide-patch-2.5.2-pre10 running at default priority:
>
>     # ./chat_s 127.0.0.1
>     # ./chat_c 127.0.0.1 10 1000
>
>     Average throughput : 123103 messages per second
>     Average throughput : 105122 messages per second
>     Average throughput : 112901 messages per second
>
>     [ system is *unusable* interactively, during the whole test. ]
>
> davide-patch-2.5.2-pre10 running at nice level 19:
>
>     # nice -n 19 ./chat_s 127.0.0.1
>     # nice -n 19 ./chat_c 127.0.0.1 10 1000
>
>     Average throughput : 109337 messages per second
>     Average throughput : 122077 messages per second
>     Average throughput : 105296 messages per second
>
>     [ system is *unusable* interactively, despite renicing. ]
>
> 2.5.2-pre10-vanilla running the test at the default priority level:
>
>     # ./chat_s 127.0.0.1
>     # ./chat_c 127.0.0.1 10 1000
>
>     Average throughput : 124676 messages per second
>     Average throughput : 102244 messages per second
>     Average throughput : 115841 messages per second
>
>     [ system is unresponsive at the start of the test, but
>       once the 2.5.2-pre10 load-estimator establishes which task is
>       interactive and which one is not, the system becomes usable.
>       Load can be felt and there are frequent delays in commands. ]
>
> 2.5.2-pre10-vanilla running at nice level 19:
>
>     # nice -n 19 ./chat_s 127.0.0.1
>     # nice -n 19 ./chat_c 127.0.0.1 10 1000
>
>     Average throughput : 214626 messages per second
>     Average throughput : 220876 messages per second
>     Average throughput : 225529 messages per second
>
>     [ system is usable from the beginning - nice levels are working as
>       expected. Load can be felt while executing shell commands, but the
>       system is usable. Load cannot be felt in truly interactive
>       applications like editors.
>
> Summary of throughput results: 2.5.2-pre10-vanilla is equivalent
> throughput-wise in the test with your patched kernel, but the vanilla
> kernel is about 100% faster than your patched kernel when running reniced.
>
> but the interactivity observations are the real showstoppers in my
> opinion. With your patch applied the system became *unbearably* slow
> during the test.

Ingo, this is not the picture that i've got from my machine.
-------------------------------------------------------------------
AMD Athlon 1GHz 256 Mb RAM, swap_cnt patch :

# nice -n 19 chat_s 127.0.0.1 &
# nice -n 19 chat_c 127.0.0.1 20 1000

125236
123988
128048

with :

r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy
id
198  0  0   1476  28996   8024  89408   0   0     0   108  812 19424  12  87   1
216  0  1   1476  32388   8024  89412   0   0     0     0  523 56344   9  91   0
134  0  1   1476  32812   8024  89412   0   0     0     0  578 32374   9  91   0
96  1  1   1476  33540   8024  89412   0   0     0     0  114  7910  13  87   0
81  0  0   1476  35412   8024  89420   0   0     0    12  657 54034  12  88   0


pre-10 :

135684
127456
132420

the niced -20 vmstat has not been run for the whole test time and the
system seemed quite bad ( personal feeling, not for the whole test time
but for 1-2 sec spots ) compared with the previous test. The whole point
Ingo is that during the test we've had 200 tasks on the run queue with a
cs 8000..50000 !!?

AMD Athlon 1GHz, swap_cnt patch :

# chat_s 127.0.0.1 &
# chat_c 127.0.0.1 20 1000

118386
114464
117972


pre-10 :

90066
88234
92612

I was not able to identify any interactive feel difference here.
----------------------------------------------------------------------

Today i'll try the same on both my dual cpu system ( PIII 733 and PIII 1GHz )
I really fail to understand why you're asking everyone to run your test reniced ?!?



>  - your patch in essence makes the scheduler ignore things like nice
>    level +19. We *used to* ignore nice levels, but with the new load
>    estimator this has changed, and personally i dont think i want to go
>    back to the old behavior.

Ingo for the duration of the test the `nice -n 20 vmstat -n 1` never run
for about the 20 seconds.
With the swap_cnt correction it ran for 5-6 times.



>  - the system i tested has a more than twice as slow CPU as yours. So i'd
>    suggest for you to repeat those exact tests but increase the number of
>    'rooms' to something like 40 (i know you tried 20 rooms, i dont think
>    it's enough), and increase the number of messages sent, from 1000 to
>    5000 or something like that.

Ingo, with 20 rooms my system was loaded with more than 200 tasks on the
run queue and was switching at 50000 times/sec.
Don't you think that it's enough for a single cpu system ??!!



> your patch indeed decreases the load estimation and interactivity
> detection overhead and code complexity - but as the above tests have
> shown, at the price of interactivity, and in some cases even at the price
> of throughput.

Ingo i tried to be the more impartial as possible and during the test i
was not able to identify any difference in system usability.
As i wrote you in private, the only spot i've had of system unusability
was running with stock pre10 ( but this could be happened occasionally ).




- Davide






^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09 18:24       ` Davide Libenzi
@ 2002-01-09 21:24         ` Ingo Molnar
  2002-01-09 19:38           ` Mike Kravetz
  2002-01-09 22:34           ` Mark Hahn
  0 siblings, 2 replies; 41+ messages in thread
From: Ingo Molnar @ 2002-01-09 21:24 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Mike Kravetz, Linus Torvalds, lkml, george anzinger


On Wed, 9 Jan 2002, Davide Libenzi wrote:

> the niced -20 vmstat has not been run for the whole test time and the
[...]

> Ingo for the duration of the test the `nice -n 20 vmstat -n 1` never
> run for about the 20 seconds. With the swap_cnt correction it ran for
> 5-6 times.

no wonder, it should be 'nice -n -20 vmstat -n 1'. And you should also do
a 'renice -20 $$ $PPID' before running vmstat. (if you are about to run
comparisons, i'd suggest the -G1 patch so you'll have all the recent
fixes.)

	Ingo


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09 21:24         ` Ingo Molnar
@ 2002-01-09 19:38           ` Mike Kravetz
  2002-01-10 18:21             ` Mike Kravetz
  2002-01-09 22:34           ` Mark Hahn
  1 sibling, 1 reply; 41+ messages in thread
From: Mike Kravetz @ 2002-01-09 19:38 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Davide Libenzi, Linus Torvalds, lkml, george anzinger

On Wed, Jan 09, 2002 at 10:24:00PM +0100, Ingo Molnar wrote:
>                                                (if you are about to run
> comparisons, i'd suggest the -G1 patch so you'll have all the recent
> fixes.)

I just kicked off another benchmark run to compare pre10, pre10 & G1
patch, pre10 & Davide's patch.  chat and make will be run as before
with the addition of chat reniced.  I won't attempt to make any claims
about interactive responsiveness.  Simple throughput numbers.  Results
should be available in about 24 hours.

-- 
Mike

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09 19:38           ` Mike Kravetz
@ 2002-01-10 18:21             ` Mike Kravetz
  2002-01-10 19:08               ` Davide Libenzi
  0 siblings, 1 reply; 41+ messages in thread
From: Mike Kravetz @ 2002-01-10 18:21 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Davide Libenzi, Linus Torvalds, lkml, george anzinger

On Wed, Jan 09, 2002 at 11:38:33AM -0800, Mike Kravetz wrote:
> 
> I just kicked off another benchmark run to compare pre10, pre10 & G1
> patch, pre10 & Davide's patch.

It wasn't a good night for benchmarking.  I had a typo in the
script to run chat reniced and as a result didn't collect any
numbers for this.  In addition, the kernel with Davide's patch
failed to boot with 8 CPUs enabled.  Can't see any '# CPU specific'
mods in the patch.  In any case, here is what I do have.


--------------------------------------------------------------------
mkbench - Time how long it takes to compile the kernel.  On this
        8 CPU system we use 'make -j 8' and increase the number
        of makes run in parallel.  Result is average build time in
        seconds.  Lower is better.
--------------------------------------------------------------------
# CPUs      # Makes         pre10          pre10-G1     pre10-Davide
--------------------------------------------------------------------
2           1               189             190            185
2           2               370             376            362
2           4               733             726*           717
2           6              1102            1082*          1077
4           1               101              99            101
4           2               199             192            195
4           4               387             382            381
4           6               581             551            568
8           1                58              56             -
8           2               110             104             -
8           4               214             204             -
8           6               314             305             -

* Most likely statisticly invalid results.  I run these things 3 times
  to make sure results are at lease consistent.  With pre10-G1 results
  varied more than the others.  Items marked with * had extremely high
  variations.

--------------------------------------------------------------------
Chat - VolanoMark simulator.  Result is a measure of throughput.
       Higher is better.
--------------------------------------------------------------------
Configuration Parms     # CPUs   pre10     pre10-G1   pre10-Davide
--------------------------------------------------------------------
10 rooms, 200 messages  2        143041    107718     181556
20 rooms, 200 messages  2        147335    147151     166048
30 rooms, 200 messages  2        179370    190413     173135
10 rooms, 200 messages  4        264033    287076     272597
20 rooms, 200 messages  4        243873    241855     273219
30 rooms, 200 messages  4        303228    301175     278513
10 rooms, 200 messages  8        304754    306891       -
20 rooms, 200 messages  8        241077    301414       -
30 rooms, 200 messages  8        309485    333660       -

-- 
Mike

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-10 18:21             ` Mike Kravetz
@ 2002-01-10 19:08               ` Davide Libenzi
  2002-01-10 19:09                 ` Linus Torvalds
  2002-01-10 19:15                 ` Mike Kravetz
  0 siblings, 2 replies; 41+ messages in thread
From: Davide Libenzi @ 2002-01-10 19:08 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Ingo Molnar, Linus Torvalds, lkml, george anzinger

On Thu, 10 Jan 2002, Mike Kravetz wrote:

> On Wed, Jan 09, 2002 at 11:38:33AM -0800, Mike Kravetz wrote:
> >
> > I just kicked off another benchmark run to compare pre10, pre10 & G1
> > patch, pre10 & Davide's patch.
>
> It wasn't a good night for benchmarking.  I had a typo in the
> script to run chat reniced and as a result didn't collect any
> numbers for this.  In addition, the kernel with Davide's patch
> failed to boot with 8 CPUs enabled.  Can't see any '# CPU specific'
> mods in the patch.  In any case, here is what I do have.

Doh !! Do you have a panic dump Mike ?




- Davide



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-10 19:08               ` Davide Libenzi
@ 2002-01-10 19:09                 ` Linus Torvalds
  2002-01-10 21:08                   ` Davide Libenzi
  2002-01-10 19:15                 ` Mike Kravetz
  1 sibling, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2002-01-10 19:09 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Mike Kravetz, Ingo Molnar, lkml, george anzinger


On Thu, 10 Jan 2002, Davide Libenzi wrote:
> >
> > It wasn't a good night for benchmarking.  I had a typo in the
> > script to run chat reniced and as a result didn't collect any
> > numbers for this.  In addition, the kernel with Davide's patch
> > failed to boot with 8 CPUs enabled.  Can't see any '# CPU specific'
> > mods in the patch.  In any case, here is what I do have.
>
> Doh !! Do you have a panic dump Mike ?

I bet it's just the placement of "init_idle()" in init/main.c, which is
unrelated to the scheduling proper, but if the kernel thread is started
before the boot CPU has done its "init_idle()", then the scheduler state
isn't really set up fully yet.

(Old bug, I think its been there for a long time, I just think that the
old scheduler didn't much care, and the "child runs first" logic in
particular of the new scheduler probably just showed it more clearly)

		Linus


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-10 19:09                 ` Linus Torvalds
@ 2002-01-10 21:08                   ` Davide Libenzi
  0 siblings, 0 replies; 41+ messages in thread
From: Davide Libenzi @ 2002-01-10 21:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mike Kravetz, Ingo Molnar, lkml, george anzinger

On Thu, 10 Jan 2002, Linus Torvalds wrote:

>
> On Thu, 10 Jan 2002, Davide Libenzi wrote:
> > >
> > > It wasn't a good night for benchmarking.  I had a typo in the
> > > script to run chat reniced and as a result didn't collect any
> > > numbers for this.  In addition, the kernel with Davide's patch
> > > failed to boot with 8 CPUs enabled.  Can't see any '# CPU specific'
> > > mods in the patch.  In any case, here is what I do have.
> >
> > Doh !! Do you have a panic dump Mike ?
>
> I bet it's just the placement of "init_idle()" in init/main.c, which is
> unrelated to the scheduling proper, but if the kernel thread is started
> before the boot CPU has done its "init_idle()", then the scheduler state
> isn't really set up fully yet.
>
> (Old bug, I think its been there for a long time, I just think that the
> old scheduler didn't much care, and the "child runs first" logic in
> particular of the new scheduler probably just showed it more clearly)

Uhm, seems fixed in pre11. Did you fix it in pre10->pre11 stage ?



- Davide



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-10 19:08               ` Davide Libenzi
  2002-01-10 19:09                 ` Linus Torvalds
@ 2002-01-10 19:15                 ` Mike Kravetz
  2002-01-10 20:05                   ` Davide Libenzi
  1 sibling, 1 reply; 41+ messages in thread
From: Mike Kravetz @ 2002-01-10 19:15 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Ingo Molnar, Linus Torvalds, lkml, george anzinger

On Thu, Jan 10, 2002 at 11:08:21AM -0800, Davide Libenzi wrote:
> On Thu, 10 Jan 2002, Mike Kravetz wrote:
> > >
> > > I just kicked off another benchmark run to compare pre10, pre10 & G1
> > > patch, pre10 & Davide's patch.
> >
> > It wasn't a good night for benchmarking.  I had a typo in the
> > script to run chat reniced and as a result didn't collect any
> > numbers for this.  In addition, the kernel with Davide's patch
> > failed to boot with 8 CPUs enabled.  Can't see any '# CPU specific'
> > mods in the patch.  In any case, here is what I do have.
> 
> Doh !! Do you have a panic dump Mike ?

It didn't panic, but hung during the boot process.  After
reading other mail, this may be caused by the out of order
locking bug/deadlock that existed in this version of the
O(1) scheduler.  I may be able to try and verify later today.
Right now the machine is being used for something else.

-- 
Mike

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-10 19:15                 ` Mike Kravetz
@ 2002-01-10 20:05                   ` Davide Libenzi
  0 siblings, 0 replies; 41+ messages in thread
From: Davide Libenzi @ 2002-01-10 20:05 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Ingo Molnar, Linus Torvalds, lkml, george anzinger

On Thu, 10 Jan 2002, Mike Kravetz wrote:

> Right now the machine is being used for something else.

Do they know at IBM that you're using 8 way SMP systems to run
counter-strike servers ? :-)



- Davide



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09 21:24         ` Ingo Molnar
  2002-01-09 19:38           ` Mike Kravetz
@ 2002-01-09 22:34           ` Mark Hahn
  2002-01-10 14:04             ` Ingo Molnar
  1 sibling, 1 reply; 41+ messages in thread
From: Mark Hahn @ 2002-01-09 22:34 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: lkml

> no wonder, it should be 'nice -n -20 vmstat -n 1'. And you should also do

I keep a suid setrealtime wrapper around (UNSAFE!) for this kind of use:

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sched.h>

int main(int argc, char *argv[]) {
    static struct sched_param sched_parms;
    int pid, wrapper=0;

    if (argc <= 1)
        return 1;

    pid = atoi(argv[1]);

    if (!pid || argc != 2) {
        wrapper = 1;
        pid = getpid();
    }

    sched_parms.sched_priority = sched_get_priority_min(SCHED_FIFO);
    if (sched_setscheduler(pid, SCHED_FIFO, &sched_parms) == -1) {
        perror("cannot set realtime scheduling policy");
        return 1;
    }
    if (wrapper) {
        setuid(getuid());
        execvp(argv[1],&argv[1]);
        perror("exec failed");
        return 1;
    }
    return 0;
}

regards, mark hahn.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09 22:34           ` Mark Hahn
@ 2002-01-10 14:04             ` Ingo Molnar
  0 siblings, 0 replies; 41+ messages in thread
From: Ingo Molnar @ 2002-01-10 14:04 UTC (permalink / raw)
  To: Mark Hahn; +Cc: lkml

On Wed, 9 Jan 2002, Mark Hahn wrote:

> > no wonder, it should be 'nice -n -20 vmstat -n 1'. And you should also do
>
> I keep a suid setrealtime wrapper around (UNSAFE!) for this kind of use:

nice -20 is an equivalent but safe version of the same (if you use my
patches). I made priority levels -20 ... -16 to be 'super-high priority',
ie. such tasks never expire. (they can still drop above prio -16 if they
use up too much CPU time, so they cannot lock up systems accidentally like
RT tasks.) So it's in essence a 'admin priority', for super-important
shells. I'm using it with great success.

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09 11:37     ` Ingo Molnar
  2002-01-09 11:19       ` Rene Rebe
  2002-01-09 18:24       ` Davide Libenzi
@ 2002-01-09 20:15       ` Linus Torvalds
  2002-01-09 23:02         ` Ingo Molnar
  2 siblings, 1 reply; 41+ messages in thread
From: Linus Torvalds @ 2002-01-09 20:15 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Davide Libenzi, Mike Kravetz, lkml, george anzinger


On Wed, 9 Jan 2002, Ingo Molnar wrote:
>
> 2.5.2-pre10-vanilla running the test at the default priority level:
>
>     # ./chat_s 127.0.0.1
>     # ./chat_c 127.0.0.1 10 1000
>
>     Average throughput : 124676 messages per second
>     Average throughput : 102244 messages per second
>     Average throughput : 115841 messages per second
>
>     [ system is unresponsive at the start of the test, but
>       once the 2.5.2-pre10 load-estimator establishes which task is
>       interactive and which one is not, the system becomes usable.
>       Load can be felt and there are frequent delays in commands. ]
>
> 2.5.2-pre10-vanilla running at nice level 19:
>
>     # nice -n 19 ./chat_s 127.0.0.1
>     # nice -n 19 ./chat_c 127.0.0.1 10 1000
>
>     Average throughput : 214626 messages per second
>     Average throughput : 220876 messages per second
>     Average throughput : 225529 messages per second
>
>     [ system is usable from the beginning - nice levels are working as
>       expected. Load can be felt while executing shell commands, but the
>       system is usable. Load cannot be felt in truly interactive
>       applications like editors.

Ingo, there's something wrong there.

Not a way in hell should "nice 19" cause the throughput to improve like
that. It looks like this is a result of "nice 19" simply doing _different_
scheduling, possibly more batch-like, and as such those numbers cannot
sanely be compared to anything else.

(And if they _are_ comparable, then you should be able to get the good
numbers even without "nice 19". Quite frankly it sounds to me like the
whole chat benchmark is another "dbench", ie doing unbalanced scheduling
_helps_ it performance-wise, which implies that it's probably a bad
benchmark to look at numbers for).

		Linus


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09 20:15       ` Linus Torvalds
@ 2002-01-09 23:02         ` Ingo Molnar
  0 siblings, 0 replies; 41+ messages in thread
From: Ingo Molnar @ 2002-01-09 23:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Davide Libenzi, Mike Kravetz, lkml, george anzinger

On Wed, 9 Jan 2002, Linus Torvalds wrote:

> Not a way in hell should "nice 19" cause the throughput to improve
> like that. It looks like this is a result of "nice 19" simply doing
> _different_ scheduling, possibly more batch-like, and as such those
> numbers cannot sanely be compared to anything else.

yes, this is what happens. The difference is that the load estimator
'punishes' tasks to have lower priority, while the recalc-based method
gives a 'bonus'. If run with nice +19 then the process cannot be punished
anymore, all the tasks will run on the same priority level - and none can
cause a preemption of the other one. The priority limit is set right at
the nice +19 level.

is this an intended thing with nice +19 tasks? I think so, at least for
some usages. It could be fixed by adding some more priority space (+13
levels) they could explore into (but which couldnt be set as the default
priority). So by having a ceiling it really behaves differently, very
batch-like - but that's what such benchmarks are asking for anyway ... I
think it's an intended effect for CPU hogs as well - we do not want them
to preempt each other, they should each use up their timeslices fully and
roundrobin nicely.

> (And if they _are_ comparable, then you should be able to get the good
> numbers even without "nice 19". Quite frankly it sounds to me like the
> whole chat benchmark is another "dbench", ie doing unbalanced
> scheduling _helps_ it performance-wise, which implies that it's
> probably a bad benchmark to look at numbers for).

yes, agreed. It's not really unbalanced scheduling, the scheduler is still
fair. What doesnt happen is priority based preemption.

i think it could be a bonus to have such a scheduler mode - people dont
run shells at +19 niceness level, it's the known CPU hogs that get started
up with nice +19. It's a kind of SCHED_IDLE - everything can preempt it
and it will preempt nothing, without the priority inheritance problems of
SCHED_IDLE.

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09  3:39 ` Mike Kravetz
  2002-01-09  5:05   ` Davide Libenzi
@ 2002-01-09  6:29   ` Brian
  2002-01-09  6:40     ` Jeffrey W. Baker
                       ` (2 more replies)
  2002-01-09 10:25   ` Ingo Molnar
  2 siblings, 3 replies; 41+ messages in thread
From: Brian @ 2002-01-09  6:29 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: linux-kernel

Can this be correct?

Intuitively, I would expect several CPUs hammering away at the compile to 
finish faster than one.  Given these numbers, I would have to conclude 
that is not just wrong, but absolutely wrong.  Compile time increases 
linearly with the number of jobs, regardless of the number of CPUs.

What would cause this?  Severe memory bottlenecks?

	-- Brian

On Tuesday 08 January 2002 10:39 pm, Mike Kravetz wrote:
> --------------------------------------------------------------------
> mkbench - Time how long it takes to compile the kernel.
>         We use 'make -j 8' and increase the number of makes run
>         in parallel.  Result is average build time in seconds.
>         Lower is better.
> --------------------------------------------------------------------
> # CPUs      # Makes         Vanilla         O(1)	haMQ
> --------------------------------------------------------------------
> 2           1                188             192        184
> 2           2                366             372        362
> 2           4                730             742        600
> 2           6               1096            1112        853
> 4           1                102             101         95
> 4           2                196             198        186
> 4           4                384             386        374
> 4           6                576             579        487
> 8           1                 58              57         58
> 8           2                109             108        105
> 8           4                209             213        186
> 8           6                309             312        280

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09  6:29   ` Brian
@ 2002-01-09  6:40     ` Jeffrey W. Baker
  2002-01-09  6:45     ` Ryan Cumming
  2002-01-09  6:48     ` Ryan Cumming
  2 siblings, 0 replies; 41+ messages in thread
From: Jeffrey W. Baker @ 2002-01-09  6:40 UTC (permalink / raw)
  To: Brian; +Cc: linux-kernel

On Wed, 9 Jan 2002, Brian wrote:

> Can this be correct?
>
> Intuitively, I would expect several CPUs hammering away at the compile to
> finish faster than one.  Given these numbers, I would have to conclude
> that is not just wrong, but absolutely wrong.  Compile time increases
> linearly with the number of jobs, regardless of the number of CPUs.
>
> What would cause this?  Severe memory bottlenecks?

Mike ran make -j 8 which means 8 compiler processes for each "# Makes" in
the table.  Thus, the first row has 8 parallel processes on a 2-way and
the last row has 48 processes on an 8-way.  The best ratio is 8 processes
on an 8-way which not incidentally also has the lowest time: 57 seconds.

-jwb

>
> 	-- Brian
>
> On Tuesday 08 January 2002 10:39 pm, Mike Kravetz wrote:
> > --------------------------------------------------------------------
> > mkbench - Time how long it takes to compile the kernel.
> >         We use 'make -j 8' and increase the number of makes run
> >         in parallel.  Result is average build time in seconds.
> >         Lower is better.
> > --------------------------------------------------------------------
> > # CPUs      # Makes         Vanilla         O(1)	haMQ
> > --------------------------------------------------------------------
> > 2           1                188             192        184
> > 2           2                366             372        362
> > 2           4                730             742        600
> > 2           6               1096            1112        853
> > 4           1                102             101         95
> > 4           2                196             198        186
> > 4           4                384             386        374
> > 4           6                576             579        487
> > 8           1                 58              57         58
> > 8           2                109             108        105
> > 8           4                209             213        186
> > 8           6                309             312        280
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09  6:29   ` Brian
  2002-01-09  6:40     ` Jeffrey W. Baker
@ 2002-01-09  6:45     ` Ryan Cumming
  2002-01-09  6:48     ` Ryan Cumming
  2 siblings, 0 replies; 41+ messages in thread
From: Ryan Cumming @ 2002-01-09  6:45 UTC (permalink / raw)
  To: Brian; +Cc: linux-kernel

On January 8, 2002 22:29, Brian wrote:
> Can this be correct?
>
> Intuitively, I would expect several CPUs hammering away at the compile to
> finish faster than one.  Given these numbers, I would have to conclude
> that is not just wrong, but absolutely wrong.  Compile time increases
> linearly with the number of jobs, regardless of the number of CPUs.

In the charts in the original message, he's not increasing the number of 
jobs, but the number of concurrent 'make -j8's. Two makes should really 
finish in half the time one make does. I don't see any problem with the 
results.

-Ryan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09  6:29   ` Brian
  2002-01-09  6:40     ` Jeffrey W. Baker
  2002-01-09  6:45     ` Ryan Cumming
@ 2002-01-09  6:48     ` Ryan Cumming
  2 siblings, 0 replies; 41+ messages in thread
From: Ryan Cumming @ 2002-01-09  6:48 UTC (permalink / raw)
  To: Brian; +Cc: linux-kernel

On January 8, 2002 22:45, Ryan Cumming wrote:
> In the charts in the original message, he's not increasing the number of
> jobs, but the number of concurrent 'make -j8's. Two makes should really
> finish in half the time one make does. I don't see any problem with the
> results.
Er, I meant finish in twice the time one make does... really... ;)

-Ryan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09  3:39 ` Mike Kravetz
  2002-01-09  5:05   ` Davide Libenzi
  2002-01-09  6:29   ` Brian
@ 2002-01-09 10:25   ` Ingo Molnar
  2002-01-09 17:40     ` Mike Kravetz
  2 siblings, 1 reply; 41+ messages in thread
From: Ingo Molnar @ 2002-01-09 10:25 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linus Torvalds, linux-kernel, george anzinger, Davide Libenzi

On Tue, 8 Jan 2002, Mike Kravetz wrote:

> --------------------------------------------------------------------
> Chat - VolanoMark simulator.  Result is a measure of throughput.
>        Higher is better.

very interesting numbers, nice work Mike! I'd suggest the following
additional test: please also run tests like VolanoMark with 'nice -n 19'.
The O(1) scheduler's task-penalty method works in our favor in this case,
since we know the test is CPU-bound we can move all processes to nice
level 19.

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
  2002-01-09 10:25   ` Ingo Molnar
@ 2002-01-09 17:40     ` Mike Kravetz
  0 siblings, 0 replies; 41+ messages in thread
From: Mike Kravetz @ 2002-01-09 17:40 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel, george anzinger, Davide Libenzi

On Wed, Jan 09, 2002 at 11:25:43AM +0100, Ingo Molnar wrote:
> 
> On Tue, 8 Jan 2002, Mike Kravetz wrote:
> 
> > --------------------------------------------------------------------
> > Chat - VolanoMark simulator.  Result is a measure of throughput.
> >        Higher is better.
> 
> very interesting numbers, nice work Mike! I'd suggest the following
> additional test: please also run tests like VolanoMark with 'nice -n 19'.
> The O(1) scheduler's task-penalty method works in our favor in this case,
> since we know the test is CPU-bound we can move all processes to nice
> level 19.
> 
> 	Ingo

I'll do that in the next go around.  Right now, I'm trying to get some
TPC-H results.

-- 
Mike

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2002-01-11 12:04 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <200201071922.g07JMN106760@penguin.transmeta.com>
2002-01-07 21:36 ` [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 Ingo Molnar
2002-01-08  8:49   ` FD Cami
2002-01-08 18:44     ` J Sloan
2002-01-08 11:32   ` Anton Blanchard
2002-01-08 11:43     ` Anton Blanchard
2002-01-08 14:34       ` Ingo Molnar
2002-01-09 23:15         ` Anton Blanchard
2002-01-10  1:09           ` Richard Henderson
2002-01-10 17:04             ` Ivan Kokshaysky
2002-01-10 20:42               ` george anzinger
2002-01-10 23:56               ` Ingo Molnar
2002-01-08 14:32     ` [patch] O(1) scheduler, -E1, 2.5.2-pre10, 2.4.17 Ingo Molnar
2002-01-07 20:24 [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 Ingo Molnar
2002-01-07 19:03 ` Brian Gerst
2002-01-07 21:19   ` Ingo Molnar
2002-01-09  3:39 ` Mike Kravetz
2002-01-09  5:05   ` Davide Libenzi
2002-01-09  3:32     ` Rusty Russell
2002-01-09 18:02       ` Davide Libenzi
2002-01-09 11:37     ` Ingo Molnar
2002-01-09 11:19       ` Rene Rebe
2002-01-09 15:34         ` Ryan Cumming
2002-01-09 18:24       ` Davide Libenzi
2002-01-09 21:24         ` Ingo Molnar
2002-01-09 19:38           ` Mike Kravetz
2002-01-10 18:21             ` Mike Kravetz
2002-01-10 19:08               ` Davide Libenzi
2002-01-10 19:09                 ` Linus Torvalds
2002-01-10 21:08                   ` Davide Libenzi
2002-01-10 19:15                 ` Mike Kravetz
2002-01-10 20:05                   ` Davide Libenzi
2002-01-09 22:34           ` Mark Hahn
2002-01-10 14:04             ` Ingo Molnar
2002-01-09 20:15       ` Linus Torvalds
2002-01-09 23:02         ` Ingo Molnar
2002-01-09  6:29   ` Brian
2002-01-09  6:40     ` Jeffrey W. Baker
2002-01-09  6:45     ` Ryan Cumming
2002-01-09  6:48     ` Ryan Cumming
2002-01-09 10:25   ` Ingo Molnar
2002-01-09 17:40     ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox