* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 [not found] <200201071922.g07JMN106760@penguin.transmeta.com> @ 2002-01-07 21:36 ` Ingo Molnar 2002-01-08 8:49 ` FD Cami 2002-01-08 11:32 ` Anton Blanchard 0 siblings, 2 replies; 41+ messages in thread From: Ingo Molnar @ 2002-01-07 21:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel, Brian Gerst On Mon, 7 Jan 2002, Linus Torvalds wrote: > Ingo, looks true. A quick -D2? yep, Brian is right. I've uploaded -D2: http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-D2.patch other changes: - make rt_priority 99 map to p->prio 0, rt_priority 0 map to p->prio 99. - display 'top' priorities correctly, 0-39 for normal processes, negative values for RT tasks. (it works just fine it appears.) We did not use to display the real priority of RT tasks, but now it's natural. > Oh, and please move console_init() back, other consoles (sparc?) may > depend on having PCI layers initialized. (doh, done too, fix is in -D2.) > Oh, and _I_ don't like "cpu()". What's wrong with the already > existing "smp_processor_id()"? nothing serious, my main problem with it is that it's often too long for my 80 chars wide consoles, and it's also too long to type and i use it quite often in SMP code. IIRC we had a 'hard_smp_processor_id()' initially, partly to make it harder to use it. (it was very slow because it did an APIC read). But these days smp_processor_id() is just as fast (or even faster) as 'current'. So i wanted to use cpu() in new code to make it easier to read and to make it more compact. But if this is a problem i can remove it. I've verified that there is no obvious namespace collisions. (i've done a quick UP sanity compile + boot of 2.5.2-pre9 + D2, it all works as expected.) Ingo ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-07 21:36 ` [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 Ingo Molnar @ 2002-01-08 8:49 ` FD Cami 2002-01-08 18:44 ` J Sloan 2002-01-08 11:32 ` Anton Blanchard 1 sibling, 1 reply; 41+ messages in thread From: FD Cami @ 2002-01-08 8:49 UTC (permalink / raw) To: mingo; +Cc: linux-kernel Hi all I'm joining the host of beta testers involved in that patch... It's currently running on a production machine : dual PII350 on ASUS P2B-DS 3 SCSI hard drives 512MB of RAM 3C905C This is a network server running squid-cache www proxy with a medium load (700 clients on a T3), mysqld, apache, proftpd. kernel is stock 2.4.17 - and so far, so good. Cheers, François Cami Ingo Molnar wrote: > On Mon, 7 Jan 2002, Linus Torvalds wrote: > > >>Ingo, looks true. A quick -D2? >> > > yep, Brian is right. I've uploaded -D2: > > http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-D2.patch > > other changes: > > - make rt_priority 99 map to p->prio 0, rt_priority 0 map to p->prio 99. > > - display 'top' priorities correctly, 0-39 for normal processes, negative > values for RT tasks. (it works just fine it appears.) We did not use to > display the real priority of RT tasks, but now it's natural. > > >>Oh, and please move console_init() back, other consoles (sparc?) may >>depend on having PCI layers initialized. >> > > (doh, done too, fix is in -D2.) > > >>Oh, and _I_ don't like "cpu()". What's wrong with the already >>existing "smp_processor_id()"? >> > > nothing serious, my main problem with it is that it's often too long for > my 80 chars wide consoles, and it's also too long to type and i use it > quite often in SMP code. > > IIRC we had a 'hard_smp_processor_id()' initially, partly to make it > harder to use it. (it was very slow because it did an APIC read). But > these days smp_processor_id() is just as fast (or even faster) as > 'current'. So i wanted to use cpu() in new code to make it easier to read > and to make it more compact. But if this is a problem i can remove it. > I've verified that there is no obvious namespace collisions. > > (i've done a quick UP sanity compile + boot of 2.5.2-pre9 + D2, it all > works as expected.) > > Ingo > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-08 8:49 ` FD Cami @ 2002-01-08 18:44 ` J Sloan 0 siblings, 0 replies; 41+ messages in thread From: J Sloan @ 2002-01-08 18:44 UTC (permalink / raw) To: FD Cami; +Cc: mingo, linux-kernel Excellent - I'm going to try this one on whatever machines I have available for testing, and if I am emboldened by success, I'll try it on some light duty production servers as well - - keep us in the loop, please! Regards, jjs FD Cami wrote: > > Hi all > > I'm joining the host of beta testers involved in that patch... > > It's currently running on a production machine : > dual PII350 on ASUS P2B-DS > 3 SCSI hard drives > 512MB of RAM > 3C905C > This is a network server running squid-cache www proxy with > a medium load (700 clients on a T3), mysqld, apache, proftpd. > kernel is stock 2.4.17 - and so far, so good. > > Cheers, > > François Cami > > > Ingo Molnar wrote: > >> On Mon, 7 Jan 2002, Linus Torvalds wrote: >> >> >>> Ingo, looks true. A quick -D2? >>> >> >> yep, Brian is right. I've uploaded -D2: >> >> http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-D2.patch >> >> other changes: >> >> - make rt_priority 99 map to p->prio 0, rt_priority 0 map to p->prio >> 99. >> >> - display 'top' priorities correctly, 0-39 for normal processes, >> negative >> values for RT tasks. (it works just fine it appears.) We did not >> use to >> display the real priority of RT tasks, but now it's natural. >> >> >>> Oh, and please move console_init() back, other consoles (sparc?) may >>> depend on having PCI layers initialized. >>> >> >> (doh, done too, fix is in -D2.) >> >> >>> Oh, and _I_ don't like "cpu()". What's wrong with the already >>> existing "smp_processor_id()"? >>> >> >> nothing serious, my main problem with it is that it's often too long for >> my 80 chars wide consoles, and it's also too long to type and i use it >> quite often in SMP code. >> >> IIRC we had a 'hard_smp_processor_id()' initially, partly to make it >> harder to use it. (it was very slow because it did an APIC read). But >> these days smp_processor_id() is just as fast (or even faster) as >> 'current'. So i wanted to use cpu() in new code to make it easier to >> read >> and to make it more compact. But if this is a problem i can remove it. >> I've verified that there is no obvious namespace collisions. >> >> (i've done a quick UP sanity compile + boot of 2.5.2-pre9 + D2, it all >> works as expected.) >> >> Ingo >> >> - >> To unsubscribe from this list: send the line "unsubscribe >> linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ >> >> > > > > - > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-07 21:36 ` [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 Ingo Molnar 2002-01-08 8:49 ` FD Cami @ 2002-01-08 11:32 ` Anton Blanchard 2002-01-08 11:43 ` Anton Blanchard 2002-01-08 14:32 ` [patch] O(1) scheduler, -E1, 2.5.2-pre10, 2.4.17 Ingo Molnar 1 sibling, 2 replies; 41+ messages in thread From: Anton Blanchard @ 2002-01-08 11:32 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel Hi Ingo, I tested 2.5.2-pre10 today. There is some bitop abuse that needs fixing for big endian machines to work :) At the moment we have: #define BITMAP_SIZE ((MAX_PRIO+7)/8) char bitmap[BITMAP_SIZE]; Which is initialised using: memset(array->bitmap, 0xff, BITMAP_SIZE); clear_bit(MAX_PRIO, array->bitmap); This results in the following in memory (in ascending memory order): ffffffffffffffff ffffffffffffffff fffffeffff000000 The problem here is that when we search the high word, we do so from the right, therefore we get 128 all the time :) The following patch fixes this. We need to define the bitmap to be in terms of unsigned long, in this case its only lucky we have the correct alignment. We also replace the memset of the bitmap with set_bit. With the patch things look much better (and the kernel boots on my ppc64 machine :) ffffffffffffffff ffffffffffffffff 000000ffffffffff Anton diff -urN linuxppc_2_5/include/asm-i386/mmu_context.h linuxppc_2_5_work/include/asm-i386/mmu_context.h --- linuxppc_2_5/include/asm-i386/mmu_context.h Tue Jan 8 17:09:47 2002 +++ linuxppc_2_5_work/include/asm-i386/mmu_context.h Tue Jan 8 22:06:35 2002 @@ -16,7 +16,7 @@ # error update this function. #endif -static inline int sched_find_first_zero_bit(char *bitmap) +static inline int sched_find_first_zero_bit(unsigned long *bitmap) { unsigned int *b = (unsigned int *)bitmap; unsigned int rt; diff -urN linuxppc_2_5/kernel/sched.c linuxppc_2_5_work/kernel/sched.c --- linuxppc_2_5/kernel/sched.c Tue Jan 8 17:09:47 2002 +++ linuxppc_2_5_work/kernel/sched.c Tue Jan 8 22:13:45 2002 @@ -20,15 +20,13 @@ #include <linux/interrupt.h> #include <asm/mmu_context.h> -#define BITMAP_SIZE ((MAX_PRIO+7)/8) - typedef struct runqueue runqueue_t; struct prio_array { int nr_active; spinlock_t *lock; runqueue_t *rq; - char bitmap[BITMAP_SIZE]; + unsigned long bitmap[3]; list_t queue[MAX_PRIO]; }; @@ -1306,11 +1304,12 @@ array = rq->arrays + j; array->rq = rq; array->lock = &rq->lock; - for (k = 0; k < MAX_PRIO; k++) + for (k = 0; k < MAX_PRIO; k++) { INIT_LIST_HEAD(array->queue + k); - memset(array->bitmap, 0xff, BITMAP_SIZE); + __set_bit(k, array->bitmap); + } // zero delimiter for bitsearch - clear_bit(MAX_PRIO, array->bitmap); + __clear_bit(MAX_PRIO, array->bitmap); } } /* ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-08 11:32 ` Anton Blanchard @ 2002-01-08 11:43 ` Anton Blanchard 2002-01-08 14:34 ` Ingo Molnar 2002-01-08 14:32 ` [patch] O(1) scheduler, -E1, 2.5.2-pre10, 2.4.17 Ingo Molnar 1 sibling, 1 reply; 41+ messages in thread From: Anton Blanchard @ 2002-01-08 11:43 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel > struct prio_array { > int nr_active; > spinlock_t *lock; > runqueue_t *rq; > - char bitmap[BITMAP_SIZE]; > + unsigned long bitmap[3]; > list_t queue[MAX_PRIO]; > }; Sorry, of course this is wrong if sizeof(unsigned long) < 64. But you get the idea :) Anton ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-08 11:43 ` Anton Blanchard @ 2002-01-08 14:34 ` Ingo Molnar 2002-01-09 23:15 ` Anton Blanchard 0 siblings, 1 reply; 41+ messages in thread From: Ingo Molnar @ 2002-01-08 14:34 UTC (permalink / raw) To: Anton Blanchard; +Cc: Linus Torvalds, linux-kernel On Tue, 8 Jan 2002, Anton Blanchard wrote: > > - char bitmap[BITMAP_SIZE]; > > + unsigned long bitmap[3]; > > list_t queue[MAX_PRIO]; > > Sorry, of course this is wrong if sizeof(unsigned long) < 64. But you > get the idea :) thanks, i've put the generic fix into the -E1 patch. > With the patch things look much better (and the kernel boots on my > ppc64 machine :) hey it should not even compile, you forgot to send us the PPC definition of sched_find_first_zero_bit() ;-) Ingo ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-08 14:34 ` Ingo Molnar @ 2002-01-09 23:15 ` Anton Blanchard 2002-01-10 1:09 ` Richard Henderson 0 siblings, 1 reply; 41+ messages in thread From: Anton Blanchard @ 2002-01-09 23:15 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel > > With the patch things look much better (and the kernel boots on my > > ppc64 machine :) > > hey it should not even compile, you forgot to send us the PPC definition > of sched_find_first_zero_bit() ;-) Good point, but its ppc64 so the patch would include all of include/asm-ppc64 and arch/ppc64 :) I expect most architectures have a reasonably fast find_first_zero_bit so they can simply do: static inline int sched_find_first_zero_bit(unsigned long *bitmap) { return find_first_zero_bit(bitmap, MAX_PRIO); } ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 23:15 ` Anton Blanchard @ 2002-01-10 1:09 ` Richard Henderson 2002-01-10 17:04 ` Ivan Kokshaysky 0 siblings, 1 reply; 41+ messages in thread From: Richard Henderson @ 2002-01-10 1:09 UTC (permalink / raw) To: Anton Blanchard; +Cc: Ingo Molnar, Linus Torvalds, linux-kernel On Thu, Jan 10, 2002 at 10:15:14AM +1100, Anton Blanchard wrote: > I expect most architectures have a reasonably fast find_first_zero_bit > so they can simply do: > > static inline int sched_find_first_zero_bit(unsigned long *bitmap) > { > return find_first_zero_bit(bitmap, MAX_PRIO); > } Careful. The following is really quite a bit better on Alpha: static inline int sched_find_first_zero_bit(unsigned long *bitmap) { unsigned long b0 = bitmap[0]; unsigned long b1 = bitmap[1]; unsigned long b2 = bitmap[2]; unsigned long ofs = MAX_RT_PRIO; if (unlikely(~(b0 & b1) != 0)) { b2 = (~b0 == 0 ? b0 : b1); ofs = (~b0 == 0 ? 0 : 64); } return ffz(b2) + ofs; } It compiles down to ldq $2,0($16) ldq $3,8($16) lda $5,128($31) ldq $0,16($16) and $2,$3,$1 ornot $31,$2,$4 ornot $31,$1,$1 bne $1,$L8 $L2: ornot $31,$0,$0 cttz $0,$0 addl $0,$5,$0 ret $31,($26),1 $L8: mov $2,$0 cmpult $31,$4,$5 cmovne $4,$3,$0 sll $5,6,$5 br $31,$L2 which is a fair bit better than find_first_zero_bit if for no other reason than we collect all the memory accesses right up at the beginning. While we're on the subject of sched_find_first_zero_bit, I'd like to complain about Ingo's choice of header file. Why in the world did you choose mmu_context.h? Invent a new asm/sched.h if you must, but please don't choose headers at random. r~ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-10 1:09 ` Richard Henderson @ 2002-01-10 17:04 ` Ivan Kokshaysky 2002-01-10 20:42 ` george anzinger 2002-01-10 23:56 ` Ingo Molnar 0 siblings, 2 replies; 41+ messages in thread From: Ivan Kokshaysky @ 2002-01-10 17:04 UTC (permalink / raw) To: Anton Blanchard, Ingo Molnar, Linus Torvalds, linux-kernel On Wed, Jan 09, 2002 at 05:09:28PM -0800, Richard Henderson wrote: > Careful. The following is really quite a bit better on Alpha: > > static inline int > sched_find_first_zero_bit(unsigned long *bitmap) > { > unsigned long b0 = bitmap[0]; > unsigned long b1 = bitmap[1]; > unsigned long b2 = bitmap[2]; > unsigned long ofs = MAX_RT_PRIO; > > if (unlikely(~(b0 & b1) != 0)) { > b2 = (~b0 == 0 ? b0 : b1); > ofs = (~b0 == 0 ? 0 : 64); > } > > return ffz(b2) + ofs; > } True. Minor correction: - b2 = (~b0 == 0 ? b0 : b1); - ofs = (~b0 == 0 ? 0 : 64); + b2 = (~b0 ? b0 : b1); + ofs = (~b0 ? 0 : 64); Note that comment for this function is a bit confusing: * ... It's the fastest * way of searching a 168-bit bitmap where the first 128 bits are * unlikely to be set. s/set/cleared/ > While we're on the subject of sched_find_first_zero_bit, I'd > like to complain about Ingo's choice of header file. Why in > the world did you choose mmu_context.h? Invent a new asm/sched.h > if you must, but please don't choose headers at random. Agreed. Apparently asm/bitops.h? Ivan. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-10 17:04 ` Ivan Kokshaysky @ 2002-01-10 20:42 ` george anzinger 2002-01-10 23:56 ` Ingo Molnar 1 sibling, 0 replies; 41+ messages in thread From: george anzinger @ 2002-01-10 20:42 UTC (permalink / raw) To: Ivan Kokshaysky Cc: Anton Blanchard, Ingo Molnar, Linus Torvalds, linux-kernel Ivan Kokshaysky wrote: > > On Wed, Jan 09, 2002 at 05:09:28PM -0800, Richard Henderson wrote: > > Careful. The following is really quite a bit better on Alpha: > > > > static inline int > > sched_find_first_zero_bit(unsigned long *bitmap) > > { > > unsigned long b0 = bitmap[0]; > > unsigned long b1 = bitmap[1]; > > unsigned long b2 = bitmap[2]; > > unsigned long ofs = MAX_RT_PRIO; > > > > if (unlikely(~(b0 & b1) != 0)) { > > b2 = (~b0 == 0 ? b0 : b1); > > ofs = (~b0 == 0 ? 0 : 64); > > } > > > > return ffz(b2) + ofs; > > } > > True. Minor correction: > - b2 = (~b0 == 0 ? b0 : b1); > - ofs = (~b0 == 0 ? 0 : 64); > + b2 = (~b0 ? b0 : b1); > + ofs = (~b0 ? 0 : 64); > > Note that comment for this function is a bit confusing: > * ... It's the fastest > * way of searching a 168-bit bitmap where the first 128 bits are > * unlikely to be set. What if we want a 2048-bit bitmap??? > > s/set/cleared/ > > > While we're on the subject of sched_find_first_zero_bit, I'd > > like to complain about Ingo's choice of header file. Why in > > the world did you choose mmu_context.h? Invent a new asm/sched.h > > if you must, but please don't choose headers at random. > > Agreed. Apparently asm/bitops.h? > > Ivan. > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- George george@mvista.com High-res-timers: http://sourceforge.net/projects/high-res-timers/ Real time sched: http://sourceforge.net/projects/rtsched/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-10 17:04 ` Ivan Kokshaysky 2002-01-10 20:42 ` george anzinger @ 2002-01-10 23:56 ` Ingo Molnar 1 sibling, 0 replies; 41+ messages in thread From: Ingo Molnar @ 2002-01-10 23:56 UTC (permalink / raw) To: Ivan Kokshaysky; +Cc: Anton Blanchard, Linus Torvalds, linux-kernel On Thu, 10 Jan 2002, Ivan Kokshaysky wrote: > Note that comment for this function is a bit confusing: > * ... It's the fastest > * way of searching a 168-bit bitmap where the first 128 bits are > * unlikely to be set. > > s/set/cleared/ no, it's really 'cleared'. The bits are inverted right now. Ingo ^ permalink raw reply [flat|nested] 41+ messages in thread
* [patch] O(1) scheduler, -E1, 2.5.2-pre10, 2.4.17 2002-01-08 11:32 ` Anton Blanchard 2002-01-08 11:43 ` Anton Blanchard @ 2002-01-08 14:32 ` Ingo Molnar 1 sibling, 0 replies; 41+ messages in thread From: Ingo Molnar @ 2002-01-08 14:32 UTC (permalink / raw) To: linux-kernel; +Cc: Linus Torvalds, Anton Blanchard, Davide Libenzi this is the latest update of the O(1) scheduler: http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-pre10-E1.patch http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.17-E1.patch now that Linus has put the -D2 patch into the 2.5.2-pre10 kernel, the 2.5.2-pre10-E1 patch has become quite small :-) The patch compiles, boots & works just fine on my UP/SMP boxes. Changes since -D2: - make rq->bitmap big-endian safe. (Anton Blanchard) - documented and cleaned up the load estimator bits, no functional changes apart from small speedups. - do init_idle() before starting up the init thread, this removes a race where we'd run the init thread on CPU#0 before init_idle() has been called. Ingo ^ permalink raw reply [flat|nested] 41+ messages in thread
* [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 @ 2002-01-07 20:24 Ingo Molnar 2002-01-07 19:03 ` Brian Gerst 2002-01-09 3:39 ` Mike Kravetz 0 siblings, 2 replies; 41+ messages in thread From: Ingo Molnar @ 2002-01-07 20:24 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel, george anzinger, Davide Libenzi -D1 is a quick update over -D0: http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-D1.patch http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.17-D1.patch this should fix the child-inherits-parent-priority-boost issue that causes interactivity problems during compilation. Ingo ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-07 20:24 [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 Ingo Molnar @ 2002-01-07 19:03 ` Brian Gerst 2002-01-07 21:19 ` Ingo Molnar 2002-01-09 3:39 ` Mike Kravetz 1 sibling, 1 reply; 41+ messages in thread From: Brian Gerst @ 2002-01-07 19:03 UTC (permalink / raw) To: mingo; +Cc: linux-kernel Ingo Molnar wrote: > > -D1 is a quick update over -D0: > > http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-D1.patch > http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.17-D1.patch > > this should fix the child-inherits-parent-priority-boost issue that causes > interactivity problems during compilation. > > Ingo I noticed in this patch that you removed the rest_init() function. The reason it was split from start_kernel() is that there was a race where init memory could be freed before the call to cpu_idle(). Note that start_kernel() is marked __init and rest_init() is not. -- Brian Gerst ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-07 19:03 ` Brian Gerst @ 2002-01-07 21:19 ` Ingo Molnar 0 siblings, 0 replies; 41+ messages in thread From: Ingo Molnar @ 2002-01-07 21:19 UTC (permalink / raw) To: Brian Gerst; +Cc: linux-kernel, Linus Torvalds On Mon, 7 Jan 2002, Brian Gerst wrote: > I noticed in this patch that you removed the rest_init() function. > The reason it was split from start_kernel() is that there was a race > where init memory could be freed before the call to cpu_idle(). Note > that start_kernel() is marked __init and rest_init() is not. you are right, i've missed that detail. I've fixed this in my tree (reverted that part to the previous behavior), the fix will show up in the next patch. Thanks, Ingo ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-07 20:24 [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 Ingo Molnar 2002-01-07 19:03 ` Brian Gerst @ 2002-01-09 3:39 ` Mike Kravetz 2002-01-09 5:05 ` Davide Libenzi ` (2 more replies) 1 sibling, 3 replies; 41+ messages in thread From: Mike Kravetz @ 2002-01-09 3:39 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel, george anzinger, Davide Libenzi Below are some benchmark results when running the D1 version of the O(1) scheduler on 2.5.2-pre9. To add another data point, I hacked together a half-a** multi-queue scheduler based on the 2.5.2-pre9 scheduler. haMQ doesn't do load balancing or anything fancy. However, it aggressively tries to not let CPUs go idle (not always a good thing as has been previously discussed). For reference, patch is at: lse.sourceforge.net/scheduling/2.5.2-pre9-hamq I can't recommend this code for anything useful. All benchmarks were run on an 8-way Pentium III 700 MHz 1MB caches. Number of CPUs was altered via the maxcpus boot flag. -------------------------------------------------------------------- mkbench - Time how long it takes to compile the kernel. We use 'make -j 8' and increase the number of makes run in parallel. Result is average build time in seconds. Lower is better. -------------------------------------------------------------------- # CPUs # Makes Vanilla O(1) haMQ -------------------------------------------------------------------- 2 1 188 192 184 2 2 366 372 362 2 4 730 742 600 2 6 1096 1112 853 4 1 102 101 95 4 2 196 198 186 4 4 384 386 374 4 6 576 579 487 8 1 58 57 58 8 2 109 108 105 8 4 209 213 186 8 6 309 312 280 Surprisingly, O(1) seems to do worse than the vanilla scheduler in almost all cases. -------------------------------------------------------------------- Chat - VolanoMark simulator. Result is a measure of throughput. Higher is better. -------------------------------------------------------------------- Configuration Parms # CPUs Vanilla O(1) haMQ -------------------------------------------------------------------- 10 rooms, 200 messages 2 162644 145915 137097 20 rooms, 200 messages 2 145872 136134 138646 30 rooms, 200 messages 2 124314 183366 144403 10 rooms, 200 messages 4 201745 258444 255415 20 rooms, 200 messages 4 177854 246032 263723 30 rooms, 200 messages 4 153506 302615 257170 10 rooms, 200 messages 8 121792 262804 310603 20 rooms, 200 messages 8 68697 248406 420157 30 rooms, 200 messages 8 42133 302513 283817 O(1) scheduler does better than Vanilla as load and number of CPUs increase. Still need to look into why it does worse on the less loaded 2 CPU runs. -------------------------------------------------------------------- Reflex - lat_ctx(of LMbench) on steroids. Does token passing to over emphasize scheduler paths. Allows loading of the runqueue unlike lat_ctx. Result is microseconds per round. Lower is better. All runs with 0 delay. lse.sourceforge.net/scheduling/other/reflex/ Lower is better. -------------------------------------------------------------------- #tasks # CPUs Vanilla O(1) haMQ -------------------------------------------------------------------- 2 2 6.594 14.388 15.996 4 2 6.988 3.787 4.686 8 2 7.322 3.757 5.148 16 2 7.234 3.737 7.244 32 2 7.651 5.135 7.182 64 2 9.462 3.948 7.553 128 2 13.889 4.584 7.918 2 4 6.019 14.646 15.403 4 4 10.997 6.213 6.755 8 4 9.838 2.160 2.838 16 4 10.595 2.154 3.080 32 4 11.870 2.917 3.400 64 4 15.280 2.890 3.131 128 4 19.832 2.685 3.307 2 8 6.338 9.064 15.474 4 8 11.454 7.020 8.281 8 8 13.354 4.390 5.816 16 8 14.976 1.502 2.018 32 8 16.757 1.920 2.240 64 8 19.961 2.264 2.358 128 8 25.010 2.280 2.260 I believe the poor showings for O(1) at the low end are the result of having the 2 tasks run on 2 different CPUs. This is the right thing to do in spite of the numbers. You can see lock contention become a factor in the Vanilla scheduler as load and number of CPUs increase. Having multiple runqueues eliminates this problem. -- Mike ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 3:39 ` Mike Kravetz @ 2002-01-09 5:05 ` Davide Libenzi 2002-01-09 3:32 ` Rusty Russell 2002-01-09 11:37 ` Ingo Molnar 2002-01-09 6:29 ` Brian 2002-01-09 10:25 ` Ingo Molnar 2 siblings, 2 replies; 41+ messages in thread From: Davide Libenzi @ 2002-01-09 5:05 UTC (permalink / raw) To: Mike Kravetz; +Cc: Ingo Molnar, Linus Torvalds, lkml, george anzinger On Tue, 8 Jan 2002, Mike Kravetz wrote: > Below are some benchmark results when running the D1 version > of the O(1) scheduler on 2.5.2-pre9. To add another data point, > I hacked together a half-a** multi-queue scheduler based on > the 2.5.2-pre9 scheduler. haMQ doesn't do load balancing or > anything fancy. However, it aggressively tries to not let CPUs > go idle (not always a good thing as has been previously discussed). > For reference, patch is at: lse.sourceforge.net/scheduling/2.5.2-pre9-hamq > I can't recommend this code for anything useful. > > All benchmarks were run on an 8-way Pentium III 700 MHz 1MB caches. > Number of CPUs was altered via the maxcpus boot flag. > > -------------------------------------------------------------------- > mkbench - Time how long it takes to compile the kernel. > We use 'make -j 8' and increase the number of makes run > in parallel. Result is average build time in seconds. > Lower is better. > -------------------------------------------------------------------- > # CPUs # Makes Vanilla O(1) haMQ > -------------------------------------------------------------------- > 2 1 188 192 184 > 2 2 366 372 362 > 2 4 730 742 600 > 2 6 1096 1112 853 > 4 1 102 101 95 > 4 2 196 198 186 > 4 4 384 386 374 > 4 6 576 579 487 > 8 1 58 57 58 > 8 2 109 108 105 > 8 4 209 213 186 > 8 6 309 312 280 > > Surprisingly, O(1) seems to do worse than the vanilla scheduler > in almost all cases. > > -------------------------------------------------------------------- > Chat - VolanoMark simulator. Result is a measure of throughput. > Higher is better. > -------------------------------------------------------------------- > Configuration Parms # CPUs Vanilla O(1) haMQ > -------------------------------------------------------------------- > 10 rooms, 200 messages 2 162644 145915 137097 > 20 rooms, 200 messages 2 145872 136134 138646 > 30 rooms, 200 messages 2 124314 183366 144403 > 10 rooms, 200 messages 4 201745 258444 255415 > 20 rooms, 200 messages 4 177854 246032 263723 > 30 rooms, 200 messages 4 153506 302615 257170 > 10 rooms, 200 messages 8 121792 262804 310603 > 20 rooms, 200 messages 8 68697 248406 420157 > 30 rooms, 200 messages 8 42133 302513 283817 > > O(1) scheduler does better than Vanilla as load and number of > CPUs increase. Still need to look into why it does worse on > the less loaded 2 CPU runs. > > -------------------------------------------------------------------- > Reflex - lat_ctx(of LMbench) on steroids. Does token passing > to over emphasize scheduler paths. Allows loading of > the runqueue unlike lat_ctx. Result is microseconds > per round. Lower is better. All runs with 0 delay. > lse.sourceforge.net/scheduling/other/reflex/ > Lower is better. > -------------------------------------------------------------------- > #tasks # CPUs Vanilla O(1) haMQ > -------------------------------------------------------------------- > 2 2 6.594 14.388 15.996 > 4 2 6.988 3.787 4.686 > 8 2 7.322 3.757 5.148 > 16 2 7.234 3.737 7.244 > 32 2 7.651 5.135 7.182 > 64 2 9.462 3.948 7.553 > 128 2 13.889 4.584 7.918 > 2 4 6.019 14.646 15.403 > 4 4 10.997 6.213 6.755 > 8 4 9.838 2.160 2.838 > 16 4 10.595 2.154 3.080 > 32 4 11.870 2.917 3.400 > 64 4 15.280 2.890 3.131 > 128 4 19.832 2.685 3.307 > 2 8 6.338 9.064 15.474 > 4 8 11.454 7.020 8.281 > 8 8 13.354 4.390 5.816 > 16 8 14.976 1.502 2.018 > 32 8 16.757 1.920 2.240 > 64 8 19.961 2.264 2.358 > 128 8 25.010 2.280 2.260 > > I believe the poor showings for O(1) at the low end are the > result of having the 2 tasks run on 2 different CPUs. This > is the right thing to do in spite of the numbers. You > can see lock contention become a factor in the Vanilla scheduler > as load and number of CPUs increase. Having multiple runqueues > eliminates this problem. Awesome job Mike. Ingo's O(1) scheduler is still 'young' and can be improved expecially from a balancing point of view. I think that it'll here that the real challenge will take place ( even if Linus keeps saying that it's easy :-) ). Mike can you try the patch listed below on custom pre-10 ? I've got 30-70% better performances with the chat_s/c test. PS: next time we'll have lunch i'll talk you about a wanderful tool called gnuplot :) - Davide diff -Nru linux-2.5.2-pre10.vanilla/include/linux/sched.h linux-2.5.2-pre10.mqo1/include/linux/sched.h --- linux-2.5.2-pre10.vanilla/include/linux/sched.h Mon Jan 7 17:12:45 2002 +++ linux-2.5.2-pre10.mqo1/include/linux/sched.h Mon Jan 7 21:45:19 2002 @@ -305,11 +305,7 @@ prio_array_t *array; unsigned int time_slice; - unsigned long sleep_timestamp, run_timestamp; - - #define SLEEP_HIST_SIZE 4 - int sleep_hist[SLEEP_HIST_SIZE]; - int sleep_idx; + unsigned long swap_cnt_last; unsigned long policy; unsigned long cpus_allowed; diff -Nru linux-2.5.2-pre10.vanilla/kernel/fork.c linux-2.5.2-pre10.mqo1/kernel/fork.c --- linux-2.5.2-pre10.vanilla/kernel/fork.c Mon Jan 7 17:12:45 2002 +++ linux-2.5.2-pre10.mqo1/kernel/fork.c Mon Jan 7 18:49:34 2002 @@ -705,9 +705,6 @@ current->time_slice = 1; expire_task(current); } - p->sleep_timestamp = p->run_timestamp = jiffies; - memset(p->sleep_hist, 0, sizeof(p->sleep_hist[0])*SLEEP_HIST_SIZE); - p->sleep_idx = 0; __restore_flags(flags); /* diff -Nru linux-2.5.2-pre10.vanilla/kernel/sched.c linux-2.5.2-pre10.mqo1/kernel/sched.c --- linux-2.5.2-pre10.vanilla/kernel/sched.c Mon Jan 7 17:12:45 2002 +++ linux-2.5.2-pre10.mqo1/kernel/sched.c Tue Jan 8 18:28:02 2002 @@ -48,6 +48,7 @@ spinlock_t lock; unsigned long nr_running, nr_switches, last_rt_event; task_t *curr, *idle; + unsigned long swap_cnt; prio_array_t *active, *expired, arrays[2]; char __pad [SMP_CACHE_BYTES]; } runqueues [NR_CPUS] __cacheline_aligned; @@ -91,115 +92,20 @@ p->array = array; } -/* - * This is the per-process load estimator. Processes that generate - * more load than the system can handle get a priority penalty. - * - * The estimator uses a 4-entry load-history ringbuffer which is - * updated whenever a task is moved to/from the runqueue. The load - * estimate is also updated from the timer tick to get an accurate - * estimation of currently executing tasks as well. - */ -#define NEXT_IDX(idx) (((idx) + 1) % SLEEP_HIST_SIZE) - -static inline void update_sleep_avg_deactivate(task_t *p) -{ - unsigned int idx; - unsigned long j = jiffies, last_sample = p->run_timestamp / HZ, - curr_sample = j / HZ, delta = curr_sample - last_sample; - - if (unlikely(delta)) { - if (delta < SLEEP_HIST_SIZE) { - for (idx = 0; idx < delta; idx++) { - p->sleep_idx++; - p->sleep_idx %= SLEEP_HIST_SIZE; - p->sleep_hist[p->sleep_idx] = 0; - } - } else { - for (idx = 0; idx < SLEEP_HIST_SIZE; idx++) - p->sleep_hist[idx] = 0; - p->sleep_idx = 0; - } - } - p->sleep_timestamp = j; -} - -#if SLEEP_HIST_SIZE != 4 -# error update this code. -#endif - -static inline unsigned int get_sleep_avg(task_t *p, unsigned long j) -{ - unsigned int sum; - - sum = p->sleep_hist[0]; - sum += p->sleep_hist[1]; - sum += p->sleep_hist[2]; - sum += p->sleep_hist[3]; - - return sum * HZ / ((SLEEP_HIST_SIZE-1)*HZ + (j % HZ)); -} - -static inline void update_sleep_avg_activate(task_t *p, unsigned long j) -{ - unsigned int idx; - unsigned long delta_ticks, last_sample = p->sleep_timestamp / HZ, - curr_sample = j / HZ, delta = curr_sample - last_sample; - - if (unlikely(delta)) { - if (delta < SLEEP_HIST_SIZE) { - p->sleep_hist[p->sleep_idx] += HZ - (p->sleep_timestamp % HZ); - p->sleep_idx++; - p->sleep_idx %= SLEEP_HIST_SIZE; - - for (idx = 1; idx < delta; idx++) { - p->sleep_idx++; - p->sleep_idx %= SLEEP_HIST_SIZE; - p->sleep_hist[p->sleep_idx] = HZ; - } - } else { - for (idx = 0; idx < SLEEP_HIST_SIZE; idx++) - p->sleep_hist[idx] = HZ; - p->sleep_idx = 0; - } - p->sleep_hist[p->sleep_idx] = 0; - delta_ticks = j % HZ; - } else - delta_ticks = j - p->sleep_timestamp; - p->sleep_hist[p->sleep_idx] += delta_ticks; - p->run_timestamp = j; -} - static inline void activate_task(task_t *p, runqueue_t *rq) { prio_array_t *array = rq->active; - unsigned long j = jiffies; - unsigned int sleep, load; - int penalty; - if (likely(p->run_timestamp == j)) - goto enqueue; - /* - * Give the process a priority penalty if it has not slept often - * enough in the past. We scale the priority penalty according - * to the current load of the runqueue, and the 'load history' - * this process has. Eg. if the CPU has 3 processes running - * right now then a process that has slept more than two-thirds - * of the time is considered to be 'interactive'. The higher - * the load of the CPUs is, the easier it is for a process to - * get an non-interactivity penalty. - */ -#define MAX_PENALTY (MAX_USER_PRIO/3) - update_sleep_avg_activate(p, j); - sleep = get_sleep_avg(p, j); - load = HZ - sleep; - penalty = (MAX_PENALTY * load)/HZ; if (!rt_task(p)) { - p->prio = NICE_TO_PRIO(p->__nice) + penalty; - if (p->prio > MAX_PRIO-1) - p->prio = MAX_PRIO-1; + unsigned long prio_bonus = rq->swap_cnt - p->swap_cnt_last; + + p->swap_cnt_last = rq->swap_cnt; + if (prio_bonus > MAX_PRIO) + prio_bonus = MAX_PRIO; + p->prio -= prio_bonus; + if (p->prio < MAX_RT_PRIO) + p->prio = MAX_RT_PRIO; } -enqueue: enqueue_task(p, array); rq->nr_running++; } @@ -209,7 +115,6 @@ rq->nr_running--; dequeue_task(p, p->array); p->array = NULL; - update_sleep_avg_deactivate(p); } static inline void resched_task(task_t *p) @@ -535,33 +440,16 @@ p->need_resched = 1; if (rt_task(p)) p->time_slice = RT_PRIO_TO_TIMESLICE(p->prio); - else + else { p->time_slice = PRIO_TO_TIMESLICE(p->prio); - - /* - * Timeslice used up - discard any possible - * priority penalty: - */ - dequeue_task(p, rq->active); - /* - * Tasks that have nice values of -20 ... -15 are put - * back into the active array. If they use up too much - * CPU time then they'll get a priority penalty anyway - * so this can not starve other processes accidentally. - * Otherwise this is pretty handy for sysadmins ... - */ - if (p->prio <= MAX_RT_PRIO + MAX_PENALTY/2) - enqueue_task(p, rq->active); - else + /* + * Timeslice used up - discard any possible + * priority penalty: + */ + dequeue_task(p, rq->active); + if (++p->prio >= MAX_PRIO) + p->prio = MAX_PRIO - 1; enqueue_task(p, rq->expired); - } else { - /* - * Deactivate + activate the task so that the - * load estimator gets updated properly: - */ - if (!rt_task(p)) { - deactivate_task(p, rq); - activate_task(p, rq); } } load_balance(rq); @@ -616,6 +504,7 @@ rq->active = rq->expired; rq->expired = array; array = rq->active; + rq->swap_cnt++; } idx = sched_find_first_zero_bit(array->bitmap); @@ -1301,6 +1190,7 @@ rq->expired = rq->arrays + 1; spin_lock_init(&rq->lock); rq->cpu = i; + rq->swap_cnt = 0; for (j = 0; j < 2; j++) { array = rq->arrays + j; ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 5:05 ` Davide Libenzi @ 2002-01-09 3:32 ` Rusty Russell 2002-01-09 18:02 ` Davide Libenzi 2002-01-09 11:37 ` Ingo Molnar 1 sibling, 1 reply; 41+ messages in thread From: Rusty Russell @ 2002-01-09 3:32 UTC (permalink / raw) To: Davide Libenzi; +Cc: kravetz, mingo, torvalds, linux-kernel, george On Tue, 8 Jan 2002 21:05:23 -0800 (PST) Davide Libenzi <davidel@xmailserver.org> wrote: > Mike can you try the patch listed below on custom pre-10 ? > I've got 30-70% better performances with the chat_s/c test. I'd encourage you to use hackbench, which is basically "the part of chat_c/s that is interesting". And I'd encourage you to come up with a better name, too 8) Cheers, Rusty. /* Simple scheduler test. */ #include <unistd.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <errno.h> #include <sys/types.h> #include <sys/socket.h> #include <sys/wait.h> #include <sys/time.h> #include <sys/poll.h> static int use_pipes = 0; static void barf(const char *msg) { fprintf(stderr, "%s (error: %s)\n", msg, strerror(errno)); exit(1); } static void fdpair(int fds[2]) { if (use_pipes) { if (pipe(fds) == 0) return; } else { if (socketpair(AF_UNIX, SOCK_STREAM, 0, fds) == 0) return; } barf("Creating fdpair"); } /* Block until we're ready to go */ static void ready(int ready_out, int wakefd) { char dummy; struct pollfd pollfd = { .fd = wakefd, .events = POLLIN }; /* Tell them we're ready. */ if (write(ready_out, &dummy, 1) != 1) barf("CLIENT: ready write"); /* Wait for "GO" signal */ if (poll(&pollfd, 1, -1) != 1) barf("poll"); } static void reader(int ready_out, int wakefd, unsigned int loops, int fd) { char dummy; unsigned int i; ready(ready_out, wakefd); for (i = 0; i < loops; i++) { if (read(fd, &dummy, 1) != 1) barf("READER: read"); } } /* Start the server */ static void server(int ready_out, int wakefd, unsigned int loops, unsigned int num_fds) { unsigned int i; int write_fds[num_fds]; unsigned int counters[num_fds]; for (i = 0; i < num_fds; i++) { int fds[2]; fdpair(fds); switch (fork()) { case -1: barf("fork()"); case 0: close(fds[1]); reader(ready_out, wakefd, loops, fds[0]); exit(0); } close(fds[0]); write_fds[i] = fds[1]; if (fcntl(write_fds[i], F_SETFL, O_NONBLOCK) != 0) barf("fcntl NONBLOCK"); counters[i] = 0; } ready(ready_out, wakefd); for (i = 0; i < loops * num_fds;) { unsigned int j; char dummy; for (j = 0; j < num_fds; j++) { if (counters[j] < loops) { if (write(write_fds[j], &dummy, 1) == 1) { counters[j]++; i++; } else if (errno != EAGAIN) barf("write"); } } } /* Reap them all */ for (i = 0; i < num_fds; i++) { int status; wait(&status); if (!WIFEXITED(status)) exit(1); } exit(0); } int main(int argc, char *argv[]) { unsigned int i; struct timeval start, stop, diff; unsigned int num_fds; int readyfds[2], wakefds[2]; char dummy; int status; if (argv[1] && strcmp(argv[1], "-pipe") == 0) { use_pipes = 1; argc--; argv++; } if (argc != 2 || (num_fds = atoi(argv[1])) == 0) barf("Usage: hackbench2 [-pipe] <num pipes>\n"); fdpair(readyfds); fdpair(wakefds); switch (fork()) { case -1: barf("fork()"); case 0: server(readyfds[1], wakefds[0], 10000, num_fds); exit(0); } /* Wait for everyone to be ready */ for (i = 0; i < num_fds+1; i++) if (read(readyfds[0], &dummy, 1) != 1) barf("Reading for readyfds"); gettimeofday(&start, NULL); /* Kick them off */ if (write(wakefds[1], &dummy, 1) != 1) barf("Writing to start them"); /* Reap server */ wait(&status); if (!WIFEXITED(status)) exit(1); gettimeofday(&stop, NULL); /* Print time... */ timersub(&stop, &start, &diff); printf("Time: %lu.%03lu\n", diff.tv_sec, diff.tv_usec/1000); exit(0); } -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 3:32 ` Rusty Russell @ 2002-01-09 18:02 ` Davide Libenzi 0 siblings, 0 replies; 41+ messages in thread From: Davide Libenzi @ 2002-01-09 18:02 UTC (permalink / raw) To: Rusty Russell Cc: Mike Kravetz, Ingo Molnar, Linus Torvalds, lkml, georgr anzinger On Wed, 9 Jan 2002, Rusty Russell wrote: > On Tue, 8 Jan 2002 21:05:23 -0800 (PST) > Davide Libenzi <davidel@xmailserver.org> wrote: > > Mike can you try the patch listed below on custom pre-10 ? > > I've got 30-70% better performances with the chat_s/c test. > > I'd encourage you to use hackbench, which is basically "the part of chat_c/s > that is interesting". > > And I'd encourage you to come up with a better name, too 8) Got it. I'll try. - Davide ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 5:05 ` Davide Libenzi 2002-01-09 3:32 ` Rusty Russell @ 2002-01-09 11:37 ` Ingo Molnar 2002-01-09 11:19 ` Rene Rebe ` (2 more replies) 1 sibling, 3 replies; 41+ messages in thread From: Ingo Molnar @ 2002-01-09 11:37 UTC (permalink / raw) To: Davide Libenzi; +Cc: Mike Kravetz, Linus Torvalds, lkml, george anzinger On Tue, 8 Jan 2002, Davide Libenzi wrote: > Mike can you try the patch listed below on custom pre-10 ? > I've got 30-70% better performances with the chat_s/c test. i've compared this patch of yours (which changes the way interactivity is detected and timeslices are distributed), to 2.5.2-pre10-vanilla on a 2-way 466 MHz Celeron box: davide-patch-2.5.2-pre10 running at default priority: # ./chat_s 127.0.0.1 # ./chat_c 127.0.0.1 10 1000 Average throughput : 123103 messages per second Average throughput : 105122 messages per second Average throughput : 112901 messages per second [ system is *unusable* interactively, during the whole test. ] davide-patch-2.5.2-pre10 running at nice level 19: # nice -n 19 ./chat_s 127.0.0.1 # nice -n 19 ./chat_c 127.0.0.1 10 1000 Average throughput : 109337 messages per second Average throughput : 122077 messages per second Average throughput : 105296 messages per second [ system is *unusable* interactively, despite renicing. ] 2.5.2-pre10-vanilla running the test at the default priority level: # ./chat_s 127.0.0.1 # ./chat_c 127.0.0.1 10 1000 Average throughput : 124676 messages per second Average throughput : 102244 messages per second Average throughput : 115841 messages per second [ system is unresponsive at the start of the test, but once the 2.5.2-pre10 load-estimator establishes which task is interactive and which one is not, the system becomes usable. Load can be felt and there are frequent delays in commands. ] 2.5.2-pre10-vanilla running at nice level 19: # nice -n 19 ./chat_s 127.0.0.1 # nice -n 19 ./chat_c 127.0.0.1 10 1000 Average throughput : 214626 messages per second Average throughput : 220876 messages per second Average throughput : 225529 messages per second [ system is usable from the beginning - nice levels are working as expected. Load can be felt while executing shell commands, but the system is usable. Load cannot be felt in truly interactive applications like editors. Summary of throughput results: 2.5.2-pre10-vanilla is equivalent throughput-wise in the test with your patched kernel, but the vanilla kernel is about 100% faster than your patched kernel when running reniced. but the interactivity observations are the real showstoppers in my opinion. With your patch applied the system became *unbearably* slow during the test. i have three observation about why your patch causes these effects (we had email discussions about this topic in private already, so you probably know my position): - your patch adds the 'recalculation based priority distribution method' that is in 2.5.2-pre9 to the O(1) scheduler. (2.4.2-pre9's priority distribution scheme is an improved but conceptually equivalent version of the priority distribution scheme of 2.4.17 - which scheme was basically unchanged since 1991. ) originally the O(1) patch was using the priority distribution scheme of 2.5.2-pre9 (it's very easy to switch between the two methods), but i have changed it because: there is a flaw in the recalculation-based (array-switch based in O(1) scheduler terms) priority distribution scheme: interactive tasks will get new timeslices depending on the frequency of recalculations. But *exactly under load*, the frequency of recalculations gets very, very low - it can be more than 10 seconds. In the above test this property causes shell interactivity to degrade so dramatically. Interactive tasks might accumulate up to 64 timeslices, but it's easy for them to use up this reserve in such high load situations and they'll never get back any new timeslices. Mike, do you agree with this analysis? [if anyone wants to look at the new estimtaor code then please apply the -E1 patch to -pre10, which cleans up the estimator code and comments it, without changing functionality.] - your patch in essence makes the scheduler ignore things like nice level +19. We *used to* ignore nice levels, but with the new load estimator this has changed, and personally i dont think i want to go back to the old behavior. - the system i tested has a more than twice as slow CPU as yours. So i'd suggest for you to repeat those exact tests but increase the number of 'rooms' to something like 40 (i know you tried 20 rooms, i dont think it's enough), and increase the number of messages sent, from 1000 to 5000 or something like that. your patch indeed decreases the load estimation and interactivity detection overhead and code complexity - but as the above tests have shown, at the price of interactivity, and in some cases even at the price of throughput. Ingo ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 11:37 ` Ingo Molnar @ 2002-01-09 11:19 ` Rene Rebe 2002-01-09 15:34 ` Ryan Cumming 2002-01-09 18:24 ` Davide Libenzi 2002-01-09 20:15 ` Linus Torvalds 2 siblings, 1 reply; 41+ messages in thread From: Rene Rebe @ 2002-01-09 11:19 UTC (permalink / raw) To: mingo; +Cc: davidel, kravetz, torvalds, linux-kernel, george Hi. From: Ingo Molnar <mingo@elte.hu> Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 Date: Wed, 9 Jan 2002 12:37:46 +0100 (CET) [...] > 2.5.2-pre10-vanilla running the test at the default priority level: > > # ./chat_s 127.0.0.1 > # ./chat_c 127.0.0.1 10 1000 > > Average throughput : 124676 messages per second > Average throughput : 102244 messages per second > Average throughput : 115841 messages per second > > [ system is unresponsive at the start of the test, but > once the 2.5.2-pre10 load-estimator establishes which task is > interactive and which one is not, the system becomes usable. > Load can be felt and there are frequent delays in commands. ] > > 2.5.2-pre10-vanilla running at nice level 19: > > # nice -n 19 ./chat_s 127.0.0.1 > # nice -n 19 ./chat_c 127.0.0.1 10 1000 > > Average throughput : 214626 messages per second > Average throughput : 220876 messages per second > Average throughput : 225529 messages per second > > [ system is usable from the beginning - nice levels are working as > expected. Load can be felt while executing shell commands, but the > system is usable. Load cannot be felt in truly interactive > applications like editors. > > Summary of throughput results: 2.5.2-pre10-vanilla is equivalent > throughput-wise in the test with your patched kernel, but the vanilla > kernel is about 100% faster than your patched kernel when running reniced. Could someone tell a non-kernel-hacker why this benchmark is nearly twice as fast when running reniced??? Shouldn't it be slower when it runs with lower priority (And you execute / type some commands during it)? [...] > Ingo k33p h4ck1n6 René -- René Rebe (Registered Linux user: #248718 <http://counter.li.org>) eMail: rene.rebe@gmx.net rene@rocklinux.org Homepage: http://www.tfh-berlin.de/~s712059/index.html Anyone sending unwanted advertising e-mail to this address will be charged $25 for network traffic and computing time. By extracting my address from this message or its header, you agree to these terms. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 11:19 ` Rene Rebe @ 2002-01-09 15:34 ` Ryan Cumming 0 siblings, 0 replies; 41+ messages in thread From: Ryan Cumming @ 2002-01-09 15:34 UTC (permalink / raw) To: Rene Rebe; +Cc: linux-kernel@vger.kernel.org On January 9, 2002 03:19, Rene Rebe wrote: > Could someone tell a non-kernel-hacker why this benchmark is nearly > twice as fast when running reniced??? Shouldn't it be slower when it > runs with lower priority (And you execute / type some commands during > it)? In addition for using the nice level as a priority hint, the new scheduler also uses it as a hint of how "CPU-bound" a process it. Negative (higher priority) nice levels give the process short, frequent timeslices. Positive priorities give the process long, infrequent time slices. On an otherwise (mostly) idle system, both processes will get the same amount of CPU time, but distributed in a different way. In applications that really don't care about interactivity, the long time slice will increase their efficency greatly. In addition to having a fewer context switches (and therefore less context switch overhead), the longer time slices give them more time to warm up the cache. This has been referred to as "batching", as the process is executing at once what would normally take many shorter timeslices to complete. So, what you're actually seeing is the reniced task not taking up more CPU time (it's probably actually using slightly less), just using the CPU time more efficently. <worships Ingo> -Ryan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 11:37 ` Ingo Molnar 2002-01-09 11:19 ` Rene Rebe @ 2002-01-09 18:24 ` Davide Libenzi 2002-01-09 21:24 ` Ingo Molnar 2002-01-09 20:15 ` Linus Torvalds 2 siblings, 1 reply; 41+ messages in thread From: Davide Libenzi @ 2002-01-09 18:24 UTC (permalink / raw) To: Ingo Molnar; +Cc: Mike Kravetz, Linus Torvalds, lkml, george anzinger On Wed, 9 Jan 2002, Ingo Molnar wrote: > > On Tue, 8 Jan 2002, Davide Libenzi wrote: > > > Mike can you try the patch listed below on custom pre-10 ? > > I've got 30-70% better performances with the chat_s/c test. > > i've compared this patch of yours (which changes the way interactivity is > detected and timeslices are distributed), to 2.5.2-pre10-vanilla on a > 2-way 466 MHz Celeron box: > > davide-patch-2.5.2-pre10 running at default priority: > > # ./chat_s 127.0.0.1 > # ./chat_c 127.0.0.1 10 1000 > > Average throughput : 123103 messages per second > Average throughput : 105122 messages per second > Average throughput : 112901 messages per second > > [ system is *unusable* interactively, during the whole test. ] > > davide-patch-2.5.2-pre10 running at nice level 19: > > # nice -n 19 ./chat_s 127.0.0.1 > # nice -n 19 ./chat_c 127.0.0.1 10 1000 > > Average throughput : 109337 messages per second > Average throughput : 122077 messages per second > Average throughput : 105296 messages per second > > [ system is *unusable* interactively, despite renicing. ] > > 2.5.2-pre10-vanilla running the test at the default priority level: > > # ./chat_s 127.0.0.1 > # ./chat_c 127.0.0.1 10 1000 > > Average throughput : 124676 messages per second > Average throughput : 102244 messages per second > Average throughput : 115841 messages per second > > [ system is unresponsive at the start of the test, but > once the 2.5.2-pre10 load-estimator establishes which task is > interactive and which one is not, the system becomes usable. > Load can be felt and there are frequent delays in commands. ] > > 2.5.2-pre10-vanilla running at nice level 19: > > # nice -n 19 ./chat_s 127.0.0.1 > # nice -n 19 ./chat_c 127.0.0.1 10 1000 > > Average throughput : 214626 messages per second > Average throughput : 220876 messages per second > Average throughput : 225529 messages per second > > [ system is usable from the beginning - nice levels are working as > expected. Load can be felt while executing shell commands, but the > system is usable. Load cannot be felt in truly interactive > applications like editors. > > Summary of throughput results: 2.5.2-pre10-vanilla is equivalent > throughput-wise in the test with your patched kernel, but the vanilla > kernel is about 100% faster than your patched kernel when running reniced. > > but the interactivity observations are the real showstoppers in my > opinion. With your patch applied the system became *unbearably* slow > during the test. Ingo, this is not the picture that i've got from my machine. ------------------------------------------------------------------- AMD Athlon 1GHz 256 Mb RAM, swap_cnt patch : # nice -n 19 chat_s 127.0.0.1 & # nice -n 19 chat_c 127.0.0.1 20 1000 125236 123988 128048 with : r b w swpd free buff cache si so bi bo in cs us sy id 198 0 0 1476 28996 8024 89408 0 0 0 108 812 19424 12 87 1 216 0 1 1476 32388 8024 89412 0 0 0 0 523 56344 9 91 0 134 0 1 1476 32812 8024 89412 0 0 0 0 578 32374 9 91 0 96 1 1 1476 33540 8024 89412 0 0 0 0 114 7910 13 87 0 81 0 0 1476 35412 8024 89420 0 0 0 12 657 54034 12 88 0 pre-10 : 135684 127456 132420 the niced -20 vmstat has not been run for the whole test time and the system seemed quite bad ( personal feeling, not for the whole test time but for 1-2 sec spots ) compared with the previous test. The whole point Ingo is that during the test we've had 200 tasks on the run queue with a cs 8000..50000 !!? AMD Athlon 1GHz, swap_cnt patch : # chat_s 127.0.0.1 & # chat_c 127.0.0.1 20 1000 118386 114464 117972 pre-10 : 90066 88234 92612 I was not able to identify any interactive feel difference here. ---------------------------------------------------------------------- Today i'll try the same on both my dual cpu system ( PIII 733 and PIII 1GHz ) I really fail to understand why you're asking everyone to run your test reniced ?!? > - your patch in essence makes the scheduler ignore things like nice > level +19. We *used to* ignore nice levels, but with the new load > estimator this has changed, and personally i dont think i want to go > back to the old behavior. Ingo for the duration of the test the `nice -n 20 vmstat -n 1` never run for about the 20 seconds. With the swap_cnt correction it ran for 5-6 times. > - the system i tested has a more than twice as slow CPU as yours. So i'd > suggest for you to repeat those exact tests but increase the number of > 'rooms' to something like 40 (i know you tried 20 rooms, i dont think > it's enough), and increase the number of messages sent, from 1000 to > 5000 or something like that. Ingo, with 20 rooms my system was loaded with more than 200 tasks on the run queue and was switching at 50000 times/sec. Don't you think that it's enough for a single cpu system ??!! > your patch indeed decreases the load estimation and interactivity > detection overhead and code complexity - but as the above tests have > shown, at the price of interactivity, and in some cases even at the price > of throughput. Ingo i tried to be the more impartial as possible and during the test i was not able to identify any difference in system usability. As i wrote you in private, the only spot i've had of system unusability was running with stock pre10 ( but this could be happened occasionally ). - Davide ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 18:24 ` Davide Libenzi @ 2002-01-09 21:24 ` Ingo Molnar 2002-01-09 19:38 ` Mike Kravetz 2002-01-09 22:34 ` Mark Hahn 0 siblings, 2 replies; 41+ messages in thread From: Ingo Molnar @ 2002-01-09 21:24 UTC (permalink / raw) To: Davide Libenzi; +Cc: Mike Kravetz, Linus Torvalds, lkml, george anzinger On Wed, 9 Jan 2002, Davide Libenzi wrote: > the niced -20 vmstat has not been run for the whole test time and the [...] > Ingo for the duration of the test the `nice -n 20 vmstat -n 1` never > run for about the 20 seconds. With the swap_cnt correction it ran for > 5-6 times. no wonder, it should be 'nice -n -20 vmstat -n 1'. And you should also do a 'renice -20 $$ $PPID' before running vmstat. (if you are about to run comparisons, i'd suggest the -G1 patch so you'll have all the recent fixes.) Ingo ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 21:24 ` Ingo Molnar @ 2002-01-09 19:38 ` Mike Kravetz 2002-01-10 18:21 ` Mike Kravetz 2002-01-09 22:34 ` Mark Hahn 1 sibling, 1 reply; 41+ messages in thread From: Mike Kravetz @ 2002-01-09 19:38 UTC (permalink / raw) To: Ingo Molnar; +Cc: Davide Libenzi, Linus Torvalds, lkml, george anzinger On Wed, Jan 09, 2002 at 10:24:00PM +0100, Ingo Molnar wrote: > (if you are about to run > comparisons, i'd suggest the -G1 patch so you'll have all the recent > fixes.) I just kicked off another benchmark run to compare pre10, pre10 & G1 patch, pre10 & Davide's patch. chat and make will be run as before with the addition of chat reniced. I won't attempt to make any claims about interactive responsiveness. Simple throughput numbers. Results should be available in about 24 hours. -- Mike ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 19:38 ` Mike Kravetz @ 2002-01-10 18:21 ` Mike Kravetz 2002-01-10 19:08 ` Davide Libenzi 0 siblings, 1 reply; 41+ messages in thread From: Mike Kravetz @ 2002-01-10 18:21 UTC (permalink / raw) To: Ingo Molnar; +Cc: Davide Libenzi, Linus Torvalds, lkml, george anzinger On Wed, Jan 09, 2002 at 11:38:33AM -0800, Mike Kravetz wrote: > > I just kicked off another benchmark run to compare pre10, pre10 & G1 > patch, pre10 & Davide's patch. It wasn't a good night for benchmarking. I had a typo in the script to run chat reniced and as a result didn't collect any numbers for this. In addition, the kernel with Davide's patch failed to boot with 8 CPUs enabled. Can't see any '# CPU specific' mods in the patch. In any case, here is what I do have. -------------------------------------------------------------------- mkbench - Time how long it takes to compile the kernel. On this 8 CPU system we use 'make -j 8' and increase the number of makes run in parallel. Result is average build time in seconds. Lower is better. -------------------------------------------------------------------- # CPUs # Makes pre10 pre10-G1 pre10-Davide -------------------------------------------------------------------- 2 1 189 190 185 2 2 370 376 362 2 4 733 726* 717 2 6 1102 1082* 1077 4 1 101 99 101 4 2 199 192 195 4 4 387 382 381 4 6 581 551 568 8 1 58 56 - 8 2 110 104 - 8 4 214 204 - 8 6 314 305 - * Most likely statisticly invalid results. I run these things 3 times to make sure results are at lease consistent. With pre10-G1 results varied more than the others. Items marked with * had extremely high variations. -------------------------------------------------------------------- Chat - VolanoMark simulator. Result is a measure of throughput. Higher is better. -------------------------------------------------------------------- Configuration Parms # CPUs pre10 pre10-G1 pre10-Davide -------------------------------------------------------------------- 10 rooms, 200 messages 2 143041 107718 181556 20 rooms, 200 messages 2 147335 147151 166048 30 rooms, 200 messages 2 179370 190413 173135 10 rooms, 200 messages 4 264033 287076 272597 20 rooms, 200 messages 4 243873 241855 273219 30 rooms, 200 messages 4 303228 301175 278513 10 rooms, 200 messages 8 304754 306891 - 20 rooms, 200 messages 8 241077 301414 - 30 rooms, 200 messages 8 309485 333660 - -- Mike ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-10 18:21 ` Mike Kravetz @ 2002-01-10 19:08 ` Davide Libenzi 2002-01-10 19:09 ` Linus Torvalds 2002-01-10 19:15 ` Mike Kravetz 0 siblings, 2 replies; 41+ messages in thread From: Davide Libenzi @ 2002-01-10 19:08 UTC (permalink / raw) To: Mike Kravetz; +Cc: Ingo Molnar, Linus Torvalds, lkml, george anzinger On Thu, 10 Jan 2002, Mike Kravetz wrote: > On Wed, Jan 09, 2002 at 11:38:33AM -0800, Mike Kravetz wrote: > > > > I just kicked off another benchmark run to compare pre10, pre10 & G1 > > patch, pre10 & Davide's patch. > > It wasn't a good night for benchmarking. I had a typo in the > script to run chat reniced and as a result didn't collect any > numbers for this. In addition, the kernel with Davide's patch > failed to boot with 8 CPUs enabled. Can't see any '# CPU specific' > mods in the patch. In any case, here is what I do have. Doh !! Do you have a panic dump Mike ? - Davide ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-10 19:08 ` Davide Libenzi @ 2002-01-10 19:09 ` Linus Torvalds 2002-01-10 21:08 ` Davide Libenzi 2002-01-10 19:15 ` Mike Kravetz 1 sibling, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2002-01-10 19:09 UTC (permalink / raw) To: Davide Libenzi; +Cc: Mike Kravetz, Ingo Molnar, lkml, george anzinger On Thu, 10 Jan 2002, Davide Libenzi wrote: > > > > It wasn't a good night for benchmarking. I had a typo in the > > script to run chat reniced and as a result didn't collect any > > numbers for this. In addition, the kernel with Davide's patch > > failed to boot with 8 CPUs enabled. Can't see any '# CPU specific' > > mods in the patch. In any case, here is what I do have. > > Doh !! Do you have a panic dump Mike ? I bet it's just the placement of "init_idle()" in init/main.c, which is unrelated to the scheduling proper, but if the kernel thread is started before the boot CPU has done its "init_idle()", then the scheduler state isn't really set up fully yet. (Old bug, I think its been there for a long time, I just think that the old scheduler didn't much care, and the "child runs first" logic in particular of the new scheduler probably just showed it more clearly) Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-10 19:09 ` Linus Torvalds @ 2002-01-10 21:08 ` Davide Libenzi 0 siblings, 0 replies; 41+ messages in thread From: Davide Libenzi @ 2002-01-10 21:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: Mike Kravetz, Ingo Molnar, lkml, george anzinger On Thu, 10 Jan 2002, Linus Torvalds wrote: > > On Thu, 10 Jan 2002, Davide Libenzi wrote: > > > > > > It wasn't a good night for benchmarking. I had a typo in the > > > script to run chat reniced and as a result didn't collect any > > > numbers for this. In addition, the kernel with Davide's patch > > > failed to boot with 8 CPUs enabled. Can't see any '# CPU specific' > > > mods in the patch. In any case, here is what I do have. > > > > Doh !! Do you have a panic dump Mike ? > > I bet it's just the placement of "init_idle()" in init/main.c, which is > unrelated to the scheduling proper, but if the kernel thread is started > before the boot CPU has done its "init_idle()", then the scheduler state > isn't really set up fully yet. > > (Old bug, I think its been there for a long time, I just think that the > old scheduler didn't much care, and the "child runs first" logic in > particular of the new scheduler probably just showed it more clearly) Uhm, seems fixed in pre11. Did you fix it in pre10->pre11 stage ? - Davide ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-10 19:08 ` Davide Libenzi 2002-01-10 19:09 ` Linus Torvalds @ 2002-01-10 19:15 ` Mike Kravetz 2002-01-10 20:05 ` Davide Libenzi 1 sibling, 1 reply; 41+ messages in thread From: Mike Kravetz @ 2002-01-10 19:15 UTC (permalink / raw) To: Davide Libenzi; +Cc: Ingo Molnar, Linus Torvalds, lkml, george anzinger On Thu, Jan 10, 2002 at 11:08:21AM -0800, Davide Libenzi wrote: > On Thu, 10 Jan 2002, Mike Kravetz wrote: > > > > > > I just kicked off another benchmark run to compare pre10, pre10 & G1 > > > patch, pre10 & Davide's patch. > > > > It wasn't a good night for benchmarking. I had a typo in the > > script to run chat reniced and as a result didn't collect any > > numbers for this. In addition, the kernel with Davide's patch > > failed to boot with 8 CPUs enabled. Can't see any '# CPU specific' > > mods in the patch. In any case, here is what I do have. > > Doh !! Do you have a panic dump Mike ? It didn't panic, but hung during the boot process. After reading other mail, this may be caused by the out of order locking bug/deadlock that existed in this version of the O(1) scheduler. I may be able to try and verify later today. Right now the machine is being used for something else. -- Mike ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-10 19:15 ` Mike Kravetz @ 2002-01-10 20:05 ` Davide Libenzi 0 siblings, 0 replies; 41+ messages in thread From: Davide Libenzi @ 2002-01-10 20:05 UTC (permalink / raw) To: Mike Kravetz; +Cc: Ingo Molnar, Linus Torvalds, lkml, george anzinger On Thu, 10 Jan 2002, Mike Kravetz wrote: > Right now the machine is being used for something else. Do they know at IBM that you're using 8 way SMP systems to run counter-strike servers ? :-) - Davide ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 21:24 ` Ingo Molnar 2002-01-09 19:38 ` Mike Kravetz @ 2002-01-09 22:34 ` Mark Hahn 2002-01-10 14:04 ` Ingo Molnar 1 sibling, 1 reply; 41+ messages in thread From: Mark Hahn @ 2002-01-09 22:34 UTC (permalink / raw) To: Ingo Molnar; +Cc: lkml > no wonder, it should be 'nice -n -20 vmstat -n 1'. And you should also do I keep a suid setrealtime wrapper around (UNSAFE!) for this kind of use: #include <unistd.h> #include <stdlib.h> #include <stdio.h> #include <sched.h> int main(int argc, char *argv[]) { static struct sched_param sched_parms; int pid, wrapper=0; if (argc <= 1) return 1; pid = atoi(argv[1]); if (!pid || argc != 2) { wrapper = 1; pid = getpid(); } sched_parms.sched_priority = sched_get_priority_min(SCHED_FIFO); if (sched_setscheduler(pid, SCHED_FIFO, &sched_parms) == -1) { perror("cannot set realtime scheduling policy"); return 1; } if (wrapper) { setuid(getuid()); execvp(argv[1],&argv[1]); perror("exec failed"); return 1; } return 0; } regards, mark hahn. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 22:34 ` Mark Hahn @ 2002-01-10 14:04 ` Ingo Molnar 0 siblings, 0 replies; 41+ messages in thread From: Ingo Molnar @ 2002-01-10 14:04 UTC (permalink / raw) To: Mark Hahn; +Cc: lkml On Wed, 9 Jan 2002, Mark Hahn wrote: > > no wonder, it should be 'nice -n -20 vmstat -n 1'. And you should also do > > I keep a suid setrealtime wrapper around (UNSAFE!) for this kind of use: nice -20 is an equivalent but safe version of the same (if you use my patches). I made priority levels -20 ... -16 to be 'super-high priority', ie. such tasks never expire. (they can still drop above prio -16 if they use up too much CPU time, so they cannot lock up systems accidentally like RT tasks.) So it's in essence a 'admin priority', for super-important shells. I'm using it with great success. Ingo ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 11:37 ` Ingo Molnar 2002-01-09 11:19 ` Rene Rebe 2002-01-09 18:24 ` Davide Libenzi @ 2002-01-09 20:15 ` Linus Torvalds 2002-01-09 23:02 ` Ingo Molnar 2 siblings, 1 reply; 41+ messages in thread From: Linus Torvalds @ 2002-01-09 20:15 UTC (permalink / raw) To: Ingo Molnar; +Cc: Davide Libenzi, Mike Kravetz, lkml, george anzinger On Wed, 9 Jan 2002, Ingo Molnar wrote: > > 2.5.2-pre10-vanilla running the test at the default priority level: > > # ./chat_s 127.0.0.1 > # ./chat_c 127.0.0.1 10 1000 > > Average throughput : 124676 messages per second > Average throughput : 102244 messages per second > Average throughput : 115841 messages per second > > [ system is unresponsive at the start of the test, but > once the 2.5.2-pre10 load-estimator establishes which task is > interactive and which one is not, the system becomes usable. > Load can be felt and there are frequent delays in commands. ] > > 2.5.2-pre10-vanilla running at nice level 19: > > # nice -n 19 ./chat_s 127.0.0.1 > # nice -n 19 ./chat_c 127.0.0.1 10 1000 > > Average throughput : 214626 messages per second > Average throughput : 220876 messages per second > Average throughput : 225529 messages per second > > [ system is usable from the beginning - nice levels are working as > expected. Load can be felt while executing shell commands, but the > system is usable. Load cannot be felt in truly interactive > applications like editors. Ingo, there's something wrong there. Not a way in hell should "nice 19" cause the throughput to improve like that. It looks like this is a result of "nice 19" simply doing _different_ scheduling, possibly more batch-like, and as such those numbers cannot sanely be compared to anything else. (And if they _are_ comparable, then you should be able to get the good numbers even without "nice 19". Quite frankly it sounds to me like the whole chat benchmark is another "dbench", ie doing unbalanced scheduling _helps_ it performance-wise, which implies that it's probably a bad benchmark to look at numbers for). Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 20:15 ` Linus Torvalds @ 2002-01-09 23:02 ` Ingo Molnar 0 siblings, 0 replies; 41+ messages in thread From: Ingo Molnar @ 2002-01-09 23:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: Davide Libenzi, Mike Kravetz, lkml, george anzinger On Wed, 9 Jan 2002, Linus Torvalds wrote: > Not a way in hell should "nice 19" cause the throughput to improve > like that. It looks like this is a result of "nice 19" simply doing > _different_ scheduling, possibly more batch-like, and as such those > numbers cannot sanely be compared to anything else. yes, this is what happens. The difference is that the load estimator 'punishes' tasks to have lower priority, while the recalc-based method gives a 'bonus'. If run with nice +19 then the process cannot be punished anymore, all the tasks will run on the same priority level - and none can cause a preemption of the other one. The priority limit is set right at the nice +19 level. is this an intended thing with nice +19 tasks? I think so, at least for some usages. It could be fixed by adding some more priority space (+13 levels) they could explore into (but which couldnt be set as the default priority). So by having a ceiling it really behaves differently, very batch-like - but that's what such benchmarks are asking for anyway ... I think it's an intended effect for CPU hogs as well - we do not want them to preempt each other, they should each use up their timeslices fully and roundrobin nicely. > (And if they _are_ comparable, then you should be able to get the good > numbers even without "nice 19". Quite frankly it sounds to me like the > whole chat benchmark is another "dbench", ie doing unbalanced > scheduling _helps_ it performance-wise, which implies that it's > probably a bad benchmark to look at numbers for). yes, agreed. It's not really unbalanced scheduling, the scheduler is still fair. What doesnt happen is priority based preemption. i think it could be a bonus to have such a scheduler mode - people dont run shells at +19 niceness level, it's the known CPU hogs that get started up with nice +19. It's a kind of SCHED_IDLE - everything can preempt it and it will preempt nothing, without the priority inheritance problems of SCHED_IDLE. Ingo ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 3:39 ` Mike Kravetz 2002-01-09 5:05 ` Davide Libenzi @ 2002-01-09 6:29 ` Brian 2002-01-09 6:40 ` Jeffrey W. Baker ` (2 more replies) 2002-01-09 10:25 ` Ingo Molnar 2 siblings, 3 replies; 41+ messages in thread From: Brian @ 2002-01-09 6:29 UTC (permalink / raw) To: Mike Kravetz; +Cc: linux-kernel Can this be correct? Intuitively, I would expect several CPUs hammering away at the compile to finish faster than one. Given these numbers, I would have to conclude that is not just wrong, but absolutely wrong. Compile time increases linearly with the number of jobs, regardless of the number of CPUs. What would cause this? Severe memory bottlenecks? -- Brian On Tuesday 08 January 2002 10:39 pm, Mike Kravetz wrote: > -------------------------------------------------------------------- > mkbench - Time how long it takes to compile the kernel. > We use 'make -j 8' and increase the number of makes run > in parallel. Result is average build time in seconds. > Lower is better. > -------------------------------------------------------------------- > # CPUs # Makes Vanilla O(1) haMQ > -------------------------------------------------------------------- > 2 1 188 192 184 > 2 2 366 372 362 > 2 4 730 742 600 > 2 6 1096 1112 853 > 4 1 102 101 95 > 4 2 196 198 186 > 4 4 384 386 374 > 4 6 576 579 487 > 8 1 58 57 58 > 8 2 109 108 105 > 8 4 209 213 186 > 8 6 309 312 280 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 6:29 ` Brian @ 2002-01-09 6:40 ` Jeffrey W. Baker 2002-01-09 6:45 ` Ryan Cumming 2002-01-09 6:48 ` Ryan Cumming 2 siblings, 0 replies; 41+ messages in thread From: Jeffrey W. Baker @ 2002-01-09 6:40 UTC (permalink / raw) To: Brian; +Cc: linux-kernel On Wed, 9 Jan 2002, Brian wrote: > Can this be correct? > > Intuitively, I would expect several CPUs hammering away at the compile to > finish faster than one. Given these numbers, I would have to conclude > that is not just wrong, but absolutely wrong. Compile time increases > linearly with the number of jobs, regardless of the number of CPUs. > > What would cause this? Severe memory bottlenecks? Mike ran make -j 8 which means 8 compiler processes for each "# Makes" in the table. Thus, the first row has 8 parallel processes on a 2-way and the last row has 48 processes on an 8-way. The best ratio is 8 processes on an 8-way which not incidentally also has the lowest time: 57 seconds. -jwb > > -- Brian > > On Tuesday 08 January 2002 10:39 pm, Mike Kravetz wrote: > > -------------------------------------------------------------------- > > mkbench - Time how long it takes to compile the kernel. > > We use 'make -j 8' and increase the number of makes run > > in parallel. Result is average build time in seconds. > > Lower is better. > > -------------------------------------------------------------------- > > # CPUs # Makes Vanilla O(1) haMQ > > -------------------------------------------------------------------- > > 2 1 188 192 184 > > 2 2 366 372 362 > > 2 4 730 742 600 > > 2 6 1096 1112 853 > > 4 1 102 101 95 > > 4 2 196 198 186 > > 4 4 384 386 374 > > 4 6 576 579 487 > > 8 1 58 57 58 > > 8 2 109 108 105 > > 8 4 209 213 186 > > 8 6 309 312 280 > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 6:29 ` Brian 2002-01-09 6:40 ` Jeffrey W. Baker @ 2002-01-09 6:45 ` Ryan Cumming 2002-01-09 6:48 ` Ryan Cumming 2 siblings, 0 replies; 41+ messages in thread From: Ryan Cumming @ 2002-01-09 6:45 UTC (permalink / raw) To: Brian; +Cc: linux-kernel On January 8, 2002 22:29, Brian wrote: > Can this be correct? > > Intuitively, I would expect several CPUs hammering away at the compile to > finish faster than one. Given these numbers, I would have to conclude > that is not just wrong, but absolutely wrong. Compile time increases > linearly with the number of jobs, regardless of the number of CPUs. In the charts in the original message, he's not increasing the number of jobs, but the number of concurrent 'make -j8's. Two makes should really finish in half the time one make does. I don't see any problem with the results. -Ryan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 6:29 ` Brian 2002-01-09 6:40 ` Jeffrey W. Baker 2002-01-09 6:45 ` Ryan Cumming @ 2002-01-09 6:48 ` Ryan Cumming 2 siblings, 0 replies; 41+ messages in thread From: Ryan Cumming @ 2002-01-09 6:48 UTC (permalink / raw) To: Brian; +Cc: linux-kernel On January 8, 2002 22:45, Ryan Cumming wrote: > In the charts in the original message, he's not increasing the number of > jobs, but the number of concurrent 'make -j8's. Two makes should really > finish in half the time one make does. I don't see any problem with the > results. Er, I meant finish in twice the time one make does... really... ;) -Ryan ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 3:39 ` Mike Kravetz 2002-01-09 5:05 ` Davide Libenzi 2002-01-09 6:29 ` Brian @ 2002-01-09 10:25 ` Ingo Molnar 2002-01-09 17:40 ` Mike Kravetz 2 siblings, 1 reply; 41+ messages in thread From: Ingo Molnar @ 2002-01-09 10:25 UTC (permalink / raw) To: Mike Kravetz Cc: Linus Torvalds, linux-kernel, george anzinger, Davide Libenzi On Tue, 8 Jan 2002, Mike Kravetz wrote: > -------------------------------------------------------------------- > Chat - VolanoMark simulator. Result is a measure of throughput. > Higher is better. very interesting numbers, nice work Mike! I'd suggest the following additional test: please also run tests like VolanoMark with 'nice -n 19'. The O(1) scheduler's task-penalty method works in our favor in this case, since we know the test is CPU-bound we can move all processes to nice level 19. Ingo ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 2002-01-09 10:25 ` Ingo Molnar @ 2002-01-09 17:40 ` Mike Kravetz 0 siblings, 0 replies; 41+ messages in thread From: Mike Kravetz @ 2002-01-09 17:40 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, linux-kernel, george anzinger, Davide Libenzi On Wed, Jan 09, 2002 at 11:25:43AM +0100, Ingo Molnar wrote: > > On Tue, 8 Jan 2002, Mike Kravetz wrote: > > > -------------------------------------------------------------------- > > Chat - VolanoMark simulator. Result is a measure of throughput. > > Higher is better. > > very interesting numbers, nice work Mike! I'd suggest the following > additional test: please also run tests like VolanoMark with 'nice -n 19'. > The O(1) scheduler's task-penalty method works in our favor in this case, > since we know the test is CPU-bound we can move all processes to nice > level 19. > > Ingo I'll do that in the next go around. Right now, I'm trying to get some TPC-H results. -- Mike ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2002-01-11 12:04 UTC | newest]
Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200201071922.g07JMN106760@penguin.transmeta.com>
2002-01-07 21:36 ` [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 Ingo Molnar
2002-01-08 8:49 ` FD Cami
2002-01-08 18:44 ` J Sloan
2002-01-08 11:32 ` Anton Blanchard
2002-01-08 11:43 ` Anton Blanchard
2002-01-08 14:34 ` Ingo Molnar
2002-01-09 23:15 ` Anton Blanchard
2002-01-10 1:09 ` Richard Henderson
2002-01-10 17:04 ` Ivan Kokshaysky
2002-01-10 20:42 ` george anzinger
2002-01-10 23:56 ` Ingo Molnar
2002-01-08 14:32 ` [patch] O(1) scheduler, -E1, 2.5.2-pre10, 2.4.17 Ingo Molnar
2002-01-07 20:24 [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17 Ingo Molnar
2002-01-07 19:03 ` Brian Gerst
2002-01-07 21:19 ` Ingo Molnar
2002-01-09 3:39 ` Mike Kravetz
2002-01-09 5:05 ` Davide Libenzi
2002-01-09 3:32 ` Rusty Russell
2002-01-09 18:02 ` Davide Libenzi
2002-01-09 11:37 ` Ingo Molnar
2002-01-09 11:19 ` Rene Rebe
2002-01-09 15:34 ` Ryan Cumming
2002-01-09 18:24 ` Davide Libenzi
2002-01-09 21:24 ` Ingo Molnar
2002-01-09 19:38 ` Mike Kravetz
2002-01-10 18:21 ` Mike Kravetz
2002-01-10 19:08 ` Davide Libenzi
2002-01-10 19:09 ` Linus Torvalds
2002-01-10 21:08 ` Davide Libenzi
2002-01-10 19:15 ` Mike Kravetz
2002-01-10 20:05 ` Davide Libenzi
2002-01-09 22:34 ` Mark Hahn
2002-01-10 14:04 ` Ingo Molnar
2002-01-09 20:15 ` Linus Torvalds
2002-01-09 23:02 ` Ingo Molnar
2002-01-09 6:29 ` Brian
2002-01-09 6:40 ` Jeffrey W. Baker
2002-01-09 6:45 ` Ryan Cumming
2002-01-09 6:48 ` Ryan Cumming
2002-01-09 10:25 ` Ingo Molnar
2002-01-09 17:40 ` Mike Kravetz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox