* [BENCHMARK] 2.4.20-rc2-aa1 with contest @ 2002-11-22 22:29 Con Kolivas 2002-11-24 16:28 ` Andrea Arcangeli 0 siblings, 1 reply; 7+ messages in thread From: Con Kolivas @ 2002-11-22 22:29 UTC (permalink / raw) To: linux kernel mailing list; +Cc: Andrea Arcangeli -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Here is a partial run of contest (http://contest.kolivas.net) benchmarks for rc2aa1 with the disk latency hack noload: Kernel [runs] Time CPU% Loads LCPU% Ratio 2.4.18 [5] 71.7 93 0 0 0.98 2.4.19 [5] 69.0 97 0 0 0.94 2.4.20-rc1 [3] 72.2 93 0 0 0.99 2.4.20-rc1aa1 [1] 71.9 94 0 0 0.98 2420rc2aa1 [1] 71.1 94 0 0 0.97 cacherun: Kernel [runs] Time CPU% Loads LCPU% Ratio 2.4.18 [2] 66.6 99 0 0 0.91 2.4.19 [2] 68.0 99 0 0 0.93 2.4.20-rc1 [3] 67.2 99 0 0 0.92 2.4.20-rc1aa1 [1] 67.4 99 0 0 0.92 2420rc2aa1 [1] 66.6 99 0 0 0.91 process_load: Kernel [runs] Time CPU% Loads LCPU% Ratio 2.4.18 [3] 109.5 57 119 44 1.50 2.4.19 [3] 106.5 59 112 43 1.45 2.4.20-rc1 [3] 110.7 58 119 43 1.51 2.4.20-rc1aa1 [3] 110.5 58 117 43 1.51* 2420rc2aa1 [1] 212.5 31 412 69 2.90* This load just copies data between 4 processes repeatedly. Seems to take longer. ctar_load: Kernel [runs] Time CPU% Loads LCPU% Ratio 2.4.18 [3] 117.4 63 1 7 1.60 2.4.19 [2] 106.5 70 1 8 1.45 2.4.20-rc1 [3] 102.1 72 1 7 1.39 2.4.20-rc1aa1 [3] 107.1 69 1 7 1.46 2420rc2aa1 [1] 103.3 73 1 8 1.41 xtar_load: Kernel [runs] Time CPU% Loads LCPU% Ratio 2.4.18 [3] 150.8 49 2 8 2.06 2.4.19 [1] 132.4 55 2 9 1.81 2.4.20-rc1 [3] 180.7 40 3 8 2.47 2.4.20-rc1aa1 [3] 166.6 44 2 7 2.28* 2420rc2aa1 [1] 217.7 34 4 9 2.97* Takes longer. Is only one run though so may not be an accurate average. io_load: Kernel [runs] Time CPU% Loads LCPU% Ratio 2.4.18 [3] 474.1 15 36 10 6.48 2.4.19 [3] 492.6 14 38 10 6.73 2.4.20-rc1 [2] 1142.2 6 90 10 15.60 2.4.20-rc1aa1 [1] 1132.5 6 90 10 15.47 2420rc2aa1 [1] 164.3 44 10 9 2.24 This was where the effect of the disk latency hack was expected to have an effect. It sure did. read_load: Kernel [runs] Time CPU% Loads LCPU% Ratio 2.4.18 [3] 102.3 70 6 3 1.40 2.4.19 [2] 134.1 54 14 5 1.83 2.4.20-rc1 [3] 173.2 43 20 5 2.37 2.4.20-rc1aa1 [3] 150.6 51 16 5 2.06 2420rc2aa1 [1] 140.5 51 13 4 1.92 list_load: Kernel [runs] Time CPU% Loads LCPU% Ratio 2.4.18 [3] 90.2 76 1 17 1.23 2.4.19 [1] 89.8 77 1 20 1.23 2.4.20-rc1 [3] 88.8 77 0 12 1.21 2.4.20-rc1aa1 [1] 88.1 78 1 16 1.20 2420rc2aa1 [1] 99.7 69 1 19 1.36 mem_load: Kernel [runs] Time CPU% Loads LCPU% Ratio 2.4.18 [3] 103.3 70 32 3 1.41 2.4.19 [3] 100.0 72 33 3 1.37 2.4.20-rc1 [3] 105.9 69 32 2 1.45 Mem load hung the machine. I could not get rc2aa1 through this part of the benchmark no matter how many times I tried to run it. No idea what was going on. Easy to reproduce. Simply run the mem_load out of contest (which runs until it is killed) and the machine will hang. Con P.S. I'm having mailserver trouble so respond to lkml where I may see responses -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.0 (GNU/Linux) iD8DBQE93q/IF6dfvkL3i1gRAqWCAKCp6eZ2MFe4Ag7LqoGwy4+0MbUqxQCgkkxl AOUDUScNazCAJ2oZrdgDMuE= =vHmI -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest 2002-11-22 22:29 [BENCHMARK] 2.4.20-rc2-aa1 with contest Con Kolivas @ 2002-11-24 16:28 ` Andrea Arcangeli 2002-11-25 6:44 ` Con Kolivas 0 siblings, 1 reply; 7+ messages in thread From: Andrea Arcangeli @ 2002-11-24 16:28 UTC (permalink / raw) To: Con Kolivas; +Cc: linux kernel mailing list On Sat, Nov 23, 2002 at 09:29:22AM +1100, Con Kolivas wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Here is a partial run of contest (http://contest.kolivas.net) benchmarks for > rc2aa1 with the disk latency hack > > noload: > Kernel [runs] Time CPU% Loads LCPU% Ratio > 2.4.18 [5] 71.7 93 0 0 0.98 > 2.4.19 [5] 69.0 97 0 0 0.94 > 2.4.20-rc1 [3] 72.2 93 0 0 0.99 > 2.4.20-rc1aa1 [1] 71.9 94 0 0 0.98 > 2420rc2aa1 [1] 71.1 94 0 0 0.97 > > cacherun: > Kernel [runs] Time CPU% Loads LCPU% Ratio > 2.4.18 [2] 66.6 99 0 0 0.91 > 2.4.19 [2] 68.0 99 0 0 0.93 > 2.4.20-rc1 [3] 67.2 99 0 0 0.92 > 2.4.20-rc1aa1 [1] 67.4 99 0 0 0.92 > 2420rc2aa1 [1] 66.6 99 0 0 0.91 > > process_load: > Kernel [runs] Time CPU% Loads LCPU% Ratio > 2.4.18 [3] 109.5 57 119 44 1.50 > 2.4.19 [3] 106.5 59 112 43 1.45 > 2.4.20-rc1 [3] 110.7 58 119 43 1.51 > 2.4.20-rc1aa1 [3] 110.5 58 117 43 1.51* > 2420rc2aa1 [1] 212.5 31 412 69 2.90* > > This load just copies data between 4 processes repeatedly. Seems to take > longer. you go into linux/include/blkdev.h and increase MAX_QUEUE_SECTORS to (2 << (20 - 9)) and see if it makes any differences here? if it doesn't make differences it could be the a bit increased readhaead but I doubt it's the latter. > ctar_load: > Kernel [runs] Time CPU% Loads LCPU% Ratio > 2.4.18 [3] 117.4 63 1 7 1.60 > 2.4.19 [2] 106.5 70 1 8 1.45 > 2.4.20-rc1 [3] 102.1 72 1 7 1.39 > 2.4.20-rc1aa1 [3] 107.1 69 1 7 1.46 > 2420rc2aa1 [1] 103.3 73 1 8 1.41 > > xtar_load: > Kernel [runs] Time CPU% Loads LCPU% Ratio > 2.4.18 [3] 150.8 49 2 8 2.06 > 2.4.19 [1] 132.4 55 2 9 1.81 > 2.4.20-rc1 [3] 180.7 40 3 8 2.47 > 2.4.20-rc1aa1 [3] 166.6 44 2 7 2.28* > 2420rc2aa1 [1] 217.7 34 4 9 2.97* > > Takes longer. Is only one run though so may not be an accurate average. This most probably is a too small waitqueue. Of course increasing the waitqueue will increase a bit the latency too for the other workloads, it's a tradeoff and there's no way around it. Even read-latency has the tradeoff when it chooses the "nth" place to be the seventh slot, where to put the read request if it fails inserction. > > > io_load: > Kernel [runs] Time CPU% Loads LCPU% Ratio > 2.4.18 [3] 474.1 15 36 10 6.48 > 2.4.19 [3] 492.6 14 38 10 6.73 > 2.4.20-rc1 [2] 1142.2 6 90 10 15.60 > 2.4.20-rc1aa1 [1] 1132.5 6 90 10 15.47 > 2420rc2aa1 [1] 164.3 44 10 9 2.24 > > This was where the effect of the disk latency hack was expected to have an > effect. It sure did. yes, I certainly can feel the machine much more responsive during the write load too. Too bad some benchmark like dbench decreased significantly but I don't see too many ways around it. At least now with those changes the contigous write case is unaffected, my storage test box still reads and writes at over 100mbyte/sec for example, this clearly means what matters is that we have 512k dma commands, not an huge size of the queue. Really with a loaded machine and potential scheduling delays it could matter more to have a larger queue, that maybe why the performance is decreased for some workload here too, not only because of a less effective elevator. So probably 2Mbyte of queue is a much better idea, so at least we can have a ring with 4 elements to refill after a completion wakeup, I wanted to be strict to see the "lowlatency" effect at most in the first place. We could also consider to use a /4 instead of my current /2 for the batch_sectors initialization. BTW, at first glance it looks 2.5 has the same problem in the queue sizing too. > read_load: > Kernel [runs] Time CPU% Loads LCPU% Ratio > 2.4.18 [3] 102.3 70 6 3 1.40 > 2.4.19 [2] 134.1 54 14 5 1.83 > 2.4.20-rc1 [3] 173.2 43 20 5 2.37 > 2.4.20-rc1aa1 [3] 150.6 51 16 5 2.06 > 2420rc2aa1 [1] 140.5 51 13 4 1.92 > > list_load: > Kernel [runs] Time CPU% Loads LCPU% Ratio > 2.4.18 [3] 90.2 76 1 17 1.23 > 2.4.19 [1] 89.8 77 1 20 1.23 > 2.4.20-rc1 [3] 88.8 77 0 12 1.21 > 2.4.20-rc1aa1 [1] 88.1 78 1 16 1.20 > 2420rc2aa1 [1] 99.7 69 1 19 1.36 > > mem_load: > Kernel [runs] Time CPU% Loads LCPU% Ratio > 2.4.18 [3] 103.3 70 32 3 1.41 > 2.4.19 [3] 100.0 72 33 3 1.37 > 2.4.20-rc1 [3] 105.9 69 32 2 1.45 > > Mem load hung the machine. I could not get rc2aa1 through this part of the > benchmark no matter how many times I tried to run it. No idea what was going > on. Easy to reproduce. Simply run the mem_load out of contest (which runs > until it is killed) and the machine will hang. sorry but what is mem_load supposed to do other than to loop forever? It is running for two days on my test box (512m of ram, 2G of swap, 4-way smp) and nothing happened yet. It's an infinite loop. Sounds like you're trapping a signal. Wouldn't it be simpler to just finish after a number of passes? The machine is perfectly usable and responsive during the mem_load, xmms doesn't skip a beat for istance, this is probably thanks to the elevator-lowlatency too, I recall xmms wasn't used to be completely smooth during heavy swapping in previous kernels (because the read() of the sound file didn't return in rasonable time since I'm swapping in the same hd where I store the data). jupiter:~ # uptime 4:20pm up 1 day, 14:43, 3 users, load average: 1.38, 1.28, 1.21 jupiter:~ # vmstat 1 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 1 0 197408 4504 112 1436 21 34 23 34 36 19 0 2 97 0 1 0 199984 4768 116 1116 11712 5796 11720 5804 514 851 1 2 97 0 1 0 234684 4280 108 1116 14344 12356 14344 12360 617 1034 0 3 96 0 1 0 267880 4312 108 1116 10464 11916 10464 11916 539 790 0 3 97 1 0 0 268704 5192 108 1116 6220 9336 6220 9336 363 474 0 1 99 0 1 0 270764 5312 108 1116 13036 18952 13036 18952 584 958 0 1 99 0 1 0 271368 5088 108 1116 8288 5160 8288 5160 386 576 0 1 99 0 1 1 269184 4296 108 1116 4352 6420 4352 6416 254 314 0 0 100 0 1 0 266528 4604 108 1116 9644 4652 9644 4656 428 658 0 1 99 there is no way I can reproduce any stability problem with mem_load here (tested both on scsi quad xeon and ide dualathlon). Can you provide more details of your problem and/or a SYSRQ+T during the hang? thanks. Andrea ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest 2002-11-24 16:28 ` Andrea Arcangeli @ 2002-11-25 6:44 ` Con Kolivas 2002-11-25 7:06 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 7+ messages in thread From: Con Kolivas @ 2002-11-25 6:44 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux kernel mailing list -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 >On Sat, Nov 23, 2002 at 09:29:22AM +1100, Con Kolivas wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> process_load: >> Kernel [runs] Time CPU% Loads LCPU% Ratio >> 2.4.18 [3] 109.5 57 119 44 1.50 >> 2.4.19 [3] 106.5 59 112 43 1.45 >> 2.4.20-rc1 [3] 110.7 58 119 43 1.51 >> 2.4.20-rc1aa1 [3] 110.5 58 117 43 1.51* >> 2420rc2aa1 [1] 212.5 31 412 69 2.90* >> >> This load just copies data between 4 processes repeatedly. Seems to take >> longer. > >you go into linux/include/blkdev.h and increase MAX_QUEUE_SECTORS to (2 ><< (20 - 9)) and see if it makes any differences here? if it doesn't >make differences it could be the a bit increased readhaead but I doubt >it's the latter. No significant difference: 2420rc2aa1 212.53 31% 412 69% 2420rc2aa1mqs2 227.72 29% 455 71% >> xtar_load: >> Kernel [runs] Time CPU% Loads LCPU% Ratio >> 2.4.18 [3] 150.8 49 2 8 2.06 >> 2.4.19 [1] 132.4 55 2 9 1.81 >> 2.4.20-rc1 [3] 180.7 40 3 8 2.47 >> 2.4.20-rc1aa1 [3] 166.6 44 2 7 2.28* >> 2420rc2aa1 [1] 217.7 34 4 9 2.97* >> >> Takes longer. Is only one run though so may not be an accurate average. > >This most probably is a too small waitqueue. Of course increasing the >waitqueue will increase a bit the latency too for the other workloads, >it's a tradeoff and there's no way around it. Even read-latency has the >tradeoff when it chooses the "nth" place to be the seventh slot, where >to put the read request if it fails inserction. > >> io_load: >> Kernel [runs] Time CPU% Loads LCPU% Ratio >> 2.4.18 [3] 474.1 15 36 10 6.48 >> 2.4.19 [3] 492.6 14 38 10 6.73 >> 2.4.20-rc1 [2] 1142.2 6 90 10 15.60 >> 2.4.20-rc1aa1 [1] 1132.5 6 90 10 15.47 >> 2420rc2aa1 [1] 164.3 44 10 9 2.24 >> >> This was where the effect of the disk latency hack was expected to have an >> effect. It sure did. > >yes, I certainly can feel the machine much more responsive during the >write load too. Too bad some benchmark like dbench decreased >significantly but I don't see too many ways around it. At least now with >those changes the contigous write case is unaffected, my storage test >box still reads and writes at over 100mbyte/sec for example, this >clearly means what matters is that we have 512k dma commands, not an >huge size of the queue. Really with a loaded machine and potential >scheduling delays it could matter more to have a larger queue, that >maybe why the performance is decreased for some workload here too, not >only because of a less effective elevator. So probably 2Mbyte of queue >is a much better idea, so at least we can have a ring with 4 elements to > refill after a completion wakeup, I wanted to be strict to see the > "lowlatency" effect at most in the first place. We could also consider to > use a /4 instead of my current /2 for the batch_sectors initialization. > >BTW, at first glance it looks 2.5 has the same problem in the queue >sizing too. > >> read_load: >> Kernel [runs] Time CPU% Loads LCPU% Ratio >> 2.4.18 [3] 102.3 70 6 3 1.40 >> 2.4.19 [2] 134.1 54 14 5 1.83 >> 2.4.20-rc1 [3] 173.2 43 20 5 2.37 >> 2.4.20-rc1aa1 [3] 150.6 51 16 5 2.06 >> 2420rc2aa1 [1] 140.5 51 13 4 1.92 >> >> list_load: >> Kernel [runs] Time CPU% Loads LCPU% Ratio >> 2.4.18 [3] 90.2 76 1 17 1.23 >> 2.4.19 [1] 89.8 77 1 20 1.23 >> 2.4.20-rc1 [3] 88.8 77 0 12 1.21 >> 2.4.20-rc1aa1 [1] 88.1 78 1 16 1.20 >> 2420rc2aa1 [1] 99.7 69 1 19 1.36 >> >> mem_load: >> Kernel [runs] Time CPU% Loads LCPU% Ratio >> 2.4.18 [3] 103.3 70 32 3 1.41 >> 2.4.19 [3] 100.0 72 33 3 1.37 >> 2.4.20-rc1 [3] 105.9 69 32 2 1.45 >> >> Mem load hung the machine. I could not get rc2aa1 through this part of the >> benchmark no matter how many times I tried to run it. No idea what was >> going on. Easy to reproduce. Simply run the mem_load out of contest (which >> runs until it is killed) and the machine will hang. > >sorry but what is mem_load supposed to do other than to loop forever? It >is running for two days on my test box (512m of ram, 2G of swap, 4-way >smp) and nothing happened yet. It's an infinite loop. Sounds like you're >trapping a signal. Wouldn't it be simpler to just finish after a number >of passes? The machine is perfectly usable and responsive during the >mem_load, xmms doesn't skip a beat for istance, this is probably thanks >to the elevator-lowlatency too, I recall xmms wasn't used to be >completely smooth during heavy swapping in previous kernels (because the > read() of the sound file didn't return in rasonable time since I'm swapping > in the same hd where I store the data). > >jupiter:~ # uptime > 4:20pm up 1 day, 14:43, 3 users, load average: 1.38, 1.28, 1.21 >jupiter:~ # vmstat 1 > procs memory swap io system > cpu r b w swpd free buff cache si so bi bo in cs us > sy id 0 1 0 197408 4504 112 1436 21 34 23 34 36 19 > 0 2 97 0 1 0 199984 4768 116 1116 11712 5796 11720 5804 514 > 851 1 2 97 0 1 0 234684 4280 108 1116 14344 12356 14344 12360 > 617 1034 0 3 96 0 1 0 267880 4312 108 1116 10464 11916 > 10464 11916 539 790 0 3 97 1 0 0 268704 5192 108 1116 6220 > 9336 6220 9336 363 474 0 1 99 0 1 0 270764 5312 108 1116 > 13036 18952 13036 18952 584 958 0 1 99 0 1 0 271368 5088 108 > 1116 8288 5160 8288 5160 386 576 0 1 99 0 1 1 269184 4296 > 108 1116 4352 6420 4352 6416 254 314 0 0 100 0 1 0 266528 > 4604 108 1116 9644 4652 9644 4656 428 658 0 1 99 > >there is no way I can reproduce any stability problem with mem_load here >(tested both on scsi quad xeon and ide dualathlon). Can you provide more >details of your problem and/or a SYSRQ+T during the hang? thanks. The machine stops responding but sysrq works. It wont write anything to the logs. To get the error I have to run the mem_load portion of contest, not just mem_load by itself. The purpose of mem_load is to be just that - a memory load during the contest benchmark and contest will kill it when it finishes testing in that load. To reproduce it yourself, run mem_load then do a kernel compile make -j(4xnum_cpus). If that doesnt do it I'm not sure how else you can see it. sys-rq-T shows too much stuff on screen for me to make any sense of it and scrolls away without me being able to scroll up. Con -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.0 (GNU/Linux) iD8DBQE94cbRF6dfvkL3i1gRAvkgAKCOJwQ4hP2E5n1tu1r31MeCz9tULQCdE/lm hEbMrTEK/u2Sb8INZbVJWpg= =8YxG -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest 2002-11-25 6:44 ` Con Kolivas @ 2002-11-25 7:06 ` Andrew Morton 2002-11-25 18:57 ` Andrea Arcangeli 2002-11-25 18:23 ` Andrea Arcangeli 2002-11-30 16:17 ` Andrea Arcangeli 2 siblings, 1 reply; 7+ messages in thread From: Andrew Morton @ 2002-11-25 7:06 UTC (permalink / raw) To: Con Kolivas; +Cc: Andrea Arcangeli, linux kernel mailing list Con Kolivas wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > >On Sat, Nov 23, 2002 at 09:29:22AM +1100, Con Kolivas wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- > >> Hash: SHA1 > >> process_load: > >> Kernel [runs] Time CPU% Loads LCPU% Ratio > >> 2.4.18 [3] 109.5 57 119 44 1.50 > >> 2.4.19 [3] 106.5 59 112 43 1.45 > >> 2.4.20-rc1 [3] 110.7 58 119 43 1.51 > >> 2.4.20-rc1aa1 [3] 110.5 58 117 43 1.51* > >> 2420rc2aa1 [1] 212.5 31 412 69 2.90* > >> > >> This load just copies data between 4 processes repeatedly. Seems to take > >> longer. > > > >you go into linux/include/blkdev.h and increase MAX_QUEUE_SECTORS to (2 > ><< (20 - 9)) and see if it makes any differences here? if it doesn't > >make differences it could be the a bit increased readhaead but I doubt > >it's the latter. > > No significant difference: > 2420rc2aa1 212.53 31% 412 69% > 2420rc2aa1mqs2 227.72 29% 455 71% process_load is a CPU scheduler thing, not a disk scheduler thing. Something must have changed in kernel/sched.c. It's debatable whether 210 seconds is worse than 110 seconds in this test, really. You have four processes madly piping stuff around and four to eight processes compiling stuff. I don't see why it's "worse" that the compile happens to get 31% of the CPU time in this kernel. One would need to decide how much CPU it _should_ get before making that decision. > ... > > The machine stops responding but sysrq works. It wont write anything to the > logs. To get the error I have to run the mem_load portion of contest, not > just mem_load by itself. The purpose of mem_load is to be just that - a > memory load during the contest benchmark and contest will kill it when it > finishes testing in that load. To reproduce it yourself, run mem_load then do > a kernel compile make -j(4xnum_cpus). If that doesnt do it I'm not sure how > else you can see it. sys-rq-T shows too much stuff on screen for me to make > any sense of it and scrolls away without me being able to scroll up. Try sysrq-p. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest 2002-11-25 7:06 ` Andrew Morton @ 2002-11-25 18:57 ` Andrea Arcangeli 0 siblings, 0 replies; 7+ messages in thread From: Andrea Arcangeli @ 2002-11-25 18:57 UTC (permalink / raw) To: Andrew Morton, rwhron; +Cc: Con Kolivas, linux kernel mailing list On Sun, Nov 24, 2002 at 11:06:13PM -0800, Andrew Morton wrote: > Con Kolivas wrote: > > > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > >On Sat, Nov 23, 2002 at 09:29:22AM +1100, Con Kolivas wrote: > > >> -----BEGIN PGP SIGNED MESSAGE----- > > >> Hash: SHA1 > > >> process_load: > > >> Kernel [runs] Time CPU% Loads LCPU% Ratio > > >> 2.4.18 [3] 109.5 57 119 44 1.50 > > >> 2.4.19 [3] 106.5 59 112 43 1.45 > > >> 2.4.20-rc1 [3] 110.7 58 119 43 1.51 > > >> 2.4.20-rc1aa1 [3] 110.5 58 117 43 1.51* > > >> 2420rc2aa1 [1] 212.5 31 412 69 2.90* > > >> > > >> This load just copies data between 4 processes repeatedly. Seems to take > > >> longer. > > > > > >you go into linux/include/blkdev.h and increase MAX_QUEUE_SECTORS to (2 > > ><< (20 - 9)) and see if it makes any differences here? if it doesn't > > >make differences it could be the a bit increased readhaead but I doubt > > >it's the latter. > > > > No significant difference: > > 2420rc2aa1 212.53 31% 412 69% > > 2420rc2aa1mqs2 227.72 29% 455 71% > > process_load is a CPU scheduler thing, not a disk scheduler thing. Something > must have changed in kernel/sched.c. > > It's debatable whether 210 seconds is worse than 110 seconds in > this test, really. You have four processes madly piping stuff around and > four to eight processes compiling stuff. I don't see why it's "worse" > that the compile happens to get 31% of the CPU time in this kernel. One > would need to decide how much CPU it _should_ get before making that decision. I see, so it's probably one of the core o1 scheduler design fixes I did in my tree to avoid losing around 60% of the available cpu power in smp in critical workloads due design bugs in the o1 scheduler (partly reduced by a factor of 10 in 2.5 because of the HZ=1000 but that's also additional overhead that showup in all the userspace cpu intensive benchmarks posted to l-k, compared to the right fix that is needed anyways in 2.5 too since HZ=1000 only hides the problem partially, and s390 idle patch won't let the local smp interrupts running on idle cpus anyways). So this result should be a good thing, or anyways it's not interesting for what we're trying to benchmark here. > > > ... > > > > The machine stops responding but sysrq works. It wont write anything to the > > logs. To get the error I have to run the mem_load portion of contest, not > > just mem_load by itself. The purpose of mem_load is to be just that - a > > memory load during the contest benchmark and contest will kill it when it > > finishes testing in that load. To reproduce it yourself, run mem_load then do > > a kernel compile make -j(4xnum_cpus). If that doesnt do it I'm not sure how > > else you can see it. sys-rq-T shows too much stuff on screen for me to make > > any sense of it and scrolls away without me being able to scroll up. > > Try sysrq-p. indeed it might be sysrq+p the interesting one, I would had find out from the sysrq+t. the problem with sysrq+p is that with the improved irq-balance patch in my tree will likely dump only 1 cpu, I should send an IPI to get a reliable sysrq+p from all cpus at the same time like I did in the alpha port some time ago. Of course this is not a problem at all if his testbox is UP. The main problem of the elevator-lowlatency patch is that it increases fariness of an order of magnitude so it can hardly be the fastest kernel on dbench anymore. Again many thanks to Randy for these so useful accurate benchmarks. 2.4.20-rc1aa1 73.92 75.22 71.79 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2.4.20-rc2-ac1-rmap15-O1 53.09 54.85 51.09 2.4.20-rc2aa1 64.60 65.33 63.98 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2.5.31-mm1-dl-ew 59.55 61.51 57.00 2.5.32-mm1-dl-ew 55.43 57.15 53.13 2.5.32-mm2-dl-ew 54.01 57.38 47.48 2.5.33-mm1-dl-ew 52.02 54.86 46.74 2.5.33-mm5 49.61 53.42 41.31 2.5.40-mm1 70.39 73.85 65.24 2.5.42 67.72 70.50 66.05 2.5.43-mm2 67.32 69.92 65.11 2.5.44-mm5 69.47 71.86 66.14 2.5.44-mm6 69.03 71.66 64.11 you see rc2aa1 is slower than rc1aa1. Not that much as I would had expected, I was expecting something horrible of the order of the 30mbyte/sec, so it's quite a great result IMHO considering the queue was only 1Mbyte, but still it's noticeable (note that the queue now is 1M even for seeks, not only for contigous I/O, previously it was 32M for contigous I/O where it's useless to apply the elevator because I/O is contigous in the first place and it was something like 256k for seeks). It would be interesting to see how dbench 192 on reiserfs reacts to this patch applied on top of 2.4.20rc2aa1. 4M is a saner value for the queue size, 1M was too small but I wanted to show the lowest latency ever in contest. With this one contest should show still a very low read latency (and write latency too unlike read-latency, if you would ever test fsync or O_SYNC/O_DIRECT and not only read latency), but dbench should run faster, I doubt it's as fast as rc1aa1 but it could be a good tradeoff. --- 2.4.20rc2aa1/drivers/block/ll_rw_blk.c.~1~ 2002-11-21 06:06:02.000000000 +0100 +++ 2.4.20rc2aa1/drivers/block/ll_rw_blk.c 2002-11-25 19:45:03.000000000 +0100 @@ -421,7 +421,7 @@ int blk_grow_request_list(request_queue_ } q->batch_requests = q->nr_requests; q->max_queue_sectors = max_queue_sectors; - q->batch_sectors = max_queue_sectors / 2; + q->batch_sectors = max_queue_sectors / 4; BUG_ON(!q->batch_sectors); atomic_set(&q->nr_sectors, 0); spin_unlock_irqrestore(q->queue_lock, flags); --- 2.4.20rc2aa1/include/linux/blkdev.h.~1~ 2002-11-21 06:24:18.000000000 +0100 +++ 2.4.20rc2aa1/include/linux/blkdev.h 2002-11-25 19:44:09.000000000 +0100 @@ -244,7 +244,7 @@ extern char * blkdev_varyio[MAX_BLKDEV]; #define MAX_SEGMENTS 128 #define MAX_SECTORS 255 -#define MAX_QUEUE_SECTORS (1 << (20 - 9)) /* 1 mbytes when full sized */ +#define MAX_QUEUE_SECTORS (4 << (20 - 9)) /* 4 mbytes when full sized */ #define MAX_NR_REQUESTS (MAX_QUEUE_SECTORS >> (10 - 9)) /* 1mbyte queue when all requests are 1k */ #define PageAlignSize(size) (((size) + PAGE_SIZE -1) & PAGE_MASK) Andrea ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest 2002-11-25 6:44 ` Con Kolivas 2002-11-25 7:06 ` Andrew Morton @ 2002-11-25 18:23 ` Andrea Arcangeli 2002-11-30 16:17 ` Andrea Arcangeli 2 siblings, 0 replies; 7+ messages in thread From: Andrea Arcangeli @ 2002-11-25 18:23 UTC (permalink / raw) To: Con Kolivas; +Cc: linux kernel mailing list On Mon, Nov 25, 2002 at 05:44:30PM +1100, Con Kolivas wrote: > will kill it when it finishes testing in that load. To reproduce it > yourself, run mem_load then do a kernel compile make -j(4xnum_cpus). I will try. > If that doesnt do it I'm not sure how else you can see it. sys-rq-T > shows too much stuff on screen for me to make any sense of it and > scrolls away without me being able to scroll up. you can use as usual a serial or netconsole to log the sysrq+t output. Andrea ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest 2002-11-25 6:44 ` Con Kolivas 2002-11-25 7:06 ` Andrew Morton 2002-11-25 18:23 ` Andrea Arcangeli @ 2002-11-30 16:17 ` Andrea Arcangeli 2 siblings, 0 replies; 7+ messages in thread From: Andrea Arcangeli @ 2002-11-30 16:17 UTC (permalink / raw) To: Con Kolivas; +Cc: linux kernel mailing list On Mon, Nov 25, 2002 at 05:44:30PM +1100, Con Kolivas wrote: > finishes testing in that load. To reproduce it yourself, run mem_load then do > a kernel compile make -j(4xnum_cpus). If that doesnt do it I'm not sure how JFYI: can't reproduce it here with kernel compile and mem_load in parallel. Did you compile in AGP? there's apparently some known issue with AGP/DRI. Andrea ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2002-11-30 16:10 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-11-22 22:29 [BENCHMARK] 2.4.20-rc2-aa1 with contest Con Kolivas 2002-11-24 16:28 ` Andrea Arcangeli 2002-11-25 6:44 ` Con Kolivas 2002-11-25 7:06 ` Andrew Morton 2002-11-25 18:57 ` Andrea Arcangeli 2002-11-25 18:23 ` Andrea Arcangeli 2002-11-30 16:17 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox