public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [BENCHMARK] 2.4.20-rc2-aa1 with contest
@ 2002-11-22 22:29 Con Kolivas
  2002-11-24 16:28 ` Andrea Arcangeli
  0 siblings, 1 reply; 7+ messages in thread
From: Con Kolivas @ 2002-11-22 22:29 UTC (permalink / raw)
  To: linux kernel mailing list; +Cc: Andrea Arcangeli

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Here is a partial run of contest (http://contest.kolivas.net) benchmarks for 
rc2aa1 with the disk latency hack

noload:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.4.18 [5]              71.7    93      0       0       0.98
2.4.19 [5]              69.0    97      0       0       0.94
2.4.20-rc1 [3]          72.2    93      0       0       0.99
2.4.20-rc1aa1 [1]       71.9    94      0       0       0.98
2420rc2aa1 [1]          71.1    94      0       0       0.97

cacherun:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.4.18 [2]              66.6    99      0       0       0.91
2.4.19 [2]              68.0    99      0       0       0.93
2.4.20-rc1 [3]          67.2    99      0       0       0.92
2.4.20-rc1aa1 [1]       67.4    99      0       0       0.92
2420rc2aa1 [1]          66.6    99      0       0       0.91

process_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.4.18 [3]              109.5   57      119     44      1.50
2.4.19 [3]              106.5   59      112     43      1.45
2.4.20-rc1 [3]          110.7   58      119     43      1.51
2.4.20-rc1aa1 [3]       110.5   58      117     43      1.51*
2420rc2aa1 [1]          212.5   31      412     69      2.90*

This load just copies data between 4 processes repeatedly. Seems to take 
longer.


ctar_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.4.18 [3]              117.4   63      1       7       1.60
2.4.19 [2]              106.5   70      1       8       1.45
2.4.20-rc1 [3]          102.1   72      1       7       1.39
2.4.20-rc1aa1 [3]       107.1   69      1       7       1.46
2420rc2aa1 [1]          103.3   73      1       8       1.41

xtar_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.4.18 [3]              150.8   49      2       8       2.06
2.4.19 [1]              132.4   55      2       9       1.81
2.4.20-rc1 [3]          180.7   40      3       8       2.47
2.4.20-rc1aa1 [3]       166.6   44      2       7       2.28*
2420rc2aa1 [1]          217.7   34      4       9       2.97*

Takes longer. Is only one run though so may not be an accurate average.


io_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.4.18 [3]              474.1   15      36      10      6.48
2.4.19 [3]              492.6   14      38      10      6.73
2.4.20-rc1 [2]          1142.2  6       90      10      15.60
2.4.20-rc1aa1 [1]       1132.5  6       90      10      15.47
2420rc2aa1 [1]          164.3   44      10      9       2.24

This was where the effect of the disk latency hack was expected to have an 
effect. It sure did.


read_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.4.18 [3]              102.3   70      6       3       1.40
2.4.19 [2]              134.1   54      14      5       1.83
2.4.20-rc1 [3]          173.2   43      20      5       2.37
2.4.20-rc1aa1 [3]       150.6   51      16      5       2.06
2420rc2aa1 [1]          140.5   51      13      4       1.92

list_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.4.18 [3]              90.2    76      1       17      1.23
2.4.19 [1]              89.8    77      1       20      1.23
2.4.20-rc1 [3]          88.8    77      0       12      1.21
2.4.20-rc1aa1 [1]       88.1    78      1       16      1.20
2420rc2aa1 [1]          99.7    69      1       19      1.36

mem_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.4.18 [3]              103.3   70      32      3       1.41
2.4.19 [3]              100.0   72      33      3       1.37
2.4.20-rc1 [3]          105.9   69      32      2       1.45

Mem load hung the machine. I could not get rc2aa1 through this part of the 
benchmark no matter how many times I tried to run it. No idea what was going 
on. Easy to reproduce. Simply run the mem_load out of contest (which runs 
until it is killed) and the machine will hang. 

Con

P.S. I'm having mailserver trouble so respond to lkml where I may see 
responses
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)

iD8DBQE93q/IF6dfvkL3i1gRAqWCAKCp6eZ2MFe4Ag7LqoGwy4+0MbUqxQCgkkxl
AOUDUScNazCAJ2oZrdgDMuE=
=vHmI
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest
  2002-11-22 22:29 [BENCHMARK] 2.4.20-rc2-aa1 with contest Con Kolivas
@ 2002-11-24 16:28 ` Andrea Arcangeli
  2002-11-25  6:44   ` Con Kolivas
  0 siblings, 1 reply; 7+ messages in thread
From: Andrea Arcangeli @ 2002-11-24 16:28 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux kernel mailing list

On Sat, Nov 23, 2002 at 09:29:22AM +1100, Con Kolivas wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Here is a partial run of contest (http://contest.kolivas.net) benchmarks for 
> rc2aa1 with the disk latency hack
> 
> noload:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [5]              71.7    93      0       0       0.98
> 2.4.19 [5]              69.0    97      0       0       0.94
> 2.4.20-rc1 [3]          72.2    93      0       0       0.99
> 2.4.20-rc1aa1 [1]       71.9    94      0       0       0.98
> 2420rc2aa1 [1]          71.1    94      0       0       0.97
> 
> cacherun:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [2]              66.6    99      0       0       0.91
> 2.4.19 [2]              68.0    99      0       0       0.93
> 2.4.20-rc1 [3]          67.2    99      0       0       0.92
> 2.4.20-rc1aa1 [1]       67.4    99      0       0       0.92
> 2420rc2aa1 [1]          66.6    99      0       0       0.91
> 
> process_load:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [3]              109.5   57      119     44      1.50
> 2.4.19 [3]              106.5   59      112     43      1.45
> 2.4.20-rc1 [3]          110.7   58      119     43      1.51
> 2.4.20-rc1aa1 [3]       110.5   58      117     43      1.51*
> 2420rc2aa1 [1]          212.5   31      412     69      2.90*
> 
> This load just copies data between 4 processes repeatedly. Seems to take 
> longer.

you go into linux/include/blkdev.h and increase MAX_QUEUE_SECTORS to (2
<< (20 - 9)) and see if it makes any differences here? if it doesn't
make differences it could be the a bit increased readhaead but I doubt
it's the latter.

> ctar_load:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [3]              117.4   63      1       7       1.60
> 2.4.19 [2]              106.5   70      1       8       1.45
> 2.4.20-rc1 [3]          102.1   72      1       7       1.39
> 2.4.20-rc1aa1 [3]       107.1   69      1       7       1.46
> 2420rc2aa1 [1]          103.3   73      1       8       1.41
> 
> xtar_load:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [3]              150.8   49      2       8       2.06
> 2.4.19 [1]              132.4   55      2       9       1.81
> 2.4.20-rc1 [3]          180.7   40      3       8       2.47
> 2.4.20-rc1aa1 [3]       166.6   44      2       7       2.28*
> 2420rc2aa1 [1]          217.7   34      4       9       2.97*
> 
> Takes longer. Is only one run though so may not be an accurate average.

This most probably is a too small waitqueue. Of course increasing the
waitqueue will increase a bit the latency too for the other workloads,
it's a tradeoff and there's no way around it. Even read-latency has the
tradeoff when it chooses the "nth" place to be the seventh slot, where
to put the read request if it fails inserction.

> 
> 
> io_load:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [3]              474.1   15      36      10      6.48
> 2.4.19 [3]              492.6   14      38      10      6.73
> 2.4.20-rc1 [2]          1142.2  6       90      10      15.60
> 2.4.20-rc1aa1 [1]       1132.5  6       90      10      15.47
> 2420rc2aa1 [1]          164.3   44      10      9       2.24
> 
> This was where the effect of the disk latency hack was expected to have an 
> effect. It sure did.

yes, I certainly can feel the machine much more responsive during the
write load too. Too bad some benchmark like dbench decreased
significantly but I don't see too many ways around it. At least now with
those changes the contigous write case is unaffected, my storage  test
box still reads and writes at over 100mbyte/sec for example, this
clearly means what matters is that we have 512k dma commands, not an
huge size of the queue. Really with a loaded machine and potential
scheduling delays it could matter more to have a larger queue, that
maybe why the performance is decreased for some workload here too, not
only because of a less effective elevator. So probably 2Mbyte of queue
is a much better idea, so at least we can have a ring with 4 elements to refill
after a completion wakeup, I wanted to be strict to see the "lowlatency" effect
at most in the first place. We could also consider to use a /4 instead of my
current /2 for the batch_sectors initialization.

BTW, at first glance it looks 2.5 has the same problem in the queue
sizing too.

> read_load:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [3]              102.3   70      6       3       1.40
> 2.4.19 [2]              134.1   54      14      5       1.83
> 2.4.20-rc1 [3]          173.2   43      20      5       2.37
> 2.4.20-rc1aa1 [3]       150.6   51      16      5       2.06
> 2420rc2aa1 [1]          140.5   51      13      4       1.92
> 
> list_load:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [3]              90.2    76      1       17      1.23
> 2.4.19 [1]              89.8    77      1       20      1.23
> 2.4.20-rc1 [3]          88.8    77      0       12      1.21
> 2.4.20-rc1aa1 [1]       88.1    78      1       16      1.20
> 2420rc2aa1 [1]          99.7    69      1       19      1.36
> 
> mem_load:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [3]              103.3   70      32      3       1.41
> 2.4.19 [3]              100.0   72      33      3       1.37
> 2.4.20-rc1 [3]          105.9   69      32      2       1.45
> 
> Mem load hung the machine. I could not get rc2aa1 through this part of the 
> benchmark no matter how many times I tried to run it. No idea what was going 
> on. Easy to reproduce. Simply run the mem_load out of contest (which runs 
> until it is killed) and the machine will hang. 

sorry but what is mem_load supposed to do other than to loop forever? It
is running for two days on my test box (512m of ram, 2G of swap, 4-way
smp) and nothing happened yet. It's an infinite loop. Sounds like you're
trapping a signal. Wouldn't it be simpler to just finish after a number
of passes? The machine is perfectly usable and responsive during the
mem_load, xmms doesn't skip a beat for istance, this is probably thanks
to the elevator-lowlatency too, I recall xmms wasn't used to be
completely smooth during heavy swapping in previous kernels (because the read()
of the sound file didn't return in rasonable time since I'm swapping in the
same hd where I store the data).

jupiter:~ # uptime
  4:20pm  up 1 day, 14:43,  3 users,  load average: 1.38, 1.28, 1.21
jupiter:~ # vmstat 1
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  1  0 197408   4504    112   1436  21  34    23    34   36    19   0   2  97
 0  1  0 199984   4768    116   1116 11712 5796 11720  5804  514   851   1   2  97
 0  1  0 234684   4280    108   1116 14344 12356 14344 12360  617  1034   0   3  96
 0  1  0 267880   4312    108   1116 10464 11916 10464 11916  539   790   0   3  97
 1  0  0 268704   5192    108   1116 6220 9336  6220  9336  363   474   0   1  99
 0  1  0 270764   5312    108   1116 13036 18952 13036 18952  584   958   0   1  99
 0  1  0 271368   5088    108   1116 8288 5160  8288  5160  386   576   0   1  99
 0  1  1 269184   4296    108   1116 4352 6420  4352  6416  254   314   0   0 100
 0  1  0 266528   4604    108   1116 9644 4652  9644  4656  428   658   0   1  99

there is no way I can reproduce any stability problem with mem_load here
(tested both on scsi quad xeon and ide dualathlon). Can you provide more
details of your problem and/or a SYSRQ+T during the hang? thanks.

Andrea

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest
  2002-11-24 16:28 ` Andrea Arcangeli
@ 2002-11-25  6:44   ` Con Kolivas
  2002-11-25  7:06     ` Andrew Morton
                       ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Con Kolivas @ 2002-11-25  6:44 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux kernel mailing list

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>On Sat, Nov 23, 2002 at 09:29:22AM +1100, Con Kolivas wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>> process_load:
>> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
>> 2.4.18 [3]              109.5   57      119     44      1.50
>> 2.4.19 [3]              106.5   59      112     43      1.45
>> 2.4.20-rc1 [3]          110.7   58      119     43      1.51
>> 2.4.20-rc1aa1 [3]       110.5   58      117     43      1.51*
>> 2420rc2aa1 [1]          212.5   31      412     69      2.90*
>>
>> This load just copies data between 4 processes repeatedly. Seems to take
>> longer.
>
>you go into linux/include/blkdev.h and increase MAX_QUEUE_SECTORS to (2
><< (20 - 9)) and see if it makes any differences here? if it doesn't
>make differences it could be the a bit increased readhaead but I doubt
>it's the latter.

No significant difference:
2420rc2aa1              212.53  31%     412     69%
2420rc2aa1mqs2          227.72  29%     455     71%

>> xtar_load:
>> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
>> 2.4.18 [3]              150.8   49      2       8       2.06
>> 2.4.19 [1]              132.4   55      2       9       1.81
>> 2.4.20-rc1 [3]          180.7   40      3       8       2.47
>> 2.4.20-rc1aa1 [3]       166.6   44      2       7       2.28*
>> 2420rc2aa1 [1]          217.7   34      4       9       2.97*
>>
>> Takes longer. Is only one run though so may not be an accurate average.
>
>This most probably is a too small waitqueue. Of course increasing the
>waitqueue will increase a bit the latency too for the other workloads,
>it's a tradeoff and there's no way around it. Even read-latency has the
>tradeoff when it chooses the "nth" place to be the seventh slot, where
>to put the read request if it fails inserction.
>
>> io_load:
>> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
>> 2.4.18 [3]              474.1   15      36      10      6.48
>> 2.4.19 [3]              492.6   14      38      10      6.73
>> 2.4.20-rc1 [2]          1142.2  6       90      10      15.60
>> 2.4.20-rc1aa1 [1]       1132.5  6       90      10      15.47
>> 2420rc2aa1 [1]          164.3   44      10      9       2.24
>>
>> This was where the effect of the disk latency hack was expected to have an
>> effect. It sure did.
>
>yes, I certainly can feel the machine much more responsive during the
>write load too. Too bad some benchmark like dbench decreased
>significantly but I don't see too many ways around it. At least now with
>those changes the contigous write case is unaffected, my storage  test
>box still reads and writes at over 100mbyte/sec for example, this
>clearly means what matters is that we have 512k dma commands, not an
>huge size of the queue. Really with a loaded machine and potential
>scheduling delays it could matter more to have a larger queue, that
>maybe why the performance is decreased for some workload here too, not
>only because of a less effective elevator. So probably 2Mbyte of queue
>is a much better idea, so at least we can have a ring with 4 elements to
> refill after a completion wakeup, I wanted to be strict to see the
> "lowlatency" effect at most in the first place. We could also consider to
> use a /4 instead of my current /2 for the batch_sectors initialization.
>
>BTW, at first glance it looks 2.5 has the same problem in the queue
>sizing too.
>
>> read_load:
>> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
>> 2.4.18 [3]              102.3   70      6       3       1.40
>> 2.4.19 [2]              134.1   54      14      5       1.83
>> 2.4.20-rc1 [3]          173.2   43      20      5       2.37
>> 2.4.20-rc1aa1 [3]       150.6   51      16      5       2.06
>> 2420rc2aa1 [1]          140.5   51      13      4       1.92
>>
>> list_load:
>> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
>> 2.4.18 [3]              90.2    76      1       17      1.23
>> 2.4.19 [1]              89.8    77      1       20      1.23
>> 2.4.20-rc1 [3]          88.8    77      0       12      1.21
>> 2.4.20-rc1aa1 [1]       88.1    78      1       16      1.20
>> 2420rc2aa1 [1]          99.7    69      1       19      1.36
>>
>> mem_load:
>> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
>> 2.4.18 [3]              103.3   70      32      3       1.41
>> 2.4.19 [3]              100.0   72      33      3       1.37
>> 2.4.20-rc1 [3]          105.9   69      32      2       1.45
>>
>> Mem load hung the machine. I could not get rc2aa1 through this part of the
>> benchmark no matter how many times I tried to run it. No idea what was
>> going on. Easy to reproduce. Simply run the mem_load out of contest (which
>> runs until it is killed) and the machine will hang.
>
>sorry but what is mem_load supposed to do other than to loop forever? It
>is running for two days on my test box (512m of ram, 2G of swap, 4-way
>smp) and nothing happened yet. It's an infinite loop. Sounds like you're
>trapping a signal. Wouldn't it be simpler to just finish after a number
>of passes? The machine is perfectly usable and responsive during the
>mem_load, xmms doesn't skip a beat for istance, this is probably thanks
>to the elevator-lowlatency too, I recall xmms wasn't used to be
>completely smooth during heavy swapping in previous kernels (because the
> read() of the sound file didn't return in rasonable time since I'm swapping
> in the same hd where I store the data).
>
>jupiter:~ # uptime
>  4:20pm  up 1 day, 14:43,  3 users,  load average: 1.38, 1.28, 1.21
>jupiter:~ # vmstat 1
>   procs                      memory    swap          io     system        
> cpu r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us 
> sy  id 0  1  0 197408   4504    112   1436  21  34    23    34   36    19  
> 0   2  97 0  1  0 199984   4768    116   1116 11712 5796 11720  5804  514  
> 851   1   2  97 0  1  0 234684   4280    108   1116 14344 12356 14344 12360
>  617  1034   0   3  96 0  1  0 267880   4312    108   1116 10464 11916
> 10464 11916  539   790   0   3  97 1  0  0 268704   5192    108   1116 6220
> 9336  6220  9336  363   474   0   1  99 0  1  0 270764   5312    108   1116
> 13036 18952 13036 18952  584   958   0   1  99 0  1  0 271368   5088    108
>   1116 8288 5160  8288  5160  386   576   0   1  99 0  1  1 269184   4296  
>  108   1116 4352 6420  4352  6416  254   314   0   0 100 0  1  0 266528  
> 4604    108   1116 9644 4652  9644  4656  428   658   0   1  99
>
>there is no way I can reproduce any stability problem with mem_load here
>(tested both on scsi quad xeon and ide dualathlon). Can you provide more
>details of your problem and/or a SYSRQ+T during the hang? thanks.

The machine stops responding but sysrq works. It wont write anything to the 
logs. To get the error I have to run the mem_load portion of contest, not 
just mem_load by itself. The purpose of mem_load is to be just that - a 
memory load during the contest benchmark and contest will kill it when it 
finishes testing in that load. To reproduce it yourself, run mem_load then do 
a kernel compile make -j(4xnum_cpus).  If that doesnt do it I'm not sure how 
else you can see it. sys-rq-T shows too much stuff on screen for me to make 
any sense of it and scrolls away without me being able to scroll up.

Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)

iD8DBQE94cbRF6dfvkL3i1gRAvkgAKCOJwQ4hP2E5n1tu1r31MeCz9tULQCdE/lm
hEbMrTEK/u2Sb8INZbVJWpg=
=8YxG
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest
  2002-11-25  6:44   ` Con Kolivas
@ 2002-11-25  7:06     ` Andrew Morton
  2002-11-25 18:57       ` Andrea Arcangeli
  2002-11-25 18:23     ` Andrea Arcangeli
  2002-11-30 16:17     ` Andrea Arcangeli
  2 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2002-11-25  7:06 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Andrea Arcangeli, linux kernel mailing list

Con Kolivas wrote:
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> >On Sat, Nov 23, 2002 at 09:29:22AM +1100, Con Kolivas wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >> process_load:
> >> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> >> 2.4.18 [3]              109.5   57      119     44      1.50
> >> 2.4.19 [3]              106.5   59      112     43      1.45
> >> 2.4.20-rc1 [3]          110.7   58      119     43      1.51
> >> 2.4.20-rc1aa1 [3]       110.5   58      117     43      1.51*
> >> 2420rc2aa1 [1]          212.5   31      412     69      2.90*
> >>
> >> This load just copies data between 4 processes repeatedly. Seems to take
> >> longer.
> >
> >you go into linux/include/blkdev.h and increase MAX_QUEUE_SECTORS to (2
> ><< (20 - 9)) and see if it makes any differences here? if it doesn't
> >make differences it could be the a bit increased readhaead but I doubt
> >it's the latter.
> 
> No significant difference:
> 2420rc2aa1              212.53  31%     412     69%
> 2420rc2aa1mqs2          227.72  29%     455     71%

process_load is a CPU scheduler thing, not a disk scheduler thing.  Something
must have changed in kernel/sched.c.

It's debatable whether 210 seconds is worse than 110 seconds in
this test, really.  You have four processes madly piping stuff around and
four to eight processes compiling stuff.  I don't see why it's "worse"
that the compile happens to get 31% of the CPU time in this kernel.  One
would need to decide how much CPU it _should_ get before making that decision.

> ...
> 
> The machine stops responding but sysrq works. It wont write anything to the
> logs. To get the error I have to run the mem_load portion of contest, not
> just mem_load by itself. The purpose of mem_load is to be just that - a
> memory load during the contest benchmark and contest will kill it when it
> finishes testing in that load. To reproduce it yourself, run mem_load then do
> a kernel compile make -j(4xnum_cpus).  If that doesnt do it I'm not sure how
> else you can see it. sys-rq-T shows too much stuff on screen for me to make
> any sense of it and scrolls away without me being able to scroll up.

Try sysrq-p.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest
  2002-11-25  6:44   ` Con Kolivas
  2002-11-25  7:06     ` Andrew Morton
@ 2002-11-25 18:23     ` Andrea Arcangeli
  2002-11-30 16:17     ` Andrea Arcangeli
  2 siblings, 0 replies; 7+ messages in thread
From: Andrea Arcangeli @ 2002-11-25 18:23 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux kernel mailing list

On Mon, Nov 25, 2002 at 05:44:30PM +1100, Con Kolivas wrote:
> will kill it when it finishes testing in that load. To reproduce it
> yourself, run mem_load then do a kernel compile make -j(4xnum_cpus).

I will try.

> If that doesnt do it I'm not sure how else you can see it. sys-rq-T
> shows too much stuff on screen for me to make any sense of it and
> scrolls away without me being able to scroll up.

you can use as usual a serial or netconsole to log the sysrq+t output.

Andrea

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest
  2002-11-25  7:06     ` Andrew Morton
@ 2002-11-25 18:57       ` Andrea Arcangeli
  0 siblings, 0 replies; 7+ messages in thread
From: Andrea Arcangeli @ 2002-11-25 18:57 UTC (permalink / raw)
  To: Andrew Morton, rwhron; +Cc: Con Kolivas, linux kernel mailing list

On Sun, Nov 24, 2002 at 11:06:13PM -0800, Andrew Morton wrote:
> Con Kolivas wrote:
> > 
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> > 
> > >On Sat, Nov 23, 2002 at 09:29:22AM +1100, Con Kolivas wrote:
> > >> -----BEGIN PGP SIGNED MESSAGE-----
> > >> Hash: SHA1
> > >> process_load:
> > >> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> > >> 2.4.18 [3]              109.5   57      119     44      1.50
> > >> 2.4.19 [3]              106.5   59      112     43      1.45
> > >> 2.4.20-rc1 [3]          110.7   58      119     43      1.51
> > >> 2.4.20-rc1aa1 [3]       110.5   58      117     43      1.51*
> > >> 2420rc2aa1 [1]          212.5   31      412     69      2.90*
> > >>
> > >> This load just copies data between 4 processes repeatedly. Seems to take
> > >> longer.
> > >
> > >you go into linux/include/blkdev.h and increase MAX_QUEUE_SECTORS to (2
> > ><< (20 - 9)) and see if it makes any differences here? if it doesn't
> > >make differences it could be the a bit increased readhaead but I doubt
> > >it's the latter.
> > 
> > No significant difference:
> > 2420rc2aa1              212.53  31%     412     69%
> > 2420rc2aa1mqs2          227.72  29%     455     71%
> 
> process_load is a CPU scheduler thing, not a disk scheduler thing.  Something
> must have changed in kernel/sched.c.
> 
> It's debatable whether 210 seconds is worse than 110 seconds in
> this test, really.  You have four processes madly piping stuff around and
> four to eight processes compiling stuff.  I don't see why it's "worse"
> that the compile happens to get 31% of the CPU time in this kernel.  One
> would need to decide how much CPU it _should_ get before making that decision.

I see, so it's probably one of the core o1 scheduler design fixes I did
in my tree to avoid losing around 60% of the available cpu power in smp
in critical workloads due design bugs in the o1 scheduler (partly
reduced by a factor of 10 in 2.5 because of the HZ=1000 but that's also
additional overhead that showup in all the userspace cpu intensive
benchmarks posted to l-k, compared to the right fix that is needed
anyways in 2.5 too since HZ=1000 only hides the problem partially, and
s390 idle patch won't let the local smp interrupts running on idle
cpus anyways). So this result should be a good thing, or anyways it's
not interesting for what we're trying to benchmark here.

> 
> > ...
> > 
> > The machine stops responding but sysrq works. It wont write anything to the
> > logs. To get the error I have to run the mem_load portion of contest, not
> > just mem_load by itself. The purpose of mem_load is to be just that - a
> > memory load during the contest benchmark and contest will kill it when it
> > finishes testing in that load. To reproduce it yourself, run mem_load then do
> > a kernel compile make -j(4xnum_cpus).  If that doesnt do it I'm not sure how
> > else you can see it. sys-rq-T shows too much stuff on screen for me to make
> > any sense of it and scrolls away without me being able to scroll up.
> 
> Try sysrq-p.

indeed it might be sysrq+p the interesting one, I would had find out
from the sysrq+t. the problem with sysrq+p is that with the improved
irq-balance patch in my tree will likely dump only 1 cpu, I should send
an IPI to get a reliable sysrq+p from all cpus at the same time like I
did in the alpha port some time ago. Of course this is not a problem at
all if his testbox is UP.

The main problem of the elevator-lowlatency patch is that it increases fariness
of an order of magnitude so it can hardly be the fastest kernel on dbench
anymore.

Again many thanks to Randy for these so useful accurate benchmarks.

2.4.20-rc1aa1                            73.92           75.22           71.79
					 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2.4.20-rc2-ac1-rmap15-O1                 53.09           54.85           51.09
2.4.20-rc2aa1                            64.60           65.33           63.98
					 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2.5.31-mm1-dl-ew                         59.55           61.51           57.00
2.5.32-mm1-dl-ew                         55.43           57.15           53.13
2.5.32-mm2-dl-ew                         54.01           57.38           47.48
2.5.33-mm1-dl-ew                         52.02           54.86           46.74
2.5.33-mm5                               49.61           53.42           41.31
2.5.40-mm1                               70.39           73.85           65.24
2.5.42                                   67.72           70.50           66.05
2.5.43-mm2                               67.32           69.92           65.11
2.5.44-mm5                               69.47           71.86           66.14
2.5.44-mm6                               69.03           71.66           64.11

you see rc2aa1 is slower than rc1aa1. Not that much as I would had expected,
I was expecting something horrible of the order of the 30mbyte/sec, so it's
quite a great result IMHO considering the queue was only 1Mbyte, but still it's
noticeable (note that the queue now is 1M even for seeks, not only for
contigous I/O, previously it was 32M for contigous I/O where it's
useless to apply the elevator because I/O is contigous in the first
place and it was something like 256k for seeks). It would be interesting
to see how dbench 192 on reiserfs reacts to this patch applied on top of
2.4.20rc2aa1. 4M is a saner value for the queue size, 1M was too small
but I wanted to show the lowest latency ever in contest.  With this one
contest should show still a very low read latency (and write latency too
unlike read-latency, if you would ever test fsync or O_SYNC/O_DIRECT and
not only read latency), but dbench should run faster, I doubt it's as
fast as rc1aa1 but it could be a good tradeoff.

--- 2.4.20rc2aa1/drivers/block/ll_rw_blk.c.~1~	2002-11-21 06:06:02.000000000 +0100
+++ 2.4.20rc2aa1/drivers/block/ll_rw_blk.c	2002-11-25 19:45:03.000000000 +0100
@@ -421,7 +421,7 @@ int blk_grow_request_list(request_queue_
 	}
 	q->batch_requests = q->nr_requests;
 	q->max_queue_sectors = max_queue_sectors;
-	q->batch_sectors = max_queue_sectors / 2;
+	q->batch_sectors = max_queue_sectors / 4;
 	BUG_ON(!q->batch_sectors);
 	atomic_set(&q->nr_sectors, 0);
 	spin_unlock_irqrestore(q->queue_lock, flags);
--- 2.4.20rc2aa1/include/linux/blkdev.h.~1~	2002-11-21 06:24:18.000000000 +0100
+++ 2.4.20rc2aa1/include/linux/blkdev.h	2002-11-25 19:44:09.000000000 +0100
@@ -244,7 +244,7 @@ extern char * blkdev_varyio[MAX_BLKDEV];
 
 #define MAX_SEGMENTS 128
 #define MAX_SECTORS 255
-#define MAX_QUEUE_SECTORS (1 << (20 - 9)) /* 1 mbytes when full sized */
+#define MAX_QUEUE_SECTORS (4 << (20 - 9)) /* 4 mbytes when full sized */
 #define MAX_NR_REQUESTS (MAX_QUEUE_SECTORS >> (10 - 9)) /* 1mbyte queue when all requests are 1k */
 
 #define PageAlignSize(size) (((size) + PAGE_SIZE -1) & PAGE_MASK)

Andrea

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BENCHMARK] 2.4.20-rc2-aa1 with contest
  2002-11-25  6:44   ` Con Kolivas
  2002-11-25  7:06     ` Andrew Morton
  2002-11-25 18:23     ` Andrea Arcangeli
@ 2002-11-30 16:17     ` Andrea Arcangeli
  2 siblings, 0 replies; 7+ messages in thread
From: Andrea Arcangeli @ 2002-11-30 16:17 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux kernel mailing list

On Mon, Nov 25, 2002 at 05:44:30PM +1100, Con Kolivas wrote:
> finishes testing in that load. To reproduce it yourself, run mem_load then do 
> a kernel compile make -j(4xnum_cpus).  If that doesnt do it I'm not sure how 

JFYI: can't reproduce it here with kernel compile and mem_load in
parallel. Did you compile in AGP? there's apparently some known issue
with AGP/DRI.

Andrea

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2002-11-30 16:10 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-11-22 22:29 [BENCHMARK] 2.4.20-rc2-aa1 with contest Con Kolivas
2002-11-24 16:28 ` Andrea Arcangeli
2002-11-25  6:44   ` Con Kolivas
2002-11-25  7:06     ` Andrew Morton
2002-11-25 18:57       ` Andrea Arcangeli
2002-11-25 18:23     ` Andrea Arcangeli
2002-11-30 16:17     ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox