* 82xx performance
@ 2008-07-14 16:34 Rune Torgersen
2008-07-14 20:44 ` Arnd Bergmann
2008-07-15 15:57 ` Milton Miller
0 siblings, 2 replies; 20+ messages in thread
From: Rune Torgersen @ 2008-07-14 16:34 UTC (permalink / raw)
To: linuxppc-dev
Hi
We are looking into switching kernels from 2.6.18 (ppc) to 2.6.25
(powerpc).
I have been trying to run some benchmarks to see how the new kernel
compares to the old one.
So far it is performing worse.
One test I ran was just compiling a 2.6.18 kernel on the system.
The .25 performed 5 to 7 % slower:
2.6.18, make vmlinux
real 74m1.328s
user 68m48.196s
sys 4m35.961s
2.6.25, make vmlinux
real 79m13.361s
user 72m41.318s
sys 5m46.744s
I also ran lmbench3. (slightly outdated, but still works)
Most (if not all) results are worse on .25, especially context
switching.
Is this expected behaviour or is there anything I need to look at in my
config?
(I'll send config if anybody is interested)
L M B E N C H 3 . 0 S U M M A R Y
------------------------------------
(Alpha software, do not distribute)
Basic system parameters
------------------------------------------------------------------------
------
Host OS Description Mhz tlb cache mem
scal
pages line par
load
bytes
--------- ------------- ----------------------- ---- ----- ----- ------
----
9919_unit Linux 2.6.25 powerpc-linux-gnu 434 32 32 1.0000
1
9919_unit Linux 2.6.18 powerpc-linux-gnu 445 32 32 1.0100
1
Processor, Processes - times in microseconds - smaller is better
------------------------------------------------------------------------
------
Host OS Mhz null null open slct sig sig fork
exec sh
call I/O stat clos TCP inst hndl proc
proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ----
---- ----
9919_unit Linux 2.6.25 434 0.47 1.26 10.7 35.6 34.1 1.76 14.3 2646
9964 33.K
9919_unit Linux 2.6.18 445 0.35 1.24 9.27 22.9 32.7 1.87 13.8 2157
7825 26.K
Basic integer operations - times in nanoseconds - smaller is better
-------------------------------------------------------------------
Host OS intgr intgr intgr intgr intgr
bit add mul div mod
--------- ------------- ------ ------ ------ ------ ------
9919_unit Linux 2.6.25 2.3300 0.0100 10.7 46.2 56.0
9919_unit Linux 2.6.18 2.2300 0.0100 10.3 45.4 54.1
Basic float operations - times in nanoseconds - smaller is better
-----------------------------------------------------------------
Host OS float float float float
add mul div bogo
--------- ------------- ------ ------ ------ ------
9919_unit Linux 2.6.25 9.9500 10.1 46.2 66.2
9919_unit Linux 2.6.18 9.1100 9.0800 45.8 67.1
Basic double operations - times in nanoseconds - smaller is better
------------------------------------------------------------------
Host OS double double double double
add mul div bogo
--------- ------------- ------ ------ ------ ------
9919_unit Linux 2.6.25 9.3400 11.6 78.6 100.2
9919_unit Linux 2.6.18 9.1600 11.1 77.2 97.8
Context switching - times in microseconds - smaller is better
------------------------------------------------------------------------
-
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K
16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
ctxsw
--------- ------------- ------ ------ ------ ------ ------ -------
-------
9919_unit Linux 2.6.25 20.6 86.2 28.5 103.8 38.7 111.8
57.4
9919_unit Linux 2.6.18 5.3300 63.2 17.9 73.4 23.1 74.9
26.2
*Local* Communication latencies in microseconds - smaller is better
---------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
9919_unit Linux 2.6.25 20.6 68.8 131. 353.1 533.4 461.7 1269
9919_unit Linux 2.6.18 5.330 36.1 87.8 225.3 402.7 331.8 520.1 970.
File & VM system latencies in microseconds - smaller is better
------------------------------------------------------------------------
-------
Host OS 0K File 10K File Mmap Prot Page
100fd
Create Delete Create Delete Latency Fault Fault
selct
--------- ------------- ------ ------ ------ ------ ------- -----
------- -----
9919_unit Linux 2.6.25 222.3 172.4 1003.0 350.5 41.5K 1.734
10.5 18.0
9919_unit Linux 2.6.18 181.5 144.3 789.3 293.9 23.9K
7.09560 19.3
*Local* Communication bandwidths in MB/s - bigger is better
------------------------------------------------------------------------
-----
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem
Mem
UNIX reread reread (libc) (hand) read
write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ----
-----
9919_unit Linux 2.6.25 34.2 34.7 21.5 55.5 161.8 79.9 79.2 160.
116.1
9919_unit Linux 2.6.18 40.1 37.4 29.7 60.0 165.8 80.6 81.1 165.
117.8
Memory latencies in nanoseconds - smaller is better
(WARNING - may not be correct, check graphs)
------------------------------------------------------------------------
------
Host OS Mhz L1 $ L2 $ Main mem Rand mem
Guesses
--------- ------------- --- ---- ---- -------- --------
-------
9919_unit Linux 2.6.25 434 4.8150 174.6 183.3 511.8 No L2
cache?
9919_unit Linux 2.6.18 445 4.6880 174.1 175.4 497.5 No L2
cache?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 82xx performance
2008-07-14 16:34 82xx performance Rune Torgersen
@ 2008-07-14 20:44 ` Arnd Bergmann
2008-07-15 14:16 ` Rune Torgersen
2008-07-15 18:25 ` Rune Torgersen
2008-07-15 15:57 ` Milton Miller
1 sibling, 2 replies; 20+ messages in thread
From: Arnd Bergmann @ 2008-07-14 20:44 UTC (permalink / raw)
To: linuxppc-dev
On Monday 14 July 2008, Rune Torgersen wrote:
> Context switching - times in microseconds - smaller is better
> ------------------------------------------------------------------------
> Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
> ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
> --------- ------------- ------ ------ ------ ------ ------ ------- -------
> 9919_unit Linux 2.6.25 20.6 86.2 28.5 103.8 38.7 111.8 57.4
> 9919_unit Linux 2.6.18 5.3300 63.2 17.9 73.4 23.1 74.9 26.2
This is certainly significant, but a lot has happened between the two
versions. I few ideas:
* compare some of the key configuration options:
# CONFIG_DEBUG_*
# CONFIG_PREEMPT*
# CONFIG_NO_HZ
# CONFIG_HZ
* Try looking at where the time is spent, using oprofile or readprofile
* Try setting /proc/sys/kernel/compat/sched_yield to 1, to get the legacy
behaviour of the scheduler.
* Maybe there is a kernel version that supports your hardware in both
arch/ppc/ and arch/powerpc. In that case, you could see if the platform
change had an impact.
Arnd <><
^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: 82xx performance
2008-07-14 20:44 ` Arnd Bergmann
@ 2008-07-15 14:16 ` Rune Torgersen
2008-07-15 18:25 ` Rune Torgersen
1 sibling, 0 replies; 20+ messages in thread
From: Rune Torgersen @ 2008-07-15 14:16 UTC (permalink / raw)
To: Arnd Bergmann, linuxppc-dev
> This is certainly significant, but a lot has happened between the two
> versions. I few ideas:
>=20
> * compare some of the key configuration options:
> # CONFIG_DEBUG_*
> # CONFIG_PREEMPT*
> # CONFIG_NO_HZ
> # CONFIG_HZ
2.6.25:
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_DEBUG_FS is not set
CONFIG_DEBUG_KERNEL=3Dy
# CONFIG_DEBUG_SHIRQ is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
# CONFIG_DEBUG_HIGHMEM is not set
CONFIG_DEBUG_BUGVERBOSE=3Dy
CONFIG_DEBUG_INFO=3Dy
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_DEBUGGER is not set
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=3Dy
# CONFIG_PREEMPT_RCU is not set
# CONFIG_NO_HZ is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=3Dy
CONFIG_HZ=3D1000
2.6.18:
# CONFIG_DEBUG_DRIVER is not set
CONFIG_DEBUG_KERNEL=3Dy
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_RWSEMS is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_INFO=3Dy
# CONFIG_DEBUG_FS is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=3Dy
CONFIG_PREEMPT_BKL=3Dy
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
CONFIG_HZ_1000=3Dy
CONFIG_HZ=3D1000
>=20
> * Try looking at where the time is spent, using oprofile or=20
> readprofile
Doing this now.
>=20
> * Try setting /proc/sys/kernel/compat/sched_yield to 1, to=20
> get the legacy
> behaviour of the scheduler.
Made perfromance on .25 a few percent worse. (84 minutes on kernel
compile instead of 81 minutes)
=20
> * Maybe there is a kernel version that supports your hardware in both
> arch/ppc/ and arch/powerpc. In that case, you could see if=20
> the platform
> change had an impact.
Will try. Last port in ppc branch was 2.6.18 and first in powerpc was
2.6.24.
I should be able to port ppc to .25 also I think.
Thanks!
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 82xx performance
2008-07-14 16:34 82xx performance Rune Torgersen
2008-07-14 20:44 ` Arnd Bergmann
@ 2008-07-15 15:57 ` Milton Miller
2008-07-15 18:12 ` Rune Torgersen
1 sibling, 1 reply; 20+ messages in thread
From: Milton Miller @ 2008-07-15 15:57 UTC (permalink / raw)
To: Rune Torgersen; +Cc: ppcdev, Arnd Bergmann
On Tue Jul 15 02:34:03 EST 2008, Rune Togersen wrote:
> We are looking into switching kernels from 2.6.18 (ppc) to 2.6.25
> (powerpc).
> I have been trying to run some benchmarks to see how the new kernel
> compares to the old one.
>
> So far it is performing worse.
>
> One test I ran was just compiling a 2.6.18 kernel on the system.
> The .25 performed 5 to 7 % slower:
>
> 2.6.18, make vmlinux
> real 74m1.328s
> user 68m48.196s
> sys 4m35.961s
>
> 2.6.25, make vmlinux
> real 79m13.361s
> user 72m41.318s
> sys 5m46.744s
>
> I also ran lmbench3. (slightly outdated, but still works)
> Most (if not all) results are worse on .25, especially context
> switching.
>
> Is this expected behaviour or is there anything I need to look at in my
> config?
> (I'll send config if anybody is interested)
>
>
> L M B E N C H 3 . 0 S U M M A R Y
> ------------------------------------
> (Alpha software, do not distribute)
>
> Basic system parameters
> -----------------------------------------------------------------------
> -
> ------
> Host OS Description Mhz tlb cache mem
> scal
> pages line par
> load
> bytes
> --------- ------------- ----------------------- ---- ----- -----
> ------ ----
> 9919_unit Linux 2.6.25 powerpc-linux-gnu 434 32 32
> 1.0000 1
> 9919_unit Linux 2.6.18 powerpc-linux-gnu 445 32 32
> 1.0100 1
Hmm, processor MHz is off by 11/445
>
> Processor, Processes - times in microseconds - smaller is better
> -----------------------------------------------------------------------
> - ------
> Host OS Mhz null null open slct sig sig fork
> exec sh
> call I/O stat clos TCP inst hndl proc
> proc proc
> --------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ----
> ---- ----
> 9919_unit Linux 2.6.25 434 0.47 1.26 10.7 35.6 34.1 1.76 14.3 2646
> 9964 33.K
> 9919_unit Linux 2.6.18 445 0.35 1.24 9.27 22.9 32.7 1.87 13.8 2157
> 7825 26.K
>
> Basic integer operations - times in nanoseconds - smaller is better
> -------------------------------------------------------------------
> Host OS intgr intgr intgr intgr intgr
> bit add mul div mod
> --------- ------------- ------ ------ ------ ------ ------
> 9919_unit Linux 2.6.25 2.3300 0.0100 10.7 46.2 56.0
> 9919_unit Linux 2.6.18 2.2300 0.0100 10.3 45.4 54.1
>
> Basic float operations - times in nanoseconds - smaller is better
> -----------------------------------------------------------------
> Host OS float float float float
> add mul div bogo
> --------- ------------- ------ ------ ------ ------
> 9919_unit Linux 2.6.25 9.9500 10.1 46.2 66.2
> 9919_unit Linux 2.6.18 9.1100 9.0800 45.8 67.1
>
> Basic double operations - times in nanoseconds - smaller is better
> ------------------------------------------------------------------
> Host OS double double double double
> add mul div bogo
> --------- ------------- ------ ------ ------ ------
> 9919_unit Linux 2.6.25 9.3400 11.6 78.6 100.2
> 9919_unit Linux 2.6.18 9.1600 11.1 77.2 97.8
>
Integer and float operations are also off ...
> Context switching - times in microseconds - smaller is better
> -----------------------------------------------------------------------
> -
> -
> Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K
> 16p/64K
> ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
> --------- ------------- ------ ------ ------ ------ ------ -------
> -------
> 9919_unit Linux 2.6.25 20.6 86.2 28.5 103.8 38.7 111.8 57.4
> 9919_unit Linux 2.6.18 5.3300 63.2 17.9 73.4 23.1 74.9 26.2
>
> *Local* Communication latencies in microseconds - smaller is better
> ---------------------------------------------------------------------
> Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
> ctxsw UNIX UDP TCP conn
> --------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
> 9919_unit Linux 2.6.25 20.6 68.8 131. 353.1 533.4 461.7 1269
> 9919_unit Linux 2.6.18 5.330 36.1 87.8 225.3 402.7 331.8 520.1 970.
>
> File & VM system latencies in microseconds - smaller is better
> -----------------------------------------------------------------------
> --------
> Host OS 0K File 10K File Mmap Prot Page
> 100fd
> Create Delete Create Delete Latency Fault
> Fault
> selct
> --------- ------------- ------ ------ ------ ------ ------- -----
> ------- -----
> 9919_unit Linux 2.6.25 222.3 172.4 1003.0 350.5 41.5K 1.734
> 10.5 18.0
> 9919_unit Linux 2.6.18 181.5 144.3 789.3 293.9 23.9K 7.09560
> 19.3
>
> *Local* Communication bandwidths in MB/s - bigger is better
> -----------------------------------------------------------------------
> ------
> Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem
> Mem
> UNIX reread reread (libc) (hand) read
> write
> --------- ------------- ---- ---- ---- ------ ------ ------ ------
> ---- -----
> 9919_unit Linux 2.6.25 34.2 34.7 21.5 55.5 161.8 79.9 79.2
> 160. 116.1
> 9919_unit Linux 2.6.18 40.1 37.4 29.7 60.0 165.8 80.6 81.1
> 165. 117.8
>
> Memory latencies in nanoseconds - smaller is better
> (WARNING - may not be correct, check graphs)
> -----------------------------------------------------------------------
> -------
> Host OS Mhz L1 $ L2 $ Main mem Rand mem
> Guesses
> --------- ------------- --- ---- ---- -------- --------
> -------
> 9919_unit Linux 2.6.25 434 4.8150 174.6 183.3 511.8 No
> L2 cache?
> 9919_unit Linux 2.6.18 445 4.6880 174.1 175.4 497.5 No
> L2 cache?
And memory latency is off 13/500.
That sounds like it will be 16/666.
Are you using the same board and the same firmware?
If so, look at /proc/cpuinfo and/or the boot log to see what
frequency linux thinks the processor is running at. It sounds
like someone introduced or fixed a rounding error error calculating
the timebase frequency for your board.
Please try the sleep test: sleep for 100 seconds, and time with
either a stopwatch or another system. I think you will find they
take different amounts of time, and all the results need to be scaled.
You might be able to see it reading the hardware clock.
Actually, once you track that down, rerun and see what you find.
milton
^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: 82xx performance
2008-07-15 15:57 ` Milton Miller
@ 2008-07-15 18:12 ` Rune Torgersen
0 siblings, 0 replies; 20+ messages in thread
From: Rune Torgersen @ 2008-07-15 18:12 UTC (permalink / raw)
To: Milton Miller; +Cc: linuxppc-dev
> 9919_unit Linux 2.6.25 powerpc-linux-gnu 434 32 32 =20
> > 1.0000 1
> > 9919_unit Linux 2.6.18 powerpc-linux-gnu 445 32 32 =20
> > 1.0100 1
>=20
> Hmm, processor MHz is off by 11/445
I noticed that.
> And memory latency is off 13/500.
>=20
> That sounds like it will be 16/666.
>=20
> Are you using the same board and the same firmware?
Yes. Same board, same firmware, same filesystem, just booted with
different kernels.
>=20
> If so, look at /proc/cpuinfo and/or the boot log to see what
> frequency linux thinks the processor is running at. It sounds
> like someone introduced or fixed a rounding error error calculating
> the timebase frequency for your board.
2.6.18 /proc/cpuinfo
processor : 0
cpu : G2_LE
revision : 1.4 (pvr 8082 2014)
bogomips : 296.96
chipset : 8250
vendor : Innovative Systems LLC
machine : AP Gold
mem size : 0x40000000
console baud : 115200
core clock : 447 MHz
CPM clock : 298 MHz
bus clock : 99 MHz
2.6.25 /proc/cpuinfo
processor : 0
cpu : G2_LE
clock : 447.897600MHz
revision : 1.4 (pvr 8082 2014)
bogomips : 49.53
timebase : 24883200
platform : Innovative Systems ApMax
> Please try the sleep test: sleep for 100 seconds, and time with
> either a stopwatch or another system. I think you will find they
> take different amounts of time, and all the results need to be scaled.
> You might be able to see it reading the hardware clock.
Sleep 100 takes excactly 100 seconds on both kernels (verified with
stopwatch and external ntp server)
^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: 82xx performance
2008-07-14 20:44 ` Arnd Bergmann
2008-07-15 14:16 ` Rune Torgersen
@ 2008-07-15 18:25 ` Rune Torgersen
2008-07-15 19:50 ` Arnd Bergmann
1 sibling, 1 reply; 20+ messages in thread
From: Rune Torgersen @ 2008-07-15 18:25 UTC (permalink / raw)
To: Arnd Bergmann, linuxppc-dev
> * Maybe there is a kernel version that supports your hardware in both
> arch/ppc/ and arch/powerpc. In that case, you could see if=20
> the platform
> change had an impact.
Using arch/ppc I got a 2.6.25 kernel to boot, and the kernel compile
test I did is almost identical (within 1%) of what the arch/powerpc
2.6.25 did, so it seems to be a difference between 2.6.18 and 2.6.25
(I'll see if I can find an exact version, as I think my ppc port can be
compiled for all versions from 2.6.18 to 25)
And running oprofile didn't help much, as it seems not to have been able
to figure out what in the kernel got called. (the Freescale 82xx dcoes
not have hw performance registers)
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 82xx performance
2008-07-15 18:25 ` Rune Torgersen
@ 2008-07-15 19:50 ` Arnd Bergmann
2008-07-16 21:08 ` Rune Torgersen
0 siblings, 1 reply; 20+ messages in thread
From: Arnd Bergmann @ 2008-07-15 19:50 UTC (permalink / raw)
To: linuxppc-dev
On Tuesday 15 July 2008, Rune Torgersen wrote:
> Using arch/ppc I got a 2.6.25 kernel to boot, and the kernel compile
> test I did is almost identical (within 1%) of what the arch/powerpc
> 2.6.25 did, so it seems to be a difference between 2.6.18 and 2.6.25
> (I'll see if I can find an exact version, as I think my ppc port can be
> compiled for all versions from 2.6.18 to 25)
You probably already know git-bisect, but if you don't, you should
definitely give it a try. It's the best tool to find which patch
exactly broke your performance.
Arnd <><
^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: 82xx performance
2008-07-15 19:50 ` Arnd Bergmann
@ 2008-07-16 21:08 ` Rune Torgersen
2008-07-16 21:45 ` Arnd Bergmann
0 siblings, 1 reply; 20+ messages in thread
From: Rune Torgersen @ 2008-07-16 21:08 UTC (permalink / raw)
To: Arnd Bergmann, Milton Miller, linuxppc-dev
Arnd Bergmann wrote:
> On Tuesday 15 July 2008, Rune Torgersen wrote:
>> Using arch/ppc I got a 2.6.25 kernel to boot, and the kernel compile
>> test I did is almost identical (within 1%) of what the arch/powerpc
>> 2.6.25 did, so it seems to be a difference between 2.6.18 and 2.6.25
>> (I'll see if I can find an exact version, as I think my ppc port can
>> be compiled for all versions from 2.6.18 to 25)
>=20
> You probably already know git-bisect, but if you don't, you should
> definitely give it a try. It's the best tool to find which patch
> exactly broke your performance.
Turns out the story is no so simple.
I redid the test wih all versions of arch/ppc from 2.6.18 to 2.6.26, and
also arch/powerpc (2.6.24 and 25, 26 doesn't compile because of binutil
issues)
This time I made very sure that the tests were performed the same way,
and I made a tabel showing relative performance:
kernel compile time rel context switch rel
v2.6.18 01:13:33.70 1.00 7.2 1.00
v2.6.19 01:13:29.21 1.00 7.1 0.99
v2.6.20 01:13:29.58 1.00 2.8 0.39
v2.6.21 01:13:24.91 1.00 8.1 1.13
v2.6.22 01:13:42.72 1.00 4.5 0.63
v2.6.23 01:15:16.43 1.02 17 2.36
v2.6.24 01:15:30.90 1.03 20 2.78
v2.6.25 01:14:51.21 1.02 21 2.92
v2.6.26 01:14:34.76 1.01 23.8 3.31
v2.6.24-powerpc 01:17:41.99 1.06 25.8 3.58
v2.6.25-powerpc 01:18:10.10 1.06 35.7 4.96
This shows that arch/ppc no matter versin is fairly consistent in speed.
Arch/powerpc is roughtly 6% worse.
The contect swith column is found running lat_ctx 2 from lmbench3, and
should be in microsecs.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 82xx performance
2008-07-16 21:08 ` Rune Torgersen
@ 2008-07-16 21:45 ` Arnd Bergmann
2008-07-16 21:53 ` Rune Torgersen
0 siblings, 1 reply; 20+ messages in thread
From: Arnd Bergmann @ 2008-07-16 21:45 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Milton Miller
On Wednesday 16 July 2008, Rune Torgersen wrote:
> Turns out the story is no so simple.
> I redid the test wih all versions of arch/ppc from 2.6.18 to 2.6.26, and
> also arch/powerpc (2.6.24 and 25, 26 doesn't compile because of binutil
> issues)
>=20
> This time I made very sure that the tests were performed the same way,
> and I made a tabel showing relative performance:
>=20
> kernel =A0 =A0 =A0 =A0compile time =A0 rel =A0 context switch =A0rel
> v2.6.18 =A0 =A0 =A0 =A0 01:13:33.70 =A01.00 =A0 =A0 =A0 7.2 =A0 =A0 =A0 =
=A01.00
> v2.6.19 =A0 =A0 =A0 =A0 01:13:29.21 =A01.00 =A0 =A0 =A0 7.1 =A0 =A0 =A0 =
=A00.99
> v2.6.20 =A0 =A0 =A0 =A0 01:13:29.58 =A01.00 =A0 =A0 =A0 2.8 =A0 =A0 =A0 =
=A00.39
> v2.6.21 =A0 =A0 =A0 =A0 01:13:24.91 =A01.00 =A0 =A0 =A0 8.1 =A0 =A0 =A0 =
=A01.13
> v2.6.22 =A0 =A0 =A0 =A0 01:13:42.72 =A01.00 =A0 =A0 =A0 4.5 =A0 =A0 =A0 =
=A00.63
> v2.6.23 =A0 =A0 =A0 =A0 01:15:16.43 =A01.02 =A0 =A0 =A0 17 =A0 =A0 =A0 =
=A0 2.36
> v2.6.24 =A0 =A0 =A0 =A0 01:15:30.90 =A01.03 =A0 =A0 =A0 20 =A0 =A0 =A0 =
=A0 2.78
> v2.6.25 =A0 =A0 =A0 =A0 01:14:51.21 =A01.02 =A0 =A0 =A0 21 =A0 =A0 =A0 =
=A0 2.92
> v2.6.26 =A0 =A0 =A0 =A0 01:14:34.76 =A01.01 =A0 =A0 =A0 23.8 =A0 =A0 =A0 =
3.31
> v2.6.24-powerpc 01:17:41.99 =A01.06 =A0 =A0 =A0 25.8 =A0 =A0 =A0 3.58
> v2.6.25-powerpc 01:18:10.10 =A01.06 =A0 =A0 =A0 35.7 =A0 =A0 =A0 4.96
>=20
> This shows that arch/ppc no matter versin is fairly consistent in speed.
> Arch/powerpc is roughtly 6% worse.
>=20
> The contect swith column is found running lat_ctx 2 from lmbench3, and
> should be in microsecs.
Ok, I think this could be related mostly to two changes:
* In 2.6.23, the process scheduler was replaced, the new one is the CFS,
the 'completely fair scheduler'. This has changed a lot of data.
To verify this, you could check out a git version just before and just
after CFS went in.
* Obviously, the 6 percent change between ppc and powerpc should not
be there. You can still try to use 'readprofile', or oprofile in
timer based mode (i.e. without HW performance counters) to get some
more data about where the time is spent on an identical kernel version.
Arnd <><
^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: 82xx performance
2008-07-16 21:45 ` Arnd Bergmann
@ 2008-07-16 21:53 ` Rune Torgersen
2008-07-16 22:32 ` Arnd Bergmann
0 siblings, 1 reply; 20+ messages in thread
From: Rune Torgersen @ 2008-07-16 21:53 UTC (permalink / raw)
To: Arnd Bergmann, linuxppc-dev; +Cc: Milton Miller
Arnd Bergmann wrote:
> Ok, I think this could be related mostly to two changes:
>=20
> * In 2.6.23, the process scheduler was replaced, the new one
> is the CFS,
> the 'completely fair scheduler'. This has changed a lot of data.
> To verify this, you could check out a git version just before and
> just after CFS went in.=20
I'll check. Checking the context switch is fairly fast, so this is a
goot time to learn bisect...
> * Obviously, the 6 percent change between ppc and powerpc should not
> be there. You can still try to use 'readprofile', or oprofile in
> timer based mode (i.e. without HW performance counters) to get some
> more data about where the time is spent on an identical
> kernel version.
I did run oprofile, but I could not figure out how to get it to show me
where in the kernel it was spending time. It showed that a lot of time
was spent in vmlinux, but not anything specific. I probably just don't
know how to set up or run oprofile correctly.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 82xx performance
2008-07-16 21:53 ` Rune Torgersen
@ 2008-07-16 22:32 ` Arnd Bergmann
2008-07-17 15:12 ` Rune Torgersen
0 siblings, 1 reply; 20+ messages in thread
From: Arnd Bergmann @ 2008-07-16 22:32 UTC (permalink / raw)
To: Rune Torgersen; +Cc: linuxppc-dev, Milton Miller
On Wednesday 16 July 2008, Rune Torgersen wrote:
> I did run oprofile, but I could not figure out how to get it to show me
> where in the kernel it was spending time. It showed that a lot of time
> was spent in vmlinux, but not anything specific. I probably just don't
> know how to set up or run oprofile correctly.
Maybe you passed no vmlinux, or only a stripped one?
Oprofile needs to have a vmlinux file to extract the symbol
information from.
If you can't get it to work, readprofile(1) is a much simpler
tool, both in what it can do and what it requires you to do.
Arnd <><
^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: 82xx performance
2008-07-16 22:32 ` Arnd Bergmann
@ 2008-07-17 15:12 ` Rune Torgersen
2008-07-17 15:47 ` Arnd Bergmann
0 siblings, 1 reply; 20+ messages in thread
From: Rune Torgersen @ 2008-07-17 15:12 UTC (permalink / raw)
To: Arnd Bergmann, Milton Miller, linuxppc-dev
Arnd Bergmann wrote:
> If you can't get it to work, readprofile(1) is a much simpler
> tool, both in what it can do and what it requires you to do.
One thing that pops out is that handle_mm_fault uses twice as many ticks
in arch/powerpc.
Top 20 calls from readprofile
2.6.25 arch/ppc
305993 total 0.1295
53301 __flush_dcache_icache 832.8281
25746 clear_pages 919.5000
19086 __copy_tofrom_user 33.3671
17198 get_page_from_freelist 13.3525
12741 _tlbia 353.9167
12317 handle_mm_fault 8.9774
9669 handle_page_fault 75.5391
8037 do_page_fault 9.5225
6450 cpu_idle 25.1953
5430 update_mmu_cache 21.8952
4663 copy_page 32.3819
3712 __link_path_walk 0.8452
3302 find_vma 19.6548
3241 __do_fault 2.6741
3235 unmap_vmas 2.1741
3184 lru_cache_add_active 16.5833
3076 __alloc_pages 3.8450
3062 find_lock_page 9.8141
2826 zone_watermark_ok 16.0568
2593 put_page 6.8237
2.6.25 arch/powerpc
60982 cpu_idle 262.8534
54601 __flush_dcache_icache_phys 650.0119
25676 clear_pages 917.0000
24892 handle_mm_fault 8.7772
19478 __copy_tofrom_user 34.0524
18112 get_page_from_freelist 12.3716
13245 _tlbia 367.9167
11976 do_page_fault 10.3241
10028 handle_page_fault 78.3438
6025 update_mmu_cache 23.5352
4650 page_address 15.7095
4097 copy_page 28.4514
4031 __do_fault 1.9838
3952 find_vma 27.4444
3861 __link_path_walk 0.9237
3533 unmap_vmas 2.2590
3400 lru_cache_add_active 19.3182
3317 find_lock_page 11.0567
3238 __alloc_pages 4.5223
2825 zone_watermark_ok 16.8155
2740 __d_lookup 5.8547
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 82xx performance
2008-07-17 15:12 ` Rune Torgersen
@ 2008-07-17 15:47 ` Arnd Bergmann
2008-07-17 15:52 ` Rune Torgersen
2008-07-17 18:24 ` Rune Torgersen
0 siblings, 2 replies; 20+ messages in thread
From: Arnd Bergmann @ 2008-07-17 15:47 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Milton Miller
On Thursday 17 July 2008, Rune Torgersen wrote:
> Arnd Bergmann wrote:
> > If you can't get it to work, readprofile(1) is a much simpler
> > tool, both in what it can do and what it requires you to do.
>
> One thing that pops out is that handle_mm_fault uses twice as many
> ticks in arch/powerpc.
The other thing I found interesting is that cpu_idle is on the
top of the list in arch/powerpc but does not show up anywhere
in your top arch/ppc samples. This indicates that the system is
waiting for something, e.g. disk I/O for a significant amount
of time.
Seeing more hits in handle_mm_fault suggests that you have
a higher page fault rate. A trivial reason for this might
be that the amount of memory was misdetected in the new
code (maybe broken device tree). What is the content of
/proc/meminfo after a fresh boot?
If it's the same, try running a kernel build with 'time --verbose',
using GNU time instead of the bash builtin time to see how the
page fault rate changed.
Arnd <><
^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: 82xx performance
2008-07-17 15:47 ` Arnd Bergmann
@ 2008-07-17 15:52 ` Rune Torgersen
2008-07-17 18:24 ` Rune Torgersen
1 sibling, 0 replies; 20+ messages in thread
From: Rune Torgersen @ 2008-07-17 15:52 UTC (permalink / raw)
To: Arnd Bergmann, linuxppc-dev; +Cc: Milton Miller
Arnd Bergmann wrote:
> Seeing more hits in handle_mm_fault suggests that you have
> a higher page fault rate. A trivial reason for this might
> be that the amount of memory was misdetected in the new
> code (maybe broken device tree). What is the content of
> /proc/meminfo after a fresh boot?
I also just realized that the /aprch/ppc was set up without high-mem
support (using all 1G as lowmem), while the arch/powerpc was set up with
hightmem. I'm retrying a compile without highmem support and 1 G lowmem
on arch/powerpc now to see if it makes a difference.
^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: 82xx performance
2008-07-17 15:47 ` Arnd Bergmann
2008-07-17 15:52 ` Rune Torgersen
@ 2008-07-17 18:24 ` Rune Torgersen
2008-07-17 19:43 ` Arnd Bergmann
1 sibling, 1 reply; 20+ messages in thread
From: Rune Torgersen @ 2008-07-17 18:24 UTC (permalink / raw)
To: Arnd Bergmann, linuxppc-dev; +Cc: Milton Miller
Arnd Bergmann wrote:
> Seeing more hits in handle_mm_fault suggests that you have
> a higher page fault rate. A trivial reason for this might
> be that the amount of memory was misdetected in the new
> code (maybe broken device tree). What is the content of
> /proc/meminfo after a fresh boot?
Powerpc
MemTotal: 1011296 kB
MemFree: 953884 kB
Buffers: 20316 kB
Cached: 25896 kB
SwapCached: 0 kB
Active: 26228 kB
Inactive: 25652 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 132 kB
Writeback: 0 kB
AnonPages: 5684 kB
Mapped: 2436 kB
Slab: 2460 kB
SReclaimable: 732 kB
SUnreclaim: 1728 kB
PageTables: 208 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 505648 kB
Committed_AS: 11472 kB
VmallocTotal: 474756 kB
VmallocUsed: 8776 kB
VmallocChunk: 465664 kB
Ppc:
MemTotal: 1011500 kB
MemFree: 954868 kB
Buffers: 20696 kB
Cached: 25044 kB
SwapCached: 0 kB
Active: 26816 kB
Inactive: 24588 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 124 kB
Writeback: 0 kB
AnonPages: 5680 kB
Mapped: 2456 kB
Slab: 2056 kB
SReclaimable: 736 kB
SUnreclaim: 1320 kB
PageTables: 180 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 505748 kB
Committed_AS: 11468 kB
VmallocTotal: 245696 kB
VmallocUsed: 360 kB
VmallocChunk: 245276 kB
> If it's the same, try running a kernel build with 'time --verbose',
> using GNU time instead of the bash builtin time to see how the page
> fault rate changed.=20
Arch/powerpc
Command being timed: "make vmlinux"
User time (seconds): 4339.11
System time (seconds): 319.41
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:17:42
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 0
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 4213347
Voluntary context switches: 53543
Involuntary context switches: 90165
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Arch/ppc
Command being timed: "make vmlinux"
User time (seconds): 4177.11
System time (seconds): 295.00
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:14:35
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 0
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 4203103
Voluntary context switches: 53812
Involuntary context switches: 85856
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 82xx performance
2008-07-17 18:24 ` Rune Torgersen
@ 2008-07-17 19:43 ` Arnd Bergmann
2008-07-17 19:54 ` Rune Torgersen
0 siblings, 1 reply; 20+ messages in thread
From: Arnd Bergmann @ 2008-07-17 19:43 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Milton Miller
On Thursday 17 July 2008, Rune Torgersen wrote:
> Arnd Bergmann wrote:
> > Seeing more hits in handle_mm_fault suggests that you have
> > a higher page fault rate. A trivial reason for this might
> > be that the amount of memory was misdetected in the new
> > code (maybe broken device tree). What is the content of
> > /proc/meminfo after a fresh boot?
> Powerpc
> VmallocTotal: 474756 kB
> Ppc
> VmallocTotal: 245696 kB
This seems to be the only significant difference here, but
I don't see how it can make an impact on performance.
> User time (seconds): 4339.11
> User time (seconds): 4177.11
3.8% slowdown
> System time (seconds): 319.41
> System time (seconds): 295.00
8.2% slowdown
> Elapsed (wall clock) time (h:mm:ss or m:ss): 1:17:42
> Elapsed (wall clock) time (h:mm:ss or m:ss): 1:14:35
4% slowdown
> Minor (reclaiming a frame) page faults: 4213347
> Minor (reclaiming a frame) page faults: 4203103
slightly more faults
> Voluntary context switches: 53543
> Voluntary context switches: 53812
This actually went down by 0.5%, both of these are well within
the expected
> Involuntary context switches: 90165
> Involuntary context switches: 85856
4.8% more context switches, probably a side-effect of the longer
run-time.
So again, nothing conclusive. I'm running out of ideas.
Arnd <><
^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: 82xx performance
2008-07-17 19:43 ` Arnd Bergmann
@ 2008-07-17 19:54 ` Rune Torgersen
2008-07-17 21:32 ` Arnd Bergmann
0 siblings, 1 reply; 20+ messages in thread
From: Rune Torgersen @ 2008-07-17 19:54 UTC (permalink / raw)
To: Arnd Bergmann, linuxppc-dev; +Cc: Milton Miller
Arnd Bergmann wrote:
> So again, nothing conclusive. I'm running out of ideas.
Is the syscall path different or the same on ppc and powerpc?
Any differences in the task switching, irq handling or page fault
handling?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 82xx performance
2008-07-17 19:54 ` Rune Torgersen
@ 2008-07-17 21:32 ` Arnd Bergmann
2008-07-25 20:41 ` Rune Torgersen
0 siblings, 1 reply; 20+ messages in thread
From: Arnd Bergmann @ 2008-07-17 21:32 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Milton Miller
On Thursday 17 July 2008, Rune Torgersen wrote:
> Arnd Bergmann wrote:
> > So again, nothing conclusive. I'm running out of ideas.
>
> Is the syscall path different or the same on ppc and powerpc?
> Any differences in the task switching, irq handling or page fault
> handling?
>
It's all different in suble ways, but those changes should only
show up in the system time accounting, not user time accounting.
One change that I would expect to show up in user space performance
is different TLB management, in a way that causes increased flushing
and reloading of page table entries. I don't have the slightest
idea if or in what ways that has changed. Maybe Ben has a clue
about this.
Arnd <><
^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: 82xx performance
2008-07-17 21:32 ` Arnd Bergmann
@ 2008-07-25 20:41 ` Rune Torgersen
2008-07-26 3:47 ` Milton Miller
0 siblings, 1 reply; 20+ messages in thread
From: Rune Torgersen @ 2008-07-25 20:41 UTC (permalink / raw)
To: Arnd Bergmann, linuxppc-dev; +Cc: Milton Miller
> From: Arnd Bergmann [mailto:arnd@arndb.de]=20
> On Thursday 17 July 2008, Rune Torgersen wrote:
> > Arnd Bergmann wrote:
> > > So again, nothing conclusive. I'm running out of ideas.
> >=20
> > Is the syscall path different or the same on ppc and powerpc?
> > Any differences in the task switching, irq handling or page fault
> > handling?
> >=20
>=20
> It's all different in suble ways, but those changes should only
> show up in the system time accounting, not user time accounting.
I've been running the workload this board will see. On a 2.6.18 kernel
%idle is ~50% and %wa (waiting for IO) is less than 1% most of the time.
On 2.6.25, the idle% is lower (by about 10-15%) and the %wa is
consistently hovering around 20-30% sometimes spiking to 100%.
The workload involves quite a bit of socket IO (TCP, UDP, Unix Sockets
and TIPC) and disk IO.
Any easy way of finding what is causing the wait for IO?
(Ive been trying to get lttng to work, but not any good results so far).
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 82xx performance
2008-07-25 20:41 ` Rune Torgersen
@ 2008-07-26 3:47 ` Milton Miller
0 siblings, 0 replies; 20+ messages in thread
From: Milton Miller @ 2008-07-26 3:47 UTC (permalink / raw)
To: Rune Torgersen; +Cc: linuxppc-dev, Arnd Bergmann
On Jul 25, 2008, at 3:41 PM, Rune Torgersen wrote:
>> From: Arnd Bergmann [mailto:arnd@arndb.de]
>> On Thursday 17 July 2008, Rune Torgersen wrote:
>>> Arnd Bergmann wrote:
>>>> So again, nothing conclusive. I'm running out of ideas.
>>>
>>> Is the syscall path different or the same on ppc and powerpc?
>>> Any differences in the task switching, irq handling or page fault
>>> handling?
>>>
>>
>> It's all different in suble ways, but those changes should only
>> show up in the system time accounting, not user time accounting.
>
> I've been running the workload this board will see. On a 2.6.18 kernel
> %idle is ~50% and %wa (waiting for IO) is less than 1% most of the
> time.
> On 2.6.25, the idle% is lower (by about 10-15%) and the %wa is
> consistently hovering around 20-30% sometimes spiking to 100%.
>
> The workload involves quite a bit of socket IO (TCP, UDP, Unix Sockets
> and TIPC) and disk IO.
> Any easy way of finding what is causing the wait for IO?
>
> (Ive been trying to get lttng to work, but not any good results so
> far).
In both idle and wait, the cpu is in the idle loop waiting for something
to do. The difference is that a cpu is considered in disk wait if the
there is a task in uninterruptible sleep that is not being counted as
waiting on another cpu.
With a significant change in wait time, I would suggest restating the
current observations to the linux-mm community, probably with a cc
to linux-kernel.
Be sure to state what you have done to equalize the comparison (eg
highmen, ram size left, etc).
There may be some tunables that could adjust the behavior.
milton
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2008-07-26 3:48 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-14 16:34 82xx performance Rune Torgersen
2008-07-14 20:44 ` Arnd Bergmann
2008-07-15 14:16 ` Rune Torgersen
2008-07-15 18:25 ` Rune Torgersen
2008-07-15 19:50 ` Arnd Bergmann
2008-07-16 21:08 ` Rune Torgersen
2008-07-16 21:45 ` Arnd Bergmann
2008-07-16 21:53 ` Rune Torgersen
2008-07-16 22:32 ` Arnd Bergmann
2008-07-17 15:12 ` Rune Torgersen
2008-07-17 15:47 ` Arnd Bergmann
2008-07-17 15:52 ` Rune Torgersen
2008-07-17 18:24 ` Rune Torgersen
2008-07-17 19:43 ` Arnd Bergmann
2008-07-17 19:54 ` Rune Torgersen
2008-07-17 21:32 ` Arnd Bergmann
2008-07-25 20:41 ` Rune Torgersen
2008-07-26 3:47 ` Milton Miller
2008-07-15 15:57 ` Milton Miller
2008-07-15 18:12 ` Rune Torgersen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).