[PATCH V2] [x86]: Compiler Option Os is better on latest x86

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V2] [x86]: Compiler Option Os is better on latest x86
@ 2013-01-26 12:24 ling.ma.program
  2013-01-27 13:09 ` Ingo Molnar
  0 siblings, 1 reply; 3+ messages in thread
From: ling.ma.program @ 2013-01-26 12:24 UTC (permalink / raw)
  To: mingo; +Cc: tglx, hpa, linux-kernel, Ma Ling

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 7198 bytes --]

From: Ma Ling <ling.ml@alipay.com>

  Currently we use O2 as compiler option for better performance,
although it will enlarge code size, in modern CPUs larger instructon
and unified cache, sophisticated instruction prefetch weaken instruction
cache miss, meanwhile flags such as -falign-functions, -falign-jumps,
-falign-loops, -falign-labels are very helpful to improve CPU front-end
throughput because CPU fetch instruction by 16 aligned–bytes code block
per cycle.

  In order to save power and get higher performance, Sandy Bridge 
starts to introduce decoded-cache, instructions will be kept in it
after decode stage. When CPU refetches the instruction, decoded cache could
provide 32 aligned-bytes instruction block, instead of 16 bytes from I-cache,
fewer branch miss penalty resulted from shorter pipeline. It requires hot
code should be put into decoded cache as possible we can. Sandy Bridge,
Ivy Bridge, and Haswell all implemented this feature, Os-Optimize for size
should be better than O2 on them.

Based on above reasons, we compiled linux kernel 3.6.9 with O2 and Os
respectively. The results show Os improve performance netperf 4.8%,
2.7% for volano as below

O2 + netperf
Performance counter stats for 'netperf' (3 runs):

       5416.157986 task-clock                #    0.541 CPUs utilized            ( +-  0.19% )
           348,249 context-switches          #    0.064 M/sec                    ( +-  0.17% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +-  0.00% )
               353 page-faults               #    0.000 M/sec                    ( +-  0.16% )
    13,166,254,384 cycles                    #    2.431 GHz                      ( +-  0.18% )
     8,827,499,807 stalled-cycles-frontend   #   67.05% frontend cycles idle     ( +-  0.29% )
     5,951,234,060 stalled-cycles-backend    #   45.20% backend  cycles idle     ( +-  0.44% )
     8,122,481,914 instructions              #    0.62  insns per cycle
                                             #    1.09  stalled cycles per insn  ( +-  0.17% )
     1,415,864,138 branches                  #  261.415 M/sec                    ( +-  0.17% )
        16,975,308 branch-misses             #    1.20% of all branches          ( +-  0.61% )

      10.007215371 seconds time elapsed                                          ( +-  0.03% )

Os + netperf

Performance counter stats for 'netperf' (3 runs):

       5395.386704 task-clock                #    0.539 CPUs utilized            ( +-  0.14% )
           345,880 context-switches          #    0.064 M/sec                    ( +-  0.25% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +-  0.00% )
               354 page-faults               #    0.000 M/sec                    ( +-  0.00% )
    13,142,706,297 cycles                    #    2.436 GHz                      ( +-  0.23% )
     8,379,382,641 stalled-cycles-frontend   #   63.76% frontend cycles idle     ( +-  0.50% )
     5,513,722,219 stalled-cycles-backend    #   41.95% backend  cycles idle     ( +-  0.71% )
     8,554,202,795 instructions              #    0.65  insns per cycle
                                             #    0.98  stalled cycles per insn  ( +-  0.25% )
     1,530,020,505 branches                  #  283.579 M/sec                    ( +-  0.25% )
        17,710,406 branch-misses             #    1.16% of all branches          ( +-  1.00% )

      10.004859867 seconds time elapsed               

During the same time (10.004859867 seconds) IPC from Os is 0.65, O2 is 0.62, Os improved performance 4.8%

O2 + volano
Performance counter stats for './loopclient.sh openjdk' (3 runs):

     210627.115313 task-clock                #    0.781 CPUs utilized            ( +-  0.92% )
        13,812,610 context-switches          #    0.066 M/sec                    ( +-  0.17% )
         2,352,755 CPU-migrations            #    0.011 M/sec                    ( +-  0.84% )
           208,333 page-faults               #    0.001 M/sec                    ( +-  1.58% )
   525,627,073,405 cycles                    #    2.496 GHz                      ( +-  0.96% )
   428,177,571,365 stalled-cycles-frontend   #   81.46% frontend cycles idle     ( +-  1.09% )
   370,885,224,739 stalled-cycles-backend    #   70.56% backend  cycles idle     ( +-  1.18% )
   187,662,577,544 instructions              #    0.36  insns per cycle
                                             #    2.28  stalled cycles per insn  ( +-  0.31% )
    35,684,976,425 branches                  #  169.423 M/sec                    ( +-  0.45% )
     1,062,086,942 branch-misses             #    2.98% of all branches          ( +-  0.08% )

     269.764578435 seconds time elapsed    
         
Os + volano
Performance counter stats for './loopclient.sh openjdk' (3 runs):

     209545.786941 task-clock                #    0.778 CPUs utilized            ( +-  0.66% )
        13,864,142 context-switches          #    0.066 M/sec                    ( +-  0.29% )
         2,326,826 CPU-migrations            #    0.011 M/sec                    ( +-  0.83% )
           205,575 page-faults               #    0.001 M/sec                    ( +-  2.63% )
   523,366,588,452 cycles                    #    2.498 GHz                      ( +-  0.75% )
   419,200,472,430 stalled-cycles-frontend   #   80.10% frontend cycles idle     ( +-  0.86% )
   362,044,374,737 stalled-cycles-backend    #   69.18% backend  cycles idle     ( +-  0.96% )
   193,274,857,837 instructions              #    0.37  insns per cycle
                                             #    2.17  stalled cycles per insn  ( +-  0.51% )
    37,657,832,686 branches                  #  179.712 M/sec                    ( +-  0.42% )
     1,061,005,300 branch-misses             #    2.82% of all branches          ( +-  0.86% )

     269.410275674 seconds time elapsed                                          ( +-  0.06% )

During the same  time (269.410275674 seconds) IPC from Os is 0.37, O2 is 0.36, Os improved performance 2.7%

Signed-off-by: Ma Ling <ling.ml@alipay.com>
---
In this version we also append CONFIG_CC_OPTIMIZE_FOR_SIZE for i386.

 arch/x86/configs/i386_defconfig   |    1 +
 arch/x86/configs/x86_64_defconfig |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig
index 119db67..e59f1d1 100644
--- a/arch/x86/configs/i386_defconfig
+++ b/arch/x86/configs/i386_defconfig
@@ -299,6 +299,7 @@ CONFIG_DEBUG_KERNEL=y
 CONFIG_SCHEDSTATS=y
 CONFIG_TIMER_STATS=y
 CONFIG_DEBUG_STACK_USAGE=y
+CONFIG_CC_OPTIMIZE_FOR_SIZE=y
 CONFIG_SYSCTL_SYSCALL_CHECK=y
 CONFIG_BLK_DEV_IO_TRACE=y
 CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig
index 76eb290..cee2348 100644
--- a/arch/x86/configs/x86_64_defconfig
+++ b/arch/x86/configs/x86_64_defconfig
@@ -298,6 +298,7 @@ CONFIG_DEBUG_KERNEL=y
 CONFIG_SCHEDSTATS=y
 CONFIG_TIMER_STATS=y
 CONFIG_DEBUG_STACK_USAGE=y
+CONFIG_CC_OPTIMIZE_FOR_SIZE=y
 CONFIG_SYSCTL_SYSCALL_CHECK=y
 CONFIG_BLK_DEV_IO_TRACE=y
 CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH V2] [x86]: Compiler Option Os is better on latest x86
  2013-01-26 12:24 [PATCH V2] [x86]: Compiler Option Os is better on latest x86 ling.ma.program
@ 2013-01-27 13:09 ` Ingo Molnar
  2013-01-27 13:38   ` Ling Ma
  0 siblings, 1 reply; 3+ messages in thread
From: Ingo Molnar @ 2013-01-27 13:09 UTC (permalink / raw)
  To: ling.ma.program; +Cc: mingo, tglx, hpa, linux-kernel, Ma Ling

* ling.ma.program@gmail.com <ling.ma.program@gmail.com> wrote:

> In this version we also append CONFIG_CC_OPTIMIZE_FOR_SIZE for i386.

Were those measurements done on i386 as well? I'd be content 
with keeping the current minimal - and mostly documentary - 
patch that only does this on 64-bit. I just wanted to record the 
current benchmark numbers mostly.

Also see the discussion with Linus - this is by no far an 
unconditionally good option and we'll probably have to wait for 
more GCC action (-Ok or so) before being able to enable size 
optimizations everywhere without unacceptable trade-offs.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH V2] [x86]: Compiler Option Os is better on latest x86
  2013-01-27 13:09 ` Ingo Molnar
@ 2013-01-27 13:38   ` Ling Ma
  0 siblings, 0 replies; 3+ messages in thread
From: Ling Ma @ 2013-01-27 13:38 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: mingo, tglx, hpa, linux-kernel, Ma Ling

Hi Ingo

Thanks for your correcting.
Because thinking of  most of 32bit CPU belong to low-end CPUs(smaller
cache), they should more  emphasize  i-cache miss, I chose Os for
them.
I will test it and send out result ASAP.

Regards
Ling

2013/1/27, Ingo Molnar <mingo@kernel.org>:
>
> * ling.ma.program@gmail.com <ling.ma.program@gmail.com> wrote:
>
>> In this version we also append CONFIG_CC_OPTIMIZE_FOR_SIZE for i386.
>
> Were those measurements done on i386 as well? I'd be content
> with keeping the current minimal - and mostly documentary -
> patch that only does this on 64-bit. I just wanted to record the
> current benchmark numbers mostly.
>
> Also see the discussion with Linus - this is by no far an
> unconditionally good option and we'll probably have to wait for
> more GCC action (-Ok or so) before being able to enable size
> optimizations everywhere without unacceptable trade-offs.
>
> Thanks,
>
> 	Ingo
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-01-27 13:38 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-26 12:24 [PATCH V2] [x86]: Compiler Option Os is better on latest x86 ling.ma.program
2013-01-27 13:09 ` Ingo Molnar
2013-01-27 13:38   ` Ling Ma

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.