[Suggestion] [x86]: Compiler Option Os is better on latest x86

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [Suggestion] [x86]: Compiler Option Os is better on latest x86
@ 2012-12-26  6:32 ling.ma.program
  2013-01-24 14:17 ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: ling.ma.program @ 2012-12-26  6:32 UTC (permalink / raw)
  To: mingo; +Cc: tglx, hpa, linux-kernel, Ma Ling

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 6205 bytes --]

From: Ma Ling <ling.ml@alipay.com>

  Currently we use O2 as compiler option for better performance,
although it will enlarge code size, in modern CPUs larger instructon
and unified cache, sophisticated instruction prefetch weaken instruction
cache miss, meanwhile flags such as -falign-functions, -falign-jumps,
-falign-loops, -falign-labels are very helpful to improve CPU front-end
throughput because CPU fetch instruction by 16 aligned–bytes code block
per cycle.

  In order to save power and get higher performance, Sandy Bridge 
starts to introduce decoded-cache, instructions will be kept in it
after decode stage. When CPU refetches the instruction, decoded cache could
provide 32 aligned-bytes instruction block, instead of 16 bytes from I-cache,
fewer branch miss penalty resulted from shorter pipeline. It requires hot
code should be put into decoded cache as possible we can. Sandy Bridge,
Ivy Bridge, and Haswell all implemented this feature, Os-Optimize for size
should be better than O2 on them.

Based on above reasons, we compiled linux kernel 3.6.9 with O2 and Os
respectively. The results show Os improve performance netperf 4.8%,
2.7% for volano as below

O2 + netperf
Performance counter stats for 'netperf' (3 runs):

       5416.157986 task-clock                #    0.541 CPUs utilized            ( +-  0.19% )
           348,249 context-switches          #    0.064 M/sec                    ( +-  0.17% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +-  0.00% )
               353 page-faults               #    0.000 M/sec                    ( +-  0.16% )
    13,166,254,384 cycles                    #    2.431 GHz                      ( +-  0.18% )
     8,827,499,807 stalled-cycles-frontend   #   67.05% frontend cycles idle     ( +-  0.29% )
     5,951,234,060 stalled-cycles-backend    #   45.20% backend  cycles idle     ( +-  0.44% )
     8,122,481,914 instructions              #    0.62  insns per cycle
                                             #    1.09  stalled cycles per insn  ( +-  0.17% )
     1,415,864,138 branches                  #  261.415 M/sec                    ( +-  0.17% )
        16,975,308 branch-misses             #    1.20% of all branches          ( +-  0.61% )

      10.007215371 seconds time elapsed                                          ( +-  0.03% )

Os + netperf

Performance counter stats for 'netperf' (3 runs):

       5395.386704 task-clock                #    0.539 CPUs utilized            ( +-  0.14% )
           345,880 context-switches          #    0.064 M/sec                    ( +-  0.25% )
                 0 CPU-migrations            #    0.000 M/sec                    ( +-  0.00% )
               354 page-faults               #    0.000 M/sec                    ( +-  0.00% )
    13,142,706,297 cycles                    #    2.436 GHz                      ( +-  0.23% )
     8,379,382,641 stalled-cycles-frontend   #   63.76% frontend cycles idle     ( +-  0.50% )
     5,513,722,219 stalled-cycles-backend    #   41.95% backend  cycles idle     ( +-  0.71% )
     8,554,202,795 instructions              #    0.65  insns per cycle
                                             #    0.98  stalled cycles per insn  ( +-  0.25% )
     1,530,020,505 branches                  #  283.579 M/sec                    ( +-  0.25% )
        17,710,406 branch-misses             #    1.16% of all branches          ( +-  1.00% )

      10.004859867 seconds time elapsed               

During the same time (10.004859867 seconds) IPC from Os is 0.65, O2 is 0.62, Os improved performance 4.8%

O2 + volano
Performance counter stats for './loopclient.sh openjdk' (3 runs):

     210627.115313 task-clock                #    0.781 CPUs utilized            ( +-  0.92% )
        13,812,610 context-switches          #    0.066 M/sec                    ( +-  0.17% )
         2,352,755 CPU-migrations            #    0.011 M/sec                    ( +-  0.84% )
           208,333 page-faults               #    0.001 M/sec                    ( +-  1.58% )
   525,627,073,405 cycles                    #    2.496 GHz                      ( +-  0.96% )
   428,177,571,365 stalled-cycles-frontend   #   81.46% frontend cycles idle     ( +-  1.09% )
   370,885,224,739 stalled-cycles-backend    #   70.56% backend  cycles idle     ( +-  1.18% )
   187,662,577,544 instructions              #    0.36  insns per cycle
                                             #    2.28  stalled cycles per insn  ( +-  0.31% )
    35,684,976,425 branches                  #  169.423 M/sec                    ( +-  0.45% )
     1,062,086,942 branch-misses             #    2.98% of all branches          ( +-  0.08% )

     269.764578435 seconds time elapsed    
         
Os + volano
Performance counter stats for './loopclient.sh openjdk' (3 runs):

     209545.786941 task-clock                #    0.778 CPUs utilized            ( +-  0.66% )
        13,864,142 context-switches          #    0.066 M/sec                    ( +-  0.29% )
         2,326,826 CPU-migrations            #    0.011 M/sec                    ( +-  0.83% )
           205,575 page-faults               #    0.001 M/sec                    ( +-  2.63% )
   523,366,588,452 cycles                    #    2.498 GHz                      ( +-  0.75% )
   419,200,472,430 stalled-cycles-frontend   #   80.10% frontend cycles idle     ( +-  0.86% )
   362,044,374,737 stalled-cycles-backend    #   69.18% backend  cycles idle     ( +-  0.96% )
   193,274,857,837 instructions              #    0.37  insns per cycle
                                             #    2.17  stalled cycles per insn  ( +-  0.51% )
    37,657,832,686 branches                  #  179.712 M/sec                    ( +-  0.42% )
     1,061,005,300 branch-misses             #    2.82% of all branches          ( +-  0.86% )

     269.410275674 seconds time elapsed                                          ( +-  0.06% )

During the same  time (269.410275674 seconds) IPC from Os is 0.37, O2 is 0.36, Os improved performance 2.7%

So our initial conclusion is Os is better than O2 for current & coming x86 CPUs.
If I was wrong, please correct me.

Thanks
Ling

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Suggestion] [x86]: Compiler Option Os is better on latest x86
  2012-12-26  6:32 [Suggestion] [x86]: Compiler Option Os is better on latest x86 ling.ma.program
@ 2013-01-24 14:17 ` Ingo Molnar
  2013-01-24 14:46   ` Borislav Petkov
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2013-01-24 14:17 UTC (permalink / raw)
  To: ling.ma.program; +Cc: mingo, tglx, hpa, linux-kernel, Ma Ling

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 6570 bytes --]


* ling.ma.program@gmail.com <ling.ma.program@gmail.com> wrote:

> From: Ma Ling <ling.ml@alipay.com>
> 
>   Currently we use O2 as compiler option for better performance,
> although it will enlarge code size, in modern CPUs larger instructon
> and unified cache, sophisticated instruction prefetch weaken instruction
> cache miss, meanwhile flags such as -falign-functions, -falign-jumps,
> -falign-loops, -falign-labels are very helpful to improve CPU front-end
> throughput because CPU fetch instruction by 16 aligned–bytes code block
> per cycle.
> 
>   In order to save power and get higher performance, Sandy Bridge 
> starts to introduce decoded-cache, instructions will be kept in it
> after decode stage. When CPU refetches the instruction, decoded cache could
> provide 32 aligned-bytes instruction block, instead of 16 bytes from I-cache,
> fewer branch miss penalty resulted from shorter pipeline. It requires hot
> code should be put into decoded cache as possible we can. Sandy Bridge,
> Ivy Bridge, and Haswell all implemented this feature, Os-Optimize for size
> should be better than O2 on them.
> 
> Based on above reasons, we compiled linux kernel 3.6.9 with O2 and Os
> respectively. The results show Os improve performance netperf 4.8%,
> 2.7% for volano as below
> 
> O2 + netperf
> Performance counter stats for 'netperf' (3 runs):
> 
>        5416.157986 task-clock                #    0.541 CPUs utilized            ( +-  0.19% )
>            348,249 context-switches          #    0.064 M/sec                    ( +-  0.17% )
>                  0 CPU-migrations            #    0.000 M/sec                    ( +-  0.00% )
>                353 page-faults               #    0.000 M/sec                    ( +-  0.16% )
>     13,166,254,384 cycles                    #    2.431 GHz                      ( +-  0.18% )
>      8,827,499,807 stalled-cycles-frontend   #   67.05% frontend cycles idle     ( +-  0.29% )
>      5,951,234,060 stalled-cycles-backend    #   45.20% backend  cycles idle     ( +-  0.44% )
>      8,122,481,914 instructions              #    0.62  insns per cycle
>                                              #    1.09  stalled cycles per insn  ( +-  0.17% )
>      1,415,864,138 branches                  #  261.415 M/sec                    ( +-  0.17% )
>         16,975,308 branch-misses             #    1.20% of all branches          ( +-  0.61% )
> 
>       10.007215371 seconds time elapsed                                          ( +-  0.03% )
> 
> Os + netperf
> 
> Performance counter stats for 'netperf' (3 runs):
> 
>        5395.386704 task-clock                #    0.539 CPUs utilized            ( +-  0.14% )
>            345,880 context-switches          #    0.064 M/sec                    ( +-  0.25% )
>                  0 CPU-migrations            #    0.000 M/sec                    ( +-  0.00% )
>                354 page-faults               #    0.000 M/sec                    ( +-  0.00% )
>     13,142,706,297 cycles                    #    2.436 GHz                      ( +-  0.23% )
>      8,379,382,641 stalled-cycles-frontend   #   63.76% frontend cycles idle     ( +-  0.50% )
>      5,513,722,219 stalled-cycles-backend    #   41.95% backend  cycles idle     ( +-  0.71% )
>      8,554,202,795 instructions              #    0.65  insns per cycle
>                                              #    0.98  stalled cycles per insn  ( +-  0.25% )
>      1,530,020,505 branches                  #  283.579 M/sec                    ( +-  0.25% )
>         17,710,406 branch-misses             #    1.16% of all branches          ( +-  1.00% )
> 
>       10.004859867 seconds time elapsed               
> 
> During the same time (10.004859867 seconds) IPC from Os is 0.65, O2 is 0.62, Os improved performance 4.8%
> 
> O2 + volano
> Performance counter stats for './loopclient.sh openjdk' (3 runs):
> 
>      210627.115313 task-clock                #    0.781 CPUs utilized            ( +-  0.92% )
>         13,812,610 context-switches          #    0.066 M/sec                    ( +-  0.17% )
>          2,352,755 CPU-migrations            #    0.011 M/sec                    ( +-  0.84% )
>            208,333 page-faults               #    0.001 M/sec                    ( +-  1.58% )
>    525,627,073,405 cycles                    #    2.496 GHz                      ( +-  0.96% )
>    428,177,571,365 stalled-cycles-frontend   #   81.46% frontend cycles idle     ( +-  1.09% )
>    370,885,224,739 stalled-cycles-backend    #   70.56% backend  cycles idle     ( +-  1.18% )
>    187,662,577,544 instructions              #    0.36  insns per cycle
>                                              #    2.28  stalled cycles per insn  ( +-  0.31% )
>     35,684,976,425 branches                  #  169.423 M/sec                    ( +-  0.45% )
>      1,062,086,942 branch-misses             #    2.98% of all branches          ( +-  0.08% )
> 
>      269.764578435 seconds time elapsed    
>          
> Os + volano
> Performance counter stats for './loopclient.sh openjdk' (3 runs):
> 
>      209545.786941 task-clock                #    0.778 CPUs utilized            ( +-  0.66% )
>         13,864,142 context-switches          #    0.066 M/sec                    ( +-  0.29% )
>          2,326,826 CPU-migrations            #    0.011 M/sec                    ( +-  0.83% )
>            205,575 page-faults               #    0.001 M/sec                    ( +-  2.63% )
>    523,366,588,452 cycles                    #    2.498 GHz                      ( +-  0.75% )
>    419,200,472,430 stalled-cycles-frontend   #   80.10% frontend cycles idle     ( +-  0.86% )
>    362,044,374,737 stalled-cycles-backend    #   69.18% backend  cycles idle     ( +-  0.96% )
>    193,274,857,837 instructions              #    0.37  insns per cycle
>                                              #    2.17  stalled cycles per insn  ( +-  0.51% )
>     37,657,832,686 branches                  #  179.712 M/sec                    ( +-  0.42% )
>      1,061,005,300 branch-misses             #    2.82% of all branches          ( +-  0.86% )
> 
>      269.410275674 seconds time elapsed                                          ( +-  0.06% )
> 
> During the same  time (269.410275674 seconds) IPC from Os is 0.37, O2 is 0.36, Os improved performance 2.7%
> 
> So our initial conclusion is Os is better than O2 for current 
> & coming x86 CPUs. If I was wrong, please correct me.

Did you patch the kernel, or used CONFIG_CC_OPTIMIZE_FOR_SIZE?

(there was no patch in your mail.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Suggestion] [x86]: Compiler Option Os is better on latest x86
  2013-01-24 14:17 ` Ingo Molnar
@ 2013-01-24 14:46   ` Borislav Petkov
  2013-01-24 14:56     ` H. Peter Anvin
  0 siblings, 1 reply; 7+ messages in thread
From: Borislav Petkov @ 2013-01-24 14:46 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: ling.ma.program, mingo, tglx, hpa, linux-kernel, Ma Ling

On Thu, Jan 24, 2013 at 03:17:33PM +0100, Ingo Molnar wrote:
> > So our initial conclusion is Os is better than O2 for current 
> > & coming x86 CPUs. If I was wrong, please correct me.
> 
> Did you patch the kernel, or used CONFIG_CC_OPTIMIZE_FOR_SIZE?
> 
> (there was no patch in your mail.)

Hmm, I'm not sure about -Os: 3a55fb0d9fe8e2f4594329edd58c5fd6f35a99dd

And 0.01/0.03 IPC improvement doesn't really look too persuasive IMO.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Suggestion] [x86]: Compiler Option Os is better on latest x86
  2013-01-24 14:46   ` Borislav Petkov
@ 2013-01-24 14:56     ` H. Peter Anvin
  2013-01-24 15:25       ` Borislav Petkov
  0 siblings, 1 reply; 7+ messages in thread
From: H. Peter Anvin @ 2013-01-24 14:56 UTC (permalink / raw)
  To: Borislav Petkov, Ingo Molnar, ling.ma.program, mingo, tglx,
	linux-kernel, Ma Ling

On 01/24/2013 08:46 AM, Borislav Petkov wrote:
> 
> Hmm, I'm not sure about -Os: 3a55fb0d9fe8e2f4594329edd58c5fd6f35a99dd
> 
> And 0.01/0.03 IPC improvement doesn't really look too persuasive IMO.
> 

There has been occasional talk about a -Ok(ernel) option to gcc, but
that would require someone to go through gcc and figure out what bits
makes sense and which don't...

	-hpa


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Suggestion] [x86]: Compiler Option Os is better on latest x86
  2013-01-24 14:56     ` H. Peter Anvin
@ 2013-01-24 15:25       ` Borislav Petkov
  2013-01-24 15:35         ` H. Peter Anvin
  0 siblings, 1 reply; 7+ messages in thread
From: Borislav Petkov @ 2013-01-24 15:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, ling.ma.program, mingo, tglx, linux-kernel, Ma Ling

On Thu, Jan 24, 2013 at 08:56:26AM -0600, H. Peter Anvin wrote:
> There has been occasional talk about a -Ok(ernel) option to gcc, but
> that would require someone to go through gcc and figure out what bits
> makes sense and which don't...

Yep, such an option has a great potential for us and, if done right,
would be very cool. It could probably be taught to insert HWEIGHT,
RDRAND, CLFLUSH and whatever other insns we have the inline asm versions
for... and query CPUID before that... uuh, nice stuff.

The alternative would be to fork gcc and call it kcc...

/me runs away pretty quickly without even turning back :-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Suggestion] [x86]: Compiler Option Os is better on latest x86
  2013-01-24 15:25       ` Borislav Petkov
@ 2013-01-24 15:35         ` H. Peter Anvin
  0 siblings, 0 replies; 7+ messages in thread
From: H. Peter Anvin @ 2013-01-24 15:35 UTC (permalink / raw)
  To: Borislav Petkov, Ingo Molnar, ling.ma.program, mingo, tglx,
	linux-kernel, Ma Ling

On 01/24/2013 09:25 AM, Borislav Petkov wrote:
> On Thu, Jan 24, 2013 at 08:56:26AM -0600, H. Peter Anvin wrote:
>> There has been occasional talk about a -Ok(ernel) option to gcc, but
>> that would require someone to go through gcc and figure out what bits
>> makes sense and which don't...
> 
> Yep, such an option has a great potential for us and, if done right,
> would be very cool. It could probably be taught to insert HWEIGHT,
> RDRAND, CLFLUSH and whatever other insns we have the inline asm versions
> for... and query CPUID before that... uuh, nice stuff.
> 
> The alternative would be to fork gcc and call it kcc...
> 
> /me runs away pretty quickly without even turning back :-)
> 

I don't expect we'll be teaching gcc about the alternatives mechanism.
-Ok was about optimization, basically a "sane -Os".

	-hpa


^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <1356939140-4113-1-git-send-email-ling.ma@alipay.com>]

* Re: [Suggestion] [x86]: Compiler Option Os is better on latest x86
       [not found] <1356939140-4113-1-git-send-email-ling.ma@alipay.com>
@ 2012-12-31  7:52 ` Ling Ma
  0 siblings, 0 replies; 7+ messages in thread
From: Ling Ma @ 2012-12-31  7:52 UTC (permalink / raw)
  To: mingo; +Cc: hpa, tglx, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 9041 bytes --]

Hi Ingo,

By netperf we did double check on older Nehalem platform too as below:

O2 NHM
 Performance counter stats for 'netperf' (3 runs):

       3779.262214 task-clock                #    0.378 CPUs utilized
          ( +-  0.37% )
            47,580 context-switches          #    0.013 M/sec
          ( +-  0.59% )
                 0 cpu-migrations            #    0.000 K/sec
               321 page-faults               #    0.085 K/sec
          ( +-  0.18% )
     8,885,976,365 cycles                    #    2.351 GHz
          ( +-  0.37% )
     4,572,094,199 stalled-cycles-frontend   #   51.45% frontend
cycles idle     ( +-  1.27% )
     1,347,935,497 stalled-cycles-backend    #   15.17% backend
cycles idle     ( +-  2.02% )
     6,564,928,770 instructions              #    0.74  insns per cycle
                                             #    0.70  stalled cycles
per insn  ( +-  0.33% )
     1,196,254,990 branches                  #  316.531 M/sec
          ( +-  0.33% )
         6,434,145 branch-misses             #    0.54% of all
branches          ( +-  0.42% )

      10.009993130 seconds time elapsed
          ( +-  0.04% )

 87380  16384  16384    10.00    16727.94


Os NHM

 Performance counter stats for 'netperf' (3 runs):

       3793.965782 task-clock                #    0.379 CPUs utilized
          ( +-  0.24% )
            59,124 context-switches          #    0.016 M/sec
          ( +-  0.02% )
                 0 cpu-migrations            #    0.000 K/sec
               321 page-faults               #    0.085 K/sec
          ( +-  0.21% )
     8,878,307,926 cycles                    #    2.340 GHz
          ( +-  0.25% )
     4,717,512,228 stalled-cycles-frontend   #   53.14% frontend
cycles idle     ( +-  0.56% )
     1,612,028,376 stalled-cycles-backend    #   18.16% backend
cycles idle     ( +-  0.58% )
     6,273,760,790 instructions              #    0.71  insns per cycle
                                             #    0.75  stalled cycles
per insn  ( +-  0.02% )
     1,144,007,254 branches                  #  301.533 M/sec
          ( +-  0.02% )
        11,348,742 branch-misses             #    0.99% of all
branches          ( +-  0.66% )

      10.006341837 seconds time elapsed
          ( +-  0.00% )

 During the same time IPC from O2 is 0.74, 0.71 from Os, so the
performance is improved by 4%

The above result verified our thought, O2 is better than Os on Nehalem
because Nehalem use legacy instruction fetch, and -falign-functions,
-falign-jumps, -falign-loops, -falign-labels are useful to improve
frontend throuput and Os is good for Sandy Bridge resulted from
decoded cache.

Any comments are appreciate.

Thanks & Best Wish for coming year!
Ling

2012/12/31, ling.ma.program@gmail.com <ling.ma.program@gmail.com>:
> From: Ma Ling <ling.ml@alipay.com>
>
>   Currently we use O2 as compiler option for better performance,
> although it will enlarge code size, in modern CPUs larger instructon
> and unified cache, sophisticated instruction prefetch weaken instruction
> cache miss, meanwhile flags such as -falign-functions, -falign-jumps,
> -falign-loops, -falign-labels are very helpful to improve CPU front-end
> throughput because CPU fetch instruction by 16 aligned–bytes code block
> per cycle.
>
>   In order to save power and get higher performance, Sandy Bridge
> starts to introduce decoded-cache, instructions will be kept in it
> after decode stage. When CPU refetches the instruction, decoded cache could
> provide 32 aligned-bytes instruction block, instead of 16 bytes from
> I-cache,
> fewer branch miss penalty resulted from shorter pipeline. It requires hot
> code should be put into decoded cache as possible we can. Sandy Bridge,
> Ivy Bridge, and Haswell all implemented this feature, Os-Optimize for size
> should be better than O2 on them.
>
> Based on above reasons, we compiled linux kernel 3.6.9 with O2 and Os
> respectively. The results show Os improve performance netperf 4.8%,
> 2.7% for volano as below
>
> O2 + netperf
> Performance counter stats for 'netperf' (3 runs):
>
>        5416.157986 task-clock                #    0.541 CPUs utilized
>     ( +-  0.19% )
>            348,249 context-switches          #    0.064 M/sec
>     ( +-  0.17% )
>                  0 CPU-migrations            #    0.000 M/sec
>     ( +-  0.00% )
>                353 page-faults               #    0.000 M/sec
>     ( +-  0.16% )
>     13,166,254,384 cycles                    #    2.431 GHz
>     ( +-  0.18% )
>      8,827,499,807 stalled-cycles-frontend   #   67.05% frontend cycles idle
>     ( +-  0.29% )
>      5,951,234,060 stalled-cycles-backend    #   45.20% backend  cycles idle
>     ( +-  0.44% )
>      8,122,481,914 instructions              #    0.62  insns per cycle
>                                              #    1.09  stalled cycles per
> insn  ( +-  0.17% )
>      1,415,864,138 branches                  #  261.415 M/sec
>     ( +-  0.17% )
>         16,975,308 branch-misses             #    1.20% of all branches
>     ( +-  0.61% )
>
>       10.007215371 seconds time elapsed
>     ( +-  0.03% )
>
> Os + netperf
>
> Performance counter stats for 'netperf' (3 runs):
>
>        5395.386704 task-clock                #    0.539 CPUs utilized
>     ( +-  0.14% )
>            345,880 context-switches          #    0.064 M/sec
>     ( +-  0.25% )
>                  0 CPU-migrations            #    0.000 M/sec
>     ( +-  0.00% )
>                354 page-faults               #    0.000 M/sec
>     ( +-  0.00% )
>     13,142,706,297 cycles                    #    2.436 GHz
>     ( +-  0.23% )
>      8,379,382,641 stalled-cycles-frontend   #   63.76% frontend cycles idle
>     ( +-  0.50% )
>      5,513,722,219 stalled-cycles-backend    #   41.95% backend  cycles idle
>     ( +-  0.71% )
>      8,554,202,795 instructions              #    0.65  insns per cycle
>                                              #    0.98  stalled cycles per
> insn  ( +-  0.25% )
>      1,530,020,505 branches                  #  283.579 M/sec
>     ( +-  0.25% )
>         17,710,406 branch-misses             #    1.16% of all branches
>     ( +-  1.00% )
>
>       10.004859867 seconds time elapsed
>
> During the same time (10.004859867 seconds) IPC from Os is 0.65, O2 is 0.62,
> Os improved performance 4.8%
>
> O2 + volano
> Performance counter stats for './loopclient.sh openjdk' (3 runs):
>
>      210627.115313 task-clock                #    0.781 CPUs utilized
>     ( +-  0.92% )
>         13,812,610 context-switches          #    0.066 M/sec
>     ( +-  0.17% )
>          2,352,755 CPU-migrations            #    0.011 M/sec
>     ( +-  0.84% )
>            208,333 page-faults               #    0.001 M/sec
>     ( +-  1.58% )
>    525,627,073,405 cycles                    #    2.496 GHz
>     ( +-  0.96% )
>    428,177,571,365 stalled-cycles-frontend   #   81.46% frontend cycles idle
>     ( +-  1.09% )
>    370,885,224,739 stalled-cycles-backend    #   70.56% backend  cycles idle
>     ( +-  1.18% )
>    187,662,577,544 instructions              #    0.36  insns per cycle
>                                              #    2.28  stalled cycles per
> insn  ( +-  0.31% )
>     35,684,976,425 branches                  #  169.423 M/sec
>     ( +-  0.45% )
>      1,062,086,942 branch-misses             #    2.98% of all branches
>     ( +-  0.08% )
>
>      269.764578435 seconds time elapsed
>
> Os + volano
> Performance counter stats for './loopclient.sh openjdk' (3 runs):
>
>      209545.786941 task-clock                #    0.778 CPUs utilized
>     ( +-  0.66% )
>         13,864,142 context-switches          #    0.066 M/sec
>     ( +-  0.29% )
>          2,326,826 CPU-migrations            #    0.011 M/sec
>     ( +-  0.83% )
>            205,575 page-faults               #    0.001 M/sec
>     ( +-  2.63% )
>    523,366,588,452 cycles                    #    2.498 GHz
>     ( +-  0.75% )
>    419,200,472,430 stalled-cycles-frontend   #   80.10% frontend cycles idle
>     ( +-  0.86% )
>    362,044,374,737 stalled-cycles-backend    #   69.18% backend  cycles idle
>     ( +-  0.96% )
>    193,274,857,837 instructions              #    0.37  insns per cycle
>                                              #    2.17  stalled cycles per
> insn  ( +-  0.51% )
>     37,657,832,686 branches                  #  179.712 M/sec
>     ( +-  0.42% )
>      1,061,005,300 branch-misses             #    2.82% of all branches
>     ( +-  0.86% )
>
>      269.410275674 seconds time elapsed
>     ( +-  0.06% )
>
> During the same  time (269.410275674 seconds) IPC from Os is 0.37, O2 is
> 0.36, Os improved performance 2.7%
>
> So our initial conclusion is Os is better than O2 for current & coming x86
> CPUs.
> If I was wrong, please correct me.
>
> Thanks
> Ling
>

[-- Attachment #2: nhm-cpu-info --]
[-- Type: text/plain, Size: 7050 bytes --]

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.173
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.34
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.173
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 1
cpu cores	: 4
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.34
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.173
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 9
cpu cores	: 4
apicid		: 18
initial apicid	: 18
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.34
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.173
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 10
cpu cores	: 4
apicid		: 20
initial apicid	: 20
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.34
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 4
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.173
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.34
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 5
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.173
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 1
cpu cores	: 4
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.34
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 6
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.173
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 9
cpu cores	: 4
apicid		: 19
initial apicid	: 19
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.34
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.173
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 10
cpu cores	: 4
apicid		: 21
initial apicid	: 21
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.34
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-01-24 15:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-26  6:32 [Suggestion] [x86]: Compiler Option Os is better on latest x86 ling.ma.program
2013-01-24 14:17 ` Ingo Molnar
2013-01-24 14:46   ` Borislav Petkov
2013-01-24 14:56     ` H. Peter Anvin
2013-01-24 15:25       ` Borislav Petkov
2013-01-24 15:35         ` H. Peter Anvin
     [not found] <1356939140-4113-1-git-send-email-ling.ma@alipay.com>
2012-12-31  7:52 ` Ling Ma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox