public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Shift by one instruction in the perf annotate output
@ 2012-01-26 23:01 Lénaïc Huard
  2012-01-27  6:50 ` Mike Galbraith
  2012-01-27  8:15 ` Peter Zijlstra
  0 siblings, 2 replies; 7+ messages in thread
From: Lénaïc Huard @ 2012-01-26 23:01 UTC (permalink / raw)
  To: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: linux-kernel

Hello,

Using perf and zooming up to the annotated assembler, I noticed a “shift by 
one” issue between the assembler instructions and the reported consumed cycles 
in the left margin.
Each time a given instruction is expensive, the high event count appears on 
the line right after.
In order to validate that feeling, I tried to put an expensive instruction 
(division) inside a block of cheaper instructions (no operation). And here is 
the output of perf annotate:

    0,00 :          40041d:       nop
    0,00 :          40041e:       nop
    5,31 :          40041f:       nop                                                                                                                                               
    0,00 :          400420:       divl   (%rsp)
   84,18 :          400423:       nop                                                                                                                                               
    0,00 :          400424:       nop
    0,07 :          400425:       nop

I would have expected the 84,18 count to be in front of the “divl” instruction 
and not in front of the “nop” one.

I am running Linux and perf 3.2 but I remember that previous versions suffered 
from the same issue.

I don’t know if it could be specific to my cpu:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz
stepping        : 6
microcode       : 0xd0
cpu MHz         : 2394.000
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm 
constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor 
ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dts tpr_shadow
bogomips        : 4807.89
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

This issue is not specific to the perf tool itself since I observe exactly the 
same thing with OProfile.
Here is the output of opannotate for the same piece of code:

               :  40041d:       nop
               :  40041e:       nop
   971  0.5636 :  40041f:       nop
     1 5.8e-04 :  400420:       divl   (%rsp)
165045 95.7904 :  400423:       nop
     4  0.0023 :  400424:       nop
    18  0.0104 :  400425:       nop

Once again, the high number of samples should have been collected in “divl” 
and not in “nop”.


Here is the small program I used to observe that phenomenon (x86(-64) 
specific):

#include <stdlib.h>
#include <stdio.h>

void divide( int  dividend,
             int  divisor,
             int *quotient,
             int *remainder )
{
  for( unsigned i = 0 ; i < 1000000000 ; i++ )
  __asm__ __volatile__
    ( "xor %%edx,%%edx;"
      "nop;"
      "nop;"
      "nop;"
      "divl %[divisor];"
      "nop;"
      "nop;"
      "nop;"
      : "=a" (*quotient), "=&d" (*remainder)
      : "a" (dividend), [divisor] "m" (divisor)
      );
}

int main()
{
  int quotient, remainder;
  divide( 1234, 100, &quotient, &remainder );
  printf( "%d = %d x %d + %d\n", 1234, 100, quotient, remainder );
  return EXIT_SUCCESS;
}

$ gcc -pipe -o div div.c -std=c99 -O3
$ perf record -g ./div
$ perf report -g
$ perf annotate main # -O3 made divide() inlined

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Shift by one instruction in the perf annotate output
  2012-01-26 23:01 Shift by one instruction in the perf annotate output Lénaïc Huard
@ 2012-01-27  6:50 ` Mike Galbraith
  2012-01-27  8:15 ` Peter Zijlstra
  1 sibling, 0 replies; 7+ messages in thread
From: Mike Galbraith @ 2012-01-27  6:50 UTC (permalink / raw)
  To: Lénaïc Huard
  Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel

On Fri, 2012-01-27 at 00:01 +0100, Lénaïc Huard wrote:
> Hello,
> 
> Using perf and zooming up to the annotated assembler, I noticed a “shift by 
> one” issue between the assembler instructions and the reported consumed cycles 
> in the left margin.
> Each time a given instruction is expensive, the high event count appears on 
> the line right after.
> In order to validate that feeling, I tried to put an expensive instruction 
> (division) inside a block of cheaper instructions (no operation). And here is 
> the output of perf annotate:
> 
>     0,00 :          40041d:       nop
>     0,00 :          40041e:       nop
>     5,31 :          40041f:       nop                                                                                                                                               
>     0,00 :          400420:       divl   (%rsp)
>    84,18 :          400423:       nop                                                                                                                                               
>     0,00 :          400424:       nop
>     0,07 :          400425:       nop
> 
> I would have expected the 84,18 count to be in front of the “divl” instruction 
> and not in front of the “nop” one.
> 
> I am running Linux and perf 3.2 but I remember that previous versions suffered 
> from the same issue.

Yeah, it's been that way since day 1 here, you have to squint a little.
It's not always off, it gets lock btr right, but lock xadd wrong.  Same
CPU as yours.  I see the same with cache-misses too, but since the right
instruction is right next door, it's always been easy to spot the real
culprit, so I just squint and ignore it, figuring it's a hw detail.

Showing cycles for __schedule
  Events  Pcnt (>=5%)
 Percent |      Source code & Disassembly of vmlinux
------------------------------------------------
         :              next = pick_next_task(rq);
         :              clear_tsk_need_resched(prev);
         :
         :              if (likely(prev != next)) {
    7.68 :      ffffffff8139efca:       mov    -0xf0(%rbp),%rcx
         :
         :      static inline void load_cr3(pgd_t *pgdir)
         :      {
         :              write_cr3(__pa(pgdir));
    6.87 :      ffffffff8139f061:       mov    0x48(%r15),%rdi
         :                              : CONST_MASK_ADDR(nr, addr)
         :                              : "iq" ((u8)~CONST_MASK(nr)));
         :              } else {
         :                      asm volatile(LOCK_PREFIX "btr %1,%0"
   38.27 :      ffffffff8139f06d:       lock btr %r14d,0x300(%r12)
         :                      /*
         :                       * load the LDT, if the LDT is different:
         :                       */
         :                      if (unlikely(prev->context.ldt != next->context.ldt))
   12.53 :      ffffffff8139f077:       mov    0x308(%r15),%r13
    0.40 :      ffffffff8139f0df:       jne    ffffffff813a7c60 <ret_from_fork>
    0.00 :      ffffffff8139f0e5:       mov    %rbp,%rsi
    0.00 :      ffffffff8139f0e8:       pop    %rbp
    0.00 :      ffffffff8139f0e9:       popfq  
    5.12 :      ffffffff8139f0ea:       mov    %rax,-0xe8(%rbp)

Showing cycles for _raw_spin_lock_irqsave
  Events  Pcnt (>=5%)
 Percent |      Source code & Disassembly of vmlinux
------------------------------------------------
         :               * documented behavior of the "pop" instruction.
         :               */
         :              asm volatile("# __raw_save_flags\n\t"
    0.00 :      ffffffff813a0b1f:       pushfq 
   15.38 :      ffffffff813a0b20:       pop    %r12
         :              unsigned long flags;
         :
         :              local_irq_save(flags);
         :              preempt_disable();
   16.24 :      ffffffff813a0b23:       mov    $0x1,%edi
         :              register struct __raw_tickets inc = { .tail = 1 };
         :
         :              inc = xadd(&lock->tickets, inc);
    0.00 :      ffffffff813a0b32:       lock xadd %ax,(%rbx)
   46.15 :      ffffffff813a0b37:       movzbl %ah,%edx
         :              return __raw_spin_lock_irqsave(lock);
         :      }
    0.85 :      ffffffff813a0b49:       mov    %r12,%rax
    0.00 :      ffffffff813a0b4c:       pop    %rbx
   17.09 :      ffffffff813a0b4d:       pop    %r12



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Shift by one instruction in the perf annotate output
  2012-01-26 23:01 Shift by one instruction in the perf annotate output Lénaïc Huard
  2012-01-27  6:50 ` Mike Galbraith
@ 2012-01-27  8:15 ` Peter Zijlstra
  2012-01-27 10:27   ` Ingo Molnar
  1 sibling, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2012-01-27  8:15 UTC (permalink / raw)
  To: Lénaïc Huard
  Cc: Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel

On Fri, 2012-01-27 at 00:01 +0100, Lénaïc Huard wrote:
> Using perf and zooming up to the annotated assembler, I noticed a “shift by 
> one” issue between the assembler instructions and the reported consumed cycles 
> in the left margin.
> Each time a given instruction is expensive, the high event count appears on 
> the line right after.
> In order to validate that feeling, I tried to put an expensive instruction 
> (division) inside a block of cheaper instructions (no operation). And here is 
> the output of perf annotate:
> 
>     0,00 :          40041d:       nop
>     0,00 :          40041e:       nop
>     5,31 :          40041f:       nop                                                                                                                                               
>     0,00 :          400420:       divl   (%rsp)
>    84,18 :          400423:       nop                                                                                                                                               
>     0,00 :          400424:       nop
>     0,07 :          400425:       nop
> 
> I would have expected the 84,18 count to be in front of the “divl” instruction 
> and not in front of the “nop” one.

That's how it works on x86 (and a lot of other architectures), the divl
raises the interrupt line but the CPU can only assert the interrupt once
its completed the instruction its working on (sometimes more than one)
and you get a drift between when the interrupt was raised and when it
gets handled, its called skid and is a well known issue.

> I am running Linux and perf 3.2 but I remember that previous versions suffered 
> from the same issue.
> 
> I don’t know if it could be specific to my cpu:
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 15
> model name      : Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz 

And sadly its the best you'll get on your machine, most Intel chips
after that (including the core2 shrink, but excluding the latest core i7
SNB) can do better using a feature called PEBS.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Shift by one instruction in the perf annotate output
  2012-01-27  8:15 ` Peter Zijlstra
@ 2012-01-27 10:27   ` Ingo Molnar
  2012-01-27 10:40     ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2012-01-27 10:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lénaïc Huard, Paul Mackerras, Arnaldo Carvalho de Melo,
	linux-kernel


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> > I am running Linux and perf 3.2 but I remember that previous 
> > versions suffered from the same issue.
> > 
> > I don’t know if it could be specific to my cpu:
> > processor       : 0
> > vendor_id       : GenuineIntel
> > cpu family      : 6
> > model           : 15
> > model name      : Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz 
> 
> And sadly its the best you'll get on your machine, most Intel 
> chips after that (including the core2 shrink, but excluding 
> the latest core i7 SNB) can do better using a feature called 
> PEBS.

Which can be activated on those CPUs using the '-e cycles:pp' 
option (the first 'p' stands for 'precise', the second 'p' for 
'very precise' ;-).

In that case some rather non-obvious perf magic is activated (we 
use PEBS for precise samples and use the LBR hardware to rewind 
the IP), due to which annotation output looks like this:

         :        ffffffff810a6f51 <do_raw_spin_lock>:                                            ▒
    1.77 :        ffffffff810a6f51:       mov    $0x10000,%eax                                    ▒
   44.95 :        ffffffff810a6f56:       lock xadd %eax,(%rdi)                                   ▒
    1.25 :        ffffffff810a6f5a:       mov    %eax,%edx                                        ▒
    0.29 :        ffffffff810a6f5c:       shr    $0x10,%edx                                       ▒
    1.21 :        ffffffff810a6f5f:       cmp    %dx,%ax                                          ▒
    0.01 :        ffffffff810a6f62:       je     ffffffff810a6f6b <do_raw_spin_lock+0x1a>         ▒
   29.81 :        ffffffff810a6f64:       pause                                                   ▒
   16.45 :        ffffffff810a6f66:       mov    (%rdi),%ax                                       ▒
    4.27 :        ffffffff810a6f69:       jmp    ffffffff810a6f5f <do_raw_spin_lock+0xe>          ▒
    0.00 :        ffffffff810a6f6b:       retq                                                    ▒

the entries are both precise and show up in the right place.

On Core2 CPUs there's PEBS so 'p' will work, but there's no LBR 
so the IP-rewinding does not work.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Shift by one instruction in the perf annotate output
  2012-01-27 10:27   ` Ingo Molnar
@ 2012-01-27 10:40     ` Peter Zijlstra
  2012-01-27 10:44       ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2012-01-27 10:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lénaïc Huard, Paul Mackerras, Arnaldo Carvalho de Melo,
	linux-kernel

On Fri, 2012-01-27 at 11:27 +0100, Ingo Molnar wrote:
> On Core2 CPUs there's PEBS so 'p' will work, but there's no LBR 
> so the IP-rewinding does not work. 

Not on his thing.. PEBS is a trainwreck on model 15.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Shift by one instruction in the perf annotate output
  2012-01-27 10:40     ` Peter Zijlstra
@ 2012-01-27 10:44       ` Ingo Molnar
  2012-01-27 10:49         ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2012-01-27 10:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lénaïc Huard, Paul Mackerras, Arnaldo Carvalho de Melo,
	linux-kernel


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Fri, 2012-01-27 at 11:27 +0100, Ingo Molnar wrote:
>
> > On Core2 CPUs there's PEBS so 'p' will work, but there's no 
> > LBR so the IP-rewinding does not work.
> 
> Not on his thing.. PEBS is a trainwreck on model 15.

ok - i have a model 23, there it works:

[    0.044260] CPU0: Intel(R) Core(TM)2 Duo CPU     T9600  @ 2.80GHz stepping 06
[    0.044996] Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Shift by one instruction in the perf annotate output
  2012-01-27 10:44       ` Ingo Molnar
@ 2012-01-27 10:49         ` Peter Zijlstra
  0 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2012-01-27 10:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lénaïc Huard, Paul Mackerras, Arnaldo Carvalho de Melo,
	linux-kernel

On Fri, 2012-01-27 at 11:44 +0100, Ingo Molnar wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > On Fri, 2012-01-27 at 11:27 +0100, Ingo Molnar wrote:
> >
> > > On Core2 CPUs there's PEBS so 'p' will work, but there's no 
> > > LBR so the IP-rewinding does not work.
> > 
> > Not on his thing.. PEBS is a trainwreck on model 15.
> 
> ok - i have a model 23, there it works:
> 
> [    0.044260] CPU0: Intel(R) Core(TM)2 Duo CPU     T9600  @ 2.80GHz stepping 06
> [    0.044996] Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.

Right, only model 15 the very first core2 chips were badly broken, they
fixed it in the shrink (model 22,23,29).

And as you know, something got screwy in SNB (model 42) so that's got
PEBS fully disabled until further notice as well.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-01-27 10:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-26 23:01 Shift by one instruction in the perf annotate output Lénaïc Huard
2012-01-27  6:50 ` Mike Galbraith
2012-01-27  8:15 ` Peter Zijlstra
2012-01-27 10:27   ` Ingo Molnar
2012-01-27 10:40     ` Peter Zijlstra
2012-01-27 10:44       ` Ingo Molnar
2012-01-27 10:49         ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox