perf_event: rdpmc self-monitoring overhead issue

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* perf_event: rdpmc self-monitoring overhead issue
@ 2013-08-30 17:55 Vince Weaver
  2013-08-30 18:05 ` Stephane Eranian
  0 siblings, 1 reply; 9+ messages in thread
From: Vince Weaver @ 2013-08-30 17:55 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, Peter Zijlstra, mingo, acme,
	Stephane Eranian

Hello,

I've finally found time to track down why perf_event/rdpmc self-monitoring 
overhead was so bad.

To summarize, a test which does:

   perf_event_open()
   ioctl(PERF_EVENT_IOC_ENABLE)
   read() /* either via syscall or  the rdpmc code listed in 
             include/uapi/linux/perf_event.h */
   ioctl(PERF_EVENT_IOC_DISABLE)

is done, and the number of cycles for each routine is taken using
rdtsc().

On a Core2 Processor the results look something like this for read:

                              | read time for 1 event
                              | median of 1024 runs
                              |     (cycles)
 -----------------------------|-------------------------
   2.6.32-perfctr (rdpmc)     |       133
   2.6.30-perfmon2            |      1264
   3.10                       |      1482
   3.10 (rdpmc)               |      3062

As you can see, using the userspace-only rdpmc code is twice as slow as 
just using the read() syscall.

I've tracked down the cause of this, and apparently it's due to
the first access to the event's struct perf_event_mmap_page.  If 
outside of the read timing code I do an unrelated read of the mmap() page
to fault it in, then the result is much more believable:

  3.10 (rdpmc)                 |      123

So the question is, why do I have to explicitly in advance fault the
page in?  Is there a way to force this to happen automatically?

The perfctr code as far as I can tell doesn't touch its mmap page in 
advance.
It uses vm_insert_page() to insert the page rather than the
rb tree stuff that perf_event uses.

I know part of this overhead is due to the construction of my benchmark
and in theory would be mitigated if you were doing a large number
of measurements in a program, but at the same time this is also a common
pattern when self-monitoring: putting calipers around one chunk of code
and taking one measurement (often in a timing-critical area where
overhead matters).

Vince

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: perf_event: rdpmc self-monitoring overhead issue
  2013-08-30 17:55 perf_event: rdpmc self-monitoring overhead issue Vince Weaver
@ 2013-08-30 18:05 ` Stephane Eranian
  2013-08-30 18:35   ` Vince Weaver
  2013-09-02  2:50   ` Andi Kleen
  0 siblings, 2 replies; 9+ messages in thread
From: Stephane Eranian @ 2013-08-30 18:05 UTC (permalink / raw)
  To: Vince Weaver
  Cc: LKML, linux-perf-users, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo

On Fri, Aug 30, 2013 at 7:55 PM, Vince Weaver <vincent.weaver@maine.edu> wrote:
> Hello,
>
> I've finally found time to track down why perf_event/rdpmc self-monitoring
> overhead was so bad.
>
> To summarize, a test which does:
>
>    perf_event_open()
>    ioctl(PERF_EVENT_IOC_ENABLE)
>    read() /* either via syscall or  the rdpmc code listed in
>              include/uapi/linux/perf_event.h */
>    ioctl(PERF_EVENT_IOC_DISABLE)
>
> is done, and the number of cycles for each routine is taken using
> rdtsc().
>
> On a Core2 Processor the results look something like this for read:
>
>                               | read time for 1 event
>                               | median of 1024 runs
>                               |     (cycles)
>  -----------------------------|-------------------------
>    2.6.32-perfctr (rdpmc)     |       133
>    2.6.30-perfmon2            |      1264
>    3.10                       |      1482
>    3.10 (rdpmc)               |      3062
>
> As you can see, using the userspace-only rdpmc code is twice as slow as
> just using the read() syscall.
>
>
> I've tracked down the cause of this, and apparently it's due to
> the first access to the event's struct perf_event_mmap_page.  If
> outside of the read timing code I do an unrelated read of the mmap() page
> to fault it in, then the result is much more believable:
>
>   3.10 (rdpmc)                 |      123
>
You mean that the high cost in your first example comes from the fact
that you are averaging over all the iterations and not n-1 (where 1 is
the first). I don't see a flag in mmap() to fault it in immediately. But
why not document, that programs should touch the page once before
starting any timing measurements.

> So the question is, why do I have to explicitly in advance fault the
> page in?  Is there a way to force this to happen automatically?
>
> The perfctr code as far as I can tell doesn't touch its mmap page in
> advance.
> It uses vm_insert_page() to insert the page rather than the
> rb tree stuff that perf_event uses.
>
> I know part of this overhead is due to the construction of my benchmark
> and in theory would be mitigated if you were doing a large number
> of measurements in a program, but at the same time this is also a common
> pattern when self-monitoring: putting calipers around one chunk of code
> and taking one measurement (often in a timing-critical area where
> overhead matters).
>
> Vince

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: perf_event: rdpmc self-monitoring overhead issue
  2013-08-30 18:05 ` Stephane Eranian
@ 2013-08-30 18:35   ` Vince Weaver
  2013-08-31 10:06     ` Mikael Pettersson
  2013-09-02  2:50   ` Andi Kleen
  1 sibling, 1 reply; 9+ messages in thread
From: Vince Weaver @ 2013-08-30 18:35 UTC (permalink / raw)
  To: eranian
  Cc: LKML, linux-perf-users, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo

On Fri, 30 Aug 2013, Stephane Eranian wrote:
> >
> You mean that the high cost in your first example comes from the fact
> that you are averaging over all the iterations and not n-1 (where 1 is
> the first). I don't see a flag in mmap() to fault it in immediately. But
> why not document, that programs should touch the page once before
> starting any timing measurements.

I was just curious why perfctr didn't need to do this, but possibly
I'm just missing the mmap page being touched.  Though the code is pretty
small and I'm not seeing any such access.

With perf_event you have one mmap page per event so touching all the pages
starts racking up some overhead if you have multiple events, but yes 
probably lost in the noise if you do it at open() rather than at read().

It's also hard to have a dummy read of a page w/o the compiler
optimizing it away...

Vince Weaver
vincent.weaver@maine.edu
http://www.eece.maine.edu/~vweaver/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: perf_event: rdpmc self-monitoring overhead issue
  2013-08-30 18:35   ` Vince Weaver
@ 2013-08-31 10:06     ` Mikael Pettersson
  0 siblings, 0 replies; 9+ messages in thread
From: Mikael Pettersson @ 2013-08-31 10:06 UTC (permalink / raw)
  To: Vince Weaver
  Cc: eranian, LKML, linux-perf-users, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo

Vince Weaver writes:
 > On Fri, 30 Aug 2013, Stephane Eranian wrote:
 > > >
 > > You mean that the high cost in your first example comes from the fact
 > > that you are averaging over all the iterations and not n-1 (where 1 is
 > > the first). I don't see a flag in mmap() to fault it in immediately. But
 > > why not document, that programs should touch the page once before
 > > starting any timing measurements.
 > 
 > I was just curious why perfctr didn't need to do this, but possibly
 > I'm just missing the mmap page being touched.  Though the code is pretty
 > small and I'm not seeing any such access.

perfctr's ->mmap() actively inserts the counters page into the calling
process' mm, so it's available immediately.  The other model is to wait
for the first page fault on that page before inserting it; apparently
the other performance counter frameworks do that instead.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: perf_event: rdpmc self-monitoring overhead issue
  2013-08-30 18:05 ` Stephane Eranian
  2013-08-30 18:35   ` Vince Weaver
@ 2013-09-02  2:50   ` Andi Kleen
  2013-09-02  8:24     ` Stephane Eranian
  1 sibling, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2013-09-02  2:50 UTC (permalink / raw)
  To: eranian
  Cc: Vince Weaver, LKML, linux-perf-users, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo

Stephane Eranian <eranian@googlemail.com> writes:

> I don't see a flag in mmap() to fault it in immediately.

MAP_PRESENT

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: perf_event: rdpmc self-monitoring overhead issue
  2013-09-02  2:50   ` Andi Kleen
@ 2013-09-02  8:24     ` Stephane Eranian
  2013-09-02  9:23       ` Andi Kleen
  2013-09-02 13:15       ` Vince Weaver
  0 siblings, 2 replies; 9+ messages in thread
From: Stephane Eranian @ 2013-09-02  8:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Vince Weaver, LKML, linux-perf-users, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo

On Mon, Sep 2, 2013 at 4:50 AM, Andi Kleen <andi@firstfloor.org> wrote:
> Stephane Eranian <eranian@googlemail.com> writes:
>
>> I don't see a flag in mmap() to fault it in immediately.
>
> MAP_PRESENT
>
I could not find this constant defined anywhere in the kernel source tree
nor in /usr/include. Are you sure of the name?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: perf_event: rdpmc self-monitoring overhead issue
  2013-09-02  8:24     ` Stephane Eranian
@ 2013-09-02  9:23       ` Andi Kleen
  2013-09-02 13:15       ` Vince Weaver
  1 sibling, 0 replies; 9+ messages in thread
From: Andi Kleen @ 2013-09-02  9:23 UTC (permalink / raw)
  To: eranian
  Cc: Andi Kleen, Vince Weaver, LKML, linux-perf-users, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, Sep 02, 2013 at 10:24:13AM +0200, Stephane Eranian wrote:
> On Mon, Sep 2, 2013 at 4:50 AM, Andi Kleen <andi@firstfloor.org> wrote:
> > Stephane Eranian <eranian@googlemail.com> writes:
> >
> >> I don't see a flag in mmap() to fault it in immediately.
> >
> > MAP_PRESENT
> >
> I could not find this constant defined anywhere in the kernel source tree
> nor in /usr/include. Are you sure of the name?

MAP_POPULATE sorry

-Andi

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: perf_event: rdpmc self-monitoring overhead issue
  2013-09-02  8:24     ` Stephane Eranian
  2013-09-02  9:23       ` Andi Kleen
@ 2013-09-02 13:15       ` Vince Weaver
  2013-09-02 17:26         ` Andi Kleen
  1 sibling, 1 reply; 9+ messages in thread
From: Vince Weaver @ 2013-09-02 13:15 UTC (permalink / raw)
  To: eranian
  Cc: Andi Kleen, Vince Weaver, LKML, linux-perf-users, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, 2 Sep 2013, Stephane Eranian wrote:

> On Mon, Sep 2, 2013 at 4:50 AM, Andi Kleen <andi@firstfloor.org> wrote:
> > Stephane Eranian <eranian@googlemail.com> writes:
> >
> >> I don't see a flag in mmap() to fault it in immediately.
> >
> > MAP_PRESENT
> >
> I could not find this constant defined anywhere in the kernel source tree
> nor in /usr/include. Are you sure of the name?

I assume he means MAP_POPULATE

which does improve things, from ~3000 cycles to ~219 cycles but that's 
still more overhead than the ~130 or so you get by manually touching the 
page first.

Vince

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: perf_event: rdpmc self-monitoring overhead issue
  2013-09-02 13:15       ` Vince Weaver
@ 2013-09-02 17:26         ` Andi Kleen
  0 siblings, 0 replies; 9+ messages in thread
From: Andi Kleen @ 2013-09-02 17:26 UTC (permalink / raw)
  To: Vince Weaver
  Cc: eranian, Andi Kleen, LKML, linux-perf-users, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo

> I assume he means MAP_POPULATE

Yes.

> 
> which does improve things, from ~3000 cycles to ~219 cycles but that's 
> still more overhead than the ~130 or so you get by manually touching the 
> page first.

That seems odd. It should be the same.

Can you do a trace-cmd function trace and compare the two cases?

trace-cmd record -p function_graph ...
trace-cmd report

(as usual for tracing perf remove the useless -pg removal for perf in
kernel/events/Makefile and arch/x86/kernel/cpu/Makefile first)

-Andi

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-09-02 17:27 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-30 17:55 perf_event: rdpmc self-monitoring overhead issue Vince Weaver
2013-08-30 18:05 ` Stephane Eranian
2013-08-30 18:35   ` Vince Weaver
2013-08-31 10:06     ` Mikael Pettersson
2013-09-02  2:50   ` Andi Kleen
2013-09-02  8:24     ` Stephane Eranian
2013-09-02  9:23       ` Andi Kleen
2013-09-02 13:15       ` Vince Weaver
2013-09-02 17:26         ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).