Re: [PATCH 1/4] tracing: move __DO_TRACE out of line

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Steven Rostedt <rostedt@goodmis.org>, Ingo Molnar <mingo@elte.hu>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>,
	Christoph Hellwig <hch@lst.de>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 1/4] tracing: move __DO_TRACE out of line
Date: Mon, 20 Apr 2009 17:39:36 -0400	[thread overview]
Message-ID: <20090420213936.GA12986@Krystal> (raw)
In-Reply-To: <49EBB609.9030407@goop.org>

* Jeremy Fitzhardinge (jeremy@goop.org) wrote:
> Mathieu Desnoyers wrote:
>> Here is the conclusions I gather from the following tbench tests on the LTTng
>> tree :
>>
>> - Dormant tracepoints, when sprinkled all over the place, have a very small, but
>>   measurable, footprint on kernel stress-test workloads (3 % for the
>>   whole 2.6.30-rc1 LTTng tree).
>>
>> - "Immediate values" help lessening this impact significantly (3 % -> 2.5 %).
>>
>> - Static jump patching would diminish impact even more, but would require gcc
>>   modifications to be acceptable. I did some prototypes using instruction
>>   pattern matching in the past which was judged too complex.
>>
>> - I strongly recommend adding per-subsystem config-out option for heavy
>>   users like kmemtrace or pvops. Compiling-out kmemtrace instrumentation
>>   brings the performance impact from 2.5 % down to 1.9 % slowdown.
>>
>> - Putting the tracepoint out-of-line is a no-go, as it slows down *both* the
>>   dormant (3 % -> 4.7 %) and the active (+20% to tracer overhead) tracepoints
>>   compared to inline tracepoints.
>>   
>
> That's an interestingly counter-intuitive result.  Do you have any  
> theories how this might happen?  The only mechanism I can think of is  
> that, because the inline code sections are smaller, gcc is less inclined  
> to put the if(unlikely) code out of line, so the amount of hot-patch  
> code is higher.  But still, 1.7% is a massive increase in overhead,  
> especially compared to the relative differences of the other changes.
>

Hrm, there is an approximation I've done in my test code to minimize the
development time, and it might explain it. I have simplistically changed the

static inline
for
static noinline

in DECLARE_TRACE(), and have not modified DEFINE_TRACE. Therefore,
some duplicated instances of the function are defined. We should clearly
re-do those tests with your approach of extern prototype in the
DECLARE_TRACE and add proto and args arguments to DEFINE_TRACE, where
the callback would be declared. I'd be very interested to see the
result. For a limited instrumentation modification, one could
concentrate on kmemtrace instrumentation, given I've shown that cover
enough sites that its performance impact, under tbench, seems to be
consistently perceivable.

However I have very limited time on my hands, and I won't be able to do
the modification required to test this in the LTTng setup applied to all
the instrumentation. I also don't have the hardware and cpu time to
perform the 10 runs of each you are talking about, given that the 3 runs
already monopolized my development machine for way too long.

Mathieu, who really has to focus back on his ph.d. thesis :/

>> Tracepoints all compiled-out :
>>
>> run 1 :                2091.50
>> run 2 (after reboot) : 2089.50 (baseline)
>> run 3 (after reboot) : 2083.61
>>
>> Dormant tracepoints :
>>
>> inline, no immediate value optimization
>>
>> run 1 :                1990.63
>> run 2 (after reboot) : 2025.38 (3 %)
>> run 3 (after reboot) : 2028.81
>>
>> out-of-line, no immediate value optimization
>>
>> run 1 :                1990.66
>> run 2 (after reboot) : 1990.19 (4.7 %)
>> run 3 (after reboot) : 1977.79
>>
>> inline, immediate value optimization
>>
>> run 1 :                2035.99 (2.5 %)
>> run 2 (after reboot) : 2036.11
>> run 3 (after reboot) : 2035.75
>>
>> inline, immediate value optimization, configuring out kmemtrace tracepoints
>>
>> run 1 :                2048.08 (1.9 %)
>> run 2 (after reboot) : 2055.53
>> run 3 (after reboot) : 2046.49
>>   
>
> So what are you doing here?  Are you doing 3 runs, then comparing he  
> median measurement in each case?
>
> The trouble is that your run to run variations are at least as large as  
> the difference you're trying to detect.  For example in run 1 of  
> "inline, no immediate value optimization" you got 1990.6MB/s throughput,  
> and then runs 2 & 3 both went up to ~2025.  Why?  That's a huge jump.
>
> The "out-of-line, no immediate value optimization" runs 1&2 has the same  
> throughput as run 1 of the previous test, 1990MB/s, while run 3 is a bit  
> worse.  OK, so perhaps its slower.  But why are runs 1&2 more or less  
> identical to inline/run1?
>
> What would happen if you happened to do 10 iterations of these tests?   
> There just seems like too much run to run variation to make 3 runs  
> statistically meaningful.
>
> I'm not picking on you personally, because I had exactly the same  
> problems when trying to benchmark the overhead of pvops.  The  
> reboot/rerun variations were at least as large as the effects I'm trying  
> to measure, and I'm just feeling suspicious of all the results.
>
> I think there's something fundimentally off about about this kind of  
> kernel benchmark methodology.  The results are not stable and are not -  
> I think - reliable.  Unfortunately I don't have enough of a background  
> in statistics to really analyze what's going on here, or how we should  
> change the test/measurement methodology to get results that we can  
> really stand by.
>
> I don't even have a good explanation for why there are such large  
> boot-to-boot variations anyway.  The normal explanation is "cache  
> effects", but what is actually changing here?  The kernel image is  
> identical, loaded into the same physical pages each time, and mapped  
> into the same virtual address.  So the I&D caches and tlb should get  
> exactly the same access patterns for the kernel code itself.  The  
> dynamically allocated memory is going to vary, and have different cache  
> interactions, but is that enough to explain these kinds of variations?   
> If so, we're going to need to do a lot more iterations to see any signal  
> from our actual changes over the noise that "cache effects" are throwing  
> our way...
>
>    J

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

next prev parent reply	other threads:[~2009-04-20 21:40 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-17  6:35 [PATCH] tracing WIP patches Jeremy Fitzhardinge
2009-04-17  6:35 ` [PATCH 1/4] tracing: move __DO_TRACE out of line Jeremy Fitzhardinge
2009-04-17 15:46   ` Ingo Molnar
2009-04-17 16:10     ` Mathieu Desnoyers
2009-04-17 16:23       ` Ingo Molnar
2009-04-17 16:47         ` Jeremy Fitzhardinge
2009-04-17 19:31         ` Jeremy Fitzhardinge
2009-04-17 19:46           ` Ingo Molnar
2009-04-17 19:57             ` Steven Rostedt
2009-04-17 19:58             ` Jeremy Fitzhardinge
2009-04-17 20:06               ` Steven Rostedt
2009-04-18  6:53           ` Mathieu Desnoyers
2009-04-18 14:16             ` Steven Rostedt
2009-04-19  3:59               ` Mathieu Desnoyers
2009-04-19 23:38                 ` Jeremy Fitzhardinge
2009-04-20 21:39                   ` Mathieu Desnoyers [this message]
2009-04-19 23:40             ` Jeremy Fitzhardinge
2009-04-20 21:47               ` Mathieu Desnoyers
2009-04-17  6:35 ` [PATCH 2/4] x86/pvops: target CREATE_TRACE_POINTS to particular subsystems Jeremy Fitzhardinge
2009-04-17 15:55   ` Steven Rostedt
2009-04-17 16:14     ` Jeremy Fitzhardinge
2009-04-17 16:32       ` Steven Rostedt
2009-04-17 16:48         ` Jeremy Fitzhardinge
2009-04-17 16:57           ` Steven Rostedt
2009-04-17 17:14             ` Jeremy Fitzhardinge
2009-04-17 17:33               ` Steven Rostedt
2009-04-17 18:11                 ` Jeremy Fitzhardinge
2009-04-17  6:35 ` [PATCH 3/4] tracing: pass proto and args to DEFINE_TRACE Jeremy Fitzhardinge
2009-04-17  6:48   ` Christoph Hellwig
2009-04-17  6:58     ` Jeremy Fitzhardinge
2009-04-17  7:05       ` Christoph Hellwig
2009-04-17 12:53         ` Ingo Molnar
2009-04-17 15:21     ` Mathieu Desnoyers
2009-04-17  6:35 ` [PATCH 4/4] tracing: avoid warnings from zero-arg tracepoints Jeremy Fitzhardinge
2009-04-17 15:53   ` Steven Rostedt
2009-04-17 15:53   ` Ingo Molnar
2009-04-17 16:10   ` [tip:tracing/core] " tip-bot for Jeremy Fitzhardinge

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090420213936.GA12986@Krystal \
    --to=mathieu.desnoyers@polymtl.ca \
    --cc=akpm@linux-foundation.org \
    --cc=hch@lst.de \
    --cc=jeremy.fitzhardinge@citrix.com \
    --cc=jeremy@goop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.