ftrace performance impact with different configuration

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* ftrace performance impact with different configuration
@ 2011-12-29  8:38 Lei Wen
  2011-12-29 15:42 ` Rabin Vincent
  0 siblings, 1 reply; 7+ messages in thread
From: Lei Wen @ 2011-12-29  8:38 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

Since we all know tracing is a good toy to play, but enable it in the
defconfig would introduce extra effort for
the product system. Here I test with different configuration, and
maybe someone could base my data to
further decrease system overhead when enabling it as defconfig but not
enabling it. It is most useful for product kernel,
as we could enable it latter on the fly.

I am using iperf test over one ARMv5 board with kernel 3.0. The test
environment setting as below:
For TCP testing
PC side: 	    iperf -s -f m -w 128k
Board side:    iperf -c 192.168.1.100 -t 300 -i 1 -f m -r -w 128k

For UDP testing
PC side:       iperf -s -f m -u -w 128k
Board side:   iperf -c 192.168.1.100 -t 300 -i 1 -f m -r -u -b 500m -l
1470 -w 128k

The data is listed the pattern as: (sender speed/receiver speed), and
the unit is as Mbit/s
Also here ftrace enabled means we only enable the option of function
tracer and dynamic tracer
With no ftrace and no debugfs built in:     tcp: 155/188   udp: 276/197
With no ftrace but with debugfs built in:   tcp: 149/184    udp:270/187
With ftrace built in:                                tcp: 136/164
udp:254/141
With ftrace buit in but with modification:  tcp:  148/167   udp:263/174

Here the modification metioned above is:
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index d530a44..5d51e35 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -105,7 +105,7 @@ void tracepoint_update_probe_range(struct
tracepoint * const *begin,
 #define TP_ARGS(args...)       args
 #define TP_CONDITION(args...)  args

-#ifdef CONFIG_TRACEPOINTS
+#ifndef CONFIG_TRACEPOINTS

By this modification, I suppose the tracepoint would not involve extra
performance penalty.
I do this change is for arm don't support jump label yet, and I think
the best jump label could achieve
also cannot exceed the result by this modification.

So it seems to me that:
1. tracepoint without jump label support could greatly bring down the
system performance even they are only built in but not enabled
2. Seem dynamic ftrace also could involve some penalty for the running
system, although it patching the running kernel with nop stub...

For the second item, is there anyone done some research before that
could zero the cost for the running system when the tracing is not
enabled yet?

Thanks,
Lei

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* ftrace performance impact with different configuration
  2011-12-29  8:38 ftrace performance impact with different configuration Lei Wen
@ 2011-12-29 15:42 ` Rabin Vincent
  2011-12-29 16:21   ` Steven Rostedt
  0 siblings, 1 reply; 7+ messages in thread
From: Rabin Vincent @ 2011-12-29 15:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 29, 2011 at 14:08, Lei Wen <adrian.wenl@gmail.com> wrote:
> 2. Seem dynamic ftrace also could involve some penalty for the running
> system, although it patching the running kernel with nop stub...
>
> For the second item, is there anyone done some research before that
> could zero the cost for the running system when the tracing is not
> enabled yet?

One thing that needs to be fixed (for ARM) is that for the new-style
mcounts, the nop that's currently being done is not really a nop -- it
removes the function call, but there is still an unnecessary push/pop
sequence.  This should be modified to have the push {lr} removed too.
(Two instructions replaced instead of one.)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* ftrace performance impact with different configuration
  2011-12-29 15:42 ` Rabin Vincent
@ 2011-12-29 16:21   ` Steven Rostedt
  2011-12-30 13:07     ` Philippe Rétornaz
  2012-01-04 10:06     ` Lei Wen
  0 siblings, 2 replies; 7+ messages in thread
From: Steven Rostedt @ 2011-12-29 16:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2011-12-29 at 21:12 +0530, Rabin Vincent wrote:
> On Thu, Dec 29, 2011 at 14:08, Lei Wen <adrian.wenl@gmail.com> wrote:
> > 2. Seem dynamic ftrace also could involve some penalty for the running
> > system, although it patching the running kernel with nop stub...
> >
> > For the second item, is there anyone done some research before that
> > could zero the cost for the running system when the tracing is not
> > enabled yet?
> 
> One thing that needs to be fixed (for ARM) is that for the new-style
> mcounts, the nop that's currently being done is not really a nop -- it
> removes the function call, but there is still an unnecessary push/pop
> sequence.  This should be modified to have the push {lr} removed too.
> (Two instructions replaced instead of one.)

Unfortunately you can't do this, at least not when the kernel is
preemptible.

Say we have:

	push lr
	call mcount

then we convert it to:

	nop
	nop

The conversion to nop should not be an issue, and this is what would be
done when the system boots up. But then we enable tracing, some low
priority task could have been preempted after executing the first nop,
and we call stop machine to do the conversions (if no stop machine, then
lets just say a higher prio task is running while we do the
conversions). Then we add both the push lr and call back. But when that
lower priority task gets scheduled in again, it would have looked like
it ran:

	nop
	call mcount

Since the call to mcount requires that the lr was pushed, this process
will crash when the return is done and we never saved the lr.

If you don't like the push. the best thing you can do is convert to:

	jmp 1f
	call mcount
1:

This may not be as cheap as two nops, but it may be better than a push.

-- Steve

^ permalink raw reply	[flat|nested] 7+ messages in thread

* ftrace performance impact with different configuration
  2011-12-29 16:21   ` Steven Rostedt
@ 2011-12-30 13:07     ` Philippe Rétornaz
  2011-12-30 22:25       ` Steven Rostedt
  2012-01-04 10:06     ` Lei Wen
  1 sibling, 1 reply; 7+ messages in thread
From: Philippe Rétornaz @ 2011-12-30 13:07 UTC (permalink / raw)
  To: linux-arm-kernel

Le jeudi 29 d?cembre 2011 11:21:25 Steven Rostedt a ?crit :
> On Thu, 2011-12-29 at 21:12 +0530, Rabin Vincent wrote:
> > On Thu, Dec 29, 2011 at 14:08, Lei Wen <adrian.wenl@gmail.com> wrote:
> > > 2. Seem dynamic ftrace also could involve some penalty for the
> > > running
> > > system, although it patching the running kernel with nop stub...
> > > 
> > > For the second item, is there anyone done some research before that
> > > could zero the cost for the running system when the tracing is not
> > > enabled yet?
> > 
> > One thing that needs to be fixed (for ARM) is that for the new-style
> > mcounts, the nop that's currently being done is not really a nop -- it
> > removes the function call, but there is still an unnecessary push/pop
> > sequence.  This should be modified to have the push {lr} removed too.
> > (Two instructions replaced instead of one.)
> 
> Unfortunately you can't do this, at least not when the kernel is
> preemptible.
> 
> Say we have:
> 
> 	push lr
> 	call mcount
> 
> then we convert it to:
> 
> 	nop
> 	nop
> 
> The conversion to nop should not be an issue, and this is what would be
> done when the system boots up. But then we enable tracing, some low
> priority task could have been preempted after executing the first nop,
> and we call stop machine to do the conversions (if no stop machine, then
> lets just say a higher prio task is running while we do the
> conversions). Then we add both the push lr and call back. But when that
> lower priority task gets scheduled in again, it would have looked like
> it ran:
> 
> 	nop
> 	call mcount
> 
> Since the call to mcount requires that the lr was pushed, this process
> will crash when the return is done and we never saved the lr.
> 
> If you don't like the push. the best thing you can do is convert to:
> 
> 	jmp 1f
> 	call mcount
> 1:
> 
> This may not be as cheap as two nops, but it may be better than a push.

Sorry about being a bit naive, but why it is not possible to do it in two 
steps ?
call stop_machine to put the jmp which skip the call to mcount
Then wait until all tasks hits schedule() (synchronize_sched() ?)
Then modify both instructions to put in place the two nops since we know that 
nobody is calling mcount.

Thanks,

Philippe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* ftrace performance impact with different configuration
  2011-12-30 13:07     ` Philippe Rétornaz
@ 2011-12-30 22:25       ` Steven Rostedt
  2011-12-30 23:31         ` Philippe Rétornaz
  0 siblings, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2011-12-30 22:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2011-12-30 at 14:07 +0100, Philippe R?tornaz wrote:

> Sorry about being a bit naive, but why it is not possible to do it in two 
> steps ?
> call stop_machine to put the jmp which skip the call to mcount
> Then wait until all tasks hits schedule() (synchronize_sched() ?)

Here's the problem. With a preemptible kernel, hitting schedule() does
not mean that you can guarantee that all tasks have not been
interrupted.


	nop (use to be push lr)
	------------> interrupt
			set NEED_RESCHED
			end of interrupt
			preempt_schedule()
				schedule()

	[ another processes is now running ]
	call stop_machine()
	put in push lr
	call stop_machine()
	put in call to mcount

	everything seems to be running fine.

	the low priority task gets scheduled again...


				schedule() finishes
			preempt_schedule() finishes
			back at tail of return_from_intr in entry_64.S
			iret
	call mcount
			pop lr (which was never pushed)
			jmp lr

Now we jump back to some random stuff and the stack is corrupted.

There's no way to safely modify two instructions that depend on each
other in a preemptible kernel, with the exception of waiting for all
CPUs to hit idle (which may never happen on a busy system). And even
that may be racy.

-- Steve

> Then modify both instructions to put in place the two nops since we know that 
> nobody is calling mcount.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* ftrace performance impact with different configuration
  2011-12-30 22:25       ` Steven Rostedt
@ 2011-12-30 23:31         ` Philippe Rétornaz
  0 siblings, 0 replies; 7+ messages in thread
From: Philippe Rétornaz @ 2011-12-30 23:31 UTC (permalink / raw)
  To: linux-arm-kernel

Le vendredi 30 d?cembre 2011 17:25:41 Steven Rostedt a ?crit :
> On Fri, 2011-12-30 at 14:07 +0100, Philippe R?tornaz wrote:
> > Sorry about being a bit naive, but why it is not possible to do it in
> > two
> > steps ?
> > call stop_machine to put the jmp which skip the call to mcount
> > Then wait until all tasks hits schedule() (synchronize_sched() ?)
> (...)
> There's no way to safely modify two instructions that depend on each
> other in a preemptible kernel, with the exception of waiting for all
> CPUs to hit idle (which may never happen on a busy system). And even
> that may be racy.
> 

Thanks for the detailed explanation, I got it now. I guess the only way would 
be to use the freezer api like freeze_processes() is doing. Looks like it does 
not worth the pain.

Thanks ! 

Philippe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* ftrace performance impact with different configuration
  2011-12-29 16:21   ` Steven Rostedt
  2011-12-30 13:07     ` Philippe Rétornaz
@ 2012-01-04 10:06     ` Lei Wen
  1 sibling, 0 replies; 7+ messages in thread
From: Lei Wen @ 2012-01-04 10:06 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Steven,

On Fri, Dec 30, 2011 at 12:21 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 2011-12-29 at 21:12 +0530, Rabin Vincent wrote:
>> On Thu, Dec 29, 2011 at 14:08, Lei Wen <adrian.wenl@gmail.com> wrote:
>> > 2. Seem dynamic ftrace also could involve some penalty for the running
>> > system, although it patching the running kernel with nop stub...
>> >
>> > For the second item, is there anyone done some research before that
>> > could zero the cost for the running system when the tracing is not
>> > enabled yet?
>>
>> One thing that needs to be fixed (for ARM) is that for the new-style
>> mcounts, the nop that's currently being done is not really a nop -- it
>> removes the function call, but there is still an unnecessary push/pop
>> sequence. ?This should be modified to have the push {lr} removed too.
>> (Two instructions replaced instead of one.)
>
>
> Unfortunately you can't do this, at least not when the kernel is
> preemptible.
>
> Say we have:
>
> ? ? ? ?push lr
> ? ? ? ?call mcount
>
> then we convert it to:
>
> ? ? ? ?nop
> ? ? ? ?nop
>
> The conversion to nop should not be an issue, and this is what would be
> done when the system boots up. But then we enable tracing, some low
> priority task could have been preempted after executing the first nop,
> and we call stop machine to do the conversions (if no stop machine, then
> lets just say a higher prio task is running while we do the
> conversions). Then we add both the push lr and call back. But when that
> lower priority task gets scheduled in again, it would have looked like
> it ran:
>
> ? ? ? ?nop
> ? ? ? ?call mcount
>
> Since the call to mcount requires that the lr was pushed, this process
> will crash when the return is done and we never saved the lr.
>
> If you don't like the push. the best thing you can do is convert to:
>
> ? ? ? ?jmp 1f
> ? ? ? ?call mcount
> 1:
>
> This may not be as cheap as two nops, but it may be better than a push.
>
I do this conversion as you suggestion, but seem it still cannot fully
solve the performance
downgrade...

Here is the updated data with arm-eabi-4.4.3 toolchain in ARMv5 platform:

With no ftrace and no debugfs built in:
tcp: 161 /185 udp: 277 /180
With no ftrace but with debugfs built in:
tcp: 154 /185 udp:278 /183
With ftrace built in(no other changes):
tcp: 130 /163 udp:253 /140
With ftrace buit in but with mcount fix:
tcp: 135 /167 udp:258 /150
With ftrace built in but with mcount fix and no tracepoint:     tcp:
148 /170 udp: 267 / 161
With ftrace built in but with no tracepoint
tcp:  140 /165 udp: 267 /157

The mcount fix is referring to patching push {lr} to jmp 1f.
While no tracepoint means NULL the __DECLARE_TRACE, so that tracepoint
itself would not incur penalty, as current we don't have jump label
support yet officially.

It seems from the data the jmp fix would improve around 5~10 Mbit, but
even the most optimize combination, there is still a gap between "With
ftrace built in but with mcount fix and no tracepoint" and "With no
ftrace but with debugfs built in".

Do you have further suggestion on this?

Thanks,
Lei

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-01-04 10:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-29  8:38 ftrace performance impact with different configuration Lei Wen
2011-12-29 15:42 ` Rabin Vincent
2011-12-29 16:21   ` Steven Rostedt
2011-12-30 13:07     ` Philippe Rétornaz
2011-12-30 22:25       ` Steven Rostedt
2011-12-30 23:31         ` Philippe Rétornaz
2012-01-04 10:06     ` Lei Wen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox