* ftrace performance impact with different configuration
@ 2011-12-29 8:38 Lei Wen
2011-12-29 15:42 ` Rabin Vincent
0 siblings, 1 reply; 7+ messages in thread
From: Lei Wen @ 2011-12-29 8:38 UTC (permalink / raw)
To: Rabin Vincent, Steven Rostedt, linux-arm-kernel, linux-kernel; +Cc: leiwen
Hi,
Since we all know tracing is a good toy to play, but enable it in the
defconfig would introduce extra effort for
the product system. Here I test with different configuration, and
maybe someone could base my data to
further decrease system overhead when enabling it as defconfig but not
enabling it. It is most useful for product kernel,
as we could enable it latter on the fly.
I am using iperf test over one ARMv5 board with kernel 3.0. The test
environment setting as below:
For TCP testing
PC side: iperf -s -f m -w 128k
Board side: iperf -c 192.168.1.100 -t 300 -i 1 -f m -r -w 128k
For UDP testing
PC side: iperf -s -f m -u -w 128k
Board side: iperf -c 192.168.1.100 -t 300 -i 1 -f m -r -u -b 500m -l
1470 -w 128k
The data is listed the pattern as: (sender speed/receiver speed), and
the unit is as Mbit/s
Also here ftrace enabled means we only enable the option of function
tracer and dynamic tracer
With no ftrace and no debugfs built in: tcp: 155/188 udp: 276/197
With no ftrace but with debugfs built in: tcp: 149/184 udp:270/187
With ftrace built in: tcp: 136/164
udp:254/141
With ftrace buit in but with modification: tcp: 148/167 udp:263/174
Here the modification metioned above is:
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index d530a44..5d51e35 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -105,7 +105,7 @@ void tracepoint_update_probe_range(struct
tracepoint * const *begin,
#define TP_ARGS(args...) args
#define TP_CONDITION(args...) args
-#ifdef CONFIG_TRACEPOINTS
+#ifndef CONFIG_TRACEPOINTS
By this modification, I suppose the tracepoint would not involve extra
performance penalty.
I do this change is for arm don't support jump label yet, and I think
the best jump label could achieve
also cannot exceed the result by this modification.
So it seems to me that:
1. tracepoint without jump label support could greatly bring down the
system performance even they are only built in but not enabled
2. Seem dynamic ftrace also could involve some penalty for the running
system, although it patching the running kernel with nop stub...
For the second item, is there anyone done some research before that
could zero the cost for the running system when the tracing is not
enabled yet?
Thanks,
Lei
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: ftrace performance impact with different configuration
2011-12-29 8:38 ftrace performance impact with different configuration Lei Wen
@ 2011-12-29 15:42 ` Rabin Vincent
2011-12-29 16:21 ` Steven Rostedt
0 siblings, 1 reply; 7+ messages in thread
From: Rabin Vincent @ 2011-12-29 15:42 UTC (permalink / raw)
To: Lei Wen; +Cc: Steven Rostedt, linux-arm-kernel, linux-kernel, leiwen
On Thu, Dec 29, 2011 at 14:08, Lei Wen <adrian.wenl@gmail.com> wrote:
> 2. Seem dynamic ftrace also could involve some penalty for the running
> system, although it patching the running kernel with nop stub...
>
> For the second item, is there anyone done some research before that
> could zero the cost for the running system when the tracing is not
> enabled yet?
One thing that needs to be fixed (for ARM) is that for the new-style
mcounts, the nop that's currently being done is not really a nop -- it
removes the function call, but there is still an unnecessary push/pop
sequence. This should be modified to have the push {lr} removed too.
(Two instructions replaced instead of one.)
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ftrace performance impact with different configuration
2011-12-29 15:42 ` Rabin Vincent
@ 2011-12-29 16:21 ` Steven Rostedt
2011-12-30 13:07 ` Philippe Rétornaz
2012-01-04 10:06 ` Lei Wen
0 siblings, 2 replies; 7+ messages in thread
From: Steven Rostedt @ 2011-12-29 16:21 UTC (permalink / raw)
To: Rabin Vincent; +Cc: Lei Wen, linux-arm-kernel, linux-kernel, leiwen
On Thu, 2011-12-29 at 21:12 +0530, Rabin Vincent wrote:
> On Thu, Dec 29, 2011 at 14:08, Lei Wen <adrian.wenl@gmail.com> wrote:
> > 2. Seem dynamic ftrace also could involve some penalty for the running
> > system, although it patching the running kernel with nop stub...
> >
> > For the second item, is there anyone done some research before that
> > could zero the cost for the running system when the tracing is not
> > enabled yet?
>
> One thing that needs to be fixed (for ARM) is that for the new-style
> mcounts, the nop that's currently being done is not really a nop -- it
> removes the function call, but there is still an unnecessary push/pop
> sequence. This should be modified to have the push {lr} removed too.
> (Two instructions replaced instead of one.)
Unfortunately you can't do this, at least not when the kernel is
preemptible.
Say we have:
push lr
call mcount
then we convert it to:
nop
nop
The conversion to nop should not be an issue, and this is what would be
done when the system boots up. But then we enable tracing, some low
priority task could have been preempted after executing the first nop,
and we call stop machine to do the conversions (if no stop machine, then
lets just say a higher prio task is running while we do the
conversions). Then we add both the push lr and call back. But when that
lower priority task gets scheduled in again, it would have looked like
it ran:
nop
call mcount
Since the call to mcount requires that the lr was pushed, this process
will crash when the return is done and we never saved the lr.
If you don't like the push. the best thing you can do is convert to:
jmp 1f
call mcount
1:
This may not be as cheap as two nops, but it may be better than a push.
-- Steve
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ftrace performance impact with different configuration
2011-12-29 16:21 ` Steven Rostedt
@ 2011-12-30 13:07 ` Philippe Rétornaz
2011-12-30 22:25 ` Steven Rostedt
2012-01-04 10:06 ` Lei Wen
1 sibling, 1 reply; 7+ messages in thread
From: Philippe Rétornaz @ 2011-12-30 13:07 UTC (permalink / raw)
To: linux-arm-kernel
Cc: Steven Rostedt, Rabin Vincent, leiwen, Lei Wen, linux-kernel
Le jeudi 29 décembre 2011 11:21:25 Steven Rostedt a écrit :
> On Thu, 2011-12-29 at 21:12 +0530, Rabin Vincent wrote:
> > On Thu, Dec 29, 2011 at 14:08, Lei Wen <adrian.wenl@gmail.com> wrote:
> > > 2. Seem dynamic ftrace also could involve some penalty for the
> > > running
> > > system, although it patching the running kernel with nop stub...
> > >
> > > For the second item, is there anyone done some research before that
> > > could zero the cost for the running system when the tracing is not
> > > enabled yet?
> >
> > One thing that needs to be fixed (for ARM) is that for the new-style
> > mcounts, the nop that's currently being done is not really a nop -- it
> > removes the function call, but there is still an unnecessary push/pop
> > sequence. This should be modified to have the push {lr} removed too.
> > (Two instructions replaced instead of one.)
>
> Unfortunately you can't do this, at least not when the kernel is
> preemptible.
>
> Say we have:
>
> push lr
> call mcount
>
> then we convert it to:
>
> nop
> nop
>
> The conversion to nop should not be an issue, and this is what would be
> done when the system boots up. But then we enable tracing, some low
> priority task could have been preempted after executing the first nop,
> and we call stop machine to do the conversions (if no stop machine, then
> lets just say a higher prio task is running while we do the
> conversions). Then we add both the push lr and call back. But when that
> lower priority task gets scheduled in again, it would have looked like
> it ran:
>
> nop
> call mcount
>
> Since the call to mcount requires that the lr was pushed, this process
> will crash when the return is done and we never saved the lr.
>
> If you don't like the push. the best thing you can do is convert to:
>
> jmp 1f
> call mcount
> 1:
>
> This may not be as cheap as two nops, but it may be better than a push.
Sorry about being a bit naive, but why it is not possible to do it in two
steps ?
call stop_machine to put the jmp which skip the call to mcount
Then wait until all tasks hits schedule() (synchronize_sched() ?)
Then modify both instructions to put in place the two nops since we know that
nobody is calling mcount.
Thanks,
Philippe
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ftrace performance impact with different configuration
2011-12-30 13:07 ` Philippe Rétornaz
@ 2011-12-30 22:25 ` Steven Rostedt
2011-12-30 23:31 ` Philippe Rétornaz
0 siblings, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2011-12-30 22:25 UTC (permalink / raw)
To: Philippe Rétornaz
Cc: linux-arm-kernel, Rabin Vincent, leiwen, Lei Wen, linux-kernel
On Fri, 2011-12-30 at 14:07 +0100, Philippe Rétornaz wrote:
> Sorry about being a bit naive, but why it is not possible to do it in two
> steps ?
> call stop_machine to put the jmp which skip the call to mcount
> Then wait until all tasks hits schedule() (synchronize_sched() ?)
Here's the problem. With a preemptible kernel, hitting schedule() does
not mean that you can guarantee that all tasks have not been
interrupted.
nop (use to be push lr)
------------> interrupt
set NEED_RESCHED
end of interrupt
preempt_schedule()
schedule()
[ another processes is now running ]
call stop_machine()
put in push lr
call stop_machine()
put in call to mcount
everything seems to be running fine.
the low priority task gets scheduled again...
schedule() finishes
preempt_schedule() finishes
back at tail of return_from_intr in entry_64.S
iret
call mcount
pop lr (which was never pushed)
jmp lr
Now we jump back to some random stuff and the stack is corrupted.
There's no way to safely modify two instructions that depend on each
other in a preemptible kernel, with the exception of waiting for all
CPUs to hit idle (which may never happen on a busy system). And even
that may be racy.
-- Steve
> Then modify both instructions to put in place the two nops since we know that
> nobody is calling mcount.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ftrace performance impact with different configuration
2011-12-30 22:25 ` Steven Rostedt
@ 2011-12-30 23:31 ` Philippe Rétornaz
0 siblings, 0 replies; 7+ messages in thread
From: Philippe Rétornaz @ 2011-12-30 23:31 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-arm-kernel, Rabin Vincent, leiwen, Lei Wen, linux-kernel
Le vendredi 30 décembre 2011 17:25:41 Steven Rostedt a écrit :
> On Fri, 2011-12-30 at 14:07 +0100, Philippe Rétornaz wrote:
> > Sorry about being a bit naive, but why it is not possible to do it in
> > two
> > steps ?
> > call stop_machine to put the jmp which skip the call to mcount
> > Then wait until all tasks hits schedule() (synchronize_sched() ?)
> (...)
> There's no way to safely modify two instructions that depend on each
> other in a preemptible kernel, with the exception of waiting for all
> CPUs to hit idle (which may never happen on a busy system). And even
> that may be racy.
>
Thanks for the detailed explanation, I got it now. I guess the only way would
be to use the freezer api like freeze_processes() is doing. Looks like it does
not worth the pain.
Thanks !
Philippe
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ftrace performance impact with different configuration
2011-12-29 16:21 ` Steven Rostedt
2011-12-30 13:07 ` Philippe Rétornaz
@ 2012-01-04 10:06 ` Lei Wen
1 sibling, 0 replies; 7+ messages in thread
From: Lei Wen @ 2012-01-04 10:06 UTC (permalink / raw)
To: Steven Rostedt; +Cc: Rabin Vincent, linux-arm-kernel, linux-kernel, leiwen
Hi Steven,
On Fri, Dec 30, 2011 at 12:21 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 2011-12-29 at 21:12 +0530, Rabin Vincent wrote:
>> On Thu, Dec 29, 2011 at 14:08, Lei Wen <adrian.wenl@gmail.com> wrote:
>> > 2. Seem dynamic ftrace also could involve some penalty for the running
>> > system, although it patching the running kernel with nop stub...
>> >
>> > For the second item, is there anyone done some research before that
>> > could zero the cost for the running system when the tracing is not
>> > enabled yet?
>>
>> One thing that needs to be fixed (for ARM) is that for the new-style
>> mcounts, the nop that's currently being done is not really a nop -- it
>> removes the function call, but there is still an unnecessary push/pop
>> sequence. This should be modified to have the push {lr} removed too.
>> (Two instructions replaced instead of one.)
>
>
> Unfortunately you can't do this, at least not when the kernel is
> preemptible.
>
> Say we have:
>
> push lr
> call mcount
>
> then we convert it to:
>
> nop
> nop
>
> The conversion to nop should not be an issue, and this is what would be
> done when the system boots up. But then we enable tracing, some low
> priority task could have been preempted after executing the first nop,
> and we call stop machine to do the conversions (if no stop machine, then
> lets just say a higher prio task is running while we do the
> conversions). Then we add both the push lr and call back. But when that
> lower priority task gets scheduled in again, it would have looked like
> it ran:
>
> nop
> call mcount
>
> Since the call to mcount requires that the lr was pushed, this process
> will crash when the return is done and we never saved the lr.
>
> If you don't like the push. the best thing you can do is convert to:
>
> jmp 1f
> call mcount
> 1:
>
> This may not be as cheap as two nops, but it may be better than a push.
>
I do this conversion as you suggestion, but seem it still cannot fully
solve the performance
downgrade...
Here is the updated data with arm-eabi-4.4.3 toolchain in ARMv5 platform:
With no ftrace and no debugfs built in:
tcp: 161 /185 udp: 277 /180
With no ftrace but with debugfs built in:
tcp: 154 /185 udp:278 /183
With ftrace built in(no other changes):
tcp: 130 /163 udp:253 /140
With ftrace buit in but with mcount fix:
tcp: 135 /167 udp:258 /150
With ftrace built in but with mcount fix and no tracepoint: tcp:
148 /170 udp: 267 / 161
With ftrace built in but with no tracepoint
tcp: 140 /165 udp: 267 /157
The mcount fix is referring to patching push {lr} to jmp 1f.
While no tracepoint means NULL the __DECLARE_TRACE, so that tracepoint
itself would not incur penalty, as current we don't have jump label
support yet officially.
It seems from the data the jmp fix would improve around 5~10 Mbit, but
even the most optimize combination, there is still a gap between "With
ftrace built in but with mcount fix and no tracepoint" and "With no
ftrace but with debugfs built in".
Do you have further suggestion on this?
Thanks,
Lei
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-01-04 10:06 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-29 8:38 ftrace performance impact with different configuration Lei Wen
2011-12-29 15:42 ` Rabin Vincent
2011-12-29 16:21 ` Steven Rostedt
2011-12-30 13:07 ` Philippe Rétornaz
2011-12-30 22:25 ` Steven Rostedt
2011-12-30 23:31 ` Philippe Rétornaz
2012-01-04 10:06 ` Lei Wen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox