* ftrace performance impact with different configuration @ 2011-12-29 8:38 Lei Wen 2011-12-29 15:42 ` Rabin Vincent 0 siblings, 1 reply; 7+ messages in thread From: Lei Wen @ 2011-12-29 8:38 UTC (permalink / raw) To: linux-arm-kernel Hi, Since we all know tracing is a good toy to play, but enable it in the defconfig would introduce extra effort for the product system. Here I test with different configuration, and maybe someone could base my data to further decrease system overhead when enabling it as defconfig but not enabling it. It is most useful for product kernel, as we could enable it latter on the fly. I am using iperf test over one ARMv5 board with kernel 3.0. The test environment setting as below: For TCP testing PC side: iperf -s -f m -w 128k Board side: iperf -c 192.168.1.100 -t 300 -i 1 -f m -r -w 128k For UDP testing PC side: iperf -s -f m -u -w 128k Board side: iperf -c 192.168.1.100 -t 300 -i 1 -f m -r -u -b 500m -l 1470 -w 128k The data is listed the pattern as: (sender speed/receiver speed), and the unit is as Mbit/s Also here ftrace enabled means we only enable the option of function tracer and dynamic tracer With no ftrace and no debugfs built in: tcp: 155/188 udp: 276/197 With no ftrace but with debugfs built in: tcp: 149/184 udp:270/187 With ftrace built in: tcp: 136/164 udp:254/141 With ftrace buit in but with modification: tcp: 148/167 udp:263/174 Here the modification metioned above is: diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index d530a44..5d51e35 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -105,7 +105,7 @@ void tracepoint_update_probe_range(struct tracepoint * const *begin, #define TP_ARGS(args...) args #define TP_CONDITION(args...) args -#ifdef CONFIG_TRACEPOINTS +#ifndef CONFIG_TRACEPOINTS By this modification, I suppose the tracepoint would not involve extra performance penalty. I do this change is for arm don't support jump label yet, and I think the best jump label could achieve also cannot exceed the result by this modification. So it seems to me that: 1. tracepoint without jump label support could greatly bring down the system performance even they are only built in but not enabled 2. Seem dynamic ftrace also could involve some penalty for the running system, although it patching the running kernel with nop stub... For the second item, is there anyone done some research before that could zero the cost for the running system when the tracing is not enabled yet? Thanks, Lei ^ permalink raw reply related [flat|nested] 7+ messages in thread
* ftrace performance impact with different configuration 2011-12-29 8:38 ftrace performance impact with different configuration Lei Wen @ 2011-12-29 15:42 ` Rabin Vincent 2011-12-29 16:21 ` Steven Rostedt 0 siblings, 1 reply; 7+ messages in thread From: Rabin Vincent @ 2011-12-29 15:42 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 29, 2011 at 14:08, Lei Wen <adrian.wenl@gmail.com> wrote: > 2. Seem dynamic ftrace also could involve some penalty for the running > system, although it patching the running kernel with nop stub... > > For the second item, is there anyone done some research before that > could zero the cost for the running system when the tracing is not > enabled yet? One thing that needs to be fixed (for ARM) is that for the new-style mcounts, the nop that's currently being done is not really a nop -- it removes the function call, but there is still an unnecessary push/pop sequence. This should be modified to have the push {lr} removed too. (Two instructions replaced instead of one.) ^ permalink raw reply [flat|nested] 7+ messages in thread
* ftrace performance impact with different configuration 2011-12-29 15:42 ` Rabin Vincent @ 2011-12-29 16:21 ` Steven Rostedt 2011-12-30 13:07 ` Philippe Rétornaz 2012-01-04 10:06 ` Lei Wen 0 siblings, 2 replies; 7+ messages in thread From: Steven Rostedt @ 2011-12-29 16:21 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2011-12-29 at 21:12 +0530, Rabin Vincent wrote: > On Thu, Dec 29, 2011 at 14:08, Lei Wen <adrian.wenl@gmail.com> wrote: > > 2. Seem dynamic ftrace also could involve some penalty for the running > > system, although it patching the running kernel with nop stub... > > > > For the second item, is there anyone done some research before that > > could zero the cost for the running system when the tracing is not > > enabled yet? > > One thing that needs to be fixed (for ARM) is that for the new-style > mcounts, the nop that's currently being done is not really a nop -- it > removes the function call, but there is still an unnecessary push/pop > sequence. This should be modified to have the push {lr} removed too. > (Two instructions replaced instead of one.) Unfortunately you can't do this, at least not when the kernel is preemptible. Say we have: push lr call mcount then we convert it to: nop nop The conversion to nop should not be an issue, and this is what would be done when the system boots up. But then we enable tracing, some low priority task could have been preempted after executing the first nop, and we call stop machine to do the conversions (if no stop machine, then lets just say a higher prio task is running while we do the conversions). Then we add both the push lr and call back. But when that lower priority task gets scheduled in again, it would have looked like it ran: nop call mcount Since the call to mcount requires that the lr was pushed, this process will crash when the return is done and we never saved the lr. If you don't like the push. the best thing you can do is convert to: jmp 1f call mcount 1: This may not be as cheap as two nops, but it may be better than a push. -- Steve ^ permalink raw reply [flat|nested] 7+ messages in thread
* ftrace performance impact with different configuration 2011-12-29 16:21 ` Steven Rostedt @ 2011-12-30 13:07 ` Philippe Rétornaz 2011-12-30 22:25 ` Steven Rostedt 2012-01-04 10:06 ` Lei Wen 1 sibling, 1 reply; 7+ messages in thread From: Philippe Rétornaz @ 2011-12-30 13:07 UTC (permalink / raw) To: linux-arm-kernel Le jeudi 29 d?cembre 2011 11:21:25 Steven Rostedt a ?crit : > On Thu, 2011-12-29 at 21:12 +0530, Rabin Vincent wrote: > > On Thu, Dec 29, 2011 at 14:08, Lei Wen <adrian.wenl@gmail.com> wrote: > > > 2. Seem dynamic ftrace also could involve some penalty for the > > > running > > > system, although it patching the running kernel with nop stub... > > > > > > For the second item, is there anyone done some research before that > > > could zero the cost for the running system when the tracing is not > > > enabled yet? > > > > One thing that needs to be fixed (for ARM) is that for the new-style > > mcounts, the nop that's currently being done is not really a nop -- it > > removes the function call, but there is still an unnecessary push/pop > > sequence. This should be modified to have the push {lr} removed too. > > (Two instructions replaced instead of one.) > > Unfortunately you can't do this, at least not when the kernel is > preemptible. > > Say we have: > > push lr > call mcount > > then we convert it to: > > nop > nop > > The conversion to nop should not be an issue, and this is what would be > done when the system boots up. But then we enable tracing, some low > priority task could have been preempted after executing the first nop, > and we call stop machine to do the conversions (if no stop machine, then > lets just say a higher prio task is running while we do the > conversions). Then we add both the push lr and call back. But when that > lower priority task gets scheduled in again, it would have looked like > it ran: > > nop > call mcount > > Since the call to mcount requires that the lr was pushed, this process > will crash when the return is done and we never saved the lr. > > If you don't like the push. the best thing you can do is convert to: > > jmp 1f > call mcount > 1: > > This may not be as cheap as two nops, but it may be better than a push. Sorry about being a bit naive, but why it is not possible to do it in two steps ? call stop_machine to put the jmp which skip the call to mcount Then wait until all tasks hits schedule() (synchronize_sched() ?) Then modify both instructions to put in place the two nops since we know that nobody is calling mcount. Thanks, Philippe ^ permalink raw reply [flat|nested] 7+ messages in thread
* ftrace performance impact with different configuration 2011-12-30 13:07 ` Philippe Rétornaz @ 2011-12-30 22:25 ` Steven Rostedt 2011-12-30 23:31 ` Philippe Rétornaz 0 siblings, 1 reply; 7+ messages in thread From: Steven Rostedt @ 2011-12-30 22:25 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2011-12-30 at 14:07 +0100, Philippe R?tornaz wrote: > Sorry about being a bit naive, but why it is not possible to do it in two > steps ? > call stop_machine to put the jmp which skip the call to mcount > Then wait until all tasks hits schedule() (synchronize_sched() ?) Here's the problem. With a preemptible kernel, hitting schedule() does not mean that you can guarantee that all tasks have not been interrupted. nop (use to be push lr) ------------> interrupt set NEED_RESCHED end of interrupt preempt_schedule() schedule() [ another processes is now running ] call stop_machine() put in push lr call stop_machine() put in call to mcount everything seems to be running fine. the low priority task gets scheduled again... schedule() finishes preempt_schedule() finishes back at tail of return_from_intr in entry_64.S iret call mcount pop lr (which was never pushed) jmp lr Now we jump back to some random stuff and the stack is corrupted. There's no way to safely modify two instructions that depend on each other in a preemptible kernel, with the exception of waiting for all CPUs to hit idle (which may never happen on a busy system). And even that may be racy. -- Steve > Then modify both instructions to put in place the two nops since we know that > nobody is calling mcount. ^ permalink raw reply [flat|nested] 7+ messages in thread
* ftrace performance impact with different configuration 2011-12-30 22:25 ` Steven Rostedt @ 2011-12-30 23:31 ` Philippe Rétornaz 0 siblings, 0 replies; 7+ messages in thread From: Philippe Rétornaz @ 2011-12-30 23:31 UTC (permalink / raw) To: linux-arm-kernel Le vendredi 30 d?cembre 2011 17:25:41 Steven Rostedt a ?crit : > On Fri, 2011-12-30 at 14:07 +0100, Philippe R?tornaz wrote: > > Sorry about being a bit naive, but why it is not possible to do it in > > two > > steps ? > > call stop_machine to put the jmp which skip the call to mcount > > Then wait until all tasks hits schedule() (synchronize_sched() ?) > (...) > There's no way to safely modify two instructions that depend on each > other in a preemptible kernel, with the exception of waiting for all > CPUs to hit idle (which may never happen on a busy system). And even > that may be racy. > Thanks for the detailed explanation, I got it now. I guess the only way would be to use the freezer api like freeze_processes() is doing. Looks like it does not worth the pain. Thanks ! Philippe ^ permalink raw reply [flat|nested] 7+ messages in thread
* ftrace performance impact with different configuration 2011-12-29 16:21 ` Steven Rostedt 2011-12-30 13:07 ` Philippe Rétornaz @ 2012-01-04 10:06 ` Lei Wen 1 sibling, 0 replies; 7+ messages in thread From: Lei Wen @ 2012-01-04 10:06 UTC (permalink / raw) To: linux-arm-kernel Hi Steven, On Fri, Dec 30, 2011 at 12:21 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > On Thu, 2011-12-29 at 21:12 +0530, Rabin Vincent wrote: >> On Thu, Dec 29, 2011 at 14:08, Lei Wen <adrian.wenl@gmail.com> wrote: >> > 2. Seem dynamic ftrace also could involve some penalty for the running >> > system, although it patching the running kernel with nop stub... >> > >> > For the second item, is there anyone done some research before that >> > could zero the cost for the running system when the tracing is not >> > enabled yet? >> >> One thing that needs to be fixed (for ARM) is that for the new-style >> mcounts, the nop that's currently being done is not really a nop -- it >> removes the function call, but there is still an unnecessary push/pop >> sequence. ?This should be modified to have the push {lr} removed too. >> (Two instructions replaced instead of one.) > > > Unfortunately you can't do this, at least not when the kernel is > preemptible. > > Say we have: > > ? ? ? ?push lr > ? ? ? ?call mcount > > then we convert it to: > > ? ? ? ?nop > ? ? ? ?nop > > The conversion to nop should not be an issue, and this is what would be > done when the system boots up. But then we enable tracing, some low > priority task could have been preempted after executing the first nop, > and we call stop machine to do the conversions (if no stop machine, then > lets just say a higher prio task is running while we do the > conversions). Then we add both the push lr and call back. But when that > lower priority task gets scheduled in again, it would have looked like > it ran: > > ? ? ? ?nop > ? ? ? ?call mcount > > Since the call to mcount requires that the lr was pushed, this process > will crash when the return is done and we never saved the lr. > > If you don't like the push. the best thing you can do is convert to: > > ? ? ? ?jmp 1f > ? ? ? ?call mcount > 1: > > This may not be as cheap as two nops, but it may be better than a push. > I do this conversion as you suggestion, but seem it still cannot fully solve the performance downgrade... Here is the updated data with arm-eabi-4.4.3 toolchain in ARMv5 platform: With no ftrace and no debugfs built in: tcp: 161 /185 udp: 277 /180 With no ftrace but with debugfs built in: tcp: 154 /185 udp:278 /183 With ftrace built in(no other changes): tcp: 130 /163 udp:253 /140 With ftrace buit in but with mcount fix: tcp: 135 /167 udp:258 /150 With ftrace built in but with mcount fix and no tracepoint: tcp: 148 /170 udp: 267 / 161 With ftrace built in but with no tracepoint tcp: 140 /165 udp: 267 /157 The mcount fix is referring to patching push {lr} to jmp 1f. While no tracepoint means NULL the __DECLARE_TRACE, so that tracepoint itself would not incur penalty, as current we don't have jump label support yet officially. It seems from the data the jmp fix would improve around 5~10 Mbit, but even the most optimize combination, there is still a gap between "With ftrace built in but with mcount fix and no tracepoint" and "With no ftrace but with debugfs built in". Do you have further suggestion on this? Thanks, Lei ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-01-04 10:06 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-12-29 8:38 ftrace performance impact with different configuration Lei Wen 2011-12-29 15:42 ` Rabin Vincent 2011-12-29 16:21 ` Steven Rostedt 2011-12-30 13:07 ` Philippe Rétornaz 2011-12-30 22:25 ` Steven Rostedt 2011-12-30 23:31 ` Philippe Rétornaz 2012-01-04 10:06 ` Lei Wen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox