All of lore.kernel.org
 help / color / mirror / Atom feed
From: masami.hiramatsu.pt@hitachi.com (Masami Hiramatsu)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH v10 2/2] ARM: kprobes: enable OPTPROBES for ARM 32
Date: Fri, 28 Nov 2014 12:12:42 +0900	[thread overview]
Message-ID: <5477E82A.3020208@hitachi.com> (raw)
In-Reply-To: <1417099007.2041.6.camel@linaro.org>

(2014/11/27 23:36), Jon Medhurst (Tixy) wrote:
> On Fri, 2014-11-21 at 14:35 +0800, Wang Nan wrote:
>> This patch introduce kprobeopt for ARM 32.
> 
> If I've understood things correctly, this is a feature which inserts
> probes by using a branch instruction to some trampoline code rather than
> using an undefined instruction as a breakpoint. That way we avoid the
> overhead of processing the exception and it is this performance
> improvement which is the main/only reason for implementing it?
> 
> If so, I though it good to see what kind of improvement we get by
> running the micro benchmarks in the kprobes test code. On an A7/A15
> big.LITTLE vexpress board the approximate figures I get are 0.3us for
> optimised probe, 1us for un-optimised, so a three times performance
> improvement. This is with an empty probe pre-handler and no post
> handler, so with a more realistic usecase, the relative improvement we
> get from optimisation would be less.

Indeed, I think we'd better use ftrace to measure performance, since
it is the most realistic usecase. On x86, we have similar number,
and ftrace itself has 0.3-0.4us to record an event. So I guess
it can get 2 times faster. (Of course it depends on the SoC because
memory bandwidth is the key for performance of event recording)


> I thought it good to see what sort of benefits this code achieves,
> especially as it could grow quite complex over time, and the cost of
> that versus the benefit should be considered.

I don't think it's so complex. It's actually cleanly separated.
However, ARM tree should have arch/arm/kernel/kprobe/ dir,
since there are too many kprobe related files under arch/arm/kernel/ ...


>>
>> Limitations:
>>  - Currently only kernel compiled with ARM ISA is supported.
> 
> Supporting Thumb will be very difficult because I don't believe that
> putting a branch into an IT block could be made to work, and you can't
> feasibly know if an instruction is in an IT block other than by first
> using something like the breakpoint probe method and then when that is
> hit examine the IT flags to see if they're set. If they aren't you could
> then change the probe to an optimised probe. Is transforming the probe
> type like that currently supported by the generic kprobes code?

Optprobe framework optimizes probes transparently. If it can not be
optimized, it just do nothing on it.


> Also, the Thumb branch instruction can only jump half as far as the ARM
> mode one. And being 32-bits when a lot of instructions people will want
> to probe are 16-bits will be an additional problem, similar as
> identified below for ARM instructions...
> 
> 
>>
>>  - Offset between probe point and optinsn slot must not larger than
>>    32MiB.
> 
> 
> I see that elsewhere [1] people are working on supporting loading kernel
> modules at locations that are out of the range of a branch instruction,
> I guess because with multi-platform kernels and general code bloat
> kernels are getting too big. The same reasons would impact the usability
> of optimized kprobes as well if they're restricted to the range of a
> single branch instruction.
> 
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-November/305539.html
> 
> 
>>  Masami Hiramatsu suggests replacing 2 words, it will make
>>    things complex. Futher patch can make such optimization.
> 
> I'm wondering how can we replace 2 words if we can't determine if the
> second word is the target of a branch instruction?

on X86, we already have an instruction decoder for finding the
branch target :). But yes, it can be impossible in other arch if
it intensively uses indirect branch.

> E.g. if we had
> 
> 		b	after_probe
> 		...
> probe_me:	mov	r2, #0
> after_probe:	ldr	r0, [r1]
> 
> and we inserted a two word probe at probe_me, then the branch to
> after_probe would be to the second half of that 2 word probe. Guess that
> could be worked around by ensuring the 2nd word is an invalid
> instruction and trapping that case then emulating after_probe like we do
> unoptimised probes. This assumes that we can come up with an
> encoding for a 2 word 'long branch' that was suitable. (For Thumb, I
> suspect that we would need at least 3 16-bit instructions to achieve
> that).
> 
> As the commit message says "will make things complex" and I begin to
> wonder if the extra complexity would be worth the benefits. (Considering
> that the resulting optimised probe would only be around twice as fast.)
> 
> 
>>
>> Kprobe opt on ARM is relatively simpler than kprobe opt on x86 because
>> ARM instruction is always 4 bytes aligned and 4 bytes long. This patch
>> replace probed instruction by a 'b', branch to trampoline code and then
>> calls optimized_callback(). optimized_callback() calls opt_pre_handler()
>> to execute kprobe handler. It also emulate/simulate replaced instruction.
>>
>> When unregistering kprobe, the deferred manner of unoptimizer may leave
>> branch instruction before optimizer is called. Different from x86_64,
>> which only copy the probed insn after optprobe_template_end and
>> reexecute them, this patch call singlestep to emulate/simulate the insn
>> directly. Futher patch can optimize this behavior.
>>
>> Signed-off-by: Wang Nan <wangnan0@huawei.com>
>> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
>> Cc: Jon Medhurst (Tixy) <tixy@linaro.org>
>> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
>> Cc: Will Deacon <will.deacon@arm.com>
>>
>> ---
> 
> I initially had some trouble testing this. I tried running the kprobes
> test code with some printf's added to the code and it seems that only
> very rarely are optimised probes actually executed. This turned out to
> be due to the optimization being run as a background task after a delay.
> So I ended up hacking kernel/kprobes.c to force some calls to
> wait_for_kprobe_optimizer(). It would be nice to have the test code to
> robustly cover both optimised and unoptimised cases but that would need
> some new exported functions from the generic kprobes code, not sure what
> people think of that idea?

Hm, did you use ftrace's kprobe events?
You can actually add kprobes via /sys/kernel/debug/tracing/kprobe_events and
see what kprobes are optimized via /sys/kernel/debug/kprobes/list.

For more information, please refer
 Documentation/trace/kprobetrace.txt
 Documentation/kprobes.txt

Thank you,



-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt at hitachi.com

WARNING: multiple messages have this Message-ID (diff)
From: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
To: "Jon Medhurst (Tixy)" <tixy@linaro.org>
Cc: Wang Nan <wangnan0@huawei.com>,
	linux@arm.linux.org.uk, will.deacon@arm.com,
	taras.kondratiuk@linaro.org, ben.dooks@codethink.co.uk,
	cl@linux.com, rabin@rab.in, davem@davemloft.net,
	lizefan@huawei.com, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org
Subject: Re: Re: [PATCH v10 2/2] ARM: kprobes: enable OPTPROBES for ARM 32
Date: Fri, 28 Nov 2014 12:12:42 +0900	[thread overview]
Message-ID: <5477E82A.3020208@hitachi.com> (raw)
In-Reply-To: <1417099007.2041.6.camel@linaro.org>

(2014/11/27 23:36), Jon Medhurst (Tixy) wrote:
> On Fri, 2014-11-21 at 14:35 +0800, Wang Nan wrote:
>> This patch introduce kprobeopt for ARM 32.
> 
> If I've understood things correctly, this is a feature which inserts
> probes by using a branch instruction to some trampoline code rather than
> using an undefined instruction as a breakpoint. That way we avoid the
> overhead of processing the exception and it is this performance
> improvement which is the main/only reason for implementing it?
> 
> If so, I though it good to see what kind of improvement we get by
> running the micro benchmarks in the kprobes test code. On an A7/A15
> big.LITTLE vexpress board the approximate figures I get are 0.3us for
> optimised probe, 1us for un-optimised, so a three times performance
> improvement. This is with an empty probe pre-handler and no post
> handler, so with a more realistic usecase, the relative improvement we
> get from optimisation would be less.

Indeed, I think we'd better use ftrace to measure performance, since
it is the most realistic usecase. On x86, we have similar number,
and ftrace itself has 0.3-0.4us to record an event. So I guess
it can get 2 times faster. (Of course it depends on the SoC because
memory bandwidth is the key for performance of event recording)


> I thought it good to see what sort of benefits this code achieves,
> especially as it could grow quite complex over time, and the cost of
> that versus the benefit should be considered.

I don't think it's so complex. It's actually cleanly separated.
However, ARM tree should have arch/arm/kernel/kprobe/ dir,
since there are too many kprobe related files under arch/arm/kernel/ ...


>>
>> Limitations:
>>  - Currently only kernel compiled with ARM ISA is supported.
> 
> Supporting Thumb will be very difficult because I don't believe that
> putting a branch into an IT block could be made to work, and you can't
> feasibly know if an instruction is in an IT block other than by first
> using something like the breakpoint probe method and then when that is
> hit examine the IT flags to see if they're set. If they aren't you could
> then change the probe to an optimised probe. Is transforming the probe
> type like that currently supported by the generic kprobes code?

Optprobe framework optimizes probes transparently. If it can not be
optimized, it just do nothing on it.


> Also, the Thumb branch instruction can only jump half as far as the ARM
> mode one. And being 32-bits when a lot of instructions people will want
> to probe are 16-bits will be an additional problem, similar as
> identified below for ARM instructions...
> 
> 
>>
>>  - Offset between probe point and optinsn slot must not larger than
>>    32MiB.
> 
> 
> I see that elsewhere [1] people are working on supporting loading kernel
> modules at locations that are out of the range of a branch instruction,
> I guess because with multi-platform kernels and general code bloat
> kernels are getting too big. The same reasons would impact the usability
> of optimized kprobes as well if they're restricted to the range of a
> single branch instruction.
> 
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-November/305539.html
> 
> 
>>  Masami Hiramatsu suggests replacing 2 words, it will make
>>    things complex. Futher patch can make such optimization.
> 
> I'm wondering how can we replace 2 words if we can't determine if the
> second word is the target of a branch instruction?

on X86, we already have an instruction decoder for finding the
branch target :). But yes, it can be impossible in other arch if
it intensively uses indirect branch.

> E.g. if we had
> 
> 		b	after_probe
> 		...
> probe_me:	mov	r2, #0
> after_probe:	ldr	r0, [r1]
> 
> and we inserted a two word probe at probe_me, then the branch to
> after_probe would be to the second half of that 2 word probe. Guess that
> could be worked around by ensuring the 2nd word is an invalid
> instruction and trapping that case then emulating after_probe like we do
> unoptimised probes. This assumes that we can come up with an
> encoding for a 2 word 'long branch' that was suitable. (For Thumb, I
> suspect that we would need at least 3 16-bit instructions to achieve
> that).
> 
> As the commit message says "will make things complex" and I begin to
> wonder if the extra complexity would be worth the benefits. (Considering
> that the resulting optimised probe would only be around twice as fast.)
> 
> 
>>
>> Kprobe opt on ARM is relatively simpler than kprobe opt on x86 because
>> ARM instruction is always 4 bytes aligned and 4 bytes long. This patch
>> replace probed instruction by a 'b', branch to trampoline code and then
>> calls optimized_callback(). optimized_callback() calls opt_pre_handler()
>> to execute kprobe handler. It also emulate/simulate replaced instruction.
>>
>> When unregistering kprobe, the deferred manner of unoptimizer may leave
>> branch instruction before optimizer is called. Different from x86_64,
>> which only copy the probed insn after optprobe_template_end and
>> reexecute them, this patch call singlestep to emulate/simulate the insn
>> directly. Futher patch can optimize this behavior.
>>
>> Signed-off-by: Wang Nan <wangnan0@huawei.com>
>> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
>> Cc: Jon Medhurst (Tixy) <tixy@linaro.org>
>> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
>> Cc: Will Deacon <will.deacon@arm.com>
>>
>> ---
> 
> I initially had some trouble testing this. I tried running the kprobes
> test code with some printf's added to the code and it seems that only
> very rarely are optimised probes actually executed. This turned out to
> be due to the optimization being run as a background task after a delay.
> So I ended up hacking kernel/kprobes.c to force some calls to
> wait_for_kprobe_optimizer(). It would be nice to have the test code to
> robustly cover both optimised and unoptimised cases but that would need
> some new exported functions from the generic kprobes code, not sure what
> people think of that idea?

Hm, did you use ftrace's kprobe events?
You can actually add kprobes via /sys/kernel/debug/tracing/kprobe_events and
see what kprobes are optimized via /sys/kernel/debug/kprobes/list.

For more information, please refer
 Documentation/trace/kprobetrace.txt
 Documentation/kprobes.txt

Thank you,



-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com



  reply	other threads:[~2014-11-28  3:12 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-21  6:35 [PATCH v10 0/2] ARM: kprobes: enable OPTPROBES for ARM32 Wang Nan
2014-11-21  6:35 ` Wang Nan
2014-11-21  6:35 ` [PATCH v10 1/2] kprobes: Pass the original kprobe for preparing optimized kprobe Wang Nan
2014-11-21  6:35   ` Wang Nan
2014-11-21  6:35 ` [PATCH v10 2/2] ARM: kprobes: enable OPTPROBES for ARM 32 Wang Nan
2014-11-21  6:35   ` Wang Nan
2014-11-27 14:36   ` Jon Medhurst (Tixy)
2014-11-27 14:36     ` Jon Medhurst (Tixy)
2014-11-28  3:12     ` Masami Hiramatsu [this message]
2014-11-28  3:12       ` Masami Hiramatsu
2014-11-28 10:08       ` Jon Medhurst (Tixy)
2014-11-28 10:08         ` Jon Medhurst (Tixy)
2014-11-28 10:43         ` Masami Hiramatsu
2014-11-28 10:43           ` Re: " Masami Hiramatsu
2014-11-28 11:13         ` Russell King - ARM Linux
2014-11-28 11:13           ` Russell King - ARM Linux
2014-11-28 11:17           ` Jon Medhurst (Tixy)
2014-11-28 11:17             ` Jon Medhurst (Tixy)
2014-11-29  1:28     ` Wang Nan
2014-11-29  1:28       ` Wang Nan
2014-12-01  1:29     ` Wang Nan
2014-12-01  1:29       ` Wang Nan
2014-12-01  8:59       ` Wang Nan
2014-12-01  8:59         ` Wang Nan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5477E82A.3020208@hitachi.com \
    --to=masami.hiramatsu.pt@hitachi.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.