Hi,

Here, I'd like to show you another x86 insn decoder user.
These are the prototype patchset of the kprobes jump optimization
(a.k.a. Djprobe, which I had developed two years ago). Finally,
I rewrote it as the jump optimized probe. These patches are still
under development, it neither support temporary disabling, nor
support debugfs interface. However, its basic functions(register/
unregister/optimizing/safety check) are implemented.

 These patches can be applied on -tip tree + following patches;
  - kprobes patches on -mm tree (I attached on this mail)
 And below patches which I sent last week.
  - x86: instruction decorder API
  - x86: kprobes checks safeness of insertion address.

 So, this is another example of x86 instruction decoder.

(Andrew, I ported some of -mm patches to -tip tree just for
 preventing source code forking. This should be done on -tip,
 because x86-instruction decoder has been discussed on -tip)


Jump Optimized Kprobes
======================
o What is jump optimization?
 Kprobes uses the int3 breakpoint instruction on x86 for instrumenting
probes into running kernel. Jump optimization allows kprobes to replace
breakpoint with a jump instruction for reducing probing overhead drastically.


o Advantage and Disadvantage
 The advantage is process time performance. Usually, a kprobe hit takes
0.5 to 1.0 microseconds to process. On the other hand, a jump optimized
probe hit takes less than 0.1 microseconds (actual number depends on the
processor). Here is a sample overheads.

Intel(R) Xeon(R) CPU E5410  @ 2.33GHz (running in 2GHz)

                     x86-32  x86-64
kprobe:              1.00us  1.05us
kprobe+booster:	     0.45us  0.50us
kprobe+optimized:    0.05us  0.07us

kretprobe :          1.77us  1.45us
kretprobe+booster:   1.30us  0.90us
kretprobe+optimized: 1.02us  0.40us

 However, there is a disadvantage (the law of equivalent exchange :)) too,
which is memory consumption. Jump optimization requires optimized_kprobe
data structure, and additional bigger instruction buffer than kprobe,
which contains exception emulating code (push/pop registers), copied
instructions, and a jump. Those data consumes 145 bytes(x86-32) of
memory per probe.

Briefly speaking, an optimized kprobe 5 times faster and 3 times bigger
than a kprobe.

Anyway, you can choose that you'd like to optimize your kprobes by setting
KPROBE_FLAG_OPTIMIZE to kp->flags field.


o How to use it?
 What you need to optimize your *probe is just adding KPROBE_FLAG_OPTIMIZE
to kp.flags before registering.

E.g.
 (setup handler/addr/symbol...)
 kp->flags |= KPROBE_FLAG_OPTIMIZE;
 (register kp)

 That's all. :-)

 kprobes decodes probed function and checks whether the target instructions
can be optimized(replaced with a jump) safely. If it can't, kprobes clears
KPROBE_FLAG_OPTIMIZE from kp->flags. So, you can check it after registering.


o How it works?
 kprobe jump optimization looks like an aggregated kprobe.

 Before preparing optimization, kprobe inserts original(user-defined)
 kprobe on the specified address. So, even if the kprobe is not
 possible to be optimized, it just fall back to a normal kprobe.

 - Safety check
  First, kprobe decodes whole body of probed function and checks
 whether there is NO indirect jump, and near jump which jumps into the
 region which will be replaced by a jump instruction (except the 1st
 byte of jump), because if some jump instruction jumps into the middle
 of another instruction, which causes unexpectable results.
  Kprobe also measures the length of instructions which will be replaced
 by a jump instruction, because a jump instruction is longer than 1 byte,
 it may replaces multiple instructions, and it checkes whether those
 instructions can be executed out-of-line.

 - Preparing detour code
  Next, kprobe prepares "detour" buffer, which contains exception emulating
 code (push/pop registers, call handler), copied instructions(kprobes copies
 instructions which will be replaced by a jump, to the detour buffer), and
 a jump which jumps back to the original execution path.

 - Pre-optimization
  After preparing detour code, kprobe kicks kprobe-optimizer workqueue to
 optimize kprobe. To wait other optimized_kprobes, kprobe optimizer will
 delay to work.
  When the optimized_kprobe is hit before optimization, its handler
 changes IP(instruction pointer) to detour code and exits. So, the
 instructions which were copied to detour buffer are not executed.

 - Optimization
  Kprobe-optimizer doesn't start instruction-replacing soon, it waits
 synchronize_sched for safety, because some processors are possible to be
 interrpted on the instructions which will be replaced by a jump instruction.
 As you know, synchronize_sched() can ensure that all interruptions which were
 executed when synchronize_sched() was called are done, only if CONFIG_PREEMPT=n.
 So, this version supports only the kernel with CONFIG_PREEMPT=n.(*)
  After that, kprobe-optimizer replaces the 4 bytes right after int3 breakpoint
 with relative-jump destination, and synchronize caches on all processors. Next,
 it replaces int3 with relative-jump opcode, and synchronize caches again.


(*)This optimization-safety checking may be replaced with stop-machine method
 which ksplice is done for supporting CONFIG_PREEMPT=y kernel.


 arch/Kconfig                   |   11 +
 arch/x86/Kconfig               |    1 +
 arch/x86/include/asm/kprobes.h |   25 ++-
 arch/x86/kernel/kprobes.c      |  483 +++++++++++++++++++++++++++++++++-------
 include/linux/kprobes.h        |   25 ++
 kernel/kprobes.c               |  294 ++++++++++++++++++++-----
 6 files changed, 707 insertions(+), 132 deletions(-)

NOTE: As I said, Attached patches are just ported from -mm tree,
      so those are NOT included in above statistics.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com