* [PATCH v2 0/8] x86: use new text_poke_bp in ftrace
@ 2013-11-08 9:12 Petr Mladek
2013-11-08 9:12 ` Petr Mladek
` (8 more replies)
0 siblings, 9 replies; 16+ messages in thread
From: Petr Mladek @ 2013-11-08 9:12 UTC (permalink / raw)
To: Steven Rostedt, Frederic Weisbecker, Masami Hiramatsu,
Paul E. McKenney, Jiri Kosina
Cc: linux-kernel, x86, Petr Mladek
The commit fd4363fff3d9 (x86: Introduce int3 (breakpoint)-based instruction
patching) uses the same technique that has been used in ftrace since 08d636b
("ftrace/x86: Have arch x86_64 use breakpoints instead of stop machine")
This patch set merge the two implementations and remove duplicities from
the ftrace side. It adds a generic function to efficiently patch more
instructions at once. It also makes the generic int3-based framework
and the ftrace code a bit faster.
The first patch speed up the existing text_poke_bp.
The next 3 patches improve the generic int3-based framework to be usable
with ftrace. All the changes are based on ideas already used in ftrace.
They are needed to keep the functionality and efficiency.
The 5th patch speedups the original ftrace code but it is useful also
with the generic functions.
The last three patches modifies different parts of the current
x86-specific ftrace implementation and use the generic function there.
Changes in v2:
+ check for number of CPUs instead of enabling IRQs when syncing CPUs;
suggested by Steven Rostedt, Paul E. McKenney, and Masami Hiramatsu
+ return error codes in text_poke_part and text_poke_bp; needed by ftrace
+ reverted changes in text_poke_bp; it patches only single address again
+ introduce text_poke_bp_iter for patching multiple addresses:
+ uses iterator and callbacks instead of copying data
+ checks old code before patching
+ returns error code and more info about error; needed by ftrace
+ provides recovery mechanism in case of errors
+ update ftrace.c to use the new text_poke_bp_iter
+ split notrace __probe_kernel_read into separate patch because it
is useful for original ftrace code as well
+ rebased on current kernel tip and updated performance statistics;
it started to be slower on idle machine after the commit commit
c229828ca6bc62d6c654 (rcu: Throttle rcu_try_advance_all_cbs() execution)
Unfortunately, the size of the code is almost the same. But most of it is
generic and can be reused.
Also I tried to switch between 7 tracers: blk, branch, function_graph, wakeup_rt,
irqsoff, function, and nop. Every tracer has also been enabled and disabled.
With 500 cycles, I got these times before the change:
real 18m2.477s 18m8.654s 18m9.196s
user 0m0.008s 0m0.008s 0m0.012s
sys 0m17.316s 0m17.104s 0m17.300s
and after
real 17m9.753s 17m12.272s 17m11.424s
user 0m0.004s 0m0.004s 0m0.004s
sys 0m18.244s 0m18.252s 0m18.308s
The patches are against kernel/git/tip/tip.git on top of the commit
af7949e870d4632b Merge branch 'tools/kvm perf fixes'
Petr Mladek (8):
x86: speed up int3-based patching using less paranoid write
x86: return error code in text_poke_bp
x86: allow to call text_poke_bp during boot
x86: add generic function to modify more calls using int3 framework
x86: do not trace __probe_kernel_read
x86: modify ftrace function using the new int3-based framework
x86: patch all traced function calls using the int3-based framework
x86: enable/disable ftrace graph call using new int3-based framework
arch/x86/include/asm/alternative.h | 39 ++-
arch/x86/kernel/alternative.c | 321 +++++++++++++++++++--
arch/x86/kernel/ftrace.c | 571 ++++++++-----------------------------
arch/x86/kernel/traps.c | 10 -
include/linux/ftrace.h | 6 -
mm/maccess.c | 2 +-
6 files changed, 462 insertions(+), 487 deletions(-)
--
1.8.4
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 0/8] x86: use new text_poke_bp in ftrace
2013-11-08 9:12 [PATCH v2 0/8] x86: use new text_poke_bp in ftrace Petr Mladek
@ 2013-11-08 9:12 ` Petr Mladek
2013-11-08 9:12 ` [PATCH v2 1/8] x86: speed up int3-based patching using less paranoid write Petr Mladek
` (7 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Petr Mladek @ 2013-11-08 9:12 UTC (permalink / raw)
To: Steven Rostedt, Frederic Weisbecker, Masami Hiramatsu,
Paul E. McKenney, Jiri Kosina
Cc: linux-kernel, x86, Petr Mladek
The commit fd4363fff3d9 (x86: Introduce int3 (breakpoint)-based instruction
patching) uses the same technique that has been used in ftrace since 08d636b
("ftrace/x86: Have arch x86_64 use breakpoints instead of stop machine")
This patch set merge the two implementations and remove duplicities from
the ftrace side. It adds a generic function to efficiently patch more
instructions at once. It also makes the generic int3-based framework
and the ftrace code a bit faster.
The first patch speed up the existing text_poke_bp.
The next 3 patches improve the generic int3-based framework to be usable
with ftrace. All the changes are based on ideas already used in ftrace.
They are needed to keep the functionality and efficiency.
The 5th patch speedups the original ftrace code but it is useful also
with the generic functions.
The last three patches modifies different parts of the current
x86-specific ftrace implementation and use the generic function there.
Changes in v2:
+ check for number of CPUs instead of enabling IRQs when syncing CPUs;
suggested by Steven Rostedt, Paul E. McKenney, and Masami Hiramatsu
+ return error codes in text_poke_part and text_poke_bp; needed by ftrace
+ reverted changes in text_poke_bp; it patches only single address again
+ introduce text_poke_bp_iter for patching multiple addresses:
+ uses iterator and callbacks instead of copying data
+ checks old code before patching
+ returns error code and more info about error; needed by ftrace
+ provides recovery mechanism in case of errors
+ update ftrace.c to use the new text_poke_bp_iter
+ split notrace __probe_kernel_read into separate patch because it
is useful for original ftrace code as well
+ rebased on current kernel tip and updated performance statistics;
it started to be slower on idle machine after the commit commit
c229828ca6bc62d6c654 (rcu: Throttle rcu_try_advance_all_cbs() execution)
I tried to switch between 7 tracers: blk, branch, function_graph, wakeup_rt,
irqsoff, function, and nop. Every tracer has also been enabled and disabled.
With 500 cycles, I got these times before the change:
real 18m2.477s 18m8.654s 18m9.196s
user 0m0.008s 0m0.008s 0m0.012s
sys 0m17.316s 0m17.104s 0m17.300s
and after
real 17m9.753s 17m12.272s 17m11.424s
user 0m0.004s 0m0.004s 0m0.004s
sys 0m18.244s 0m18.252s 0m18.308s
Unfortunately, the size of the code is almost the same. But most of it is
generic and can be reused.
The patches are against kernel/git/tip/tip.git on top of the commit
af7949e870d4632b Merge branch 'tools/kvm perf fixes'
Petr Mladek (8):
x86: speed up int3-based patching using less paranoid write
x86: return error code in text_poke_bp
x86: allow to call text_poke_bp during boot
x86: add generic function to modify more calls using int3 framework
x86: do not trace __probe_kernel_read
x86: modify ftrace function using the new int3-based framework
x86: patch all traced function calls using the int3-based framework
x86: enable/disable ftrace graph call using new int3-based framework
arch/x86/include/asm/alternative.h | 39 ++-
arch/x86/kernel/alternative.c | 321 +++++++++++++++++++--
arch/x86/kernel/ftrace.c | 571 ++++++++-----------------------------
arch/x86/kernel/traps.c | 10 -
include/linux/ftrace.h | 6 -
mm/maccess.c | 2 +-
6 files changed, 462 insertions(+), 487 deletions(-)
--
1.8.4
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 1/8] x86: speed up int3-based patching using less paranoid write
2013-11-08 9:12 [PATCH v2 0/8] x86: use new text_poke_bp in ftrace Petr Mladek
2013-11-08 9:12 ` Petr Mladek
@ 2013-11-08 9:12 ` Petr Mladek
2013-11-08 12:04 ` Masami Hiramatsu
2013-11-08 9:12 ` [PATCH v2 2/8] x86: return error code in text_poke_bp Petr Mladek
` (6 subsequent siblings)
8 siblings, 1 reply; 16+ messages in thread
From: Petr Mladek @ 2013-11-08 9:12 UTC (permalink / raw)
To: Steven Rostedt, Frederic Weisbecker, Masami Hiramatsu,
Paul E. McKenney, Jiri Kosina
Cc: linux-kernel, x86, Petr Mladek
This change is inspired by the int3-based patching code used in
ftrace. See the commit fd4363fff3d9 (x86: Introduce int3
(breakpoint)-based instruction patching).
When trying to use text_poke_bp in ftrace, the result was slower than
the original implementation.
It turned out that one difference was in text_poke vs. ftrace_write.
text_poke did many extra operations to make sure that the change
was atomic.
In fact, we do not need this luxury in text_poke_bp because
the modified code is guarded by the int3 handler. The int3
interrupt itself is added and removed by an atomic operation
because we modify only one byte.
This patch adds text_poke_part which is inspired by ftrace_write.
Then it is used in text_poke_bp instead of the paranoid text_poke.
For example, I tried to switch between 7 tracers: blk, branch,
function_graph, wakeup_rt, irqsoff, function, and nop. Every tracer
has also been enabled and disabled. With 100 cycles, I got these
times with text_poke:
real 25m7.380s 25m13.44s 25m9.072s
user 0m0.004s 0m0.004s 0m0.004s
sys 0m3.480s 0m3.508s 0m3.420s
and with text_poke_part:
real 3m23.335s 3m24.422s 3m26.686s
user 0m0.004s 0m0.004s 0m0.004s
sys 0m3.724s 0m3.772s 0m3.588s
Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
arch/x86/kernel/alternative.c | 49 ++++++++++++++++++++++++++++++++++++++-----
1 file changed, 44 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index df94598..0586dc1 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -8,6 +8,7 @@
#include <linux/kprobes.h>
#include <linux/mm.h>
#include <linux/vmalloc.h>
+#include <linux/uaccess.h>
#include <linux/memory.h>
#include <linux/stop_machine.h>
#include <linux/slab.h>
@@ -586,6 +587,44 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
return addr;
}
+/**
+ * text_poke_part - Update part of the instruction on a live kernel when using
+ * int3 breakpoint
+ * @addr: address to modify
+ * @opcode: source of the copy
+ * @len: length to copy
+ *
+ * If we patch instructions using int3 interrupt, we do not need to be so
+ * careful about an atomic write. Adding and removing the interrupt is atomic
+ * because we modify only one byte. The rest of the instruction could be
+ * modified in several steps because it is guarded by the interrupt handler.
+ * Hence we could use faster code here. See also text_poke_bp below for more
+ * details.
+ *
+ * Note: Must be called under text_mutex.
+ */
+static int text_poke_part(void *addr, const void *opcode, size_t len)
+{
+ /*
+ * On x86_64, kernel text mappings are mapped read-only with
+ * CONFIG_DEBUG_RODATA. So we use the kernel identity mapping instead
+ * of the kernel text mapping to modify the kernel text.
+ *
+ * For 32bit kernels, these mappings are same and we can use
+ * kernel identity mapping to modify code.
+ */
+ if (((unsigned long)addr >= (unsigned long)_text) &&
+ ((unsigned long)addr < (unsigned long)_etext))
+ addr = __va(__pa_symbol(addr));
+
+ if (probe_kernel_write(addr, opcode, len))
+ return -EPERM;
+
+ sync_core();
+
+ return 0;
+}
+
static void do_sync_core(void *info)
{
sync_core();
@@ -648,15 +687,15 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
*/
smp_wmb();
- text_poke(addr, &int3, sizeof(int3));
+ text_poke_part(addr, &int3, sizeof(int3));
on_each_cpu(do_sync_core, NULL, 1);
if (len - sizeof(int3) > 0) {
/* patch all but the first byte */
- text_poke((char *)addr + sizeof(int3),
- (const char *) opcode + sizeof(int3),
- len - sizeof(int3));
+ text_poke_part((char *)addr + sizeof(int3),
+ (const char *) opcode + sizeof(int3),
+ len - sizeof(int3));
/*
* According to Intel, this core syncing is very likely
* not necessary and we'd be safe even without it. But
@@ -666,7 +705,7 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
}
/* patch the first byte */
- text_poke(addr, opcode, sizeof(int3));
+ text_poke_part(addr, opcode, sizeof(int3));
on_each_cpu(do_sync_core, NULL, 1);
--
1.8.4
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 2/8] x86: return error code in text_poke_bp
2013-11-08 9:12 [PATCH v2 0/8] x86: use new text_poke_bp in ftrace Petr Mladek
2013-11-08 9:12 ` Petr Mladek
2013-11-08 9:12 ` [PATCH v2 1/8] x86: speed up int3-based patching using less paranoid write Petr Mladek
@ 2013-11-08 9:12 ` Petr Mladek
2013-11-08 12:36 ` Masami Hiramatsu
2013-11-08 14:53 ` Steven Rostedt
2013-11-08 9:12 ` [PATCH v2 3/8] x86: allow to call text_poke_bp during boot Petr Mladek
` (5 subsequent siblings)
8 siblings, 2 replies; 16+ messages in thread
From: Petr Mladek @ 2013-11-08 9:12 UTC (permalink / raw)
To: Steven Rostedt, Frederic Weisbecker, Masami Hiramatsu,
Paul E. McKenney, Jiri Kosina
Cc: linux-kernel, x86, Petr Mladek
We would like to use text_poke_bp in the dynamic ftrace which want to
know about errors. For example, it informs about them in the ftrace log.
Let's return the error code instead of the address. The address was just copied
from the first parameter, so it was no extra information. The return value
has not been used anywhere yet.
There is a question whether we should recover the original opcode when
the second or third text_poke_part fails in text_poke_bp. Well, the errors
were ignored until now. It did not cause any real life problems. There is
really small chance that the first byte (int3) can be written and the other
parts of the code can not be modified. It is probably not worth the extra
complexity.
Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
arch/x86/include/asm/alternative.h | 3 ++-
arch/x86/kernel/alternative.c | 18 +++++++++++++-----
2 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 0a3f9c9..f2343d8 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -226,6 +226,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
*/
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern int poke_int3_handler(struct pt_regs *regs);
-extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern int text_poke_bp(void *addr, const void *opcode, size_t len,
+ void *handler);
#endif /* _ASM_X86_ALTERNATIVE_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 0586dc1..c459e62 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -673,9 +673,10 @@ int poke_int3_handler(struct pt_regs *regs)
*
* Note: must be called under text_mutex.
*/
-void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
+int text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
{
unsigned char int3 = 0xcc;
+ int ret;
bp_int3_handler = handler;
bp_int3_addr = (u8 *)addr + sizeof(int3);
@@ -687,15 +688,19 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
*/
smp_wmb();
- text_poke_part(addr, &int3, sizeof(int3));
+ ret = text_poke_part(addr, &int3, sizeof(int3));
+ if (unlikely(ret))
+ goto fail;
on_each_cpu(do_sync_core, NULL, 1);
if (len - sizeof(int3) > 0) {
/* patch all but the first byte */
- text_poke_part((char *)addr + sizeof(int3),
+ ret = text_poke_part((char *)addr + sizeof(int3),
(const char *) opcode + sizeof(int3),
len - sizeof(int3));
+ if (unlikely(ret))
+ goto fail;
/*
* According to Intel, this core syncing is very likely
* not necessary and we'd be safe even without it. But
@@ -705,13 +710,16 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
}
/* patch the first byte */
- text_poke_part(addr, opcode, sizeof(int3));
+ ret = text_poke_part(addr, opcode, sizeof(int3));
+ if (unlikely(ret))
+ goto fail;
on_each_cpu(do_sync_core, NULL, 1);
+fail:
bp_patching_in_progress = false;
smp_wmb();
- return addr;
+ return ret;
}
--
1.8.4
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 3/8] x86: allow to call text_poke_bp during boot
2013-11-08 9:12 [PATCH v2 0/8] x86: use new text_poke_bp in ftrace Petr Mladek
` (2 preceding siblings ...)
2013-11-08 9:12 ` [PATCH v2 2/8] x86: return error code in text_poke_bp Petr Mladek
@ 2013-11-08 9:12 ` Petr Mladek
2013-11-08 9:12 ` [PATCH v2 4/8] x86: add generic function to modify more calls using int3 framework Petr Mladek
` (4 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Petr Mladek @ 2013-11-08 9:12 UTC (permalink / raw)
To: Steven Rostedt, Frederic Weisbecker, Masami Hiramatsu,
Paul E. McKenney, Jiri Kosina
Cc: linux-kernel, x86, Petr Mladek
We would like to use text_poke_bp in ftrace. It might be called also during
boot when there is only one CPU and we do not need to sync the others.
The check is must to have because there are also disabled interrupts during
the boot. Then the call would cause a deadlock, see the warning in
"smp_call_function_many", kernel/smp.c:371.
Note that the check might need an update when we add support for zero CPUS
or complex numbers, e.g 5i-3 CPUs.
The change is inspired by the code in arch/x86/kernel/ftrace.c.
Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
arch/x86/kernel/alternative.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index c459e62..c28c108 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -630,6 +630,17 @@ static void do_sync_core(void *info)
sync_core();
}
+static void run_sync(void)
+{
+ /*
+ * We do not need to sync other cores during boot when there is only one
+ * CPU enabled. In fact, we must not because there are also disabled
+ * interrupts. The call would fail because of a potential deadlock.
+ */
+ if (num_online_cpus() != 1)
+ on_each_cpu(do_sync_core, NULL, 1);
+}
+
static bool bp_patching_in_progress;
static void *bp_int3_handler, *bp_int3_addr;
@@ -692,7 +703,7 @@ int text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
if (unlikely(ret))
goto fail;
- on_each_cpu(do_sync_core, NULL, 1);
+ run_sync();
if (len - sizeof(int3) > 0) {
/* patch all but the first byte */
@@ -706,7 +717,7 @@ int text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
* not necessary and we'd be safe even without it. But
* better safe than sorry (plus there's not only Intel).
*/
- on_each_cpu(do_sync_core, NULL, 1);
+ run_sync();
}
/* patch the first byte */
@@ -714,7 +725,7 @@ int text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
if (unlikely(ret))
goto fail;
- on_each_cpu(do_sync_core, NULL, 1);
+ run_sync();
fail:
bp_patching_in_progress = false;
--
1.8.4
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 4/8] x86: add generic function to modify more calls using int3 framework
2013-11-08 9:12 [PATCH v2 0/8] x86: use new text_poke_bp in ftrace Petr Mladek
` (3 preceding siblings ...)
2013-11-08 9:12 ` [PATCH v2 3/8] x86: allow to call text_poke_bp during boot Petr Mladek
@ 2013-11-08 9:12 ` Petr Mladek
2013-11-08 9:12 ` [PATCH v2 5/8] x86: do not trace __probe_kernel_read Petr Mladek
` (3 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Petr Mladek @ 2013-11-08 9:12 UTC (permalink / raw)
To: Steven Rostedt, Frederic Weisbecker, Masami Hiramatsu,
Paul E. McKenney, Jiri Kosina
Cc: linux-kernel, x86, Petr Mladek
The commit fd4363fff3d9 (x86: Introduce int3 (breakpoint)-based instruction
patching) uses the same technique that has been used in ftrace since 08d636b
("ftrace/x86: Have arch x86_64 use breakpoints instead of stop machine")
When trying to use text_poke_bp in dynamic ftrace, the process was significantly
slower than the original code. For example, I tried to switch between 7 tracers:
blk, branch, function_graph, wakeup_rt, irqsoff, function, and nop. Every tracer
has also been enabled and disabled. With 500 cycles, I got these times with the
original code:
real 16m14.390s 16m15.200s 16m19.632s
user 0m0.028s 0m0.024s 0m0.028s
sys 0m23.788s 0m23.812s 0m23.804s
and with text_poke_bp:
real 29m45.785s 29m47.504s 29m44.016
user 0m0.004s 0m0.004s 0m0.004s
sys 0m15.776s 0m16.088s 0m16.192s
It turned out that most of the extra time was spent in the call:
on_each_cpu(do_sync_core, NULL, 1)
It is needed to reset the speculative queue of decoded instructions on each CPU.
It is relatively expensive operation. The original code reduced the usage by
patching all instructions in parallel.
This commits adds text_poke_bp_iter that provides generic interface for patching
more instructions in parallel. It is basically a mix of text_poke_bp and
ftrace_replace_code. The list of addresses and opcodes is passed via
an iterator and callbacks.
We need to iterate several times but we do not need both opcodes and the address
in all stages. They are passed via callback to avoid an unnecessary computation.
The function checks the current code before patching. It might be possible to
leave this check in the ftrace code but such paranoia is useful in general.
It helps to keep the system consistent. And the old opcode is needed anyway.
When something goes wrong, it is used to get back to a valid state. Note
that an error might happen even when adding interrupt to a particular address.
The optional "finish" callback can be used to update state of the patched entry.
For example, it is needed to update the related ftrace record.
Some opcodes might already be in the final state and do not need patching.
It would be possible to detect this by comparing the old and new opcode but
there might be more a effective way. For example, ftrace could tell this by
ftrace_test_record. We pass this information via the get_opcode callbacks.
They return a valid opcode when a change is needed and NULL otherwise. Note
that we should call the "finish" callback even then the code is not really
modified.
Finally, ftrace wants to know more information about a potential failure.
The error code is passed via the return value. The failing address and
the number of the failed record is passed via the iterator structure.
Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
arch/x86/include/asm/alternative.h | 36 ++++++
arch/x86/kernel/alternative.c | 251 +++++++++++++++++++++++++++++++++++--
2 files changed, 275 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index f2343d8..60c0520 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -229,4 +229,40 @@ extern int poke_int3_handler(struct pt_regs *regs);
extern int text_poke_bp(void *addr, const void *opcode, size_t len,
void *handler);
+struct text_poke_bp_iter {
+ /* information used to start iteration from the beginning */
+ void *init;
+ /* length of the patched instruction */
+ size_t len;
+ /* details about failure if any */
+ int fail_count;
+ void *fail_addr;
+ /* iteration over entries */
+ void *(*start)(void *init);
+ void *(*next)(void *prev);
+ /* callback to get patched address */
+ void *(*get_addr)(void *cur);
+ /*
+ * Callbacks to get the patched code. They could return NULL if no
+ * patching is needed; This is useful for example in ftrace.
+ */
+ const void *(*get_opcode)(void *cur);
+ const void *(*get_old_opcode)(void *cur);
+ /*
+ * Optional function that is called when the patching of the given
+ * has finished. It might be NULL if no postprocess is needed.
+ */
+ int (*finish)(void *cur);
+ /*
+ * Helper function for int3 handler. It decides whether the given IP
+ * is being patched or not.
+ *
+ * Try to implement it as fast as possible. It affects performance
+ * of the system when the patching is in progress.
+ */
+ void *(*is_handled)(const unsigned long ip);
+};
+
+extern int text_poke_bp_list(struct text_poke_bp_iter *iter);
+
#endif /* _ASM_X86_ALTERNATIVE_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index c28c108..95e8a7f 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -641,8 +641,11 @@ static void run_sync(void)
on_each_cpu(do_sync_core, NULL, 1);
}
+static char bp_int3;
static bool bp_patching_in_progress;
static void *bp_int3_handler, *bp_int3_addr;
+static size_t bp_int3_len;
+static void *(*bp_int3_is_handled)(const unsigned long ip);
int poke_int3_handler(struct pt_regs *regs)
{
@@ -652,14 +655,23 @@ int poke_int3_handler(struct pt_regs *regs)
if (likely(!bp_patching_in_progress))
return 0;
- if (user_mode_vm(regs) || regs->ip != (unsigned long)bp_int3_addr)
+ if (user_mode_vm(regs))
return 0;
- /* set up the specified breakpoint handler */
- regs->ip = (unsigned long) bp_int3_handler;
+ /* Check if address is handled by text_poke_bp */
+ if (bp_int3_handler && regs->ip == (unsigned long)bp_int3_addr) {
+ regs->ip = (unsigned long)bp_int3_handler;
+ return 1;
+ }
- return 1;
+ /* Check if address is handled by text_poke_bp_list */
+ if (bp_int3_is_handled && bp_int3_is_handled(regs->ip)) {
+ /* just skip the instruction */
+ regs->ip += bp_int3_len - 1;
+ return 1;
+ }
+ return 0;
}
/**
@@ -686,11 +698,13 @@ int poke_int3_handler(struct pt_regs *regs)
*/
int text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
{
- unsigned char int3 = 0xcc;
int ret;
+ bp_int3 = 0xcc;
bp_int3_handler = handler;
- bp_int3_addr = (u8 *)addr + sizeof(int3);
+ bp_int3_addr = (u8 *)addr + sizeof(bp_int3);
+ bp_int3_len = len;
+ bp_int3_is_handled = NULL;
bp_patching_in_progress = true;
/*
* Corresponding read barrier in int3 notifier for
@@ -699,17 +713,17 @@ int text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
*/
smp_wmb();
- ret = text_poke_part(addr, &int3, sizeof(int3));
+ ret = text_poke_part(addr, &bp_int3, sizeof(bp_int3));
if (unlikely(ret))
goto fail;
run_sync();
- if (len - sizeof(int3) > 0) {
+ if (len - sizeof(bp_int3) > 0) {
/* patch all but the first byte */
- ret = text_poke_part((char *)addr + sizeof(int3),
- (const char *) opcode + sizeof(int3),
- len - sizeof(int3));
+ ret = text_poke_part((char *)addr + sizeof(bp_int3),
+ (const char *) opcode + sizeof(bp_int3),
+ len - sizeof(bp_int3));
if (unlikely(ret))
goto fail;
/*
@@ -721,7 +735,7 @@ int text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
}
/* patch the first byte */
- ret = text_poke_part(addr, opcode, sizeof(int3));
+ ret = text_poke_part(addr, opcode, sizeof(bp_int3));
if (unlikely(ret))
goto fail;
@@ -734,3 +748,216 @@ fail:
return ret;
}
+static int notrace text_poke_check(void *addr, const void *opcode, size_t len)
+{
+ unsigned char actual[MAX_PATCH_LEN];
+
+ if (probe_kernel_read(actual, addr, len)) {
+ return -EFAULT;
+ }
+
+ if (memcmp(actual, opcode, len) != 0) {
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int notrace add_iter_breakpoint(struct text_poke_bp_iter *iterator,
+ void *iter)
+{
+ void *addr;
+ const void *old_opcode;
+ int ret = 0;
+
+ /* nope if the code is not defined */
+ old_opcode = iterator->get_old_opcode(iter);
+ if (!old_opcode)
+ return 0;
+
+ addr = iterator->get_addr(iter);
+ ret = text_poke_check(addr, old_opcode, bp_int3_len);
+
+ if (likely(!ret))
+ /* write the breakpoint */
+ ret = text_poke_part(addr, &bp_int3, sizeof(bp_int3));
+
+ return ret;
+}
+
+static int notrace update_iter_code(struct text_poke_bp_iter *iterator,
+ void *iter)
+{
+ void *addr;
+ const void *opcode;
+
+ /* nope if the code is not defined */
+ opcode = iterator->get_opcode(iter);
+ if (!opcode)
+ return 0;
+
+ /* write all but the first byte */
+ addr = iterator->get_addr(iter);
+ return text_poke_part(addr + sizeof(bp_int3),
+ opcode + sizeof(bp_int3),
+ bp_int3_len - sizeof(bp_int3));
+}
+
+static int notrace finish_iter_update(struct text_poke_bp_iter *iterator,
+ void *iter)
+{
+ void *addr;
+ const void *opcode;
+ int ret = 0;
+
+ opcode = iterator->get_opcode(iter);
+ if (opcode) {
+ /* write the first byte if defined */
+ addr = iterator->get_addr(iter);
+ ret = text_poke_part(addr, opcode, sizeof(bp_int3));
+ }
+
+ /* Patching has finished. Let's update the status flag if any. */
+ if (!ret && iterator->finish)
+ ret = iterator->finish(iter);
+
+ return ret;
+}
+
+static int notrace recover_iter(struct text_poke_bp_iter *iterator,
+ void *iter)
+{
+ void *addr;
+ const void *old_opcode;
+ unsigned char actual[MAX_PATCH_LEN];
+ int ret = 0;
+
+ /* If opcode is not defined, we did not change this address at all */
+ old_opcode = iterator->get_old_opcode(iter);
+ if(!old_opcode)
+ return 0;
+
+ addr = iterator->get_addr(iter);
+ if (probe_kernel_read(actual, addr, bp_int3_len))
+ return -EFAULT;
+
+ /* We are done if there is no interrupt */
+ if (actual[0] != bp_int3)
+ return 0;
+
+ /* Put the old code back behind the interrupt if it is not there */
+ if (memcmp(actual + sizeof(bp_int3),
+ old_opcode + sizeof(bp_int3),
+ bp_int3_len - sizeof(bp_int3)) != 0) {
+ ret = text_poke_part(addr + sizeof(bp_int3),
+ old_opcode + sizeof(bp_int3),
+ bp_int3_len - sizeof(bp_int3));
+ /*
+ * It is not optimal to sync for each repaired address.
+ * But this is a corner case and we are in bad state anyway.
+ * Note that this is not needed on Intel, see text_poke_bp
+ */
+ run_sync();
+ }
+
+ /* Finally, put back the first byte from the old code */
+ if (!ret) {
+ ret = text_poke_part(addr, old_opcode, sizeof(bp_int3));
+ }
+
+ return ret;
+}
+
+#define for_text_poke_bp_iter(iterator,iter) \
+ for (iter = iterator->start(iterator->init); \
+ iter; \
+ iter = iterator->next(iter))
+
+/**
+ * text_poke_bp_list() -- update instructions on live kernel on SMP
+ * @iterator: structure that allows to iterate over the patched addressed
+ * and get all the needed information
+ *
+ * Modify several instructions by using int3 breakpoint on SMP in parallel.
+ * The principle is the same as in text_poke_bp. But synchronization of all CPUs
+ * is relatively slow operation, so we need to reduce it. Hence we add interrupt
+ * for all instructions before we modify the other bytes, etc.
+ *
+ * The operation wants to keep a consistent state. If anything goes wrong,
+ * it does the best effort to restore the original state. The only exception
+ * are addresses where the patching has finished. But the caller can be informed
+ * about this via the finish callback.
+ *
+ * It is a bit more paranoid than text_poke_bp because it checks the actual
+ * code before patching. This is a good practice proposed by the ftrace code.
+ *
+ * Note: must be called under text_mutex.
+ */
+int text_poke_bp_list(struct text_poke_bp_iter *iterator)
+{
+ void *iter;
+ const char *report = "adding breakpoints";
+ int count = 0;
+ int ret = 0;
+
+ bp_int3 = 0xcc;
+ bp_int3_addr = NULL;
+ bp_int3_len = iterator->len;
+ bp_int3_handler = NULL;
+ bp_int3_is_handled = iterator->is_handled;
+ bp_patching_in_progress = true;
+
+ smp_wmb();
+
+ /* add interrupts */
+ for_text_poke_bp_iter(iterator, iter) {
+ ret = add_iter_breakpoint(iterator, iter);
+ if (ret)
+ goto fail;
+ count++;
+ }
+ run_sync();
+
+ report = "updating code";
+ if (bp_int3_len - sizeof(bp_int3) > 0) {
+ count = 0;
+ /* patch all but the first byte */
+ for_text_poke_bp_iter(iterator, iter) {
+ ret = update_iter_code(iterator, iter);
+ if (ret)
+ goto fail;
+ count++;
+ }
+ run_sync();
+ }
+
+ /* patch the first byte */
+ report = "finishing update";
+ count = 0;
+ for_text_poke_bp_iter(iterator, iter) {
+ ret = finish_iter_update(iterator, iter);
+ if (ret)
+ goto fail;
+ count++;
+ }
+ run_sync();
+
+fail:
+ if (unlikely(ret)) {
+ printk(KERN_WARNING "text_poke_bp_iter failed on %s (%d):\n",
+ report, count);
+ /* save information about the failure */
+ iterator->fail_count = count;
+ iterator->fail_addr = iterator->get_addr(iter);
+ /* try to recover the old state */
+ for_text_poke_bp_iter(iterator, iter) {
+ recover_iter(iterator, iter);
+ }
+ run_sync();
+ }
+
+ bp_patching_in_progress = false;
+ smp_wmb();
+
+ return ret;
+}
--
1.8.4
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 5/8] x86: do not trace __probe_kernel_read
2013-11-08 9:12 [PATCH v2 0/8] x86: use new text_poke_bp in ftrace Petr Mladek
` (4 preceding siblings ...)
2013-11-08 9:12 ` [PATCH v2 4/8] x86: add generic function to modify more calls using int3 framework Petr Mladek
@ 2013-11-08 9:12 ` Petr Mladek
2013-11-08 9:12 ` [PATCH v2 6/8] x86: modify ftrace function using the new int3-based framework Petr Mladek
` (2 subsequent siblings)
8 siblings, 0 replies; 16+ messages in thread
From: Petr Mladek @ 2013-11-08 9:12 UTC (permalink / raw)
To: Steven Rostedt, Frederic Weisbecker, Masami Hiramatsu,
Paul E. McKenney, Jiri Kosina
Cc: linux-kernel, x86, Petr Mladek
probe_kernel_read is used when modifying function calls in ftrace_replace_code,
see arch/x86/kernel/ftrace.c. On x86, the code is replaced using int3 guard.
All functions are patched in parallel to reduce an expensive synchronization
of all CPUs. The result is that all affected functions are called via
ftrace_int3_handler during the process.
ftrace_int3_handler is relatively slow because it has to check whether
it is responsible for the handled IP. Therefore we should not modify
functions that used during patching. It would slowdown the patching.
I tried to switch between 7 tracers: blk, branch, function_graph, wakeup_rt,
irqsoff, function, and nop. Every tracer has also been enabled and disabled.
With 500 cycles, I got these times before this commit:
real 18m2.477s 18m8.654s 18m9.196s
user 0m0.008s 0m0.008s 0m0.012s
sys 0m17.316s 0m17.104s 0m17.300s
and after this commit:
real 16m14.390s 16m15.200s 16m19.632s
user 0m0.028s 0m0.024s 0m0.028s
sys 0m23.788s 0m23.812s 0m23.804s
Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
mm/maccess.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/maccess.c b/mm/maccess.c
index d53adf9..bed9ee8 100644
--- a/mm/maccess.c
+++ b/mm/maccess.c
@@ -18,7 +18,7 @@
long __weak probe_kernel_read(void *dst, const void *src, size_t size)
__attribute__((alias("__probe_kernel_read")));
-long __probe_kernel_read(void *dst, const void *src, size_t size)
+long notrace __probe_kernel_read(void *dst, const void *src, size_t size)
{
long ret;
mm_segment_t old_fs = get_fs();
--
1.8.4
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 6/8] x86: modify ftrace function using the new int3-based framework
2013-11-08 9:12 [PATCH v2 0/8] x86: use new text_poke_bp in ftrace Petr Mladek
` (5 preceding siblings ...)
2013-11-08 9:12 ` [PATCH v2 5/8] x86: do not trace __probe_kernel_read Petr Mladek
@ 2013-11-08 9:12 ` Petr Mladek
2013-11-08 9:12 ` [PATCH v2 7/8] x86: patch all traced function calls using the " Petr Mladek
2013-11-08 9:12 ` [PATCH v2 8/8] x86: enable/disable ftrace graph call using new " Petr Mladek
8 siblings, 0 replies; 16+ messages in thread
From: Petr Mladek @ 2013-11-08 9:12 UTC (permalink / raw)
To: Steven Rostedt, Frederic Weisbecker, Masami Hiramatsu,
Paul E. McKenney, Jiri Kosina
Cc: linux-kernel, x86, Petr Mladek
The commit fd4363fff3d9 (x86: Introduce int3 (breakpoint)-based instruction
patching) uses the same technique that has been used in ftrace since 08d636b
("ftrace/x86: Have arch x86_64 use breakpoints instead of stop machine")
It would be great to use the new generic implementation and clean up the x86
ftrace code a bit.
Let's start with ftrace_modify_code. It does basically the same as text_poke_bp.
In addition, it checks whether the replaced code is the expected one to avoid
problems with modules. The checking code can be split from
ftrace_modify_code_direct.
Note that we do not longer need to set modifying_ftrace_code in
ftrace_update_ftrace_func. The int3 interrupt will be handled by
poke_int3_handler.
Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
arch/x86/kernel/ftrace.c | 106 ++++++++++++++++++++---------------------------
1 file changed, 46 insertions(+), 60 deletions(-)
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 42a392a..5ade40e 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -52,6 +52,33 @@ union ftrace_code_union {
} __attribute__((packed));
};
+/*
+ * Note: Due to modules and __init, code can
+ * disappear and change, we need to protect against faulting
+ * as well as code changing. We do this by using the
+ * probe_kernel_* functions.
+ */
+static int
+ftrace_check_code(unsigned long ip, unsigned const char *expected)
+{
+ unsigned char actual[MCOUNT_INSN_SIZE];
+
+ if (probe_kernel_read(actual, (void *)ip, MCOUNT_INSN_SIZE))
+ return -EFAULT;
+
+ if (memcmp(actual, expected, MCOUNT_INSN_SIZE) != 0) {
+ WARN_ONCE(1,
+ "ftrace check failed: %x %x%x%x%x vs. %x %x%x%x%x\n",
+ actual[0],
+ actual[1], actual[2], actual[3], actual[4],
+ expected[0],
+ expected[1], expected[2], expected[3], expected[4]);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
static int ftrace_calc_offset(long ip, long addr)
{
return (int)(addr - ip);
@@ -103,28 +130,12 @@ static int
ftrace_modify_code_direct(unsigned long ip, unsigned const char *old_code,
unsigned const char *new_code)
{
- unsigned char replaced[MCOUNT_INSN_SIZE];
-
- /*
- * Note: Due to modules and __init, code can
- * disappear and change, we need to protect against faulting
- * as well as code changing. We do this by using the
- * probe_kernel_* functions.
- *
- * No real locking needed, this code is run through
- * kstop_machine, or before SMP starts.
- */
-
- /* read the text we want to modify */
- if (probe_kernel_read(replaced, (void *)ip, MCOUNT_INSN_SIZE))
- return -EFAULT;
+ int ret;
- /* Make sure it is what we expect it to be */
- if (memcmp(replaced, old_code, MCOUNT_INSN_SIZE) != 0)
- return -EINVAL;
+ ret = ftrace_check_code(ip, old_code);
/* replace the text with the new text */
- if (do_ftrace_mod_code(ip, new_code))
+ if (!ret && do_ftrace_mod_code(ip, new_code))
return -EPERM;
sync_core();
@@ -132,6 +143,22 @@ ftrace_modify_code_direct(unsigned long ip, unsigned const char *old_code,
return 0;
}
+static int
+ftrace_modify_code(unsigned long ip, unsigned const char *old_code,
+ unsigned const char *new_code)
+{
+ int ret;
+
+ ret = ftrace_check_code(ip, old_code);
+
+ /* replace the text with the new text */
+ if (!ret)
+ ret = text_poke_bp((void *)ip, new_code, MCOUNT_INSN_SIZE,
+ (void *)ip + MCOUNT_INSN_SIZE);
+
+ return ret;
+}
+
int ftrace_make_nop(struct module *mod,
struct dyn_ftrace *rec, unsigned long addr)
{
@@ -202,10 +229,6 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
*/
atomic_t modifying_ftrace_code __read_mostly;
-static int
-ftrace_modify_code(unsigned long ip, unsigned const char *old_code,
- unsigned const char *new_code);
-
/*
* Should never be called:
* As it is only called by __ftrace_replace_code() which is called by
@@ -230,9 +253,6 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
memcpy(old, &ftrace_call, MCOUNT_INSN_SIZE);
new = ftrace_call_replace(ip, (unsigned long)func);
- /* See comment above by declaration of modifying_ftrace_code */
- atomic_inc(&modifying_ftrace_code);
-
ret = ftrace_modify_code(ip, old, new);
/* Also update the regs callback function */
@@ -243,8 +263,6 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
ret = ftrace_modify_code(ip, old, new);
}
- atomic_dec(&modifying_ftrace_code);
-
return ret;
}
@@ -609,38 +627,6 @@ void ftrace_replace_code(int enable)
}
}
-static int
-ftrace_modify_code(unsigned long ip, unsigned const char *old_code,
- unsigned const char *new_code)
-{
- int ret;
-
- ret = add_break(ip, old_code);
- if (ret)
- goto out;
-
- run_sync();
-
- ret = add_update_code(ip, new_code);
- if (ret)
- goto fail_update;
-
- run_sync();
-
- ret = ftrace_write(ip, new_code, 1);
- if (ret) {
- ret = -EPERM;
- goto out;
- }
- run_sync();
- out:
- return ret;
-
- fail_update:
- probe_kernel_write((void *)ip, &old_code[0], 1);
- goto out;
-}
-
void arch_ftrace_update_code(int command)
{
/* See comment above by declaration of modifying_ftrace_code */
--
1.8.4
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 7/8] x86: patch all traced function calls using the int3-based framework
2013-11-08 9:12 [PATCH v2 0/8] x86: use new text_poke_bp in ftrace Petr Mladek
` (6 preceding siblings ...)
2013-11-08 9:12 ` [PATCH v2 6/8] x86: modify ftrace function using the new int3-based framework Petr Mladek
@ 2013-11-08 9:12 ` Petr Mladek
2013-11-08 9:12 ` [PATCH v2 8/8] x86: enable/disable ftrace graph call using new " Petr Mladek
8 siblings, 0 replies; 16+ messages in thread
From: Petr Mladek @ 2013-11-08 9:12 UTC (permalink / raw)
To: Steven Rostedt, Frederic Weisbecker, Masami Hiramatsu,
Paul E. McKenney, Jiri Kosina
Cc: linux-kernel, x86, Petr Mladek
Let's continue with replacing the existing Int3-based framework in ftrace with
the new generic function introduced by the commit fd4363fff3d9 (x86: Introduce
int3 (breakpoint)-based instruction patching)
This time use it in ftrace_replace_code that modifies all the watched function
calls.
If we use text_poke_bp in ftrace_make_nop, ftrace_make_call, ftrace_modify_call,
it would be possible to get rid of the x86-specific ftrace_replace_code
implementation and use the generic code.
This would be really lovely change. Unfortunately, the code would be slow
because syncing on each CPU is relatively expensive operation. For example,
I tried to switch between 7 tracers: blk, branch, function_graph, wakeup_rt,
irqsoff, function, and nop. Every tracer has also been enabled and disabled.
With 500 cycles, I got these times with the original code:
real 16m14.390s 16m15.200s 16m19.632s
user 0m0.028s 0m0.024s 0m0.028s
sys 0m23.788s 0m23.812s 0m23.804s
It slowed down after using text_poke_bp for each record separately:
real 29m45.785s 29m47.504s 29m44.016
user 0m0.004s 0m0.004s 0m0.004s
sys 0m15.776s 0m16.088s 0m16.192s
Hence, we implemented the more complex text_poke_bp_iter. When we use it,
we get the times:
real 17m9.753s 17m12.272s 17m11.424s
user 0m0.004s 0m0.004s 0m0.004s
sys 0m18.244s 0m18.252s 0m18.308s
It is slightly slower than the ftrace-specific code. Well, this is the
cost for using a generic API that can be used also in other locations.
Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
---
arch/x86/kernel/alternative.c | 4 +-
arch/x86/kernel/ftrace.c | 428 +++++++-----------------------------------
arch/x86/kernel/traps.c | 10 -
include/linux/ftrace.h | 6 -
4 files changed, 75 insertions(+), 373 deletions(-)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 95e8a7f..c7c7ad8 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -603,7 +603,7 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
*
* Note: Must be called under text_mutex.
*/
-static int text_poke_part(void *addr, const void *opcode, size_t len)
+static int notrace text_poke_part(void *addr, const void *opcode, size_t len)
{
/*
* On x86_64, kernel text mappings are mapped read-only with
@@ -647,7 +647,7 @@ static void *bp_int3_handler, *bp_int3_addr;
static size_t bp_int3_len;
static void *(*bp_int3_is_handled)(const unsigned long ip);
-int poke_int3_handler(struct pt_regs *regs)
+int notrace poke_int3_handler(struct pt_regs *regs)
{
/* bp_patching_in_progress */
smp_rmb();
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 5ade40e..92fe8ca 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -127,23 +127,6 @@ static const unsigned char *ftrace_nop_replace(void)
}
static int
-ftrace_modify_code_direct(unsigned long ip, unsigned const char *old_code,
- unsigned const char *new_code)
-{
- int ret;
-
- ret = ftrace_check_code(ip, old_code);
-
- /* replace the text with the new text */
- if (!ret && do_ftrace_mod_code(ip, new_code))
- return -EPERM;
-
- sync_core();
-
- return 0;
-}
-
-static int
ftrace_modify_code(unsigned long ip, unsigned const char *old_code,
unsigned const char *new_code)
{
@@ -168,20 +151,7 @@ int ftrace_make_nop(struct module *mod,
old = ftrace_call_replace(ip, addr);
new = ftrace_nop_replace();
- /*
- * On boot up, and when modules are loaded, the MCOUNT_ADDR
- * is converted to a nop, and will never become MCOUNT_ADDR
- * again. This code is either running before SMP (on boot up)
- * or before the code will ever be executed (module load).
- * We do not want to use the breakpoint version in this case,
- * just modify the code directly.
- */
- if (addr == MCOUNT_ADDR)
- return ftrace_modify_code_direct(rec->ip, old, new);
-
- /* Normal cases use add_brk_on_nop */
- WARN_ONCE(1, "invalid use of ftrace_make_nop");
- return -EINVAL;
+ return ftrace_modify_code(ip, old, new);
}
int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
@@ -192,44 +162,10 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
old = ftrace_nop_replace();
new = ftrace_call_replace(ip, addr);
- /* Should only be called when module is loaded */
- return ftrace_modify_code_direct(rec->ip, old, new);
+ return ftrace_modify_code(ip, old, new);
}
/*
- * The modifying_ftrace_code is used to tell the breakpoint
- * handler to call ftrace_int3_handler(). If it fails to
- * call this handler for a breakpoint added by ftrace, then
- * the kernel may crash.
- *
- * As atomic_writes on x86 do not need a barrier, we do not
- * need to add smp_mb()s for this to work. It is also considered
- * that we can not read the modifying_ftrace_code before
- * executing the breakpoint. That would be quite remarkable if
- * it could do that. Here's the flow that is required:
- *
- * CPU-0 CPU-1
- *
- * atomic_inc(mfc);
- * write int3s
- * <trap-int3> // implicit (r)mb
- * if (atomic_read(mfc))
- * call ftrace_int3_handler()
- *
- * Then when we are finished:
- *
- * atomic_dec(mfc);
- *
- * If we hit a breakpoint that was not set by ftrace, it does not
- * matter if ftrace_int3_handler() is called or not. It will
- * simply be ignored. But it is crucial that a ftrace nop/caller
- * breakpoint is handled. No other user should ever place a
- * breakpoint on an ftrace nop/caller location. It must only
- * be done by this code.
- */
-atomic_t modifying_ftrace_code __read_mostly;
-
-/*
* Should never be called:
* As it is only called by __ftrace_replace_code() which is called by
* ftrace_replace_code() that x86 overrides, and by ftrace_update_code()
@@ -267,80 +203,6 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
}
/*
- * A breakpoint was added to the code address we are about to
- * modify, and this is the handle that will just skip over it.
- * We are either changing a nop into a trace call, or a trace
- * call to a nop. While the change is taking place, we treat
- * it just like it was a nop.
- */
-int ftrace_int3_handler(struct pt_regs *regs)
-{
- if (WARN_ON_ONCE(!regs))
- return 0;
-
- if (!ftrace_location(regs->ip - 1))
- return 0;
-
- regs->ip += MCOUNT_INSN_SIZE - 1;
-
- return 1;
-}
-
-static int ftrace_write(unsigned long ip, const char *val, int size)
-{
- /*
- * On x86_64, kernel text mappings are mapped read-only with
- * CONFIG_DEBUG_RODATA. So we use the kernel identity mapping instead
- * of the kernel text mapping to modify the kernel text.
- *
- * For 32bit kernels, these mappings are same and we can use
- * kernel identity mapping to modify code.
- */
- if (within(ip, (unsigned long)_text, (unsigned long)_etext))
- ip = (unsigned long)__va(__pa_symbol(ip));
-
- return probe_kernel_write((void *)ip, val, size);
-}
-
-static int add_break(unsigned long ip, const char *old)
-{
- unsigned char replaced[MCOUNT_INSN_SIZE];
- unsigned char brk = BREAKPOINT_INSTRUCTION;
-
- if (probe_kernel_read(replaced, (void *)ip, MCOUNT_INSN_SIZE))
- return -EFAULT;
-
- /* Make sure it is what we expect it to be */
- if (memcmp(replaced, old, MCOUNT_INSN_SIZE) != 0)
- return -EINVAL;
-
- if (ftrace_write(ip, &brk, 1))
- return -EPERM;
-
- return 0;
-}
-
-static int add_brk_on_call(struct dyn_ftrace *rec, unsigned long addr)
-{
- unsigned const char *old;
- unsigned long ip = rec->ip;
-
- old = ftrace_call_replace(ip, addr);
-
- return add_break(rec->ip, old);
-}
-
-
-static int add_brk_on_nop(struct dyn_ftrace *rec)
-{
- unsigned const char *old;
-
- old = ftrace_nop_replace();
-
- return add_break(rec->ip, old);
-}
-
-/*
* If the record has the FTRACE_FL_REGS set, that means that it
* wants to convert to a callback that saves all regs. If FTRACE_FL_REGS
* is not not set, then it wants to convert to the normal callback.
@@ -366,275 +228,131 @@ static unsigned long get_ftrace_old_addr(struct dyn_ftrace *rec)
return (unsigned long)FTRACE_ADDR;
}
-static int add_breakpoints(struct dyn_ftrace *rec, int enable)
-{
- unsigned long ftrace_addr;
- int ret;
-
- ret = ftrace_test_record(rec, enable);
-
- ftrace_addr = get_ftrace_addr(rec);
-
- switch (ret) {
- case FTRACE_UPDATE_IGNORE:
- return 0;
-
- case FTRACE_UPDATE_MAKE_CALL:
- /* converting nop to call */
- return add_brk_on_nop(rec);
-
- case FTRACE_UPDATE_MODIFY_CALL_REGS:
- case FTRACE_UPDATE_MODIFY_CALL:
- ftrace_addr = get_ftrace_old_addr(rec);
- /* fall through */
- case FTRACE_UPDATE_MAKE_NOP:
- /* converting a call to a nop */
- return add_brk_on_call(rec, ftrace_addr);
- }
- return 0;
-}
+struct ftrace_tp_iter {
+ struct ftrace_rec_iter *rec_iter;
+ struct dyn_ftrace *rec;
+ int enable;
+};
-/*
- * On error, we need to remove breakpoints. This needs to
- * be done caefully. If the address does not currently have a
- * breakpoint, we know we are done. Otherwise, we look at the
- * remaining 4 bytes of the instruction. If it matches a nop
- * we replace the breakpoint with the nop. Otherwise we replace
- * it with the call instruction.
- */
-static int remove_breakpoint(struct dyn_ftrace *rec)
+static struct ftrace_tp_iter *ftrace_tp_set_record(struct ftrace_tp_iter *tp_iter)
{
- unsigned char ins[MCOUNT_INSN_SIZE];
- unsigned char brk = BREAKPOINT_INSTRUCTION;
- const unsigned char *nop;
- unsigned long ftrace_addr;
- unsigned long ip = rec->ip;
-
- /* If we fail the read, just give up */
- if (probe_kernel_read(ins, (void *)ip, MCOUNT_INSN_SIZE))
- return -EFAULT;
+ if (!tp_iter->rec_iter)
+ return NULL;
- /* If this does not have a breakpoint, we are done */
- if (ins[0] != brk)
- return -1;
-
- nop = ftrace_nop_replace();
-
- /*
- * If the last 4 bytes of the instruction do not match
- * a nop, then we assume that this is a call to ftrace_addr.
- */
- if (memcmp(&ins[1], &nop[1], MCOUNT_INSN_SIZE - 1) != 0) {
- /*
- * For extra paranoidism, we check if the breakpoint is on
- * a call that would actually jump to the ftrace_addr.
- * If not, don't touch the breakpoint, we make just create
- * a disaster.
- */
- ftrace_addr = get_ftrace_addr(rec);
- nop = ftrace_call_replace(ip, ftrace_addr);
-
- if (memcmp(&ins[1], &nop[1], MCOUNT_INSN_SIZE - 1) == 0)
- goto update;
-
- /* Check both ftrace_addr and ftrace_old_addr */
- ftrace_addr = get_ftrace_old_addr(rec);
- nop = ftrace_call_replace(ip, ftrace_addr);
-
- if (memcmp(&ins[1], &nop[1], MCOUNT_INSN_SIZE - 1) != 0)
- return -EINVAL;
- }
-
- update:
- return probe_kernel_write((void *)ip, &nop[0], 1);
+ tp_iter->rec = ftrace_rec_iter_record(tp_iter->rec_iter);
+ return tp_iter;
}
-static int add_update_code(unsigned long ip, unsigned const char *new)
+void *ftrace_tp_iter_start(void *init)
{
- /* skip breakpoint */
- ip++;
- new++;
- if (ftrace_write(ip, new, MCOUNT_INSN_SIZE - 1))
- return -EPERM;
- return 0;
+ struct ftrace_tp_iter *tp_iter = init;
+
+ tp_iter->rec_iter = ftrace_rec_iter_start();
+ return ftrace_tp_set_record(tp_iter);
}
-static int add_update_call(struct dyn_ftrace *rec, unsigned long addr)
+void *ftrace_tp_iter_next(void *cur)
{
- unsigned long ip = rec->ip;
- unsigned const char *new;
+ struct ftrace_tp_iter *tp_iter = cur;
- new = ftrace_call_replace(ip, addr);
- return add_update_code(ip, new);
+ tp_iter->rec_iter = ftrace_rec_iter_next(tp_iter->rec_iter);
+ return ftrace_tp_set_record(tp_iter);
}
-static int add_update_nop(struct dyn_ftrace *rec)
+void *ftrace_tp_iter_get_addr(void *cur)
{
- unsigned long ip = rec->ip;
- unsigned const char *new;
+ struct ftrace_tp_iter *tp_iter = cur;
- new = ftrace_nop_replace();
- return add_update_code(ip, new);
+ return (void *)(tp_iter->rec->ip);
}
-static int add_update(struct dyn_ftrace *rec, int enable)
+const void *ftrace_tp_iter_get_opcode(void *cur)
{
- unsigned long ftrace_addr;
+ struct ftrace_tp_iter *tp_iter = cur;
+ unsigned long addr;
int ret;
- ret = ftrace_test_record(rec, enable);
-
- ftrace_addr = get_ftrace_addr(rec);
+ ret = ftrace_test_record(tp_iter->rec, tp_iter->enable);
switch (ret) {
- case FTRACE_UPDATE_IGNORE:
- return 0;
+ case FTRACE_UPDATE_MAKE_NOP:
+ return ftrace_nop_replace();
- case FTRACE_UPDATE_MODIFY_CALL_REGS:
- case FTRACE_UPDATE_MODIFY_CALL:
case FTRACE_UPDATE_MAKE_CALL:
- /* converting nop to call */
- return add_update_call(rec, ftrace_addr);
+ case FTRACE_UPDATE_MODIFY_CALL:
+ case FTRACE_UPDATE_MODIFY_CALL_REGS:
+ addr = get_ftrace_addr(tp_iter->rec);
+ return ftrace_call_replace(tp_iter->rec->ip, addr);
- case FTRACE_UPDATE_MAKE_NOP:
- /* converting a call to a nop */
- return add_update_nop(rec);
+ case FTRACE_UPDATE_IGNORE:
+ default: /* unknown ftrace bug */
+ return NULL;
}
-
- return 0;
-}
-
-static int finish_update_call(struct dyn_ftrace *rec, unsigned long addr)
-{
- unsigned long ip = rec->ip;
- unsigned const char *new;
-
- new = ftrace_call_replace(ip, addr);
-
- if (ftrace_write(ip, new, 1))
- return -EPERM;
-
- return 0;
-}
-
-static int finish_update_nop(struct dyn_ftrace *rec)
-{
- unsigned long ip = rec->ip;
- unsigned const char *new;
-
- new = ftrace_nop_replace();
-
- if (ftrace_write(ip, new, 1))
- return -EPERM;
- return 0;
}
-static int finish_update(struct dyn_ftrace *rec, int enable)
+const void *ftrace_tp_iter_get_old_opcode(void *cur)
{
- unsigned long ftrace_addr;
+ struct ftrace_tp_iter *tp_iter = cur;
+ unsigned long old_addr;
int ret;
- ret = ftrace_update_record(rec, enable);
-
- ftrace_addr = get_ftrace_addr(rec);
+ ret = ftrace_test_record(tp_iter->rec, tp_iter->enable);
switch (ret) {
- case FTRACE_UPDATE_IGNORE:
- return 0;
-
- case FTRACE_UPDATE_MODIFY_CALL_REGS:
- case FTRACE_UPDATE_MODIFY_CALL:
case FTRACE_UPDATE_MAKE_CALL:
- /* converting nop to call */
- return finish_update_call(rec, ftrace_addr);
+ return ftrace_nop_replace();
case FTRACE_UPDATE_MAKE_NOP:
- /* converting a call to a nop */
- return finish_update_nop(rec);
- }
+ case FTRACE_UPDATE_MODIFY_CALL:
+ case FTRACE_UPDATE_MODIFY_CALL_REGS:
+ old_addr = get_ftrace_old_addr(tp_iter->rec);
+ return ftrace_call_replace(tp_iter->rec->ip, old_addr);
- return 0;
+ case FTRACE_UPDATE_IGNORE:
+ default: /* unknown ftrace bug */
+ return NULL;
+ }
}
-static void do_sync_core(void *data)
+int ftrace_tp_iter_finish(void *cur)
{
- sync_core();
-}
+ struct ftrace_tp_iter *tp_iter = cur;
-static void run_sync(void)
-{
- int enable_irqs = irqs_disabled();
-
- /* We may be called with interrupts disbled (on bootup). */
- if (enable_irqs)
- local_irq_enable();
- on_each_cpu(do_sync_core, NULL, 1);
- if (enable_irqs)
- local_irq_disable();
+ ftrace_update_record(tp_iter->rec, tp_iter->enable);
+ return 0;
}
void ftrace_replace_code(int enable)
{
- struct ftrace_rec_iter *iter;
- struct dyn_ftrace *rec;
- const char *report = "adding breakpoints";
- int count = 0;
+ struct text_poke_bp_iter tp_bp_iter;
+ struct ftrace_tp_iter tp_iter;
int ret;
- for_ftrace_rec_iter(iter) {
- rec = ftrace_rec_iter_record(iter);
-
- ret = add_breakpoints(rec, enable);
- if (ret)
- goto remove_breakpoints;
- count++;
- }
-
- run_sync();
-
- report = "updating code";
-
- for_ftrace_rec_iter(iter) {
- rec = ftrace_rec_iter_record(iter);
-
- ret = add_update(rec, enable);
- if (ret)
- goto remove_breakpoints;
- }
+ tp_iter.enable = enable;
- run_sync();
+ tp_bp_iter.init = (void *)&tp_iter;
+ tp_bp_iter.len = MCOUNT_INSN_SIZE;
+ tp_bp_iter.start = ftrace_tp_iter_start;
+ tp_bp_iter.next = ftrace_tp_iter_next;
+ tp_bp_iter.get_addr = ftrace_tp_iter_get_addr;
+ tp_bp_iter.get_opcode = ftrace_tp_iter_get_opcode;
+ tp_bp_iter.get_old_opcode = ftrace_tp_iter_get_old_opcode;
+ tp_bp_iter.finish = ftrace_tp_iter_finish;
+ tp_bp_iter.is_handled = (void *(*)(const unsigned long))ftrace_location;
- report = "removing breakpoints";
+ ret = text_poke_bp_list(&tp_bp_iter);
- for_ftrace_rec_iter(iter) {
- rec = ftrace_rec_iter_record(iter);
-
- ret = finish_update(rec, enable);
- if (ret)
- goto remove_breakpoints;
- }
-
- run_sync();
-
- return;
-
- remove_breakpoints:
- ftrace_bug(ret, rec ? rec->ip : 0);
- printk(KERN_WARNING "Failed on %s (%d):\n", report, count);
- for_ftrace_rec_iter(iter) {
- rec = ftrace_rec_iter_record(iter);
- remove_breakpoint(rec);
- }
+ if (ret)
+ ftrace_bug(ret, (unsigned long)(tp_bp_iter.fail_addr));
}
+/*
+ * The code is modified using Int3 guard,
+ * so we do not need to call the stop machine
+ */
void arch_ftrace_update_code(int command)
{
- /* See comment above by declaration of modifying_ftrace_code */
- atomic_inc(&modifying_ftrace_code);
-
ftrace_modify_all_code(command);
-
- atomic_dec(&modifying_ftrace_code);
}
int __init ftrace_dyn_arch_init(void *data)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 729aa77..16bf450 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -50,7 +50,6 @@
#include <asm/processor.h>
#include <asm/debugreg.h>
#include <linux/atomic.h>
-#include <asm/ftrace.h>
#include <asm/traps.h>
#include <asm/desc.h>
#include <asm/i387.h>
@@ -319,15 +318,6 @@ dotraplinkage void __kprobes notrace do_int3(struct pt_regs *regs, long error_co
{
enum ctx_state prev_state;
-#ifdef CONFIG_DYNAMIC_FTRACE
- /*
- * ftrace must be first, everything else may cause a recursive crash.
- * See note by declaration of modifying_ftrace_code in ftrace.c
- */
- if (unlikely(atomic_read(&modifying_ftrace_code)) &&
- ftrace_int3_handler(regs))
- return;
-#endif
if (poke_int3_handler(regs))
return;
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 9f15c00..40ec039 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -383,12 +383,6 @@ struct ftrace_rec_iter *ftrace_rec_iter_start(void);
struct ftrace_rec_iter *ftrace_rec_iter_next(struct ftrace_rec_iter *iter);
struct dyn_ftrace *ftrace_rec_iter_record(struct ftrace_rec_iter *iter);
-#define for_ftrace_rec_iter(iter) \
- for (iter = ftrace_rec_iter_start(); \
- iter; \
- iter = ftrace_rec_iter_next(iter))
-
-
int ftrace_update_record(struct dyn_ftrace *rec, int enable);
int ftrace_test_record(struct dyn_ftrace *rec, int enable);
void ftrace_run_stop_machine(int command);
--
1.8.4
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 8/8] x86: enable/disable ftrace graph call using new int3-based framework
2013-11-08 9:12 [PATCH v2 0/8] x86: use new text_poke_bp in ftrace Petr Mladek
` (7 preceding siblings ...)
2013-11-08 9:12 ` [PATCH v2 7/8] x86: patch all traced function calls using the " Petr Mladek
@ 2013-11-08 9:12 ` Petr Mladek
8 siblings, 0 replies; 16+ messages in thread
From: Petr Mladek @ 2013-11-08 9:12 UTC (permalink / raw)
To: Steven Rostedt, Frederic Weisbecker, Masami Hiramatsu,
Paul E. McKenney, Jiri Kosina
Cc: linux-kernel, x86, Petr Mladek
One more change related to replacing the existing Int3-based framework with
the new generic function introduced by the commit fd4363fff3d9
(x86: Introduce int3 (breakpoint)-based instruction patching)
ftrace_enable_ftrace_graph_caller and ftrace_disable_ftrace_graph_caller
modified the jump target directly without using the int3 guard. It worked
because writing the address was an atomic operation.
We do not really need to use the safe ftrace_modify_code here but it helps
to remove another arch-specific code. The result is more consistent
and better readable code which is might be worth the change.
Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
arch/x86/kernel/ftrace.c | 63 +++++++++++-------------------------------------
1 file changed, 14 insertions(+), 49 deletions(-)
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 92fe8ca..20d2028 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -47,7 +47,7 @@ int ftrace_arch_code_modify_post_process(void)
union ftrace_code_union {
char code[MCOUNT_INSN_SIZE];
struct {
- char e8;
+ char inst;
int offset;
} __attribute__((packed));
};
@@ -88,7 +88,7 @@ static unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr)
{
static union ftrace_code_union calc;
- calc.e8 = 0xe8;
+ calc.inst = 0xe8;
calc.offset = ftrace_calc_offset(ip + MCOUNT_INSN_SIZE, addr);
/*
@@ -98,27 +98,11 @@ static unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr)
return calc.code;
}
-static inline int
-within(unsigned long addr, unsigned long start, unsigned long end)
+static void ftrace_jump_replace(union ftrace_code_union *calc,
+ unsigned long ip, unsigned long addr)
{
- return addr >= start && addr < end;
-}
-
-static int
-do_ftrace_mod_code(unsigned long ip, const void *new_code)
-{
- /*
- * On x86_64, kernel text mappings are mapped read-only with
- * CONFIG_DEBUG_RODATA. So we use the kernel identity mapping instead
- * of the kernel text mapping to modify the kernel text.
- *
- * For 32bit kernels, these mappings are same and we can use
- * kernel identity mapping to modify code.
- */
- if (within(ip, (unsigned long)_text, (unsigned long)_etext))
- ip = (unsigned long)__va(__pa_symbol(ip));
-
- return probe_kernel_write((void *)ip, new_code, MCOUNT_INSN_SIZE);
+ calc->inst = 0xe9;
+ calc->offset = ftrace_calc_offset(ip + MCOUNT_INSN_SIZE, addr);
}
static const unsigned char *ftrace_nop_replace(void)
@@ -369,45 +353,26 @@ int __init ftrace_dyn_arch_init(void *data)
#ifdef CONFIG_DYNAMIC_FTRACE
extern void ftrace_graph_call(void);
-static int ftrace_mod_jmp(unsigned long ip,
- int old_offset, int new_offset)
-{
- unsigned char code[MCOUNT_INSN_SIZE];
-
- if (probe_kernel_read(code, (void *)ip, MCOUNT_INSN_SIZE))
- return -EFAULT;
-
- if (code[0] != 0xe9 || old_offset != *(int *)(&code[1]))
- return -EINVAL;
-
- *(int *)(&code[1]) = new_offset;
-
- if (do_ftrace_mod_code(ip, &code))
- return -EPERM;
-
- return 0;
-}
-
int ftrace_enable_ftrace_graph_caller(void)
{
unsigned long ip = (unsigned long)(&ftrace_graph_call);
- int old_offset, new_offset;
+ union ftrace_code_union old, new;
- old_offset = (unsigned long)(&ftrace_stub) - (ip + MCOUNT_INSN_SIZE);
- new_offset = (unsigned long)(&ftrace_graph_caller) - (ip + MCOUNT_INSN_SIZE);
+ ftrace_jump_replace(&old, ip, (unsigned long)(&ftrace_stub));
+ ftrace_jump_replace(&new, ip, (unsigned long)(&ftrace_graph_caller));
- return ftrace_mod_jmp(ip, old_offset, new_offset);
+ return ftrace_modify_code(ip, old.code, new.code);
}
int ftrace_disable_ftrace_graph_caller(void)
{
unsigned long ip = (unsigned long)(&ftrace_graph_call);
- int old_offset, new_offset;
+ union ftrace_code_union old, new;
- old_offset = (unsigned long)(&ftrace_graph_caller) - (ip + MCOUNT_INSN_SIZE);
- new_offset = (unsigned long)(&ftrace_stub) - (ip + MCOUNT_INSN_SIZE);
+ ftrace_jump_replace(&old, ip, (unsigned long)(&ftrace_graph_caller));
+ ftrace_jump_replace(&new, ip, (unsigned long)(&ftrace_stub));
- return ftrace_mod_jmp(ip, old_offset, new_offset);
+ return ftrace_modify_code(ip, old.code, new.code);
}
#endif /* !CONFIG_DYNAMIC_FTRACE */
--
1.8.4
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/8] x86: speed up int3-based patching using less paranoid write
2013-11-08 9:12 ` [PATCH v2 1/8] x86: speed up int3-based patching using less paranoid write Petr Mladek
@ 2013-11-08 12:04 ` Masami Hiramatsu
2013-11-08 12:43 ` Steven Rostedt
2013-11-08 14:49 ` Steven Rostedt
0 siblings, 2 replies; 16+ messages in thread
From: Masami Hiramatsu @ 2013-11-08 12:04 UTC (permalink / raw)
To: Petr Mladek
Cc: Steven Rostedt, Frederic Weisbecker, Paul E. McKenney,
Jiri Kosina, linux-kernel, x86
(2013/11/08 18:12), Petr Mladek wrote:
> This change is inspired by the int3-based patching code used in
> ftrace. See the commit fd4363fff3d9 (x86: Introduce int3
> (breakpoint)-based instruction patching).
>
> When trying to use text_poke_bp in ftrace, the result was slower than
> the original implementation.
>
> It turned out that one difference was in text_poke vs. ftrace_write.
> text_poke did many extra operations to make sure that the change
> was atomic.
AFAIK, the main reason why text_poke is used is avoiding
RODATA protection (by alias mapping).
> In fact, we do not need this luxury in text_poke_bp because
> the modified code is guarded by the int3 handler. The int3
> interrupt itself is added and removed by an atomic operation
> because we modify only one byte.
>
> This patch adds text_poke_part which is inspired by ftrace_write.
> Then it is used in text_poke_bp instead of the paranoid text_poke.
>
> For example, I tried to switch between 7 tracers: blk, branch,
> function_graph, wakeup_rt, irqsoff, function, and nop. Every tracer
> has also been enabled and disabled. With 100 cycles, I got these
> times with text_poke:
>
> real 25m7.380s 25m13.44s 25m9.072s
> user 0m0.004s 0m0.004s 0m0.004s
> sys 0m3.480s 0m3.508s 0m3.420s
>
> and with text_poke_part:
>
> real 3m23.335s 3m24.422s 3m26.686s
> user 0m0.004s 0m0.004s 0m0.004s
> sys 0m3.724s 0m3.772s 0m3.588s
>
> Signed-off-by: Petr Mladek <pmladek@suse.cz>
> ---
> arch/x86/kernel/alternative.c | 49 ++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 44 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index df94598..0586dc1 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -8,6 +8,7 @@
> #include <linux/kprobes.h>
> #include <linux/mm.h>
> #include <linux/vmalloc.h>
> +#include <linux/uaccess.h>
> #include <linux/memory.h>
> #include <linux/stop_machine.h>
> #include <linux/slab.h>
> @@ -586,6 +587,44 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
> return addr;
> }
>
> +/**
> + * text_poke_part - Update part of the instruction on a live kernel when using
> + * int3 breakpoint
> + * @addr: address to modify
> + * @opcode: source of the copy
> + * @len: length to copy
> + *
> + * If we patch instructions using int3 interrupt, we do not need to be so
> + * careful about an atomic write. Adding and removing the interrupt is atomic
> + * because we modify only one byte. The rest of the instruction could be
> + * modified in several steps because it is guarded by the interrupt handler.
> + * Hence we could use faster code here. See also text_poke_bp below for more
> + * details.
> + *
> + * Note: Must be called under text_mutex.
> + */
> +static int text_poke_part(void *addr, const void *opcode, size_t len)
> +{
> + /*
> + * On x86_64, kernel text mappings are mapped read-only with
> + * CONFIG_DEBUG_RODATA. So we use the kernel identity mapping instead
> + * of the kernel text mapping to modify the kernel text.
Hmm, I couldn't find the guarantee that __va(__pa_symbol(x)) gives us a
raw(writable) text pages always.
Even it is true, it seems wired that we can still rewrite such protected
page without any spacial page remapping operation.
> + *
> + * For 32bit kernels, these mappings are same and we can use
> + * kernel identity mapping to modify code.
> + */
> + if (((unsigned long)addr >= (unsigned long)_text) &&
> + ((unsigned long)addr < (unsigned long)_etext))
> + addr = __va(__pa_symbol(addr));
You should also warn or return error if the given addr is not in the kernel text.
> +
> + if (probe_kernel_write(addr, opcode, len))
> + return -EPERM;
> +
> + sync_core();
> +
> + return 0;
> +}
> +
> static void do_sync_core(void *info)
> {
> sync_core();
> @@ -648,15 +687,15 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> */
> smp_wmb();
>
> - text_poke(addr, &int3, sizeof(int3));
> + text_poke_part(addr, &int3, sizeof(int3));
Anyway, text_poke itself will cause a BUG if it hits an error, but text_poke_part() seems
silently failing. It should handle the error correctly (or call BUG()).
>
> on_each_cpu(do_sync_core, NULL, 1);
>
> if (len - sizeof(int3) > 0) {
> /* patch all but the first byte */
> - text_poke((char *)addr + sizeof(int3),
> - (const char *) opcode + sizeof(int3),
> - len - sizeof(int3));
> + text_poke_part((char *)addr + sizeof(int3),
> + (const char *) opcode + sizeof(int3),
> + len - sizeof(int3));
Here we should check an error too.
> /*
> * According to Intel, this core syncing is very likely
> * not necessary and we'd be safe even without it. But
> @@ -666,7 +705,7 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> }
>
> /* patch the first byte */
> - text_poke(addr, opcode, sizeof(int3));
> + text_poke_part(addr, opcode, sizeof(int3));
Ditto.
>
> on_each_cpu(do_sync_core, NULL, 1);
Thank you,
--
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/8] x86: return error code in text_poke_bp
2013-11-08 9:12 ` [PATCH v2 2/8] x86: return error code in text_poke_bp Petr Mladek
@ 2013-11-08 12:36 ` Masami Hiramatsu
2013-11-08 14:53 ` Steven Rostedt
1 sibling, 0 replies; 16+ messages in thread
From: Masami Hiramatsu @ 2013-11-08 12:36 UTC (permalink / raw)
To: Petr Mladek
Cc: Steven Rostedt, Frederic Weisbecker, Paul E. McKenney,
Jiri Kosina, linux-kernel, x86
(2013/11/08 18:12), Petr Mladek wrote:
> We would like to use text_poke_bp in the dynamic ftrace which want to
> know about errors. For example, it informs about them in the ftrace log.
>
> Let's return the error code instead of the address. The address was just copied
> from the first parameter, so it was no extra information. The return value
> has not been used anywhere yet.
Ah, OK. This change is what I'd like to see. :)
> There is a question whether we should recover the original opcode when
> the second or third text_poke_part fails in text_poke_bp. Well, the errors
> were ignored until now. It did not cause any real life problems. There is
> really small chance that the first byte (int3) can be written and the other
> parts of the code can not be modified. It is probably not worth the extra
> complexity.
Since all the text_poke user must hold text_mutex, that kind of racing
must not happen. I guess, if you hit that case, you'd better call BUG_ON() or
you may get a GPF...
Thank you,
>
> Signed-off-by: Petr Mladek <pmladek@suse.cz>
> ---
> arch/x86/include/asm/alternative.h | 3 ++-
> arch/x86/kernel/alternative.c | 18 +++++++++++++-----
> 2 files changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
> index 0a3f9c9..f2343d8 100644
> --- a/arch/x86/include/asm/alternative.h
> +++ b/arch/x86/include/asm/alternative.h
> @@ -226,6 +226,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> */
> extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern int poke_int3_handler(struct pt_regs *regs);
> -extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> +extern int text_poke_bp(void *addr, const void *opcode, size_t len,
> + void *handler);
>
> #endif /* _ASM_X86_ALTERNATIVE_H */
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 0586dc1..c459e62 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -673,9 +673,10 @@ int poke_int3_handler(struct pt_regs *regs)
> *
> * Note: must be called under text_mutex.
> */
> -void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> +int text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> {
> unsigned char int3 = 0xcc;
> + int ret;
>
> bp_int3_handler = handler;
> bp_int3_addr = (u8 *)addr + sizeof(int3);
> @@ -687,15 +688,19 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> */
> smp_wmb();
>
> - text_poke_part(addr, &int3, sizeof(int3));
> + ret = text_poke_part(addr, &int3, sizeof(int3));
> + if (unlikely(ret))
> + goto fail;
>
> on_each_cpu(do_sync_core, NULL, 1);
>
> if (len - sizeof(int3) > 0) {
> /* patch all but the first byte */
> - text_poke_part((char *)addr + sizeof(int3),
> + ret = text_poke_part((char *)addr + sizeof(int3),
> (const char *) opcode + sizeof(int3),
> len - sizeof(int3));
> + if (unlikely(ret))
> + goto fail;
> /*
> * According to Intel, this core syncing is very likely
> * not necessary and we'd be safe even without it. But
> @@ -705,13 +710,16 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> }
>
> /* patch the first byte */
> - text_poke_part(addr, opcode, sizeof(int3));
> + ret = text_poke_part(addr, opcode, sizeof(int3));
> + if (unlikely(ret))
> + goto fail;
>
> on_each_cpu(do_sync_core, NULL, 1);
>
> +fail:
> bp_patching_in_progress = false;
> smp_wmb();
>
> - return addr;
> + return ret;
> }
>
>
--
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/8] x86: speed up int3-based patching using less paranoid write
2013-11-08 12:04 ` Masami Hiramatsu
@ 2013-11-08 12:43 ` Steven Rostedt
2013-11-12 10:44 ` Petr Mladek
2013-11-08 14:49 ` Steven Rostedt
1 sibling, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2013-11-08 12:43 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Petr Mladek, Frederic Weisbecker, Paul E. McKenney, Jiri Kosina,
linux-kernel, x86
On Fri, 08 Nov 2013 21:04:26 +0900
Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:
> (2013/11/08 18:12), Petr Mladek wrote:
> > This change is inspired by the int3-based patching code used in
> > ftrace. See the commit fd4363fff3d9 (x86: Introduce int3
> > (breakpoint)-based instruction patching).
> >
> > When trying to use text_poke_bp in ftrace, the result was slower than
> > the original implementation.
> >
> > It turned out that one difference was in text_poke vs. ftrace_write.
> > text_poke did many extra operations to make sure that the change
> > was atomic.
>
> AFAIK, the main reason why text_poke is used is avoiding
> RODATA protection (by alias mapping).
That is correct, and the reason ftrace didn't do that is because it
would be quite expensive to map 22,000 addresses for each change.
-- Steve
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/8] x86: speed up int3-based patching using less paranoid write
2013-11-08 12:04 ` Masami Hiramatsu
2013-11-08 12:43 ` Steven Rostedt
@ 2013-11-08 14:49 ` Steven Rostedt
1 sibling, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2013-11-08 14:49 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Petr Mladek, Frederic Weisbecker, Paul E. McKenney, Jiri Kosina,
linux-kernel, x86
On Fri, 08 Nov 2013 21:04:26 +0900
Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:
> > static void do_sync_core(void *info)
> > {
> > sync_core();
> > @@ -648,15 +687,15 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> > */
> > smp_wmb();
> >
> > - text_poke(addr, &int3, sizeof(int3));
> > + text_poke_part(addr, &int3, sizeof(int3));
>
> Anyway, text_poke itself will cause a BUG if it hits an error, but text_poke_part() seems
> silently failing. It should handle the error correctly (or call BUG()).
>
text_poke_part() shouldn't do a bug on error, but this code probably
should.
if (text_poke_part(addr, &int3, sizeof(int3)))
BUG();
Note, I would avoid doing:
BUG_ON(text_poke_part(...));
as we don't want "side effects" in the BUG() macros. You may do:
ret = text_poke_part(...);
BUG_ON(ret);
Which has the added benefit of the implicit "unlikely" in the BUG_ON()
macro.
Or we can have do_sync_core() be able to handle errors and have their
callers handle them too.
-- Steve
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/8] x86: return error code in text_poke_bp
2013-11-08 9:12 ` [PATCH v2 2/8] x86: return error code in text_poke_bp Petr Mladek
2013-11-08 12:36 ` Masami Hiramatsu
@ 2013-11-08 14:53 ` Steven Rostedt
1 sibling, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2013-11-08 14:53 UTC (permalink / raw)
To: Petr Mladek
Cc: Frederic Weisbecker, Masami Hiramatsu, Paul E. McKenney,
Jiri Kosina, linux-kernel, x86
On Fri, 8 Nov 2013 10:12:06 +0100
Petr Mladek <pmladek@suse.cz> wrote:
> We would like to use text_poke_bp in the dynamic ftrace which want to
> know about errors. For example, it informs about them in the ftrace log.
>
> Let's return the error code instead of the address. The address was just copied
> from the first parameter, so it was no extra information. The return value
> has not been used anywhere yet.
>
> There is a question whether we should recover the original opcode when
> the second or third text_poke_part fails in text_poke_bp. Well, the errors
> were ignored until now. It did not cause any real life problems. There is
> really small chance that the first byte (int3) can be written and the other
> parts of the code can not be modified. It is probably not worth the extra
> complexity.
I agree with Masami. If the first part succeeds, still check the other
parts, and if they don't succeed, then just do a BUG().
>
> Signed-off-by: Petr Mladek <pmladek@suse.cz>
> ---
> arch/x86/include/asm/alternative.h | 3 ++-
> arch/x86/kernel/alternative.c | 18 +++++++++++++-----
> 2 files changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
> index 0a3f9c9..f2343d8 100644
> --- a/arch/x86/include/asm/alternative.h
> +++ b/arch/x86/include/asm/alternative.h
> @@ -226,6 +226,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> */
> extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern int poke_int3_handler(struct pt_regs *regs);
> -extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> +extern int text_poke_bp(void *addr, const void *opcode, size_t len,
> + void *handler);
>
> #endif /* _ASM_X86_ALTERNATIVE_H */
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 0586dc1..c459e62 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -673,9 +673,10 @@ int poke_int3_handler(struct pt_regs *regs)
> *
> * Note: must be called under text_mutex.
> */
> -void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> +int text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> {
> unsigned char int3 = 0xcc;
> + int ret;
>
> bp_int3_handler = handler;
> bp_int3_addr = (u8 *)addr + sizeof(int3);
> @@ -687,15 +688,19 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> */
> smp_wmb();
>
> - text_poke_part(addr, &int3, sizeof(int3));
> + ret = text_poke_part(addr, &int3, sizeof(int3));
> + if (unlikely(ret))
> + goto fail;
Ah! you have it return an error code. May need to do that for all users
of text_poke_part().
-- Steve
>
> on_each_cpu(do_sync_core, NULL, 1);
>
> if (len - sizeof(int3) > 0) {
> /* patch all but the first byte */
> - text_poke_part((char *)addr + sizeof(int3),
> + ret = text_poke_part((char *)addr + sizeof(int3),
> (const char *) opcode + sizeof(int3),
> len - sizeof(int3));
> + if (unlikely(ret))
> + goto fail;
> /*
> * According to Intel, this core syncing is very likely
> * not necessary and we'd be safe even without it. But
> @@ -705,13 +710,16 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> }
>
> /* patch the first byte */
> - text_poke_part(addr, opcode, sizeof(int3));
> + ret = text_poke_part(addr, opcode, sizeof(int3));
> + if (unlikely(ret))
> + goto fail;
>
> on_each_cpu(do_sync_core, NULL, 1);
>
> +fail:
> bp_patching_in_progress = false;
> smp_wmb();
>
> - return addr;
> + return ret;
> }
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/8] x86: speed up int3-based patching using less paranoid write
2013-11-08 12:43 ` Steven Rostedt
@ 2013-11-12 10:44 ` Petr Mladek
0 siblings, 0 replies; 16+ messages in thread
From: Petr Mladek @ 2013-11-12 10:44 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Frederic Weisbecker, Paul E. McKenney,
Jiri Kosina, linux-kernel, x86
Steven Rostedt píše v Pá 08. 11. 2013 v 07:43 -0500:
> On Fri, 08 Nov 2013 21:04:26 +0900
> Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:
>
> > (2013/11/08 18:12), Petr Mladek wrote:
> > > This change is inspired by the int3-based patching code used in
> > > ftrace. See the commit fd4363fff3d9 (x86: Introduce int3
> > > (breakpoint)-based instruction patching).
> > >
> > > When trying to use text_poke_bp in ftrace, the result was slower than
> > > the original implementation.
> > >
> > > It turned out that one difference was in text_poke vs. ftrace_write.
> > > text_poke did many extra operations to make sure that the change
> > > was atomic.
> >
> > AFAIK, the main reason why text_poke is used is avoiding
> > RODATA protection (by alias mapping).
>
> That is correct, and the reason ftrace didn't do that is because it
> would be quite expensive to map 22,000 addresses for each change.
I see. I am going to prepare v3 with the following changes:
+ use text_poke in text_poke_bp back again because it is not
effective to make the pages rw only for a single address;
the alias mapping is faster here
+ use the faster text_poke_part only in text_poke_bp_iter;
the caller of this function will be responsible for
making the code rw
+ I will rework the error handling as suggested in the other
mails; I will try to add some also for text_poke.
Thanks a lot for hints.
Best Regards,
Petr
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2013-11-12 10:44 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-08 9:12 [PATCH v2 0/8] x86: use new text_poke_bp in ftrace Petr Mladek
2013-11-08 9:12 ` Petr Mladek
2013-11-08 9:12 ` [PATCH v2 1/8] x86: speed up int3-based patching using less paranoid write Petr Mladek
2013-11-08 12:04 ` Masami Hiramatsu
2013-11-08 12:43 ` Steven Rostedt
2013-11-12 10:44 ` Petr Mladek
2013-11-08 14:49 ` Steven Rostedt
2013-11-08 9:12 ` [PATCH v2 2/8] x86: return error code in text_poke_bp Petr Mladek
2013-11-08 12:36 ` Masami Hiramatsu
2013-11-08 14:53 ` Steven Rostedt
2013-11-08 9:12 ` [PATCH v2 3/8] x86: allow to call text_poke_bp during boot Petr Mladek
2013-11-08 9:12 ` [PATCH v2 4/8] x86: add generic function to modify more calls using int3 framework Petr Mladek
2013-11-08 9:12 ` [PATCH v2 5/8] x86: do not trace __probe_kernel_read Petr Mladek
2013-11-08 9:12 ` [PATCH v2 6/8] x86: modify ftrace function using the new int3-based framework Petr Mladek
2013-11-08 9:12 ` [PATCH v2 7/8] x86: patch all traced function calls using the " Petr Mladek
2013-11-08 9:12 ` [PATCH v2 8/8] x86: enable/disable ftrace graph call using new " Petr Mladek
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.