[PATCH v2 0/6] riscv: ftrace: atmoic patching and preempt improvements

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/6] riscv: ftrace: atmoic patching and preempt improvements
@ 2024-06-28 11:47 Andy Chiu
  2024-06-28 11:47 ` [PATCH v2 1/6] riscv: ftrace: support fastcc in Clang for WITH_ARGS Andy Chiu
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Andy Chiu @ 2024-06-28 11:47 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Zong Li, Steven Rostedt, Masami Hiramatsu, Mark Rutland,
	Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Puranjay Mohan
  Cc: Palmer Dabbelt, linux-riscv, linux-kernel, linux-trace-kernel,
	llvm, Evgenii Shatokhin, Andy Chiu

This series makes atmoic code patching possible in riscv ftrace. A
direct benefit of this is that we can get rid of stop_machine() when
patching function entries. This also makes it possible to run ftrace
with full kernel preemption. Before this series, the kernel initializes
patchable function entries to NOP4 + NOP4. To start tracing, it updates
entries to AUIPC + JALR while holding other cores in stop_machine.
stop_machine() is required because it is impossible to update 2
instructions, and be seen atomically. And preemption must have to be
prevented, as kernel preemption allows process to be scheduled out while
executing on one of these instruction pairs.

This series addresses the problem by initializing the first NOP4 to
AUIPC. So, atmoic patching is possible because the kernel only has to
update one instruction. As long as the instruction is naturally aligned,
then it is expected to be updated atomically.

However, the address range of the ftrace trampoline is limited to +-2K
from ftrace_caller after appplying this series. This issue is expected
to be solved by Puranjay's CALL_OPS, where it adds 8B naturally align
data in front of pacthable functions and can  use it to direct execution
out to any custom trampolines.

The series is composed by three parts. The first part cleans up the
existing issues when the kernel is compiled with clang.The second part
modifies the ftrace code patching mechanism (2-4) as mentioned above.
Then prepare ftrace to be able to run with kernel preemption (5,6)

This series is tested after applying the following ftrace/patching in
the fixes branch:

- commit 57a369b6f2ee ("riscv: patch: Flush the icache right after
                        patching to avoid illegal insns")
- commit a2bd3a5b4b63 ("riscv: stacktrace: convert arch_stack_walk() to
                        noinstr")

Changes in v2:
- Drop patch 1 as it is merged through fixes.
- Drop patch 2, which converts kernel_text_address into notrace. As
  users can prevent tracing it by configuring the tracefs.
- Use a more generic way in kconfig to align functions.
- Link to v1: https://lore.kernel.org/r/20240613-dev-andyc-dyn-ftrace-v4-v1-0-1a538e12c01e@sifive.com

---
Andy Chiu (6):
      riscv: ftrace: support fastcc in Clang for WITH_ARGS
      riscv: ftrace: align patchable functions to 4 Byte boundary
      riscv: ftrace: prepare ftrace for atomic code patching
      riscv: ftrace: do not use stop_machine to update code
      riscv: vector: Support calling schedule() for preemptible Vector
      riscv: ftrace: support PREEMPT

 arch/riscv/Kconfig                 |   4 +-
 arch/riscv/include/asm/ftrace.h    |  11 +++
 arch/riscv/include/asm/processor.h |   5 ++
 arch/riscv/include/asm/vector.h    |  22 +++++-
 arch/riscv/kernel/asm-offsets.c    |   7 ++
 arch/riscv/kernel/ftrace.c         | 133 ++++++++++++++++---------------------
 arch/riscv/kernel/mcount-dyn.S     |  25 +++++--
 7 files changed, 121 insertions(+), 86 deletions(-)
---
base-commit: a2bd3a5b4b63b95aea7dbf61d9395cd6696a2bc0
change-id: 20240613-dev-andyc-dyn-ftrace-v4-941d4a00ea19

Best regards,
-- 
Andy Chiu <andy.chiu@sifive.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/6] riscv: ftrace: support fastcc in Clang for WITH_ARGS
  2024-06-28 11:47 [PATCH v2 0/6] riscv: ftrace: atmoic patching and preempt improvements Andy Chiu
@ 2024-06-28 11:47 ` Andy Chiu
  2024-08-13 11:09   ` Björn Töpel
  2024-06-28 11:47 ` [PATCH v2 2/6] riscv: ftrace: align patchable functions to 4 Byte boundary Andy Chiu
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Andy Chiu @ 2024-06-28 11:47 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Zong Li, Steven Rostedt, Masami Hiramatsu, Mark Rutland,
	Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Puranjay Mohan
  Cc: Palmer Dabbelt, linux-riscv, linux-kernel, linux-trace-kernel,
	llvm, Evgenii Shatokhin, Andy Chiu

Some caller-saved registers which are not defined as function arguments
in the ABI can still be passed as arguments when the kernel is compiled
with Clang. As a result, we must save and restore those registers to
prevent ftrace from clobbering them.

- [1]: https://reviews.llvm.org/D68559

Reported-by: Evgenii Shatokhin <e.shatokhin@yadro.com>
Closes: https://lore.kernel.org/linux-riscv/7e7c7914-445d-426d-89a0-59a9199c45b1@yadro.com/
Acked-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
---
 arch/riscv/include/asm/ftrace.h |  7 +++++++
 arch/riscv/kernel/asm-offsets.c |  7 +++++++
 arch/riscv/kernel/mcount-dyn.S  | 16 ++++++++++++++--
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/include/asm/ftrace.h b/arch/riscv/include/asm/ftrace.h
index 9eb31a7ea0aa..5f81c53dbfd9 100644
--- a/arch/riscv/include/asm/ftrace.h
+++ b/arch/riscv/include/asm/ftrace.h
@@ -144,6 +144,13 @@ struct ftrace_regs {
 			unsigned long a5;
 			unsigned long a6;
 			unsigned long a7;
+#ifdef CONFIG_CC_IS_CLANG
+			unsigned long t2;
+			unsigned long t3;
+			unsigned long t4;
+			unsigned long t5;
+			unsigned long t6;
+#endif
 		};
 	};
 };
diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
index b09ca5f944f7..db5a26fcc9ae 100644
--- a/arch/riscv/kernel/asm-offsets.c
+++ b/arch/riscv/kernel/asm-offsets.c
@@ -497,6 +497,13 @@ void asm_offsets(void)
 	DEFINE(FREGS_SP,	    offsetof(struct ftrace_regs, sp));
 	DEFINE(FREGS_S0,	    offsetof(struct ftrace_regs, s0));
 	DEFINE(FREGS_T1,	    offsetof(struct ftrace_regs, t1));
+#ifdef CONFIG_CC_IS_CLANG
+	DEFINE(FREGS_T2,	    offsetof(struct ftrace_regs, t2));
+	DEFINE(FREGS_T3,	    offsetof(struct ftrace_regs, t3));
+	DEFINE(FREGS_T4,	    offsetof(struct ftrace_regs, t4));
+	DEFINE(FREGS_T5,	    offsetof(struct ftrace_regs, t5));
+	DEFINE(FREGS_T6,	    offsetof(struct ftrace_regs, t6));
+#endif
 	DEFINE(FREGS_A0,	    offsetof(struct ftrace_regs, a0));
 	DEFINE(FREGS_A1,	    offsetof(struct ftrace_regs, a1));
 	DEFINE(FREGS_A2,	    offsetof(struct ftrace_regs, a2));
diff --git a/arch/riscv/kernel/mcount-dyn.S b/arch/riscv/kernel/mcount-dyn.S
index 745dd4c4a69c..e988bd26b28b 100644
--- a/arch/riscv/kernel/mcount-dyn.S
+++ b/arch/riscv/kernel/mcount-dyn.S
@@ -96,7 +96,13 @@
 	REG_S	x8,  FREGS_S0(sp)
 #endif
 	REG_S	x6,  FREGS_T1(sp)
-
+#ifdef CONFIG_CC_IS_CLANG
+	REG_S	x7,  FREGS_T2(sp)
+	REG_S	x28, FREGS_T3(sp)
+	REG_S	x29, FREGS_T4(sp)
+	REG_S	x30, FREGS_T5(sp)
+	REG_S	x31, FREGS_T6(sp)
+#endif
 	// save the arguments
 	REG_S	x10, FREGS_A0(sp)
 	REG_S	x11, FREGS_A1(sp)
@@ -115,7 +121,13 @@
 	REG_L	x8, FREGS_S0(sp)
 #endif
 	REG_L	x6,  FREGS_T1(sp)
-
+#ifdef CONFIG_CC_IS_CLANG
+	REG_L	x7,  FREGS_T2(sp)
+	REG_L	x28, FREGS_T3(sp)
+	REG_L	x29, FREGS_T4(sp)
+	REG_L	x30, FREGS_T5(sp)
+	REG_L	x31, FREGS_T6(sp)
+#endif
 	// restore the arguments
 	REG_L	x10, FREGS_A0(sp)
 	REG_L	x11, FREGS_A1(sp)

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/6] riscv: ftrace: align patchable functions to 4 Byte boundary
  2024-06-28 11:47 [PATCH v2 0/6] riscv: ftrace: atmoic patching and preempt improvements Andy Chiu
  2024-06-28 11:47 ` [PATCH v2 1/6] riscv: ftrace: support fastcc in Clang for WITH_ARGS Andy Chiu
@ 2024-06-28 11:47 ` Andy Chiu
  2024-08-13 11:11   ` Björn Töpel
  2024-06-28 11:47 ` [PATCH v2 3/6] riscv: ftrace: prepare ftrace for atomic code patching Andy Chiu
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Andy Chiu @ 2024-06-28 11:47 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Zong Li, Steven Rostedt, Masami Hiramatsu, Mark Rutland,
	Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Puranjay Mohan
  Cc: Palmer Dabbelt, linux-riscv, linux-kernel, linux-trace-kernel,
	llvm, Evgenii Shatokhin, Andy Chiu

We are changing ftrace code patching in order to remove dependency from
stop_machine() and enable kernel preemption. This requires us to align
functions entry at a 4-B align address.

However, -falign-functions on older versions of GCC alone was not strong
enoungh to align all functions. In fact, cold functions are not aligned
after turning on optimizations. We consider this is a bug in GCC and
turn off guess-branch-probility as a workaround to align all functions.

GCC bug id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345

The option -fmin-function-alignment is able to align all functions
properly on newer versions of gcc. So, we add a cc-option to test if
the toolchain supports it.

Suggested-by: Evgenii Shatokhin <e.shatokhin@yadro.com>
Signed-off-by: Andy Chiu <andy.chiu@sifive.com>

---
Changelog v2:
 - Use CC_HAS_MIN_FUNCTION_ALIGNMENT and it friends to prevent reinventing
    wheels (Nathan)
---
 arch/riscv/Kconfig | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 704d4683bcfa..55c70efbad0a 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -133,6 +133,7 @@ config RISCV
 	select HAVE_DEBUG_KMEMLEAK
 	select HAVE_DMA_CONTIGUOUS if MMU
 	select HAVE_DYNAMIC_FTRACE if !XIP_KERNEL && MMU && (CLANG_SUPPORTS_DYNAMIC_FTRACE || GCC_SUPPORTS_DYNAMIC_FTRACE)
+	select FUNCTION_ALIGNMENT_4B if HAVE_DYNAMIC_FTRACE && RISCV_ISA_C
 	select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
 	select HAVE_DYNAMIC_FTRACE_WITH_ARGS if HAVE_DYNAMIC_FTRACE
 	select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL
@@ -208,6 +209,7 @@ config CLANG_SUPPORTS_DYNAMIC_FTRACE
 config GCC_SUPPORTS_DYNAMIC_FTRACE
 	def_bool CC_IS_GCC
 	depends on $(cc-option,-fpatchable-function-entry=8)
+	depends on CC_HAS_MIN_FUNCTION_ALIGNMENT || !RISCV_ISA_C
 
 config HAVE_SHADOW_CALL_STACK
 	def_bool $(cc-option,-fsanitize=shadow-call-stack)

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 3/6] riscv: ftrace: prepare ftrace for atomic code patching
  2024-06-28 11:47 [PATCH v2 0/6] riscv: ftrace: atmoic patching and preempt improvements Andy Chiu
  2024-06-28 11:47 ` [PATCH v2 1/6] riscv: ftrace: support fastcc in Clang for WITH_ARGS Andy Chiu
  2024-06-28 11:47 ` [PATCH v2 2/6] riscv: ftrace: align patchable functions to 4 Byte boundary Andy Chiu
@ 2024-06-28 11:47 ` Andy Chiu
  2024-08-13 12:59   ` Björn Töpel
  2024-06-28 11:47 ` [PATCH v2 4/6] riscv: ftrace: do not use stop_machine to update code Andy Chiu
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Andy Chiu @ 2024-06-28 11:47 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Zong Li, Steven Rostedt, Masami Hiramatsu, Mark Rutland,
	Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Puranjay Mohan
  Cc: Palmer Dabbelt, linux-riscv, linux-kernel, linux-trace-kernel,
	llvm, Andy Chiu

We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since
instruction fetch can break down to 4 byte at a time, it is impossible
to update two instructions without a race. In order to mitigate it, we
initialize the patchable entry to AUIPC + NOP4. Then, the run-time code
patching can change NOP4 to JALR to eable/disable ftrcae from a
function. This limits the reach of each ftrace entry to +-2KB displacing
from ftrace_caller.

Starting from the trampoline, we add a level of indirection for it to
reach ftrace caller target. Now, it loads the target address from a
memory location, then perform the jump. This enable the kernel to update
the target atomically.

The ordering of reading/updating the targert address should be guarded
by generic ftrace code, where it sends smp_rmb ipi.

Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
---
 arch/riscv/include/asm/ftrace.h |  4 +++
 arch/riscv/kernel/ftrace.c      | 80 ++++++++++++++++++++++++++---------------
 arch/riscv/kernel/mcount-dyn.S  |  9 +++--
 3 files changed, 62 insertions(+), 31 deletions(-)

diff --git a/arch/riscv/include/asm/ftrace.h b/arch/riscv/include/asm/ftrace.h
index 5f81c53dbfd9..7199383f8c02 100644
--- a/arch/riscv/include/asm/ftrace.h
+++ b/arch/riscv/include/asm/ftrace.h
@@ -81,6 +81,7 @@ struct dyn_arch_ftrace {
 #define JALR_T0			(0x000282e7)
 #define AUIPC_T0		(0x00000297)
 #define NOP4			(0x00000013)
+#define JALR_RANGE		(JALR_SIGN_MASK - 1)
 
 #define to_jalr_t0(offset)						\
 	(((offset & JALR_OFFSET_MASK) << JALR_SHIFT) | JALR_T0)
@@ -118,6 +119,9 @@ do {									\
  * Let auipc+jalr be the basic *mcount unit*, so we make it 8 bytes here.
  */
 #define MCOUNT_INSN_SIZE 8
+#define MCOUNT_AUIPC_SIZE	4
+#define MCOUNT_JALR_SIZE	4
+#define MCOUNT_NOP4_SIZE	4
 
 #ifndef __ASSEMBLY__
 struct dyn_ftrace;
diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
index 4b95c574fd04..5ebe412280ef 100644
--- a/arch/riscv/kernel/ftrace.c
+++ b/arch/riscv/kernel/ftrace.c
@@ -64,42 +64,64 @@ static int ftrace_check_current_call(unsigned long hook_pos,
 	return 0;
 }
 
-static int __ftrace_modify_call(unsigned long hook_pos, unsigned long target,
-				bool enable, bool ra)
+static int __ftrace_modify_call(unsigned long hook_pos, unsigned long target, bool validate)
 {
 	unsigned int call[2];
-	unsigned int nops[2] = {NOP4, NOP4};
+	unsigned int replaced[2];
+
+	make_call_t0(hook_pos, target, call);
 
-	if (ra)
-		make_call_ra(hook_pos, target, call);
-	else
-		make_call_t0(hook_pos, target, call);
+	if (validate) {
+		/*
+		 * Read the text we want to modify;
+		 * return must be -EFAULT on read error
+		 */
+		if (copy_from_kernel_nofault(replaced, (void *)hook_pos,
+					     MCOUNT_INSN_SIZE))
+			return -EFAULT;
+
+		if (replaced[0] != call[0]) {
+			pr_err("%p: expected (%08x) but got (%08x)\n",
+			       (void *)hook_pos, call[0], replaced[0]);
+			return -EINVAL;
+		}
+	}
 
-	/* Replace the auipc-jalr pair at once. Return -EPERM on write error. */
-	if (patch_insn_write((void *)hook_pos, enable ? call : nops, MCOUNT_INSN_SIZE))
+	/* Replace the jalr at once. Return -EPERM on write error. */
+	if (patch_insn_write((void *)(hook_pos + MCOUNT_AUIPC_SIZE), call + 1, MCOUNT_JALR_SIZE))
 		return -EPERM;
 
 	return 0;
 }
 
-int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
+static int __ftrace_modify_call_site(ftrace_func_t *hook_pos, ftrace_func_t target, bool enable)
 {
-	unsigned int call[2];
+	ftrace_func_t call = target;
+	ftrace_func_t nops = &ftrace_stub;
 
-	make_call_t0(rec->ip, addr, call);
-
-	if (patch_insn_write((void *)rec->ip, call, MCOUNT_INSN_SIZE))
-		return -EPERM;
+	WRITE_ONCE(*hook_pos, enable ? call : nops);
 
 	return 0;
 }
 
+int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
+{
+	unsigned long distance, orig_addr;
+
+	orig_addr = (unsigned long)&ftrace_caller;
+	distance = addr > orig_addr ? addr - orig_addr : orig_addr - addr;
+	if (distance > JALR_RANGE)
+		return -EINVAL;
+
+	return __ftrace_modify_call(rec->ip, addr, false);
+}
+
 int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec,
 		    unsigned long addr)
 {
-	unsigned int nops[2] = {NOP4, NOP4};
+	unsigned int nops[1] = {NOP4};
 
-	if (patch_insn_write((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
+	if (patch_insn_write((void *)(rec->ip + MCOUNT_AUIPC_SIZE), nops, MCOUNT_NOP4_SIZE))
 		return -EPERM;
 
 	return 0;
@@ -114,21 +136,23 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec,
  */
 int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec)
 {
+	unsigned int nops[2];
 	int out;
 
+	make_call_t0(rec->ip, &ftrace_caller, nops);
+	nops[1] = NOP4;
+
 	mutex_lock(&text_mutex);
-	out = ftrace_make_nop(mod, rec, MCOUNT_ADDR);
+	out = patch_insn_write((void *)rec->ip, nops, MCOUNT_INSN_SIZE);
 	mutex_unlock(&text_mutex);
 
 	return out;
 }
 
+ftrace_func_t ftrace_call_dest = ftrace_stub;
 int ftrace_update_ftrace_func(ftrace_func_t func)
 {
-	int ret = __ftrace_modify_call((unsigned long)&ftrace_call,
-				       (unsigned long)func, true, true);
-
-	return ret;
+	return __ftrace_modify_call_site(&ftrace_call_dest, func, true);
 }
 
 struct ftrace_modify_param {
@@ -182,7 +206,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
 	if (ret)
 		return ret;
 
-	return __ftrace_modify_call(caller, addr, true, false);
+	return __ftrace_modify_call(caller, addr, true);
 }
 #endif
 
@@ -217,17 +241,17 @@ void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
 	prepare_ftrace_return(&fregs->ra, ip, fregs->s0);
 }
 #else /* CONFIG_DYNAMIC_FTRACE_WITH_ARGS */
-extern void ftrace_graph_call(void);
+ftrace_func_t ftrace_graph_call_dest = ftrace_stub;
 int ftrace_enable_ftrace_graph_caller(void)
 {
-	return __ftrace_modify_call((unsigned long)&ftrace_graph_call,
-				    (unsigned long)&prepare_ftrace_return, true, true);
+	return __ftrace_modify_call_site(&ftrace_graph_call_dest,
+					 &prepare_ftrace_return, true);
 }
 
 int ftrace_disable_ftrace_graph_caller(void)
 {
-	return __ftrace_modify_call((unsigned long)&ftrace_graph_call,
-				    (unsigned long)&prepare_ftrace_return, false, true);
+	return __ftrace_modify_call_site(&ftrace_graph_call_dest,
+					 &prepare_ftrace_return, false);
 }
 #endif /* CONFIG_DYNAMIC_FTRACE_WITH_ARGS */
 #endif /* CONFIG_DYNAMIC_FTRACE */
diff --git a/arch/riscv/kernel/mcount-dyn.S b/arch/riscv/kernel/mcount-dyn.S
index e988bd26b28b..bc06e8ab81cf 100644
--- a/arch/riscv/kernel/mcount-dyn.S
+++ b/arch/riscv/kernel/mcount-dyn.S
@@ -162,7 +162,8 @@ SYM_FUNC_START(ftrace_caller)
 	mv	a3, sp
 
 SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
-	call	ftrace_stub
+	REG_L	ra, ftrace_call_dest
+	jalr	0(ra)
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 	addi	a0, sp, ABI_RA
@@ -172,7 +173,8 @@ SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
 	mv	a2, s0
 #endif
 SYM_INNER_LABEL(ftrace_graph_call, SYM_L_GLOBAL)
-	call	ftrace_stub
+	REG_L	ra, ftrace_graph_call_dest
+	jalr	0(ra)
 #endif
 	RESTORE_ABI
 	jr	t0
@@ -185,7 +187,8 @@ SYM_FUNC_START(ftrace_caller)
 	PREPARE_ARGS
 
 SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
-	call	ftrace_stub
+	REG_L	ra, ftrace_call_dest
+	jalr	0(ra)
 
 	RESTORE_ABI_REGS
 	bnez	t1, .Ldirect

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 4/6] riscv: ftrace: do not use stop_machine to update code
  2024-06-28 11:47 [PATCH v2 0/6] riscv: ftrace: atmoic patching and preempt improvements Andy Chiu
                   ` (2 preceding siblings ...)
  2024-06-28 11:47 ` [PATCH v2 3/6] riscv: ftrace: prepare ftrace for atomic code patching Andy Chiu
@ 2024-06-28 11:47 ` Andy Chiu
  2024-06-28 11:47 ` [PATCH v2 5/6] riscv: vector: Support calling schedule() for preemptible Vector Andy Chiu
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Andy Chiu @ 2024-06-28 11:47 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Zong Li, Steven Rostedt, Masami Hiramatsu, Mark Rutland,
	Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Puranjay Mohan
  Cc: Palmer Dabbelt, linux-riscv, linux-kernel, linux-trace-kernel,
	llvm, Andy Chiu

Now it is safe to remove dependency from stop_machine() for us to patch
code in ftrace.

Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
---
 arch/riscv/kernel/ftrace.c | 53 ++++------------------------------------------
 1 file changed, 4 insertions(+), 49 deletions(-)

diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
index 5ebe412280ef..57a6558e212e 100644
--- a/arch/riscv/kernel/ftrace.c
+++ b/arch/riscv/kernel/ftrace.c
@@ -13,23 +13,13 @@
 #include <asm/patch.h>
 
 #ifdef CONFIG_DYNAMIC_FTRACE
-void ftrace_arch_code_modify_prepare(void) __acquires(&text_mutex)
+void arch_ftrace_update_code(int command)
 {
 	mutex_lock(&text_mutex);
-
-	/*
-	 * The code sequences we use for ftrace can't be patched while the
-	 * kernel is running, so we need to use stop_machine() to modify them
-	 * for now.  This doesn't play nice with text_mutex, we use this flag
-	 * to elide the check.
-	 */
-	riscv_patch_in_stop_machine = true;
-}
-
-void ftrace_arch_code_modify_post_process(void) __releases(&text_mutex)
-{
-	riscv_patch_in_stop_machine = false;
+	command |= FTRACE_MAY_SLEEP;
+	ftrace_modify_all_code(command);
 	mutex_unlock(&text_mutex);
+	flush_icache_all();
 }
 
 static int ftrace_check_current_call(unsigned long hook_pos,
@@ -155,41 +145,6 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
 	return __ftrace_modify_call_site(&ftrace_call_dest, func, true);
 }
 
-struct ftrace_modify_param {
-	int command;
-	atomic_t cpu_count;
-};
-
-static int __ftrace_modify_code(void *data)
-{
-	struct ftrace_modify_param *param = data;
-
-	if (atomic_inc_return(&param->cpu_count) == num_online_cpus()) {
-		ftrace_modify_all_code(param->command);
-		/*
-		 * Make sure the patching store is effective *before* we
-		 * increment the counter which releases all waiting CPUs
-		 * by using the release variant of atomic increment. The
-		 * release pairs with the call to local_flush_icache_all()
-		 * on the waiting CPU.
-		 */
-		atomic_inc_return_release(&param->cpu_count);
-	} else {
-		while (atomic_read(&param->cpu_count) <= num_online_cpus())
-			cpu_relax();
-
-		local_flush_icache_all();
-	}
-
-	return 0;
-}
-
-void arch_ftrace_update_code(int command)
-{
-	struct ftrace_modify_param param = { command, ATOMIC_INIT(0) };
-
-	stop_machine(__ftrace_modify_code, &param, cpu_online_mask);
-}
 #endif
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 5/6] riscv: vector: Support calling schedule() for preemptible Vector
  2024-06-28 11:47 [PATCH v2 0/6] riscv: ftrace: atmoic patching and preempt improvements Andy Chiu
                   ` (3 preceding siblings ...)
  2024-06-28 11:47 ` [PATCH v2 4/6] riscv: ftrace: do not use stop_machine to update code Andy Chiu
@ 2024-06-28 11:47 ` Andy Chiu
  2024-06-28 11:47 ` [PATCH v2 6/6] riscv: ftrace: support PREEMPT Andy Chiu
  2024-08-13 11:00 ` [PATCH v2 0/6] riscv: ftrace: atmoic patching and preempt improvements Björn Töpel
  6 siblings, 0 replies; 17+ messages in thread
From: Andy Chiu @ 2024-06-28 11:47 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Zong Li, Steven Rostedt, Masami Hiramatsu, Mark Rutland,
	Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Puranjay Mohan
  Cc: Palmer Dabbelt, linux-riscv, linux-kernel, linux-trace-kernel,
	llvm, Andy Chiu

Each function entry implies a call to ftrace infrastructure. And it may
call into schedule in some cases. So, it is possible for preemptible
kernel-mode Vector to implicitly call into schedule. Since all V-regs
are caller-saved, it is possible to drop all V context when a thread
voluntarily call schedule(). Besides, we currently don't pass argument
through vector register, so we don't have to save/restore V-regs in
ftrace trampoline.

Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
---
 arch/riscv/include/asm/processor.h |  5 +++++
 arch/riscv/include/asm/vector.h    | 22 +++++++++++++++++++---
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/arch/riscv/include/asm/processor.h b/arch/riscv/include/asm/processor.h
index 68c3432dc6ea..02598e168659 100644
--- a/arch/riscv/include/asm/processor.h
+++ b/arch/riscv/include/asm/processor.h
@@ -95,6 +95,10 @@ struct pt_regs;
  *       Thus, the task does not own preempt_v. Any use of Vector will have to
  *       save preempt_v, if dirty, and fallback to non-preemptible kernel-mode
  *       Vector.
+ *  - bit 29: The thread voluntarily calls schedule() while holding an active
+ *    preempt_v. All preempt_v context should be dropped in such case because
+ *    V-regs are caller-saved. Only sstatus.VS=ON is persisted across a
+ *    schedule() call.
  *  - bit 30: The in-kernel preempt_v context is saved, and requries to be
  *    restored when returning to the context that owns the preempt_v.
  *  - bit 31: The in-kernel preempt_v context is dirty, as signaled by the
@@ -109,6 +113,7 @@ struct pt_regs;
 #define RISCV_PREEMPT_V			0x00000100
 #define RISCV_PREEMPT_V_DIRTY		0x80000000
 #define RISCV_PREEMPT_V_NEED_RESTORE	0x40000000
+#define RISCV_PREEMPT_V_IN_SCHEDULE	0x20000000
 
 /* CPU-specific state of a task */
 struct thread_struct {
diff --git a/arch/riscv/include/asm/vector.h b/arch/riscv/include/asm/vector.h
index be7d309cca8a..fbf17aba92c1 100644
--- a/arch/riscv/include/asm/vector.h
+++ b/arch/riscv/include/asm/vector.h
@@ -75,6 +75,11 @@ static __always_inline void riscv_v_disable(void)
 	csr_clear(CSR_SSTATUS, SR_VS);
 }
 
+static __always_inline bool riscv_v_is_on(void)
+{
+	return !!(csr_read(CSR_SSTATUS) & SR_VS);
+}
+
 static __always_inline void __vstate_csr_save(struct __riscv_v_ext_state *dest)
 {
 	asm volatile (
@@ -243,6 +248,11 @@ static inline void __switch_to_vector(struct task_struct *prev,
 	struct pt_regs *regs;
 
 	if (riscv_preempt_v_started(prev)) {
+		if (riscv_v_is_on()) {
+			WARN_ON(prev->thread.riscv_v_flags & RISCV_V_CTX_DEPTH_MASK);
+			riscv_v_disable();
+			prev->thread.riscv_v_flags |= RISCV_PREEMPT_V_IN_SCHEDULE;
+		}
 		if (riscv_preempt_v_dirty(prev)) {
 			__riscv_v_vstate_save(&prev->thread.kernel_vstate,
 					      prev->thread.kernel_vstate.datap);
@@ -253,10 +263,16 @@ static inline void __switch_to_vector(struct task_struct *prev,
 		riscv_v_vstate_save(&prev->thread.vstate, regs);
 	}
 
-	if (riscv_preempt_v_started(next))
-		riscv_preempt_v_set_restore(next);
-	else
+	if (riscv_preempt_v_started(next)) {
+		if (next->thread.riscv_v_flags & RISCV_PREEMPT_V_IN_SCHEDULE) {
+			next->thread.riscv_v_flags &= ~RISCV_PREEMPT_V_IN_SCHEDULE;
+			riscv_v_enable();
+		} else {
+			riscv_preempt_v_set_restore(next);
+		}
+	} else {
 		riscv_v_vstate_set_restore(next, task_pt_regs(next));
+	}
 }
 
 void riscv_v_vstate_ctrl_init(struct task_struct *tsk);

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 6/6] riscv: ftrace: support PREEMPT
  2024-06-28 11:47 [PATCH v2 0/6] riscv: ftrace: atmoic patching and preempt improvements Andy Chiu
                   ` (4 preceding siblings ...)
  2024-06-28 11:47 ` [PATCH v2 5/6] riscv: vector: Support calling schedule() for preemptible Vector Andy Chiu
@ 2024-06-28 11:47 ` Andy Chiu
  2024-08-13 11:00 ` [PATCH v2 0/6] riscv: ftrace: atmoic patching and preempt improvements Björn Töpel
  6 siblings, 0 replies; 17+ messages in thread
From: Andy Chiu @ 2024-06-28 11:47 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Zong Li, Steven Rostedt, Masami Hiramatsu, Mark Rutland,
	Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Puranjay Mohan
  Cc: Palmer Dabbelt, linux-riscv, linux-kernel, linux-trace-kernel,
	llvm, Andy Chiu

Now, we can safely enable dynamic ftrace with kernel preemption.

Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
---
 arch/riscv/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 55c70efbad0a..881ea466ff52 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -139,7 +139,7 @@ config RISCV
 	select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_FUNCTION_GRAPH_RETVAL if HAVE_FUNCTION_GRAPH_TRACER
-	select HAVE_FUNCTION_TRACER if !XIP_KERNEL && !PREEMPTION
+	select HAVE_FUNCTION_TRACER if !XIP_KERNEL
 	select HAVE_EBPF_JIT if MMU
 	select HAVE_GUP_FAST if MMU
 	select HAVE_FUNCTION_ARG_ACCESS_API

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/6] riscv: ftrace: atmoic patching and preempt improvements
  2024-06-28 11:47 [PATCH v2 0/6] riscv: ftrace: atmoic patching and preempt improvements Andy Chiu
                   ` (5 preceding siblings ...)
  2024-06-28 11:47 ` [PATCH v2 6/6] riscv: ftrace: support PREEMPT Andy Chiu
@ 2024-08-13 11:00 ` Björn Töpel
  6 siblings, 0 replies; 17+ messages in thread
From: Björn Töpel @ 2024-08-13 11:00 UTC (permalink / raw)
  To: Andy Chiu, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexandre Ghiti, Zong Li, Steven Rostedt, Masami Hiramatsu,
	Mark Rutland, Nathan Chancellor, Nick Desaulniers, Bill Wendling,
	Justin Stitt, Puranjay Mohan
  Cc: Palmer Dabbelt, linux-riscv, linux-kernel, linux-trace-kernel,
	llvm, Evgenii Shatokhin, Andy Chiu

Andy,

Way over due; I'm back from my vacation, I've finally started to look at
the series. Thanks for working on it.

Andy Chiu <andy.chiu@sifive.com> writes:

> This series makes atmoic code patching possible in riscv ftrace. A
                    atomic

> direct benefit of this is that we can get rid of stop_machine() when
> patching function entries. This also makes it possible to run ftrace
> with full kernel preemption. Before this series, the kernel initializes
> patchable function entries to NOP4 + NOP4. To start tracing, it updates
> entries to AUIPC + JALR while holding other cores in stop_machine.
> stop_machine() is required because it is impossible to update 2
> instructions, and be seen atomically. And preemption must have to be
> prevented, as kernel preemption allows process to be scheduled out while
> executing on one of these instruction pairs.
>
> This series addresses the problem by initializing the first NOP4 to
> AUIPC. So, atmoic patching is possible because the kernel only has to
             atomic

> update one instruction. As long as the instruction is naturally aligned,
> then it is expected to be updated atomically.

This came up on the last weekly patchwork call; Given that RISC-V does
not yet (WIP!) has a formal specfication expressing cmodx behaviour, the
assumptions done in this series (what you're describing here pretty
much) should be properly documented in Documentation/riscv the next
revision.

From the earlier public discussions [1], this is "option A".

> However, the address range of the ftrace trampoline is limited to +-2K
> from ftrace_caller after appplying this series. This issue is expected
> to be solved by Puranjay's CALL_OPS, where it adds 8B naturally align
> data in front of pacthable functions and can  use it to direct execution
                   patchable

Is it really usable to enable with the limit 2K range? Makes me wonder
if we should bake in Puranjay CALL_OPS work directly in this series.
Thoughts?

> out to any custom trampolines.
>
> The series is composed by three parts. The first part cleans up the
> existing issues when the kernel is compiled with clang.The second part
> modifies the ftrace code patching mechanism (2-4) as mentioned above.
> Then prepare ftrace to be able to run with kernel preemption (5,6)
>
> This series is tested after applying the following ftrace/patching in
> the fixes branch:
>
> - commit 57a369b6f2ee ("riscv: patch: Flush the icache right after
>                         patching to avoid illegal insns")
> - commit a2bd3a5b4b63 ("riscv: stacktrace: convert arch_stack_walk() to
>                         noinstr")
>
> Changes in v2:
> - Drop patch 1 as it is merged through fixes.
> - Drop patch 2, which converts kernel_text_address into notrace. As
>   users can prevent tracing it by configuring the tracefs.
> - Use a more generic way in kconfig to align functions.
> - Link to v1: https://lore.kernel.org/r/20240613-dev-andyc-dyn-ftrace-v4-v1-0-1a538e12c01e@sifive.com

More input in subsequent patches.


Cheers,
Björn

[1] https://lore.kernel.org/linux-riscv/87zfv0onre.fsf@all.your.base.are.belong.to.us/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/6] riscv: ftrace: support fastcc in Clang for WITH_ARGS
  2024-06-28 11:47 ` [PATCH v2 1/6] riscv: ftrace: support fastcc in Clang for WITH_ARGS Andy Chiu
@ 2024-08-13 11:09   ` Björn Töpel
  2024-08-22  8:39     ` Andy Chiu
  0 siblings, 1 reply; 17+ messages in thread
From: Björn Töpel @ 2024-08-13 11:09 UTC (permalink / raw)
  To: Andy Chiu, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexandre Ghiti, Zong Li, Steven Rostedt, Masami Hiramatsu,
	Mark Rutland, Nathan Chancellor, Nick Desaulniers, Bill Wendling,
	Justin Stitt, Puranjay Mohan
  Cc: Palmer Dabbelt, linux-riscv, linux-kernel, linux-trace-kernel,
	llvm, Evgenii Shatokhin, Andy Chiu

Andy Chiu <andy.chiu@sifive.com> writes:

> Some caller-saved registers which are not defined as function arguments
> in the ABI can still be passed as arguments when the kernel is compiled
> with Clang. As a result, we must save and restore those registers to
> prevent ftrace from clobbering them.
>
> - [1]: https://reviews.llvm.org/D68559
>
> Reported-by: Evgenii Shatokhin <e.shatokhin@yadro.com>
> Closes: https://lore.kernel.org/linux-riscv/7e7c7914-445d-426d-89a0-59a9199c45b1@yadro.com/
> Acked-by: Nathan Chancellor <nathan@kernel.org>
> Signed-off-by: Andy Chiu <andy.chiu@sifive.com>

Shouldn't this be a separate fix? Still reading the details, but it
smells like something what should be disabled when building w/ LLVM, no?


Björn

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/6] riscv: ftrace: align patchable functions to 4 Byte boundary
  2024-06-28 11:47 ` [PATCH v2 2/6] riscv: ftrace: align patchable functions to 4 Byte boundary Andy Chiu
@ 2024-08-13 11:11   ` Björn Töpel
  0 siblings, 0 replies; 17+ messages in thread
From: Björn Töpel @ 2024-08-13 11:11 UTC (permalink / raw)
  To: Andy Chiu, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexandre Ghiti, Zong Li, Steven Rostedt, Masami Hiramatsu,
	Mark Rutland, Nathan Chancellor, Nick Desaulniers, Bill Wendling,
	Justin Stitt, Puranjay Mohan
  Cc: Palmer Dabbelt, linux-riscv, linux-kernel, linux-trace-kernel,
	llvm, Evgenii Shatokhin, Andy Chiu

Andy Chiu <andy.chiu@sifive.com> writes:

> We are changing ftrace code patching in order to remove dependency from
> stop_machine() and enable kernel preemption. This requires us to align
> functions entry at a 4-B align address.
>
> However, -falign-functions on older versions of GCC alone was not strong
> enoungh to align all functions. In fact, cold functions are not aligned
> after turning on optimizations. We consider this is a bug in GCC and
> turn off guess-branch-probility as a workaround to align all functions.
>
> GCC bug id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345
>
> The option -fmin-function-alignment is able to align all functions
> properly on newer versions of gcc. So, we add a cc-option to test if
> the toolchain supports it.
>
> Suggested-by: Evgenii Shatokhin <e.shatokhin@yadro.com>
> Signed-off-by: Andy Chiu <andy.chiu@sifive.com>

Reviewed-by: Björn Töpel <bjorn@rivosinc.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/6] riscv: ftrace: prepare ftrace for atomic code patching
  2024-06-28 11:47 ` [PATCH v2 3/6] riscv: ftrace: prepare ftrace for atomic code patching Andy Chiu
@ 2024-08-13 12:59   ` Björn Töpel
  2024-08-14 12:57     ` Björn Töpel
  0 siblings, 1 reply; 17+ messages in thread
From: Björn Töpel @ 2024-08-13 12:59 UTC (permalink / raw)
  To: Andy Chiu, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexandre Ghiti, Zong Li, Steven Rostedt, Masami Hiramatsu,
	Mark Rutland, Nathan Chancellor, Nick Desaulniers, Bill Wendling,
	Justin Stitt, Puranjay Mohan
  Cc: Palmer Dabbelt, linux-riscv, linux-kernel, linux-trace-kernel,
	llvm, Andy Chiu

Andy Chiu <andy.chiu@sifive.com> writes:

> We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since
> instruction fetch can break down to 4 byte at a time, it is impossible
> to update two instructions without a race. In order to mitigate it, we
> initialize the patchable entry to AUIPC + NOP4. Then, the run-time code
> patching can change NOP4 to JALR to eable/disable ftrcae from a
                                      enable        ftrace

> function. This limits the reach of each ftrace entry to +-2KB displacing
> from ftrace_caller.
>
> Starting from the trampoline, we add a level of indirection for it to
> reach ftrace caller target. Now, it loads the target address from a
> memory location, then perform the jump. This enable the kernel to update
> the target atomically.

The +-2K limit is for direct calls, right?

...and this I would say breaks DIRECT_CALLS (which should be implemented
using call_ops later)?


Björn

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/6] riscv: ftrace: prepare ftrace for atomic code patching
  2024-08-13 12:59   ` Björn Töpel
@ 2024-08-14 12:57     ` Björn Töpel
  2024-09-11 10:57       ` Andy Chiu
  0 siblings, 1 reply; 17+ messages in thread
From: Björn Töpel @ 2024-08-14 12:57 UTC (permalink / raw)
  To: Andy Chiu, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexandre Ghiti, Zong Li, Steven Rostedt, Masami Hiramatsu,
	Mark Rutland, Nathan Chancellor, Nick Desaulniers, Bill Wendling,
	Justin Stitt, Puranjay Mohan
  Cc: Palmer Dabbelt, linux-riscv, linux-kernel, linux-trace-kernel,
	llvm, Andy Chiu

Björn Töpel <bjorn@kernel.org> writes:

> Andy Chiu <andy.chiu@sifive.com> writes:
>
>> We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since
>> instruction fetch can break down to 4 byte at a time, it is impossible
>> to update two instructions without a race. In order to mitigate it, we
>> initialize the patchable entry to AUIPC + NOP4. Then, the run-time code
>> patching can change NOP4 to JALR to eable/disable ftrcae from a
>                                       enable        ftrace
>
>> function. This limits the reach of each ftrace entry to +-2KB displacing
>> from ftrace_caller.
>>
>> Starting from the trampoline, we add a level of indirection for it to
>> reach ftrace caller target. Now, it loads the target address from a
>> memory location, then perform the jump. This enable the kernel to update
>> the target atomically.
>
> The +-2K limit is for direct calls, right?
>
> ...and this I would say breaks DIRECT_CALLS (which should be implemented
> using call_ops later)?

Thinking a bit more, and re-reading the series.

This series is good work, and it's a big improvement for DYNAMIC_FTRACE,
but

+int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
+{
+	unsigned long distance, orig_addr;
+
+	orig_addr = (unsigned long)&ftrace_caller;
+	distance = addr > orig_addr ? addr - orig_addr : orig_addr - addr;
+	if (distance > JALR_RANGE)
+		return -EINVAL;
+
+	return __ftrace_modify_call(rec->ip, addr, false);
+}
+

breaks WITH_DIRECT_CALLS. The direct trampoline will *never* be within
the JALR_RANGE.

Unless we're happy with a break (I'm not) -- I really think Puranjay's
CALL_OPS patch needs to be baked in in the series!


Björn

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/6] riscv: ftrace: support fastcc in Clang for WITH_ARGS
  2024-08-13 11:09   ` Björn Töpel
@ 2024-08-22  8:39     ` Andy Chiu
  0 siblings, 0 replies; 17+ messages in thread
From: Andy Chiu @ 2024-08-22  8:39 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Zong Li, Steven Rostedt, Masami Hiramatsu, Mark Rutland,
	Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Puranjay Mohan, Palmer Dabbelt, linux-riscv, linux-kernel,
	linux-trace-kernel, llvm, Evgenii Shatokhin

On Tue, Aug 13, 2024 at 7:09 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> Andy Chiu <andy.chiu@sifive.com> writes:
>
> > Some caller-saved registers which are not defined as function arguments
> > in the ABI can still be passed as arguments when the kernel is compiled
> > with Clang. As a result, we must save and restore those registers to
> > prevent ftrace from clobbering them.
> >
> > - [1]: https://reviews.llvm.org/D68559
> >
> > Reported-by: Evgenii Shatokhin <e.shatokhin@yadro.com>
> > Closes: https://lore.kernel.org/linux-riscv/7e7c7914-445d-426d-89a0-59a9199c45b1@yadro.com/
> > Acked-by: Nathan Chancellor <nathan@kernel.org>
> > Signed-off-by: Andy Chiu <andy.chiu@sifive.com>
>
> Shouldn't this be a separate fix? Still reading the details, but it
> smells like something what should be disabled when building w/ LLVM, no?

Yes, this is a fix. My intention was indicating the dependency. So
people won't hit into trouble testing it. I can send it in a separate
series.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/6] riscv: ftrace: prepare ftrace for atomic code patching
  2024-08-14 12:57     ` Björn Töpel
@ 2024-09-11 10:57       ` Andy Chiu
  2024-09-11 14:37         ` Björn Töpel
  0 siblings, 1 reply; 17+ messages in thread
From: Andy Chiu @ 2024-09-11 10:57 UTC (permalink / raw)
  To: bjorn
  Cc: alexghiti, andy.chiu, aou, justinstitt, linux-kernel, linux-riscv,
	linux-trace-kernel, llvm, mark.rutland, mhiramat, morbo, nathan,
	ndesaulniers, palmer, palmer, paul.walmsley, puranjay, rostedt,
	zong.li, Andy Chiu, yongxuan.wang

On Wed, Aug 14, 2024 at 02:57:52PM +0200, Björn Töpel wrote:
> Björn Töpel <bjorn@kernel.org> writes:
> 
> > Andy Chiu <andy.chiu@sifive.com> writes:
> >
> >> We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since
> >> instruction fetch can break down to 4 byte at a time, it is impossible
> >> to update two instructions without a race. In order to mitigate it, we
> >> initialize the patchable entry to AUIPC + NOP4. Then, the run-time code
> >> patching can change NOP4 to JALR to eable/disable ftrcae from a
> >                                       enable        ftrace
> >
> >> function. This limits the reach of each ftrace entry to +-2KB displacing
> >> from ftrace_caller.
> >>
> >> Starting from the trampoline, we add a level of indirection for it to
> >> reach ftrace caller target. Now, it loads the target address from a
> >> memory location, then perform the jump. This enable the kernel to update
> >> the target atomically.
> >
> > The +-2K limit is for direct calls, right?
> >
> > ...and this I would say breaks DIRECT_CALLS (which should be implemented
> > using call_ops later)?
> 
> Thinking a bit more, and re-reading the series.
> 
> This series is good work, and it's a big improvement for DYNAMIC_FTRACE,
> but
> 
> +int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
> +{
> +	unsigned long distance, orig_addr;
> +
> +	orig_addr = (unsigned long)&ftrace_caller;
> +	distance = addr > orig_addr ? addr - orig_addr : orig_addr - addr;
> +	if (distance > JALR_RANGE)
> +		return -EINVAL;
> +
> +	return __ftrace_modify_call(rec->ip, addr, false);
> +}
> +
> 
> breaks WITH_DIRECT_CALLS. The direct trampoline will *never* be within
> the JALR_RANGE.

Yes, it is hardly possible that a direct trampoline will be within the
range.

Recently I have been thinking some solutions to address the issue. One
solution is replaying AUIPC at function entries. The idea has two sides.
First, if we are returning back to the second instruction at trap return,
then do sepc -= 4 so it executes the up-to-date AUIPC. The other side is
to fire synchronous IPI that does remote fence.i at right timings to
prevent concurrent executing on a mix of old and new instructions.

Consider replacing instructions at a function's patchable entry with the
following sequence:

Initial state:
--------------
0: AUIPC
4: JALR

Step1:
write(0, "J +8")
fence w,w
send sync local+remote fence.i
------------------------
0: J +8
4: JALR

Step2:
write(4, "JALR'")
fence w,w
send sync local+remote fence.i
------------------------
0: J +8
4: JALR'

Step3:
write(0, "AUIPC'")
fence w,w
send sync local+remote fence.i (to activate the call)
-----------------------
0: AUIPC'
4: JALR'

The following execution sequences are acceptable:
- AUIPC, JALR
- J +8, (skipping {JALR | JALR'})
- AUIPC', JALR'

And here are sequences that we want to prevent:
- AUIPC', JALR
- AUIPC, JALR'

The local core should never execute the forbidden sequence.

By listing all possible combinations of executing sequence on a remote
core, we can find that the dangerous seqence is impossible to happen:

let f be the fence.i at step 1, 2, 3. And let numbers be the location of
code being executed. Mathematically, here are all combinations at a site
happening on a remote core:

fff04 -- updated seq
ff0f4 -- impossible, would be ff0f04, updated seq
ff04f -- impossible, would be ff08f, safe seq
f0ff4 -- impossible, would be f0ff04, updated seq
f0f4f -- impossible, would be f0f08f (safe), or f0f0f04 (updated)
f04ff -- impossible, would be f08ff, safe seq
0fff4 -- impossible, would be 0fff04, updated seq
0ff4f -- impossible, would be 0ff08f (safe), or 0ff0f04 (updated)
0f4ff -- impossible, would be 0f08ff (safe), 0f0f08f (safe), 0f0f0f04 (updated)
04fff -- old seq

After the 1st 'fence.i', remote cores should observe (J +8, JALR) or (J +8, JALR')
After the 2nd 'fence.i', remote cores should observe (J +8, JALR') or (AUIPC', JALR')
After the 3rd 'fence.i', remote cores should observe (AUIPC', JALR')

Remote cores should never execute (AUIPC',JALR) or (AUIPC,JALR')

To correctly implement the solution, the trap return code must match JALR
and adjust sepc only for patchable function entries. This is undocumently
possible because we use t0 as source and destination registers for JALR
at function entries. Compiler never generates JALR that uses the same
register pattern.

Another solution is inspired by zcmt, and perhaps we can optimize it if
the hardware does support zcmt. First, we allocate a page and divide it
into two halves. The first half of the page are 255 x 8B destination
addresses. Then, starting from offset 2056, the second half of the page
is composed by a series of 2 x 4 Byte instructions:

0:	ftrace_tramp_1
8:	ftrace_tramp_2
...
2040:	ftrace_tramp_255
2048:	ftrace_tramp_256 (not used when configured with 255 tramps)
2056:
ld	t1, -2048(t1)
jr	t1
ld	t1, -2048(t1)
jr	t1
...
4088:
ld	t1, -2048(t1)
jr	t1
4096:

It is possible to expand to 511 trampolines by adding a page
below, and making a load+jr sequence from +2040 offset.

When the kernel boots, we direct AUIPCs at patchable entries to the page,
and disable the call by setting the second instruction to NOP4. Then, we
can effectively enable/disable/modify a call by setting only the
instruction at JALR. It is possible to utilize most of the current patch
set to achieve atomic patching. A missing part is to allocate and manage
trampolines for ftrace users.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/6] riscv: ftrace: prepare ftrace for atomic code patching
  2024-09-11 10:57       ` Andy Chiu
@ 2024-09-11 14:37         ` Björn Töpel
  2024-09-11 15:03           ` Tao Chiu
  0 siblings, 1 reply; 17+ messages in thread
From: Björn Töpel @ 2024-09-11 14:37 UTC (permalink / raw)
  To: Andy Chiu
  Cc: alexghiti, andy.chiu, aou, justinstitt, linux-kernel, linux-riscv,
	linux-trace-kernel, llvm, mark.rutland, mhiramat, morbo, nathan,
	ndesaulniers, palmer, palmer, paul.walmsley, puranjay, rostedt,
	zong.li, Andy Chiu, yongxuan.wang

Andy Chiu <andybnac@gmail.com> writes:

> On Wed, Aug 14, 2024 at 02:57:52PM +0200, Björn Töpel wrote:
>> Björn Töpel <bjorn@kernel.org> writes:
>> 
>> > Andy Chiu <andy.chiu@sifive.com> writes:
>> >
>> >> We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since
>> >> instruction fetch can break down to 4 byte at a time, it is impossible
>> >> to update two instructions without a race. In order to mitigate it, we
>> >> initialize the patchable entry to AUIPC + NOP4. Then, the run-time code
>> >> patching can change NOP4 to JALR to eable/disable ftrcae from a
>> >                                       enable        ftrace
>> >
>> >> function. This limits the reach of each ftrace entry to +-2KB displacing
>> >> from ftrace_caller.
>> >>
>> >> Starting from the trampoline, we add a level of indirection for it to
>> >> reach ftrace caller target. Now, it loads the target address from a
>> >> memory location, then perform the jump. This enable the kernel to update
>> >> the target atomically.
>> >
>> > The +-2K limit is for direct calls, right?
>> >
>> > ...and this I would say breaks DIRECT_CALLS (which should be implemented
>> > using call_ops later)?
>> 
>> Thinking a bit more, and re-reading the series.
>> 
>> This series is good work, and it's a big improvement for DYNAMIC_FTRACE,
>> but
>> 
>> +int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
>> +{
>> +	unsigned long distance, orig_addr;
>> +
>> +	orig_addr = (unsigned long)&ftrace_caller;
>> +	distance = addr > orig_addr ? addr - orig_addr : orig_addr - addr;
>> +	if (distance > JALR_RANGE)
>> +		return -EINVAL;
>> +
>> +	return __ftrace_modify_call(rec->ip, addr, false);
>> +}
>> +
>> 
>> breaks WITH_DIRECT_CALLS. The direct trampoline will *never* be within
>> the JALR_RANGE.
>
>
> Yes, it is hardly possible that a direct trampoline will be within the
> range.
>
> Recently I have been thinking some solutions to address the issue. One
> solution is replaying AUIPC at function entries. The idea has two sides.
> First, if we are returning back to the second instruction at trap return,
> then do sepc -= 4 so it executes the up-to-date AUIPC. The other side is
> to fire synchronous IPI that does remote fence.i at right timings to
> prevent concurrent executing on a mix of old and new instructions.
>  
> Consider replacing instructions at a function's patchable entry with the
> following sequence:
>
> Initial state:
> --------------
> 0: AUIPC
> 4: JALR
>
> Step1:
> write(0, "J +8")
> fence w,w
> send sync local+remote fence.i
> ------------------------
> 0: J +8
> 4: JALR
>
> Step2:
> write(4, "JALR'")
> fence w,w
> send sync local+remote fence.i
> ------------------------
> 0: J +8
> 4: JALR'
>
> Step3:
> write(0, "AUIPC'")
> fence w,w
> send sync local+remote fence.i (to activate the call)
> -----------------------
> 0: AUIPC'
> 4: JALR'
>
> The following execution sequences are acceptable:
> - AUIPC, JALR
> - J +8, (skipping {JALR | JALR'})
> - AUIPC', JALR'
>
> And here are sequences that we want to prevent:
> - AUIPC', JALR
> - AUIPC, JALR'
>
> The local core should never execute the forbidden sequence.
>
> By listing all possible combinations of executing sequence on a remote
> core, we can find that the dangerous seqence is impossible to happen:
>
> let f be the fence.i at step 1, 2, 3. And let numbers be the location of
> code being executed. Mathematically, here are all combinations at a site
> happening on a remote core:
>
> fff04 -- updated seq
> ff0f4 -- impossible, would be ff0f04, updated seq
> ff04f -- impossible, would be ff08f, safe seq
> f0ff4 -- impossible, would be f0ff04, updated seq
> f0f4f -- impossible, would be f0f08f (safe), or f0f0f04 (updated)
> f04ff -- impossible, would be f08ff, safe seq
> 0fff4 -- impossible, would be 0fff04, updated seq
> 0ff4f -- impossible, would be 0ff08f (safe), or 0ff0f04 (updated)
> 0f4ff -- impossible, would be 0f08ff (safe), 0f0f08f (safe), 0f0f0f04 (updated)
> 04fff -- old seq
>
> After the 1st 'fence.i', remote cores should observe (J +8, JALR) or (J +8, JALR')
> After the 2nd 'fence.i', remote cores should observe (J +8, JALR') or (AUIPC', JALR')
> After the 3rd 'fence.i', remote cores should observe (AUIPC', JALR')
>
> Remote cores should never execute (AUIPC',JALR) or (AUIPC,JALR')
>
> To correctly implement the solution, the trap return code must match JALR
> and adjust sepc only for patchable function entries. This is undocumently
> possible because we use t0 as source and destination registers for JALR
> at function entries. Compiler never generates JALR that uses the same
> register pattern.
>
> Another solution is inspired by zcmt, and perhaps we can optimize it if
> the hardware does support zcmt. First, we allocate a page and divide it
> into two halves. The first half of the page are 255 x 8B destination
> addresses. Then, starting from offset 2056, the second half of the page
> is composed by a series of 2 x 4 Byte instructions:
>
> 0:	ftrace_tramp_1
> 8:	ftrace_tramp_2
> ...
> 2040:	ftrace_tramp_255
> 2048:	ftrace_tramp_256 (not used when configured with 255 tramps)
> 2056:
> ld	t1, -2048(t1)
> jr	t1
> ld	t1, -2048(t1)
> jr	t1
> ...
> 4088:
> ld	t1, -2048(t1)
> jr	t1
> 4096:
>
> It is possible to expand to 511 trampolines by adding a page
> below, and making a load+jr sequence from +2040 offset.
>
> When the kernel boots, we direct AUIPCs at patchable entries to the page,
> and disable the call by setting the second instruction to NOP4. Then, we
> can effectively enable/disable/modify a call by setting only the
> instruction at JALR. It is possible to utilize most of the current patch
> set to achieve atomic patching. A missing part is to allocate and manage
> trampolines for ftrace users.

(I will need to digest above in detail!)

I don't think it's a good idea to try to handle direct calls w/o
call_ops. What I was trying to say is "add the call_ops patch to your
series, so that direct calls aren't broken". If direct calls depend on
call_ops -- sure, no worries. But don't try to get direct calls W/O
call_ops. That's a whole new bag of worms.

Some more high-level thoughts: ...all this to workaround where we don't
want the call_ops overhead? Is there really a use-case with a platform
that doesn't handle the text overhead of call_ops?

Maybe I'm missing context here... but I'd say, let's follow what arm64
did (but obviously w/o the BL direct call optimization, and always jump
to a trampoline -- since that's not possible with RISC-V branch length),
and just do the call_ops way.

Then, as a second step, and if there are platforms that care, think
about a variant w/o call_ops.

Or what I wrote in the first section:

1. Keep this patch set
2. ...but add call_ops to it, and require call_ops for direct calls.

Just my $.02.


Björn

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/6] riscv: ftrace: prepare ftrace for atomic code patching
  2024-09-11 14:37         ` Björn Töpel
@ 2024-09-11 15:03           ` Tao Chiu
  2024-09-11 17:16             ` Björn Töpel
  0 siblings, 1 reply; 17+ messages in thread
From: Tao Chiu @ 2024-09-11 15:03 UTC (permalink / raw)
  To: Björn Töpel
  Cc: alexghiti, andy.chiu, aou, justinstitt, linux-kernel, linux-riscv,
	linux-trace-kernel, llvm, mark.rutland, mhiramat, morbo, nathan,
	ndesaulniers, palmer, palmer, paul.walmsley, puranjay, rostedt,
	zong.li, yongxuan.wang

Björn Töpel <bjorn@kernel.org> 於 2024年9月11日 週三 下午10:37寫道：

>
> Andy Chiu <andybnac@gmail.com> writes:
>
> > On Wed, Aug 14, 2024 at 02:57:52PM +0200, Björn Töpel wrote:
> >> Björn Töpel <bjorn@kernel.org> writes:
> >>
> >> > Andy Chiu <andy.chiu@sifive.com> writes:
> >> >
> >> >> We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since
> >> >> instruction fetch can break down to 4 byte at a time, it is impossible
> >> >> to update two instructions without a race. In order to mitigate it, we
> >> >> initialize the patchable entry to AUIPC + NOP4. Then, the run-time code
> >> >> patching can change NOP4 to JALR to eable/disable ftrcae from a
> >> >                                       enable        ftrace
> >> >
> >> >> function. This limits the reach of each ftrace entry to +-2KB displacing
> >> >> from ftrace_caller.
> >> >>
> >> >> Starting from the trampoline, we add a level of indirection for it to
> >> >> reach ftrace caller target. Now, it loads the target address from a
> >> >> memory location, then perform the jump. This enable the kernel to update
> >> >> the target atomically.
> >> >
> >> > The +-2K limit is for direct calls, right?
> >> >
> >> > ...and this I would say breaks DIRECT_CALLS (which should be implemented
> >> > using call_ops later)?
> >>
> >> Thinking a bit more, and re-reading the series.
> >>
> >> This series is good work, and it's a big improvement for DYNAMIC_FTRACE,
> >> but
> >>
> >> +int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
> >> +{
> >> +    unsigned long distance, orig_addr;
> >> +
> >> +    orig_addr = (unsigned long)&ftrace_caller;
> >> +    distance = addr > orig_addr ? addr - orig_addr : orig_addr - addr;
> >> +    if (distance > JALR_RANGE)
> >> +            return -EINVAL;
> >> +
> >> +    return __ftrace_modify_call(rec->ip, addr, false);
> >> +}
> >> +
> >>
> >> breaks WITH_DIRECT_CALLS. The direct trampoline will *never* be within
> >> the JALR_RANGE.
> >
> >
> > Yes, it is hardly possible that a direct trampoline will be within the
> > range.
> >
> > Recently I have been thinking some solutions to address the issue. One
> > solution is replaying AUIPC at function entries. The idea has two sides.
> > First, if we are returning back to the second instruction at trap return,
> > then do sepc -= 4 so it executes the up-to-date AUIPC. The other side is
> > to fire synchronous IPI that does remote fence.i at right timings to
> > prevent concurrent executing on a mix of old and new instructions.
> >
> > Consider replacing instructions at a function's patchable entry with the
> > following sequence:
> >
> > Initial state:
> > --------------
> > 0: AUIPC
> > 4: JALR
> >
> > Step1:
> > write(0, "J +8")
> > fence w,w
> > send sync local+remote fence.i
> > ------------------------
> > 0: J +8
> > 4: JALR
> >
> > Step2:
> > write(4, "JALR'")
> > fence w,w
> > send sync local+remote fence.i
> > ------------------------
> > 0: J +8
> > 4: JALR'
> >
> > Step3:
> > write(0, "AUIPC'")
> > fence w,w
> > send sync local+remote fence.i (to activate the call)
> > -----------------------
> > 0: AUIPC'
> > 4: JALR'
> >
> > The following execution sequences are acceptable:
> > - AUIPC, JALR
> > - J +8, (skipping {JALR | JALR'})
> > - AUIPC', JALR'
> >
> > And here are sequences that we want to prevent:
> > - AUIPC', JALR
> > - AUIPC, JALR'
> >
> > The local core should never execute the forbidden sequence.
> >
> > By listing all possible combinations of executing sequence on a remote
> > core, we can find that the dangerous seqence is impossible to happen:
> >
> > let f be the fence.i at step 1, 2, 3. And let numbers be the location of
> > code being executed. Mathematically, here are all combinations at a site
> > happening on a remote core:
> >
> > fff04 -- updated seq
> > ff0f4 -- impossible, would be ff0f04, updated seq
> > ff04f -- impossible, would be ff08f, safe seq
> > f0ff4 -- impossible, would be f0ff04, updated seq
> > f0f4f -- impossible, would be f0f08f (safe), or f0f0f04 (updated)
> > f04ff -- impossible, would be f08ff, safe seq
> > 0fff4 -- impossible, would be 0fff04, updated seq
> > 0ff4f -- impossible, would be 0ff08f (safe), or 0ff0f04 (updated)
> > 0f4ff -- impossible, would be 0f08ff (safe), 0f0f08f (safe), 0f0f0f04 (updated)
> > 04fff -- old seq
> >
> > After the 1st 'fence.i', remote cores should observe (J +8, JALR) or (J +8, JALR')
> > After the 2nd 'fence.i', remote cores should observe (J +8, JALR') or (AUIPC', JALR')
> > After the 3rd 'fence.i', remote cores should observe (AUIPC', JALR')
> >
> > Remote cores should never execute (AUIPC',JALR) or (AUIPC,JALR')
> >
> > To correctly implement the solution, the trap return code must match JALR
> > and adjust sepc only for patchable function entries. This is undocumently
> > possible because we use t0 as source and destination registers for JALR
> > at function entries. Compiler never generates JALR that uses the same
> > register pattern.
> >
> > Another solution is inspired by zcmt, and perhaps we can optimize it if
> > the hardware does support zcmt. First, we allocate a page and divide it
> > into two halves. The first half of the page are 255 x 8B destination
> > addresses. Then, starting from offset 2056, the second half of the page
> > is composed by a series of 2 x 4 Byte instructions:
> >
> > 0:    ftrace_tramp_1
> > 8:    ftrace_tramp_2
> > ...
> > 2040: ftrace_tramp_255
> > 2048: ftrace_tramp_256 (not used when configured with 255 tramps)
> > 2056:
> > ld    t1, -2048(t1)
> > jr    t1
> > ld    t1, -2048(t1)
> > jr    t1
> > ...
> > 4088:
> > ld    t1, -2048(t1)
> > jr    t1
> > 4096:
> >
> > It is possible to expand to 511 trampolines by adding a page
> > below, and making a load+jr sequence from +2040 offset.
> >
> > When the kernel boots, we direct AUIPCs at patchable entries to the page,
> > and disable the call by setting the second instruction to NOP4. Then, we
> > can effectively enable/disable/modify a call by setting only the
> > instruction at JALR. It is possible to utilize most of the current patch
> > set to achieve atomic patching. A missing part is to allocate and manage
> > trampolines for ftrace users.
>
> (I will need to digest above in detail!)
>
> I don't think it's a good idea to try to handle direct calls w/o
> call_ops. What I was trying to say is "add the call_ops patch to your
> series, so that direct calls aren't broken". If direct calls depend on
> call_ops -- sure, no worries. But don't try to get direct calls W/O
> call_ops. That's a whole new bag of worms.
>
> Some more high-level thoughts: ...all this to workaround where we don't
> want the call_ops overhead? Is there really a use-case with a platform
> that doesn't handle the text overhead of call_ops?

Sorry for making any confusions. I have no strong personal preference
on what we should do. Just want to have a technical discussion on what
is possible if we want to optimize code size.

>
> Maybe I'm missing context here... but I'd say, let's follow what arm64
> did (but obviously w/o the BL direct call optimization, and always jump
> to a trampoline -- since that's not possible with RISC-V branch length),
> and just do the call_ops way.
>
> Then, as a second step, and if there are platforms that care, think
> about a variant w/o call_ops.
>
> Or what I wrote in the first section:
>
> 1. Keep this patch set
> 2. ...but add call_ops to it, and require call_ops for direct calls.
>
> Just my $.02.
>
>
> Björn

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/6] riscv: ftrace: prepare ftrace for atomic code patching
  2024-09-11 15:03           ` Tao Chiu
@ 2024-09-11 17:16             ` Björn Töpel
  0 siblings, 0 replies; 17+ messages in thread
From: Björn Töpel @ 2024-09-11 17:16 UTC (permalink / raw)
  To: Tao Chiu
  Cc: alexghiti, andy.chiu, aou, justinstitt, linux-kernel, linux-riscv,
	linux-trace-kernel, llvm, mark.rutland, mhiramat, morbo, nathan,
	ndesaulniers, palmer, palmer, paul.walmsley, puranjay, rostedt,
	zong.li, yongxuan.wang

Tao Chiu <andybnac@gmail.com> writes:

> Björn Töpel <bjorn@kernel.org> 於 2024年9月11日 週三 下午10:37寫道：
>
>>
>> Andy Chiu <andybnac@gmail.com> writes:
>>
>> > On Wed, Aug 14, 2024 at 02:57:52PM +0200, Björn Töpel wrote:
>> >> Björn Töpel <bjorn@kernel.org> writes:
>> >>
>> >> > Andy Chiu <andy.chiu@sifive.com> writes:
>> >> >
>> >> >> We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since
>> >> >> instruction fetch can break down to 4 byte at a time, it is impossible
>> >> >> to update two instructions without a race. In order to mitigate it, we
>> >> >> initialize the patchable entry to AUIPC + NOP4. Then, the run-time code
>> >> >> patching can change NOP4 to JALR to eable/disable ftrcae from a
>> >> >                                       enable        ftrace
>> >> >
>> >> >> function. This limits the reach of each ftrace entry to +-2KB displacing
>> >> >> from ftrace_caller.
>> >> >>
>> >> >> Starting from the trampoline, we add a level of indirection for it to
>> >> >> reach ftrace caller target. Now, it loads the target address from a
>> >> >> memory location, then perform the jump. This enable the kernel to update
>> >> >> the target atomically.
>> >> >
>> >> > The +-2K limit is for direct calls, right?
>> >> >
>> >> > ...and this I would say breaks DIRECT_CALLS (which should be implemented
>> >> > using call_ops later)?
>> >>
>> >> Thinking a bit more, and re-reading the series.
>> >>
>> >> This series is good work, and it's a big improvement for DYNAMIC_FTRACE,
>> >> but
>> >>
>> >> +int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
>> >> +{
>> >> +    unsigned long distance, orig_addr;
>> >> +
>> >> +    orig_addr = (unsigned long)&ftrace_caller;
>> >> +    distance = addr > orig_addr ? addr - orig_addr : orig_addr - addr;
>> >> +    if (distance > JALR_RANGE)
>> >> +            return -EINVAL;
>> >> +
>> >> +    return __ftrace_modify_call(rec->ip, addr, false);
>> >> +}
>> >> +
>> >>
>> >> breaks WITH_DIRECT_CALLS. The direct trampoline will *never* be within
>> >> the JALR_RANGE.
>> >
>> >
>> > Yes, it is hardly possible that a direct trampoline will be within the
>> > range.
>> >
>> > Recently I have been thinking some solutions to address the issue. One
>> > solution is replaying AUIPC at function entries. The idea has two sides.
>> > First, if we are returning back to the second instruction at trap return,
>> > then do sepc -= 4 so it executes the up-to-date AUIPC. The other side is
>> > to fire synchronous IPI that does remote fence.i at right timings to
>> > prevent concurrent executing on a mix of old and new instructions.
>> >
>> > Consider replacing instructions at a function's patchable entry with the
>> > following sequence:
>> >
>> > Initial state:
>> > --------------
>> > 0: AUIPC
>> > 4: JALR
>> >
>> > Step1:
>> > write(0, "J +8")
>> > fence w,w
>> > send sync local+remote fence.i
>> > ------------------------
>> > 0: J +8
>> > 4: JALR
>> >
>> > Step2:
>> > write(4, "JALR'")
>> > fence w,w
>> > send sync local+remote fence.i
>> > ------------------------
>> > 0: J +8
>> > 4: JALR'
>> >
>> > Step3:
>> > write(0, "AUIPC'")
>> > fence w,w
>> > send sync local+remote fence.i (to activate the call)
>> > -----------------------
>> > 0: AUIPC'
>> > 4: JALR'
>> >
>> > The following execution sequences are acceptable:
>> > - AUIPC, JALR
>> > - J +8, (skipping {JALR | JALR'})
>> > - AUIPC', JALR'
>> >
>> > And here are sequences that we want to prevent:
>> > - AUIPC', JALR
>> > - AUIPC, JALR'
>> >
>> > The local core should never execute the forbidden sequence.
>> >
>> > By listing all possible combinations of executing sequence on a remote
>> > core, we can find that the dangerous seqence is impossible to happen:
>> >
>> > let f be the fence.i at step 1, 2, 3. And let numbers be the location of
>> > code being executed. Mathematically, here are all combinations at a site
>> > happening on a remote core:
>> >
>> > fff04 -- updated seq
>> > ff0f4 -- impossible, would be ff0f04, updated seq
>> > ff04f -- impossible, would be ff08f, safe seq
>> > f0ff4 -- impossible, would be f0ff04, updated seq
>> > f0f4f -- impossible, would be f0f08f (safe), or f0f0f04 (updated)
>> > f04ff -- impossible, would be f08ff, safe seq
>> > 0fff4 -- impossible, would be 0fff04, updated seq
>> > 0ff4f -- impossible, would be 0ff08f (safe), or 0ff0f04 (updated)
>> > 0f4ff -- impossible, would be 0f08ff (safe), 0f0f08f (safe), 0f0f0f04 (updated)
>> > 04fff -- old seq
>> >
>> > After the 1st 'fence.i', remote cores should observe (J +8, JALR) or (J +8, JALR')
>> > After the 2nd 'fence.i', remote cores should observe (J +8, JALR') or (AUIPC', JALR')
>> > After the 3rd 'fence.i', remote cores should observe (AUIPC', JALR')
>> >
>> > Remote cores should never execute (AUIPC',JALR) or (AUIPC,JALR')
>> >
>> > To correctly implement the solution, the trap return code must match JALR
>> > and adjust sepc only for patchable function entries. This is undocumently
>> > possible because we use t0 as source and destination registers for JALR
>> > at function entries. Compiler never generates JALR that uses the same
>> > register pattern.
>> >
>> > Another solution is inspired by zcmt, and perhaps we can optimize it if
>> > the hardware does support zcmt. First, we allocate a page and divide it
>> > into two halves. The first half of the page are 255 x 8B destination
>> > addresses. Then, starting from offset 2056, the second half of the page
>> > is composed by a series of 2 x 4 Byte instructions:
>> >
>> > 0:    ftrace_tramp_1
>> > 8:    ftrace_tramp_2
>> > ...
>> > 2040: ftrace_tramp_255
>> > 2048: ftrace_tramp_256 (not used when configured with 255 tramps)
>> > 2056:
>> > ld    t1, -2048(t1)
>> > jr    t1
>> > ld    t1, -2048(t1)
>> > jr    t1
>> > ...
>> > 4088:
>> > ld    t1, -2048(t1)
>> > jr    t1
>> > 4096:
>> >
>> > It is possible to expand to 511 trampolines by adding a page
>> > below, and making a load+jr sequence from +2040 offset.
>> >
>> > When the kernel boots, we direct AUIPCs at patchable entries to the page,
>> > and disable the call by setting the second instruction to NOP4. Then, we
>> > can effectively enable/disable/modify a call by setting only the
>> > instruction at JALR. It is possible to utilize most of the current patch
>> > set to achieve atomic patching. A missing part is to allocate and manage
>> > trampolines for ftrace users.
>>
>> (I will need to digest above in detail!)
>>
>> I don't think it's a good idea to try to handle direct calls w/o
>> call_ops. What I was trying to say is "add the call_ops patch to your
>> series, so that direct calls aren't broken". If direct calls depend on
>> call_ops -- sure, no worries. But don't try to get direct calls W/O
>> call_ops. That's a whole new bag of worms.
>>
>> Some more high-level thoughts: ...all this to workaround where we don't
>> want the call_ops overhead? Is there really a use-case with a platform
>> that doesn't handle the text overhead of call_ops?
>
> Sorry for making any confusions. I have no strong personal preference
> on what we should do. Just want to have a technical discussion on what
> is possible if we want to optimize code size.

Understood! I'm -- as you know -- really eager to have a text patching
mechanism that is not horrible. ;-) I'd rather wait with the text size
optimizations.

TL;DR -- your series is fine from my POV, but it's missing call_ops, so
that it doesn't break direct calls.

I will take your patch series, and Puranjay's call_ops series see how
they play together.

We can discuss it at LPC next week!


Cheers,
Björn

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-09-11 17:16 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-28 11:47 [PATCH v2 0/6] riscv: ftrace: atmoic patching and preempt improvements Andy Chiu
2024-06-28 11:47 ` [PATCH v2 1/6] riscv: ftrace: support fastcc in Clang for WITH_ARGS Andy Chiu
2024-08-13 11:09   ` Björn Töpel
2024-08-22  8:39     ` Andy Chiu
2024-06-28 11:47 ` [PATCH v2 2/6] riscv: ftrace: align patchable functions to 4 Byte boundary Andy Chiu
2024-08-13 11:11   ` Björn Töpel
2024-06-28 11:47 ` [PATCH v2 3/6] riscv: ftrace: prepare ftrace for atomic code patching Andy Chiu
2024-08-13 12:59   ` Björn Töpel
2024-08-14 12:57     ` Björn Töpel
2024-09-11 10:57       ` Andy Chiu
2024-09-11 14:37         ` Björn Töpel
2024-09-11 15:03           ` Tao Chiu
2024-09-11 17:16             ` Björn Töpel
2024-06-28 11:47 ` [PATCH v2 4/6] riscv: ftrace: do not use stop_machine to update code Andy Chiu
2024-06-28 11:47 ` [PATCH v2 5/6] riscv: vector: Support calling schedule() for preemptible Vector Andy Chiu
2024-06-28 11:47 ` [PATCH v2 6/6] riscv: ftrace: support PREEMPT Andy Chiu
2024-08-13 11:00 ` [PATCH v2 0/6] riscv: ftrace: atmoic patching and preempt improvements Björn Töpel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).