[PATCH v4 0/4] per-function storage support

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/4] per-function storage support
@ 2025-03-03 13:28 Menglong Dong
  2025-03-03 13:28 ` [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset Menglong Dong
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Menglong Dong @ 2025-03-03 13:28 UTC (permalink / raw)
  To: peterz, rostedt, mark.rutland, alexei.starovoitov
  Cc: catalin.marinas, will, mhiramat, tglx, mingo, bp, dave.hansen,
	x86, hpa, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, jolsa, davem, dsahern,
	mathieu.desnoyers, nathan, nick.desaulniers+lkml, morbo,
	samitolvanen, kees, dongml2, akpm, riel, rppt, linux-arm-kernel,
	linux-kernel, linux-trace-kernel, bpf, netdev, llvm

For now, there isn't a way to set and get per-function metadata with
a low overhead, which is not convenient for some situations. Take
BPF trampoline for example, we need to create a trampoline for each
kernel function, as we have to store some information of the function
to the trampoline, such as BPF progs, function arg count, etc. The
performance overhead and memory consumption can be higher to create
these trampolines. With the supporting of per-function metadata storage,
we can store these information to the metadata, and create a global BPF
trampoline for all the kernel functions. In the global trampoline, we
get the information that we need from the function metadata through the
ip (function address) with almost no overhead.

Another beneficiary can be ftrace. For now, all the kernel functions that
are enabled by dynamic ftrace will be added to a filter hash if there are
more than one callbacks. And hash lookup will happen when the traced
functions are called, which has an impact on the performance, see
__ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function
metadata supporting, we can store the information that if the callback is
enabled on the kernel function to the metadata.

In the 1st patch, we factor out FINEIBT_INSN_OFFSET and CFI_INSN_OFFSET to
make fineibt works when the kernel function is 32-bytes aligned.

In the 2nd patch, we implement the per-function metadata storage by
storing the index of the metadata to the function padding space.

In the 3rd and 4th patch, we implement the per-function metadata storage
for x86 and arm64. And in the feature, we can support more arch.

Changes since V3:
- rebase to the newest tip/x86/core, the fineibt has some updating

Changes since V2:
- split the patch into a series.
- considering the effect to cfi and fineibt and introduce the 1st patch.

Changes since V1:
- add supporting for arm64
- split out arch relevant code
- refactor the commit log

Menglong Dong (4):
  x86/ibt: factor out cfi and fineibt offset
  add per-function metadata storage support
  x86: implement per-function metadata storage for x86
  arm64: implement per-function metadata storage for arm64

 arch/arm64/Kconfig              |  15 ++
 arch/arm64/Makefile             |  23 ++-
 arch/arm64/include/asm/ftrace.h |  34 +++++
 arch/arm64/kernel/ftrace.c      |  13 +-
 arch/x86/Kconfig                |  18 +++
 arch/x86/include/asm/cfi.h      |  13 +-
 arch/x86/include/asm/ftrace.h   |  54 ++++++++
 arch/x86/kernel/alternative.c   |  18 ++-
 arch/x86/net/bpf_jit_comp.c     |  22 +--
 include/linux/kfunc_md.h        |  25 ++++
 kernel/Makefile                 |   1 +
 kernel/trace/Makefile           |   1 +
 kernel/trace/kfunc_md.c         | 239 ++++++++++++++++++++++++++++++++
 13 files changed, 450 insertions(+), 26 deletions(-)
 create mode 100644 include/linux/kfunc_md.h
 create mode 100644 kernel/trace/kfunc_md.c

-- 
2.39.5

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-03 13:28 [PATCH v4 0/4] per-function storage support Menglong Dong
@ 2025-03-03 13:28 ` Menglong Dong
  2025-03-03 16:54   ` Peter Zijlstra
  2025-03-03 13:28 ` [PATCH v4 2/4] add per-function metadata storage support Menglong Dong
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Menglong Dong @ 2025-03-03 13:28 UTC (permalink / raw)
  To: peterz, rostedt, mark.rutland, alexei.starovoitov
  Cc: catalin.marinas, will, mhiramat, tglx, mingo, bp, dave.hansen,
	x86, hpa, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, jolsa, davem, dsahern,
	mathieu.desnoyers, nathan, nick.desaulniers+lkml, morbo,
	samitolvanen, kees, dongml2, akpm, riel, rppt, linux-arm-kernel,
	linux-kernel, linux-trace-kernel, bpf, netdev, llvm

For now, the layout of cfi and fineibt is hard coded, and the padding is
fixed on 16 bytes.

Factor out FINEIBT_INSN_OFFSET and CFI_INSN_OFFSET. CFI_INSN_OFFSET is
the offset of cfi, which is the same as FUNCTION_ALIGNMENT when
CALL_PADDING is enabled. And FINEIBT_INSN_OFFSET is the offset where we
put the fineibt preamble on, which is 16 for now.

When the FUNCTION_ALIGNMENT is bigger than 16, we place the fineibt
preamble on the last 16 bytes of the padding for better performance, which
means the fineibt preamble don't use the space that cfi uses.

The FINEIBT_INSN_OFFSET is not used in fineibt_caller_start and
fineibt_paranoid_start, as it is always "0x10". Note that we need to
update the offset in fineibt_caller_start and fineibt_paranoid_start if
FINEIBT_INSN_OFFSET changes.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
---
v4:
- rebase to the newest tip/x86/core, the fineibt has some updating
---
 arch/x86/include/asm/cfi.h    | 13 +++++++++----
 arch/x86/kernel/alternative.c | 18 +++++++++++-------
 arch/x86/net/bpf_jit_comp.c   | 22 +++++++++++-----------
 3 files changed, 31 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/cfi.h b/arch/x86/include/asm/cfi.h
index 2f6a01f098b5..04525f2f6bf2 100644
--- a/arch/x86/include/asm/cfi.h
+++ b/arch/x86/include/asm/cfi.h
@@ -108,6 +108,14 @@ extern bhi_thunk __bhi_args_end[];
 
 struct pt_regs;
 
+#ifdef CONFIG_CALL_PADDING
+#define FINEIBT_INSN_OFFSET	16
+#define CFI_INSN_OFFSET		CONFIG_FUNCTION_ALIGNMENT
+#else
+#define FINEIBT_INSN_OFFSET	0
+#define CFI_INSN_OFFSET		5
+#endif
+
 #ifdef CONFIG_CFI_CLANG
 enum bug_trap_type handle_cfi_failure(struct pt_regs *regs);
 #define __bpfcall
@@ -118,11 +126,8 @@ static inline int cfi_get_offset(void)
 {
 	switch (cfi_mode) {
 	case CFI_FINEIBT:
-		return 16;
 	case CFI_KCFI:
-		if (IS_ENABLED(CONFIG_CALL_PADDING))
-			return 16;
-		return 5;
+		return CFI_INSN_OFFSET;
 	default:
 		return 0;
 	}
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 32e4b801db99..0088d2313f33 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -917,7 +917,7 @@ void __init_or_module noinline apply_seal_endbr(s32 *start, s32 *end)
 
 		poison_endbr(addr);
 		if (IS_ENABLED(CONFIG_FINEIBT))
-			poison_cfi(addr - 16);
+			poison_cfi(addr);
 	}
 }
 
@@ -980,12 +980,13 @@ u32 cfi_get_func_hash(void *func)
 {
 	u32 hash;
 
-	func -= cfi_get_offset();
 	switch (cfi_mode) {
 	case CFI_FINEIBT:
+		func -= FINEIBT_INSN_OFFSET;
 		func += 7;
 		break;
 	case CFI_KCFI:
+		func -= CFI_INSN_OFFSET;
 		func += 1;
 		break;
 	default:
@@ -1372,7 +1373,7 @@ static int cfi_rewrite_preamble(s32 *start, s32 *end)
 		 * have determined there are no indirect calls to it and we
 		 * don't need no CFI either.
 		 */
-		if (!is_endbr(addr + 16))
+		if (!is_endbr(addr + CFI_INSN_OFFSET))
 			continue;
 
 		hash = decode_preamble_hash(addr, &arity);
@@ -1380,6 +1381,7 @@ static int cfi_rewrite_preamble(s32 *start, s32 *end)
 			 addr, addr, 5, addr))
 			return -EINVAL;
 
+		addr += (CFI_INSN_OFFSET - FINEIBT_INSN_OFFSET);
 		text_poke_early(addr, fineibt_preamble_start, fineibt_preamble_size);
 		WARN_ON(*(u32 *)(addr + fineibt_preamble_hash) != 0x12345678);
 		text_poke_early(addr + fineibt_preamble_hash, &hash, 4);
@@ -1402,10 +1404,10 @@ static void cfi_rewrite_endbr(s32 *start, s32 *end)
 	for (s = start; s < end; s++) {
 		void *addr = (void *)s + *s;
 
-		if (!exact_endbr(addr + 16))
+		if (!exact_endbr(addr + CFI_INSN_OFFSET))
 			continue;
 
-		poison_endbr(addr + 16);
+		poison_endbr(addr + CFI_INSN_OFFSET);
 	}
 }
 
@@ -1543,12 +1545,12 @@ static void __apply_fineibt(s32 *start_retpoline, s32 *end_retpoline,
 		return;
 
 	case CFI_FINEIBT:
-		/* place the FineIBT preamble at func()-16 */
+		/* place the FineIBT preamble at func()-FINEIBT_INSN_OFFSET */
 		ret = cfi_rewrite_preamble(start_cfi, end_cfi);
 		if (ret)
 			goto err;
 
-		/* rewrite the callers to target func()-16 */
+		/* rewrite the callers to target func()-FINEIBT_INSN_OFFSET */
 		ret = cfi_rewrite_callers(start_retpoline, end_retpoline);
 		if (ret)
 			goto err;
@@ -1588,6 +1590,7 @@ static void poison_cfi(void *addr)
 	 */
 	switch (cfi_mode) {
 	case CFI_FINEIBT:
+		addr -= FINEIBT_INSN_OFFSET;
 		/*
 		 * FineIBT prefix should start with an ENDBR.
 		 */
@@ -1607,6 +1610,7 @@ static void poison_cfi(void *addr)
 		break;
 
 	case CFI_KCFI:
+		addr -= CFI_INSN_OFFSET;
 		/*
 		 * kCFI prefix should start with a valid hash.
 		 */
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 72776dcb75aa..ee86a5df5ffb 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -415,6 +415,12 @@ static int emit_call(u8 **prog, void *func, void *ip);
 static void emit_fineibt(u8 **pprog, u8 *ip, u32 hash, int arity)
 {
 	u8 *prog = *pprog;
+#ifdef CONFIG_CALL_PADDING
+	int i;
+
+	for (i = 0; i < CFI_INSN_OFFSET - 16; i++)
+		EMIT1(0x90);
+#endif
 
 	EMIT_ENDBR();
 	EMIT3_off32(0x41, 0x81, 0xea, hash);		/* subl $hash, %r10d	*/
@@ -432,20 +438,14 @@ static void emit_fineibt(u8 **pprog, u8 *ip, u32 hash, int arity)
 static void emit_kcfi(u8 **pprog, u32 hash)
 {
 	u8 *prog = *pprog;
+#ifdef CONFIG_CALL_PADDING
+	int i;
+#endif
 
 	EMIT1_off32(0xb8, hash);			/* movl $hash, %eax	*/
 #ifdef CONFIG_CALL_PADDING
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
+	for (i = 0; i < CFI_INSN_OFFSET - 5; i++)
+		EMIT1(0x90);
 #endif
 	EMIT_ENDBR();
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 2/4] add per-function metadata storage support
  2025-03-03 13:28 [PATCH v4 0/4] per-function storage support Menglong Dong
  2025-03-03 13:28 ` [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset Menglong Dong
@ 2025-03-03 13:28 ` Menglong Dong
  2025-03-03 13:28 ` [PATCH v4 3/4] x86: implement per-function metadata storage for x86 Menglong Dong
  2025-03-03 13:28 ` [PATCH v4 4/4] arm64: implement per-function metadata storage for arm64 Menglong Dong
  3 siblings, 0 replies; 23+ messages in thread
From: Menglong Dong @ 2025-03-03 13:28 UTC (permalink / raw)
  To: peterz, rostedt, mark.rutland, alexei.starovoitov
  Cc: catalin.marinas, will, mhiramat, tglx, mingo, bp, dave.hansen,
	x86, hpa, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, jolsa, davem, dsahern,
	mathieu.desnoyers, nathan, nick.desaulniers+lkml, morbo,
	samitolvanen, kees, dongml2, akpm, riel, rppt, linux-arm-kernel,
	linux-kernel, linux-trace-kernel, bpf, netdev, llvm

For now, there isn't a way to set and get per-function metadata with
a low overhead, which is not convenient for some situations. Take
BPF trampoline for example, we need to create a trampoline for each
kernel function, as we have to store some information of the function
to the trampoline, such as BPF progs, function arg count, etc. The
performance overhead and memory consumption can be higher to create
these trampolines. With the supporting of per-function metadata storage,
we can store these information to the metadata, and create a global BPF
trampoline for all the kernel functions. In the global trampoline, we
get the information that we need from the function metadata through the
ip (function address) with almost no overhead.

Another beneficiary can be ftrace. For now, all the kernel functions that
are enabled by dynamic ftrace will be added to a filter hash if there are
more than one callbacks. And hash lookup will happen when the traced
functions are called, which has an impact on the performance, see
__ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function
metadata supporting, we can store the information that if the callback is
enabled on the kernel function to the metadata.

Support per-function metadata storage in the function padding, and
previous discussion can be found in [1]. Generally speaking, we have two
way to implement this feature:

1. Create a function metadata array, and prepend a insn which can hold
the index of the function metadata in the array. And store the insn to
the function padding.

2. Allocate the function metadata with kmalloc(), and prepend a insn which
hold the pointer of the metadata. And store the insn to the function
padding.

Compared with way 2, way 1 consume less space, but we need to do more work
on the global function metadata array. And we implement this function in
the way 1.

Link: https://lore.kernel.org/bpf/CADxym3anLzM6cAkn_z71GDd_VeKiqqk1ts=xuiP7pr4PO6USPA@mail.gmail.com/ [1]
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
---
v2:
- add supporting for arm64
- split out arch relevant code
- refactor the commit log
---
 include/linux/kfunc_md.h |  25 ++++
 kernel/Makefile          |   1 +
 kernel/trace/Makefile    |   1 +
 kernel/trace/kfunc_md.c  | 239 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 266 insertions(+)
 create mode 100644 include/linux/kfunc_md.h
 create mode 100644 kernel/trace/kfunc_md.c

diff --git a/include/linux/kfunc_md.h b/include/linux/kfunc_md.h
new file mode 100644
index 000000000000..df616f0fcb36
--- /dev/null
+++ b/include/linux/kfunc_md.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_KFUNC_MD_H
+#define _LINUX_KFUNC_MD_H
+
+#include <linux/kernel.h>
+
+struct kfunc_md {
+	int users;
+	/* we can use this field later, make sure it is 8-bytes aligned
+	 * for now.
+	 */
+	int pad0;
+	void *func;
+};
+
+extern struct kfunc_md *kfunc_mds;
+
+struct kfunc_md *kfunc_md_find(void *ip);
+struct kfunc_md *kfunc_md_get(void *ip);
+void kfunc_md_put(struct kfunc_md *meta);
+void kfunc_md_put_by_ip(void *ip);
+void kfunc_md_lock(void);
+void kfunc_md_unlock(void);
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 87866b037fbe..7435674d5da3 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -108,6 +108,7 @@ obj-$(CONFIG_TRACE_CLOCK) += trace/
 obj-$(CONFIG_RING_BUFFER) += trace/
 obj-$(CONFIG_TRACEPOINTS) += trace/
 obj-$(CONFIG_RETHOOK) += trace/
+obj-$(CONFIG_FUNCTION_METADATA) += trace/
 obj-$(CONFIG_IRQ_WORK) += irq_work.o
 obj-$(CONFIG_CPU_PM) += cpu_pm.o
 obj-$(CONFIG_BPF) += bpf/
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 057cd975d014..9780ee3f8d8d 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_FTRACE_RECORD_RECURSION) += trace_recursion_record.o
 obj-$(CONFIG_FPROBE) += fprobe.o
 obj-$(CONFIG_RETHOOK) += rethook.o
 obj-$(CONFIG_FPROBE_EVENTS) += trace_fprobe.o
+obj-$(CONFIG_FUNCTION_METADATA) += kfunc_md.o
 
 obj-$(CONFIG_TRACEPOINT_BENCHMARK) += trace_benchmark.o
 obj-$(CONFIG_RV) += rv/
diff --git a/kernel/trace/kfunc_md.c b/kernel/trace/kfunc_md.c
new file mode 100644
index 000000000000..7ec25bcf778d
--- /dev/null
+++ b/kernel/trace/kfunc_md.c
@@ -0,0 +1,239 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/slab.h>
+#include <linux/memory.h>
+#include <linux/rcupdate.h>
+#include <linux/ftrace.h>
+#include <linux/kfunc_md.h>
+
+#define ENTRIES_PER_PAGE (PAGE_SIZE / sizeof(struct kfunc_md))
+
+static u32 kfunc_md_count = ENTRIES_PER_PAGE, kfunc_md_used;
+struct kfunc_md __rcu *kfunc_mds;
+EXPORT_SYMBOL_GPL(kfunc_mds);
+
+static DEFINE_MUTEX(kfunc_md_mutex);
+
+
+void kfunc_md_unlock(void)
+{
+	mutex_unlock(&kfunc_md_mutex);
+}
+EXPORT_SYMBOL_GPL(kfunc_md_unlock);
+
+void kfunc_md_lock(void)
+{
+	mutex_lock(&kfunc_md_mutex);
+}
+EXPORT_SYMBOL_GPL(kfunc_md_lock);
+
+static u32 kfunc_md_get_index(void *ip)
+{
+	return *(u32 *)(ip - KFUNC_MD_DATA_OFFSET);
+}
+
+static void kfunc_md_init(struct kfunc_md *mds, u32 start, u32 end)
+{
+	u32 i;
+
+	for (i = start; i < end; i++)
+		mds[i].users = 0;
+}
+
+static int kfunc_md_page_order(void)
+{
+	return fls(DIV_ROUND_UP(kfunc_md_count, ENTRIES_PER_PAGE)) - 1;
+}
+
+/* Get next usable function metadata. On success, return the usable
+ * kfunc_md and store the index of it to *index. If no usable kfunc_md is
+ * found in kfunc_mds, a larger array will be allocated.
+ */
+static struct kfunc_md *kfunc_md_get_next(u32 *index)
+{
+	struct kfunc_md *new_mds, *mds;
+	u32 i, order;
+
+	mds = rcu_dereference(kfunc_mds);
+	if (mds == NULL) {
+		order = kfunc_md_page_order();
+		new_mds = (void *)__get_free_pages(GFP_KERNEL, order);
+		if (!new_mds)
+			return NULL;
+		kfunc_md_init(new_mds, 0, kfunc_md_count);
+		/* The first time to initialize kfunc_mds, so it is not
+		 * used anywhere yet, and we can update it directly.
+		 */
+		rcu_assign_pointer(kfunc_mds, new_mds);
+		mds = new_mds;
+	}
+
+	if (likely(kfunc_md_used < kfunc_md_count)) {
+		/* maybe we can manage the used function metadata entry
+		 * with a bit map ?
+		 */
+		for (i = 0; i < kfunc_md_count; i++) {
+			if (!mds[i].users) {
+				kfunc_md_used++;
+				*index = i;
+				mds[i].users++;
+				return mds + i;
+			}
+		}
+	}
+
+	order = kfunc_md_page_order();
+	/* no available function metadata, so allocate a bigger function
+	 * metadata array.
+	 */
+	new_mds = (void *)__get_free_pages(GFP_KERNEL, order + 1);
+	if (!new_mds)
+		return NULL;
+
+	memcpy(new_mds, mds, kfunc_md_count * sizeof(*new_mds));
+	kfunc_md_init(new_mds, kfunc_md_count, kfunc_md_count * 2);
+
+	rcu_assign_pointer(kfunc_mds, new_mds);
+	synchronize_rcu();
+	free_pages((u64)mds, order);
+
+	mds = new_mds + kfunc_md_count;
+	*index = kfunc_md_count;
+	kfunc_md_count <<= 1;
+	kfunc_md_used++;
+	mds->users++;
+
+	return mds;
+}
+
+static int kfunc_md_text_poke(void *ip, void *insn, void *nop)
+{
+	void *target;
+	int ret = 0;
+	u8 *prog;
+
+	target = ip - KFUNC_MD_INSN_OFFSET;
+	mutex_lock(&text_mutex);
+	if (insn) {
+		if (!memcmp(target, insn, KFUNC_MD_INSN_SIZE))
+			goto out;
+
+		if (memcmp(target, nop, KFUNC_MD_INSN_SIZE)) {
+			ret = -EBUSY;
+			goto out;
+		}
+		prog = insn;
+	} else {
+		if (!memcmp(target, nop, KFUNC_MD_INSN_SIZE))
+			goto out;
+		prog = nop;
+	}
+
+	ret = kfunc_md_arch_poke(target, prog);
+out:
+	mutex_unlock(&text_mutex);
+	return ret;
+}
+
+static bool __kfunc_md_put(struct kfunc_md *md)
+{
+	u8 nop_insn[KFUNC_MD_INSN_SIZE];
+
+	if (WARN_ON_ONCE(md->users <= 0))
+		return false;
+
+	md->users--;
+	if (md->users > 0)
+		return false;
+
+	if (!kfunc_md_arch_exist(md->func))
+		return false;
+
+	kfunc_md_arch_nops(nop_insn);
+	/* release the metadata by recovering the function padding to NOPS */
+	kfunc_md_text_poke(md->func, NULL, nop_insn);
+	/* TODO: we need a way to shrink the array "kfunc_mds" */
+	kfunc_md_used--;
+
+	return true;
+}
+
+/* Decrease the reference of the md, release it if "md->users <= 0" */
+void kfunc_md_put(struct kfunc_md *md)
+{
+	mutex_lock(&kfunc_md_mutex);
+	__kfunc_md_put(md);
+	mutex_unlock(&kfunc_md_mutex);
+}
+EXPORT_SYMBOL_GPL(kfunc_md_put);
+
+/* Get a exist metadata by the function address, and NULL will be returned
+ * if not exist.
+ *
+ * NOTE: rcu lock should be held during reading the metadata, and
+ * kfunc_md_lock should be held if writing happens.
+ */
+struct kfunc_md *kfunc_md_find(void *ip)
+{
+	struct kfunc_md *md;
+	u32 index;
+
+	if (kfunc_md_arch_exist(ip)) {
+		index = kfunc_md_get_index(ip);
+		if (WARN_ON_ONCE(index >= kfunc_md_count))
+			return NULL;
+
+		md = rcu_dereference(kfunc_mds) + index;
+		return md;
+	}
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(kfunc_md_find);
+
+void kfunc_md_put_by_ip(void *ip)
+{
+	struct kfunc_md *md;
+
+	mutex_lock(&kfunc_md_mutex);
+	md = kfunc_md_find(ip);
+	if (md)
+		__kfunc_md_put(md);
+	mutex_unlock(&kfunc_md_mutex);
+}
+EXPORT_SYMBOL_GPL(kfunc_md_put_by_ip);
+
+/* Get a exist metadata by the function address, and create one if not
+ * exist. Reference of the metadata will increase 1.
+ *
+ * NOTE: always call this function with kfunc_md_lock held, and all
+ * updating to metadata should also hold the kfunc_md_lock.
+ */
+struct kfunc_md *kfunc_md_get(void *ip)
+{
+	u8 nop_insn[KFUNC_MD_INSN_SIZE], insn[KFUNC_MD_INSN_SIZE];
+	struct kfunc_md *md;
+	u32 index;
+
+	md = kfunc_md_find(ip);
+	if (md) {
+		md->users++;
+		return md;
+	}
+
+	md = kfunc_md_get_next(&index);
+	if (!md)
+		return NULL;
+
+	kfunc_md_arch_pretend(insn, index);
+	kfunc_md_arch_nops(nop_insn);
+
+	if (kfunc_md_text_poke(ip, insn, nop_insn)) {
+		kfunc_md_used--;
+		md->users = 0;
+		return NULL;
+	}
+	md->func = ip;
+
+	return md;
+}
+EXPORT_SYMBOL_GPL(kfunc_md_get);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 3/4] x86: implement per-function metadata storage for x86
  2025-03-03 13:28 [PATCH v4 0/4] per-function storage support Menglong Dong
  2025-03-03 13:28 ` [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset Menglong Dong
  2025-03-03 13:28 ` [PATCH v4 2/4] add per-function metadata storage support Menglong Dong
@ 2025-03-03 13:28 ` Menglong Dong
  2025-03-03 13:28 ` [PATCH v4 4/4] arm64: implement per-function metadata storage for arm64 Menglong Dong
  3 siblings, 0 replies; 23+ messages in thread
From: Menglong Dong @ 2025-03-03 13:28 UTC (permalink / raw)
  To: peterz, rostedt, mark.rutland, alexei.starovoitov
  Cc: catalin.marinas, will, mhiramat, tglx, mingo, bp, dave.hansen,
	x86, hpa, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, jolsa, davem, dsahern,
	mathieu.desnoyers, nathan, nick.desaulniers+lkml, morbo,
	samitolvanen, kees, dongml2, akpm, riel, rppt, linux-arm-kernel,
	linux-kernel, linux-trace-kernel, bpf, netdev, llvm

With CONFIG_CALL_PADDING enabled, there will be 16-bytes padding space
before all the kernel functions. And some kernel features can use it,
such as MITIGATION_CALL_DEPTH_TRACKING, CFI_CLANG, FINEIBT, etc.

In my research, MITIGATION_CALL_DEPTH_TRACKING will consume the tail
9-bytes in the function padding, CFI_CLANG will consume the head 5-bytes,
and FINEIBT will consume all the 16 bytes if it is enabled. So there will
be no space for us if MITIGATION_CALL_DEPTH_TRACKING and CFI_CLANG are
both enabled, or FINEIBT is enabled.

In x86, we need 5-bytes to prepend a "mov %eax xxx" insn, which can hold
a 4-bytes index. So we have following logic:

1. use the head 5-bytes if CFI_CLANG is not enabled
2. use the tail 5-bytes if MITIGATION_CALL_DEPTH_TRACKING and FINEIBT are
   not enabled
3. compile the kernel with FUNCTION_ALIGNMENT_32B otherwise

In the third case, we make the kernel function 32 bytes aligned, and there
will be 32 bytes padding before the functions. According to my testing,
the text size didn't increase on this case, which is weird.

With 16-bytes padding:

-rwxr-xr-x 1 401190688  x86-dev/vmlinux*
-rw-r--r-- 1    251068  x86-dev/vmlinux.a
-rw-r--r-- 1 851892992  x86-dev/vmlinux.o
-rw-r--r-- 1  12395008  x86-dev/arch/x86/boot/bzImage

With 32-bytes padding:

-rwxr-xr-x 1 401318128 x86-dev/vmlinux*
-rw-r--r-- 1    251154 x86-dev/vmlinux.a
-rw-r--r-- 1 853636704 x86-dev/vmlinux.o
-rw-r--r-- 1  12509696 x86-dev/arch/x86/boot/bzImage

The way I tested should be right, and this is a good news for us. On the
third case, the layout of the padding space will be like this if fineibt
is enabled:

__cfi_func:
	mov	--	5	-- cfi, not used anymore
	nop
	nop
	nop
	mov	--	5	-- function metadata
	nop
	nop
	nop
	fineibt	--	16	-- fineibt
func:
	nopw	--	4
	......

I tested the fineibt with "cfi=fineibt" cmdline, and it works well
together with FUNCTION_METADATA enabled. And I also tested the
performance of this function by setting metadata for all the kernel
function, and it consumes 0.7s for 70k+ functions, not bad :/

I can't find a machine that support IBT, so I didn't test the IBT. I'd
appreciate it if someone can do this testing for me :/

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
---
v3:
- select FUNCTION_ALIGNMENT_32B on case3, instead of extra 5-bytes
---
 arch/x86/Kconfig              | 18 ++++++++++++
 arch/x86/include/asm/ftrace.h | 54 +++++++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5c277261507e..b0614188c80b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2518,6 +2518,24 @@ config PREFIX_SYMBOLS
 	def_bool y
 	depends on CALL_PADDING && !CFI_CLANG
 
+config FUNCTION_METADATA
+	bool "Per-function metadata storage support"
+	default y
+	depends on CC_HAS_ENTRY_PADDING && OBJTOOL
+	select CALL_PADDING
+	select FUNCTION_ALIGNMENT_32B if ((CFI_CLANG && CALL_THUNKS) || FINEIBT)
+	help
+	  Support per-function metadata storage for kernel functions, and
+	  get the metadata of the function by its address with almost no
+	  overhead.
+
+	  The index of the metadata will be stored in the function padding
+	  and consumes 5-bytes. FUNCTION_ALIGNMENT_32B will be selected if
+	  "(CFI_CLANG && CALL_THUNKS) || FINEIBT" to make sure there is
+	  enough available padding space for this function. However, it
+	  seems that the text size almost don't change, compare with
+	  FUNCTION_ALIGNMENT_16B.
+
 menuconfig CPU_MITIGATIONS
 	bool "Mitigations for CPU vulnerabilities"
 	default y
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index f2265246249a..700bb729e949 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -4,6 +4,28 @@
 
 #include <asm/ptrace.h>
 
+#ifdef CONFIG_FUNCTION_METADATA
+#if (defined(CONFIG_CFI_CLANG) && defined(CONFIG_CALL_THUNKS)) || (defined(CONFIG_FINEIBT))
+  /* the CONFIG_FUNCTION_PADDING_BYTES is 32 in this case, use the
+   * range: [align + 8, align + 13].
+   */
+  #define KFUNC_MD_INSN_OFFSET		(CONFIG_FUNCTION_PADDING_BYTES - 8)
+  #define KFUNC_MD_DATA_OFFSET		(CONFIG_FUNCTION_PADDING_BYTES - 9)
+#else
+  #ifdef CONFIG_CFI_CLANG
+    /* use the space that CALL_THUNKS suppose to use */
+    #define KFUNC_MD_INSN_OFFSET	(5)
+    #define KFUNC_MD_DATA_OFFSET	(4)
+  #else
+    /* use the space that CFI_CLANG suppose to use */
+    #define KFUNC_MD_INSN_OFFSET	(CONFIG_FUNCTION_PADDING_BYTES)
+    #define KFUNC_MD_DATA_OFFSET	(CONFIG_FUNCTION_PADDING_BYTES - 1)
+  #endif
+#endif
+
+#define KFUNC_MD_INSN_SIZE		(5)
+#endif
+
 #ifdef CONFIG_FUNCTION_TRACER
 #ifndef CC_USING_FENTRY
 # error Compiler does not support fentry?
@@ -156,4 +178,36 @@ static inline bool arch_trace_is_compat_syscall(struct pt_regs *regs)
 #endif /* !COMPILE_OFFSETS */
 #endif /* !__ASSEMBLY__ */
 
+#if !defined(__ASSEMBLY__) && defined(CONFIG_FUNCTION_METADATA)
+#include <asm/text-patching.h>
+
+static inline bool kfunc_md_arch_exist(void *ip)
+{
+	return *(u8 *)(ip - KFUNC_MD_INSN_OFFSET) == 0xB8;
+}
+
+static inline void kfunc_md_arch_pretend(u8 *insn, u32 index)
+{
+	*insn = 0xB8;
+	*(u32 *)(insn + 1) = index;
+}
+
+static inline void kfunc_md_arch_nops(u8 *insn)
+{
+	*(insn++) = BYTES_NOP1;
+	*(insn++) = BYTES_NOP1;
+	*(insn++) = BYTES_NOP1;
+	*(insn++) = BYTES_NOP1;
+	*(insn++) = BYTES_NOP1;
+}
+
+static inline int kfunc_md_arch_poke(void *ip, u8 *insn)
+{
+	text_poke(ip, insn, KFUNC_MD_INSN_SIZE);
+	text_poke_sync();
+	return 0;
+}
+
+#endif
+
 #endif /* _ASM_X86_FTRACE_H */
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 4/4] arm64: implement per-function metadata storage for arm64
  2025-03-03 13:28 [PATCH v4 0/4] per-function storage support Menglong Dong
                   ` (2 preceding siblings ...)
  2025-03-03 13:28 ` [PATCH v4 3/4] x86: implement per-function metadata storage for x86 Menglong Dong
@ 2025-03-03 13:28 ` Menglong Dong
  2025-03-03 21:40   ` Sami Tolvanen
  3 siblings, 1 reply; 23+ messages in thread
From: Menglong Dong @ 2025-03-03 13:28 UTC (permalink / raw)
  To: peterz, rostedt, mark.rutland, alexei.starovoitov
  Cc: catalin.marinas, will, mhiramat, tglx, mingo, bp, dave.hansen,
	x86, hpa, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, jolsa, davem, dsahern,
	mathieu.desnoyers, nathan, nick.desaulniers+lkml, morbo,
	samitolvanen, kees, dongml2, akpm, riel, rppt, linux-arm-kernel,
	linux-kernel, linux-trace-kernel, bpf, netdev, llvm

The per-function metadata storage is already used by ftrace if
CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS is enabled, and it store the pointer
of the callback directly to the function padding, which consume 8-bytes,
in the commit
baaf553d3bc3 ("arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS").
So we can directly store the index to the function padding too, without
a prepending. With CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS enabled, the
function is 8-bytes aligned, and we will compile the kernel with extra
8-bytes (2 NOPS) padding space. Otherwise, the function is 4-bytes
aligned, and only extra 4-bytes (1 NOPS) is needed.

However, we have the same problem with Mark in the commit above: we can't
use the function padding together with CFI_CLANG, which can make the clang
compiles a wrong offset to the pre-function type hash. He said that he was
working with others on this problem 2 years ago. Hi Mark, is there any
progress on this problem?

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
---
 arch/arm64/Kconfig              | 15 +++++++++++++++
 arch/arm64/Makefile             | 23 ++++++++++++++++++++--
 arch/arm64/include/asm/ftrace.h | 34 +++++++++++++++++++++++++++++++++
 arch/arm64/kernel/ftrace.c      | 13 +++++++++++--
 4 files changed, 81 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 940343beb3d4..7ed80f5eb267 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1536,6 +1536,21 @@ config NODES_SHIFT
 	  Specify the maximum number of NUMA Nodes available on the target
 	  system.  Increases memory reserved to accommodate various tables.
 
+config FUNCTION_METADATA
+	bool "Per-function metadata storage support"
+	default y
+	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE if !FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY
+	depends on !CFI_CLANG
+	help
+	  Support per-function metadata storage for kernel functions, and
+	  get the metadata of the function by its address with almost no
+	  overhead.
+
+	  The index of the metadata will be stored in the function padding,
+	  which will consume 4-bytes. If FUNCTION_ALIGNMENT_8B is enabled,
+	  extra 8-bytes function padding will be reserved during compiling.
+	  Otherwise, only extra 4-bytes function padding is needed.
+
 source "kernel/Kconfig.hz"
 
 config ARCH_SPARSEMEM_ENABLE
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index 2b25d671365f..2df2b0f4dd90 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -144,12 +144,31 @@ endif
 
 CHECKFLAGS	+= -D__aarch64__
 
+ifeq ($(CONFIG_FUNCTION_METADATA),y)
+  ifeq ($(CONFIG_FUNCTION_ALIGNMENT_8B),y)
+  __padding_nops := 2
+  else
+  __padding_nops := 1
+  endif
+else
+  __padding_nops := 0
+endif
+
 ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS),y)
+  __padding_nops  := $(shell echo $(__padding_nops) + 2 | bc)
   KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
-  CC_FLAGS_FTRACE := -fpatchable-function-entry=4,2
+  CC_FLAGS_FTRACE := -fpatchable-function-entry=$(shell echo $(__padding_nops) + 2 | bc),$(__padding_nops)
 else ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_ARGS),y)
+  CC_FLAGS_FTRACE := -fpatchable-function-entry=$(shell echo $(__padding_nops) + 2 | bc),$(__padding_nops)
   KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
-  CC_FLAGS_FTRACE := -fpatchable-function-entry=2
+else ifeq ($(CONFIG_FUNCTION_METADATA),y)
+  CC_FLAGS_FTRACE += -fpatchable-function-entry=$(__padding_nops),$(__padding_nops)
+  ifneq ($(CONFIG_FUNCTION_TRACER),y)
+    KBUILD_CFLAGS += $(CC_FLAGS_FTRACE)
+    # some file need to remove this cflag when CONFIG_FUNCTION_TRACER
+    # is not enabled, so we need to export it here
+    export CC_FLAGS_FTRACE
+  endif
 endif
 
 ifeq ($(CONFIG_KASAN_SW_TAGS), y)
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index bfe3ce9df197..aa3eaa91bf82 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -24,6 +24,16 @@
 #define FTRACE_PLT_IDX		0
 #define NR_FTRACE_PLTS		1
 
+#ifdef CONFIG_FUNCTION_METADATA
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
+#define KFUNC_MD_DATA_OFFSET	(AARCH64_INSN_SIZE * 3)
+#else
+#define KFUNC_MD_DATA_OFFSET	AARCH64_INSN_SIZE
+#endif
+#define KFUNC_MD_INSN_SIZE	AARCH64_INSN_SIZE
+#define KFUNC_MD_INSN_OFFSET	KFUNC_MD_DATA_OFFSET
+#endif
+
 /*
  * Currently, gcc tends to save the link register after the local variables
  * on the stack. This causes the max stack tracer to report the function
@@ -216,6 +226,30 @@ static inline bool arch_syscall_match_sym_name(const char *sym,
 	 */
 	return !strcmp(sym + 8, name);
 }
+
+#ifdef CONFIG_FUNCTION_METADATA
+#include <asm/text-patching.h>
+
+static inline bool kfunc_md_arch_exist(void *ip)
+{
+	return !aarch64_insn_is_nop(*(u32 *)(ip - KFUNC_MD_INSN_OFFSET));
+}
+
+static inline void kfunc_md_arch_pretend(u8 *insn, u32 index)
+{
+	*(u32 *)insn = index;
+}
+
+static inline void kfunc_md_arch_nops(u8 *insn)
+{
+	*(u32 *)insn = aarch64_insn_gen_nop();
+}
+
+static inline int kfunc_md_arch_poke(void *ip, u8 *insn)
+{
+	return aarch64_insn_patch_text_nosync(ip, *(u32 *)insn);
+}
+#endif
 #endif /* ifndef __ASSEMBLY__ */
 
 #ifndef __ASSEMBLY__
diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
index d7c0d023dfe5..4191ff0037f5 100644
--- a/arch/arm64/kernel/ftrace.c
+++ b/arch/arm64/kernel/ftrace.c
@@ -88,8 +88,10 @@ unsigned long ftrace_call_adjust(unsigned long addr)
 	 * to `BL <caller>`, which is at `addr + 4` bytes in either case.
 	 *
 	 */
-	if (!IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS))
-		return addr + AARCH64_INSN_SIZE;
+	if (!IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS)) {
+		addr += AARCH64_INSN_SIZE;
+		goto out;
+	}
 
 	/*
 	 * When using patchable-function-entry with pre-function NOPs, addr is
@@ -139,6 +141,13 @@ unsigned long ftrace_call_adjust(unsigned long addr)
 
 	/* Skip the first NOP after function entry */
 	addr += AARCH64_INSN_SIZE;
+out:
+	if (IS_ENABLED(CONFIG_FUNCTION_METADATA)) {
+		if (IS_ENABLED(CONFIG_FUNCTION_ALIGNMENT_8B))
+			addr += 2 * AARCH64_INSN_SIZE;
+		else
+			addr += AARCH64_INSN_SIZE;
+	}
 
 	return addr;
 }
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-03 13:28 ` [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset Menglong Dong
@ 2025-03-03 16:54   ` Peter Zijlstra
  2025-03-04  1:10     ` Menglong Dong
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2025-03-03 16:54 UTC (permalink / raw)
  To: Menglong Dong
  Cc: rostedt, mark.rutland, alexei.starovoitov, catalin.marinas, will,
	mhiramat, tglx, mingo, bp, dave.hansen, x86, hpa, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, jolsa, davem, dsahern, mathieu.desnoyers, nathan,
	nick.desaulniers+lkml, morbo, samitolvanen, kees, dongml2, akpm,
	riel, rppt, linux-arm-kernel, linux-kernel, linux-trace-kernel,
	bpf, netdev, llvm

On Mon, Mar 03, 2025 at 09:28:34PM +0800, Menglong Dong wrote:
> For now, the layout of cfi and fineibt is hard coded, and the padding is
> fixed on 16 bytes.
> 
> Factor out FINEIBT_INSN_OFFSET and CFI_INSN_OFFSET. CFI_INSN_OFFSET is
> the offset of cfi, which is the same as FUNCTION_ALIGNMENT when
> CALL_PADDING is enabled. And FINEIBT_INSN_OFFSET is the offset where we
> put the fineibt preamble on, which is 16 for now.
> 
> When the FUNCTION_ALIGNMENT is bigger than 16, we place the fineibt
> preamble on the last 16 bytes of the padding for better performance, which
> means the fineibt preamble don't use the space that cfi uses.
> 
> The FINEIBT_INSN_OFFSET is not used in fineibt_caller_start and
> fineibt_paranoid_start, as it is always "0x10". Note that we need to
> update the offset in fineibt_caller_start and fineibt_paranoid_start if
> FINEIBT_INSN_OFFSET changes.
> 
> Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>

I'm confused as to what exactly you mean.

Preamble will have __cfi symbol and some number of NOPs right before
actual symbol like:

__cfi_foo:
  mov $0x12345678, %reg
  nop
  nop
  nop
  ...
foo:

FineIBT must be at foo-16, has nothing to do with performance. This 16
can also be spelled: fineibt_preamble_size.

The total size of the preamble is FUNCTION_PADDING_BYTES + CFI_CLANG*5.

If you increase FUNCTION_PADDING_BYTES by another 5, which is what you
want I think, then we'll have total preamble of 21 bytes; 5 bytes kCFI,
16 bytes nop.

Then kCFI expects hash to be at -20, while FineIBT must be at -16.

This then means there is no unambiguous hole for you to stick your
meta-data thing (whatever that is).

There are two options: make meta data location depend on cfi_mode, or
have __apply_fineibt() rewrite kCFI to also be at -16, so that you can
have -21 for your 5 bytes.

I think I prefer latter.

In any case, I don't think we need *_INSN_OFFSET. At most we need
PREAMBLE_SIZE.

Hmm?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 4/4] arm64: implement per-function metadata storage for arm64
  2025-03-03 13:28 ` [PATCH v4 4/4] arm64: implement per-function metadata storage for arm64 Menglong Dong
@ 2025-03-03 21:40   ` Sami Tolvanen
  2025-03-04  1:21     ` Menglong Dong
  0 siblings, 1 reply; 23+ messages in thread
From: Sami Tolvanen @ 2025-03-03 21:40 UTC (permalink / raw)
  To: Menglong Dong
  Cc: peterz, rostedt, mark.rutland, alexei.starovoitov,
	catalin.marinas, will, mhiramat, tglx, mingo, bp, dave.hansen,
	x86, hpa, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, jolsa, davem, dsahern,
	mathieu.desnoyers, nathan, nick.desaulniers+lkml, morbo, kees,
	dongml2, akpm, riel, rppt, linux-arm-kernel, linux-kernel,
	linux-trace-kernel, bpf, netdev, llvm

On Mon, Mar 03, 2025 at 09:28:37PM +0800, Menglong Dong wrote:
> The per-function metadata storage is already used by ftrace if
> CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS is enabled, and it store the pointer
> of the callback directly to the function padding, which consume 8-bytes,
> in the commit
> baaf553d3bc3 ("arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS").
> So we can directly store the index to the function padding too, without
> a prepending. With CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS enabled, the
> function is 8-bytes aligned, and we will compile the kernel with extra
> 8-bytes (2 NOPS) padding space. Otherwise, the function is 4-bytes
> aligned, and only extra 4-bytes (1 NOPS) is needed.
> 
> However, we have the same problem with Mark in the commit above: we can't
> use the function padding together with CFI_CLANG, which can make the clang
> compiles a wrong offset to the pre-function type hash. He said that he was
> working with others on this problem 2 years ago. Hi Mark, is there any
> progress on this problem?

I don't think there's been much progress since the previous
discussion a couple of years ago. The conclusion seemed to be
that adding a section parameter to -fpatchable-function-entry
would allow us to identify notrace functions while keeping a
consistent layout for functions:

https://lore.kernel.org/lkml/Y1QEzk%2FA41PKLEPe@hirez.programming.kicks-ass.net/

Sami

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-03 16:54   ` Peter Zijlstra
@ 2025-03-04  1:10     ` Menglong Dong
  2025-03-04  5:38       ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Menglong Dong @ 2025-03-04  1:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rostedt, mark.rutland, alexei.starovoitov, catalin.marinas, will,
	mhiramat, tglx, mingo, bp, dave.hansen, x86, hpa, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, jolsa, davem, dsahern, mathieu.desnoyers, nathan,
	nick.desaulniers+lkml, morbo, samitolvanen, kees, dongml2, akpm,
	riel, rppt, linux-arm-kernel, linux-kernel, linux-trace-kernel,
	bpf, netdev, llvm

On Tue, Mar 4, 2025 at 12:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Mar 03, 2025 at 09:28:34PM +0800, Menglong Dong wrote:
> > For now, the layout of cfi and fineibt is hard coded, and the padding is
> > fixed on 16 bytes.
> >
> > Factor out FINEIBT_INSN_OFFSET and CFI_INSN_OFFSET. CFI_INSN_OFFSET is
> > the offset of cfi, which is the same as FUNCTION_ALIGNMENT when
> > CALL_PADDING is enabled. And FINEIBT_INSN_OFFSET is the offset where we
> > put the fineibt preamble on, which is 16 for now.
> >
> > When the FUNCTION_ALIGNMENT is bigger than 16, we place the fineibt
> > preamble on the last 16 bytes of the padding for better performance, which
> > means the fineibt preamble don't use the space that cfi uses.
> >
> > The FINEIBT_INSN_OFFSET is not used in fineibt_caller_start and
> > fineibt_paranoid_start, as it is always "0x10". Note that we need to
> > update the offset in fineibt_caller_start and fineibt_paranoid_start if
> > FINEIBT_INSN_OFFSET changes.
> >
> > Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
>
> I'm confused as to what exactly you mean.
>
> Preamble will have __cfi symbol and some number of NOPs right before
> actual symbol like:
>
> __cfi_foo:
>   mov $0x12345678, %reg
>   nop
>   nop
>   nop
>   ...
> foo:
>
> FineIBT must be at foo-16, has nothing to do with performance. This 16
> can also be spelled: fineibt_preamble_size.
>
> The total size of the preamble is FUNCTION_PADDING_BYTES + CFI_CLANG*5.
>
> If you increase FUNCTION_PADDING_BYTES by another 5, which is what you
> want I think, then we'll have total preamble of 21 bytes; 5 bytes kCFI,
> 16 bytes nop.

Hello, sorry that I forgot to add something to the changelog. In fact,
I don't add extra 5-bytes anymore, which you can see in the 3rd patch.

The thing is that we can't add extra 5-bytes if CFI is enabled. Without
CFI, we can make the padding space any value, such as 5-bytes, and
the layout will be like this:

__align:
  nop
  nop
  nop
  nop
  nop
foo: -- __align +5

However, the CFI will always make the cfi insn 16-bytes aligned. When
we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be
like this:

__cfi_foo:
  nop (11)
  mov $0x12345678, %reg
  nop (16)
foo:

and the padding space is 32-bytes actually. So, we can just select
FUNCTION_ALIGNMENT_32B instead, which makes the padding
space 32-bytes too, and have the following layout:

__cfi_foo:
  mov $0x12345678, %reg
  nop (27)
foo:

And the layout will be like this if fineibt and function metadata is both
used:

__cfi_foo:
        mov     --      5       -- cfi, not used anymore if fineibt is used
        nop
        nop
        nop
        mov     --      5       -- function metadata
        nop
        nop
        nop
        fineibt --      16      -- fineibt
foo:
        nopw    --      4
        ......

The things that I make in this commit is to make sure that
the code in arch/x86/kernel/alternative.c can find the location
of cfi hash and fineibt depends on the FUNCTION_ALIGNMENT.
the offset of cfi and fineibt is different now, so we need to do
some adjustment here.

In the beginning, I thought to make the layout like this to ensure
that the offset of cfi and fineibt the same:

__cfi_foo:
        fineibt  --   16  --  fineibt
        mov    --    5    -- function metadata
        nop(11)
foo:
        nopw    --      4
        ......

The adjustment will be easier in this mode. However, it may have
impact on the performance. That way I say it doesn't impact the
performance in this commit.

Sorry that I didn't describe it clearly in the commit log, and I'll
add the things above to the commit log too in the next version.

Thanks!
Menglong Dong

>
> Then kCFI expects hash to be at -20, while FineIBT must be at -16.
>
> This then means there is no unambiguous hole for you to stick your
> meta-data thing (whatever that is).
>
> There are two options: make meta data location depend on cfi_mode, or
> have __apply_fineibt() rewrite kCFI to also be at -16, so that you can
> have -21 for your 5 bytes.
>
> I think I prefer latter.
>
> In any case, I don't think we need *_INSN_OFFSET. At most we need
> PREAMBLE_SIZE.
>
> Hmm?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 4/4] arm64: implement per-function metadata storage for arm64
  2025-03-03 21:40   ` Sami Tolvanen
@ 2025-03-04  1:21     ` Menglong Dong
  0 siblings, 0 replies; 23+ messages in thread
From: Menglong Dong @ 2025-03-04  1:21 UTC (permalink / raw)
  To: Sami Tolvanen
  Cc: peterz, rostedt, mark.rutland, alexei.starovoitov,
	catalin.marinas, will, mhiramat, tglx, mingo, bp, dave.hansen,
	x86, hpa, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, jolsa, davem, dsahern,
	mathieu.desnoyers, nathan, nick.desaulniers+lkml, morbo, kees,
	dongml2, akpm, riel, rppt, linux-arm-kernel, linux-kernel,
	linux-trace-kernel, bpf, netdev, llvm

On Tue, Mar 4, 2025 at 5:40 AM Sami Tolvanen <samitolvanen@google.com> wrote:
>
> On Mon, Mar 03, 2025 at 09:28:37PM +0800, Menglong Dong wrote:
> > The per-function metadata storage is already used by ftrace if
> > CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS is enabled, and it store the pointer
> > of the callback directly to the function padding, which consume 8-bytes,
> > in the commit
> > baaf553d3bc3 ("arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS").
> > So we can directly store the index to the function padding too, without
> > a prepending. With CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS enabled, the
> > function is 8-bytes aligned, and we will compile the kernel with extra
> > 8-bytes (2 NOPS) padding space. Otherwise, the function is 4-bytes
> > aligned, and only extra 4-bytes (1 NOPS) is needed.
> >
> > However, we have the same problem with Mark in the commit above: we can't
> > use the function padding together with CFI_CLANG, which can make the clang
> > compiles a wrong offset to the pre-function type hash. He said that he was
> > working with others on this problem 2 years ago. Hi Mark, is there any
> > progress on this problem?
>
> I don't think there's been much progress since the previous
> discussion a couple of years ago. The conclusion seemed to be
> that adding a section parameter to -fpatchable-function-entry
> would allow us to identify notrace functions while keeping a
> consistent layout for functions:
>
> https://lore.kernel.org/lkml/Y1QEzk%2FA41PKLEPe@hirez.programming.kicks-ass.net/

Thank you for your information, which helps me a lot.
I'll dig deeper to find a way to keep CFI working together
with this function.

Thanks!
Menglong Dong

>
> Sami

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-04  1:10     ` Menglong Dong
@ 2025-03-04  5:38       ` Peter Zijlstra
  2025-03-04  6:16         ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2025-03-04  5:38 UTC (permalink / raw)
  To: Menglong Dong
  Cc: rostedt, mark.rutland, alexei.starovoitov, catalin.marinas, will,
	mhiramat, tglx, mingo, bp, dave.hansen, x86, hpa, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, jolsa, davem, dsahern, mathieu.desnoyers, nathan,
	nick.desaulniers+lkml, morbo, samitolvanen, kees, dongml2, akpm,
	riel, rppt, linux-arm-kernel, linux-kernel, linux-trace-kernel,
	bpf, netdev, llvm

On Tue, Mar 04, 2025 at 09:10:12AM +0800, Menglong Dong wrote:
> Hello, sorry that I forgot to add something to the changelog. In fact,
> I don't add extra 5-bytes anymore, which you can see in the 3rd patch.
> 
> The thing is that we can't add extra 5-bytes if CFI is enabled. Without
> CFI, we can make the padding space any value, such as 5-bytes, and
> the layout will be like this:
> 
> __align:
>   nop
>   nop
>   nop
>   nop
>   nop
> foo: -- __align +5
> 
> However, the CFI will always make the cfi insn 16-bytes aligned. When
> we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be
> like this:
> 
> __cfi_foo:
>   nop (11)
>   mov $0x12345678, %reg
>   nop (16)
> foo:
> 
> and the padding space is 32-bytes actually. So, we can just select
> FUNCTION_ALIGNMENT_32B instead, which makes the padding
> space 32-bytes too, and have the following layout:
> 
> __cfi_foo:
>   mov $0x12345678, %reg
>   nop (27)
> foo:

*blink*, wtf is clang smoking.

I mean, you're right, this is what it is doing, but that is somewhat
unexpected. Let me go look at clang source, this is insane.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-04  5:38       ` Peter Zijlstra
@ 2025-03-04  6:16         ` Peter Zijlstra
  2025-03-04  7:47           ` Menglong Dong
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2025-03-04  6:16 UTC (permalink / raw)
  To: Menglong Dong
  Cc: rostedt, mark.rutland, alexei.starovoitov, catalin.marinas, will,
	mhiramat, tglx, mingo, bp, dave.hansen, x86, hpa, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, jolsa, davem, dsahern, mathieu.desnoyers, nathan,
	nick.desaulniers+lkml, morbo, samitolvanen, kees, dongml2, akpm,
	riel, rppt, linux-arm-kernel, linux-kernel, linux-trace-kernel,
	bpf, netdev, llvm

On Tue, Mar 04, 2025 at 06:38:53AM +0100, Peter Zijlstra wrote:
> On Tue, Mar 04, 2025 at 09:10:12AM +0800, Menglong Dong wrote:
> > Hello, sorry that I forgot to add something to the changelog. In fact,
> > I don't add extra 5-bytes anymore, which you can see in the 3rd patch.
> > 
> > The thing is that we can't add extra 5-bytes if CFI is enabled. Without
> > CFI, we can make the padding space any value, such as 5-bytes, and
> > the layout will be like this:
> > 
> > __align:
> >   nop
> >   nop
> >   nop
> >   nop
> >   nop
> > foo: -- __align +5
> > 
> > However, the CFI will always make the cfi insn 16-bytes aligned. When
> > we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be
> > like this:
> > 
> > __cfi_foo:
> >   nop (11)
> >   mov $0x12345678, %reg
> >   nop (16)
> > foo:
> > 
> > and the padding space is 32-bytes actually. So, we can just select
> > FUNCTION_ALIGNMENT_32B instead, which makes the padding
> > space 32-bytes too, and have the following layout:
> > 
> > __cfi_foo:
> >   mov $0x12345678, %reg
> >   nop (27)
> > foo:
> 
> *blink*, wtf is clang smoking.
> 
> I mean, you're right, this is what it is doing, but that is somewhat
> unexpected. Let me go look at clang source, this is insane.

Bah, this is because assemblers are stupid :/

There is no way to tell them to have foo aligned such that there are at
least N bytes free before it.

So what kCFI ends up having to do is align the __cfi symbol to the
function alignment, and then stuff enough nops in to make the real
symbol meet alignment.

And the end result is utter insanity.

I mean, look at this:

      50:       2e e9 00 00 00 00       cs jmp 56 <__traceiter_initcall_level+0x46>     52: R_X86_64_PLT32      __x86_return_thunk-0x4
      56:       66 2e 0f 1f 84 00 00 00 00 00   cs nopw 0x0(%rax,%rax,1)

0000000000000060 <__cfi___probestub_initcall_level>:
      60:       90                      nop
      61:       90                      nop
      62:       90                      nop
      63:       90                      nop
      64:       90                      nop
      65:       90                      nop
      66:       90                      nop
      67:       90                      nop
      68:       90                      nop
      69:       90                      nop
      6a:       90                      nop
      6b:       b8 b1 fd 66 f9          mov    $0xf966fdb1,%eax

0000000000000070 <__probestub_initcall_level>:
      70:       2e e9 00 00 00 00       cs jmp 76 <__probestub_initcall_level+0x6>      72: R_X86_64_PLT32      __x86_return_thunk-0x4


That's 21 bytes wasted, for no reason other than that asm doesn't have a
directive to say: get me a place that is M before N alignment.

Because ideally the whole above thing would look like:

      50:       2e e9 00 00 00 00       cs jmp 56 <__traceiter_initcall_level+0x46>     52: R_X86_64_PLT32      __x86_return_thunk-0x4
      56:       66 2e 0f 1f 84 		cs nopw (%rax,%rax,1)

000000000000005b <__cfi___probestub_initcall_level>:
      5b:       b8 b1 fd 66 f9          mov    $0xf966fdb1,%eax

0000000000000060 <__probestub_initcall_level>:
      60:       2e e9 00 00 00 00       cs jmp 76 <__probestub_initcall_level+0x6>      72: R_X86_64_PLT32      __x86_return_thunk-0x4




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-04  6:16         ` Peter Zijlstra
@ 2025-03-04  7:47           ` Menglong Dong
  2025-03-04  8:41             ` Menglong Dong
  2025-03-04  9:42             ` Peter Zijlstra
  0 siblings, 2 replies; 23+ messages in thread
From: Menglong Dong @ 2025-03-04  7:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rostedt, mark.rutland, alexei.starovoitov, catalin.marinas, will,
	mhiramat, tglx, mingo, bp, dave.hansen, x86, hpa, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, jolsa, davem, dsahern, mathieu.desnoyers, nathan,
	nick.desaulniers+lkml, morbo, samitolvanen, kees, dongml2, akpm,
	riel, rppt, linux-arm-kernel, linux-kernel, linux-trace-kernel,
	bpf, netdev, llvm

On Tue, Mar 4, 2025 at 2:16 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Mar 04, 2025 at 06:38:53AM +0100, Peter Zijlstra wrote:
> > On Tue, Mar 04, 2025 at 09:10:12AM +0800, Menglong Dong wrote:
> > > Hello, sorry that I forgot to add something to the changelog. In fact,
> > > I don't add extra 5-bytes anymore, which you can see in the 3rd patch.
> > >
> > > The thing is that we can't add extra 5-bytes if CFI is enabled. Without
> > > CFI, we can make the padding space any value, such as 5-bytes, and
> > > the layout will be like this:
> > >
> > > __align:
> > >   nop
> > >   nop
> > >   nop
> > >   nop
> > >   nop
> > > foo: -- __align +5
> > >
> > > However, the CFI will always make the cfi insn 16-bytes aligned. When
> > > we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be
> > > like this:
> > >
> > > __cfi_foo:
> > >   nop (11)
> > >   mov $0x12345678, %reg
> > >   nop (16)
> > > foo:
> > >
> > > and the padding space is 32-bytes actually. So, we can just select
> > > FUNCTION_ALIGNMENT_32B instead, which makes the padding
> > > space 32-bytes too, and have the following layout:
> > >
> > > __cfi_foo:
> > >   mov $0x12345678, %reg
> > >   nop (27)
> > > foo:
> >
> > *blink*, wtf is clang smoking.
> >
> > I mean, you're right, this is what it is doing, but that is somewhat
> > unexpected. Let me go look at clang source, this is insane.
>
> Bah, this is because assemblers are stupid :/
>
> There is no way to tell them to have foo aligned such that there are at
> least N bytes free before it.
>
> So what kCFI ends up having to do is align the __cfi symbol to the
> function alignment, and then stuff enough nops in to make the real
> symbol meet alignment.
>
> And the end result is utter insanity.
>
> I mean, look at this:
>
>       50:       2e e9 00 00 00 00       cs jmp 56 <__traceiter_initcall_level+0x46>     52: R_X86_64_PLT32      __x86_return_thunk-0x4
>       56:       66 2e 0f 1f 84 00 00 00 00 00   cs nopw 0x0(%rax,%rax,1)
>
> 0000000000000060 <__cfi___probestub_initcall_level>:
>       60:       90                      nop
>       61:       90                      nop
>       62:       90                      nop
>       63:       90                      nop
>       64:       90                      nop
>       65:       90                      nop
>       66:       90                      nop
>       67:       90                      nop
>       68:       90                      nop
>       69:       90                      nop
>       6a:       90                      nop
>       6b:       b8 b1 fd 66 f9          mov    $0xf966fdb1,%eax
>
> 0000000000000070 <__probestub_initcall_level>:
>       70:       2e e9 00 00 00 00       cs jmp 76 <__probestub_initcall_level+0x6>      72: R_X86_64_PLT32      __x86_return_thunk-0x4
>
>
> That's 21 bytes wasted, for no reason other than that asm doesn't have a
> directive to say: get me a place that is M before N alignment.
>
> Because ideally the whole above thing would look like:
>
>       50:       2e e9 00 00 00 00       cs jmp 56 <__traceiter_initcall_level+0x46>     52: R_X86_64_PLT32      __x86_return_thunk-0x4
>       56:       66 2e 0f 1f 84          cs nopw (%rax,%rax,1)
>
> 000000000000005b <__cfi___probestub_initcall_level>:
>       5b:       b8 b1 fd 66 f9          mov    $0xf966fdb1,%eax
>
> 0000000000000060 <__probestub_initcall_level>:
>       60:       2e e9 00 00 00 00       cs jmp 76 <__probestub_initcall_level+0x6>      72: R_X86_64_PLT32      __x86_return_thunk-0x4

Hi, peter. Thank you for the testing, which is quite helpful
to understand the whole thing.

I was surprised at this too. Without CALL_PADDING, the cfi is
nop(11) + mov; with CALL_PADDING, the cfi is mov + nop(11),
which is weird, as it seems that we can select CALL_PADDING if
CFI_CLANG to make things consistent. And I  thought that it is
designed to be this for some reasons :/

Hmm......so what should we do now? Accept and bear it,
or do something different?

I'm good at clang, so the solution that I can think of is how to
bear it :/

According to my testing, the text size will increase:

~2.2% if we make FUNCTION_PADDING_BYTES 27 and select
FUNCTION_ALIGNMENT_16B.

~3.5% if we make FUNCTION_PADDING_BYTES 27 and select
FUNCTION_ALIGNMENT_32B.

We don't have to select FUNCTION_ALIGNMENT_32B, so the
worst case is to increase ~2.2%.

What do you think?

Thanks!
Menglong Dong

>
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-04  7:47           ` Menglong Dong
@ 2025-03-04  8:41             ` Menglong Dong
  2025-03-04  9:42             ` Peter Zijlstra
  1 sibling, 0 replies; 23+ messages in thread
From: Menglong Dong @ 2025-03-04  8:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rostedt, mark.rutland, alexei.starovoitov, catalin.marinas, will,
	mhiramat, tglx, mingo, bp, dave.hansen, x86, hpa, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, jolsa, davem, dsahern, mathieu.desnoyers, nathan,
	nick.desaulniers+lkml, morbo, samitolvanen, kees, dongml2, akpm,
	riel, rppt, linux-arm-kernel, linux-kernel, linux-trace-kernel,
	bpf, netdev, llvm

On Tue, Mar 4, 2025 at 3:47 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> On Tue, Mar 4, 2025 at 2:16 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Mar 04, 2025 at 06:38:53AM +0100, Peter Zijlstra wrote:
> > > On Tue, Mar 04, 2025 at 09:10:12AM +0800, Menglong Dong wrote:
> > > > Hello, sorry that I forgot to add something to the changelog. In fact,
> > > > I don't add extra 5-bytes anymore, which you can see in the 3rd patch.
> > > >
> > > > The thing is that we can't add extra 5-bytes if CFI is enabled. Without
> > > > CFI, we can make the padding space any value, such as 5-bytes, and
> > > > the layout will be like this:
> > > >
> > > > __align:
> > > >   nop
> > > >   nop
> > > >   nop
> > > >   nop
> > > >   nop
> > > > foo: -- __align +5
> > > >
> > > > However, the CFI will always make the cfi insn 16-bytes aligned. When
> > > > we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be
> > > > like this:
> > > >
> > > > __cfi_foo:
> > > >   nop (11)
> > > >   mov $0x12345678, %reg
> > > >   nop (16)
> > > > foo:
> > > >
> > > > and the padding space is 32-bytes actually. So, we can just select
> > > > FUNCTION_ALIGNMENT_32B instead, which makes the padding
> > > > space 32-bytes too, and have the following layout:
> > > >
> > > > __cfi_foo:
> > > >   mov $0x12345678, %reg
> > > >   nop (27)
> > > > foo:
> > >
> > > *blink*, wtf is clang smoking.
> > >
> > > I mean, you're right, this is what it is doing, but that is somewhat
> > > unexpected. Let me go look at clang source, this is insane.
> >
> > Bah, this is because assemblers are stupid :/
> >
> > There is no way to tell them to have foo aligned such that there are at
> > least N bytes free before it.
> >
> > So what kCFI ends up having to do is align the __cfi symbol to the
> > function alignment, and then stuff enough nops in to make the real
> > symbol meet alignment.
> >
> > And the end result is utter insanity.
> >
> > I mean, look at this:
> >
> >       50:       2e e9 00 00 00 00       cs jmp 56 <__traceiter_initcall_level+0x46>     52: R_X86_64_PLT32      __x86_return_thunk-0x4
> >       56:       66 2e 0f 1f 84 00 00 00 00 00   cs nopw 0x0(%rax,%rax,1)
> >
> > 0000000000000060 <__cfi___probestub_initcall_level>:
> >       60:       90                      nop
> >       61:       90                      nop
> >       62:       90                      nop
> >       63:       90                      nop
> >       64:       90                      nop
> >       65:       90                      nop
> >       66:       90                      nop
> >       67:       90                      nop
> >       68:       90                      nop
> >       69:       90                      nop
> >       6a:       90                      nop
> >       6b:       b8 b1 fd 66 f9          mov    $0xf966fdb1,%eax
> >
> > 0000000000000070 <__probestub_initcall_level>:
> >       70:       2e e9 00 00 00 00       cs jmp 76 <__probestub_initcall_level+0x6>      72: R_X86_64_PLT32      __x86_return_thunk-0x4
> >
> >
> > That's 21 bytes wasted, for no reason other than that asm doesn't have a
> > directive to say: get me a place that is M before N alignment.
> >
> > Because ideally the whole above thing would look like:
> >
> >       50:       2e e9 00 00 00 00       cs jmp 56 <__traceiter_initcall_level+0x46>     52: R_X86_64_PLT32      __x86_return_thunk-0x4
> >       56:       66 2e 0f 1f 84          cs nopw (%rax,%rax,1)
> >
> > 000000000000005b <__cfi___probestub_initcall_level>:
> >       5b:       b8 b1 fd 66 f9          mov    $0xf966fdb1,%eax
> >
> > 0000000000000060 <__probestub_initcall_level>:
> >       60:       2e e9 00 00 00 00       cs jmp 76 <__probestub_initcall_level+0x6>      72: R_X86_64_PLT32      __x86_return_thunk-0x4
>
> Hi, peter. Thank you for the testing, which is quite helpful
> to understand the whole thing.
>
> I was surprised at this too. Without CALL_PADDING, the cfi is
> nop(11) + mov; with CALL_PADDING, the cfi is mov + nop(11),
> which is weird, as it seems that we can select CALL_PADDING if
> CFI_CLANG to make things consistent. And I  thought that it is
> designed to be this for some reasons :/
>
> Hmm......so what should we do now? Accept and bear it,
> or do something different?
>
> I'm good at clang, so the solution that I can think of is how to

*not good at*

> bear it :/
>
> According to my testing, the text size will increase:
>
> ~2.2% if we make FUNCTION_PADDING_BYTES 27 and select
> FUNCTION_ALIGNMENT_16B.
>
> ~3.5% if we make FUNCTION_PADDING_BYTES 27 and select
> FUNCTION_ALIGNMENT_32B.
>
> We don't have to select FUNCTION_ALIGNMENT_32B, so the
> worst case is to increase ~2.2%.
>
> What do you think?
>
> Thanks!
> Menglong Dong
>
> >
> >
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-04  7:47           ` Menglong Dong
  2025-03-04  8:41             ` Menglong Dong
@ 2025-03-04  9:42             ` Peter Zijlstra
  2025-03-04 14:52               ` H. Peter Anvin
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2025-03-04  9:42 UTC (permalink / raw)
  To: Menglong Dong
  Cc: rostedt, mark.rutland, alexei.starovoitov, catalin.marinas, will,
	mhiramat, tglx, mingo, bp, dave.hansen, x86, hpa, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, jolsa, davem, dsahern, mathieu.desnoyers, nathan,
	nick.desaulniers+lkml, morbo, samitolvanen, kees, dongml2, akpm,
	riel, rppt, linux-arm-kernel, linux-kernel, linux-trace-kernel,
	bpf, netdev, llvm

On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote:
> We don't have to select FUNCTION_ALIGNMENT_32B, so the
> worst case is to increase ~2.2%.
> 
> What do you think?

Well, since I don't understand what you need this for at all, I'm firmly
on the side of not doing this.

What actual problem is being solved with this meta data nonsense? Why is
it worth blowing up our I$ footprint over.

Also note, that if you're going to be explaining this, start from
scratch, as I have absolutely 0 clues about BPF and such.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-04  9:42             ` Peter Zijlstra
@ 2025-03-04 14:52               ` H. Peter Anvin
  2025-03-05  1:19                 ` Menglong Dong
  0 siblings, 1 reply; 23+ messages in thread
From: H. Peter Anvin @ 2025-03-04 14:52 UTC (permalink / raw)
  To: Peter Zijlstra, Menglong Dong
  Cc: rostedt, mark.rutland, alexei.starovoitov, catalin.marinas, will,
	mhiramat, tglx, mingo, bp, dave.hansen, x86, ast, daniel, andrii,
	martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf,
	jolsa, davem, dsahern, mathieu.desnoyers, nathan,
	nick.desaulniers+lkml, morbo, samitolvanen, kees, dongml2, akpm,
	riel, rppt, linux-arm-kernel, linux-kernel, linux-trace-kernel,
	bpf, netdev, llvm

On March 4, 2025 1:42:20 AM PST, Peter Zijlstra <peterz@infradead.org> wrote:
>On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote:
>> We don't have to select FUNCTION_ALIGNMENT_32B, so the
>> worst case is to increase ~2.2%.
>> 
>> What do you think?
>
>Well, since I don't understand what you need this for at all, I'm firmly
>on the side of not doing this.
>
>What actual problem is being solved with this meta data nonsense? Why is
>it worth blowing up our I$ footprint over.
>
>Also note, that if you're going to be explaining this, start from
>scratch, as I have absolutely 0 clues about BPF and such.

I would appreciate such information as well. The idea seems dubious on the surface.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-04 14:52               ` H. Peter Anvin
@ 2025-03-05  1:19                 ` Menglong Dong
  2025-03-05  8:29                   ` H. Peter Anvin
  2025-03-05 15:03                   ` Steven Rostedt
  0 siblings, 2 replies; 23+ messages in thread
From: Menglong Dong @ 2025-03-05  1:19 UTC (permalink / raw)
  To: H. Peter Anvin, Peter Zijlstra
  Cc: rostedt, mark.rutland, alexei.starovoitov, catalin.marinas, will,
	mhiramat, tglx, mingo, bp, dave.hansen, x86, ast, daniel, andrii,
	martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf,
	jolsa, davem, dsahern, mathieu.desnoyers, nathan,
	nick.desaulniers+lkml, morbo, samitolvanen, kees, dongml2, akpm,
	riel, rppt, linux-arm-kernel, linux-kernel, linux-trace-kernel,
	bpf, netdev, llvm

On Tue, Mar 4, 2025 at 10:53 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On March 4, 2025 1:42:20 AM PST, Peter Zijlstra <peterz@infradead.org> wrote:
> >On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote:
> >> We don't have to select FUNCTION_ALIGNMENT_32B, so the
> >> worst case is to increase ~2.2%.
> >>
> >> What do you think?
> >
> >Well, since I don't understand what you need this for at all, I'm firmly
> >on the side of not doing this.
> >
> >What actual problem is being solved with this meta data nonsense? Why is
> >it worth blowing up our I$ footprint over.
> >
> >Also note, that if you're going to be explaining this, start from
> >scratch, as I have absolutely 0 clues about BPF and such.
>
> I would appreciate such information as well. The idea seems dubious on the surface.

Ok, let me explain it from the beginning. (My English is not good,
but I'll try to describe it as clear as possible :/)

Many BPF program types need to depend on the BPF trampoline,
such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT,
BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between
the kernel (or bpf) function and BPF program, and it acts just like the
trampoline that ftrace uses.

Generally speaking, it is used to hook a function, just like what ftrace
do:

foo:
    endbr
    nop5  -->  call trampoline_foo
    xxxx

In short, the trampoline_foo can be this:

trampoline_foo:
    prepare a array and store the args of foo to the array
    call fentry_bpf1
    call fentry_bpf2
    ......
    call foo+4 (origin call)
    save the return value of foo
    call fexit_bpf1 (this bpf can get the return value of foo)
    call fexit_bpf2
    .......
    return to the caller of foo

We can see that the trampoline_foo can be only used for
the function foo, as different kernel function can be attached
different BPF programs, and have different argument count,
etc. Therefore, we have to create 1000 BPF trampolines if
we want to attach a BPF program to 1000 kernel functions.

The creation of the BPF trampoline is expensive. According to
my testing, It will spend more than 1 second to create 100 bpf
trampoline. What's more, it consumes more memory.

If we have the per-function metadata supporting, then we can
create a global BPF trampoline, like this:

trampoline_global:
    prepare a array and store the args of foo to the array
    get the metadata by the ip
    call metadata.fentry_bpf1
    call metadata.fentry_bpf2
    ....
    call foo+4 (origin call)
    save the return value of foo
    call metadata.fexit_bpf1 (this bpf can get the return value of foo)
    call metadata.fexit_bpf2
    .......
    return to the caller of foo

(The metadata holds more information for the global trampoline than
I described.)

Then, we don't need to create a trampoline for every kernel function
anymore.

Another beneficiary can be ftrace. For now, all the kernel functions that
are enabled by dynamic ftrace will be added to a filter hash if there are
more than one callbacks. And hash lookup will happen when the traced
functions are called, which has an impact on the performance, see
__ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function
metadata supporting, we can store the information that if the callback is
enabled on the kernel function to the metadata, which can make the performance
much better.

The per-function metadata storage is a basic function, and I think there
may be other functions that can use it for better performance in the feature
too.

(Hope that I'm describing it clearly :/)

Thanks!
Menglong Dong

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-05  1:19                 ` Menglong Dong
@ 2025-03-05  8:29                   ` H. Peter Anvin
  2025-03-05  8:49                     ` Menglong Dong
  2025-03-05 15:03                   ` Steven Rostedt
  1 sibling, 1 reply; 23+ messages in thread
From: H. Peter Anvin @ 2025-03-05  8:29 UTC (permalink / raw)
  To: Menglong Dong, Peter Zijlstra
  Cc: rostedt, mark.rutland, alexei.starovoitov, catalin.marinas, will,
	mhiramat, tglx, mingo, bp, dave.hansen, x86, ast, daniel, andrii,
	martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf,
	jolsa, davem, dsahern, mathieu.desnoyers, nathan,
	nick.desaulniers+lkml, morbo, samitolvanen, kees, dongml2, akpm,
	riel, rppt, linux-arm-kernel, linux-kernel, linux-trace-kernel,
	bpf, netdev, llvm

On March 4, 2025 5:19:09 PM PST, Menglong Dong <menglong8.dong@gmail.com> wrote:
>On Tue, Mar 4, 2025 at 10:53 PM H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> On March 4, 2025 1:42:20 AM PST, Peter Zijlstra <peterz@infradead.org> wrote:
>> >On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote:
>> >> We don't have to select FUNCTION_ALIGNMENT_32B, so the
>> >> worst case is to increase ~2.2%.
>> >>
>> >> What do you think?
>> >
>> >Well, since I don't understand what you need this for at all, I'm firmly
>> >on the side of not doing this.
>> >
>> >What actual problem is being solved with this meta data nonsense? Why is
>> >it worth blowing up our I$ footprint over.
>> >
>> >Also note, that if you're going to be explaining this, start from
>> >scratch, as I have absolutely 0 clues about BPF and such.
>>
>> I would appreciate such information as well. The idea seems dubious on the surface.
>
>Ok, let me explain it from the beginning. (My English is not good,
>but I'll try to describe it as clear as possible :/)
>
>Many BPF program types need to depend on the BPF trampoline,
>such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT,
>BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between
>the kernel (or bpf) function and BPF program, and it acts just like the
>trampoline that ftrace uses.
>
>Generally speaking, it is used to hook a function, just like what ftrace
>do:
>
>foo:
>    endbr
>    nop5  -->  call trampoline_foo
>    xxxx
>
>In short, the trampoline_foo can be this:
>
>trampoline_foo:
>    prepare a array and store the args of foo to the array
>    call fentry_bpf1
>    call fentry_bpf2
>    ......
>    call foo+4 (origin call)
>    save the return value of foo
>    call fexit_bpf1 (this bpf can get the return value of foo)
>    call fexit_bpf2
>    .......
>    return to the caller of foo
>
>We can see that the trampoline_foo can be only used for
>the function foo, as different kernel function can be attached
>different BPF programs, and have different argument count,
>etc. Therefore, we have to create 1000 BPF trampolines if
>we want to attach a BPF program to 1000 kernel functions.
>
>The creation of the BPF trampoline is expensive. According to
>my testing, It will spend more than 1 second to create 100 bpf
>trampoline. What's more, it consumes more memory.
>
>If we have the per-function metadata supporting, then we can
>create a global BPF trampoline, like this:
>
>trampoline_global:
>    prepare a array and store the args of foo to the array
>    get the metadata by the ip
>    call metadata.fentry_bpf1
>    call metadata.fentry_bpf2
>    ....
>    call foo+4 (origin call)
>    save the return value of foo
>    call metadata.fexit_bpf1 (this bpf can get the return value of foo)
>    call metadata.fexit_bpf2
>    .......
>    return to the caller of foo
>
>(The metadata holds more information for the global trampoline than
>I described.)
>
>Then, we don't need to create a trampoline for every kernel function
>anymore.
>
>Another beneficiary can be ftrace. For now, all the kernel functions that
>are enabled by dynamic ftrace will be added to a filter hash if there are
>more than one callbacks. And hash lookup will happen when the traced
>functions are called, which has an impact on the performance, see
>__ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function
>metadata supporting, we can store the information that if the callback is
>enabled on the kernel function to the metadata, which can make the performance
>much better.
>
>The per-function metadata storage is a basic function, and I think there
>may be other functions that can use it for better performance in the feature
>too.
>
>(Hope that I'm describing it clearly :/)
>
>Thanks!
>Menglong Dong
>

This is way too cursory. For one thing, you need to start by explaining why you are asking to put this *inline* with the code, which is something that normally would be avoided at all cost.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-05  8:29                   ` H. Peter Anvin
@ 2025-03-05  8:49                     ` Menglong Dong
  0 siblings, 0 replies; 23+ messages in thread
From: Menglong Dong @ 2025-03-05  8:49 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, rostedt, mark.rutland, alexei.starovoitov,
	catalin.marinas, will, mhiramat, tglx, mingo, bp, dave.hansen,
	x86, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, jolsa, davem, dsahern,
	mathieu.desnoyers, nathan, nick.desaulniers+lkml, morbo,
	samitolvanen, kees, dongml2, akpm, riel, rppt, linux-arm-kernel,
	linux-kernel, linux-trace-kernel, bpf, netdev, llvm

On Wed, Mar 5, 2025 at 4:30 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On March 4, 2025 5:19:09 PM PST, Menglong Dong <menglong8.dong@gmail.com> wrote:
> >On Tue, Mar 4, 2025 at 10:53 PM H. Peter Anvin <hpa@zytor.com> wrote:
> >>
> >> On March 4, 2025 1:42:20 AM PST, Peter Zijlstra <peterz@infradead.org> wrote:
> >> >On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote:
> >> >> We don't have to select FUNCTION_ALIGNMENT_32B, so the
> >> >> worst case is to increase ~2.2%.
> >> >>
> >> >> What do you think?
> >> >
> >> >Well, since I don't understand what you need this for at all, I'm firmly
> >> >on the side of not doing this.
> >> >
> >> >What actual problem is being solved with this meta data nonsense? Why is
> >> >it worth blowing up our I$ footprint over.
> >> >
> >> >Also note, that if you're going to be explaining this, start from
> >> >scratch, as I have absolutely 0 clues about BPF and such.
> >>
> >> I would appreciate such information as well. The idea seems dubious on the surface.
> >
> >Ok, let me explain it from the beginning. (My English is not good,
> >but I'll try to describe it as clear as possible :/)
> >
> >Many BPF program types need to depend on the BPF trampoline,
> >such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT,
> >BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between
> >the kernel (or bpf) function and BPF program, and it acts just like the
> >trampoline that ftrace uses.
> >
> >Generally speaking, it is used to hook a function, just like what ftrace
> >do:
> >
> >foo:
> >    endbr
> >    nop5  -->  call trampoline_foo
> >    xxxx
> >
> >In short, the trampoline_foo can be this:
> >
> >trampoline_foo:
> >    prepare a array and store the args of foo to the array
> >    call fentry_bpf1
> >    call fentry_bpf2
> >    ......
> >    call foo+4 (origin call)
> >    save the return value of foo
> >    call fexit_bpf1 (this bpf can get the return value of foo)
> >    call fexit_bpf2
> >    .......
> >    return to the caller of foo
> >
> >We can see that the trampoline_foo can be only used for
> >the function foo, as different kernel function can be attached
> >different BPF programs, and have different argument count,
> >etc. Therefore, we have to create 1000 BPF trampolines if
> >we want to attach a BPF program to 1000 kernel functions.
> >
> >The creation of the BPF trampoline is expensive. According to
> >my testing, It will spend more than 1 second to create 100 bpf
> >trampoline. What's more, it consumes more memory.
> >
> >If we have the per-function metadata supporting, then we can
> >create a global BPF trampoline, like this:
> >
> >trampoline_global:
> >    prepare a array and store the args of foo to the array
> >    get the metadata by the ip
> >    call metadata.fentry_bpf1
> >    call metadata.fentry_bpf2
> >    ....
> >    call foo+4 (origin call)
> >    save the return value of foo
> >    call metadata.fexit_bpf1 (this bpf can get the return value of foo)
> >    call metadata.fexit_bpf2
> >    .......
> >    return to the caller of foo
> >
> >(The metadata holds more information for the global trampoline than
> >I described.)
> >
> >Then, we don't need to create a trampoline for every kernel function
> >anymore.
> >
> >Another beneficiary can be ftrace. For now, all the kernel functions that
> >are enabled by dynamic ftrace will be added to a filter hash if there are
> >more than one callbacks. And hash lookup will happen when the traced
> >functions are called, which has an impact on the performance, see
> >__ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function
> >metadata supporting, we can store the information that if the callback is
> >enabled on the kernel function to the metadata, which can make the performance
> >much better.
> >
> >The per-function metadata storage is a basic function, and I think there
> >may be other functions that can use it for better performance in the feature
> >too.
> >
> >(Hope that I'm describing it clearly :/)
> >
> >Thanks!
> >Menglong Dong
> >
>
> This is way too cursory. For one thing, you need to start by explaining why you are asking to put this *inline* with the code, which is something that normally would be avoided at all cost.

Hi,

Sorry that I don't understand the *inline* here, do you mean
why puting the metadata in the function padding?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-05  1:19                 ` Menglong Dong
  2025-03-05  8:29                   ` H. Peter Anvin
@ 2025-03-05 15:03                   ` Steven Rostedt
  2025-03-06  2:58                     ` Menglong Dong
  1 sibling, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2025-03-05 15:03 UTC (permalink / raw)
  To: Menglong Dong
  Cc: H. Peter Anvin, Peter Zijlstra, mark.rutland, alexei.starovoitov,
	catalin.marinas, will, mhiramat, tglx, mingo, bp, dave.hansen,
	x86, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, jolsa, davem, dsahern,
	mathieu.desnoyers, nathan, nick.desaulniers+lkml, morbo,
	samitolvanen, kees, dongml2, akpm, riel, rppt, linux-arm-kernel,
	linux-kernel, linux-trace-kernel, bpf, netdev, llvm

On Wed, 5 Mar 2025 09:19:09 +0800
Menglong Dong <menglong8.dong@gmail.com> wrote:

> Ok, let me explain it from the beginning. (My English is not good,
> but I'll try to describe it as clear as possible :/)

I always appreciate those who struggle with English having these
conversations. Thank you for that, as I know I am horrible in speaking any
other language. (I can get by in German, but even Germans tell me to switch
back to English ;-)

> 
> Many BPF program types need to depend on the BPF trampoline,
> such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT,
> BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between
> the kernel (or bpf) function and BPF program, and it acts just like the
> trampoline that ftrace uses.
> 
> Generally speaking, it is used to hook a function, just like what ftrace
> do:
> 
> foo:
>     endbr
>     nop5  -->  call trampoline_foo
>     xxxx
> 
> In short, the trampoline_foo can be this:
> 
> trampoline_foo:
>     prepare a array and store the args of foo to the array
>     call fentry_bpf1
>     call fentry_bpf2
>     ......
>     call foo+4 (origin call)

Note, I brought up this issue when I first heard about how BPF does this.
The calling of the original function from the trampoline. I said this will
cause issues, and is only good for a few functions. Once you start doing
this for 1000s of functions, it's going to be a nightmare.

Looks like you are now in the nightmare phase.

My argument was once you have this case, you need to switch over to the
kretprobe / function graph way of doing things, which is to have a shadow
stack and hijack the return address. Yes, that has slightly more overhead,
but it's better than having to add all theses hacks.

And function graph has been updated so that it can do this for other users.
fprobes uses it now, and bpf can too.

>     save the return value of foo
>     call fexit_bpf1 (this bpf can get the return value of foo)
>     call fexit_bpf2
>     .......
>     return to the caller of foo
> 
> We can see that the trampoline_foo can be only used for
> the function foo, as different kernel function can be attached
> different BPF programs, and have different argument count,
> etc. Therefore, we have to create 1000 BPF trampolines if
> we want to attach a BPF program to 1000 kernel functions.
> 
> The creation of the BPF trampoline is expensive. According to
> my testing, It will spend more than 1 second to create 100 bpf
> trampoline. What's more, it consumes more memory.
> 
> If we have the per-function metadata supporting, then we can
> create a global BPF trampoline, like this:
> 
> trampoline_global:
>     prepare a array and store the args of foo to the array
>     get the metadata by the ip
>     call metadata.fentry_bpf1
>     call metadata.fentry_bpf2
>     ....
>     call foo+4 (origin call)

So if this is a global trampoline, wouldn't this "call foo" need to be an
indirect call? It can't be a direct call, otherwise you need a separate
trampoline for that.

This means you need to mitigate for spectre here, and you just lost the
performance gain from not using function graph.


>     save the return value of foo
>     call metadata.fexit_bpf1 (this bpf can get the return value of foo)
>     call metadata.fexit_bpf2
>     .......
>     return to the caller of foo
> 
> (The metadata holds more information for the global trampoline than
> I described.)
> 
> Then, we don't need to create a trampoline for every kernel function
> anymore.
> 
> Another beneficiary can be ftrace. For now, all the kernel functions that
> are enabled by dynamic ftrace will be added to a filter hash if there are
> more than one callbacks. And hash lookup will happen when the traced
> functions are called, which has an impact on the performance, see
> __ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function
> metadata supporting, we can store the information that if the callback is
> enabled on the kernel function to the metadata, which can make the performance
> much better.

Let me say now that ftrace will not use this. Looks like too much work for
little gain. The only time this impacts ftrace is when there's two
different callbacks tracing the same function, and it only impacts that
function. All other functions being traced still call the appropriate
trampoline for the callback.

-- Steve

> 
> The per-function metadata storage is a basic function, and I think there
> may be other functions that can use it for better performance in the feature
> too.
> 
> (Hope that I'm describing it clearly :/)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-05 15:03                   ` Steven Rostedt
@ 2025-03-06  2:58                     ` Menglong Dong
  2025-03-06  3:39                       ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Menglong Dong @ 2025-03-06  2:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: H. Peter Anvin, Peter Zijlstra, mark.rutland, alexei.starovoitov,
	catalin.marinas, will, mhiramat, tglx, mingo, bp, dave.hansen,
	x86, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, jolsa, davem, dsahern,
	mathieu.desnoyers, nathan, nick.desaulniers+lkml, morbo,
	samitolvanen, kees, dongml2, akpm, riel, rppt, linux-arm-kernel,
	linux-kernel, linux-trace-kernel, bpf, netdev, llvm

On Wed, Mar 5, 2025 at 11:02 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 5 Mar 2025 09:19:09 +0800
> Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> > Ok, let me explain it from the beginning. (My English is not good,
> > but I'll try to describe it as clear as possible :/)
>
> I always appreciate those who struggle with English having these
> conversations. Thank you for that, as I know I am horrible in speaking any
> other language. (I can get by in German, but even Germans tell me to switch
> back to English ;-)
>
> >
> > Many BPF program types need to depend on the BPF trampoline,
> > such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT,
> > BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between
> > the kernel (or bpf) function and BPF program, and it acts just like the
> > trampoline that ftrace uses.
> >
> > Generally speaking, it is used to hook a function, just like what ftrace
> > do:
> >
> > foo:
> >     endbr
> >     nop5  -->  call trampoline_foo
> >     xxxx
> >
> > In short, the trampoline_foo can be this:
> >
> > trampoline_foo:
> >     prepare a array and store the args of foo to the array
> >     call fentry_bpf1
> >     call fentry_bpf2
> >     ......
> >     call foo+4 (origin call)
>
> Note, I brought up this issue when I first heard about how BPF does this.
> The calling of the original function from the trampoline. I said this will
> cause issues, and is only good for a few functions. Once you start doing
> this for 1000s of functions, it's going to be a nightmare.
>
> Looks like you are now in the nightmare phase.
>
> My argument was once you have this case, you need to switch over to the
> kretprobe / function graph way of doing things, which is to have a shadow
> stack and hijack the return address. Yes, that has slightly more overhead,
> but it's better than having to add all theses hacks.
>
> And function graph has been updated so that it can do this for other users.
> fprobes uses it now, and bpf can too.

Yeah, I heard that the kretprobe is able to get the function
arguments too, which benefits from the function graph.

Besides the overhead, another problem is that we can't do
direct memory access if we use the BPF based on kretprobe.

>
> >     save the return value of foo
> >     call fexit_bpf1 (this bpf can get the return value of foo)
> >     call fexit_bpf2
> >     .......
> >     return to the caller of foo
> >
> > We can see that the trampoline_foo can be only used for
> > the function foo, as different kernel function can be attached
> > different BPF programs, and have different argument count,
> > etc. Therefore, we have to create 1000 BPF trampolines if
> > we want to attach a BPF program to 1000 kernel functions.
> >
> > The creation of the BPF trampoline is expensive. According to
> > my testing, It will spend more than 1 second to create 100 bpf
> > trampoline. What's more, it consumes more memory.
> >
> > If we have the per-function metadata supporting, then we can
> > create a global BPF trampoline, like this:
> >
> > trampoline_global:
> >     prepare a array and store the args of foo to the array
> >     get the metadata by the ip
> >     call metadata.fentry_bpf1
> >     call metadata.fentry_bpf2
> >     ....
> >     call foo+4 (origin call)
>
> So if this is a global trampoline, wouldn't this "call foo" need to be an
> indirect call? It can't be a direct call, otherwise you need a separate
> trampoline for that.
>
> This means you need to mitigate for spectre here, and you just lost the
> performance gain from not using function graph.

Yeah, you are right, this is an indirect call here. I haven't done
any research on mitigating for spectre yet, and maybe we can
convert it into a direct call somehow? Such as, we maintain a
trampoline_table:
    some preparation
    jmp +%eax (eax is the index of the target function)
    call foo1 + 4
    return
    call foo2 + 4
    return
    call foo3 + 4
    return

(Hmm......Is the jmp above also an indirect call?)

And in the trampoline_global, we can call it like this:

    mov metadata.index %eax
    call trampoline_table

I'm not sure if it works. However, indirect call is also used
in function graph, so we still have better performance. Isn't it?

Let me have a look at the code of the function graph first :/

Thanks!
Menglong Dong

>
>
> >     save the return value of foo
> >     call metadata.fexit_bpf1 (this bpf can get the return value of foo)
> >     call metadata.fexit_bpf2
> >     .......
> >     return to the caller of foo
> >
> > (The metadata holds more information for the global trampoline than
> > I described.)
> >
> > Then, we don't need to create a trampoline for every kernel function
> > anymore.
> >
> > Another beneficiary can be ftrace. For now, all the kernel functions that
> > are enabled by dynamic ftrace will be added to a filter hash if there are
> > more than one callbacks. And hash lookup will happen when the traced
> > functions are called, which has an impact on the performance, see
> > __ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function
> > metadata supporting, we can store the information that if the callback is
> > enabled on the kernel function to the metadata, which can make the performance
> > much better.
>
> Let me say now that ftrace will not use this. Looks like too much work for
> little gain. The only time this impacts ftrace is when there's two
> different callbacks tracing the same function, and it only impacts that
> function. All other functions being traced still call the appropriate
> trampoline for the callback.
>
> -- Steve
>
> >
> > The per-function metadata storage is a basic function, and I think there
> > may be other functions that can use it for better performance in the feature
> > too.
> >
> > (Hope that I'm describing it clearly :/)
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-06  2:58                     ` Menglong Dong
@ 2025-03-06  3:39                       ` Alexei Starovoitov
  2025-03-06  8:50                         ` Menglong Dong
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2025-03-06  3:39 UTC (permalink / raw)
  To: Menglong Dong
  Cc: Steven Rostedt, H. Peter Anvin, Peter Zijlstra, Mark Rutland,
	Catalin Marinas, Will Deacon, Masami Hiramatsu, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eddy Z, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Jiri Olsa, David S. Miller, David Ahern,
	Mathieu Desnoyers, Nathan Chancellor, Nick Desaulniers,
	Bill Wendling, Sami Tolvanen, Kees Cook, dongml2, Andrew Morton,
	Rik van Riel, Mike Rapoport, linux-arm-kernel, LKML,
	linux-trace-kernel, bpf, Network Development, clang-built-linux

On Wed, Mar 5, 2025 at 6:59 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> I'm not sure if it works. However, indirect call is also used
> in function graph, so we still have better performance. Isn't it?
>
> Let me have a look at the code of the function graph first :/

Menglong,

Function graph infra isn't going to help.
"call foo" isn't a problem either.

But we have to step back.
per-function metadata is an optimization and feels like
we're doing a premature optimization here without collecting
performance numbers first.

Let's implement multi-fentry with generic get_metadata_by_ip() first.
get_metadata_by_ip() will be a hashtable in such a case and
then we can compare its performance when it's implemented as
a direct lookup from ip-4 (this patch) vs hash table
(that does 'ip' to 'metadata' lookup).

If/when we decide to do this per-function metadata we can also
punt to generic hashtable for cfi, IBT, FineIBT, etc configs.
When mitigations are enabled the performance suffers anyway,
so hashtable lookup vs direct ip-4 lookup won't make much difference.
So we can enable per-function metadata only on non-mitigation configs
when FUNCTION_ALIGNMENT=16.
There will be some number of bytes available before every function
and if we can tell gcc/llvm to leave at least 5 bytes there
the growth of vmlinux .text will be within a noise.

So let's figure out the design of multi-fenty first with a hashtable
for metadata and decide next steps afterwards.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-06  3:39                       ` Alexei Starovoitov
@ 2025-03-06  8:50                         ` Menglong Dong
  2025-03-23  3:51                           ` Menglong Dong
  0 siblings, 1 reply; 23+ messages in thread
From: Menglong Dong @ 2025-03-06  8:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, H. Peter Anvin, Peter Zijlstra, Mark Rutland,
	Catalin Marinas, Will Deacon, Masami Hiramatsu, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eddy Z, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Jiri Olsa, David S. Miller, David Ahern,
	Mathieu Desnoyers, Nathan Chancellor, Nick Desaulniers,
	Bill Wendling, Sami Tolvanen, Kees Cook, dongml2, Andrew Morton,
	Rik van Riel, Mike Rapoport, linux-arm-kernel, LKML,
	linux-trace-kernel, bpf, Network Development, clang-built-linux

On Thu, Mar 6, 2025 at 11:39 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Mar 5, 2025 at 6:59 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
> >
> > I'm not sure if it works. However, indirect call is also used
> > in function graph, so we still have better performance. Isn't it?
> >
> > Let me have a look at the code of the function graph first :/
>
> Menglong,
>
> Function graph infra isn't going to help.
> "call foo" isn't a problem either.
>
> But we have to step back.
> per-function metadata is an optimization and feels like
> we're doing a premature optimization here without collecting
> performance numbers first.
>
> Let's implement multi-fentry with generic get_metadata_by_ip() first.
> get_metadata_by_ip() will be a hashtable in such a case and
> then we can compare its performance when it's implemented as
> a direct lookup from ip-4 (this patch) vs hash table
> (that does 'ip' to 'metadata' lookup).

Hi, Alexei

You are right, I should do such a performance comparison.

>
> If/when we decide to do this per-function metadata we can also
> punt to generic hashtable for cfi, IBT, FineIBT, etc configs.
> When mitigations are enabled the performance suffers anyway,
> so hashtable lookup vs direct ip-4 lookup won't make much difference.
> So we can enable per-function metadata only on non-mitigation configs
> when FUNCTION_ALIGNMENT=16.
> There will be some number of bytes available before every function
> and if we can tell gcc/llvm to leave at least 5 bytes there
> the growth of vmlinux .text will be within a noise.

Sounds great! It's so different to make the per-function metadata
work in all the cases. Especially, we can't implement it in arm64
if CFI_CLANG is enabled. And the fallbacking to the hash table makes
it much easier in these cases.

>
> So let's figure out the design of multi-fenty first with a hashtable
> for metadata and decide next steps afterwards.

Ok, I'll develop a version for fentry multi-link with both hashtable
and function metadata, and do some performance testing. Thank
you for your advice :/

Thanks!
Menglong Dong

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset
  2025-03-06  8:50                         ` Menglong Dong
@ 2025-03-23  3:51                           ` Menglong Dong
  0 siblings, 0 replies; 23+ messages in thread
From: Menglong Dong @ 2025-03-23  3:51 UTC (permalink / raw)
  To: Alexei Starovoitov, Steven Rostedt
  Cc: Peter Zijlstra, X86 ML, Alexei Starovoitov, Andrii Nakryiko,
	Yonghong Song, dongml2, Mike Rapoport, linux-arm-kernel, LKML,
	linux-trace-kernel, bpf, Network Development, clang-built-linux

On Thu, Mar 6, 2025 at 4:50 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> On Thu, Mar 6, 2025 at 11:39 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Mar 5, 2025 at 6:59 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > >
> > > I'm not sure if it works. However, indirect call is also used
> > > in function graph, so we still have better performance. Isn't it?
> > >
> > > Let me have a look at the code of the function graph first :/
> >
> > Menglong,
> >
> > Function graph infra isn't going to help.
> > "call foo" isn't a problem either.
> >
> > But we have to step back.
> > per-function metadata is an optimization and feels like
> > we're doing a premature optimization here without collecting
> > performance numbers first.
> >
> > Let's implement multi-fentry with generic get_metadata_by_ip() first.
> > get_metadata_by_ip() will be a hashtable in such a case and
> > then we can compare its performance when it's implemented as
> > a direct lookup from ip-4 (this patch) vs hash table
> > (that does 'ip' to 'metadata' lookup).
>
> Hi, Alexei
>
> You are right, I should do such a performance comparison.
>
> >
> > If/when we decide to do this per-function metadata we can also
> > punt to generic hashtable for cfi, IBT, FineIBT, etc configs.
> > When mitigations are enabled the performance suffers anyway,
> > so hashtable lookup vs direct ip-4 lookup won't make much difference.
> > So we can enable per-function metadata only on non-mitigation configs
> > when FUNCTION_ALIGNMENT=16.
> > There will be some number of bytes available before every function
> > and if we can tell gcc/llvm to leave at least 5 bytes there
> > the growth of vmlinux .text will be within a noise.

Hi, Alexei

I have finished the demo of the tracing multi-link recently.
The code is not ready to be sent out, as it's very very ugly
for now, and I did some performance testing.

The test case is very simple. I defined a function "kfunc_md_test",
and called it 10000000 times in "do_kfunc_md_test". And I will
attach my empty bpf program of attach type BPF_FENTRY_MULTI
to it. Following is the code of the test case:

-----------------------------------------kernel
part--------------------------------
int kfunc_md_test_result = 0;
noinline void kfunc_md_test(int a)
{
    kfunc_md_test_result = a;
}

int noinline
do_kfunc_md_test(const struct ctl_table *table, int write,
                void *buffer, size_t *lenp, loff_t *ppos)
{
    u64 start, interval;
    int i;

    start = ktime_get_boottime_ns();
    for (i = 0; i < 10000000; i++)
        kfunc_md_test(i);

    interval = ktime_get_boottime_ns() - start;
    pr_info("%llu.%llums\n",
        interval / 1000000, interval % 1000000);

    return 0;
}

---------------------------------------bpf
part-----------------------------------------
SEC("fentry.multi/kfunc_md_test")
int BPF_PROG(fentry_manual_nop)
{
    return 0;
}
------------------------------------bpf part
end-------------------------------------

I did the testing for BPF_FENTRY, BPF_FENTRY_MULTI and
BPF_KPROBE_MULTI, and following is the results:

Without any bpf:
--------------------------------------------------------------------------------------
9.234677ms
9.486119ms
9.310059ms
9.468227ms
9.217295ms
9.500406ms
9.292606ms
9.530492ms
9.268741ms
9.513371ms

BPF_FENTRY:
----------------------------------------------------------------------------------------
80.800800ms
79.746338ms
83.292012ms
80.324835ms
84.25841ms
81.67250ms
81.21824ms
80.415886ms
79.910556ms
80.427809ms

BPF_FENTRY_MULTI with function padding:
---------------------------------------------------------------------------------------
120.457336ms
117.854154ms
118.888287ms
119.726011ms
117.52847ms
117.463910ms
119.212126ms
118.722216ms
118.843222ms
119.166079ms

It seems that the overhead of BPF_FENTRY_MULTI is more
that BPF_FENTRY. I'm not sure if it is because of the "indirect
call". However, it's not what we want to discuss today, so let's
focus on the performance of the function metadata basing on
"function padding" and "hash table".

Generally speaking, the overhead of the BPF_FENTRY_MULTI
with the hash table has a linear relation. The hash table that I
used is exactly the same to the filter_hash that ftrace uses, and
the array length is 1024. I didn't do that statistics basing on the
function number, but the hash table looking up count, as I find
that the hash is not random enough some times. However, we
can compute the kernel function number if we image the hash
is random enough.

BPF_FENTRY_MULTI with hash table:
----------------------------------------------------------------------------------
1(1k)                    16(32k)
--------------------    --------------------
124.950881ms    235.24341ms
124.171226ms    232.20816ms
123.969627ms    232.212086ms
125.803975ms    230.935175ms
124.256777ms    230.906713ms
124.314095ms    234.551623ms
124.165637ms    231.435496ms
124.488003ms    230.936458ms
125.571929ms    230.753203ms
124.168110ms    234.679152ms

(The 1 and 16 above means that the hash lookup times is
1 and 16, 1k and 32k means the corresponding kernel function
count that we trace.)

According to my testing, the hash table will have a slight overhead
if the kernel functions that we trace are no more than 5k. And
I think this is the most use case, according to the people who are
interested in tracing multi-link. When the function count up to 32k,
the overhead is obvious.

According to my research, the kprobe-multi/fprobe also based on
the hash table, which will lookup the callback ops with the function
address in a hash table, and the overhead is heavy too. And I alse
did the kprobe-multi performance. I run the test case
"kprobe_multi_bench_attach/kernel", and do the "kfunc_md_test"
meanwhile, just like what I did for BPF_FENTRY_MULTI:

BPF_KPROBE_MULTI:
-----------------------------------------------------------------------------------
36895.985224ms
37002.298075ms
30150.774087ms

The kernel function count is 55239 in the kprobe-multi testing.
I'm not sure if there is something wrong with my testing, but
the overhead looks heavy.

So I think maybe it works to fallback to the hash table if
CFI/FINEIBT/... are enabled? I would be appreciated to
hear some advice here.

(BTW, I removed most CCs to reduce the noise :/)

Thanks!
Menglong Dong

-------------------------------------------------------------------------------
-------------------------------------------------------------------------------

Following is the bpf global trampoline(x86, demo and ugly):
---------------------------------------------------------------------------------
#define FUNC_ARGS_SIZE        (6 * 8)
#define FUNC_ARGS_OFFSET    (-8 - FUNC_ARGS_SIZE)
#define FUNC_ARGS_1        (FUNC_ARGS_OFFSET + 0 * 8)
#define FUNC_ARGS_2        (FUNC_ARGS_OFFSET + 1 * 8)
#define FUNC_ARGS_3        (FUNC_ARGS_OFFSET + 2 * 8)
#define FUNC_ARGS_4        (FUNC_ARGS_OFFSET + 3 * 8)
#define FUNC_ARGS_5        (FUNC_ARGS_OFFSET + 4 * 8)
#define FUNC_ARGS_6        (FUNC_ARGS_OFFSET + 5 * 8)

/* the args count, rbp - 8 * 8 */
#define FUNC_ARGS_COUNT_OFFSET    (FUNC_ARGS_OFFSET - 1 * 8)
#define FUNC_ORIGIN_IP        (FUNC_ARGS_OFFSET - 2 * 8) /* -9 * 8 */
#define RBX_OFFSET        (FUNC_ARGS_OFFSET - 3 * 8)

/* bpf_tramp_run_ctx, rbp - BPF_RUN_CTX_OFFSET */
#define BPF_RUN_CTX_OFFSET    (RBX_OFFSET - BPF_TRAMP_RUN_CTX_SIZE)
#define KFUNC_MD_OFFSET        (BPF_RUN_CTX_OFFSET - 1 * 8)
#define STACK_SIZE        (-1 * KFUNC_MD_OFFSET)

.macro tramp_restore_regs
    movq FUNC_ARGS_1(%rbp), %rdi
    movq FUNC_ARGS_2(%rbp), %rsi
    movq FUNC_ARGS_3(%rbp), %rdx
    movq FUNC_ARGS_4(%rbp), %rcx
    movq FUNC_ARGS_5(%rbp), %r8
    movq FUNC_ARGS_6(%rbp), %r9
    .endm

SYM_FUNC_START(bpf_global_caller)
    pushq %rbp
    movq %rsp, %rbp
    subq $STACK_SIZE, %rsp

    /* save the args to stack, only regs is supported for now */
    movq %rdi, FUNC_ARGS_1(%rbp)
    movq %rsi, FUNC_ARGS_2(%rbp)
    movq %rdx, FUNC_ARGS_3(%rbp)
    movq %rcx, FUNC_ARGS_4(%rbp)
    movq %r8, FUNC_ARGS_5(%rbp)
    movq %r9, FUNC_ARGS_6(%rbp)

    /* save the rbx, rbp - 9 * 8 */
    movq %rbx, RBX_OFFSET(%rbp)

    /* get the function address */
    movq 8(%rbp), %rdi
    /* subq $(4+5), %rdi */
    /* save the function ip */
    movq %rdi, FUNC_ORIGIN_IP(%rbp)

    call kfunc_md_find
    cmpq $0, %rax
    jz out
    /* kfunc_md, keep it in %rcx */
    movq %rax, %rcx

    /* fentry bpf prog */
    cmpq $0, KFUNC_MD_FENTRY(%rcx)
    jz out

    /* load fentry bpf prog to the 1st arg */
    movq KFUNC_MD_FENTRY(%rcx), %rdi
    /* load the pointer of tramp_run_ctx to the 2nd arg */
    leaq BPF_RUN_CTX_OFFSET(%rbp), %rsi
    /* save the bpf cookie to the tramp_run_ctx */
    movq KFUNC_MD_COOKIE(%rcx), %rax
    movq %rax, BPF_COOKIE_OFFSET(%rsi)
    call __bpf_prog_enter_recur
    /* save the start time to rbx */
    movq %rax, %rbx

    /* load fentry JITed prog to rax */
    movq BPF_FUNC_OFFSET(%rdi), %rax
    /* load func args array to the 1st arg */
    leaq FUNC_ARGS_OFFSET(%rbp), %rdi

    /* load and call the JITed bpf func */
    call *%rax

    /* load bpf prog to the 1st arg */
    movq KFUNC_MD_FENTRY(%rcx), %rdi
    /* load the rbx(start time) to the 2nd arg */
    movq %rbx, %rsi
    /* load the pointer of tramp_run_ctx to the 3rd arg */
    leaq BPF_RUN_CTX_OFFSET(%rbp), %rdx
    call __bpf_prog_exit_recur
out:
    tramp_restore_regs

    movq RBX_OFFSET(%rbp), %rbx
    addq $STACK_SIZE, %rsp
    popq %rbp
    RET

SYM_FUNC_END(bpf_global_caller)
STACK_FRAME_NON_STANDARD_FP(bpf_global_caller)

>
> Sounds great! It's so different to make the per-function metadata
> work in all the cases. Especially, we can't implement it in arm64
> if CFI_CLANG is enabled. And the fallbacking to the hash table makes
> it much easier in these cases.
>
> >
> > So let's figure out the design of multi-fenty first with a hashtable
> > for metadata and decide next steps afterwards.
>
> Ok, I'll develop a version for fentry multi-link with both hashtable
> and function metadata, and do some performance testing. Thank
> you for your advice :/
>
> Thanks!
> Menglong Dong

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-03-23  3:49 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-03 13:28 [PATCH v4 0/4] per-function storage support Menglong Dong
2025-03-03 13:28 ` [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset Menglong Dong
2025-03-03 16:54   ` Peter Zijlstra
2025-03-04  1:10     ` Menglong Dong
2025-03-04  5:38       ` Peter Zijlstra
2025-03-04  6:16         ` Peter Zijlstra
2025-03-04  7:47           ` Menglong Dong
2025-03-04  8:41             ` Menglong Dong
2025-03-04  9:42             ` Peter Zijlstra
2025-03-04 14:52               ` H. Peter Anvin
2025-03-05  1:19                 ` Menglong Dong
2025-03-05  8:29                   ` H. Peter Anvin
2025-03-05  8:49                     ` Menglong Dong
2025-03-05 15:03                   ` Steven Rostedt
2025-03-06  2:58                     ` Menglong Dong
2025-03-06  3:39                       ` Alexei Starovoitov
2025-03-06  8:50                         ` Menglong Dong
2025-03-23  3:51                           ` Menglong Dong
2025-03-03 13:28 ` [PATCH v4 2/4] add per-function metadata storage support Menglong Dong
2025-03-03 13:28 ` [PATCH v4 3/4] x86: implement per-function metadata storage for x86 Menglong Dong
2025-03-03 13:28 ` [PATCH v4 4/4] arm64: implement per-function metadata storage for arm64 Menglong Dong
2025-03-03 21:40   ` Sami Tolvanen
2025-03-04  1:21     ` Menglong Dong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).