[PATCH bpf-next v11 0/4] bpf: add cpu time counter kfuncs

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf-next v11 0/4] bpf: add cpu time counter kfuncs
@ 2025-03-17 22:49 Vadim Fedorenko
  2025-03-17 22:49 ` [PATCH bpf-next v11 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Vadim Fedorenko @ 2025-03-17 22:49 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

This patchset adds 2 kfuncs to provide a way to precisely measure the
time spent running some code. The first patch provides a way to get cpu
cycles counter which is used to feed CLOCK_MONOTONIC_RAW. On x86
architecture it is effectively rdtsc_ordered() function. The second patch
adds a kfunc to convert cpu cycles to nanoseconds using shift/mult
constants discovered by kernel. The main use-case for this kfunc is to
convert deltas of timestamp counter values into nanoseconds. It is not
supposed to get CLOCK_MONOTONIC_RAW values as offset part is skipped.
JIT version is done for x86 for now, on other architectures it falls
back to get CLOCK_MONOTONIC_RAW values.

The reason to have these functions is to avoid overhead added by
a bpf_ktime_get_ns() call in case of benchmarking, when two timestamps
are taken to get delta value. With both functions being JITed, the
overhead is minimal and the result has better precision. New functions
can be used to benchmark BPF code directly in the program, or can be
used in kprobe/uprobe to store timestamp counter in the session coockie
and then in kretprobe/uretprobe the delta can be calculated and
converted into nanoseconds.

Selftests are also added to check whether the JIT implementation is
correct and to show the simplest usage example.

Change log:
v10 -> v11:
* add missing IS_ENABLED(CONFIG_BPF_SYSCALL)
* reword "cycles" -> "counter"
v9 -> v10:
* rework fallback implementation to avoid using vDSO data from
  kernel space.
* add comment about using "LFENCE; RDTSC" instead of "RDTSCP"
* guard x86 JIT implementation to be sure that TSC is enabled and
  stable
* v9 link:
  https://lore.kernel.org/bpf/20241123005833.810044-1-vadfed@meta.com/
v8 -> v9:
* rewording of commit messages, no code changes
* move change log from each patch into cover letter
v7 -> v8:
* rename kfuncs again to bpf_get_cpu_time_counter() and
  bpf_cpu_time_counter_to_ns()
* use cyc2ns_read_begin()/cyc2ns_read_end() to get mult and shift
  constants in bpf_cpu_time_counter_to_ns()
v6 -> v7:
* change boot_cpu_has() to cpu_feature_enabled() (Borislav)
* return constant clock_mode in __arch_get_hw_counter() call
v5 -> v6:
* added cover letter
* add comment about dropping S64_MAX manipulation in jitted
  implementation of rdtsc_oredered (Alexey)
* add comment about using 'lfence;rdtsc' variant (Alexey)
* change the check in fixup_kfunc_call() (Eduard)
* make __arch_get_hw_counter() call more aligned with vDSO
  implementation (Yonghong)
v4 -> v5:
* use #if instead of #ifdef with IS_ENABLED
v3 -> v4:
* change name of the helper to bpf_get_cpu_cycles (Andrii)
* Hide the helper behind CONFIG_GENERIC_GETTIMEOFDAY to avoid exposing
  it on architectures which do not have vDSO functions and data
* reduce the scope of check of inlined functions in verifier to only 2,
  which are actually inlined.
* change helper name to bpf_cpu_cycles_to_ns.
* hide it behind CONFIG_GENERIC_GETTIMEOFDAY to avoid exposing on
  unsupported architectures.
v2 -> v3:
* change name of the helper to bpf_get_cpu_cycles_counter to
  explicitly mention what counter it provides (Andrii)
* move kfunc definition to bpf.h to use it in JIT.
* introduce another kfunc to convert cycles into nanoseconds as
  more meaningful time units for generic tracing use case (Andrii)
v1 -> v2:
* Fix incorrect function return value type to u64
* Introduce bpf_jit_inlines_kfunc_call() and use it in
	mark_fastcall_pattern_for_call() to avoid clobbering in case
	of running programs with no JIT (Eduard)
* Avoid rewriting instruction and check function pointer directly
	in JIT (Alexei)
* Change includes to fix compile issues on non x86 architectures

Vadim Fedorenko (4):
  bpf: add bpf_get_cpu_time_counter kfunc
  bpf: add bpf_cpu_time_counter_to_ns helper
  selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit
  selftests/bpf: add usage example for cpu time counter kfuncs

 arch/x86/net/bpf_jit_comp.c                   |  73 ++++++++++++
 arch/x86/net/bpf_jit_comp32.c                 |  58 ++++++++++
 include/linux/bpf.h                           |   4 +
 include/linux/filter.h                        |   1 +
 kernel/bpf/core.c                             |  11 ++
 kernel/bpf/helpers.c                          |  12 ++
 kernel/bpf/verifier.c                         |  41 ++++++-
 .../bpf/prog_tests/test_cpu_cycles.c          |  35 ++++++
 .../selftests/bpf/prog_tests/verifier.c       |   2 +
 .../selftests/bpf/progs/test_cpu_cycles.c     |  25 +++++
 .../selftests/bpf/progs/verifier_cpu_cycles.c | 104 ++++++++++++++++++
 11 files changed, 360 insertions(+), 6 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cpu_cycles.c
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c

-- 
2.47.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH bpf-next v11 1/4] bpf: add bpf_get_cpu_time_counter kfunc
  2025-03-17 22:49 [PATCH bpf-next v11 0/4] bpf: add cpu time counter kfuncs Vadim Fedorenko
@ 2025-03-17 22:49 ` Vadim Fedorenko
  2025-03-18  0:23   ` Alexei Starovoitov
  2025-03-17 22:49 ` [PATCH bpf-next v11 2/4] bpf: add bpf_cpu_time_counter_to_ns helper Vadim Fedorenko
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Vadim Fedorenko @ 2025-03-17 22:49 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

New kfunc to return ARCH-specific timecounter. The main reason to
implement this kfunc is to avoid extra overhead of benchmark
measurements, which are usually done by a pair of bpf_ktime_get_ns()
at the beginnig and at the end of the code block under benchmark.
When fully JITed this function doesn't implement conversion to the
monotonic clock and saves some CPU cycles by receiving timecounter
values in single-digit amount of instructions. The delta values can be
translated into nanoseconds using kfunc introduced in the next patch.
For x86 BPF JIT converts this kfunc into rdtsc ordered call. Other
architectures will get JIT implementation too if supported. The fallback
is to get CLOCK_MONOTONIC_RAW value in ns.

JIT version of the function uses "LFENCE; RDTSC" variant because it
doesn't care about cookie value returned by "RDTSCP" and it doesn't want
to trash RCX value. LFENCE option provides the same ordering guarantee as
RDTSCP variant.

The simplest use-case is added in 4th patch, where we calculate the time
spent by bpf_get_ns_current_pid_tgid() kfunc. More complex example is to
use session cookie to store timecounter value at kprobe/uprobe using
kprobe.session/uprobe.session, and calculate the difference at
kretprobe/uretprobe.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
 arch/x86/net/bpf_jit_comp.c   | 47 +++++++++++++++++++++++++++++++++++
 arch/x86/net/bpf_jit_comp32.c | 33 ++++++++++++++++++++++++
 include/linux/bpf.h           |  3 +++
 include/linux/filter.h        |  1 +
 kernel/bpf/core.c             | 11 ++++++++
 kernel/bpf/helpers.c          |  6 +++++
 kernel/bpf/verifier.c         | 41 +++++++++++++++++++++++++-----
 7 files changed, 136 insertions(+), 6 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index d3491cc0898b..92cd5945d630 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -15,6 +15,7 @@
 #include <asm/ftrace.h>
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
+#include <asm/timer.h>
 #include <asm/text-patching.h>
 #include <asm/unwind.h>
 #include <asm/cfi.h>
@@ -2254,6 +2255,40 @@ st:			if (is_imm8(insn->off))
 		case BPF_JMP | BPF_CALL: {
 			u8 *ip = image + addrs[i - 1];
 
+			if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
+			    IS_ENABLED(CONFIG_BPF_SYSCALL) &&
+			    imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
+			    cpu_feature_enabled(X86_FEATURE_TSC) &&
+			    using_native_sched_clock() && sched_clock_stable()) {
+				/* The default implementation of this kfunc uses
+				 * ktime_get_raw_ns() which effectively is implemented as
+				 * `(u64)rdtsc_ordered() & S64_MAX`. For JIT We skip
+				 * masking part because we assume it's not needed in BPF
+				 * use case (two measurements close in time).
+				 * Original code for rdtsc_ordered() uses sequence:
+				 * 'rdtsc; nop; nop; nop' to patch it into
+				 * 'lfence; rdtsc' or 'rdtscp' depending on CPU features.
+				 * JIT uses 'lfence; rdtsc' variant because BPF program
+				 * doesn't care about cookie provided by rdtscp in RCX.
+				 * Save RDX because RDTSC will use EDX:EAX to return u64
+				 */
+				emit_mov_reg(&prog, true, AUX_REG, BPF_REG_3);
+				if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
+					EMIT_LFENCE();
+				EMIT2(0x0F, 0x31);
+
+				/* shl RDX, 32 */
+				maybe_emit_1mod(&prog, BPF_REG_3, true);
+				EMIT3(0xC1, add_1reg(0xE0, BPF_REG_3), 32);
+				/* or RAX, RDX */
+				maybe_emit_mod(&prog, BPF_REG_0, BPF_REG_3, true);
+				EMIT2(0x09, add_2reg(0xC0, BPF_REG_0, BPF_REG_3));
+				/* restore RDX from R11 */
+				emit_mov_reg(&prog, true, BPF_REG_3, AUX_REG);
+
+				break;
+			}
+
 			func = (u8 *) __bpf_call_base + imm32;
 			if (src_reg == BPF_PSEUDO_CALL && tail_call_reachable) {
 				LOAD_TAIL_CALL_CNT_PTR(stack_depth);
@@ -3865,3 +3900,15 @@ bool bpf_jit_supports_timed_may_goto(void)
 {
 	return true;
 }
+
+/* x86-64 JIT can inline kfunc */
+bool bpf_jit_inlines_kfunc_call(s32 imm)
+{
+	if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
+		return false;
+	if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
+	    cpu_feature_enabled(X86_FEATURE_TSC) &&
+	    using_native_sched_clock() && sched_clock_stable())
+		return true;
+	return false;
+}
diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
index de0f9e5f9f73..7f13509c66db 100644
--- a/arch/x86/net/bpf_jit_comp32.c
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -16,6 +16,7 @@
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
 #include <asm/asm-prototypes.h>
+#include <asm/timer.h>
 #include <linux/bpf.h>
 
 /*
@@ -2094,6 +2095,27 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 			if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
 				int err;
 
+				if (IS_ENABLED(CONFIG_BPF_SYSCALL) &&
+				    imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
+				    cpu_feature_enabled(X86_FEATURE_TSC) &&
+				    using_native_sched_clock() && sched_clock_stable()) {
+					/* The default implementation of this kfunc uses
+					 * ktime_get_raw_ns() which effectively is implemented as
+					 * `(u64)rdtsc_ordered() & S64_MAX`. For JIT We skip
+					 * masking part because we assume it's not needed in BPF
+					 * use case (two measurements close in time).
+					 * Original code for rdtsc_ordered() uses sequence:
+					 * 'rdtsc; nop; nop; nop' to patch it into
+					 * 'lfence; rdtsc' or 'rdtscp' depending on CPU features.
+					 * JIT uses 'lfence; rdtsc' variant because BPF program
+					 * doesn't care about cookie provided by rdtscp in ECX.
+					 */
+					if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
+						EMIT3(0x0F, 0xAE, 0xE8);
+					EMIT2(0x0F, 0x31);
+					break;
+				}
+
 				err = emit_kfunc_call(bpf_prog,
 						      image + addrs[i],
 						      insn, &prog);
@@ -2621,3 +2643,14 @@ bool bpf_jit_supports_kfunc_call(void)
 {
 	return true;
 }
+
+bool bpf_jit_inlines_kfunc_call(s32 imm)
+{
+	if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
+		return false;
+	if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
+	    cpu_feature_enabled(X86_FEATURE_TSC) &&
+	    using_native_sched_clock() && sched_clock_stable())
+		return true;
+	return false;
+}
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0d7b70124d81..a5e9b592d3e8 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3387,6 +3387,9 @@ void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 u64 bpf_get_raw_cpu_id(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
+/* Inlined kfuncs */
+u64 bpf_get_cpu_time_counter(void);
+
 #if defined(CONFIG_NET)
 bool bpf_sock_common_is_valid_access(int off, int size,
 				     enum bpf_access_type type,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 590476743f7a..2fbfa1bc3f49 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1128,6 +1128,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
 void bpf_jit_compile(struct bpf_prog *prog);
 bool bpf_jit_needs_zext(void);
 bool bpf_jit_inlines_helper_call(s32 imm);
+bool bpf_jit_inlines_kfunc_call(s32 imm);
 bool bpf_jit_supports_subprog_tailcalls(void);
 bool bpf_jit_supports_percpu_insn(void);
 bool bpf_jit_supports_kfunc_call(void);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 62cb9557ad3b..1d811fc39eac 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -3035,6 +3035,17 @@ bool __weak bpf_jit_inlines_helper_call(s32 imm)
 	return false;
 }
 
+/* Return true if the JIT inlines the call to the kfunc corresponding to
+ * the imm.
+ *
+ * The verifier will not patch the insn->imm for the call to the helper if
+ * this returns true.
+ */
+bool __weak bpf_jit_inlines_kfunc_call(s32 imm)
+{
+	return false;
+}
+
 /* Return TRUE if the JIT backend supports mixing bpf2bpf and tailcalls. */
 bool __weak bpf_jit_supports_subprog_tailcalls(void)
 {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 5449756ba102..43bf35a15f78 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -3193,6 +3193,11 @@ __bpf_kfunc void bpf_local_irq_restore(unsigned long *flags__irq_flag)
 	local_irq_restore(*flags__irq_flag);
 }
 
+__bpf_kfunc u64 bpf_get_cpu_time_counter(void)
+{
+	return ktime_get_raw_fast_ns();
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(generic_btf_ids)
@@ -3293,6 +3298,7 @@ BTF_ID_FLAGS(func, bpf_iter_kmem_cache_next, KF_ITER_NEXT | KF_RET_NULL | KF_SLE
 BTF_ID_FLAGS(func, bpf_iter_kmem_cache_destroy, KF_ITER_DESTROY | KF_SLEEPABLE)
 BTF_ID_FLAGS(func, bpf_local_irq_save)
 BTF_ID_FLAGS(func, bpf_local_irq_restore)
+BTF_ID_FLAGS(func, bpf_get_cpu_time_counter, KF_FASTCALL)
 BTF_KFUNCS_END(common_btf_ids)
 
 static const struct btf_kfunc_id_set common_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 3303a3605ee8..0c4ea977973c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -17035,6 +17035,24 @@ static bool verifier_inlines_helper_call(struct bpf_verifier_env *env, s32 imm)
 	}
 }
 
+/* True if fixup_kfunc_call() replaces calls to kfunc number 'imm',
+ * replacement patch is presumed to follow bpf_fastcall contract
+ * (see mark_fastcall_pattern_for_call() below).
+ */
+static bool verifier_inlines_kfunc_call(struct bpf_verifier_env *env, s32 imm)
+{
+	const struct bpf_kfunc_desc *desc = find_kfunc_desc(env->prog, imm, 0);
+
+	if (!env->prog->jit_requested)
+		return false;
+
+	if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
+	    desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast])
+		return true;
+
+	return false;
+}
+
 struct call_summary {
 	u8 num_params;
 	bool is_void;
@@ -17077,7 +17095,10 @@ static bool get_call_summary(struct bpf_verifier_env *env, struct bpf_insn *call
 			/* error would be reported later */
 			return false;
 		cs->num_params = btf_type_vlen(meta.func_proto);
-		cs->fastcall = meta.kfunc_flags & KF_FASTCALL;
+		cs->fastcall = meta.kfunc_flags & KF_FASTCALL &&
+			       (verifier_inlines_kfunc_call(env, call->imm) ||
+			       (meta.btf == btf_vmlinux &&
+			       bpf_jit_inlines_kfunc_call(call->imm)));
 		cs->is_void = btf_type_is_void(btf_type_by_id(meta.btf, meta.func_proto->type));
 		return true;
 	}
@@ -21223,6 +21244,7 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			    struct bpf_insn *insn_buf, int insn_idx, int *cnt)
 {
 	const struct bpf_kfunc_desc *desc;
+	s32 imm = insn->imm;
 
 	if (!insn->imm) {
 		verbose(env, "invalid kernel function call not eliminated in verifier pass\n");
@@ -21246,7 +21268,18 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		insn->imm = BPF_CALL_IMM(desc->addr);
 	if (insn->off)
 		return 0;
-	if (desc->func_id == special_kfunc_list[KF_bpf_obj_new_impl] ||
+	if (verifier_inlines_kfunc_call(env, imm)) {
+		if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
+		    desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
+			insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
+			*cnt = 1;
+		} else {
+			verbose(env, "verifier internal error: kfunc id %d has no inline code\n",
+				desc->func_id);
+			return -EFAULT;
+		}
+
+	} else if (desc->func_id == special_kfunc_list[KF_bpf_obj_new_impl] ||
 	    desc->func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
 		struct btf_struct_meta *kptr_struct_meta = env->insn_aux_data[insn_idx].kptr_struct_meta;
 		struct bpf_insn addr[2] = { BPF_LD_IMM64(BPF_REG_2, (long)kptr_struct_meta) };
@@ -21307,10 +21340,6 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 
 		__fixup_collection_insert_kfunc(&env->insn_aux_data[insn_idx], struct_meta_reg,
 						node_offset_reg, insn, insn_buf, cnt);
-	} else if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
-		   desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
-		insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
-		*cnt = 1;
 	} else if (is_bpf_wq_set_callback_impl_kfunc(desc->func_id)) {
 		struct bpf_insn ld_addrs[2] = { BPF_LD_IMM64(BPF_REG_4, (long)env->prog->aux) };
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v11 1/4] bpf: add bpf_get_cpu_time_counter kfunc
  2025-03-17 22:49 ` [PATCH bpf-next v11 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
@ 2025-03-18  0:23   ` Alexei Starovoitov
  2025-03-18  8:17     ` Vadim Fedorenko
  0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2025-03-18  0:23 UTC (permalink / raw)
  To: Vadim Fedorenko
  Cc: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko, X86 ML, bpf, Peter Zijlstra,
	Martin KaFai Lau

On Mon, Mar 17, 2025 at 3:50 PM Vadim Fedorenko <vadfed@meta.com> wrote:
>
> New kfunc to return ARCH-specific timecounter. The main reason to
> implement this kfunc is to avoid extra overhead of benchmark
> measurements, which are usually done by a pair of bpf_ktime_get_ns()
> at the beginnig and at the end of the code block under benchmark.
> When fully JITed this function doesn't implement conversion to the
> monotonic clock and saves some CPU cycles by receiving timecounter
> values in single-digit amount of instructions. The delta values can be
> translated into nanoseconds using kfunc introduced in the next patch.
> For x86 BPF JIT converts this kfunc into rdtsc ordered call. Other
> architectures will get JIT implementation too if supported. The fallback
> is to get CLOCK_MONOTONIC_RAW value in ns.
>
> JIT version of the function uses "LFENCE; RDTSC" variant because it
> doesn't care about cookie value returned by "RDTSCP" and it doesn't want
> to trash RCX value. LFENCE option provides the same ordering guarantee as
> RDTSCP variant.
>
> The simplest use-case is added in 4th patch, where we calculate the time
> spent by bpf_get_ns_current_pid_tgid() kfunc. More complex example is to
> use session cookie to store timecounter value at kprobe/uprobe using
> kprobe.session/uprobe.session, and calculate the difference at
> kretprobe/uretprobe.
>
> Acked-by: Eduard Zingerman <eddyz87@gmail.com>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> Acked-by: Yonghong Song <yonghong.song@linux.dev>
> Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
> ---
>  arch/x86/net/bpf_jit_comp.c   | 47 +++++++++++++++++++++++++++++++++++
>  arch/x86/net/bpf_jit_comp32.c | 33 ++++++++++++++++++++++++
>  include/linux/bpf.h           |  3 +++
>  include/linux/filter.h        |  1 +
>  kernel/bpf/core.c             | 11 ++++++++
>  kernel/bpf/helpers.c          |  6 +++++
>  kernel/bpf/verifier.c         | 41 +++++++++++++++++++++++++-----
>  7 files changed, 136 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index d3491cc0898b..92cd5945d630 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -15,6 +15,7 @@
>  #include <asm/ftrace.h>
>  #include <asm/set_memory.h>
>  #include <asm/nospec-branch.h>
> +#include <asm/timer.h>
>  #include <asm/text-patching.h>
>  #include <asm/unwind.h>
>  #include <asm/cfi.h>
> @@ -2254,6 +2255,40 @@ st:                      if (is_imm8(insn->off))
>                 case BPF_JMP | BPF_CALL: {
>                         u8 *ip = image + addrs[i - 1];
>
> +                       if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
> +                           IS_ENABLED(CONFIG_BPF_SYSCALL) &&

why?

It's true that JIT can be compiled in even when there is no sys_bpf,
but why gate this?

> +                           imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
> +                           cpu_feature_enabled(X86_FEATURE_TSC) &&
> +                           using_native_sched_clock() && sched_clock_stable()) {
> +                               /* The default implementation of this kfunc uses
> +                                * ktime_get_raw_ns() which effectively is implemented as
> +                                * `(u64)rdtsc_ordered() & S64_MAX`. For JIT We skip
> +                                * masking part because we assume it's not needed in BPF
> +                                * use case (two measurements close in time).
> +                                * Original code for rdtsc_ordered() uses sequence:
> +                                * 'rdtsc; nop; nop; nop' to patch it into
> +                                * 'lfence; rdtsc' or 'rdtscp' depending on CPU features.
> +                                * JIT uses 'lfence; rdtsc' variant because BPF program
> +                                * doesn't care about cookie provided by rdtscp in RCX.
> +                                * Save RDX because RDTSC will use EDX:EAX to return u64
> +                                */
> +                               emit_mov_reg(&prog, true, AUX_REG, BPF_REG_3);
> +                               if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
> +                                       EMIT_LFENCE();
> +                               EMIT2(0x0F, 0x31);
> +
> +                               /* shl RDX, 32 */
> +                               maybe_emit_1mod(&prog, BPF_REG_3, true);
> +                               EMIT3(0xC1, add_1reg(0xE0, BPF_REG_3), 32);
> +                               /* or RAX, RDX */
> +                               maybe_emit_mod(&prog, BPF_REG_0, BPF_REG_3, true);
> +                               EMIT2(0x09, add_2reg(0xC0, BPF_REG_0, BPF_REG_3));
> +                               /* restore RDX from R11 */
> +                               emit_mov_reg(&prog, true, BPF_REG_3, AUX_REG);
> +
> +                               break;
> +                       }
> +
>                         func = (u8 *) __bpf_call_base + imm32;
>                         if (src_reg == BPF_PSEUDO_CALL && tail_call_reachable) {
>                                 LOAD_TAIL_CALL_CNT_PTR(stack_depth);
> @@ -3865,3 +3900,15 @@ bool bpf_jit_supports_timed_may_goto(void)
>  {
>         return true;
>  }
> +
> +/* x86-64 JIT can inline kfunc */
> +bool bpf_jit_inlines_kfunc_call(s32 imm)
> +{
> +       if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
> +               return false;

This is certainly unnecessary.
Only the verifier calls this bpf_jit_inlines_kfunc_call() helper.

> +       if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
> +           cpu_feature_enabled(X86_FEATURE_TSC) &&
> +           using_native_sched_clock() && sched_clock_stable())

The duplication of the check is ugly.
Call this helper from an earlier bit ?
if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
    bpf_jit_inlines_kfunc_call(imm32))
?

> +               return true;
> +       return false;
> +}
> diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
> index de0f9e5f9f73..7f13509c66db 100644
> --- a/arch/x86/net/bpf_jit_comp32.c
> +++ b/arch/x86/net/bpf_jit_comp32.c
> @@ -16,6 +16,7 @@
>  #include <asm/set_memory.h>
>  #include <asm/nospec-branch.h>
>  #include <asm/asm-prototypes.h>
> +#include <asm/timer.h>
>  #include <linux/bpf.h>
>
>  /*
> @@ -2094,6 +2095,27 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>                         if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
>                                 int err;
>
> +                               if (IS_ENABLED(CONFIG_BPF_SYSCALL) &&

same.

> +                                   imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
> +                                   cpu_feature_enabled(X86_FEATURE_TSC) &&
> +                                   using_native_sched_clock() && sched_clock_stable()) {
> +                                       /* The default implementation of this kfunc uses
> +                                        * ktime_get_raw_ns() which effectively is implemented as
> +                                        * `(u64)rdtsc_ordered() & S64_MAX`. For JIT We skip
> +                                        * masking part because we assume it's not needed in BPF
> +                                        * use case (two measurements close in time).
> +                                        * Original code for rdtsc_ordered() uses sequence:
> +                                        * 'rdtsc; nop; nop; nop' to patch it into
> +                                        * 'lfence; rdtsc' or 'rdtscp' depending on CPU features.
> +                                        * JIT uses 'lfence; rdtsc' variant because BPF program
> +                                        * doesn't care about cookie provided by rdtscp in ECX.
> +                                        */
> +                                       if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
> +                                               EMIT3(0x0F, 0xAE, 0xE8);
> +                                       EMIT2(0x0F, 0x31);
> +                                       break;
> +                               }
> +
>                                 err = emit_kfunc_call(bpf_prog,
>                                                       image + addrs[i],
>                                                       insn, &prog);
> @@ -2621,3 +2643,14 @@ bool bpf_jit_supports_kfunc_call(void)
>  {
>         return true;
>  }
> +
> +bool bpf_jit_inlines_kfunc_call(s32 imm)
> +{
> +       if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
> +               return false;
> +       if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
> +           cpu_feature_enabled(X86_FEATURE_TSC) &&
> +           using_native_sched_clock() && sched_clock_stable())
> +               return true;

same issue.

> +       return false;
> +}
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 0d7b70124d81..a5e9b592d3e8 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -3387,6 +3387,9 @@ void bpf_user_rnd_init_once(void);
>  u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>  u64 bpf_get_raw_cpu_id(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>
> +/* Inlined kfuncs */
> +u64 bpf_get_cpu_time_counter(void);
> +
>  #if defined(CONFIG_NET)
>  bool bpf_sock_common_is_valid_access(int off, int size,
>                                      enum bpf_access_type type,
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 590476743f7a..2fbfa1bc3f49 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1128,6 +1128,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
>  void bpf_jit_compile(struct bpf_prog *prog);
>  bool bpf_jit_needs_zext(void);
>  bool bpf_jit_inlines_helper_call(s32 imm);
> +bool bpf_jit_inlines_kfunc_call(s32 imm);
>  bool bpf_jit_supports_subprog_tailcalls(void);
>  bool bpf_jit_supports_percpu_insn(void);
>  bool bpf_jit_supports_kfunc_call(void);
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 62cb9557ad3b..1d811fc39eac 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -3035,6 +3035,17 @@ bool __weak bpf_jit_inlines_helper_call(s32 imm)
>         return false;
>  }
>
> +/* Return true if the JIT inlines the call to the kfunc corresponding to
> + * the imm.
> + *
> + * The verifier will not patch the insn->imm for the call to the helper if
> + * this returns true.
> + */
> +bool __weak bpf_jit_inlines_kfunc_call(s32 imm)
> +{
> +       return false;
> +}
> +
>  /* Return TRUE if the JIT backend supports mixing bpf2bpf and tailcalls. */
>  bool __weak bpf_jit_supports_subprog_tailcalls(void)
>  {
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 5449756ba102..43bf35a15f78 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -3193,6 +3193,11 @@ __bpf_kfunc void bpf_local_irq_restore(unsigned long *flags__irq_flag)
>         local_irq_restore(*flags__irq_flag);
>  }
>
> +__bpf_kfunc u64 bpf_get_cpu_time_counter(void)
> +{
> +       return ktime_get_raw_fast_ns();

Why 'raw' ?
Is it faster than 'mono' ?
This needs a comment at least.

> +}
> +
>  __bpf_kfunc_end_defs();
>
>  BTF_KFUNCS_START(generic_btf_ids)
> @@ -3293,6 +3298,7 @@ BTF_ID_FLAGS(func, bpf_iter_kmem_cache_next, KF_ITER_NEXT | KF_RET_NULL | KF_SLE
>  BTF_ID_FLAGS(func, bpf_iter_kmem_cache_destroy, KF_ITER_DESTROY | KF_SLEEPABLE)
>  BTF_ID_FLAGS(func, bpf_local_irq_save)
>  BTF_ID_FLAGS(func, bpf_local_irq_restore)
> +BTF_ID_FLAGS(func, bpf_get_cpu_time_counter, KF_FASTCALL)
>  BTF_KFUNCS_END(common_btf_ids)
>
>  static const struct btf_kfunc_id_set common_kfunc_set = {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 3303a3605ee8..0c4ea977973c 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -17035,6 +17035,24 @@ static bool verifier_inlines_helper_call(struct bpf_verifier_env *env, s32 imm)
>         }
>  }
>
> +/* True if fixup_kfunc_call() replaces calls to kfunc number 'imm',
> + * replacement patch is presumed to follow bpf_fastcall contract
> + * (see mark_fastcall_pattern_for_call() below).
> + */
> +static bool verifier_inlines_kfunc_call(struct bpf_verifier_env *env, s32 imm)
> +{
> +       const struct bpf_kfunc_desc *desc = find_kfunc_desc(env->prog, imm, 0);
> +
> +       if (!env->prog->jit_requested)
> +               return false;

This is a regression.
Why disable verifier inlining when jit is off?

> +
> +       if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
> +           desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast])
> +               return true;
> +
> +       return false;
> +}
> +
>  struct call_summary {
>         u8 num_params;
>         bool is_void;
> @@ -17077,7 +17095,10 @@ static bool get_call_summary(struct bpf_verifier_env *env, struct bpf_insn *call
>                         /* error would be reported later */
>                         return false;
>                 cs->num_params = btf_type_vlen(meta.func_proto);
> -               cs->fastcall = meta.kfunc_flags & KF_FASTCALL;
> +               cs->fastcall = meta.kfunc_flags & KF_FASTCALL &&
> +                              (verifier_inlines_kfunc_call(env, call->imm) ||
> +                              (meta.btf == btf_vmlinux &&
> +                              bpf_jit_inlines_kfunc_call(call->imm)));
>                 cs->is_void = btf_type_is_void(btf_type_by_id(meta.btf, meta.func_proto->type));
>                 return true;
>         }
> @@ -21223,6 +21244,7 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>                             struct bpf_insn *insn_buf, int insn_idx, int *cnt)
>  {
>         const struct bpf_kfunc_desc *desc;
> +       s32 imm = insn->imm;
>
>         if (!insn->imm) {
>                 verbose(env, "invalid kernel function call not eliminated in verifier pass\n");
> @@ -21246,7 +21268,18 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>                 insn->imm = BPF_CALL_IMM(desc->addr);
>         if (insn->off)
>                 return 0;
> -       if (desc->func_id == special_kfunc_list[KF_bpf_obj_new_impl] ||
> +       if (verifier_inlines_kfunc_call(env, imm)) {
> +               if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
> +                   desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {

I really don't like this copy paste.
Next trivial function that inlines as r0=r1
would need to add itself in two places for no good reason.

pw-bot: cr

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v11 1/4] bpf: add bpf_get_cpu_time_counter kfunc
  2025-03-18  0:23   ` Alexei Starovoitov
@ 2025-03-18  8:17     ` Vadim Fedorenko
  2025-03-18 15:56       ` Alexei Starovoitov
  0 siblings, 1 reply; 13+ messages in thread
From: Vadim Fedorenko @ 2025-03-18  8:17 UTC (permalink / raw)
  To: Alexei Starovoitov, Vadim Fedorenko
  Cc: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Mykola Lysenko, X86 ML, bpf, Peter Zijlstra, Martin KaFai Lau

On 18/03/2025 00:23, Alexei Starovoitov wrote:
> On Mon, Mar 17, 2025 at 3:50 PM Vadim Fedorenko <vadfed@meta.com> wrote:
>>
>> New kfunc to return ARCH-specific timecounter. The main reason to
>> implement this kfunc is to avoid extra overhead of benchmark
>> measurements, which are usually done by a pair of bpf_ktime_get_ns()
>> at the beginnig and at the end of the code block under benchmark.
>> When fully JITed this function doesn't implement conversion to the
>> monotonic clock and saves some CPU cycles by receiving timecounter
>> values in single-digit amount of instructions. The delta values can be
>> translated into nanoseconds using kfunc introduced in the next patch.
>> For x86 BPF JIT converts this kfunc into rdtsc ordered call. Other
>> architectures will get JIT implementation too if supported. The fallback
>> is to get CLOCK_MONOTONIC_RAW value in ns.
>>
>> JIT version of the function uses "LFENCE; RDTSC" variant because it
>> doesn't care about cookie value returned by "RDTSCP" and it doesn't want
>> to trash RCX value. LFENCE option provides the same ordering guarantee as
>> RDTSCP variant.
>>
>> The simplest use-case is added in 4th patch, where we calculate the time
>> spent by bpf_get_ns_current_pid_tgid() kfunc. More complex example is to
>> use session cookie to store timecounter value at kprobe/uprobe using
>> kprobe.session/uprobe.session, and calculate the difference at
>> kretprobe/uretprobe.
>>
>> Acked-by: Eduard Zingerman <eddyz87@gmail.com>
>> Acked-by: Andrii Nakryiko <andrii@kernel.org>
>> Acked-by: Yonghong Song <yonghong.song@linux.dev>
>> Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
>> ---
>>   arch/x86/net/bpf_jit_comp.c   | 47 +++++++++++++++++++++++++++++++++++
>>   arch/x86/net/bpf_jit_comp32.c | 33 ++++++++++++++++++++++++
>>   include/linux/bpf.h           |  3 +++
>>   include/linux/filter.h        |  1 +
>>   kernel/bpf/core.c             | 11 ++++++++
>>   kernel/bpf/helpers.c          |  6 +++++
>>   kernel/bpf/verifier.c         | 41 +++++++++++++++++++++++++-----
>>   7 files changed, 136 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>> index d3491cc0898b..92cd5945d630 100644
>> --- a/arch/x86/net/bpf_jit_comp.c
>> +++ b/arch/x86/net/bpf_jit_comp.c
>> @@ -15,6 +15,7 @@
>>   #include <asm/ftrace.h>
>>   #include <asm/set_memory.h>
>>   #include <asm/nospec-branch.h>
>> +#include <asm/timer.h>
>>   #include <asm/text-patching.h>
>>   #include <asm/unwind.h>
>>   #include <asm/cfi.h>
>> @@ -2254,6 +2255,40 @@ st:                      if (is_imm8(insn->off))
>>                  case BPF_JMP | BPF_CALL: {
>>                          u8 *ip = image + addrs[i - 1];
>>
>> +                       if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
>> +                           IS_ENABLED(CONFIG_BPF_SYSCALL) &&
> 
> why?
> 
> It's true that JIT can be compiled in even when there is no sys_bpf,
> but why gate this?

Both bpf_get_cpu_time_counter() and bpf_cpu_time_counter_to_ns() are
defined in helpers.c which is compiled only when CONFIG_BPF_SYSCALL is
enabled. Otherwise these symbols are absent and compilation fails.
See kernel test bot report:
https://lore.kernel.org/bpf/202503131640.opwmXIvU-lkp@intel.com/

>> +                           imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
>> +                           cpu_feature_enabled(X86_FEATURE_TSC) &&
>> +                           using_native_sched_clock() && sched_clock_stable()) {
>> +                               /* The default implementation of this kfunc uses
>> +                                * ktime_get_raw_ns() which effectively is implemented as
>> +                                * `(u64)rdtsc_ordered() & S64_MAX`. For JIT We skip
>> +                                * masking part because we assume it's not needed in BPF
>> +                                * use case (two measurements close in time).
>> +                                * Original code for rdtsc_ordered() uses sequence:
>> +                                * 'rdtsc; nop; nop; nop' to patch it into
>> +                                * 'lfence; rdtsc' or 'rdtscp' depending on CPU features.
>> +                                * JIT uses 'lfence; rdtsc' variant because BPF program
>> +                                * doesn't care about cookie provided by rdtscp in RCX.
>> +                                * Save RDX because RDTSC will use EDX:EAX to return u64
>> +                                */
>> +                               emit_mov_reg(&prog, true, AUX_REG, BPF_REG_3);
>> +                               if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
>> +                                       EMIT_LFENCE();
>> +                               EMIT2(0x0F, 0x31);
>> +
>> +                               /* shl RDX, 32 */
>> +                               maybe_emit_1mod(&prog, BPF_REG_3, true);
>> +                               EMIT3(0xC1, add_1reg(0xE0, BPF_REG_3), 32);
>> +                               /* or RAX, RDX */
>> +                               maybe_emit_mod(&prog, BPF_REG_0, BPF_REG_3, true);
>> +                               EMIT2(0x09, add_2reg(0xC0, BPF_REG_0, BPF_REG_3));
>> +                               /* restore RDX from R11 */
>> +                               emit_mov_reg(&prog, true, BPF_REG_3, AUX_REG);
>> +
>> +                               break;
>> +                       }
>> +
>>                          func = (u8 *) __bpf_call_base + imm32;
>>                          if (src_reg == BPF_PSEUDO_CALL && tail_call_reachable) {
>>                                  LOAD_TAIL_CALL_CNT_PTR(stack_depth);
>> @@ -3865,3 +3900,15 @@ bool bpf_jit_supports_timed_may_goto(void)
>>   {
>>          return true;
>>   }
>> +
>> +/* x86-64 JIT can inline kfunc */
>> +bool bpf_jit_inlines_kfunc_call(s32 imm)
>> +{
>> +       if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
>> +               return false;
> 
> This is certainly unnecessary.
> Only the verifier calls this bpf_jit_inlines_kfunc_call() helper.
 >
>> +       if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
>> +           cpu_feature_enabled(X86_FEATURE_TSC) &&
>> +           using_native_sched_clock() && sched_clock_stable())
> 
> The duplication of the check is ugly.
> Call this helper from an earlier bit ?
> if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
>      bpf_jit_inlines_kfunc_call(imm32))
> ?

LGTM, I'll adjust it

> 
>> +               return true;
>> +       return false;
>> +}
>> diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
>> index de0f9e5f9f73..7f13509c66db 100644
>> --- a/arch/x86/net/bpf_jit_comp32.c
>> +++ b/arch/x86/net/bpf_jit_comp32.c
>> @@ -16,6 +16,7 @@
>>   #include <asm/set_memory.h>
>>   #include <asm/nospec-branch.h>
>>   #include <asm/asm-prototypes.h>
>> +#include <asm/timer.h>
>>   #include <linux/bpf.h>
>>
>>   /*
>> @@ -2094,6 +2095,27 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>>                          if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
>>                                  int err;
>>
>> +                               if (IS_ENABLED(CONFIG_BPF_SYSCALL) &&
> 
> same.
> 
>> +                                   imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
>> +                                   cpu_feature_enabled(X86_FEATURE_TSC) &&
>> +                                   using_native_sched_clock() && sched_clock_stable()) {
>> +                                       /* The default implementation of this kfunc uses
>> +                                        * ktime_get_raw_ns() which effectively is implemented as
>> +                                        * `(u64)rdtsc_ordered() & S64_MAX`. For JIT We skip
>> +                                        * masking part because we assume it's not needed in BPF
>> +                                        * use case (two measurements close in time).
>> +                                        * Original code for rdtsc_ordered() uses sequence:
>> +                                        * 'rdtsc; nop; nop; nop' to patch it into
>> +                                        * 'lfence; rdtsc' or 'rdtscp' depending on CPU features.
>> +                                        * JIT uses 'lfence; rdtsc' variant because BPF program
>> +                                        * doesn't care about cookie provided by rdtscp in ECX.
>> +                                        */
>> +                                       if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
>> +                                               EMIT3(0x0F, 0xAE, 0xE8);
>> +                                       EMIT2(0x0F, 0x31);
>> +                                       break;
>> +                               }
>> +
>>                                  err = emit_kfunc_call(bpf_prog,
>>                                                        image + addrs[i],
>>                                                        insn, &prog);
>> @@ -2621,3 +2643,14 @@ bool bpf_jit_supports_kfunc_call(void)
>>   {
>>          return true;
>>   }
>> +
>> +bool bpf_jit_inlines_kfunc_call(s32 imm)
>> +{
>> +       if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
>> +               return false;
>> +       if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
>> +           cpu_feature_enabled(X86_FEATURE_TSC) &&
>> +           using_native_sched_clock() && sched_clock_stable())
>> +               return true;
> 
> same issue.
> 
>> +       return false;
>> +}
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 0d7b70124d81..a5e9b592d3e8 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -3387,6 +3387,9 @@ void bpf_user_rnd_init_once(void);
>>   u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>>   u64 bpf_get_raw_cpu_id(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>>
>> +/* Inlined kfuncs */
>> +u64 bpf_get_cpu_time_counter(void);
>> +
>>   #if defined(CONFIG_NET)
>>   bool bpf_sock_common_is_valid_access(int off, int size,
>>                                       enum bpf_access_type type,
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 590476743f7a..2fbfa1bc3f49 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -1128,6 +1128,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
>>   void bpf_jit_compile(struct bpf_prog *prog);
>>   bool bpf_jit_needs_zext(void);
>>   bool bpf_jit_inlines_helper_call(s32 imm);
>> +bool bpf_jit_inlines_kfunc_call(s32 imm);
>>   bool bpf_jit_supports_subprog_tailcalls(void);
>>   bool bpf_jit_supports_percpu_insn(void);
>>   bool bpf_jit_supports_kfunc_call(void);
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 62cb9557ad3b..1d811fc39eac 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -3035,6 +3035,17 @@ bool __weak bpf_jit_inlines_helper_call(s32 imm)
>>          return false;
>>   }
>>
>> +/* Return true if the JIT inlines the call to the kfunc corresponding to
>> + * the imm.
>> + *
>> + * The verifier will not patch the insn->imm for the call to the helper if
>> + * this returns true.
>> + */
>> +bool __weak bpf_jit_inlines_kfunc_call(s32 imm)
>> +{
>> +       return false;
>> +}
>> +
>>   /* Return TRUE if the JIT backend supports mixing bpf2bpf and tailcalls. */
>>   bool __weak bpf_jit_supports_subprog_tailcalls(void)
>>   {
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index 5449756ba102..43bf35a15f78 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -3193,6 +3193,11 @@ __bpf_kfunc void bpf_local_irq_restore(unsigned long *flags__irq_flag)
>>          local_irq_restore(*flags__irq_flag);
>>   }
>>
>> +__bpf_kfunc u64 bpf_get_cpu_time_counter(void)
>> +{
>> +       return ktime_get_raw_fast_ns();
> 
> Why 'raw' ?
> Is it faster than 'mono' ?
> This needs a comment at least.

'raw' is the closest analogue to what is implemented in JIT. The access
time is the same as for 'mono', but the slope of 'raw' is not affected
by NTP adjustments, and with stable tsc it can provide less jitter in 
short term measurements.

> 
>> +}
>> +
>>   __bpf_kfunc_end_defs();
>>
>>   BTF_KFUNCS_START(generic_btf_ids)
>> @@ -3293,6 +3298,7 @@ BTF_ID_FLAGS(func, bpf_iter_kmem_cache_next, KF_ITER_NEXT | KF_RET_NULL | KF_SLE
>>   BTF_ID_FLAGS(func, bpf_iter_kmem_cache_destroy, KF_ITER_DESTROY | KF_SLEEPABLE)
>>   BTF_ID_FLAGS(func, bpf_local_irq_save)
>>   BTF_ID_FLAGS(func, bpf_local_irq_restore)
>> +BTF_ID_FLAGS(func, bpf_get_cpu_time_counter, KF_FASTCALL)
>>   BTF_KFUNCS_END(common_btf_ids)
>>
>>   static const struct btf_kfunc_id_set common_kfunc_set = {
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 3303a3605ee8..0c4ea977973c 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -17035,6 +17035,24 @@ static bool verifier_inlines_helper_call(struct bpf_verifier_env *env, s32 imm)
>>          }
>>   }
>>
>> +/* True if fixup_kfunc_call() replaces calls to kfunc number 'imm',
>> + * replacement patch is presumed to follow bpf_fastcall contract
>> + * (see mark_fastcall_pattern_for_call() below).
>> + */
>> +static bool verifier_inlines_kfunc_call(struct bpf_verifier_env *env, s32 imm)
>> +{
>> +       const struct bpf_kfunc_desc *desc = find_kfunc_desc(env->prog, imm, 0);
>> +
>> +       if (!env->prog->jit_requested)
>> +               return false;
> 
> This is a regression.
> Why disable verifier inlining when jit is off?
> 
>> +
>> +       if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
>> +           desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast])
>> +               return true;
>> +
>> +       return false;
>> +}
>> +
>>   struct call_summary {
>>          u8 num_params;
>>          bool is_void;
>> @@ -17077,7 +17095,10 @@ static bool get_call_summary(struct bpf_verifier_env *env, struct bpf_insn *call
>>                          /* error would be reported later */
>>                          return false;
>>                  cs->num_params = btf_type_vlen(meta.func_proto);
>> -               cs->fastcall = meta.kfunc_flags & KF_FASTCALL;
>> +               cs->fastcall = meta.kfunc_flags & KF_FASTCALL &&
>> +                              (verifier_inlines_kfunc_call(env, call->imm) ||
>> +                              (meta.btf == btf_vmlinux &&
>> +                              bpf_jit_inlines_kfunc_call(call->imm)));
>>                  cs->is_void = btf_type_is_void(btf_type_by_id(meta.btf, meta.func_proto->type));
>>                  return true;
>>          }
>> @@ -21223,6 +21244,7 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>>                              struct bpf_insn *insn_buf, int insn_idx, int *cnt)
>>   {
>>          const struct bpf_kfunc_desc *desc;
>> +       s32 imm = insn->imm;
>>
>>          if (!insn->imm) {
>>                  verbose(env, "invalid kernel function call not eliminated in verifier pass\n");
>> @@ -21246,7 +21268,18 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>>                  insn->imm = BPF_CALL_IMM(desc->addr);
>>          if (insn->off)
>>                  return 0;
>> -       if (desc->func_id == special_kfunc_list[KF_bpf_obj_new_impl] ||
>> +       if (verifier_inlines_kfunc_call(env, imm)) {
>> +               if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
>> +                   desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
> 
> I really don't like this copy paste.
> Next trivial function that inlines as r0=r1
> would need to add itself in two places for no good reason.

OK, I'll keep this part untouched

> 
> pw-bot: cr


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v11 1/4] bpf: add bpf_get_cpu_time_counter kfunc
  2025-03-18  8:17     ` Vadim Fedorenko
@ 2025-03-18 15:56       ` Alexei Starovoitov
  0 siblings, 0 replies; 13+ messages in thread
From: Alexei Starovoitov @ 2025-03-18 15:56 UTC (permalink / raw)
  To: Vadim Fedorenko
  Cc: Vadim Fedorenko, Borislav Petkov, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Thomas Gleixner, Yonghong Song, Mykola Lysenko, X86 ML, bpf,
	Peter Zijlstra, Martin KaFai Lau

On Tue, Mar 18, 2025 at 1:17 AM Vadim Fedorenko
<vadim.fedorenko@linux.dev> wrote:
>
> >>
> >> +                       if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
> >> +                           IS_ENABLED(CONFIG_BPF_SYSCALL) &&
> >
> > why?
> >
> > It's true that JIT can be compiled in even when there is no sys_bpf,
> > but why gate this?
>
> Both bpf_get_cpu_time_counter() and bpf_cpu_time_counter_to_ns() are
> defined in helpers.c which is compiled only when CONFIG_BPF_SYSCALL is
> enabled. Otherwise these symbols are absent and compilation fails.
> See kernel test bot report:
> https://lore.kernel.org/bpf/202503131640.opwmXIvU-lkp@intel.com/

put in core.c then?

> >> +__bpf_kfunc u64 bpf_get_cpu_time_counter(void)
> >> +{
> >> +       return ktime_get_raw_fast_ns();
> >
> > Why 'raw' ?
> > Is it faster than 'mono' ?
> > This needs a comment at least.
>
> 'raw' is the closest analogue to what is implemented in JIT. The access
> time is the same as for 'mono', but the slope of 'raw' is not affected
> by NTP adjustments, and with stable tsc it can provide less jitter in
> short term measurements.

fair enough. the comment is needed.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH bpf-next v11 2/4] bpf: add bpf_cpu_time_counter_to_ns helper
  2025-03-17 22:49 [PATCH bpf-next v11 0/4] bpf: add cpu time counter kfuncs Vadim Fedorenko
  2025-03-17 22:49 ` [PATCH bpf-next v11 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
@ 2025-03-17 22:49 ` Vadim Fedorenko
  2025-03-18  0:29   ` Alexei Starovoitov
  2025-03-17 22:49 ` [PATCH bpf-next v11 3/4] selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit Vadim Fedorenko
  2025-03-17 22:49 ` [PATCH bpf-next v11 4/4] selftests/bpf: add usage example for cpu time counter kfuncs Vadim Fedorenko
  3 siblings, 1 reply; 13+ messages in thread
From: Vadim Fedorenko @ 2025-03-17 22:49 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

The new helper should be used to convert deltas of values
received by bpf_get_cpu_time_counter() into nanoseconds. It is not
designed to do full conversion of time counter values to
CLOCK_MONOTONIC_RAW nanoseconds and cannot guarantee monotonicity of 2
independent values, but rather to convert the difference of 2 close
enough values of CPU timestamp counter into nanoseconds.

This function is JITted into just several instructions and adds as
low overhead as possible and perfectly suits benchmark use-cases.

When the kfunc is not JITted it returns the value provided as argument
because the kfunc in previous patch will return values in nanoseconds.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
 arch/x86/net/bpf_jit_comp.c   | 28 +++++++++++++++++++++++++++-
 arch/x86/net/bpf_jit_comp32.c | 27 ++++++++++++++++++++++++++-
 include/linux/bpf.h           |  1 +
 kernel/bpf/helpers.c          |  6 ++++++
 4 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 92cd5945d630..3e4d45defe2f 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -9,6 +9,7 @@
 #include <linux/filter.h>
 #include <linux/if_vlan.h>
 #include <linux/bpf.h>
+#include <linux/clocksource.h>
 #include <linux/memory.h>
 #include <linux/sort.h>
 #include <asm/extable.h>
@@ -2289,6 +2290,30 @@ st:			if (is_imm8(insn->off))
 				break;
 			}
 
+			if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
+			    IS_ENABLED(CONFIG_BPF_SYSCALL) &&
+			    imm32 == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
+			    cpu_feature_enabled(X86_FEATURE_TSC) &&
+			    using_native_sched_clock() && sched_clock_stable()) {
+				struct cyc2ns_data data;
+				u32 mult, shift;
+
+				cyc2ns_read_begin(&data);
+				mult = data.cyc2ns_mul;
+				shift = data.cyc2ns_shift;
+				cyc2ns_read_end();
+				/* imul RAX, RDI, mult */
+				maybe_emit_mod(&prog, BPF_REG_1, BPF_REG_0, true);
+				EMIT2_off32(0x69, add_2reg(0xC0, BPF_REG_1, BPF_REG_0),
+					    mult);
+
+				/* shr RAX, shift (which is less than 64) */
+				maybe_emit_1mod(&prog, BPF_REG_0, true);
+				EMIT3(0xC1, add_1reg(0xE8, BPF_REG_0), shift);
+
+				break;
+			}
+
 			func = (u8 *) __bpf_call_base + imm32;
 			if (src_reg == BPF_PSEUDO_CALL && tail_call_reachable) {
 				LOAD_TAIL_CALL_CNT_PTR(stack_depth);
@@ -3906,7 +3931,8 @@ bool bpf_jit_inlines_kfunc_call(s32 imm)
 {
 	if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
 		return false;
-	if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
+	if ((imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) ||
+	    imm == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns)) &&
 	    cpu_feature_enabled(X86_FEATURE_TSC) &&
 	    using_native_sched_clock() && sched_clock_stable())
 		return true;
diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
index 7f13509c66db..9791a3fb9d69 100644
--- a/arch/x86/net/bpf_jit_comp32.c
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -12,6 +12,7 @@
 #include <linux/netdevice.h>
 #include <linux/filter.h>
 #include <linux/if_vlan.h>
+#include <linux/clocksource.h>
 #include <asm/cacheflush.h>
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
@@ -2115,6 +2116,29 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 					EMIT2(0x0F, 0x31);
 					break;
 				}
+				if (IS_ENABLED(CONFIG_BPF_SYSCALL) &&
+				    imm32 == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
+				    cpu_feature_enabled(X86_FEATURE_TSC) &&
+				    using_native_sched_clock() && sched_clock_stable()) {
+					struct cyc2ns_data data;
+					u32 mult, shift;
+
+					cyc2ns_read_begin(&data);
+					mult = data.cyc2ns_mul;
+					shift = data.cyc2ns_shift;
+					cyc2ns_read_end();
+
+					/* move parameter to BPF_REG_0 */
+					emit_ia32_mov_r64(true, bpf2ia32[BPF_REG_0],
+							  bpf2ia32[BPF_REG_1], true, true,
+							  &prog, bpf_prog->aux);
+					/* multiply parameter by mut */
+					emit_ia32_mul_i64(bpf2ia32[BPF_REG_0],
+							  mult, true, &prog);
+					/* shift parameter by shift which is less than 64 */
+					emit_ia32_rsh_i64(bpf2ia32[BPF_REG_0],
+							  shift, true, &prog);
+				}
 
 				err = emit_kfunc_call(bpf_prog,
 						      image + addrs[i],
@@ -2648,7 +2672,8 @@ bool bpf_jit_inlines_kfunc_call(s32 imm)
 {
 	if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
 		return false;
-	if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
+	if ((imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) ||
+	    imm == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns)) &&
 	    cpu_feature_enabled(X86_FEATURE_TSC) &&
 	    using_native_sched_clock() && sched_clock_stable())
 		return true;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a5e9b592d3e8..f45a704f06e3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3389,6 +3389,7 @@ u64 bpf_get_raw_cpu_id(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 /* Inlined kfuncs */
 u64 bpf_get_cpu_time_counter(void);
+u64 bpf_cpu_time_counter_to_ns(u64 counter);
 
 #if defined(CONFIG_NET)
 bool bpf_sock_common_is_valid_access(int off, int size,
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 43bf35a15f78..e5ed5ba4b4aa 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -3198,6 +3198,11 @@ __bpf_kfunc u64 bpf_get_cpu_time_counter(void)
 	return ktime_get_raw_fast_ns();
 }
 
+__bpf_kfunc u64 bpf_cpu_time_counter_to_ns(u64 counter)
+{
+	return counter;
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(generic_btf_ids)
@@ -3299,6 +3304,7 @@ BTF_ID_FLAGS(func, bpf_iter_kmem_cache_destroy, KF_ITER_DESTROY | KF_SLEEPABLE)
 BTF_ID_FLAGS(func, bpf_local_irq_save)
 BTF_ID_FLAGS(func, bpf_local_irq_restore)
 BTF_ID_FLAGS(func, bpf_get_cpu_time_counter, KF_FASTCALL)
+BTF_ID_FLAGS(func, bpf_cpu_time_counter_to_ns, KF_FASTCALL)
 BTF_KFUNCS_END(common_btf_ids)
 
 static const struct btf_kfunc_id_set common_kfunc_set = {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v11 2/4] bpf: add bpf_cpu_time_counter_to_ns helper
  2025-03-17 22:49 ` [PATCH bpf-next v11 2/4] bpf: add bpf_cpu_time_counter_to_ns helper Vadim Fedorenko
@ 2025-03-18  0:29   ` Alexei Starovoitov
  2025-03-18  8:44     ` Vadim Fedorenko
  0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2025-03-18  0:29 UTC (permalink / raw)
  To: Vadim Fedorenko
  Cc: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko, X86 ML, bpf, Peter Zijlstra,
	Martin KaFai Lau

On Mon, Mar 17, 2025 at 3:50 PM Vadim Fedorenko <vadfed@meta.com> wrote:
>
> The new helper should be used to convert deltas of values
> received by bpf_get_cpu_time_counter() into nanoseconds. It is not
> designed to do full conversion of time counter values to
> CLOCK_MONOTONIC_RAW nanoseconds and cannot guarantee monotonicity of 2
> independent values, but rather to convert the difference of 2 close
> enough values of CPU timestamp counter into nanoseconds.
>
> This function is JITted into just several instructions and adds as
> low overhead as possible and perfectly suits benchmark use-cases.
>
> When the kfunc is not JITted it returns the value provided as argument
> because the kfunc in previous patch will return values in nanoseconds.
>
> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
> ---
>  arch/x86/net/bpf_jit_comp.c   | 28 +++++++++++++++++++++++++++-
>  arch/x86/net/bpf_jit_comp32.c | 27 ++++++++++++++++++++++++++-
>  include/linux/bpf.h           |  1 +
>  kernel/bpf/helpers.c          |  6 ++++++
>  4 files changed, 60 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 92cd5945d630..3e4d45defe2f 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -9,6 +9,7 @@
>  #include <linux/filter.h>
>  #include <linux/if_vlan.h>
>  #include <linux/bpf.h>
> +#include <linux/clocksource.h>
>  #include <linux/memory.h>
>  #include <linux/sort.h>
>  #include <asm/extable.h>
> @@ -2289,6 +2290,30 @@ st:                      if (is_imm8(insn->off))
>                                 break;
>                         }
>
> +                       if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
> +                           IS_ENABLED(CONFIG_BPF_SYSCALL) &&
> +                           imm32 == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
> +                           cpu_feature_enabled(X86_FEATURE_TSC) &&
> +                           using_native_sched_clock() && sched_clock_stable()) {

And now this condition copy pasted 3 times ?!

> +                               struct cyc2ns_data data;
> +                               u32 mult, shift;
> +
> +                               cyc2ns_read_begin(&data);
> +                               mult = data.cyc2ns_mul;
> +                               shift = data.cyc2ns_shift;
> +                               cyc2ns_read_end();

This needs a big comment explaining why this math will be stable
after JIT and for the lifetime of the prog.

> +                               /* imul RAX, RDI, mult */
> +                               maybe_emit_mod(&prog, BPF_REG_1, BPF_REG_0, true);
> +                               EMIT2_off32(0x69, add_2reg(0xC0, BPF_REG_1, BPF_REG_0),
> +                                           mult);
> +
> +                               /* shr RAX, shift (which is less than 64) */
> +                               maybe_emit_1mod(&prog, BPF_REG_0, true);
> +                               EMIT3(0xC1, add_1reg(0xE8, BPF_REG_0), shift);
> +
> +                               break;
> +                       }
> +
>                         func = (u8 *) __bpf_call_base + imm32;
>                         if (src_reg == BPF_PSEUDO_CALL && tail_call_reachable) {
>                                 LOAD_TAIL_CALL_CNT_PTR(stack_depth);
> @@ -3906,7 +3931,8 @@ bool bpf_jit_inlines_kfunc_call(s32 imm)
>  {
>         if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
>                 return false;
> -       if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
> +       if ((imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) ||
> +           imm == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns)) &&
>             cpu_feature_enabled(X86_FEATURE_TSC) &&
>             using_native_sched_clock() && sched_clock_stable())
>                 return true;
> diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
> index 7f13509c66db..9791a3fb9d69 100644
> --- a/arch/x86/net/bpf_jit_comp32.c
> +++ b/arch/x86/net/bpf_jit_comp32.c
> @@ -12,6 +12,7 @@
>  #include <linux/netdevice.h>
>  #include <linux/filter.h>
>  #include <linux/if_vlan.h>
> +#include <linux/clocksource.h>
>  #include <asm/cacheflush.h>
>  #include <asm/set_memory.h>
>  #include <asm/nospec-branch.h>
> @@ -2115,6 +2116,29 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>                                         EMIT2(0x0F, 0x31);
>                                         break;
>                                 }
> +                               if (IS_ENABLED(CONFIG_BPF_SYSCALL) &&
> +                                   imm32 == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
> +                                   cpu_feature_enabled(X86_FEATURE_TSC) &&
> +                                   using_native_sched_clock() && sched_clock_stable()) {
> +                                       struct cyc2ns_data data;
> +                                       u32 mult, shift;
> +
> +                                       cyc2ns_read_begin(&data);
> +                                       mult = data.cyc2ns_mul;
> +                                       shift = data.cyc2ns_shift;
> +                                       cyc2ns_read_end();

same here.

> +
> +                                       /* move parameter to BPF_REG_0 */
> +                                       emit_ia32_mov_r64(true, bpf2ia32[BPF_REG_0],
> +                                                         bpf2ia32[BPF_REG_1], true, true,
> +                                                         &prog, bpf_prog->aux);
> +                                       /* multiply parameter by mut */
> +                                       emit_ia32_mul_i64(bpf2ia32[BPF_REG_0],
> +                                                         mult, true, &prog);

How did you test this?
It's far from obvious that this will match what mul_u64_u32_shr() does.
And on a quick look I really doubt.

The trouble of adding support for 32-bit JIT doesn't seem worth it.

> +                                       /* shift parameter by shift which is less than 64 */
> +                                       emit_ia32_rsh_i64(bpf2ia32[BPF_REG_0],
> +                                                         shift, true, &prog);
> +                               }
>
>                                 err = emit_kfunc_call(bpf_prog,
>                                                       image + addrs[i],
> @@ -2648,7 +2672,8 @@ bool bpf_jit_inlines_kfunc_call(s32 imm)
>  {
>         if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
>                 return false;
> -       if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
> +       if ((imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) ||
> +           imm == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns)) &&
>             cpu_feature_enabled(X86_FEATURE_TSC) &&
>             using_native_sched_clock() && sched_clock_stable())
>                 return true;
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index a5e9b592d3e8..f45a704f06e3 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -3389,6 +3389,7 @@ u64 bpf_get_raw_cpu_id(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>
>  /* Inlined kfuncs */
>  u64 bpf_get_cpu_time_counter(void);
> +u64 bpf_cpu_time_counter_to_ns(u64 counter);
>
>  #if defined(CONFIG_NET)
>  bool bpf_sock_common_is_valid_access(int off, int size,
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 43bf35a15f78..e5ed5ba4b4aa 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -3198,6 +3198,11 @@ __bpf_kfunc u64 bpf_get_cpu_time_counter(void)
>         return ktime_get_raw_fast_ns();
>  }
>
> +__bpf_kfunc u64 bpf_cpu_time_counter_to_ns(u64 counter)
> +{
> +       return counter;
> +}
> +
>  __bpf_kfunc_end_defs();
>
>  BTF_KFUNCS_START(generic_btf_ids)
> @@ -3299,6 +3304,7 @@ BTF_ID_FLAGS(func, bpf_iter_kmem_cache_destroy, KF_ITER_DESTROY | KF_SLEEPABLE)
>  BTF_ID_FLAGS(func, bpf_local_irq_save)
>  BTF_ID_FLAGS(func, bpf_local_irq_restore)
>  BTF_ID_FLAGS(func, bpf_get_cpu_time_counter, KF_FASTCALL)
> +BTF_ID_FLAGS(func, bpf_cpu_time_counter_to_ns, KF_FASTCALL)
>  BTF_KFUNCS_END(common_btf_ids)
>
>  static const struct btf_kfunc_id_set common_kfunc_set = {
> --
> 2.47.1
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v11 2/4] bpf: add bpf_cpu_time_counter_to_ns helper
  2025-03-18  0:29   ` Alexei Starovoitov
@ 2025-03-18  8:44     ` Vadim Fedorenko
  2025-03-18 16:42       ` Alexei Starovoitov
  0 siblings, 1 reply; 13+ messages in thread
From: Vadim Fedorenko @ 2025-03-18  8:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Vadim Fedorenko
  Cc: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Mykola Lysenko, X86 ML, bpf, Peter Zijlstra, Martin KaFai Lau

On 18/03/2025 00:29, Alexei Starovoitov wrote:
> On Mon, Mar 17, 2025 at 3:50 PM Vadim Fedorenko <vadfed@meta.com> wrote:
>>
>> The new helper should be used to convert deltas of values
>> received by bpf_get_cpu_time_counter() into nanoseconds. It is not
>> designed to do full conversion of time counter values to
>> CLOCK_MONOTONIC_RAW nanoseconds and cannot guarantee monotonicity of 2
>> independent values, but rather to convert the difference of 2 close
>> enough values of CPU timestamp counter into nanoseconds.
>>
>> This function is JITted into just several instructions and adds as
>> low overhead as possible and perfectly suits benchmark use-cases.
>>
>> When the kfunc is not JITted it returns the value provided as argument
>> because the kfunc in previous patch will return values in nanoseconds.
>>
>> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
>> Acked-by: Andrii Nakryiko <andrii@kernel.org>
>> Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
>> ---
>>   arch/x86/net/bpf_jit_comp.c   | 28 +++++++++++++++++++++++++++-
>>   arch/x86/net/bpf_jit_comp32.c | 27 ++++++++++++++++++++++++++-
>>   include/linux/bpf.h           |  1 +
>>   kernel/bpf/helpers.c          |  6 ++++++
>>   4 files changed, 60 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>> index 92cd5945d630..3e4d45defe2f 100644
>> --- a/arch/x86/net/bpf_jit_comp.c
>> +++ b/arch/x86/net/bpf_jit_comp.c
>> @@ -9,6 +9,7 @@
>>   #include <linux/filter.h>
>>   #include <linux/if_vlan.h>
>>   #include <linux/bpf.h>
>> +#include <linux/clocksource.h>
>>   #include <linux/memory.h>
>>   #include <linux/sort.h>
>>   #include <asm/extable.h>
>> @@ -2289,6 +2290,30 @@ st:                      if (is_imm8(insn->off))
>>                                  break;
>>                          }
>>
>> +                       if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
>> +                           IS_ENABLED(CONFIG_BPF_SYSCALL) &&
>> +                           imm32 == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
>> +                           cpu_feature_enabled(X86_FEATURE_TSC) &&
>> +                           using_native_sched_clock() && sched_clock_stable()) {
> 
> And now this condition copy pasted 3 times ?!

Yeah, I'll factor it out

>> +                               struct cyc2ns_data data;
>> +                               u32 mult, shift;
>> +
>> +                               cyc2ns_read_begin(&data);
>> +                               mult = data.cyc2ns_mul;
>> +                               shift = data.cyc2ns_shift;
>> +                               cyc2ns_read_end();
> 
> This needs a big comment explaining why this math will be stable
> after JIT and for the lifetime of the prog.

It's more or less the same comment as for the JIT of
bpf_get_cpu_time_counter(). I'll add it.


>> +                               /* imul RAX, RDI, mult */
>> +                               maybe_emit_mod(&prog, BPF_REG_1, BPF_REG_0, true);
>> +                               EMIT2_off32(0x69, add_2reg(0xC0, BPF_REG_1, BPF_REG_0),
>> +                                           mult);
>> +
>> +                               /* shr RAX, shift (which is less than 64) */
>> +                               maybe_emit_1mod(&prog, BPF_REG_0, true);
>> +                               EMIT3(0xC1, add_1reg(0xE8, BPF_REG_0), shift);
>> +
>> +                               break;
>> +                       }
>> +
>>                          func = (u8 *) __bpf_call_base + imm32;
>>                          if (src_reg == BPF_PSEUDO_CALL && tail_call_reachable) {
>>                                  LOAD_TAIL_CALL_CNT_PTR(stack_depth);
>> @@ -3906,7 +3931,8 @@ bool bpf_jit_inlines_kfunc_call(s32 imm)
>>   {
>>          if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
>>                  return false;
>> -       if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
>> +       if ((imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) ||
>> +           imm == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns)) &&
>>              cpu_feature_enabled(X86_FEATURE_TSC) &&
>>              using_native_sched_clock() && sched_clock_stable())
>>                  return true;
>> diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
>> index 7f13509c66db..9791a3fb9d69 100644
>> --- a/arch/x86/net/bpf_jit_comp32.c
>> +++ b/arch/x86/net/bpf_jit_comp32.c
>> @@ -12,6 +12,7 @@
>>   #include <linux/netdevice.h>
>>   #include <linux/filter.h>
>>   #include <linux/if_vlan.h>
>> +#include <linux/clocksource.h>
>>   #include <asm/cacheflush.h>
>>   #include <asm/set_memory.h>
>>   #include <asm/nospec-branch.h>
>> @@ -2115,6 +2116,29 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>>                                          EMIT2(0x0F, 0x31);
>>                                          break;
>>                                  }
>> +                               if (IS_ENABLED(CONFIG_BPF_SYSCALL) &&
>> +                                   imm32 == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
>> +                                   cpu_feature_enabled(X86_FEATURE_TSC) &&
>> +                                   using_native_sched_clock() && sched_clock_stable()) {
>> +                                       struct cyc2ns_data data;
>> +                                       u32 mult, shift;
>> +
>> +                                       cyc2ns_read_begin(&data);
>> +                                       mult = data.cyc2ns_mul;
>> +                                       shift = data.cyc2ns_shift;
>> +                                       cyc2ns_read_end();
> 
> same here.
> 
>> +
>> +                                       /* move parameter to BPF_REG_0 */
>> +                                       emit_ia32_mov_r64(true, bpf2ia32[BPF_REG_0],
>> +                                                         bpf2ia32[BPF_REG_1], true, true,
>> +                                                         &prog, bpf_prog->aux);
>> +                                       /* multiply parameter by mut */
>> +                                       emit_ia32_mul_i64(bpf2ia32[BPF_REG_0],
>> +                                                         mult, true, &prog);
> 
> How did you test this?
> It's far from obvious that this will match what mul_u64_u32_shr() does.
> And on a quick look I really doubt.

Well, I can re-write it op-by-op from mul_u64_u32_shr(), but it's more
or less the same given that mult and shift are not too big, which is
common for TSC coefficients.

> 
> The trouble of adding support for 32-bit JIT doesn't seem worth it.

Do you mean it's better to drop this JIT implementation?

> 
>> +                                       /* shift parameter by shift which is less than 64 */
>> +                                       emit_ia32_rsh_i64(bpf2ia32[BPF_REG_0],
>> +                                                         shift, true, &prog);
>> +                               }
>>
>>                                  err = emit_kfunc_call(bpf_prog,
>>                                                        image + addrs[i],
>> @@ -2648,7 +2672,8 @@ bool bpf_jit_inlines_kfunc_call(s32 imm)
>>   {
>>          if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
>>                  return false;
>> -       if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
>> +       if ((imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) ||
>> +           imm == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns)) &&
>>              cpu_feature_enabled(X86_FEATURE_TSC) &&
>>              using_native_sched_clock() && sched_clock_stable())
>>                  return true;
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index a5e9b592d3e8..f45a704f06e3 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -3389,6 +3389,7 @@ u64 bpf_get_raw_cpu_id(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>>
>>   /* Inlined kfuncs */
>>   u64 bpf_get_cpu_time_counter(void);
>> +u64 bpf_cpu_time_counter_to_ns(u64 counter);
>>
>>   #if defined(CONFIG_NET)
>>   bool bpf_sock_common_is_valid_access(int off, int size,
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index 43bf35a15f78..e5ed5ba4b4aa 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -3198,6 +3198,11 @@ __bpf_kfunc u64 bpf_get_cpu_time_counter(void)
>>          return ktime_get_raw_fast_ns();
>>   }
>>
>> +__bpf_kfunc u64 bpf_cpu_time_counter_to_ns(u64 counter)
>> +{
>> +       return counter;
>> +}
>> +
>>   __bpf_kfunc_end_defs();
>>
>>   BTF_KFUNCS_START(generic_btf_ids)
>> @@ -3299,6 +3304,7 @@ BTF_ID_FLAGS(func, bpf_iter_kmem_cache_destroy, KF_ITER_DESTROY | KF_SLEEPABLE)
>>   BTF_ID_FLAGS(func, bpf_local_irq_save)
>>   BTF_ID_FLAGS(func, bpf_local_irq_restore)
>>   BTF_ID_FLAGS(func, bpf_get_cpu_time_counter, KF_FASTCALL)
>> +BTF_ID_FLAGS(func, bpf_cpu_time_counter_to_ns, KF_FASTCALL)
>>   BTF_KFUNCS_END(common_btf_ids)
>>
>>   static const struct btf_kfunc_id_set common_kfunc_set = {
>> --
>> 2.47.1
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v11 2/4] bpf: add bpf_cpu_time_counter_to_ns helper
  2025-03-18  8:44     ` Vadim Fedorenko
@ 2025-03-18 16:42       ` Alexei Starovoitov
  0 siblings, 0 replies; 13+ messages in thread
From: Alexei Starovoitov @ 2025-03-18 16:42 UTC (permalink / raw)
  To: Vadim Fedorenko
  Cc: Vadim Fedorenko, Borislav Petkov, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Thomas Gleixner, Yonghong Song, Mykola Lysenko, X86 ML, bpf,
	Peter Zijlstra, Martin KaFai Lau

On Tue, Mar 18, 2025 at 1:44 AM Vadim Fedorenko
<vadim.fedorenko@linux.dev> wrote:
>
> On 18/03/2025 00:29, Alexei Starovoitov wrote:
> > On Mon, Mar 17, 2025 at 3:50 PM Vadim Fedorenko <vadfed@meta.com> wrote:
> >>
> >> The new helper should be used to convert deltas of values
> >> received by bpf_get_cpu_time_counter() into nanoseconds. It is not
> >> designed to do full conversion of time counter values to
> >> CLOCK_MONOTONIC_RAW nanoseconds and cannot guarantee monotonicity of 2
> >> independent values, but rather to convert the difference of 2 close
> >> enough values of CPU timestamp counter into nanoseconds.
> >>
> >> This function is JITted into just several instructions and adds as
> >> low overhead as possible and perfectly suits benchmark use-cases.
> >>
> >> When the kfunc is not JITted it returns the value provided as argument
> >> because the kfunc in previous patch will return values in nanoseconds.
> >>
> >> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
> >> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> >> Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
> >> ---
> >>   arch/x86/net/bpf_jit_comp.c   | 28 +++++++++++++++++++++++++++-
> >>   arch/x86/net/bpf_jit_comp32.c | 27 ++++++++++++++++++++++++++-
> >>   include/linux/bpf.h           |  1 +
> >>   kernel/bpf/helpers.c          |  6 ++++++
> >>   4 files changed, 60 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> >> index 92cd5945d630..3e4d45defe2f 100644
> >> --- a/arch/x86/net/bpf_jit_comp.c
> >> +++ b/arch/x86/net/bpf_jit_comp.c
> >> @@ -9,6 +9,7 @@
> >>   #include <linux/filter.h>
> >>   #include <linux/if_vlan.h>
> >>   #include <linux/bpf.h>
> >> +#include <linux/clocksource.h>
> >>   #include <linux/memory.h>
> >>   #include <linux/sort.h>
> >>   #include <asm/extable.h>
> >> @@ -2289,6 +2290,30 @@ st:                      if (is_imm8(insn->off))
> >>                                  break;
> >>                          }
> >>
> >> +                       if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
> >> +                           IS_ENABLED(CONFIG_BPF_SYSCALL) &&
> >> +                           imm32 == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
> >> +                           cpu_feature_enabled(X86_FEATURE_TSC) &&
> >> +                           using_native_sched_clock() && sched_clock_stable()) {
> >
> > And now this condition copy pasted 3 times ?!
>
> Yeah, I'll factor it out
>
> >> +                               struct cyc2ns_data data;
> >> +                               u32 mult, shift;
> >> +
> >> +                               cyc2ns_read_begin(&data);
> >> +                               mult = data.cyc2ns_mul;
> >> +                               shift = data.cyc2ns_shift;
> >> +                               cyc2ns_read_end();
> >
> > This needs a big comment explaining why this math will be stable
> > after JIT and for the lifetime of the prog.
>
> It's more or less the same comment as for the JIT of
> bpf_get_cpu_time_counter(). I'll add it.
>
>
> >> +                               /* imul RAX, RDI, mult */
> >> +                               maybe_emit_mod(&prog, BPF_REG_1, BPF_REG_0, true);
> >> +                               EMIT2_off32(0x69, add_2reg(0xC0, BPF_REG_1, BPF_REG_0),
> >> +                                           mult);
> >> +
> >> +                               /* shr RAX, shift (which is less than 64) */
> >> +                               maybe_emit_1mod(&prog, BPF_REG_0, true);
> >> +                               EMIT3(0xC1, add_1reg(0xE8, BPF_REG_0), shift);
> >> +
> >> +                               break;
> >> +                       }
> >> +
> >>                          func = (u8 *) __bpf_call_base + imm32;
> >>                          if (src_reg == BPF_PSEUDO_CALL && tail_call_reachable) {
> >>                                  LOAD_TAIL_CALL_CNT_PTR(stack_depth);
> >> @@ -3906,7 +3931,8 @@ bool bpf_jit_inlines_kfunc_call(s32 imm)
> >>   {
> >>          if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
> >>                  return false;
> >> -       if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
> >> +       if ((imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) ||
> >> +           imm == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns)) &&
> >>              cpu_feature_enabled(X86_FEATURE_TSC) &&
> >>              using_native_sched_clock() && sched_clock_stable())
> >>                  return true;
> >> diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
> >> index 7f13509c66db..9791a3fb9d69 100644
> >> --- a/arch/x86/net/bpf_jit_comp32.c
> >> +++ b/arch/x86/net/bpf_jit_comp32.c
> >> @@ -12,6 +12,7 @@
> >>   #include <linux/netdevice.h>
> >>   #include <linux/filter.h>
> >>   #include <linux/if_vlan.h>
> >> +#include <linux/clocksource.h>
> >>   #include <asm/cacheflush.h>
> >>   #include <asm/set_memory.h>
> >>   #include <asm/nospec-branch.h>
> >> @@ -2115,6 +2116,29 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
> >>                                          EMIT2(0x0F, 0x31);
> >>                                          break;
> >>                                  }
> >> +                               if (IS_ENABLED(CONFIG_BPF_SYSCALL) &&
> >> +                                   imm32 == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
> >> +                                   cpu_feature_enabled(X86_FEATURE_TSC) &&
> >> +                                   using_native_sched_clock() && sched_clock_stable()) {
> >> +                                       struct cyc2ns_data data;
> >> +                                       u32 mult, shift;
> >> +
> >> +                                       cyc2ns_read_begin(&data);
> >> +                                       mult = data.cyc2ns_mul;
> >> +                                       shift = data.cyc2ns_shift;
> >> +                                       cyc2ns_read_end();
> >
> > same here.
> >
> >> +
> >> +                                       /* move parameter to BPF_REG_0 */
> >> +                                       emit_ia32_mov_r64(true, bpf2ia32[BPF_REG_0],
> >> +                                                         bpf2ia32[BPF_REG_1], true, true,
> >> +                                                         &prog, bpf_prog->aux);
> >> +                                       /* multiply parameter by mut */
> >> +                                       emit_ia32_mul_i64(bpf2ia32[BPF_REG_0],
> >> +                                                         mult, true, &prog);
> >
> > How did you test this?
> > It's far from obvious that this will match what mul_u64_u32_shr() does.
> > And on a quick look I really doubt.
>
> Well, I can re-write it op-by-op from mul_u64_u32_shr(), but it's more
> or less the same given that mult and shift are not too big, which is
> common for TSC coefficients.
>
> >
> > The trouble of adding support for 32-bit JIT doesn't seem worth it.
>
> Do you mean it's better to drop this JIT implementation?

yes.
The additional complexity doesn't look justified.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH bpf-next v11 3/4] selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit
  2025-03-17 22:49 [PATCH bpf-next v11 0/4] bpf: add cpu time counter kfuncs Vadim Fedorenko
  2025-03-17 22:49 ` [PATCH bpf-next v11 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
  2025-03-17 22:49 ` [PATCH bpf-next v11 2/4] bpf: add bpf_cpu_time_counter_to_ns helper Vadim Fedorenko
@ 2025-03-17 22:49 ` Vadim Fedorenko
  2025-03-18  0:30   ` Alexei Starovoitov
  2025-03-17 22:49 ` [PATCH bpf-next v11 4/4] selftests/bpf: add usage example for cpu time counter kfuncs Vadim Fedorenko
  3 siblings, 1 reply; 13+ messages in thread
From: Vadim Fedorenko @ 2025-03-17 22:49 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

bpf_get_cpu_time_counter() is replaced with rdtsc instruction on x86_64.
Add tests to check that JIT works as expected.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
 .../selftests/bpf/prog_tests/verifier.c       |   2 +
 .../selftests/bpf/progs/verifier_cpu_cycles.c | 104 ++++++++++++++++++
 2 files changed, 106 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c

diff --git a/tools/testing/selftests/bpf/prog_tests/verifier.c b/tools/testing/selftests/bpf/prog_tests/verifier.c
index e66a57970d28..d5e7e302a344 100644
--- a/tools/testing/selftests/bpf/prog_tests/verifier.c
+++ b/tools/testing/selftests/bpf/prog_tests/verifier.c
@@ -102,6 +102,7 @@
 #include "verifier_xdp_direct_packet_access.skel.h"
 #include "verifier_bits_iter.skel.h"
 #include "verifier_lsm.skel.h"
+#include "verifier_cpu_cycles.skel.h"
 #include "irq.skel.h"
 
 #define MAX_ENTRIES 11
@@ -236,6 +237,7 @@ void test_verifier_bits_iter(void) { RUN(verifier_bits_iter); }
 void test_verifier_lsm(void)                  { RUN(verifier_lsm); }
 void test_irq(void)			      { RUN(irq); }
 void test_verifier_mtu(void)		      { RUN(verifier_mtu); }
+void test_verifier_cpu_cycles(void)	      { RUN(verifier_cpu_cycles); }
 
 static int init_test_val_map(struct bpf_object *obj, char *map_name)
 {
diff --git a/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c b/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c
new file mode 100644
index 000000000000..5b62e3690362
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c
@@ -0,0 +1,104 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Inc. */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+
+extern u64 bpf_cpu_time_counter_to_ns(u64 cycles) __weak __ksym;
+extern u64 bpf_get_cpu_time_counter(void) __weak __ksym;
+
+SEC("syscall")
+__arch_x86_64
+__xlated("0: call kernel-function")
+__naked int bpf_rdtsc(void)
+{
+	asm volatile(
+	"call %[bpf_get_cpu_time_counter];"
+	"exit"
+	:
+	: __imm(bpf_get_cpu_time_counter)
+	: __clobber_all
+	);
+}
+
+SEC("syscall")
+__arch_x86_64
+/* program entry for bpf_rdtsc_jit_x86_64(), regular function prologue */
+__jited("	endbr64")
+__jited("	nopl	(%rax,%rax)")
+__jited("	nopl	(%rax)")
+__jited("	pushq	%rbp")
+__jited("	movq	%rsp, %rbp")
+__jited("	endbr64")
+/* save RDX in R11 as it will be overwritten */
+__jited("	movq	%rdx, %r11")
+/* lfence may not be executed depending on cpu features */
+__jited("	{{(lfence|)}}")
+__jited("	rdtsc")
+/* combine EDX:EAX into RAX */
+__jited("	shlq	${{(32|0x20)}}, %rdx")
+__jited("	orq	%rdx, %rax")
+/* restore RDX from R11 */
+__jited("	movq	%r11, %rdx")
+__jited("	leave")
+__naked int bpf_rdtsc_jit_x86_64(void)
+{
+	asm volatile(
+	"call %[bpf_get_cpu_time_counter];"
+	"exit"
+	:
+	: __imm(bpf_get_cpu_time_counter)
+	: __clobber_all
+	);
+}
+
+SEC("syscall")
+__arch_x86_64
+__xlated("0: r1 = 42")
+__xlated("1: call kernel-function")
+__naked int bpf_cyc2ns(void)
+{
+	asm volatile(
+	"r1=0x2a;"
+	"call %[bpf_cpu_time_counter_to_ns];"
+	"exit"
+	:
+	: __imm(bpf_cpu_time_counter_to_ns)
+	: __clobber_all
+	);
+}
+
+SEC("syscall")
+__arch_x86_64
+/* program entry for bpf_rdtsc_jit_x86_64(), regular function prologue */
+__jited("	endbr64")
+__jited("	nopl	(%rax,%rax)")
+__jited("	nopl	(%rax)")
+__jited("	pushq	%rbp")
+__jited("	movq	%rsp, %rbp")
+__jited("	endbr64")
+/* save RDX in R11 as it will be overwritten */
+__jited("	movabsq	$0x2a2a2a2a2a, %rdi")
+__jited("	imulq	${{.*}}, %rdi, %rax")
+__jited("	shrq	${{.*}}, %rax")
+__jited("	leave")
+__naked int bpf_cyc2ns_jit_x86(void)
+{
+	asm volatile(
+	"r1=0x2a2a2a2a2a ll;"
+	"call %[bpf_cpu_time_counter_to_ns];"
+	"exit"
+	:
+	: __imm(bpf_cpu_time_counter_to_ns)
+	: __clobber_all
+	);
+}
+
+void rdtsc(void)
+{
+	bpf_get_cpu_time_counter();
+	bpf_cpu_time_counter_to_ns(42);
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v11 3/4] selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit
  2025-03-17 22:49 ` [PATCH bpf-next v11 3/4] selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit Vadim Fedorenko
@ 2025-03-18  0:30   ` Alexei Starovoitov
  2025-03-18  8:45     ` Vadim Fedorenko
  0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2025-03-18  0:30 UTC (permalink / raw)
  To: Vadim Fedorenko
  Cc: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko, X86 ML, bpf, Peter Zijlstra,
	Martin KaFai Lau

On Mon, Mar 17, 2025 at 3:50 PM Vadim Fedorenko <vadfed@meta.com> wrote:
>
> bpf_get_cpu_time_counter() is replaced with rdtsc instruction on x86_64.
> Add tests to check that JIT works as expected.
>
> Acked-by: Eduard Zingerman <eddyz87@gmail.com>
> Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
> ---
>  .../selftests/bpf/prog_tests/verifier.c       |   2 +
>  .../selftests/bpf/progs/verifier_cpu_cycles.c | 104 ++++++++++++++++++
>  2 files changed, 106 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/verifier.c b/tools/testing/selftests/bpf/prog_tests/verifier.c
> index e66a57970d28..d5e7e302a344 100644
> --- a/tools/testing/selftests/bpf/prog_tests/verifier.c
> +++ b/tools/testing/selftests/bpf/prog_tests/verifier.c
> @@ -102,6 +102,7 @@
>  #include "verifier_xdp_direct_packet_access.skel.h"
>  #include "verifier_bits_iter.skel.h"
>  #include "verifier_lsm.skel.h"
> +#include "verifier_cpu_cycles.skel.h"
>  #include "irq.skel.h"
>
>  #define MAX_ENTRIES 11
> @@ -236,6 +237,7 @@ void test_verifier_bits_iter(void) { RUN(verifier_bits_iter); }
>  void test_verifier_lsm(void)                  { RUN(verifier_lsm); }
>  void test_irq(void)                          { RUN(irq); }
>  void test_verifier_mtu(void)                 { RUN(verifier_mtu); }
> +void test_verifier_cpu_cycles(void)          { RUN(verifier_cpu_cycles); }
>
>  static int init_test_val_map(struct bpf_object *obj, char *map_name)
>  {
> diff --git a/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c b/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c
> new file mode 100644
> index 000000000000..5b62e3690362
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c
> @@ -0,0 +1,104 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2022 Meta Inc. */

botched copy paste.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v11 3/4] selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit
  2025-03-18  0:30   ` Alexei Starovoitov
@ 2025-03-18  8:45     ` Vadim Fedorenko
  0 siblings, 0 replies; 13+ messages in thread
From: Vadim Fedorenko @ 2025-03-18  8:45 UTC (permalink / raw)
  To: Alexei Starovoitov, Vadim Fedorenko
  Cc: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Mykola Lysenko, X86 ML, bpf, Peter Zijlstra, Martin KaFai Lau

On 18/03/2025 00:30, Alexei Starovoitov wrote:
> On Mon, Mar 17, 2025 at 3:50 PM Vadim Fedorenko <vadfed@meta.com> wrote:
>>
>> bpf_get_cpu_time_counter() is replaced with rdtsc instruction on x86_64.
>> Add tests to check that JIT works as expected.
>>
>> Acked-by: Eduard Zingerman <eddyz87@gmail.com>
>> Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
>> ---
>>   .../selftests/bpf/prog_tests/verifier.c       |   2 +
>>   .../selftests/bpf/progs/verifier_cpu_cycles.c | 104 ++++++++++++++++++
>>   2 files changed, 106 insertions(+)
>>   create mode 100644 tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c
>>
>> diff --git a/tools/testing/selftests/bpf/prog_tests/verifier.c b/tools/testing/selftests/bpf/prog_tests/verifier.c
>> index e66a57970d28..d5e7e302a344 100644
>> --- a/tools/testing/selftests/bpf/prog_tests/verifier.c
>> +++ b/tools/testing/selftests/bpf/prog_tests/verifier.c
>> @@ -102,6 +102,7 @@
>>   #include "verifier_xdp_direct_packet_access.skel.h"
>>   #include "verifier_bits_iter.skel.h"
>>   #include "verifier_lsm.skel.h"
>> +#include "verifier_cpu_cycles.skel.h"
>>   #include "irq.skel.h"
>>
>>   #define MAX_ENTRIES 11
>> @@ -236,6 +237,7 @@ void test_verifier_bits_iter(void) { RUN(verifier_bits_iter); }
>>   void test_verifier_lsm(void)                  { RUN(verifier_lsm); }
>>   void test_irq(void)                          { RUN(irq); }
>>   void test_verifier_mtu(void)                 { RUN(verifier_mtu); }
>> +void test_verifier_cpu_cycles(void)          { RUN(verifier_cpu_cycles); }
>>
>>   static int init_test_val_map(struct bpf_object *obj, char *map_name)
>>   {
>> diff --git a/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c b/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c
>> new file mode 100644
>> index 000000000000..5b62e3690362
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c
>> @@ -0,0 +1,104 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2022 Meta Inc. */
> 
> botched copy paste.

Ouch.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH bpf-next v11 4/4] selftests/bpf: add usage example for cpu time counter kfuncs
  2025-03-17 22:49 [PATCH bpf-next v11 0/4] bpf: add cpu time counter kfuncs Vadim Fedorenko
                   ` (2 preceding siblings ...)
  2025-03-17 22:49 ` [PATCH bpf-next v11 3/4] selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit Vadim Fedorenko
@ 2025-03-17 22:49 ` Vadim Fedorenko
  3 siblings, 0 replies; 13+ messages in thread
From: Vadim Fedorenko @ 2025-03-17 22:49 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

The selftest provides an example of how to measure the latency of bpf
kfunc/helper call using time stamp counter and how to convert measured
value into nanoseconds.

Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
 .../bpf/prog_tests/test_cpu_cycles.c          | 35 +++++++++++++++++++
 .../selftests/bpf/progs/test_cpu_cycles.c     | 25 +++++++++++++
 2 files changed, 60 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cpu_cycles.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c b/tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c
new file mode 100644
index 000000000000..d7f3b66594b3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Inc. */
+
+#include <test_progs.h>
+#include "test_cpu_cycles.skel.h"
+
+static void cpu_cycles(void)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	struct test_cpu_cycles *skel;
+	int err, pfd;
+
+	skel = test_cpu_cycles__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "test_cpu_cycles open and load"))
+		return;
+
+	pfd = bpf_program__fd(skel->progs.bpf_cpu_cycles);
+	if (!ASSERT_GT(pfd, 0, "test_cpu_cycles fd"))
+		goto fail;
+
+	err = bpf_prog_test_run_opts(pfd, &opts);
+	if (!ASSERT_OK(err, "test_cpu_cycles test run"))
+		goto fail;
+
+	ASSERT_NEQ(skel->bss->cycles, 0, "test_cpu_cycles 0 cycles");
+	ASSERT_NEQ(skel->bss->ns, 0, "test_cpu_cycles 0 ns");
+fail:
+	test_cpu_cycles__destroy(skel);
+}
+
+void test_cpu_cycles(void)
+{
+	if (test__start_subtest("cpu_cycles"))
+		cpu_cycles();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_cpu_cycles.c b/tools/testing/selftests/bpf/progs/test_cpu_cycles.c
new file mode 100644
index 000000000000..a7f8a4c6b854
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_cpu_cycles.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Inc. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+
+extern u64 bpf_cpu_time_counter_to_ns(u64 cycles) __weak __ksym;
+extern u64 bpf_get_cpu_time_counter(void) __weak __ksym;
+
+__u64 cycles, ns;
+
+SEC("syscall")
+int bpf_cpu_cycles(void)
+{
+	struct bpf_pidns_info pidns;
+	__u64 start;
+
+	start = bpf_get_cpu_time_counter();
+	bpf_get_ns_current_pid_tgid(0, 0, &pidns, sizeof(struct bpf_pidns_info));
+	cycles = bpf_get_cpu_time_counter() - start;
+	ns = bpf_cpu_time_counter_to_ns(cycles);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-03-18 16:42 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-17 22:49 [PATCH bpf-next v11 0/4] bpf: add cpu time counter kfuncs Vadim Fedorenko
2025-03-17 22:49 ` [PATCH bpf-next v11 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
2025-03-18  0:23   ` Alexei Starovoitov
2025-03-18  8:17     ` Vadim Fedorenko
2025-03-18 15:56       ` Alexei Starovoitov
2025-03-17 22:49 ` [PATCH bpf-next v11 2/4] bpf: add bpf_cpu_time_counter_to_ns helper Vadim Fedorenko
2025-03-18  0:29   ` Alexei Starovoitov
2025-03-18  8:44     ` Vadim Fedorenko
2025-03-18 16:42       ` Alexei Starovoitov
2025-03-17 22:49 ` [PATCH bpf-next v11 3/4] selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit Vadim Fedorenko
2025-03-18  0:30   ` Alexei Starovoitov
2025-03-18  8:45     ` Vadim Fedorenko
2025-03-17 22:49 ` [PATCH bpf-next v11 4/4] selftests/bpf: add usage example for cpu time counter kfuncs Vadim Fedorenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox