[PATCH bpf-next v9 0/4] bpf: add cpu time counter kfuncs

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf-next v9 0/4] bpf: add cpu time counter kfuncs
@ 2024-11-23  0:58 Vadim Fedorenko
  2024-11-23  0:58 ` [PATCH bpf-next v9 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Vadim Fedorenko @ 2024-11-23  0:58 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

This patchset adds 2 kfuncs to provide a way to precisely measure the
time spent running some code. The first patch provides a way to get cpu
cycles counter which is used to feed CLOCK_MONOTONIC_RAW. On x86
architecture it is effectively rdtsc_ordered() function while on other
architectures it falls back to __arch_get_hw_counter(). The second patch
adds a kfunc to convert cpu cycles to nanoseconds using shift/mult
constants discovered by kernel. The main use-case for this kfunc is to
convert deltas of timestamp counter values into nanoseconds. It is not
supposed to get CLOCK_MONOTONIC_RAW values as offset part is skipped.
JIT version is done for x86 for now, on other architectures it falls
back to slightly simplified version of vdso_calc_ns.

The reason to have these functions is to avoid overhead added by
a bpf_ktime_get_ns() call in case of benchmarking, when two timestamps
are taken to get delta value. With both functions being JITed, the
overhead is minimal and the result has better precision. New functions
can be used to benchmark BPF code directly in the program, or can be
used in kprobe/uprobe to store timestamp counter in the session coockie
and then in kretprobe/uretprobe the delta can be calculated and
converted into nanoseconds.

Selftests are also added to check whether the JIT implementation is
correct and to show the simplest usage example.

Change log:
v8 -> v9:
* rewording of commit messages, no code changes
* move change log from each patch into cover letter
v7 -> v8:
* rename kfuncs again to bpf_get_cpu_time_counter() and
  bpf_cpu_time_counter_to_ns()
* use cyc2ns_read_begin()/cyc2ns_read_end() to get mult and shift
  constants in bpf_cpu_time_counter_to_ns()
v6 -> v7:
* change boot_cpu_has() to cpu_feature_enabled() (Borislav)
* return constant clock_mode in __arch_get_hw_counter() call
v5 -> v6:
* added cover letter
* add comment about dropping S64_MAX manipulation in jitted
  implementation of rdtsc_oredered (Alexey)
* add comment about using 'lfence;rdtsc' variant (Alexey)
* change the check in fixup_kfunc_call() (Eduard)
* make __arch_get_hw_counter() call more aligned with vDSO
  implementation (Yonghong)
v4 -> v5:
* use #if instead of #ifdef with IS_ENABLED
v3 -> v4:
* change name of the helper to bpf_get_cpu_cycles (Andrii)
* Hide the helper behind CONFIG_GENERIC_GETTIMEOFDAY to avoid exposing
  it on architectures which do not have vDSO functions and data
* reduce the scope of check of inlined functions in verifier to only 2,
  which are actually inlined.
* change helper name to bpf_cpu_cycles_to_ns.
* hide it behind CONFIG_GENERIC_GETTIMEOFDAY to avoid exposing on
  unsupported architectures.
v2 -> v3:
* change name of the helper to bpf_get_cpu_cycles_counter to
  explicitly mention what counter it provides (Andrii)
* move kfunc definition to bpf.h to use it in JIT.
* introduce another kfunc to convert cycles into nanoseconds as
  more meaningful time units for generic tracing use case (Andrii)
v1 -> v2:
* Fix incorrect function return value type to u64
* Introduce bpf_jit_inlines_kfunc_call() and use it in
	mark_fastcall_pattern_for_call() to avoid clobbering in case
	of running programs with no JIT (Eduard)
* Avoid rewriting instruction and check function pointer directly
	in JIT (Alexei)
* Change includes to fix compile issues on non x86 architectures

Vadim Fedorenko (4):
  bpf: add bpf_get_cpu_time_counter kfunc
  bpf: add bpf_cpu_time_counter_to_ns helper
  selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit
  selftests/bpf: add usage example for cpu time counter kfuncs

 arch/x86/net/bpf_jit_comp.c                   |  66 +++++++++++
 arch/x86/net/bpf_jit_comp32.c                 |  41 +++++++
 include/linux/bpf.h                           |   6 +
 include/linux/filter.h                        |   1 +
 kernel/bpf/core.c                             |  11 ++
 kernel/bpf/helpers.c                          |  39 +++++++
 kernel/bpf/verifier.c                         |  41 ++++++-
 .../bpf/prog_tests/test_cpu_cycles.c          |  35 ++++++
 .../selftests/bpf/prog_tests/verifier.c       |   2 +
 .../selftests/bpf/progs/test_cpu_cycles.c     |  25 +++++
 .../selftests/bpf/progs/verifier_cpu_cycles.c | 104 ++++++++++++++++++
 11 files changed, 365 insertions(+), 6 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cpu_cycles.c
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c

-- 
2.43.5

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH bpf-next v9 1/4] bpf: add bpf_get_cpu_time_counter kfunc
  2024-11-23  0:58 [PATCH bpf-next v9 0/4] bpf: add cpu time counter kfuncs Vadim Fedorenko
@ 2024-11-23  0:58 ` Vadim Fedorenko
  2024-11-27 12:03   ` kernel test robot
  2024-12-01 12:46   ` Thomas Gleixner
  2024-11-23  0:58 ` [PATCH bpf-next v9 2/4] bpf: add bpf_cpu_time_counter_to_ns helper Vadim Fedorenko
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 10+ messages in thread
From: Vadim Fedorenko @ 2024-11-23  0:58 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

New kfunc to return ARCH-specific timecounter. The main reason to
implement this kfunc is to avoid extra overhead of benchmark
measurements, which are usually done by a pair of bpf_ktime_get_ns()
at the beginnig and at the end of the code block under benchmark.
This function doesn't implement conversion to the monotonic clock and
saves some CPU cycles because of it. Another point is when it's properly
JITed, it saves extra hundreds of cycles by receiving timecounter values
in single-digit amount of instructions. The delta values can be
translated into nanoseconds using kfunc introduced in the next patch.
For x86 BPF JIT converts this kfunc into rdtsc ordered call. Other
architectures will get JIT implementation too if supported. The fallback
is to call __arch_get_hw_counter().

The simplest use-case is added in 4th patch, where we calculate the time
spent by bpf_get_ns_current_pid_tgid() kfunc. More complex example is to
use session cookie to store timecounter value at kprobe/uprobe using
kprobe.session/uprobe.session, and calculate the difference at
kretprobe/uretprobe.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
 arch/x86/net/bpf_jit_comp.c   | 39 +++++++++++++++++++++++++++++++++
 arch/x86/net/bpf_jit_comp32.c | 14 ++++++++++++
 include/linux/bpf.h           |  5 +++++
 include/linux/filter.h        |  1 +
 kernel/bpf/core.c             | 11 ++++++++++
 kernel/bpf/helpers.c          | 27 +++++++++++++++++++++++
 kernel/bpf/verifier.c         | 41 ++++++++++++++++++++++++++++++-----
 7 files changed, 132 insertions(+), 6 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index a43fc5af973d..92431ab1a21e 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -2185,6 +2185,37 @@ st:			if (is_imm8(insn->off))
 		case BPF_JMP | BPF_CALL: {
 			u8 *ip = image + addrs[i - 1];
 
+			if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
+			    imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter)) {
+				/* The default implementation of this kfunc uses
+				 * __arch_get_hw_counter() which is implemented as
+				 * `(u64)rdtsc_ordered() & S64_MAX`. We skip masking
+				 * part because we assume it's not needed in BPF
+				 * use case (two measurements close in time).
+				 * Original code for rdtsc_ordered() uses sequence:
+				 * 'rdtsc; nop; nop; nop' to patch it into
+				 * 'lfence; rdtsc' or 'rdtscp' depending on CPU features.
+				 * JIT uses 'lfence; rdtsc' variant because BPF program
+				 * doesn't care about cookie provided by rdtscp in RCX.
+				 * Save RDX because RDTSC will use EDX:EAX to return u64
+				 */
+				emit_mov_reg(&prog, true, AUX_REG, BPF_REG_3);
+				if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
+					EMIT_LFENCE();
+				EMIT2(0x0F, 0x31);
+
+				/* shl RDX, 32 */
+				maybe_emit_1mod(&prog, BPF_REG_3, true);
+				EMIT3(0xC1, add_1reg(0xE0, BPF_REG_3), 32);
+				/* or RAX, RDX */
+				maybe_emit_mod(&prog, BPF_REG_0, BPF_REG_3, true);
+				EMIT2(0x09, add_2reg(0xC0, BPF_REG_0, BPF_REG_3));
+				/* restore RDX from R11 */
+				emit_mov_reg(&prog, true, BPF_REG_3, AUX_REG);
+
+				break;
+			}
+
 			func = (u8 *) __bpf_call_base + imm32;
 			if (src_reg == BPF_PSEUDO_CALL && tail_call_reachable) {
 				LOAD_TAIL_CALL_CNT_PTR(stack_depth);
@@ -3791,3 +3822,11 @@ u64 bpf_arch_uaddress_limit(void)
 {
 	return 0;
 }
+
+/* x86-64 JIT can inline kfunc */
+bool bpf_jit_inlines_kfunc_call(s32 imm)
+{
+	if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter))
+		return true;
+	return false;
+}
diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
index de0f9e5f9f73..a549aea25f5f 100644
--- a/arch/x86/net/bpf_jit_comp32.c
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -2094,6 +2094,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 			if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
 				int err;
 
+				if (imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter)) {
+					if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
+						EMIT3(0x0F, 0xAE, 0xE8);
+					EMIT2(0x0F, 0x31);
+					break;
+				}
+
 				err = emit_kfunc_call(bpf_prog,
 						      image + addrs[i],
 						      insn, &prog);
@@ -2621,3 +2628,10 @@ bool bpf_jit_supports_kfunc_call(void)
 {
 	return true;
 }
+
+bool bpf_jit_inlines_kfunc_call(s32 imm)
+{
+	if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter))
+		return true;
+	return false;
+}
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3ace0d6227e3..6d540253cfb4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3333,6 +3333,11 @@ void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 u64 bpf_get_raw_cpu_id(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
+/* Inlined kfuncs */
+#if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY)
+u64 bpf_get_cpu_time_counter(void);
+#endif
+
 #if defined(CONFIG_NET)
 bool bpf_sock_common_is_valid_access(int off, int size,
 				     enum bpf_access_type type,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 3a21947f2fd4..9cf57233874f 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1111,6 +1111,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
 void bpf_jit_compile(struct bpf_prog *prog);
 bool bpf_jit_needs_zext(void);
 bool bpf_jit_inlines_helper_call(s32 imm);
+bool bpf_jit_inlines_kfunc_call(s32 imm);
 bool bpf_jit_supports_subprog_tailcalls(void);
 bool bpf_jit_supports_percpu_insn(void);
 bool bpf_jit_supports_kfunc_call(void);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index a2327c4fdc8b..4f79a6ed1d3a 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2965,6 +2965,17 @@ bool __weak bpf_jit_inlines_helper_call(s32 imm)
 	return false;
 }
 
+/* Return true if the JIT inlines the call to the kfunc corresponding to
+ * the imm.
+ *
+ * The verifier will not patch the insn->imm for the call to the helper if
+ * this returns true.
+ */
+bool __weak bpf_jit_inlines_kfunc_call(s32 imm)
+{
+	return false;
+}
+
 /* Return TRUE if the JIT backend supports mixing bpf2bpf and tailcalls. */
 bool __weak bpf_jit_supports_subprog_tailcalls(void)
 {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 751c150f9e1c..23f1a1606f8b 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -23,6 +23,10 @@
 #include <linux/btf_ids.h>
 #include <linux/bpf_mem_alloc.h>
 #include <linux/kasan.h>
+#if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY)
+#include <vdso/datapage.h>
+#include <asm/vdso/vsyscall.h>
+#endif
 
 #include "../../lib/kstrtox.h"
 
@@ -3057,6 +3061,26 @@ __bpf_kfunc int bpf_copy_from_user_str(void *dst, u32 dst__sz, const void __user
 	return ret + 1;
 }
 
+#if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY)
+__bpf_kfunc u64 bpf_get_cpu_time_counter(void)
+{
+	const struct vdso_data *vd = __arch_get_k_vdso_data();
+
+	vd = &vd[CS_RAW];
+
+	/* CS_RAW clock_mode translates to VDSO_CLOCKMODE_TSC on x86 and
+	 * to VDSO_CLOCKMODE_ARCHTIMER on aarch64/risc-v. We cannot use
+	 * vd->clock_mode directly because it brings possible access to
+	 * pages visible by user-space only via vDSO. But the constant value
+	 * of 1 is exactly what we need - it works for any architecture and
+	 * translates to reading of HW timecounter regardles of architecture.
+	 * We still have to provide vdso_data for some architectures to avoid
+	 * NULL pointer dereference.
+	 */
+	return __arch_get_hw_counter(1, vd);
+}
+#endif
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(generic_btf_ids)
@@ -3149,6 +3173,9 @@ BTF_ID_FLAGS(func, bpf_get_kmem_cache)
 BTF_ID_FLAGS(func, bpf_iter_kmem_cache_new, KF_ITER_NEW | KF_SLEEPABLE)
 BTF_ID_FLAGS(func, bpf_iter_kmem_cache_next, KF_ITER_NEXT | KF_RET_NULL | KF_SLEEPABLE)
 BTF_ID_FLAGS(func, bpf_iter_kmem_cache_destroy, KF_ITER_DESTROY | KF_SLEEPABLE)
+#if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY)
+BTF_ID_FLAGS(func, bpf_get_cpu_time_counter, KF_FASTCALL)
+#endif
 BTF_KFUNCS_END(common_btf_ids)
 
 static const struct btf_kfunc_id_set common_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1c4ebb326785..dbfad4457bef 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -16407,6 +16407,24 @@ static bool verifier_inlines_helper_call(struct bpf_verifier_env *env, s32 imm)
 	}
 }
 
+/* True if fixup_kfunc_call() replaces calls to kfunc number 'imm',
+ * replacement patch is presumed to follow bpf_fastcall contract
+ * (see mark_fastcall_pattern_for_call() below).
+ */
+static bool verifier_inlines_kfunc_call(struct bpf_verifier_env *env, s32 imm)
+{
+	const struct bpf_kfunc_desc *desc = find_kfunc_desc(env->prog, imm, 0);
+
+	if (!env->prog->jit_requested)
+		return false;
+
+	if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
+	    desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast])
+		return true;
+
+	return false;
+}
+
 /* Same as helper_fastcall_clobber_mask() but for kfuncs, see comment above */
 static u32 kfunc_fastcall_clobber_mask(struct bpf_kfunc_call_arg_meta *meta)
 {
@@ -16534,7 +16552,10 @@ static void mark_fastcall_pattern_for_call(struct bpf_verifier_env *env,
 			return;
 
 		clobbered_regs_mask = kfunc_fastcall_clobber_mask(&meta);
-		can_be_inlined = is_fastcall_kfunc_call(&meta);
+		can_be_inlined = is_fastcall_kfunc_call(&meta) &&
+				 (verifier_inlines_kfunc_call(env, call->imm) ||
+				 (meta.btf == btf_vmlinux &&
+				  bpf_jit_inlines_kfunc_call(call->imm)));
 	}
 
 	if (clobbered_regs_mask == ALL_CALLER_SAVED_REGS)
@@ -20541,6 +20562,7 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			    struct bpf_insn *insn_buf, int insn_idx, int *cnt)
 {
 	const struct bpf_kfunc_desc *desc;
+	s32 imm = insn->imm;
 
 	if (!insn->imm) {
 		verbose(env, "invalid kernel function call not eliminated in verifier pass\n");
@@ -20564,7 +20586,18 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		insn->imm = BPF_CALL_IMM(desc->addr);
 	if (insn->off)
 		return 0;
-	if (desc->func_id == special_kfunc_list[KF_bpf_obj_new_impl] ||
+	if (verifier_inlines_kfunc_call(env, imm)) {
+		if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
+		    desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
+			insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
+			*cnt = 1;
+		} else {
+			verbose(env, "verifier internal error: kfunc id %d has no inline code\n",
+				desc->func_id);
+			return -EFAULT;
+		}
+
+	} else if (desc->func_id == special_kfunc_list[KF_bpf_obj_new_impl] ||
 	    desc->func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
 		struct btf_struct_meta *kptr_struct_meta = env->insn_aux_data[insn_idx].kptr_struct_meta;
 		struct bpf_insn addr[2] = { BPF_LD_IMM64(BPF_REG_2, (long)kptr_struct_meta) };
@@ -20625,10 +20658,6 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 
 		__fixup_collection_insert_kfunc(&env->insn_aux_data[insn_idx], struct_meta_reg,
 						node_offset_reg, insn, insn_buf, cnt);
-	} else if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
-		   desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
-		insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
-		*cnt = 1;
 	} else if (is_bpf_wq_set_callback_impl_kfunc(desc->func_id)) {
 		struct bpf_insn ld_addrs[2] = { BPF_LD_IMM64(BPF_REG_4, (long)env->prog->aux) };
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH bpf-next v9 2/4] bpf: add bpf_cpu_time_counter_to_ns helper
  2024-11-23  0:58 [PATCH bpf-next v9 0/4] bpf: add cpu time counter kfuncs Vadim Fedorenko
  2024-11-23  0:58 ` [PATCH bpf-next v9 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
@ 2024-11-23  0:58 ` Vadim Fedorenko
  2024-11-27 19:07   ` kernel test robot
  2024-11-23  0:58 ` [PATCH bpf-next v9 3/4] selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit Vadim Fedorenko
  2024-11-23  0:58 ` [PATCH bpf-next v9 4/4] selftests/bpf: add usage example for cpu time counter kfuncs Vadim Fedorenko
  3 siblings, 1 reply; 10+ messages in thread
From: Vadim Fedorenko @ 2024-11-23  0:58 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

The new helper should be used to convert deltas of values
received by bpf_get_cpu_time_counter() into nanoseconds. It is not
designed to do full conversion of time counter values to
CLOCK_MONOTONIC_RAW nanoseconds and cannot guarantee monotonicity of 2
independent values, but rather to convert the difference of 2 close
enough values of CPU timestamp counter into nanoseconds.

This function is also JITted into just several instructions and adds as
low overhead as possible and perfectly suits benchmark use-cases.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
 arch/x86/net/bpf_jit_comp.c   | 27 +++++++++++++++++++++++++++
 arch/x86/net/bpf_jit_comp32.c | 27 +++++++++++++++++++++++++++
 include/linux/bpf.h           |  1 +
 kernel/bpf/helpers.c          | 14 +++++++++++++-
 4 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 92431ab1a21e..d21e0ab55c94 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -11,10 +11,12 @@
 #include <linux/bpf.h>
 #include <linux/memory.h>
 #include <linux/sort.h>
+#include <linux/clocksource.h>
 #include <asm/extable.h>
 #include <asm/ftrace.h>
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
+#include <asm/timer.h>
 #include <asm/text-patching.h>
 #include <asm/unwind.h>
 #include <asm/cfi.h>
@@ -2216,6 +2218,28 @@ st:			if (is_imm8(insn->off))
 				break;
 			}
 
+			if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
+			    imm32 == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
+			    using_native_sched_clock() && sched_clock_stable()) {
+				struct cyc2ns_data data;
+				u32 mult, shift;
+
+				cyc2ns_read_begin(&data);
+				mult = data.cyc2ns_mul;
+				shift = data.cyc2ns_shift;
+				cyc2ns_read_end();
+				/* imul RAX, RDI, mult */
+				maybe_emit_mod(&prog, BPF_REG_1, BPF_REG_0, true);
+				EMIT2_off32(0x69, add_2reg(0xC0, BPF_REG_1, BPF_REG_0),
+					    mult);
+
+				/* shr RAX, shift (which is less than 64) */
+				maybe_emit_1mod(&prog, BPF_REG_0, true);
+				EMIT3(0xC1, add_1reg(0xE8, BPF_REG_0), shift);
+
+				break;
+			}
+
 			func = (u8 *) __bpf_call_base + imm32;
 			if (src_reg == BPF_PSEUDO_CALL && tail_call_reachable) {
 				LOAD_TAIL_CALL_CNT_PTR(stack_depth);
@@ -3828,5 +3852,8 @@ bool bpf_jit_inlines_kfunc_call(s32 imm)
 {
 	if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter))
 		return true;
+	if (imm == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
+	    using_native_sched_clock() && sched_clock_stable())
+		return true;
 	return false;
 }
diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
index a549aea25f5f..a2069a3ee4a3 100644
--- a/arch/x86/net/bpf_jit_comp32.c
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -12,10 +12,12 @@
 #include <linux/netdevice.h>
 #include <linux/filter.h>
 #include <linux/if_vlan.h>
+#include <linux/clocksource.h>
 #include <asm/cacheflush.h>
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
 #include <asm/asm-prototypes.h>
+#include <asm/timer.h>
 #include <linux/bpf.h>
 
 /*
@@ -2100,6 +2102,27 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 					EMIT2(0x0F, 0x31);
 					break;
 				}
+				if (imm32 == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
+				    cpu_feature_enabled(X86_FEATURE_CONSTANT_TSC)) {
+					struct cyc2ns_data data;
+					u32 mult, shift;
+
+					cyc2ns_read_begin(&data);
+					mult = data.cyc2ns_mul;
+					shift = data.cyc2ns_shift;
+					cyc2ns_read_end();
+
+					/* move parameter to BPF_REG_0 */
+					emit_ia32_mov_r64(true, bpf2ia32[BPF_REG_0],
+							  bpf2ia32[BPF_REG_1], true, true,
+							  &prog, bpf_prog->aux);
+					/* multiply parameter by mut */
+					emit_ia32_mul_i64(bpf2ia32[BPF_REG_0],
+							  mult, true, &prog);
+					/* shift parameter by shift which is less than 64 */
+					emit_ia32_rsh_i64(bpf2ia32[BPF_REG_0],
+							  shift, true, &prog);
+				}
 
 				err = emit_kfunc_call(bpf_prog,
 						      image + addrs[i],
@@ -2633,5 +2656,9 @@ bool bpf_jit_inlines_kfunc_call(s32 imm)
 {
 	if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter))
 		return true;
+	if (imm == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
+	    using_native_sched_clock() && sched_clock_stable())
+		return true;
+
 	return false;
 }
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6d540253cfb4..dd3c4ddfd60e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3336,6 +3336,7 @@ u64 bpf_get_raw_cpu_id(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 /* Inlined kfuncs */
 #if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY)
 u64 bpf_get_cpu_time_counter(void);
+u64 bpf_cpu_time_counter_to_ns(u64 cycles);
 #endif
 
 #if defined(CONFIG_NET)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 23f1a1606f8b..e4d461f2e98f 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -3079,8 +3079,19 @@ __bpf_kfunc u64 bpf_get_cpu_time_counter(void)
 	 */
 	return __arch_get_hw_counter(1, vd);
 }
-#endif
 
+__bpf_kfunc u64 bpf_cpu_time_counter_to_ns(u64 cycles)
+{
+	const struct vdso_data *vd = __arch_get_k_vdso_data();
+
+	vd = &vd[CS_RAW];
+	/* kfunc implementation does less manipulations than vDSO
+	 * implementation. BPF use-case assumes two measurements are close
+	 * in time and can simplify the logic.
+	 */
+	return mul_u64_u32_shr(cycles, vd->mult, vd->shift);
+}
+#endif
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(generic_btf_ids)
@@ -3175,6 +3186,7 @@ BTF_ID_FLAGS(func, bpf_iter_kmem_cache_next, KF_ITER_NEXT | KF_RET_NULL | KF_SLE
 BTF_ID_FLAGS(func, bpf_iter_kmem_cache_destroy, KF_ITER_DESTROY | KF_SLEEPABLE)
 #if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY)
 BTF_ID_FLAGS(func, bpf_get_cpu_time_counter, KF_FASTCALL)
+BTF_ID_FLAGS(func, bpf_cpu_time_counter_to_ns, KF_FASTCALL)
 #endif
 BTF_KFUNCS_END(common_btf_ids)
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH bpf-next v9 3/4] selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit
  2024-11-23  0:58 [PATCH bpf-next v9 0/4] bpf: add cpu time counter kfuncs Vadim Fedorenko
  2024-11-23  0:58 ` [PATCH bpf-next v9 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
  2024-11-23  0:58 ` [PATCH bpf-next v9 2/4] bpf: add bpf_cpu_time_counter_to_ns helper Vadim Fedorenko
@ 2024-11-23  0:58 ` Vadim Fedorenko
  2024-11-23  0:58 ` [PATCH bpf-next v9 4/4] selftests/bpf: add usage example for cpu time counter kfuncs Vadim Fedorenko
  3 siblings, 0 replies; 10+ messages in thread
From: Vadim Fedorenko @ 2024-11-23  0:58 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

bpf_get_cpu_time_counter() is replaced with rdtsc instruction on x86_64. Add
tests to check that JIT works as expected.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
 .../selftests/bpf/prog_tests/verifier.c       |   2 +
 .../selftests/bpf/progs/verifier_cpu_cycles.c | 104 ++++++++++++++++++
 2 files changed, 106 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c

diff --git a/tools/testing/selftests/bpf/prog_tests/verifier.c b/tools/testing/selftests/bpf/prog_tests/verifier.c
index d9f65adb456b..6cbb8949164a 100644
--- a/tools/testing/selftests/bpf/prog_tests/verifier.c
+++ b/tools/testing/selftests/bpf/prog_tests/verifier.c
@@ -98,6 +98,7 @@
 #include "verifier_xdp_direct_packet_access.skel.h"
 #include "verifier_bits_iter.skel.h"
 #include "verifier_lsm.skel.h"
+#include "verifier_cpu_cycles.skel.h"
 
 #define MAX_ENTRIES 11
 
@@ -225,6 +226,7 @@ void test_verifier_xdp(void)                  { RUN(verifier_xdp); }
 void test_verifier_xdp_direct_packet_access(void) { RUN(verifier_xdp_direct_packet_access); }
 void test_verifier_bits_iter(void) { RUN(verifier_bits_iter); }
 void test_verifier_lsm(void)                  { RUN(verifier_lsm); }
+void test_verifier_cpu_cycles(void)           { RUN(verifier_cpu_cycles); }
 
 void test_verifier_mtu(void)
 {
diff --git a/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c b/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c
new file mode 100644
index 000000000000..5b62e3690362
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c
@@ -0,0 +1,104 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Inc. */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+
+extern u64 bpf_cpu_time_counter_to_ns(u64 cycles) __weak __ksym;
+extern u64 bpf_get_cpu_time_counter(void) __weak __ksym;
+
+SEC("syscall")
+__arch_x86_64
+__xlated("0: call kernel-function")
+__naked int bpf_rdtsc(void)
+{
+	asm volatile(
+	"call %[bpf_get_cpu_time_counter];"
+	"exit"
+	:
+	: __imm(bpf_get_cpu_time_counter)
+	: __clobber_all
+	);
+}
+
+SEC("syscall")
+__arch_x86_64
+/* program entry for bpf_rdtsc_jit_x86_64(), regular function prologue */
+__jited("	endbr64")
+__jited("	nopl	(%rax,%rax)")
+__jited("	nopl	(%rax)")
+__jited("	pushq	%rbp")
+__jited("	movq	%rsp, %rbp")
+__jited("	endbr64")
+/* save RDX in R11 as it will be overwritten */
+__jited("	movq	%rdx, %r11")
+/* lfence may not be executed depending on cpu features */
+__jited("	{{(lfence|)}}")
+__jited("	rdtsc")
+/* combine EDX:EAX into RAX */
+__jited("	shlq	${{(32|0x20)}}, %rdx")
+__jited("	orq	%rdx, %rax")
+/* restore RDX from R11 */
+__jited("	movq	%r11, %rdx")
+__jited("	leave")
+__naked int bpf_rdtsc_jit_x86_64(void)
+{
+	asm volatile(
+	"call %[bpf_get_cpu_time_counter];"
+	"exit"
+	:
+	: __imm(bpf_get_cpu_time_counter)
+	: __clobber_all
+	);
+}
+
+SEC("syscall")
+__arch_x86_64
+__xlated("0: r1 = 42")
+__xlated("1: call kernel-function")
+__naked int bpf_cyc2ns(void)
+{
+	asm volatile(
+	"r1=0x2a;"
+	"call %[bpf_cpu_time_counter_to_ns];"
+	"exit"
+	:
+	: __imm(bpf_cpu_time_counter_to_ns)
+	: __clobber_all
+	);
+}
+
+SEC("syscall")
+__arch_x86_64
+/* program entry for bpf_rdtsc_jit_x86_64(), regular function prologue */
+__jited("	endbr64")
+__jited("	nopl	(%rax,%rax)")
+__jited("	nopl	(%rax)")
+__jited("	pushq	%rbp")
+__jited("	movq	%rsp, %rbp")
+__jited("	endbr64")
+/* save RDX in R11 as it will be overwritten */
+__jited("	movabsq	$0x2a2a2a2a2a, %rdi")
+__jited("	imulq	${{.*}}, %rdi, %rax")
+__jited("	shrq	${{.*}}, %rax")
+__jited("	leave")
+__naked int bpf_cyc2ns_jit_x86(void)
+{
+	asm volatile(
+	"r1=0x2a2a2a2a2a ll;"
+	"call %[bpf_cpu_time_counter_to_ns];"
+	"exit"
+	:
+	: __imm(bpf_cpu_time_counter_to_ns)
+	: __clobber_all
+	);
+}
+
+void rdtsc(void)
+{
+	bpf_get_cpu_time_counter();
+	bpf_cpu_time_counter_to_ns(42);
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH bpf-next v9 4/4] selftests/bpf: add usage example for cpu time counter kfuncs
  2024-11-23  0:58 [PATCH bpf-next v9 0/4] bpf: add cpu time counter kfuncs Vadim Fedorenko
                   ` (2 preceding siblings ...)
  2024-11-23  0:58 ` [PATCH bpf-next v9 3/4] selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit Vadim Fedorenko
@ 2024-11-23  0:58 ` Vadim Fedorenko
  3 siblings, 0 replies; 10+ messages in thread
From: Vadim Fedorenko @ 2024-11-23  0:58 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

The selftest provides an example of how to measure the latency of bpf
kfunc/helper call using time stamp counter and how to convert measured
value into nanoseconds.

Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
 .../bpf/prog_tests/test_cpu_cycles.c          | 35 +++++++++++++++++++
 .../selftests/bpf/progs/test_cpu_cycles.c     | 25 +++++++++++++
 2 files changed, 60 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cpu_cycles.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c b/tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c
new file mode 100644
index 000000000000..d7f3b66594b3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Inc. */
+
+#include <test_progs.h>
+#include "test_cpu_cycles.skel.h"
+
+static void cpu_cycles(void)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	struct test_cpu_cycles *skel;
+	int err, pfd;
+
+	skel = test_cpu_cycles__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "test_cpu_cycles open and load"))
+		return;
+
+	pfd = bpf_program__fd(skel->progs.bpf_cpu_cycles);
+	if (!ASSERT_GT(pfd, 0, "test_cpu_cycles fd"))
+		goto fail;
+
+	err = bpf_prog_test_run_opts(pfd, &opts);
+	if (!ASSERT_OK(err, "test_cpu_cycles test run"))
+		goto fail;
+
+	ASSERT_NEQ(skel->bss->cycles, 0, "test_cpu_cycles 0 cycles");
+	ASSERT_NEQ(skel->bss->ns, 0, "test_cpu_cycles 0 ns");
+fail:
+	test_cpu_cycles__destroy(skel);
+}
+
+void test_cpu_cycles(void)
+{
+	if (test__start_subtest("cpu_cycles"))
+		cpu_cycles();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_cpu_cycles.c b/tools/testing/selftests/bpf/progs/test_cpu_cycles.c
new file mode 100644
index 000000000000..a7f8a4c6b854
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_cpu_cycles.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Inc. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+
+extern u64 bpf_cpu_time_counter_to_ns(u64 cycles) __weak __ksym;
+extern u64 bpf_get_cpu_time_counter(void) __weak __ksym;
+
+__u64 cycles, ns;
+
+SEC("syscall")
+int bpf_cpu_cycles(void)
+{
+	struct bpf_pidns_info pidns;
+	__u64 start;
+
+	start = bpf_get_cpu_time_counter();
+	bpf_get_ns_current_pid_tgid(0, 0, &pidns, sizeof(struct bpf_pidns_info));
+	cycles = bpf_get_cpu_time_counter() - start;
+	ns = bpf_cpu_time_counter_to_ns(cycles);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next v9 1/4] bpf: add bpf_get_cpu_time_counter kfunc
  2024-11-23  0:58 ` [PATCH bpf-next v9 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
@ 2024-11-27 12:03   ` kernel test robot
  2024-12-01 12:46   ` Thomas Gleixner
  1 sibling, 0 replies; 10+ messages in thread
From: kernel test robot @ 2024-11-27 12:03 UTC (permalink / raw)
  To: Vadim Fedorenko, Borislav Petkov, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Thomas Gleixner, Yonghong Song, Vadim Fedorenko, Mykola Lysenko
  Cc: oe-kbuild-all, x86, bpf, Peter Zijlstra, Martin KaFai Lau

Hi Vadim,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Vadim-Fedorenko/bpf-add-bpf_get_cpu_time_counter-kfunc/20241125-122255
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20241123005833.810044-2-vadfed%40meta.com
patch subject: [PATCH bpf-next v9 1/4] bpf: add bpf_get_cpu_time_counter kfunc
config: x86_64-randconfig-104-20241127 (https://download.01.org/0day-ci/archive/20241127/202411271903.OUFqQ7FP-lkp@intel.com/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241127/202411271903.OUFqQ7FP-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411271903.OUFqQ7FP-lkp@intel.com/

All errors (new ones prefixed by >>):

>> ld.lld: error: undefined symbol: bpf_get_cpu_time_counter
   >>> referenced by bpf_jit_comp.c:0 (arch/x86/net/bpf_jit_comp.c:0)
   >>>               vmlinux.o:(do_jit)
   >>> referenced by bpf_jit_comp.c:3829 (arch/x86/net/bpf_jit_comp.c:3829)
   >>>               vmlinux.o:(bpf_jit_inlines_kfunc_call)

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next v9 2/4] bpf: add bpf_cpu_time_counter_to_ns helper
  2024-11-23  0:58 ` [PATCH bpf-next v9 2/4] bpf: add bpf_cpu_time_counter_to_ns helper Vadim Fedorenko
@ 2024-11-27 19:07   ` kernel test robot
  0 siblings, 0 replies; 10+ messages in thread
From: kernel test robot @ 2024-11-27 19:07 UTC (permalink / raw)
  To: Vadim Fedorenko, Borislav Petkov, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Thomas Gleixner, Yonghong Song, Vadim Fedorenko, Mykola Lysenko
  Cc: oe-kbuild-all, x86, bpf, Peter Zijlstra, Martin KaFai Lau

Hi Vadim,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Vadim-Fedorenko/bpf-add-bpf_get_cpu_time_counter-kfunc/20241125-122255
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20241123005833.810044-3-vadfed%40meta.com
patch subject: [PATCH bpf-next v9 2/4] bpf: add bpf_cpu_time_counter_to_ns helper
config: x86_64-randconfig-104-20241127 (https://download.01.org/0day-ci/archive/20241128/202411280258.urOdkuWz-lkp@intel.com/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241128/202411280258.urOdkuWz-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411280258.urOdkuWz-lkp@intel.com/

All errors (new ones prefixed by >>):

>> ld.lld: error: undefined symbol: bpf_cpu_time_counter_to_ns
   >>> referenced by bpf_jit_comp.c:0 (arch/x86/net/bpf_jit_comp.c:0)
   >>>               vmlinux.o:(do_jit)
   >>> referenced by bpf_jit_comp.c:3855 (arch/x86/net/bpf_jit_comp.c:3855)
   >>>               vmlinux.o:(bpf_jit_inlines_kfunc_call)

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next v9 1/4] bpf: add bpf_get_cpu_time_counter kfunc
  2024-11-23  0:58 ` [PATCH bpf-next v9 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
  2024-11-27 12:03   ` kernel test robot
@ 2024-12-01 12:46   ` Thomas Gleixner
  2024-12-01 17:45     ` Vadim Fedorenko
  1 sibling, 1 reply; 10+ messages in thread
From: Thomas Gleixner @ 2024-12-01 12:46 UTC (permalink / raw)
  To: Vadim Fedorenko, Borislav Petkov, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

On Fri, Nov 22 2024 at 16:58, Vadim Fedorenko wrote:
> diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
> index de0f9e5f9f73..a549aea25f5f 100644
> --- a/arch/x86/net/bpf_jit_comp32.c
> +++ b/arch/x86/net/bpf_jit_comp32.c
> @@ -2094,6 +2094,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>  			if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
>  				int err;
>  
> +				if (imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter)) {
> +					if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
> +						EMIT3(0x0F, 0xAE, 0xE8);
> +					EMIT2(0x0F, 0x31);

What guarantees that RDTSC is supported by the CPU?

Aside of that, if you want the read to be ordered, then you need to take
RDTSCP into account too.

> +#if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY)
> +__bpf_kfunc u64 bpf_get_cpu_time_counter(void)
> +{
> +	const struct vdso_data *vd = __arch_get_k_vdso_data();
> +
> +	vd = &vd[CS_RAW];
> +
> +	/* CS_RAW clock_mode translates to VDSO_CLOCKMODE_TSC on x86 and

How so?

vd->clock_mode is not guaranteed to be VDSO_CLOCKMODE_TSC or
VDSO_CLOCKMODE_ARCHTIMER. CS_RAW is the access to the raw (uncorrected)
time of the current clocksource. If the clock mode is not matching, then
you cannot access it.

> +	 * to VDSO_CLOCKMODE_ARCHTIMER on aarch64/risc-v. We cannot use
> +	 * vd->clock_mode directly because it brings possible access to
> +	 * pages visible by user-space only via vDSO.

How so? vd->clock_mode is kernel visible.

>        * But the constant value
> +	 * of 1 is exactly what we need - it works for any architecture and
> +	 * translates to reading of HW timecounter regardles of architecture.

It does not. Care to look at MIPS?

> +	 * We still have to provide vdso_data for some architectures to avoid
> +	 * NULL pointer dereference.
> +	 */
> +	return __arch_get_hw_counter(1, vd);

This is outright dangerous. __arch_get_hw_counter() is for VDSO usage
and not for in kernel usage. What guarantees you that the architecture
specific implementation does not need access to user only mappings.

Aside of that what guarantees that '1' is what you want and stays that
way forever? It's already broken on MIPS.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next v9 1/4] bpf: add bpf_get_cpu_time_counter kfunc
  2024-12-01 12:46   ` Thomas Gleixner
@ 2024-12-01 17:45     ` Vadim Fedorenko
  2024-12-02 20:52       ` Thomas Gleixner
  0 siblings, 1 reply; 10+ messages in thread
From: Vadim Fedorenko @ 2024-12-01 17:45 UTC (permalink / raw)
  To: Thomas Gleixner, Vadim Fedorenko, Borislav Petkov,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Yonghong Song, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Martin KaFai Lau

On 01.12.2024 12:46, Thomas Gleixner wrote:
> On Fri, Nov 22 2024 at 16:58, Vadim Fedorenko wrote:
>> diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
>> index de0f9e5f9f73..a549aea25f5f 100644
>> --- a/arch/x86/net/bpf_jit_comp32.c
>> +++ b/arch/x86/net/bpf_jit_comp32.c
>> @@ -2094,6 +2094,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>>   			if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
>>   				int err;
>>   
>> +				if (imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter)) {
>> +					if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
>> +						EMIT3(0x0F, 0xAE, 0xE8);
>> +					EMIT2(0x0F, 0x31);
> 
> What guarantees that RDTSC is supported by the CPU?

Well, technically it may be a problem on x86_32 because there are x86 compatible
platforms which don't have RDTSC, but they are almost 16+ years old, and I'm not
quite sure we expose vDSO on such platforms.

> 
> Aside of that, if you want the read to be ordered, then you need to take
> RDTSCP into account too.

Yes, we have already had this discussion. RDTSCP has the same ordering
guaranties as "LFENCE; RDTSC" according to the programming manuals. But it also
provides "cookie" value, which is not used in this case and just trashes the
value of ECX. To avoid additional register manipulation, I used lfence option.

>> +#if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY)
>> +__bpf_kfunc u64 bpf_get_cpu_time_counter(void)
>> +{
>> +	const struct vdso_data *vd = __arch_get_k_vdso_data();
>> +
>> +	vd = &vd[CS_RAW];
>> +
>> +	/* CS_RAW clock_mode translates to VDSO_CLOCKMODE_TSC on x86 and
> 
> How so?
> 
> vd->clock_mode is not guaranteed to be VDSO_CLOCKMODE_TSC or
> VDSO_CLOCKMODE_ARCHTIMER. CS_RAW is the access to the raw (uncorrected)
> time of the current clocksource. If the clock mode is not matching, then
> you cannot access it.

That's more about x86 and virtualization options. But in the end all this ends
up in reading tsc value. And we do JIT anyway, so this function call will never
be executed on x86. Other architectures (well, apart from MIPS) don't care about
vd->clock_mode at all. And we don't provide kfuncs for architectures without JIT

For MIPS I think I can ifdef these new kfuncs to the case when CONFIG_CSRC_R4K
is not defined.

I'm going to create a patchset to implement arch-specific replacements for all
architectures supported by BPF JIT, so in the end this call will be effectively
not executed.

> 
>> +	 * to VDSO_CLOCKMODE_ARCHTIMER on aarch64/risc-v. We cannot use
>> +	 * vd->clock_mode directly because it brings possible access to
>> +	 * pages visible by user-space only via vDSO.
> 
> How so? vd->clock_mode is kernel visible.

vd->clock_mode is kernel visible, but compiler cannot optimize out code which
accesses user-space pages if I don't provide constant value here.

> 
>>         * But the constant value
>> +	 * of 1 is exactly what we need - it works for any architecture and
>> +	 * translates to reading of HW timecounter regardles of architecture.
> 
> It does not. Care to look at MIPS?

Yes, this is pretty much specific. But again, the goal is to have JIT
implementation for all architectures and this func will actually be never called
this way.

> 
>> +	 * We still have to provide vdso_data for some architectures to avoid
>> +	 * NULL pointer dereference.
>> +	 */
>> +	return __arch_get_hw_counter(1, vd);
> 
> This is outright dangerous. __arch_get_hw_counter() is for VDSO usage
> and not for in kernel usage. What guarantees you that the architecture
> specific implementation does not need access to user only mappings.
> 
> Aside of that what guarantees that '1' is what you want and stays that
> way forever? It's already broken on MIPS.

I can ifdef MIPS case until we have JIT for it (which has pretty much 
straightforward implementation for HW counter)

> 
> Thanks,
> 
>          tglx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next v9 1/4] bpf: add bpf_get_cpu_time_counter kfunc
  2024-12-01 17:45     ` Vadim Fedorenko
@ 2024-12-02 20:52       ` Thomas Gleixner
  0 siblings, 0 replies; 10+ messages in thread
From: Thomas Gleixner @ 2024-12-02 20:52 UTC (permalink / raw)
  To: Vadim Fedorenko, Vadim Fedorenko, Borislav Petkov,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Yonghong Song, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Martin KaFai Lau

On Sun, Dec 01 2024 at 17:45, Vadim Fedorenko wrote:
> On 01.12.2024 12:46, Thomas Gleixner wrote:
>> On Fri, Nov 22 2024 at 16:58, Vadim Fedorenko wrote:
>>> diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
>>> index de0f9e5f9f73..a549aea25f5f 100644
>>> --- a/arch/x86/net/bpf_jit_comp32.c
>>> +++ b/arch/x86/net/bpf_jit_comp32.c
>>> @@ -2094,6 +2094,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>>>   			if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
>>>   				int err;
>>>   
>>> +				if (imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter)) {
>>> +					if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
>>> +						EMIT3(0x0F, 0xAE, 0xE8);
>>> +					EMIT2(0x0F, 0x31);
>> 
>> What guarantees that RDTSC is supported by the CPU?
>
> Well, technically it may be a problem on x86_32 because there are x86 compatible
> platforms which don't have RDTSC, but they are almost 16+ years old, and I'm not
> quite sure we expose vDSO on such platforms.

Care to look at the VDSO related config symbols? They are
unconditionally selected independent of 16+ year old platforms.

Also TSC can be disabled at boot time by a particular platform because
it's implementation is buggy.

It does not matter at all whether those platforms are old or not. What
matters is that the code which tries to access the TSC has to be correct
under all circumstances and not under magic assumptions.

>> Aside of that, if you want the read to be ordered, then you need to take
>> RDTSCP into account too.
>
> Yes, we have already had this discussion. RDTSCP has the same ordering
> guaranties as "LFENCE; RDTSC" according to the programming manuals. But it also
> provides "cookie" value, which is not used in this case and just trashes the
> value of ECX. To avoid additional register manipulation, I used lfence option.

With zero comment and zero explanation in the change log.

>>> +#if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY)
>>> +__bpf_kfunc u64 bpf_get_cpu_time_counter(void)
>>> +{
>>> +	const struct vdso_data *vd = __arch_get_k_vdso_data();
>>> +
>>> +	vd = &vd[CS_RAW];
>>> +
>>> +	/* CS_RAW clock_mode translates to VDSO_CLOCKMODE_TSC on x86 and
>> 
>> How so?
>> 
>> vd->clock_mode is not guaranteed to be VDSO_CLOCKMODE_TSC or
>> VDSO_CLOCKMODE_ARCHTIMER. CS_RAW is the access to the raw (uncorrected)
>> time of the current clocksource. If the clock mode is not matching, then
>> you cannot access it.
>
> That's more about x86 and virtualization options. But in the end all this ends
> up in reading tsc value.

As long as VDSO_CLOCKMODE_TSC == 1. There is no guarantee for this.

> And we do JIT anyway, so this function call will never be executed on
> x86. Other architectures (well, apart from MIPS) don't care about
> vd->clock_mode at all. And we don't provide kfuncs for architectures
> without JIT

They care. Because all of them can set VDSO_CLOCKMODE_NONE, which forces
the fallback into the syscall. And they do so for good reasons. Either
because the clocksource is not functional or has other limitiations
which prevent VDSO usage or even usage in the kernel.

> For MIPS I think I can ifdef these new kfuncs to the case when CONFIG_CSRC_R4K
> is not defined.

Which excludes VDSO_CLOCKMODE_GIC, which is the most common clock source
on modern MIPS systems.

Aside of that a config symbol does not guarantee at all that the
clocksource exists on the actual hardware or is properly configured and
enabled.

>>> +	 * We still have to provide vdso_data for some architectures to avoid
>>> +	 * NULL pointer dereference.
>>> +	 */
>>> +	return __arch_get_hw_counter(1, vd);
>> 
>> This is outright dangerous. __arch_get_hw_counter() is for VDSO usage
>> and not for in kernel usage. What guarantees you that the architecture
>> specific implementation does not need access to user only mappings.
>> 
>> Aside of that what guarantees that '1' is what you want and stays that
>> way forever? It's already broken on MIPS.
>
> I can ifdef MIPS case until we have JIT for it (which has pretty much 
> straightforward implementation for HW counter)

Again. You cannot make any assumptions about any implementation detail
of __arch_get_hw_counter().

It is a function solely designed for user space VDSO usage and in kernel
usage is simply bogus.

Just because it "works" today, does not guarantee that it works tomorrow
and there is no justification for BPF to enforce compatibility with
magic number '1' and a constraint that the code has to work in the
kernel.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-12-02 20:52 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-23  0:58 [PATCH bpf-next v9 0/4] bpf: add cpu time counter kfuncs Vadim Fedorenko
2024-11-23  0:58 ` [PATCH bpf-next v9 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
2024-11-27 12:03   ` kernel test robot
2024-12-01 12:46   ` Thomas Gleixner
2024-12-01 17:45     ` Vadim Fedorenko
2024-12-02 20:52       ` Thomas Gleixner
2024-11-23  0:58 ` [PATCH bpf-next v9 2/4] bpf: add bpf_cpu_time_counter_to_ns helper Vadim Fedorenko
2024-11-27 19:07   ` kernel test robot
2024-11-23  0:58 ` [PATCH bpf-next v9 3/4] selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit Vadim Fedorenko
2024-11-23  0:58 ` [PATCH bpf-next v9 4/4] selftests/bpf: add usage example for cpu time counter kfuncs Vadim Fedorenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox