[PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss
@ 2024-11-21  0:08 Vadim Fedorenko
  2024-11-21  0:08 ` [PATCH bpf-next v8 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Vadim Fedorenko @ 2024-11-21  0:08 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

This patchset adds 2 kfuncs to provide a way to precisely measure the
time spent running some code. The first patch provides a way to get cpu
cycles counter which is used to feed CLOCK_MONOTONIC_RAW. On x86
architecture it is effectively rdtsc_ordered() function while on other
architectures it falls back to __arch_get_hw_counter(). The second patch
adds a kfunc to convert cpu cycles to nanoseconds using shift/mult
constants discovered by kernel. The main use-case for this kfunc is to
convert deltas of timestamp counter values into nanoseconds. It is not
supposed to get CLOCK_MONOTONIC_RAW values as offset part is skipped.
JIT version is done for x86 for now, on other architectures it falls
back to slightly simplified version of vdso_calc_ns.

Selftests are also added to check whether the JIT implementation is
correct and to show the simplest usage example.

Change log:
v7 -> v8:
* rename kfuncs again to bpf_get_cpu_time_counter() and
  bpf_cpu_time_counter_to_ns()
* use cyc2ns_read_begin()/cyc2ns_read_end() to get mult and shift
  constants in bpf_cpu_time_counter_to_ns()
v6 -> v7:
* change boot_cpu_has() to cpu_feature_enabled() (Borislav)
* return constant clock_mode in __arch_get_hw_counter() call
v5 -> v6:
* added cover letter
* add comment about dropping S64_MAX manipulation in jitted
  implementation of rdtsc_oredered (Alexey)
* add comment about using 'lfence;rdtsc' variant (Alexey)
* change the check in fixup_kfunc_call() (Eduard)
* make __arch_get_hw_counter() call more aligned with vDSO
  implementation (Yonghong)
v4 -> v5:
* use #if instead of #ifdef with IS_ENABLED
v3 -> v4:
* change name of the helper to bpf_get_cpu_cycles (Andrii)
* Hide the helper behind CONFIG_GENERIC_GETTIMEOFDAY to avoid exposing
  it on architectures which do not have vDSO functions and data
* reduce the scope of check of inlined functions in verifier to only 2,
  which are actually inlined.
* change helper name to bpf_cpu_cycles_to_ns.
* hide it behind CONFIG_GENERIC_GETTIMEOFDAY to avoid exposing on
  unsupported architectures.
v2 -> v3:
* change name of the helper to bpf_get_cpu_cycles_counter to
  explicitly mention what counter it provides (Andrii)
* move kfunc definition to bpf.h to use it in JIT.
* introduce another kfunc to convert cycles into nanoseconds as
  more meaningful time units for generic tracing use case (Andrii)
v1 -> v2:
* Fix incorrect function return value type to u64
* Introduce bpf_jit_inlines_kfunc_call() and use it in
	mark_fastcall_pattern_for_call() to avoid clobbering in case
	of running programs with no JIT (Eduard)
* Avoid rewriting instruction and check function pointer directly
	in JIT (Alexei)
* Change includes to fix compile issues on non x86 architectures


Vadim Fedorenko (4):
  bpf: add bpf_get_cpu_time_counter kfunc
  bpf: add bpf_cpu_time_counter_to_ns helper
  selftests/bpf: add selftest to check rdtsc jit
  selftests/bpf: add usage example for cpu cycles kfuncs

 arch/x86/net/bpf_jit_comp.c                   |  66 +++++++++++
 arch/x86/net/bpf_jit_comp32.c                 |  41 +++++++
 include/linux/bpf.h                           |   6 +
 include/linux/filter.h                        |   1 +
 kernel/bpf/core.c                             |  11 ++
 kernel/bpf/helpers.c                          |  39 +++++++
 kernel/bpf/verifier.c                         |  41 ++++++-
 .../bpf/prog_tests/test_cpu_cycles.c          |  35 ++++++
 .../selftests/bpf/prog_tests/verifier.c       |   2 +
 .../selftests/bpf/progs/test_cpu_cycles.c     |  25 +++++
 .../selftests/bpf/progs/verifier_cpu_cycles.c | 104 ++++++++++++++++++
 11 files changed, 365 insertions(+), 6 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cpu_cycles.c
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c

-- 
2.43.5

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH bpf-next v8 1/4] bpf: add bpf_get_cpu_time_counter kfunc
  2024-11-21  0:08 [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss Vadim Fedorenko
@ 2024-11-21  0:08 ` Vadim Fedorenko
  2024-11-21 11:32   ` Peter Zijlstra
  2024-11-21  0:08 ` [PATCH bpf-next v8 2/4] bpf: add bpf_cpu_time_counter_to_ns helper Vadim Fedorenko
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Vadim Fedorenko @ 2024-11-21  0:08 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

New kfunc to return ARCH-specific timecounter. For x86 BPF JIT converts
it into rdtsc ordered call. Other architectures will get JIT
implementation too if supported. The fallback is to
__arch_get_hw_counter().

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
v7 -> v8:
* rename kfunc to bpf_get_cpu_time_counter()
v6 -> v7:
* change boot_cpu_has() -> cpu_feature_enabled() (Borislav)
* change __arch_get_hw_counter() back to use constant clock mode
  to avoid linking issues with CONFIG_HYPERV_TIMER or
  CONFIG_PARAVIRT_CLOCK enabled on x86.
v5 -> v6:
* add comment about dropping S64_MAX manipulation in jitted
  implementation of rdtsc_oredered (Alexey)
* add comment about using 'lfence;rdtsc' variant (Alexey)
* change the check in fixup_kfunc_call() (Eduard)
* make __arch_get_hw_counter() call more aligned with vDSO
  implementation (Yonghong)
v4 -> v5:
* use if instead of ifdef with IS_ENABLED
v3 -> v4:
* change name of the helper to bpf_get_cpu_cycles (Andrii)
* Hide the helper behind CONFIG_GENERIC_GETTIMEOFDAY to avoid exposing
  it on architectures which do not have vDSO functions and data
* reduce the scope of check of inlined functions in verifier to only 2,
  which are actually inlined.
v2 -> v3:
* change name of the helper to bpf_get_cpu_cycles_counter to explicitly
  mention what counter it provides (Andrii)
* move kfunc definition to bpf.h to use it in JIT.
* introduce another kfunc to convert cycles into nanoseconds as more
  meaningful time units for generic tracing use case (Andrii)
v1 -> v2:
* Fix incorrect function return value type to u64
* Introduce bpf_jit_inlines_kfunc_call() and use it in
  mark_fastcall_pattern_for_call() to avoid clobbering in case of
  running programs with no JIT (Eduard)
* Avoid rewriting instruction and check function pointer directly
  in JIT (Alexei)
* Change includes to fix compile issues on non x86 architectures
---
 arch/x86/net/bpf_jit_comp.c   | 39 +++++++++++++++++++++++++++++++++
 arch/x86/net/bpf_jit_comp32.c | 14 ++++++++++++
 include/linux/bpf.h           |  5 +++++
 include/linux/filter.h        |  1 +
 kernel/bpf/core.c             | 11 ++++++++++
 kernel/bpf/helpers.c          | 27 +++++++++++++++++++++++
 kernel/bpf/verifier.c         | 41 ++++++++++++++++++++++++++++++-----
 7 files changed, 132 insertions(+), 6 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index a43fc5af973d..92431ab1a21e 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -2185,6 +2185,37 @@ st:			if (is_imm8(insn->off))
 		case BPF_JMP | BPF_CALL: {
 			u8 *ip = image + addrs[i - 1];
 
+			if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
+			    imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter)) {
+				/* The default implementation of this kfunc uses
+				 * __arch_get_hw_counter() which is implemented as
+				 * `(u64)rdtsc_ordered() & S64_MAX`. We skip masking
+				 * part because we assume it's not needed in BPF
+				 * use case (two measurements close in time).
+				 * Original code for rdtsc_ordered() uses sequence:
+				 * 'rdtsc; nop; nop; nop' to patch it into
+				 * 'lfence; rdtsc' or 'rdtscp' depending on CPU features.
+				 * JIT uses 'lfence; rdtsc' variant because BPF program
+				 * doesn't care about cookie provided by rdtscp in RCX.
+				 * Save RDX because RDTSC will use EDX:EAX to return u64
+				 */
+				emit_mov_reg(&prog, true, AUX_REG, BPF_REG_3);
+				if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
+					EMIT_LFENCE();
+				EMIT2(0x0F, 0x31);
+
+				/* shl RDX, 32 */
+				maybe_emit_1mod(&prog, BPF_REG_3, true);
+				EMIT3(0xC1, add_1reg(0xE0, BPF_REG_3), 32);
+				/* or RAX, RDX */
+				maybe_emit_mod(&prog, BPF_REG_0, BPF_REG_3, true);
+				EMIT2(0x09, add_2reg(0xC0, BPF_REG_0, BPF_REG_3));
+				/* restore RDX from R11 */
+				emit_mov_reg(&prog, true, BPF_REG_3, AUX_REG);
+
+				break;
+			}
+
 			func = (u8 *) __bpf_call_base + imm32;
 			if (src_reg == BPF_PSEUDO_CALL && tail_call_reachable) {
 				LOAD_TAIL_CALL_CNT_PTR(stack_depth);
@@ -3791,3 +3822,11 @@ u64 bpf_arch_uaddress_limit(void)
 {
 	return 0;
 }
+
+/* x86-64 JIT can inline kfunc */
+bool bpf_jit_inlines_kfunc_call(s32 imm)
+{
+	if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter))
+		return true;
+	return false;
+}
diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
index de0f9e5f9f73..a549aea25f5f 100644
--- a/arch/x86/net/bpf_jit_comp32.c
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -2094,6 +2094,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 			if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
 				int err;
 
+				if (imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter)) {
+					if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
+						EMIT3(0x0F, 0xAE, 0xE8);
+					EMIT2(0x0F, 0x31);
+					break;
+				}
+
 				err = emit_kfunc_call(bpf_prog,
 						      image + addrs[i],
 						      insn, &prog);
@@ -2621,3 +2628,10 @@ bool bpf_jit_supports_kfunc_call(void)
 {
 	return true;
 }
+
+bool bpf_jit_inlines_kfunc_call(s32 imm)
+{
+	if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter))
+		return true;
+	return false;
+}
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3ace0d6227e3..6d540253cfb4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3333,6 +3333,11 @@ void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 u64 bpf_get_raw_cpu_id(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
+/* Inlined kfuncs */
+#if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY)
+u64 bpf_get_cpu_time_counter(void);
+#endif
+
 #if defined(CONFIG_NET)
 bool bpf_sock_common_is_valid_access(int off, int size,
 				     enum bpf_access_type type,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 3a21947f2fd4..9cf57233874f 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1111,6 +1111,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
 void bpf_jit_compile(struct bpf_prog *prog);
 bool bpf_jit_needs_zext(void);
 bool bpf_jit_inlines_helper_call(s32 imm);
+bool bpf_jit_inlines_kfunc_call(s32 imm);
 bool bpf_jit_supports_subprog_tailcalls(void);
 bool bpf_jit_supports_percpu_insn(void);
 bool bpf_jit_supports_kfunc_call(void);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 14d9288441f2..daa3ab458c8a 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2965,6 +2965,17 @@ bool __weak bpf_jit_inlines_helper_call(s32 imm)
 	return false;
 }
 
+/* Return true if the JIT inlines the call to the kfunc corresponding to
+ * the imm.
+ *
+ * The verifier will not patch the insn->imm for the call to the helper if
+ * this returns true.
+ */
+bool __weak bpf_jit_inlines_kfunc_call(s32 imm)
+{
+	return false;
+}
+
 /* Return TRUE if the JIT backend supports mixing bpf2bpf and tailcalls. */
 bool __weak bpf_jit_supports_subprog_tailcalls(void)
 {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 751c150f9e1c..23f1a1606f8b 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -23,6 +23,10 @@
 #include <linux/btf_ids.h>
 #include <linux/bpf_mem_alloc.h>
 #include <linux/kasan.h>
+#if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY)
+#include <vdso/datapage.h>
+#include <asm/vdso/vsyscall.h>
+#endif
 
 #include "../../lib/kstrtox.h"
 
@@ -3057,6 +3061,26 @@ __bpf_kfunc int bpf_copy_from_user_str(void *dst, u32 dst__sz, const void __user
 	return ret + 1;
 }
 
+#if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY)
+__bpf_kfunc u64 bpf_get_cpu_time_counter(void)
+{
+	const struct vdso_data *vd = __arch_get_k_vdso_data();
+
+	vd = &vd[CS_RAW];
+
+	/* CS_RAW clock_mode translates to VDSO_CLOCKMODE_TSC on x86 and
+	 * to VDSO_CLOCKMODE_ARCHTIMER on aarch64/risc-v. We cannot use
+	 * vd->clock_mode directly because it brings possible access to
+	 * pages visible by user-space only via vDSO. But the constant value
+	 * of 1 is exactly what we need - it works for any architecture and
+	 * translates to reading of HW timecounter regardles of architecture.
+	 * We still have to provide vdso_data for some architectures to avoid
+	 * NULL pointer dereference.
+	 */
+	return __arch_get_hw_counter(1, vd);
+}
+#endif
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(generic_btf_ids)
@@ -3149,6 +3173,9 @@ BTF_ID_FLAGS(func, bpf_get_kmem_cache)
 BTF_ID_FLAGS(func, bpf_iter_kmem_cache_new, KF_ITER_NEW | KF_SLEEPABLE)
 BTF_ID_FLAGS(func, bpf_iter_kmem_cache_next, KF_ITER_NEXT | KF_RET_NULL | KF_SLEEPABLE)
 BTF_ID_FLAGS(func, bpf_iter_kmem_cache_destroy, KF_ITER_DESTROY | KF_SLEEPABLE)
+#if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY)
+BTF_ID_FLAGS(func, bpf_get_cpu_time_counter, KF_FASTCALL)
+#endif
 BTF_KFUNCS_END(common_btf_ids)
 
 static const struct btf_kfunc_id_set common_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1c4ebb326785..dbfad4457bef 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -16407,6 +16407,24 @@ static bool verifier_inlines_helper_call(struct bpf_verifier_env *env, s32 imm)
 	}
 }
 
+/* True if fixup_kfunc_call() replaces calls to kfunc number 'imm',
+ * replacement patch is presumed to follow bpf_fastcall contract
+ * (see mark_fastcall_pattern_for_call() below).
+ */
+static bool verifier_inlines_kfunc_call(struct bpf_verifier_env *env, s32 imm)
+{
+	const struct bpf_kfunc_desc *desc = find_kfunc_desc(env->prog, imm, 0);
+
+	if (!env->prog->jit_requested)
+		return false;
+
+	if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
+	    desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast])
+		return true;
+
+	return false;
+}
+
 /* Same as helper_fastcall_clobber_mask() but for kfuncs, see comment above */
 static u32 kfunc_fastcall_clobber_mask(struct bpf_kfunc_call_arg_meta *meta)
 {
@@ -16534,7 +16552,10 @@ static void mark_fastcall_pattern_for_call(struct bpf_verifier_env *env,
 			return;
 
 		clobbered_regs_mask = kfunc_fastcall_clobber_mask(&meta);
-		can_be_inlined = is_fastcall_kfunc_call(&meta);
+		can_be_inlined = is_fastcall_kfunc_call(&meta) &&
+				 (verifier_inlines_kfunc_call(env, call->imm) ||
+				 (meta.btf == btf_vmlinux &&
+				  bpf_jit_inlines_kfunc_call(call->imm)));
 	}
 
 	if (clobbered_regs_mask == ALL_CALLER_SAVED_REGS)
@@ -20541,6 +20562,7 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			    struct bpf_insn *insn_buf, int insn_idx, int *cnt)
 {
 	const struct bpf_kfunc_desc *desc;
+	s32 imm = insn->imm;
 
 	if (!insn->imm) {
 		verbose(env, "invalid kernel function call not eliminated in verifier pass\n");
@@ -20564,7 +20586,18 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		insn->imm = BPF_CALL_IMM(desc->addr);
 	if (insn->off)
 		return 0;
-	if (desc->func_id == special_kfunc_list[KF_bpf_obj_new_impl] ||
+	if (verifier_inlines_kfunc_call(env, imm)) {
+		if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
+		    desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
+			insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
+			*cnt = 1;
+		} else {
+			verbose(env, "verifier internal error: kfunc id %d has no inline code\n",
+				desc->func_id);
+			return -EFAULT;
+		}
+
+	} else if (desc->func_id == special_kfunc_list[KF_bpf_obj_new_impl] ||
 	    desc->func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
 		struct btf_struct_meta *kptr_struct_meta = env->insn_aux_data[insn_idx].kptr_struct_meta;
 		struct bpf_insn addr[2] = { BPF_LD_IMM64(BPF_REG_2, (long)kptr_struct_meta) };
@@ -20625,10 +20658,6 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 
 		__fixup_collection_insert_kfunc(&env->insn_aux_data[insn_idx], struct_meta_reg,
 						node_offset_reg, insn, insn_buf, cnt);
-	} else if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
-		   desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
-		insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
-		*cnt = 1;
 	} else if (is_bpf_wq_set_callback_impl_kfunc(desc->func_id)) {
 		struct bpf_insn ld_addrs[2] = { BPF_LD_IMM64(BPF_REG_4, (long)env->prog->aux) };
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next v8 2/4] bpf: add bpf_cpu_time_counter_to_ns helper
  2024-11-21  0:08 [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss Vadim Fedorenko
  2024-11-21  0:08 ` [PATCH bpf-next v8 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
@ 2024-11-21  0:08 ` Vadim Fedorenko
  2024-11-21 11:31   ` Peter Zijlstra
  2024-11-21  0:08 ` [PATCH bpf-next v8 3/4] selftests/bpf: add selftest to check rdtsc jit Vadim Fedorenko
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Vadim Fedorenko @ 2024-11-21  0:08 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

The new helper should be used to convert cycles received by
bpf_get_cpu_cycle() into nanoseconds.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
v7 -> v8:
* rename helper to bpf_cpu_time_counter_to_ns
* use cyc2ns_read_begin()/cyc2ns_read_end() to get mult and shift
  values instead of manually recalculating them (Peter)
v6 -> v7:
* change boot_cpu_has() -> cpu_feature_enabled() (Borislav)
v4 -> v6:
* add comment about simplified implementation (Eduard)
v4:
* change helper name to bpf_cpu_cycles_to_ns.
* hide it behind CONFIG_GENERIC_GETTIMEOFDAY to avoid exposing on
  unsupported architectures.
---
 arch/x86/net/bpf_jit_comp.c   | 27 +++++++++++++++++++++++++++
 arch/x86/net/bpf_jit_comp32.c | 27 +++++++++++++++++++++++++++
 include/linux/bpf.h           |  1 +
 kernel/bpf/helpers.c          | 14 +++++++++++++-
 4 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 92431ab1a21e..d21e0ab55c94 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -11,10 +11,12 @@
 #include <linux/bpf.h>
 #include <linux/memory.h>
 #include <linux/sort.h>
+#include <linux/clocksource.h>
 #include <asm/extable.h>
 #include <asm/ftrace.h>
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
+#include <asm/timer.h>
 #include <asm/text-patching.h>
 #include <asm/unwind.h>
 #include <asm/cfi.h>
@@ -2216,6 +2218,28 @@ st:			if (is_imm8(insn->off))
 				break;
 			}
 
+			if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
+			    imm32 == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
+			    using_native_sched_clock() && sched_clock_stable()) {
+				struct cyc2ns_data data;
+				u32 mult, shift;
+
+				cyc2ns_read_begin(&data);
+				mult = data.cyc2ns_mul;
+				shift = data.cyc2ns_shift;
+				cyc2ns_read_end();
+				/* imul RAX, RDI, mult */
+				maybe_emit_mod(&prog, BPF_REG_1, BPF_REG_0, true);
+				EMIT2_off32(0x69, add_2reg(0xC0, BPF_REG_1, BPF_REG_0),
+					    mult);
+
+				/* shr RAX, shift (which is less than 64) */
+				maybe_emit_1mod(&prog, BPF_REG_0, true);
+				EMIT3(0xC1, add_1reg(0xE8, BPF_REG_0), shift);
+
+				break;
+			}
+
 			func = (u8 *) __bpf_call_base + imm32;
 			if (src_reg == BPF_PSEUDO_CALL && tail_call_reachable) {
 				LOAD_TAIL_CALL_CNT_PTR(stack_depth);
@@ -3828,5 +3852,8 @@ bool bpf_jit_inlines_kfunc_call(s32 imm)
 {
 	if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter))
 		return true;
+	if (imm == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
+	    using_native_sched_clock() && sched_clock_stable())
+		return true;
 	return false;
 }
diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
index a549aea25f5f..a2069a3ee4a3 100644
--- a/arch/x86/net/bpf_jit_comp32.c
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -12,10 +12,12 @@
 #include <linux/netdevice.h>
 #include <linux/filter.h>
 #include <linux/if_vlan.h>
+#include <linux/clocksource.h>
 #include <asm/cacheflush.h>
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
 #include <asm/asm-prototypes.h>
+#include <asm/timer.h>
 #include <linux/bpf.h>
 
 /*
@@ -2100,6 +2102,27 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 					EMIT2(0x0F, 0x31);
 					break;
 				}
+				if (imm32 == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
+				    cpu_feature_enabled(X86_FEATURE_CONSTANT_TSC)) {
+					struct cyc2ns_data data;
+					u32 mult, shift;
+
+					cyc2ns_read_begin(&data);
+					mult = data.cyc2ns_mul;
+					shift = data.cyc2ns_shift;
+					cyc2ns_read_end();
+
+					/* move parameter to BPF_REG_0 */
+					emit_ia32_mov_r64(true, bpf2ia32[BPF_REG_0],
+							  bpf2ia32[BPF_REG_1], true, true,
+							  &prog, bpf_prog->aux);
+					/* multiply parameter by mut */
+					emit_ia32_mul_i64(bpf2ia32[BPF_REG_0],
+							  mult, true, &prog);
+					/* shift parameter by shift which is less than 64 */
+					emit_ia32_rsh_i64(bpf2ia32[BPF_REG_0],
+							  shift, true, &prog);
+				}
 
 				err = emit_kfunc_call(bpf_prog,
 						      image + addrs[i],
@@ -2633,5 +2656,9 @@ bool bpf_jit_inlines_kfunc_call(s32 imm)
 {
 	if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter))
 		return true;
+	if (imm == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
+	    using_native_sched_clock() && sched_clock_stable())
+		return true;
+
 	return false;
 }
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6d540253cfb4..dd3c4ddfd60e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3336,6 +3336,7 @@ u64 bpf_get_raw_cpu_id(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 /* Inlined kfuncs */
 #if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY)
 u64 bpf_get_cpu_time_counter(void);
+u64 bpf_cpu_time_counter_to_ns(u64 cycles);
 #endif
 
 #if defined(CONFIG_NET)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 23f1a1606f8b..e4d461f2e98f 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -3079,8 +3079,19 @@ __bpf_kfunc u64 bpf_get_cpu_time_counter(void)
 	 */
 	return __arch_get_hw_counter(1, vd);
 }
-#endif
 
+__bpf_kfunc u64 bpf_cpu_time_counter_to_ns(u64 cycles)
+{
+	const struct vdso_data *vd = __arch_get_k_vdso_data();
+
+	vd = &vd[CS_RAW];
+	/* kfunc implementation does less manipulations than vDSO
+	 * implementation. BPF use-case assumes two measurements are close
+	 * in time and can simplify the logic.
+	 */
+	return mul_u64_u32_shr(cycles, vd->mult, vd->shift);
+}
+#endif
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(generic_btf_ids)
@@ -3175,6 +3186,7 @@ BTF_ID_FLAGS(func, bpf_iter_kmem_cache_next, KF_ITER_NEXT | KF_RET_NULL | KF_SLE
 BTF_ID_FLAGS(func, bpf_iter_kmem_cache_destroy, KF_ITER_DESTROY | KF_SLEEPABLE)
 #if IS_ENABLED(CONFIG_GENERIC_GETTIMEOFDAY)
 BTF_ID_FLAGS(func, bpf_get_cpu_time_counter, KF_FASTCALL)
+BTF_ID_FLAGS(func, bpf_cpu_time_counter_to_ns, KF_FASTCALL)
 #endif
 BTF_KFUNCS_END(common_btf_ids)
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next v8 3/4] selftests/bpf: add selftest to check rdtsc jit
  2024-11-21  0:08 [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss Vadim Fedorenko
  2024-11-21  0:08 ` [PATCH bpf-next v8 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
  2024-11-21  0:08 ` [PATCH bpf-next v8 2/4] bpf: add bpf_cpu_time_counter_to_ns helper Vadim Fedorenko
@ 2024-11-21  0:08 ` Vadim Fedorenko
  2024-11-21  0:08 ` [PATCH bpf-next v8 4/4] selftests/bpf: add usage example for cpu cycles kfuncs Vadim Fedorenko
  2024-11-22 11:34 ` [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss Peter Zijlstra
  4 siblings, 0 replies; 19+ messages in thread
From: Vadim Fedorenko @ 2024-11-21  0:08 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

bpf_get_cpu_time_counter() is replaced with rdtsc instruction on x86_64.
Add tests to check it.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
 .../selftests/bpf/prog_tests/verifier.c       |   2 +
 .../selftests/bpf/progs/verifier_cpu_cycles.c | 104 ++++++++++++++++++
 2 files changed, 106 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c

diff --git a/tools/testing/selftests/bpf/prog_tests/verifier.c b/tools/testing/selftests/bpf/prog_tests/verifier.c
index d9f65adb456b..6cbb8949164a 100644
--- a/tools/testing/selftests/bpf/prog_tests/verifier.c
+++ b/tools/testing/selftests/bpf/prog_tests/verifier.c
@@ -98,6 +98,7 @@
 #include "verifier_xdp_direct_packet_access.skel.h"
 #include "verifier_bits_iter.skel.h"
 #include "verifier_lsm.skel.h"
+#include "verifier_cpu_cycles.skel.h"
 
 #define MAX_ENTRIES 11
 
@@ -225,6 +226,7 @@ void test_verifier_xdp(void)                  { RUN(verifier_xdp); }
 void test_verifier_xdp_direct_packet_access(void) { RUN(verifier_xdp_direct_packet_access); }
 void test_verifier_bits_iter(void) { RUN(verifier_bits_iter); }
 void test_verifier_lsm(void)                  { RUN(verifier_lsm); }
+void test_verifier_cpu_cycles(void)           { RUN(verifier_cpu_cycles); }
 
 void test_verifier_mtu(void)
 {
diff --git a/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c b/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c
new file mode 100644
index 000000000000..5b62e3690362
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c
@@ -0,0 +1,104 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Inc. */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+
+extern u64 bpf_cpu_time_counter_to_ns(u64 cycles) __weak __ksym;
+extern u64 bpf_get_cpu_time_counter(void) __weak __ksym;
+
+SEC("syscall")
+__arch_x86_64
+__xlated("0: call kernel-function")
+__naked int bpf_rdtsc(void)
+{
+	asm volatile(
+	"call %[bpf_get_cpu_time_counter];"
+	"exit"
+	:
+	: __imm(bpf_get_cpu_time_counter)
+	: __clobber_all
+	);
+}
+
+SEC("syscall")
+__arch_x86_64
+/* program entry for bpf_rdtsc_jit_x86_64(), regular function prologue */
+__jited("	endbr64")
+__jited("	nopl	(%rax,%rax)")
+__jited("	nopl	(%rax)")
+__jited("	pushq	%rbp")
+__jited("	movq	%rsp, %rbp")
+__jited("	endbr64")
+/* save RDX in R11 as it will be overwritten */
+__jited("	movq	%rdx, %r11")
+/* lfence may not be executed depending on cpu features */
+__jited("	{{(lfence|)}}")
+__jited("	rdtsc")
+/* combine EDX:EAX into RAX */
+__jited("	shlq	${{(32|0x20)}}, %rdx")
+__jited("	orq	%rdx, %rax")
+/* restore RDX from R11 */
+__jited("	movq	%r11, %rdx")
+__jited("	leave")
+__naked int bpf_rdtsc_jit_x86_64(void)
+{
+	asm volatile(
+	"call %[bpf_get_cpu_time_counter];"
+	"exit"
+	:
+	: __imm(bpf_get_cpu_time_counter)
+	: __clobber_all
+	);
+}
+
+SEC("syscall")
+__arch_x86_64
+__xlated("0: r1 = 42")
+__xlated("1: call kernel-function")
+__naked int bpf_cyc2ns(void)
+{
+	asm volatile(
+	"r1=0x2a;"
+	"call %[bpf_cpu_time_counter_to_ns];"
+	"exit"
+	:
+	: __imm(bpf_cpu_time_counter_to_ns)
+	: __clobber_all
+	);
+}
+
+SEC("syscall")
+__arch_x86_64
+/* program entry for bpf_rdtsc_jit_x86_64(), regular function prologue */
+__jited("	endbr64")
+__jited("	nopl	(%rax,%rax)")
+__jited("	nopl	(%rax)")
+__jited("	pushq	%rbp")
+__jited("	movq	%rsp, %rbp")
+__jited("	endbr64")
+/* save RDX in R11 as it will be overwritten */
+__jited("	movabsq	$0x2a2a2a2a2a, %rdi")
+__jited("	imulq	${{.*}}, %rdi, %rax")
+__jited("	shrq	${{.*}}, %rax")
+__jited("	leave")
+__naked int bpf_cyc2ns_jit_x86(void)
+{
+	asm volatile(
+	"r1=0x2a2a2a2a2a ll;"
+	"call %[bpf_cpu_time_counter_to_ns];"
+	"exit"
+	:
+	: __imm(bpf_cpu_time_counter_to_ns)
+	: __clobber_all
+	);
+}
+
+void rdtsc(void)
+{
+	bpf_get_cpu_time_counter();
+	bpf_cpu_time_counter_to_ns(42);
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH bpf-next v8 4/4] selftests/bpf: add usage example for cpu cycles kfuncs
  2024-11-21  0:08 [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss Vadim Fedorenko
                   ` (2 preceding siblings ...)
  2024-11-21  0:08 ` [PATCH bpf-next v8 3/4] selftests/bpf: add selftest to check rdtsc jit Vadim Fedorenko
@ 2024-11-21  0:08 ` Vadim Fedorenko
  2024-11-22 11:34 ` [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss Peter Zijlstra
  4 siblings, 0 replies; 19+ messages in thread
From: Vadim Fedorenko @ 2024-11-21  0:08 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

The selftest provides an example of how to measure the latency of bpf
kfunc/helper call in cycles and nanoseconds.

Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
 .../bpf/prog_tests/test_cpu_cycles.c          | 35 +++++++++++++++++++
 .../selftests/bpf/progs/test_cpu_cycles.c     | 25 +++++++++++++
 2 files changed, 60 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cpu_cycles.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c b/tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c
new file mode 100644
index 000000000000..d7f3b66594b3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Inc. */
+
+#include <test_progs.h>
+#include "test_cpu_cycles.skel.h"
+
+static void cpu_cycles(void)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	struct test_cpu_cycles *skel;
+	int err, pfd;
+
+	skel = test_cpu_cycles__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "test_cpu_cycles open and load"))
+		return;
+
+	pfd = bpf_program__fd(skel->progs.bpf_cpu_cycles);
+	if (!ASSERT_GT(pfd, 0, "test_cpu_cycles fd"))
+		goto fail;
+
+	err = bpf_prog_test_run_opts(pfd, &opts);
+	if (!ASSERT_OK(err, "test_cpu_cycles test run"))
+		goto fail;
+
+	ASSERT_NEQ(skel->bss->cycles, 0, "test_cpu_cycles 0 cycles");
+	ASSERT_NEQ(skel->bss->ns, 0, "test_cpu_cycles 0 ns");
+fail:
+	test_cpu_cycles__destroy(skel);
+}
+
+void test_cpu_cycles(void)
+{
+	if (test__start_subtest("cpu_cycles"))
+		cpu_cycles();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_cpu_cycles.c b/tools/testing/selftests/bpf/progs/test_cpu_cycles.c
new file mode 100644
index 000000000000..a7f8a4c6b854
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_cpu_cycles.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Inc. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+
+extern u64 bpf_cpu_time_counter_to_ns(u64 cycles) __weak __ksym;
+extern u64 bpf_get_cpu_time_counter(void) __weak __ksym;
+
+__u64 cycles, ns;
+
+SEC("syscall")
+int bpf_cpu_cycles(void)
+{
+	struct bpf_pidns_info pidns;
+	__u64 start;
+
+	start = bpf_get_cpu_time_counter();
+	bpf_get_ns_current_pid_tgid(0, 0, &pidns, sizeof(struct bpf_pidns_info));
+	cycles = bpf_get_cpu_time_counter() - start;
+	ns = bpf_cpu_time_counter_to_ns(cycles);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v8 2/4] bpf: add bpf_cpu_time_counter_to_ns helper
  2024-11-21  0:08 ` [PATCH bpf-next v8 2/4] bpf: add bpf_cpu_time_counter_to_ns helper Vadim Fedorenko
@ 2024-11-21 11:31   ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2024-11-21 11:31 UTC (permalink / raw)
  To: Vadim Fedorenko
  Cc: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko, x86, bpf, Martin KaFai Lau

On Wed, Nov 20, 2024 at 04:08:12PM -0800, Vadim Fedorenko wrote:
> The new helper should be used to convert cycles received by
> bpf_get_cpu_cycle() into nanoseconds.

This is pertinently false, as written here you're looking for a single
reading into a clock value, while that is exactly not what you want.

Please, go write real changelogs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v8 1/4] bpf: add bpf_get_cpu_time_counter kfunc
  2024-11-21  0:08 ` [PATCH bpf-next v8 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
@ 2024-11-21 11:32   ` Peter Zijlstra
  2024-11-21 14:35     ` Vadim Fedorenko
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2024-11-21 11:32 UTC (permalink / raw)
  To: Vadim Fedorenko
  Cc: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko, x86, bpf, Martin KaFai Lau

On Wed, Nov 20, 2024 at 04:08:11PM -0800, Vadim Fedorenko wrote:
> New kfunc to return ARCH-specific timecounter. For x86 BPF JIT converts
> it into rdtsc ordered call. Other architectures will get JIT
> implementation too if supported. The fallback is to
> __arch_get_hw_counter().

Still not a single word as to *WHY* and what you're going to do with
those values.

NAK

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v8 1/4] bpf: add bpf_get_cpu_time_counter kfunc
  2024-11-21 11:32   ` Peter Zijlstra
@ 2024-11-21 14:35     ` Vadim Fedorenko
  2024-11-21 15:33       ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Vadim Fedorenko @ 2024-11-21 14:35 UTC (permalink / raw)
  To: Peter Zijlstra, Vadim Fedorenko
  Cc: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Mykola Lysenko, x86, bpf, Martin KaFai Lau

On 21/11/2024 03:32, Peter Zijlstra wrote:
> On Wed, Nov 20, 2024 at 04:08:11PM -0800, Vadim Fedorenko wrote:
>> New kfunc to return ARCH-specific timecounter. For x86 BPF JIT converts
>> it into rdtsc ordered call. Other architectures will get JIT
>> implementation too if supported. The fallback is to
>> __arch_get_hw_counter().
> 
> Still not a single word as to *WHY* and what you're going to do with
> those values.
> 
> NAK

Did you have a chance to read cover letter?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v8 1/4] bpf: add bpf_get_cpu_time_counter kfunc
  2024-11-21 14:35     ` Vadim Fedorenko
@ 2024-11-21 15:33       ` Peter Zijlstra
  2024-11-21 23:51         ` Andrii Nakryiko
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2024-11-21 15:33 UTC (permalink / raw)
  To: Vadim Fedorenko
  Cc: Vadim Fedorenko, Borislav Petkov, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Thomas Gleixner, Yonghong Song, Mykola Lysenko, x86, bpf,
	Martin KaFai Lau

On Thu, Nov 21, 2024 at 06:35:39AM -0800, Vadim Fedorenko wrote:
> On 21/11/2024 03:32, Peter Zijlstra wrote:
> > On Wed, Nov 20, 2024 at 04:08:11PM -0800, Vadim Fedorenko wrote:
> > > New kfunc to return ARCH-specific timecounter. For x86 BPF JIT converts
> > > it into rdtsc ordered call. Other architectures will get JIT
> > > implementation too if supported. The fallback is to
> > > __arch_get_hw_counter().
> > 
> > Still not a single word as to *WHY* and what you're going to do with
> > those values.
> > 
> > NAK
> 
> Did you have a chance to read cover letter?

Cover letter is disposable and not retained when applying patches, as
such I rarely read it.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v8 1/4] bpf: add bpf_get_cpu_time_counter kfunc
  2024-11-21 15:33       ` Peter Zijlstra
@ 2024-11-21 23:51         ` Andrii Nakryiko
  2024-11-21 23:55           ` Vadim Fedorenko
  0 siblings, 1 reply; 19+ messages in thread
From: Andrii Nakryiko @ 2024-11-21 23:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vadim Fedorenko, Vadim Fedorenko, Borislav Petkov,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Thomas Gleixner, Yonghong Song, Mykola Lysenko,
	x86, bpf, Martin KaFai Lau

On Thu, Nov 21, 2024 at 7:33 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Nov 21, 2024 at 06:35:39AM -0800, Vadim Fedorenko wrote:
> > On 21/11/2024 03:32, Peter Zijlstra wrote:
> > > On Wed, Nov 20, 2024 at 04:08:11PM -0800, Vadim Fedorenko wrote:
> > > > New kfunc to return ARCH-specific timecounter. For x86 BPF JIT converts
> > > > it into rdtsc ordered call. Other architectures will get JIT
> > > > implementation too if supported. The fallback is to
> > > > __arch_get_hw_counter().
> > >
> > > Still not a single word as to *WHY* and what you're going to do with
> > > those values.
> > >
> > > NAK
> >
> > Did you have a chance to read cover letter?
>
> Cover letter is disposable and not retained when applying patches, as
> such I rarely read it.

It's not disposable for BPF trees. We preserve them as part of the
merge commit for the patch set, e.g. [0]. Both bpf and netdev
maintainers use a set of scripts to apply patches (pw-apply,
specifically), that does all that automatically.

Vadim,

Please do another careful pass over commit messages and cover letter.
I'd suggest moving the version history into cover letter (see other
multi-version cover letter for an example). You can use an example
from your BPF selftests as an intended use case (measuring the
duration of some BPF piece of logic), and I'd also mention that this
is useful to measure the duration of two related BPF events. E.g.,
uprobe entry and exit, of kprobe entry/exit. kprobe.session and
uprobe.session programs are especially well suited for that, as they
allow to capture initial timestamp, store it in session cookie, then
retrieve it in return probe and calculate the difference.

Please also update all the "cycles" references to "time counter",
stuff like that.

pw-bot: cr

  [0] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?h=bpf-next-6.13&id=379d5ee624eda6a897d9e1f7f88c68ea482bd5fa
  [1] https://git.kernel.org/pub/scm/linux/kernel/git/dborkman/pw.git/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v8 1/4] bpf: add bpf_get_cpu_time_counter kfunc
  2024-11-21 23:51         ` Andrii Nakryiko
@ 2024-11-21 23:55           ` Vadim Fedorenko
  0 siblings, 0 replies; 19+ messages in thread
From: Vadim Fedorenko @ 2024-11-21 23:55 UTC (permalink / raw)
  To: Andrii Nakryiko, Peter Zijlstra
  Cc: Vadim Fedorenko, Borislav Petkov, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Thomas Gleixner, Yonghong Song, Mykola Lysenko, x86, bpf,
	Martin KaFai Lau

On 21/11/2024 15:51, Andrii Nakryiko wrote:
> On Thu, Nov 21, 2024 at 7:33 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Thu, Nov 21, 2024 at 06:35:39AM -0800, Vadim Fedorenko wrote:
>>> On 21/11/2024 03:32, Peter Zijlstra wrote:
>>>> On Wed, Nov 20, 2024 at 04:08:11PM -0800, Vadim Fedorenko wrote:
>>>>> New kfunc to return ARCH-specific timecounter. For x86 BPF JIT converts
>>>>> it into rdtsc ordered call. Other architectures will get JIT
>>>>> implementation too if supported. The fallback is to
>>>>> __arch_get_hw_counter().
>>>>
>>>> Still not a single word as to *WHY* and what you're going to do with
>>>> those values.
>>>>
>>>> NAK
>>>
>>> Did you have a chance to read cover letter?
>>
>> Cover letter is disposable and not retained when applying patches, as
>> such I rarely read it.
> 
> It's not disposable for BPF trees. We preserve them as part of the
> merge commit for the patch set, e.g. [0]. Both bpf and netdev
> maintainers use a set of scripts to apply patches (pw-apply,
> specifically), that does all that automatically.

Yeah, was going to write the same...

> Vadim,
> 
> Please do another careful pass over commit messages and cover letter.
> I'd suggest moving the version history into cover letter (see other
> multi-version cover letter for an example). You can use an example
> from your BPF selftests as an intended use case (measuring the
> duration of some BPF piece of logic), and I'd also mention that this
> is useful to measure the duration of two related BPF events. E.g.,
> uprobe entry and exit, of kprobe entry/exit. kprobe.session and
> uprobe.session programs are especially well suited for that, as they
> allow to capture initial timestamp, store it in session cookie, then
> retrieve it in return probe and calculate the difference.

Sure, I'm on it already.

> 
> Please also update all the "cycles" references to "time counter",
> stuff like that.

Yeah, I've done it already in my local branch.

Thanks.

> 
> pw-bot: cr
> 
> 
>    [0] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?h=bpf-next-6.13&id=379d5ee624eda6a897d9e1f7f88c68ea482bd5fa
>    [1] https://git.kernel.org/pub/scm/linux/kernel/git/dborkman/pw.git/



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss
  2024-11-21  0:08 [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss Vadim Fedorenko
                   ` (3 preceding siblings ...)
  2024-11-21  0:08 ` [PATCH bpf-next v8 4/4] selftests/bpf: add usage example for cpu cycles kfuncs Vadim Fedorenko
@ 2024-11-22 11:34 ` Peter Zijlstra
  2024-11-22 15:40   ` Vadim Fedorenko
  2024-11-26 18:12   ` Andrii Nakryiko
  4 siblings, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2024-11-22 11:34 UTC (permalink / raw)
  To: Vadim Fedorenko
  Cc: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko, x86, bpf, Martin KaFai Lau

On Wed, Nov 20, 2024 at 04:08:10PM -0800, Vadim Fedorenko wrote:
> This patchset adds 2 kfuncs to provide a way to precisely measure the
> time spent running some code. The first patch provides a way to get cpu
> cycles counter which is used to feed CLOCK_MONOTONIC_RAW. On x86
> architecture it is effectively rdtsc_ordered() function while on other
> architectures it falls back to __arch_get_hw_counter(). The second patch
> adds a kfunc to convert cpu cycles to nanoseconds using shift/mult
> constants discovered by kernel. The main use-case for this kfunc is to
> convert deltas of timestamp counter values into nanoseconds. It is not
> supposed to get CLOCK_MONOTONIC_RAW values as offset part is skipped.
> JIT version is done for x86 for now, on other architectures it falls
> back to slightly simplified version of vdso_calc_ns.

So having now read this. I'm still left wondering why you would want to
do this.

Is this just debug stuff, for when you're doing a poor man's profile
run? If it is, why do we care about all the precision or the ns. And why
aren't you using perf?

Is it something else?

Again, what are you going to do with this information?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss
  2024-11-22 11:34 ` [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss Peter Zijlstra
@ 2024-11-22 15:40   ` Vadim Fedorenko
  2024-11-26 18:12   ` Andrii Nakryiko
  1 sibling, 0 replies; 19+ messages in thread
From: Vadim Fedorenko @ 2024-11-22 15:40 UTC (permalink / raw)
  To: Peter Zijlstra, Vadim Fedorenko
  Cc: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Mykola Lysenko, x86, bpf, Martin KaFai Lau

On 22/11/2024 03:34, Peter Zijlstra wrote:
> On Wed, Nov 20, 2024 at 04:08:10PM -0800, Vadim Fedorenko wrote:
>> This patchset adds 2 kfuncs to provide a way to precisely measure the
>> time spent running some code. The first patch provides a way to get cpu
>> cycles counter which is used to feed CLOCK_MONOTONIC_RAW. On x86
>> architecture it is effectively rdtsc_ordered() function while on other
>> architectures it falls back to __arch_get_hw_counter(). The second patch
>> adds a kfunc to convert cpu cycles to nanoseconds using shift/mult
>> constants discovered by kernel. The main use-case for this kfunc is to
>> convert deltas of timestamp counter values into nanoseconds. It is not
>> supposed to get CLOCK_MONOTONIC_RAW values as offset part is skipped.
>> JIT version is done for x86 for now, on other architectures it falls
>> back to slightly simplified version of vdso_calc_ns.
> 
> So having now read this. I'm still left wondering why you would want to
> do this.
> 
> Is this just debug stuff, for when you're doing a poor man's profile
> run? If it is, why do we care about all the precision or the ns. And why
> aren't you using perf?
> 
> Is it something else?
> 
> Again, what are you going to do with this information?

We do a lot of benchmarking at scale. We benchmark kernel function as
well as our own BPF programs. We already do it using bpf_ktime_get_ns().
And this patchset optimizes benchmark use-case by removing overhead
created double conversion from tsc to ns in case when we only need delta
value as a result of benchmarks. Another optimization, which has even
better effect, is to remove the overhead of function calls. As you can
see, both helpers are fully inlined for x86, reducing the amount of 
instructions from hundreds to single digit number and removing function
calls. The precision comes next, now we can better understand the effect
of fast-exits of some programs, but it's more like micro-benchmarking
and may have less benefits.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss
  2024-11-22 11:34 ` [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss Peter Zijlstra
  2024-11-22 15:40   ` Vadim Fedorenko
@ 2024-11-26 18:12   ` Andrii Nakryiko
  2024-11-28 11:27     ` Peter Zijlstra
  1 sibling, 1 reply; 19+ messages in thread
From: Andrii Nakryiko @ 2024-11-26 18:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vadim Fedorenko, Borislav Petkov, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Thomas Gleixner, Yonghong Song, Vadim Fedorenko, Mykola Lysenko,
	x86, bpf, Martin KaFai Lau

On Fri, Nov 22, 2024 at 3:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Nov 20, 2024 at 04:08:10PM -0800, Vadim Fedorenko wrote:
> > This patchset adds 2 kfuncs to provide a way to precisely measure the
> > time spent running some code. The first patch provides a way to get cpu
> > cycles counter which is used to feed CLOCK_MONOTONIC_RAW. On x86
> > architecture it is effectively rdtsc_ordered() function while on other
> > architectures it falls back to __arch_get_hw_counter(). The second patch
> > adds a kfunc to convert cpu cycles to nanoseconds using shift/mult
> > constants discovered by kernel. The main use-case for this kfunc is to
> > convert deltas of timestamp counter values into nanoseconds. It is not
> > supposed to get CLOCK_MONOTONIC_RAW values as offset part is skipped.
> > JIT version is done for x86 for now, on other architectures it falls
> > back to slightly simplified version of vdso_calc_ns.
>
> So having now read this. I'm still left wondering why you would want to
> do this.
>
> Is this just debug stuff, for when you're doing a poor man's profile
> run? If it is, why do we care about all the precision or the ns. And why
> aren't you using perf?

No, it's not debug stuff. It's meant to be used in production for
measuring durations of whatever is needed. Like uprobe entry/exit
duration, or time between scheduling switches, etc.

Vadim emphasizes benchmarking at scale, but that's a bit misleading.
It's not "benchmarking", it's measuring durations of relevant pairs of
events. In production and at scale, so the unnecessary overhead all
adds up. We'd like to have the minimal possible overhead for this time
passage measurement. And some durations are very brief, so precision
matters as well. And given this is meant to be later used to do
aggregation and comparison across large swaths of production hosts, we
have to have comparable units, which is why nanoseconds and not some
abstract "time cycles".

Does this address your concerns?

>
> Is it something else?
>
> Again, what are you going to do with this information?

There are many specific uses, all of which currently use pairs of
bpf_ktime_get_ns() helper calls, which calls into
ktime_get_mono_fast_ns(). These new kfuncs are meant to be a faster
replacement for 2 x bpf_ktime_get_ns() calls.

The information itself is collected and emitted into a centralized
data storage and querying system used by tons of engineers for
whatever they need. I'm not sure we need to go into specific
individual use cases. There are tons and they all vary. The common
need is to measure real wallclock time passage.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss
  2024-11-26 18:12   ` Andrii Nakryiko
@ 2024-11-28 11:27     ` Peter Zijlstra
  2024-11-28 11:33       ` Peter Zijlstra
                         ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Peter Zijlstra @ 2024-11-28 11:27 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Vadim Fedorenko, Borislav Petkov, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Thomas Gleixner, Yonghong Song, Vadim Fedorenko, Mykola Lysenko,
	x86, bpf, Martin KaFai Lau

On Tue, Nov 26, 2024 at 10:12:57AM -0800, Andrii Nakryiko wrote:
> On Fri, Nov 22, 2024 at 3:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Nov 20, 2024 at 04:08:10PM -0800, Vadim Fedorenko wrote:
> > > This patchset adds 2 kfuncs to provide a way to precisely measure the
> > > time spent running some code. The first patch provides a way to get cpu
> > > cycles counter which is used to feed CLOCK_MONOTONIC_RAW. On x86
> > > architecture it is effectively rdtsc_ordered() function while on other
> > > architectures it falls back to __arch_get_hw_counter(). The second patch
> > > adds a kfunc to convert cpu cycles to nanoseconds using shift/mult
> > > constants discovered by kernel. The main use-case for this kfunc is to
> > > convert deltas of timestamp counter values into nanoseconds. It is not
> > > supposed to get CLOCK_MONOTONIC_RAW values as offset part is skipped.
> > > JIT version is done for x86 for now, on other architectures it falls
> > > back to slightly simplified version of vdso_calc_ns.
> >
> > So having now read this. I'm still left wondering why you would want to
> > do this.
> >
> > Is this just debug stuff, for when you're doing a poor man's profile
> > run? If it is, why do we care about all the precision or the ns. And why
> > aren't you using perf?
> 
> No, it's not debug stuff. It's meant to be used in production for
> measuring durations of whatever is needed. Like uprobe entry/exit
> duration, or time between scheduling switches, etc.
> 
> Vadim emphasizes benchmarking at scale, but that's a bit misleading.
> It's not "benchmarking", it's measuring durations of relevant pairs of
> events. In production and at scale, so the unnecessary overhead all
> adds up. We'd like to have the minimal possible overhead for this time
> passage measurement. And some durations are very brief,

You might want to consider leaving out the LFENCE before the RDTSC on
some of those, LFENCE isn't exactly cheap.

> so precision
> matters as well. And given this is meant to be later used to do
> aggregation and comparison across large swaths of production hosts, we
> have to have comparable units, which is why nanoseconds and not some
> abstract "time cycles".
> 
> Does this address your concerns?

Well, it's clearly useful for you guys, but I do worry about it. Even on
servers DVFS is starting to play a significant role. And the TSC is
unaffected by it.

Directly comparing these numbers, esp. across different systems makes no
sense to me. Yes putting them all in [ns] allows for comparison, but
you're still comparing fundamentally different things.

How does it make sense to measure uprobe entry/exit in wall-clock when
it can vary by at least a factor of 2 depending on DVFS. How does it
make sense to compare an x86-64 uprobe entry/exit to an aaargh64 one?

Or are you trying to estimate the fraction of overhead spend on
instrumentation instead of real work? Like, this machine spends 5% of
its wall-time in instrumentation, which is effectively not doing work?

The part I'm missing is how using wall-time for these things makes
sense.

I mean, if all you're doing is saying, hey, we appear to be spending X
on this action on this particular system Y doing workload Z (irrespecive
of you then having like a million Ys) and this patch reduces X by half
given the same Y and Z. So patch must be awesome.

Then you don't need the conversion to [ns], and the DVFS angle is more
or less mitigated by the whole 'same workload' thing.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss
  2024-11-28 11:27     ` Peter Zijlstra
@ 2024-11-28 11:33       ` Peter Zijlstra
  2024-11-28 14:30         ` Vadim Fedorenko
  2024-11-28 14:28       ` Vadim Fedorenko
  2024-12-02 19:15       ` Andrii Nakryiko
  2 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2024-11-28 11:33 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Vadim Fedorenko, Borislav Petkov, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Thomas Gleixner, Yonghong Song, Vadim Fedorenko, Mykola Lysenko,
	x86, bpf, Martin KaFai Lau

On Thu, Nov 28, 2024 at 12:27:34PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 26, 2024 at 10:12:57AM -0800, Andrii Nakryiko wrote:
> > On Fri, Nov 22, 2024 at 3:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Nov 20, 2024 at 04:08:10PM -0800, Vadim Fedorenko wrote:
> > > > This patchset adds 2 kfuncs to provide a way to precisely measure the
> > > > time spent running some code. The first patch provides a way to get cpu
> > > > cycles counter which is used to feed CLOCK_MONOTONIC_RAW. On x86
> > > > architecture it is effectively rdtsc_ordered() function while on other
> > > > architectures it falls back to __arch_get_hw_counter(). The second patch
> > > > adds a kfunc to convert cpu cycles to nanoseconds using shift/mult
> > > > constants discovered by kernel. The main use-case for this kfunc is to
> > > > convert deltas of timestamp counter values into nanoseconds. It is not
> > > > supposed to get CLOCK_MONOTONIC_RAW values as offset part is skipped.
> > > > JIT version is done for x86 for now, on other architectures it falls
> > > > back to slightly simplified version of vdso_calc_ns.
> > >
> > > So having now read this. I'm still left wondering why you would want to
> > > do this.
> > >
> > > Is this just debug stuff, for when you're doing a poor man's profile
> > > run? If it is, why do we care about all the precision or the ns. And why
> > > aren't you using perf?
> > 
> > No, it's not debug stuff. It's meant to be used in production for
> > measuring durations of whatever is needed. Like uprobe entry/exit
> > duration, or time between scheduling switches, etc.
> > 
> > Vadim emphasizes benchmarking at scale, but that's a bit misleading.
> > It's not "benchmarking", it's measuring durations of relevant pairs of
> > events. In production and at scale, so the unnecessary overhead all
> > adds up. We'd like to have the minimal possible overhead for this time
> > passage measurement. And some durations are very brief,
> 
> You might want to consider leaving out the LFENCE before the RDTSC on
> some of those, LFENCE isn't exactly cheap.
> 
> > so precision
> > matters as well. And given this is meant to be later used to do
> > aggregation and comparison across large swaths of production hosts, we
> > have to have comparable units, which is why nanoseconds and not some
> > abstract "time cycles".
> > 
> > Does this address your concerns?
> 
> Well, it's clearly useful for you guys, but I do worry about it. Even on
> servers DVFS is starting to play a significant role. And the TSC is
> unaffected by it.
> 
> Directly comparing these numbers, esp. across different systems makes no
> sense to me. Yes putting them all in [ns] allows for comparison, but
> you're still comparing fundamentally different things.
> 
> How does it make sense to measure uprobe entry/exit in wall-clock when
> it can vary by at least a factor of 2 depending on DVFS. How does it
> make sense to compare an x86-64 uprobe entry/exit to an aaargh64 one?
> 
> Or are you trying to estimate the fraction of overhead spend on
> instrumentation instead of real work? Like, this machine spends 5% of
> its wall-time in instrumentation, which is effectively not doing work?
> 
> The part I'm missing is how using wall-time for these things makes
> sense.
> 
> I mean, if all you're doing is saying, hey, we appear to be spending X
> on this action on this particular system Y doing workload Z (irrespecive
> of you then having like a million Ys) and this patch reduces X by half
> given the same Y and Z. So patch must be awesome.
> 
> Then you don't need the conversion to [ns], and the DVFS angle is more
> or less mitigated by the whole 'same workload' thing.
> 
> 

Anyway, latest patches are functionally good and Changelogs are fair.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss
  2024-11-28 11:27     ` Peter Zijlstra
  2024-11-28 11:33       ` Peter Zijlstra
@ 2024-11-28 14:28       ` Vadim Fedorenko
  2024-12-02 19:15       ` Andrii Nakryiko
  2 siblings, 0 replies; 19+ messages in thread
From: Vadim Fedorenko @ 2024-11-28 14:28 UTC (permalink / raw)
  To: Peter Zijlstra, Andrii Nakryiko
  Cc: Vadim Fedorenko, Borislav Petkov, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Thomas Gleixner, Yonghong Song, Mykola Lysenko, x86, bpf,
	Martin KaFai Lau

On 28/11/2024 11:27, Peter Zijlstra wrote:
> On Tue, Nov 26, 2024 at 10:12:57AM -0800, Andrii Nakryiko wrote:
>> On Fri, Nov 22, 2024 at 3:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>> On Wed, Nov 20, 2024 at 04:08:10PM -0800, Vadim Fedorenko wrote:
>>>> This patchset adds 2 kfuncs to provide a way to precisely measure the
>>>> time spent running some code. The first patch provides a way to get cpu
>>>> cycles counter which is used to feed CLOCK_MONOTONIC_RAW. On x86
>>>> architecture it is effectively rdtsc_ordered() function while on other
>>>> architectures it falls back to __arch_get_hw_counter(). The second patch
>>>> adds a kfunc to convert cpu cycles to nanoseconds using shift/mult
>>>> constants discovered by kernel. The main use-case for this kfunc is to
>>>> convert deltas of timestamp counter values into nanoseconds. It is not
>>>> supposed to get CLOCK_MONOTONIC_RAW values as offset part is skipped.
>>>> JIT version is done for x86 for now, on other architectures it falls
>>>> back to slightly simplified version of vdso_calc_ns.
>>>
>>> So having now read this. I'm still left wondering why you would want to
>>> do this.
>>>
>>> Is this just debug stuff, for when you're doing a poor man's profile
>>> run? If it is, why do we care about all the precision or the ns. And why
>>> aren't you using perf?
>>
>> No, it's not debug stuff. It's meant to be used in production for
>> measuring durations of whatever is needed. Like uprobe entry/exit
>> duration, or time between scheduling switches, etc.
>>
>> Vadim emphasizes benchmarking at scale, but that's a bit misleading.
>> It's not "benchmarking", it's measuring durations of relevant pairs of
>> events. In production and at scale, so the unnecessary overhead all
>> adds up. We'd like to have the minimal possible overhead for this time
>> passage measurement. And some durations are very brief,
> 
> You might want to consider leaving out the LFENCE before the RDTSC on
> some of those, LFENCE isn't exactly cheap.

I was considering this option. Unfortunately, RDTSC without LFENCE may
be executed well out of order by CPU and can easily bring more noise.
We have seen some effects of LFENCE being quite expensive on high core
count machines and we will continue monitor it. I might add another
helper in the future if the situation gets unacceptable.

> 
>> so precision
>> matters as well. And given this is meant to be later used to do
>> aggregation and comparison across large swaths of production hosts, we
>> have to have comparable units, which is why nanoseconds and not some
>> abstract "time cycles".
>>
>> Does this address your concerns?
> 
> Well, it's clearly useful for you guys, but I do worry about it. Even on
> servers DVFS is starting to play a significant role. And the TSC is
> unaffected by it.
> 
> Directly comparing these numbers, esp. across different systems makes no
> sense to me. Yes putting them all in [ns] allows for comparison, but
> you're still comparing fundamentally different things.
> 
> How does it make sense to measure uprobe entry/exit in wall-clock when
> it can vary by at least a factor of 2 depending on DVFS. How does it
> make sense to compare an x86-64 uprobe entry/exit to an aaargh64 one?

I'm going to implement JIT for aarch64 soon and measuring wall-time can 
bring more info about platforms differences.

> 
> Or are you trying to estimate the fraction of overhead spend on
> instrumentation instead of real work? Like, this machine spends 5% of
> its wall-time in instrumentation, which is effectively not doing work?
> 
> The part I'm missing is how using wall-time for these things makes
> sense.
> 
> I mean, if all you're doing is saying, hey, we appear to be spending X
> on this action on this particular system Y doing workload Z (irrespecive
> of you then having like a million Ys) and this patch reduces X by half
> given the same Y and Z. So patch must be awesome.

This is one of the use-cases. Another one is to show differences across
platforms, and that what usually needs ns.

> 
> Then you don't need the conversion to [ns], and the DVFS angle is more
> or less mitigated by the whole 'same workload' thing.

We are thinking of this option too, but another point is that it's
usually easier for people to understand nanoseconds rather then
counter values.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss
  2024-11-28 11:33       ` Peter Zijlstra
@ 2024-11-28 14:30         ` Vadim Fedorenko
  0 siblings, 0 replies; 19+ messages in thread
From: Vadim Fedorenko @ 2024-11-28 14:30 UTC (permalink / raw)
  To: Peter Zijlstra, Andrii Nakryiko
  Cc: Vadim Fedorenko, Borislav Petkov, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Thomas Gleixner, Yonghong Song, Mykola Lysenko, x86, bpf,
	Martin KaFai Lau

On 28/11/2024 11:33, Peter Zijlstra wrote:
> On Thu, Nov 28, 2024 at 12:27:34PM +0100, Peter Zijlstra wrote:
>> On Tue, Nov 26, 2024 at 10:12:57AM -0800, Andrii Nakryiko wrote:
>>> On Fri, Nov 22, 2024 at 3:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>>
>>>> On Wed, Nov 20, 2024 at 04:08:10PM -0800, Vadim Fedorenko wrote:
>>>>> This patchset adds 2 kfuncs to provide a way to precisely measure the
>>>>> time spent running some code. The first patch provides a way to get cpu
>>>>> cycles counter which is used to feed CLOCK_MONOTONIC_RAW. On x86
>>>>> architecture it is effectively rdtsc_ordered() function while on other
>>>>> architectures it falls back to __arch_get_hw_counter(). The second patch
>>>>> adds a kfunc to convert cpu cycles to nanoseconds using shift/mult
>>>>> constants discovered by kernel. The main use-case for this kfunc is to
>>>>> convert deltas of timestamp counter values into nanoseconds. It is not
>>>>> supposed to get CLOCK_MONOTONIC_RAW values as offset part is skipped.
>>>>> JIT version is done for x86 for now, on other architectures it falls
>>>>> back to slightly simplified version of vdso_calc_ns.
>>>>
>>>> So having now read this. I'm still left wondering why you would want to
>>>> do this.
>>>>
>>>> Is this just debug stuff, for when you're doing a poor man's profile
>>>> run? If it is, why do we care about all the precision or the ns. And why
>>>> aren't you using perf?
>>>
>>> No, it's not debug stuff. It's meant to be used in production for
>>> measuring durations of whatever is needed. Like uprobe entry/exit
>>> duration, or time between scheduling switches, etc.
>>>
>>> Vadim emphasizes benchmarking at scale, but that's a bit misleading.
>>> It's not "benchmarking", it's measuring durations of relevant pairs of
>>> events. In production and at scale, so the unnecessary overhead all
>>> adds up. We'd like to have the minimal possible overhead for this time
>>> passage measurement. And some durations are very brief,
>>
>> You might want to consider leaving out the LFENCE before the RDTSC on
>> some of those, LFENCE isn't exactly cheap.
>>
>>> so precision
>>> matters as well. And given this is meant to be later used to do
>>> aggregation and comparison across large swaths of production hosts, we
>>> have to have comparable units, which is why nanoseconds and not some
>>> abstract "time cycles".
>>>
>>> Does this address your concerns?
>>
>> Well, it's clearly useful for you guys, but I do worry about it. Even on
>> servers DVFS is starting to play a significant role. And the TSC is
>> unaffected by it.
>>
>> Directly comparing these numbers, esp. across different systems makes no
>> sense to me. Yes putting them all in [ns] allows for comparison, but
>> you're still comparing fundamentally different things.
>>
>> How does it make sense to measure uprobe entry/exit in wall-clock when
>> it can vary by at least a factor of 2 depending on DVFS. How does it
>> make sense to compare an x86-64 uprobe entry/exit to an aaargh64 one?
>>
>> Or are you trying to estimate the fraction of overhead spend on
>> instrumentation instead of real work? Like, this machine spends 5% of
>> its wall-time in instrumentation, which is effectively not doing work?
>>
>> The part I'm missing is how using wall-time for these things makes
>> sense.
>>
>> I mean, if all you're doing is saying, hey, we appear to be spending X
>> on this action on this particular system Y doing workload Z (irrespecive
>> of you then having like a million Ys) and this patch reduces X by half
>> given the same Y and Z. So patch must be awesome.
>>
>> Then you don't need the conversion to [ns], and the DVFS angle is more
>> or less mitigated by the whole 'same workload' thing.
>>
>>
> 
> Anyway, latest patches are functionally good and Changelogs are fair.

Ok, thanks! I'm going to post v9 soon as kernel testbot found some
building issue, and I'll address some style issues, but no functional
changes are expected.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss
  2024-11-28 11:27     ` Peter Zijlstra
  2024-11-28 11:33       ` Peter Zijlstra
  2024-11-28 14:28       ` Vadim Fedorenko
@ 2024-12-02 19:15       ` Andrii Nakryiko
  2 siblings, 0 replies; 19+ messages in thread
From: Andrii Nakryiko @ 2024-12-02 19:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vadim Fedorenko, Borislav Petkov, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Thomas Gleixner, Yonghong Song, Vadim Fedorenko, Mykola Lysenko,
	x86, bpf, Martin KaFai Lau

On Thu, Nov 28, 2024 at 3:27 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Nov 26, 2024 at 10:12:57AM -0800, Andrii Nakryiko wrote:
> > On Fri, Nov 22, 2024 at 3:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Nov 20, 2024 at 04:08:10PM -0800, Vadim Fedorenko wrote:
> > > > This patchset adds 2 kfuncs to provide a way to precisely measure the
> > > > time spent running some code. The first patch provides a way to get cpu
> > > > cycles counter which is used to feed CLOCK_MONOTONIC_RAW. On x86
> > > > architecture it is effectively rdtsc_ordered() function while on other
> > > > architectures it falls back to __arch_get_hw_counter(). The second patch
> > > > adds a kfunc to convert cpu cycles to nanoseconds using shift/mult
> > > > constants discovered by kernel. The main use-case for this kfunc is to
> > > > convert deltas of timestamp counter values into nanoseconds. It is not
> > > > supposed to get CLOCK_MONOTONIC_RAW values as offset part is skipped.
> > > > JIT version is done for x86 for now, on other architectures it falls
> > > > back to slightly simplified version of vdso_calc_ns.
> > >
> > > So having now read this. I'm still left wondering why you would want to
> > > do this.
> > >
> > > Is this just debug stuff, for when you're doing a poor man's profile
> > > run? If it is, why do we care about all the precision or the ns. And why
> > > aren't you using perf?
> >
> > No, it's not debug stuff. It's meant to be used in production for
> > measuring durations of whatever is needed. Like uprobe entry/exit
> > duration, or time between scheduling switches, etc.
> >
> > Vadim emphasizes benchmarking at scale, but that's a bit misleading.
> > It's not "benchmarking", it's measuring durations of relevant pairs of
> > events. In production and at scale, so the unnecessary overhead all
> > adds up. We'd like to have the minimal possible overhead for this time
> > passage measurement. And some durations are very brief,
>
> You might want to consider leaving out the LFENCE before the RDTSC on
> some of those, LFENCE isn't exactly cheap.
>
> > so precision
> > matters as well. And given this is meant to be later used to do
> > aggregation and comparison across large swaths of production hosts, we
> > have to have comparable units, which is why nanoseconds and not some
> > abstract "time cycles".
> >
> > Does this address your concerns?
>
> Well, it's clearly useful for you guys, but I do worry about it. Even on
> servers DVFS is starting to play a significant role. And the TSC is
> unaffected by it.
>
> Directly comparing these numbers, esp. across different systems makes no
> sense to me. Yes putting them all in [ns] allows for comparison, but
> you're still comparing fundamentally different things.
>
> How does it make sense to measure uprobe entry/exit in wall-clock when
> it can vary by at least a factor of 2 depending on DVFS. How does it
> make sense to compare an x86-64 uprobe entry/exit to an aaargh64 one?
>
> Or are you trying to estimate the fraction of overhead spend on
> instrumentation instead of real work? Like, this machine spends 5% of
> its wall-time in instrumentation, which is effectively not doing work?

Yes, exactly. I think most of the time it will be comparisons based on
percentages. E.g., we can measure total latency of handling the whole
request, and separately total time spent on some extra metrics
collection and/or logging during that request. This would inform how
much (in relative terms) this extra metrics/logging infrastructure
costs, and would inform any of the planned efficiency work. And it's
just one possible use case.

We do collect CPU cycles-based measurements as well, just to be clear.
But we have lots of time-based data collection, and currently we are
just using bpf_ktime_get_ns() for those.

>
> The part I'm missing is how using wall-time for these things makes
> sense.
>
> I mean, if all you're doing is saying, hey, we appear to be spending X
> on this action on this particular system Y doing workload Z (irrespecive
> of you then having like a million Ys) and this patch reduces X by half
> given the same Y and Z. So patch must be awesome.
>
> Then you don't need the conversion to [ns], and the DVFS angle is more
> or less mitigated by the whole 'same workload' thing.
>

We run big services across different types of machines, and usually
people look at aggregate metrics across all of them. It might not be
the most accurate and precise way to quantify overheads, but it seems
to be good enough in practice to drive (some of) efficiency work. They
*can* dive deeper and look at breakdown by specific type of CPU, if
they care, but it's totally up to them (all this data is self-service
data sets, so tons of people have various uses for it and they often
don't consult anyone related to actual collection of this data).

In summary, I think we understand the DVFS concern you have, but in a
lot of cases this doesn't matter to specific use cases our customers
have. All in all, these new APIs seem good and useful and are an
improvement to the currently abused bpf_ktime_get_ns(). Thanks for
your reviews!

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-12-02 19:15 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-21  0:08 [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss Vadim Fedorenko
2024-11-21  0:08 ` [PATCH bpf-next v8 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
2024-11-21 11:32   ` Peter Zijlstra
2024-11-21 14:35     ` Vadim Fedorenko
2024-11-21 15:33       ` Peter Zijlstra
2024-11-21 23:51         ` Andrii Nakryiko
2024-11-21 23:55           ` Vadim Fedorenko
2024-11-21  0:08 ` [PATCH bpf-next v8 2/4] bpf: add bpf_cpu_time_counter_to_ns helper Vadim Fedorenko
2024-11-21 11:31   ` Peter Zijlstra
2024-11-21  0:08 ` [PATCH bpf-next v8 3/4] selftests/bpf: add selftest to check rdtsc jit Vadim Fedorenko
2024-11-21  0:08 ` [PATCH bpf-next v8 4/4] selftests/bpf: add usage example for cpu cycles kfuncs Vadim Fedorenko
2024-11-22 11:34 ` [PATCH bpf-next v8 0/4] bpf: add cpu cycles kfuncss Peter Zijlstra
2024-11-22 15:40   ` Vadim Fedorenko
2024-11-26 18:12   ` Andrii Nakryiko
2024-11-28 11:27     ` Peter Zijlstra
2024-11-28 11:33       ` Peter Zijlstra
2024-11-28 14:30         ` Vadim Fedorenko
2024-11-28 14:28       ` Vadim Fedorenko
2024-12-02 19:15       ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox