[PATCH bpf-next v10 0/4] bpf: add cpu time counter kfuncs

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf-next v10 0/4] bpf: add cpu time counter kfuncs
@ 2025-03-11 15:48 Vadim Fedorenko
  2025-03-11 15:48 ` [PATCH bpf-next v10 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Vadim Fedorenko @ 2025-03-11 15:48 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

This patchset adds 2 kfuncs to provide a way to precisely measure the
time spent running some code. The first patch provides a way to get cpu
cycles counter which is used to feed CLOCK_MONOTONIC_RAW. On x86
architecture it is effectively rdtsc_ordered() function. The second patch
adds a kfunc to convert cpu cycles to nanoseconds using shift/mult
constants discovered by kernel. The main use-case for this kfunc is to
convert deltas of timestamp counter values into nanoseconds. It is not
supposed to get CLOCK_MONOTONIC_RAW values as offset part is skipped.
JIT version is done for x86 for now, on other architectures it falls
back to get CLOCK_MONOTONIC_RAW values.

The reason to have these functions is to avoid overhead added by
a bpf_ktime_get_ns() call in case of benchmarking, when two timestamps
are taken to get delta value. With both functions being JITed, the
overhead is minimal and the result has better precision. New functions
can be used to benchmark BPF code directly in the program, or can be
used in kprobe/uprobe to store timestamp counter in the session coockie
and then in kretprobe/uretprobe the delta can be calculated and
converted into nanoseconds.

Selftests are also added to check whether the JIT implementation is
correct and to show the simplest usage example.

Change log:
v9 -> v10:
* rework fallback implementation to avoid using vDSO data from
  kernel space.
* add comment about using "LFENCE; RDTSC" instead of "RDTSCP"
* guard x86 JIT implementation to be sure that TSC is enabled and
  stable
* v9 link:
  https://lore.kernel.org/bpf/20241123005833.810044-1-vadfed@meta.com/
v8 -> v9:
* rewording of commit messages, no code changes
* move change log from each patch into cover letter
v7 -> v8:
* rename kfuncs again to bpf_get_cpu_time_counter() and
  bpf_cpu_time_counter_to_ns()
* use cyc2ns_read_begin()/cyc2ns_read_end() to get mult and shift
  constants in bpf_cpu_time_counter_to_ns()
v6 -> v7:
* change boot_cpu_has() to cpu_feature_enabled() (Borislav)
* return constant clock_mode in __arch_get_hw_counter() call
v5 -> v6:
* added cover letter
* add comment about dropping S64_MAX manipulation in jitted
  implementation of rdtsc_oredered (Alexey)
* add comment about using 'lfence;rdtsc' variant (Alexey)
* change the check in fixup_kfunc_call() (Eduard)
* make __arch_get_hw_counter() call more aligned with vDSO
  implementation (Yonghong)
v4 -> v5:
* use #if instead of #ifdef with IS_ENABLED
v3 -> v4:
* change name of the helper to bpf_get_cpu_cycles (Andrii)
* Hide the helper behind CONFIG_GENERIC_GETTIMEOFDAY to avoid exposing
  it on architectures which do not have vDSO functions and data
* reduce the scope of check of inlined functions in verifier to only 2,
  which are actually inlined.
* change helper name to bpf_cpu_cycles_to_ns.
* hide it behind CONFIG_GENERIC_GETTIMEOFDAY to avoid exposing on
  unsupported architectures.
v2 -> v3:
* change name of the helper to bpf_get_cpu_cycles_counter to
  explicitly mention what counter it provides (Andrii)
* move kfunc definition to bpf.h to use it in JIT.
* introduce another kfunc to convert cycles into nanoseconds as
  more meaningful time units for generic tracing use case (Andrii)
v1 -> v2:
* Fix incorrect function return value type to u64
* Introduce bpf_jit_inlines_kfunc_call() and use it in
	mark_fastcall_pattern_for_call() to avoid clobbering in case
	of running programs with no JIT (Eduard)
* Avoid rewriting instruction and check function pointer directly
	in JIT (Alexei)
* Change includes to fix compile issues on non x86 architectures

Vadim Fedorenko (4):
  bpf: add bpf_get_cpu_time_counter kfunc
  bpf: add bpf_cpu_time_counter_to_ns helper
  selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit
  selftests/bpf: add usage example for cpu time counter kfuncs

 arch/x86/net/bpf_jit_comp.c                   |  72 ++++++++++++
 arch/x86/net/bpf_jit_comp32.c                 |  58 ++++++++++
 include/linux/bpf.h                           |   4 +
 include/linux/filter.h                        |   1 +
 kernel/bpf/core.c                             |  11 ++
 kernel/bpf/helpers.c                          |  12 ++
 kernel/bpf/verifier.c                         |  41 ++++++-
 .../bpf/prog_tests/test_cpu_cycles.c          |  35 ++++++
 .../selftests/bpf/prog_tests/verifier.c       |   2 +
 .../selftests/bpf/progs/test_cpu_cycles.c     |  25 +++++
 .../selftests/bpf/progs/verifier_cpu_cycles.c | 104 ++++++++++++++++++
 11 files changed, 359 insertions(+), 6 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cpu_cycles.c
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c

-- 
2.47.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH bpf-next v10 1/4] bpf: add bpf_get_cpu_time_counter kfunc
  2025-03-11 15:48 [PATCH bpf-next v10 0/4] bpf: add cpu time counter kfuncs Vadim Fedorenko
@ 2025-03-11 15:48 ` Vadim Fedorenko
  2025-03-11 15:48 ` [PATCH bpf-next v10 2/4] bpf: add bpf_cpu_time_counter_to_ns helper Vadim Fedorenko
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Vadim Fedorenko @ 2025-03-11 15:48 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

New kfunc to return ARCH-specific timecounter. The main reason to
implement this kfunc is to avoid extra overhead of benchmark
measurements, which are usually done by a pair of bpf_ktime_get_ns()
at the beginnig and at the end of the code block under benchmark.
When fully JITed this function doesn't implement conversion to the
monotonic clock and saves some CPU cycles by receiving timecounter
values in single-digit amount of instructions. The delta values can be
translated into nanoseconds using kfunc introduced in the next patch.
For x86 BPF JIT converts this kfunc into rdtsc ordered call. Other
architectures will get JIT implementation too if supported. The fallback
is to get CLOCK_MONOTONIC_RAW value in ns.

JIT version of the function uses "LFENCE; RDTSC" variant because it
doesn't care about cookie value returned by "RDTSCP" and it doesn't want
to trash RCX value. LFENCE option provides the same ordering guarantee as
RDTSCP variant.

The simplest use-case is added in 4th patch, where we calculate the time
spent by bpf_get_ns_current_pid_tgid() kfunc. More complex example is to
use session cookie to store timecounter value at kprobe/uprobe using
kprobe.session/uprobe.session, and calculate the difference at
kretprobe/uretprobe.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
 arch/x86/net/bpf_jit_comp.c   | 47 +++++++++++++++++++++++++++++++++++
 arch/x86/net/bpf_jit_comp32.c | 33 ++++++++++++++++++++++++
 include/linux/bpf.h           |  3 +++
 include/linux/filter.h        |  1 +
 kernel/bpf/core.c             | 11 ++++++++
 kernel/bpf/helpers.c          |  6 +++++
 kernel/bpf/verifier.c         | 41 +++++++++++++++++++++++++-----
 7 files changed, 136 insertions(+), 6 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index d3491cc0898b..92cd5945d630 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -15,6 +15,7 @@
 #include <asm/ftrace.h>
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
+#include <asm/timer.h>
 #include <asm/text-patching.h>
 #include <asm/unwind.h>
 #include <asm/cfi.h>
@@ -2254,6 +2255,40 @@ st:			if (is_imm8(insn->off))
 		case BPF_JMP | BPF_CALL: {
 			u8 *ip = image + addrs[i - 1];
 
+			if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
+			    IS_ENABLED(CONFIG_BPF_SYSCALL) &&
+			    imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
+			    cpu_feature_enabled(X86_FEATURE_TSC) &&
+			    using_native_sched_clock() && sched_clock_stable()) {
+				/* The default implementation of this kfunc uses
+				 * ktime_get_raw_ns() which effectively is implemented as
+				 * `(u64)rdtsc_ordered() & S64_MAX`. For JIT We skip
+				 * masking part because we assume it's not needed in BPF
+				 * use case (two measurements close in time).
+				 * Original code for rdtsc_ordered() uses sequence:
+				 * 'rdtsc; nop; nop; nop' to patch it into
+				 * 'lfence; rdtsc' or 'rdtscp' depending on CPU features.
+				 * JIT uses 'lfence; rdtsc' variant because BPF program
+				 * doesn't care about cookie provided by rdtscp in RCX.
+				 * Save RDX because RDTSC will use EDX:EAX to return u64
+				 */
+				emit_mov_reg(&prog, true, AUX_REG, BPF_REG_3);
+				if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
+					EMIT_LFENCE();
+				EMIT2(0x0F, 0x31);
+
+				/* shl RDX, 32 */
+				maybe_emit_1mod(&prog, BPF_REG_3, true);
+				EMIT3(0xC1, add_1reg(0xE0, BPF_REG_3), 32);
+				/* or RAX, RDX */
+				maybe_emit_mod(&prog, BPF_REG_0, BPF_REG_3, true);
+				EMIT2(0x09, add_2reg(0xC0, BPF_REG_0, BPF_REG_3));
+				/* restore RDX from R11 */
+				emit_mov_reg(&prog, true, BPF_REG_3, AUX_REG);
+
+				break;
+			}
+
 			func = (u8 *) __bpf_call_base + imm32;
 			if (src_reg == BPF_PSEUDO_CALL && tail_call_reachable) {
 				LOAD_TAIL_CALL_CNT_PTR(stack_depth);
@@ -3865,3 +3900,15 @@ bool bpf_jit_supports_timed_may_goto(void)
 {
 	return true;
 }
+
+/* x86-64 JIT can inline kfunc */
+bool bpf_jit_inlines_kfunc_call(s32 imm)
+{
+	if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
+		return false;
+	if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
+	    cpu_feature_enabled(X86_FEATURE_TSC) &&
+	    using_native_sched_clock() && sched_clock_stable())
+		return true;
+	return false;
+}
diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
index de0f9e5f9f73..7f13509c66db 100644
--- a/arch/x86/net/bpf_jit_comp32.c
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -16,6 +16,7 @@
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
 #include <asm/asm-prototypes.h>
+#include <asm/timer.h>
 #include <linux/bpf.h>
 
 /*
@@ -2094,6 +2095,27 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 			if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
 				int err;
 
+				if (IS_ENABLED(CONFIG_BPF_SYSCALL) &&
+				    imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
+				    cpu_feature_enabled(X86_FEATURE_TSC) &&
+				    using_native_sched_clock() && sched_clock_stable()) {
+					/* The default implementation of this kfunc uses
+					 * ktime_get_raw_ns() which effectively is implemented as
+					 * `(u64)rdtsc_ordered() & S64_MAX`. For JIT We skip
+					 * masking part because we assume it's not needed in BPF
+					 * use case (two measurements close in time).
+					 * Original code for rdtsc_ordered() uses sequence:
+					 * 'rdtsc; nop; nop; nop' to patch it into
+					 * 'lfence; rdtsc' or 'rdtscp' depending on CPU features.
+					 * JIT uses 'lfence; rdtsc' variant because BPF program
+					 * doesn't care about cookie provided by rdtscp in ECX.
+					 */
+					if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
+						EMIT3(0x0F, 0xAE, 0xE8);
+					EMIT2(0x0F, 0x31);
+					break;
+				}
+
 				err = emit_kfunc_call(bpf_prog,
 						      image + addrs[i],
 						      insn, &prog);
@@ -2621,3 +2643,14 @@ bool bpf_jit_supports_kfunc_call(void)
 {
 	return true;
 }
+
+bool bpf_jit_inlines_kfunc_call(s32 imm)
+{
+	if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
+		return false;
+	if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
+	    cpu_feature_enabled(X86_FEATURE_TSC) &&
+	    using_native_sched_clock() && sched_clock_stable())
+		return true;
+	return false;
+}
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7d55553de3fc..599aaa854e4c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3387,6 +3387,9 @@ void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 u64 bpf_get_raw_cpu_id(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
+/* Inlined kfuncs */
+u64 bpf_get_cpu_time_counter(void);
+
 #if defined(CONFIG_NET)
 bool bpf_sock_common_is_valid_access(int off, int size,
 				     enum bpf_access_type type,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 590476743f7a..2fbfa1bc3f49 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1128,6 +1128,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
 void bpf_jit_compile(struct bpf_prog *prog);
 bool bpf_jit_needs_zext(void);
 bool bpf_jit_inlines_helper_call(s32 imm);
+bool bpf_jit_inlines_kfunc_call(s32 imm);
 bool bpf_jit_supports_subprog_tailcalls(void);
 bool bpf_jit_supports_percpu_insn(void);
 bool bpf_jit_supports_kfunc_call(void);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 62cb9557ad3b..1d811fc39eac 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -3035,6 +3035,17 @@ bool __weak bpf_jit_inlines_helper_call(s32 imm)
 	return false;
 }
 
+/* Return true if the JIT inlines the call to the kfunc corresponding to
+ * the imm.
+ *
+ * The verifier will not patch the insn->imm for the call to the helper if
+ * this returns true.
+ */
+bool __weak bpf_jit_inlines_kfunc_call(s32 imm)
+{
+	return false;
+}
+
 /* Return TRUE if the JIT backend supports mixing bpf2bpf and tailcalls. */
 bool __weak bpf_jit_supports_subprog_tailcalls(void)
 {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 5449756ba102..43bf35a15f78 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -3193,6 +3193,11 @@ __bpf_kfunc void bpf_local_irq_restore(unsigned long *flags__irq_flag)
 	local_irq_restore(*flags__irq_flag);
 }
 
+__bpf_kfunc u64 bpf_get_cpu_time_counter(void)
+{
+	return ktime_get_raw_fast_ns();
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(generic_btf_ids)
@@ -3293,6 +3298,7 @@ BTF_ID_FLAGS(func, bpf_iter_kmem_cache_next, KF_ITER_NEXT | KF_RET_NULL | KF_SLE
 BTF_ID_FLAGS(func, bpf_iter_kmem_cache_destroy, KF_ITER_DESTROY | KF_SLEEPABLE)
 BTF_ID_FLAGS(func, bpf_local_irq_save)
 BTF_ID_FLAGS(func, bpf_local_irq_restore)
+BTF_ID_FLAGS(func, bpf_get_cpu_time_counter, KF_FASTCALL)
 BTF_KFUNCS_END(common_btf_ids)
 
 static const struct btf_kfunc_id_set common_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 3303a3605ee8..0c4ea977973c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -17035,6 +17035,24 @@ static bool verifier_inlines_helper_call(struct bpf_verifier_env *env, s32 imm)
 	}
 }
 
+/* True if fixup_kfunc_call() replaces calls to kfunc number 'imm',
+ * replacement patch is presumed to follow bpf_fastcall contract
+ * (see mark_fastcall_pattern_for_call() below).
+ */
+static bool verifier_inlines_kfunc_call(struct bpf_verifier_env *env, s32 imm)
+{
+	const struct bpf_kfunc_desc *desc = find_kfunc_desc(env->prog, imm, 0);
+
+	if (!env->prog->jit_requested)
+		return false;
+
+	if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
+	    desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast])
+		return true;
+
+	return false;
+}
+
 struct call_summary {
 	u8 num_params;
 	bool is_void;
@@ -17077,7 +17095,10 @@ static bool get_call_summary(struct bpf_verifier_env *env, struct bpf_insn *call
 			/* error would be reported later */
 			return false;
 		cs->num_params = btf_type_vlen(meta.func_proto);
-		cs->fastcall = meta.kfunc_flags & KF_FASTCALL;
+		cs->fastcall = meta.kfunc_flags & KF_FASTCALL &&
+			       (verifier_inlines_kfunc_call(env, call->imm) ||
+			       (meta.btf == btf_vmlinux &&
+			       bpf_jit_inlines_kfunc_call(call->imm)));
 		cs->is_void = btf_type_is_void(btf_type_by_id(meta.btf, meta.func_proto->type));
 		return true;
 	}
@@ -21223,6 +21244,7 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			    struct bpf_insn *insn_buf, int insn_idx, int *cnt)
 {
 	const struct bpf_kfunc_desc *desc;
+	s32 imm = insn->imm;
 
 	if (!insn->imm) {
 		verbose(env, "invalid kernel function call not eliminated in verifier pass\n");
@@ -21246,7 +21268,18 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		insn->imm = BPF_CALL_IMM(desc->addr);
 	if (insn->off)
 		return 0;
-	if (desc->func_id == special_kfunc_list[KF_bpf_obj_new_impl] ||
+	if (verifier_inlines_kfunc_call(env, imm)) {
+		if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
+		    desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
+			insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
+			*cnt = 1;
+		} else {
+			verbose(env, "verifier internal error: kfunc id %d has no inline code\n",
+				desc->func_id);
+			return -EFAULT;
+		}
+
+	} else if (desc->func_id == special_kfunc_list[KF_bpf_obj_new_impl] ||
 	    desc->func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
 		struct btf_struct_meta *kptr_struct_meta = env->insn_aux_data[insn_idx].kptr_struct_meta;
 		struct bpf_insn addr[2] = { BPF_LD_IMM64(BPF_REG_2, (long)kptr_struct_meta) };
@@ -21307,10 +21340,6 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 
 		__fixup_collection_insert_kfunc(&env->insn_aux_data[insn_idx], struct_meta_reg,
 						node_offset_reg, insn, insn_buf, cnt);
-	} else if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
-		   desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
-		insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
-		*cnt = 1;
 	} else if (is_bpf_wq_set_callback_impl_kfunc(desc->func_id)) {
 		struct bpf_insn ld_addrs[2] = { BPF_LD_IMM64(BPF_REG_4, (long)env->prog->aux) };
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH bpf-next v10 2/4] bpf: add bpf_cpu_time_counter_to_ns helper
  2025-03-11 15:48 [PATCH bpf-next v10 0/4] bpf: add cpu time counter kfuncs Vadim Fedorenko
  2025-03-11 15:48 ` [PATCH bpf-next v10 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
@ 2025-03-11 15:48 ` Vadim Fedorenko
  2025-03-13  8:27   ` kernel test robot
  2025-03-11 15:48 ` [PATCH bpf-next v10 3/4] selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit Vadim Fedorenko
  2025-03-11 15:48 ` [PATCH bpf-next v10 4/4] selftests/bpf: add usage example for cpu time counter kfuncs Vadim Fedorenko
  3 siblings, 1 reply; 6+ messages in thread
From: Vadim Fedorenko @ 2025-03-11 15:48 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

The new helper should be used to convert deltas of values
received by bpf_get_cpu_time_counter() into nanoseconds. It is not
designed to do full conversion of time counter values to
CLOCK_MONOTONIC_RAW nanoseconds and cannot guarantee monotonicity of 2
independent values, but rather to convert the difference of 2 close
enough values of CPU timestamp counter into nanoseconds.

This function is JITted into just several instructions and adds as
low overhead as possible and perfectly suits benchmark use-cases.

When the kfunc is not JITted it returns the value provided as argument
because the kfunc in previous patch will return values in nanoseconds.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
 arch/x86/net/bpf_jit_comp.c   | 27 ++++++++++++++++++++++++++-
 arch/x86/net/bpf_jit_comp32.c | 27 ++++++++++++++++++++++++++-
 include/linux/bpf.h           |  1 +
 kernel/bpf/helpers.c          |  6 ++++++
 4 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 92cd5945d630..56f7557048d1 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -9,6 +9,7 @@
 #include <linux/filter.h>
 #include <linux/if_vlan.h>
 #include <linux/bpf.h>
+#include <linux/clocksource.h>
 #include <linux/memory.h>
 #include <linux/sort.h>
 #include <asm/extable.h>
@@ -2289,6 +2290,29 @@ st:			if (is_imm8(insn->off))
 				break;
 			}
 
+			if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
+			    imm32 == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
+			    cpu_feature_enabled(X86_FEATURE_TSC) &&
+			    using_native_sched_clock() && sched_clock_stable()) {
+				struct cyc2ns_data data;
+				u32 mult, shift;
+
+				cyc2ns_read_begin(&data);
+				mult = data.cyc2ns_mul;
+				shift = data.cyc2ns_shift;
+				cyc2ns_read_end();
+				/* imul RAX, RDI, mult */
+				maybe_emit_mod(&prog, BPF_REG_1, BPF_REG_0, true);
+				EMIT2_off32(0x69, add_2reg(0xC0, BPF_REG_1, BPF_REG_0),
+					    mult);
+
+				/* shr RAX, shift (which is less than 64) */
+				maybe_emit_1mod(&prog, BPF_REG_0, true);
+				EMIT3(0xC1, add_1reg(0xE8, BPF_REG_0), shift);
+
+				break;
+			}
+
 			func = (u8 *) __bpf_call_base + imm32;
 			if (src_reg == BPF_PSEUDO_CALL && tail_call_reachable) {
 				LOAD_TAIL_CALL_CNT_PTR(stack_depth);
@@ -3906,7 +3930,8 @@ bool bpf_jit_inlines_kfunc_call(s32 imm)
 {
 	if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
 		return false;
-	if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
+	if ((imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) ||
+	    imm == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns)) &&
 	    cpu_feature_enabled(X86_FEATURE_TSC) &&
 	    using_native_sched_clock() && sched_clock_stable())
 		return true;
diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
index 7f13509c66db..9791a3fb9d69 100644
--- a/arch/x86/net/bpf_jit_comp32.c
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -12,6 +12,7 @@
 #include <linux/netdevice.h>
 #include <linux/filter.h>
 #include <linux/if_vlan.h>
+#include <linux/clocksource.h>
 #include <asm/cacheflush.h>
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
@@ -2115,6 +2116,29 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 					EMIT2(0x0F, 0x31);
 					break;
 				}
+				if (IS_ENABLED(CONFIG_BPF_SYSCALL) &&
+				    imm32 == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
+				    cpu_feature_enabled(X86_FEATURE_TSC) &&
+				    using_native_sched_clock() && sched_clock_stable()) {
+					struct cyc2ns_data data;
+					u32 mult, shift;
+
+					cyc2ns_read_begin(&data);
+					mult = data.cyc2ns_mul;
+					shift = data.cyc2ns_shift;
+					cyc2ns_read_end();
+
+					/* move parameter to BPF_REG_0 */
+					emit_ia32_mov_r64(true, bpf2ia32[BPF_REG_0],
+							  bpf2ia32[BPF_REG_1], true, true,
+							  &prog, bpf_prog->aux);
+					/* multiply parameter by mut */
+					emit_ia32_mul_i64(bpf2ia32[BPF_REG_0],
+							  mult, true, &prog);
+					/* shift parameter by shift which is less than 64 */
+					emit_ia32_rsh_i64(bpf2ia32[BPF_REG_0],
+							  shift, true, &prog);
+				}
 
 				err = emit_kfunc_call(bpf_prog,
 						      image + addrs[i],
@@ -2648,7 +2672,8 @@ bool bpf_jit_inlines_kfunc_call(s32 imm)
 {
 	if (!IS_ENABLED(CONFIG_BPF_SYSCALL))
 		return false;
-	if (imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
+	if ((imm == BPF_CALL_IMM(bpf_get_cpu_time_counter) ||
+	    imm == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns)) &&
 	    cpu_feature_enabled(X86_FEATURE_TSC) &&
 	    using_native_sched_clock() && sched_clock_stable())
 		return true;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 599aaa854e4c..5c4d35019b22 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3389,6 +3389,7 @@ u64 bpf_get_raw_cpu_id(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 /* Inlined kfuncs */
 u64 bpf_get_cpu_time_counter(void);
+u64 bpf_cpu_time_counter_to_ns(u64 cycles);
 
 #if defined(CONFIG_NET)
 bool bpf_sock_common_is_valid_access(int off, int size,
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 43bf35a15f78..cc986d2048db 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -3198,6 +3198,11 @@ __bpf_kfunc u64 bpf_get_cpu_time_counter(void)
 	return ktime_get_raw_fast_ns();
 }
 
+__bpf_kfunc u64 bpf_cpu_time_counter_to_ns(u64 cycles)
+{
+	return cycles;
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(generic_btf_ids)
@@ -3299,6 +3304,7 @@ BTF_ID_FLAGS(func, bpf_iter_kmem_cache_destroy, KF_ITER_DESTROY | KF_SLEEPABLE)
 BTF_ID_FLAGS(func, bpf_local_irq_save)
 BTF_ID_FLAGS(func, bpf_local_irq_restore)
 BTF_ID_FLAGS(func, bpf_get_cpu_time_counter, KF_FASTCALL)
+BTF_ID_FLAGS(func, bpf_cpu_time_counter_to_ns, KF_FASTCALL)
 BTF_KFUNCS_END(common_btf_ids)
 
 static const struct btf_kfunc_id_set common_kfunc_set = {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH bpf-next v10 3/4] selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit
  2025-03-11 15:48 [PATCH bpf-next v10 0/4] bpf: add cpu time counter kfuncs Vadim Fedorenko
  2025-03-11 15:48 ` [PATCH bpf-next v10 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
  2025-03-11 15:48 ` [PATCH bpf-next v10 2/4] bpf: add bpf_cpu_time_counter_to_ns helper Vadim Fedorenko
@ 2025-03-11 15:48 ` Vadim Fedorenko
  2025-03-11 15:48 ` [PATCH bpf-next v10 4/4] selftests/bpf: add usage example for cpu time counter kfuncs Vadim Fedorenko
  3 siblings, 0 replies; 6+ messages in thread
From: Vadim Fedorenko @ 2025-03-11 15:48 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

bpf_get_cpu_time_counter() is replaced with rdtsc instruction on x86_64.
Add tests to check that JIT works as expected.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
 .../selftests/bpf/prog_tests/verifier.c       |   2 +
 .../selftests/bpf/progs/verifier_cpu_cycles.c | 104 ++++++++++++++++++
 2 files changed, 106 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c

diff --git a/tools/testing/selftests/bpf/prog_tests/verifier.c b/tools/testing/selftests/bpf/prog_tests/verifier.c
index e66a57970d28..d5e7e302a344 100644
--- a/tools/testing/selftests/bpf/prog_tests/verifier.c
+++ b/tools/testing/selftests/bpf/prog_tests/verifier.c
@@ -102,6 +102,7 @@
 #include "verifier_xdp_direct_packet_access.skel.h"
 #include "verifier_bits_iter.skel.h"
 #include "verifier_lsm.skel.h"
+#include "verifier_cpu_cycles.skel.h"
 #include "irq.skel.h"
 
 #define MAX_ENTRIES 11
@@ -236,6 +237,7 @@ void test_verifier_bits_iter(void) { RUN(verifier_bits_iter); }
 void test_verifier_lsm(void)                  { RUN(verifier_lsm); }
 void test_irq(void)			      { RUN(irq); }
 void test_verifier_mtu(void)		      { RUN(verifier_mtu); }
+void test_verifier_cpu_cycles(void)	      { RUN(verifier_cpu_cycles); }
 
 static int init_test_val_map(struct bpf_object *obj, char *map_name)
 {
diff --git a/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c b/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c
new file mode 100644
index 000000000000..5b62e3690362
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/verifier_cpu_cycles.c
@@ -0,0 +1,104 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Inc. */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+
+extern u64 bpf_cpu_time_counter_to_ns(u64 cycles) __weak __ksym;
+extern u64 bpf_get_cpu_time_counter(void) __weak __ksym;
+
+SEC("syscall")
+__arch_x86_64
+__xlated("0: call kernel-function")
+__naked int bpf_rdtsc(void)
+{
+	asm volatile(
+	"call %[bpf_get_cpu_time_counter];"
+	"exit"
+	:
+	: __imm(bpf_get_cpu_time_counter)
+	: __clobber_all
+	);
+}
+
+SEC("syscall")
+__arch_x86_64
+/* program entry for bpf_rdtsc_jit_x86_64(), regular function prologue */
+__jited("	endbr64")
+__jited("	nopl	(%rax,%rax)")
+__jited("	nopl	(%rax)")
+__jited("	pushq	%rbp")
+__jited("	movq	%rsp, %rbp")
+__jited("	endbr64")
+/* save RDX in R11 as it will be overwritten */
+__jited("	movq	%rdx, %r11")
+/* lfence may not be executed depending on cpu features */
+__jited("	{{(lfence|)}}")
+__jited("	rdtsc")
+/* combine EDX:EAX into RAX */
+__jited("	shlq	${{(32|0x20)}}, %rdx")
+__jited("	orq	%rdx, %rax")
+/* restore RDX from R11 */
+__jited("	movq	%r11, %rdx")
+__jited("	leave")
+__naked int bpf_rdtsc_jit_x86_64(void)
+{
+	asm volatile(
+	"call %[bpf_get_cpu_time_counter];"
+	"exit"
+	:
+	: __imm(bpf_get_cpu_time_counter)
+	: __clobber_all
+	);
+}
+
+SEC("syscall")
+__arch_x86_64
+__xlated("0: r1 = 42")
+__xlated("1: call kernel-function")
+__naked int bpf_cyc2ns(void)
+{
+	asm volatile(
+	"r1=0x2a;"
+	"call %[bpf_cpu_time_counter_to_ns];"
+	"exit"
+	:
+	: __imm(bpf_cpu_time_counter_to_ns)
+	: __clobber_all
+	);
+}
+
+SEC("syscall")
+__arch_x86_64
+/* program entry for bpf_rdtsc_jit_x86_64(), regular function prologue */
+__jited("	endbr64")
+__jited("	nopl	(%rax,%rax)")
+__jited("	nopl	(%rax)")
+__jited("	pushq	%rbp")
+__jited("	movq	%rsp, %rbp")
+__jited("	endbr64")
+/* save RDX in R11 as it will be overwritten */
+__jited("	movabsq	$0x2a2a2a2a2a, %rdi")
+__jited("	imulq	${{.*}}, %rdi, %rax")
+__jited("	shrq	${{.*}}, %rax")
+__jited("	leave")
+__naked int bpf_cyc2ns_jit_x86(void)
+{
+	asm volatile(
+	"r1=0x2a2a2a2a2a ll;"
+	"call %[bpf_cpu_time_counter_to_ns];"
+	"exit"
+	:
+	: __imm(bpf_cpu_time_counter_to_ns)
+	: __clobber_all
+	);
+}
+
+void rdtsc(void)
+{
+	bpf_get_cpu_time_counter();
+	bpf_cpu_time_counter_to_ns(42);
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH bpf-next v10 4/4] selftests/bpf: add usage example for cpu time counter kfuncs
  2025-03-11 15:48 [PATCH bpf-next v10 0/4] bpf: add cpu time counter kfuncs Vadim Fedorenko
                   ` (2 preceding siblings ...)
  2025-03-11 15:48 ` [PATCH bpf-next v10 3/4] selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit Vadim Fedorenko
@ 2025-03-11 15:48 ` Vadim Fedorenko
  3 siblings, 0 replies; 6+ messages in thread
From: Vadim Fedorenko @ 2025-03-11 15:48 UTC (permalink / raw)
  To: Borislav Petkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Thomas Gleixner, Yonghong Song,
	Vadim Fedorenko, Mykola Lysenko
  Cc: x86, bpf, Peter Zijlstra, Vadim Fedorenko, Martin KaFai Lau

The selftest provides an example of how to measure the latency of bpf
kfunc/helper call using time stamp counter and how to convert measured
value into nanoseconds.

Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
 .../bpf/prog_tests/test_cpu_cycles.c          | 35 +++++++++++++++++++
 .../selftests/bpf/progs/test_cpu_cycles.c     | 25 +++++++++++++
 2 files changed, 60 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cpu_cycles.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c b/tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c
new file mode 100644
index 000000000000..d7f3b66594b3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_cpu_cycles.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Inc. */
+
+#include <test_progs.h>
+#include "test_cpu_cycles.skel.h"
+
+static void cpu_cycles(void)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	struct test_cpu_cycles *skel;
+	int err, pfd;
+
+	skel = test_cpu_cycles__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "test_cpu_cycles open and load"))
+		return;
+
+	pfd = bpf_program__fd(skel->progs.bpf_cpu_cycles);
+	if (!ASSERT_GT(pfd, 0, "test_cpu_cycles fd"))
+		goto fail;
+
+	err = bpf_prog_test_run_opts(pfd, &opts);
+	if (!ASSERT_OK(err, "test_cpu_cycles test run"))
+		goto fail;
+
+	ASSERT_NEQ(skel->bss->cycles, 0, "test_cpu_cycles 0 cycles");
+	ASSERT_NEQ(skel->bss->ns, 0, "test_cpu_cycles 0 ns");
+fail:
+	test_cpu_cycles__destroy(skel);
+}
+
+void test_cpu_cycles(void)
+{
+	if (test__start_subtest("cpu_cycles"))
+		cpu_cycles();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_cpu_cycles.c b/tools/testing/selftests/bpf/progs/test_cpu_cycles.c
new file mode 100644
index 000000000000..a7f8a4c6b854
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_cpu_cycles.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Inc. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+
+extern u64 bpf_cpu_time_counter_to_ns(u64 cycles) __weak __ksym;
+extern u64 bpf_get_cpu_time_counter(void) __weak __ksym;
+
+__u64 cycles, ns;
+
+SEC("syscall")
+int bpf_cpu_cycles(void)
+{
+	struct bpf_pidns_info pidns;
+	__u64 start;
+
+	start = bpf_get_cpu_time_counter();
+	bpf_get_ns_current_pid_tgid(0, 0, &pidns, sizeof(struct bpf_pidns_info));
+	cycles = bpf_get_cpu_time_counter() - start;
+	ns = bpf_cpu_time_counter_to_ns(cycles);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH bpf-next v10 2/4] bpf: add bpf_cpu_time_counter_to_ns helper
  2025-03-11 15:48 ` [PATCH bpf-next v10 2/4] bpf: add bpf_cpu_time_counter_to_ns helper Vadim Fedorenko
@ 2025-03-13  8:27   ` kernel test robot
  0 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2025-03-13  8:27 UTC (permalink / raw)
  To: Vadim Fedorenko, Borislav Petkov, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman,
	Thomas Gleixner, Yonghong Song, Vadim Fedorenko, Mykola Lysenko
  Cc: llvm, oe-kbuild-all, x86, bpf, Peter Zijlstra, Martin KaFai Lau

Hi Vadim,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Vadim-Fedorenko/bpf-add-bpf_get_cpu_time_counter-kfunc/20250311-235326
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20250311154850.3616840-3-vadfed%40meta.com
patch subject: [PATCH bpf-next v10 2/4] bpf: add bpf_cpu_time_counter_to_ns helper
config: x86_64-randconfig-004-20250313 (https://download.01.org/0day-ci/archive/20250313/202503131640.opwmXIvU-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250313/202503131640.opwmXIvU-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503131640.opwmXIvU-lkp@intel.com/

All errors (new ones prefixed by >>):

   ld: arch/x86/net/bpf_jit_comp.o: in function `do_jit':
>> arch/x86/net/bpf_jit_comp.c:2294: undefined reference to `bpf_cpu_time_counter_to_ns'


vim +2294 arch/x86/net/bpf_jit_comp.c

  1498	
  1499	static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image,
  1500			  int oldproglen, struct jit_context *ctx, bool jmp_padding)
  1501	{
  1502		bool tail_call_reachable = bpf_prog->aux->tail_call_reachable;
  1503		struct bpf_insn *insn = bpf_prog->insnsi;
  1504		bool callee_regs_used[4] = {};
  1505		int insn_cnt = bpf_prog->len;
  1506		bool seen_exit = false;
  1507		u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
  1508		void __percpu *priv_frame_ptr = NULL;
  1509		u64 arena_vm_start, user_vm_start;
  1510		void __percpu *priv_stack_ptr;
  1511		int i, excnt = 0;
  1512		int ilen, proglen = 0;
  1513		u8 *prog = temp;
  1514		u32 stack_depth;
  1515		int err;
  1516	
  1517		stack_depth = bpf_prog->aux->stack_depth;
  1518		priv_stack_ptr = bpf_prog->aux->priv_stack_ptr;
  1519		if (priv_stack_ptr) {
  1520			priv_frame_ptr = priv_stack_ptr + PRIV_STACK_GUARD_SZ + round_up(stack_depth, 8);
  1521			stack_depth = 0;
  1522		}
  1523	
  1524		arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
  1525		user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
  1526	
  1527		detect_reg_usage(insn, insn_cnt, callee_regs_used);
  1528	
  1529		emit_prologue(&prog, stack_depth,
  1530			      bpf_prog_was_classic(bpf_prog), tail_call_reachable,
  1531			      bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb);
  1532		/* Exception callback will clobber callee regs for its own use, and
  1533		 * restore the original callee regs from main prog's stack frame.
  1534		 */
  1535		if (bpf_prog->aux->exception_boundary) {
  1536			/* We also need to save r12, which is not mapped to any BPF
  1537			 * register, as we throw after entry into the kernel, which may
  1538			 * overwrite r12.
  1539			 */
  1540			push_r12(&prog);
  1541			push_callee_regs(&prog, all_callee_regs_used);
  1542		} else {
  1543			if (arena_vm_start)
  1544				push_r12(&prog);
  1545			push_callee_regs(&prog, callee_regs_used);
  1546		}
  1547		if (arena_vm_start)
  1548			emit_mov_imm64(&prog, X86_REG_R12,
  1549				       arena_vm_start >> 32, (u32) arena_vm_start);
  1550	
  1551		if (priv_frame_ptr)
  1552			emit_priv_frame_ptr(&prog, priv_frame_ptr);
  1553	
  1554		ilen = prog - temp;
  1555		if (rw_image)
  1556			memcpy(rw_image + proglen, temp, ilen);
  1557		proglen += ilen;
  1558		addrs[0] = proglen;
  1559		prog = temp;
  1560	
  1561		for (i = 1; i <= insn_cnt; i++, insn++) {
  1562			const s32 imm32 = insn->imm;
  1563			u32 dst_reg = insn->dst_reg;
  1564			u32 src_reg = insn->src_reg;
  1565			u8 b2 = 0, b3 = 0;
  1566			u8 *start_of_ldx;
  1567			s64 jmp_offset;
  1568			s16 insn_off;
  1569			u8 jmp_cond;
  1570			u8 *func;
  1571			int nops;
  1572	
  1573			if (priv_frame_ptr) {
  1574				if (src_reg == BPF_REG_FP)
  1575					src_reg = X86_REG_R9;
  1576	
  1577				if (dst_reg == BPF_REG_FP)
  1578					dst_reg = X86_REG_R9;
  1579			}
  1580	
  1581			switch (insn->code) {
  1582				/* ALU */
  1583			case BPF_ALU | BPF_ADD | BPF_X:
  1584			case BPF_ALU | BPF_SUB | BPF_X:
  1585			case BPF_ALU | BPF_AND | BPF_X:
  1586			case BPF_ALU | BPF_OR | BPF_X:
  1587			case BPF_ALU | BPF_XOR | BPF_X:
  1588			case BPF_ALU64 | BPF_ADD | BPF_X:
  1589			case BPF_ALU64 | BPF_SUB | BPF_X:
  1590			case BPF_ALU64 | BPF_AND | BPF_X:
  1591			case BPF_ALU64 | BPF_OR | BPF_X:
  1592			case BPF_ALU64 | BPF_XOR | BPF_X:
  1593				maybe_emit_mod(&prog, dst_reg, src_reg,
  1594					       BPF_CLASS(insn->code) == BPF_ALU64);
  1595				b2 = simple_alu_opcodes[BPF_OP(insn->code)];
  1596				EMIT2(b2, add_2reg(0xC0, dst_reg, src_reg));
  1597				break;
  1598	
  1599			case BPF_ALU64 | BPF_MOV | BPF_X:
  1600				if (insn_is_cast_user(insn)) {
  1601					if (dst_reg != src_reg)
  1602						/* 32-bit mov */
  1603						emit_mov_reg(&prog, false, dst_reg, src_reg);
  1604					/* shl dst_reg, 32 */
  1605					maybe_emit_1mod(&prog, dst_reg, true);
  1606					EMIT3(0xC1, add_1reg(0xE0, dst_reg), 32);
  1607	
  1608					/* or dst_reg, user_vm_start */
  1609					maybe_emit_1mod(&prog, dst_reg, true);
  1610					if (is_axreg(dst_reg))
  1611						EMIT1_off32(0x0D,  user_vm_start >> 32);
  1612					else
  1613						EMIT2_off32(0x81, add_1reg(0xC8, dst_reg),  user_vm_start >> 32);
  1614	
  1615					/* rol dst_reg, 32 */
  1616					maybe_emit_1mod(&prog, dst_reg, true);
  1617					EMIT3(0xC1, add_1reg(0xC0, dst_reg), 32);
  1618	
  1619					/* xor r11, r11 */
  1620					EMIT3(0x4D, 0x31, 0xDB);
  1621	
  1622					/* test dst_reg32, dst_reg32; check if lower 32-bit are zero */
  1623					maybe_emit_mod(&prog, dst_reg, dst_reg, false);
  1624					EMIT2(0x85, add_2reg(0xC0, dst_reg, dst_reg));
  1625	
  1626					/* cmove r11, dst_reg; if so, set dst_reg to zero */
  1627					/* WARNING: Intel swapped src/dst register encoding in CMOVcc !!! */
  1628					maybe_emit_mod(&prog, AUX_REG, dst_reg, true);
  1629					EMIT3(0x0F, 0x44, add_2reg(0xC0, AUX_REG, dst_reg));
  1630					break;
  1631				} else if (insn_is_mov_percpu_addr(insn)) {
  1632					/* mov <dst>, <src> (if necessary) */
  1633					EMIT_mov(dst_reg, src_reg);
  1634	#ifdef CONFIG_SMP
  1635					/* add <dst>, gs:[<off>] */
  1636					EMIT2(0x65, add_1mod(0x48, dst_reg));
  1637					EMIT3(0x03, add_2reg(0x04, 0, dst_reg), 0x25);
  1638					EMIT((u32)(unsigned long)&this_cpu_off, 4);
  1639	#endif
  1640					break;
  1641				}
  1642				fallthrough;
  1643			case BPF_ALU | BPF_MOV | BPF_X:
  1644				if (insn->off == 0)
  1645					emit_mov_reg(&prog,
  1646						     BPF_CLASS(insn->code) == BPF_ALU64,
  1647						     dst_reg, src_reg);
  1648				else
  1649					emit_movsx_reg(&prog, insn->off,
  1650						       BPF_CLASS(insn->code) == BPF_ALU64,
  1651						       dst_reg, src_reg);
  1652				break;
  1653	
  1654				/* neg dst */
  1655			case BPF_ALU | BPF_NEG:
  1656			case BPF_ALU64 | BPF_NEG:
  1657				maybe_emit_1mod(&prog, dst_reg,
  1658						BPF_CLASS(insn->code) == BPF_ALU64);
  1659				EMIT2(0xF7, add_1reg(0xD8, dst_reg));
  1660				break;
  1661	
  1662			case BPF_ALU | BPF_ADD | BPF_K:
  1663			case BPF_ALU | BPF_SUB | BPF_K:
  1664			case BPF_ALU | BPF_AND | BPF_K:
  1665			case BPF_ALU | BPF_OR | BPF_K:
  1666			case BPF_ALU | BPF_XOR | BPF_K:
  1667			case BPF_ALU64 | BPF_ADD | BPF_K:
  1668			case BPF_ALU64 | BPF_SUB | BPF_K:
  1669			case BPF_ALU64 | BPF_AND | BPF_K:
  1670			case BPF_ALU64 | BPF_OR | BPF_K:
  1671			case BPF_ALU64 | BPF_XOR | BPF_K:
  1672				maybe_emit_1mod(&prog, dst_reg,
  1673						BPF_CLASS(insn->code) == BPF_ALU64);
  1674	
  1675				/*
  1676				 * b3 holds 'normal' opcode, b2 short form only valid
  1677				 * in case dst is eax/rax.
  1678				 */
  1679				switch (BPF_OP(insn->code)) {
  1680				case BPF_ADD:
  1681					b3 = 0xC0;
  1682					b2 = 0x05;
  1683					break;
  1684				case BPF_SUB:
  1685					b3 = 0xE8;
  1686					b2 = 0x2D;
  1687					break;
  1688				case BPF_AND:
  1689					b3 = 0xE0;
  1690					b2 = 0x25;
  1691					break;
  1692				case BPF_OR:
  1693					b3 = 0xC8;
  1694					b2 = 0x0D;
  1695					break;
  1696				case BPF_XOR:
  1697					b3 = 0xF0;
  1698					b2 = 0x35;
  1699					break;
  1700				}
  1701	
  1702				if (is_imm8(imm32))
  1703					EMIT3(0x83, add_1reg(b3, dst_reg), imm32);
  1704				else if (is_axreg(dst_reg))
  1705					EMIT1_off32(b2, imm32);
  1706				else
  1707					EMIT2_off32(0x81, add_1reg(b3, dst_reg), imm32);
  1708				break;
  1709	
  1710			case BPF_ALU64 | BPF_MOV | BPF_K:
  1711			case BPF_ALU | BPF_MOV | BPF_K:
  1712				emit_mov_imm32(&prog, BPF_CLASS(insn->code) == BPF_ALU64,
  1713					       dst_reg, imm32);
  1714				break;
  1715	
  1716			case BPF_LD | BPF_IMM | BPF_DW:
  1717				emit_mov_imm64(&prog, dst_reg, insn[1].imm, insn[0].imm);
  1718				insn++;
  1719				i++;
  1720				break;
  1721	
  1722				/* dst %= src, dst /= src, dst %= imm32, dst /= imm32 */
  1723			case BPF_ALU | BPF_MOD | BPF_X:
  1724			case BPF_ALU | BPF_DIV | BPF_X:
  1725			case BPF_ALU | BPF_MOD | BPF_K:
  1726			case BPF_ALU | BPF_DIV | BPF_K:
  1727			case BPF_ALU64 | BPF_MOD | BPF_X:
  1728			case BPF_ALU64 | BPF_DIV | BPF_X:
  1729			case BPF_ALU64 | BPF_MOD | BPF_K:
  1730			case BPF_ALU64 | BPF_DIV | BPF_K: {
  1731				bool is64 = BPF_CLASS(insn->code) == BPF_ALU64;
  1732	
  1733				if (dst_reg != BPF_REG_0)
  1734					EMIT1(0x50); /* push rax */
  1735				if (dst_reg != BPF_REG_3)
  1736					EMIT1(0x52); /* push rdx */
  1737	
  1738				if (BPF_SRC(insn->code) == BPF_X) {
  1739					if (src_reg == BPF_REG_0 ||
  1740					    src_reg == BPF_REG_3) {
  1741						/* mov r11, src_reg */
  1742						EMIT_mov(AUX_REG, src_reg);
  1743						src_reg = AUX_REG;
  1744					}
  1745				} else {
  1746					/* mov r11, imm32 */
  1747					EMIT3_off32(0x49, 0xC7, 0xC3, imm32);
  1748					src_reg = AUX_REG;
  1749				}
  1750	
  1751				if (dst_reg != BPF_REG_0)
  1752					/* mov rax, dst_reg */
  1753					emit_mov_reg(&prog, is64, BPF_REG_0, dst_reg);
  1754	
  1755				if (insn->off == 0) {
  1756					/*
  1757					 * xor edx, edx
  1758					 * equivalent to 'xor rdx, rdx', but one byte less
  1759					 */
  1760					EMIT2(0x31, 0xd2);
  1761	
  1762					/* div src_reg */
  1763					maybe_emit_1mod(&prog, src_reg, is64);
  1764					EMIT2(0xF7, add_1reg(0xF0, src_reg));
  1765				} else {
  1766					if (BPF_CLASS(insn->code) == BPF_ALU)
  1767						EMIT1(0x99); /* cdq */
  1768					else
  1769						EMIT2(0x48, 0x99); /* cqo */
  1770	
  1771					/* idiv src_reg */
  1772					maybe_emit_1mod(&prog, src_reg, is64);
  1773					EMIT2(0xF7, add_1reg(0xF8, src_reg));
  1774				}
  1775	
  1776				if (BPF_OP(insn->code) == BPF_MOD &&
  1777				    dst_reg != BPF_REG_3)
  1778					/* mov dst_reg, rdx */
  1779					emit_mov_reg(&prog, is64, dst_reg, BPF_REG_3);
  1780				else if (BPF_OP(insn->code) == BPF_DIV &&
  1781					 dst_reg != BPF_REG_0)
  1782					/* mov dst_reg, rax */
  1783					emit_mov_reg(&prog, is64, dst_reg, BPF_REG_0);
  1784	
  1785				if (dst_reg != BPF_REG_3)
  1786					EMIT1(0x5A); /* pop rdx */
  1787				if (dst_reg != BPF_REG_0)
  1788					EMIT1(0x58); /* pop rax */
  1789				break;
  1790			}
  1791	
  1792			case BPF_ALU | BPF_MUL | BPF_K:
  1793			case BPF_ALU64 | BPF_MUL | BPF_K:
  1794				maybe_emit_mod(&prog, dst_reg, dst_reg,
  1795					       BPF_CLASS(insn->code) == BPF_ALU64);
  1796	
  1797				if (is_imm8(imm32))
  1798					/* imul dst_reg, dst_reg, imm8 */
  1799					EMIT3(0x6B, add_2reg(0xC0, dst_reg, dst_reg),
  1800					      imm32);
  1801				else
  1802					/* imul dst_reg, dst_reg, imm32 */
  1803					EMIT2_off32(0x69,
  1804						    add_2reg(0xC0, dst_reg, dst_reg),
  1805						    imm32);
  1806				break;
  1807	
  1808			case BPF_ALU | BPF_MUL | BPF_X:
  1809			case BPF_ALU64 | BPF_MUL | BPF_X:
  1810				maybe_emit_mod(&prog, src_reg, dst_reg,
  1811					       BPF_CLASS(insn->code) == BPF_ALU64);
  1812	
  1813				/* imul dst_reg, src_reg */
  1814				EMIT3(0x0F, 0xAF, add_2reg(0xC0, src_reg, dst_reg));
  1815				break;
  1816	
  1817				/* Shifts */
  1818			case BPF_ALU | BPF_LSH | BPF_K:
  1819			case BPF_ALU | BPF_RSH | BPF_K:
  1820			case BPF_ALU | BPF_ARSH | BPF_K:
  1821			case BPF_ALU64 | BPF_LSH | BPF_K:
  1822			case BPF_ALU64 | BPF_RSH | BPF_K:
  1823			case BPF_ALU64 | BPF_ARSH | BPF_K:
  1824				maybe_emit_1mod(&prog, dst_reg,
  1825						BPF_CLASS(insn->code) == BPF_ALU64);
  1826	
  1827				b3 = simple_alu_opcodes[BPF_OP(insn->code)];
  1828				if (imm32 == 1)
  1829					EMIT2(0xD1, add_1reg(b3, dst_reg));
  1830				else
  1831					EMIT3(0xC1, add_1reg(b3, dst_reg), imm32);
  1832				break;
  1833	
  1834			case BPF_ALU | BPF_LSH | BPF_X:
  1835			case BPF_ALU | BPF_RSH | BPF_X:
  1836			case BPF_ALU | BPF_ARSH | BPF_X:
  1837			case BPF_ALU64 | BPF_LSH | BPF_X:
  1838			case BPF_ALU64 | BPF_RSH | BPF_X:
  1839			case BPF_ALU64 | BPF_ARSH | BPF_X:
  1840				/* BMI2 shifts aren't better when shift count is already in rcx */
  1841				if (boot_cpu_has(X86_FEATURE_BMI2) && src_reg != BPF_REG_4) {
  1842					/* shrx/sarx/shlx dst_reg, dst_reg, src_reg */
  1843					bool w = (BPF_CLASS(insn->code) == BPF_ALU64);
  1844					u8 op;
  1845	
  1846					switch (BPF_OP(insn->code)) {
  1847					case BPF_LSH:
  1848						op = 1; /* prefix 0x66 */
  1849						break;
  1850					case BPF_RSH:
  1851						op = 3; /* prefix 0xf2 */
  1852						break;
  1853					case BPF_ARSH:
  1854						op = 2; /* prefix 0xf3 */
  1855						break;
  1856					}
  1857	
  1858					emit_shiftx(&prog, dst_reg, src_reg, w, op);
  1859	
  1860					break;
  1861				}
  1862	
  1863				if (src_reg != BPF_REG_4) { /* common case */
  1864					/* Check for bad case when dst_reg == rcx */
  1865					if (dst_reg == BPF_REG_4) {
  1866						/* mov r11, dst_reg */
  1867						EMIT_mov(AUX_REG, dst_reg);
  1868						dst_reg = AUX_REG;
  1869					} else {
  1870						EMIT1(0x51); /* push rcx */
  1871					}
  1872					/* mov rcx, src_reg */
  1873					EMIT_mov(BPF_REG_4, src_reg);
  1874				}
  1875	
  1876				/* shl %rax, %cl | shr %rax, %cl | sar %rax, %cl */
  1877				maybe_emit_1mod(&prog, dst_reg,
  1878						BPF_CLASS(insn->code) == BPF_ALU64);
  1879	
  1880				b3 = simple_alu_opcodes[BPF_OP(insn->code)];
  1881				EMIT2(0xD3, add_1reg(b3, dst_reg));
  1882	
  1883				if (src_reg != BPF_REG_4) {
  1884					if (insn->dst_reg == BPF_REG_4)
  1885						/* mov dst_reg, r11 */
  1886						EMIT_mov(insn->dst_reg, AUX_REG);
  1887					else
  1888						EMIT1(0x59); /* pop rcx */
  1889				}
  1890	
  1891				break;
  1892	
  1893			case BPF_ALU | BPF_END | BPF_FROM_BE:
  1894			case BPF_ALU64 | BPF_END | BPF_FROM_LE:
  1895				switch (imm32) {
  1896				case 16:
  1897					/* Emit 'ror %ax, 8' to swap lower 2 bytes */
  1898					EMIT1(0x66);
  1899					if (is_ereg(dst_reg))
  1900						EMIT1(0x41);
  1901					EMIT3(0xC1, add_1reg(0xC8, dst_reg), 8);
  1902	
  1903					/* Emit 'movzwl eax, ax' */
  1904					if (is_ereg(dst_reg))
  1905						EMIT3(0x45, 0x0F, 0xB7);
  1906					else
  1907						EMIT2(0x0F, 0xB7);
  1908					EMIT1(add_2reg(0xC0, dst_reg, dst_reg));
  1909					break;
  1910				case 32:
  1911					/* Emit 'bswap eax' to swap lower 4 bytes */
  1912					if (is_ereg(dst_reg))
  1913						EMIT2(0x41, 0x0F);
  1914					else
  1915						EMIT1(0x0F);
  1916					EMIT1(add_1reg(0xC8, dst_reg));
  1917					break;
  1918				case 64:
  1919					/* Emit 'bswap rax' to swap 8 bytes */
  1920					EMIT3(add_1mod(0x48, dst_reg), 0x0F,
  1921					      add_1reg(0xC8, dst_reg));
  1922					break;
  1923				}
  1924				break;
  1925	
  1926			case BPF_ALU | BPF_END | BPF_FROM_LE:
  1927				switch (imm32) {
  1928				case 16:
  1929					/*
  1930					 * Emit 'movzwl eax, ax' to zero extend 16-bit
  1931					 * into 64 bit
  1932					 */
  1933					if (is_ereg(dst_reg))
  1934						EMIT3(0x45, 0x0F, 0xB7);
  1935					else
  1936						EMIT2(0x0F, 0xB7);
  1937					EMIT1(add_2reg(0xC0, dst_reg, dst_reg));
  1938					break;
  1939				case 32:
  1940					/* Emit 'mov eax, eax' to clear upper 32-bits */
  1941					if (is_ereg(dst_reg))
  1942						EMIT1(0x45);
  1943					EMIT2(0x89, add_2reg(0xC0, dst_reg, dst_reg));
  1944					break;
  1945				case 64:
  1946					/* nop */
  1947					break;
  1948				}
  1949				break;
  1950	
  1951				/* speculation barrier */
  1952			case BPF_ST | BPF_NOSPEC:
  1953				EMIT_LFENCE();
  1954				break;
  1955	
  1956				/* ST: *(u8*)(dst_reg + off) = imm */
  1957			case BPF_ST | BPF_MEM | BPF_B:
  1958				if (is_ereg(dst_reg))
  1959					EMIT2(0x41, 0xC6);
  1960				else
  1961					EMIT1(0xC6);
  1962				goto st;
  1963			case BPF_ST | BPF_MEM | BPF_H:
  1964				if (is_ereg(dst_reg))
  1965					EMIT3(0x66, 0x41, 0xC7);
  1966				else
  1967					EMIT2(0x66, 0xC7);
  1968				goto st;
  1969			case BPF_ST | BPF_MEM | BPF_W:
  1970				if (is_ereg(dst_reg))
  1971					EMIT2(0x41, 0xC7);
  1972				else
  1973					EMIT1(0xC7);
  1974				goto st;
  1975			case BPF_ST | BPF_MEM | BPF_DW:
  1976				EMIT2(add_1mod(0x48, dst_reg), 0xC7);
  1977	
  1978	st:			if (is_imm8(insn->off))
  1979					EMIT2(add_1reg(0x40, dst_reg), insn->off);
  1980				else
  1981					EMIT1_off32(add_1reg(0x80, dst_reg), insn->off);
  1982	
  1983				EMIT(imm32, bpf_size_to_x86_bytes(BPF_SIZE(insn->code)));
  1984				break;
  1985	
  1986				/* STX: *(u8*)(dst_reg + off) = src_reg */
  1987			case BPF_STX | BPF_MEM | BPF_B:
  1988			case BPF_STX | BPF_MEM | BPF_H:
  1989			case BPF_STX | BPF_MEM | BPF_W:
  1990			case BPF_STX | BPF_MEM | BPF_DW:
  1991				emit_stx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
  1992				break;
  1993	
  1994			case BPF_ST | BPF_PROBE_MEM32 | BPF_B:
  1995			case BPF_ST | BPF_PROBE_MEM32 | BPF_H:
  1996			case BPF_ST | BPF_PROBE_MEM32 | BPF_W:
  1997			case BPF_ST | BPF_PROBE_MEM32 | BPF_DW:
  1998				start_of_ldx = prog;
  1999				emit_st_r12(&prog, BPF_SIZE(insn->code), dst_reg, insn->off, insn->imm);
  2000				goto populate_extable;
  2001	
  2002				/* LDX: dst_reg = *(u8*)(src_reg + r12 + off) */
  2003			case BPF_LDX | BPF_PROBE_MEM32 | BPF_B:
  2004			case BPF_LDX | BPF_PROBE_MEM32 | BPF_H:
  2005			case BPF_LDX | BPF_PROBE_MEM32 | BPF_W:
  2006			case BPF_LDX | BPF_PROBE_MEM32 | BPF_DW:
  2007			case BPF_STX | BPF_PROBE_MEM32 | BPF_B:
  2008			case BPF_STX | BPF_PROBE_MEM32 | BPF_H:
  2009			case BPF_STX | BPF_PROBE_MEM32 | BPF_W:
  2010			case BPF_STX | BPF_PROBE_MEM32 | BPF_DW:
  2011				start_of_ldx = prog;
  2012				if (BPF_CLASS(insn->code) == BPF_LDX)
  2013					emit_ldx_r12(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
  2014				else
  2015					emit_stx_r12(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
  2016	populate_extable:
  2017				{
  2018					struct exception_table_entry *ex;
  2019					u8 *_insn = image + proglen + (start_of_ldx - temp);
  2020					s64 delta;
  2021	
  2022					if (!bpf_prog->aux->extable)
  2023						break;
  2024	
  2025					if (excnt >= bpf_prog->aux->num_exentries) {
  2026						pr_err("mem32 extable bug\n");
  2027						return -EFAULT;
  2028					}
  2029					ex = &bpf_prog->aux->extable[excnt++];
  2030	
  2031					delta = _insn - (u8 *)&ex->insn;
  2032					/* switch ex to rw buffer for writes */
  2033					ex = (void *)rw_image + ((void *)ex - (void *)image);
  2034	
  2035					ex->insn = delta;
  2036	
  2037					ex->data = EX_TYPE_BPF;
  2038	
  2039					ex->fixup = (prog - start_of_ldx) |
  2040						((BPF_CLASS(insn->code) == BPF_LDX ? reg2pt_regs[dst_reg] : DONT_CLEAR) << 8);
  2041				}
  2042				break;
  2043	
  2044				/* LDX: dst_reg = *(u8*)(src_reg + off) */
  2045			case BPF_LDX | BPF_MEM | BPF_B:
  2046			case BPF_LDX | BPF_PROBE_MEM | BPF_B:
  2047			case BPF_LDX | BPF_MEM | BPF_H:
  2048			case BPF_LDX | BPF_PROBE_MEM | BPF_H:
  2049			case BPF_LDX | BPF_MEM | BPF_W:
  2050			case BPF_LDX | BPF_PROBE_MEM | BPF_W:
  2051			case BPF_LDX | BPF_MEM | BPF_DW:
  2052			case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
  2053				/* LDXS: dst_reg = *(s8*)(src_reg + off) */
  2054			case BPF_LDX | BPF_MEMSX | BPF_B:
  2055			case BPF_LDX | BPF_MEMSX | BPF_H:
  2056			case BPF_LDX | BPF_MEMSX | BPF_W:
  2057			case BPF_LDX | BPF_PROBE_MEMSX | BPF_B:
  2058			case BPF_LDX | BPF_PROBE_MEMSX | BPF_H:
  2059			case BPF_LDX | BPF_PROBE_MEMSX | BPF_W:
  2060				insn_off = insn->off;
  2061	
  2062				if (BPF_MODE(insn->code) == BPF_PROBE_MEM ||
  2063				    BPF_MODE(insn->code) == BPF_PROBE_MEMSX) {
  2064					/* Conservatively check that src_reg + insn->off is a kernel address:
  2065					 *   src_reg + insn->off > TASK_SIZE_MAX + PAGE_SIZE
  2066					 *   and
  2067					 *   src_reg + insn->off < VSYSCALL_ADDR
  2068					 */
  2069	
  2070					u64 limit = TASK_SIZE_MAX + PAGE_SIZE - VSYSCALL_ADDR;
  2071					u8 *end_of_jmp;
  2072	
  2073					/* movabsq r10, VSYSCALL_ADDR */
  2074					emit_mov_imm64(&prog, BPF_REG_AX, (long)VSYSCALL_ADDR >> 32,
  2075						       (u32)(long)VSYSCALL_ADDR);
  2076	
  2077					/* mov src_reg, r11 */
  2078					EMIT_mov(AUX_REG, src_reg);
  2079	
  2080					if (insn->off) {
  2081						/* add r11, insn->off */
  2082						maybe_emit_1mod(&prog, AUX_REG, true);
  2083						EMIT2_off32(0x81, add_1reg(0xC0, AUX_REG), insn->off);
  2084					}
  2085	
  2086					/* sub r11, r10 */
  2087					maybe_emit_mod(&prog, AUX_REG, BPF_REG_AX, true);
  2088					EMIT2(0x29, add_2reg(0xC0, AUX_REG, BPF_REG_AX));
  2089	
  2090					/* movabsq r10, limit */
  2091					emit_mov_imm64(&prog, BPF_REG_AX, (long)limit >> 32,
  2092						       (u32)(long)limit);
  2093	
  2094					/* cmp r10, r11 */
  2095					maybe_emit_mod(&prog, AUX_REG, BPF_REG_AX, true);
  2096					EMIT2(0x39, add_2reg(0xC0, AUX_REG, BPF_REG_AX));
  2097	
  2098					/* if unsigned '>', goto load */
  2099					EMIT2(X86_JA, 0);
  2100					end_of_jmp = prog;
  2101	
  2102					/* xor dst_reg, dst_reg */
  2103					emit_mov_imm32(&prog, false, dst_reg, 0);
  2104					/* jmp byte_after_ldx */
  2105					EMIT2(0xEB, 0);
  2106	
  2107					/* populate jmp_offset for JAE above to jump to start_of_ldx */
  2108					start_of_ldx = prog;
  2109					end_of_jmp[-1] = start_of_ldx - end_of_jmp;
  2110				}
  2111				if (BPF_MODE(insn->code) == BPF_PROBE_MEMSX ||
  2112				    BPF_MODE(insn->code) == BPF_MEMSX)
  2113					emit_ldsx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn_off);
  2114				else
  2115					emit_ldx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn_off);
  2116				if (BPF_MODE(insn->code) == BPF_PROBE_MEM ||
  2117				    BPF_MODE(insn->code) == BPF_PROBE_MEMSX) {
  2118					struct exception_table_entry *ex;
  2119					u8 *_insn = image + proglen + (start_of_ldx - temp);
  2120					s64 delta;
  2121	
  2122					/* populate jmp_offset for JMP above */
  2123					start_of_ldx[-1] = prog - start_of_ldx;
  2124	
  2125					if (!bpf_prog->aux->extable)
  2126						break;
  2127	
  2128					if (excnt >= bpf_prog->aux->num_exentries) {
  2129						pr_err("ex gen bug\n");
  2130						return -EFAULT;
  2131					}
  2132					ex = &bpf_prog->aux->extable[excnt++];
  2133	
  2134					delta = _insn - (u8 *)&ex->insn;
  2135					if (!is_simm32(delta)) {
  2136						pr_err("extable->insn doesn't fit into 32-bit\n");
  2137						return -EFAULT;
  2138					}
  2139					/* switch ex to rw buffer for writes */
  2140					ex = (void *)rw_image + ((void *)ex - (void *)image);
  2141	
  2142					ex->insn = delta;
  2143	
  2144					ex->data = EX_TYPE_BPF;
  2145	
  2146					if (dst_reg > BPF_REG_9) {
  2147						pr_err("verifier error\n");
  2148						return -EFAULT;
  2149					}
  2150					/*
  2151					 * Compute size of x86 insn and its target dest x86 register.
  2152					 * ex_handler_bpf() will use lower 8 bits to adjust
  2153					 * pt_regs->ip to jump over this x86 instruction
  2154					 * and upper bits to figure out which pt_regs to zero out.
  2155					 * End result: x86 insn "mov rbx, qword ptr [rax+0x14]"
  2156					 * of 4 bytes will be ignored and rbx will be zero inited.
  2157					 */
  2158					ex->fixup = (prog - start_of_ldx) | (reg2pt_regs[dst_reg] << 8);
  2159				}
  2160				break;
  2161	
  2162			case BPF_STX | BPF_ATOMIC | BPF_B:
  2163			case BPF_STX | BPF_ATOMIC | BPF_H:
  2164				if (!bpf_atomic_is_load_store(insn)) {
  2165					pr_err("bpf_jit: 1- and 2-byte RMW atomics are not supported\n");
  2166					return -EFAULT;
  2167				}
  2168				fallthrough;
  2169			case BPF_STX | BPF_ATOMIC | BPF_W:
  2170			case BPF_STX | BPF_ATOMIC | BPF_DW:
  2171				if (insn->imm == (BPF_AND | BPF_FETCH) ||
  2172				    insn->imm == (BPF_OR | BPF_FETCH) ||
  2173				    insn->imm == (BPF_XOR | BPF_FETCH)) {
  2174					bool is64 = BPF_SIZE(insn->code) == BPF_DW;
  2175					u32 real_src_reg = src_reg;
  2176					u32 real_dst_reg = dst_reg;
  2177					u8 *branch_target;
  2178	
  2179					/*
  2180					 * Can't be implemented with a single x86 insn.
  2181					 * Need to do a CMPXCHG loop.
  2182					 */
  2183	
  2184					/* Will need RAX as a CMPXCHG operand so save R0 */
  2185					emit_mov_reg(&prog, true, BPF_REG_AX, BPF_REG_0);
  2186					if (src_reg == BPF_REG_0)
  2187						real_src_reg = BPF_REG_AX;
  2188					if (dst_reg == BPF_REG_0)
  2189						real_dst_reg = BPF_REG_AX;
  2190	
  2191					branch_target = prog;
  2192					/* Load old value */
  2193					emit_ldx(&prog, BPF_SIZE(insn->code),
  2194						 BPF_REG_0, real_dst_reg, insn->off);
  2195					/*
  2196					 * Perform the (commutative) operation locally,
  2197					 * put the result in the AUX_REG.
  2198					 */
  2199					emit_mov_reg(&prog, is64, AUX_REG, BPF_REG_0);
  2200					maybe_emit_mod(&prog, AUX_REG, real_src_reg, is64);
  2201					EMIT2(simple_alu_opcodes[BPF_OP(insn->imm)],
  2202					      add_2reg(0xC0, AUX_REG, real_src_reg));
  2203					/* Attempt to swap in new value */
  2204					err = emit_atomic_rmw(&prog, BPF_CMPXCHG,
  2205							      real_dst_reg, AUX_REG,
  2206							      insn->off,
  2207							      BPF_SIZE(insn->code));
  2208					if (WARN_ON(err))
  2209						return err;
  2210					/*
  2211					 * ZF tells us whether we won the race. If it's
  2212					 * cleared we need to try again.
  2213					 */
  2214					EMIT2(X86_JNE, -(prog - branch_target) - 2);
  2215					/* Return the pre-modification value */
  2216					emit_mov_reg(&prog, is64, real_src_reg, BPF_REG_0);
  2217					/* Restore R0 after clobbering RAX */
  2218					emit_mov_reg(&prog, true, BPF_REG_0, BPF_REG_AX);
  2219					break;
  2220				}
  2221	
  2222				if (bpf_atomic_is_load_store(insn))
  2223					err = emit_atomic_ld_st(&prog, insn->imm, dst_reg, src_reg,
  2224								insn->off, BPF_SIZE(insn->code));
  2225				else
  2226					err = emit_atomic_rmw(&prog, insn->imm, dst_reg, src_reg,
  2227							      insn->off, BPF_SIZE(insn->code));
  2228				if (err)
  2229					return err;
  2230				break;
  2231	
  2232			case BPF_STX | BPF_PROBE_ATOMIC | BPF_B:
  2233			case BPF_STX | BPF_PROBE_ATOMIC | BPF_H:
  2234				if (!bpf_atomic_is_load_store(insn)) {
  2235					pr_err("bpf_jit: 1- and 2-byte RMW atomics are not supported\n");
  2236					return -EFAULT;
  2237				}
  2238				fallthrough;
  2239			case BPF_STX | BPF_PROBE_ATOMIC | BPF_W:
  2240			case BPF_STX | BPF_PROBE_ATOMIC | BPF_DW:
  2241				start_of_ldx = prog;
  2242	
  2243				if (bpf_atomic_is_load_store(insn))
  2244					err = emit_atomic_ld_st_index(&prog, insn->imm,
  2245								      BPF_SIZE(insn->code), dst_reg,
  2246								      src_reg, X86_REG_R12, insn->off);
  2247				else
  2248					err = emit_atomic_rmw_index(&prog, insn->imm, BPF_SIZE(insn->code),
  2249								    dst_reg, src_reg, X86_REG_R12,
  2250								    insn->off);
  2251				if (err)
  2252					return err;
  2253				goto populate_extable;
  2254	
  2255				/* call */
  2256			case BPF_JMP | BPF_CALL: {
  2257				u8 *ip = image + addrs[i - 1];
  2258	
  2259				if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
  2260				    IS_ENABLED(CONFIG_BPF_SYSCALL) &&
  2261				    imm32 == BPF_CALL_IMM(bpf_get_cpu_time_counter) &&
  2262				    cpu_feature_enabled(X86_FEATURE_TSC) &&
  2263				    using_native_sched_clock() && sched_clock_stable()) {
  2264					/* The default implementation of this kfunc uses
  2265					 * ktime_get_raw_ns() which effectively is implemented as
  2266					 * `(u64)rdtsc_ordered() & S64_MAX`. For JIT We skip
  2267					 * masking part because we assume it's not needed in BPF
  2268					 * use case (two measurements close in time).
  2269					 * Original code for rdtsc_ordered() uses sequence:
  2270					 * 'rdtsc; nop; nop; nop' to patch it into
  2271					 * 'lfence; rdtsc' or 'rdtscp' depending on CPU features.
  2272					 * JIT uses 'lfence; rdtsc' variant because BPF program
  2273					 * doesn't care about cookie provided by rdtscp in RCX.
  2274					 * Save RDX because RDTSC will use EDX:EAX to return u64
  2275					 */
  2276					emit_mov_reg(&prog, true, AUX_REG, BPF_REG_3);
  2277					if (cpu_feature_enabled(X86_FEATURE_LFENCE_RDTSC))
  2278						EMIT_LFENCE();
  2279					EMIT2(0x0F, 0x31);
  2280	
  2281					/* shl RDX, 32 */
  2282					maybe_emit_1mod(&prog, BPF_REG_3, true);
  2283					EMIT3(0xC1, add_1reg(0xE0, BPF_REG_3), 32);
  2284					/* or RAX, RDX */
  2285					maybe_emit_mod(&prog, BPF_REG_0, BPF_REG_3, true);
  2286					EMIT2(0x09, add_2reg(0xC0, BPF_REG_0, BPF_REG_3));
  2287					/* restore RDX from R11 */
  2288					emit_mov_reg(&prog, true, BPF_REG_3, AUX_REG);
  2289	
  2290					break;
  2291				}
  2292	
  2293				if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
> 2294				    imm32 == BPF_CALL_IMM(bpf_cpu_time_counter_to_ns) &&
  2295				    cpu_feature_enabled(X86_FEATURE_TSC) &&
  2296				    using_native_sched_clock() && sched_clock_stable()) {
  2297					struct cyc2ns_data data;
  2298					u32 mult, shift;
  2299	
  2300					cyc2ns_read_begin(&data);
  2301					mult = data.cyc2ns_mul;
  2302					shift = data.cyc2ns_shift;
  2303					cyc2ns_read_end();
  2304					/* imul RAX, RDI, mult */
  2305					maybe_emit_mod(&prog, BPF_REG_1, BPF_REG_0, true);
  2306					EMIT2_off32(0x69, add_2reg(0xC0, BPF_REG_1, BPF_REG_0),
  2307						    mult);
  2308	
  2309					/* shr RAX, shift (which is less than 64) */
  2310					maybe_emit_1mod(&prog, BPF_REG_0, true);
  2311					EMIT3(0xC1, add_1reg(0xE8, BPF_REG_0), shift);
  2312	
  2313					break;
  2314				}
  2315	
  2316				func = (u8 *) __bpf_call_base + imm32;
  2317				if (src_reg == BPF_PSEUDO_CALL && tail_call_reachable) {
  2318					LOAD_TAIL_CALL_CNT_PTR(stack_depth);
  2319					ip += 7;
  2320				}
  2321				if (!imm32)
  2322					return -EINVAL;
  2323				if (priv_frame_ptr) {
  2324					push_r9(&prog);
  2325					ip += 2;
  2326				}
  2327				ip += x86_call_depth_emit_accounting(&prog, func, ip);
  2328				if (emit_call(&prog, func, ip))
  2329					return -EINVAL;
  2330				if (priv_frame_ptr)
  2331					pop_r9(&prog);
  2332				break;
  2333			}
  2334	
  2335			case BPF_JMP | BPF_TAIL_CALL:
  2336				if (imm32)
  2337					emit_bpf_tail_call_direct(bpf_prog,
  2338								  &bpf_prog->aux->poke_tab[imm32 - 1],
  2339								  &prog, image + addrs[i - 1],
  2340								  callee_regs_used,
  2341								  stack_depth,
  2342								  ctx);
  2343				else
  2344					emit_bpf_tail_call_indirect(bpf_prog,
  2345								    &prog,
  2346								    callee_regs_used,
  2347								    stack_depth,
  2348								    image + addrs[i - 1],
  2349								    ctx);
  2350				break;
  2351	
  2352				/* cond jump */
  2353			case BPF_JMP | BPF_JEQ | BPF_X:
  2354			case BPF_JMP | BPF_JNE | BPF_X:
  2355			case BPF_JMP | BPF_JGT | BPF_X:
  2356			case BPF_JMP | BPF_JLT | BPF_X:
  2357			case BPF_JMP | BPF_JGE | BPF_X:
  2358			case BPF_JMP | BPF_JLE | BPF_X:
  2359			case BPF_JMP | BPF_JSGT | BPF_X:
  2360			case BPF_JMP | BPF_JSLT | BPF_X:
  2361			case BPF_JMP | BPF_JSGE | BPF_X:
  2362			case BPF_JMP | BPF_JSLE | BPF_X:
  2363			case BPF_JMP32 | BPF_JEQ | BPF_X:
  2364			case BPF_JMP32 | BPF_JNE | BPF_X:
  2365			case BPF_JMP32 | BPF_JGT | BPF_X:
  2366			case BPF_JMP32 | BPF_JLT | BPF_X:
  2367			case BPF_JMP32 | BPF_JGE | BPF_X:
  2368			case BPF_JMP32 | BPF_JLE | BPF_X:
  2369			case BPF_JMP32 | BPF_JSGT | BPF_X:
  2370			case BPF_JMP32 | BPF_JSLT | BPF_X:
  2371			case BPF_JMP32 | BPF_JSGE | BPF_X:
  2372			case BPF_JMP32 | BPF_JSLE | BPF_X:
  2373				/* cmp dst_reg, src_reg */
  2374				maybe_emit_mod(&prog, dst_reg, src_reg,
  2375					       BPF_CLASS(insn->code) == BPF_JMP);
  2376				EMIT2(0x39, add_2reg(0xC0, dst_reg, src_reg));
  2377				goto emit_cond_jmp;
  2378	
  2379			case BPF_JMP | BPF_JSET | BPF_X:
  2380			case BPF_JMP32 | BPF_JSET | BPF_X:
  2381				/* test dst_reg, src_reg */
  2382				maybe_emit_mod(&prog, dst_reg, src_reg,
  2383					       BPF_CLASS(insn->code) == BPF_JMP);
  2384				EMIT2(0x85, add_2reg(0xC0, dst_reg, src_reg));
  2385				goto emit_cond_jmp;
  2386	
  2387			case BPF_JMP | BPF_JSET | BPF_K:
  2388			case BPF_JMP32 | BPF_JSET | BPF_K:
  2389				/* test dst_reg, imm32 */
  2390				maybe_emit_1mod(&prog, dst_reg,
  2391						BPF_CLASS(insn->code) == BPF_JMP);
  2392				EMIT2_off32(0xF7, add_1reg(0xC0, dst_reg), imm32);
  2393				goto emit_cond_jmp;
  2394	
  2395			case BPF_JMP | BPF_JEQ | BPF_K:
  2396			case BPF_JMP | BPF_JNE | BPF_K:
  2397			case BPF_JMP | BPF_JGT | BPF_K:
  2398			case BPF_JMP | BPF_JLT | BPF_K:
  2399			case BPF_JMP | BPF_JGE | BPF_K:
  2400			case BPF_JMP | BPF_JLE | BPF_K:
  2401			case BPF_JMP | BPF_JSGT | BPF_K:
  2402			case BPF_JMP | BPF_JSLT | BPF_K:
  2403			case BPF_JMP | BPF_JSGE | BPF_K:
  2404			case BPF_JMP | BPF_JSLE | BPF_K:
  2405			case BPF_JMP32 | BPF_JEQ | BPF_K:
  2406			case BPF_JMP32 | BPF_JNE | BPF_K:
  2407			case BPF_JMP32 | BPF_JGT | BPF_K:
  2408			case BPF_JMP32 | BPF_JLT | BPF_K:
  2409			case BPF_JMP32 | BPF_JGE | BPF_K:
  2410			case BPF_JMP32 | BPF_JLE | BPF_K:
  2411			case BPF_JMP32 | BPF_JSGT | BPF_K:
  2412			case BPF_JMP32 | BPF_JSLT | BPF_K:
  2413			case BPF_JMP32 | BPF_JSGE | BPF_K:
  2414			case BPF_JMP32 | BPF_JSLE | BPF_K:
  2415				/* test dst_reg, dst_reg to save one extra byte */
  2416				if (imm32 == 0) {
  2417					maybe_emit_mod(&prog, dst_reg, dst_reg,
  2418						       BPF_CLASS(insn->code) == BPF_JMP);
  2419					EMIT2(0x85, add_2reg(0xC0, dst_reg, dst_reg));
  2420					goto emit_cond_jmp;
  2421				}
  2422	
  2423				/* cmp dst_reg, imm8/32 */
  2424				maybe_emit_1mod(&prog, dst_reg,
  2425						BPF_CLASS(insn->code) == BPF_JMP);
  2426	
  2427				if (is_imm8(imm32))
  2428					EMIT3(0x83, add_1reg(0xF8, dst_reg), imm32);
  2429				else
  2430					EMIT2_off32(0x81, add_1reg(0xF8, dst_reg), imm32);
  2431	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-03-13  8:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-11 15:48 [PATCH bpf-next v10 0/4] bpf: add cpu time counter kfuncs Vadim Fedorenko
2025-03-11 15:48 ` [PATCH bpf-next v10 1/4] bpf: add bpf_get_cpu_time_counter kfunc Vadim Fedorenko
2025-03-11 15:48 ` [PATCH bpf-next v10 2/4] bpf: add bpf_cpu_time_counter_to_ns helper Vadim Fedorenko
2025-03-13  8:27   ` kernel test robot
2025-03-11 15:48 ` [PATCH bpf-next v10 3/4] selftests/bpf: add selftest to check bpf_get_cpu_time_counter jit Vadim Fedorenko
2025-03-11 15:48 ` [PATCH bpf-next v10 4/4] selftests/bpf: add usage example for cpu time counter kfuncs Vadim Fedorenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox