[PATCH bpf-next v1 0/2] Timed may

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next v1 0/2] Timed may_goto
@ 2025-03-02 20:13 Kumar Kartikeya Dwivedi
  2025-03-02 20:13 ` [PATCH bpf-next v1 1/2] bpf: Add verifier support for timed may_goto Kumar Kartikeya Dwivedi
  2025-03-02 20:13 ` [PATCH bpf-next v1 2/2] bpf, x86: Add x86 JIT " Kumar Kartikeya Dwivedi
  0 siblings, 2 replies; 6+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-03-02 20:13 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, kernel-team

This series replaces the current implementation of cond_break, which
uses the may_goto instruction, and counts 8 million iterations per stack
frame, with an implementation based on sampling time locally on the CPU.

This is done to permit a longer time for a given loop per-program
invocation. The accounting is still done per-stack frame, but the count
is used to instead amortize the cost of the logic to sample and check
the time spent since the start.

This is needed for expressing more complicated algorithms (spin locks,
waiting loops, etc.) in BPF programs without false positive expiration
of the loop. For instance, the plan is to make use of this for
implementing spin locks for BPF arena [0].

For the loop as follows:

for (int i = 0;; i++) {}

Testing on a bare-metal Saphire Rapids Intel server yields the following
table (taking an average of 25 runs).

+-----------------------------+--------------+--------------+------------------+
| Loop type		      |	Iterations   |	Time (ms)   |	Time/iter (ns) |
+-----------------------------|--------------+--------------+------------------+
| may_goto		      |	8388608	     |	3	    |	0.36	       |
| timed_may_goto (count=65535)|	589674932    |	250	    |	0.42	       |
| bpf_for		      |	8388608	     |	10	    |	1.19	       |
+-----------------------------+--------------+--------------+------------------+

Here, count is used to amortize the time sampling and checking logic.

Obviously, this is the limit of an empty loop. Given the complexity of
the loop body, the time spent in the loop can be longer. Cancellations
will address the task of imposing an upper bound on program runtime.

For now, the implementation only supports x86.

  [0]: https://lore.kernel.org/bpf/20250118162238.2621311-1-memxor@gmail.com

Kumar Kartikeya Dwivedi (2):
  bpf: Add verifier support for timed may_goto
  bpf, x86: Add x86 JIT support for timed may_goto

 arch/x86/net/Makefile                         |  2 +-
 arch/x86/net/bpf_jit_comp.c                   |  5 ++
 arch/x86/net/bpf_timed_may_goto.S             | 43 ++++++++++++++
 include/linux/bpf.h                           |  1 +
 include/linux/filter.h                        |  8 +++
 kernel/bpf/core.c                             | 31 ++++++++++
 kernel/bpf/verifier.c                         | 52 ++++++++++++++---
 .../bpf/progs/verifier_bpf_fastcall.c         | 58 +++++++++++++++----
 .../selftests/bpf/progs/verifier_may_goto_1.c | 34 ++++++++++-
 9 files changed, 213 insertions(+), 21 deletions(-)
 create mode 100644 arch/x86/net/bpf_timed_may_goto.S

base-commit: 0b9363131daf4227d5ae11ee677acdcfff06e938
-- 
2.43.5

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH bpf-next v1 1/2] bpf: Add verifier support for timed may_goto
  2025-03-02 20:13 [PATCH bpf-next v1 0/2] Timed may_goto Kumar Kartikeya Dwivedi
@ 2025-03-02 20:13 ` Kumar Kartikeya Dwivedi
  2025-03-03 21:59   ` Alexei Starovoitov
  2025-03-02 20:13 ` [PATCH bpf-next v1 2/2] bpf, x86: Add x86 JIT " Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 6+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-03-02 20:13 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, kernel-team

Implement support in the verifier for replacing may_goto implementation
from a counter-based approach to one which samples time on the local CPU
to have a bigger loop bound.

We implement it by maintaining 16-bytes per-stack frame, and using 8
bytes for maintaining the count for amortizing time sampling, and 8
bytes for the starting timestamp. To minimize overhead, we need to avoid
spilling and filling of registers around this sequence, so we push this
cost into the time sampling function 'arch_bpf_timed_may_goto'. This is
a JIT-specific wrapper around bpf_check_timed_may_goto which returns us
the count to store into the stack through BPF_REG_AX. All caller-saved
registers (r0-r5) are guaranteed to remain untouched.

The loop can be broken by returning count as 0, otherwise we dispatch
into the function when the count becomes 1, and the runtime chooses to
refresh it (by returning count as BPF_MAX_TIMED_LOOPS) or returning 0
and aborting it.

Since the check for 0 is done right after loading the count from the
stack, all subsequent cond_break sequences should immediately break as
well.

We pass in the stack_depth of the count (and thus the timestamp, by
adding 8 to it) to the arch_bpf_timed_may_goto call so that it can be
passed in to bpf_check_timed_may_goto as an argument after r1 is saved,
by adding the offset to r10/fp. This adjustment will be arch specific,
and the next patch will introduce support for x86.

Note that depending on loop complexity, time spent in the loop can be
more than the current limit (250 ms), but imposing an upper bound on
program runtime is an orthogonal problem which will be addressed when
program cancellations are supported.

The current time afforded by cond_break may not be enough for cases
where BPF programs want to implement locking algorithms inline, and use
cond_break as a promise to the verifier that they will eventually
terminate.

Below are some benchmarking numbers on the time taken per-iteration for
an empty loop that counts the number of iterations until cond_break
fires. For comparison, we compare it against bpf_for/bpf_repeat which is
another way to achieve the same number of spins (BPF_MAX_LOOPS).  The
hardware used for benchmarking was a Saphire Rapids Intel server with
performance governor enabled.

+-----------------------------+--------------+--------------+------------------+
| Loop type                   | Iterations   |  Time (ms)   |   Time/iter (ns) |
+-----------------------------|--------------+--------------+------------------+
| may_goto                    | 8388608      |  3           |   0.36           |
| timed_may_goto (count=65535)| 589674932    |  250         |   0.42           |
| bpf_for                     | 8388608      |  10          |   1.19           |
+-----------------------------+--------------+--------------+------------------+

This gives a good approximation at low overhead while staying close to
the current implementation.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h    |  1 +
 include/linux/filter.h |  8 +++++++
 kernel/bpf/core.c      | 31 +++++++++++++++++++++++++
 kernel/bpf/verifier.c  | 52 +++++++++++++++++++++++++++++++++++-------
 4 files changed, 84 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index aec102868b93..788f6ca374e9 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1986,6 +1986,7 @@ struct bpf_array {
  */
 enum {
 	BPF_MAX_LOOPS = 8 * 1024 * 1024,
+	BPF_MAX_TIMED_LOOPS = 0xffff,
 };
 
 #define BPF_F_ACCESS_MASK	(BPF_F_RDONLY |		\
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 3ed6eb9e7c73..02dda5c53d91 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -669,6 +669,11 @@ struct bpf_prog_stats {
 	struct u64_stats_sync syncp;
 } __aligned(2 * sizeof(u64));
 
+struct bpf_timed_may_goto {
+	u64 count;
+	u64 timestamp;
+};
+
 struct sk_filter {
 	refcount_t	refcnt;
 	struct rcu_head	rcu;
@@ -1130,8 +1135,11 @@ bool bpf_jit_supports_ptr_xchg(void);
 bool bpf_jit_supports_arena(void);
 bool bpf_jit_supports_insn(struct bpf_insn *insn, bool in_arena);
 bool bpf_jit_supports_private_stack(void);
+bool bpf_jit_supports_timed_may_goto(void);
 u64 bpf_arch_uaddress_limit(void);
 void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
+u64 arch_bpf_timed_may_goto(void);
+u64 bpf_check_timed_may_goto(struct bpf_timed_may_goto *);
 bool bpf_helper_changes_pkt_data(enum bpf_func_id func_id);
 
 static inline bool bpf_dump_raw_ok(const struct cred *cred)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index a0200fbbace9..b3f7c7bd08d3 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -3069,6 +3069,37 @@ void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp,
 {
 }
 
+bool __weak bpf_jit_supports_timed_may_goto(void)
+{
+	return false;
+}
+
+u64 __weak arch_bpf_timed_may_goto(void)
+{
+	return 0;
+}
+
+u64 bpf_check_timed_may_goto(struct bpf_timed_may_goto *p)
+{
+	u64 time = ktime_get_mono_fast_ns();
+
+	/* If the count is zero, we've already broken a prior loop in this stack
+	 * frame, let's just exit quickly.
+	 */
+	if (!p->count)
+		return 0;
+	/* Populate the timestamp for this stack frame. */
+	if (!p->timestamp) {
+		p->timestamp = time;
+		return BPF_MAX_TIMED_LOOPS;
+	}
+	/* Check if we've exhausted our time slice. */
+	if (time - p->timestamp >= (NSEC_PER_SEC / 4))
+		return 0;
+	/* Refresh the count for the stack frame. */
+	return BPF_MAX_TIMED_LOOPS;
+}
+
 /* for configs without MMU or 32-bit */
 __weak const struct bpf_map_ops arena_map_ops;
 __weak u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index dcd0da4e62fc..79bfb1932f40 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -21503,7 +21503,34 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			goto next_insn;
 		}
 
-		if (is_may_goto_insn(insn)) {
+		if (is_may_goto_insn(insn) && bpf_jit_supports_timed_may_goto()) {
+			int stack_off_cnt = -stack_depth - 16;
+
+			/* Two 8 byte slots, depth-16 stores the count, and
+			 * depth-8 stores the start timestamp of the loop.
+			 */
+			stack_depth_extra = 16;
+			insn_buf[0] = BPF_LDX_MEM(BPF_DW, BPF_REG_AX, BPF_REG_10, stack_off_cnt);
+			if (insn->off >= 0)
+				insn_buf[1] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_AX, 0, insn->off + 5);
+			else
+				insn_buf[1] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_AX, 0, insn->off - 1);
+			insn_buf[2] = BPF_ALU64_IMM(BPF_SUB, BPF_REG_AX, 1);
+			insn_buf[3] = BPF_JMP_IMM(BPF_JNE, BPF_REG_AX, 1, 2);
+			insn_buf[4] = BPF_MOV64_IMM(BPF_REG_AX, stack_off_cnt);
+			insn_buf[5] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_CALL_IMM(arch_bpf_timed_may_goto));
+			insn_buf[6] = BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_AX, stack_off_cnt);
+			cnt = 7;
+
+			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+			if (!new_prog)
+				return -ENOMEM;
+
+			delta += cnt - 1;
+			env->prog = prog = new_prog;
+			insn = new_prog->insnsi + i + delta;
+			goto next_insn;
+		} else if (is_may_goto_insn(insn)) {
 			int stack_off = -stack_depth - 8;
 
 			stack_depth_extra = 8;
@@ -22044,23 +22071,32 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 
 	env->prog->aux->stack_depth = subprogs[0].stack_depth;
 	for (i = 0; i < env->subprog_cnt; i++) {
+		int delta = bpf_jit_supports_timed_may_goto() ? 2 : 1;
 		int subprog_start = subprogs[i].start;
 		int stack_slots = subprogs[i].stack_extra / 8;
+		int slots = delta, cnt = 0;
 
 		if (!stack_slots)
 			continue;
-		if (stack_slots > 1) {
+		/* We need two in case timed may_goto is supported. */
+		if (stack_slots > slots) {
 			verbose(env, "verifier bug: stack_slots supports may_goto only\n");
 			return -EFAULT;
 		}
 
-		/* Add ST insn to subprog prologue to init extra stack */
-		insn_buf[0] = BPF_ST_MEM(BPF_DW, BPF_REG_FP,
-					 -subprogs[i].stack_depth, BPF_MAX_LOOPS);
+		if (bpf_jit_supports_timed_may_goto()) {
+			insn_buf[cnt++] = BPF_ST_MEM(BPF_DW, BPF_REG_FP, -subprogs[i].stack_depth,
+						     BPF_MAX_TIMED_LOOPS);
+			insn_buf[cnt++] = BPF_ST_MEM(BPF_DW, BPF_REG_FP, -subprogs[i].stack_depth + 8, 0);
+		} else {
+			/* Add ST insn to subprog prologue to init extra stack */
+			insn_buf[cnt++] = BPF_ST_MEM(BPF_DW, BPF_REG_FP, -subprogs[i].stack_depth,
+						     BPF_MAX_LOOPS);
+		}
 		/* Copy first actual insn to preserve it */
-		insn_buf[1] = env->prog->insnsi[subprog_start];
+		insn_buf[cnt++] = env->prog->insnsi[subprog_start];
 
-		new_prog = bpf_patch_insn_data(env, subprog_start, insn_buf, 2);
+		new_prog = bpf_patch_insn_data(env, subprog_start, insn_buf, cnt);
 		if (!new_prog)
 			return -ENOMEM;
 		env->prog = prog = new_prog;
@@ -22070,7 +22106,7 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 		 * to insn after BPF_ST that inits may_goto count.
 		 * Adjustment will succeed because bpf_patch_insn_data() didn't fail.
 		 */
-		WARN_ON(adjust_jmp_off(env->prog, subprog_start, 1));
+		WARN_ON(adjust_jmp_off(env->prog, subprog_start, delta));
 	}
 
 	/* Since poke tab is now finalized, publish aux to tracker. */
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH bpf-next v1 1/2] bpf: Add verifier support for timed may_goto
  2025-03-02 20:13 ` [PATCH bpf-next v1 1/2] bpf: Add verifier support for timed may_goto Kumar Kartikeya Dwivedi
@ 2025-03-03 21:59   ` Alexei Starovoitov
  2025-03-03 22:25     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 6+ messages in thread
From: Alexei Starovoitov @ 2025-03-03 21:59 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, Kernel Team

On Sun, Mar 2, 2025 at 12:13 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> Implement support in the verifier for replacing may_goto implementation
> from a counter-based approach to one which samples time on the local CPU
> to have a bigger loop bound.
>
> We implement it by maintaining 16-bytes per-stack frame, and using 8
> bytes for maintaining the count for amortizing time sampling, and 8
> bytes for the starting timestamp. To minimize overhead, we need to avoid
> spilling and filling of registers around this sequence, so we push this
> cost into the time sampling function 'arch_bpf_timed_may_goto'. This is
> a JIT-specific wrapper around bpf_check_timed_may_goto which returns us
> the count to store into the stack through BPF_REG_AX. All caller-saved
> registers (r0-r5) are guaranteed to remain untouched.
>
> The loop can be broken by returning count as 0, otherwise we dispatch
> into the function when the count becomes 1, and the runtime chooses to
> refresh it (by returning count as BPF_MAX_TIMED_LOOPS) or returning 0
> and aborting it.
>
> Since the check for 0 is done right after loading the count from the
> stack, all subsequent cond_break sequences should immediately break as
> well.
>
> We pass in the stack_depth of the count (and thus the timestamp, by
> adding 8 to it) to the arch_bpf_timed_may_goto call so that it can be
> passed in to bpf_check_timed_may_goto as an argument after r1 is saved,
> by adding the offset to r10/fp. This adjustment will be arch specific,
> and the next patch will introduce support for x86.
>
> Note that depending on loop complexity, time spent in the loop can be
> more than the current limit (250 ms), but imposing an upper bound on
> program runtime is an orthogonal problem which will be addressed when
> program cancellations are supported.
>
> The current time afforded by cond_break may not be enough for cases
> where BPF programs want to implement locking algorithms inline, and use
> cond_break as a promise to the verifier that they will eventually
> terminate.
>
> Below are some benchmarking numbers on the time taken per-iteration for
> an empty loop that counts the number of iterations until cond_break
> fires. For comparison, we compare it against bpf_for/bpf_repeat which is
> another way to achieve the same number of spins (BPF_MAX_LOOPS).  The
> hardware used for benchmarking was a Saphire Rapids Intel server with
> performance governor enabled.
>
> +-----------------------------+--------------+--------------+------------------+
> | Loop type                   | Iterations   |  Time (ms)   |   Time/iter (ns) |
> +-----------------------------|--------------+--------------+------------------+
> | may_goto                    | 8388608      |  3           |   0.36           |
> | timed_may_goto (count=65535)| 589674932    |  250         |   0.42           |
> | bpf_for                     | 8388608      |  10          |   1.19           |
> +-----------------------------+--------------+--------------+------------------+
>
> This gives a good approximation at low overhead while staying close to
> the current implementation.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h    |  1 +
>  include/linux/filter.h |  8 +++++++
>  kernel/bpf/core.c      | 31 +++++++++++++++++++++++++
>  kernel/bpf/verifier.c  | 52 +++++++++++++++++++++++++++++++++++-------
>  4 files changed, 84 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index aec102868b93..788f6ca374e9 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1986,6 +1986,7 @@ struct bpf_array {
>   */
>  enum {
>         BPF_MAX_LOOPS = 8 * 1024 * 1024,
> +       BPF_MAX_TIMED_LOOPS = 0xffff,
>  };
>
>  #define BPF_F_ACCESS_MASK      (BPF_F_RDONLY |         \
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 3ed6eb9e7c73..02dda5c53d91 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -669,6 +669,11 @@ struct bpf_prog_stats {
>         struct u64_stats_sync syncp;
>  } __aligned(2 * sizeof(u64));
>
> +struct bpf_timed_may_goto {
> +       u64 count;
> +       u64 timestamp;
> +};
> +
>  struct sk_filter {
>         refcount_t      refcnt;
>         struct rcu_head rcu;
> @@ -1130,8 +1135,11 @@ bool bpf_jit_supports_ptr_xchg(void);
>  bool bpf_jit_supports_arena(void);
>  bool bpf_jit_supports_insn(struct bpf_insn *insn, bool in_arena);
>  bool bpf_jit_supports_private_stack(void);
> +bool bpf_jit_supports_timed_may_goto(void);
>  u64 bpf_arch_uaddress_limit(void);
>  void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
> +u64 arch_bpf_timed_may_goto(void);
> +u64 bpf_check_timed_may_goto(struct bpf_timed_may_goto *);
>  bool bpf_helper_changes_pkt_data(enum bpf_func_id func_id);
>
>  static inline bool bpf_dump_raw_ok(const struct cred *cred)
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index a0200fbbace9..b3f7c7bd08d3 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -3069,6 +3069,37 @@ void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp,
>  {
>  }
>
> +bool __weak bpf_jit_supports_timed_may_goto(void)
> +{
> +       return false;
> +}
> +
> +u64 __weak arch_bpf_timed_may_goto(void)
> +{
> +       return 0;
> +}
> +
> +u64 bpf_check_timed_may_goto(struct bpf_timed_may_goto *p)
> +{
> +       u64 time = ktime_get_mono_fast_ns();

let's move the call after !p->count check to avoid unused work.

> +
> +       /* If the count is zero, we've already broken a prior loop in this stack
> +        * frame, let's just exit quickly.
> +        */

Let's use normal kernel comment style in all new code.
I think even netdev folks allow both styles now.

> +       if (!p->count)
> +               return 0;
> +       /* Populate the timestamp for this stack frame. */
> +       if (!p->timestamp) {
> +               p->timestamp = time;
> +               return BPF_MAX_TIMED_LOOPS;
> +       }
> +       /* Check if we've exhausted our time slice. */
> +       if (time - p->timestamp >= (NSEC_PER_SEC / 4))
> +               return 0;
> +       /* Refresh the count for the stack frame. */
> +       return BPF_MAX_TIMED_LOOPS;
> +}
> +
>  /* for configs without MMU or 32-bit */
>  __weak const struct bpf_map_ops arena_map_ops;
>  __weak u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena)
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index dcd0da4e62fc..79bfb1932f40 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -21503,7 +21503,34 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>                         goto next_insn;
>                 }
>
> -               if (is_may_goto_insn(insn)) {
> +               if (is_may_goto_insn(insn) && bpf_jit_supports_timed_may_goto()) {
> +                       int stack_off_cnt = -stack_depth - 16;
> +
> +                       /* Two 8 byte slots, depth-16 stores the count, and
> +                        * depth-8 stores the start timestamp of the loop.
> +                        */
> +                       stack_depth_extra = 16;
> +                       insn_buf[0] = BPF_LDX_MEM(BPF_DW, BPF_REG_AX, BPF_REG_10, stack_off_cnt);
> +                       if (insn->off >= 0)
> +                               insn_buf[1] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_AX, 0, insn->off + 5);
> +                       else
> +                               insn_buf[1] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_AX, 0, insn->off - 1);
> +                       insn_buf[2] = BPF_ALU64_IMM(BPF_SUB, BPF_REG_AX, 1);
> +                       insn_buf[3] = BPF_JMP_IMM(BPF_JNE, BPF_REG_AX, 1, 2);

Maybe != 0 instead ?
Otherwise it's off by 1.

> +                       insn_buf[4] = BPF_MOV64_IMM(BPF_REG_AX, stack_off_cnt);

Please add a comment that BPF_REG_AX is used as an argument
register and contains return value too.

I looked at a couple other non-x86 JITs and I think this calling
convention should work for them too.

> +                       insn_buf[5] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_CALL_IMM(arch_bpf_timed_may_goto));

Use BPF_EMIT_CALL() instead?

> +                       insn_buf[6] = BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_AX, stack_off_cnt);
> +                       cnt = 7;
> +
> +                       new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> +                       if (!new_prog)
> +                               return -ENOMEM;
> +
> +                       delta += cnt - 1;
> +                       env->prog = prog = new_prog;
> +                       insn = new_prog->insnsi + i + delta;
> +                       goto next_insn;
> +               } else if (is_may_goto_insn(insn)) {
>                         int stack_off = -stack_depth - 8;
>
>                         stack_depth_extra = 8;
> @@ -22044,23 +22071,32 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>
>         env->prog->aux->stack_depth = subprogs[0].stack_depth;
>         for (i = 0; i < env->subprog_cnt; i++) {
> +               int delta = bpf_jit_supports_timed_may_goto() ? 2 : 1;
>                 int subprog_start = subprogs[i].start;
>                 int stack_slots = subprogs[i].stack_extra / 8;
> +               int slots = delta, cnt = 0;
>
>                 if (!stack_slots)
>                         continue;
> -               if (stack_slots > 1) {
> +               /* We need two in case timed may_goto is supported. */
> +               if (stack_slots > slots) {
>                         verbose(env, "verifier bug: stack_slots supports may_goto only\n");
>                         return -EFAULT;
>                 }
>
> -               /* Add ST insn to subprog prologue to init extra stack */
> -               insn_buf[0] = BPF_ST_MEM(BPF_DW, BPF_REG_FP,
> -                                        -subprogs[i].stack_depth, BPF_MAX_LOOPS);
> +               if (bpf_jit_supports_timed_may_goto()) {
> +                       insn_buf[cnt++] = BPF_ST_MEM(BPF_DW, BPF_REG_FP, -subprogs[i].stack_depth,
> +                                                    BPF_MAX_TIMED_LOOPS);
> +                       insn_buf[cnt++] = BPF_ST_MEM(BPF_DW, BPF_REG_FP, -subprogs[i].stack_depth + 8, 0);
> +               } else {
> +                       /* Add ST insn to subprog prologue to init extra stack */
> +                       insn_buf[cnt++] = BPF_ST_MEM(BPF_DW, BPF_REG_FP, -subprogs[i].stack_depth,
> +                                                    BPF_MAX_LOOPS);
> +               }
>                 /* Copy first actual insn to preserve it */
> -               insn_buf[1] = env->prog->insnsi[subprog_start];
> +               insn_buf[cnt++] = env->prog->insnsi[subprog_start];
>
> -               new_prog = bpf_patch_insn_data(env, subprog_start, insn_buf, 2);
> +               new_prog = bpf_patch_insn_data(env, subprog_start, insn_buf, cnt);
>                 if (!new_prog)
>                         return -ENOMEM;
>                 env->prog = prog = new_prog;
> @@ -22070,7 +22106,7 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>                  * to insn after BPF_ST that inits may_goto count.
>                  * Adjustment will succeed because bpf_patch_insn_data() didn't fail.
>                  */
> -               WARN_ON(adjust_jmp_off(env->prog, subprog_start, 1));
> +               WARN_ON(adjust_jmp_off(env->prog, subprog_start, delta));
>         }
>
>         /* Since poke tab is now finalized, publish aux to tracker. */
> --
> 2.43.5
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH bpf-next v1 1/2] bpf: Add verifier support for timed may_goto
  2025-03-03 21:59   ` Alexei Starovoitov
@ 2025-03-03 22:25     ` Kumar Kartikeya Dwivedi
  2025-03-03 22:40       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 6+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-03-03 22:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, Kernel Team

On Mon, 3 Mar 2025 at 22:59, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Mar 2, 2025 at 12:13 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > Implement support in the verifier for replacing may_goto implementation
> > from a counter-based approach to one which samples time on the local CPU
> > to have a bigger loop bound.
> >
> > We implement it by maintaining 16-bytes per-stack frame, and using 8
> > bytes for maintaining the count for amortizing time sampling, and 8
> > bytes for the starting timestamp. To minimize overhead, we need to avoid
> > spilling and filling of registers around this sequence, so we push this
> > cost into the time sampling function 'arch_bpf_timed_may_goto'. This is
> > a JIT-specific wrapper around bpf_check_timed_may_goto which returns us
> > the count to store into the stack through BPF_REG_AX. All caller-saved
> > registers (r0-r5) are guaranteed to remain untouched.
> >
> > The loop can be broken by returning count as 0, otherwise we dispatch
> > into the function when the count becomes 1, and the runtime chooses to
> > refresh it (by returning count as BPF_MAX_TIMED_LOOPS) or returning 0
> > and aborting it.
> >
> > Since the check for 0 is done right after loading the count from the
> > stack, all subsequent cond_break sequences should immediately break as
> > well.
> >
> > We pass in the stack_depth of the count (and thus the timestamp, by
> > adding 8 to it) to the arch_bpf_timed_may_goto call so that it can be
> > passed in to bpf_check_timed_may_goto as an argument after r1 is saved,
> > by adding the offset to r10/fp. This adjustment will be arch specific,
> > and the next patch will introduce support for x86.
> >
> > Note that depending on loop complexity, time spent in the loop can be
> > more than the current limit (250 ms), but imposing an upper bound on
> > program runtime is an orthogonal problem which will be addressed when
> > program cancellations are supported.
> >
> > The current time afforded by cond_break may not be enough for cases
> > where BPF programs want to implement locking algorithms inline, and use
> > cond_break as a promise to the verifier that they will eventually
> > terminate.
> >
> > Below are some benchmarking numbers on the time taken per-iteration for
> > an empty loop that counts the number of iterations until cond_break
> > fires. For comparison, we compare it against bpf_for/bpf_repeat which is
> > another way to achieve the same number of spins (BPF_MAX_LOOPS).  The
> > hardware used for benchmarking was a Saphire Rapids Intel server with
> > performance governor enabled.
> >
> > +-----------------------------+--------------+--------------+------------------+
> > | Loop type                   | Iterations   |  Time (ms)   |   Time/iter (ns) |
> > +-----------------------------|--------------+--------------+------------------+
> > | may_goto                    | 8388608      |  3           |   0.36           |
> > | timed_may_goto (count=65535)| 589674932    |  250         |   0.42           |
> > | bpf_for                     | 8388608      |  10          |   1.19           |
> > +-----------------------------+--------------+--------------+------------------+
> >
> > This gives a good approximation at low overhead while staying close to
> > the current implementation.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf.h    |  1 +
> >  include/linux/filter.h |  8 +++++++
> >  kernel/bpf/core.c      | 31 +++++++++++++++++++++++++
> >  kernel/bpf/verifier.c  | 52 +++++++++++++++++++++++++++++++++++-------
> >  4 files changed, 84 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index aec102868b93..788f6ca374e9 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1986,6 +1986,7 @@ struct bpf_array {
> >   */
> >  enum {
> >         BPF_MAX_LOOPS = 8 * 1024 * 1024,
> > +       BPF_MAX_TIMED_LOOPS = 0xffff,
> >  };
> >
> >  #define BPF_F_ACCESS_MASK      (BPF_F_RDONLY |         \
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index 3ed6eb9e7c73..02dda5c53d91 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -669,6 +669,11 @@ struct bpf_prog_stats {
> >         struct u64_stats_sync syncp;
> >  } __aligned(2 * sizeof(u64));
> >
> > +struct bpf_timed_may_goto {
> > +       u64 count;
> > +       u64 timestamp;
> > +};
> > +
> >  struct sk_filter {
> >         refcount_t      refcnt;
> >         struct rcu_head rcu;
> > @@ -1130,8 +1135,11 @@ bool bpf_jit_supports_ptr_xchg(void);
> >  bool bpf_jit_supports_arena(void);
> >  bool bpf_jit_supports_insn(struct bpf_insn *insn, bool in_arena);
> >  bool bpf_jit_supports_private_stack(void);
> > +bool bpf_jit_supports_timed_may_goto(void);
> >  u64 bpf_arch_uaddress_limit(void);
> >  void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
> > +u64 arch_bpf_timed_may_goto(void);
> > +u64 bpf_check_timed_may_goto(struct bpf_timed_may_goto *);
> >  bool bpf_helper_changes_pkt_data(enum bpf_func_id func_id);
> >
> >  static inline bool bpf_dump_raw_ok(const struct cred *cred)
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index a0200fbbace9..b3f7c7bd08d3 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -3069,6 +3069,37 @@ void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp,
> >  {
> >  }
> >
> > +bool __weak bpf_jit_supports_timed_may_goto(void)
> > +{
> > +       return false;
> > +}
> > +
> > +u64 __weak arch_bpf_timed_may_goto(void)
> > +{
> > +       return 0;
> > +}
> > +
> > +u64 bpf_check_timed_may_goto(struct bpf_timed_may_goto *p)
> > +{
> > +       u64 time = ktime_get_mono_fast_ns();
>
> let's move the call after !p->count check to avoid unused work.
>

Ok.

> > +
> > +       /* If the count is zero, we've already broken a prior loop in this stack
> > +        * frame, let's just exit quickly.
> > +        */
>
> Let's use normal kernel comment style in all new code.
> I think even netdev folks allow both styles now.
>

Ack.

> > +       if (!p->count)
> > +               return 0;
> > +       /* Populate the timestamp for this stack frame. */
> > +       if (!p->timestamp) {
> > +               p->timestamp = time;
> > +               return BPF_MAX_TIMED_LOOPS;
> > +       }
> > +       /* Check if we've exhausted our time slice. */
> > +       if (time - p->timestamp >= (NSEC_PER_SEC / 4))
> > +               return 0;
> > +       /* Refresh the count for the stack frame. */
> > +       return BPF_MAX_TIMED_LOOPS;
> > +}
> > +
> >  /* for configs without MMU or 32-bit */
> >  __weak const struct bpf_map_ops arena_map_ops;
> >  __weak u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena)
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index dcd0da4e62fc..79bfb1932f40 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -21503,7 +21503,34 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> >                         goto next_insn;
> >                 }
> >
> > -               if (is_may_goto_insn(insn)) {
> > +               if (is_may_goto_insn(insn) && bpf_jit_supports_timed_may_goto()) {
> > +                       int stack_off_cnt = -stack_depth - 16;
> > +
> > +                       /* Two 8 byte slots, depth-16 stores the count, and
> > +                        * depth-8 stores the start timestamp of the loop.
> > +                        */
> > +                       stack_depth_extra = 16;
> > +                       insn_buf[0] = BPF_LDX_MEM(BPF_DW, BPF_REG_AX, BPF_REG_10, stack_off_cnt);
> > +                       if (insn->off >= 0)
> > +                               insn_buf[1] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_AX, 0, insn->off + 5);
> > +                       else
> > +                               insn_buf[1] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_AX, 0, insn->off - 1);
> > +                       insn_buf[2] = BPF_ALU64_IMM(BPF_SUB, BPF_REG_AX, 1);
> > +                       insn_buf[3] = BPF_JMP_IMM(BPF_JNE, BPF_REG_AX, 1, 2);
>
> Maybe != 0 instead ?
> Otherwise it's off by 1.

We'll never do the sub with AX=0, we'll always break out in that case.
It starts at 0xffff, so when it's 2 in stack, and 1 on subtraction, we
will do the call.
This resets it to 0xffff or 0.

But it's late and I could be missing something.

>
> > +                       insn_buf[4] = BPF_MOV64_IMM(BPF_REG_AX, stack_off_cnt);
>
> Please add a comment that BPF_REG_AX is used as an argument
> register and contains return value too.

Ok, will do.

>
> I looked at a couple other non-x86 JITs and I think this calling
> convention should work for them too.
>
> > +                       insn_buf[5] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_CALL_IMM(arch_bpf_timed_may_goto));
>
> Use BPF_EMIT_CALL() instead?
>

Ack.

> > [...]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH bpf-next v1 1/2] bpf: Add verifier support for timed may_goto
  2025-03-03 22:25     ` Kumar Kartikeya Dwivedi
@ 2025-03-03 22:40       ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 6+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-03-03 22:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, Kernel Team

On Mon, 3 Mar 2025 at 23:25, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Mon, 3 Mar 2025 at 22:59, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Sun, Mar 2, 2025 at 12:13 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > Implement support in the verifier for replacing may_goto implementation
> > > from a counter-based approach to one which samples time on the local CPU
> > > to have a bigger loop bound.
> > >
> > > We implement it by maintaining 16-bytes per-stack frame, and using 8
> > > bytes for maintaining the count for amortizing time sampling, and 8
> > > bytes for the starting timestamp. To minimize overhead, we need to avoid
> > > spilling and filling of registers around this sequence, so we push this
> > > cost into the time sampling function 'arch_bpf_timed_may_goto'. This is
> > > a JIT-specific wrapper around bpf_check_timed_may_goto which returns us
> > > the count to store into the stack through BPF_REG_AX. All caller-saved
> > > registers (r0-r5) are guaranteed to remain untouched.
> > >
> > > The loop can be broken by returning count as 0, otherwise we dispatch
> > > into the function when the count becomes 1, and the runtime chooses to
> > > refresh it (by returning count as BPF_MAX_TIMED_LOOPS) or returning 0
> > > and aborting it.
> > >
> > > Since the check for 0 is done right after loading the count from the
> > > stack, all subsequent cond_break sequences should immediately break as
> > > well.
> > >
> > > We pass in the stack_depth of the count (and thus the timestamp, by
> > > adding 8 to it) to the arch_bpf_timed_may_goto call so that it can be
> > > passed in to bpf_check_timed_may_goto as an argument after r1 is saved,
> > > by adding the offset to r10/fp. This adjustment will be arch specific,
> > > and the next patch will introduce support for x86.
> > >
> > > Note that depending on loop complexity, time spent in the loop can be
> > > more than the current limit (250 ms), but imposing an upper bound on
> > > program runtime is an orthogonal problem which will be addressed when
> > > program cancellations are supported.
> > >
> > > The current time afforded by cond_break may not be enough for cases
> > > where BPF programs want to implement locking algorithms inline, and use
> > > cond_break as a promise to the verifier that they will eventually
> > > terminate.
> > >
> > > Below are some benchmarking numbers on the time taken per-iteration for
> > > an empty loop that counts the number of iterations until cond_break
> > > fires. For comparison, we compare it against bpf_for/bpf_repeat which is
> > > another way to achieve the same number of spins (BPF_MAX_LOOPS).  The
> > > hardware used for benchmarking was a Saphire Rapids Intel server with
> > > performance governor enabled.
> > >
> > > +-----------------------------+--------------+--------------+------------------+
> > > | Loop type                   | Iterations   |  Time (ms)   |   Time/iter (ns) |
> > > +-----------------------------|--------------+--------------+------------------+
> > > | may_goto                    | 8388608      |  3           |   0.36           |
> > > | timed_may_goto (count=65535)| 589674932    |  250         |   0.42           |
> > > | bpf_for                     | 8388608      |  10          |   1.19           |
> > > +-----------------------------+--------------+--------------+------------------+
> > >
> > > This gives a good approximation at low overhead while staying close to
> > > the current implementation.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > >  include/linux/bpf.h    |  1 +
> > >  include/linux/filter.h |  8 +++++++
> > >  kernel/bpf/core.c      | 31 +++++++++++++++++++++++++
> > >  kernel/bpf/verifier.c  | 52 +++++++++++++++++++++++++++++++++++-------
> > >  4 files changed, 84 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index aec102868b93..788f6ca374e9 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1986,6 +1986,7 @@ struct bpf_array {
> > >   */
> > >  enum {
> > >         BPF_MAX_LOOPS = 8 * 1024 * 1024,
> > > +       BPF_MAX_TIMED_LOOPS = 0xffff,
> > >  };
> > >
> > >  #define BPF_F_ACCESS_MASK      (BPF_F_RDONLY |         \
> > > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > > index 3ed6eb9e7c73..02dda5c53d91 100644
> > > --- a/include/linux/filter.h
> > > +++ b/include/linux/filter.h
> > > @@ -669,6 +669,11 @@ struct bpf_prog_stats {
> > >         struct u64_stats_sync syncp;
> > >  } __aligned(2 * sizeof(u64));
> > >
> > > +struct bpf_timed_may_goto {
> > > +       u64 count;
> > > +       u64 timestamp;
> > > +};
> > > +
> > >  struct sk_filter {
> > >         refcount_t      refcnt;
> > >         struct rcu_head rcu;
> > > @@ -1130,8 +1135,11 @@ bool bpf_jit_supports_ptr_xchg(void);
> > >  bool bpf_jit_supports_arena(void);
> > >  bool bpf_jit_supports_insn(struct bpf_insn *insn, bool in_arena);
> > >  bool bpf_jit_supports_private_stack(void);
> > > +bool bpf_jit_supports_timed_may_goto(void);
> > >  u64 bpf_arch_uaddress_limit(void);
> > >  void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
> > > +u64 arch_bpf_timed_may_goto(void);
> > > +u64 bpf_check_timed_may_goto(struct bpf_timed_may_goto *);
> > >  bool bpf_helper_changes_pkt_data(enum bpf_func_id func_id);
> > >
> > >  static inline bool bpf_dump_raw_ok(const struct cred *cred)
> > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > > index a0200fbbace9..b3f7c7bd08d3 100644
> > > --- a/kernel/bpf/core.c
> > > +++ b/kernel/bpf/core.c
> > > @@ -3069,6 +3069,37 @@ void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp,
> > >  {
> > >  }
> > >
> > > +bool __weak bpf_jit_supports_timed_may_goto(void)
> > > +{
> > > +       return false;
> > > +}
> > > +
> > > +u64 __weak arch_bpf_timed_may_goto(void)
> > > +{
> > > +       return 0;
> > > +}
> > > +
> > > +u64 bpf_check_timed_may_goto(struct bpf_timed_may_goto *p)
> > > +{
> > > +       u64 time = ktime_get_mono_fast_ns();
> >
> > let's move the call after !p->count check to avoid unused work.
> >
>
> Ok.
>
> > > +
> > > +       /* If the count is zero, we've already broken a prior loop in this stack
> > > +        * frame, let's just exit quickly.
> > > +        */
> >
> > Let's use normal kernel comment style in all new code.
> > I think even netdev folks allow both styles now.
> >
>
> Ack.
>
> > > +       if (!p->count)
> > > +               return 0;
> > > +       /* Populate the timestamp for this stack frame. */
> > > +       if (!p->timestamp) {
> > > +               p->timestamp = time;
> > > +               return BPF_MAX_TIMED_LOOPS;
> > > +       }
> > > +       /* Check if we've exhausted our time slice. */
> > > +       if (time - p->timestamp >= (NSEC_PER_SEC / 4))
> > > +               return 0;
> > > +       /* Refresh the count for the stack frame. */
> > > +       return BPF_MAX_TIMED_LOOPS;
> > > +}
> > > +
> > >  /* for configs without MMU or 32-bit */
> > >  __weak const struct bpf_map_ops arena_map_ops;
> > >  __weak u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena)
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index dcd0da4e62fc..79bfb1932f40 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -21503,7 +21503,34 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> > >                         goto next_insn;
> > >                 }
> > >
> > > -               if (is_may_goto_insn(insn)) {
> > > +               if (is_may_goto_insn(insn) && bpf_jit_supports_timed_may_goto()) {
> > > +                       int stack_off_cnt = -stack_depth - 16;
> > > +
> > > +                       /* Two 8 byte slots, depth-16 stores the count, and
> > > +                        * depth-8 stores the start timestamp of the loop.
> > > +                        */
> > > +                       stack_depth_extra = 16;
> > > +                       insn_buf[0] = BPF_LDX_MEM(BPF_DW, BPF_REG_AX, BPF_REG_10, stack_off_cnt);
> > > +                       if (insn->off >= 0)
> > > +                               insn_buf[1] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_AX, 0, insn->off + 5);
> > > +                       else
> > > +                               insn_buf[1] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_AX, 0, insn->off - 1);
> > > +                       insn_buf[2] = BPF_ALU64_IMM(BPF_SUB, BPF_REG_AX, 1);
> > > +                       insn_buf[3] = BPF_JMP_IMM(BPF_JNE, BPF_REG_AX, 1, 2);
> >
> > Maybe != 0 instead ?
> > Otherwise it's off by 1.
>
> We'll never do the sub with AX=0, we'll always break out in that case.
> It starts at 0xffff, so when it's 2 in stack, and 1 on subtraction, we
> will do the call.
> This resets it to 0xffff or 0.
>
> But it's late and I could be missing something.

Which means we will never see p->count as 0 in
bpf_check_timed_may_goto, since we'll go through the first condition
checking ax to be 0, so I will just remove that condition from the
function.

>
> >
> > > +                       insn_buf[4] = BPF_MOV64_IMM(BPF_REG_AX, stack_off_cnt);
> >
> > Please add a comment that BPF_REG_AX is used as an argument
> > register and contains return value too.
>
> Ok, will do.
>
> >
> > I looked at a couple other non-x86 JITs and I think this calling
> > convention should work for them too.
> >
> > > +                       insn_buf[5] = BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_CALL_IMM(arch_bpf_timed_may_goto));
> >
> > Use BPF_EMIT_CALL() instead?
> >
>
> Ack.
>
> > > [...]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH bpf-next v1 2/2] bpf, x86: Add x86 JIT support for timed may_goto
  2025-03-02 20:13 [PATCH bpf-next v1 0/2] Timed may_goto Kumar Kartikeya Dwivedi
  2025-03-02 20:13 ` [PATCH bpf-next v1 1/2] bpf: Add verifier support for timed may_goto Kumar Kartikeya Dwivedi
@ 2025-03-02 20:13 ` Kumar Kartikeya Dwivedi
  1 sibling, 0 replies; 6+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-03-02 20:13 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Tejun Heo, Emil Tsalapatis,
	Barret Rhoden, Josh Don, Dohyun Kim, kkd, kernel-team

Implement the arch_bpf_timed_may_goto function using inline assembly to
have control over which registers are spilled, and use our special
protocol of using BPF_REG_AX as an argument into the function, and as
the return value when going back.

Emit call depth accounting for the call made from this stub, and ensure
we don't have naked returns (when rethunk mitigations are enabled) by
falling back to the RET macro (instead of retq). After popping all saved
registers, the return address into the BPF program should be on top of
the stack.

Since the JIT support is now enabled, ensure selftests which are
checking the produced may_goto sequences do not break by adjusting them.
Make sure we still test the old may_goto sequence on other
architectures, while testing the new sequence on x86_64.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 arch/x86/net/Makefile                         |  2 +-
 arch/x86/net/bpf_jit_comp.c                   |  5 ++
 arch/x86/net/bpf_timed_may_goto.S             | 43 ++++++++++++++
 .../bpf/progs/verifier_bpf_fastcall.c         | 58 +++++++++++++++----
 .../selftests/bpf/progs/verifier_may_goto_1.c | 34 ++++++++++-
 5 files changed, 129 insertions(+), 13 deletions(-)
 create mode 100644 arch/x86/net/bpf_timed_may_goto.S

diff --git a/arch/x86/net/Makefile b/arch/x86/net/Makefile
index 383c87300b0d..dddbefc0f439 100644
--- a/arch/x86/net/Makefile
+++ b/arch/x86/net/Makefile
@@ -6,5 +6,5 @@
 ifeq ($(CONFIG_X86_32),y)
         obj-$(CONFIG_BPF_JIT) += bpf_jit_comp32.o
 else
-        obj-$(CONFIG_BPF_JIT) += bpf_jit_comp.o
+        obj-$(CONFIG_BPF_JIT) += bpf_jit_comp.o bpf_timed_may_goto.o
 endif
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index a43fc5af973d..f3e9ef6b5329 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -3791,3 +3791,8 @@ u64 bpf_arch_uaddress_limit(void)
 {
 	return 0;
 }
+
+bool bpf_jit_supports_timed_may_goto(void)
+{
+	return true;
+}
diff --git a/arch/x86/net/bpf_timed_may_goto.S b/arch/x86/net/bpf_timed_may_goto.S
new file mode 100644
index 000000000000..c35e00b93ac6
--- /dev/null
+++ b/arch/x86/net/bpf_timed_may_goto.S
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+
+#include <linux/export.h>
+#include <linux/linkage.h>
+#include <asm/nospec-branch.h>
+
+	.code64
+	.section .text, "ax"
+
+SYM_FUNC_START(arch_bpf_timed_may_goto)
+	ANNOTATE_NOENDBR
+
+	/* Save r0-r5 */
+	pushq %rax
+	pushq %rdi
+	pushq %rsi
+	pushq %rdx
+	pushq %rcx
+	pushq %r8
+
+	/* r10 passes us stack depth, load the pointer to count and timestamp as
+	 * first argument to the call below.
+	 */
+	leaq (%rbp, %r10, 1), %rdi
+
+	/* Emit call depth accounting for call below */
+	CALL_DEPTH_ACCOUNT
+	call bpf_check_timed_may_goto
+
+	/* BPF_REG_AX=r10 will be stored into count, so move return value to it */
+	movq %rax, %r10
+
+	/* Restore r5-r0 */
+	popq %r8
+	popq %rcx
+	popq %rdx
+	popq %rsi
+	popq %rdi
+	popq %rax
+
+	RET
+SYM_FUNC_END(arch_bpf_timed_may_goto)
diff --git a/tools/testing/selftests/bpf/progs/verifier_bpf_fastcall.c b/tools/testing/selftests/bpf/progs/verifier_bpf_fastcall.c
index 5094c288cfd7..67e6980cd722 100644
--- a/tools/testing/selftests/bpf/progs/verifier_bpf_fastcall.c
+++ b/tools/testing/selftests/bpf/progs/verifier_bpf_fastcall.c
@@ -620,23 +620,61 @@ __naked void helper_call_does_not_prevent_bpf_fastcall(void)
 
 SEC("raw_tp")
 __arch_x86_64
+__log_level(4) __msg("stack depth 24")
+/* may_goto counter at -24 */
+__xlated("0: *(u64 *)(r10 -24) =")
+/* may_goto timestamp at -16 */
+__xlated("1: *(u64 *)(r10 -16) =")
+__xlated("2: r1 = 1")
+__xlated("...")
+__xlated("4: r0 = &(void __percpu *)(r0)")
+__xlated("...")
+/* may_goto expansion starts */
+__xlated("6: r11 = *(u64 *)(r10 -24)")
+__xlated("7: if r11 == 0x0 goto pc+6")
+__xlated("8: r11 -= 1")
+__xlated("9: if r11 != 0x1 goto pc+2")
+__xlated("10: r11 = -24")
+__xlated("11: call unknown")
+__xlated("12: *(u64 *)(r10 -24) = r11")
+/* may_goto expansion ends */
+__xlated("13: *(u64 *)(r10 -8) = r1")
+__xlated("14: exit")
+__success
+__naked void may_goto_interaction_x86_64(void)
+{
+	asm volatile (
+	"r1 = 1;"
+	"*(u64 *)(r10 - 16) = r1;"
+	"call %[bpf_get_smp_processor_id];"
+	"r1 = *(u64 *)(r10 - 16);"
+	".8byte %[may_goto];"
+	/* just touch some stack at -8 */
+	"*(u64 *)(r10 - 8) = r1;"
+	"exit;"
+	:
+	: __imm(bpf_get_smp_processor_id),
+	  __imm_insn(may_goto, BPF_RAW_INSN(BPF_JMP | BPF_JCOND, 0, 0, +1 /* offset */, 0))
+	: __clobber_all);
+}
+
+SEC("raw_tp")
+__arch_arm64
 __log_level(4) __msg("stack depth 16")
 /* may_goto counter at -16 */
 __xlated("0: *(u64 *)(r10 -16) =")
 __xlated("1: r1 = 1")
-__xlated("...")
-__xlated("3: r0 = &(void __percpu *)(r0)")
-__xlated("...")
+__xlated("2: call bpf_get_smp_processor_id")
 /* may_goto expansion starts */
-__xlated("5: r11 = *(u64 *)(r10 -16)")
-__xlated("6: if r11 == 0x0 goto pc+3")
-__xlated("7: r11 -= 1")
-__xlated("8: *(u64 *)(r10 -16) = r11")
+__xlated("3: r11 = *(u64 *)(r10 -16)")
+__xlated("4: if r11 == 0x0 goto pc+3")
+__xlated("5: r11 -= 1")
+__xlated("6: *(u64 *)(r10 -16) = r11")
 /* may_goto expansion ends */
-__xlated("9: *(u64 *)(r10 -8) = r1")
-__xlated("10: exit")
+__xlated("7: *(u64 *)(r10 -8) = r1")
+__xlated("8: exit")
 __success
-__naked void may_goto_interaction(void)
+__naked void may_goto_interaction_arm64(void)
 {
 	asm volatile (
 	"r1 = 1;"
diff --git a/tools/testing/selftests/bpf/progs/verifier_may_goto_1.c b/tools/testing/selftests/bpf/progs/verifier_may_goto_1.c
index e81097c96fe2..b75548a52658 100644
--- a/tools/testing/selftests/bpf/progs/verifier_may_goto_1.c
+++ b/tools/testing/selftests/bpf/progs/verifier_may_goto_1.c
@@ -69,8 +69,38 @@ __naked void may_goto_batch_1(void)
 }
 
 SEC("raw_tp")
-__description("may_goto batch with offsets 2/0")
+__description("may_goto batch with offsets 2/0 - x86_64")
 __arch_x86_64
+__xlated("0: *(u64 *)(r10 -16) = 65535")
+__xlated("1: *(u64 *)(r10 -8) = 0")
+__xlated("2: r11 = *(u64 *)(r10 -16)")
+__xlated("3: if r11 == 0x0 goto pc+6")
+__xlated("4: r11 -= 1")
+__xlated("5: if r11 != 0x1 goto pc+2")
+__xlated("6: r11 = -16")
+__xlated("7: call unknown")
+__xlated("8: *(u64 *)(r10 -16) = r11")
+__xlated("9: r0 = 1")
+__xlated("10: r0 = 2")
+__xlated("11: exit")
+__success
+__naked void may_goto_batch_2_x86_64(void)
+{
+	asm volatile (
+	".8byte %[may_goto1];"
+	".8byte %[may_goto3];"
+	"r0 = 1;"
+	"r0 = 2;"
+	"exit;"
+	:
+	: __imm_insn(may_goto1, BPF_RAW_INSN(BPF_JMP | BPF_JCOND, 0, 0, 2 /* offset */, 0)),
+	  __imm_insn(may_goto3, BPF_RAW_INSN(BPF_JMP | BPF_JCOND, 0, 0, 0 /* offset */, 0))
+	: __clobber_all);
+}
+
+SEC("raw_tp")
+__description("may_goto batch with offsets 2/0 - arm64")
+__arch_arm64
 __xlated("0: *(u64 *)(r10 -8) = 8388608")
 __xlated("1: r11 = *(u64 *)(r10 -8)")
 __xlated("2: if r11 == 0x0 goto pc+3")
@@ -80,7 +110,7 @@ __xlated("5: r0 = 1")
 __xlated("6: r0 = 2")
 __xlated("7: exit")
 __success
-__naked void may_goto_batch_2(void)
+__naked void may_goto_batch_2_arm64(void)
 {
 	asm volatile (
 	".8byte %[may_goto1];"
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-03-03 22:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-02 20:13 [PATCH bpf-next v1 0/2] Timed may_goto Kumar Kartikeya Dwivedi
2025-03-02 20:13 ` [PATCH bpf-next v1 1/2] bpf: Add verifier support for timed may_goto Kumar Kartikeya Dwivedi
2025-03-03 21:59   ` Alexei Starovoitov
2025-03-03 22:25     ` Kumar Kartikeya Dwivedi
2025-03-03 22:40       ` Kumar Kartikeya Dwivedi
2025-03-02 20:13 ` [PATCH bpf-next v1 2/2] bpf, x86: Add x86 JIT " Kumar Kartikeya Dwivedi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox