[PATCH bpf-next 0/7] selftests/bpf: Add XDP load-balancer benchmark

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/7] selftests/bpf: Add XDP load-balancer benchmark
@ 2026-04-27 23:22 Puranjay Mohan
  2026-04-27 23:22 ` [PATCH bpf-next 1/7] selftests/bpf: Add bench_force_done() for early benchmark completion Puranjay Mohan
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Puranjay Mohan @ 2026-04-27 23:22 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	Fei Chen, Taruna Agrawal, Nikhil Dixit Limaye, Nikita V. Shirokov,
	kernel-team

Changelog:
RFC: https://lore.kernel.org/all/20260420111726.2118636-1-puranjay@kernel.org/
Changes in v1:
- Replace bpf_get_cpu_time_counter() with bpf_ktime_get_ns()
- Replace bpf_repeat() with plain for loop and may_goto
- Refactor collect_measurements() to reuse bench_force_done()
- Remove histogram, verbose calibration output, and per-scenario status prints
- Trim run script table to p50/stddev/p99
- Set env.quiet when --machine-readable is passed
- Add || true to run script benchmark invocation for set -e safety
- Add bpf-nop benchmark as timing overhead baseline (patch 3)
- Use named struct for LRU inner map to fix build on older toolchains

This series adds an XDP load-balancer benchmark (based on Katran) to the BPF
selftest bench framework.

Motivation
----------

Existing BPF bench tests measure individual operations (map lookups,
kprobes, ring buffers) in isolation.  Production BPF programs combine
parsing, map lookups, branching, and packet rewriting in a single call
chain.  The performance characteristics of such programs depend on the
interaction of these operations -- register pressure, spills, inlining
decisions, branch layout -- which isolated micro-benchmarks do not
capture.

This benchmark implements a simplified L4 load-balancer modeled after
katran [1].  The BPF program reproduces katran's core datapath:

  L3/L4 parsing -> VIP hash lookup -> per-CPU LRU connection table
  with consistent-hash fallback -> real server selection -> per-VIP
  and per-real stats -> IPIP/IP6IP6 encapsulation

The BPF code exercises hash maps, array-of-maps (per-CPU LRU),
percpu arrays, jhash, bpf_xdp_adjust_head(), bpf_ktime_get_ns(),
and bpf_get_smp_processor_id() in a single pipeline.

This is intended as the first in a series of BPF workload benchmarks
covering other use cases (sched_ext, etc.).

Design
------

A userspace loop calling bpf_prog_test_run_opts(repeat=1) would
measure syscall overhead, not BPF program cost -- the ~4 ns early-exit
paths would be buried under kernel entry/exit.  Using repeat=N is
also unsuitable: the kernel re-runs the same packet without resetting
state between iterations, so the second iteration of an encap scenario
would process an already-encapsulated packet.

Instead, timing is measured inside the BPF program using
bpf_ktime_get_ns().  BENCH_BPF_LOOP() brackets N iterations with
timestamp reads using a plain for loop with may_goto, runs a
caller-supplied reset block between iterations to undo side effects
(e.g. strip encapsulation), and records the elapsed time per batch.
One extra untimed iteration runs afterward for output validation.

Auto-calibration picks a batch size targeting ~10 ms per invocation.
A proportionality sanity check verifies that 2N iterations take ~2x
as long as N.

24 scenarios cover the code-path matrix:

  - Protocol: TCP, UDP
  - Address family: IPv4, IPv6, cross-AF (IPv4-in-IPv6)
  - LRU state: hit, miss (16M flow space), diverse (4K flows), cold
  - Consistent-hash: direct (LRU bypass)
  - TCP flags: SYN (skip LRU, force CH), RST (skip LRU insert)
  - Early exits: unknown VIP, non-IP, ICMP, fragments, IP options

Each scenario validates correctness before benchmarking by comparing
the output packet byte-for-byte against a pre-built expected packet
and checking BPF map counters.

Sample single-scenario output:

  $ sudo ./bench xdp-lb --scenario tcp-v4-lru-hit
  Setting up benchmark 'xdp-lb'...
  Benchmark 'xdp-lb' started.
  tcp-v4-lru-hit: median 74.51 ns/op, stddev 0.11, p99 74.81 (202 samples)

Sample run script output:

  $ ./benchs/run_bench_xdp_lb.sh

  XDP load-balancer benchmark
  ===========================
  +----------------------------------+----------+---------+----------+
  | Single-flow baseline             |      p50 |  stddev |      p99 |
  +----------------------------------+----------+---------+----------+
  | tcp-v4-lru-hit                   |    74.30 |    0.08 |    74.48 |
  | tcp-v4-ch                        |   101.73 |    0.11 |   102.01 |
  | tcp-v6-lru-hit                   |    76.77 |    0.14 |    77.04 |
  | tcp-v6-ch                        |   121.40 |    0.10 |   121.65 |
  | udp-v4-lru-hit                   |   107.42 |    0.22 |   107.90 |
  | udp-v6-lru-hit                   |   110.21 |    0.12 |   110.45 |
  | tcp-v4v6-lru-hit                 |    74.82 |    0.35 |    75.43 |
  +----------------------------------+----------+---------+----------+
  | Diverse flows (4K src addrs)     |      p50 |  stddev |      p99 |
  +----------------------------------+----------+---------+----------+
  | tcp-v4-lru-diverse               |    86.63 |    0.37 |    89.04 |
  | tcp-v4-ch-diverse                |   104.09 |    0.19 |   105.67 |
  | tcp-v6-lru-diverse               |    89.34 |    0.42 |    90.70 |
  | tcp-v6-ch-diverse                |   122.20 |    0.21 |   123.78 |
  | udp-v4-lru-diverse               |   119.37 |    0.58 |   123.10 |
  +----------------------------------+----------+---------+----------+
  | TCP flags                        |      p50 |  stddev |      p99 |
  +----------------------------------+----------+---------+----------+
  | tcp-v4-syn                       |   165.52 |   15.68 |   198.34 |
  | tcp-v4-rst-miss                  |   161.34 |    2.69 |   172.64 |
  +----------------------------------+----------+---------+----------+
  | LRU stress                       |      p50 |  stddev |      p99 |
  +----------------------------------+----------+---------+----------+
  | tcp-v4-lru-miss                  |   440.39 |   35.75 |   550.62 |
  | udp-v4-lru-miss                  |   571.88 |   57.38 |   680.61 |
  | tcp-v4-lru-warmup                |   317.75 |    9.55 |   356.20 |
  +----------------------------------+----------+---------+----------+
  | Early exits                      |      p50 |  stddev |      p99 |
  +----------------------------------+----------+---------+----------+
  | pass-v4-no-vip                   |    18.26 |    0.13 |    18.66 |
  | pass-v6-no-vip                   |    19.08 |    0.01 |    19.10 |
  | pass-v4-icmp                     |     6.81 |    0.02 |     6.86 |
  | pass-non-ip                      |     5.71 |    0.03 |     5.76 |
  | drop-v4-frag                     |     6.09 |    0.01 |     6.10 |
  | drop-v4-options                  |     5.88 |    0.00 |     5.89 |
  | drop-v6-frag                     |     6.00 |    0.03 |     6.04 |
  +----------------------------------+----------+---------+----------+

Patches
-------

Patch 1 adds bench_force_done() to the bench framework so benchmarks
can signal early completion when enough samples have been collected.

Patch 2 adds the shared BPF batch-timing library (BPF-side timing
arrays, BENCH_BPF_LOOP macro, userspace statistics and calibration).

Patch 3 adds a bpf-nop benchmark as a timing overhead baseline and
usage example for the timing library.

Patch 4 adds the common header shared between the BPF program and
userspace (flow_key, vip_definition, real_definition, encap helpers).

Patch 5 adds the XDP load-balancer BPF program.

Patch 6 adds the userspace benchmark driver with 24 scenarios,
packet construction, validation, and bench framework integration.

Patch 7 adds the run script for running all scenarios.

[1] https://github.com/facebookincubator/katran

Puranjay Mohan (7):
  selftests/bpf: Add bench_force_done() for early benchmark completion
  selftests/bpf: Add BPF batch-timing library
  selftests/bpf: Add bpf-nop benchmark for timing overhead baseline
  selftests/bpf: Add XDP load-balancer common definitions
  selftests/bpf: Add XDP load-balancer BPF program
  selftests/bpf: Add XDP load-balancer benchmark driver
  selftests/bpf: Add XDP load-balancer benchmark run script

 tools/testing/selftests/bpf/Makefile          |    6 +
 tools/testing/selftests/bpf/bench.c           |   20 +-
 tools/testing/selftests/bpf/bench.h           |    1 +
 .../testing/selftests/bpf/bench_bpf_timing.h  |   50 +
 .../selftests/bpf/benchs/bench_bpf_nop.c      |   84 ++
 .../selftests/bpf/benchs/bench_bpf_timing.c   |  272 ++++
 .../selftests/bpf/benchs/bench_xdp_lb.c       | 1113 +++++++++++++++++
 .../selftests/bpf/benchs/run_bench_xdp_lb.sh  |   79 ++
 .../bpf/progs/bench_bpf_timing.bpf.h          |   69 +
 .../selftests/bpf/progs/bpf_nop_bench.c       |   14 +
 .../selftests/bpf/progs/xdp_lb_bench.c        |  647 ++++++++++
 .../selftests/bpf/xdp_lb_bench_common.h       |  112 ++
 12 files changed, 2462 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/bench_bpf_timing.h
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_nop.c
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_timing.c
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_xdp_lb.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_xdp_lb.sh
 create mode 100644 tools/testing/selftests/bpf/progs/bench_bpf_timing.bpf.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_nop_bench.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_lb_bench.c
 create mode 100644 tools/testing/selftests/bpf/xdp_lb_bench_common.h

-- 
2.52.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 1/7] selftests/bpf: Add bench_force_done() for early benchmark completion
  2026-04-27 23:22 [PATCH bpf-next 0/7] selftests/bpf: Add XDP load-balancer benchmark Puranjay Mohan
@ 2026-04-27 23:22 ` Puranjay Mohan
  2026-04-27 23:39   ` sashiko-bot
  2026-04-28  0:05   ` bot+bpf-ci
  2026-04-27 23:22 ` [PATCH bpf-next 2/7] selftests/bpf: Add BPF batch-timing library Puranjay Mohan
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Puranjay Mohan @ 2026-04-27 23:22 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	Fei Chen, Taruna Agrawal, Nikhil Dixit Limaye, Nikita V. Shirokov,
	kernel-team

The bench framework waits for duration_sec to elapse before collecting
results.  Benchmarks that know exactly how many samples they need can
call bench_force_done() to signal completion early, avoiding wasted
wall-clock time.

Also refactor collect_measurements() to reuse bench_force_done()
instead of open-coding the same mutex/cond_signal sequence.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 tools/testing/selftests/bpf/bench.c | 14 +++++++++-----
 tools/testing/selftests/bpf/bench.h |  1 +
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 029b3e21f438..47a4e72208d6 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -741,6 +741,13 @@ static void setup_benchmark(void)
 static pthread_mutex_t bench_done_mtx = PTHREAD_MUTEX_INITIALIZER;
 static pthread_cond_t bench_done = PTHREAD_COND_INITIALIZER;
 
+void bench_force_done(void)
+{
+	pthread_mutex_lock(&bench_done_mtx);
+	pthread_cond_signal(&bench_done);
+	pthread_mutex_unlock(&bench_done_mtx);
+}
+
 static void collect_measurements(long delta_ns) {
 	int iter = state.res_cnt++;
 	struct bench_res *res = &state.results[iter];
@@ -750,11 +757,8 @@ static void collect_measurements(long delta_ns) {
 	if (bench->report_progress)
 		bench->report_progress(iter, res, delta_ns);
 
-	if (iter == env.duration_sec + env.warmup_sec) {
-		pthread_mutex_lock(&bench_done_mtx);
-		pthread_cond_signal(&bench_done);
-		pthread_mutex_unlock(&bench_done_mtx);
-	}
+	if (iter == env.duration_sec + env.warmup_sec)
+		bench_force_done();
 }
 
 int main(int argc, char **argv)
diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
index 7cf21936e7ed..89a3fc72f70e 100644
--- a/tools/testing/selftests/bpf/bench.h
+++ b/tools/testing/selftests/bpf/bench.h
@@ -70,6 +70,7 @@ extern struct env env;
 extern const struct bench *bench;
 
 void setup_libbpf(void);
+void bench_force_done(void);
 void hits_drops_report_progress(int iter, struct bench_res *res, long delta_ns);
 void hits_drops_report_final(struct bench_res res[], int res_cnt);
 void false_hits_report_progress(int iter, struct bench_res *res, long delta_ns);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 2/7] selftests/bpf: Add BPF batch-timing library
  2026-04-27 23:22 [PATCH bpf-next 0/7] selftests/bpf: Add XDP load-balancer benchmark Puranjay Mohan
  2026-04-27 23:22 ` [PATCH bpf-next 1/7] selftests/bpf: Add bench_force_done() for early benchmark completion Puranjay Mohan
@ 2026-04-27 23:22 ` Puranjay Mohan
  2026-04-28  0:12   ` sashiko-bot
  2026-04-28  0:18   ` bot+bpf-ci
  2026-04-27 23:23 ` [PATCH bpf-next 3/7] selftests/bpf: Add bpf-nop benchmark for timing overhead baseline Puranjay Mohan
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Puranjay Mohan @ 2026-04-27 23:22 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	Fei Chen, Taruna Agrawal, Nikhil Dixit Limaye, Nikita V. Shirokov,
	kernel-team

Add a reusable timing library for BPF benchmarks that need to measure
BPF program execution time.

The BPF side (progs/bench_bpf_timing.bpf.h) provides per-CPU sample
arrays and BENCH_BPF_LOOP(), a macro that brackets batch_iters
iterations with bpf_ktime_get_ns() reads and records the elapsed time.
One extra untimed iteration runs afterward for output validation.

The userspace side (benchs/bench_bpf_timing.c) collects samples from
the skeleton BSS, computes percentile statistics, and auto-calibrates
batch_iters to target ~10 ms per batch.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 tools/testing/selftests/bpf/Makefile          |   2 +
 .../testing/selftests/bpf/bench_bpf_timing.h  |  50 ++++
 .../selftests/bpf/benchs/bench_bpf_timing.c   | 272 ++++++++++++++++++
 .../bpf/progs/bench_bpf_timing.bpf.h          |  69 +++++
 4 files changed, 393 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/bench_bpf_timing.h
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_timing.c
 create mode 100644 tools/testing/selftests/bpf/progs/bench_bpf_timing.bpf.h

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 97ee61f2ade5..3d516f10f29e 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -906,6 +906,7 @@ $(OUTPUT)/bench_htab_mem.o: $(OUTPUT)/htab_mem_bench.skel.h
 $(OUTPUT)/bench_bpf_crypto.o: $(OUTPUT)/crypto_bench.skel.h
 $(OUTPUT)/bench_sockmap.o: $(OUTPUT)/bench_sockmap_prog.skel.h
 $(OUTPUT)/bench_lpm_trie_map.o: $(OUTPUT)/lpm_trie_bench.skel.h $(OUTPUT)/lpm_trie_map.skel.h
+$(OUTPUT)/bench_bpf_timing.o: bench_bpf_timing.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
@@ -928,6 +929,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_bpf_crypto.o \
 		 $(OUTPUT)/bench_sockmap.o \
 		 $(OUTPUT)/bench_lpm_trie_map.o \
+		 $(OUTPUT)/bench_bpf_timing.o \
 		 $(OUTPUT)/usdt_1.o \
 		 $(OUTPUT)/usdt_2.o \
 		 #
diff --git a/tools/testing/selftests/bpf/bench_bpf_timing.h b/tools/testing/selftests/bpf/bench_bpf_timing.h
new file mode 100644
index 000000000000..6ef23b6d6639
--- /dev/null
+++ b/tools/testing/selftests/bpf/bench_bpf_timing.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */
+
+#ifndef __BENCH_BPF_TIMING_H__
+#define __BENCH_BPF_TIMING_H__
+
+#include <stdbool.h>
+#include <linux/types.h>
+#include "bench.h"
+
+#ifndef BENCH_NR_SAMPLES
+#define BENCH_NR_SAMPLES	4096
+#endif
+#ifndef BENCH_NR_CPUS
+#define BENCH_NR_CPUS		256
+#endif
+
+typedef void (*bpf_bench_run_fn)(void *ctx);
+
+struct bpf_bench_timing {
+	__u64 (*samples)[BENCH_NR_SAMPLES];	/* skel->bss->timing_samples */
+	__u32 *idx;				/* skel->bss->timing_idx */
+	volatile __u32 *timing_enabled;		/* &skel->bss->timing_enabled */
+	volatile __u32 *batch_iters_bss;	/* &skel->bss->batch_iters */
+	__u32 batch_iters;
+	__u32 target_samples;
+	__u32 nr_cpus;
+	int warmup_ticks;
+	bool done;
+	bool machine_readable;
+};
+
+#define BENCH_TIMING_INIT(t, skel, iters) do {				\
+	(t)->samples = (skel)->bss->timing_samples;			\
+	(t)->idx = (skel)->bss->timing_idx;				\
+	(t)->timing_enabled = &(skel)->bss->timing_enabled;		\
+	(t)->batch_iters_bss = &(skel)->bss->batch_iters;		\
+	(t)->batch_iters = (iters);					\
+	(t)->target_samples = 200;					\
+	(t)->nr_cpus = env.nr_cpus;					\
+	(t)->warmup_ticks = 0;						\
+	(t)->done = false;						\
+	(t)->machine_readable = false;					\
+} while (0)
+
+void bpf_bench_timing_measure(struct bpf_bench_timing *t, struct bench_res *res);
+void bpf_bench_timing_report(struct bpf_bench_timing *t, const char *name, const char *desc);
+void bpf_bench_calibrate(struct bpf_bench_timing *t, bpf_bench_run_fn run_fn, void *ctx);
+
+#endif /* __BENCH_BPF_TIMING_H__ */
diff --git a/tools/testing/selftests/bpf/benchs/bench_bpf_timing.c b/tools/testing/selftests/bpf/benchs/bench_bpf_timing.c
new file mode 100644
index 000000000000..75a39da69655
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_bpf_timing.c
@@ -0,0 +1,272 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <math.h>
+#include "bench_bpf_timing.h"
+#include "bpf_util.h"
+
+struct timing_stats {
+	double min, max;
+	double median, p99;
+	double mean, stddev;
+	int count;
+};
+
+static int cmp_double(const void *a, const void *b)
+{
+	double da = *(const double *)a;
+	double db = *(const double *)b;
+
+	if (da < db)
+		return -1;
+	if (da > db)
+		return 1;
+	return 0;
+}
+
+static double percentile(const double *sorted, int n, double pct)
+{
+	int idx = (int)(n * pct / 100.0);
+
+	if (idx >= n)
+		idx = n - 1;
+	return sorted[idx];
+}
+
+static int collect_samples(struct bpf_bench_timing *t,
+			   double *out, int max_out)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	__u32 timed_iters = t->batch_iters;
+	int total = 0;
+
+	if (nr_cpus > BENCH_NR_CPUS)
+		nr_cpus = BENCH_NR_CPUS;
+
+	for (unsigned int cpu = 0; cpu < nr_cpus; cpu++) {
+		__u32 count = t->idx[cpu];
+
+		if (count > BENCH_NR_SAMPLES)
+			count = BENCH_NR_SAMPLES;
+
+		for (__u32 i = 0; i < count && total < max_out; i++) {
+			__u64 sample = t->samples[cpu][i];
+
+			if (sample == 0)
+				continue;
+			out[total++] = (double)sample / timed_iters;
+		}
+	}
+
+	qsort(out, total, sizeof(double), cmp_double);
+	return total;
+}
+
+static void compute_stats(const double *sorted, int n,
+			  struct timing_stats *s)
+{
+	double sum = 0, var_sum = 0;
+
+	memset(s, 0, sizeof(*s));
+	s->count = n;
+
+	if (n == 0)
+		return;
+
+	s->min    = sorted[0];
+	s->max    = sorted[n - 1];
+	s->median = sorted[n / 2];
+	s->p99    = percentile(sorted, n, 99);
+
+	for (int i = 0; i < n; i++)
+		sum += sorted[i];
+	s->mean = sum / n;
+
+	for (int i = 0; i < n; i++) {
+		double d = sorted[i] - s->mean;
+
+		var_sum += d * d;
+	}
+	s->stddev = n > 1 ? sqrt(var_sum / (n - 1)) : 0;
+}
+
+void bpf_bench_timing_measure(struct bpf_bench_timing *t, struct bench_res *res)
+{
+	unsigned int nr_cpus;
+	__u32 total_samples;
+	int i;
+
+	t->warmup_ticks++;
+
+	if (t->warmup_ticks < env.warmup_sec)
+		return;
+
+	if (t->warmup_ticks == env.warmup_sec) {
+		*t->timing_enabled = 1;
+		return;
+	}
+
+	nr_cpus = bpf_num_possible_cpus();
+	if (nr_cpus > BENCH_NR_CPUS)
+		nr_cpus = BENCH_NR_CPUS;
+
+	total_samples = 0;
+	for (i = 0; i < (int)nr_cpus; i++) {
+		__u32 cnt = t->idx[i];
+
+		if (cnt > BENCH_NR_SAMPLES)
+			cnt = BENCH_NR_SAMPLES;
+		total_samples += cnt;
+	}
+
+	if (total_samples >= (__u32)env.producer_cnt * t->target_samples && !t->done) {
+		t->done = true;
+		*t->timing_enabled = 0;
+		bench_force_done();
+	}
+}
+
+void bpf_bench_timing_report(struct bpf_bench_timing *t, const char *name, const char *description)
+{
+	int max_out = BENCH_NR_CPUS * BENCH_NR_SAMPLES;
+	struct timing_stats s;
+	double *all;
+	int total;
+
+	all = calloc(max_out, sizeof(*all));
+	if (!all) {
+		fprintf(stderr, "failed to allocate timing buffer\n");
+		return;
+	}
+
+	total = collect_samples(t, all, max_out);
+
+	if (total == 0) {
+		printf("No timing samples collected.\n");
+		free(all);
+		return;
+	}
+
+	compute_stats(all, total, &s);
+
+	if (t->machine_readable) {
+		printf("RESULT scenario=%s samples=%d median=%.2f stddev=%.2f cv=%.2f min=%.2f "
+		       "p99=%.2f max=%.2f\n", name, total, s.median, s.stddev,
+		       s.mean > 0 ? s.stddev / s.mean * 100.0 : 0.0, s.min, s.p99, s.max);
+	} else {
+		printf("%s: median %.2f ns/op, stddev %.2f, p99 %.2f (%d samples)\n", name,
+		       s.median, s.stddev, s.p99, total);
+	}
+
+	free(all);
+}
+
+#define CALIBRATE_SEED_BATCH	100
+#define CALIBRATE_MIN_BATCH	100
+#define CALIBRATE_MAX_BATCH	10000000
+#define CALIBRATE_TARGET_MS	10
+#define CALIBRATE_RUNS		5
+#define PROPORTIONALITY_TOL	0.05	/* 5% */
+
+static void reset_timing(struct bpf_bench_timing *t)
+{
+	*t->timing_enabled = 0;
+	memset(t->samples, 0, sizeof(__u64) * BENCH_NR_CPUS * BENCH_NR_SAMPLES);
+	memset(t->idx, 0, sizeof(__u32) * BENCH_NR_CPUS);
+}
+
+static __u64 measure_elapsed(struct bpf_bench_timing *t, bpf_bench_run_fn run_fn, void *run_ctx,
+			     __u32 iters, int runs)
+{
+	__u64 buf[CALIBRATE_RUNS];
+	int n = 0, i, j;
+
+	reset_timing(t);
+	*t->batch_iters_bss = iters;
+	*t->timing_enabled = 1;
+
+	for (i = 0; i < runs; i++)
+		run_fn(run_ctx);
+
+	*t->timing_enabled = 0;
+
+	for (i = 0; i < BENCH_NR_CPUS && n < runs; i++) {
+		__u32 cnt = t->idx[i];
+
+		for (j = 0; j < (int)cnt && n < runs; j++)
+			buf[n++] = t->samples[i][j];
+	}
+
+	if (n == 0)
+		return 0;
+
+	for (i = 1; i < n; i++) {
+		__u64 key = buf[i];
+
+		j = i - 1;
+		while (j >= 0 && buf[j] > key) {
+			buf[j + 1] = buf[j];
+			j--;
+		}
+		buf[j + 1] = key;
+	}
+
+	return buf[n / 2];
+}
+
+static __u32 compute_batch_iters(__u64 per_op_ns)
+{
+	__u64 target_ns = (__u64)CALIBRATE_TARGET_MS * 1000000ULL;
+	__u32 iters;
+
+	if (per_op_ns == 0)
+		return CALIBRATE_MIN_BATCH;
+
+	iters = target_ns / per_op_ns;
+
+	if (iters < CALIBRATE_MIN_BATCH)
+		iters = CALIBRATE_MIN_BATCH;
+	if (iters > CALIBRATE_MAX_BATCH)
+		iters = CALIBRATE_MAX_BATCH;
+
+	return iters;
+}
+
+void bpf_bench_calibrate(struct bpf_bench_timing *t, bpf_bench_run_fn run_fn, void *run_ctx)
+{
+	__u64 elapsed, per_op_ns;
+	__u64 time_n, time_2n;
+	double ratio;
+
+	elapsed = measure_elapsed(t, run_fn, run_ctx, CALIBRATE_SEED_BATCH, CALIBRATE_RUNS);
+	if (elapsed == 0) {
+		fprintf(stderr, "calibration: no timing samples, using default\n");
+		t->batch_iters = 10000;
+		*t->batch_iters_bss = t->batch_iters;
+		reset_timing(t);
+		return;
+	}
+
+	per_op_ns = elapsed / CALIBRATE_SEED_BATCH;
+	t->batch_iters = compute_batch_iters(per_op_ns);
+
+	time_n = measure_elapsed(t, run_fn, run_ctx, t->batch_iters, CALIBRATE_RUNS);
+	time_2n = measure_elapsed(t, run_fn, run_ctx, t->batch_iters * 2, CALIBRATE_RUNS);
+
+	if (time_n > 0 && time_2n > 0) {
+		ratio = (double)time_2n / (double)time_n;
+
+		if (fabs(ratio - 2.0) / 2.0 > PROPORTIONALITY_TOL)
+			fprintf(stderr,
+				"WARNING: proportionality check failed (2N/N ratio=%.3f, "
+				"expected=2.000, error=%.1f%%)\n  System noise may be affecting "
+				"results.\n",
+				ratio, fabs(ratio - 2.0) / 2.0 * 100.0);
+	}
+
+	*t->batch_iters_bss = t->batch_iters;
+	reset_timing(t);
+}
diff --git a/tools/testing/selftests/bpf/progs/bench_bpf_timing.bpf.h b/tools/testing/selftests/bpf/progs/bench_bpf_timing.bpf.h
new file mode 100644
index 000000000000..6a1ad75f1fd7
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bench_bpf_timing.bpf.h
@@ -0,0 +1,69 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */
+
+#ifndef __BENCH_BPF_TIMING_BPF_H__
+#define __BENCH_BPF_TIMING_BPF_H__
+
+#include <stdbool.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf_may_goto.h>
+
+#ifndef BENCH_NR_SAMPLES
+#define BENCH_NR_SAMPLES	4096
+#endif
+#ifndef BENCH_NR_CPUS
+#define BENCH_NR_CPUS		256
+#endif
+#define BENCH_CPU_MASK		(BENCH_NR_CPUS - 1)
+
+__u64 timing_samples[BENCH_NR_CPUS][BENCH_NR_SAMPLES];
+__u32 timing_idx[BENCH_NR_CPUS];
+
+volatile __u32 batch_iters;
+volatile __u32 timing_enabled;
+
+static __always_inline void bench_record_sample(__u64 elapsed_ns)
+{
+	__u32 cpu, idx;
+
+	if (!timing_enabled)
+		return;
+
+	cpu = bpf_get_smp_processor_id() & BENCH_CPU_MASK;
+	idx = timing_idx[cpu];
+
+	if (idx >= BENCH_NR_SAMPLES)
+		return;
+
+	timing_samples[cpu][idx] = elapsed_ns;
+	timing_idx[cpu] = idx + 1;
+}
+
+/*
+ * @body:  expression to time; return value (int) stored in __bench_result.
+ * @reset: undo body's side-effects so each iteration starts identically.
+ *         May reference __bench_result.  Use ({}) for empty reset.
+ *
+ * Runs batch_iters timed iterations, then one untimed iteration whose
+ * return value the macro evaluates to (for validation).
+ */
+#define BENCH_BPF_LOOP(body, reset) ({					\
+	__u64 __bench_start = bpf_ktime_get_ns();			\
+	__u32 __bench_i;						\
+	int __bench_result;						\
+									\
+	for (__bench_i = 0;						\
+	     __bench_i < batch_iters && can_loop;			\
+	     __bench_i++) {						\
+		__bench_result = (body);				\
+		reset;							\
+	}								\
+									\
+	bench_record_sample(bpf_ktime_get_ns() - __bench_start);	\
+									\
+	__bench_result = (body);					\
+	__bench_result;							\
+})
+
+#endif /* __BENCH_BPF_TIMING_BPF_H__ */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 3/7] selftests/bpf: Add bpf-nop benchmark for timing overhead baseline
  2026-04-27 23:22 [PATCH bpf-next 0/7] selftests/bpf: Add XDP load-balancer benchmark Puranjay Mohan
  2026-04-27 23:22 ` [PATCH bpf-next 1/7] selftests/bpf: Add bench_force_done() for early benchmark completion Puranjay Mohan
  2026-04-27 23:22 ` [PATCH bpf-next 2/7] selftests/bpf: Add BPF batch-timing library Puranjay Mohan
@ 2026-04-27 23:23 ` Puranjay Mohan
  2026-04-27 23:23 ` [PATCH bpf-next 4/7] selftests/bpf: Add XDP load-balancer common definitions Puranjay Mohan
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Puranjay Mohan @ 2026-04-27 23:23 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	Fei Chen, Taruna Agrawal, Nikhil Dixit Limaye, Nikita V. Shirokov,
	kernel-team

Add a minimal benchmark that measures the overhead of the batch-timing
infrastructure itself. The BPF program runs an empty BENCH_BPF_LOOP body
(~1.5-2 ns/op), establishing the floor cost that all timing-library
benchmarks include.

[root@virtme-ng tools/testing/selftests/bpf]# sudo ./bench -a -p8 bpf-nop
Setting up benchmark 'bpf-nop'...
Benchmark 'bpf-nop' started.
bpf-nop: median 1.82 ns/op, stddev 0.01, p99 1.86 (1754 samples)

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 tools/testing/selftests/bpf/Makefile          |  2 +
 tools/testing/selftests/bpf/bench.c           |  2 +
 .../selftests/bpf/benchs/bench_bpf_nop.c      | 84 +++++++++++++++++++
 .../selftests/bpf/progs/bpf_nop_bench.c       | 14 ++++
 4 files changed, 102 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_nop.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_nop_bench.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 3d516f10f29e..97f9fbd41244 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -906,6 +906,7 @@ $(OUTPUT)/bench_htab_mem.o: $(OUTPUT)/htab_mem_bench.skel.h
 $(OUTPUT)/bench_bpf_crypto.o: $(OUTPUT)/crypto_bench.skel.h
 $(OUTPUT)/bench_sockmap.o: $(OUTPUT)/bench_sockmap_prog.skel.h
 $(OUTPUT)/bench_lpm_trie_map.o: $(OUTPUT)/lpm_trie_bench.skel.h $(OUTPUT)/lpm_trie_map.skel.h
+$(OUTPUT)/bench_bpf_nop.o: $(OUTPUT)/bpf_nop_bench.skel.h bench_bpf_timing.h
 $(OUTPUT)/bench_bpf_timing.o: bench_bpf_timing.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
@@ -930,6 +931,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_sockmap.o \
 		 $(OUTPUT)/bench_lpm_trie_map.o \
 		 $(OUTPUT)/bench_bpf_timing.o \
+		 $(OUTPUT)/bench_bpf_nop.o \
 		 $(OUTPUT)/usdt_1.o \
 		 $(OUTPUT)/usdt_2.o \
 		 #
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 47a4e72208d6..1696de5d6780 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -575,6 +575,7 @@ extern const struct bench bench_lpm_trie_insert;
 extern const struct bench bench_lpm_trie_update;
 extern const struct bench bench_lpm_trie_delete;
 extern const struct bench bench_lpm_trie_free;
+extern const struct bench bench_bpf_nop;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -653,6 +654,7 @@ static const struct bench *benchs[] = {
 	&bench_lpm_trie_update,
 	&bench_lpm_trie_delete,
 	&bench_lpm_trie_free,
+	&bench_bpf_nop,
 };
 
 static void find_benchmark(void)
diff --git a/tools/testing/selftests/bpf/benchs/bench_bpf_nop.c b/tools/testing/selftests/bpf/benchs/bench_bpf_nop.c
new file mode 100644
index 000000000000..e2d8c2ccf384
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_bpf_nop.c
@@ -0,0 +1,84 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */
+
+#include "bench.h"
+#include "bench_bpf_timing.h"
+#include "bpf_nop_bench.skel.h"
+#include "bpf_util.h"
+
+static struct ctx {
+	struct bpf_nop_bench *skel;
+	struct bpf_bench_timing timing;
+	int prog_fd;
+} ctx;
+
+static void nop_validate(void)
+{
+	if (env.consumer_cnt != 0) {
+		fprintf(stderr, "benchmark doesn't support consumers\n");
+		exit(1);
+	}
+}
+
+static void nop_run_once(void *unused __always_unused)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+
+	bpf_prog_test_run_opts(ctx.prog_fd, &topts);
+}
+
+static void nop_setup(void)
+{
+	struct bpf_nop_bench *skel;
+	int err;
+
+	setup_libbpf();
+
+	skel = bpf_nop_bench__open();
+	if (!skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	err = bpf_nop_bench__load(skel);
+	if (err) {
+		fprintf(stderr, "failed to load skeleton: %s\n", strerror(-err));
+		bpf_nop_bench__destroy(skel);
+		exit(1);
+	}
+
+	ctx.skel = skel;
+	ctx.prog_fd = bpf_program__fd(skel->progs.bench_nop);
+
+	BENCH_TIMING_INIT(&ctx.timing, skel, 0);
+	bpf_bench_calibrate(&ctx.timing, nop_run_once, NULL);
+
+	env.duration_sec = 600;
+}
+
+static void *nop_producer(void *input)
+{
+	while (true)
+		nop_run_once(NULL);
+
+	return NULL;
+}
+
+static void nop_measure(struct bench_res *res)
+{
+	bpf_bench_timing_measure(&ctx.timing, res);
+}
+
+static void nop_report_final(struct bench_res res[], int res_cnt)
+{
+	bpf_bench_timing_report(&ctx.timing, "bpf-nop", NULL);
+}
+
+const struct bench bench_bpf_nop = {
+	.name		= "bpf-nop",
+	.validate	= nop_validate,
+	.setup		= nop_setup,
+	.producer_thread = nop_producer,
+	.measure	= nop_measure,
+	.report_final	= nop_report_final,
+};
diff --git a/tools/testing/selftests/bpf/progs/bpf_nop_bench.c b/tools/testing/selftests/bpf/progs/bpf_nop_bench.c
new file mode 100644
index 000000000000..01ed284c1bb3
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_nop_bench.c
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include "bench_bpf_timing.bpf.h"
+
+SEC("syscall")
+int bench_nop(void *ctx)
+{
+	return BENCH_BPF_LOOP(0, ({}));
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 4/7] selftests/bpf: Add XDP load-balancer common definitions
  2026-04-27 23:22 [PATCH bpf-next 0/7] selftests/bpf: Add XDP load-balancer benchmark Puranjay Mohan
                   ` (2 preceding siblings ...)
  2026-04-27 23:23 ` [PATCH bpf-next 3/7] selftests/bpf: Add bpf-nop benchmark for timing overhead baseline Puranjay Mohan
@ 2026-04-27 23:23 ` Puranjay Mohan
  2026-04-28  0:05   ` bot+bpf-ci
  2026-04-28  0:38   ` sashiko-bot
  2026-04-27 23:23 ` [PATCH bpf-next 5/7] selftests/bpf: Add XDP load-balancer BPF program Puranjay Mohan
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Puranjay Mohan @ 2026-04-27 23:23 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	Fei Chen, Taruna Agrawal, Nikhil Dixit Limaye, Nikita V. Shirokov,
	kernel-team

Add the shared header for the XDP load-balancer benchmark.  This
defines the data structures used by both the BPF program and
userspace: flow_key, vip_definition, real_definition, and the
stats/control structures.

Also provides the encapsulation source-address helpers shared
between the BPF datapath (for encap) and userspace (for building
expected output packets used in validation).

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 .../selftests/bpf/xdp_lb_bench_common.h       | 112 ++++++++++++++++++
 1 file changed, 112 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/xdp_lb_bench_common.h

diff --git a/tools/testing/selftests/bpf/xdp_lb_bench_common.h b/tools/testing/selftests/bpf/xdp_lb_bench_common.h
new file mode 100644
index 000000000000..aed20a963701
--- /dev/null
+++ b/tools/testing/selftests/bpf/xdp_lb_bench_common.h
@@ -0,0 +1,112 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */
+
+#ifndef XDP_LB_BENCH_COMMON_H
+#define XDP_LB_BENCH_COMMON_H
+
+#define F_IPV6		(1 << 0)
+#define F_LRU_BYPASS	(1 << 1)
+
+#define CH_RING_SIZE	65537		/* per-VIP consistent hash ring slots */
+#define MAX_VIPS	16
+#define CH_RINGS_SIZE	(MAX_VIPS * CH_RING_SIZE)
+#define MAX_REALS	512
+#define DEFAULT_LRU_SIZE 100000		/* connection tracking cache size */
+#define ONE_SEC		1000000000U	/* 1 sec in nanosec */
+#define MAX_CONN_RATE	100000000	/* high enough to never trigger in bench */
+#define LRU_UDP_TIMEOUT	30000000000ULL	/* 30 sec in nanosec */
+#define PCKT_FRAGMENTED	0x3FFF
+#define KNUTH_HASH_MULT	2654435761U
+#define IPIP_V4_PREFIX	4268		/* 172.16/12 in network order */
+#define IPIP_V6_PREFIX1	1		/* 0100::/64 (RFC 6666 discard) */
+#define IPIP_V6_PREFIX2	0
+#define IPIP_V6_PREFIX3	0
+
+/* Stats indices (0..MAX_VIPS-1 are per-VIP packet/byte counters) */
+#define STATS_LRU	(MAX_VIPS + 0)	/* v1: total VIP packets, v2: LRU misses */
+#define STATS_XDP_TX	(MAX_VIPS + 1)
+#define STATS_XDP_PASS	(MAX_VIPS + 2)
+#define STATS_XDP_DROP	(MAX_VIPS + 3)
+#define STATS_NEW_CONN	(MAX_VIPS + 4)	/* v1: conn count, v2: last reset ts */
+#define STATS_LRU_MISS	(MAX_VIPS + 5)	/* v1: TCP LRU misses */
+#define STATS_SIZE	(MAX_VIPS + 6)
+
+#ifdef __BPF__
+#define lb_htons(x)	bpf_htons(x)
+#define LB_INLINE	static __always_inline
+#else
+#define lb_htons(x)	htons(x)
+#define LB_INLINE	static inline
+#endif
+
+LB_INLINE __be32 create_encap_ipv4_src(__u16 port, __be32 src)
+{
+	__u32 ip_suffix = lb_htons(port);
+
+	ip_suffix <<= 16;
+	ip_suffix ^= src;
+	return (0xFFFF0000 & ip_suffix) | IPIP_V4_PREFIX;
+}
+
+LB_INLINE void create_encap_ipv6_src(__u16 port, __be32 src, __be32 *saddr)
+{
+	saddr[0] = IPIP_V6_PREFIX1;
+	saddr[1] = IPIP_V6_PREFIX2;
+	saddr[2] = IPIP_V6_PREFIX3;
+	saddr[3] = src ^ port;
+}
+
+struct flow_key {
+	union {
+		__be32 src;
+		__be32 srcv6[4];
+	};
+	union {
+		__be32 dst;
+		__be32 dstv6[4];
+	};
+	union {
+		__u32 ports;
+		__u16 port16[2];
+	};
+	__u8 proto;
+	__u8 pad[3];
+};
+
+struct vip_definition {
+	union {
+		__be32 vip;
+		__be32 vipv6[4];
+	};
+	__u16 port;
+	__u8 proto;
+	__u8 pad;
+};
+
+struct vip_meta {
+	__u32 flags;
+	__u32 vip_num;
+};
+
+struct real_pos_lru {
+	__u32 pos;
+	__u64 atime;
+};
+
+struct real_definition {
+	__be32 dst;
+	__be32 dstv6[4];
+	__u8   flags;
+};
+
+struct lb_stats {
+	__u64 v1;
+	__u64 v2;
+};
+
+struct ctl_value {
+	__u8 mac[6];
+	__u8 pad[2];
+};
+
+#endif /* XDP_LB_BENCH_COMMON_H */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 5/7] selftests/bpf: Add XDP load-balancer BPF program
  2026-04-27 23:22 [PATCH bpf-next 0/7] selftests/bpf: Add XDP load-balancer benchmark Puranjay Mohan
                   ` (3 preceding siblings ...)
  2026-04-27 23:23 ` [PATCH bpf-next 4/7] selftests/bpf: Add XDP load-balancer common definitions Puranjay Mohan
@ 2026-04-27 23:23 ` Puranjay Mohan
  2026-04-28  0:18   ` bot+bpf-ci
  2026-04-28  1:05   ` sashiko-bot
  2026-04-27 23:23 ` [PATCH bpf-next 6/7] selftests/bpf: Add XDP load-balancer benchmark driver Puranjay Mohan
  2026-04-27 23:23 ` [PATCH bpf-next 7/7] selftests/bpf: Add XDP load-balancer benchmark run script Puranjay Mohan
  6 siblings, 2 replies; 24+ messages in thread
From: Puranjay Mohan @ 2026-04-27 23:23 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	Fei Chen, Taruna Agrawal, Nikhil Dixit Limaye, Nikita V. Shirokov,
	kernel-team

Add the BPF datapath for the XDP load-balancer benchmark, a
simplified L4 load-balancer inspired by katran.

The pipeline: L3/L4 parse -> VIP lookup -> per-CPU LRU connection
table or consistent-hash fallback -> real server lookup -> per-VIP
and per-real stats -> IPIP/IP6IP6 encapsulation.  TCP SYN forces
the consistent-hash path (skipping LRU); TCP RST skips LRU insert
to avoid polluting the table.

process_packet() is marked __noinline so that the BENCH_BPF_LOOP
reset block (which strips encapsulation) operates on valid packet
pointers after bpf_xdp_adjust_head().

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 .../selftests/bpf/progs/xdp_lb_bench.c        | 647 ++++++++++++++++++
 1 file changed, 647 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_lb_bench.c

diff --git a/tools/testing/selftests/bpf/progs/xdp_lb_bench.c b/tools/testing/selftests/bpf/progs/xdp_lb_bench.c
new file mode 100644
index 000000000000..b9fd848c035d
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_lb_bench.c
@@ -0,0 +1,647 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */
+
+#include <stddef.h>
+#include <stdbool.h>
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/in.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#include "bpf_compiler.h"
+#include "xdp_lb_bench_common.h"
+#include "bench_bpf_timing.bpf.h"
+
+#ifndef IPPROTO_FRAGMENT
+#define IPPROTO_FRAGMENT 44
+#endif
+
+/* jhash helpers */
+
+static inline __u32 rol32(__u32 word, unsigned int shift)
+{
+	return (word << shift) | (word >> ((-shift) & 31));
+}
+
+#define __jhash_mix(a, b, c)			\
+{						\
+	a -= c;  a ^= rol32(c, 4);  c += b;	\
+	b -= a;  b ^= rol32(a, 6);  a += c;	\
+	c -= b;  c ^= rol32(b, 8);  b += a;	\
+	a -= c;  a ^= rol32(c, 16); c += b;	\
+	b -= a;  b ^= rol32(a, 19); a += c;	\
+	c -= b;  c ^= rol32(b, 4);  b += a;	\
+}
+
+#define __jhash_final(a, b, c)			\
+{						\
+	c ^= b; c -= rol32(b, 14);		\
+	a ^= c; a -= rol32(c, 11);		\
+	b ^= a; b -= rol32(a, 25);		\
+	c ^= b; c -= rol32(b, 16);		\
+	a ^= c; a -= rol32(c, 4);		\
+	b ^= a; b -= rol32(a, 14);		\
+	c ^= b; c -= rol32(b, 24);		\
+}
+
+#define JHASH_INITVAL 0xdeadbeef
+
+static inline __u32 __jhash_nwords(__u32 a, __u32 b, __u32 c, __u32 initval)
+{
+	a += initval;
+	b += initval;
+	c += initval;
+	__jhash_final(a, b, c);
+	return c;
+}
+
+static inline __u32 jhash_2words(__u32 a, __u32 b, __u32 initval)
+{
+	return __jhash_nwords(a, b, 0, initval + JHASH_INITVAL + (2 << 2));
+}
+
+static inline __u32 jhash2_4words(const __u32 *k, __u32 initval)
+{
+	__u32 a, b, c;
+
+	a = b = c = JHASH_INITVAL + (4 << 2) + initval;
+
+	a += k[0]; b += k[1]; c += k[2];
+	__jhash_mix(a, b, c);
+
+	a += k[3];
+	__jhash_final(a, b, c);
+
+	return c;
+}
+
+static __always_inline void ipv4_csum(struct iphdr *iph)
+{
+	__u16 *next_iph = (__u16 *)iph;
+	__u32 csum = 0;
+	int i;
+
+	__pragma_loop_unroll_full
+	for (i = 0; i < (int)(sizeof(*iph) >> 1); i++)
+		csum += *next_iph++;
+
+	csum = (csum & 0xffff) + (csum >> 16);
+	csum = (csum & 0xffff) + (csum >> 16);
+	iph->check = ~csum;
+}
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 64);
+	__type(key, struct vip_definition);
+	__type(value, struct vip_meta);
+} vip_map SEC(".maps");
+
+struct lru_inner_map {
+	__uint(type, BPF_MAP_TYPE_LRU_HASH);
+	__type(key, struct flow_key);
+	__type(value, struct real_pos_lru);
+	__uint(max_entries, DEFAULT_LRU_SIZE);
+} lru_inner SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
+	__type(key, __u32);
+	__type(value, __u32);
+	__uint(max_entries, BENCH_NR_CPUS);
+	__array(values, struct lru_inner_map);
+} lru_mapping SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, CH_RINGS_SIZE);
+	__type(key, __u32);
+	__type(value, __u32);
+} ch_rings SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, MAX_REALS);
+	__type(key, __u32);
+	__type(value, struct real_definition);
+} reals SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(max_entries, STATS_SIZE);
+	__type(key, __u32);
+	__type(value, struct lb_stats);
+} stats SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(max_entries, MAX_REALS);
+	__type(key, __u32);
+	__type(value, struct lb_stats);
+} reals_stats SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, struct ctl_value);
+} ctl_array SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, struct vip_definition);
+} vip_miss_stats SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(max_entries, MAX_REALS);
+	__type(key, __u32);
+	__type(value, __u32);
+} lru_miss_stats SEC(".maps");
+
+volatile __u32 flow_mask;
+volatile __u32 cold_lru;
+__u32 batch_gen;
+
+/*
+ * old_eth MUST be read BEFORE writing the outer header because
+ * bpf_xdp_adjust_head makes them overlap.
+ */
+static __always_inline int encap_v4(struct xdp_md *xdp, __be32 saddr, __be32 daddr,
+				    __u16 payload_len, const __u8 *dst_mac)
+{
+	struct ethhdr *new_eth, *old_eth;
+	void *data, *data_end;
+	struct iphdr *iph;
+
+	if (bpf_xdp_adjust_head(xdp, -(int)sizeof(struct iphdr)))
+		return -1;
+
+	data     = (void *)(long)xdp->data;
+	data_end = (void *)(long)xdp->data_end;
+
+	new_eth = data;
+	iph     = data + sizeof(struct ethhdr);
+	old_eth = data + sizeof(struct iphdr);
+
+	if (new_eth + 1 > data_end || old_eth + 1 > data_end || iph + 1 > data_end)
+		return -1;
+
+	__builtin_memcpy(new_eth->h_source, old_eth->h_dest, sizeof(new_eth->h_source));
+	__builtin_memcpy(new_eth->h_dest, dst_mac, sizeof(new_eth->h_dest));
+	new_eth->h_proto = bpf_htons(ETH_P_IP);
+
+	__builtin_memset(iph, 0, sizeof(*iph));
+	iph->version  = 4;
+	iph->ihl      = sizeof(*iph) >> 2;
+	iph->protocol = IPPROTO_IPIP;
+	iph->tot_len  = bpf_htons(payload_len + sizeof(*iph));
+	iph->ttl      = 64;
+	iph->saddr    = saddr;
+	iph->daddr    = daddr;
+	ipv4_csum(iph);
+
+	return 0;
+}
+
+static __always_inline int encap_v6(struct xdp_md *xdp, const __be32 saddr[4],
+				    const __be32 daddr[4], __u8 nexthdr, __u16 payload_len,
+				    const __u8 *dst_mac)
+{
+	struct ethhdr *new_eth, *old_eth;
+	void *data, *data_end;
+	struct ipv6hdr *ip6h;
+
+	if (bpf_xdp_adjust_head(xdp, -(int)sizeof(struct ipv6hdr)))
+		return -1;
+
+	data     = (void *)(long)xdp->data;
+	data_end = (void *)(long)xdp->data_end;
+
+	new_eth = data;
+	ip6h    = data + sizeof(struct ethhdr);
+	old_eth = data + sizeof(struct ipv6hdr);
+
+	if (new_eth + 1 > data_end || old_eth + 1 > data_end || ip6h + 1 > data_end)
+		return -1;
+
+	__builtin_memcpy(new_eth->h_source, old_eth->h_dest, sizeof(new_eth->h_source));
+	__builtin_memcpy(new_eth->h_dest, dst_mac, sizeof(new_eth->h_dest));
+	new_eth->h_proto = bpf_htons(ETH_P_IPV6);
+
+	__builtin_memset(ip6h, 0, sizeof(*ip6h));
+	ip6h->version     = 6;
+	ip6h->nexthdr     = nexthdr;
+	ip6h->payload_len = bpf_htons(payload_len);
+	ip6h->hop_limit   = 64;
+	__builtin_memcpy(&ip6h->saddr, saddr, sizeof(ip6h->saddr));
+	__builtin_memcpy(&ip6h->daddr, daddr, sizeof(ip6h->daddr));
+
+	return 0;
+}
+
+static __always_inline void update_stats(void *map, __u32 key, __u16 bytes)
+{
+	struct lb_stats *st = bpf_map_lookup_elem(map, &key);
+
+	if (st) {
+		st->v1 += 1;
+		st->v2 += bytes;
+	}
+}
+
+static __always_inline void count_action(int action)
+{
+	struct lb_stats *st;
+	__u32 key;
+
+	if (action == XDP_TX)
+		key = STATS_XDP_TX;
+	else if (action == XDP_PASS)
+		key = STATS_XDP_PASS;
+	else
+		key = STATS_XDP_DROP;
+
+	st = bpf_map_lookup_elem(&stats, &key);
+	if (st)
+		st->v1 += 1;
+}
+
+static __always_inline bool is_under_flood(void)
+{
+	__u32 key = STATS_NEW_CONN;
+	struct lb_stats *conn_st = bpf_map_lookup_elem(&stats, &key);
+	__u64 cur_time;
+
+	if (!conn_st)
+		return true;
+
+	cur_time = bpf_ktime_get_ns();
+	if ((cur_time - conn_st->v2) > ONE_SEC) {
+		conn_st->v1 = 1;
+		conn_st->v2 = cur_time;
+	} else {
+		conn_st->v1 += 1;
+		if (conn_st->v1 > MAX_CONN_RATE)
+			return true;
+	}
+	return false;
+}
+
+static __always_inline struct real_definition *connection_table_lookup(void *lru_map,
+								       struct flow_key *flow,
+								       __u32 *out_pos)
+{
+	struct real_pos_lru *dst_lru;
+	struct real_definition *real;
+	__u32 key;
+
+	dst_lru = bpf_map_lookup_elem(lru_map, flow);
+	if (!dst_lru)
+		return NULL;
+
+	/* UDP connections use atime-based timeout instead of FIN/RST */
+	if (flow->proto == IPPROTO_UDP) {
+		__u64 cur_time = bpf_ktime_get_ns();
+
+		if (cur_time - dst_lru->atime > LRU_UDP_TIMEOUT)
+			return NULL;
+		dst_lru->atime = cur_time;
+	}
+
+	key = dst_lru->pos;
+	*out_pos = key;
+	real = bpf_map_lookup_elem(&reals, &key);
+	return real;
+}
+
+static __always_inline bool get_packet_dst(struct real_definition **real, struct flow_key *flow,
+					   struct vip_meta *vip_info, bool is_v6, void *lru_map,
+					   bool is_rst, __u32 *out_pos)
+{
+	bool under_flood;
+	__u32 hash, ch_key;
+	__u32 *ch_val;
+	__u32 real_pos;
+
+	under_flood = is_under_flood();
+
+	if (is_v6) {
+		__u32 src_hash = jhash2_4words((__u32 *)flow->srcv6, MAX_VIPS);
+
+		hash = jhash_2words(src_hash, flow->ports, CH_RING_SIZE);
+	} else {
+		hash = jhash_2words(flow->src, flow->ports, CH_RING_SIZE);
+	}
+
+	ch_key = CH_RING_SIZE * vip_info->vip_num + hash % CH_RING_SIZE;
+	ch_val = bpf_map_lookup_elem(&ch_rings, &ch_key);
+	if (!ch_val)
+		return false;
+	real_pos = *ch_val;
+
+	*real = bpf_map_lookup_elem(&reals, &real_pos);
+	if (!(*real))
+		return false;
+
+	if (!(vip_info->flags & F_LRU_BYPASS) && !under_flood && !is_rst) {
+		struct real_pos_lru new_lru = { .pos = real_pos };
+
+		if (flow->proto == IPPROTO_UDP)
+			new_lru.atime = bpf_ktime_get_ns();
+		bpf_map_update_elem(lru_map, flow, &new_lru, BPF_ANY);
+	}
+
+	*out_pos = real_pos;
+	return true;
+}
+
+static __always_inline void update_vip_lru_miss_stats(struct vip_definition *vip, bool is_v6,
+						      __u32 real_idx)
+{
+	struct vip_definition *miss_vip;
+	__u32 key = 0;
+	__u32 *cnt;
+
+	miss_vip = bpf_map_lookup_elem(&vip_miss_stats, &key);
+	if (!miss_vip)
+		return;
+
+	if (is_v6) {
+		if (miss_vip->vipv6[0] != vip->vipv6[0] || miss_vip->vipv6[1] != vip->vipv6[1] ||
+		    miss_vip->vipv6[2] != vip->vipv6[2] || miss_vip->vipv6[3] != vip->vipv6[3])
+			return;
+	} else {
+		if (miss_vip->vip != vip->vip)
+			return;
+	}
+
+	if (miss_vip->port != vip->port || miss_vip->proto != vip->proto)
+		return;
+
+	cnt = bpf_map_lookup_elem(&lru_miss_stats, &real_idx);
+	if (cnt)
+		*cnt += 1;
+}
+
+static __noinline int process_packet(struct xdp_md *xdp)
+{
+	void *data     = (void *)(long)xdp->data;
+	void *data_end = (void *)(long)xdp->data_end;
+	struct ethhdr *eth = data;
+	struct real_definition *dst = NULL;
+	struct vip_definition vip_def = {};
+	struct ctl_value *cval;
+	struct flow_key flow = {};
+	struct vip_meta *vip_info;
+	struct lb_stats *data_stats;
+	struct udphdr *uh;
+	__be32 tnl_src[4];
+	void *lru_map;
+	void *l4;
+	__u16 payload_len;
+	__u32 real_pos = 0, cpu_num, key;
+	__u8 proto;
+	int action = XDP_DROP;
+	bool is_v6, is_syn = false, is_rst = false;
+
+	if (eth + 1 > data_end)
+		goto out;
+
+	if (eth->h_proto == bpf_htons(ETH_P_IPV6)) {
+		is_v6 = true;
+	} else if (eth->h_proto == bpf_htons(ETH_P_IP)) {
+		is_v6 = false;
+	} else {
+		action = XDP_PASS;
+		goto out;
+	}
+
+	if (is_v6) {
+		struct ipv6hdr *ip6h = (void *)(eth + 1);
+
+		if (ip6h + 1 > data_end)
+			goto out;
+		if (ip6h->nexthdr == IPPROTO_FRAGMENT)
+			goto out;
+
+		payload_len = sizeof(struct ipv6hdr) + bpf_ntohs(ip6h->payload_len);
+		proto = ip6h->nexthdr;
+
+		__builtin_memcpy(flow.srcv6, &ip6h->saddr, sizeof(flow.srcv6));
+		__builtin_memcpy(flow.dstv6, &ip6h->daddr, sizeof(flow.dstv6));
+		__builtin_memcpy(vip_def.vipv6, &ip6h->daddr, sizeof(vip_def.vipv6));
+		l4 = (void *)(ip6h + 1);
+	} else {
+		struct iphdr *iph = (void *)(eth + 1);
+
+		if (iph + 1 > data_end)
+			goto out;
+		if (iph->ihl != 5)
+			goto out;
+		if (iph->frag_off & bpf_htons(PCKT_FRAGMENTED))
+			goto out;
+
+		payload_len = bpf_ntohs(iph->tot_len);
+		proto = iph->protocol;
+
+		flow.src    = iph->saddr;
+		flow.dst    = iph->daddr;
+		vip_def.vip = iph->daddr;
+		l4 = (void *)(iph + 1);
+	}
+
+	/* TCP and UDP share the same port layout at offset 0 */
+	if (proto != IPPROTO_TCP && proto != IPPROTO_UDP) {
+		action = XDP_PASS;
+		goto out;
+	}
+
+	uh = l4;
+	if ((void *)(uh + 1) > data_end)
+		goto out;
+	flow.port16[0] = uh->source;
+	flow.port16[1] = uh->dest;
+
+	if (proto == IPPROTO_TCP) {
+		struct tcphdr *th = l4;
+
+		if ((void *)(th + 1) > data_end)
+			goto out;
+		is_syn = th->syn;
+		is_rst = th->rst;
+	}
+
+	flow.proto    = proto;
+	vip_def.port  = flow.port16[1];
+	vip_def.proto = proto;
+
+	vip_info = bpf_map_lookup_elem(&vip_map, &vip_def);
+	if (!vip_info) {
+		action = XDP_PASS;
+		goto out;
+	}
+
+	key = STATS_LRU;
+	data_stats = bpf_map_lookup_elem(&stats, &key);
+	if (!data_stats)
+		goto out;
+	data_stats->v1 += 1;
+
+	cpu_num = bpf_get_smp_processor_id();
+	lru_map = bpf_map_lookup_elem(&lru_mapping, &cpu_num);
+	if (!lru_map)
+		goto out;
+
+	if (!(vip_info->flags & F_LRU_BYPASS) && !is_syn)
+		dst = connection_table_lookup(lru_map, &flow, &real_pos);
+
+	if (!dst) {
+		if (flow.proto == IPPROTO_TCP) {
+			struct lb_stats *miss_st;
+
+			key = STATS_LRU_MISS;
+			miss_st = bpf_map_lookup_elem(&stats, &key);
+			if (miss_st)
+				miss_st->v1 += 1;
+		}
+
+		if (!get_packet_dst(&dst, &flow, vip_info, is_v6, lru_map, is_rst, &real_pos))
+			goto out;
+
+		update_vip_lru_miss_stats(&vip_def, is_v6, real_pos);
+		data_stats->v2 += 1;
+	}
+
+	key = 0;
+	cval = bpf_map_lookup_elem(&ctl_array, &key);
+	if (!cval)
+		goto out;
+
+	update_stats(&stats, vip_info->vip_num, payload_len);
+	update_stats(&reals_stats, real_pos, payload_len);
+
+	if (is_v6) {
+		create_encap_ipv6_src(flow.port16[0], flow.srcv6[0], tnl_src);
+		if (encap_v6(xdp, tnl_src, dst->dstv6, IPPROTO_IPV6, payload_len, cval->mac))
+			goto out;
+	} else if (dst->flags & F_IPV6) {
+		create_encap_ipv6_src(flow.port16[0], flow.src, tnl_src);
+		if (encap_v6(xdp, tnl_src, dst->dstv6, IPPROTO_IPIP, payload_len, cval->mac))
+			goto out;
+	} else {
+		if (encap_v4(xdp, create_encap_ipv4_src(flow.port16[0], flow.src), dst->dst,
+			     payload_len, cval->mac))
+			goto out;
+	}
+
+	action = XDP_TX;
+
+out:
+	count_action(action);
+	return action;
+}
+
+static __always_inline int strip_encap(struct xdp_md *xdp, const struct ethhdr *saved_eth)
+{
+	void *data = (void *)(long)xdp->data;
+	void *data_end = (void *)(long)xdp->data_end;
+	struct ethhdr *eth = data;
+	int hdr_sz;
+
+	if (eth + 1 > data_end)
+		return -1;
+
+	hdr_sz = (eth->h_proto == bpf_htons(ETH_P_IPV6)) ? (int)sizeof(struct ipv6hdr)
+							 : (int)sizeof(struct iphdr);
+
+	if (bpf_xdp_adjust_head(xdp, hdr_sz))
+		return -1;
+
+	data     = (void *)(long)xdp->data;
+	data_end = (void *)(long)xdp->data_end;
+	eth      = data;
+
+	if (eth + 1 > data_end)
+		return -1;
+
+	__builtin_memcpy(eth, saved_eth, sizeof(*saved_eth));
+	return 0;
+}
+
+static __always_inline void randomize_src(struct xdp_md *xdp, int saddr_off, __u32 *rand_state)
+{
+	void *data     = (void *)(long)xdp->data;
+	void *data_end = (void *)(long)xdp->data_end;
+	__u32 *saddr   = data + saddr_off;
+
+	*rand_state ^= *rand_state << 13;
+	*rand_state ^= *rand_state >> 17;
+	*rand_state ^= *rand_state << 5;
+
+	if ((void *)(saddr + 1) <= data_end)
+		*saddr = *rand_state & flow_mask;
+}
+
+SEC("xdp")
+int xdp_lb_bench(struct xdp_md *xdp)
+{
+	void *data     = (void *)(long)xdp->data;
+	void *data_end = (void *)(long)xdp->data_end;
+	struct ethhdr *eth = data;
+	struct ethhdr saved_eth;
+	__u32 rand_state = 0;
+	__u32 batch_hash = 0;
+	int saddr_off = 0;
+	bool is_v6;
+
+	if (eth + 1 > data_end)
+		return XDP_DROP;
+
+	__builtin_memcpy(&saved_eth, eth, sizeof(saved_eth));
+
+	is_v6 = (saved_eth.h_proto == bpf_htons(ETH_P_IPV6));
+
+	saddr_off = sizeof(struct ethhdr) + (is_v6 ? offsetof(struct ipv6hdr, saddr) :
+					     offsetof(struct iphdr, saddr));
+
+	if (flow_mask)
+		rand_state = bpf_get_prandom_u32() | 1;
+
+	if (cold_lru) {
+		__u32 *saddr = data + saddr_off;
+
+		batch_gen++;
+		batch_hash = (batch_gen ^ bpf_get_smp_processor_id()) * KNUTH_HASH_MULT;
+		if ((void *)(saddr + 1) <= data_end)
+			*saddr ^= batch_hash;
+	}
+
+	return BENCH_BPF_LOOP(
+		process_packet(xdp),
+		({
+			if (__bench_result == XDP_TX) {
+				if (strip_encap(xdp, &saved_eth))
+					return XDP_DROP;
+				if (rand_state)
+					randomize_src(xdp, saddr_off, &rand_state);
+			}
+			if (cold_lru) {
+				void *d = (void *)(long)xdp->data;
+				void *de = (void *)(long)xdp->data_end;
+				__u32 *__sa = d + saddr_off;
+
+				if ((void *)(__sa + 1) <= de)
+					*__sa ^= batch_hash;
+			}
+		})
+	);
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 6/7] selftests/bpf: Add XDP load-balancer benchmark driver
  2026-04-27 23:22 [PATCH bpf-next 0/7] selftests/bpf: Add XDP load-balancer benchmark Puranjay Mohan
                   ` (4 preceding siblings ...)
  2026-04-27 23:23 ` [PATCH bpf-next 5/7] selftests/bpf: Add XDP load-balancer BPF program Puranjay Mohan
@ 2026-04-27 23:23 ` Puranjay Mohan
  2026-04-28  0:05   ` bot+bpf-ci
  2026-04-28  1:29   ` sashiko-bot
  2026-04-27 23:23 ` [PATCH bpf-next 7/7] selftests/bpf: Add XDP load-balancer benchmark run script Puranjay Mohan
  6 siblings, 2 replies; 24+ messages in thread
From: Puranjay Mohan @ 2026-04-27 23:23 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	Fei Chen, Taruna Agrawal, Nikhil Dixit Limaye, Nikita V. Shirokov,
	kernel-team

Wire up the userspace side of the XDP load-balancer benchmark.

24 scenarios cover the full code-path matrix: TCP/UDP, IPv4/IPv6,
cross-AF encap, LRU hit/miss/diverse/cold, consistent-hash bypass,
SYN/RST flag handling, and early exits (unknown VIP, non-IP, ICMP,
fragments, IP options).

Before benchmarking each scenario validates correctness: the output
packet is compared byte-for-byte against a pre-built expected packet
and BPF map counters are checked against the expected values.

Usage:
  sudo ./bench -a -w3 -p1 xdp-lb --scenario tcp-v4-lru-hit
  sudo ./bench xdp-lb --list-scenarios

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 tools/testing/selftests/bpf/Makefile          |    2 +
 tools/testing/selftests/bpf/bench.c           |    4 +
 .../selftests/bpf/benchs/bench_xdp_lb.c       | 1113 +++++++++++++++++
 3 files changed, 1119 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_xdp_lb.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 97f9fbd41244..bc049620c774 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -907,6 +907,7 @@ $(OUTPUT)/bench_bpf_crypto.o: $(OUTPUT)/crypto_bench.skel.h
 $(OUTPUT)/bench_sockmap.o: $(OUTPUT)/bench_sockmap_prog.skel.h
 $(OUTPUT)/bench_lpm_trie_map.o: $(OUTPUT)/lpm_trie_bench.skel.h $(OUTPUT)/lpm_trie_map.skel.h
 $(OUTPUT)/bench_bpf_nop.o: $(OUTPUT)/bpf_nop_bench.skel.h bench_bpf_timing.h
+$(OUTPUT)/bench_xdp_lb.o: $(OUTPUT)/xdp_lb_bench.skel.h bench_bpf_timing.h
 $(OUTPUT)/bench_bpf_timing.o: bench_bpf_timing.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
@@ -932,6 +933,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_lpm_trie_map.o \
 		 $(OUTPUT)/bench_bpf_timing.o \
 		 $(OUTPUT)/bench_bpf_nop.o \
+		 $(OUTPUT)/bench_xdp_lb.o \
 		 $(OUTPUT)/usdt_1.o \
 		 $(OUTPUT)/usdt_2.o \
 		 #
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 1696de5d6780..6155ce455c27 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -286,6 +286,7 @@ extern struct argp bench_trigger_batch_argp;
 extern struct argp bench_crypto_argp;
 extern struct argp bench_sockmap_argp;
 extern struct argp bench_lpm_trie_map_argp;
+extern struct argp bench_xdp_lb_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
@@ -302,6 +303,7 @@ static const struct argp_child bench_parsers[] = {
 	{ &bench_crypto_argp, 0, "bpf crypto benchmark", 0 },
 	{ &bench_sockmap_argp, 0, "bpf sockmap benchmark", 0 },
 	{ &bench_lpm_trie_map_argp, 0, "LPM trie map benchmark", 0 },
+	{ &bench_xdp_lb_argp, 0, "XDP load-balancer benchmark", 0 },
 	{},
 };
 
@@ -576,6 +578,7 @@ extern const struct bench bench_lpm_trie_update;
 extern const struct bench bench_lpm_trie_delete;
 extern const struct bench bench_lpm_trie_free;
 extern const struct bench bench_bpf_nop;
+extern const struct bench bench_xdp_lb;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -655,6 +658,7 @@ static const struct bench *benchs[] = {
 	&bench_lpm_trie_delete,
 	&bench_lpm_trie_free,
 	&bench_bpf_nop,
+	&bench_xdp_lb,
 };
 
 static void find_benchmark(void)
diff --git a/tools/testing/selftests/bpf/benchs/bench_xdp_lb.c b/tools/testing/selftests/bpf/benchs/bench_xdp_lb.c
new file mode 100644
index 000000000000..0b6709a2b03c
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_xdp_lb.c
@@ -0,0 +1,1113 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */
+
+#include <argp.h>
+#include <string.h>
+#include <arpa/inet.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/in.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include "bench.h"
+#include "bench_bpf_timing.h"
+#include "xdp_lb_bench.skel.h"
+#include "xdp_lb_bench_common.h"
+#include "bpf_util.h"
+
+#define IP4(a, b, c, d) (((__u32)(a) << 24) | ((__u32)(b) << 16) | ((__u32)(c) << 8) | (__u32)(d))
+
+#define IP6(a, b, c, d)  { (__u32)(a), (__u32)(b), (__u32)(c), (__u32)(d) }
+
+#define TNL_DST		IP4(192, 168, 1, 2)
+#define REAL_INDEX	1
+#define REAL_INDEX_V6	2
+#define MAX_PKT_SIZE	256
+#define IP_MF		0x2000
+
+static const __u32 tnl_dst_v6[4] = { 0xfd000000, 0, 0, 2 };
+
+static const __u8 lb_mac[ETH_ALEN]	= {0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff};
+static const __u8 client_mac[ETH_ALEN]	= {0x11, 0x22, 0x33, 0x44, 0x55, 0x66};
+static const __u8 router_mac[ETH_ALEN]	= {0xde, 0xad, 0xbe, 0xef, 0x00, 0x01};
+
+enum scenario_id {
+	S_TCP_V4_LRU_HIT,
+	S_TCP_V4_CH,
+	S_TCP_V6_LRU_HIT,
+	S_TCP_V6_CH,
+	S_UDP_V4_LRU_HIT,
+	S_UDP_V6_LRU_HIT,
+	S_TCP_V4V6_LRU_HIT,
+	S_TCP_V4_LRU_DIVERSE,
+	S_TCP_V4_CH_DIVERSE,
+	S_TCP_V6_LRU_DIVERSE,
+	S_TCP_V6_CH_DIVERSE,
+	S_UDP_V4_LRU_DIVERSE,
+	S_TCP_V4_LRU_MISS,
+	S_UDP_V4_LRU_MISS,
+	S_TCP_V4_LRU_WARMUP,
+	S_TCP_V4_SYN,
+	S_TCP_V4_RST_MISS,
+	S_PASS_V4_NO_VIP,
+	S_PASS_V6_NO_VIP,
+	S_PASS_V4_ICMP,
+	S_PASS_NON_IP,
+	S_DROP_V4_FRAG,
+	S_DROP_V4_OPTIONS,
+	S_DROP_V6_FRAG,
+	NUM_SCENARIOS,
+};
+
+enum lru_miss_type {
+	LRU_MISS_AUTO = 0,	/* compute from scenario flags (default) */
+	LRU_MISS_NONE,		/* 0 misses (all LRU hits) */
+	LRU_MISS_ALL,		/* batch_iters+1 misses (every op misses) */
+	LRU_MISS_FIRST,		/* 1 miss (first miss, then hits) */
+};
+
+#define S_BASE_ENCAP_V4							\
+	.expected_retval = XDP_TX, .expect_encap = true,		\
+	.tunnel_dst = TNL_DST
+
+#define S_BASE_ENCAP_V6							\
+	.expected_retval = XDP_TX, .expect_encap = true,		\
+	.is_v6 = true, .encap_v6_outer = true,				\
+	.tunnel_dst_v6 = { 0xfd000000, 0, 0, 2 }
+
+#define S_BASE_ENCAP_V4V6						\
+	.expected_retval = XDP_TX, .expect_encap = true,		\
+	.encap_v6_outer = true,						\
+	.tunnel_dst_v6 = { 0xfd000000, 0, 0, 2 }
+
+struct test_scenario {
+	const char *name;
+	const char *description;
+	int         expected_retval;
+	bool        expect_encap;
+	bool        is_v6;
+	__u32       vip_addr;
+	__u32       src_addr;
+	__u32       tunnel_dst;
+	__u32       vip_addr_v6[4];
+	__u32       src_addr_v6[4];
+	__u32       tunnel_dst_v6[4];
+	__u16       dst_port;
+	__u16       src_port;
+	__u8        ip_proto;
+	__u32       vip_flags;
+	__u32       vip_num;
+	bool        prepopulate_lru;
+	bool        set_frag;
+	__u16       eth_proto;
+	bool        encap_v6_outer;
+	__u32       flow_mask;
+	bool        cold_lru;
+	bool        set_syn;
+	bool        set_rst;
+	bool        set_ip_options;
+	__u32       fixed_batch_iters;	/* 0 = auto-calibrate, >0 = use this value */
+	enum lru_miss_type lru_miss;	/* expected LRU miss pattern */
+};
+
+static const struct test_scenario scenarios[NUM_SCENARIOS] = {
+	/* Single-flow baseline */
+	[S_TCP_V4_LRU_HIT] = {
+		S_BASE_ENCAP_V4, .ip_proto = IPPROTO_TCP,
+		.name        = "tcp-v4-lru-hit",
+		.description = "IPv4 TCP, LRU hit, IPIP encap",
+		.vip_addr    = IP4(10, 10, 1, 1), .dst_port = 80,
+		.src_addr    = IP4(10, 10, 2, 1), .src_port = 12345,
+		.prepopulate_lru = true, .lru_miss = LRU_MISS_NONE,
+	},
+	[S_TCP_V4_CH] = {
+		S_BASE_ENCAP_V4, .ip_proto = IPPROTO_TCP,
+		.name        = "tcp-v4-ch",
+		.description = "IPv4 TCP, CH (LRU bypass), IPIP encap",
+		.vip_addr    = IP4(10, 10, 1, 2), .dst_port = 80,
+		.src_addr    = IP4(10, 10, 2, 2), .src_port = 54321,
+		.vip_flags   = F_LRU_BYPASS, .vip_num = 1,
+		.lru_miss    = LRU_MISS_ALL,
+	},
+	[S_TCP_V6_LRU_HIT] = {
+		S_BASE_ENCAP_V6, .ip_proto = IPPROTO_TCP,
+		.name        = "tcp-v6-lru-hit",
+		.description = "IPv6 TCP, LRU hit, IP6IP6 encap",
+		.vip_addr_v6 = IP6(0xfd000100, 0, 0, 1), .dst_port = 80,
+		.src_addr_v6 = IP6(0xfd000200, 0, 0, 1), .src_port = 12345,
+		.vip_num     = 10,
+		.prepopulate_lru = true, .lru_miss = LRU_MISS_NONE,
+	},
+	[S_TCP_V6_CH] = {
+		S_BASE_ENCAP_V6, .ip_proto = IPPROTO_TCP,
+		.name        = "tcp-v6-ch",
+		.description = "IPv6 TCP, CH (LRU bypass), IP6IP6 encap",
+		.vip_addr_v6 = IP6(0xfd000100, 0, 0, 2), .dst_port = 80,
+		.src_addr_v6 = IP6(0xfd000200, 0, 0, 2), .src_port = 54321,
+		.vip_flags   = F_LRU_BYPASS, .vip_num = 12,
+		.lru_miss    = LRU_MISS_ALL,
+	},
+	[S_UDP_V4_LRU_HIT] = {
+		S_BASE_ENCAP_V4, .ip_proto = IPPROTO_UDP,
+		.name        = "udp-v4-lru-hit",
+		.description = "IPv4 UDP, LRU hit, IPIP encap",
+		.vip_addr    = IP4(10, 10, 1, 1), .dst_port = 443,
+		.src_addr    = IP4(10, 10, 3, 1), .src_port = 11111,
+		.vip_num     = 2,
+		.prepopulate_lru = true, .lru_miss = LRU_MISS_NONE,
+	},
+	[S_UDP_V6_LRU_HIT] = {
+		S_BASE_ENCAP_V6, .ip_proto = IPPROTO_UDP,
+		.name        = "udp-v6-lru-hit",
+		.description = "IPv6 UDP, LRU hit, IP6IP6 encap",
+		.vip_addr_v6 = IP6(0xfd000100, 0, 0, 1), .dst_port = 443,
+		.src_addr_v6 = IP6(0xfd000200, 0, 0, 3), .src_port = 22222,
+		.vip_num     = 14,
+		.prepopulate_lru = true, .lru_miss = LRU_MISS_NONE,
+	},
+	[S_TCP_V4V6_LRU_HIT] = {
+		S_BASE_ENCAP_V4V6, .ip_proto = IPPROTO_TCP,
+		.name        = "tcp-v4v6-lru-hit",
+		.description = "IPv4 TCP, LRU hit, IPv4-in-IPv6 encap",
+		.vip_addr    = IP4(10, 10, 1, 4), .dst_port = 80,
+		.src_addr    = IP4(10, 10, 2, 4), .src_port = 12347,
+		.vip_num     = 13,
+		.prepopulate_lru = true, .lru_miss = LRU_MISS_NONE,
+	},
+
+	/* Diverse flows (4K src addrs) */
+	[S_TCP_V4_LRU_DIVERSE] = {
+		S_BASE_ENCAP_V4, .ip_proto = IPPROTO_TCP,
+		.name        = "tcp-v4-lru-diverse",
+		.description = "IPv4 TCP, diverse flows, warm LRU",
+		.vip_addr    = IP4(10, 10, 1, 1), .dst_port = 80,
+		.src_addr    = IP4(10, 10, 2, 1), .src_port = 12345,
+		.prepopulate_lru = true, .flow_mask = 0xFFF,
+		.lru_miss    = LRU_MISS_NONE,
+	},
+	[S_TCP_V4_CH_DIVERSE] = {
+		S_BASE_ENCAP_V4, .ip_proto = IPPROTO_TCP,
+		.name        = "tcp-v4-ch-diverse",
+		.description = "IPv4 TCP, diverse flows, CH (LRU bypass)",
+		.vip_addr    = IP4(10, 10, 1, 2), .dst_port = 80,
+		.src_addr    = IP4(10, 10, 2, 2), .src_port = 54321,
+		.vip_flags   = F_LRU_BYPASS, .vip_num = 1,
+		.flow_mask   = 0xFFF, .lru_miss = LRU_MISS_ALL,
+	},
+	[S_TCP_V6_LRU_DIVERSE] = {
+		S_BASE_ENCAP_V6, .ip_proto = IPPROTO_TCP,
+		.name        = "tcp-v6-lru-diverse",
+		.description = "IPv6 TCP, diverse flows, warm LRU",
+		.vip_addr_v6 = IP6(0xfd000100, 0, 0, 1), .dst_port = 80,
+		.src_addr_v6 = IP6(0xfd000200, 0, 0, 1), .src_port = 12345,
+		.vip_num     = 10,
+		.prepopulate_lru = true, .flow_mask = 0xFFF,
+		.lru_miss    = LRU_MISS_NONE,
+	},
+	[S_TCP_V6_CH_DIVERSE] = {
+		S_BASE_ENCAP_V6, .ip_proto = IPPROTO_TCP,
+		.name        = "tcp-v6-ch-diverse",
+		.description = "IPv6 TCP, diverse flows, CH (LRU bypass)",
+		.vip_addr_v6 = IP6(0xfd000100, 0, 0, 2), .dst_port = 80,
+		.src_addr_v6 = IP6(0xfd000200, 0, 0, 2), .src_port = 54321,
+		.vip_flags   = F_LRU_BYPASS, .vip_num = 12,
+		.flow_mask   = 0xFFF, .lru_miss = LRU_MISS_ALL,
+	},
+	[S_UDP_V4_LRU_DIVERSE] = {
+		S_BASE_ENCAP_V4, .ip_proto = IPPROTO_UDP,
+		.name        = "udp-v4-lru-diverse",
+		.description = "IPv4 UDP, diverse flows, warm LRU",
+		.vip_addr    = IP4(10, 10, 1, 1), .dst_port = 443,
+		.src_addr    = IP4(10, 10, 3, 1), .src_port = 11111,
+		.vip_num     = 2,
+		.prepopulate_lru = true, .flow_mask = 0xFFF,
+		.lru_miss    = LRU_MISS_NONE,
+	},
+
+	/* LRU stress */
+	[S_TCP_V4_LRU_MISS] = {
+		S_BASE_ENCAP_V4, .ip_proto = IPPROTO_TCP,
+		.name        = "tcp-v4-lru-miss",
+		.description = "IPv4 TCP, LRU miss (16M flow space), CH lookup",
+		.vip_addr    = IP4(10, 10, 1, 1), .dst_port = 80,
+		.src_addr    = IP4(10, 10, 2, 1), .src_port = 12345,
+		.flow_mask   = 0xFFFFFF, .cold_lru = true,
+		.lru_miss    = LRU_MISS_FIRST,
+	},
+	[S_UDP_V4_LRU_MISS] = {
+		S_BASE_ENCAP_V4, .ip_proto = IPPROTO_UDP,
+		.name        = "udp-v4-lru-miss",
+		.description = "IPv4 UDP, LRU miss (16M flow space), CH lookup",
+		.vip_addr    = IP4(10, 10, 1, 1), .dst_port = 443,
+		.src_addr    = IP4(10, 10, 3, 1), .src_port = 11111,
+		.vip_num     = 2,
+		.flow_mask   = 0xFFFFFF, .cold_lru = true,
+		.lru_miss    = LRU_MISS_FIRST,
+	},
+	[S_TCP_V4_LRU_WARMUP] = {
+		S_BASE_ENCAP_V4, .ip_proto = IPPROTO_TCP,
+		.name        = "tcp-v4-lru-warmup",
+		.description = "IPv4 TCP, 4K flows, ~50% LRU miss",
+		.vip_addr    = IP4(10, 10, 1, 1), .dst_port = 80,
+		.src_addr    = IP4(10, 10, 2, 1), .src_port = 12345,
+		.flow_mask   = 0xFFF, .cold_lru = true,
+		.fixed_batch_iters = 6500,
+		.lru_miss    = LRU_MISS_FIRST,
+	},
+
+	/* TCP flags */
+	[S_TCP_V4_SYN] = {
+		S_BASE_ENCAP_V4, .ip_proto = IPPROTO_TCP,
+		.name        = "tcp-v4-syn",
+		.description = "IPv4 TCP SYN, skip LRU, CH + LRU insert",
+		.vip_addr    = IP4(10, 10, 1, 1), .dst_port = 80,
+		.src_addr    = IP4(10, 10, 8, 2), .src_port = 60001,
+		.set_syn     = true, .lru_miss = LRU_MISS_ALL,
+	},
+	[S_TCP_V4_RST_MISS] = {
+		S_BASE_ENCAP_V4, .ip_proto = IPPROTO_TCP,
+		.name        = "tcp-v4-rst-miss",
+		.description = "IPv4 TCP RST, CH lookup, no LRU insert",
+		.vip_addr    = IP4(10, 10, 1, 1), .dst_port = 80,
+		.src_addr    = IP4(10, 10, 8, 1), .src_port = 60000,
+		.flow_mask   = 0xFFFFFF, .cold_lru = true,
+		.set_rst     = true, .lru_miss = LRU_MISS_ALL,
+	},
+
+	/* Early exits */
+	[S_PASS_V4_NO_VIP] = {
+		.name            = "pass-v4-no-vip",
+		.description     = "IPv4 TCP, unknown VIP, XDP_PASS",
+		.expected_retval = XDP_PASS,
+		.ip_proto        = IPPROTO_TCP,
+		.vip_addr        = IP4(10, 10, 9, 9), .dst_port = 80,
+		.src_addr        = IP4(10, 10, 4, 1), .src_port = 33333,
+	},
+	[S_PASS_V6_NO_VIP] = {
+		.name            = "pass-v6-no-vip",
+		.description     = "IPv6 TCP, unknown VIP, XDP_PASS",
+		.expected_retval = XDP_PASS, .is_v6 = true,
+		.ip_proto        = IPPROTO_TCP,
+		.vip_addr_v6     = IP6(0xfd009900, 0, 0, 1), .dst_port = 80,
+		.src_addr_v6     = IP6(0xfd000400, 0, 0, 1), .src_port = 33333,
+	},
+	[S_PASS_V4_ICMP] = {
+		.name            = "pass-v4-icmp",
+		.description     = "IPv4 ICMP, non-TCP/UDP protocol, XDP_PASS",
+		.expected_retval = XDP_PASS,
+		.ip_proto        = IPPROTO_ICMP,
+		.vip_addr        = IP4(10, 10, 1, 1),
+		.src_addr        = IP4(10, 10, 6, 1),
+	},
+	[S_PASS_NON_IP] = {
+		.name            = "pass-non-ip",
+		.description     = "Non-IP (ARP), earliest XDP_PASS exit",
+		.expected_retval = XDP_PASS,
+		.eth_proto       = ETH_P_ARP,
+	},
+	[S_DROP_V4_FRAG] = {
+		.name            = "drop-v4-frag",
+		.description     = "IPv4 fragmented, XDP_DROP",
+		.expected_retval = XDP_DROP, .ip_proto = IPPROTO_TCP,
+		.vip_addr        = IP4(10, 10, 1, 1), .dst_port = 80,
+		.src_addr        = IP4(10, 10, 5, 1), .src_port = 44444,
+		.set_frag        = true,
+	},
+	[S_DROP_V4_OPTIONS] = {
+		.name            = "drop-v4-options",
+		.description     = "IPv4 with IP options (ihl>5), XDP_DROP",
+		.expected_retval = XDP_DROP, .ip_proto = IPPROTO_TCP,
+		.vip_addr        = IP4(10, 10, 1, 1), .dst_port = 80,
+		.src_addr        = IP4(10, 10, 7, 1), .src_port = 55555,
+		.set_ip_options  = true,
+	},
+	[S_DROP_V6_FRAG] = {
+		.name            = "drop-v6-frag",
+		.description     = "IPv6 fragment extension header, XDP_DROP",
+		.expected_retval = XDP_DROP, .is_v6 = true,
+		.ip_proto        = IPPROTO_TCP,
+		.vip_addr_v6     = IP6(0xfd000100, 0, 0, 1), .dst_port = 80,
+		.src_addr_v6     = IP6(0xfd000500, 0, 0, 1), .src_port = 44444,
+		.set_frag        = true,
+	},
+};
+
+#define MAX_ENCAP_SIZE	(MAX_PKT_SIZE + sizeof(struct ipv6hdr))
+
+static __u8  pkt_buf[NUM_SCENARIOS][MAX_PKT_SIZE];
+static __u32 pkt_len[NUM_SCENARIOS];
+static __u8  expected_buf[NUM_SCENARIOS][MAX_ENCAP_SIZE];
+static __u32 expected_len[NUM_SCENARIOS];
+
+static int lru_inner_fds[BENCH_NR_CPUS];
+static int nr_inner_maps;
+
+static struct ctx {
+	struct xdp_lb_bench *skel;
+	struct bpf_bench_timing timing;
+	int prog_fd;
+} ctx;
+
+static struct {
+	int   scenario;
+	bool  machine_readable;
+} args = {
+	.scenario = -1,
+};
+
+static __u16 ip_checksum(const void *hdr, int len)
+{
+	const __u16 *p = hdr;
+	__u32 csum = 0;
+	int i;
+
+	for (i = 0; i < len / 2; i++)
+		csum += p[i];
+
+	while (csum >> 16)
+		csum = (csum & 0xffff) + (csum >> 16);
+
+	return ~csum;
+}
+
+static void htonl_v6(__be32 dst[4], const __u32 src[4])
+{
+	int i;
+
+	for (i = 0; i < 4; i++)
+		dst[i] = htonl(src[i]);
+}
+
+static void build_flow_key(struct flow_key *fk, const struct test_scenario *sc)
+{
+	memset(fk, 0, sizeof(*fk));
+	if (sc->is_v6) {
+		htonl_v6(fk->srcv6, sc->src_addr_v6);
+		htonl_v6(fk->dstv6, sc->vip_addr_v6);
+	} else {
+		fk->src = htonl(sc->src_addr);
+		fk->dst = htonl(sc->vip_addr);
+	}
+	fk->proto = sc->ip_proto;
+	fk->port16[0] = htons(sc->src_port);
+	fk->port16[1] = htons(sc->dst_port);
+}
+
+static void build_l4(const struct test_scenario *sc, __u8 *p, __u32 *off)
+{
+	if (sc->ip_proto == IPPROTO_TCP) {
+		struct tcphdr tcp = {};
+
+		tcp.source = htons(sc->src_port);
+		tcp.dest   = htons(sc->dst_port);
+		tcp.doff   = 5;
+		tcp.syn    = sc->set_syn ? 1 : 0;
+		tcp.rst    = sc->set_rst ? 1 : 0;
+		tcp.window = htons(8192);
+		memcpy(p + *off, &tcp, sizeof(tcp));
+		*off += sizeof(tcp);
+	} else if (sc->ip_proto == IPPROTO_UDP) {
+		struct udphdr udp = {};
+
+		udp.source = htons(sc->src_port);
+		udp.dest   = htons(sc->dst_port);
+		udp.len    = htons(sizeof(udp) + 16);
+		memcpy(p + *off, &udp, sizeof(udp));
+		*off += sizeof(udp);
+	}
+}
+
+static void build_packet(int idx)
+{
+	const struct test_scenario *sc = &scenarios[idx];
+	__u8 *p = pkt_buf[idx];
+	struct ethhdr eth = {};
+	__u16 proto;
+	__u32 off = 0;
+
+	memcpy(eth.h_dest, lb_mac, ETH_ALEN);
+	memcpy(eth.h_source, client_mac, ETH_ALEN);
+
+	if (sc->eth_proto)
+		proto = sc->eth_proto;
+	else if (sc->is_v6)
+		proto = ETH_P_IPV6;
+	else
+		proto = ETH_P_IP;
+
+	eth.h_proto = htons(proto);
+	memcpy(p, &eth, sizeof(eth));
+	off += sizeof(eth);
+
+	if (proto != ETH_P_IP && proto != ETH_P_IPV6) {
+		memcpy(p + off, "bench___payload!", 16);
+		off += 16;
+		pkt_len[idx] = off;
+		return;
+	}
+
+	if (sc->is_v6) {
+		struct ipv6hdr ip6h = {};
+		__u32 ip6_off = off;
+
+		ip6h.version  = 6;
+		ip6h.nexthdr  = sc->set_frag ? 44 : sc->ip_proto;
+		ip6h.hop_limit = 64;
+		htonl_v6((__be32 *)&ip6h.saddr, sc->src_addr_v6);
+		htonl_v6((__be32 *)&ip6h.daddr, sc->vip_addr_v6);
+		off += sizeof(ip6h);
+
+		if (sc->set_frag) {
+			memset(p + off, 0, 8);
+			p[off] = sc->ip_proto;
+			off += 8;
+		}
+
+		build_l4(sc, p, &off);
+
+		memcpy(p + off, "bench___payload!", 16);
+		off += 16;
+
+		ip6h.payload_len = htons(off - ip6_off - sizeof(ip6h));
+		memcpy(p + ip6_off, &ip6h, sizeof(ip6h));
+	} else {
+		struct iphdr iph = {};
+		__u32 ip_off = off;
+
+		iph.version  = 4;
+		iph.ihl      = sc->set_ip_options ? 6 : 5;
+		iph.ttl      = 64;
+		iph.protocol = sc->ip_proto;
+		iph.saddr    = htonl(sc->src_addr);
+		iph.daddr    = htonl(sc->vip_addr);
+		iph.frag_off = sc->set_frag ? htons(IP_MF) : 0;
+		off += sizeof(iph);
+
+		if (sc->set_ip_options) {
+			/* NOP option padding (4 bytes = 1 word) */
+			__u32 nop = htonl(0x01010101);
+
+			memcpy(p + off, &nop, sizeof(nop));
+			off += sizeof(nop);
+		}
+
+		build_l4(sc, p, &off);
+
+		memcpy(p + off, "bench___payload!", 16);
+		off += 16;
+
+		iph.tot_len = htons(off - ip_off);
+		iph.check   = ip_checksum(&iph, sizeof(iph));
+		memcpy(p + ip_off, &iph, sizeof(iph));
+	}
+
+	pkt_len[idx] = off;
+}
+
+static void populate_vip(struct xdp_lb_bench *skel, const struct test_scenario *sc)
+{
+	struct vip_definition key = {};
+	struct vip_meta val = {};
+	int err;
+
+	if (sc->is_v6)
+		htonl_v6(key.vipv6, sc->vip_addr_v6);
+	else
+		key.vip = htonl(sc->vip_addr);
+	key.port  = htons(sc->dst_port);
+	key.proto = sc->ip_proto;
+	val.flags   = sc->vip_flags;
+	val.vip_num = sc->vip_num;
+
+	err = bpf_map_update_elem(bpf_map__fd(skel->maps.vip_map), &key, &val, BPF_ANY);
+	if (err) {
+		fprintf(stderr, "vip_map [%s]: %s\n", sc->name, strerror(errno));
+		exit(1);
+	}
+}
+
+static void create_per_cpu_lru_maps(struct xdp_lb_bench *skel)
+{
+	int outer_fd = bpf_map__fd(skel->maps.lru_mapping);
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	int i, inner_fd, err;
+	__u32 cpu;
+
+	if (nr_cpus > BENCH_NR_CPUS)
+		nr_cpus = BENCH_NR_CPUS;
+
+	for (i = 0; i < (int)nr_cpus; i++) {
+		LIBBPF_OPTS(bpf_map_create_opts, opts);
+
+		inner_fd = bpf_map_create(BPF_MAP_TYPE_LRU_HASH, "lru_inner",
+					  sizeof(struct flow_key),
+					  sizeof(struct real_pos_lru),
+					  DEFAULT_LRU_SIZE, &opts);
+		if (inner_fd < 0) {
+			fprintf(stderr, "lru_inner[%d]: %s\n", i, strerror(errno));
+			exit(1);
+		}
+
+		cpu = i;
+		err = bpf_map_update_elem(outer_fd, &cpu, &inner_fd, BPF_ANY);
+		if (err) {
+			fprintf(stderr, "lru_mapping[%d]: %s\n", i, strerror(errno));
+			close(inner_fd);
+			exit(1);
+		}
+
+		lru_inner_fds[i] = inner_fd;
+	}
+
+	nr_inner_maps = nr_cpus;
+}
+
+static void populate_lru(const struct test_scenario *sc, __u32 real_idx)
+{
+	struct real_pos_lru lru = { .pos = real_idx };
+	struct flow_key fk;
+	int i, err;
+
+	build_flow_key(&fk, sc);
+
+	/* Insert into every per-CPU inner LRU so the entry is found
+	 * regardless of which CPU runs the BPF program.
+	 */
+	for (i = 0; i < nr_inner_maps; i++) {
+		err = bpf_map_update_elem(lru_inner_fds[i], &fk, &lru, BPF_ANY);
+		if (err) {
+			fprintf(stderr, "lru_inner[%d] [%s]: %s\n", i, sc->name,
+				strerror(errno));
+			exit(1);
+		}
+	}
+}
+
+static void populate_maps(struct xdp_lb_bench *skel)
+{
+	struct real_definition real_v4 = {};
+	struct real_definition real_v6 = {};
+	struct ctl_value cval = {};
+	__u32 key, real_idx = REAL_INDEX;
+	int ch_fd, err, i;
+
+	if (scenarios[args.scenario].expect_encap)
+		populate_vip(skel, &scenarios[args.scenario]);
+
+	ch_fd = bpf_map__fd(skel->maps.ch_rings);
+	for (i = 0; i < CH_RINGS_SIZE; i++) {
+		__u32 k = i;
+
+		err = bpf_map_update_elem(ch_fd, &k, &real_idx, BPF_ANY);
+		if (err) {
+			fprintf(stderr, "ch_rings[%d]: %s\n", i, strerror(errno));
+			exit(1);
+		}
+	}
+
+	memcpy(cval.mac, router_mac, ETH_ALEN);
+	key = 0;
+	err = bpf_map_update_elem(bpf_map__fd(skel->maps.ctl_array), &key, &cval, BPF_ANY);
+	if (err) {
+		fprintf(stderr, "ctl_array: %s\n", strerror(errno));
+		exit(1);
+	}
+
+	key = REAL_INDEX;
+	real_v4.dst = htonl(TNL_DST);
+	htonl_v6(real_v4.dstv6, tnl_dst_v6);
+	err = bpf_map_update_elem(bpf_map__fd(skel->maps.reals), &key, &real_v4, BPF_ANY);
+	if (err) {
+		fprintf(stderr, "reals[%d]: %s\n", REAL_INDEX, strerror(errno));
+		exit(1);
+	}
+
+	key = REAL_INDEX_V6;
+	htonl_v6(real_v6.dstv6, tnl_dst_v6);
+	real_v6.flags = F_IPV6;
+	err = bpf_map_update_elem(bpf_map__fd(skel->maps.reals), &key, &real_v6, BPF_ANY);
+	if (err) {
+		fprintf(stderr, "reals[%d]: %s\n", REAL_INDEX_V6, strerror(errno));
+		exit(1);
+	}
+
+	create_per_cpu_lru_maps(skel);
+
+	if (scenarios[args.scenario].prepopulate_lru) {
+		const struct test_scenario *sc = &scenarios[args.scenario];
+		__u32 ridx = sc->encap_v6_outer ? REAL_INDEX_V6 : REAL_INDEX;
+
+		populate_lru(sc, ridx);
+	}
+
+	if (scenarios[args.scenario].expect_encap) {
+		const struct test_scenario *sc = &scenarios[args.scenario];
+		struct vip_definition miss_vip = {};
+
+		if (sc->is_v6)
+			htonl_v6(miss_vip.vipv6, sc->vip_addr_v6);
+		else
+			miss_vip.vip = htonl(sc->vip_addr);
+		miss_vip.port = htons(sc->dst_port);
+		miss_vip.proto = sc->ip_proto;
+
+		key = 0;
+		err = bpf_map_update_elem(bpf_map__fd(skel->maps.vip_miss_stats),
+					  &key, &miss_vip, BPF_ANY);
+		if (err) {
+			fprintf(stderr, "vip_miss_stats: %s\n", strerror(errno));
+			exit(1);
+		}
+	}
+}
+
+static void build_expected_packet(int idx)
+{
+	const struct test_scenario *sc = &scenarios[idx];
+	__u8 *p = expected_buf[idx];
+	struct ethhdr eth = {};
+	const __u8 *in = pkt_buf[idx];
+	__u32 in_len = pkt_len[idx];
+	__u32 off = 0;
+	__u32 inner_len = in_len - sizeof(struct ethhdr);
+
+	if (sc->expected_retval == XDP_DROP) {
+		expected_len[idx] = 0;
+		return;
+	}
+
+	if (sc->expected_retval == XDP_PASS) {
+		memcpy(p, in, in_len);
+		expected_len[idx] = in_len;
+		return;
+	}
+
+	memcpy(eth.h_dest, router_mac, ETH_ALEN);
+	memcpy(eth.h_source, lb_mac, ETH_ALEN);
+	eth.h_proto = htons(sc->encap_v6_outer ? ETH_P_IPV6 : ETH_P_IP);
+	memcpy(p, &eth, sizeof(eth));
+	off += sizeof(eth);
+
+	if (sc->encap_v6_outer) {
+		struct ipv6hdr ip6h = {};
+		__u8 nexthdr = sc->is_v6 ? IPPROTO_IPV6 : IPPROTO_IPIP;
+
+		ip6h.version     = 6;
+		ip6h.nexthdr     = nexthdr;
+		ip6h.payload_len = htons(inner_len);
+		ip6h.hop_limit   = 64;
+
+		create_encap_ipv6_src(htons(sc->src_port),
+				      sc->is_v6 ? htonl(sc->src_addr_v6[0])
+						: htonl(sc->src_addr),
+				      (__be32 *)&ip6h.saddr);
+		htonl_v6((__be32 *)&ip6h.daddr, sc->tunnel_dst_v6);
+
+		memcpy(p + off, &ip6h, sizeof(ip6h));
+		off += sizeof(ip6h);
+	} else {
+		struct iphdr iph = {};
+
+		iph.version  = 4;
+		iph.ihl      = sizeof(iph) >> 2;
+		iph.protocol = IPPROTO_IPIP;
+		iph.tot_len  = htons(inner_len + sizeof(iph));
+		iph.ttl      = 64;
+		iph.saddr    = create_encap_ipv4_src(htons(sc->src_port),
+						     htonl(sc->src_addr));
+		iph.daddr    = htonl(sc->tunnel_dst);
+		iph.check    = ip_checksum(&iph, sizeof(iph));
+
+		memcpy(p + off, &iph, sizeof(iph));
+		off += sizeof(iph);
+	}
+
+	memcpy(p + off, in + sizeof(struct ethhdr), inner_len);
+	off += inner_len;
+
+	expected_len[idx] = off;
+}
+
+static void print_hex_diff(const char *name, const __u8 *got, __u32 got_len, const __u8 *exp,
+			   __u32 exp_len)
+{
+	__u32 max_len = got_len > exp_len ? got_len : exp_len;
+	__u32 i, ndiffs = 0;
+
+	fprintf(stderr, "  [%s] got %u bytes, expected %u bytes\n",
+		name, got_len, exp_len);
+
+	for (i = 0; i < max_len && ndiffs < 8; i++) {
+		__u8 g = i < got_len ? got[i] : 0;
+		__u8 e = i < exp_len ? exp[i] : 0;
+
+		if (g != e || i >= got_len || i >= exp_len) {
+			fprintf(stderr, "    offset 0x%03x: got 0x%02x  expected 0x%02x\n",
+				i, g, e);
+			ndiffs++;
+		}
+	}
+
+	if (ndiffs >= 8 && i < max_len)
+		fprintf(stderr, "    ... (more differences)\n");
+}
+
+static void read_stat(int stats_fd, __u32 key, __u64 *v1_out, __u64 *v2_out)
+{
+	struct lb_stats values[BENCH_NR_CPUS];
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	__u64 v1 = 0, v2 = 0;
+	unsigned int i;
+
+	if (nr_cpus > BENCH_NR_CPUS)
+		nr_cpus = BENCH_NR_CPUS;
+
+	if (bpf_map_lookup_elem(stats_fd, &key, values) == 0) {
+		for (i = 0; i < nr_cpus; i++) {
+			v1 += values[i].v1;
+			v2 += values[i].v2;
+		}
+	}
+
+	*v1_out = v1;
+	*v2_out = v2;
+}
+
+static void reset_stats(int stats_fd)
+{
+	struct lb_stats zeros[BENCH_NR_CPUS];
+	__u32 key;
+
+	memset(zeros, 0, sizeof(zeros));
+	for (key = 0; key < STATS_SIZE; key++)
+		bpf_map_update_elem(stats_fd, &key, zeros, BPF_ANY);
+}
+
+static bool validate_counters(int idx)
+{
+	const struct test_scenario *sc = &scenarios[idx];
+	int stats_fd = bpf_map__fd(ctx.skel->maps.stats);
+	__u64 xdp_tx, xdp_pass, xdp_drop, lru_pkts, lru_misses, tcp_misses;
+	__u64 expected_misses;
+	__u64 dummy;
+	/*
+	 * BENCH_BPF_LOOP runs batch_iters timed + 1 untimed iteration.
+	 * Each iteration calls process_packet -> count_action, so all
+	 * counters are incremented (batch_iters + 1) times.
+	 */
+	__u64 n = ctx.timing.batch_iters + 1;
+	bool pass = true;
+
+	read_stat(stats_fd, STATS_XDP_TX, &xdp_tx, &dummy);
+	read_stat(stats_fd, STATS_XDP_PASS, &xdp_pass, &dummy);
+	read_stat(stats_fd, STATS_XDP_DROP, &xdp_drop, &dummy);
+	read_stat(stats_fd, STATS_LRU, &lru_pkts, &lru_misses);
+	read_stat(stats_fd, STATS_LRU_MISS, &tcp_misses, &dummy);
+
+	if (sc->expected_retval == XDP_TX && xdp_tx != n) {
+		fprintf(stderr, "  [%s] COUNTER FAIL: STATS_XDP_TX=%llu, expected %llu\n", sc->name,
+			(unsigned long long)xdp_tx, (unsigned long long)n);
+		pass = false;
+	}
+	if (sc->expected_retval == XDP_PASS && xdp_pass != n) {
+		fprintf(stderr, "  [%s] COUNTER FAIL: STATS_XDP_PASS=%llu, expected %llu\n",
+			sc->name, (unsigned long long)xdp_pass, (unsigned long long)n);
+		pass = false;
+	}
+	if (sc->expected_retval == XDP_DROP && xdp_drop != n) {
+		fprintf(stderr, "  [%s] COUNTER FAIL: STATS_XDP_DROP=%llu, expected %llu\n",
+			sc->name, (unsigned long long)xdp_drop, (unsigned long long)n);
+		pass = false;
+	}
+
+	if (!sc->expect_encap)
+		goto out;
+
+	if (lru_pkts != n) {
+		fprintf(stderr, "  [%s] COUNTER FAIL: STATS_LRU.v1=%llu, expected %llu\n",
+			sc->name, (unsigned long long)lru_pkts, (unsigned long long)n);
+		pass = false;
+	}
+
+	switch (sc->lru_miss) {
+	case LRU_MISS_NONE:
+		expected_misses = 0;
+		break;
+	case LRU_MISS_ALL:
+		expected_misses = n;
+		break;
+	case LRU_MISS_FIRST:
+		expected_misses = 1;
+		break;
+	default:
+		/* LRU_MISS_AUTO: compute from scenario flags */
+		if (sc->prepopulate_lru && !sc->set_syn)
+			expected_misses = 0;
+		else if (sc->set_syn || sc->set_rst ||
+			 (sc->vip_flags & F_LRU_BYPASS))
+			expected_misses = n;
+		else if (sc->cold_lru)
+			expected_misses = 1;
+		else
+			expected_misses = n;
+		break;
+	}
+
+	if (lru_misses != expected_misses) {
+		fprintf(stderr, "  [%s] COUNTER FAIL: LRU misses=%llu, expected %llu\n",
+			sc->name, (unsigned long long)lru_misses,
+			(unsigned long long)expected_misses);
+		pass = false;
+	}
+
+	if (sc->ip_proto == IPPROTO_TCP && lru_misses > 0) {
+		if (tcp_misses != lru_misses) {
+			fprintf(stderr, "  [%s] COUNTER FAIL: TCP LRU misses=%llu, expected %llu\n",
+				sc->name, (unsigned long long)tcp_misses,
+				(unsigned long long)lru_misses);
+			pass = false;
+		}
+	}
+
+out:
+	reset_stats(stats_fd);
+	return pass;
+}
+
+static const char *xdp_action_str(int action)
+{
+	switch (action) {
+	case XDP_DROP:	return "XDP_DROP";
+	case XDP_PASS:	return "XDP_PASS";
+	case XDP_TX:	return "XDP_TX";
+	default:	return "UNKNOWN";
+	}
+}
+
+static bool validate_scenario(int idx)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+	const struct test_scenario *sc = &scenarios[idx];
+	__u8 out[MAX_ENCAP_SIZE];
+	int err;
+
+	topts.data_in = pkt_buf[idx];
+	topts.data_size_in = pkt_len[idx];
+	topts.data_out = out;
+	topts.data_size_out = sizeof(out);
+	topts.repeat = 1;
+
+	err = bpf_prog_test_run_opts(ctx.prog_fd, &topts);
+	if (err) {
+		fprintf(stderr, "  [%s] FAIL: test_run: %s\n", sc->name, strerror(errno));
+		return false;
+	}
+
+	if ((int)topts.retval != sc->expected_retval) {
+		fprintf(stderr, "  [%s] FAIL: retval %s, expected %s\n", sc->name,
+			xdp_action_str(topts.retval), xdp_action_str(sc->expected_retval));
+		return false;
+	}
+
+	/*
+	 * Compare output packet when it's deterministic.
+	 * Skip for XDP_DROP (no output) and cold_lru (source IP poisoned).
+	 */
+	if (sc->expected_retval != XDP_DROP && !sc->cold_lru) {
+		if (topts.data_size_out != expected_len[idx] ||
+		    memcmp(out, expected_buf[idx], expected_len[idx]) != 0) {
+			fprintf(stderr, "  [%s] FAIL: output packet mismatch\n", sc->name);
+			print_hex_diff(sc->name, out, topts.data_size_out, expected_buf[idx],
+				       expected_len[idx]);
+			return false;
+		}
+	}
+
+	if (!validate_counters(idx))
+		return false;
+	return true;
+}
+
+static int find_scenario(const char *name)
+{
+	int i;
+
+	for (i = 0; i < NUM_SCENARIOS; i++) {
+		if (strcmp(scenarios[i].name, name) == 0)
+			return i;
+	}
+	return -1;
+}
+
+static void xdp_lb_validate(void)
+{
+	if (env.consumer_cnt != 0) {
+		fprintf(stderr, "benchmark doesn't support consumers\n");
+		exit(1);
+	}
+	if (bpf_num_possible_cpus() > BENCH_NR_CPUS) {
+		fprintf(stderr, "too many CPUs (%d > %d), increase BENCH_NR_CPUS\n",
+			bpf_num_possible_cpus(), BENCH_NR_CPUS);
+		exit(1);
+	}
+}
+
+static void xdp_lb_run_once(void *unused __always_unused)
+{
+	int idx = args.scenario;
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		.data_in      = pkt_buf[idx],
+		.data_size_in = pkt_len[idx],
+		.repeat       = 1,
+	);
+
+	bpf_prog_test_run_opts(ctx.prog_fd, &topts);
+}
+
+static void xdp_lb_setup(void)
+{
+	struct xdp_lb_bench *skel;
+	int err;
+
+	if (args.scenario < 0) {
+		fprintf(stderr, "--scenario is required. Use --list-scenarios to see options.\n");
+		exit(1);
+	}
+
+	setup_libbpf();
+
+	skel = xdp_lb_bench__open();
+	if (!skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	err = xdp_lb_bench__load(skel);
+	if (err) {
+		fprintf(stderr, "failed to load skeleton: %s\n", strerror(-err));
+		xdp_lb_bench__destroy(skel);
+		exit(1);
+	}
+
+	ctx.skel    = skel;
+	ctx.prog_fd = bpf_program__fd(skel->progs.xdp_lb_bench);
+
+	build_packet(args.scenario);
+	build_expected_packet(args.scenario);
+
+	populate_maps(skel);
+
+	BENCH_TIMING_INIT(&ctx.timing, skel, 0);
+	ctx.timing.machine_readable = args.machine_readable;
+
+	if (scenarios[args.scenario].fixed_batch_iters) {
+		ctx.timing.batch_iters = scenarios[args.scenario].fixed_batch_iters;
+		skel->bss->batch_iters = ctx.timing.batch_iters;
+	} else {
+		bpf_bench_calibrate(&ctx.timing, xdp_lb_run_once, NULL);
+	}
+
+	env.duration_sec = 600;
+
+	/*
+	 * Enable cold_lru before validation so LRU miss counters are
+	 * correct.  Seed the LRU with one run so the original flow is
+	 * present; validation then sees exactly 1 miss (the poisoned
+	 * flow) regardless of whether calibration ran.
+	 */
+	if (scenarios[args.scenario].cold_lru) {
+		skel->bss->cold_lru = 1;
+		xdp_lb_run_once(NULL);
+	}
+
+	reset_stats(bpf_map__fd(skel->maps.stats));
+
+	if (!validate_scenario(args.scenario)) {
+		fprintf(stderr, "Validation FAILED - aborting benchmark\n");
+		exit(1);
+	}
+
+	if (scenarios[args.scenario].flow_mask)
+		skel->bss->flow_mask = scenarios[args.scenario].flow_mask;
+}
+
+static void *xdp_lb_producer(void *input)
+{
+	while (true)
+		xdp_lb_run_once(NULL);
+
+	return NULL;
+}
+
+static void xdp_lb_measure(struct bench_res *res)
+{
+	bpf_bench_timing_measure(&ctx.timing, res);
+}
+
+static void xdp_lb_report_final(struct bench_res res[], int res_cnt)
+{
+	bpf_bench_timing_report(&ctx.timing, scenarios[args.scenario].name,
+				scenarios[args.scenario].description);
+}
+
+enum {
+	ARG_SCENARIO         = 9001,
+	ARG_LIST_SCENARIOS   = 9002,
+	ARG_MACHINE_READABLE = 9003,
+};
+
+static const struct argp_option opts[] = {
+	{ "scenario", ARG_SCENARIO, "NAME", 0,
+	  "Scenario to benchmark (required)" },
+	{ "list-scenarios", ARG_LIST_SCENARIOS, NULL, 0,
+	  "List available scenarios and exit" },
+	{ "machine-readable", ARG_MACHINE_READABLE, NULL, 0,
+	  "Print only a machine-readable RESULT line" },
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	int i;
+
+	switch (key) {
+	case ARG_SCENARIO:
+		args.scenario = find_scenario(arg);
+		if (args.scenario < 0) {
+			fprintf(stderr, "unknown scenario: '%s'\n", arg);
+			fprintf(stderr, "use --list-scenarios to see options\n");
+			argp_usage(state);
+		}
+		break;
+	case ARG_LIST_SCENARIOS:
+		printf("Available scenarios:\n");
+		for (i = 0; i < NUM_SCENARIOS; i++)
+			printf("  %-20s  %s\n", scenarios[i].name, scenarios[i].description);
+		exit(0);
+	case ARG_MACHINE_READABLE:
+		args.machine_readable = true;
+		env.quiet = true;
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+const struct argp bench_xdp_lb_argp = {
+	.options = opts,
+	.parser  = parse_arg,
+};
+
+const struct bench bench_xdp_lb = {
+	.name            = "xdp-lb",
+	.argp            = &bench_xdp_lb_argp,
+	.validate        = xdp_lb_validate,
+	.setup           = xdp_lb_setup,
+	.producer_thread = xdp_lb_producer,
+	.measure         = xdp_lb_measure,
+	.report_final    = xdp_lb_report_final,
+};
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 7/7] selftests/bpf: Add XDP load-balancer benchmark run script
  2026-04-27 23:22 [PATCH bpf-next 0/7] selftests/bpf: Add XDP load-balancer benchmark Puranjay Mohan
                   ` (5 preceding siblings ...)
  2026-04-27 23:23 ` [PATCH bpf-next 6/7] selftests/bpf: Add XDP load-balancer benchmark driver Puranjay Mohan
@ 2026-04-27 23:23 ` Puranjay Mohan
  2026-04-28  2:03   ` sashiko-bot
  6 siblings, 1 reply; 24+ messages in thread
From: Puranjay Mohan @ 2026-04-27 23:23 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	Fei Chen, Taruna Agrawal, Nikhil Dixit Limaye, Nikita V. Shirokov,
	kernel-team

Add a convenience script that runs all 24 XDP load-balancer scenarios
and formats the results as a table with median, stddev, and p99
columns.

  ./benchs/run_bench_xdp_lb.sh

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 .../selftests/bpf/benchs/run_bench_xdp_lb.sh  | 79 +++++++++++++++++++
 1 file changed, 79 insertions(+)
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_xdp_lb.sh

diff --git a/tools/testing/selftests/bpf/benchs/run_bench_xdp_lb.sh b/tools/testing/selftests/bpf/benchs/run_bench_xdp_lb.sh
new file mode 100755
index 000000000000..f65cf46214a3
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_xdp_lb.sh
@@ -0,0 +1,79 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+WARMUP=${WARMUP:-3}
+
+RUN="sudo ./bench -q -w${WARMUP} -a xdp-lb --machine-readable"
+
+SEP="  +----------------------------------+----------+---------+----------+"
+HDR="  | %-32s | %8s | %7s | %8s |\n"
+ROW="  | %-32s | %8s | %7s | %8s |\n"
+
+function group_header()
+{
+	printf "%s\n" "$SEP"
+	printf "$HDR" "$1" "p50" "stddev" "p99"
+	printf "%s\n" "$SEP"
+}
+
+function rval()
+{
+	echo "$1" | sed -nE "s/.*$2=([^ ]+).*/\1/p"
+}
+
+function run_scenario()
+{
+	local sc="$1"
+	shift
+	local output rline
+
+	output=$($RUN --scenario "$sc" "$@" 2>&1) || true
+	rline=$(echo "$output" | grep '^RESULT ' || true)
+
+	if [ -z "$rline" ]; then
+		printf "$ROW" "$sc" "ERR" "-" "-"
+		return
+	fi
+
+	printf "$ROW" "$sc" \
+		"$(rval "$rline" median)" \
+		"$(rval "$rline" stddev)" \
+		"$(rval "$rline" p99)"
+}
+
+header "XDP load-balancer benchmark"
+
+group_header "Single-flow baseline"
+for sc in tcp-v4-lru-hit tcp-v4-ch \
+	  tcp-v6-lru-hit tcp-v6-ch \
+	  udp-v4-lru-hit udp-v6-lru-hit \
+	  tcp-v4v6-lru-hit; do
+	run_scenario "$sc"
+done
+
+group_header "Diverse flows (4K src addrs)"
+for sc in tcp-v4-lru-diverse tcp-v4-ch-diverse \
+	  tcp-v6-lru-diverse tcp-v6-ch-diverse \
+	  udp-v4-lru-diverse; do
+	run_scenario "$sc"
+done
+
+group_header "TCP flags"
+run_scenario tcp-v4-syn
+run_scenario tcp-v4-rst-miss
+
+group_header "LRU stress"
+run_scenario tcp-v4-lru-miss
+run_scenario udp-v4-lru-miss
+run_scenario tcp-v4-lru-warmup
+
+group_header "Early exits"
+for sc in pass-v4-no-vip pass-v6-no-vip pass-v4-icmp pass-non-ip drop-v4-frag drop-v4-options \
+	  drop-v6-frag; do
+	run_scenario "$sc"
+done
+printf "%s\n" "$SEP"
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 1/7] selftests/bpf: Add bench_force_done() for early benchmark completion
  2026-04-27 23:22 ` [PATCH bpf-next 1/7] selftests/bpf: Add bench_force_done() for early benchmark completion Puranjay Mohan
@ 2026-04-27 23:39   ` sashiko-bot
  2026-04-28  0:05   ` bot+bpf-ci
  1 sibling, 0 replies; 24+ messages in thread
From: sashiko-bot @ 2026-04-27 23:39 UTC (permalink / raw)
  To: Puranjay Mohan; +Cc: bpf

Sashiko AI review found 3 potential issue(s):
- [High] Calling non-async-signal-safe POSIX functions in a signal handler creates a self-deadlock risk.
- [Medium] Missing state predicate for condition variable causes lost wakeups, defeating the patch's purpose.
- [Medium] Early benchmark completion passes negative sample counts to reporting functions, yielding NaN outputs.
--

commit 7ae4c67dc4880455d3c6c1f70bd9e8d56eb4e69b
Author: Puranjay Mohan <puranjay@kernel.org>

selftests/bpf: Add bench_force_done() for early benchmark completion

The bench framework waits for duration_sec to elapse before collecting
results. Benchmarks that know exactly how many samples they need can
call bench_force_done() to signal completion early, avoiding wasted
wall-clock time.

> diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> --- a/tools/testing/selftests/bpf/bench.c
> +++ b/tools/testing/selftests/bpf/bench.c
> @@ -741,6 +741,13 @@ static void setup_benchmark(void)
>  static pthread_mutex_t bench_done_mtx = PTHREAD_MUTEX_INITIALIZER;
>  static pthread_cond_t bench_done = PTHREAD_COND_INITIALIZER;
>  
> +void bench_force_done(void)
> +{
> +	pthread_mutex_lock(&bench_done_mtx);
> +	pthread_cond_signal(&bench_done);
> +	pthread_mutex_unlock(&bench_done_mtx);
> +}
> +

Since bench_force_done() is now exported for concurrent benchmark worker
threads to call directly, could this lead to lost wakeups?

Condition variables are stateless. If a fast-running benchmark finishes and
signals the condition variable before main() reaches its wait statement:

tools/testing/selftests/bpf/bench.c:main() {
    ...
    pthread_cond_wait(&bench_done, &bench_done_mtx);
    ...
}

Will the signal be permanently lost, causing main() to sleep for the full
duration and defeating the goal of the patch?

Also, if a benchmark legitimately completes its target operations and calls
bench_force_done() before the warmup period is fully satisfied:

tools/testing/selftests/bpf/bench.c:main() {
    ...
    if (bench->report_final)
        bench->report_final(state.results + env.warmup_sec,
                            state.res_cnt - env.warmup_sec);
}

Could state.res_cnt - env.warmup_sec evaluate to a negative integer?

Passing a negative sample count to reporting functions might cause
reporter loops to fail bounds checks and yield nan outputs.

>  static void collect_measurements(long delta_ns) {
>  	int iter = state.res_cnt++;
>  	struct bench_res *res = &state.results[iter];
> @@ -750,11 +757,8 @@ static void collect_measurements(long delta_ns) {
>  	if (bench->report_progress)
>  		bench->report_progress(iter, res, delta_ns);
>  
> -	if (iter == env.duration_sec + env.warmup_sec) {
> -		pthread_mutex_lock(&bench_done_mtx);
> -		pthread_cond_signal(&bench_done);
> -		pthread_mutex_unlock(&bench_done_mtx);
> -	}
> +	if (iter == env.duration_sec + env.warmup_sec)
> +		bench_force_done();
>  }

collect_measurements() executes in the context of a timer signal handler.

Posix specifies that pthread_mutex_lock() and pthread_cond_signal() are
not async-signal-safe.

If the timer signal interrupts a thread exactly while it already holds
bench_done_mtx, will the signal handler attempt to acquire the same
non-recursive mutex and cause a self-deadlock?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260427232313.1582588-1-puranjay@kernel.org?part=1

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 1/7] selftests/bpf: Add bench_force_done() for early benchmark completion
  2026-04-27 23:22 ` [PATCH bpf-next 1/7] selftests/bpf: Add bench_force_done() for early benchmark completion Puranjay Mohan
  2026-04-27 23:39   ` sashiko-bot
@ 2026-04-28  0:05   ` bot+bpf-ci
  2026-04-28  9:15     ` Puranjay Mohan
  1 sibling, 1 reply; 24+ messages in thread
From: bot+bpf-ci @ 2026-04-28  0:05 UTC (permalink / raw)
  To: puranjay, bpf
  Cc: puranjay, puranjay12, ast, andrii, daniel, martin.lau, eddyz87,
	memxor, mykyta.yatsenko5, feichen, taragrawal, ndixit, tehnerd,
	kernel-team, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3901 bytes --]

> diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> index 029b3e21f438..47a4e72208d6 100644
> --- a/tools/testing/selftests/bpf/bench.c
> +++ b/tools/testing/selftests/bpf/bench.c
> @@ -741,6 +741,13 @@ static void setup_benchmark(void)
>  static pthread_mutex_t bench_done_mtx = PTHREAD_MUTEX_INITIALIZER;
>  static pthread_cond_t bench_done = PTHREAD_COND_INITIALIZER;
>
> +void bench_force_done(void)
> +{
> +	pthread_mutex_lock(&bench_done_mtx);
> +	pthread_cond_signal(&bench_done);
> +	pthread_mutex_unlock(&bench_done_mtx);
> +}
> +

Is it safe to use pthread_mutex_lock() and pthread_cond_signal() here
given that collect_measurements() calls bench_force_done() from a
SIGALRM signal handler (sigalarm_handler)?

If the signal interrupts a thread that is already holding bench_done_mtx
(for example, a worker thread executing bench_force_done()), the signal
handler will attempt to lock the same mutex:

Thread context:
    bench_force_done()
        pthread_mutex_lock(&bench_done_mtx);  // lock acquired
        [SIGALRM arrives]

Signal handler context:
    sigalarm_handler()
        collect_measurements()
            bench_force_done()
                pthread_mutex_lock(&bench_done_mtx);  // deadlock

This causes a self-deadlock because pthread_mutex_lock() and
pthread_cond_signal() are not async-signal-safe according to POSIX.

Could an async-signal-safe mechanism like an atomic flag with sem_post()
be used instead?

This concern was raised in the v1 and v2 reviews by Sashiko AI on lore:
https://lore.kernel.org/bpf/20260420124120.13970C19425@smtp.kernel.org/


Could this code lose wakeups because the condition variable is signaled
without a shared state predicate?

If a worker thread completes quickly and calls bench_force_done() before
main() reaches pthread_cond_wait(), the signal will be lost because POSIX
condition variables do not retain state:

Worker thread:
    bench_force_done()
        pthread_cond_signal(&bench_done);  // signal sent

Main thread (not waiting yet):
    [still in setup]
    pthread_cond_wait(&bench_done, &bench_done_mtx);  // misses signal

This would cause main() to block for the full timer duration, defeating
the purpose of bench_force_done(). Should a boolean flag be added to
track the done state?

This concern was raised in the v1 and v2 reviews by Sashiko AI on lore:
https://lore.kernel.org/bpf/20260420124120.13970C19425@smtp.kernel.org/


>  static void collect_measurements(long delta_ns) {
>  	int iter = state.res_cnt++;
>  	struct bench_res *res = &state.results[iter];
> @@ -750,11 +757,8 @@ static void collect_measurements(long delta_ns) {
>  	if (bench->report_progress)
>  		bench->report_progress(iter, res, delta_ns);
>
> -	if (iter == env.duration_sec + env.warmup_sec) {
> -		pthread_mutex_lock(&bench_done_mtx);
> -		pthread_cond_signal(&bench_done);
> -		pthread_mutex_unlock(&bench_done_mtx);
> -	}
> +	if (iter == env.duration_sec + env.warmup_sec)
> +		bench_force_done();
>  }

[ ... ]

What happens if bench_force_done() is called before the warmup period
fully elapses?

In main(), the final sample count passed to bench->report_final() is
calculated as:

    state.res_cnt - env.warmup_sec

If state.res_cnt is strictly less than env.warmup_sec, the resulting
negative count will bypass the iteration loops in the report_final
callbacks (like ops_report_final()), leaving aggregate variables at zero.

Would subsequent calculations divide by zero and corrupt the benchmark
summary?

This concern was raised in the v1 and v2 reviews by Sashiko AI on lore:
https://lore.kernel.org/bpf/20260420124120.13970C19425@smtp.kernel.org/


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25025533577

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 4/7] selftests/bpf: Add XDP load-balancer common definitions
  2026-04-27 23:23 ` [PATCH bpf-next 4/7] selftests/bpf: Add XDP load-balancer common definitions Puranjay Mohan
@ 2026-04-28  0:05   ` bot+bpf-ci
  2026-04-28  0:38   ` sashiko-bot
  1 sibling, 0 replies; 24+ messages in thread
From: bot+bpf-ci @ 2026-04-28  0:05 UTC (permalink / raw)
  To: puranjay, bpf
  Cc: puranjay, puranjay12, ast, andrii, daniel, martin.lau, eddyz87,
	memxor, mykyta.yatsenko5, feichen, taragrawal, ndixit, tehnerd,
	kernel-team, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3460 bytes --]

> diff --git a/tools/testing/selftests/bpf/xdp_lb_bench_common.h b/tools/testing/selftests/bpf/xdp_lb_bench_common.h
> new file mode 100644
> index 000000000000..aed20a963701
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/xdp_lb_bench_common.h
> @@ -0,0 +1,112 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */
> +
> +#ifndef XDP_LB_BENCH_COMMON_H
> +#define XDP_LB_BENCH_COMMON_H

[ ... ]

> +#ifdef __BPF__
> +#define lb_htons(x)	bpf_htons(x)
> +#define LB_INLINE	static __always_inline
> +#else
> +#define lb_htons(x)	htons(x)
> +#define LB_INLINE	static inline
> +#endif
> +
> +LB_INLINE __be32 create_encap_ipv4_src(__u16 port, __be32 src)
> +{
> +	__u32 ip_suffix = lb_htons(port);
> +
> +	ip_suffix <<= 16;
> +	ip_suffix ^= src;
> +	return (0xFFFF0000 & ip_suffix) | IPIP_V4_PREFIX;
> +}

Does this mix host-endian arithmetic with network-order types?

Shifting and masking __be32 values using host operations modifies
different IP octets depending on the architecture.

Also, on big-endian systems, will the host-order constant IPIP_V4_PREFIX
(4268) map to 0.0.16.172 instead of 172.16.0.0?

This concern was raised in review of v1:

  https://lore.kernel.org/bpf/20260420132625.95C0AC2BCB9@smtp.kernel.org/

> +
> +LB_INLINE void create_encap_ipv6_src(__u16 port, __be32 src, __be32 *saddr)
> +{
> +	saddr[0] = IPIP_V6_PREFIX1;
> +	saddr[1] = IPIP_V6_PREFIX2;
> +	saddr[2] = IPIP_V6_PREFIX3;
> +	saddr[3] = src ^ port;
> +}

Are we mixing endianness here by applying a host-endian XOR operation
between a __be32 source and a __u16 port?

Looking at the parallel IPv4 version above, it converts port with
lb_htons(port) before the XOR operation. Shouldn't this be:

  saddr[3] = src ^ lb_htons(port);

to match the IPv4 version's byte order handling?

On little-endian architectures, when port=0x1234, the XOR will use
0x3412 (byte-swapped during 32-bit promotion) instead of the intended
value. This will cause packet validation failures and incorrect routing
behavior when these functions are eventually used.

This concern was raised in review of v1:

  https://lore.kernel.org/bpf/20260420132625.95C0AC2BCB9@smtp.kernel.org/

> +
> +struct flow_key {
> +	union {
> +		__be32 src;
> +		__be32 srcv6[4];
> +	};
> +	union {
> +		__be32 dst;
> +		__be32 dstv6[4];
> +	};
> +	union {
> +		__u32 ports;
> +		__u16 port16[2];
> +	};
> +	__u8 proto;
> +	__u8 pad[3];
> +};
> +
> +struct vip_definition {
> +	union {
> +		__be32 vip;
> +		__be32 vipv6[4];
> +	};
> +	__u16 port;
> +	__u8 proto;
> +	__u8 pad;
> +};

[ ... ]

> +struct real_definition {
> +	__be32 dst;
> +	__be32 dstv6[4];
> +	__u8   flags;
> +};

Should dst and dstv6 be wrapped in an anonymous union?

The other dual-stack IP structures in this header (flow_key and
vip_definition) use a union to save memory in BPF maps and maintain
structural layout.

Also, does this structure need explicit trailing padding? There are
currently 3 bytes of implicit trailing padding which might cause UAPI
ABI mismatch issues between 32-bit and 64-bit architectures.

These concerns were raised in review of v1:

  https://lore.kernel.org/bpf/20260420132625.95C0AC2BCB9@smtp.kernel.org/

[ ... ]

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25025533577

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 6/7] selftests/bpf: Add XDP load-balancer benchmark driver
  2026-04-27 23:23 ` [PATCH bpf-next 6/7] selftests/bpf: Add XDP load-balancer benchmark driver Puranjay Mohan
@ 2026-04-28  0:05   ` bot+bpf-ci
  2026-04-28  1:29   ` sashiko-bot
  1 sibling, 0 replies; 24+ messages in thread
From: bot+bpf-ci @ 2026-04-28  0:05 UTC (permalink / raw)
  To: puranjay, bpf
  Cc: puranjay, puranjay12, ast, andrii, daniel, martin.lau, eddyz87,
	memxor, mykyta.yatsenko5, feichen, taragrawal, ndixit, tehnerd,
	kernel-team, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 5753 bytes --]

> diff --git a/tools/testing/selftests/bpf/benchs/bench_xdp_lb.c b/tools/testing/selftests/bpf/benchs/bench_xdp_lb.c
> new file mode 100644
> index 000000000000..0b6709a2b03c
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/benchs/bench_xdp_lb.c

[ ... ]

> +static void build_packet(int idx)
> +{
> +	const struct test_scenario *sc = &scenarios[idx];
> +	__u8 *p = pkt_buf[idx];
> +	struct ethhdr eth = {};
> +	__u16 proto;
> +	__u32 off = 0;

[ ... ]

> +	} else {
> +		struct iphdr iph = {};
> +		__u32 ip_off = off;
> +
> +		iph.version  = 4;
> +		iph.ihl      = sc->set_ip_options ? 6 : 5;
> +		iph.ttl      = 64;
> +		iph.protocol = sc->ip_proto;
> +		iph.saddr    = htonl(sc->src_addr);
> +		iph.daddr    = htonl(sc->vip_addr);
> +		iph.frag_off = sc->set_frag ? htons(IP_MF) : 0;
> +		off += sizeof(iph);
> +
> +		if (sc->set_ip_options) {
> +			/* NOP option padding (4 bytes = 1 word) */
> +			__u32 nop = htonl(0x01010101);
> +
> +			memcpy(p + off, &nop, sizeof(nop));
> +			off += sizeof(nop);
> +		}
> +
> +		build_l4(sc, p, &off);
> +
> +		memcpy(p + off, "bench___payload!", 16);
> +		off += 16;
> +
> +		iph.tot_len = htons(off - ip_off);
> +		iph.check   = ip_checksum(&iph, sizeof(iph));
> +		memcpy(p + ip_off, &iph, sizeof(iph));

Does the IP header checksum calculation exclude the IP options when
set_ip_options is true? Since ip_checksum(&iph, sizeof(iph)) only
computes over the 20-byte stack structure iph, and the 4 bytes of NOP
options are added separately to the packet buffer after the base header,
those option bytes are not covered by the checksum.

Could this cause the BPF program to see an invalid checksum and drop the
packet prematurely instead of processing the intended options logic?

This concern was raised by sashiko-bot@kernel.org in the RFC v1 thread
https://lore.kernel.org/bpf/20260420171123.CC789C19425@smtp.kernel.org/
and does not appear to have been addressed.

> +	}
> +
> +	pkt_len[idx] = off;
> +}

[ ... ]

> +static void populate_lru(const struct test_scenario *sc, __u32 real_idx)
> +{
> +	struct real_pos_lru lru = { .pos = real_idx };
> +	struct flow_key fk;
> +	int i, err;

When atime is not explicitly initialized, does this leave it at zero?
If the machine running the selftest has been up for more than 30 seconds
(LRU_UDP_TIMEOUT), would lookups against this pre-populated entry see
an age of bpf_ktime_get_ns() - 0 and immediately evict it?

This could lead to unintended LRU misses in UDP scenarios that expect
LRU hits (like S_UDP_V4_LRU_HIT, S_UDP_V6_LRU_HIT) and cause validation
failures.

This concern was raised by sashiko-bot@kernel.org in the RFC v1 thread
https://lore.kernel.org/bpf/20260420171123.CC789C19425@smtp.kernel.org/
and does not appear to have been addressed.

> +
> +	build_flow_key(&fk, sc);
> +
> +	/* Insert into every per-CPU inner LRU so the entry is found
> +	 * regardless of which CPU runs the BPF program.
> +	 */
> +	for (i = 0; i < nr_inner_maps; i++) {
> +		err = bpf_map_update_elem(lru_inner_fds[i], &fk, &lru, BPF_ANY);
> +		if (err) {
> +			fprintf(stderr, "lru_inner[%d] [%s]: %s\n", i, sc->name,
> +				strerror(errno));
> +			exit(1);
> +		}
> +	}
> +}

[ ... ]

> +static void build_expected_packet(int idx)
> +{
> +	const struct test_scenario *sc = &scenarios[idx];
> +	__u8 *p = expected_buf[idx];
> +	struct ethhdr eth = {};
> +	const __u8 *in = pkt_buf[idx];
> +	__u32 in_len = pkt_len[idx];
> +	__u32 off = 0;
> +	__u32 inner_len = in_len - sizeof(struct ethhdr);

[ ... ]

> +	if (sc->encap_v6_outer) {
> +		struct ipv6hdr ip6h = {};
> +		__u8 nexthdr = sc->is_v6 ? IPPROTO_IPV6 : IPPROTO_IPIP;
> +
> +		ip6h.version     = 6;
> +		ip6h.nexthdr     = nexthdr;
> +		ip6h.payload_len = htons(inner_len);
> +		ip6h.hop_limit   = 64;
> +
> +		create_encap_ipv6_src(htons(sc->src_port),
> +				      sc->is_v6 ? htonl(sc->src_addr_v6[0])
> +						: htonl(sc->src_addr),
> +				      (__be32 *)&ip6h.saddr);
> +		htonl_v6((__be32 *)&ip6h.daddr, sc->tunnel_dst_v6);
> +
> +		memcpy(p + off, &ip6h, sizeof(ip6h));
> +		off += sizeof(ip6h);
> +	} else {
> +		struct iphdr iph = {};
> +
> +		iph.version  = 4;
> +		iph.ihl      = sizeof(iph) >> 2;
> +		iph.protocol = IPPROTO_IPIP;
> +		iph.tot_len  = htons(inner_len + sizeof(iph));
> +		iph.ttl      = 64;
> +		iph.saddr    = create_encap_ipv4_src(htons(sc->src_port),
> +						     htonl(sc->src_addr));
> +		iph.daddr    = htonl(sc->tunnel_dst);
> +		iph.check    = ip_checksum(&iph, sizeof(iph));
> +
> +		memcpy(p + off, &iph, sizeof(iph));
> +		off += sizeof(iph);
> +	}

Are the port values being double byte-swapped here? The code passes
htons(sc->src_port) into create_encap_ipv6_src() and create_encap_ipv4_src(),
but looking at xdp_lb_bench_common.h, those functions internally call
lb_htons(port) again. On little-endian architectures, would this cause
the port to be swapped twice, ending up in host order instead of
network order?

Additionally, the encapsulation helpers use bitwise shifts and XORs on
host integers (like IPIP_V4_PREFIX) directly into network-order types.
On big-endian architectures, could this place the prefix in the wrong
bytes, altering the IP address semantics and causing validation failures?

This concern was raised by sashiko-bot@kernel.org in the RFC v1 thread
https://lore.kernel.org/bpf/20260420171123.CC789C19425@smtp.kernel.org/
and does not appear to have been addressed.

> +
> +	memcpy(p + off, in + sizeof(struct ethhdr), inner_len);
> +	off += inner_len;
> +
> +	expected_len[idx] = off;
> +}

[ ... ]

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25025533577

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/7] selftests/bpf: Add BPF batch-timing library
  2026-04-27 23:22 ` [PATCH bpf-next 2/7] selftests/bpf: Add BPF batch-timing library Puranjay Mohan
@ 2026-04-28  0:12   ` sashiko-bot
  2026-04-28  0:18   ` bot+bpf-ci
  1 sibling, 0 replies; 24+ messages in thread
From: sashiko-bot @ 2026-04-28  0:12 UTC (permalink / raw)
  To: Puranjay Mohan; +Cc: bpf

Sashiko AI review found 4 potential issue(s):
- [Medium] Integer division truncation in `bpf_bench_calibrate()` breaks auto-calibration for very fast BPF programs.
- [Medium] Zero warmup time (`-w 0`) causes the benchmark to silently collect no samples.
- [Medium] BPF `can_loop` (`bpf_may_goto`) can prematurely terminate the timed loop, silently invalidating the userspace `ns/op` calculation.
- [Low] The BPF header includes a non-existent file `<bpf_may_goto.h>`.
--

commit ecbc1ddd2ab895fa706970f55f150bb87ba2790e
Author: Puranjay Mohan <puranjay@kernel.org>

selftests/bpf: Add BPF batch-timing library
    
This commit adds a reusable timing library for BPF benchmarks that need to
measure BPF program execution time. It provides a BPF macro to record elapsed
time and a userspace library to auto-calibrate batch iterations and compute
statistics.

> diff --git a/tools/testing/selftests/bpf/benchs/bench_bpf_timing.c b/tools/testing/selftests/bpf/benchs/bench_bpf_timing.c
> new file mode 100644
> index 000000000000..75a39da69655
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/benchs/bench_bpf_timing.c
[ ... ]
> +static int collect_samples(struct bpf_bench_timing *t,
> +			   double *out, int max_out)
> +{
> +	unsigned int nr_cpus = bpf_num_possible_cpus();
> +	__u32 timed_iters = t->batch_iters;
> +	int total = 0;
[ ... ]
> +		for (__u32 i = 0; i < count && total < max_out; i++) {
> +			__u64 sample = t->samples[cpu][i];
> +
> +			if (sample == 0)
> +				continue;
> +			out[total++] = (double)sample / timed_iters;

If the loop in BENCH_BPF_LOOP is terminated early by the can_loop condition
(e.g., due to instruction limits or yielding), the actual number of completed
iterations will be less than timed_iters.

Since the BPF program only records the elapsed time, will dividing by
timed_iters produce artificially low ns/op calculations?

Should the BPF program explicitly record and export the actual number of
completed iterations for this calculation?

> +		}
> +	}
> +
> +	qsort(out, total, sizeof(double), cmp_double);
> +	return total;
> +}
[ ... ]
> +void bpf_bench_timing_measure(struct bpf_bench_timing *t, struct bench_res *res)
> +{
> +	unsigned int nr_cpus;
> +	__u32 total_samples;
> +	int i;
> +
> +	t->warmup_ticks++;
> +
> +	if (t->warmup_ticks < env.warmup_sec)
> +		return;
> +
> +	if (t->warmup_ticks == env.warmup_sec) {
> +		*t->timing_enabled = 1;
> +		return;
> +	}

If the benchmark is run with zero warmup (env.warmup_sec == 0),
t->warmup_ticks becomes 1 on the first tick.

Does this mean t->warmup_ticks == env.warmup_sec will never be true, bypassing
the assignment and preventing timing_enabled from ever being set to 1?

[ ... ]
> +void bpf_bench_calibrate(struct bpf_bench_timing *t, bpf_bench_run_fn run_fn, void *run_ctx)
> +{
> +	__u64 elapsed, per_op_ns;
> +	__u64 time_n, time_2n;
> +	double ratio;
> +
> +	elapsed = measure_elapsed(t, run_fn, run_ctx, CALIBRATE_SEED_BATCH, CALIBRATE_RUNS);
> +	if (elapsed == 0) {
> +		fprintf(stderr, "calibration: no timing samples, using default\n");
> +		t->batch_iters = 10000;
> +		*t->batch_iters_bss = t->batch_iters;
> +		reset_timing(t);
> +		return;
> +	}
> +
> +	per_op_ns = elapsed / CALIBRATE_SEED_BATCH;

If the BPF operation executes in less than 1 ns (e.g., an empty loop or
bpf-nop), elapsed could be less than 100 since CALIBRATE_SEED_BATCH is 100.

Will the integer division truncate per_op_ns to 0 here, causing
compute_batch_iters() to incorrectly fall back to CALIBRATE_MIN_BATCH instead
of scaling up the iterations?

> +	t->batch_iters = compute_batch_iters(per_op_ns);
[ ... ]
> diff --git a/tools/testing/selftests/bpf/progs/bench_bpf_timing.bpf.h b/tools/testing/selftests/bpf/progs/bench_bpf_timing.bpf.h
> new file mode 100644
> index 000000000000..6a1ad75f1fd7
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bench_bpf_timing.bpf.h
> @@ -0,0 +1,69 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */
> +
> +#ifndef __BENCH_BPF_TIMING_BPF_H__
> +#define __BENCH_BPF_TIMING_BPF_H__
> +
> +#include <stdbool.h>
> +#include <linux/bpf.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf_may_goto.h>

Does bpf_may_goto.h exist in the kernel source tree?

I noticed that can_loop is typically defined in bpf_experimental.h. Will
including this cause a No such file or directory build error?

> +
> +#ifndef BENCH_NR_SAMPLES

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260427232313.1582588-1-puranjay@kernel.org?part=2

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 5/7] selftests/bpf: Add XDP load-balancer BPF program
  2026-04-27 23:23 ` [PATCH bpf-next 5/7] selftests/bpf: Add XDP load-balancer BPF program Puranjay Mohan
@ 2026-04-28  0:18   ` bot+bpf-ci
  2026-04-28  1:05   ` sashiko-bot
  1 sibling, 0 replies; 24+ messages in thread
From: bot+bpf-ci @ 2026-04-28  0:18 UTC (permalink / raw)
  To: puranjay, bpf
  Cc: puranjay, puranjay12, ast, andrii, daniel, martin.lau, eddyz87,
	memxor, mykyta.yatsenko5, feichen, taragrawal, ndixit, tehnerd,
	kernel-team, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2786 bytes --]

> diff --git a/tools/testing/selftests/bpf/progs/xdp_lb_bench.c b/tools/testing/selftests/bpf/progs/xdp_lb_bench.c
> new file mode 100644
> index 000000000000..b9fd848c035d
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/xdp_lb_bench.c
> @@ -0,0 +1,647 @@

[ ... ]

> +SEC("xdp")
> +int xdp_lb_bench(struct xdp_md *xdp)
> +{
> +	void *data     = (void *)(long)xdp->data;
> +	void *data_end = (void *)(long)xdp->data_end;
> +	struct ethhdr *eth = data;
> +	struct ethhdr saved_eth;
> +	__u32 rand_state = 0;
> +	__u32 batch_hash = 0;
> +	int saddr_off = 0;
> +	bool is_v6;
> +
> +	if (eth + 1 > data_end)
> +		return XDP_DROP;
> +
> +	__builtin_memcpy(&saved_eth, eth, sizeof(saved_eth));
> +
> +	is_v6 = (saved_eth.h_proto == bpf_htons(ETH_P_IPV6));
> +
> +	saddr_off = sizeof(struct ethhdr) + (is_v6 ? offsetof(struct ipv6hdr, saddr) :
> +					     offsetof(struct iphdr, saddr));
> +
> +	if (flow_mask)
> +		rand_state = bpf_get_prandom_u32() | 1;
> +
> +	if (cold_lru) {
> +		__u32 *saddr = data + saddr_off;
> +
> +		batch_gen++;
> +		batch_hash = (batch_gen ^ bpf_get_smp_processor_id()) * KNUTH_HASH_MULT;
> +		if ((void *)(saddr + 1) <= data_end)
> +			*saddr ^= batch_hash;
> +	}
> +
> +	return BENCH_BPF_LOOP(
> +		process_packet(xdp),
> +		({
> +			if (__bench_result == XDP_TX) {
> +				if (strip_encap(xdp, &saved_eth))
> +					return XDP_DROP;
> +				if (rand_state)
> +					randomize_src(xdp, saddr_off, &rand_state);
> +			}
> +			if (cold_lru) {
> +				void *d = (void *)(long)xdp->data;
> +				void *de = (void *)(long)xdp->data_end;
> +				__u32 *__sa = d + saddr_off;
> +
> +				if ((void *)(__sa + 1) <= de)
> +					*__sa ^= batch_hash;
                                     ^^^^

Does the cold_lru simulation actually generate the intended cache misses?

Inside BENCH_BPF_LOOP, the source IP address is XORed with the constant
batch_hash value on every iteration. Since batch_hash does not change
within the loop, this XOR operation will toggle the IP address between
only two unique values.

If the LRU cache is sized to hold both IP addresses, after the first two
iterations the system will continuously hit cached entries rather than
experiencing the cache misses that the cold_lru simulation is intended to
produce.

Would using a different random or incrementing value for each iteration
ensure a continuous stream of unique addresses?

This concern was raised by sashiko-bot in the RFC v1 patch series but
remains unaddressed in v2.

> +			}
> +		})
> +	);
> +}
> +
> +char _license[] SEC("license") = "GPL";


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25025533577

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/7] selftests/bpf: Add BPF batch-timing library
  2026-04-27 23:22 ` [PATCH bpf-next 2/7] selftests/bpf: Add BPF batch-timing library Puranjay Mohan
  2026-04-28  0:12   ` sashiko-bot
@ 2026-04-28  0:18   ` bot+bpf-ci
  2026-04-28  9:23     ` Puranjay Mohan
  1 sibling, 1 reply; 24+ messages in thread
From: bot+bpf-ci @ 2026-04-28  0:18 UTC (permalink / raw)
  To: puranjay, bpf
  Cc: puranjay, puranjay12, ast, andrii, daniel, martin.lau, eddyz87,
	memxor, mykyta.yatsenko5, feichen, taragrawal, ndixit, tehnerd,
	kernel-team, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 4624 bytes --]

> diff --git a/tools/testing/selftests/bpf/benchs/bench_bpf_timing.c b/tools/testing/selftests/bpf/benchs/bench_bpf_timing.c
> new file mode 100644
> index 000000000000..75a39da69655
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/benchs/bench_bpf_timing.c

[ ... ]

> +static __u32 compute_batch_iters(__u64 per_op_ns)
> +{
> +	__u64 target_ns = (__u64)CALIBRATE_TARGET_MS * 1000000ULL;
> +	__u32 iters;
> +
> +	if (per_op_ns == 0)
> +		return CALIBRATE_MIN_BATCH;
> +
> +	iters = target_ns / per_op_ns;
> +
> +	if (iters < CALIBRATE_MIN_BATCH)
> +		iters = CALIBRATE_MIN_BATCH;
> +	if (iters > CALIBRATE_MAX_BATCH)
> +		iters = CALIBRATE_MAX_BATCH;
> +
> +	return iters;
> +}
> +
> +void bpf_bench_calibrate(struct bpf_bench_timing *t, bpf_bench_run_fn run_fn, void *run_ctx)
> +{
> +	__u64 elapsed, per_op_ns;
> +	__u64 time_n, time_2n;
> +	double ratio;
> +
> +	elapsed = measure_elapsed(t, run_fn, run_ctx, CALIBRATE_SEED_BATCH, CALIBRATE_RUNS);
> +	if (elapsed == 0) {
> +		fprintf(stderr, "calibration: no timing samples, using default\n");
> +		t->batch_iters = 10000;
> +		*t->batch_iters_bss = t->batch_iters;
> +		reset_timing(t);
> +		return;
> +	}
> +
> +	per_op_ns = elapsed / CALIBRATE_SEED_BATCH;
> +	t->batch_iters = compute_batch_iters(per_op_ns);

Will this integer division truncation break calibration for operations
taking less than 1 ns? If the total elapsed time for 100 iterations is under
100 ns, per_op_ns evaluates to 0, forcing compute_batch_iters() to return
CALIBRATE_MIN_BATCH (100) instead of scaling up to reach the 10 ms target.

> +
> +	time_n = measure_elapsed(t, run_fn, run_ctx, t->batch_iters, CALIBRATE_RUNS);
> +	time_2n = measure_elapsed(t, run_fn, run_ctx, t->batch_iters * 2, CALIBRATE_RUNS);

[ ... ]

> +static void reset_timing(struct bpf_bench_timing *t)
> +{
> +	*t->timing_enabled = 0;
> +	memset(t->samples, 0, sizeof(__u64) * BENCH_NR_CPUS * BENCH_NR_SAMPLES);
> +	memset(t->idx, 0, sizeof(__u32) * BENCH_NR_CPUS);
> +}

Does this need CPU memory barriers to prevent hardware reordering?
While timing_enabled is volatile, weakly-ordered architectures like arm64 might
reorder the memory stores across the flag toggle. This could allow background
BPF executions to write samples while the arrays are being zeroed.

[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/bench_bpf_timing.bpf.h b/tools/testing/selftests/bpf/progs/bench_bpf_timing.bpf.h
> new file mode 100644
> index 000000000000..6a1ad75f1fd7
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bench_bpf_timing.bpf.h

[ ... ]

> +static __always_inline void bench_record_sample(__u64 elapsed_ns)
> +{
> +	__u32 cpu, idx;
> +
> +	if (!timing_enabled)
> +		return;
> +
> +	cpu = bpf_get_smp_processor_id() & BENCH_CPU_MASK;
> +	idx = timing_idx[cpu];
> +
> +	if (idx >= BENCH_NR_SAMPLES)
> +		return;
> +
> +	timing_samples[cpu][idx] = elapsed_ns;
> +	timing_idx[cpu] = idx + 1;
> +}

Can this read-modify-write pattern cause lost samples or out-of-bounds
writes? If machines have more than 256 logical CPUs, BENCH_CPU_MASK will alias
multiple CPUs to the same index. Concurrent thread executions could race and
overwrite the index, potentially corrupting the timing statistics.

> +
> +/*
> + * @body:  expression to time; return value (int) stored in __bench_result.
> + * @reset: undo body's side-effects so each iteration starts identically.
> + *         May reference __bench_result.  Use ({}) for empty reset.
> + *
> + * Runs batch_iters timed iterations, then one untimed iteration whose
> + * return value the macro evaluates to (for validation).
> + */
> +#define BENCH_BPF_LOOP(body, reset) ({					\
> +	__u64 __bench_start = bpf_ktime_get_ns();			\
> +	__u32 __bench_i;						\
> +	int __bench_result;						\
> +									\
> +	for (__bench_i = 0;						\
> +	     __bench_i < batch_iters && can_loop;			\
> +	     __bench_i++) {						\
> +		__bench_result = (body);				\
> +		reset;							\
> +	}								\
> +									\
> +	bench_record_sample(bpf_ktime_get_ns() - __bench_start);	\
> +									\
> +	__bench_result = (body);					\
> +	__bench_result;							\
> +})

Should the final untimed iteration also evaluate the reset expression?
Without calling reset here, any side effects from the last body execution (like
memory allocations or map modifications) will leak into the next program
invocation and could violate benchmark isolation.


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25025533577

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 4/7] selftests/bpf: Add XDP load-balancer common definitions
  2026-04-27 23:23 ` [PATCH bpf-next 4/7] selftests/bpf: Add XDP load-balancer common definitions Puranjay Mohan
  2026-04-28  0:05   ` bot+bpf-ci
@ 2026-04-28  0:38   ` sashiko-bot
  2026-04-28  9:29     ` Puranjay Mohan
  1 sibling, 1 reply; 24+ messages in thread
From: sashiko-bot @ 2026-04-28  0:38 UTC (permalink / raw)
  To: Puranjay Mohan; +Cc: bpf

Sashiko AI review found 3 potential issue(s):
- [Medium] Endianness bug and invalid arithmetic on network-order values in IP encapsulation helpers.
- [Medium] Implicit compiler padding in shared BPF/userspace structures causes cross-architecture ABI breakage.
- [Low] Missing anonymous union for IPv4/IPv6 addresses in `struct real_definition`.
--

commit 27405cc11d849dd578c45222803d3bbd40b0e500
Author: Puranjay Mohan <puranjay@kernel.org>

selftests/bpf: Add XDP load-balancer common definitions

This commit adds the shared header for the XDP load-balancer benchmark.
It defines the data structures used by both the BPF program and userspace,
along with encapsulation source-address helpers.

> diff --git a/tools/testing/selftests/bpf/xdp_lb_bench_common.h b/tools/testing/selftests/bpf/xdp_lb_bench_common.h
> new file mode 100644
> index 000000000000..aed20a963701
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/xdp_lb_bench_common.h

[ ... ]

> +LB_INLINE __be32 create_encap_ipv4_src(__u16 port, __be32 src)
> +{
> +	__u32 ip_suffix = lb_htons(port);
> +
> +	ip_suffix <<= 16;
> +	ip_suffix ^= src;
> +	return (0xFFFF0000 & ip_suffix) | IPIP_V4_PREFIX;
> +}

Is it safe to perform bitwise operations directly on network-order values
like this? Since lb_htons() returns a network-order port, left-shifting it
by 16 and masking with host-order constants like 0xFFFF0000 or combining
it with IPIP_V4_PREFIX (4268) seems to assume a little-endian architecture.
Could this generate corrupted packet headers on big-endian systems?

> +
> +LB_INLINE void create_encap_ipv6_src(__u16 port, __be32 src, __be32 *saddr)
> +{
> +	saddr[0] = IPIP_V6_PREFIX1;
> +	saddr[1] = IPIP_V6_PREFIX2;
> +	saddr[2] = IPIP_V6_PREFIX3;
> +	saddr[3] = src ^ port;
> +}

Will saddr[0] evaluate to the correct prefix on all architectures? Because
IPIP_V6_PREFIX1 is defined as 1, it evaluates to 0100:: on little-endian
but ::1 on big-endian.

Also, is it intended to XOR the network-order __be32 src with the
host-order __u16 port without a prior byte-order conversion?

[ ... ]

> +struct real_pos_lru {
> +	__u32 pos;
> +	__u64 atime;
> +};

Could this shared structure cause ABI breakage between 32-bit userspace
and 64-bit BPF environments? In 64-bit environments, the compiler will
insert 4 bytes of padding after pos, but 32-bit environments only require
4-byte alignment for __u64. Should an explicit padding field be added to
ensure identical memory layout?

> +
> +struct real_definition {
> +	__be32 dst;
> +	__be32 dstv6[4];
> +	__u8   flags;
> +};

This isn't a bug, but earlier structures like flow_key and vip_definition
use an anonymous union to overlay their IPv4 and IPv6 addresses. Should
dst and dstv6 be placed in a union here as well to save memory and
maintain consistency?

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260427232313.1582588-1-puranjay@kernel.org?part=4

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 5/7] selftests/bpf: Add XDP load-balancer BPF program
  2026-04-27 23:23 ` [PATCH bpf-next 5/7] selftests/bpf: Add XDP load-balancer BPF program Puranjay Mohan
  2026-04-28  0:18   ` bot+bpf-ci
@ 2026-04-28  1:05   ` sashiko-bot
  2026-04-28  9:30     ` Puranjay Mohan
  1 sibling, 1 reply; 24+ messages in thread
From: sashiko-bot @ 2026-04-28  1:05 UTC (permalink / raw)
  To: Puranjay Mohan; +Cc: bpf

Sashiko AI review found 1 potential issue(s):
- [Medium] The `cold_lru` logic inside the benchmark loop's `reset` block toggles the IP source address back and forth, breaking the benchmark's LRU cache miss measurement.
--

commit 295ed705f4fdd57c01615b43803c47313840bbf1
Author: Puranjay Mohan <puranjay@kernel.org>

selftests/bpf: Add XDP load-balancer BPF program

Add the BPF datapath for the XDP load-balancer benchmark, a
simplified L4 load-balancer inspired by katran.

> diff --git a/tools/testing/selftests/bpf/progs/xdp_lb_bench.c b/tools/testing/selftests/bpf/progs/xdp_lb_bench.c
> new file mode 100644
> index 000000000000..b9fd848c035d
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/xdp_lb_bench.c
[ ... ]
> +SEC("xdp")
> +int xdp_lb_bench(struct xdp_md *xdp)
> +{
[ ... ]
> +	if (cold_lru) {
> +		__u32 *saddr = data + saddr_off;
> +
> +		batch_gen++;
> +		batch_hash = (batch_gen ^ bpf_get_smp_processor_id()) * KNUTH_HASH_MULT;
> +		if ((void *)(saddr + 1) <= data_end)
> +			*saddr ^= batch_hash;
> +	}
> +
> +	return BENCH_BPF_LOOP(
> +		process_packet(xdp),
> +		({
> +			if (__bench_result == XDP_TX) {
> +				if (strip_encap(xdp, &saved_eth))
> +					return XDP_DROP;
> +				if (rand_state)
> +					randomize_src(xdp, saddr_off, &rand_state);
> +			}
> +			if (cold_lru) {
> +				void *d = (void *)(long)xdp->data;
> +				void *de = (void *)(long)xdp->data_end;
> +				__u32 *__sa = d + saddr_off;
> +
> +				if ((void *)(__sa + 1) <= de)
> +					*__sa ^= batch_hash;

Does this code toggle the source IP back and forth on alternating iterations?

Because batch_hash is computed outside the loop and remains constant, XORing
the source IP on every reset block will cause it to alternate between two
values.

Won't this allow the LRU cache to easily cache both addresses, leading to a
near 100% cache hit rate and defeating the purpose of measuring cold_lru cache
misses?

> +			}
> +		})
> +	);
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260427232313.1582588-1-puranjay@kernel.org?part=5

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 6/7] selftests/bpf: Add XDP load-balancer benchmark driver
  2026-04-27 23:23 ` [PATCH bpf-next 6/7] selftests/bpf: Add XDP load-balancer benchmark driver Puranjay Mohan
  2026-04-28  0:05   ` bot+bpf-ci
@ 2026-04-28  1:29   ` sashiko-bot
  2026-04-28  9:33     ` Puranjay Mohan
  1 sibling, 1 reply; 24+ messages in thread
From: sashiko-bot @ 2026-04-28  1:29 UTC (permalink / raw)
  To: Puranjay Mohan; +Cc: bpf

Sashiko AI review found 1 potential issue(s):
- [Low] Incorrect IPv4 checksum calculation for packets with IP options in test packet generator
--

commit cc1cb16090fdc833cb82a1c6a168d6cb11005220
Author: Puranjay Mohan <puranjay@kernel.org>
selftests/bpf: Add XDP load-balancer benchmark driver

This commit wires up the userspace side of the XDP load-balancer benchmark.
It covers 24 scenarios and validates correctness before benchmarking by
comparing the output packet byte-for-byte against an expected packet.

> diff --git a/tools/testing/selftests/bpf/benchs/bench_xdp_lb.c b/tools/testing/selftests/bpf/benchs/bench_xdp_lb.c
> new file mode 100644
> index 000000000000..0b6709a2b03c
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/benchs/bench_xdp_lb.c
[ ... ]
> +static void build_packet(int idx)
> +{
[ ... ]
> +	if (sc->is_v6) {
[ ... ]
> +	} else {
> +		struct iphdr iph = {};
> +		__u32 ip_off = off;
> +
> +		iph.version  = 4;
> +		iph.ihl      = sc->set_ip_options ? 6 : 5;
> +		iph.ttl      = 64;
> +		iph.protocol = sc->ip_proto;
> +		iph.saddr    = htonl(sc->src_addr);
> +		iph.daddr    = htonl(sc->vip_addr);
> +		iph.frag_off = sc->set_frag ? htons(IP_MF) : 0;
> +		off += sizeof(iph);
> +
> +		if (sc->set_ip_options) {
> +			/* NOP option padding (4 bytes = 1 word) */
> +			__u32 nop = htonl(0x01010101);
> +
> +			memcpy(p + off, &nop, sizeof(nop));
> +			off += sizeof(nop);
> +		}
> +
> +		build_l4(sc, p, &off);
> +
> +		memcpy(p + off, "bench___payload!", 16);
> +		off += 16;
> +
> +		iph.tot_len = htons(off - ip_off);
> +		iph.check   = ip_checksum(&iph, sizeof(iph));

Does this calculation exclude the appended IP options from the checksum?

When sc->set_ip_options is true, the header length is set to 6 words (24
bytes) and 4 bytes of NOP options are copied directly into the packet
buffer. However, ip_checksum() is only passed sizeof(iph), which is exactly
20 bytes, and computes the checksum over the local iph struct alone.

While this logically malformed packet might not cause the current tests
to fail because the BPF program validates iph->ihl == 5 before verifying
the IP checksum, could this mask bugs or produce false test failures if
the BPF program is later modified to validate L3 checksums?

> +		memcpy(p + ip_off, &iph, sizeof(iph));
> +	}
> +
> +	pkt_len[idx] = off;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260427232313.1582588-1-puranjay@kernel.org?part=6

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 7/7] selftests/bpf: Add XDP load-balancer benchmark run script
  2026-04-27 23:23 ` [PATCH bpf-next 7/7] selftests/bpf: Add XDP load-balancer benchmark run script Puranjay Mohan
@ 2026-04-28  2:03   ` sashiko-bot
  0 siblings, 0 replies; 24+ messages in thread
From: sashiko-bot @ 2026-04-28  2:03 UTC (permalink / raw)
  To: Puranjay Mohan; +Cc: bpf

Sashiko AI review found 1 potential issue(s):
- [Medium] Any extra arguments passed to the script are silently ignored because `"$@"` is used incorrectly inside the `run_scenario` bash function.
--

commit eb02fb82c543ffe6f8607af5eadc8ba489f2fb53
Author: Puranjay Mohan <puranjay@kernel.org>

selftests/bpf: Add XDP load-balancer benchmark run script

This commit adds a convenience script to run 24 XDP load-balancer scenarios.
It formats the benchmark results as a table displaying the median, standard
deviation, and 99th percentile.

> diff --git a/tools/testing/selftests/bpf/benchs/run_bench_xdp_lb.sh b/tools/testing/selftests/bpf/benchs/run_bench_xdp_lb.sh
> new file mode 100755
> index 000000000000..f65cf46214a3
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/benchs/run_bench_xdp_lb.sh

[ ... ]

> +function run_scenario()
> +{
> +	local sc="$1"
> +	shift
> +	local output rline
> +
> +	output=$($RUN --scenario "$sc" "$@" 2>&1) || true

Does this silently ignore extra arguments passed to the script?

In bash, "$@" inside a function evaluates to the arguments passed to that
specific function, not the global script arguments. Since run_scenario is
called with exactly one argument ($sc) throughout the script, shift will
consume it and leave "$@" empty.

Should the script's global arguments be passed explicitly to run_scenario
when it is called, or should the script store its arguments in a global array?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260427232313.1582588-1-puranjay@kernel.org?part=7

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 1/7] selftests/bpf: Add bench_force_done() for early benchmark completion
  2026-04-28  0:05   ` bot+bpf-ci
@ 2026-04-28  9:15     ` Puranjay Mohan
  0 siblings, 0 replies; 24+ messages in thread
From: Puranjay Mohan @ 2026-04-28  9:15 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: bpf, ast, andrii, daniel, martin.lau, eddyz87, memxor,
	mykyta.yatsenko5, feichen, taragrawal, ndixit, tehnerd,
	kernel-team, yonghong.song, clm, ihor.solodrai

On Tue, Apr 28, 2026 at 1:05 AM <bot+bpf-ci@kernel.org> wrote:
>
> > diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> > index 029b3e21f438..47a4e72208d6 100644
> > --- a/tools/testing/selftests/bpf/bench.c
> > +++ b/tools/testing/selftests/bpf/bench.c
> > @@ -741,6 +741,13 @@ static void setup_benchmark(void)
> >  static pthread_mutex_t bench_done_mtx = PTHREAD_MUTEX_INITIALIZER;
> >  static pthread_cond_t bench_done = PTHREAD_COND_INITIALIZER;
> >
> > +void bench_force_done(void)
> > +{
> > +     pthread_mutex_lock(&bench_done_mtx);
> > +     pthread_cond_signal(&bench_done);
> > +     pthread_mutex_unlock(&bench_done_mtx);
> > +}
> > +
>
> Is it safe to use pthread_mutex_lock() and pthread_cond_signal() here
> given that collect_measurements() calls bench_force_done() from a
> SIGALRM signal handler (sigalarm_handler)?
>
> If the signal interrupts a thread that is already holding bench_done_mtx
> (for example, a worker thread executing bench_force_done()), the signal
> handler will attempt to lock the same mutex:
>
> Thread context:
>     bench_force_done()
>         pthread_mutex_lock(&bench_done_mtx);  // lock acquired
>         [SIGALRM arrives]

bench_force_done() is expected to be called from SIGALARM handlers
only, it is never called in Thread context.

>
> Signal handler context:
>     sigalarm_handler()
>         collect_measurements()
>             bench_force_done()
>                 pthread_mutex_lock(&bench_done_mtx);  // deadlock
>
> This causes a self-deadlock because pthread_mutex_lock() and
> pthread_cond_signal() are not async-signal-safe according to POSIX.
>
> Could an async-signal-safe mechanism like an atomic flag with sem_post()
> be used instead?
>
> This concern was raised in the v1 and v2 reviews by Sashiko AI on lore:
> https://lore.kernel.org/bpf/20260420124120.13970C19425@smtp.kernel.org/
>
>
> Could this code lose wakeups because the condition variable is signaled
> without a shared state predicate?
>
> If a worker thread completes quickly and calls bench_force_done() before
> main() reaches pthread_cond_wait(), the signal will be lost because POSIX
> condition variables do not retain state:
>
> Worker thread:
>     bench_force_done()
>         pthread_cond_signal(&bench_done);  // signal sent
>
> Main thread (not waiting yet):
>     [still in setup]
>     pthread_cond_wait(&bench_done, &bench_done_mtx);  // misses signal
>
> This would cause main() to block for the full timer duration, defeating
> the purpose of bench_force_done(). Should a boolean flag be added to
> track the done state?

  t->warmup_ticks++;
  if (t->warmup_ticks < env.warmup_sec)
      return;
  if (t->warmup_ticks == env.warmup_sec) {
      *t->timing_enabled = 1;
      return;
  }
  // ... only here can bench_force_done() is called

So bench_force_done() fires at earliest on tick warmup_sec + 1,
meaning state.res_cnt >= warmup_sec + 1 when it fires. The subtraction
is always positive

> This concern was raised in the v1 and v2 reviews by Sashiko AI on lore:
> https://lore.kernel.org/bpf/20260420124120.13970C19425@smtp.kernel.org/
>
>
> >  static void collect_measurements(long delta_ns) {
> >       int iter = state.res_cnt++;
> >       struct bench_res *res = &state.results[iter];
> > @@ -750,11 +757,8 @@ static void collect_measurements(long delta_ns) {
> >       if (bench->report_progress)
> >               bench->report_progress(iter, res, delta_ns);
> >
> > -     if (iter == env.duration_sec + env.warmup_sec) {
> > -             pthread_mutex_lock(&bench_done_mtx);
> > -             pthread_cond_signal(&bench_done);
> > -             pthread_mutex_unlock(&bench_done_mtx);
> > -     }
> > +     if (iter == env.duration_sec + env.warmup_sec)
> > +             bench_force_done();
> >  }
>
> [ ... ]
>
> What happens if bench_force_done() is called before the warmup period
> fully elapses?
>
> In main(), the final sample count passed to bench->report_final() is
> calculated as:
>
>     state.res_cnt - env.warmup_sec
>
> If state.res_cnt is strictly less than env.warmup_sec, the resulting
> negative count will bypass the iteration loops in the report_final
> callbacks (like ops_report_final()), leaving aggregate variables at zero.
>
> Would subsequent calculations divide by zero and corrupt the benchmark
> summary?
>
> This concern was raised in the v1 and v2 reviews by Sashiko AI on lore:
> https://lore.kernel.org/bpf/20260420124120.13970C19425@smtp.kernel.org/
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25025533577

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/7] selftests/bpf: Add BPF batch-timing library
  2026-04-28  0:18   ` bot+bpf-ci
@ 2026-04-28  9:23     ` Puranjay Mohan
  0 siblings, 0 replies; 24+ messages in thread
From: Puranjay Mohan @ 2026-04-28  9:23 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: bpf, ast, andrii, daniel, martin.lau, eddyz87, memxor,
	mykyta.yatsenko5, feichen, taragrawal, ndixit, tehnerd,
	kernel-team, yonghong.song, clm, ihor.solodrai

On Tue, Apr 28, 2026 at 1:18 AM <bot+bpf-ci@kernel.org> wrote:
>
> > diff --git a/tools/testing/selftests/bpf/benchs/bench_bpf_timing.c b/tools/testing/selftests/bpf/benchs/bench_bpf_timing.c
> > new file mode 100644
> > index 000000000000..75a39da69655
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/benchs/bench_bpf_timing.c
>
> [ ... ]
>
> > +static __u32 compute_batch_iters(__u64 per_op_ns)
> > +{
> > +     __u64 target_ns = (__u64)CALIBRATE_TARGET_MS * 1000000ULL;
> > +     __u32 iters;
> > +
> > +     if (per_op_ns == 0)
> > +             return CALIBRATE_MIN_BATCH;
> > +
> > +     iters = target_ns / per_op_ns;
> > +
> > +     if (iters < CALIBRATE_MIN_BATCH)
> > +             iters = CALIBRATE_MIN_BATCH;
> > +     if (iters > CALIBRATE_MAX_BATCH)
> > +             iters = CALIBRATE_MAX_BATCH;
> > +
> > +     return iters;
> > +}
> > +
> > +void bpf_bench_calibrate(struct bpf_bench_timing *t, bpf_bench_run_fn run_fn, void *run_ctx)
> > +{
> > +     __u64 elapsed, per_op_ns;
> > +     __u64 time_n, time_2n;
> > +     double ratio;
> > +
> > +     elapsed = measure_elapsed(t, run_fn, run_ctx, CALIBRATE_SEED_BATCH, CALIBRATE_RUNS);
> > +     if (elapsed == 0) {
> > +             fprintf(stderr, "calibration: no timing samples, using default\n");
> > +             t->batch_iters = 10000;
> > +             *t->batch_iters_bss = t->batch_iters;
> > +             reset_timing(t);
> > +             return;
> > +     }
> > +
> > +     per_op_ns = elapsed / CALIBRATE_SEED_BATCH;
> > +     t->batch_iters = compute_batch_iters(per_op_ns);
>
> Will this integer division truncation break calibration for operations
> taking less than 1 ns? If the total elapsed time for 100 iterations is under
> 100 ns, per_op_ns evaluates to 0, forcing compute_batch_iters() to return
> CALIBRATE_MIN_BATCH (100) instead of scaling up to reach the 10 ms target.

If individual operations take less than 1 ns then we can't meaninfully
measure them, so returning 100 here is good enough.

>
> > +
> > +     time_n = measure_elapsed(t, run_fn, run_ctx, t->batch_iters, CALIBRATE_RUNS);
> > +     time_2n = measure_elapsed(t, run_fn, run_ctx, t->batch_iters * 2, CALIBRATE_RUNS);
>
> [ ... ]
>
> > +static void reset_timing(struct bpf_bench_timing *t)
> > +{
> > +     *t->timing_enabled = 0;
> > +     memset(t->samples, 0, sizeof(__u64) * BENCH_NR_CPUS * BENCH_NR_SAMPLES);
> > +     memset(t->idx, 0, sizeof(__u32) * BENCH_NR_CPUS);
> > +}
>
> Does this need CPU memory barriers to prevent hardware reordering?
> While timing_enabled is volatile, weakly-ordered architectures like arm64 might
> reorder the memory stores across the flag toggle. This could allow background
> BPF executions to write samples while the arrays are being zeroed.

reset_timing() is only called during calibration, which runs in
setup() — before any producer threads are started. There's no
concurrent BPF execution to race with.

> [ ... ]
>
> > diff --git a/tools/testing/selftests/bpf/progs/bench_bpf_timing.bpf.h b/tools/testing/selftests/bpf/progs/bench_bpf_timing.bpf.h
> > new file mode 100644
> > index 000000000000..6a1ad75f1fd7
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/bench_bpf_timing.bpf.h
>
> [ ... ]
>
> > +static __always_inline void bench_record_sample(__u64 elapsed_ns)
> > +{
> > +     __u32 cpu, idx;
> > +
> > +     if (!timing_enabled)
> > +             return;
> > +
> > +     cpu = bpf_get_smp_processor_id() & BENCH_CPU_MASK;
> > +     idx = timing_idx[cpu];
> > +
> > +     if (idx >= BENCH_NR_SAMPLES)
> > +             return;
> > +
> > +     timing_samples[cpu][idx] = elapsed_ns;
> > +     timing_idx[cpu] = idx + 1;
> > +}
>
> Can this read-modify-write pattern cause lost samples or out-of-bounds
> writes? If machines have more than 256 logical CPUs, BENCH_CPU_MASK will alias
> multiple CPUs to the same index. Concurrent thread executions could race and
> overwrite the index, potentially corrupting the timing statistics.

xdp_lb_validate() checks bpf_num_possible_cpus() > BENCH_NR_CPUS and
exits with an error.

>
> > +
> > +/*
> > + * @body:  expression to time; return value (int) stored in __bench_result.
> > + * @reset: undo body's side-effects so each iteration starts identically.
> > + *         May reference __bench_result.  Use ({}) for empty reset.
> > + *
> > + * Runs batch_iters timed iterations, then one untimed iteration whose
> > + * return value the macro evaluates to (for validation).
> > + */
> > +#define BENCH_BPF_LOOP(body, reset) ({                                       \
> > +     __u64 __bench_start = bpf_ktime_get_ns();                       \
> > +     __u32 __bench_i;                                                \
> > +     int __bench_result;                                             \
> > +                                                                     \
> > +     for (__bench_i = 0;                                             \
> > +          __bench_i < batch_iters && can_loop;                       \
> > +          __bench_i++) {                                             \
> > +             __bench_result = (body);                                \
> > +             reset;                                                  \
> > +     }                                                               \
> > +                                                                     \
> > +     bench_record_sample(bpf_ktime_get_ns() - __bench_start);        \
> > +                                                                     \
> > +     __bench_result = (body);                                        \
> > +     __bench_result;                                                 \
> > +})
>
> Should the final untimed iteration also evaluate the reset expression?
> Without calling reset here, any side effects from the last body execution (like
> memory allocations or map modifications) will leak into the next program
> invocation and could violate benchmark isolation.

This is by design, we want to have the last untimed iteration to have
side effects so we can validate output.

>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25025533577

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 4/7] selftests/bpf: Add XDP load-balancer common definitions
  2026-04-28  0:38   ` sashiko-bot
@ 2026-04-28  9:29     ` Puranjay Mohan
  0 siblings, 0 replies; 24+ messages in thread
From: Puranjay Mohan @ 2026-04-28  9:29 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Tue, Apr 28, 2026 at 1:38 AM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 3 potential issue(s):
> - [Medium] Endianness bug and invalid arithmetic on network-order values in IP encapsulation helpers.
> - [Medium] Implicit compiler padding in shared BPF/userspace structures causes cross-architecture ABI breakage.
> - [Low] Missing anonymous union for IPv4/IPv6 addresses in `struct real_definition`.
> --
>
> commit 27405cc11d849dd578c45222803d3bbd40b0e500
> Author: Puranjay Mohan <puranjay@kernel.org>
>
> selftests/bpf: Add XDP load-balancer common definitions
>
> This commit adds the shared header for the XDP load-balancer benchmark.
> It defines the data structures used by both the BPF program and userspace,
> along with encapsulation source-address helpers.
>
> > diff --git a/tools/testing/selftests/bpf/xdp_lb_bench_common.h b/tools/testing/selftests/bpf/xdp_lb_bench_common.h
> > new file mode 100644
> > index 000000000000..aed20a963701
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/xdp_lb_bench_common.h
>
> [ ... ]
>
> > +LB_INLINE __be32 create_encap_ipv4_src(__u16 port, __be32 src)
> > +{
> > +     __u32 ip_suffix = lb_htons(port);
> > +
> > +     ip_suffix <<= 16;
> > +     ip_suffix ^= src;
> > +     return (0xFFFF0000 & ip_suffix) | IPIP_V4_PREFIX;
> > +}
>
> Is it safe to perform bitwise operations directly on network-order values
> like this? Since lb_htons() returns a network-order port, left-shifting it
> by 16 and masking with host-order constants like 0xFFFF0000 or combining
> it with IPIP_V4_PREFIX (4268) seems to assume a little-endian architecture.
> Could this generate corrupted packet headers on big-endian systems?

It is safe, everything happens on the same host, this is a benchmark!

>
> > +
> > +LB_INLINE void create_encap_ipv6_src(__u16 port, __be32 src, __be32 *saddr)
> > +{
> > +     saddr[0] = IPIP_V6_PREFIX1;
> > +     saddr[1] = IPIP_V6_PREFIX2;
> > +     saddr[2] = IPIP_V6_PREFIX3;
> > +     saddr[3] = src ^ port;
> > +}
>
> Will saddr[0] evaluate to the correct prefix on all architectures? Because
> IPIP_V6_PREFIX1 is defined as 1, it evaluates to 0100:: on little-endian
> but ::1 on big-endian.
>
> Also, is it intended to XOR the network-order __be32 src with the
> host-order __u16 port without a prior byte-order conversion?
>
> [ ... ]
>
> > +struct real_pos_lru {
> > +     __u32 pos;
> > +     __u64 atime;
> > +};
>
> Could this shared structure cause ABI breakage between 32-bit userspace
> and 64-bit BPF environments? In 64-bit environments, the compiler will
> insert 4 bytes of padding after pos, but 32-bit environments only require
> 4-byte alignment for __u64. Should an explicit padding field be added to
> ensure identical memory layout?
>
> > +
> > +struct real_definition {
> > +     __be32 dst;
> > +     __be32 dstv6[4];
> > +     __u8   flags;
> > +};
>
> This isn't a bug, but earlier structures like flow_key and vip_definition
> use an anonymous union to overlay their IPv4 and IPv6 addresses. Should
> dst and dstv6 be placed in a union here as well to save memory and
> maintain consistency?

No, some scenarios use both.

>
> [ ... ]
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260427232313.1582588-1-puranjay@kernel.org?part=4

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 5/7] selftests/bpf: Add XDP load-balancer BPF program
  2026-04-28  1:05   ` sashiko-bot
@ 2026-04-28  9:30     ` Puranjay Mohan
  0 siblings, 0 replies; 24+ messages in thread
From: Puranjay Mohan @ 2026-04-28  9:30 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Tue, Apr 28, 2026 at 2:05 AM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 1 potential issue(s):
> - [Medium] The `cold_lru` logic inside the benchmark loop's `reset` block toggles the IP source address back and forth, breaking the benchmark's LRU cache miss measurement.
> --
>
> commit 295ed705f4fdd57c01615b43803c47313840bbf1
> Author: Puranjay Mohan <puranjay@kernel.org>
>
> selftests/bpf: Add XDP load-balancer BPF program
>
> Add the BPF datapath for the XDP load-balancer benchmark, a
> simplified L4 load-balancer inspired by katran.
>
> > diff --git a/tools/testing/selftests/bpf/progs/xdp_lb_bench.c b/tools/testing/selftests/bpf/progs/xdp_lb_bench.c
> > new file mode 100644
> > index 000000000000..b9fd848c035d
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/xdp_lb_bench.c
> [ ... ]
> > +SEC("xdp")
> > +int xdp_lb_bench(struct xdp_md *xdp)
> > +{
> [ ... ]
> > +     if (cold_lru) {
> > +             __u32 *saddr = data + saddr_off;
> > +
> > +             batch_gen++;
> > +             batch_hash = (batch_gen ^ bpf_get_smp_processor_id()) * KNUTH_HASH_MULT;
> > +             if ((void *)(saddr + 1) <= data_end)
> > +                     *saddr ^= batch_hash;
> > +     }
> > +
> > +     return BENCH_BPF_LOOP(
> > +             process_packet(xdp),
> > +             ({
> > +                     if (__bench_result == XDP_TX) {
> > +                             if (strip_encap(xdp, &saved_eth))
> > +                                     return XDP_DROP;
> > +                             if (rand_state)
> > +                                     randomize_src(xdp, saddr_off, &rand_state);
> > +                     }
> > +                     if (cold_lru) {
> > +                             void *d = (void *)(long)xdp->data;
> > +                             void *de = (void *)(long)xdp->data_end;
> > +                             __u32 *__sa = d + saddr_off;
> > +
> > +                             if ((void *)(__sa + 1) <= de)
> > +                                     *__sa ^= batch_hash;
>
> Does this code toggle the source IP back and forth on alternating iterations?
>
> Because batch_hash is computed outside the loop and remains constant, XORing
> the source IP on every reset block will cause it to alternate between two
> values.
>
> Won't this allow the LRU cache to easily cache both addresses, leading to a
> near 100% cache hit rate and defeating the purpose of measuring cold_lru cache
> misses?

All cold_lru scenarios also have flow_mask set, which means
randomize_src() runs in the reset block on every iteration, producing
a different source IP each time via an xorshift PRNG. The cold_lru XOR
is an additional per-batch perturbation, not the sole source of flow
diversity. With flow_mask = 0xFFFFFF (16M flow space) and a 100K LRU,
most iterations genuinely miss.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 6/7] selftests/bpf: Add XDP load-balancer benchmark driver
  2026-04-28  1:29   ` sashiko-bot
@ 2026-04-28  9:33     ` Puranjay Mohan
  0 siblings, 0 replies; 24+ messages in thread
From: Puranjay Mohan @ 2026-04-28  9:33 UTC (permalink / raw)
  To: sashiko; +Cc: bpf

On Tue, Apr 28, 2026 at 2:29 AM <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 1 potential issue(s):
> - [Low] Incorrect IPv4 checksum calculation for packets with IP options in test packet generator
> --
>
> commit cc1cb16090fdc833cb82a1c6a168d6cb11005220
> Author: Puranjay Mohan <puranjay@kernel.org>
> selftests/bpf: Add XDP load-balancer benchmark driver
>
> This commit wires up the userspace side of the XDP load-balancer benchmark.
> It covers 24 scenarios and validates correctness before benchmarking by
> comparing the output packet byte-for-byte against an expected packet.
>
> > diff --git a/tools/testing/selftests/bpf/benchs/bench_xdp_lb.c b/tools/testing/selftests/bpf/benchs/bench_xdp_lb.c
> > new file mode 100644
> > index 000000000000..0b6709a2b03c
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/benchs/bench_xdp_lb.c
> [ ... ]
> > +static void build_packet(int idx)
> > +{
> [ ... ]
> > +     if (sc->is_v6) {
> [ ... ]
> > +     } else {
> > +             struct iphdr iph = {};
> > +             __u32 ip_off = off;
> > +
> > +             iph.version  = 4;
> > +             iph.ihl      = sc->set_ip_options ? 6 : 5;
> > +             iph.ttl      = 64;
> > +             iph.protocol = sc->ip_proto;
> > +             iph.saddr    = htonl(sc->src_addr);
> > +             iph.daddr    = htonl(sc->vip_addr);
> > +             iph.frag_off = sc->set_frag ? htons(IP_MF) : 0;
> > +             off += sizeof(iph);
> > +
> > +             if (sc->set_ip_options) {
> > +                     /* NOP option padding (4 bytes = 1 word) */
> > +                     __u32 nop = htonl(0x01010101);
> > +
> > +                     memcpy(p + off, &nop, sizeof(nop));
> > +                     off += sizeof(nop);
> > +             }
> > +
> > +             build_l4(sc, p, &off);
> > +
> > +             memcpy(p + off, "bench___payload!", 16);
> > +             off += 16;
> > +
> > +             iph.tot_len = htons(off - ip_off);
> > +             iph.check   = ip_checksum(&iph, sizeof(iph));
>
> Does this calculation exclude the appended IP options from the checksum?
>
> When sc->set_ip_options is true, the header length is set to 6 words (24
> bytes) and 4 bytes of NOP options are copied directly into the packet
> buffer. However, ip_checksum() is only passed sizeof(iph), which is exactly
> 20 bytes, and computes the checksum over the local iph struct alone.
>
> While this logically malformed packet might not cause the current tests
> to fail because the BPF program validates iph->ihl == 5 before verifying
> the IP checksum, could this mask bugs or produce false test failures if
> the BPF program is later modified to validate L3 checksums?

the current BPF program drops on ihl != 5 and never validates checksums.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-04-28  9:33 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 23:22 [PATCH bpf-next 0/7] selftests/bpf: Add XDP load-balancer benchmark Puranjay Mohan
2026-04-27 23:22 ` [PATCH bpf-next 1/7] selftests/bpf: Add bench_force_done() for early benchmark completion Puranjay Mohan
2026-04-27 23:39   ` sashiko-bot
2026-04-28  0:05   ` bot+bpf-ci
2026-04-28  9:15     ` Puranjay Mohan
2026-04-27 23:22 ` [PATCH bpf-next 2/7] selftests/bpf: Add BPF batch-timing library Puranjay Mohan
2026-04-28  0:12   ` sashiko-bot
2026-04-28  0:18   ` bot+bpf-ci
2026-04-28  9:23     ` Puranjay Mohan
2026-04-27 23:23 ` [PATCH bpf-next 3/7] selftests/bpf: Add bpf-nop benchmark for timing overhead baseline Puranjay Mohan
2026-04-27 23:23 ` [PATCH bpf-next 4/7] selftests/bpf: Add XDP load-balancer common definitions Puranjay Mohan
2026-04-28  0:05   ` bot+bpf-ci
2026-04-28  0:38   ` sashiko-bot
2026-04-28  9:29     ` Puranjay Mohan
2026-04-27 23:23 ` [PATCH bpf-next 5/7] selftests/bpf: Add XDP load-balancer BPF program Puranjay Mohan
2026-04-28  0:18   ` bot+bpf-ci
2026-04-28  1:05   ` sashiko-bot
2026-04-28  9:30     ` Puranjay Mohan
2026-04-27 23:23 ` [PATCH bpf-next 6/7] selftests/bpf: Add XDP load-balancer benchmark driver Puranjay Mohan
2026-04-28  0:05   ` bot+bpf-ci
2026-04-28  1:29   ` sashiko-bot
2026-04-28  9:33     ` Puranjay Mohan
2026-04-27 23:23 ` [PATCH bpf-next 7/7] selftests/bpf: Add XDP load-balancer benchmark run script Puranjay Mohan
2026-04-28  2:03   ` sashiko-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox