From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D7A8C221FBB for ; Mon, 27 Apr 2026 23:23:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777332212; cv=none; b=UuUUdzCeXz5hXvFjzpNm3GK7s4q3OpXHnSBsR+S4WlEE5BTR8bnIShVH6zB/vcnX/R5y8kxQPSI18XuW74HyjEKC1FjKaPj/IC9hy+51Uka5Nj9rqoGmhP4Sw+0eISUMdEqm/qb036UUMSt912QqAJJP3x4seMAM6E7rTAYlJ70= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777332212; c=relaxed/simple; bh=nIayJSgTV1YhNzK/LLuHf3pRm7aSQTngAAJstWkGM5M=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=a5VXjt0+Kusrw8SFBc//kZvt0ChfUa6M2vJuo5CPcC7t0Xbwuuordd+jg1W34Pzb9bq1djuYQdMs/Cf+FIi7CenNzIIoitcbJ3pnxwUepdL48cuUMoLOeUE0F1j3DoF8xF3bLDVVKEZ1eMxSNu4w/cV6mJOOrWt39EY7Jz2Pqxk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=VrdOqQIH; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="VrdOqQIH" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 42A3FC19425; Mon, 27 Apr 2026 23:23:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1777332212; bh=nIayJSgTV1YhNzK/LLuHf3pRm7aSQTngAAJstWkGM5M=; h=From:To:Cc:Subject:Date:From; b=VrdOqQIH2mjlXH6axscTAcQ1gJi33v26BKdhmILigio9OCAanHV+QBxH38L3qji9T GWp8WIuW/JumYnsHrmSqUm4rCJKxudC06sc9cr7oVTTTyRj8LV2JlX1Gv7iZ28MHRD ttgslvqSzCaiEwla+FmoX8F1s/tT/dwPTnTNyEK0ykSi+LJJjcoTV98P6ry5Q1jj5h cBF9uWuq5obvane38MlRwfuf71OVFay0qGakH1eJZkahI760NdI8acyedFm2L1EQ4F 5/pxyJVzVu1yXLy7CaXN92cg7IDqhh6lWubV03vtR3W2XDRqmS7PMdttOvj5gHM4Wq 8Nzrk9MlBVDKQ== From: Puranjay Mohan To: bpf@vger.kernel.org Cc: Puranjay Mohan , Puranjay Mohan , Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Martin KaFai Lau , Eduard Zingerman , Kumar Kartikeya Dwivedi , Mykyta Yatsenko , Fei Chen , Taruna Agrawal , Nikhil Dixit Limaye , "Nikita V. Shirokov" , kernel-team@meta.com Subject: [PATCH bpf-next 0/7] selftests/bpf: Add XDP load-balancer benchmark Date: Mon, 27 Apr 2026 16:22:57 -0700 Message-ID: <20260427232313.1582588-1-puranjay@kernel.org> X-Mailer: git-send-email 2.52.0 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Changelog: RFC: https://lore.kernel.org/all/20260420111726.2118636-1-puranjay@kernel.org/ Changes in v1: - Replace bpf_get_cpu_time_counter() with bpf_ktime_get_ns() - Replace bpf_repeat() with plain for loop and may_goto - Refactor collect_measurements() to reuse bench_force_done() - Remove histogram, verbose calibration output, and per-scenario status prints - Trim run script table to p50/stddev/p99 - Set env.quiet when --machine-readable is passed - Add || true to run script benchmark invocation for set -e safety - Add bpf-nop benchmark as timing overhead baseline (patch 3) - Use named struct for LRU inner map to fix build on older toolchains This series adds an XDP load-balancer benchmark (based on Katran) to the BPF selftest bench framework. Motivation ---------- Existing BPF bench tests measure individual operations (map lookups, kprobes, ring buffers) in isolation. Production BPF programs combine parsing, map lookups, branching, and packet rewriting in a single call chain. The performance characteristics of such programs depend on the interaction of these operations -- register pressure, spills, inlining decisions, branch layout -- which isolated micro-benchmarks do not capture. This benchmark implements a simplified L4 load-balancer modeled after katran [1]. The BPF program reproduces katran's core datapath: L3/L4 parsing -> VIP hash lookup -> per-CPU LRU connection table with consistent-hash fallback -> real server selection -> per-VIP and per-real stats -> IPIP/IP6IP6 encapsulation The BPF code exercises hash maps, array-of-maps (per-CPU LRU), percpu arrays, jhash, bpf_xdp_adjust_head(), bpf_ktime_get_ns(), and bpf_get_smp_processor_id() in a single pipeline. This is intended as the first in a series of BPF workload benchmarks covering other use cases (sched_ext, etc.). Design ------ A userspace loop calling bpf_prog_test_run_opts(repeat=1) would measure syscall overhead, not BPF program cost -- the ~4 ns early-exit paths would be buried under kernel entry/exit. Using repeat=N is also unsuitable: the kernel re-runs the same packet without resetting state between iterations, so the second iteration of an encap scenario would process an already-encapsulated packet. Instead, timing is measured inside the BPF program using bpf_ktime_get_ns(). BENCH_BPF_LOOP() brackets N iterations with timestamp reads using a plain for loop with may_goto, runs a caller-supplied reset block between iterations to undo side effects (e.g. strip encapsulation), and records the elapsed time per batch. One extra untimed iteration runs afterward for output validation. Auto-calibration picks a batch size targeting ~10 ms per invocation. A proportionality sanity check verifies that 2N iterations take ~2x as long as N. 24 scenarios cover the code-path matrix: - Protocol: TCP, UDP - Address family: IPv4, IPv6, cross-AF (IPv4-in-IPv6) - LRU state: hit, miss (16M flow space), diverse (4K flows), cold - Consistent-hash: direct (LRU bypass) - TCP flags: SYN (skip LRU, force CH), RST (skip LRU insert) - Early exits: unknown VIP, non-IP, ICMP, fragments, IP options Each scenario validates correctness before benchmarking by comparing the output packet byte-for-byte against a pre-built expected packet and checking BPF map counters. Sample single-scenario output: $ sudo ./bench xdp-lb --scenario tcp-v4-lru-hit Setting up benchmark 'xdp-lb'... Benchmark 'xdp-lb' started. tcp-v4-lru-hit: median 74.51 ns/op, stddev 0.11, p99 74.81 (202 samples) Sample run script output: $ ./benchs/run_bench_xdp_lb.sh XDP load-balancer benchmark =========================== +----------------------------------+----------+---------+----------+ | Single-flow baseline | p50 | stddev | p99 | +----------------------------------+----------+---------+----------+ | tcp-v4-lru-hit | 74.30 | 0.08 | 74.48 | | tcp-v4-ch | 101.73 | 0.11 | 102.01 | | tcp-v6-lru-hit | 76.77 | 0.14 | 77.04 | | tcp-v6-ch | 121.40 | 0.10 | 121.65 | | udp-v4-lru-hit | 107.42 | 0.22 | 107.90 | | udp-v6-lru-hit | 110.21 | 0.12 | 110.45 | | tcp-v4v6-lru-hit | 74.82 | 0.35 | 75.43 | +----------------------------------+----------+---------+----------+ | Diverse flows (4K src addrs) | p50 | stddev | p99 | +----------------------------------+----------+---------+----------+ | tcp-v4-lru-diverse | 86.63 | 0.37 | 89.04 | | tcp-v4-ch-diverse | 104.09 | 0.19 | 105.67 | | tcp-v6-lru-diverse | 89.34 | 0.42 | 90.70 | | tcp-v6-ch-diverse | 122.20 | 0.21 | 123.78 | | udp-v4-lru-diverse | 119.37 | 0.58 | 123.10 | +----------------------------------+----------+---------+----------+ | TCP flags | p50 | stddev | p99 | +----------------------------------+----------+---------+----------+ | tcp-v4-syn | 165.52 | 15.68 | 198.34 | | tcp-v4-rst-miss | 161.34 | 2.69 | 172.64 | +----------------------------------+----------+---------+----------+ | LRU stress | p50 | stddev | p99 | +----------------------------------+----------+---------+----------+ | tcp-v4-lru-miss | 440.39 | 35.75 | 550.62 | | udp-v4-lru-miss | 571.88 | 57.38 | 680.61 | | tcp-v4-lru-warmup | 317.75 | 9.55 | 356.20 | +----------------------------------+----------+---------+----------+ | Early exits | p50 | stddev | p99 | +----------------------------------+----------+---------+----------+ | pass-v4-no-vip | 18.26 | 0.13 | 18.66 | | pass-v6-no-vip | 19.08 | 0.01 | 19.10 | | pass-v4-icmp | 6.81 | 0.02 | 6.86 | | pass-non-ip | 5.71 | 0.03 | 5.76 | | drop-v4-frag | 6.09 | 0.01 | 6.10 | | drop-v4-options | 5.88 | 0.00 | 5.89 | | drop-v6-frag | 6.00 | 0.03 | 6.04 | +----------------------------------+----------+---------+----------+ Patches ------- Patch 1 adds bench_force_done() to the bench framework so benchmarks can signal early completion when enough samples have been collected. Patch 2 adds the shared BPF batch-timing library (BPF-side timing arrays, BENCH_BPF_LOOP macro, userspace statistics and calibration). Patch 3 adds a bpf-nop benchmark as a timing overhead baseline and usage example for the timing library. Patch 4 adds the common header shared between the BPF program and userspace (flow_key, vip_definition, real_definition, encap helpers). Patch 5 adds the XDP load-balancer BPF program. Patch 6 adds the userspace benchmark driver with 24 scenarios, packet construction, validation, and bench framework integration. Patch 7 adds the run script for running all scenarios. [1] https://github.com/facebookincubator/katran Puranjay Mohan (7): selftests/bpf: Add bench_force_done() for early benchmark completion selftests/bpf: Add BPF batch-timing library selftests/bpf: Add bpf-nop benchmark for timing overhead baseline selftests/bpf: Add XDP load-balancer common definitions selftests/bpf: Add XDP load-balancer BPF program selftests/bpf: Add XDP load-balancer benchmark driver selftests/bpf: Add XDP load-balancer benchmark run script tools/testing/selftests/bpf/Makefile | 6 + tools/testing/selftests/bpf/bench.c | 20 +- tools/testing/selftests/bpf/bench.h | 1 + .../testing/selftests/bpf/bench_bpf_timing.h | 50 + .../selftests/bpf/benchs/bench_bpf_nop.c | 84 ++ .../selftests/bpf/benchs/bench_bpf_timing.c | 272 ++++ .../selftests/bpf/benchs/bench_xdp_lb.c | 1113 +++++++++++++++++ .../selftests/bpf/benchs/run_bench_xdp_lb.sh | 79 ++ .../bpf/progs/bench_bpf_timing.bpf.h | 69 + .../selftests/bpf/progs/bpf_nop_bench.c | 14 + .../selftests/bpf/progs/xdp_lb_bench.c | 647 ++++++++++ .../selftests/bpf/xdp_lb_bench_common.h | 112 ++ 12 files changed, 2462 insertions(+), 5 deletions(-) create mode 100644 tools/testing/selftests/bpf/bench_bpf_timing.h create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_nop.c create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_timing.c create mode 100644 tools/testing/selftests/bpf/benchs/bench_xdp_lb.c create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_xdp_lb.sh create mode 100644 tools/testing/selftests/bpf/progs/bench_bpf_timing.bpf.h create mode 100644 tools/testing/selftests/bpf/progs/bpf_nop_bench.c create mode 100644 tools/testing/selftests/bpf/progs/xdp_lb_bench.c create mode 100644 tools/testing/selftests/bpf/xdp_lb_bench_common.h -- 2.52.0