public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH bpf-next 0/6] selftests/bpf: Add XDP load-balancer benchmark
@ 2026-04-20 11:17 Puranjay Mohan
  2026-04-20 11:17 ` [RFC PATCH bpf-next 1/6] selftests/bpf: Add bench_force_done() for early benchmark completion Puranjay Mohan
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Puranjay Mohan @ 2026-04-20 11:17 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	Fei Chen, Taruna Agrawal, Nikhil Dixit Limaye, Nikita V. Shirokov,
	kernel-team

This series adds an XDP load-balancer benchmark (based on Katran) to the BPF
selftest bench framework.

This series depends on the bpf_get_cpu_time_counter() kfunc series:
https://lore.kernel.org/all/20260418131614.1501848-1-puranjay@kernel.org/
But if that doesn't land in time, we can replace
bpf_get_cpu_time_counter() with bpf_ktime_get_ns() which means more
benchmarking overhead but it should still work.

Motivation
----------

Existing BPF bench tests measure individual operations (map lookups,
kprobes, ring buffers) in isolation.  Production BPF programs combine
parsing, map lookups, branching, and packet rewriting in a single call
chain.  The performance characteristics of such programs depend on the
interaction of these operations -- register pressure, spills, inlining
decisions, branch layout -- which isolated micro-benchmarks do not
capture.

This benchmark implements a simplified L4 load-balancer modeled after
katran [1].  The BPF program reproduces katran's core datapath:

  L3/L4 parsing -> VIP hash lookup -> per-CPU LRU connection table
  with consistent-hash fallback -> real server selection -> per-VIP
  and per-real stats -> IPIP/IP6IP6 encapsulation

The BPF code exercises hash maps, array-of-maps (per-CPU LRU),
percpu arrays, jhash, bpf_xdp_adjust_head(), bpf_ktime_get_ns(),
and bpf_get_smp_processor_id() in a single pipeline.

This is intended as the first in a series of BPF workload benchmarks
covering other use cases (sched_ext, etc.).

Design
------

A userspace loop calling bpf_prog_test_run_opts(repeat=1) would
measure syscall overhead, not BPF program cost -- the ~4 ns early-exit
paths would be buried under kernel entry/exit.  Using repeat=N is
also unsuitable: the kernel re-runs the same packet without resetting
state between iterations, so the second iteration of an encap scenario
would process an already-encapsulated packet.

Instead, timing is measured inside the BPF program using
bpf_get_cpu_time_counter().  BENCH_BPF_LOOP() brackets N iterations
with counter reads, runs a caller-supplied reset block between
iterations to undo side effects (e.g. strip encapsulation), and
records the elapsed time per batch.  One extra untimed iteration
runs afterward for output validation.

Auto-calibration picks a batch size targeting ~10 ms per invocation.
A proportionality sanity check verifies that 2N iterations take ~2x
as long as N.

24 scenarios cover the code-path matrix:

  - Protocol: TCP, UDP
  - Address family: IPv4, IPv6, cross-AF (IPv4-in-IPv6)
  - LRU state: hit, miss (16M flow space), diverse (4K flows), cold
  - Consistent-hash: direct (LRU bypass)
  - TCP flags: SYN (skip LRU, force CH), RST (skip LRU insert)
  - Early exits: unknown VIP, non-IP, ICMP, fragments, IP options

Each scenario validates correctness before benchmarking by comparing
the output packet byte-for-byte against a pre-built expected packet
and checking BPF map counters.

Sample single-scenario output:

  $ sudo ./bench -a -w3 -p1 xdp-lb --scenario tcp-v4-lru-miss
  Calibration: 84 ns/op, batch_iters=119047 (~9ms/batch)
  Proportionality check: 2N/N ratio=1.9996 (ok)
  Validating scenario 'tcp-v4-lru-miss' (batch_iters=119047):
    [tcp-v4-lru-miss] PASS  (XDP_TX) IPv4 TCP, LRU miss (16M flow space), CH lookup
    Flow diversity: 16777216 unique src addrs (mask 0xffffff)
    Cold LRU: enabled (per-batch generation)

  Scenario: tcp-v4-lru-miss - IPv4 TCP, LRU miss (16M flow space), CH lookup
  Batch size: 119047 iterations/invocation (+1 for validation)

  In-BPF timing: 203 samples, 119047 ops/batch
    median 856.8 ns/op, stddev 39.4, CV 4.54% [min 817.4, max 1173.6]
    p50 856.8, p75 881.4, p90 918.3, p95 943.3, p99 976.3
    NOTE: right-skewed distribution (tail 3.6x the body)

    Distribution (ns/op):
         <p1 : 1         (below range)
         820 : 4         |***                                     |
         830 : 31        |*****************************           |
         840 : 42        |****************************************|
         850 : 33        |*******************************         |
         860 : 25        |***********************                 |
         870 : 15        |**************                          |
         880 : 8         |*******                                 |
         890 : 14        |*************                           |
         900 : 6         |*****                                   |
         910 : 6         |*****                                   |
         920 : 2         |*                                       |
         930 : 5         |****                                    |
         940 : 2         |*                                       |
         950 : 3         |**                                      |
         960 : 2         |*                                       |
         970 : 2         |*                                       |
        >p99 : 2         (above range)

Sample run script output:

  $ ./benchs/run_bench_xdp_lb.sh

  +----------------------------------+------+----------+---------+--------+----------+----------+----------+----------+
  | Single-flow baseline             |    n |      p50 |  stddev |     CV |      min |      p90 |      p99 |      max |
  +----------------------------------+------+----------+---------+--------+----------+----------+----------+----------+
  | tcp-v4-lru-hit                   |  202 |    83.14 |    0.16 |  0.19% |    82.79 |    83.31 |    83.51 |    83.60 |
  | tcp-v4-ch                        |  201 |    92.26 |    0.12 |  0.13% |    92.05 |    92.41 |    92.57 |    92.68 |
  | tcp-v6-lru-hit                   |  202 |    81.00 |    0.12 |  0.14% |    80.80 |    81.15 |    81.45 |    81.56 |
  | tcp-v6-ch                        |  201 |   106.36 |    0.14 |  0.13% |   106.07 |   106.55 |   106.73 |   107.03 |
  | udp-v4-lru-hit                   |  202 |   114.65 |    0.17 |  0.15% |   114.22 |   114.85 |   115.02 |   115.06 |
  | udp-v6-lru-hit                   |  297 |   112.91 |    0.17 |  0.15% |   112.56 |   113.13 |   113.45 |   113.50 |
  | tcp-v4v6-lru-hit                 |  298 |    81.28 |    1.11 |  1.37% |    80.09 |    82.04 |    86.32 |    86.71 |
  +----------------------------------+------+----------+---------+--------+----------+----------+----------+----------+
  | Diverse flows (4K src addrs)     |    n |      p50 |  stddev |     CV |      min |      p90 |      p99 |      max |
  +----------------------------------+------+----------+---------+--------+----------+----------+----------+----------+
  | tcp-v4-lru-diverse               |  272 |    93.43 |    0.38 |  0.40% |    92.76 |    93.92 |    94.97 |    95.30 |
  | tcp-v4-ch-diverse                |  291 |    94.92 |    1.88 |  1.97% |    94.08 |    97.70 |   102.66 |   102.86 |
  | tcp-v6-lru-diverse               |  270 |    89.43 |    1.85 |  2.06% |    88.42 |    91.34 |    99.04 |   100.00 |
  | tcp-v6-ch-diverse                |  291 |   108.85 |    0.23 |  0.21% |   108.58 |   109.04 |   110.26 |   110.73 |
  | udp-v4-lru-diverse               |  268 |   126.66 |    2.04 |  1.60% |   124.95 |   129.11 |   137.47 |   138.29 |
  +----------------------------------+------+----------+---------+--------+----------+----------+----------+----------+
  | TCP flags                        |    n |      p50 |  stddev |     CV |      min |      p90 |      p99 |      max |
  +----------------------------------+------+----------+---------+--------+----------+----------+----------+----------+
  | tcp-v4-syn                       |  204 |   787.60 |    0.92 |  0.12% |   785.53 |   788.68 |   790.23 |   791.21 |
  | tcp-v4-rst-miss                  |  226 |   160.33 |    1.12 |  0.70% |   158.67 |   161.71 |   164.23 |   164.92 |
  +----------------------------------+------+----------+---------+--------+----------+----------+----------+----------+
  | LRU stress                       |    n |      p50 |  stddev |     CV |      min |      p90 |      p99 |      max |
  +----------------------------------+------+----------+---------+--------+----------+----------+----------+----------+
  | tcp-v4-lru-miss                  |  203 |   855.68 |   31.97 |  3.71% |   813.63 |   908.00 |   952.04 |   979.49 |
  | udp-v4-lru-miss                  |  211 |   854.79 |   48.24 |  5.55% |   819.49 |   928.10 |  1061.91 |  1129.39 |
  | tcp-v4-lru-warmup                |  286 |   534.18 |   16.62 |  3.09% |   510.34 |   559.55 |   592.37 |   594.51 |
  +----------------------------------+------+----------+---------+--------+----------+----------+----------+----------+
  | Early exits                      |    n |      p50 |  stddev |     CV |      min |      p90 |      p99 |      max |
  +----------------------------------+------+----------+---------+--------+----------+----------+----------+----------+
  | pass-v4-no-vip                   |  210 |    25.80 |    0.04 |  0.15% |    25.72 |    25.85 |    25.88 |    26.04 |
  | pass-v6-no-vip                   |  282 |    27.67 |    0.06 |  0.20% |    27.60 |    27.73 |    27.96 |    28.00 |
  | pass-v4-icmp                     |  215 |     5.57 |    0.06 |  1.12% |     5.47 |     5.67 |     5.76 |     5.77 |
  | pass-non-ip                      |  207 |     4.80 |    0.10 |  2.14% |     4.67 |     4.94 |     5.18 |     5.23 |
  | drop-v4-frag                     |  207 |     4.81 |    0.02 |  0.48% |     4.79 |     4.85 |     4.88 |     4.89 |
  | drop-v4-options                  |  206 |     4.82 |    0.09 |  1.95% |     4.74 |     5.00 |     5.17 |     5.24 |
  | drop-v6-frag                     |  201 |     4.93 |    0.09 |  1.89% |     4.86 |     5.11 |     5.21 |     5.36 |
  +----------------------------------+------+----------+---------+--------+----------+----------+----------+----------+

Patches
-------

Patch 1 adds bench_force_done() to the bench framework so benchmarks
can signal early completion when enough samples have been collected.

Patch 2 adds the shared BPF batch-timing library (BPF-side timing
arrays, BENCH_BPF_LOOP macro, userspace statistics and calibration).

Patch 3 adds the common header shared between the BPF program and
userspace (flow_key, vip_definition, real_definition, encap helpers).

Patch 4 adds the XDP load-balancer BPF program.

Patch 5 adds the userspace benchmark driver with 24 scenarios,
packet construction, validation, and bench framework integration.

Patch 6 adds the run script for running all scenarios.

[1] https://github.com/facebookincubator/katran

Puranjay Mohan (6):
  selftests/bpf: Add bench_force_done() for early benchmark completion
  selftests/bpf: Add BPF batch-timing library
  selftests/bpf: Add XDP load-balancer common definitions
  selftests/bpf: Add XDP load-balancer BPF program
  selftests/bpf: Add XDP load-balancer benchmark driver
  selftests/bpf: Add XDP load-balancer benchmark run script

 tools/testing/selftests/bpf/Makefile          |    4 +
 tools/testing/selftests/bpf/bench.c           |   11 +
 tools/testing/selftests/bpf/bench.h           |    1 +
 .../testing/selftests/bpf/bench_bpf_timing.h  |   49 +
 .../selftests/bpf/benchs/bench_bpf_timing.c   |  415 ++++++
 .../selftests/bpf/benchs/bench_xdp_lb.c       | 1160 +++++++++++++++++
 .../selftests/bpf/benchs/run_bench_xdp_lb.sh  |   84 ++
 .../bpf/progs/bench_bpf_timing.bpf.h          |   68 +
 .../selftests/bpf/progs/xdp_lb_bench.c        |  653 ++++++++++
 .../selftests/bpf/xdp_lb_bench_common.h       |  112 ++
 10 files changed, 2557 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/bench_bpf_timing.h
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_bpf_timing.c
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_xdp_lb.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_xdp_lb.sh
 create mode 100644 tools/testing/selftests/bpf/progs/bench_bpf_timing.bpf.h
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_lb_bench.c
 create mode 100644 tools/testing/selftests/bpf/xdp_lb_bench_common.h


base-commit: 31f61ac33032ee87ea404d6d996ba2c386502a36
-- 
2.52.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-04-22  1:16 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-20 11:17 [RFC PATCH bpf-next 0/6] selftests/bpf: Add XDP load-balancer benchmark Puranjay Mohan
2026-04-20 11:17 ` [RFC PATCH bpf-next 1/6] selftests/bpf: Add bench_force_done() for early benchmark completion Puranjay Mohan
2026-04-20 12:41   ` sashiko-bot
2026-04-20 15:32   ` Mykyta Yatsenko
2026-04-20 11:17 ` [RFC PATCH bpf-next 2/6] selftests/bpf: Add BPF batch-timing library Puranjay Mohan
2026-04-20 13:18   ` sashiko-bot
2026-04-22  1:10   ` Alexei Starovoitov
2026-04-20 11:17 ` [RFC PATCH bpf-next 3/6] selftests/bpf: Add XDP load-balancer common definitions Puranjay Mohan
2026-04-20 13:26   ` sashiko-bot
2026-04-20 11:17 ` [RFC PATCH bpf-next 4/6] selftests/bpf: Add XDP load-balancer BPF program Puranjay Mohan
2026-04-20 13:57   ` sashiko-bot
2026-04-20 11:17 ` [RFC PATCH bpf-next 5/6] selftests/bpf: Add XDP load-balancer benchmark driver Puranjay Mohan
2026-04-20 17:11   ` sashiko-bot
2026-04-20 11:17 ` [RFC PATCH bpf-next 6/6] selftests/bpf: Add XDP load-balancer benchmark run script Puranjay Mohan
2026-04-20 17:36   ` sashiko-bot
2026-04-22  1:16 ` [RFC PATCH bpf-next 0/6] selftests/bpf: Add XDP load-balancer benchmark Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox