* [PATCH bpf-next v3] selftests/bpf: Add LPM trie microbenchmarks
@ 2025-07-22 15:01 Matt Fleming
2025-07-28 14:34 ` Alexei Starovoitov
2025-08-08 15:51 ` Jesper Dangaard Brouer
0 siblings, 2 replies; 8+ messages in thread
From: Matt Fleming @ 2025-07-22 15:01 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman
Cc: Shuah Khan, kernel-team, Jesper Dangaard Brouer, linux-kselftest,
linux-kernel, bpf, Martin KaFai Lau, Yonghong Song, netdev,
Matt Fleming
From: Matt Fleming <mfleming@cloudflare.com>
Add benchmarks for the standard set of operations: lookup, update,
delete. Also, include a benchmark for trie_free() which is known to have
terrible performance for maps with many entries.
Benchmarks operate on tries without gaps in the key range, i.e. each
test begins with a trie with valid keys in the range [0, nr_entries).
This is intended to cause maximum branching when traversing the trie.
All measurements are recorded inside the kernel to remove syscall
overhead.
Most benchmarks run an XDP program to generate stats but free needs to
collect latencies using fentry/fexit on map_free_deferred() because it's
not possible to use fentry directly on lpm_trie.c since commit
c83508da5620 ("bpf: Avoid deadlock caused by nested kprobe and fentry
bpf programs") and there's no way to create/destroy a map from within an
XDP program.
Here is example output from an AMD EPYC 9684X 96-Core machine for each
of the benchmarks using a trie with 10K entries and a 32-bit prefix
length, e.g.
$ ./bench lpm-trie-$op \
--prefix_len=32 \
--producers=1 \
--nr_entries=10000
lookup: throughput 7.423 ± 0.023 M ops/s ( 7.423M ops/prod), latency 134.710 ns/op
update: throughput 2.643 ± 0.015 M ops/s ( 2.643M ops/prod), latency 378.310 ns/op
delete: throughput 0.712 ± 0.008 M ops/s ( 0.712M ops/prod), latency 1405.152 ns/op
free: throughput 0.574 ± 0.003 K ops/s ( 0.574K ops/prod), latency 1.743 ms/op
Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Matt Fleming <mfleming@cloudflare.com>
---
Changes in v3:
- Replace BPF_CORE_READ() with BPF_CORE_READ_STR_INTO() to avoid
gcc-bpf CI build failure
Changes in v2:
- Add Jesper's Tested-by and Revewied-by tags
- Remove use of atomic_*() in favour of __sync_add_and_fetch()
- Use a file-local 'deleted_entries' in the DELETE op benchmark and add
a comment explaining why non-atomic accesses are safe.
- Bump 'hits' with the number of bpf_loop() loops actually executed
tools/testing/selftests/bpf/Makefile | 2 +
tools/testing/selftests/bpf/bench.c | 10 +
tools/testing/selftests/bpf/bench.h | 1 +
.../selftests/bpf/benchs/bench_lpm_trie_map.c | 337 ++++++++++++++++++
.../selftests/bpf/progs/lpm_trie_bench.c | 171 +++++++++
.../selftests/bpf/progs/lpm_trie_map.c | 19 +
6 files changed, 540 insertions(+)
create mode 100644 tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c
create mode 100644 tools/testing/selftests/bpf/progs/lpm_trie_bench.c
create mode 100644 tools/testing/selftests/bpf/progs/lpm_trie_map.c
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 910d8d6402ef..10a5f1d0fa41 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -815,6 +815,7 @@ $(OUTPUT)/bench_bpf_hashmap_lookup.o: $(OUTPUT)/bpf_hashmap_lookup.skel.h
$(OUTPUT)/bench_htab_mem.o: $(OUTPUT)/htab_mem_bench.skel.h
$(OUTPUT)/bench_bpf_crypto.o: $(OUTPUT)/crypto_bench.skel.h
$(OUTPUT)/bench_sockmap.o: $(OUTPUT)/bench_sockmap_prog.skel.h
+$(OUTPUT)/bench_lpm_trie_map.o: $(OUTPUT)/lpm_trie_bench.skel.h $(OUTPUT)/lpm_trie_map.skel.h
$(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
$(OUTPUT)/bench: LDLIBS += -lm
$(OUTPUT)/bench: $(OUTPUT)/bench.o \
@@ -836,6 +837,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
$(OUTPUT)/bench_htab_mem.o \
$(OUTPUT)/bench_bpf_crypto.o \
$(OUTPUT)/bench_sockmap.o \
+ $(OUTPUT)/bench_lpm_trie_map.o \
#
$(call msg,BINARY,,$@)
$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index ddd73d06a1eb..fd15f60fd5a8 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -284,6 +284,7 @@ extern struct argp bench_htab_mem_argp;
extern struct argp bench_trigger_batch_argp;
extern struct argp bench_crypto_argp;
extern struct argp bench_sockmap_argp;
+extern struct argp bench_lpm_trie_map_argp;
static const struct argp_child bench_parsers[] = {
{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
@@ -299,6 +300,7 @@ static const struct argp_child bench_parsers[] = {
{ &bench_trigger_batch_argp, 0, "BPF triggering benchmark", 0 },
{ &bench_crypto_argp, 0, "bpf crypto benchmark", 0 },
{ &bench_sockmap_argp, 0, "bpf sockmap benchmark", 0 },
+ { &bench_lpm_trie_map_argp, 0, "LPM trie map benchmark", 0 },
{},
};
@@ -558,6 +560,10 @@ extern const struct bench bench_htab_mem;
extern const struct bench bench_crypto_encrypt;
extern const struct bench bench_crypto_decrypt;
extern const struct bench bench_sockmap;
+extern const struct bench bench_lpm_trie_lookup;
+extern const struct bench bench_lpm_trie_update;
+extern const struct bench bench_lpm_trie_delete;
+extern const struct bench bench_lpm_trie_free;
static const struct bench *benchs[] = {
&bench_count_global,
@@ -625,6 +631,10 @@ static const struct bench *benchs[] = {
&bench_crypto_encrypt,
&bench_crypto_decrypt,
&bench_sockmap,
+ &bench_lpm_trie_lookup,
+ &bench_lpm_trie_update,
+ &bench_lpm_trie_delete,
+ &bench_lpm_trie_free,
};
static void find_benchmark(void)
diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
index 005c401b3e22..bea323820ffb 100644
--- a/tools/testing/selftests/bpf/bench.h
+++ b/tools/testing/selftests/bpf/bench.h
@@ -46,6 +46,7 @@ struct bench_res {
unsigned long gp_ns;
unsigned long gp_ct;
unsigned int stime;
+ unsigned long duration_ns;
};
struct bench {
diff --git a/tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c b/tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c
new file mode 100644
index 000000000000..435b5c7ceee9
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c
@@ -0,0 +1,337 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Cloudflare */
+
+/*
+ * All of these benchmarks operate on tries with keys in the range
+ * [0, args.nr_entries), i.e. there are no gaps or partially filled
+ * branches of the trie for any key < args.nr_entries.
+ *
+ * This gives an idea of worst-case behaviour.
+ */
+
+#include <argp.h>
+#include <linux/time64.h>
+#include <linux/if_ether.h>
+#include "lpm_trie_bench.skel.h"
+#include "lpm_trie_map.skel.h"
+#include "bench.h"
+#include "testing_helpers.h"
+
+static struct ctx {
+ struct lpm_trie_bench *bench;
+} ctx;
+
+static struct {
+ __u32 nr_entries;
+ __u32 prefixlen;
+} args = {
+ .nr_entries = 10000,
+ .prefixlen = 32,
+};
+
+enum {
+ ARG_NR_ENTRIES = 9000,
+ ARG_PREFIX_LEN,
+};
+
+static const struct argp_option opts[] = {
+ { "nr_entries", ARG_NR_ENTRIES, "NR_ENTRIES", 0,
+ "Number of unique entries in the LPM trie" },
+ { "prefix_len", ARG_PREFIX_LEN, "PREFIX_LEN", 0,
+ "Number of prefix bits to use in the LPM trie" },
+ {},
+};
+
+static error_t lpm_parse_arg(int key, char *arg, struct argp_state *state)
+{
+ long ret;
+
+ switch (key) {
+ case ARG_NR_ENTRIES:
+ ret = strtol(arg, NULL, 10);
+ if (ret < 1 || ret > UINT_MAX) {
+ fprintf(stderr, "Invalid nr_entries count.");
+ argp_usage(state);
+ }
+ args.nr_entries = ret;
+ break;
+ case ARG_PREFIX_LEN:
+ ret = strtol(arg, NULL, 10);
+ if (ret < 1 || ret > UINT_MAX) {
+ fprintf(stderr, "Invalid prefix_len value.");
+ argp_usage(state);
+ }
+ args.prefixlen = ret;
+ break;
+ default:
+ return ARGP_ERR_UNKNOWN;
+ }
+ return 0;
+}
+
+const struct argp bench_lpm_trie_map_argp = {
+ .options = opts,
+ .parser = lpm_parse_arg,
+};
+
+static void __lpm_validate(void)
+{
+ if (env.consumer_cnt != 0) {
+ fprintf(stderr, "benchmark doesn't support consumer!\n");
+ exit(1);
+ }
+
+ if ((1UL << args.prefixlen) < args.nr_entries) {
+ fprintf(stderr, "prefix_len value too small for nr_entries!\n");
+ exit(1);
+ };
+}
+
+enum { OP_LOOKUP = 1, OP_UPDATE, OP_DELETE, OP_FREE };
+
+static void lpm_delete_validate(void)
+{
+ __lpm_validate();
+
+ if (env.producer_cnt != 1) {
+ fprintf(stderr,
+ "lpm-trie-delete requires a single producer!\n");
+ exit(1);
+ }
+}
+
+static void lpm_free_validate(void)
+{
+ __lpm_validate();
+
+ if (env.producer_cnt != 1) {
+ fprintf(stderr, "lpm-trie-free requires a single producer!\n");
+ exit(1);
+ }
+}
+
+static void fill_map(int map_fd)
+{
+ int i, err;
+
+ for (i = 0; i < args.nr_entries; i++) {
+ struct trie_key {
+ __u32 prefixlen;
+ __u32 data;
+ } key = { args.prefixlen, i };
+ __u32 val = 1;
+
+ err = bpf_map_update_elem(map_fd, &key, &val, BPF_NOEXIST);
+ if (err) {
+ fprintf(stderr, "failed to add key %d to map: %d\n",
+ key.data, -err);
+ exit(1);
+ }
+ }
+}
+
+static void __lpm_setup(void)
+{
+ ctx.bench = lpm_trie_bench__open_and_load();
+ if (!ctx.bench) {
+ fprintf(stderr, "failed to open skeleton\n");
+ exit(1);
+ }
+
+ ctx.bench->bss->nr_entries = args.nr_entries;
+ ctx.bench->bss->prefixlen = args.prefixlen;
+
+ if (lpm_trie_bench__attach(ctx.bench)) {
+ fprintf(stderr, "failed to attach skeleton\n");
+ exit(1);
+ }
+}
+
+static void lpm_setup(void)
+{
+ int fd;
+
+ __lpm_setup();
+
+ fd = bpf_map__fd(ctx.bench->maps.trie_map);
+ fill_map(fd);
+}
+
+static void lpm_lookup_setup(void)
+{
+ lpm_setup();
+
+ ctx.bench->bss->op = OP_LOOKUP;
+}
+
+static void lpm_update_setup(void)
+{
+ lpm_setup();
+
+ ctx.bench->bss->op = OP_UPDATE;
+}
+
+static void lpm_delete_setup(void)
+{
+ lpm_setup();
+
+ ctx.bench->bss->op = OP_DELETE;
+}
+
+static void lpm_free_setup(void)
+{
+ __lpm_setup();
+ ctx.bench->bss->op = OP_FREE;
+}
+
+static void lpm_measure(struct bench_res *res)
+{
+ res->hits = atomic_swap(&ctx.bench->bss->hits, 0);
+ res->duration_ns = atomic_swap(&ctx.bench->bss->duration_ns, 0);
+}
+
+/* For LOOKUP, UPDATE, and DELETE */
+static void *lpm_producer(void *unused __always_unused)
+{
+ int err;
+ char in[ETH_HLEN]; /* unused */
+
+ LIBBPF_OPTS(bpf_test_run_opts, opts, .data_in = in,
+ .data_size_in = sizeof(in), .repeat = 1, );
+
+ while (true) {
+ int fd = bpf_program__fd(ctx.bench->progs.run_bench);
+ err = bpf_prog_test_run_opts(fd, &opts);
+ if (err) {
+ fprintf(stderr, "failed to run BPF prog: %d\n", err);
+ exit(1);
+ }
+
+ if (opts.retval < 0) {
+ fprintf(stderr, "BPF prog returned error: %d\n",
+ opts.retval);
+ exit(1);
+ }
+
+ if (ctx.bench->bss->op == OP_DELETE && opts.retval == 1) {
+ /* trie_map needs to be refilled */
+ fill_map(bpf_map__fd(ctx.bench->maps.trie_map));
+ }
+ }
+
+ return NULL;
+}
+
+static void *lpm_free_producer(void *unused __always_unused)
+{
+ while (true) {
+ struct lpm_trie_map *skel;
+
+ skel = lpm_trie_map__open_and_load();
+ if (!skel) {
+ fprintf(stderr, "failed to open skeleton\n");
+ exit(1);
+ }
+
+ fill_map(bpf_map__fd(skel->maps.trie_free_map));
+ lpm_trie_map__destroy(skel);
+ }
+
+ return NULL;
+}
+
+static void free_ops_report_progress(int iter, struct bench_res *res,
+ long delta_ns)
+{
+ double hits_per_sec, hits_per_prod;
+ double rate_divisor = 1000.0;
+ char rate = 'K';
+
+ hits_per_sec = res->hits / (res->duration_ns / (double)NSEC_PER_SEC) /
+ rate_divisor;
+ hits_per_prod = hits_per_sec / env.producer_cnt;
+
+ printf("Iter %3d (%7.3lfus): ", iter,
+ (delta_ns - NSEC_PER_SEC) / 1000.0);
+ printf("hits %8.3lf%c/s (%7.3lf%c/prod)\n", hits_per_sec, rate,
+ hits_per_prod, rate);
+}
+
+static void free_ops_report_final(struct bench_res res[], int res_cnt)
+{
+ double hits_mean = 0.0, hits_stddev = 0.0;
+ double lat_divisor = 1000000.0;
+ double rate_divisor = 1000.0;
+ const char *unit = "ms";
+ double latency = 0.0;
+ char rate = 'K';
+ int i;
+
+ for (i = 0; i < res_cnt; i++) {
+ double val = res[i].hits / rate_divisor /
+ (res[i].duration_ns / (double)NSEC_PER_SEC);
+ hits_mean += val / (0.0 + res_cnt);
+ latency += res[i].duration_ns / res[i].hits / (0.0 + res_cnt);
+ }
+
+ if (res_cnt > 1) {
+ for (i = 0; i < res_cnt; i++) {
+ double val =
+ res[i].hits / rate_divisor /
+ (res[i].duration_ns / (double)NSEC_PER_SEC);
+ hits_stddev += (hits_mean - val) * (hits_mean - val) /
+ (res_cnt - 1.0);
+ }
+
+ hits_stddev = sqrt(hits_stddev);
+ }
+ printf("Summary: throughput %8.3lf \u00B1 %5.3lf %c ops/s (%7.3lf%c ops/prod), ",
+ hits_mean, hits_stddev, rate, hits_mean / env.producer_cnt,
+ rate);
+ printf("latency %8.3lf %s/op\n",
+ latency / lat_divisor / env.producer_cnt, unit);
+}
+
+const struct bench bench_lpm_trie_lookup = {
+ .name = "lpm-trie-lookup",
+ .argp = &bench_lpm_trie_map_argp,
+ .validate = __lpm_validate,
+ .setup = lpm_lookup_setup,
+ .producer_thread = lpm_producer,
+ .measure = lpm_measure,
+ .report_progress = ops_report_progress,
+ .report_final = ops_report_final,
+};
+
+const struct bench bench_lpm_trie_update = {
+ .name = "lpm-trie-update",
+ .argp = &bench_lpm_trie_map_argp,
+ .validate = __lpm_validate,
+ .setup = lpm_update_setup,
+ .producer_thread = lpm_producer,
+ .measure = lpm_measure,
+ .report_progress = ops_report_progress,
+ .report_final = ops_report_final,
+};
+
+const struct bench bench_lpm_trie_delete = {
+ .name = "lpm-trie-delete",
+ .argp = &bench_lpm_trie_map_argp,
+ .validate = lpm_delete_validate,
+ .setup = lpm_delete_setup,
+ .producer_thread = lpm_producer,
+ .measure = lpm_measure,
+ .report_progress = ops_report_progress,
+ .report_final = ops_report_final,
+};
+
+const struct bench bench_lpm_trie_free = {
+ .name = "lpm-trie-free",
+ .argp = &bench_lpm_trie_map_argp,
+ .validate = lpm_free_validate,
+ .setup = lpm_free_setup,
+ .producer_thread = lpm_free_producer,
+ .measure = lpm_measure,
+ .report_progress = free_ops_report_progress,
+ .report_final = free_ops_report_final,
+};
diff --git a/tools/testing/selftests/bpf/progs/lpm_trie_bench.c b/tools/testing/selftests/bpf/progs/lpm_trie_bench.c
new file mode 100644
index 000000000000..522e1cbef490
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/lpm_trie_bench.c
@@ -0,0 +1,171 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Cloudflare */
+
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_misc.h"
+
+#define BPF_OBJ_NAME_LEN 16U
+#define MAX_ENTRIES 100000000
+#define NR_LOOPS 10000
+
+struct trie_key {
+ __u32 prefixlen;
+ __u32 data;
+};
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __uint(max_entries, 512);
+ __type(key, struct bpf_map *);
+ __type(value, __u64);
+} latency_free_start SEC(".maps");
+
+/* Filled by userspace. See fill_map() in bench_lpm_trie_map.c */
+struct {
+ __uint(type, BPF_MAP_TYPE_LPM_TRIE);
+ __type(key, struct trie_key);
+ __type(value, __u32);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __uint(max_entries, MAX_ENTRIES);
+} trie_map SEC(".maps");
+
+long hits;
+long duration_ns;
+
+/* Configured from userspace */
+__u32 nr_entries;
+__u32 prefixlen;
+__u8 op;
+
+SEC("fentry/bpf_map_free_deferred")
+int BPF_PROG(trie_free_entry, struct work_struct *work)
+{
+ struct bpf_map *map = container_of(work, struct bpf_map, work);
+ char name[BPF_OBJ_NAME_LEN];
+ u32 map_type;
+ __u64 val;
+
+ map_type = BPF_CORE_READ(map, map_type);
+ if (map_type != BPF_MAP_TYPE_LPM_TRIE)
+ return 0;
+
+ /*
+ * Ideally we'd have access to the map ID but that's already
+ * freed before we enter trie_free().
+ */
+ BPF_CORE_READ_STR_INTO(&name, map, name);
+ if (bpf_strncmp(name, BPF_OBJ_NAME_LEN, "trie_free_map"))
+ return 0;
+
+ val = bpf_ktime_get_ns();
+ bpf_map_update_elem(&latency_free_start, &map, &val, BPF_ANY);
+
+ return 0;
+}
+
+SEC("fexit/bpf_map_free_deferred")
+int BPF_PROG(trie_free_exit, struct work_struct *work)
+{
+ struct bpf_map *map = container_of(work, struct bpf_map, work);
+ __u64 *val;
+
+ val = bpf_map_lookup_elem(&latency_free_start, &map);
+ if (val) {
+ __sync_add_and_fetch(&duration_ns, bpf_ktime_get_ns() - *val);
+ __sync_add_and_fetch(&hits, 1);
+ bpf_map_delete_elem(&latency_free_start, &map);
+ }
+
+ return 0;
+}
+
+static void gen_random_key(struct trie_key *key)
+{
+ key->prefixlen = prefixlen;
+ key->data = bpf_get_prandom_u32() % nr_entries;
+}
+
+static int lookup(__u32 index, __u32 *unused)
+{
+ struct trie_key key;
+
+ gen_random_key(&key);
+ bpf_map_lookup_elem(&trie_map, &key);
+ return 0;
+}
+
+static int update(__u32 index, __u32 *unused)
+{
+ struct trie_key key;
+ u32 val = bpf_get_prandom_u32();
+
+ gen_random_key(&key);
+ bpf_map_update_elem(&trie_map, &key, &val, BPF_EXIST);
+ return 0;
+}
+
+static __u32 deleted_entries;
+
+static int delete (__u32 index, bool *need_refill)
+{
+ struct trie_key key = {
+ .data = deleted_entries,
+ .prefixlen = prefixlen,
+ };
+
+ bpf_map_delete_elem(&trie_map, &key);
+
+ /* Do we need to refill the map? */
+ if (++deleted_entries == nr_entries) {
+ /*
+ * Atomicity isn't required because DELETE only supports
+ * one producer running concurrently. What we need is a
+ * way to track how many entries have been deleted from
+ * the trie between consecutive invocations of the BPF
+ * prog because a single bpf_loop() call might not
+ * delete all entries, e.g. when NR_LOOPS < nr_entries.
+ */
+ deleted_entries = 0;
+ *need_refill = true;
+ return 1;
+ }
+
+ return 0;
+}
+
+SEC("xdp")
+int BPF_PROG(run_bench)
+{
+ bool need_refill = false;
+ u64 start, delta;
+ int loops;
+
+ start = bpf_ktime_get_ns();
+
+ switch (op) {
+ case 1:
+ loops = bpf_loop(NR_LOOPS, lookup, NULL, 0);
+ break;
+ case 2:
+ loops = bpf_loop(NR_LOOPS, update, NULL, 0);
+ break;
+ case 3:
+ loops = bpf_loop(NR_LOOPS, delete, &need_refill, 0);
+ break;
+ default:
+ bpf_printk("invalid benchmark operation\n");
+ return -1;
+ }
+
+ delta = bpf_ktime_get_ns() - start;
+
+ __sync_add_and_fetch(&duration_ns, delta);
+ __sync_add_and_fetch(&hits, loops);
+
+ return need_refill;
+}
diff --git a/tools/testing/selftests/bpf/progs/lpm_trie_map.c b/tools/testing/selftests/bpf/progs/lpm_trie_map.c
new file mode 100644
index 000000000000..2ab43e2cd6c6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/lpm_trie_map.c
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#define MAX_ENTRIES 100000000
+
+struct trie_key {
+ __u32 prefixlen;
+ __u32 data;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_LPM_TRIE);
+ __type(key, struct trie_key);
+ __type(value, __u32);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __uint(max_entries, MAX_ENTRIES);
+} trie_free_map SEC(".maps");
--
2.34.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next v3] selftests/bpf: Add LPM trie microbenchmarks
2025-07-22 15:01 [PATCH bpf-next v3] selftests/bpf: Add LPM trie microbenchmarks Matt Fleming
@ 2025-07-28 14:34 ` Alexei Starovoitov
2025-07-29 13:56 ` Matt Fleming
2025-08-13 16:59 ` Jesper Dangaard Brouer
2025-08-08 15:51 ` Jesper Dangaard Brouer
1 sibling, 2 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2025-07-28 14:34 UTC (permalink / raw)
To: Matt Fleming
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Shuah Khan, kernel-team, Jesper Dangaard Brouer,
open list:KERNEL SELFTEST FRAMEWORK, LKML, bpf, Martin KaFai Lau,
Yonghong Song, Network Development, Matt Fleming
On Tue, Jul 22, 2025 at 9:02 AM Matt Fleming <matt@readmodwrite.com> wrote:
>
> From: Matt Fleming <mfleming@cloudflare.com>
>
> Add benchmarks for the standard set of operations: lookup, update,
> delete. Also, include a benchmark for trie_free() which is known to have
> terrible performance for maps with many entries.
>
> Benchmarks operate on tries without gaps in the key range, i.e. each
> test begins with a trie with valid keys in the range [0, nr_entries).
> This is intended to cause maximum branching when traversing the trie.
Please make a full description of what the test does,
since it's not trivial to decipher from the code.
If I'm reading it correctly, first, the user space
makes the LPM completely full and then lookup/update
use random key-s within range.
But delete looks different. It seems the kernel delete
operation can interleave with user space refilling the LPM,
so ...
> lookup: throughput 7.423 ± 0.023 M ops/s ( 7.423M ops/prod), latency 134.710 ns/op
> update: throughput 2.643 ± 0.015 M ops/s ( 2.643M ops/prod), latency 378.310 ns/op
> delete: throughput 0.712 ± 0.008 M ops/s ( 0.712M ops/prod), latency 1405.152 ns/op
this comparison doesn't look like apples to apples,
since delete will include user space refill time.
> free: throughput 0.574 ± 0.003 K ops/s ( 0.574K ops/prod), latency 1.743 ms/op
Does this measure the free-ing of full LPM ?
Please explain the benchmark logic in words.
Both in the commit log and the comments.
> Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Matt Fleming <mfleming@cloudflare.com>
> ---
>
> Changes in v3:
>
> - Replace BPF_CORE_READ() with BPF_CORE_READ_STR_INTO() to avoid
> gcc-bpf CI build failure
>
> Changes in v2:
>
> - Add Jesper's Tested-by and Revewied-by tags
> - Remove use of atomic_*() in favour of __sync_add_and_fetch()
> - Use a file-local 'deleted_entries' in the DELETE op benchmark and add
> a comment explaining why non-atomic accesses are safe.
> - Bump 'hits' with the number of bpf_loop() loops actually executed
>
> tools/testing/selftests/bpf/Makefile | 2 +
> tools/testing/selftests/bpf/bench.c | 10 +
> tools/testing/selftests/bpf/bench.h | 1 +
> .../selftests/bpf/benchs/bench_lpm_trie_map.c | 337 ++++++++++++++++++
> .../selftests/bpf/progs/lpm_trie_bench.c | 171 +++++++++
> .../selftests/bpf/progs/lpm_trie_map.c | 19 +
> 6 files changed, 540 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c
> create mode 100644 tools/testing/selftests/bpf/progs/lpm_trie_bench.c
> create mode 100644 tools/testing/selftests/bpf/progs/lpm_trie_map.c
>
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 910d8d6402ef..10a5f1d0fa41 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -815,6 +815,7 @@ $(OUTPUT)/bench_bpf_hashmap_lookup.o: $(OUTPUT)/bpf_hashmap_lookup.skel.h
> $(OUTPUT)/bench_htab_mem.o: $(OUTPUT)/htab_mem_bench.skel.h
> $(OUTPUT)/bench_bpf_crypto.o: $(OUTPUT)/crypto_bench.skel.h
> $(OUTPUT)/bench_sockmap.o: $(OUTPUT)/bench_sockmap_prog.skel.h
> +$(OUTPUT)/bench_lpm_trie_map.o: $(OUTPUT)/lpm_trie_bench.skel.h $(OUTPUT)/lpm_trie_map.skel.h
> $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
> $(OUTPUT)/bench: LDLIBS += -lm
> $(OUTPUT)/bench: $(OUTPUT)/bench.o \
> @@ -836,6 +837,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
> $(OUTPUT)/bench_htab_mem.o \
> $(OUTPUT)/bench_bpf_crypto.o \
> $(OUTPUT)/bench_sockmap.o \
> + $(OUTPUT)/bench_lpm_trie_map.o \
> #
> $(call msg,BINARY,,$@)
> $(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
> diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> index ddd73d06a1eb..fd15f60fd5a8 100644
> --- a/tools/testing/selftests/bpf/bench.c
> +++ b/tools/testing/selftests/bpf/bench.c
> @@ -284,6 +284,7 @@ extern struct argp bench_htab_mem_argp;
> extern struct argp bench_trigger_batch_argp;
> extern struct argp bench_crypto_argp;
> extern struct argp bench_sockmap_argp;
> +extern struct argp bench_lpm_trie_map_argp;
>
> static const struct argp_child bench_parsers[] = {
> { &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
> @@ -299,6 +300,7 @@ static const struct argp_child bench_parsers[] = {
> { &bench_trigger_batch_argp, 0, "BPF triggering benchmark", 0 },
> { &bench_crypto_argp, 0, "bpf crypto benchmark", 0 },
> { &bench_sockmap_argp, 0, "bpf sockmap benchmark", 0 },
> + { &bench_lpm_trie_map_argp, 0, "LPM trie map benchmark", 0 },
> {},
> };
>
> @@ -558,6 +560,10 @@ extern const struct bench bench_htab_mem;
> extern const struct bench bench_crypto_encrypt;
> extern const struct bench bench_crypto_decrypt;
> extern const struct bench bench_sockmap;
> +extern const struct bench bench_lpm_trie_lookup;
> +extern const struct bench bench_lpm_trie_update;
> +extern const struct bench bench_lpm_trie_delete;
> +extern const struct bench bench_lpm_trie_free;
>
> static const struct bench *benchs[] = {
> &bench_count_global,
> @@ -625,6 +631,10 @@ static const struct bench *benchs[] = {
> &bench_crypto_encrypt,
> &bench_crypto_decrypt,
> &bench_sockmap,
> + &bench_lpm_trie_lookup,
> + &bench_lpm_trie_update,
> + &bench_lpm_trie_delete,
> + &bench_lpm_trie_free,
> };
>
> static void find_benchmark(void)
> diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
> index 005c401b3e22..bea323820ffb 100644
> --- a/tools/testing/selftests/bpf/bench.h
> +++ b/tools/testing/selftests/bpf/bench.h
> @@ -46,6 +46,7 @@ struct bench_res {
> unsigned long gp_ns;
> unsigned long gp_ct;
> unsigned int stime;
> + unsigned long duration_ns;
> };
>
> struct bench {
> diff --git a/tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c b/tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c
> new file mode 100644
> index 000000000000..435b5c7ceee9
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/benchs/bench_lpm_trie_map.c
> @@ -0,0 +1,337 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2025 Cloudflare */
> +
> +/*
> + * All of these benchmarks operate on tries with keys in the range
> + * [0, args.nr_entries), i.e. there are no gaps or partially filled
> + * branches of the trie for any key < args.nr_entries.
> + *
> + * This gives an idea of worst-case behaviour.
> + */
> +
> +#include <argp.h>
> +#include <linux/time64.h>
> +#include <linux/if_ether.h>
> +#include "lpm_trie_bench.skel.h"
> +#include "lpm_trie_map.skel.h"
> +#include "bench.h"
> +#include "testing_helpers.h"
> +
> +static struct ctx {
> + struct lpm_trie_bench *bench;
> +} ctx;
> +
> +static struct {
> + __u32 nr_entries;
> + __u32 prefixlen;
> +} args = {
> + .nr_entries = 10000,
> + .prefixlen = 32,
> +};
> +
> +enum {
> + ARG_NR_ENTRIES = 9000,
> + ARG_PREFIX_LEN,
> +};
> +
> +static const struct argp_option opts[] = {
> + { "nr_entries", ARG_NR_ENTRIES, "NR_ENTRIES", 0,
> + "Number of unique entries in the LPM trie" },
> + { "prefix_len", ARG_PREFIX_LEN, "PREFIX_LEN", 0,
> + "Number of prefix bits to use in the LPM trie" },
> + {},
> +};
> +
> +static error_t lpm_parse_arg(int key, char *arg, struct argp_state *state)
> +{
> + long ret;
> +
> + switch (key) {
> + case ARG_NR_ENTRIES:
> + ret = strtol(arg, NULL, 10);
> + if (ret < 1 || ret > UINT_MAX) {
> + fprintf(stderr, "Invalid nr_entries count.");
> + argp_usage(state);
> + }
> + args.nr_entries = ret;
> + break;
> + case ARG_PREFIX_LEN:
> + ret = strtol(arg, NULL, 10);
> + if (ret < 1 || ret > UINT_MAX) {
> + fprintf(stderr, "Invalid prefix_len value.");
> + argp_usage(state);
> + }
> + args.prefixlen = ret;
> + break;
> + default:
> + return ARGP_ERR_UNKNOWN;
> + }
> + return 0;
> +}
> +
> +const struct argp bench_lpm_trie_map_argp = {
> + .options = opts,
> + .parser = lpm_parse_arg,
> +};
> +
> +static void __lpm_validate(void)
why double underscore ?
Just lpm_validate() ?
> +{
> + if (env.consumer_cnt != 0) {
> + fprintf(stderr, "benchmark doesn't support consumer!\n");
> + exit(1);
> + }
> +
> + if ((1UL << args.prefixlen) < args.nr_entries) {
> + fprintf(stderr, "prefix_len value too small for nr_entries!\n");
> + exit(1);
> + };
> +}
> +
> +enum { OP_LOOKUP = 1, OP_UPDATE, OP_DELETE, OP_FREE };
> +
> +static void lpm_delete_validate(void)
> +{
> + __lpm_validate();
> +
> + if (env.producer_cnt != 1) {
> + fprintf(stderr,
> + "lpm-trie-delete requires a single producer!\n");
> + exit(1);
> + }
> +}
> +
> +static void lpm_free_validate(void)
> +{
> + __lpm_validate();
> +
> + if (env.producer_cnt != 1) {
> + fprintf(stderr, "lpm-trie-free requires a single producer!\n");
> + exit(1);
> + }
> +}
> +
> +static void fill_map(int map_fd)
> +{
> + int i, err;
> +
> + for (i = 0; i < args.nr_entries; i++) {
> + struct trie_key {
> + __u32 prefixlen;
> + __u32 data;
> + } key = { args.prefixlen, i };
> + __u32 val = 1;
> +
> + err = bpf_map_update_elem(map_fd, &key, &val, BPF_NOEXIST);
> + if (err) {
> + fprintf(stderr, "failed to add key %d to map: %d\n",
> + key.data, -err);
> + exit(1);
> + }
> + }
> +}
> +
> +static void __lpm_setup(void)
> +{
> + ctx.bench = lpm_trie_bench__open_and_load();
> + if (!ctx.bench) {
> + fprintf(stderr, "failed to open skeleton\n");
> + exit(1);
> + }
> +
> + ctx.bench->bss->nr_entries = args.nr_entries;
> + ctx.bench->bss->prefixlen = args.prefixlen;
> +
> + if (lpm_trie_bench__attach(ctx.bench)) {
> + fprintf(stderr, "failed to attach skeleton\n");
> + exit(1);
> + }
> +}
> +
> +static void lpm_setup(void)
> +{
> + int fd;
> +
> + __lpm_setup();
> +
> + fd = bpf_map__fd(ctx.bench->maps.trie_map);
> + fill_map(fd);
> +}
lpm_setup() vs __lpm_setup() are too cryptic.
The name of the function should explain at a high level
what it does.
> +
> +static void lpm_lookup_setup(void)
> +{
> + lpm_setup();
> +
> + ctx.bench->bss->op = OP_LOOKUP;
> +}
> +
> +static void lpm_update_setup(void)
> +{
> + lpm_setup();
> +
> + ctx.bench->bss->op = OP_UPDATE;
> +}
> +
> +static void lpm_delete_setup(void)
> +{
> + lpm_setup();
> +
> + ctx.bench->bss->op = OP_DELETE;
> +}
> +
> +static void lpm_free_setup(void)
> +{
> + __lpm_setup();
... then the reader won't need to scratch the head
why this one is __ and other don't.
> + ctx.bench->bss->op = OP_FREE;
> +}
> +
> +static void lpm_measure(struct bench_res *res)
> +{
> + res->hits = atomic_swap(&ctx.bench->bss->hits, 0);
> + res->duration_ns = atomic_swap(&ctx.bench->bss->duration_ns, 0);
> +}
> +
> +/* For LOOKUP, UPDATE, and DELETE */
> +static void *lpm_producer(void *unused __always_unused)
> +{
> + int err;
> + char in[ETH_HLEN]; /* unused */
> +
> + LIBBPF_OPTS(bpf_test_run_opts, opts, .data_in = in,
> + .data_size_in = sizeof(in), .repeat = 1, );
> +
> + while (true) {
> + int fd = bpf_program__fd(ctx.bench->progs.run_bench);
> + err = bpf_prog_test_run_opts(fd, &opts);
> + if (err) {
> + fprintf(stderr, "failed to run BPF prog: %d\n", err);
> + exit(1);
> + }
> +
> + if (opts.retval < 0) {
> + fprintf(stderr, "BPF prog returned error: %d\n",
> + opts.retval);
> + exit(1);
> + }
> +
> + if (ctx.bench->bss->op == OP_DELETE && opts.retval == 1) {
> + /* trie_map needs to be refilled */
> + fill_map(bpf_map__fd(ctx.bench->maps.trie_map));
> + }
> + }
> +
> + return NULL;
> +}
> +
> +static void *lpm_free_producer(void *unused __always_unused)
> +{
> + while (true) {
> + struct lpm_trie_map *skel;
> +
> + skel = lpm_trie_map__open_and_load();
> + if (!skel) {
> + fprintf(stderr, "failed to open skeleton\n");
> + exit(1);
> + }
> +
> + fill_map(bpf_map__fd(skel->maps.trie_free_map));
> + lpm_trie_map__destroy(skel);
> + }
> +
> + return NULL;
> +}
> +
> +static void free_ops_report_progress(int iter, struct bench_res *res,
> + long delta_ns)
> +{
> + double hits_per_sec, hits_per_prod;
> + double rate_divisor = 1000.0;
> + char rate = 'K';
> +
> + hits_per_sec = res->hits / (res->duration_ns / (double)NSEC_PER_SEC) /
> + rate_divisor;
> + hits_per_prod = hits_per_sec / env.producer_cnt;
> +
> + printf("Iter %3d (%7.3lfus): ", iter,
> + (delta_ns - NSEC_PER_SEC) / 1000.0);
> + printf("hits %8.3lf%c/s (%7.3lf%c/prod)\n", hits_per_sec, rate,
> + hits_per_prod, rate);
> +}
> +
> +static void free_ops_report_final(struct bench_res res[], int res_cnt)
> +{
> + double hits_mean = 0.0, hits_stddev = 0.0;
> + double lat_divisor = 1000000.0;
> + double rate_divisor = 1000.0;
> + const char *unit = "ms";
> + double latency = 0.0;
> + char rate = 'K';
> + int i;
> +
> + for (i = 0; i < res_cnt; i++) {
> + double val = res[i].hits / rate_divisor /
> + (res[i].duration_ns / (double)NSEC_PER_SEC);
> + hits_mean += val / (0.0 + res_cnt);
> + latency += res[i].duration_ns / res[i].hits / (0.0 + res_cnt);
> + }
> +
> + if (res_cnt > 1) {
> + for (i = 0; i < res_cnt; i++) {
> + double val =
> + res[i].hits / rate_divisor /
> + (res[i].duration_ns / (double)NSEC_PER_SEC);
> + hits_stddev += (hits_mean - val) * (hits_mean - val) /
> + (res_cnt - 1.0);
> + }
> +
> + hits_stddev = sqrt(hits_stddev);
> + }
> + printf("Summary: throughput %8.3lf \u00B1 %5.3lf %c ops/s (%7.3lf%c ops/prod), ",
> + hits_mean, hits_stddev, rate, hits_mean / env.producer_cnt,
> + rate);
> + printf("latency %8.3lf %s/op\n",
> + latency / lat_divisor / env.producer_cnt, unit);
> +}
> +
> +const struct bench bench_lpm_trie_lookup = {
> + .name = "lpm-trie-lookup",
> + .argp = &bench_lpm_trie_map_argp,
> + .validate = __lpm_validate,
> + .setup = lpm_lookup_setup,
> + .producer_thread = lpm_producer,
> + .measure = lpm_measure,
> + .report_progress = ops_report_progress,
> + .report_final = ops_report_final,
> +};
> +
> +const struct bench bench_lpm_trie_update = {
> + .name = "lpm-trie-update",
> + .argp = &bench_lpm_trie_map_argp,
> + .validate = __lpm_validate,
> + .setup = lpm_update_setup,
> + .producer_thread = lpm_producer,
> + .measure = lpm_measure,
> + .report_progress = ops_report_progress,
> + .report_final = ops_report_final,
> +};
> +
> +const struct bench bench_lpm_trie_delete = {
> + .name = "lpm-trie-delete",
> + .argp = &bench_lpm_trie_map_argp,
> + .validate = lpm_delete_validate,
> + .setup = lpm_delete_setup,
> + .producer_thread = lpm_producer,
> + .measure = lpm_measure,
> + .report_progress = ops_report_progress,
> + .report_final = ops_report_final,
> +};
> +
> +const struct bench bench_lpm_trie_free = {
> + .name = "lpm-trie-free",
> + .argp = &bench_lpm_trie_map_argp,
> + .validate = lpm_free_validate,
> + .setup = lpm_free_setup,
> + .producer_thread = lpm_free_producer,
> + .measure = lpm_measure,
> + .report_progress = free_ops_report_progress,
> + .report_final = free_ops_report_final,
> +};
> diff --git a/tools/testing/selftests/bpf/progs/lpm_trie_bench.c b/tools/testing/selftests/bpf/progs/lpm_trie_bench.c
> new file mode 100644
> index 000000000000..522e1cbef490
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/lpm_trie_bench.c
> @@ -0,0 +1,171 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2025 Cloudflare */
> +
> +#include <vmlinux.h>
> +#include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_core_read.h>
> +#include "bpf_misc.h"
> +
> +#define BPF_OBJ_NAME_LEN 16U
> +#define MAX_ENTRIES 100000000
> +#define NR_LOOPS 10000
> +
> +struct trie_key {
> + __u32 prefixlen;
> + __u32 data;
> +};
> +
> +char _license[] SEC("license") = "GPL";
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_HASH);
> + __uint(max_entries, 512);
> + __type(key, struct bpf_map *);
> + __type(value, __u64);
> +} latency_free_start SEC(".maps");
> +
> +/* Filled by userspace. See fill_map() in bench_lpm_trie_map.c */
> +struct {
> + __uint(type, BPF_MAP_TYPE_LPM_TRIE);
> + __type(key, struct trie_key);
> + __type(value, __u32);
> + __uint(map_flags, BPF_F_NO_PREALLOC);
> + __uint(max_entries, MAX_ENTRIES);
> +} trie_map SEC(".maps");
> +
> +long hits;
> +long duration_ns;
> +
> +/* Configured from userspace */
> +__u32 nr_entries;
> +__u32 prefixlen;
> +__u8 op;
> +
> +SEC("fentry/bpf_map_free_deferred")
> +int BPF_PROG(trie_free_entry, struct work_struct *work)
> +{
> + struct bpf_map *map = container_of(work, struct bpf_map, work);
> + char name[BPF_OBJ_NAME_LEN];
> + u32 map_type;
> + __u64 val;
> +
> + map_type = BPF_CORE_READ(map, map_type);
> + if (map_type != BPF_MAP_TYPE_LPM_TRIE)
> + return 0;
> +
> + /*
> + * Ideally we'd have access to the map ID but that's already
> + * freed before we enter trie_free().
> + */
> + BPF_CORE_READ_STR_INTO(&name, map, name);
> + if (bpf_strncmp(name, BPF_OBJ_NAME_LEN, "trie_free_map"))
> + return 0;
> +
> + val = bpf_ktime_get_ns();
> + bpf_map_update_elem(&latency_free_start, &map, &val, BPF_ANY);
Looks like there is only one lpm map.
What's the point of that extra map ?
Just have a global var for start time ?
> + return 0;
> +}
> +
> +SEC("fexit/bpf_map_free_deferred")
> +int BPF_PROG(trie_free_exit, struct work_struct *work)
> +{
> + struct bpf_map *map = container_of(work, struct bpf_map, work);
> + __u64 *val;
> +
> + val = bpf_map_lookup_elem(&latency_free_start, &map);
> + if (val) {
> + __sync_add_and_fetch(&duration_ns, bpf_ktime_get_ns() - *val);
> + __sync_add_and_fetch(&hits, 1);
> + bpf_map_delete_elem(&latency_free_start, &map);
> + }
> +
> + return 0;
> +}
> +
> +static void gen_random_key(struct trie_key *key)
> +{
> + key->prefixlen = prefixlen;
> + key->data = bpf_get_prandom_u32() % nr_entries;
bpf_get_prandom_u32() is not free
and modulo operation isn't free either.
The benchmark includes their time.
It's ok to have it, but add a mode where the bench
tests linear lookup/update too with simple key.data++
> +}
> +
> +static int lookup(__u32 index, __u32 *unused)
> +{
> + struct trie_key key;
> +
> + gen_random_key(&key);
> + bpf_map_lookup_elem(&trie_map, &key);
> + return 0;
> +}
> +
> +static int update(__u32 index, __u32 *unused)
> +{
> + struct trie_key key;
> + u32 val = bpf_get_prandom_u32();
> +
> + gen_random_key(&key);
> + bpf_map_update_elem(&trie_map, &key, &val, BPF_EXIST);
Since the map suppose to full before we start all keys
should be there, right?
Let's add a sanity check that update() succeeds.
> + return 0;
> +}
> +
> +static __u32 deleted_entries;
> +
> +static int delete (__u32 index, bool *need_refill)
> +{
> + struct trie_key key = {
> + .data = deleted_entries,
> + .prefixlen = prefixlen,
> + };
> +
> + bpf_map_delete_elem(&trie_map, &key);
> +
> + /* Do we need to refill the map? */
> + if (++deleted_entries == nr_entries) {
> + /*
> + * Atomicity isn't required because DELETE only supports
> + * one producer running concurrently. What we need is a
> + * way to track how many entries have been deleted from
> + * the trie between consecutive invocations of the BPF
> + * prog because a single bpf_loop() call might not
> + * delete all entries, e.g. when NR_LOOPS < nr_entries.
> + */
> + deleted_entries = 0;
> + *need_refill = true;
> + return 1;
This early break is another reason that makes
'delete' op different from 'lookup/update'.
Pls make all 3 consistent, so they can be compared to each other.
--
pw-bot: cr
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next v3] selftests/bpf: Add LPM trie microbenchmarks
2025-07-28 14:34 ` Alexei Starovoitov
@ 2025-07-29 13:56 ` Matt Fleming
2025-07-31 16:41 ` Alexei Starovoitov
2025-08-13 16:59 ` Jesper Dangaard Brouer
1 sibling, 1 reply; 8+ messages in thread
From: Matt Fleming @ 2025-07-29 13:56 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Shuah Khan, kernel-team, Jesper Dangaard Brouer,
open list:KERNEL SELFTEST FRAMEWORK, LKML, bpf, Martin KaFai Lau,
Yonghong Song, Network Development, Matt Fleming
On Mon, Jul 28, 2025 at 3:35 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> Please make a full description of what the test does,
> since it's not trivial to decipher from the code.
> If I'm reading it correctly, first, the user space
> makes the LPM completely full and then lookup/update
> use random key-s within range.
Yep, that's correct. I'll provide a more comprehensive description in v4.
> But delete looks different. It seems the kernel delete
> operation can interleave with user space refilling the LPM,
> so ...
>
> > lookup: throughput 7.423 ± 0.023 M ops/s ( 7.423M ops/prod), latency 134.710 ns/op
> > update: throughput 2.643 ± 0.015 M ops/s ( 2.643M ops/prod), latency 378.310 ns/op
> > delete: throughput 0.712 ± 0.008 M ops/s ( 0.712M ops/prod), latency 1405.152 ns/op
>
> this comparison doesn't look like apples to apples,
> since delete will include user space refill time.
Yeah, you're right. Though we only measure the delete time from in the
BPF prog, delete throughput is essentially blocked while the refill
happens and because things are measured with a 1-second timer
(bench.c:sigalarm_handler) the refill time gets accounted for anyway.
I could try extrapolating the delete time like I've done for the free
op, i.e. we calculate how many ops were completed in what fraction of
a second.
> > free: throughput 0.574 ± 0.003 K ops/s ( 0.574K ops/prod), latency 1.743 ms/op
>
> Does this measure the free-ing of full LPM ?
Yes, this measures the total time to free every element in the trie.
> > +static void __lpm_validate(void)
>
> why double underscore ?
> Just lpm_validate() ?
The double underscore is used for the common validation parts, but
I'll rename this to include "_common()" (along with all other uses of
__).
> > + /*
> > + * Ideally we'd have access to the map ID but that's already
> > + * freed before we enter trie_free().
> > + */
> > + BPF_CORE_READ_STR_INTO(&name, map, name);
> > + if (bpf_strncmp(name, BPF_OBJ_NAME_LEN, "trie_free_map"))
> > + return 0;
> > +
> > + val = bpf_ktime_get_ns();
> > + bpf_map_update_elem(&latency_free_start, &map, &val, BPF_ANY);
>
> Looks like there is only one lpm map.
> What's the point of that extra map ?
> Just have a global var for start time ?
Sure, I can make this a global variable for the start time instead.
> bpf_get_prandom_u32() is not free
> and modulo operation isn't free either.
> The benchmark includes their time.
> It's ok to have it, but add a mode where the bench
> tests linear lookup/update too with simple key.data++
Good idea.
> Since the map suppose to full before we start all keys
> should be there, right?
Yes.
> Let's add a sanity check that update() succeeds.
Will do.
> > +static int delete (__u32 index, bool *need_refill)
> > +{
> > + struct trie_key key = {
> > + .data = deleted_entries,
> > + .prefixlen = prefixlen,
> > + };
> > +
> > + bpf_map_delete_elem(&trie_map, &key);
> > +
> > + /* Do we need to refill the map? */
> > + if (++deleted_entries == nr_entries) {
> > + /*
> > + * Atomicity isn't required because DELETE only supports
> > + * one producer running concurrently. What we need is a
> > + * way to track how many entries have been deleted from
> > + * the trie between consecutive invocations of the BPF
> > + * prog because a single bpf_loop() call might not
> > + * delete all entries, e.g. when NR_LOOPS < nr_entries.
> > + */
> > + deleted_entries = 0;
> > + *need_refill = true;
> > + return 1;
>
> This early break is another reason that makes
> 'delete' op different from 'lookup/update'.
> Pls make all 3 consistent, so they can be compared to each other.
Hmm.. I'm not quite sure how to do that. lookup/update don't modify
the number of entries in the map whereas delete does (update only
updates existing entries, it doesn't create new ones). So when the map
is empty it needs to be refilled. You're right that somehow the time
to refill needs to be removed from the delete throughput/latency
numbers but fundamentally these 3 ops are not the same and I don't see
how to treat them as such.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next v3] selftests/bpf: Add LPM trie microbenchmarks
2025-07-29 13:56 ` Matt Fleming
@ 2025-07-31 16:41 ` Alexei Starovoitov
2025-08-08 14:21 ` Matt Fleming
0 siblings, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2025-07-31 16:41 UTC (permalink / raw)
To: Matt Fleming
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Shuah Khan, kernel-team, Jesper Dangaard Brouer,
open list:KERNEL SELFTEST FRAMEWORK, LKML, bpf, Martin KaFai Lau,
Yonghong Song, Network Development, Matt Fleming
On Tue, Jul 29, 2025 at 6:56 AM Matt Fleming <matt@readmodwrite.com> wrote:
>
> On Mon, Jul 28, 2025 at 3:35 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > Please make a full description of what the test does,
> > since it's not trivial to decipher from the code.
> > If I'm reading it correctly, first, the user space
> > makes the LPM completely full and then lookup/update
> > use random key-s within range.
>
> Yep, that's correct. I'll provide a more comprehensive description in v4.
>
> > But delete looks different. It seems the kernel delete
> > operation can interleave with user space refilling the LPM,
> > so ...
> >
> > > lookup: throughput 7.423 ± 0.023 M ops/s ( 7.423M ops/prod), latency 134.710 ns/op
> > > update: throughput 2.643 ± 0.015 M ops/s ( 2.643M ops/prod), latency 378.310 ns/op
> > > delete: throughput 0.712 ± 0.008 M ops/s ( 0.712M ops/prod), latency 1405.152 ns/op
> >
> > this comparison doesn't look like apples to apples,
> > since delete will include user space refill time.
>
> Yeah, you're right. Though we only measure the delete time from in the
> BPF prog, delete throughput is essentially blocked while the refill
> happens and because things are measured with a 1-second timer
> (bench.c:sigalarm_handler) the refill time gets accounted for anyway.
> I could try extrapolating the delete time like I've done for the free
> op, i.e. we calculate how many ops were completed in what fraction of
> a second.
>
> > > free: throughput 0.574 ± 0.003 K ops/s ( 0.574K ops/prod), latency 1.743 ms/op
> >
> > Does this measure the free-ing of full LPM ?
>
> Yes, this measures the total time to free every element in the trie.
>
> > > +static void __lpm_validate(void)
> >
> > why double underscore ?
> > Just lpm_validate() ?
>
> The double underscore is used for the common validation parts, but
> I'll rename this to include "_common()" (along with all other uses of
> __).
No. It's also wrong.
Double underscore or _common suffix are both meaningless.
The helper name should describe what it's for.
> > > + /*
> > > + * Ideally we'd have access to the map ID but that's already
> > > + * freed before we enter trie_free().
> > > + */
> > > + BPF_CORE_READ_STR_INTO(&name, map, name);
> > > + if (bpf_strncmp(name, BPF_OBJ_NAME_LEN, "trie_free_map"))
> > > + return 0;
> > > +
> > > + val = bpf_ktime_get_ns();
> > > + bpf_map_update_elem(&latency_free_start, &map, &val, BPF_ANY);
> >
> > Looks like there is only one lpm map.
> > What's the point of that extra map ?
> > Just have a global var for start time ?
>
> Sure, I can make this a global variable for the start time instead.
>
> > bpf_get_prandom_u32() is not free
> > and modulo operation isn't free either.
> > The benchmark includes their time.
> > It's ok to have it, but add a mode where the bench
> > tests linear lookup/update too with simple key.data++
>
> Good idea.
>
> > Since the map suppose to full before we start all keys
> > should be there, right?
>
> Yes.
>
> > Let's add a sanity check that update() succeeds.
>
> Will do.
>
> > > +static int delete (__u32 index, bool *need_refill)
> > > +{
> > > + struct trie_key key = {
> > > + .data = deleted_entries,
> > > + .prefixlen = prefixlen,
> > > + };
> > > +
> > > + bpf_map_delete_elem(&trie_map, &key);
> > > +
> > > + /* Do we need to refill the map? */
> > > + if (++deleted_entries == nr_entries) {
> > > + /*
> > > + * Atomicity isn't required because DELETE only supports
> > > + * one producer running concurrently. What we need is a
> > > + * way to track how many entries have been deleted from
> > > + * the trie between consecutive invocations of the BPF
> > > + * prog because a single bpf_loop() call might not
> > > + * delete all entries, e.g. when NR_LOOPS < nr_entries.
> > > + */
> > > + deleted_entries = 0;
> > > + *need_refill = true;
> > > + return 1;
> >
> > This early break is another reason that makes
> > 'delete' op different from 'lookup/update'.
> > Pls make all 3 consistent, so they can be compared to each other.
>
> Hmm.. I'm not quite sure how to do that. lookup/update don't modify
> the number of entries in the map whereas delete does (update only
> updates existing entries, it doesn't create new ones). So when the map
> is empty it needs to be refilled. You're right that somehow the time
> to refill needs to be removed from the delete throughput/latency
> numbers but fundamentally these 3 ops are not the same and I don't see
> how to treat them as such.
well, random-key update when the map is full is also quite different from
random-key update when the map is empty.
Instead doing an update from user space do timed ops:
1 start with empty map, update (aka insert) all keys sequentially
2 lookup all sequentially
3 delete all sequentially
4 update (aka insert) all sequentially
5 lookup random
6 update random
7 delete all random
The elapsed time for 1 and 4 should be exactly the same.
While all others might have differences,
but all can be compared to each other and reasoned about.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next v3] selftests/bpf: Add LPM trie microbenchmarks
2025-07-31 16:41 ` Alexei Starovoitov
@ 2025-08-08 14:21 ` Matt Fleming
2025-08-08 16:42 ` Alexei Starovoitov
0 siblings, 1 reply; 8+ messages in thread
From: Matt Fleming @ 2025-08-08 14:21 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Shuah Khan, kernel-team, Jesper Dangaard Brouer,
open list:KERNEL SELFTEST FRAMEWORK, LKML, bpf, Martin KaFai Lau,
Yonghong Song, Network Development, Matt Fleming
On Thu, Jul 31, 2025 at 5:41 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> well, random-key update when the map is full is also quite different from
> random-key update when the map is empty.
>
> Instead doing an update from user space do timed ops:
> 1 start with empty map, update (aka insert) all keys sequentially
> 2 lookup all sequentially
> 3 delete all sequentially
> 4 update (aka insert) all sequentially
> 5 lookup random
> 6 update random
> 7 delete all random
>
> The elapsed time for 1 and 4 should be exactly the same.
> While all others might have differences,
> but all can be compared to each other and reasoned about.
Having both sequential and random access for the benchmarks is fine,
but as far as I can tell the scheme you propose is not how the bpf
bench framework is implemented.
Plus, handing off a map between subtests is brittle and prone to
error. What if I just want to investigate the sequential access update
time? The cost of the most expensive op (probably delete) is going to
dwarf all over timings making it difficult to separate them and this
scheme is going to be susceptible to noise if I can't crank up the
number of iterations without altering the number of entries in the
map. Microbenchmarks mitigate noise/run-to-run variance by doing a
single op over and over again.
I agree we need a better approach to timing deletes than what I
suggested but I don't think is it.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next v3] selftests/bpf: Add LPM trie microbenchmarks
2025-08-08 14:21 ` Matt Fleming
@ 2025-08-08 16:42 ` Alexei Starovoitov
0 siblings, 0 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2025-08-08 16:42 UTC (permalink / raw)
To: Matt Fleming
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Shuah Khan, kernel-team, Jesper Dangaard Brouer,
open list:KERNEL SELFTEST FRAMEWORK, LKML, bpf, Martin KaFai Lau,
Yonghong Song, Network Development, Matt Fleming
On Fri, Aug 8, 2025 at 7:21 AM Matt Fleming <matt@readmodwrite.com> wrote:
>
> On Thu, Jul 31, 2025 at 5:41 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > well, random-key update when the map is full is also quite different from
> > random-key update when the map is empty.
> >
> > Instead doing an update from user space do timed ops:
> > 1 start with empty map, update (aka insert) all keys sequentially
> > 2 lookup all sequentially
> > 3 delete all sequentially
> > 4 update (aka insert) all sequentially
> > 5 lookup random
> > 6 update random
> > 7 delete all random
> >
> > The elapsed time for 1 and 4 should be exactly the same.
> > While all others might have differences,
> > but all can be compared to each other and reasoned about.
>
> Having both sequential and random access for the benchmarks is fine,
> but as far as I can tell the scheme you propose is not how the bpf
> bench framework is implemented.
>
> Plus, handing off a map between subtests is brittle and prone to
> error. What if I just want to investigate the sequential access update
> time? The cost of the most expensive op (probably delete) is going to
> dwarf all over timings making it difficult to separate them and this
> scheme is going to be susceptible to noise if I can't crank up the
> number of iterations without altering the number of entries in the
> map. Microbenchmarks mitigate noise/run-to-run variance by doing a
> single op over and over again.
The bench has to repeat the operation obviously.
The point is that it cannot benchmark 'delete' in isolation.
And it cannot benchmark 'update of empty aka insert' in isolation.
So some operations have to be paired and when another one
can be done standalone then the delta is the performance for the other.
In the above 2,5,6 are repeatable. The others need to be paired.
7 with pure random is probably not a good idea, since it will
try to delete some keys that were already deleted and will skew the numbers.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next v3] selftests/bpf: Add LPM trie microbenchmarks
2025-07-28 14:34 ` Alexei Starovoitov
2025-07-29 13:56 ` Matt Fleming
@ 2025-08-13 16:59 ` Jesper Dangaard Brouer
1 sibling, 0 replies; 8+ messages in thread
From: Jesper Dangaard Brouer @ 2025-08-13 16:59 UTC (permalink / raw)
To: Alexei Starovoitov, Matt Fleming
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Shuah Khan, kernel-team,
open list:KERNEL SELFTEST FRAMEWORK, LKML, bpf, Martin KaFai Lau,
Yonghong Song, Network Development, Matt Fleming
On 28/07/2025 16.34, Alexei Starovoitov wrote:
>> diff --git a/tools/testing/selftests/bpf/progs/lpm_trie_bench.c b/tools/testing/selftests/bpf/progs/lpm_trie_bench.c
>> new file mode 100644
>> index 000000000000..522e1cbef490
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/lpm_trie_bench.c
[...]
>> +
>> +static void gen_random_key(struct trie_key *key)
>> +{
>> + key->prefixlen = prefixlen;
>> + key->data = bpf_get_prandom_u32() % nr_entries;
> bpf_get_prandom_u32() is not free
> and modulo operation isn't free either.
> The benchmark includes their time.
> It's ok to have it, but add a mode where the bench
> tests linear lookup/update too with simple key.data++
I've extended this bench with a "noop" and "baseline" benchmark[1].
[1]
https://lore.kernel.org/all/175509897596.2755384.18413775753563966331.stgit@firesoul/
This allowed us to measure and deduce that the:
bpf_get_prandom_u32() % nr_entries
Takes 14.1 nanosec for doing the rand + modulo.
The "noop" test shows harness overhead is 13.402 ns/op
and on-top the "baseline" shows randomness takes 27.529 ns/op.
--Jesper
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next v3] selftests/bpf: Add LPM trie microbenchmarks
2025-07-22 15:01 [PATCH bpf-next v3] selftests/bpf: Add LPM trie microbenchmarks Matt Fleming
2025-07-28 14:34 ` Alexei Starovoitov
@ 2025-08-08 15:51 ` Jesper Dangaard Brouer
1 sibling, 0 replies; 8+ messages in thread
From: Jesper Dangaard Brouer @ 2025-08-08 15:51 UTC (permalink / raw)
To: Matt Fleming, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Eduard Zingerman
Cc: Shuah Khan, kernel-team, linux-kselftest, linux-kernel, bpf,
Martin KaFai Lau, Yonghong Song, netdev, Matt Fleming
On 22/07/2025 17.01, Matt Fleming wrote:
> From: Matt Fleming<mfleming@cloudflare.com>
>
> Add benchmarks for the standard set of operations: lookup, update,
> delete. Also, include a benchmark for trie_free() which is known to have
> terrible performance for maps with many entries.
>
> Benchmarks operate on tries without gaps in the key range, i.e. each
> test begins with a trie with valid keys in the range [0, nr_entries).
> This is intended to cause maximum branching when traversing the trie.
>
> All measurements are recorded inside the kernel to remove syscall
> overhead.
>
> Most benchmarks run an XDP program to generate stats but free needs to
> collect latencies using fentry/fexit on map_free_deferred() because it's
> not possible to use fentry directly on lpm_trie.c since commit
> c83508da5620 ("bpf: Avoid deadlock caused by nested kprobe and fentry
> bpf programs") and there's no way to create/destroy a map from within an
> XDP program.
>
> Here is example output from an AMD EPYC 9684X 96-Core machine for each
> of the benchmarks using a trie with 10K entries and a 32-bit prefix
> length, e.g.
>
> $ ./bench lpm-trie-$op \
> --prefix_len=32 \
> --producers=1 \
> --nr_entries=10000
>
> lookup: throughput 7.423 ± 0.023 M ops/s ( 7.423M ops/prod), latency 134.710 ns/op
> update: throughput 2.643 ± 0.015 M ops/s ( 2.643M ops/prod), latency 378.310 ns/op
> delete: throughput 0.712 ± 0.008 M ops/s ( 0.712M ops/prod), latency 1405.152 ns/op
> free: throughput 0.574 ± 0.003 K ops/s ( 0.574K ops/prod), latency 1.743 ms/op
>
> Tested-by: Jesper Dangaard Brouer<hawk@kernel.org>
I've run a lot more benchmarks.
I've used a slightly updated version[1] written by Matt, that improve
the "delete" operation accounting that Alexei complain about, but I
guess Matt isn't 100% happy with his approach (as I don't see a V4).
The below "delete" numbers have improved compared to above.
Results from[2] with default 10,000 entries:
lookup 7.598 ± 0.004 M ops/s 131.608 ns/op
update 3.247 ± 0.029 M ops/s 308.008 ns/op
delete 1.747 ± 0.053 M ops/s 572.519 ns/op
free 0.294 ± 0.055 K ops/s 3.521 ms/op
[1]
https://github.com/xdp-project/xdp-project/blob/main/areas/bench/patches/bench-lpm-trie-V3-adjusted.patch
[2]
https://github.com/xdp-project/xdp-project/blob/main/areas/bench/bench01_lpm-trie.org
I'm mostly interested in the fast-path lookup performance. Documented
here [3] and links to plots[4]. People seeing these per operations
costs, remember that this includes the get random number cost. That said
is very clear from my data, that LPM trie have problems with cache-line
trashing as number of entries increase.
[3]
https://github.com/xdp-project/xdp-project/blob/main/areas/bench/bench02_lpm-trie-lookup.org
[4]
https://github.com/xdp-project/xdp-project/blob/main/areas/bench/bench02_lpm-trie-lookup.org#plotting-results
I've also documented the "bench" harness a little[5].
[5]
https://github.com/xdp-project/xdp-project/blob/main/areas/bench/README.org
--Jesper
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-08-13 16:59 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-22 15:01 [PATCH bpf-next v3] selftests/bpf: Add LPM trie microbenchmarks Matt Fleming
2025-07-28 14:34 ` Alexei Starovoitov
2025-07-29 13:56 ` Matt Fleming
2025-07-31 16:41 ` Alexei Starovoitov
2025-08-08 14:21 ` Matt Fleming
2025-08-08 16:42 ` Alexei Starovoitov
2025-08-13 16:59 ` Jesper Dangaard Brouer
2025-08-08 15:51 ` Jesper Dangaard Brouer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).