From: John Fastabend <john.fastabend@gmail.com>
To: Dave Marchevsky <davemarchevsky@fb.com>, bpf@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <kafai@fb.com>, Kernel Team <kernel-team@fb.com>,
Dave Marchevsky <davemarchevsky@fb.com>
Subject: RE: [PATCH v6 bpf-next] selftests/bpf: Add benchmark for local_storage get
Date: Tue, 21 Jun 2022 12:17:54 -0700 [thread overview]
Message-ID: <62b21962dc64_1627420844@john.notmuch> (raw)
In-Reply-To: <20220620222554.270578-1-davemarchevsky@fb.com>
Dave Marchevsky wrote:
> Add a benchmarks to demonstrate the performance cliff for local_storage
> get as the number of local_storage maps increases beyond current
> local_storage implementation's cache size.
>
> "sequential get" and "interleaved get" benchmarks are added, both of
> which do many bpf_task_storage_get calls on sets of task local_storage
> maps of various counts, while considering a single specific map to be
> 'important' and counting task_storage_gets to the important map
> separately in addition to normal 'hits' count of all gets. Goal here is
> to mimic scenario where a particular program using one map - the
> important one - is running on a system where many other local_storage
> maps exist and are accessed often.
>
> While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
> ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
> bpf_task_storage_gets for the important map for every 10 map gets. This
> is meant to highlight performance differences when important map is
> accessed far more frequently than non-important maps.
>
> A "hashmap control" benchmark is also included for easy comparison of
> standard bpf hashmap lookup vs local_storage get. The benchmark is
> similar to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
> instead of local storage. Only one inner map is created - a hashmap
> meant to hold tid -> data mapping for all tasks. Size of the hashmap is
> hardcoded to my system's PID_MAX_LIMIT (4,194,304). The number of these
> keys which are actually fetched as part of the benchmark is
> configurable.
>
> Addition of this benchmark is inspired by conversation with Alexei in a
> previous patchset's thread [0], which highlighted the need for such a
> benchmark to motivate and validate improvements to local_storage
> implementation. My approach in that series focused on improving
> performance for explicitly-marked 'important' maps and was rejected
> with feedback to make more generally-applicable improvements while
> avoiding explicitly marking maps as important. Thus the benchmark
> reports both general and important-map-focused metrics, so effect of
> future work on both is clear.
>
> Regarding the benchmark results. On a powerful system (Skylake, 20
> cores, 256gb ram):
>
> Hashmap Control
> ===============
> num keys: 10
> hashmap (control) sequential get: hits throughput: 20.900 ± 0.334 M ops/s, hits latency: 47.847 ns/op, important_hits throughput: 20.900 ± 0.334 M ops/s
>
> num keys: 1000
> hashmap (control) sequential get: hits throughput: 13.758 ± 0.219 M ops/s, hits latency: 72.683 ns/op, important_hits throughput: 13.758 ± 0.219 M ops/s
>
> num keys: 10000
> hashmap (control) sequential get: hits throughput: 6.995 ± 0.034 M ops/s, hits latency: 142.959 ns/op, important_hits throughput: 6.995 ± 0.034 M ops/s
>
> num keys: 100000
> hashmap (control) sequential get: hits throughput: 4.452 ± 0.371 M ops/s, hits latency: 224.635 ns/op, important_hits throughput: 4.452 ± 0.371 M ops/s
>
> num keys: 4194304
> hashmap (control) sequential get: hits throughput: 3.043 ± 0.033 M ops/s, hits latency: 328.587 ns/op, important_hits throughput: 3.043 ± 0.033 M ops/s
>
Why is the hashmap lookup not constant with the number of keys? It looks
like its prepopulated without collisions so I wouldn't expect any
extra ops on the lookup side after looking at the code quickly.
> Local Storage
> =============
> num_maps: 1
> local_storage cache sequential get: hits throughput: 47.298 ± 0.180 M ops/s, hits latency: 21.142 ns/op, important_hits throughput: 47.298 ± 0.180 M ops/s
> local_storage cache interleaved get: hits throughput: 55.277 ± 0.888 M ops/s, hits latency: 18.091 ns/op, important_hits throughput: 55.277 ± 0.888 M ops/s
>
> num_maps: 10
> local_storage cache sequential get: hits throughput: 40.240 ± 0.802 M ops/s, hits latency: 24.851 ns/op, important_hits throughput: 4.024 ± 0.080 M ops/s
> local_storage cache interleaved get: hits throughput: 48.701 ± 0.722 M ops/s, hits latency: 20.533 ns/op, important_hits throughput: 17.393 ± 0.258 M ops/s
>
> num_maps: 16
> local_storage cache sequential get: hits throughput: 44.515 ± 0.708 M ops/s, hits latency: 22.464 ns/op, important_hits throughput: 2.782 ± 0.044 M ops/s
> local_storage cache interleaved get: hits throughput: 49.553 ± 2.260 M ops/s, hits latency: 20.181 ns/op, important_hits throughput: 15.767 ± 0.719 M ops/s
>
> num_maps: 17
> local_storage cache sequential get: hits throughput: 38.778 ± 0.302 M ops/s, hits latency: 25.788 ns/op, important_hits throughput: 2.284 ± 0.018 M ops/s
> local_storage cache interleaved get: hits throughput: 43.848 ± 1.023 M ops/s, hits latency: 22.806 ns/op, important_hits throughput: 13.349 ± 0.311 M ops/s
>
> num_maps: 24
> local_storage cache sequential get: hits throughput: 19.317 ± 0.568 M ops/s, hits latency: 51.769 ns/op, important_hits throughput: 0.806 ± 0.024 M ops/s
> local_storage cache interleaved get: hits throughput: 24.397 ± 0.272 M ops/s, hits latency: 40.989 ns/op, important_hits throughput: 6.863 ± 0.077 M ops/s
>
> num_maps: 32
> local_storage cache sequential get: hits throughput: 13.333 ± 0.135 M ops/s, hits latency: 75.000 ns/op, important_hits throughput: 0.417 ± 0.004 M ops/s
> local_storage cache interleaved get: hits throughput: 16.898 ± 0.383 M ops/s, hits latency: 59.178 ns/op, important_hits throughput: 4.717 ± 0.107 M ops/s
>
> num_maps: 100
> local_storage cache sequential get: hits throughput: 6.360 ± 0.107 M ops/s, hits latency: 157.233 ns/op, important_hits throughput: 0.064 ± 0.001 M ops/s
> local_storage cache interleaved get: hits throughput: 7.303 ± 0.362 M ops/s, hits latency: 136.930 ns/op, important_hits throughput: 1.907 ± 0.094 M ops/s
>
> num_maps: 1000
> local_storage cache sequential get: hits throughput: 0.452 ± 0.010 M ops/s, hits latency: 2214.022 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
> local_storage cache interleaved get: hits throughput: 0.542 ± 0.007 M ops/s, hits latency: 1843.341 ns/op, important_hits throughput: 0.136 ± 0.002 M ops/s
>
> Looking at the "sequential get" results, it's clear that as the
> number of task local_storage maps grows beyond the current cache size
> (16), there's a significant reduction in hits throughput. Note that
> current local_storage implementation assigns a cache_idx to maps as they
> are created. Since "sequential get" is creating maps 0..n in order and
> then doing bpf_task_storage_get calls in the same order, the benchmark
> is effectively ensuring that a map will not be in cache when the program
> tries to access it.
>
> For "interleaved get" results, important-map hits throughput is greatly
> increased as the important map is more likely to be in cache by virtue
> of being accessed far more frequently. Throughput still reduces as #
> maps increases, though.
>
> To get a sense of the overhead of the benchmark program, I
> commented out bpf_task_storage_get/bpf_map_lookup_elem in
> local_storage_bench.c and ran the benchmark on the same host as the
> 'real' run. Results:
Also just checking the hash overhead was taken including the
urandom so we can pull that out of the cost.
[...]
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include "bpf_misc.h"
> +
> +#define HASHMAP_SZ 4194304
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
> + __uint(max_entries, 1000);
> + __type(key, int);
> + __type(value, int);
> + __array(values, struct {
> + __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> + __uint(map_flags, BPF_F_NO_PREALLOC);
> + __type(key, int);
> + __type(value, int);
> + });
> +} array_of_local_storage_maps SEC(".maps");
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
> + __uint(max_entries, 1000);
> + __type(key, int);
> + __type(value, int);
> + __array(values, struct {
> + __uint(type, BPF_MAP_TYPE_HASH);
> + __uint(max_entries, HASHMAP_SZ);
> + __type(key, int);
> + __type(value, int);
> + });
> +} array_of_hash_maps SEC(".maps");
> +
> +long important_hits;
> +long hits;
> +
> +/* set from user-space */
> +const volatile unsigned int use_hashmap;
> +const volatile unsigned int hashmap_num_keys;
> +const volatile unsigned int num_maps;
> +const volatile unsigned int interleave;
> +
> +struct loop_ctx {
> + struct task_struct *task;
> + long loop_hits;
> + long loop_important_hits;
> +};
> +
> +static int do_lookup(unsigned int elem, struct loop_ctx *lctx)
> +{
> + void *map, *inner_map;
> + int idx = 0;
> +
> + if (use_hashmap)
> + map = &array_of_hash_maps;
> + else
> + map = &array_of_local_storage_maps;
> +
> + inner_map = bpf_map_lookup_elem(map, &elem);
> + if (!inner_map)
> + return -1;
> +
> + if (use_hashmap) {
> + idx = bpf_get_prandom_u32() % hashmap_num_keys;
> + bpf_map_lookup_elem(inner_map, &idx);
The htab lookup is just,
static void *htab_map_lookup_elem(struct bpf_map *map, void *key)
{
struct htab_elem *l = __htab_map_lookup_elem(map, key);
if (l)
return l->key + round_up(map->key_size, 8);
return NULL;
}
> + } else {
> + bpf_task_storage_get(inner_map, lctx->task, &idx,
> + BPF_LOCAL_STORAGE_GET_F_CREATE);
> + }
> +
> + lctx->loop_hits++;
> + if (!elem)
> + lctx->loop_important_hits++;
> + return 0;
> +}
> +
> +static long loop(u32 index, void *ctx)
> +{
> + struct loop_ctx *lctx = (struct loop_ctx *)ctx;
> + unsigned int map_idx = index % num_maps;
> +
> + do_lookup(map_idx, lctx);
> + if (interleave && map_idx % 3 == 0)
> + do_lookup(0, lctx);
> + return 0;
> +}
> +
> +SEC("fentry/" SYS_PREFIX "sys_getpgid")
> +int get_local(void *ctx)
> +{
> + struct loop_ctx lctx;
> +
> + lctx.task = bpf_get_current_task_btf();
> + lctx.loop_hits = 0;
> + lctx.loop_important_hits = 0;
> + bpf_loop(10000, &loop, &lctx, 0);
> + __sync_add_and_fetch(&hits, lctx.loop_hits);
> + __sync_add_and_fetch(&important_hits, lctx.loop_important_hits);
> + return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> --
> 2.30.2
>
next prev parent reply other threads:[~2022-06-21 19:19 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-20 22:25 [PATCH v6 bpf-next] selftests/bpf: Add benchmark for local_storage get Dave Marchevsky
2022-06-21 19:17 ` John Fastabend [this message]
2022-06-22 0:29 ` Martin KaFai Lau
2022-06-22 5:49 ` John Fastabend
2022-06-22 17:26 ` Martin KaFai Lau
2022-06-23 1:26 ` John Fastabend
2022-06-23 2:18 ` Alexei Starovoitov
2022-06-23 3:25 ` John Fastabend
2022-06-23 2:53 ` Dave Marchevsky
2022-06-23 3:27 ` John Fastabend
2022-06-23 2:31 ` Dave Marchevsky
2022-06-23 2:20 ` patchwork-bot+netdevbpf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=62b21962dc64_1627420844@john.notmuch \
--to=john.fastabend@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davemarchevsky@fb.com \
--cc=kafai@fb.com \
--cc=kernel-team@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.