From: Yonghong Song <yhs@fb.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Andrii Nakryiko <andriin@fb.com>, bpf <bpf@vger.kernel.org>,
Networking <netdev@vger.kernel.org>,
Alexei Starovoitov <ast@fb.com>,
Daniel Borkmann <daniel@iogearbox.net>,
Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH v2 bpf-next 1/3] selftests/bpf: add benchmark runner infrastructure
Date: Tue, 12 May 2020 07:39:18 -0700 [thread overview]
Message-ID: <80ca0fba-8a30-4f1f-6bf3-7ccaa1fa8d69@fb.com> (raw)
In-Reply-To: <CAEf4BzYkbsd3EUXoH8M80+udtz-owN6gAb-Hpp==bNU2Pk6x+A@mail.gmail.com>
On 5/11/20 8:29 PM, Andrii Nakryiko wrote:
> On Sat, May 9, 2020 at 10:10 AM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 5/8/20 4:20 PM, Andrii Nakryiko wrote:
>>> While working on BPF ringbuf implementation, testing, and benchmarking, I've
>>> developed a pretty generic and modular benchmark runner, which seems to be
>>> generically useful, as I've already used it for one more purpose (testing
>>> fastest way to trigger BPF program, to minimize overhead of in-kernel code).
>>>
>>> This patch adds generic part of benchmark runner and sets up Makefile for
>>> extending it with more sets of benchmarks.
>>>
>>> Benchmarker itself operates by spinning up specified number of producer and
>>> consumer threads, setting up interval timer sending SIGALARM signal to
>>> application once a second. Every second, current snapshot with hits/drops
>>> counters are collected and stored in an array. Drops are useful for
>>> producer/consumer benchmarks in which producer might overwhelm consumers.
>>>
>>> Once test finishes after given amount of warm-up and testing seconds, mean and
>>> stddev are calculated (ignoring warm-up results) and is printed out to stdout.
>>> This setup seems to give consistent and accurate results.
>>>
>>> To validate behavior, I added two atomic counting tests: global and local.
>>> For global one, all the producer threads are atomically incrementing same
>>> counter as fast as possible. This, of course, leads to huge drop of
>>> performance once there is more than one producer thread due to CPUs fighting
>>> for the same memory location.
>>>
>>> Local counting, on the other hand, maintains one counter per each producer
>>> thread, incremented independently. Once per second, all counters are read and
>>> added together to form final "counting throughput" measurement. As expected,
>>> such setup demonstrates linear scalability with number of producers (as long
>>> as there are enough physical CPU cores, of course). See example output below.
>>> Also, this setup can nicely demonstrate disastrous effects of false sharing,
>>> if care is not taken to take those per-producer counters apart into
>>> independent cache lines.
>>>
>>> Demo output shows global counter first with 1 producer, then with 4. Both
>>> total and per-producer performance significantly drop. The last run is local
>>> counter with 4 producers, demonstrating near-perfect scalability.
>>>
>>> $ ./bench -a -w1 -d2 -p1 count-global
>>> Setting up benchmark 'count-global'...
>>> Benchmark 'count-global' started.
>>> Iter 0 ( 24.822us): hits 148.179M/s (148.179M/prod), drops 0.000M/s
>>> Iter 1 ( 37.939us): hits 149.308M/s (149.308M/prod), drops 0.000M/s
>>> Iter 2 (-10.774us): hits 150.717M/s (150.717M/prod), drops 0.000M/s
>>> Iter 3 ( 3.807us): hits 151.435M/s (151.435M/prod), drops 0.000M/s
>>> Summary: hits 150.488 ± 1.079M/s (150.488M/prod), drops 0.000 ± 0.000M/s
>>>
>>> $ ./bench -a -w1 -d2 -p4 count-global
>>> Setting up benchmark 'count-global'...
>>> Benchmark 'count-global' started.
>>> Iter 0 ( 60.659us): hits 53.910M/s ( 13.477M/prod), drops 0.000M/s
>>> Iter 1 (-17.658us): hits 53.722M/s ( 13.431M/prod), drops 0.000M/s
>>> Iter 2 ( 5.865us): hits 53.495M/s ( 13.374M/prod), drops 0.000M/s
>>> Iter 3 ( 0.104us): hits 53.606M/s ( 13.402M/prod), drops 0.000M/s
>>> Summary: hits 53.608 ± 0.113M/s ( 13.402M/prod), drops 0.000 ± 0.000M/s
>>>
>>> $ ./bench -a -w1 -d2 -p4 count-local
>>> Setting up benchmark 'count-local'...
>>> Benchmark 'count-local' started.
>>> Iter 0 ( 23.388us): hits 640.450M/s (160.113M/prod), drops 0.000M/s
>>> Iter 1 ( 2.291us): hits 605.661M/s (151.415M/prod), drops 0.000M/s
>>> Iter 2 ( -6.415us): hits 607.092M/s (151.773M/prod), drops 0.000M/s
>>> Iter 3 ( -1.361us): hits 601.796M/s (150.449M/prod), drops 0.000M/s
>>> Summary: hits 604.849 ± 2.739M/s (151.212M/prod), drops 0.000 ± 0.000M/s
>>>
>>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>>> ---
>>> tools/testing/selftests/bpf/.gitignore | 1 +
>>> tools/testing/selftests/bpf/Makefile | 13 +-
>>> tools/testing/selftests/bpf/bench.c | 372 ++++++++++++++++++
>>> tools/testing/selftests/bpf/bench.h | 74 ++++
>>> .../selftests/bpf/benchs/bench_count.c | 91 +++++
>>> 5 files changed, 550 insertions(+), 1 deletion(-)
>>> create mode 100644 tools/testing/selftests/bpf/bench.c
>>> create mode 100644 tools/testing/selftests/bpf/bench.h
>>> create mode 100644 tools/testing/selftests/bpf/benchs/bench_count.c
>>>
>
> [...]
>
> trimming is good :)
>
>>> +
>>> +void hits_drops_report_final(struct bench_res res[], int res_cnt)
>>> +{
>>> + int i;
>>> + double hits_mean = 0.0, drops_mean = 0.0;
>>> + double hits_stddev = 0.0, drops_stddev = 0.0;
>>> +
>>> + for (i = 0; i < res_cnt; i++) {
>>> + hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
>>> + drops_mean += res[i].drops / 1000000.0 / (0.0 + res_cnt);
>>> + }
>>> +
>>> + if (res_cnt > 1) {
>>> + for (i = 0; i < res_cnt; i++) {
>>> + hits_stddev += (hits_mean - res[i].hits / 1000000.0) *
>>> + (hits_mean - res[i].hits / 1000000.0) /
>>> + (res_cnt - 1.0);
>>> + drops_stddev += (drops_mean - res[i].drops / 1000000.0) *
>>> + (drops_mean - res[i].drops / 1000000.0) /
>>> + (res_cnt - 1.0);
>>> + }
>>> + hits_stddev = sqrt(hits_stddev);
>>> + drops_stddev = sqrt(drops_stddev);
>>> + }
>>> + printf("Summary: hits %8.3lf \u00B1 %5.3lfM/s (%7.3lfM/prod), ",
>>> + hits_mean, hits_stddev, hits_mean / env.producer_cnt);
>>> + printf("drops %8.3lf \u00B1 %5.3lfM/s\n",
>>> + drops_mean, drops_stddev);
>>
>> The unicode char \u00B1 (for +|-) may cause some old compiler (e.g.,
>> 4.8.5) warnings as it needs C99 mode.
>>
>> /data/users/yhs/work/net-next/tools/testing/selftests/bpf/bench.c:91:9:
>> warning: universal character names are only valid in C++ and C99
>> [enabled by default]
>> printf("Summary: hits %8.3lf \u00B1 %5.3lfM/s (%7.3lfM/prod), ",
>>
>> "+|-" is alternative, but not as good as \u00B1 indeed. Newer
>> compiler may already have the default C99. Maybe we can just add
>> C99 for build `bench`?
>
> I think I'm fine with ancient compiler emitting harmless warning for
> code under selftests/bpf, honestly...
>
>>
>>> +}
>>> +
>>> +const char *argp_program_version = "benchmark";
>>> +const char *argp_program_bug_address = "<bpf@vger.kernel.org>";
>>> +const char argp_program_doc[] =
>>> +"benchmark Generic benchmarking framework.\n"
>>> +"\n"
>>> +"This tool runs benchmarks.\n"
>>> +"\n"
>>> +"USAGE: benchmark <bench-name>\n"
>>> +"\n"
>>> +"EXAMPLES:\n"
>>> +" # run 'count-local' benchmark with 1 producer and 1 consumer\n"
>>> +" benchmark count-local\n"
>>> +" # run 'count-local' with 16 producer and 8 consumer thread, pinned to CPUs\n"
>>> +" benchmark -p16 -c8 -a count-local\n";
>>
>> Some of the above global variables probably are statics.
>> But I do not have a strong preference on this.
>
> Oh, it's actually global variables argp library expects, they have to be global.
>
>>
>>> +
>>> +static const struct argp_option opts[] = {
>>> + { "list", 'l', NULL, 0, "List available benchmarks"},
>>> + { "duration", 'd', "SEC", 0, "Duration of benchmark, seconds"},
>>> + { "warmup", 'w', "SEC", 0, "Warm-up period, seconds"},
>>> + { "producers", 'p', "NUM", 0, "Number of producer threads"},
>>> + { "consumers", 'c', "NUM", 0, "Number of consumer threads"},
>>> + { "verbose", 'v', NULL, 0, "Verbose debug output"},
>>> + { "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
>>> + { "b2b", 'b', NULL, 0, "Back-to-back mode"},
>>> + { "rb-output", 10001, NULL, 0, "Set consumer/producer thread affinity"},
>>
>> I did not see b2b and rb-output options are processed in this file.
>
> Slipped through the rebasing cracks from the future ringbuf
> benchmarks, will remove it. I also figured out a way to do this more
> modular anyways (child parsers in argp).
>
>>
>>> + {},
>>> +};
>>> +
>
> [...]
>
>>> + for (i = 0; i < env.consumer_cnt; i++) {
>>> + err = pthread_create(&state.consumers[i], NULL,
>>> + bench->consumer_thread, (void *)(long)i);
>>> + if (err) {
>>> + fprintf(stderr, "failed to create consumer thread #%d: %d\n",
>>> + i, -errno);
>>> + exit(1);
>>> + }
>>> + if (env.affinity)
>>> + set_thread_affinity(state.consumers[i], i);
>>> + }
>>> + for (i = 0; i < env.producer_cnt; i++) {
>>> + err = pthread_create(&state.producers[i], NULL,
>>> + bench->producer_thread, (void *)(long)i);
>>> + if (err) {
>>> + fprintf(stderr, "failed to create producer thread #%d: %d\n",
>>> + i, -errno);
>>> + exit(1);
>>> + }
>>> + if (env.affinity)
>>> + set_thread_affinity(state.producers[i],
>>> + env.consumer_cnt + i);
>>
>> Here, we bind consumer threads first and then producer threads, I think
>> this is probably just arbitrary choice?
>
> yep, almost arbitrary. Most of my cases have 1 consumer and >=1
> producers, so it was convenient to have consumer pinned to same CPU,
> regardless of how many producers I have.
>
>>
>> In certain cases, I think people may want to have more advanced binding
>> scenarios, e.g., for hyperthreading, binding consumer and producer on
>> the same core or different cores etc. One possibility is to introduce
>> -c option similar to taskset. If -c not supplied, you can have
>> the current default. Otherwise, using -c list.
>>
>
> well, taskset's job is simpler, it takes a list of CPUs for single
> PID, if I understand correctly. Here we have many threads, each might
> have different CPU or even CPUs. But I agree that for some benchmarks
> it's going to be critical to control this precisely. Here's how I'm
> going to allows most flexibility without too much complexity.
>
> --prod-affinity 1,2,10-16,100 -- will specify a set of CPUs for
> producers. First producer will use CPU with least index form that
> list. Second will take second, and so on. If there are less CPUs
> provided than necessary - it's an error. If more - it's fine.
>
> Then for consumers will add independent --cons-affinity parameters,
> which will do the same for consumer threads.
>
> Having two independent lists will allow to test scenarios where we
> want producers and consumers to fight for the same CPU.
>
> Does this sound ok?
This sounds good, more than what I asked.
>
>>> + }
>>> +
>>> + printf("Benchmark '%s' started.\n", bench->name);
>>> +}
>>> +
>
> [...]
next prev parent reply other threads:[~2020-05-12 14:39 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-08 23:20 [PATCH v2 bpf-next 0/3] Add benchmark runner and few benchmarks Andrii Nakryiko
2020-05-08 23:20 ` [PATCH v2 bpf-next 1/3] selftests/bpf: add benchmark runner infrastructure Andrii Nakryiko
2020-05-09 17:10 ` Yonghong Song
2020-05-12 3:29 ` Andrii Nakryiko
2020-05-12 14:39 ` Yonghong Song [this message]
2020-05-08 23:20 ` [PATCH v2 bpf-next 2/3] selftest/bpf: fmod_ret prog and implement test_overhead as part of bench Andrii Nakryiko
2020-05-09 17:23 ` Yonghong Song
2020-05-12 4:22 ` Andrii Nakryiko
2020-05-12 15:11 ` Yonghong Song
2020-05-12 17:23 ` Andrii Nakryiko
2020-05-12 17:47 ` Yonghong Song
2020-05-08 23:20 ` [PATCH v2 bpf-next 3/3] selftest/bpf: add BPF triggering benchmark Andrii Nakryiko
2020-05-09 17:43 ` Yonghong Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=80ca0fba-8a30-4f1f-6bf3-7ccaa1fa8d69@fb.com \
--to=yhs@fb.com \
--cc=andrii.nakryiko@gmail.com \
--cc=andriin@fb.com \
--cc=ast@fb.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox