* [PATCH bpf-next 0/5] Optimize bpf_csum_diff() and homogenize for all archs
@ 2024-10-21 12:21 Puranjay Mohan
2024-10-21 12:21 ` [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header Puranjay Mohan
` (4 more replies)
0 siblings, 5 replies; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-21 12:21 UTC (permalink / raw)
To: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Puranjay Mohan, Puranjay Mohan, Shuah Khan,
Song Liu, Stanislav Fomichev, Yonghong Song
The bpf_csum_diff() helper currently returns different values on different
architectures because it calls csum_partial() that is either implemented by
the architecture like x86_64, arm, etc or uses the generic implementation
in lib/checksum.c like arm64, riscv, etc.
The implementation in lib/checksum.c returns the folded result that is
16-bit long, but the architecture specific implementation can return an
unfolded value that is larger than 16-bits.
The helper uses a per-cpu scratchpad buffer for copying the data and then
computing the csum on this buffer. This can be optimised by utilising some
mathematical properties of csum.
The patch 1 in this series does preparatory work for homogenizing the
helper. patch 2 does the changes to the helper itself. The performance gain
can be seen in the tables below that are generated using the benchmark
built in patch 4:
x86-64:
+-------------+------------------+------------------+-------------+
| Buffer Size | Before | After | Improvement |
+-------------+------------------+------------------+-------------+
| 4 | 2.296 ± 0.066M/s | 3.415 ± 0.001M/s | 48.73 % |
| 8 | 2.320 ± 0.003M/s | 3.409 ± 0.003M/s | 46.93 % |
| 16 | 2.315 ± 0.001M/s | 3.414 ± 0.003M/s | 47.47 % |
| 20 | 2.318 ± 0.001M/s | 3.416 ± 0.001M/s | 47.36 % |
| 32 | 2.308 ± 0.003M/s | 3.413 ± 0.003M/s | 47.87 % |
| 40 | 2.300 ± 0.029M/s | 3.413 ± 0.003M/s | 48.39 % |
| 64 | 2.286 ± 0.001M/s | 3.410 ± 0.001M/s | 49.16 % |
| 128 | 2.250 ± 0.001M/s | 3.404 ± 0.001M/s | 51.28 % |
| 256 | 2.173 ± 0.001M/s | 3.383 ± 0.001M/s | 55.68 % |
| 512 | 2.023 ± 0.055M/s | 3.340 ± 0.001M/s | 65.10 % |
+-------------+------------------+------------------+-------------+
ARM64:
+-------------+------------------+------------------+-------------+
| Buffer Size | Before | After | Improvement |
+-------------+------------------+------------------+-------------+
| 4 | 1.397 ± 0.005M/s | 1.493 ± 0.005M/s | 6.87 % |
| 8 | 1.402 ± 0.002M/s | 1.489 ± 0.002M/s | 6.20 % |
| 16 | 1.391 ± 0.001M/s | 1.481 ± 0.001M/s | 6.47 % |
| 20 | 1.379 ± 0.001M/s | 1.477 ± 0.001M/s | 7.10 % |
| 32 | 1.358 ± 0.001M/s | 1.469 ± 0.002M/s | 8.17 % |
| 40 | 1.339 ± 0.001M/s | 1.462 ± 0.002M/s | 9.18 % |
| 64 | 1.302 ± 0.002M/s | 1.449 ± 0.003M/s | 11.29 % |
| 128 | 1.214 ± 0.001M/s | 1.443 ± 0.003M/s | 18.86 % |
| 256 | 1.080 ± 0.001M/s | 1.423 ± 0.001M/s | 31.75 % |
| 512 | 0.887 ± 0.001M/s | 1.411 ± 0.002M/s | 59.07 % |
+-------------+------------------+------------------+-------------+
Patch 5 adds a selftest for this helper to verify the results produced by
this helper in multiple modes and edge cases.
Patch 3 reverts a hack that was done to make the selftest pass on all
architectures.
Puranjay Mohan (5):
net: checksum: move from32to16() to generic header
bpf: bpf_csum_diff: optimize and homogenize for all archs
selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier
selftests/bpf: Add benchmark for bpf_csum_diff() helper
selftests/bpf: Add a selftest for bpf_csum_diff()
arch/parisc/lib/checksum.c | 13 +-
include/net/checksum.h | 6 +
lib/checksum.c | 11 +-
net/core/filter.c | 37 +-
tools/testing/selftests/bpf/Makefile | 2 +
tools/testing/selftests/bpf/bench.c | 4 +
.../selftests/bpf/benchs/bench_csum_diff.c | 164 +++++++
.../bpf/benchs/run_bench_csum_diff.sh | 10 +
.../selftests/bpf/prog_tests/test_csum_diff.c | 408 ++++++++++++++++++
.../selftests/bpf/progs/csum_diff_bench.c | 25 ++
.../selftests/bpf/progs/csum_diff_test.c | 42 ++
.../bpf/progs/verifier_array_access.c | 3 +-
12 files changed, 674 insertions(+), 51 deletions(-)
create mode 100644 tools/testing/selftests/bpf/benchs/bench_csum_diff.c
create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_csum_diff.c
create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_bench.c
create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_test.c
--
2.40.1
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header
2024-10-21 12:21 [PATCH bpf-next 0/5] Optimize bpf_csum_diff() and homogenize for all archs Puranjay Mohan
@ 2024-10-21 12:21 ` Puranjay Mohan
2024-10-21 13:41 ` Daniel Borkmann
` (2 more replies)
2024-10-21 12:21 ` [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs Puranjay Mohan
` (3 subsequent siblings)
4 siblings, 3 replies; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-21 12:21 UTC (permalink / raw)
To: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Puranjay Mohan, Puranjay Mohan, Shuah Khan,
Song Liu, Stanislav Fomichev, Yonghong Song
from32to16() is used by lib/checksum.c and also by
arch/parisc/lib/checksum.c. The next patch will use it in the
bpf_csum_diff helper.
Move from32to16() to the include/net/checksum.h as csum_from32to16() and
remove other implementations.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
arch/parisc/lib/checksum.c | 13 ++-----------
include/net/checksum.h | 6 ++++++
lib/checksum.c | 11 +----------
3 files changed, 9 insertions(+), 21 deletions(-)
diff --git a/arch/parisc/lib/checksum.c b/arch/parisc/lib/checksum.c
index 4818f3db84a5c..59d8c15d81bd0 100644
--- a/arch/parisc/lib/checksum.c
+++ b/arch/parisc/lib/checksum.c
@@ -25,15 +25,6 @@
: "=r"(_t) \
: "r"(_r), "0"(_t));
-static inline unsigned short from32to16(unsigned int x)
-{
- /* 32 bits --> 16 bits + carry */
- x = (x & 0xffff) + (x >> 16);
- /* 16 bits + carry --> 16 bits including carry */
- x = (x & 0xffff) + (x >> 16);
- return (unsigned short)x;
-}
-
static inline unsigned int do_csum(const unsigned char * buff, int len)
{
int odd, count;
@@ -85,7 +76,7 @@ static inline unsigned int do_csum(const unsigned char * buff, int len)
}
if (len & 1)
result += le16_to_cpu(*buff);
- result = from32to16(result);
+ result = csum_from32to16(result);
if (odd)
result = swab16(result);
out:
@@ -102,7 +93,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
{
unsigned int result = do_csum(buff, len);
addc(result, sum);
- return (__force __wsum)from32to16(result);
+ return (__force __wsum)csum_from32to16(result);
}
EXPORT_SYMBOL(csum_partial);
diff --git a/include/net/checksum.h b/include/net/checksum.h
index 1338cb92c8e72..0d082febfead4 100644
--- a/include/net/checksum.h
+++ b/include/net/checksum.h
@@ -151,6 +151,12 @@ static inline void csum_replace(__wsum *csum, __wsum old, __wsum new)
*csum = csum_add(csum_sub(*csum, old), new);
}
+static inline __sum16 csum_from32to16(__wsum sum)
+{
+ sum += (sum >> 16) | (sum << 16);
+ return (__force __sum16)(sum >> 16);
+}
+
struct sk_buff;
void inet_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb,
__be32 from, __be32 to, bool pseudohdr);
diff --git a/lib/checksum.c b/lib/checksum.c
index 6860d6b05a171..025ba546e1ec6 100644
--- a/lib/checksum.c
+++ b/lib/checksum.c
@@ -34,15 +34,6 @@
#include <asm/byteorder.h>
#ifndef do_csum
-static inline unsigned short from32to16(unsigned int x)
-{
- /* add up 16-bit and 16-bit for 16+c bit */
- x = (x & 0xffff) + (x >> 16);
- /* add up carry.. */
- x = (x & 0xffff) + (x >> 16);
- return x;
-}
-
static unsigned int do_csum(const unsigned char *buff, int len)
{
int odd;
@@ -90,7 +81,7 @@ static unsigned int do_csum(const unsigned char *buff, int len)
#else
result += (*buff << 8);
#endif
- result = from32to16(result);
+ result = csum_from32to16(result);
if (odd)
result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
out:
--
2.40.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs
2024-10-21 12:21 [PATCH bpf-next 0/5] Optimize bpf_csum_diff() and homogenize for all archs Puranjay Mohan
2024-10-21 12:21 ` [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header Puranjay Mohan
@ 2024-10-21 12:21 ` Puranjay Mohan
2024-10-21 13:42 ` Daniel Borkmann
` (2 more replies)
2024-10-21 12:21 ` [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier Puranjay Mohan
` (2 subsequent siblings)
4 siblings, 3 replies; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-21 12:21 UTC (permalink / raw)
To: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Puranjay Mohan, Puranjay Mohan, Shuah Khan,
Song Liu, Stanislav Fomichev, Yonghong Song
1. Optimization
------------
The current implementation copies the 'from' and 'to' buffers to a
scratchpad and it takes the bitwise NOT of 'from' buffer while copying.
In the next step csum_partial() is called with this scratchpad.
so, mathematically, the current implementation is doing:
result = csum(to - from)
Here, 'to' and '~ from' are copied in to the scratchpad buffer, we need
it in the scratchpad buffer because csum_partial() takes a single
contiguous buffer and not two disjoint buffers like 'to' and 'from'.
We can re write this equation to:
result = csum(to) - csum(from)
using the distributive property of csum().
this allows 'to' and 'from' to be at different locations and therefore
this scratchpad and copying is not needed.
This in C code will look like:
result = csum_sub(csum_partial(to, to_size, seed),
csum_partial(from, from_size, 0));
2. Homogenization
--------------
The bpf_csum_diff() helper calls csum_partial() which is implemented by
some architectures like arm and x86 but other architectures rely on the
generic implementation in lib/checksum.c
The generic implementation in lib/checksum.c returns a 16 bit value but
the arch specific implementations can return more than 16 bits, this
works out in most places because before the result is used, it is passed
through csum_fold() that turns it into a 16-bit value.
bpf_csum_diff() directly returns the value from csum_partial() and
therefore the returned values could be different on different
architectures. see discussion in [1]:
for the int value 28 the calculated checksums are:
x86 : -29 : 0xffffffe3
generic (arm64, riscv) : 65507 : 0x0000ffe3
arm : 131042 : 0x0001ffe2
Pass the result of bpf_csum_diff() through from32to16() before returning
to homogenize this result for all architectures.
NOTE: from32to16() is used instead of csum_fold() because csum_fold()
does from32to16() + bitwise NOT of the result, which is not what we want
to do here.
[1] https://lore.kernel.org/bpf/CAJ+HfNiQbOcqCLxFUP2FMm5QrLXUUaj852Fxe3hn_2JNiucn6g@mail.gmail.com/
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
net/core/filter.c | 37 +++++++++----------------------------
1 file changed, 9 insertions(+), 28 deletions(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index bd0d08bf76bb8..e00bec7de9edd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1654,18 +1654,6 @@ void sk_reuseport_prog_free(struct bpf_prog *prog)
bpf_prog_destroy(prog);
}
-struct bpf_scratchpad {
- union {
- __be32 diff[MAX_BPF_STACK / sizeof(__be32)];
- u8 buff[MAX_BPF_STACK];
- };
- local_lock_t bh_lock;
-};
-
-static DEFINE_PER_CPU(struct bpf_scratchpad, bpf_sp) = {
- .bh_lock = INIT_LOCAL_LOCK(bh_lock),
-};
-
static inline int __bpf_try_make_writable(struct sk_buff *skb,
unsigned int write_len)
{
@@ -2022,11 +2010,6 @@ static const struct bpf_func_proto bpf_l4_csum_replace_proto = {
BPF_CALL_5(bpf_csum_diff, __be32 *, from, u32, from_size,
__be32 *, to, u32, to_size, __wsum, seed)
{
- struct bpf_scratchpad *sp = this_cpu_ptr(&bpf_sp);
- u32 diff_size = from_size + to_size;
- int i, j = 0;
- __wsum ret;
-
/* This is quite flexible, some examples:
*
* from_size == 0, to_size > 0, seed := csum --> pushing data
@@ -2035,19 +2018,17 @@ BPF_CALL_5(bpf_csum_diff, __be32 *, from, u32, from_size,
*
* Even for diffing, from_size and to_size don't need to be equal.
*/
- if (unlikely(((from_size | to_size) & (sizeof(__be32) - 1)) ||
- diff_size > sizeof(sp->diff)))
- return -EINVAL;
- local_lock_nested_bh(&bpf_sp.bh_lock);
- for (i = 0; i < from_size / sizeof(__be32); i++, j++)
- sp->diff[j] = ~from[i];
- for (i = 0; i < to_size / sizeof(__be32); i++, j++)
- sp->diff[j] = to[i];
+ if (from_size && to_size)
+ return csum_from32to16(csum_sub(csum_partial(to, to_size, seed),
+ csum_partial(from, from_size, 0)));
+ if (to_size)
+ return csum_from32to16(csum_partial(to, to_size, seed));
- ret = csum_partial(sp->diff, diff_size, seed);
- local_unlock_nested_bh(&bpf_sp.bh_lock);
- return ret;
+ if (from_size)
+ return csum_from32to16(~csum_partial(from, from_size, ~seed));
+
+ return seed;
}
static const struct bpf_func_proto bpf_csum_diff_proto = {
--
2.40.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier
2024-10-21 12:21 [PATCH bpf-next 0/5] Optimize bpf_csum_diff() and homogenize for all archs Puranjay Mohan
2024-10-21 12:21 ` [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header Puranjay Mohan
2024-10-21 12:21 ` [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs Puranjay Mohan
@ 2024-10-21 12:21 ` Puranjay Mohan
2024-10-21 13:01 ` Helge Deller
` (2 more replies)
2024-10-21 12:21 ` [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper Puranjay Mohan
2024-10-21 12:21 ` [PATCH bpf-next 5/5] selftests/bpf: Add a selftest for bpf_csum_diff() Puranjay Mohan
4 siblings, 3 replies; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-21 12:21 UTC (permalink / raw)
To: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Puranjay Mohan, Puranjay Mohan, Shuah Khan,
Song Liu, Stanislav Fomichev, Yonghong Song
The bpf_csum_diff() helper has been fixed to return a 16-bit value for
all archs, so now we don't need to mask the result.
This commit is basically reverting the below:
commit 6185266c5a85 ("selftests/bpf: Mask bpf_csum_diff() return value
to 16 bits in test_verifier")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
tools/testing/selftests/bpf/progs/verifier_array_access.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/progs/verifier_array_access.c b/tools/testing/selftests/bpf/progs/verifier_array_access.c
index 95d7ecc12963b..4195aa824ba55 100644
--- a/tools/testing/selftests/bpf/progs/verifier_array_access.c
+++ b/tools/testing/selftests/bpf/progs/verifier_array_access.c
@@ -368,8 +368,7 @@ __naked void a_read_only_array_2_1(void)
r4 = 0; \
r5 = 0; \
call %[bpf_csum_diff]; \
-l0_%=: r0 &= 0xffff; \
- exit; \
+l0_%=: exit; \
" :
: __imm(bpf_csum_diff),
__imm(bpf_map_lookup_elem),
--
2.40.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper
2024-10-21 12:21 [PATCH bpf-next 0/5] Optimize bpf_csum_diff() and homogenize for all archs Puranjay Mohan
` (2 preceding siblings ...)
2024-10-21 12:21 ` [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier Puranjay Mohan
@ 2024-10-21 12:21 ` Puranjay Mohan
2024-10-21 13:43 ` Daniel Borkmann
2024-10-21 23:28 ` Andrii Nakryiko
2024-10-21 12:21 ` [PATCH bpf-next 5/5] selftests/bpf: Add a selftest for bpf_csum_diff() Puranjay Mohan
4 siblings, 2 replies; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-21 12:21 UTC (permalink / raw)
To: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Puranjay Mohan, Puranjay Mohan, Shuah Khan,
Song Liu, Stanislav Fomichev, Yonghong Song
Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by
filling a 4KB buffer with random data and calculating the internet
checksum on different parts of this buffer using bpf_csum_diff().
Example run using ./benchs/run_bench_csum_diff.sh on x86_64:
[bpf]$ ./benchs/run_bench_csum_diff.sh
4 2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s)
8 2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s)
16 2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s)
20 2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s)
32 2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
40 2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s)
64 2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s)
128 2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s)
256 2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s)
512 2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s)
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
tools/testing/selftests/bpf/Makefile | 2 +
tools/testing/selftests/bpf/bench.c | 4 +
.../selftests/bpf/benchs/bench_csum_diff.c | 164 ++++++++++++++++++
.../bpf/benchs/run_bench_csum_diff.sh | 10 ++
.../selftests/bpf/progs/csum_diff_bench.c | 25 +++
5 files changed, 205 insertions(+)
create mode 100644 tools/testing/selftests/bpf/benchs/bench_csum_diff.c
create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh
create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_bench.c
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 28a76baa854d3..a0d86dd453e16 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -809,6 +809,7 @@ $(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \
$(OUTPUT)/bench_bloom_filter_map.o: $(OUTPUT)/bloom_filter_bench.skel.h
$(OUTPUT)/bench_bpf_loop.o: $(OUTPUT)/bpf_loop_bench.skel.h
$(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h
+$(OUTPUT)/bench_csum_diff.o: $(OUTPUT)/csum_diff_bench.skel.h
$(OUTPUT)/bench_bpf_hashmap_full_update.o: $(OUTPUT)/bpf_hashmap_full_update_bench.skel.h
$(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench.skel.h
$(OUTPUT)/bench_local_storage_rcu_tasks_trace.o: $(OUTPUT)/local_storage_rcu_tasks_trace_bench.skel.h
@@ -829,6 +830,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
$(OUTPUT)/bench_bloom_filter_map.o \
$(OUTPUT)/bench_bpf_loop.o \
$(OUTPUT)/bench_strncmp.o \
+ $(OUTPUT)/bench_csum_diff.o \
$(OUTPUT)/bench_bpf_hashmap_full_update.o \
$(OUTPUT)/bench_local_storage.o \
$(OUTPUT)/bench_local_storage_rcu_tasks_trace.o \
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 1bd403a5ef7b3..29bd6f4498ebc 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -278,6 +278,7 @@ extern struct argp bench_bpf_loop_argp;
extern struct argp bench_local_storage_argp;
extern struct argp bench_local_storage_rcu_tasks_trace_argp;
extern struct argp bench_strncmp_argp;
+extern struct argp bench_csum_diff_argp;
extern struct argp bench_hashmap_lookup_argp;
extern struct argp bench_local_storage_create_argp;
extern struct argp bench_htab_mem_argp;
@@ -290,6 +291,7 @@ static const struct argp_child bench_parsers[] = {
{ &bench_bpf_loop_argp, 0, "bpf_loop helper benchmark", 0 },
{ &bench_local_storage_argp, 0, "local_storage benchmark", 0 },
{ &bench_strncmp_argp, 0, "bpf_strncmp helper benchmark", 0 },
+ { &bench_csum_diff_argp, 0, "bpf_csum_diff helper benchmark", 0 },
{ &bench_local_storage_rcu_tasks_trace_argp, 0,
"local_storage RCU Tasks Trace slowdown benchmark", 0 },
{ &bench_hashmap_lookup_argp, 0, "Hashmap lookup benchmark", 0 },
@@ -539,6 +541,7 @@ extern const struct bench bench_hashmap_with_bloom;
extern const struct bench bench_bpf_loop;
extern const struct bench bench_strncmp_no_helper;
extern const struct bench bench_strncmp_helper;
+extern const struct bench bench_csum_diff;
extern const struct bench bench_bpf_hashmap_full_update;
extern const struct bench bench_local_storage_cache_seq_get;
extern const struct bench bench_local_storage_cache_interleaved_get;
@@ -599,6 +602,7 @@ static const struct bench *benchs[] = {
&bench_bpf_loop,
&bench_strncmp_no_helper,
&bench_strncmp_helper,
+ &bench_csum_diff,
&bench_bpf_hashmap_full_update,
&bench_local_storage_cache_seq_get,
&bench_local_storage_cache_interleaved_get,
diff --git a/tools/testing/selftests/bpf/benchs/bench_csum_diff.c b/tools/testing/selftests/bpf/benchs/bench_csum_diff.c
new file mode 100644
index 0000000000000..2c30c8b54d9bc
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_csum_diff.c
@@ -0,0 +1,164 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates */
+#include <argp.h>
+#include "bench.h"
+#include "csum_diff_bench.skel.h"
+
+static struct csum_diff_ctx {
+ struct csum_diff_bench *skel;
+ int pfd;
+} ctx;
+
+static struct csum_diff_args {
+ u32 buff_len;
+} args = {
+ .buff_len = 32,
+};
+
+enum {
+ ARG_BUFF_LEN = 5000,
+};
+
+static const struct argp_option opts[] = {
+ { "buff-len", ARG_BUFF_LEN, "BUFF_LEN", 0,
+ "Set the length of the buffer" },
+ {},
+};
+
+static error_t csum_diff_parse_arg(int key, char *arg, struct argp_state *state)
+{
+ switch (key) {
+ case ARG_BUFF_LEN:
+ args.buff_len = strtoul(arg, NULL, 10);
+ if (!args.buff_len ||
+ args.buff_len >= sizeof(ctx.skel->rodata->buff)) {
+ fprintf(stderr, "Invalid buff len (limit %zu)\n",
+ sizeof(ctx.skel->rodata->buff));
+ argp_usage(state);
+ }
+ break;
+ default:
+ return ARGP_ERR_UNKNOWN;
+ }
+
+ return 0;
+}
+
+const struct argp bench_csum_diff_argp = {
+ .options = opts,
+ .parser = csum_diff_parse_arg,
+};
+
+static void csum_diff_validate(void)
+{
+ if (env.consumer_cnt != 0) {
+ fprintf(stderr, "csum_diff benchmark doesn't support consumer!\n");
+ exit(1);
+ }
+}
+
+static void csum_diff_setup(void)
+{
+ int err;
+ char *buff;
+ size_t i, sz;
+
+ sz = sizeof(ctx.skel->rodata->buff);
+
+ setup_libbpf();
+
+ ctx.skel = csum_diff_bench__open();
+ if (!ctx.skel) {
+ fprintf(stderr, "failed to open skeleton\n");
+ exit(1);
+ }
+
+ srandom(time(NULL));
+ buff = ctx.skel->rodata->buff;
+
+ /*
+ * Set first 8 bytes of buffer to 0xdeadbeefdeadbeef, this is later used to verify the
+ * correctness of the helper by comparing the checksum result for 0xdeadbeefdeadbeef that
+ * should be 0x3b3b
+ */
+
+ *(u64 *)buff = 0xdeadbeefdeadbeef;
+
+ for (i = 8; i < sz; i++)
+ buff[i] = '1' + random() % 9;
+
+ ctx.skel->rodata->buff_len = args.buff_len;
+
+ err = csum_diff_bench__load(ctx.skel);
+ if (err) {
+ fprintf(stderr, "failed to load skeleton\n");
+ csum_diff_bench__destroy(ctx.skel);
+ exit(1);
+ }
+}
+
+static void csum_diff_helper_setup(void)
+{
+ u8 tmp_out[64 << 2] = {};
+ u8 tmp_in[64] = {};
+ int err, saved_errno;
+
+ LIBBPF_OPTS(bpf_test_run_opts, topts,
+ .data_in = tmp_in,
+ .data_size_in = sizeof(tmp_in),
+ .data_out = tmp_out,
+ .data_size_out = sizeof(tmp_out),
+ .repeat = 1,
+ );
+ csum_diff_setup();
+ ctx.pfd = bpf_program__fd(ctx.skel->progs.compute_checksum);
+
+ err = bpf_prog_test_run_opts(ctx.pfd, &topts);
+ saved_errno = errno;
+
+ if (err) {
+ fprintf(stderr, "failed to run setup prog: err %d, result %d, serror %d\n",
+ err, ctx.skel->bss->result, saved_errno);
+ csum_diff_bench__destroy(ctx.skel);
+ exit(1);
+ }
+
+ /* Sanity check for correctness of helper */
+ if (args.buff_len == 8 && ctx.skel->bss->result != 0x3b3b) {
+ fprintf(stderr, "csum_diff helper broken: buff: %lx, result: %x, expected: %x\n",
+ *(u64 *)ctx.skel->rodata->buff, ctx.skel->bss->result, 0x3b3b);
+ }
+}
+
+static void *csum_diff_producer(void *unused)
+{
+ u8 tmp_out[64 << 2] = {};
+ u8 tmp_in[64] = {};
+
+ LIBBPF_OPTS(bpf_test_run_opts, topts,
+ .data_in = tmp_in,
+ .data_size_in = sizeof(tmp_in),
+ .data_out = tmp_out,
+ .data_size_out = sizeof(tmp_out),
+ .repeat = 64,
+ );
+ while (true)
+ (void)bpf_prog_test_run_opts(ctx.pfd, &topts);
+ return NULL;
+}
+
+static void csum_diff_measure(struct bench_res *res)
+{
+ res->hits = atomic_swap(&ctx.skel->bss->hits, 0);
+}
+
+const struct bench bench_csum_diff = {
+ .name = "csum-diff-helper",
+ .argp = &bench_csum_diff_argp,
+ .validate = csum_diff_validate,
+ .setup = csum_diff_helper_setup,
+ .producer_thread = csum_diff_producer,
+ .measure = csum_diff_measure,
+ .report_progress = hits_drops_report_progress,
+ .report_final = hits_drops_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh b/tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh
new file mode 100755
index 0000000000000..c4e147fbf2f98
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+for s in 4 8 16 20 32 40 64 128 256 512; do
+ summarize ${s} "$($RUN_BENCH --buff-len=$s csum-diff-helper)"
+done
diff --git a/tools/testing/selftests/bpf/progs/csum_diff_bench.c b/tools/testing/selftests/bpf/progs/csum_diff_bench.c
new file mode 100644
index 0000000000000..85245edd6f9dc
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/csum_diff_bench.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates */
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#define BUFF_SZ 4096
+
+/* Will be updated by benchmark before program loading */
+const char buff[BUFF_SZ];
+const volatile unsigned int buff_len = 4;
+
+long hits = 0;
+short result;
+
+char _license[] SEC("license") = "GPL";
+
+SEC("tc")
+int compute_checksum(void *ctx)
+{
+ result = bpf_csum_diff(0, 0, (void *)buff, buff_len, 0);
+ __sync_add_and_fetch(&hits, 1);
+ return 0;
+}
--
2.40.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH bpf-next 5/5] selftests/bpf: Add a selftest for bpf_csum_diff()
2024-10-21 12:21 [PATCH bpf-next 0/5] Optimize bpf_csum_diff() and homogenize for all archs Puranjay Mohan
` (3 preceding siblings ...)
2024-10-21 12:21 ` [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper Puranjay Mohan
@ 2024-10-21 12:21 ` Puranjay Mohan
2024-10-21 13:44 ` Daniel Borkmann
4 siblings, 1 reply; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-21 12:21 UTC (permalink / raw)
To: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Puranjay Mohan, Puranjay Mohan, Shuah Khan,
Song Liu, Stanislav Fomichev, Yonghong Song
Add a selftest for the bpf_csum_diff() helper. This selftests runs the
helper in all three configurations(push, pull, and diff) and verifies
its output. The correct results have been computed by hand and by the
helper's older implementation.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
.../selftests/bpf/prog_tests/test_csum_diff.c | 408 ++++++++++++++++++
.../selftests/bpf/progs/csum_diff_test.c | 42 ++
2 files changed, 450 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_csum_diff.c
create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_test.c
diff --git a/tools/testing/selftests/bpf/prog_tests/test_csum_diff.c b/tools/testing/selftests/bpf/prog_tests/test_csum_diff.c
new file mode 100644
index 0000000000000..107b20d43e839
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_csum_diff.c
@@ -0,0 +1,408 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates */
+#include <test_progs.h>
+#include "csum_diff_test.skel.h"
+
+#define BUFF_SZ 512
+
+struct testcase {
+ unsigned long long to_buff[BUFF_SZ / 8];
+ unsigned int to_buff_len;
+ unsigned long long from_buff[BUFF_SZ / 8];
+ unsigned int from_buff_len;
+ unsigned short seed;
+ unsigned short result;
+};
+
+#define NUM_PUSH_TESTS 4
+
+struct testcase push_tests[NUM_PUSH_TESTS] = {
+ {
+ .to_buff = {
+ 0xdeadbeefdeadbeef,
+ },
+ .to_buff_len = 8,
+ .from_buff = {},
+ .from_buff_len = 0,
+ .seed = 0,
+ .result = 0x3b3b
+ },
+ {
+ .to_buff = {
+ 0xdeadbeefdeadbeef,
+ 0xbeefdeadbeefdead,
+ },
+ .to_buff_len = 16,
+ .from_buff = {},
+ .from_buff_len = 0,
+ .seed = 0x1234,
+ .result = 0x88aa
+ },
+ {
+ .to_buff = {
+ 0xdeadbeefdeadbeef,
+ 0xbeefdeadbeefdead,
+ },
+ .to_buff_len = 15,
+ .from_buff = {},
+ .from_buff_len = 0,
+ .seed = 0x1234,
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+ .result = 0xcaa9
+#else
+ .result = 0x87fd
+#endif
+ },
+ {
+ .to_buff = {
+ 0x327b23c66b8b4567,
+ 0x66334873643c9869,
+ 0x19495cff74b0dc51,
+ 0x625558ec2ae8944a,
+ 0x46e87ccd238e1f29,
+ 0x507ed7ab3d1b58ba,
+ 0x41b71efb2eb141f2,
+ 0x7545e14679e2a9e3,
+ 0x5bd062c2515f007c,
+ 0x4db127f812200854,
+ 0x1f16e9e80216231b,
+ 0x66ef438d1190cde7,
+ 0x3352255a140e0f76,
+ 0x0ded7263109cf92e,
+ 0x1befd79f7fdcc233,
+ 0x6b68079a41a7c4c9,
+ 0x25e45d324e6afb66,
+ 0x431bd7b7519b500d,
+ 0x7c83e4583f2dba31,
+ 0x62bbd95a257130a3,
+ 0x628c895d436c6125,
+ 0x721da317333ab105,
+ 0x2d1d5ae92443a858,
+ 0x75a2a8d46763845e,
+ 0x79838cb208edbdab,
+ 0x0b03e0c64353d0cd,
+ 0x54e49eb4189a769b,
+ 0x2ca8861171f32454,
+ 0x02901d820836c40e,
+ 0x081386413a95f874,
+ 0x7c3dbd3d1e7ff521,
+ 0x6ceaf087737b8ddc,
+ 0x4516dde922221a70,
+ 0x614fd4a13006c83e,
+ 0x5577f8e1419ac241,
+ 0x05072367440badfc,
+ 0x77465f013804823e,
+ 0x5c482a977724c67e,
+ 0x5e884adc2463b9ea,
+ 0x2d51779651ead36b,
+ 0x153ea438580bd78f,
+ 0x70a64e2a3855585c,
+ 0x2a487cb06a2342ec,
+ 0x725a06fb1d4ed43b,
+ 0x57e4ccaf2cd89a32,
+ 0x4b588f547a6d8d3c,
+ 0x6de91b18542289ec,
+ 0x7644a45c38437fdb,
+ 0x684a481a32fff902,
+ 0x749abb43579478fe,
+ 0x1ba026fa3dc240fb,
+ 0x75c6c33a79a1deaa,
+ 0x70c6a52912e685fb,
+ 0x374a3fe6520eedd1,
+ 0x23f9c13c4f4ef005,
+ 0x275ac794649bb77c,
+ 0x1cf10fd839386575,
+ 0x235ba861180115be,
+ 0x354fe9f947398c89,
+ 0x741226bb15b5af5c,
+ 0x10233c990d34b6a8,
+ 0x615740953f6ab60f,
+ 0x77ae35eb7e0c57b1,
+ 0x310c50b3579be4f1,
+ },
+ .to_buff_len = 512,
+ .from_buff = {},
+ .from_buff_len = 0,
+ .seed = 0xffff,
+ .result = 0xca45
+ },
+};
+
+#define NUM_PULL_TESTS 4
+
+struct testcase pull_tests[NUM_PULL_TESTS] = {
+ {
+ .from_buff = {
+ 0xdeadbeefdeadbeef,
+ },
+ .from_buff_len = 8,
+ .to_buff = {},
+ .to_buff_len = 0,
+ .seed = 0,
+ .result = 0xc4c4
+ },
+ {
+ .from_buff = {
+ 0xdeadbeefdeadbeef,
+ 0xbeefdeadbeefdead,
+ },
+ .from_buff_len = 16,
+ .to_buff = {},
+ .to_buff_len = 0,
+ .seed = 0x1234,
+ .result = 0x9bbd
+ },
+ {
+ .from_buff = {
+ 0xdeadbeefdeadbeef,
+ 0xbeefdeadbeefdead,
+ },
+ .from_buff_len = 15,
+ .to_buff = {},
+ .to_buff_len = 0,
+ .seed = 0x1234,
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+ .result = 0x59be
+#else
+ .result = 0x9c6a
+#endif
+ },
+ {
+ .from_buff = {
+ 0x327b23c66b8b4567,
+ 0x66334873643c9869,
+ 0x19495cff74b0dc51,
+ 0x625558ec2ae8944a,
+ 0x46e87ccd238e1f29,
+ 0x507ed7ab3d1b58ba,
+ 0x41b71efb2eb141f2,
+ 0x7545e14679e2a9e3,
+ 0x5bd062c2515f007c,
+ 0x4db127f812200854,
+ 0x1f16e9e80216231b,
+ 0x66ef438d1190cde7,
+ 0x3352255a140e0f76,
+ 0x0ded7263109cf92e,
+ 0x1befd79f7fdcc233,
+ 0x6b68079a41a7c4c9,
+ 0x25e45d324e6afb66,
+ 0x431bd7b7519b500d,
+ 0x7c83e4583f2dba31,
+ 0x62bbd95a257130a3,
+ 0x628c895d436c6125,
+ 0x721da317333ab105,
+ 0x2d1d5ae92443a858,
+ 0x75a2a8d46763845e,
+ 0x79838cb208edbdab,
+ 0x0b03e0c64353d0cd,
+ 0x54e49eb4189a769b,
+ 0x2ca8861171f32454,
+ 0x02901d820836c40e,
+ 0x081386413a95f874,
+ 0x7c3dbd3d1e7ff521,
+ 0x6ceaf087737b8ddc,
+ 0x4516dde922221a70,
+ 0x614fd4a13006c83e,
+ 0x5577f8e1419ac241,
+ 0x05072367440badfc,
+ 0x77465f013804823e,
+ 0x5c482a977724c67e,
+ 0x5e884adc2463b9ea,
+ 0x2d51779651ead36b,
+ 0x153ea438580bd78f,
+ 0x70a64e2a3855585c,
+ 0x2a487cb06a2342ec,
+ 0x725a06fb1d4ed43b,
+ 0x57e4ccaf2cd89a32,
+ 0x4b588f547a6d8d3c,
+ 0x6de91b18542289ec,
+ 0x7644a45c38437fdb,
+ 0x684a481a32fff902,
+ 0x749abb43579478fe,
+ 0x1ba026fa3dc240fb,
+ 0x75c6c33a79a1deaa,
+ 0x70c6a52912e685fb,
+ 0x374a3fe6520eedd1,
+ 0x23f9c13c4f4ef005,
+ 0x275ac794649bb77c,
+ 0x1cf10fd839386575,
+ 0x235ba861180115be,
+ 0x354fe9f947398c89,
+ 0x741226bb15b5af5c,
+ 0x10233c990d34b6a8,
+ 0x615740953f6ab60f,
+ 0x77ae35eb7e0c57b1,
+ 0x310c50b3579be4f1,
+ },
+ .from_buff_len = 512,
+ .to_buff = {},
+ .to_buff_len = 0,
+ .seed = 0xffff,
+ .result = 0x35ba
+ },
+};
+
+#define NUM_DIFF_TESTS 4
+
+struct testcase diff_tests[NUM_DIFF_TESTS] = {
+ {
+ .from_buff = {
+ 0xdeadbeefdeadbeef,
+ },
+ .from_buff_len = 8,
+ .to_buff = {
+ 0xabababababababab,
+ },
+ .to_buff_len = 8,
+ .seed = 0,
+ .result = 0x7373
+ },
+ {
+ .from_buff = {
+ 0xdeadbeefdeadbeef,
+ },
+ .from_buff_len = 7,
+ .to_buff = {
+ 0xabababababababab,
+ },
+ .to_buff_len = 7,
+ .seed = 0,
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+ .result = 0xa673
+#else
+ .result = 0x73b7
+#endif
+ },
+ {
+ .from_buff = {
+ 0,
+ },
+ .from_buff_len = 8,
+ .to_buff = {
+ 0xabababababababab,
+ },
+ .to_buff_len = 8,
+ .seed = 0,
+ .result = 0xaeae
+ },
+ {
+ .from_buff = {
+ 0xdeadbeefdeadbeef
+ },
+ .from_buff_len = 8,
+ .to_buff = {
+ 0,
+ },
+ .to_buff_len = 8,
+ .seed = 0xffff,
+ .result = 0xc4c4
+ },
+};
+
+#define NUM_EDGE_TESTS 4
+
+struct testcase edge_tests[NUM_EDGE_TESTS] = {
+ {
+ .from_buff = {},
+ .from_buff_len = 0,
+ .to_buff = {},
+ .to_buff_len = 0,
+ .seed = 0,
+ .result = 0
+ },
+ {
+ .from_buff = {
+ 0x1234
+ },
+ .from_buff_len = 0,
+ .to_buff = {
+ 0x1234
+ },
+ .to_buff_len = 0,
+ .seed = 0,
+ .result = 0
+ },
+ {
+ .from_buff = {},
+ .from_buff_len = 0,
+ .to_buff = {},
+ .to_buff_len = 0,
+ .seed = 0x1234,
+ .result = 0x1234
+ },
+ {
+ .from_buff = {},
+ .from_buff_len = 512,
+ .to_buff = {},
+ .to_buff_len = 0,
+ .seed = 0xffff,
+ .result = 0xffff
+ },
+};
+
+static unsigned short trigger_csum_diff(const struct csum_diff_test *skel)
+{
+ u8 tmp_out[64 << 2] = {};
+ u8 tmp_in[64] = {};
+ int err;
+ int pfd;
+
+ LIBBPF_OPTS(bpf_test_run_opts, topts,
+ .data_in = tmp_in,
+ .data_size_in = sizeof(tmp_in),
+ .data_out = tmp_out,
+ .data_size_out = sizeof(tmp_out),
+ .repeat = 1,
+ );
+ pfd = bpf_program__fd(skel->progs.compute_checksum);
+ err = bpf_prog_test_run_opts(pfd, &topts);
+ if (err)
+ return -1;
+
+ return skel->bss->result;
+}
+
+static void test_csum_diff(struct testcase *tests, int num_tests)
+{
+ struct csum_diff_test *skel;
+ unsigned short got;
+ int err;
+
+ for (int i = 0; i < num_tests; i++) {
+ skel = csum_diff_test__open();
+ if (!ASSERT_OK_PTR(skel, "csum_diff_test open"))
+ return;
+
+ skel->rodata->to_buff_len = tests[i].to_buff_len;
+ skel->rodata->from_buff_len = tests[i].from_buff_len;
+
+ err = csum_diff_test__load(skel);
+ if (!ASSERT_EQ(err, 0, "csum_diff_test load"))
+ goto out;
+
+ memcpy(skel->bss->to_buff, tests[i].to_buff, tests[i].to_buff_len);
+ memcpy(skel->bss->from_buff, tests[i].from_buff, tests[i].from_buff_len);
+ skel->bss->seed = tests[i].seed;
+
+ got = trigger_csum_diff(skel);
+ ASSERT_EQ(got, tests[i].result, "csum_diff result");
+
+ csum_diff_test__destroy(skel);
+ }
+
+ return;
+out:
+ csum_diff_test__destroy(skel);
+}
+
+void test_test_csum_diff(void)
+{
+ if (test__start_subtest("csum_diff_push"))
+ test_csum_diff(push_tests, NUM_PUSH_TESTS);
+ if (test__start_subtest("csum_diff_pull"))
+ test_csum_diff(pull_tests, NUM_PULL_TESTS);
+ if (test__start_subtest("csum_diff_diff"))
+ test_csum_diff(diff_tests, NUM_DIFF_TESTS);
+ if (test__start_subtest("csum_diff_edge"))
+ test_csum_diff(edge_tests, NUM_EDGE_TESTS);
+}
diff --git a/tools/testing/selftests/bpf/progs/csum_diff_test.c b/tools/testing/selftests/bpf/progs/csum_diff_test.c
new file mode 100644
index 0000000000000..9438f1773a589
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/csum_diff_test.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates */
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#define BUFF_SZ 512
+
+/* Will be updated by benchmark before program loading */
+char to_buff[BUFF_SZ];
+const volatile unsigned int to_buff_len = 0;
+char from_buff[BUFF_SZ];
+const volatile unsigned int from_buff_len = 0;
+unsigned short seed = 0;
+
+short result;
+
+char _license[] SEC("license") = "GPL";
+
+SEC("tc")
+int compute_checksum(void *ctx)
+{
+ int to_len_half = to_buff_len / 2;
+ int from_len_half = from_buff_len / 2;
+ short result2;
+
+ /* Calculate checksum in one go */
+ result2 = bpf_csum_diff((void *)from_buff, from_buff_len,
+ (void *)to_buff, to_buff_len, seed);
+
+ /* Calculate checksum by concatenating bpf_csum_diff()*/
+ result = bpf_csum_diff((void *)from_buff, from_buff_len - from_len_half,
+ (void *)to_buff, to_buff_len - to_len_half, seed);
+
+ result = bpf_csum_diff((void *)from_buff + (from_buff_len - from_len_half), from_len_half,
+ (void *)to_buff + (to_buff_len - to_len_half), to_len_half, result);
+
+ result = (result == result2) ? result : 0;
+
+ return 0;
+}
--
2.40.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier
2024-10-21 12:21 ` [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier Puranjay Mohan
@ 2024-10-21 13:01 ` Helge Deller
2024-10-21 13:14 ` Puranjay Mohan
2024-10-21 13:42 ` Daniel Borkmann
2024-10-22 9:55 ` Toke Høiland-Jørgensen
2 siblings, 1 reply; 24+ messages in thread
From: Helge Deller @ 2024-10-21 13:01 UTC (permalink / raw)
To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
Andrii Nakryiko, bpf, Daniel Borkmann, David S. Miller,
Eduard Zingerman, Eric Dumazet, Hao Luo, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
Stanislav Fomichev, Yonghong Song
On 10/21/24 14:21, Puranjay Mohan wrote:
> The bpf_csum_diff() helper has been fixed to return a 16-bit value for
> all archs, so now we don't need to mask the result.
>
> ...
> --- a/tools/testing/selftests/bpf/progs/verifier_array_access.c
> +++ b/tools/testing/selftests/bpf/progs/verifier_array_access.c
> @@ -368,8 +368,7 @@ __naked void a_read_only_array_2_1(void)
> r4 = 0; \
> r5 = 0; \
> call %[bpf_csum_diff]; \
> -l0_%=: r0 &= 0xffff; \
> - exit; \
> +l0_%=: exit; \
Instead of dropping the masking, would it make sense to
check here if (r0 >> 16) == 0 ?
Helge
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier
2024-10-21 13:01 ` Helge Deller
@ 2024-10-21 13:14 ` Puranjay Mohan
2024-10-21 14:04 ` Helge Deller
0 siblings, 1 reply; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-21 13:14 UTC (permalink / raw)
To: Helge Deller, Albert Ou, Alexei Starovoitov, Andrew Morton,
Andrii Nakryiko, bpf, Daniel Borkmann, David S. Miller,
Eduard Zingerman, Eric Dumazet, Hao Luo, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Shuah Khan, Song Liu, Stanislav Fomichev,
Yonghong Song
[-- Attachment #1: Type: text/plain, Size: 1212 bytes --]
Helge Deller <deller@gmx.de> writes:
> On 10/21/24 14:21, Puranjay Mohan wrote:
>> The bpf_csum_diff() helper has been fixed to return a 16-bit value for
>> all archs, so now we don't need to mask the result.
>>
>> ...
>> --- a/tools/testing/selftests/bpf/progs/verifier_array_access.c
>> +++ b/tools/testing/selftests/bpf/progs/verifier_array_access.c
>> @@ -368,8 +368,7 @@ __naked void a_read_only_array_2_1(void)
>> r4 = 0; \
>> r5 = 0; \
>> call %[bpf_csum_diff]; \
>> -l0_%=: r0 &= 0xffff; \
>> - exit; \
>> +l0_%=: exit; \
>
> Instead of dropping the masking, would it make sense to
> check here if (r0 >> 16) == 0 ?
We define the expected value in R0 to be 65507(0xffe3) in the line at the top:
__success __retval(65507)
So, we should just not do anything to R0 and it should contain this value
after returning from bpf_csum_diff()
This masking hack was added in:
6185266c5a853 ("selftests/bpf: Mask bpf_csum_diff() return value to 16 bits in test_verifier")
because without the fix in patch 2 bpf_csum_diff() would return the
following for this test:
x86 : -29 : 0xffffffe3
generic (arm64, riscv) : 65507 : 0x0000ffe3
Thanks,
Puranjay
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 255 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header
2024-10-21 12:21 ` [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header Puranjay Mohan
@ 2024-10-21 13:41 ` Daniel Borkmann
2024-10-22 9:49 ` Toke Høiland-Jørgensen
2024-10-22 13:50 ` kernel test robot
2 siblings, 0 replies; 24+ messages in thread
From: Daniel Borkmann @ 2024-10-21 13:41 UTC (permalink / raw)
To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
Andrii Nakryiko, bpf, David S. Miller, Eduard Zingerman,
Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
Stanislav Fomichev, Yonghong Song
On 10/21/24 2:21 PM, Puranjay Mohan wrote:
> from32to16() is used by lib/checksum.c and also by
> arch/parisc/lib/checksum.c. The next patch will use it in the
> bpf_csum_diff helper.
>
> Move from32to16() to the include/net/checksum.h as csum_from32to16() and
> remove other implementations.
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs
2024-10-21 12:21 ` [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs Puranjay Mohan
@ 2024-10-21 13:42 ` Daniel Borkmann
2024-10-22 9:54 ` Toke Høiland-Jørgensen
2024-10-22 18:09 ` kernel test robot
2 siblings, 0 replies; 24+ messages in thread
From: Daniel Borkmann @ 2024-10-21 13:42 UTC (permalink / raw)
To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
Andrii Nakryiko, bpf, David S. Miller, Eduard Zingerman,
Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
Stanislav Fomichev, Yonghong Song
On 10/21/24 2:21 PM, Puranjay Mohan wrote:
> 1. Optimization
> ------------
>
> The current implementation copies the 'from' and 'to' buffers to a
> scratchpad and it takes the bitwise NOT of 'from' buffer while copying.
> In the next step csum_partial() is called with this scratchpad.
>
> so, mathematically, the current implementation is doing:
>
> result = csum(to - from)
>
> Here, 'to' and '~ from' are copied in to the scratchpad buffer, we need
> it in the scratchpad buffer because csum_partial() takes a single
> contiguous buffer and not two disjoint buffers like 'to' and 'from'.
>
> We can re write this equation to:
>
> result = csum(to) - csum(from)
>
> using the distributive property of csum().
>
> this allows 'to' and 'from' to be at different locations and therefore
> this scratchpad and copying is not needed.
>
> This in C code will look like:
>
> result = csum_sub(csum_partial(to, to_size, seed),
> csum_partial(from, from_size, 0));
>
> 2. Homogenization
> --------------
>
> The bpf_csum_diff() helper calls csum_partial() which is implemented by
> some architectures like arm and x86 but other architectures rely on the
> generic implementation in lib/checksum.c
>
> The generic implementation in lib/checksum.c returns a 16 bit value but
> the arch specific implementations can return more than 16 bits, this
> works out in most places because before the result is used, it is passed
> through csum_fold() that turns it into a 16-bit value.
>
> bpf_csum_diff() directly returns the value from csum_partial() and
> therefore the returned values could be different on different
> architectures. see discussion in [1]:
>
> for the int value 28 the calculated checksums are:
>
> x86 : -29 : 0xffffffe3
> generic (arm64, riscv) : 65507 : 0x0000ffe3
> arm : 131042 : 0x0001ffe2
>
> Pass the result of bpf_csum_diff() through from32to16() before returning
> to homogenize this result for all architectures.
>
> NOTE: from32to16() is used instead of csum_fold() because csum_fold()
> does from32to16() + bitwise NOT of the result, which is not what we want
> to do here.
>
> [1] https://lore.kernel.org/bpf/CAJ+HfNiQbOcqCLxFUP2FMm5QrLXUUaj852Fxe3hn_2JNiucn6g@mail.gmail.com/
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Thanks for looking into this!
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier
2024-10-21 12:21 ` [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier Puranjay Mohan
2024-10-21 13:01 ` Helge Deller
@ 2024-10-21 13:42 ` Daniel Borkmann
2024-10-22 9:55 ` Toke Høiland-Jørgensen
2 siblings, 0 replies; 24+ messages in thread
From: Daniel Borkmann @ 2024-10-21 13:42 UTC (permalink / raw)
To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
Andrii Nakryiko, bpf, David S. Miller, Eduard Zingerman,
Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
Stanislav Fomichev, Yonghong Song
On 10/21/24 2:21 PM, Puranjay Mohan wrote:
> The bpf_csum_diff() helper has been fixed to return a 16-bit value for
> all archs, so now we don't need to mask the result.
>
> This commit is basically reverting the below:
>
> commit 6185266c5a85 ("selftests/bpf: Mask bpf_csum_diff() return value
> to 16 bits in test_verifier")
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper
2024-10-21 12:21 ` [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper Puranjay Mohan
@ 2024-10-21 13:43 ` Daniel Borkmann
2024-10-21 23:28 ` Andrii Nakryiko
1 sibling, 0 replies; 24+ messages in thread
From: Daniel Borkmann @ 2024-10-21 13:43 UTC (permalink / raw)
To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
Andrii Nakryiko, bpf, David S. Miller, Eduard Zingerman,
Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
Stanislav Fomichev, Yonghong Song
On 10/21/24 2:21 PM, Puranjay Mohan wrote:
> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by
> filling a 4KB buffer with random data and calculating the internet
> checksum on different parts of this buffer using bpf_csum_diff().
>
> Example run using ./benchs/run_bench_csum_diff.sh on x86_64:
>
> [bpf]$ ./benchs/run_bench_csum_diff.sh
> 4 2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s)
> 8 2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> 16 2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 20 2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 32 2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> 40 2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s)
> 64 2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 128 2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 256 2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 512 2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s)
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 5/5] selftests/bpf: Add a selftest for bpf_csum_diff()
2024-10-21 12:21 ` [PATCH bpf-next 5/5] selftests/bpf: Add a selftest for bpf_csum_diff() Puranjay Mohan
@ 2024-10-21 13:44 ` Daniel Borkmann
0 siblings, 0 replies; 24+ messages in thread
From: Daniel Borkmann @ 2024-10-21 13:44 UTC (permalink / raw)
To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
Andrii Nakryiko, bpf, David S. Miller, Eduard Zingerman,
Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
Stanislav Fomichev, Yonghong Song
On 10/21/24 2:21 PM, Puranjay Mohan wrote:
> Add a selftest for the bpf_csum_diff() helper. This selftests runs the
> helper in all three configurations(push, pull, and diff) and verifies
> its output. The correct results have been computed by hand and by the
> helper's older implementation.
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier
2024-10-21 13:14 ` Puranjay Mohan
@ 2024-10-21 14:04 ` Helge Deller
0 siblings, 0 replies; 24+ messages in thread
From: Helge Deller @ 2024-10-21 14:04 UTC (permalink / raw)
To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
Andrii Nakryiko, bpf, Daniel Borkmann, David S. Miller,
Eduard Zingerman, Eric Dumazet, Hao Luo, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Shuah Khan, Song Liu, Stanislav Fomichev,
Yonghong Song
On 10/21/24 15:14, Puranjay Mohan wrote:
> Helge Deller <deller@gmx.de> writes:
>
>> On 10/21/24 14:21, Puranjay Mohan wrote:
>>> The bpf_csum_diff() helper has been fixed to return a 16-bit value for
>>> all archs, so now we don't need to mask the result.
>>>
>>> ...
>>> --- a/tools/testing/selftests/bpf/progs/verifier_array_access.c
>>> +++ b/tools/testing/selftests/bpf/progs/verifier_array_access.c
>>> @@ -368,8 +368,7 @@ __naked void a_read_only_array_2_1(void)
>>> r4 = 0; \
>>> r5 = 0; \
>>> call %[bpf_csum_diff]; \
>>> -l0_%=: r0 &= 0xffff; \
>>> - exit; \
>>> +l0_%=: exit; \
>>
>> Instead of dropping the masking, would it make sense to
>> check here if (r0 >> 16) == 0 ?
>
> We define the expected value in R0 to be 65507(0xffe3) in the line at the top:
> __success __retval(65507)
>
> So, we should just not do anything to R0 and it should contain this value
> after returning from bpf_csum_diff()
>
> This masking hack was added in:
>
> 6185266c5a853 ("selftests/bpf: Mask bpf_csum_diff() return value to 16 bits in test_verifier")
>
> because without the fix in patch 2 bpf_csum_diff() would return the
> following for this test:
>
> x86 : -29 : 0xffffffe3
> generic (arm64, riscv) : 65507 : 0x0000ffe3
You're right.
Thanks for explaining.
Helge
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper
2024-10-21 12:21 ` [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper Puranjay Mohan
2024-10-21 13:43 ` Daniel Borkmann
@ 2024-10-21 23:28 ` Andrii Nakryiko
2024-10-22 10:21 ` Puranjay Mohan
1 sibling, 1 reply; 24+ messages in thread
From: Andrii Nakryiko @ 2024-10-21 23:28 UTC (permalink / raw)
To: Puranjay Mohan
Cc: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
Stanislav Fomichev, Yonghong Song
On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by
> filling a 4KB buffer with random data and calculating the internet
> checksum on different parts of this buffer using bpf_csum_diff().
>
> Example run using ./benchs/run_bench_csum_diff.sh on x86_64:
>
> [bpf]$ ./benchs/run_bench_csum_diff.sh
> 4 2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s)
> 8 2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> 16 2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 20 2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 32 2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> 40 2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s)
> 64 2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 128 2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 256 2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 512 2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s)
you are not benchmarking bpf_csum_diff(), you are benchmarking how
often you can call bpf_prog_test_run(). Add some batching on the BPF
side, these numbers tell you that there is no difference between
calculating checksum for 4 bytes and for 512, that didn't seem strange
to you?
pw-bot: cr
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
> tools/testing/selftests/bpf/Makefile | 2 +
> tools/testing/selftests/bpf/bench.c | 4 +
> .../selftests/bpf/benchs/bench_csum_diff.c | 164 ++++++++++++++++++
> .../bpf/benchs/run_bench_csum_diff.sh | 10 ++
> .../selftests/bpf/progs/csum_diff_bench.c | 25 +++
> 5 files changed, 205 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/benchs/bench_csum_diff.c
> create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh
> create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_bench.c
>
[...]
> +
> +static void csum_diff_setup(void)
> +{
> + int err;
> + char *buff;
> + size_t i, sz;
> +
> + sz = sizeof(ctx.skel->rodata->buff);
> +
> + setup_libbpf();
> +
> + ctx.skel = csum_diff_bench__open();
> + if (!ctx.skel) {
> + fprintf(stderr, "failed to open skeleton\n");
> + exit(1);
> + }
> +
> + srandom(time(NULL));
> + buff = ctx.skel->rodata->buff;
> +
> + /*
> + * Set first 8 bytes of buffer to 0xdeadbeefdeadbeef, this is later used to verify the
> + * correctness of the helper by comparing the checksum result for 0xdeadbeefdeadbeef that
> + * should be 0x3b3b
> + */
> +
> + *(u64 *)buff = 0xdeadbeefdeadbeef;
> +
> + for (i = 8; i < sz; i++)
> + buff[i] = '1' + random() % 9;
so, you only generate 9 different values for bytes, why? Why not full
byte range?
> +
> + ctx.skel->rodata->buff_len = args.buff_len;
> +
> + err = csum_diff_bench__load(ctx.skel);
> + if (err) {
> + fprintf(stderr, "failed to load skeleton\n");
> + csum_diff_bench__destroy(ctx.skel);
> + exit(1);
> + }
> +}
> +
[...]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header
2024-10-21 12:21 ` [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header Puranjay Mohan
2024-10-21 13:41 ` Daniel Borkmann
@ 2024-10-22 9:49 ` Toke Høiland-Jørgensen
2024-10-22 13:50 ` kernel test robot
2 siblings, 0 replies; 24+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-10-22 9:49 UTC (permalink / raw)
To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
Andrii Nakryiko, bpf, Daniel Borkmann, David S. Miller,
Eduard Zingerman, Eric Dumazet, Hao Luo, Helge Deller,
Jakub Kicinski, James E.J. Bottomley, Jiri Olsa, John Fastabend,
KP Singh, linux-kernel, linux-parisc, linux-riscv,
Martin KaFai Lau, Mykola Lysenko, netdev, Palmer Dabbelt,
Paolo Abeni, Paul Walmsley, Puranjay Mohan, Puranjay Mohan,
Shuah Khan, Song Liu, Stanislav Fomichev, Yonghong Song
Puranjay Mohan <puranjay@kernel.org> writes:
> from32to16() is used by lib/checksum.c and also by
> arch/parisc/lib/checksum.c. The next patch will use it in the
> bpf_csum_diff helper.
>
> Move from32to16() to the include/net/checksum.h as csum_from32to16() and
> remove other implementations.
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs
2024-10-21 12:21 ` [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs Puranjay Mohan
2024-10-21 13:42 ` Daniel Borkmann
@ 2024-10-22 9:54 ` Toke Høiland-Jørgensen
2024-10-22 18:09 ` kernel test robot
2 siblings, 0 replies; 24+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-10-22 9:54 UTC (permalink / raw)
To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
Andrii Nakryiko, bpf, Daniel Borkmann, David S. Miller,
Eduard Zingerman, Eric Dumazet, Hao Luo, Helge Deller,
Jakub Kicinski, James E.J. Bottomley, Jiri Olsa, John Fastabend,
KP Singh, linux-kernel, linux-parisc, linux-riscv,
Martin KaFai Lau, Mykola Lysenko, netdev, Palmer Dabbelt,
Paolo Abeni, Paul Walmsley, Puranjay Mohan, Puranjay Mohan,
Shuah Khan, Song Liu, Stanislav Fomichev, Yonghong Song
Puranjay Mohan <puranjay@kernel.org> writes:
> 1. Optimization
> ------------
>
> The current implementation copies the 'from' and 'to' buffers to a
> scratchpad and it takes the bitwise NOT of 'from' buffer while copying.
> In the next step csum_partial() is called with this scratchpad.
>
> so, mathematically, the current implementation is doing:
>
> result = csum(to - from)
>
> Here, 'to' and '~ from' are copied in to the scratchpad buffer, we need
> it in the scratchpad buffer because csum_partial() takes a single
> contiguous buffer and not two disjoint buffers like 'to' and 'from'.
>
> We can re write this equation to:
>
> result = csum(to) - csum(from)
>
> using the distributive property of csum().
>
> this allows 'to' and 'from' to be at different locations and therefore
> this scratchpad and copying is not needed.
>
> This in C code will look like:
>
> result = csum_sub(csum_partial(to, to_size, seed),
> csum_partial(from, from_size, 0));
>
> 2. Homogenization
> --------------
>
> The bpf_csum_diff() helper calls csum_partial() which is implemented by
> some architectures like arm and x86 but other architectures rely on the
> generic implementation in lib/checksum.c
>
> The generic implementation in lib/checksum.c returns a 16 bit value but
> the arch specific implementations can return more than 16 bits, this
> works out in most places because before the result is used, it is passed
> through csum_fold() that turns it into a 16-bit value.
>
> bpf_csum_diff() directly returns the value from csum_partial() and
> therefore the returned values could be different on different
> architectures. see discussion in [1]:
>
> for the int value 28 the calculated checksums are:
>
> x86 : -29 : 0xffffffe3
> generic (arm64, riscv) : 65507 : 0x0000ffe3
> arm : 131042 : 0x0001ffe2
>
> Pass the result of bpf_csum_diff() through from32to16() before returning
> to homogenize this result for all architectures.
>
> NOTE: from32to16() is used instead of csum_fold() because csum_fold()
> does from32to16() + bitwise NOT of the result, which is not what we want
> to do here.
>
> [1] https://lore.kernel.org/bpf/CAJ+HfNiQbOcqCLxFUP2FMm5QrLXUUaj852Fxe3hn_2JNiucn6g@mail.gmail.com/
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Pretty neat simplification :)
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier
2024-10-21 12:21 ` [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier Puranjay Mohan
2024-10-21 13:01 ` Helge Deller
2024-10-21 13:42 ` Daniel Borkmann
@ 2024-10-22 9:55 ` Toke Høiland-Jørgensen
2 siblings, 0 replies; 24+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-10-22 9:55 UTC (permalink / raw)
To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
Andrii Nakryiko, bpf, Daniel Borkmann, David S. Miller,
Eduard Zingerman, Eric Dumazet, Hao Luo, Helge Deller,
Jakub Kicinski, James E.J. Bottomley, Jiri Olsa, John Fastabend,
KP Singh, linux-kernel, linux-parisc, linux-riscv,
Martin KaFai Lau, Mykola Lysenko, netdev, Palmer Dabbelt,
Paolo Abeni, Paul Walmsley, Puranjay Mohan, Puranjay Mohan,
Shuah Khan, Song Liu, Stanislav Fomichev, Yonghong Song
Puranjay Mohan <puranjay@kernel.org> writes:
> The bpf_csum_diff() helper has been fixed to return a 16-bit value for
> all archs, so now we don't need to mask the result.
>
> This commit is basically reverting the below:
>
> commit 6185266c5a85 ("selftests/bpf: Mask bpf_csum_diff() return value
> to 16 bits in test_verifier")
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper
2024-10-21 23:28 ` Andrii Nakryiko
@ 2024-10-22 10:21 ` Puranjay Mohan
2024-10-22 17:47 ` Andrii Nakryiko
0 siblings, 1 reply; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-22 10:21 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Shuah Khan, Song Liu, Stanislav Fomichev,
Yonghong Song
[-- Attachment #1: Type: text/plain, Size: 4101 bytes --]
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by
>> filling a 4KB buffer with random data and calculating the internet
>> checksum on different parts of this buffer using bpf_csum_diff().
>>
>> Example run using ./benchs/run_bench_csum_diff.sh on x86_64:
>>
>> [bpf]$ ./benchs/run_bench_csum_diff.sh
>> 4 2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s)
>> 8 2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> 16 2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> 20 2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> 32 2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> 40 2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s)
>> 64 2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> 128 2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> 256 2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> 512 2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s)
>
> you are not benchmarking bpf_csum_diff(), you are benchmarking how
> often you can call bpf_prog_test_run(). Add some batching on the BPF
> side, these numbers tell you that there is no difference between
> calculating checksum for 4 bytes and for 512, that didn't seem strange
> to you?
This didn't seem strange to me because if you see the tables I added to
the cover letter, there is a clear improvement after optimizing the
helper and arm64 even shows a linear drop going from 4 bytes to 512
bytes, even after the optimization.
On x86 after the improvement, 4 bytes and 512 bytes show similar numbers
but there is still a small drop that can be seen going from 4 to 512
bytes.
My thought was that because the bpf_csum_diff() calls csum_partial() on
x86 which is already optimised, most of the overhead was due to copying
the buffer which is now removed.
I guess I can amplify the difference between 4B and 512B by calling
bpf_csum_diff() multiple times in a loop, or by calculating the csum by
dividing the buffer into more parts (currently the BPF code divides it
into 2 parts only).
>>
>> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
>> ---
>> tools/testing/selftests/bpf/Makefile | 2 +
>> tools/testing/selftests/bpf/bench.c | 4 +
>> .../selftests/bpf/benchs/bench_csum_diff.c | 164 ++++++++++++++++++
>> .../bpf/benchs/run_bench_csum_diff.sh | 10 ++
>> .../selftests/bpf/progs/csum_diff_bench.c | 25 +++
>> 5 files changed, 205 insertions(+)
>> create mode 100644 tools/testing/selftests/bpf/benchs/bench_csum_diff.c
>> create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh
>> create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_bench.c
>>
>
> [...]
>
>> +
>> +static void csum_diff_setup(void)
>> +{
>> + int err;
>> + char *buff;
>> + size_t i, sz;
>> +
>> + sz = sizeof(ctx.skel->rodata->buff);
>> +
>> + setup_libbpf();
>> +
>> + ctx.skel = csum_diff_bench__open();
>> + if (!ctx.skel) {
>> + fprintf(stderr, "failed to open skeleton\n");
>> + exit(1);
>> + }
>> +
>> + srandom(time(NULL));
>> + buff = ctx.skel->rodata->buff;
>> +
>> + /*
>> + * Set first 8 bytes of buffer to 0xdeadbeefdeadbeef, this is later used to verify the
>> + * correctness of the helper by comparing the checksum result for 0xdeadbeefdeadbeef that
>> + * should be 0x3b3b
>> + */
>> +
>> + *(u64 *)buff = 0xdeadbeefdeadbeef;
>> +
>> + for (i = 8; i < sz; i++)
>> + buff[i] = '1' + random() % 9;
>
> so, you only generate 9 different values for bytes, why? Why not full
> byte range?
Thanks for catching this, there is no reason for this to be [1,10] I
will use the full byte range in the next version.
Thanks,
Puranjay
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 255 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header
2024-10-21 12:21 ` [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header Puranjay Mohan
2024-10-21 13:41 ` Daniel Borkmann
2024-10-22 9:49 ` Toke Høiland-Jørgensen
@ 2024-10-22 13:50 ` kernel test robot
2 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2024-10-22 13:50 UTC (permalink / raw)
To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
Andrii Nakryiko, bpf, Daniel Borkmann, David S. Miller,
Eduard Zingerman, Eric Dumazet, Hao Luo, Helge Deller,
Jakub Kicinski, James E.J. Bottomley, Jiri Olsa, John Fastabend,
KP Singh, linux-kernel, linux-parisc, linux-riscv,
Martin KaFai Lau, Mykola Lysenko, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
Stanislav Fomichev
Cc: oe-kbuild-all, Linux Memory Management List, netdev
Hi Puranjay,
kernel test robot noticed the following build warnings:
[auto build test WARNING on bpf-next/master]
url: https://github.com/intel-lab-lkp/linux/commits/Puranjay-Mohan/net-checksum-move-from32to16-to-generic-header/20241021-202707
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link: https://lore.kernel.org/r/20241021122112.101513-2-puranjay%40kernel.org
patch subject: [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header
config: x86_64-randconfig-122-20241022 (https://download.01.org/0day-ci/archive/20241022/202410222149.3FVJFYYy-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241022/202410222149.3FVJFYYy-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410222149.3FVJFYYy-lkp@intel.com/
sparse warnings: (new ones prefixed by >>)
>> lib/checksum.c:84:34: sparse: sparse: incorrect type in argument 1 (different base types) @@ expected restricted __wsum [usertype] sum @@ got unsigned int [assigned] result @@
lib/checksum.c:84:34: sparse: expected restricted __wsum [usertype] sum
lib/checksum.c:84:34: sparse: got unsigned int [assigned] result
>> lib/checksum.c:84:16: sparse: sparse: incorrect type in assignment (different base types) @@ expected unsigned int [assigned] result @@ got restricted __sum16 @@
lib/checksum.c:84:16: sparse: expected unsigned int [assigned] result
lib/checksum.c:84:16: sparse: got restricted __sum16
vim +84 lib/checksum.c
35
36 #ifndef do_csum
37 static unsigned int do_csum(const unsigned char *buff, int len)
38 {
39 int odd;
40 unsigned int result = 0;
41
42 if (len <= 0)
43 goto out;
44 odd = 1 & (unsigned long) buff;
45 if (odd) {
46 #ifdef __LITTLE_ENDIAN
47 result += (*buff << 8);
48 #else
49 result = *buff;
50 #endif
51 len--;
52 buff++;
53 }
54 if (len >= 2) {
55 if (2 & (unsigned long) buff) {
56 result += *(unsigned short *) buff;
57 len -= 2;
58 buff += 2;
59 }
60 if (len >= 4) {
61 const unsigned char *end = buff + ((unsigned)len & ~3);
62 unsigned int carry = 0;
63 do {
64 unsigned int w = *(unsigned int *) buff;
65 buff += 4;
66 result += carry;
67 result += w;
68 carry = (w > result);
69 } while (buff < end);
70 result += carry;
71 result = (result & 0xffff) + (result >> 16);
72 }
73 if (len & 2) {
74 result += *(unsigned short *) buff;
75 buff += 2;
76 }
77 }
78 if (len & 1)
79 #ifdef __LITTLE_ENDIAN
80 result += *buff;
81 #else
82 result += (*buff << 8);
83 #endif
> 84 result = csum_from32to16(result);
85 if (odd)
86 result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
87 out:
88 return result;
89 }
90 #endif
91
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper
2024-10-22 10:21 ` Puranjay Mohan
@ 2024-10-22 17:47 ` Andrii Nakryiko
2024-10-22 17:58 ` Puranjay Mohan
0 siblings, 1 reply; 24+ messages in thread
From: Andrii Nakryiko @ 2024-10-22 17:47 UTC (permalink / raw)
To: Puranjay Mohan
Cc: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Shuah Khan, Song Liu, Stanislav Fomichev,
Yonghong Song
On Tue, Oct 22, 2024 at 3:21 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >>
> >> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by
> >> filling a 4KB buffer with random data and calculating the internet
> >> checksum on different parts of this buffer using bpf_csum_diff().
> >>
> >> Example run using ./benchs/run_bench_csum_diff.sh on x86_64:
> >>
> >> [bpf]$ ./benchs/run_bench_csum_diff.sh
> >> 4 2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s)
> >> 8 2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> >> 16 2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> >> 20 2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> >> 32 2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> >> 40 2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s)
> >> 64 2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> >> 128 2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> >> 256 2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> >> 512 2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s)
> >
> > you are not benchmarking bpf_csum_diff(), you are benchmarking how
> > often you can call bpf_prog_test_run(). Add some batching on the BPF
> > side, these numbers tell you that there is no difference between
> > calculating checksum for 4 bytes and for 512, that didn't seem strange
> > to you?
>
> This didn't seem strange to me because if you see the tables I added to
> the cover letter, there is a clear improvement after optimizing the
> helper and arm64 even shows a linear drop going from 4 bytes to 512
> bytes, even after the optimization.
>
Regardless of optimization, it's strange that throughput barely
differs when you vary the amount of work by more than 100x. This
wouldn't be strange if this checksum calculation was some sort of
cryptographic hash, where it's intentional to have the same timing
regardless of amount of work, or something along those lines. But I
don't think that's the case here.
But as it is right now, this benchmark is benchmarking
bpf_prog_test_run(), as I mentioned, which seems to be bottlenecking
at about 2mln/s throughput for your machine. bpf_csum_diff()'s
overhead is trivial compared to bpf_prog_test_run() overhead and
syscall/context switch overhead.
We shouldn't add the benchmark that doesn't benchmark the right thing.
So just add a bpf_for(i, 0, 100) loop doing bpf_csum_diff(), and then
do atomic increment *after* the loop (to minimize atomics overhead).
> On x86 after the improvement, 4 bytes and 512 bytes show similar numbers
> but there is still a small drop that can be seen going from 4 to 512
> bytes.
>
> My thought was that because the bpf_csum_diff() calls csum_partial() on
> x86 which is already optimised, most of the overhead was due to copying
> the buffer which is now removed.
>
> I guess I can amplify the difference between 4B and 512B by calling
> bpf_csum_diff() multiple times in a loop, or by calculating the csum by
> dividing the buffer into more parts (currently the BPF code divides it
> into 2 parts only).
>
> >>
> >> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> >> ---
> >> tools/testing/selftests/bpf/Makefile | 2 +
> >> tools/testing/selftests/bpf/bench.c | 4 +
> >> .../selftests/bpf/benchs/bench_csum_diff.c | 164 ++++++++++++++++++
> >> .../bpf/benchs/run_bench_csum_diff.sh | 10 ++
> >> .../selftests/bpf/progs/csum_diff_bench.c | 25 +++
> >> 5 files changed, 205 insertions(+)
> >> create mode 100644 tools/testing/selftests/bpf/benchs/bench_csum_diff.c
> >> create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh
> >> create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_bench.c
> >>
> >
[...]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper
2024-10-22 17:47 ` Andrii Nakryiko
@ 2024-10-22 17:58 ` Puranjay Mohan
2024-10-23 15:37 ` Puranjay Mohan
0 siblings, 1 reply; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-22 17:58 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Shuah Khan, Song Liu, Stanislav Fomichev,
Yonghong Song
[-- Attachment #1: Type: text/plain, Size: 2969 bytes --]
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> On Tue, Oct 22, 2024 at 3:21 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>> >>
>> >> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by
>> >> filling a 4KB buffer with random data and calculating the internet
>> >> checksum on different parts of this buffer using bpf_csum_diff().
>> >>
>> >> Example run using ./benchs/run_bench_csum_diff.sh on x86_64:
>> >>
>> >> [bpf]$ ./benchs/run_bench_csum_diff.sh
>> >> 4 2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s)
>> >> 8 2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> >> 16 2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> >> 20 2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> >> 32 2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> >> 40 2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s)
>> >> 64 2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> >> 128 2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> >> 256 2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> >> 512 2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s)
>> >
>> > you are not benchmarking bpf_csum_diff(), you are benchmarking how
>> > often you can call bpf_prog_test_run(). Add some batching on the BPF
>> > side, these numbers tell you that there is no difference between
>> > calculating checksum for 4 bytes and for 512, that didn't seem strange
>> > to you?
>>
>> This didn't seem strange to me because if you see the tables I added to
>> the cover letter, there is a clear improvement after optimizing the
>> helper and arm64 even shows a linear drop going from 4 bytes to 512
>> bytes, even after the optimization.
>>
>
> Regardless of optimization, it's strange that throughput barely
> differs when you vary the amount of work by more than 100x. This
> wouldn't be strange if this checksum calculation was some sort of
> cryptographic hash, where it's intentional to have the same timing
> regardless of amount of work, or something along those lines. But I
> don't think that's the case here.
>
> But as it is right now, this benchmark is benchmarking
> bpf_prog_test_run(), as I mentioned, which seems to be bottlenecking
> at about 2mln/s throughput for your machine. bpf_csum_diff()'s
> overhead is trivial compared to bpf_prog_test_run() overhead and
> syscall/context switch overhead.
>
> We shouldn't add the benchmark that doesn't benchmark the right thing.
> So just add a bpf_for(i, 0, 100) loop doing bpf_csum_diff(), and then
> do atomic increment *after* the loop (to minimize atomics overhead).
Thanks, now I undestand what you meant. Will add the bpf_for() in the
next version.
Thanks,
Puranjay
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 255 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs
2024-10-21 12:21 ` [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs Puranjay Mohan
2024-10-21 13:42 ` Daniel Borkmann
2024-10-22 9:54 ` Toke Høiland-Jørgensen
@ 2024-10-22 18:09 ` kernel test robot
2 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2024-10-22 18:09 UTC (permalink / raw)
To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
Andrii Nakryiko, bpf, Daniel Borkmann, David S. Miller,
Eduard Zingerman, Eric Dumazet, Hao Luo, Helge Deller,
Jakub Kicinski, James E.J. Bottomley, Jiri Olsa, John Fastabend,
KP Singh, linux-kernel, linux-parisc, linux-riscv,
Martin KaFai Lau, Mykola Lysenko, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
Stanislav Fomichev
Cc: oe-kbuild-all, Linux Memory Management List, netdev
Hi Puranjay,
kernel test robot noticed the following build warnings:
[auto build test WARNING on bpf-next/master]
url: https://github.com/intel-lab-lkp/linux/commits/Puranjay-Mohan/net-checksum-move-from32to16-to-generic-header/20241021-202707
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link: https://lore.kernel.org/r/20241021122112.101513-3-puranjay%40kernel.org
patch subject: [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs
config: x86_64-randconfig-122-20241022 (https://download.01.org/0day-ci/archive/20241023/202410230122.BYZLEUHz-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241023/202410230122.BYZLEUHz-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410230122.BYZLEUHz-lkp@intel.com/
sparse warnings: (new ones prefixed by >>)
net/core/filter.c:1423:39: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sock_filter const *filter @@ got struct sock_filter [noderef] __user *filter @@
net/core/filter.c:1423:39: sparse: expected struct sock_filter const *filter
net/core/filter.c:1423:39: sparse: got struct sock_filter [noderef] __user *filter
net/core/filter.c:1501:39: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sock_filter const *filter @@ got struct sock_filter [noderef] __user *filter @@
net/core/filter.c:1501:39: sparse: expected struct sock_filter const *filter
net/core/filter.c:1501:39: sparse: got struct sock_filter [noderef] __user *filter
net/core/filter.c:2321:45: sparse: sparse: incorrect type in argument 2 (different base types) @@ expected restricted __be32 [usertype] daddr @@ got unsigned int [usertype] ipv4_nh @@
net/core/filter.c:2321:45: sparse: expected restricted __be32 [usertype] daddr
net/core/filter.c:2321:45: sparse: got unsigned int [usertype] ipv4_nh
net/core/filter.c:10993:31: sparse: sparse: symbol 'sk_filter_verifier_ops' was not declared. Should it be static?
net/core/filter.c:11000:27: sparse: sparse: symbol 'sk_filter_prog_ops' was not declared. Should it be static?
net/core/filter.c:11004:31: sparse: sparse: symbol 'tc_cls_act_verifier_ops' was not declared. Should it be static?
net/core/filter.c:11013:27: sparse: sparse: symbol 'tc_cls_act_prog_ops' was not declared. Should it be static?
net/core/filter.c:11017:31: sparse: sparse: symbol 'xdp_verifier_ops' was not declared. Should it be static?
net/core/filter.c:11029:31: sparse: sparse: symbol 'cg_skb_verifier_ops' was not declared. Should it be static?
net/core/filter.c:11035:27: sparse: sparse: symbol 'cg_skb_prog_ops' was not declared. Should it be static?
net/core/filter.c:11039:31: sparse: sparse: symbol 'lwt_in_verifier_ops' was not declared. Should it be static?
net/core/filter.c:11045:27: sparse: sparse: symbol 'lwt_in_prog_ops' was not declared. Should it be static?
net/core/filter.c:11049:31: sparse: sparse: symbol 'lwt_out_verifier_ops' was not declared. Should it be static?
net/core/filter.c:11055:27: sparse: sparse: symbol 'lwt_out_prog_ops' was not declared. Should it be static?
net/core/filter.c:11059:31: sparse: sparse: symbol 'lwt_xmit_verifier_ops' was not declared. Should it be static?
net/core/filter.c:11066:27: sparse: sparse: symbol 'lwt_xmit_prog_ops' was not declared. Should it be static?
net/core/filter.c:11070:31: sparse: sparse: symbol 'lwt_seg6local_verifier_ops' was not declared. Should it be static?
net/core/filter.c:11076:27: sparse: sparse: symbol 'lwt_seg6local_prog_ops' was not declared. Should it be static?
net/core/filter.c:11079:31: sparse: sparse: symbol 'cg_sock_verifier_ops' was not declared. Should it be static?
net/core/filter.c:11085:27: sparse: sparse: symbol 'cg_sock_prog_ops' was not declared. Should it be static?
net/core/filter.c:11088:31: sparse: sparse: symbol 'cg_sock_addr_verifier_ops' was not declared. Should it be static?
net/core/filter.c:11094:27: sparse: sparse: symbol 'cg_sock_addr_prog_ops' was not declared. Should it be static?
net/core/filter.c:11097:31: sparse: sparse: symbol 'sock_ops_verifier_ops' was not declared. Should it be static?
net/core/filter.c:11103:27: sparse: sparse: symbol 'sock_ops_prog_ops' was not declared. Should it be static?
net/core/filter.c:11106:31: sparse: sparse: symbol 'sk_skb_verifier_ops' was not declared. Should it be static?
net/core/filter.c:11113:27: sparse: sparse: symbol 'sk_skb_prog_ops' was not declared. Should it be static?
net/core/filter.c:11116:31: sparse: sparse: symbol 'sk_msg_verifier_ops' was not declared. Should it be static?
net/core/filter.c:11123:27: sparse: sparse: symbol 'sk_msg_prog_ops' was not declared. Should it be static?
net/core/filter.c:11126:31: sparse: sparse: symbol 'flow_dissector_verifier_ops' was not declared. Should it be static?
net/core/filter.c:11132:27: sparse: sparse: symbol 'flow_dissector_prog_ops' was not declared. Should it be static?
net/core/filter.c:11460:31: sparse: sparse: symbol 'sk_reuseport_verifier_ops' was not declared. Should it be static?
net/core/filter.c:11466:27: sparse: sparse: symbol 'sk_reuseport_prog_ops' was not declared. Should it be static?
net/core/filter.c:11668:27: sparse: sparse: symbol 'sk_lookup_prog_ops' was not declared. Should it be static?
net/core/filter.c:11672:31: sparse: sparse: symbol 'sk_lookup_verifier_ops' was not declared. Should it be static?
net/core/filter.c:1931:43: sparse: sparse: incorrect type in argument 2 (different base types) @@ expected restricted __wsum [usertype] diff @@ got unsigned long long [usertype] to @@
net/core/filter.c:1931:43: sparse: expected restricted __wsum [usertype] diff
net/core/filter.c:1931:43: sparse: got unsigned long long [usertype] to
net/core/filter.c:1934:36: sparse: sparse: incorrect type in argument 2 (different base types) @@ expected restricted __be16 [usertype] old @@ got unsigned long long [usertype] from @@
net/core/filter.c:1934:36: sparse: expected restricted __be16 [usertype] old
net/core/filter.c:1934:36: sparse: got unsigned long long [usertype] from
net/core/filter.c:1934:42: sparse: sparse: incorrect type in argument 3 (different base types) @@ expected restricted __be16 [usertype] new @@ got unsigned long long [usertype] to @@
net/core/filter.c:1934:42: sparse: expected restricted __be16 [usertype] new
net/core/filter.c:1934:42: sparse: got unsigned long long [usertype] to
net/core/filter.c:1937:36: sparse: sparse: incorrect type in argument 2 (different base types) @@ expected restricted __be32 [usertype] from @@ got unsigned long long [usertype] from @@
net/core/filter.c:1937:36: sparse: expected restricted __be32 [usertype] from
net/core/filter.c:1937:36: sparse: got unsigned long long [usertype] from
net/core/filter.c:1937:42: sparse: sparse: incorrect type in argument 3 (different base types) @@ expected restricted __be32 [usertype] to @@ got unsigned long long [usertype] to @@
net/core/filter.c:1937:42: sparse: expected restricted __be32 [usertype] to
net/core/filter.c:1937:42: sparse: got unsigned long long [usertype] to
net/core/filter.c:1982:59: sparse: sparse: incorrect type in argument 3 (different base types) @@ expected restricted __wsum [usertype] diff @@ got unsigned long long [usertype] to @@
net/core/filter.c:1982:59: sparse: expected restricted __wsum [usertype] diff
net/core/filter.c:1982:59: sparse: got unsigned long long [usertype] to
net/core/filter.c:1985:52: sparse: sparse: incorrect type in argument 3 (different base types) @@ expected restricted __be16 [usertype] from @@ got unsigned long long [usertype] from @@
net/core/filter.c:1985:52: sparse: expected restricted __be16 [usertype] from
net/core/filter.c:1985:52: sparse: got unsigned long long [usertype] from
net/core/filter.c:1985:58: sparse: sparse: incorrect type in argument 4 (different base types) @@ expected restricted __be16 [usertype] to @@ got unsigned long long [usertype] to @@
net/core/filter.c:1985:58: sparse: expected restricted __be16 [usertype] to
net/core/filter.c:1985:58: sparse: got unsigned long long [usertype] to
net/core/filter.c:1988:52: sparse: sparse: incorrect type in argument 3 (different base types) @@ expected restricted __be32 [usertype] from @@ got unsigned long long [usertype] from @@
net/core/filter.c:1988:52: sparse: expected restricted __be32 [usertype] from
net/core/filter.c:1988:52: sparse: got unsigned long long [usertype] from
net/core/filter.c:1988:58: sparse: sparse: incorrect type in argument 4 (different base types) @@ expected restricted __be32 [usertype] to @@ got unsigned long long [usertype] to @@
net/core/filter.c:1988:58: sparse: expected restricted __be32 [usertype] to
net/core/filter.c:1988:58: sparse: got unsigned long long [usertype] to
>> net/core/filter.c:2023:39: sparse: sparse: incorrect type in return expression (different base types) @@ expected unsigned long long @@ got restricted __sum16 @@
net/core/filter.c:2023:39: sparse: expected unsigned long long
net/core/filter.c:2023:39: sparse: got restricted __sum16
net/core/filter.c:2026:39: sparse: sparse: incorrect type in return expression (different base types) @@ expected unsigned long long @@ got restricted __sum16 @@
net/core/filter.c:2026:39: sparse: expected unsigned long long
net/core/filter.c:2026:39: sparse: got restricted __sum16
net/core/filter.c:2029:39: sparse: sparse: incorrect type in return expression (different base types) @@ expected unsigned long long @@ got restricted __sum16 @@
net/core/filter.c:2029:39: sparse: expected unsigned long long
net/core/filter.c:2029:39: sparse: got restricted __sum16
>> net/core/filter.c:2031:16: sparse: sparse: incorrect type in return expression (different base types) @@ expected unsigned long long @@ got restricted __wsum [usertype] seed @@
net/core/filter.c:2031:16: sparse: expected unsigned long long
net/core/filter.c:2031:16: sparse: got restricted __wsum [usertype] seed
net/core/filter.c:2053:35: sparse: sparse: incorrect type in return expression (different base types) @@ expected unsigned long long @@ got restricted __wsum [usertype] csum @@
net/core/filter.c:2053:35: sparse: expected unsigned long long
net/core/filter.c:2053:35: sparse: got restricted __wsum [usertype] csum
vim +2023 net/core/filter.c
1956
1957 BPF_CALL_5(bpf_l4_csum_replace, struct sk_buff *, skb, u32, offset,
1958 u64, from, u64, to, u64, flags)
1959 {
1960 bool is_pseudo = flags & BPF_F_PSEUDO_HDR;
1961 bool is_mmzero = flags & BPF_F_MARK_MANGLED_0;
1962 bool do_mforce = flags & BPF_F_MARK_ENFORCE;
1963 __sum16 *ptr;
1964
1965 if (unlikely(flags & ~(BPF_F_MARK_MANGLED_0 | BPF_F_MARK_ENFORCE |
1966 BPF_F_PSEUDO_HDR | BPF_F_HDR_FIELD_MASK)))
1967 return -EINVAL;
1968 if (unlikely(offset > 0xffff || offset & 1))
1969 return -EFAULT;
1970 if (unlikely(bpf_try_make_writable(skb, offset + sizeof(*ptr))))
1971 return -EFAULT;
1972
1973 ptr = (__sum16 *)(skb->data + offset);
1974 if (is_mmzero && !do_mforce && !*ptr)
1975 return 0;
1976
1977 switch (flags & BPF_F_HDR_FIELD_MASK) {
1978 case 0:
1979 if (unlikely(from != 0))
1980 return -EINVAL;
1981
1982 inet_proto_csum_replace_by_diff(ptr, skb, to, is_pseudo);
1983 break;
1984 case 2:
> 1985 inet_proto_csum_replace2(ptr, skb, from, to, is_pseudo);
1986 break;
1987 case 4:
1988 inet_proto_csum_replace4(ptr, skb, from, to, is_pseudo);
1989 break;
1990 default:
1991 return -EINVAL;
1992 }
1993
1994 if (is_mmzero && !*ptr)
1995 *ptr = CSUM_MANGLED_0;
1996 return 0;
1997 }
1998
1999 static const struct bpf_func_proto bpf_l4_csum_replace_proto = {
2000 .func = bpf_l4_csum_replace,
2001 .gpl_only = false,
2002 .ret_type = RET_INTEGER,
2003 .arg1_type = ARG_PTR_TO_CTX,
2004 .arg2_type = ARG_ANYTHING,
2005 .arg3_type = ARG_ANYTHING,
2006 .arg4_type = ARG_ANYTHING,
2007 .arg5_type = ARG_ANYTHING,
2008 };
2009
2010 BPF_CALL_5(bpf_csum_diff, __be32 *, from, u32, from_size,
2011 __be32 *, to, u32, to_size, __wsum, seed)
2012 {
2013 /* This is quite flexible, some examples:
2014 *
2015 * from_size == 0, to_size > 0, seed := csum --> pushing data
2016 * from_size > 0, to_size == 0, seed := csum --> pulling data
2017 * from_size > 0, to_size > 0, seed := 0 --> diffing data
2018 *
2019 * Even for diffing, from_size and to_size don't need to be equal.
2020 */
2021
2022 if (from_size && to_size)
> 2023 return csum_from32to16(csum_sub(csum_partial(to, to_size, seed),
2024 csum_partial(from, from_size, 0)));
2025 if (to_size)
2026 return csum_from32to16(csum_partial(to, to_size, seed));
2027
2028 if (from_size)
2029 return csum_from32to16(~csum_partial(from, from_size, ~seed));
2030
> 2031 return seed;
2032 }
2033
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper
2024-10-22 17:58 ` Puranjay Mohan
@ 2024-10-23 15:37 ` Puranjay Mohan
0 siblings, 0 replies; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-23 15:37 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
Paul Walmsley, Shuah Khan, Song Liu, Stanislav Fomichev,
Yonghong Song
[-- Attachment #1: Type: text/plain, Size: 3424 bytes --]
Puranjay Mohan <puranjay@kernel.org> writes:
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
>> On Tue, Oct 22, 2024 at 3:21 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>>
>>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>>
>>> > On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>> >>
>>> >> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by
>>> >> filling a 4KB buffer with random data and calculating the internet
>>> >> checksum on different parts of this buffer using bpf_csum_diff().
>>> >>
>>> >> Example run using ./benchs/run_bench_csum_diff.sh on x86_64:
>>> >>
>>> >> [bpf]$ ./benchs/run_bench_csum_diff.sh
>>> >> 4 2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s)
>>> >> 8 2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>>> >> 16 2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>>> >> 20 2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>>> >> 32 2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>>> >> 40 2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s)
>>> >> 64 2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>>> >> 128 2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>>> >> 256 2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>>> >> 512 2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s)
>>> >
>>> > you are not benchmarking bpf_csum_diff(), you are benchmarking how
>>> > often you can call bpf_prog_test_run(). Add some batching on the BPF
>>> > side, these numbers tell you that there is no difference between
>>> > calculating checksum for 4 bytes and for 512, that didn't seem strange
>>> > to you?
>>>
>>> This didn't seem strange to me because if you see the tables I added to
>>> the cover letter, there is a clear improvement after optimizing the
>>> helper and arm64 even shows a linear drop going from 4 bytes to 512
>>> bytes, even after the optimization.
>>>
>>
>> Regardless of optimization, it's strange that throughput barely
>> differs when you vary the amount of work by more than 100x. This
>> wouldn't be strange if this checksum calculation was some sort of
>> cryptographic hash, where it's intentional to have the same timing
>> regardless of amount of work, or something along those lines. But I
>> don't think that's the case here.
>>
>> But as it is right now, this benchmark is benchmarking
>> bpf_prog_test_run(), as I mentioned, which seems to be bottlenecking
>> at about 2mln/s throughput for your machine. bpf_csum_diff()'s
>> overhead is trivial compared to bpf_prog_test_run() overhead and
>> syscall/context switch overhead.
>>
>> We shouldn't add the benchmark that doesn't benchmark the right thing.
>> So just add a bpf_for(i, 0, 100) loop doing bpf_csum_diff(), and then
>> do atomic increment *after* the loop (to minimize atomics overhead).
>
> Thanks, now I undestand what you meant. Will add the bpf_for() in the
> next version.
I have decided to drop this patch as even after adding bpf_for() the
difference between 4B and 512B is not that much. So, benchmarking
bpf_csum_diff() using this triggering based framework is not useful.
So, v2 will not have this patch but the cover letter will still have the
tables to show the difference before/after the optimization.
Thanks,
Puranjay
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 255 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2024-10-23 15:37 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-21 12:21 [PATCH bpf-next 0/5] Optimize bpf_csum_diff() and homogenize for all archs Puranjay Mohan
2024-10-21 12:21 ` [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header Puranjay Mohan
2024-10-21 13:41 ` Daniel Borkmann
2024-10-22 9:49 ` Toke Høiland-Jørgensen
2024-10-22 13:50 ` kernel test robot
2024-10-21 12:21 ` [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs Puranjay Mohan
2024-10-21 13:42 ` Daniel Borkmann
2024-10-22 9:54 ` Toke Høiland-Jørgensen
2024-10-22 18:09 ` kernel test robot
2024-10-21 12:21 ` [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier Puranjay Mohan
2024-10-21 13:01 ` Helge Deller
2024-10-21 13:14 ` Puranjay Mohan
2024-10-21 14:04 ` Helge Deller
2024-10-21 13:42 ` Daniel Borkmann
2024-10-22 9:55 ` Toke Høiland-Jørgensen
2024-10-21 12:21 ` [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper Puranjay Mohan
2024-10-21 13:43 ` Daniel Borkmann
2024-10-21 23:28 ` Andrii Nakryiko
2024-10-22 10:21 ` Puranjay Mohan
2024-10-22 17:47 ` Andrii Nakryiko
2024-10-22 17:58 ` Puranjay Mohan
2024-10-23 15:37 ` Puranjay Mohan
2024-10-21 12:21 ` [PATCH bpf-next 5/5] selftests/bpf: Add a selftest for bpf_csum_diff() Puranjay Mohan
2024-10-21 13:44 ` Daniel Borkmann
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).