[PATCH bpf-next 0/5] Optimize bpf_csum_diff() and homogenize for all archs

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/5] Optimize bpf_csum_diff() and homogenize for all archs
@ 2024-10-21 12:21 Puranjay Mohan
  2024-10-21 12:21 ` [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header Puranjay Mohan
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-21 12:21 UTC (permalink / raw)
  To: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
	bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Puranjay Mohan, Puranjay Mohan, Shuah Khan,
	Song Liu, Stanislav Fomichev, Yonghong Song

The bpf_csum_diff() helper currently returns different values on different
architectures because it calls csum_partial() that is either implemented by
the architecture like x86_64, arm, etc or uses the generic implementation
in lib/checksum.c like arm64, riscv, etc.

The implementation in lib/checksum.c returns the folded result that is
16-bit long, but the architecture specific implementation can return an
unfolded value that is larger than 16-bits.

The helper uses a per-cpu scratchpad buffer for copying the data and then
computing the csum on this buffer. This can be optimised by utilising some
mathematical properties of csum.

The patch 1 in this series does preparatory work for homogenizing the
helper. patch 2 does the changes to the helper itself. The performance gain
can be seen in the tables below that are generated using the benchmark
built in patch 4:

  x86-64:
  +-------------+------------------+------------------+-------------+
  | Buffer Size |      Before      |      After       | Improvement |
  +-------------+------------------+------------------+-------------+
  |      4      | 2.296 ± 0.066M/s | 3.415 ± 0.001M/s |   48.73  %  |
  |      8      | 2.320 ± 0.003M/s | 3.409 ± 0.003M/s |   46.93  %  |
  |      16     | 2.315 ± 0.001M/s | 3.414 ± 0.003M/s |   47.47  %  |
  |      20     | 2.318 ± 0.001M/s | 3.416 ± 0.001M/s |   47.36  %  |
  |      32     | 2.308 ± 0.003M/s | 3.413 ± 0.003M/s |   47.87  %  |
  |      40     | 2.300 ± 0.029M/s | 3.413 ± 0.003M/s |   48.39  %  |
  |      64     | 2.286 ± 0.001M/s | 3.410 ± 0.001M/s |   49.16  %  |
  |      128    | 2.250 ± 0.001M/s | 3.404 ± 0.001M/s |   51.28  %  |
  |      256    | 2.173 ± 0.001M/s | 3.383 ± 0.001M/s |   55.68  %  |
  |      512    | 2.023 ± 0.055M/s | 3.340 ± 0.001M/s |   65.10  %  |
  +-------------+------------------+------------------+-------------+

  ARM64:
  +-------------+------------------+------------------+-------------+
  | Buffer Size |      Before      |      After       | Improvement |
  +-------------+------------------+------------------+-------------+
  |      4      | 1.397 ± 0.005M/s | 1.493 ± 0.005M/s |    6.87  %  |
  |      8      | 1.402 ± 0.002M/s | 1.489 ± 0.002M/s |    6.20  %  |
  |      16     | 1.391 ± 0.001M/s | 1.481 ± 0.001M/s |    6.47  %  |
  |      20     | 1.379 ± 0.001M/s | 1.477 ± 0.001M/s |    7.10  %  |
  |      32     | 1.358 ± 0.001M/s | 1.469 ± 0.002M/s |    8.17  %  |
  |      40     | 1.339 ± 0.001M/s | 1.462 ± 0.002M/s |    9.18  %  |
  |      64     | 1.302 ± 0.002M/s | 1.449 ± 0.003M/s |    11.29 %  |
  |      128    | 1.214 ± 0.001M/s | 1.443 ± 0.003M/s |    18.86 %  |
  |      256    | 1.080 ± 0.001M/s | 1.423 ± 0.001M/s |    31.75 %  |
  |      512    | 0.887 ± 0.001M/s | 1.411 ± 0.002M/s |    59.07 %  |
  +-------------+------------------+------------------+-------------+

Patch 5 adds a selftest for this helper to verify the results produced by
this helper in multiple modes and edge cases.

Patch 3 reverts a hack that was done to make the selftest pass on all
architectures.

Puranjay Mohan (5):
  net: checksum: move from32to16() to generic header
  bpf: bpf_csum_diff: optimize and homogenize for all archs
  selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier
  selftests/bpf: Add benchmark for bpf_csum_diff() helper
  selftests/bpf: Add a selftest for bpf_csum_diff()

 arch/parisc/lib/checksum.c                    |  13 +-
 include/net/checksum.h                        |   6 +
 lib/checksum.c                                |  11 +-
 net/core/filter.c                             |  37 +-
 tools/testing/selftests/bpf/Makefile          |   2 +
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_csum_diff.c    | 164 +++++++
 .../bpf/benchs/run_bench_csum_diff.sh         |  10 +
 .../selftests/bpf/prog_tests/test_csum_diff.c | 408 ++++++++++++++++++
 .../selftests/bpf/progs/csum_diff_bench.c     |  25 ++
 .../selftests/bpf/progs/csum_diff_test.c      |  42 ++
 .../bpf/progs/verifier_array_access.c         |   3 +-
 12 files changed, 674 insertions(+), 51 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_csum_diff.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_csum_diff.c
 create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_bench.c
 create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_test.c

-- 
2.40.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header
  2024-10-21 12:21 [PATCH bpf-next 0/5] Optimize bpf_csum_diff() and homogenize for all archs Puranjay Mohan
@ 2024-10-21 12:21 ` Puranjay Mohan
  2024-10-21 13:41   ` Daniel Borkmann
                     ` (2 more replies)
  2024-10-21 12:21 ` [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs Puranjay Mohan
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-21 12:21 UTC (permalink / raw)
  To: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
	bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Puranjay Mohan, Puranjay Mohan, Shuah Khan,
	Song Liu, Stanislav Fomichev, Yonghong Song

from32to16() is used by lib/checksum.c and also by
arch/parisc/lib/checksum.c. The next patch will use it in the
bpf_csum_diff helper.

Move from32to16() to the include/net/checksum.h as csum_from32to16() and
remove other implementations.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 arch/parisc/lib/checksum.c | 13 ++-----------
 include/net/checksum.h     |  6 ++++++
 lib/checksum.c             | 11 +----------
 3 files changed, 9 insertions(+), 21 deletions(-)

diff --git a/arch/parisc/lib/checksum.c b/arch/parisc/lib/checksum.c
index 4818f3db84a5c..59d8c15d81bd0 100644
--- a/arch/parisc/lib/checksum.c
+++ b/arch/parisc/lib/checksum.c
@@ -25,15 +25,6 @@
 	: "=r"(_t)                      \
 	: "r"(_r), "0"(_t));
 
-static inline unsigned short from32to16(unsigned int x)
-{
-	/* 32 bits --> 16 bits + carry */
-	x = (x & 0xffff) + (x >> 16);
-	/* 16 bits + carry --> 16 bits including carry */
-	x = (x & 0xffff) + (x >> 16);
-	return (unsigned short)x;
-}
-
 static inline unsigned int do_csum(const unsigned char * buff, int len)
 {
 	int odd, count;
@@ -85,7 +76,7 @@ static inline unsigned int do_csum(const unsigned char * buff, int len)
 	}
 	if (len & 1)
 		result += le16_to_cpu(*buff);
-	result = from32to16(result);
+	result = csum_from32to16(result);
 	if (odd)
 		result = swab16(result);
 out:
@@ -102,7 +93,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
 {
 	unsigned int result = do_csum(buff, len);
 	addc(result, sum);
-	return (__force __wsum)from32to16(result);
+	return (__force __wsum)csum_from32to16(result);
 }
 
 EXPORT_SYMBOL(csum_partial);
diff --git a/include/net/checksum.h b/include/net/checksum.h
index 1338cb92c8e72..0d082febfead4 100644
--- a/include/net/checksum.h
+++ b/include/net/checksum.h
@@ -151,6 +151,12 @@ static inline void csum_replace(__wsum *csum, __wsum old, __wsum new)
 	*csum = csum_add(csum_sub(*csum, old), new);
 }
 
+static inline __sum16 csum_from32to16(__wsum sum)
+{
+	sum += (sum >> 16) | (sum << 16);
+	return (__force __sum16)(sum >> 16);
+}
+
 struct sk_buff;
 void inet_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb,
 			      __be32 from, __be32 to, bool pseudohdr);
diff --git a/lib/checksum.c b/lib/checksum.c
index 6860d6b05a171..025ba546e1ec6 100644
--- a/lib/checksum.c
+++ b/lib/checksum.c
@@ -34,15 +34,6 @@
 #include <asm/byteorder.h>
 
 #ifndef do_csum
-static inline unsigned short from32to16(unsigned int x)
-{
-	/* add up 16-bit and 16-bit for 16+c bit */
-	x = (x & 0xffff) + (x >> 16);
-	/* add up carry.. */
-	x = (x & 0xffff) + (x >> 16);
-	return x;
-}
-
 static unsigned int do_csum(const unsigned char *buff, int len)
 {
 	int odd;
@@ -90,7 +81,7 @@ static unsigned int do_csum(const unsigned char *buff, int len)
 #else
 		result += (*buff << 8);
 #endif
-	result = from32to16(result);
+	result = csum_from32to16(result);
 	if (odd)
 		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
 out:
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs
  2024-10-21 12:21 [PATCH bpf-next 0/5] Optimize bpf_csum_diff() and homogenize for all archs Puranjay Mohan
  2024-10-21 12:21 ` [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header Puranjay Mohan
@ 2024-10-21 12:21 ` Puranjay Mohan
  2024-10-21 13:42   ` Daniel Borkmann
                     ` (2 more replies)
  2024-10-21 12:21 ` [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier Puranjay Mohan
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-21 12:21 UTC (permalink / raw)
  To: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
	bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Puranjay Mohan, Puranjay Mohan, Shuah Khan,
	Song Liu, Stanislav Fomichev, Yonghong Song

1. Optimization
   ------------

The current implementation copies the 'from' and 'to' buffers to a
scratchpad and it takes the bitwise NOT of 'from' buffer while copying.
In the next step csum_partial() is called with this scratchpad.

so, mathematically, the current implementation is doing:

	result = csum(to - from)

Here, 'to'  and '~ from' are copied in to the scratchpad buffer, we need
it in the scratchpad buffer because csum_partial() takes a single
contiguous buffer and not two disjoint buffers like 'to' and 'from'.

We can re write this equation to:

	result = csum(to) - csum(from)

using the distributive property of csum().

this allows 'to' and 'from' to be at different locations and therefore
this scratchpad and copying is not needed.

This in C code will look like:

result = csum_sub(csum_partial(to, to_size, seed),
                  csum_partial(from, from_size, 0));

2. Homogenization
   --------------

The bpf_csum_diff() helper calls csum_partial() which is implemented by
some architectures like arm and x86 but other architectures rely on the
generic implementation in lib/checksum.c

The generic implementation in lib/checksum.c returns a 16 bit value but
the arch specific implementations can return more than 16 bits, this
works out in most places because before the result is used, it is passed
through csum_fold() that turns it into a 16-bit value.

bpf_csum_diff() directly returns the value from csum_partial() and
therefore the returned values could be different on different
architectures. see discussion in [1]:

for the int value 28 the calculated checksums are:

x86                    :    -29 : 0xffffffe3
generic (arm64, riscv) :  65507 : 0x0000ffe3
arm                    : 131042 : 0x0001ffe2

Pass the result of bpf_csum_diff() through from32to16() before returning
to homogenize this result for all architectures.

NOTE: from32to16() is used instead of csum_fold() because csum_fold()
does from32to16() + bitwise NOT of the result, which is not what we want
to do here.

[1] https://lore.kernel.org/bpf/CAJ+HfNiQbOcqCLxFUP2FMm5QrLXUUaj852Fxe3hn_2JNiucn6g@mail.gmail.com/

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 net/core/filter.c | 37 +++++++++----------------------------
 1 file changed, 9 insertions(+), 28 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index bd0d08bf76bb8..e00bec7de9edd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1654,18 +1654,6 @@ void sk_reuseport_prog_free(struct bpf_prog *prog)
 		bpf_prog_destroy(prog);
 }
 
-struct bpf_scratchpad {
-	union {
-		__be32 diff[MAX_BPF_STACK / sizeof(__be32)];
-		u8     buff[MAX_BPF_STACK];
-	};
-	local_lock_t	bh_lock;
-};
-
-static DEFINE_PER_CPU(struct bpf_scratchpad, bpf_sp) = {
-	.bh_lock	= INIT_LOCAL_LOCK(bh_lock),
-};
-
 static inline int __bpf_try_make_writable(struct sk_buff *skb,
 					  unsigned int write_len)
 {
@@ -2022,11 +2010,6 @@ static const struct bpf_func_proto bpf_l4_csum_replace_proto = {
 BPF_CALL_5(bpf_csum_diff, __be32 *, from, u32, from_size,
 	   __be32 *, to, u32, to_size, __wsum, seed)
 {
-	struct bpf_scratchpad *sp = this_cpu_ptr(&bpf_sp);
-	u32 diff_size = from_size + to_size;
-	int i, j = 0;
-	__wsum ret;
-
 	/* This is quite flexible, some examples:
 	 *
 	 * from_size == 0, to_size > 0,  seed := csum --> pushing data
@@ -2035,19 +2018,17 @@ BPF_CALL_5(bpf_csum_diff, __be32 *, from, u32, from_size,
 	 *
 	 * Even for diffing, from_size and to_size don't need to be equal.
 	 */
-	if (unlikely(((from_size | to_size) & (sizeof(__be32) - 1)) ||
-		     diff_size > sizeof(sp->diff)))
-		return -EINVAL;
 
-	local_lock_nested_bh(&bpf_sp.bh_lock);
-	for (i = 0; i < from_size / sizeof(__be32); i++, j++)
-		sp->diff[j] = ~from[i];
-	for (i = 0; i <   to_size / sizeof(__be32); i++, j++)
-		sp->diff[j] = to[i];
+	if (from_size && to_size)
+		return csum_from32to16(csum_sub(csum_partial(to, to_size, seed),
+						csum_partial(from, from_size, 0)));
+	if (to_size)
+		return csum_from32to16(csum_partial(to, to_size, seed));
 
-	ret = csum_partial(sp->diff, diff_size, seed);
-	local_unlock_nested_bh(&bpf_sp.bh_lock);
-	return ret;
+	if (from_size)
+		return csum_from32to16(~csum_partial(from, from_size, ~seed));
+
+	return seed;
 }
 
 static const struct bpf_func_proto bpf_csum_diff_proto = {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier
  2024-10-21 12:21 [PATCH bpf-next 0/5] Optimize bpf_csum_diff() and homogenize for all archs Puranjay Mohan
  2024-10-21 12:21 ` [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header Puranjay Mohan
  2024-10-21 12:21 ` [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs Puranjay Mohan
@ 2024-10-21 12:21 ` Puranjay Mohan
  2024-10-21 13:01   ` Helge Deller
                     ` (2 more replies)
  2024-10-21 12:21 ` [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper Puranjay Mohan
  2024-10-21 12:21 ` [PATCH bpf-next 5/5] selftests/bpf: Add a selftest for bpf_csum_diff() Puranjay Mohan
  4 siblings, 3 replies; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-21 12:21 UTC (permalink / raw)
  To: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
	bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Puranjay Mohan, Puranjay Mohan, Shuah Khan,
	Song Liu, Stanislav Fomichev, Yonghong Song

The bpf_csum_diff() helper has been fixed to return a 16-bit value for
all archs, so now we don't need to mask the result.

This commit is basically reverting the below:

commit 6185266c5a85 ("selftests/bpf: Mask bpf_csum_diff() return value
to 16 bits in test_verifier")

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 tools/testing/selftests/bpf/progs/verifier_array_access.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/verifier_array_access.c b/tools/testing/selftests/bpf/progs/verifier_array_access.c
index 95d7ecc12963b..4195aa824ba55 100644
--- a/tools/testing/selftests/bpf/progs/verifier_array_access.c
+++ b/tools/testing/selftests/bpf/progs/verifier_array_access.c
@@ -368,8 +368,7 @@ __naked void a_read_only_array_2_1(void)
 	r4 = 0;						\
 	r5 = 0;						\
 	call %[bpf_csum_diff];				\
-l0_%=:	r0 &= 0xffff;					\
-	exit;						\
+l0_%=:	exit;						\
 "	:
 	: __imm(bpf_csum_diff),
 	  __imm(bpf_map_lookup_elem),
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper
  2024-10-21 12:21 [PATCH bpf-next 0/5] Optimize bpf_csum_diff() and homogenize for all archs Puranjay Mohan
                   ` (2 preceding siblings ...)
  2024-10-21 12:21 ` [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier Puranjay Mohan
@ 2024-10-21 12:21 ` Puranjay Mohan
  2024-10-21 13:43   ` Daniel Borkmann
  2024-10-21 23:28   ` Andrii Nakryiko
  2024-10-21 12:21 ` [PATCH bpf-next 5/5] selftests/bpf: Add a selftest for bpf_csum_diff() Puranjay Mohan
  4 siblings, 2 replies; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-21 12:21 UTC (permalink / raw)
  To: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
	bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Puranjay Mohan, Puranjay Mohan, Shuah Khan,
	Song Liu, Stanislav Fomichev, Yonghong Song

Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by
filling a 4KB buffer with random data and calculating the internet
checksum on different parts of this buffer using bpf_csum_diff().

Example run using ./benchs/run_bench_csum_diff.sh on x86_64:

[bpf]$ ./benchs/run_bench_csum_diff.sh
4                    2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s)
8                    2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s)
16                   2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s)
20                   2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s)
32                   2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
40                   2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s)
64                   2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s)
128                  2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s)
256                  2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s)
512                  2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s)

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 tools/testing/selftests/bpf/Makefile          |   2 +
 tools/testing/selftests/bpf/bench.c           |   4 +
 .../selftests/bpf/benchs/bench_csum_diff.c    | 164 ++++++++++++++++++
 .../bpf/benchs/run_bench_csum_diff.sh         |  10 ++
 .../selftests/bpf/progs/csum_diff_bench.c     |  25 +++
 5 files changed, 205 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_csum_diff.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh
 create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_bench.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 28a76baa854d3..a0d86dd453e16 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -809,6 +809,7 @@ $(OUTPUT)/bench_ringbufs.o: $(OUTPUT)/ringbuf_bench.skel.h \
 $(OUTPUT)/bench_bloom_filter_map.o: $(OUTPUT)/bloom_filter_bench.skel.h
 $(OUTPUT)/bench_bpf_loop.o: $(OUTPUT)/bpf_loop_bench.skel.h
 $(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h
+$(OUTPUT)/bench_csum_diff.o: $(OUTPUT)/csum_diff_bench.skel.h
 $(OUTPUT)/bench_bpf_hashmap_full_update.o: $(OUTPUT)/bpf_hashmap_full_update_bench.skel.h
 $(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench.skel.h
 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o: $(OUTPUT)/local_storage_rcu_tasks_trace_bench.skel.h
@@ -829,6 +830,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_bloom_filter_map.o \
 		 $(OUTPUT)/bench_bpf_loop.o \
 		 $(OUTPUT)/bench_strncmp.o \
+		 $(OUTPUT)/bench_csum_diff.o \
 		 $(OUTPUT)/bench_bpf_hashmap_full_update.o \
 		 $(OUTPUT)/bench_local_storage.o \
 		 $(OUTPUT)/bench_local_storage_rcu_tasks_trace.o \
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 1bd403a5ef7b3..29bd6f4498ebc 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -278,6 +278,7 @@ extern struct argp bench_bpf_loop_argp;
 extern struct argp bench_local_storage_argp;
 extern struct argp bench_local_storage_rcu_tasks_trace_argp;
 extern struct argp bench_strncmp_argp;
+extern struct argp bench_csum_diff_argp;
 extern struct argp bench_hashmap_lookup_argp;
 extern struct argp bench_local_storage_create_argp;
 extern struct argp bench_htab_mem_argp;
@@ -290,6 +291,7 @@ static const struct argp_child bench_parsers[] = {
 	{ &bench_bpf_loop_argp, 0, "bpf_loop helper benchmark", 0 },
 	{ &bench_local_storage_argp, 0, "local_storage benchmark", 0 },
 	{ &bench_strncmp_argp, 0, "bpf_strncmp helper benchmark", 0 },
+	{ &bench_csum_diff_argp, 0, "bpf_csum_diff helper benchmark", 0 },
 	{ &bench_local_storage_rcu_tasks_trace_argp, 0,
 		"local_storage RCU Tasks Trace slowdown benchmark", 0 },
 	{ &bench_hashmap_lookup_argp, 0, "Hashmap lookup benchmark", 0 },
@@ -539,6 +541,7 @@ extern const struct bench bench_hashmap_with_bloom;
 extern const struct bench bench_bpf_loop;
 extern const struct bench bench_strncmp_no_helper;
 extern const struct bench bench_strncmp_helper;
+extern const struct bench bench_csum_diff;
 extern const struct bench bench_bpf_hashmap_full_update;
 extern const struct bench bench_local_storage_cache_seq_get;
 extern const struct bench bench_local_storage_cache_interleaved_get;
@@ -599,6 +602,7 @@ static const struct bench *benchs[] = {
 	&bench_bpf_loop,
 	&bench_strncmp_no_helper,
 	&bench_strncmp_helper,
+	&bench_csum_diff,
 	&bench_bpf_hashmap_full_update,
 	&bench_local_storage_cache_seq_get,
 	&bench_local_storage_cache_interleaved_get,
diff --git a/tools/testing/selftests/bpf/benchs/bench_csum_diff.c b/tools/testing/selftests/bpf/benchs/bench_csum_diff.c
new file mode 100644
index 0000000000000..2c30c8b54d9bc
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_csum_diff.c
@@ -0,0 +1,164 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates */
+#include <argp.h>
+#include "bench.h"
+#include "csum_diff_bench.skel.h"
+
+static struct csum_diff_ctx {
+	struct csum_diff_bench *skel;
+	int pfd;
+} ctx;
+
+static struct csum_diff_args {
+	u32 buff_len;
+} args = {
+	.buff_len = 32,
+};
+
+enum {
+	ARG_BUFF_LEN = 5000,
+};
+
+static const struct argp_option opts[] = {
+	{ "buff-len", ARG_BUFF_LEN, "BUFF_LEN", 0,
+	  "Set the length of the buffer" },
+	{},
+};
+
+static error_t csum_diff_parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case ARG_BUFF_LEN:
+		args.buff_len = strtoul(arg, NULL, 10);
+		if (!args.buff_len ||
+		    args.buff_len >= sizeof(ctx.skel->rodata->buff)) {
+			fprintf(stderr, "Invalid buff len (limit %zu)\n",
+				sizeof(ctx.skel->rodata->buff));
+			argp_usage(state);
+		}
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+const struct argp bench_csum_diff_argp = {
+	.options = opts,
+	.parser = csum_diff_parse_arg,
+};
+
+static void csum_diff_validate(void)
+{
+	if (env.consumer_cnt != 0) {
+		fprintf(stderr, "csum_diff benchmark doesn't support consumer!\n");
+		exit(1);
+	}
+}
+
+static void csum_diff_setup(void)
+{
+	int err;
+	char *buff;
+	size_t i, sz;
+
+	sz = sizeof(ctx.skel->rodata->buff);
+
+	setup_libbpf();
+
+	ctx.skel = csum_diff_bench__open();
+	if (!ctx.skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		exit(1);
+	}
+
+	srandom(time(NULL));
+	buff = ctx.skel->rodata->buff;
+
+	/*
+	 * Set first 8 bytes of buffer to 0xdeadbeefdeadbeef, this is later used to verify the
+	 * correctness of the helper by comparing the checksum result for 0xdeadbeefdeadbeef that
+	 * should be 0x3b3b
+	 */
+
+	*(u64 *)buff = 0xdeadbeefdeadbeef;
+
+	for (i = 8; i < sz; i++)
+		buff[i] = '1' + random() % 9;
+
+	ctx.skel->rodata->buff_len = args.buff_len;
+
+	err = csum_diff_bench__load(ctx.skel);
+	if (err) {
+		fprintf(stderr, "failed to load skeleton\n");
+		csum_diff_bench__destroy(ctx.skel);
+		exit(1);
+	}
+}
+
+static void csum_diff_helper_setup(void)
+{
+	u8 tmp_out[64 << 2] = {};
+	u8 tmp_in[64] = {};
+	int err, saved_errno;
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		.data_in = tmp_in,
+		.data_size_in = sizeof(tmp_in),
+		.data_out = tmp_out,
+		.data_size_out = sizeof(tmp_out),
+		.repeat = 1,
+	);
+	csum_diff_setup();
+	ctx.pfd = bpf_program__fd(ctx.skel->progs.compute_checksum);
+
+	err = bpf_prog_test_run_opts(ctx.pfd, &topts);
+	saved_errno = errno;
+
+	if (err) {
+		fprintf(stderr, "failed to run setup prog: err %d, result %d, serror %d\n",
+			err, ctx.skel->bss->result, saved_errno);
+		csum_diff_bench__destroy(ctx.skel);
+		exit(1);
+	}
+
+	/* Sanity check for correctness of helper */
+	if (args.buff_len == 8 && ctx.skel->bss->result != 0x3b3b) {
+		fprintf(stderr, "csum_diff helper broken: buff: %lx, result: %x, expected: %x\n",
+			*(u64 *)ctx.skel->rodata->buff, ctx.skel->bss->result, 0x3b3b);
+	}
+}
+
+static void *csum_diff_producer(void *unused)
+{
+	u8 tmp_out[64 << 2] = {};
+	u8 tmp_in[64] = {};
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		.data_in = tmp_in,
+		.data_size_in = sizeof(tmp_in),
+		.data_out = tmp_out,
+		.data_size_out = sizeof(tmp_out),
+		.repeat = 64,
+	);
+	while (true)
+		(void)bpf_prog_test_run_opts(ctx.pfd, &topts);
+	return NULL;
+}
+
+static void csum_diff_measure(struct bench_res *res)
+{
+	res->hits = atomic_swap(&ctx.skel->bss->hits, 0);
+}
+
+const struct bench bench_csum_diff = {
+	.name = "csum-diff-helper",
+	.argp = &bench_csum_diff_argp,
+	.validate = csum_diff_validate,
+	.setup = csum_diff_helper_setup,
+	.producer_thread = csum_diff_producer,
+	.measure = csum_diff_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = hits_drops_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh b/tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh
new file mode 100755
index 0000000000000..c4e147fbf2f98
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+for s in 4 8 16 20 32 40 64 128 256 512; do
+	summarize ${s} "$($RUN_BENCH --buff-len=$s csum-diff-helper)"
+done
diff --git a/tools/testing/selftests/bpf/progs/csum_diff_bench.c b/tools/testing/selftests/bpf/progs/csum_diff_bench.c
new file mode 100644
index 0000000000000..85245edd6f9dc
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/csum_diff_bench.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates */
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#define BUFF_SZ 4096
+
+/* Will be updated by benchmark before program loading */
+const char buff[BUFF_SZ];
+const volatile unsigned int buff_len = 4;
+
+long hits = 0;
+short result;
+
+char _license[] SEC("license") = "GPL";
+
+SEC("tc")
+int compute_checksum(void *ctx)
+{
+	result = bpf_csum_diff(0, 0, (void *)buff, buff_len, 0);
+	__sync_add_and_fetch(&hits, 1);
+	return 0;
+}
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 5/5] selftests/bpf: Add a selftest for bpf_csum_diff()
  2024-10-21 12:21 [PATCH bpf-next 0/5] Optimize bpf_csum_diff() and homogenize for all archs Puranjay Mohan
                   ` (3 preceding siblings ...)
  2024-10-21 12:21 ` [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper Puranjay Mohan
@ 2024-10-21 12:21 ` Puranjay Mohan
  2024-10-21 13:44   ` Daniel Borkmann
  4 siblings, 1 reply; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-21 12:21 UTC (permalink / raw)
  To: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
	bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Puranjay Mohan, Puranjay Mohan, Shuah Khan,
	Song Liu, Stanislav Fomichev, Yonghong Song

Add a selftest for the bpf_csum_diff() helper. This selftests runs the
helper in all three configurations(push, pull, and diff) and verifies
its output. The correct results have been computed by hand and by the
helper's older implementation.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 .../selftests/bpf/prog_tests/test_csum_diff.c | 408 ++++++++++++++++++
 .../selftests/bpf/progs/csum_diff_test.c      |  42 ++
 2 files changed, 450 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_csum_diff.c
 create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_test.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_csum_diff.c b/tools/testing/selftests/bpf/prog_tests/test_csum_diff.c
new file mode 100644
index 0000000000000..107b20d43e839
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_csum_diff.c
@@ -0,0 +1,408 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates */
+#include <test_progs.h>
+#include "csum_diff_test.skel.h"
+
+#define BUFF_SZ 512
+
+struct testcase {
+	unsigned long long to_buff[BUFF_SZ / 8];
+	unsigned int to_buff_len;
+	unsigned long long from_buff[BUFF_SZ / 8];
+	unsigned int from_buff_len;
+	unsigned short seed;
+	unsigned short result;
+};
+
+#define NUM_PUSH_TESTS 4
+
+struct testcase push_tests[NUM_PUSH_TESTS] = {
+	{
+		.to_buff = {
+			0xdeadbeefdeadbeef,
+		},
+		.to_buff_len = 8,
+		.from_buff = {},
+		.from_buff_len = 0,
+		.seed = 0,
+		.result = 0x3b3b
+	},
+	{
+		.to_buff = {
+			0xdeadbeefdeadbeef,
+			0xbeefdeadbeefdead,
+		},
+		.to_buff_len = 16,
+		.from_buff = {},
+		.from_buff_len = 0,
+		.seed = 0x1234,
+		.result = 0x88aa
+	},
+	{
+		.to_buff = {
+			0xdeadbeefdeadbeef,
+			0xbeefdeadbeefdead,
+		},
+		.to_buff_len = 15,
+		.from_buff = {},
+		.from_buff_len = 0,
+		.seed = 0x1234,
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+		.result = 0xcaa9
+#else
+		.result = 0x87fd
+#endif
+	},
+	{
+		.to_buff = {
+			0x327b23c66b8b4567,
+			0x66334873643c9869,
+			0x19495cff74b0dc51,
+			0x625558ec2ae8944a,
+			0x46e87ccd238e1f29,
+			0x507ed7ab3d1b58ba,
+			0x41b71efb2eb141f2,
+			0x7545e14679e2a9e3,
+			0x5bd062c2515f007c,
+			0x4db127f812200854,
+			0x1f16e9e80216231b,
+			0x66ef438d1190cde7,
+			0x3352255a140e0f76,
+			0x0ded7263109cf92e,
+			0x1befd79f7fdcc233,
+			0x6b68079a41a7c4c9,
+			0x25e45d324e6afb66,
+			0x431bd7b7519b500d,
+			0x7c83e4583f2dba31,
+			0x62bbd95a257130a3,
+			0x628c895d436c6125,
+			0x721da317333ab105,
+			0x2d1d5ae92443a858,
+			0x75a2a8d46763845e,
+			0x79838cb208edbdab,
+			0x0b03e0c64353d0cd,
+			0x54e49eb4189a769b,
+			0x2ca8861171f32454,
+			0x02901d820836c40e,
+			0x081386413a95f874,
+			0x7c3dbd3d1e7ff521,
+			0x6ceaf087737b8ddc,
+			0x4516dde922221a70,
+			0x614fd4a13006c83e,
+			0x5577f8e1419ac241,
+			0x05072367440badfc,
+			0x77465f013804823e,
+			0x5c482a977724c67e,
+			0x5e884adc2463b9ea,
+			0x2d51779651ead36b,
+			0x153ea438580bd78f,
+			0x70a64e2a3855585c,
+			0x2a487cb06a2342ec,
+			0x725a06fb1d4ed43b,
+			0x57e4ccaf2cd89a32,
+			0x4b588f547a6d8d3c,
+			0x6de91b18542289ec,
+			0x7644a45c38437fdb,
+			0x684a481a32fff902,
+			0x749abb43579478fe,
+			0x1ba026fa3dc240fb,
+			0x75c6c33a79a1deaa,
+			0x70c6a52912e685fb,
+			0x374a3fe6520eedd1,
+			0x23f9c13c4f4ef005,
+			0x275ac794649bb77c,
+			0x1cf10fd839386575,
+			0x235ba861180115be,
+			0x354fe9f947398c89,
+			0x741226bb15b5af5c,
+			0x10233c990d34b6a8,
+			0x615740953f6ab60f,
+			0x77ae35eb7e0c57b1,
+			0x310c50b3579be4f1,
+		},
+		.to_buff_len = 512,
+		.from_buff = {},
+		.from_buff_len = 0,
+		.seed = 0xffff,
+		.result = 0xca45
+	},
+};
+
+#define NUM_PULL_TESTS 4
+
+struct testcase pull_tests[NUM_PULL_TESTS] = {
+	{
+		.from_buff = {
+			0xdeadbeefdeadbeef,
+		},
+		.from_buff_len = 8,
+		.to_buff = {},
+		.to_buff_len = 0,
+		.seed = 0,
+		.result = 0xc4c4
+	},
+	{
+		.from_buff = {
+			0xdeadbeefdeadbeef,
+			0xbeefdeadbeefdead,
+		},
+		.from_buff_len = 16,
+		.to_buff = {},
+		.to_buff_len = 0,
+		.seed = 0x1234,
+		.result = 0x9bbd
+	},
+	{
+		.from_buff = {
+			0xdeadbeefdeadbeef,
+			0xbeefdeadbeefdead,
+		},
+		.from_buff_len = 15,
+		.to_buff = {},
+		.to_buff_len = 0,
+		.seed = 0x1234,
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+		.result = 0x59be
+#else
+		.result = 0x9c6a
+#endif
+	},
+	{
+		.from_buff = {
+			0x327b23c66b8b4567,
+			0x66334873643c9869,
+			0x19495cff74b0dc51,
+			0x625558ec2ae8944a,
+			0x46e87ccd238e1f29,
+			0x507ed7ab3d1b58ba,
+			0x41b71efb2eb141f2,
+			0x7545e14679e2a9e3,
+			0x5bd062c2515f007c,
+			0x4db127f812200854,
+			0x1f16e9e80216231b,
+			0x66ef438d1190cde7,
+			0x3352255a140e0f76,
+			0x0ded7263109cf92e,
+			0x1befd79f7fdcc233,
+			0x6b68079a41a7c4c9,
+			0x25e45d324e6afb66,
+			0x431bd7b7519b500d,
+			0x7c83e4583f2dba31,
+			0x62bbd95a257130a3,
+			0x628c895d436c6125,
+			0x721da317333ab105,
+			0x2d1d5ae92443a858,
+			0x75a2a8d46763845e,
+			0x79838cb208edbdab,
+			0x0b03e0c64353d0cd,
+			0x54e49eb4189a769b,
+			0x2ca8861171f32454,
+			0x02901d820836c40e,
+			0x081386413a95f874,
+			0x7c3dbd3d1e7ff521,
+			0x6ceaf087737b8ddc,
+			0x4516dde922221a70,
+			0x614fd4a13006c83e,
+			0x5577f8e1419ac241,
+			0x05072367440badfc,
+			0x77465f013804823e,
+			0x5c482a977724c67e,
+			0x5e884adc2463b9ea,
+			0x2d51779651ead36b,
+			0x153ea438580bd78f,
+			0x70a64e2a3855585c,
+			0x2a487cb06a2342ec,
+			0x725a06fb1d4ed43b,
+			0x57e4ccaf2cd89a32,
+			0x4b588f547a6d8d3c,
+			0x6de91b18542289ec,
+			0x7644a45c38437fdb,
+			0x684a481a32fff902,
+			0x749abb43579478fe,
+			0x1ba026fa3dc240fb,
+			0x75c6c33a79a1deaa,
+			0x70c6a52912e685fb,
+			0x374a3fe6520eedd1,
+			0x23f9c13c4f4ef005,
+			0x275ac794649bb77c,
+			0x1cf10fd839386575,
+			0x235ba861180115be,
+			0x354fe9f947398c89,
+			0x741226bb15b5af5c,
+			0x10233c990d34b6a8,
+			0x615740953f6ab60f,
+			0x77ae35eb7e0c57b1,
+			0x310c50b3579be4f1,
+		},
+		.from_buff_len = 512,
+		.to_buff = {},
+		.to_buff_len = 0,
+		.seed = 0xffff,
+		.result = 0x35ba
+	},
+};
+
+#define NUM_DIFF_TESTS 4
+
+struct testcase diff_tests[NUM_DIFF_TESTS] = {
+	{
+		.from_buff = {
+			0xdeadbeefdeadbeef,
+		},
+		.from_buff_len = 8,
+		.to_buff = {
+			0xabababababababab,
+		},
+		.to_buff_len = 8,
+		.seed = 0,
+		.result = 0x7373
+	},
+	{
+		.from_buff = {
+			0xdeadbeefdeadbeef,
+		},
+		.from_buff_len = 7,
+		.to_buff = {
+			0xabababababababab,
+		},
+		.to_buff_len = 7,
+		.seed = 0,
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+		.result = 0xa673
+#else
+		.result = 0x73b7
+#endif
+	},
+	{
+		.from_buff = {
+			0,
+		},
+		.from_buff_len = 8,
+		.to_buff = {
+			0xabababababababab,
+		},
+		.to_buff_len = 8,
+		.seed = 0,
+		.result = 0xaeae
+	},
+	{
+		.from_buff = {
+			0xdeadbeefdeadbeef
+		},
+		.from_buff_len = 8,
+		.to_buff = {
+			0,
+		},
+		.to_buff_len = 8,
+		.seed = 0xffff,
+		.result = 0xc4c4
+	},
+};
+
+#define NUM_EDGE_TESTS 4
+
+struct testcase edge_tests[NUM_EDGE_TESTS] = {
+	{
+		.from_buff = {},
+		.from_buff_len = 0,
+		.to_buff = {},
+		.to_buff_len = 0,
+		.seed = 0,
+		.result = 0
+	},
+	{
+		.from_buff = {
+			0x1234
+		},
+		.from_buff_len = 0,
+		.to_buff = {
+			0x1234
+		},
+		.to_buff_len = 0,
+		.seed = 0,
+		.result = 0
+	},
+	{
+		.from_buff = {},
+		.from_buff_len = 0,
+		.to_buff = {},
+		.to_buff_len = 0,
+		.seed = 0x1234,
+		.result = 0x1234
+	},
+	{
+		.from_buff = {},
+		.from_buff_len = 512,
+		.to_buff = {},
+		.to_buff_len = 0,
+		.seed = 0xffff,
+		.result = 0xffff
+	},
+};
+
+static unsigned short trigger_csum_diff(const struct csum_diff_test *skel)
+{
+	u8 tmp_out[64 << 2] = {};
+	u8 tmp_in[64] = {};
+	int err;
+	int pfd;
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		.data_in = tmp_in,
+		.data_size_in = sizeof(tmp_in),
+		.data_out = tmp_out,
+		.data_size_out = sizeof(tmp_out),
+		.repeat = 1,
+	);
+	pfd = bpf_program__fd(skel->progs.compute_checksum);
+	err = bpf_prog_test_run_opts(pfd, &topts);
+	if (err)
+		return -1;
+
+	return skel->bss->result;
+}
+
+static void test_csum_diff(struct testcase *tests, int num_tests)
+{
+	struct csum_diff_test *skel;
+	unsigned short got;
+	int err;
+
+	for (int i = 0; i < num_tests; i++) {
+		skel = csum_diff_test__open();
+		if (!ASSERT_OK_PTR(skel, "csum_diff_test open"))
+			return;
+
+		skel->rodata->to_buff_len = tests[i].to_buff_len;
+		skel->rodata->from_buff_len = tests[i].from_buff_len;
+
+		err = csum_diff_test__load(skel);
+		if (!ASSERT_EQ(err, 0, "csum_diff_test load"))
+			goto out;
+
+		memcpy(skel->bss->to_buff, tests[i].to_buff, tests[i].to_buff_len);
+		memcpy(skel->bss->from_buff, tests[i].from_buff, tests[i].from_buff_len);
+		skel->bss->seed = tests[i].seed;
+
+		got = trigger_csum_diff(skel);
+		ASSERT_EQ(got, tests[i].result, "csum_diff result");
+
+		csum_diff_test__destroy(skel);
+	}
+
+	return;
+out:
+	csum_diff_test__destroy(skel);
+}
+
+void test_test_csum_diff(void)
+{
+	if (test__start_subtest("csum_diff_push"))
+		test_csum_diff(push_tests, NUM_PUSH_TESTS);
+	if (test__start_subtest("csum_diff_pull"))
+		test_csum_diff(pull_tests, NUM_PULL_TESTS);
+	if (test__start_subtest("csum_diff_diff"))
+		test_csum_diff(diff_tests, NUM_DIFF_TESTS);
+	if (test__start_subtest("csum_diff_edge"))
+		test_csum_diff(edge_tests, NUM_EDGE_TESTS);
+}
diff --git a/tools/testing/selftests/bpf/progs/csum_diff_test.c b/tools/testing/selftests/bpf/progs/csum_diff_test.c
new file mode 100644
index 0000000000000..9438f1773a589
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/csum_diff_test.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates */
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#define BUFF_SZ 512
+
+/* Will be updated by benchmark before program loading */
+char to_buff[BUFF_SZ];
+const volatile unsigned int to_buff_len = 0;
+char from_buff[BUFF_SZ];
+const volatile unsigned int from_buff_len = 0;
+unsigned short seed = 0;
+
+short result;
+
+char _license[] SEC("license") = "GPL";
+
+SEC("tc")
+int compute_checksum(void *ctx)
+{
+	int to_len_half = to_buff_len / 2;
+	int from_len_half = from_buff_len / 2;
+	short result2;
+
+	/* Calculate checksum in one go */
+	result2 = bpf_csum_diff((void *)from_buff, from_buff_len,
+				(void *)to_buff, to_buff_len, seed);
+
+	/* Calculate checksum by concatenating bpf_csum_diff()*/
+	result = bpf_csum_diff((void *)from_buff, from_buff_len - from_len_half,
+			       (void *)to_buff, to_buff_len - to_len_half, seed);
+
+	result = bpf_csum_diff((void *)from_buff + (from_buff_len - from_len_half), from_len_half,
+			       (void *)to_buff + (to_buff_len - to_len_half), to_len_half, result);
+
+	result = (result == result2) ? result : 0;
+
+	return 0;
+}
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier
  2024-10-21 12:21 ` [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier Puranjay Mohan
@ 2024-10-21 13:01   ` Helge Deller
  2024-10-21 13:14     ` Puranjay Mohan
  2024-10-21 13:42   ` Daniel Borkmann
  2024-10-22  9:55   ` Toke Høiland-Jørgensen
  2 siblings, 1 reply; 24+ messages in thread
From: Helge Deller @ 2024-10-21 13:01 UTC (permalink / raw)
  To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
	Andrii Nakryiko, bpf, Daniel Borkmann, David S. Miller,
	Eduard Zingerman, Eric Dumazet, Hao Luo, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
	Stanislav Fomichev, Yonghong Song

On 10/21/24 14:21, Puranjay Mohan wrote:
> The bpf_csum_diff() helper has been fixed to return a 16-bit value for
> all archs, so now we don't need to mask the result.
>
> ...
> --- a/tools/testing/selftests/bpf/progs/verifier_array_access.c
> +++ b/tools/testing/selftests/bpf/progs/verifier_array_access.c
> @@ -368,8 +368,7 @@ __naked void a_read_only_array_2_1(void)
>   	r4 = 0;						\
>   	r5 = 0;						\
>   	call %[bpf_csum_diff];				\
> -l0_%=:	r0 &= 0xffff;					\
> -	exit;						\
> +l0_%=:	exit;						\

Instead of dropping the masking, would it make sense to
check here if (r0 >> 16) == 0 ?

Helge

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier
  2024-10-21 13:01   ` Helge Deller
@ 2024-10-21 13:14     ` Puranjay Mohan
  2024-10-21 14:04       ` Helge Deller
  0 siblings, 1 reply; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-21 13:14 UTC (permalink / raw)
  To: Helge Deller, Albert Ou, Alexei Starovoitov, Andrew Morton,
	Andrii Nakryiko, bpf, Daniel Borkmann, David S. Miller,
	Eduard Zingerman, Eric Dumazet, Hao Luo, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Shuah Khan, Song Liu, Stanislav Fomichev,
	Yonghong Song

[-- Attachment #1: Type: text/plain, Size: 1212 bytes --]

Helge Deller <deller@gmx.de> writes:

> On 10/21/24 14:21, Puranjay Mohan wrote:
>> The bpf_csum_diff() helper has been fixed to return a 16-bit value for
>> all archs, so now we don't need to mask the result.
>>
>> ...
>> --- a/tools/testing/selftests/bpf/progs/verifier_array_access.c
>> +++ b/tools/testing/selftests/bpf/progs/verifier_array_access.c
>> @@ -368,8 +368,7 @@ __naked void a_read_only_array_2_1(void)
>>   	r4 = 0;						\
>>   	r5 = 0;						\
>>   	call %[bpf_csum_diff];				\
>> -l0_%=:	r0 &= 0xffff;					\
>> -	exit;						\
>> +l0_%=:	exit;						\
>
> Instead of dropping the masking, would it make sense to
> check here if (r0 >> 16) == 0 ?

We define the expected value in R0 to be 65507(0xffe3) in the line at the top:
__success __retval(65507)

So, we should just not do anything to R0 and it should contain this value
after returning from bpf_csum_diff()

This masking hack was added in:

6185266c5a853 ("selftests/bpf: Mask bpf_csum_diff() return value to 16 bits in test_verifier")

because without the fix in patch 2 bpf_csum_diff() would return the
following for this test:

x86                    :    -29 : 0xffffffe3
generic (arm64, riscv) :  65507 : 0x0000ffe3


Thanks,
Puranjay

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 255 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header
  2024-10-21 12:21 ` [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header Puranjay Mohan
@ 2024-10-21 13:41   ` Daniel Borkmann
  2024-10-22  9:49   ` Toke Høiland-Jørgensen
  2024-10-22 13:50   ` kernel test robot
  2 siblings, 0 replies; 24+ messages in thread
From: Daniel Borkmann @ 2024-10-21 13:41 UTC (permalink / raw)
  To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
	Andrii Nakryiko, bpf, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
	Stanislav Fomichev, Yonghong Song

On 10/21/24 2:21 PM, Puranjay Mohan wrote:
> from32to16() is used by lib/checksum.c and also by
> arch/parisc/lib/checksum.c. The next patch will use it in the
> bpf_csum_diff helper.
> 
> Move from32to16() to the include/net/checksum.h as csum_from32to16() and
> remove other implementations.
> 
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs
  2024-10-21 12:21 ` [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs Puranjay Mohan
@ 2024-10-21 13:42   ` Daniel Borkmann
  2024-10-22  9:54   ` Toke Høiland-Jørgensen
  2024-10-22 18:09   ` kernel test robot
  2 siblings, 0 replies; 24+ messages in thread
From: Daniel Borkmann @ 2024-10-21 13:42 UTC (permalink / raw)
  To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
	Andrii Nakryiko, bpf, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
	Stanislav Fomichev, Yonghong Song

On 10/21/24 2:21 PM, Puranjay Mohan wrote:
> 1. Optimization
>     ------------
> 
> The current implementation copies the 'from' and 'to' buffers to a
> scratchpad and it takes the bitwise NOT of 'from' buffer while copying.
> In the next step csum_partial() is called with this scratchpad.
> 
> so, mathematically, the current implementation is doing:
> 
> 	result = csum(to - from)
> 
> Here, 'to'  and '~ from' are copied in to the scratchpad buffer, we need
> it in the scratchpad buffer because csum_partial() takes a single
> contiguous buffer and not two disjoint buffers like 'to' and 'from'.
> 
> We can re write this equation to:
> 
> 	result = csum(to) - csum(from)
> 
> using the distributive property of csum().
> 
> this allows 'to' and 'from' to be at different locations and therefore
> this scratchpad and copying is not needed.
> 
> This in C code will look like:
> 
> result = csum_sub(csum_partial(to, to_size, seed),
>                    csum_partial(from, from_size, 0));
> 
> 2. Homogenization
>     --------------
> 
> The bpf_csum_diff() helper calls csum_partial() which is implemented by
> some architectures like arm and x86 but other architectures rely on the
> generic implementation in lib/checksum.c
> 
> The generic implementation in lib/checksum.c returns a 16 bit value but
> the arch specific implementations can return more than 16 bits, this
> works out in most places because before the result is used, it is passed
> through csum_fold() that turns it into a 16-bit value.
> 
> bpf_csum_diff() directly returns the value from csum_partial() and
> therefore the returned values could be different on different
> architectures. see discussion in [1]:
> 
> for the int value 28 the calculated checksums are:
> 
> x86                    :    -29 : 0xffffffe3
> generic (arm64, riscv) :  65507 : 0x0000ffe3
> arm                    : 131042 : 0x0001ffe2
> 
> Pass the result of bpf_csum_diff() through from32to16() before returning
> to homogenize this result for all architectures.
> 
> NOTE: from32to16() is used instead of csum_fold() because csum_fold()
> does from32to16() + bitwise NOT of the result, which is not what we want
> to do here.
> 
> [1] https://lore.kernel.org/bpf/CAJ+HfNiQbOcqCLxFUP2FMm5QrLXUUaj852Fxe3hn_2JNiucn6g@mail.gmail.com/
> 
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

Thanks for looking into this!

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier
  2024-10-21 12:21 ` [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier Puranjay Mohan
  2024-10-21 13:01   ` Helge Deller
@ 2024-10-21 13:42   ` Daniel Borkmann
  2024-10-22  9:55   ` Toke Høiland-Jørgensen
  2 siblings, 0 replies; 24+ messages in thread
From: Daniel Borkmann @ 2024-10-21 13:42 UTC (permalink / raw)
  To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
	Andrii Nakryiko, bpf, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
	Stanislav Fomichev, Yonghong Song

On 10/21/24 2:21 PM, Puranjay Mohan wrote:
> The bpf_csum_diff() helper has been fixed to return a 16-bit value for
> all archs, so now we don't need to mask the result.
> 
> This commit is basically reverting the below:
> 
> commit 6185266c5a85 ("selftests/bpf: Mask bpf_csum_diff() return value
> to 16 bits in test_verifier")
> 
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper
  2024-10-21 12:21 ` [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper Puranjay Mohan
@ 2024-10-21 13:43   ` Daniel Borkmann
  2024-10-21 23:28   ` Andrii Nakryiko
  1 sibling, 0 replies; 24+ messages in thread
From: Daniel Borkmann @ 2024-10-21 13:43 UTC (permalink / raw)
  To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
	Andrii Nakryiko, bpf, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
	Stanislav Fomichev, Yonghong Song

On 10/21/24 2:21 PM, Puranjay Mohan wrote:
> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by
> filling a 4KB buffer with random data and calculating the internet
> checksum on different parts of this buffer using bpf_csum_diff().
> 
> Example run using ./benchs/run_bench_csum_diff.sh on x86_64:
> 
> [bpf]$ ./benchs/run_bench_csum_diff.sh
> 4                    2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s)
> 8                    2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> 16                   2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 20                   2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 32                   2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> 40                   2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s)
> 64                   2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 128                  2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 256                  2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 512                  2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s)
> 
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 5/5] selftests/bpf: Add a selftest for bpf_csum_diff()
  2024-10-21 12:21 ` [PATCH bpf-next 5/5] selftests/bpf: Add a selftest for bpf_csum_diff() Puranjay Mohan
@ 2024-10-21 13:44   ` Daniel Borkmann
  0 siblings, 0 replies; 24+ messages in thread
From: Daniel Borkmann @ 2024-10-21 13:44 UTC (permalink / raw)
  To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
	Andrii Nakryiko, bpf, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
	Stanislav Fomichev, Yonghong Song

On 10/21/24 2:21 PM, Puranjay Mohan wrote:
> Add a selftest for the bpf_csum_diff() helper. This selftests runs the
> helper in all three configurations(push, pull, and diff) and verifies
> its output. The correct results have been computed by hand and by the
> helper's older implementation.
> 
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier
  2024-10-21 13:14     ` Puranjay Mohan
@ 2024-10-21 14:04       ` Helge Deller
  0 siblings, 0 replies; 24+ messages in thread
From: Helge Deller @ 2024-10-21 14:04 UTC (permalink / raw)
  To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
	Andrii Nakryiko, bpf, Daniel Borkmann, David S. Miller,
	Eduard Zingerman, Eric Dumazet, Hao Luo, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Shuah Khan, Song Liu, Stanislav Fomichev,
	Yonghong Song

On 10/21/24 15:14, Puranjay Mohan wrote:
> Helge Deller <deller@gmx.de> writes:
>
>> On 10/21/24 14:21, Puranjay Mohan wrote:
>>> The bpf_csum_diff() helper has been fixed to return a 16-bit value for
>>> all archs, so now we don't need to mask the result.
>>>
>>> ...
>>> --- a/tools/testing/selftests/bpf/progs/verifier_array_access.c
>>> +++ b/tools/testing/selftests/bpf/progs/verifier_array_access.c
>>> @@ -368,8 +368,7 @@ __naked void a_read_only_array_2_1(void)
>>>    	r4 = 0;						\
>>>    	r5 = 0;						\
>>>    	call %[bpf_csum_diff];				\
>>> -l0_%=:	r0 &= 0xffff;					\
>>> -	exit;						\
>>> +l0_%=:	exit;						\
>>
>> Instead of dropping the masking, would it make sense to
>> check here if (r0 >> 16) == 0 ?
>
> We define the expected value in R0 to be 65507(0xffe3) in the line at the top:
> __success __retval(65507)
>
> So, we should just not do anything to R0 and it should contain this value
> after returning from bpf_csum_diff()
>
> This masking hack was added in:
>
> 6185266c5a853 ("selftests/bpf: Mask bpf_csum_diff() return value to 16 bits in test_verifier")
>
> because without the fix in patch 2 bpf_csum_diff() would return the
> following for this test:
>
> x86                    :    -29 : 0xffffffe3
> generic (arm64, riscv) :  65507 : 0x0000ffe3

You're right.
Thanks for explaining.

Helge

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper
  2024-10-21 12:21 ` [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper Puranjay Mohan
  2024-10-21 13:43   ` Daniel Borkmann
@ 2024-10-21 23:28   ` Andrii Nakryiko
  2024-10-22 10:21     ` Puranjay Mohan
  1 sibling, 1 reply; 24+ messages in thread
From: Andrii Nakryiko @ 2024-10-21 23:28 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
	bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
	Stanislav Fomichev, Yonghong Song

On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by
> filling a 4KB buffer with random data and calculating the internet
> checksum on different parts of this buffer using bpf_csum_diff().
>
> Example run using ./benchs/run_bench_csum_diff.sh on x86_64:
>
> [bpf]$ ./benchs/run_bench_csum_diff.sh
> 4                    2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s)
> 8                    2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> 16                   2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 20                   2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 32                   2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> 40                   2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s)
> 64                   2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 128                  2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 256                  2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> 512                  2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s)

you are not benchmarking bpf_csum_diff(), you are benchmarking how
often you can call bpf_prog_test_run(). Add some batching on the BPF
side, these numbers tell you that there is no difference between
calculating checksum for 4 bytes and for 512, that didn't seem strange
to you?

pw-bot: cr

>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
>  tools/testing/selftests/bpf/Makefile          |   2 +
>  tools/testing/selftests/bpf/bench.c           |   4 +
>  .../selftests/bpf/benchs/bench_csum_diff.c    | 164 ++++++++++++++++++
>  .../bpf/benchs/run_bench_csum_diff.sh         |  10 ++
>  .../selftests/bpf/progs/csum_diff_bench.c     |  25 +++
>  5 files changed, 205 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_csum_diff.c
>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh
>  create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_bench.c
>

[...]

> +
> +static void csum_diff_setup(void)
> +{
> +       int err;
> +       char *buff;
> +       size_t i, sz;
> +
> +       sz = sizeof(ctx.skel->rodata->buff);
> +
> +       setup_libbpf();
> +
> +       ctx.skel = csum_diff_bench__open();
> +       if (!ctx.skel) {
> +               fprintf(stderr, "failed to open skeleton\n");
> +               exit(1);
> +       }
> +
> +       srandom(time(NULL));
> +       buff = ctx.skel->rodata->buff;
> +
> +       /*
> +        * Set first 8 bytes of buffer to 0xdeadbeefdeadbeef, this is later used to verify the
> +        * correctness of the helper by comparing the checksum result for 0xdeadbeefdeadbeef that
> +        * should be 0x3b3b
> +        */
> +
> +       *(u64 *)buff = 0xdeadbeefdeadbeef;
> +
> +       for (i = 8; i < sz; i++)
> +               buff[i] = '1' + random() % 9;

so, you only generate 9 different values for bytes, why? Why not full
byte range?

> +
> +       ctx.skel->rodata->buff_len = args.buff_len;
> +
> +       err = csum_diff_bench__load(ctx.skel);
> +       if (err) {
> +               fprintf(stderr, "failed to load skeleton\n");
> +               csum_diff_bench__destroy(ctx.skel);
> +               exit(1);
> +       }
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header
  2024-10-21 12:21 ` [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header Puranjay Mohan
  2024-10-21 13:41   ` Daniel Borkmann
@ 2024-10-22  9:49   ` Toke Høiland-Jørgensen
  2024-10-22 13:50   ` kernel test robot
  2 siblings, 0 replies; 24+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-10-22  9:49 UTC (permalink / raw)
  To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
	Andrii Nakryiko, bpf, Daniel Borkmann, David S. Miller,
	Eduard Zingerman, Eric Dumazet, Hao Luo, Helge Deller,
	Jakub Kicinski, James E.J. Bottomley, Jiri Olsa, John Fastabend,
	KP Singh, linux-kernel, linux-parisc, linux-riscv,
	Martin KaFai Lau, Mykola Lysenko, netdev, Palmer Dabbelt,
	Paolo Abeni, Paul Walmsley, Puranjay Mohan, Puranjay Mohan,
	Shuah Khan, Song Liu, Stanislav Fomichev, Yonghong Song

Puranjay Mohan <puranjay@kernel.org> writes:

> from32to16() is used by lib/checksum.c and also by
> arch/parisc/lib/checksum.c. The next patch will use it in the
> bpf_csum_diff helper.
>
> Move from32to16() to the include/net/checksum.h as csum_from32to16() and
> remove other implementations.
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs
  2024-10-21 12:21 ` [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs Puranjay Mohan
  2024-10-21 13:42   ` Daniel Borkmann
@ 2024-10-22  9:54   ` Toke Høiland-Jørgensen
  2024-10-22 18:09   ` kernel test robot
  2 siblings, 0 replies; 24+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-10-22  9:54 UTC (permalink / raw)
  To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
	Andrii Nakryiko, bpf, Daniel Borkmann, David S. Miller,
	Eduard Zingerman, Eric Dumazet, Hao Luo, Helge Deller,
	Jakub Kicinski, James E.J. Bottomley, Jiri Olsa, John Fastabend,
	KP Singh, linux-kernel, linux-parisc, linux-riscv,
	Martin KaFai Lau, Mykola Lysenko, netdev, Palmer Dabbelt,
	Paolo Abeni, Paul Walmsley, Puranjay Mohan, Puranjay Mohan,
	Shuah Khan, Song Liu, Stanislav Fomichev, Yonghong Song

Puranjay Mohan <puranjay@kernel.org> writes:

> 1. Optimization
>    ------------
>
> The current implementation copies the 'from' and 'to' buffers to a
> scratchpad and it takes the bitwise NOT of 'from' buffer while copying.
> In the next step csum_partial() is called with this scratchpad.
>
> so, mathematically, the current implementation is doing:
>
> 	result = csum(to - from)
>
> Here, 'to'  and '~ from' are copied in to the scratchpad buffer, we need
> it in the scratchpad buffer because csum_partial() takes a single
> contiguous buffer and not two disjoint buffers like 'to' and 'from'.
>
> We can re write this equation to:
>
> 	result = csum(to) - csum(from)
>
> using the distributive property of csum().
>
> this allows 'to' and 'from' to be at different locations and therefore
> this scratchpad and copying is not needed.
>
> This in C code will look like:
>
> result = csum_sub(csum_partial(to, to_size, seed),
>                   csum_partial(from, from_size, 0));
>
> 2. Homogenization
>    --------------
>
> The bpf_csum_diff() helper calls csum_partial() which is implemented by
> some architectures like arm and x86 but other architectures rely on the
> generic implementation in lib/checksum.c
>
> The generic implementation in lib/checksum.c returns a 16 bit value but
> the arch specific implementations can return more than 16 bits, this
> works out in most places because before the result is used, it is passed
> through csum_fold() that turns it into a 16-bit value.
>
> bpf_csum_diff() directly returns the value from csum_partial() and
> therefore the returned values could be different on different
> architectures. see discussion in [1]:
>
> for the int value 28 the calculated checksums are:
>
> x86                    :    -29 : 0xffffffe3
> generic (arm64, riscv) :  65507 : 0x0000ffe3
> arm                    : 131042 : 0x0001ffe2
>
> Pass the result of bpf_csum_diff() through from32to16() before returning
> to homogenize this result for all architectures.
>
> NOTE: from32to16() is used instead of csum_fold() because csum_fold()
> does from32to16() + bitwise NOT of the result, which is not what we want
> to do here.
>
> [1] https://lore.kernel.org/bpf/CAJ+HfNiQbOcqCLxFUP2FMm5QrLXUUaj852Fxe3hn_2JNiucn6g@mail.gmail.com/
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

Pretty neat simplification :)

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier
  2024-10-21 12:21 ` [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier Puranjay Mohan
  2024-10-21 13:01   ` Helge Deller
  2024-10-21 13:42   ` Daniel Borkmann
@ 2024-10-22  9:55   ` Toke Høiland-Jørgensen
  2 siblings, 0 replies; 24+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-10-22  9:55 UTC (permalink / raw)
  To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
	Andrii Nakryiko, bpf, Daniel Borkmann, David S. Miller,
	Eduard Zingerman, Eric Dumazet, Hao Luo, Helge Deller,
	Jakub Kicinski, James E.J. Bottomley, Jiri Olsa, John Fastabend,
	KP Singh, linux-kernel, linux-parisc, linux-riscv,
	Martin KaFai Lau, Mykola Lysenko, netdev, Palmer Dabbelt,
	Paolo Abeni, Paul Walmsley, Puranjay Mohan, Puranjay Mohan,
	Shuah Khan, Song Liu, Stanislav Fomichev, Yonghong Song

Puranjay Mohan <puranjay@kernel.org> writes:

> The bpf_csum_diff() helper has been fixed to return a 16-bit value for
> all archs, so now we don't need to mask the result.
>
> This commit is basically reverting the below:
>
> commit 6185266c5a85 ("selftests/bpf: Mask bpf_csum_diff() return value
> to 16 bits in test_verifier")
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper
  2024-10-21 23:28   ` Andrii Nakryiko
@ 2024-10-22 10:21     ` Puranjay Mohan
  2024-10-22 17:47       ` Andrii Nakryiko
  0 siblings, 1 reply; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-22 10:21 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
	bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Shuah Khan, Song Liu, Stanislav Fomichev,
	Yonghong Song

[-- Attachment #1: Type: text/plain, Size: 4101 bytes --]

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by
>> filling a 4KB buffer with random data and calculating the internet
>> checksum on different parts of this buffer using bpf_csum_diff().
>>
>> Example run using ./benchs/run_bench_csum_diff.sh on x86_64:
>>
>> [bpf]$ ./benchs/run_bench_csum_diff.sh
>> 4                    2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s)
>> 8                    2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> 16                   2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> 20                   2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> 32                   2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> 40                   2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s)
>> 64                   2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> 128                  2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> 256                  2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> 512                  2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s)
>
> you are not benchmarking bpf_csum_diff(), you are benchmarking how
> often you can call bpf_prog_test_run(). Add some batching on the BPF
> side, these numbers tell you that there is no difference between
> calculating checksum for 4 bytes and for 512, that didn't seem strange
> to you?

This didn't seem strange to me because if you see the tables I added to
the cover letter, there is a clear improvement after optimizing the
helper and arm64 even shows a linear drop going from 4 bytes to 512
bytes, even after the optimization.

On x86 after the improvement, 4 bytes and 512 bytes show similar numbers
but there is still a small drop that can be seen going from 4 to 512
bytes.

My thought was that because the bpf_csum_diff() calls csum_partial() on
x86 which is already optimised, most of the overhead was due to copying
the buffer which is now removed.

I guess I can amplify the difference between 4B and 512B by calling
bpf_csum_diff() multiple times in a loop, or by calculating the csum by
dividing the buffer into more parts (currently the BPF code divides it
into 2 parts only).

>>
>> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
>> ---
>>  tools/testing/selftests/bpf/Makefile          |   2 +
>>  tools/testing/selftests/bpf/bench.c           |   4 +
>>  .../selftests/bpf/benchs/bench_csum_diff.c    | 164 ++++++++++++++++++
>>  .../bpf/benchs/run_bench_csum_diff.sh         |  10 ++
>>  .../selftests/bpf/progs/csum_diff_bench.c     |  25 +++
>>  5 files changed, 205 insertions(+)
>>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_csum_diff.c
>>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh
>>  create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_bench.c
>>
>
> [...]
>
>> +
>> +static void csum_diff_setup(void)
>> +{
>> +       int err;
>> +       char *buff;
>> +       size_t i, sz;
>> +
>> +       sz = sizeof(ctx.skel->rodata->buff);
>> +
>> +       setup_libbpf();
>> +
>> +       ctx.skel = csum_diff_bench__open();
>> +       if (!ctx.skel) {
>> +               fprintf(stderr, "failed to open skeleton\n");
>> +               exit(1);
>> +       }
>> +
>> +       srandom(time(NULL));
>> +       buff = ctx.skel->rodata->buff;
>> +
>> +       /*
>> +        * Set first 8 bytes of buffer to 0xdeadbeefdeadbeef, this is later used to verify the
>> +        * correctness of the helper by comparing the checksum result for 0xdeadbeefdeadbeef that
>> +        * should be 0x3b3b
>> +        */
>> +
>> +       *(u64 *)buff = 0xdeadbeefdeadbeef;
>> +
>> +       for (i = 8; i < sz; i++)
>> +               buff[i] = '1' + random() % 9;
>
> so, you only generate 9 different values for bytes, why? Why not full
> byte range?

Thanks for catching this, there is no reason for this to be [1,10] I
will use the full byte range in the next version.

Thanks,
Puranjay

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 255 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header
  2024-10-21 12:21 ` [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header Puranjay Mohan
  2024-10-21 13:41   ` Daniel Borkmann
  2024-10-22  9:49   ` Toke Høiland-Jørgensen
@ 2024-10-22 13:50   ` kernel test robot
  2 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2024-10-22 13:50 UTC (permalink / raw)
  To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
	Andrii Nakryiko, bpf, Daniel Borkmann, David S. Miller,
	Eduard Zingerman, Eric Dumazet, Hao Luo, Helge Deller,
	Jakub Kicinski, James E.J. Bottomley, Jiri Olsa, John Fastabend,
	KP Singh, linux-kernel, linux-parisc, linux-riscv,
	Martin KaFai Lau, Mykola Lysenko, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
	Stanislav Fomichev
  Cc: oe-kbuild-all, Linux Memory Management List, netdev

Hi Puranjay,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Puranjay-Mohan/net-checksum-move-from32to16-to-generic-header/20241021-202707
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20241021122112.101513-2-puranjay%40kernel.org
patch subject: [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header
config: x86_64-randconfig-122-20241022 (https://download.01.org/0day-ci/archive/20241022/202410222149.3FVJFYYy-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241022/202410222149.3FVJFYYy-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410222149.3FVJFYYy-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> lib/checksum.c:84:34: sparse: sparse: incorrect type in argument 1 (different base types) @@     expected restricted __wsum [usertype] sum @@     got unsigned int [assigned] result @@
   lib/checksum.c:84:34: sparse:     expected restricted __wsum [usertype] sum
   lib/checksum.c:84:34: sparse:     got unsigned int [assigned] result
>> lib/checksum.c:84:16: sparse: sparse: incorrect type in assignment (different base types) @@     expected unsigned int [assigned] result @@     got restricted __sum16 @@
   lib/checksum.c:84:16: sparse:     expected unsigned int [assigned] result
   lib/checksum.c:84:16: sparse:     got restricted __sum16

vim +84 lib/checksum.c

    35	
    36	#ifndef do_csum
    37	static unsigned int do_csum(const unsigned char *buff, int len)
    38	{
    39		int odd;
    40		unsigned int result = 0;
    41	
    42		if (len <= 0)
    43			goto out;
    44		odd = 1 & (unsigned long) buff;
    45		if (odd) {
    46	#ifdef __LITTLE_ENDIAN
    47			result += (*buff << 8);
    48	#else
    49			result = *buff;
    50	#endif
    51			len--;
    52			buff++;
    53		}
    54		if (len >= 2) {
    55			if (2 & (unsigned long) buff) {
    56				result += *(unsigned short *) buff;
    57				len -= 2;
    58				buff += 2;
    59			}
    60			if (len >= 4) {
    61				const unsigned char *end = buff + ((unsigned)len & ~3);
    62				unsigned int carry = 0;
    63				do {
    64					unsigned int w = *(unsigned int *) buff;
    65					buff += 4;
    66					result += carry;
    67					result += w;
    68					carry = (w > result);
    69				} while (buff < end);
    70				result += carry;
    71				result = (result & 0xffff) + (result >> 16);
    72			}
    73			if (len & 2) {
    74				result += *(unsigned short *) buff;
    75				buff += 2;
    76			}
    77		}
    78		if (len & 1)
    79	#ifdef __LITTLE_ENDIAN
    80			result += *buff;
    81	#else
    82			result += (*buff << 8);
    83	#endif
  > 84		result = csum_from32to16(result);
    85		if (odd)
    86			result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
    87	out:
    88		return result;
    89	}
    90	#endif
    91	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper
  2024-10-22 10:21     ` Puranjay Mohan
@ 2024-10-22 17:47       ` Andrii Nakryiko
  2024-10-22 17:58         ` Puranjay Mohan
  0 siblings, 1 reply; 24+ messages in thread
From: Andrii Nakryiko @ 2024-10-22 17:47 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
	bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Shuah Khan, Song Liu, Stanislav Fomichev,
	Yonghong Song

On Tue, Oct 22, 2024 at 3:21 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >>
> >> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by
> >> filling a 4KB buffer with random data and calculating the internet
> >> checksum on different parts of this buffer using bpf_csum_diff().
> >>
> >> Example run using ./benchs/run_bench_csum_diff.sh on x86_64:
> >>
> >> [bpf]$ ./benchs/run_bench_csum_diff.sh
> >> 4                    2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s)
> >> 8                    2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> >> 16                   2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> >> 20                   2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> >> 32                   2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> >> 40                   2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s)
> >> 64                   2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> >> 128                  2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> >> 256                  2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s)
> >> 512                  2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s)
> >
> > you are not benchmarking bpf_csum_diff(), you are benchmarking how
> > often you can call bpf_prog_test_run(). Add some batching on the BPF
> > side, these numbers tell you that there is no difference between
> > calculating checksum for 4 bytes and for 512, that didn't seem strange
> > to you?
>
> This didn't seem strange to me because if you see the tables I added to
> the cover letter, there is a clear improvement after optimizing the
> helper and arm64 even shows a linear drop going from 4 bytes to 512
> bytes, even after the optimization.
>

Regardless of optimization, it's strange that throughput barely
differs when you vary the amount of work by more than 100x. This
wouldn't be strange if this checksum calculation was some sort of
cryptographic hash, where it's intentional to have the same timing
regardless of amount of work, or something along those lines. But I
don't think that's the case here.

But as it is right now, this benchmark is benchmarking
bpf_prog_test_run(), as I mentioned, which seems to be bottlenecking
at about 2mln/s throughput for your machine. bpf_csum_diff()'s
overhead is trivial compared to bpf_prog_test_run() overhead and
syscall/context switch overhead.

We shouldn't add the benchmark that doesn't benchmark the right thing.
So just add a bpf_for(i, 0, 100) loop doing bpf_csum_diff(), and then
do atomic increment *after* the loop (to minimize atomics overhead).

> On x86 after the improvement, 4 bytes and 512 bytes show similar numbers
> but there is still a small drop that can be seen going from 4 to 512
> bytes.
>
> My thought was that because the bpf_csum_diff() calls csum_partial() on
> x86 which is already optimised, most of the overhead was due to copying
> the buffer which is now removed.
>
> I guess I can amplify the difference between 4B and 512B by calling
> bpf_csum_diff() multiple times in a loop, or by calculating the csum by
> dividing the buffer into more parts (currently the BPF code divides it
> into 2 parts only).
>
> >>
> >> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> >> ---
> >>  tools/testing/selftests/bpf/Makefile          |   2 +
> >>  tools/testing/selftests/bpf/bench.c           |   4 +
> >>  .../selftests/bpf/benchs/bench_csum_diff.c    | 164 ++++++++++++++++++
> >>  .../bpf/benchs/run_bench_csum_diff.sh         |  10 ++
> >>  .../selftests/bpf/progs/csum_diff_bench.c     |  25 +++
> >>  5 files changed, 205 insertions(+)
> >>  create mode 100644 tools/testing/selftests/bpf/benchs/bench_csum_diff.c
> >>  create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh
> >>  create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_bench.c
> >>
> >

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper
  2024-10-22 17:47       ` Andrii Nakryiko
@ 2024-10-22 17:58         ` Puranjay Mohan
  2024-10-23 15:37           ` Puranjay Mohan
  0 siblings, 1 reply; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-22 17:58 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
	bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Shuah Khan, Song Liu, Stanislav Fomichev,
	Yonghong Song

[-- Attachment #1: Type: text/plain, Size: 2969 bytes --]

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Tue, Oct 22, 2024 at 3:21 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>> >>
>> >> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by
>> >> filling a 4KB buffer with random data and calculating the internet
>> >> checksum on different parts of this buffer using bpf_csum_diff().
>> >>
>> >> Example run using ./benchs/run_bench_csum_diff.sh on x86_64:
>> >>
>> >> [bpf]$ ./benchs/run_bench_csum_diff.sh
>> >> 4                    2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s)
>> >> 8                    2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> >> 16                   2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> >> 20                   2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> >> 32                   2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> >> 40                   2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s)
>> >> 64                   2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> >> 128                  2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> >> 256                  2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> >> 512                  2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s)
>> >
>> > you are not benchmarking bpf_csum_diff(), you are benchmarking how
>> > often you can call bpf_prog_test_run(). Add some batching on the BPF
>> > side, these numbers tell you that there is no difference between
>> > calculating checksum for 4 bytes and for 512, that didn't seem strange
>> > to you?
>>
>> This didn't seem strange to me because if you see the tables I added to
>> the cover letter, there is a clear improvement after optimizing the
>> helper and arm64 even shows a linear drop going from 4 bytes to 512
>> bytes, even after the optimization.
>>
>
> Regardless of optimization, it's strange that throughput barely
> differs when you vary the amount of work by more than 100x. This
> wouldn't be strange if this checksum calculation was some sort of
> cryptographic hash, where it's intentional to have the same timing
> regardless of amount of work, or something along those lines. But I
> don't think that's the case here.
>
> But as it is right now, this benchmark is benchmarking
> bpf_prog_test_run(), as I mentioned, which seems to be bottlenecking
> at about 2mln/s throughput for your machine. bpf_csum_diff()'s
> overhead is trivial compared to bpf_prog_test_run() overhead and
> syscall/context switch overhead.
>
> We shouldn't add the benchmark that doesn't benchmark the right thing.
> So just add a bpf_for(i, 0, 100) loop doing bpf_csum_diff(), and then
> do atomic increment *after* the loop (to minimize atomics overhead).

Thanks, now I undestand what you meant. Will add the bpf_for() in the
next version.

Thanks,
Puranjay

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 255 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs
  2024-10-21 12:21 ` [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs Puranjay Mohan
  2024-10-21 13:42   ` Daniel Borkmann
  2024-10-22  9:54   ` Toke Høiland-Jørgensen
@ 2024-10-22 18:09   ` kernel test robot
  2 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2024-10-22 18:09 UTC (permalink / raw)
  To: Puranjay Mohan, Albert Ou, Alexei Starovoitov, Andrew Morton,
	Andrii Nakryiko, bpf, Daniel Borkmann, David S. Miller,
	Eduard Zingerman, Eric Dumazet, Hao Luo, Helge Deller,
	Jakub Kicinski, James E.J. Bottomley, Jiri Olsa, John Fastabend,
	KP Singh, linux-kernel, linux-parisc, linux-riscv,
	Martin KaFai Lau, Mykola Lysenko, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Puranjay Mohan, Shuah Khan, Song Liu,
	Stanislav Fomichev
  Cc: oe-kbuild-all, Linux Memory Management List, netdev

Hi Puranjay,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Puranjay-Mohan/net-checksum-move-from32to16-to-generic-header/20241021-202707
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20241021122112.101513-3-puranjay%40kernel.org
patch subject: [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs
config: x86_64-randconfig-122-20241022 (https://download.01.org/0day-ci/archive/20241023/202410230122.BYZLEUHz-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241023/202410230122.BYZLEUHz-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410230122.BYZLEUHz-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   net/core/filter.c:1423:39: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sock_filter const *filter @@     got struct sock_filter [noderef] __user *filter @@
   net/core/filter.c:1423:39: sparse:     expected struct sock_filter const *filter
   net/core/filter.c:1423:39: sparse:     got struct sock_filter [noderef] __user *filter
   net/core/filter.c:1501:39: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sock_filter const *filter @@     got struct sock_filter [noderef] __user *filter @@
   net/core/filter.c:1501:39: sparse:     expected struct sock_filter const *filter
   net/core/filter.c:1501:39: sparse:     got struct sock_filter [noderef] __user *filter
   net/core/filter.c:2321:45: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted __be32 [usertype] daddr @@     got unsigned int [usertype] ipv4_nh @@
   net/core/filter.c:2321:45: sparse:     expected restricted __be32 [usertype] daddr
   net/core/filter.c:2321:45: sparse:     got unsigned int [usertype] ipv4_nh
   net/core/filter.c:10993:31: sparse: sparse: symbol 'sk_filter_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11000:27: sparse: sparse: symbol 'sk_filter_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11004:31: sparse: sparse: symbol 'tc_cls_act_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11013:27: sparse: sparse: symbol 'tc_cls_act_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11017:31: sparse: sparse: symbol 'xdp_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11029:31: sparse: sparse: symbol 'cg_skb_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11035:27: sparse: sparse: symbol 'cg_skb_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11039:31: sparse: sparse: symbol 'lwt_in_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11045:27: sparse: sparse: symbol 'lwt_in_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11049:31: sparse: sparse: symbol 'lwt_out_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11055:27: sparse: sparse: symbol 'lwt_out_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11059:31: sparse: sparse: symbol 'lwt_xmit_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11066:27: sparse: sparse: symbol 'lwt_xmit_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11070:31: sparse: sparse: symbol 'lwt_seg6local_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11076:27: sparse: sparse: symbol 'lwt_seg6local_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11079:31: sparse: sparse: symbol 'cg_sock_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11085:27: sparse: sparse: symbol 'cg_sock_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11088:31: sparse: sparse: symbol 'cg_sock_addr_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11094:27: sparse: sparse: symbol 'cg_sock_addr_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11097:31: sparse: sparse: symbol 'sock_ops_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11103:27: sparse: sparse: symbol 'sock_ops_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11106:31: sparse: sparse: symbol 'sk_skb_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11113:27: sparse: sparse: symbol 'sk_skb_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11116:31: sparse: sparse: symbol 'sk_msg_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11123:27: sparse: sparse: symbol 'sk_msg_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11126:31: sparse: sparse: symbol 'flow_dissector_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11132:27: sparse: sparse: symbol 'flow_dissector_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11460:31: sparse: sparse: symbol 'sk_reuseport_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:11466:27: sparse: sparse: symbol 'sk_reuseport_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11668:27: sparse: sparse: symbol 'sk_lookup_prog_ops' was not declared. Should it be static?
   net/core/filter.c:11672:31: sparse: sparse: symbol 'sk_lookup_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:1931:43: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted __wsum [usertype] diff @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1931:43: sparse:     expected restricted __wsum [usertype] diff
   net/core/filter.c:1931:43: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1934:36: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted __be16 [usertype] old @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1934:36: sparse:     expected restricted __be16 [usertype] old
   net/core/filter.c:1934:36: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1934:42: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be16 [usertype] new @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1934:42: sparse:     expected restricted __be16 [usertype] new
   net/core/filter.c:1934:42: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1937:36: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted __be32 [usertype] from @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1937:36: sparse:     expected restricted __be32 [usertype] from
   net/core/filter.c:1937:36: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1937:42: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be32 [usertype] to @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1937:42: sparse:     expected restricted __be32 [usertype] to
   net/core/filter.c:1937:42: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1982:59: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __wsum [usertype] diff @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1982:59: sparse:     expected restricted __wsum [usertype] diff
   net/core/filter.c:1982:59: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1985:52: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be16 [usertype] from @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1985:52: sparse:     expected restricted __be16 [usertype] from
   net/core/filter.c:1985:52: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1985:58: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted __be16 [usertype] to @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1985:58: sparse:     expected restricted __be16 [usertype] to
   net/core/filter.c:1985:58: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1988:52: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be32 [usertype] from @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1988:52: sparse:     expected restricted __be32 [usertype] from
   net/core/filter.c:1988:52: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1988:58: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted __be32 [usertype] to @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1988:58: sparse:     expected restricted __be32 [usertype] to
   net/core/filter.c:1988:58: sparse:     got unsigned long long [usertype] to
>> net/core/filter.c:2023:39: sparse: sparse: incorrect type in return expression (different base types) @@     expected unsigned long long @@     got restricted __sum16 @@
   net/core/filter.c:2023:39: sparse:     expected unsigned long long
   net/core/filter.c:2023:39: sparse:     got restricted __sum16
   net/core/filter.c:2026:39: sparse: sparse: incorrect type in return expression (different base types) @@     expected unsigned long long @@     got restricted __sum16 @@
   net/core/filter.c:2026:39: sparse:     expected unsigned long long
   net/core/filter.c:2026:39: sparse:     got restricted __sum16
   net/core/filter.c:2029:39: sparse: sparse: incorrect type in return expression (different base types) @@     expected unsigned long long @@     got restricted __sum16 @@
   net/core/filter.c:2029:39: sparse:     expected unsigned long long
   net/core/filter.c:2029:39: sparse:     got restricted __sum16
>> net/core/filter.c:2031:16: sparse: sparse: incorrect type in return expression (different base types) @@     expected unsigned long long @@     got restricted __wsum [usertype] seed @@
   net/core/filter.c:2031:16: sparse:     expected unsigned long long
   net/core/filter.c:2031:16: sparse:     got restricted __wsum [usertype] seed
   net/core/filter.c:2053:35: sparse: sparse: incorrect type in return expression (different base types) @@     expected unsigned long long @@     got restricted __wsum [usertype] csum @@
   net/core/filter.c:2053:35: sparse:     expected unsigned long long
   net/core/filter.c:2053:35: sparse:     got restricted __wsum [usertype] csum

vim +2023 net/core/filter.c

  1956	
  1957	BPF_CALL_5(bpf_l4_csum_replace, struct sk_buff *, skb, u32, offset,
  1958		   u64, from, u64, to, u64, flags)
  1959	{
  1960		bool is_pseudo = flags & BPF_F_PSEUDO_HDR;
  1961		bool is_mmzero = flags & BPF_F_MARK_MANGLED_0;
  1962		bool do_mforce = flags & BPF_F_MARK_ENFORCE;
  1963		__sum16 *ptr;
  1964	
  1965		if (unlikely(flags & ~(BPF_F_MARK_MANGLED_0 | BPF_F_MARK_ENFORCE |
  1966				       BPF_F_PSEUDO_HDR | BPF_F_HDR_FIELD_MASK)))
  1967			return -EINVAL;
  1968		if (unlikely(offset > 0xffff || offset & 1))
  1969			return -EFAULT;
  1970		if (unlikely(bpf_try_make_writable(skb, offset + sizeof(*ptr))))
  1971			return -EFAULT;
  1972	
  1973		ptr = (__sum16 *)(skb->data + offset);
  1974		if (is_mmzero && !do_mforce && !*ptr)
  1975			return 0;
  1976	
  1977		switch (flags & BPF_F_HDR_FIELD_MASK) {
  1978		case 0:
  1979			if (unlikely(from != 0))
  1980				return -EINVAL;
  1981	
  1982			inet_proto_csum_replace_by_diff(ptr, skb, to, is_pseudo);
  1983			break;
  1984		case 2:
> 1985			inet_proto_csum_replace2(ptr, skb, from, to, is_pseudo);
  1986			break;
  1987		case 4:
  1988			inet_proto_csum_replace4(ptr, skb, from, to, is_pseudo);
  1989			break;
  1990		default:
  1991			return -EINVAL;
  1992		}
  1993	
  1994		if (is_mmzero && !*ptr)
  1995			*ptr = CSUM_MANGLED_0;
  1996		return 0;
  1997	}
  1998	
  1999	static const struct bpf_func_proto bpf_l4_csum_replace_proto = {
  2000		.func		= bpf_l4_csum_replace,
  2001		.gpl_only	= false,
  2002		.ret_type	= RET_INTEGER,
  2003		.arg1_type	= ARG_PTR_TO_CTX,
  2004		.arg2_type	= ARG_ANYTHING,
  2005		.arg3_type	= ARG_ANYTHING,
  2006		.arg4_type	= ARG_ANYTHING,
  2007		.arg5_type	= ARG_ANYTHING,
  2008	};
  2009	
  2010	BPF_CALL_5(bpf_csum_diff, __be32 *, from, u32, from_size,
  2011		   __be32 *, to, u32, to_size, __wsum, seed)
  2012	{
  2013		/* This is quite flexible, some examples:
  2014		 *
  2015		 * from_size == 0, to_size > 0,  seed := csum --> pushing data
  2016		 * from_size > 0,  to_size == 0, seed := csum --> pulling data
  2017		 * from_size > 0,  to_size > 0,  seed := 0    --> diffing data
  2018		 *
  2019		 * Even for diffing, from_size and to_size don't need to be equal.
  2020		 */
  2021	
  2022		if (from_size && to_size)
> 2023			return csum_from32to16(csum_sub(csum_partial(to, to_size, seed),
  2024							csum_partial(from, from_size, 0)));
  2025		if (to_size)
  2026			return csum_from32to16(csum_partial(to, to_size, seed));
  2027	
  2028		if (from_size)
  2029			return csum_from32to16(~csum_partial(from, from_size, ~seed));
  2030	
> 2031		return seed;
  2032	}
  2033	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper
  2024-10-22 17:58         ` Puranjay Mohan
@ 2024-10-23 15:37           ` Puranjay Mohan
  0 siblings, 0 replies; 24+ messages in thread
From: Puranjay Mohan @ 2024-10-23 15:37 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
	bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Shuah Khan, Song Liu, Stanislav Fomichev,
	Yonghong Song

[-- Attachment #1: Type: text/plain, Size: 3424 bytes --]

Puranjay Mohan <puranjay@kernel.org> writes:

> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
>> On Tue, Oct 22, 2024 at 3:21 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>>
>>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>>
>>> > On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>> >>
>>> >> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by
>>> >> filling a 4KB buffer with random data and calculating the internet
>>> >> checksum on different parts of this buffer using bpf_csum_diff().
>>> >>
>>> >> Example run using ./benchs/run_bench_csum_diff.sh on x86_64:
>>> >>
>>> >> [bpf]$ ./benchs/run_bench_csum_diff.sh
>>> >> 4                    2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s)
>>> >> 8                    2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>>> >> 16                   2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>>> >> 20                   2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>>> >> 32                   2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>>> >> 40                   2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s)
>>> >> 64                   2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>>> >> 128                  2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>>> >> 256                  2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>>> >> 512                  2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s)
>>> >
>>> > you are not benchmarking bpf_csum_diff(), you are benchmarking how
>>> > often you can call bpf_prog_test_run(). Add some batching on the BPF
>>> > side, these numbers tell you that there is no difference between
>>> > calculating checksum for 4 bytes and for 512, that didn't seem strange
>>> > to you?
>>>
>>> This didn't seem strange to me because if you see the tables I added to
>>> the cover letter, there is a clear improvement after optimizing the
>>> helper and arm64 even shows a linear drop going from 4 bytes to 512
>>> bytes, even after the optimization.
>>>
>>
>> Regardless of optimization, it's strange that throughput barely
>> differs when you vary the amount of work by more than 100x. This
>> wouldn't be strange if this checksum calculation was some sort of
>> cryptographic hash, where it's intentional to have the same timing
>> regardless of amount of work, or something along those lines. But I
>> don't think that's the case here.
>>
>> But as it is right now, this benchmark is benchmarking
>> bpf_prog_test_run(), as I mentioned, which seems to be bottlenecking
>> at about 2mln/s throughput for your machine. bpf_csum_diff()'s
>> overhead is trivial compared to bpf_prog_test_run() overhead and
>> syscall/context switch overhead.
>>
>> We shouldn't add the benchmark that doesn't benchmark the right thing.
>> So just add a bpf_for(i, 0, 100) loop doing bpf_csum_diff(), and then
>> do atomic increment *after* the loop (to minimize atomics overhead).
>
> Thanks, now I undestand what you meant. Will add the bpf_for() in the
> next version.

I have decided to drop this patch as even after adding bpf_for() the
difference between 4B and 512B is not that much. So, benchmarking
bpf_csum_diff() using this triggering based framework is not useful.

So, v2 will not have this patch but the cover letter will still have the
tables to show the difference before/after the optimization.

Thanks,
Puranjay

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 255 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-10-23 15:37 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-21 12:21 [PATCH bpf-next 0/5] Optimize bpf_csum_diff() and homogenize for all archs Puranjay Mohan
2024-10-21 12:21 ` [PATCH bpf-next 1/5] net: checksum: move from32to16() to generic header Puranjay Mohan
2024-10-21 13:41   ` Daniel Borkmann
2024-10-22  9:49   ` Toke Høiland-Jørgensen
2024-10-22 13:50   ` kernel test robot
2024-10-21 12:21 ` [PATCH bpf-next 2/5] bpf: bpf_csum_diff: optimize and homogenize for all archs Puranjay Mohan
2024-10-21 13:42   ` Daniel Borkmann
2024-10-22  9:54   ` Toke Høiland-Jørgensen
2024-10-22 18:09   ` kernel test robot
2024-10-21 12:21 ` [PATCH bpf-next 3/5] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier Puranjay Mohan
2024-10-21 13:01   ` Helge Deller
2024-10-21 13:14     ` Puranjay Mohan
2024-10-21 14:04       ` Helge Deller
2024-10-21 13:42   ` Daniel Borkmann
2024-10-22  9:55   ` Toke Høiland-Jørgensen
2024-10-21 12:21 ` [PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper Puranjay Mohan
2024-10-21 13:43   ` Daniel Borkmann
2024-10-21 23:28   ` Andrii Nakryiko
2024-10-22 10:21     ` Puranjay Mohan
2024-10-22 17:47       ` Andrii Nakryiko
2024-10-22 17:58         ` Puranjay Mohan
2024-10-23 15:37           ` Puranjay Mohan
2024-10-21 12:21 ` [PATCH bpf-next 5/5] selftests/bpf: Add a selftest for bpf_csum_diff() Puranjay Mohan
2024-10-21 13:44   ` Daniel Borkmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).