netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 0/4] Optimize bpf_csum_diff() and homogenize for all archs
@ 2024-10-23 15:39 Puranjay Mohan
  2024-10-23 15:39 ` [PATCH bpf-next v2 1/4] net: checksum: move from32to16() to generic header Puranjay Mohan
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Puranjay Mohan @ 2024-10-23 15:39 UTC (permalink / raw)
  To: Albert Ou, Alexei Starovoitov, Andrew Morton, Andrii Nakryiko,
	bpf, Daniel Borkmann, David S. Miller, Eduard Zingerman,
	Eric Dumazet, Hao Luo, Helge Deller, Jakub Kicinski,
	James E.J. Bottomley, Jiri Olsa, John Fastabend, KP Singh,
	linux-kernel, linux-parisc, linux-riscv, Martin KaFai Lau,
	Mykola Lysenko, netdev, Palmer Dabbelt, Paolo Abeni,
	Paul Walmsley, Puranjay Mohan, Puranjay Mohan, Shuah Khan,
	Song Liu, Stanislav Fomichev, Yonghong Song

Changes in v2:
v1: https://lore.kernel.org/all/20241021122112.101513-1-puranjay@kernel.org/
- Remove the patch that adds the benchmark as it is not useful enough to be
  added to the tree.
- Fixed a sparse warning in patch 1.
- Add reviewed-by and acked-by tags.

NOTE: There are some sparse warning in net/core/filter.c but those are not
worth fixing because bpf helpers take and return u64 values and using them
in csum related functions that take and return __sum16 / __wsum would need
a lot of casts everywhere.

The bpf_csum_diff() helper currently returns different values on different
architectures because it calls csum_partial() that is either implemented by
the architecture like x86_64, arm, etc or uses the generic implementation
in lib/checksum.c like arm64, riscv, etc.

The implementation in lib/checksum.c returns the folded result that is
16-bit long, but the architecture specific implementation can return an
unfolded value that is larger than 16-bits.

The helper uses a per-cpu scratchpad buffer for copying the data and then
computing the csum on this buffer. This can be optimised by utilising some
mathematical properties of csum.

The patch 1 in this series does preparatory work for homogenizing the
helper. patch 2 does the changes to the helper itself. The performance gain
can be seen in the tables below that are generated using the benchmark
built in patch 4 of v1 of this series:

  x86-64:
  +-------------+------------------+------------------+-------------+
  | Buffer Size |      Before      |      After       | Improvement |
  +-------------+------------------+------------------+-------------+
  |      4      | 2.296 ± 0.066M/s | 3.415 ± 0.001M/s |   48.73  %  |
  |      8      | 2.320 ± 0.003M/s | 3.409 ± 0.003M/s |   46.93  %  |
  |      16     | 2.315 ± 0.001M/s | 3.414 ± 0.003M/s |   47.47  %  |
  |      20     | 2.318 ± 0.001M/s | 3.416 ± 0.001M/s |   47.36  %  |
  |      32     | 2.308 ± 0.003M/s | 3.413 ± 0.003M/s |   47.87  %  |
  |      40     | 2.300 ± 0.029M/s | 3.413 ± 0.003M/s |   48.39  %  |
  |      64     | 2.286 ± 0.001M/s | 3.410 ± 0.001M/s |   49.16  %  |
  |      128    | 2.250 ± 0.001M/s | 3.404 ± 0.001M/s |   51.28  %  |
  |      256    | 2.173 ± 0.001M/s | 3.383 ± 0.001M/s |   55.68  %  |
  |      512    | 2.023 ± 0.055M/s | 3.340 ± 0.001M/s |   65.10  %  |
  +-------------+------------------+------------------+-------------+

  ARM64:
  +-------------+------------------+------------------+-------------+
  | Buffer Size |      Before      |      After       | Improvement |
  +-------------+------------------+------------------+-------------+
  |      4      | 1.397 ± 0.005M/s | 1.493 ± 0.005M/s |    6.87  %  |
  |      8      | 1.402 ± 0.002M/s | 1.489 ± 0.002M/s |    6.20  %  |
  |      16     | 1.391 ± 0.001M/s | 1.481 ± 0.001M/s |    6.47  %  |
  |      20     | 1.379 ± 0.001M/s | 1.477 ± 0.001M/s |    7.10  %  |
  |      32     | 1.358 ± 0.001M/s | 1.469 ± 0.002M/s |    8.17  %  |
  |      40     | 1.339 ± 0.001M/s | 1.462 ± 0.002M/s |    9.18  %  |
  |      64     | 1.302 ± 0.002M/s | 1.449 ± 0.003M/s |    11.29 %  |
  |      128    | 1.214 ± 0.001M/s | 1.443 ± 0.003M/s |    18.86 %  |
  |      256    | 1.080 ± 0.001M/s | 1.423 ± 0.001M/s |    31.75 %  |
  |      512    | 0.887 ± 0.001M/s | 1.411 ± 0.002M/s |    59.07 %  |
  +-------------+------------------+------------------+-------------+

Patch 3 reverts a hack that was done to make the selftest pass on all
architectures.

Patch 4 adds a selftest for this helper to verify the results produced by
this helper in multiple modes and edge cases.

Puranjay Mohan (4):
  net: checksum: move from32to16() to generic header
  bpf: bpf_csum_diff: optimize and homogenize for all archs
  selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier
  selftests/bpf: Add a selftest for bpf_csum_diff()

 arch/parisc/lib/checksum.c                    |  13 +-
 include/net/checksum.h                        |   6 +
 lib/checksum.c                                |  11 +-
 net/core/filter.c                             |  37 +-
 .../selftests/bpf/prog_tests/test_csum_diff.c | 408 ++++++++++++++++++
 .../selftests/bpf/progs/csum_diff_test.c      |  42 ++
 .../bpf/progs/verifier_array_access.c         |   3 +-
 7 files changed, 469 insertions(+), 51 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_csum_diff.c
 create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_test.c

-- 
2.40.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-10-25 11:32 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-23 15:39 [PATCH bpf-next v2 0/4] Optimize bpf_csum_diff() and homogenize for all archs Puranjay Mohan
2024-10-23 15:39 ` [PATCH bpf-next v2 1/4] net: checksum: move from32to16() to generic header Puranjay Mohan
2024-10-23 15:39 ` [PATCH bpf-next v2 2/4] bpf: bpf_csum_diff: optimize and homogenize for all archs Puranjay Mohan
2024-10-25  7:38   ` kernel test robot
2024-10-25 10:11     ` Puranjay Mohan
2024-10-25 11:32       ` Daniel Borkmann
2024-10-23 15:39 ` [PATCH bpf-next v2 3/4] selftests/bpf: don't mask result of bpf_csum_diff() in test_verifier Puranjay Mohan
2024-10-23 15:39 ` [PATCH bpf-next v2 4/4] selftests/bpf: Add a selftest for bpf_csum_diff() Puranjay Mohan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).