* [PATCH bpf-next v4 00/10] Add bpf_xdp_get_xfrm_state() kfunc
@ 2023-12-04 20:56 Daniel Xu
2023-12-04 20:56 ` [PATCH bpf-next v4 04/10] libbpf: Add BPF_CORE_WRITE_BITFIELD() macro Daniel Xu
0 siblings, 1 reply; 3+ messages in thread
From: Daniel Xu @ 2023-12-04 20:56 UTC (permalink / raw)
To: bpf, linux-kernel, linux-kselftest, netdev, llvm,
steffen.klassert, antony.antony, alexei.starovoitov,
yonghong.song, eddyz87
Cc: devel
This patchset adds two kfunc helpers, bpf_xdp_get_xfrm_state() and
bpf_xdp_xfrm_state_release() that wrap xfrm_state_lookup() and
xfrm_state_put(). The intent is to support software RSS (via XDP) for
the ongoing/upcoming ipsec pcpu work [0]. Recent experiments performed
on (hopefully) reproducible AWS testbeds indicate that single tunnel
pcpu ipsec can reach line rate on 100G ENA nics.
Note this patchset only tests/shows generic xfrm_state access. The
"secret sauce" (if you can really even call it that) involves accessing
a soon-to-be-upstreamed pcpu_num field in xfrm_state. Early example is
available here [1].
[0]: https://datatracker.ietf.org/doc/draft-ietf-ipsecme-multi-sa-performance/03/
[1]: https://github.com/danobi/xdp-tools/blob/e89a1c617aba3b50d990f779357d6ce2863ecb27/xdp-bench/xdp_redirect_cpumap.bpf.c#L385-L406
Changes from v3:
* Place all xfrm bpf integrations in xfrm_bpf.c
* Avoid using nval as a temporary
* Rebase to bpf-next
* Remove extraneous __failure_unpriv annotation for verifier tests
Changes from v2:
* Fix/simplify BPF_CORE_WRITE_BITFIELD() algorithm
* Added verifier tests for bitfield writes
* Fix state leakage across test_tunnel subtests
Changes from v1:
* Move xfrm tunnel tests to test_progs
* Fix writing to opts->error when opts is invalid
* Use __bpf_kfunc_start_defs()
* Remove unused vxlanhdr definition
* Add and use BPF_CORE_WRITE_BITFIELD() macro
* Make series bisect clean
Changes from RFCv2:
* Rebased to ipsec-next
* Fix netns leak
Changes from RFCv1:
* Add Antony's commit tags
* Add KF_ACQUIRE and KF_RELEASE semantics
Daniel Xu (10):
xfrm: bpf: Move xfrm_interface_bpf.c to xfrm_bpf.c
bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
bpf: xfrm: Add bpf_xdp_xfrm_state_release() kfunc
libbpf: Add BPF_CORE_WRITE_BITFIELD() macro
bpf: selftests: test_loader: Support __btf_path() annotation
libbpf: selftests: Add verifier tests for CO-RE bitfield writes
bpf: selftests: test_tunnel: Setup fresh topology for each subtest
bpf: selftests: test_tunnel: Use vmlinux.h declarations
bpf: selftests: Move xfrm tunnel test to test_progs
bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state()
include/net/xfrm.h | 9 +
net/xfrm/Makefile | 7 +-
net/xfrm/xfrm_bpf.c | 232 ++++++++++++++++++
net/xfrm/xfrm_interface_bpf.c | 110 ---------
net/xfrm/xfrm_policy.c | 2 +
tools/lib/bpf/bpf_core_read.h | 32 +++
.../selftests/bpf/prog_tests/test_tunnel.c | 162 +++++++++++-
.../selftests/bpf/prog_tests/verifier.c | 2 +
tools/testing/selftests/bpf/progs/bpf_misc.h | 1 +
.../selftests/bpf/progs/bpf_tracing_net.h | 1 +
.../selftests/bpf/progs/test_tunnel_kern.c | 138 ++++++-----
.../bpf/progs/verifier_bitfield_write.c | 100 ++++++++
tools/testing/selftests/bpf/test_loader.c | 7 +
tools/testing/selftests/bpf/test_tunnel.sh | 92 -------
14 files changed, 624 insertions(+), 271 deletions(-)
create mode 100644 net/xfrm/xfrm_bpf.c
delete mode 100644 net/xfrm/xfrm_interface_bpf.c
create mode 100644 tools/testing/selftests/bpf/progs/verifier_bitfield_write.c
--
2.42.1
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH bpf-next v4 04/10] libbpf: Add BPF_CORE_WRITE_BITFIELD() macro
2023-12-04 20:56 [PATCH bpf-next v4 00/10] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
@ 2023-12-04 20:56 ` Daniel Xu
2023-12-05 4:03 ` Andrii Nakryiko
0 siblings, 1 reply; 3+ messages in thread
From: Daniel Xu @ 2023-12-04 20:56 UTC (permalink / raw)
To: daniel, ast, nathan, andrii, ndesaulniers, steffen.klassert,
antony.antony, alexei.starovoitov, yonghong.song, eddyz87
Cc: martin.lau, song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
trix, bpf, linux-kernel, llvm, devel, netdev, Jonathan Lemon
=== Motivation ===
Similar to reading from CO-RE bitfields, we need a CO-RE aware bitfield
writing wrapper to make the verifier happy.
Two alternatives to this approach are:
1. Use the upcoming `preserve_static_offset` [0] attribute to disable
CO-RE on specific structs.
2. Use broader byte-sized writes to write to bitfields.
(1) is a bit hard to use. It requires specific and not-very-obvious
annotations to bpftool generated vmlinux.h. It's also not generally
available in released LLVM versions yet.
(2) makes the code quite hard to read and write. And especially if
BPF_CORE_READ_BITFIELD() is already being used, it makes more sense to
to have an inverse helper for writing.
=== Implementation details ===
Since the logic is a bit non-obvious, I thought it would be helpful
to explain exactly what's going on.
To start, it helps by explaining what LSHIFT_U64 (lshift) and RSHIFT_U64
(rshift) is designed to mean. Consider the core of the
BPF_CORE_READ_BITFIELD() algorithm:
val <<= __CORE_RELO(s, field, LSHIFT_U64);
val = val >> __CORE_RELO(s, field, RSHIFT_U64);
Basically what happens is we lshift to clear the non-relevant (blank)
higher order bits. Then we rshift to bring the relevant bits (bitfield)
down to LSB position (while also clearing blank lower order bits). To
illustrate:
Start: ........XXX......
Lshift: XXX......00000000
Rshift: 00000000000000XXX
where `.` means blank bit, `0` means 0 bit, and `X` means bitfield bit.
After the two operations, the bitfield is ready to be interpreted as a
regular integer.
Next, we want to build an alternative (but more helpful) mental model
on lshift and rshift. That is, to consider:
* rshift as the total number of blank bits in the u64
* lshift as number of blank bits left of the bitfield in the u64
Take a moment to consider why that is true by consulting the above
diagram.
With this insight, we can now define the following relationship:
bitfield
_
| |
0.....00XXX0...00
| | | |
|______| | |
lshift | |
|____|
(rshift - lshift)
That is, we know the number of higher order blank bits is just lshift.
And the number of lower order blank bits is (rshift - lshift).
Finally, we can examine the core of the write side algorithm:
mask = (~0ULL << rshift) >> lshift; // 1
val = (val & ~mask) | ((nval << rpad) & mask); // 2
1. Compute a mask where the set bits are the bitfield bits. The first
left shift zeros out exactly the number of blank bits, leaving a
bitfield sized set of 1s. The subsequent right shift inserts the
correct amount of higher order blank bits.
2. On the left of the `|`, mask out the bitfield bits. This creates
0s where the new bitfield bits will go. On the right of the `|`,
bring nval into the correct bit position and mask out any bits
that fall outside of the bitfield. Finally, by bor'ing the two
halves, we get the final set of bits to write back.
[0]: https://reviews.llvm.org/D133361
Co-developed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Co-developed-by: Jonathan Lemon <jlemon@aviatrix.com>
Signed-off-by: Jonathan Lemon <jlemon@aviatrix.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
tools/lib/bpf/bpf_core_read.h | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/tools/lib/bpf/bpf_core_read.h b/tools/lib/bpf/bpf_core_read.h
index 1ac57bb7ac55..7325a12692a3 100644
--- a/tools/lib/bpf/bpf_core_read.h
+++ b/tools/lib/bpf/bpf_core_read.h
@@ -111,6 +111,38 @@ enum bpf_enum_value_kind {
val; \
})
+/*
+ * Write to a bitfield, identified by s->field.
+ * This is the inverse of BPF_CORE_WRITE_BITFIELD().
+ */
+#define BPF_CORE_WRITE_BITFIELD(s, field, new_val) ({ \
+ void *p = (void *)s + __CORE_RELO(s, field, BYTE_OFFSET); \
+ unsigned int byte_size = __CORE_RELO(s, field, BYTE_SIZE); \
+ unsigned int lshift = __CORE_RELO(s, field, LSHIFT_U64); \
+ unsigned int rshift = __CORE_RELO(s, field, RSHIFT_U64); \
+ unsigned long long mask, val, nval = new_val; \
+ unsigned int rpad = rshift - lshift; \
+ \
+ asm volatile("" : "+r"(p)); \
+ \
+ switch (byte_size) { \
+ case 1: val = *(unsigned char *)p; break; \
+ case 2: val = *(unsigned short *)p; break; \
+ case 4: val = *(unsigned int *)p; break; \
+ case 8: val = *(unsigned long long *)p; break; \
+ } \
+ \
+ mask = (~0ULL << rshift) >> lshift; \
+ val = (val & ~mask) | ((nval << rpad) & mask); \
+ \
+ switch (byte_size) { \
+ case 1: *(unsigned char *)p = val; break; \
+ case 2: *(unsigned short *)p = val; break; \
+ case 4: *(unsigned int *)p = val; break; \
+ case 8: *(unsigned long long *)p = val; break; \
+ } \
+})
+
#define ___bpf_field_ref1(field) (field)
#define ___bpf_field_ref2(type, field) (((typeof(type) *)0)->field)
#define ___bpf_field_ref(args...) \
--
2.42.1
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH bpf-next v4 04/10] libbpf: Add BPF_CORE_WRITE_BITFIELD() macro
2023-12-04 20:56 ` [PATCH bpf-next v4 04/10] libbpf: Add BPF_CORE_WRITE_BITFIELD() macro Daniel Xu
@ 2023-12-05 4:03 ` Andrii Nakryiko
0 siblings, 0 replies; 3+ messages in thread
From: Andrii Nakryiko @ 2023-12-05 4:03 UTC (permalink / raw)
To: Daniel Xu
Cc: daniel, ast, nathan, andrii, ndesaulniers, steffen.klassert,
antony.antony, alexei.starovoitov, yonghong.song, eddyz87,
martin.lau, song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
trix, bpf, linux-kernel, llvm, devel, netdev, Jonathan Lemon
On Mon, Dec 4, 2023 at 12:57 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
>
> === Motivation ===
>
> Similar to reading from CO-RE bitfields, we need a CO-RE aware bitfield
> writing wrapper to make the verifier happy.
>
> Two alternatives to this approach are:
>
> 1. Use the upcoming `preserve_static_offset` [0] attribute to disable
> CO-RE on specific structs.
> 2. Use broader byte-sized writes to write to bitfields.
>
> (1) is a bit hard to use. It requires specific and not-very-obvious
> annotations to bpftool generated vmlinux.h. It's also not generally
> available in released LLVM versions yet.
>
> (2) makes the code quite hard to read and write. And especially if
> BPF_CORE_READ_BITFIELD() is already being used, it makes more sense to
> to have an inverse helper for writing.
>
> === Implementation details ===
>
> Since the logic is a bit non-obvious, I thought it would be helpful
> to explain exactly what's going on.
>
> To start, it helps by explaining what LSHIFT_U64 (lshift) and RSHIFT_U64
> (rshift) is designed to mean. Consider the core of the
> BPF_CORE_READ_BITFIELD() algorithm:
>
> val <<= __CORE_RELO(s, field, LSHIFT_U64);
> val = val >> __CORE_RELO(s, field, RSHIFT_U64);
>
> Basically what happens is we lshift to clear the non-relevant (blank)
> higher order bits. Then we rshift to bring the relevant bits (bitfield)
> down to LSB position (while also clearing blank lower order bits). To
> illustrate:
>
> Start: ........XXX......
> Lshift: XXX......00000000
> Rshift: 00000000000000XXX
>
> where `.` means blank bit, `0` means 0 bit, and `X` means bitfield bit.
>
> After the two operations, the bitfield is ready to be interpreted as a
> regular integer.
>
> Next, we want to build an alternative (but more helpful) mental model
> on lshift and rshift. That is, to consider:
>
> * rshift as the total number of blank bits in the u64
> * lshift as number of blank bits left of the bitfield in the u64
>
> Take a moment to consider why that is true by consulting the above
> diagram.
>
> With this insight, we can now define the following relationship:
>
> bitfield
> _
> | |
> 0.....00XXX0...00
> | | | |
> |______| | |
> lshift | |
> |____|
> (rshift - lshift)
>
> That is, we know the number of higher order blank bits is just lshift.
> And the number of lower order blank bits is (rshift - lshift).
>
> Finally, we can examine the core of the write side algorithm:
>
> mask = (~0ULL << rshift) >> lshift; // 1
> val = (val & ~mask) | ((nval << rpad) & mask); // 2
>
> 1. Compute a mask where the set bits are the bitfield bits. The first
> left shift zeros out exactly the number of blank bits, leaving a
> bitfield sized set of 1s. The subsequent right shift inserts the
> correct amount of higher order blank bits.
>
> 2. On the left of the `|`, mask out the bitfield bits. This creates
> 0s where the new bitfield bits will go. On the right of the `|`,
> bring nval into the correct bit position and mask out any bits
> that fall outside of the bitfield. Finally, by bor'ing the two
> halves, we get the final set of bits to write back.
>
> [0]: https://reviews.llvm.org/D133361
> Co-developed-by: Eduard Zingerman <eddyz87@gmail.com>
> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
> Co-developed-by: Jonathan Lemon <jlemon@aviatrix.com>
> Signed-off-by: Jonathan Lemon <jlemon@aviatrix.com>
> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> ---
> tools/lib/bpf/bpf_core_read.h | 32 ++++++++++++++++++++++++++++++++
> 1 file changed, 32 insertions(+)
>
LGTM
Acked-by: Andrii Nakryiko <andrii@kernel.org>
> diff --git a/tools/lib/bpf/bpf_core_read.h b/tools/lib/bpf/bpf_core_read.h
> index 1ac57bb7ac55..7325a12692a3 100644
> --- a/tools/lib/bpf/bpf_core_read.h
> +++ b/tools/lib/bpf/bpf_core_read.h
> @@ -111,6 +111,38 @@ enum bpf_enum_value_kind {
> val; \
> })
>
> +/*
> + * Write to a bitfield, identified by s->field.
> + * This is the inverse of BPF_CORE_WRITE_BITFIELD().
> + */
> +#define BPF_CORE_WRITE_BITFIELD(s, field, new_val) ({ \
> + void *p = (void *)s + __CORE_RELO(s, field, BYTE_OFFSET); \
> + unsigned int byte_size = __CORE_RELO(s, field, BYTE_SIZE); \
> + unsigned int lshift = __CORE_RELO(s, field, LSHIFT_U64); \
> + unsigned int rshift = __CORE_RELO(s, field, RSHIFT_U64); \
> + unsigned long long mask, val, nval = new_val; \
> + unsigned int rpad = rshift - lshift; \
> + \
> + asm volatile("" : "+r"(p)); \
> + \
> + switch (byte_size) { \
> + case 1: val = *(unsigned char *)p; break; \
> + case 2: val = *(unsigned short *)p; break; \
> + case 4: val = *(unsigned int *)p; break; \
> + case 8: val = *(unsigned long long *)p; break; \
> + } \
> + \
> + mask = (~0ULL << rshift) >> lshift; \
> + val = (val & ~mask) | ((nval << rpad) & mask); \
> + \
> + switch (byte_size) { \
> + case 1: *(unsigned char *)p = val; break; \
> + case 2: *(unsigned short *)p = val; break; \
> + case 4: *(unsigned int *)p = val; break; \
> + case 8: *(unsigned long long *)p = val; break; \
> + } \
> +})
> +
> #define ___bpf_field_ref1(field) (field)
> #define ___bpf_field_ref2(type, field) (((typeof(type) *)0)->field)
> #define ___bpf_field_ref(args...) \
> --
> 2.42.1
>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2023-12-05 4:04 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-04 20:56 [PATCH bpf-next v4 00/10] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
2023-12-04 20:56 ` [PATCH bpf-next v4 04/10] libbpf: Add BPF_CORE_WRITE_BITFIELD() macro Daniel Xu
2023-12-05 4:03 ` Andrii Nakryiko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox