[RFC bpf-next 0/6] Add bpf_xdp_get_xfrm

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC bpf-next 0/6] Add bpf_xdp_get_xfrm_state() kfunc
@ 2023-10-27 18:46 Daniel Xu
  2023-10-27 18:46 ` [RFC bpf-next 1/6] bpf: xfrm: " Daniel Xu
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Daniel Xu @ 2023-10-27 18:46 UTC (permalink / raw)
  To: netdev, linux-kernel, bpf, linux-kselftest, steffen.klassert,
	antony.antony
  Cc: devel

This patchset adds a kfunc helper, bpf_xdp_get_xfrm_state(), that wraps
xfrm_state_lookup(). The intent is to support software RSS (via XDP) for
the ongoing/upcoming ipsec pcpu work [0]. Recent experiments performed
on (hopefully) reproducible AWS testbeds indicate that single tunnel
pcpu ipsec can reach line rate on 100G ENA nics.

More details about that will be presented at netdev next week [1].

Antony did the initial stable bpf helper - I later ported it to unstable
kfuncs. So for the series, please apply a Co-developed-by for Antony,
provided he acks and signs off on this.

[0]: https://datatracker.ietf.org/doc/html/draft-ietf-ipsecme-multi-sa-performance-02
[1]: https://netdevconf.info/0x17/sessions/workshop/security-workshop.html

Daniel Xu (6):
  bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
  bpf: selftests: test_tunnel: Use ping -6 over ping6
  bpf: selftests: test_tunnel: Mount bpffs if necessary
  bpf: selftests: test_tunnel: Use vmlinux.h declarations
  bpf: selftests: test_tunnel: Disable CO-RE relocations
  bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state()

 include/net/xfrm.h                            |   9 ++
 net/xfrm/Makefile                             |   1 +
 net/xfrm/xfrm_policy.c                        |   2 +
 net/xfrm/xfrm_state_bpf.c                     | 105 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_tracing_net.h     |   1 +
 .../selftests/bpf/progs/test_tunnel_kern.c    |  95 +++++++++-------
 tools/testing/selftests/bpf/test_tunnel.sh    |  43 ++++---
 7 files changed, 202 insertions(+), 54 deletions(-)
 create mode 100644 net/xfrm/xfrm_state_bpf.c

-- 
2.42.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC bpf-next 1/6] bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
  2023-10-27 18:46 [RFC bpf-next 0/6] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
@ 2023-10-27 18:46 ` Daniel Xu
  2023-10-28 23:49   ` Alexei Starovoitov
  2023-10-27 18:46 ` [RFC bpf-next 2/6] bpf: selftests: test_tunnel: Use ping -6 over ping6 Daniel Xu
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Daniel Xu @ 2023-10-27 18:46 UTC (permalink / raw)
  To: hawk, steffen.klassert, ast, pabeni, daniel, kuba, Herbert Xu,
	davem, john.fastabend, edumazet, antony.antony
  Cc: linux-kernel, netdev, bpf, devel

This commit adds an unstable kfunc helper to access internal xfrm_state
associated with an SA. This is intended to be used for the upcoming
IPsec pcpu work to assign special pcpu SAs to a particular CPU. In other
words: for custom software RSS.

That being said, the function that this kfunc wraps is fairly generic
and used for a lot of xfrm tasks. I'm sure people will find uses
elsewhere over time.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 include/net/xfrm.h        |   9 ++++
 net/xfrm/Makefile         |   1 +
 net/xfrm/xfrm_policy.c    |   2 +
 net/xfrm/xfrm_state_bpf.c | 105 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 117 insertions(+)
 create mode 100644 net/xfrm/xfrm_state_bpf.c

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 98d7aa78adda..ab4cf66480f3 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -2188,4 +2188,13 @@ static inline int register_xfrm_interface_bpf(void)
 
 #endif
 
+#if IS_ENABLED(CONFIG_DEBUG_INFO_BTF)
+int register_xfrm_state_bpf(void);
+#else
+static inline int register_xfrm_state_bpf(void)
+{
+	return 0;
+}
+#endif
+
 #endif	/* _NET_XFRM_H */
diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile
index cd47f88921f5..547cec77ba03 100644
--- a/net/xfrm/Makefile
+++ b/net/xfrm/Makefile
@@ -21,3 +21,4 @@ obj-$(CONFIG_XFRM_USER_COMPAT) += xfrm_compat.o
 obj-$(CONFIG_XFRM_IPCOMP) += xfrm_ipcomp.o
 obj-$(CONFIG_XFRM_INTERFACE) += xfrm_interface.o
 obj-$(CONFIG_XFRM_ESPINTCP) += espintcp.o
+obj-$(CONFIG_DEBUG_INFO_BTF) += xfrm_state_bpf.o
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 5cdd3bca3637..62e64fa7ae5c 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -4267,6 +4267,8 @@ void __init xfrm_init(void)
 #ifdef CONFIG_XFRM_ESPINTCP
 	espintcp_init();
 #endif
+
+	register_xfrm_state_bpf();
 }
 
 #ifdef CONFIG_AUDITSYSCALL
diff --git a/net/xfrm/xfrm_state_bpf.c b/net/xfrm/xfrm_state_bpf.c
new file mode 100644
index 000000000000..a73a17a6497b
--- /dev/null
+++ b/net/xfrm/xfrm_state_bpf.c
@@ -0,0 +1,105 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Unstable XFRM state BPF helpers.
+ *
+ * Note that it is allowed to break compatibility for these functions since the
+ * interface they are exposed through to BPF programs is explicitly unstable.
+ */
+
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <net/xdp.h>
+#include <net/xfrm.h>
+
+/* bpf_xfrm_state_opts - Options for XFRM state lookup helpers
+ *
+ * Members:
+ * @error      - Out parameter, set for any errors encountered
+ *		 Values:
+ *		   -EINVAL - netns_id is less than -1
+ *		   -EINVAL - Passed NULL for opts
+ *		   -EINVAL - opts__sz isn't BPF_XFRM_STATE_OPTS_SZ
+ *		   -ENONET - No network namespace found for netns_id
+ * @netns_id	- Specify the network namespace for lookup
+ *		 Values:
+ *		   BPF_F_CURRENT_NETNS (-1)
+ *		     Use namespace associated with ctx
+ *		   [0, S32_MAX]
+ *		     Network Namespace ID
+ * @mark	- XFRM mark to match on
+ * @daddr	- Destination address to match on
+ * @spi		- Security parameter index to match on
+ * @proto	- L3 protocol to match on
+ * @family	- L3 protocol family to match on
+ */
+struct bpf_xfrm_state_opts {
+	s32 error;
+	s32 netns_id;
+	u32 mark;
+	xfrm_address_t daddr;
+	__be32 spi;
+	u8 proto;
+	u16 family;
+};
+
+enum {
+	BPF_XFRM_STATE_OPTS_SZ = sizeof(struct bpf_xfrm_state_opts),
+};
+
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in xfrm_state BTF");
+
+/* bpf_xdp_get_xfrm_state - Get XFRM state
+ *
+ * Parameters:
+ * @ctx 	- Pointer to ctx (xdp_md) in XDP program
+ *		    Cannot be NULL
+ * @opts	- Options for lookup (documented above)
+ *		    Cannot be NULL
+ * @opts__sz	- Length of the bpf_xfrm_state_opts structure
+ *		    Must be BPF_XFRM_STATE_OPTS_SZ
+ */
+__bpf_kfunc struct xfrm_state *
+bpf_xdp_get_xfrm_state(struct xdp_md *ctx, struct bpf_xfrm_state_opts *opts, u32 opts__sz)
+{
+	struct xdp_buff *xdp = (struct xdp_buff *)ctx;
+	struct net *net = dev_net(xdp->rxq->dev);
+
+	if (!opts || opts__sz != BPF_XFRM_STATE_OPTS_SZ) {
+		opts->error = -EINVAL;
+		return NULL;
+	}
+
+	if (unlikely(opts->netns_id < BPF_F_CURRENT_NETNS)) {
+		opts->error = -EINVAL;
+		return NULL;
+	}
+
+	if (opts->netns_id >= 0) {
+		net = get_net_ns_by_id(net, opts->netns_id);
+		if (unlikely(!net)) {
+			opts->error = -ENONET;
+			return NULL;
+		}
+	}
+
+	return xfrm_state_lookup(net, opts->mark, &opts->daddr, opts->spi,
+				 opts->proto, opts->family);
+}
+
+__diag_pop()
+
+BTF_SET8_START(xfrm_state_kfunc_set)
+BTF_ID_FLAGS(func, bpf_xdp_get_xfrm_state, KF_RET_NULL)
+BTF_SET8_END(xfrm_state_kfunc_set)
+
+static const struct btf_kfunc_id_set xfrm_state_xdp_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &xfrm_state_kfunc_set,
+};
+
+int __init register_xfrm_state_bpf(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP,
+					 &xfrm_state_xdp_kfunc_set);
+}
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC bpf-next 2/6] bpf: selftests: test_tunnel: Use ping -6 over ping6
  2023-10-27 18:46 [RFC bpf-next 0/6] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
  2023-10-27 18:46 ` [RFC bpf-next 1/6] bpf: xfrm: " Daniel Xu
@ 2023-10-27 18:46 ` Daniel Xu
  2023-10-27 18:46 ` [RFC bpf-next 3/6] bpf: selftests: test_tunnel: Mount bpffs if necessary Daniel Xu
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Daniel Xu @ 2023-10-27 18:46 UTC (permalink / raw)
  To: ast, andrii, shuah, daniel, steffen.klassert, antony.antony
  Cc: martin.lau, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, mykolal, bpf, linux-kselftest, linux-kernel, devel

The ping6 binary went away over 7 years ago [0].

[0]: https://github.com/iputils/iputils/commit/ebad35fee3de851b809c7b72ccc654a72b6af61d

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 tools/testing/selftests/bpf/test_tunnel.sh | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_tunnel.sh b/tools/testing/selftests/bpf/test_tunnel.sh
index 2dec7dbf29a2..85ba39992461 100755
--- a/tools/testing/selftests/bpf/test_tunnel.sh
+++ b/tools/testing/selftests/bpf/test_tunnel.sh
@@ -295,13 +295,13 @@ test_ip6gre()
 	add_ip6gretap_tunnel
 	attach_bpf $DEV ip6gretap_set_tunnel ip6gretap_get_tunnel
 	# underlay
-	ping6 $PING_ARG ::11
+	ping -6 $PING_ARG ::11
 	# overlay: ipv4 over ipv6
 	ip netns exec at_ns0 ping $PING_ARG 10.1.1.200
 	ping $PING_ARG 10.1.1.100
 	check_err $?
 	# overlay: ipv6 over ipv6
-	ip netns exec at_ns0 ping6 $PING_ARG fc80::200
+	ip netns exec at_ns0 ping -6 $PING_ARG fc80::200
 	check_err $?
 	cleanup
 
@@ -324,13 +324,13 @@ test_ip6gretap()
 	add_ip6gretap_tunnel
 	attach_bpf $DEV ip6gretap_set_tunnel ip6gretap_get_tunnel
 	# underlay
-	ping6 $PING_ARG ::11
+	ping -6 $PING_ARG ::11
 	# overlay: ipv4 over ipv6
 	ip netns exec at_ns0 ping $PING_ARG 10.1.1.200
 	ping $PING_ARG 10.1.1.100
 	check_err $?
 	# overlay: ipv6 over ipv6
-	ip netns exec at_ns0 ping6 $PING_ARG fc80::200
+	ip netns exec at_ns0 ping -6 $PING_ARG fc80::200
 	check_err $?
 	cleanup
 
@@ -376,7 +376,7 @@ test_ip6erspan()
 	config_device
 	add_ip6erspan_tunnel $1
 	attach_bpf $DEV ip4ip6erspan_set_tunnel ip4ip6erspan_get_tunnel
-	ping6 $PING_ARG ::11
+	ping -6 $PING_ARG ::11
 	ip netns exec at_ns0 ping $PING_ARG 10.1.1.200
 	check_err $?
 	cleanup
@@ -474,7 +474,7 @@ test_ipip6()
 	ip link set dev veth1 mtu 1500
 	attach_bpf $DEV ipip6_set_tunnel ipip6_get_tunnel
 	# underlay
-	ping6 $PING_ARG ::11
+	ping -6 $PING_ARG ::11
 	# ip4 over ip6
 	ping $PING_ARG 10.1.1.100
 	check_err $?
@@ -502,11 +502,11 @@ test_ip6ip6()
 	ip link set dev veth1 mtu 1500
 	attach_bpf $DEV ip6ip6_set_tunnel ip6ip6_get_tunnel
 	# underlay
-	ping6 $PING_ARG ::11
+	ping -6 $PING_ARG ::11
 	# ip6 over ip6
-	ping6 $PING_ARG 1::11
+	ping -6 $PING_ARG 1::11
 	check_err $?
-	ip netns exec at_ns0 ping6 $PING_ARG 1::22
+	ip netns exec at_ns0 ping -6 $PING_ARG 1::22
 	check_err $?
 	cleanup
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC bpf-next 3/6] bpf: selftests: test_tunnel: Mount bpffs if necessary
  2023-10-27 18:46 [RFC bpf-next 0/6] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
  2023-10-27 18:46 ` [RFC bpf-next 1/6] bpf: xfrm: " Daniel Xu
  2023-10-27 18:46 ` [RFC bpf-next 2/6] bpf: selftests: test_tunnel: Use ping -6 over ping6 Daniel Xu
@ 2023-10-27 18:46 ` Daniel Xu
  2023-10-27 18:46 ` [RFC bpf-next 4/6] bpf: selftests: test_tunnel: Use vmlinux.h declarations Daniel Xu
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Daniel Xu @ 2023-10-27 18:46 UTC (permalink / raw)
  To: andrii, ast, shuah, daniel, steffen.klassert, antony.antony
  Cc: martin.lau, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, mykolal, bpf, linux-kselftest, linux-kernel, devel

Previously, if bpffs was not already mounted, then the test suite would
fail during object file pinning steps. Fix by mounting bpffs if
necessary.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 tools/testing/selftests/bpf/test_tunnel.sh | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_tunnel.sh b/tools/testing/selftests/bpf/test_tunnel.sh
index 85ba39992461..dd3c79129e87 100755
--- a/tools/testing/selftests/bpf/test_tunnel.sh
+++ b/tools/testing/selftests/bpf/test_tunnel.sh
@@ -46,7 +46,8 @@
 # 6) Forward the packet to the overlay tnl dev
 
 BPF_FILE="test_tunnel_kern.bpf.o"
-BPF_PIN_TUNNEL_DIR="/sys/fs/bpf/tc/tunnel"
+BPF_FS="/sys/fs/bpf"
+BPF_PIN_TUNNEL_DIR="${BPF_FS}/tc/tunnel"
 PING_ARG="-c 3 -w 10 -q"
 ret=0
 GREEN='\033[0;92m'
@@ -668,10 +669,20 @@ check_err()
 	fi
 }
 
+mount_bpffs()
+{
+	if ! mount | grep "bpf on /sys/fs/bpf" &>/dev/null; then
+		mount -t bpf bpf "$BPF_FS"
+	fi
+}
+
 bpf_tunnel_test()
 {
 	local errors=0
 
+	echo "Mounting bpffs..."
+	mount_bpffs
+
 	echo "Testing GRE tunnel..."
 	test_gre
 	errors=$(( $errors + $? ))
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC bpf-next 4/6] bpf: selftests: test_tunnel: Use vmlinux.h declarations
  2023-10-27 18:46 [RFC bpf-next 0/6] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
                   ` (2 preceding siblings ...)
  2023-10-27 18:46 ` [RFC bpf-next 3/6] bpf: selftests: test_tunnel: Mount bpffs if necessary Daniel Xu
@ 2023-10-27 18:46 ` Daniel Xu
  2023-10-27 18:46 ` [RFC bpf-next 5/6] bpf: selftests: test_tunnel: Disable CO-RE relocations Daniel Xu
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Daniel Xu @ 2023-10-27 18:46 UTC (permalink / raw)
  To: ast, andrii, shuah, daniel, steffen.klassert, antony.antony
  Cc: martin.lau, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, mykolal, bpf, linux-kselftest, linux-kernel, devel

vmlinux.h declarations are more ergnomic, especially when working with
kfuncs. The uapi headers are often incomplete for kfunc definitions.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 .../selftests/bpf/progs/bpf_tracing_net.h     |  1 +
 .../selftests/bpf/progs/test_tunnel_kern.c    | 48 ++++---------------
 2 files changed, 9 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
index 0b793a102791..1bdc680b0e0e 100644
--- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
+++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
@@ -26,6 +26,7 @@
 #define IPV6_AUTOFLOWLABEL	70
 
 #define TC_ACT_UNSPEC		(-1)
+#define TC_ACT_OK		0
 #define TC_ACT_SHOT		2
 
 #define SOL_TCP			6
diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
index f66af753bbbb..3065a716544d 100644
--- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
@@ -6,62 +6,30 @@
  * modify it under the terms of version 2 of the GNU General Public
  * License as published by the Free Software Foundation.
  */
-#include <stddef.h>
-#include <string.h>
-#include <arpa/inet.h>
-#include <linux/bpf.h>
-#include <linux/if_ether.h>
-#include <linux/if_packet.h>
-#include <linux/if_tunnel.h>
-#include <linux/ip.h>
-#include <linux/ipv6.h>
-#include <linux/icmp.h>
-#include <linux/types.h>
-#include <linux/socket.h>
-#include <linux/pkt_cls.h>
-#include <linux/erspan.h>
-#include <linux/udp.h>
+#include "vmlinux.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_endian.h>
+#include "bpf_kfuncs.h"
+#include "bpf_tracing_net.h"
 
 #define log_err(__ret) bpf_printk("ERROR line:%d ret:%d\n", __LINE__, __ret)
 
-#define VXLAN_UDP_PORT 4789
+#define VXLAN_UDP_PORT		4789
+#define ETH_P_IP		0x0800
+#define PACKET_HOST		0
+#define TUNNEL_CSUM		bpf_htons(0x01)
+#define TUNNEL_KEY		bpf_htons(0x04)
 
 /* Only IPv4 address assigned to veth1.
  * 172.16.1.200
  */
 #define ASSIGNED_ADDR_VETH1 0xac1001c8
 
-struct geneve_opt {
-	__be16	opt_class;
-	__u8	type;
-	__u8	length:5;
-	__u8	r3:1;
-	__u8	r2:1;
-	__u8	r1:1;
-	__u8	opt_data[8]; /* hard-coded to 8 byte */
-};
-
 struct vxlanhdr {
 	__be32 vx_flags;
 	__be32 vx_vni;
 } __attribute__((packed));
 
-struct vxlan_metadata {
-	__u32     gbp;
-};
-
-struct bpf_fou_encap {
-	__be16 sport;
-	__be16 dport;
-};
-
-enum bpf_fou_encap_type {
-	FOU_BPF_ENCAP_FOU,
-	FOU_BPF_ENCAP_GUE,
-};
-
 int bpf_skb_set_fou_encap(struct __sk_buff *skb_ctx,
 			  struct bpf_fou_encap *encap, int type) __ksym;
 int bpf_skb_get_fou_encap(struct __sk_buff *skb_ctx,
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC bpf-next 5/6] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-10-27 18:46 [RFC bpf-next 0/6] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
                   ` (3 preceding siblings ...)
  2023-10-27 18:46 ` [RFC bpf-next 4/6] bpf: selftests: test_tunnel: Use vmlinux.h declarations Daniel Xu
@ 2023-10-27 18:46 ` Daniel Xu
  2023-10-27 20:33   ` Andrii Nakryiko
  2023-10-27 18:46 ` [RFC bpf-next 6/6] bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state() Daniel Xu
  2023-10-29  2:13 ` [RFC bpf-next 0/6] Add bpf_xdp_get_xfrm_state() kfunc Antony Antony
  6 siblings, 1 reply; 16+ messages in thread
From: Daniel Xu @ 2023-10-27 18:46 UTC (permalink / raw)
  To: ast, andrii, shuah, daniel, steffen.klassert, antony.antony
  Cc: mykolal, martin.lau, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, jolsa, bpf, linux-kselftest, linux-kernel, devel

Switching to vmlinux.h definitions seems to make the verifier very
unhappy with bitfield accesses. The error is:

    ; md.u.md2.dir = direction;
    33: (69) r1 = *(u16 *)(r2 +11)
    misaligned stack access off (0x0; 0x0)+-64+11 size 2

It looks like disabling CO-RE relocations seem to make the error go
away.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 tools/testing/selftests/bpf/progs/test_tunnel_kern.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
index 3065a716544d..ec7e04e012ae 100644
--- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
@@ -6,6 +6,7 @@
  * modify it under the terms of version 2 of the GNU General Public
  * License as published by the Free Software Foundation.
  */
+#define BPF_NO_PRESERVE_ACCESS_INDEX
 #include "vmlinux.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_endian.h>
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC bpf-next 6/6] bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state()
  2023-10-27 18:46 [RFC bpf-next 0/6] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
                   ` (4 preceding siblings ...)
  2023-10-27 18:46 ` [RFC bpf-next 5/6] bpf: selftests: test_tunnel: Disable CO-RE relocations Daniel Xu
@ 2023-10-27 18:46 ` Daniel Xu
  2023-10-29  2:13 ` [RFC bpf-next 0/6] Add bpf_xdp_get_xfrm_state() kfunc Antony Antony
  6 siblings, 0 replies; 16+ messages in thread
From: Daniel Xu @ 2023-10-27 18:46 UTC (permalink / raw)
  To: hawk, ast, shuah, daniel, kuba, davem, andrii, john.fastabend,
	steffen.klassert, antony.antony
  Cc: mykolal, martin.lau, song, yonghong.song, kpsingh, sdf, haoluo,
	jolsa, bpf, linux-kselftest, linux-kernel, netdev, devel

This commit extends test_tunnel selftest to test the new XDP xfrm state
lookup kfunc.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 .../selftests/bpf/progs/test_tunnel_kern.c    | 46 +++++++++++++++++++
 tools/testing/selftests/bpf/test_tunnel.sh    | 12 +++--
 2 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
index ec7e04e012ae..f5f6a18ac0f1 100644
--- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
@@ -35,6 +35,9 @@ int bpf_skb_set_fou_encap(struct __sk_buff *skb_ctx,
 			  struct bpf_fou_encap *encap, int type) __ksym;
 int bpf_skb_get_fou_encap(struct __sk_buff *skb_ctx,
 			  struct bpf_fou_encap *encap) __ksym;
+struct xfrm_state *
+bpf_xdp_get_xfrm_state(struct xdp_md *ctx, struct bpf_xfrm_state_opts *opts,
+		       u32 opts__sz) __ksym;
 
 struct {
 	__uint(type, BPF_MAP_TYPE_ARRAY);
@@ -948,4 +951,47 @@ int xfrm_get_state(struct __sk_buff *skb)
 	return TC_ACT_OK;
 }
 
+SEC("xdp")
+int xfrm_get_state_xdp(struct xdp_md *xdp)
+{
+	struct bpf_xfrm_state_opts opts = {};
+	struct ip_esp_hdr *esph;
+	struct bpf_dynptr ptr;
+	struct xfrm_state *x;
+	u8 esph_buf[8] = {};
+	u8 iph_buf[20] = {};
+	struct iphdr *iph;
+	u32 off;
+
+	if (bpf_dynptr_from_xdp(xdp, 0, &ptr))
+		goto out;
+
+	off = sizeof(struct ethhdr);
+	iph = bpf_dynptr_slice(&ptr, off, iph_buf, sizeof(iph_buf));
+	if (!iph || iph->protocol != IPPROTO_ESP)
+		goto out;
+
+	off += sizeof(struct iphdr);
+	esph = bpf_dynptr_slice(&ptr, off, esph_buf, sizeof(esph_buf));
+	if (!esph)
+		goto out;
+
+	opts.netns_id = BPF_F_CURRENT_NETNS,
+	opts.daddr.a4 = iph->daddr;
+	opts.spi = esph->spi;
+	opts.proto = IPPROTO_ESP;
+	opts.family = AF_INET;
+
+	x = bpf_xdp_get_xfrm_state(xdp, &opts, sizeof(opts));
+	if (!x || opts.error)
+		goto out;
+
+	if (!x->replay_esn)
+		goto out;
+
+	bpf_printk("replay-window %d\n", x->replay_esn->replay_window);
+out:
+	return XDP_PASS;
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_tunnel.sh b/tools/testing/selftests/bpf/test_tunnel.sh
index dd3c79129e87..17d263681c71 100755
--- a/tools/testing/selftests/bpf/test_tunnel.sh
+++ b/tools/testing/selftests/bpf/test_tunnel.sh
@@ -528,7 +528,7 @@ setup_xfrm_tunnel()
 	# at_ns0 -> root
 	ip netns exec at_ns0 \
 		ip xfrm state add src 172.16.1.100 dst 172.16.1.200 proto esp \
-			spi $spi_in_to_out reqid 1 mode tunnel \
+			spi $spi_in_to_out reqid 1 mode tunnel replay-window 42 \
 			auth-trunc 'hmac(sha1)' $auth 96 enc 'cbc(aes)' $enc
 	ip netns exec at_ns0 \
 		ip xfrm policy add src 10.1.1.100/32 dst 10.1.1.200/32 dir out \
@@ -537,7 +537,7 @@ setup_xfrm_tunnel()
 	# root -> at_ns0
 	ip netns exec at_ns0 \
 		ip xfrm state add src 172.16.1.200 dst 172.16.1.100 proto esp \
-			spi $spi_out_to_in reqid 2 mode tunnel \
+			spi $spi_out_to_in reqid 2 mode tunnel replay-window 42 \
 			auth-trunc 'hmac(sha1)' $auth 96 enc 'cbc(aes)' $enc
 	ip netns exec at_ns0 \
 		ip xfrm policy add src 10.1.1.200/32 dst 10.1.1.100/32 dir in \
@@ -553,14 +553,14 @@ setup_xfrm_tunnel()
 	# root namespace
 	# at_ns0 -> root
 	ip xfrm state add src 172.16.1.100 dst 172.16.1.200 proto esp \
-		spi $spi_in_to_out reqid 1 mode tunnel \
+		spi $spi_in_to_out reqid 1 mode tunnel replay-window 42 \
 		auth-trunc 'hmac(sha1)' $auth 96  enc 'cbc(aes)' $enc
 	ip xfrm policy add src 10.1.1.100/32 dst 10.1.1.200/32 dir in \
 		tmpl src 172.16.1.100 dst 172.16.1.200 proto esp reqid 1 \
 		mode tunnel
 	# root -> at_ns0
 	ip xfrm state add src 172.16.1.200 dst 172.16.1.100 proto esp \
-		spi $spi_out_to_in reqid 2 mode tunnel \
+		spi $spi_out_to_in reqid 2 mode tunnel replay-window 42 \
 		auth-trunc 'hmac(sha1)' $auth 96  enc 'cbc(aes)' $enc
 	ip xfrm policy add src 10.1.1.200/32 dst 10.1.1.100/32 dir out \
 		tmpl src 172.16.1.200 dst 172.16.1.100 proto esp reqid 2 \
@@ -585,6 +585,8 @@ test_xfrm_tunnel()
 	tc qdisc add dev veth1 clsact
 	tc filter add dev veth1 proto ip ingress bpf da object-pinned \
 		${BPF_PIN_TUNNEL_DIR}/xfrm_get_state
+	ip link set dev veth1 xdpdrv pinned \
+		${BPF_PIN_TUNNEL_DIR}/xfrm_get_state_xdp
 	ip netns exec at_ns0 ping $PING_ARG 10.1.1.200
 	sleep 1
 	grep "reqid 1" ${TRACE}
@@ -593,6 +595,8 @@ test_xfrm_tunnel()
 	check_err $?
 	grep "remote ip 0xac100164" ${TRACE}
 	check_err $?
+	grep "replay-window 42" ${TRACE}
+	check_err $?
 	cleanup
 
 	if [ $ret -ne 0 ]; then
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC bpf-next 5/6] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-10-27 18:46 ` [RFC bpf-next 5/6] bpf: selftests: test_tunnel: Disable CO-RE relocations Daniel Xu
@ 2023-10-27 20:33   ` Andrii Nakryiko
  2023-10-29 23:22     ` Daniel Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Andrii Nakryiko @ 2023-10-27 20:33 UTC (permalink / raw)
  To: Daniel Xu
  Cc: ast, andrii, shuah, daniel, steffen.klassert, antony.antony,
	mykolal, martin.lau, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, jolsa, bpf, linux-kselftest, linux-kernel, devel

On Fri, Oct 27, 2023 at 11:46 AM Daniel Xu <dxu@dxuuu.xyz> wrote:
>
> Switching to vmlinux.h definitions seems to make the verifier very
> unhappy with bitfield accesses. The error is:
>
>     ; md.u.md2.dir = direction;
>     33: (69) r1 = *(u16 *)(r2 +11)
>     misaligned stack access off (0x0; 0x0)+-64+11 size 2
>
> It looks like disabling CO-RE relocations seem to make the error go
> away.
>

for accessing bitfields libbpf provides
BPF_CORE_READ_BITFIELD_PROBED() and BPF_CORE_READ_BITFIELD() macros

> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> ---
>  tools/testing/selftests/bpf/progs/test_tunnel_kern.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> index 3065a716544d..ec7e04e012ae 100644
> --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> @@ -6,6 +6,7 @@
>   * modify it under the terms of version 2 of the GNU General Public
>   * License as published by the Free Software Foundation.
>   */
> +#define BPF_NO_PRESERVE_ACCESS_INDEX
>  #include "vmlinux.h"
>  #include <bpf/bpf_helpers.h>
>  #include <bpf/bpf_endian.h>
> --
> 2.42.0
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC bpf-next 1/6] bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
  2023-10-27 18:46 ` [RFC bpf-next 1/6] bpf: xfrm: " Daniel Xu
@ 2023-10-28 23:49   ` Alexei Starovoitov
  2023-10-29 22:55     ` Daniel Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2023-10-28 23:49 UTC (permalink / raw)
  To: Daniel Xu
  Cc: Jesper Dangaard Brouer, Steffen Klassert, Alexei Starovoitov,
	Paolo Abeni, Daniel Borkmann, Jakub Kicinski, Herbert Xu,
	David S. Miller, John Fastabend, Eric Dumazet, antony.antony,
	LKML, Network Development, bpf, devel

On Fri, Oct 27, 2023 at 11:46 AM Daniel Xu <dxu@dxuuu.xyz> wrote:
>
> This commit adds an unstable kfunc helper to access internal xfrm_state
> associated with an SA. This is intended to be used for the upcoming
> IPsec pcpu work to assign special pcpu SAs to a particular CPU. In other
> words: for custom software RSS.
>
> That being said, the function that this kfunc wraps is fairly generic
> and used for a lot of xfrm tasks. I'm sure people will find uses
> elsewhere over time.
>
> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> ---
>  include/net/xfrm.h        |   9 ++++
>  net/xfrm/Makefile         |   1 +
>  net/xfrm/xfrm_policy.c    |   2 +
>  net/xfrm/xfrm_state_bpf.c | 105 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 117 insertions(+)
>  create mode 100644 net/xfrm/xfrm_state_bpf.c
>
> diff --git a/include/net/xfrm.h b/include/net/xfrm.h
> index 98d7aa78adda..ab4cf66480f3 100644
> --- a/include/net/xfrm.h
> +++ b/include/net/xfrm.h
> @@ -2188,4 +2188,13 @@ static inline int register_xfrm_interface_bpf(void)
>
>  #endif
>
> +#if IS_ENABLED(CONFIG_DEBUG_INFO_BTF)
> +int register_xfrm_state_bpf(void);
> +#else
> +static inline int register_xfrm_state_bpf(void)
> +{
> +       return 0;
> +}
> +#endif
> +
>  #endif /* _NET_XFRM_H */
> diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile
> index cd47f88921f5..547cec77ba03 100644
> --- a/net/xfrm/Makefile
> +++ b/net/xfrm/Makefile
> @@ -21,3 +21,4 @@ obj-$(CONFIG_XFRM_USER_COMPAT) += xfrm_compat.o
>  obj-$(CONFIG_XFRM_IPCOMP) += xfrm_ipcomp.o
>  obj-$(CONFIG_XFRM_INTERFACE) += xfrm_interface.o
>  obj-$(CONFIG_XFRM_ESPINTCP) += espintcp.o
> +obj-$(CONFIG_DEBUG_INFO_BTF) += xfrm_state_bpf.o
> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> index 5cdd3bca3637..62e64fa7ae5c 100644
> --- a/net/xfrm/xfrm_policy.c
> +++ b/net/xfrm/xfrm_policy.c
> @@ -4267,6 +4267,8 @@ void __init xfrm_init(void)
>  #ifdef CONFIG_XFRM_ESPINTCP
>         espintcp_init();
>  #endif
> +
> +       register_xfrm_state_bpf();
>  }
>
>  #ifdef CONFIG_AUDITSYSCALL
> diff --git a/net/xfrm/xfrm_state_bpf.c b/net/xfrm/xfrm_state_bpf.c
> new file mode 100644
> index 000000000000..a73a17a6497b
> --- /dev/null
> +++ b/net/xfrm/xfrm_state_bpf.c
> @@ -0,0 +1,105 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Unstable XFRM state BPF helpers.
> + *
> + * Note that it is allowed to break compatibility for these functions since the
> + * interface they are exposed through to BPF programs is explicitly unstable.
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/btf_ids.h>
> +#include <net/xdp.h>
> +#include <net/xfrm.h>
> +
> +/* bpf_xfrm_state_opts - Options for XFRM state lookup helpers
> + *
> + * Members:
> + * @error      - Out parameter, set for any errors encountered
> + *              Values:
> + *                -EINVAL - netns_id is less than -1
> + *                -EINVAL - Passed NULL for opts
> + *                -EINVAL - opts__sz isn't BPF_XFRM_STATE_OPTS_SZ
> + *                -ENONET - No network namespace found for netns_id
> + * @netns_id   - Specify the network namespace for lookup
> + *              Values:
> + *                BPF_F_CURRENT_NETNS (-1)
> + *                  Use namespace associated with ctx
> + *                [0, S32_MAX]
> + *                  Network Namespace ID
> + * @mark       - XFRM mark to match on
> + * @daddr      - Destination address to match on
> + * @spi                - Security parameter index to match on
> + * @proto      - L3 protocol to match on
> + * @family     - L3 protocol family to match on
> + */
> +struct bpf_xfrm_state_opts {
> +       s32 error;
> +       s32 netns_id;
> +       u32 mark;
> +       xfrm_address_t daddr;
> +       __be32 spi;
> +       u8 proto;
> +       u16 family;
> +};
> +
> +enum {
> +       BPF_XFRM_STATE_OPTS_SZ = sizeof(struct bpf_xfrm_state_opts),
> +};
> +
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> +                 "Global functions as their definitions will be in xfrm_state BTF");
> +
> +/* bpf_xdp_get_xfrm_state - Get XFRM state
> + *
> + * Parameters:
> + * @ctx        - Pointer to ctx (xdp_md) in XDP program
> + *                 Cannot be NULL
> + * @opts       - Options for lookup (documented above)
> + *                 Cannot be NULL
> + * @opts__sz   - Length of the bpf_xfrm_state_opts structure
> + *                 Must be BPF_XFRM_STATE_OPTS_SZ
> + */
> +__bpf_kfunc struct xfrm_state *
> +bpf_xdp_get_xfrm_state(struct xdp_md *ctx, struct bpf_xfrm_state_opts *opts, u32 opts__sz)
> +{
> +       struct xdp_buff *xdp = (struct xdp_buff *)ctx;
> +       struct net *net = dev_net(xdp->rxq->dev);
> +
> +       if (!opts || opts__sz != BPF_XFRM_STATE_OPTS_SZ) {
> +               opts->error = -EINVAL;
> +               return NULL;
> +       }
> +
> +       if (unlikely(opts->netns_id < BPF_F_CURRENT_NETNS)) {
> +               opts->error = -EINVAL;
> +               return NULL;
> +       }
> +
> +       if (opts->netns_id >= 0) {
> +               net = get_net_ns_by_id(net, opts->netns_id);
> +               if (unlikely(!net)) {
> +                       opts->error = -ENONET;
> +                       return NULL;
> +               }
> +       }
> +
> +       return xfrm_state_lookup(net, opts->mark, &opts->daddr, opts->spi,
> +                                opts->proto, opts->family);
> +}

Patch 6 example does little to explain how this kfunc can be used.
Cover letter sounds promising, but no code to demonstrate the result.
The main issue is that this kfunc has to be KF_ACQUIRE,
otherwise bpf prog will keep leaking xfrm_state.
Plenty of red flags in this RFC.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC bpf-next 0/6] Add bpf_xdp_get_xfrm_state() kfunc
  2023-10-27 18:46 [RFC bpf-next 0/6] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
                   ` (5 preceding siblings ...)
  2023-10-27 18:46 ` [RFC bpf-next 6/6] bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state() Daniel Xu
@ 2023-10-29  2:13 ` Antony Antony
  6 siblings, 0 replies; 16+ messages in thread
From: Antony Antony @ 2023-10-29  2:13 UTC (permalink / raw)
  To: Daniel Xu
  Cc: netdev, linux-kernel, bpf, linux-kselftest, steffen.klassert,
	antony.antony, devel

On Fri, Oct 27, 2023 at 12:46:16 -0600, Daniel Xu wrote:
> This patchset adds a kfunc helper, bpf_xdp_get_xfrm_state(), that wraps
> xfrm_state_lookup(). The intent is to support software RSS (via XDP) for
> the ongoing/upcoming ipsec pcpu work [0]. Recent experiments performed
> on (hopefully) reproducible AWS testbeds indicate that single tunnel
> pcpu ipsec can reach line rate on 100G ENA nics.
> 
> More details about that will be presented at netdev next week [1].
> 
> Antony did the initial stable bpf helper - I later ported it to unstable
> kfuncs. So for the series, please apply a Co-developed-by for Antony,
> provided he acks and signs off on this.

Thanks Daniel for working on this and bringing it upstreadm.

Co-developed-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>

> 
> [0]: https://datatracker.ietf.org/doc/html/draft-ietf-ipsecme-multi-sa-performance-02
> [1]: https://netdevconf.info/0x17/sessions/workshop/security-workshop.html
> 
> Daniel Xu (6):
>   bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
>   bpf: selftests: test_tunnel: Use ping -6 over ping6
>   bpf: selftests: test_tunnel: Mount bpffs if necessary
>   bpf: selftests: test_tunnel: Use vmlinux.h declarations
>   bpf: selftests: test_tunnel: Disable CO-RE relocations
>   bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state()
> 
>  include/net/xfrm.h                            |   9 ++
>  net/xfrm/Makefile                             |   1 +
>  net/xfrm/xfrm_policy.c                        |   2 +
>  net/xfrm/xfrm_state_bpf.c                     | 105 ++++++++++++++++++
>  .../selftests/bpf/progs/bpf_tracing_net.h     |   1 +
>  .../selftests/bpf/progs/test_tunnel_kern.c    |  95 +++++++++-------
>  tools/testing/selftests/bpf/test_tunnel.sh    |  43 ++++---
>  7 files changed, 202 insertions(+), 54 deletions(-)
>  create mode 100644 net/xfrm/xfrm_state_bpf.c
> 
> -- 
> 2.42.0
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC bpf-next 1/6] bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
  2023-10-28 23:49   ` Alexei Starovoitov
@ 2023-10-29 22:55     ` Daniel Xu
  2023-10-31 22:38       ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Xu @ 2023-10-29 22:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Steffen Klassert, Alexei Starovoitov,
	Paolo Abeni, Daniel Borkmann, Jakub Kicinski, Herbert Xu,
	David S. Miller, John Fastabend, Eric Dumazet, antony.antony,
	LKML, Network Development, bpf, devel

Hi Alexei,

On Sat, Oct 28, 2023 at 04:49:45PM -0700, Alexei Starovoitov wrote:
> On Fri, Oct 27, 2023 at 11:46 AM Daniel Xu <dxu@dxuuu.xyz> wrote:
> >
> > This commit adds an unstable kfunc helper to access internal xfrm_state
> > associated with an SA. This is intended to be used for the upcoming
> > IPsec pcpu work to assign special pcpu SAs to a particular CPU. In other
> > words: for custom software RSS.
> >
> > That being said, the function that this kfunc wraps is fairly generic
> > and used for a lot of xfrm tasks. I'm sure people will find uses
> > elsewhere over time.
> >
> > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> > ---
> >  include/net/xfrm.h        |   9 ++++
> >  net/xfrm/Makefile         |   1 +
> >  net/xfrm/xfrm_policy.c    |   2 +
> >  net/xfrm/xfrm_state_bpf.c | 105 ++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 117 insertions(+)
> >  create mode 100644 net/xfrm/xfrm_state_bpf.c
> >
> > diff --git a/include/net/xfrm.h b/include/net/xfrm.h
> > index 98d7aa78adda..ab4cf66480f3 100644
> > --- a/include/net/xfrm.h
> > +++ b/include/net/xfrm.h
> > @@ -2188,4 +2188,13 @@ static inline int register_xfrm_interface_bpf(void)
> >
> >  #endif
> >
> > +#if IS_ENABLED(CONFIG_DEBUG_INFO_BTF)
> > +int register_xfrm_state_bpf(void);
> > +#else
> > +static inline int register_xfrm_state_bpf(void)
> > +{
> > +       return 0;
> > +}
> > +#endif
> > +
> >  #endif /* _NET_XFRM_H */
> > diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile
> > index cd47f88921f5..547cec77ba03 100644
> > --- a/net/xfrm/Makefile
> > +++ b/net/xfrm/Makefile
> > @@ -21,3 +21,4 @@ obj-$(CONFIG_XFRM_USER_COMPAT) += xfrm_compat.o
> >  obj-$(CONFIG_XFRM_IPCOMP) += xfrm_ipcomp.o
> >  obj-$(CONFIG_XFRM_INTERFACE) += xfrm_interface.o
> >  obj-$(CONFIG_XFRM_ESPINTCP) += espintcp.o
> > +obj-$(CONFIG_DEBUG_INFO_BTF) += xfrm_state_bpf.o
> > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> > index 5cdd3bca3637..62e64fa7ae5c 100644
> > --- a/net/xfrm/xfrm_policy.c
> > +++ b/net/xfrm/xfrm_policy.c
> > @@ -4267,6 +4267,8 @@ void __init xfrm_init(void)
> >  #ifdef CONFIG_XFRM_ESPINTCP
> >         espintcp_init();
> >  #endif
> > +
> > +       register_xfrm_state_bpf();
> >  }
> >
> >  #ifdef CONFIG_AUDITSYSCALL
> > diff --git a/net/xfrm/xfrm_state_bpf.c b/net/xfrm/xfrm_state_bpf.c
> > new file mode 100644
> > index 000000000000..a73a17a6497b
> > --- /dev/null
> > +++ b/net/xfrm/xfrm_state_bpf.c
> > @@ -0,0 +1,105 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Unstable XFRM state BPF helpers.
> > + *
> > + * Note that it is allowed to break compatibility for these functions since the
> > + * interface they are exposed through to BPF programs is explicitly unstable.
> > + */
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/btf_ids.h>
> > +#include <net/xdp.h>
> > +#include <net/xfrm.h>
> > +
> > +/* bpf_xfrm_state_opts - Options for XFRM state lookup helpers
> > + *
> > + * Members:
> > + * @error      - Out parameter, set for any errors encountered
> > + *              Values:
> > + *                -EINVAL - netns_id is less than -1
> > + *                -EINVAL - Passed NULL for opts
> > + *                -EINVAL - opts__sz isn't BPF_XFRM_STATE_OPTS_SZ
> > + *                -ENONET - No network namespace found for netns_id
> > + * @netns_id   - Specify the network namespace for lookup
> > + *              Values:
> > + *                BPF_F_CURRENT_NETNS (-1)
> > + *                  Use namespace associated with ctx
> > + *                [0, S32_MAX]
> > + *                  Network Namespace ID
> > + * @mark       - XFRM mark to match on
> > + * @daddr      - Destination address to match on
> > + * @spi                - Security parameter index to match on
> > + * @proto      - L3 protocol to match on
> > + * @family     - L3 protocol family to match on
> > + */
> > +struct bpf_xfrm_state_opts {
> > +       s32 error;
> > +       s32 netns_id;
> > +       u32 mark;
> > +       xfrm_address_t daddr;
> > +       __be32 spi;
> > +       u8 proto;
> > +       u16 family;
> > +};
> > +
> > +enum {
> > +       BPF_XFRM_STATE_OPTS_SZ = sizeof(struct bpf_xfrm_state_opts),
> > +};
> > +
> > +__diag_push();
> > +__diag_ignore_all("-Wmissing-prototypes",
> > +                 "Global functions as their definitions will be in xfrm_state BTF");
> > +
> > +/* bpf_xdp_get_xfrm_state - Get XFRM state
> > + *
> > + * Parameters:
> > + * @ctx        - Pointer to ctx (xdp_md) in XDP program
> > + *                 Cannot be NULL
> > + * @opts       - Options for lookup (documented above)
> > + *                 Cannot be NULL
> > + * @opts__sz   - Length of the bpf_xfrm_state_opts structure
> > + *                 Must be BPF_XFRM_STATE_OPTS_SZ
> > + */
> > +__bpf_kfunc struct xfrm_state *
> > +bpf_xdp_get_xfrm_state(struct xdp_md *ctx, struct bpf_xfrm_state_opts *opts, u32 opts__sz)
> > +{
> > +       struct xdp_buff *xdp = (struct xdp_buff *)ctx;
> > +       struct net *net = dev_net(xdp->rxq->dev);
> > +
> > +       if (!opts || opts__sz != BPF_XFRM_STATE_OPTS_SZ) {
> > +               opts->error = -EINVAL;
> > +               return NULL;
> > +       }
> > +
> > +       if (unlikely(opts->netns_id < BPF_F_CURRENT_NETNS)) {
> > +               opts->error = -EINVAL;
> > +               return NULL;
> > +       }
> > +
> > +       if (opts->netns_id >= 0) {
> > +               net = get_net_ns_by_id(net, opts->netns_id);
> > +               if (unlikely(!net)) {
> > +                       opts->error = -ENONET;
> > +                       return NULL;
> > +               }
> > +       }
> > +
> > +       return xfrm_state_lookup(net, opts->mark, &opts->daddr, opts->spi,
> > +                                opts->proto, opts->family);
> > +}
> 
> Patch 6 example does little to explain how this kfunc can be used.
> Cover letter sounds promising, but no code to demonstrate the result.

Part of the reason for that is this kfunc is intended to be used with a
not-yet-upstreamed xfrm patchset. The other is that the usage is quite
trivial. This is the code the experiments were run with:

https://github.com/danobi/xdp-tools/blob/e89a1c617aba3b50d990f779357d6ce2863ecb27/xdp-bench/xdp_redirect_cpumap.bpf.c#L385-L406

We intend to upstream that cpumap mode to xdp-tools as soon as the xfrm
patches are in. (Note the linked code is a little buggy but the
main idea is there).

Depending on your appetite for complex diagrams, I can also offer you a
sequence diagram that describes how everything fits together:

https://dxuuu.xyz/r/ipsec-pcpu.png

The TLDR is that all the magic comes from xfrm subsystem. This kfunc
just enables software RSS.

> The main issue is that this kfunc has to be KF_ACQUIRE,
> otherwise bpf prog will keep leaking xfrm_state.
> Plenty of red flags in this RFC.

Ack, will check on KF_ACQUIRE.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC bpf-next 5/6] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-10-27 20:33   ` Andrii Nakryiko
@ 2023-10-29 23:22     ` Daniel Xu
  2023-10-31  6:25       ` Andrii Nakryiko
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Xu @ 2023-10-29 23:22 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: ast, andrii, shuah, daniel, steffen.klassert, antony.antony,
	mykolal, martin.lau, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, jolsa, bpf, linux-kselftest, linux-kernel, devel

On Fri, Oct 27, 2023 at 01:33:09PM -0700, Andrii Nakryiko wrote:
> On Fri, Oct 27, 2023 at 11:46 AM Daniel Xu <dxu@dxuuu.xyz> wrote:
> >
> > Switching to vmlinux.h definitions seems to make the verifier very
> > unhappy with bitfield accesses. The error is:
> >
> >     ; md.u.md2.dir = direction;
> >     33: (69) r1 = *(u16 *)(r2 +11)
> >     misaligned stack access off (0x0; 0x0)+-64+11 size 2
> >
> > It looks like disabling CO-RE relocations seem to make the error go
> > away.
> >
> 
> for accessing bitfields libbpf provides
> BPF_CORE_READ_BITFIELD_PROBED() and BPF_CORE_READ_BITFIELD() macros

In this case the code in question is:

        __u8 direction = 0;
        md.u.md2.dir = direction;

IOW the problem is assigning to bitfields, not reading from them.

Is that something that libbpf needs to support as well?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC bpf-next 5/6] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-10-29 23:22     ` Daniel Xu
@ 2023-10-31  6:25       ` Andrii Nakryiko
  0 siblings, 0 replies; 16+ messages in thread
From: Andrii Nakryiko @ 2023-10-31  6:25 UTC (permalink / raw)
  To: Daniel Xu
  Cc: ast, andrii, shuah, daniel, steffen.klassert, antony.antony,
	mykolal, martin.lau, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, jolsa, bpf, linux-kselftest, linux-kernel, devel

On Sun, Oct 29, 2023 at 4:22 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
>
> On Fri, Oct 27, 2023 at 01:33:09PM -0700, Andrii Nakryiko wrote:
> > On Fri, Oct 27, 2023 at 11:46 AM Daniel Xu <dxu@dxuuu.xyz> wrote:
> > >
> > > Switching to vmlinux.h definitions seems to make the verifier very
> > > unhappy with bitfield accesses. The error is:
> > >
> > >     ; md.u.md2.dir = direction;
> > >     33: (69) r1 = *(u16 *)(r2 +11)
> > >     misaligned stack access off (0x0; 0x0)+-64+11 size 2
> > >
> > > It looks like disabling CO-RE relocations seem to make the error go
> > > away.
> > >
> >
> > for accessing bitfields libbpf provides
> > BPF_CORE_READ_BITFIELD_PROBED() and BPF_CORE_READ_BITFIELD() macros
>
> In this case the code in question is:
>
>         __u8 direction = 0;
>         md.u.md2.dir = direction;
>
> IOW the problem is assigning to bitfields, not reading from them.
>
> Is that something that libbpf needs to support as well?

Ah, I missed that this is a write into a struct. I think we can
support BPF_CORE_WRITE_BITFIELD() (not the PROBED version, though)
using all the same CO-RE relocations. It's probably a very niche case,
but BPF_CORE_READ_BITFIELD() is niche as well (though an absolute
necessity when the need does come up).

>
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC bpf-next 1/6] bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
  2023-10-29 22:55     ` Daniel Xu
@ 2023-10-31 22:38       ` Alexei Starovoitov
  2023-11-01 17:51         ` Daniel Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2023-10-31 22:38 UTC (permalink / raw)
  To: Daniel Xu
  Cc: Jesper Dangaard Brouer, Steffen Klassert, Alexei Starovoitov,
	Paolo Abeni, Daniel Borkmann, Jakub Kicinski, Herbert Xu,
	David S. Miller, John Fastabend, Eric Dumazet, antony.antony,
	LKML, Network Development, bpf, devel

On Sun, Oct 29, 2023 at 3:55 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
>
> Hi Alexei,
>
> On Sat, Oct 28, 2023 at 04:49:45PM -0700, Alexei Starovoitov wrote:
> > On Fri, Oct 27, 2023 at 11:46 AM Daniel Xu <dxu@dxuuu.xyz> wrote:
> > >
> > > This commit adds an unstable kfunc helper to access internal xfrm_state
> > > associated with an SA. This is intended to be used for the upcoming
> > > IPsec pcpu work to assign special pcpu SAs to a particular CPU. In other
> > > words: for custom software RSS.
> > >
> > > That being said, the function that this kfunc wraps is fairly generic
> > > and used for a lot of xfrm tasks. I'm sure people will find uses
> > > elsewhere over time.
> > >
> > > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> > > ---
> > >  include/net/xfrm.h        |   9 ++++
> > >  net/xfrm/Makefile         |   1 +
> > >  net/xfrm/xfrm_policy.c    |   2 +
> > >  net/xfrm/xfrm_state_bpf.c | 105 ++++++++++++++++++++++++++++++++++++++
> > >  4 files changed, 117 insertions(+)
> > >  create mode 100644 net/xfrm/xfrm_state_bpf.c
> > >
> > > diff --git a/include/net/xfrm.h b/include/net/xfrm.h
> > > index 98d7aa78adda..ab4cf66480f3 100644
> > > --- a/include/net/xfrm.h
> > > +++ b/include/net/xfrm.h
> > > @@ -2188,4 +2188,13 @@ static inline int register_xfrm_interface_bpf(void)
> > >
> > >  #endif
> > >
> > > +#if IS_ENABLED(CONFIG_DEBUG_INFO_BTF)
> > > +int register_xfrm_state_bpf(void);
> > > +#else
> > > +static inline int register_xfrm_state_bpf(void)
> > > +{
> > > +       return 0;
> > > +}
> > > +#endif
> > > +
> > >  #endif /* _NET_XFRM_H */
> > > diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile
> > > index cd47f88921f5..547cec77ba03 100644
> > > --- a/net/xfrm/Makefile
> > > +++ b/net/xfrm/Makefile
> > > @@ -21,3 +21,4 @@ obj-$(CONFIG_XFRM_USER_COMPAT) += xfrm_compat.o
> > >  obj-$(CONFIG_XFRM_IPCOMP) += xfrm_ipcomp.o
> > >  obj-$(CONFIG_XFRM_INTERFACE) += xfrm_interface.o
> > >  obj-$(CONFIG_XFRM_ESPINTCP) += espintcp.o
> > > +obj-$(CONFIG_DEBUG_INFO_BTF) += xfrm_state_bpf.o
> > > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> > > index 5cdd3bca3637..62e64fa7ae5c 100644
> > > --- a/net/xfrm/xfrm_policy.c
> > > +++ b/net/xfrm/xfrm_policy.c
> > > @@ -4267,6 +4267,8 @@ void __init xfrm_init(void)
> > >  #ifdef CONFIG_XFRM_ESPINTCP
> > >         espintcp_init();
> > >  #endif
> > > +
> > > +       register_xfrm_state_bpf();
> > >  }
> > >
> > >  #ifdef CONFIG_AUDITSYSCALL
> > > diff --git a/net/xfrm/xfrm_state_bpf.c b/net/xfrm/xfrm_state_bpf.c
> > > new file mode 100644
> > > index 000000000000..a73a17a6497b
> > > --- /dev/null
> > > +++ b/net/xfrm/xfrm_state_bpf.c
> > > @@ -0,0 +1,105 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/* Unstable XFRM state BPF helpers.
> > > + *
> > > + * Note that it is allowed to break compatibility for these functions since the
> > > + * interface they are exposed through to BPF programs is explicitly unstable.
> > > + */
> > > +
> > > +#include <linux/bpf.h>
> > > +#include <linux/btf_ids.h>
> > > +#include <net/xdp.h>
> > > +#include <net/xfrm.h>
> > > +
> > > +/* bpf_xfrm_state_opts - Options for XFRM state lookup helpers
> > > + *
> > > + * Members:
> > > + * @error      - Out parameter, set for any errors encountered
> > > + *              Values:
> > > + *                -EINVAL - netns_id is less than -1
> > > + *                -EINVAL - Passed NULL for opts
> > > + *                -EINVAL - opts__sz isn't BPF_XFRM_STATE_OPTS_SZ
> > > + *                -ENONET - No network namespace found for netns_id
> > > + * @netns_id   - Specify the network namespace for lookup
> > > + *              Values:
> > > + *                BPF_F_CURRENT_NETNS (-1)
> > > + *                  Use namespace associated with ctx
> > > + *                [0, S32_MAX]
> > > + *                  Network Namespace ID
> > > + * @mark       - XFRM mark to match on
> > > + * @daddr      - Destination address to match on
> > > + * @spi                - Security parameter index to match on
> > > + * @proto      - L3 protocol to match on
> > > + * @family     - L3 protocol family to match on
> > > + */
> > > +struct bpf_xfrm_state_opts {
> > > +       s32 error;
> > > +       s32 netns_id;
> > > +       u32 mark;
> > > +       xfrm_address_t daddr;
> > > +       __be32 spi;
> > > +       u8 proto;
> > > +       u16 family;
> > > +};
> > > +
> > > +enum {
> > > +       BPF_XFRM_STATE_OPTS_SZ = sizeof(struct bpf_xfrm_state_opts),
> > > +};
> > > +
> > > +__diag_push();
> > > +__diag_ignore_all("-Wmissing-prototypes",
> > > +                 "Global functions as their definitions will be in xfrm_state BTF");
> > > +
> > > +/* bpf_xdp_get_xfrm_state - Get XFRM state
> > > + *
> > > + * Parameters:
> > > + * @ctx        - Pointer to ctx (xdp_md) in XDP program
> > > + *                 Cannot be NULL
> > > + * @opts       - Options for lookup (documented above)
> > > + *                 Cannot be NULL
> > > + * @opts__sz   - Length of the bpf_xfrm_state_opts structure
> > > + *                 Must be BPF_XFRM_STATE_OPTS_SZ
> > > + */
> > > +__bpf_kfunc struct xfrm_state *
> > > +bpf_xdp_get_xfrm_state(struct xdp_md *ctx, struct bpf_xfrm_state_opts *opts, u32 opts__sz)
> > > +{
> > > +       struct xdp_buff *xdp = (struct xdp_buff *)ctx;
> > > +       struct net *net = dev_net(xdp->rxq->dev);
> > > +
> > > +       if (!opts || opts__sz != BPF_XFRM_STATE_OPTS_SZ) {
> > > +               opts->error = -EINVAL;
> > > +               return NULL;
> > > +       }
> > > +
> > > +       if (unlikely(opts->netns_id < BPF_F_CURRENT_NETNS)) {
> > > +               opts->error = -EINVAL;
> > > +               return NULL;
> > > +       }
> > > +
> > > +       if (opts->netns_id >= 0) {
> > > +               net = get_net_ns_by_id(net, opts->netns_id);
> > > +               if (unlikely(!net)) {
> > > +                       opts->error = -ENONET;
> > > +                       return NULL;
> > > +               }
> > > +       }
> > > +
> > > +       return xfrm_state_lookup(net, opts->mark, &opts->daddr, opts->spi,
> > > +                                opts->proto, opts->family);
> > > +}
> >
> > Patch 6 example does little to explain how this kfunc can be used.
> > Cover letter sounds promising, but no code to demonstrate the result.
>
> Part of the reason for that is this kfunc is intended to be used with a
> not-yet-upstreamed xfrm patchset. The other is that the usage is quite
> trivial. This is the code the experiments were run with:
>
> https://github.com/danobi/xdp-tools/blob/e89a1c617aba3b50d990f779357d6ce2863ecb27/xdp-bench/xdp_redirect_cpumap.bpf.c#L385-L406
>
> We intend to upstream that cpumap mode to xdp-tools as soon as the xfrm
> patches are in. (Note the linked code is a little buggy but the
> main idea is there).

I don't understand how it survives anything, but sanity check.
To measure perf gains it needs to be under traffic for some time,
but
x = bpf_xdp_get_xfrm_state(ctx, &opts, sizeof(opts));
will keep refcnt++ that state for every packet.
Minimum -> memory leak or refcnt overflow.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC bpf-next 1/6] bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
  2023-10-31 22:38       ` Alexei Starovoitov
@ 2023-11-01 17:51         ` Daniel Xu
  2023-11-01 18:51           ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Xu @ 2023-11-01 17:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Steffen Klassert, Alexei Starovoitov,
	Paolo Abeni, Daniel Borkmann, Jakub Kicinski, Herbert Xu,
	David S. Miller, John Fastabend, Eric Dumazet, antony.antony,
	LKML, Network Development, bpf, devel

On Tue, Oct 31, 2023 at 03:38:26PM -0700, Alexei Starovoitov wrote:
> On Sun, Oct 29, 2023 at 3:55 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
> >
> > Hi Alexei,
> >
> > On Sat, Oct 28, 2023 at 04:49:45PM -0700, Alexei Starovoitov wrote:
> > > On Fri, Oct 27, 2023 at 11:46 AM Daniel Xu <dxu@dxuuu.xyz> wrote:
> > > >
> > > > This commit adds an unstable kfunc helper to access internal xfrm_state
> > > > associated with an SA. This is intended to be used for the upcoming
> > > > IPsec pcpu work to assign special pcpu SAs to a particular CPU. In other
> > > > words: for custom software RSS.
> > > >
> > > > That being said, the function that this kfunc wraps is fairly generic
> > > > and used for a lot of xfrm tasks. I'm sure people will find uses
> > > > elsewhere over time.
> > > >
> > > > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> > > > ---
> > > >  include/net/xfrm.h        |   9 ++++
> > > >  net/xfrm/Makefile         |   1 +
> > > >  net/xfrm/xfrm_policy.c    |   2 +
> > > >  net/xfrm/xfrm_state_bpf.c | 105 ++++++++++++++++++++++++++++++++++++++
> > > >  4 files changed, 117 insertions(+)
> > > >  create mode 100644 net/xfrm/xfrm_state_bpf.c
> > > >
> > > > diff --git a/include/net/xfrm.h b/include/net/xfrm.h
> > > > index 98d7aa78adda..ab4cf66480f3 100644
> > > > --- a/include/net/xfrm.h
> > > > +++ b/include/net/xfrm.h
> > > > @@ -2188,4 +2188,13 @@ static inline int register_xfrm_interface_bpf(void)
> > > >
> > > >  #endif
> > > >
> > > > +#if IS_ENABLED(CONFIG_DEBUG_INFO_BTF)
> > > > +int register_xfrm_state_bpf(void);
> > > > +#else
> > > > +static inline int register_xfrm_state_bpf(void)
> > > > +{
> > > > +       return 0;
> > > > +}
> > > > +#endif
> > > > +
> > > >  #endif /* _NET_XFRM_H */
> > > > diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile
> > > > index cd47f88921f5..547cec77ba03 100644
> > > > --- a/net/xfrm/Makefile
> > > > +++ b/net/xfrm/Makefile
> > > > @@ -21,3 +21,4 @@ obj-$(CONFIG_XFRM_USER_COMPAT) += xfrm_compat.o
> > > >  obj-$(CONFIG_XFRM_IPCOMP) += xfrm_ipcomp.o
> > > >  obj-$(CONFIG_XFRM_INTERFACE) += xfrm_interface.o
> > > >  obj-$(CONFIG_XFRM_ESPINTCP) += espintcp.o
> > > > +obj-$(CONFIG_DEBUG_INFO_BTF) += xfrm_state_bpf.o
> > > > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> > > > index 5cdd3bca3637..62e64fa7ae5c 100644
> > > > --- a/net/xfrm/xfrm_policy.c
> > > > +++ b/net/xfrm/xfrm_policy.c
> > > > @@ -4267,6 +4267,8 @@ void __init xfrm_init(void)
> > > >  #ifdef CONFIG_XFRM_ESPINTCP
> > > >         espintcp_init();
> > > >  #endif
> > > > +
> > > > +       register_xfrm_state_bpf();
> > > >  }
> > > >
> > > >  #ifdef CONFIG_AUDITSYSCALL
> > > > diff --git a/net/xfrm/xfrm_state_bpf.c b/net/xfrm/xfrm_state_bpf.c
> > > > new file mode 100644
> > > > index 000000000000..a73a17a6497b
> > > > --- /dev/null
> > > > +++ b/net/xfrm/xfrm_state_bpf.c
> > > > @@ -0,0 +1,105 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-only
> > > > +/* Unstable XFRM state BPF helpers.
> > > > + *
> > > > + * Note that it is allowed to break compatibility for these functions since the
> > > > + * interface they are exposed through to BPF programs is explicitly unstable.
> > > > + */
> > > > +
> > > > +#include <linux/bpf.h>
> > > > +#include <linux/btf_ids.h>
> > > > +#include <net/xdp.h>
> > > > +#include <net/xfrm.h>
> > > > +
> > > > +/* bpf_xfrm_state_opts - Options for XFRM state lookup helpers
> > > > + *
> > > > + * Members:
> > > > + * @error      - Out parameter, set for any errors encountered
> > > > + *              Values:
> > > > + *                -EINVAL - netns_id is less than -1
> > > > + *                -EINVAL - Passed NULL for opts
> > > > + *                -EINVAL - opts__sz isn't BPF_XFRM_STATE_OPTS_SZ
> > > > + *                -ENONET - No network namespace found for netns_id
> > > > + * @netns_id   - Specify the network namespace for lookup
> > > > + *              Values:
> > > > + *                BPF_F_CURRENT_NETNS (-1)
> > > > + *                  Use namespace associated with ctx
> > > > + *                [0, S32_MAX]
> > > > + *                  Network Namespace ID
> > > > + * @mark       - XFRM mark to match on
> > > > + * @daddr      - Destination address to match on
> > > > + * @spi                - Security parameter index to match on
> > > > + * @proto      - L3 protocol to match on
> > > > + * @family     - L3 protocol family to match on
> > > > + */
> > > > +struct bpf_xfrm_state_opts {
> > > > +       s32 error;
> > > > +       s32 netns_id;
> > > > +       u32 mark;
> > > > +       xfrm_address_t daddr;
> > > > +       __be32 spi;
> > > > +       u8 proto;
> > > > +       u16 family;
> > > > +};
> > > > +
> > > > +enum {
> > > > +       BPF_XFRM_STATE_OPTS_SZ = sizeof(struct bpf_xfrm_state_opts),
> > > > +};
> > > > +
> > > > +__diag_push();
> > > > +__diag_ignore_all("-Wmissing-prototypes",
> > > > +                 "Global functions as their definitions will be in xfrm_state BTF");
> > > > +
> > > > +/* bpf_xdp_get_xfrm_state - Get XFRM state
> > > > + *
> > > > + * Parameters:
> > > > + * @ctx        - Pointer to ctx (xdp_md) in XDP program
> > > > + *                 Cannot be NULL
> > > > + * @opts       - Options for lookup (documented above)
> > > > + *                 Cannot be NULL
> > > > + * @opts__sz   - Length of the bpf_xfrm_state_opts structure
> > > > + *                 Must be BPF_XFRM_STATE_OPTS_SZ
> > > > + */
> > > > +__bpf_kfunc struct xfrm_state *
> > > > +bpf_xdp_get_xfrm_state(struct xdp_md *ctx, struct bpf_xfrm_state_opts *opts, u32 opts__sz)
> > > > +{
> > > > +       struct xdp_buff *xdp = (struct xdp_buff *)ctx;
> > > > +       struct net *net = dev_net(xdp->rxq->dev);
> > > > +
> > > > +       if (!opts || opts__sz != BPF_XFRM_STATE_OPTS_SZ) {
> > > > +               opts->error = -EINVAL;
> > > > +               return NULL;
> > > > +       }
> > > > +
> > > > +       if (unlikely(opts->netns_id < BPF_F_CURRENT_NETNS)) {
> > > > +               opts->error = -EINVAL;
> > > > +               return NULL;
> > > > +       }
> > > > +
> > > > +       if (opts->netns_id >= 0) {
> > > > +               net = get_net_ns_by_id(net, opts->netns_id);
> > > > +               if (unlikely(!net)) {
> > > > +                       opts->error = -ENONET;
> > > > +                       return NULL;
> > > > +               }
> > > > +       }
> > > > +
> > > > +       return xfrm_state_lookup(net, opts->mark, &opts->daddr, opts->spi,
> > > > +                                opts->proto, opts->family);
> > > > +}
> > >
> > > Patch 6 example does little to explain how this kfunc can be used.
> > > Cover letter sounds promising, but no code to demonstrate the result.
> >
> > Part of the reason for that is this kfunc is intended to be used with a
> > not-yet-upstreamed xfrm patchset. The other is that the usage is quite
> > trivial. This is the code the experiments were run with:
> >
> > https://github.com/danobi/xdp-tools/blob/e89a1c617aba3b50d990f779357d6ce2863ecb27/xdp-bench/xdp_redirect_cpumap.bpf.c#L385-L406
> >
> > We intend to upstream that cpumap mode to xdp-tools as soon as the xfrm
> > patches are in. (Note the linked code is a little buggy but the
> > main idea is there).
> 
> I don't understand how it survives anything, but sanity check.
> To measure perf gains it needs to be under traffic for some time,
> but
> x = bpf_xdp_get_xfrm_state(ctx, &opts, sizeof(opts));
> will keep refcnt++ that state for every packet.
> Minimum -> memory leak or refcnt overflow.

Yeah, I agree the code in this patchset is not correct. I have the fix
(a KF_RELEASE wrapper around xfrm_state_put()) ready to send. I think
Steffen was gonna chat w/ you about this at IETF next week. But I can
send it now if you'd like.

To answer your question why it doesn't blow up immediately:

* The test system only has ~33 inbound SAs and the test doesn't try to
  delete any. So leak is not noticed in the test. Oddly enough I recall
  `ip x s flush` working correctly... Could be misremembering.

* Refcnt overflow will indeed happen, but some rough math shows it'll
  take about 12 hrs receiving at 100Gbps for that to happen. 100Gbps =
  12.5 GB/s. 12.5GB / (32 CPUs) / (9000B) = 43k pps for each pcpu SA.
  INT_MAX = 2 billion. 2B / 4k = 46k. 46k seconds to hours is ~12 hrs.
  And I was only running traffic for ~1 hour.

At least I think that math is right.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC bpf-next 1/6] bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
  2023-11-01 17:51         ` Daniel Xu
@ 2023-11-01 18:51           ` Alexei Starovoitov
  0 siblings, 0 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2023-11-01 18:51 UTC (permalink / raw)
  To: Daniel Xu
  Cc: Jesper Dangaard Brouer, Steffen Klassert, Alexei Starovoitov,
	Paolo Abeni, Daniel Borkmann, Jakub Kicinski, Herbert Xu,
	David S. Miller, John Fastabend, Eric Dumazet, antony.antony,
	LKML, Network Development, bpf, devel

On Wed, Nov 1, 2023 at 10:51 AM Daniel Xu <dxu@dxuuu.xyz> wrote:
>
> Yeah, I agree the code in this patchset is not correct. I have the fix
> (a KF_RELEASE wrapper around xfrm_state_put()) ready to send. I think
> Steffen was gonna chat w/ you about this at IETF next week. But I can
> send it now if you'd like.

I say send a new version with all issues addressed now, since
it might help to frame the discussion at IETF.

>
> To answer your question why it doesn't blow up immediately:
>
> * The test system only has ~33 inbound SAs and the test doesn't try to
>   delete any. So leak is not noticed in the test. Oddly enough I recall
>   `ip x s flush` working correctly... Could be misremembering.
>
> * Refcnt overflow will indeed happen, but some rough math shows it'll
>   take about 12 hrs receiving at 100Gbps for that to happen. 100Gbps =
>   12.5 GB/s. 12.5GB / (32 CPUs) / (9000B) = 43k pps for each pcpu SA.
>   INT_MAX = 2 billion. 2B / 4k = 46k. 46k seconds to hours is ~12 hrs.
>   And I was only running traffic for ~1 hour.
>
> At least I think that math is right.

Makes sense.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-11-01 18:52 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-27 18:46 [RFC bpf-next 0/6] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
2023-10-27 18:46 ` [RFC bpf-next 1/6] bpf: xfrm: " Daniel Xu
2023-10-28 23:49   ` Alexei Starovoitov
2023-10-29 22:55     ` Daniel Xu
2023-10-31 22:38       ` Alexei Starovoitov
2023-11-01 17:51         ` Daniel Xu
2023-11-01 18:51           ` Alexei Starovoitov
2023-10-27 18:46 ` [RFC bpf-next 2/6] bpf: selftests: test_tunnel: Use ping -6 over ping6 Daniel Xu
2023-10-27 18:46 ` [RFC bpf-next 3/6] bpf: selftests: test_tunnel: Mount bpffs if necessary Daniel Xu
2023-10-27 18:46 ` [RFC bpf-next 4/6] bpf: selftests: test_tunnel: Use vmlinux.h declarations Daniel Xu
2023-10-27 18:46 ` [RFC bpf-next 5/6] bpf: selftests: test_tunnel: Disable CO-RE relocations Daniel Xu
2023-10-27 20:33   ` Andrii Nakryiko
2023-10-29 23:22     ` Daniel Xu
2023-10-31  6:25       ` Andrii Nakryiko
2023-10-27 18:46 ` [RFC bpf-next 6/6] bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state() Daniel Xu
2023-10-29  2:13 ` [RFC bpf-next 0/6] Add bpf_xdp_get_xfrm_state() kfunc Antony Antony

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).