netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v4 0/7] Add bpf programmable net device
@ 2023-10-24 21:48 Daniel Borkmann
  2023-10-24 21:48 ` [PATCH bpf-next v4 1/7] netkit, bpf: " Daniel Borkmann
                   ` (8 more replies)
  0 siblings, 9 replies; 27+ messages in thread
From: Daniel Borkmann @ 2023-10-24 21:48 UTC (permalink / raw)
  To: bpf
  Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend, sdf, toke,
	kuba, andrew, Daniel Borkmann

This work adds a BPF programmable device which can operate in L3 or L2
mode where the BPF program is part of the xmit routine. It's program
management is done via bpf_mprog and it comes with BPF link support.
For details see patch 1 and following. Thanks!

v3 -> v4:
  - Moved netkit_release_all() into ndo_uninit (Stan)
  - Two small commit msg corrections (Toke)
  - Added Acked/Reviewed-by
v2 -> v3:
  - Remove setting dev->min_mtu to ETH_MIN_MTU (Andrew)
  - Do not populate ethtool info->version (Andrew)
  - Populate netdev private data before register_netdevice (Andrew)
  - Use strscpy for ifname template (Jakub)
  - Use GFP_KERNEL_ACCOUNT for link kzalloc (Jakub)
  - Carry and dump link attach type for bpftool (Toke)
v1 -> v2:
  - Rename from meta (Toke, Andrii, Alexei)
  - Reuse skb_scrub_packet (Stan)
  - Remove IFF_META and use netdev_ops (Toke)
  - Add comment to multicast handler (Toke)
  - Remove silly version info (Toke)
  - Fix attach_type_name (Quentin)
  - Rework libbpf link attach api to be similar
    as tcx (Andrii)
  - Move flags last for bpf_netkit_opts (Andrii)
  - Rebased to bpf_mprog query api changes
  - Folded link support patch into main one

Daniel Borkmann (7):
  netkit, bpf: Add bpf programmable net device
  tools: Sync if_link uapi header
  libbpf: Add link-based API for netkit
  bpftool: Implement link show support for netkit
  bpftool: Extend net dump with netkit progs
  selftests/bpf: Add netlink helper library
  selftests/bpf: Add selftests for netkit

 MAINTAINERS                                   |   9 +
 drivers/net/Kconfig                           |   9 +
 drivers/net/Makefile                          |   1 +
 drivers/net/netkit.c                          | 940 ++++++++++++++++++
 include/net/netkit.h                          |  38 +
 include/uapi/linux/bpf.h                      |  14 +
 include/uapi/linux/if_link.h                  |  24 +
 kernel/bpf/syscall.c                          |  30 +-
 .../bpf/bpftool/Documentation/bpftool-net.rst |   8 +-
 tools/bpf/bpftool/link.c                      |   9 +
 tools/bpf/bpftool/net.c                       |   7 +-
 tools/include/uapi/linux/bpf.h                |  14 +
 tools/include/uapi/linux/if_link.h            | 141 +++
 tools/lib/bpf/bpf.c                           |  16 +
 tools/lib/bpf/bpf.h                           |   5 +
 tools/lib/bpf/libbpf.c                        |  39 +
 tools/lib/bpf/libbpf.h                        |  15 +
 tools/lib/bpf/libbpf.map                      |   1 +
 tools/testing/selftests/bpf/Makefile          |  19 +-
 tools/testing/selftests/bpf/config            |   1 +
 tools/testing/selftests/bpf/netlink_helpers.c | 358 +++++++
 tools/testing/selftests/bpf/netlink_helpers.h |  46 +
 .../selftests/bpf/prog_tests/tc_helpers.h     |   4 +
 .../selftests/bpf/prog_tests/tc_netkit.c      | 687 +++++++++++++
 .../selftests/bpf/progs/test_tc_link.c        |  13 +
 25 files changed, 2433 insertions(+), 15 deletions(-)
 create mode 100644 drivers/net/netkit.c
 create mode 100644 include/net/netkit.h
 create mode 100644 tools/testing/selftests/bpf/netlink_helpers.c
 create mode 100644 tools/testing/selftests/bpf/netlink_helpers.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_netkit.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
  2023-10-24 21:48 [PATCH bpf-next v4 0/7] Add bpf programmable net device Daniel Borkmann
@ 2023-10-24 21:48 ` Daniel Borkmann
  2023-10-25 15:47   ` Jiri Pirko
  2023-10-25 21:24   ` Kui-Feng Lee
  2023-10-24 21:48 ` [PATCH bpf-next v4 2/7] tools: Sync if_link uapi header Daniel Borkmann
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 27+ messages in thread
From: Daniel Borkmann @ 2023-10-24 21:48 UTC (permalink / raw)
  To: bpf
  Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend, sdf, toke,
	kuba, andrew, Daniel Borkmann, Toke Høiland-Jørgensen

This work adds a new, minimal BPF-programmable device called "netkit"
(former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
core idea is that BPF programs are executed within the drivers xmit routine
and therefore e.g. in case of containers/Pods moving BPF processing closer
to the source.

One of the goals was that in case of Pod egress traffic, this allows to
move BPF programs from hostns tcx ingress into the device itself, providing
earlier drop or forward mechanisms, for example, if the BPF program
determines that the skb must be sent out of the node, then a redirect to
the physical device can take place directly without going through per-CPU
backlog queue. This helps to shift processing for such traffic from softirq
to process context, leading to better scheduling decisions/performance (see
measurements in the slides).

In this initial version, the netkit device ships as a pair, but we plan to
extend this further so it can also operate in single device mode. The pair
comes with a primary and a peer device. Only the primary device, typically
residing in hostns, can manage BPF programs for itself and its peer. The
peer device is designated for containers/Pods and cannot attach/detach
BPF programs. Upon the device creation, the user can set the default policy
to 'pass' or 'drop' for the case when no BPF program is attached.

Additionally, the device can be operated in L3 (default) or L2 mode. The
management of BPF programs is done via bpf_mprog, so that multi-attach is
supported right from the beginning with similar API and dependency controls
as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic
attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
so that existing programs can be easily migrated.

Going forward, we plan to use netkit devices in Cilium as the main device
type for connecting Pods. They will be operated in L3 mode in order to
simplify a Pod's neighbor management and the peer will operate in default
drop mode, so that no traffic is leaving between the time when a Pod is
brought up by the CNI plugin and programs attached by the agent.
Additionally, the programs we attach via tcx on the physical devices are
using bpf_redirect_peer() for inbound traffic into netkit device, hence the
latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
directly. Also, BIG TCP is supported on netkit device. For the follow-up
work in single device mode, we plan to convert Cilium's cilium_host/_net
devices into a single one.

An extensive test suite for checking device operations and the BPF program
and link management API comes as BPF selftests in this series.

Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://github.com/borkmann/iproute2/tree/pr/netkit
Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
---
 MAINTAINERS                    |   9 +
 drivers/net/Kconfig            |   9 +
 drivers/net/Makefile           |   1 +
 drivers/net/netkit.c           | 940 +++++++++++++++++++++++++++++++++
 include/net/netkit.h           |  38 ++
 include/uapi/linux/bpf.h       |  14 +
 include/uapi/linux/if_link.h   |  24 +
 kernel/bpf/syscall.c           |  30 +-
 tools/include/uapi/linux/bpf.h |  14 +
 9 files changed, 1074 insertions(+), 5 deletions(-)
 create mode 100644 drivers/net/netkit.c
 create mode 100644 include/net/netkit.h

diff --git a/MAINTAINERS b/MAINTAINERS
index ed33b9df8b3d..43be6197e655 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3795,6 +3795,15 @@ L:	bpf@vger.kernel.org
 S:	Odd Fixes
 K:	(?:\b|_)bpf(?:\b|_)
 
+BPF [NETKIT] (BPF-programmable network device)
+M:	Daniel Borkmann <daniel@iogearbox.net>
+M:	Nikolay Aleksandrov <razor@blackwall.org>
+L:	bpf@vger.kernel.org
+L:	netdev@vger.kernel.org
+S:	Supported
+F:	drivers/net/netkit.c
+F:	include/net/netkit.h
+
 BPF [NETWORKING] (struct_ops, reuseport)
 M:	Martin KaFai Lau <martin.lau@linux.dev>
 L:	bpf@vger.kernel.org
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 44eeb5d61ba9..af0da4bb429b 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -448,6 +448,15 @@ config NLMON
 	  diagnostics, etc. This is mostly intended for developers or support
 	  to debug netlink issues. If unsure, say N.
 
+config NETKIT
+	bool "BPF-programmable network device"
+	depends on BPF_SYSCALL
+	help
+	  The netkit device is a virtual networking device where BPF programs
+	  can be attached to the device(s) transmission routine in order to
+	  implement the driver's internal logic. The device can be configured
+	  to operate in L3 or L2 mode. If unsure, say N.
+
 config NET_VRF
 	tristate "Virtual Routing and Forwarding (Lite)"
 	depends on IP_MULTIPLE_TABLES
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 8a83db32509d..7cab36f94782 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -22,6 +22,7 @@ obj-$(CONFIG_MDIO) += mdio.o
 obj-$(CONFIG_NET) += loopback.o
 obj-$(CONFIG_NETDEV_LEGACY_INIT) += Space.o
 obj-$(CONFIG_NETCONSOLE) += netconsole.o
+obj-$(CONFIG_NETKIT) += netkit.o
 obj-y += phy/
 obj-y += pse-pd/
 obj-y += mdio/
diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
new file mode 100644
index 000000000000..7e484f9fd3ae
--- /dev/null
+++ b/drivers/net/netkit.c
@@ -0,0 +1,940 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2023 Isovalent */
+
+#include <linux/netdevice.h>
+#include <linux/ethtool.h>
+#include <linux/etherdevice.h>
+#include <linux/filter.h>
+#include <linux/netfilter_netdev.h>
+#include <linux/bpf_mprog.h>
+
+#include <net/netkit.h>
+#include <net/dst.h>
+#include <net/tcx.h>
+
+#define DRV_NAME "netkit"
+
+struct netkit {
+	/* Needed in fast-path */
+	struct net_device __rcu *peer;
+	struct bpf_mprog_entry __rcu *active;
+	enum netkit_action policy;
+	struct bpf_mprog_bundle	bundle;
+
+	/* Needed in slow-path */
+	enum netkit_mode mode;
+	bool primary;
+	u32 headroom;
+};
+
+struct netkit_link {
+	struct bpf_link link;
+	struct net_device *dev;
+	u32 location;
+};
+
+static __always_inline int
+netkit_run(const struct bpf_mprog_entry *entry, struct sk_buff *skb,
+	   enum netkit_action ret)
+{
+	const struct bpf_mprog_fp *fp;
+	const struct bpf_prog *prog;
+
+	bpf_mprog_foreach_prog(entry, fp, prog) {
+		bpf_compute_data_pointers(skb);
+		ret = bpf_prog_run(prog, skb);
+		if (ret != NETKIT_NEXT)
+			break;
+	}
+	return ret;
+}
+
+static void netkit_prep_forward(struct sk_buff *skb, bool xnet)
+{
+	skb_scrub_packet(skb, xnet);
+	skb->priority = 0;
+	nf_skip_egress(skb, true);
+}
+
+static struct netkit *netkit_priv(const struct net_device *dev)
+{
+	return netdev_priv(dev);
+}
+
+static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct netkit *nk = netkit_priv(dev);
+	enum netkit_action ret = READ_ONCE(nk->policy);
+	netdev_tx_t ret_dev = NET_XMIT_SUCCESS;
+	const struct bpf_mprog_entry *entry;
+	struct net_device *peer;
+
+	rcu_read_lock();
+	peer = rcu_dereference(nk->peer);
+	if (unlikely(!peer || !(peer->flags & IFF_UP) ||
+		     !pskb_may_pull(skb, ETH_HLEN) ||
+		     skb_orphan_frags(skb, GFP_ATOMIC)))
+		goto drop;
+	netkit_prep_forward(skb, !net_eq(dev_net(dev), dev_net(peer)));
+	skb->dev = peer;
+	entry = rcu_dereference(nk->active);
+	if (entry)
+		ret = netkit_run(entry, skb, ret);
+	switch (ret) {
+	case NETKIT_NEXT:
+	case NETKIT_PASS:
+		skb->protocol = eth_type_trans(skb, skb->dev);
+		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
+		__netif_rx(skb);
+		break;
+	case NETKIT_REDIRECT:
+		skb_do_redirect(skb);
+		break;
+	case NETKIT_DROP:
+	default:
+drop:
+		kfree_skb(skb);
+		dev_core_stats_tx_dropped_inc(dev);
+		ret_dev = NET_XMIT_DROP;
+		break;
+	}
+	rcu_read_unlock();
+	return ret_dev;
+}
+
+static int netkit_open(struct net_device *dev)
+{
+	struct netkit *nk = netkit_priv(dev);
+	struct net_device *peer = rtnl_dereference(nk->peer);
+
+	if (!peer)
+		return -ENOTCONN;
+	if (peer->flags & IFF_UP) {
+		netif_carrier_on(dev);
+		netif_carrier_on(peer);
+	}
+	return 0;
+}
+
+static int netkit_close(struct net_device *dev)
+{
+	struct netkit *nk = netkit_priv(dev);
+	struct net_device *peer = rtnl_dereference(nk->peer);
+
+	netif_carrier_off(dev);
+	if (peer)
+		netif_carrier_off(peer);
+	return 0;
+}
+
+static int netkit_get_iflink(const struct net_device *dev)
+{
+	struct netkit *nk = netkit_priv(dev);
+	struct net_device *peer;
+	int iflink = 0;
+
+	rcu_read_lock();
+	peer = rcu_dereference(nk->peer);
+	if (peer)
+		iflink = peer->ifindex;
+	rcu_read_unlock();
+	return iflink;
+}
+
+static void netkit_set_multicast(struct net_device *dev)
+{
+	/* Nothing to do, we receive whatever gets pushed to us! */
+}
+
+static void netkit_set_headroom(struct net_device *dev, int headroom)
+{
+	struct netkit *nk = netkit_priv(dev), *nk2;
+	struct net_device *peer;
+
+	if (headroom < 0)
+		headroom = NET_SKB_PAD;
+
+	rcu_read_lock();
+	peer = rcu_dereference(nk->peer);
+	if (unlikely(!peer))
+		goto out;
+
+	nk2 = netkit_priv(peer);
+	nk->headroom = headroom;
+	headroom = max(nk->headroom, nk2->headroom);
+
+	peer->needed_headroom = headroom;
+	dev->needed_headroom = headroom;
+out:
+	rcu_read_unlock();
+}
+
+static struct net_device *netkit_peer_dev(struct net_device *dev)
+{
+	return rcu_dereference(netkit_priv(dev)->peer);
+}
+
+static void netkit_uninit(struct net_device *dev);
+
+static const struct net_device_ops netkit_netdev_ops = {
+	.ndo_open		= netkit_open,
+	.ndo_stop		= netkit_close,
+	.ndo_start_xmit		= netkit_xmit,
+	.ndo_set_rx_mode	= netkit_set_multicast,
+	.ndo_set_rx_headroom	= netkit_set_headroom,
+	.ndo_get_iflink		= netkit_get_iflink,
+	.ndo_get_peer_dev	= netkit_peer_dev,
+	.ndo_uninit		= netkit_uninit,
+	.ndo_features_check	= passthru_features_check,
+};
+
+static void netkit_get_drvinfo(struct net_device *dev,
+			       struct ethtool_drvinfo *info)
+{
+	strscpy(info->driver, DRV_NAME, sizeof(info->driver));
+}
+
+static const struct ethtool_ops netkit_ethtool_ops = {
+	.get_drvinfo		= netkit_get_drvinfo,
+};
+
+static void netkit_setup(struct net_device *dev)
+{
+	static const netdev_features_t netkit_features_hw_vlan =
+		NETIF_F_HW_VLAN_CTAG_TX |
+		NETIF_F_HW_VLAN_CTAG_RX |
+		NETIF_F_HW_VLAN_STAG_TX |
+		NETIF_F_HW_VLAN_STAG_RX;
+	static const netdev_features_t netkit_features =
+		netkit_features_hw_vlan |
+		NETIF_F_SG |
+		NETIF_F_FRAGLIST |
+		NETIF_F_HW_CSUM |
+		NETIF_F_RXCSUM |
+		NETIF_F_SCTP_CRC |
+		NETIF_F_HIGHDMA |
+		NETIF_F_GSO_SOFTWARE |
+		NETIF_F_GSO_ENCAP_ALL;
+
+	ether_setup(dev);
+	dev->max_mtu = ETH_MAX_MTU;
+
+	dev->flags |= IFF_NOARP;
+	dev->priv_flags &= ~IFF_TX_SKB_SHARING;
+	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
+	dev->priv_flags |= IFF_PHONY_HEADROOM;
+	dev->priv_flags |= IFF_NO_QUEUE;
+
+	dev->ethtool_ops = &netkit_ethtool_ops;
+	dev->netdev_ops  = &netkit_netdev_ops;
+
+	dev->features |= netkit_features | NETIF_F_LLTX;
+	dev->hw_features = netkit_features;
+	dev->hw_enc_features = netkit_features;
+	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
+	dev->vlan_features = dev->features & ~netkit_features_hw_vlan;
+
+	dev->needs_free_netdev = true;
+
+	netif_set_tso_max_size(dev, GSO_MAX_SIZE);
+}
+
+static struct net *netkit_get_link_net(const struct net_device *dev)
+{
+	struct netkit *nk = netkit_priv(dev);
+	struct net_device *peer = rtnl_dereference(nk->peer);
+
+	return peer ? dev_net(peer) : dev_net(dev);
+}
+
+static int netkit_check_policy(int policy, struct nlattr *tb,
+			       struct netlink_ext_ack *extack)
+{
+	switch (policy) {
+	case NETKIT_PASS:
+	case NETKIT_DROP:
+		return 0;
+	default:
+		NL_SET_ERR_MSG_ATTR(extack, tb,
+				    "Provided default xmit policy not supported");
+		return -EINVAL;
+	}
+}
+
+static int netkit_check_mode(int mode, struct nlattr *tb,
+			     struct netlink_ext_ack *extack)
+{
+	switch (mode) {
+	case NETKIT_L2:
+	case NETKIT_L3:
+		return 0;
+	default:
+		NL_SET_ERR_MSG_ATTR(extack, tb,
+				    "Provided device mode can only be L2 or L3");
+		return -EINVAL;
+	}
+}
+
+static int netkit_validate(struct nlattr *tb[], struct nlattr *data[],
+			   struct netlink_ext_ack *extack)
+{
+	struct nlattr *attr = tb[IFLA_ADDRESS];
+
+	if (!attr)
+		return 0;
+	NL_SET_ERR_MSG_ATTR(extack, attr,
+			    "Setting Ethernet address is not supported");
+	return -EOPNOTSUPP;
+}
+
+static struct rtnl_link_ops netkit_link_ops;
+
+static int netkit_new_link(struct net *src_net, struct net_device *dev,
+			   struct nlattr *tb[], struct nlattr *data[],
+			   struct netlink_ext_ack *extack)
+{
+	struct nlattr *peer_tb[IFLA_MAX + 1], **tbp = tb, *attr;
+	enum netkit_action default_prim = NETKIT_PASS;
+	enum netkit_action default_peer = NETKIT_PASS;
+	enum netkit_mode mode = NETKIT_L3;
+	unsigned char ifname_assign_type;
+	struct ifinfomsg *ifmp = NULL;
+	struct net_device *peer;
+	char ifname[IFNAMSIZ];
+	struct netkit *nk;
+	struct net *net;
+	int err;
+
+	if (data) {
+		if (data[IFLA_NETKIT_MODE]) {
+			attr = data[IFLA_NETKIT_MODE];
+			mode = nla_get_u32(attr);
+			err = netkit_check_mode(mode, attr, extack);
+			if (err < 0)
+				return err;
+		}
+		if (data[IFLA_NETKIT_PEER_INFO]) {
+			attr = data[IFLA_NETKIT_PEER_INFO];
+			ifmp = nla_data(attr);
+			err = rtnl_nla_parse_ifinfomsg(peer_tb, attr, extack);
+			if (err < 0)
+				return err;
+			err = netkit_validate(peer_tb, NULL, extack);
+			if (err < 0)
+				return err;
+			tbp = peer_tb;
+		}
+		if (data[IFLA_NETKIT_POLICY]) {
+			attr = data[IFLA_NETKIT_POLICY];
+			default_prim = nla_get_u32(attr);
+			err = netkit_check_policy(default_prim, attr, extack);
+			if (err < 0)
+				return err;
+		}
+		if (data[IFLA_NETKIT_PEER_POLICY]) {
+			attr = data[IFLA_NETKIT_PEER_POLICY];
+			default_peer = nla_get_u32(attr);
+			err = netkit_check_policy(default_peer, attr, extack);
+			if (err < 0)
+				return err;
+		}
+	}
+
+	if (ifmp && tbp[IFLA_IFNAME]) {
+		nla_strscpy(ifname, tbp[IFLA_IFNAME], IFNAMSIZ);
+		ifname_assign_type = NET_NAME_USER;
+	} else {
+		strscpy(ifname, "nk%d", IFNAMSIZ);
+		ifname_assign_type = NET_NAME_ENUM;
+	}
+
+	net = rtnl_link_get_net(src_net, tbp);
+	if (IS_ERR(net))
+		return PTR_ERR(net);
+
+	peer = rtnl_create_link(net, ifname, ifname_assign_type,
+				&netkit_link_ops, tbp, extack);
+	if (IS_ERR(peer)) {
+		put_net(net);
+		return PTR_ERR(peer);
+	}
+
+	netif_inherit_tso_max(peer, dev);
+
+	if (mode == NETKIT_L2)
+		eth_hw_addr_random(peer);
+	if (ifmp && dev->ifindex)
+		peer->ifindex = ifmp->ifi_index;
+
+	nk = netkit_priv(peer);
+	nk->primary = false;
+	nk->policy = default_peer;
+	nk->mode = mode;
+	bpf_mprog_bundle_init(&nk->bundle);
+	RCU_INIT_POINTER(nk->active, NULL);
+	RCU_INIT_POINTER(nk->peer, NULL);
+
+	err = register_netdevice(peer);
+	put_net(net);
+	if (err < 0)
+		goto err_register_peer;
+	netif_carrier_off(peer);
+	if (mode == NETKIT_L2)
+		dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
+
+	err = rtnl_configure_link(peer, NULL, 0, NULL);
+	if (err < 0)
+		goto err_configure_peer;
+
+	if (mode == NETKIT_L2)
+		eth_hw_addr_random(dev);
+	if (tb[IFLA_IFNAME])
+		nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
+	else
+		strscpy(dev->name, "nk%d", IFNAMSIZ);
+
+	nk = netkit_priv(dev);
+	nk->primary = true;
+	nk->policy = default_prim;
+	nk->mode = mode;
+	bpf_mprog_bundle_init(&nk->bundle);
+	RCU_INIT_POINTER(nk->active, NULL);
+	RCU_INIT_POINTER(nk->peer, NULL);
+
+	err = register_netdevice(dev);
+	if (err < 0)
+		goto err_configure_peer;
+	netif_carrier_off(dev);
+	if (mode == NETKIT_L2)
+		dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
+
+	rcu_assign_pointer(netkit_priv(dev)->peer, peer);
+	rcu_assign_pointer(netkit_priv(peer)->peer, dev);
+	return 0;
+err_configure_peer:
+	unregister_netdevice(peer);
+	return err;
+err_register_peer:
+	free_netdev(peer);
+	return err;
+}
+
+static struct bpf_mprog_entry *netkit_entry_fetch(struct net_device *dev,
+						  bool bundle_fallback)
+{
+	struct netkit *nk = netkit_priv(dev);
+	struct bpf_mprog_entry *entry;
+
+	ASSERT_RTNL();
+	entry = rcu_dereference_rtnl(nk->active);
+	if (entry)
+		return entry;
+	if (bundle_fallback)
+		return &nk->bundle.a;
+	return NULL;
+}
+
+static void netkit_entry_update(struct net_device *dev,
+				struct bpf_mprog_entry *entry)
+{
+	struct netkit *nk = netkit_priv(dev);
+
+	ASSERT_RTNL();
+	rcu_assign_pointer(nk->active, entry);
+}
+
+static void netkit_entry_sync(void)
+{
+	synchronize_rcu();
+}
+
+static struct net_device *netkit_dev_fetch(struct net *net, u32 ifindex, u32 which)
+{
+	struct net_device *dev;
+	struct netkit *nk;
+
+	ASSERT_RTNL();
+
+	switch (which) {
+	case BPF_NETKIT_PRIMARY:
+	case BPF_NETKIT_PEER:
+		break;
+	default:
+		return ERR_PTR(-EINVAL);
+	}
+
+	dev = __dev_get_by_index(net, ifindex);
+	if (!dev)
+		return ERR_PTR(-ENODEV);
+	if (dev->netdev_ops != &netkit_netdev_ops)
+		return ERR_PTR(-ENXIO);
+
+	nk = netkit_priv(dev);
+	if (!nk->primary)
+		return ERR_PTR(-EACCES);
+	if (which == BPF_NETKIT_PEER) {
+		dev = rcu_dereference_rtnl(nk->peer);
+		if (!dev)
+			return ERR_PTR(-ENODEV);
+	}
+	return dev;
+}
+
+int netkit_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct bpf_mprog_entry *entry, *entry_new;
+	struct bpf_prog *replace_prog = NULL;
+	struct net_device *dev;
+	int ret;
+
+	rtnl_lock();
+	dev = netkit_dev_fetch(current->nsproxy->net_ns, attr->target_ifindex,
+			       attr->attach_type);
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		goto out;
+	}
+	entry = netkit_entry_fetch(dev, true);
+	if (attr->attach_flags & BPF_F_REPLACE) {
+		replace_prog = bpf_prog_get_type(attr->replace_bpf_fd,
+						 prog->type);
+		if (IS_ERR(replace_prog)) {
+			ret = PTR_ERR(replace_prog);
+			replace_prog = NULL;
+			goto out;
+		}
+	}
+	ret = bpf_mprog_attach(entry, &entry_new, prog, NULL, replace_prog,
+			       attr->attach_flags, attr->relative_fd,
+			       attr->expected_revision);
+	if (!ret) {
+		if (entry != entry_new) {
+			netkit_entry_update(dev, entry_new);
+			netkit_entry_sync();
+		}
+		bpf_mprog_commit(entry);
+	}
+out:
+	if (replace_prog)
+		bpf_prog_put(replace_prog);
+	rtnl_unlock();
+	return ret;
+}
+
+int netkit_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct bpf_mprog_entry *entry, *entry_new;
+	struct net_device *dev;
+	int ret;
+
+	rtnl_lock();
+	dev = netkit_dev_fetch(current->nsproxy->net_ns, attr->target_ifindex,
+			       attr->attach_type);
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		goto out;
+	}
+	entry = netkit_entry_fetch(dev, false);
+	if (!entry) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = bpf_mprog_detach(entry, &entry_new, prog, NULL, attr->attach_flags,
+			       attr->relative_fd, attr->expected_revision);
+	if (!ret) {
+		if (!bpf_mprog_total(entry_new))
+			entry_new = NULL;
+		netkit_entry_update(dev, entry_new);
+		netkit_entry_sync();
+		bpf_mprog_commit(entry);
+	}
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+int netkit_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
+{
+	struct net_device *dev;
+	int ret;
+
+	rtnl_lock();
+	dev = netkit_dev_fetch(current->nsproxy->net_ns,
+			       attr->query.target_ifindex,
+			       attr->query.attach_type);
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		goto out;
+	}
+	ret = bpf_mprog_query(attr, uattr, netkit_entry_fetch(dev, false));
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+static struct netkit_link *netkit_link(const struct bpf_link *link)
+{
+	return container_of(link, struct netkit_link, link);
+}
+
+static int netkit_link_prog_attach(struct bpf_link *link, u32 flags,
+				   u32 id_or_fd, u64 revision)
+{
+	struct netkit_link *nkl = netkit_link(link);
+	struct bpf_mprog_entry *entry, *entry_new;
+	struct net_device *dev = nkl->dev;
+	int ret;
+
+	ASSERT_RTNL();
+	entry = netkit_entry_fetch(dev, true);
+	ret = bpf_mprog_attach(entry, &entry_new, link->prog, link, NULL, flags,
+			       id_or_fd, revision);
+	if (!ret) {
+		if (entry != entry_new) {
+			netkit_entry_update(dev, entry_new);
+			netkit_entry_sync();
+		}
+		bpf_mprog_commit(entry);
+	}
+	return ret;
+}
+
+static void netkit_link_release(struct bpf_link *link)
+{
+	struct netkit_link *nkl = netkit_link(link);
+	struct bpf_mprog_entry *entry, *entry_new;
+	struct net_device *dev;
+	int ret = 0;
+
+	rtnl_lock();
+	dev = nkl->dev;
+	if (!dev)
+		goto out;
+	entry = netkit_entry_fetch(dev, false);
+	if (!entry) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = bpf_mprog_detach(entry, &entry_new, link->prog, link, 0, 0, 0);
+	if (!ret) {
+		if (!bpf_mprog_total(entry_new))
+			entry_new = NULL;
+		netkit_entry_update(dev, entry_new);
+		netkit_entry_sync();
+		bpf_mprog_commit(entry);
+		nkl->dev = NULL;
+	}
+out:
+	WARN_ON_ONCE(ret);
+	rtnl_unlock();
+}
+
+static int netkit_link_update(struct bpf_link *link, struct bpf_prog *nprog,
+			      struct bpf_prog *oprog)
+{
+	struct netkit_link *nkl = netkit_link(link);
+	struct bpf_mprog_entry *entry, *entry_new;
+	struct net_device *dev;
+	int ret = 0;
+
+	rtnl_lock();
+	dev = nkl->dev;
+	if (!dev) {
+		ret = -ENOLINK;
+		goto out;
+	}
+	if (oprog && link->prog != oprog) {
+		ret = -EPERM;
+		goto out;
+	}
+	oprog = link->prog;
+	if (oprog == nprog) {
+		bpf_prog_put(nprog);
+		goto out;
+	}
+	entry = netkit_entry_fetch(dev, false);
+	if (!entry) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = bpf_mprog_attach(entry, &entry_new, nprog, link, oprog,
+			       BPF_F_REPLACE | BPF_F_ID,
+			       link->prog->aux->id, 0);
+	if (!ret) {
+		WARN_ON_ONCE(entry != entry_new);
+		oprog = xchg(&link->prog, nprog);
+		bpf_prog_put(oprog);
+		bpf_mprog_commit(entry);
+	}
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+static void netkit_link_dealloc(struct bpf_link *link)
+{
+	kfree(netkit_link(link));
+}
+
+static void netkit_link_fdinfo(const struct bpf_link *link, struct seq_file *seq)
+{
+	const struct netkit_link *nkl = netkit_link(link);
+	u32 ifindex = 0;
+
+	rtnl_lock();
+	if (nkl->dev)
+		ifindex = nkl->dev->ifindex;
+	rtnl_unlock();
+
+	seq_printf(seq, "ifindex:\t%u\n", ifindex);
+	seq_printf(seq, "attach_type:\t%u (%s)\n",
+		   nkl->location,
+		   nkl->location == BPF_NETKIT_PRIMARY ? "primary" : "peer");
+}
+
+static int netkit_link_fill_info(const struct bpf_link *link,
+				 struct bpf_link_info *info)
+{
+	const struct netkit_link *nkl = netkit_link(link);
+	u32 ifindex = 0;
+
+	rtnl_lock();
+	if (nkl->dev)
+		ifindex = nkl->dev->ifindex;
+	rtnl_unlock();
+
+	info->netkit.ifindex = ifindex;
+	info->netkit.attach_type = nkl->location;
+	return 0;
+}
+
+static int netkit_link_detach(struct bpf_link *link)
+{
+	netkit_link_release(link);
+	return 0;
+}
+
+static const struct bpf_link_ops netkit_link_lops = {
+	.release	= netkit_link_release,
+	.detach		= netkit_link_detach,
+	.dealloc	= netkit_link_dealloc,
+	.update_prog	= netkit_link_update,
+	.show_fdinfo	= netkit_link_fdinfo,
+	.fill_link_info	= netkit_link_fill_info,
+};
+
+static int netkit_link_init(struct netkit_link *nkl,
+			    struct bpf_link_primer *link_primer,
+			    const union bpf_attr *attr,
+			    struct net_device *dev,
+			    struct bpf_prog *prog)
+{
+	bpf_link_init(&nkl->link, BPF_LINK_TYPE_NETKIT,
+		      &netkit_link_lops, prog);
+	nkl->location = attr->link_create.attach_type;
+	nkl->dev = dev;
+	return bpf_link_prime(&nkl->link, link_primer);
+}
+
+int netkit_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct bpf_link_primer link_primer;
+	struct netkit_link *nkl;
+	struct net_device *dev;
+	int ret;
+
+	rtnl_lock();
+	dev = netkit_dev_fetch(current->nsproxy->net_ns,
+			       attr->link_create.target_ifindex,
+			       attr->link_create.attach_type);
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		goto out;
+	}
+	nkl = kzalloc(sizeof(*nkl), GFP_KERNEL_ACCOUNT);
+	if (!nkl) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = netkit_link_init(nkl, &link_primer, attr, dev, prog);
+	if (ret) {
+		kfree(nkl);
+		goto out;
+	}
+	ret = netkit_link_prog_attach(&nkl->link,
+				      attr->link_create.flags,
+				      attr->link_create.netkit.relative_fd,
+				      attr->link_create.netkit.expected_revision);
+	if (ret) {
+		nkl->dev = NULL;
+		bpf_link_cleanup(&link_primer);
+		goto out;
+	}
+	ret = bpf_link_settle(&link_primer);
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+static void netkit_release_all(struct net_device *dev)
+{
+	struct bpf_mprog_entry *entry;
+	struct bpf_tuple tuple = {};
+	struct bpf_mprog_fp *fp;
+	struct bpf_mprog_cp *cp;
+
+	entry = netkit_entry_fetch(dev, false);
+	if (!entry)
+		return;
+	netkit_entry_update(dev, NULL);
+	netkit_entry_sync();
+	bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
+		if (tuple.link)
+			netkit_link(tuple.link)->dev = NULL;
+		else
+			bpf_prog_put(tuple.prog);
+	}
+}
+
+static void netkit_uninit(struct net_device *dev)
+{
+	netkit_release_all(dev);
+}
+
+static void netkit_del_link(struct net_device *dev, struct list_head *head)
+{
+	struct netkit *nk = netkit_priv(dev);
+	struct net_device *peer = rtnl_dereference(nk->peer);
+
+	RCU_INIT_POINTER(nk->peer, NULL);
+	unregister_netdevice_queue(dev, head);
+	if (peer) {
+		nk = netkit_priv(peer);
+		RCU_INIT_POINTER(nk->peer, NULL);
+		unregister_netdevice_queue(peer, head);
+	}
+}
+
+static int netkit_change_link(struct net_device *dev, struct nlattr *tb[],
+			      struct nlattr *data[],
+			      struct netlink_ext_ack *extack)
+{
+	struct netkit *nk = netkit_priv(dev);
+	struct net_device *peer = rtnl_dereference(nk->peer);
+	enum netkit_action policy;
+	struct nlattr *attr;
+	int err;
+
+	if (!nk->primary) {
+		NL_SET_ERR_MSG(extack,
+			       "netkit link settings can be changed only through the primary device");
+		return -EACCES;
+	}
+
+	if (data[IFLA_NETKIT_MODE]) {
+		NL_SET_ERR_MSG_ATTR(extack, data[IFLA_NETKIT_MODE],
+				    "netkit link operating mode cannot be changed after device creation");
+		return -EACCES;
+	}
+
+	if (data[IFLA_NETKIT_POLICY]) {
+		attr = data[IFLA_NETKIT_POLICY];
+		policy = nla_get_u32(attr);
+		err = netkit_check_policy(policy, attr, extack);
+		if (err)
+			return err;
+		WRITE_ONCE(nk->policy, policy);
+	}
+
+	if (data[IFLA_NETKIT_PEER_POLICY]) {
+		err = -EOPNOTSUPP;
+		attr = data[IFLA_NETKIT_PEER_POLICY];
+		policy = nla_get_u32(attr);
+		if (peer)
+			err = netkit_check_policy(policy, attr, extack);
+		if (err)
+			return err;
+		nk = netkit_priv(peer);
+		WRITE_ONCE(nk->policy, policy);
+	}
+
+	return 0;
+}
+
+static size_t netkit_get_size(const struct net_device *dev)
+{
+	return nla_total_size(sizeof(u32)) + /* IFLA_NETKIT_POLICY */
+	       nla_total_size(sizeof(u32)) + /* IFLA_NETKIT_PEER_POLICY */
+	       nla_total_size(sizeof(u8))  + /* IFLA_NETKIT_PRIMARY */
+	       nla_total_size(sizeof(u32)) + /* IFLA_NETKIT_MODE */
+	       0;
+}
+
+static int netkit_fill_info(struct sk_buff *skb, const struct net_device *dev)
+{
+	struct netkit *nk = netkit_priv(dev);
+	struct net_device *peer = rtnl_dereference(nk->peer);
+
+	if (nla_put_u8(skb, IFLA_NETKIT_PRIMARY, nk->primary))
+		return -EMSGSIZE;
+	if (nla_put_u32(skb, IFLA_NETKIT_POLICY, nk->policy))
+		return -EMSGSIZE;
+	if (nla_put_u32(skb, IFLA_NETKIT_MODE, nk->mode))
+		return -EMSGSIZE;
+
+	if (peer) {
+		nk = netkit_priv(peer);
+		if (nla_put_u32(skb, IFLA_NETKIT_PEER_POLICY, nk->policy))
+			return -EMSGSIZE;
+	}
+
+	return 0;
+}
+
+static const struct nla_policy netkit_policy[IFLA_NETKIT_MAX + 1] = {
+	[IFLA_NETKIT_PEER_INFO]		= { .len = sizeof(struct ifinfomsg) },
+	[IFLA_NETKIT_POLICY]		= { .type = NLA_U32 },
+	[IFLA_NETKIT_MODE]		= { .type = NLA_U32 },
+	[IFLA_NETKIT_PEER_POLICY]	= { .type = NLA_U32 },
+	[IFLA_NETKIT_PRIMARY]		= { .type = NLA_REJECT,
+					    .reject_message = "Primary attribute is read-only" },
+};
+
+static struct rtnl_link_ops netkit_link_ops = {
+	.kind		= DRV_NAME,
+	.priv_size	= sizeof(struct netkit),
+	.setup		= netkit_setup,
+	.newlink	= netkit_new_link,
+	.dellink	= netkit_del_link,
+	.changelink	= netkit_change_link,
+	.get_link_net	= netkit_get_link_net,
+	.get_size	= netkit_get_size,
+	.fill_info	= netkit_fill_info,
+	.policy		= netkit_policy,
+	.validate	= netkit_validate,
+	.maxtype	= IFLA_NETKIT_MAX,
+};
+
+static __init int netkit_init(void)
+{
+	BUILD_BUG_ON((int)NETKIT_NEXT != (int)TCX_NEXT ||
+		     (int)NETKIT_PASS != (int)TCX_PASS ||
+		     (int)NETKIT_DROP != (int)TCX_DROP ||
+		     (int)NETKIT_REDIRECT != (int)TCX_REDIRECT);
+
+	return rtnl_link_register(&netkit_link_ops);
+}
+
+static __exit void netkit_exit(void)
+{
+	rtnl_link_unregister(&netkit_link_ops);
+}
+
+module_init(netkit_init);
+module_exit(netkit_exit);
+
+MODULE_DESCRIPTION("BPF-programmable network device");
+MODULE_AUTHOR("Daniel Borkmann <daniel@iogearbox.net>");
+MODULE_AUTHOR("Nikolay Aleksandrov <razor@blackwall.org>");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_RTNL_LINK(DRV_NAME);
diff --git a/include/net/netkit.h b/include/net/netkit.h
new file mode 100644
index 000000000000..0ba2e6b847ca
--- /dev/null
+++ b/include/net/netkit.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2023 Isovalent */
+#ifndef __NET_NETKIT_H
+#define __NET_NETKIT_H
+
+#include <linux/bpf.h>
+
+#ifdef CONFIG_NETKIT
+int netkit_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
+int netkit_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
+int netkit_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog);
+int netkit_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr);
+#else
+static inline int netkit_prog_attach(const union bpf_attr *attr,
+				     struct bpf_prog *prog)
+{
+	return -EINVAL;
+}
+
+static inline int netkit_link_attach(const union bpf_attr *attr,
+				     struct bpf_prog *prog)
+{
+	return -EINVAL;
+}
+
+static inline int netkit_prog_detach(const union bpf_attr *attr,
+				     struct bpf_prog *prog)
+{
+	return -EINVAL;
+}
+
+static inline int netkit_prog_query(const union bpf_attr *attr,
+				    union bpf_attr __user *uattr)
+{
+	return -EINVAL;
+}
+#endif /* CONFIG_NETKIT */
+#endif /* __NET_NETKIT_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7ba61b75bc0e..0f6cdf52b1da 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1052,6 +1052,8 @@ enum bpf_attach_type {
 	BPF_CGROUP_UNIX_RECVMSG,
 	BPF_CGROUP_UNIX_GETPEERNAME,
 	BPF_CGROUP_UNIX_GETSOCKNAME,
+	BPF_NETKIT_PRIMARY,
+	BPF_NETKIT_PEER,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1071,6 +1073,7 @@ enum bpf_link_type {
 	BPF_LINK_TYPE_NETFILTER = 10,
 	BPF_LINK_TYPE_TCX = 11,
 	BPF_LINK_TYPE_UPROBE_MULTI = 12,
+	BPF_LINK_TYPE_NETKIT = 13,
 	MAX_BPF_LINK_TYPE,
 };
 
@@ -1656,6 +1659,13 @@ union bpf_attr {
 				__u32		flags;
 				__u32		pid;
 			} uprobe_multi;
+			struct {
+				union {
+					__u32	relative_fd;
+					__u32	relative_id;
+				};
+				__u64		expected_revision;
+			} netkit;
 		};
 	} link_create;
 
@@ -6576,6 +6586,10 @@ struct bpf_link_info {
 			__u32 ifindex;
 			__u32 attach_type;
 		} tcx;
+		struct {
+			__u32 ifindex;
+			__u32 attach_type;
+		} netkit;
 	};
 } __attribute__((aligned(8)));
 
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index fac351a93aed..a0aa05a28cf2 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -756,6 +756,30 @@ struct tunnel_msg {
 	__u32 ifindex;
 };
 
+/* netkit section */
+enum netkit_action {
+	NETKIT_NEXT	= -1,
+	NETKIT_PASS	= 0,
+	NETKIT_DROP	= 2,
+	NETKIT_REDIRECT	= 7,
+};
+
+enum netkit_mode {
+	NETKIT_L2,
+	NETKIT_L3,
+};
+
+enum {
+	IFLA_NETKIT_UNSPEC,
+	IFLA_NETKIT_PEER_INFO,
+	IFLA_NETKIT_PRIMARY,
+	IFLA_NETKIT_POLICY,
+	IFLA_NETKIT_PEER_POLICY,
+	IFLA_NETKIT_MODE,
+	__IFLA_NETKIT_MAX,
+};
+#define IFLA_NETKIT_MAX	(__IFLA_NETKIT_MAX - 1)
+
 /* VXLAN section */
 
 /* include statistics in the dump */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a9b2cb500bf7..0ed286b8a0f0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -35,8 +35,9 @@
 #include <linux/rcupdate_trace.h>
 #include <linux/memcontrol.h>
 #include <linux/trace_events.h>
-#include <net/netfilter/nf_bpf_link.h>
 
+#include <net/netfilter/nf_bpf_link.h>
+#include <net/netkit.h>
 #include <net/tcx.h>
 
 #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
@@ -3730,6 +3731,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 		return BPF_PROG_TYPE_LSM;
 	case BPF_TCX_INGRESS:
 	case BPF_TCX_EGRESS:
+	case BPF_NETKIT_PRIMARY:
+	case BPF_NETKIT_PEER:
 		return BPF_PROG_TYPE_SCHED_CLS;
 	default:
 		return BPF_PROG_TYPE_UNSPEC;
@@ -3781,7 +3784,9 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 		return 0;
 	case BPF_PROG_TYPE_SCHED_CLS:
 		if (attach_type != BPF_TCX_INGRESS &&
-		    attach_type != BPF_TCX_EGRESS)
+		    attach_type != BPF_TCX_EGRESS &&
+		    attach_type != BPF_NETKIT_PRIMARY &&
+		    attach_type != BPF_NETKIT_PEER)
 			return -EINVAL;
 		return 0;
 	default:
@@ -3864,7 +3869,11 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 			ret = cgroup_bpf_prog_attach(attr, ptype, prog);
 		break;
 	case BPF_PROG_TYPE_SCHED_CLS:
-		ret = tcx_prog_attach(attr, prog);
+		if (attr->attach_type == BPF_TCX_INGRESS ||
+		    attr->attach_type == BPF_TCX_EGRESS)
+			ret = tcx_prog_attach(attr, prog);
+		else
+			ret = netkit_prog_attach(attr, prog);
 		break;
 	default:
 		ret = -EINVAL;
@@ -3925,7 +3934,11 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 		ret = cgroup_bpf_prog_detach(attr, ptype);
 		break;
 	case BPF_PROG_TYPE_SCHED_CLS:
-		ret = tcx_prog_detach(attr, prog);
+		if (attr->attach_type == BPF_TCX_INGRESS ||
+		    attr->attach_type == BPF_TCX_EGRESS)
+			ret = tcx_prog_detach(attr, prog);
+		else
+			ret = netkit_prog_detach(attr, prog);
 		break;
 	default:
 		ret = -EINVAL;
@@ -3992,6 +4005,9 @@ static int bpf_prog_query(const union bpf_attr *attr,
 	case BPF_TCX_INGRESS:
 	case BPF_TCX_EGRESS:
 		return tcx_prog_query(attr, uattr);
+	case BPF_NETKIT_PRIMARY:
+	case BPF_NETKIT_PEER:
+		return netkit_prog_query(attr, uattr);
 	default:
 		return -EINVAL;
 	}
@@ -4973,7 +4989,11 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 		ret = bpf_xdp_link_attach(attr, prog);
 		break;
 	case BPF_PROG_TYPE_SCHED_CLS:
-		ret = tcx_link_attach(attr, prog);
+		if (attr->link_create.attach_type == BPF_TCX_INGRESS ||
+		    attr->link_create.attach_type == BPF_TCX_EGRESS)
+			ret = tcx_link_attach(attr, prog);
+		else
+			ret = netkit_link_attach(attr, prog);
 		break;
 	case BPF_PROG_TYPE_NETFILTER:
 		ret = bpf_nf_link_attach(attr, prog);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7ba61b75bc0e..0f6cdf52b1da 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1052,6 +1052,8 @@ enum bpf_attach_type {
 	BPF_CGROUP_UNIX_RECVMSG,
 	BPF_CGROUP_UNIX_GETPEERNAME,
 	BPF_CGROUP_UNIX_GETSOCKNAME,
+	BPF_NETKIT_PRIMARY,
+	BPF_NETKIT_PEER,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1071,6 +1073,7 @@ enum bpf_link_type {
 	BPF_LINK_TYPE_NETFILTER = 10,
 	BPF_LINK_TYPE_TCX = 11,
 	BPF_LINK_TYPE_UPROBE_MULTI = 12,
+	BPF_LINK_TYPE_NETKIT = 13,
 	MAX_BPF_LINK_TYPE,
 };
 
@@ -1656,6 +1659,13 @@ union bpf_attr {
 				__u32		flags;
 				__u32		pid;
 			} uprobe_multi;
+			struct {
+				union {
+					__u32	relative_fd;
+					__u32	relative_id;
+				};
+				__u64		expected_revision;
+			} netkit;
 		};
 	} link_create;
 
@@ -6576,6 +6586,10 @@ struct bpf_link_info {
 			__u32 ifindex;
 			__u32 attach_type;
 		} tcx;
+		struct {
+			__u32 ifindex;
+			__u32 attach_type;
+		} netkit;
 	};
 } __attribute__((aligned(8)));
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v4 2/7] tools: Sync if_link uapi header
  2023-10-24 21:48 [PATCH bpf-next v4 0/7] Add bpf programmable net device Daniel Borkmann
  2023-10-24 21:48 ` [PATCH bpf-next v4 1/7] netkit, bpf: " Daniel Borkmann
@ 2023-10-24 21:48 ` Daniel Borkmann
  2023-10-24 21:49 ` [PATCH bpf-next v4 3/7] libbpf: Add link-based API for netkit Daniel Borkmann
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Daniel Borkmann @ 2023-10-24 21:48 UTC (permalink / raw)
  To: bpf
  Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend, sdf, toke,
	kuba, andrew, Daniel Borkmann

Sync if_link uapi header to the latest version as we need the refresher
in tooling for netkit device. Given it's been a while since the last sync
and the diff is fairly big, it has been done as its own commit.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/include/uapi/linux/if_link.h | 141 +++++++++++++++++++++++++++++
 1 file changed, 141 insertions(+)

diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h
index 39e659c83cfd..a0aa05a28cf2 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -211,6 +211,9 @@ struct rtnl_link_stats {
  * @rx_nohandler: Number of packets received on the interface
  *   but dropped by the networking stack because the device is
  *   not designated to receive packets (e.g. backup link in a bond).
+ *
+ * @rx_otherhost_dropped: Number of packets dropped due to mismatch
+ *   in destination MAC address.
  */
 struct rtnl_link_stats64 {
 	__u64	rx_packets;
@@ -243,6 +246,23 @@ struct rtnl_link_stats64 {
 	__u64	rx_compressed;
 	__u64	tx_compressed;
 	__u64	rx_nohandler;
+
+	__u64	rx_otherhost_dropped;
+};
+
+/* Subset of link stats useful for in-HW collection. Meaning of the fields is as
+ * for struct rtnl_link_stats64.
+ */
+struct rtnl_hw_stats64 {
+	__u64	rx_packets;
+	__u64	tx_packets;
+	__u64	rx_bytes;
+	__u64	tx_bytes;
+	__u64	rx_errors;
+	__u64	tx_errors;
+	__u64	rx_dropped;
+	__u64	tx_dropped;
+	__u64	multicast;
 };
 
 /* The struct should be in sync with struct ifmap */
@@ -350,7 +370,13 @@ enum {
 	IFLA_GRO_MAX_SIZE,
 	IFLA_TSO_MAX_SIZE,
 	IFLA_TSO_MAX_SEGS,
+	IFLA_ALLMULTI,		/* Allmulti count: > 0 means acts ALLMULTI */
+
+	IFLA_DEVLINK_PORT,
 
+	IFLA_GSO_IPV4_MAX_SIZE,
+	IFLA_GRO_IPV4_MAX_SIZE,
+	IFLA_DPLL_PIN,
 	__IFLA_MAX
 };
 
@@ -539,6 +565,12 @@ enum {
 	IFLA_BRPORT_MRP_IN_OPEN,
 	IFLA_BRPORT_MCAST_EHT_HOSTS_LIMIT,
 	IFLA_BRPORT_MCAST_EHT_HOSTS_CNT,
+	IFLA_BRPORT_LOCKED,
+	IFLA_BRPORT_MAB,
+	IFLA_BRPORT_MCAST_N_GROUPS,
+	IFLA_BRPORT_MCAST_MAX_GROUPS,
+	IFLA_BRPORT_NEIGH_VLAN_SUPPRESS,
+	IFLA_BRPORT_BACKUP_NHID,
 	__IFLA_BRPORT_MAX
 };
 #define IFLA_BRPORT_MAX (__IFLA_BRPORT_MAX - 1)
@@ -716,7 +748,79 @@ enum ipvlan_mode {
 #define IPVLAN_F_PRIVATE	0x01
 #define IPVLAN_F_VEPA		0x02
 
+/* Tunnel RTM header */
+struct tunnel_msg {
+	__u8 family;
+	__u8 flags;
+	__u16 reserved2;
+	__u32 ifindex;
+};
+
+/* netkit section */
+enum netkit_action {
+	NETKIT_NEXT	= -1,
+	NETKIT_PASS	= 0,
+	NETKIT_DROP	= 2,
+	NETKIT_REDIRECT	= 7,
+};
+
+enum netkit_mode {
+	NETKIT_L2,
+	NETKIT_L3,
+};
+
+enum {
+	IFLA_NETKIT_UNSPEC,
+	IFLA_NETKIT_PEER_INFO,
+	IFLA_NETKIT_PRIMARY,
+	IFLA_NETKIT_POLICY,
+	IFLA_NETKIT_PEER_POLICY,
+	IFLA_NETKIT_MODE,
+	__IFLA_NETKIT_MAX,
+};
+#define IFLA_NETKIT_MAX	(__IFLA_NETKIT_MAX - 1)
+
 /* VXLAN section */
+
+/* include statistics in the dump */
+#define TUNNEL_MSG_FLAG_STATS	0x01
+
+#define TUNNEL_MSG_VALID_USER_FLAGS TUNNEL_MSG_FLAG_STATS
+
+/* Embedded inside VXLAN_VNIFILTER_ENTRY_STATS */
+enum {
+	VNIFILTER_ENTRY_STATS_UNSPEC,
+	VNIFILTER_ENTRY_STATS_RX_BYTES,
+	VNIFILTER_ENTRY_STATS_RX_PKTS,
+	VNIFILTER_ENTRY_STATS_RX_DROPS,
+	VNIFILTER_ENTRY_STATS_RX_ERRORS,
+	VNIFILTER_ENTRY_STATS_TX_BYTES,
+	VNIFILTER_ENTRY_STATS_TX_PKTS,
+	VNIFILTER_ENTRY_STATS_TX_DROPS,
+	VNIFILTER_ENTRY_STATS_TX_ERRORS,
+	VNIFILTER_ENTRY_STATS_PAD,
+	__VNIFILTER_ENTRY_STATS_MAX
+};
+#define VNIFILTER_ENTRY_STATS_MAX (__VNIFILTER_ENTRY_STATS_MAX - 1)
+
+enum {
+	VXLAN_VNIFILTER_ENTRY_UNSPEC,
+	VXLAN_VNIFILTER_ENTRY_START,
+	VXLAN_VNIFILTER_ENTRY_END,
+	VXLAN_VNIFILTER_ENTRY_GROUP,
+	VXLAN_VNIFILTER_ENTRY_GROUP6,
+	VXLAN_VNIFILTER_ENTRY_STATS,
+	__VXLAN_VNIFILTER_ENTRY_MAX
+};
+#define VXLAN_VNIFILTER_ENTRY_MAX	(__VXLAN_VNIFILTER_ENTRY_MAX - 1)
+
+enum {
+	VXLAN_VNIFILTER_UNSPEC,
+	VXLAN_VNIFILTER_ENTRY,
+	__VXLAN_VNIFILTER_MAX
+};
+#define VXLAN_VNIFILTER_MAX	(__VXLAN_VNIFILTER_MAX - 1)
+
 enum {
 	IFLA_VXLAN_UNSPEC,
 	IFLA_VXLAN_ID,
@@ -748,6 +852,8 @@ enum {
 	IFLA_VXLAN_GPE,
 	IFLA_VXLAN_TTL_INHERIT,
 	IFLA_VXLAN_DF,
+	IFLA_VXLAN_VNIFILTER, /* only applicable with COLLECT_METADATA mode */
+	IFLA_VXLAN_LOCALBYPASS,
 	__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX	(__IFLA_VXLAN_MAX - 1)
@@ -781,6 +887,7 @@ enum {
 	IFLA_GENEVE_LABEL,
 	IFLA_GENEVE_TTL_INHERIT,
 	IFLA_GENEVE_DF,
+	IFLA_GENEVE_INNER_PROTO_INHERIT,
 	__IFLA_GENEVE_MAX
 };
 #define IFLA_GENEVE_MAX	(__IFLA_GENEVE_MAX - 1)
@@ -826,6 +933,8 @@ enum {
 	IFLA_GTP_FD1,
 	IFLA_GTP_PDP_HASHSIZE,
 	IFLA_GTP_ROLE,
+	IFLA_GTP_CREATE_SOCKETS,
+	IFLA_GTP_RESTART_COUNT,
 	__IFLA_GTP_MAX,
 };
 #define IFLA_GTP_MAX (__IFLA_GTP_MAX - 1)
@@ -1162,6 +1271,17 @@ enum {
 
 #define IFLA_STATS_FILTER_BIT(ATTR)	(1 << (ATTR - 1))
 
+enum {
+	IFLA_STATS_GETSET_UNSPEC,
+	IFLA_STATS_GET_FILTERS, /* Nest of IFLA_STATS_LINK_xxx, each a u32 with
+				 * a filter mask for the corresponding group.
+				 */
+	IFLA_STATS_SET_OFFLOAD_XSTATS_L3_STATS, /* 0 or 1 as u8 */
+	__IFLA_STATS_GETSET_MAX,
+};
+
+#define IFLA_STATS_GETSET_MAX (__IFLA_STATS_GETSET_MAX - 1)
+
 /* These are embedded into IFLA_STATS_LINK_XSTATS:
  * [IFLA_STATS_LINK_XSTATS]
  * -> [LINK_XSTATS_TYPE_xxx]
@@ -1179,10 +1299,21 @@ enum {
 enum {
 	IFLA_OFFLOAD_XSTATS_UNSPEC,
 	IFLA_OFFLOAD_XSTATS_CPU_HIT, /* struct rtnl_link_stats64 */
+	IFLA_OFFLOAD_XSTATS_HW_S_INFO,	/* HW stats info. A nest */
+	IFLA_OFFLOAD_XSTATS_L3_STATS,	/* struct rtnl_hw_stats64 */
 	__IFLA_OFFLOAD_XSTATS_MAX
 };
 #define IFLA_OFFLOAD_XSTATS_MAX (__IFLA_OFFLOAD_XSTATS_MAX - 1)
 
+enum {
+	IFLA_OFFLOAD_XSTATS_HW_S_INFO_UNSPEC,
+	IFLA_OFFLOAD_XSTATS_HW_S_INFO_REQUEST,		/* u8 */
+	IFLA_OFFLOAD_XSTATS_HW_S_INFO_USED,		/* u8 */
+	__IFLA_OFFLOAD_XSTATS_HW_S_INFO_MAX,
+};
+#define IFLA_OFFLOAD_XSTATS_HW_S_INFO_MAX \
+	(__IFLA_OFFLOAD_XSTATS_HW_S_INFO_MAX - 1)
+
 /* XDP section */
 
 #define XDP_FLAGS_UPDATE_IF_NOEXIST	(1U << 0)
@@ -1281,4 +1412,14 @@ enum {
 
 #define IFLA_MCTP_MAX (__IFLA_MCTP_MAX - 1)
 
+/* DSA section */
+
+enum {
+	IFLA_DSA_UNSPEC,
+	IFLA_DSA_MASTER,
+	__IFLA_DSA_MAX,
+};
+
+#define IFLA_DSA_MAX	(__IFLA_DSA_MAX - 1)
+
 #endif /* _UAPI_LINUX_IF_LINK_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v4 3/7] libbpf: Add link-based API for netkit
  2023-10-24 21:48 [PATCH bpf-next v4 0/7] Add bpf programmable net device Daniel Borkmann
  2023-10-24 21:48 ` [PATCH bpf-next v4 1/7] netkit, bpf: " Daniel Borkmann
  2023-10-24 21:48 ` [PATCH bpf-next v4 2/7] tools: Sync if_link uapi header Daniel Borkmann
@ 2023-10-24 21:49 ` Daniel Borkmann
  2023-10-24 21:49 ` [PATCH bpf-next v4 4/7] bpftool: Implement link show support " Daniel Borkmann
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Daniel Borkmann @ 2023-10-24 21:49 UTC (permalink / raw)
  To: bpf
  Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend, sdf, toke,
	kuba, andrew, Daniel Borkmann

This adds bpf_program__attach_netkit() API to libbpf. Overall it is very
similar to tcx. The API looks as following:

  LIBBPF_API struct bpf_link *
  bpf_program__attach_netkit(const struct bpf_program *prog, int ifindex,
                             const struct bpf_netkit_opts *opts);

The struct bpf_netkit_opts is done in similar way as struct bpf_tcx_opts
for supporting bpf_mprog control parameters. The attach location for the
primary and peer device is derived from the program section "netkit/primary"
and "netkit/peer", respectively.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/lib/bpf/bpf.c      | 16 ++++++++++++++++
 tools/lib/bpf/bpf.h      |  5 +++++
 tools/lib/bpf/libbpf.c   | 39 +++++++++++++++++++++++++++++++++++++++
 tools/lib/bpf/libbpf.h   | 15 +++++++++++++++
 tools/lib/bpf/libbpf.map |  1 +
 5 files changed, 76 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index b0f1913763a3..9dc9625651dc 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -810,6 +810,22 @@ int bpf_link_create(int prog_fd, int target_fd,
 		if (!OPTS_ZEROED(opts, tcx))
 			return libbpf_err(-EINVAL);
 		break;
+	case BPF_NETKIT_PRIMARY:
+	case BPF_NETKIT_PEER:
+		relative_fd = OPTS_GET(opts, netkit.relative_fd, 0);
+		relative_id = OPTS_GET(opts, netkit.relative_id, 0);
+		if (relative_fd && relative_id)
+			return libbpf_err(-EINVAL);
+		if (relative_id) {
+			attr.link_create.netkit.relative_id = relative_id;
+			attr.link_create.flags |= BPF_F_ID;
+		} else {
+			attr.link_create.netkit.relative_fd = relative_fd;
+		}
+		attr.link_create.netkit.expected_revision = OPTS_GET(opts, netkit.expected_revision, 0);
+		if (!OPTS_ZEROED(opts, netkit))
+			return libbpf_err(-EINVAL);
+		break;
 	default:
 		if (!OPTS_ZEROED(opts, flags))
 			return libbpf_err(-EINVAL);
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 74c2887cfd24..d0f53772bdc0 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -415,6 +415,11 @@ struct bpf_link_create_opts {
 			__u32 relative_id;
 			__u64 expected_revision;
 		} tcx;
+		struct {
+			__u32 relative_fd;
+			__u32 relative_id;
+			__u64 expected_revision;
+		} netkit;
 	};
 	size_t :0;
 };
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index a295f6acf98f..e067be95da3c 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -126,6 +126,8 @@ static const char * const attach_type_name[] = {
 	[BPF_TCX_INGRESS]		= "tcx_ingress",
 	[BPF_TCX_EGRESS]		= "tcx_egress",
 	[BPF_TRACE_UPROBE_MULTI]	= "trace_uprobe_multi",
+	[BPF_NETKIT_PRIMARY]		= "netkit_primary",
+	[BPF_NETKIT_PEER]		= "netkit_peer",
 };
 
 static const char * const link_type_name[] = {
@@ -142,6 +144,7 @@ static const char * const link_type_name[] = {
 	[BPF_LINK_TYPE_NETFILTER]		= "netfilter",
 	[BPF_LINK_TYPE_TCX]			= "tcx",
 	[BPF_LINK_TYPE_UPROBE_MULTI]		= "uprobe_multi",
+	[BPF_LINK_TYPE_NETKIT]			= "netkit",
 };
 
 static const char * const map_type_name[] = {
@@ -8915,6 +8918,8 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("tc",			SCHED_CLS, 0, SEC_NONE), /* deprecated / legacy, use tcx */
 	SEC_DEF("classifier",		SCHED_CLS, 0, SEC_NONE), /* deprecated / legacy, use tcx */
 	SEC_DEF("action",		SCHED_ACT, 0, SEC_NONE), /* deprecated / legacy, use tcx */
+	SEC_DEF("netkit/primary",	SCHED_CLS, BPF_NETKIT_PRIMARY, SEC_NONE),
+	SEC_DEF("netkit/peer",		SCHED_CLS, BPF_NETKIT_PEER, SEC_NONE),
 	SEC_DEF("tracepoint+",		TRACEPOINT, 0, SEC_NONE, attach_tp),
 	SEC_DEF("tp+",			TRACEPOINT, 0, SEC_NONE, attach_tp),
 	SEC_DEF("raw_tracepoint+",	RAW_TRACEPOINT, 0, SEC_NONE, attach_raw_tp),
@@ -12126,6 +12131,40 @@ bpf_program__attach_tcx(const struct bpf_program *prog, int ifindex,
 	return bpf_program_attach_fd(prog, ifindex, "tcx", &link_create_opts);
 }
 
+struct bpf_link *
+bpf_program__attach_netkit(const struct bpf_program *prog, int ifindex,
+			   const struct bpf_netkit_opts *opts)
+{
+	LIBBPF_OPTS(bpf_link_create_opts, link_create_opts);
+	__u32 relative_id;
+	int relative_fd;
+
+	if (!OPTS_VALID(opts, bpf_netkit_opts))
+		return libbpf_err_ptr(-EINVAL);
+
+	relative_id = OPTS_GET(opts, relative_id, 0);
+	relative_fd = OPTS_GET(opts, relative_fd, 0);
+
+	/* validate we don't have unexpected combinations of non-zero fields */
+	if (!ifindex) {
+		pr_warn("prog '%s': target netdevice ifindex cannot be zero\n",
+			prog->name);
+		return libbpf_err_ptr(-EINVAL);
+	}
+	if (relative_fd && relative_id) {
+		pr_warn("prog '%s': relative_fd and relative_id cannot be set at the same time\n",
+			prog->name);
+		return libbpf_err_ptr(-EINVAL);
+	}
+
+	link_create_opts.netkit.expected_revision = OPTS_GET(opts, expected_revision, 0);
+	link_create_opts.netkit.relative_fd = relative_fd;
+	link_create_opts.netkit.relative_id = relative_id;
+	link_create_opts.flags = OPTS_GET(opts, flags, 0);
+
+	return bpf_program_attach_fd(prog, ifindex, "netkit", &link_create_opts);
+}
+
 struct bpf_link *bpf_program__attach_freplace(const struct bpf_program *prog,
 					      int target_fd,
 					      const char *attach_func_name)
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 475378438545..6cd9c501624f 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -800,6 +800,21 @@ LIBBPF_API struct bpf_link *
 bpf_program__attach_tcx(const struct bpf_program *prog, int ifindex,
 			const struct bpf_tcx_opts *opts);
 
+struct bpf_netkit_opts {
+	/* size of this struct, for forward/backward compatibility */
+	size_t sz;
+	__u32 flags;
+	__u32 relative_fd;
+	__u32 relative_id;
+	__u64 expected_revision;
+	size_t :0;
+};
+#define bpf_netkit_opts__last_field expected_revision
+
+LIBBPF_API struct bpf_link *
+bpf_program__attach_netkit(const struct bpf_program *prog, int ifindex,
+			   const struct bpf_netkit_opts *opts);
+
 struct bpf_map;
 
 LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index cc973b678a39..b52dc28dc8af 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -398,6 +398,7 @@ LIBBPF_1.3.0 {
 		bpf_object__unpin;
 		bpf_prog_detach_opts;
 		bpf_program__attach_netfilter;
+		bpf_program__attach_netkit;
 		bpf_program__attach_tcx;
 		bpf_program__attach_uprobe_multi;
 		ring__avail_data_size;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v4 4/7] bpftool: Implement link show support for netkit
  2023-10-24 21:48 [PATCH bpf-next v4 0/7] Add bpf programmable net device Daniel Borkmann
                   ` (2 preceding siblings ...)
  2023-10-24 21:49 ` [PATCH bpf-next v4 3/7] libbpf: Add link-based API for netkit Daniel Borkmann
@ 2023-10-24 21:49 ` Daniel Borkmann
  2023-10-24 21:49 ` [PATCH bpf-next v4 5/7] bpftool: Extend net dump with netkit progs Daniel Borkmann
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Daniel Borkmann @ 2023-10-24 21:49 UTC (permalink / raw)
  To: bpf
  Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend, sdf, toke,
	kuba, andrew, Daniel Borkmann, Quentin Monnet

Add support to dump netkit link information to bpftool in similar way as
we have for XDP. The netkit link info only exposes the ifindex and the
attach_type.

Below shows an example link dump output, and a cgroup link is included for
comparison, too:

  # bpftool link
  [...]
  10: cgroup  prog 2466
        cgroup_id 1  attach_type cgroup_inet6_post_bind
  [...]
  8: netkit  prog 35
        ifindex nk1(18)  attach_type netkit_primary
  [...]

Equivalent json output:

  # bpftool link --json
  [...]
  {
    "id": 10,
    "type": "cgroup",
    "prog_id": 2466,
    "cgroup_id": 1,
    "attach_type": "cgroup_inet6_post_bind"
  },
  [...]
  {
    "id": 12,
    "type": "netkit",
    "prog_id": 61,
    "devname": "nk1",
    "ifindex": 21,
    "attach_type": "netkit_primary"
  }
  [...]

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
---
 tools/bpf/bpftool/link.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/tools/bpf/bpftool/link.c b/tools/bpf/bpftool/link.c
index 4b1407b05056..a1528cde81ab 100644
--- a/tools/bpf/bpftool/link.c
+++ b/tools/bpf/bpftool/link.c
@@ -451,6 +451,10 @@ static int show_link_close_json(int fd, struct bpf_link_info *info)
 		show_link_ifindex_json(info->tcx.ifindex, json_wtr);
 		show_link_attach_type_json(info->tcx.attach_type, json_wtr);
 		break;
+	case BPF_LINK_TYPE_NETKIT:
+		show_link_ifindex_json(info->netkit.ifindex, json_wtr);
+		show_link_attach_type_json(info->netkit.attach_type, json_wtr);
+		break;
 	case BPF_LINK_TYPE_XDP:
 		show_link_ifindex_json(info->xdp.ifindex, json_wtr);
 		break;
@@ -791,6 +795,11 @@ static int show_link_close_plain(int fd, struct bpf_link_info *info)
 		show_link_ifindex_plain(info->tcx.ifindex);
 		show_link_attach_type_plain(info->tcx.attach_type);
 		break;
+	case BPF_LINK_TYPE_NETKIT:
+		printf("\n\t");
+		show_link_ifindex_plain(info->netkit.ifindex);
+		show_link_attach_type_plain(info->netkit.attach_type);
+		break;
 	case BPF_LINK_TYPE_XDP:
 		printf("\n\t");
 		show_link_ifindex_plain(info->xdp.ifindex);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v4 5/7] bpftool: Extend net dump with netkit progs
  2023-10-24 21:48 [PATCH bpf-next v4 0/7] Add bpf programmable net device Daniel Borkmann
                   ` (3 preceding siblings ...)
  2023-10-24 21:49 ` [PATCH bpf-next v4 4/7] bpftool: Implement link show support " Daniel Borkmann
@ 2023-10-24 21:49 ` Daniel Borkmann
  2023-10-24 21:49 ` [PATCH bpf-next v4 6/7] selftests/bpf: Add netlink helper library Daniel Borkmann
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Daniel Borkmann @ 2023-10-24 21:49 UTC (permalink / raw)
  To: bpf
  Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend, sdf, toke,
	kuba, andrew, Daniel Borkmann, Quentin Monnet

Add support to dump BPF programs on netkit via bpftool. This includes both
the BPF link and attach ops programs. Dumped information contain the attach
location, function entry name, program ID and link ID when applicable.

Example with tc BPF link:

  # ./bpftool net
  xdp:

  tc:
  nk1(22) netkit/peer tc1 prog_id 43 link_id 12

  [...]

Example with json dump:

  # ./bpftool net --json | jq
  [
    {
      "xdp": [],
      "tc": [
        {
          "devname": "nk1",
          "ifindex": 18,
          "kind": "netkit/primary",
          "name": "tc1",
          "prog_id": 29,
          "prog_flags": [],
          "link_id": 8,
          "link_flags": []
        }
      ],
      "flow_dissector": [],
      "netfilter": []
    }
  ]

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
---
 tools/bpf/bpftool/Documentation/bpftool-net.rst | 8 ++++----
 tools/bpf/bpftool/net.c                         | 7 ++++++-
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-net.rst b/tools/bpf/bpftool/Documentation/bpftool-net.rst
index 5e2abd3de5ab..dd3f9469765b 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-net.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-net.rst
@@ -37,7 +37,7 @@ DESCRIPTION
 	**bpftool net { show | list }** [ **dev** *NAME* ]
 		  List bpf program attachments in the kernel networking subsystem.
 
-		  Currently, device driver xdp attachments, tcx and old-style tc
+		  Currently, device driver xdp attachments, tcx, netkit and old-style tc
 		  classifier/action attachments, flow_dissector as well as netfilter
 		  attachments are implemented, i.e., for
 		  program types **BPF_PROG_TYPE_XDP**, **BPF_PROG_TYPE_SCHED_CLS**,
@@ -52,11 +52,11 @@ DESCRIPTION
 		  bpf programs, users should consult other tools, e.g., iproute2.
 
 		  The current output will start with all xdp program attachments, followed by
-		  all tcx, then tc class/qdisc bpf program attachments, then flow_dissector
-		  and finally netfilter programs. Both xdp programs and tcx/tc programs are
+		  all tcx, netkit, then tc class/qdisc bpf program attachments, then flow_dissector
+		  and finally netfilter programs. Both xdp programs and tcx/netkit/tc programs are
 		  ordered based on ifindex number. If multiple bpf programs attached
 		  to the same networking device through **tc**, the order will be first
-		  all bpf programs attached to tcx, then tc classes, then all bpf programs
+		  all bpf programs attached to tcx, netkit, then tc classes, then all bpf programs
 		  attached to non clsact qdiscs, and finally all bpf programs attached
 		  to root and clsact qdisc.
 
diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
index 66a8ce8ae012..968714b4c3d4 100644
--- a/tools/bpf/bpftool/net.c
+++ b/tools/bpf/bpftool/net.c
@@ -79,6 +79,8 @@ static const char * const attach_type_strings[] = {
 static const char * const attach_loc_strings[] = {
 	[BPF_TCX_INGRESS]		= "tcx/ingress",
 	[BPF_TCX_EGRESS]		= "tcx/egress",
+	[BPF_NETKIT_PRIMARY]		= "netkit/primary",
+	[BPF_NETKIT_PEER]		= "netkit/peer",
 };
 
 const size_t net_attach_type_size = ARRAY_SIZE(attach_type_strings);
@@ -506,6 +508,9 @@ static void show_dev_tc_bpf(struct ip_devname_ifindex *dev)
 {
 	__show_dev_tc_bpf(dev, BPF_TCX_INGRESS);
 	__show_dev_tc_bpf(dev, BPF_TCX_EGRESS);
+
+	__show_dev_tc_bpf(dev, BPF_NETKIT_PRIMARY);
+	__show_dev_tc_bpf(dev, BPF_NETKIT_PEER);
 }
 
 static int show_dev_tc_bpf_classic(int sock, unsigned int nl_pid,
@@ -926,7 +931,7 @@ static int do_help(int argc, char **argv)
 		"       ATTACH_TYPE := { xdp | xdpgeneric | xdpdrv | xdpoffload }\n"
 		"       " HELP_SPEC_OPTIONS " }\n"
 		"\n"
-		"Note: Only xdp, tcx, tc, flow_dissector and netfilter attachments\n"
+		"Note: Only xdp, tcx, tc, netkit, flow_dissector and netfilter attachments\n"
 		"      are currently supported.\n"
 		"      For progs attached to cgroups, use \"bpftool cgroup\"\n"
 		"      to dump program attachments. For program types\n"
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v4 6/7] selftests/bpf: Add netlink helper library
  2023-10-24 21:48 [PATCH bpf-next v4 0/7] Add bpf programmable net device Daniel Borkmann
                   ` (4 preceding siblings ...)
  2023-10-24 21:49 ` [PATCH bpf-next v4 5/7] bpftool: Extend net dump with netkit progs Daniel Borkmann
@ 2023-10-24 21:49 ` Daniel Borkmann
  2023-10-24 21:49 ` [PATCH bpf-next v4 7/7] selftests/bpf: Add selftests for netkit Daniel Borkmann
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Daniel Borkmann @ 2023-10-24 21:49 UTC (permalink / raw)
  To: bpf
  Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend, sdf, toke,
	kuba, andrew, Daniel Borkmann

Add a minimal netlink helper library for the BPF selftests. This has been
taken and cut down and cleaned up from iproute2. This covers basics such
as netdevice creation which we need for BPF selftests / BPF CI given
iproute2 package cannot cover it yet.

Stanislav Fomichev suggested that this could be replaced in future by ynl
tool generated C code once it has RTNL support to create devices. Once we
get to this point the BPF CI would also need to add libmnl. If no further
extensions are needed, a second option could be that we remove this code
again once iproute2 package has support.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/testing/selftests/bpf/Makefile          |  19 +-
 tools/testing/selftests/bpf/netlink_helpers.c | 358 ++++++++++++++++++
 tools/testing/selftests/bpf/netlink_helpers.h |  46 +++
 3 files changed, 418 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/netlink_helpers.c
 create mode 100644 tools/testing/selftests/bpf/netlink_helpers.h

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 4225f975fce3..9c27b67bc7b1 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -585,11 +585,20 @@ endef
 # Define test_progs test runner.
 TRUNNER_TESTS_DIR := prog_tests
 TRUNNER_BPF_PROGS_DIR := progs
-TRUNNER_EXTRA_SOURCES := test_progs.c cgroup_helpers.c trace_helpers.c	\
-			 network_helpers.c testing_helpers.c		\
-			 btf_helpers.c flow_dissector_load.h		\
-			 cap_helpers.c test_loader.c xsk.c disasm.c	\
-			 json_writer.c unpriv_helpers.c 		\
+TRUNNER_EXTRA_SOURCES := test_progs.c		\
+			 cgroup_helpers.c	\
+			 trace_helpers.c	\
+			 network_helpers.c	\
+			 testing_helpers.c	\
+			 btf_helpers.c		\
+			 cap_helpers.c		\
+			 unpriv_helpers.c 	\
+			 netlink_helpers.c	\
+			 test_loader.c		\
+			 xsk.c			\
+			 disasm.c		\
+			 json_writer.c 		\
+			 flow_dissector_load.h	\
 			 ip_check_defrag_frags.h
 TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read $(OUTPUT)/bpf_testmod.ko	\
 		       $(OUTPUT)/liburandom_read.so			\
diff --git a/tools/testing/selftests/bpf/netlink_helpers.c b/tools/testing/selftests/bpf/netlink_helpers.c
new file mode 100644
index 000000000000..caf36eb1d032
--- /dev/null
+++ b/tools/testing/selftests/bpf/netlink_helpers.c
@@ -0,0 +1,358 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Taken & modified from iproute2's libnetlink.c
+ * Authors: Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <time.h>
+#include <sys/socket.h>
+
+#include "netlink_helpers.h"
+
+static int rcvbuf = 1024 * 1024;
+
+void rtnl_close(struct rtnl_handle *rth)
+{
+	if (rth->fd >= 0) {
+		close(rth->fd);
+		rth->fd = -1;
+	}
+}
+
+int rtnl_open_byproto(struct rtnl_handle *rth, unsigned int subscriptions,
+		      int protocol)
+{
+	socklen_t addr_len;
+	int sndbuf = 32768;
+	int one = 1;
+
+	memset(rth, 0, sizeof(*rth));
+	rth->proto = protocol;
+	rth->fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, protocol);
+	if (rth->fd < 0) {
+		perror("Cannot open netlink socket");
+		return -1;
+	}
+	if (setsockopt(rth->fd, SOL_SOCKET, SO_SNDBUF,
+		       &sndbuf, sizeof(sndbuf)) < 0) {
+		perror("SO_SNDBUF");
+		goto err;
+	}
+	if (setsockopt(rth->fd, SOL_SOCKET, SO_RCVBUF,
+		       &rcvbuf, sizeof(rcvbuf)) < 0) {
+		perror("SO_RCVBUF");
+		goto err;
+	}
+
+	/* Older kernels may no support extended ACK reporting */
+	setsockopt(rth->fd, SOL_NETLINK, NETLINK_EXT_ACK,
+		   &one, sizeof(one));
+
+	memset(&rth->local, 0, sizeof(rth->local));
+	rth->local.nl_family = AF_NETLINK;
+	rth->local.nl_groups = subscriptions;
+
+	if (bind(rth->fd, (struct sockaddr *)&rth->local,
+		 sizeof(rth->local)) < 0) {
+		perror("Cannot bind netlink socket");
+		goto err;
+	}
+	addr_len = sizeof(rth->local);
+	if (getsockname(rth->fd, (struct sockaddr *)&rth->local,
+			&addr_len) < 0) {
+		perror("Cannot getsockname");
+		goto err;
+	}
+	if (addr_len != sizeof(rth->local)) {
+		fprintf(stderr, "Wrong address length %d\n", addr_len);
+		goto err;
+	}
+	if (rth->local.nl_family != AF_NETLINK) {
+		fprintf(stderr, "Wrong address family %d\n",
+			rth->local.nl_family);
+		goto err;
+	}
+	rth->seq = time(NULL);
+	return 0;
+err:
+	rtnl_close(rth);
+	return -1;
+}
+
+int rtnl_open(struct rtnl_handle *rth, unsigned int subscriptions)
+{
+	return rtnl_open_byproto(rth, subscriptions, NETLINK_ROUTE);
+}
+
+static int __rtnl_recvmsg(int fd, struct msghdr *msg, int flags)
+{
+	int len;
+
+	do {
+		len = recvmsg(fd, msg, flags);
+	} while (len < 0 && (errno == EINTR || errno == EAGAIN));
+	if (len < 0) {
+		fprintf(stderr, "netlink receive error %s (%d)\n",
+			strerror(errno), errno);
+		return -errno;
+	}
+	if (len == 0) {
+		fprintf(stderr, "EOF on netlink\n");
+		return -ENODATA;
+	}
+	return len;
+}
+
+static int rtnl_recvmsg(int fd, struct msghdr *msg, char **answer)
+{
+	struct iovec *iov = msg->msg_iov;
+	char *buf;
+	int len;
+
+	iov->iov_base = NULL;
+	iov->iov_len = 0;
+
+	len = __rtnl_recvmsg(fd, msg, MSG_PEEK | MSG_TRUNC);
+	if (len < 0)
+		return len;
+	if (len < 32768)
+		len = 32768;
+	buf = malloc(len);
+	if (!buf) {
+		fprintf(stderr, "malloc error: not enough buffer\n");
+		return -ENOMEM;
+	}
+	iov->iov_base = buf;
+	iov->iov_len = len;
+	len = __rtnl_recvmsg(fd, msg, 0);
+	if (len < 0) {
+		free(buf);
+		return len;
+	}
+	if (answer)
+		*answer = buf;
+	else
+		free(buf);
+	return len;
+}
+
+static void rtnl_talk_error(struct nlmsghdr *h, struct nlmsgerr *err,
+			    nl_ext_ack_fn_t errfn)
+{
+	fprintf(stderr, "RTNETLINK answers: %s\n",
+		strerror(-err->error));
+}
+
+static int __rtnl_talk_iov(struct rtnl_handle *rtnl, struct iovec *iov,
+			   size_t iovlen, struct nlmsghdr **answer,
+			   bool show_rtnl_err, nl_ext_ack_fn_t errfn)
+{
+	struct sockaddr_nl nladdr = { .nl_family = AF_NETLINK };
+	struct iovec riov;
+	struct msghdr msg = {
+		.msg_name	= &nladdr,
+		.msg_namelen	= sizeof(nladdr),
+		.msg_iov	= iov,
+		.msg_iovlen	= iovlen,
+	};
+	unsigned int seq = 0;
+	struct nlmsghdr *h;
+	int i, status;
+	char *buf;
+
+	for (i = 0; i < iovlen; i++) {
+		h = iov[i].iov_base;
+		h->nlmsg_seq = seq = ++rtnl->seq;
+		if (answer == NULL)
+			h->nlmsg_flags |= NLM_F_ACK;
+	}
+	status = sendmsg(rtnl->fd, &msg, 0);
+	if (status < 0) {
+		perror("Cannot talk to rtnetlink");
+		return -1;
+	}
+	/* change msg to use the response iov */
+	msg.msg_iov = &riov;
+	msg.msg_iovlen = 1;
+	i = 0;
+	while (1) {
+next:
+		status = rtnl_recvmsg(rtnl->fd, &msg, &buf);
+		++i;
+		if (status < 0)
+			return status;
+		if (msg.msg_namelen != sizeof(nladdr)) {
+			fprintf(stderr,
+				"Sender address length == %d!\n",
+				msg.msg_namelen);
+			exit(1);
+		}
+		for (h = (struct nlmsghdr *)buf; status >= sizeof(*h); ) {
+			int len = h->nlmsg_len;
+			int l = len - sizeof(*h);
+
+			if (l < 0 || len > status) {
+				if (msg.msg_flags & MSG_TRUNC) {
+					fprintf(stderr, "Truncated message!\n");
+					free(buf);
+					return -1;
+				}
+				fprintf(stderr,
+					"Malformed message: len=%d!\n",
+					len);
+				exit(1);
+			}
+			if (nladdr.nl_pid != 0 ||
+			    h->nlmsg_pid != rtnl->local.nl_pid ||
+			    h->nlmsg_seq > seq || h->nlmsg_seq < seq - iovlen) {
+				/* Don't forget to skip that message. */
+				status -= NLMSG_ALIGN(len);
+				h = (struct nlmsghdr *)((char *)h + NLMSG_ALIGN(len));
+				continue;
+			}
+			if (h->nlmsg_type == NLMSG_ERROR) {
+				struct nlmsgerr *err = (struct nlmsgerr *)NLMSG_DATA(h);
+				int error = err->error;
+
+				if (l < sizeof(struct nlmsgerr)) {
+					fprintf(stderr, "ERROR truncated\n");
+					free(buf);
+					return -1;
+				}
+				if (error) {
+					errno = -error;
+					if (rtnl->proto != NETLINK_SOCK_DIAG &&
+					    show_rtnl_err)
+						rtnl_talk_error(h, err, errfn);
+				}
+				if (i < iovlen) {
+					free(buf);
+					goto next;
+				}
+				if (error) {
+					free(buf);
+					return -i;
+				}
+				if (answer)
+					*answer = (struct nlmsghdr *)buf;
+				else
+					free(buf);
+				return 0;
+			}
+			if (answer) {
+				*answer = (struct nlmsghdr *)buf;
+				return 0;
+			}
+			fprintf(stderr, "Unexpected reply!\n");
+			status -= NLMSG_ALIGN(len);
+			h = (struct nlmsghdr *)((char *)h + NLMSG_ALIGN(len));
+		}
+		free(buf);
+		if (msg.msg_flags & MSG_TRUNC) {
+			fprintf(stderr, "Message truncated!\n");
+			continue;
+		}
+		if (status) {
+			fprintf(stderr, "Remnant of size %d!\n", status);
+			exit(1);
+		}
+	}
+}
+
+static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
+		       struct nlmsghdr **answer, bool show_rtnl_err,
+		       nl_ext_ack_fn_t errfn)
+{
+	struct iovec iov = {
+		.iov_base	= n,
+		.iov_len	= n->nlmsg_len,
+	};
+
+	return __rtnl_talk_iov(rtnl, &iov, 1, answer, show_rtnl_err, errfn);
+}
+
+int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
+	      struct nlmsghdr **answer)
+{
+	return __rtnl_talk(rtnl, n, answer, true, NULL);
+}
+
+int addattr(struct nlmsghdr *n, int maxlen, int type)
+{
+	return addattr_l(n, maxlen, type, NULL, 0);
+}
+
+int addattr8(struct nlmsghdr *n, int maxlen, int type, __u8 data)
+{
+	return addattr_l(n, maxlen, type, &data, sizeof(__u8));
+}
+
+int addattr16(struct nlmsghdr *n, int maxlen, int type, __u16 data)
+{
+	return addattr_l(n, maxlen, type, &data, sizeof(__u16));
+}
+
+int addattr32(struct nlmsghdr *n, int maxlen, int type, __u32 data)
+{
+	return addattr_l(n, maxlen, type, &data, sizeof(__u32));
+}
+
+int addattr64(struct nlmsghdr *n, int maxlen, int type, __u64 data)
+{
+	return addattr_l(n, maxlen, type, &data, sizeof(__u64));
+}
+
+int addattrstrz(struct nlmsghdr *n, int maxlen, int type, const char *str)
+{
+	return addattr_l(n, maxlen, type, str, strlen(str)+1);
+}
+
+int addattr_l(struct nlmsghdr *n, int maxlen, int type, const void *data,
+	      int alen)
+{
+	int len = RTA_LENGTH(alen);
+	struct rtattr *rta;
+
+	if (NLMSG_ALIGN(n->nlmsg_len) + RTA_ALIGN(len) > maxlen) {
+		fprintf(stderr, "%s: Message exceeded bound of %d\n",
+			__func__, maxlen);
+		return -1;
+	}
+	rta = NLMSG_TAIL(n);
+	rta->rta_type = type;
+	rta->rta_len = len;
+	if (alen)
+		memcpy(RTA_DATA(rta), data, alen);
+	n->nlmsg_len = NLMSG_ALIGN(n->nlmsg_len) + RTA_ALIGN(len);
+	return 0;
+}
+
+int addraw_l(struct nlmsghdr *n, int maxlen, const void *data, int len)
+{
+	if (NLMSG_ALIGN(n->nlmsg_len) + NLMSG_ALIGN(len) > maxlen) {
+		fprintf(stderr, "%s: Message exceeded bound of %d\n",
+			__func__, maxlen);
+		return -1;
+	}
+
+	memcpy(NLMSG_TAIL(n), data, len);
+	memset((void *) NLMSG_TAIL(n) + len, 0, NLMSG_ALIGN(len) - len);
+	n->nlmsg_len = NLMSG_ALIGN(n->nlmsg_len) + NLMSG_ALIGN(len);
+	return 0;
+}
+
+struct rtattr *addattr_nest(struct nlmsghdr *n, int maxlen, int type)
+{
+	struct rtattr *nest = NLMSG_TAIL(n);
+
+	addattr_l(n, maxlen, type, NULL, 0);
+	return nest;
+}
+
+int addattr_nest_end(struct nlmsghdr *n, struct rtattr *nest)
+{
+	nest->rta_len = (void *)NLMSG_TAIL(n) - (void *)nest;
+	return n->nlmsg_len;
+}
diff --git a/tools/testing/selftests/bpf/netlink_helpers.h b/tools/testing/selftests/bpf/netlink_helpers.h
new file mode 100644
index 000000000000..68116818a47e
--- /dev/null
+++ b/tools/testing/selftests/bpf/netlink_helpers.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef NETLINK_HELPERS_H
+#define NETLINK_HELPERS_H
+
+#include <string.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+
+struct rtnl_handle {
+	int			fd;
+	struct sockaddr_nl	local;
+	struct sockaddr_nl	peer;
+	__u32			seq;
+	__u32			dump;
+	int			proto;
+	FILE			*dump_fp;
+#define RTNL_HANDLE_F_LISTEN_ALL_NSID		0x01
+#define RTNL_HANDLE_F_SUPPRESS_NLERR		0x02
+#define RTNL_HANDLE_F_STRICT_CHK		0x04
+	int			flags;
+};
+
+#define NLMSG_TAIL(nmsg) \
+	((struct rtattr *) (((void *) (nmsg)) + NLMSG_ALIGN((nmsg)->nlmsg_len)))
+
+typedef int (*nl_ext_ack_fn_t)(const char *errmsg, uint32_t off,
+			       const struct nlmsghdr *inner_nlh);
+
+int rtnl_open(struct rtnl_handle *rth, unsigned int subscriptions)
+	      __attribute__((warn_unused_result));
+void rtnl_close(struct rtnl_handle *rth);
+int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
+	      struct nlmsghdr **answer)
+	      __attribute__((warn_unused_result));
+
+int addattr(struct nlmsghdr *n, int maxlen, int type);
+int addattr8(struct nlmsghdr *n, int maxlen, int type, __u8 data);
+int addattr16(struct nlmsghdr *n, int maxlen, int type, __u16 data);
+int addattr32(struct nlmsghdr *n, int maxlen, int type, __u32 data);
+int addattr64(struct nlmsghdr *n, int maxlen, int type, __u64 data);
+int addattrstrz(struct nlmsghdr *n, int maxlen, int type, const char *data);
+int addattr_l(struct nlmsghdr *n, int maxlen, int type, const void *data, int alen);
+int addraw_l(struct nlmsghdr *n, int maxlen, const void *data, int len);
+struct rtattr *addattr_nest(struct nlmsghdr *n, int maxlen, int type);
+int addattr_nest_end(struct nlmsghdr *n, struct rtattr *nest);
+#endif /* NETLINK_HELPERS_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v4 7/7] selftests/bpf: Add selftests for netkit
  2023-10-24 21:48 [PATCH bpf-next v4 0/7] Add bpf programmable net device Daniel Borkmann
                   ` (5 preceding siblings ...)
  2023-10-24 21:49 ` [PATCH bpf-next v4 6/7] selftests/bpf: Add netlink helper library Daniel Borkmann
@ 2023-10-24 21:49 ` Daniel Borkmann
  2023-10-24 22:45 ` [PATCH bpf-next v4 0/7] Add bpf programmable net device Martin KaFai Lau
  2023-10-24 23:50 ` patchwork-bot+netdevbpf
  8 siblings, 0 replies; 27+ messages in thread
From: Daniel Borkmann @ 2023-10-24 21:49 UTC (permalink / raw)
  To: bpf
  Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend, sdf, toke,
	kuba, andrew, Daniel Borkmann

Add a bigger batch of test coverage to assert correct operation of
netkit devices and their BPF program management:

  # ./test_progs -t tc_netkit
  [...]
  [    1.166267] bpf_testmod: loading out-of-tree module taints kernel.
  [    1.166831] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  [    1.270957] tsc: Refined TSC clocksource calibration: 3407.988 MHz
  [    1.272579] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fc932722, max_idle_ns: 440795381586 ns
  [    1.275336] clocksource: Switched to clocksource tsc
  #257     tc_netkit_basic:OK
  #258     tc_netkit_device:OK
  #259     tc_netkit_multi_links:OK
  #260     tc_netkit_multi_opts:OK
  #261     tc_netkit_neigh_links:OK
  Summary: 5/0 PASSED, 0 SKIPPED, 0 FAILED
  [...]

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/prog_tests/tc_helpers.h     |   4 +
 .../selftests/bpf/prog_tests/tc_netkit.c      | 687 ++++++++++++++++++
 .../selftests/bpf/progs/test_tc_link.c        |  13 +
 4 files changed, 705 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_netkit.c

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index 02dd4409200e..3ec5927ec3e5 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -71,6 +71,7 @@ CONFIG_NETFILTER_SYNPROXY=y
 CONFIG_NETFILTER_XT_CONNMARK=y
 CONFIG_NETFILTER_XT_MATCH_STATE=y
 CONFIG_NETFILTER_XT_TARGET_CT=y
+CONFIG_NETKIT=y
 CONFIG_NF_CONNTRACK=y
 CONFIG_NF_CONNTRACK_MARK=y
 CONFIG_NF_DEFRAG_IPV4=y
diff --git a/tools/testing/selftests/bpf/prog_tests/tc_helpers.h b/tools/testing/selftests/bpf/prog_tests/tc_helpers.h
index 67f985f7d215..924d0e25320c 100644
--- a/tools/testing/selftests/bpf/prog_tests/tc_helpers.h
+++ b/tools/testing/selftests/bpf/prog_tests/tc_helpers.h
@@ -4,6 +4,10 @@
 #define TC_HELPERS
 #include <test_progs.h>
 
+#ifndef loopback
+# define loopback 1
+#endif
+
 static inline __u32 id_from_prog_fd(int fd)
 {
 	struct bpf_prog_info prog_info = {};
diff --git a/tools/testing/selftests/bpf/prog_tests/tc_netkit.c b/tools/testing/selftests/bpf/prog_tests/tc_netkit.c
new file mode 100644
index 000000000000..15ee7b2fc410
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tc_netkit.c
@@ -0,0 +1,687 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Isovalent */
+#include <uapi/linux/if_link.h>
+#include <net/if.h>
+#include <test_progs.h>
+
+#define netkit_peer "nk0"
+#define netkit_name "nk1"
+
+#define ping_addr_neigh		0x0a000002 /* 10.0.0.2 */
+#define ping_addr_noneigh	0x0a000003 /* 10.0.0.3 */
+
+#include "test_tc_link.skel.h"
+#include "netlink_helpers.h"
+#include "tc_helpers.h"
+
+#define ICMP_ECHO 8
+
+struct icmphdr {
+	__u8		type;
+	__u8		code;
+	__sum16		checksum;
+	struct {
+		__be16	id;
+		__be16	sequence;
+	} echo;
+};
+
+struct iplink_req {
+	struct nlmsghdr  n;
+	struct ifinfomsg i;
+	char             buf[1024];
+};
+
+static int create_netkit(int mode, int policy, int peer_policy, int *ifindex,
+			 bool same_netns)
+{
+	struct rtnl_handle rth = { .fd = -1 };
+	struct iplink_req req = {};
+	struct rtattr *linkinfo, *data;
+	const char *type = "netkit";
+	int err;
+
+	err = rtnl_open(&rth, 0);
+	if (!ASSERT_OK(err, "open_rtnetlink"))
+		return err;
+
+	memset(&req, 0, sizeof(req));
+	req.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+	req.n.nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_EXCL;
+	req.n.nlmsg_type = RTM_NEWLINK;
+	req.i.ifi_family = AF_UNSPEC;
+
+	addattr_l(&req.n, sizeof(req), IFLA_IFNAME, netkit_name,
+		  strlen(netkit_name));
+	linkinfo = addattr_nest(&req.n, sizeof(req), IFLA_LINKINFO);
+	addattr_l(&req.n, sizeof(req), IFLA_INFO_KIND, type, strlen(type));
+	data = addattr_nest(&req.n, sizeof(req), IFLA_INFO_DATA);
+	addattr32(&req.n, sizeof(req), IFLA_NETKIT_POLICY, policy);
+	addattr32(&req.n, sizeof(req), IFLA_NETKIT_PEER_POLICY, peer_policy);
+	addattr32(&req.n, sizeof(req), IFLA_NETKIT_MODE, mode);
+	addattr_nest_end(&req.n, data);
+	addattr_nest_end(&req.n, linkinfo);
+
+	err = rtnl_talk(&rth, &req.n, NULL);
+	ASSERT_OK(err, "talk_rtnetlink");
+	rtnl_close(&rth);
+	*ifindex = if_nametoindex(netkit_name);
+
+	ASSERT_GT(*ifindex, 0, "retrieve_ifindex");
+	ASSERT_OK(system("ip netns add foo"), "create netns");
+	ASSERT_OK(system("ip link set dev " netkit_name " up"),
+			 "up primary");
+	ASSERT_OK(system("ip addr add dev " netkit_name " 10.0.0.1/24"),
+			 "addr primary");
+	if (same_netns) {
+		ASSERT_OK(system("ip link set dev " netkit_peer " up"),
+				 "up peer");
+		ASSERT_OK(system("ip addr add dev " netkit_peer " 10.0.0.2/24"),
+				 "addr peer");
+	} else {
+		ASSERT_OK(system("ip link set " netkit_peer " netns foo"),
+				 "move peer");
+		ASSERT_OK(system("ip netns exec foo ip link set dev "
+				 netkit_peer " up"), "up peer");
+		ASSERT_OK(system("ip netns exec foo ip addr add dev "
+				 netkit_peer " 10.0.0.2/24"), "addr peer");
+	}
+	return err;
+}
+
+static void destroy_netkit(void)
+{
+	ASSERT_OK(system("ip link del dev " netkit_name), "del primary");
+	ASSERT_OK(system("ip netns del foo"), "delete netns");
+	ASSERT_EQ(if_nametoindex(netkit_name), 0, netkit_name "_ifindex");
+}
+
+static int __send_icmp(__u32 dest)
+{
+	struct sockaddr_in addr;
+	struct icmphdr icmp;
+	int sock, ret;
+
+	ret = write_sysctl("/proc/sys/net/ipv4/ping_group_range", "0 0");
+	if (!ASSERT_OK(ret, "write_sysctl(net.ipv4.ping_group_range)"))
+		return ret;
+
+	sock = socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP);
+	if (!ASSERT_GE(sock, 0, "icmp_socket"))
+		return -errno;
+
+	ret = setsockopt(sock, SOL_SOCKET, SO_BINDTODEVICE,
+			 netkit_name, strlen(netkit_name) + 1);
+	if (!ASSERT_OK(ret, "setsockopt(SO_BINDTODEVICE)"))
+		goto out;
+
+	memset(&addr, 0, sizeof(addr));
+	addr.sin_family = AF_INET;
+	addr.sin_addr.s_addr = htonl(dest);
+
+	memset(&icmp, 0, sizeof(icmp));
+	icmp.type = ICMP_ECHO;
+	icmp.echo.id = 1234;
+	icmp.echo.sequence = 1;
+
+	ret = sendto(sock, &icmp, sizeof(icmp), 0,
+		     (struct sockaddr *)&addr, sizeof(addr));
+	if (!ASSERT_GE(ret, 0, "icmp_sendto"))
+		ret = -errno;
+	else
+		ret = 0;
+out:
+	close(sock);
+	return ret;
+}
+
+static int send_icmp(void)
+{
+	return __send_icmp(ping_addr_neigh);
+}
+
+void serial_test_tc_netkit_basic(void)
+{
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	LIBBPF_OPTS(bpf_netkit_opts, optl);
+	__u32 prog_ids[2], link_ids[2];
+	__u32 pid1, pid2, lid1, lid2;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err, ifindex;
+
+	err = create_netkit(NETKIT_L2, NETKIT_PASS, NETKIT_PASS,
+			    &ifindex, false);
+	if (err)
+		return;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1,
+		  BPF_NETKIT_PRIMARY), 0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2,
+		  BPF_NETKIT_PEER), 0, "tc2_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+
+	assert_mprog_count_ifindex(ifindex, BPF_NETKIT_PRIMARY, 0);
+	assert_mprog_count_ifindex(ifindex, BPF_NETKIT_PEER, 0);
+
+	ASSERT_EQ(skel->bss->seen_tc1, false, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	link = bpf_program__attach_netkit(skel->progs.tc1, ifindex, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+
+	skel->links.tc1 = link;
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	assert_mprog_count_ifindex(ifindex, BPF_NETKIT_PRIMARY, 1);
+	assert_mprog_count_ifindex(ifindex, BPF_NETKIT_PEER, 0);
+
+	optq.prog_ids = prog_ids;
+	optq.link_ids = link_ids;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(link_ids, 0, sizeof(link_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex, BPF_NETKIT_PRIMARY, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+
+	tc_skel_reset_all_seen(skel);
+	ASSERT_EQ(send_icmp(), 0, "icmp_pkt");
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	link = bpf_program__attach_netkit(skel->progs.tc2, ifindex, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+
+	skel->links.tc2 = link;
+
+	lid2 = id_from_link_fd(bpf_link__fd(skel->links.tc2));
+	ASSERT_NEQ(lid1, lid2, "link_ids_1_2");
+
+	assert_mprog_count_ifindex(ifindex, BPF_NETKIT_PRIMARY, 1);
+	assert_mprog_count_ifindex(ifindex, BPF_NETKIT_PEER, 1);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(link_ids, 0, sizeof(link_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex, BPF_NETKIT_PEER, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid2, "prog_ids[0]");
+	ASSERT_EQ(optq.link_ids[0], lid2, "link_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+
+	tc_skel_reset_all_seen(skel);
+	ASSERT_EQ(send_icmp(), 0, "icmp_pkt");
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+cleanup:
+	test_tc_link__destroy(skel);
+
+	assert_mprog_count_ifindex(ifindex, BPF_NETKIT_PRIMARY, 0);
+	assert_mprog_count_ifindex(ifindex, BPF_NETKIT_PEER, 0);
+	destroy_netkit();
+}
+
+static void serial_test_tc_netkit_multi_links_target(int mode, int target)
+{
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	LIBBPF_OPTS(bpf_netkit_opts, optl);
+	__u32 prog_ids[3], link_ids[3];
+	__u32 pid1, pid2, lid1, lid2;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err, ifindex;
+
+	err = create_netkit(mode, NETKIT_PASS, NETKIT_PASS,
+			    &ifindex, false);
+	if (err)
+		return;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1,
+		  target), 0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2,
+		  target), 0, "tc2_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+
+	assert_mprog_count_ifindex(ifindex, target, 0);
+
+	ASSERT_EQ(skel->bss->seen_tc1, false, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_eth, false, "seen_eth");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	link = bpf_program__attach_netkit(skel->progs.tc1, ifindex, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+
+	skel->links.tc1 = link;
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	assert_mprog_count_ifindex(ifindex, target, 1);
+
+	optq.prog_ids = prog_ids;
+	optq.link_ids = link_ids;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(link_ids, 0, sizeof(link_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+
+	tc_skel_reset_all_seen(skel);
+	ASSERT_EQ(send_icmp(), 0, "icmp_pkt");
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_eth, true, "seen_eth");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	LIBBPF_OPTS_RESET(optl,
+		.flags = BPF_F_BEFORE,
+		.relative_fd = bpf_program__fd(skel->progs.tc1),
+	);
+
+	link = bpf_program__attach_netkit(skel->progs.tc2, ifindex, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+
+	skel->links.tc2 = link;
+
+	lid2 = id_from_link_fd(bpf_link__fd(skel->links.tc2));
+	ASSERT_NEQ(lid1, lid2, "link_ids_1_2");
+
+	assert_mprog_count_ifindex(ifindex, target, 2);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(link_ids, 0, sizeof(link_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid2, "prog_ids[0]");
+	ASSERT_EQ(optq.link_ids[0], lid2, "link_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid1, "prog_ids[1]");
+	ASSERT_EQ(optq.link_ids[1], lid1, "link_ids[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.link_ids[2], 0, "link_ids[2]");
+
+	tc_skel_reset_all_seen(skel);
+	ASSERT_EQ(send_icmp(), 0, "icmp_pkt");
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_eth, true, "seen_eth");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+cleanup:
+	test_tc_link__destroy(skel);
+
+	assert_mprog_count_ifindex(ifindex, target, 0);
+	destroy_netkit();
+}
+
+void serial_test_tc_netkit_multi_links(void)
+{
+	serial_test_tc_netkit_multi_links_target(NETKIT_L2, BPF_NETKIT_PRIMARY);
+	serial_test_tc_netkit_multi_links_target(NETKIT_L3, BPF_NETKIT_PRIMARY);
+	serial_test_tc_netkit_multi_links_target(NETKIT_L2, BPF_NETKIT_PEER);
+	serial_test_tc_netkit_multi_links_target(NETKIT_L3, BPF_NETKIT_PEER);
+}
+
+static void serial_test_tc_netkit_multi_opts_target(int mode, int target)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	__u32 pid1, pid2, fd1, fd2;
+	__u32 prog_ids[3];
+	struct test_tc_link *skel;
+	int err, ifindex;
+
+	err = create_netkit(mode, NETKIT_PASS, NETKIT_PASS,
+			    &ifindex, false);
+	if (err)
+		return;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+
+	pid1 = id_from_prog_fd(fd1);
+	pid2 = id_from_prog_fd(fd2);
+
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+
+	assert_mprog_count_ifindex(ifindex, target, 0);
+
+	ASSERT_EQ(skel->bss->seen_tc1, false, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_eth, false, "seen_eth");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	err = bpf_prog_attach_opts(fd1, ifindex, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count_ifindex(ifindex, target, 1);
+
+	optq.prog_ids = prog_ids;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_fd1;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+
+	tc_skel_reset_all_seen(skel);
+	ASSERT_EQ(send_icmp(), 0, "icmp_pkt");
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_eth, true, "seen_eth");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	LIBBPF_OPTS_RESET(opta,
+		.flags = BPF_F_BEFORE,
+		.relative_fd = fd1,
+	);
+
+	err = bpf_prog_attach_opts(fd2, ifindex, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_fd1;
+
+	assert_mprog_count_ifindex(ifindex, target, 2);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_fd2;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid2, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid1, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+
+	tc_skel_reset_all_seen(skel);
+	ASSERT_EQ(send_icmp(), 0, "icmp_pkt");
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_eth, true, "seen_eth");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+
+cleanup_fd2:
+	err = bpf_prog_detach_opts(fd2, ifindex, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count_ifindex(ifindex, target, 1);
+cleanup_fd1:
+	err = bpf_prog_detach_opts(fd1, ifindex, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count_ifindex(ifindex, target, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+
+	assert_mprog_count_ifindex(ifindex, target, 0);
+	destroy_netkit();
+}
+
+void serial_test_tc_netkit_multi_opts(void)
+{
+	serial_test_tc_netkit_multi_opts_target(NETKIT_L2, BPF_NETKIT_PRIMARY);
+	serial_test_tc_netkit_multi_opts_target(NETKIT_L3, BPF_NETKIT_PRIMARY);
+	serial_test_tc_netkit_multi_opts_target(NETKIT_L2, BPF_NETKIT_PEER);
+	serial_test_tc_netkit_multi_opts_target(NETKIT_L3, BPF_NETKIT_PEER);
+}
+
+void serial_test_tc_netkit_device(void)
+{
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	LIBBPF_OPTS(bpf_netkit_opts, optl);
+	__u32 prog_ids[2], link_ids[2];
+	__u32 pid1, pid2, lid1;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err, ifindex, ifindex2;
+
+	err = create_netkit(NETKIT_L3, NETKIT_PASS, NETKIT_PASS,
+			    &ifindex, true);
+	if (err)
+		return;
+
+	ifindex2 = if_nametoindex(netkit_peer);
+	ASSERT_NEQ(ifindex, ifindex2, "ifindex_1_2");
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1,
+		  BPF_NETKIT_PRIMARY), 0, "tc1_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc2,
+		  BPF_NETKIT_PEER), 0, "tc2_attach_type");
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc3,
+		  BPF_NETKIT_PRIMARY), 0, "tc3_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+
+	assert_mprog_count_ifindex(ifindex, BPF_NETKIT_PRIMARY, 0);
+	assert_mprog_count_ifindex(ifindex, BPF_NETKIT_PEER, 0);
+
+	ASSERT_EQ(skel->bss->seen_tc1, false, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	link = bpf_program__attach_netkit(skel->progs.tc1, ifindex, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+
+	skel->links.tc1 = link;
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	assert_mprog_count_ifindex(ifindex, BPF_NETKIT_PRIMARY, 1);
+	assert_mprog_count_ifindex(ifindex, BPF_NETKIT_PEER, 0);
+
+	optq.prog_ids = prog_ids;
+	optq.link_ids = link_ids;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(link_ids, 0, sizeof(link_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex, BPF_NETKIT_PRIMARY, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+
+	tc_skel_reset_all_seen(skel);
+	ASSERT_EQ(send_icmp(), 0, "icmp_pkt");
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(link_ids, 0, sizeof(link_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex2, BPF_NETKIT_PRIMARY, &optq);
+	ASSERT_EQ(err, -EACCES, "prog_query_should_fail");
+
+	err = bpf_prog_query_opts(ifindex2, BPF_NETKIT_PEER, &optq);
+	ASSERT_EQ(err, -EACCES, "prog_query_should_fail");
+
+	link = bpf_program__attach_netkit(skel->progs.tc2, ifindex2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+
+	link = bpf_program__attach_netkit(skel->progs.tc3, ifindex2, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+
+	assert_mprog_count_ifindex(ifindex, BPF_NETKIT_PRIMARY, 1);
+	assert_mprog_count_ifindex(ifindex, BPF_NETKIT_PEER, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+
+	assert_mprog_count_ifindex(ifindex, BPF_NETKIT_PRIMARY, 0);
+	assert_mprog_count_ifindex(ifindex, BPF_NETKIT_PEER, 0);
+	destroy_netkit();
+}
+
+static void serial_test_tc_netkit_neigh_links_target(int mode, int target)
+{
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	LIBBPF_OPTS(bpf_netkit_opts, optl);
+	__u32 prog_ids[2], link_ids[2];
+	__u32 pid1, lid1;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err, ifindex;
+
+	err = create_netkit(mode, NETKIT_PASS, NETKIT_PASS,
+			    &ifindex, false);
+	if (err)
+		return;
+
+	skel = test_tc_link__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	ASSERT_EQ(bpf_program__set_expected_attach_type(skel->progs.tc1,
+		  BPF_NETKIT_PRIMARY), 0, "tc1_attach_type");
+
+	err = test_tc_link__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+
+	assert_mprog_count_ifindex(ifindex, target, 0);
+
+	ASSERT_EQ(skel->bss->seen_tc1, false, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_eth, false, "seen_eth");
+
+	link = bpf_program__attach_netkit(skel->progs.tc1, ifindex, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+
+	skel->links.tc1 = link;
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	assert_mprog_count_ifindex(ifindex, target, 1);
+
+	optq.prog_ids = prog_ids;
+	optq.link_ids = link_ids;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(link_ids, 0, sizeof(link_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+
+	tc_skel_reset_all_seen(skel);
+	ASSERT_EQ(__send_icmp(ping_addr_noneigh), 0, "icmp_pkt");
+
+	ASSERT_EQ(skel->bss->seen_tc1, true /* L2: ARP */, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_eth, mode == NETKIT_L3, "seen_eth");
+cleanup:
+	test_tc_link__destroy(skel);
+
+	assert_mprog_count_ifindex(ifindex, target, 0);
+	destroy_netkit();
+}
+
+void serial_test_tc_netkit_neigh_links(void)
+{
+	serial_test_tc_netkit_neigh_links_target(NETKIT_L2, BPF_NETKIT_PRIMARY);
+	serial_test_tc_netkit_neigh_links_target(NETKIT_L3, BPF_NETKIT_PRIMARY);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_tc_link.c b/tools/testing/selftests/bpf/progs/test_tc_link.c
index 30e7124c49a1..992400acb957 100644
--- a/tools/testing/selftests/bpf/progs/test_tc_link.c
+++ b/tools/testing/selftests/bpf/progs/test_tc_link.c
@@ -1,7 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2023 Isovalent */
 #include <stdbool.h>
+
 #include <linux/bpf.h>
+#include <linux/if_ether.h>
+
+#include <bpf/bpf_endian.h>
 #include <bpf/bpf_helpers.h>
 
 char LICENSE[] SEC("license") = "GPL";
@@ -12,10 +16,19 @@ bool seen_tc3;
 bool seen_tc4;
 bool seen_tc5;
 bool seen_tc6;
+bool seen_eth;
 
 SEC("tc/ingress")
 int tc1(struct __sk_buff *skb)
 {
+	struct ethhdr eth = {};
+
+	if (skb->protocol != __bpf_constant_htons(ETH_P_IP))
+		goto out;
+	if (bpf_skb_load_bytes(skb, 0, &eth, sizeof(eth)))
+		goto out;
+	seen_eth = eth.h_proto == bpf_htons(ETH_P_IP);
+out:
 	seen_tc1 = true;
 	return TCX_NEXT;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 0/7] Add bpf programmable net device
  2023-10-24 21:48 [PATCH bpf-next v4 0/7] Add bpf programmable net device Daniel Borkmann
                   ` (6 preceding siblings ...)
  2023-10-24 21:49 ` [PATCH bpf-next v4 7/7] selftests/bpf: Add selftests for netkit Daniel Borkmann
@ 2023-10-24 22:45 ` Martin KaFai Lau
  2023-10-24 23:50 ` patchwork-bot+netdevbpf
  8 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2023-10-24 22:45 UTC (permalink / raw)
  To: Daniel Borkmann, bpf
  Cc: netdev, razor, ast, andrii, john.fastabend, sdf, toke, kuba,
	andrew

On 10/24/23 2:48 PM, Daniel Borkmann wrote:
> This work adds a BPF programmable device which can operate in L3 or L2
> mode where the BPF program is part of the xmit routine. It's program
> management is done via bpf_mprog and it comes with BPF link support.
> For details see patch 1 and following. Thanks!

Thanks for the work on this. We have been testing the earlier version in a 
production service and we see at least 3% cpu win when comparing with veth.

Acked-by: Martin KaFai Lau <martin.lau@kernel.org>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 0/7] Add bpf programmable net device
  2023-10-24 21:48 [PATCH bpf-next v4 0/7] Add bpf programmable net device Daniel Borkmann
                   ` (7 preceding siblings ...)
  2023-10-24 22:45 ` [PATCH bpf-next v4 0/7] Add bpf programmable net device Martin KaFai Lau
@ 2023-10-24 23:50 ` patchwork-bot+netdevbpf
  2023-10-25 15:50   ` Jiri Pirko
  8 siblings, 1 reply; 27+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-10-24 23:50 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, martin.lau, razor, ast, andrii, john.fastabend, sdf,
	toke, kuba, andrew

Hello:

This series was applied to bpf/bpf-next.git (master)
by Martin KaFai Lau <martin.lau@kernel.org>:

On Tue, 24 Oct 2023 23:48:57 +0200 you wrote:
> This work adds a BPF programmable device which can operate in L3 or L2
> mode where the BPF program is part of the xmit routine. It's program
> management is done via bpf_mprog and it comes with BPF link support.
> For details see patch 1 and following. Thanks!
> 
> v3 -> v4:
>   - Moved netkit_release_all() into ndo_uninit (Stan)
>   - Two small commit msg corrections (Toke)
>   - Added Acked/Reviewed-by
> v2 -> v3:
>   - Remove setting dev->min_mtu to ETH_MIN_MTU (Andrew)
>   - Do not populate ethtool info->version (Andrew)
>   - Populate netdev private data before register_netdevice (Andrew)
>   - Use strscpy for ifname template (Jakub)
>   - Use GFP_KERNEL_ACCOUNT for link kzalloc (Jakub)
>   - Carry and dump link attach type for bpftool (Toke)
> v1 -> v2:
>   - Rename from meta (Toke, Andrii, Alexei)
>   - Reuse skb_scrub_packet (Stan)
>   - Remove IFF_META and use netdev_ops (Toke)
>   - Add comment to multicast handler (Toke)
>   - Remove silly version info (Toke)
>   - Fix attach_type_name (Quentin)
>   - Rework libbpf link attach api to be similar
>     as tcx (Andrii)
>   - Move flags last for bpf_netkit_opts (Andrii)
>   - Rebased to bpf_mprog query api changes
>   - Folded link support patch into main one
> 
> [...]

Here is the summary with links:
  - [bpf-next,v4,1/7] netkit, bpf: Add bpf programmable net device
    https://git.kernel.org/bpf/bpf-next/c/35dfaad7188c
  - [bpf-next,v4,2/7] tools: Sync if_link uapi header
    https://git.kernel.org/bpf/bpf-next/c/5c1b994de4be
  - [bpf-next,v4,3/7] libbpf: Add link-based API for netkit
    https://git.kernel.org/bpf/bpf-next/c/05c31b4ab205
  - [bpf-next,v4,4/7] bpftool: Implement link show support for netkit
    https://git.kernel.org/bpf/bpf-next/c/92a85e18ad47
  - [bpf-next,v4,5/7] bpftool: Extend net dump with netkit progs
    https://git.kernel.org/bpf/bpf-next/c/bec981a4add6
  - [bpf-next,v4,6/7] selftests/bpf: Add netlink helper library
    https://git.kernel.org/bpf/bpf-next/c/51f1892b5289
  - [bpf-next,v4,7/7] selftests/bpf: Add selftests for netkit
    https://git.kernel.org/bpf/bpf-next/c/ace15f91e569

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
  2023-10-24 21:48 ` [PATCH bpf-next v4 1/7] netkit, bpf: " Daniel Borkmann
@ 2023-10-25 15:47   ` Jiri Pirko
  2023-10-25 17:20     ` Daniel Borkmann
  2023-10-25 19:21     ` Nikolay Aleksandrov
  2023-10-25 21:24   ` Kui-Feng Lee
  1 sibling, 2 replies; 27+ messages in thread
From: Jiri Pirko @ 2023-10-25 15:47 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, martin.lau, razor, ast, andrii, john.fastabend, sdf,
	toke, kuba, andrew, Toke Høiland-Jørgensen

Tue, Oct 24, 2023 at 11:48:58PM CEST, daniel@iogearbox.net wrote:
>This work adds a new, minimal BPF-programmable device called "netkit"
>(former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
>core idea is that BPF programs are executed within the drivers xmit routine
>and therefore e.g. in case of containers/Pods moving BPF processing closer
>to the source.
>
>One of the goals was that in case of Pod egress traffic, this allows to
>move BPF programs from hostns tcx ingress into the device itself, providing
>earlier drop or forward mechanisms, for example, if the BPF program
>determines that the skb must be sent out of the node, then a redirect to
>the physical device can take place directly without going through per-CPU
>backlog queue. This helps to shift processing for such traffic from softirq
>to process context, leading to better scheduling decisions/performance (see
>measurements in the slides).
>
>In this initial version, the netkit device ships as a pair, but we plan to
>extend this further so it can also operate in single device mode. The pair

Single device mode should work how?


>comes with a primary and a peer device. Only the primary device, typically
>residing in hostns, can manage BPF programs for itself and its peer. The
>peer device is designated for containers/Pods and cannot attach/detach
>BPF programs. Upon the device creation, the user can set the default policy
>to 'pass' or 'drop' for the case when no BPF program is attached.

It looks to me that you only need the host (primary) netdevice to be
used as a handle to attach the bpf programs. Because the bpf program
can (and probably in real use case will) redirect to uplink/another
pod netdevice skipping the host (primary) netdevice, correct?

If so, why can't you do just single device mode from start finding a
different sort of bpf attach handle? (not sure which)


>
>Additionally, the device can be operated in L3 (default) or L2 mode. The
>management of BPF programs is done via bpf_mprog, so that multi-attach is
>supported right from the beginning with similar API and dependency controls
>as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic
>attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
>so that existing programs can be easily migrated.
>
>Going forward, we plan to use netkit devices in Cilium as the main device
>type for connecting Pods. They will be operated in L3 mode in order to
>simplify a Pod's neighbor management and the peer will operate in default
>drop mode, so that no traffic is leaving between the time when a Pod is
>brought up by the CNI plugin and programs attached by the agent.
>Additionally, the programs we attach via tcx on the physical devices are
>using bpf_redirect_peer() for inbound traffic into netkit device, hence the
>latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
>bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
>directly. Also, BIG TCP is supported on netkit device. For the follow-up
>work in single device mode, we plan to convert Cilium's cilium_host/_net
>devices into a single one.
>
>An extensive test suite for checking device operations and the BPF program
>and link management API comes as BPF selftests in this series.
>

Couple of nitpicks below:

[..]


>+static int netkit_check_policy(int policy, struct nlattr *tb,
>+			       struct netlink_ext_ack *extack)
>+{
>+	switch (policy) {
>+	case NETKIT_PASS:
>+	case NETKIT_DROP:
>+		return 0;
>+	default:

Isn't this job for netlink policy?


>+		NL_SET_ERR_MSG_ATTR(extack, tb,
>+				    "Provided default xmit policy not supported");
>+		return -EINVAL;
>+	}
>+}
>+
>+static int netkit_check_mode(int mode, struct nlattr *tb,
>+			     struct netlink_ext_ack *extack)
>+{
>+	switch (mode) {
>+	case NETKIT_L2:
>+	case NETKIT_L3:
>+		return 0;
>+	default:

Isn't this job for netlink policy?


>+		NL_SET_ERR_MSG_ATTR(extack, tb,
>+				    "Provided device mode can only be L2 or L3");
>+		return -EINVAL;
>+	}
>+}
>+
>+static int netkit_validate(struct nlattr *tb[], struct nlattr *data[],
>+			   struct netlink_ext_ack *extack)
>+{
>+	struct nlattr *attr = tb[IFLA_ADDRESS];
>+
>+	if (!attr)
>+		return 0;
>+	NL_SET_ERR_MSG_ATTR(extack, attr,
>+			    "Setting Ethernet address is not supported");
>+	return -EOPNOTSUPP;
>+}
>+
>+static struct rtnl_link_ops netkit_link_ops;
>+
>+static int netkit_new_link(struct net *src_net, struct net_device *dev,
>+			   struct nlattr *tb[], struct nlattr *data[],
>+			   struct netlink_ext_ack *extack)
>+{
>+	struct nlattr *peer_tb[IFLA_MAX + 1], **tbp = tb, *attr;
>+	enum netkit_action default_prim = NETKIT_PASS;
>+	enum netkit_action default_peer = NETKIT_PASS;
>+	enum netkit_mode mode = NETKIT_L3;
>+	unsigned char ifname_assign_type;
>+	struct ifinfomsg *ifmp = NULL;
>+	struct net_device *peer;
>+	char ifname[IFNAMSIZ];
>+	struct netkit *nk;
>+	struct net *net;
>+	int err;
>+
>+	if (data) {
>+		if (data[IFLA_NETKIT_MODE]) {
>+			attr = data[IFLA_NETKIT_MODE];
>+			mode = nla_get_u32(attr);
>+			err = netkit_check_mode(mode, attr, extack);
>+			if (err < 0)
>+				return err;
>+		}
>+		if (data[IFLA_NETKIT_PEER_INFO]) {
>+			attr = data[IFLA_NETKIT_PEER_INFO];
>+			ifmp = nla_data(attr);
>+			err = rtnl_nla_parse_ifinfomsg(peer_tb, attr, extack);
>+			if (err < 0)
>+				return err;
>+			err = netkit_validate(peer_tb, NULL, extack);
>+			if (err < 0)
>+				return err;
>+			tbp = peer_tb;
>+		}
>+		if (data[IFLA_NETKIT_POLICY]) {
>+			attr = data[IFLA_NETKIT_POLICY];
>+			default_prim = nla_get_u32(attr);
>+			err = netkit_check_policy(default_prim, attr, extack);
>+			if (err < 0)
>+				return err;
>+		}
>+		if (data[IFLA_NETKIT_PEER_POLICY]) {
>+			attr = data[IFLA_NETKIT_PEER_POLICY];
>+			default_peer = nla_get_u32(attr);
>+			err = netkit_check_policy(default_peer, attr, extack);
>+			if (err < 0)
>+				return err;
>+		}
>+	}
>+
>+	if (ifmp && tbp[IFLA_IFNAME]) {
>+		nla_strscpy(ifname, tbp[IFLA_IFNAME], IFNAMSIZ);
>+		ifname_assign_type = NET_NAME_USER;
>+	} else {
>+		strscpy(ifname, "nk%d", IFNAMSIZ);
>+		ifname_assign_type = NET_NAME_ENUM;
>+	}
>+
>+	net = rtnl_link_get_net(src_net, tbp);
>+	if (IS_ERR(net))
>+		return PTR_ERR(net);
>+
>+	peer = rtnl_create_link(net, ifname, ifname_assign_type,
>+				&netkit_link_ops, tbp, extack);
>+	if (IS_ERR(peer)) {
>+		put_net(net);
>+		return PTR_ERR(peer);
>+	}
>+
>+	netif_inherit_tso_max(peer, dev);
>+
>+	if (mode == NETKIT_L2)
>+		eth_hw_addr_random(peer);
>+	if (ifmp && dev->ifindex)
>+		peer->ifindex = ifmp->ifi_index;
>+
>+	nk = netkit_priv(peer);
>+	nk->primary = false;
>+	nk->policy = default_peer;
>+	nk->mode = mode;
>+	bpf_mprog_bundle_init(&nk->bundle);
>+	RCU_INIT_POINTER(nk->active, NULL);
>+	RCU_INIT_POINTER(nk->peer, NULL);

Aren't these already 0?


>+
>+	err = register_netdevice(peer);
>+	put_net(net);
>+	if (err < 0)
>+		goto err_register_peer;
>+	netif_carrier_off(peer);
>+	if (mode == NETKIT_L2)
>+		dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
>+
>+	err = rtnl_configure_link(peer, NULL, 0, NULL);
>+	if (err < 0)
>+		goto err_configure_peer;
>+
>+	if (mode == NETKIT_L2)
>+		eth_hw_addr_random(dev);
>+	if (tb[IFLA_IFNAME])
>+		nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
>+	else
>+		strscpy(dev->name, "nk%d", IFNAMSIZ);
>+
>+	nk = netkit_priv(dev);
>+	nk->primary = true;
>+	nk->policy = default_prim;
>+	nk->mode = mode;
>+	bpf_mprog_bundle_init(&nk->bundle);
>+	RCU_INIT_POINTER(nk->active, NULL);
>+	RCU_INIT_POINTER(nk->peer, NULL);
>+
>+	err = register_netdevice(dev);
>+	if (err < 0)
>+		goto err_configure_peer;
>+	netif_carrier_off(dev);
>+	if (mode == NETKIT_L2)
>+		dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
>+
>+	rcu_assign_pointer(netkit_priv(dev)->peer, peer);
>+	rcu_assign_pointer(netkit_priv(peer)->peer, dev);
>+	return 0;
>+err_configure_peer:
>+	unregister_netdevice(peer);
>+	return err;
>+err_register_peer:
>+	free_netdev(peer);
>+	return err;
>+}
>+

[..]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 0/7] Add bpf programmable net device
  2023-10-24 23:50 ` patchwork-bot+netdevbpf
@ 2023-10-25 15:50   ` Jiri Pirko
  2023-10-25 16:54     ` Martin KaFai Lau
  0 siblings, 1 reply; 27+ messages in thread
From: Jiri Pirko @ 2023-10-25 15:50 UTC (permalink / raw)
  To: patchwork-bot+netdevbpf
  Cc: Daniel Borkmann, bpf, netdev, martin.lau, razor, ast, andrii,
	john.fastabend, sdf, toke, kuba, andrew

Wed, Oct 25, 2023 at 01:50:25AM CEST, patchwork-bot+netdevbpf@kernel.org wrote:
>Hello:
>
>This series was applied to bpf/bpf-next.git (master)
>by Martin KaFai Lau <martin.lau@kernel.org>:

Interesting, applied within 2 hours after send. You bpf people don't
care about some 24h timeout?

[..]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 0/7] Add bpf programmable net device
  2023-10-25 15:50   ` Jiri Pirko
@ 2023-10-25 16:54     ` Martin KaFai Lau
  2023-10-26  5:35       ` Jiri Pirko
  0 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2023-10-25 16:54 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Daniel Borkmann, bpf, netdev, razor, ast, andrii, john.fastabend,
	sdf, toke, kuba, andrew

On 10/25/23 8:50 AM, Jiri Pirko wrote:
> Wed, Oct 25, 2023 at 01:50:25AM CEST, patchwork-bot+netdevbpf@kernel.org wrote:
>> Hello:
>>
>> This series was applied to bpf/bpf-next.git (master)
>> by Martin KaFai Lau <martin.lau@kernel.org>:
> 
> Interesting, applied within 2 hours after send. You bpf people don't
> care about some 24h timeout?

24hr? The v1 was posted to both netdev and bpf list on 9/25. It was 10/24 
yesterday. The part you commented in patch 1 had not been changed much since v1, 
so there was a month of time. netdev is always on the cc list. Multiple people 
(Andrew, Jakub...etc) had already helped to review and Daniel had addressed the 
comments. The change history had been diminishing from v1 to v4 and v4 changes 
was mostly nit-picking already.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
  2023-10-25 15:47   ` Jiri Pirko
@ 2023-10-25 17:20     ` Daniel Borkmann
  2023-10-26  5:18       ` Jiri Pirko
  2023-10-25 19:21     ` Nikolay Aleksandrov
  1 sibling, 1 reply; 27+ messages in thread
From: Daniel Borkmann @ 2023-10-25 17:20 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: bpf, netdev, martin.lau, razor, ast, andrii, john.fastabend, sdf,
	toke, kuba, andrew, Toke Høiland-Jørgensen

On 10/25/23 5:47 PM, Jiri Pirko wrote:
> Tue, Oct 24, 2023 at 11:48:58PM CEST, daniel@iogearbox.net wrote:
[...]
>> comes with a primary and a peer device. Only the primary device, typically
>> residing in hostns, can manage BPF programs for itself and its peer. The
>> peer device is designated for containers/Pods and cannot attach/detach
>> BPF programs. Upon the device creation, the user can set the default policy
>> to 'pass' or 'drop' for the case when no BPF program is attached.
> 
> It looks to me that you only need the host (primary) netdevice to be
> used as a handle to attach the bpf programs. Because the bpf program
> can (and probably in real use case will) redirect to uplink/another
> pod netdevice skipping the host (primary) netdevice, correct?
> 
> If so, why can't you do just single device mode from start finding a
> different sort of bpf attach handle? (not sure which)

The first point where we switch netns from a K8s Pod is out of the netdevice.
For the CNI case the vast majority has one but there could also be multi-
homing for Pods where there may be two or more, and from a troubleshooting
PoV aka tcpdump et al, it is the most natural point. Other attach handle
inside the Pod doesn't really fit given from policy PoV it also must be
unreachable for apps inside the Pod itself.

>> Additionally, the device can be operated in L3 (default) or L2 mode. The
>> management of BPF programs is done via bpf_mprog, so that multi-attach is
>> supported right from the beginning with similar API and dependency controls
>> as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic
>> attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
>> so that existing programs can be easily migrated.
>>
>> Going forward, we plan to use netkit devices in Cilium as the main device
>> type for connecting Pods. They will be operated in L3 mode in order to
>> simplify a Pod's neighbor management and the peer will operate in default
>> drop mode, so that no traffic is leaving between the time when a Pod is
>> brought up by the CNI plugin and programs attached by the agent.
>> Additionally, the programs we attach via tcx on the physical devices are
>> using bpf_redirect_peer() for inbound traffic into netkit device, hence the
>> latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
>> bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
>> directly. Also, BIG TCP is supported on netkit device. For the follow-up
>> work in single device mode, we plan to convert Cilium's cilium_host/_net
>> devices into a single one.
>>
>> An extensive test suite for checking device operations and the BPF program
>> and link management API comes as BPF selftests in this series.
> 
> Couple of nitpicks below:

Thanks for looking into those, I'll look into it and send a small cleanup on
the two.

>> +static int netkit_check_policy(int policy, struct nlattr *tb,
>> +			       struct netlink_ext_ack *extack)
>> +{
>> +	switch (policy) {
>> +	case NETKIT_PASS:
>> +	case NETKIT_DROP:
>> +		return 0;
>> +	default:
> 
> Isn't this job for netlink policy?
> 
> 
>> +		NL_SET_ERR_MSG_ATTR(extack, tb,
>> +				    "Provided default xmit policy not supported");
>> +		return -EINVAL;
>> +	}
>> +}
>> +
>> +static int netkit_check_mode(int mode, struct nlattr *tb,
>> +			     struct netlink_ext_ack *extack)
>> +{
>> +	switch (mode) {
>> +	case NETKIT_L2:
>> +	case NETKIT_L3:
>> +		return 0;
>> +	default:
> 
> Isn't this job for netlink policy?
> 
> 
>> +		NL_SET_ERR_MSG_ATTR(extack, tb,
>> +				    "Provided device mode can only be L2 or L3");
>> +		return -EINVAL;
>> +	}
>> +}
>> +
>> +static int netkit_validate(struct nlattr *tb[], struct nlattr *data[],
>> +			   struct netlink_ext_ack *extack)
>> +{
>> +	struct nlattr *attr = tb[IFLA_ADDRESS];
>> +
>> +	if (!attr)
>> +		return 0;
>> +	NL_SET_ERR_MSG_ATTR(extack, attr,
>> +			    "Setting Ethernet address is not supported");
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +static struct rtnl_link_ops netkit_link_ops;
>> +
>> +static int netkit_new_link(struct net *src_net, struct net_device *dev,
>> +			   struct nlattr *tb[], struct nlattr *data[],
>> +			   struct netlink_ext_ack *extack)
>> +{
>> +	struct nlattr *peer_tb[IFLA_MAX + 1], **tbp = tb, *attr;
>> +	enum netkit_action default_prim = NETKIT_PASS;
>> +	enum netkit_action default_peer = NETKIT_PASS;
>> +	enum netkit_mode mode = NETKIT_L3;
>> +	unsigned char ifname_assign_type;
>> +	struct ifinfomsg *ifmp = NULL;
>> +	struct net_device *peer;
>> +	char ifname[IFNAMSIZ];
>> +	struct netkit *nk;
>> +	struct net *net;
>> +	int err;
>> +
>> +	if (data) {
>> +		if (data[IFLA_NETKIT_MODE]) {
>> +			attr = data[IFLA_NETKIT_MODE];
>> +			mode = nla_get_u32(attr);
>> +			err = netkit_check_mode(mode, attr, extack);
>> +			if (err < 0)
>> +				return err;
>> +		}
>> +		if (data[IFLA_NETKIT_PEER_INFO]) {
>> +			attr = data[IFLA_NETKIT_PEER_INFO];
>> +			ifmp = nla_data(attr);
>> +			err = rtnl_nla_parse_ifinfomsg(peer_tb, attr, extack);
>> +			if (err < 0)
>> +				return err;
>> +			err = netkit_validate(peer_tb, NULL, extack);
>> +			if (err < 0)
>> +				return err;
>> +			tbp = peer_tb;
>> +		}
>> +		if (data[IFLA_NETKIT_POLICY]) {
>> +			attr = data[IFLA_NETKIT_POLICY];
>> +			default_prim = nla_get_u32(attr);
>> +			err = netkit_check_policy(default_prim, attr, extack);
>> +			if (err < 0)
>> +				return err;
>> +		}
>> +		if (data[IFLA_NETKIT_PEER_POLICY]) {
>> +			attr = data[IFLA_NETKIT_PEER_POLICY];
>> +			default_peer = nla_get_u32(attr);
>> +			err = netkit_check_policy(default_peer, attr, extack);
>> +			if (err < 0)
>> +				return err;
>> +		}
>> +	}
>> +
>> +	if (ifmp && tbp[IFLA_IFNAME]) {
>> +		nla_strscpy(ifname, tbp[IFLA_IFNAME], IFNAMSIZ);
>> +		ifname_assign_type = NET_NAME_USER;
>> +	} else {
>> +		strscpy(ifname, "nk%d", IFNAMSIZ);
>> +		ifname_assign_type = NET_NAME_ENUM;
>> +	}
>> +
>> +	net = rtnl_link_get_net(src_net, tbp);
>> +	if (IS_ERR(net))
>> +		return PTR_ERR(net);
>> +
>> +	peer = rtnl_create_link(net, ifname, ifname_assign_type,
>> +				&netkit_link_ops, tbp, extack);
>> +	if (IS_ERR(peer)) {
>> +		put_net(net);
>> +		return PTR_ERR(peer);
>> +	}
>> +
>> +	netif_inherit_tso_max(peer, dev);
>> +
>> +	if (mode == NETKIT_L2)
>> +		eth_hw_addr_random(peer);
>> +	if (ifmp && dev->ifindex)
>> +		peer->ifindex = ifmp->ifi_index;
>> +
>> +	nk = netkit_priv(peer);
>> +	nk->primary = false;
>> +	nk->policy = default_peer;
>> +	nk->mode = mode;
>> +	bpf_mprog_bundle_init(&nk->bundle);
>> +	RCU_INIT_POINTER(nk->active, NULL);
>> +	RCU_INIT_POINTER(nk->peer, NULL);
> 
> Aren't these already 0?
> 
> 
>> +
>> +	err = register_netdevice(peer);
>> +	put_net(net);
>> +	if (err < 0)
>> +		goto err_register_peer;
>> +	netif_carrier_off(peer);
>> +	if (mode == NETKIT_L2)
>> +		dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
>> +
>> +	err = rtnl_configure_link(peer, NULL, 0, NULL);
>> +	if (err < 0)
>> +		goto err_configure_peer;
>> +
>> +	if (mode == NETKIT_L2)
>> +		eth_hw_addr_random(dev);
>> +	if (tb[IFLA_IFNAME])
>> +		nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
>> +	else
>> +		strscpy(dev->name, "nk%d", IFNAMSIZ);
>> +
>> +	nk = netkit_priv(dev);
>> +	nk->primary = true;
>> +	nk->policy = default_prim;
>> +	nk->mode = mode;
>> +	bpf_mprog_bundle_init(&nk->bundle);
>> +	RCU_INIT_POINTER(nk->active, NULL);
>> +	RCU_INIT_POINTER(nk->peer, NULL);
>> +
>> +	err = register_netdevice(dev);
>> +	if (err < 0)
>> +		goto err_configure_peer;
>> +	netif_carrier_off(dev);
>> +	if (mode == NETKIT_L2)
>> +		dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
>> +
>> +	rcu_assign_pointer(netkit_priv(dev)->peer, peer);
>> +	rcu_assign_pointer(netkit_priv(peer)->peer, dev);
>> +	return 0;
>> +err_configure_peer:
>> +	unregister_netdevice(peer);
>> +	return err;
>> +err_register_peer:
>> +	free_netdev(peer);
>> +	return err;
>> +}
>> +
> 
> [..]
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
  2023-10-25 15:47   ` Jiri Pirko
  2023-10-25 17:20     ` Daniel Borkmann
@ 2023-10-25 19:21     ` Nikolay Aleksandrov
  2023-10-26  5:26       ` Jiri Pirko
  1 sibling, 1 reply; 27+ messages in thread
From: Nikolay Aleksandrov @ 2023-10-25 19:21 UTC (permalink / raw)
  To: Jiri Pirko, Daniel Borkmann
  Cc: bpf, netdev, martin.lau, ast, andrii, john.fastabend, sdf, toke,
	kuba, andrew, Toke Høiland-Jørgensen

On 10/25/23 18:47, Jiri Pirko wrote:
> Tue, Oct 24, 2023 at 11:48:58PM CEST, daniel@iogearbox.net wrote:
>> This work adds a new, minimal BPF-programmable device called "netkit"
[snip]
> 
> Couple of nitpicks below:
> 
> [..]
> 
> 

Hi,
Thanks for the review. I know about the nits below but decided against
changing them, more below each...

>> +static int netkit_check_policy(int policy, struct nlattr *tb,
>> +			       struct netlink_ext_ack *extack)
>> +{
>> +	switch (policy) {
>> +	case NETKIT_PASS:
>> +	case NETKIT_DROP:
>> +		return 0;
>> +	default:
> 
> Isn't this job for netlink policy?
> 
> 

This cannot be handled by policies AFAIK, because only 2 sparse values 
from more are allowed. We could potentially do it through validate() but
it's the same minus the explicit policy type info. IMO this approach is 
good.

>> +		NL_SET_ERR_MSG_ATTR(extack, tb,
>> +				    "Provided default xmit policy not supported");
>> +		return -EINVAL;
>> +	}
>> +}
>> +
>> +static int netkit_check_mode(int mode, struct nlattr *tb,
>> +			     struct netlink_ext_ack *extack)
>> +{
>> +	switch (mode) {
>> +	case NETKIT_L2:
>> +	case NETKIT_L3:
>> +		return 0;
>> +	default:
> 
> Isn't this job for netlink policy?
> 
> 

This one can be handled by policy indeed, but then we lose the nice user 
error. Again can be done through validate(), but it's the same and we
lose explicit policy type information.

>> +		NL_SET_ERR_MSG_ATTR(extack, tb,
>> +				    "Provided device mode can only be L2 or L3");
>> +		return -EINVAL;
>> +	}
>> +}
>> +
>> +static int netkit_validate(struct nlattr *tb[], struct nlattr *data[],
>> +			   struct netlink_ext_ack *extack)
>> +{
>> +	struct nlattr *attr = tb[IFLA_ADDRESS];
>> +
>> +	if (!attr)
>> +		return 0;
>> +	NL_SET_ERR_MSG_ATTR(extack, attr,
>> +			    "Setting Ethernet address is not supported");
>> +	return -EOPNOTSUPP;
>> +}
>> +
>> +static struct rtnl_link_ops netkit_link_ops;
>> +
>> +static int netkit_new_link(struct net *src_net, struct net_device *dev,
>> +			   struct nlattr *tb[], struct nlattr *data[],
>> +			   struct netlink_ext_ack *extack)
>> +{
>> +	struct nlattr *peer_tb[IFLA_MAX + 1], **tbp = tb, *attr;
>> +	enum netkit_action default_prim = NETKIT_PASS;
>> +	enum netkit_action default_peer = NETKIT_PASS;
>> +	enum netkit_mode mode = NETKIT_L3;
>> +	unsigned char ifname_assign_type;
>> +	struct ifinfomsg *ifmp = NULL;
>> +	struct net_device *peer;
>> +	char ifname[IFNAMSIZ];
>> +	struct netkit *nk;
>> +	struct net *net;
>> +	int err;
>> +
>> +	if (data) {
>> +		if (data[IFLA_NETKIT_MODE]) {
>> +			attr = data[IFLA_NETKIT_MODE];
>> +			mode = nla_get_u32(attr);
>> +			err = netkit_check_mode(mode, attr, extack);
>> +			if (err < 0)
>> +				return err;
>> +		}
>> +		if (data[IFLA_NETKIT_PEER_INFO]) {
>> +			attr = data[IFLA_NETKIT_PEER_INFO];
>> +			ifmp = nla_data(attr);
>> +			err = rtnl_nla_parse_ifinfomsg(peer_tb, attr, extack);
>> +			if (err < 0)
>> +				return err;
>> +			err = netkit_validate(peer_tb, NULL, extack);
>> +			if (err < 0)
>> +				return err;
>> +			tbp = peer_tb;
>> +		}
>> +		if (data[IFLA_NETKIT_POLICY]) {
>> +			attr = data[IFLA_NETKIT_POLICY];
>> +			default_prim = nla_get_u32(attr);
>> +			err = netkit_check_policy(default_prim, attr, extack);
>> +			if (err < 0)
>> +				return err;
>> +		}
>> +		if (data[IFLA_NETKIT_PEER_POLICY]) {
>> +			attr = data[IFLA_NETKIT_PEER_POLICY];
>> +			default_peer = nla_get_u32(attr);
>> +			err = netkit_check_policy(default_peer, attr, extack);
>> +			if (err < 0)
>> +				return err;
>> +		}
>> +	}
>> +
>> +	if (ifmp && tbp[IFLA_IFNAME]) {
>> +		nla_strscpy(ifname, tbp[IFLA_IFNAME], IFNAMSIZ);
>> +		ifname_assign_type = NET_NAME_USER;
>> +	} else {
>> +		strscpy(ifname, "nk%d", IFNAMSIZ);
>> +		ifname_assign_type = NET_NAME_ENUM;
>> +	}
>> +
>> +	net = rtnl_link_get_net(src_net, tbp);
>> +	if (IS_ERR(net))
>> +		return PTR_ERR(net);
>> +
>> +	peer = rtnl_create_link(net, ifname, ifname_assign_type,
>> +				&netkit_link_ops, tbp, extack);
>> +	if (IS_ERR(peer)) {
>> +		put_net(net);
>> +		return PTR_ERR(peer);
>> +	}
>> +
>> +	netif_inherit_tso_max(peer, dev);
>> +
>> +	if (mode == NETKIT_L2)
>> +		eth_hw_addr_random(peer);
>> +	if (ifmp && dev->ifindex)
>> +		peer->ifindex = ifmp->ifi_index;
>> +
>> +	nk = netkit_priv(peer);
>> +	nk->primary = false;
>> +	nk->policy = default_peer;
>> +	nk->mode = mode;
>> +	bpf_mprog_bundle_init(&nk->bundle);
>> +	RCU_INIT_POINTER(nk->active, NULL);
>> +	RCU_INIT_POINTER(nk->peer, NULL);
> 
> Aren't these already 0?
> 
> 

Yep, they are. Here decided in favor of explicit show of values, 
although it's minor and I'm fine either way.

>> +
>> +	err = register_netdevice(peer);
>> +	put_net(net);
>> +	if (err < 0)
>> +		goto err_register_peer;
>> +	netif_carrier_off(peer);
>> +	if (mode == NETKIT_L2)
>> +		dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
>> +
>> +	err = rtnl_configure_link(peer, NULL, 0, NULL);
>> +	if (err < 0)
>> +		goto err_configure_peer;
>> +
>> +	if (mode == NETKIT_L2)
>> +		eth_hw_addr_random(dev);
>> +	if (tb[IFLA_IFNAME])
>> +		nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
>> +	else
>> +		strscpy(dev->name, "nk%d", IFNAMSIZ);
>> +
>> +	nk = netkit_priv(dev);
>> +	nk->primary = true;
>> +	nk->policy = default_prim;
>> +	nk->mode = mode;
>> +	bpf_mprog_bundle_init(&nk->bundle);
>> +	RCU_INIT_POINTER(nk->active, NULL);
>> +	RCU_INIT_POINTER(nk->peer, NULL);
>> +
>> +	err = register_netdevice(dev);
>> +	if (err < 0)
>> +		goto err_configure_peer;
>> +	netif_carrier_off(dev);
>> +	if (mode == NETKIT_L2)
>> +		dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
>> +
>> +	rcu_assign_pointer(netkit_priv(dev)->peer, peer);
>> +	rcu_assign_pointer(netkit_priv(peer)->peer, dev);
>> +	return 0;
>> +err_configure_peer:
>> +	unregister_netdevice(peer);
>> +	return err;
>> +err_register_peer:
>> +	free_netdev(peer);
>> +	return err;
>> +}
>> +
> 
> [..]

Cheers,
  Nik


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
  2023-10-24 21:48 ` [PATCH bpf-next v4 1/7] netkit, bpf: " Daniel Borkmann
  2023-10-25 15:47   ` Jiri Pirko
@ 2023-10-25 21:24   ` Kui-Feng Lee
  2023-10-25 22:09     ` Martin KaFai Lau
  1 sibling, 1 reply; 27+ messages in thread
From: Kui-Feng Lee @ 2023-10-25 21:24 UTC (permalink / raw)
  To: Daniel Borkmann, bpf
  Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend, sdf, toke,
	kuba, andrew, Toke Høiland-Jørgensen



On 10/24/23 14:48, Daniel Borkmann wrote:
> This work adds a new, minimal BPF-programmable device called "netkit"
> (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
> core idea is that BPF programs are executed within the drivers xmit routine
> and therefore e.g. in case of containers/Pods moving BPF processing closer
> to the source.
> 

Sorry for intruding into this discussion! Although it is too late to
mentioned this since this patchset have been v4 already.

I notice netkit has introduced a new attach type. I wonder if it
possible to implement it as a new struct_ops type.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
  2023-10-25 21:24   ` Kui-Feng Lee
@ 2023-10-25 22:09     ` Martin KaFai Lau
  2023-10-26  1:15       ` Kui-Feng Lee
  0 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2023-10-25 22:09 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: netdev, razor, ast, andrii, john.fastabend, sdf, toke, kuba,
	andrew, Toke Høiland-Jørgensen, Daniel Borkmann, bpf

On 10/25/23 2:24 PM, Kui-Feng Lee wrote:
> 
> 
> On 10/24/23 14:48, Daniel Borkmann wrote:
>> This work adds a new, minimal BPF-programmable device called "netkit"
>> (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
>> core idea is that BPF programs are executed within the drivers xmit routine
>> and therefore e.g. in case of containers/Pods moving BPF processing closer
>> to the source.
>>
> 
> Sorry for intruding into this discussion! Although it is too late to
> mentioned this since this patchset have been v4 already.
> 
> I notice netkit has introduced a new attach type. I wonder if it
> possible to implement it as a new struct_ops type.

Could your elaborate more about what does this struct_ops type do and how is it 
different from the SCHED_CLS bpf prog that the netkit is running?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
  2023-10-25 22:09     ` Martin KaFai Lau
@ 2023-10-26  1:15       ` Kui-Feng Lee
  2023-10-26  1:18         ` Kui-Feng Lee
  0 siblings, 1 reply; 27+ messages in thread
From: Kui-Feng Lee @ 2023-10-26  1:15 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, razor, ast, andrii, john.fastabend, sdf, toke, kuba,
	andrew, Toke Høiland-Jørgensen, Daniel Borkmann, bpf



On 10/25/23 15:09, Martin KaFai Lau wrote:
> On 10/25/23 2:24 PM, Kui-Feng Lee wrote:
>>
>>
>> On 10/24/23 14:48, Daniel Borkmann wrote:
>>> This work adds a new, minimal BPF-programmable device called "netkit"
>>> (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
>>> core idea is that BPF programs are executed within the drivers xmit 
>>> routine
>>> and therefore e.g. in case of containers/Pods moving BPF processing 
>>> closer
>>> to the source.
>>>
>>
>> Sorry for intruding into this discussion! Although it is too late to
>> mentioned this since this patchset have been v4 already.
>>
>> I notice netkit has introduced a new attach type. I wonder if it
>> possible to implement it as a new struct_ops type.
> 
> Could your elaborate more about what does this struct_ops type do and 
> how is it different from the SCHED_CLS bpf prog that the netkit is running?

I found the code has been landed.
Basing on the landed code and
the patchset of registering bpf struct_ops from modules that I
am working on, it will looks like what is done in following patch.
No changes on syscall, uapi and libbpf are required.


diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index 7e484f9fd3ae..e4eafaf397bf 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -20,6 +20,7 @@ struct netkit {
  	struct bpf_mprog_entry __rcu *active;
  	enum netkit_action policy;
  	struct bpf_mprog_bundle	bundle;
+	struct hlist_head ops_list;

  	/* Needed in slow-path */
  	enum netkit_mode mode;
@@ -27,6 +28,13 @@ struct netkit {
  	u32 headroom;
  };

+struct netkit_ops {
+	struct hlist_node node;
+	int ifindex;
+
+	int (*xmit)(struct sk_buff *skb);
+};
+
  struct netkit_link {
  	struct bpf_link link;
  	struct net_device *dev;
@@ -46,6 +54,22 @@ netkit_run(const struct bpf_mprog_entry *entry, 
struct sk_buff *skb,
  		if (ret != NETKIT_NEXT)
  			break;
  	}
+
+	return ret;
+}
+
+static __always_inline int
+netkit_run_st_ops(const struct netkit *nk, struct sk_buff *skb,
+	   enum netkit_action ret)
+{
+	struct netkit_ops *ops;
+
+	hlist_for_each_entry_rcu(ops, &nk->ops_list, node) {
+		ret = ops->xmit(skb);
+		if (ret != NETKIT_NEXT)
+			break;
+	}
+
  	return ret;
  }

@@ -80,6 +104,8 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, 
struct net_device *dev)
  	entry = rcu_dereference(nk->active);
  	if (entry)
  		ret = netkit_run(entry, skb, ret);
+	if (ret == NETKIT_NEXT)
+		ret = netkit_run_st_ops(nk, skb, ret);
  	switch (ret) {
  	case NETKIT_NEXT:
  	case NETKIT_PASS:
@@ -900,6 +926,78 @@ static const struct nla_policy 
netkit_policy[IFLA_NETKIT_MAX + 1] = {
  					    .reject_message = "Primary attribute is read-only" },
  };

+#ifdef CONFIG_DEBUG_INFO_BTF_MODULES
+
+static bool bpf_netkit_ops_is_valid_access(int off, int size,
+					   enum bpf_access_type type,
+					   const struct bpf_prog *prog,
+					   struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_verifier_ops bpf_netkit_verifier_ops = {
+	.is_valid_access = bpf_netkit_ops_is_valid_access,
+};
+
+static int bpf_netkit_ops_reg(void *kdata)
+{
+	struct netkit_ops *ops = kdata;
+	struct netkit_link *nkl;
+	struct net_device *dev;
+
+	BTF_STRUCT_OPS_TYPE_EMIT(netkit_ops);
+	dev = netkit_dev_fetch(current->nsproxy->net_ns,
+			       ops->ifindex,
+			       BPF_NETKIT_PRIMARY);
+	nkl = netkit_link(dev);
+	hlist_add_tail_rcu(&ops->node, &nkl->ops_list);
+
+	return 0;
+}
+
+static int bpf_netkit_ops_init(struct btf *btf)
+{
+	return 0;
+}
+
+static int bpf_netkit_ops_init_member(const struct btf_type *t,
+				       const struct btf_member *member,
+				       void *kdata, const void *udata)
+{
+	struct netkit_ops *kops = kdata;
+	struct netkit_ops *uops = kdata;
+
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+	if (moff == offsetof(struct netkit_ops, ifindex)) {
+		kops->ifindex = uops->ifindex;
+		return 1;
+	}
+	if (mod < offsetof(struct netkit_ops, ifindex))
+		return 1;
+
+	return 0;
+}
+
+static void bpf_netkit_ops_unreg(void *kdata)
+{
+	struct netkit_ops *ops = kdata;
+
+	hlist_del_rcu(&ops->node);
+}
+
+struct bpf_struct_ops bpf_netkit_ops = {
+	.verifier_ops = &bpf_netkit_verifier_ops,
+	.init = bpf_netkit_ops_init,
+	.init_member = bpf_netkit_ops_init_member,
+	.reg = bpf_netkit_ops_reg,
+	.unreg = bpf_netki_ops_unreg,
+	.name = "netkit_ops",
+	.owner = THIS_MODULE,
+};
+
+#endif /* CONFIG_DEBUG_INFO_BTF_MODULES */
+
  static struct rtnl_link_ops netkit_link_ops = {
  	.kind		= DRV_NAME,
  	.priv_size	= sizeof(struct netkit),
@@ -917,17 +1015,22 @@ static struct rtnl_link_ops netkit_link_ops = {

  static __init int netkit_init(void)
  {
+	int ret;
+
  	BUILD_BUG_ON((int)NETKIT_NEXT != (int)TCX_NEXT ||
  		     (int)NETKIT_PASS != (int)TCX_PASS ||
  		     (int)NETKIT_DROP != (int)TCX_DROP ||
  		     (int)NETKIT_REDIRECT != (int)TCX_REDIRECT);

-	return rtnl_link_register(&netkit_link_ops);
+	ret = rtnl_link_register(&netkit_link_ops);
+#ifdef CONFIG_DEBUG_INFO_BTF_MODULES
+	ret = ret ?: register_bpf_struct_ops(&bpf_netkit_ops);
+#endif
  }

  static __exit void netkit_exit(void)
  {
-	rtnl_link_unregister(&netkit_link_ops);
+	rtnl_link_unregister(&bpf_netkit_ops);
  }

  module_init(netkit_init);
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
  2023-10-26  1:15       ` Kui-Feng Lee
@ 2023-10-26  1:18         ` Kui-Feng Lee
  2023-10-26  6:20           ` Daniel Borkmann
  0 siblings, 1 reply; 27+ messages in thread
From: Kui-Feng Lee @ 2023-10-26  1:18 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, razor, ast, andrii, john.fastabend, sdf, toke, kuba,
	andrew, Toke Høiland-Jørgensen, Daniel Borkmann, bpf



On 10/25/23 18:15, Kui-Feng Lee wrote:
> 
> 
> On 10/25/23 15:09, Martin KaFai Lau wrote:
>> On 10/25/23 2:24 PM, Kui-Feng Lee wrote:
>>>
>>>
>>> On 10/24/23 14:48, Daniel Borkmann wrote:
>>>> This work adds a new, minimal BPF-programmable device called "netkit"
>>>> (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
>>>> core idea is that BPF programs are executed within the drivers xmit 
>>>> routine
>>>> and therefore e.g. in case of containers/Pods moving BPF processing 
>>>> closer
>>>> to the source.
>>>>
>>>
>>> Sorry for intruding into this discussion! Although it is too late to
>>> mentioned this since this patchset have been v4 already.
>>>
>>> I notice netkit has introduced a new attach type. I wonder if it
>>> possible to implement it as a new struct_ops type.
>>
>> Could your elaborate more about what does this struct_ops type do and 
>> how is it different from the SCHED_CLS bpf prog that the netkit is 
>> running?
> 
> I found the code has been landed.
> Basing on the landed code and
> the patchset of registering bpf struct_ops from modules that I
> am working on, it will looks like what is done in following patch.
> No changes on syscall, uapi and libbpf are required.
> 
> 
> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
> index 7e484f9fd3ae..e4eafaf397bf 100644
> --- a/drivers/net/netkit.c
> +++ b/drivers/net/netkit.c
> @@ -20,6 +20,7 @@ struct netkit {
>       struct bpf_mprog_entry __rcu *active;
>       enum netkit_action policy;
>       struct bpf_mprog_bundle    bundle;
> +    struct hlist_head ops_list;
> 
>       /* Needed in slow-path */
>       enum netkit_mode mode;
> @@ -27,6 +28,13 @@ struct netkit {
>       u32 headroom;
>   };
> 
> +struct netkit_ops {
> +    struct hlist_node node;
> +    int ifindex;
> +
> +    int (*xmit)(struct sk_buff *skb);
> +};
> +
>   struct netkit_link {
>       struct bpf_link link;
>       struct net_device *dev;
> @@ -46,6 +54,22 @@ netkit_run(const struct bpf_mprog_entry *entry, 
> struct sk_buff *skb,
>           if (ret != NETKIT_NEXT)
>               break;
>       }
> +
> +    return ret;
> +}
> +
> +static __always_inline int
> +netkit_run_st_ops(const struct netkit *nk, struct sk_buff *skb,
> +       enum netkit_action ret)
> +{
> +    struct netkit_ops *ops;
> +
> +    hlist_for_each_entry_rcu(ops, &nk->ops_list, node) {
> +        ret = ops->xmit(skb);
> +        if (ret != NETKIT_NEXT)
> +            break;
> +    }
> +
>       return ret;
>   }
> 
> @@ -80,6 +104,8 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, 
> struct net_device *dev)
>       entry = rcu_dereference(nk->active);
>       if (entry)
>           ret = netkit_run(entry, skb, ret);
> +    if (ret == NETKIT_NEXT)
> +        ret = netkit_run_st_ops(nk, skb, ret);
>       switch (ret) {
>       case NETKIT_NEXT:
>       case NETKIT_PASS:
> @@ -900,6 +926,78 @@ static const struct nla_policy 
> netkit_policy[IFLA_NETKIT_MAX + 1] = {
>                           .reject_message = "Primary attribute is 
> read-only" },
>   };
> 
> +#ifdef CONFIG_DEBUG_INFO_BTF_MODULES
> +
> +static bool bpf_netkit_ops_is_valid_access(int off, int size,
> +                       enum bpf_access_type type,
> +                       const struct bpf_prog *prog,
> +                       struct bpf_insn_access_aux *info)
> +{
> +    return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> +}
> +
> +static const struct bpf_verifier_ops bpf_netkit_verifier_ops = {
> +    .is_valid_access = bpf_netkit_ops_is_valid_access,
> +};
> +
> +static int bpf_netkit_ops_reg(void *kdata)
> +{
> +    struct netkit_ops *ops = kdata;
> +    struct netkit_link *nkl;
> +    struct net_device *dev;
> +
> +    BTF_STRUCT_OPS_TYPE_EMIT(netkit_ops);
> +    dev = netkit_dev_fetch(current->nsproxy->net_ns,
> +                   ops->ifindex,
> +                   BPF_NETKIT_PRIMARY);
> +    nkl = netkit_link(dev);
> +    hlist_add_tail_rcu(&ops->node, &nkl->ops_list);
> +
> +    return 0;
> +}
> +
> +static int bpf_netkit_ops_init(struct btf *btf)
> +{
> +    return 0;
> +}
> +
> +static int bpf_netkit_ops_init_member(const struct btf_type *t,
> +                       const struct btf_member *member,
> +                       void *kdata, const void *udata)
> +{
> +    struct netkit_ops *kops = kdata;
> +    struct netkit_ops *uops = kdata;
> +
> +    u32 moff = __btf_member_bit_offset(t, member) / 8;
> +    if (moff == offsetof(struct netkit_ops, ifindex)) {
> +        kops->ifindex = uops->ifindex;
> +        return 1;
> +    }
> +    if (mod < offsetof(struct netkit_ops, ifindex))
> +        return 1;
> +
> +    return 0;
> +}
> +
> +static void bpf_netkit_ops_unreg(void *kdata)
> +{
> +    struct netkit_ops *ops = kdata;
> +
> +    hlist_del_rcu(&ops->node);
> +}
> +
> +struct bpf_struct_ops bpf_netkit_ops = {
> +    .verifier_ops = &bpf_netkit_verifier_ops,
> +    .init = bpf_netkit_ops_init,
> +    .init_member = bpf_netkit_ops_init_member,
> +    .reg = bpf_netkit_ops_reg,
> +    .unreg = bpf_netki_ops_unreg,
> +    .name = "netkit_ops",
> +    .owner = THIS_MODULE,
> +};
> +
> +#endif /* CONFIG_DEBUG_INFO_BTF_MODULES */
> +
>   static struct rtnl_link_ops netkit_link_ops = {
>       .kind        = DRV_NAME,
>       .priv_size    = sizeof(struct netkit),
> @@ -917,17 +1015,22 @@ static struct rtnl_link_ops netkit_link_ops = {
> 
>   static __init int netkit_init(void)
>   {
> +    int ret;
> +
>       BUILD_BUG_ON((int)NETKIT_NEXT != (int)TCX_NEXT ||
>                (int)NETKIT_PASS != (int)TCX_PASS ||
>                (int)NETKIT_DROP != (int)TCX_DROP ||
>                (int)NETKIT_REDIRECT != (int)TCX_REDIRECT);
> 
> -    return rtnl_link_register(&netkit_link_ops);
> +    ret = rtnl_link_register(&netkit_link_ops);
> +#ifdef CONFIG_DEBUG_INFO_BTF_MODULES
> +    ret = ret ?: register_bpf_struct_ops(&bpf_netkit_ops);
> +#endif
>   }
> 
>   static __exit void netkit_exit(void)
>   {
> -    rtnl_link_unregister(&netkit_link_ops);
> +    rtnl_link_unregister(&bpf_netkit_ops);

This change should be removed.

>   }
> 
>   module_init(netkit_init);

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
  2023-10-25 17:20     ` Daniel Borkmann
@ 2023-10-26  5:18       ` Jiri Pirko
  2023-10-26 12:11         ` Daniel Borkmann
  0 siblings, 1 reply; 27+ messages in thread
From: Jiri Pirko @ 2023-10-26  5:18 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, martin.lau, razor, ast, andrii, john.fastabend, sdf,
	toke, kuba, andrew, Toke Høiland-Jørgensen

Wed, Oct 25, 2023 at 07:20:01PM CEST, daniel@iogearbox.net wrote:
>On 10/25/23 5:47 PM, Jiri Pirko wrote:
>> Tue, Oct 24, 2023 at 11:48:58PM CEST, daniel@iogearbox.net wrote:
>[...]
>> > comes with a primary and a peer device. Only the primary device, typically
>> > residing in hostns, can manage BPF programs for itself and its peer. The
>> > peer device is designated for containers/Pods and cannot attach/detach
>> > BPF programs. Upon the device creation, the user can set the default policy
>> > to 'pass' or 'drop' for the case when no BPF program is attached.
>> 
>> It looks to me that you only need the host (primary) netdevice to be
>> used as a handle to attach the bpf programs. Because the bpf program
>> can (and probably in real use case will) redirect to uplink/another
>> pod netdevice skipping the host (primary) netdevice, correct?
>> 
>> If so, why can't you do just single device mode from start finding a
>> different sort of bpf attach handle? (not sure which)
>
>The first point where we switch netns from a K8s Pod is out of the netdevice.
>For the CNI case the vast majority has one but there could also be multi-
>homing for Pods where there may be two or more, and from a troubleshooting
>PoV aka tcpdump et al, it is the most natural point. Other attach handle
>inside the Pod doesn't really fit given from policy PoV it also must be
>unreachable for apps inside the Pod itself.

Okay. What is the usecase for the single device model then?

[..]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
  2023-10-25 19:21     ` Nikolay Aleksandrov
@ 2023-10-26  5:26       ` Jiri Pirko
  2023-10-26  6:21         ` Nikolay Aleksandrov
  0 siblings, 1 reply; 27+ messages in thread
From: Jiri Pirko @ 2023-10-26  5:26 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Daniel Borkmann, bpf, netdev, martin.lau, ast, andrii,
	john.fastabend, sdf, toke, kuba, andrew,
	Toke Høiland-Jørgensen

Wed, Oct 25, 2023 at 09:21:01PM CEST, razor@blackwall.org wrote:
>On 10/25/23 18:47, Jiri Pirko wrote:
>> Tue, Oct 24, 2023 at 11:48:58PM CEST, daniel@iogearbox.net wrote:
>> > This work adds a new, minimal BPF-programmable device called "netkit"
>[snip]
>> 
>> Couple of nitpicks below:
>> 
>> [..]
>> 
>> 
>
>Hi,
>Thanks for the review. I know about the nits below but decided against
>changing them, more below each...
>
>> > +static int netkit_check_policy(int policy, struct nlattr *tb,
>> > +			       struct netlink_ext_ack *extack)
>> > +{
>> > +	switch (policy) {
>> > +	case NETKIT_PASS:
>> > +	case NETKIT_DROP:
>> > +		return 0;
>> > +	default:
>> 
>> Isn't this job for netlink policy?
>> 
>> 
>
>This cannot be handled by policies AFAIK, because only 2 sparse values from
>more are allowed. We could potentially do it through validate() but

Perhaps good time to extend the netlink validation for list
of values allowed?



>it's the same minus the explicit policy type info. IMO this approach is good.
>
>> > +		NL_SET_ERR_MSG_ATTR(extack, tb,
>> > +				    "Provided default xmit policy not supported");
>> > +		return -EINVAL;
>> > +	}
>> > +}
>> > +
>> > +static int netkit_check_mode(int mode, struct nlattr *tb,
>> > +			     struct netlink_ext_ack *extack)
>> > +{
>> > +	switch (mode) {
>> > +	case NETKIT_L2:
>> > +	case NETKIT_L3:
>> > +		return 0;
>> > +	default:
>> 
>> Isn't this job for netlink policy?
>> 
>> 
>
>This one can be handled by policy indeed, but then we lose the nice user
>error. Again can be done through validate(), but it's the same and we
>lose explicit policy type information.

But the netlink validator setups the extack properly. What's wrong with
that?



>
>> > +		NL_SET_ERR_MSG_ATTR(extack, tb,
>> > +				    "Provided device mode can only be L2 or L3");
>> > +		return -EINVAL;
>> > +	}
>> > +}
>> > +
>> > +static int netkit_validate(struct nlattr *tb[], struct nlattr *data[],
>> > +			   struct netlink_ext_ack *extack)
>> > +{
>> > +	struct nlattr *attr = tb[IFLA_ADDRESS];
>> > +
>> > +	if (!attr)
>> > +		return 0;
>> > +	NL_SET_ERR_MSG_ATTR(extack, attr,
>> > +			    "Setting Ethernet address is not supported");
>> > +	return -EOPNOTSUPP;
>> > +}
>> > +
>> > +static struct rtnl_link_ops netkit_link_ops;
>> > +
>> > +static int netkit_new_link(struct net *src_net, struct net_device *dev,
>> > +			   struct nlattr *tb[], struct nlattr *data[],
>> > +			   struct netlink_ext_ack *extack)
>> > +{
>> > +	struct nlattr *peer_tb[IFLA_MAX + 1], **tbp = tb, *attr;
>> > +	enum netkit_action default_prim = NETKIT_PASS;
>> > +	enum netkit_action default_peer = NETKIT_PASS;
>> > +	enum netkit_mode mode = NETKIT_L3;
>> > +	unsigned char ifname_assign_type;
>> > +	struct ifinfomsg *ifmp = NULL;
>> > +	struct net_device *peer;
>> > +	char ifname[IFNAMSIZ];
>> > +	struct netkit *nk;
>> > +	struct net *net;
>> > +	int err;
>> > +
>> > +	if (data) {
>> > +		if (data[IFLA_NETKIT_MODE]) {
>> > +			attr = data[IFLA_NETKIT_MODE];
>> > +			mode = nla_get_u32(attr);
>> > +			err = netkit_check_mode(mode, attr, extack);
>> > +			if (err < 0)
>> > +				return err;
>> > +		}
>> > +		if (data[IFLA_NETKIT_PEER_INFO]) {
>> > +			attr = data[IFLA_NETKIT_PEER_INFO];
>> > +			ifmp = nla_data(attr);
>> > +			err = rtnl_nla_parse_ifinfomsg(peer_tb, attr, extack);
>> > +			if (err < 0)
>> > +				return err;
>> > +			err = netkit_validate(peer_tb, NULL, extack);
>> > +			if (err < 0)
>> > +				return err;
>> > +			tbp = peer_tb;
>> > +		}
>> > +		if (data[IFLA_NETKIT_POLICY]) {
>> > +			attr = data[IFLA_NETKIT_POLICY];
>> > +			default_prim = nla_get_u32(attr);
>> > +			err = netkit_check_policy(default_prim, attr, extack);
>> > +			if (err < 0)
>> > +				return err;
>> > +		}
>> > +		if (data[IFLA_NETKIT_PEER_POLICY]) {
>> > +			attr = data[IFLA_NETKIT_PEER_POLICY];
>> > +			default_peer = nla_get_u32(attr);
>> > +			err = netkit_check_policy(default_peer, attr, extack);
>> > +			if (err < 0)
>> > +				return err;
>> > +		}
>> > +	}
>> > +
>> > +	if (ifmp && tbp[IFLA_IFNAME]) {
>> > +		nla_strscpy(ifname, tbp[IFLA_IFNAME], IFNAMSIZ);
>> > +		ifname_assign_type = NET_NAME_USER;
>> > +	} else {
>> > +		strscpy(ifname, "nk%d", IFNAMSIZ);
>> > +		ifname_assign_type = NET_NAME_ENUM;
>> > +	}
>> > +
>> > +	net = rtnl_link_get_net(src_net, tbp);
>> > +	if (IS_ERR(net))
>> > +		return PTR_ERR(net);
>> > +
>> > +	peer = rtnl_create_link(net, ifname, ifname_assign_type,
>> > +				&netkit_link_ops, tbp, extack);
>> > +	if (IS_ERR(peer)) {
>> > +		put_net(net);
>> > +		return PTR_ERR(peer);
>> > +	}
>> > +
>> > +	netif_inherit_tso_max(peer, dev);
>> > +
>> > +	if (mode == NETKIT_L2)
>> > +		eth_hw_addr_random(peer);
>> > +	if (ifmp && dev->ifindex)
>> > +		peer->ifindex = ifmp->ifi_index;
>> > +
>> > +	nk = netkit_priv(peer);
>> > +	nk->primary = false;
>> > +	nk->policy = default_peer;
>> > +	nk->mode = mode;
>> > +	bpf_mprog_bundle_init(&nk->bundle);
>> > +	RCU_INIT_POINTER(nk->active, NULL);
>> > +	RCU_INIT_POINTER(nk->peer, NULL);
>> 
>> Aren't these already 0?
>> 
>> 
>
>Yep, they are. Here decided in favor of explicit show of values, although
>it's minor and I'm fine either way.

It is always confusing to see this. Reader might think this is necessary
as if the mem was not previuosly cleared. The general approach is to
rely on mem being zeroed, not sure why this is different.


>
>> > +
>> > +	err = register_netdevice(peer);
>> > +	put_net(net);
>> > +	if (err < 0)
>> > +		goto err_register_peer;
>> > +	netif_carrier_off(peer);
>> > +	if (mode == NETKIT_L2)
>> > +		dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
>> > +
>> > +	err = rtnl_configure_link(peer, NULL, 0, NULL);
>> > +	if (err < 0)
>> > +		goto err_configure_peer;
>> > +
>> > +	if (mode == NETKIT_L2)
>> > +		eth_hw_addr_random(dev);
>> > +	if (tb[IFLA_IFNAME])
>> > +		nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
>> > +	else
>> > +		strscpy(dev->name, "nk%d", IFNAMSIZ);
>> > +
>> > +	nk = netkit_priv(dev);
>> > +	nk->primary = true;
>> > +	nk->policy = default_prim;
>> > +	nk->mode = mode;
>> > +	bpf_mprog_bundle_init(&nk->bundle);
>> > +	RCU_INIT_POINTER(nk->active, NULL);
>> > +	RCU_INIT_POINTER(nk->peer, NULL);
>> > +
>> > +	err = register_netdevice(dev);
>> > +	if (err < 0)
>> > +		goto err_configure_peer;
>> > +	netif_carrier_off(dev);
>> > +	if (mode == NETKIT_L2)
>> > +		dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
>> > +
>> > +	rcu_assign_pointer(netkit_priv(dev)->peer, peer);
>> > +	rcu_assign_pointer(netkit_priv(peer)->peer, dev);
>> > +	return 0;
>> > +err_configure_peer:
>> > +	unregister_netdevice(peer);
>> > +	return err;
>> > +err_register_peer:
>> > +	free_netdev(peer);
>> > +	return err;
>> > +}
>> > +
>> 
>> [..]
>
>Cheers,
> Nik
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 0/7] Add bpf programmable net device
  2023-10-25 16:54     ` Martin KaFai Lau
@ 2023-10-26  5:35       ` Jiri Pirko
  0 siblings, 0 replies; 27+ messages in thread
From: Jiri Pirko @ 2023-10-26  5:35 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Daniel Borkmann, bpf, netdev, razor, ast, andrii, john.fastabend,
	sdf, toke, kuba, andrew

Wed, Oct 25, 2023 at 06:54:27PM CEST, martin.lau@linux.dev wrote:
>On 10/25/23 8:50 AM, Jiri Pirko wrote:
>> Wed, Oct 25, 2023 at 01:50:25AM CEST, patchwork-bot+netdevbpf@kernel.org wrote:
>> > Hello:
>> > 
>> > This series was applied to bpf/bpf-next.git (master)
>> > by Martin KaFai Lau <martin.lau@kernel.org>:
>> 
>> Interesting, applied within 2 hours after send. You bpf people don't
>> care about some 24h timeout?
>
>24hr? The v1 was posted to both netdev and bpf list on 9/25. It was 10/24
>yesterday. The part you commented in patch 1 had not been changed much since
>v1, so there was a month of time. netdev is always on the cc list. Multiple
>people (Andrew, Jakub...etc) had already helped to review and Daniel had
>addressed the comments. The change history had been diminishing from v1 to v4
>and v4 changes was mostly nit-picking already.

AFAIK netdev maintainer have a policy (which I thoink I saw written
down somewhere but cannot find) that the patch stays on the list one day
before it is getting applied. It actually makes a lot of sense. Anyway,
I may be wrong.

>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
  2023-10-26  1:18         ` Kui-Feng Lee
@ 2023-10-26  6:20           ` Daniel Borkmann
  2023-10-26 17:47             ` Kui-Feng Lee
  0 siblings, 1 reply; 27+ messages in thread
From: Daniel Borkmann @ 2023-10-26  6:20 UTC (permalink / raw)
  To: Kui-Feng Lee, Martin KaFai Lau
  Cc: netdev, razor, ast, andrii, john.fastabend, sdf, toke, kuba,
	andrew, Toke Høiland-Jørgensen, bpf

Hi Kui-Feng,

On 10/26/23 3:18 AM, Kui-Feng Lee wrote:
> On 10/25/23 18:15, Kui-Feng Lee wrote:
>> On 10/25/23 15:09, Martin KaFai Lau wrote:
>>> On 10/25/23 2:24 PM, Kui-Feng Lee wrote:
>>>> On 10/24/23 14:48, Daniel Borkmann wrote:
>>>>> This work adds a new, minimal BPF-programmable device called "netkit"
>>>>> (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
>>>>> core idea is that BPF programs are executed within the drivers xmit routine
>>>>> and therefore e.g. in case of containers/Pods moving BPF processing closer
>>>>> to the source.
>>>>
>>>> Sorry for intruding into this discussion! Although it is too late to
>>>> mentioned this since this patchset have been v4 already.
>>>>
>>>> I notice netkit has introduced a new attach type. I wonder if it
>>>> possible to implement it as a new struct_ops type.
>>>
>>> Could your elaborate more about what does this struct_ops type do and how is it different from the SCHED_CLS bpf prog that the netkit is running?
>>
>> I found the code has been landed.
>> Basing on the landed code and
>> the patchset of registering bpf struct_ops from modules that I
>> am working on, it will looks like what is done in following patch.
>> No changes on syscall, uapi and libbpf are required.
>>
>> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
>> index 7e484f9fd3ae..e4eafaf397bf 100644
>> --- a/drivers/net/netkit.c
>> +++ b/drivers/net/netkit.c
>> @@ -20,6 +20,7 @@ struct netkit {
>>       struct bpf_mprog_entry __rcu *active;
>>       enum netkit_action policy;
>>       struct bpf_mprog_bundle    bundle;
>> +    struct hlist_head ops_list;
>>
>>       /* Needed in slow-path */
>>       enum netkit_mode mode;
>> @@ -27,6 +28,13 @@ struct netkit {
>>       u32 headroom;
>>   };
>>
>> +struct netkit_ops {
>> +    struct hlist_node node;
>> +    int ifindex;
>> +
>> +    int (*xmit)(struct sk_buff *skb);
>> +};
>> +
>>   struct netkit_link {
>>       struct bpf_link link;
>>       struct net_device *dev;
>> @@ -46,6 +54,22 @@ netkit_run(const struct bpf_mprog_entry *entry, struct sk_buff *skb,
>>           if (ret != NETKIT_NEXT)
>>               break;
>>       }
>> +
>> +    return ret;
>> +}
>> +
>> +static __always_inline int
>> +netkit_run_st_ops(const struct netkit *nk, struct sk_buff *skb,
>> +       enum netkit_action ret)
>> +{
>> +    struct netkit_ops *ops;
>> +
>> +    hlist_for_each_entry_rcu(ops, &nk->ops_list, node) {
>> +        ret = ops->xmit(skb);
>> +        if (ret != NETKIT_NEXT)
>> +            break;
>> +    }
>> +
>>       return ret;
>>   }
>>
>> @@ -80,6 +104,8 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct net_device *dev)
>>       entry = rcu_dereference(nk->active);
>>       if (entry)
>>           ret = netkit_run(entry, skb, ret);
>> +    if (ret == NETKIT_NEXT)
>> +        ret = netkit_run_st_ops(nk, skb, ret);
>>       switch (ret) {
>>       case NETKIT_NEXT:
>>       case NETKIT_PASS:

I don't think it makes sense to cramp struct ops in here for what has been
solved already with the bpf_mprog interface in a more efficient way and with
control dependencies for the insertion (before/after relative programs/links).
The latter is in particular crucial for a multi-user interface when dealing
with network traffic (think for example: policy, forwarder, observability
prog, etc).

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
  2023-10-26  5:26       ` Jiri Pirko
@ 2023-10-26  6:21         ` Nikolay Aleksandrov
  0 siblings, 0 replies; 27+ messages in thread
From: Nikolay Aleksandrov @ 2023-10-26  6:21 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Daniel Borkmann, bpf, netdev, martin.lau, ast, andrii,
	john.fastabend, sdf, toke, kuba, andrew,
	Toke Høiland-Jørgensen

On 10/26/23 08:26, Jiri Pirko wrote:
> Wed, Oct 25, 2023 at 09:21:01PM CEST, razor@blackwall.org wrote:
>> On 10/25/23 18:47, Jiri Pirko wrote:
>>> Tue, Oct 24, 2023 at 11:48:58PM CEST, daniel@iogearbox.net wrote:
>>>> This work adds a new, minimal BPF-programmable device called "netkit"
>> [snip]
>>>
>>> Couple of nitpicks below:
>>>
>>> [..]
>>>
>>>
>>
>> Hi,
>> Thanks for the review. I know about the nits below but decided against
>> changing them, more below each...
>>
>>>> +static int netkit_check_policy(int policy, struct nlattr *tb,
>>>> +			       struct netlink_ext_ack *extack)
>>>> +{
>>>> +	switch (policy) {
>>>> +	case NETKIT_PASS:
>>>> +	case NETKIT_DROP:
>>>> +		return 0;
>>>> +	default:
>>>
>>> Isn't this job for netlink policy?
>>>
>>>
>>
>> This cannot be handled by policies AFAIK, because only 2 sparse values from
>> more are allowed. We could potentially do it through validate() but
> 
> Perhaps good time to extend the netlink validation for list
> of values allowed?
> 
> 
> 
>> it's the same minus the explicit policy type info. IMO this approach is good.
>>
>>>> +		NL_SET_ERR_MSG_ATTR(extack, tb,
>>>> +				    "Provided default xmit policy not supported");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +}
>>>> +
>>>> +static int netkit_check_mode(int mode, struct nlattr *tb,
>>>> +			     struct netlink_ext_ack *extack)
>>>> +{
>>>> +	switch (mode) {
>>>> +	case NETKIT_L2:
>>>> +	case NETKIT_L3:
>>>> +		return 0;
>>>> +	default:
>>>
>>> Isn't this job for netlink policy?
>>>
>>>
>>
>> This one can be handled by policy indeed, but then we lose the nice user
>> error. Again can be done through validate(), but it's the same and we
>> lose explicit policy type information.
> 
> But the netlink validator setups the extack properly. What's wrong with
> that?
> 
> 

"integer out of range" vs "Provided device mode can only be L2 or L3" ?
I like the second one better and it's more informative. The way to get 
it is to use NLA_POLICY_VALIDATE_FN() as type and provide custom 
validate() callback, which is identical to the current solution, I see 
no added value in changing it.

> 
>>
>>>> +		NL_SET_ERR_MSG_ATTR(extack, tb,
>>>> +				    "Provided device mode can only be L2 or L3");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +}
>>>> +
>>>> +static int netkit_validate(struct nlattr *tb[], struct nlattr *data[],
>>>> +			   struct netlink_ext_ack *extack)
>>>> +{
>>>> +	struct nlattr *attr = tb[IFLA_ADDRESS];
>>>> +
>>>> +	if (!attr)
>>>> +		return 0;
>>>> +	NL_SET_ERR_MSG_ATTR(extack, attr,
>>>> +			    "Setting Ethernet address is not supported");
>>>> +	return -EOPNOTSUPP;
>>>> +}
>>>> +
>>>> +static struct rtnl_link_ops netkit_link_ops;
>>>> +
>>>> +static int netkit_new_link(struct net *src_net, struct net_device *dev,
>>>> +			   struct nlattr *tb[], struct nlattr *data[],
>>>> +			   struct netlink_ext_ack *extack)
>>>> +{
>>>> +	struct nlattr *peer_tb[IFLA_MAX + 1], **tbp = tb, *attr;
>>>> +	enum netkit_action default_prim = NETKIT_PASS;
>>>> +	enum netkit_action default_peer = NETKIT_PASS;
>>>> +	enum netkit_mode mode = NETKIT_L3;
>>>> +	unsigned char ifname_assign_type;
>>>> +	struct ifinfomsg *ifmp = NULL;
>>>> +	struct net_device *peer;
>>>> +	char ifname[IFNAMSIZ];
>>>> +	struct netkit *nk;
>>>> +	struct net *net;
>>>> +	int err;
>>>> +
>>>> +	if (data) {
>>>> +		if (data[IFLA_NETKIT_MODE]) {
>>>> +			attr = data[IFLA_NETKIT_MODE];
>>>> +			mode = nla_get_u32(attr);
>>>> +			err = netkit_check_mode(mode, attr, extack);
>>>> +			if (err < 0)
>>>> +				return err;
>>>> +		}
>>>> +		if (data[IFLA_NETKIT_PEER_INFO]) {
>>>> +			attr = data[IFLA_NETKIT_PEER_INFO];
>>>> +			ifmp = nla_data(attr);
>>>> +			err = rtnl_nla_parse_ifinfomsg(peer_tb, attr, extack);
>>>> +			if (err < 0)
>>>> +				return err;
>>>> +			err = netkit_validate(peer_tb, NULL, extack);
>>>> +			if (err < 0)
>>>> +				return err;
>>>> +			tbp = peer_tb;
>>>> +		}
>>>> +		if (data[IFLA_NETKIT_POLICY]) {
>>>> +			attr = data[IFLA_NETKIT_POLICY];
>>>> +			default_prim = nla_get_u32(attr);
>>>> +			err = netkit_check_policy(default_prim, attr, extack);
>>>> +			if (err < 0)
>>>> +				return err;
>>>> +		}
>>>> +		if (data[IFLA_NETKIT_PEER_POLICY]) {
>>>> +			attr = data[IFLA_NETKIT_PEER_POLICY];
>>>> +			default_peer = nla_get_u32(attr);
>>>> +			err = netkit_check_policy(default_peer, attr, extack);
>>>> +			if (err < 0)
>>>> +				return err;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	if (ifmp && tbp[IFLA_IFNAME]) {
>>>> +		nla_strscpy(ifname, tbp[IFLA_IFNAME], IFNAMSIZ);
>>>> +		ifname_assign_type = NET_NAME_USER;
>>>> +	} else {
>>>> +		strscpy(ifname, "nk%d", IFNAMSIZ);
>>>> +		ifname_assign_type = NET_NAME_ENUM;
>>>> +	}
>>>> +
>>>> +	net = rtnl_link_get_net(src_net, tbp);
>>>> +	if (IS_ERR(net))
>>>> +		return PTR_ERR(net);
>>>> +
>>>> +	peer = rtnl_create_link(net, ifname, ifname_assign_type,
>>>> +				&netkit_link_ops, tbp, extack);
>>>> +	if (IS_ERR(peer)) {
>>>> +		put_net(net);
>>>> +		return PTR_ERR(peer);
>>>> +	}
>>>> +
>>>> +	netif_inherit_tso_max(peer, dev);
>>>> +
>>>> +	if (mode == NETKIT_L2)
>>>> +		eth_hw_addr_random(peer);
>>>> +	if (ifmp && dev->ifindex)
>>>> +		peer->ifindex = ifmp->ifi_index;
>>>> +
>>>> +	nk = netkit_priv(peer);
>>>> +	nk->primary = false;
>>>> +	nk->policy = default_peer;
>>>> +	nk->mode = mode;
>>>> +	bpf_mprog_bundle_init(&nk->bundle);
>>>> +	RCU_INIT_POINTER(nk->active, NULL);
>>>> +	RCU_INIT_POINTER(nk->peer, NULL);
>>>
>>> Aren't these already 0?
>>>
>>>
>>
>> Yep, they are. Here decided in favor of explicit show of values, although
>> it's minor and I'm fine either way.
> 
> It is always confusing to see this. Reader might think this is necessary
> as if the mem was not previuosly cleared. The general approach is to
> rely on mem being zeroed, not sure why this is different.
> 
> 

Oh come on :) you really believe someone reading this code would start 
inferring about netdev_alloc mem zeroing instead of getting a mental 
picture of the expected state? Anyway, as I said I think this is way too
minor and it's fine either way, we can remove the explicit 
initialization and rely on the implicit one.

>>
>>>> +
>>>> +	err = register_netdevice(peer);
>>>> +	put_net(net);
>>>> +	if (err < 0)
>>>> +		goto err_register_peer;
>>>> +	netif_carrier_off(peer);
>>>> +	if (mode == NETKIT_L2)
>>>> +		dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
>>>> +
>>>> +	err = rtnl_configure_link(peer, NULL, 0, NULL);
>>>> +	if (err < 0)
>>>> +		goto err_configure_peer;
>>>> +
>>>> +	if (mode == NETKIT_L2)
>>>> +		eth_hw_addr_random(dev);
>>>> +	if (tb[IFLA_IFNAME])
>>>> +		nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
>>>> +	else
>>>> +		strscpy(dev->name, "nk%d", IFNAMSIZ);
>>>> +
>>>> +	nk = netkit_priv(dev);
>>>> +	nk->primary = true;
>>>> +	nk->policy = default_prim;
>>>> +	nk->mode = mode;
>>>> +	bpf_mprog_bundle_init(&nk->bundle);
>>>> +	RCU_INIT_POINTER(nk->active, NULL);
>>>> +	RCU_INIT_POINTER(nk->peer, NULL);
>>>> +
>>>> +	err = register_netdevice(dev);
>>>> +	if (err < 0)
>>>> +		goto err_configure_peer;
>>>> +	netif_carrier_off(dev);
>>>> +	if (mode == NETKIT_L2)
>>>> +		dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
>>>> +
>>>> +	rcu_assign_pointer(netkit_priv(dev)->peer, peer);
>>>> +	rcu_assign_pointer(netkit_priv(peer)->peer, dev);
>>>> +	return 0;
>>>> +err_configure_peer:
>>>> +	unregister_netdevice(peer);
>>>> +	return err;
>>>> +err_register_peer:
>>>> +	free_netdev(peer);
>>>> +	return err;
>>>> +}
>>>> +
>>>
>>> [..]
>>
>> Cheers,
>> Nik
>>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
  2023-10-26  5:18       ` Jiri Pirko
@ 2023-10-26 12:11         ` Daniel Borkmann
  0 siblings, 0 replies; 27+ messages in thread
From: Daniel Borkmann @ 2023-10-26 12:11 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: bpf, netdev, martin.lau, razor, ast, andrii, john.fastabend, sdf,
	toke, kuba, andrew, Toke Høiland-Jørgensen, joe

On 10/26/23 7:18 AM, Jiri Pirko wrote:
> Wed, Oct 25, 2023 at 07:20:01PM CEST, daniel@iogearbox.net wrote:
>> On 10/25/23 5:47 PM, Jiri Pirko wrote:
>>> Tue, Oct 24, 2023 at 11:48:58PM CEST, daniel@iogearbox.net wrote:
>> [...]
>>>> comes with a primary and a peer device. Only the primary device, typically
>>>> residing in hostns, can manage BPF programs for itself and its peer. The
>>>> peer device is designated for containers/Pods and cannot attach/detach
>>>> BPF programs. Upon the device creation, the user can set the default policy
>>>> to 'pass' or 'drop' for the case when no BPF program is attached.
>>>
>>> It looks to me that you only need the host (primary) netdevice to be
>>> used as a handle to attach the bpf programs. Because the bpf program
>>> can (and probably in real use case will) redirect to uplink/another
>>> pod netdevice skipping the host (primary) netdevice, correct?
>>>
>>> If so, why can't you do just single device mode from start finding a
>>> different sort of bpf attach handle? (not sure which)
>>
>> The first point where we switch netns from a K8s Pod is out of the netdevice.
>> For the CNI case the vast majority has one but there could also be multi-
>> homing for Pods where there may be two or more, and from a troubleshooting
>> PoV aka tcpdump et al, it is the most natural point. Other attach handle
>> inside the Pod doesn't really fit given from policy PoV it also must be
>> unreachable for apps inside the Pod itself.
> 
> Okay. What is the usecase for the single device model then?

One immediate use case in the context of Cilium itself is to replace
and simplify our cilium_host/cilium_net pair which is in the hostns
and depending on the routing mode that Cilium is configured in handling
traffic/policy from host into Pods respectively from host or Pods into
collect_md encaps in order to route traffic E/W to other cluster nodes.
Going further, we were thinking also to have single dev and move it into
target netns with policy being configured from host.

Best,
Daniel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
  2023-10-26  6:20           ` Daniel Borkmann
@ 2023-10-26 17:47             ` Kui-Feng Lee
  2023-10-26 18:46               ` Martin KaFai Lau
  0 siblings, 1 reply; 27+ messages in thread
From: Kui-Feng Lee @ 2023-10-26 17:47 UTC (permalink / raw)
  To: Daniel Borkmann, Martin KaFai Lau
  Cc: netdev, razor, ast, andrii, john.fastabend, sdf, toke, kuba,
	andrew, Toke Høiland-Jørgensen, bpf



On 10/25/23 23:20, Daniel Borkmann wrote:
> Hi Kui-Feng,
> 
> On 10/26/23 3:18 AM, Kui-Feng Lee wrote:
>> On 10/25/23 18:15, Kui-Feng Lee wrote:
>>> On 10/25/23 15:09, Martin KaFai Lau wrote:
>>>> On 10/25/23 2:24 PM, Kui-Feng Lee wrote:
>>>>> On 10/24/23 14:48, Daniel Borkmann wrote:
>>>>>> This work adds a new, minimal BPF-programmable device called "netkit"
>>>>>> (former PoC code-name "meta") we recently presented at LSF/MM/BPF. 
>>>>>> The
>>>>>> core idea is that BPF programs are executed within the drivers 
>>>>>> xmit routine
>>>>>> and therefore e.g. in case of containers/Pods moving BPF 
>>>>>> processing closer
>>>>>> to the source.
>>>>>
>>>>> Sorry for intruding into this discussion! Although it is too late to
>>>>> mentioned this since this patchset have been v4 already.
>>>>>
>>>>> I notice netkit has introduced a new attach type. I wonder if it
>>>>> possible to implement it as a new struct_ops type.
>>>>
>>>> Could your elaborate more about what does this struct_ops type do 
>>>> and how is it different from the SCHED_CLS bpf prog that the netkit 
>>>> is running?
>>>
>>> I found the code has been landed.
>>> Basing on the landed code and
>>> the patchset of registering bpf struct_ops from modules that I
>>> am working on, it will looks like what is done in following patch.
>>> No changes on syscall, uapi and libbpf are required.
>>>
>>> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
>>> index 7e484f9fd3ae..e4eafaf397bf 100644
>>> --- a/drivers/net/netkit.c
>>> +++ b/drivers/net/netkit.c
>>> @@ -20,6 +20,7 @@ struct netkit {
>>>       struct bpf_mprog_entry __rcu *active;
>>>       enum netkit_action policy;
>>>       struct bpf_mprog_bundle    bundle;
>>> +    struct hlist_head ops_list;
>>>
>>>       /* Needed in slow-path */
>>>       enum netkit_mode mode;
>>> @@ -27,6 +28,13 @@ struct netkit {
>>>       u32 headroom;
>>>   };
>>>
>>> +struct netkit_ops {
>>> +    struct hlist_node node;
>>> +    int ifindex;
>>> +
>>> +    int (*xmit)(struct sk_buff *skb);
>>> +};
>>> +
>>>   struct netkit_link {
>>>       struct bpf_link link;
>>>       struct net_device *dev;
>>> @@ -46,6 +54,22 @@ netkit_run(const struct bpf_mprog_entry *entry, 
>>> struct sk_buff *skb,
>>>           if (ret != NETKIT_NEXT)
>>>               break;
>>>       }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static __always_inline int
>>> +netkit_run_st_ops(const struct netkit *nk, struct sk_buff *skb,
>>> +       enum netkit_action ret)
>>> +{
>>> +    struct netkit_ops *ops;
>>> +
>>> +    hlist_for_each_entry_rcu(ops, &nk->ops_list, node) {
>>> +        ret = ops->xmit(skb);
>>> +        if (ret != NETKIT_NEXT)
>>> +            break;
>>> +    }
>>> +
>>>       return ret;
>>>   }
>>>
>>> @@ -80,6 +104,8 @@ static netdev_tx_t netkit_xmit(struct sk_buff 
>>> *skb, struct net_device *dev)
>>>       entry = rcu_dereference(nk->active);
>>>       if (entry)
>>>           ret = netkit_run(entry, skb, ret);
>>> +    if (ret == NETKIT_NEXT)
>>> +        ret = netkit_run_st_ops(nk, skb, ret);
>>>       switch (ret) {
>>>       case NETKIT_NEXT:
>>>       case NETKIT_PASS:
> 
> I don't think it makes sense to cramp struct ops in here for what has been
> solved already with the bpf_mprog interface in a more efficient way and 
> with
> control dependencies for the insertion (before/after relative 
> programs/links).
> The latter is in particular crucial for a multi-user interface when dealing
> with network traffic (think for example: policy, forwarder, observability
> prog, etc).
> 

I don't mean to cramp two implementations together
and don't notice this patchset is already landed at beginning.
This patch is just for explanation of how it likes if it is implemented
with just struct_ops (without bpf_mprog).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
  2023-10-26 17:47             ` Kui-Feng Lee
@ 2023-10-26 18:46               ` Martin KaFai Lau
  0 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2023-10-26 18:46 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: netdev, razor, ast, andrii, john.fastabend, sdf, toke, kuba,
	andrew, Toke Høiland-Jørgensen, bpf, Daniel Borkmann

On 10/26/23 10:47 AM, Kui-Feng Lee wrote:
> 
> 
> On 10/25/23 23:20, Daniel Borkmann wrote:
>> Hi Kui-Feng,
>>
>> On 10/26/23 3:18 AM, Kui-Feng Lee wrote:
>>> On 10/25/23 18:15, Kui-Feng Lee wrote:
>>>> On 10/25/23 15:09, Martin KaFai Lau wrote:
>>>>> On 10/25/23 2:24 PM, Kui-Feng Lee wrote:
>>>>>> On 10/24/23 14:48, Daniel Borkmann wrote:
>>>>>>> This work adds a new, minimal BPF-programmable device called "netkit"
>>>>>>> (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
>>>>>>> core idea is that BPF programs are executed within the drivers xmit routine
>>>>>>> and therefore e.g. in case of containers/Pods moving BPF processing closer
>>>>>>> to the source.
>>>>>>
>>>>>> Sorry for intruding into this discussion! Although it is too late to
>>>>>> mentioned this since this patchset have been v4 already.
>>>>>>
>>>>>> I notice netkit has introduced a new attach type. I wonder if it
>>>>>> possible to implement it as a new struct_ops type.
>>>>>
>>>>> Could your elaborate more about what does this struct_ops type do and how 
>>>>> is it different from the SCHED_CLS bpf prog that the netkit is running?
>>>>
>>>> I found the code has been landed.
>>>> Basing on the landed code and
>>>> the patchset of registering bpf struct_ops from modules that I
>>>> am working on, it will looks like what is done in following patch.
>>>> No changes on syscall, uapi and libbpf are required.
>>>>
>>>> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
>>>> index 7e484f9fd3ae..e4eafaf397bf 100644
>>>> --- a/drivers/net/netkit.c
>>>> +++ b/drivers/net/netkit.c
>>>> @@ -20,6 +20,7 @@ struct netkit {
>>>>       struct bpf_mprog_entry __rcu *active;
>>>>       enum netkit_action policy;
>>>>       struct bpf_mprog_bundle    bundle;
>>>> +    struct hlist_head ops_list;
>>>>
>>>>       /* Needed in slow-path */
>>>>       enum netkit_mode mode;
>>>> @@ -27,6 +28,13 @@ struct netkit {
>>>>       u32 headroom;
>>>>   };
>>>>
>>>> +struct netkit_ops {
>>>> +    struct hlist_node node;
>>>> +    int ifindex;
>>>> +
>>>> +    int (*xmit)(struct sk_buff *skb);
>>>> +};
>>>> +
>>>>   struct netkit_link {
>>>>       struct bpf_link link;
>>>>       struct net_device *dev;
>>>> @@ -46,6 +54,22 @@ netkit_run(const struct bpf_mprog_entry *entry, struct 
>>>> sk_buff *skb,
>>>>           if (ret != NETKIT_NEXT)
>>>>               break;
>>>>       }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static __always_inline int
>>>> +netkit_run_st_ops(const struct netkit *nk, struct sk_buff *skb,
>>>> +       enum netkit_action ret)
>>>> +{
>>>> +    struct netkit_ops *ops;
>>>> +
>>>> +    hlist_for_each_entry_rcu(ops, &nk->ops_list, node) {
>>>> +        ret = ops->xmit(skb);
>>>> +        if (ret != NETKIT_NEXT)
>>>> +            break;
>>>> +    }
>>>> +
>>>>       return ret;
>>>>   }
>>>>
>>>> @@ -80,6 +104,8 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, 
>>>> struct net_device *dev)
>>>>       entry = rcu_dereference(nk->active);
>>>>       if (entry)
>>>>           ret = netkit_run(entry, skb, ret);
>>>> +    if (ret == NETKIT_NEXT)
>>>> +        ret = netkit_run_st_ops(nk, skb, ret);
>>>>       switch (ret) {
>>>>       case NETKIT_NEXT:
>>>>       case NETKIT_PASS:
>>
>> I don't think it makes sense to cramp struct ops in here for what has been
>> solved already with the bpf_mprog interface in a more efficient way and with
>> control dependencies for the insertion (before/after relative programs/links).
>> The latter is in particular crucial for a multi-user interface when dealing
>> with network traffic (think for example: policy, forwarder, observability
>> prog, etc).
>>
> 
> I don't mean to cramp two implementations together
> and don't notice this patchset is already landed at beginning.

There are a few ways to track this. patchwork bot will send a landing message to 
the list. There is a few mins lag time but I don't think this lags matter here. 
You may want to check your inbox and ensure it gets through.

git always has the source of true also.

> This patch is just for explanation of how it likes if it is implemented
> with just struct_ops (without bpf_mprog).

Thanks for sharing a struct_ops code snippet. It is an interesting idea to embed 
ifindex and other details in the struct.

Leaving it still needs verifier changes to make the PTR_TO_BTF_ID skb in 
struct_ops to work like tc __sk_buff such that all existing tc-bpf prog will 
work as is. Daniel has already mentioned the ordering API (bpf_mprog) that has 
been discussed for a year and has already been used in tc-link which I hope it 
will be extended to solve the xdp ordering also. I am also not convinced saving 
two attach types (note the prog type is the same here) deserve to re-create 
something in-parallel to tc-link and then require the same "skb" bpf dataplane 
program to be administrated (attach/introspect...etc) differently.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-10-26 18:46 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-24 21:48 [PATCH bpf-next v4 0/7] Add bpf programmable net device Daniel Borkmann
2023-10-24 21:48 ` [PATCH bpf-next v4 1/7] netkit, bpf: " Daniel Borkmann
2023-10-25 15:47   ` Jiri Pirko
2023-10-25 17:20     ` Daniel Borkmann
2023-10-26  5:18       ` Jiri Pirko
2023-10-26 12:11         ` Daniel Borkmann
2023-10-25 19:21     ` Nikolay Aleksandrov
2023-10-26  5:26       ` Jiri Pirko
2023-10-26  6:21         ` Nikolay Aleksandrov
2023-10-25 21:24   ` Kui-Feng Lee
2023-10-25 22:09     ` Martin KaFai Lau
2023-10-26  1:15       ` Kui-Feng Lee
2023-10-26  1:18         ` Kui-Feng Lee
2023-10-26  6:20           ` Daniel Borkmann
2023-10-26 17:47             ` Kui-Feng Lee
2023-10-26 18:46               ` Martin KaFai Lau
2023-10-24 21:48 ` [PATCH bpf-next v4 2/7] tools: Sync if_link uapi header Daniel Borkmann
2023-10-24 21:49 ` [PATCH bpf-next v4 3/7] libbpf: Add link-based API for netkit Daniel Borkmann
2023-10-24 21:49 ` [PATCH bpf-next v4 4/7] bpftool: Implement link show support " Daniel Borkmann
2023-10-24 21:49 ` [PATCH bpf-next v4 5/7] bpftool: Extend net dump with netkit progs Daniel Borkmann
2023-10-24 21:49 ` [PATCH bpf-next v4 6/7] selftests/bpf: Add netlink helper library Daniel Borkmann
2023-10-24 21:49 ` [PATCH bpf-next v4 7/7] selftests/bpf: Add selftests for netkit Daniel Borkmann
2023-10-24 22:45 ` [PATCH bpf-next v4 0/7] Add bpf programmable net device Martin KaFai Lau
2023-10-24 23:50 ` patchwork-bot+netdevbpf
2023-10-25 15:50   ` Jiri Pirko
2023-10-25 16:54     ` Martin KaFai Lau
2023-10-26  5:35       ` Jiri Pirko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).