netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/8] Add bpf programmable device
@ 2023-09-26  5:59 Daniel Borkmann
  2023-09-26  5:59 ` [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device Daniel Borkmann
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Daniel Borkmann @ 2023-09-26  5:59 UTC (permalink / raw)
  To: bpf; +Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend,
	Daniel Borkmann

This work adds a BPF programmable device which can operate in L3 or L2
mode where the BPF program is part of the xmit routine. It's program
management is done via bpf_mprog and it comes with BPF link support.
For details see patch 1 and following. Thanks!

Daniel Borkmann (8):
  meta, bpf: Add bpf programmable meta device
  meta, bpf: Add bpf link support for meta device
  tools: Sync if_link uapi header
  libbpf: Add link-based API for meta
  bpftool: Implement link show support for meta
  bpftool: Extend net dump with meta progs
  selftests/bpf: Add netlink helper library
  selftests/bpf: Add selftests for meta

 MAINTAINERS                                   |   9 +
 drivers/net/Kconfig                           |   9 +
 drivers/net/Makefile                          |   1 +
 drivers/net/meta.c                            | 943 ++++++++++++++++++
 include/linux/netdevice.h                     |   2 +
 include/net/meta.h                            |  38 +
 include/uapi/linux/bpf.h                      |  13 +
 include/uapi/linux/if_link.h                  |  25 +
 kernel/bpf/syscall.c                          |  30 +-
 .../bpf/bpftool/Documentation/bpftool-net.rst |   8 +-
 tools/bpf/bpftool/link.c                      |   7 +
 tools/bpf/bpftool/net.c                       |   7 +-
 tools/include/uapi/linux/bpf.h                |  13 +
 tools/include/uapi/linux/if_link.h            | 142 +++
 tools/lib/bpf/bpf.c                           |  16 +
 tools/lib/bpf/bpf.h                           |   5 +
 tools/lib/bpf/libbpf.c                        |  61 +-
 tools/lib/bpf/libbpf.h                        |  15 +
 tools/lib/bpf/libbpf.map                      |   1 +
 tools/testing/selftests/bpf/Makefile          |  19 +-
 tools/testing/selftests/bpf/config            |   1 +
 tools/testing/selftests/bpf/netlink_helpers.c | 358 +++++++
 tools/testing/selftests/bpf/netlink_helpers.h |  46 +
 .../selftests/bpf/prog_tests/tc_helpers.h     |   4 +
 .../selftests/bpf/prog_tests/tc_meta.c        | 650 ++++++++++++
 .../selftests/bpf/progs/test_tc_link.c        |  13 +
 26 files changed, 2415 insertions(+), 21 deletions(-)
 create mode 100644 drivers/net/meta.c
 create mode 100644 include/net/meta.h
 create mode 100644 tools/testing/selftests/bpf/netlink_helpers.c
 create mode 100644 tools/testing/selftests/bpf/netlink_helpers.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_meta.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device
  2023-09-26  5:59 [PATCH bpf-next 0/8] Add bpf programmable device Daniel Borkmann
@ 2023-09-26  5:59 ` Daniel Borkmann
  2023-09-26 21:26   ` Stanislav Fomichev
                     ` (2 more replies)
  2023-09-26  5:59 ` [PATCH bpf-next 2/8] meta, bpf: Add bpf link support for " Daniel Borkmann
                   ` (6 subsequent siblings)
  7 siblings, 3 replies; 22+ messages in thread
From: Daniel Borkmann @ 2023-09-26  5:59 UTC (permalink / raw)
  To: bpf; +Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend,
	Daniel Borkmann

This work adds a new, minimal BPF-programmable device called "meta" we
recently presented at LSF/MM/BPF. The latter name derives from the Greek
μετά, encompassing a wide array of meanings such as "on top of", "beyond".
Given business logic is defined by BPF, this device can have many meanings.
The core idea is that BPF programs are executed within the drivers xmit
routine and therefore e.g. in case of containers/Pods moving BPF processing
closer to the source.

One of the goals was that in case of Pod egress traffic, this allows to
move BPF programs from hostns tcx ingress into the device itself, providing
earlier drop or forward mechanisms, for example, if the BPF program
determines that the skb must be sent out of the node, then a redirect to
the physical device can take place directly without going through per-CPU
backlog queue. This helps to shift processing for such traffic from softirq
to process context, leading to better scheduling decisions and better
performance.

In this initial version, the meta device ships as a pair, but we plan to
extend this further so it can also operate in single device mode. The pair
comes with a primary and a peer device. Only the primary device, typically
residing in hostns, can manage BPF programs for itself and its peer. The
peer device is designated for containers/Pods and cannot attach/detach
BPF programs. Upon the device creation, the user can set the default policy
to 'forward' or 'drop' for the case when no BPF program is attached.

Additionally, the device can be operated in L3 (default) or L2 mode. The
management of BPF programs is done via bpf_mprog, so that multi-attach is
supported right from the beginning with similar API/dependency controls as
tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic
attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
so that existing programs can be easily migrated.

Going forward, we plan to use meta devices in Cilium as the main device type
for connecting Pods. They will be operated in L3 mode in order to simplify
a Pod's neighbor management and the peer will operate in default drop mode,
so that no traffic is leaving between the time when a Pod is brought up by
the CNI plugin and programs attached by the agent. Additionally, the programs
we attach via tcx on the physical devices are using bpf_redirect_peer()
for inbound traffic into meta device, hence the latter also supporting the
ndo_get_peer_dev callback. Similarly, we use bpf_redirect_neigh() for the
way out, pushing to phys device directly. Also, BIG TCP is supported on meta
device. For the follow-up work in single device mode, we plan to convert
Cilium's cilium_host/_net devices into a single one.

An extensive test suite for checking device operations and the BPF program
and link management API comes as BPF selftests in this series.

Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://github.com/borkmann/iproute2/commits/pr/meta
Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
---
 MAINTAINERS                    |   9 +
 drivers/net/Kconfig            |   9 +
 drivers/net/Makefile           |   1 +
 drivers/net/meta.c             | 734 +++++++++++++++++++++++++++++++++
 include/linux/netdevice.h      |   2 +
 include/net/meta.h             |  31 ++
 include/uapi/linux/bpf.h       |   2 +
 include/uapi/linux/if_link.h   |  25 ++
 kernel/bpf/syscall.c           |  30 +-
 tools/include/uapi/linux/bpf.h |   2 +
 10 files changed, 840 insertions(+), 5 deletions(-)
 create mode 100644 drivers/net/meta.c
 create mode 100644 include/net/meta.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 8985a1b0b5ee..ec3edd4caa56 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3774,6 +3774,15 @@ L:	bpf@vger.kernel.org
 S:	Maintained
 F:	tools/lib/bpf/
 
+BPF [META]
+M:	Daniel Borkmann <daniel@iogearbox.net>
+M:	Nikolay Aleksandrov <razor@blackwall.org>
+L:	bpf@vger.kernel.org
+L:	netdev@vger.kernel.org
+S:	Supported
+F:	drivers/net/meta.c
+F:	include/net/meta.h
+
 BPF [MISC]
 L:	bpf@vger.kernel.org
 S:	Odd Fixes
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 44eeb5d61ba9..9959cdd50b0b 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -448,6 +448,15 @@ config NLMON
 	  diagnostics, etc. This is mostly intended for developers or support
 	  to debug netlink issues. If unsure, say N.
 
+config META
+	bool "BPF-programmable meta device"
+	depends on BPF_SYSCALL
+	help
+	  The virtual meta devices can be created in pairs and used to connect
+	  two network namespaces. A BPF program can be attached to the device(s)
+	  which then gets executed on transmission to implement the driver
+	  internal logic.
+
 config NET_VRF
 	tristate "Virtual Routing and Forwarding (Lite)"
 	depends on IP_MULTIPLE_TABLES
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index e26f98f897c5..18eabeb78ece 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -22,6 +22,7 @@ obj-$(CONFIG_MDIO) += mdio.o
 obj-$(CONFIG_NET) += loopback.o
 obj-$(CONFIG_NETDEV_LEGACY_INIT) += Space.o
 obj-$(CONFIG_NETCONSOLE) += netconsole.o
+obj-$(CONFIG_META) += meta.o
 obj-y += phy/
 obj-y += pse-pd/
 obj-y += mdio/
diff --git a/drivers/net/meta.c b/drivers/net/meta.c
new file mode 100644
index 000000000000..e464f547b0a6
--- /dev/null
+++ b/drivers/net/meta.c
@@ -0,0 +1,734 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2023 Isovalent */
+
+#include <linux/netdevice.h>
+#include <linux/ethtool.h>
+#include <linux/etherdevice.h>
+#include <linux/filter.h>
+#include <linux/netfilter_netdev.h>
+#include <linux/bpf_mprog.h>
+
+#include <net/meta.h>
+#include <net/dst.h>
+#include <net/tcx.h>
+
+#define DRV_NAME	"meta"
+#define DRV_VERSION	"1.0"
+
+struct meta {
+	/* Needed in fast-path */
+	struct net_device __rcu *peer;
+	struct bpf_mprog_entry __rcu *active;
+	enum meta_action policy;
+	struct bpf_mprog_bundle	bundle;
+	/* Needed in slow-path */
+	enum meta_mode mode;
+	bool primary;
+	u32 headroom;
+};
+
+static void meta_scrub_minimum(struct sk_buff *skb)
+{
+	skb->skb_iif = 0;
+	skb->ignore_df = 0;
+	skb->priority = 0;
+	skb_dst_drop(skb);
+	skb_ext_reset(skb);
+	nf_reset_ct(skb);
+	nf_reset_trace(skb);
+	nf_skip_egress(skb, true);
+	ipvs_reset(skb);
+}
+
+static __always_inline int
+meta_run(const struct meta *meta, const struct bpf_mprog_entry *entry,
+	 struct sk_buff *skb, enum meta_action ret)
+{
+	const struct bpf_mprog_fp *fp;
+	const struct bpf_prog *prog;
+
+	bpf_mprog_foreach_prog(entry, fp, prog) {
+		bpf_compute_data_pointers(skb);
+		ret = bpf_prog_run(prog, skb);
+		if (ret != META_NEXT)
+			break;
+	}
+	return ret;
+}
+
+static netdev_tx_t meta_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct meta *meta = netdev_priv(dev);
+	enum meta_action ret = READ_ONCE(meta->policy);
+	netdev_tx_t ret_dev = NET_XMIT_SUCCESS;
+	const struct bpf_mprog_entry *entry;
+	struct net_device *peer;
+
+	rcu_read_lock();
+	peer = rcu_dereference(meta->peer);
+	if (unlikely(!peer || !(peer->flags & IFF_UP) ||
+		     !pskb_may_pull(skb, ETH_HLEN) ||
+		     skb_orphan_frags(skb, GFP_ATOMIC)))
+		goto drop;
+	meta_scrub_minimum(skb);
+	skb->dev = peer;
+	entry = rcu_dereference(meta->active);
+	if (entry)
+		ret = meta_run(meta, entry, skb, ret);
+	switch (ret) {
+	case META_NEXT:
+	case META_PASS:
+		skb->pkt_type = PACKET_HOST;
+		skb->protocol = eth_type_trans(skb, skb->dev);
+		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
+		__netif_rx(skb);
+		break;
+	case META_REDIRECT:
+		skb_do_redirect(skb);
+		break;
+	case META_DROP:
+	default:
+drop:
+		ret_dev = NET_XMIT_DROP;
+		dev_core_stats_tx_dropped_inc(dev);
+		kfree_skb(skb);
+		break;
+	}
+	rcu_read_unlock();
+	return ret_dev;
+}
+
+static int meta_open(struct net_device *dev)
+{
+	struct meta *meta = netdev_priv(dev);
+	struct net_device *peer = rtnl_dereference(meta->peer);
+
+	if (!peer)
+		return -ENOTCONN;
+	if (peer->flags & IFF_UP) {
+		netif_carrier_on(dev);
+		netif_carrier_on(peer);
+	}
+	return 0;
+}
+
+static int meta_close(struct net_device *dev)
+{
+	struct meta *meta = netdev_priv(dev);
+	struct net_device *peer = rtnl_dereference(meta->peer);
+
+	netif_carrier_off(dev);
+	if (peer)
+		netif_carrier_off(peer);
+	return 0;
+}
+
+static int meta_get_iflink(const struct net_device *dev)
+{
+	struct meta *meta = netdev_priv(dev);
+	struct net_device *peer;
+	int iflink = 0;
+
+	rcu_read_lock();
+	peer = rcu_dereference(meta->peer);
+	if (peer)
+		iflink = peer->ifindex;
+	rcu_read_unlock();
+	return iflink;
+}
+
+static void meta_set_multicast_list(struct net_device *dev)
+{
+}
+
+static void meta_set_headroom(struct net_device *dev, int headroom)
+{
+	struct meta *meta = netdev_priv(dev), *meta2;
+	struct net_device *peer;
+
+	if (headroom < 0)
+		headroom = NET_SKB_PAD;
+
+	rcu_read_lock();
+	peer = rcu_dereference(meta->peer);
+	if (unlikely(!peer))
+		goto out;
+
+	meta2 = netdev_priv(peer);
+	meta->headroom = headroom;
+	headroom = max(meta->headroom, meta2->headroom);
+
+	peer->needed_headroom = headroom;
+	dev->needed_headroom = headroom;
+out:
+	rcu_read_unlock();
+}
+
+static struct net_device *meta_peer_dev(struct net_device *dev)
+{
+	struct meta *meta = netdev_priv(dev);
+
+	return rcu_dereference(meta->peer);
+}
+
+static struct net_device *meta_peer_dev_rtnl(struct net_device *dev)
+{
+	struct meta *meta = netdev_priv(dev);
+
+	return rcu_dereference_rtnl(meta->peer);
+}
+
+static const struct net_device_ops meta_netdev_ops = {
+	.ndo_open		= meta_open,
+	.ndo_stop		= meta_close,
+	.ndo_start_xmit		= meta_xmit,
+	.ndo_set_rx_mode	= meta_set_multicast_list,
+	.ndo_set_rx_headroom	= meta_set_headroom,
+	.ndo_get_iflink		= meta_get_iflink,
+	.ndo_get_peer_dev	= meta_peer_dev,
+	.ndo_features_check	= passthru_features_check,
+};
+
+static void meta_get_drvinfo(struct net_device *dev,
+			     struct ethtool_drvinfo *info)
+{
+	strscpy(info->driver, DRV_NAME, sizeof(info->driver));
+	strscpy(info->version, DRV_VERSION, sizeof(info->version));
+}
+
+static const struct ethtool_ops meta_ethtool_ops = {
+	.get_drvinfo		= meta_get_drvinfo,
+};
+
+static void meta_setup(struct net_device *dev)
+{
+	static const netdev_features_t meta_features_hw_vlan =
+		NETIF_F_HW_VLAN_CTAG_TX |
+		NETIF_F_HW_VLAN_CTAG_RX |
+		NETIF_F_HW_VLAN_STAG_TX |
+		NETIF_F_HW_VLAN_STAG_RX;
+	static const netdev_features_t meta_features =
+		meta_features_hw_vlan |
+		NETIF_F_SG |
+		NETIF_F_FRAGLIST |
+		NETIF_F_HW_CSUM |
+		NETIF_F_RXCSUM |
+		NETIF_F_SCTP_CRC |
+		NETIF_F_HIGHDMA |
+		NETIF_F_GSO_SOFTWARE |
+		NETIF_F_GSO_ENCAP_ALL;
+
+	ether_setup(dev);
+	dev->min_mtu = ETH_MIN_MTU;
+	dev->max_mtu = ETH_MAX_MTU;
+
+	dev->flags |= IFF_NOARP;
+	dev->priv_flags &= ~IFF_TX_SKB_SHARING;
+	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
+	dev->priv_flags |= IFF_PHONY_HEADROOM;
+	dev->priv_flags |= IFF_NO_QUEUE;
+	dev->priv_flags |= IFF_META;
+
+	dev->ethtool_ops = &meta_ethtool_ops;
+	dev->netdev_ops  = &meta_netdev_ops;
+
+	dev->features |= meta_features | NETIF_F_LLTX;
+	dev->hw_features = meta_features;
+	dev->hw_enc_features = meta_features;
+	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
+	dev->vlan_features = dev->features & ~meta_features_hw_vlan;
+
+	dev->needs_free_netdev = true;
+
+	netif_set_tso_max_size(dev, GSO_MAX_SIZE);
+}
+
+static struct net *meta_get_link_net(const struct net_device *dev)
+{
+	struct meta *meta = netdev_priv(dev);
+	struct net_device *peer = rtnl_dereference(meta->peer);
+
+	return peer ? dev_net(peer) : dev_net(dev);
+}
+
+static int meta_check_policy(int policy, struct nlattr *tb,
+			     struct netlink_ext_ack *extack)
+{
+	switch (policy) {
+	case META_PASS:
+	case META_DROP:
+		return 0;
+	default:
+		NL_SET_ERR_MSG_ATTR(extack, tb,
+				    "Provided default xmit policy not supported");
+		return -EINVAL;
+	}
+}
+
+static int meta_check_mode(int mode, struct nlattr *tb,
+			   struct netlink_ext_ack *extack)
+{
+	switch (mode) {
+	case META_L2:
+	case META_L3:
+		return 0;
+	default:
+		NL_SET_ERR_MSG_ATTR(extack, tb,
+				    "Provided device mode can only be L2 or L3");
+		return -EINVAL;
+	}
+}
+
+static int meta_validate(struct nlattr *tb[], struct nlattr *data[],
+			 struct netlink_ext_ack *extack)
+{
+	struct nlattr *attr = tb[IFLA_ADDRESS];
+
+	if (!attr)
+		return 0;
+	NL_SET_ERR_MSG_ATTR(extack, attr,
+			    "Setting Ethernet address is not supported");
+	return -EOPNOTSUPP;
+}
+
+static struct rtnl_link_ops meta_link_ops;
+
+static int meta_new_link(struct net *src_net, struct net_device *dev,
+			 struct nlattr *tb[], struct nlattr *data[],
+			 struct netlink_ext_ack *extack)
+{
+	struct nlattr *peer_tb[IFLA_MAX + 1], **tbp = tb, *attr;
+	enum meta_action default_prim = META_PASS;
+	enum meta_action default_peer = META_PASS;
+	unsigned char name_assign_type;
+	enum meta_mode mode = META_L3;
+	struct ifinfomsg *ifmp = NULL;
+	struct net_device *peer;
+	char ifname[IFNAMSIZ];
+	struct meta *meta;
+	struct net *net;
+	int err;
+
+	if (data) {
+		if (data[IFLA_META_MODE]) {
+			attr = data[IFLA_META_MODE];
+			mode = nla_get_u32(attr);
+			err = meta_check_mode(mode, attr, extack);
+			if (err < 0)
+				return err;
+		}
+		if (data[IFLA_META_PEER_INFO]) {
+			attr = data[IFLA_META_PEER_INFO];
+			ifmp = nla_data(attr);
+			err = rtnl_nla_parse_ifinfomsg(peer_tb, attr, extack);
+			if (err < 0)
+				return err;
+			err = meta_validate(peer_tb, NULL, extack);
+			if (err < 0)
+				return err;
+			tbp = peer_tb;
+		}
+		if (data[IFLA_META_POLICY]) {
+			attr = data[IFLA_META_POLICY];
+			default_prim = nla_get_u32(attr);
+			err = meta_check_policy(default_prim, attr, extack);
+			if (err < 0)
+				return err;
+		}
+		if (data[IFLA_META_PEER_POLICY]) {
+			attr = data[IFLA_META_PEER_POLICY];
+			default_peer = nla_get_u32(attr);
+			err = meta_check_policy(default_peer, attr, extack);
+			if (err < 0)
+				return err;
+		}
+	}
+
+	if (ifmp && tbp[IFLA_IFNAME]) {
+		nla_strscpy(ifname, tbp[IFLA_IFNAME], IFNAMSIZ);
+		name_assign_type = NET_NAME_USER;
+	} else {
+		snprintf(ifname, IFNAMSIZ, "m%%d");
+		name_assign_type = NET_NAME_ENUM;
+	}
+
+	net = rtnl_link_get_net(src_net, tbp);
+	if (IS_ERR(net))
+		return PTR_ERR(net);
+
+	peer = rtnl_create_link(net, ifname, name_assign_type,
+				&meta_link_ops, tbp, extack);
+	if (IS_ERR(peer)) {
+		put_net(net);
+		return PTR_ERR(peer);
+	}
+
+	if (mode == META_L2)
+		eth_hw_addr_random(peer);
+	if (ifmp && dev->ifindex)
+		peer->ifindex = ifmp->ifi_index;
+
+	netif_inherit_tso_max(peer, dev);
+
+	err = register_netdevice(peer);
+	put_net(net);
+	if (err < 0)
+		goto err_register_peer;
+
+	netif_carrier_off(peer);
+
+	err = rtnl_configure_link(peer, ifmp, 0, NULL);
+	if (err < 0)
+		goto err_configure_peer;
+
+	if (mode == META_L2)
+		eth_hw_addr_random(dev);
+	if (tb[IFLA_IFNAME])
+		nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
+	else
+		snprintf(dev->name, IFNAMSIZ, "m%%d");
+
+	err = register_netdevice(dev);
+	if (err < 0)
+		goto err_configure_peer;
+
+	netif_carrier_off(dev);
+
+	meta = netdev_priv(dev);
+	meta->primary = true;
+	meta->policy = default_prim;
+	meta->mode = mode;
+	if (meta->mode == META_L2)
+		dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
+	bpf_mprog_bundle_init(&meta->bundle);
+	RCU_INIT_POINTER(meta->active, NULL);
+	rcu_assign_pointer(meta->peer, peer);
+
+	meta = netdev_priv(peer);
+	meta->primary = false;
+	meta->policy = default_peer;
+	meta->mode = mode;
+	if (meta->mode == META_L2)
+		dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
+	bpf_mprog_bundle_init(&meta->bundle);
+	RCU_INIT_POINTER(meta->active, NULL);
+	rcu_assign_pointer(meta->peer, dev);
+	return 0;
+err_configure_peer:
+	unregister_netdevice(peer);
+	return err;
+err_register_peer:
+	free_netdev(peer);
+	return err;
+}
+
+static struct bpf_mprog_entry *meta_entry_fetch(struct net_device *dev,
+						bool bundle_fallback)
+{
+	struct meta *meta = netdev_priv(dev);
+	struct bpf_mprog_entry *entry;
+
+	ASSERT_RTNL();
+	entry = rcu_dereference_rtnl(meta->active);
+	if (entry)
+		return entry;
+	if (bundle_fallback)
+		return &meta->bundle.a;
+	return NULL;
+}
+
+static void meta_entry_update(struct net_device *dev, struct bpf_mprog_entry *entry)
+{
+	struct meta *meta = netdev_priv(dev);
+
+	ASSERT_RTNL();
+	rcu_assign_pointer(meta->active, entry);
+}
+
+static void meta_entry_sync(void)
+{
+	synchronize_rcu();
+}
+
+static struct net_device *meta_dev_fetch(struct net *net, u32 ifindex, u32 which)
+{
+	struct net_device *dev;
+	struct meta *meta;
+
+	ASSERT_RTNL();
+
+	switch (which) {
+	case BPF_META_PRIMARY:
+	case BPF_META_PEER:
+		break;
+	default:
+		return ERR_PTR(-EINVAL);
+	}
+
+	dev = __dev_get_by_index(net, ifindex);
+	if (!dev)
+		return ERR_PTR(-ENODEV);
+	if (!(dev->priv_flags & IFF_META))
+		return ERR_PTR(-ENXIO);
+
+	meta = netdev_priv(dev);
+	if (!meta->primary)
+		return ERR_PTR(-EACCES);
+	if (which == BPF_META_PRIMARY)
+		return dev;
+	return meta_peer_dev_rtnl(dev);
+}
+
+int meta_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct bpf_mprog_entry *entry, *entry_new;
+	struct bpf_prog *replace_prog = NULL;
+	struct net_device *dev;
+	int ret;
+
+	rtnl_lock();
+	dev = meta_dev_fetch(current->nsproxy->net_ns, attr->target_ifindex,
+			     attr->attach_type);
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		goto out;
+	}
+	entry = meta_entry_fetch(dev, true);
+	if (attr->attach_flags & BPF_F_REPLACE) {
+		replace_prog = bpf_prog_get_type(attr->replace_bpf_fd,
+						 prog->type);
+		if (IS_ERR(replace_prog)) {
+			ret = PTR_ERR(replace_prog);
+			replace_prog = NULL;
+			goto out;
+		}
+	}
+	ret = bpf_mprog_attach(entry, &entry_new, prog, NULL, replace_prog,
+			       attr->attach_flags, attr->relative_fd,
+			       attr->expected_revision);
+	if (!ret) {
+		if (entry != entry_new) {
+			meta_entry_update(dev, entry_new);
+			meta_entry_sync();
+		}
+		bpf_mprog_commit(entry);
+	}
+out:
+	if (replace_prog)
+		bpf_prog_put(replace_prog);
+	rtnl_unlock();
+	return ret;
+}
+
+int meta_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct bpf_mprog_entry *entry, *entry_new;
+	struct net_device *dev;
+	int ret;
+
+	rtnl_lock();
+	dev = meta_dev_fetch(current->nsproxy->net_ns, attr->target_ifindex,
+			     attr->attach_type);
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		goto out;
+	}
+	entry = meta_entry_fetch(dev, false);
+	if (!entry) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = bpf_mprog_detach(entry, &entry_new, prog, NULL, attr->attach_flags,
+			       attr->relative_fd, attr->expected_revision);
+	if (!ret) {
+		if (!bpf_mprog_total(entry_new))
+			entry_new = NULL;
+		meta_entry_update(dev, entry_new);
+		meta_entry_sync();
+		bpf_mprog_commit(entry);
+	}
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+int meta_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
+{
+	struct bpf_mprog_entry *entry;
+	struct net_device *dev;
+	int ret;
+
+	rtnl_lock();
+	dev = meta_dev_fetch(current->nsproxy->net_ns, attr->query.target_ifindex,
+			     attr->query.attach_type);
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		goto out;
+	}
+	entry = meta_entry_fetch(dev, false);
+	if (!entry) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = bpf_mprog_query(attr, uattr, entry);
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+static void meta_release_all(struct net_device *dev)
+{
+	struct bpf_mprog_entry *entry;
+	struct bpf_tuple tuple = {};
+	struct bpf_mprog_fp *fp;
+	struct bpf_mprog_cp *cp;
+
+	entry = meta_entry_fetch(dev, false);
+	if (!entry)
+		return;
+	meta_entry_update(dev, NULL);
+	meta_entry_sync();
+	bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
+		bpf_prog_put(tuple.prog);
+	}
+}
+
+static void meta_del_link(struct net_device *dev, struct list_head *head)
+{
+	struct meta *meta = netdev_priv(dev);
+	struct net_device *peer = rtnl_dereference(meta->peer);
+
+	RCU_INIT_POINTER(meta->peer, NULL);
+	meta_release_all(dev);
+	unregister_netdevice_queue(dev, head);
+	if (peer) {
+		meta = netdev_priv(peer);
+		RCU_INIT_POINTER(meta->peer, NULL);
+		meta_release_all(peer);
+		unregister_netdevice_queue(peer, head);
+	}
+}
+
+static int meta_change_link(struct net_device *dev, struct nlattr *tb[],
+			    struct nlattr *data[],
+			    struct netlink_ext_ack *extack)
+{
+	struct meta *meta = netdev_priv(dev);
+	struct net_device *peer = rtnl_dereference(meta->peer);
+	enum meta_action policy;
+	struct nlattr *attr;
+	int err;
+
+	if (!meta->primary) {
+		NL_SET_ERR_MSG(extack,
+			       "Meta settings can be changed only through the primary device");
+		return -EACCES;
+	}
+
+	if (data[IFLA_META_MODE]) {
+		NL_SET_ERR_MSG_ATTR(extack, data[IFLA_META_MODE],
+				    "Meta operating mode cannot be changed after device creation");
+		return -EACCES;
+	}
+
+	if (data[IFLA_META_POLICY]) {
+		attr = data[IFLA_META_POLICY];
+		policy = nla_get_u32(attr);
+		err = meta_check_policy(policy, attr, extack);
+		if (err)
+			return err;
+		WRITE_ONCE(meta->policy, policy);
+	}
+
+	if (data[IFLA_META_PEER_POLICY]) {
+		err = -EOPNOTSUPP;
+		attr = data[IFLA_META_PEER_POLICY];
+		policy = nla_get_u32(attr);
+		if (peer)
+			err = meta_check_policy(policy, attr, extack);
+		if (err)
+			return err;
+		meta = netdev_priv(peer);
+		WRITE_ONCE(meta->policy, policy);
+	}
+
+	return 0;
+}
+
+static size_t meta_get_size(const struct net_device *dev)
+{
+	return nla_total_size(sizeof(u32)) + /* IFLA_META_POLICY */
+	       nla_total_size(sizeof(u32)) + /* IFLA_META_PEER_POLICY */
+	       nla_total_size(sizeof(u8))  + /* IFLA_META_PRIMARY */
+	       nla_total_size(sizeof(u32)) + /* IFLA_META_MODE */
+	       0;
+}
+
+static int meta_fill_info(struct sk_buff *skb, const struct net_device *dev)
+{
+	struct meta *meta = netdev_priv(dev);
+	struct net_device *peer = rtnl_dereference(meta->peer);
+
+	if (nla_put_u8(skb, IFLA_META_PRIMARY, meta->primary))
+		return -EMSGSIZE;
+	if (nla_put_u32(skb, IFLA_META_POLICY, meta->policy))
+		return -EMSGSIZE;
+	if (nla_put_u32(skb, IFLA_META_MODE, meta->mode))
+		return -EMSGSIZE;
+
+	if (peer) {
+		meta = netdev_priv(peer);
+		if (nla_put_u32(skb, IFLA_META_PEER_POLICY, meta->policy))
+			return -EMSGSIZE;
+	}
+
+	return 0;
+}
+
+static const struct nla_policy meta_policy[IFLA_META_MAX + 1] = {
+	[IFLA_META_PEER_INFO]	= { .len = sizeof(struct ifinfomsg) },
+	[IFLA_META_POLICY]	= { .type = NLA_U32 },
+	[IFLA_META_MODE]	= { .type = NLA_U32 },
+	[IFLA_META_PEER_POLICY]	= { .type = NLA_U32 },
+	[IFLA_META_PRIMARY]	= { .type = NLA_REJECT,
+				    .reject_message = "Primary attribute is read-only" },
+};
+
+static struct rtnl_link_ops meta_link_ops = {
+	.kind		= DRV_NAME,
+	.priv_size	= sizeof(struct meta),
+	.setup		= meta_setup,
+	.newlink	= meta_new_link,
+	.dellink	= meta_del_link,
+	.changelink	= meta_change_link,
+	.get_link_net	= meta_get_link_net,
+	.get_size	= meta_get_size,
+	.fill_info	= meta_fill_info,
+	.policy		= meta_policy,
+	.validate	= meta_validate,
+	.maxtype	= IFLA_META_MAX,
+};
+
+static __init int meta_init(void)
+{
+	BUILD_BUG_ON((int)META_NEXT != (int)TCX_NEXT ||
+		     (int)META_PASS != (int)TCX_PASS ||
+		     (int)META_DROP != (int)TCX_DROP ||
+		     (int)META_REDIRECT != (int)TCX_REDIRECT);
+
+	return rtnl_link_register(&meta_link_ops);
+}
+
+static __exit void meta_exit(void)
+{
+	rtnl_link_unregister(&meta_link_ops);
+}
+
+module_init(meta_init);
+module_exit(meta_exit);
+
+MODULE_DESCRIPTION("BPF-programmable meta device");
+MODULE_AUTHOR("Daniel Borkmann <daniel@iogearbox.net>");
+MODULE_AUTHOR("Nikolay Aleksandrov <razor@blackwall.org>");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_RTNL_LINK(DRV_NAME);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7e520c14eb8c..af0f23ed8d51 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1701,6 +1701,7 @@ struct net_device_ops {
  * @IFF_SEE_ALL_HWTSTAMP_REQUESTS: device wants to see calls to
  *	ndo_hwtstamp_set() for all timestamp requests regardless of source,
  *	even if those aren't HWTSTAMP_SOURCE_NETDEV.
+ * @IFF_META: device is a meta device
  */
 enum netdev_priv_flags {
 	IFF_802_1Q_VLAN			= 1<<0,
@@ -1737,6 +1738,7 @@ enum netdev_priv_flags {
 	IFF_TX_SKB_NO_LINEAR		= BIT_ULL(31),
 	IFF_CHANGE_PROTO_DOWN		= BIT_ULL(32),
 	IFF_SEE_ALL_HWTSTAMP_REQUESTS	= BIT_ULL(33),
+	IFF_META			= BIT_ULL(34),
 };
 
 #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
diff --git a/include/net/meta.h b/include/net/meta.h
new file mode 100644
index 000000000000..20fc61d05970
--- /dev/null
+++ b/include/net/meta.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2023 Isovalent */
+#ifndef __NET_META_H
+#define __NET_META_H
+
+#include <linux/bpf.h>
+
+#ifdef CONFIG_META
+int meta_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
+int meta_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog);
+int meta_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr);
+#else
+static inline int meta_prog_attach(const union bpf_attr *attr,
+				   struct bpf_prog *prog)
+{
+	return -EINVAL;
+}
+
+static inline int meta_prog_detach(const union bpf_attr *attr,
+				   struct bpf_prog *prog)
+{
+	return -EINVAL;
+}
+
+static inline int meta_prog_query(const union bpf_attr *attr,
+				  union bpf_attr __user *uattr)
+{
+	return -EINVAL;
+}
+#endif /* CONFIG_META */
+#endif /* __NET_META_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 5f13db15a3c7..00a875720e84 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1047,6 +1047,8 @@ enum bpf_attach_type {
 	BPF_TCX_INGRESS,
 	BPF_TCX_EGRESS,
 	BPF_TRACE_UPROBE_MULTI,
+	BPF_META_PRIMARY,
+	BPF_META_PEER,
 	__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index fac351a93aed..ec099c6c51e0 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -756,6 +756,31 @@ struct tunnel_msg {
 	__u32 ifindex;
 };
 
+/* META section */
+enum meta_action {
+	META_NEXT	= -1,
+	META_PASS	= 0,
+	META_DROP	= 2,
+	META_REDIRECT	= 7,
+};
+
+enum meta_mode {
+	META_L2,
+	META_L3,
+};
+
+enum {
+	IFLA_META_UNSPEC,
+	IFLA_META_PEER_INFO,
+	IFLA_META_PRIMARY,
+	IFLA_META_POLICY,
+	IFLA_META_PEER_POLICY,
+	IFLA_META_MODE,
+	__IFLA_META_MAX,
+};
+
+#define IFLA_META_MAX	(__IFLA_META_MAX - 1)
+
 /* VXLAN section */
 
 /* include statistics in the dump */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 85c1d908f70f..51baf4355c39 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -35,8 +35,9 @@
 #include <linux/rcupdate_trace.h>
 #include <linux/memcontrol.h>
 #include <linux/trace_events.h>
-#include <net/netfilter/nf_bpf_link.h>
 
+#include <net/netfilter/nf_bpf_link.h>
+#include <net/meta.h>
 #include <net/tcx.h>
 
 #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
@@ -3720,6 +3721,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 		return BPF_PROG_TYPE_LSM;
 	case BPF_TCX_INGRESS:
 	case BPF_TCX_EGRESS:
+	case BPF_META_PRIMARY:
+	case BPF_META_PEER:
 		return BPF_PROG_TYPE_SCHED_CLS;
 	default:
 		return BPF_PROG_TYPE_UNSPEC;
@@ -3771,7 +3774,9 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 		return 0;
 	case BPF_PROG_TYPE_SCHED_CLS:
 		if (attach_type != BPF_TCX_INGRESS &&
-		    attach_type != BPF_TCX_EGRESS)
+		    attach_type != BPF_TCX_EGRESS &&
+		    attach_type != BPF_META_PRIMARY &&
+		    attach_type != BPF_META_PEER)
 			return -EINVAL;
 		return 0;
 	default:
@@ -3849,7 +3854,11 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 			ret = cgroup_bpf_prog_attach(attr, ptype, prog);
 		break;
 	case BPF_PROG_TYPE_SCHED_CLS:
-		ret = tcx_prog_attach(attr, prog);
+		if (attr->link_create.attach_type == BPF_TCX_INGRESS ||
+		    attr->link_create.attach_type == BPF_TCX_EGRESS)
+			ret = tcx_prog_attach(attr, prog);
+		else
+			ret = meta_prog_attach(attr, prog);
 		break;
 	default:
 		ret = -EINVAL;
@@ -3906,7 +3915,11 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 		ret = cgroup_bpf_prog_detach(attr, ptype);
 		break;
 	case BPF_PROG_TYPE_SCHED_CLS:
-		ret = tcx_prog_detach(attr, prog);
+		if (attr->link_create.attach_type == BPF_TCX_INGRESS ||
+		    attr->link_create.attach_type == BPF_TCX_EGRESS)
+			ret = tcx_prog_detach(attr, prog);
+		else
+			ret = meta_prog_detach(attr, prog);
 		break;
 	default:
 		ret = -EINVAL;
@@ -3968,6 +3981,9 @@ static int bpf_prog_query(const union bpf_attr *attr,
 	case BPF_TCX_INGRESS:
 	case BPF_TCX_EGRESS:
 		return tcx_prog_query(attr, uattr);
+	case BPF_META_PRIMARY:
+	case BPF_META_PEER:
+		return meta_prog_query(attr, uattr);
 	default:
 		return -EINVAL;
 	}
@@ -4949,7 +4965,11 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 		ret = bpf_xdp_link_attach(attr, prog);
 		break;
 	case BPF_PROG_TYPE_SCHED_CLS:
-		ret = tcx_link_attach(attr, prog);
+		if (attr->link_create.attach_type == BPF_TCX_INGRESS ||
+		    attr->link_create.attach_type == BPF_TCX_EGRESS)
+			ret = tcx_link_attach(attr, prog);
+		else
+			ret = -EINVAL;
 		break;
 	case BPF_PROG_TYPE_NETFILTER:
 		ret = bpf_nf_link_attach(attr, prog);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 5f13db15a3c7..00a875720e84 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1047,6 +1047,8 @@ enum bpf_attach_type {
 	BPF_TCX_INGRESS,
 	BPF_TCX_EGRESS,
 	BPF_TRACE_UPROBE_MULTI,
+	BPF_META_PRIMARY,
+	BPF_META_PEER,
 	__MAX_BPF_ATTACH_TYPE
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 2/8] meta, bpf: Add bpf link support for meta device
  2023-09-26  5:59 [PATCH bpf-next 0/8] Add bpf programmable device Daniel Borkmann
  2023-09-26  5:59 ` [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device Daniel Borkmann
@ 2023-09-26  5:59 ` Daniel Borkmann
  2023-09-28  0:12   ` Andrii Nakryiko
  2023-09-26  5:59 ` [PATCH bpf-next 3/8] tools: Sync if_link uapi header Daniel Borkmann
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Daniel Borkmann @ 2023-09-26  5:59 UTC (permalink / raw)
  To: bpf; +Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend,
	Daniel Borkmann

This adds BPF link support for meta device (BPF_LINK_TYPE_META). Similar
as with tcx or XDP, the BPF link for meta contains the device.

The bpf_mprog API has been reused for its implementation. For details, see
also commit e420bed0250 ("bpf: Add fd-based tcx multi-prog infra with link
support").

This is now the second user of bpf_mprog after tcx, and in meta case the
implementation is also a bit more straight forward since it does not need
to deal with miniq.

The UAPI extensions for the BPF_LINK_CREATE command are similar as for tcx,
that is, relative_{fd,id} and expected_revision.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 drivers/net/meta.c             | 211 ++++++++++++++++++++++++++++++++-
 include/net/meta.h             |   7 ++
 include/uapi/linux/bpf.h       |  11 ++
 kernel/bpf/syscall.c           |   2 +-
 tools/include/uapi/linux/bpf.h |  11 ++
 5 files changed, 240 insertions(+), 2 deletions(-)

diff --git a/drivers/net/meta.c b/drivers/net/meta.c
index e464f547b0a6..8cb39281c455 100644
--- a/drivers/net/meta.c
+++ b/drivers/net/meta.c
@@ -27,6 +27,11 @@ struct meta {
 	u32 headroom;
 };
 
+struct meta_link {
+	struct bpf_link link;
+	struct net_device *dev;
+};
+
 static void meta_scrub_minimum(struct sk_buff *skb)
 {
 	skb->skb_iif = 0;
@@ -576,6 +581,207 @@ int meta_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
 	return ret;
 }
 
+static struct meta_link *meta_link(struct bpf_link *link)
+{
+	return container_of(link, struct meta_link, link);
+}
+
+static const struct meta_link *meta_link_const(const struct bpf_link *link)
+{
+	return meta_link((struct bpf_link *)link);
+}
+
+static int meta_link_prog_attach(struct bpf_link *link, u32 flags,
+				 u32 id_or_fd, u64 revision)
+{
+	struct meta_link *meta = meta_link(link);
+	struct bpf_mprog_entry *entry, *entry_new;
+	struct net_device *dev = meta->dev;
+	int ret;
+
+	ASSERT_RTNL();
+	entry = meta_entry_fetch(dev, true);
+	ret = bpf_mprog_attach(entry, &entry_new, link->prog, link, NULL, flags,
+			       id_or_fd, revision);
+	if (!ret) {
+		if (entry != entry_new) {
+			meta_entry_update(dev, entry_new);
+			meta_entry_sync();
+		}
+		bpf_mprog_commit(entry);
+	}
+	return ret;
+}
+
+static void meta_link_release(struct bpf_link *link)
+{
+	struct meta_link *meta = meta_link(link);
+	struct bpf_mprog_entry *entry, *entry_new;
+	struct net_device *dev;
+	int ret = 0;
+
+	rtnl_lock();
+	dev = meta->dev;
+	if (!dev)
+		goto out;
+	entry = meta_entry_fetch(dev, false);
+	if (!entry) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = bpf_mprog_detach(entry, &entry_new, link->prog, link, 0, 0, 0);
+	if (!ret) {
+		if (!bpf_mprog_total(entry_new))
+			entry_new = NULL;
+		meta_entry_update(dev, entry_new);
+		meta_entry_sync();
+		bpf_mprog_commit(entry);
+		meta->dev = NULL;
+	}
+out:
+	WARN_ON_ONCE(ret);
+	rtnl_unlock();
+}
+
+static int meta_link_update(struct bpf_link *link, struct bpf_prog *nprog,
+			    struct bpf_prog *oprog)
+{
+	struct meta_link *meta = meta_link(link);
+	struct bpf_mprog_entry *entry, *entry_new;
+	struct net_device *dev;
+	int ret = 0;
+
+	rtnl_lock();
+	dev = meta->dev;
+	if (!dev) {
+		ret = -ENOLINK;
+		goto out;
+	}
+	if (oprog && link->prog != oprog) {
+		ret = -EPERM;
+		goto out;
+	}
+	oprog = link->prog;
+	if (oprog == nprog) {
+		bpf_prog_put(nprog);
+		goto out;
+	}
+	entry = meta_entry_fetch(dev, false);
+	if (!entry) {
+		ret = -ENOENT;
+		goto out;
+	}
+	ret = bpf_mprog_attach(entry, &entry_new, nprog, link, oprog,
+			       BPF_F_REPLACE | BPF_F_ID,
+			       link->prog->aux->id, 0);
+	if (!ret) {
+		WARN_ON_ONCE(entry != entry_new);
+		oprog = xchg(&link->prog, nprog);
+		bpf_prog_put(oprog);
+		bpf_mprog_commit(entry);
+	}
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+static void meta_link_dealloc(struct bpf_link *link)
+{
+	kfree(meta_link(link));
+}
+
+static void meta_link_fdinfo(const struct bpf_link *link, struct seq_file *seq)
+{
+	const struct meta_link *meta = meta_link_const(link);
+	u32 ifindex = 0;
+
+	rtnl_lock();
+	if (meta->dev)
+		ifindex = meta->dev->ifindex;
+	rtnl_unlock();
+
+	seq_printf(seq, "ifindex:\t%u\n", ifindex);
+}
+
+static int meta_link_fill_info(const struct bpf_link *link,
+			       struct bpf_link_info *info)
+{
+	const struct meta_link *meta = meta_link_const(link);
+	u32 ifindex = 0;
+
+	rtnl_lock();
+	if (meta->dev)
+		ifindex = meta->dev->ifindex;
+	rtnl_unlock();
+
+	info->meta.ifindex = ifindex;
+	return 0;
+}
+
+static int meta_link_detach(struct bpf_link *link)
+{
+	meta_link_release(link);
+	return 0;
+}
+
+static const struct bpf_link_ops meta_link_lops = {
+	.release	= meta_link_release,
+	.detach		= meta_link_detach,
+	.dealloc	= meta_link_dealloc,
+	.update_prog	= meta_link_update,
+	.show_fdinfo	= meta_link_fdinfo,
+	.fill_link_info	= meta_link_fill_info,
+};
+
+static int meta_link_init(struct meta_link *meta,
+			  struct bpf_link_primer *link_primer,
+			  struct net_device *dev, struct bpf_prog *prog)
+{
+	bpf_link_init(&meta->link, BPF_LINK_TYPE_META, &meta_link_lops, prog);
+	meta->dev = dev;
+	return bpf_link_prime(&meta->link, link_primer);
+}
+
+int meta_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct bpf_link_primer link_primer;
+	struct net_device *dev;
+	struct meta_link *meta;
+	int ret;
+
+	rtnl_lock();
+	dev = meta_dev_fetch(current->nsproxy->net_ns,
+			     attr->link_create.target_ifindex,
+			     attr->link_create.attach_type);
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		goto out;
+	}
+	meta = kzalloc(sizeof(*meta), GFP_USER);
+	if (!meta) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = meta_link_init(meta, &link_primer, dev, prog);
+	if (ret) {
+		kfree(meta);
+		goto out;
+	}
+	ret = meta_link_prog_attach(&meta->link,
+				    attr->link_create.flags,
+				    attr->link_create.meta.relative_fd,
+				    attr->link_create.meta.expected_revision);
+	if (ret) {
+		meta->dev = NULL;
+		bpf_link_cleanup(&link_primer);
+		goto out;
+	}
+	ret = bpf_link_settle(&link_primer);
+out:
+	rtnl_unlock();
+	return ret;
+}
+
 static void meta_release_all(struct net_device *dev)
 {
 	struct bpf_mprog_entry *entry;
@@ -589,7 +795,10 @@ static void meta_release_all(struct net_device *dev)
 	meta_entry_update(dev, NULL);
 	meta_entry_sync();
 	bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
-		bpf_prog_put(tuple.prog);
+		if (tuple.link)
+			meta_link(tuple.link)->dev = NULL;
+		else
+			bpf_prog_put(tuple.prog);
 	}
 }
 
diff --git a/include/net/meta.h b/include/net/meta.h
index 20fc61d05970..f1abe1d6d02d 100644
--- a/include/net/meta.h
+++ b/include/net/meta.h
@@ -7,6 +7,7 @@
 
 #ifdef CONFIG_META
 int meta_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
+int meta_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
 int meta_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog);
 int meta_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr);
 #else
@@ -16,6 +17,12 @@ static inline int meta_prog_attach(const union bpf_attr *attr,
 	return -EINVAL;
 }
 
+static inline int meta_link_attach(const union bpf_attr *attr,
+				   struct bpf_prog *prog)
+{
+	return -EINVAL;
+}
+
 static inline int meta_prog_detach(const union bpf_attr *attr,
 				   struct bpf_prog *prog)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 00a875720e84..fd069f285fbc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1068,6 +1068,7 @@ enum bpf_link_type {
 	BPF_LINK_TYPE_NETFILTER = 10,
 	BPF_LINK_TYPE_TCX = 11,
 	BPF_LINK_TYPE_UPROBE_MULTI = 12,
+	BPF_LINK_TYPE_META = 12,
 	MAX_BPF_LINK_TYPE,
 };
 
@@ -1653,6 +1654,13 @@ union bpf_attr {
 				__u32		flags;
 				__u32		pid;
 			} uprobe_multi;
+			struct {
+				union {
+					__u32	relative_fd;
+					__u32	relative_id;
+				};
+				__u64		expected_revision;
+			} meta;
 		};
 	} link_create;
 
@@ -6564,6 +6572,9 @@ struct bpf_link_info {
 			__u32 ifindex;
 			__u32 attach_type;
 		} tcx;
+		struct {
+			__u32 ifindex;
+		} meta;
 	};
 } __attribute__((aligned(8)));
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 51baf4355c39..b689da4de280 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4969,7 +4969,7 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 		    attr->link_create.attach_type == BPF_TCX_EGRESS)
 			ret = tcx_link_attach(attr, prog);
 		else
-			ret = -EINVAL;
+			ret = meta_link_attach(attr, prog);
 		break;
 	case BPF_PROG_TYPE_NETFILTER:
 		ret = bpf_nf_link_attach(attr, prog);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 00a875720e84..fd069f285fbc 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1068,6 +1068,7 @@ enum bpf_link_type {
 	BPF_LINK_TYPE_NETFILTER = 10,
 	BPF_LINK_TYPE_TCX = 11,
 	BPF_LINK_TYPE_UPROBE_MULTI = 12,
+	BPF_LINK_TYPE_META = 12,
 	MAX_BPF_LINK_TYPE,
 };
 
@@ -1653,6 +1654,13 @@ union bpf_attr {
 				__u32		flags;
 				__u32		pid;
 			} uprobe_multi;
+			struct {
+				union {
+					__u32	relative_fd;
+					__u32	relative_id;
+				};
+				__u64		expected_revision;
+			} meta;
 		};
 	} link_create;
 
@@ -6564,6 +6572,9 @@ struct bpf_link_info {
 			__u32 ifindex;
 			__u32 attach_type;
 		} tcx;
+		struct {
+			__u32 ifindex;
+		} meta;
 	};
 } __attribute__((aligned(8)));
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 3/8] tools: Sync if_link uapi header
  2023-09-26  5:59 [PATCH bpf-next 0/8] Add bpf programmable device Daniel Borkmann
  2023-09-26  5:59 ` [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device Daniel Borkmann
  2023-09-26  5:59 ` [PATCH bpf-next 2/8] meta, bpf: Add bpf link support for " Daniel Borkmann
@ 2023-09-26  5:59 ` Daniel Borkmann
  2023-09-26  5:59 ` [PATCH bpf-next 4/8] libbpf: Add link-based API for meta Daniel Borkmann
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Daniel Borkmann @ 2023-09-26  5:59 UTC (permalink / raw)
  To: bpf; +Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend,
	Daniel Borkmann

Sync if_link uapi header to the latest version as we need the refresher
in tooling for meta device. Given it's been a while since the last sync
and the diff is fairly big, it has been done as its own commit.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/include/uapi/linux/if_link.h | 142 +++++++++++++++++++++++++++++
 1 file changed, 142 insertions(+)

diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h
index 39e659c83cfd..ec099c6c51e0 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -211,6 +211,9 @@ struct rtnl_link_stats {
  * @rx_nohandler: Number of packets received on the interface
  *   but dropped by the networking stack because the device is
  *   not designated to receive packets (e.g. backup link in a bond).
+ *
+ * @rx_otherhost_dropped: Number of packets dropped due to mismatch
+ *   in destination MAC address.
  */
 struct rtnl_link_stats64 {
 	__u64	rx_packets;
@@ -243,6 +246,23 @@ struct rtnl_link_stats64 {
 	__u64	rx_compressed;
 	__u64	tx_compressed;
 	__u64	rx_nohandler;
+
+	__u64	rx_otherhost_dropped;
+};
+
+/* Subset of link stats useful for in-HW collection. Meaning of the fields is as
+ * for struct rtnl_link_stats64.
+ */
+struct rtnl_hw_stats64 {
+	__u64	rx_packets;
+	__u64	tx_packets;
+	__u64	rx_bytes;
+	__u64	tx_bytes;
+	__u64	rx_errors;
+	__u64	tx_errors;
+	__u64	rx_dropped;
+	__u64	tx_dropped;
+	__u64	multicast;
 };
 
 /* The struct should be in sync with struct ifmap */
@@ -350,7 +370,13 @@ enum {
 	IFLA_GRO_MAX_SIZE,
 	IFLA_TSO_MAX_SIZE,
 	IFLA_TSO_MAX_SEGS,
+	IFLA_ALLMULTI,		/* Allmulti count: > 0 means acts ALLMULTI */
+
+	IFLA_DEVLINK_PORT,
 
+	IFLA_GSO_IPV4_MAX_SIZE,
+	IFLA_GRO_IPV4_MAX_SIZE,
+	IFLA_DPLL_PIN,
 	__IFLA_MAX
 };
 
@@ -539,6 +565,12 @@ enum {
 	IFLA_BRPORT_MRP_IN_OPEN,
 	IFLA_BRPORT_MCAST_EHT_HOSTS_LIMIT,
 	IFLA_BRPORT_MCAST_EHT_HOSTS_CNT,
+	IFLA_BRPORT_LOCKED,
+	IFLA_BRPORT_MAB,
+	IFLA_BRPORT_MCAST_N_GROUPS,
+	IFLA_BRPORT_MCAST_MAX_GROUPS,
+	IFLA_BRPORT_NEIGH_VLAN_SUPPRESS,
+	IFLA_BRPORT_BACKUP_NHID,
 	__IFLA_BRPORT_MAX
 };
 #define IFLA_BRPORT_MAX (__IFLA_BRPORT_MAX - 1)
@@ -716,7 +748,80 @@ enum ipvlan_mode {
 #define IPVLAN_F_PRIVATE	0x01
 #define IPVLAN_F_VEPA		0x02
 
+/* Tunnel RTM header */
+struct tunnel_msg {
+	__u8 family;
+	__u8 flags;
+	__u16 reserved2;
+	__u32 ifindex;
+};
+
+/* META section */
+enum meta_action {
+	META_NEXT	= -1,
+	META_PASS	= 0,
+	META_DROP	= 2,
+	META_REDIRECT	= 7,
+};
+
+enum meta_mode {
+	META_L2,
+	META_L3,
+};
+
+enum {
+	IFLA_META_UNSPEC,
+	IFLA_META_PEER_INFO,
+	IFLA_META_PRIMARY,
+	IFLA_META_POLICY,
+	IFLA_META_PEER_POLICY,
+	IFLA_META_MODE,
+	__IFLA_META_MAX,
+};
+
+#define IFLA_META_MAX	(__IFLA_META_MAX - 1)
+
 /* VXLAN section */
+
+/* include statistics in the dump */
+#define TUNNEL_MSG_FLAG_STATS	0x01
+
+#define TUNNEL_MSG_VALID_USER_FLAGS TUNNEL_MSG_FLAG_STATS
+
+/* Embedded inside VXLAN_VNIFILTER_ENTRY_STATS */
+enum {
+	VNIFILTER_ENTRY_STATS_UNSPEC,
+	VNIFILTER_ENTRY_STATS_RX_BYTES,
+	VNIFILTER_ENTRY_STATS_RX_PKTS,
+	VNIFILTER_ENTRY_STATS_RX_DROPS,
+	VNIFILTER_ENTRY_STATS_RX_ERRORS,
+	VNIFILTER_ENTRY_STATS_TX_BYTES,
+	VNIFILTER_ENTRY_STATS_TX_PKTS,
+	VNIFILTER_ENTRY_STATS_TX_DROPS,
+	VNIFILTER_ENTRY_STATS_TX_ERRORS,
+	VNIFILTER_ENTRY_STATS_PAD,
+	__VNIFILTER_ENTRY_STATS_MAX
+};
+#define VNIFILTER_ENTRY_STATS_MAX (__VNIFILTER_ENTRY_STATS_MAX - 1)
+
+enum {
+	VXLAN_VNIFILTER_ENTRY_UNSPEC,
+	VXLAN_VNIFILTER_ENTRY_START,
+	VXLAN_VNIFILTER_ENTRY_END,
+	VXLAN_VNIFILTER_ENTRY_GROUP,
+	VXLAN_VNIFILTER_ENTRY_GROUP6,
+	VXLAN_VNIFILTER_ENTRY_STATS,
+	__VXLAN_VNIFILTER_ENTRY_MAX
+};
+#define VXLAN_VNIFILTER_ENTRY_MAX	(__VXLAN_VNIFILTER_ENTRY_MAX - 1)
+
+enum {
+	VXLAN_VNIFILTER_UNSPEC,
+	VXLAN_VNIFILTER_ENTRY,
+	__VXLAN_VNIFILTER_MAX
+};
+#define VXLAN_VNIFILTER_MAX	(__VXLAN_VNIFILTER_MAX - 1)
+
 enum {
 	IFLA_VXLAN_UNSPEC,
 	IFLA_VXLAN_ID,
@@ -748,6 +853,8 @@ enum {
 	IFLA_VXLAN_GPE,
 	IFLA_VXLAN_TTL_INHERIT,
 	IFLA_VXLAN_DF,
+	IFLA_VXLAN_VNIFILTER, /* only applicable with COLLECT_METADATA mode */
+	IFLA_VXLAN_LOCALBYPASS,
 	__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX	(__IFLA_VXLAN_MAX - 1)
@@ -781,6 +888,7 @@ enum {
 	IFLA_GENEVE_LABEL,
 	IFLA_GENEVE_TTL_INHERIT,
 	IFLA_GENEVE_DF,
+	IFLA_GENEVE_INNER_PROTO_INHERIT,
 	__IFLA_GENEVE_MAX
 };
 #define IFLA_GENEVE_MAX	(__IFLA_GENEVE_MAX - 1)
@@ -826,6 +934,8 @@ enum {
 	IFLA_GTP_FD1,
 	IFLA_GTP_PDP_HASHSIZE,
 	IFLA_GTP_ROLE,
+	IFLA_GTP_CREATE_SOCKETS,
+	IFLA_GTP_RESTART_COUNT,
 	__IFLA_GTP_MAX,
 };
 #define IFLA_GTP_MAX (__IFLA_GTP_MAX - 1)
@@ -1162,6 +1272,17 @@ enum {
 
 #define IFLA_STATS_FILTER_BIT(ATTR)	(1 << (ATTR - 1))
 
+enum {
+	IFLA_STATS_GETSET_UNSPEC,
+	IFLA_STATS_GET_FILTERS, /* Nest of IFLA_STATS_LINK_xxx, each a u32 with
+				 * a filter mask for the corresponding group.
+				 */
+	IFLA_STATS_SET_OFFLOAD_XSTATS_L3_STATS, /* 0 or 1 as u8 */
+	__IFLA_STATS_GETSET_MAX,
+};
+
+#define IFLA_STATS_GETSET_MAX (__IFLA_STATS_GETSET_MAX - 1)
+
 /* These are embedded into IFLA_STATS_LINK_XSTATS:
  * [IFLA_STATS_LINK_XSTATS]
  * -> [LINK_XSTATS_TYPE_xxx]
@@ -1179,10 +1300,21 @@ enum {
 enum {
 	IFLA_OFFLOAD_XSTATS_UNSPEC,
 	IFLA_OFFLOAD_XSTATS_CPU_HIT, /* struct rtnl_link_stats64 */
+	IFLA_OFFLOAD_XSTATS_HW_S_INFO,	/* HW stats info. A nest */
+	IFLA_OFFLOAD_XSTATS_L3_STATS,	/* struct rtnl_hw_stats64 */
 	__IFLA_OFFLOAD_XSTATS_MAX
 };
 #define IFLA_OFFLOAD_XSTATS_MAX (__IFLA_OFFLOAD_XSTATS_MAX - 1)
 
+enum {
+	IFLA_OFFLOAD_XSTATS_HW_S_INFO_UNSPEC,
+	IFLA_OFFLOAD_XSTATS_HW_S_INFO_REQUEST,		/* u8 */
+	IFLA_OFFLOAD_XSTATS_HW_S_INFO_USED,		/* u8 */
+	__IFLA_OFFLOAD_XSTATS_HW_S_INFO_MAX,
+};
+#define IFLA_OFFLOAD_XSTATS_HW_S_INFO_MAX \
+	(__IFLA_OFFLOAD_XSTATS_HW_S_INFO_MAX - 1)
+
 /* XDP section */
 
 #define XDP_FLAGS_UPDATE_IF_NOEXIST	(1U << 0)
@@ -1281,4 +1413,14 @@ enum {
 
 #define IFLA_MCTP_MAX (__IFLA_MCTP_MAX - 1)
 
+/* DSA section */
+
+enum {
+	IFLA_DSA_UNSPEC,
+	IFLA_DSA_MASTER,
+	__IFLA_DSA_MAX,
+};
+
+#define IFLA_DSA_MAX	(__IFLA_DSA_MAX - 1)
+
 #endif /* _UAPI_LINUX_IF_LINK_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 4/8] libbpf: Add link-based API for meta
  2023-09-26  5:59 [PATCH bpf-next 0/8] Add bpf programmable device Daniel Borkmann
                   ` (2 preceding siblings ...)
  2023-09-26  5:59 ` [PATCH bpf-next 3/8] tools: Sync if_link uapi header Daniel Borkmann
@ 2023-09-26  5:59 ` Daniel Borkmann
  2023-09-26 11:19   ` Quentin Monnet
  2023-09-28  0:12   ` Andrii Nakryiko
  2023-09-26  5:59 ` [PATCH bpf-next 5/8] bpftool: Implement link show support " Daniel Borkmann
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 22+ messages in thread
From: Daniel Borkmann @ 2023-09-26  5:59 UTC (permalink / raw)
  To: bpf; +Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend,
	Daniel Borkmann

This adds bpf_program__attach_meta() API to libbpf. Overall it is very
similar to tcx. The API looks as following:

  LIBBPF_API struct bpf_link *
  bpf_program__attach_meta(const struct bpf_program *prog, int ifindex,
                           bool peer_device, const struct bpf_meta_opts *opts);

The struct bpf_meta_opts is done in similar way as struct bpf_tcx_opts.
bpf_program__attach_meta() compared to bpf_program__attach_tcx() has one
additional argument, that is peer_device. The latter denotes whether the
program should be attached to the relative peer of ifindex or whether it
should be attached to ifindex itself.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/lib/bpf/bpf.c      | 16 +++++++++++
 tools/lib/bpf/bpf.h      |  5 ++++
 tools/lib/bpf/libbpf.c   | 61 ++++++++++++++++++++++++++++++++++++----
 tools/lib/bpf/libbpf.h   | 15 ++++++++++
 tools/lib/bpf/libbpf.map |  1 +
 5 files changed, 92 insertions(+), 6 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index b0f1913763a3..f1335333b63c 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -810,6 +810,22 @@ int bpf_link_create(int prog_fd, int target_fd,
 		if (!OPTS_ZEROED(opts, tcx))
 			return libbpf_err(-EINVAL);
 		break;
+	case BPF_META_PRIMARY:
+	case BPF_META_PEER:
+		relative_fd = OPTS_GET(opts, meta.relative_fd, 0);
+		relative_id = OPTS_GET(opts, meta.relative_id, 0);
+		if (relative_fd && relative_id)
+			return libbpf_err(-EINVAL);
+		if (relative_id) {
+			attr.link_create.meta.relative_id = relative_id;
+			attr.link_create.flags |= BPF_F_ID;
+		} else {
+			attr.link_create.meta.relative_fd = relative_fd;
+		}
+		attr.link_create.meta.expected_revision = OPTS_GET(opts, meta.expected_revision, 0);
+		if (!OPTS_ZEROED(opts, meta))
+			return libbpf_err(-EINVAL);
+		break;
 	default:
 		if (!OPTS_ZEROED(opts, flags))
 			return libbpf_err(-EINVAL);
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 74c2887cfd24..175cfb95a175 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -415,6 +415,11 @@ struct bpf_link_create_opts {
 			__u32 relative_id;
 			__u64 expected_revision;
 		} tcx;
+		struct {
+			__u32 relative_fd;
+			__u32 relative_id;
+			__u64 expected_revision;
+		} meta;
 	};
 	size_t :0;
 };
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index b4758e54a815..4d4da8ba2179 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -121,6 +121,8 @@ static const char * const attach_type_name[] = {
 	[BPF_TCX_INGRESS]		= "tcx_ingress",
 	[BPF_TCX_EGRESS]		= "tcx_egress",
 	[BPF_TRACE_UPROBE_MULTI]	= "trace_uprobe_multi",
+	[BPF_META_PRIMARY]		= "meta",
+	[BPF_META_PEER]			= "meta",
 };
 
 static const char * const link_type_name[] = {
@@ -137,6 +139,7 @@ static const char * const link_type_name[] = {
 	[BPF_LINK_TYPE_NETFILTER]		= "netfilter",
 	[BPF_LINK_TYPE_TCX]			= "tcx",
 	[BPF_LINK_TYPE_UPROBE_MULTI]		= "uprobe_multi",
+	[BPF_LINK_TYPE_META]			= "meta",
 };
 
 static const char * const map_type_name[] = {
@@ -8910,6 +8913,7 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("tc",			SCHED_CLS, 0, SEC_NONE), /* deprecated / legacy, use tcx */
 	SEC_DEF("classifier",		SCHED_CLS, 0, SEC_NONE), /* deprecated / legacy, use tcx */
 	SEC_DEF("action",		SCHED_ACT, 0, SEC_NONE), /* deprecated / legacy, use tcx */
+	SEC_DEF("meta",			SCHED_CLS, 0, SEC_NONE),
 	SEC_DEF("tracepoint+",		TRACEPOINT, 0, SEC_NONE, attach_tp),
 	SEC_DEF("tp+",			TRACEPOINT, 0, SEC_NONE, attach_tp),
 	SEC_DEF("raw_tracepoint+",	RAW_TRACEPOINT, 0, SEC_NONE, attach_raw_tp),
@@ -12019,11 +12023,11 @@ static int attach_lsm(const struct bpf_program *prog, long cookie, struct bpf_li
 }
 
 static struct bpf_link *
-bpf_program_attach_fd(const struct bpf_program *prog,
-		      int target_fd, const char *target_name,
-		      const struct bpf_link_create_opts *opts)
+bpf_program_attach_fd_type(const struct bpf_program *prog,
+			   int target_fd, const char *target_name,
+			   enum bpf_attach_type attach_type,
+			   const struct bpf_link_create_opts *opts)
 {
-	enum bpf_attach_type attach_type;
 	char errmsg[STRERR_BUFSIZE];
 	struct bpf_link *link;
 	int prog_fd, link_fd;
@@ -12038,8 +12042,6 @@ bpf_program_attach_fd(const struct bpf_program *prog,
 	if (!link)
 		return libbpf_err_ptr(-ENOMEM);
 	link->detach = &bpf_link__detach_fd;
-
-	attach_type = bpf_program__expected_attach_type(prog);
 	link_fd = bpf_link_create(prog_fd, target_fd, attach_type, opts);
 	if (link_fd < 0) {
 		link_fd = -errno;
@@ -12053,6 +12055,16 @@ bpf_program_attach_fd(const struct bpf_program *prog,
 	return link;
 }
 
+static struct bpf_link *
+bpf_program_attach_fd(const struct bpf_program *prog,
+		      int target_fd, const char *target_name,
+		      const struct bpf_link_create_opts *opts)
+{
+	return bpf_program_attach_fd_type(prog, target_fd, target_name,
+					  bpf_program__expected_attach_type(prog),
+					  opts);
+}
+
 struct bpf_link *
 bpf_program__attach_cgroup(const struct bpf_program *prog, int cgroup_fd)
 {
@@ -12106,6 +12118,43 @@ bpf_program__attach_tcx(const struct bpf_program *prog, int ifindex,
 	return bpf_program_attach_fd(prog, ifindex, "tcx", &link_create_opts);
 }
 
+struct bpf_link *
+bpf_program__attach_meta(const struct bpf_program *prog, int ifindex,
+			 bool peer_device, const struct bpf_meta_opts *opts)
+{
+	LIBBPF_OPTS(bpf_link_create_opts, link_create_opts);
+	enum bpf_attach_type attach_type;
+	__u32 relative_id;
+	int relative_fd;
+
+	if (!OPTS_VALID(opts, bpf_meta_opts))
+		return libbpf_err_ptr(-EINVAL);
+
+	relative_id = OPTS_GET(opts, relative_id, 0);
+	relative_fd = OPTS_GET(opts, relative_fd, 0);
+	attach_type = peer_device ? BPF_META_PEER : BPF_META_PRIMARY;
+
+	/* validate we don't have unexpected combinations of non-zero fields */
+	if (!ifindex) {
+		pr_warn("prog '%s': target netdevice ifindex cannot be zero\n",
+			prog->name);
+		return libbpf_err_ptr(-EINVAL);
+	}
+	if (relative_fd && relative_id) {
+		pr_warn("prog '%s': relative_fd and relative_id cannot be set at the same time\n",
+			prog->name);
+		return libbpf_err_ptr(-EINVAL);
+	}
+
+	link_create_opts.meta.expected_revision = OPTS_GET(opts, expected_revision, 0);
+	link_create_opts.meta.relative_fd = relative_fd;
+	link_create_opts.meta.relative_id = relative_id;
+	link_create_opts.flags = OPTS_GET(opts, flags, 0);
+
+	return bpf_program_attach_fd_type(prog, ifindex, "meta", attach_type,
+					  &link_create_opts);
+}
+
 struct bpf_link *bpf_program__attach_freplace(const struct bpf_program *prog,
 					      int target_fd,
 					      const char *attach_func_name)
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 0e52621cba43..827d29cf9a06 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -800,6 +800,21 @@ LIBBPF_API struct bpf_link *
 bpf_program__attach_tcx(const struct bpf_program *prog, int ifindex,
 			const struct bpf_tcx_opts *opts);
 
+struct bpf_meta_opts {
+	/* size of this struct, for forward/backward compatibility */
+	size_t sz;
+	__u32 flags;
+	__u32 relative_fd;
+	__u32 relative_id;
+	__u64 expected_revision;
+	size_t :0;
+};
+#define bpf_meta_opts__last_field expected_revision
+
+LIBBPF_API struct bpf_link *
+bpf_program__attach_meta(const struct bpf_program *prog, int ifindex,
+			 bool peer_device, const struct bpf_meta_opts *opts);
+
 struct bpf_map;
 
 LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 57712321490f..2dd4fe2cba3d 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -397,6 +397,7 @@ LIBBPF_1.3.0 {
 		bpf_obj_pin_opts;
 		bpf_object__unpin;
 		bpf_prog_detach_opts;
+		bpf_program__attach_meta;
 		bpf_program__attach_netfilter;
 		bpf_program__attach_tcx;
 		bpf_program__attach_uprobe_multi;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 5/8] bpftool: Implement link show support for meta
  2023-09-26  5:59 [PATCH bpf-next 0/8] Add bpf programmable device Daniel Borkmann
                   ` (3 preceding siblings ...)
  2023-09-26  5:59 ` [PATCH bpf-next 4/8] libbpf: Add link-based API for meta Daniel Borkmann
@ 2023-09-26  5:59 ` Daniel Borkmann
  2023-09-26 11:19   ` Quentin Monnet
  2023-09-26  5:59 ` [PATCH bpf-next 6/8] bpftool: Extend net dump with meta progs Daniel Borkmann
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Daniel Borkmann @ 2023-09-26  5:59 UTC (permalink / raw)
  To: bpf; +Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend,
	Daniel Borkmann

Add support to dump meta link information to bpftool in similar way
as we have for XDP. The meta link info only exposes the ifindex.

Below shows an example link dump output, and a cgroup link is included
for comparison, too:

  # bpftool link
  [...]
  10: cgroup  prog 2466
        cgroup_id 1  attach_type cgroup_inet6_post_bind
  [...]
  8: meta  prog 35
        ifindex meta1(18)
  [...]

Equivalent json output:

  # bpftool link --json
  [...]
  {
    "id": 10,
    "type": "cgroup",
    "prog_id": 2466,
    "cgroup_id": 1,
    "attach_type": "cgroup_inet6_post_bind"
  },
  [...]
  {
    "id": 12,
    "type": "meta",
    "prog_id": 61,
    "devname": "meta1",
    "ifindex": 21
  }
  [...]

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/bpf/bpftool/link.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/tools/bpf/bpftool/link.c b/tools/bpf/bpftool/link.c
index 2e5c231e08ac..57fd3f1a7330 100644
--- a/tools/bpf/bpftool/link.c
+++ b/tools/bpf/bpftool/link.c
@@ -449,6 +449,9 @@ static int show_link_close_json(int fd, struct bpf_link_info *info)
 		show_link_ifindex_json(info->tcx.ifindex, json_wtr);
 		show_link_attach_type_json(info->tcx.attach_type, json_wtr);
 		break;
+	case BPF_LINK_TYPE_META:
+		show_link_ifindex_json(info->meta.ifindex, json_wtr);
+		break;
 	case BPF_LINK_TYPE_XDP:
 		show_link_ifindex_json(info->xdp.ifindex, json_wtr);
 		break;
@@ -785,6 +788,10 @@ static int show_link_close_plain(int fd, struct bpf_link_info *info)
 		show_link_ifindex_plain(info->tcx.ifindex);
 		show_link_attach_type_plain(info->tcx.attach_type);
 		break;
+	case BPF_LINK_TYPE_META:
+		printf("\n\t");
+		show_link_ifindex_plain(info->meta.ifindex);
+		break;
 	case BPF_LINK_TYPE_XDP:
 		printf("\n\t");
 		show_link_ifindex_plain(info->xdp.ifindex);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 6/8] bpftool: Extend net dump with meta progs
  2023-09-26  5:59 [PATCH bpf-next 0/8] Add bpf programmable device Daniel Borkmann
                   ` (4 preceding siblings ...)
  2023-09-26  5:59 ` [PATCH bpf-next 5/8] bpftool: Implement link show support " Daniel Borkmann
@ 2023-09-26  5:59 ` Daniel Borkmann
  2023-09-26 11:19   ` Quentin Monnet
  2023-09-26  5:59 ` [PATCH bpf-next 7/8] selftests/bpf: Add netlink helper library Daniel Borkmann
  2023-09-26  5:59 ` [PATCH bpf-next 8/8] selftests/bpf: Add selftests for meta Daniel Borkmann
  7 siblings, 1 reply; 22+ messages in thread
From: Daniel Borkmann @ 2023-09-26  5:59 UTC (permalink / raw)
  To: bpf; +Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend,
	Daniel Borkmann

Add support to dump BPF programs on meta via bpftool. This includes both
the BPF link and attach ops programs. Dumped information contain the attach
location, function entry name, program ID and link ID when applicable.

Example with tc BPF link:

  # ./bpftool net
  xdp:

  tc:
  meta1(22) meta/peer tc1 prog_id 43 link_id 12

  [...]

Example with json dump:

  # ./bpftool net --json | jq
  [
    {
      "xdp": [],
      "tc": [
        {
          "devname": "meta1",
          "ifindex": 18,
          "kind": "meta/primary",
          "name": "tc1",
          "prog_id": 29,
          "prog_flags": [],
          "link_id": 8,
          "link_flags": []
        }
      ],
      "flow_dissector": [],
      "netfilter": []
    }
  ]

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/bpf/bpftool/Documentation/bpftool-net.rst | 8 ++++----
 tools/bpf/bpftool/net.c                         | 7 ++++++-
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-net.rst b/tools/bpf/bpftool/Documentation/bpftool-net.rst
index 5e2abd3de5ab..268770c3eb9c 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-net.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-net.rst
@@ -37,7 +37,7 @@ DESCRIPTION
 	**bpftool net { show | list }** [ **dev** *NAME* ]
 		  List bpf program attachments in the kernel networking subsystem.
 
-		  Currently, device driver xdp attachments, tcx and old-style tc
+		  Currently, device driver xdp attachments, tcx, meta and old-style tc
 		  classifier/action attachments, flow_dissector as well as netfilter
 		  attachments are implemented, i.e., for
 		  program types **BPF_PROG_TYPE_XDP**, **BPF_PROG_TYPE_SCHED_CLS**,
@@ -52,11 +52,11 @@ DESCRIPTION
 		  bpf programs, users should consult other tools, e.g., iproute2.
 
 		  The current output will start with all xdp program attachments, followed by
-		  all tcx, then tc class/qdisc bpf program attachments, then flow_dissector
-		  and finally netfilter programs. Both xdp programs and tcx/tc programs are
+		  all tcx, meta, then tc class/qdisc bpf program attachments, then flow_dissector
+		  and finally netfilter programs. Both xdp programs and tcx/meta/tc programs are
 		  ordered based on ifindex number. If multiple bpf programs attached
 		  to the same networking device through **tc**, the order will be first
-		  all bpf programs attached to tcx, then tc classes, then all bpf programs
+		  all bpf programs attached to tcx, meta, then tc classes, then all bpf programs
 		  attached to non clsact qdiscs, and finally all bpf programs attached
 		  to root and clsact qdisc.
 
diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
index 66a8ce8ae012..1c60fb18b7fd 100644
--- a/tools/bpf/bpftool/net.c
+++ b/tools/bpf/bpftool/net.c
@@ -79,6 +79,8 @@ static const char * const attach_type_strings[] = {
 static const char * const attach_loc_strings[] = {
 	[BPF_TCX_INGRESS]		= "tcx/ingress",
 	[BPF_TCX_EGRESS]		= "tcx/egress",
+	[BPF_META_PRIMARY]		= "meta/primary",
+	[BPF_META_PEER]			= "meta/peer",
 };
 
 const size_t net_attach_type_size = ARRAY_SIZE(attach_type_strings);
@@ -506,6 +508,9 @@ static void show_dev_tc_bpf(struct ip_devname_ifindex *dev)
 {
 	__show_dev_tc_bpf(dev, BPF_TCX_INGRESS);
 	__show_dev_tc_bpf(dev, BPF_TCX_EGRESS);
+
+	__show_dev_tc_bpf(dev, BPF_META_PRIMARY);
+	__show_dev_tc_bpf(dev, BPF_META_PEER);
 }
 
 static int show_dev_tc_bpf_classic(int sock, unsigned int nl_pid,
@@ -926,7 +931,7 @@ static int do_help(int argc, char **argv)
 		"       ATTACH_TYPE := { xdp | xdpgeneric | xdpdrv | xdpoffload }\n"
 		"       " HELP_SPEC_OPTIONS " }\n"
 		"\n"
-		"Note: Only xdp, tcx, tc, flow_dissector and netfilter attachments\n"
+		"Note: Only xdp, tcx, meta, tc, flow_dissector and netfilter attachments\n"
 		"      are currently supported.\n"
 		"      For progs attached to cgroups, use \"bpftool cgroup\"\n"
 		"      to dump program attachments. For program types\n"
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 7/8] selftests/bpf: Add netlink helper library
  2023-09-26  5:59 [PATCH bpf-next 0/8] Add bpf programmable device Daniel Borkmann
                   ` (5 preceding siblings ...)
  2023-09-26  5:59 ` [PATCH bpf-next 6/8] bpftool: Extend net dump with meta progs Daniel Borkmann
@ 2023-09-26  5:59 ` Daniel Borkmann
  2023-09-26 21:35   ` Stanislav Fomichev
  2023-09-26  5:59 ` [PATCH bpf-next 8/8] selftests/bpf: Add selftests for meta Daniel Borkmann
  7 siblings, 1 reply; 22+ messages in thread
From: Daniel Borkmann @ 2023-09-26  5:59 UTC (permalink / raw)
  To: bpf; +Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend,
	Daniel Borkmann

Add a basic netlink helper library for the BPF selftests. This has been
taken and cut down/cleaned up from iproute2. More can be added at some
later point in time when needed, but for now this covers basics such as
device creation which we need for BPF selftests / BPF CI.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/testing/selftests/bpf/Makefile          |  19 +-
 tools/testing/selftests/bpf/netlink_helpers.c | 358 ++++++++++++++++++
 tools/testing/selftests/bpf/netlink_helpers.h |  46 +++
 3 files changed, 418 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/netlink_helpers.c
 create mode 100644 tools/testing/selftests/bpf/netlink_helpers.h

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 47365161b6fc..b8186ceb31dc 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -579,11 +579,20 @@ endef
 # Define test_progs test runner.
 TRUNNER_TESTS_DIR := prog_tests
 TRUNNER_BPF_PROGS_DIR := progs
-TRUNNER_EXTRA_SOURCES := test_progs.c cgroup_helpers.c trace_helpers.c	\
-			 network_helpers.c testing_helpers.c		\
-			 btf_helpers.c flow_dissector_load.h		\
-			 cap_helpers.c test_loader.c xsk.c disasm.c	\
-			 json_writer.c unpriv_helpers.c 		\
+TRUNNER_EXTRA_SOURCES := test_progs.c		\
+			 cgroup_helpers.c	\
+			 trace_helpers.c	\
+			 network_helpers.c	\
+			 testing_helpers.c	\
+			 btf_helpers.c		\
+			 cap_helpers.c		\
+			 unpriv_helpers.c 	\
+			 netlink_helpers.c	\
+			 test_loader.c		\
+			 xsk.c			\
+			 disasm.c		\
+			 json_writer.c 		\
+			 flow_dissector_load.h	\
 			 ip_check_defrag_frags.h
 TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read $(OUTPUT)/bpf_testmod.ko	\
 		       $(OUTPUT)/liburandom_read.so			\
diff --git a/tools/testing/selftests/bpf/netlink_helpers.c b/tools/testing/selftests/bpf/netlink_helpers.c
new file mode 100644
index 000000000000..caf36eb1d032
--- /dev/null
+++ b/tools/testing/selftests/bpf/netlink_helpers.c
@@ -0,0 +1,358 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Taken & modified from iproute2's libnetlink.c
+ * Authors: Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <time.h>
+#include <sys/socket.h>
+
+#include "netlink_helpers.h"
+
+static int rcvbuf = 1024 * 1024;
+
+void rtnl_close(struct rtnl_handle *rth)
+{
+	if (rth->fd >= 0) {
+		close(rth->fd);
+		rth->fd = -1;
+	}
+}
+
+int rtnl_open_byproto(struct rtnl_handle *rth, unsigned int subscriptions,
+		      int protocol)
+{
+	socklen_t addr_len;
+	int sndbuf = 32768;
+	int one = 1;
+
+	memset(rth, 0, sizeof(*rth));
+	rth->proto = protocol;
+	rth->fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, protocol);
+	if (rth->fd < 0) {
+		perror("Cannot open netlink socket");
+		return -1;
+	}
+	if (setsockopt(rth->fd, SOL_SOCKET, SO_SNDBUF,
+		       &sndbuf, sizeof(sndbuf)) < 0) {
+		perror("SO_SNDBUF");
+		goto err;
+	}
+	if (setsockopt(rth->fd, SOL_SOCKET, SO_RCVBUF,
+		       &rcvbuf, sizeof(rcvbuf)) < 0) {
+		perror("SO_RCVBUF");
+		goto err;
+	}
+
+	/* Older kernels may no support extended ACK reporting */
+	setsockopt(rth->fd, SOL_NETLINK, NETLINK_EXT_ACK,
+		   &one, sizeof(one));
+
+	memset(&rth->local, 0, sizeof(rth->local));
+	rth->local.nl_family = AF_NETLINK;
+	rth->local.nl_groups = subscriptions;
+
+	if (bind(rth->fd, (struct sockaddr *)&rth->local,
+		 sizeof(rth->local)) < 0) {
+		perror("Cannot bind netlink socket");
+		goto err;
+	}
+	addr_len = sizeof(rth->local);
+	if (getsockname(rth->fd, (struct sockaddr *)&rth->local,
+			&addr_len) < 0) {
+		perror("Cannot getsockname");
+		goto err;
+	}
+	if (addr_len != sizeof(rth->local)) {
+		fprintf(stderr, "Wrong address length %d\n", addr_len);
+		goto err;
+	}
+	if (rth->local.nl_family != AF_NETLINK) {
+		fprintf(stderr, "Wrong address family %d\n",
+			rth->local.nl_family);
+		goto err;
+	}
+	rth->seq = time(NULL);
+	return 0;
+err:
+	rtnl_close(rth);
+	return -1;
+}
+
+int rtnl_open(struct rtnl_handle *rth, unsigned int subscriptions)
+{
+	return rtnl_open_byproto(rth, subscriptions, NETLINK_ROUTE);
+}
+
+static int __rtnl_recvmsg(int fd, struct msghdr *msg, int flags)
+{
+	int len;
+
+	do {
+		len = recvmsg(fd, msg, flags);
+	} while (len < 0 && (errno == EINTR || errno == EAGAIN));
+	if (len < 0) {
+		fprintf(stderr, "netlink receive error %s (%d)\n",
+			strerror(errno), errno);
+		return -errno;
+	}
+	if (len == 0) {
+		fprintf(stderr, "EOF on netlink\n");
+		return -ENODATA;
+	}
+	return len;
+}
+
+static int rtnl_recvmsg(int fd, struct msghdr *msg, char **answer)
+{
+	struct iovec *iov = msg->msg_iov;
+	char *buf;
+	int len;
+
+	iov->iov_base = NULL;
+	iov->iov_len = 0;
+
+	len = __rtnl_recvmsg(fd, msg, MSG_PEEK | MSG_TRUNC);
+	if (len < 0)
+		return len;
+	if (len < 32768)
+		len = 32768;
+	buf = malloc(len);
+	if (!buf) {
+		fprintf(stderr, "malloc error: not enough buffer\n");
+		return -ENOMEM;
+	}
+	iov->iov_base = buf;
+	iov->iov_len = len;
+	len = __rtnl_recvmsg(fd, msg, 0);
+	if (len < 0) {
+		free(buf);
+		return len;
+	}
+	if (answer)
+		*answer = buf;
+	else
+		free(buf);
+	return len;
+}
+
+static void rtnl_talk_error(struct nlmsghdr *h, struct nlmsgerr *err,
+			    nl_ext_ack_fn_t errfn)
+{
+	fprintf(stderr, "RTNETLINK answers: %s\n",
+		strerror(-err->error));
+}
+
+static int __rtnl_talk_iov(struct rtnl_handle *rtnl, struct iovec *iov,
+			   size_t iovlen, struct nlmsghdr **answer,
+			   bool show_rtnl_err, nl_ext_ack_fn_t errfn)
+{
+	struct sockaddr_nl nladdr = { .nl_family = AF_NETLINK };
+	struct iovec riov;
+	struct msghdr msg = {
+		.msg_name	= &nladdr,
+		.msg_namelen	= sizeof(nladdr),
+		.msg_iov	= iov,
+		.msg_iovlen	= iovlen,
+	};
+	unsigned int seq = 0;
+	struct nlmsghdr *h;
+	int i, status;
+	char *buf;
+
+	for (i = 0; i < iovlen; i++) {
+		h = iov[i].iov_base;
+		h->nlmsg_seq = seq = ++rtnl->seq;
+		if (answer == NULL)
+			h->nlmsg_flags |= NLM_F_ACK;
+	}
+	status = sendmsg(rtnl->fd, &msg, 0);
+	if (status < 0) {
+		perror("Cannot talk to rtnetlink");
+		return -1;
+	}
+	/* change msg to use the response iov */
+	msg.msg_iov = &riov;
+	msg.msg_iovlen = 1;
+	i = 0;
+	while (1) {
+next:
+		status = rtnl_recvmsg(rtnl->fd, &msg, &buf);
+		++i;
+		if (status < 0)
+			return status;
+		if (msg.msg_namelen != sizeof(nladdr)) {
+			fprintf(stderr,
+				"Sender address length == %d!\n",
+				msg.msg_namelen);
+			exit(1);
+		}
+		for (h = (struct nlmsghdr *)buf; status >= sizeof(*h); ) {
+			int len = h->nlmsg_len;
+			int l = len - sizeof(*h);
+
+			if (l < 0 || len > status) {
+				if (msg.msg_flags & MSG_TRUNC) {
+					fprintf(stderr, "Truncated message!\n");
+					free(buf);
+					return -1;
+				}
+				fprintf(stderr,
+					"Malformed message: len=%d!\n",
+					len);
+				exit(1);
+			}
+			if (nladdr.nl_pid != 0 ||
+			    h->nlmsg_pid != rtnl->local.nl_pid ||
+			    h->nlmsg_seq > seq || h->nlmsg_seq < seq - iovlen) {
+				/* Don't forget to skip that message. */
+				status -= NLMSG_ALIGN(len);
+				h = (struct nlmsghdr *)((char *)h + NLMSG_ALIGN(len));
+				continue;
+			}
+			if (h->nlmsg_type == NLMSG_ERROR) {
+				struct nlmsgerr *err = (struct nlmsgerr *)NLMSG_DATA(h);
+				int error = err->error;
+
+				if (l < sizeof(struct nlmsgerr)) {
+					fprintf(stderr, "ERROR truncated\n");
+					free(buf);
+					return -1;
+				}
+				if (error) {
+					errno = -error;
+					if (rtnl->proto != NETLINK_SOCK_DIAG &&
+					    show_rtnl_err)
+						rtnl_talk_error(h, err, errfn);
+				}
+				if (i < iovlen) {
+					free(buf);
+					goto next;
+				}
+				if (error) {
+					free(buf);
+					return -i;
+				}
+				if (answer)
+					*answer = (struct nlmsghdr *)buf;
+				else
+					free(buf);
+				return 0;
+			}
+			if (answer) {
+				*answer = (struct nlmsghdr *)buf;
+				return 0;
+			}
+			fprintf(stderr, "Unexpected reply!\n");
+			status -= NLMSG_ALIGN(len);
+			h = (struct nlmsghdr *)((char *)h + NLMSG_ALIGN(len));
+		}
+		free(buf);
+		if (msg.msg_flags & MSG_TRUNC) {
+			fprintf(stderr, "Message truncated!\n");
+			continue;
+		}
+		if (status) {
+			fprintf(stderr, "Remnant of size %d!\n", status);
+			exit(1);
+		}
+	}
+}
+
+static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
+		       struct nlmsghdr **answer, bool show_rtnl_err,
+		       nl_ext_ack_fn_t errfn)
+{
+	struct iovec iov = {
+		.iov_base	= n,
+		.iov_len	= n->nlmsg_len,
+	};
+
+	return __rtnl_talk_iov(rtnl, &iov, 1, answer, show_rtnl_err, errfn);
+}
+
+int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
+	      struct nlmsghdr **answer)
+{
+	return __rtnl_talk(rtnl, n, answer, true, NULL);
+}
+
+int addattr(struct nlmsghdr *n, int maxlen, int type)
+{
+	return addattr_l(n, maxlen, type, NULL, 0);
+}
+
+int addattr8(struct nlmsghdr *n, int maxlen, int type, __u8 data)
+{
+	return addattr_l(n, maxlen, type, &data, sizeof(__u8));
+}
+
+int addattr16(struct nlmsghdr *n, int maxlen, int type, __u16 data)
+{
+	return addattr_l(n, maxlen, type, &data, sizeof(__u16));
+}
+
+int addattr32(struct nlmsghdr *n, int maxlen, int type, __u32 data)
+{
+	return addattr_l(n, maxlen, type, &data, sizeof(__u32));
+}
+
+int addattr64(struct nlmsghdr *n, int maxlen, int type, __u64 data)
+{
+	return addattr_l(n, maxlen, type, &data, sizeof(__u64));
+}
+
+int addattrstrz(struct nlmsghdr *n, int maxlen, int type, const char *str)
+{
+	return addattr_l(n, maxlen, type, str, strlen(str)+1);
+}
+
+int addattr_l(struct nlmsghdr *n, int maxlen, int type, const void *data,
+	      int alen)
+{
+	int len = RTA_LENGTH(alen);
+	struct rtattr *rta;
+
+	if (NLMSG_ALIGN(n->nlmsg_len) + RTA_ALIGN(len) > maxlen) {
+		fprintf(stderr, "%s: Message exceeded bound of %d\n",
+			__func__, maxlen);
+		return -1;
+	}
+	rta = NLMSG_TAIL(n);
+	rta->rta_type = type;
+	rta->rta_len = len;
+	if (alen)
+		memcpy(RTA_DATA(rta), data, alen);
+	n->nlmsg_len = NLMSG_ALIGN(n->nlmsg_len) + RTA_ALIGN(len);
+	return 0;
+}
+
+int addraw_l(struct nlmsghdr *n, int maxlen, const void *data, int len)
+{
+	if (NLMSG_ALIGN(n->nlmsg_len) + NLMSG_ALIGN(len) > maxlen) {
+		fprintf(stderr, "%s: Message exceeded bound of %d\n",
+			__func__, maxlen);
+		return -1;
+	}
+
+	memcpy(NLMSG_TAIL(n), data, len);
+	memset((void *) NLMSG_TAIL(n) + len, 0, NLMSG_ALIGN(len) - len);
+	n->nlmsg_len = NLMSG_ALIGN(n->nlmsg_len) + NLMSG_ALIGN(len);
+	return 0;
+}
+
+struct rtattr *addattr_nest(struct nlmsghdr *n, int maxlen, int type)
+{
+	struct rtattr *nest = NLMSG_TAIL(n);
+
+	addattr_l(n, maxlen, type, NULL, 0);
+	return nest;
+}
+
+int addattr_nest_end(struct nlmsghdr *n, struct rtattr *nest)
+{
+	nest->rta_len = (void *)NLMSG_TAIL(n) - (void *)nest;
+	return n->nlmsg_len;
+}
diff --git a/tools/testing/selftests/bpf/netlink_helpers.h b/tools/testing/selftests/bpf/netlink_helpers.h
new file mode 100644
index 000000000000..68116818a47e
--- /dev/null
+++ b/tools/testing/selftests/bpf/netlink_helpers.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef NETLINK_HELPERS_H
+#define NETLINK_HELPERS_H
+
+#include <string.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+
+struct rtnl_handle {
+	int			fd;
+	struct sockaddr_nl	local;
+	struct sockaddr_nl	peer;
+	__u32			seq;
+	__u32			dump;
+	int			proto;
+	FILE			*dump_fp;
+#define RTNL_HANDLE_F_LISTEN_ALL_NSID		0x01
+#define RTNL_HANDLE_F_SUPPRESS_NLERR		0x02
+#define RTNL_HANDLE_F_STRICT_CHK		0x04
+	int			flags;
+};
+
+#define NLMSG_TAIL(nmsg) \
+	((struct rtattr *) (((void *) (nmsg)) + NLMSG_ALIGN((nmsg)->nlmsg_len)))
+
+typedef int (*nl_ext_ack_fn_t)(const char *errmsg, uint32_t off,
+			       const struct nlmsghdr *inner_nlh);
+
+int rtnl_open(struct rtnl_handle *rth, unsigned int subscriptions)
+	      __attribute__((warn_unused_result));
+void rtnl_close(struct rtnl_handle *rth);
+int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
+	      struct nlmsghdr **answer)
+	      __attribute__((warn_unused_result));
+
+int addattr(struct nlmsghdr *n, int maxlen, int type);
+int addattr8(struct nlmsghdr *n, int maxlen, int type, __u8 data);
+int addattr16(struct nlmsghdr *n, int maxlen, int type, __u16 data);
+int addattr32(struct nlmsghdr *n, int maxlen, int type, __u32 data);
+int addattr64(struct nlmsghdr *n, int maxlen, int type, __u64 data);
+int addattrstrz(struct nlmsghdr *n, int maxlen, int type, const char *data);
+int addattr_l(struct nlmsghdr *n, int maxlen, int type, const void *data, int alen);
+int addraw_l(struct nlmsghdr *n, int maxlen, const void *data, int len);
+struct rtattr *addattr_nest(struct nlmsghdr *n, int maxlen, int type);
+int addattr_nest_end(struct nlmsghdr *n, struct rtattr *nest);
+#endif /* NETLINK_HELPERS_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next 8/8] selftests/bpf: Add selftests for meta
  2023-09-26  5:59 [PATCH bpf-next 0/8] Add bpf programmable device Daniel Borkmann
                   ` (6 preceding siblings ...)
  2023-09-26  5:59 ` [PATCH bpf-next 7/8] selftests/bpf: Add netlink helper library Daniel Borkmann
@ 2023-09-26  5:59 ` Daniel Borkmann
  7 siblings, 0 replies; 22+ messages in thread
From: Daniel Borkmann @ 2023-09-26  5:59 UTC (permalink / raw)
  To: bpf; +Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend,
	Daniel Borkmann

Add a bigger batch of test coverage to assert correct operation of
meta device and its BPF program management:

  # ./vmtest.sh -- ./test_progs -t tc_meta
  [...]
  ./test_progs -t tc_meta
  [    1.211407] bpf_testmod: loading out-of-tree module taints kernel.
  [    1.211805] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  [    1.271692] tsc: Refined TSC clocksource calibration: 3407.989 MHz
  [    1.274015] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fc9c9451, max_idle_ns: 440795361646 ns
  [    1.275241] clocksource: Switched to clocksource tsc
  #255     tc_meta_basic:OK
  #256     tc_meta_device:OK
  #257     tc_meta_multi_links:OK
  #258     tc_meta_multi_opts:OK
  #259     tc_meta_neigh_links:OK
  Summary: 5/0 PASSED, 0 SKIPPED, 0 FAILED
  [...]

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/prog_tests/tc_helpers.h     |   4 +
 .../selftests/bpf/prog_tests/tc_meta.c        | 650 ++++++++++++++++++
 .../selftests/bpf/progs/test_tc_link.c        |  13 +
 4 files changed, 668 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/tc_meta.c

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index e41eb33b2704..c00bac17dace 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -43,6 +43,7 @@ CONFIG_IPV6_TUNNEL=y
 CONFIG_KEYS=y
 CONFIG_LIRC=y
 CONFIG_LWTUNNEL=y
+CONFIG_META=y
 CONFIG_MODULE_SIG=y
 CONFIG_MODULE_SRCVERSION_ALL=y
 CONFIG_MODULE_UNLOAD=y
diff --git a/tools/testing/selftests/bpf/prog_tests/tc_helpers.h b/tools/testing/selftests/bpf/prog_tests/tc_helpers.h
index 6c93215be8a3..03dbb10918ae 100644
--- a/tools/testing/selftests/bpf/prog_tests/tc_helpers.h
+++ b/tools/testing/selftests/bpf/prog_tests/tc_helpers.h
@@ -4,6 +4,10 @@
 #define TC_HELPERS
 #include <test_progs.h>
 
+#ifndef loopback
+# define loopback 1
+#endif
+
 static inline __u32 id_from_prog_fd(int fd)
 {
 	struct bpf_prog_info prog_info = {};
diff --git a/tools/testing/selftests/bpf/prog_tests/tc_meta.c b/tools/testing/selftests/bpf/prog_tests/tc_meta.c
new file mode 100644
index 000000000000..c528fcedd519
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/tc_meta.c
@@ -0,0 +1,650 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Isovalent */
+#include <uapi/linux/if_link.h>
+#include <net/if.h>
+#include <test_progs.h>
+
+#define meta_peer "m0"
+#define meta_name "m1"
+
+#define ping_addr_neigh		0x0a000002 /* 10.0.0.2 */
+#define ping_addr_noneigh	0x0a000003 /* 10.0.0.3 */
+
+#include "test_tc_link.skel.h"
+#include "netlink_helpers.h"
+#include "tc_helpers.h"
+
+#define ICMP_ECHO 8
+
+struct icmphdr {
+	__u8		type;
+	__u8		code;
+	__sum16		checksum;
+	struct {
+		__be16	id;
+		__be16	sequence;
+	} echo;
+};
+
+struct iplink_req {
+	struct nlmsghdr  n;
+	struct ifinfomsg i;
+	char             buf[1024];
+};
+
+static int create_meta(int mode, int policy, int peer_policy, int *ifindex,
+		       bool same_netns)
+{
+	struct rtnl_handle rth = { .fd = -1 };
+	struct iplink_req req = {};
+	struct rtattr *linkinfo, *data;
+	const char *type = "meta";
+	int err;
+
+	err = rtnl_open(&rth, 0);
+	if (!ASSERT_OK(err, "open_rtnetlink"))
+		return err;
+
+	memset(&req, 0, sizeof(req));
+	req.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+	req.n.nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_EXCL;
+	req.n.nlmsg_type = RTM_NEWLINK;
+	req.i.ifi_family = AF_UNSPEC;
+
+	addattr_l(&req.n, sizeof(req), IFLA_IFNAME, meta_name,
+		  strlen(meta_name));
+	linkinfo = addattr_nest(&req.n, sizeof(req), IFLA_LINKINFO);
+	addattr_l(&req.n, sizeof(req), IFLA_INFO_KIND, type, strlen(type));
+	data = addattr_nest(&req.n, sizeof(req), IFLA_INFO_DATA);
+	addattr32(&req.n, sizeof(req), IFLA_META_POLICY, policy);
+	addattr32(&req.n, sizeof(req), IFLA_META_PEER_POLICY, peer_policy);
+	addattr32(&req.n, sizeof(req), IFLA_META_MODE, mode);
+	addattr_nest_end(&req.n, data);
+	addattr_nest_end(&req.n, linkinfo);
+
+	err = rtnl_talk(&rth, &req.n, NULL);
+	ASSERT_OK(err, "talk_rtnetlink");
+	rtnl_close(&rth);
+	*ifindex = if_nametoindex(meta_name);
+
+	ASSERT_GT(*ifindex, 0, "retrieve_ifindex");
+	ASSERT_OK(system("ip netns add foo"), "create netns");
+	ASSERT_OK(system("ip link set dev " meta_name " up"),
+			 "up primary");
+	ASSERT_OK(system("ip addr add dev " meta_name " 10.0.0.1/24"),
+			 "addr primary");
+	if (same_netns) {
+		ASSERT_OK(system("ip link set dev " meta_peer " up"),
+				 "up peer");
+		ASSERT_OK(system("ip addr add dev " meta_peer " 10.0.0.2/24"),
+				 "addr peer");
+	} else {
+		ASSERT_OK(system("ip link set " meta_peer " netns foo"),
+				 "move peer");
+		ASSERT_OK(system("ip netns exec foo ip link set dev "
+				 meta_peer " up"), "up peer");
+		ASSERT_OK(system("ip netns exec foo ip addr add dev "
+				 meta_peer " 10.0.0.2/24"), "addr peer");
+	}
+	return err;
+}
+
+static void destroy_meta(void)
+{
+	ASSERT_OK(system("ip link del dev " meta_name), "del primary");
+	ASSERT_OK(system("ip netns del foo"), "delete netns");
+	ASSERT_EQ(if_nametoindex(meta_name), 0, meta_name "_ifindex");
+}
+
+static int __send_icmp(__u32 dest)
+{
+	struct sockaddr_in addr;
+	struct icmphdr icmp;
+	int sock, ret;
+
+	ret = write_sysctl("/proc/sys/net/ipv4/ping_group_range", "0 0");
+	if (!ASSERT_OK(ret, "write_sysctl(net.ipv4.ping_group_range)"))
+		return ret;
+
+	sock = socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP);
+	if (!ASSERT_GE(sock, 0, "icmp_socket"))
+		return -errno;
+
+	ret = setsockopt(sock, SOL_SOCKET, SO_BINDTODEVICE,
+			 meta_name, strlen(meta_name) + 1);
+	if (!ASSERT_OK(ret, "setsockopt(SO_BINDTODEVICE)"))
+		goto out;
+
+	memset(&addr, 0, sizeof(addr));
+	addr.sin_family = AF_INET;
+	addr.sin_addr.s_addr = htonl(dest);
+
+	memset(&icmp, 0, sizeof(icmp));
+	icmp.type = ICMP_ECHO;
+	icmp.echo.id = 1234;
+	icmp.echo.sequence = 1;
+
+	ret = sendto(sock, &icmp, sizeof(icmp), 0,
+		     (struct sockaddr *)&addr, sizeof(addr));
+	if (!ASSERT_GE(ret, 0, "icmp_sendto"))
+		ret = -errno;
+	else
+		ret = 0;
+out:
+	close(sock);
+	return ret;
+}
+
+static int send_icmp(void)
+{
+	return __send_icmp(ping_addr_neigh);
+}
+
+void serial_test_tc_meta_basic(void)
+{
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	LIBBPF_OPTS(bpf_meta_opts, optl);
+	__u32 prog_ids[2], link_ids[2];
+	__u32 pid1, pid2, lid1, lid2;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err, ifindex;
+
+	err = create_meta(META_L2, META_PASS, META_PASS, &ifindex, false);
+	if (err)
+		return;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+
+	assert_mprog_count_ifindex(ifindex, BPF_META_PRIMARY, 0);
+	assert_mprog_count_ifindex(ifindex, BPF_META_PEER, 0);
+
+	ASSERT_EQ(skel->bss->seen_tc1, false, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	link = bpf_program__attach_meta(skel->progs.tc1, ifindex, false, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+
+	skel->links.tc1 = link;
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	assert_mprog_count_ifindex(ifindex, BPF_META_PRIMARY, 1);
+	assert_mprog_count_ifindex(ifindex, BPF_META_PEER, 0);
+
+	optq.prog_ids = prog_ids;
+	optq.link_ids = link_ids;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(link_ids, 0, sizeof(link_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex, BPF_META_PRIMARY, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+
+	ASSERT_EQ(send_icmp(), 0, "icmp_pkt");
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	link = bpf_program__attach_meta(skel->progs.tc2, ifindex, true, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+
+	skel->links.tc2 = link;
+
+	lid2 = id_from_link_fd(bpf_link__fd(skel->links.tc2));
+	ASSERT_NEQ(lid1, lid2, "link_ids_1_2");
+
+	skel->bss->seen_tc1 = false;
+	skel->bss->seen_tc2 = false;
+
+	assert_mprog_count_ifindex(ifindex, BPF_META_PRIMARY, 1);
+	assert_mprog_count_ifindex(ifindex, BPF_META_PEER, 1);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(link_ids, 0, sizeof(link_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex, BPF_META_PEER, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid2, "prog_ids[0]");
+	ASSERT_EQ(optq.link_ids[0], lid2, "link_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+
+	ASSERT_EQ(send_icmp(), 0, "icmp_pkt");
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+cleanup:
+	test_tc_link__destroy(skel);
+
+	assert_mprog_count_ifindex(ifindex, BPF_META_PRIMARY, 0);
+	assert_mprog_count_ifindex(ifindex, BPF_META_PEER, 0);
+	destroy_meta();
+}
+
+void serial_test_tc_meta_multi_links_target(int mode, int target)
+{
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	LIBBPF_OPTS(bpf_meta_opts, optl);
+	__u32 prog_ids[3], link_ids[3];
+	__u32 pid1, pid2, lid1, lid2;
+	bool peer = target == BPF_META_PEER;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err, ifindex;
+
+	err = create_meta(mode, META_PASS, META_PASS, &ifindex, false);
+	if (err)
+		return;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+
+	assert_mprog_count_ifindex(ifindex, target, 0);
+
+	ASSERT_EQ(skel->bss->seen_tc1, false, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_eth, false, "seen_eth");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	link = bpf_program__attach_meta(skel->progs.tc1, ifindex, peer, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+
+	skel->links.tc1 = link;
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	assert_mprog_count_ifindex(ifindex, target, 1);
+
+	optq.prog_ids = prog_ids;
+	optq.link_ids = link_ids;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(link_ids, 0, sizeof(link_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+
+	ASSERT_EQ(send_icmp(), 0, "icmp_pkt");
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_eth, true, "seen_eth");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	LIBBPF_OPTS_RESET(optl,
+		.flags = BPF_F_BEFORE,
+		.relative_fd = bpf_program__fd(skel->progs.tc1),
+	);
+
+	link = bpf_program__attach_meta(skel->progs.tc2, ifindex, peer, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+
+	skel->links.tc2 = link;
+
+	lid2 = id_from_link_fd(bpf_link__fd(skel->links.tc2));
+	ASSERT_NEQ(lid1, lid2, "link_ids_1_2");
+
+	skel->bss->seen_tc1 = false;
+	skel->bss->seen_eth = false;
+	skel->bss->seen_tc2 = false;
+
+	assert_mprog_count_ifindex(ifindex, target, 2);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(link_ids, 0, sizeof(link_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid2, "prog_ids[0]");
+	ASSERT_EQ(optq.link_ids[0], lid2, "link_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid1, "prog_ids[1]");
+	ASSERT_EQ(optq.link_ids[1], lid1, "link_ids[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+	ASSERT_EQ(optq.link_ids[2], 0, "link_ids[2]");
+
+	ASSERT_EQ(send_icmp(), 0, "icmp_pkt");
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_eth, true, "seen_eth");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+cleanup:
+	test_tc_link__destroy(skel);
+
+	assert_mprog_count_ifindex(ifindex, target, 0);
+	destroy_meta();
+}
+
+void serial_test_tc_meta_multi_links(void)
+{
+	serial_test_tc_meta_multi_links_target(META_L2, BPF_META_PRIMARY);
+	serial_test_tc_meta_multi_links_target(META_L3, BPF_META_PRIMARY);
+	serial_test_tc_meta_multi_links_target(META_L2, BPF_META_PEER);
+	serial_test_tc_meta_multi_links_target(META_L3, BPF_META_PEER);
+}
+
+void serial_test_tc_meta_multi_opts_target(int mode, int target)
+{
+	LIBBPF_OPTS(bpf_prog_attach_opts, opta);
+	LIBBPF_OPTS(bpf_prog_detach_opts, optd);
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	__u32 pid1, pid2, fd1, fd2;
+	__u32 prog_ids[3];
+	struct test_tc_link *skel;
+	int err, ifindex;
+
+	err = create_meta(mode, META_PASS, META_PASS, &ifindex, false);
+	if (err)
+		return;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	fd1 = bpf_program__fd(skel->progs.tc1);
+	fd2 = bpf_program__fd(skel->progs.tc2);
+
+	pid1 = id_from_prog_fd(fd1);
+	pid2 = id_from_prog_fd(fd2);
+
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+
+	assert_mprog_count_ifindex(ifindex, target, 0);
+
+	ASSERT_EQ(skel->bss->seen_tc1, false, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_eth, false, "seen_eth");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	err = bpf_prog_attach_opts(fd1, ifindex, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup;
+
+	assert_mprog_count_ifindex(ifindex, target, 1);
+
+	optq.prog_ids = prog_ids;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_fd1;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+
+	ASSERT_EQ(send_icmp(), 0, "icmp_pkt");
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_eth, true, "seen_eth");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	LIBBPF_OPTS_RESET(opta,
+		.flags = BPF_F_BEFORE,
+		.relative_fd = fd1,
+	);
+
+	err = bpf_prog_attach_opts(fd2, ifindex, target, &opta);
+	if (!ASSERT_EQ(err, 0, "prog_attach"))
+		goto cleanup_fd1;
+
+	skel->bss->seen_tc1 = false;
+	skel->bss->seen_eth = false;
+	skel->bss->seen_tc2 = false;
+
+	assert_mprog_count_ifindex(ifindex, target, 2);
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup_fd2;
+
+	ASSERT_EQ(optq.count, 2, "count");
+	ASSERT_EQ(optq.revision, 3, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid2, "prog_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], pid1, "prog_ids[1]");
+	ASSERT_EQ(optq.prog_ids[2], 0, "prog_ids[2]");
+
+	ASSERT_EQ(send_icmp(), 0, "icmp_pkt");
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_eth, true, "seen_eth");
+	ASSERT_EQ(skel->bss->seen_tc2, true, "seen_tc2");
+
+cleanup_fd2:
+	err = bpf_prog_detach_opts(fd2, ifindex, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count_ifindex(ifindex, target, 1);
+cleanup_fd1:
+	err = bpf_prog_detach_opts(fd1, ifindex, target, &optd);
+	ASSERT_OK(err, "prog_detach");
+	assert_mprog_count_ifindex(ifindex, target, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+
+	assert_mprog_count_ifindex(ifindex, target, 0);
+	destroy_meta();
+}
+
+void serial_test_tc_meta_multi_opts(void)
+{
+	serial_test_tc_meta_multi_opts_target(META_L2, BPF_META_PRIMARY);
+	serial_test_tc_meta_multi_opts_target(META_L3, BPF_META_PRIMARY);
+	serial_test_tc_meta_multi_opts_target(META_L2, BPF_META_PEER);
+	serial_test_tc_meta_multi_opts_target(META_L3, BPF_META_PEER);
+}
+
+void serial_test_tc_meta_device(void)
+{
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	LIBBPF_OPTS(bpf_meta_opts, optl);
+	__u32 prog_ids[2], link_ids[2];
+	__u32 pid1, pid2, lid1;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err, ifindex, ifindex2;
+
+	err = create_meta(META_L3, META_PASS, META_PASS, &ifindex, true);
+	if (err)
+		return;
+
+	ifindex2 = if_nametoindex(meta_peer);
+	ASSERT_NEQ(ifindex, ifindex2, "ifindex_1_2");
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+	pid2 = id_from_prog_fd(bpf_program__fd(skel->progs.tc2));
+
+	ASSERT_NEQ(pid1, pid2, "prog_ids_1_2");
+
+	assert_mprog_count_ifindex(ifindex, BPF_META_PRIMARY, 0);
+	assert_mprog_count_ifindex(ifindex, BPF_META_PEER, 0);
+
+	ASSERT_EQ(skel->bss->seen_tc1, false, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	link = bpf_program__attach_meta(skel->progs.tc1, ifindex, false, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+
+	skel->links.tc1 = link;
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	assert_mprog_count_ifindex(ifindex, BPF_META_PRIMARY, 1);
+	assert_mprog_count_ifindex(ifindex, BPF_META_PEER, 0);
+
+	optq.prog_ids = prog_ids;
+	optq.link_ids = link_ids;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(link_ids, 0, sizeof(link_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex, BPF_META_PRIMARY, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+
+	ASSERT_EQ(send_icmp(), 0, "icmp_pkt");
+
+	ASSERT_EQ(skel->bss->seen_tc1, true, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_tc2, false, "seen_tc2");
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(link_ids, 0, sizeof(link_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex2, BPF_META_PRIMARY, &optq);
+	ASSERT_EQ(err, -EACCES, "prog_query_should_fail");
+
+	err = bpf_prog_query_opts(ifindex2, BPF_META_PEER, &optq);
+	ASSERT_EQ(err, -EACCES, "prog_query_should_fail");
+
+	link = bpf_program__attach_meta(skel->progs.tc2, ifindex2, true, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+
+	link = bpf_program__attach_meta(skel->progs.tc2, ifindex2, false, &optl);
+	if (!ASSERT_ERR_PTR(link, "link_attach_should_fail")) {
+		bpf_link__destroy(link);
+		goto cleanup;
+	}
+
+	assert_mprog_count_ifindex(ifindex, BPF_META_PRIMARY, 1);
+	assert_mprog_count_ifindex(ifindex, BPF_META_PEER, 0);
+cleanup:
+	test_tc_link__destroy(skel);
+
+	assert_mprog_count_ifindex(ifindex, BPF_META_PRIMARY, 0);
+	assert_mprog_count_ifindex(ifindex, BPF_META_PEER, 0);
+	destroy_meta();
+}
+
+void serial_test_tc_meta_neigh_links_target(int mode, int target)
+{
+	LIBBPF_OPTS(bpf_prog_query_opts, optq);
+	LIBBPF_OPTS(bpf_meta_opts, optl);
+	__u32 prog_ids[2], link_ids[2];
+	__u32 pid1, lid1;
+	struct test_tc_link *skel;
+	struct bpf_link *link;
+	int err, ifindex;
+
+	err = create_meta(mode, META_PASS, META_PASS, &ifindex, false);
+	if (err)
+		return;
+
+	skel = test_tc_link__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_load"))
+		goto cleanup;
+
+	pid1 = id_from_prog_fd(bpf_program__fd(skel->progs.tc1));
+
+	assert_mprog_count_ifindex(ifindex, target, 0);
+
+	ASSERT_EQ(skel->bss->seen_tc1, false, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_eth, false, "seen_eth");
+
+	link = bpf_program__attach_meta(skel->progs.tc1, ifindex, false, &optl);
+	if (!ASSERT_OK_PTR(link, "link_attach"))
+		goto cleanup;
+
+	skel->links.tc1 = link;
+
+	lid1 = id_from_link_fd(bpf_link__fd(skel->links.tc1));
+
+	assert_mprog_count_ifindex(ifindex, target, 1);
+
+	optq.prog_ids = prog_ids;
+	optq.link_ids = link_ids;
+
+	memset(prog_ids, 0, sizeof(prog_ids));
+	memset(link_ids, 0, sizeof(link_ids));
+	optq.count = ARRAY_SIZE(prog_ids);
+
+	err = bpf_prog_query_opts(ifindex, target, &optq);
+	if (!ASSERT_OK(err, "prog_query"))
+		goto cleanup;
+
+	ASSERT_EQ(optq.count, 1, "count");
+	ASSERT_EQ(optq.revision, 2, "revision");
+	ASSERT_EQ(optq.prog_ids[0], pid1, "prog_ids[0]");
+	ASSERT_EQ(optq.link_ids[0], lid1, "link_ids[0]");
+	ASSERT_EQ(optq.prog_ids[1], 0, "prog_ids[1]");
+	ASSERT_EQ(optq.link_ids[1], 0, "link_ids[1]");
+
+	ASSERT_EQ(__send_icmp(ping_addr_noneigh), 0, "icmp_pkt");
+
+	ASSERT_EQ(skel->bss->seen_tc1, true /* L2: ARP */, "seen_tc1");
+	ASSERT_EQ(skel->bss->seen_eth, mode == META_L3, "seen_eth");
+cleanup:
+	test_tc_link__destroy(skel);
+
+	assert_mprog_count_ifindex(ifindex, target, 0);
+	destroy_meta();
+}
+
+void serial_test_tc_meta_neigh_links(void)
+{
+	serial_test_tc_meta_neigh_links_target(META_L2, BPF_META_PRIMARY);
+	serial_test_tc_meta_neigh_links_target(META_L3, BPF_META_PRIMARY);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_tc_link.c b/tools/testing/selftests/bpf/progs/test_tc_link.c
index 30e7124c49a1..992400acb957 100644
--- a/tools/testing/selftests/bpf/progs/test_tc_link.c
+++ b/tools/testing/selftests/bpf/progs/test_tc_link.c
@@ -1,7 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2023 Isovalent */
 #include <stdbool.h>
+
 #include <linux/bpf.h>
+#include <linux/if_ether.h>
+
+#include <bpf/bpf_endian.h>
 #include <bpf/bpf_helpers.h>
 
 char LICENSE[] SEC("license") = "GPL";
@@ -12,10 +16,19 @@ bool seen_tc3;
 bool seen_tc4;
 bool seen_tc5;
 bool seen_tc6;
+bool seen_eth;
 
 SEC("tc/ingress")
 int tc1(struct __sk_buff *skb)
 {
+	struct ethhdr eth = {};
+
+	if (skb->protocol != __bpf_constant_htons(ETH_P_IP))
+		goto out;
+	if (bpf_skb_load_bytes(skb, 0, &eth, sizeof(eth)))
+		goto out;
+	seen_eth = eth.h_proto == bpf_htons(ETH_P_IP);
+out:
 	seen_tc1 = true;
 	return TCX_NEXT;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 4/8] libbpf: Add link-based API for meta
  2023-09-26  5:59 ` [PATCH bpf-next 4/8] libbpf: Add link-based API for meta Daniel Borkmann
@ 2023-09-26 11:19   ` Quentin Monnet
  2023-09-28  0:12   ` Andrii Nakryiko
  1 sibling, 0 replies; 22+ messages in thread
From: Quentin Monnet @ 2023-09-26 11:19 UTC (permalink / raw)
  To: Daniel Borkmann, bpf
  Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend

On 26/09/2023 06:59, Daniel Borkmann wrote:
> This adds bpf_program__attach_meta() API to libbpf. Overall it is very
> similar to tcx. The API looks as following:
> 
>   LIBBPF_API struct bpf_link *
>   bpf_program__attach_meta(const struct bpf_program *prog, int ifindex,
>                            bool peer_device, const struct bpf_meta_opts *opts);
> 
> The struct bpf_meta_opts is done in similar way as struct bpf_tcx_opts.
> bpf_program__attach_meta() compared to bpf_program__attach_tcx() has one
> additional argument, that is peer_device. The latter denotes whether the
> program should be attached to the relative peer of ifindex or whether it
> should be attached to ifindex itself.
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  tools/lib/bpf/bpf.c      | 16 +++++++++++
>  tools/lib/bpf/bpf.h      |  5 ++++
>  tools/lib/bpf/libbpf.c   | 61 ++++++++++++++++++++++++++++++++++++----
>  tools/lib/bpf/libbpf.h   | 15 ++++++++++
>  tools/lib/bpf/libbpf.map |  1 +
>  5 files changed, 92 insertions(+), 6 deletions(-)

> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index b4758e54a815..4d4da8ba2179 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -121,6 +121,8 @@ static const char * const attach_type_name[] = {
>  	[BPF_TCX_INGRESS]		= "tcx_ingress",
>  	[BPF_TCX_EGRESS]		= "tcx_egress",
>  	[BPF_TRACE_UPROBE_MULTI]	= "trace_uprobe_multi",
> +	[BPF_META_PRIMARY]		= "meta",
> +	[BPF_META_PEER]			= "meta",

"meta_primary" and "meta_peer"? Or is there a particular reason for
making these the only array entries with identical values?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 5/8] bpftool: Implement link show support for meta
  2023-09-26  5:59 ` [PATCH bpf-next 5/8] bpftool: Implement link show support " Daniel Borkmann
@ 2023-09-26 11:19   ` Quentin Monnet
  0 siblings, 0 replies; 22+ messages in thread
From: Quentin Monnet @ 2023-09-26 11:19 UTC (permalink / raw)
  To: Daniel Borkmann, bpf
  Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend

On 26/09/2023 06:59, Daniel Borkmann wrote:
> Add support to dump meta link information to bpftool in similar way
> as we have for XDP. The meta link info only exposes the ifindex.
> 
> Below shows an example link dump output, and a cgroup link is included
> for comparison, too:
> 
>   # bpftool link
>   [...]
>   10: cgroup  prog 2466
>         cgroup_id 1  attach_type cgroup_inet6_post_bind
>   [...]
>   8: meta  prog 35
>         ifindex meta1(18)
>   [...]
> 
> Equivalent json output:
> 
>   # bpftool link --json
>   [...]
>   {
>     "id": 10,
>     "type": "cgroup",
>     "prog_id": 2466,
>     "cgroup_id": 1,
>     "attach_type": "cgroup_inet6_post_bind"
>   },
>   [...]
>   {
>     "id": 12,
>     "type": "meta",
>     "prog_id": 61,
>     "devname": "meta1",
>     "ifindex": 21
>   }
>   [...]
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Reviewed-by: Quentin Monnet <quentin@isovalent.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 6/8] bpftool: Extend net dump with meta progs
  2023-09-26  5:59 ` [PATCH bpf-next 6/8] bpftool: Extend net dump with meta progs Daniel Borkmann
@ 2023-09-26 11:19   ` Quentin Monnet
  0 siblings, 0 replies; 22+ messages in thread
From: Quentin Monnet @ 2023-09-26 11:19 UTC (permalink / raw)
  To: Daniel Borkmann, bpf
  Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend

On 26/09/2023 06:59, Daniel Borkmann wrote:
> Add support to dump BPF programs on meta via bpftool. This includes both
> the BPF link and attach ops programs. Dumped information contain the attach
> location, function entry name, program ID and link ID when applicable.
> 
> Example with tc BPF link:
> 
>   # ./bpftool net
>   xdp:
> 
>   tc:
>   meta1(22) meta/peer tc1 prog_id 43 link_id 12
> 
>   [...]
> 
> Example with json dump:
> 
>   # ./bpftool net --json | jq
>   [
>     {
>       "xdp": [],
>       "tc": [
>         {
>           "devname": "meta1",
>           "ifindex": 18,
>           "kind": "meta/primary",
>           "name": "tc1",
>           "prog_id": 29,
>           "prog_flags": [],
>           "link_id": 8,
>           "link_flags": []
>         }
>       ],
>       "flow_dissector": [],
>       "netfilter": []
>     }
>   ]
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Reviewed-by: Quentin Monnet <quentin@isovalent.com>

Thanks!


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device
  2023-09-26  5:59 ` [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device Daniel Borkmann
@ 2023-09-26 21:26   ` Stanislav Fomichev
  2023-09-28  9:16   ` Toke Høiland-Jørgensen
  2023-10-13 11:26   ` Florian Kauer
  2 siblings, 0 replies; 22+ messages in thread
From: Stanislav Fomichev @ 2023-09-26 21:26 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, martin.lau, razor, ast, andrii, john.fastabend

On 09/26, Daniel Borkmann wrote:
> This work adds a new, minimal BPF-programmable device called "meta" we
> recently presented at LSF/MM/BPF. The latter name derives from the Greek
> μετά, encompassing a wide array of meanings such as "on top of", "beyond".
> Given business logic is defined by BPF, this device can have many meanings.
> The core idea is that BPF programs are executed within the drivers xmit
> routine and therefore e.g. in case of containers/Pods moving BPF processing
> closer to the source.
> 
> One of the goals was that in case of Pod egress traffic, this allows to
> move BPF programs from hostns tcx ingress into the device itself, providing
> earlier drop or forward mechanisms, for example, if the BPF program
> determines that the skb must be sent out of the node, then a redirect to
> the physical device can take place directly without going through per-CPU
> backlog queue. This helps to shift processing for such traffic from softirq
> to process context, leading to better scheduling decisions and better
> performance.
> 
> In this initial version, the meta device ships as a pair, but we plan to
> extend this further so it can also operate in single device mode. The pair
> comes with a primary and a peer device. Only the primary device, typically
> residing in hostns, can manage BPF programs for itself and its peer. The
> peer device is designated for containers/Pods and cannot attach/detach
> BPF programs. Upon the device creation, the user can set the default policy
> to 'forward' or 'drop' for the case when no BPF program is attached.
> 
> Additionally, the device can be operated in L3 (default) or L2 mode. The
> management of BPF programs is done via bpf_mprog, so that multi-attach is
> supported right from the beginning with similar API/dependency controls as
> tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic
> attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
> so that existing programs can be easily migrated.
> 
> Going forward, we plan to use meta devices in Cilium as the main device type
> for connecting Pods. They will be operated in L3 mode in order to simplify
> a Pod's neighbor management and the peer will operate in default drop mode,
> so that no traffic is leaving between the time when a Pod is brought up by
> the CNI plugin and programs attached by the agent. Additionally, the programs
> we attach via tcx on the physical devices are using bpf_redirect_peer()
> for inbound traffic into meta device, hence the latter also supporting the
> ndo_get_peer_dev callback. Similarly, we use bpf_redirect_neigh() for the
> way out, pushing to phys device directly. Also, BIG TCP is supported on meta
> device. For the follow-up work in single device mode, we plan to convert
> Cilium's cilium_host/_net devices into a single one.
> 
> An extensive test suite for checking device operations and the BPF program
> and link management API comes as BPF selftests in this series.
> 
> Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Link: https://github.com/borkmann/iproute2/commits/pr/meta
> Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
> ---
>  MAINTAINERS                    |   9 +
>  drivers/net/Kconfig            |   9 +
>  drivers/net/Makefile           |   1 +
>  drivers/net/meta.c             | 734 +++++++++++++++++++++++++++++++++
>  include/linux/netdevice.h      |   2 +
>  include/net/meta.h             |  31 ++
>  include/uapi/linux/bpf.h       |   2 +
>  include/uapi/linux/if_link.h   |  25 ++
>  kernel/bpf/syscall.c           |  30 +-
>  tools/include/uapi/linux/bpf.h |   2 +
>  10 files changed, 840 insertions(+), 5 deletions(-)
>  create mode 100644 drivers/net/meta.c
>  create mode 100644 include/net/meta.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 8985a1b0b5ee..ec3edd4caa56 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3774,6 +3774,15 @@ L:	bpf@vger.kernel.org
>  S:	Maintained
>  F:	tools/lib/bpf/
>  
> +BPF [META]
> +M:	Daniel Borkmann <daniel@iogearbox.net>
> +M:	Nikolay Aleksandrov <razor@blackwall.org>
> +L:	bpf@vger.kernel.org
> +L:	netdev@vger.kernel.org
> +S:	Supported
> +F:	drivers/net/meta.c
> +F:	include/net/meta.h
> +
>  BPF [MISC]
>  L:	bpf@vger.kernel.org
>  S:	Odd Fixes
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 44eeb5d61ba9..9959cdd50b0b 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -448,6 +448,15 @@ config NLMON
>  	  diagnostics, etc. This is mostly intended for developers or support
>  	  to debug netlink issues. If unsure, say N.
>  
> +config META
> +	bool "BPF-programmable meta device"
> +	depends on BPF_SYSCALL
> +	help
> +	  The virtual meta devices can be created in pairs and used to connect
> +	  two network namespaces. A BPF program can be attached to the device(s)
> +	  which then gets executed on transmission to implement the driver
> +	  internal logic.
> +
>  config NET_VRF
>  	tristate "Virtual Routing and Forwarding (Lite)"
>  	depends on IP_MULTIPLE_TABLES
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index e26f98f897c5..18eabeb78ece 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -22,6 +22,7 @@ obj-$(CONFIG_MDIO) += mdio.o
>  obj-$(CONFIG_NET) += loopback.o
>  obj-$(CONFIG_NETDEV_LEGACY_INIT) += Space.o
>  obj-$(CONFIG_NETCONSOLE) += netconsole.o
> +obj-$(CONFIG_META) += meta.o
>  obj-y += phy/
>  obj-y += pse-pd/
>  obj-y += mdio/
> diff --git a/drivers/net/meta.c b/drivers/net/meta.c
> new file mode 100644
> index 000000000000..e464f547b0a6
> --- /dev/null
> +++ b/drivers/net/meta.c
> @@ -0,0 +1,734 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2023 Isovalent */
> +
> +#include <linux/netdevice.h>
> +#include <linux/ethtool.h>
> +#include <linux/etherdevice.h>
> +#include <linux/filter.h>
> +#include <linux/netfilter_netdev.h>
> +#include <linux/bpf_mprog.h>
> +
> +#include <net/meta.h>
> +#include <net/dst.h>
> +#include <net/tcx.h>
> +
> +#define DRV_NAME	"meta"
> +#define DRV_VERSION	"1.0"
> +
> +struct meta {
> +	/* Needed in fast-path */
> +	struct net_device __rcu *peer;
> +	struct bpf_mprog_entry __rcu *active;
> +	enum meta_action policy;
> +	struct bpf_mprog_bundle	bundle;
> +	/* Needed in slow-path */
> +	enum meta_mode mode;
> +	bool primary;
> +	u32 headroom;
> +};
> +
> +static void meta_scrub_minimum(struct sk_buff *skb)
> +{

[..]

> +	skb->skb_iif = 0;
> +	skb->ignore_df = 0;
> +	skb->priority = 0;
> +	skb_dst_drop(skb);
> +	skb_ext_reset(skb);
> +	nf_reset_ct(skb);
> +	nf_reset_trace(skb);
> +	nf_skip_egress(skb, true);
> +	ipvs_reset(skb);

This looks similar to skb_scrub_packet; what's the difference?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 7/8] selftests/bpf: Add netlink helper library
  2023-09-26  5:59 ` [PATCH bpf-next 7/8] selftests/bpf: Add netlink helper library Daniel Borkmann
@ 2023-09-26 21:35   ` Stanislav Fomichev
  0 siblings, 0 replies; 22+ messages in thread
From: Stanislav Fomichev @ 2023-09-26 21:35 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, martin.lau, razor, ast, andrii, john.fastabend,
	donald.hunter, kuba

On 09/26, Daniel Borkmann wrote:
> Add a basic netlink helper library for the BPF selftests. This has been
> taken and cut down/cleaned up from iproute2. More can be added at some
> later point in time when needed, but for now this covers basics such as
> device creation which we need for BPF selftests / BPF CI.

Should the netlink code be based on ynl
(https://lore.kernel.org/all/20230825122756.7603-1-donald.hunter@gmail.com/)?
Or it doesn't have full rtnl support yet?

> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  tools/testing/selftests/bpf/Makefile          |  19 +-
>  tools/testing/selftests/bpf/netlink_helpers.c | 358 ++++++++++++++++++
>  tools/testing/selftests/bpf/netlink_helpers.h |  46 +++
>  3 files changed, 418 insertions(+), 5 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/netlink_helpers.c
>  create mode 100644 tools/testing/selftests/bpf/netlink_helpers.h
> 
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 47365161b6fc..b8186ceb31dc 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -579,11 +579,20 @@ endef
>  # Define test_progs test runner.
>  TRUNNER_TESTS_DIR := prog_tests
>  TRUNNER_BPF_PROGS_DIR := progs
> -TRUNNER_EXTRA_SOURCES := test_progs.c cgroup_helpers.c trace_helpers.c	\
> -			 network_helpers.c testing_helpers.c		\
> -			 btf_helpers.c flow_dissector_load.h		\
> -			 cap_helpers.c test_loader.c xsk.c disasm.c	\
> -			 json_writer.c unpriv_helpers.c 		\
> +TRUNNER_EXTRA_SOURCES := test_progs.c		\
> +			 cgroup_helpers.c	\
> +			 trace_helpers.c	\
> +			 network_helpers.c	\
> +			 testing_helpers.c	\
> +			 btf_helpers.c		\
> +			 cap_helpers.c		\
> +			 unpriv_helpers.c 	\
> +			 netlink_helpers.c	\
> +			 test_loader.c		\
> +			 xsk.c			\
> +			 disasm.c		\
> +			 json_writer.c 		\
> +			 flow_dissector_load.h	\
>  			 ip_check_defrag_frags.h
>  TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read $(OUTPUT)/bpf_testmod.ko	\
>  		       $(OUTPUT)/liburandom_read.so			\
> diff --git a/tools/testing/selftests/bpf/netlink_helpers.c b/tools/testing/selftests/bpf/netlink_helpers.c
> new file mode 100644
> index 000000000000..caf36eb1d032
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/netlink_helpers.c
> @@ -0,0 +1,358 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* Taken & modified from iproute2's libnetlink.c
> + * Authors: Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
> + */
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <errno.h>
> +#include <time.h>
> +#include <sys/socket.h>
> +
> +#include "netlink_helpers.h"
> +
> +static int rcvbuf = 1024 * 1024;
> +
> +void rtnl_close(struct rtnl_handle *rth)
> +{
> +	if (rth->fd >= 0) {
> +		close(rth->fd);
> +		rth->fd = -1;
> +	}
> +}
> +
> +int rtnl_open_byproto(struct rtnl_handle *rth, unsigned int subscriptions,
> +		      int protocol)
> +{
> +	socklen_t addr_len;
> +	int sndbuf = 32768;
> +	int one = 1;
> +
> +	memset(rth, 0, sizeof(*rth));
> +	rth->proto = protocol;
> +	rth->fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC, protocol);
> +	if (rth->fd < 0) {
> +		perror("Cannot open netlink socket");
> +		return -1;
> +	}
> +	if (setsockopt(rth->fd, SOL_SOCKET, SO_SNDBUF,
> +		       &sndbuf, sizeof(sndbuf)) < 0) {
> +		perror("SO_SNDBUF");
> +		goto err;
> +	}
> +	if (setsockopt(rth->fd, SOL_SOCKET, SO_RCVBUF,
> +		       &rcvbuf, sizeof(rcvbuf)) < 0) {
> +		perror("SO_RCVBUF");
> +		goto err;
> +	}
> +
> +	/* Older kernels may no support extended ACK reporting */
> +	setsockopt(rth->fd, SOL_NETLINK, NETLINK_EXT_ACK,
> +		   &one, sizeof(one));
> +
> +	memset(&rth->local, 0, sizeof(rth->local));
> +	rth->local.nl_family = AF_NETLINK;
> +	rth->local.nl_groups = subscriptions;
> +
> +	if (bind(rth->fd, (struct sockaddr *)&rth->local,
> +		 sizeof(rth->local)) < 0) {
> +		perror("Cannot bind netlink socket");
> +		goto err;
> +	}
> +	addr_len = sizeof(rth->local);
> +	if (getsockname(rth->fd, (struct sockaddr *)&rth->local,
> +			&addr_len) < 0) {
> +		perror("Cannot getsockname");
> +		goto err;
> +	}
> +	if (addr_len != sizeof(rth->local)) {
> +		fprintf(stderr, "Wrong address length %d\n", addr_len);
> +		goto err;
> +	}
> +	if (rth->local.nl_family != AF_NETLINK) {
> +		fprintf(stderr, "Wrong address family %d\n",
> +			rth->local.nl_family);
> +		goto err;
> +	}
> +	rth->seq = time(NULL);
> +	return 0;
> +err:
> +	rtnl_close(rth);
> +	return -1;
> +}
> +
> +int rtnl_open(struct rtnl_handle *rth, unsigned int subscriptions)
> +{
> +	return rtnl_open_byproto(rth, subscriptions, NETLINK_ROUTE);
> +}
> +
> +static int __rtnl_recvmsg(int fd, struct msghdr *msg, int flags)
> +{
> +	int len;
> +
> +	do {
> +		len = recvmsg(fd, msg, flags);
> +	} while (len < 0 && (errno == EINTR || errno == EAGAIN));
> +	if (len < 0) {
> +		fprintf(stderr, "netlink receive error %s (%d)\n",
> +			strerror(errno), errno);
> +		return -errno;
> +	}
> +	if (len == 0) {
> +		fprintf(stderr, "EOF on netlink\n");
> +		return -ENODATA;
> +	}
> +	return len;
> +}
> +
> +static int rtnl_recvmsg(int fd, struct msghdr *msg, char **answer)
> +{
> +	struct iovec *iov = msg->msg_iov;
> +	char *buf;
> +	int len;
> +
> +	iov->iov_base = NULL;
> +	iov->iov_len = 0;
> +
> +	len = __rtnl_recvmsg(fd, msg, MSG_PEEK | MSG_TRUNC);
> +	if (len < 0)
> +		return len;
> +	if (len < 32768)
> +		len = 32768;
> +	buf = malloc(len);
> +	if (!buf) {
> +		fprintf(stderr, "malloc error: not enough buffer\n");
> +		return -ENOMEM;
> +	}
> +	iov->iov_base = buf;
> +	iov->iov_len = len;
> +	len = __rtnl_recvmsg(fd, msg, 0);
> +	if (len < 0) {
> +		free(buf);
> +		return len;
> +	}
> +	if (answer)
> +		*answer = buf;
> +	else
> +		free(buf);
> +	return len;
> +}
> +
> +static void rtnl_talk_error(struct nlmsghdr *h, struct nlmsgerr *err,
> +			    nl_ext_ack_fn_t errfn)
> +{
> +	fprintf(stderr, "RTNETLINK answers: %s\n",
> +		strerror(-err->error));
> +}
> +
> +static int __rtnl_talk_iov(struct rtnl_handle *rtnl, struct iovec *iov,
> +			   size_t iovlen, struct nlmsghdr **answer,
> +			   bool show_rtnl_err, nl_ext_ack_fn_t errfn)
> +{
> +	struct sockaddr_nl nladdr = { .nl_family = AF_NETLINK };
> +	struct iovec riov;
> +	struct msghdr msg = {
> +		.msg_name	= &nladdr,
> +		.msg_namelen	= sizeof(nladdr),
> +		.msg_iov	= iov,
> +		.msg_iovlen	= iovlen,
> +	};
> +	unsigned int seq = 0;
> +	struct nlmsghdr *h;
> +	int i, status;
> +	char *buf;
> +
> +	for (i = 0; i < iovlen; i++) {
> +		h = iov[i].iov_base;
> +		h->nlmsg_seq = seq = ++rtnl->seq;
> +		if (answer == NULL)
> +			h->nlmsg_flags |= NLM_F_ACK;
> +	}
> +	status = sendmsg(rtnl->fd, &msg, 0);
> +	if (status < 0) {
> +		perror("Cannot talk to rtnetlink");
> +		return -1;
> +	}
> +	/* change msg to use the response iov */
> +	msg.msg_iov = &riov;
> +	msg.msg_iovlen = 1;
> +	i = 0;
> +	while (1) {
> +next:
> +		status = rtnl_recvmsg(rtnl->fd, &msg, &buf);
> +		++i;
> +		if (status < 0)
> +			return status;
> +		if (msg.msg_namelen != sizeof(nladdr)) {
> +			fprintf(stderr,
> +				"Sender address length == %d!\n",
> +				msg.msg_namelen);
> +			exit(1);
> +		}
> +		for (h = (struct nlmsghdr *)buf; status >= sizeof(*h); ) {
> +			int len = h->nlmsg_len;
> +			int l = len - sizeof(*h);
> +
> +			if (l < 0 || len > status) {
> +				if (msg.msg_flags & MSG_TRUNC) {
> +					fprintf(stderr, "Truncated message!\n");
> +					free(buf);
> +					return -1;
> +				}
> +				fprintf(stderr,
> +					"Malformed message: len=%d!\n",
> +					len);
> +				exit(1);
> +			}
> +			if (nladdr.nl_pid != 0 ||
> +			    h->nlmsg_pid != rtnl->local.nl_pid ||
> +			    h->nlmsg_seq > seq || h->nlmsg_seq < seq - iovlen) {
> +				/* Don't forget to skip that message. */
> +				status -= NLMSG_ALIGN(len);
> +				h = (struct nlmsghdr *)((char *)h + NLMSG_ALIGN(len));
> +				continue;
> +			}
> +			if (h->nlmsg_type == NLMSG_ERROR) {
> +				struct nlmsgerr *err = (struct nlmsgerr *)NLMSG_DATA(h);
> +				int error = err->error;
> +
> +				if (l < sizeof(struct nlmsgerr)) {
> +					fprintf(stderr, "ERROR truncated\n");
> +					free(buf);
> +					return -1;
> +				}
> +				if (error) {
> +					errno = -error;
> +					if (rtnl->proto != NETLINK_SOCK_DIAG &&
> +					    show_rtnl_err)
> +						rtnl_talk_error(h, err, errfn);
> +				}
> +				if (i < iovlen) {
> +					free(buf);
> +					goto next;
> +				}
> +				if (error) {
> +					free(buf);
> +					return -i;
> +				}
> +				if (answer)
> +					*answer = (struct nlmsghdr *)buf;
> +				else
> +					free(buf);
> +				return 0;
> +			}
> +			if (answer) {
> +				*answer = (struct nlmsghdr *)buf;
> +				return 0;
> +			}
> +			fprintf(stderr, "Unexpected reply!\n");
> +			status -= NLMSG_ALIGN(len);
> +			h = (struct nlmsghdr *)((char *)h + NLMSG_ALIGN(len));
> +		}
> +		free(buf);
> +		if (msg.msg_flags & MSG_TRUNC) {
> +			fprintf(stderr, "Message truncated!\n");
> +			continue;
> +		}
> +		if (status) {
> +			fprintf(stderr, "Remnant of size %d!\n", status);
> +			exit(1);
> +		}
> +	}
> +}
> +
> +static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
> +		       struct nlmsghdr **answer, bool show_rtnl_err,
> +		       nl_ext_ack_fn_t errfn)
> +{
> +	struct iovec iov = {
> +		.iov_base	= n,
> +		.iov_len	= n->nlmsg_len,
> +	};
> +
> +	return __rtnl_talk_iov(rtnl, &iov, 1, answer, show_rtnl_err, errfn);
> +}
> +
> +int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
> +	      struct nlmsghdr **answer)
> +{
> +	return __rtnl_talk(rtnl, n, answer, true, NULL);
> +}
> +
> +int addattr(struct nlmsghdr *n, int maxlen, int type)
> +{
> +	return addattr_l(n, maxlen, type, NULL, 0);
> +}
> +
> +int addattr8(struct nlmsghdr *n, int maxlen, int type, __u8 data)
> +{
> +	return addattr_l(n, maxlen, type, &data, sizeof(__u8));
> +}
> +
> +int addattr16(struct nlmsghdr *n, int maxlen, int type, __u16 data)
> +{
> +	return addattr_l(n, maxlen, type, &data, sizeof(__u16));
> +}
> +
> +int addattr32(struct nlmsghdr *n, int maxlen, int type, __u32 data)
> +{
> +	return addattr_l(n, maxlen, type, &data, sizeof(__u32));
> +}
> +
> +int addattr64(struct nlmsghdr *n, int maxlen, int type, __u64 data)
> +{
> +	return addattr_l(n, maxlen, type, &data, sizeof(__u64));
> +}
> +
> +int addattrstrz(struct nlmsghdr *n, int maxlen, int type, const char *str)
> +{
> +	return addattr_l(n, maxlen, type, str, strlen(str)+1);
> +}
> +
> +int addattr_l(struct nlmsghdr *n, int maxlen, int type, const void *data,
> +	      int alen)
> +{
> +	int len = RTA_LENGTH(alen);
> +	struct rtattr *rta;
> +
> +	if (NLMSG_ALIGN(n->nlmsg_len) + RTA_ALIGN(len) > maxlen) {
> +		fprintf(stderr, "%s: Message exceeded bound of %d\n",
> +			__func__, maxlen);
> +		return -1;
> +	}
> +	rta = NLMSG_TAIL(n);
> +	rta->rta_type = type;
> +	rta->rta_len = len;
> +	if (alen)
> +		memcpy(RTA_DATA(rta), data, alen);
> +	n->nlmsg_len = NLMSG_ALIGN(n->nlmsg_len) + RTA_ALIGN(len);
> +	return 0;
> +}
> +
> +int addraw_l(struct nlmsghdr *n, int maxlen, const void *data, int len)
> +{
> +	if (NLMSG_ALIGN(n->nlmsg_len) + NLMSG_ALIGN(len) > maxlen) {
> +		fprintf(stderr, "%s: Message exceeded bound of %d\n",
> +			__func__, maxlen);
> +		return -1;
> +	}
> +
> +	memcpy(NLMSG_TAIL(n), data, len);
> +	memset((void *) NLMSG_TAIL(n) + len, 0, NLMSG_ALIGN(len) - len);
> +	n->nlmsg_len = NLMSG_ALIGN(n->nlmsg_len) + NLMSG_ALIGN(len);
> +	return 0;
> +}
> +
> +struct rtattr *addattr_nest(struct nlmsghdr *n, int maxlen, int type)
> +{
> +	struct rtattr *nest = NLMSG_TAIL(n);
> +
> +	addattr_l(n, maxlen, type, NULL, 0);
> +	return nest;
> +}
> +
> +int addattr_nest_end(struct nlmsghdr *n, struct rtattr *nest)
> +{
> +	nest->rta_len = (void *)NLMSG_TAIL(n) - (void *)nest;
> +	return n->nlmsg_len;
> +}
> diff --git a/tools/testing/selftests/bpf/netlink_helpers.h b/tools/testing/selftests/bpf/netlink_helpers.h
> new file mode 100644
> index 000000000000..68116818a47e
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/netlink_helpers.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +#ifndef NETLINK_HELPERS_H
> +#define NETLINK_HELPERS_H
> +
> +#include <string.h>
> +#include <linux/netlink.h>
> +#include <linux/rtnetlink.h>
> +
> +struct rtnl_handle {
> +	int			fd;
> +	struct sockaddr_nl	local;
> +	struct sockaddr_nl	peer;
> +	__u32			seq;
> +	__u32			dump;
> +	int			proto;
> +	FILE			*dump_fp;
> +#define RTNL_HANDLE_F_LISTEN_ALL_NSID		0x01
> +#define RTNL_HANDLE_F_SUPPRESS_NLERR		0x02
> +#define RTNL_HANDLE_F_STRICT_CHK		0x04
> +	int			flags;
> +};
> +
> +#define NLMSG_TAIL(nmsg) \
> +	((struct rtattr *) (((void *) (nmsg)) + NLMSG_ALIGN((nmsg)->nlmsg_len)))
> +
> +typedef int (*nl_ext_ack_fn_t)(const char *errmsg, uint32_t off,
> +			       const struct nlmsghdr *inner_nlh);
> +
> +int rtnl_open(struct rtnl_handle *rth, unsigned int subscriptions)
> +	      __attribute__((warn_unused_result));
> +void rtnl_close(struct rtnl_handle *rth);
> +int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
> +	      struct nlmsghdr **answer)
> +	      __attribute__((warn_unused_result));
> +
> +int addattr(struct nlmsghdr *n, int maxlen, int type);
> +int addattr8(struct nlmsghdr *n, int maxlen, int type, __u8 data);
> +int addattr16(struct nlmsghdr *n, int maxlen, int type, __u16 data);
> +int addattr32(struct nlmsghdr *n, int maxlen, int type, __u32 data);
> +int addattr64(struct nlmsghdr *n, int maxlen, int type, __u64 data);
> +int addattrstrz(struct nlmsghdr *n, int maxlen, int type, const char *data);
> +int addattr_l(struct nlmsghdr *n, int maxlen, int type, const void *data, int alen);
> +int addraw_l(struct nlmsghdr *n, int maxlen, const void *data, int len);
> +struct rtattr *addattr_nest(struct nlmsghdr *n, int maxlen, int type);
> +int addattr_nest_end(struct nlmsghdr *n, struct rtattr *nest);
> +#endif /* NETLINK_HELPERS_H */
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 2/8] meta, bpf: Add bpf link support for meta device
  2023-09-26  5:59 ` [PATCH bpf-next 2/8] meta, bpf: Add bpf link support for " Daniel Borkmann
@ 2023-09-28  0:12   ` Andrii Nakryiko
  0 siblings, 0 replies; 22+ messages in thread
From: Andrii Nakryiko @ 2023-09-28  0:12 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, martin.lau, razor, ast, andrii, john.fastabend

On Mon, Sep 25, 2023 at 10:59 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> This adds BPF link support for meta device (BPF_LINK_TYPE_META). Similar
> as with tcx or XDP, the BPF link for meta contains the device.
>
> The bpf_mprog API has been reused for its implementation. For details, see
> also commit e420bed0250 ("bpf: Add fd-based tcx multi-prog infra with link
> support").
>
> This is now the second user of bpf_mprog after tcx, and in meta case the
> implementation is also a bit more straight forward since it does not need
> to deal with miniq.
>
> The UAPI extensions for the BPF_LINK_CREATE command are similar as for tcx,
> that is, relative_{fd,id} and expected_revision.
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  drivers/net/meta.c             | 211 ++++++++++++++++++++++++++++++++-
>  include/net/meta.h             |   7 ++
>  include/uapi/linux/bpf.h       |  11 ++
>  kernel/bpf/syscall.c           |   2 +-
>  tools/include/uapi/linux/bpf.h |  11 ++
>  5 files changed, 240 insertions(+), 2 deletions(-)
>

[...]

> diff --git a/include/net/meta.h b/include/net/meta.h
> index 20fc61d05970..f1abe1d6d02d 100644
> --- a/include/net/meta.h
> +++ b/include/net/meta.h
> @@ -7,6 +7,7 @@
>
>  #ifdef CONFIG_META
>  int meta_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int meta_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
>  int meta_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog);
>  int meta_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr);
>  #else
> @@ -16,6 +17,12 @@ static inline int meta_prog_attach(const union bpf_attr *attr,
>         return -EINVAL;
>  }
>
> +static inline int meta_link_attach(const union bpf_attr *attr,
> +                                  struct bpf_prog *prog)
> +{
> +       return -EINVAL;
> +}
> +
>  static inline int meta_prog_detach(const union bpf_attr *attr,
>                                    struct bpf_prog *prog)
>  {
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 00a875720e84..fd069f285fbc 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1068,6 +1068,7 @@ enum bpf_link_type {
>         BPF_LINK_TYPE_NETFILTER = 10,
>         BPF_LINK_TYPE_TCX = 11,
>         BPF_LINK_TYPE_UPROBE_MULTI = 12,
> +       BPF_LINK_TYPE_META = 12,

it's not just some completely universal "meta" device, it's
specifically networking meta-device, is that right? so at least in
UAPI I think we should stay away from using super-generic "meta"
words, and do something like "net_meta" or "meta_net" or whatnot, but
indicate that this is networking stuff. WDYT?


>         MAX_BPF_LINK_TYPE,
>  };
>
> @@ -1653,6 +1654,13 @@ union bpf_attr {
>                                 __u32           flags;
>                                 __u32           pid;
>                         } uprobe_multi;
> +                       struct {
> +                               union {
> +                                       __u32   relative_fd;
> +                                       __u32   relative_id;
> +                               };
> +                               __u64           expected_revision;
> +                       } meta;
>                 };
>         } link_create;
>
> @@ -6564,6 +6572,9 @@ struct bpf_link_info {
>                         __u32 ifindex;
>                         __u32 attach_type;
>                 } tcx;
> +               struct {
> +                       __u32 ifindex;
> +               } meta;
>         };
>  } __attribute__((aligned(8)));
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 51baf4355c39..b689da4de280 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -4969,7 +4969,7 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>                     attr->link_create.attach_type == BPF_TCX_EGRESS)
>                         ret = tcx_link_attach(attr, prog);
>                 else
> -                       ret = -EINVAL;
> +                       ret = meta_link_attach(attr, prog);
>                 break;
>         case BPF_PROG_TYPE_NETFILTER:
>                 ret = bpf_nf_link_attach(attr, prog);
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 00a875720e84..fd069f285fbc 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1068,6 +1068,7 @@ enum bpf_link_type {
>         BPF_LINK_TYPE_NETFILTER = 10,
>         BPF_LINK_TYPE_TCX = 11,
>         BPF_LINK_TYPE_UPROBE_MULTI = 12,
> +       BPF_LINK_TYPE_META = 12,
>         MAX_BPF_LINK_TYPE,
>  };
>
> @@ -1653,6 +1654,13 @@ union bpf_attr {
>                                 __u32           flags;
>                                 __u32           pid;
>                         } uprobe_multi;
> +                       struct {
> +                               union {
> +                                       __u32   relative_fd;
> +                                       __u32   relative_id;
> +                               };
> +                               __u64           expected_revision;
> +                       } meta;
>                 };
>         } link_create;
>
> @@ -6564,6 +6572,9 @@ struct bpf_link_info {
>                         __u32 ifindex;
>                         __u32 attach_type;
>                 } tcx;
> +               struct {
> +                       __u32 ifindex;
> +               } meta;
>         };
>  } __attribute__((aligned(8)));
>
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 4/8] libbpf: Add link-based API for meta
  2023-09-26  5:59 ` [PATCH bpf-next 4/8] libbpf: Add link-based API for meta Daniel Borkmann
  2023-09-26 11:19   ` Quentin Monnet
@ 2023-09-28  0:12   ` Andrii Nakryiko
  2023-09-28 21:30     ` Daniel Borkmann
  1 sibling, 1 reply; 22+ messages in thread
From: Andrii Nakryiko @ 2023-09-28  0:12 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, martin.lau, razor, ast, andrii, john.fastabend

On Mon, Sep 25, 2023 at 10:59 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> This adds bpf_program__attach_meta() API to libbpf. Overall it is very
> similar to tcx. The API looks as following:
>
>   LIBBPF_API struct bpf_link *
>   bpf_program__attach_meta(const struct bpf_program *prog, int ifindex,
>                            bool peer_device, const struct bpf_meta_opts *opts);
>
> The struct bpf_meta_opts is done in similar way as struct bpf_tcx_opts.
> bpf_program__attach_meta() compared to bpf_program__attach_tcx() has one
> additional argument, that is peer_device. The latter denotes whether the
> program should be attached to the relative peer of ifindex or whether it
> should be attached to ifindex itself.
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  tools/lib/bpf/bpf.c      | 16 +++++++++++
>  tools/lib/bpf/bpf.h      |  5 ++++
>  tools/lib/bpf/libbpf.c   | 61 ++++++++++++++++++++++++++++++++++++----
>  tools/lib/bpf/libbpf.h   | 15 ++++++++++
>  tools/lib/bpf/libbpf.map |  1 +
>  5 files changed, 92 insertions(+), 6 deletions(-)
>
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index b0f1913763a3..f1335333b63c 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -810,6 +810,22 @@ int bpf_link_create(int prog_fd, int target_fd,
>                 if (!OPTS_ZEROED(opts, tcx))
>                         return libbpf_err(-EINVAL);
>                 break;
> +       case BPF_META_PRIMARY:
> +       case BPF_META_PEER:

thinking out loud: should this be expected_attach_type during program
load? Or is it going to be common for primary and peer to be exactly
the same instance of a BPF program? If BPF_META_PRIMARY or
BPF_META_PEER is expected_attach_type, it seems to be a more natural
API where you'll be just saying "bpf_program__attach_meta(prog,
ifindex)" and whether it's primary or peer will be determined by SEC()
definition (SEC("meta/primary") vs SEC("meta/peer"))?

> +               relative_fd = OPTS_GET(opts, meta.relative_fd, 0);
> +               relative_id = OPTS_GET(opts, meta.relative_id, 0);
> +               if (relative_fd && relative_id)
> +                       return libbpf_err(-EINVAL);
> +               if (relative_id) {
> +                       attr.link_create.meta.relative_id = relative_id;
> +                       attr.link_create.flags |= BPF_F_ID;
> +               } else {
> +                       attr.link_create.meta.relative_fd = relative_fd;
> +               }
> +               attr.link_create.meta.expected_revision = OPTS_GET(opts, meta.expected_revision, 0);
> +               if (!OPTS_ZEROED(opts, meta))
> +                       return libbpf_err(-EINVAL);
> +               break;
>         default:
>                 if (!OPTS_ZEROED(opts, flags))
>                         return libbpf_err(-EINVAL);
> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> index 74c2887cfd24..175cfb95a175 100644
> --- a/tools/lib/bpf/bpf.h
> +++ b/tools/lib/bpf/bpf.h
> @@ -415,6 +415,11 @@ struct bpf_link_create_opts {
>                         __u32 relative_id;
>                         __u64 expected_revision;
>                 } tcx;
> +               struct {
> +                       __u32 relative_fd;
> +                       __u32 relative_id;
> +                       __u64 expected_revision;
> +               } meta;
>         };
>         size_t :0;
>  };
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index b4758e54a815..4d4da8ba2179 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -121,6 +121,8 @@ static const char * const attach_type_name[] = {
>         [BPF_TCX_INGRESS]               = "tcx_ingress",
>         [BPF_TCX_EGRESS]                = "tcx_egress",
>         [BPF_TRACE_UPROBE_MULTI]        = "trace_uprobe_multi",
> +       [BPF_META_PRIMARY]              = "meta",
> +       [BPF_META_PEER]                 = "meta",
>  };
>
>  static const char * const link_type_name[] = {
> @@ -137,6 +139,7 @@ static const char * const link_type_name[] = {
>         [BPF_LINK_TYPE_NETFILTER]               = "netfilter",
>         [BPF_LINK_TYPE_TCX]                     = "tcx",
>         [BPF_LINK_TYPE_UPROBE_MULTI]            = "uprobe_multi",
> +       [BPF_LINK_TYPE_META]                    = "meta",
>  };
>
>  static const char * const map_type_name[] = {
> @@ -8910,6 +8913,7 @@ static const struct bpf_sec_def section_defs[] = {
>         SEC_DEF("tc",                   SCHED_CLS, 0, SEC_NONE), /* deprecated / legacy, use tcx */
>         SEC_DEF("classifier",           SCHED_CLS, 0, SEC_NONE), /* deprecated / legacy, use tcx */
>         SEC_DEF("action",               SCHED_ACT, 0, SEC_NONE), /* deprecated / legacy, use tcx */
> +       SEC_DEF("meta",                 SCHED_CLS, 0, SEC_NONE),
>         SEC_DEF("tracepoint+",          TRACEPOINT, 0, SEC_NONE, attach_tp),
>         SEC_DEF("tp+",                  TRACEPOINT, 0, SEC_NONE, attach_tp),
>         SEC_DEF("raw_tracepoint+",      RAW_TRACEPOINT, 0, SEC_NONE, attach_raw_tp),
> @@ -12019,11 +12023,11 @@ static int attach_lsm(const struct bpf_program *prog, long cookie, struct bpf_li
>  }
>
>  static struct bpf_link *
> -bpf_program_attach_fd(const struct bpf_program *prog,
> -                     int target_fd, const char *target_name,
> -                     const struct bpf_link_create_opts *opts)
> +bpf_program_attach_fd_type(const struct bpf_program *prog,
> +                          int target_fd, const char *target_name,
> +                          enum bpf_attach_type attach_type,
> +                          const struct bpf_link_create_opts *opts)
>  {
> -       enum bpf_attach_type attach_type;
>         char errmsg[STRERR_BUFSIZE];
>         struct bpf_link *link;
>         int prog_fd, link_fd;
> @@ -12038,8 +12042,6 @@ bpf_program_attach_fd(const struct bpf_program *prog,
>         if (!link)
>                 return libbpf_err_ptr(-ENOMEM);
>         link->detach = &bpf_link__detach_fd;
> -
> -       attach_type = bpf_program__expected_attach_type(prog);
>         link_fd = bpf_link_create(prog_fd, target_fd, attach_type, opts);
>         if (link_fd < 0) {
>                 link_fd = -errno;
> @@ -12053,6 +12055,16 @@ bpf_program_attach_fd(const struct bpf_program *prog,
>         return link;
>  }
>
> +static struct bpf_link *
> +bpf_program_attach_fd(const struct bpf_program *prog,
> +                     int target_fd, const char *target_name,
> +                     const struct bpf_link_create_opts *opts)
> +{
> +       return bpf_program_attach_fd_type(prog, target_fd, target_name,
> +                                         bpf_program__expected_attach_type(prog),
> +                                         opts);
> +}
> +
>  struct bpf_link *
>  bpf_program__attach_cgroup(const struct bpf_program *prog, int cgroup_fd)
>  {
> @@ -12106,6 +12118,43 @@ bpf_program__attach_tcx(const struct bpf_program *prog, int ifindex,
>         return bpf_program_attach_fd(prog, ifindex, "tcx", &link_create_opts);
>  }
>
> +struct bpf_link *
> +bpf_program__attach_meta(const struct bpf_program *prog, int ifindex,
> +                        bool peer_device, const struct bpf_meta_opts *opts)

you mentioned that there are plans to also support cases where there
is no primary-peer. Is that going to be a primary-only setup or will
it be some third option? If the latter, should this `bool peer_device`
be an enum then?

> +{
> +       LIBBPF_OPTS(bpf_link_create_opts, link_create_opts);
> +       enum bpf_attach_type attach_type;
> +       __u32 relative_id;
> +       int relative_fd;
> +
> +       if (!OPTS_VALID(opts, bpf_meta_opts))
> +               return libbpf_err_ptr(-EINVAL);
> +
> +       relative_id = OPTS_GET(opts, relative_id, 0);
> +       relative_fd = OPTS_GET(opts, relative_fd, 0);
> +       attach_type = peer_device ? BPF_META_PEER : BPF_META_PRIMARY;
> +
> +       /* validate we don't have unexpected combinations of non-zero fields */
> +       if (!ifindex) {
> +               pr_warn("prog '%s': target netdevice ifindex cannot be zero\n",
> +                       prog->name);
> +               return libbpf_err_ptr(-EINVAL);
> +       }
> +       if (relative_fd && relative_id) {
> +               pr_warn("prog '%s': relative_fd and relative_id cannot be set at the same time\n",
> +                       prog->name);
> +               return libbpf_err_ptr(-EINVAL);
> +       }
> +
> +       link_create_opts.meta.expected_revision = OPTS_GET(opts, expected_revision, 0);
> +       link_create_opts.meta.relative_fd = relative_fd;
> +       link_create_opts.meta.relative_id = relative_id;
> +       link_create_opts.flags = OPTS_GET(opts, flags, 0);
> +
> +       return bpf_program_attach_fd_type(prog, ifindex, "meta", attach_type,
> +                                         &link_create_opts);
> +}
> +
>  struct bpf_link *bpf_program__attach_freplace(const struct bpf_program *prog,
>                                               int target_fd,
>                                               const char *attach_func_name)
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 0e52621cba43..827d29cf9a06 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -800,6 +800,21 @@ LIBBPF_API struct bpf_link *
>  bpf_program__attach_tcx(const struct bpf_program *prog, int ifindex,
>                         const struct bpf_tcx_opts *opts);
>
> +struct bpf_meta_opts {
> +       /* size of this struct, for forward/backward compatibility */
> +       size_t sz;
> +       __u32 flags;
> +       __u32 relative_fd;
> +       __u32 relative_id;
> +       __u64 expected_revision;

nit: move flags to be the last, so we don't have that padding before
expected_revision?


> +       size_t :0;
> +};
> +#define bpf_meta_opts__last_field expected_revision
> +
> +LIBBPF_API struct bpf_link *
> +bpf_program__attach_meta(const struct bpf_program *prog, int ifindex,
> +                        bool peer_device, const struct bpf_meta_opts *opts);
> +
>  struct bpf_map;
>
>  LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 57712321490f..2dd4fe2cba3d 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -397,6 +397,7 @@ LIBBPF_1.3.0 {
>                 bpf_obj_pin_opts;
>                 bpf_object__unpin;
>                 bpf_prog_detach_opts;
> +               bpf_program__attach_meta;
>                 bpf_program__attach_netfilter;
>                 bpf_program__attach_tcx;
>                 bpf_program__attach_uprobe_multi;
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device
  2023-09-26  5:59 ` [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device Daniel Borkmann
  2023-09-26 21:26   ` Stanislav Fomichev
@ 2023-09-28  9:16   ` Toke Høiland-Jørgensen
  2023-09-28 12:01     ` Willem de Bruijn
  2023-09-28 21:14     ` Daniel Borkmann
  2023-10-13 11:26   ` Florian Kauer
  2 siblings, 2 replies; 22+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-09-28  9:16 UTC (permalink / raw)
  To: Daniel Borkmann, bpf
  Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend,
	Daniel Borkmann

Daniel Borkmann <daniel@iogearbox.net> writes:

> This work adds a new, minimal BPF-programmable device called "meta" we
> recently presented at LSF/MM/BPF. The latter name derives from the Greek
> μετά, encompassing a wide array of meanings such as "on top of", "beyond".
> Given business logic is defined by BPF, this device can have many meanings.
> The core idea is that BPF programs are executed within the drivers xmit
> routine and therefore e.g. in case of containers/Pods moving BPF processing
> closer to the source.

I like the concept, but I think we should change the name (as I believe
I also mentioned back when you presented it at LSF/MM/BPF). I know this
is basically bikeshedding, but I nevertheless think it is important, for
a couple of reasons:

- As you say, meta has a specific meaning, and this device is not a
  "meta" device in the common sense of the word: it is not tied to other
  devices (so it's not 'on top of' anything), and it is not "about"
  anything (as in metadata). It is just a device type that is programmed
  by BPF, so let's call it that.

- It's not discoverable; how are people supposed to figure out that they
  should go look for a 'meta' device? We also already have multiple
  things called 'metadata', so this is just going to create even more
  confusion (as we also discussed in relation to 'xdp hints').

- It squats on a pretty widely used term throughout the kernel
  (CONFIG_META, 'meta' as the module name). This is related to the above
  point; seeing something named 'meta' in lsmod, the natural assumption
  wouldn't be that it's a network driver.

I think we should just name the driver 'bpfnet'; it's not pretty, but
it's obvious and descriptive. Optionally we could teach 'ip' to
understand just 'bpf' as the device type, so you could go 'ip link add
type bpf' and get one of these.

> One of the goals was that in case of Pod egress traffic, this allows to
> move BPF programs from hostns tcx ingress into the device itself, providing
> earlier drop or forward mechanisms, for example, if the BPF program
> determines that the skb must be sent out of the node, then a redirect to
> the physical device can take place directly without going through per-CPU
> backlog queue. This helps to shift processing for such traffic from softirq
> to process context, leading to better scheduling decisions and better
> performance.

So my only reservation to having this tied to a BPF-only device like
this is basically that if this is indeed such a big win, shouldn't we
try to make the stack operate in this mode by default? I assume you did
the analysis of what it would take to change veth to operate in this
mode; so what was the reason you decided to create a new device type
instead?

(I seem to recall at the presentation that you made a general reference
to veth being 'too complex', but complexity can be managed, so I'm more
thinking about whether there's any specific reason why changing veth
wouldn't work at all?)

[...]

Some comments on the code below:

> --- /dev/null
> +++ b/drivers/net/meta.c
> @@ -0,0 +1,734 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2023 Isovalent */
> +
> +#include <linux/netdevice.h>
> +#include <linux/ethtool.h>
> +#include <linux/etherdevice.h>
> +#include <linux/filter.h>
> +#include <linux/netfilter_netdev.h>
> +#include <linux/bpf_mprog.h>
> +
> +#include <net/meta.h>
> +#include <net/dst.h>
> +#include <net/tcx.h>
> +
> +#define DRV_NAME	"meta"
> +#define DRV_VERSION	"1.0"

Looking at veth as an example, this will probably never get updated :)

So wouldn't it be better to use the kernel version as the driver
version? That way there will at least be some information in this field.
I guess we could make the same change for veth.

> +struct meta {
> +	/* Needed in fast-path */
> +	struct net_device __rcu *peer;
> +	struct bpf_mprog_entry __rcu *active;
> +	enum meta_action policy;
> +	struct bpf_mprog_bundle	bundle;
> +	/* Needed in slow-path */
> +	enum meta_mode mode;
> +	bool primary;
> +	u32 headroom;
> +};
> +
> +static void meta_scrub_minimum(struct sk_buff *skb)
> +{
> +	skb->skb_iif = 0;
> +	skb->ignore_df = 0;
> +	skb->priority = 0;
> +	skb_dst_drop(skb);
> +	skb_ext_reset(skb);
> +	nf_reset_ct(skb);
> +	nf_reset_trace(skb);
> +	nf_skip_egress(skb, true);
> +	ipvs_reset(skb);
> +}

Same question as Stanislav here :)

> +static __always_inline int
> +meta_run(const struct meta *meta, const struct bpf_mprog_entry *entry,
> +	 struct sk_buff *skb, enum meta_action ret)
> +{
> +	const struct bpf_mprog_fp *fp;
> +	const struct bpf_prog *prog;
> +
> +	bpf_mprog_foreach_prog(entry, fp, prog) {
> +		bpf_compute_data_pointers(skb);
> +		ret = bpf_prog_run(prog, skb);
> +		if (ret != META_NEXT)
> +			break;
> +	}
> +	return ret;
> +}
> +
> +static netdev_tx_t meta_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +	struct meta *meta = netdev_priv(dev);
> +	enum meta_action ret = READ_ONCE(meta->policy);
> +	netdev_tx_t ret_dev = NET_XMIT_SUCCESS;
> +	const struct bpf_mprog_entry *entry;
> +	struct net_device *peer;
> +
> +	rcu_read_lock();
> +	peer = rcu_dereference(meta->peer);
> +	if (unlikely(!peer || !(peer->flags & IFF_UP) ||
> +		     !pskb_may_pull(skb, ETH_HLEN) ||
> +		     skb_orphan_frags(skb, GFP_ATOMIC)))
> +		goto drop;
> +	meta_scrub_minimum(skb);
> +	skb->dev = peer;
> +	entry = rcu_dereference(meta->active);
> +	if (entry)
> +		ret = meta_run(meta, entry, skb, ret);
> +	switch (ret) {
> +	case META_NEXT:
> +	case META_PASS:
> +		skb->pkt_type = PACKET_HOST;
> +		skb->protocol = eth_type_trans(skb, skb->dev);
> +		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
> +		__netif_rx(skb);
> +		break;
> +	case META_REDIRECT:
> +		skb_do_redirect(skb);
> +		break;
> +	case META_DROP:

Why the aliases for the constants? Might as well reuse the TCX names?

> +	default:
> +drop:
> +		ret_dev = NET_XMIT_DROP;
> +		dev_core_stats_tx_dropped_inc(dev);
> +		kfree_skb(skb);
> +		break;
> +	}
> +	rcu_read_unlock();
> +	return ret_dev;
> +}
> +
> +static int meta_open(struct net_device *dev)
> +{
> +	struct meta *meta = netdev_priv(dev);
> +	struct net_device *peer = rtnl_dereference(meta->peer);
> +
> +	if (!peer)
> +		return -ENOTCONN;
> +	if (peer->flags & IFF_UP) {
> +		netif_carrier_on(dev);
> +		netif_carrier_on(peer);
> +	}
> +	return 0;
> +}
> +
> +static int meta_close(struct net_device *dev)
> +{
> +	struct meta *meta = netdev_priv(dev);
> +	struct net_device *peer = rtnl_dereference(meta->peer);
> +
> +	netif_carrier_off(dev);
> +	if (peer)
> +		netif_carrier_off(peer);
> +	return 0;
> +}
> +
> +static int meta_get_iflink(const struct net_device *dev)
> +{
> +	struct meta *meta = netdev_priv(dev);
> +	struct net_device *peer;
> +	int iflink = 0;
> +
> +	rcu_read_lock();
> +	peer = rcu_dereference(meta->peer);
> +	if (peer)
> +		iflink = peer->ifindex;
> +	rcu_read_unlock();
> +	return iflink;
> +}
> +
> +static void meta_set_multicast_list(struct net_device *dev)
> +{
> +}

The function name indicates there is some functionality envisioned here?
Why is the function empty?

> +static void meta_set_headroom(struct net_device *dev, int headroom)
> +{
> +	struct meta *meta = netdev_priv(dev), *meta2;
> +	struct net_device *peer;
> +
> +	if (headroom < 0)
> +		headroom = NET_SKB_PAD;
> +
> +	rcu_read_lock();
> +	peer = rcu_dereference(meta->peer);
> +	if (unlikely(!peer))
> +		goto out;
> +
> +	meta2 = netdev_priv(peer);
> +	meta->headroom = headroom;
> +	headroom = max(meta->headroom, meta2->headroom);
> +
> +	peer->needed_headroom = headroom;
> +	dev->needed_headroom = headroom;
> +out:
> +	rcu_read_unlock();
> +}
> +
> +static struct net_device *meta_peer_dev(struct net_device *dev)
> +{
> +	struct meta *meta = netdev_priv(dev);
> +
> +	return rcu_dereference(meta->peer);
> +}
> +
> +static struct net_device *meta_peer_dev_rtnl(struct net_device *dev)
> +{
> +	struct meta *meta = netdev_priv(dev);
> +
> +	return rcu_dereference_rtnl(meta->peer);
> +}
> +
> +static const struct net_device_ops meta_netdev_ops = {
> +	.ndo_open		= meta_open,
> +	.ndo_stop		= meta_close,
> +	.ndo_start_xmit		= meta_xmit,
> +	.ndo_set_rx_mode	= meta_set_multicast_list,
> +	.ndo_set_rx_headroom	= meta_set_headroom,
> +	.ndo_get_iflink		= meta_get_iflink,
> +	.ndo_get_peer_dev	= meta_peer_dev,
> +	.ndo_features_check	= passthru_features_check,
> +};
> +
> +static void meta_get_drvinfo(struct net_device *dev,
> +			     struct ethtool_drvinfo *info)
> +{
> +	strscpy(info->driver, DRV_NAME, sizeof(info->driver));
> +	strscpy(info->version, DRV_VERSION, sizeof(info->version));
> +}
> +
> +static const struct ethtool_ops meta_ethtool_ops = {
> +	.get_drvinfo		= meta_get_drvinfo,
> +};
> +
> +static void meta_setup(struct net_device *dev)
> +{
> +	static const netdev_features_t meta_features_hw_vlan =
> +		NETIF_F_HW_VLAN_CTAG_TX |
> +		NETIF_F_HW_VLAN_CTAG_RX |
> +		NETIF_F_HW_VLAN_STAG_TX |
> +		NETIF_F_HW_VLAN_STAG_RX;
> +	static const netdev_features_t meta_features =
> +		meta_features_hw_vlan |
> +		NETIF_F_SG |
> +		NETIF_F_FRAGLIST |
> +		NETIF_F_HW_CSUM |
> +		NETIF_F_RXCSUM |
> +		NETIF_F_SCTP_CRC |
> +		NETIF_F_HIGHDMA |
> +		NETIF_F_GSO_SOFTWARE |
> +		NETIF_F_GSO_ENCAP_ALL;
> +
> +	ether_setup(dev);
> +	dev->min_mtu = ETH_MIN_MTU;
> +	dev->max_mtu = ETH_MAX_MTU;
> +
> +	dev->flags |= IFF_NOARP;
> +	dev->priv_flags &= ~IFF_TX_SKB_SHARING;
> +	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
> +	dev->priv_flags |= IFF_PHONY_HEADROOM;
> +	dev->priv_flags |= IFF_NO_QUEUE;

What happens if someone attaches a qdisc to the device in spite of this?

> +	dev->priv_flags |= IFF_META;
> +
> +	dev->ethtool_ops = &meta_ethtool_ops;
> +	dev->netdev_ops  = &meta_netdev_ops;
> +
> +	dev->features |= meta_features | NETIF_F_LLTX;
> +	dev->hw_features = meta_features;
> +	dev->hw_enc_features = meta_features;
> +	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
> +	dev->vlan_features = dev->features & ~meta_features_hw_vlan;
> +
> +	dev->needs_free_netdev = true;
> +
> +	netif_set_tso_max_size(dev, GSO_MAX_SIZE);
> +}
> +
> +static struct net *meta_get_link_net(const struct net_device *dev)
> +{
> +	struct meta *meta = netdev_priv(dev);
> +	struct net_device *peer = rtnl_dereference(meta->peer);
> +
> +	return peer ? dev_net(peer) : dev_net(dev);
> +}
> +
> +static int meta_check_policy(int policy, struct nlattr *tb,
> +			     struct netlink_ext_ack *extack)
> +{
> +	switch (policy) {
> +	case META_PASS:
> +	case META_DROP:
> +		return 0;
> +	default:
> +		NL_SET_ERR_MSG_ATTR(extack, tb,
> +				    "Provided default xmit policy not supported");
> +		return -EINVAL;
> +	}
> +}
> +
> +static int meta_check_mode(int mode, struct nlattr *tb,
> +			   struct netlink_ext_ack *extack)
> +{
> +	switch (mode) {
> +	case META_L2:
> +	case META_L3:
> +		return 0;
> +	default:
> +		NL_SET_ERR_MSG_ATTR(extack, tb,
> +				    "Provided device mode can only be L2 or L3");
> +		return -EINVAL;
> +	}
> +}
> +
> +static int meta_validate(struct nlattr *tb[], struct nlattr *data[],
> +			 struct netlink_ext_ack *extack)
> +{
> +	struct nlattr *attr = tb[IFLA_ADDRESS];
> +
> +	if (!attr)
> +		return 0;
> +	NL_SET_ERR_MSG_ATTR(extack, attr,
> +			    "Setting Ethernet address is not supported");
> +	return -EOPNOTSUPP;
> +}
> +
> +static struct rtnl_link_ops meta_link_ops;
> +
> +static int meta_new_link(struct net *src_net, struct net_device *dev,
> +			 struct nlattr *tb[], struct nlattr *data[],
> +			 struct netlink_ext_ack *extack)
> +{
> +	struct nlattr *peer_tb[IFLA_MAX + 1], **tbp = tb, *attr;
> +	enum meta_action default_prim = META_PASS;
> +	enum meta_action default_peer = META_PASS;
> +	unsigned char name_assign_type;
> +	enum meta_mode mode = META_L3;
> +	struct ifinfomsg *ifmp = NULL;
> +	struct net_device *peer;
> +	char ifname[IFNAMSIZ];
> +	struct meta *meta;
> +	struct net *net;
> +	int err;
> +
> +	if (data) {
> +		if (data[IFLA_META_MODE]) {
> +			attr = data[IFLA_META_MODE];
> +			mode = nla_get_u32(attr);
> +			err = meta_check_mode(mode, attr, extack);
> +			if (err < 0)
> +				return err;
> +		}
> +		if (data[IFLA_META_PEER_INFO]) {
> +			attr = data[IFLA_META_PEER_INFO];
> +			ifmp = nla_data(attr);
> +			err = rtnl_nla_parse_ifinfomsg(peer_tb, attr, extack);
> +			if (err < 0)
> +				return err;
> +			err = meta_validate(peer_tb, NULL, extack);
> +			if (err < 0)
> +				return err;
> +			tbp = peer_tb;
> +		}
> +		if (data[IFLA_META_POLICY]) {
> +			attr = data[IFLA_META_POLICY];
> +			default_prim = nla_get_u32(attr);
> +			err = meta_check_policy(default_prim, attr, extack);
> +			if (err < 0)
> +				return err;
> +		}
> +		if (data[IFLA_META_PEER_POLICY]) {
> +			attr = data[IFLA_META_PEER_POLICY];
> +			default_peer = nla_get_u32(attr);
> +			err = meta_check_policy(default_peer, attr, extack);
> +			if (err < 0)
> +				return err;
> +		}
> +	}
> +
> +	if (ifmp && tbp[IFLA_IFNAME]) {
> +		nla_strscpy(ifname, tbp[IFLA_IFNAME], IFNAMSIZ);
> +		name_assign_type = NET_NAME_USER;
> +	} else {
> +		snprintf(ifname, IFNAMSIZ, "m%%d");
> +		name_assign_type = NET_NAME_ENUM;
> +	}
> +
> +	net = rtnl_link_get_net(src_net, tbp);
> +	if (IS_ERR(net))
> +		return PTR_ERR(net);
> +
> +	peer = rtnl_create_link(net, ifname, name_assign_type,
> +				&meta_link_ops, tbp, extack);
> +	if (IS_ERR(peer)) {
> +		put_net(net);
> +		return PTR_ERR(peer);
> +	}
> +
> +	if (mode == META_L2)
> +		eth_hw_addr_random(peer);
> +	if (ifmp && dev->ifindex)
> +		peer->ifindex = ifmp->ifi_index;
> +
> +	netif_inherit_tso_max(peer, dev);
> +
> +	err = register_netdevice(peer);
> +	put_net(net);
> +	if (err < 0)
> +		goto err_register_peer;
> +
> +	netif_carrier_off(peer);
> +
> +	err = rtnl_configure_link(peer, ifmp, 0, NULL);
> +	if (err < 0)
> +		goto err_configure_peer;
> +
> +	if (mode == META_L2)
> +		eth_hw_addr_random(dev);
> +	if (tb[IFLA_IFNAME])
> +		nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
> +	else
> +		snprintf(dev->name, IFNAMSIZ, "m%%d");
> +
> +	err = register_netdevice(dev);
> +	if (err < 0)
> +		goto err_configure_peer;
> +
> +	netif_carrier_off(dev);
> +
> +	meta = netdev_priv(dev);
> +	meta->primary = true;
> +	meta->policy = default_prim;
> +	meta->mode = mode;
> +	if (meta->mode == META_L2)
> +		dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
> +	bpf_mprog_bundle_init(&meta->bundle);
> +	RCU_INIT_POINTER(meta->active, NULL);
> +	rcu_assign_pointer(meta->peer, peer);
> +
> +	meta = netdev_priv(peer);
> +	meta->primary = false;
> +	meta->policy = default_peer;
> +	meta->mode = mode;
> +	if (meta->mode == META_L2)
> +		dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
> +	bpf_mprog_bundle_init(&meta->bundle);
> +	RCU_INIT_POINTER(meta->active, NULL);
> +	rcu_assign_pointer(meta->peer, dev);
> +	return 0;
> +err_configure_peer:
> +	unregister_netdevice(peer);
> +	return err;
> +err_register_peer:
> +	free_netdev(peer);
> +	return err;
> +}
> +
> +static struct bpf_mprog_entry *meta_entry_fetch(struct net_device *dev,
> +						bool bundle_fallback)
> +{
> +	struct meta *meta = netdev_priv(dev);
> +	struct bpf_mprog_entry *entry;
> +
> +	ASSERT_RTNL();
> +	entry = rcu_dereference_rtnl(meta->active);
> +	if (entry)
> +		return entry;
> +	if (bundle_fallback)
> +		return &meta->bundle.a;
> +	return NULL;
> +}
> +
> +static void meta_entry_update(struct net_device *dev, struct bpf_mprog_entry *entry)
> +{
> +	struct meta *meta = netdev_priv(dev);
> +
> +	ASSERT_RTNL();
> +	rcu_assign_pointer(meta->active, entry);
> +}
> +
> +static void meta_entry_sync(void)
> +{
> +	synchronize_rcu();
> +}
> +
> +static struct net_device *meta_dev_fetch(struct net *net, u32 ifindex, u32 which)
> +{
> +	struct net_device *dev;
> +	struct meta *meta;
> +
> +	ASSERT_RTNL();
> +
> +	switch (which) {
> +	case BPF_META_PRIMARY:
> +	case BPF_META_PEER:
> +		break;
> +	default:
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	dev = __dev_get_by_index(net, ifindex);
> +	if (!dev)
> +		return ERR_PTR(-ENODEV);
> +	if (!(dev->priv_flags & IFF_META))
> +		return ERR_PTR(-ENXIO);

I don't really think a new flag value is needed here? Can't you just
make this check if (dev->netdev_ops == &meta_netdev_ops) ?
> +
> +	meta = netdev_priv(dev);
> +	if (!meta->primary)
> +		return ERR_PTR(-EACCES);
> +	if (which == BPF_META_PRIMARY)
> +		return dev;
> +	return meta_peer_dev_rtnl(dev);
> +}
> +
> +int meta_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +	struct bpf_mprog_entry *entry, *entry_new;
> +	struct bpf_prog *replace_prog = NULL;
> +	struct net_device *dev;
> +	int ret;
> +
> +	rtnl_lock();
> +	dev = meta_dev_fetch(current->nsproxy->net_ns, attr->target_ifindex,
> +			     attr->attach_type);
> +	if (IS_ERR(dev)) {
> +		ret = PTR_ERR(dev);
> +		goto out;
> +	}
> +	entry = meta_entry_fetch(dev, true);
> +	if (attr->attach_flags & BPF_F_REPLACE) {
> +		replace_prog = bpf_prog_get_type(attr->replace_bpf_fd,
> +						 prog->type);
> +		if (IS_ERR(replace_prog)) {
> +			ret = PTR_ERR(replace_prog);
> +			replace_prog = NULL;
> +			goto out;
> +		}
> +	}
> +	ret = bpf_mprog_attach(entry, &entry_new, prog, NULL, replace_prog,
> +			       attr->attach_flags, attr->relative_fd,
> +			       attr->expected_revision);
> +	if (!ret) {
> +		if (entry != entry_new) {
> +			meta_entry_update(dev, entry_new);
> +			meta_entry_sync();
> +		}
> +		bpf_mprog_commit(entry);
> +	}
> +out:
> +	if (replace_prog)
> +		bpf_prog_put(replace_prog);
> +	rtnl_unlock();
> +	return ret;
> +}
> +
> +int meta_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +	struct bpf_mprog_entry *entry, *entry_new;
> +	struct net_device *dev;
> +	int ret;
> +
> +	rtnl_lock();
> +	dev = meta_dev_fetch(current->nsproxy->net_ns, attr->target_ifindex,
> +			     attr->attach_type);
> +	if (IS_ERR(dev)) {
> +		ret = PTR_ERR(dev);
> +		goto out;
> +	}
> +	entry = meta_entry_fetch(dev, false);
> +	if (!entry) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	ret = bpf_mprog_detach(entry, &entry_new, prog, NULL, attr->attach_flags,
> +			       attr->relative_fd, attr->expected_revision);
> +	if (!ret) {
> +		if (!bpf_mprog_total(entry_new))
> +			entry_new = NULL;
> +		meta_entry_update(dev, entry_new);
> +		meta_entry_sync();
> +		bpf_mprog_commit(entry);
> +	}
> +out:
> +	rtnl_unlock();
> +	return ret;
> +}
> +
> +int meta_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr)
> +{
> +	struct bpf_mprog_entry *entry;
> +	struct net_device *dev;
> +	int ret;
> +
> +	rtnl_lock();
> +	dev = meta_dev_fetch(current->nsproxy->net_ns, attr->query.target_ifindex,
> +			     attr->query.attach_type);
> +	if (IS_ERR(dev)) {
> +		ret = PTR_ERR(dev);
> +		goto out;
> +	}
> +	entry = meta_entry_fetch(dev, false);
> +	if (!entry) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	ret = bpf_mprog_query(attr, uattr, entry);
> +out:
> +	rtnl_unlock();
> +	return ret;
> +}
> +
> +static void meta_release_all(struct net_device *dev)
> +{
> +	struct bpf_mprog_entry *entry;
> +	struct bpf_tuple tuple = {};
> +	struct bpf_mprog_fp *fp;
> +	struct bpf_mprog_cp *cp;
> +
> +	entry = meta_entry_fetch(dev, false);
> +	if (!entry)
> +		return;
> +	meta_entry_update(dev, NULL);
> +	meta_entry_sync();
> +	bpf_mprog_foreach_tuple(entry, fp, cp, tuple) {
> +		bpf_prog_put(tuple.prog);
> +	}
> +}
> +
> +static void meta_del_link(struct net_device *dev, struct list_head *head)
> +{
> +	struct meta *meta = netdev_priv(dev);
> +	struct net_device *peer = rtnl_dereference(meta->peer);
> +
> +	RCU_INIT_POINTER(meta->peer, NULL);
> +	meta_release_all(dev);
> +	unregister_netdevice_queue(dev, head);
> +	if (peer) {
> +		meta = netdev_priv(peer);
> +		RCU_INIT_POINTER(meta->peer, NULL);
> +		meta_release_all(peer);
> +		unregister_netdevice_queue(peer, head);
> +	}
> +}
> +
> +static int meta_change_link(struct net_device *dev, struct nlattr *tb[],
> +			    struct nlattr *data[],
> +			    struct netlink_ext_ack *extack)
> +{
> +	struct meta *meta = netdev_priv(dev);
> +	struct net_device *peer = rtnl_dereference(meta->peer);
> +	enum meta_action policy;
> +	struct nlattr *attr;
> +	int err;
> +
> +	if (!meta->primary) {
> +		NL_SET_ERR_MSG(extack,
> +			       "Meta settings can be changed only through the primary device");
> +		return -EACCES;
> +	}
> +
> +	if (data[IFLA_META_MODE]) {
> +		NL_SET_ERR_MSG_ATTR(extack, data[IFLA_META_MODE],
> +				    "Meta operating mode cannot be changed after device creation");
> +		return -EACCES;
> +	}
> +
> +	if (data[IFLA_META_POLICY]) {
> +		attr = data[IFLA_META_POLICY];
> +		policy = nla_get_u32(attr);
> +		err = meta_check_policy(policy, attr, extack);
> +		if (err)
> +			return err;
> +		WRITE_ONCE(meta->policy, policy);
> +	}
> +
> +	if (data[IFLA_META_PEER_POLICY]) {
> +		err = -EOPNOTSUPP;
> +		attr = data[IFLA_META_PEER_POLICY];
> +		policy = nla_get_u32(attr);
> +		if (peer)
> +			err = meta_check_policy(policy, attr, extack);
> +		if (err)
> +			return err;
> +		meta = netdev_priv(peer);
> +		WRITE_ONCE(meta->policy, policy);
> +	}
> +
> +	return 0;
> +}
> +
> +static size_t meta_get_size(const struct net_device *dev)
> +{
> +	return nla_total_size(sizeof(u32)) + /* IFLA_META_POLICY */
> +	       nla_total_size(sizeof(u32)) + /* IFLA_META_PEER_POLICY */
> +	       nla_total_size(sizeof(u8))  + /* IFLA_META_PRIMARY */
> +	       nla_total_size(sizeof(u32)) + /* IFLA_META_MODE */
> +	       0;
> +}
> +
> +static int meta_fill_info(struct sk_buff *skb, const struct net_device *dev)
> +{
> +	struct meta *meta = netdev_priv(dev);
> +	struct net_device *peer = rtnl_dereference(meta->peer);
> +
> +	if (nla_put_u8(skb, IFLA_META_PRIMARY, meta->primary))
> +		return -EMSGSIZE;
> +	if (nla_put_u32(skb, IFLA_META_POLICY, meta->policy))
> +		return -EMSGSIZE;
> +	if (nla_put_u32(skb, IFLA_META_MODE, meta->mode))
> +		return -EMSGSIZE;
> +
> +	if (peer) {
> +		meta = netdev_priv(peer);
> +		if (nla_put_u32(skb, IFLA_META_PEER_POLICY, meta->policy))
> +			return -EMSGSIZE;
> +	}
> +
> +	return 0;
> +}
> +
> +static const struct nla_policy meta_policy[IFLA_META_MAX + 1] = {
> +	[IFLA_META_PEER_INFO]	= { .len = sizeof(struct ifinfomsg) },
> +	[IFLA_META_POLICY]	= { .type = NLA_U32 },
> +	[IFLA_META_MODE]	= { .type = NLA_U32 },
> +	[IFLA_META_PEER_POLICY]	= { .type = NLA_U32 },
> +	[IFLA_META_PRIMARY]	= { .type = NLA_REJECT,
> +				    .reject_message = "Primary attribute is read-only" },
> +};
> +
> +static struct rtnl_link_ops meta_link_ops = {
> +	.kind		= DRV_NAME,
> +	.priv_size	= sizeof(struct meta),
> +	.setup		= meta_setup,
> +	.newlink	= meta_new_link,
> +	.dellink	= meta_del_link,
> +	.changelink	= meta_change_link,
> +	.get_link_net	= meta_get_link_net,
> +	.get_size	= meta_get_size,
> +	.fill_info	= meta_fill_info,
> +	.policy		= meta_policy,
> +	.validate	= meta_validate,
> +	.maxtype	= IFLA_META_MAX,
> +};
> +
> +static __init int meta_init(void)
> +{
> +	BUILD_BUG_ON((int)META_NEXT != (int)TCX_NEXT ||
> +		     (int)META_PASS != (int)TCX_PASS ||
> +		     (int)META_DROP != (int)TCX_DROP ||
> +		     (int)META_REDIRECT != (int)TCX_REDIRECT);
> +
> +	return rtnl_link_register(&meta_link_ops);
> +}
> +
> +static __exit void meta_exit(void)
> +{
> +	rtnl_link_unregister(&meta_link_ops);
> +}
> +
> +module_init(meta_init);
> +module_exit(meta_exit);
> +
> +MODULE_DESCRIPTION("BPF-programmable meta device");
> +MODULE_AUTHOR("Daniel Borkmann <daniel@iogearbox.net>");
> +MODULE_AUTHOR("Nikolay Aleksandrov <razor@blackwall.org>");
> +MODULE_LICENSE("GPL");
> +MODULE_ALIAS_RTNL_LINK(DRV_NAME);
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 7e520c14eb8c..af0f23ed8d51 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1701,6 +1701,7 @@ struct net_device_ops {
>   * @IFF_SEE_ALL_HWTSTAMP_REQUESTS: device wants to see calls to
>   *	ndo_hwtstamp_set() for all timestamp requests regardless of source,
>   *	even if those aren't HWTSTAMP_SOURCE_NETDEV.
> + * @IFF_META: device is a meta device
>   */
>  enum netdev_priv_flags {
>  	IFF_802_1Q_VLAN			= 1<<0,
> @@ -1737,6 +1738,7 @@ enum netdev_priv_flags {
>  	IFF_TX_SKB_NO_LINEAR		= BIT_ULL(31),
>  	IFF_CHANGE_PROTO_DOWN		= BIT_ULL(32),
>  	IFF_SEE_ALL_HWTSTAMP_REQUESTS	= BIT_ULL(33),
> +	IFF_META			= BIT_ULL(34),
>  };
>  
>  #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
> diff --git a/include/net/meta.h b/include/net/meta.h
> new file mode 100644
> index 000000000000..20fc61d05970
> --- /dev/null
> +++ b/include/net/meta.h
> @@ -0,0 +1,31 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2023 Isovalent */
> +#ifndef __NET_META_H
> +#define __NET_META_H
> +
> +#include <linux/bpf.h>
> +
> +#ifdef CONFIG_META
> +int meta_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int meta_prog_detach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int meta_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr);
> +#else
> +static inline int meta_prog_attach(const union bpf_attr *attr,
> +				   struct bpf_prog *prog)
> +{
> +	return -EINVAL;
> +}
> +
> +static inline int meta_prog_detach(const union bpf_attr *attr,
> +				   struct bpf_prog *prog)
> +{
> +	return -EINVAL;
> +}
> +
> +static inline int meta_prog_query(const union bpf_attr *attr,
> +				  union bpf_attr __user *uattr)
> +{
> +	return -EINVAL;
> +}
> +#endif /* CONFIG_META */
> +#endif /* __NET_META_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 5f13db15a3c7..00a875720e84 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1047,6 +1047,8 @@ enum bpf_attach_type {
>  	BPF_TCX_INGRESS,
>  	BPF_TCX_EGRESS,
>  	BPF_TRACE_UPROBE_MULTI,
> +	BPF_META_PRIMARY,
> +	BPF_META_PEER,
>  	__MAX_BPF_ATTACH_TYPE
>  };
>  
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index fac351a93aed..ec099c6c51e0 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -756,6 +756,31 @@ struct tunnel_msg {
>  	__u32 ifindex;
>  };
>  
> +/* META section */
> +enum meta_action {
> +	META_NEXT	= -1,
> +	META_PASS	= 0,
> +	META_DROP	= 2,
> +	META_REDIRECT	= 7,
> +};
> +
> +enum meta_mode {
> +	META_L2,
> +	META_L3,
> +};
> +
> +enum {
> +	IFLA_META_UNSPEC,
> +	IFLA_META_PEER_INFO,
> +	IFLA_META_PRIMARY,
> +	IFLA_META_POLICY,
> +	IFLA_META_PEER_POLICY,
> +	IFLA_META_MODE,
> +	__IFLA_META_MAX,
> +};
> +
> +#define IFLA_META_MAX	(__IFLA_META_MAX - 1)
> +
>  /* VXLAN section */
>  
>  /* include statistics in the dump */
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 85c1d908f70f..51baf4355c39 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -35,8 +35,9 @@
>  #include <linux/rcupdate_trace.h>
>  #include <linux/memcontrol.h>
>  #include <linux/trace_events.h>
> -#include <net/netfilter/nf_bpf_link.h>
>  
> +#include <net/netfilter/nf_bpf_link.h>
> +#include <net/meta.h>
>  #include <net/tcx.h>
>  
>  #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
> @@ -3720,6 +3721,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
>  		return BPF_PROG_TYPE_LSM;
>  	case BPF_TCX_INGRESS:
>  	case BPF_TCX_EGRESS:
> +	case BPF_META_PRIMARY:
> +	case BPF_META_PEER:
>  		return BPF_PROG_TYPE_SCHED_CLS;
>  	default:
>  		return BPF_PROG_TYPE_UNSPEC;
> @@ -3771,7 +3774,9 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
>  		return 0;
>  	case BPF_PROG_TYPE_SCHED_CLS:
>  		if (attach_type != BPF_TCX_INGRESS &&
> -		    attach_type != BPF_TCX_EGRESS)
> +		    attach_type != BPF_TCX_EGRESS &&
> +		    attach_type != BPF_META_PRIMARY &&
> +		    attach_type != BPF_META_PEER)

PRIMARY and PEER basically correspond to INGRESS and EGRESS in terms of
which packets the program sees, right? So why not just reuse ingress and
egress designators, the fact that it's a "peer" attachment is mostly an
implementation detail, isn't it? Or should 'mirred' redirection to the
device inside a container also be supported? (is it?)

Reusing it (and special-casing the tcx attachment) would prevent people
from accidentally attaching a tcx program on top of the device (which
AFAICT if otherwise possible, right?). Or maybe this is a feature?

-Toke

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device
  2023-09-28  9:16   ` Toke Høiland-Jørgensen
@ 2023-09-28 12:01     ` Willem de Bruijn
  2023-09-28 21:14     ` Daniel Borkmann
  1 sibling, 0 replies; 22+ messages in thread
From: Willem de Bruijn @ 2023-09-28 12:01 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Daniel Borkmann, bpf, netdev, martin.lau, razor, ast, andrii,
	john.fastabend

On Thu, Sep 28, 2023 at 11:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Daniel Borkmann <daniel@iogearbox.net> writes:
>
> > This work adds a new, minimal BPF-programmable device called "meta" we
> > recently presented at LSF/MM/BPF. The latter name derives from the Greek
> > μετά, encompassing a wide array of meanings such as "on top of", "beyond".
> > Given business logic is defined by BPF, this device can have many meanings.
> > The core idea is that BPF programs are executed within the drivers xmit
> > routine and therefore e.g. in case of containers/Pods moving BPF processing
> > closer to the source.
>
> I like the concept, but I think we should change the name (as I believe
> I also mentioned back when you presented it at LSF/MM/BPF). I know this
> is basically bikeshedding, but I nevertheless think it is important, for
> a couple of reasons:
>
> - As you say, meta has a specific meaning, and this device is not a
>   "meta" device in the common sense of the word: it is not tied to other
>   devices (so it's not 'on top of' anything), and it is not "about"
>   anything (as in metadata). It is just a device type that is programmed
>   by BPF, so let's call it that.
>
> - It's not discoverable; how are people supposed to figure out that they
>   should go look for a 'meta' device? We also already have multiple
>   things called 'metadata', so this is just going to create even more
>   confusion (as we also discussed in relation to 'xdp hints').
>
> - It squats on a pretty widely used term throughout the kernel
>   (CONFIG_META, 'meta' as the module name). This is related to the above
>   point; seeing something named 'meta' in lsmod, the natural assumption
>   wouldn't be that it's a network driver.
>
> I think we should just name the driver 'bpfnet'; it's not pretty, but
> it's obvious and descriptive. Optionally we could teach 'ip' to
> understand just 'bpf' as the device type, so you could go 'ip link add
> type bpf' and get one of these.

+1

> > One of the goals was that in case of Pod egress traffic, this allows to
> > move BPF programs from hostns tcx ingress into the device itself, providing
> > earlier drop or forward mechanisms, for example, if the BPF program
> > determines that the skb must be sent out of the node, then a redirect to
> > the physical device can take place directly without going through per-CPU
> > backlog queue. This helps to shift processing for such traffic from softirq
> > to process context, leading to better scheduling decisions and better
> > performance.
>
> So my only reservation to having this tied to a BPF-only device like
> this is basically that if this is indeed such a big win, shouldn't we
> try to make the stack operate in this mode by default? I assume you did
> the analysis of what it would take to change veth to operate in this
> mode; so what was the reason you decided to create a new device type
> instead?
>
> (I seem to recall at the presentation that you made a general reference
> to veth being 'too complex', but complexity can be managed, so I'm more
> thinking about whether there's any specific reason why changing veth
> wouldn't work at all?)

If one point is queuing packets on the softnet queue, I think it
should be fine to call netif_receive_skb instead of netif_rx, at least
for single device depth.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device
  2023-09-28  9:16   ` Toke Høiland-Jørgensen
  2023-09-28 12:01     ` Willem de Bruijn
@ 2023-09-28 21:14     ` Daniel Borkmann
  2023-09-29 19:25       ` Alexei Starovoitov
  1 sibling, 1 reply; 22+ messages in thread
From: Daniel Borkmann @ 2023-09-28 21:14 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, bpf
  Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend

On 9/28/23 11:16 AM, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
> 
>> This work adds a new, minimal BPF-programmable device called "meta" we
>> recently presented at LSF/MM/BPF. The latter name derives from the Greek
>> μετά, encompassing a wide array of meanings such as "on top of", "beyond".
>> Given business logic is defined by BPF, this device can have many meanings.
>> The core idea is that BPF programs are executed within the drivers xmit
>> routine and therefore e.g. in case of containers/Pods moving BPF processing
>> closer to the source.
> 
> I like the concept, but I think we should change the name (as I believe
> I also mentioned back when you presented it at LSF/MM/BPF). I know this
> is basically bikeshedding, but I nevertheless think it is important, for
> a couple of reasons:
> 
> - As you say, meta has a specific meaning, and this device is not a
>    "meta" device in the common sense of the word: it is not tied to other
>    devices (so it's not 'on top of' anything), and it is not "about"
>    anything (as in metadata). It is just a device type that is programmed
>    by BPF, so let's call it that.
> 
> - It's not discoverable; how are people supposed to figure out that they
>    should go look for a 'meta' device? We also already have multiple
>    things called 'metadata', so this is just going to create even more
>    confusion (as we also discussed in relation to 'xdp hints').
> 
> - It squats on a pretty widely used term throughout the kernel
>    (CONFIG_META, 'meta' as the module name). This is related to the above
>    point; seeing something named 'meta' in lsmod, the natural assumption
>    wouldn't be that it's a network driver.
> 
> I think we should just name the driver 'bpfnet'; it's not pretty, but
> it's obvious and descriptive. Optionally we could teach 'ip' to
> understand just 'bpf' as the device type, so you could go 'ip link add
> type bpf' and get one of these.

I'll think about it, the bpfnet sounds terrible as you also noticed. I
definitely don't like that. Perhaps meta_net as suggested by Andrii in
the other thread could be a compromise. Need to sleep over it, my pref
was actually to keep it shorter.

>> One of the goals was that in case of Pod egress traffic, this allows to
>> move BPF programs from hostns tcx ingress into the device itself, providing
>> earlier drop or forward mechanisms, for example, if the BPF program
>> determines that the skb must be sent out of the node, then a redirect to
>> the physical device can take place directly without going through per-CPU
>> backlog queue. This helps to shift processing for such traffic from softirq
>> to process context, leading to better scheduling decisions and better
>> performance.
> 
> So my only reservation to having this tied to a BPF-only device like
> this is basically that if this is indeed such a big win, shouldn't we
> try to make the stack operate in this mode by default? I assume you did
> the analysis of what it would take to change veth to operate in this
> mode; so what was the reason you decided to create a new device type
> instead?

There are multiple virtual device flavors and veth is not the sole one. Could
other virtual devices have been extended into veth? Perhaps, but it doesn't
mean it should. veth has very much connotation of L2 and device pair. In this
case here the core of it is around having BPF logic as part of the xmit logic
(with default policies when no BPF is attached), being able to have L3 mode
and having the option to use them as paired devices but also as just single/
standalone one which we plan to push as next step after this series.

> Some comments on the code below:
> 
>> --- /dev/null
>> +++ b/drivers/net/meta.c
>> @@ -0,0 +1,734 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/* Copyright (c) 2023 Isovalent */
>> +
>> +#include <linux/netdevice.h>
>> +#include <linux/ethtool.h>
>> +#include <linux/etherdevice.h>
>> +#include <linux/filter.h>
>> +#include <linux/netfilter_netdev.h>
>> +#include <linux/bpf_mprog.h>
>> +
>> +#include <net/meta.h>
>> +#include <net/dst.h>
>> +#include <net/tcx.h>
>> +
>> +#define DRV_NAME	"meta"
>> +#define DRV_VERSION	"1.0"
> 
> Looking at veth as an example, this will probably never get updated :)
> 
> So wouldn't it be better to use the kernel version as the driver
> version? That way there will at least be some information in this field.
> I guess we could make the same change for veth.

That's fine, I can change it to something more useful.

[...]
>> +static netdev_tx_t meta_xmit(struct sk_buff *skb, struct net_device *dev)
>> +{
>> +	struct meta *meta = netdev_priv(dev);
>> +	enum meta_action ret = READ_ONCE(meta->policy);
>> +	netdev_tx_t ret_dev = NET_XMIT_SUCCESS;
>> +	const struct bpf_mprog_entry *entry;
>> +	struct net_device *peer;
>> +
>> +	rcu_read_lock();
>> +	peer = rcu_dereference(meta->peer);
>> +	if (unlikely(!peer || !(peer->flags & IFF_UP) ||
>> +		     !pskb_may_pull(skb, ETH_HLEN) ||
>> +		     skb_orphan_frags(skb, GFP_ATOMIC)))
>> +		goto drop;
>> +	meta_scrub_minimum(skb);
>> +	skb->dev = peer;
>> +	entry = rcu_dereference(meta->active);
>> +	if (entry)
>> +		ret = meta_run(meta, entry, skb, ret);
>> +	switch (ret) {
>> +	case META_NEXT:
>> +	case META_PASS:
>> +		skb->pkt_type = PACKET_HOST;
>> +		skb->protocol = eth_type_trans(skb, skb->dev);
>> +		skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
>> +		__netif_rx(skb);
>> +		break;
>> +	case META_REDIRECT:
>> +		skb_do_redirect(skb);
>> +		break;
>> +	case META_DROP:
> 
> Why the aliases for the constants? Might as well reuse the TCX names?

The constants are also used for the default configuration of the device
when no bpf is attached. Using tcx constant names as part of the config
is confusing, I don't see a reason why it needs to be tied together, it's
more confusing than it would help anything.

>> +	default:
>> +drop:
>> +		ret_dev = NET_XMIT_DROP;
>> +		dev_core_stats_tx_dropped_inc(dev);
>> +		kfree_skb(skb);
>> +		break;
>> +	}
>> +	rcu_read_unlock();
>> +	return ret_dev;
>> +}
>> +
>> +static int meta_open(struct net_device *dev)
>> +{
>> +	struct meta *meta = netdev_priv(dev);
>> +	struct net_device *peer = rtnl_dereference(meta->peer);
>> +
>> +	if (!peer)
>> +		return -ENOTCONN;
>> +	if (peer->flags & IFF_UP) {
>> +		netif_carrier_on(dev);
>> +		netif_carrier_on(peer);
>> +	}
>> +	return 0;
>> +}
>> +
>> +static int meta_close(struct net_device *dev)
>> +{
>> +	struct meta *meta = netdev_priv(dev);
>> +	struct net_device *peer = rtnl_dereference(meta->peer);
>> +
>> +	netif_carrier_off(dev);
>> +	if (peer)
>> +		netif_carrier_off(peer);
>> +	return 0;
>> +}
>> +
>> +static int meta_get_iflink(const struct net_device *dev)
>> +{
>> +	struct meta *meta = netdev_priv(dev);
>> +	struct net_device *peer;
>> +	int iflink = 0;
>> +
>> +	rcu_read_lock();
>> +	peer = rcu_dereference(meta->peer);
>> +	if (peer)
>> +		iflink = peer->ifindex;
>> +	rcu_read_unlock();
>> +	return iflink;
>> +}
>> +
>> +static void meta_set_multicast_list(struct net_device *dev)
>> +{
>> +}
> 
> The function name indicates there is some functionality envisioned here?
> Why is the function empty?

This is a stub callback to deal with multicast filter, given it's a virtual
dev and it'll receive traffic you push to w/o further config this one is
empty. See also ndo_set_rx_mode for various other virtual-only devs. I can
add a comment.

[...]
>> +static struct net_device *meta_dev_fetch(struct net *net, u32 ifindex, u32 which)
>> +{
>> +	struct net_device *dev;
>> +	struct meta *meta;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	switch (which) {
>> +	case BPF_META_PRIMARY:
>> +	case BPF_META_PEER:
>> +		break;
>> +	default:
>> +		return ERR_PTR(-EINVAL);
>> +	}
>> +
>> +	dev = __dev_get_by_index(net, ifindex);
>> +	if (!dev)
>> +		return ERR_PTR(-ENODEV);
>> +	if (!(dev->priv_flags & IFF_META))
>> +		return ERR_PTR(-ENXIO);
> 
> I don't really think a new flag value is needed here? Can't you just
> make this check if (dev->netdev_ops == &meta_netdev_ops) ?

Agree, very good point. Will change.

[...]
>>   #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
>> @@ -3720,6 +3721,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
>>   		return BPF_PROG_TYPE_LSM;
>>   	case BPF_TCX_INGRESS:
>>   	case BPF_TCX_EGRESS:
>> +	case BPF_META_PRIMARY:
>> +	case BPF_META_PEER:
>>   		return BPF_PROG_TYPE_SCHED_CLS;
>>   	default:
>>   		return BPF_PROG_TYPE_UNSPEC;
>> @@ -3771,7 +3774,9 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
>>   		return 0;
>>   	case BPF_PROG_TYPE_SCHED_CLS:
>>   		if (attach_type != BPF_TCX_INGRESS &&
>> -		    attach_type != BPF_TCX_EGRESS)
>> +		    attach_type != BPF_TCX_EGRESS &&
>> +		    attach_type != BPF_META_PRIMARY &&
>> +		    attach_type != BPF_META_PEER)
> 
> PRIMARY and PEER basically correspond to INGRESS and EGRESS in terms of
> which packets the program sees, right? So why not just reuse ingress and
> egress designators, the fact that it's a "peer" attachment is mostly an
> implementation detail, isn't it? Or should 'mirred' redirection to the
> device inside a container also be supported? (is it?)

No, ingress/egress is higly confusing here given it can have many meanings.
You can ingress into the container or ingress into the host, for example, so
it is not clear without more context. Also in a next step we plan to make
this device configurable as a single device instead of peered. Then it's
only 'primary' available where you attach to, much simpler to reason about
from a mental model.

> Reusing it (and special-casing the tcx attachment) would prevent people
> from accidentally attaching a tcx program on top of the device (which
> AFAICT if otherwise possible, right?). Or maybe this is a feature?

You can use tcx with it just fine and maybe some people have a need for it,
for example, for implementing logic within the container. There is certainly
no reason to prevent that.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 4/8] libbpf: Add link-based API for meta
  2023-09-28  0:12   ` Andrii Nakryiko
@ 2023-09-28 21:30     ` Daniel Borkmann
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Borkmann @ 2023-09-28 21:30 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, martin.lau, razor, ast, andrii, john.fastabend

On 9/28/23 2:12 AM, Andrii Nakryiko wrote:
> On Mon, Sep 25, 2023 at 10:59 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
[...]
>> +struct bpf_link *
>> +bpf_program__attach_meta(const struct bpf_program *prog, int ifindex,
>> +                        bool peer_device, const struct bpf_meta_opts *opts)
> 
> you mentioned that there are plans to also support cases where there
> is no primary-peer. Is that going to be a primary-only setup or will
> it be some third option? If the latter, should this `bool peer_device`
> be an enum then?

Agree, enum is more flexible either way, will change it to that.

>> +{
>> +       LIBBPF_OPTS(bpf_link_create_opts, link_create_opts);
>> +       enum bpf_attach_type attach_type;
>> +       __u32 relative_id;
>> +       int relative_fd;
>> +
>> +       if (!OPTS_VALID(opts, bpf_meta_opts))
>> +               return libbpf_err_ptr(-EINVAL);
>> +
>> +       relative_id = OPTS_GET(opts, relative_id, 0);
>> +       relative_fd = OPTS_GET(opts, relative_fd, 0);
>> +       attach_type = peer_device ? BPF_META_PEER : BPF_META_PRIMARY;
>> +
>> +       /* validate we don't have unexpected combinations of non-zero fields */
>> +       if (!ifindex) {
>> +               pr_warn("prog '%s': target netdevice ifindex cannot be zero\n",
>> +                       prog->name);
>> +               return libbpf_err_ptr(-EINVAL);
>> +       }
>> +       if (relative_fd && relative_id) {
>> +               pr_warn("prog '%s': relative_fd and relative_id cannot be set at the same time\n",
>> +                       prog->name);
>> +               return libbpf_err_ptr(-EINVAL);
>> +       }
>> +
>> +       link_create_opts.meta.expected_revision = OPTS_GET(opts, expected_revision, 0);
>> +       link_create_opts.meta.relative_fd = relative_fd;
>> +       link_create_opts.meta.relative_id = relative_id;
>> +       link_create_opts.flags = OPTS_GET(opts, flags, 0);
>> +
>> +       return bpf_program_attach_fd_type(prog, ifindex, "meta", attach_type,
>> +                                         &link_create_opts);
>> +}
>> +
>>   struct bpf_link *bpf_program__attach_freplace(const struct bpf_program *prog,
>>                                                int target_fd,
>>                                                const char *attach_func_name)
>> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
>> index 0e52621cba43..827d29cf9a06 100644
>> --- a/tools/lib/bpf/libbpf.h
>> +++ b/tools/lib/bpf/libbpf.h
>> @@ -800,6 +800,21 @@ LIBBPF_API struct bpf_link *
>>   bpf_program__attach_tcx(const struct bpf_program *prog, int ifindex,
>>                          const struct bpf_tcx_opts *opts);
>>
>> +struct bpf_meta_opts {
>> +       /* size of this struct, for forward/backward compatibility */
>> +       size_t sz;
>> +       __u32 flags;
>> +       __u32 relative_fd;
>> +       __u32 relative_id;
>> +       __u64 expected_revision;
> 
> nit: move flags to be the last, so we don't have that padding before
> expected_revision?

Sounds good, will do.

>> +       size_t :0;
>> +};
>> +#define bpf_meta_opts__last_field expected_revision
>> +
>> +LIBBPF_API struct bpf_link *
>> +bpf_program__attach_meta(const struct bpf_program *prog, int ifindex,
>> +                        bool peer_device, const struct bpf_meta_opts *opts);
>> +
>>   struct bpf_map;
>>
>>   LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
>> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
>> index 57712321490f..2dd4fe2cba3d 100644
>> --- a/tools/lib/bpf/libbpf.map
>> +++ b/tools/lib/bpf/libbpf.map
>> @@ -397,6 +397,7 @@ LIBBPF_1.3.0 {
>>                  bpf_obj_pin_opts;
>>                  bpf_object__unpin;
>>                  bpf_prog_detach_opts;
>> +               bpf_program__attach_meta;
>>                  bpf_program__attach_netfilter;
>>                  bpf_program__attach_tcx;
>>                  bpf_program__attach_uprobe_multi;
>> --
>> 2.34.1
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device
  2023-09-28 21:14     ` Daniel Borkmann
@ 2023-09-29 19:25       ` Alexei Starovoitov
  0 siblings, 0 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2023-09-29 19:25 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Toke Høiland-Jørgensen, bpf, Network Development,
	Martin KaFai Lau, Nikolay Aleksandrov, Alexei Starovoitov,
	Andrii Nakryiko, John Fastabend

On Thu, Sep 28, 2023 at 2:14 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
>
> > I think we should just name the driver 'bpfnet'; it's not pretty, but
> > it's obvious and descriptive. Optionally we could teach 'ip' to
> > understand just 'bpf' as the device type, so you could go 'ip link add
> > type bpf' and get one of these.
>
> I'll think about it, the bpfnet sounds terrible as you also noticed. I
> definitely don't like that. Perhaps meta_net as suggested by Andrii in
> the other thread could be a compromise. Need to sleep over it, my pref
> was actually to keep it shorter.

I don't like the meta name either standalone or as meta_net.
Maybe "hollow" or "void" netdevice?
Since this netdev doesn't have a substance when bpf prog is not attached.
It's empty == dummy == hollow == void netdevice.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device
  2023-09-26  5:59 ` [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device Daniel Borkmann
  2023-09-26 21:26   ` Stanislav Fomichev
  2023-09-28  9:16   ` Toke Høiland-Jørgensen
@ 2023-10-13 11:26   ` Florian Kauer
  2 siblings, 0 replies; 22+ messages in thread
From: Florian Kauer @ 2023-10-13 11:26 UTC (permalink / raw)
  To: Daniel Borkmann, bpf
  Cc: netdev, martin.lau, razor, ast, andrii, john.fastabend

Hi Daniel,

On 26.09.23 07:59, Daniel Borkmann wrote:
> This work adds a new, minimal BPF-programmable device called "meta" we
> recently presented at LSF/MM/BPF. The latter name derives from the Greek
> μετά, encompassing a wide array of meanings such as "on top of", "beyond".
> Given business logic is defined by BPF, this device can have many meanings.
> The core idea is that BPF programs are executed within the drivers xmit
> routine and therefore e.g. in case of containers/Pods moving BPF processing
> closer to the source.

I have a more general question:
You mentioned in your LSF/MM/BPF talk that you do not plan for XDP support for the "meta device",
because it is already supported in veth.
Does that imply that already today I will get full zerocopy from container via veth to the NIC
(and not only to other containers) by simply opening an AF_XDP socket inside a container?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-10-13 11:26 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-26  5:59 [PATCH bpf-next 0/8] Add bpf programmable device Daniel Borkmann
2023-09-26  5:59 ` [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device Daniel Borkmann
2023-09-26 21:26   ` Stanislav Fomichev
2023-09-28  9:16   ` Toke Høiland-Jørgensen
2023-09-28 12:01     ` Willem de Bruijn
2023-09-28 21:14     ` Daniel Borkmann
2023-09-29 19:25       ` Alexei Starovoitov
2023-10-13 11:26   ` Florian Kauer
2023-09-26  5:59 ` [PATCH bpf-next 2/8] meta, bpf: Add bpf link support for " Daniel Borkmann
2023-09-28  0:12   ` Andrii Nakryiko
2023-09-26  5:59 ` [PATCH bpf-next 3/8] tools: Sync if_link uapi header Daniel Borkmann
2023-09-26  5:59 ` [PATCH bpf-next 4/8] libbpf: Add link-based API for meta Daniel Borkmann
2023-09-26 11:19   ` Quentin Monnet
2023-09-28  0:12   ` Andrii Nakryiko
2023-09-28 21:30     ` Daniel Borkmann
2023-09-26  5:59 ` [PATCH bpf-next 5/8] bpftool: Implement link show support " Daniel Borkmann
2023-09-26 11:19   ` Quentin Monnet
2023-09-26  5:59 ` [PATCH bpf-next 6/8] bpftool: Extend net dump with meta progs Daniel Borkmann
2023-09-26 11:19   ` Quentin Monnet
2023-09-26  5:59 ` [PATCH bpf-next 7/8] selftests/bpf: Add netlink helper library Daniel Borkmann
2023-09-26 21:35   ` Stanislav Fomichev
2023-09-26  5:59 ` [PATCH bpf-next 8/8] selftests/bpf: Add selftests for meta Daniel Borkmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).