From: Jiri Pirko <jiri@resnulli.us>
To: Daniel Borkmann <daniel@iogearbox.net>
Cc: bpf@vger.kernel.org, netdev@vger.kernel.org,
martin.lau@linux.dev, razor@blackwall.org, ast@kernel.org,
andrii@kernel.org, john.fastabend@gmail.com, sdf@google.com,
toke@kernel.org, kuba@kernel.org, andrew@lunn.ch,
"Toke Høiland-Jørgensen" <toke@redhat.com>
Subject: Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device
Date: Wed, 25 Oct 2023 17:47:05 +0200 [thread overview]
Message-ID: <ZTk4ec8CBh92PZvs@nanopsycho> (raw)
In-Reply-To: <20231024214904.29825-2-daniel@iogearbox.net>
Tue, Oct 24, 2023 at 11:48:58PM CEST, daniel@iogearbox.net wrote:
>This work adds a new, minimal BPF-programmable device called "netkit"
>(former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
>core idea is that BPF programs are executed within the drivers xmit routine
>and therefore e.g. in case of containers/Pods moving BPF processing closer
>to the source.
>
>One of the goals was that in case of Pod egress traffic, this allows to
>move BPF programs from hostns tcx ingress into the device itself, providing
>earlier drop or forward mechanisms, for example, if the BPF program
>determines that the skb must be sent out of the node, then a redirect to
>the physical device can take place directly without going through per-CPU
>backlog queue. This helps to shift processing for such traffic from softirq
>to process context, leading to better scheduling decisions/performance (see
>measurements in the slides).
>
>In this initial version, the netkit device ships as a pair, but we plan to
>extend this further so it can also operate in single device mode. The pair
Single device mode should work how?
>comes with a primary and a peer device. Only the primary device, typically
>residing in hostns, can manage BPF programs for itself and its peer. The
>peer device is designated for containers/Pods and cannot attach/detach
>BPF programs. Upon the device creation, the user can set the default policy
>to 'pass' or 'drop' for the case when no BPF program is attached.
It looks to me that you only need the host (primary) netdevice to be
used as a handle to attach the bpf programs. Because the bpf program
can (and probably in real use case will) redirect to uplink/another
pod netdevice skipping the host (primary) netdevice, correct?
If so, why can't you do just single device mode from start finding a
different sort of bpf attach handle? (not sure which)
>
>Additionally, the device can be operated in L3 (default) or L2 mode. The
>management of BPF programs is done via bpf_mprog, so that multi-attach is
>supported right from the beginning with similar API and dependency controls
>as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic
>attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
>so that existing programs can be easily migrated.
>
>Going forward, we plan to use netkit devices in Cilium as the main device
>type for connecting Pods. They will be operated in L3 mode in order to
>simplify a Pod's neighbor management and the peer will operate in default
>drop mode, so that no traffic is leaving between the time when a Pod is
>brought up by the CNI plugin and programs attached by the agent.
>Additionally, the programs we attach via tcx on the physical devices are
>using bpf_redirect_peer() for inbound traffic into netkit device, hence the
>latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
>bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
>directly. Also, BIG TCP is supported on netkit device. For the follow-up
>work in single device mode, we plan to convert Cilium's cilium_host/_net
>devices into a single one.
>
>An extensive test suite for checking device operations and the BPF program
>and link management API comes as BPF selftests in this series.
>
Couple of nitpicks below:
[..]
>+static int netkit_check_policy(int policy, struct nlattr *tb,
>+ struct netlink_ext_ack *extack)
>+{
>+ switch (policy) {
>+ case NETKIT_PASS:
>+ case NETKIT_DROP:
>+ return 0;
>+ default:
Isn't this job for netlink policy?
>+ NL_SET_ERR_MSG_ATTR(extack, tb,
>+ "Provided default xmit policy not supported");
>+ return -EINVAL;
>+ }
>+}
>+
>+static int netkit_check_mode(int mode, struct nlattr *tb,
>+ struct netlink_ext_ack *extack)
>+{
>+ switch (mode) {
>+ case NETKIT_L2:
>+ case NETKIT_L3:
>+ return 0;
>+ default:
Isn't this job for netlink policy?
>+ NL_SET_ERR_MSG_ATTR(extack, tb,
>+ "Provided device mode can only be L2 or L3");
>+ return -EINVAL;
>+ }
>+}
>+
>+static int netkit_validate(struct nlattr *tb[], struct nlattr *data[],
>+ struct netlink_ext_ack *extack)
>+{
>+ struct nlattr *attr = tb[IFLA_ADDRESS];
>+
>+ if (!attr)
>+ return 0;
>+ NL_SET_ERR_MSG_ATTR(extack, attr,
>+ "Setting Ethernet address is not supported");
>+ return -EOPNOTSUPP;
>+}
>+
>+static struct rtnl_link_ops netkit_link_ops;
>+
>+static int netkit_new_link(struct net *src_net, struct net_device *dev,
>+ struct nlattr *tb[], struct nlattr *data[],
>+ struct netlink_ext_ack *extack)
>+{
>+ struct nlattr *peer_tb[IFLA_MAX + 1], **tbp = tb, *attr;
>+ enum netkit_action default_prim = NETKIT_PASS;
>+ enum netkit_action default_peer = NETKIT_PASS;
>+ enum netkit_mode mode = NETKIT_L3;
>+ unsigned char ifname_assign_type;
>+ struct ifinfomsg *ifmp = NULL;
>+ struct net_device *peer;
>+ char ifname[IFNAMSIZ];
>+ struct netkit *nk;
>+ struct net *net;
>+ int err;
>+
>+ if (data) {
>+ if (data[IFLA_NETKIT_MODE]) {
>+ attr = data[IFLA_NETKIT_MODE];
>+ mode = nla_get_u32(attr);
>+ err = netkit_check_mode(mode, attr, extack);
>+ if (err < 0)
>+ return err;
>+ }
>+ if (data[IFLA_NETKIT_PEER_INFO]) {
>+ attr = data[IFLA_NETKIT_PEER_INFO];
>+ ifmp = nla_data(attr);
>+ err = rtnl_nla_parse_ifinfomsg(peer_tb, attr, extack);
>+ if (err < 0)
>+ return err;
>+ err = netkit_validate(peer_tb, NULL, extack);
>+ if (err < 0)
>+ return err;
>+ tbp = peer_tb;
>+ }
>+ if (data[IFLA_NETKIT_POLICY]) {
>+ attr = data[IFLA_NETKIT_POLICY];
>+ default_prim = nla_get_u32(attr);
>+ err = netkit_check_policy(default_prim, attr, extack);
>+ if (err < 0)
>+ return err;
>+ }
>+ if (data[IFLA_NETKIT_PEER_POLICY]) {
>+ attr = data[IFLA_NETKIT_PEER_POLICY];
>+ default_peer = nla_get_u32(attr);
>+ err = netkit_check_policy(default_peer, attr, extack);
>+ if (err < 0)
>+ return err;
>+ }
>+ }
>+
>+ if (ifmp && tbp[IFLA_IFNAME]) {
>+ nla_strscpy(ifname, tbp[IFLA_IFNAME], IFNAMSIZ);
>+ ifname_assign_type = NET_NAME_USER;
>+ } else {
>+ strscpy(ifname, "nk%d", IFNAMSIZ);
>+ ifname_assign_type = NET_NAME_ENUM;
>+ }
>+
>+ net = rtnl_link_get_net(src_net, tbp);
>+ if (IS_ERR(net))
>+ return PTR_ERR(net);
>+
>+ peer = rtnl_create_link(net, ifname, ifname_assign_type,
>+ &netkit_link_ops, tbp, extack);
>+ if (IS_ERR(peer)) {
>+ put_net(net);
>+ return PTR_ERR(peer);
>+ }
>+
>+ netif_inherit_tso_max(peer, dev);
>+
>+ if (mode == NETKIT_L2)
>+ eth_hw_addr_random(peer);
>+ if (ifmp && dev->ifindex)
>+ peer->ifindex = ifmp->ifi_index;
>+
>+ nk = netkit_priv(peer);
>+ nk->primary = false;
>+ nk->policy = default_peer;
>+ nk->mode = mode;
>+ bpf_mprog_bundle_init(&nk->bundle);
>+ RCU_INIT_POINTER(nk->active, NULL);
>+ RCU_INIT_POINTER(nk->peer, NULL);
Aren't these already 0?
>+
>+ err = register_netdevice(peer);
>+ put_net(net);
>+ if (err < 0)
>+ goto err_register_peer;
>+ netif_carrier_off(peer);
>+ if (mode == NETKIT_L2)
>+ dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
>+
>+ err = rtnl_configure_link(peer, NULL, 0, NULL);
>+ if (err < 0)
>+ goto err_configure_peer;
>+
>+ if (mode == NETKIT_L2)
>+ eth_hw_addr_random(dev);
>+ if (tb[IFLA_IFNAME])
>+ nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
>+ else
>+ strscpy(dev->name, "nk%d", IFNAMSIZ);
>+
>+ nk = netkit_priv(dev);
>+ nk->primary = true;
>+ nk->policy = default_prim;
>+ nk->mode = mode;
>+ bpf_mprog_bundle_init(&nk->bundle);
>+ RCU_INIT_POINTER(nk->active, NULL);
>+ RCU_INIT_POINTER(nk->peer, NULL);
>+
>+ err = register_netdevice(dev);
>+ if (err < 0)
>+ goto err_configure_peer;
>+ netif_carrier_off(dev);
>+ if (mode == NETKIT_L2)
>+ dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
>+
>+ rcu_assign_pointer(netkit_priv(dev)->peer, peer);
>+ rcu_assign_pointer(netkit_priv(peer)->peer, dev);
>+ return 0;
>+err_configure_peer:
>+ unregister_netdevice(peer);
>+ return err;
>+err_register_peer:
>+ free_netdev(peer);
>+ return err;
>+}
>+
[..]
next prev parent reply other threads:[~2023-10-25 15:47 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-24 21:48 [PATCH bpf-next v4 0/7] Add bpf programmable net device Daniel Borkmann
2023-10-24 21:48 ` [PATCH bpf-next v4 1/7] netkit, bpf: " Daniel Borkmann
2023-10-25 15:47 ` Jiri Pirko [this message]
2023-10-25 17:20 ` Daniel Borkmann
2023-10-26 5:18 ` Jiri Pirko
2023-10-26 12:11 ` Daniel Borkmann
2023-10-25 19:21 ` Nikolay Aleksandrov
2023-10-26 5:26 ` Jiri Pirko
2023-10-26 6:21 ` Nikolay Aleksandrov
2023-10-25 21:24 ` Kui-Feng Lee
2023-10-25 22:09 ` Martin KaFai Lau
2023-10-26 1:15 ` Kui-Feng Lee
2023-10-26 1:18 ` Kui-Feng Lee
2023-10-26 6:20 ` Daniel Borkmann
2023-10-26 17:47 ` Kui-Feng Lee
2023-10-26 18:46 ` Martin KaFai Lau
2023-10-24 21:48 ` [PATCH bpf-next v4 2/7] tools: Sync if_link uapi header Daniel Borkmann
2023-10-24 21:49 ` [PATCH bpf-next v4 3/7] libbpf: Add link-based API for netkit Daniel Borkmann
2023-10-24 21:49 ` [PATCH bpf-next v4 4/7] bpftool: Implement link show support " Daniel Borkmann
2023-10-24 21:49 ` [PATCH bpf-next v4 5/7] bpftool: Extend net dump with netkit progs Daniel Borkmann
2023-10-24 21:49 ` [PATCH bpf-next v4 6/7] selftests/bpf: Add netlink helper library Daniel Borkmann
2023-10-24 21:49 ` [PATCH bpf-next v4 7/7] selftests/bpf: Add selftests for netkit Daniel Borkmann
2023-10-24 22:45 ` [PATCH bpf-next v4 0/7] Add bpf programmable net device Martin KaFai Lau
2023-10-24 23:50 ` patchwork-bot+netdevbpf
2023-10-25 15:50 ` Jiri Pirko
2023-10-25 16:54 ` Martin KaFai Lau
2023-10-26 5:35 ` Jiri Pirko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZTk4ec8CBh92PZvs@nanopsycho \
--to=jiri@resnulli.us \
--cc=andrew@lunn.ch \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=john.fastabend@gmail.com \
--cc=kuba@kernel.org \
--cc=martin.lau@linux.dev \
--cc=netdev@vger.kernel.org \
--cc=razor@blackwall.org \
--cc=sdf@google.com \
--cc=toke@kernel.org \
--cc=toke@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).