From: Stanislav Fomichev <sdf@google.com>
To: Daniel Borkmann <daniel@iogearbox.net>
Cc: bpf@vger.kernel.org, netdev@vger.kernel.org,
martin.lau@kernel.org, razor@blackwall.org, ast@kernel.org,
andrii@kernel.org, john.fastabend@gmail.com
Subject: Re: [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device
Date: Tue, 26 Sep 2023 14:26:29 -0700 [thread overview]
Message-ID: <ZRNMhVfuqPrK3J6O@google.com> (raw)
In-Reply-To: <20230926055913.9859-2-daniel@iogearbox.net>
On 09/26, Daniel Borkmann wrote:
> This work adds a new, minimal BPF-programmable device called "meta" we
> recently presented at LSF/MM/BPF. The latter name derives from the Greek
> μετά, encompassing a wide array of meanings such as "on top of", "beyond".
> Given business logic is defined by BPF, this device can have many meanings.
> The core idea is that BPF programs are executed within the drivers xmit
> routine and therefore e.g. in case of containers/Pods moving BPF processing
> closer to the source.
>
> One of the goals was that in case of Pod egress traffic, this allows to
> move BPF programs from hostns tcx ingress into the device itself, providing
> earlier drop or forward mechanisms, for example, if the BPF program
> determines that the skb must be sent out of the node, then a redirect to
> the physical device can take place directly without going through per-CPU
> backlog queue. This helps to shift processing for such traffic from softirq
> to process context, leading to better scheduling decisions and better
> performance.
>
> In this initial version, the meta device ships as a pair, but we plan to
> extend this further so it can also operate in single device mode. The pair
> comes with a primary and a peer device. Only the primary device, typically
> residing in hostns, can manage BPF programs for itself and its peer. The
> peer device is designated for containers/Pods and cannot attach/detach
> BPF programs. Upon the device creation, the user can set the default policy
> to 'forward' or 'drop' for the case when no BPF program is attached.
>
> Additionally, the device can be operated in L3 (default) or L2 mode. The
> management of BPF programs is done via bpf_mprog, so that multi-attach is
> supported right from the beginning with similar API/dependency controls as
> tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic
> attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
> so that existing programs can be easily migrated.
>
> Going forward, we plan to use meta devices in Cilium as the main device type
> for connecting Pods. They will be operated in L3 mode in order to simplify
> a Pod's neighbor management and the peer will operate in default drop mode,
> so that no traffic is leaving between the time when a Pod is brought up by
> the CNI plugin and programs attached by the agent. Additionally, the programs
> we attach via tcx on the physical devices are using bpf_redirect_peer()
> for inbound traffic into meta device, hence the latter also supporting the
> ndo_get_peer_dev callback. Similarly, we use bpf_redirect_neigh() for the
> way out, pushing to phys device directly. Also, BIG TCP is supported on meta
> device. For the follow-up work in single device mode, we plan to convert
> Cilium's cilium_host/_net devices into a single one.
>
> An extensive test suite for checking device operations and the BPF program
> and link management API comes as BPF selftests in this series.
>
> Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Link: https://github.com/borkmann/iproute2/commits/pr/meta
> Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
> ---
> MAINTAINERS | 9 +
> drivers/net/Kconfig | 9 +
> drivers/net/Makefile | 1 +
> drivers/net/meta.c | 734 +++++++++++++++++++++++++++++++++
> include/linux/netdevice.h | 2 +
> include/net/meta.h | 31 ++
> include/uapi/linux/bpf.h | 2 +
> include/uapi/linux/if_link.h | 25 ++
> kernel/bpf/syscall.c | 30 +-
> tools/include/uapi/linux/bpf.h | 2 +
> 10 files changed, 840 insertions(+), 5 deletions(-)
> create mode 100644 drivers/net/meta.c
> create mode 100644 include/net/meta.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 8985a1b0b5ee..ec3edd4caa56 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3774,6 +3774,15 @@ L: bpf@vger.kernel.org
> S: Maintained
> F: tools/lib/bpf/
>
> +BPF [META]
> +M: Daniel Borkmann <daniel@iogearbox.net>
> +M: Nikolay Aleksandrov <razor@blackwall.org>
> +L: bpf@vger.kernel.org
> +L: netdev@vger.kernel.org
> +S: Supported
> +F: drivers/net/meta.c
> +F: include/net/meta.h
> +
> BPF [MISC]
> L: bpf@vger.kernel.org
> S: Odd Fixes
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 44eeb5d61ba9..9959cdd50b0b 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -448,6 +448,15 @@ config NLMON
> diagnostics, etc. This is mostly intended for developers or support
> to debug netlink issues. If unsure, say N.
>
> +config META
> + bool "BPF-programmable meta device"
> + depends on BPF_SYSCALL
> + help
> + The virtual meta devices can be created in pairs and used to connect
> + two network namespaces. A BPF program can be attached to the device(s)
> + which then gets executed on transmission to implement the driver
> + internal logic.
> +
> config NET_VRF
> tristate "Virtual Routing and Forwarding (Lite)"
> depends on IP_MULTIPLE_TABLES
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index e26f98f897c5..18eabeb78ece 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -22,6 +22,7 @@ obj-$(CONFIG_MDIO) += mdio.o
> obj-$(CONFIG_NET) += loopback.o
> obj-$(CONFIG_NETDEV_LEGACY_INIT) += Space.o
> obj-$(CONFIG_NETCONSOLE) += netconsole.o
> +obj-$(CONFIG_META) += meta.o
> obj-y += phy/
> obj-y += pse-pd/
> obj-y += mdio/
> diff --git a/drivers/net/meta.c b/drivers/net/meta.c
> new file mode 100644
> index 000000000000..e464f547b0a6
> --- /dev/null
> +++ b/drivers/net/meta.c
> @@ -0,0 +1,734 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2023 Isovalent */
> +
> +#include <linux/netdevice.h>
> +#include <linux/ethtool.h>
> +#include <linux/etherdevice.h>
> +#include <linux/filter.h>
> +#include <linux/netfilter_netdev.h>
> +#include <linux/bpf_mprog.h>
> +
> +#include <net/meta.h>
> +#include <net/dst.h>
> +#include <net/tcx.h>
> +
> +#define DRV_NAME "meta"
> +#define DRV_VERSION "1.0"
> +
> +struct meta {
> + /* Needed in fast-path */
> + struct net_device __rcu *peer;
> + struct bpf_mprog_entry __rcu *active;
> + enum meta_action policy;
> + struct bpf_mprog_bundle bundle;
> + /* Needed in slow-path */
> + enum meta_mode mode;
> + bool primary;
> + u32 headroom;
> +};
> +
> +static void meta_scrub_minimum(struct sk_buff *skb)
> +{
[..]
> + skb->skb_iif = 0;
> + skb->ignore_df = 0;
> + skb->priority = 0;
> + skb_dst_drop(skb);
> + skb_ext_reset(skb);
> + nf_reset_ct(skb);
> + nf_reset_trace(skb);
> + nf_skip_egress(skb, true);
> + ipvs_reset(skb);
This looks similar to skb_scrub_packet; what's the difference?
next prev parent reply other threads:[~2023-09-26 21:26 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-26 5:59 [PATCH bpf-next 0/8] Add bpf programmable device Daniel Borkmann
2023-09-26 5:59 ` [PATCH bpf-next 1/8] meta, bpf: Add bpf programmable meta device Daniel Borkmann
2023-09-26 21:26 ` Stanislav Fomichev [this message]
2023-09-28 9:16 ` Toke Høiland-Jørgensen
2023-09-28 12:01 ` Willem de Bruijn
2023-09-28 21:14 ` Daniel Borkmann
2023-09-29 19:25 ` Alexei Starovoitov
2023-10-13 11:26 ` Florian Kauer
2023-09-26 5:59 ` [PATCH bpf-next 2/8] meta, bpf: Add bpf link support for " Daniel Borkmann
2023-09-28 0:12 ` Andrii Nakryiko
2023-09-26 5:59 ` [PATCH bpf-next 3/8] tools: Sync if_link uapi header Daniel Borkmann
2023-09-26 5:59 ` [PATCH bpf-next 4/8] libbpf: Add link-based API for meta Daniel Borkmann
2023-09-26 11:19 ` Quentin Monnet
2023-09-28 0:12 ` Andrii Nakryiko
2023-09-28 21:30 ` Daniel Borkmann
2023-09-26 5:59 ` [PATCH bpf-next 5/8] bpftool: Implement link show support " Daniel Borkmann
2023-09-26 11:19 ` Quentin Monnet
2023-09-26 5:59 ` [PATCH bpf-next 6/8] bpftool: Extend net dump with meta progs Daniel Borkmann
2023-09-26 11:19 ` Quentin Monnet
2023-09-26 5:59 ` [PATCH bpf-next 7/8] selftests/bpf: Add netlink helper library Daniel Borkmann
2023-09-26 21:35 ` Stanislav Fomichev
2023-09-26 5:59 ` [PATCH bpf-next 8/8] selftests/bpf: Add selftests for meta Daniel Borkmann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZRNMhVfuqPrK3J6O@google.com \
--to=sdf@google.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=john.fastabend@gmail.com \
--cc=martin.lau@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=razor@blackwall.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.