From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 997012AB3E for ; Wed, 25 Oct 2023 15:47:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=resnulli-us.20230601.gappssmtp.com header.i=@resnulli-us.20230601.gappssmtp.com header.b="QPhzU/5Z" Received: from mail-ej1-x634.google.com (mail-ej1-x634.google.com [IPv6:2a00:1450:4864:20::634]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 66A5B136 for ; Wed, 25 Oct 2023 08:47:09 -0700 (PDT) Received: by mail-ej1-x634.google.com with SMTP id a640c23a62f3a-99357737980so921074966b.2 for ; Wed, 25 Oct 2023 08:47:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=resnulli-us.20230601.gappssmtp.com; s=20230601; t=1698248828; x=1698853628; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=Mt7St37rIKseWM3KjSEFy+jlXydYiULEJ1tiZmCsQbs=; b=QPhzU/5ZAuXzg5ZXhIptvqBKFXPPNA7p7p5U0KDawQHP0Kfa6SYdP5M/FCvc+5duOI q+pV6Fov3J05NoalzMg7a1bsWXprbXCD5YtwZRITy5xd+zn7QEsyWALvLBDVUWedx4+M JNH5JoL2xAKSFKWC4VY9QzZNIVNwoLv1vea+AJaOfPuR14kyf5qFfJpNYno5pg4hdzuP b1FNXIzko1l3xF1yBUKVR97IVyyRd5/QzaVUHZVepzGTKu0n681L4AUfqw9uWNsFBAS8 +OoM70gyRUAjnn/f72ugFwhj1MsAFsCcJFaJGqq113BAB+lEVSzs1o24HJWvTuGnnoDC U5Dw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698248828; x=1698853628; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Mt7St37rIKseWM3KjSEFy+jlXydYiULEJ1tiZmCsQbs=; b=Bbx3tofx3sr4iiVTV2IFq/eVWNSUl0pzguPcSkPrgrxoMCloUU06pgB8+dz8zJw0as fLMWQLf4zLQZ4w1+2fBiys4TV80Y076Romfx75ObscKGF7+Gj8l94qUMWYVhJPtoztYE lG4lOFVjsCcdbRmtFnJJhEeqMpv3g6ZrS0D0CZ1/vkC/cRMaCPTqUJW7mY5Baxy6q1P6 SgoIgZjXNF6+muhqWH18NdXtYqewbtjdvZbXf7jAdX/8ccw3o82grhjXilBQuzOoHGjZ SmFyEMENzI9O5XZUBvO7Y5XrOI+wwVcmh5b0sZFYiQXKZIO0dRELBRWFEbGgnrEPz+jj vjzw== X-Gm-Message-State: AOJu0YxY1shWclbGcmc0mOxuILIS6j1XUNimaHaX6I+0FxqTahbdb74i QiCkJBFv112FVvxTdfpWK1wyOQ== X-Google-Smtp-Source: AGHT+IHAS0YwqqQ+OKqxlR5ZgvGYHtOqaDNLMTPyOA3e867qhf78U8qg48nDaB/X2usvqMGaCOB5YQ== X-Received: by 2002:a17:907:9403:b0:9bd:a5a3:3328 with SMTP id dk3-20020a170907940300b009bda5a33328mr13208921ejc.13.1698248827647; Wed, 25 Oct 2023 08:47:07 -0700 (PDT) Received: from localhost (host-213-179-129-39.customer.m-online.net. [213.179.129.39]) by smtp.gmail.com with ESMTPSA id i25-20020a170906251900b009ca522853ecsm5035529ejb.58.2023.10.25.08.47.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Oct 2023 08:47:07 -0700 (PDT) Date: Wed, 25 Oct 2023 17:47:05 +0200 From: Jiri Pirko To: Daniel Borkmann Cc: bpf@vger.kernel.org, netdev@vger.kernel.org, martin.lau@linux.dev, razor@blackwall.org, ast@kernel.org, andrii@kernel.org, john.fastabend@gmail.com, sdf@google.com, toke@kernel.org, kuba@kernel.org, andrew@lunn.ch, Toke =?iso-8859-1?Q?H=F8iland-J=F8rgensen?= Subject: Re: [PATCH bpf-next v4 1/7] netkit, bpf: Add bpf programmable net device Message-ID: References: <20231024214904.29825-1-daniel@iogearbox.net> <20231024214904.29825-2-daniel@iogearbox.net> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20231024214904.29825-2-daniel@iogearbox.net> Tue, Oct 24, 2023 at 11:48:58PM CEST, daniel@iogearbox.net wrote: >This work adds a new, minimal BPF-programmable device called "netkit" >(former PoC code-name "meta") we recently presented at LSF/MM/BPF. The >core idea is that BPF programs are executed within the drivers xmit routine >and therefore e.g. in case of containers/Pods moving BPF processing closer >to the source. > >One of the goals was that in case of Pod egress traffic, this allows to >move BPF programs from hostns tcx ingress into the device itself, providing >earlier drop or forward mechanisms, for example, if the BPF program >determines that the skb must be sent out of the node, then a redirect to >the physical device can take place directly without going through per-CPU >backlog queue. This helps to shift processing for such traffic from softirq >to process context, leading to better scheduling decisions/performance (see >measurements in the slides). > >In this initial version, the netkit device ships as a pair, but we plan to >extend this further so it can also operate in single device mode. The pair Single device mode should work how? >comes with a primary and a peer device. Only the primary device, typically >residing in hostns, can manage BPF programs for itself and its peer. The >peer device is designated for containers/Pods and cannot attach/detach >BPF programs. Upon the device creation, the user can set the default policy >to 'pass' or 'drop' for the case when no BPF program is attached. It looks to me that you only need the host (primary) netdevice to be used as a handle to attach the bpf programs. Because the bpf program can (and probably in real use case will) redirect to uplink/another pod netdevice skipping the host (primary) netdevice, correct? If so, why can't you do just single device mode from start finding a different sort of bpf attach handle? (not sure which) > >Additionally, the device can be operated in L3 (default) or L2 mode. The >management of BPF programs is done via bpf_mprog, so that multi-attach is >supported right from the beginning with similar API and dependency controls >as tcx. For details on the latter see commit 053c8e1f235d ("bpf: Add generic >attach/detach/query API for multi-progs"). tc BPF compatibility is provided, >so that existing programs can be easily migrated. > >Going forward, we plan to use netkit devices in Cilium as the main device >type for connecting Pods. They will be operated in L3 mode in order to >simplify a Pod's neighbor management and the peer will operate in default >drop mode, so that no traffic is leaving between the time when a Pod is >brought up by the CNI plugin and programs attached by the agent. >Additionally, the programs we attach via tcx on the physical devices are >using bpf_redirect_peer() for inbound traffic into netkit device, hence the >latter is also supporting the ndo_get_peer_dev callback. Similarly, we use >bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device >directly. Also, BIG TCP is supported on netkit device. For the follow-up >work in single device mode, we plan to convert Cilium's cilium_host/_net >devices into a single one. > >An extensive test suite for checking device operations and the BPF program >and link management API comes as BPF selftests in this series. > Couple of nitpicks below: [..] >+static int netkit_check_policy(int policy, struct nlattr *tb, >+ struct netlink_ext_ack *extack) >+{ >+ switch (policy) { >+ case NETKIT_PASS: >+ case NETKIT_DROP: >+ return 0; >+ default: Isn't this job for netlink policy? >+ NL_SET_ERR_MSG_ATTR(extack, tb, >+ "Provided default xmit policy not supported"); >+ return -EINVAL; >+ } >+} >+ >+static int netkit_check_mode(int mode, struct nlattr *tb, >+ struct netlink_ext_ack *extack) >+{ >+ switch (mode) { >+ case NETKIT_L2: >+ case NETKIT_L3: >+ return 0; >+ default: Isn't this job for netlink policy? >+ NL_SET_ERR_MSG_ATTR(extack, tb, >+ "Provided device mode can only be L2 or L3"); >+ return -EINVAL; >+ } >+} >+ >+static int netkit_validate(struct nlattr *tb[], struct nlattr *data[], >+ struct netlink_ext_ack *extack) >+{ >+ struct nlattr *attr = tb[IFLA_ADDRESS]; >+ >+ if (!attr) >+ return 0; >+ NL_SET_ERR_MSG_ATTR(extack, attr, >+ "Setting Ethernet address is not supported"); >+ return -EOPNOTSUPP; >+} >+ >+static struct rtnl_link_ops netkit_link_ops; >+ >+static int netkit_new_link(struct net *src_net, struct net_device *dev, >+ struct nlattr *tb[], struct nlattr *data[], >+ struct netlink_ext_ack *extack) >+{ >+ struct nlattr *peer_tb[IFLA_MAX + 1], **tbp = tb, *attr; >+ enum netkit_action default_prim = NETKIT_PASS; >+ enum netkit_action default_peer = NETKIT_PASS; >+ enum netkit_mode mode = NETKIT_L3; >+ unsigned char ifname_assign_type; >+ struct ifinfomsg *ifmp = NULL; >+ struct net_device *peer; >+ char ifname[IFNAMSIZ]; >+ struct netkit *nk; >+ struct net *net; >+ int err; >+ >+ if (data) { >+ if (data[IFLA_NETKIT_MODE]) { >+ attr = data[IFLA_NETKIT_MODE]; >+ mode = nla_get_u32(attr); >+ err = netkit_check_mode(mode, attr, extack); >+ if (err < 0) >+ return err; >+ } >+ if (data[IFLA_NETKIT_PEER_INFO]) { >+ attr = data[IFLA_NETKIT_PEER_INFO]; >+ ifmp = nla_data(attr); >+ err = rtnl_nla_parse_ifinfomsg(peer_tb, attr, extack); >+ if (err < 0) >+ return err; >+ err = netkit_validate(peer_tb, NULL, extack); >+ if (err < 0) >+ return err; >+ tbp = peer_tb; >+ } >+ if (data[IFLA_NETKIT_POLICY]) { >+ attr = data[IFLA_NETKIT_POLICY]; >+ default_prim = nla_get_u32(attr); >+ err = netkit_check_policy(default_prim, attr, extack); >+ if (err < 0) >+ return err; >+ } >+ if (data[IFLA_NETKIT_PEER_POLICY]) { >+ attr = data[IFLA_NETKIT_PEER_POLICY]; >+ default_peer = nla_get_u32(attr); >+ err = netkit_check_policy(default_peer, attr, extack); >+ if (err < 0) >+ return err; >+ } >+ } >+ >+ if (ifmp && tbp[IFLA_IFNAME]) { >+ nla_strscpy(ifname, tbp[IFLA_IFNAME], IFNAMSIZ); >+ ifname_assign_type = NET_NAME_USER; >+ } else { >+ strscpy(ifname, "nk%d", IFNAMSIZ); >+ ifname_assign_type = NET_NAME_ENUM; >+ } >+ >+ net = rtnl_link_get_net(src_net, tbp); >+ if (IS_ERR(net)) >+ return PTR_ERR(net); >+ >+ peer = rtnl_create_link(net, ifname, ifname_assign_type, >+ &netkit_link_ops, tbp, extack); >+ if (IS_ERR(peer)) { >+ put_net(net); >+ return PTR_ERR(peer); >+ } >+ >+ netif_inherit_tso_max(peer, dev); >+ >+ if (mode == NETKIT_L2) >+ eth_hw_addr_random(peer); >+ if (ifmp && dev->ifindex) >+ peer->ifindex = ifmp->ifi_index; >+ >+ nk = netkit_priv(peer); >+ nk->primary = false; >+ nk->policy = default_peer; >+ nk->mode = mode; >+ bpf_mprog_bundle_init(&nk->bundle); >+ RCU_INIT_POINTER(nk->active, NULL); >+ RCU_INIT_POINTER(nk->peer, NULL); Aren't these already 0? >+ >+ err = register_netdevice(peer); >+ put_net(net); >+ if (err < 0) >+ goto err_register_peer; >+ netif_carrier_off(peer); >+ if (mode == NETKIT_L2) >+ dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL); >+ >+ err = rtnl_configure_link(peer, NULL, 0, NULL); >+ if (err < 0) >+ goto err_configure_peer; >+ >+ if (mode == NETKIT_L2) >+ eth_hw_addr_random(dev); >+ if (tb[IFLA_IFNAME]) >+ nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ); >+ else >+ strscpy(dev->name, "nk%d", IFNAMSIZ); >+ >+ nk = netkit_priv(dev); >+ nk->primary = true; >+ nk->policy = default_prim; >+ nk->mode = mode; >+ bpf_mprog_bundle_init(&nk->bundle); >+ RCU_INIT_POINTER(nk->active, NULL); >+ RCU_INIT_POINTER(nk->peer, NULL); >+ >+ err = register_netdevice(dev); >+ if (err < 0) >+ goto err_configure_peer; >+ netif_carrier_off(dev); >+ if (mode == NETKIT_L2) >+ dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL); >+ >+ rcu_assign_pointer(netkit_priv(dev)->peer, peer); >+ rcu_assign_pointer(netkit_priv(peer)->peer, dev); >+ return 0; >+err_configure_peer: >+ unregister_netdevice(peer); >+ return err; >+err_register_peer: >+ free_netdev(peer); >+ return err; >+} >+ [..]