From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jakub Kicinski Subject: Re: [PATCH 0/6] offload Linux LAG devices to the TC datapath Date: Wed, 27 Jun 2018 21:02:43 -0700 Message-ID: <20180627210243.154f05e0@cakuba.netronome.com> References: <1529588161-15934-1-git-send-email-john.hurley@netronome.com> <8f406548-8f90-b658-fcd1-342d702b3445@mellanox.com> <20180626153140.6a06eb97@cakuba.netronome.com> <20180627160811.57250c26@cakuba.netronome.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: Or Gerlitz , John Hurley , Jiri Pirko , Linux Netdev List , ASAP_Direct_Dev , Simon Horman , Andy Gospodarek To: Or Gerlitz Return-path: Received: from mail-qt0-f196.google.com ([209.85.216.196]:35418 "EHLO mail-qt0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750802AbeF1ECs (ORCPT ); Thu, 28 Jun 2018 00:02:48 -0400 Received: by mail-qt0-f196.google.com with SMTP id z6-v6so3616273qti.2 for ; Wed, 27 Jun 2018 21:02:48 -0700 (PDT) In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Thu, 28 Jun 2018 06:50:32 +0300, Or Gerlitz wrote: > On Thu, Jun 28, 2018 at 2:08 AM, Jakub Kicinski > wrote: > > On Wed, 27 Jun 2018 23:07:29 +0300, Or Gerlitz wrote: > >> On Wed, Jun 27, 2018 at 1:31 AM, Jakub Kicinski > >> wrote: > >> > On Tue, 26 Jun 2018 17:57:08 +0300, Or Gerlitz wrote: > >> > >> >> 2. re the egress side of things. Some NIC HWs can't just use LAG > >> >> as the egress port destination of an ACL (tc rule) and the HW rule > >> >> needs to be duplicated to both HW ports. So... in that case, you > >> >> see the HW driver doing the duplication (:() or we can somehow > >> >> make it happen from user-space? > >> > >> > It's the TC core that does the duplication. Drivers which don't need > >> > the duplication (e.g. mlxsw) will not register a new callback for each > >> > port on which shared block is bound. They will keep one list of rules, > >> > and a list of ports that those rules apply to. > >> > >> [snip] > >> > >> > Drivers which need duplication (multiplication) (all NICs?) have to > >> > register a new callback for each port bound to a shared block. And TC > >> > will call those drivers as many times as they have callbacks registered > >> > == as many times as they have ports bound to the block. Each time > >> > callback is invoked the driver will figure out the ingress port based > >> > on the cb_priv and use as the key in its rule table > >> > (or have a separate rule table per ingress port). > >> > >> [snip snip] > >> > >> > I may be wrong, but I think you split the rules tables per port for mlx5 > >> > >> correct, currently I have a rule table per physical port. > >> > >> > So again you just register a callback every time shared block is bound, > >> > and then TC core will send add/remove rule commands down to the driver, > >> > relaying existing rules as well if needed. > >> > >> Let's see, the NIC uplink rep port devices were bounded (say) by ovs to > >> a shared-block because they are the lower devices (hate the slavish jargon) > >> of a bond device. > >> > >> Next, the TC stack will invoke the callback over these ports, when ingress > >> rule is added on the bond. > >> > >> But we are talking on ingress rule set on a non-uplink rep (VF rep) port, > >> where bonding is the egress of the rule. I guess the callback which you probably > >> refer to (you hinted there below) is the egdev one, correct? you are suggesting > >> that bonding will do egdev registration... I am a bit confused. > > > > Ah, you really meant egress. We don't have this problem, but yes, I > > so how does it works for you -- the rule is: > > > > so from here, your driver logic does what inorder > to allow offloading into the lagged uplinks? can you > point the code please.. static int nfp_fl_output(struct nfp_app *app, struct nfp_fl_output *output, ... if (tun_type) { /* Verify the egress netdev matches the tunnel type. */ if (!nfp_fl_netdev_is_tunnel_type(out_dev, tun_type)) return -EOPNOTSUPP; if (*tun_out_cnt) return -EOPNOTSUPP; (*tun_out_cnt)++; output->flags = cpu_to_be16(tmp_flags | NFP_FL_OUT_FLAGS_USE_TUN); output->port = cpu_to_be32(NFP_FL_PORT_TYPE_TUN | tun_type); } else if (netif_is_lag_master(out_dev) && priv->flower_ext_feats & NFP_FL_FEATS_LAG) { int gid; output->flags = cpu_to_be16(tmp_flags); gid = nfp_flower_lag_get_output_id(app, out_dev); if (gid < 0) return gid; output->port = cpu_to_be32(NFP_FL_LAG_OUT | gid); } else { /* Set action output parameters. */ output->flags = cpu_to_be16(tmp_flags); /* Only offload if egress ports are on the same device as the * ingress port. */ if (!switchdev_port_same_parent_id(in_dev, out_dev)) return -EOPNOTSUPP; if (!nfp_netdev_is_nfp_repr(out_dev)) return -EOPNOTSUPP; output->port = cpu_to_be32(nfp_repr_get_port_id(out_dev)); if (!output->port) return -EOPNOTSUPP; } > the bond BTW doesn't have the same switchdev id as > the vfrep in case you keep different switchdev id's > for the uplink reps under bonding -- do you unite them?