* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond @ 2018-03-13 15:51 Or Gerlitz 2018-03-13 15:53 ` Or Gerlitz 2018-03-14 9:50 ` Jiri Pirko 0 siblings, 2 replies; 11+ messages in thread From: Or Gerlitz @ 2018-03-13 15:51 UTC (permalink / raw) To: Jiri Pirko, Rabie Loulou, John Hurley Cc: Jakub Kicinski, Simon Horman, Linux Netdev List, ASAP_Direct_Dev, mlxsw On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko <jiri@resnulli.us> wrote: > Mon, Mar 05, 2018 at 02:28:30PM CET, john.hurley@netronome.com wrote: >>Allow drivers to register netdev callbacks for tc offload in linux bonds. >>If a netdev has registered and is a slave of a given bond, then any tc >>rules offloaded to the bond will be relayed to it if both the bond and the >>slave permit hw offload. >>Because the bond itself is not offloaded, just the rules, we don't care >>about whether the bond ports are on the same device or whether some of >>slaves are representor ports and some are not. John, I think we must design here for the case where the bond IS offloaded. E.g some sort of HW LAG. For example, the mlxsw driver supports LAG offload and support tcflower offload, we need to see how these two live together, mlx5 supports tcflower offload and we are working on bond offload, etc. >>+EXPORT_SYMBOL_GPL(tc_setup_cb_bond_register); > > Please, no "bond" specific calls from drivers. That would be wrong. > The idea behing block callbacks was that anyone who is interested could > register to receive those. In this case, slave device is interested. > So it should register to receive block callbacks in the same way as if > the block was directly on top of the slave device. The only thing you > need to handle is to propagate block bind/unbind from master down to the > slaves. Jiri, This sounds nice for the case where one install ingress tc rules on the bond (lets call them type 1, see next) One obstacle pointed by my colleague, Rabie, is that when the upper layer issues stat call on the filter, they will get two replies, this can confuse them and lead to wrong decisions (aging). I wonder if/how we can set a knob somewhere that unifies the stats (add packet/bytes, use the latest lastuse). Also, lets see what other rules have to be offloaded in that scheme (call them type 2/3/4) where one bonded two HW ports 2. bond being egress port of a rule TC rules for overlay networks scheme, e.g in NIC SRIOV scheme where one bonds the two uplink representors Starting with type 2, in our current NIC HW APIs we have to duplicate these rules into two rules set to HW: 2.1 VF rep --> uplink 0 2.2 VF rep --> uplink 1 and we do that in the driver (add/del two HW rules, combine the stat results, etc) 3. ingress rule on VF rep port with shared tunnel device being the egress (encap) and where the routing of the underlay (tunnel) goes through LAG. in our case, this is like 2.1/2.2 above, offload two rules, combine stats 4. ingress rule shared tunnel device being the ingress and VF rep port being the egress (decap) this uses the egdev facility to be offloaded into the our driver, and then in the driver we will treat it like type 1, two rules need to be installed into HW, but now, we can't delegate them from the vxlan device b/c it has no direct connection with the bond. All to all, for the mlx5 use case, seems we have elegant solution only for type 1. I think we should do the elegant solution for the case where it applicable. In parallel if/when newer HW APIs are there such that type 2 and 3 can be set using one HW rule whose dest is the bond, we are good. As for type 4, need to see if/how it can be nicer. Or. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond 2018-03-13 15:51 [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond Or Gerlitz @ 2018-03-13 15:53 ` Or Gerlitz 2018-03-14 1:50 ` Jakub Kicinski 2018-03-14 9:50 ` Jiri Pirko 1 sibling, 1 reply; 11+ messages in thread From: Or Gerlitz @ 2018-03-13 15:53 UTC (permalink / raw) To: Jiri Pirko, Rabie Loulou, John Hurley Cc: Jakub Kicinski, Simon Horman, Linux Netdev List, mlxsw, Yevgeny Kliteynik, Paul Blakey On Tue, Mar 13, 2018 at 5:51 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote: Sorry ppl, I added MLNX alias (ASAP_Direct_Dev@mellanox.com) which is not open to outer posts, please remove it from your replies, otherwise it will bump you back.. Or. > On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko <jiri@resnulli.us> wrote: >> Mon, Mar 05, 2018 at 02:28:30PM CET, john.hurley@netronome.com wrote: >>>Allow drivers to register netdev callbacks for tc offload in linux bonds. >>>If a netdev has registered and is a slave of a given bond, then any tc >>>rules offloaded to the bond will be relayed to it if both the bond and the >>>slave permit hw offload. > >>>Because the bond itself is not offloaded, just the rules, we don't care >>>about whether the bond ports are on the same device or whether some of >>>slaves are representor ports and some are not. > > John, I think we must design here for the case where the bond IS offloaded. > E.g some sort of HW LAG. For example, the mlxsw driver supports > LAG offload and support tcflower offload, we need to see how these > two live together, mlx5 supports tcflower offload and we are working on > bond offload, etc. > >>>+EXPORT_SYMBOL_GPL(tc_setup_cb_bond_register); >> >> Please, no "bond" specific calls from drivers. That would be wrong. >> The idea behing block callbacks was that anyone who is interested could >> register to receive those. In this case, slave device is interested. >> So it should register to receive block callbacks in the same way as if >> the block was directly on top of the slave device. The only thing you >> need to handle is to propagate block bind/unbind from master down to the >> slaves. > > Jiri, > > This sounds nice for the case where one install ingress tc rules on > the bond (lets > call them type 1, see next) > > One obstacle pointed by my colleague, Rabie, is that when the upper layer > issues stat call on the filter, they will get two replies, this can confuse them > and lead to wrong decisions (aging). I wonder if/how we can set a knob > somewhere that unifies the stats (add packet/bytes, use the latest lastuse). > > Also, lets see what other rules have to be offloaded in that scheme > (call them type 2/3/4) > where one bonded two HW ports > > 2. bond being egress port of a rule > > TC rules for overlay networks scheme, e.g in NIC SRIOV > scheme where one bonds the two uplink representors > > Starting with type 2, in our current NIC HW APIs we have to duplicate > these rules > into two rules set to HW: > > 2.1 VF rep --> uplink 0 > 2.2 VF rep --> uplink 1 > > and we do that in the driver (add/del two HW rules, combine the stat > results, etc) > > 3. ingress rule on VF rep port with shared tunnel device being the > egress (encap) > and where the routing of the underlay (tunnel) goes through LAG. > > in our case, this is like 2.1/2.2 above, offload two rules, combine stats > > 4. ingress rule shared tunnel device being the ingress and VF rep port > being the egress (decap) > > this uses the egdev facility to be offloaded into the our driver, and > then in the driver > we will treat it like type 1, two rules need to be installed into HW, > but now, we can't delegate them > from the vxlan device b/c it has no direct connection with the bond. > > All to all, for the mlx5 use case, seems we have elegant solution only > for type 1. > > I think we should do the elegant solution for the case where it applicable. > > In parallel if/when newer HW APIs are there such that type 2 and 3 can be set > using one HW rule whose dest is the bond, we are good. As for type 4, > need to see > if/how it can be nicer. > > Or. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond 2018-03-13 15:53 ` Or Gerlitz @ 2018-03-14 1:50 ` Jakub Kicinski 2018-03-14 6:54 ` Or Gerlitz 2018-03-14 15:51 ` Jiri Pirko 0 siblings, 2 replies; 11+ messages in thread From: Jakub Kicinski @ 2018-03-14 1:50 UTC (permalink / raw) To: Or Gerlitz Cc: Jiri Pirko, Rabie Loulou, John Hurley, Simon Horman, Linux Netdev List, mlxsw, Yevgeny Kliteynik, Paul Blakey On Tue, 13 Mar 2018 17:53:39 +0200, Or Gerlitz wrote: > > Starting with type 2, in our current NIC HW APIs we have to duplicate > > these rules > > into two rules set to HW: > > > > 2.1 VF rep --> uplink 0 > > 2.2 VF rep --> uplink 1 > > > > and we do that in the driver (add/del two HW rules, combine the stat > > results, etc) Ack, I think our HW API also will require us to duplicate the rules today, but IMHO we should implement some common helper module in the core that would work for any block sharing rather than bond specific solution. > > 3. ingress rule on VF rep port with shared tunnel device being the > > egress (encap) > > and where the routing of the underlay (tunnel) goes through LAG. > > > > in our case, this is like 2.1/2.2 above, offload two rules, combine stats > > > > 4. ingress rule shared tunnel device being the ingress and VF rep port > > being the egress (decap) > > > > this uses the egdev facility to be offloaded into the our driver, and > > then in the driver > > we will treat it like type 1, two rules need to be installed into HW, > > but now, we can't delegate them > > from the vxlan device b/c it has no direct connection with the bond. Let's get rid of the egdev crutch first then :] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond 2018-03-14 1:50 ` Jakub Kicinski @ 2018-03-14 6:54 ` Or Gerlitz 2018-03-14 15:51 ` Jiri Pirko 1 sibling, 0 replies; 11+ messages in thread From: Or Gerlitz @ 2018-03-14 6:54 UTC (permalink / raw) To: Jakub Kicinski Cc: Jiri Pirko, Rabie Loulou, John Hurley, Simon Horman, Linux Netdev List, mlxsw, Yevgeny Kliteynik, Paul Blakey On Wed, Mar 14, 2018 at 3:50 AM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote: > On Tue, 13 Mar 2018 17:53:39 +0200, Or Gerlitz wrote: >> > Starting with type 2, in our current NIC HW APIs we have to duplicate >> > these rules >> > into two rules set to HW: >> > >> > 2.1 VF rep --> uplink 0 >> > 2.2 VF rep --> uplink 1 >> > >> > and we do that in the driver (add/del two HW rules, combine the stat >> > results, etc) > > Ack, I think our HW API also will require us to duplicate the rules > today, but IMHO we should implement some common helper module in the > core that would work for any block sharing rather than bond specific > solution. To be clear, you refer to the case where the bond is the egress device of the rule? For the case the bond is the ingress device, RU OK with the approach Jiri suggested to propagate the tc setup ndo call into the lower devices? so they are bind/unbinding for any block the upper is. This approach is applicable for bond/team/vlan devices for both NIC and Switch ASIC (or NPU...) drivers. You want to make a helper out of this? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond 2018-03-14 1:50 ` Jakub Kicinski 2018-03-14 6:54 ` Or Gerlitz @ 2018-03-14 15:51 ` Jiri Pirko 1 sibling, 0 replies; 11+ messages in thread From: Jiri Pirko @ 2018-03-14 15:51 UTC (permalink / raw) To: Jakub Kicinski Cc: Or Gerlitz, Jiri Pirko, Rabie Loulou, John Hurley, Simon Horman, Linux Netdev List, mlxsw, Yevgeny Kliteynik, Paul Blakey Wed, Mar 14, 2018 at 02:50:02AM CET, jakub.kicinski@netronome.com wrote: >On Tue, 13 Mar 2018 17:53:39 +0200, Or Gerlitz wrote: >> > Starting with type 2, in our current NIC HW APIs we have to duplicate >> > these rules >> > into two rules set to HW: >> > >> > 2.1 VF rep --> uplink 0 >> > 2.2 VF rep --> uplink 1 >> > >> > and we do that in the driver (add/del two HW rules, combine the stat >> > results, etc) > >Ack, I think our HW API also will require us to duplicate the rules >today, but IMHO we should implement some common helper module in the >core that would work for any block sharing rather than bond specific >solution. But how? Only the driver knows if in case it has 2 netdevices if the HW is capable of share or not. And accordingly, it registers 1cb instance or 2cb instances (1 for each netdev). I don't see how you can move it in core... > >> > 3. ingress rule on VF rep port with shared tunnel device being the >> > egress (encap) >> > and where the routing of the underlay (tunnel) goes through LAG. >> > >> > in our case, this is like 2.1/2.2 above, offload two rules, combine stats >> > >> > 4. ingress rule shared tunnel device being the ingress and VF rep port >> > being the egress (decap) >> > >> > this uses the egdev facility to be offloaded into the our driver, and >> > then in the driver >> > we will treat it like type 1, two rules need to be installed into HW, >> > but now, we can't delegate them >> > from the vxlan device b/c it has no direct connection with the bond. > >Let's get rid of the egdev crutch first then :] I don't see how you can do it. Note that this exists to catch insertions of rules that have "mirred redirect" to the dev which is interested in the rules. Originally it was done in a very ugly way (please see git history), and I converted it to egdev - I was not able to find any nicer solution :/ Any ideas for improvement? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond 2018-03-13 15:51 [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond Or Gerlitz 2018-03-13 15:53 ` Or Gerlitz @ 2018-03-14 9:50 ` Jiri Pirko 2018-03-14 11:23 ` Or Gerlitz 1 sibling, 1 reply; 11+ messages in thread From: Jiri Pirko @ 2018-03-14 9:50 UTC (permalink / raw) To: Or Gerlitz Cc: Jiri Pirko, Rabie Loulou, John Hurley, Jakub Kicinski, Simon Horman, Linux Netdev List, ASAP_Direct_Dev, mlxsw Tue, Mar 13, 2018 at 04:51:02PM CET, gerlitz.or@gmail.com wrote: >On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko <jiri@resnulli.us> wrote: >> Mon, Mar 05, 2018 at 02:28:30PM CET, john.hurley@netronome.com wrote: >>>Allow drivers to register netdev callbacks for tc offload in linux bonds. >>>If a netdev has registered and is a slave of a given bond, then any tc >>>rules offloaded to the bond will be relayed to it if both the bond and the >>>slave permit hw offload. > >>>Because the bond itself is not offloaded, just the rules, we don't care >>>about whether the bond ports are on the same device or whether some of >>>slaves are representor ports and some are not. > >John, I think we must design here for the case where the bond IS offloaded. >E.g some sort of HW LAG. For example, the mlxsw driver supports >LAG offload and support tcflower offload, we need to see how these >two live together, mlx5 supports tcflower offload and we are working on >bond offload, etc. > >>>+EXPORT_SYMBOL_GPL(tc_setup_cb_bond_register); >> >> Please, no "bond" specific calls from drivers. That would be wrong. >> The idea behing block callbacks was that anyone who is interested could >> register to receive those. In this case, slave device is interested. >> So it should register to receive block callbacks in the same way as if >> the block was directly on top of the slave device. The only thing you >> need to handle is to propagate block bind/unbind from master down to the >> slaves. > >Jiri, > >This sounds nice for the case where one install ingress tc rules on >the bond (lets >call them type 1, see next) > >One obstacle pointed by my colleague, Rabie, is that when the upper layer >issues stat call on the filter, they will get two replies, this can confuse them >and lead to wrong decisions (aging). I wonder if/how we can set a knob The bonding itself would not do anything on stats update command (TC_CLSFLOWER_STATS for example). Only the slaves would do update. So there will be only reply from slaves. Bond/team is just going to probagare block bind/unbind down. Nothing else. >somewhere that unifies the stats (add packet/bytes, use the latest lastuse). > >Also, lets see what other rules have to be offloaded in that scheme >(call them type 2/3/4) >where one bonded two HW ports > >2. bond being egress port of a rule > >TC rules for overlay networks scheme, e.g in NIC SRIOV >scheme where one bonds the two uplink representors > >Starting with type 2, in our current NIC HW APIs we have to duplicate >these rules >into two rules set to HW: > >2.1 VF rep --> uplink 0 >2.2 VF rep --> uplink 1 > >and we do that in the driver (add/del two HW rules, combine the stat >results, etc) That is up to the driver. If the driver can share block between 2 devices, he can do that. If he cannot share, it will just report stats for every device separatelly (2 block cbs registered) and tc will see them both together. No need to do anything in driver. > >3. ingress rule on VF rep port with shared tunnel device being the >egress (encap) >and where the routing of the underlay (tunnel) goes through LAG. > >in our case, this is like 2.1/2.2 above, offload two rules, combine stats > Same as "2." >4. ingress rule shared tunnel device being the ingress and VF rep port >being the egress (decap) I don't follow :( > >this uses the egdev facility to be offloaded into the our driver, and >then in the driver >we will treat it like type 1, two rules need to be installed into HW, >but now, we can't delegate them >from the vxlan device b/c it has no direct connection with the bond. I see another thing we need to sanitize: vxlan rule ingress match action mirred redirect to lag > >All to all, for the mlx5 use case, seems we have elegant solution only >for type 1. > >I think we should do the elegant solution for the case where it applicable. > >In parallel if/when newer HW APIs are there such that type 2 and 3 can be set >using one HW rule whose dest is the bond, we are good. As for type 4, >need to see >if/how it can be nicer. > >Or. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond 2018-03-14 9:50 ` Jiri Pirko @ 2018-03-14 11:23 ` Or Gerlitz 2018-03-14 15:56 ` Jiri Pirko 0 siblings, 1 reply; 11+ messages in thread From: Or Gerlitz @ 2018-03-14 11:23 UTC (permalink / raw) To: Jiri Pirko Cc: Jiri Pirko, Rabie Loulou, John Hurley, Jakub Kicinski, Simon Horman, Linux Netdev List, ASAP_Direct_Dev, mlxsw On Wed, Mar 14, 2018 at 11:50 AM, Jiri Pirko <jiri@resnulli.us> wrote: > Tue, Mar 13, 2018 at 04:51:02PM CET, gerlitz.or@gmail.com wrote: >>On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko <jiri@resnulli.us> wrote: >>This sounds nice for the case where one install ingress tc rules on >>the bond (lets >>call them type 1, see next) >> >>One obstacle pointed by my colleague, Rabie, is that when the upper layer >>issues stat call on the filter, they will get two replies, this can confuse them >>and lead to wrong decisions (aging). I wonder if/how we can set a knob > > The bonding itself would not do anything on stats update > command (TC_CLSFLOWER_STATS for example). Only the slaves would do > update. So there will be only reply from slaves. > > Bond/team is just going to probagare block bind/unbind down. Nothing else. Do we agree that user space will get the replies of all lower (slave) devices, or I am missing something here? >>2. bond being egress port of a rule >>2.1 VF rep --> uplink 0 >>2.2 VF rep --> uplink 1 >> >>and we do that in the driver (add/del two HW rules, combine the stat >>results, etc) > > That is up to the driver. If the driver can share block between 2 > devices, he can do that. If he cannot share, it will just report stats > for every device separatelly (2 block cbs registered) and tc will see > them both together. No need to do anything in driver. right >>3. ingress rule on VF rep port with shared tunnel device being the >>egress (encap) >>and where the routing of the underlay (tunnel) goes through LAG. > Same as "2." ok >>4. ingress rule shared tunnel device being the ingress and VF rep port >>being the egress (decap) > I don't follow :( the way tunneling is handled in tc classifier/action is encap: ingress: net port, action1: tunnel key set action2: mirred to shared-tunnel device decap: ingress: shared tunnel device, action1: tunnel key unset action2: mirred to net port type 4 are the decap rules, when we offload it to as HW ACL we stretch the line and the ingress in a HW port too (e.g uplink port in NICs) >>this uses the egdev facility to be offloaded into the our driver, and >>then in the driver >>we will treat it like type 1, two rules need to be installed into HW, >>but now, we can't delegate them >>from the vxlan device b/c it has no direct connection with the bond. > I see another thing we need to sanitize: vxlan rule ingress match action > mirred redirect to lag right, we don't have for NIC but for switch ASIC, I guess it is applicable ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond 2018-03-14 11:23 ` Or Gerlitz @ 2018-03-14 15:56 ` Jiri Pirko 2018-03-15 21:38 ` Or Gerlitz 0 siblings, 1 reply; 11+ messages in thread From: Jiri Pirko @ 2018-03-14 15:56 UTC (permalink / raw) To: Or Gerlitz Cc: Jiri Pirko, Rabie Loulou, John Hurley, Jakub Kicinski, Simon Horman, Linux Netdev List, ASAP_Direct_Dev, mlxsw Wed, Mar 14, 2018 at 12:23:59PM CET, gerlitz.or@gmail.com wrote: >On Wed, Mar 14, 2018 at 11:50 AM, Jiri Pirko <jiri@resnulli.us> wrote: >> Tue, Mar 13, 2018 at 04:51:02PM CET, gerlitz.or@gmail.com wrote: >>>On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko <jiri@resnulli.us> wrote: > >>>This sounds nice for the case where one install ingress tc rules on >>>the bond (lets >>>call them type 1, see next) >>> >>>One obstacle pointed by my colleague, Rabie, is that when the upper layer >>>issues stat call on the filter, they will get two replies, this can confuse them >>>and lead to wrong decisions (aging). I wonder if/how we can set a knob >> >> The bonding itself would not do anything on stats update >> command (TC_CLSFLOWER_STATS for example). Only the slaves would do >> update. So there will be only reply from slaves. >> >> Bond/team is just going to probagare block bind/unbind down. Nothing else. > >Do we agree that user space will get the replies of all lower (slave) devices, >or I am missing something here? "user space will get the replies" - not sure what exactly do you mean by this. The stats would be accumulated over all devices/drivers who registered block callback. > >>>2. bond being egress port of a rule >>>2.1 VF rep --> uplink 0 >>>2.2 VF rep --> uplink 1 >>> >>>and we do that in the driver (add/del two HW rules, combine the stat >>>results, etc) >> >> That is up to the driver. If the driver can share block between 2 >> devices, he can do that. If he cannot share, it will just report stats >> for every device separatelly (2 block cbs registered) and tc will see >> them both together. No need to do anything in driver. > >right > >>>3. ingress rule on VF rep port with shared tunnel device being the >>>egress (encap) >>>and where the routing of the underlay (tunnel) goes through LAG. > >> Same as "2." > >ok > >>>4. ingress rule shared tunnel device being the ingress and VF rep port >>>being the egress (decap) > >> I don't follow :( > >the way tunneling is handled in tc classifier/action is > >encap: ingress: net port, action1: tunnel key set action2: mirred to >shared-tunnel device > >decap: ingress: shared tunnel device, action1: tunnel key unset >action2: mirred to net port > >type 4 are the decap rules, when we offload it to as HW ACL we stretch >the line and the ingress >in a HW port too (e.g uplink port in NICs) Okay, I see. But where's the bond here? Is it the one I mentioned as "mirred redirect to lag"? > > >>>this uses the egdev facility to be offloaded into the our driver, and >>>then in the driver >>>we will treat it like type 1, two rules need to be installed into HW, >>>but now, we can't delegate them >>>from the vxlan device b/c it has no direct connection with the bond. > >> I see another thing we need to sanitize: vxlan rule ingress match action >> mirred redirect to lag > >right, we don't have for NIC but for switch ASIC, I guess it is applicable Yes, it is. For future NICs I guess it is going to be as well. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond 2018-03-14 15:56 ` Jiri Pirko @ 2018-03-15 21:38 ` Or Gerlitz 0 siblings, 0 replies; 11+ messages in thread From: Or Gerlitz @ 2018-03-15 21:38 UTC (permalink / raw) To: Jiri Pirko Cc: Jiri Pirko, Rabie Loulou, John Hurley, Jakub Kicinski, Simon Horman, Linux Netdev List, ASAP_Direct_Dev, mlxsw On Wed, Mar 14, 2018 at 5:56 PM, Jiri Pirko <jiri@resnulli.us> wrote: > Wed, Mar 14, 2018 at 12:23:59PM CET, gerlitz.or@gmail.com wrote: >>On Wed, Mar 14, 2018 at 11:50 AM, Jiri Pirko <jiri@resnulli.us> wrote: >>> Tue, Mar 13, 2018 at 04:51:02PM CET, gerlitz.or@gmail.com wrote: >>>>On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko <jiri@resnulli.us> wrote: >> >>>>This sounds nice for the case where one install ingress tc rules on >>>>the bond (lets >>>>call them type 1, see next) >>>> >>>>One obstacle pointed by my colleague, Rabie, is that when the upper layer >>>>issues stat call on the filter, they will get two replies, this can confuse them >>>>and lead to wrong decisions (aging). I wonder if/how we can set a knob >>> >>> The bonding itself would not do anything on stats update >>> command (TC_CLSFLOWER_STATS for example). Only the slaves would do >>> update. So there will be only reply from slaves. >>> >>> Bond/team is just going to probagare block bind/unbind down. Nothing else. >> >>Do we agree that user space will get the replies of all lower (slave) devices, >>or I am missing something here? > > "user space will get the replies" - not sure what exactly do you mean by > this. The stats would be accumulated over all devices/drivers who > registered block callback. OK, this is probably something I have to check, thanks >>>>2. bond being egress port of a rule >>>>2.1 VF rep --> uplink 0 >>>>2.2 VF rep --> uplink 1 >>>> >>>>and we do that in the driver (add/del two HW rules, combine the stat >>>>results, etc) >>> >>> That is up to the driver. If the driver can share block between 2 >>> devices, he can do that. If he cannot share, it will just report stats >>> for every device separatelly (2 block cbs registered) and tc will see >>> them both together. No need to do anything in driver. >> >>right >> >>>>3. ingress rule on VF rep port with shared tunnel device being the >>>>egress (encap) >>>>and where the routing of the underlay (tunnel) goes through LAG. >> >>> Same as "2." >> >>ok >> >>>>4. ingress rule shared tunnel device being the ingress and VF rep port being the egress (decap) >>> I don't follow :( >> the way tunneling is handled in tc classifier/action is >> encap: ingress: net port, action1: tunnel key set action2: mirred to >> shared-tunnel device >> decap: ingress: shared tunnel device, action1: tunnel key unset >> action2: mirred to net port >> type 4 are the decap rules, when we offload it to as HW ACL we stretch >> the line and the ingress in a HW port too (e.g uplink port in NICs) > Okay, I see. But where's the bond here? Is it the one I mentioned as > "mirred redirect to lag"? since the ingress port is not HW port, we will use the egdev approach and offload the rule as the uplink of this VF rep port being the ingress. Since we will see that this uplink is into LAG, we will offload another rule which the 2nd uplink being the ingress >>> I see another thing we need to sanitize: vxlan rule ingress match action >>> mirred redirect to lag >>right, we don't have for NIC but for switch ASIC, I guess it is applicable > Yes, it is. For future NICs I guess it is going to be as well. might ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC net-next 0/6] offload linux bonding tc ingress rules
@ 2018-03-05 13:28 John Hurley
2018-03-05 13:28 ` [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond John Hurley
0 siblings, 1 reply; 11+ messages in thread
From: John Hurley @ 2018-03-05 13:28 UTC (permalink / raw)
To: netdev; +Cc: jiri, ogerlitz, jakub.kicinski, simon.horman, John Hurley
Hi,
This RFC patchset adds support for offloading tc ingress rules applied to
linux bonds. The premise of these patches is that if a rule is applied to
a bond port then the rule should be applied to each slave of the bond.
The linux bond itself registers a cb for offloading tc rules. Potential
slave netdevs on offload devices can then register with the bond for a
further callback - this code is basically the same as registering for an
egress dev offload in TC. Then when a rule is offloaded to the bond, it
can be relayed to each netdev that has registered with the bond code and
which is a slave of the given bond.
To prevent sync issues between the kernel and offload device, the linux
bond driver is affectively locked when it has offloaded rules. i.e no new
ports can be enslaved and no slaves can be released until the offload
rules are removed. Similarly, if a port on a bond is deleted, the bond is
destroyed, forcing a flush of all offloaded rules.
Also included in the RFC are changes to the NFP driver to utilise the new
code by registering NFP port representors for bond offload rules and
modifying cookie handling to allow the relaying of a rule to multiple
ports.
Thanks,
John
John Hurley (6):
drivers: net: bonding: add tc offload infastructure to bond
driver: net: bonding: allow registration of tc offload callbacks in
bond
drivers: net: bonding: restrict bond mods when rules are offloaded
nfp: add ndo_set_mac_address for representors
nfp: register repr ports for bond offloads
nfp: support offloading multiple rules with same cookie
drivers/net/bonding/bond_main.c | 277 ++++++++++++++++++++-
drivers/net/ethernet/netronome/nfp/flower/main.c | 24 +-
drivers/net/ethernet/netronome/nfp/flower/main.h | 10 +-
.../net/ethernet/netronome/nfp/flower/metadata.c | 20 +-
.../net/ethernet/netronome/nfp/flower/offload.c | 33 ++-
drivers/net/ethernet/netronome/nfp/nfp_net_repr.c | 1 +
include/net/bonding.h | 9 +
7 files changed, 351 insertions(+), 23 deletions(-)
--
2.7.4
^ permalink raw reply [flat|nested] 11+ messages in thread* [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond 2018-03-05 13:28 [RFC net-next 0/6] offload linux bonding tc ingress rules John Hurley @ 2018-03-05 13:28 ` John Hurley 2018-03-07 10:57 ` Jiri Pirko 0 siblings, 1 reply; 11+ messages in thread From: John Hurley @ 2018-03-05 13:28 UTC (permalink / raw) To: netdev; +Cc: jiri, ogerlitz, jakub.kicinski, simon.horman, John Hurley Allow drivers to register netdev callbacks for tc offload in linux bonds. If a netdev has registered and is a slave of a given bond, then any tc rules offloaded to the bond will be relayed to it if both the bond and the slave permit hw offload. Because the bond itself is not offloaded, just the rules, we don't care about whether the bond ports are on the same device or whether some of slaves are representor ports and some are not. Signed-off-by: John Hurley <john.hurley@netronome.com> --- drivers/net/bonding/bond_main.c | 195 +++++++++++++++++++++++++++++++++++++++- include/net/bonding.h | 7 ++ 2 files changed, 201 insertions(+), 1 deletion(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index e6415f6..d9e41cf 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -335,9 +335,201 @@ static inline unsigned int bond_get_offload_cnt(struct bonding *bond) return bond->tc_block->offloadcnt; } +struct tcf_bond_cb { + struct list_head list; + tc_setup_cb_t *cb; + void *cb_priv; +}; + +struct tcf_bond_off { + struct rhash_head ht_node; + const struct net_device *netdev; + unsigned int refcnt; + struct list_head cb_list; +}; + +static const struct rhashtable_params tcf_bond_ht_params = { + .key_offset = offsetof(struct tcf_bond_off, netdev), + .head_offset = offsetof(struct tcf_bond_off, ht_node), + .key_len = sizeof(const struct net_device *), +}; + +static struct tcf_bond_off *tcf_bond_off_lookup(const struct net_device *dev) +{ + struct bond_net *bn = net_generic(dev_net(dev), bond_net_id); + + return rhashtable_lookup_fast(&bn->bond_offload_ht, &dev, + tcf_bond_ht_params); +} + +static struct tcf_bond_cb *tcf_bond_off_cb_lookup(struct tcf_bond_off *off, + tc_setup_cb_t *cb, + void *cb_priv) +{ + struct tcf_bond_cb *bond_cb; + + list_for_each_entry(bond_cb, &off->cb_list, list) + if (bond_cb->cb == cb && bond_cb->cb_priv == cb_priv) + return bond_cb; + return NULL; +} + +static struct tcf_bond_off *tcf_bond_off_get(const struct net_device *dev, + tc_setup_cb_t *cb, + void *cb_priv) +{ + struct tcf_bond_off *bond_off; + struct bond_net *bn; + + bond_off = tcf_bond_off_lookup(dev); + if (bond_off) + goto inc_ref; + + bond_off = kzalloc(sizeof(*bond_off), GFP_KERNEL); + if (!bond_off) + return NULL; + INIT_LIST_HEAD(&bond_off->cb_list); + bond_off->netdev = dev; + bn = net_generic(dev_net(dev), bond_net_id); + rhashtable_insert_fast(&bn->bond_offload_ht, &bond_off->ht_node, + tcf_bond_ht_params); + +inc_ref: + bond_off->refcnt++; + return bond_off; +} + +static void tcf_bond_off_put(struct tcf_bond_off *bond_off) +{ + struct bond_net *bn; + + if (--bond_off->refcnt) + return; + bn = net_generic(dev_net(bond_off->netdev), bond_net_id); + rhashtable_remove_fast(&bn->bond_offload_ht, &bond_off->ht_node, + tcf_bond_ht_params); + kfree(bond_off); +} + +static int tcf_bond_off_cb_add(struct tcf_bond_off *bond_off, + tc_setup_cb_t *cb, void *cb_priv) +{ + struct tcf_bond_cb *bond_cb; + + bond_cb = tcf_bond_off_cb_lookup(bond_off, cb, cb_priv); + if (WARN_ON(bond_cb)) + return -EEXIST; + bond_cb = kzalloc(sizeof(*bond_cb), GFP_KERNEL); + if (!bond_cb) + return -ENOMEM; + bond_cb->cb = cb; + bond_cb->cb_priv = cb_priv; + list_add(&bond_cb->list, &bond_off->cb_list); + return 0; +} + +static void tcf_bond_off_cb_del(struct tcf_bond_off *bond_off, + tc_setup_cb_t *cb, void *cb_priv) +{ + struct tcf_bond_cb *bond_cb; + + bond_cb = tcf_bond_off_cb_lookup(bond_off, cb, cb_priv); + if (WARN_ON(!bond_cb)) + return; + list_del(&bond_cb->list); + kfree(bond_cb); +} + +static int __tc_setup_cb_bond_register(const struct net_device *dev, + tc_setup_cb_t *cb, void *cb_priv) +{ + struct tcf_bond_off *bond_off = tcf_bond_off_get(dev, cb, cb_priv); + int err; + + if (!bond_off) + return -ENOMEM; + err = tcf_bond_off_cb_add(bond_off, cb, cb_priv); + if (err) + goto err_cb_add; + return 0; + +err_cb_add: + tcf_bond_off_put(bond_off); + return err; +} + +int tc_setup_cb_bond_register(const struct net_device *dev, tc_setup_cb_t *cb, + void *cb_priv) +{ + int err; + + rtnl_lock(); + err = __tc_setup_cb_bond_register(dev, cb, cb_priv); + rtnl_unlock(); + return err; +} +EXPORT_SYMBOL_GPL(tc_setup_cb_bond_register); + +static void __tc_setup_cb_bond_unregister(const struct net_device *dev, + tc_setup_cb_t *cb, void *cb_priv) +{ + struct tcf_bond_off *bond_off = tcf_bond_off_lookup(dev); + + if (WARN_ON(!bond_off)) + return; + tcf_bond_off_cb_del(bond_off, cb, cb_priv); + tcf_bond_off_put(bond_off); +} + +void tc_setup_cb_bond_unregister(const struct net_device *dev, + tc_setup_cb_t *cb, void *cb_priv) +{ + rtnl_lock(); + __tc_setup_cb_bond_unregister(dev, cb, cb_priv); + rtnl_unlock(); +} +EXPORT_SYMBOL_GPL(tc_setup_cb_bond_unregister); + static int bond_tc_relay_cb(enum tc_setup_type type, void *type_data, void *cb_priv) { + struct net_device *bond_dev = cb_priv; + struct tcf_bond_off *bond_off; + struct tcf_bond_cb *bond_cb; + struct list_head *iter; + struct bonding *bond; + struct slave *slave; + int err; + + bond = netdev_priv(bond_dev); + + if (!tc_can_offload(bond_dev)) + return -EOPNOTSUPP; + + bond_for_each_slave(bond, slave, iter) { + if (!tc_can_offload(slave->dev)) + continue; + + bond_off = tcf_bond_off_lookup(slave->dev); + if (!bond_off) + continue; + + list_for_each_entry(bond_cb, &bond_off->cb_list, list) { + err = bond_cb->cb(type, type_data, bond_cb->cb_priv); + /* Possible here that some of the relayed callbacks are + * accepted before the error meaning a rule add may be + * offloaded to some ports and not others. + * + * If skip_sw is set then the classifier will generate + * a destroy message undoing the adds. If not set then + * some of the relays exist in hw and some software + * only. + */ + if (err) + return err; + } + } + return 0; } @@ -4829,7 +5021,7 @@ static int __net_init bond_net_init(struct net *net) bond_create_proc_dir(bn); bond_create_sysfs(bn); - return 0; + return rhashtable_init(&bn->bond_offload_ht, &tcf_bond_ht_params); } static void __net_exit bond_net_exit(struct net *net) @@ -4848,6 +5040,7 @@ static void __net_exit bond_net_exit(struct net *net) rtnl_unlock(); bond_destroy_proc_dir(bn); + rhashtable_destroy(&bn->bond_offload_ht); } static struct pernet_operations bond_net_ops = { diff --git a/include/net/bonding.h b/include/net/bonding.h index 424b9ea..056f5fc 100644 --- a/include/net/bonding.h +++ b/include/net/bonding.h @@ -30,6 +30,7 @@ #include <net/bond_alb.h> #include <net/bond_options.h> #include <net/pkt_cls.h> +#include <net/act_api.h> #define BOND_MAX_ARP_TARGETS 16 @@ -584,6 +585,7 @@ struct bond_net { struct proc_dir_entry *proc_dir; #endif struct class_attribute class_attr_bonding_masters; + struct rhashtable bond_offload_ht; }; int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond, struct slave *slave); @@ -620,6 +622,11 @@ int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave); void bond_slave_arr_work_rearm(struct bonding *bond, unsigned long delay); void bond_work_init_all(struct bonding *bond); +int tc_setup_cb_bond_register(const struct net_device *dev, tc_setup_cb_t *cb, + void *cb_priv); +void tc_setup_cb_bond_unregister(const struct net_device *dev, + tc_setup_cb_t *cb, void *cb_priv); + #ifdef CONFIG_PROC_FS void bond_create_proc_entry(struct bonding *bond); void bond_remove_proc_entry(struct bonding *bond); -- 2.7.4 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond 2018-03-05 13:28 ` [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond John Hurley @ 2018-03-07 10:57 ` Jiri Pirko 0 siblings, 0 replies; 11+ messages in thread From: Jiri Pirko @ 2018-03-07 10:57 UTC (permalink / raw) To: John Hurley; +Cc: netdev, jiri, ogerlitz, jakub.kicinski, simon.horman Mon, Mar 05, 2018 at 02:28:30PM CET, john.hurley@netronome.com wrote: >Allow drivers to register netdev callbacks for tc offload in linux bonds. >If a netdev has registered and is a slave of a given bond, then any tc >rules offloaded to the bond will be relayed to it if both the bond and the >slave permit hw offload. > >Because the bond itself is not offloaded, just the rules, we don't care >about whether the bond ports are on the same device or whether some of >slaves are representor ports and some are not. > >Signed-off-by: John Hurley <john.hurley@netronome.com> >--- > drivers/net/bonding/bond_main.c | 195 +++++++++++++++++++++++++++++++++++++++- > include/net/bonding.h | 7 ++ > 2 files changed, 201 insertions(+), 1 deletion(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index e6415f6..d9e41cf 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c [...] >+EXPORT_SYMBOL_GPL(tc_setup_cb_bond_register); Please, no "bond" specific calls from drivers. That would be wrong. The idea behing block callbacks was that anyone who is interested could register to receive those. In this case, slave device is interested. So it should register to receive block callbacks in the same way as if the block was directly on top of the slave device. The only thing you need to handle is to propagate block bind/unbind from master down to the slaves. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2018-03-15 21:38 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-03-13 15:51 [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond Or Gerlitz 2018-03-13 15:53 ` Or Gerlitz 2018-03-14 1:50 ` Jakub Kicinski 2018-03-14 6:54 ` Or Gerlitz 2018-03-14 15:51 ` Jiri Pirko 2018-03-14 9:50 ` Jiri Pirko 2018-03-14 11:23 ` Or Gerlitz 2018-03-14 15:56 ` Jiri Pirko 2018-03-15 21:38 ` Or Gerlitz -- strict thread matches above, loose matches on Subject: below -- 2018-03-05 13:28 [RFC net-next 0/6] offload linux bonding tc ingress rules John Hurley 2018-03-05 13:28 ` [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond John Hurley 2018-03-07 10:57 ` Jiri Pirko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).