Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
@ 2018-03-13 15:51 Or Gerlitz
  2018-03-13 15:53 ` Or Gerlitz
  2018-03-14  9:50 ` Jiri Pirko
  0 siblings, 2 replies; 11+ messages in thread
From: Or Gerlitz @ 2018-03-13 15:51 UTC (permalink / raw)
  To: Jiri Pirko, Rabie Loulou, John Hurley
  Cc: Jakub Kicinski, Simon Horman, Linux Netdev List, ASAP_Direct_Dev,
	mlxsw

On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Mon, Mar 05, 2018 at 02:28:30PM CET, john.hurley@netronome.com wrote:
>>Allow drivers to register netdev callbacks for tc offload in linux bonds.
>>If a netdev has registered and is a slave of a given bond, then any tc
>>rules offloaded to the bond will be relayed to it if both the bond and the
>>slave permit hw offload.

>>Because the bond itself is not offloaded, just the rules, we don't care
>>about whether the bond ports are on the same device or whether some of
>>slaves are representor ports and some are not.

John, I think we must design here for the case where the bond IS offloaded.
E.g some sort of HW LAG. For example, the mlxsw driver supports
LAG offload and support tcflower offload, we need to see how these
two live together, mlx5 supports tcflower offload and we are working on
bond offload, etc.

>>+EXPORT_SYMBOL_GPL(tc_setup_cb_bond_register);
>
> Please, no "bond" specific calls from drivers. That would be wrong.
> The idea behing block callbacks was that anyone who is interested could
> register to receive those. In this case, slave device is interested.
> So it should register to receive block callbacks in the same way as if
> the block was directly on top of the slave device. The only thing you
> need to handle is to propagate block bind/unbind from master down to the
> slaves.

Jiri,

This sounds nice for the case where one install ingress tc rules on
the bond (lets
call them type 1, see next)

One obstacle pointed by my colleague, Rabie, is that when the upper layer
issues stat call on the filter, they will get two replies, this can confuse them
and lead to wrong decisions (aging). I wonder if/how we can set a knob
somewhere that unifies the stats (add packet/bytes, use the latest lastuse).

Also, lets see what other rules have to be offloaded in that scheme
(call them type 2/3/4)
where one bonded two HW ports

2. bond being egress port of a rule

TC rules for overlay networks scheme, e.g in NIC SRIOV
scheme where one bonds the two uplink representors

Starting with type 2, in our current NIC HW APIs we have to duplicate
these rules
into two rules set to HW:

2.1 VF rep --> uplink 0
2.2 VF rep --> uplink 1

and we do that in the driver (add/del two HW rules, combine the stat
results, etc)

3. ingress rule on VF rep port with shared tunnel device being the
egress (encap)
and where the routing of the underlay (tunnel) goes through LAG.

in our case, this is like 2.1/2.2 above, offload two rules, combine stats

4. ingress rule shared tunnel device being the ingress and VF rep port
being the egress (decap)

this uses the egdev facility to be offloaded into the our driver, and
then in the driver
we will treat it like type 1, two rules need to be installed into HW,
but now, we can't delegate them
from the vxlan device b/c it has no direct connection with the bond.

All to all, for the mlx5 use case, seems we have elegant solution only
for type 1.

I think we should do the elegant solution for the case where it applicable.

In parallel if/when newer HW APIs are there such that type 2 and 3 can be set
using one HW rule whose dest is the bond, we are good. As for type 4,
need to see
if/how it can be nicer.

Or.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
  2018-03-13 15:51 [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond Or Gerlitz
@ 2018-03-13 15:53 ` Or Gerlitz
  2018-03-14  1:50   ` Jakub Kicinski
  2018-03-14  9:50 ` Jiri Pirko
  1 sibling, 1 reply; 11+ messages in thread
From: Or Gerlitz @ 2018-03-13 15:53 UTC (permalink / raw)
  To: Jiri Pirko, Rabie Loulou, John Hurley
  Cc: Jakub Kicinski, Simon Horman, Linux Netdev List, mlxsw,
	Yevgeny Kliteynik, Paul Blakey

On Tue, Mar 13, 2018 at 5:51 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:

Sorry ppl, I added MLNX alias (ASAP_Direct_Dev@mellanox.com) which is
not open to outer posts,
please remove it from your replies, otherwise it will bump you back.. Or.

> On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Mon, Mar 05, 2018 at 02:28:30PM CET, john.hurley@netronome.com wrote:
>>>Allow drivers to register netdev callbacks for tc offload in linux bonds.
>>>If a netdev has registered and is a slave of a given bond, then any tc
>>>rules offloaded to the bond will be relayed to it if both the bond and the
>>>slave permit hw offload.
>
>>>Because the bond itself is not offloaded, just the rules, we don't care
>>>about whether the bond ports are on the same device or whether some of
>>>slaves are representor ports and some are not.
>
> John, I think we must design here for the case where the bond IS offloaded.
> E.g some sort of HW LAG. For example, the mlxsw driver supports
> LAG offload and support tcflower offload, we need to see how these
> two live together, mlx5 supports tcflower offload and we are working on
> bond offload, etc.
>
>>>+EXPORT_SYMBOL_GPL(tc_setup_cb_bond_register);
>>
>> Please, no "bond" specific calls from drivers. That would be wrong.
>> The idea behing block callbacks was that anyone who is interested could
>> register to receive those. In this case, slave device is interested.
>> So it should register to receive block callbacks in the same way as if
>> the block was directly on top of the slave device. The only thing you
>> need to handle is to propagate block bind/unbind from master down to the
>> slaves.
>
> Jiri,
>
> This sounds nice for the case where one install ingress tc rules on
> the bond (lets
> call them type 1, see next)
>
> One obstacle pointed by my colleague, Rabie, is that when the upper layer
> issues stat call on the filter, they will get two replies, this can confuse them
> and lead to wrong decisions (aging). I wonder if/how we can set a knob
> somewhere that unifies the stats (add packet/bytes, use the latest lastuse).
>
> Also, lets see what other rules have to be offloaded in that scheme
> (call them type 2/3/4)
> where one bonded two HW ports
>
> 2. bond being egress port of a rule
>
> TC rules for overlay networks scheme, e.g in NIC SRIOV
> scheme where one bonds the two uplink representors
>
> Starting with type 2, in our current NIC HW APIs we have to duplicate
> these rules
> into two rules set to HW:
>
> 2.1 VF rep --> uplink 0
> 2.2 VF rep --> uplink 1
>
> and we do that in the driver (add/del two HW rules, combine the stat
> results, etc)
>
> 3. ingress rule on VF rep port with shared tunnel device being the
> egress (encap)
> and where the routing of the underlay (tunnel) goes through LAG.
>
> in our case, this is like 2.1/2.2 above, offload two rules, combine stats
>
> 4. ingress rule shared tunnel device being the ingress and VF rep port
> being the egress (decap)
>
> this uses the egdev facility to be offloaded into the our driver, and
> then in the driver
> we will treat it like type 1, two rules need to be installed into HW,
> but now, we can't delegate them
> from the vxlan device b/c it has no direct connection with the bond.
>
> All to all, for the mlx5 use case, seems we have elegant solution only
> for type 1.
>
> I think we should do the elegant solution for the case where it applicable.
>
> In parallel if/when newer HW APIs are there such that type 2 and 3 can be set
> using one HW rule whose dest is the bond, we are good. As for type 4,
> need to see
> if/how it can be nicer.
>
> Or.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
  2018-03-13 15:53 ` Or Gerlitz
@ 2018-03-14  1:50   ` Jakub Kicinski
  2018-03-14  6:54     ` Or Gerlitz
  2018-03-14 15:51     ` Jiri Pirko
  0 siblings, 2 replies; 11+ messages in thread
From: Jakub Kicinski @ 2018-03-14  1:50 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Jiri Pirko, Rabie Loulou, John Hurley, Simon Horman,
	Linux Netdev List, mlxsw, Yevgeny Kliteynik, Paul Blakey

On Tue, 13 Mar 2018 17:53:39 +0200, Or Gerlitz wrote:
> > Starting with type 2, in our current NIC HW APIs we have to duplicate
> > these rules
> > into two rules set to HW:
> >
> > 2.1 VF rep --> uplink 0
> > 2.2 VF rep --> uplink 1
> >
> > and we do that in the driver (add/del two HW rules, combine the stat
> > results, etc)

Ack, I think our HW API also will require us to duplicate the rules
today, but IMHO we should implement some common helper module in the
core that would work for any block sharing rather than bond specific
solution.

> > 3. ingress rule on VF rep port with shared tunnel device being the
> > egress (encap)
> > and where the routing of the underlay (tunnel) goes through LAG.
> >
> > in our case, this is like 2.1/2.2 above, offload two rules, combine stats
> >
> > 4. ingress rule shared tunnel device being the ingress and VF rep port
> > being the egress (decap)
> >
> > this uses the egdev facility to be offloaded into the our driver, and
> > then in the driver
> > we will treat it like type 1, two rules need to be installed into HW,
> > but now, we can't delegate them
> > from the vxlan device b/c it has no direct connection with the bond.

Let's get rid of the egdev crutch first then :]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
  2018-03-14  1:50   ` Jakub Kicinski
@ 2018-03-14  6:54     ` Or Gerlitz
  2018-03-14 15:51     ` Jiri Pirko
  1 sibling, 0 replies; 11+ messages in thread
From: Or Gerlitz @ 2018-03-14  6:54 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jiri Pirko, Rabie Loulou, John Hurley, Simon Horman,
	Linux Netdev List, mlxsw, Yevgeny Kliteynik, Paul Blakey

On Wed, Mar 14, 2018 at 3:50 AM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> On Tue, 13 Mar 2018 17:53:39 +0200, Or Gerlitz wrote:
>> > Starting with type 2, in our current NIC HW APIs we have to duplicate
>> > these rules
>> > into two rules set to HW:
>> >
>> > 2.1 VF rep --> uplink 0
>> > 2.2 VF rep --> uplink 1
>> >
>> > and we do that in the driver (add/del two HW rules, combine the stat
>> > results, etc)
>
> Ack, I think our HW API also will require us to duplicate the rules
> today, but IMHO we should implement some common helper module in the
> core that would work for any block sharing rather than bond specific
> solution.

To be clear, you refer to the case where the bond is the egress device
of the rule?

For the case the bond is the ingress device, RU OK with the approach
Jiri suggested
to propagate the tc setup ndo call into the lower devices? so they are
bind/unbinding
for any block the upper is. This approach is applicable for
bond/team/vlan devices for
both NIC and Switch ASIC (or NPU...) drivers. You want to make a
helper out of this?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
  2018-03-14  1:50   ` Jakub Kicinski
  2018-03-14  6:54     ` Or Gerlitz
@ 2018-03-14 15:51     ` Jiri Pirko
  1 sibling, 0 replies; 11+ messages in thread
From: Jiri Pirko @ 2018-03-14 15:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Or Gerlitz, Jiri Pirko, Rabie Loulou, John Hurley, Simon Horman,
	Linux Netdev List, mlxsw, Yevgeny Kliteynik, Paul Blakey

Wed, Mar 14, 2018 at 02:50:02AM CET, jakub.kicinski@netronome.com wrote:
>On Tue, 13 Mar 2018 17:53:39 +0200, Or Gerlitz wrote:
>> > Starting with type 2, in our current NIC HW APIs we have to duplicate
>> > these rules
>> > into two rules set to HW:
>> >
>> > 2.1 VF rep --> uplink 0
>> > 2.2 VF rep --> uplink 1
>> >
>> > and we do that in the driver (add/del two HW rules, combine the stat
>> > results, etc)
>
>Ack, I think our HW API also will require us to duplicate the rules
>today, but IMHO we should implement some common helper module in the
>core that would work for any block sharing rather than bond specific
>solution.

But how? Only the driver knows if in case it has 2 netdevices if the HW
is capable of share or not. And accordingly, it registers 1cb instance
or 2cb instances (1 for each netdev). I don't see how you can move it in
core...


>
>> > 3. ingress rule on VF rep port with shared tunnel device being the
>> > egress (encap)
>> > and where the routing of the underlay (tunnel) goes through LAG.
>> >
>> > in our case, this is like 2.1/2.2 above, offload two rules, combine stats
>> >
>> > 4. ingress rule shared tunnel device being the ingress and VF rep port
>> > being the egress (decap)
>> >
>> > this uses the egdev facility to be offloaded into the our driver, and
>> > then in the driver
>> > we will treat it like type 1, two rules need to be installed into HW,
>> > but now, we can't delegate them
>> > from the vxlan device b/c it has no direct connection with the bond.
>
>Let's get rid of the egdev crutch first then :]

I don't see how you can do it. Note that this exists to catch insertions
of rules that have "mirred redirect" to the dev which is interested in
the rules. Originally it was done in a very ugly way (please see git
history), and I converted it to egdev - I was not able to find any nicer
solution :/ Any ideas for improvement?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
  2018-03-13 15:51 [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond Or Gerlitz
  2018-03-13 15:53 ` Or Gerlitz
@ 2018-03-14  9:50 ` Jiri Pirko
  2018-03-14 11:23   ` Or Gerlitz
  1 sibling, 1 reply; 11+ messages in thread
From: Jiri Pirko @ 2018-03-14  9:50 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Jiri Pirko, Rabie Loulou, John Hurley, Jakub Kicinski,
	Simon Horman, Linux Netdev List, ASAP_Direct_Dev, mlxsw

Tue, Mar 13, 2018 at 04:51:02PM CET, gerlitz.or@gmail.com wrote:
>On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Mon, Mar 05, 2018 at 02:28:30PM CET, john.hurley@netronome.com wrote:
>>>Allow drivers to register netdev callbacks for tc offload in linux bonds.
>>>If a netdev has registered and is a slave of a given bond, then any tc
>>>rules offloaded to the bond will be relayed to it if both the bond and the
>>>slave permit hw offload.
>
>>>Because the bond itself is not offloaded, just the rules, we don't care
>>>about whether the bond ports are on the same device or whether some of
>>>slaves are representor ports and some are not.
>
>John, I think we must design here for the case where the bond IS offloaded.
>E.g some sort of HW LAG. For example, the mlxsw driver supports
>LAG offload and support tcflower offload, we need to see how these
>two live together, mlx5 supports tcflower offload and we are working on
>bond offload, etc.
>
>>>+EXPORT_SYMBOL_GPL(tc_setup_cb_bond_register);
>>
>> Please, no "bond" specific calls from drivers. That would be wrong.
>> The idea behing block callbacks was that anyone who is interested could
>> register to receive those. In this case, slave device is interested.
>> So it should register to receive block callbacks in the same way as if
>> the block was directly on top of the slave device. The only thing you
>> need to handle is to propagate block bind/unbind from master down to the
>> slaves.
>
>Jiri,
>
>This sounds nice for the case where one install ingress tc rules on
>the bond (lets
>call them type 1, see next)
>
>One obstacle pointed by my colleague, Rabie, is that when the upper layer
>issues stat call on the filter, they will get two replies, this can confuse them
>and lead to wrong decisions (aging). I wonder if/how we can set a knob

The bonding itself would not do anything on stats update
command (TC_CLSFLOWER_STATS for example). Only the slaves would do
update. So there will be only reply from slaves.

Bond/team is just going to probagare block bind/unbind down. Nothing
else.



>somewhere that unifies the stats (add packet/bytes, use the latest lastuse).
>
>Also, lets see what other rules have to be offloaded in that scheme
>(call them type 2/3/4)
>where one bonded two HW ports
>
>2. bond being egress port of a rule
>
>TC rules for overlay networks scheme, e.g in NIC SRIOV
>scheme where one bonds the two uplink representors
>
>Starting with type 2, in our current NIC HW APIs we have to duplicate
>these rules
>into two rules set to HW:
>
>2.1 VF rep --> uplink 0
>2.2 VF rep --> uplink 1
>
>and we do that in the driver (add/del two HW rules, combine the stat
>results, etc)

That is up to the driver. If the driver can share block between 2
devices, he can do that. If he cannot share, it will just report stats
for every device separatelly (2 block cbs registered) and tc will see
them both together. No need to do anything in driver.


>
>3. ingress rule on VF rep port with shared tunnel device being the
>egress (encap)
>and where the routing of the underlay (tunnel) goes through LAG.
>
>in our case, this is like 2.1/2.2 above, offload two rules, combine stats
>

Same as "2."


>4. ingress rule shared tunnel device being the ingress and VF rep port
>being the egress (decap)

I don't follow :(

>
>this uses the egdev facility to be offloaded into the our driver, and
>then in the driver
>we will treat it like type 1, two rules need to be installed into HW,
>but now, we can't delegate them
>from the vxlan device b/c it has no direct connection with the bond.

I see another thing we need to sanitize: vxlan rule ingress match action
mirred redirect to lag


>
>All to all, for the mlx5 use case, seems we have elegant solution only
>for type 1.
>
>I think we should do the elegant solution for the case where it applicable.
>
>In parallel if/when newer HW APIs are there such that type 2 and 3 can be set
>using one HW rule whose dest is the bond, we are good. As for type 4,
>need to see
>if/how it can be nicer.
>
>Or.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
  2018-03-14  9:50 ` Jiri Pirko
@ 2018-03-14 11:23   ` Or Gerlitz
  2018-03-14 15:56     ` Jiri Pirko
  0 siblings, 1 reply; 11+ messages in thread
From: Or Gerlitz @ 2018-03-14 11:23 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jiri Pirko, Rabie Loulou, John Hurley, Jakub Kicinski,
	Simon Horman, Linux Netdev List, ASAP_Direct_Dev, mlxsw

On Wed, Mar 14, 2018 at 11:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Mar 13, 2018 at 04:51:02PM CET, gerlitz.or@gmail.com wrote:
>>On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko <jiri@resnulli.us> wrote:

>>This sounds nice for the case where one install ingress tc rules on
>>the bond (lets
>>call them type 1, see next)
>>
>>One obstacle pointed by my colleague, Rabie, is that when the upper layer
>>issues stat call on the filter, they will get two replies, this can confuse them
>>and lead to wrong decisions (aging). I wonder if/how we can set a knob
>
> The bonding itself would not do anything on stats update
> command (TC_CLSFLOWER_STATS for example). Only the slaves would do
> update. So there will be only reply from slaves.
>
> Bond/team is just going to probagare block bind/unbind down. Nothing else.

Do we agree that user space will get the replies of all lower (slave) devices,
or I am missing something here?

>>2. bond being egress port of a rule
>>2.1 VF rep --> uplink 0
>>2.2 VF rep --> uplink 1
>>
>>and we do that in the driver (add/del two HW rules, combine the stat
>>results, etc)
>
> That is up to the driver. If the driver can share block between 2
> devices, he can do that. If he cannot share, it will just report stats
> for every device separatelly (2 block cbs registered) and tc will see
> them both together. No need to do anything in driver.

right

>>3. ingress rule on VF rep port with shared tunnel device being the
>>egress (encap)
>>and where the routing of the underlay (tunnel) goes through LAG.

> Same as "2."

ok

>>4. ingress rule shared tunnel device being the ingress and VF rep port
>>being the egress (decap)

> I don't follow :(

the way tunneling is handled in tc classifier/action is

encap:  ingress: net port, action1: tunnel key set action2: mirred to
shared-tunnel device

decap: ingress: shared tunnel device, action1: tunnel key unset
action2: mirred to net port

type 4 are the decap rules, when we offload it to as HW ACL we stretch
the line and the ingress
in a HW port too (e.g uplink port in NICs)


>>this uses the egdev facility to be offloaded into the our driver, and
>>then in the driver
>>we will treat it like type 1, two rules need to be installed into HW,
>>but now, we can't delegate them
>>from the vxlan device b/c it has no direct connection with the bond.

> I see another thing we need to sanitize: vxlan rule ingress match action
> mirred redirect to lag

right, we don't have  for NIC but for switch ASIC, I guess it is applicable

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
  2018-03-14 11:23   ` Or Gerlitz
@ 2018-03-14 15:56     ` Jiri Pirko
  2018-03-15 21:38       ` Or Gerlitz
  0 siblings, 1 reply; 11+ messages in thread
From: Jiri Pirko @ 2018-03-14 15:56 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Jiri Pirko, Rabie Loulou, John Hurley, Jakub Kicinski,
	Simon Horman, Linux Netdev List, ASAP_Direct_Dev, mlxsw

Wed, Mar 14, 2018 at 12:23:59PM CET, gerlitz.or@gmail.com wrote:
>On Wed, Mar 14, 2018 at 11:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Tue, Mar 13, 2018 at 04:51:02PM CET, gerlitz.or@gmail.com wrote:
>>>On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>
>>>This sounds nice for the case where one install ingress tc rules on
>>>the bond (lets
>>>call them type 1, see next)
>>>
>>>One obstacle pointed by my colleague, Rabie, is that when the upper layer
>>>issues stat call on the filter, they will get two replies, this can confuse them
>>>and lead to wrong decisions (aging). I wonder if/how we can set a knob
>>
>> The bonding itself would not do anything on stats update
>> command (TC_CLSFLOWER_STATS for example). Only the slaves would do
>> update. So there will be only reply from slaves.
>>
>> Bond/team is just going to probagare block bind/unbind down. Nothing else.
>
>Do we agree that user space will get the replies of all lower (slave) devices,
>or I am missing something here?

"user space will get the replies" - not sure what exactly do you mean by
this. The stats would be accumulated over all devices/drivers who
registered block callback.


>
>>>2. bond being egress port of a rule
>>>2.1 VF rep --> uplink 0
>>>2.2 VF rep --> uplink 1
>>>
>>>and we do that in the driver (add/del two HW rules, combine the stat
>>>results, etc)
>>
>> That is up to the driver. If the driver can share block between 2
>> devices, he can do that. If he cannot share, it will just report stats
>> for every device separatelly (2 block cbs registered) and tc will see
>> them both together. No need to do anything in driver.
>
>right
>
>>>3. ingress rule on VF rep port with shared tunnel device being the
>>>egress (encap)
>>>and where the routing of the underlay (tunnel) goes through LAG.
>
>> Same as "2."
>
>ok
>
>>>4. ingress rule shared tunnel device being the ingress and VF rep port
>>>being the egress (decap)
>
>> I don't follow :(
>
>the way tunneling is handled in tc classifier/action is
>
>encap:  ingress: net port, action1: tunnel key set action2: mirred to
>shared-tunnel device
>
>decap: ingress: shared tunnel device, action1: tunnel key unset
>action2: mirred to net port
>
>type 4 are the decap rules, when we offload it to as HW ACL we stretch
>the line and the ingress
>in a HW port too (e.g uplink port in NICs)

Okay, I see. But where's the bond here? Is it the one I mentioned as
"mirred redirect to lag"?


>
>
>>>this uses the egdev facility to be offloaded into the our driver, and
>>>then in the driver
>>>we will treat it like type 1, two rules need to be installed into HW,
>>>but now, we can't delegate them
>>>from the vxlan device b/c it has no direct connection with the bond.
>
>> I see another thing we need to sanitize: vxlan rule ingress match action
>> mirred redirect to lag
>
>right, we don't have  for NIC but for switch ASIC, I guess it is applicable

Yes, it is. For future NICs I guess it is going to be as well.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
  2018-03-14 15:56     ` Jiri Pirko
@ 2018-03-15 21:38       ` Or Gerlitz
  0 siblings, 0 replies; 11+ messages in thread
From: Or Gerlitz @ 2018-03-15 21:38 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jiri Pirko, Rabie Loulou, John Hurley, Jakub Kicinski,
	Simon Horman, Linux Netdev List, ASAP_Direct_Dev, mlxsw

On Wed, Mar 14, 2018 at 5:56 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Mar 14, 2018 at 12:23:59PM CET, gerlitz.or@gmail.com wrote:
>>On Wed, Mar 14, 2018 at 11:50 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Tue, Mar 13, 2018 at 04:51:02PM CET, gerlitz.or@gmail.com wrote:
>>>>On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>
>>>>This sounds nice for the case where one install ingress tc rules on
>>>>the bond (lets
>>>>call them type 1, see next)
>>>>
>>>>One obstacle pointed by my colleague, Rabie, is that when the upper layer
>>>>issues stat call on the filter, they will get two replies, this can confuse them
>>>>and lead to wrong decisions (aging). I wonder if/how we can set a knob
>>>
>>> The bonding itself would not do anything on stats update
>>> command (TC_CLSFLOWER_STATS for example). Only the slaves would do
>>> update. So there will be only reply from slaves.
>>>
>>> Bond/team is just going to probagare block bind/unbind down. Nothing else.
>>
>>Do we agree that user space will get the replies of all lower (slave) devices,
>>or I am missing something here?
>
> "user space will get the replies" - not sure what exactly do you mean by
> this. The stats would be accumulated over all devices/drivers who
> registered block callback.

OK, this is probably something I have to check, thanks


>>>>2. bond being egress port of a rule
>>>>2.1 VF rep --> uplink 0
>>>>2.2 VF rep --> uplink 1
>>>>
>>>>and we do that in the driver (add/del two HW rules, combine the stat
>>>>results, etc)
>>>
>>> That is up to the driver. If the driver can share block between 2
>>> devices, he can do that. If he cannot share, it will just report stats
>>> for every device separatelly (2 block cbs registered) and tc will see
>>> them both together. No need to do anything in driver.
>>
>>right
>>
>>>>3. ingress rule on VF rep port with shared tunnel device being the
>>>>egress (encap)
>>>>and where the routing of the underlay (tunnel) goes through LAG.
>>
>>> Same as "2."
>>
>>ok
>>
>>>>4. ingress rule shared tunnel device being the ingress and VF rep port being the egress (decap)

>>> I don't follow :(

>> the way tunneling is handled in tc classifier/action is

>> encap:  ingress: net port, action1: tunnel key set action2: mirred to
>> shared-tunnel device

>> decap: ingress: shared tunnel device, action1: tunnel key unset
>> action2: mirred to net port

>> type 4 are the decap rules, when we offload it to as HW ACL we stretch
>> the line and the ingress in a HW port too (e.g uplink port in NICs)

> Okay, I see. But where's the bond here? Is it the one I mentioned as
> "mirred redirect to lag"?

since the ingress port is not HW port, we will use the egdev approach
and offload the rule as the uplink of this VF rep port being the ingress.

Since we will see that this uplink is into LAG, we will offload another rule
which the 2nd uplink being the ingress

>>> I see another thing we need to sanitize: vxlan rule ingress match action
>>> mirred redirect to lag
>>right, we don't have  for NIC but for switch ASIC, I guess it is applicable
> Yes, it is. For future NICs I guess it is going to be as well.

might

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC net-next 0/6] offload linux bonding tc ingress rules
@ 2018-03-05 13:28 John Hurley
  2018-03-05 13:28 ` [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond John Hurley
  0 siblings, 1 reply; 11+ messages in thread
From: John Hurley @ 2018-03-05 13:28 UTC (permalink / raw)
  To: netdev; +Cc: jiri, ogerlitz, jakub.kicinski, simon.horman, John Hurley

Hi,

This RFC patchset adds support for offloading tc ingress rules applied to
linux bonds. The premise of these patches is that if a rule is applied to
a bond port then the rule should be applied to each slave of the bond.

The linux bond itself registers a cb for offloading tc rules. Potential
slave netdevs on offload devices can then register with the bond for a
further callback - this code is basically the same as registering for an
egress dev offload in TC. Then when a rule is offloaded to the bond, it
can be relayed to each netdev that has registered with the bond code and
which is a slave of the given bond.

To prevent sync issues between the kernel and offload device, the linux
bond driver is affectively locked when it has offloaded rules. i.e no new
ports can be enslaved and no slaves can be released until the offload
rules are removed. Similarly, if a port on a bond is deleted, the bond is
destroyed, forcing a flush of all offloaded rules.

Also included in the RFC are changes to the NFP driver to utilise the new
code by registering NFP port representors for bond offload rules and
modifying cookie handling to allow the relaying of a rule to multiple
ports.

Thanks,
John

John Hurley (6):
  drivers: net: bonding: add tc offload infastructure to bond
  driver: net: bonding: allow registration of tc offload callbacks in
    bond
  drivers: net: bonding: restrict bond mods when rules are offloaded
  nfp: add ndo_set_mac_address for representors
  nfp: register repr ports for bond offloads
  nfp: support offloading multiple rules with same cookie

 drivers/net/bonding/bond_main.c                    | 277 ++++++++++++++++++++-
 drivers/net/ethernet/netronome/nfp/flower/main.c   |  24 +-
 drivers/net/ethernet/netronome/nfp/flower/main.h   |  10 +-
 .../net/ethernet/netronome/nfp/flower/metadata.c   |  20 +-
 .../net/ethernet/netronome/nfp/flower/offload.c    |  33 ++-
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.c  |   1 +
 include/net/bonding.h                              |   9 +
 7 files changed, 351 insertions(+), 23 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
  2018-03-05 13:28 [RFC net-next 0/6] offload linux bonding tc ingress rules John Hurley
@ 2018-03-05 13:28 ` John Hurley
  2018-03-07 10:57   ` Jiri Pirko
  0 siblings, 1 reply; 11+ messages in thread
From: John Hurley @ 2018-03-05 13:28 UTC (permalink / raw)
  To: netdev; +Cc: jiri, ogerlitz, jakub.kicinski, simon.horman, John Hurley

Allow drivers to register netdev callbacks for tc offload in linux bonds.
If a netdev has registered and is a slave of a given bond, then any tc
rules offloaded to the bond will be relayed to it if both the bond and the
slave permit hw offload.

Because the bond itself is not offloaded, just the rules, we don't care
about whether the bond ports are on the same device or whether some of
slaves are representor ports and some are not.

Signed-off-by: John Hurley <john.hurley@netronome.com>
---
 drivers/net/bonding/bond_main.c | 195 +++++++++++++++++++++++++++++++++++++++-
 include/net/bonding.h           |   7 ++
 2 files changed, 201 insertions(+), 1 deletion(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index e6415f6..d9e41cf 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -335,9 +335,201 @@ static inline unsigned int bond_get_offload_cnt(struct bonding *bond)
 	return bond->tc_block->offloadcnt;
 }
 
+struct tcf_bond_cb {
+	struct list_head list;
+	tc_setup_cb_t *cb;
+	void *cb_priv;
+};
+
+struct tcf_bond_off {
+	struct rhash_head ht_node;
+	const struct net_device *netdev;
+	unsigned int refcnt;
+	struct list_head cb_list;
+};
+
+static const struct rhashtable_params tcf_bond_ht_params = {
+	.key_offset = offsetof(struct tcf_bond_off, netdev),
+	.head_offset = offsetof(struct tcf_bond_off, ht_node),
+	.key_len = sizeof(const struct net_device *),
+};
+
+static struct tcf_bond_off *tcf_bond_off_lookup(const struct net_device *dev)
+{
+	struct bond_net *bn = net_generic(dev_net(dev), bond_net_id);
+
+	return rhashtable_lookup_fast(&bn->bond_offload_ht, &dev,
+				      tcf_bond_ht_params);
+}
+
+static struct tcf_bond_cb *tcf_bond_off_cb_lookup(struct tcf_bond_off *off,
+						  tc_setup_cb_t *cb,
+						  void *cb_priv)
+{
+	struct tcf_bond_cb *bond_cb;
+
+	list_for_each_entry(bond_cb, &off->cb_list, list)
+		if (bond_cb->cb == cb && bond_cb->cb_priv == cb_priv)
+			return bond_cb;
+	return NULL;
+}
+
+static struct tcf_bond_off *tcf_bond_off_get(const struct net_device *dev,
+					     tc_setup_cb_t *cb,
+					     void *cb_priv)
+{
+	struct tcf_bond_off *bond_off;
+	struct bond_net *bn;
+
+	bond_off = tcf_bond_off_lookup(dev);
+	if (bond_off)
+		goto inc_ref;
+
+	bond_off = kzalloc(sizeof(*bond_off), GFP_KERNEL);
+	if (!bond_off)
+		return NULL;
+	INIT_LIST_HEAD(&bond_off->cb_list);
+	bond_off->netdev = dev;
+	bn = net_generic(dev_net(dev), bond_net_id);
+	rhashtable_insert_fast(&bn->bond_offload_ht, &bond_off->ht_node,
+			       tcf_bond_ht_params);
+
+inc_ref:
+	bond_off->refcnt++;
+	return bond_off;
+}
+
+static void tcf_bond_off_put(struct tcf_bond_off *bond_off)
+{
+	struct bond_net *bn;
+
+	if (--bond_off->refcnt)
+		return;
+	bn = net_generic(dev_net(bond_off->netdev), bond_net_id);
+	rhashtable_remove_fast(&bn->bond_offload_ht, &bond_off->ht_node,
+			       tcf_bond_ht_params);
+	kfree(bond_off);
+}
+
+static int tcf_bond_off_cb_add(struct tcf_bond_off *bond_off,
+			       tc_setup_cb_t *cb, void *cb_priv)
+{
+	struct tcf_bond_cb *bond_cb;
+
+	bond_cb = tcf_bond_off_cb_lookup(bond_off, cb, cb_priv);
+	if (WARN_ON(bond_cb))
+		return -EEXIST;
+	bond_cb = kzalloc(sizeof(*bond_cb), GFP_KERNEL);
+	if (!bond_cb)
+		return -ENOMEM;
+	bond_cb->cb = cb;
+	bond_cb->cb_priv = cb_priv;
+	list_add(&bond_cb->list, &bond_off->cb_list);
+	return 0;
+}
+
+static void tcf_bond_off_cb_del(struct tcf_bond_off *bond_off,
+				tc_setup_cb_t *cb, void *cb_priv)
+{
+	struct tcf_bond_cb *bond_cb;
+
+	bond_cb = tcf_bond_off_cb_lookup(bond_off, cb, cb_priv);
+	if (WARN_ON(!bond_cb))
+		return;
+	list_del(&bond_cb->list);
+	kfree(bond_cb);
+}
+
+static int __tc_setup_cb_bond_register(const struct net_device *dev,
+				       tc_setup_cb_t *cb, void *cb_priv)
+{
+	struct tcf_bond_off *bond_off = tcf_bond_off_get(dev, cb, cb_priv);
+	int err;
+
+	if (!bond_off)
+		return -ENOMEM;
+	err = tcf_bond_off_cb_add(bond_off, cb, cb_priv);
+	if (err)
+		goto err_cb_add;
+	return 0;
+
+err_cb_add:
+	tcf_bond_off_put(bond_off);
+	return err;
+}
+
+int tc_setup_cb_bond_register(const struct net_device *dev, tc_setup_cb_t *cb,
+			      void *cb_priv)
+{
+	int err;
+
+	rtnl_lock();
+	err = __tc_setup_cb_bond_register(dev, cb, cb_priv);
+	rtnl_unlock();
+	return err;
+}
+EXPORT_SYMBOL_GPL(tc_setup_cb_bond_register);
+
+static void __tc_setup_cb_bond_unregister(const struct net_device *dev,
+					  tc_setup_cb_t *cb, void *cb_priv)
+{
+	struct tcf_bond_off *bond_off = tcf_bond_off_lookup(dev);
+
+	if (WARN_ON(!bond_off))
+		return;
+	tcf_bond_off_cb_del(bond_off, cb, cb_priv);
+	tcf_bond_off_put(bond_off);
+}
+
+void tc_setup_cb_bond_unregister(const struct net_device *dev,
+				 tc_setup_cb_t *cb, void *cb_priv)
+{
+	rtnl_lock();
+	__tc_setup_cb_bond_unregister(dev, cb, cb_priv);
+	rtnl_unlock();
+}
+EXPORT_SYMBOL_GPL(tc_setup_cb_bond_unregister);
+
 static int bond_tc_relay_cb(enum tc_setup_type type, void *type_data,
 			    void *cb_priv)
 {
+	struct net_device *bond_dev = cb_priv;
+	struct tcf_bond_off *bond_off;
+	struct tcf_bond_cb *bond_cb;
+	struct list_head *iter;
+	struct bonding *bond;
+	struct slave *slave;
+	int err;
+
+	bond = netdev_priv(bond_dev);
+
+	if (!tc_can_offload(bond_dev))
+		return -EOPNOTSUPP;
+
+	bond_for_each_slave(bond, slave, iter) {
+		if (!tc_can_offload(slave->dev))
+			continue;
+
+		bond_off = tcf_bond_off_lookup(slave->dev);
+		if (!bond_off)
+			continue;
+
+		list_for_each_entry(bond_cb, &bond_off->cb_list, list) {
+			err = bond_cb->cb(type, type_data, bond_cb->cb_priv);
+			/* Possible here that some of the relayed callbacks are
+			 * accepted before the error meaning a rule add may be
+			 * offloaded to some ports and not others.
+			 *
+			 * If skip_sw is set then the classifier will generate
+			 * a destroy message undoing the adds. If not set then
+			 * some of the relays exist in hw and some software
+			 * only.
+			 */
+			if (err)
+				return err;
+		}
+	}
+
 	return 0;
 }
 
@@ -4829,7 +5021,7 @@ static int __net_init bond_net_init(struct net *net)
 	bond_create_proc_dir(bn);
 	bond_create_sysfs(bn);
 
-	return 0;
+	return rhashtable_init(&bn->bond_offload_ht, &tcf_bond_ht_params);
 }
 
 static void __net_exit bond_net_exit(struct net *net)
@@ -4848,6 +5040,7 @@ static void __net_exit bond_net_exit(struct net *net)
 	rtnl_unlock();
 
 	bond_destroy_proc_dir(bn);
+	rhashtable_destroy(&bn->bond_offload_ht);
 }
 
 static struct pernet_operations bond_net_ops = {
diff --git a/include/net/bonding.h b/include/net/bonding.h
index 424b9ea..056f5fc 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -30,6 +30,7 @@
 #include <net/bond_alb.h>
 #include <net/bond_options.h>
 #include <net/pkt_cls.h>
+#include <net/act_api.h>
 
 #define BOND_MAX_ARP_TARGETS	16
 
@@ -584,6 +585,7 @@ struct bond_net {
 	struct proc_dir_entry	*proc_dir;
 #endif
 	struct class_attribute	class_attr_bonding_masters;
+	struct rhashtable	bond_offload_ht;
 };
 
 int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond, struct slave *slave);
@@ -620,6 +622,11 @@ int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave);
 void bond_slave_arr_work_rearm(struct bonding *bond, unsigned long delay);
 void bond_work_init_all(struct bonding *bond);
 
+int tc_setup_cb_bond_register(const struct net_device *dev, tc_setup_cb_t *cb,
+			      void *cb_priv);
+void tc_setup_cb_bond_unregister(const struct net_device *dev,
+				 tc_setup_cb_t *cb, void *cb_priv);
+
 #ifdef CONFIG_PROC_FS
 void bond_create_proc_entry(struct bonding *bond);
 void bond_remove_proc_entry(struct bonding *bond);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond
  2018-03-05 13:28 ` [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond John Hurley
@ 2018-03-07 10:57   ` Jiri Pirko
  0 siblings, 0 replies; 11+ messages in thread
From: Jiri Pirko @ 2018-03-07 10:57 UTC (permalink / raw)
  To: John Hurley; +Cc: netdev, jiri, ogerlitz, jakub.kicinski, simon.horman

Mon, Mar 05, 2018 at 02:28:30PM CET, john.hurley@netronome.com wrote:
>Allow drivers to register netdev callbacks for tc offload in linux bonds.
>If a netdev has registered and is a slave of a given bond, then any tc
>rules offloaded to the bond will be relayed to it if both the bond and the
>slave permit hw offload.
>
>Because the bond itself is not offloaded, just the rules, we don't care
>about whether the bond ports are on the same device or whether some of
>slaves are representor ports and some are not.
>
>Signed-off-by: John Hurley <john.hurley@netronome.com>
>---
> drivers/net/bonding/bond_main.c | 195 +++++++++++++++++++++++++++++++++++++++-
> include/net/bonding.h           |   7 ++
> 2 files changed, 201 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index e6415f6..d9e41cf 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c

[...]


>+EXPORT_SYMBOL_GPL(tc_setup_cb_bond_register);

Please, no "bond" specific calls from drivers. That would be wrong.
The idea behing block callbacks was that anyone who is interested could
register to receive those. In this case, slave device is interested.
So it should register to receive block callbacks in the same way as if
the block was directly on top of the slave device. The only thing you
need to handle is to propagate block bind/unbind from master down to the
slaves.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-03-15 21:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-03-13 15:51 [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond Or Gerlitz
2018-03-13 15:53 ` Or Gerlitz
2018-03-14  1:50   ` Jakub Kicinski
2018-03-14  6:54     ` Or Gerlitz
2018-03-14 15:51     ` Jiri Pirko
2018-03-14  9:50 ` Jiri Pirko
2018-03-14 11:23   ` Or Gerlitz
2018-03-14 15:56     ` Jiri Pirko
2018-03-15 21:38       ` Or Gerlitz
  -- strict thread matches above, loose matches on Subject: below --
2018-03-05 13:28 [RFC net-next 0/6] offload linux bonding tc ingress rules John Hurley
2018-03-05 13:28 ` [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond John Hurley
2018-03-07 10:57   ` Jiri Pirko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).