Re: Fwd: [PATCH 0/6] offload Linux LAG devices to the TC datapath

Netdev List
 help / color / mirror / Atom feed

* Re: Fwd: [PATCH 0/6] offload Linux LAG devices to the TC datapath
       [not found] ` <8f406548-8f90-b658-fcd1-342d702b3445@mellanox.com>
@ 2018-06-26 14:57   ` Or Gerlitz
  2018-06-26 18:16     ` John Hurley
  2018-06-26 22:31     ` Jakub Kicinski
  0 siblings, 2 replies; 9+ messages in thread
From: Or Gerlitz @ 2018-06-26 14:57 UTC (permalink / raw)
  To: John Hurley, Jakub Kicinski, Jiri Pirko
  Cc: netdev, ASAP_Direct_Dev, simon.horman, Andy Gospodarek

> -------- Forwarded Message --------
> Subject: [PATCH 0/6] offload Linux LAG devices to the TC datapath
> Date: Thu, 21 Jun 2018 14:35:55 +0100
> From: John Hurley <john.hurley@netronome.com>
> To: dev@openvswitch.org, roid@mellanox.com, gavi@mellanox.com, paulb@mellanox.com, fbl@sysclose.org, simon.horman@netronome.com
> CC: John Hurley <john.hurley@netronome.com>
> 
> This patchset extends OvS TC and the linux-netdev implementation to
> support the offloading of Linux Link Aggregation devices (LAG) and their
> slaves. TC blocks are used to provide this offload. Blocks, in TC, group
> together a series of qdiscs. If a filter is added to one of these qdiscs
> then it applied to all. Similarly, if a packet is matched on one of the
> grouped qdiscs then the stats for the entire block are increased. The
> basis of the LAG offload is that the LAG master (attached to the OvS
> bridge) and slaves that may exist outside of OvS are all added to the same
> TC block. OvS can then control the filters and collect the stats on the
> slaves via its interaction with the LAG master.
> 
> The TC API is extended within OvS to allow the addition of a block id to
> ingress qdisc adds. Block ids are then assigned to each LAG master that is
> attached to the OvS bridge. The linux netdev netlink socket is used to
> monitor slave devices. If a LAG slave is found whose master is on the bridge
> then it is added to the same block as its master. If the underlying slaves
> belong to an offloadable device then the Linux LAG device can be offloaded
> to hardware.

Guys (J/J/J), 

Doing this here b/c

a. this has impact on the kernel side of things

b. I am more of a netdev and not openvswitch citizen..

some comments, 

1. this + Jakub's patch for the reply are really a great design

2. re the egress side of things. Some NIC HWs can't just use LAG
as the egress port destination of an ACL (tc rule) and the HW rule
needs to be duplicated to both HW ports. So... in that case, you 
see the HW driver doing the duplication (:() or we can somehow
make it happen from user-space?

3. for the case of overlay networks, e.g OVS based vxlan tunnel, the
ingress (decap) rule is set on the vxlan device. Jakub, you mentioned 
a possible kernel patch to the HW (nfp, mlx5) drivers to have them bind 
to the tunnel device for ingress rules. If we have agreed way to identify
uplink representors, can we do that from ovs too? does it matter if we are
bonding + encapsulating or just encapsulating? note that under encap scheme
the bond is typically not part of the OVS bridge. 

Or.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fwd: [PATCH 0/6] offload Linux LAG devices to the TC datapath
  2018-06-26 14:57   ` Fwd: [PATCH 0/6] offload Linux LAG devices to the TC datapath Or Gerlitz
@ 2018-06-26 18:16     ` John Hurley
  2018-06-27 20:13       ` Or Gerlitz
  2018-06-26 22:31     ` Jakub Kicinski
  1 sibling, 1 reply; 9+ messages in thread
From: John Hurley @ 2018-06-26 18:16 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Jakub Kicinski, Jiri Pirko, Linux Netdev List, Simon Horman,
	Andy Gospodarek

On Tue, Jun 26, 2018 at 3:57 PM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
>> -------- Forwarded Message --------
>> Subject: [PATCH 0/6] offload Linux LAG devices to the TC datapath
>> Date: Thu, 21 Jun 2018 14:35:55 +0100
>> From: John Hurley <john.hurley@netronome.com>
>> To: dev@openvswitch.org, roid@mellanox.com, gavi@mellanox.com, paulb@mellanox.com, fbl@sysclose.org, simon.horman@netronome.com
>> CC: John Hurley <john.hurley@netronome.com>
>>
>> This patchset extends OvS TC and the linux-netdev implementation to
>> support the offloading of Linux Link Aggregation devices (LAG) and their
>> slaves. TC blocks are used to provide this offload. Blocks, in TC, group
>> together a series of qdiscs. If a filter is added to one of these qdiscs
>> then it applied to all. Similarly, if a packet is matched on one of the
>> grouped qdiscs then the stats for the entire block are increased. The
>> basis of the LAG offload is that the LAG master (attached to the OvS
>> bridge) and slaves that may exist outside of OvS are all added to the same
>> TC block. OvS can then control the filters and collect the stats on the
>> slaves via its interaction with the LAG master.
>>
>> The TC API is extended within OvS to allow the addition of a block id to
>> ingress qdisc adds. Block ids are then assigned to each LAG master that is
>> attached to the OvS bridge. The linux netdev netlink socket is used to
>> monitor slave devices. If a LAG slave is found whose master is on the bridge
>> then it is added to the same block as its master. If the underlying slaves
>> belong to an offloadable device then the Linux LAG device can be offloaded
>> to hardware.
>
> Guys (J/J/J),
>
> Doing this here b/c
>
> a. this has impact on the kernel side of things
>
> b. I am more of a netdev and not openvswitch citizen..
>
> some comments,
>
> 1. this + Jakub's patch for the reply are really a great design
>
> 2. re the egress side of things. Some NIC HWs can't just use LAG
> as the egress port destination of an ACL (tc rule) and the HW rule
> needs to be duplicated to both HW ports. So... in that case, you
> see the HW driver doing the duplication (:() or we can somehow
> make it happen from user-space?
>

Hi Or,
I'm not sure how rule duplication would work for rules that egress to
a LAG device.
Perhaps this could be done for an active/backup mode where user-space
adds a rule to 1 port and deletes from another as appropriate.
For load balancing modes where the egress port is selected based on a
hash of packet fields, it would be a lot more complicated.
OvS can do this with its own bonds as far as I'm aware but (if
recirculation is turned off) it basically creates exact match datapath
entries for each packet flow.
Perhaps I do not fully understand your question?

> 3. for the case of overlay networks, e.g OVS based vxlan tunnel, the
> ingress (decap) rule is set on the vxlan device. Jakub, you mentioned
> a possible kernel patch to the HW (nfp, mlx5) drivers to have them bind
> to the tunnel device for ingress rules. If we have agreed way to identify
> uplink representors, can we do that from ovs too? does it matter if we are
> bonding + encapsulating or just encapsulating? note that under encap scheme
> the bond is typically not part of the OVS bridge.
>

If we have a way to bind the HW drivers to tunnel devs for ingress
rules then this should work fine with OvS (possibly requiring a small
patch - Id need to check).

In terms of bonding + encap this probably needs to be handled in the
hw itself for the same reason I mentioned in point 2.

> Or.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Fwd: [PATCH 0/6] offload Linux LAG devices to the TC datapath
  2018-06-26 18:16     ` John Hurley
@ 2018-06-27 20:13       ` Or Gerlitz
  0 siblings, 0 replies; 9+ messages in thread
From: Or Gerlitz @ 2018-06-27 20:13 UTC (permalink / raw)
  To: John Hurley
  Cc: Or Gerlitz, Jakub Kicinski, Jiri Pirko, Linux Netdev List,
	Simon Horman, Andy Gospodarek

On Tue, Jun 26, 2018 at 9:16 PM, John Hurley <john.hurley@netronome.com> wrote:
> On Tue, Jun 26, 2018 at 3:57 PM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
>>> -------- Forwarded Message --------
>>> Subject: [PATCH 0/6] offload Linux LAG devices to the TC datapath
>>> Date: Thu, 21 Jun 2018 14:35:55 +0100
>>> From: John Hurley <john.hurley@netronome.com>
>>> To: dev@openvswitch.org, roid@mellanox.com, gavi@mellanox.com, paulb@mellanox.com, fbl@sysclose.org, simon.horman@netronome.com
>>> CC: John Hurley <john.hurley@netronome.com>
>>>
>>> This patchset extends OvS TC and the linux-netdev implementation to
>>> support the offloading of Linux Link Aggregation devices (LAG) and their
>>> slaves. TC blocks are used to provide this offload. Blocks, in TC, group
>>> together a series of qdiscs. If a filter is added to one of these qdiscs
>>> then it applied to all. Similarly, if a packet is matched on one of the
>>> grouped qdiscs then the stats for the entire block are increased. The
>>> basis of the LAG offload is that the LAG master (attached to the OvS
>>> bridge) and slaves that may exist outside of OvS are all added to the same
>>> TC block. OvS can then control the filters and collect the stats on the
>>> slaves via its interaction with the LAG master.
>>>
>>> The TC API is extended within OvS to allow the addition of a block id to
>>> ingress qdisc adds. Block ids are then assigned to each LAG master that is
>>> attached to the OvS bridge. The linux netdev netlink socket is used to
>>> monitor slave devices. If a LAG slave is found whose master is on the bridge
>>> then it is added to the same block as its master. If the underlying slaves
>>> belong to an offloadable device then the Linux LAG device can be offloaded
>>> to hardware.
>>
>> Guys (J/J/J),
>>
>> Doing this here b/c
>>
>> a. this has impact on the kernel side of things
>>
>> b. I am more of a netdev and not openvswitch citizen..
>>
>> some comments,
>>
>> 1. this + Jakub's patch for the reply are really a great design
>>
>> 2. re the egress side of things. Some NIC HWs can't just use LAG
>> as the egress port destination of an ACL (tc rule) and the HW rule
>> needs to be duplicated to both HW ports. So... in that case, you
>> see the HW driver doing the duplication (:() or we can somehow
>> make it happen from user-space?
>>
>
> Hi Or,
> I'm not sure how rule duplication would work for rules that egress to
> a LAG device.
> Perhaps this could be done for an active/backup mode where user-space
> adds a rule to 1 port and deletes from another as appropriate.
> For load balancing modes where the egress port is selected based on a
> hash of packet fields, it would be a lot more complicated.
> OvS can do this with its own bonds as far as I'm aware but (if
> recirculation is turned off) it basically creates exact match datapath
> entries for each packet flow.
> Perhaps I do not fully understand your question?

Hi John,

Some NICs don't support egress lag hashing, still they can provide HW
high-availability(HA) and load-balancing (LB)-- specifically here we
are referring
to get a VF netdev HA and LB without any action on their side, once we bond
the uplink reps, apply your patch on ovs and some more..

So the use-case I am targeting (1) does it with kernel bond/team
(2) uses the LAG/802.3ad mode of bonding/teaming

and needs to duplicate rules where the egress is the bond.

>
>> 3. for the case of overlay networks, e.g OVS based vxlan tunnel, the
>> ingress (decap) rule is set on the vxlan device. Jakub, you mentioned
>> a possible kernel patch to the HW (nfp, mlx5) drivers to have them bind
>> to the tunnel device for ingress rules. If we have agreed way to identify
>> uplink representors, can we do that from ovs too? does it matter if we are
>> bonding + encapsulating or just encapsulating? note that under encap scheme
>> the bond is typically not part of the OVS bridge.
>>
>
> If we have a way to bind the HW drivers to tunnel devs for ingress
> rules then this should work fine with OvS (possibly requiring a small
> patch - Id need to check).
>
> In terms of bonding + encap this probably needs to be handled in the
> hw itself for the same reason I mentioned in point 2.

so we have two cases where the stack/ovs can't do and the hw driver
needs to act, lets try to improve and reduce that..

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/6] offload Linux LAG devices to the TC datapath
  2018-06-26 14:57   ` Fwd: [PATCH 0/6] offload Linux LAG devices to the TC datapath Or Gerlitz
  2018-06-26 18:16     ` John Hurley
@ 2018-06-26 22:31     ` Jakub Kicinski
  2018-06-27 20:07       ` Or Gerlitz
  1 sibling, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2018-06-26 22:31 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: John Hurley, Jiri Pirko, netdev, ASAP_Direct_Dev, simon.horman,
	Andy Gospodarek

On Tue, 26 Jun 2018 17:57:08 +0300, Or Gerlitz wrote:
> > -------- Forwarded Message --------
> > Subject: [PATCH 0/6] offload Linux LAG devices to the TC datapath
> > Date: Thu, 21 Jun 2018 14:35:55 +0100
> > From: John Hurley <john.hurley@netronome.com>
> > To: dev@openvswitch.org, roid@mellanox.com, gavi@mellanox.com, paulb@mellanox.com, fbl@sysclose.org, simon.horman@netronome.com
> > CC: John Hurley <john.hurley@netronome.com>
> > 
> > This patchset extends OvS TC and the linux-netdev implementation to
> > support the offloading of Linux Link Aggregation devices (LAG) and their
> > slaves. TC blocks are used to provide this offload. Blocks, in TC, group
> > together a series of qdiscs. If a filter is added to one of these qdiscs
> > then it applied to all. Similarly, if a packet is matched on one of the
> > grouped qdiscs then the stats for the entire block are increased. The
> > basis of the LAG offload is that the LAG master (attached to the OvS
> > bridge) and slaves that may exist outside of OvS are all added to the same
> > TC block. OvS can then control the filters and collect the stats on the
> > slaves via its interaction with the LAG master.
> > 
> > The TC API is extended within OvS to allow the addition of a block id to
> > ingress qdisc adds. Block ids are then assigned to each LAG master that is
> > attached to the OvS bridge. The linux netdev netlink socket is used to
> > monitor slave devices. If a LAG slave is found whose master is on the bridge
> > then it is added to the same block as its master. If the underlying slaves
> > belong to an offloadable device then the Linux LAG device can be offloaded
> > to hardware.  
> 
> Guys (J/J/J), 
> 
> Doing this here b/c
> 
> a. this has impact on the kernel side of things
> 
> b. I am more of a netdev and not openvswitch citizen..
> 
> some comments, 
> 
> 1. this + Jakub's patch for the reply are really a great design
> 
> 2. re the egress side of things. Some NIC HWs can't just use LAG
> as the egress port destination of an ACL (tc rule) and the HW rule
> needs to be duplicated to both HW ports. So... in that case, you 
> see the HW driver doing the duplication (:() or we can somehow
> make it happen from user-space?

It's the TC core that does the duplication.  Drivers which don't need
the duplication (e.g. mlxsw) will not register a new callback for each
port on which shared block is bound.  They will keep one list of rules,
and a list of ports that those rules apply to.

Drivers which need duplication (multiplication) (all NICs?) have to
register a new callback for each port bound to a shared block.  And TC
will call those drivers as many times as they have callbacks registered
== as many times as they have ports bound to the block.  Each time
callback is invoked the driver will figure out the ingress port based
on the cb_priv and use <ingress, cookie> as the key in its rule table
(or have a separate rule table per ingress port).

So again you just register a callback every time shared block is bound,
and then TC core will send add/remove rule commands down to the driver,
relaying existing rules as well if needed.  I may be wrong, but I think
you split the rules tables per port for mlx5, so this OvS bond offload
scheme "should just work" for you?

Does that clarify things or were you asking more about the active
passive thing John mentioned or some way to save rule space?

> 3. for the case of overlay networks, e.g OVS based vxlan tunnel, the
> ingress (decap) rule is set on the vxlan device. Jakub, you mentioned 
> a possible kernel patch to the HW (nfp, mlx5) drivers to have them bind 
> to the tunnel device for ingress rules. If we have agreed way to identify
> uplink representors, can we do that from ovs too?

I'm not sure, there can be multiple tunnel devices.  Plus we really
want to know the tunnel type, e.g. vxlan vs geneve, so simple shared
block propagation will probably not cut it.  If that's what you're
referring to.

> does it matter if we are bonding + encapsulating or just
> encapsulating? note that under encap scheme the bond is typically not
> part of the OVS bridge. 

I don't think that matters in general, driver doing bonding offload
should just start recognizing the bond master as "their port" and
register an egdev callback for redirects to master today (or equivalent
in the new scheme once that materializes...)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/6] offload Linux LAG devices to the TC datapath
  2018-06-26 22:31     ` Jakub Kicinski
@ 2018-06-27 20:07       ` Or Gerlitz
  2018-06-27 23:08         ` Jakub Kicinski
  0 siblings, 1 reply; 9+ messages in thread
From: Or Gerlitz @ 2018-06-27 20:07 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Or Gerlitz, John Hurley, Jiri Pirko, Linux Netdev List,
	ASAP_Direct_Dev, Simon Horman, Andy Gospodarek

On Wed, Jun 27, 2018 at 1:31 AM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> On Tue, 26 Jun 2018 17:57:08 +0300, Or Gerlitz wrote:

>> 2. re the egress side of things. Some NIC HWs can't just use LAG
>> as the egress port destination of an ACL (tc rule) and the HW rule
>> needs to be duplicated to both HW ports. So... in that case, you
>> see the HW driver doing the duplication (:() or we can somehow
>> make it happen from user-space?

> It's the TC core that does the duplication.  Drivers which don't need
> the duplication (e.g. mlxsw) will not register a new callback for each
> port on which shared block is bound.  They will keep one list of rules,
> and a list of ports that those rules apply to.

[snip]

> Drivers which need duplication (multiplication) (all NICs?) have to
> register a new callback for each port bound to a shared block.  And TC
> will call those drivers as many times as they have callbacks registered
> == as many times as they have ports bound to the block.  Each time
> callback is invoked the driver will figure out the ingress port based
> on the cb_priv and use <ingress, cookie> as the key in its rule table
> (or have a separate rule table per ingress port).

[snip snip]

> I may be wrong, but I think you split the rules tables per port for mlx5

correct,  currently I have a rule table per physical port.

> So again you just register a callback every time shared block is bound,
> and then TC core will send add/remove rule commands down to the driver,
> relaying existing rules as well if needed.

Let's see, the NIC uplink rep port devices were bounded (say) by ovs to
a shared-block because they are the lower devices (hate the slavish jargon)
of a bond device.

Next, the TC stack will invoke the callback over these ports, when ingress
rule is added on the bond.

But we are talking on ingress rule set on a non-uplink rep (VF rep) port,
where bonding is the egress of the rule. I guess the callback which you probably
refer to (you hinted there below) is the egdev one, correct? you are suggesting
that bonding will do egdev registration... I am a bit confused.

> Does that clarify things or were you asking more about the active
> passive thing John mentioned or some way to save rule space?

no (didn't refer to active-passive) and no (didn't look to save rule space)
yes for active-active in a HW that needs duplicated rules (NICs).

>> 3. for the case of overlay networks, e.g OVS based vxlan tunnel, the
>> ingress (decap) rule is set on the vxlan device. Jakub, you mentioned
>> a possible kernel patch to the HW (nfp, mlx5) drivers to have them bind
>> to the tunnel device for ingress rules. If we have agreed way to identify
>> uplink representors, can we do that from ovs too?
>
> I'm not sure, there can be multiple tunnel devices.  Plus we really
> want to know the tunnel type, e.g. vxlan vs geneve, so simple shared
> block propagation will probably not cut it.  If that's what you're
> referring to.

isn't knowing the tunnel type already missing today? I saw you
started patches the tunnel key set action for Geneve, does upstream
+ the patches you sent works or more is missing to get geneve encap
through the TC stack?

>> does it matter if we are bonding + encapsulating or just
>> encapsulating? note that under encap scheme the bond is typically not
>> part of the OVS bridge.

> I don't think that matters in general, driver doing bonding offload
> should just start recognizing the bond master as "their port" and
> register an egdev callback for redirects to master today (or equivalent
> in the new scheme once that materializes...)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/6] offload Linux LAG devices to the TC datapath
  2018-06-27 20:07       ` Or Gerlitz
@ 2018-06-27 23:08         ` Jakub Kicinski
  2018-06-28  3:50           ` Or Gerlitz
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2018-06-27 23:08 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, John Hurley, Jiri Pirko, Linux Netdev List,
	ASAP_Direct_Dev, Simon Horman, Andy Gospodarek

On Wed, 27 Jun 2018 23:07:29 +0300, Or Gerlitz wrote:
> On Wed, Jun 27, 2018 at 1:31 AM, Jakub Kicinski
> <jakub.kicinski@netronome.com> wrote:
> > On Tue, 26 Jun 2018 17:57:08 +0300, Or Gerlitz wrote:  
> 
> >> 2. re the egress side of things. Some NIC HWs can't just use LAG
> >> as the egress port destination of an ACL (tc rule) and the HW rule
> >> needs to be duplicated to both HW ports. So... in that case, you
> >> see the HW driver doing the duplication (:() or we can somehow
> >> make it happen from user-space?  
> 
> > It's the TC core that does the duplication.  Drivers which don't need
> > the duplication (e.g. mlxsw) will not register a new callback for each
> > port on which shared block is bound.  They will keep one list of rules,
> > and a list of ports that those rules apply to.  
> 
> [snip]
> 
> > Drivers which need duplication (multiplication) (all NICs?) have to
> > register a new callback for each port bound to a shared block.  And TC
> > will call those drivers as many times as they have callbacks registered
> > == as many times as they have ports bound to the block.  Each time
> > callback is invoked the driver will figure out the ingress port based
> > on the cb_priv and use <ingress, cookie> as the key in its rule table
> > (or have a separate rule table per ingress port).  
> 
> [snip snip]
> 
> > I may be wrong, but I think you split the rules tables per port for mlx5  
> 
> correct,  currently I have a rule table per physical port.
> 
> > So again you just register a callback every time shared block is bound,
> > and then TC core will send add/remove rule commands down to the driver,
> > relaying existing rules as well if needed.  
> 
> Let's see, the NIC uplink rep port devices were bounded (say) by ovs to
> a shared-block because they are the lower devices (hate the slavish jargon)
> of a bond device.
> 
> Next, the TC stack will invoke the callback over these ports, when ingress
> rule is added on the bond.
> 
> But we are talking on ingress rule set on a non-uplink rep (VF rep) port,
> where bonding is the egress of the rule. I guess the callback which you probably
> refer to (you hinted there below) is the egdev one, correct? you are suggesting
> that bonding will do egdev registration... I am a bit confused.

Ah, you really meant egress.  We don't have this problem, but yes, I
think you could register an egdev callback for each lower.  You won't
get the nice rule replay from egdev as of today, though :(

> > Does that clarify things or were you asking more about the active
> > passive thing John mentioned or some way to save rule space?  
> 
> no (didn't refer to active-passive) and no (didn't look to save rule space)
> yes for active-active in a HW that needs duplicated rules (NICs).
> 
> >> 3. for the case of overlay networks, e.g OVS based vxlan tunnel, the
> >> ingress (decap) rule is set on the vxlan device. Jakub, you mentioned
> >> a possible kernel patch to the HW (nfp, mlx5) drivers to have them bind
> >> to the tunnel device for ingress rules. If we have agreed way to identify
> >> uplink representors, can we do that from ovs too?  
> >
> > I'm not sure, there can be multiple tunnel devices.  Plus we really
> > want to know the tunnel type, e.g. vxlan vs geneve, so simple shared
> > block propagation will probably not cut it.  If that's what you're
> > referring to.  
> 
> isn't knowing the tunnel type already missing today? I saw you
> started patches the tunnel key set action for Geneve, does upstream
> + the patches you sent works or more is missing to get geneve encap
> through the TC stack?

Yes, knowing tunnel type missing today, but hopefully it won't be once
we get to redesign of egdev :)  Today we only support decap on standard
ports :/  Encap is fine, though.  FWIW Geneve already works on the nfp,
the work from Simon & Pieter we posted is adding support for the
options.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/6] offload Linux LAG devices to the TC datapath
  2018-06-27 23:08         ` Jakub Kicinski
@ 2018-06-28  3:50           ` Or Gerlitz
  2018-06-28  4:02             ` Jakub Kicinski
  0 siblings, 1 reply; 9+ messages in thread
From: Or Gerlitz @ 2018-06-28  3:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Or Gerlitz, John Hurley, Jiri Pirko, Linux Netdev List,
	ASAP_Direct_Dev, Simon Horman, Andy Gospodarek

On Thu, Jun 28, 2018 at 2:08 AM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> On Wed, 27 Jun 2018 23:07:29 +0300, Or Gerlitz wrote:
>> On Wed, Jun 27, 2018 at 1:31 AM, Jakub Kicinski
>> <jakub.kicinski@netronome.com> wrote:
>> > On Tue, 26 Jun 2018 17:57:08 +0300, Or Gerlitz wrote:
>>
>> >> 2. re the egress side of things. Some NIC HWs can't just use LAG
>> >> as the egress port destination of an ACL (tc rule) and the HW rule
>> >> needs to be duplicated to both HW ports. So... in that case, you
>> >> see the HW driver doing the duplication (:() or we can somehow
>> >> make it happen from user-space?
>>
>> > It's the TC core that does the duplication.  Drivers which don't need
>> > the duplication (e.g. mlxsw) will not register a new callback for each
>> > port on which shared block is bound.  They will keep one list of rules,
>> > and a list of ports that those rules apply to.
>>
>> [snip]
>>
>> > Drivers which need duplication (multiplication) (all NICs?) have to
>> > register a new callback for each port bound to a shared block.  And TC
>> > will call those drivers as many times as they have callbacks registered
>> > == as many times as they have ports bound to the block.  Each time
>> > callback is invoked the driver will figure out the ingress port based
>> > on the cb_priv and use <ingress, cookie> as the key in its rule table
>> > (or have a separate rule table per ingress port).
>>
>> [snip snip]
>>
>> > I may be wrong, but I think you split the rules tables per port for mlx5
>>
>> correct,  currently I have a rule table per physical port.
>>
>> > So again you just register a callback every time shared block is bound,
>> > and then TC core will send add/remove rule commands down to the driver,
>> > relaying existing rules as well if needed.
>>
>> Let's see, the NIC uplink rep port devices were bounded (say) by ovs to
>> a shared-block because they are the lower devices (hate the slavish jargon)
>> of a bond device.
>>
>> Next, the TC stack will invoke the callback over these ports, when ingress
>> rule is added on the bond.
>>
>> But we are talking on ingress rule set on a non-uplink rep (VF rep) port,
>> where bonding is the egress of the rule. I guess the callback which you probably
>> refer to (you hinted there below) is the egdev one, correct? you are suggesting
>> that bonding will do egdev registration... I am a bit confused.
>
> Ah, you really meant egress.  We don't have this problem, but yes, I

so how does it works for you -- the rule is:

<ingress=vfrep netdev, egress=bond netdev>

so from here, your driver logic does what inorder
to allow offloading into the lagged uplinks? can you
point the code please..

the bond BTW doesn't have the same switchdev id as
the vfrep in case you keep different switchdev id's
for the uplink reps under bonding -- do you unite them?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/6] offload Linux LAG devices to the TC datapath
  2018-06-28  3:50           ` Or Gerlitz
@ 2018-06-28  4:02             ` Jakub Kicinski
  2018-06-28 22:19               ` Or Gerlitz
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2018-06-28  4:02 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, John Hurley, Jiri Pirko, Linux Netdev List,
	ASAP_Direct_Dev, Simon Horman, Andy Gospodarek

On Thu, 28 Jun 2018 06:50:32 +0300, Or Gerlitz wrote:
> On Thu, Jun 28, 2018 at 2:08 AM, Jakub Kicinski
> <jakub.kicinski@netronome.com> wrote:
> > On Wed, 27 Jun 2018 23:07:29 +0300, Or Gerlitz wrote:  
> >> On Wed, Jun 27, 2018 at 1:31 AM, Jakub Kicinski
> >> <jakub.kicinski@netronome.com> wrote:  
> >> > On Tue, 26 Jun 2018 17:57:08 +0300, Or Gerlitz wrote:  
> >>  
> >> >> 2. re the egress side of things. Some NIC HWs can't just use LAG
> >> >> as the egress port destination of an ACL (tc rule) and the HW rule
> >> >> needs to be duplicated to both HW ports. So... in that case, you
> >> >> see the HW driver doing the duplication (:() or we can somehow
> >> >> make it happen from user-space?  
> >>  
> >> > It's the TC core that does the duplication.  Drivers which don't need
> >> > the duplication (e.g. mlxsw) will not register a new callback for each
> >> > port on which shared block is bound.  They will keep one list of rules,
> >> > and a list of ports that those rules apply to.  
> >>
> >> [snip]
> >>  
> >> > Drivers which need duplication (multiplication) (all NICs?) have to
> >> > register a new callback for each port bound to a shared block.  And TC
> >> > will call those drivers as many times as they have callbacks registered
> >> > == as many times as they have ports bound to the block.  Each time
> >> > callback is invoked the driver will figure out the ingress port based
> >> > on the cb_priv and use <ingress, cookie> as the key in its rule table
> >> > (or have a separate rule table per ingress port).  
> >>
> >> [snip snip]
> >>  
> >> > I may be wrong, but I think you split the rules tables per port for mlx5  
> >>
> >> correct,  currently I have a rule table per physical port.
> >>  
> >> > So again you just register a callback every time shared block is bound,
> >> > and then TC core will send add/remove rule commands down to the driver,
> >> > relaying existing rules as well if needed.  
> >>
> >> Let's see, the NIC uplink rep port devices were bounded (say) by ovs to
> >> a shared-block because they are the lower devices (hate the slavish jargon)
> >> of a bond device.
> >>
> >> Next, the TC stack will invoke the callback over these ports, when ingress
> >> rule is added on the bond.
> >>
> >> But we are talking on ingress rule set on a non-uplink rep (VF rep) port,
> >> where bonding is the egress of the rule. I guess the callback which you probably
> >> refer to (you hinted there below) is the egdev one, correct? you are suggesting
> >> that bonding will do egdev registration... I am a bit confused.  
> >
> > Ah, you really meant egress.  We don't have this problem, but yes, I  
> 
> so how does it works for you -- the rule is:
> 
> <ingress=vfrep netdev, egress=bond netdev>
> 
> so from here, your driver logic does what inorder
> to allow offloading into the lagged uplinks? can you
> point the code please..

static int
nfp_fl_output(struct nfp_app *app, struct nfp_fl_output *output,
...
	if (tun_type) {
		/* Verify the egress netdev matches the tunnel type. */
		if (!nfp_fl_netdev_is_tunnel_type(out_dev, tun_type))
			return -EOPNOTSUPP;

		if (*tun_out_cnt)
			return -EOPNOTSUPP;
		(*tun_out_cnt)++;

		output->flags = cpu_to_be16(tmp_flags |
					    NFP_FL_OUT_FLAGS_USE_TUN);
		output->port = cpu_to_be32(NFP_FL_PORT_TYPE_TUN | tun_type);
	} else if (netif_is_lag_master(out_dev) &&
		   priv->flower_ext_feats & NFP_FL_FEATS_LAG) {
		int gid;

		output->flags = cpu_to_be16(tmp_flags);
		gid = nfp_flower_lag_get_output_id(app, out_dev);
		if (gid < 0)
			return gid;
		output->port = cpu_to_be32(NFP_FL_LAG_OUT | gid);
	} else {
		/* Set action output parameters. */
		output->flags = cpu_to_be16(tmp_flags);

		/* Only offload if egress ports are on the same device as the
		 * ingress port.
		 */
		if (!switchdev_port_same_parent_id(in_dev, out_dev))
			return -EOPNOTSUPP;
		if (!nfp_netdev_is_nfp_repr(out_dev))
			return -EOPNOTSUPP;

		output->port = cpu_to_be32(nfp_repr_get_port_id(out_dev));
		if (!output->port)
			return -EOPNOTSUPP;
	}

> the bond BTW doesn't have the same switchdev id as
> the vfrep in case you keep different switchdev id's
> for the uplink reps under bonding -- do you unite them?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/6] offload Linux LAG devices to the TC datapath
  2018-06-28  4:02             ` Jakub Kicinski
@ 2018-06-28 22:19               ` Or Gerlitz
  0 siblings, 0 replies; 9+ messages in thread
From: Or Gerlitz @ 2018-06-28 22:19 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Or Gerlitz, John Hurley, Jiri Pirko, Linux Netdev List,
	ASAP_Direct_Dev, Simon Horman, Andy Gospodarek

On Thu, Jun 28, 2018 at 7:02 AM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:

[...]


>         } else if (netif_is_lag_master(out_dev) &&
>                    priv->flower_ext_feats & NFP_FL_FEATS_LAG) {
>                 int gid;
>
>                 output->flags = cpu_to_be16(tmp_flags);
>                 gid = nfp_flower_lag_get_output_id(app, out_dev);
>                 if (gid < 0)
>                         return gid;
>                 output->port = cpu_to_be32(NFP_FL_LAG_OUT | gid);

got it how you do that, cool for you

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-06-28 22:19 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1529588161-15934-1-git-send-email-john.hurley@netronome.com>
     [not found] ` <8f406548-8f90-b658-fcd1-342d702b3445@mellanox.com>
2018-06-26 14:57   ` Fwd: [PATCH 0/6] offload Linux LAG devices to the TC datapath Or Gerlitz
2018-06-26 18:16     ` John Hurley
2018-06-27 20:13       ` Or Gerlitz
2018-06-26 22:31     ` Jakub Kicinski
2018-06-27 20:07       ` Or Gerlitz
2018-06-27 23:08         ` Jakub Kicinski
2018-06-28  3:50           ` Or Gerlitz
2018-06-28  4:02             ` Jakub Kicinski
2018-06-28 22:19               ` Or Gerlitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox