* RSTP with switchdev question
@ 2019-12-13 21:18 Murali Karicheri
2019-12-16 16:55 ` Murali Karicheri
0 siblings, 1 reply; 5+ messages in thread
From: Murali Karicheri @ 2019-12-13 21:18 UTC (permalink / raw)
To: netdev, Kwok, WingMan
Hi Netdev experts,
We are working on a switchdev based switch driver with L2 and stp
offload. Implemented the driver based on
Documentation/networking/switchdev.txt
Currently seeing an issue with switch over of a link failure. So
wondering how this is supposed to work. So any help on this will
be highy appreciated.
0 1
|-------X---------- B ----------------------|
A C root
|-------------------------------------------|
Figure 1)
At the start, A, B and C nodes are brought up and mstpd is started
on all nodes and we get a toplogy as above with X marking the link
that breaks the loop. We run a Ping from C to A and it works fine
and takes the direct path from C to A. We then simulate a link
failure to trigger topology change by disconnecting of the link
A to C. Switch over happens and the topology gets updated quickly
and we get the one below in Figure 2).
Case 2)
0 1
|------------------ B ----------------------|
A C root
|-------------------X-----------------------|
Figure 2)
The ping stops and resume after about 30 seconds instead of right
away as expected in rstp case which should be in milliseconds. On
debug we found following happening.
1) In the steady state, the fdb dump at the firmware on B (This
implements the switch) shows that both A and C appears on port
1 as expected.
2 After switch over, Ping frame from C with A's MAC address gets
sent to B. However B's fdb entry is still showing it is at
port 1. Since the frame arrived from C, it drops the frame.
So the question is, in this scenario, how does the data path
restored quickly? Looks like for this to happen FDB at the nodes
needs to get flushed or re-learned so that it will show all nodes
at the correct port in the new topology. So in this case at node
B, A should appear on port 0 instead of port 1 so that L2
forwarding happens correctly? As expected, if another ping is
initiated from A to C, the other ping (C to A) starts working as
the FDB at B is updated. But if data path needs to be restored
quickly, these fdb update should happen immediately. How does
this happen?
Thanks
Murali
--
Murali Karicheri
Texas Instruments
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RSTP with switchdev question
2019-12-13 21:18 RSTP with switchdev question Murali Karicheri
@ 2019-12-16 16:55 ` Murali Karicheri
2019-12-17 11:21 ` Andrew Lunn
0 siblings, 1 reply; 5+ messages in thread
From: Murali Karicheri @ 2019-12-16 16:55 UTC (permalink / raw)
To: netdev, Kwok, WingMan, andrew, vivien.didelot, f.fainelli, jiri,
ivecera
+ switchdev/DSA experts
On 12/13/2019 04:18 PM, Murali Karicheri wrote:
> Hi Netdev experts,
>
> We are working on a switchdev based switch driver with L2 and stp
> offload. Implemented the driver based on
> Documentation/networking/switchdev.txt
> Currently seeing an issue with switch over of a link failure. So
> wondering how this is supposed to work. So any help on this will
> be highy appreciated.
>
>
> 0 1
> |-------X---------- B ----------------------|
> A C root
> |-------------------------------------------|
>
> Figure 1)
>
> At the start, A, B and C nodes are brought up and mstpd is started
> on all nodes and we get a toplogy as above with X marking the link
> that breaks the loop. We run a Ping from C to A and it works fine
> and takes the direct path from C to A. We then simulate a link
> failure to trigger topology change by disconnecting of the link
> A to C. Switch over happens and the topology gets updated quickly
> and we get the one below in Figure 2).
>
> Case 2)
> 0 1
> |------------------ B ----------------------|
> A C root
> |-------------------X-----------------------|
> Figure 2)
>
> The ping stops and resume after about 30 seconds instead of right
> away as expected in rstp case which should be in milliseconds. On
> debug we found following happening.
>
> 1) In the steady state, the fdb dump at the firmware on B (This
> implements the switch) shows that both A and C appears on port
> 1 as expected.
> 2 After switch over, Ping frame from C with A's MAC address gets
> sent to B. However B's fdb entry is still showing it is at
> port 1. Since the frame arrived from C, it drops the frame.
>
> So the question is, in this scenario, how does the data path
> restored quickly? Looks like for this to happen FDB at the nodes
> needs to get flushed or re-learned so that it will show all nodes
> at the correct port in the new topology. So in this case at node
> B, A should appear on port 0 instead of port 1 so that L2
> forwarding happens correctly? As expected, if another ping is
> initiated from A to C, the other ping (C to A) starts working as
> the FDB at B is updated. But if data path needs to be restored
> quickly, these fdb update should happen immediately. How does
> this happen?
>
> Thanks
>
> Murali
>
--
Murali Karicheri
Texas Instruments
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RSTP with switchdev question
2019-12-16 16:55 ` Murali Karicheri
@ 2019-12-17 11:21 ` Andrew Lunn
2019-12-19 17:30 ` Murali Karicheri
0 siblings, 1 reply; 5+ messages in thread
From: Andrew Lunn @ 2019-12-17 11:21 UTC (permalink / raw)
To: Murali Karicheri
Cc: netdev, Kwok, WingMan, vivien.didelot, f.fainelli, jiri, ivecera
On Mon, Dec 16, 2019 at 11:55:05AM -0500, Murali Karicheri wrote:
> + switchdev/DSA experts
Hi Murali
I did not reply before because this is a pure switchdev issue. DSA
does things differently. The kernel FDB and the switches FDB are not
kept in sync. With DSA, when a port changes state, we flush the switch
FDB. For STP, that seems to be sufficient. There have been reports for
RSTP this might not be enough, but that conversation did not go very
far.
I've no idea how this is supposed to work with a pure switchdev
driver. Often, to answer a question like this, you need to take a step
backwards. How is this supposed to work for a machine with two e1000e
cards and a plain software bridge? What ever APIs user space RSTP is
using in a pure software case should be used in a switchdev setup as
well, but extra plumbing in the kernel might be required, and it
sounds like it may be missing...
Andrew
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RSTP with switchdev question
2019-12-17 11:21 ` Andrew Lunn
@ 2019-12-19 17:30 ` Murali Karicheri
2019-12-19 18:06 ` Andrew Lunn
0 siblings, 1 reply; 5+ messages in thread
From: Murali Karicheri @ 2019-12-19 17:30 UTC (permalink / raw)
To: Andrew Lunn
Cc: netdev, Kwok, WingMan, vivien.didelot, f.fainelli, jiri, ivecera
Hi Andrew,
Thanks for responding to this.
On 12/17/2019 06:21 AM, Andrew Lunn wrote:
> On Mon, Dec 16, 2019 at 11:55:05AM -0500, Murali Karicheri wrote:
>> + switchdev/DSA experts
>
> Hi Murali
>
> I did not reply before because this is a pure switchdev issue. DSA
> does things differently. The kernel FDB and the switches FDB are not
> kept in sync. With DSA, when a port changes state, we flush the switch
> FDB. For STP, that seems to be sufficient. There have been reports for
> RSTP this might not be enough, but that conversation did not go very
> far.
I am new to RSTP and trying to understand what is required to be done
at the driver level when switchdev is used.
Looks like topology changes are handled currectly when only Linux bridge
is used and L2 forwarding is not offloaded to switch (Plain Ethernet
interface underneath).
This is my understanding. Linux bridge code uses BR_USER_STP to handle
user space handling. So daemon manages the STP state machine and update
the STP state to bridge which then get sent to device driver through
switchdev SET attribute command in the same way as kernel STP. From the
RSTP point of view, AFAIK, the quick data path switch over happens by
purging and re-learning when topology changes (TCN BPDUs). Currently
we are doing the following workaround which seems to solve the issue
based on the limited testing we had. Idea is for the switchdev based
switch driver to monitor the RTP state per port and if there is any
change in state, do a purge of learned MAC address in switch and send a
notification to bridge using
call_switchdev_notifiers(SWITCHDEV_FDB_DEL_TO_BRIDGE, dev, &info.info);
Following transition to be monitored and purged on any port:-
Blocking -> Learning (assuming blocking to forward doesn't happen
directly)
Blocking -> Forward (Not sure if this is possible. Need to check the
spec.
Learning -> Blocked
Forwarding -> Blocked
Hope the above are correct. Do you know if DSA is checking the above
transitions? Also when the learned address are purged in the switch
hardware, send event notification to Linux bridge to sync up with it's
database.
Since this is required for all of the Switchdev supported drivers,
it make sense to move this to switchdev eventually to trigger purge at
switch as well as notification to bridge for purge its entries. What do
you think?
Regards,
Murali
>
> I've no idea how this is supposed to work with a pure switchdev
> driver. Often, to answer a question like this, you need to take a step
> backwards. How is this supposed to work for a machine with two e1000e
> cards and a plain software bridge? What ever APIs user space RSTP is
> using in a pure software case should be used in a switchdev setup as
> well, but extra plumbing in the kernel might be required, and it
> sounds like it may be missing...
>
> Andrew
>
--
Murali Karicheri
Texas Instruments
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RSTP with switchdev question
2019-12-19 17:30 ` Murali Karicheri
@ 2019-12-19 18:06 ` Andrew Lunn
0 siblings, 0 replies; 5+ messages in thread
From: Andrew Lunn @ 2019-12-19 18:06 UTC (permalink / raw)
To: Murali Karicheri
Cc: netdev, Kwok, WingMan, vivien.didelot, f.fainelli, jiri, ivecera
On Thu, Dec 19, 2019 at 12:30:06PM -0500, Murali Karicheri wrote:
> Hi Andrew,
>
> Thanks for responding to this.
>
> On 12/17/2019 06:21 AM, Andrew Lunn wrote:
> > On Mon, Dec 16, 2019 at 11:55:05AM -0500, Murali Karicheri wrote:
> > > + switchdev/DSA experts
> >
> > Hi Murali
> >
> > I did not reply before because this is a pure switchdev issue. DSA
> > does things differently. The kernel FDB and the switches FDB are not
> > kept in sync. With DSA, when a port changes state, we flush the switch
> > FDB. For STP, that seems to be sufficient. There have been reports for
> > RSTP this might not be enough, but that conversation did not go very
> > far.
> I am new to RSTP and trying to understand what is required to be done
> at the driver level when switchdev is used.
>
> Looks like topology changes are handled currectly when only Linux bridge
> is used and L2 forwarding is not offloaded to switch (Plain Ethernet
> interface underneath).
>
> This is my understanding. Linux bridge code uses BR_USER_STP to handle
> user space handling. So daemon manages the STP state machine and update
> the STP state to bridge which then get sent to device driver through
> switchdev SET attribute command in the same way as kernel STP. From the
> RSTP point of view, AFAIK, the quick data path switch over happens by
> purging and re-learning when topology changes (TCN BPDUs). Currently
> we are doing the following workaround which seems to solve the issue
> based on the limited testing we had. Idea is for the switchdev based
> switch driver to monitor the RTP state per port and if there is any
> change in state, do a purge of learned MAC address in switch and send a
> notification to bridge using
> call_switchdev_notifiers(SWITCHDEV_FDB_DEL_TO_BRIDGE, dev, &info.info);
Are you saying the hardware should send a notification to the software
bridge? That seems the wrong way around. It is the software bridge
which is in control of everything. It should be the software bridge
which tells the hardware to purge its cache.
> Following transition to be monitored and purged on any port:-
> Blocking -> Learning (assuming blocking to forward doesn't happen
> directly)
> Blocking -> Forward (Not sure if this is possible. Need to check the
> spec.
> Learning -> Blocked
> Forwarding -> Blocked
What we have for dsa is:
int dsa_port_set_state(struct dsa_port *dp, u8 state,
struct switchdev_trans *trans)
{
struct dsa_switch *ds = dp->ds;
int port = dp->index;
if (switchdev_trans_ph_prepare(trans))
return ds->ops->port_stp_state_set ? 0 : -EOPNOTSUPP;
if (ds->ops->port_stp_state_set)
ds->ops->port_stp_state_set(ds, port, state);
if (ds->ops->port_fast_age) {
/* Fast age FDB entries or flush appropriate forwarding database
* for the given port, if we are moving it from Learning or
* Forwarding state, to Disabled or Blocking or Listening state.
*/
if ((dp->stp_state == BR_STATE_LEARNING ||
dp->stp_state == BR_STATE_FORWARDING) &&
(state == BR_STATE_DISABLED ||
state == BR_STATE_BLOCKING ||
state == BR_STATE_LISTENING))
ds->ops->port_fast_age(ds, port);
}
dp->stp_state = state;
return 0;
}
This gets called from the software bridge. The first call into the DSA
driver changes the port state. If ds->ops->port_fast_age is
implemented in the DSA driver, it is used. For STP, ideally you age
out entries quicker. If the hardware cannot do that, the driver is
expected to just flush them.
I don't know what RSTP requires. Is fast ageing also used in RSTP, or
is a complete flush expected?
> Hope the above are correct. Do you know if DSA is checking the above
> transitions? Also when the learned address are purged in the switch
> hardware, send event notification to Linux bridge to sync up with it's
> database.
Nope. We expect the software bridge performs its own flush. With DSA,
we have two databases, one in the hardware, and one in the software
bridge. No attempt is made to keep them in sync. Each performs its own
learning and ageing.
Andrew
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2019-12-19 18:06 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-12-13 21:18 RSTP with switchdev question Murali Karicheri
2019-12-16 16:55 ` Murali Karicheri
2019-12-17 11:21 ` Andrew Lunn
2019-12-19 17:30 ` Murali Karicheri
2019-12-19 18:06 ` Andrew Lunn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).