net: switchdev: Port-to-Bridge Synchronization

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* net: switchdev: Port-to-Bridge Synchronization
@ 2024-01-25 23:38 Tobias Waldekranz
  2024-01-29 12:17 ` Vladimir Oltean
  0 siblings, 1 reply; 4+ messages in thread
From: Tobias Waldekranz @ 2024-01-25 23:38 UTC (permalink / raw)
  To: jiri, ivecera, roopa, razor, olteanv; +Cc: netdev, bridge

Hello netdev,

Let me start by describing an issue that I have found in the
bridge/switchdev replay logic, which will lead to a larger question
about how switchdev ports are synchronized with a bridge. Anyway, on the
bug!

The steps to reproduce the issue are very simple on my board, but I've
never seen it before, so it could be that you need a pretty fast (by
embedded standards) SMP CPU to trigger the race.

Hardware:
- Marvell CN9130 SoC
- Marvell 88E6393X Switch

Steps:
1. Create a bridge with multicast snooping enabled
2. Attach an switch port to the bridge
3. Remove the bridge

If (2) is done immediately after (1), then my switch's hardware MDB will
be left with two orphan entries after removing the bridge (3).

On a system running net-next with the switchdev tracepoint series[1]
applied, this is what it looks like (with comm, pid, timestamp
etc. trimmed from the trace output to make the lines a bit narrower,
with an event number added instead):

root@infix-06-0b-00:~$ echo 1 >/sys/kernel/debug/tracing/events/switchdev/enable
root@infix-06-0b-00:~$ cat /sys/kernel/debug/tracing/trace_pipe | grep HOST_MDB &
[1] 2602
root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && ip link set dev x3 up master br0
01: switchdev_defer: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:ff:87:e4:3f
02: switchdev_defer: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:00:00:00:6a
03: switchdev_call_replay: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x0 orig br0) vid 0 addr 33:33:00:00:00:6a -> 0
04: switchdev_call_replay: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x0 orig br0) vid 0 addr 33:33:ff:87:e4:3f -> 0
05: switchdev_call_blocking: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:ff:87:e4:3f -> 0
06: switchdev_call_blocking: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:00:00:00:6a -> 0
07: switchdev_defer: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:00:00:00:fb
08: switchdev_call_blocking: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:00:00:00:fb -> 0
root@infix-06-0b-00:~$ ip link del dev br0
09: switchdev_call_replay: dev x3 PORT_OBJ_DEL HOST_MDB(flags 0x0 orig br0) vid 0 addr 33:33:00:00:00:fb -> 0
10: switchdev_call_replay: dev x3 PORT_OBJ_DEL HOST_MDB(flags 0x0 orig br0) vid 0 addr 33:33:00:00:00:6a -> 0
11: switchdev_call_replay: dev x3 PORT_OBJ_DEL HOST_MDB(flags 0x0 orig br0) vid 0 addr 33:33:ff:87:e4:3f -> 0
root@infix-06-0b-00:~$ mvls atu
ADDRESS             FID  STATE      Q  F  0  1  2  3  4  5  6  7  8  9  a
DEV:0 Marvell 88E6393X
33:33:00:00:00:6a     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
33:33:ff:87:e4:3f     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
ff:ff:ff:ff:ff:ff     1  static     -  -  0  1  2  3  4  5  6  7  8  9  a
root@infix-06-0b-00:~$

Event 01 and 02 are generated when the bridge is parsing an MLD report
generated by the kernel to register membership in the interface's link
local group address and the All-Snoopers group. As we are in softirq
context, the events are placed in the deferred queue.

More or less concurrently, the port is joining the bridge, and the DSA
layer is sending a SWITCHDEV_BRPORT_OFFLOADED to request a replay of all
relevant events that happened before it joined. Since we now find the
two host entries in the bridge's MDB, we replay the events (03 and 04).

Once the replay is done, at some later point in time, we start to
process the deferred events and send 05 and 06 (originally 01 and 02) to
the driver again.

At this point, the DSA layer will have a recorded refcount of 2 for both
of the groups in question, whereas the bridge only holds a single
reference count of each membership.

Therefore, when we tear down the bridge, the corresponding UNOFFLOADED
event will trigger another replay cycle, which will send a single delete
event for each group, leaving the DSA layer's refcount at 1. This is
confirmed by the mvls output, showing the two groups are still loaded in
the ATU.

The end (of the bug report)

How can we make sure that this does not happen?

Broadly, it seems to me like we need the bridge to signal the start and
end of a replay sequence (via the deferred queue), which the drivers can
then use to determine when it should start/stop accepting deferred
messages. I think this also means that we have to guarantee that no
new events of a particular class can be generated while we are scanning
the database of those objects and generating replay events.

Some complicating factors:

- There is no single stream of events to synchronize with.
  - Some calls are deferred for later dispatch
  - Some are run synchronously in a blocking context
  - Some are run synchronously in an atomic context

- The rhashtable which backs the FDB is designed to be lock-free, making
  it hard to ensure atomicity between a replay cycle and new addresses
  being learned in the hotpath

- If we take this approach, how can we make sure that most of the
  driver-side implementation can be shared in switchdev.c, and doesn't
  have to be reinvented by each driver?

I really hope that someone can tell my why this problem is much easier
than this!

[1]: https://lore.kernel.org/netdev/20240123153707.550795-1-tobias@waldekranz.com/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: net: switchdev: Port-to-Bridge Synchronization
  2024-01-25 23:38 net: switchdev: Port-to-Bridge Synchronization Tobias Waldekranz
@ 2024-01-29 12:17 ` Vladimir Oltean
  2024-01-30 21:23   ` Tobias Waldekranz
  0 siblings, 1 reply; 4+ messages in thread
From: Vladimir Oltean @ 2024-01-29 12:17 UTC (permalink / raw)
  To: Tobias Waldekranz; +Cc: jiri, ivecera, roopa, razor, netdev, bridge

Hi Tobias,

On Fri, Jan 26, 2024 at 12:38:05AM +0100, Tobias Waldekranz wrote:
> Hello netdev,
> 
> Let me start by describing an issue that I have found in the
> bridge/switchdev replay logic, which will lead to a larger question
> about how switchdev ports are synchronized with a bridge. Anyway, on the
> bug!
> 
> The steps to reproduce the issue are very simple on my board, but I've
> never seen it before, so it could be that you need a pretty fast (by
> embedded standards) SMP CPU to trigger the race.
> 
> Hardware:
> - Marvell CN9130 SoC
> - Marvell 88E6393X Switch
> 
> Steps:
> 1. Create a bridge with multicast snooping enabled
> 2. Attach an switch port to the bridge
> 3. Remove the bridge
> 
> If (2) is done immediately after (1), then my switch's hardware MDB will
> be left with two orphan entries after removing the bridge (3).
> 
> On a system running net-next with the switchdev tracepoint series[1]
> applied, this is what it looks like (with comm, pid, timestamp
> etc. trimmed from the trace output to make the lines a bit narrower,
> with an event number added instead):
> 
> root@infix-06-0b-00:~$ echo 1 >/sys/kernel/debug/tracing/events/switchdev/enable
> root@infix-06-0b-00:~$ cat /sys/kernel/debug/tracing/trace_pipe | grep HOST_MDB &
> [1] 2602
> root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && ip link set dev x3 up master br0
> 01: switchdev_defer: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:ff:87:e4:3f
> 02: switchdev_defer: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:00:00:00:6a
> 03: switchdev_call_replay: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x0 orig br0) vid 0 addr 33:33:00:00:00:6a -> 0
> 04: switchdev_call_replay: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x0 orig br0) vid 0 addr 33:33:ff:87:e4:3f -> 0
> 05: switchdev_call_blocking: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:ff:87:e4:3f -> 0
> 06: switchdev_call_blocking: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:00:00:00:6a -> 0
> 07: switchdev_defer: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:00:00:00:fb
> 08: switchdev_call_blocking: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:00:00:00:fb -> 0
> root@infix-06-0b-00:~$ ip link del dev br0
> 09: switchdev_call_replay: dev x3 PORT_OBJ_DEL HOST_MDB(flags 0x0 orig br0) vid 0 addr 33:33:00:00:00:fb -> 0
> 10: switchdev_call_replay: dev x3 PORT_OBJ_DEL HOST_MDB(flags 0x0 orig br0) vid 0 addr 33:33:00:00:00:6a -> 0
> 11: switchdev_call_replay: dev x3 PORT_OBJ_DEL HOST_MDB(flags 0x0 orig br0) vid 0 addr 33:33:ff:87:e4:3f -> 0
> root@infix-06-0b-00:~$ mvls atu
> ADDRESS             FID  STATE      Q  F  0  1  2  3  4  5  6  7  8  9  a
> DEV:0 Marvell 88E6393X
> 33:33:00:00:00:6a     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
> 33:33:ff:87:e4:3f     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
> ff:ff:ff:ff:ff:ff     1  static     -  -  0  1  2  3  4  5  6  7  8  9  a
> root@infix-06-0b-00:~$
> 
> Event 01 and 02 are generated when the bridge is parsing an MLD report
> generated by the kernel to register membership in the interface's link
> local group address and the All-Snoopers group. As we are in softirq
> context, the events are placed in the deferred queue.
> 
> More or less concurrently, the port is joining the bridge, and the DSA
> layer is sending a SWITCHDEV_BRPORT_OFFLOADED to request a replay of all
> relevant events that happened before it joined. Since we now find the
> two host entries in the bridge's MDB, we replay the events (03 and 04).
> 
> Once the replay is done, at some later point in time, we start to
> process the deferred events and send 05 and 06 (originally 01 and 02) to
> the driver again.
> 
> At this point, the DSA layer will have a recorded refcount of 2 for both
> of the groups in question, whereas the bridge only holds a single
> reference count of each membership.
> 
> Therefore, when we tear down the bridge, the corresponding UNOFFLOADED
> event will trigger another replay cycle, which will send a single delete
> event for each group, leaving the DSA layer's refcount at 1. This is
> confirmed by the mvls output, showing the two groups are still loaded in
> the ATU.
> 
> The end (of the bug report)
> 
> How can we make sure that this does not happen?
> 
> Broadly, it seems to me like we need the bridge to signal the start and
> end of a replay sequence (via the deferred queue), which the drivers can
> then use to determine when it should start/stop accepting deferred
> messages. I think this also means that we have to guarantee that no
> new events of a particular class can be generated while we are scanning
> the database of those objects and generating replay events.
> 
> Some complicating factors:
> 
> - There is no single stream of events to synchronize with.
>   - Some calls are deferred for later dispatch
>   - Some are run synchronously in a blocking context
>   - Some are run synchronously in an atomic context
> 
> - The rhashtable which backs the FDB is designed to be lock-free, making
>   it hard to ensure atomicity between a replay cycle and new addresses
>   being learned in the hotpath
> 
> - If we take this approach, how can we make sure that most of the
>   driver-side implementation can be shared in switchdev.c, and doesn't
>   have to be reinvented by each driver?
> 
> I really hope that someone can tell my why this problem is much easier
> than this!
> 
> [1]: https://lore.kernel.org/netdev/20240123153707.550795-1-tobias@waldekranz.com/

Please expect a few weeks from my side until I could take a close look
at this. Otherwise, here are some superficial (and maybe impractical) ideas.

I agree that the replay procedures race with the data path. Let's try to
concentrate on one at a time, and then see how the others are different.

We seem to need to distinguish the events that happened before the
SWITCHDEV_BRPORT_OFFLOADED event from those that happened afterwards.
The SWITCHDEV_BRPORT_OFFLOADED event itself runs under rtnl_lock()
protection, and so does switchdev_deferred_process(). So, purely from a
switchdev driver API perspective, things are relatively coarse grained,
and the order of operations within the NETDEV_CHANGEUPPER handler is not
very important, because at the end of the day, only this is: before
NETDEV_CHANGEUPPER, the port is not ready to process deferred switchdev
events, and after NETDEV_CHANGEUPPER, it is.

The problem is figuring out, during this relatively large rtnl_lock()
section, where to delineate events that are 'before' vs 'after'. We must
only replay what's 'before'.

Multicast database entries seem to have br->multicast_lock for serializing
writers. Currently we are not acquiring that during replay.

I'm thinking we could do something like this in br_switchdev_mdb_replay().
We could block concurrent callers to br_switchdev_mdb_notify() by
acquiring br->multicast_lock, so that br->mdb_list stays constant for a
while.

Then, after constructing the local mdb_list, the problem is that it
still contains some SWITCHDEV_F_DEFER elements which are pending a call
switchdev_deferred_process(). But that can't run currently, so we can
iterate through switchdev's "deferred" list and remove them from the
replay list, if we figure out some sort of reliable switchdev object
comparison function. Then we can safely release br->multicast_lock.

The big problem I see is that FDB notifications don't work that way.
They are also deferred, but all the deferred processing is on the
switchdev driver side, not on the bridge side. One of the reasons is
that the deferred side should not need rtnl_lock() at all. I don't see
how this model would scale to FDB processing.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: net: switchdev: Port-to-Bridge Synchronization
  2024-01-29 12:17 ` Vladimir Oltean
@ 2024-01-30 21:23   ` Tobias Waldekranz
  2024-01-31 13:10     ` Vladimir Oltean
  0 siblings, 1 reply; 4+ messages in thread
From: Tobias Waldekranz @ 2024-01-30 21:23 UTC (permalink / raw)
  To: Vladimir Oltean; +Cc: jiri, ivecera, roopa, razor, netdev, bridge

On mån, jan 29, 2024 at 14:17, Vladimir Oltean <olteanv@gmail.com> wrote:
> Hi Tobias,
>
> On Fri, Jan 26, 2024 at 12:38:05AM +0100, Tobias Waldekranz wrote:
>> Hello netdev,
>> 
>> Let me start by describing an issue that I have found in the
>> bridge/switchdev replay logic, which will lead to a larger question
>> about how switchdev ports are synchronized with a bridge. Anyway, on the
>> bug!
>> 
>> The steps to reproduce the issue are very simple on my board, but I've
>> never seen it before, so it could be that you need a pretty fast (by
>> embedded standards) SMP CPU to trigger the race.
>> 
>> Hardware:
>> - Marvell CN9130 SoC
>> - Marvell 88E6393X Switch
>> 
>> Steps:
>> 1. Create a bridge with multicast snooping enabled
>> 2. Attach an switch port to the bridge
>> 3. Remove the bridge
>> 
>> If (2) is done immediately after (1), then my switch's hardware MDB will
>> be left with two orphan entries after removing the bridge (3).
>> 
>> On a system running net-next with the switchdev tracepoint series[1]
>> applied, this is what it looks like (with comm, pid, timestamp
>> etc. trimmed from the trace output to make the lines a bit narrower,
>> with an event number added instead):
>> 
>> root@infix-06-0b-00:~$ echo 1 >/sys/kernel/debug/tracing/events/switchdev/enable
>> root@infix-06-0b-00:~$ cat /sys/kernel/debug/tracing/trace_pipe | grep HOST_MDB &
>> [1] 2602
>> root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && ip link set dev x3 up master br0
>> 01: switchdev_defer: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:ff:87:e4:3f
>> 02: switchdev_defer: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:00:00:00:6a
>> 03: switchdev_call_replay: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x0 orig br0) vid 0 addr 33:33:00:00:00:6a -> 0
>> 04: switchdev_call_replay: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x0 orig br0) vid 0 addr 33:33:ff:87:e4:3f -> 0
>> 05: switchdev_call_blocking: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:ff:87:e4:3f -> 0
>> 06: switchdev_call_blocking: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:00:00:00:6a -> 0
>> 07: switchdev_defer: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:00:00:00:fb
>> 08: switchdev_call_blocking: dev x3 PORT_OBJ_ADD HOST_MDB(flags 0x4 orig br0) vid 0 addr 33:33:00:00:00:fb -> 0
>> root@infix-06-0b-00:~$ ip link del dev br0
>> 09: switchdev_call_replay: dev x3 PORT_OBJ_DEL HOST_MDB(flags 0x0 orig br0) vid 0 addr 33:33:00:00:00:fb -> 0
>> 10: switchdev_call_replay: dev x3 PORT_OBJ_DEL HOST_MDB(flags 0x0 orig br0) vid 0 addr 33:33:00:00:00:6a -> 0
>> 11: switchdev_call_replay: dev x3 PORT_OBJ_DEL HOST_MDB(flags 0x0 orig br0) vid 0 addr 33:33:ff:87:e4:3f -> 0
>> root@infix-06-0b-00:~$ mvls atu
>> ADDRESS             FID  STATE      Q  F  0  1  2  3  4  5  6  7  8  9  a
>> DEV:0 Marvell 88E6393X
>> 33:33:00:00:00:6a     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
>> 33:33:ff:87:e4:3f     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
>> ff:ff:ff:ff:ff:ff     1  static     -  -  0  1  2  3  4  5  6  7  8  9  a
>> root@infix-06-0b-00:~$
>> 
>> Event 01 and 02 are generated when the bridge is parsing an MLD report
>> generated by the kernel to register membership in the interface's link
>> local group address and the All-Snoopers group. As we are in softirq
>> context, the events are placed in the deferred queue.
>> 
>> More or less concurrently, the port is joining the bridge, and the DSA
>> layer is sending a SWITCHDEV_BRPORT_OFFLOADED to request a replay of all
>> relevant events that happened before it joined. Since we now find the
>> two host entries in the bridge's MDB, we replay the events (03 and 04).
>> 
>> Once the replay is done, at some later point in time, we start to
>> process the deferred events and send 05 and 06 (originally 01 and 02) to
>> the driver again.
>> 
>> At this point, the DSA layer will have a recorded refcount of 2 for both
>> of the groups in question, whereas the bridge only holds a single
>> reference count of each membership.
>> 
>> Therefore, when we tear down the bridge, the corresponding UNOFFLOADED
>> event will trigger another replay cycle, which will send a single delete
>> event for each group, leaving the DSA layer's refcount at 1. This is
>> confirmed by the mvls output, showing the two groups are still loaded in
>> the ATU.
>> 
>> The end (of the bug report)
>> 
>> How can we make sure that this does not happen?
>> 
>> Broadly, it seems to me like we need the bridge to signal the start and
>> end of a replay sequence (via the deferred queue), which the drivers can
>> then use to determine when it should start/stop accepting deferred
>> messages. I think this also means that we have to guarantee that no
>> new events of a particular class can be generated while we are scanning
>> the database of those objects and generating replay events.
>> 
>> Some complicating factors:
>> 
>> - There is no single stream of events to synchronize with.
>>   - Some calls are deferred for later dispatch
>>   - Some are run synchronously in a blocking context
>>   - Some are run synchronously in an atomic context
>> 
>> - The rhashtable which backs the FDB is designed to be lock-free, making
>>   it hard to ensure atomicity between a replay cycle and new addresses
>>   being learned in the hotpath
>> 
>> - If we take this approach, how can we make sure that most of the
>>   driver-side implementation can be shared in switchdev.c, and doesn't
>>   have to be reinvented by each driver?
>> 
>> I really hope that someone can tell my why this problem is much easier
>> than this!
>> 
>> [1]: https://lore.kernel.org/netdev/20240123153707.550795-1-tobias@waldekranz.com/
>
> Please expect a few weeks from my side until I could take a close look
> at this. Otherwise, here are some superficial (and maybe impractical) ideas.

Thanks for taking the time!

> I agree that the replay procedures race with the data path. Let's try to
> concentrate on one at a time, and then see how the others are different.

Right, I think this is the key. I was focusing too much on finding one
solution to fix all replay races. I will start with the MDB.

> We seem to need to distinguish the events that happened before the
> SWITCHDEV_BRPORT_OFFLOADED event from those that happened afterwards.
> The SWITCHDEV_BRPORT_OFFLOADED event itself runs under rtnl_lock()
> protection, and so does switchdev_deferred_process(). So, purely from a
> switchdev driver API perspective, things are relatively coarse grained,
> and the order of operations within the NETDEV_CHANGEUPPER handler is not
> very important, because at the end of the day, only this is: before
> NETDEV_CHANGEUPPER, the port is not ready to process deferred switchdev
> events, and after NETDEV_CHANGEUPPER, it is.
>
> The problem is figuring out, during this relatively large rtnl_lock()
> section, where to delineate events that are 'before' vs 'after'. We must
> only replay what's 'before'.
>
> Multicast database entries seem to have br->multicast_lock for serializing
> writers. Currently we are not acquiring that during replay.
>
> I'm thinking we could do something like this in br_switchdev_mdb_replay().
> We could block concurrent callers to br_switchdev_mdb_notify() by
> acquiring br->multicast_lock, so that br->mdb_list stays constant for a
> while.
>
> Then, after constructing the local mdb_list, the problem is that it
> still contains some SWITCHDEV_F_DEFER elements which are pending a call
> switchdev_deferred_process(). But that can't run currently, so we can
> iterate through switchdev's "deferred" list and remove them from the
> replay list, if we figure out some sort of reliable switchdev object
> comparison function. Then we can safely release br->multicast_lock.

That would _almost_ work, I think. But instead of purging the deferred
items, I think we have to skip queuing the replay events in these
cases. Otherwise we limit the scope of the notification to the
requesting driver, when it ought to reach all registered callbacks on
the notifier chain.

This matters if a driver wants to handle foreign group memberships the
same way we do with FDB entries, which I want to add to DSA once this
race has been taken care of.

> The big problem I see is that FDB notifications don't work that way.
> They are also deferred, but all the deferred processing is on the
> switchdev driver side, not on the bridge side. One of the reasons is
> that the deferred side should not need rtnl_lock() at all. I don't see
> how this model would scale to FDB processing.

Yeah, that's where I threw in the towel as well. That issue will just
have to go unsolved for now.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: net: switchdev: Port-to-Bridge Synchronization
  2024-01-30 21:23   ` Tobias Waldekranz
@ 2024-01-31 13:10     ` Vladimir Oltean
  0 siblings, 0 replies; 4+ messages in thread
From: Vladimir Oltean @ 2024-01-31 13:10 UTC (permalink / raw)
  To: Tobias Waldekranz; +Cc: jiri, ivecera, roopa, razor, netdev, bridge

On Tue, Jan 30, 2024 at 10:23:34PM +0100, Tobias Waldekranz wrote:
> On mån, jan 29, 2024 at 14:17, Vladimir Oltean <olteanv@gmail.com> wrote:
> > I'm thinking we could do something like this in br_switchdev_mdb_replay().
> > We could block concurrent callers to br_switchdev_mdb_notify() by
> > acquiring br->multicast_lock, so that br->mdb_list stays constant for a
> > while.
> >
> > Then, after constructing the local mdb_list, the problem is that it
> > still contains some SWITCHDEV_F_DEFER elements which are pending a call
> > switchdev_deferred_process(). But that can't run currently, so we can
> > iterate through switchdev's "deferred" list and remove them from the
> > replay list, if we figure out some sort of reliable switchdev object
> > comparison function. Then we can safely release br->multicast_lock.
> 
> That would _almost_ work, I think. But instead of purging the deferred
> items, I think we have to skip queuing the replay events in these
> cases. Otherwise we limit the scope of the notification to the
> requesting driver, when it ought to reach all registered callbacks on
> the notifier chain.
> 
> This matters if a driver wants to handle foreign group memberships the
> same way we do with FDB entries, which I want to add to DSA once this
> race has been taken care of.

Yes, not purging the deferred items (for the reasons you mention), but
as I said, "remove them from the replay list" (the local mdb_list).
Similar to what you've done in the series you just posted
(https://lore.kernel.org/netdev/20240131123544.462597-3-tobias@waldekranz.com/),
only that instead of removing from the list, what you do is you never
call list_add_tail().

> > The big problem I see is that FDB notifications don't work that way.
> > They are also deferred, but all the deferred processing is on the
> > switchdev driver side, not on the bridge side. One of the reasons is
> > that the deferred side should not need rtnl_lock() at all. I don't see
> > how this model would scale to FDB processing.
> 
> Yeah, that's where I threw in the towel as well. That issue will just
> have to go unsolved for now.

I mean I have some ideas for FDB replays as well, but they all revolve
around making the interface more similar to how MDBs are handled, while
also maintaining the rtnl-lockless behavior. It's a lot more rework,
potentially involves maintaining 2 parallel mechanisms for switchdev FDB
notifications, and very likely something to handle in net-next.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-01-31 13:10 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-25 23:38 net: switchdev: Port-to-Bridge Synchronization Tobias Waldekranz
2024-01-29 12:17 ` Vladimir Oltean
2024-01-30 21:23   ` Tobias Waldekranz
2024-01-31 13:10     ` Vladimir Oltean

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).