[PATCH v4 net 0/2] net: bridge: switchdev: Ensure MDB events are delivered exactly once

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 net 0/2] net: bridge: switchdev: Ensure MDB events are delivered exactly once
@ 2024-02-12 19:18 Tobias Waldekranz
  2024-02-12 19:18 ` [PATCH v4 net 1/2] net: bridge: switchdev: Skip MDB replays of deferred events on offload Tobias Waldekranz
  2024-02-12 19:18 ` [PATCH v4 net 2/2] net: bridge: switchdev: Ensure deferred event delivery on unoffload Tobias Waldekranz
  0 siblings, 2 replies; 6+ messages in thread
From: Tobias Waldekranz @ 2024-02-12 19:18 UTC (permalink / raw)
  To: davem, kuba; +Cc: olteanv, atenart, roopa, razor, bridge, netdev, jiri, ivecera

When a device is attached to a bridge, drivers will request a replay
of objects that were created before the device joined the bridge, that
are still of interest to the joining port. Typical examples include
FDB entries and MDB memberships on other ports ("foreign interfaces")
or on the bridge itself.

Conversely when a device is detached, the bridge will synthesize
deletion events for all those objects that are still live, but no
longer applicable to the device in question.

This series eliminates two races related to the synching and
unsynching phases of a bridge's MDB with a joining or leaving device,
that would cause notifications of such objects to be either delivered
twice (1/2), or not at all (2/2).

A similar race to the one solved by 1/2 still remains for the
FDB. This is much harder to solve, due to the lockless operation of
the FDB's rhashtable, and is therefore knowingly left out of this
series.

v1 -> v2:
- Squash the previously separate addition of
  switchdev_port_obj_act_is_deferred into first consumer.
- Use ether_addr_equal to compare MAC addresses.
- Document switchdev_port_obj_act_is_deferred (renamed from
  switchdev_port_obj_is_deferred in v1, to indicate that we also match
  on the action).
- Delay allocations of MDB objects until we know they're needed.
- Use non-RCU version of the hash list iterator, now that the MDB is
  not scanned while holding the RCU read lock.
- Add Fixes tag to commit message

v2 -> v3:
- Fix unlocking in error paths
- Access RCU protected port list via mlock_dereference, since MDB is
  guaranteed to remain constant for the duration of the scan.

v3 -> v4:
- Limit the search for exiting deferred events in 1/2 to only apply to
  additions, since the problem does not exist in the deletion case.
- Add 2/2, to plug a related race when unoffloading an indirectly
  associated device.

Tobias Waldekranz (2):
  net: bridge: switchdev: Skip MDB replays of deferred events on offload
  net: bridge: switchdev: Ensure deferred event delivery on unoffload

 include/net/switchdev.h   |  3 ++
 net/bridge/br_switchdev.c | 84 ++++++++++++++++++++++++++-------------
 net/switchdev/switchdev.c | 73 ++++++++++++++++++++++++++++++++++
 3 files changed, 132 insertions(+), 28 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v4 net 1/2] net: bridge: switchdev: Skip MDB replays of deferred events on offload
  2024-02-12 19:18 [PATCH v4 net 0/2] net: bridge: switchdev: Ensure MDB events are delivered exactly once Tobias Waldekranz
@ 2024-02-12 19:18 ` Tobias Waldekranz
  2024-02-14 16:45   ` Vladimir Oltean
  2024-02-12 19:18 ` [PATCH v4 net 2/2] net: bridge: switchdev: Ensure deferred event delivery on unoffload Tobias Waldekranz
  1 sibling, 1 reply; 6+ messages in thread
From: Tobias Waldekranz @ 2024-02-12 19:18 UTC (permalink / raw)
  To: davem, kuba; +Cc: olteanv, atenart, roopa, razor, bridge, netdev, jiri, ivecera

Before this change, generation of the list of MDB events to replay
would race against the creation of new group memberships, either from
the IGMP/MLD snooping logic or from user configuration.

While new memberships are immediately visible to walkers of
br->mdb_list, the notification of their existence to switchdev event
subscribers is deferred until a later point in time. So if a replay
list was generated during a time that overlapped with such a window,
it would also contain a replay of the not-yet-delivered event.

The driver would thus receive two copies of what the bridge internally
considered to be one single event. On destruction of the bridge, only
a single membership deletion event was therefore sent. As a
consequence of this, drivers which reference count memberships (at
least DSA), would be left with orphan groups in their hardware
database when the bridge was destroyed.

This is only an issue when replaying additions. While deletion events
may still be pending on the deferred queue, they will already have
been removed from br->mdb_list, so no duplicates can be generated in
that scenario.

To a user this meant that old group memberships, from a bridge in
which a port was previously attached, could be reanimated (in
hardware) when the port joined a new bridge, without the new bridge's
knowledge.

For example, on an mv88e6xxx system, create a snooping bridge and
immediately add a port to it:

    root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
    > ip link set dev x3 up master br0

And then destroy the bridge:

    root@infix-06-0b-00:~$ ip link del dev br0
    root@infix-06-0b-00:~$ mvls atu
    ADDRESS             FID  STATE      Q  F  0  1  2  3  4  5  6  7  8  9  a
    DEV:0 Marvell 88E6393X
    33:33:00:00:00:6a     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
    33:33:ff:87:e4:3f     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
    ff:ff:ff:ff:ff:ff     1  static     -  -  0  1  2  3  4  5  6  7  8  9  a
    root@infix-06-0b-00:~$

The two IPv6 groups remain in the hardware database because the
port (x3) is notified of the host's membership twice: once via the
original event and once via a replay. Since only a single delete
notification is sent, the count remains at 1 when the bridge is
destroyed.

Then add the same port (or another port belonging to the same hardware
domain) to a new bridge, this time with snooping disabled:

    root@infix-06-0b-00:~$ ip link add dev br1 up type bridge mcast_snooping 0 && \
    > ip link set dev x3 up master br1

All multicast, including the two IPv6 groups from br0, should now be
flooded, according to the policy of br1. But instead the old
memberships are still active in the hardware database, causing the
switch to only forward traffic to those groups towards the CPU (port
0).

Eliminate the race in two steps:

1. Grab the write-side lock of the MDB while generating the replay
   list.

This prevents new memberships from showing up while we are generating
the replay list. But it leaves the scenario in which a deferred event
was already generated, but not delivered, before we grabbed the
lock. Therefore:

2. Make sure that no deferred version of a replay event is already
   enqueued to the switchdev deferred queue, before adding it to the
   replay list, when replaying additions.

Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries")
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 include/net/switchdev.h   |  3 ++
 net/bridge/br_switchdev.c | 74 ++++++++++++++++++++++++---------------
 net/switchdev/switchdev.c | 73 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 122 insertions(+), 28 deletions(-)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index a43062d4c734..8346b0d29542 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -308,6 +308,9 @@ void switchdev_deferred_process(void);
 int switchdev_port_attr_set(struct net_device *dev,
 			    const struct switchdev_attr *attr,
 			    struct netlink_ext_ack *extack);
+bool switchdev_port_obj_act_is_deferred(struct net_device *dev,
+					enum switchdev_notifier_type nt,
+					const struct switchdev_obj *obj);
 int switchdev_port_obj_add(struct net_device *dev,
 			   const struct switchdev_obj *obj,
 			   struct netlink_ext_ack *extack);
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index ee84e783e1df..6a7cb01f121c 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -595,21 +595,40 @@ br_switchdev_mdb_replay_one(struct notifier_block *nb, struct net_device *dev,
 }
 
 static int br_switchdev_mdb_queue_one(struct list_head *mdb_list,
+				      struct net_device *dev,
+				      unsigned long action,
 				      enum switchdev_obj_id id,
 				      const struct net_bridge_mdb_entry *mp,
 				      struct net_device *orig_dev)
 {
-	struct switchdev_obj_port_mdb *mdb;
+	struct switchdev_obj_port_mdb mdb = {
+		.obj = {
+			.id = id,
+			.orig_dev = orig_dev,
+		},
+	};
+	struct switchdev_obj_port_mdb *pmdb;
 
-	mdb = kzalloc(sizeof(*mdb), GFP_ATOMIC);
-	if (!mdb)
-		return -ENOMEM;
+	br_switchdev_mdb_populate(&mdb, mp);
 
-	mdb->obj.id = id;
-	mdb->obj.orig_dev = orig_dev;
-	br_switchdev_mdb_populate(mdb, mp);
-	list_add_tail(&mdb->obj.list, mdb_list);
+	if (action == SWITCHDEV_PORT_OBJ_ADD &&
+	    switchdev_port_obj_act_is_deferred(dev, action, &mdb.obj)) {
+		/* This event is already in the deferred queue of
+		 * events, so this replay must be elided, lest the
+		 * driver receives duplicate events for it. This can
+		 * only happen when replaying additions, since
+		 * modifications are always immediately visible in
+		 * br->mdb_list, whereas actual event delivery may be
+		 * delayed.
+		 */
+		return 0;
+	}
+
+	pmdb = kmemdup(&mdb, sizeof(mdb), GFP_ATOMIC);
+	if (!pmdb)
+		return -ENOMEM;
 
+	list_add_tail(&pmdb->obj.list, mdb_list);
 	return 0;
 }
 
@@ -677,51 +696,50 @@ br_switchdev_mdb_replay(struct net_device *br_dev, struct net_device *dev,
 	if (!br_opt_get(br, BROPT_MULTICAST_ENABLED))
 		return 0;
 
-	/* We cannot walk over br->mdb_list protected just by the rtnl_mutex,
-	 * because the write-side protection is br->multicast_lock. But we
-	 * need to emulate the [ blocking ] calling context of a regular
-	 * switchdev event, so since both br->multicast_lock and RCU read side
-	 * critical sections are atomic, we have no choice but to pick the RCU
-	 * read side lock, queue up all our events, leave the critical section
-	 * and notify switchdev from blocking context.
+	if (adding)
+		action = SWITCHDEV_PORT_OBJ_ADD;
+	else
+		action = SWITCHDEV_PORT_OBJ_DEL;
+
+	/* br_switchdev_mdb_queue_one() will take care to not queue a
+	 * replay of an event that is already pending in the switchdev
+	 * deferred queue. In order to safely determine that, there
+	 * must be no new deferred MDB notifications enqueued for the
+	 * duration of the MDB scan. Therefore, grab the write-side
+	 * lock to avoid racing with any concurrent IGMP/MLD snooping.
 	 */
-	rcu_read_lock();
+	spin_lock_bh(&br->multicast_lock);
 
-	hlist_for_each_entry_rcu(mp, &br->mdb_list, mdb_node) {
+	hlist_for_each_entry(mp, &br->mdb_list, mdb_node) {
 		struct net_bridge_port_group __rcu * const *pp;
 		const struct net_bridge_port_group *p;
 
 		if (mp->host_joined) {
-			err = br_switchdev_mdb_queue_one(&mdb_list,
+			err = br_switchdev_mdb_queue_one(&mdb_list, dev, action,
 							 SWITCHDEV_OBJ_ID_HOST_MDB,
 							 mp, br_dev);
 			if (err) {
-				rcu_read_unlock();
+				spin_unlock_bh(&br->multicast_lock);
 				goto out_free_mdb;
 			}
 		}
 
-		for (pp = &mp->ports; (p = rcu_dereference(*pp)) != NULL;
+		for (pp = &mp->ports; (p = mlock_dereference(*pp, br)) != NULL;
 		     pp = &p->next) {
 			if (p->key.port->dev != dev)
 				continue;
 
-			err = br_switchdev_mdb_queue_one(&mdb_list,
+			err = br_switchdev_mdb_queue_one(&mdb_list, dev, action,
 							 SWITCHDEV_OBJ_ID_PORT_MDB,
 							 mp, dev);
 			if (err) {
-				rcu_read_unlock();
+				spin_unlock_bh(&br->multicast_lock);
 				goto out_free_mdb;
 			}
 		}
 	}
 
-	rcu_read_unlock();
-
-	if (adding)
-		action = SWITCHDEV_PORT_OBJ_ADD;
-	else
-		action = SWITCHDEV_PORT_OBJ_DEL;
+	spin_unlock_bh(&br->multicast_lock);
 
 	list_for_each_entry(obj, &mdb_list, list) {
 		err = br_switchdev_mdb_replay_one(nb, dev,
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 5b045284849e..7d11f31820df 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -19,6 +19,35 @@
 #include <linux/rtnetlink.h>
 #include <net/switchdev.h>
 
+static bool switchdev_obj_eq(const struct switchdev_obj *a,
+			     const struct switchdev_obj *b)
+{
+	const struct switchdev_obj_port_vlan *va, *vb;
+	const struct switchdev_obj_port_mdb *ma, *mb;
+
+	if (a->id != b->id || a->orig_dev != b->orig_dev)
+		return false;
+
+	switch (a->id) {
+	case SWITCHDEV_OBJ_ID_PORT_VLAN:
+		va = SWITCHDEV_OBJ_PORT_VLAN(a);
+		vb = SWITCHDEV_OBJ_PORT_VLAN(b);
+		return va->flags == vb->flags &&
+			va->vid == vb->vid &&
+			va->changed == vb->changed;
+	case SWITCHDEV_OBJ_ID_PORT_MDB:
+	case SWITCHDEV_OBJ_ID_HOST_MDB:
+		ma = SWITCHDEV_OBJ_PORT_MDB(a);
+		mb = SWITCHDEV_OBJ_PORT_MDB(b);
+		return ma->vid == mb->vid &&
+			ether_addr_equal(ma->addr, mb->addr);
+	default:
+		break;
+	}
+
+	BUG();
+}
+
 static LIST_HEAD(deferred);
 static DEFINE_SPINLOCK(deferred_lock);
 
@@ -307,6 +336,50 @@ int switchdev_port_obj_del(struct net_device *dev,
 }
 EXPORT_SYMBOL_GPL(switchdev_port_obj_del);
 
+/**
+ *	switchdev_port_obj_act_is_deferred - Is object action pending?
+ *
+ *	@dev: port device
+ *	@nt: type of action; add or delete
+ *	@obj: object to test
+ *
+ *	Returns true if a deferred item is exists, which is equivalent
+ *	to the action @nt of an object @obj.
+ *
+ *	rtnl_lock must be held.
+ */
+bool switchdev_port_obj_act_is_deferred(struct net_device *dev,
+					enum switchdev_notifier_type nt,
+					const struct switchdev_obj *obj)
+{
+	struct switchdev_deferred_item *dfitem;
+	bool found = false;
+
+	ASSERT_RTNL();
+
+	spin_lock_bh(&deferred_lock);
+
+	list_for_each_entry(dfitem, &deferred, list) {
+		if (dfitem->dev != dev)
+			continue;
+
+		if ((dfitem->func == switchdev_port_obj_add_deferred &&
+		     nt == SWITCHDEV_PORT_OBJ_ADD) ||
+		    (dfitem->func == switchdev_port_obj_del_deferred &&
+		     nt == SWITCHDEV_PORT_OBJ_DEL)) {
+			if (switchdev_obj_eq((const void *)dfitem->data, obj)) {
+				found = true;
+				break;
+			}
+		}
+	}
+
+	spin_unlock_bh(&deferred_lock);
+
+	return found;
+}
+EXPORT_SYMBOL_GPL(switchdev_port_obj_act_is_deferred);
+
 static ATOMIC_NOTIFIER_HEAD(switchdev_notif_chain);
 static BLOCKING_NOTIFIER_HEAD(switchdev_blocking_notif_chain);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v4 net 2/2] net: bridge: switchdev: Ensure deferred event delivery on unoffload
  2024-02-12 19:18 [PATCH v4 net 0/2] net: bridge: switchdev: Ensure MDB events are delivered exactly once Tobias Waldekranz
  2024-02-12 19:18 ` [PATCH v4 net 1/2] net: bridge: switchdev: Skip MDB replays of deferred events on offload Tobias Waldekranz
@ 2024-02-12 19:18 ` Tobias Waldekranz
  2024-02-14 16:47   ` Vladimir Oltean
  1 sibling, 1 reply; 6+ messages in thread
From: Tobias Waldekranz @ 2024-02-12 19:18 UTC (permalink / raw)
  To: davem, kuba; +Cc: olteanv, atenart, roopa, razor, bridge, netdev, jiri, ivecera

When unoffloading a device, it is important to ensure that all
relevant deferred events are delivered to it before it disassociates
itself from the bridge.

Before this change, this was true for the normal case when a device
maps 1:1 to a net_bridge_port, i.e.

   br0
   /
swp0

When swp0 leaves br0, the call to switchdev_deferred_process() in
del_nbp() makes sure to process any outstanding events while the
device is still associated with the bridge.

In the case when the association is indirect though, i.e. when the
device is attached to the bridge via an intermediate device, like a
LAG...

    br0
    /
  lag0
  /
swp0

...then detaching swp0 from lag0 does not cause any net_bridge_port to
be deleted, so there was no guarantee that all events had been
processed before the device disassociated itself from the bridge.

Fix this by always synchronously processing all deferred events before
signaling completion of unoffloading back to the driver.

Fixes: 4e51bf44a03a ("net: bridge: move the switchdev object replay helpers to "push" mode")
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 net/bridge/br_switchdev.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index 6a7cb01f121c..7b41ee8740cb 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -804,6 +804,16 @@ static void nbp_switchdev_unsync_objs(struct net_bridge_port *p,
 	br_switchdev_mdb_replay(br_dev, dev, ctx, false, blocking_nb, NULL);

 	br_switchdev_vlan_replay(br_dev, ctx, false, blocking_nb, NULL);
+
+	/* Make sure that the device leaving this bridge has seen all
+	 * relevant events before it is disassociated. In the normal
+	 * case, when the device is directly attached to the bridge,
+	 * this is covered by del_nbp(). If the association was indirect
+	 * however, e.g. via a team or bond, and the device is leaving
+	 * that intermediate device, then the bridge port remains in
+	 * place.
+	 */
+	switchdev_deferred_process();
 }

 /* Let the bridge know that this port is offloaded, so that it can assign a
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 net 1/2] net: bridge: switchdev: Skip MDB replays of deferred events on offload
  2024-02-12 19:18 ` [PATCH v4 net 1/2] net: bridge: switchdev: Skip MDB replays of deferred events on offload Tobias Waldekranz
@ 2024-02-14 16:45   ` Vladimir Oltean
  2024-02-14 21:28     ` Tobias Waldekranz
  0 siblings, 1 reply; 6+ messages in thread
From: Vladimir Oltean @ 2024-02-14 16:45 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: davem, kuba, atenart, roopa, razor, bridge, netdev, jiri, ivecera

On Mon, Feb 12, 2024 at 08:18:43PM +0100, Tobias Waldekranz wrote:
> Before this change, generation of the list of MDB events to replay
> would race against the creation of new group memberships, either from
> the IGMP/MLD snooping logic or from user configuration.
> 
> While new memberships are immediately visible to walkers of
> br->mdb_list, the notification of their existence to switchdev event
> subscribers is deferred until a later point in time. So if a replay
> list was generated during a time that overlapped with such a window,
> it would also contain a replay of the not-yet-delivered event.
> 
> The driver would thus receive two copies of what the bridge internally
> considered to be one single event. On destruction of the bridge, only
> a single membership deletion event was therefore sent. As a
> consequence of this, drivers which reference count memberships (at
> least DSA), would be left with orphan groups in their hardware
> database when the bridge was destroyed.
> 
> This is only an issue when replaying additions. While deletion events
> may still be pending on the deferred queue, they will already have
> been removed from br->mdb_list, so no duplicates can be generated in
> that scenario.
> 
> To a user this meant that old group memberships, from a bridge in
> which a port was previously attached, could be reanimated (in
> hardware) when the port joined a new bridge, without the new bridge's
> knowledge.
> 
> For example, on an mv88e6xxx system, create a snooping bridge and
> immediately add a port to it:
> 
>     root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
>     > ip link set dev x3 up master br0
> 
> And then destroy the bridge:
> 
>     root@infix-06-0b-00:~$ ip link del dev br0
>     root@infix-06-0b-00:~$ mvls atu
>     ADDRESS             FID  STATE      Q  F  0  1  2  3  4  5  6  7  8  9  a
>     DEV:0 Marvell 88E6393X
>     33:33:00:00:00:6a     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
>     33:33:ff:87:e4:3f     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
>     ff:ff:ff:ff:ff:ff     1  static     -  -  0  1  2  3  4  5  6  7  8  9  a
>     root@infix-06-0b-00:~$
> 
> The two IPv6 groups remain in the hardware database because the
> port (x3) is notified of the host's membership twice: once via the
> original event and once via a replay. Since only a single delete
> notification is sent, the count remains at 1 when the bridge is
> destroyed.
> 
> Then add the same port (or another port belonging to the same hardware
> domain) to a new bridge, this time with snooping disabled:
> 
>     root@infix-06-0b-00:~$ ip link add dev br1 up type bridge mcast_snooping 0 && \
>     > ip link set dev x3 up master br1
> 
> All multicast, including the two IPv6 groups from br0, should now be
> flooded, according to the policy of br1. But instead the old
> memberships are still active in the hardware database, causing the
> switch to only forward traffic to those groups towards the CPU (port
> 0).
> 
> Eliminate the race in two steps:
> 
> 1. Grab the write-side lock of the MDB while generating the replay
>    list.
> 
> This prevents new memberships from showing up while we are generating
> the replay list. But it leaves the scenario in which a deferred event
> was already generated, but not delivered, before we grabbed the
> lock. Therefore:
> 
> 2. Make sure that no deferred version of a replay event is already
>    enqueued to the switchdev deferred queue, before adding it to the
>    replay list, when replaying additions.
> 
> Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries")
> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> ---

Excellent from my side, thank you!

Reviewed-by: Vladimir Oltean <olteanv@gmail.com>

> @@ -307,6 +336,50 @@ int switchdev_port_obj_del(struct net_device *dev,
>  }
>  EXPORT_SYMBOL_GPL(switchdev_port_obj_del);
>  
> +/**
> + *	switchdev_port_obj_act_is_deferred - Is object action pending?
> + *
> + *	@dev: port device
> + *	@nt: type of action; add or delete
> + *	@obj: object to test
> + *
> + *	Returns true if a deferred item is exists, which is equivalent
> + *	to the action @nt of an object @obj.

nitpick: replace "is exists" with something else like "is pending" or
"exists".

Also "action of an object" or "on an object"?

> + *
> + *	rtnl_lock must be held.
> + */

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 net 2/2] net: bridge: switchdev: Ensure deferred event delivery on unoffload
  2024-02-12 19:18 ` [PATCH v4 net 2/2] net: bridge: switchdev: Ensure deferred event delivery on unoffload Tobias Waldekranz
@ 2024-02-14 16:47   ` Vladimir Oltean
  0 siblings, 0 replies; 6+ messages in thread
From: Vladimir Oltean @ 2024-02-14 16:47 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: davem, kuba, atenart, roopa, razor, bridge, netdev, jiri, ivecera

On Mon, Feb 12, 2024 at 08:18:44PM +0100, Tobias Waldekranz wrote:
> When unoffloading a device, it is important to ensure that all
> relevant deferred events are delivered to it before it disassociates
> itself from the bridge.
> 
> Before this change, this was true for the normal case when a device
> maps 1:1 to a net_bridge_port, i.e.
> 
>    br0
>    /
> swp0
> 
> When swp0 leaves br0, the call to switchdev_deferred_process() in
> del_nbp() makes sure to process any outstanding events while the
> device is still associated with the bridge.
> 
> In the case when the association is indirect though, i.e. when the
> device is attached to the bridge via an intermediate device, like a
> LAG...
> 
>     br0
>     /
>   lag0
>   /
> swp0
> 
> ...then detaching swp0 from lag0 does not cause any net_bridge_port to
> be deleted, so there was no guarantee that all events had been
> processed before the device disassociated itself from the bridge.
> 
> Fix this by always synchronously processing all deferred events before
> signaling completion of unoffloading back to the driver.
> 
> Fixes: 4e51bf44a03a ("net: bridge: move the switchdev object replay helpers to "push" mode")
> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> ---

Reviewed-by: Vladimir Oltean <olteanv@gmail.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 net 1/2] net: bridge: switchdev: Skip MDB replays of deferred events on offload
  2024-02-14 16:45   ` Vladimir Oltean
@ 2024-02-14 21:28     ` Tobias Waldekranz
  0 siblings, 0 replies; 6+ messages in thread
From: Tobias Waldekranz @ 2024-02-14 21:28 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: davem, kuba, atenart, roopa, razor, bridge, netdev, jiri, ivecera

On ons, feb 14, 2024 at 18:45, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Mon, Feb 12, 2024 at 08:18:43PM +0100, Tobias Waldekranz wrote:
>> Before this change, generation of the list of MDB events to replay
>> would race against the creation of new group memberships, either from
>> the IGMP/MLD snooping logic or from user configuration.
>> 
>> While new memberships are immediately visible to walkers of
>> br->mdb_list, the notification of their existence to switchdev event
>> subscribers is deferred until a later point in time. So if a replay
>> list was generated during a time that overlapped with such a window,
>> it would also contain a replay of the not-yet-delivered event.
>> 
>> The driver would thus receive two copies of what the bridge internally
>> considered to be one single event. On destruction of the bridge, only
>> a single membership deletion event was therefore sent. As a
>> consequence of this, drivers which reference count memberships (at
>> least DSA), would be left with orphan groups in their hardware
>> database when the bridge was destroyed.
>> 
>> This is only an issue when replaying additions. While deletion events
>> may still be pending on the deferred queue, they will already have
>> been removed from br->mdb_list, so no duplicates can be generated in
>> that scenario.
>> 
>> To a user this meant that old group memberships, from a bridge in
>> which a port was previously attached, could be reanimated (in
>> hardware) when the port joined a new bridge, without the new bridge's
>> knowledge.
>> 
>> For example, on an mv88e6xxx system, create a snooping bridge and
>> immediately add a port to it:
>> 
>>     root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
>>     > ip link set dev x3 up master br0
>> 
>> And then destroy the bridge:
>> 
>>     root@infix-06-0b-00:~$ ip link del dev br0
>>     root@infix-06-0b-00:~$ mvls atu
>>     ADDRESS             FID  STATE      Q  F  0  1  2  3  4  5  6  7  8  9  a
>>     DEV:0 Marvell 88E6393X
>>     33:33:00:00:00:6a     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
>>     33:33:ff:87:e4:3f     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
>>     ff:ff:ff:ff:ff:ff     1  static     -  -  0  1  2  3  4  5  6  7  8  9  a
>>     root@infix-06-0b-00:~$
>> 
>> The two IPv6 groups remain in the hardware database because the
>> port (x3) is notified of the host's membership twice: once via the
>> original event and once via a replay. Since only a single delete
>> notification is sent, the count remains at 1 when the bridge is
>> destroyed.
>> 
>> Then add the same port (or another port belonging to the same hardware
>> domain) to a new bridge, this time with snooping disabled:
>> 
>>     root@infix-06-0b-00:~$ ip link add dev br1 up type bridge mcast_snooping 0 && \
>>     > ip link set dev x3 up master br1
>> 
>> All multicast, including the two IPv6 groups from br0, should now be
>> flooded, according to the policy of br1. But instead the old
>> memberships are still active in the hardware database, causing the
>> switch to only forward traffic to those groups towards the CPU (port
>> 0).
>> 
>> Eliminate the race in two steps:
>> 
>> 1. Grab the write-side lock of the MDB while generating the replay
>>    list.
>> 
>> This prevents new memberships from showing up while we are generating
>> the replay list. But it leaves the scenario in which a deferred event
>> was already generated, but not delivered, before we grabbed the
>> lock. Therefore:
>> 
>> 2. Make sure that no deferred version of a replay event is already
>>    enqueued to the switchdev deferred queue, before adding it to the
>>    replay list, when replaying additions.
>> 
>> Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries")
>> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
>> ---
>
> Excellent from my side, thank you!

Thanks!

> Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
>
>> @@ -307,6 +336,50 @@ int switchdev_port_obj_del(struct net_device *dev,
>>  }
>>  EXPORT_SYMBOL_GPL(switchdev_port_obj_del);
>>  
>> +/**
>> + *	switchdev_port_obj_act_is_deferred - Is object action pending?
>> + *
>> + *	@dev: port device
>> + *	@nt: type of action; add or delete
>> + *	@obj: object to test
>> + *
>> + *	Returns true if a deferred item is exists, which is equivalent
>> + *	to the action @nt of an object @obj.
>
> nitpick: replace "is exists" with something else like "is pending" or
> "exists".
>
> Also "action of an object" or "on an object"?

Yes, these are annoying. I might as well send a v5.

pw-bot: changes-requested

>> + *
>> + *	rtnl_lock must be held.
>> + */

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-02-14 21:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-12 19:18 [PATCH v4 net 0/2] net: bridge: switchdev: Ensure MDB events are delivered exactly once Tobias Waldekranz
2024-02-12 19:18 ` [PATCH v4 net 1/2] net: bridge: switchdev: Skip MDB replays of deferred events on offload Tobias Waldekranz
2024-02-14 16:45   ` Vladimir Oltean
2024-02-14 21:28     ` Tobias Waldekranz
2024-02-12 19:18 ` [PATCH v4 net 2/2] net: bridge: switchdev: Ensure deferred event delivery on unoffload Tobias Waldekranz
2024-02-14 16:47   ` Vladimir Oltean

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).