[PATCH ipsec v2 1/2] bond: Use xfrm_state

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs
@ 2025-11-13 10:43 Cosmin Ratiu
  2025-11-13 10:43 ` [PATCH ipsec v2 2/2] bond: Use a separate xfrm_active_slave pointer Cosmin Ratiu
  2025-11-14 12:56 ` [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs Sabrina Dubroca
  0 siblings, 2 replies; 6+ messages in thread
From: Cosmin Ratiu @ 2025-11-13 10:43 UTC (permalink / raw)
  To: netdev
  Cc: Jay Vosburgh, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Taehee Yoo, Jianbo Liu,
	Steffen Klassert, Herbert Xu, Leon Romanovsky, Cosmin Ratiu

The bonding driver manages offloaded SAs using the following strategy:

An xfrm_state offloaded on the bond device with bond_ipsec_add_sa() uses
'real_dev' on the xfrm_state xs to redirect the offload to the current
active slave. The corresponding bond_ipsec_del_sa() (called with the xs
spinlock held) redirects the unoffload call to real_dev. Finally,
cleanup happens in bond_ipsec_free_sa(), which removes the offload from
the device. Since the last call happens without the xs spinlock held,
that is where the real work to unoffload actually happens.

When the active slave changes to a new device a 3-step process is used
to migrate all xfrm states to the new device:
1. bond_ipsec_del_sa_all() unoffloads all states in bond->ipsec_list
   from the previously active device.
2. The active slave is flipped to the new device.
3. bond_ipsec_add_sa_all() offloads all states in bond->ipsec_list to
   the new device.

This patch closes a race that could happen between xfrm_state migration
and TX, which could result in unencrypted packets going out the wire:
CPU1 (xfrm_output)                   CPU2 (bond_change_active_slave)
bond_ipsec_offload_ok -> true
                                     bond_ipsec_del_sa_all
bond_xmit_activebackup
bond_dev_queue_xmit
dev_queue_xmit on old_dev
				     bond->curr_active_slave = new_dev
				     bond_ipsec_add_sa_all

So the packet makes it out to old_dev after the offloaded xfrm_state is
deleted from it. The result: an unencrypted IPSec packet on the wire.

With the new approach, in-use states on old_dev will not be deleted
until in-flight packets are transmitted. It also makes for cleaner
bonding code, which no longer needs to care about xfrm_state management
so much.

Fixes: ("ec13009472f4 bonding: implement xdo_dev_state_free and call it after deletion")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
---
 drivers/net/bonding/bond_main.c | 126 ++++++++++++--------------------
 1 file changed, 45 insertions(+), 81 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 67fdcbdd2764..e45e89179236 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -513,19 +513,21 @@ static int bond_ipsec_add_sa(struct net_device *bond_dev,
 	return err;
 }
 
-static void bond_ipsec_add_sa_all(struct bonding *bond)
+static void bond_ipsec_migrate_sa_all(struct bonding *bond)
 {
+	struct slave *new_active = rtnl_dereference(bond->curr_active_slave);
 	struct net_device *bond_dev = bond->dev;
+	struct net *net = dev_net(bond_dev);
+	struct bond_ipsec *ipsec, *tmp;
+	struct xfrm_user_offload xuo;
 	struct net_device *real_dev;
-	struct bond_ipsec *ipsec;
-	struct slave *slave;
+	struct xfrm_migrate m = {};
+	LIST_HEAD(ipsec_list);
 
-	slave = rtnl_dereference(bond->curr_active_slave);
-	real_dev = slave ? slave->dev : NULL;
-	if (!real_dev)
+	if (!new_active)
 		return;
 
-	mutex_lock(&bond->ipsec_lock);
+	real_dev = new_active->dev;
 	if (!real_dev->xfrmdev_ops ||
 	    !real_dev->xfrmdev_ops->xdo_dev_state_add ||
 	    netif_is_bond_master(real_dev)) {
@@ -533,36 +535,42 @@ static void bond_ipsec_add_sa_all(struct bonding *bond)
 			slave_warn(bond_dev, real_dev,
 				   "%s: no slave xdo_dev_state_add\n",
 				   __func__);
-		goto out;
+		return;
 	}
 
-	list_for_each_entry(ipsec, &bond->ipsec_list, list) {
-		/* If new state is added before ipsec_lock acquired */
-		if (ipsec->xs->xso.real_dev == real_dev)
-			continue;
+	/* Prepare the list of xfrm_states to be migrated. */
+	mutex_lock(&bond->ipsec_lock);
+	list_splice_init(&bond->ipsec_list, &ipsec_list);
+	/* Add back states already offloaded on the new device before the
+	 * lock was acquired and hold all remaining states to avoid them
+	 * getting deleted during the migration.
+	 */
+	list_for_each_entry_safe(ipsec, tmp, &ipsec_list, list) {
+		if (unlikely(ipsec->xs->xso.real_dev == real_dev))
+			list_move_tail(&ipsec->list, &bond->ipsec_list);
+		else
+			xfrm_state_hold(ipsec->xs);
+	}
+	mutex_unlock(&bond->ipsec_lock);
 
-		if (real_dev->xfrmdev_ops->xdo_dev_state_add(real_dev,
-							     ipsec->xs, NULL)) {
-			slave_warn(bond_dev, real_dev, "%s: failed to add SA\n", __func__);
-			continue;
-		}
+	xuo.ifindex = bond_dev->ifindex;
+	list_for_each_entry_safe(ipsec, tmp, &ipsec_list, list) {
+		struct xfrm_state *x = ipsec->xs;
 
-		spin_lock_bh(&ipsec->xs->lock);
-		/* xs might have been killed by the user during the migration
-		 * to the new dev, but bond_ipsec_del_sa() should have done
-		 * nothing, as xso.real_dev is NULL.
-		 * Delete it from the device we just added it to. The pending
-		 * bond_ipsec_free_sa() call will do the rest of the cleanup.
-		 */
-		if (ipsec->xs->km.state == XFRM_STATE_DEAD &&
-		    real_dev->xfrmdev_ops->xdo_dev_state_delete)
-			real_dev->xfrmdev_ops->xdo_dev_state_delete(real_dev,
-								    ipsec->xs);
-		ipsec->xs->xso.real_dev = real_dev;
-		spin_unlock_bh(&ipsec->xs->lock);
+		m.new_family = x->props.family;
+		memcpy(&m.new_daddr, &x->id.daddr, sizeof(x->id.daddr));
+		memcpy(&m.new_saddr, &x->props.saddr, sizeof(x->props.saddr));
+
+		xuo.flags = x->xso.dir == XFRM_DEV_OFFLOAD_IN ?
+			XFRM_OFFLOAD_INBOUND : 0;
+
+		if (!xfrm_state_migrate(x, &m, NULL, net, &xuo, NULL))
+			slave_warn(bond_dev, real_dev,
+				   "%s: xfrm_state_migrate failed\n", __func__);
+		xfrm_state_delete(x);
+		xfrm_state_put(x);
+		kfree(ipsec);
 	}
-out:
-	mutex_unlock(&bond->ipsec_lock);
 }
 
 /**
@@ -590,47 +598,6 @@ static void bond_ipsec_del_sa(struct net_device *bond_dev,
 	real_dev->xfrmdev_ops->xdo_dev_state_delete(real_dev, xs);
 }
 
-static void bond_ipsec_del_sa_all(struct bonding *bond)
-{
-	struct net_device *bond_dev = bond->dev;
-	struct net_device *real_dev;
-	struct bond_ipsec *ipsec;
-	struct slave *slave;
-
-	slave = rtnl_dereference(bond->curr_active_slave);
-	real_dev = slave ? slave->dev : NULL;
-	if (!real_dev)
-		return;
-
-	mutex_lock(&bond->ipsec_lock);
-	list_for_each_entry(ipsec, &bond->ipsec_list, list) {
-		if (!ipsec->xs->xso.real_dev)
-			continue;
-
-		if (!real_dev->xfrmdev_ops ||
-		    !real_dev->xfrmdev_ops->xdo_dev_state_delete ||
-		    netif_is_bond_master(real_dev)) {
-			slave_warn(bond_dev, real_dev,
-				   "%s: no slave xdo_dev_state_delete\n",
-				   __func__);
-			continue;
-		}
-
-		spin_lock_bh(&ipsec->xs->lock);
-		ipsec->xs->xso.real_dev = NULL;
-		/* Don't double delete states killed by the user. */
-		if (ipsec->xs->km.state != XFRM_STATE_DEAD)
-			real_dev->xfrmdev_ops->xdo_dev_state_delete(real_dev,
-								    ipsec->xs);
-		spin_unlock_bh(&ipsec->xs->lock);
-
-		if (real_dev->xfrmdev_ops->xdo_dev_state_free)
-			real_dev->xfrmdev_ops->xdo_dev_state_free(real_dev,
-								  ipsec->xs);
-	}
-	mutex_unlock(&bond->ipsec_lock);
-}
-
 static void bond_ipsec_free_sa(struct net_device *bond_dev,
 			       struct xfrm_state *xs)
 {
@@ -1221,10 +1188,6 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
 	if (old_active == new_active)
 		return;
 
-#ifdef CONFIG_XFRM_OFFLOAD
-	bond_ipsec_del_sa_all(bond);
-#endif /* CONFIG_XFRM_OFFLOAD */
-
 	if (new_active) {
 		new_active->last_link_up = jiffies;
 
@@ -1247,6 +1210,7 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
 			if (bond_uses_primary(bond))
 				slave_info(bond->dev, new_active->dev, "making interface the new active one\n");
 		}
+
 	}
 
 	if (bond_uses_primary(bond))
@@ -1264,6 +1228,10 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
 		rcu_assign_pointer(bond->curr_active_slave, new_active);
 	}
 
+#ifdef CONFIG_XFRM_OFFLOAD
+	bond_ipsec_migrate_sa_all(bond);
+#endif
+
 	if (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP) {
 		if (old_active)
 			bond_set_slave_inactive_flags(old_active,
@@ -1296,10 +1264,6 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
 		}
 	}
 
-#ifdef CONFIG_XFRM_OFFLOAD
-	bond_ipsec_add_sa_all(bond);
-#endif /* CONFIG_XFRM_OFFLOAD */
-
 	/* resend IGMP joins since active slave has changed or
 	 * all were sent on curr_active_slave.
 	 * resend only if bond is brought up with the affected
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH ipsec v2 2/2] bond: Use a separate xfrm_active_slave pointer
  2025-11-13 10:43 [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs Cosmin Ratiu
@ 2025-11-13 10:43 ` Cosmin Ratiu
  2025-11-14 12:56 ` [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs Sabrina Dubroca
  1 sibling, 0 replies; 6+ messages in thread
From: Cosmin Ratiu @ 2025-11-13 10:43 UTC (permalink / raw)
  To: netdev
  Cc: Jay Vosburgh, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Taehee Yoo, Jianbo Liu,
	Steffen Klassert, Herbert Xu, Leon Romanovsky, Cosmin Ratiu

Offloaded xfrm states are currently added to a new device after
curr_active_slave is updated to direct traffic to it. This could result
in the following race, which could lead to unencrypted IPSec packets on
the wire:

CPU1 (xfrm_output)                   CPU2 (bond_change_active_slave)
bond_ipsec_offload_ok -> true
                                     bond->curr_active_slave = new_dev
                                     bond_ipsec_migrate_sa_all
bond_xmit_activebackup
bond_dev_queue_xmit
dev_queue_xmit on new_dev
				     bond_ipsec_migrate_sa_all finishes

So the packet can make it out to new_dev before its xfrm_state is
offloaded to it. The result: an unencrypted IPSec packet on the wire.

This patch closes this race by introducing a new pointer in the bond
device, named 'xfrm_active_slave'. All xfrm_states offloaded to the bond
device get offloaded to it. Changing the current active slave will now
first update this new pointer, then migrate all xfrm states on the bond
device, then flip curr_active_slave. This closes the above race and
makes sure that any IPSec packets transmitted on the new device will
find an offloaded xfrm_state on the device.

Fixes: 9a5605505d9c ("bonding: Add struct bond_ipesc to manage SA")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
---
 drivers/net/bonding/bond_main.c | 15 ++++++++-------
 include/net/bonding.h           |  2 ++
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index e45e89179236..98d5f9974086 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -438,7 +438,7 @@ static struct net_device *bond_ipsec_dev(struct xfrm_state *xs)
 	if (BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP)
 		return NULL;
 
-	slave = rcu_dereference(bond->curr_active_slave);
+	slave = rcu_dereference(bond->xfrm_active_slave);
 	if (!slave)
 		return NULL;
 
@@ -474,7 +474,7 @@ static int bond_ipsec_add_sa(struct net_device *bond_dev,
 
 	rcu_read_lock();
 	bond = netdev_priv(bond_dev);
-	slave = rcu_dereference(bond->curr_active_slave);
+	slave = rcu_dereference(bond->xfrm_active_slave);
 	real_dev = slave ? slave->dev : NULL;
 	netdev_hold(real_dev, &tracker, GFP_ATOMIC);
 	rcu_read_unlock();
@@ -515,7 +515,7 @@ static int bond_ipsec_add_sa(struct net_device *bond_dev,
 
 static void bond_ipsec_migrate_sa_all(struct bonding *bond)
 {
-	struct slave *new_active = rtnl_dereference(bond->curr_active_slave);
+	struct slave *new_active = rtnl_dereference(bond->xfrm_active_slave);
 	struct net_device *bond_dev = bond->dev;
 	struct net *net = dev_net(bond_dev);
 	struct bond_ipsec *ipsec, *tmp;
@@ -1216,6 +1216,11 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
 	if (bond_uses_primary(bond))
 		bond_hw_addr_swap(bond, new_active, old_active);
 
+#ifdef CONFIG_XFRM_OFFLOAD
+	rcu_assign_pointer(bond->xfrm_active_slave, new_active);
+	bond_ipsec_migrate_sa_all(bond);
+#endif
+
 	if (bond_is_lb(bond)) {
 		bond_alb_handle_active_change(bond, new_active);
 		if (old_active)
@@ -1228,10 +1233,6 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
 		rcu_assign_pointer(bond->curr_active_slave, new_active);
 	}
 
-#ifdef CONFIG_XFRM_OFFLOAD
-	bond_ipsec_migrate_sa_all(bond);
-#endif
-
 	if (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP) {
 		if (old_active)
 			bond_set_slave_inactive_flags(old_active,
diff --git a/include/net/bonding.h b/include/net/bonding.h
index 49edc7da0586..256fe96fcfda 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -260,6 +260,8 @@ struct bonding {
 #endif /* CONFIG_DEBUG_FS */
 	struct rtnl_link_stats64 bond_stats;
 #ifdef CONFIG_XFRM_OFFLOAD
+	/* The device where new xfrm states will be offloaded */
+	struct slave __rcu *xfrm_active_slave;
 	struct list_head ipsec_list;
 	/* protecting ipsec_list */
 	struct mutex ipsec_lock;
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs
  2025-11-13 10:43 [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs Cosmin Ratiu
  2025-11-13 10:43 ` [PATCH ipsec v2 2/2] bond: Use a separate xfrm_active_slave pointer Cosmin Ratiu
@ 2025-11-14 12:56 ` Sabrina Dubroca
  2025-11-17 12:48   ` Cosmin Ratiu
  1 sibling, 1 reply; 6+ messages in thread
From: Sabrina Dubroca @ 2025-11-14 12:56 UTC (permalink / raw)
  To: Cosmin Ratiu
  Cc: netdev, Jay Vosburgh, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Taehee Yoo, Jianbo Liu,
	Steffen Klassert, Herbert Xu, Leon Romanovsky

2025-11-13, 12:43:09 +0200, Cosmin Ratiu wrote:
> The bonding driver manages offloaded SAs using the following strategy:
> 
> An xfrm_state offloaded on the bond device with bond_ipsec_add_sa() uses
> 'real_dev' on the xfrm_state xs to redirect the offload to the current
> active slave. The corresponding bond_ipsec_del_sa() (called with the xs
> spinlock held) redirects the unoffload call to real_dev.


> Finally,
> cleanup happens in bond_ipsec_free_sa(), which removes the offload from
> the device. Since the last call happens without the xs spinlock held,
> that is where the real work to unoffload actually happens.

Not on all devices (some don't even implement xdo_dev_state_free).


> 
> When the active slave changes to a new device a 3-step process is used
> to migrate all xfrm states to the new device:
>
> 1. bond_ipsec_del_sa_all() unoffloads all states in bond->ipsec_list
>    from the previously active device.
> 2. The active slave is flipped to the new device.
> 3. bond_ipsec_add_sa_all() offloads all states in bond->ipsec_list to
>    the new device.
> 
> This patch closes a race that could happen between xfrm_state migration
> and TX, which could result in unencrypted packets going out the wire:
> CPU1 (xfrm_output)                   CPU2 (bond_change_active_slave)
> bond_ipsec_offload_ok -> true
>                                      bond_ipsec_del_sa_all
> bond_xmit_activebackup
> bond_dev_queue_xmit
> dev_queue_xmit on old_dev
> 				     bond->curr_active_slave = new_dev
> 				     bond_ipsec_add_sa_all
> 
> So the packet makes it out to old_dev after the offloaded xfrm_state is
> deleted from it. The result: an unencrypted IPSec packet on the wire.
> 
> With the new approach, in-use states on old_dev will not be deleted
> until in-flight packets are transmitted.

How does this guarantee it? It would be good to describe how the new
approach closes the race with a bit more info than "use
xfrm_state_migrate".

And I don't think we currently guarantee that packets using offload
will be fully transmitted before xdo_dev_state_delete was called in
case of deletion. But ok, the bond case is worse due to the add/delete
dance when we change the active slave (and there's still the possible
issue Steffen mentioned a while ago, that this delete/add dance may
not be valid at all depending on how the HW behaves wrt IVs).


> It also makes for cleaner
> bonding code, which no longer needs to care about xfrm_state management
> so much.

But using the migrate code for that feels kind of hacky, and the 2nd
patch in this set also looks quite hacky.

And doing all that without protection against admin operations on the
xfrm state objects doesn't seem safe.


Thinking about the migrate behavior, if we fail to create/offload the
new state:
 - old state will be deleted
 - new state won't be created

So any packet we send afterwards that would need to use this SA will
just get dropped? (while the old behavior was "no more offload until
we change the active slave again"?)


> Fixes: ("ec13009472f4 bonding: implement xdo_dev_state_free and call it after deletion")

nit: wrong formatting of the Fixes tag


[just one comment on the diff, I'll look at it again if we decide to
proceed with this patch]

> @@ -533,36 +535,42 @@ static void bond_ipsec_add_sa_all(struct bonding *bond)
>  			slave_warn(bond_dev, real_dev,
>  				   "%s: no slave xdo_dev_state_add\n",
>  				   __func__);
> -		goto out;
> +		return;
>  	}
>  
> -	list_for_each_entry(ipsec, &bond->ipsec_list, list) {
> -		/* If new state is added before ipsec_lock acquired */
> -		if (ipsec->xs->xso.real_dev == real_dev)
> -			continue;
> +	/* Prepare the list of xfrm_states to be migrated. */
> +	mutex_lock(&bond->ipsec_lock);
> +	list_splice_init(&bond->ipsec_list, &ipsec_list);
> +	/* Add back states already offloaded on the new device before the
> +	 * lock was acquired and hold all remaining states to avoid them
> +	 * getting deleted during the migration.

Even with hold(), they could still be deleted (but not destroyed)?

> +	 */
> +	list_for_each_entry_safe(ipsec, tmp, &ipsec_list, list) {
> +		if (unlikely(ipsec->xs->xso.real_dev == real_dev))
> +			list_move_tail(&ipsec->list, &bond->ipsec_list);
> +		else
> +			xfrm_state_hold(ipsec->xs);
> +	}
> +	mutex_unlock(&bond->ipsec_lock);

-- 
Sabrina

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs
  2025-11-14 12:56 ` [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs Sabrina Dubroca
@ 2025-11-17 12:48   ` Cosmin Ratiu
  2025-11-20 11:36     ` Sabrina Dubroca
  0 siblings, 1 reply; 6+ messages in thread
From: Cosmin Ratiu @ 2025-11-17 12:48 UTC (permalink / raw)
  To: sd@queasysnail.net
  Cc: andrew+netdev@lunn.ch, davem@davemloft.net, ap420073@gmail.com,
	herbert@gondor.apana.org.au, Leon Romanovsky,
	steffen.klassert@secunet.com, jv@jvosburgh.net, kuba@kernel.org,
	horms@kernel.org, edumazet@google.com, netdev@vger.kernel.org,
	Jianbo Liu, pabeni@redhat.com

On Fri, 2025-11-14 at 13:56 +0100, Sabrina Dubroca wrote:
> 2025-11-13, 12:43:09 +0200, Cosmin Ratiu wrote:
> > The bonding driver manages offloaded SAs using the following
> > strategy:
> > 
> > An xfrm_state offloaded on the bond device with bond_ipsec_add_sa()
> > uses
> > 'real_dev' on the xfrm_state xs to redirect the offload to the
> > current
> > active slave. The corresponding bond_ipsec_del_sa() (called with
> > the xs
> > spinlock held) redirects the unoffload call to real_dev.
> 
> 
> > Finally,
> > cleanup happens in bond_ipsec_free_sa(), which removes the offload
> > from
> > the device. Since the last call happens without the xs spinlock
> > held,
> > that is where the real work to unoffload actually happens.
> 
> Not on all devices (some don't even implement xdo_dev_state_free).

You're right. Looking at what the stack does:
xfrm_state_delete() immediately calls xdo_dev_state_delete(), but
leaves xdo_dev_state_free() for when there are no more refs (in-flight
tx packets are done).
xfrm_dev_state_flush() forces an xfrm_dev_state_free() immediately
after deleting xs. I guess the goal is to clean up *now* everything
from 'dev'.
All other callers of xfrm_state_delete() don't care about free, it will
be done when there are no more refs.

So right now for devices that implement xdo_dev_state_free(), there's
distinct behavior of what happens when xfrm_state_delete gets called

So right now, there's a difference in behavior for what happens with
in-flight packets when xfrm_state_delete() is called:
1. On devs which delete the dev state in xdo_dev_state_free(), in-
flight packets are not affected.
2. On devs which delete the dev state in xdo_dev_state_delete(), in-
flight packets will see the xs yanked from underneath them.

This makes me ask the question: Is there a point to the
xdo_dev_state_delete() callback any more? Couldn't we consolidate on
having a single callback to free the offloaded xfrm_state when there
are no more references to it? This would simplify the delete+free dance
and would leave proper cleanup for the xs reference counting.

What am I missing?

> > 
> > When the active slave changes to a new device a 3-step process is
> > used
> > to migrate all xfrm states to the new device:
> > 
> > 1. bond_ipsec_del_sa_all() unoffloads all states in bond-
> > >ipsec_list
> >    from the previously active device.
> > 2. The active slave is flipped to the new device.
> > 3. bond_ipsec_add_sa_all() offloads all states in bond->ipsec_list
> > to
> >    the new device.
> > 
> > This patch closes a race that could happen between xfrm_state
> > migration
> > and TX, which could result in unencrypted packets going out the
> > wire:
> > CPU1 (xfrm_output)                   CPU2
> > (bond_change_active_slave)
> > bond_ipsec_offload_ok -> true
> >                                      bond_ipsec_del_sa_all
> > bond_xmit_activebackup
> > bond_dev_queue_xmit
> > dev_queue_xmit on old_dev
> > 				     bond->curr_active_slave =
> > new_dev
> > 				     bond_ipsec_add_sa_all
> > 
> > So the packet makes it out to old_dev after the offloaded
> > xfrm_state is
> > deleted from it. The result: an unencrypted IPSec packet on the
> > wire.
> > 
> > With the new approach, in-use states on old_dev will not be deleted
> > until in-flight packets are transmitted.
> 
> How does this guarantee it? It would be good to describe how the new
> approach closes the race with a bit more info than "use
> xfrm_state_migrate".
> 
> And I don't think we currently guarantee that packets using offload
> will be fully transmitted before xdo_dev_state_delete was called in
> case of deletion. 

Apologies for leaving this part out, yeah, it's pretty important.
I changed the descriptions for the next versions, here's what happens:
In-flight offloaded tx packets hold a reference to the used xfrm_state
via xfrm_output -> xfrm_state_hold which gets released when the
completion arrives via napi_consume_skb -...-> skb_ext_put ->
skb_ext_put_sp -> xfrm_state_put.

But this doesn't work on devices which do the dev state deletion in
xdo_dev_state_delete(), because those might get their SAs yanked from
the device during the xfrm_state_delete() added in this patch. I guess
this ties to the previous point: Shouldn't there be only
xdo_dev_state_delete which touches the device when refcount is 0?

> But ok, the bond case is worse due to the add/delete
> dance when we change the active slave (and there's still the possible
> issue Steffen mentioned a while ago, that this delete/add dance may
> not be valid at all depending on how the HW behaves wrt IVs).

I am aware of that issue, I am not trying to change any of that. Just
trying to improve bond from a security perspective. I don't think it's
ok for it to send out unencrypted IPSec packets.

> > It also makes for cleaner
> > bonding code, which no longer needs to care about xfrm_state
> > management
> > so much.
> 
> But using the migrate code for that feels kind of hacky, and the 2nd
> patch in this set also looks quite hacky.

It's less hacky than the manual xfrm state management done so far. At
least bonding no longer needs to care so much about the semantics of
the xfrm dev state operations. And it no longer needs to acquire the
xs->lock (what does bonding have to do with an internal xfrm_state lock
anyway?)

> 
> And doing all that without protection against admin operations on the
> xfrm state objects doesn't seem safe.
> 
> Thinking about the migrate behavior, if we fail to create/offload the
> new state:
>  - old state will be deleted
>  - new state won't be created
> 
> So any packet we send afterwards that would need to use this SA will
> just get dropped? (while the old behavior was "no more offload until
> we change the active slave again"?)

This is not ideal, I agree. Perhaps instead of giving up on the failed
xs there could be an alternate migration path where we call
xdo_dev_state_free+xdo_dev_state_add like before? Ick, I don't really
like that.

Alternatively, I have implemented another fix to these races, which is
to change xs.xso to be able to be offloaded on multiple devices at the
same time (nothing fancy, just parameter changes to xdo ops) and
changing the bonding driver to maintain a single offloaded xfrm_state
on *all* slaves (using bonding data structs). Changing the active slave
then becomes as simple as updating the esn on the new device (to get
that device state up to speed).
Leon and I discussed about that and he suggested it would be better to
use xfrm_state_migrate, since that is an existing well-understood
workflow.
Do you think that approach is worth pursuing instead? I could send them
patches as RFC for discussion.

> 
> 
> > Fixes: ("ec13009472f4 bonding: implement xdo_dev_state_free and
> > call it after deletion")
> 
> nit: wrong formatting of the Fixes tag

Thanks, will fix in the next submission (if we decide there will be a
next one).

> 
> 
> [just one comment on the diff, I'll look at it again if we decide to
> proceed with this patch]
> 
> > @@ -533,36 +535,42 @@ static void bond_ipsec_add_sa_all(struct
> > bonding *bond)
> >  			slave_warn(bond_dev, real_dev,
> >  				   "%s: no slave
> > xdo_dev_state_add\n",
> >  				   __func__);
> > -		goto out;
> > +		return;
> >  	}
> >  
> > -	list_for_each_entry(ipsec, &bond->ipsec_list, list) {
> > -		/* If new state is added before ipsec_lock
> > acquired */
> > -		if (ipsec->xs->xso.real_dev == real_dev)
> > -			continue;
> > +	/* Prepare the list of xfrm_states to be migrated. */
> > +	mutex_lock(&bond->ipsec_lock);
> > +	list_splice_init(&bond->ipsec_list, &ipsec_list);
> > +	/* Add back states already offloaded on the new device
> > before the
> > +	 * lock was acquired and hold all remaining states to
> > avoid them
> > +	 * getting deleted during the migration.
> 
> Even with hold(), they could still be deleted (but not destroyed)?

xfrm_del_sa() might concurrently happen, but since we have a ref,
they'll not get destroyed, yes.

> 
> > +	 */
> > +	list_for_each_entry_safe(ipsec, tmp, &ipsec_list, list) {
> > +		if (unlikely(ipsec->xs->xso.real_dev == real_dev))
> > +			list_move_tail(&ipsec->list, &bond-
> > >ipsec_list);
> > +		else
> > +			xfrm_state_hold(ipsec->xs);
> > +	}
> > +	mutex_unlock(&bond->ipsec_lock);
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs
  2025-11-17 12:48   ` Cosmin Ratiu
@ 2025-11-20 11:36     ` Sabrina Dubroca
  2025-12-01  8:42       ` Steffen Klassert
  0 siblings, 1 reply; 6+ messages in thread
From: Sabrina Dubroca @ 2025-11-20 11:36 UTC (permalink / raw)
  To: Cosmin Ratiu, steffen.klassert
  Cc: andrew+netdev@lunn.ch, davem@davemloft.net, ap420073@gmail.com,
	herbert@gondor.apana.org.au, Leon Romanovsky, jv@jvosburgh.net,
	kuba@kernel.org, horms@kernel.org, edumazet@google.com,
	netdev@vger.kernel.org, Jianbo Liu, pabeni@redhat.com

2025-11-17, 12:48:20 +0000, Cosmin Ratiu wrote:
> On Fri, 2025-11-14 at 13:56 +0100, Sabrina Dubroca wrote:
> > 2025-11-13, 12:43:09 +0200, Cosmin Ratiu wrote:
> > > The bonding driver manages offloaded SAs using the following
> > > strategy:
> > > 
> > > An xfrm_state offloaded on the bond device with bond_ipsec_add_sa()
> > > uses
> > > 'real_dev' on the xfrm_state xs to redirect the offload to the
> > > current
> > > active slave. The corresponding bond_ipsec_del_sa() (called with
> > > the xs
> > > spinlock held) redirects the unoffload call to real_dev.
> > 
> > 
> > > Finally,
> > > cleanup happens in bond_ipsec_free_sa(), which removes the offload
> > > from
> > > the device. Since the last call happens without the xs spinlock
> > > held,
> > > that is where the real work to unoffload actually happens.
> > 
> > Not on all devices (some don't even implement xdo_dev_state_free).
> 
> You're right. Looking at what the stack does:
> xfrm_state_delete() immediately calls xdo_dev_state_delete(), but
> leaves xdo_dev_state_free() for when there are no more refs (in-flight
> tx packets are done).
> xfrm_dev_state_flush() forces an xfrm_dev_state_free() immediately
> after deleting xs. I guess the goal is to clean up *now* everything
> from 'dev'.

Yes, see 07b87f9eea0c ("xfrm: Fix unregister netdevice hang on hardware offload.")
(though I'm not sure it's still needed, now that TCP drops the secpath
early: 9b6412e6979f)

> All other callers of xfrm_state_delete() don't care about free, it will
> be done when there are no more refs.
> 
> So right now for devices that implement xdo_dev_state_free(), there's
> distinct behavior of what happens when xfrm_state_delete gets called
> 
> So right now, there's a difference in behavior for what happens with
> in-flight packets when xfrm_state_delete() is called:
> 1. On devs which delete the dev state in xdo_dev_state_free(), in-
> flight packets are not affected.
> 2. On devs which delete the dev state in xdo_dev_state_delete(), in-
> flight packets will see the xs yanked from underneath them.
> 
> This makes me ask the question: Is there a point to the
> xdo_dev_state_delete() callback any more? Couldn't we consolidate on
> having a single callback to free the offloaded xfrm_state when there
> are no more references to it? This would simplify the delete+free dance
> and would leave proper cleanup for the xs reference counting.
> 
> What am I missing?

I don't know. Maybe it's a leftover of the initial offload
implementation/drivers that we don't need anymore? Steffen?


[...]
> > > With the new approach, in-use states on old_dev will not be deleted
> > > until in-flight packets are transmitted.
> > 
> > How does this guarantee it? It would be good to describe how the new
> > approach closes the race with a bit more info than "use
> > xfrm_state_migrate".
> > 
> > And I don't think we currently guarantee that packets using offload
> > will be fully transmitted before xdo_dev_state_delete was called in
> > case of deletion. 
> 
> Apologies for leaving this part out, yeah, it's pretty important.
> I changed the descriptions for the next versions, here's what happens:
> In-flight offloaded tx packets hold a reference to the used xfrm_state
> via xfrm_output -> xfrm_state_hold which gets released when the
> completion arrives via napi_consume_skb -...-> skb_ext_put ->
> skb_ext_put_sp -> xfrm_state_put.
> 
> But this doesn't work on devices which do the dev state deletion in
> xdo_dev_state_delete(), because those might get their SAs yanked from
> the device during the xfrm_state_delete() added in this patch. I guess
> this ties to the previous point: Shouldn't there be only
> xdo_dev_state_delete which touches the device when refcount is 0?
> 
> 
> > But ok, the bond case is worse due to the add/delete
> > dance when we change the active slave (and there's still the possible
> > issue Steffen mentioned a while ago, that this delete/add dance may
> > not be valid at all depending on how the HW behaves wrt IVs).
> 
> I am aware of that issue, I am not trying to change any of that. Just
> trying to improve bond from a security perspective.

Sure. But I'm not sure we can make it really trustworthy...

> I don't think it's
> ok for it to send out unencrypted IPSec packets.

Agree.


> > > It also makes for cleaner
> > > bonding code, which no longer needs to care about xfrm_state
> > > management
> > > so much.
> > 
> > But using the migrate code for that feels kind of hacky, and the 2nd
> > patch in this set also looks quite hacky.
> 
> It's less hacky than the manual xfrm state management done so far. At
> least bonding no longer needs to care so much about the semantics of
> the xfrm dev state operations. And it no longer needs to acquire the
> xs->lock (what does bonding have to do with an internal xfrm_state lock
> anyway?)

To me, it's hacky in the sense that we're hijacking the migrate code
that isn't intended for that, and triggering core xfrm operations from
inside a driver (and without proper locking). But true, the current
code is also hacky.

I think a better solution might be to find a way to use the
"per-resource" SA code for bonding (currently implemented for
"per-CPU" SAs, but a resource could be a lower device). Then we don't
have to worry about moving states from one link to another, but it
requires userspace cooperation.


> > And doing all that without protection against admin operations on the
> > xfrm state objects doesn't seem safe.
> > 
> > Thinking about the migrate behavior, if we fail to create/offload the
> > new state:
> >  - old state will be deleted
> >  - new state won't be created
> > 
> > So any packet we send afterwards that would need to use this SA will
> > just get dropped? (while the old behavior was "no more offload until
> > we change the active slave again"?)
> 
> This is not ideal, I agree. Perhaps instead of giving up on the failed
> xs there could be an alternate migration path where we call
> xdo_dev_state_free+xdo_dev_state_add like before? Ick, I don't really
> like that.
> 
> Alternatively, I have implemented another fix to these races, which is
> to change xs.xso to be able to be offloaded on multiple devices at the
> same time (nothing fancy, just parameter changes to xdo ops) and
> changing the bonding driver to maintain a single offloaded xfrm_state
> on *all* slaves (using bonding data structs). Changing the active slave
> then becomes as simple as updating the esn on the new device (to get
> that device state up to speed).
> Leon and I discussed about that and he suggested it would be better to
> use xfrm_state_migrate, since that is an existing well-understood
> workflow.
> Do you think that approach is worth pursuing instead? I could send them
> patches as RFC for discussion.

You can go ahead and share them if you want, but the short description
above kind of puzzles/scares me.

This whole feature is really a mess :/

-- 
Sabrina

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs
  2025-11-20 11:36     ` Sabrina Dubroca
@ 2025-12-01  8:42       ` Steffen Klassert
  0 siblings, 0 replies; 6+ messages in thread
From: Steffen Klassert @ 2025-12-01  8:42 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: Cosmin Ratiu, andrew+netdev@lunn.ch, davem@davemloft.net,
	ap420073@gmail.com, herbert@gondor.apana.org.au, Leon Romanovsky,
	jv@jvosburgh.net, kuba@kernel.org, horms@kernel.org,
	edumazet@google.com, netdev@vger.kernel.org, Jianbo Liu,
	pabeni@redhat.com

On Thu, Nov 20, 2025 at 12:36:16PM +0100, Sabrina Dubroca wrote:
> 2025-11-17, 12:48:20 +0000, Cosmin Ratiu wrote:
> > On Fri, 2025-11-14 at 13:56 +0100, Sabrina Dubroca wrote:
> 
> > All other callers of xfrm_state_delete() don't care about free, it will
> > be done when there are no more refs.
> > 
> > So right now for devices that implement xdo_dev_state_free(), there's
> > distinct behavior of what happens when xfrm_state_delete gets called
> > 
> > So right now, there's a difference in behavior for what happens with
> > in-flight packets when xfrm_state_delete() is called:
> > 1. On devs which delete the dev state in xdo_dev_state_free(), in-
> > flight packets are not affected.
> > 2. On devs which delete the dev state in xdo_dev_state_delete(), in-
> > flight packets will see the xs yanked from underneath them.
> > 
> > This makes me ask the question: Is there a point to the
> > xdo_dev_state_delete() callback any more? Couldn't we consolidate on
> > having a single callback to free the offloaded xfrm_state when there
> > are no more references to it? This would simplify the delete+free dance
> > and would leave proper cleanup for the xs reference counting.
> > 
> > What am I missing?
> 
> I don't know. Maybe it's a leftover of the initial offload
> implementation/drivers that we don't need anymore? Steffen?

The xfrm states are deleted in two stages. xfrm_state_delete
removes the states from the lists so they don't get used anymore.
In a second step the states are freed once all inflight packets
that used the state left the system. The xdo_dev_state_delete
and xdo_dev_state_free were an offer to the driver to do something
at both stages. I don't remember anymore how it was used in the
beginning. But if one callback is sufficient for all the drivers,
I'm ok with having just one callback.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-12-01  8:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-13 10:43 [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs Cosmin Ratiu
2025-11-13 10:43 ` [PATCH ipsec v2 2/2] bond: Use a separate xfrm_active_slave pointer Cosmin Ratiu
2025-11-14 12:56 ` [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs Sabrina Dubroca
2025-11-17 12:48   ` Cosmin Ratiu
2025-11-20 11:36     ` Sabrina Dubroca
2025-12-01  8:42       ` Steffen Klassert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).