netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hangbin Liu <liuhangbin@gmail.com>
To: Cosmin Ratiu <cratiu@nvidia.com>
Cc: netdev@vger.kernel.org, Jay Vosburgh <jv@jvosburgh.net>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Simon Horman <horms@kernel.org>, Taehee Yoo <ap420073@gmail.com>,
	Jianbo Liu <jianbol@nvidia.com>,
	Sabrina Dubroca <sd@queasysnail.net>,
	Steffen Klassert <steffen.klassert@secunet.com>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	Leon Romanovsky <leonro@nvidia.com>
Subject: Re: [RFC PATCH ipsec 2/2] bonding: Maintain offloaded xfrm on all devices
Date: Fri, 5 Dec 2025 03:48:51 +0000	[thread overview]
Message-ID: <aTJWI3aybYO-NHg5@fedora> (raw)
In-Reply-To: <20251121151644.1797728-3-cratiu@nvidia.com>

On Fri, Nov 21, 2025 at 05:16:44PM +0200, Cosmin Ratiu wrote:
> The bonding driver manages offloaded SAs using the following strategy:
> 
> An xfrm_state offloaded on the bond device with bond_ipsec_add_sa() uses
> 'real_dev' on the xfrm_state xs to redirect the offload to the current
> active slave. The corresponding bond_ipsec_del_sa() (called with the xs
> spinlock held) redirects the unoffload call to real_dev. Finally,
> cleanup happens in bond_ipsec_free_sa(), which removes the offload from
> the device. Since the last call happens without the xs spinlock held,
> that is where the real work to unoffload actually happens.
> 
> When the active slave changes to a new device a 3-step process is used
> to migrate all xfrm states to the new device:
> 1. bond_ipsec_del_sa_all() unoffloads all states in bond->ipsec_list
>    from the previously active device.
> 2. The active slave is flipped to the new device.
> 3. bond_ipsec_add_sa_all() offloads all states in bond->ipsec_list to
>    the new device.
> 
> There can be two races which result in unencrypted IPSec packets being
> transmitted on the wire:
> 
> 1. Unencrypted IPSec packet on old_dev:
> CPU1 (xfrm_output)                   CPU2 (bond_change_active_slave)
> bond_ipsec_offload_ok -> true
>                                      bond_ipsec_del_sa_all
> bond_xmit_activebackup
> bond_dev_queue_xmit
> dev_queue_xmit on old_dev
> 				     bond->curr_active_slave = new_dev
> 				     bond_ipsec_add_sa_all
> 
> 2. Unencrypted IPSec packet on new_dev:
> CPU1 (xfrm_output)                   CPU2 (bond_change_active_slave)
> bond_ipsec_offload_ok -> true
>                                      bond->curr_active_slave = new_dev
>                                      bond_ipsec_migrate_sa_all
> bond_xmit_activebackup
> bond_dev_queue_xmit
> dev_queue_xmit on new_dev
> 				     bond_ipsec_migrate_sa_all finishes
> 
> This patch fixes both these issues. Bonding now maintain SAs on all
> devices by making use of the previous patch that allows the same xfrm
> state to be offloaded on multiple devices. This consists of:
> 
> 1. Maintaining two linked lists:
> - bond->ipsec_list is the list of xfrm states offloaded to the bonding
>   device.
> - Each slave has its own bond->ipsec_offloads list holding offloads of
>   bond->ipsec_list on that slave.
> These lists are protected by the existing bond->ipsec_lock mutex.
> 
> 2. When a slave is added (bond_enslave), bond_ipsec_add_sa_all now
>    offloads all xfrm states to the new device.
> 
> 3. When a slave is removed (__bond_release_one), bond_ipsec_del_sa_all
>    now removes all xfrm state offloads from that device.
> 
> 4. When the active slave is changed (bond_change_active_slave), a new
>    bond_ipsec_migrate_sa_all function switches xs->xso.real_dev and
>    xs->xso.offload handle for all offloaded xfrm states.
>    xdo_dev_state_advance_esn is also called on the new device to update
>    the esn state.
> 
> 5. Adding an offloaded xfrm state to the bond device must now iterate
>    through active slaves. To make that nice, RTNL is grabbed there. The
>    alternative is repeatedly grabbing each slave under the RCU lock,
>    holding it, releasing the lock to be able to offload a state, then
>    re-grabbing the RCU lock and releasing the slave. RTNL seems cleaner.
> 
> 6. bond_ipsec_del_sa (.xdo_dev_state_delete for bond) is unchanged, it
>    now only deletes the state from the active device and leaves the rest
>    for the xdo_dev_state_free callback, which can grab the required
>    locks.
> 
> Fixes: 9a5605505d9c ("bonding: Add struct bond_ipesc to manage SA")
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> ---
>  drivers/net/bonding/bond_main.c | 283 +++++++++++++++++---------------
>  include/net/bonding.h           |  22 ++-
>  2 files changed, 164 insertions(+), 141 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 4c5b73786877..979e5aabf8d2 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -452,6 +452,61 @@ static struct net_device *bond_ipsec_dev(struct xfrm_state *xs)
>  	return slave->dev;
>  }
>  
> +static struct bond_ipsec_offload*
> +bond_ipsec_dev_add_sa(struct net_device *dev, struct bond_ipsec *ipsec,
> +		      struct netlink_ext_ack *extack)
> +{
> +	struct bond_ipsec_offload *offload;
> +	int err;
> +
> +	if (!dev->xfrmdev_ops ||
> +	    !dev->xfrmdev_ops->xdo_dev_state_add ||
> +	    netif_is_bond_master(dev)) {
> +		NL_SET_ERR_MSG_MOD(extack,
> +				   "Slave does not support ipsec offload");
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	offload = kzalloc(sizeof(*offload), GFP_KERNEL);
> +	if (!offload)
> +		return ERR_PTR(-ENOMEM);
> +
> +	offload->ipsec = ipsec;
> +	offload->dev = dev;
> +	err = dev->xfrmdev_ops->xdo_dev_state_add(dev, ipsec->xs,
> +						   &offload->handle, extack);
> +	if (err)

Here we need to free the offload.

> +		return ERR_PTR(err);
> +	return offload;
> +}
> +
> +static void bond_ipsec_dev_del_sa(struct bond_ipsec_offload *offload)
> +{
> +	struct xfrm_state *xs = offload->ipsec->xs;
> +	struct net_device *dev = offload->dev;
> +
> +	if (dev->xfrmdev_ops->xdo_dev_state_delete) {
> +		spin_lock_bh(&xs->lock);
> +		/* Don't double delete states killed by the user
> +		 * from xs->xso.real_dev.
> +		 */
> +		if (dev != xs->xso.real_dev ||
> +		    xs->km.state != XFRM_STATE_DEAD)
> +			dev->xfrmdev_ops->xdo_dev_state_delete(dev, xs,
> +							       offload->handle);
> +		if (xs->xso.real_dev == dev)
> +			xs->xso.real_dev = NULL;
> +		spin_unlock_bh(&xs->lock);
> +	}
> +
> +	if (dev->xfrmdev_ops->xdo_dev_state_free)
> +		dev->xfrmdev_ops->xdo_dev_state_free(dev, xs, offload->handle);
> +
> +	list_del(&offload->list);
> +	list_del(&offload->ipsec_list);
> +	kfree(offload);
> +}
> +
[...]

> -static void bond_ipsec_add_sa_all(struct bonding *bond)
> +static void bond_ipsec_add_sa_all(struct bonding *bond, struct slave *new_slave,
> +				  struct netlink_ext_ack *extack)
>  {
> +	struct net_device *real_dev = new_slave->dev;
>  	struct net_device *bond_dev = bond->dev;
> -	struct net_device *real_dev;
>  	struct bond_ipsec *ipsec;
>  	struct slave *slave;
>  
> -	slave = rtnl_dereference(bond->curr_active_slave);
> -	real_dev = slave ? slave->dev : NULL;
> -	if (!real_dev)
> -		return;
> +	INIT_LIST_HEAD(&new_slave->ipsec_offloads);
>  
>  	mutex_lock(&bond->ipsec_lock);
> -	if (!real_dev->xfrmdev_ops ||
> -	    !real_dev->xfrmdev_ops->xdo_dev_state_add ||
> -	    netif_is_bond_master(real_dev)) {
> -		if (!list_empty(&bond->ipsec_list))
> -			slave_warn(bond_dev, real_dev,
> -				   "%s: no slave xdo_dev_state_add\n",
> -				   __func__);
> -		goto out;
> -	}
> -
>  	list_for_each_entry(ipsec, &bond->ipsec_list, list) {
> +		struct bond_ipsec_offload *offload;
>  		struct xfrm_state *xs = ipsec->xs;
>  
> -		/* If new state is added before ipsec_lock acquired */
> -		if (xs->xso.real_dev == real_dev)
> -			continue;
> -
> -		if (real_dev->xfrmdev_ops->xdo_dev_state_add(real_dev, xs,
> -							     &xs->xso.offload_handle,
> -							     NULL)) {
> +		offload = bond_ipsec_dev_add_sa(slave->dev, ipsec, extack);

Here should be real_dev.

> +		if (IS_ERR(offload)) {
>  			slave_warn(bond_dev, real_dev, "%s: failed to add SA\n", __func__);
>  			continue;
>  		}

If we add offload failed on the slave, what would happen during migrate?

Thanks
Hangbin

  reply	other threads:[~2025-12-05  3:49 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-21 15:16 [RFC PATCH ipsec 0/2] Fix bonding IPSec races Cosmin Ratiu
2025-11-21 15:16 ` [RFC PATCH ipsec 1/2] xfrm: Add explicit offload_handle to some xfrm callbacks Cosmin Ratiu
2025-11-21 15:16 ` [RFC PATCH ipsec 2/2] bonding: Maintain offloaded xfrm on all devices Cosmin Ratiu
2025-12-05  3:48   ` Hangbin Liu [this message]
2025-12-01  9:01 ` [RFC PATCH ipsec 0/2] Fix bonding IPSec races Steffen Klassert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aTJWI3aybYO-NHg5@fedora \
    --to=liuhangbin@gmail.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=ap420073@gmail.com \
    --cc=cratiu@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=herbert@gondor.apana.org.au \
    --cc=horms@kernel.org \
    --cc=jianbol@nvidia.com \
    --cc=jv@jvosburgh.net \
    --cc=kuba@kernel.org \
    --cc=leonro@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sd@queasysnail.net \
    --cc=steffen.klassert@secunet.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).