All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hangbin Liu <liuhangbin@gmail.com>
To: Cosmin Ratiu <cratiu@nvidia.com>
Cc: netdev@vger.kernel.org, Jay Vosburgh <jv@jvosburgh.net>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Simon Horman <horms@kernel.org>, Taehee Yoo <ap420073@gmail.com>,
	Jianbo Liu <jianbol@nvidia.com>,
	Sabrina Dubroca <sd@queasysnail.net>,
	Steffen Klassert <steffen.klassert@secunet.com>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	Leon Romanovsky <leonro@nvidia.com>
Subject: Re: [RFC PATCH ipsec 2/2] bonding: Maintain offloaded xfrm on all devices
Date: Fri, 5 Dec 2025 03:48:51 +0000	[thread overview]
Message-ID: <aTJWI3aybYO-NHg5@fedora> (raw)
In-Reply-To: <20251121151644.1797728-3-cratiu@nvidia.com>

On Fri, Nov 21, 2025 at 05:16:44PM +0200, Cosmin Ratiu wrote:
> The bonding driver manages offloaded SAs using the following strategy:
> 
> An xfrm_state offloaded on the bond device with bond_ipsec_add_sa() uses
> 'real_dev' on the xfrm_state xs to redirect the offload to the current
> active slave. The corresponding bond_ipsec_del_sa() (called with the xs
> spinlock held) redirects the unoffload call to real_dev. Finally,
> cleanup happens in bond_ipsec_free_sa(), which removes the offload from
> the device. Since the last call happens without the xs spinlock held,
> that is where the real work to unoffload actually happens.
> 
> When the active slave changes to a new device a 3-step process is used
> to migrate all xfrm states to the new device:
> 1. bond_ipsec_del_sa_all() unoffloads all states in bond->ipsec_list
>    from the previously active device.
> 2. The active slave is flipped to the new device.
> 3. bond_ipsec_add_sa_all() offloads all states in bond->ipsec_list to
>    the new device.
> 
> There can be two races which result in unencrypted IPSec packets being
> transmitted on the wire:
> 
> 1. Unencrypted IPSec packet on old_dev:
> CPU1 (xfrm_output)                   CPU2 (bond_change_active_slave)
> bond_ipsec_offload_ok -> true
>                                      bond_ipsec_del_sa_all
> bond_xmit_activebackup
> bond_dev_queue_xmit
> dev_queue_xmit on old_dev
> 				     bond->curr_active_slave = new_dev
> 				     bond_ipsec_add_sa_all
> 
> 2. Unencrypted IPSec packet on new_dev:
> CPU1 (xfrm_output)                   CPU2 (bond_change_active_slave)
> bond_ipsec_offload_ok -> true
>                                      bond->curr_active_slave = new_dev
>                                      bond_ipsec_migrate_sa_all
> bond_xmit_activebackup
> bond_dev_queue_xmit
> dev_queue_xmit on new_dev
> 				     bond_ipsec_migrate_sa_all finishes
> 
> This patch fixes both these issues. Bonding now maintain SAs on all
> devices by making use of the previous patch that allows the same xfrm
> state to be offloaded on multiple devices. This consists of:
> 
> 1. Maintaining two linked lists:
> - bond->ipsec_list is the list of xfrm states offloaded to the bonding
>   device.
> - Each slave has its own bond->ipsec_offloads list holding offloads of
>   bond->ipsec_list on that slave.
> These lists are protected by the existing bond->ipsec_lock mutex.
> 
> 2. When a slave is added (bond_enslave), bond_ipsec_add_sa_all now
>    offloads all xfrm states to the new device.
> 
> 3. When a slave is removed (__bond_release_one), bond_ipsec_del_sa_all
>    now removes all xfrm state offloads from that device.
> 
> 4. When the active slave is changed (bond_change_active_slave), a new
>    bond_ipsec_migrate_sa_all function switches xs->xso.real_dev and
>    xs->xso.offload handle for all offloaded xfrm states.
>    xdo_dev_state_advance_esn is also called on the new device to update
>    the esn state.
> 
> 5. Adding an offloaded xfrm state to the bond device must now iterate
>    through active slaves. To make that nice, RTNL is grabbed there. The
>    alternative is repeatedly grabbing each slave under the RCU lock,
>    holding it, releasing the lock to be able to offload a state, then
>    re-grabbing the RCU lock and releasing the slave. RTNL seems cleaner.
> 
> 6. bond_ipsec_del_sa (.xdo_dev_state_delete for bond) is unchanged, it
>    now only deletes the state from the active device and leaves the rest
>    for the xdo_dev_state_free callback, which can grab the required
>    locks.
> 
> Fixes: 9a5605505d9c ("bonding: Add struct bond_ipesc to manage SA")
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> ---
>  drivers/net/bonding/bond_main.c | 283 +++++++++++++++++---------------
>  include/net/bonding.h           |  22 ++-
>  2 files changed, 164 insertions(+), 141 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 4c5b73786877..979e5aabf8d2 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -452,6 +452,61 @@ static struct net_device *bond_ipsec_dev(struct xfrm_state *xs)
>  	return slave->dev;
>  }
>  
> +static struct bond_ipsec_offload*
> +bond_ipsec_dev_add_sa(struct net_device *dev, struct bond_ipsec *ipsec,
> +		      struct netlink_ext_ack *extack)
> +{
> +	struct bond_ipsec_offload *offload;
> +	int err;
> +
> +	if (!dev->xfrmdev_ops ||
> +	    !dev->xfrmdev_ops->xdo_dev_state_add ||
> +	    netif_is_bond_master(dev)) {
> +		NL_SET_ERR_MSG_MOD(extack,
> +				   "Slave does not support ipsec offload");
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	offload = kzalloc(sizeof(*offload), GFP_KERNEL);
> +	if (!offload)
> +		return ERR_PTR(-ENOMEM);
> +
> +	offload->ipsec = ipsec;
> +	offload->dev = dev;
> +	err = dev->xfrmdev_ops->xdo_dev_state_add(dev, ipsec->xs,
> +						   &offload->handle, extack);
> +	if (err)

Here we need to free the offload.

> +		return ERR_PTR(err);
> +	return offload;
> +}
> +
> +static void bond_ipsec_dev_del_sa(struct bond_ipsec_offload *offload)
> +{
> +	struct xfrm_state *xs = offload->ipsec->xs;
> +	struct net_device *dev = offload->dev;
> +
> +	if (dev->xfrmdev_ops->xdo_dev_state_delete) {
> +		spin_lock_bh(&xs->lock);
> +		/* Don't double delete states killed by the user
> +		 * from xs->xso.real_dev.
> +		 */
> +		if (dev != xs->xso.real_dev ||
> +		    xs->km.state != XFRM_STATE_DEAD)
> +			dev->xfrmdev_ops->xdo_dev_state_delete(dev, xs,
> +							       offload->handle);
> +		if (xs->xso.real_dev == dev)
> +			xs->xso.real_dev = NULL;
> +		spin_unlock_bh(&xs->lock);
> +	}
> +
> +	if (dev->xfrmdev_ops->xdo_dev_state_free)
> +		dev->xfrmdev_ops->xdo_dev_state_free(dev, xs, offload->handle);
> +
> +	list_del(&offload->list);
> +	list_del(&offload->ipsec_list);
> +	kfree(offload);
> +}
> +
[...]

> -static void bond_ipsec_add_sa_all(struct bonding *bond)
> +static void bond_ipsec_add_sa_all(struct bonding *bond, struct slave *new_slave,
> +				  struct netlink_ext_ack *extack)
>  {
> +	struct net_device *real_dev = new_slave->dev;
>  	struct net_device *bond_dev = bond->dev;
> -	struct net_device *real_dev;
>  	struct bond_ipsec *ipsec;
>  	struct slave *slave;
>  
> -	slave = rtnl_dereference(bond->curr_active_slave);
> -	real_dev = slave ? slave->dev : NULL;
> -	if (!real_dev)
> -		return;
> +	INIT_LIST_HEAD(&new_slave->ipsec_offloads);
>  
>  	mutex_lock(&bond->ipsec_lock);
> -	if (!real_dev->xfrmdev_ops ||
> -	    !real_dev->xfrmdev_ops->xdo_dev_state_add ||
> -	    netif_is_bond_master(real_dev)) {
> -		if (!list_empty(&bond->ipsec_list))
> -			slave_warn(bond_dev, real_dev,
> -				   "%s: no slave xdo_dev_state_add\n",
> -				   __func__);
> -		goto out;
> -	}
> -
>  	list_for_each_entry(ipsec, &bond->ipsec_list, list) {
> +		struct bond_ipsec_offload *offload;
>  		struct xfrm_state *xs = ipsec->xs;
>  
> -		/* If new state is added before ipsec_lock acquired */
> -		if (xs->xso.real_dev == real_dev)
> -			continue;
> -
> -		if (real_dev->xfrmdev_ops->xdo_dev_state_add(real_dev, xs,
> -							     &xs->xso.offload_handle,
> -							     NULL)) {
> +		offload = bond_ipsec_dev_add_sa(slave->dev, ipsec, extack);

Here should be real_dev.

> +		if (IS_ERR(offload)) {
>  			slave_warn(bond_dev, real_dev, "%s: failed to add SA\n", __func__);
>  			continue;
>  		}

If we add offload failed on the slave, what would happen during migrate?

Thanks
Hangbin

  reply	other threads:[~2025-12-05  3:49 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-21 15:16 [RFC PATCH ipsec 0/2] Fix bonding IPSec races Cosmin Ratiu
2025-11-21 15:16 ` [RFC PATCH ipsec 1/2] xfrm: Add explicit offload_handle to some xfrm callbacks Cosmin Ratiu
2025-11-21 15:16 ` [RFC PATCH ipsec 2/2] bonding: Maintain offloaded xfrm on all devices Cosmin Ratiu
2025-12-05  3:48   ` Hangbin Liu [this message]
2025-12-01  9:01 ` [RFC PATCH ipsec 0/2] Fix bonding IPSec races Steffen Klassert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aTJWI3aybYO-NHg5@fedora \
    --to=liuhangbin@gmail.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=ap420073@gmail.com \
    --cc=cratiu@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=herbert@gondor.apana.org.au \
    --cc=horms@kernel.org \
    --cc=jianbol@nvidia.com \
    --cc=jv@jvosburgh.net \
    --cc=kuba@kernel.org \
    --cc=leonro@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sd@queasysnail.net \
    --cc=steffen.klassert@secunet.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.