From: Sabrina Dubroca <sd@queasysnail.net>
To: Cosmin Ratiu <cratiu@nvidia.com>, steffen.klassert@secunet.com
Cc: "andrew+netdev@lunn.ch" <andrew+netdev@lunn.ch>,
"davem@davemloft.net" <davem@davemloft.net>,
"ap420073@gmail.com" <ap420073@gmail.com>,
"herbert@gondor.apana.org.au" <herbert@gondor.apana.org.au>,
Leon Romanovsky <leonro@nvidia.com>,
"jv@jvosburgh.net" <jv@jvosburgh.net>,
"kuba@kernel.org" <kuba@kernel.org>,
"horms@kernel.org" <horms@kernel.org>,
"edumazet@google.com" <edumazet@google.com>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
Jianbo Liu <jianbol@nvidia.com>,
"pabeni@redhat.com" <pabeni@redhat.com>
Subject: Re: [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs
Date: Thu, 20 Nov 2025 12:36:16 +0100 [thread overview]
Message-ID: <aR79MCBdyx2oTcp2@krikkit> (raw)
In-Reply-To: <88f2bf5ef1977fcdd4c87051cd54a4545db993da.camel@nvidia.com>
2025-11-17, 12:48:20 +0000, Cosmin Ratiu wrote:
> On Fri, 2025-11-14 at 13:56 +0100, Sabrina Dubroca wrote:
> > 2025-11-13, 12:43:09 +0200, Cosmin Ratiu wrote:
> > > The bonding driver manages offloaded SAs using the following
> > > strategy:
> > >
> > > An xfrm_state offloaded on the bond device with bond_ipsec_add_sa()
> > > uses
> > > 'real_dev' on the xfrm_state xs to redirect the offload to the
> > > current
> > > active slave. The corresponding bond_ipsec_del_sa() (called with
> > > the xs
> > > spinlock held) redirects the unoffload call to real_dev.
> >
> >
> > > Finally,
> > > cleanup happens in bond_ipsec_free_sa(), which removes the offload
> > > from
> > > the device. Since the last call happens without the xs spinlock
> > > held,
> > > that is where the real work to unoffload actually happens.
> >
> > Not on all devices (some don't even implement xdo_dev_state_free).
>
> You're right. Looking at what the stack does:
> xfrm_state_delete() immediately calls xdo_dev_state_delete(), but
> leaves xdo_dev_state_free() for when there are no more refs (in-flight
> tx packets are done).
> xfrm_dev_state_flush() forces an xfrm_dev_state_free() immediately
> after deleting xs. I guess the goal is to clean up *now* everything
> from 'dev'.
Yes, see 07b87f9eea0c ("xfrm: Fix unregister netdevice hang on hardware offload.")
(though I'm not sure it's still needed, now that TCP drops the secpath
early: 9b6412e6979f)
> All other callers of xfrm_state_delete() don't care about free, it will
> be done when there are no more refs.
>
> So right now for devices that implement xdo_dev_state_free(), there's
> distinct behavior of what happens when xfrm_state_delete gets called
>
> So right now, there's a difference in behavior for what happens with
> in-flight packets when xfrm_state_delete() is called:
> 1. On devs which delete the dev state in xdo_dev_state_free(), in-
> flight packets are not affected.
> 2. On devs which delete the dev state in xdo_dev_state_delete(), in-
> flight packets will see the xs yanked from underneath them.
>
> This makes me ask the question: Is there a point to the
> xdo_dev_state_delete() callback any more? Couldn't we consolidate on
> having a single callback to free the offloaded xfrm_state when there
> are no more references to it? This would simplify the delete+free dance
> and would leave proper cleanup for the xs reference counting.
>
> What am I missing?
I don't know. Maybe it's a leftover of the initial offload
implementation/drivers that we don't need anymore? Steffen?
[...]
> > > With the new approach, in-use states on old_dev will not be deleted
> > > until in-flight packets are transmitted.
> >
> > How does this guarantee it? It would be good to describe how the new
> > approach closes the race with a bit more info than "use
> > xfrm_state_migrate".
> >
> > And I don't think we currently guarantee that packets using offload
> > will be fully transmitted before xdo_dev_state_delete was called in
> > case of deletion.
>
> Apologies for leaving this part out, yeah, it's pretty important.
> I changed the descriptions for the next versions, here's what happens:
> In-flight offloaded tx packets hold a reference to the used xfrm_state
> via xfrm_output -> xfrm_state_hold which gets released when the
> completion arrives via napi_consume_skb -...-> skb_ext_put ->
> skb_ext_put_sp -> xfrm_state_put.
>
> But this doesn't work on devices which do the dev state deletion in
> xdo_dev_state_delete(), because those might get their SAs yanked from
> the device during the xfrm_state_delete() added in this patch. I guess
> this ties to the previous point: Shouldn't there be only
> xdo_dev_state_delete which touches the device when refcount is 0?
>
>
> > But ok, the bond case is worse due to the add/delete
> > dance when we change the active slave (and there's still the possible
> > issue Steffen mentioned a while ago, that this delete/add dance may
> > not be valid at all depending on how the HW behaves wrt IVs).
>
> I am aware of that issue, I am not trying to change any of that. Just
> trying to improve bond from a security perspective.
Sure. But I'm not sure we can make it really trustworthy...
> I don't think it's
> ok for it to send out unencrypted IPSec packets.
Agree.
> > > It also makes for cleaner
> > > bonding code, which no longer needs to care about xfrm_state
> > > management
> > > so much.
> >
> > But using the migrate code for that feels kind of hacky, and the 2nd
> > patch in this set also looks quite hacky.
>
> It's less hacky than the manual xfrm state management done so far. At
> least bonding no longer needs to care so much about the semantics of
> the xfrm dev state operations. And it no longer needs to acquire the
> xs->lock (what does bonding have to do with an internal xfrm_state lock
> anyway?)
To me, it's hacky in the sense that we're hijacking the migrate code
that isn't intended for that, and triggering core xfrm operations from
inside a driver (and without proper locking). But true, the current
code is also hacky.
I think a better solution might be to find a way to use the
"per-resource" SA code for bonding (currently implemented for
"per-CPU" SAs, but a resource could be a lower device). Then we don't
have to worry about moving states from one link to another, but it
requires userspace cooperation.
> > And doing all that without protection against admin operations on the
> > xfrm state objects doesn't seem safe.
> >
> > Thinking about the migrate behavior, if we fail to create/offload the
> > new state:
> > - old state will be deleted
> > - new state won't be created
> >
> > So any packet we send afterwards that would need to use this SA will
> > just get dropped? (while the old behavior was "no more offload until
> > we change the active slave again"?)
>
> This is not ideal, I agree. Perhaps instead of giving up on the failed
> xs there could be an alternate migration path where we call
> xdo_dev_state_free+xdo_dev_state_add like before? Ick, I don't really
> like that.
>
> Alternatively, I have implemented another fix to these races, which is
> to change xs.xso to be able to be offloaded on multiple devices at the
> same time (nothing fancy, just parameter changes to xdo ops) and
> changing the bonding driver to maintain a single offloaded xfrm_state
> on *all* slaves (using bonding data structs). Changing the active slave
> then becomes as simple as updating the esn on the new device (to get
> that device state up to speed).
> Leon and I discussed about that and he suggested it would be better to
> use xfrm_state_migrate, since that is an existing well-understood
> workflow.
> Do you think that approach is worth pursuing instead? I could send them
> patches as RFC for discussion.
You can go ahead and share them if you want, but the short description
above kind of puzzles/scares me.
This whole feature is really a mess :/
--
Sabrina
next prev parent reply other threads:[~2025-11-20 11:36 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-13 10:43 [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs Cosmin Ratiu
2025-11-13 10:43 ` [PATCH ipsec v2 2/2] bond: Use a separate xfrm_active_slave pointer Cosmin Ratiu
2025-11-14 12:56 ` [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs Sabrina Dubroca
2025-11-17 12:48 ` Cosmin Ratiu
2025-11-20 11:36 ` Sabrina Dubroca [this message]
2025-12-01 8:42 ` Steffen Klassert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aR79MCBdyx2oTcp2@krikkit \
--to=sd@queasysnail.net \
--cc=andrew+netdev@lunn.ch \
--cc=ap420073@gmail.com \
--cc=cratiu@nvidia.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=herbert@gondor.apana.org.au \
--cc=horms@kernel.org \
--cc=jianbol@nvidia.com \
--cc=jv@jvosburgh.net \
--cc=kuba@kernel.org \
--cc=leonro@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=steffen.klassert@secunet.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.