From: Sabrina Dubroca <sd@queasysnail.net>
To: Cosmin Ratiu <cratiu@nvidia.com>, steffen.klassert@secunet.com
Cc: "andrew+netdev@lunn.ch" <andrew+netdev@lunn.ch>,
"davem@davemloft.net" <davem@davemloft.net>,
"ap420073@gmail.com" <ap420073@gmail.com>,
"herbert@gondor.apana.org.au" <herbert@gondor.apana.org.au>,
Leon Romanovsky <leonro@nvidia.com>,
"jv@jvosburgh.net" <jv@jvosburgh.net>,
"kuba@kernel.org" <kuba@kernel.org>,
"horms@kernel.org" <horms@kernel.org>,
"edumazet@google.com" <edumazet@google.com>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
Jianbo Liu <jianbol@nvidia.com>,
"pabeni@redhat.com" <pabeni@redhat.com>
Subject: Re: [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs
Date: Thu, 20 Nov 2025 12:36:16 +0100 [thread overview]
Message-ID: <aR79MCBdyx2oTcp2@krikkit> (raw)
In-Reply-To: <88f2bf5ef1977fcdd4c87051cd54a4545db993da.camel@nvidia.com>
2025-11-17, 12:48:20 +0000, Cosmin Ratiu wrote:
> On Fri, 2025-11-14 at 13:56 +0100, Sabrina Dubroca wrote:
> > 2025-11-13, 12:43:09 +0200, Cosmin Ratiu wrote:
> > > The bonding driver manages offloaded SAs using the following
> > > strategy:
> > >
> > > An xfrm_state offloaded on the bond device with bond_ipsec_add_sa()
> > > uses
> > > 'real_dev' on the xfrm_state xs to redirect the offload to the
> > > current
> > > active slave. The corresponding bond_ipsec_del_sa() (called with
> > > the xs
> > > spinlock held) redirects the unoffload call to real_dev.
> >
> >
> > > Finally,
> > > cleanup happens in bond_ipsec_free_sa(), which removes the offload
> > > from
> > > the device. Since the last call happens without the xs spinlock
> > > held,
> > > that is where the real work to unoffload actually happens.
> >
> > Not on all devices (some don't even implement xdo_dev_state_free).
>
> You're right. Looking at what the stack does:
> xfrm_state_delete() immediately calls xdo_dev_state_delete(), but
> leaves xdo_dev_state_free() for when there are no more refs (in-flight
> tx packets are done).
> xfrm_dev_state_flush() forces an xfrm_dev_state_free() immediately
> after deleting xs. I guess the goal is to clean up *now* everything
> from 'dev'.
Yes, see 07b87f9eea0c ("xfrm: Fix unregister netdevice hang on hardware offload.")
(though I'm not sure it's still needed, now that TCP drops the secpath
early: 9b6412e6979f)
> All other callers of xfrm_state_delete() don't care about free, it will
> be done when there are no more refs.
>
> So right now for devices that implement xdo_dev_state_free(), there's
> distinct behavior of what happens when xfrm_state_delete gets called
>
> So right now, there's a difference in behavior for what happens with
> in-flight packets when xfrm_state_delete() is called:
> 1. On devs which delete the dev state in xdo_dev_state_free(), in-
> flight packets are not affected.
> 2. On devs which delete the dev state in xdo_dev_state_delete(), in-
> flight packets will see the xs yanked from underneath them.
>
> This makes me ask the question: Is there a point to the
> xdo_dev_state_delete() callback any more? Couldn't we consolidate on
> having a single callback to free the offloaded xfrm_state when there
> are no more references to it? This would simplify the delete+free dance
> and would leave proper cleanup for the xs reference counting.
>
> What am I missing?
I don't know. Maybe it's a leftover of the initial offload
implementation/drivers that we don't need anymore? Steffen?
[...]
> > > With the new approach, in-use states on old_dev will not be deleted
> > > until in-flight packets are transmitted.
> >
> > How does this guarantee it? It would be good to describe how the new
> > approach closes the race with a bit more info than "use
> > xfrm_state_migrate".
> >
> > And I don't think we currently guarantee that packets using offload
> > will be fully transmitted before xdo_dev_state_delete was called in
> > case of deletion.
>
> Apologies for leaving this part out, yeah, it's pretty important.
> I changed the descriptions for the next versions, here's what happens:
> In-flight offloaded tx packets hold a reference to the used xfrm_state
> via xfrm_output -> xfrm_state_hold which gets released when the
> completion arrives via napi_consume_skb -...-> skb_ext_put ->
> skb_ext_put_sp -> xfrm_state_put.
>
> But this doesn't work on devices which do the dev state deletion in
> xdo_dev_state_delete(), because those might get their SAs yanked from
> the device during the xfrm_state_delete() added in this patch. I guess
> this ties to the previous point: Shouldn't there be only
> xdo_dev_state_delete which touches the device when refcount is 0?
>
>
> > But ok, the bond case is worse due to the add/delete
> > dance when we change the active slave (and there's still the possible
> > issue Steffen mentioned a while ago, that this delete/add dance may
> > not be valid at all depending on how the HW behaves wrt IVs).
>
> I am aware of that issue, I am not trying to change any of that. Just
> trying to improve bond from a security perspective.
Sure. But I'm not sure we can make it really trustworthy...
> I don't think it's
> ok for it to send out unencrypted IPSec packets.
Agree.
> > > It also makes for cleaner
> > > bonding code, which no longer needs to care about xfrm_state
> > > management
> > > so much.
> >
> > But using the migrate code for that feels kind of hacky, and the 2nd
> > patch in this set also looks quite hacky.
>
> It's less hacky than the manual xfrm state management done so far. At
> least bonding no longer needs to care so much about the semantics of
> the xfrm dev state operations. And it no longer needs to acquire the
> xs->lock (what does bonding have to do with an internal xfrm_state lock
> anyway?)
To me, it's hacky in the sense that we're hijacking the migrate code
that isn't intended for that, and triggering core xfrm operations from
inside a driver (and without proper locking). But true, the current
code is also hacky.
I think a better solution might be to find a way to use the
"per-resource" SA code for bonding (currently implemented for
"per-CPU" SAs, but a resource could be a lower device). Then we don't
have to worry about moving states from one link to another, but it
requires userspace cooperation.
> > And doing all that without protection against admin operations on the
> > xfrm state objects doesn't seem safe.
> >
> > Thinking about the migrate behavior, if we fail to create/offload the
> > new state:
> > - old state will be deleted
> > - new state won't be created
> >
> > So any packet we send afterwards that would need to use this SA will
> > just get dropped? (while the old behavior was "no more offload until
> > we change the active slave again"?)
>
> This is not ideal, I agree. Perhaps instead of giving up on the failed
> xs there could be an alternate migration path where we call
> xdo_dev_state_free+xdo_dev_state_add like before? Ick, I don't really
> like that.
>
> Alternatively, I have implemented another fix to these races, which is
> to change xs.xso to be able to be offloaded on multiple devices at the
> same time (nothing fancy, just parameter changes to xdo ops) and
> changing the bonding driver to maintain a single offloaded xfrm_state
> on *all* slaves (using bonding data structs). Changing the active slave
> then becomes as simple as updating the esn on the new device (to get
> that device state up to speed).
> Leon and I discussed about that and he suggested it would be better to
> use xfrm_state_migrate, since that is an existing well-understood
> workflow.
> Do you think that approach is worth pursuing instead? I could send them
> patches as RFC for discussion.
You can go ahead and share them if you want, but the short description
above kind of puzzles/scares me.
This whole feature is really a mess :/
--
Sabrina
next prev parent reply other threads:[~2025-11-20 11:36 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-13 10:43 [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs Cosmin Ratiu
2025-11-13 10:43 ` [PATCH ipsec v2 2/2] bond: Use a separate xfrm_active_slave pointer Cosmin Ratiu
2025-11-14 12:56 ` [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs Sabrina Dubroca
2025-11-17 12:48 ` Cosmin Ratiu
2025-11-20 11:36 ` Sabrina Dubroca [this message]
2025-12-01 8:42 ` Steffen Klassert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aR79MCBdyx2oTcp2@krikkit \
--to=sd@queasysnail.net \
--cc=andrew+netdev@lunn.ch \
--cc=ap420073@gmail.com \
--cc=cratiu@nvidia.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=herbert@gondor.apana.org.au \
--cc=horms@kernel.org \
--cc=jianbol@nvidia.com \
--cc=jv@jvosburgh.net \
--cc=kuba@kernel.org \
--cc=leonro@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=steffen.klassert@secunet.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).