From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from fout-a5-smtp.messagingengine.com (fout-a5-smtp.messagingengine.com [103.168.172.148]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 564F13002A8 for ; Thu, 20 Nov 2025 11:36:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=103.168.172.148 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763638585; cv=none; b=Q7chUSIgigCkkbfXmbo3d01bCNbcTHrqEWjCJj5AGE5IS4tBfquQW9f+f+f9xjJes4wmNPFAPiLTaWw0CvK/EEYBZuhOSnTIw7Q7lJSvmdAA/oW7R0JpN4Me9BuE9flGacn3Dbi0lO8zB0pdV3rhZpImaUgpGtAfQc1g8jlK9oc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763638585; c=relaxed/simple; bh=kgXlwnzmYFsXTZ8lHcA5mi62tkQY8RwrDdQ1rYh+2g4=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=guGOgXcqlJ7qcgwM1C95onu0a+nIdY7UgRfW5vMs3s157U01bqpDeG645nDT4l1UC+JwQjvRa019YxNOPXDONiZHE5Qc5f3MUsWLm50SiqibRoXCwQ9aQFr2XIEKMYW6JrKjY0ZCIsuKPoB8iVYNCmqOlOdV/oLk7euGIQThDgs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=queasysnail.net; spf=pass smtp.mailfrom=queasysnail.net; dkim=pass (2048-bit key) header.d=queasysnail.net header.i=@queasysnail.net header.b=TnpMXkKR; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b=LPGF8y4Q; arc=none smtp.client-ip=103.168.172.148 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=queasysnail.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=queasysnail.net Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=queasysnail.net header.i=@queasysnail.net header.b="TnpMXkKR"; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b="LPGF8y4Q" Received: from phl-compute-06.internal (phl-compute-06.internal [10.202.2.46]) by mailfout.phl.internal (Postfix) with ESMTP id 37C14EC0370; Thu, 20 Nov 2025 06:36:21 -0500 (EST) Received: from phl-mailfrontend-01 ([10.202.2.162]) by phl-compute-06.internal (MEProxy); Thu, 20 Nov 2025 06:36:21 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=queasysnail.net; h=cc:cc:content-transfer-encoding:content-type:content-type :date:date:from:from:in-reply-to:in-reply-to:message-id :mime-version:references:reply-to:subject:subject:to:to; s=fm1; t=1763638581; x=1763724981; bh=duJmX5HHOziadQf5q4kEyZjSRR1LrD50 zaH8pVR/GKo=; b=TnpMXkKRYM2qwcsuVgkhBt00zkAurS4vLdp3QkC4fmCXIT8p VIYY8ffTUffInnEBvWvX6FQAfLFtPQMSgdSIl+ggAU9W43Zrixhvo47P656funMP cFfAA/ZY/AIN3gQDKyskJkNAh4XSYrYrHD4u/ToYnYo8yvLEofDwkM2YUUGfK9al Tb/OjpwEl0Gd1dlMX/uyf0/xOYomWmBo7cRk/Xa9yP9mQH2pNG6Sxk0ldIhymoaK OTg2ADNujbDpeGMQifcmiRHCCjkQtwPeuVSTtySc10YbrtLmj7/KLWPF11eaMTew 9FEU0uMHB5Rt1NRPK21NMtfinUHPhnKb/Yn7xw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t=1763638581; x= 1763724981; bh=duJmX5HHOziadQf5q4kEyZjSRR1LrD50zaH8pVR/GKo=; b=L PGF8y4QMGXARiehHOCHCVjnxv/MqLtLMFTdvUjJ1snuciB3Mf2F0Xfpdlybnx0TH KQ9UZQsxu6dfdb9TC6Zm86QwS7m7WXb2NkjRSElJ6bJVOXsvU1U6Crjrr04ZuuyQ /tTgXiLHhXC0ImcUiWs1WSlgOAyvqYjZULajfhfFboDh5+s4OTDXduCGlgnDGT3x lJC0jN3Ao2uT5lXQqSvJb7fTmqqMYpuvu0zsNz+xOjj24AgrqCBeAO5XV5dDr2pS 6T3f7q6i8ltr1SmxwXra9br1KOg1vaBWCMeJSkOWnpBRcHyIFbCXschNeAlwIrA+ vK2Ig6kC/+Ol3TT3R37Pw== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdeggddvvdeileejucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggugfgjsehtkeertddttdejnecuhfhrohhmpefurggsrhhi nhgrucffuhgsrhhotggruceoshgusehquhgvrghshihsnhgrihhlrdhnvghtqeenucggtf frrghtthgvrhhnpefgvdegieetffefvdfguddtleegiefhgeeuheetveevgeevjeduleef ffeiheelvdenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhroh hmpehsugesqhhuvggrshihshhnrghilhdrnhgvthdpnhgspghrtghpthhtohepudegpdhm ohguvgepshhmthhpohhuthdprhgtphhtthhopegtrhgrthhiuhesnhhvihguihgrrdgtoh hmpdhrtghpthhtohepshhtvghffhgvnhdrkhhlrghsshgvrhhtsehsvggtuhhnvghtrdgt ohhmpdhrtghpthhtoheprghnughrvgifodhnvghtuggvvheslhhunhhnrdgthhdprhgtph htthhopegurghvvghmsegurghvvghmlhhofhhtrdhnvghtpdhrtghpthhtoheprghpgedv tddtjeefsehgmhgrihhlrdgtohhmpdhrtghpthhtohephhgvrhgsvghrthesghhonhguoh hrrdgrphgrnhgrrdhorhhgrdgruhdprhgtphhtthhopehlvghonhhrohesnhhvihguihgr rdgtohhmpdhrtghpthhtohepjhhvsehjvhhoshgsuhhrghhhrdhnvghtpdhrtghpthhtoh epkhhusggrsehkvghrnhgvlhdrohhrgh X-ME-Proxy: Feedback-ID: i934648bf:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Thu, 20 Nov 2025 06:36:18 -0500 (EST) Date: Thu, 20 Nov 2025 12:36:16 +0100 From: Sabrina Dubroca To: Cosmin Ratiu , steffen.klassert@secunet.com Cc: "andrew+netdev@lunn.ch" , "davem@davemloft.net" , "ap420073@gmail.com" , "herbert@gondor.apana.org.au" , Leon Romanovsky , "jv@jvosburgh.net" , "kuba@kernel.org" , "horms@kernel.org" , "edumazet@google.com" , "netdev@vger.kernel.org" , Jianbo Liu , "pabeni@redhat.com" Subject: Re: [PATCH ipsec v2 1/2] bond: Use xfrm_state_migrate to migrate SAs Message-ID: References: <20251113104310.1243150-1-cratiu@nvidia.com> <88f2bf5ef1977fcdd4c87051cd54a4545db993da.camel@nvidia.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <88f2bf5ef1977fcdd4c87051cd54a4545db993da.camel@nvidia.com> 2025-11-17, 12:48:20 +0000, Cosmin Ratiu wrote: > On Fri, 2025-11-14 at 13:56 +0100, Sabrina Dubroca wrote: > > 2025-11-13, 12:43:09 +0200, Cosmin Ratiu wrote: > > > The bonding driver manages offloaded SAs using the following > > > strategy: > > > > > > An xfrm_state offloaded on the bond device with bond_ipsec_add_sa() > > > uses > > > 'real_dev' on the xfrm_state xs to redirect the offload to the > > > current > > > active slave. The corresponding bond_ipsec_del_sa() (called with > > > the xs > > > spinlock held) redirects the unoffload call to real_dev. > > > > > > > Finally, > > > cleanup happens in bond_ipsec_free_sa(), which removes the offload > > > from > > > the device. Since the last call happens without the xs spinlock > > > held, > > > that is where the real work to unoffload actually happens. > > > > Not on all devices (some don't even implement xdo_dev_state_free). > > You're right. Looking at what the stack does: > xfrm_state_delete() immediately calls xdo_dev_state_delete(), but > leaves xdo_dev_state_free() for when there are no more refs (in-flight > tx packets are done). > xfrm_dev_state_flush() forces an xfrm_dev_state_free() immediately > after deleting xs. I guess the goal is to clean up *now* everything > from 'dev'. Yes, see 07b87f9eea0c ("xfrm: Fix unregister netdevice hang on hardware offload.") (though I'm not sure it's still needed, now that TCP drops the secpath early: 9b6412e6979f) > All other callers of xfrm_state_delete() don't care about free, it will > be done when there are no more refs. > > So right now for devices that implement xdo_dev_state_free(), there's > distinct behavior of what happens when xfrm_state_delete gets called > > So right now, there's a difference in behavior for what happens with > in-flight packets when xfrm_state_delete() is called: > 1. On devs which delete the dev state in xdo_dev_state_free(), in- > flight packets are not affected. > 2. On devs which delete the dev state in xdo_dev_state_delete(), in- > flight packets will see the xs yanked from underneath them. > > This makes me ask the question: Is there a point to the > xdo_dev_state_delete() callback any more? Couldn't we consolidate on > having a single callback to free the offloaded xfrm_state when there > are no more references to it? This would simplify the delete+free dance > and would leave proper cleanup for the xs reference counting. > > What am I missing? I don't know. Maybe it's a leftover of the initial offload implementation/drivers that we don't need anymore? Steffen? [...] > > > With the new approach, in-use states on old_dev will not be deleted > > > until in-flight packets are transmitted. > > > > How does this guarantee it? It would be good to describe how the new > > approach closes the race with a bit more info than "use > > xfrm_state_migrate". > > > > And I don't think we currently guarantee that packets using offload > > will be fully transmitted before xdo_dev_state_delete was called in > > case of deletion.  > > Apologies for leaving this part out, yeah, it's pretty important. > I changed the descriptions for the next versions, here's what happens: > In-flight offloaded tx packets hold a reference to the used xfrm_state > via xfrm_output -> xfrm_state_hold which gets released when the > completion arrives via napi_consume_skb -...-> skb_ext_put -> > skb_ext_put_sp -> xfrm_state_put. > > But this doesn't work on devices which do the dev state deletion in > xdo_dev_state_delete(), because those might get their SAs yanked from > the device during the xfrm_state_delete() added in this patch. I guess > this ties to the previous point: Shouldn't there be only > xdo_dev_state_delete which touches the device when refcount is 0? > > > > But ok, the bond case is worse due to the add/delete > > dance when we change the active slave (and there's still the possible > > issue Steffen mentioned a while ago, that this delete/add dance may > > not be valid at all depending on how the HW behaves wrt IVs). > > I am aware of that issue, I am not trying to change any of that. Just > trying to improve bond from a security perspective. Sure. But I'm not sure we can make it really trustworthy... > I don't think it's > ok for it to send out unencrypted IPSec packets. Agree. > > > It also makes for cleaner > > > bonding code, which no longer needs to care about xfrm_state > > > management > > > so much. > > > > But using the migrate code for that feels kind of hacky, and the 2nd > > patch in this set also looks quite hacky. > > It's less hacky than the manual xfrm state management done so far. At > least bonding no longer needs to care so much about the semantics of > the xfrm dev state operations. And it no longer needs to acquire the > xs->lock (what does bonding have to do with an internal xfrm_state lock > anyway?) To me, it's hacky in the sense that we're hijacking the migrate code that isn't intended for that, and triggering core xfrm operations from inside a driver (and without proper locking). But true, the current code is also hacky. I think a better solution might be to find a way to use the "per-resource" SA code for bonding (currently implemented for "per-CPU" SAs, but a resource could be a lower device). Then we don't have to worry about moving states from one link to another, but it requires userspace cooperation. > > And doing all that without protection against admin operations on the > > xfrm state objects doesn't seem safe. > > > > Thinking about the migrate behavior, if we fail to create/offload the > > new state: > >  - old state will be deleted > >  - new state won't be created > > > > So any packet we send afterwards that would need to use this SA will > > just get dropped? (while the old behavior was "no more offload until > > we change the active slave again"?) > > This is not ideal, I agree. Perhaps instead of giving up on the failed > xs there could be an alternate migration path where we call > xdo_dev_state_free+xdo_dev_state_add like before? Ick, I don't really > like that. > > Alternatively, I have implemented another fix to these races, which is > to change xs.xso to be able to be offloaded on multiple devices at the > same time (nothing fancy, just parameter changes to xdo ops) and > changing the bonding driver to maintain a single offloaded xfrm_state > on *all* slaves (using bonding data structs). Changing the active slave > then becomes as simple as updating the esn on the new device (to get > that device state up to speed). > Leon and I discussed about that and he suggested it would be better to > use xfrm_state_migrate, since that is an existing well-understood > workflow. > Do you think that approach is worth pursuing instead? I could send them > patches as RFC for discussion. You can go ahead and share them if you want, but the short description above kind of puzzles/scares me. This whole feature is really a mess :/ -- Sabrina