Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] crypto: af_alg - Document the deprecation of AF_ALG
From: Eric Biggers @ 2026-06-23 16:49 UTC (permalink / raw)
  To: Bastien Nocera
  Cc: linux-crypto, Herbert Xu, Marcel Holtmann, Luiz Augusto von Dentz,
	linux-doc, linux-api, linux-kernel, netdev, Linus Torvalds,
	linux-bluetooth, ell
In-Reply-To: <7d08a6df54279e9915f5df6bd4e5e5dde52b4fe1.camel@hadess.net>

On Tue, Jun 23, 2026 at 02:44:28PM +0200, Bastien Nocera wrote:
> Hey,
> 
> Replying to this older patch.
> 
> On Wed, 2026-04-29 at 18:15 -0700, Eric Biggers wrote:
> <snip>
> > This isn't intended to change anything overnight.  After all, most Linux
> > distros won't be able to disable the kconfig options quite yet, mainly
> > because of iwd.  But this should create a bit more impetus for these
> > userspace programs to be fixed, and the documentation update should also
> > help prevent more users from appearing.
> 
> There are 2 other users that I know of: bluez, and the ell library
> (used by iwd and bluez).
>
> From what I could tell, bluetoothd uses AF_ALG for cryptography:
> https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/src/shared/crypto.c
> https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/tools/mesh-gatt/crypto.c
> 
> It uses "ecb(aes)" and "cmac(aes)" as algorithms.
> 
> Finally, it also uses them both again:
> https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/mesh/crypto.c
> through ell:
> https://git.kernel.org/pub/scm/libs/ell/ell.git/tree/ell/cipher.c
> 
> Because that's a question that also came up, bluetoothd also uses the
> CAP_NET_ADMIN capability.
> 
> I'll let Luiz and Marcel take it over from here.
> 

We're aware of that and are taking it into account in the allowlist:
https://lore.kernel.org/linux-crypto/20260622234803.6982-1-ebiggers@kernel.org/
If you have any feedback on the allowlist, please respond to that patch.

- Eric

^ permalink raw reply

* Re: [RFC net-next 08/15] ipxlat: add translation engine and dispatch core
From: Ralf Lici @ 2026-06-23 16:36 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: netdev, Daniel Gröber, Antonio Quartulli, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-kernel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	Beniamino Galvani
In-Reply-To: <87ik7aej6f.fsf@toke.dk>

On Mon, 22 Jun 2026 16:36:24 +0200, Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> > My second concern is that the SIIT boundary would be a property of
> >> > rule and hook placement. That gives flexibility, but it also means the
> >> > translation point has to be constrained and documented very carefully
> >> > to avoid ambiguous TTL/Hop Limit, PMTU/ICMP, and hook-order behavior.
> >> > For this use case I would rather have the route that matches the
> >> > translation prefix also be the object that says: leave this family
> >> > here and continue in the other one.
> >>
> >> Yeah, with flexibility comes the ability to shoot yourself in the foot.
> >> But that's not really different from much of the other functionality we
> >> have in the kernel today, is it? For netfilter in particular it's
> >> certainly possible to configure a broken NAT configuration that leads to
> >> packet drops (or just invalid packets being sent out on a network
> >> device).
> >>
> >
> > True, misconfiguration is always possible and that alone is not an
> > argument against the netfilter model. But what do we actually gain in
> > capability from that flexibility? I agree on the UX argument (an admin
> > would look in nft first), but in terms of what the feature can do, I
> > can't yet see what the nft model unlocks. More on this just below.
> >
> >> > After looking at the available kernel mechanisms again, I think the
> >> > better model is probably LWT: routes carry an ipxlat encap referencing a
> >> > named translator domain configured over netlink. That should represent
> >> > the stateless, prefix-based and symmetric nature of ipxlat.
> >>
> >> I think this description actually hits the nail on the head: What are we
> >> implementing here? Is it a product feature, or a building block for one?
> >> The properties you mention wrt consistency, symmetry etc are properties
> >> of the high-level feature (which is also generally the level things are
> >> specified in RFCs). Whereas other packet mangling features in the kernel
> >> are more in the "building block" category, where it's possible to
> >> configure things to implement a particular feature set / compliance with
> >> a particular RFC, but it's also possible to do things that are outside
> >> of that.
> >>
> >> I think this relates to the "mechanism, not policy" approach that we
> >> take to most things in the kernel: implement the building blocks to do
> >> something in the most general way we can, and then leave it up to
> >> userspace to configure things in a way that results in a consistent
> >> high-level system behaviour.
> >>
> >
> > That's a good point, and I agree that we should not bake a high-level
> > product policy into the kernel if what we need is a reusable mechanism
> > (the LWT idea was my attempt at exactly that). What I am still trying to
> > understand is whether there is a useful generic trigger for stateless
> > cross-family translation beyond the route/prefix/policy-routing cases.
> >
> > Routes and policy routing already cover the selectors I can make
> > coherent for a stateless, per-packet translator: destination/source
> > prefix, iif/oif/VRF, mark, TOS/DSCP, and so on. nft can of course match
> > much more than that, but the additional selectors that would materially
> > change the translation decision seem to be selectors such as L4 fields,
> > payload state, or conntrack state. Those are exactly the selectors I am
> > struggling to make correct for a stateless translator:
> >
> > - non-first fragments carry no L4 header at all, yet the translator must
> >   rewrite every fragment (an nft ... tcp dport trigger cannot fire on
> >   them);
> >
> > - ICMP errors must be translated too, but the flow identity lives in the
> >   quoted inner header (reversed), not in anything an L4/ct match on the
> >   error packet can see and there is no conntrack to associate them,
> >   since this is stateless.
>
> True in principle, but if (say) you deploy this on a network that is
> configured so it will never fragment packets, this won't be an issue in
> practice.
>
> I.e., you're quite right that arbitrary matching criteria cannot be
> guaranteed to result in coherent translation. But I think that goes into
> the "use it wrong, get wrong results" bin. E.g., if you match on
> something that results in only a subset of the packets of a flow being
> translated, well, only that subset of the packets will make it to the
> destination. The SIIT translator itself should not try to fix this, but
> neither should it prevent it; that's what I mean by "building block" -
> it's up to the builder using the blocks to make sure the building
> doesn't collapse, that's out of scope for the block manufacturer to
> worry about :)
>

I agree with that framing. The translation core should not try to prove
that the surrounding policy describes a coherent SIIT deployment.

> > So an L4-conditional trigger does not look like a good primitive for
> > correct stateless SIIT unless the action also defragments/refragments or
> > uses conntrack-like state. Those may be valid mechanisms, but they move
> > the design away from the stateless per-packet SIIT boundary this RFC is
> > trying to model.
> >
> > So my first question is: is there a useful nft configuration this should
> > enable that is not naturally expressible as route selection, while still
> > remaining stateless SIIT rather than a NAT64-like stateful feature?
> > Maybe there is a real use case there, but I cannot construct one yet.
>
> So the poster child for "match on arbitrary criteria" is of course BPF.
> You can write BPF programs that match on arbitrary parts of the packet
> header, custom encapsulation headers,or even on out of band things like
> system state, phase of the moon, or what have you. And we should
> certainly allow a BPF program to make the decision on whether to perform
> the SIIT translation.
>
> Which... maybe is an argument to keep it as a device like you do in this
> RFC series? Redirecting to a device is trivially supported from TC-BPF,
> which also makes it possible to use the translation mechanism without
> going through the routing subsystem at all, saving a bit of overhead.
> Whereas making it a route action ties it very closely to the routing
> subsystem.
>
> WDYT?
>

I see the netdevice appeal for this, especially as a BPF redirect
target. But as we discussed earlier, the device model has some real
problems: the device selected by the first route is not the real
post-translation egress, so the model ends up doing translation and
reinjection rather than normal transmission. Concretely:

- it needs synthetic routing state purely to get things like MTU for
  fragmentation, because the real post-translation nexthop is not known
  at translation time;

- TTL/Hop Limit handling gets harder to reason about because the packet
  has effectively gone through two routing decisions;

- rx/tx stats can't be made meaningful for a direction-agnostic device
  whose ndo_start_xmit is really "translate and receive";

- and the setup is not very obvious: create an interface, route packets
  to it, then have them come back translated.

None of these is fatal on its own, but together they make me think the
abstraction does not quite fit.

On the BPF point specifically: I agree a BPF program should be able to
decide whether to translate. What I am less sure about is whether
redirecting to a netdevice is the best way to expose that. A TC action
(yet another model, I know :)) gives you the same thing in-pipeline and
more directly:

    tc filter add dev wwan0 egress \
        bpf obj match.o action ipxlat4to6 domain clat0

Let BPF make the policy decision, with the native action doing the
translation work that the current BPF CLAT implementations have trouble
with: fragmentation, checksum corner cases, and ICMP error inner
headers (as explained by Beniamino).

So TC clsact looks like the natural in-kernel replacement for today's
TC-BPF CLAT programs: no extra netdev, you attach to the existing
uplink, direction is explicit, and on egress you sit on the real route
dst, so the synthetic-dst and double-routing problems above just don't
arise. The cost is more moving parts than a single bpf_redirect since
userspace has to manage clsact, filters, priorities and action
lifecycle/cleanup.

For a gateway translator, though, I still think a device-bound model is
less natural. There the translation point is more like a forwarding
decision across routes and nexthops, so a route/LWT attachment, or
possibly a netfilter attachment seems easier to reason about. Also, as
you already pointed out while discussing LWT, an admin setting up NAT64
is more likely to reach for an nft rule than for a clsact filter on a
specific device.

Taking a step back, ipxlat is really a generic translation engine plus a
thin harness around it. So rather than pick one attachment, it might be
worth structuring the engine so different harnesses can drive it.
There's interesting precedent for this shape:

- ILA, again, is the closest sibling: stateless IPv6 address translation
  with a shared core in ila_common.c, driven both by an LWT frontend in
  ila_lwt.c and by an inline netfilter hook with a netlink-configured
  mapping table in ila_xlat.c.

- act_ct is the precedent for the TC side specifically: a TC action that
  reuses the netfilter conntrack engine rather than reimplementing it.

And act_nat is the cautionary counter-example: a standalone TC
reimplementation of stateless NAT that shares no code with nf_nat, and
carries a "would be nice to share code" comment :)

So I am wondering whether the right direction is to factor the
translation engine cleanly, land it with one harness first, and keep the
other attachment points as follow-up work once the core semantics are
settled.

Does that direction seem reasonable to you?

-- 
Ralf Lici
Mandelbit Srl

^ permalink raw reply

* Re: [PATCH net-next v3 0/2] net: phy: sfp/mdio-i2c: defer RollBall probe + fix mii_bus leak
From: Maxime Chevallier @ 2026-06-23 16:34 UTC (permalink / raw)
  To: Petr Wozniak, linux, andrew, hkallweit1
  Cc: kuba, davem, edumazet, pabeni, netdev, linux-kernel, linux-phy,
	bjorn, olek2, kabel
In-Reply-To: <20260623080538.7646-1-petr.wozniak@gmail.com>

Hi Petr,

On 6/23/26 10:05, Petr Wozniak wrote:
> This series resends the RollBall bridge probe deferral (a fix for the
> regression in commit 8fe125892f40) and adds a related mii_bus leak fix.

These are bugfixes, you need to target the 'net' tree as explained here :

https://docs.kernel.org/process/maintainer-netdev.html

Thanks :)

Maxime
> 
> Patch 1 fixes a pre-existing mii_bus leak in sfp_i2c_mdiobus_destroy()
> that has been present since the helper was introduced in 2022. Patch 2's
> new -ENODEV path destroys the MDIO bus via sfp_i2c_mdiobus_destroy(), so
> patch 1 is a prerequisite to avoid leaking the bus on that path.
> 
> The v2 deferral patch was corrupted in transit and failed to apply; it is
> regenerated here against current net-next with no functional change.
> 
> v3:
>  - Resend: v2 defer patch was corrupted in transit and failed to apply
>    (netdev/apply); regenerated against current net-next.
>  - Fixed block comment style flagged by checkpatch. No functional change.
>  - Added patch 1/2 (sfp: free mii_bus in sfp_i2c_mdiobus_destroy).
> v2 (defer):
>  - Generalized scope: regression affects boot-inserted and hotplugged
>    modules where bridge init exceeds 200 ms; Aleksander Bajkowski
>    confirmed FLYPRO SFP-10GT-CS-30M / AQR113C broken when hotplugged.
>  - Corrected state machine description (probe runs in SFP_S_INIT after
>    SFP_S_WAIT) - Jan Hoffmann.
>  - No code changes from v1.
> v1: initial submission.
> 
> Petr Wozniak (2):
>   net: phy: sfp: free mii_bus in sfp_i2c_mdiobus_destroy
>   net: phy: mdio-i2c: defer RollBall bridge probe to PHY discovery
> 
>  drivers/net/mdio/mdio-i2c.c   | 15 +++++++++------
>  drivers/net/phy/sfp.c         | 23 ++++++++++++++---------
>  include/linux/mdio/mdio-i2c.h |  1 +
>  3 files changed, 24 insertions(+), 15 deletions(-)
> 
> 
> base-commit: b85966adbf5de0668a815c6e3527f87e0c387fb4


^ permalink raw reply

* Re: [PATCH iwl-net v2 2/2] ice: preserve uplink DFLT Rx rule on switchdev release
From: Simon Horman @ 2026-06-23 16:25 UTC (permalink / raw)
  To: poros; +Cc: Simon Horman, netdev, aleksandr.loktionov
In-Reply-To: <20260622113428.2565255-3-poros@redhat.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
ice: preserve uplink DFLT Rx rule on switchdev release

This commit preserves the default Rx rule on the uplink VSI when releasing
the switchdev environment if the interface is in promiscuous mode. This
prevents the catch-all rule from being removed while the netdev flags still
request promiscuity.

> diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> index 2e4f0969035f7..48273ef9f69dc 100644
> --- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
> +++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
[ ... ]
> @@ -278,8 +280,16 @@ static void ice_eswitch_release_env(struct ice_pf *pf)
>  	vlan_ops->ena_rx_filtering(uplink_vsi);

[Severity: High]
Could this lockless read of netdev->flags cause a regression due to a TOCTOU
race?

The ena_rx_filtering() call eventually reaches ice_cfg_vlan_pruning(),
which reads uplink_vsi->netdev->flags & IFF_PROMISC locklessly to decide
whether to enable VLAN pruning.

>  	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
>  			 ICE_FLTR_TX);
> -	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
> -			 ICE_FLTR_RX);
> +
> +	/* Keep the Rx DFLT rule if the uplink is promiscuous; it must outlive
> +	 * the session. Test the live netdev->flags, the same value
> +	 * ena_rx_filtering() -> ice_cfg_vlan_pruning() above keys its decision
> +	 * on, so the preserved DFLT rule and the pruning state stay consistent.
> +	 */
> +	if (!(uplink_vsi->netdev->flags & IFF_PROMISC))

... and then the exact same flags are read locklessly again right here.

Because setup/teardown paths are protected by devl_lock rather than
rtnl_lock, could a concurrent dev_change_flags() holding rtnl_lock toggle
IFF_PROMISC between the two checks?

If IFF_PROMISC changes from 0 to 1 in this window, the code would incorrectly
enable VLAN pruning (seeing 0) and then incorrectly retain the DFLT_RX rule
(seeing 1), leaving the hardware with VLAN pruning enabled during promiscuous
mode.

[Severity: High]
Does preserving the Rx DFLT rule here cause a regression by permanently
breaking multicast promiscuity configuration if IFF_PROMISC was set?

During the switchdev session, ice_remove_vsi_fltr() removes the
ICE_SW_LKUP_PROMISC rules. By manually preserving the DFLT rule here, any
future synchronization via ice_vsi_sync_fltr() will evaluate
!ice_is_dflt_vsi_in_use(vsi->port_info) as false:

ice_main.c:ice_vsi_sync_fltr() {
    ...
    if (!ice_is_dflt_vsi_in_use(vsi->port_info)) {
        err = ice_set_dflt_vsi(vsi);
        ...
        err = ice_set_promisc(vsi, ICE_MCAST_PROMISC_BITS);
    }
    ...
}

Because it is skipped, ICE_MCAST_PROMISC_BITS is never configured, leaving
the VSI in unicast promiscuous mode but silently dropping multicast traffic
destined for other MAC addresses.

> +		ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx,
> +				 false, ICE_FLTR_RX);
> +
>  	ice_fltr_add_mac_and_broadcast(uplink_vsi,
>  				       uplink_vsi->port_info->mac.perm_addr,
>  				       ICE_FWD_TO_VSI);

[Severity: High]
This is a pre-existing issue, but does this teardown function fail to
synchronize hardware state for previously joined multicast groups and
secondary unicast addresses?

When entering switchdev mode, __dev_uc_unsync() and __dev_mc_unsync() clear
the synchronization state of OS-level MAC addresses, and
ice_remove_vsi_fltr() removes the actual hardware filters. While switchdev
runs, calls to ndo_set_rx_mode return early and do not update hardware.

When returning to legacy mode, this function restores only the primary MAC
and broadcast addresses, but doesn't seem to invoke netif_set_rx_mode() or
equivalent to resynchronize the deferred Rx filters from the netdev to the
hardware. Could this cause the hardware to silently drop packets for
previously joined groups until the interface is bounced?

^ permalink raw reply

* Re: [PATCH v7 01/15] arm64: dts: qcom: kodiak: Add EL2 overlay
From: Mukesh Ojha @ 2026-06-23 16:31 UTC (permalink / raw)
  To: Sumit Garg
  Cc: andersson, linux-arm-msm, devicetree, dri-devel, freedreno,
	linux-media, netdev, linux-wireless, ath12k, linux-remoteproc,
	konradybcio, robh, krzk+dt, conor+dt, robin.clark, sean, akhilpo,
	lumag, abhinav.kumar, jesszhan0024, marijn.suijten, airlied,
	simona, vikash.garodia, dikshita.agarwal, bod, mchehab, elder,
	andrew+netdev, davem, edumazet, kuba, pabeni, jjohnson,
	mathieu.poirier, trilokkumar.soni, pavan.kondeti, jorge.ramirez,
	tonyh, vignesh.viswanathan, srinivas.kandagatla, amirreza.zarrabi,
	jens.wiklander, op-tee, apurupa, skare, linux-kernel, Sumit Garg
In-Reply-To: <20260522115936.201208-2-sumit.garg@kernel.org>

On Fri, May 22, 2026 at 05:29:22PM +0530, Sumit Garg wrote:
> From: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
> 
> All the existing variants Kodiak boards are using Gunyah hypervisor
> which means that, so far, Linux-based OS could only boot in EL1 on those
> devices.  However, it is possible for us to boot Linux at EL2 on these
> devices [1].
> 
> When running under Gunyah, the remote processor firmware IOMMU
> streams are controlled by Gunyah. However, without Gunyah, the IOMMU is
> managed by the consumer of this DeviceTree. Therefore, describe the
> firmware streams for each remote processor.
> 
> Add a EL2-specific DT overlay and apply it to Kodiak IOT variant
> devices to create -el2.dtb for each of them alongside "normal" dtb.
> 
> Note that modem and media subsystems haven't been supported yet due
> to missing dependencies. For GPU to work, zap shader is disabled and
> in EL2 mode the kernel owns hardware watchdog which is enabled here.
> 
> [1]
> https://docs.qualcomm.com/bundle/publicresource/topics/80-70020-4/boot-developer-touchpoints.html#uefi
> 
> Signed-off-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
> [SG: watchdog and modem fixup]
> Signed-off-by: Sumit Garg <sumit.garg@oss.qualcomm.com>

As discussed internally, I will be taking this patch separately and you
can drop this from series.

-- 
-Mukesh Ojha

^ permalink raw reply

* Re: [PATCH net-next v3 2/2] net: phy: mdio-i2c: defer RollBall bridge probe to PHY discovery
From: Maxime Chevallier @ 2026-06-23 16:28 UTC (permalink / raw)
  To: Petr Wozniak, linux, andrew, hkallweit1
  Cc: kuba, davem, edumazet, pabeni, netdev, linux-kernel, linux-phy,
	bjorn, olek2, kabel
In-Reply-To: <20260623080538.7646-3-petr.wozniak@gmail.com>

Hi Petr,

On 6/23/26 10:05, Petr Wozniak wrote:
> commit 8fe125892f40 ("net: phy: sfp: probe for RollBall I2C-to-MDIO
> bridge in mdio-i2c") introduced a regression: the RollBall I2C-to-MDIO
> bridge is not yet ready to respond to CMD_READ/CMD_DONE cycles when
> sfp_sm_add_mdio_bus() runs in SFP_S_INIT.  The 200 ms probe times out,
> i2c_mii_probe_rollball() returns -ENODEV, and sfp_sm_add_mdio_bus()
> sets mdio_protocol = MDIO_I2C_NONE.  By the time sfp_sm_probe_for_phy()
> runs (up to ~17 s later on affected hardware), the bridge is fully
> initialized but PHY probing is skipped because the protocol has already
> been changed to NONE.
> 
> This affects both modules inserted before boot and hotplugged modules on
> hardware where bridge initialization exceeds the 200 ms probe window
> (confirmed: FLYPRO SFP-10GT-CS-30M with Aquantia AQR113C, hotplugged).
> 
> Move the probe from i2c_mii_init_rollball(), called at bus-creation time,
> to sfp_sm_probe_for_phy() in sfp.c, where it runs after the SFP state
> machine module initialization delays.  Export the probe function as
> mdio_i2c_probe_rollball() so sfp.c can call it.
> 
> For RTL8261BE-based modules the probe correctly returns -ENODEV at PHY
> discovery time, causing sfp_sm_probe_for_phy() to destroy the MDIO bus
> and set MDIO_I2C_NONE, eliminating the 5+ minute PHY probe retry loop.
> 
> For genuine RollBall modules (e.g. FLYPRO SFP-10GT-CS-30M with Aquantia
> AQR113C) the probe now runs after initialization is complete and
> correctly returns 0, so PHY detection proceeds normally.
> 
> Reported-by: Aleksander Bajkowski <olek2@wp.pl>
> Fixes: 8fe125892f40 ("net: phy: sfp: probe for RollBall I2C-to-MDIO bridge in mdio-i2c")
> Signed-off-by: Petr Wozniak <petr.wozniak@gmail.com>

I'm not currently at home so I can't test that on my side, but as you'll
have to resend to the net tree, can you CC me for the next round so that
I can test with the few odd-ball modules I have ?

I expect to be able to test this on friday :(

Maxime

> ---
> v3: regenerated against net-next (v2 failed to apply due to transit
>     corruption); fixed block comment style (checkpatch); no functional
>     change.
> v2: commit message only - generalized scope (Aleksander Bajkowski);
>     corrected SM description (Jan Hoffmann); no code change from v1.
> v1: initial.
>  drivers/net/mdio/mdio-i2c.c   | 15 +++++++++------
>  drivers/net/phy/sfp.c         | 22 +++++++++++++---------
>  include/linux/mdio/mdio-i2c.h |  1 +
>  3 files changed, 23 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/net/mdio/mdio-i2c.c b/drivers/net/mdio/mdio-i2c.c
> index b88f63234b4e..2a3a418c1369 100644
> --- a/drivers/net/mdio/mdio-i2c.c
> +++ b/drivers/net/mdio/mdio-i2c.c
> @@ -419,7 +419,7 @@ static int i2c_mii_write_rollball(struct mii_bus *bus, int phy_id, int devad,
>  	return 0;
>  }
>  
> -static int i2c_mii_probe_rollball(struct i2c_adapter *i2c)
> +int mdio_i2c_probe_rollball(struct i2c_adapter *i2c)
>  {
>  	u8 data_buf[] = { ROLLBALL_DATA_ADDR, 0x01, 0x00, 0x00 };
>  	u8 cmd_buf[]  = { ROLLBALL_CMD_ADDR, ROLLBALL_CMD_READ };
> @@ -462,9 +462,13 @@ static int i2c_mii_probe_rollball(struct i2c_adapter *i2c)
>  
>  	return -ENODEV;
>  }
> +EXPORT_SYMBOL_GPL(mdio_i2c_probe_rollball);
>  
>  static int i2c_mii_init_rollball(struct i2c_adapter *i2c)
>  {
> +	/* Send the RollBall unlock password; bridge presence is verified
> +	 * later, in sfp_sm_probe_for_phy(), after module initialization.
> +	 */
>  	struct i2c_msg msg;
>  	u8 pw[5];
>  	int ret;
> @@ -486,7 +490,7 @@ static int i2c_mii_init_rollball(struct i2c_adapter *i2c)
>  	if (ret != 1)
>  		return -EIO;
>  
> -	return i2c_mii_probe_rollball(i2c);
> +	return 0;
>  }
>  
>  static bool mdio_i2c_check_functionality(struct i2c_adapter *i2c,
> @@ -531,10 +535,9 @@ struct mii_bus *mdio_i2c_alloc(struct device *parent, struct i2c_adapter *i2c,
>  	case MDIO_I2C_ROLLBALL:
>  		ret = i2c_mii_init_rollball(i2c);
>  		if (ret < 0) {
> -			if (ret != -ENODEV)
> -				dev_err(parent,
> -					"Cannot initialize RollBall MDIO I2C protocol: %d\n",
> -					ret);
> +			dev_err(parent,
> +				"Cannot initialize RollBall MDIO I2C protocol: %d\n",
> +				ret);
>  			mdiobus_free(mii);
>  			return ERR_PTR(ret);
>  		}
> diff --git a/drivers/net/phy/sfp.c b/drivers/net/phy/sfp.c
> index c4d274ab651e..bbfaa0450798 100644
> --- a/drivers/net/phy/sfp.c
> +++ b/drivers/net/phy/sfp.c
> @@ -2174,17 +2174,10 @@ static void sfp_sm_fault(struct sfp *sfp, unsigned int next_state, bool warn)
>  
>  static int sfp_sm_add_mdio_bus(struct sfp *sfp)
>  {
> -	int ret;
> -
>  	if (sfp->mdio_protocol == MDIO_I2C_NONE)
>  		return 0;
>  
> -	ret = sfp_i2c_mdiobus_create(sfp);
> -	if (ret == -ENODEV) {
> -		sfp->mdio_protocol = MDIO_I2C_NONE;
> -		return 0;
> -	}
> -	return ret;
> +	return sfp_i2c_mdiobus_create(sfp);
>  }
>  
>  /* Probe a SFP for a PHY device if the module supports copper - the PHY
> @@ -2215,7 +2208,18 @@ static int sfp_sm_probe_for_phy(struct sfp *sfp)
>  		break;
>  
>  	case MDIO_I2C_ROLLBALL:
> -		err = sfp_sm_probe_phy(sfp, SFP_PHY_ADDR_ROLLBALL, true);
> +		/* Probe here, after module initialization delays, so that
> +		 * genuine RollBall bridges have had time to start up.
> +		 * Modules without a bridge (e.g. RTL8261BE) return -ENODEV.
> +		 */
> +		err = mdio_i2c_probe_rollball(sfp->i2c);
> +		if (err == -ENODEV) {
> +			sfp_i2c_mdiobus_destroy(sfp);
> +			sfp->mdio_protocol = MDIO_I2C_NONE;
> +			break;
> +		}
> +		if (!err)
> +			err = sfp_sm_probe_phy(sfp, SFP_PHY_ADDR_ROLLBALL, true);
>  		break;
>  	}
>  
> diff --git a/include/linux/mdio/mdio-i2c.h b/include/linux/mdio/mdio-i2c.h
> index 65b550a6fc32..5cf14f45c94b 100644
> --- a/include/linux/mdio/mdio-i2c.h
> +++ b/include/linux/mdio/mdio-i2c.h
> @@ -20,5 +20,6 @@ enum mdio_i2c_proto {
>  
>  struct mii_bus *mdio_i2c_alloc(struct device *parent, struct i2c_adapter *i2c,
>  			       enum mdio_i2c_proto protocol);
> +int mdio_i2c_probe_rollball(struct i2c_adapter *i2c);
>  
>  #endif


^ permalink raw reply

* Re: [PATCH iwl-net v2 1/2] ice: skip per-VLAN promisc rules when default VSI Rx rule is set
From: Simon Horman @ 2026-06-23 16:25 UTC (permalink / raw)
  To: poros; +Cc: Simon Horman, netdev, aleksandr.loktionov
In-Reply-To: <20260622113428.2565255-2-poros@redhat.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
Subject: ice: skip per-VLAN promisc rules when default VSI Rx rule is set

This patch optimizes the number of hardware switch rules used when an ice
port is part of a vlan-filtering bridge. It avoids installing per-VLAN
promiscuous rules if the broad default VSI Rx rule is already catching
all traffic.

> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> index e2fbe111f8496..efb1f536de6c7 100644
> --- a/drivers/net/ethernet/intel/ice/ice_main.c
> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
[ ... ]
> @@ -442,17 +496,12 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)

[Severity: High]
This is a pre-existing issue, but does MAC filter overflow fail to force
promiscuous mode, resulting in silent packet drops?

When the hardware MAC filter limit is reached (LIBIE_AQ_RC_ENOSPC), the driver
attempts to fall back to promiscuous mode by setting promisc_forced_on = true.

However, the execution flow then reaches a check for IFF_PROMISC:

    if (vsi->current_netdev_flags & IFF_PROMISC) {
        /* Apply Rx filter rule to get traffic from wire */
        ...
    } else {
        /* Clear Rx filter to remove traffic from wire */

Since the user never explicitly requested IFF_PROMISC, this evaluates to false,
causing the driver to take the else branch. The else branch actively disables
the default VSI rule and clears promiscuous mode instead of enabling it. The
fallback feature appears to be non-functional, leading to lost traffic for
newly added MACs.

[Severity: High]
This is a pre-existing issue, but could this function prevent retries of
incomplete configurations and break multicast promiscuity if any default
VSI is already active?

In ice_vsi_sync_fltr(), the setup of multicast promiscuity and disabling of
Rx filtering is gated:

    if (!ice_is_dflt_vsi_in_use(vsi->port_info)) {
        err = ice_set_dflt_vsi(vsi);
        ...
        err = ice_set_promisc(vsi, ICE_MCAST_PROMISC_BITS);
        if (err)
            goto out_promisc;
    }

If ice_set_dflt_vsi() succeeds but the subsequent ice_set_promisc() fails,
the function jumps to out_promisc to trigger a retry. On the next cycle,
ice_is_dflt_vsi_in_use() returns true because this VSI is now the default,
causing the entire block to be skipped and leaving the VSI misconfigured.

Furthermore, if another VSI (such as a VF) is already the default VSI,
the PF VSI entirely skips setting multicast promiscuity, breaking multicast
reception in promiscuous mode.

>  		} else {
>  			/* Clear Rx filter to remove traffic from wire */
>  			if (ice_is_vsi_dflt_vsi(vsi)) {
> -				err = ice_clear_dflt_vsi(vsi);
> +				err = ice_vsi_exit_dflt_promisc(vsi);
>  				if (err) {
> -					netdev_err(netdev, "Error %d clearing default VSI %i Rx rule\n",
> -						   err, vsi->vsi_num);
>  					vsi->current_netdev_flags |=
>  						IFF_PROMISC;
>  					goto out_promisc;
>  				}
> -				if (vsi->netdev->features &
> -				    NETIF_F_HW_VLAN_CTAG_FILTER)
> -					vlan_ops->ena_rx_filtering(vsi);
>  			}
[ ... ]
> @@ -3675,10 +3724,9 @@ int ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
>  	while (test_and_set_bit(ICE_CFG_BUSY, vsi->state))
>  		usleep_range(1000, 2000);

[Severity: High]
This is a pre-existing issue, but isn't there a risk of TOCTOU races with
this ad-hoc lock pattern lacking memory barriers?

The driver spins on ICE_CFG_BUSY using test_and_set_bit() and usleep_range()
to guard VSI filter synchronization. The lock is subsequently dropped using
clear_bit(ICE_CFG_BUSY, vsi->state) at the exit label.

Because clear_bit() lacks a memory release barrier, the CPU or compiler can
reorder memory stores from the critical section to occur after the flag is
cleared, leading to data races and corruption of shared filter state on
weakly ordered architectures.

> -	/* Add multicast promisc rule for the VLAN ID to be added if
> -	 * all-multicast is currently enabled.
> -	 */
> -	if (vsi->current_netdev_flags & IFF_ALLMULTI) {
> +	/* skip the per-VID rule when the DFLT Rx rule already covers this VID */
> +	if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
> +	    !ice_is_vsi_dflt_vsi(vsi)) {
>  		ret = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
>  					       ICE_MCAST_VLAN_PROMISC_BITS,
>  					       vid);
> @@ -3696,11 +3744,12 @@ int ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
>  	if (ret)
>  		goto finish;

[Severity: High]
This is a pre-existing issue, but what happens if vlan_ops->add_vlan()
fails here? Is the hardware switch rule leaked?

When adding a VLAN to an interface with IFF_ALLMULTI enabled, this function
first allocates a promiscuous switch rule in hardware via
ice_fltr_set_vsi_promisc().

If this succeeds but the subsequent call to vlan_ops->add_vlan() fails
(e.g., due to ENOSPC or memory allocation failure), the function returns
early without rolling back the promiscuous rule. Because the network stack
considers the VLAN addition to have failed, ice_vlan_rx_kill_vid() will
never be called for this VID, permanently leaking a Flow Lookup Unit (FLU).

> -	/* If all-multicast is currently enabled and this VLAN ID is only one
> -	 * besides VLAN-0 we have to update look-up type of multicast promisc
> -	 * rule for VLAN-0 from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN.
> +	/* On the first non-zero VLAN, promote the VLAN-0 multicast promisc
> +	 * rule from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN. Skip when
> +	 * the DFLT Rx rule is installed; it already covers every VID.
>  	 */
>  	if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
> +	    !ice_is_vsi_dflt_vsi(vsi) &&
>  	    ice_vsi_num_non_zero_vlans(vsi) == 1) {
[ ... ]
> @@ -3763,11 +3812,12 @@ int ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid)
>  					   ICE_MCAST_VLAN_PROMISC_BITS, vid);
>  
>  	if (!ice_vsi_has_non_zero_vlans(vsi)) {
> -		/* Update look-up type of multicast promisc rule for VLAN 0
> -		 * from ICE_SW_LKUP_PROMISC_VLAN to ICE_SW_LKUP_PROMISC when
> -		 * all-multicast is enabled and VLAN 0 is the only VLAN rule.
> +		/* Last non-zero VLAN gone: demote the VLAN-0 multicast promisc
> +		 * rule back to ICE_SW_LKUP_PROMISC. Skip when the DFLT Rx rule
> +		 * is installed; no recipe swap is needed.
>  		 */
> -		if (vsi->current_netdev_flags & IFF_ALLMULTI) {
> +		if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
> +		    !ice_is_vsi_dflt_vsi(vsi)) {

[Severity: High]
Will this modification cause a stale switch rule (ICE_SW_LKUP_PROMISC_VLAN for
vid=0) to be permanently leaked if the last VLAN is removed while the DFLT VSI
rule is active?

This patch skips demoting the vid=0 multicast promiscuous rule from
ICE_SW_LKUP_PROMISC_VLAN back to ICE_SW_LKUP_PROMISC when the last VLAN is
removed, if ice_is_vsi_dflt_vsi(vsi) is true.

However, if the interface later drops IFF_PROMISC (removing the DFLT VSI rule)
and then drops IFF_ALLMULTI, ice_clear_promisc() is called. Because the
interface now has zero VLANs, ice_clear_promisc() executes its else branch:

    } else {
        status = ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
                                            promisc_m, 0);
    }

This only searches and clears rules from the ICE_SW_LKUP_PROMISC recipe. The
stranded rule in the ICE_SW_LKUP_PROMISC_VLAN recipe is never cleared,
resulting in a permanent hardware switch rule leak.

>  			ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
>  						   ICE_MCAST_VLAN_PROMISC_BITS,
>  						   0);

^ permalink raw reply

* Re: [PATCH net-next v3 1/2] net: phy: sfp: free mii_bus in sfp_i2c_mdiobus_destroy
From: Maxime Chevallier @ 2026-06-23 16:23 UTC (permalink / raw)
  To: Petr Wozniak, linux, andrew, hkallweit1
  Cc: kuba, davem, edumazet, pabeni, netdev, linux-kernel, linux-phy,
	bjorn, olek2, kabel
In-Reply-To: <20260623080538.7646-2-petr.wozniak@gmail.com>



On 6/23/26 10:05, Petr Wozniak wrote:
> sfp_i2c_mdiobus_create() allocates the I2C MDIO bus with mdio_i2c_alloc(),
> a plain (non-devm) allocation, and registers it. sfp_i2c_mdiobus_destroy()
> only unregisters the bus and clears sfp->i2c_mii without calling
> mdiobus_free(). As the only reference to the bus is then cleared, the
> struct mii_bus is leaked.
> 
> This is hit whenever a copper/RollBall SFP module that instantiated an MDIO
> bus is removed: sfp_sm_main() takes the global teardown path and calls
> sfp_i2c_mdiobus_destroy(). sfp_cleanup(), on driver unbind, frees
> sfp->i2c_mii directly, which is why the leak only triggered on module
> hot-removal and not on unbind.

which is worse, this can happen many times in a row :)

> 
> Free the bus in sfp_i2c_mdiobus_destroy() to match the allocation done in
> sfp_i2c_mdiobus_create().
> 
> Fixes: e85b1347ace6 ("net: sfp: create/destroy I2C mdiobus before PHY probe/after PHY release")
> Signed-off-by: Petr Wozniak <petr.wozniak@gmail.com>

With this patch sent towards the -net tree,

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>

Maxime

> ---
>  drivers/net/phy/sfp.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/phy/sfp.c b/drivers/net/phy/sfp.c
> index 03bfd8640db9..c4d274ab651e 100644
> --- a/drivers/net/phy/sfp.c
> +++ b/drivers/net/phy/sfp.c
> @@ -963,6 +963,7 @@ static int sfp_i2c_mdiobus_create(struct sfp *sfp)
>  static void sfp_i2c_mdiobus_destroy(struct sfp *sfp)
>  {
>  	mdiobus_unregister(sfp->i2c_mii);
> +	mdiobus_free(sfp->i2c_mii);
>  	sfp->i2c_mii = NULL;
>  }
>  


^ permalink raw reply

* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Kuniyuki Iwashima @ 2026-06-23 16:08 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jiayuan Chen, John Fastabend, netdev, kernel-team
In-Reply-To: <20260623-bpf-sk_msg-split-unix-v2-1-ca7a626a94a5@cloudflare.com>

On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
> completed all code paths related to sockmap-based redirects should be
> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
> socket references would remain under BPF_SYSCALL.
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
> Changes in v2:
> - Handle prot->recvmsg being NULL (Sashiko)
> - Elaborate on the end goal in description
> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
> ---
>  net/unix/af_unix.c  | 4 ++--
>  net/unix/unix_bpf.c | 6 ++++++
>  2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index f7a9d55eee8a..84c11c60c75f 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
>  #ifdef CONFIG_BPF_SYSCALL
>         const struct proto *prot = READ_ONCE(sk->sk_prot);
>
> -       if (prot != &unix_dgram_proto)
> +       if (prot->recvmsg)

There is no reason to have this dead branch when
CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.

Let's compile out all sockmap code when both configs
are not enabled.

Since AF_UNIX differs from TCP/UDP, it can take the
simpler approach.


>                 return prot->recvmsg(sk, msg, size, flags);
>  #endif
>         return __unix_dgram_recvmsg(sk, msg, size, flags);
> @@ -3152,7 +3152,7 @@ static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
>         struct sock *sk = sock->sk;
>         const struct proto *prot = READ_ONCE(sk->sk_prot);
>
> -       if (prot != &unix_stream_proto)
> +       if (prot->recvmsg)
>                 return prot->recvmsg(sk, msg, size, flags);
>  #endif
>         return unix_stream_read_generic(&state, true);
> diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
> index f86ff19e9764..5289a04b4993 100644
> --- a/net/unix/unix_bpf.c
> +++ b/net/unix/unix_bpf.c
> @@ -7,6 +7,7 @@
>
>  #include "af_unix.h"
>
> +#ifdef CONFIG_NET_SOCK_MSG
>  #define unix_sk_has_data(__sk, __psock)                                        \
>                 ({      !skb_queue_empty(&__sk->sk_receive_queue) ||    \
>                         !skb_queue_empty(&__psock->ingress_skb) ||      \
> @@ -94,6 +95,7 @@ static int unix_bpf_recvmsg(struct sock *sk, struct msghdr *msg,
>         sk_psock_put(sk, psock);
>         return copied;
>  }
> +#endif /* CONFIG_NET_SOCK_MSG */
>
>  static struct proto *unix_dgram_prot_saved __read_mostly;
>  static DEFINE_SPINLOCK(unix_dgram_prot_lock);
> @@ -107,8 +109,10 @@ static void unix_dgram_bpf_rebuild_protos(struct proto *prot, const struct proto
>  {
>         *prot        = *base;
>         prot->close  = sock_map_close;
> +#ifdef CONFIG_NET_SOCK_MSG
>         prot->recvmsg = unix_bpf_recvmsg;
>         prot->sock_is_readable = sk_msg_is_readable;
> +#endif
>  }
>
>  static void unix_stream_bpf_rebuild_protos(struct proto *prot,
> @@ -116,8 +120,10 @@ static void unix_stream_bpf_rebuild_protos(struct proto *prot,
>  {
>         *prot        = *base;
>         prot->close  = sock_map_close;
> +#ifdef CONFIG_NET_SOCK_MSG
>         prot->recvmsg = unix_bpf_recvmsg;
>         prot->sock_is_readable = sk_msg_is_readable;
> +#endif
>         prot->unhash  = sock_map_unhash;
>  }
>
>
>
>

^ permalink raw reply

* Re: [PATCH 0/3] SM8450 IPA support
From: Alex Elder @ 2026-06-23 15:56 UTC (permalink / raw)
  To: esteuwu, Bjorn Andersson, Konrad Dybcio, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alex Elder
  Cc: linux-arm-msm, devicetree, linux-kernel, netdev
In-Reply-To: <20260622-sm8450-ipa-v1-0-532f0299f96e@proton.me>

On 6/22/26 8:44 PM, Esteban Urrutia via B4 Relay wrote:
> This series adds support for the IPA subsystem found in the SM8450 SoC.
> While IPA v5.0 is very similar to IPA v5.1 (heck, it even managed to
> properly get the modem up and running), it wasn't perfect, since the
> modem would sometimes hang when rebooting or powering the AP off.
> After a thorough investigation, I managed to create the proper data file
> required for IPA v5.1.
> 
> Regards,
> Esteban

I assume you have implemented this based on what you found in
some downstream code.  And if so, could you please indicate
where to find that (so I can do some cross-referencing myself).
I no longer have access to any Qualcomm internal documentation.

Thanks.

					-Alex

> Signed-off-by: Esteban Urrutia <esteuwu@proton.me>
> ---
> Esteban Urrutia (3):
>        arm64: dts: qcom: sm8450: Add IPA support
>        dt-bindings: net: qcom,ipa: Add SM8450 compatible string
>        net: ipa: Add IPA v5.1 data
> 
>   .../devicetree/bindings/net/qcom,ipa.yaml          |   1 +
>   arch/arm64/boot/dts/qcom/sm8450.dtsi               |  55 ++-
>   drivers/net/ipa/Makefile                           |   2 +-
>   drivers/net/ipa/data/ipa_data-v5.1.c               | 477 +++++++++++++++++++++
>   drivers/net/ipa/gsi_reg.c                          |   1 +
>   drivers/net/ipa/ipa_data.h                         |   1 +
>   drivers/net/ipa/ipa_main.c                         |   4 +
>   drivers/net/ipa/ipa_reg.c                          |   1 +
>   8 files changed, 536 insertions(+), 6 deletions(-)
> ---
> base-commit: 948efecf22e49aa4bf55bb73ec79a0ddcfd38571
> change-id: 20260622-sm8450-ipa-5da81f67eb65
> 
> Best regards,
> --
> Esteban Urrutia <esteuwu@proton.me>
> 
> 


^ permalink raw reply

* Re: [PATCH] net: ipa: fix SMEM state handle leaks in SMP2P init
From: Alex Elder @ 2026-06-23 15:53 UTC (permalink / raw)
  To: Haoxiang Li, elder, andrew+netdev, davem, edumazet, kuba, pabeni
  Cc: netdev, linux-kernel, stable
In-Reply-To: <20260623031831.1788454-1-haoxiang_li2024@163.com>

On 6/22/26 10:18 PM, Haoxiang Li wrote:
> ipa_smp2p_init() acquires two Qualcomm SMEM state handles with
> qcom_smem_state_get(). However, neither the init error paths
> nor ipa_smp2p_exit() release them.
> 
> Use devm_qcom_smem_state_get() for both state handles so the
> references are released automatically when the platform device
> is removed.
> 
> Fixes: 530f9216a953 ("soc: qcom: ipa: AP/modem communications")
> Cc: stable@vger.kernel.org
> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>

So I guess they were never "put" before?

This looks OK, but I'll just mention that the IPA code
doesn't use devm_*() (managed) interfaces.  So it would
be more consistent to just call qcom_smem_state_put()
at the end of ipa_smp2p_exit() for both ipa->enabled_state
and ipa->valid_state.

					-Alex

> ---
>   drivers/net/ipa/ipa_smp2p.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ipa/ipa_smp2p.c b/drivers/net/ipa/ipa_smp2p.c
> index 2f0ccdd937cc..d8fd56949082 100644
> --- a/drivers/net/ipa/ipa_smp2p.c
> +++ b/drivers/net/ipa/ipa_smp2p.c
> @@ -228,15 +228,15 @@ ipa_smp2p_init(struct ipa *ipa, struct platform_device *pdev, bool modem_init)
>   	u32 valid_bit;
>   	int ret;
>   
> -	valid_state = qcom_smem_state_get(dev, "ipa-clock-enabled-valid",
> -					  &valid_bit);
> +	valid_state = devm_qcom_smem_state_get(dev, "ipa-clock-enabled-valid",
> +					       &valid_bit);
>   	if (IS_ERR(valid_state))
>   		return PTR_ERR(valid_state);
>   	if (valid_bit >= 32)		/* BITS_PER_U32 */
>   		return -EINVAL;
>   
> -	enabled_state = qcom_smem_state_get(dev, "ipa-clock-enabled",
> -					    &enabled_bit);
> +	enabled_state = devm_qcom_smem_state_get(dev, "ipa-clock-enabled",
> +						 &enabled_bit);
>   	if (IS_ERR(enabled_state))
>   		return PTR_ERR(enabled_state);
>   	if (enabled_bit >= 32)		/* BITS_PER_U32 */


^ permalink raw reply

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Petr Mladek @ 2026-06-23 15:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sebastian Andrzej Siewior, linux-arch, linux-kernel, sched-ext,
	netdev, David S . Miller, Andrea Righi, Arnd Bergmann, Ben Segall,
	Breno Leitao, Changwoo Min, David Vernet, Dietmar Eggemann,
	Eric Dumazet, Ingo Molnar, Jakub Kicinski, John Ogness,
	Juri Lelli, K Prateek Nayak, Paolo Abeni, Peter Zijlstra,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260623081258.580e034fdb5b98f4f8dba44a@linux-foundation.org>

On Tue 2026-06-23 08:12:58, Andrew Morton wrote:
> On Tue, 23 Jun 2026 16:26:49 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> > Provide a deferred version of the WARN_ON() macro. It will delay
> > flushing the console until a later context. It is needed in a context
> > where the caller holds locks which can lead to a deadlock content is
> > flushed to the console driver.
> > An example would from a warning from within the scheduler resulting in a
> > wake-up of a task.
> > 
> > Deferring the output works by using printk_deferred_enter/ exit() around
> > the printing output. This must be used in a context where the task can't
> > migrate to another CPU. This should be the case usually, since the
> > scheduler would acquire the rq lock whith disabled interrupts, but to be
> > safe preemption is disabled to guarantee this.
> > 
> > In order not to bloat the code on architectures which provide an
> > optimized __WARN_FLAGS() define BUGFLAG_DEFERRED which is handled by
> > __report_bug() and does not increase the code size.
> > 
> > Provide the DEFERRED macros based on __WARN_FLAGS and __WARN_FLAGS
> > macros. Extend __report_bug() to handle the deferred case.
> > 
> > ...
> >
> > --- a/include/asm-generic/bug.h
> > +++ b/include/asm-generic/bug.h
> > @@ -229,7 +230,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
> >  		 */
> >  		bug->flags |= BUGFLAG_DONE;
> >  	}
> > -
> > +	if (deferred) {
> > +		preempt_disable_notrace();
> > +		printk_deferred_enter();
> > +	}
> 
> For some reason the comment over printk_deferred_enter() says
> "Interrupts must be disabled for the deferred duration".  Is that the
> case for all the printk_deferred_enter() calls which this patch adds?

Strictly speaking, "only" CPU migration must be disabled around
printk_deferred_enter()/exit() call because the state is stored
in a per-CPU variable.

It means that preempt_disable() would work.

I do not recall whether we mentioned interrupts by mistake or
on purpose. It is possible that we suggested to disable interrupts
because we did not want to deffer messages from unrelated (interrupt)
context.

Best Regards,
Petr

^ permalink raw reply

* [PATCH net-next v3] vsock/virtio: rewrite MSG_ZEROCOPY flag handling
From: Arseniy Krasnov @ 2026-06-23 15:38 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Jason Wang, Bobby Eshleman, Xuan Zhuo, Eugenio Pérez,
	Simon Horman
  Cc: kvm, virtualization, netdev, linux-kernel, oxffffaa, rulkc,
	Arseniy Krasnov

Logically it was based on TCP implementation, so to make further support
easier, rewrite it in the TCP way (like in 'tcp_sendmsg_locked()'). This
patch only rewrites flag handling (e.g. it doesn't change logic).

Signed-off-by: Arseniy Krasnov <avkrasnov@rulkc.org>
---
 Changelog v1->v2:
 * Rebase on last 'net-next'. Don't need 'skb_zcopy_set()' now - it was
   already added.
 Changelog v2->v3:
 * Update commit message.
 * Remove one empty line.

 net/vmw_vsock/virtio_transport_common.c | 47 ++++++++++++-------------
 1 file changed, 22 insertions(+), 25 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 09475007165b..41c2a0b82a8e 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -328,38 +328,35 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
 		return pkt_len;
 
-	if (info->msg) {
-		/* If zerocopy is not enabled by 'setsockopt()', we behave as
-		 * there is no MSG_ZEROCOPY flag set.
+	if (info->msg && (info->msg->msg_flags & MSG_ZEROCOPY)) {
+		/* If 'info->msg' is not NULL, this is only VIRTIO_VSOCK_OP_RW.
+		 * 'MSG_ZEROCOPY' flag handling here is based on the same flag
+		 * handling from 'tcp_sendmsg_locked()'.
 		 */
-		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
-			info->msg->msg_flags &= ~MSG_ZEROCOPY;
+		if (info->msg->msg_ubuf) {
+			uarg = info->msg->msg_ubuf;
+			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
+		} else if (sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
+			uarg = msg_zerocopy_realloc(sk_vsock(vsk), pkt_len,
+						    NULL, false);
+			if (!uarg) {
+				virtio_transport_put_credit(vvs, pkt_len);
+				return -ENOMEM;
+			}
 
-		if (info->msg->msg_flags & MSG_ZEROCOPY)
 			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
+			if (!can_zcopy)
+				uarg_to_msgzc(uarg)->zerocopy = 0;
 
+			have_uref = true;
+		}
+
+		/* 'can_zcopy' means that this transmission will be
+		 * in zerocopy way (e.g. using 'frags' array).
+		 */
 		if (can_zcopy)
 			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
 					    (MAX_SKB_FRAGS * PAGE_SIZE));
-
-		if (info->msg->msg_flags & MSG_ZEROCOPY &&
-		    info->op == VIRTIO_VSOCK_OP_RW) {
-			uarg = info->msg->msg_ubuf;
-
-			if (!uarg) {
-				uarg = msg_zerocopy_realloc(sk_vsock(vsk),
-							    pkt_len, NULL, false);
-				if (!uarg) {
-					virtio_transport_put_credit(vvs, pkt_len);
-					return -ENOMEM;
-				}
-
-				if (!can_zcopy)
-					uarg_to_msgzc(uarg)->zerocopy = 0;
-
-				have_uref = true;
-			}
-		}
 	}
 
 	rest_len = pkt_len;
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Andrew Morton @ 2026-06-23 15:12 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-arch, linux-kernel, sched-ext, netdev, David S . Miller,
	Andrea Righi, Arnd Bergmann, Ben Segall, Breno Leitao,
	Changwoo Min, David Vernet, Dietmar Eggemann, Eric Dumazet,
	Ingo Molnar, Jakub Kicinski, John Ogness, Juri Lelli,
	K Prateek Nayak, Paolo Abeni, Peter Zijlstra, Petr Mladek,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260623142650.265721-2-bigeasy@linutronix.de>

On Tue, 23 Jun 2026 16:26:49 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> Provide a deferred version of the WARN_ON() macro. It will delay
> flushing the console until a later context. It is needed in a context
> where the caller holds locks which can lead to a deadlock content is
> flushed to the console driver.
> An example would from a warning from within the scheduler resulting in a
> wake-up of a task.
> 
> Deferring the output works by using printk_deferred_enter/ exit() around
> the printing output. This must be used in a context where the task can't
> migrate to another CPU. This should be the case usually, since the
> scheduler would acquire the rq lock whith disabled interrupts, but to be
> safe preemption is disabled to guarantee this.
> 
> In order not to bloat the code on architectures which provide an
> optimized __WARN_FLAGS() define BUGFLAG_DEFERRED which is handled by
> __report_bug() and does not increase the code size.
> 
> Provide the DEFERRED macros based on __WARN_FLAGS and __WARN_FLAGS
> macros. Extend __report_bug() to handle the deferred case.
> 
> ...
>
> --- a/include/asm-generic/bug.h
> +++ b/include/asm-generic/bug.h
> @@ -229,7 +230,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
>  		 */
>  		bug->flags |= BUGFLAG_DONE;
>  	}
> -
> +	if (deferred) {
> +		preempt_disable_notrace();
> +		printk_deferred_enter();
> +	}

For some reason the comment over printk_deferred_enter() says
"Interrupts must be disabled for the deferred duration".  Is that the
case for all the printk_deferred_enter() calls which this patch adds?



^ permalink raw reply

* Re: [PATCH net-next v2] Documentation: net/smc: correct old value of smcr_max_recv_wr
From: Breno Leitao @ 2026-06-23 15:12 UTC (permalink / raw)
  To: Mahanta Jambigi
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, alibuda, dust.li,
	sidraya, wenjia, wintera, pasic, horms, tonylu, guwen, netdev,
	linux-s390
In-Reply-To: <20260424052336.3262350-1-mjambigi@linux.ibm.com>

On Fri, Apr 24, 2026 at 07:23:36AM +0200, Mahanta Jambigi wrote:
> The smc-sysctl.rst documentation incorrectly stated that the previous
> hardcoded maximum number of WR buffers on the receive path (smcr_max_recv_wr)
> was 16. The correct historical value used before the introduction of the sysctl
> control was 48. Update the documentation to reflect the accurate historical
> value. Also fix a couple of minor typos.
> 
> Fixes: aef3cdb47bbb net/smc: make wr buffer count configurable

This Fixes tag is broken. You probably want:

	Fixes: aef3cdb47bbb ("net/smc: make wr buffer count configurable")

Other than that, it looks good, the corrected value checks out.

^ permalink raw reply

* Re: [PATCH net-next 0/3] selftests/xsk: stabilize timeout test behavior
From: Maciej Fijalkowski @ 2026-06-23 14:56 UTC (permalink / raw)
  To: Jason Xing
  Cc: Tushar Vyavahare, netdev, magnus.karlsson, stfomichev, kernelxing,
	davem, kuba, pabeni, ast, daniel, tirthendu.sarkar, bpf
In-Reply-To: <ajJsMj0QMOF5I8qq@boxer>

On Wed, Jun 17, 2026 at 11:43:14AM +0200, Maciej Fijalkowski wrote:
> On Wed, Jun 17, 2026 at 07:39:06AM +0800, Jason Xing wrote:
> > Hi Tushar,
> > 
> > On Tue, Jun 16, 2026 at 11:50 PM Tushar Vyavahare
> > <tushar.vyavahare@intel.com> wrote:
> > >
> > > This series improves AF_XDP selftests by making timeout handling
> > > explicit and fixing sources of non-determinism in xsk timeout tests.
> > >
> > > Patch 1 introduces test_spec::poll_tmout and removes implicit
> > > dependence on RX UMEM setup state for timeout behavior.
> > >
> > > Patch 2 fixes thread harness sequencing by attaching XDP programs
> > > before worker startup, removing signal-based termination, and using
> > > barrier synchronization only for dual-thread runs.
> > >
> > > Patch 3 restores shared_umem after POLL_TXQ_FULL so test-local
> > > configuration does not leak into subsequent cases on shared-netdev
> > > runs.
> > >
> > > Together these changes make timeout handling easier to follow and
> > > improve selftest stability, especially on real NIC runs.
> > 
> > net-next is closed, but in the meantime I'll review the series ASAP.
> > 
> > BTW, another thing about selftests I had in my mind is that are you
> > planning to work on this [1]?
> 
> This one is on me. I took your changes Jason and aligned ZC batching side
> to this behavior, followed by xskxceiver adjustment. I am planning to send
> this today EOD, however let's see how badly internal Sashiko will kick my
> ass.

Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

> 
> > 
> > [1]: https://lore.kernel.org/all/20260520004244.55663-1-kerneljasonxing@gmail.com/
> > 
> > Thanks,
> > Jason
> > 
> > >
> > > Tushar Vyavahare (3):
> > >   selftests/xsk: make poll timeout mode explicit
> > >   selftests/xsk: fix timeout thread harness sequencing
> > >   selftests/xsk: restore shared_umem after POLL_TXQ_FULL
> > >
> > >  .../selftests/bpf/prog_tests/test_xsk.c       | 96 +++++++++++--------
> > >  .../selftests/bpf/prog_tests/test_xsk.h       |  2 +
> > >  2 files changed, 56 insertions(+), 42 deletions(-)
> > >
> > > --
> > > 2.43.0
> > >
> > >
> > 

^ permalink raw reply

* Re: [PATCH net-next 0/3] selftests/xsk: stabilize timeout test behavior
From: Maciej Fijalkowski @ 2026-06-23 14:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jason Xing, Tushar Vyavahare, netdev, magnus.karlsson, stfomichev,
	kernelxing, davem, pabeni, ast, daniel, tirthendu.sarkar, bpf
In-Reply-To: <ajpLuDNCu2PHS78l@boxer>

On Tue, Jun 23, 2026 at 11:02:48AM +0200, Maciej Fijalkowski wrote:
> On Mon, Jun 22, 2026 at 04:07:06PM -0700, Jakub Kicinski wrote:
> > On Wed, 17 Jun 2026 11:43:14 +0200 Maciej Fijalkowski wrote:
> > > > On Tue, Jun 16, 2026 at 11:50 PM Tushar Vyavahare
> > > > <tushar.vyavahare@intel.com> wrote:  
> > > > >
> > > > > This series improves AF_XDP selftests by making timeout handling
> > > > > explicit and fixing sources of non-determinism in xsk timeout tests.
> > > > >
> > > > > Patch 1 introduces test_spec::poll_tmout and removes implicit
> > > > > dependence on RX UMEM setup state for timeout behavior.
> > > > >
> > > > > Patch 2 fixes thread harness sequencing by attaching XDP programs
> > > > > before worker startup, removing signal-based termination, and using
> > > > > barrier synchronization only for dual-thread runs.
> > > > >
> > > > > Patch 3 restores shared_umem after POLL_TXQ_FULL so test-local
> > > > > configuration does not leak into subsequent cases on shared-netdev
> > > > > runs.
> > > > >
> > > > > Together these changes make timeout handling easier to follow and
> > > > > improve selftest stability, especially on real NIC runs.  
> > > > 
> > > > net-next is closed, but in the meantime I'll review the series ASAP.
> > > > 
> > > > BTW, another thing about selftests I had in my mind is that are you
> > > > planning to work on this [1]?  
> > > 
> > > This one is on me. I took your changes Jason and aligned ZC batching side
> > > to this behavior, followed by xskxceiver adjustment. I am planning to send
> > > this today EOD, however let's see how badly internal Sashiko will kick my
> > > ass.
> > 
> > Hi Maciej, do you want these applied? If they help make the tests less
> > flaky I think that it's fine to take them during the merge window.
> 
> Hi Jakub,
> 
> last refactor from Tushar broke BIDIRECTIONAL test case when HW is test
> target, but not on veth, so let me test these changes locally and then get
> back to you.
> 
> BPF CI runs xskxceiver on veth so this has not been caught. Seems my/our
> focus should be to enable xskxceiver HW tests on any kind of
> environment/infrastructure.
> 
> Gonna get back to you by the EOD.
> Maciej

Ah I replied on other thread I guess, so let me repeat:

Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

^ permalink raw reply

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: K Prateek Nayak @ 2026-06-23 14:54 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-arch, linux-kernel, sched-ext,
	netdev
  Cc: David S . Miller, Andrea Righi, Andrew Morton, Arnd Bergmann,
	Ben Segall, Breno Leitao, Changwoo Min, David Vernet,
	Dietmar Eggemann, Eric Dumazet, Ingo Molnar, Jakub Kicinski,
	John Ogness, Juri Lelli, Paolo Abeni, Peter Zijlstra, Petr Mladek,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260623142650.265721-2-bigeasy@linutronix.de>

Hello Sebastian,

On 6/23/2026 7:56 PM, Sebastian Andrzej Siewior wrote:
> --- a/lib/bug.c
> +++ b/lib/bug.c
> @@ -196,7 +196,7 @@ void __warn_printf(const char *fmt, struct pt_regs *regs)
>  
>  static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long bugaddr, struct pt_regs *regs)
>  {
> -	bool warning, once, done, no_cut, has_args;
> +	bool warning, once, done, no_cut, has_args, deferred;
>  	const char *file, *fmt;
>  	unsigned line;
>  
> @@ -219,6 +219,7 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
>  	done     = bug->flags & BUGFLAG_DONE;
>  	no_cut   = bug->flags & BUGFLAG_NO_CUT_HERE;
>  	has_args = bug->flags & BUGFLAG_ARGS;
> +	deferred = bug->flags & BUGFLAG_DEFERRED;
>  
>  	if (warning && once) {
>  		if (done)
> @@ -229,7 +230,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
>  		 */
>  		bug->flags |= BUGFLAG_DONE;
>  	}
> -
> +	if (deferred) {
> +		preempt_disable_notrace();
> +		printk_deferred_enter();
> +	}
>  	/*
>  	 * BUG() and WARN_ON() families don't print a custom debug message
>  	 * before triggering the exception handler, so we must add the
> @@ -245,6 +249,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
>  		/* this is a WARN_ON rather than BUG/BUG_ON */
>  		__warn(file, line, (void *)bugaddr, BUG_GET_TAINT(bug), regs,
>  		       NULL);
> +		if (deferred) {
> +			printk_deferred_exit();
> +			preempt_enable_notrace();
> +		}
>  		return BUG_TRAP_TYPE_WARN;

nit.

Instead of replicating these bits, can we replace that return with a
"goto out" ...

>  	}
>  
> @@ -254,6 +262,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
>  		pr_crit("kernel BUG at %pB [verbose debug info unavailable]\n",
>  			(void *)bugaddr);
>  

out:

> +	if (deferred) {
> +		printk_deferred_exit();
> +		preempt_enable_notrace();
> +	}
>  	return BUG_TRAP_TYPE_BUG;

... and replace this return with a:

    return (warning) ? BUG_TRAP_TYPE_WARN : BUG_TRAP_TYPE_BUG;

Looks a tab bit cleaner to my eyes. Thoughts?

>  }
>  

-- 
Thanks and Regards,
Prateek


^ permalink raw reply

* [PATCH] qede: fix out-of-bounds check for cqe->len_list[]
From: Matvey Kovalev @ 2026-06-23 14:45 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Matvey Kovalev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Pavel Zhigulin, netdev, linux-kernel, lvc-project

Move index check before element access.

Fixes: 896f1a2493b5 ("net: qlogic/qede: fix potential out-of-bounds read in qede_tpa_cont() and qede_tpa_end()")
Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Matvey Kovalev <matvey.kovalev@ispras.ru>
---
 drivers/net/ethernet/qlogic/qede/qede_fp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c b/drivers/net/ethernet/qlogic/qede/qede_fp.c
index e338bfc8b7b2..33e18bb69774 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_fp.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c
@@ -961,7 +961,7 @@ static inline void qede_tpa_cont(struct qede_dev *edev,
 {
 	int i;
 
-	for (i = 0; cqe->len_list[i] && i < ARRAY_SIZE(cqe->len_list); i++)
+	for (i = 0; i < ARRAY_SIZE(cqe->len_list) && cqe->len_list[i]; i++)
 		qede_fill_frag_skb(edev, rxq, cqe->tpa_agg_index,
 				   le16_to_cpu(cqe->len_list[i]));
 
@@ -986,7 +986,7 @@ static int qede_tpa_end(struct qede_dev *edev,
 		dma_unmap_page(rxq->dev, tpa_info->buffer.mapping,
 			       PAGE_SIZE, rxq->data_direction);
 
-	for (i = 0; cqe->len_list[i] && i < ARRAY_SIZE(cqe->len_list); i++)
+	for (i = 0; i < ARRAY_SIZE(cqe->len_list) && cqe->len_list[i]; i++)
 		qede_fill_frag_skb(edev, rxq, cqe->tpa_agg_index,
 				   le16_to_cpu(cqe->len_list[i]));
 	if (unlikely(i > 1))
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH net v2 2/2] sctp: add INIT verification after cookie unpacking
From: Xin Long @ 2026-06-23 14:45 UTC (permalink / raw)
  To: Simon Horman
  Cc: netdev, linux-sctp, davem, kuba, edumazet, pabeni,
	marcelo.leitner
In-Reply-To: <20260622165200.1050676-1-horms@kernel.org>

On Mon, Jun 22, 2026 at 12:52 PM Simon Horman <horms@kernel.org> wrote:
>
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://netdev-ai.bots.linux.dev/sashiko/
> ---
> sctp: add INIT verification after cookie unpacking
>
> This commit adds sctp_verify_init() checks in the COOKIE-ECHO processing
> paths sctp_sf_do_5_1D_ce() and sctp_sf_do_5_2_4_dupcook() when
> cookie_auth_enable is disabled, and updates sctp_verify_init() to bound
> the parameter walk against peer_init length rather than chunk->chunk_end.
> On verification failure the new association is freed and an ABORT is
> generated via sctp_abort_on_init_err().
>
> > diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
> > index 8c636f045e45..6967e889d1bd 100644
> > --- a/net/sctp/sm_statefuns.c
> > +++ b/net/sctp/sm_statefuns.c
> > @@ -650,11 +650,12 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net,
> >                                        struct sctp_cmd_seq *commands)
> >  {
> >       struct sctp_ulpevent *ev, *ai_ev = NULL, *auth_ev = NULL;
> > +     struct sctp_chunk *err_chk_p = NULL;
> >       struct sctp_association *new_asoc;
> >       struct sctp_init_chunk *peer_init;
> >       struct sctp_chunk *chunk = arg;
> > -     struct sctp_chunk *err_chk_p;
> >       struct sctp_chunk *repl;
> > +     enum sctp_cid cid;
> >       struct sock *sk;
> >       int error = 0;
> >
> > @@ -728,6 +729,18 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net,
> >               }
> >       }
> >
> > +     peer_init = (struct sctp_init_chunk *)(chunk->subh.cookie_hdr + 1);
> > +     cid = peer_init->chunk_hdr.type;
> > +     if (!sctp_sk(sk)->cookie_auth_enable &&
> > +         !sctp_verify_init(net, ep, asoc, cid, peer_init, chunk,
> > +                           &err_chk_p)) {
> > +             sctp_association_free(new_asoc);
> > +             return sctp_abort_on_init_err(net, ep, asoc, cid, arg, commands,
> > +                                           err_chk_p);
> > +     }
>
> [Critical]
> When cookie_auth_enable is disabled the cookie HMAC is not validated by
> sctp_unpack_cookie(), so every byte of the embedded INIT (including
> peer_init->chunk_hdr.type) is attacker-controlled.  Is it safe to feed
> that byte directly into both sctp_verify_init() and
> sctp_abort_on_init_err() as cid?
>
> Inside the COOKIE-ECHO body the embedded chunk is by RFC always an INIT.
> Could the cid argument be hard-coded to SCTP_CID_INIT (or
> peer_init->chunk_hdr.type be validated against SCTP_CID_INIT before use)
> instead of trusting the wire byte?
>
> Two attacker-reachable consequences look possible if cid is forged to
> SCTP_CID_INIT_ACK (3) and the STATE_COOKIE parameter is omitted:
>
> sctp_verify_init() then returns 0 via this branch with *errp populated:
>
>         if ((SCTP_CID_INIT_ACK == cid) && !has_cookie)
>                 return sctp_process_missing_param(asoc, SCTP_PARAM_STATE_COOKIE,
>                                                   chunk, errp);
>
> control transfers to sctp_abort_on_init_err(net, ep, asoc, cid, ...) with
> cid == SCTP_CID_INIT_ACK and a non-NULL err_chunk.  After sending the
> ABORT packet, that helper falls through to its out: label because
> cid == SCTP_CID_INIT_ACK:
>
>         if (cid != SCTP_CID_INIT_ACK) {
>                 if (!packet)
>                         return SCTP_DISPOSITION_NOMEM;
>                 return SCTP_DISPOSITION_CONSUME;
>         }
>
> out:
>         ...
>         return sctp_stop_t1_and_abort(net, commands, error, ECONNREFUSED,
>                                       asoc, chunk->transport);
>
> In the sctp_sf_do_5_1D_ce() context, asoc is typically NULL (the function
> itself opens with if (asoc && !sctp_vtag_verify(chunk, asoc))).  Can the
> SCTP_CMD_NEW_STATE / SCTP_CMD_TIMER_STOP / SCTP_CMD_INIT_FAILED commands
> queued by sctp_stop_t1_and_abort() against a NULL asoc be dispatched
> without the sideeffect interpreter dereferencing asoc?
>
> > @@ -2205,6 +2219,19 @@ enum sctp_disposition sctp_sf_do_5_2_4_dupcook(
> >       switch (action) {
> >       case 'A': /* Association restart. */
> >       case 'B': /* Collision case B. */
> > +             peer_init = (struct sctp_init_chunk *)
> > +                             (chunk->subh.cookie_hdr + 1);
> > +             cid = peer_init->chunk_hdr.type;
> > +             if (!sctp_sk(ep->base.sk)->cookie_auth_enable &&
> > +                 !sctp_verify_init(net, ep, asoc, cid, peer_init, chunk,
> > +                                   &err_chk_p)) {
> > +                     sctp_association_free(new_asoc);
> > +                     return sctp_abort_on_init_err(net, ep, asoc, cid, arg,
> > +                                                   commands, err_chk_p);
> > +             }
> > +             if (err_chk_p)
> > +                     sctp_chunk_free(err_chk_p);
> > +             fallthrough;
>
> [Critical]
> The same attacker-controlled cid is propagated here as well.  In
> sctp_sf_do_5_2_4_dupcook() asoc is the existing established association.
> With a forged cid of SCTP_CID_INIT_ACK and missing STATE_COOKIE, can a
> single unauthenticated COOKIE-ECHO drive sctp_abort_on_init_err() into
> its out: path and have sctp_stop_t1_and_abort() tear down the established
> asoc and report ECONNREFUSED to the application?
>
> Would constraining the verification to the well-defined INIT case (for
> example, hard-coding SCTP_CID_INIT or rejecting any cid !=
> SCTP_CID_INIT before calling sctp_verify_init()/sctp_abort_on_init_err())
> avoid both of these paths?

I will place a check in sctp_unpack_cookie() for such malformed packets:

diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index d5ee81934d93..8adac9e0cd66 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -1761,6 +1761,8 @@ struct sctp_association *sctp_unpack_cookie(
        bear_cookie = &cookie->c;

        ch = (struct sctp_chunkhdr *)(bear_cookie + 1);
+       if (ch->type != SCTP_CID_INIT)
+               goto malformed;
        chlen = ntohs(ch->length);
        if (chlen < sizeof(struct sctp_init_chunk))
                goto malformed;

Thanks.

^ permalink raw reply related

* [PATCH v2] netdevsim: fix use-after-free in nsim_create and __nsim_dev_port_del
From: Hrushiraj Gandhi @ 2026-06-23 14:44 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Simon Horman, Andrew Lunn, David S . Miller, Eric Dumazet,
	Paolo Abeni, Jiri Pirko, netdev, linux-kernel, bpf,
	syzbot+6c25f4750230faf70be9, Hrushiraj Gandhi

debugfs files created under a port's ddir (ethtool/get_err,
ethtool/set_err, ring params, bpf_offloaded_id, udp_ports/inject_error,
etc.) store raw pointers directly into the netdevsim struct, which lives
in the net_device private data kmalloc slab.

If these files outlive the netdevsim struct, a concurrent reader can
trigger a slab-use-after-free by passing debugfs_file_get() (which only
checks dentry lifetime) and then dereferencing the freed data pointer
in debugfs_u32_get().

In __nsim_dev_port_del(), nsim_destroy() is called before
nsim_dev_port_debugfs_exit(). However, nsim_destroy() calls free_netdev()
at its end, while nsim_dev_port_debugfs_exit() removes the port's
debugfs directory. This means the slab is freed before the debugfs
files are removed.

The same window exists on nsim_create()'s error path:
nsim_ethtool_init() creates debugfs files under ddir with pointers into
ns before nsim_init_netdevsim()/nsim_init_netdevsim_vf() which can fail,
and the err_free_netdev label calls free_netdev() while those debugfs
entries are still live.

Fix both paths by calling debugfs_remove_recursive() on the port's
ddir before every free_netdev() call. The subsequent
nsim_dev_port_debugfs_exit() calls become harmless no-ops since ddir is
set to NULL.

Reported-by: syzbot+6c25f4750230faf70be9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6c25f4750230faf70be9
Fixes: e05b2d141fef ("netdevsim: move netdev creation/destruction to dev probe")
Signed-off-by: Hrushiraj Gandhi <hrushirajg23@gmail.com>
---
v2:
- Also fix the same use-after-free window on the error path of nsim_create() as suggested by Simon Horman.
- Shorten the code comment in nsim_destroy() to be more concise.

 drivers/net/netdevsim/netdev.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 27e5f109f933..f2824e75cddd 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -1165,6 +1165,8 @@ struct netdevsim *nsim_create(struct nsim_dev *nsim_dev,
 	return ns;
 
 err_free_netdev:
+	debugfs_remove_recursive(nsim_dev_port->ddir);
+	nsim_dev_port->ddir = NULL;
 	free_netdev(dev);
 	return ERR_PTR(err);
 }
@@ -1214,6 +1216,13 @@ void nsim_destroy(struct netdevsim *ns)
 		ns->page = NULL;
 	}
 
+	/*
+	 * Remove per-port debugfs files before free_netdev() releases the
+	 * netdevsim struct to prevent use-after-free in concurrent readers.
+	 */
+	debugfs_remove_recursive(ns->nsim_dev_port->ddir);
+	ns->nsim_dev_port->ddir = NULL;
+
 	free_netdev(dev);
 }
 
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH v1 0/3] thunderbold: A few cleanups
From: Uwe Kleine-König (The Capable Hub) @ 2026-06-23 14:35 UTC (permalink / raw)
  To: Mika Westerberg
  Cc: Mika Westerberg, Yehezkel Bernat, Andreas Noever, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, linux-kernel, linux-usb
In-Reply-To: <20260623121746.GD3066@black.igk.intel.com>

[-- Attachment #1: Type: text/plain, Size: 753 bytes --]

Hello Mika,

On Tue, Jun 23, 2026 at 02:17:46PM +0200, Mika Westerberg wrote:
> On Thu, Jun 18, 2026 at 12:14:49PM +0200, Uwe Kleine-König (The Capable Hub) wrote:
> > Uwe Kleine-König (The Capable Hub) (3):
> >   thunderbold: Stop passing matched device ID to .probe()
> >   thunderbold: Assert that a service driver has a probe callback
> >   thunderbold: Drop comma after device id array terminator
> 
> Fixed the typo "thunderbold" -> "thunderbolt" and applied all to

Oh.

> thunderbolt.git/next. I also took the networking patch, let me know if
> that's not okay (I'm the maintainer of that driver too and it looked fine).

Sounds fine to me. So assuming you're not offending the network guys,
that should be ok.

Thanks!
Uwe

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH 2/2] sched: Use WARN_ON.*_DEFERRED()
From: Sebastian Andrzej Siewior @ 2026-06-23 14:26 UTC (permalink / raw)
  To: linux-arch, linux-kernel, sched-ext, netdev
  Cc: David S . Miller, Andrea Righi, Andrew Morton, Arnd Bergmann,
	Ben Segall, Breno Leitao, Changwoo Min, David Vernet,
	Dietmar Eggemann, Eric Dumazet, Ingo Molnar, Jakub Kicinski,
	John Ogness, Juri Lelli, K Prateek Nayak, Paolo Abeni,
	Peter Zijlstra, Petr Mladek, Sergey Senozhatsky, Simon Horman,
	Steven Rostedt, Tejun Heo, Vincent Guittot, Vlad Poenaru,
	Sebastian Andrzej Siewior
In-Reply-To: <20260623142650.265721-1-bigeasy@linutronix.de>

Vlad managed to trigger a warning in __enqueue_entity() while the rq
lock was held. He was using the netconsole in an older kernel which was
a legacy console (not nbcon). This resulted in an immediate flush which
led to sending packets and this in turn led to waking ksoftirqd. This
wake up ended up in deadlock because the scheduler tried to acquire the
already acquired rq.

This problem is not limited to the netconsole but all legacy consoles:
Should the console wake any task while holding its internal lock then
lockdep will observe and report a possible AB-BA deadlock. Also since
the warning does not happen regulary, lockdep may observe a lockchain
while acquiring the locks, leading to a recursion report while holding
locks.
More importantly after the during the console printing and once it is
finished the console semaphore is released which will lead to wakeup if
there is a waiter pending.

Replace WARNs within the scheduler with the DEFERRED variant. This will
queue an irq_work and the print will occur once the locks are dropped.

Reported-by: Vlad Poenaru <vlad.wing@gmail.com>
Closes: https://lore.kernel.org/all/20260610183621.3915271-1-vlad.wing@gmail.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/sched/core.c        |  78 +++++++++++++-------------
 kernel/sched/core_sched.c  |   6 +-
 kernel/sched/cpudeadline.c |   6 +-
 kernel/sched/deadline.c    |  62 ++++++++++-----------
 kernel/sched/ext.c         | 110 ++++++++++++++++++-------------------
 kernel/sched/fair.c        |  88 ++++++++++++++---------------
 kernel/sched/rt.c          |  36 ++++++------
 kernel/sched/sched.h       |  18 +++---
 8 files changed, 202 insertions(+), 202 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b8871449d3c69..0e282457abb91 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -853,7 +853,7 @@ void update_rq_clock(struct rq *rq)
 		return;
 
 	if (sched_feat(WARN_DOUBLE_CLOCK))
-		WARN_ON_ONCE(rq->clock_update_flags & RQCF_UPDATED);
+		WARN_ON_ONCE_DEFERRED(rq->clock_update_flags & RQCF_UPDATED);
 	rq->clock_update_flags |= RQCF_UPDATED;
 
 	clock = sched_clock_cpu(cpu_of(rq));
@@ -1807,7 +1807,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
 
 	bucket = &uc_rq->bucket[uc_se->bucket_id];
 
-	WARN_ON_ONCE(!bucket->tasks);
+	WARN_ON_ONCE_DEFERRED(!bucket->tasks);
 	if (likely(bucket->tasks))
 		bucket->tasks--;
 
@@ -1827,7 +1827,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
 	 * Defensive programming: this should never happen. If it happens,
 	 * e.g. due to future modification, warn and fix up the expected value.
 	 */
-	WARN_ON_ONCE(bucket->value > rq_clamp);
+	WARN_ON_ONCE_DEFERRED(bucket->value > rq_clamp);
 	if (bucket->value >= rq_clamp) {
 		bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
 		uclamp_rq_set(rq, clamp_id, bkt_clamp);
@@ -2210,7 +2210,7 @@ void activate_task(struct rq *rq, struct task_struct *p, int flags)
 
 void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 {
-	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
+	WARN_ON_ONCE_DEFERRED(flags & DEQUEUE_SLEEP);
 
 	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
 	ASSERT_EXCLUSIVE_WRITER(p->on_rq);
@@ -2516,7 +2516,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 	rq = cpu_rq(new_cpu);
 
 	rq_lock(rq, rf);
-	WARN_ON_ONCE(task_cpu(p) != new_cpu);
+	WARN_ON_ONCE_DEFERRED(task_cpu(p) != new_cpu);
 	activate_task(rq, p, 0);
 	wakeup_preempt(rq, p, 0);
 
@@ -2602,7 +2602,7 @@ static int migration_cpu_stop(void *data)
 	 * If we were passed a pending, then ->stop_pending was set, thus
 	 * p->migration_pending must have remained stable.
 	 */
-	WARN_ON_ONCE(pending && pending != p->migration_pending);
+	WARN_ON_ONCE_DEFERRED(pending && pending != p->migration_pending);
 
 	/*
 	 * If task_rq(p) != rq, it cannot be migrated here, because we're
@@ -2661,7 +2661,7 @@ static int migration_cpu_stop(void *data)
 		 * determine is_migration_disabled() and so have to chase after
 		 * it.
 		 */
-		WARN_ON_ONCE(!pending->stop_pending);
+		WARN_ON_ONCE_DEFERRED(!pending->stop_pending);
 		preempt_disable();
 		rq_unlock(rq, &rf);
 		raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
@@ -3004,7 +3004,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 	 *
 	 * Either way, we really should have a @pending here.
 	 */
-	if (WARN_ON_ONCE(!pending)) {
+	if (WARN_ON_ONCE_DEFERRED(!pending)) {
 		task_rq_unlock(rq, p, rf);
 		return -EINVAL;
 	}
@@ -3116,9 +3116,9 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
 			goto out;
 		}
 
-		if (WARN_ON_ONCE(p == current &&
-				 is_migration_disabled(p) &&
-				 !cpumask_test_cpu(task_cpu(p), ctx->new_mask))) {
+		if (WARN_ON_ONCE_DEFERRED(p == current &&
+					  is_migration_disabled(p) &&
+					  !cpumask_test_cpu(task_cpu(p), ctx->new_mask))) {
 			ret = -EBUSY;
 			goto out;
 		}
@@ -3267,7 +3267,7 @@ void force_compatible_cpus_allowed_ptr(struct task_struct *p)
 				cpumask_pr_args(override_mask));
 	}
 
-	WARN_ON(set_cpus_allowed_ptr(p, override_mask));
+	WARN_ON_DEFERRED(set_cpus_allowed_ptr(p, override_mask));
 out_free_mask:
 	cpus_read_unlock();
 	free_cpumask_var(new_mask);
@@ -3293,7 +3293,7 @@ void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
 	 * Cpuset masking will be done there too.
 	 */
 	ret = __sched_setaffinity(p, &ac);
-	WARN_ON_ONCE(ret);
+	WARN_ON_ONCE_DEFERRED(ret);
 }
 
 #ifdef CONFIG_SMP
@@ -3306,16 +3306,16 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	 * We should never call set_task_cpu() on a blocked task,
 	 * ttwu() will sort out the placement.
 	 */
-	WARN_ON_ONCE(state != TASK_RUNNING && state != TASK_WAKING && !p->on_rq);
+	WARN_ON_ONCE_DEFERRED(state != TASK_RUNNING && state != TASK_WAKING && !p->on_rq);
 
 	/*
 	 * Migrating fair class task must have p->on_rq = TASK_ON_RQ_MIGRATING,
 	 * because schedstat_wait_{start,end} rebase migrating task's wait_start
 	 * time relying on p->on_rq.
 	 */
-	WARN_ON_ONCE(state == TASK_RUNNING &&
-		     p->sched_class == &fair_sched_class &&
-		     (p->on_rq && !task_on_rq_migrating(p)));
+	WARN_ON_ONCE_DEFERRED(state == TASK_RUNNING &&
+			      p->sched_class == &fair_sched_class &&
+			      (p->on_rq && !task_on_rq_migrating(p)));
 
 #ifdef CONFIG_LOCKDEP
 	/*
@@ -3328,15 +3328,15 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	 * Furthermore, all task_rq users should acquire both locks, see
 	 * task_rq_lock().
 	 */
-	WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
-				      lockdep_is_held(__rq_lockp(task_rq(p)))));
+	WARN_ON_ONCE_DEFERRED(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
+					       lockdep_is_held(__rq_lockp(task_rq(p)))));
 #endif
 	/*
 	 * Clearly, migrating tasks to offline CPUs is a fairly daft thing.
 	 */
-	WARN_ON_ONCE(!cpu_online(new_cpu));
+	WARN_ON_ONCE_DEFERRED(!cpu_online(new_cpu));
 
-	WARN_ON_ONCE(is_migration_disabled(p));
+	WARN_ON_ONCE_DEFERRED(is_migration_disabled(p));
 
 	trace_sched_migrate_task(p, new_cpu);
 
@@ -3803,10 +3803,10 @@ void sched_ttwu_pending(void *arg)
 	update_rq_clock(rq);
 
 	llist_for_each_entry_safe(p, t, llist, wake_entry.llist) {
-		if (WARN_ON_ONCE(p->on_cpu))
+		if (WARN_ON_ONCE_DEFERRED(p->on_cpu))
 			smp_cond_load_acquire(&p->on_cpu, !VAL);
 
-		if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
+		if (WARN_ON_ONCE_DEFERRED(task_cpu(p) != cpu_of(rq)))
 			set_task_cpu(p, cpu_of(rq));
 
 		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
@@ -4003,8 +4003,8 @@ bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
 	int match;
 
 	if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)) {
-		WARN_ON_ONCE((state & TASK_RTLOCK_WAIT) &&
-			     state != TASK_RTLOCK_WAIT);
+		WARN_ON_ONCE_DEFERRED((state & TASK_RTLOCK_WAIT) &&
+				      state != TASK_RTLOCK_WAIT);
 	}
 
 	*success = !!(match = __task_state_match(p, state));
@@ -5745,7 +5745,7 @@ static void sched_tick_remote(struct work_struct *work)
 			 * we are always sure that there is no proxy (only a
 			 * single task is running).
 			 */
-			WARN_ON_ONCE(rq->curr != rq->donor);
+			WARN_ON_ONCE_DEFERRED(rq->curr != rq->donor);
 			update_rq_clock(rq);
 
 			if (!is_idle_task(curr)) {
@@ -5754,7 +5754,7 @@ static void sched_tick_remote(struct work_struct *work)
 				 * reasonable amount of time.
 				 */
 				u64 delta = rq_clock_task(rq) - curr->se.exec_start;
-				WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 30);
+				WARN_ON_ONCE_DEFERRED(delta > (u64)NSEC_PER_SEC * 30);
 			}
 			curr->sched_class->task_tick(rq, curr, 0);
 
@@ -5769,7 +5769,7 @@ static void sched_tick_remote(struct work_struct *work)
 	 * first update state to reflect hotplug activity if required.
 	 */
 	os = atomic_fetch_add_unless(&twork->state, -1, TICK_SCHED_REMOTE_RUNNING);
-	WARN_ON_ONCE(os == TICK_SCHED_REMOTE_OFFLINE);
+	WARN_ON_ONCE_DEFERRED(os == TICK_SCHED_REMOTE_OFFLINE);
 	if (os == TICK_SCHED_REMOTE_RUNNING)
 		queue_delayed_work(system_dfl_wq, dwork, HZ);
 }
@@ -6196,7 +6196,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			 * For robustness, update the min_vruntime_fi for
 			 * unconstrained picks as well.
 			 */
-			WARN_ON_ONCE(fi_before);
+			WARN_ON_ONCE_DEFERRED(fi_before);
 			task_vruntime_update(rq, next, false);
 			goto out_set_next;
 		}
@@ -6274,7 +6274,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	rq->core_sched_seq = rq->core->core_pick_seq;
 
 	/* Something should have been selected for current CPU */
-	WARN_ON_ONCE(!next);
+	WARN_ON_ONCE_DEFERRED(!next);
 
 	/*
 	 * Reschedule siblings
@@ -6317,7 +6317,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		}
 
 		/* Did we break L1TF mitigation requirements? */
-		WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
+		WARN_ON_ONCE_DEFERRED(!cookie_match(next, rq_i->core_pick));
 
 		if (rq_i->curr == rq_i->core_pick) {
 			rq_i->core_pick = NULL;
@@ -6717,7 +6717,7 @@ static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 	struct rq *target_rq = cpu_rq(target_cpu);
 
 	lockdep_assert_rq_held(rq);
-	WARN_ON(p == rq->curr);
+	WARN_ON_DEFERRED(p == rq->curr);
 	/*
 	 * Since we are migrating a blocked donor, it could be rq->donor,
 	 * and we want to make sure there aren't any references from this
@@ -6749,7 +6749,7 @@ static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
 	int cpu, wake_flag = WF_TTWU;
 
 	lockdep_assert_rq_held(rq);
-	WARN_ON(p == rq->curr);
+	WARN_ON_DEFERRED(p == rq->curr);
 
 	if (p == rq->donor)
 		proxy_resched_idle(rq);
@@ -6951,7 +6951,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * guarantee its existence, as per ttwu_remote().
 		 */
 	}
-	WARN_ON_ONCE(owner && !owner->on_rq);
+	WARN_ON_ONCE_DEFERRED(owner && !owner->on_rq);
 	return owner;
 
 deactivate:
@@ -7631,8 +7631,8 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 	 * real need to boost.
 	 */
 	if (unlikely(p == rq->idle)) {
-		WARN_ON(p != rq->curr);
-		WARN_ON(p->pi_blocked_on);
+		WARN_ON_DEFERRED(p != rq->curr);
+		WARN_ON_DEFERRED(p->pi_blocked_on);
 		goto out_unlock;
 	}
 
@@ -8463,7 +8463,7 @@ static void balance_push_set(int cpu, bool on)
 
 	rq_lock_irqsave(rq, &rf);
 	if (on) {
-		WARN_ON_ONCE(rq->balance_callback);
+		WARN_ON_ONCE_DEFERRED(rq->balance_callback);
 		rq->balance_callback = &balance_push_callback;
 	} else if (rq->balance_callback == &balance_push_callback) {
 		rq->balance_callback = NULL;
@@ -11150,7 +11150,7 @@ struct sched_change_ctx *sched_change_begin(struct task_struct *p, unsigned int
 	 * Must exclusively use matched flags since this is both dequeue and
 	 * enqueue.
 	 */
-	WARN_ON_ONCE(flags & 0xFFFF0000);
+	WARN_ON_ONCE_DEFERRED(flags & 0xFFFF0000);
 
 	lockdep_assert_rq_held(rq);
 
@@ -11198,7 +11198,7 @@ void sched_change_end(struct sched_change_ctx *ctx)
 	/*
 	 * Changing class without *QUEUE_CLASS is bad.
 	 */
-	WARN_ON_ONCE(p->sched_class != ctx->class && !(ctx->flags & ENQUEUE_CLASS));
+	WARN_ON_ONCE_DEFERRED(p->sched_class != ctx->class && !(ctx->flags & ENQUEUE_CLASS));
 
 	if ((ctx->flags & ENQUEUE_CLASS) && p->sched_class->switching_to)
 		p->sched_class->switching_to(rq, p);
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index 73b6b24269119..ec88ed7d8ee87 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -67,7 +67,7 @@ static unsigned long sched_core_update_cookie(struct task_struct *p,
 	 * a cookie until after we've removed it, we must have core scheduling
 	 * enabled here.
 	 */
-	WARN_ON_ONCE((p->core_cookie || cookie) && !sched_core_enabled(rq));
+	WARN_ON_ONCE_DEFERRED((p->core_cookie || cookie) && !sched_core_enabled(rq));
 
 	if (sched_core_enqueued(p))
 		sched_core_dequeue(rq, p, DEQUEUE_SAVE);
@@ -249,7 +249,7 @@ void __sched_core_account_forceidle(struct rq *rq)
 
 	lockdep_assert_rq_held(rq);
 
-	WARN_ON_ONCE(!rq->core->core_forceidle_count);
+	WARN_ON_ONCE_DEFERRED(!rq->core->core_forceidle_count);
 
 	if (rq->core->core_forceidle_start == 0)
 		return;
@@ -260,7 +260,7 @@ void __sched_core_account_forceidle(struct rq *rq)
 
 	rq->core->core_forceidle_start = now;
 
-	if (WARN_ON_ONCE(!rq->core->core_forceidle_occupation)) {
+	if (WARN_ON_ONCE_DEFERRED(!rq->core->core_forceidle_occupation)) {
 		/* can't be forced idle without a running task */
 	} else if (rq->core->core_forceidle_count > 1 ||
 		   rq->core->core_forceidle_occupation > 1) {
diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 0a2b7e30fd10c..e305a8e993e27 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -149,7 +149,7 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
 	} else {
 		int best_cpu = cpudl_maximum(cp);
 
-		WARN_ON(best_cpu != -1 && !cpu_present(best_cpu));
+		WARN_ON_DEFERRED(best_cpu != -1 && !cpu_present(best_cpu));
 
 		if (cpumask_test_cpu(best_cpu, &p->cpus_mask) &&
 		    dl_time_before(dl_se->deadline, cp->elements[0].dl)) {
@@ -177,7 +177,7 @@ void cpudl_clear(struct cpudl *cp, int cpu, bool online)
 	int old_idx, new_cpu;
 	unsigned long flags;
 
-	WARN_ON(!cpu_present(cpu));
+	WARN_ON_DEFERRED(!cpu_present(cpu));
 
 	raw_spin_lock_irqsave(&cp->lock, flags);
 
@@ -220,7 +220,7 @@ void cpudl_set(struct cpudl *cp, int cpu, u64 dl)
 	int old_idx;
 	unsigned long flags;
 
-	WARN_ON(!cpu_present(cpu));
+	WARN_ON_DEFERRED(!cpu_present(cpu));
 
 	raw_spin_lock_irqsave(&cp->lock, flags);
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7db4c87df83b0..863ac7509192f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -217,8 +217,8 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 
 	lockdep_assert_rq_held(rq_of_dl_rq(dl_rq));
 	dl_rq->running_bw += dl_bw;
-	WARN_ON_ONCE(dl_rq->running_bw < old); /* overflow */
-	WARN_ON_ONCE(dl_rq->running_bw > dl_rq->this_bw);
+	WARN_ON_ONCE_DEFERRED(dl_rq->running_bw < old); /* overflow */
+	WARN_ON_ONCE_DEFERRED(dl_rq->running_bw > dl_rq->this_bw);
 	/* kick cpufreq (see the comment in kernel/sched/sched.h). */
 	cpufreq_update_util(rq_of_dl_rq(dl_rq), 0);
 }
@@ -230,7 +230,7 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
 
 	lockdep_assert_rq_held(rq_of_dl_rq(dl_rq));
 	dl_rq->running_bw -= dl_bw;
-	WARN_ON_ONCE(dl_rq->running_bw > old); /* underflow */
+	WARN_ON_ONCE_DEFERRED(dl_rq->running_bw > old); /* underflow */
 	if (dl_rq->running_bw > old)
 		dl_rq->running_bw = 0;
 	/* kick cpufreq (see the comment in kernel/sched/sched.h). */
@@ -244,7 +244,7 @@ void __add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
 
 	lockdep_assert_rq_held(rq_of_dl_rq(dl_rq));
 	dl_rq->this_bw += dl_bw;
-	WARN_ON_ONCE(dl_rq->this_bw < old); /* overflow */
+	WARN_ON_ONCE_DEFERRED(dl_rq->this_bw < old); /* overflow */
 }
 
 static inline
@@ -254,10 +254,10 @@ void __sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq)
 
 	lockdep_assert_rq_held(rq_of_dl_rq(dl_rq));
 	dl_rq->this_bw -= dl_bw;
-	WARN_ON_ONCE(dl_rq->this_bw > old); /* underflow */
+	WARN_ON_ONCE_DEFERRED(dl_rq->this_bw > old); /* underflow */
 	if (dl_rq->this_bw > old)
 		dl_rq->this_bw = 0;
-	WARN_ON_ONCE(dl_rq->running_bw > dl_rq->this_bw);
+	WARN_ON_ONCE_DEFERRED(dl_rq->running_bw > dl_rq->this_bw);
 }
 
 static inline
@@ -335,7 +335,7 @@ void cancel_inactive_timer(struct sched_dl_entity *dl_se)
 
 static void dl_change_utilization(struct task_struct *p, u64 new_bw)
 {
-	WARN_ON_ONCE(p->dl.flags & SCHED_FLAG_SUGOV);
+	WARN_ON_ONCE_DEFERRED(p->dl.flags & SCHED_FLAG_SUGOV);
 
 	if (task_on_rq_queued(p))
 		return;
@@ -416,7 +416,7 @@ static void task_non_contending(struct sched_dl_entity *dl_se, bool dl_task)
 	if (dl_entity_is_special(dl_se))
 		return;
 
-	WARN_ON(dl_se->dl_non_contending);
+	WARN_ON_DEFERRED(dl_se->dl_non_contending);
 
 	zerolag_time = dl_se->deadline -
 		 div64_long((dl_se->runtime * dl_se->dl_period),
@@ -582,7 +582,7 @@ static void enqueue_pushable_dl_task(struct rq *rq, struct task_struct *p)
 {
 	struct rb_node *leftmost;
 
-	WARN_ON_ONCE(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
+	WARN_ON_ONCE_DEFERRED(!RB_EMPTY_NODE(&p->pushable_dl_tasks));
 
 	leftmost = rb_add_cached(&p->pushable_dl_tasks,
 				 &rq->dl.pushable_dl_tasks_root,
@@ -664,7 +664,7 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
 			 * Failed to find any suitable CPU.
 			 * The task will never come back!
 			 */
-			WARN_ON_ONCE(dl_bandwidth_enabled());
+			WARN_ON_ONCE_DEFERRED(dl_bandwidth_enabled());
 
 			/*
 			 * If admission control is disabled we
@@ -756,8 +756,8 @@ static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
-	WARN_ON(is_dl_boosted(dl_se));
-	WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline));
+	WARN_ON_DEFERRED(is_dl_boosted(dl_se));
+	WARN_ON_DEFERRED(dl_time_before(rq_clock(rq), dl_se->deadline));
 
 	/*
 	 * We are racing with the deadline timer. So, do nothing because
@@ -801,7 +801,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se)
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 	struct rq *rq = rq_of_dl_rq(dl_rq);
 
-	WARN_ON_ONCE(pi_of(dl_se)->dl_runtime <= 0);
+	WARN_ON_ONCE_DEFERRED(pi_of(dl_se)->dl_runtime <= 0);
 
 	/*
 	 * This could be the case for a !-dl task that is boosted.
@@ -975,7 +975,7 @@ update_dl_revised_wakeup(struct sched_dl_entity *dl_se, struct rq *rq)
 	 *
 	 * See update_dl_entity() comments for further details.
 	 */
-	WARN_ON(dl_time_before(dl_se->deadline, rq_clock(rq)));
+	WARN_ON_DEFERRED(dl_time_before(dl_se->deadline, rq_clock(rq)));
 
 	dl_se->runtime = (dl_se->dl_density * laxity) >> BW_SHIFT;
 }
@@ -1080,7 +1080,7 @@ static int start_dl_timer(struct sched_dl_entity *dl_se)
 	 * (current u > U).
 	 */
 	if (dl_se->dl_defer_armed) {
-		WARN_ON_ONCE(!dl_se->dl_throttled);
+		WARN_ON_ONCE_DEFERRED(!dl_se->dl_throttled);
 		act = ns_to_ktime(dl_se->deadline - dl_se->runtime);
 	} else {
 		/* act = deadline - rel-deadline + period */
@@ -1451,7 +1451,7 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 		/*
 		 * Non-servers would never get time accounted while throttled.
 		 */
-		WARN_ON_ONCE(!dl_server(dl_se));
+		WARN_ON_ONCE_DEFERRED(!dl_server(dl_se));
 
 		/*
 		 * While the server is marked idle, do not push out the
@@ -1492,7 +1492,7 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 		 * and queue right away. Otherwise nothing might queue it. That's similar
 		 * to what enqueue_dl_entity() does on start_dl_timer==0. For now, just warn.
 		 */
-		WARN_ON_ONCE(!start_dl_timer(dl_se));
+		WARN_ON_ONCE_DEFERRED(!start_dl_timer(dl_se));
 
 		return;
 	}
@@ -1801,7 +1801,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 	 */
 	rq->donor->sched_class->update_curr(rq);
 
-	if (WARN_ON_ONCE(!cpu_online(cpu_of(rq))))
+	if (WARN_ON_ONCE_DEFERRED(!cpu_online(cpu_of(rq))))
 		return;
 
 	trace_sched_dl_server_start_tp(dl_se, cpu_of(rq), dl_get_type(dl_se, rq));
@@ -2073,7 +2073,7 @@ void inc_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 static inline
 void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
 {
-	WARN_ON(!dl_rq->dl_nr_running);
+	WARN_ON_DEFERRED(!dl_rq->dl_nr_running);
 	dl_rq->dl_nr_running--;
 
 	if (!dl_server(dl_se))
@@ -2165,7 +2165,7 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
 {
 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 
-	WARN_ON_ONCE(!RB_EMPTY_NODE(&dl_se->rb_node));
+	WARN_ON_ONCE_DEFERRED(!RB_EMPTY_NODE(&dl_se->rb_node));
 
 	rb_add_cached(&dl_se->rb_node, &dl_rq->root, __dl_less);
 
@@ -2189,7 +2189,7 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
 static void
 enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 {
-	WARN_ON_ONCE(on_dl_rq(dl_se));
+	WARN_ON_ONCE_DEFERRED(on_dl_rq(dl_se));
 
 	update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);
 
@@ -2611,7 +2611,7 @@ static struct task_struct *__pick_task_dl(struct rq *rq, struct rq_flags *rf)
 		return NULL;
 
 	dl_se = pick_next_dl_entity(dl_rq);
-	WARN_ON_ONCE(!dl_se);
+	WARN_ON_ONCE_DEFERRED(!dl_se);
 
 	if (dl_server(dl_se)) {
 		p = dl_se->server_pick_task(dl_se, rf);
@@ -2823,12 +2823,12 @@ static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
 	if (!p)
 		return NULL;
 
-	WARN_ON_ONCE(rq->cpu != task_cpu(p));
-	WARN_ON_ONCE(task_current(rq, p));
-	WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
+	WARN_ON_ONCE_DEFERRED(rq->cpu != task_cpu(p));
+	WARN_ON_ONCE_DEFERRED(task_current(rq, p));
+	WARN_ON_ONCE_DEFERRED(p->nr_cpus_allowed <= 1);
 
-	WARN_ON_ONCE(!task_on_rq_queued(p));
-	WARN_ON_ONCE(!dl_task(p));
+	WARN_ON_ONCE_DEFERRED(!task_on_rq_queued(p));
+	WARN_ON_ONCE_DEFERRED(!dl_task(p));
 
 	return p;
 }
@@ -2944,7 +2944,7 @@ static int push_dl_task(struct rq *rq)
 	if (is_migration_disabled(next_task))
 		return 0;
 
-	if (WARN_ON(next_task == rq->curr))
+	if (WARN_ON_DEFERRED(next_task == rq->curr))
 		return 0;
 
 	/* We might release rq lock */
@@ -3050,8 +3050,8 @@ static void pull_dl_task(struct rq *this_rq)
 		 */
 		if (p && dl_time_before(p->dl.deadline, dmin) &&
 		    dl_task_is_earliest_deadline(p, this_rq)) {
-			WARN_ON(p == src_rq->curr);
-			WARN_ON(!task_on_rq_queued(p));
+			WARN_ON_DEFERRED(p == src_rq->curr);
+			WARN_ON_DEFERRED(!task_on_rq_queued(p));
 
 			/*
 			 * Then we pull iff p has actually an earlier
@@ -3109,7 +3109,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
 {
 	struct rq *rq;
 
-	WARN_ON_ONCE(!dl_task(p));
+	WARN_ON_ONCE_DEFERRED(!dl_task(p));
 
 	rq = task_rq(p);
 	/*
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 5d2d19473a82e..47d3a4c16455a 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -512,12 +512,12 @@ do {										\
  * So if kf_tasks[] is set, @p's scheduler-protected fields are stable.
  *
  * kf_tasks[] can not stack, so task-based SCX ops must not nest. The
- * WARN_ON_ONCE() in each macro catches a re-entry of any of the three variants
- * while a previous one is still in progress.
+ * WARN_ON_ONCE_DEFERRED() in each macro catches a re-entry of any of the three
+ * variants while a previous one is still in progress.
  */
 #define SCX_CALL_OP_TASK(sch, op, locked_rq, task, args...)			\
 do {										\
-	WARN_ON_ONCE(current->scx.kf_tasks[0]);					\
+	WARN_ON_ONCE_DEFERRED(current->scx.kf_tasks[0]);			\
 	current->scx.kf_tasks[0] = task;					\
 	SCX_CALL_OP((sch), op, locked_rq, task, ##args);			\
 	current->scx.kf_tasks[0] = NULL;					\
@@ -526,7 +526,7 @@ do {										\
 #define SCX_CALL_OP_TASK_RET(sch, op, locked_rq, task, args...)			\
 ({										\
 	__typeof__((sch)->ops.op(task, ##args)) __ret;				\
-	WARN_ON_ONCE(current->scx.kf_tasks[0]);					\
+	WARN_ON_ONCE_DEFERRED(current->scx.kf_tasks[0]);			\
 	current->scx.kf_tasks[0] = task;					\
 	__ret = SCX_CALL_OP_RET((sch), op, locked_rq, task, ##args);		\
 	current->scx.kf_tasks[0] = NULL;					\
@@ -536,7 +536,7 @@ do {										\
 #define SCX_CALL_OP_2TASKS_RET(sch, op, locked_rq, task0, task1, args...)	\
 ({										\
 	__typeof__((sch)->ops.op(task0, task1, ##args)) __ret;			\
-	WARN_ON_ONCE(current->scx.kf_tasks[0]);					\
+	WARN_ON_ONCE_DEFERRED(current->scx.kf_tasks[0]);			\
 	current->scx.kf_tasks[0] = task0;					\
 	current->scx.kf_tasks[1] = task1;					\
 	__ret = SCX_CALL_OP_RET((sch), op, locked_rq, task0, task1, ##args);	\
@@ -687,7 +687,7 @@ static bool nldsq_cursor_lost_task(struct scx_dsq_list_node *cursor,
 		return true;
 
 	/* if @p has stayed on @dsq, its rq couldn't have changed */
-	if (WARN_ON_ONCE(rq != task_rq(p)))
+	if (WARN_ON_ONCE_DEFERRED(rq != task_rq(p)))
 		return true;
 
 	return false;
@@ -1282,7 +1282,7 @@ static void schedule_reenq_local(struct rq *rq, u64 reenq_flags)
 {
 	struct scx_sched *root = rcu_dereference_sched(scx_root);
 
-	if (WARN_ON_ONCE(!root))
+	if (WARN_ON_ONCE_DEFERRED(!root))
 		return;
 
 	schedule_dsq_reenq(root, &rq->scx.local_dsq, reenq_flags, rq);
@@ -1379,7 +1379,7 @@ static void dsq_inc_nr(struct scx_dispatch_q *dsq, struct task_struct *p, u64 en
 	 */
 	if (enq_flags & SCX_ENQ_IMMED) {
 		if (unlikely(dsq->id != SCX_DSQ_LOCAL)) {
-			WARN_ON_ONCE(!(enq_flags & SCX_ENQ_GDSQ_FALLBACK));
+			WARN_ON_ONCE_DEFERRED(!(enq_flags & SCX_ENQ_GDSQ_FALLBACK));
 			return;
 		}
 		p->scx.flags |= SCX_TASK_IMMED;
@@ -1388,7 +1388,7 @@ static void dsq_inc_nr(struct scx_dispatch_q *dsq, struct task_struct *p, u64 en
 	if (p->scx.flags & SCX_TASK_IMMED) {
 		struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
 
-		if (WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL))
+		if (WARN_ON_ONCE_DEFERRED(dsq->id != SCX_DSQ_LOCAL))
 			return;
 
 		rq->scx.nr_immed++;
@@ -1410,8 +1410,8 @@ static void dsq_dec_nr(struct scx_dispatch_q *dsq, struct task_struct *p)
 	if (p->scx.flags & SCX_TASK_IMMED) {
 		struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
 
-		if (WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL) ||
-		    WARN_ON_ONCE(rq->scx.nr_immed <= 0))
+		if (WARN_ON_ONCE_DEFERRED(dsq->id != SCX_DSQ_LOCAL) ||
+		    WARN_ON_ONCE_DEFERRED(rq->scx.nr_immed <= 0))
 			return;
 
 		rq->scx.nr_immed--;
@@ -1521,9 +1521,9 @@ static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
 {
 	bool is_local = dsq->id == SCX_DSQ_LOCAL;
 
-	WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node));
-	WARN_ON_ONCE((p->scx.dsq_flags & SCX_TASK_DSQ_ON_PRIQ) ||
-		     !RB_EMPTY_NODE(&p->scx.dsq_priq));
+	WARN_ON_ONCE_DEFERRED(p->scx.dsq || !list_empty(&p->scx.dsq_list.node));
+	WARN_ON_ONCE_DEFERRED((p->scx.dsq_flags & SCX_TASK_DSQ_ON_PRIQ) ||
+			      !RB_EMPTY_NODE(&p->scx.dsq_priq));
 
 	if (!is_local) {
 		raw_spin_lock_nested(&dsq->lock,
@@ -1646,7 +1646,7 @@ static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
 static void task_unlink_from_dsq(struct task_struct *p,
 				 struct scx_dispatch_q *dsq)
 {
-	WARN_ON_ONCE(list_empty(&p->scx.dsq_list.node));
+	WARN_ON_ONCE_DEFERRED(list_empty(&p->scx.dsq_list.node));
 
 	if (p->scx.dsq_flags & SCX_TASK_DSQ_ON_PRIQ) {
 		rb_erase(&p->scx.dsq_priq, &dsq->priq);
@@ -1709,7 +1709,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
 		 * holding_cpu which tells dispatch_to_local_dsq() that it lost
 		 * the race.
 		 */
-		WARN_ON_ONCE(!list_empty(&p->scx.dsq_list.node));
+		WARN_ON_ONCE_DEFERRED(!list_empty(&p->scx.dsq_list.node));
 		p->scx.holding_cpu = -1;
 	}
 	p->scx.dsq = NULL;
@@ -1787,8 +1787,8 @@ static void mark_direct_dispatch(struct scx_sched *sch,
 		return;
 	}
 
-	WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID);
-	WARN_ON_ONCE(p->scx.ddsp_enq_flags);
+	WARN_ON_ONCE_DEFERRED(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID);
+	WARN_ON_ONCE_DEFERRED(p->scx.ddsp_enq_flags);
 
 	p->scx.ddsp_dsq_id = dsq_id;
 	p->scx.ddsp_enq_flags = enq_flags;
@@ -1855,7 +1855,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
 			break;
 		}
 
-		WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node));
+		WARN_ON_ONCE_DEFERRED(p->scx.dsq || !list_empty(&p->scx.dsq_list.node));
 		list_add_tail(&p->scx.dsq_list.node,
 			      &rq->scx.ddsp_deferred_locals);
 		schedule_deferred_locked(rq);
@@ -1888,7 +1888,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	struct scx_dispatch_q *dsq;
 	unsigned long qseq;
 
-	WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED));
+	WARN_ON_ONCE_DEFERRED(!(p->scx.flags & SCX_TASK_QUEUED));
 
 	/* internal movements - rq migration / RESTORE */
 	if (sticky_cpu == cpu_of(rq))
@@ -1938,11 +1938,11 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	/* DSQ bypass didn't trigger, enqueue on the BPF scheduler */
 	qseq = rq->scx.ops_qseq++ << SCX_OPSS_QSEQ_SHIFT;
 
-	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
+	WARN_ON_ONCE_DEFERRED(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
 	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
 
 	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
-	WARN_ON_ONCE(*ddsp_taskp);
+	WARN_ON_ONCE_DEFERRED(*ddsp_taskp);
 	*ddsp_taskp = p;
 
 	SCX_CALL_OP_TASK(sch, enqueue, rq, p, enq_flags);
@@ -2039,7 +2039,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int core_enq_
 		sticky_cpu = cpu_of(rq);
 
 	if (p->scx.flags & SCX_TASK_QUEUED) {
-		WARN_ON_ONCE(!task_runnable(p));
+		WARN_ON_ONCE_DEFERRED(!task_runnable(p));
 		goto out;
 	}
 
@@ -2159,7 +2159,7 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int core_deq_
 		deq_flags |= SCX_DEQ_SCHED_CHANGE;
 
 	if (!(p->scx.flags & SCX_TASK_QUEUED)) {
-		WARN_ON_ONCE(task_runnable(p));
+		WARN_ON_ONCE_DEFERRED(task_runnable(p));
 		return true;
 	}
 
@@ -2256,7 +2256,7 @@ static void move_local_task_to_local_dsq(struct scx_sched *sch,
 	lockdep_assert_held(&src_dsq->lock);
 	lockdep_assert_rq_held(dst_rq);
 
-	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
+	WARN_ON_ONCE_DEFERRED(p->scx.holding_cpu >= 0);
 
 	if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
 		list_add(&p->scx.dsq_list.node, &dst_dsq->list);
@@ -2299,8 +2299,8 @@ static void move_remote_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
 	 * truncate the upper 32 bit. As we own @rq, we can pass them through
 	 * @rq->scx.extra_enq_flags instead.
 	 */
-	WARN_ON_ONCE(!cpumask_test_cpu(cpu_of(dst_rq), p->cpus_ptr));
-	WARN_ON_ONCE(dst_rq->scx.extra_enq_flags);
+	WARN_ON_ONCE_DEFERRED(!cpumask_test_cpu(cpu_of(dst_rq), p->cpus_ptr));
+	WARN_ON_ONCE_DEFERRED(dst_rq->scx.extra_enq_flags);
 	dst_rq->scx.extra_enq_flags = enq_flags;
 	activate_task(dst_rq, p, 0);
 	dst_rq->scx.extra_enq_flags = 0;
@@ -2331,7 +2331,7 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch,
 {
 	s32 cpu = cpu_of(rq);
 
-	WARN_ON_ONCE(task_cpu(p) == cpu);
+	WARN_ON_ONCE_DEFERRED(task_cpu(p) == cpu);
 
 	/*
 	 * If @p has migration disabled, @p->cpus_ptr is updated to contain only
@@ -2411,7 +2411,7 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p,
 
 	lockdep_assert_held(&dsq->lock);
 
-	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
+	WARN_ON_ONCE_DEFERRED(p->scx.holding_cpu >= 0);
 	task_unlink_from_dsq(p, dsq);
 	p->scx.holding_cpu = cpu;
 
@@ -2420,7 +2420,7 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p,
 
 	/* task_rq couldn't have changed if we're still the holding cpu */
 	return likely(p->scx.holding_cpu == cpu) &&
-		!WARN_ON_ONCE(src_rq != task_rq(p));
+		!WARN_ON_ONCE_DEFERRED(src_rq != task_rq(p));
 }
 
 static bool consume_remote_task(struct rq *this_rq,
@@ -2630,7 +2630,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 
 	/* task_rq couldn't have changed if we're still the holding cpu */
 	if (likely(p->scx.holding_cpu == raw_smp_processor_id()) &&
-	    !WARN_ON_ONCE(src_rq != task_rq(p))) {
+	    !WARN_ON_ONCE_DEFERRED(src_rq != task_rq(p))) {
 		/*
 		 * If @p is staying on the same rq, there's no need to go
 		 * through the full deactivate/activate cycle. Optimize by
@@ -3099,7 +3099,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 		 * which should trigger an explicit follow-up scheduling event.
 		 */
 		if (next && sched_class_above(&ext_sched_class, next->sched_class)) {
-			WARN_ON_ONCE(!(sch->ops.flags & SCX_OPS_ENQ_LAST));
+			WARN_ON_ONCE_DEFERRED(!(sch->ops.flags & SCX_OPS_ENQ_LAST));
 			do_enqueue_task(rq, p, SCX_ENQ_LAST, -1);
 		} else {
 			do_enqueue_task(rq, p, 0, -1);
@@ -3201,7 +3201,7 @@ do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx)
 	keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP;
 	if (unlikely(keep_prev &&
 		     prev->sched_class != &ext_sched_class)) {
-		WARN_ON_ONCE(scx_enable_state() == SCX_ENABLED);
+		WARN_ON_ONCE_DEFERRED(scx_enable_state() == SCX_ENABLED);
 		keep_prev = false;
 	}
 
@@ -3332,7 +3332,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
 		struct task_struct **ddsp_taskp;
 
 		ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
-		WARN_ON_ONCE(*ddsp_taskp);
+		WARN_ON_ONCE_DEFERRED(*ddsp_taskp);
 		*ddsp_taskp = p;
 
 		this_rq()->scx.in_select_cpu = true;
@@ -3620,7 +3620,7 @@ static void __scx_enable_task(struct scx_sched *sch, struct task_struct *p)
 	 * transitions are consistent, the flag should always be clear
 	 * here.
 	 */
-	WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY);
+	WARN_ON_ONCE_DEFERRED(p->scx.flags & SCX_TASK_IN_CUSTODY);
 
 	/*
 	 * Set the weight before calling ops.enable() so that the scheduler
@@ -3651,7 +3651,7 @@ static void scx_disable_task(struct scx_sched *sch, struct task_struct *p)
 	struct rq *rq = task_rq(p);
 
 	lockdep_assert_rq_held(rq);
-	WARN_ON_ONCE(scx_get_task_state(p) != SCX_TASK_ENABLED);
+	WARN_ON_ONCE_DEFERRED(scx_get_task_state(p) != SCX_TASK_ENABLED);
 
 	clear_direct_dispatch(p);
 
@@ -3664,7 +3664,7 @@ static void scx_disable_task(struct scx_sched *sch, struct task_struct *p)
 	 * transitions are consistent, the flag should always be clear
 	 * here.
 	 */
-	WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY);
+	WARN_ON_ONCE_DEFERRED(p->scx.flags & SCX_TASK_IN_CUSTODY);
 }
 
 static void __scx_disable_and_exit_task(struct scx_sched *sch,
@@ -3689,7 +3689,7 @@ static void __scx_disable_and_exit_task(struct scx_sched *sch,
 		scx_disable_task(sch, p);
 		break;
 	default:
-		WARN_ON_ONCE(true);
+		WARN_ON_ONCE_DEFERRED(true);
 		return;
 	}
 
@@ -3726,7 +3726,7 @@ static void scx_disable_and_exit_task(struct scx_sched *sch,
 	 * path, so it's always clear when @p arrives here in %SCX_TASK_NONE.
 	 */
 	if (p->scx.flags & SCX_TASK_SUB_INIT) {
-		if (!WARN_ON_ONCE(!scx_enabling_sub_sched))
+		if (!WARN_ON_ONCE_DEFERRED(!scx_enabling_sub_sched))
 			scx_sub_init_cancel_task(scx_enabling_sub_sched, p);
 		p->scx.flags &= ~SCX_TASK_SUB_INIT;
 	}
@@ -3818,7 +3818,7 @@ void scx_cancel_fork(struct task_struct *p)
 		struct rq_flags rf;
 
 		rq = task_rq_lock(p, &rf);
-		WARN_ON_ONCE(scx_get_task_state(p) >= SCX_TASK_READY);
+		WARN_ON_ONCE_DEFERRED(scx_get_task_state(p) >= SCX_TASK_READY);
 		scx_disable_and_exit_task(scx_task_sched(p), p);
 		task_rq_unlock(rq, p, &rf);
 	}
@@ -3986,7 +3986,7 @@ static void process_ddsp_deferred_locals(struct rq *rq)
 		clear_direct_dispatch(p);
 
 		dsq = find_dsq_for_dispatch(sch, rq, dsq_id, task_cpu(p));
-		if (!WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL))
+		if (!WARN_ON_ONCE_DEFERRED(dsq->id != SCX_DSQ_LOCAL))
 			dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags);
 	}
 }
@@ -4041,7 +4041,7 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags)
 
 	lockdep_assert_rq_held(rq);
 
-	if (WARN_ON_ONCE(reenq_flags & __SCX_REENQ_TSR_MASK))
+	if (WARN_ON_ONCE_DEFERRED(reenq_flags & __SCX_REENQ_TSR_MASK))
 		reenq_flags &= ~__SCX_REENQ_TSR_MASK;
 	if (rq_is_open(rq, 0))
 		reenq_flags |= SCX_REENQ_TSR_RQ_OPEN;
@@ -4078,7 +4078,7 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags)
 
 		dispatch_dequeue(rq, p);
 
-		if (WARN_ON_ONCE(p->scx.flags & SCX_TASK_REENQ_REASON_MASK))
+		if (WARN_ON_ONCE_DEFERRED(p->scx.flags & SCX_TASK_REENQ_REASON_MASK))
 			p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
 		p->scx.flags |= reason;
 
@@ -4199,7 +4199,7 @@ static void reenq_user(struct rq *rq, struct scx_dispatch_q *dsq, u64 reenq_flag
 		dispatch_dequeue_locked(p, dsq);
 		raw_spin_unlock(&dsq->lock);
 
-		if (WARN_ON_ONCE(p->scx.flags & SCX_TASK_REENQ_REASON_MASK))
+		if (WARN_ON_ONCE_DEFERRED(p->scx.flags & SCX_TASK_REENQ_REASON_MASK))
 			p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
 		p->scx.flags |= reason;
 
@@ -4360,7 +4360,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset)
 		struct cgroup *from = tg_cgrp(task_group(p));
 		struct cgroup *to = tg_cgrp(css_tg(css));
 
-		WARN_ON_ONCE(p->scx.cgrp_moving_from);
+		WARN_ON_ONCE_DEFERRED(p->scx.cgrp_moving_from);
 
 		/*
 		 * sched_move_task() omits identity migrations. Let's match the
@@ -4617,7 +4617,7 @@ static void exit_dsq(struct scx_dispatch_q *dsq)
 		 * There must have been a RCU grace period since the last
 		 * insertion and @dsq should be off the deferred list by now.
 		 */
-		if (WARN_ON_ONCE(!list_empty(&dru->node))) {
+		if (WARN_ON_ONCE_DEFERRED(!list_empty(&dru->node))) {
 			guard(raw_spinlock_irqsave)(&rq->scx.deferred_reenq_lock);
 			list_del_init(&dru->node);
 		}
@@ -4745,7 +4745,7 @@ static int scx_cgroup_init(struct scx_sched *sch)
 		tg->scx.flags |= SCX_TG_INITED;
 	}
 
-	WARN_ON_ONCE(scx_cgroup_enabled);
+	WARN_ON_ONCE_DEFERRED(scx_cgroup_enabled);
 	scx_cgroup_enabled = true;
 
 	return 0;
@@ -4848,7 +4848,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
 		 * period. As that blocks new deferrals, all
 		 * deferred_reenq_local_node's must be off-list by now.
 		 */
-		WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local.node));
+		WARN_ON_ONCE_DEFERRED(!list_empty(&pcpu->deferred_reenq_local.node));
 
 		exit_dsq(bypass_dsq(sch, cpu));
 	}
@@ -5324,7 +5324,7 @@ static bool inc_bypass_depth(struct scx_sched *sch)
 {
 	lockdep_assert_held(&scx_bypass_lock);
 
-	WARN_ON_ONCE(sch->bypass_depth < 0);
+	WARN_ON_ONCE_DEFERRED(sch->bypass_depth < 0);
 	WRITE_ONCE(sch->bypass_depth, sch->bypass_depth + 1);
 	if (sch->bypass_depth != 1)
 		return false;
@@ -5339,7 +5339,7 @@ static bool dec_bypass_depth(struct scx_sched *sch)
 {
 	lockdep_assert_held(&scx_bypass_lock);
 
-	WARN_ON_ONCE(sch->bypass_depth < 1);
+	WARN_ON_ONCE_DEFERRED(sch->bypass_depth < 1);
 	WRITE_ONCE(sch->bypass_depth, sch->bypass_depth - 1);
 	if (sch->bypass_depth != 0)
 		return false;
@@ -5360,7 +5360,7 @@ static void enable_bypass_dsp(struct scx_sched *sch)
 	 * @sch->bypass_depth transitioning from 0 to 1 triggers enabling.
 	 * Shouldn't stagger.
 	 */
-	if (WARN_ON_ONCE(test_and_set_bit(0, &sch->bypass_dsp_claim)))
+	if (WARN_ON_ONCE_DEFERRED(test_and_set_bit(0, &sch->bypass_dsp_claim)))
 		return;
 
 	/*
@@ -5380,11 +5380,11 @@ static void enable_bypass_dsp(struct scx_sched *sch)
 	 * Bump enable depth on both @sch and bypass dispatch host.
 	 */
 	ret = atomic_inc_return(&sch->bypass_dsp_enable_depth);
-	WARN_ON_ONCE(ret <= 0);
+	WARN_ON_ONCE_DEFERRED(ret <= 0);
 
 	if (host != sch) {
 		ret = atomic_inc_return(&host->bypass_dsp_enable_depth);
-		WARN_ON_ONCE(ret <= 0);
+		WARN_ON_ONCE_DEFERRED(ret <= 0);
 	}
 
 	/*
@@ -5405,11 +5405,11 @@ static void disable_bypass_dsp(struct scx_sched *sch)
 		return;
 
 	ret = atomic_dec_return(&sch->bypass_dsp_enable_depth);
-	WARN_ON_ONCE(ret < 0);
+	WARN_ON_ONCE_DEFERRED(ret < 0);
 
 	if (scx_parent(sch)) {
 		ret = atomic_dec_return(&scx_parent(sch)->bypass_dsp_enable_depth);
-		WARN_ON_ONCE(ret < 0);
+		WARN_ON_ONCE_DEFERRED(ret < 0);
 	}
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ebec186f9823..1213e77665fe9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -404,7 +404,7 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 
 static inline void assert_list_leaf_cfs_rq(struct rq *rq)
 {
-	WARN_ON_ONCE(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
+	WARN_ON_ONCE_DEFERRED(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list);
 }
 
 /* Iterate through all leaf cfs_rq's on a runqueue */
@@ -689,7 +689,7 @@ __sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	s64 w_vruntime, key = entity_key(cfs_rq, se);
 
 	w_vruntime = key * weight;
-	WARN_ON_ONCE((w_vruntime >> 63) != (w_vruntime >> 62));
+	WARN_ON_ONCE_DEFERRED((w_vruntime >> 63) != (w_vruntime >> 62));
 
 	cfs_rq->sum_w_vruntime += w_vruntime;
 	cfs_rq->sum_weight += weight;
@@ -861,7 +861,7 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	u64 avruntime = avg_vruntime(cfs_rq);
 	s64 vlag = entity_lag(cfs_rq, se, avruntime);
 
-	WARN_ON_ONCE(!se->on_rq);
+	WARN_ON_ONCE_DEFERRED(!se->on_rq);
 
 	if (se->sched_delayed) {
 		/* previous vlag < 0 otherwise se would not be delayed */
@@ -1153,7 +1153,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
 	if (sched_feat(PICK_BUDDY) && protect &&
 	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
 		/* ->next will never be delayed */
-		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
+		WARN_ON_ONCE_DEFERRED(cfs_rq->next->sched_delayed);
 		return cfs_rq->next;
 	}
 
@@ -4302,9 +4302,9 @@ static inline bool load_avg_is_decayed(struct sched_avg *sa)
 	 * Make sure that rounding and/or propagation of PELT values never
 	 * break this.
 	 */
-	WARN_ON_ONCE(sa->load_avg ||
-		      sa->util_avg ||
-		      sa->runnable_avg);
+	WARN_ON_ONCE_DEFERRED(sa->load_avg ||
+			      sa->util_avg ||
+			      sa->runnable_avg);
 
 	return true;
 }
@@ -5460,7 +5460,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 		weight = avg_vruntime_weight(cfs_rq, se->load.weight);
 		lag *= load + weight;
-		if (WARN_ON_ONCE(!load))
+		if (WARN_ON_ONCE_DEFERRED(!load))
 			load = 1;
 		lag = div64_long(lag, load);
 
@@ -5653,7 +5653,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	clear_buddies(cfs_rq, se);
 
 	if (flags & DEQUEUE_DELAYED) {
-		WARN_ON_ONCE(!se->sched_delayed);
+		WARN_ON_ONCE_DEFERRED(!se->sched_delayed);
 	} else {
 		bool delay = sleep;
 		/*
@@ -5663,7 +5663,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE))
 			delay = false;
 
-		WARN_ON_ONCE(delay && se->sched_delayed);
+		WARN_ON_ONCE_DEFERRED(delay && se->sched_delayed);
 
 		if (sched_feat(DELAY_DEQUEUE) && delay &&
 		    !entity_eligible(cfs_rq, se)) {
@@ -5747,7 +5747,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, bool first)
 	}
 
 	update_stats_curr_start(cfs_rq, se);
-	WARN_ON_ONCE(cfs_rq->curr);
+	WARN_ON_ONCE_DEFERRED(cfs_rq->curr);
 	cfs_rq->curr = se;
 
 	/*
@@ -5814,7 +5814,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 		/* in !on_rq case, update occurred at dequeue */
 		update_load_avg(cfs_rq, prev, 0);
 	}
-	WARN_ON_ONCE(cfs_rq->curr != prev);
+	WARN_ON_ONCE_DEFERRED(cfs_rq->curr != prev);
 	cfs_rq->curr = NULL;
 }
 
@@ -6015,7 +6015,7 @@ static void throttle_cfs_rq_work(struct callback_head *work)
 	struct cfs_rq *cfs_rq;
 	struct rq *rq;
 
-	WARN_ON_ONCE(p != current);
+	WARN_ON_ONCE_DEFERRED(p != current);
 	p->sched_throttle_work.next = &p->sched_throttle_work;
 
 	/*
@@ -6041,7 +6041,7 @@ static void throttle_cfs_rq_work(struct callback_head *work)
 			return;
 		rq = scope.rq;
 		update_rq_clock(rq);
-		WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node));
+		WARN_ON_ONCE_DEFERRED(p->throttled || !list_empty(&p->throttle_node));
 		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_THROTTLE);
 		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
 		/*
@@ -6072,7 +6072,7 @@ void init_cfs_throttle_work(struct task_struct *p)
 static void detach_task_cfs_rq(struct task_struct *p);
 static void dequeue_throttled_task(struct task_struct *p, int flags)
 {
-	WARN_ON_ONCE(p->se.on_rq);
+	WARN_ON_ONCE_DEFERRED(p->se.on_rq);
 	list_del_init(&p->throttle_node);
 
 	/* task blocked after throttled */
@@ -6094,7 +6094,7 @@ static bool enqueue_throttled_task(struct task_struct *p)
 	struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
 
 	/* @p should have gone through dequeue_throttled_task() first */
-	WARN_ON_ONCE(!list_empty(&p->throttle_node));
+	WARN_ON_ONCE_DEFERRED(!list_empty(&p->throttle_node));
 
 	/*
 	 * If the throttled task @p is enqueued to a throttled cfs_rq,
@@ -6162,7 +6162,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 
 		cfs_rq->throttled_clock_self = 0;
 
-		if (WARN_ON_ONCE((s64)delta < 0))
+		if (WARN_ON_ONCE_DEFERRED((s64)delta < 0))
 			delta = 0;
 
 		cfs_rq->throttled_clock_self_time += delta;
@@ -6231,8 +6231,8 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 		cfs_rq->pelt_clock_throttled = 1;
 	}
 
-	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
-	WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
+	WARN_ON_ONCE_DEFERRED(cfs_rq->throttled_clock_self);
+	WARN_ON_ONCE_DEFERRED(!list_empty(&cfs_rq->throttled_limbo_list));
 	return 0;
 }
 
@@ -6273,7 +6273,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	 * throttled-list.  rq->lock protects completion.
 	 */
 	cfs_rq->throttled = 1;
-	WARN_ON_ONCE(cfs_rq->throttled_clock);
+	WARN_ON_ONCE_DEFERRED(cfs_rq->throttled_clock);
 	return true;
 }
 
@@ -6380,7 +6380,7 @@ static inline void __unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq)
 	}
 
 	/* Already enqueued */
-	if (WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_csd_list)))
+	if (WARN_ON_ONCE_DEFERRED(!list_empty(&cfs_rq->throttled_csd_list)))
 		return;
 
 	first = list_empty(&rq->cfsb_csd_list);
@@ -6393,7 +6393,7 @@ static void unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq)
 {
 	lockdep_assert_rq_held(rq_of(cfs_rq));
 
-	if (WARN_ON_ONCE(!cfs_rq_throttled(cfs_rq) ||
+	if (WARN_ON_ONCE_DEFERRED(!cfs_rq_throttled(cfs_rq) ||
 	    cfs_rq->runtime_remaining <= 0))
 		return;
 
@@ -6429,7 +6429,7 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
 			goto next;
 
 		/* By the above checks, this should never be true */
-		WARN_ON_ONCE(cfs_rq->runtime_remaining > 0);
+		WARN_ON_ONCE_DEFERRED(cfs_rq->runtime_remaining > 0);
 
 		raw_spin_lock(&cfs_b->lock);
 		runtime = -cfs_rq->runtime_remaining + 1;
@@ -6450,7 +6450,7 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
 				 * We currently only expect to be unthrottling
 				 * a single cfs_rq locally.
 				 */
-				WARN_ON_ONCE(!list_empty(&local_unthrottle));
+				WARN_ON_ONCE_DEFERRED(!list_empty(&local_unthrottle));
 				list_add_tail(&cfs_rq->throttled_csd_list,
 					      &local_unthrottle);
 			}
@@ -6475,7 +6475,7 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
 
 		rq_unlock_irqrestore(rq, &rf);
 	}
-	WARN_ON_ONCE(!list_empty(&local_unthrottle));
+	WARN_ON_ONCE_DEFERRED(!list_empty(&local_unthrottle));
 
 	rcu_read_unlock();
 
@@ -7048,7 +7048,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 	u64 vdelta;
 	u64 delta;
 
-	WARN_ON_ONCE(task_rq(p) != rq);
+	WARN_ON_ONCE_DEFERRED(task_rq(p) != rq);
 
 	if (rq->cfs.h_nr_queued <= 1)
 		return;
@@ -7171,8 +7171,8 @@ requeue_delayed_entity(struct sched_entity *se)
 	 * Because a delayed entity is one that is still on
 	 * the runqueue competing until elegibility.
 	 */
-	WARN_ON_ONCE(!se->sched_delayed);
-	WARN_ON_ONCE(!se->on_rq);
+	WARN_ON_ONCE_DEFERRED(!se->sched_delayed);
+	WARN_ON_ONCE_DEFERRED(!se->on_rq);
 
 	if (update_entity_lag(cfs_rq, se)) {
 		cfs_rq->nr_queued--;
@@ -7409,8 +7409,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		rq->next_balance = jiffies;
 
 	if (p && task_delayed) {
-		WARN_ON_ONCE(!task_sleep);
-		WARN_ON_ONCE(p->on_rq != 1);
+		WARN_ON_ONCE_DEFERRED(!task_sleep);
+		WARN_ON_ONCE_DEFERRED(p->on_rq != 1);
 
 		/*
 		 * Fix-up what block_task() skipped.
@@ -8976,7 +8976,7 @@ static void set_cpus_allowed_fair(struct task_struct *p, struct affinity_context
 static void set_next_buddy(struct sched_entity *se)
 {
 	for_each_sched_entity(se) {
-		if (WARN_ON_ONCE(!se->on_rq))
+		if (WARN_ON_ONCE_DEFERRED(!se->on_rq))
 			return;
 		if (se_is_idle(se))
 			return;
@@ -9023,7 +9023,7 @@ preempt_sync(struct rq *rq, int wake_flags,
 	 * WF_SYNC without WF_TTWU is not expected so warn if it happens even
 	 * though it is likely harmless.
 	 */
-	WARN_ON_ONCE(!(wake_flags & WF_TTWU));
+	WARN_ON_ONCE_DEFERRED(!(wake_flags & WF_TTWU));
 
 	threshold = sysctl_sched_migration_cost;
 	delta = rq_clock_task(rq) - se->exec_start;
@@ -9095,7 +9095,7 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f
 		return;
 
 	find_matching_se(&se, &pse);
-	WARN_ON_ONCE(!pse);
+	WARN_ON_ONCE_DEFERRED(!pse);
 
 	cse_is_idle = se_is_idle(se);
 	pse_is_idle = se_is_idle(pse);
@@ -9857,8 +9857,8 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
 		schedstat_inc(p->stats.nr_forced_migrations);
 	}
 
-	WARN_ON(task_current(env->src_rq, p));
-	WARN_ON(task_current_donor(env->src_rq, p));
+	WARN_ON_DEFERRED(task_current(env->src_rq, p));
+	WARN_ON_DEFERRED(task_current_donor(env->src_rq, p));
 
 	deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, env->dst_cpu);
@@ -12151,7 +12151,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 		goto out_balanced;
 	}
 
-	WARN_ON_ONCE(busiest == env.dst_rq);
+	WARN_ON_ONCE_DEFERRED(busiest == env.dst_rq);
 
 	update_lb_imbalance_stat(&env, sd, idle);
 
@@ -12461,7 +12461,7 @@ static int active_load_balance_cpu_stop(void *data)
 	 * we need to fix it. Originally reported by
 	 * Bjorn Helgaas on a 128-CPU setup.
 	 */
-	WARN_ON_ONCE(busiest_rq == target_rq);
+	WARN_ON_ONCE_DEFERRED(busiest_rq == target_rq);
 
 	/* Search for an sd spanning us and the target CPU. */
 	rcu_read_lock();
@@ -12883,7 +12883,7 @@ static void set_cpu_sd_state_busy(int cpu)
 
 void nohz_balance_exit_idle(struct rq *rq)
 {
-	WARN_ON_ONCE(rq != this_rq());
+	WARN_ON_ONCE_DEFERRED(rq != this_rq());
 
 	if (likely(!rq->nohz_tick_stopped))
 		return;
@@ -12918,7 +12918,7 @@ void nohz_balance_enter_idle(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 
-	WARN_ON_ONCE(cpu != smp_processor_id());
+	WARN_ON_ONCE_DEFERRED(cpu != smp_processor_id());
 
 	/* If this CPU is going down, then nothing needs to be done: */
 	if (!cpu_active(cpu))
@@ -13000,7 +13000,7 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
 	int balance_cpu;
 	struct rq *rq;
 
-	WARN_ON_ONCE((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
+	WARN_ON_ONCE_DEFERRED((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
 
 	/*
 	 * We assume there will be no idle load after this update and clear
@@ -13623,7 +13623,7 @@ bool cfs_prio_less(const struct task_struct *a, const struct task_struct *b,
 	struct cfs_rq *cfs_rqb;
 	s64 delta;
 
-	WARN_ON_ONCE(task_rq(b)->core != rq->core);
+	WARN_ON_ONCE_DEFERRED(task_rq(b)->core != rq->core);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/*
@@ -13839,7 +13839,7 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
 
 static void switched_to_fair(struct rq *rq, struct task_struct *p)
 {
-	WARN_ON_ONCE(p->se.sched_delayed);
+	WARN_ON_ONCE_DEFERRED(p->se.sched_delayed);
 
 	attach_task_cfs_rq(p);
 
@@ -13872,7 +13872,7 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
 	if (!first)
 		return;
 
-	WARN_ON_ONCE(se->sched_delayed);
+	WARN_ON_ONCE_DEFERRED(se->sched_delayed);
 
 	if (hrtick_enabled_fair(rq))
 		hrtick_start_fair(rq, p);
@@ -14148,7 +14148,7 @@ int sched_group_set_idle(struct task_group *tg, long idle)
 		rq_lock_irqsave(rq, &rf);
 
 		grp_cfs_rq->idle = idle;
-		if (WARN_ON_ONCE(was_idle == cfs_rq_is_idle(grp_cfs_rq)))
+		if (WARN_ON_ONCE_DEFERRED(was_idle == cfs_rq_is_idle(grp_cfs_rq)))
 			goto next_cpu;
 
 		idle_task_delta = grp_cfs_rq->h_nr_queued -
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4ee8faf01441a..506d0f1afa58f 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -170,7 +170,7 @@ static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
 
 static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
 {
-	WARN_ON_ONCE(!rt_entity_is_task(rt_se));
+	WARN_ON_ONCE_DEFERRED(!rt_entity_is_task(rt_se));
 
 	return container_of(rt_se, struct task_struct, rt);
 }
@@ -178,13 +178,13 @@ static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
 static inline struct rq *rq_of_rt_rq(struct rt_rq *rt_rq)
 {
 	/* Cannot fold with non-CONFIG_RT_GROUP_SCHED version, layout */
-	WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
+	WARN_ON_DEFERRED(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
 	return rt_rq->rq;
 }
 
 static inline struct rt_rq *rt_rq_of_se(struct sched_rt_entity *rt_se)
 {
-	WARN_ON(!rt_group_sched_enabled() && rt_se->rt_rq->tg != &root_task_group);
+	WARN_ON_DEFERRED(!rt_group_sched_enabled() && rt_se->rt_rq->tg != &root_task_group);
 	return rt_se->rt_rq;
 }
 
@@ -192,7 +192,7 @@ static inline struct rq *rq_of_rt_se(struct sched_rt_entity *rt_se)
 {
 	struct rt_rq *rt_rq = rt_se->rt_rq;
 
-	WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
+	WARN_ON_DEFERRED(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
 	return rt_rq->rq;
 }
 
@@ -493,7 +493,7 @@ typedef struct task_group *rt_rq_iter_t;
 static inline struct task_group *next_task_group(struct task_group *tg)
 {
 	if (!rt_group_sched_enabled()) {
-		WARN_ON(tg != &root_task_group);
+		WARN_ON_DEFERRED(tg != &root_task_group);
 		return NULL;
 	}
 
@@ -723,7 +723,7 @@ static void __disable_runtime(struct rq *rq)
 		 * We cannot be left wanting - that would mean some runtime
 		 * leaked out of the system.
 		 */
-		WARN_ON_ONCE(want);
+		WARN_ON_ONCE_DEFERRED(want);
 balanced:
 		/*
 		 * Disable all the borrow logic by pretending we have inf
@@ -1094,7 +1094,7 @@ dec_rt_prio(struct rt_rq *rt_rq, int prio)
 
 	if (rt_rq->rt_nr_running) {
 
-		WARN_ON(prio < prev_prio);
+		WARN_ON_DEFERRED(prio < prev_prio);
 
 		/*
 		 * This may have been our highest task, and therefore
@@ -1131,7 +1131,7 @@ dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 	if (rt_se_boosted(rt_se))
 		rt_rq->rt_nr_boosted--;
 
-	WARN_ON(!rt_rq->rt_nr_running && rt_rq->rt_nr_boosted);
+	WARN_ON_DEFERRED(!rt_rq->rt_nr_running && rt_rq->rt_nr_boosted);
 }
 
 #else /* !CONFIG_RT_GROUP_SCHED: */
@@ -1176,7 +1176,7 @@ void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 {
 	int prio = rt_se_prio(rt_se);
 
-	WARN_ON(!rt_prio(prio));
+	WARN_ON_DEFERRED(!rt_prio(prio));
 	rt_rq->rt_nr_running += rt_se_nr_running(rt_se);
 	rt_rq->rr_nr_running += rt_se_rr_nr_running(rt_se);
 
@@ -1187,8 +1187,8 @@ void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 static inline
 void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 {
-	WARN_ON(!rt_prio(rt_se_prio(rt_se)));
-	WARN_ON(!rt_rq->rt_nr_running);
+	WARN_ON_DEFERRED(!rt_prio(rt_se_prio(rt_se)));
+	WARN_ON_DEFERRED(!rt_rq->rt_nr_running);
 	rt_rq->rt_nr_running -= rt_se_nr_running(rt_se);
 	rt_rq->rr_nr_running -= rt_se_rr_nr_running(rt_se);
 
@@ -1348,7 +1348,7 @@ static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flag
 	}
 
 	if (move_entity(flags)) {
-		WARN_ON_ONCE(rt_se->on_list);
+		WARN_ON_ONCE_DEFERRED(rt_se->on_list);
 		if (flags & ENQUEUE_HEAD)
 			list_add(&rt_se->run_list, queue);
 		else
@@ -1368,7 +1368,7 @@ static void __dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flag
 	struct rt_prio_array *array = &rt_rq->active;
 
 	if (move_entity(flags)) {
-		WARN_ON_ONCE(!rt_se->on_list);
+		WARN_ON_ONCE_DEFERRED(!rt_se->on_list);
 		__delist_rt_entity(rt_se, array);
 	}
 	rt_se->on_rq = 0;
@@ -1684,7 +1684,7 @@ static struct sched_rt_entity *pick_next_rt_entity(struct rt_rq *rt_rq)
 	BUG_ON(idx >= MAX_RT_PRIO);
 
 	queue = array->queue + idx;
-	if (WARN_ON_ONCE(list_empty(queue)))
+	if (WARN_ON_ONCE_DEFERRED(list_empty(queue)))
 		return NULL;
 	next = list_entry(queue->next, struct sched_rt_entity, run_list);
 
@@ -2016,7 +2016,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 		return 0;
 	}
 
-	if (WARN_ON(next_task == rq->curr))
+	if (WARN_ON_DEFERRED(next_task == rq->curr))
 		return 0;
 
 	/* We might release rq lock */
@@ -2316,8 +2316,8 @@ static void pull_rt_task(struct rq *this_rq)
 		 * the to-be-scheduled task?
 		 */
 		if (p && (p->prio < this_rq->rt.highest_prio.curr)) {
-			WARN_ON(p == src_rq->curr);
-			WARN_ON(!task_on_rq_queued(p));
+			WARN_ON_DEFERRED(p == src_rq->curr);
+			WARN_ON_DEFERRED(!task_on_rq_queued(p));
 
 			/*
 			 * There's a chance that p is higher in priority
@@ -2583,7 +2583,7 @@ static int task_is_throttled_rt(struct task_struct *p, int cpu)
 
 #ifdef CONFIG_RT_GROUP_SCHED // XXX maybe add task_rt_rq(), see also sched_rt_period_rt_rq
 	rt_rq = task_group(p)->rt_rq[cpu];
-	WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
+	WARN_ON_DEFERRED(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
 #else
 	rt_rq = &cpu_rq(cpu)->rt;
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d1..f74f9cd44e098 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1684,7 +1684,7 @@ static inline void update_idle_core(struct rq *rq) { }
 
 static inline struct task_struct *task_of(struct sched_entity *se)
 {
-	WARN_ON_ONCE(!entity_is_task(se));
+	WARN_ON_ONCE_DEFERRED(!entity_is_task(se));
 	return container_of(se, struct task_struct, se);
 }
 
@@ -1766,7 +1766,7 @@ static inline void assert_clock_updated(struct rq *rq)
 	 * The only reason for not seeing a clock update since the
 	 * last rq_pin_lock() is if we're currently skipping updates.
 	 */
-	WARN_ON_ONCE(rq->clock_update_flags < RQCF_ACT_SKIP);
+	WARN_ON_ONCE_DEFERRED(rq->clock_update_flags < RQCF_ACT_SKIP);
 }
 
 static inline u64 rq_clock(struct rq *rq)
@@ -1813,7 +1813,7 @@ static inline void rq_clock_cancel_skipupdate(struct rq *rq)
 static inline void rq_clock_start_loop_update(struct rq *rq)
 {
 	lockdep_assert_rq_held(rq);
-	WARN_ON_ONCE(rq->clock_update_flags & RQCF_ACT_SKIP);
+	WARN_ON_ONCE_DEFERRED(rq->clock_update_flags & RQCF_ACT_SKIP);
 	rq->clock_update_flags |= RQCF_ACT_SKIP;
 }
 
@@ -1870,9 +1870,9 @@ static inline void scx_rq_clock_invalidate(struct rq *rq) {}
 
 static inline void assert_balance_callbacks_empty(struct rq *rq)
 {
-	WARN_ON_ONCE(IS_ENABLED(CONFIG_PROVE_LOCKING) &&
-		     rq->balance_callback &&
-		     rq->balance_callback != &balance_push_callback);
+	WARN_ON_ONCE_DEFERRED(IS_ENABLED(CONFIG_PROVE_LOCKING) &&
+			      rq->balance_callback &&
+			      rq->balance_callback != &balance_push_callback);
 }
 
 /*
@@ -2681,7 +2681,7 @@ struct sched_class {
 
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
-	WARN_ON_ONCE(rq->donor != prev);
+	WARN_ON_ONCE_DEFERRED(rq->donor != prev);
 	prev->sched_class->put_prev_task(rq, prev, NULL);
 }
 
@@ -2704,7 +2704,7 @@ static inline void put_prev_set_next_task(struct rq *rq,
 					  struct task_struct *prev,
 					  struct task_struct *next)
 {
-	WARN_ON_ONCE(rq->donor != prev);
+	WARN_ON_ONCE_DEFERRED(rq->donor != prev);
 
 	__put_prev_set_next_dl_server(rq, prev, next);
 
@@ -3030,7 +3030,7 @@ static inline void attach_task(struct rq *rq, struct task_struct *p)
 {
 	lockdep_assert_rq_held(rq);
 
-	WARN_ON_ONCE(task_rq(p) != rq);
+	WARN_ON_ONCE_DEFERRED(task_rq(p) != rq);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
 	wakeup_preempt(rq, p, 0);
 }
-- 
2.53.0


^ permalink raw reply related

* [PATCH 0/2] sched: Introduce and use deferred WARNs in sched
From: Sebastian Andrzej Siewior @ 2026-06-23 14:26 UTC (permalink / raw)
  To: linux-arch, linux-kernel, sched-ext, netdev
  Cc: David S . Miller, Andrea Righi, Andrew Morton, Arnd Bergmann,
	Ben Segall, Breno Leitao, Changwoo Min, David Vernet,
	Dietmar Eggemann, Eric Dumazet, Ingo Molnar, Jakub Kicinski,
	John Ogness, Juri Lelli, K Prateek Nayak, Paolo Abeni,
	Peter Zijlstra, Petr Mladek, Sergey Senozhatsky, Simon Horman,
	Steven Rostedt, Tejun Heo, Vincent Guittot, Vlad Poenaru,
	Sebastian Andrzej Siewior

This is a follow-up to the netconsole lockup reported
	https://lore.kernel.org/all/20260610183621.3915271-1-vlad.wing@gmail.com/

The idea is to use deferred printing for WARNs and use them in sched. I
tried to use only where it looks that the rq lock acquired instead a
plain s/WARN_ON/WARN_ON_DEFFERED which would be simpler.

This unholy deferred mess can be removed once we don't have legacy
consoles anymore _or_ force force_legacy_kthread=true.

The initial report is against v6.16 and netconsole. The reported problem
does not occur upstream since commit 7eab73b18630e ("netconsole: convert
to NBCON console infrastructure") which is v7.0-rc1.

Should this be rejected outright because the preferred sollution is to
| - stick msg in buffer (lockless)
| - print to atomic consoles (lockless)
| - use irq_work to wake console kthreads (lockless)
| - each kthread then tries to flush buffer to its own non-atomic console
|   in non-atomic context."

then this means to force force_legacy_kthread=true.
The threaded legacy printer is available since v6.12-rc1. It terms of stable
fix, this could go back as of v6.12 stable and not earlier (in case we care).

I tested this on a x86 box with 8250 and warning in put_prev_entity().
After it printed the initial warning, it dead-locked shortly after
because systemd was writing to the kernel buffer it acquired the
uart_port_lock then attempted to write lockdep report which required the
same lock…

Sebastian Andrzej Siewior (2):
  bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
  sched: Use WARN_ON.*_DEFERRED()

 include/asm-generic/bug.h  |  41 ++++++++++++++
 kernel/sched/core.c        |  78 +++++++++++++-------------
 kernel/sched/core_sched.c  |   6 +-
 kernel/sched/cpudeadline.c |   6 +-
 kernel/sched/deadline.c    |  62 ++++++++++-----------
 kernel/sched/ext.c         | 110 ++++++++++++++++++-------------------
 kernel/sched/fair.c        |  88 ++++++++++++++---------------
 kernel/sched/rt.c          |  36 ++++++------
 kernel/sched/sched.h       |  18 +++---
 lib/bug.c                  |  16 +++++-
 10 files changed, 257 insertions(+), 204 deletions(-)

-- 
2.53.0


^ permalink raw reply

* [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Sebastian Andrzej Siewior @ 2026-06-23 14:26 UTC (permalink / raw)
  To: linux-arch, linux-kernel, sched-ext, netdev
  Cc: David S . Miller, Andrea Righi, Andrew Morton, Arnd Bergmann,
	Ben Segall, Breno Leitao, Changwoo Min, David Vernet,
	Dietmar Eggemann, Eric Dumazet, Ingo Molnar, Jakub Kicinski,
	John Ogness, Juri Lelli, K Prateek Nayak, Paolo Abeni,
	Peter Zijlstra, Petr Mladek, Sergey Senozhatsky, Simon Horman,
	Steven Rostedt, Tejun Heo, Vincent Guittot, Vlad Poenaru,
	Sebastian Andrzej Siewior
In-Reply-To: <20260623142650.265721-1-bigeasy@linutronix.de>

Provide a deferred version of the WARN_ON() macro. It will delay
flushing the console until a later context. It is needed in a context
where the caller holds locks which can lead to a deadlock content is
flushed to the console driver.
An example would from a warning from within the scheduler resulting in a
wake-up of a task.

Deferring the output works by using printk_deferred_enter/ exit() around
the printing output. This must be used in a context where the task can't
migrate to another CPU. This should be the case usually, since the
scheduler would acquire the rq lock whith disabled interrupts, but to be
safe preemption is disabled to guarantee this.

In order not to bloat the code on architectures which provide an
optimized __WARN_FLAGS() define BUGFLAG_DEFERRED which is handled by
__report_bug() and does not increase the code size.

Provide the DEFERRED macros based on __WARN_FLAGS and __WARN_FLAGS
macros. Extend __report_bug() to handle the deferred case.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/asm-generic/bug.h | 41 +++++++++++++++++++++++++++++++++++++++
 lib/bug.c                 | 16 +++++++++++++--
 2 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 09e8eccee8ed9..1e3ff00f709b8 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -14,6 +14,7 @@
 #define BUGFLAG_DONE		(1 << 2)
 #define BUGFLAG_NO_CUT_HERE	(1 << 3)	/* CUT_HERE already sent */
 #define BUGFLAG_ARGS		(1 << 4)
+#define BUGFLAG_DEFERRED	(1 << 5)
 #define BUGFLAG_TAINT(taint)	((taint) << 8)
 #define BUG_GET_TAINT(bug)	((bug)->flags >> 8)
 #endif
@@ -115,6 +116,16 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 })
 #endif
 
+#define WARN_ON_DEFERRED(condition) ({					\
+	int __ret_warn_on = !!(condition);				\
+	if (unlikely(__ret_warn_on)) {					\
+		__WARN_FLAGS(#condition,				\
+			     BUGFLAG_DEFERRED |				\
+			     BUGFLAG_TAINT(TAINT_WARN));		\
+	}								\
+	unlikely(__ret_warn_on);					\
+})
+
 #ifndef WARN_ON_ONCE
 #define WARN_ON_ONCE(condition) ({					\
 	int __ret_warn_on = !!(condition);				\
@@ -125,6 +136,16 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 	unlikely(__ret_warn_on);					\
 })
 #endif
+
+#define WARN_ON_ONCE_DEFERRED(condition) ({				\
+	int __ret_warn_on = !!(condition);				\
+	if (unlikely(__ret_warn_on)) {					\
+		__WARN_FLAGS(#condition,				\
+			     BUGFLAG_ONCE | BUGFLAG_DEFERRED |		\
+			     BUGFLAG_TAINT(TAINT_WARN));		\
+	}								\
+	unlikely(__ret_warn_on);					\
+})
 #endif /* __WARN_FLAGS */
 
 #if defined(__WARN_FLAGS) && !defined(__WARN_printf)
@@ -159,6 +180,19 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 })
 #endif
 
+#ifndef WARN_ON_DEFERRED
+#define WARN_ON_DEFERRED(condition) ({					\
+	int __ret_warn_on = !!(condition);				\
+	if (unlikely(__ret_warn_on)) {					\
+		guard(preempt)();					\
+		printk_deferred_enter()					\
+		__WARN();						\
+		printk_deferred_exit()					\
+	}								\
+	unlikely(__ret_warn_on);					\
+})
+#endif
+
 #ifndef WARN
 #define WARN(condition, format...) ({					\
 	int __ret_warn_on = !!(condition);				\
@@ -180,6 +214,11 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 	DO_ONCE_LITE_IF(condition, WARN_ON, 1)
 #endif
 
+#ifndef WARN_ON_ONCE_DEFERRED
+#define WARN_ON_ONCE_DEFERRED(condition)				\
+	DO_ONCE_LITE_IF(condition, WARN_ON_DEFERRED, 1)
+#endif
+
 #ifndef WARN_ONCE
 #define WARN_ONCE(condition, format...)				\
 	DO_ONCE_LITE_IF(condition, WARN, 1, format)
@@ -215,7 +254,9 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 })
 #endif
 
+#define WARN_ON_DEFERRED(condition) WARN_ON(condition)
 #define WARN_ON_ONCE(condition) WARN_ON(condition)
+#define WARN_ON_ONCE_DEFERRED(condition) WARN_ON(condition)
 #define WARN_ONCE(condition, format...) WARN(condition, format)
 #define WARN_TAINT(condition, taint, format...) WARN(condition, format)
 #define WARN_TAINT_ONCE(condition, taint, format...) WARN(condition, format)
diff --git a/lib/bug.c b/lib/bug.c
index 224f4cfa4aa31..f5768f5d17b47 100644
--- a/lib/bug.c
+++ b/lib/bug.c
@@ -196,7 +196,7 @@ void __warn_printf(const char *fmt, struct pt_regs *regs)
 
 static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long bugaddr, struct pt_regs *regs)
 {
-	bool warning, once, done, no_cut, has_args;
+	bool warning, once, done, no_cut, has_args, deferred;
 	const char *file, *fmt;
 	unsigned line;
 
@@ -219,6 +219,7 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
 	done     = bug->flags & BUGFLAG_DONE;
 	no_cut   = bug->flags & BUGFLAG_NO_CUT_HERE;
 	has_args = bug->flags & BUGFLAG_ARGS;
+	deferred = bug->flags & BUGFLAG_DEFERRED;
 
 	if (warning && once) {
 		if (done)
@@ -229,7 +230,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
 		 */
 		bug->flags |= BUGFLAG_DONE;
 	}
-
+	if (deferred) {
+		preempt_disable_notrace();
+		printk_deferred_enter();
+	}
 	/*
 	 * BUG() and WARN_ON() families don't print a custom debug message
 	 * before triggering the exception handler, so we must add the
@@ -245,6 +249,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
 		/* this is a WARN_ON rather than BUG/BUG_ON */
 		__warn(file, line, (void *)bugaddr, BUG_GET_TAINT(bug), regs,
 		       NULL);
+		if (deferred) {
+			printk_deferred_exit();
+			preempt_enable_notrace();
+		}
 		return BUG_TRAP_TYPE_WARN;
 	}
 
@@ -254,6 +262,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
 		pr_crit("kernel BUG at %pB [verbose debug info unavailable]\n",
 			(void *)bugaddr);
 
+	if (deferred) {
+		printk_deferred_exit();
+		preempt_enable_notrace();
+	}
 	return BUG_TRAP_TYPE_BUG;
 }
 
-- 
2.53.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox