Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net v6 0/4] Fix i40e/ice/iavf VF bonding after netdev lock changes
From: Jose Ignacio Tornos Martinez @ 2026-06-19  6:13 UTC (permalink / raw)
  To: netdev
  Cc: intel-wired-lan, przemyslaw.kitszel, aleksandr.loktionov,
	jacob.e.keller, horms, jesse.brandeburg, anthony.l.nguyen, davem,
	edumazet, kuba, pabeni, Jose Ignacio Tornos Martinez

This series fixes VF bonding failures introduced by commit ad7c7b2172c3
("net: hold netdev instance lock during sysfs operations").

When adding VFs to a bond immediately after setting trust mode, MAC
address changes fail with -EAGAIN, preventing bonding setup. This
affects both i40e (700-series) and ice (800-series) Intel NICs.

The core issue is lock contention: iavf_set_mac() is now called with the
netdev lock held and waits for MAC change completion while holding it.
However, both the watchdog task that sends the request and the adminq_task
that processes PF responses also need this lock, creating a deadlock where
neither can run, causing timeouts.

Additionally, setting VF trust triggers an unnecessary ~10 second VF reset
in i40e driver that delays bonding setup, even though filter
synchronization happens naturally during normal VF operation. For ice
driver, the delay is not so big, but in the same way the operation is not
necessary.

This series:
1. Adds safety guard to prevent MAC changes during reset or early
   initialization (before VF is ready)
2. Eliminates unnecessary VF reset when setting trust in i40e (reset only
   if revoking trust and VF has advanced features configured).
3. Fixes lock contention by polling admin queue synchronously
4. Eliminates unnecessary VF reset when setting trust in ice, (reset only
   if revoking trust and VF has advanced features configured).

The key fix (patch 3/4) implements a synchronous MAC change operation
similar to the approach used for ndo_change_mtu deadlock fix:
https://lore.kernel.org/intel-wired-lan/20260211191855.1532226-1-poros@redhat.com/
Instead of scheduling work and waiting, it:

- Sends the virtchnl message directly (not via watchdog)
- Polls the admin queue hardware directly for responses
- Processes all messages inline (including non-MAC messages)
- Returns when complete or times out

This allows the operation to complete synchronously while holding
netdev_lock, without relying on watchdog or adminq_task.

The function can sleep for up to 2.5 seconds polling hardware, but this
is acceptable since netdev_lock is per-device and only serializes
operations on the same interface.

Testing shows VF bonding now works reliably in ~5 seconds vs 15+ seconds
before (i40e), without timeouts or errors (i40e and ice).

Tested on Intel 700-series (i40e) and 800-series (ice) dual-port NICs
with iavf driver.

Thanks to Jan Tluka <jtluka@redhat.com> and Yuying Ma <yuma@redhat.com> for
reporting the issues.

Jose Ignacio Tornos Martinez (4):
  iavf: return EBUSY if reset in progress or not ready during MAC change
  i40e: skip unnecessary VF reset when setting trust
  iavf: send MAC change request synchronously
  ice: skip unnecessary VF reset when setting trust

All patches tested successfully with bonding setup.
---
v6:
  - Patch 1/4 (iavf EBUSY): No changes from v5
  - Patch 2/4 (i40e trust): No code changes from v5. AI review comments covered
    design decisions and pre-existing issues, no bugs found in new code.
  - Patch 3/4 (iavf sync MAC): Address edge cases from AI review (Jakub Kicinski)
    - Allocate event buffer before sending to avoid state mismatch if allocation
      fails after message is sent to PF
    - Add loop to send all batches before polling (rare multi-batch scenario)
    - Conditional rollback: only rollback on send failure (ret != -EAGAIN), not on
      timeout where PF will eventually respond
  - Patch 4/4 (ice trust): Revert to original reset pattern based on AI review
    - AI review identified issues with v5's reset-before-cleanup approach
      (privilege bit not cleared, LLDP cleanup loop became dead code)
    - Restore proven cleanup → set trust → reset pattern from original code
    - Simpler implementation, no save/restore complexity
    - Maintains same goal: skip reset when VF has no advanced features
v5: https://lore.kernel.org/all/20260429102426.210750-1-jtornosm@redhat.com/

 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |  38 ++++++++++++++++++++++++++++----------
 drivers/net/ethernet/intel/iavf/iavf.h             |  10 ++++++++--
 drivers/net/ethernet/intel/iavf/iavf_main.c        |  71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------
 drivers/net/ethernet/intel/iavf/iavf_virtchnl.c    | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------
 drivers/net/ethernet/intel/ice/ice_sriov.c         |  33 +++++++++++++++++++++++++++++----
 5 files changed, 220 insertions(+), 35 deletions(-)
--
2.43.0

^ permalink raw reply

* Re: [RFC net-next 3/4] net: dsa: motorcomm: Dynamically allocate port structures
From: Andrew Lunn @ 2026-06-19  6:06 UTC (permalink / raw)
  To: David Yang
  Cc: netdev, Vladimir Oltean, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, linux-kernel
In-Reply-To: <20260618202716.2166450-4-mmyangfl@gmail.com>

On Fri, Jun 19, 2026 at 04:26:31AM +0800, David Yang wrote:
> With support for LED introduced later, struct yt921x_priv will be 17k
> which is not very good for a single kmalloc(). Convert the ports array
> to a array of pointers to stop bloating the priv struct.
> 
> Signed-off-by: David Yang <mmyangfl@gmail.com>
> ---
>  drivers/net/dsa/motorcomm/chip.c | 95 ++++++++++++++++++++++++--------
>  drivers/net/dsa/motorcomm/chip.h |  3 +-
>  2 files changed, 75 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/net/dsa/motorcomm/chip.c b/drivers/net/dsa/motorcomm/chip.c
> index 6dee25b6754a..d44f7749de02 100644
> --- a/drivers/net/dsa/motorcomm/chip.c
> +++ b/drivers/net/dsa/motorcomm/chip.c
> @@ -548,11 +548,14 @@ yt921x_mbus_ext_init(struct yt921x_priv *priv, struct device_node *mnp)
>  /* Read and handle overflow of 32bit MIBs. MIB buffer must be zeroed before. */
>  static int yt921x_read_mib(struct yt921x_priv *priv, int port)
>  {
> -	struct yt921x_port *pp = &priv->ports[port];
> +	struct yt921x_port *pp = priv->ports[port];
>  	struct device *dev = to_device(priv);
>  	struct yt921x_mib *mib = &pp->mib;
>  	int res = 0;
>  
> +	if (!pp)
> +		return -ENODEV;
> +

Are all these tests actually needed? If you cannot allocate the
memory, i would expect the probe to fail, so you can never get here.

	Andrew

^ permalink raw reply

* Re: [PATCH net-next 2/2] net: dsa: mv88e6xxx: add support for credit based shaper
From: Andrew Lunn @ 2026-06-19  5:59 UTC (permalink / raw)
  To: Luke Howard
  Cc: Cedric Jehasse, cedric.jehasse, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Russell King, netdev, linux-kernel, Christoph Mellauner,
	Kieran Tyrrell
In-Reply-To: <A3A6C8D7-3B6D-4716-88C7-EC32DB9D087D@padl.com>

> As an aside, the 6341 driver is buggy. It assumes the chip does not
> have a dedicated ATU FID register because it has 256 databases. This
> is incorrect and breaks VLAN-aware bridging. I will post a patch
> when the merge window reopens.

You can post fixes for net any time. No need to wait for the merge
window to close.

	Andrew

^ permalink raw reply

* [PATCH net] net: mana: Fall back to standard MTU when PF reports adapter_mtu of 0
From: Erni Sri Satya Vennela @ 2026-06-19  5:53 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, dipayanroy, ssengar, jacob.e.keller,
	ernis, horms, gargaditya, kees, linux-hyperv, netdev,
	linux-kernel, bpf

Commit d7709812e13d ("net: mana: hardening: Validate adapter_mtu from
MANA_QUERY_DEV_CONFIG") rejected any adapter_mtu value smaller than
ETH_MIN_MTU + ETH_HLEN, including 0, returning -EPROTO and failing
mana_probe().

Some older PF firmware versions still in the field report
adapter_mtu as 0 in the MANA_QUERY_DEV_CONFIG response. With the
hardening check in place, the MANA VF driver now fails to load on
those hosts, breaking networking entirely for guests.

MANA hardware always supports the standard Ethernet MTU. Treat a
reported adapter_mtu of 0 as "the PF did not advertise a value" and
fall back to ETH_FRAME_LEN, the same value used for the pre-V2
message version path. Only jumbo frames remain unavailable until
the PF reports a valid MTU.

Other small-but-nonzero bogus values are still rejected, preserving
the original protection against the unsigned-subtraction wrap that
would otherwise let ndev->max_mtu underflow to a huge value.

Fixes: d7709812e13d ("net: mana: hardening: Validate adapter_mtu from MANA_QUERY_DEV_CONFIG")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_bpf.c |  3 ++-
 drivers/net/ethernet/microsoft/mana/mana_en.c  | 16 ++++++++++++++--
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
index b5e9bb184a1d..53308e139cbe 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
@@ -237,7 +237,8 @@ static int mana_xdp_set(struct net_device *ndev, struct bpf_prog *prog,
 		bpf_prog_put(old_prog);
 
 	if (prog)
-		ndev->max_mtu = MANA_XDP_MTU_MAX;
+		ndev->max_mtu = min_t(unsigned int, MANA_XDP_MTU_MAX,
+				      gc->adapter_mtu - ETH_HLEN);
 	else
 		ndev->max_mtu = gc->adapter_mtu - ETH_HLEN;
 
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 87862b0434c7..7438ea6b3f26 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1233,12 +1233,24 @@ int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
 	*max_num_vports = resp.max_num_vports;
 
 	if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V2) {
-		if (resp.adapter_mtu < ETH_MIN_MTU + ETH_HLEN) {
+		if (resp.adapter_mtu == 0) {
+			/*
+			 * Some older PF firmware versions report an
+			 * adapter_mtu of 0. MANA hardware always supports the
+			 * standard Ethernet MTU, so fall back to ETH_FRAME_LEN.
+			 * Jumbo frames will not be available in this case.
+			 */
+			dev_info(dev,
+				 "PF reported adapter_mtu of 0, falling back to %u (jumbo frames disabled)\n",
+				 ETH_FRAME_LEN);
+			gc->adapter_mtu = ETH_FRAME_LEN;
+		} else if (resp.adapter_mtu < ETH_MIN_MTU + ETH_HLEN) {
 			dev_err(dev, "Adapter MTU too small: %u\n",
 				resp.adapter_mtu);
 			return -EPROTO;
+		} else {
+			gc->adapter_mtu = resp.adapter_mtu;
 		}
-		gc->adapter_mtu = resp.adapter_mtu;
 	} else {
 		gc->adapter_mtu = ETH_FRAME_LEN;
 	}
-- 
2.34.1


^ permalink raw reply related

* [PATCH net v2] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: Runyu Xiao @ 2026-06-19  5:48 UTC (permalink / raw)
  To: D. Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Mahanta Jambigi, Tony Lu, Wen Gu, Simon Horman, Karsten Graul,
	linux-rdma, linux-s390, netdev, linux-kernel, jianhao.xu,
	runyu.xiao, stable
In-Reply-To: <20260617152855.1039151-1-runyu.xiao@seu.edu.cn>

smc_listen() installs smc_clcsock_data_ready() as the underlying TCP
listen socket's sk_data_ready callback.  The callback takes
sk_callback_lock before looking up the SMC listener and queuing
smc_tcp_listen_work().

This can recurse when the underlying TCP listen socket is being closed.
The close/flush path may invoke the installed sk_data_ready callback with
sk_callback_lock already held, so smc_clcsock_data_ready() tries to take
the same rwlock again in the same thread.

This issue was found by our static analysis tool and then manually
reviewed against the current tree.  The reproducer keeps the SMC listen
callback installation path:

  smc_listen()
  smc_clcsock_replace_cb()
  sk_data_ready = smc_clcsock_data_ready()

It then models the close/flush carrier that invokes the installed
sk_data_ready callback while sk_callback_lock is already held.  Lockdep
reports the same-thread recursive acquisition:

  WARNING: possible recursive locking detected
  kworker/u4:3/39 is trying to acquire lock:
    (sk_callback_lock) at smc_clcsock_data_ready+0xa/0x4d

  but task is already holding lock:
    (sk_callback_lock) at smc_close_flush_work+0xc/0x30

  Possible unsafe locking scenario:

        CPU0
        ----
        lock(sk_callback_lock);
        lock(sk_callback_lock);

  *** DEADLOCK ***

  Workqueue: smc_close_wq smc_close_flush_work

  Call Trace:
    dump_stack_lvl
    __lock_acquire
    lock_acquire
    _raw_read_lock_bh
    smc_clcsock_data_ready
    smc_close_flush_work
    process_one_work
    worker_thread
    kthread
    ret_from_fork

The same pattern was fixed for nvmet TCP by checking TCP_LISTEN before
taking sk_callback_lock:

  commit 2fa8961d3a6a ("nvmet-tcp: fixup hang in
  nvmet_tcp_listen_data_ready()")

Do the same for SMC.  smc_clcsock_data_ready() is installed by
smc_listen() on the underlying TCP listen socket and only queues
smc_tcp_listen_work() for the SMC listen/accept path.  Once that socket is
no longer in TCP_LISTEN, there is no listen accept work to queue from this
callback, and avoiding sk_callback_lock also avoids the recursive locking
path.

Fixes: 0558226cebee ("net/smc: Fix slab-out-of-bounds issue in fallback")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
v2:
- Include the fuller Lockdep stack from the grounded reproducer.
- Add the related nvmet TCP fix reference.
- Explain why the TCP_LISTEN check is valid for the SMC listen callback.

 net/smc/af_smc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 6421c2e1c84d..1af4e3c333ff 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -2631,6 +2631,9 @@ static void smc_clcsock_data_ready(struct sock *listen_clcsock)
 {
 	struct smc_sock *lsmc;

+	if (READ_ONCE(listen_clcsock->sk_state) != TCP_LISTEN)
+		return;
+
 	read_lock_bh(&listen_clcsock->sk_callback_lock);
 	lsmc = smc_clcsock_user_data(listen_clcsock);
 	if (!lsmc)
-- 
2.34.1

^ permalink raw reply related

* Re: [PATCH v2 5/7] seg6: add End.M.GTP6.D.Di behavior
From: Yuya Kusakabe @ 2026-06-19  5:48 UTC (permalink / raw)
  To: andrea
  Cc: Yuya Kusakabe, andrea.mayer, davem, edumazet, dsahern, kuba,
	pabeni, horms, justin.iurman, shuah, corbet, skhan, linux-kernel,
	netdev, linux-kselftest, linux-doc, stefano.salsano, ahabdels
In-Reply-To: <20260607160119.ed2022e8a358d700e1134318@common-net.org>

Hi Andrea,

Thank you for the review.

> The patch 4 review applies here, except for the parts where Section 6.4 is
> implemented instead of Section 6.3 (which is incorrectly implemented in
> patch 4).

Answered in the patch 4 reply: the next version of End.M.GTP6.D will
implement Section 6.3 (Args.Mob.Session stamped into SRH[0], no
preserved D), leaving the original-DA preservation exclusive to this
drop-in variant.

> input_action_end_m_gtp6_d_di() and its finish callback are largely
> identical to the patch 4 functions (input_action_end_m_gtp6_d() and its
> finish): the SRH check, GTP-U dispatch, outer strip, inner protocol
> detection, and NF_HOOK invocation are identical. The duplication should be
> reduced via shared helpers.

Will do. The plan is one decap helper (SRH check, GTP-U dispatch,
outer strip, inner protocol detection) shared between End.M.GTP6.D and
End.M.GTP6.D.Di, and one SRv6-push helper (including the GSO offload
setup) shared with H.M.GTP4.D as well, with the GTP-U parser common to
all of them. The D.Di handler then reduces to the prepended-slot
handling specific to the drop-in variant. (The NF_HOOK invocation goes
away in the initial series per the cover letter thread.)

> D.Di does not use teid or qfi, so these variables and the (void) casts are
> dead code and should be avoided. For example, seg6_mobile_parse_gtpu() could
> accept NULL for teid and qfi so callers that do not need them can pass NULL
> directly.

Will do exactly that: the GTP-U parser (and the decap helper above)
will accept NULL for teid/qfi, and the drop-in variant will pass NULL.

Thanks,
Yuya

^ permalink raw reply

* Re: [PATCH net] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: Mahanta Jambigi @ 2026-06-19  5:36 UTC (permalink / raw)
  To: Runyu Xiao, D. Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Tony Lu, Wen Gu, Simon Horman, Karsten Graul, linux-rdma,
	linux-s390, netdev, linux-kernel, jianhao.xu
In-Reply-To: <20260618141629.2904071-1-runyu.xiao@seu.edu.cn>



On 18/06/26 7:46 pm, Runyu Xiao wrote:
> Hi,
> 
> Thanks for taking a look.
> 
> The exact Lockdep stack I have is from the grounded reproducer, not from
> a production SMC setup.  The reproducer keeps the same callback shape:
> the close/flush side holds sk_callback_lock and invokes the installed
> sk_data_ready callback, which re-enters smc_clcsock_data_ready() and tries
> to take sk_callback_lock again.
> 
> The relevant Lockdep report is:
> 
>   WARNING: possible recursive locking detected
>   kworker/u4:3/39 is trying to acquire lock:
>     (sk_callback_lock) at smc_clcsock_data_ready+0xa/0x4d
> 
>   but task is already holding lock:
>     (sk_callback_lock) at smc_close_flush_work+0xc/0x30
> 
>   Possible unsafe locking scenario:
> 
>         CPU0
>         ----
>         lock(sk_callback_lock);
>         lock(sk_callback_lock);
> 
>   *** DEADLOCK ***
> 
>   Workqueue: smc_close_wq smc_close_flush_work
> 
>   Call Trace:
>     dump_stack_lvl
>     __lock_acquire
>     lock_acquire
>     _raw_read_lock_bh
>     smc_clcsock_data_ready+0xa/0x4d
>     smc_close_flush_work+0x1f/0x30
>     process_one_work
>     worker_thread
>     kthread
>     ret_from_fork

Thank you for addressing the feedback. My suggestion would be to reply
to the original email thread where the review comments were given, so
that the maintainers can follow the conversation.

https://www.kernel.org/doc/html/latest/process/submitting-patches.html#respond-to-review-comments

Please include above call stack in your next version.

> 
> The nvmet change I referred to is:
> 
>   2fa8961d3a6a ("nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()")

Please include this info in your next version.

> 
> The stable/backport patch I originally used as the reference is:
> 
>   1c90f930e7b4 ("nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()")
> 
> Its commit message says that when the socket is closed while in
> TCP_LISTEN, the flush callback can call nvmet_tcp_listen_data_ready()
> with sk_callback_lock already held, so nvmet moved the TCP_LISTEN check
> before taking sk_callback_lock.
> 
> For the TCP_LISTEN check: my reasoning was that smc_clcsock_data_ready()
> is installed by smc_listen() on the underlying TCP listen socket and only
> queues smc_tcp_listen_work() for the SMC listen/accept path.  Once that
> underlying socket is no longer in TCP_LISTEN, there should be no SMC
> listen accept work to queue from this callback.  TCP_SYN_RECV and
> TCP_ESTABLISHED are not listen-socket states for this callback path, so I
> did not intend the callback to queue listen work for those states.

I understand. Please include this info in your next version.

> 
> That said, if SMC expects smc_clcsock_data_ready() to handle a non-LISTEN
> state during fallback or another transition, then the proposed check is
> too strict and I should rework the fix.
> 
> Thanks,
> Runyu


^ permalink raw reply

* Re: [PATCH v2 4/7] seg6: add End.M.GTP6.D behavior
From: Yuya Kusakabe @ 2026-06-19  5:27 UTC (permalink / raw)
  To: andrea
  Cc: Yuya Kusakabe, andrea.mayer, davem, edumazet, dsahern, kuba,
	pabeni, horms, justin.iurman, shuah, corbet, skhan, linux-kernel,
	netdev, linux-kselftest, linux-doc, stefano.salsano, ahabdels
In-Reply-To: <20260607020517.0c6bbb8beba505ac9447545e@common-net.org>

Hi Andrea,

Thank you for the review. The points shared with patches 1-3 will be
addressed as described in those replies; below the
End.M.GTP6.D-specific ones.

> The "src" attribute is used verbatim here as the outer IPv6 source address,
> same as patch 3. The src dual-semantics overload flagged in the patch 3
> reply applies here too.

Covered in the patch 3 reply: with the End.M.GTP4.E template use
gone, verbatim outer IPv6 SA becomes the single meaning of the src
attribute for the IPv6-emitting behaviors.

> Thank you for the follow-up in the cover letter thread. The finish callback
> writes orig_dst into SRH[0] and Args.Mob.Session into SRH[1]. As far as I
> can see, this matches neither Section 6.3 (Args.Mob.Session in SRH[0], no
> D) nor Section 6.4 (D in SRH[0], no Args.Mob).

Confirmed, that is the bug from my May 10 note. The next version of
End.M.GTP6.D will push the configured SR Policy verbatim and stamp
Args.Mob.Session into SRH[0] (at the locator length given by the
explicit sr_prefix_len attribute) per Section 6.3 S08; preserving the
original outer DA in a prepended slot will be exclusive to
End.M.GTP6.D.Di.

> Same reverse Christmas tree as patch 2; same issue in the other functions
> introduced by this patch.
> gtp is only used as a cast intermediary. Could it be inlined?

Will fix both.

> Nit: gtphl and hdrlen are assigned before the GTP1_F_EXTHDR check. On the
> path where the E flag is not set, gtphl is unused. Moving the gtphl
> assignment after the check would make the flow clearer.

Will move the gtphl dereference after the check; the pull has to stay
before it, since the long header is also consumed for S/PN-only
flags.

> Maybe ext could be renamed to ext_hdr? It would be easier to distinguish
> from ext_units and ext_bytes.
> ext_units is only used to derive ext_bytes. A single ext_len would
> remove the intermediate variable.

Will do both.

> If the extension chain contains more than one PDU Session Container, *qfi
> is silently overwritten. Is that intentional, or should the function reject
> a duplicate?

Not intentional; will reject a duplicate PDU Session Container as
malformed, with a selftest case for it.

> ext[ext_bytes - 1] reads the Next Extension Header Type field from the last
> byte of the current extension. Would a short comment help the reader?

Will add one.

> input_action_end_m_gtp6_d() does not change skb_dst(skb) before this call,
> so dst and lwtstate are the same ones the caller already dereferenced. When
> can this NULL check trigger?

It cannot: for a route installed with LWTUNNEL_STATE_INPUT_REDIRECT,
lwtunnel_set_redirect() always populates orig_input before dst.input
is replaced. I will drop the checks and call orig_input directly.

> Same dst/lwtstate issue as patch 2. Not introduced by this patch.
> Same missing iptunnel_handle_offloads() as patch 2.

The NF_HOOK split goes away per the cover letter thread, and the SRv6
push will go through a shared helper that calls
iptunnel_handle_offloads(skb, SKB_GSO_IPXIP6) before
seg6_do_srh_encap().

> Same BAD_INNER misuse as patch 2. seg6_do_srh_encap() can also fail from
> seg6_push_hmac(), which is an HMAC error on the new SRH, not an inner-T-PDU
> problem.
[...]
> segments[0], segments[1], saddr, and daddr are written after
> seg6_do_srh_encap() already called skb_postpush_rcsum(). skb->csum can
> be stale. Same for any later change to the outer header or SRH.
>
> HMAC, if configured, is computed on non-final SRH and saddr, hence invalid.

Thanks, both of these are real issues. My plan for the next version:

- every field stamped after seg6_do_srh_encap() (Args.Mob.Session, the
  preserved DA in the drop-in variant, the outer saddr/daddr refresh,
  and the dsfield propagation in H.M.GTP4.D) will go through a small
  helper that applies the corresponding diff to skb->csum when the skb
  is CHECKSUM_COMPLETE;

- the D-side behaviors will reject an HMAC-flagged SRH template at
  configuration time: stamping the per-packet fields after
  seg6_do_srh_encap() has signed the SRH would always invalidate the
  HMAC. Inbound HMAC validation is unaffected. Would you prefer the
  stamp-before-sign ordering solved from the start instead?

> The initializer on reason is dead. Every goto drop path sets reason
> explicitly before the jump. The variable can be left uninitialized here.

This goes away with the drop-reason rework: the MUP drop reasons will
be out of the initial series per the prep series plan, so the variable
itself is removed.

> Same SRH validation concerns as patch 1. HMAC is not validated here.

The ingress will use the same three-state SRH helper as the other
behaviors, which validates the HMAC whenever an SRH is present.

> Limitation note for both input_action_end() calls above: correct per RFC
> 9433 Section 6.3 S10-S11, but the SRH is absent or SL == 0 here, so
> input_action_end() will always drop without signaling non-GTP-U traffic.
> Perhaps you meant to drop directly with BAD_GTPU?

Right, the End fallback could only ever drop here. Instead of
dropping, I plan to hand non-UDP, non-GTP-U and non-T-PDU packets to
the route's original input path (the orig_input saved by the lwtunnel
input redirect), so a downstream owner of the GTP-U control plane
still receives e.g. Echo Request; the selftests will cover that
passthrough.

> Nit: inner_first could be an inner_ver with the shift done at assignment.
> The name would say what the variable holds.
[...]
> Same repeated size-selection ternary as patch 2.

Will do both: the inner version, header length and protocol computed
once in the switch.

> The anonymous { } block scopes three variables that should be declared at
> function top. Splitting into smaller helpers would make this easier to
> follow.

Will split the dispatch and outer strip into a decap helper shared
with End.M.GTP6.D.Di, with declarations at function top.

> Same missing frag_off check as patch 2.

Will add.

> The "{,.Di}" shell brace notation is unusual. Emitting the actual
> behavior name (End.M.GTP6.D or End.M.GTP6.D.Di) would be clearer.
> Same applies wherever this notation appears in the patchset.

Will replace it with the concrete behavior name everywhere.

Thanks,
Yuya

^ permalink raw reply

* Re: [PATCH net-next 2/2] net: dsa: mv88e6xxx: add support for credit based shaper
From: Luke Howard @ 2026-06-19  5:00 UTC (permalink / raw)
  To: Cedric Jehasse
  Cc: cedric.jehasse, Andrew Lunn, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Russell King, netdev, linux-kernel, Christoph Mellauner,
	Kieran Tyrrell
In-Reply-To: <CABvLYuBZ=U96gVpYqHnnhwBZxp8OZzH-W503MsbNd4JPqLbEpQ@mail.gmail.com>

> On 28 May 2026, at 7:19 pm, Cedric Jehasse <cedric.jehasse@gmail.com> wrote:
> 
> On Thu, May 28, 2026 at 1:52 AM Luke Howard <lukeh@padl.com> wrote:
>> Would also be good to add support for the 6341, which as 4 queues but uses the same Qav multipliers as the 6390.
> I'll have a look at the 6341

As an aside, the 6341 driver is buggy. It assumes the chip does not have a dedicated ATU FID register because it has 256 databases. This is incorrect and breaks VLAN-aware bridging. I will post a patch when the merge window reopens.

On the hardware side, the ATU hashing algorithm appears to easily collide. This causes problems with things such as installing the broadcast address into multiple FIDs, not to mention 802.1Q Dynamic Reservation Entries. The only workaround is to use the ‘direct’ algorithm which may have a performance impact (I am presuming it is a linear search; the data sheet warns against using it).

Luke

^ permalink raw reply

* [PATCH net] ipv6: Fix null-ptr-deref in fib6_nh_mtu_change().
From: xmei5 @ 2026-06-19  4:53 UTC (permalink / raw)
  To: dsahern, idosch, netdev
  Cc: davem, edumazet, pabeni, kuba, horms, bestswngs, Xiang Mei

From: Xiang Mei <xmei5@asu.edu>

fib6_nh_mtu_change() re-fetches idev via __in6_dev_get(arg->dev) and
dereferences idev->cnf.mtu6 without a NULL check. addrconf_ifdown()
clears dev->ip6_ptr with RCU_INIT_POINTER() after rt6_disable_ip() has
released tb6_lock, so the RA-driven MTU walk can observe a NULL idev and
oops. The caller rt6_mtu_change_route() guards its own __in6_dev_get(),
but this re-fetch is unguarded; nexthop-backed routes survive
addrconf_ifdown()'s flush, so the walk still reaches it after ip6_ptr is
nulled.

Return 0 when idev is NULL, matching rt6_mtu_change_route() and the
fib6_mtu() fix in commit 5ad509c1fdad ("ipv6: Fix null-ptr-deref in
fib6_mtu().").

  Oops: general protection fault, ... KASAN: null-ptr-deref in range
        [0x00000000000002a8-0x00000000000002af]
  RIP: 0010:fib6_nh_mtu_change+0x203/0x990
   rt6_mtu_change_route+0x141/0x1d0
   __fib6_clean_all+0xd0/0x160
   rt6_mtu_change+0xb4/0x100
   ndisc_router_discovery+0x24b5/0x2cb0
   icmpv6_rcv+0x12e9/0x1710
   ipv6_rcv+0x39b/0x410

Fixes: c0b220cf7d80 ("ipv6: Refactor exception functions")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Xiang Mei <xmei5@asu.edu>
---
 net/ipv6/route.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 6361ad2fcf77..a1301334da48 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -5055,6 +5055,9 @@ static int fib6_nh_mtu_change(struct fib6_nh *nh, void *_arg)
 		struct inet6_dev *idev = __in6_dev_get(arg->dev);
 		u32 mtu = f6i->fib6_pmtu;

+		if (!idev)
+			return 0;
+
 		if (mtu >= arg->mtu ||
 		    (mtu < arg->mtu && mtu == idev->cnf.mtu6))
 			fib6_metric_set(f6i, RTAX_MTU, arg->mtu);
-- 
2.43.0

^ permalink raw reply related

* [PATCH net] octeontx2-af: Block VFs from clobbering special CGX PKIND state
From: Ratheesh Kannoth @ 2026-06-19  4:10 UTC (permalink / raw)
  To: davem, gakula, linux-kernel, netdev, sgoutham
  Cc: andrew+netdev, edumazet, kuba, pabeni, Hariprasad Kelam,
	Ratheesh Kannoth

From: Hariprasad Kelam <hkelam@marvell.com>

PF and VF NIX LFs that share a CGX LMAC reuse the same hardware PKIND
programming. When HiGig2 or EDSA parsing is enabled, a VF NIX LF alloc must
not reset the LMAC RX PKIND or default TX parse config over the PF setup.

Add cgx_get_pkind() and rvu_cgx_is_pkind_config_permitted() so VFs skip
cgx_set_pkind(), rvu_npc_set_pkind(), and NIX_AF_LFX_TX_PARSE_CFG updates
when the LMAC is using NPC_RX_HIGIG_PKIND or NPC_RX_EDSA_PKIND.

Fixes: 94d942c5fb97 ("octeontx2-af: Config pkind for CGX mapped PFs")
Cc: Geetha sowjanya <gakula@marvell.com>
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
---
 .../net/ethernet/marvell/octeontx2/af/cgx.c   | 12 +++++++
 .../net/ethernet/marvell/octeontx2/af/cgx.h   |  1 +
 .../net/ethernet/marvell/octeontx2/af/rvu.h   |  1 +
 .../ethernet/marvell/octeontx2/af/rvu_cgx.c   | 32 +++++++++++++++++++
 .../ethernet/marvell/octeontx2/af/rvu_nix.c   | 13 +++++---
 5 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
index 2e94d5105016..f5fd6138c352 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
@@ -518,6 +518,18 @@ int cgx_set_pkind(void *cgxd, u8 lmac_id, int pkind)
 	return 0;
 }
 
+int cgx_get_pkind(void *cgxd, u8 lmac_id, int *pkind)
+{
+	struct cgx *cgx = cgxd;
+
+	if (!is_lmac_valid(cgx, lmac_id))
+		return -ENODEV;
+
+	*pkind = cgx_read(cgx, lmac_id, cgx->mac_ops->rxid_map_offset);
+	*pkind = *pkind & 0x3F;
+	return 0;
+}
+
 static u8 cgx_get_lmac_type(void *cgxd, int lmac_id)
 {
 	struct cgx *cgx = cgxd;
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.h b/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
index 92ccf343dfe0..8411a75dd723 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.h
@@ -141,6 +141,7 @@ int cgx_get_cgxid(void *cgxd);
 int cgx_get_lmac_cnt(void *cgxd);
 void *cgx_get_pdata(int cgx_id);
 int cgx_set_pkind(void *cgxd, u8 lmac_id, int pkind);
+int cgx_get_pkind(void *cgxd, u8 lmac_id, int *pkind);
 int cgx_lmac_evh_register(struct cgx_event_cb *cb, void *cgxd, int lmac_id);
 int cgx_lmac_evh_unregister(void *cgxd, int lmac_id);
 int cgx_get_tx_stats(void *cgxd, int lmac_id, int idx, u64 *tx_stat);
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
index 7f3505ae6860..bb671e2150aa 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
@@ -1115,6 +1115,7 @@ void npc_read_mcam_entry(struct rvu *rvu, struct npc_mcam *mcam,
 			 u8 *intf, u8 *ena);
 int npc_config_cntr_default_entries(struct rvu *rvu, bool enable);
 bool is_cgx_config_permitted(struct rvu *rvu, u16 pcifunc);
+bool rvu_cgx_is_pkind_config_permitted(struct rvu *rvu, u16 pcifunc);
 bool is_mac_feature_supported(struct rvu *rvu, int pf, int feature);
 u32  rvu_cgx_get_fifolen(struct rvu *rvu);
 void *rvu_first_cgx_pdata(struct rvu *rvu);
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
index 4ff3935ed3fe..2be1da3476ac 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
@@ -1355,3 +1355,35 @@ void rvu_mac_reset(struct rvu *rvu, u16 pcifunc)
 	if (mac_ops->mac_reset(cgxd, lmac, !is_vf(pcifunc)))
 		dev_err(rvu->dev, "Failed to reset MAC\n");
 }
+
+/* Do not allow CGX-mapped VFs to overwrite PKIND when special parse kinds
+ * (HiGig, EDSA, etc.) are in use on the shared LMAC.
+ */
+bool rvu_cgx_is_pkind_config_permitted(struct rvu *rvu, u16 pcifunc)
+{
+	int pf, err, rxpkind;
+	u8 cgx_id, lmac_id;
+	void *cgxd;
+
+	pf = rvu_get_pf(rvu->pdev, pcifunc);
+
+	if (!(pcifunc & RVU_PFVF_FUNC_MASK))
+		return true;
+
+	if (!is_pf_cgxmapped(rvu, pf))
+		return true;
+
+	rvu_get_cgx_lmac_id(rvu->pf2cgxlmac_map[pf], &cgx_id, &lmac_id);
+	cgxd = rvu_cgx_pdata(cgx_id, rvu);
+	err = cgx_get_pkind(cgxd, lmac_id, &rxpkind);
+	if (err)
+		return false;
+
+	switch (rxpkind) {
+	case NPC_RX_HIGIG_PKIND:
+	case NPC_RX_EDSA_PKIND:
+		return false;
+	default:
+		return true;
+	}
+}
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
index d8989395e875..8c5cadd4465d 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
@@ -363,8 +363,11 @@ static int nix_interface_init(struct rvu *rvu, u16 pcifunc, int type, int nixlf,
 		pfvf->tx_chan_cnt = 1;
 		rsp->tx_link = cgx_id * hw->lmac_per_cgx + lmac_id;
 
-		cgx_set_pkind(rvu_cgx_pdata(cgx_id, rvu), lmac_id, pkind);
-		rvu_npc_set_pkind(rvu, pkind, pfvf);
+		if (rvu_cgx_is_pkind_config_permitted(rvu, pcifunc)) {
+			cgx_set_pkind(rvu_cgx_pdata(cgx_id, rvu), lmac_id,
+				      pkind);
+			rvu_npc_set_pkind(rvu, pkind, pfvf);
+		}
 		break;
 	case NIX_INTF_TYPE_LBK:
 		vf = (pcifunc & RVU_PFVF_FUNC_MASK) - 1;
@@ -1680,8 +1683,10 @@ int rvu_mbox_handler_nix_lf_alloc(struct rvu *rvu,
 	rvu_write64(rvu, blkaddr, NIX_AF_LFX_RX_CFG(nixlf), req->rx_cfg);
 
 	/* Configure pkind for TX parse config */
-	cfg = NPC_TX_DEF_PKIND;
-	rvu_write64(rvu, blkaddr, NIX_AF_LFX_TX_PARSE_CFG(nixlf), cfg);
+	if (rvu_cgx_is_pkind_config_permitted(rvu, pcifunc)) {
+		cfg = NPC_TX_DEF_PKIND;
+		rvu_write64(rvu, blkaddr, NIX_AF_LFX_TX_PARSE_CFG(nixlf), cfg);
+	}
 
 	if (is_rep_dev(rvu, pcifunc)) {
 		pfvf->tx_chan_base = RVU_SWITCH_LBK_CHAN;
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net] ipv6: ndisc: fix NULL deref in accept_untracked_na()
From: Xiang Mei @ 2026-06-19  4:05 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: Weiming Shi, David S . Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, linux-kernel
In-Reply-To: <efea4e44-e9cd-444d-a648-8831f408acbf@linux.dev>

On Wed, Jun 17, 2026 at 9:09 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
>
> On 6/17/26 9:38 PM, Weiming Shi wrote:
> > On Wed Jun 17, 2026 at 4:32 PM CST, Jiayuan Chen wrote:
> >> On 6/17/26 2:55 PM, Weiming Shi wrote:
> >>> accept_untracked_na() re-fetches the inet6_dev with __in6_dev_get(dev)
> >>> and dereferences idev->cnf.accept_untracked_na without a NULL check,
> >>
> >> Does ipv6_rpl_srh_rcv have same problem?
> > Hi,
> >
> > Yes, ipv6_rpl_srh_rcv() has the same missing check. It reads
> > idev->cnf.rpl_seg_enabled right after __in6_dev_get(skb->dev) with no
> > NULL check, while seg6 and ioam6 in the same file both check it.
> >
> > But I tried to trigger it and couldn't. With a guard added as an instrument,
> > idev never came back NULL over tens of millions of RPL packets while
> > flapping the MTU, so I can't say it's actually reachable.
>
>
> Can you need to add mdelay to enlarge the race window to reproduce it?
>
The delay works; we insert one at

@@ -1078,6 +1079,7 @@ static enum skb_drop_reason ndisc_recv_na(struct
sk_buff *skb)
         */
        new_state = msg->icmph.icmp6_solicited ? NUD_REACHABLE : NUD_STALE;
        if (!neigh && lladdr && idev && READ_ONCE(idev->cnf.forwarding)) {
+               mdelay(100); /* AI-POC: widen race window vs
concurrent addrconf_ifdown() NULLing dev->ip6_ptr */
                if (accept_untracked_na(dev, saddr)) {
                        neigh = neigh_create(&nd_tbl, &msg->target, dev);
                        new_state = NUD_STALE;

And our poc triggers:
```
[    0.887292] Oops: general protection fault, probably for
non-canonical address 0xdffffc0000000068: 0000 [#1] SMP KASAN NOPTI
[    0.887791] KASAN: null-ptr-deref in range
[0x0000000000000340-0x0000000000000347]
[    0.888101] CPU: 1 UID: 0 PID: 146 Comm: exploit Not tainted 7.1.0+
#9 PREEMPTLAZY
[    0.888419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/4
[    0.888880] RIP: 0010:ndisc_recv_na+0x67b/0xf30
[    0.889072] Code: 03 80 3c 02 00 0f 85 5f 07 00 00 49 8b af c0 00
00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bd 40 03 00 00 e
[    0.889829] RSP: 0018:ffff8881029084f8 EFLAGS: 00010202
[    0.890043] RAX: dffffc0000000000 RBX: 1ffff110205210a8 RCX: 0000000000000001
[    0.890340] RDX: 0000000000000068 RSI: 0000000000004d9a RDI: 0000000000000340
[    0.890633] RBP: 0000000000000000 R08: 0000000000000004 R09: ffff888102908570
[    0.890921] R10: ffff88800ee7f792 R11: ffff888102908580 R12: ffff88800ee7f780
[    0.891207] R13: ffff88800ee7f778 R14: ffff88800ee7f792 R15: ffff88800f2d6000
[    0.891511] FS:  00007b8516aec6c0(0000) GS:ffff888177d3d000(0000)
knlGS:0000000000000000
[    0.891853] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.892094] CR2: 00007b8516aebb30 CR3: 000000000f164006 CR4: 0000000000772ef0
[    0.892402] PKRU: 55555554
[    0.892517] Call Trace:
[    0.892627]  <IRQ>
[    0.892718]  ? stack_trace_save+0x93/0xd0
[    0.892895]  ? __pfx_ndisc_recv_na+0x10/0x10
[    0.893087]  ? _raw_spin_unlock_irqrestore+0xe/0x40
[    0.893295]  ? stack_depot_save_flags+0x488/0x810
[    0.893506]  ? fib6_rule_lookup+0x40a/0x5a0
[    0.893689]  ? __napi_poll+0xa1/0x560
[    0.893852]  ? net_rx_action+0x401/0xd80
[    0.894025]  icmpv6_rcv+0x10a4/0x14e0
[    0.894181]  ip6_protocol_deliver_rcu+0xbd4/0x1340
[    0.894391]  ip6_input_finish+0x1d1/0x3a0
[    0.894559]  ip6_input+0x195/0x1d0
[    0.894709]  ? __pfx_ip6_input+0x10/0x10
[    0.894873]  ? __asan_memset+0x23/0x50
[    0.895034]  ip6_mc_input+0x26c/0x3a0
[    0.895188]  ipv6_rcv+0x31b/0x390
[    0.895340]  ? __pfx_ipv6_rcv+0x10/0x10
[    0.895501]  ? get_stack_info+0x2f/0x90
[    0.895665]  ? stack_access_ok+0xde/0x200
[    0.895833]  ? get_stack_info_noinstr+0x18/0x110
[    0.896036]  __netif_receive_skb_core.constprop.0+0xd78/0x2f00
[    0.896286]  ? common_startup_64+0x13e/0x151
[    0.896475]  ? common_startup_64+0x13e/0x151
[    0.896655]  ? __pfx_stack_trace_consume_entry+0x10/0x10
[    0.896875]  ? arch_stack_walk+0x99/0x100
[    0.897049]  ? __pfx___netif_receive_skb_core.constprop.0+0x10/0x10
[    0.897310]  ? common_startup_64+0x13e/0x151
[    0.897498]  ? stack_trace_save+0x93/0xd0
[    0.897668]  ? __pfx_stack_trace_save+0x10/0x10
[    0.897855]  ? stack_depot_save_flags+0x29/0x810
[    0.898046]  ? kasan_save_stack+0x33/0x60
[    0.898213]  ? kasan_save_track+0x14/0x30
[    0.898389]  ? kasan_save_free_info+0x3b/0x60
...
```

Weiming's patch works with a delay. The poc doesn't crash the kernel.
I noticed that you mentioned ipv6_rpl_srh_rcv. We actually have sent
one, but we later found that someone (Andrea Mayer) had sent the issue
earlier, so we just let them fix it:

https://lore.kernel.org/all/20260521200859.816b8923b5f27bba6124461e@uniroma2.it/

Xiang
> I believe we need more precise traffic and timing control, instead of
> aggressively ramping up traffic and load in an attempt to reproduce the
> issue.
>

^ permalink raw reply

* Re: [PATCH stable 6.6.y v3 1/4] bpf: Track equal scalars history on per-instruction level
From: Zhenzhong Wu @ 2026-06-19  4:01 UTC (permalink / raw)
  To: Shung-Hsi Yu
  Cc: bpf, netdev, linux-kernel, ast, daniel, john.fastabend, andrii,
	martin.lau, song, yonghong.song, kpsingh, haoluo, jolsa,
	menglong8.dong, eddyz87, stable, mykolal, tamird, Hao Sun
In-Reply-To: <ajDiLjjSYPp5p7KF@u94a>

On second thought, I'll drop the cnt == 1 reset (and the matching cnt > 0)
and follow upstream's cnt > 1 in v4. It gives no benefit over upstream, it
only adds the corner cases the bot flagged. And The other two points the
bot raised (clearing id on overflow, collecting src/dst twice) are upstream
behaviour, so I will keep them as-is.Thanks for the pointer.



On Tue, Jun 16, 2026 at 1:51 PM Shung-Hsi Yu <shung-hsi.yu@suse.com> wrote:
>
> On Mon, Jun 15, 2026 at 12:58:38AM +0800, Zhenzhong Wu wrote:
> [...]
> > +/* For all R being scalar registers or spilled scalar registers
> > + * in verifier state, save R in linked_regs if R->id == id.
> > + * If there are too many Rs sharing same id, reset id for leftover Rs.
> > + */
> > +static void collect_linked_regs(struct bpf_verifier_state *vstate, u32 id,
> > +                             struct linked_regs *linked_regs)
> > +{
> > +     struct bpf_func_state *func;
> >       struct bpf_reg_state *reg;
> > +     int i, j;
> >
> > -     bpf_for_each_reg_in_vstate(vstate, state, reg, ({
> > -             if (reg->type == SCALAR_VALUE && reg->id == known_reg->id) {
> > +     for (i = vstate->curframe; i >= 0; i--) {
> > +             func = vstate->frame[i];
> > +             for (j = 0; j < BPF_REG_FP; j++) {
> > +                     reg = &func->regs[j];
> > +                     __collect_linked_regs(linked_regs, reg, id, i, j, true);
> > +             }
> > +             for (j = 0; j < func->allocated_stack / BPF_REG_SIZE; j++) {
> > +                     if (!is_spilled_reg(&func->stack[j]))
> > +                             continue;
> > +                     reg = &func->stack[j].spilled_ptr;
> > +                     __collect_linked_regs(linked_regs, reg, id, i, j, false);
> > +             }
> > +     }
> > +
> > +     if (linked_regs->cnt == 1)
> > +             linked_regs->cnt = 0;
>
> This part seems new, not found on the original commit, and also not in
> bpf-next. Can you add some more explaining (in the notes before your
> signed-off-by) regarding why this is needed?
>
> > +}
> [...]
> > @@ -14704,6 +14899,21 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env,
> >               return 0;
> >       }
> >
> > +     /* Push scalar registers sharing same ID to jump history,
> > +      * do this before creating 'other_branch', so that both
> > +      * 'this_branch' and 'other_branch' share this history
> > +      * if parent state is created.
> > +      */
> > +     if (BPF_SRC(insn->code) == BPF_X && src_reg->type == SCALAR_VALUE && src_reg->id)
> > +             collect_linked_regs(this_branch, src_reg->id, &linked_regs);
> > +     if (dst_reg->type == SCALAR_VALUE && dst_reg->id)
> > +             collect_linked_regs(this_branch, dst_reg->id, &linked_regs);
> > +     if (linked_regs.cnt > 0) {
>
> Same here, the original commit and bpf-next has the '> 1' conditional,
> where as your has '> 0'. Can you also added some explanation on this
> part?
>
> > +             err = push_jmp_history(env, this_branch, 0, linked_regs_pack(&linked_regs));
> > +             if (err)
> > +                     return err;
> > +     }
> > +
> ...

^ permalink raw reply

* Re: [PATCH stable 6.6.y v3 0/4] bpf: linked scalar precision fixes
From: Zhenzhong Wu @ 2026-06-19  4:01 UTC (permalink / raw)
  To: Shung-Hsi Yu
  Cc: Sasha Levin, Paul Chaignon, bpf, netdev, linux-kernel, ast,
	daniel, john.fastabend, andrii, martin.lau, song, yonghong.song,
	kpsingh, haoluo, jolsa, menglong8.dong, eddyz87, stable, mykolal,
	tamird
In-Reply-To: <ajC3d44N4s0sdEBB@u94a>

Sorry for the late reply.

For v3, I only ran the targeted test_progs coverage and missed the legacy
test_verifier coverage, so I did not catch the expectation mismatch in
precise.c. Thanks for the detailed analysis. I have rechecked this and will
send v4 shortly with the precise.c fix included.




On Tue, Jun 16, 2026 at 1:22 PM Shung-Hsi Yu <shung-hsi.yu@suse.com> wrote:
>
> On Tue, Jun 16, 2026 at 12:51:34AM +0200, Paul Chaignon wrote:
> > On Mon, Jun 15, 2026 at 12:58:37AM +0800, Zhenzhong Wu wrote:
> > > Hi,
> > >
> > > This v3 targets 6.6.y and changes the backport strategy based on review
> > > feedback on v2.
> >
> > [...]
> >
> > > Relevant QEMU selftest results on 6.6.y with this backport:
> > >
> > >   verifier_scalar_ids passed all 18 subtests, including the newly
> > >   backported linked-scalar precision tests and the related
> > >   check_ids_in_regsafe tests.
> >
> > The first patch in this backport series is actually breaking the
> > "precise: test 1" selftest from test_verifier. You can see the full
> > error at [1]. I haven't yet checked if it's the test or the backport
> > that needs to be adjusted.
>
> I had a quick look, and believe it was that test that needs to be
> adjusted to include r9 into the precise register set.
>
> So unless Sasha have other preference, I suggest Zhenzhong send a v4,
> with changes to tools/testing/selftests/bpf/verifier/precise.c
> (including "r9" the the expected verifier output) merged into "bpf:
> Track equal scalars history on per-instruction level".
>
> ---
>
> The program under test is:
>
>   00: BPF_MOV64_IMM(BPF_REG_0, 1),
>   01: BPF_LD_MAP_FD(BPF_REG_6, 0),
>   03: BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
>   04: BPF_MOV64_REG(BPF_REG_2, BPF_REG_FP),
>   05: BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
>   06: BPF_ST_MEM(BPF_DW, BPF_REG_FP, -8, 0),
>   07: BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
>   08: BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
>   09: BPF_EXIT_INSN(),
>
>   10: BPF_MOV64_REG(BPF_REG_9, BPF_REG_0),
>
>   11: BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
>   12: BPF_MOV64_REG(BPF_REG_2, BPF_REG_FP),
>   13: BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
>   14: BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
>   15: BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
>   16: BPF_EXIT_INSN(),
>
>   17: BPF_MOV64_REG(BPF_REG_8, BPF_REG_0),
>
>   18: BPF_ALU64_REG(BPF_SUB, BPF_REG_9, BPF_REG_8), /* map_value_ptr -= map_value_ptr */
>   19: BPF_MOV64_REG(BPF_REG_2, BPF_REG_9),
>   20: BPF_JMP_IMM(BPF_JLT, BPF_REG_2, 8, 1),
>   21: BPF_EXIT_INSN(),
>
>   22: BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, 1), /* R2=scalar(umin=1, umax=8) */
>   23: BPF_MOV64_REG(BPF_REG_1, BPF_REG_FP),
>   24: BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8),
>   25: BPF_MOV64_IMM(BPF_REG_3, 0),
>   26: BPF_EMIT_CALL(BPF_FUNC_probe_read_kernel),
>   27: BPF_EXIT_INSN(),
>
> The test was expecting the following line in the verifier log that was
> shown during the backtracking start at instruction 26 (call
> bpf_probe_read_kernel#113)
>
>   mark_precise: frame0: regs=r2 stack= before 20: (a5) if r2 < 0x8 goto pc+1
>   mark_precise: frame0: parent state regs=r2 stack=: ...
>   mark_precise: frame0: last_idx 19 first_idx 10 ...
>
> But after applying the patchset, we now got an additional register r9 in
> the precise set:
>
>   mark_precise: frame0: regs=r2 stack= before 20: (a5) if r2 < 0x8 goto pc+1
>   mark_precise: frame0: parent state regs=r2,r9 stack=: ....
>   mark_precise: frame0: last_idx 19 first_idx 10 ...
>
> The additional r9 in the precise set seems actually correct, this is
> because r2 and r9 share the same scalar ID at instruction 20 (before the
> link got broken in instruction 21), and hence at that point, both
> register should be marked as precise.
>
> ---
>
> In upstream the test already has the expected verifier log to include
> r9, and hence no failure, but it simply comes from the fact that r2 and
> r9 maintain a link even after instruction 22 (r2 += 1).
>
>   commit 98d7ca374ba4b39e7535613d40e159f09ca14da2
>   Author: Alexei Starovoitov <ast@kernel.org>
>   Date:   Wed Jun 12 18:38:13 2024 -0700
>
>       bpf: Track delta between "linked" registers.
>   ...
>   --- a/tools/testing/selftests/bpf/verifier/precise.c
>   +++ b/tools/testing/selftests/bpf/verifier/precise.c
>   @@ -39,12 +39,12 @@
>         .result = VERBOSE_ACCEPT,
>         .errstr =
>         "mark_precise: frame0: last_idx 26 first_idx 20\
>   -     mark_precise: frame0: regs=r2 stack= before 25\
>   -     mark_precise: frame0: regs=r2 stack= before 24\
>   -     mark_precise: frame0: regs=r2 stack= before 23\
>   -     mark_precise: frame0: regs=r2 stack= before 22\
>   -     mark_precise: frame0: regs=r2 stack= before 20\
>   -     mark_precise: frame0: parent state regs=r2 stack=:\
>   +     mark_precise: frame0: regs=r2,r9 stack= before 25\
>   +     mark_precise: frame0: regs=r2,r9 stack= before 24\
>   +     mark_precise: frame0: regs=r2,r9 stack= before 23\
>   +     mark_precise: frame0: regs=r2,r9 stack= before 22\
>   +     mark_precise: frame0: regs=r2,r9 stack= before 20\
>   +     mark_precise: frame0: parent state regs=r2,r9 stack=:\
>         mark_precise: frame0: last_idx 19 first_idx 10\
>         mark_precise: frame0: regs=r2,r9 stack= before 19\
>         mark_precise: frame0: regs=r9 stack= before 18\
>   ...
>
> ---
>
> Full test log below
>
>   #492/p precise: test 1 FAIL
>   Unexpected verifier log!
>   EXP: mark_precise: frame0: parent state regs=r2 stack=:
>   RES:
>   func#0 @0
>   0: R1=ctx(off=0,imm=0) R10=fp0
>   0: (b7) r0 = 1                        ; R0_w=1
>   1: (18) r6 = 0xffff9eb644619000       ; R6_w=map_ptr(off=0,ks=4,vs=48,imm=0)
>   3: (bf) r1 = r6                       ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R6_w=map_ptr(off=0,ks=4,vs=48,imm=0)
>   4: (bf) r2 = r10                      ; R2_w=fp0 R10=fp0
>   5: (07) r2 += -8                      ; R2_w=fp-8
>   6: (7a) *(u64 *)(r10 -8) = 0          ; R10=fp0 fp-8_w=00000000
>   7: (85) call bpf_map_lookup_elem#1    ; R0_w=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0)
>   8: (55) if r0 != 0x0 goto pc+1        ; R0_w=0
>   9: (95) exit
>
>   from 8 to 10: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=map_ptr(off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=0000mmmm
>   10: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=map_ptr(off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=0000mmmm
>   10: (bf) r9 = r0                      ; R0=map_value(off=0,ks=4,vs=48,imm=0) R9_w=map_value(off=0,ks=4,vs=48,imm=0)
>   11: (bf) r1 = r6                      ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R6=map_ptr(off=0,ks=4,vs=48,imm=0)
>   12: (bf) r2 = r10                     ; R2_w=fp0 R10=fp0
>   13: (07) r2 += -8                     ; R2_w=fp-8
>   14: (85) call bpf_map_lookup_elem#1   ; R0_w=map_value_or_null(id=2,off=0,ks=4,vs=48,imm=0)
>   15: (55) if r0 != 0x0 goto pc+1       ; R0_w=0
>   16: (95) exit
>
>   from 15 to 17: R0_w=map_value(off=0,ks=4,vs=48,imm=0) R6=map_ptr(off=0,ks=4,vs=48,imm=0) R9_w=map_value(off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=0000mmmm
>   17: R0_w=map_value(off=0,ks=4,vs=48,imm=0) R6=map_ptr(off=0,ks=4,vs=48,imm=0) R9_w=map_value(off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=0000mmmm
>   17: (bf) r8 = r0                      ; R0_w=map_value(off=0,ks=4,vs=48,imm=0) R8_w=map_value(off=0,ks=4,vs=48,imm=0)
>   18: (1f) r9 -= r8                     ; R8_w=map_value(off=0,ks=4,vs=48,imm=0) R9_w=scalar()
>   19: (bf) r2 = r9                      ; R2=scalar(id=3) R9=scalar(id=3)
>   20: (a5) if r2 < 0x8 goto pc+1        ; R2=scalar(id=3,umin=8)
>   21: (95) exit
>
>   from 20 to 22: R0=map_value(off=0,ks=4,vs=48,imm=0) R2=scalar(id=3,umax=7,var_off=(0x0; 0x7)) R6=map_ptr(off=0,ks=4,vs=48,imm=0) R8=map_value(off=0,ks=4,vs=48,imm=0) R9=scalar(id=3,umax=7,var_off=(0x0; 0x7)) R10=fp0 fp-8=0000mmmm
>   22: R0=map_value(off=0,ks=4,vs=48,imm=0) R2=scalar(id=3,umax=7,var_off=(0x0; 0x7)) R6=map_ptr(off=0,ks=4,vs=48,imm=0) R8=map_value(off=0,ks=4,vs=48,imm=0) R9=scalar(id=3,umax=7,var_off=(0x0; 0x7)) R10=fp0 fp-8=0000mmmm
>   22: (07) r2 += 1                      ; R2_w=scalar(umin=1,umax=8,var_off=(0x0; 0xf))
>   23: (bf) r1 = r10                     ; R1_w=fp0 R10=fp0
>   24: (07) r1 += -8                     ; R1_w=fp-8
>   25: (b7) r3 = 0                       ; R3_w=0
>   26: (85) call bpf_probe_read_kernel#113
>   mark_precise: frame0: last_idx 26 first_idx 20 subseq_idx -1
>   mark_precise: frame0: regs=r2 stack= before 25: (b7) r3 = 0
>   mark_precise: frame0: regs=r2 stack= before 24: (07) r1 += -8
>   mark_precise: frame0: regs=r2 stack= before 23: (bf) r1 = r10
>   mark_precise: frame0: regs=r2 stack= before 22: (07) r2 += 1
>   mark_precise: frame0: regs=r2 stack= before 20: (a5) if r2 < 0x8 goto pc+1
>   mark_precise: frame0: parent state regs=r2,r9 stack=:  R0_rw=map_value(off=0,ks=4,vs=48,imm=0) R2_rw=Pscalar(id=3) R6=map_ptr(off=0,ks=4,vs=48,imm=0) R8_w=map_value(off=0,ks=4,vs=48,imm=0) R9_w=Pscalar(id=3) R10=fp0 fp-8_r=0000mmmm
>   mark_precise: frame0: last_idx 19 first_idx 10 subseq_idx 20
>   mark_precise: frame0: regs=r2,r9 stack= before 19: (bf) r2 = r9
>   mark_precise: frame0: regs=r9 stack= before 18: (1f) r9 -= r8
>   mark_precise: frame0: regs=r8,r9 stack= before 17: (bf) r8 = r0
>   mark_precise: frame0: regs=r0,r9 stack= before 15: (55) if r0 != 0x0 goto pc+1
>   mark_precise: frame0: regs=r0,r9 stack= before 14: (85) call bpf_map_lookup_elem#1
>   mark_precise: frame0: regs=r9 stack= before 13: (07) r2 += -8
>   mark_precise: frame0: regs=r9 stack= before 12: (bf) r2 = r10
>   mark_precise: frame0: regs=r9 stack= before 11: (bf) r1 = r6
>   mark_precise: frame0: regs=r9 stack= before 10: (bf) r9 = r0
>   mark_precise: frame0: parent state regs= stack=:  R0_rw=map_value(off=0,ks=4,vs=48,imm=0) R6_rw=map_ptr(off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8_rw=0000mmmm
>   27: R0_w=scalar()
>   27: (95) exit
>   processed 27 insns (limit 1000000) max_states_per_insn 0 total_states 2 peak_states 2 mark_read 1
>
> [...]

^ permalink raw reply

* [PATCH net 2/2] net: dsa: mxl862xx: fix use-after-free of DSA ports in crc_err_work
From: Daniel Golle @ 2026-06-19  3:40 UTC (permalink / raw)
  To: Daniel Golle, Andrew Lunn, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, linux-kernel

Upon an MDIO CRC error mxl862xx_crc_err_work_fn() walks the DSA ports
and closes the CPU port conduits:

	dsa_switch_for_each_cpu_port(dp, priv->ds)
		dev_close(dp->conduit);

mxl862xx_remove() unregisters the switch before cancelling this work:

	set_bit(MXL862XX_FLAG_WORK_STOPPED, &priv->flags);
	cancel_delayed_work_sync(&priv->stats_work);
	dsa_unregister_switch(ds);
	mxl862xx_host_shutdown(priv);

dsa_unregister_switch() frees the dsa_port objects. If a CRC error
schedules the work during teardown it can run after the ports have been
freed and dereference freed memory.

Guard the port walk with MXL862XX_FLAG_WORK_STOPPED, which is already set
before dsa_unregister_switch(). DSA tears the ports down under
rtnl_lock(), so checking the flag under rtnl_lock() means the work either
runs before teardown and sees valid ports, or runs afterwards, observes
the flag and skips the walk. This mirrors the host_flood_work handler,
which skips torn-down ports under rtnl_lock().

Link: https://sashiko.dev/#/patchset/cover.1780968180.git.daniel%40makrotopia.org?part=2
Fixes: a319d0c8c8ce ("net: dsa: mxl862xx: add CRC for MDIO communication")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
---
 drivers/net/dsa/mxl862xx/mxl862xx-host.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/dsa/mxl862xx/mxl862xx-host.c b/drivers/net/dsa/mxl862xx/mxl862xx-host.c
index 882c5d960941..4acd216f7cc0 100644
--- a/drivers/net/dsa/mxl862xx/mxl862xx-host.c
+++ b/drivers/net/dsa/mxl862xx/mxl862xx-host.c
@@ -41,12 +41,13 @@ static void mxl862xx_crc_err_work_fn(struct work_struct *work)
 						  crc_err_work);
 	struct dsa_port *dp;
 
-	dev_warn(&priv->mdiodev->dev,
-		 "MDIO CRC error detected, shutting down all ports\n");
-
 	rtnl_lock();
-	dsa_switch_for_each_cpu_port(dp, priv->ds)
-		dev_close(dp->conduit);
+	if (!test_bit(MXL862XX_FLAG_WORK_STOPPED, &priv->flags)) {
+		dev_warn(&priv->mdiodev->dev,
+			 "MDIO CRC error detected, shutting down all ports\n");
+		dsa_switch_for_each_cpu_port(dp, priv->ds)
+			dev_close(dp->conduit);
+	}
 	rtnl_unlock();
 
 	clear_bit(MXL862XX_FLAG_CRC_ERR, &priv->flags);
-- 
2.54.0

^ permalink raw reply related

* [PATCH net 1/2] net: dsa: mxl862xx: avoid unaligned 16-bit access in api_wrap
From: Daniel Golle @ 2026-06-19  3:39 UTC (permalink / raw)
  To: Daniel Golle, Andrew Lunn, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, linux-kernel

The MXL862XX_API_* macros pass the address of a stack-allocated, __packed
firmware-ABI struct to mxl862xx_api_wrap() as a void *. The struct has an
alignment of 1, so the compiler is free to place it at an odd address.

mxl862xx_api_wrap() reinterprets that buffer as a __le16 * and accesses it
with data[i], for which the compiler assumes the natural 2-byte alignment
of __le16 and emits aligned 16-bit loads/stores (e.g. lhu/sh on MIPS).
When the buffer lands on an odd address these fault on architectures that
do not support unaligned access, such as MIPS32.

-Waddress-of-packed-member does not catch this: the packed origin is
laundered through the void * parameter, so the cast inside api_wrap looks
alignment-safe to the compiler and no warning is emitted.

Use get_unaligned_le16()/put_unaligned_le16() for the three 16-bit word
accesses. The byte accesses (*(u8 *)&data[i], crc16()) are already safe
and are left unchanged.

Link: https://sashiko.dev/#/patchset/cover.1781319534.git.daniel%40makrotopia.org?part=4
Fixes: 23794bec1cb6 ("net: dsa: add basic initial driver for MxL862xx switches")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
---
 drivers/net/dsa/mxl862xx/mxl862xx-host.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/dsa/mxl862xx/mxl862xx-host.c b/drivers/net/dsa/mxl862xx/mxl862xx-host.c
index d55f9dff6433..882c5d960941 100644
--- a/drivers/net/dsa/mxl862xx/mxl862xx-host.c
+++ b/drivers/net/dsa/mxl862xx/mxl862xx-host.c
@@ -12,6 +12,7 @@
 #include <linux/crc16.h>
 #include <linux/iopoll.h>
 #include <linux/limits.h>
+#include <linux/unaligned.h>
 #include <net/dsa.h>
 #include "mxl862xx.h"
 #include "mxl862xx-host.h"
@@ -349,7 +350,7 @@ int mxl862xx_api_wrap(struct mxl862xx_priv *priv, u16 cmd, void *_data,
 	 * zero words individually.
 	 */
 	for (i = 0, zeros = 0; i < size / 2 && zeros < RST_DATA_THRESHOLD; i++)
-		if (!data[i])
+		if (!get_unaligned_le16(&data[i]))
 			zeros++;
 
 	if (zeros < RST_DATA_THRESHOLD && (size & 1) && !*(u8 *)&data[i])
@@ -395,7 +396,7 @@ int mxl862xx_api_wrap(struct mxl862xx_priv *priv, u16 cmd, void *_data,
 			 */
 			val = *(u8 *)&data[i] | ((crc & 0xff) << 8);
 		} else {
-			val = le16_to_cpu(data[i]);
+			val = get_unaligned_le16(&data[i]);
 		}
 
 		/* After RST_DATA, skip zero data words as the registers
@@ -453,7 +454,7 @@ int mxl862xx_api_wrap(struct mxl862xx_priv *priv, u16 cmd, void *_data,
 			*(uint8_t *)&data[i] = ret & 0xff;
 			crc = (ret >> 8) & 0xff;
 		} else {
-			data[i] = cpu_to_le16((u16)ret);
+			put_unaligned_le16((u16)ret, &data[i]);
 		}
 	}
 
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH v3 2/3] net/smc: bound the receive length to the RMB in smc_rx_recvmsg()
From: Dust Li @ 2026-06-19  3:31 UTC (permalink / raw)
  To: Bryam Vargas
  Cc: Wenjia Zhang, D . Wythe, Sidraya Jayagond, Eric Dumazet,
	David S . Miller, Mahanta Jambigi, Wen Gu, Simon Horman,
	Ursula Braun, Stefan Raspl, Tony Lu, Paolo Abeni, Jakub Kicinski,
	netdev, linux-s390, linux-rdma, linux-kernel
In-Reply-To: <20260618221106.236699-1-hexlabsecurity@proton.me>

On 2026-06-18 22:11:12, Bryam Vargas wrote:
>On Fri, 19 Jun 2026 00:03:17 +0800, Dust Li wrote:
>> Once we validate the CDC message at the input boundary (as in the
>> previous patch), bytes_to_rcv can never exceed rmb_desc->len, so
>> this check becomes unreachable. So I don't think this patch is needed.
>
>This one I'd actually like to keep, and let me walk through why -- I don't think the
>boundary check closes it.
>
>bytes_to_rcv isn't set to a cursor count, it's a running accumulator:
>smc_cdc_msg_recv_action does atomic_add(diff_prod, &bytes_to_rcv), where
>diff_prod = smc_curs_diff(rmb_desc->len, old, new). So bounding each cursor's count at
>the boundary doesn't bound the sum of the deltas.
>
>The differing-wrap branch of smc_curs_diff returns (len - old.count) + new.count,
>which is up to 2*len-1 even when both cursors pass count <= len. With len=16, a prod
>going (0,0) -> (1,15) gives diff=31, so bytes_to_rcv is already 31 > len after one
>message; alternating wrap 0<->1 at count=15 keeps adding ~len and eventually wraps the
>atomic_t negative. I have an A/B for this -- happy to send it along.

Glad to see you A/B test, I think we can decide after we see the real
issue.

>
>So to make this truly unreachable from the boundary check, we'd need to bound
>prod - cons <= len there, not just the absolute count. The consumer-side clamp is two
>lines and race-free against the tasklet, so my preference would be to keep it as a
>backstop -- but if you'd rather fold it into a stronger boundary check instead, I'm
>open to that.

Another thing I'd worry about is if this really happens, should we also
abort the connection like what we did in patch #1 ?

Best regards,
Dust


^ permalink raw reply

* Re: [PATCH v3 1/3] net/smc: bound the wire-controlled producer cursor to the RMB
From: Dust Li @ 2026-06-19  3:26 UTC (permalink / raw)
  To: Bryam Vargas
  Cc: Wenjia Zhang, D . Wythe, Sidraya Jayagond, Eric Dumazet,
	David S . Miller, Mahanta Jambigi, Wen Gu, Simon Horman,
	Ursula Braun, Stefan Raspl, Tony Lu, Paolo Abeni, Jakub Kicinski,
	netdev, linux-s390, linux-rdma, linux-kernel
In-Reply-To: <20260618221057.236673-1-hexlabsecurity@proton.me>

On 2026-06-18 22:11:05, Bryam Vargas wrote:
>On Thu, 18 Jun 2026 22:29:20 +0800, Dust Li wrote:
>> once we detect that the peer is misbehaving, I think the right action is
>> to abort the connection and record the event, rather than silently clamp.
>[...]
>>         u32 prod_count = ntohs(cdc->prod.count);
>> ...
>>             cdc->prod.wrap > 1 || cdc->cons.wrap > 1) {
>
>Thanks for taking a look, Dust. I'm on board with the direction for net-next --
>aborting and recording a bad CDC is cleaner than clamping something we already know
>we can't trust, and as you say, the clamp just papers over the peer bug. So: minimal
>clamp stays for -stable, and net-next gets the wire-boundary check + abort (through
>abort_work, with an smc_stats counter and a ratelimited warn).

That's greate. Then I think we can move on in this direction.

>
>A few things I ran into on the check itself, though:
>
>- count is __be32, so it wants ntohl() rather than ntohs() -- ntohs() ends up reading
>  the wrong half.

Right

>
>- I'd drop the wrap > 1 tests. wrap is a free-running counter (smc_curs_add does
>  wrap++), so a connection that legitimately wraps its RMB ends up with wrap > 1; and
>  since it's a __be16 read raw, on little-endian wrap==1 already reads as 0x0100 and
>  we'd abort on the very first wrap. I don't think there's a sane upper bound to put
>  on wrap.

Agree

>
>- the check is typed for SMC-R, but the SMC-D path hands a host-order smcd_cdc_msg to
>  smc_cdc_msg_recv() cast as smc_cdc_msg (smc_cdc.c:456), so ntohl/ntohs would
>  double-swap it there. The simplest thing I found is one check on the host cursor
>  right after smc_cdc_msg_to_host(), before the diff/atomic_add block -- that covers
>  SMC-R and SMC-D in one place.

Agree

>
>Minor: >= len rather than > len (count is an offset in [0,len)), and peer_rmbe_size
>is signed so worth guarding. The cons vs peer_rmbe_size bound looks right to me.

No problem

Best regards,
Dust


^ permalink raw reply

* net: wwan: iosm: possible out of bounds read in the MUX downlink decoder
From: Maoyi Xie @ 2026-06-19  3:05 UTC (permalink / raw)
  To: Loic Poulain, Sergey Ryazanov, Johannes Berg; +Cc: netdev, linux-kernel

Hi all,

I think mux_dl_adb_decode() in drivers/net/wwan/iosm/iosm_ipc_mux_codec.c can
read past the downlink buffer when the modem reports a large table index. I
would appreciate it if you could take a look.

The decoder takes the table offset straight from the device ADB header.

	block = skb->data;
	adbh = (struct mux_adbh *)block;
	adth_index = le32_to_cpu(adbh->first_table_index);
	if (adth_index < 1)
		goto adb_decode_err;
	...
	adth = (struct mux_adth *)(block + adth_index);

first_table_index is a device le32 and the only check is that it is not
zero. After that adth points at block + adth_index with no upper bound and
the code reads adth->table_length and the datagram table from there. The
downlink buffer is IPC_MEM_MAX_DL_MUX_LITE_BUF_SIZE, 2048 bytes, so an index
past 2048 reads past the slab object.

mux_dl_process_dg() has the same issue further down. It reads
dg->datagram_index and dg->datagram_length per entry, bounded only by the
device block_length, which is itself never checked against skb->len.

The protocol layer below only caps the total transfer at the pipe buffer
size and skb_put()s it, so skb->len is at most 2048, but none of these
in band offsets and lengths are checked against it.

The data here comes from the modem, so it is device input we should not
trust, especially with an external or compromised PCIe baseband. It fires on
a normal downlink receive once the iosm net device is up.

I reproduced the adth_index read under KASAN on 7.1-rc7. With an index that
fits the buffer the read stays inside. With an index past the buffer KASAN
reports a slab out of bounds read past the downlink buffer.

The fix I have in mind validates every device offset and length against
skb->len before use, rejecting an adth_index that leaves no room for a
struct mux_adth and a datagram that runs past the buffer.

Does this look like a real bug to you? The aggregation decoder has moved
around over the years, so I am not certain of the exact introducing commit.
1f52d7b62285 ("net: wwan: iosm: Enable M.2 7360 WWAN card support") is the
broadly cited landing point. I am happy to send a proper patch.

Thanks,
Maoyi
https://maoyixie.com/

^ permalink raw reply

* Re: [PATCH bpf v3 1/2] bpf, sockmap: fix use-after-free when the stream parser resizes the skb
From: Sechang Lim @ 2026-06-19  1:54 UTC (permalink / raw)
  To: John Fastabend; +Cc: Jiayuan Chen, netdev, bpf, linux-kernel, Jakub Kicinski
In-Reply-To: <ajQwWe4SC1M7MtJa@john-p8>

On Thu, Jun 18, 2026 at 11:01:58AM -0700, John Fastabend wrote:
>On Thu, Jun 18, 2026 at 07:56:34PM +0800, Jiayuan Chen wrote:
>>
>>On 6/18/26 6:27 PM, Sechang Lim wrote:
>>>sk_psock_strp_parse() runs the BPF_PROG_TYPE_SK_SKB stream-parser program
>>>to find the length of the next message. strparser assembles a message out
>>>of several received skbs by chaining them onto the head's frag_list and
>>>recording where to append the next one in strp->skb_nextp:
>>>
>>>	*strp->skb_nextp = skb;
>>>	strp->skb_nextp = &skb->next;
>>>
>>>and then calls the parser on the head:
>>>
>>>	len = (*strp->cb.parse_msg)(strp, head);
>>
>>[...]
>>
>>>unaffected and may still modify the skb.
>>>
>>>Fixes: 8a31db561566 ("bpf: add access to sock fields and pkt data from sk_skb programs")
>>
>>Is the Fixes tag correct ?
>>
>>Anyway, I don't think this patch is a fix; it's more of a hardening. 
>>So no Fixes tag needed, IMO.

Thanks for your review, dropping the Fixes tag in v4.

>>
>>
>>>Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
>>>---
>
>[...]
>
>>
>>
>>CI failed:
>>https://github.com/kernel-patches/bpf/actions/runs/27754218839/job/82113319982
>>   Failed stream parser bpf prog attach
>>
>>Hi John
>>I noticed that bpf_skb_pull_data was added to the skmsg test:
>>https://github.com/torvalds/linux/commit/82a8616889d506cb690cfc0afb2ccadda120461d
>>
>>Can we drop bpf_skb_pull_data in parser prog(sockmap_parse_prog.c‎) ?
>>And are there any scenarios where we need to modify skb len when 
>>using strparser ?
>
>We should never modify the skb from strparser. Just remove any tests
>that do this and state its not safe. We haven't used strparser progs
>for a long time anyways.

Thanks for your review. Will do in v4. 

Best,
Sechang

^ permalink raw reply

* Re: [PATCH net v4] tipc: fix slab-use-after-free Read in tipc_aead_decrypt_done
From: patchwork-bot+netdevbpf @ 2026-06-19  1:40 UTC (permalink / raw)
  To: Doruk Tan Ozturk
  Cc: jmaloy, davem, edumazet, kuba, pabeni, horms, aleksander.lobakin,
	tung.quang.nguyen, tipc-discussion, netdev, linux-kernel, stable
In-Reply-To: <20260617075818.37431-1-doruk@0sec.ai>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 17 Jun 2026 09:58:18 +0200 you wrote:
> tipc_aead_decrypt() goes straight from tipc_bearer_hold(b) to
> crypto_aead_decrypt(req) without taking a reference on the netns, unlike
> the encrypt path. When crypto_aead_decrypt() is offloaded asynchronously
> (e.g. the SIMD aead wrapper queuing to cryptd), the cryptd worker runs
> tipc_aead_decrypt_done() later. If the bearer's netns is torn down in the
> meantime, cleanup_net() -> tipc_exit_net() -> tipc_crypto_stop() frees the
> per-netns tipc_crypto, and the completion then reads it:
> tipc_aead_decrypt_done() dereferences aead->crypto->stats and
> aead->crypto->net, and tipc_crypto_rcv_complete() dereferences
> aead->crypto->aead[] and the node table -- reading freed memory.
> 
> [...]

Here is the summary with links:
  - [net,v4] tipc: fix slab-use-after-free Read in tipc_aead_decrypt_done
    https://git.kernel.org/netdev/net/c/bda3348872a2

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net-next v2] net: rds: check cmsg_len before reading rds_rdma_args in size pass
From: patchwork-bot+netdevbpf @ 2026-06-19  1:40 UTC (permalink / raw)
  To: Michael Bommarito
  Cc: achender, davem, kuba, pabeni, edumazet, horms, netdev,
	linux-rdma, rds-devel, linux-kernel
In-Reply-To: <20260617023146.2780077-1-michael.bommarito@gmail.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 16 Jun 2026 22:31:46 -0400 you wrote:
> rds_rm_size() handles RDS_CMSG_RDMA_ARGS after only CMSG_OK() and then
> calls rds_rdma_extra_size(), which reads args->local_vec_addr and
> args->nr_local without first checking that cmsg_len covers struct
> rds_rdma_args. The other two RDS_CMSG_RDMA_ARGS consumers already guard
> this: rds_rdma_bytes() in rds_sendmsg() and rds_cmsg_rdma_args() in
> rds_cmsg_send() both reject cmsg_len < CMSG_LEN(sizeof(struct
> rds_rdma_args)). Add the same check to rds_rm_size() so all three RDMA
> args passes are consistent.
> 
> [...]

Here is the summary with links:
  - [net-next,v2] net: rds: check cmsg_len before reading rds_rdma_args in size pass
    https://git.kernel.org/netdev/net/c/e5c00023270e

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net] eth: fbnic: take netif_addr_lock_bh() around rx mode address programming
From: patchwork-bot+netdevbpf @ 2026-06-19  1:40 UTC (permalink / raw)
  To: Daniel Zahka
  Cc: alexanderduyck, kuba, kernel-team, andrew+netdev, davem, edumazet,
	pabeni, sanmanpradhan, netdev, linux-kernel
In-Reply-To: <20260617-linux-fbnic-hwaddr-v1-1-3f9f5dee7f99@gmail.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 17 Jun 2026 03:39:49 -0700 you wrote:
> When __fbnic_set_rx_mode() is called from contexts other than
> .ndo_set_rx_mode_async(), the uc and mc addr lists are accessed
> without the addr lock that __hw_addr_sync_dev() and
> __hw_addr_unsync_dev() require. Wrap these unprotected accesses with
> netif_addr_lock_bh(). fbnic_clear_rx_mode() has similar issues.
> 
> Fixes: eb690ef8d1c2 ("eth: fbnic: Add L2 address programming")
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
> 
> [...]

Here is the summary with links:
  - [net] eth: fbnic: take netif_addr_lock_bh() around rx mode address programming
    https://git.kernel.org/netdev/net/c/96e7f9122aae

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [RFC PATCH 1/2] landlock: fix TCP Fast Open connection bypass
From: Bryam Vargas @ 2026-06-19  1:39 UTC (permalink / raw)
  To: Matthieu Buffet
  Cc: Mickaël Salaün, Günther Noack, Mikhail Ivanov,
	Paul Moore, Eric Dumazet, Neal Cardwell, linux-security-module,
	netdev, linux-kernel
In-Reply-To: <e0c7b502-931e-481e-89b0-b47687d2b942@buffet.re>

Thanks, that settles it: MPTCP is out of scope by design, not a gap.

I read 854277e2cc8c ("landlock: Fix non-TCP sockets restriction"). It
changed the sock->type != SOCK_STREAM test to !sk_is_tcp(sock->sk),
dropping SMC/MPTCP/SCTP from the TCP rights on purpose, and 3d4033985ff5
pins that with a "MPTCP actions are not restricted" selftest. So my
"|| sk_protocol == IPPROTO_MPTCP" suggestion was wrong: it would revert
that decision and break the selftest. Please disregard it.

That leaves the series complete as-is on this axis. Keeping both the v0
guard and the 2/2 selftest sk_is_tcp()-only is correct, and the
Tested-by stands for the TCP and IPv6 fast-open path the patch fixes.

Bryam

^ permalink raw reply

* Re: [PATCH v2 0/1] selftests: net: fix file owner for broadcast_ether_dst test
From: patchwork-bot+netdevbpf @ 2026-06-19  1:30 UTC (permalink / raw)
  To: Ross Porter
  Cc: linux-kselftest, netdev, davem, edumazet, kuba, pabeni, horms,
	shuah, oscmaes92, bacs, linux-kernel
In-Reply-To: <20260617061039.79717-1-ross.porter@canonical.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 17 Jun 2026 18:10:38 +1200 you wrote:
> The broadcast_ether_dst test can spuriously fail due to file
> permissions, depending on which flags tcpdump is compiled with: If
> tcpdump is compiled with the `--with-user` flag then tcpdump will
> step down to the provided user after opening the device. In this
> case the test fails due to permission issues, as the output file
> is owned by a different user.
> Debian ships tcpdump with this flag enabled, so this bug
> appears as part of our kernel testing.
> 
> [...]

Here is the summary with links:
  - [v2,1/1] selftests: net: fix file owner for broadcast_ether_dst test
    https://git.kernel.org/netdev/net/c/b8613e979200

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox