Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v3] flow_dissector: fix uninit-value in __skb_flow_dissect() for ETH_ADDRS
From: Jakub Kicinski @ 2026-06-16  0:28 UTC (permalink / raw)
  To: yun.zhou
  Cc: Jakub Kicinski, davem, edumazet, pabeni, horms, qingfang.deng,
	jiri, netdev, linux-kernel
In-Reply-To: <20260613113154.2550286-1-yun.zhou@windriver.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
flow_dissector: fix uninit-value in __skb_flow_dissect() for ETH_ADDRS

This patch addresses a KMSAN report by gating the read of eth_hdr(skb)
in __skb_flow_dissect() on skb->dev->hard_header_len being at least the
size of the ETH_ADDRS dissector key, zeroing the key otherwise. The
goal is to avoid reading uninitialized memory when a TUN device in
IFF_TUN (L3) mode passes a packet without a link-layer header to a
flower filter on a shared block.

Closes: https://syzkaller.appspot.com/bug?extid=fa2f5b1fb06147be5e16

> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> index 2a98f5fa74eb..de6ca9beee03 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -1173,13 +1173,20 @@ bool __skb_flow_dissect(const struct net *net,
>  
>  	if (dissector_uses_key(flow_dissector,
>  			       FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
> -		struct ethhdr *eth = eth_hdr(skb);
>  		struct flow_dissector_key_eth_addrs *key_eth_addrs;
>  
>  		key_eth_addrs = skb_flow_dissector_target(flow_dissector,
>  							  FLOW_DISSECTOR_KEY_ETH_ADDRS,
>  							  target_container);
> -		memcpy(key_eth_addrs, eth, sizeof(*key_eth_addrs));
> +		/* TC filter blocks can be shared across devices with
> +		 * different header lengths, so we cannot validate this
> +		 * when the filter is installed -- check at dissect time.
> +		 */
> +		if (skb->dev &&
> +		    skb->dev->hard_header_len >= sizeof(*key_eth_addrs))
> +			memcpy(key_eth_addrs, eth_hdr(skb), sizeof(*key_eth_addrs));

[Medium]
Is hard_header_len the right predicate to gate this read on?

The dissector key is ETH_ALEN+ETH_ALEN, so the gate accepts any device
with hard_header_len >= 12. IPoIB sets hard_header_len = IPOIB_HARD_LEN
with dev->type = ARPHRD_INFINIBAND, and ARPHRD_FDDI uses 21 bytes.
Wouldn't packets on those link types pass this check and have the first
12 bytes of a non-Ethernet L2 header copied into key_eth_addrs as if
they were the destination and source MAC addresses?

Would a check on dev->type == ARPHRD_ETHER (or dev->addr_len == ETH_ALEN)
be more appropriate, so that only true Ethernet framing is read into
the key?

[Medium]
Does this gate behave correctly across act_mirred?

act_mirred reassigns skb->dev to the mirror/redirect target before the
target device's classifier path runs. Consider a packet originally
framed on an Ethernet device, with a valid mac_header in the linear
region, redirected to a headerless device such as TUN IFF_TUN that
shares a TC block with the source. The key_eth_addrs would now be
zeroed via the else branch even though the original Ethernet header
is still present in the skb. Is that intended for shared blocks?

[Medium]
Does this still leave a path where eth_hdr(skb) reads past the buffer?

eth_hdr(skb) expands to skb->head + skb->mac_header. When mac_header is
the unset sentinel (typeof(skb->mac_header))~0U set by skb allocation
or by skb_unset_mac_header(), the dereference resolves to skb->head +
0xFFFF, well past skb_end_pointer(skb). The new gate only inspects
skb->dev->hard_header_len, which is independent of skb->mac_header, so
any skb delivered to the dissector with mac_header unset on a device
whose hard_header_len >= 12 would still read out-of-bounds bytes into
key_eth_addrs.

Would skb_mac_header_was_set(skb) plus a bounds check that the 12 bytes
lie within [skb->head, skb_tail_pointer(skb)) be a more direct match
for the property the commit message describes:

    This assumes the skb has a valid Ethernet header at mac_header,
    which is not always the case.

> +		else
> +			memset(key_eth_addrs, 0, sizeof(*key_eth_addrs));

[Low]
The commit message says:

    If the device does not have a link-layer header large enough to
    contain the Ethernet addresses, zero the key so the filter will
    not match.

Is that wording accurate for all filter shapes?

__fl_lookup() uses key_eth_addrs as part of the rhashtable lookup key.
A flower filter installed to match src_mac 00:00:00:00:00:00 and
dst_mac 00:00:00:00:00:00 (for example a catch-all on a shared block)
would, after this change, deterministically match every packet on a
headerless device, where previously the random skb memory made such a
filter effectively never match there.

Could the commit message note this corner case, or could a sentinel
value that cannot collide with a legal user-supplied key be used
instead of all zeros?

>  	}
>  
>  	if (dissector_uses_key(flow_dissector,
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH net] netdev-genl: report NAPI thread PID in the caller's pid namespace
From: Samiullah Khawaja @ 2026-06-16  0:27 UTC (permalink / raw)
  To: Maoyi Xie
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Daniel Borkmann, Nikolay Aleksandrov, David Wei,
	Stanislav Fomichev, Dragos Tatulea, netdev, linux-kernel, stable
In-Reply-To: <20260615171736.1709318-1-maoyixie.tju@gmail.com>

On Tue, Jun 16, 2026 at 01:17:36AM +0800, Maoyi Xie wrote:
>netdev_nl_napi_fill_one() reports the NAPI kthread PID in NETDEV_A_NAPI_PID
>using task_pid_nr(), which returns the PID in the initial pid namespace.
>
>NETDEV_CMD_NAPI_GET does not have GENL_ADMIN_PERM and the netdev genl family
>is netnsok, so a caller in a child pid namespace can issue it. That caller
>then sees the kthread's global PID, even though the kthread is not visible
>in its pid namespace, where the value should be 0.
>
>Translate the PID through the caller's pid namespace, the same way commit
>3799c2570982 ("io_uring/fdinfo: translate SqThread PID through caller's
>pid_ns") did for the io_uring SQPOLL thread. The doit and dumpit paths both
>run synchronously in the caller's context, so task_active_pid_ns(current) is
>the caller's pid namespace.
>
>Fixes: db4704f4e4df ("netdev-genl: Add PID for the NAPI thread")
>Cc: stable@vger.kernel.org
>Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>
>---
> net/core/netdev-genl.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
>diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
>index b8f6076d8007..4c23e985cc01 100644
>--- a/net/core/netdev-genl.c
>+++ b/net/core/netdev-genl.c
>@@ -2,6 +2,7 @@
>
> #include <linux/netdevice.h>
> #include <linux/notifier.h>
>+#include <linux/pid_namespace.h>
> #include <linux/rtnetlink.h>
> #include <net/busy_poll.h>
> #include <net/net_namespace.h>
>@@ -189,7 +190,8 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
> 		goto nla_put_failure;
>
> 	if (napi->thread) {
>-		pid = task_pid_nr(napi->thread);
>+		pid = task_pid_nr_ns(napi->thread,
>+				     task_active_pid_ns(current));
> 		if (nla_put_u32(rsp, NETDEV_A_NAPI_PID, pid))
> 			goto nla_put_failure;
> 	}
>--
>2.43.0
>

Reviewed-by: Samiullah Khawaja <skhawaja@google.com>

^ permalink raw reply

* Re: [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
From: Jakub Kicinski @ 2026-06-16  0:25 UTC (permalink / raw)
  To: Luigi Rizzo
  Cc: rizzo.unipi, m.szyprowski, robin.murphy, willemb, kuniyu, davem,
	edumazet, pabeni, gregkh, rafael, akpm, david, netdev, linux-mm,
	iommu, driver-core, linux-kernel
In-Reply-To: <20260615234220.3946885-1-lrizzo@google.com>

On Mon, 15 Jun 2026 23:42:20 +0000 Luigi Rizzo wrote:
> The use of swiotlb causes an extra data copy on I/O.  For tx sockets,
> especially with greedy senders, this has a high chance of happening in
> the softirq handler for tx network interrupts, creating a significant
> performance bottleneck.

What's the use case? I associate swiotlb with debug / testing mostly,
so it'd be useful for people like me to explain why you care.

BTW net-next is closed: https://netdev.bots.linux.dev/net-next.html

^ permalink raw reply

* Re: [PATCH net-next v10 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-06-16  0:21 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
	pavan.chebbi
In-Reply-To: <20260615134247.0bd7e16e@kernel.org>

On Mon, Jun 15, 2026 at 01:42:47PM -0700, Jakub Kicinski wrote:
> On Mon, 15 Jun 2026 12:25:53 -0700 Dipayaan Roy wrote:
> > Just a gentle ping on this series. The approach was agreed upon, and it
> > has picked up a few Reviewed-by tags as well.
> > 
> > Please let me know if you need anything else from me, or if I should
> > resend it to collect the tags.
> 
> Don't recall now what the exact sequence was but pretty sure this 
> no longer applied after some other mana series was merged.

I see, the net-next is closed now, I will rebase and resend this
once it opens on June 29th.

Regards
Dipayaan Roy

^ permalink raw reply

* Re: [PATCH net-next v5 0/3] airoha: add the capability to configure GDM3/GDM4 as WAN/LAN on demand
From: patchwork-bot+netdevbpf @ 2026-06-16  0:20 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, linux-arm-kernel,
	linux-mediatek, netdev, madhur.agrawal, aleksander.lobakin
In-Reply-To: <20260611-airoha-ethtool-priv_flags-v5-0-c11de08486d1@kernel.org>

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 11 Jun 2026 23:55:50 +0200 you wrote:
> Add the capability to configure GDM3/GDM4 as WAN/LAN on demand when QoS
> offload is created or destroyed.
> Make dev->qdma an RCU pointer so the TX path can safely dereference it
> without holding RTNL.
> Introduce airoha_qdma_start() and airoha_qdma_stop() helpers.
> 
> 
> [...]

Here is the summary with links:
  - [net-next,v5,1/3] net: airoha: use int instead of atomic_t for qdma users counter
    https://git.kernel.org/netdev/net-next/c/a459b560e58b
  - [net-next,v5,2/3] net: airoha: refactor QDMA start/stop into reusable helpers
    (no matching commit)
  - [net-next,v5,3/3] net: airoha: defer GDM3/GDM4 WAN mode and GDM2 loopback to QoS offload
    (no matching commit)

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net-next v2 5/8] net: dsa: mt7530: replace mt7530_read with regmap_read
From: Jakub Kicinski @ 2026-06-16  0:15 UTC (permalink / raw)
  To: daniel
  Cc: chester.a.unal, andrew, olteanv, davem, edumazet, pabeni,
	matthias.bgg, angelogioacchino.delregno, linux, netdev,
	linux-kernel, linux-arm-kernel, linux-mediatek
In-Reply-To: <20260615235649.1060736-1-kuba@kernel.org>

On Mon, 15 Jun 2026 16:56:49 -0700 Jakub Kicinski wrote:
> >  static u32
> > -mt7530_read(struct mt7530_priv *priv, u32 reg)
> > +mt7530_mii_poll(struct mt7530_dummy_poll *p)
> >  {
> > -	int ret;
> >  	u32 val;
> >  
> > -	ret = regmap_read(priv->regmap, reg, &val);
> > -	if (ret) {
> > -		WARN_ON_ONCE(1);
> > -		dev_err(priv->dev,
> > -			"failed to read mt7530 register\n");
> > -		return 0;
> > -	}
> > +	regmap_read(p->priv->regmap, p->reg, &val);
> >  
> >  	return val;
> >  }  
> 
> [Medium]
> The old wrapper guaranteed val == 0 on failure. The new mt7530_mii_poll()
> declares val without an initializer and ignores the regmap_read() return
> code, so on a failed underlying transfer val is whatever was on the stack.
> 
> Looking at drivers/net/dsa/mt7530-mdio.c::mt7530_regmap_read():
> 
>     ret = bus->write(bus, priv->mdiodev->addr, 0x1f, page);
>     if (ret < 0)
>         return ret;
> 
>     lo = bus->read(bus, priv->mdiodev->addr, r);
>     hi = bus->read(bus, priv->mdiodev->addr, 0x10);
> 
>     *val = (hi << 16) | (lo & 0xffff);

This looks legit, AFAICT, sadly.

Maybe just write *val = 0; in mt7530_regmap_read() unconditionally ?

^ permalink raw reply

* Re: [PATCH net-next v2 8/8] net: dsa: mt7530: implement port_change_conduit op
From: Jakub Kicinski @ 2026-06-15 23:57 UTC (permalink / raw)
  To: daniel
  Cc: Jakub Kicinski, chester.a.unal, andrew, olteanv, davem, edumazet,
	pabeni, matthias.bgg, angelogioacchino.delregno, linux, netdev,
	linux-kernel, linux-arm-kernel, linux-mediatek
In-Reply-To: <8dd8cfe32bc8e38b92c49e30a6255090fb0998fb.1781312667.git.daniel@makrotopia.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: dsa: mt7530: implement port_change_conduit op

Implements the port_change_conduit operation for the mt7530 driver,
updating the port matrix to forward to a new CPU port. The op is
restricted to MT7531 because MT7530/MT7621 cannot honour per-port CPU
affinity for trapped frames and the MT7988/EN7581/AN7583 switches have
only a single CPU port.

> diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
> index c96420c291d5..2f3e734b9f53 100644
> --- a/drivers/net/dsa/mt7530.c
> +++ b/drivers/net/dsa/mt7530.c
> @@ -3206,6 +3206,34 @@ static int mt753x_set_mac_eee(struct dsa_switch *ds, int port,
>  	return 0;
>  }
>  
> +static int
> +mt753x_port_change_conduit(struct dsa_switch *ds, int port,
> +			   struct net_device *conduit,
> +			   struct netlink_ext_ack *extack)
> +{
> +	struct dsa_port *new_cpu_dp = conduit->dsa_ptr;
> +	struct dsa_port *dp = dsa_to_port(ds, port);
> +	struct mt7530_priv *priv = ds->priv;
> +
> +	if (priv->id != ID_MT7531)
> +		return -EOPNOTSUPP;

[Low]
Should this branch populate the supplied extack with NL_SET_ERR_MSG_MOD()
before returning -EOPNOTSUPP?

Because port_change_conduit is registered in the shared mt7530_switch_ops
table, the DSA core's fallback message in dsa_user_change_conduit() ("Driver
does not support changing DSA conduit") is bypassed; that branch only runs
when ds->ops->port_change_conduit is NULL. Neither dsa_port_change_conduit()
nor __dsa_port_change_conduit() in net/dsa/port.c synthesize an extack on
callback failure either.

The result is that on MT7530, MT7621, MT7988, EN7581 and AN7583, userspace
invoking IFLA_DSA_CONDUIT receives EOPNOTSUPP with an empty extack, with
none of the per-chip context that the commit message itself spells out
(single CPU port; MT7530 MFC CPU_PORT trapping limitation). Would a short
NL_SET_ERR_MSG_MOD() naming the reason be appropriate here?

> +
> +	mutex_lock(&priv->reg_mutex);
> +
> +	/* dp->cpu_dp still points to the old CPU port */
> +	priv->ports[port].pm &= ~PCR_MATRIX(BIT(dp->cpu_dp->index));
> +	priv->ports[port].pm |= PCR_MATRIX(BIT(new_cpu_dp->index));
> +	if (priv->ports[port].enable)
> +		regmap_update_bits(priv->regmap, MT7530_PCR_P(port),
> +				   PCR_MATRIX_MASK, priv->ports[port].pm);
> +
> +	mutex_unlock(&priv->reg_mutex);
> +
> +	mt7530_port_fast_age(ds, port);
> +
> +	return 0;
> +}

^ permalink raw reply

* Re: [PATCH net-next v2 7/8] net: dsa: mt7530: implement port_fast_age
From: Jakub Kicinski @ 2026-06-15 23:56 UTC (permalink / raw)
  To: daniel
  Cc: Jakub Kicinski, chester.a.unal, andrew, olteanv, davem, edumazet,
	pabeni, matthias.bgg, angelogioacchino.delregno, linux, netdev,
	linux-kernel, linux-arm-kernel, linux-mediatek
In-Reply-To: <e6cdb705f0ec142283071e1966000d3a76f71a16.1781312667.git.daniel@makrotopia.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
Subject: net: dsa: mt7530: implement port_fast_age

This change adds a .port_fast_age operation for the mt7530 DSA driver.
The implementation triggers an ATC flush of all non-static MAC address
entries and polls ATC_BUSY for completion.

> diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
> index dcf72ab0cd66..c96420c291d5 100644
> --- a/drivers/net/dsa/mt7530.c
> +++ b/drivers/net/dsa/mt7530.c
> @@ -193,6 +193,21 @@ mt7530_fdb_cmd(struct mt7530_priv *priv, enum mt7530_fdb_cmd cmd, u32 *rsp)
>  	return 0;
>  }
>  
> +static void mt7530_port_fast_age(struct dsa_switch *ds, int port)
> +{
> +	struct mt7530_priv *priv = ds->priv;
> +	struct mt7530_dummy_poll p;
> +	u32 val;
> +
> +	/* Flush all non-static MAC address entries */
> +	val = ATC_BUSY | ATC_MAT_NON_STATIC_MAC | MT7530_FDB_FLUSH;
> +	regmap_write(priv->regmap, MT7530_ATC, val);

[High]
Should this transaction be serialised against priv->reg_mutex?

Every other accessor that touches MT7530_ATC wraps the write+poll
sequence in mutex_lock(&priv->reg_mutex)/mutex_unlock(&priv->reg_mutex).
For example mt7530_port_fdb_add(), mt7530_port_fdb_del(),
mt7530_port_fdb_dump(), mt7530_port_mdb_add() and mt7530_port_mdb_del()
all hold reg_mutex around the call to mt7530_fdb_cmd():

	mutex_lock(&priv->reg_mutex);
	mt7530_fdb_write(priv, vid, port_mask, addr, -1, STATIC_ENT);
	ret = mt7530_fdb_cmd(priv, MT7530_FDB_WRITE, NULL);
	mutex_unlock(&priv->reg_mutex);

If .port_fast_age runs concurrently with one of those operations, can
the two write-then-busy-poll sequences on the same MT7530_ATC register
overlap?

Could one thread's poll loop observe ATC_BUSY clearing for the other
thread's command, leading mt7530_port_fdb_dump() to read stale ATRD
data, or could one thread's freshly written ATC command be overwritten
by the other before the controller has executed it?

> +
> +	INIT_MT7530_DUMMY_POLL(&p, priv, MT7530_ATC);
> +	readx_poll_timeout(mt7530_mii_poll, &p, val,
> +			   !(val & ATC_BUSY), 20, 20000);
> +}

[Medium]
Is there a reason the return value of readx_poll_timeout() is ignored
here?

The structurally identical helper mt7530_fdb_cmd() in the same file
checks the result and emits a diagnostic on timeout:

	ret = readx_poll_timeout(mt7530_mii_poll, &p, val,
				 !(val & ATC_BUSY), 20, 20000);
	if (ret < 0) {
		dev_err(priv->dev, "reset timeout\n");
		return ret;
	}

Peer DSA drivers also log on failure (b53_br_fast_age prints "fast
ageing failed"; rtl83xx_port_fast_age prints "failed to fast age on
port %d: %d").

If the ATC engine fails to clear ATC_BUSY within the 20ms timeout,
mt7530_port_fast_age() returns silently and a subsequent
mt7530_fdb_cmd() will write a new command on top of the in-flight
operation. Should this path emit dev_err() on timeout to make a wedged
ATC engine visible to operators?

> +
>  static void
>  mt7530_fdb_read(struct mt7530_priv *priv, struct mt7530_fdb *fdb)
>  {

[ ... ]

[Low]
Is the claim in the commit message accurate?

  "This is consistent with what other DSA drivers do (b53, realtek)
  and relearning is fast."

Both drivers cited actually implement per-port fast aging.

In b53, b53_br_fast_age() calls b53_fast_age_port() which writes the
target port to B53_FAST_AGE_PORT_CTRL and selects FAST_AGE_PORT mode.

In realtek, rtl83xx_port_fast_age() calls
priv->ops->l2_flush(priv, port, 0), and rtl8366rb_port_fast_age()
flushes only BIT(port) via RTL8366RB_SECURITY_CTRL.

The mt7530 implementation flushes globally and ignores the port
argument entirely, which is the opposite of the cited examples. The
hardware-limitation rationale ("switch does not offer combined
non-static AND per-port match mode") may be valid, but the b53/realtek
comparison appears to be the wrong precedent.

A side effect not mentioned is that an STP transition on one port will
flush dynamic FDB entries on every other port and every other bridge
sharing the switch. Should this divergence from the typical
.port_fast_age contract be called out?

> diff --git a/drivers/net/dsa/mt7530.h b/drivers/net/dsa/mt7530.h
> index abf19aa69520..decad7a93dbd 100644
> --- a/drivers/net/dsa/mt7530.h
> +++ b/drivers/net/dsa/mt7530.h
> @@ -165,6 +165,7 @@ enum mt753x_to_cpu_fw {
>  #define  ATC_MAT_MASK			GENMASK(11, 8)
>  #define  ATC_MAT(x)			FIELD_PREP(ATC_MAT_MASK, x)
>  #define  ATC_MAT_MACTAB			ATC_MAT(0)
> +#define  ATC_MAT_NON_STATIC_MAC	ATC_MAT(4)

[Low]
The new ATC_MAT_NON_STATIC_MAC define uses a single tab between the
macro name and its value, while the surrounding ATC_MAT_* macros use
multiple tabs and align the values at the same column. Should this
follow the existing alignment?

>  
>  enum mt7530_fdb_cmd {
>  	MT7530_FDB_READ	= 0,

^ permalink raw reply

* Re: [PATCH net-next v2 6/8] net: dsa: mt7530: convert to use field accessor macros
From: Jakub Kicinski @ 2026-06-15 23:56 UTC (permalink / raw)
  To: daniel
  Cc: Jakub Kicinski, chester.a.unal, andrew, olteanv, davem, edumazet,
	pabeni, matthias.bgg, angelogioacchino.delregno, linux, netdev,
	linux-kernel, linux-arm-kernel, linux-mediatek
In-Reply-To: <62e0a7e96bcadba0ec1a0f090c2fd2990c6ae0f1.1781312667.git.daniel@makrotopia.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: dsa: mt7530: convert to use field accessor macros

This patch converts the mt7530 DSA driver to use FIELD_GET and FIELD_PREP
instead of open-coded shift/mask constructs, and replaces a 0x1f literal
in the PHY address helper with (PHY_MAX_ADDR - 1).

> diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
> index 4168adca949f..dcf72ab0cd66 100644
> --- a/drivers/net/dsa/mt7530.c
> +++ b/drivers/net/dsa/mt7530.c

[ ... ]

> @@ -1555,7 +1555,7 @@ mt7530_vlan_cmd(struct mt7530_priv *priv, enum mt7530_vlan_cmd cmd, u16 vid)
>  	u32 val;
>  	int ret;
>  
> -	val = VTCR_BUSY | VTCR_FUNC(cmd) | vid;
> +	val = VTCR_BUSY | VTCR_FUNC(cmd) | VTCR_VID(vid);

[Low]
The commit message describes this change as a refactor:
"Use FIELD_GET and FIELD_PREP instead of open-coding register fields."

The mt7530_vlan_cmd() hunk is not quite a pure conversion though.  The
previous code OR'd in vid directly:

    val = VTCR_BUSY | VTCR_FUNC(cmd) | vid;

while the new code applies a GENMASK(11, 0) truncation through VTCR_VID():

    val = VTCR_BUSY | VTCR_FUNC(cmd) | VTCR_VID(vid);

Spec-conformant VIDs always fit in 12 bits, so the runtime result is the
same, but the value is now masked where it previously was not.

A related observation is that the prior VTCR_VID and ATC_HASH macros
referenced an undeclared x:

    #define  ATC_HASH                       (((x) & 0xfff) << 16)
    #define  VTCR_VID                       ((x) & 0xfff)

so they could not have been used by any caller before this patch.  Adding
the (x) parameter and a GENMASK is effectively a fix for those macros.

Would it be worth calling out the VTCR_VID() truncation and the macro
parameter fix in the commit message, or splitting them into a separate
patch so the conversion remains a no-op refactor?

> diff --git a/drivers/net/dsa/mt7530.h b/drivers/net/dsa/mt7530.h

[ ... ]

> @@ -171,32 +176,29 @@ enum mt7530_fdb_cmd {
>  
>  /* Registers for table search read address */
>  #define MT7530_TSRA1			0x84

[ ... ]

>  #define MT7530_ATRD			0x8C
> -#define	 AGE_TIMER			24
> -#define  AGE_TIMER_MASK			0xff
> -#define  PORT_MAP			4
> -#define  PORT_MAP_MASK			0xff
> -#define  ENT_STATUS			2
> -#define  ENT_STATUS_MASK		0x3
> +#define  AGE_TIMER_RD_MASK		GENMASK(31, 24)
> +#define  PORT_MAP_MASK			GENMASK(11, 4)
> +#define  ENT_STATUS_MASK		GENMASK(3, 2)

[Low]
The mask formerly named AGE_TIMER_MASK is renamed to AGE_TIMER_RD_MASK,
but the same mask is consumed on both the read and write paths
(mt7530_fdb_read() and mt7530_fdb_write()), so the _RD_ infix reads as
"read-side only".

Its sibling fields in the same MT7530_ATRD register keep plain names
(PORT_MAP_MASK, ENT_STATUS_MASK), and the related ageing counters in
MT7530_AAC use AGE_CNT_MASK / AGE_UNIT_MASK rather than an _RD_
disambiguation.

Was the _RD_ infix intentional, and if so, what does it disambiguate
from?

[ ... ]

> @@ -731,38 +745,48 @@ enum mt7531_xtal_fsel {
>  #define  MT7531_PHY_PLL_OFF		BIT(5)
>  #define  MT7531_PHY_PLL_BYPASS_MODE	BIT(4)
>  
> -#define MT753X_CTRL_PHY_ADDR(addr)	((addr + 1) & 0x1f)
> +#define MT753X_CTRL_PHY_ADDR(addr)	(((addr) + 1) & (PHY_MAX_ADDR - 1))

[Low]
The replacement evaluates to the same value today, since PHY_MAX_ADDR is
32 in include/linux/phy.h and PHY_MAX_ADDR - 1 is 0x1f, and both forms
ultimately derive from the IEEE 802.3 5-bit MDIO PHY-address field.

The original 0x1f represented the width of the hardware register field,
while PHY_MAX_ADDR - 1 represents the maximum software PHY address minus
one and additionally relies on PHY_MAX_ADDR being a power of two, which
phy.h does not formally guarantee.

Given the rest of the patch introduces explicit GENMASK based field
masks, would something like:

    #define MT753X_CTRL_PHY_ADDR_MASK GENMASK(4, 0)
    #define MT753X_CTRL_PHY_ADDR(addr) FIELD_PREP(MT753X_CTRL_PHY_ADDR_MASK, (addr) + 1)

be more consistent and make the hardware-field intent clearer?

^ permalink raw reply

* Re: [PATCH net-next v2 5/8] net: dsa: mt7530: replace mt7530_read with regmap_read
From: Jakub Kicinski @ 2026-06-15 23:56 UTC (permalink / raw)
  To: daniel
  Cc: Jakub Kicinski, chester.a.unal, andrew, olteanv, davem, edumazet,
	pabeni, matthias.bgg, angelogioacchino.delregno, linux, netdev,
	linux-kernel, linux-arm-kernel, linux-mediatek
In-Reply-To: <c6fc9cfae9762bf8d8a136339c91d4a76a307355.1781312667.git.daniel@makrotopia.org>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: dsa: mt7530: replace mt7530_read with regmap_read

This commit removes the mt7530_read() wrapper and converts every caller
to use regmap_read() directly. The WARN_ON_ONCE / dev_err and the
return-zero-on-failure behaviour the wrapper provided are dropped.

> diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
> index fe7e4ab5ae9c..4168adca949f 100644
> --- a/drivers/net/dsa/mt7530.c
> +++ b/drivers/net/dsa/mt7530.c
> @@ -152,28 +152,15 @@ core_clear(struct mt7530_priv *priv, u32 reg, u32 val)
>  
>  
>  static u32
> -mt7530_read(struct mt7530_priv *priv, u32 reg)
> +mt7530_mii_poll(struct mt7530_dummy_poll *p)
>  {
> -	int ret;
>  	u32 val;
>  
> -	ret = regmap_read(priv->regmap, reg, &val);
> -	if (ret) {
> -		WARN_ON_ONCE(1);
> -		dev_err(priv->dev,
> -			"failed to read mt7530 register\n");
> -		return 0;
> -	}
> +	regmap_read(p->priv->regmap, p->reg, &val);
>  
>  	return val;
>  }

[Medium]
The old wrapper guaranteed val == 0 on failure. The new mt7530_mii_poll()
declares val without an initializer and ignores the regmap_read() return
code, so on a failed underlying transfer val is whatever was on the stack.

Looking at drivers/net/dsa/mt7530-mdio.c::mt7530_regmap_read():

    ret = bus->write(bus, priv->mdiodev->addr, 0x1f, page);
    if (ret < 0)
        return ret;

    lo = bus->read(bus, priv->mdiodev->addr, r);
    hi = bus->read(bus, priv->mdiodev->addr, 0x10);

    *val = (hi << 16) | (lo & 0xffff);

When the page-select write fails, *val is not written. mt7530_mii_poll()
is the polling callback used by readx_poll_timeout() in mt7530_fdb_cmd(),
mt7530_vlan_cmd(), mt7530_setup() and mt7531_setup() with predicates
such as val != 0 (post-reset stability), !(val & ATC_BUSY) and
!(val & VTCR_BUSY).

Previously a broken MDIO transport produced val == 0 and the post-reset
poll deterministically timed out. Can the new code spuriously satisfy
the val != 0 predicate from an uninitialized stack slot, then feed that
same garbage into (val & MT7530_XTAL_MASK) to choose 25/40 MHz settings
in mt7530_pll_setup()?

> @@ -365,9 +354,9 @@ mt7531_pll_setup(struct mt7530_priv *priv)
>  	u32 hwstrap;
>  	u32 val;
>  
> -	val = mt7530_read(priv, MT7531_CREV);
> -	top_sig = mt7530_read(priv, MT7531_TOP_SIG_SR);
> -	hwstrap = mt7530_read(priv, MT753X_TRAP);
> +	regmap_read(priv->regmap, MT7531_CREV, &val);
> +	regmap_read(priv->regmap, MT7531_TOP_SIG_SR, &top_sig);
> +	regmap_read(priv->regmap, MT753X_TRAP, &hwstrap);

[Medium]
Same concern applies to all the read-modify-write sequences in
mt7531_pll_setup(), e.g.:

    regmap_read(priv->regmap, MT7531_PLLGP_EN, &val);
    val &= ~EN_COREPLL;
    regmap_write(priv->regmap, MT7531_PLLGP_EN, val);

If regmap_read() fails, val is uninitialized and is then OR'd / AND'd
with constants and written straight back into PLLGP / COREPLL control
registers. Was there a reason to drop the wrapper's return-zero (and
WARN_ON_ONCE / dev_err) behaviour without either initializing these
locals to 0 or checking the return code?

> @@ -1010,7 +999,7 @@ static void mt7530_setup_port5(struct dsa_switch *ds, phy_interface_t interface)
>  
>  	mutex_lock(&priv->reg_mutex);
>  
> -	val = mt7530_read(priv, MT753X_MTRAP);
> +	regmap_read(priv->regmap, MT753X_MTRAP, &val);

[Low]
mt7530_setup_port5() declares val as int, but regmap_read()'s third
parameter is unsigned int *. The previous code hid this through the
u32 return of mt7530_read() and an implicit assignment conversion;
the direct call now passes &val with mismatched signedness. Should this
be u32 val; like the other call sites in this file (sparse and
-Wpointer-sign would flag this)?

> @@ -2436,7 +2435,7 @@ mt7530_setup(struct dsa_switch *ds)
>  		return ret;
>  	}
>  
> -	id = mt7530_read(priv, MT7530_CREV);
> +	regmap_read(priv->regmap, MT7530_CREV, &id);
>  	id >>= CHIP_NAME_SHIFT;
>  	if (id != MT7530_ID) {
>  		dev_err(priv->dev, "chip %x can't be supported\n", id);

[Medium]
The chip-ID gating in mt7530_setup() and mt7531_setup() was previously
deterministic-fail on MDIO read errors (id == 0 after the wrapper).
With id now uninitialized on failure, can this check spuriously match
or mismatch on garbage and either mis-bind a wrong-id chip or refuse a
real one?

> @@ -2690,7 +2689,7 @@ mt7531_setup(struct dsa_switch *ds)
>  	/* MT7531AE has got two SGMII units. One for port 5, one for port 6.
>  	 * MT7531BE has got only one SGMII unit which is for port 6.
>  	 */
> -	val = mt7530_read(priv, MT7531_TOP_SIG_SR);
> +	regmap_read(priv->regmap, MT7531_TOP_SIG_SR, &val);
>  	priv->p5_sgmii = !!(val & PAD_DUAL_SGMII_EN);

[Medium]
And here, p5_sgmii is set from val & PAD_DUAL_SGMII_EN after an
unchecked read; if the read fails, would this latch a wrong p5_sgmii
state from stack contents and then drive different mt7531_pll_setup()
vs. GPIO mode programming below?

The commit message says:

    The WARN_ON_ONCE error logging is dropped as regmap provides its
    own error handling.

Is that accurate for this driver? regmap_read() does not zero *val on
error and does not log; with all new call sites discarding the return
value, the converted paths appear to have no error handling at all,
which seems like a behavioural regression compared to the prior
wrapper. Would initializing the affected locals to 0 (matching the
old return-zero-on-failure semantics), or checking the regmap_read()
return code at the sites that derive subsequent hardware programming
(chip ID, XTAL select, PLL RMW, post-reset poll), be appropriate here?
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] dt-bindings: net: pse-pd: add bindings for Realtek/Broadcom PSE MCU
From: Daniel Golle @ 2026-06-15 23:50 UTC (permalink / raw)
  To: Rob Herring
  Cc: Jonas Jelonek, Oleksij Rempel, Kory Maincent, Andrew Lunn,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Krzysztof Kozlowski, Conor Dooley, netdev, devicetree,
	linux-kernel, Bjørn Mork
In-Reply-To: <20260615212959.GA1679454-robh@kernel.org>

On Mon, Jun 15, 2026 at 04:29:59PM -0500, Rob Herring wrote:
> On Fri, Jun 12, 2026 at 01:29:41PM +0000, Jonas Jelonek wrote:
> > +properties:
> > +  compatible:
> > +    enum:
> > +      - realtek,pse-mcu-rtk
> 
> The "rtk" feels redundant.
> 
> > +      - realtek,pse-mcu-bcm
> 
> "brcm" is the standard vendor prefix, so use that instead of "bcm". 
> Though who defined the protocol in this case? Realtek or Broadcom? In 
> the latter case, I'd argue that "brcm" should be the vendor prefix.

The microcontroller firmware, and hence the protocol, is designed
by RealTek in both cases. However, they chose to design two incmpatible
protocol dialects based on the features of the PSE(s) connected to the
MCU.

^ permalink raw reply

* [PATCH] swiotlb: avoid double copy with swiotlb on tx socket
From: Luigi Rizzo @ 2026-06-15 23:42 UTC (permalink / raw)
  To: rizzo.unipi, lrizzo, m.szyprowski, robin.murphy, willemb, kuniyu,
	davem, edumazet, kuba, pabeni
  Cc: gregkh, rafael, akpm, david, netdev, linux-mm, iommu, driver-core,
	linux-kernel

The use of swiotlb causes an extra data copy on I/O.  For tx sockets,
especially with greedy senders, this has a high chance of happening in
the softirq handler for tx network interrupts, creating a significant
performance bottleneck.

Allow tx sockets to allocate socket buffers directly from the bounce
buffers. This avoids the second copy and removes the above bottleneck.
The fraction of swiotlb buffers allowed for this feature is set with
   /sys/module/swiotlb/parameters/zerocopy_tx_percent
(0 means disabled, 90 is the maximum, to avoid persistent I/O failures).

Implementation:
- define a new page type to unambiguously identify bounce buffers used
  as backing storage for socket buffers
- modify skb_page_frag_refill to perform the modified allocation
- modify the destructors __free_frozen_pages(), free_unref_folio() to
  handle those pages and return them to the pool.

The savings are especially visible with fewer queues. In synthetic
benchmarks, senders with 1-2 queues would cap around 50Gbps with
conventional swiotlb, and reach over 170Gbps with the feature enabled.

Signed-off-by: Luigi Rizzo <lrizzo@google.com>
---
 drivers/base/core.c        |   1 +
 include/linux/netdevice.h  |  22 ++++
 include/linux/page-flags.h |   4 +
 include/linux/skbuff.h     |   7 +-
 include/linux/swiotlb.h    |  74 ++++++++++++
 include/net/sock.h         |  29 +++++
 kernel/dma/swiotlb.c       | 227 +++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |  32 ++++++
 net/core/sock.c            |  98 ++++++++++++++--
 9 files changed, 485 insertions(+), 9 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index bd2ddf2aab505..e1257dea37ba0 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -3855,6 +3855,7 @@ void device_del(struct device *dev)
 	unsigned int noio_flag;
 
 	device_lock(dev);
+	swiotlb_device_deleted();
 	kill_device(dev);
 	device_unlock(dev);
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0e1e581efc5ac..d7e5929e73c92 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -5368,13 +5368,35 @@ static inline netdev_tx_t __netdev_start_xmit(const struct net_device_ops *ops,
 	return ops->ndo_start_xmit(skb, dev);
 }
 
+struct sock;
+
+#ifdef CONFIG_SWIOTLB
+/* Per-CPU pointer to the socket currently performing transmission.
+ * Used to bridge the networking and DMA layers, allowing the dma_map_page()
+ * path to identify the socket originating the packet and apply SWIOTLB optimizations.
+ */
+DECLARE_PER_CPU(struct sock *, current_tx_socket);
+static inline struct sock *__set_current_tx_socket(struct sock *sk)
+{
+	struct sock *old_sk = this_cpu_read(current_tx_socket);
+
+	this_cpu_write(current_tx_socket, sk);
+	return old_sk;
+}
+#else
+static inline struct sock *__set_current_tx_socket(struct sock *sk) { return NULL; }
+#endif
+
 static inline netdev_tx_t netdev_start_xmit(struct sk_buff *skb, struct net_device *dev,
 					    struct netdev_queue *txq, bool more)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
+	struct sock *old_sk;
 	netdev_tx_t rc;
 
+	old_sk = __set_current_tx_socket(skb->sk);
 	rc = __netdev_start_xmit(ops, skb, dev, more);
+	__set_current_tx_socket(old_sk);
 	if (rc == NETDEV_TX_OK)
 		txq_trans_update(dev, txq);
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7223f6f4e2b40..0ecbb404038a0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -923,6 +923,7 @@ enum pagetype {
 	PGTY_zsmalloc		= 0xf6,
 	PGTY_unaccepted		= 0xf7,
 	PGTY_large_kmalloc	= 0xf8,
+	PGTY_zcswiotlb		= 0xf9,
 
 	PGTY_mapcount_underflow = 0xff
 };
@@ -1055,6 +1056,9 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
 PAGE_TYPE_OPS(Unaccepted, unaccepted, unaccepted)
 PAGE_TYPE_OPS(LargeKmalloc, large_kmalloc, large_kmalloc)
 
+/* Pages in socket buffers from the swiotlb pool. */
+PAGE_TYPE_OPS(ZCSwiotlb, zcswiotlb, zcswiotlb)
+
 /**
  * PageHuge - Determine if the page belongs to hugetlbfs
  * @page: The page to test.
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3f06254ab1b72..62340909409e5 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3787,7 +3787,12 @@ static inline void skb_frag_page_copy(skb_frag_t *fragto,
 	fragto->netmem = fragfrom->netmem;
 }
 
-bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio);
+/* zerocopy swiotlb uses an additional non-null struct sock pointer. */
+bool __skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio, struct sock *sk);
+static inline bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
+{
+	return __skb_page_frag_refill(sz, pfrag, prio, NULL);
+}
 
 /**
  * __skb_frag_dma_map - maps a paged fragment via the DMA API
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 3dae0f592063e..bd2d0e160a9d8 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -7,8 +7,10 @@
 #include <linux/init.h>
 #include <linux/types.h>
 #include <linux/limits.h>
+#include <linux/percpu.h>
 #include <linux/spinlock.h>
 #include <linux/workqueue.h>
+#include <linux/atomic.h>
 
 struct device;
 struct page;
@@ -122,6 +124,9 @@ struct io_tlb_mem {
 	atomic_long_t total_used;
 	atomic_long_t used_hiwater;
 	atomic_long_t transient_nslabs;
+#else
+	unsigned long last_used_slots;
+	unsigned long last_used_jiffies;
 #endif
 };
 
@@ -185,6 +190,69 @@ bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
 phys_addr_t default_swiotlb_base(void);
 phys_addr_t default_swiotlb_limit(void);
+
+/* Helpers for zerocopy swiotlb. */
+/* Control allocation fraction. */
+extern unsigned int swiotlb_zc_tx_percent;
+
+/* Track freshness of the leaf device info. */
+extern atomic_t global_device_serial;
+
+static inline u32 swiotlb_get_device_serial(void)
+{
+	return atomic_read(&global_device_serial);
+}
+
+static inline void swiotlb_device_deleted(void)
+{
+	atomic_inc(&global_device_serial);
+}
+
+struct page *swiotlb_alloc_pages(struct device *dev, unsigned int order);
+bool swiotlb_free_pages(struct page *page, bool where_debug_only);
+void swiotlb_safe_put_device(struct device *dev);
+
+static inline void swiotlb_set_page_dev(struct page *page, struct device *dev)
+{
+	page->private = (unsigned long)dev;
+}
+
+static inline struct device *swiotlb_page_to_dev(struct page *page)
+{
+	return (struct device *)compound_head(page)->private;
+}
+
+static inline bool is_zerocopy_swiotlb_folio(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	return folio_test_zcswiotlb(folio) && folio->private != 0;
+}
+
+/* These two are in mm/page_alloc.c */
+void swiotlb_prep_compound_page(struct page *page, unsigned int order);
+void swiotlb_destroy_compound_page(struct page *page, unsigned int order);
+
+#if defined(CONFIG_NET)
+/*
+ * Track the socket for the currently transmitted packet, so the dma mapping
+ * function can record there the leaf device if it needs bounce buffers.
+ */
+struct sock;
+DECLARE_PER_CPU(struct sock *, current_tx_socket);
+void sk_set_bounce_device(struct sock *sk, struct device *dev);
+static inline void dma_learn_bounce_device(struct device *dev)
+{
+	struct sock *sk = this_cpu_read(current_tx_socket);
+
+	if (sk)
+		sk_set_bounce_device(sk, dev);
+}
+#else
+static inline void dma_learn_bounce_device(struct device *dev) {}
+#endif
+/* End helpers for zerocopy swiotlb. */
+
 #else
 static inline void swiotlb_init(bool addressing_limited, unsigned int flags)
 {
@@ -234,6 +302,12 @@ static inline phys_addr_t default_swiotlb_limit(void)
 {
 	return 0;
 }
+
+/* zerocopy swiotlb stubs */
+static inline bool swiotlb_free_pages(struct page *page, int reason) { return false; }
+static inline u32 swiotlb_get_device_serial(void) { return 0; }
+static inline void swiotlb_device_deleted(void) {}
+
 #endif /* CONFIG_SWIOTLB */
 
 phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t phys,
diff --git a/include/net/sock.h b/include/net/sock.h
index dccd3738c3687..1e6caf4bd1366 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -47,6 +47,7 @@
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/mm.h>
 #include <linux/security.h>
+#include <linux/swiotlb.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
 #include <linux/page_counter.h>
@@ -70,6 +71,14 @@
 #include <net/l3mdev.h>
 #include <uapi/linux/socket.h>
 
+#ifdef CONFIG_SWIOTLB
+struct sk_swiotlb_info {
+	struct device		*dev;
+	u32			serial;
+	unsigned long		jiffies;
+};
+#endif
+
 /*
  * This structure really needs to be cleaned up.
  * Most of it is for TCP, and not used by any of
@@ -602,8 +611,28 @@ struct sock {
 #if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
 	struct module		*sk_owner;
 #endif
+#ifdef CONFIG_SWIOTLB
+	struct sk_swiotlb_info	sk_swiotlb;
+#endif
 };
 
+#ifdef CONFIG_SWIOTLB
+static inline void sk_init_bounce_device(struct sock *sk)
+{
+	sk->sk_swiotlb.dev = NULL;
+}
+static inline void sk_cleanup_bounce_device(struct sock *sk)
+{
+	if (sk->sk_swiotlb.dev) {
+		swiotlb_safe_put_device(sk->sk_swiotlb.dev);
+		sk->sk_swiotlb.dev = NULL;
+	}
+}
+#else
+static inline void sk_init_bounce_device(struct sock *sk) {}
+static inline void sk_cleanup_bounce_device(struct sock *sk) {}
+#endif
+
 struct sock_bh_locked {
 	struct sock *sock;
 	local_lock_t bh_lock;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 1abd3e6146f45..e27f23d03c482 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -37,12 +37,16 @@
 #include <linux/mm.h>
 #include <linux/pfn.h>
 #include <linux/rculist.h>
+#include <linux/refcount.h>
 #include <linux/scatterlist.h>
 #include <linux/set_memory.h>
 #include <linux/spinlock.h>
 #include <linux/string.h>
 #include <linux/swiotlb.h>
+#include <linux/moduleparam.h>
+#include <linux/percpu.h>
 #include <linux/types.h>
+#include <linux/atomic.h>
 #ifdef CONFIG_DMA_RESTRICTED_POOL
 #include <linux/of.h>
 #include <linux/of_fdt.h>
@@ -81,6 +85,17 @@ struct io_tlb_slot {
 static bool swiotlb_force_bounce;
 static bool swiotlb_force_disable;
 
+/**
+ * global_device_serial - Global sequence number for device deletions
+ *
+ * Incremented every time a device is unregistered (in device_del()).
+ * Used by subsystems (like SWIOTLB zero-copy sockets) as a fast, lockless
+ * O(1) cache invalidation serial to detect when a cached device pointer
+ * might have been deleted and needs to be expired to prevent Use-After-Free.
+ */
+atomic_t global_device_serial = ATOMIC_INIT(0);
+EXPORT_SYMBOL(global_device_serial);
+
 #ifdef CONFIG_SWIOTLB_DYNAMIC
 
 static void swiotlb_dyn_alloc(struct work_struct *work);
@@ -1442,6 +1457,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 	offset &= (IO_TLB_SIZE - 1);
 	index += pad_slots;
 	pool->slots[index].pad_slots = pad_slots;
+	/* Fix an upstream bug with alloc_align_mask = 0xffff */
+	pool->slots[index].alloc_size = mapping_size;
 	for (i = 0; i < (nr_slots(size) - pad_slots); i++)
 		pool->slots[index + i].orig_addr = slot_addr(orig_addr, i);
 	tlb_addr = slot_addr(pool->start, index) + offset;
@@ -1554,6 +1571,13 @@ void __swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
 		size_t mapping_size, enum dma_data_direction dir,
 		unsigned long attrs, struct io_tlb_pool *pool)
 {
+	/*
+	 * Recognize and avoid unmapping pages allocated for Zero-Copy SWIOTLB Page Bypass.
+	 * They will be eventually released when the page reference count drops to 0.
+	 */
+	if (is_zerocopy_swiotlb_folio(pfn_to_page(PHYS_PFN(tlb_addr))))
+		return;
+
 	/*
 	 * First, sync the memory before unmapping the entry
 	 */
@@ -1597,6 +1621,21 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
 	phys_addr_t swiotlb_addr;
 	dma_addr_t dma_addr;
 
+	dma_learn_bounce_device(dev);
+
+	/*
+	 * If the page was allocated via Zero-Copy SWIOTLB Page Bypass, it is likely
+	 * already good for DMA so we can return its dma address.
+	 */
+	if (is_zerocopy_swiotlb_folio(pfn_to_page(PHYS_PFN(paddr)))) {
+		dma_addr = phys_to_dma_unencrypted(dev, paddr);
+		if (likely(dma_capable(dev, dma_addr, size, true))) {
+			if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+				arch_sync_dma_for_device(paddr, size, dir);
+			return dma_addr;
+		}
+	}
+
 	trace_swiotlb_bounced(dev, phys_to_dma(dev, paddr), size);
 
 	swiotlb_addr = swiotlb_tbl_map_single(dev, paddr, size, 0, dir, attrs);
@@ -1899,3 +1938,191 @@ static const struct reserved_mem_ops rmem_swiotlb_ops = {
 
 RESERVEDMEM_OF_DECLARE(dma, "restricted-dma-pool", &rmem_swiotlb_ops);
 #endif /* CONFIG_DMA_RESTRICTED_POOL */
+
+/*
+ * Asynchronous/Deferred Device Release.
+ * put_device() can trigger the final release path of a device which may sleep.
+ * Since SWIOTLB pages can be freed in atomic or interrupt context (e.g. TX completion),
+ * we must defer the put_device() call to task context using a workqueue.
+ */
+struct swiotlb_deferred_put {
+	struct work_struct work;
+	struct device *dev;
+};
+
+static void swiotlb_deferred_put_work(struct work_struct *work)
+{
+	struct swiotlb_deferred_put *dp = container_of(work, struct swiotlb_deferred_put, work);
+
+	put_device(dp->dev);
+	kfree(dp);
+}
+
+/**
+ * swiotlb_safe_put_device() - Safely release device reference from atomic/interrupt context
+ * @dev: The device structure to release.
+ *
+ * Enqueues a deferred put_device() call on a workqueue using GFP_ATOMIC.
+ * If memory allocation fails, the reference is leaked to avoid an immediate crash.
+ */
+void swiotlb_safe_put_device(struct device *dev)
+{
+	struct swiotlb_deferred_put *dp;
+
+	if (!dev)
+		return;
+
+	/*
+	 * FAST PATH (O(1) lockless): If this is not the last reference,
+	 * we can decrement it atomically and safely in any context
+	 * without allocating memory or scheduling work!
+	 */
+	if (refcount_dec_not_one(&dev->kobj.kref.refcount))
+		return;
+
+	/*
+	 * SLOW PATH: It is the last reference (refcount == 1). We must
+	 * defer the final put_device() to task context because it will
+	 * trigger device_release() which can sleep.
+	 */
+	dp = kmalloc_obj(*dp, GFP_ATOMIC);
+	if (dp) {
+		INIT_WORK(&dp->work, swiotlb_deferred_put_work);
+		dp->dev = dev;
+		schedule_work(&dp->work);
+	} else {
+		pr_warn_ratelimited("swiotlb: failed to allocate deferred put, leaking device ref\n");
+	}
+}
+EXPORT_SYMBOL_GPL(swiotlb_safe_put_device);
+
+unsigned int swiotlb_zc_tx_percent;
+module_param_named(zerocopy_tx_percent, swiotlb_zc_tx_percent, uint, 0644);
+
+static unsigned long fast_mem_used(struct io_tlb_mem *mem)
+{
+#ifdef CONFIG_DEBUG_FS
+	return mem_used(mem);
+#else
+	unsigned long last_j = READ_ONCE(mem->last_used_jiffies);
+	unsigned long now = jiffies;
+
+	if (time_after(now, last_j + HZ / 100) &&
+	    try_cmpxchg(&mem->last_used_jiffies, &last_j, now)) {
+		WRITE_ONCE(mem->last_used_slots, mem_used(mem));
+	}
+	return READ_ONCE(mem->last_used_slots);
+#endif
+}
+
+/**
+ * swiotlb_alloc_pages() - Allocate long-lived contiguous pages from SWIOTLB pool
+ * @dev: Device which requires the SWIOTLB bounce buffers.
+ * @order: Allocation order (log2 of number of pages).
+ */
+struct page *swiotlb_alloc_pages(struct device *dev, unsigned int order)
+{
+	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+	struct io_tlb_pool *pool;
+	int npages = 1 << order;
+	unsigned int max_pct;
+	phys_addr_t tlb_addr;
+	struct page *page;
+	int index;
+
+	if (!mem || !mem->nslabs)
+		return NULL;
+
+	max_pct = clamp(READ_ONCE(swiotlb_zc_tx_percent), 0u, 90u);
+	if (max_pct == 0 || max_pct * mem->nslabs <= fast_mem_used(mem) * 100)
+		return NULL;
+
+	/*
+	 * Enforce natural alignment for compound pages. The mask-based
+	 * compound_head() optimization (used when HVO is enabled and struct page
+	 * size is a power of 2) assumes that compound pages are naturally aligned
+	 * to their size. Without this, compound_head() on tail pages can return
+	 * a wrong head page pointer, leading to refcount corruption.
+	 */
+	index = swiotlb_find_slots(dev, 0, PAGE_SIZE * npages, ~(PAGE_MASK << order), &pool);
+	if (index == -1)
+		return NULL;
+
+	tlb_addr = slot_addr(pool->start, index);
+
+	pool->slots[index].pad_slots = 0;
+	pool->slots[index].alloc_size = PAGE_SIZE * npages;
+
+	page = pfn_to_page(PHYS_PFN(tlb_addr));
+
+	set_page_count(page, 1);
+
+	/* Strictly tag page[0] to prevent clobbering folio tail overlays */
+	__SetPageZCSwiotlb(page);
+
+	swiotlb_set_page_dev(page, dev);
+	get_device(dev);
+	swiotlb_prep_compound_page(page, order);
+	return page;
+}
+EXPORT_SYMBOL_GPL(swiotlb_alloc_pages);
+
+/*
+ * Debugging to track how swiotlb_free_pages() was called.
+ * b2: 0 from __free_frozen_pages(), 1 from free_unref_folios()
+ * b1: pool found b0: dev present,
+ */
+static unsigned long zc_debug[8];
+static int ctrs_num = 8;
+module_param_array(zc_debug, ulong, &ctrs_num, 0644);
+static void __zc_debug_stats(bool where, bool has_dev, bool has_pool)
+{
+	zc_debug[has_dev + has_pool * 2 + where * 4]++;
+}
+
+/**
+ * swiotlb_free_pages() - Free pages allocated via swiotlb_alloc_pages()
+ * @page: The starting struct page to release.
+ */
+bool swiotlb_free_pages(struct page *page, bool where_debug_only)
+{
+	struct page *head = compound_head(page);
+	struct device *dev = swiotlb_page_to_dev(head);
+	phys_addr_t head_tlb_addr = page_to_phys(head);
+	struct io_tlb_pool *pool;
+	int index, npages, i;
+
+	if (!folio_test_zcswiotlb(page_folio(head)))
+		return false;
+
+	pool = dev ? swiotlb_find_pool(dev, head_tlb_addr) : NULL;
+	__zc_debug_stats(where_debug_only, !!dev, !!pool);
+
+	/* Check for any false positives. */
+	if (!pool)
+		return false;
+
+	/* Read alloc_size first, it is reset by swiotlb_release_slots(). */
+	index = (head_tlb_addr - pool->start) >> IO_TLB_SHIFT;
+	npages = pool->slots[index].alloc_size >> PAGE_SHIFT;
+
+	WARN_ON_ONCE(!is_power_of_2(npages));
+
+	/* Step 1: Sever compound links (clobbers compound_info / lru.next) */
+	swiotlb_destroy_compound_page(head, ilog2(npages));
+
+	/* Step 2: Re-init LRU, drop refcounts, and strip flag across all constituent pages */
+	for (i = 0; i < npages; i++) {
+		INIT_LIST_HEAD(&head[i].lru);
+		set_page_count(&head[i], 0);
+		head[i].private = 0;
+		__ClearPageZCSwiotlb(&head[i]);
+	}
+
+	/* Step 3: Safely release slots back to the pool */
+	swiotlb_release_slots(dev, head_tlb_addr, pool);
+	swiotlb_del_transient(dev, head_tlb_addr, pool);
+	swiotlb_safe_put_device(dev);
+	return true;
+}
+EXPORT_SYMBOL_GPL(swiotlb_free_pages);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d49c254174da7..eaba683b5b2a8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -16,6 +16,7 @@
 
 #include <linux/stddef.h>
 #include <linux/mm.h>
+#include <linux/swiotlb.h>
 #include <linux/highmem.h>
 #include <linux/interrupt.h>
 #include <linux/jiffies.h>
@@ -705,6 +706,31 @@ void prep_compound_page(struct page *page, unsigned int order)
 	prep_compound_head(page, order);
 }
 
+#ifdef CONFIG_SWIOTLB
+void swiotlb_prep_compound_page(struct page *page, unsigned int order)
+{
+	if (order > 0)
+		prep_compound_page(page, order);
+}
+
+void swiotlb_destroy_compound_page(struct page *page, unsigned int order)
+{
+	if (order > 0) {
+		struct folio *folio = (struct folio *)page;
+
+		__ClearPageHead(page);
+		page[1].flags.f &= ~PAGE_FLAGS_SECOND;
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+		folio->_nr_pages = 0;
+#endif
+		for (int i = 1; i < (1 << order); i++) {
+			page[i].mapping = NULL;
+			clear_compound_head(&page[i]);
+		}
+	}
+}
+#endif /* CONFIG_SWIOTLB */
+
 static inline void set_buddy_order(struct page *page, unsigned int order)
 {
 	set_page_private(page, order);
@@ -2930,6 +2956,9 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
+	if (unlikely(swiotlb_free_pages(page, false)))
+		return;
+
 	if (!pcp_allowed_order(order)) {
 		__free_pages_ok(page, order, fpi_flags);
 		return;
@@ -2996,6 +3025,9 @@ void free_unref_folios(struct folio_batch *folios)
 		unsigned long pfn = folio_pfn(folio);
 		unsigned int order = folio_order(folio);
 
+		if (unlikely(swiotlb_free_pages(&folio->page, true)))
+			continue;
+
 		if (!__free_pages_prepare(&folio->page, order, FPI_NONE))
 			continue;
 		/*
diff --git a/net/core/sock.c b/net/core/sock.c
index d097025c116a8..c6fbb469f9ce5 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -103,6 +103,9 @@
 #include <linux/sockios.h>
 #include <linux/net.h>
 #include <linux/mm.h>
+#include <linux/swiotlb.h>
+#include <linux/device.h>
+#include <linux/moduleparam.h>
 #include <linux/slab.h>
 #include <linux/interrupt.h>
 #include <linux/poll.h>
@@ -152,6 +155,83 @@
 
 #include "dev.h"
 
+#ifdef CONFIG_SWIOTLB
+
+DEFINE_PER_CPU(struct sock *, current_tx_socket);
+EXPORT_PER_CPU_SYMBOL(current_tx_socket);
+
+void sk_set_bounce_device(struct sock *sk, struct device *dev)
+{
+	struct device *old_dev;
+
+	if (in_hardirq() || !sk_fullsock(sk) || sock_flag(sk, SOCK_ZEROCOPY))
+		return;
+
+	old_dev = READ_ONCE(sk->sk_swiotlb.dev);
+
+	if (dev != old_dev) {
+		/* Rate-limit updates to once per second to prevent bonding thrashing */
+		if (old_dev && time_before(jiffies, sk->sk_swiotlb.jiffies + HZ))
+			return;
+
+		get_device(dev);
+
+		/* Atomically swap in the new device and get the actual old one */
+		old_dev = xchg(&sk->sk_swiotlb.dev, dev);
+
+		WRITE_ONCE(sk->sk_swiotlb.serial, swiotlb_get_device_serial());
+		sk->sk_swiotlb.jiffies = jiffies;
+
+		/* Only drop the reference to the device we actually replaced */
+		if (old_dev)
+			swiotlb_safe_put_device(old_dev);
+	}
+}
+EXPORT_SYMBOL(sk_set_bounce_device);
+
+/*
+ * Wrap alloc_pages in __skb_page_frag_refill(). If the socket's dma_device requires
+ * SWIOTLB bounce buffering, divert allocation to the SWIOTLB slot allocator.
+ * This ensures the packet payload is written directly to a bounce buffer from the start,
+ * enabling zero-copy during driver DMA mapping.
+ */
+static inline struct page *alloc_any_pg(gfp_t gfp, unsigned int order, struct sock *sk)
+{
+	if (sk && READ_ONCE(swiotlb_zc_tx_percent) && !sock_flag(sk, SOCK_ZEROCOPY)) {
+		u32 serial = READ_ONCE(sk->sk_swiotlb.serial);
+		struct device *dev;
+
+		/* Force serial read BEFORE device pointer read. */
+		smp_rmb();
+
+		dev = READ_ONCE(sk->sk_swiotlb.dev);
+
+		if (dev) {
+			/*
+			 * The serial check is just for cache invalidation, UAF is
+			 * protected by the reference held in the sk.
+			 */
+			if (swiotlb_get_device_serial() != serial) {
+				if (cmpxchg(&sk->sk_swiotlb.dev, dev, NULL) == dev)
+					swiotlb_safe_put_device(dev);
+			} else {
+				struct page *page = swiotlb_alloc_pages(dev, order);
+
+				if (page)
+					return page;
+				/* On failure, fallback to alloc_pages(). */
+			}
+		}
+	}
+	return alloc_pages(gfp, order);
+}
+#else
+static inline struct page *alloc_any_pg(gfp_t gfp, unsigned int order, struct sock *sk)
+{
+	return alloc_pages(gfp, order);
+}
+#endif
+
 static DEFINE_MUTEX(proto_list_mutex);
 static LIST_HEAD(proto_list);
 
@@ -2383,6 +2463,7 @@ static void __sk_destruct(struct rcu_head *head)
 		__netns_tracker_free(net, &sk->ns_tracker, false);
 		net_passive_dec(net);
 	}
+	sk_cleanup_bounce_device(sk);
 	sk_prot_free(sk->sk_prot_creator, sk);
 }
 
@@ -2485,6 +2566,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority,
 		goto out;
 
 	sock_copy(newsk, sk);
+	sk_init_bounce_device(newsk);
 
 	newsk->sk_prot_creator = prot;
 
@@ -3134,7 +3216,7 @@ DEFINE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
  * no guarantee that allocations succeed. Therefore, @sz MUST be
  * less or equal than PAGE_SIZE.
  */
-bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
+bool __skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp, struct sock *sk)
 {
 	if (pfrag->page) {
 		if (page_ref_count(pfrag->page) == 1) {
@@ -3150,27 +3232,27 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
 	if (SKB_FRAG_PAGE_ORDER &&
 	    !static_branch_unlikely(&net_high_order_alloc_disable_key)) {
 		/* Avoid direct reclaim but allow kswapd to wake */
-		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
-					  __GFP_COMP | __GFP_NOWARN |
-					  __GFP_NORETRY,
-					  SKB_FRAG_PAGE_ORDER);
+		pfrag->page = alloc_any_pg((gfp & ~__GFP_DIRECT_RECLAIM) |
+					   __GFP_COMP | __GFP_NOWARN |
+					   __GFP_NORETRY,
+					   SKB_FRAG_PAGE_ORDER, sk);
 		if (likely(pfrag->page)) {
 			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
 			return true;
 		}
 	}
-	pfrag->page = alloc_page(gfp);
+	pfrag->page = alloc_any_pg(gfp, 0, sk);
 	if (likely(pfrag->page)) {
 		pfrag->size = PAGE_SIZE;
 		return true;
 	}
 	return false;
 }
-EXPORT_SYMBOL(skb_page_frag_refill);
+EXPORT_SYMBOL(__skb_page_frag_refill);
 
 bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
 {
-	if (likely(skb_page_frag_refill(32U, pfrag, sk->sk_allocation)))
+	if (likely(__skb_page_frag_refill(32U, pfrag, sk->sk_allocation, sk)))
 		return true;
 
 	if (!sk->sk_bypass_prot_mem)
-- 
2.54.0.1136.gdb2ca164c4-goog


^ permalink raw reply related

* Re: [PATCH net-next v14 0/2] tcp: rehash onto different local ECMP path on retransmit timeout
From: patchwork-bot+netdevbpf @ 2026-06-15 23:40 UTC (permalink / raw)
  To: Neil Spring
  Cc: netdev, edumazet, ncardwell, kuniyu, davem, kuba, dsahern, pabeni,
	horms, shuah, linux-kselftest, bpf, martin.lau, daniel
In-Reply-To: <20260615042158.1600746-1-ntspring@meta.com>

Hello:

This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Sun, 14 Jun 2026 21:21:56 -0700 you wrote:
> Currently sk_rethink_txhash() re-rolls the socket's txhash on RTO,
> PLB, and spurious-retransmission events, but the new hash is not
> propagated into the IPv6 ECMP path selection.  The cached
> route is reused and fib6_select_path() is never re-invoked, so
> the connection uses the same local ECMP decision.
> 
> This series adds the two missing pieces:
> 
> [...]

Here is the summary with links:
  - [v14,1/2] tcp: rehash onto different local ECMP path on retransmit timeout
    https://git.kernel.org/netdev/net-next/c/658eb696544c

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v14] selftests: net: add local ECMP rehash test
From: patchwork-bot+netdevbpf @ 2026-06-15 23:40 UTC (permalink / raw)
  To: Neil Spring
  Cc: netdev, edumazet, ncardwell, kuniyu, davem, kuba, dsahern, pabeni,
	horms, shuah, linux-kselftest, bpf, martin.lau, daniel
In-Reply-To: <20260615042158.1600746-3-ntspring@meta.com>

Hello:

This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Sun, 14 Jun 2026 21:21:58 -0700 you wrote:
> Add ecmp_rehash.sh with nine scenarios verifying that TCP rehash
> selects a different local ECMP path for IPv6:
> 
>   - SYN retransmission (forward path blocked during setup)
>   - SYN/ACK retransmission (reverse path blocked during setup)
>   - Midstream RTO (forward path blocked on established connection)
>   - Midstream ACK rehash (reverse path blocked on established connection)
>   - PLB rehash (ECN-driven congestion on established connection)
>   - Hash policy 1 negative test (rehash attempted but path unchanged)
>   - No flowlabel leak (client mp_hash does not alter on-wire flowlabel)
>   - Dst rebuild consistency (dst invalidation does not change path)
>   - Syncookie server path consistency (SYN-ACK and post-cookie ACKs
>     use the same ECMP path)
> 
> [...]

Here is the summary with links:
  - [v14] selftests: net: add local ECMP rehash test
    https://git.kernel.org/netdev/net-next/c/ae57a5533243

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net v5 0/4] MAC-PHY interrupt changed to level triggered interrupt
From: patchwork-bot+netdevbpf @ 2026-06-15 23:40 UTC (permalink / raw)
  To: Selvamani Rajagopal
  Cc: parthiban.veerasooran, andrew+netdev, davem, edumazet, kuba,
	pabeni, robh, krzk+dt, conor+dt, pier.beruto, andrew, netdev,
	linux-kernel, conor.dooley, devicetree, Parthiban.Veerasooran,
	Selvamani.Rajagopal
In-Reply-To: <20260611-level-trigger-v5-0-4533a9e85ce2@onsemi.com>

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 11 Jun 2026 14:55:37 -0700 you wrote:
> According to OPEN Alliance 10BASE-T1x MAC-PHY Serial Interface
> specification, MAC-PHY interrupt is "active low, level triggered".
> The specification mentions about the conditions in which the IRQ
> is asserted and deasserted.
> 
> Bug is inadvertently introduced by treating the IRQ in the OA TC6
> framework driver and in dt-binding YAML file as edge triggered.
> 
> [...]

Here is the summary with links:
  - [net,v5,1/4] net: ethernet: oa_tc6: Interrupt is active low, level triggered.
    https://git.kernel.org/netdev/net/c/b542d13fab0f
  - [net,v5,2/4] net: ethernet: oa_tc6: mdiobus->parent initialized with NULL
    https://git.kernel.org/netdev/net/c/a221d3f7e3f3
  - [net,v5,3/4] net: ethernet: oa_tc6: Remove FCS size in RX frame
    https://git.kernel.org/netdev/net/c/a5a1d11dd372
  - [net,v5,4/4] dt-bindings: net: updated interrupt type to be active low, level triggered
    https://git.kernel.org/netdev/net/c/31e56112e654

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net 0/4] ICSSG XDP zero copy bug fixes
From: patchwork-bot+netdevbpf @ 2026-06-15 23:40 UTC (permalink / raw)
  To: Meghana Malladi
  Cc: diogo.ivo, haokexin, vadim.fedorenko, devnexen, horms,
	jacob.e.keller, sdf, john.fastabend, hawk, daniel, ast, pabeni,
	kuba, edumazet, davem, andrew+netdev, bpf, linux-kernel, netdev,
	linux-arm-kernel, srk, vigneshr, rogerq, danishanwar
In-Reply-To: <20260611185744.2498070-1-m-malladi@ti.com>

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 12 Jun 2026 00:27:40 +0530 you wrote:
> This patch series fixes bugs introduced while adding xdp
> zero copy support in the icssg driver.
> 
> Patch 1/4: Fix wakeup handling for Rx when available CPPI
> descriptor is zero
> Patch 2,3/4: Fix destination tag in CPPI descriptor to enable
> proper Tx xmit for HSR offload mode with XDP and zero copy
> Patch 4/4: Fix Tx copy wakeup handling for XDP zero copy
> 
> [...]

Here is the summary with links:
  - [net,1/4] net: ti: icssg-prueth: Fix AF_XDP fill ring alloc and wakeup condition
    https://git.kernel.org/netdev/net/c/dfb787f7d157
  - [net,2/4] net: ti: icssg: Use undirected TX tag for native XDP in HSR offload mode
    https://git.kernel.org/netdev/net/c/bcbf73d98195
  - [net,3/4] net: ti: icssg: Use undirected TX tag for XDP zero copy in HSR offload mode
    https://git.kernel.org/netdev/net/c/f9691288413c
  - [net,4/4] net: ti: icssg: Fix XSK zero copy TX during application wakeup
    (no matching commit)

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net-next v7 2/2] net: ti: icssg-prueth: Add ethtool ops for Frame Preemption MAC Merge
From: Jakub Kicinski @ 2026-06-15 23:39 UTC (permalink / raw)
  To: m-malladi
  Cc: elfring, haokexin, vadim.fedorenko, devnexen, horms,
	jacob.e.keller, arnd, basharath, afd, parvathi, vladimir.oltean,
	rogerq, danishanwar, pabeni, edumazet, davem, andrew+netdev,
	linux-arm-kernel, netdev, linux-kernel, srk, vigneshr
In-Reply-To: <20260615231041.1007484-1-kuba@kernel.org>

On Mon, 15 Jun 2026 16:10:41 -0700 Jakub Kicinski wrote:
> > diff --git a/drivers/net/ethernet/ti/icssg/icssg_stats.h b/drivers/net/ethernet/ti/icssg/icssg_stats.h
> > index 5ec0b38e0c67..8073deac35c3 100644
> > --- a/drivers/net/ethernet/ti/icssg/icssg_stats.h
> > +++ b/drivers/net/ethernet/ti/icssg/icssg_stats.h
> > @@ -189,6 +187,11 @@ static const struct icssg_pa_stats icssg_all_pa_stats[] = {
> >  	ICSSG_PA_STATS(FW_INF_DROP_PRIOTAGGED),
> >  	ICSSG_PA_STATS(FW_INF_DROP_NOTAG),
> >  	ICSSG_PA_STATS(FW_INF_DROP_NOTMEMBER),
> > +	ICSSG_PA_STATS(FW_PREEMPT_BAD_FRAG),
> > +	ICSSG_PA_STATS(FW_PREEMPT_ASSEMBLY_ERR),
> > +	ICSSG_PA_STATS(FW_PREEMPT_FRAG_CNT_TX),
> > +	ICSSG_PA_STATS(FW_PREEMPT_ASSEMBLY_OK),
> > +	ICSSG_PA_STATS(FW_PREEMPT_FRAG_CNT_RX),
> >  	ICSSG_PA_STATS(FW_RX_EOF_SHORT_FRMERR),
> >  	ICSSG_PA_STATS(FW_RX_B0_DROP_EARLY_EOF),
> >  	ICSSG_PA_STATS(FW_TX_JUMBO_FRM_CUTOFF),  
> 
> [Medium]
> Are these five new entries duplicating values that already have a
> standard uAPI?
> 
> The same five firmware counters are exposed through the new
> .get_mm_stats callback as the standardized MAC Merge stats
> (MACMergeFrameAssOkCount, MACMergeFrameAssErrorCount, MACMergeFragCountRx,
> MACMergeFragCountTx, MACMergeFrameSmdErrorCount in struct
> ethtool_mm_stats), and adding them to icssg_all_pa_stats[] also
> publishes them via emac_get_strings() / emac_get_ethtool_stats() as
> ethtool -S strings.
> 
> Documentation/networking/statistics.rst describes ethtool -S as the
> private-driver-stats interface; counters that have a standard uAPI are
> expected to flow only through that uAPI.
> 
> Could the firmware-register lookup table used by emac_get_stat_by_name()
> be separated from the ethtool -S string table, so the new preemption
> counters feed get_mm_stats without also showing up under ethtool -S?

This -- not sure about the other complaints but this one looks legit.

^ permalink raw reply

* Re: [PATCH net-next 9/9] net: sparx5: add neighbour event handling for L3 routing
From: Jakub Kicinski @ 2026-06-15 23:38 UTC (permalink / raw)
  To: jensemil.schulzostergaard
  Cc: Jakub Kicinski, horatiu.vultur, UNGLinuxDriver, andrew+netdev,
	davem, edumazet, pabeni, daniel.machon, Steen.Hegelund, kees,
	gustavoars, netdev, linux-kernel, linux-arm-kernel,
	linux-hardening
In-Reply-To: <20260612-sparx5_l3_routing-v1-9-fc3c10160f49@microchip.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: sparx5: add neighbour event handling for L3 routing

Register a netevent notifier to handle NETEVENT_NEIGH_UPDATE events.
This completes the L3 unicast forwarding data path by updating hardware
ARP table entries and linked nexthops when neighbours are resolved.

> diff --git a/drivers/net/ethernet/microchip/sparx5/sparx5_router.c b/drivers/net/ethernet/microchip/sparx5/sparx5_router.c
> --- a/drivers/net/ethernet/microchip/sparx5/sparx5_router.c
> +++ b/drivers/net/ethernet/microchip/sparx5/sparx5_router.c

[ ... ]

> @@ -2525,6 +2525,104 @@ static int sparx5_rr_fib_event(struct notifier_block *nb, unsigned long event,
> +static void sparx5_rr_neigh_event_work(struct work_struct *work)
> +{

[ ... ]

> +	if (entry->connected && entry_connected &&
> +	    ether_addr_equal(entry->hwaddr, hwaddr))
> +		goto out_mutex;
> +
> +	ether_addr_copy(entry->hwaddr, hwaddr);
> +	sparx5_rr_neigh_entry_update(sparx5, entry, entry_connected);

[Severity: High]
Does this code risk re-offloading an orphaned neighbour entry with a stale
VMID?

When a subnet route is deleted, the fib_entry is destroyed. This unlinks the
neigh_entries, sets neigh_entry->fib_entry to NULL, and un-offloads them.
However, if another nexthop still references this neighbour,
sparx5_rr_neigh_entry_put() skips destruction.

The interface's router leg might then be destroyed, freeing its VMID for
reuse.

If an ARP update arrives later, this code finds the orphaned neigh_entry and
calls sparx5_rr_neigh_entry_update() to offload it again. This could apply a
hardware VCAP rule using the potentially reassigned VMID.

Also, because this sets vrule_id_valid to true, when the referencing nexthop
is eventually deleted, sparx5_rr_neigh_entry_put() will refuse to destroy it.
Since fib_entry is NULL, it won't be found during module unload, permanently
leaking the memory and the netdevice tracker reference.

> +	sparx5_rr_nexthops_update_notify(sparx5, entry, entry_connected);
> +	if (!entry_connected)
> +		sparx5_rr_neigh_entry_put(sparx5, entry);
> +
> +out_mutex:
> +	mutex_unlock(&sparx5->router->lock);
> +out:
> +	neigh_release(n);
> +	kfree(net_work);
> +}

[ ... ]

> +static int sparx5_rr_netevent_event(struct notifier_block *nb,
> +				    unsigned long event, void *ptr)
> +{

[ ... ]

> +	switch (event) {
> +	case NETEVENT_NEIGH_UPDATE:
> +		n = ptr;
> +
> +		if (n->tbl->family != AF_INET && n->tbl->family != AF_INET6)
> +			return NOTIFY_DONE;
> +
> +		port = sparx5_port_dev_lower_find(n->dev);
> +		if (!port)
> +			return NOTIFY_DONE;
> +
> +		net_work = kzalloc_obj(*net_work, GFP_ATOMIC);
> +		if (!net_work)
> +			return NOTIFY_BAD;

[Severity: High]
Could returning NOTIFY_BAD on allocation failure interrupt the global
notification chain?

The NETEVENT_NEIGH_UPDATE event is broadcast to all registered listeners via
the netevent_notif_chain atomic notifier chain. Returning NOTIFY_BAD from the
notifier callback prematurely aborts the chain traversal.

If this allocation fails, it appears this driver will silently prevent all
subsequent subsystems and drivers from receiving the neighbour update, which
might cause system-wide stale ARP/NDP caches.

Should this return NOTIFY_DONE instead?

> +
> +		INIT_WORK(&net_work->work, sparx5_rr_neigh_event_work);
> +		net_work->sparx5 = router->sparx5;
> +		net_work->neigh = neigh_clone(n);

^ permalink raw reply

* Re: [PATCH net-next 8/9] net: sparx5: add L3 FIB, nexthop and neighbour entry management
From: Jakub Kicinski @ 2026-06-15 23:38 UTC (permalink / raw)
  To: jensemil.schulzostergaard
  Cc: Jakub Kicinski, horatiu.vultur, UNGLinuxDriver, andrew+netdev,
	davem, edumazet, pabeni, daniel.machon, Steen.Hegelund, kees,
	gustavoars, netdev, linux-kernel, linux-arm-kernel,
	linux-hardening
In-Reply-To: <20260612-sparx5_l3_routing-v1-8-fc3c10160f49@microchip.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
Subject: net: sparx5: add L3 FIB, nexthop and neighbour entry management

Add the data path for L3 unicast route offloading for IPv4 and IPv6 FIB
and nexthop groups.

> diff --git a/drivers/net/ethernet/microchip/sparx5/sparx5_router.c b/drivers/net/ethernet/microchip/sparx5/sparx5_router.c
> --- a/drivers/net/ethernet/microchip/sparx5/sparx5_router.c
> +++ b/drivers/net/ethernet/microchip/sparx5/sparx5_router.c
> @@ -396,6 +400,14 @@ static bool sparx5_rr_fib_info_should_ignore(struct sparx5_rr_fib_info *fi)
>  		       (IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL);
>  }
>  
> +#if IS_ENABLED(CONFIG_IPV6)
> +static void sparx5_rr_rt6_release(struct fib6_info *rt)
> +{
> +	if (!rt->nh)
> +		rt->fib6_nh->fib_nh_flags &= ~RTNH_F_OFFLOAD;
> +
> +	fib6_info_release(rt);
> +}

[Severity: High]
Does this non-atomic flag modification race with the core IPv6 stack?

The driver clears RTNH_F_OFFLOAD directly from an async worker thread
without holding the rtnl_lock. Since the core networking stack updates
these flags under the rtnl_lock, could this lockless non-atomic operation
cause torn writes and overwrite state updates like RTNH_F_DEAD?

[ ... ]

> @@ -826,6 +1143,14 @@ static void sparx5_rr_neigh_entry_destroy(struct sparx5 *sparx5,
>  	kfree(entry);
>  }
>  
> +static void sparx5_rr_neigh_entry_put(struct sparx5 *sparx5,
> +				      struct sparx5_rr_neigh_entry *neigh_entry)
> +{
> +	if (neigh_entry && list_empty(&neigh_entry->nexthop_list) &&
> +	    !neigh_entry->hw_route.vrule_id_valid)
> +		sparx5_rr_neigh_entry_destroy(sparx5, neigh_entry);
> +}

[Severity: High]
Can a neighbour entry leak permanently if its subnet route is deleted while
a gateway nexthop still references it?

When the subnet route is deleted, sparx5_rr_fib_entry_destroy() sets
neigh_entry->fib_entry = NULL and disables the VCAP rule. However, if an
ARP update arrives while the gateway route still references the neighbour,
sparx5_rr_neigh_event_work() re-enables the VCAP rule by setting
vrule_id_valid to true.

Later, when the gateway route is deleted, sparx5_rr_neigh_entry_put() will
skip destroying the neighbour because vrule_id_valid is true. The neighbour
would be left with an empty nexthop_list and a NULL fib_entry, meaning it
cannot be found during teardown by sparx5_rr_fib_flush(). Does this leak
the dev_tracker and hardware VCAP rule?

[ ... ]

> @@ -2126,6 +2753,8 @@ static int sparx5_rr_fib_replace(struct sparx5 *sparx5,
>  	/* Release and allow any previous fib_info to be deleted */
>  	sparx5_rr_fib_info_put(&old_fi);
>  
> +	fib_entry->type = sparx5_rr_rtm_type2fib_type(fi_type);
> +
>  	err = sparx5_rr_entry_nexthop_group_update(sparx5, fib_entry);

[Severity: Critical]
Could this in-place type update corrupt the LPM list?

If sparx5_rr_fib_replace() changes the route type from UNICAST to LOCAL,
the entry remains in the fib_lpm4_list or fib_lpm6_list. Later, when the
route is deleted, sparx5_rr_fib_entry_destroy() calls
sparx5_rr_fib_lpm_remove(), which has this check:

static void sparx5_rr_fib_lpm_remove(struct sparx5_rr_fib_entry *fib_entry)
{
        if (fib_entry->type != SPARX5_RR_FIB_TYPE_UNICAST)
                return;

        list_del(&fib_entry->fib_lpm_node);
}

Since the type is now LOCAL, the function returns early and skips the
list_del() operation. The entry is then freed, leaving a dangling pointer
in the linked list. Conversely, replacing a LOCAL route with a UNICAST
route appears to cause a NULL pointer dereference in list_del() because
it was never added to the list.

[ ... ]

> @@ -2247,6 +2873,28 @@ static int sparx5_rr_fib6_nexthop_prune(struct sparx5 *sparx5,
> +	old_nrt6 = fib_entry->fi.fe6_info.nrt6;
> +	new_nrt6 = old_nrt6 >= f6i->nrt6 ? old_nrt6 - f6i->nrt6 : 0;
> +
> +	rt_arr = kzalloc_objs(struct fib6_info *, new_nrt6);
> +	if (!rt_arr)
> +		return -ENOMEM;
> +
> +	for (int i = 0; i < old_nrt6; i++) {
> +		struct fib6_info *fi = old_rt_arr[i];
> +
> +		if (sparx5_rr_fib6_rt_exists(f6i, fi)) {
> +			sparx5_rr_rt6_release(fi);
> +			continue;
> +		}
> +
> +		rt_arr[j++] = fi;
> +	}

[Severity: Critical]
Is there a risk of a heap buffer overflow in this loop?

The new array size is calculated as the difference between the old count
and the count of nexthops to remove. However, if the deletion request
contains nexthops that do not perfectly match the tracked nexthops,
sparx5_rr_fib6_rt_exists() will return false.

The loop then copies these non-matching elements into rt_arr. Could the
index j exceed new_nrt6 in this scenario, writing past the end of the
dynamically allocated buffer?

[ ... ]

> @@ -2939,6 +3032,12 @@ int sparx5_rr_router_init(struct sparx5 *sparx5)
>  err_unreg_inet_addr_val_notifier:
>  	unregister_inetaddr_validator_notifier(&r->inetaddr_valid_nb);
>  err_unreg_inet_notifier:
>  	unregister_inetaddr_notifier(&r->inetaddr_nb);
> +err_unreg_fib_notifier:
> +	unregister_fib_notifier(&init_net, &r->fib_nb);
>  err_workqueue_destroy:
>  	destroy_workqueue(r->sparx5_router_owq);
> +err_fib_ht_destroy:
> +	rhashtable_destroy(&r->fib_ht);
> +err_neigh_ht_destroy:
> +	rhashtable_destroy(&r->neigh_ht);

[Severity: High]
Does this error path leak all offloaded route hardware and software
structures?

If an error occurs after register_fib_notifier() has successfully run, the
error path calls destroy_workqueue(). This allows the initial FIB dump to
finish populating routes into the hardware and the r->fib_ht structure.

However, the error path subsequently calls rhashtable_destroy() without
first calling sparx5_rr_fib_flush(). Since rhashtable_destroy() only frees
the hash buckets and not the elements, does this permanently leak all the
inserted fib_entry objects, their dev_trackers, and their hardware VCAP
rules upon a module load failure?

^ permalink raw reply

* Re: [PATCH net-next 7/9] net: sparx5: add L3 router infrastructure and leg management
From: Jakub Kicinski @ 2026-06-15 23:38 UTC (permalink / raw)
  To: jensemil.schulzostergaard
  Cc: Jakub Kicinski, horatiu.vultur, UNGLinuxDriver, andrew+netdev,
	davem, edumazet, pabeni, daniel.machon, Steen.Hegelund, kees,
	gustavoars, netdev, linux-kernel, linux-arm-kernel,
	linux-hardening
In-Reply-To: <20260612-sparx5_l3_routing-v1-7-fc3c10160f49@microchip.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
Subject: net: sparx5: add L3 router infrastructure and leg management

This commit adds L3 router infrastructure and leg management to the
Microchip Sparx5 switch driver. It handles netdevice and inetaddr events
to intercept VLAN uppers of a bridge device, associating them with router
legs (VMIDs) for hardware offloading.

> diff --git a/drivers/net/ethernet/microchip/sparx5/sparx5_router.c b/drivers/net/ethernet/microchip/sparx5/sparx5_router.c
> new file mode 100644
> index 0000000000000..03923d91fdfb8
> --- /dev/null
> +++ b/drivers/net/ethernet/microchip/sparx5/sparx5_router.c

[ ... ]

> +static void sparx5_rr_inet6_make_mask_le(int logmask, u8 *mask)
> +{
> +	/* Caller must ensure 0 <= logmask <= 128 */
> +	int rem, byte_prefix = logmask;
> +
> +	rem = do_div(byte_prefix, BITS_PER_BYTE);

[Severity: High]
Will this do_div() call cause a build failure or stack corruption?

The do_div() macro expects a 64-bit unsigned integer for the dividend. Passing
byte_prefix, which is a 32-bit int, causes a compiler error on architectures
like x86. On architectures like ARM, it can overwrite stack memory since the
macro assumes the pointer points to a 64-bit value.

Could byte_prefix be defined as a u64, or could this use a standard division
operation instead?

[ ... ]

> +static bool
> +sparx5_rr_router_leg_addr_list_empty_rcu(struct sparx5_rr_router_leg *leg)
> +{
> +	struct inet6_dev *inet6_dev;
> +	struct in_device *in_dev;
> +
> +	in_dev = __in_dev_get_rcu(leg->dev);
> +	if (in_dev && in_dev->ifa_list)
> +		return false;

[Severity: Medium]
Does this code miss an RCU pointer annotation?

The ifa_list pointer in struct in_device is annotated with __rcu, so accessing
it directly violates RCU rules and can cause sparse warnings.

Could this be updated to use rcu_access_pointer() since it's just doing a
boolean check?

[ ... ]

> +/* Called in atomic context. */
> +static int sparx5_rr_inet6addr_event(struct notifier_block *nb,
> +				     unsigned long event, void *ptr)
> +{
> +	struct inet6_ifaddr *if6 = (struct inet6_ifaddr *)ptr;
> +	struct sparx5_rr_inet6addr_event_work *work;
> +	struct net_device *dev = if6->idev->dev;
> +	struct sparx5_router *router;
> +
> +	if (event != NETDEV_DOWN)
> +		return NOTIFY_DONE;
> +
> +	work = kzalloc_obj(*work, GFP_ATOMIC);
> +	if (!work)
> +		return NOTIFY_BAD;

[Severity: High]
Does returning NOTIFY_BAD during a NETDEV_DOWN event illegally abort the atomic
notification chain?

The inet6addr_chain is an atomic notifier chain, so returning NOTIFY_BAD stops
chain traversal. Because NETDEV_DOWN is a post-action event indicating the
address has already been removed, halting the chain prevents subsequent
notifiers (e.g., routing tables, IPsec) from observing the removal. This
leaves stale state that can cause use-after-free or routing blackholes.

Should this driver log an error and return NOTIFY_DONE or NOTIFY_OK instead
if memory allocation fails?

[ ... ]

> +/* Called with RTNL. */
> +static int sparx5_rr_inetaddr_valid_event(struct notifier_block *nb,
> +					  unsigned long event, void *ptr)
> +{
> +	struct in_validator_info *ivi = (struct in_validator_info *)ptr;
> +	struct net_device *dev = ivi->ivi_dev->dev;
> +	struct sparx5_router *router;
> +
> +	ASSERT_RTNL();
> +
> +	if (event != NETDEV_UP)
> +		return NOTIFY_DONE;
> +
> +	router = container_of(nb, struct sparx5_router, inetaddr_valid_nb);
> +
> +	return sparx5_rr_inetaddr_event_handle(router->sparx5, dev, event);
> +}

[Severity: High]
Does allocating hardware resources in the validator chain cause a leak?

The driver allocates a hardware router leg (VMID) via
sparx5_rr_router_leg_create() and takes a netdev_hold reference during the
NETDEV_UP validation phase.

If a downstream validator rejects the address addition, or if the addition
natively fails, NETDEV_DOWN is never fired. Because there is no rollback hook
handling this failure, the VMID and net_device tracker reference are leaked. A
leaked reference permanently prevents the net_device from being unregistered,
hanging unregister_netdevice() in netdev_wait_allrefs().

[ ... ]

> +/* Called with RTNL. */
> +static int sparx5_rr_netdevice_event(struct notifier_block *nb,
> +				     unsigned long event, void *ptr)
> +{
> +	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> +	unsigned char mac[ETH_ALEN] __aligned(2);
> +	struct sparx5_router *router;
> +	struct sparx5 *sparx5;
> +
> +	ASSERT_RTNL();
> +
> +	router = container_of(nb, struct sparx5_router, netdevice_nb);
> +	sparx5 = router->sparx5;
> +
> +	/* Allow single bridge. Global router leg MAC tracks bridge mac. */
> +	if (!netif_is_bridge_master(dev))
> +		return NOTIFY_OK;
> +
> +	switch (event) {
> +	case NETDEV_CHANGEADDR:
> +		ether_addr_copy(mac, dev->dev_addr);
> +		sparx5_rr_leg_base_mac_set(sparx5, mac);
> +		break;
> +	}
> +
> +	return NOTIFY_OK;
> +}

[Severity: High]
Does this notifier miss network namespace and offload ownership checks?

The netdevice and inetaddr notifiers process events globally. In
sparx5_rr_netdevice_event(), it only checks netif_is_bridge_master(dev). The
driver doesn't verify if the bridge actually contains Sparx5 physical ports or
if it is inside the target network namespace (init_net).

An unprivileged user inside a user or network namespace could create dummy
VLAN-aware bridges and change their MAC addresses or assign IPs. This would
trigger these global notifiers, causing the driver to program physical
hardware registers, overwrite the global router leg MAC, and exhaust the
finite 511 VMID slots, potentially leading to a denial of service.

^ permalink raw reply

* Re: [PATCH net-next 6/9] net: sparx5: vcap: add lpm vcap implementation
From: Jakub Kicinski @ 2026-06-15 23:38 UTC (permalink / raw)
  To: jensemil.schulzostergaard
  Cc: Jakub Kicinski, horatiu.vultur, UNGLinuxDriver, andrew+netdev,
	davem, edumazet, pabeni, daniel.machon, Steen.Hegelund, kees,
	gustavoars, netdev, linux-kernel, linux-arm-kernel,
	linux-hardening
In-Reply-To: <20260612-sparx5_l3_routing-v1-6-fc3c10160f49@microchip.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: sparx5: vcap: add lpm vcap implementation

Add the implementation for the LPM VCAP for lan969x and sparx5. The LPM
VCAP has a static keyset configuration, and a single lookup dedicated to
LPM, so this is straight forward, and many functions can use existing
functionality.

> diff --git a/drivers/net/ethernet/microchip/sparx5/sparx5_vcap_impl.c b/drivers/net/ethernet/microchip/sparx5/sparx5_vcap_impl.c
> index 95b93e46a41dc..e25e759c24f60 100644
> --- a/drivers/net/ethernet/microchip/sparx5/sparx5_vcap_impl.c
> +++ b/drivers/net/ethernet/microchip/sparx5/sparx5_vcap_impl.c
> @@ -1083,6 +1146,46 @@ static void sparx5_vcap_es2_cache_write(struct sparx5 *sparx5,
>  	}
>  }
>  
> +static void sparx5_vcap_lpm_cache_write(struct sparx5 *sparx5,
> +					struct vcap_admin *admin,
> +					enum vcap_selection sel,
> +					u32 start,
> +					u32 count)
> +{
> +	u32 *keystr, *mskstr, *actstr;
> +	int idx;
> +
> +	keystr = &admin->cache.keystream[start];
> +	mskstr = &admin->cache.maskstream[start];
> +	actstr = &admin->cache.actionstream[start];
> +
> +	switch (sel) {
> +	case VCAP_SEL_ENTRY:
> +		for (idx = 0; idx < count; ++idx) {
> +			/* Avoid 'match-off' by setting value & mask */
> +			spx5_wr(keystr[idx] & mskstr[idx], sparx5,
> +				VCAP_SUPER_VCAP_ENTRY_DAT(idx));

[Severity: High]
Are concurrent writes to the shared VCAP_SUPER_VCAP_ENTRY_DAT registers
safe here without a global hardware lock across the different VCAP types?

It looks like IS0 and IS2 rules are updated via vcap_add_rule() under the
rtnl_lock(). However, the LPM rules are updated under the
sparx5->router->lock inside a workqueue.

Since both execution paths share the Super VCAP hardware block, could
concurrent updates from TC (IS0/IS2) and routing (LPM) race on these shared
cache registers, and also on VCAP_SUPER_CTRL in sparx5_vcap_super_update(),
garbling the hardware command and corrupting the cache memory?

> +			spx5_wr(~mskstr[idx], sparx5,
> +				VCAP_SUPER_VCAP_MASK_DAT(idx));
> +		}
> +		break;

[ ... ]

> @@ -1379,6 +1523,7 @@ static void sparx5_vcap_update(struct net_device *ndev,
>  	struct sparx5 *sparx5 = port->sparx5;
>  
>  	switch (admin->vtype) {
> +	case VCAP_TYPE_LPM:
>  	case VCAP_TYPE_IS0:
>  	case VCAP_TYPE_IS2:
>  		sparx5_vcap_super_update(sparx5, cmd, sel, addr);

^ permalink raw reply

* Re: [PATCH net-next 4/9] net: microchip: vcap: expose helpers in vcap api and update debugfs
From: Jakub Kicinski @ 2026-06-15 23:38 UTC (permalink / raw)
  To: jensemil.schulzostergaard
  Cc: Jakub Kicinski, horatiu.vultur, UNGLinuxDriver, andrew+netdev,
	davem, edumazet, pabeni, daniel.machon, Steen.Hegelund, kees,
	gustavoars, netdev, linux-kernel, linux-arm-kernel,
	linux-hardening
In-Reply-To: <20260612-sparx5_l3_routing-v1-4-fc3c10160f49@microchip.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: microchip: vcap: expose helpers in vcap api and update debugfs

Add new helpers to the vcap client api in preparation for L3 routing
functionality, rename PREROUTING to PREROUTING_L0, and fix a latent
undefined-behavior bug in the debugfs action-field printer.

> diff --git a/drivers/net/ethernet/microchip/vcap/vcap_api.c b/drivers/net/ethernet/microchip/vcap/vcap_api.c
> index 30700648672fc..0905e4f192a02 100644
> --- a/drivers/net/ethernet/microchip/vcap/vcap_api.c
> +++ b/drivers/net/ethernet/microchip/vcap/vcap_api.c

[ ... ]

> @@ -3547,6 +3560,18 @@ int vcap_rule_mod_action_u32(struct vcap_rule *rule,
>  }
>  EXPORT_SYMBOL_GPL(vcap_rule_mod_action_u32);
>  
> +/* Modify a bit action with value in the rule */
> +int vcap_rule_mod_action_bit(struct vcap_rule *rule,
> +			     enum vcap_action_field action,
> +			     enum vcap_bit val)
> +{
> +	struct vcap_client_actionfield_data data;
> +
> +	vcap_rule_set_action_bitsize(&data.u1, val);
> +	return vcap_rule_mod_action(rule, action, VCAP_FIELD_BIT, &data);

[Severity: Medium]
Could this copy uninitialized stack memory to the heap?

The data union is allocated on the stack but not initialized. The call to
vcap_rule_set_action_bitsize() only initializes the first byte (data.u1).

When data is passed to vcap_rule_mod_action(), it performs a memcpy() of the
entire 16-byte union:

vcap_rule_mod_action() {
	...
	memcpy(&field->data, data, sizeof(field->data));
	...
}

This appears to copy 15 bytes of uninitialized kernel stack memory into the
heap-allocated field->data, which might trigger KMSAN uninit-value warnings.

While a similar pattern exists in the pre-existing vcap_rule_mod_action_u32()
function, should the new data union be zero-initialized here?

> +}
> +EXPORT_SYMBOL_GPL(vcap_rule_mod_action_bit);
> +
>  /* Drop keys in a keylist and any keys that are not supported by the keyset */
>  int vcap_filter_rule_keys(struct vcap_rule *rule,
>  			  enum vcap_key_field keylist[], int length,

^ permalink raw reply

* Re: [PATCH net-next 1/9] net: microchip: vcap: fix rule move for rules of coprime size
From: Jakub Kicinski @ 2026-06-15 23:38 UTC (permalink / raw)
  To: jensemil.schulzostergaard
  Cc: Jakub Kicinski, horatiu.vultur, UNGLinuxDriver, andrew+netdev,
	davem, edumazet, pabeni, daniel.machon, Steen.Hegelund, kees,
	gustavoars, netdev, linux-kernel, linux-arm-kernel,
	linux-hardening
In-Reply-To: <20260612-sparx5_l3_routing-v1-1-fc3c10160f49@microchip.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
Subject: net: microchip: vcap: fix rule move for rules of coprime size

This patch addresses a discrepancy in how software and hardware move rule
addresses on insert and delete for the LPM VCAP. It introduces a uniform
offset computation aligned to the least common multiple of all rule sizes
in a block. This ensures that hardware and software representations remain
synchronized and all rules land on correctly aligned addresses.

> diff --git a/drivers/net/ethernet/microchip/vcap/vcap_api.c b/drivers/net/ethernet/microchip/vcap/vcap_api.c
> index 0fdb5e363badc..6946fd7384581 100644
> --- a/drivers/net/ethernet/microchip/vcap/vcap_api.c
> +++ b/drivers/net/ethernet/microchip/vcap/vcap_api.c

[ ... ]

> @@ -2275,8 +2336,7 @@ int vcap_add_rule(struct vcap_rule *rule)
>  		       __func__, __LINE__, ret);
>  		goto out;
>  	}
> -	if (move.count > 0)
> -		vcap_move_rules(ri, &move);
> +	vcap_move_rules(ri, &move);
>  
>  	/* Set the counter to zero */
>  	ret = vcap_write_counter(ri, &ctr);

[Severity: High]
This is a pre-existing issue, but does vcap_add_rule() leak the newly
duplicated rule if operations like vcap_write_counter() or
vcap_write_rule() fail later on?

If a failure occurs, the code takes the goto out path:

vcap_add_rule() {
    ...
    ret = vcap_write_rule(ri);
    if (ret) {
        pr_err("%s:%d: rule write error: %d\n", __func__, __LINE__, ret);
        goto out;
    }
out:
    mutex_unlock(&ri->admin->lock);
    return ret;
}

This seems to exit without reverting the earlier list insertions and
hardware moves.

Could this leave the software and hardware states permanently inconsistent
and leak the newly duplicated rule?

[ ... ]

> @@ -2488,59 +2548,52 @@ int vcap_mod_rule(struct vcap_rule *rule)
>  }
>  EXPORT_SYMBOL_GPL(vcap_mod_rule);

[Severity: High]
This isn't a bug introduced by this patch, but does vcap_mod_rule() discard
the active software rule pointer and overwrite hardware with stale addresses?

Looking at vcap_mod_rule(), the function validates the rule using
vcap_get_locked_rule() but ignores the returned pointer:

vcap_mod_rule(struct vcap_rule *rule) {
    struct vcap_rule_internal *ri = to_intrule(rule);
    ...
    if (!vcap_get_locked_rule(ri->vctrl, ri->data.id))
        return -ENOENT;
    ...
    err = vcap_encode_rule(ri);
    if (err)
        goto out;

    err = vcap_write_rule(ri);
}

Because the function continues to use the caller provided copy ri, it
doesn't see updates to the real rule in the admin->rules list.

If the real rule's hardware address was previously shifted due to insertions
or deletions, could writing to the stale address in ri->addr corrupt other
rules in the hardware VCAP?

[ ... ]

One other pre-existing issue noticed while reviewing this subsystem:

[Severity: High]
This is a pre-existing issue, but is it safe to iterate over the admin->rules
list across different VCAP instances without acquiring admin->lock for
each instance?

In vcap_rule_exists():

static bool vcap_rule_exists(struct vcap_control *vctrl, u32 id) {
    ...
    list_for_each_entry(admin, &vctrl->list, list)
        list_for_each_entry(ri, &admin->rules, list)
            if (ri->data.id == id)
                return true;
    return false;
}

If another thread is concurrently inserting or deleting a rule on another
VCAP instance while holding that instance's lock, could this unprotected
concurrent access encounter a data race and dereference a poisoned pointer?
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH net-next v5 0/3] airoha: add the capability to configure GDM3/GDM4 as WAN/LAN on demand
From: Jakub Kicinski @ 2026-06-15 23:37 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
	linux-arm-kernel, linux-mediatek, netdev, Madhur Agrawal,
	Alexander Lobakin
In-Reply-To: <20260611-airoha-ethtool-priv_flags-v5-0-c11de08486d1@kernel.org>

On Thu, 11 Jun 2026 23:55:50 +0200 Lorenzo Bianconi wrote:
>       net: airoha: use int instead of atomic_t for qdma users counter
>       net: airoha: refactor QDMA start/stop into reusable helpers
>       net: airoha: defer GDM3/GDM4 WAN mode and GDM2 loopback to QoS offload

only the first patch applies cleanly right now

^ permalink raw reply

* Re: [PATCH net-next 0/2] appletalk: move the protocol out of tree
From: John Paul Adrian Glaubitz @ 2026-06-15 23:34 UTC (permalink / raw)
  To: Jakub Kicinski, davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, geert, chleroy,
	npiggin, mpe, maddy, linux-mips, linux-m68k, linuxppc-dev
In-Reply-To: <20260615222935.947233-1-kuba@kernel.org>

Hello Jakub,

On Mon, 2026-06-15 at 15:29 -0700, Jakub Kicinski wrote:
> This tiny series moves appletalk out of tree, to:
> 
>   https://github.com/linux-netdev/mod-orphan
> 
> Core maintainainers are unable to keep up with the rate of security
> bug reports and fixes. Nobody seems to care about appletalk enough
> to review the patches.

Why would fixing these vulnerabilities be relevant? No one is going to
expose an Apple Talk server to an untrusted network, are they? The same
applies to hamradio and AX.25, they are all used by hobbyists in DMZ
networks, so no one really cares about vulnerabilities in these protocols.

I find it sad that AI tools are basically used to shoot at the kernel
to kill off features as some people are apparently getting scared by
these AI reports and just nuke everything in a panic reaction as if it
wouldn't just be possible to disable these protocols at compile time
to reduce the attack surface.

> As Eric pointed out Mac OS dropped AppleTalk over a decade ago.

That's not the point though. No one is going to use AppleTalk to network
a Linux box to a modern macOS machine. The usefulness lies in hooking up
a Linux box to a vintage Mac or other retro computer.

So far, one of the huge advantages of open source operating systems has
always been that even niche use cases were supported and people could make
use of old hardware by using open source operating systems over commercial
offerings such as Windows or macOS.

With the advent of AI security reports, these niche use cases are more and
more being killed off with the argument that a vulnerability in the harmradio
code could pose a threat to a large SAP database running on a Linux enterprise
distribution. However, if your enterprise distribution is enabling kernel
features their customers aren't using and therefore enlarging the attack surface,
it's more a problem of said enterprise distribution and not of these old and
obscure network protocols.

I am trying my best to save as many classic features in the kernel as possible
to enable retro computing but I am sometimes fearing that commercial interest
in the kernel is taking over too much making my efforts harder every day.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox